Chapter 8 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

• What is a residual? Why does it make sense to make them small when fitting a model?
• What is “least squares?”
• What does it mean to “partition variability” using a model?
• How can a model term be redundant? Why are redundant terms a problem?

Prob 8.01. Here are some (made-up) data from an experiment growing trees. The height was measured for trees in different locations that had been watered and fertilized in different ways.

 height water light compost nitrogen 5 2 shady none little 4 1 bright none lot 5 1.5 bright some little 6 3 shady rich lot 7 3 bright some little 6 2 shady rich lot

(a)
In the model expression height ~ water, which is the explanatory variable?

 A height B water C light D compost E Can’t tell from this information.

(b)
Ranger Alan proposes the specific model formula Copy the table to a piece of paper and fill in the table showing the model values and the residuals.
 height water model values resids 5 2 4 1 5 1.5 6 3 7 3 6 2
(c)
Ranger Bill proposes the specific model formula Again, fill in the model values and residuals.
 height water model values resids 5 2 4 1 5 1.5 6 3 7 3 6 2
(d)
Based on your answers to the previous to parts, which of the two models is better? Give a specific definition of “better” and explain your answer quantitatively.

(e)
Write down the set of indicator variables that arise from the categorical variable compost.

(f)
The fitted values are exactly the same for the two models water ~ compost and water ~ compost-1. This suggests that the 1 vector (1,1,1,1,1,1) is redundant with the set of indicator variables due to the variable compost. Explain why this redundancy occurs. Is it because of something special about the “compost” variable?

(g)
Estimate, as best you can using only very simple calculations, the coefficients on the model water ~ compost-1. (Note: there is no intercept term in this model.)

(h)
Ranger Charley observes that the the following model is perfect because all of the residuals are zero.

height ~ 1+water+light+compost+nitrogen

Charley believes that using this model will enable him to make excellent predictions about the height of trees in the future. Ranger Donald, on the other hand, calls Charley’s regression “ridiculous rot” and claims that Charley’s explanatory terms could fit perfectly any set of 6 numbers. Donald says that the perfect fit of Charley’s model does not give any evidence that the model is of any use whatsoever. Who do you think is right, Donald or Charley?

Prob 8.02. Which of these statements will compute the sum of square residuals of the model stored in the object mod?

 A resid(mod) B sum(resid(mod)) C sum(resid(mod))2 D sum(resid(mod)2) E sum(resid(mod2)) F None of the above.

Prob 8.04. Here is a simple model that relates foot width to length in children, fit to the data in kidsfeet.csv:

> kids = fetchData("kidsfeet.csv")
> mod = lm( width ~ length, data=kids)
> coef(mod)
(Intercept)    length
2.8623    0.2479

(a)
Using the coefficients, calculate the predicted foot width from this model for a child with foot length 27cm.

2.86  3.10  7.93  9.12  9.56  12.24  28.62
(b)
The sum of squares of the residuals from the model provides a simple indication of how far typical values are from the model. In this sense, the standard deviation of the residuals tells us how much uncertainty there is in the prediction. (Later on, we’ll see that another term needs to be added to this uncertainty.) What is the sum of squares of the residuals?

4.73  5.81  5.94  6.10  6.21
(c)
What is the sum of squares of the fitted values for the kids in kidsfeet.csv?

42.5  286.3  3157.7  8492.0  15582.1
(d)
What is the sum of squares of the foot widths for the kids in kidsfeet.csv.

3163.5  3167.2  3285.1  3314.8  3341.7
(e)
There is a simple relationship between the sum of squares of the response variable, the residuals, and the fitted values. You can confirm this directly. Which of the following R statements is appropriate to do this:

 A sum(kids\$width)-(sum(resid(mod))+ sum(fitted(mod))) B sum(kids\$width2)-(sum(resid(mod)2) +sum(fitted(mod)2)) C sum(resid(mod))-sum(fitted(mod)) D sum(resid(mod)2)-sum(fitted(mod)2)

Note: It might seem natural to use the == operator to compare the equality of two values, for instance A==B. However, arithmetic on the computer is subject to small round-off errors, too small to be important when looking at the quantities themselves but sufficient to cause the == operator to say the quantities are different. So, it’s usually better to compare numbers by subtracting one from the other and checking whether the result is very small.

Prob 8.05. Consider the data collected by Francis Galton in the 1880s, stored in a modern format in the galton.csv file. In this file, heights is the variable containing the child’s heights, while the father’s and mother’s height is contained in the variables father and mother. The family variable is a numerical code identifying children in the same family; the number of kids in this family is in nkids.

> galton = fetchData("Galton")
> lm( height ~ father, data=galton)
Coefficients:
(Intercept)       father
39.1104       0.3994

(a)
What is the model’s prediction for the height of a child whose father is 72 inches tall?
67.1  67.4  67.9  68.2
(b)
Construct a model using both the father’s and mother’s heights, using just the main effect but not including their interaction. What is the model’s prediction for the height of a child whose father is 72 inches tall and mother is 65 inches tall?
67.4  68.1  68.9  69.2
(c)
Construct a model using mother and father’s height, including the main effects as well as the interaction. What is the model’s prediction for the height of a child whose father is 72 inches tall and mother is 65 inches tall?
67.4  68.1  68.9  69.2

Galton did not have our modern techniques for including multiple variables into a model. So, he tried an expedient, defining a single variable, “mid-parent,” that reflected both the father’s and mother’s height. We can mimic this approach by defining the variable in the same way Galton did:

> galton = transform( galton,midparent=(father+1.08*mother)/2 )

Galton used the multiplier of 1.08 to adjust for the fact that the mothers were, as a group, shorter than the fathers.

Fit a model to the Galton data using the mid-parent variable and child’s sex, using both the main effects and the interaction. This will lead to a separate coefficient on mid-parent for male and female children.

(d)
What is the predicted height for a girl whose father is 67 inches and mother 64 inches?
63.6  63.9  64.2  65.4  65.7

The following questions are about the size of the residuals from models.

(e)
Without knowing anything about a randomly selected child except that he or she was in Galton’s data set, we can say that the child’s height is a random variable with a certain mean and standard deviation. What is this standard deviation?
2.51  2.73  2.95  3.44  3.58  3.67  3.72
(f)
Now consider that we are promised to be told the sex of the child, but no other information. We are going to make a prediction of the child’s height once we get this information, and we are asked to say, ahead of time, how good this prediction will be. A sensible way to do this is to give the standard deviation of the residuals from the best fitting model based on the child’s sex. What is this standard deviation of residuals?
2.51  2.73  2.95  3.44  3.58  3.67  3.72

Prob 8.10. The “modern physics” course has a lab where students measure the speed of sound. The apparatus consists of an air-filled tube with a sound generator at one end and a microphone that can be set at any specified position within the tube. Using an oscilloscope, the transit time between the sound generator and microphone can be measured precisely. Knowing the position p and transit time t allows the speed of sound v to be calculated, based on the simple model: Here are some data recorded by a student group calling themselves “CDT”.

 position transit time (m) (millisec) 0.2 0.6839 0.4 1.252 0.6 1.852 0.8 2.458 1.0 3.097 1.2 3.619 1.4 4.181

Part 1.

Enter these data into a spreadsheet in the standard case-variable format. Then fit an appropriate model. Note that the relationship p = vt between position, velocity, and time translates into a statistical model of the form p ~ t - 1 where the velocity will be the coefficient on the t term.

What are the units of the model coefficient corresponding to velocity, given the form of the data in the table above?

 A meters per second B miles per hour C millimeters per second D meters per millisecond E millimeters per millisecond F No units. It’s a pure number. G No way to know from the information provided.

Compare the velocity you find from your model fit to the accepted velocity of sound (at room temperature, at sea level, in dry air): 343 m/s. There should be a reasonable match. If not, check whether your data were entered properly and whether you specified your model correctly.

Part 2.

The students who recorded the data wrote down the transit time to 4 digits of precision, but recorded the position to only 1 or 2 digits, although they might simply have left off the trailing zeros that would indicate a higher precision.

Use the data to find out how precise the position measurement is. To do this, make two assumptions that are very reasonable in this case:

1.
The velocity model is highly accurate, that is, sound travels at a constant velocity through the tube.
2.
The transit time measurements are correct. This assumption reflects current technology. Time measurements can be made very precisely, even with inexpensive equipment.

Given these assumptions, you should be able to calculate the position from the transit time and velocity. If the measured position differs from this model value — as reflected by the residuals — then the measured position is imprecise. So, a reasonable way to infer the precision of the position is by the typical size of residuals.

How big is a typical residual? One appropriate way to measure this is with the standard deviation of the residuals.

• Give a numerical value for this.
0.001  0.006  0.010  0.017  0.084  0.128

Part 3.

The students’ lab report doesn’t indicate how they know for certain that the sound generator is at position zero. One way to figure this out is to measure the generator’s position from the data themselves. Denoting the actual position of the sound generator as p0, then the equation relating position and transit time is This suggests fitting a model of the form p ~ 1 + t, where the coefficient on 1 will be p0 and the coefficient on t will be v.

Fit this model to the data.

• What is the estimated value of p0?
-0.032  0.012  0.000  0.012  0.032

Notice that adding new terms to the model reduces the standard deviation of the residuals.

• What is the new value of the standard deviation of the residuals?
0.001  0.006  0.010  0.017  0.084  0.128

Compare the estimated speed of sound found from the model p ~ t to the established value: 343 m/s . Notice that the estimate is better than the one from the model p ~ t - 1 that didn’t take into account the position of the sound generator.

Prob 8.11. The graph shows some data on natural gas usage (in ccf) versus temperature (in deg. F) along with a model of the relationship. (a)
What are the units of the residuals from a model in which natural gas usage is the response variable?
ccf  degF  ccf.per.degF  none
(b)
Using the graph, estimate the magnitude of a typical residual, that is, approximately how far a typical case is from the model relationship. (Ignore whether the residual is positive or negative. Just consider how far the case is from the model, whether it be above or below the model curve.)
2ccf  20ccf  50ccf  100ccf
(c)
There are two cases that are outliers with respect to the model relationship between the variables. Approximately how big are the residuals in these two cases?
2ccf  20ccf  50ccf  100ccf

Now ignore the model and focus just on those two outlier cases and their relationship to the other data points.

(d)
Are the two cases outliers with respect to natural gas usage?
True or False
(e)
Are the two cases outliers with respect to temperature?
True or False

Prob 8.12. It can be helpful to look closely at the residuals from a model. Here are some things you can easily do:

1.
Look for outliers in the residuals. If they exist, it can be worthwhile to look into the cases involved more deeply. They might be anomalous or misleading in some way.
2.
Plot the residuals versus the fitted model values. Ideally there should be no evident relationship between the two — the points should be a random scatter. When there is a strong relationship, even though it might be complicated, the model may be missing some important term.
3.
Plot the residuals versus the values of an important explanatory variable. (If there are multiple explanatory variables, there would be multiple plots to look at.) Again, ideally there should be no evident relationship. If there is, there is something to think about.

Using the world-record swim data,  swim100m.csv construct the model time ~ year + sex + year:sex. This model captures some of the variability in the record times, but doesn’t reflect something that’s obvious from a plot of the data: that records improved quickly in the early years (especially for women) but the improvment is much slower in recent years. The point of this exercise is to show how the residuals provide information about this.

• Find the cases in the residuals that are outliers. Explain what it is about these cases that fits in with the failure of the model to reflect the slowing improvement in world records.
• Plot the residuals versus the fitted model values. What pattern do you see that isn’t consistent with the idea that the residuals are unrelated to the fitted values?
• Plot the residuals versus year. Describe the pattern you see.

Now use the kids-feet data  kidsfeet.csv and the model width ~ length + sex + length:sex.

Look at the residuals in the three suggested ways. Are there any outliers? Describe any patterns you see in relationship to the fitted model values and the explanatory variable length.

Prob 8.20. We’re going to use the ten-mile-race data to explore the idea of redundancy: Why redundancy is a problem and what we can do about it.

> run = fetchData("ten-mile-race.csv")

The data includes information about the runner’s age and sex, as well as the time it took to run the race.

I’m interested in how computer and cell-phone use as a child may have affected the runner’s ability. I don’t have any information about computer use, but as a rough proxy, I’m going to use the runner’s year of birth. The assumption is that runners who were born in the 1950s, 60s, and 70s, didn’t have much chance to use computers as children.

Add in a new variable: yob. We’ll approximate this as the runner’s age subtracted from the year in which the race was run: 2005. That might be off by a year for any given person, but it will be pretty good.

> run\$yob = 2005 - run\$age

Each of the following models has two terms.

mod1 = lm( net ~ age + yob - 1, data=run)
mod2 = lm( net ~ 1 + age, data = run)
mod3 = lm( net ~ 1 + yob, data=run )

• Fit each of the the models and interpret the coefficients in terms of the relationship between age and year of birth and running time. Then look at the R2 and the sum of square residuals in order to decide which is the better model.

Using special software that you don’t have, I have fitted a model — I’ll call it mod4 — with all three terms: the intercept, age, and year of birth. The model coefficients are:

 My Fantastic Model: mod4 Intercept age yob -20050 20.831891052 12.642004612

My conclusion, based on the mod4 coefficients, is that people slow down by 20.8 seconds for every year they age. Making up for this, however, is the fact that people who were born earlier in the last century tend to run slower by 12.6 seconds for every year later they were born. Presumably this is because those born earlier had less opportunity to use computers and cell phones and therefore went out and did healthful, energetic, physical play rather than typing.

• Using these coefficients, calculate the model values. The statement will look like this:
mod4vals = -20050 + 20.831891052*run\$age +
12.642004612*run\$yob

• Calculate the residuals from mod4 by substracting the model values from the response variable (net running time). Compare the size of the residuals using a sum of squares or a standard deviation or a variance to the size of the residuals from models 1 through 3. Judging from this, which is the better model?

I needed special software to find the coefficients in mod4 because R won’t do it. See what happens when you try the models with three terms, like this:

lm( net ~ 1 + age + yob, data=run )
lm( net ~ 1 + yob + age, data=run )

• Can you get three coefficients from the R software?

I’m very pleased with mod4 and the special methods I used to find the coefficients.

Unfortunately, my statistical arch-enemy, Prof. Nalpak Ynnad, has proposed another model. He claims that computer and cell-phone use is helpful. According to his twisted theory, people actually run faster as they get older. Impossible! But look at his model coefficients.

 Ynnad’s Evil Model: mod5 Intercept age yob 60150 -19.16810895 -27.35799539

Ynnad’s ridiculous explanation is that the natural process of aging (that you run faster as you age), is masked by the beneficial effects of exposure to computers and cell phones as a child. That is, today’s kids would be even slower (because they are young) except for the fact that they use computers and cell phones so much. Presumably, when they grow up, they will be super fast, benefiting both from their advanced age and from the head start they got as children from their exposure to computers and cell phones.

• Looking at Ynnad’s model in terms of the R2 or size of residuals, how does it compare to my model? Which one should you believe?
• Give an explanation of why both my model and Ynnad’s model are bogus. See if you can also explain why we shouldn’t take the coefficients in mod1 seriously at face value.