Reading Questions.
Prob 8.01. Here are some (made-up) data from an experiment growing trees. The height was measured for trees in different locations that had been watered and fertilized in different ways.
height | water | light | compost | nitrogen |
5 | 2 | shady | none | little |
4 | 1 | bright | none | lot |
5 | 1.5 | bright | some | little |
6 | 3 | shady | rich | lot |
7 | 3 | bright | some | little |
6 | 2 | shady | rich | lot |
A
| height |
B
| water |
C
| light |
D
| compost |
E
| Can’t tell from this information. |
height | water | model values | resids |
5 | 2 | ||
4 | 1 | ||
5 | 1.5 | ||
6 | 3 | ||
7 | 3 | ||
6 | 2 | ||
height | water | model values | resids |
5 | 2 | ||
4 | 1 | ||
5 | 1.5 | ||
6 | 3 | ||
7 | 3 | ||
6 | 2 | ||
height ~ 1+water+light+compost+nitrogen
Charley believes that using this model will enable him to make excellent predictions about the height of trees in the future. Ranger Donald, on the other hand, calls Charley’s regression “ridiculous rot” and claims that Charley’s explanatory terms could fit perfectly any set of 6 numbers. Donald says that the perfect fit of Charley’s model does not give any evidence that the model is of any use whatsoever. Who do you think is right, Donald or Charley?
Prob 8.02. Which of these statements will compute the sum of square residuals of the model stored in the object mod?
A
| resid(mod) |
B
| sum(resid(mod)) |
C
| sum(resid(mod))^{2} |
D
| sum(resid(mod)^{2}) |
E
| sum(resid(mod^{2})) |
F
| None of the above. |
Prob 8.04. Here is a simple model that relates foot width to length in children, fit to the data in kidsfeet.csv:
A
| sum(kids$width)-(sum(resid(mod))+ sum(fitted(mod))) |
B
| sum(kids$width^{2})-(sum(resid(mod)^{2}) +sum(fitted(mod)^{2})) |
C
| sum(resid(mod))-sum(fitted(mod)) |
D
| sum(resid(mod)^{2})-sum(fitted(mod)^{2}) |
Note: It might seem natural to use the == operator to compare the equality of two values, for instance A==B. However, arithmetic on the computer is subject to small round-off errors, too small to be important when looking at the quantities themselves but sufficient to cause the == operator to say the quantities are different. So, it’s usually better to compare numbers by subtracting one from the other and checking whether the result is very small.
Prob 8.05. Consider the data collected by Francis Galton in the 1880s, stored in a modern format in the galton.csv file. In this file, heights is the variable containing the child’s heights, while the father’s and mother’s height is contained in the variables father and mother. The family variable is a numerical code identifying children in the same family; the number of kids in this family is in nkids.
Galton did not have our modern techniques for including multiple variables into a model. So, he tried an expedient, defining a single variable, “mid-parent,” that reflected both the father’s and mother’s height. We can mimic this approach by defining the variable in the same way Galton did:
Galton used the multiplier of 1.08 to adjust for the fact that the mothers were, as a group, shorter than the fathers.
Fit a model to the Galton data using the mid-parent variable and child’s sex, using both the main effects and the interaction. This will lead to a separate coefficient on mid-parent for male and female children.
The following questions are about the size of the residuals from models.
Prob 8.10. The “modern physics” course has a lab where students measure the speed of sound. The apparatus consists of an air-filled tube with a sound generator at one end and a microphone that can be set at any specified position within the tube. Using an oscilloscope, the transit time between the sound generator and microphone can be measured precisely. Knowing the position p and transit time t allows the speed of sound v to be calculated, based on the simple model:
Here are some data recorded by a student group calling themselves “CDT”.
position | transit time |
(m) | (millisec) |
0.2 | 0.6839 |
0.4 | 1.252 |
0.6 | 1.852 |
0.8 | 2.458 |
1.0 | 3.097 |
1.2 | 3.619 |
1.4 | 4.181 |
Part 1.
Enter these data into a spreadsheet in the standard case-variable format. Then fit an appropriate model. Note that the relationship p = vt between position, velocity, and time translates into a statistical model of the form p ~ t - 1 where the velocity will be the coefficient on the t term.
What are the units of the model coefficient corresponding to velocity, given the form of the data in the table above?
A
| meters per second |
B
| miles per hour |
C
| millimeters per second |
D
| meters per millisecond |
E
| millimeters per millisecond |
F
| No units. It’s a pure number. |
G
| No way to know from the information provided. |
Compare the velocity you find from your model fit to the accepted velocity of sound (at room temperature, at sea level, in dry air): 343 m/s. There should be a reasonable match. If not, check whether your data were entered properly and whether you specified your model correctly.
Part 2.
The students who recorded the data wrote down the transit time to 4 digits of precision, but recorded the position to only 1 or 2 digits, although they might simply have left off the trailing zeros that would indicate a higher precision.
Use the data to find out how precise the position measurement is. To do this, make two assumptions that are very reasonable in this case:
Given these assumptions, you should be able to calculate the position from the transit time and velocity. If the measured position differs from this model value — as reflected by the residuals — then the measured position is imprecise. So, a reasonable way to infer the precision of the position is by the typical size of residuals.
How big is a typical residual? One appropriate way to measure this is with the standard deviation of the residuals.
Part 3.
The students’ lab report doesn’t indicate how they know for certain that the sound generator is at position zero. One way to figure this out is to measure the generator’s position from the data themselves. Denoting the actual position of the sound generator as p_{0}, then the equation relating position and transit time is
Fit this model to the data.
Notice that adding new terms to the model reduces the standard deviation of the residuals.
Compare the estimated speed of sound found from the model p ~ t to the established value: 343 m/s . Notice that the estimate is better than the one from the model p ~ t - 1 that didn’t take into account the position of the sound generator.
Prob 8.11. The graph shows some data on natural gas usage (in ccf) versus temperature (in deg. F) along with a model of the relationship.
Now ignore the model and focus just on those two outlier cases and their relationship to the other data points.
Prob 8.12. It can be helpful to look closely at the residuals from a model. Here are some things you can easily do:
Using the world-record swim data, swim100m.csv construct the model time ~ year + sex + year:sex. This model captures some of the variability in the record times, but doesn’t reflect something that’s obvious from a plot of the data: that records improved quickly in the early years (especially for women) but the improvment is much slower in recent years. The point of this exercise is to show how the residuals provide information about this.
Now use the kids-feet data kidsfeet.csv and the model width ~ length + sex + length:sex.
Look at the residuals in the three suggested ways. Are there any outliers? Describe any patterns you see in relationship to the fitted model values and the explanatory variable length.
Prob 8.20. We’re going to use the ten-mile-race data to explore the idea of redundancy: Why redundancy is a problem and what we can do about it.
Read in the data:
The data includes information about the runner’s age and sex, as well as the time it took to run the race.
I’m interested in how computer and cell-phone use as a child may have affected the runner’s ability. I don’t have any information about computer use, but as a rough proxy, I’m going to use the runner’s year of birth. The assumption is that runners who were born in the 1950s, 60s, and 70s, didn’t have much chance to use computers as children.
Add in a new variable: yob. We’ll approximate this as the runner’s age subtracted from the year in which the race was run: 2005. That might be off by a year for any given person, but it will be pretty good.
Each of the following models has two terms.
Using special software that you don’t have, I have fitted a model — I’ll call it mod4 — with all three terms: the intercept, age, and year of birth. The model coefficients are:
My Fantastic Model: mod4
| ||
Intercept | age | yob |
-20050 | 20.831891052 | 12.642004612 |
My conclusion, based on the mod4 coefficients, is that people slow down by 20.8 seconds for every year they age. Making up for this, however, is the fact that people who were born earlier in the last century tend to run slower by 12.6 seconds for every year later they were born. Presumably this is because those born earlier had less opportunity to use computers and cell phones and therefore went out and did healthful, energetic, physical play rather than typing.
I needed special software to find the coefficients in mod4 because R won’t do it. See what happens when you try the models with three terms, like this:
I’m very pleased with mod4 and the special methods I used to find the coefficients.
Unfortunately, my statistical arch-enemy, Prof. Nalpak Ynnad, has proposed another model. He claims that computer and cell-phone use is helpful. According to his twisted theory, people actually run faster as they get older. Impossible! But look at his model coefficients.
Ynnad’s Evil Model: mod5 | ||
Intercept | age | yob |
60150 | -19.16810895 | -27.35799539 |
Ynnad’s ridiculous explanation is that the natural process of aging (that you run faster as you age), is masked by the beneficial effects of exposure to computers and cell phones as a child. That is, today’s kids would be even slower (because they are young) except for the fact that they use computers and cell phones so much. Presumably, when they grow up, they will be super fast, benefiting both from their advanced age and from the head start they got as children from their exposure to computers and cell phones.