Chapter 15 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

• What is a “covariate” and how does it differ from any other kind of variable?
• Why is there a separate F statistic for each explanatory term in a model?
• How can covariates make an explanatory term look better (e.g., more significant) in an F test?
• How can covariates make an explanatory term look worse (e.g., less significant) in an F test?
• Why does the the sum of squares of the various model terms change when there is collinearity among the model terms?

Prob 15.01. Often we are interested in whether two groups are different. For example, we might ask if girls have a different mean footlength than do boys. We can answer this question by constructing a suitable model.

> kids = fetchData("kidsfeet.csv")
> summary( lm( length ~ sex, data=kids ) )
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  25.1050     0.2847  88.180   <2e-16
sexG         -0.7839     0.4079  -1.922   0.0623

Interpret this report, keeping in mind that the foot length is reported in centimeters. (The reported value <2e-16 means p < 2 × 10-16.)

1.
What is the point estimate of the difference between the lengths of boys and girls feet.

 A Girls’ feet are, on average, 25 centimeters long. B Girls’ feet are 0.4079 cm shorter than boys’. C Girls’ feet are 0.7839 cm shorter than boys’. D Girls’ feet are 1.922 cm shorter than boys’.

2.
The confidence interval can be written as a point estimate plus-or-minus a margin of error: P ± M. What is the 95% margin of error, M, on the difference between boy’s and girl’s foot lengths.
-0.78  0.28  0.41  0.60  0.80
3.
What is the Null Hypothesis being tested by the reported p-value 0.0623?

 A Boys’ feet are, on average, longer than girls’ feet. B Girls’ feet are, on average, shorter than boys’ feet. C All boys’ feet are longer than all girls’ feet. D No girl’s foot is shorter than all boys’ feet. E There is no difference, on average, between boys’ footlengths and girls’ footlengths.

4.
What is the Null Hypothesis being tested by the p-value on the intercept?

 A Boys’ and girls’ feet are, on average, the same length B The length of kids’ feet is, on average, zero. C The length of boys’ feet is, on average, zero. D The length of girls’ feet is, on average, zero. E Girls’ and boys’ feet don’t intercept.

Here is the report from a related, but slightly different model:

> summary( lm( length~sex-1, data=kids ))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
sexB  25.1050     0.2847   88.18   <2e-16
sexG  24.3211     0.2921   83.26   <2e-16

Note that the p-values for both coefficients are practically zero, p < 2 × 10-16.

What is the Null Hypothesis tested by the p-value on sexG?

 A Girls’ feet have a different length, on average, than boys’. B Girls’ feet are no different in length, on average, than boys’. C Girls’ footlengths are, on average, zero. D Girls’ footlengths are, on average, greater than zero.

Prob 15.02. Here is an ANOVA table (with the “intercept” term included) from a fictional study of scores assigned to various flavors, textures, densities, and chunkiness of ice cream. Some of the values in the table have been left out. Figure out from the rest of the table what they should be.

Df Sum-Sq Mean-Sq F-value p-value
(intercept)   1    _A_     200  _B_    _C_
flavor        8    640      80  _D_    0.134
density      _E_   100     100    2    0.160
fat-content   1    300     _F_    6    0.015
chunky        1    200     200    4    0.048
Residuals   100   5000      50

(a)
The value of A:
1  2  100  200  400  600
(b)
The value of B:
1  2  3  4  5  6  7  8  10  20  200
(c)
The value of C: (Hint: There’s enough information in the table to find this.)
0.00  0.015  0.030  0.048  0.096  0.134  0.160  0.320  0.480
(d)
The value of D:
0.0  0.8  1.6  3.2  4.8
(e)
The value of E:
0  1  2  3  4  5  6
(f)
The value of F:
100  200  300  400  500
(g)
How many cases are involved altogether?
50  100  111  112  200  5000
(h)
How many different flavors were tested?
1  3  5  8  9  10  12  100

Prob 15.04. Consider the following analysis of the kids’ feet data looking for a relationship between foot width and whether the child is left or right handed. The variable domhand gives the handedness, either L or R. We’ll construct the model in two different ways. There are 39 cases altogether.

> anova( lm(width ~ domhand, data=kids))
Response: width
Df  Sum Sq Mean Sq    F value Pr(>F)
(Intercept)  1 3153.60 3153.60 12228.0064 <2e-16 ***
domhand      1    0.33    0.33     1.2617 0.2686
Residuals   37    9.54    0.26

> anova( lm(width ~ domhand - 1, data=kids))
Response: width
Df  Sum Sq Mean Sq F value    Pr(>F)
domhand    2 3153.93 1576.96  6114.6 < 2.2e-16 ***
Residuals 37    9.54    0.26

• Explain why, in the first case, the p-value is not significant, but in the second case it is.
• Why does domhand have 1 degree of freedom in the first ANOVA report, but 2 degrees of freedom in the second?

Prob 15.05. A statistics student wrote:

I’m interested in the publishing business, particularly magazines, and thought I would try a statistical analysis of some of the features of magazines. I looked at several different magazines, and recorded several variables, some of which I could measure from a single copy and some of which I deduced from my knowledge of the publishing business.

magazine
a number to identify the magazine
pages
the number of pages in a typical issue
color
the number of pages with a color picture
age
the age group of the target audience
sex
the sex of the intended audience
sentenceLength
the average number of words in a sentence in the articles in the magazine.

Most people find it hard to believe, but most mass-market magazines are very deliberately written and composed graphically to be attractive to the target audience. The distinctive “styles” of magazines is no accident.

I was interested to see if there is a relation between the average sentence length and any of the other variables. I made one linear model and had a look at the ANOVA table, as shown below.

Analysis of Variance Table

Response: sentenceLength
Df Sum Sq Mean Sq F value  Pr(>F)
sex        1  0.222   0.222  0.0717 0.80626
age        3 71.067  23.689  7.6407   ????
color      1  0.299   0.299  0.0964 0.77647
Residuals  3  9.301   3.100

Answer each question based on the information given above.

1.
What model structure did I use to generate this table?

 A sentenceLength ~ age + sex + color B sentenceLength ~ sex * age + color C sentenceLength ~ sex + age + color D color ~ sentenceLength + sex + age

2.
How many cases are there altogether?

 A 8 B 9 C 10 D No way to tell from the information given.

3.
Is the variable age categorical or quantitative?

 A categorical B quantitative C Could be either. D Can’t know for sure from the data given.

4.
The p-value for age is missing. What should this value be?

 A 0.93553 from pf(7.6407, 3, 3) B 0.06446 from 1-pf(7.6407, 3, 3) C 0.98633 from pf(23.689, 3, 3) D 0.01367 from 1-pf(23.689, 3, 3) E 0.99902 from pnorm(23.689, 0, 7.6507) F 0.00098 from 1-pnorm(23.689, 0, 7.6507)

5.
Based on the ANOVA table, what null hypothesis can be rejected in this study at the significance level p < 0.10?

 A An average sentence has zero words. B There is no relationship between the number of color pages and the sex of the intended audience. C The number of color pages is not related to the sentence length. D There is no relation between the average number of words per sentence in an article and the age group that the magazine is intended for, after taking sex into account. E None of the above, because there is a different null hypothesis corresponding to each model term in the ANOVA report.

6.
I really wanted to look to see if there is an interaction between sex and age. What would be the point of including this term in the model?

 A To see if the different sexes have a different distribution of age groups. B To see if there is a difference in average sentence length between magazines for females and males. C To see if magazines for different age groups are targeted to different sexes. D To see if the difference in average sentence length between magazines for females and males changes from one age group to another.

7.
I tried to include an interaction between sex and age into the model. This didn’t work out. Just using the information in the printed Anova report, judge what might have gone wrong.

 A The term was included as the last term in the ANOVA report and didn’t have a significant sum of squares. B I discovered that sex and age were redundant. C The p-values disappeared from the report. D None of the above.

Prob 15.10. P-values concern the “statistical significance” of evidence for a relationship. This can be a different thing from the real-world importance of the observed relationship. It’s possible for a weak connection to be strongly statistically significant (if there is a lot of data for it) and for a strong relationship to lack statistical significance (if there is not much data).

Consider the data on the times it took runners to complete the Cherry Blossom ten-mile race in March 2005:

> run = fetchData("ten-mile-race.csv")
> names(run)
[1] "state" "time"  "net"   "age"   "sex"

Consider the net variable, which gives the time it took the runners to get from the start line to the finish line.

Answer each of the following questions, giving both a quantitative argument and also an everyday English explanation. Assessing statistical significance is a technical matter, but to interpret the substance of a relationship, you will have to put it in a real-world context.

1.
What is the relationship between net running time and the runner’s age? Is the relationship significant? Is it substantial?

2.
What is the relationship between net running time and the runner’s sex? Is the relationship significant? Is it substantial?

3.
Is there an interaction between sex and age? Is the relationship significant? Is it substantial?

Prob 15.11. You are conducting an experiment of a treatment for balding. You measure the hair follicle density before treatment and again after treatment. The data table has the following variables (with a few examples shown):

and so on, 100 entries altogether
 Subject.ID follicle.density when sex A59 7.3 before M A59 7.9 after M A60 61.2 before F A60 61.4 after F

1.
Here is an ANOVA table for a model of these data:
> anova(lm(follicle.density~when,data=hair))
Df  Sum Sq Mean Sq F value  p
when         1    33.7   33.7  0.157 0.693
Residuals   98 21077.5  215.1

Does this table suggest that the treatment makes a difference? Why or why not?

2.
Here’s another ANOVA table
> anova(lm(follicle.density~when+Subject.ID,
+  data=hair))
Df  Sum Sq Mean Sq  F value p
when        1    33.7   33.7  14.9  0.0002
Subject.ID 49 20858.6  425.7 185.0  zero
Residuals  97   218.9    2.3

Why is the F-value on when different in this model than in the previous one?

3.
What overall conclusion do you draw about the effectiveness of the treatment? Is the effect of the treatment statistically significant? Is it significant in practice?

Prob 15.12. During a conversation about college admissions, a group of high-school students starts to wonder how reliable the SAT score is, that is, how much an individual student’s score could be expected vary just do to random factors such as the day on which the test was taken, the student’s mood, the specific questions on the test, etc. This variation within an individual is quite different from the variation from person to person.

The high-school students decide to study the issue. A simple way to do this is to have one student take the SAT test several times and examine the variability in the student’s scores. But, it would be better to do this for many different students. To this end, the students propose to pool together their scores from the times they took the SAT, producing a data frame that looks like this:

 Student Score Sex Order PersonA 2110 F 1 PersonB 1950 M 1 PersonC 2080 F 1 PersonA 2090 F 2 PersonA 2150 F 3 ... and so on

The order variable indicates how many times the student has taken the test. 1 means that it is the student’s first time, 2 the second time, and so on.

One student suggests that they simply take the standard deviation of the score variable to measure the variability in the SAT score. What’s wrong with this for the purpose the students have in mind?

 A There’s nothing wrong with it. B Standard deviations don’t measure random variability. C It would confound variability between students with variability.

Another student suggests looking at the sum of square residuals from the model score ~ student. What’s wrong with this:

 A There’s nothing wrong with it. B It’s the coefficients on student that are important. C Better to look at the mean square residual.

The students’ statistics teacher points out that the model score ~ student will exactly capture the score of any student who takes the SAT only once; the residuals for those students will be exactly zero. Explain why this isn’t a problem, given the purpose for which the model is being constructed.

Still another student suggests the model score ~ student + order in order to adjust for the possibility that scores change with experience, and not just at random. The group likes this idea and starts to elaborate on it. They make two main suggestions:

• Elaboration 1: score ~ student + order + sex
• Elaboration 2: score ~ student + order + student:order

Why not include sex as an additional covariate, as in Elaboration 1, to take into account the possibility that males and females might have systematically different scores.

 A It’s a good idea. B Bad idea since probably a person’s sex has nothing to do with his or her score. C Useless, since sex is redundant with student.

Regarding Elaboration 2, which of the following statements is correct?

1.
True or False
It allows the model to capture how the change of score with experience itself might be different from one person to another.
2.
True or False
It assumes that all the students in the data frame are taking the test multiple times.
3.
True or False
With the interaction term in place, the model would capture the exact scores of all students who took the SAT just once or twice, so the mean square residual would reflect only those students who took the SAT three times or more.

Prob 15.21. In conducting a hypothesis test, we need to specify two things:

• A Null Hypothesis
• A Test Statistic

The numerical output of a hypothesis test is a p-value.

In modeling, a sensible Null Hypothesis is that one or more explanatory variables are unrelated to the response variable. We can simulate a situation in which this Null applies by shuffling the variables. For example, here are two trials of a simulation of the Null in a model of the kidsfeet data:

> kids = fetchData("kidsfeet.csv")
> lm( width ~ length + shuffle(sex), data=kids)
Coefficients:
(Intercept)      length  shuffle(sex)G
3.14406     0.23828       -0.07585
> lm( width ~ length + shuffle(sex), data=kids)
Coefficients:
(Intercept)      length  shuffle(sex)G
2.74975     0.25106        0.08668

The test statistic summarizes the situation. There are several possibilities, but here we will use R2 from the model since this gives an indication of the quality of the model.

> r.squared(lm( width ~ length + shuffle(sex), data=kids))
[1] 0.4572837
> r.squared(lm( width ~ length + shuffle(sex), data=kids))
[1] 0.4175377
> r.squared(lm( width ~ length + shuffle(sex), data=kids))
[1] 0.4148968

By computing many such trials, we construct the sampling distribution under the Null — that is, the sampling distribution of the test statistic in the world in which the Null holds true. We can automate this process using do:

> samps = do(1000) *
+   r.squared(lm( width ~ length + shuffle(sex), data=kids))

Finally, to compute the p-value, we need to compute the test statistic on the model fitted to the actual data, not on the simulation.

> r.squared( lm( width ~ length + sex, data=kids))
[1] 0.4595428

The p-value is the probability of seeing a value of the test statistic from the Null Hypothesis simulation that is more extreme than our actual value. The meaning of “more extreme” depends on what the test statistic is. In this example, since a better fitting model will always have a larger R2 we check the probability of getting a larger R2 squares from our simulation than from the actual data.

> table( samps >= 0.4595428)

FALSE  TRUE
912    88

Here are various computer modeling statements that implement possible Null Hypotheses. Connect each computer statement to the corresponding Null.

1.
lm(width ~ length + shuffle(sex),data=kids)

2.
lm(width ~ shuffle(length) + shuffle(sex), data=kids)

3.
lm(width ~ shuffle(length), data=kids)

4.
lm(width ~ shuffle(sex), data=kids)

5.
lm(width ~ length + sex, data=shuffle(kids))

• Foot width is unrelated to foot length or to sex.
1  2  3  4  5
• Foot width is unrelated to sex, but it is related to foot length.
1  2  3  4  5
• Foot width is unrelated to sex, and we won’t consider any possible relationship to foot length.
1  2  3  4  5
• Foot width is unrelated to foot length, and we won’t consider any possible relationship to sex.
1  2  3  4  5
• This isn’t a hypothesis test; the randomization won’t change anything from the original data.
1  2  3  4  5

Prob 15.22. I’m interested in studying the length of gestation as a function of the ages of the mother and the father. In the gestation data set, ( gestation.csv ) the variable age records the mother’s age in years, and dage gives the father’s age in years. The variable gestation is the length of the gestation in days. I hypothesize that the older the mother and father, the shorter the gestational period. So, I fit a model to those 599 cases where all the relevant data were recorded:

> summary(lm( gestation ~ age+dage, data=b))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 282.62201    3.25821  86.741   <2e-16
age          -0.21313    0.19947  -1.068    0.286
dage          0.06085    0.17372   0.350    0.726

1.
Describe in everyday language the relationship between age and gestation indicated by this model.

2.
I note that the two p-values are nothing great. But I wonder whether if I treated mother’s and father’s age together — lumping them together into a single term with two degrees of freedom — I might not get something significant. Using the ANOVA reports given below, explain how you might come up with a single p-value summarizing the joint contribution of mother’s and father’s age. Insofar as you can, try to calculate the p-value itself.
> anova( lm(gestation ~ age+dage, data=b))
Df Sum Sq Mean Sq F value Pr(>F)
age           1    486     486  1.9091 0.1676
dage          1     31      31  0.1227 0.7262
Residuals   596 151758     255

> anova( lm( gestation ~ dage+age, data=b))
Df Sum Sq Mean Sq F value Pr(>F)
dage          1    227     227  0.8903 0.3458
age           1    291     291  1.1416 0.2858
Residuals   596 151758     255