Chapter 9 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

Reading Questions.

Prob 9.01. The R2 statistic is the ratio of the variance of the fitted values to the variance of the response variable.

Using the kidsfeet.csv data:

1.
Find the variance of the response variable in the model width ~ sex + length + sex:length .
 0.053  0.119  0.183  0.260  0.346  
2.
Find the variance of the fitted values from the model
 0.053  0.119  0.183  0.260  0.346  
3.
Compute the R2 as the ratio of these two variances.
 0.20  0.29  0.46  0.53  0.75  
4.
Is this the same as the “Multiple R2” given in the summary(mod) report?
 Yes  No  

Prob 9.02. The variance of a response variable A is 145 and the variance of the residuals from the model A ~ 1+B is 45.

Prob 9.04. For each of the following pairs of models, mark the statement that is most correct.

Part 1 ______________________________________________________________________________________________

Model 1
. A ~ B+C
Model 2
. A ~ B*C

A

Model 1 is nested in Model 2.

B

Model 2 is nested in Model 1.

C

The two models are the same.

D

None of the above is true.


Part 2 ______________________________________________________________________________________________

Model 1
. A ~ B
Model 2
. B ~ A

A

Model 1 is nested in Model 2.

B

Model 2 is nested in Model 1.

C

The two models are the same.

D

None of the above is true.


Part 3_______________________________________________________________________________________________

Model 1
. A ~ B + C
Model 2
. B ~ A * C

A

Model 1 is nested in Model 2.

B

Model 2 is nested in Model 1.

C

The two models are the same.

D

None of the above is true.


Part 4 ______________________________________________________________________________________________

Model 1
. A ~ B + C + B:C
Model 2
. A ~ B * C

A

Model 1 is nested in Model 2.

B

Model 2 is nested in Model 1.

C

The two models are the same.

D

None of the above is true.


Prob 9.05. For each of the following pairs of models, mark the statement that is most correct.

Part 1 ______________________________________________________________________________________________

Model 1
. A ~ B+C
Model 2
. A ~ B*C

A

Model 1 can have a higher R2 than Model 2

B

Model 2 can have a higher R2 than Model 1

C

The R2 values will be the same.

D

None of the above are necessarily true.


Part 2_______________________________________________________________________________________________

Model 1
. A ~ B + C
Model 2
. B ~ A * C

A

Model 1 can have a higher R2 than Model 2

B

Model 2 can have a higher R2 than Model 1

C

The R2 values will be the same.

D

None of the above are necessarily true.


Part 3 ______________________________________________________________________________________________

Model 1
. A ~ B + C + B:C
Model 2
. A ~ B * C

A

Model 1 can have a higher R2 than Model 2

B

Model 2 can have a higher R2 than Model 1

C

The R2 values will be the same.

D

None of the above are necessarily true.


[From a suggestion by a student]

Going further _____________________________________________________________________________________

In answering this question, you might want to try out a few examples using real data: just pick two quantitative variables to stand in for A and B.

What will be the relationship between R2 for the following two models?

Model 1
. A ~ B
Model 2
. B ~ A

A

Model 1 can have a higher R2 than Model 2

B

Model 2 can have a higher R2 than Model 1

C

The R2 values will be the same.

D

None of the above are necessarily true.


Prob 9.10. Which of the following statements is true about R2?

1.
True or False 
  R2 will never go down when you add an additional explanatory term to a model.
2.
For a perfectly fitting model,

A

R2 is exactly zero.

B

R2 is exactly one.

C

Neither of the above.


3.
In terms of the variances of the fitted model points, the residual, and the response variable, R2 is the:

A

Variance of the residuals divided by the variance of the fitted.

B

Variance of the response divided by the variance of the residuals.

C

Variance of the fitted divided by the variance of the residuals.

D

Variance of the fitted divided by the variance of the response.

E

Variance of the response divided by the variance of the fitted.


Prob 9.11. Consider models with a form like this

> lm( response ~ 1, data=whatever)

The R2 of such a model will always be 0. Explain why.

Prob 9.12. Consider the following models where a response variable A is modeled by explanatory variables B, C, and D.

1~ B
2 A  ~ B + C + B:C
3~ B + C
4~ B * C
5 A  ~ B + D
6~ B*C*D

Answer the following:

(a)
Model 1 is nested in model 2.
True or False 
(b)
Model 5 is netsted in model 3.
True or False 
(c)
Model 1 is nested in model 3.
True or False 
(d)
Model 5 is nested in model 1.
True or False 
(e)
Model 2 is nested in model 3.
True or False 
(f)
Model 3 is nested in model 4.
True or False 
(g)
All the other models are nested in model 6.
True or False 

Prob 9.13. Consider two models, Model 1 and Model 2, with the same response variable.

1.
Model 1 is nested in Model 2 if the
 variables  model terms  
of Model 1 are a subset of those of Model 2.
2.
True or False 
If Model 1 is nested in Model 2, then model 1 cannot have a higher R2 than model 2.
3.
Which of the following are nested in A ~ B*C + D?
True or False 
~ B
True or False 
~ B + D
True or False 
~ C
True or False 
~ B+C+D
True or False 
~ B*D + C
True or False 
~ D

Prob 9.21. Here is a set of models:

Model A: wage ~ 1

Model B: wage ~ age + sex

Model C: wage ~ 1 + age*sex

Model D: wage ~ educ

Model E: wage ~ educ + age - 1

Model F: wage ~ educ:age

Model G: wage ~ educ*age*sex

You may want to try fitting each of the models to the Current Population Survey data  cps.csv to make sure you understand how the * shorthand for interaction and main effects expands to a complete set of terms. That way you can see exactly which coefficients are calculated for any of the models.

Answer the following:

1.
B is nested in A.
True or False 
2.
D is nested in E.
True or False 
3.
B is nested in C.
True or False 
4.
All of the models A-F are nested in G.
True or False 
5.
D is nested in F.
True or False 
6.
At least one of the models A-G is nested in educ ~ age.
True or False 

Prob 9.22. A data set on US Congressional Districts (provided by Prof. Julie Dolan),  congress.csv contains information on the population of each congressional district in 2004. There are 436 districts listed (corresponding to the 435 voting members of the House of Representatives from the 50 states and an additional district for Washington, D.C., whose citizens have only a non-voting “representative.”

The US Supreme Court (Reynolds v. Sims, 377 US 533, 1964) ruled that state legislature districts had to be roughly equal in population: the one-person one-vote principle. Before this ruling, some states had grossly unequally sized districts. For example, one district in Connecticut for the state General Assembly had 191 people, while another district in the same state had 81,000. Los Angeles County had one representative in the California State Senate for a population of six million, while another county with only 14,000 residents also had one representative.

Of course, exact equality of district sizes is impossible in every district, since districts have geographically defined boundaries and the population can fluctuate within each boundary. The Supreme Court has written, “... mathematical nicety is not a constitutional requisite...” and “so long as the divergences from a strict population standard are based on legitimate considerations incident to the effectuation of a rational state policy, some deviations from the equal-population principle are constitutionally permissible ....” (Reynolds v. Simms)

The situation in the US House of Representatives is more complicated, since congressional districts are required to be entirely within a single state.

Let’s explore how close the districts for the US House of Representatives comes to meeting the one-person one-vote principle.

One way to evaluate how far districts are from equality of population size is to examine the standard deviation across districts.

Another way to look at the spread is to try to account for the differences in populations by modeling them and looking at how much of the difference remains unexplained.

Let’s start with a very simple model that treats all the districts as the same: population ~ 1.

What is the meaning of the single coefficient from this model?

A

The mean district population across all states.

B

The mean district population across all districts.

C

The median population across all districts.

D

The median population across all states.

E

None of the above.


Calculate the standard deviation of the residuals. How does this compare to the standard deviation of the district population itself?

A

It’s much larger.

B

It’s somewhat larger.

C

It’s exactly the same.

D

It’s much smaller.


Now model the district size by the state population ~ 1 + state.

What is the standard deviation of the residuals from this model?

# 

A box plot of the residuals shows a peculiar pattern. What is it?

A

The residuals are all the same.

B

Every residual is an outlier.

C

The residuals are almost all very close to zero, except for a few outliers.


The variable state accounts for almost all of the variability from district to district. That is, districts within a state are almost exactly the same size, but that size differs from state to state. Why is there a state-to-state difference? The number of districts within a state must be a whole number: 1, 2, 3, and so on. Ideally, the district populations are the state population divided by the number of districts. The number of districts is set to make the district population as even as possible between states, but exact equality isn’t possible since the state populations differ. Notice that the largest and smallest districts (Montana and Wyoming, respectively) are in states with only a single district. Adding a second district to Montana would dramatically reduce the district size below the national mean. And even though Wyoming has a very low-population district, it’s impossible to take a district away since Wyoming only has one.

Prob 9.23. Consider this rule of thumb:

In comparing two models based on the same data, the model with the larger R2 is better than the model with the smaller R2.

Explain what makes sense about this rule of thumb and also what issues it might be neglecting.