Reading Questions.

- How does R
^{2}summarize the extent to which a model has captured variability? - What does it mean for one model to be nested in another?
- How does the correlation coefficient differ from R
^{2}?

Prob 9.01.
The R^{2} statistic is the ratio of the variance of the fitted values to the variance of
the response variable.

Using the kidsfeet.csv data:

- 1.
- Find the variance of the response variable in the model
width ~ sex + length + sex:length .0.053 0.119 0.183 0.260 0.346
- 2.
- Find the variance of the fitted values from the model0.053 0.119 0.183 0.260 0.346
- 3.
- Compute the R
^{2}as the ratio of these two variances.0.20 0.29 0.46 0.53 0.75 - 4.
- Is this the same as the “Multiple R
^{2}” given in the summary(mod) report?Yes No

Prob 9.02. The variance of a response variable A is 145 and the variance of the residuals from the model A ~ 1+B is 45.

- What is the variance of the fitted model values?45 100 145 190 Cannot tell
- What is the R
^{2}for this model?0 45/145 100/145 100/190 145/190 Cannot tell

Prob 9.04. For each of the following pairs of models, mark the statement that is most correct.

Part 1 ______________________________________________________________________________________________

- Model 1
- . A ~ B+C
- Model 2
- . A ~ B*C

A
| Model 1 is nested in Model 2. |

B
| Model 2 is nested in Model 1. |

C
| The two models are the same. |

D
| None of the above is true. |

Part 2 ______________________________________________________________________________________________

- Model 1
- . A ~ B
- Model 2
- . B ~ A

A
| Model 1 is nested in Model 2. |

B
| Model 2 is nested in Model 1. |

C
| The two models are the same. |

D
| None of the above is true. |

Part 3_______________________________________________________________________________________________

- Model 1
- . A ~ B + C
- Model 2
- . B ~ A * C

A
| Model 1 is nested in Model 2. |

B
| Model 2 is nested in Model 1. |

C
| The two models are the same. |

D
| None of the above is true. |

Part 4 ______________________________________________________________________________________________

- Model 1
- . A ~ B + C + B:C
- Model 2
- . A ~ B * C

A
| Model 1 is nested in Model 2. |

B
| Model 2 is nested in Model 1. |

C
| The two models are the same. |

D
| None of the above is true. |

Prob 9.05. For each of the following pairs of models, mark the statement that is most correct.

Part 1 ______________________________________________________________________________________________

- Model 1
- . A ~ B+C
- Model 2
- . A ~ B*C

A
| Model 1 can have a higher R |

B
| Model 2 can have a higher R |

C
| The R |

D
| None of the above are necessarily true. |

Part 2_______________________________________________________________________________________________

- Model 1
- . A ~ B + C
- Model 2
- . B ~ A * C

A
| Model 1 can have a higher R |

B
| Model 2 can have a higher R |

C
| The R |

D
| None of the above are necessarily true. |

Part 3 ______________________________________________________________________________________________

- Model 1
- . A ~ B + C + B:C
- Model 2
- . A ~ B * C

A
| Model 1 can have a higher R |

B
| Model 2 can have a higher R |

C
| The R |

D
| None of the above are necessarily true. |

[From a suggestion by a student]

Going further _____________________________________________________________________________________

In answering this question, you might want to try out a few examples using real data: just pick two quantitative variables to stand in for A and B.

What will be the relationship between R^{2} for the following two models?

- Model 1
- . A ~ B
- Model 2
- . B ~ A

A
| Model 1 can have a higher R |

B
| Model 2 can have a higher R |

C
| The R |

D
| None of the above are necessarily true. |

Prob 9.10.
Which of the following statements is true about R^{2}?

- 1.
- True or FalseR
^{2}will never go down when you add an additional explanatory term to a model. - 2.
- For a perfectly fitting model,
A
R

^{2}is exactly zero.BR

^{2}is exactly one.CNeither of the above.

- 3.
- In terms of the variances of the fitted model points, the residual, and the
response variable, R
^{2}is the:AVariance of the residuals divided by the variance of the fitted.

BVariance of the response divided by the variance of the residuals.

CVariance of the fitted divided by the variance of the residuals.

DVariance of the fitted divided by the variance of the response.

EVariance of the response divided by the variance of the fitted.

Prob 9.11. Consider models with a form like this

> lm( response ~ 1, data=whatever)

The R^{2} of such a model will always be 0. Explain why.

Prob 9.12. Consider the following models where a response variable A is modeled by explanatory variables B, C, and D.

1 | A ~ B |

2 | A ~ B + C + B:C |

3 | A ~ B + C |

4 | A ~ B * C |

5 | A ~ B + D |

6 | A ~ B*C*D |

Answer the following:

- (a)
- Model 1 is nested in model 2.
True or False
- (b)
- Model 5 is netsted in model 3.
True or False
- (c)
- Model 1 is nested in model 3.
True or False
- (d)
- Model 5 is nested in model 1.
True or False
- (e)
- Model 2 is nested in model 3.
True or False
- (f)
- Model 3 is nested in model
4.True or False
- (g)
- All the other models are nested in model 6.
True or False

Prob 9.13. Consider two models, Model 1 and Model 2, with the same response variable.

- 1.
- Model 1 is nested in Model 2 if the
variables model termsof Model 1 are a subset of those of Model 2.
- 2.
- True or FalseIf Model 1 is nested in Model 2, then model 1 cannot have a higher R
^{2}than model 2. - 3.
- Which of the following are nested in A ~ B*C + D?
True or False
A ~ B True or FalseA ~ B + D True or FalseB ~ C True or FalseA ~ B+C+D True or FalseA ~ B*D + C True or FalseA ~ D

Prob 9.21. Here is a set of models:

Model A: wage ~ 1

Model B: wage ~ age + sex

Model C: wage ~ 1 + age*sex

Model D: wage ~ educ

Model E: wage ~ educ + age - 1

Model F: wage ~ educ:age

Model G: wage ~ educ*age*sex

You may want to try fitting each of the models to the Current Population Survey data cps.csv to make sure you understand how the * shorthand for interaction and main effects expands to a complete set of terms. That way you can see exactly which coefficients are calculated for any of the models.

Answer the following:

- 1.
- B is nested in A. True or False
- 2.
- D is nested in E. True or False
- 3.
- B is nested in C. True or False
- 4.
- All of the models A-F are nested in G.
True or False
- 5.
- D is nested in F. True or False
- 6.
- At least one of the models A-G is nested in educ ~ age.
True or False

Prob 9.22. A data set on US Congressional Districts (provided by Prof. Julie Dolan), congress.csv contains information on the population of each congressional district in 2004. There are 436 districts listed (corresponding to the 435 voting members of the House of Representatives from the 50 states and an additional district for Washington, D.C., whose citizens have only a non-voting “representative.”

The US Supreme Court (Reynolds v. Sims, 377 US 533, 1964) ruled that state legislature districts had to be roughly equal in population: the one-person one-vote principle. Before this ruling, some states had grossly unequally sized districts. For example, one district in Connecticut for the state General Assembly had 191 people, while another district in the same state had 81,000. Los Angeles County had one representative in the California State Senate for a population of six million, while another county with only 14,000 residents also had one representative.

Of course, exact equality of district sizes is impossible in every district, since districts have geographically defined boundaries and the population can fluctuate within each boundary. The Supreme Court has written, “... mathematical nicety is not a constitutional requisite...” and “so long as the divergences from a strict population standard are based on legitimate considerations incident to the effectuation of a rational state policy, some deviations from the equal-population principle are constitutionally permissible ....” (Reynolds v. Simms)

The situation in the US House of Representatives is more complicated, since congressional districts are required to be entirely within a single state.

Let’s explore how close the districts for the US House of Representatives comes to meeting the one-person one-vote principle.

One way to evaluate how far districts are from equality of population size is to examine the standard deviation across districts.

- What is the standard deviation of the district populations across the whole US?
4823 9468 28790 342183 540649

Another way to look at the spread is to try to account for the differences in populations by modeling them and looking at how much of the difference remains unexplained.

Let’s start with a very simple model that treats all the districts as the same: population ~ 1.

What is the meaning of the single coefficient from this model?

A
| The mean district population across all states. |

B
| The mean district population across all districts. |

C
| The median population across all districts. |

D
| The median population across all states. |

E
| None of the above. |

Calculate the standard deviation of the residuals. How does this compare to the standard deviation of the district population itself?

A
| It’s much larger. |

B
| It’s somewhat larger. |

C
| It’s exactly the same. |

D
| It’s much smaller. |

Now model the district size by the state population ~ 1 + state.

What is the standard deviation of the residuals from this model?

#

A box plot of the residuals shows a peculiar pattern. What is it?

A
| The residuals are all the same. |

B
| Every residual is an outlier. |

C
| The residuals are almost all very close to zero, except for a few outliers. |

The variable state accounts for almost all of the variability from district to district. That is, districts within a state are almost exactly the same size, but that size differs from state to state. Why is there a state-to-state difference? The number of districts within a state must be a whole number: 1, 2, 3, and so on. Ideally, the district populations are the state population divided by the number of districts. The number of districts is set to make the district population as even as possible between states, but exact equality isn’t possible since the state populations differ. Notice that the largest and smallest districts (Montana and Wyoming, respectively) are in states with only a single district. Adding a second district to Montana would dramatically reduce the district size below the national mean. And even though Wyoming has a very low-population district, it’s impossible to take a district away since Wyoming only has one.

Prob 9.23. Consider this rule of thumb:

In comparing two models based on the same data, the model with
the larger R^{2} is better than the model with the smaller R^{2}.

Explain what makes sense about this rule of thumb and also what issues it might be neglecting.