Chapter 6 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

• What is an “explanatory variable” and how does it differ from a “response variable?”
• What is the difference between a "model term" and a "variable?"
• Why are there sometimes multiple model terms in a model?
• What is an interaction term and how does it differ from a variable?
• In graphs of the model response value versus an explanatory variable, quantitative explanatory variables are associated with slopes and categorical explanatory variables are associated with step-like differences. Explain why.
• How can models be useful that fail to represent the actual causal connections (if any) between variables? Give an example.

Prob 6.01. In McClesky vs Georgia, lawyers presented data showing that for convicted murderers, a death sentence was more likely if the victim was white than if the victim was black. For each case, they tabulated the race of the victim and the sentence (death or life in prison). Which of the following best describe the variables their models?

 A Response is quantitative; explanatory variable is quantitative. B Response is quantitative; explanatory variable is categorical. C Response is categorical; explanatory variable is quantitative. D Response is categorical; explanatory variable is categorical. E There is no explanatory variable.

[Note: Based on an example from George Cobb.]

Prob 6.02. In studies of employment discrimination, several attributes of employees are often relevant:

age, sex, race, years of experience, salary, whether promoted, whether laid off

For each of the following questions, indicate which is the response variable and which is the explanatory variable.

1.
Are men paid more than women?

Response Variable:

age  sex  race  years.experience  salary  promoted  laid.off

Explanatory Variable:

age  sex  race  years.experience  salary  promoted  laid.off
2.
On average, how much extra salary is a year of experience worth?

Response Variable:

age  sex  race  years.experience  salary  promoted  laid.off

Explanatory Variable:

age  sex  race  years.experience  salary  promoted  laid.off
3.
Are whites more likely than blacks to be promoted?

Response Variable:

age  sex  race  years.experience  salary  promoted  laid.off

Explanatory Variable:

age  sex  race  years.experience  salary  promoted  laid.off
4.
Are older employees more likely to be laid off than younger ones?

Response Variable:

age  sex  race  years.experience  salary  promoted  laid.off

Explanatory Variable:

age  sex  race  years.experience  salary  promoted  laid.off

[Note: Thanks to George Cobb.]

Prob 6.04. The drawings show some data involving three variables:

• D — a quantitative variable
• A — a quantitative variable
• G — a categorical variable with two levels: S & K

On each of the plots, sketch a graph of the fitted model function of the indicated structure.

Draw these models:

• D ~ A+G
• D ~ A*G
• D ~ A-1
• D ~ 1
• D ~ A
• D ~ poly(A,2)

Only a qualitative sketch is needed. It will be good enough to draw out the graph on a piece of paper, roughly approximating the patterns of S and K seen in the graph. Then draw the model values right on your paper. (You can’t hand this in with AcroScore.)

Example: D ~ G

• speed of a bicyclist.
• steepness of the road, a quantitative variable measured by the grade (rise over run). 0 means flat, + means uphill, - means downhill.
• fitness of the rider, a categorical variable with three levels: unfit, average, athletic.

On a piece of paper, sketch out a graph of speed versus steepness for reasonable models of each of these forms:

1.
Model 1: speed ~ 1 + steepness
2.
Model 2: speed ~ 1 + fitness
3.
Model 3: speed ~ 1 + steepness+fitness
4.
Model 4: speed ~ 1 + steepness+fitness + steepness:fitness

Prob 6.10. The graphic (from the New York Times, April 17, 2008) shows the fitted values from a model of the survival of babies born extremely prematurely.

Caption: “A new study finds that doctors could better estimate an extremely premature baby’s chance of survival by considering factors including birth weight, length of gestation, sex and whether the mother was given steroids to help develop the baby’s lungs.”

Two different response variables are plotted: (1) the probability of survival and (2) the probability of survival without moderate to severe disabilities. Remarkably for a statistical graphic, there are three explanatory variables:

1.
Birth weight (measured in pounds (lb) in the graphic).
2.
The sex of the baby.
3.
Whether the mother took steroids intended to help the fetus’s lungs develop.

Focus on the survival rates without disabilities — the darker bars in the graphic.

(a)
Estimate the effect of giving steroids, that is, how much extra survival probability is associated with giving steroids?

 A No extra survival probability with steroids. B About 1-5 percentage points C About 10 to 15 percentage points D About 50 percentage points E About 75 percentage points

(b)
For the babies where the mother was given steroids, how does the survival probability depend on the birth weight of the baby:

 A No dependence. B Increases by about 25 percentage points. C Increases by about 50 percentage points. D Increases by about 25 percentage points per pound. E Increases by about 50 percentage points per pound.

(c)
For the babies where the mother was given steroids, how does the survival probability depend on the sex of the baby?

 A No dependence. B Higher for girls by about 15 percentage points. C Higher for boys by about 20 percentage points. D Higher for girls by about 40 percentage points. E Higher for boys by about 40 percentage points.

(d)
How would you look for an interaction between birth weight and baby’s sex in accounting for survival?

 A Compare survival of males to females at a given weight. B Compare survival of males across different weights. C Compare survival of females across different weights. D Compare the difference in survival between males and females across different weights.

Do you see signs of a substantial interaction between birth weight and sex in accounting for survival? (Take substantial to mean “greater than 10 percentage points.”)

Yes  No
(e)
How would you look for a substantial interaction between steroid use and baby’s sex in accounting for survival.

 A Compare survival of males to females when the mother was given steroids. B Compare survival of males between steroid given and steroid not given. C Compare survival of females between steroid given and steroid not given. D Compare the difference in survival between males and females between steroid given and steroid not given.

Do you see signs of a substantial interaction between steroid use and sex in accounting for survival? (Take substantial to mean “greater than 10 percentage points.”)

Yes  No

Prob 6.11. The graphic in the Figure is part of a report describing a standardized test for college graduates, the Collegiate Learning Assessment (CLA). The test consists of several essay questions which probe students’ critical thinking skills.

Although individual students take the test and receive a score, the purpose of the test is not to evaluate the students individually. Instead, the test is intended to evaluate the effect that the institution has on its students as indicated by the difference in test scores between 1st- and 4th-year students (freshmen and seniors). The cases in the graph are institutions, not individual students.

Council for Aid to Education, “Collegiate Learning Assessment: Draft Institutional Report, 2005-6” http://www.cae.org

There are three variables involved in the graphic:

cla
The CLA test score (averaged over each institution) shown on the vertical axis
sat
The SAT test score of entering students (averaged over each institution) shown on the horizontal axis
class
Whether the CLA test was taken by freshmen or seniors. (In the graph: blue for freshmen, red for seniors)

What model is being depicted by the straight lines in the graph? Give your answer in the standard modeling notation (e.g, A ~ B+C) using the variable names above. Make sure to indicate what interaction term, if any, has been included in the model and explain how you can tell whether the interaction is or is not there.

Prob 6.12. Time Magazine reported the results of a poll of people’s opinions about the U.S. economy in July 2008. The results are summarized in the graph.

[Source: Time, July 28, 2008, p. 41]

The variables depicted in the graph are:

• Pessimism, as indicated by agreeing with the statement that the U.S. was a better place to live in the 1990s and will continue to decline.
• Ethnicity, with three levels: White, African American, Hispanic.
• Income, with five levels.
• Age, with four levels.

Judging from the information in the graph, which of these statements best describes the model pessimism ~ income?

 A Pessimism declines as incomes get higher. B Pessimism increases as incomes get higher. C Pessimism is unrelated to income.

Again, judging from the information in the graph, which of these statements best describes the model pessimism ~ age?

 A Pessimism is highest in the 18-29 age group. B Pessimism is highest in the 64 and older group. C Pessimism is lowest among whites. D Pessimism is unrelated to age.

Poll results such as this are almost always reported using just one explanatory variable at a time, as in this graphic. However, it can be more informative to know the effect of one variable while adjusting for other variables. For example, in looking at the connection between pessimism and age, it would be useful to be able to untangle the influence of income.

Prob 6.13. Here is a display constructed using the Current Population Survey wage data:

Which of the following commands will make this? Each of the possibilities is a working command, so try them out and see which one makes the matching plot. Before you start, make sure to read in the data with

> cps = fetchData("cps.csv")

 A bwplot(wage~sex,groups=sector,data=cps) B bwplot(wage~sex|sector,data=cps) C bwplot(wage~cross(sex,sector),data=cps)

Prob 6.20. It’s possible to have interaction terms that involve more than two variables. For instance, one model of the swimming record data displayed in Chapter 4 was time ~ year*sex. This model design includes an interaction term between year and sex. This interaction term produces fitted model values that fall on lines with two different slopes, one for men and one for women. Now consider a third possible term to add to the model, the transform term that is “yes” when the year is larger than 1948 and “no” when the year is 1948 or earlier. Call this variable “post-war,” since World War II ended in 1945 and the Olympic games resumed in 1948. This can be interpreted to represent the systematic changes that occurred after the war.

Here is the model of the swimming record data that includes an intercept term, main terms for year, sex, and post-war, and interaction terms among all of those: a three-way interaction.

A two-way interaction term between sex and year allows there to be differently sloping lines for men and women. A three-way interaction term among sex, year, and post-war allows even more flexibility; the difference between slopes for men and women can be different before and after the war. You can see this from the graph. Before 1948, men’s and women’s slopes are very different. After the war the slopes are almost the same.

Explain how this graph gives support for the following interpreation: Before the war, women’s participation in competitive sports was rapidly increasing. As more women became involved in swimming, records were rapidly beaten. After the war, both women and men had high levels of participation and so new records were the result of better methods of training. Those methods apply equally to men and women and so records are improving at about the same rate for both sexes.

Prob 6.21. Consider the following situation. In order to encourage schools to perform well, a school district hires an external evaluator to give a rating to each school in the district. The rating is a single number based on the quality of the teachers, absenteeism among the teachers, the amount and quality of homeworks the the teachers assign, and so on.

To reward those schools that do well, the district gives a moderate salary bonus to each teacher and a fairly large budget increase to the school itself.

The next year, the school district publishes data showing that the students in schools that received the budget increases had done much better on standarized test scores than the schools that hadn’t gotten the increases. The school district argues that this means that increasing budgets overall will improve performance on standardized tests.

The teacher’s union supports the idea of higher budgets, but objects to the rating system, saying that it is meaningless and that teacher pay should not be linked to it. The Taxpayers League argues that there is no real evidence that higher spending produces better results. They interpret the school district’s data as indicating only that the higher ranked schools are better and, of course, better schools give better results. Those schools were better before they won the ratings-based budget increase.

This is a serious problem. Because of the way the school district collected its data, being a high-rated school is confounded with getting a higher budget.

A modeling technique for dealing with situations like this is called threshold regression. Threshold regression models student test scores at each school as a function of the school rating, but includes another variable that indicates whether the school got a budget increase. The budget increase variable is almost the same thing as the school rating: because of the way the school district awarded the increases, it is a threshold transformation of the school rating.

The graph shows some data (plotted as circles) from a simulation of this situation in which the budget increase had a genuine impact of 50 points in the standardized test. The solid line shows the model of test score as a function of school rating, with only the main effect. This model corresponds to the claim that the threshold has no effect. The solid dots are the model values from another model, with rating as a main effect and a threshold transformation of rating that corresponds to which schools got the budget increase.

Explain how to interpret the models as indicating the effect of the budget increase. In addition to your explanation, make sure to give a numerical estimate of how big the effect is, according to the model as indicated in the graph.

An important statistical question is whether the data provide good support for the claim that the threshold makes a difference. (Techniques for answering this question are discussed later in the book). The answer depends both on the size of the effect, and how much data is used for constructing the model. For the simulation here, it turns out that the threshold model has successfully detected the effect of the budget increase.