Chapter 10 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

• What is a covariate? Why use a special word for it when it is just a variable?
• What is the difference between a partial change and a total change?
• In the experimental method, how are covariates dealt with?
• What is the modeling approach to dealing with covariates?

Prob 10.01. Consider the data set on kids’ feet in kidsfeet.csv.

First, you’re going to look at a total change. Build a model of foot width as a function of foot length: width ~ length. Fit this model to the kids’ feet data.

According to this model, how much does the typical width change when the foot length is increased from 22 to 27 cm?

0.187  0.362  0.744  0.953  1.060  1.105  1.240  1.487

This is a total change, because it doesn’t hold any other variable constant, e.g. sex. That might sound silly, since obviously a kid’s sex doesn’t change as his or her foot grows. But the model doesn’t know that. It happens that most of the kids with foot lengths near 22 cm are girls, and most of the kids with foot lengths near 27 cm are boys. So when you compare feet with lengths of 22 and 27, you are effectively changing the sex at the same time as you change the foot length.

To look at a partial change, holding sex constant, you need to include sex in the model. A simple way to do this is width ~ length + sex. Using this model fitted to the kids’ feet data, how much does a typical foot width change if the foot length is increased from 22 to 27 cm?

0.187  0.362  0.744  0.953  1.060  1.142  1.163  1.240  1.487

You can also build more detailed models, for example a model that includes an interaction term: width ~ length * sex. Using this model fitted to the kids’ feet data, how much will a typical girl’s foot width change if the foot length is increased from 22 to 27 cm?

0.187  0.362  0.744  0.953  1.060  1.105  1.142  1.240  1.487

Prob 10.02. In each of the following, a situation is described and a question is asked that is to be answered by modeling. Several variables are listed. Imagine an appropriate model and identify each variable as either the response variable, an explanatory variable, a covariate, or a variable to be ignored.

EXAMPLE: Some people have claimed that police foot patrols are more effective at reducing the crime rate than patrols done in automobiles. Data from several different cities is available; each city has its own fraction of patrols done by foot, its own crime rate, etc. The mayor of your town has asked for your advice on whether it would be worthwhile to shift to more foot patrols in order to reduce crime. She asks, “Is there evidence that a larger fraction of foot patrols reduces the crime rate?”

Variables:

1.
Crime rate (e.g., robberies per 100000 population) variable.
2.
Fraction of foot patrols
3.
Number of policemen per 1000 population
4.
Demographics (e.g., poverty rate)

The question focuses on how the fraction of foot patrols might influence crime rate, so crime rate is the response variable and fraction of foot patrols is an explanatory variable.

But, the crime rate might also depend on the overall level of policing (as indicated by the number of policemen), or on the social conditions that are associated with crime (e.g., demographics). Since the mayor has no power to change the demographics of your town, and probably little power to change the overall level number of policemen, in modeling the data from the different cities, you would want to hold constant number of policemen and the demographics. You can do this by treating number of policement and demographics as covariates and including them in your model.

Fifteen years ago, your state legislature raised the legal drinking age from 18 to 21 years. An important motivation was to reduce the number of car accident deaths due to drunk or impaired drivers. Now, some people are arguing that the 21-year age limit encourages binge drinking among 18 to 20 year olds and that such binge drinking actually increases car accident deaths. But the evidence is that the number of car accident deaths has gone down since the 21-year age restriction was introduced. You are asked to examine the issue: Does the reduction in the number of car-accident deaths per year point to the effectiveness of the 21-year drinking age?

Variables:

1.
Drinking age limit. Levels: 18 or 21.

response  explanatory  covariate  ignore
2.
Number of car-accident deaths per year.

response  explanatory  covariate  ignore
3.
Prevalence of seat-belt use.

response  explanatory  covariate  ignore
4.
Fraction of cars with air bags.

response  explanatory  covariate  ignore
5.
Number of car accidents (with or without death).

response  explanatory  covariate  ignore

Rating Surgeons

Your state government wants to guide citizens in choosing physicians. As part of this effort, they are going to rank all the surgeons in your state. You have been asked to build the rating system and you have a set of variables available for your use. These variables have been measured for each of the 342,861 people who underwent surgery in your state last year: one person being treated by one doctor. How should you construct a rating system that will help citizens to choose the most effective surgeon for their own treatment?

Variables:

• Outcome score. A high score means that the operation did what it was supposed to. A low score reflects failure, e.g. death. Death is a very bad outcome, post-operative infection a somewhat bad outcome.)

response  explanatory  covariate  ignore
• Surgeon. One level for each of the operating surgeons.

response  explanatory  covariate  ignore
• Experience of the surgeon.

response  explanatory  covariate  ignore
• Difficulty of the case.

response  explanatory  covariate  ignore

School testing

Last year, your school district hired a new superintendent to “shake things up.” He did so, introducing several controversial new policies. At the end of the year, test scores were higher than last year. A representative of the teachers’ union has asked you to examine the score data and answer this question: Is there reason to think that the higher scores were the result of the superintendent’s new policies?

Variables:

1.
Superintendent (levels: New or Former superintendent)

response  explanatory  covariate  ignore
2.
Exam difficulty

response  explanatory  covariate  ignore
3.
Test scores

response  explanatory  covariate  ignore

Gravity

In a bizarre twist of time, you find yourself as Galileo’s research assistant in Pisa in 1605. Galileo is studying gravity: Does gravity accelerate all materials in the same way, whether they be made of metal, wood, stone, etc.? Galileo hired you as his assistant because you have brought with you, from the 21st century, a stop-watch with which to measure time intervals, a computer, and your skill in statistical modeling. All of these seem miraculous to him.

He drops objects off the top of the Leaning Tower of Pisa and you measure the following:

Variables

1.
The size of the object (measured by its diameter).

response  explanatory  covariate  ignore
2.
Time of fall of the object.

response  explanatory  covariate  ignore
3.
The material from which the object is made (brass, lead, wood, stone).

response  explanatory  covariate  ignore

[Thanks to James Heyman.]

Prob 10.04. Economists measure the inflation rate as a percent change in price per year. Unemployment is measured as the fraction (percentage) of those who want to work who are seeking jobs.

According to economists, in the short run — say, from one year to another — there is a relationship between inflation and unemployment: all other things being equal, as unemployment goes up, inflation should go down. (The relationship is called the “Phillips curve,” but you don’t need to know that or anything technical about economics to do this question.)

If this is true, in the model Inflation ~ Unemployment, what should be the sign of the coefficient on Unemployment?

positive  zero  negative

But despite the short term relationship, economists claim that In the long run — over decades — unemployment and inflation should be unrelated.

If this is true, in the model Inflation ~ Unemployment, what should be the sign of the coefficient on Unemployment?

positive  zero  negative

The point of this exercise is to figure out how to arrange a model so that you can study the short-term behavior of the relationship, or so that you can study the long term relationship.

For your reference, here is a graph showing a scatter plot of inflation and unemployment rates over about 30 years in the US. Each point shows the inflation and unemployment rates during one quarter of a year. The plotting symbol indicates which of three decade-long periods the point falls into.

The relationship between inflation and unemployment seems to be different from one decade to another — that’s the short term.

Which decade seems to violate the economists’ Phillips Curve short-term relationship?

A  B  C  none  all

Using the modeling language, express these different possible relationships between the variables Inflation, Unemployment, and Decade, where the variable Decade is a categorical variable with the three different levels shown in the legend for the graph.

1.
Inflation depends on Unemployment in a way that doesn’t change over time.

2.
Inflation changes with the decade, but doesn’t depend on Unemployment.

3.
Inflation depends on Unemployment in the same way every decade, but each decade introduces a new background inflation rate independent of Unemployment.

4.
Inflation depends on Unemployment in a way that differs from decade to decade.

Whether a model examines the short-term or the long-term behavior is analogous to whether a partial change or a total change is being considered.

Suppose you wanted to study the long-term relationship between inflation and unemployment. Which of these is appropriate?

 A Hold Decade constant. (Partial change) B Let Decade vary as it will. (Total change)

Now suppose you want to study the short-term relationship. Which of these is appropriate?

 A Hold Decade constant. (Partial change) B Let Decade vary as it will. (Total change)

Prob 10.05. Consider two models that you are to fit to a single data set involving three variables: A, B, and C.

Model 1
~ B
Model 2
~ B + C

(a)
When should you say that Simpson’s Paradox is occuring?

 A When Model 2 has a lower R2 than Model 1. B When Model 1 has a lower R2 than Model 2. C When the coef. on B in Model 2 has the opposite sign to the coef. on B in Model 1. D When the coef. on C in Model 2 has the opposite sign to the coef. on B in Model 1.

(b)
True or False: If B is uncorrelated with A, then the coefficient on B in the model A ~ B must be zero.

True or False
(c)
True or False: If B is uncorrelated with A, then the coefficient on B in a model A ~ B+C must be zero.

True or False
(d)
True or False: Simpson’s Paradox can occur if B is uncorrelated with C.

True or False

Based on a suggestion by student Atang Gilika.

Prob 10.10. Standard & Poor’s is a RATING AGENCY that provides information about various financial instruments such as stocks and bonds. The S&P 500 Stock Index, for instance, provides a summary of the value of stocks.

Bonds issued by governments, corporations, and other entities are rated using letters. As described on the Standard & Poor’s website, the ratings levels are AAA, AA+, AA, AA-, A+, A, A-, BBB+, BBB, BBB-, BB+, BB, BB-, B+, B, B-, CCC+, CCC, CCC-, CC, C, and D. The AAA rating is the best. (“The obligor’s capacity to meet its financial commitment on the obligation is extremely strong.”) D is the worst. (“The ‘D’ rating category is used when payments on an obligation are not made on the date due ....)

• The bond ratings are a categorical variable.
True or False
• The bond ratings are an ordinal variable.
True or False

Bonds are a kind of debt; they pay interest and the principal is paid back at the end of a maturity period. The people and institutions who invest in bonds are willing to accept somewhat lower interest payments in exchange for greater security. Thus, AAA-rated bonds tend to pay the lowest interest rates and worse-rated bonds pay more. A report on interest rates on bonds (www.fmsbonds.com, for 8/21/2008) listed interest rates on municipal bonds:

 Issue Maturity Rate AAA Rated National 10 Year 3.75 National 20 Year 4.60 National 30 Year 4.75 Florida 30 Year 4.70 AA Rated National 10 Year 3.90 National 20 Year 4.70 National 30 Year 4.85 Florida 30 Year 4.80 A Rated National 10 Year 4.20 National 20 Year 5.05 National 30 Year 5.20 Florida 30 Year 5.15

How many explanatory variables are given in this table to account for the interest rate:

 A Two: Issue and Maturity B Three: Issue, Maturity, and S & P Rating C Four: Issue, Maturity, S & P Rating, and Interest Rate

Judging from the table, and holding all other explanatory variables constant, what is the change in interest rate associated with a change from AAA to AA rating?

0.05  0.15  0.25  0.30  0.40

Again, holding all other explanatory variables constant, what is the change in interest rate for a 10-year compared to a 20-year maturity bond? (Pick the closest answer.)

0.15  0.50  0.85  1.20  1.45

Sometimes it is unclear when a variable should be considered quantitative and when it should be taken as categorical. For example, the maturity variable looks on the surface to be quantitative (10-year, 20-year, 30-year, etc.). What is it about these data that suggests that it would be unrealistic to treat maturity as a quantitative variable in a model of interest rate?

Prob 10.11. A study on drug D indicates that patients who were given the drug were less likely to recover from their condition C. Here is a table showing the overall results:

 Drug # recovered # died Recovery Rate Given 1600 2400 40% Not given 2000 2000 50%

Strangely, when investigators looked at the situation separately for males and females, they found that the drug improves recovery for each group:

Females
 Drug num recovered # died Recovery Rate Given 900 2100 30% Not given 200 800 20%
Males
 Drug # recovered # died Recovery Rate Given 700 300 70% Not given 1800 1200 60%

Which is right? Does the drug improve recovery or hinder recovery? What advice would you give to a physician about whether or not to prescribe the drug to her patients? Give enough of an explanation that the physician can judge whether your advice is reasonable.

Based on an example from Judea Pearl (2000) Causality: Models, Reasoning, and Inference, Cambridge Univ. Press, p. 175

Prob 10.12. Time Magazine reported the results of a poll of people’s opinions about the U.S. economy in July 2008. The results are summarized in the graph.

[Source: Time, July 28, 2008, p. 41]

In a typical news media report of a poll, the results are summarized using one explanatory variable at a time. The point of this exercise is to show that such univariate explanations can be misleading.

The poll involves three explanatory variables: ethnicity, income, and age. Regretably, the reported results treat each of these explanatory variables separately, even though there are likely to be correlations among them. For instance, relatively few people in the 18 to 29 age group have high incomes.

The original data set from which the Time graphic was made contains the information needed to study the multiple explanatory variables simultaneously, for example looking at the connection between pessimism and age while adjusting for income. This data set is not available, so you will need to resort to a simulation which attempts to mimic the poll results. Of course, the simulation doesn’t necessarily describe people’s attitudes directly, but it does let you see how the conclusions drawn from the poll might have been different if the results for each explanatory variable had been presented in a way that adjusts for the other explanatory variables.

To install the simulation software, run this statement:

> fetchData("simulate.r")

Once that software has been installed, the following statement will run a simulation of a poll in which 10,000 people are asked to rate their level of pessimism (on a scale from 0 to 10) and to indicate their age group and income level:

> poll = run.sim(economic.outlook.poll, 10000)

The output of the simulation will be a data frame that looks something like this:

age               income pessimism
1     [18 to 29]   [less than \$20000]        10
2     [40 to 64] [\$50,000 to \$99,999]         5
3     [40 to 64]   [less than \$20000]         9
4     [40 to 64] [\$50,000 to \$99,999]         7
5 [65 and older] [\$50,000 to \$99,999]         7
6     [18 to 29]   [less than \$20000]        10

Your output will differ because the simulation reflects random sampling.

• Construct the model pessimism ~ age-1. Look at the coefficients and choose the statement that best reflects the results:

 A Middle aged people have lower pessimism than young or old people. B Young people have the least pessimism. C There is no relationship between age and pessimism.

• Now construct the model pessimism ~ income-1. Look at the coefficients and choose the statement that best reflects the results:

 A Higher income people are more pessimistic than low-income people. B Higher income people are less pessimistic than low-income people. C There is no relationship between income and pessimism.

• Construct a model in which you can look at the relationship between pessimism and age while adjusting for income. That is, include income as a covariate in your model. Enter your model formula here:
.

Look at the coefficients from your model and choose the statement that best reflects the results:

 A Holding income constant, older people tend to have higher levels of pessimism than young people. B Holding income constant, young people tend to have higher levels of pessimism than old people. C Holding income constant, there is no relationship between age and pessimism.

• You can also interpret that same model to see the relationship between pessimism and income while adjusting for age. Which of the following statements best reflects the results? (Hint: make sure to pay attention to the sign of the coefficients.)

 A Holding age constant, higher income people are more pessimistic than low-income people. B Holding age constant, higher income people are less pessimistic than low-income people. C Holding age constant, there is no relationship between income and pessimism.

Prob 10.20. Whenever you seek to study a partial relationship, there must be at least three variables involves: a response variable, an explanatory variable that is of direct interest, and one or more other explanatory variables that will be held constant: the co-variates. Unfortunately, it’s hard to graph out models involving three variables on paper: the usual graph of a model just shows one variable as a function of a second.

One way to display the relationship between a response variable and two quantitative explanatory variables is to use a contour plot. The two explanatory variables are plotted on the axes and the fitted model values are shown by the contours. The figure shows such a display of the fitted model of used car prices as a function of mileage and age.

The dots are the mileage and age of the individual cars — the model Price is indicated by the contours.

The total relationship between Price and mileage involves how the price changes for typical cars of different mileage. Pick a dot that is a typical car with about 10,000 miles. Using the contours, find the model price of this car.

Which of the following is closest to the model price (in dollars)?

18000  21000  25000  30000

Now pick another dot that is a typical car with about 70,000 miles. Using the contours, find the model price of this car.

18000  21000  25000  30000

The total relationship between Price and mileage is reflected by this ratio: change in model price divided by change in mileage. What is that ratio (roughly)?

 A = 0.15 dollars/mile B = 15.0 dollars/mile C = 0.12 dollars/mile

In contrast, the partial relationship between Price and mileage holding age constant is found in a different way, by comparing two points with different mileage but exactly the same age.

Mark a point on the graph where age is 3 years and mileage is 10000. Keep in mind that this point doesn’t need to be an actual car, that is, a data point in the graph typical car. There might be no actual car with an age of 3 years and mileage 10000. But using the contour model, find the model price at this point:

22000  24000  26000  28000  30000

Now find another point, one where the age is exactly the same (3 years) but the mileage is different. Again there might not be an actual car there. Let’s pick mileage as 80000. Using the contours, find the model price at this point:

22000  24000  26000  28000  30000

The partial relationship between price and mileage (holding age constant) is reflected again reflected by the ratio of the change in model price divided by the change in mileage. What is that ratio (roughly)?

 A = 17.50 dollars/mile B = 0.09 dollars/mile C = 0.03 dollars/mile

Both the total relationship and the partial relationship are indicated by the slope of the model price function given by the contours. The total relationship involves the slope between two points that are typical cars, as indicated by the dots. The partial relationship involves a slope along a different direction. When holding age constant, that direction is the one where mileage changes but age does not (vertical in the graph).

There’s also a partial relationship between price and age holding mileage constant. That partial relationship involves the slope along the direction where age changes but mileage is held constant. Estimate that slope by finding the model price at a point where age is 2 years and another point where age is 5 years. You can pick whatever mileage you like, but it’s key that your two points be at exactly the same mileage.

Estimate the slope of the price function along a direction where age changes but mileage is held constant (horizontally on the graph).

 A 100 dollars per year B 500 dollars per year C 1000 dollars per year D 2000 dollars per year

The contour plot above shows a model in which both mileage and age are explanatory variables. By choosing the direction in which to measure the slope, one determines whether the slope reflects a total relationship (a direction between typical cars), or a partial relationship holding age constant (a direction where age does not change, which might not be typical for cars), or a partial relationship holding mileage constant (a direction where mileage does not change, which also might not be typical for cars).

In calculus, the partial derivative of price with respect to mileage refers to an infinitesimal change in a direction where age is held constant. Similarly, the partial derivative of price with respect to age refers to an infinitesimal change in a direction where mileage is held constant.

Of course, in order for the directional derivatives to make sense, the price function needs to have both age and mileage as explanatory variables. The following contour plot shows a model in which only age has been used as an explanatory variable: there is no dependence of the function on mileage.

Such a model is incapable of distinguishing between a partial relationship and a total relationship. Both the partial and the total relationship involve a ratio of the change in price and change in age between two points. For the total relationship, those two points would be typical cars of different ages. For the partial relationship, those two points would be different ages at exactly the same mileage. But, because the model depend on mileage, the two ratios will be exactly the same.

Prob 10.21. Sometimes people use data in aggregated form to draw conclusions about individuals. For example, in 1950 W.S. Robinson described the correlation between immigration and illiteracy done in two different ways.[?] In the first, the unit of analysis is individual US states as shown in the figure — the plot shows the fraction of people in each state who are illiterate versus the fraction of people who are foreign born. The correlation is negative, meaning that states with higher foreign-born populations have less illiteracy.

Robinson’s second analysis involves the same data, but takes the unit of analysis as an individual person. The table gives the number of people who are illiterate and who are foreign born in the states included in the scatter plot.

The data in the table leads to a different conclusion than the analysis of states: the foreign born people are more likely to be illiterate.

This conflict between the results of the analyses, analogous to Simpson’s paradox, is called the ecological fallacy. (The word “ecological” is rooted in the Greek word oikos for house — think of the choice between studying individuals or the groups of individuals in their houses.)

The ecological fallacy is not a paradox; it isn’t a question of what is the correct unit of analysis. If you want to study the characteristic of individuals, your unit of analysis should be individuals. If you want to study groups, your unit of analysis should be those groups. It’s a fallacy to study groups when your interest is in individuals.

One way to think about the difference between Robinson’s conclusions with groups (the states) and the very different conclusions with individuals, is the factors that create the groups. Give an explanation, in everyday terms, why the immigrants that Robinson studied might tend to be clustered in states with low illiteracy rates, even if the immigrants themselves had high rates of illiteracy.