Chapter 4 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

1.
Which is larger: variance of residuals, variance of the model values, or the variance of the actual values?
2.
How can a difference in group means clearly shown by your data nonetheless be misleading?
3.
What does it mean to partition variation? What’s special about the variance — the square of the standard deviation — as a way to measure variation?

Prob 4.03. To exercise your ability to calculate groupwise quantities, use the swimming records in  swim100m.csv and calculate the mean and minimum swimming time for the subset. (Answers have been rounded to one decimal place.)

> require(mosaic)
> swim = fetchData("SwimRecords")

(a)
Record times for women:
• Mean:
47.8  53.5  54.7  57.3  61.4  63.4  65.2  73.8  84.2
• Minimum:
47.8  53.5  54.7  57.3  61.4  63.4  65.2  73.8  84.2
(b)
All records before 1920. (Hint: the construction year<1920 can be used as a variable.)
• Mean:
47.8  53.8  54.7  57.3  61.4  63.4  69.6  73.8  84.2
• Minimum:
47.8  53.8  54.7  57.3  61.4  63.4  69.6  73.8  84.2
(c)
All records that are slower than 60 seconds. (Hint: Think what “slower” means in terms of the swimming times.)
• Mean:
47.8  53.8  54.7  60.2  61.6  63.4  69.6  73.8  84.2
• Minimum:
47.8  53.8  54.7  60.2  61.6  63.4  69.6  73.8  84.2

Prob 4.04. Here is a model of wages in 1985 constructed using the CPS85 data.

> mod = mm( wage ~ sector, data=CPS85 )

wage is the “response variable,” while sector is the explanatory variable.

For every case in the data, the model will give a “fitted model value.” Different cases will have different fitted model values if they have different values for the explanatory variable. Here, the model assigns different fitted model values to workers in different sectors of the economy.

You can see the groupwise means for the different sectors by looking at the model. Just give the model name, like this:

> mod

(a)
What is the mean wage for workers in the construction sector (const)?
6.54  7.42  7.59  8.04  8.50  9.50  11.95  12.70
(b)
What is the mean wage for workers in the management sector (manag)?
6.54  7.42  7.59  8.04  8.50  9.50  11.95  12.70
(c)
Which sector has the lowest mean wage?
clerical  const  manag  manuf  prof  sales  service
(d)
Statistical models attempt to account for case-to-case variability. One simple way to measure the success of a model is to look at the variation in the fitted model values. What is the standard deviation in the fitted model values for mod?
0  0.95  1.10  1.53  2.03  2.20  2.43  3.43  4.13  4.65
(e)
The residuals of the model tell how far each case is from that case’s fitted model value. In interpreting models, it’s often important to know the typical size of a residual. The standard deviation is often used to quantify “size”. What’s the standard deviation of the residuals of mod?
0  0.95  1.10  1.53  2.03  2.20  2.43  3.43  4.13  4.65

Prob 4.05. Here are two models of wages in 1985 in the CPS85 data:

> mod1 = mm( wage ~ 1, data=CPS85 )
> mod2 = mm( wage ~ sector, data=CPS85 )

The model mod1 corresponds to the grand mean, as if all cases were in the same group. The model mod2 breaks down the mean wage into groups depending on what sector of the economy the worker is in.

(a)
Which model has the greater variation from case to case in fitted model values?
mod1  mod2  same for both
(b)
Which model has the greater variation from case to case in residuals?
mod1  mod2  same for both
(c)
Which of these statements is true for both model 1 and 2 (and all other groupwise mean models)?
• The mean residual is always zero.
True or False
• The standard deviation of residuals plus the standard deviation of fitted model values gives the standard deviation of the variable being modeled (the “response variable”).
True or False
• The variance of residuals plus the variance of fitted model values gives the variance of the variable being modeled.
True or False

Prob 4.06. Read in the Current Population Survey wage data:

> w = fetchData("CPS85")

(a)
What is the grand mean of wage?
7.68  7.88  8.26  8.31  9.02  9.40  10.88
(b)
What is the group-wise mean of wage for females?
7.68  7.88  8.26  8.31  9.02  9.40  10.88
(c)
What is the group-wise mean of wage for married people?
7.68  7.88  8.26  8.31  9.02  9.40  10.88
(d)
What is the group-wise mean of wage for married females? (Hint: There are two grouping variables involved.)
7.68  7.88  8.26  8.31  9.02  9.40  10.88

Prob 4.07. Read in the Galton height data

> g = fetchData("Galton")

(a)
What is the standard deviation of the height?
(b)
Calculate the grand mean and, from that, the residuals of the actual heights from the grand mean.
> mod0 = mm(height~1, data=g)
> res = resid(mod0)

What is the standard deviation of the residuals from this "grand mean" model?

2.51  2.58  2.92  3.58  3.82
(c)
Calculate the group-wise mean for the different sexes and, from that, the residuals of the actual heights from this group-wise model.
> mod1 = mm( height ~ sex, data=g)
> res1 = resid(mod1)

What is the standard deviation of the residuals from this group-wise model?

2.51  2.58  2.92  3.58  3.82
(d)
Which model has the smaller standard deviation of residuals?
mod0  mod1  they are the same

Prob 4.08. Create a spreadsheet with the three variables distance, team, and position, in the following way:

 distance team position 5 Eagles center 12 Eagles forward 11 Eagles end 2 Doves center 18 Doves end 12 Penguins forward 15 Penguins end 19 Eagles back 5 Penguins center 12 Penguins back

(a)
After entering the data, you can calculate the mean distance in various ways.
• What is the grand mean distance?
4  9.25  10  11  11.1  11.75  12  14.67  15.5
• What is the group mean distance for the three teams?
• Eagles
4  9.25  10  11  11.1  11.75  12  14.67  15.5
• Doves
4  9.25  10  11  11.1  11.75  12  14.67  15.5
• Penguins
4  9.25  10  11  11.1  11.75  12  14.67  15.5
• What is the group mean distance for the following positions?
• back
4  9.25  10  11  11.1  11.75  12  14.67  15.5
• center
4  9.25  10  11  11.1  11.75  12  14.67  15.5
• end
4  9.25  10  11  11.1  11.75  12  14.67  15.5
(b)
Now, just for the sake of developing an understanding of group means, you are going to change the dist data. Make up values for dist so that the mean dist for Eagles is 14, for Penguins is 13, and for Doves is 15.

Cut and paste the output from R showing the means for these groups and then the means taken group-wise according to position.

(c)
Now arrange things so that the means are as stated in (b) but every case has a residual of either 1 or -1.

Prob 4.10. It can be helpful when testing and evaluating statistical methods to use simulations. In this exercise, you are going to use a simulation of salaries to explore groupwise means. Keep in mind that the simulation is not reality; you should NOT draw conclusions about real-world salaries from the simulation. Instead, the simulation is useful just for showing how statistical methods work in a setting where we know the true answer.

To use the simulations, you’ll need both the mosaic package and some additional software. Probably you already have mosaic loaded, but it doesn’t hurt to make sure. So give both these commands:

> require(mosaic)

The simulation you will use in this exercise is called salaries. It’s a simulation of salaries of college professors. To carry out the simulation, give this command:

> run.sim( salaries, n=5 )

age sex children   rank   salary
1  47   M        0   Full 51601.75
2  49   M        1   Full 52280.93
3  49   M        0   Full 52427.08
4  39   M        2 Assist 38908.45
5  34   M        1 Assist 41761.81

The argument n tells how many cases to generate. By looking at these five cases, you can see the structure of the data.

Chances are, the data you generate by running the simulation will differ from the data printed here. That’s because the simulation generates cases at random. Still, underlying the simulation is a mathematical model that imposes certain patterns and relationships on the variables. You can get an idea of the structure of the model by looking at the salaries simulation itself:

> salaries

Causal Network with  5  vars:  age, sex, children, rank, salary
===============================================
age is exogenous
sex <== age
children is exogenous
rank <== age & sex & children
salary <== age & rank

This structure, and the equations that underlie it, might or might not correspond to the real world; no claim about the realism of the model is being made here. Instead, you’ll use the model to explore some mathematical properties of group means.

Generate a data set with n = 1000 cases using the simulation.

> s = run.sim( salaries, n=1000 )

1.
What is the grand mean of the salary variable? (Choose the closest.)
39000  42000  48000  51000  53000  59000  65000  72000
2.
What is the grand mean of the age variable? (Choose the closest.)
41  45  48  50  53  55  61
3.
Calculate the groupwise means for salary broken down by sex.
• For women?
39000  42000  48000  51000  53000  59000  65000  72000
• For men?
39000  42000  48000  51000  53000  59000  65000  72000
• What’s the pattern indicated by these groupwise means?

 A Women and mean earn almost exactly the same, on average. B Men earn less than women, on average. C Women earn less than men, on average.

4.
Make side-by-side boxplots of the distribution of salary, broken down by sex. Use the graph to answer the following questions. (Choose the closest answer.)
• What fraction of women earn more than the median salary for men?
None  0.25  0.50  0.75  All
• What fraction of men earn less than the median salary for women?
None  0.25  0.50  0.75  All
• Explain how it’s possible that the mean salary for men can be higher than the mean salary for women, and yet some men earn less than some women. (If this is obvious to you, then state the obvious!)
5.
There are other variables involved in the salary simulation. In particular, consider the rank variable. At most colleges and universities, professors start at the assistant level, then some are promoted to associate and some further promoted to “full” professors.

Find the mean salary broken down by rank.

• What’s the mean salary for assistant professors? (Choose the closest.)
37000  41000  46000  52000  58000  63000
• What’s the mean salary for associate professors? (Choose the closest.)
37000  41000  46000  52000  58000  63000
• What’s the mean salary for “full” professors? (Choose the closest.)
37000  41000  46000  52000  58000  63000
6.
Make the following side-by-side boxplot. (Make sure to copy the command exactly.)
> bwplot( salary ~ cross(rank,sex), data=s )

Based on the graph, which choose one of the following:

 A Adjusted for rank, women and mean earn about the same. B Adjusted for rank, men systematically earn less than women. C Adjusted for rank, women earn less than men.

7.
Look at the distribution of rank, broken down by sex. (Hint: rank is a categorical variable, so it’s meaningless to calculate the mean. But you can tally up the proportions.

Explain how the different distributions of rank for the different sexes can account for the pattern of salaries.

Keep in mind that this is a simulation and says nothing directly about the real-world distribution of salaries. In analyzing real-world salaries, however, you might want to use some of the same techniques.