Chapter 4 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

Reading Questions.

Which is larger: variance of residuals, variance of the model values, or the variance of the actual values?
How can a difference in group means clearly shown by your data nonetheless be misleading?
What does it mean to partition variation? What’s special about the variance — the square of the standard deviation — as a way to measure variation?

Prob 4.03. To exercise your ability to calculate groupwise quantities, use the swimming records in  swim100m.csv and calculate the mean and minimum swimming time for the subset. (Answers have been rounded to one decimal place.)

  > require(mosaic)
  > swim = fetchData("SwimRecords")

Record times for women:
All records before 1920. (Hint: the construction year<1920 can be used as a variable.)
All records that are slower than 60 seconds. (Hint: Think what “slower” means in terms of the swimming times.)

Prob 4.04. Here is a model of wages in 1985 constructed using the CPS85 data.

  > mod = mm( wage ~ sector, data=CPS85 )

wage is the “response variable,” while sector is the explanatory variable.

For every case in the data, the model will give a “fitted model value.” Different cases will have different fitted model values if they have different values for the explanatory variable. Here, the model assigns different fitted model values to workers in different sectors of the economy.

You can see the groupwise means for the different sectors by looking at the model. Just give the model name, like this:

  > mod

What is the mean wage for workers in the construction sector (const)?
 6.54  7.42  7.59  8.04  8.50  9.50  11.95  12.70  
What is the mean wage for workers in the management sector (manag)?
 6.54  7.42  7.59  8.04  8.50  9.50  11.95  12.70  
Which sector has the lowest mean wage?
 clerical  const  manag  manuf  prof  sales  service  
Statistical models attempt to account for case-to-case variability. One simple way to measure the success of a model is to look at the variation in the fitted model values. What is the standard deviation in the fitted model values for mod?
 0  0.95  1.10  1.53  2.03  2.20  2.43  3.43  4.13  4.65  
The residuals of the model tell how far each case is from that case’s fitted model value. In interpreting models, it’s often important to know the typical size of a residual. The standard deviation is often used to quantify “size”. What’s the standard deviation of the residuals of mod?
 0  0.95  1.10  1.53  2.03  2.20  2.43  3.43  4.13  4.65  

Prob 4.05. Here are two models of wages in 1985 in the CPS85 data:

  > mod1 = mm( wage ~ 1, data=CPS85 )
  > mod2 = mm( wage ~ sector, data=CPS85 )

The model mod1 corresponds to the grand mean, as if all cases were in the same group. The model mod2 breaks down the mean wage into groups depending on what sector of the economy the worker is in.

Which model has the greater variation from case to case in fitted model values?
 mod1  mod2  same for both  
Which model has the greater variation from case to case in residuals?
 mod1  mod2  same for both  
Which of these statements is true for both model 1 and 2 (and all other groupwise mean models)?

Prob 4.06. Read in the Current Population Survey wage data:

  > w = fetchData("CPS85")

What is the grand mean of wage?
 7.68  7.88  8.26  8.31  9.02  9.40  10.88  
What is the group-wise mean of wage for females?
 7.68  7.88  8.26  8.31  9.02  9.40  10.88  
What is the group-wise mean of wage for married people?
 7.68  7.88  8.26  8.31  9.02  9.40  10.88  
What is the group-wise mean of wage for married females? (Hint: There are two grouping variables involved.)
 7.68  7.88  8.26  8.31  9.02  9.40  10.88  

Prob 4.07. Read in the Galton height data

  > g = fetchData("Galton")

What is the standard deviation of the height?
Calculate the grand mean and, from that, the residuals of the actual heights from the grand mean.
  > mod0 = mm(height~1, data=g)
  > res = resid(mod0)

What is the standard deviation of the residuals from this "grand mean" model?

 2.51  2.58  2.92  3.58  3.82  
Calculate the group-wise mean for the different sexes and, from that, the residuals of the actual heights from this group-wise model.
  > mod1 = mm( height ~ sex, data=g)
  > res1 = resid(mod1)

What is the standard deviation of the residuals from this group-wise model?

 2.51  2.58  2.92  3.58  3.82  
Which model has the smaller standard deviation of residuals?
 mod0  mod1  they are the same  

Prob 4.08. Create a spreadsheet with the three variables distance, team, and position, in the following way:

distanceteam position

5Eagles center
12Eagles forward
11Eagles end
2Doves center
18 Doves end
19Eagles back

After entering the data, you can calculate the mean distance in various ways.
Now, just for the sake of developing an understanding of group means, you are going to change the dist data. Make up values for dist so that the mean dist for Eagles is 14, for Penguins is 13, and for Doves is 15.

Cut and paste the output from R showing the means for these groups and then the means taken group-wise according to position.

Now arrange things so that the means are as stated in (b) but every case has a residual of either 1 or -1.

Prob 4.10. It can be helpful when testing and evaluating statistical methods to use simulations. In this exercise, you are going to use a simulation of salaries to explore groupwise means. Keep in mind that the simulation is not reality; you should NOT draw conclusions about real-world salaries from the simulation. Instead, the simulation is useful just for showing how statistical methods work in a setting where we know the true answer.

To use the simulations, you’ll need both the mosaic package and some additional software. Probably you already have mosaic loaded, but it doesn’t hurt to make sure. So give both these commands:

  > require(mosaic)
  > source("")

The simulation you will use in this exercise is called salaries. It’s a simulation of salaries of college professors. To carry out the simulation, give this command:

  > run.sim( salaries, n=5 )

    age sex children   rank   salary
  1  47   M        0   Full 51601.75
  2  49   M        1   Full 52280.93
  3  49   M        0   Full 52427.08
  4  39   M        2 Assist 38908.45
  5  34   M        1 Assist 41761.81

The argument n tells how many cases to generate. By looking at these five cases, you can see the structure of the data.

Chances are, the data you generate by running the simulation will differ from the data printed here. That’s because the simulation generates cases at random. Still, underlying the simulation is a mathematical model that imposes certain patterns and relationships on the variables. You can get an idea of the structure of the model by looking at the salaries simulation itself:

  > salaries

  Causal Network with  5  vars:  age, sex, children, rank, salary
  age is exogenous
  sex <== age
  children is exogenous
  rank <== age & sex & children
  salary <== age & rank

This structure, and the equations that underlie it, might or might not correspond to the real world; no claim about the realism of the model is being made here. Instead, you’ll use the model to explore some mathematical properties of group means.

Generate a data set with n = 1000 cases using the simulation.

  > s = run.sim( salaries, n=1000 )

What is the grand mean of the salary variable? (Choose the closest.)
 39000  42000  48000  51000  53000  59000  65000  72000  
What is the grand mean of the age variable? (Choose the closest.)
 41  45  48  50  53  55  61  
Calculate the groupwise means for salary broken down by sex.
Make side-by-side boxplots of the distribution of salary, broken down by sex. Use the graph to answer the following questions. (Choose the closest answer.)
There are other variables involved in the salary simulation. In particular, consider the rank variable. At most colleges and universities, professors start at the assistant level, then some are promoted to associate and some further promoted to “full” professors.

Find the mean salary broken down by rank.

Make the following side-by-side boxplot. (Make sure to copy the command exactly.)
  > bwplot( salary ~ cross(rank,sex), data=s )

Based on the graph, which choose one of the following:


Adjusted for rank, women and mean earn about the same.


Adjusted for rank, men systematically earn less than women.


Adjusted for rank, women earn less than men.

Look at the distribution of rank, broken down by sex. (Hint: rank is a categorical variable, so it’s meaningless to calculate the mean. But you can tally up the proportions.

Explain how the different distributions of rank for the different sexes can account for the pattern of salaries.

Keep in mind that this is a simulation and says nothing directly about the real-world distribution of salaries. In analyzing real-world salaries, however, you might want to use some of the same techniques.