Which is larger: variance of residuals, variance of the
model values, or the variance of the actual values?
How can a difference in group means clearly shown by your data nonetheless be misleading?
What does it mean to partition variation? What’s special about the variance —
the square of the standard deviation — as a way to measure variation?
To exercise your ability to calculate groupwise quantities, use the swimming
records in swim100m.csv and calculate the mean and minimum swimming time for
the subset. (Answers have been rounded to one decimal place.)
All records before 1920. (Hint: the construction year<1920 can be used as a
All records that are slower than 60 seconds. (Hint: Think what “slower” means
in terms of the swimming times.)
Here is a model of wages in 1985 constructed using the CPS85 data.
> mod = mm( wage ~ sector, data=CPS85 )
wage is the “response variable,” while sector is the explanatory variable.
For every case in the data, the model will give a “fitted model value.” Different
cases will have different fitted model values if they have different values for the
explanatory variable. Here, the model assigns different fitted model values to workers
in different sectors of the economy.
You can see the groupwise means for the different sectors by looking at the model.
Just give the model name, like this:
What is the mean wage for workers in the construction sector (const)?
What is the mean wage for workers in the management sector (manag)?
Which sector has the lowest mean wage?
Statistical models attempt to account for case-to-case variability. One simple way
to measure the success of a model is to look at the variation in the fitted model
values. What is the standard deviation in the fitted model values for
The residuals of the model tell how far each case is from that case’s fitted model
value. In interpreting models, it’s often important to know the typical size of a
residual. The standard deviation is often used to quantify “size”. What’s the
standard deviation of the residuals of mod?
Here are two models of wages in 1985 in the CPS85 data:
What is the standard deviation of the residuals from this group-wise model?
Which model has the smaller standard deviation of residuals?
mod0mod1they are the
Create a spreadsheet with the three variables distance, team, and position, in the
After entering the data, you can calculate the mean distance in various
What is the grand mean distance?
What is the group mean distance for the three teams?
What is the group mean distance for the following positions?
Now, just for the sake of developing an understanding of group means, you are
going to change the dist data. Make up values for dist so that the mean dist for
Eagles is 14, for Penguins is 13, and for Doves is 15.
Cut and paste the output from R showing the means for these
groups and then the means taken group-wise according to position.
Now arrange things so that the means are as stated in (b) but every case has a
residual of either 1 or -1.
It can be helpful when testing and evaluating statistical methods to use
simulations. In this exercise, you are going to use a simulation of salaries to explore
groupwise means. Keep in mind that the simulation is not reality; you should NOT
draw conclusions about real-world salaries from the simulation. Instead, the
simulation is useful just for showing how statistical methods work in a setting where
we know the true answer.
To use the simulations, you’ll need both the mosaic package and some additional
software. Probably you already have mosaic loaded, but it doesn’t hurt to make sure.
So give both these commands:
The simulation you will use in this exercise is called salaries. It’s a
simulation of salaries of college professors. To carry out the simulation, give this
> run.sim( salaries, n=5 )
age sex children rank salary 1 47 M 0 Full 51601.75 2 49 M 1 Full 52280.93 3 49 M 0 Full 52427.08 4 39 M 2 Assist 38908.45 5 34 M 1 Assist 41761.81
The argument n tells how many cases to generate. By looking at these five cases, you can
see the structure of the data.
Chances are, the data you generate by running the simulation will differ
from the data printed here. That’s because the simulation generates cases
at random. Still, underlying the simulation is a mathematical model that
imposes certain patterns and relationships on the variables. You can get an
idea of the structure of the model by looking at the salaries simulation
Causal Network with 5 vars: age, sex, children, rank, salary =============================================== age is exogenous sex <== age children is exogenous rank <== age & sex & children salary <== age & rank
This structure, and the equations that underlie it, might or might not correspond to the
real world; no claim about the realism of the model is being made here.
Instead, you’ll use the model to explore some mathematical properties of group
Generate a data set with n = 1000 cases using the simulation.
> s = run.sim( salaries, n=1000 )
What is the grand mean of the salary variable? (Choose the closest.)
What is the grand mean of the age variable? (Choose the closest.)
Calculate the groupwise means for salary broken down by sex.
What’s the pattern indicated by these groupwise means?
Women and mean earn almost exactly the same, on
Men earn less than women, on average.
Women earn less than men, on average.
Make side-by-side boxplots of the distribution of salary, broken down by sex.
Use the graph to answer the following questions. (Choose the closest
What fraction of women earn more than the median salary for men?
What fraction of men earn less than the median salary for women?
Explain how it’s possible that the mean salary for men can be higher
than the mean salary for women, and yet some men earn less than
some women. (If this is obvious to you, then state the obvious!)
There are other variables involved in the salary simulation. In particular, consider
the rank variable. At most colleges and universities, professors start at the
assistant level, then some are promoted to associate and some further promoted to
Find the mean salary broken down by rank.
What’s the mean salary for assistant professors? (Choose the closest.)
What’s the mean salary for associate professors? (Choose the closest.)
What’s the mean salary for “full” professors? (Choose the closest.)
Make the following side-by-side boxplot. (Make sure to copy the command
> bwplot( salary ~ cross(rank,sex), data=s )
Based on the graph, which choose one of the following:
Adjusted for rank, women and mean earn about the
Adjusted for rank, men systematically earn less than
Adjusted for rank, women earn less than men.
Look at the distribution of rank, broken down by sex. (Hint: rank is a categorical
variable, so it’s meaningless to calculate the mean. But you can tally up the
Explain how the different distributions of rank for the different sexes can account for the pattern
Keep in mind that this is a simulation and says nothing directly about the
real-world distribution of salaries. In analyzing real-world salaries, however, you
might want to use some of the same techniques.