Chapter 10 Total and partial relationships
10.1 Adjustment
There are two basic approaches to adjusting for covariates. Conceptually, the simplest one is to hold the covariates constant at some level when collecting data or by extracting a subset of data which holds those covariates constant. The other approach is to include the covariates in your models.
For example, suppose you want to study the differences in the wages of male and females. The very simple model wage
~ sex
might give some insight, but it attributes to sex
effects that might actually be due to level of education, age, or the sector of the economy in which the person works. Here’s the result from the simple model:
Cps <- CPS85
mod0 <- lm( wage ~ sex, data = Cps)
summary(mod0)
##
## Call:
## lm(formula = wage ~ sex, data = Cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.995 -3.529 -1.072 2.394 36.621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.8789 0.3216 24.50 < 2e-16 ***
## sexM 2.1161 0.4372 4.84 1.7e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.034 on 532 degrees of freedom
## Multiple R-squared: 0.04218, Adjusted R-squared: 0.04038
## F-statistic: 23.43 on 1 and 532 DF, p-value: 1.703e-06
The coefficients indicate that a typical male makes $2.12 more per hour than a typical female. (Notice that R2=0.0422 is very small: sex
explains hardly any of the person-to-person variability in wage.)
By including the variables age
, educ
, and sector
in the model, you can adjust for these variables:
mod1 <- lm( wage ~ age + sex + educ + sector, data = Cps)
summary(mod1)
##
## Call:
## lm(formula = wage ~ age + sex + educ + sector, data = Cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.198 -2.695 -0.465 2.066 35.159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.69411 1.53776 -3.053 0.002384 **
## age 0.10221 0.01657 6.167 1.39e-09 ***
## sexM 1.94172 0.42285 4.592 5.51e-06 ***
## educ 0.61556 0.09439 6.521 1.65e-10 ***
## sectorconst 1.43552 1.13120 1.269 0.204999
## sectormanag 3.27105 0.76685 4.266 2.37e-05 ***
## sectormanuf 0.80627 0.73115 1.103 0.270644
## sectorother 0.75838 0.75918 0.999 0.318286
## sectorprof 2.24777 0.66976 3.356 0.000848 ***
## sectorsales -0.76706 0.84202 -0.911 0.362729
## sectorservice -0.56871 0.66602 -0.854 0.393556
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.334 on 523 degrees of freedom
## Multiple R-squared: 0.3022, Adjusted R-squared: 0.2888
## F-statistic: 22.65 on 10 and 523 DF, p-value: < 2.2e-16
The adjusted difference between the sexes is $1.94 per hour. (The R2=0.30 from this model is considerably larger than for mod0
, but still a lot of the person-to-person variation in wages has not be captured.)
It would be wrong to claim that simply including a covariate in a model guarantees that an appropriate adjustment has been made. The effectiveness of the adjustment depends on whether the model design is appropriate, for instance whether appropriate interaction terms have been included. However, it’s certainly the case that if you don’t include the covariate in the model, you have not adjusted for it.
The other approach is to subsample the data so that the levels of the covariates are approximately constant. For example, here is a subset that considers workers between the ages of 30 and 35 with between 10 to 12 years of education and working in the sales sector of the economy:
small <- subset(Cps, age <=35 & age >= 30 &
educ>=10 & educ <=12 &
sector=="sales" )
The choice of these particular levels of age
, educ
, and sector
is arbitrary, but you need to choose some level if you want to hold the covariates appproximately constant.
The subset of the data can be used to fit a simple model:
mod4 <- lm( wage ~ sex, data = small)
summary(mod4)
##
## Call:
## lm(formula = wage ~ sex, data = small)
##
## Residuals:
## 10 156 195
## 0.5 0.0 -0.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.500 0.500 9.000 0.0704 .
## sexM 4.500 0.866 5.196 0.1210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared: 0.9643, Adjusted R-squared: 0.9286
## F-statistic: 27 on 1 and 1 DF, p-value: 0.121
At first glance, there might seem to be nothing wrong with this approach and, indeed, for very large data sets it can be effective. In this case, however, there are only 3 cases that satisfy the various criteria: two women and one man.
table( small$sex )
##
## F M
## 2 1
So, the $4.50 difference between the sexes and wages depends entirely on the data from a single male! (Chapter @ref(“chap:confidence”) describes how to assess the precision of model coefficients. This one works out to be 4.50±11.00 — not at all precise.)