Chapter 10 Total and partial relationships

10.1 Adjustment

There are two basic approaches to adjusting for covariates. Conceptually, the simplest one is to hold the covariates constant at some level when collecting data or by extracting a subset of data which holds those covariates constant. The other approach is to include the covariates in your models.

For example, suppose you want to study the differences in the wages of male and females. The very simple model wage ~ sex might give some insight, but it attributes to sex effects that might actually be due to level of education, age, or the sector of the economy in which the person works. Here’s the result from the simple model:

Cps <- CPS85
mod0 <- lm( wage ~ sex, data = Cps)
summary(mod0)

## 
## Call:
## lm(formula = wage ~ sex, data = Cps)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.995 -3.529 -1.072  2.394 36.621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.8789     0.3216   24.50  < 2e-16 ***
## sexM          2.1161     0.4372    4.84  1.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.034 on 532 degrees of freedom
## Multiple R-squared:  0.04218,    Adjusted R-squared:  0.04038 
## F-statistic: 23.43 on 1 and 532 DF,  p-value: 1.703e-06

The coefficients indicate that a typical male makes $2.12 more per hour than a typical female. (Notice that $R^2 = 0.0422$ is very small: sex explains hardly any of the person-to-person variability in wage.)

By including the variables age, educ, and sector in the model, you can adjust for these variables:

mod1 <- lm( wage ~ age + sex + educ + sector, data = Cps)
summary(mod1)

## 
## Call:
## lm(formula = wage ~ age + sex + educ + sector, data = Cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.198  -2.695  -0.465   2.066  35.159 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.69411    1.53776  -3.053 0.002384 ** 
## age            0.10221    0.01657   6.167 1.39e-09 ***
## sexM           1.94172    0.42285   4.592 5.51e-06 ***
## educ           0.61556    0.09439   6.521 1.65e-10 ***
## sectorconst    1.43552    1.13120   1.269 0.204999    
## sectormanag    3.27105    0.76685   4.266 2.37e-05 ***
## sectormanuf    0.80627    0.73115   1.103 0.270644    
## sectorother    0.75838    0.75918   0.999 0.318286    
## sectorprof     2.24777    0.66976   3.356 0.000848 ***
## sectorsales   -0.76706    0.84202  -0.911 0.362729    
## sectorservice -0.56871    0.66602  -0.854 0.393556    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.334 on 523 degrees of freedom
## Multiple R-squared:  0.3022, Adjusted R-squared:  0.2888 
## F-statistic: 22.65 on 10 and 523 DF,  p-value: < 2.2e-16

The adjusted difference between the sexes is $1.94 per hour. (The $R^2=0.30$ from this model is considerably larger than for mod0, but still a lot of the person-to-person variation in wages has not be captured.)

It would be wrong to claim that simply including a covariate in a model guarantees that an appropriate adjustment has been made. The effectiveness of the adjustment depends on whether the model design is appropriate, for instance whether appropriate interaction terms have been included. However, it’s certainly the case that if you don’t include the covariate in the model, you have not adjusted for it.

The other approach is to subsample the data so that the levels of the covariates are approximately constant. For example, here is a subset that considers workers between the ages of 30 and 35 with between 10 to 12 years of education and working in the sales sector of the economy:

small <- subset(Cps, age <=35 & age >= 30 & 
                       educ>=10 & educ <=12 & 
                       sector=="sales" )

The choice of these particular levels of age, educ, and sector is arbitrary, but you need to choose some level if you want to hold the covariates appproximately constant.

The subset of the data can be used to fit a simple model:

mod4 <- lm( wage ~ sex, data = small)
summary(mod4)

## 
## Call:
## lm(formula = wage ~ sex, data = small)
## 
## Residuals:
##   10  156  195 
##  0.5  0.0 -0.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    4.500      0.500   9.000   0.0704 .
## sexM           4.500      0.866   5.196   0.1210  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared:  0.9643, Adjusted R-squared:  0.9286 
## F-statistic:    27 on 1 and 1 DF,  p-value: 0.121

At first glance, there might seem to be nothing wrong with this approach and, indeed, for very large data sets it can be effective. In this case, however, there are only 3 cases that satisfy the various criteria: two women and one man.

table( small$sex )

## 
## F M 
## 2 1

So, the $4.50 difference between the sexes and wages depends entirely on the data from a single male! (Chapter @ref(“chap:confidence”) describes how to assess the precision of model coefficients. This one works out to be $4.50 \pm 11.00$ — not at all precise.)