Chapter 12 Confidence in models

Regression reports are generated using software you have already encountered: lm to fit a model and summary to construct the report from the fitted model. To illustrate with the SwimRecords data from the mosaicData package:

mod <- lm(time ~ year + sex, data = SwimRecords)
summary(mod)

## 
## Call:
## lm(formula = time ~ year + sex, data = SwimRecords)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7027 -2.7027 -0.5968  1.2796 19.0759 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 555.71678   33.79991  16.441  < 2e-16 ***
## year         -0.25146    0.01732 -14.516  < 2e-16 ***
## sexM         -9.79796    1.01287  -9.673 8.79e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.983 on 59 degrees of freedom
## Multiple R-squared:  0.844,  Adjusted R-squared:  0.8387 
## F-statistic: 159.6 on 2 and 59 DF,  p-value: < 2.2e-16

12.1 Confidence Intervals from Standard Errors

Given the coefficient estimate and the standard error from the regression report, the confidence interval is easily generated.
For a 95% confidence interval, you just multiply the standard error by 2 to get the margin of error. For example, in the above, the margin of error on sex M is $2 \times 1.013 = 2.03$ , or, in computer notation:

2 * 1.0129

## [1] 2.0258

If you want the two endpoints of the confidence interval, rather than just the margin of error, do these simple calculations: (1) subtract the margin of error from the estimate; (2) add the margin of error to the estimate. So,

-9.798 - 2*1.0129

## [1] -11.8238

-9.798 + 2*1.0129

## [1] -7.7722

The key thing is to remember the multiplier that is applied to the standard error. A multiplier of approximately 2 is for a 95% confidence interval.

The confint() function provides a convenient way to calculate confidence intervals directly.

It calculates the exact multiplier (which depends somewhat on the sample size) and applies it to the standard error to produce the confidence intervals.

mod <- lm( time ~ year + sex, data = SwimRecords)
confint(mod)

##                   2.5 %      97.5 %
## (Intercept) 488.0833106 623.3502562
## year         -0.2861277  -0.2167997
## sexM        -11.8247135  -7.7712094

It would be convenient if the regression report included confidence intervals rather than the standard error. Part of the reason it doesn’t is historical: the desire to connect to the traditional by-hand calculations.

12.2 Bootstrapping Confidence Intervals

Confidence intervals on model coefficients can be computed using the same bootstrapping technique introduced in Chapter @ref(“chap:statistical-inference”).

Start with your fitted model. To illustrate, here is a model of world-record swimming time over the years, taking into account sex:

lm( time ~ year + sex, data = SwimRecords)

## 
## Call:
## lm(formula = time ~ year + sex, data = SwimRecords)
## 
## Coefficients:
## (Intercept)         year         sexM  
##    555.7168      -0.2515      -9.7980

These coefficients reflect one hypothetical draw from the population-based sampling distribution. It’s impossible to get another draw from the “population” here: the actual records are all you’ve got.

But to approximate sampling variation, you can treat the sample as your population and re-sample:

lm( time ~ year + sex, data = resample(SwimRecords))

## 
## Call:
## lm(formula = time ~ year + sex, data = resample(SwimRecords))
## 
## Coefficients:
## (Intercept)         year         sexM  
##    533.8004      -0.2402      -9.9231

Constructing many such re-sampling trials and collect the results

s = do(500) * lm( time ~ year + sex, data = resample(SwimRecords))
head(s)

##   Intercept       year       sexM    sigma r.squared        F numdf dendf .row .index
## 1  551.9951 -0.2495943  -9.495649 2.841716 0.9078873 290.7598     2    59    1      1
## 2  555.8467 -0.2513771 -10.600461 4.007877 0.8340924 148.3099     2    59    1      2
## 3  693.8656 -0.3217095  -9.636199 5.493706 0.8373505 151.8716     2    59    1      3
## 4  633.5587 -0.2911209  -9.212951 4.760334 0.8182637 132.8231     2    59    1      4
## 5  588.4652 -0.2683256  -9.557587 3.912234 0.8497676 166.8624     2    59    1      5
## 6  544.5205 -0.2460383  -9.379971 3.127280 0.8878750 233.5992     2    59    1      6

To find the standard error of the coefficients, just take the standard deviation across the re-sampling trials. For the indicator variable sex M:

sd(s$sexM)

## [1] 0.9242203

Multiplying the standard error by 2 gives the approximate 95% margin of error. Alternatively, you can use the confint() function to calculate this for you:

confint(s$sexM, method = "stderr")

## Confidence Interval from Bootstrap Distribution (500 replicates)

##              V1        V2
## stderr -11.5434 -7.911714

12.3 Prediction Confidence Intervals

When a model is used to make a prediction, it’s helpful to be able to describe how precise the prediction is. For instance, suppose you want to use the KidsFeet data set (from mosaicData) to make a prediction of the foot width of a girl whose foot length is 25 cm.

First, fit your model:

names(KidsFeet)

## [1] "name"       "birthmonth" "birthyear"  "length"     "width"      "sex"        "biggerfoot"
## [8] "domhand"

levels(KidsFeet$sex)

## [1] "B" "G"

mod <- lm( width ~ length + sex, data = KidsFeet)

Now apply the model to the new data for which you want to make a prediction. Take care to use the right coding for categorical variables.

predict(mod, newdata = data.frame( length=25, sex="G" ))

##        1 
## 8.934275

In order to generate a confidence interval, the predict operator needs to be told what type of interval is wanted. There are two types of prediction confidence intervals:

Interval on the model value which reflects the sampling distributions of the coefficients themselves. To calculate this, use the interval = "confidence" named argument:

predict(mod, newdata = data.frame( length = 25, sex = "G" ), 
         interval = "confidence")

##        fit      lwr      upr
## 1 8.934275 8.742565 9.125985

The components named lwr and upr are the lower and upper limits of the confidence interval, respectively.

Interval on the prediction which includes the variation due to the uncertainty in the coefficients as well as the size of a typical residual. To find this interval, use the interval = "prediction" named argument:

predict(mod, newdata = data.frame( length = 25, sex = "G" ), 
         interval = "prediction")

##        fit      lwr      upr
## 1 8.934275 8.130484 9.738066

The prediction interval is larger than the model-value confidence interval because the residual always gives additional uncertainty around the model value. Predicting an individual’s foot width, even if we know her sex and foot length, involves a lot more uncertainty than predicting the mean foot width of all individuals with these particular characteristics.