Chapter 12 Confidence in models
Regression reports are generated using software you have already encountered: lm
to fit a model and summary
to construct the report from the fitted model. To illustrate with the SwimRecords
data from the mosaicData
package:
mod <- lm(time ~ year + sex, data = SwimRecords)
summary(mod)
##
## Call:
## lm(formula = time ~ year + sex, data = SwimRecords)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7027 -2.7027 -0.5968 1.2796 19.0759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 555.71678 33.79991 16.441 < 2e-16 ***
## year -0.25146 0.01732 -14.516 < 2e-16 ***
## sexM -9.79796 1.01287 -9.673 8.79e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.983 on 59 degrees of freedom
## Multiple R-squared: 0.844, Adjusted R-squared: 0.8387
## F-statistic: 159.6 on 2 and 59 DF, p-value: < 2.2e-16
12.1 Confidence Intervals from Standard Errors
Given the coefficient estimate and the standard error from the regression report, the confidence interval is easily generated.
For a 95% confidence interval, you just multiply the standard error by 2 to get the margin of error. For example, in the above, the margin of error on sex M
is 2×1.013=2.03, or, in computer notation:
2 * 1.0129
## [1] 2.0258
If you want the two endpoints of the confidence interval, rather than just the margin of error, do these simple calculations: (1) subtract the margin of error from the estimate; (2) add the margin of error to the estimate. So,
-9.798 - 2*1.0129
## [1] -11.8238
-9.798 + 2*1.0129
## [1] -7.7722
The key thing is to remember the multiplier that is applied to the standard error. A multiplier of approximately 2 is for a 95% confidence interval.
The confint()
function provides a convenient way to calculate confidence intervals directly.
It calculates the exact multiplier (which depends somewhat on the sample size) and applies it to the standard error to produce the confidence intervals.
mod <- lm( time ~ year + sex, data = SwimRecords)
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 488.0833106 623.3502562
## year -0.2861277 -0.2167997
## sexM -11.8247135 -7.7712094
It would be convenient if the regression report included confidence intervals rather than the standard error. Part of the reason it doesn’t is historical: the desire to connect to the traditional by-hand calculations.
12.2 Bootstrapping Confidence Intervals
Confidence intervals on model coefficients can be computed using the same bootstrapping technique introduced in Chapter @ref(“chap:statistical-inference”).
Start with your fitted model. To illustrate, here is a model of world-record swimming time over the years, taking into account sex:
lm( time ~ year + sex, data = SwimRecords)
##
## Call:
## lm(formula = time ~ year + sex, data = SwimRecords)
##
## Coefficients:
## (Intercept) year sexM
## 555.7168 -0.2515 -9.7980
These coefficients reflect one hypothetical draw from the population-based sampling distribution. It’s impossible to get another draw from the “population” here: the actual records are all you’ve got.
But to approximate sampling variation, you can treat the sample as your population and re-sample:
lm( time ~ year + sex, data = resample(SwimRecords))
##
## Call:
## lm(formula = time ~ year + sex, data = resample(SwimRecords))
##
## Coefficients:
## (Intercept) year sexM
## 533.8004 -0.2402 -9.9231
Constructing many such re-sampling trials and collect the results
s = do(500) * lm( time ~ year + sex, data = resample(SwimRecords))
head(s)
## Intercept year sexM sigma r.squared F numdf dendf .row .index
## 1 551.9951 -0.2495943 -9.495649 2.841716 0.9078873 290.7598 2 59 1 1
## 2 555.8467 -0.2513771 -10.600461 4.007877 0.8340924 148.3099 2 59 1 2
## 3 693.8656 -0.3217095 -9.636199 5.493706 0.8373505 151.8716 2 59 1 3
## 4 633.5587 -0.2911209 -9.212951 4.760334 0.8182637 132.8231 2 59 1 4
## 5 588.4652 -0.2683256 -9.557587 3.912234 0.8497676 166.8624 2 59 1 5
## 6 544.5205 -0.2460383 -9.379971 3.127280 0.8878750 233.5992 2 59 1 6
To find the standard error of the coefficients, just take the standard deviation across the re-sampling trials. For the indicator variable sex M
:
sd(s$sexM)
## [1] 0.9242203
Multiplying the standard error by 2 gives the approximate 95% margin of error. Alternatively, you can use the confint()
function to calculate this for you:
confint(s$sexM, method = "stderr")
## Confidence Interval from Bootstrap Distribution (500 replicates)
## V1 V2
## stderr -11.5434 -7.911714
12.3 Prediction Confidence Intervals
When a model is used to make a prediction, it’s helpful to be able to describe how precise the prediction is. For instance, suppose you want to use the KidsFeet
data set (from mosaicData
) to make a prediction of the foot width of a girl whose foot length is 25 cm.
First, fit your model:
names(KidsFeet)
## [1] "name" "birthmonth" "birthyear" "length" "width" "sex" "biggerfoot"
## [8] "domhand"
levels(KidsFeet$sex)
## [1] "B" "G"
mod <- lm( width ~ length + sex, data = KidsFeet)
Now apply the model to the new data for which you want to make a prediction. Take care to use the right coding for categorical variables.
predict(mod, newdata = data.frame( length=25, sex="G" ))
## 1
## 8.934275
In order to generate a confidence interval, the predict
operator needs to be told what type of interval is wanted. There are two types of prediction confidence intervals:
- Interval on the model value which reflects the sampling distributions of the coefficients themselves. To calculate this, use the
interval = "confidence"
named argument:
predict(mod, newdata = data.frame( length = 25, sex = "G" ),
interval = "confidence")
## fit lwr upr
## 1 8.934275 8.742565 9.125985
The components named lwr
and upr
are the lower and upper limits of the confidence interval, respectively.
- Interval on the prediction which includes the variation due to the uncertainty in the coefficients as well as the size of a typical residual. To find this interval, use the
interval = "prediction"
named argument:
predict(mod, newdata = data.frame( length = 25, sex = "G" ),
interval = "prediction")
## fit lwr upr
## 1 8.934275 8.130484 9.738066
The prediction interval is larger than the model-value confidence interval because the residual always gives additional uncertainty around the model value. Predicting an individual’s foot width, even if we know her sex and foot length, involves a lot more uncertainty than predicting the mean foot width of all individuals with these particular characteristics.