Chapter 16 Models of Yes/No Variables
Fitting logistic models uses many of the same ideas as in linear models.
library(mosaic)
16.1 Fitting Logistic Models
The glm
operator fits logistic models. (It also fits other kinds of models, but that’s another story.) glm
takes model design and data arguments that are identical to their counterparts in lm
. Here’s an example using the smoking/mortality data in Whickham
:
mod = glm( outcome ~ age + smoker, data=Whickham,
family="binomial")
The last argument, family="binomial"
, simply specifies to glm()
that the logistic transformation should be used. glm()
is short for Generalized Linear Modeling, a broad label that covers logistic regression as well as other types of models involving links and transformations.
The regression report is produced with the summary
operator, which recognizes that the model was fit logistically and does the right thing:
summary(mod)
##
## Call:
## glm(formula = outcome ~ age + smoker, family = "binomial", data = Whickham)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9581 -0.5458 -0.2228 0.4381 3.2795
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.599221 0.441231 -17.223 <2e-16 ***
## age 0.123683 0.007177 17.233 <2e-16 ***
## smokerYes 0.204699 0.168422 1.215 0.224
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1560.32 on 1313 degrees of freedom
## Residual deviance: 945.02 on 1311 degrees of freedom
## AIC: 951.02
##
## Number of Fisher Scoring iterations: 6
Keep in mind that the coefficients refer to the intermediate model values Y. The probability P will be eY/(1+eY).
16.2 Fitted Model Values
Logistic regression involves two different kinds of fitted values: the intermediate “link” value Y and the probability P. The fitted
operator returns the probabilities:
probs = fitted(mod)
There is one fitted probability value for each case.
The link values can be gotten via the predict
operator
linkvals = predict(mod, type="link")
head(linkvals)
## 1 2 3 4 5 6
## -4.5498092 -5.1682251 1.3869832 0.6875514 0.3165019 -2.6945616
Notice that the link values are not necessarily between zero and one.
The predict
operator can also be used to calculate the probability values.
response_vals = predict(mod, type="response")
head(response_vals)
## 1 2 3 4 5 6
## 0.010458680 0.005662422 0.800110184 0.665421999 0.578471493 0.063295027
This is particularly useful when you want to use predict
to find the model values for inputs other than that original data frame used for fitting.
16.3 Which Level is “Yes”?
In fitting a logistic model, it’s crucial that the response variable be categorical, with two levels. It happens that in the Whickham
data, the outcome
variable fits the bill: the levels are Alive
and Dead
.
The glm
software will automatically recode the response variable as 0/1. The question is, which level gets mapped to 1? In some sense, it makes no difference since there are only two levels. But if you’re talking about the probability of dying, it’s nice not to mistake that for the probability of staying alive. So make sure that you know which level in the response variable corresponds to 1: it’s the second level.
Here is an easy way to make sure which level has been coded as “Yes”. First, fit the all-cases-the-same model, outcome
~ 1. The fitted model value P from this model will be the proportion of cases for which the outcome was “Yes.”
mod2 = glm( outcome ~ 1, data = Whickham, family = "binomial")
fitted(mod2)
So, 28% of the cases were “Yes.” But which of the two levels is “Yes?” Find out just by looking at the proportion of each level:
tally( ~ outcome, data = Whickham, format = "proportion")
##
## Alive Dead
## 0.7191781 0.2808219
If you want to dictate which of the two levels is going to be encoded as 1, you can use a comparison operation to do so:
mod3 = glm( outcome == "Alive" ~ 1, data=Whickham,
family="binomial")
mod3
In this model, “Yes” means Alive
.
16.4 Analysis of Deviance
The same basic logic used in analysis of variance applies to logistic regression, although the quantity being broken down into parts is not the sum of squares of the residuals but, rather, the deviance.
The anova
software will take apart a logistic model, term by term, using the order specified in the model.
anova(mod, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: outcome
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 1313 1560.32
## age 1 613.81 1312 946.51 <2e-16 ***
## smoker 1 1.49 1311 945.02 0.2228
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice the second argument, test="Chisq"
, which instructs anova
to calculate a p-value for each term. This involves a slightly different test than the F test used in linear-model ANOVA
The format of the ANOVA table for logistic regression is somewhat different from that used in linear models, but the concepts of degrees of freedom and partitioning still apply. The basic idea is to ask whether the reduction in deviance accomplished with the model terms is greater than what would be expected if random terms were used instead.