Chapter 16 Models of Yes/No Variables

Fitting logistic models uses many of the same ideas as in linear models.

library(mosaic)

16.1 Fitting Logistic Models

The glm operator fits logistic models. (It also fits other kinds of models, but that’s another story.) glm takes model design and data arguments that are identical to their counterparts in lm. Here’s an example using the smoking/mortality data in Whickham:

mod = glm( outcome ~ age + smoker, data=Whickham, 
    family="binomial")

The last argument, family="binomial", simply specifies to glm() that the logistic transformation should be used. glm() is short for Generalized Linear Modeling, a broad label that covers logistic regression as well as other types of models involving links and transformations.

The regression report is produced with the summary operator, which recognizes that the model was fit logistically and does the right thing:

summary(mod)

## 
## Call:
## glm(formula = outcome ~ age + smoker, family = "binomial", data = Whickham)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9581  -0.5458  -0.2228   0.4381   3.2795  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.599221   0.441231 -17.223   <2e-16 ***
## age          0.123683   0.007177  17.233   <2e-16 ***
## smokerYes    0.204699   0.168422   1.215    0.224    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1560.32  on 1313  degrees of freedom
## Residual deviance:  945.02  on 1311  degrees of freedom
## AIC: 951.02
## 
## Number of Fisher Scoring iterations: 6

Keep in mind that the coefficients refer to the intermediate model values $Y$ . The probability $P$ will be $e^Y / (1 + e^Y )$ .

16.2 Fitted Model Values

Logistic regression involves two different kinds of fitted values: the intermediate “link” value $Y$ and the probability $P$ . The fitted operator returns the probabilities:

probs = fitted(mod)

There is one fitted probability value for each case.

The link values can be gotten via the predict operator

linkvals = predict(mod, type="link")
head(linkvals)

##          1          2          3          4          5          6 
## -4.5498092 -5.1682251  1.3869832  0.6875514  0.3165019 -2.6945616

Notice that the link values are not necessarily between zero and one.

The predict operator can also be used to calculate the probability values.

response_vals = predict(mod, type="response")
head(response_vals)

##           1           2           3           4           5           6 
## 0.010458680 0.005662422 0.800110184 0.665421999 0.578471493 0.063295027

This is particularly useful when you want to use predict to find the model values for inputs other than that original data frame used for fitting.

16.3 Which Level is “Yes”?

In fitting a logistic model, it’s crucial that the response variable be categorical, with two levels. It happens that in the Whickham data, the outcome variable fits the bill: the levels are Alive and Dead.

The glm software will automatically recode the response variable as 0/1. The question is, which level gets mapped to 1? In some sense, it makes no difference since there are only two levels. But if you’re talking about the probability of dying, it’s nice not to mistake that for the probability of staying alive. So make sure that you know which level in the response variable corresponds to 1: it’s the second level.

Here is an easy way to make sure which level has been coded as “Yes”. First, fit the all-cases-the-same model, outcome ~ 1. The fitted model value $P$ from this model will be the proportion of cases for which the outcome was “Yes.”

mod2 = glm( outcome ~ 1, data = Whickham, family = "binomial")

fitted(mod2)

So, 28% of the cases were “Yes.” But which of the two levels is “Yes?” Find out just by looking at the proportion of each level:

tally( ~ outcome, data = Whickham, format = "proportion")

## 
##     Alive      Dead 
## 0.7191781 0.2808219

If you want to dictate which of the two levels is going to be encoded as 1, you can use a comparison operation to do so:

mod3 = glm( outcome == "Alive" ~ 1, data=Whickham, 
   family="binomial")
mod3

In this model, “Yes” means Alive.

16.4 Analysis of Deviance

The same basic logic used in analysis of variance applies to logistic regression, although the quantity being broken down into parts is not the sum of squares of the residuals but, rather, the deviance.

The anova software will take apart a logistic model, term by term, using the order specified in the model.

anova(mod, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: outcome
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                    1313    1560.32             
## age     1   613.81      1312     946.51   <2e-16 ***
## smoker  1     1.49      1311     945.02   0.2228    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice the second argument, test="Chisq", which instructs anova to calculate a p-value for each term. This involves a slightly different test than the F test used in linear-model ANOVA

The format of the ANOVA table for logistic regression is somewhat different from that used in linear models, but the concepts of degrees of freedom and partitioning still apply. The basic idea is to ask whether the reduction in deviance accomplished with the model terms is greater than what would be expected if random terms were used instead.