Chapter 16 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

Reading Questions.

Prob 16.01. The graphs show the link values and the corresponding probability values for a logistic model where x is the explanatory variable.

Use the graphs to look up answers to the following. Choose the closest possibility to what you see in the graphs.

Prob 16.02. The NASA space shuttle Challenger had a catastrophic accident during launch on January 28, 1986. Photographic evidence from the launch showed that the accident resulted from a plume of hot flame from the side of one of the booster rockets which cut into the main fuel tank. US President Reagan appointed a commission to investigate the accident. The commission concluded that the jet was due to the failure of an O-ring gasket between segments of the booster rocket.

PIC
A NASA photograph showing the plume of flame from the side of the booster rocket during the Challenger launch.

An important issue for the commission was whether the accident was avoidable. Attention focused on the fact that the ground temperature at the time of launch was 31F, much lower than for any previous launch. Commission member and Nobel laureate physicist Richard Feynman famously demonstrated, using a glass of ice water and a C-clamp, that the O-rings were very inflexible when cold. But did the data available to NASA before the launch indicate a high risk of an O-ring failure?

Here is the information available at the time of Challenger’s launch from the previous shuttle launches:

Flight TempDamageFlight TempDamage
STS-1 66no STS-2 70yes
STS-3 69no STS-4 80NA
STS-5 68 no STS-6 67 no
STS-7 72no STS-8 73no
STS-9 70 no STS 41-B 57 yes
STS 41-C 63yes STS 41-D 70yes
STS 41-G 78no STS 51-A 67no
STS 51-B 75 no STS 51-C 53 yes
STS 51-D 67no STS 51-F 81no
STS 51-G 70 no STS 51-I 76 no
STS 51-J 79no STS 61-A 75yes
STS 61-B 76no STS 61-C 58yes

Using these data, you can fit a logistic model to estimate the probability of failure at any temperature.

> mod = glm(Damage ~ Temp, family=’binomial’)  
> summary(mod)  
Coefficients:  
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  15.0429     7.3786   2.039   0.0415  
Temp         -0.2322     0.1082  -2.145   0.0320

Use the coefficients to find the link value for these launch temperatures:

Convert the link value to a probability value for the launch temperatures:

A more complete analysis of the situation would take into account the fact that there are multiple O-rings in each booster, while the Damage variable describes whether any O-ring failed. In addition, there were two O-rings on each booster segment, both of which would have to fail to create a leakage problem. Thus, the probabilities estimated from this model and these data do not accurately reflect the probability of a catastrophic accident.

Prob 16.04. George believes in astrology and wants to check whether a person’s sign influences whether they are left- or right-handed. With great effort, he collects data on 100 people, recording their dominant hand and their astrological sign. He builds a logistic model hand ~ sign. The deviance from the model hand ~ 1 is 102.8 on 99 degrees of freedom. Including the sign term in the model reduces the deviance to 63.8 on 88 degrees of freedom.

The sign term only reduced the degrees of freedom by 11 (that is, from 99 to 88) even though there are 12 astrological signs. Why?

A

There must have been one sign not represented among the 100 people in George’s sample.

B

sign is redundant with the intercept and so one level is lost.

C

hand uses up one degree of freedom.


According to theory, if sign were unrelated to hand, the 11 degrees of freedom ought to reduce the deviance by how much, on average?

A

1199 × 102.8

B

111 × 102.8

C

to zero

D

None of the above.


Prob 16.05. This model traces through some of the steps in fitting a model of a yes/no process. For specificity, pretend that the data are from observations of a random sample of teenaged drivers. The response variable is whether or not the driver was in an accident during one year (birthday to birthday). The explanatory variables are sex and age of the driver. The model being fit is accident ~ 1 + age + sex.

Here is a very small, fictitious set of data.

CaseAgeSexAccident?




1 17 F Yes
2 17 M No
3 18 M Yes
4 19 F No

Even if it weren’t fictitious, it would be too small for any practical purpose. But it will serve to illustrate the principles of fitting.

In fitting the model, the computer compares the likelihoods of various candidate values for the coefficients, choosing those coefficients that maximize the likelihood of the model.

Consider these two different candidate coefficients:




Candidate A Coefficients
Intercept age sexF
35 -2 -1



Candidate B Coefficients
Interceptage sexF
35 -2 0



The link value is found by multiplying the coefficients by the values of the explanatory variables in the usual way.

The link value is converted to a probability value by using the logistic transform.

The probability value is converted to a likelihood by calculating the probability of the observed outcome according to the probability value. When the outcome is “Yes,” the likelihood is just the same as the probability value. But when the outcome is “No,” the likelihood is 1 minus the probability value.

To compute the likelihood of the entire set of observations under the candidate coefficients, multiply together the likelihoods for all the cases. Do this calculation separately for the candidate A coefficients and the candidate B coefficients. Show your work. Say which of the two candidates gives the bigger likelihood?

In an actual fitting calculation, the computer goes through large numbers of candidate coefficients in a systematic way to find the candidate with the largest possible likelihood: the maximum likelihood candidate. Explain why it makes sense to choose the candidate with the maximize rather than the minimum likelihood.

Prob 16.10. The National Osteoporosis Risk Assessment (NORA)[?] studied about 200,000 postmenopausal women aged 50 years or old in the United States. When entering the study, 14,412 of these women had osteoporosis as defined by a bone-mineral density “T score.” In studying the risk factors for the development of osteoporosis, the researchers fit a logistic regression model.

The coefficients in a logistic regression model can be directly interpreted as the logarithm of an odds ratio — the “log odds ratio.” In presenting results from logistic regression, it’s common to exponentiate the coefficients, that is, to compute ecoef to produce a simple odds ratio.

The table below shows the coefficients in odds ratio form from the NORA model. There were many explanatory variables in the model: Age group, years since menopause, health status, etc. All of these were arranged to be categorical variables, so there is one coefficient for each level of each variable. As always, one level of each variable serves as a reference level. For instance, in the table below, the age group 50-54 is the reference level. In the table below, the odds ratio for the reference level is always given as 1.00. The other odds ratios are always with respect to this reference. So, women in the 55-59 age group have odds of having osteoporosis that are 1.79 time bigger than women in the 50-54 age group. In contrast, women who are 6-10 years since menopause have odds of having osteoporosis that are 0.79 as big as women who are 5 years since menopause.

An odds ratio of 1 means that the group has the same probability value as the reference group. Odds ratios bigger than 1 mean the group is more likely to have osteoporosis than the reference group; odds ratios smaller than 1 mean the group is less likely to have the condition.

The 95% confidence interval on the odds ratio indicates the precision of the estimate from the available data. When the confidence interval for a coefficient includes 1.00, the null hypothesis that the population odds ratio is 1 cannot be rejected at a 0.05 significance level. For example, the odds ratio for a self-rated health status level of “very good” is 1.04 compared to those in “excellent” health. But the confidence interval, 0.97 to 1.13, includes 1.00, indicating that the evidence is weak that women in very good health have a different risk of developing osteoporosis compared to women in excellent health.

For some variables, e.g., “college education or higher,” no reference level is given. This is simply because the variable has just two levels. The other level serves as the reference.

Age group (years)
Odds Ratio (95% CI)




   50-54 1.00(Referent)
55-59 1.79(1.56-2.06)
60-64 3.84(3.37-4.37)
65-69 5.94(5.24-6.74)
70-74 9.54(8.42-10.81)
75-79 14.34(12.64-16.26)
80 22.56(19.82-25.67)
Years since menopause
Odds Ratio (95% CI)




5 1.00(Referent)
6-10 0.79(0.70-0.89)
11-15 0.83(0.76-0.91)
16-20 0.96(0.89-1.03)
21-25 1.01(0.95-1.08)
26-30 1.02(0.95-1.09)
31-35 1.10(1.03-1.19)
36-40 1.14(1.05-1.24)
41 1.24(1.14-1.35)
College educ or higher               
Odds Ratio (95% CI)




    0.91(0.87-0.94)
Self-rated health status
Odds Ratio (95% CI)




Excellent 1.00(Referent)
Very good 1.04(0.97-1.13)
Good 1.23(1.14-1.33)
Fair/poor 1.62(1.50-1.76)
Fracture history      
Odds Ratio (95% CI)




Hip 1.96(1.75-2.20)
Wrist 1.90(1.77-2.03)
Spine 1.34(1.17-1.54)
Rib 1.43(1.32-1.56)
Maternal
history of osteoporosis
Odds Ratio (95% CI)




1.08(1.01-1.17)
Maternal
history of fracture
Odds Ratio (95% CI)




1.16(1.11-1.22)
Race/ethnicity
Odds Ratio (95% CI)




White 1.00(Referent)
African American 0.55(0.48-0.62)
Native American 0.97(0.82-1.14)
Hispanic 1.31(1.19-1.44)
Asian 1.56(1.32-1.85)
Body mass index, kg/m2
Odds Ratio (95% CI)




   23 1.00(Referent)
23.01-25.99 0.46(0.44-0.48)
26.00-29.99 0.27(0.26-0.28)
30 0.16(0.15-0.17)
Current medication use
Odds Ratio (95% CI)




Cortisone 1.63(1.47-1.81)
Diuretics 0.81(0.76-0.85)
Estrogen use
Odds Ratio (95% CI)




Former 0.77(0.73-0.80)
Current 0.27(0.25-0.28)
Cigarette smoking
Odds Ratio (95% CI)




Former 1.14(1.10-1.19)
Current 1.58(1.48-1.68)
Regular Exercise
Odds Ratio (95% CI)




Regular 0.86(0.82-0.89)
Alcohol use, drinks/wk
Odds Ratio (95% CI)




None 1.00(Referent)
1-6 0.85(0.80-0.90)
7-13 0.76(0.69-0.83)
14 0.62(0.54-0.71)
Technology
Odds Ratio (95% CI)




Heel x-ray 1.00(Referent)
Forearm x-ray 2.86(2.75-2.99)
Finger x-ray 4.86(4.56-5.18)
Heel ultrasound 0.79(0.70-0.90)




Since all the variables were included simultaneously in the model, the various coefficients can be interpreted as indicating partial change: the odds ratio comparing the given level to the reference level for each variable, adjusting for all the other variables as if they had been held constant.

Prob 16.11. The National Osteoporosis Risk Assessment (NORA)[?] studied about 200,000 postmenopausal women aged 50 years or old in the United States. When entering the study, 14,412 of these women had osteoporosis as defined by a bone-mineral density “T score.” In studying the risk factors for the development of osteoporosis, the researchers fit a logistic regression model.

The coefficients in a logistic regression model can be directly interpreted as the logarithm of an odds ratio — the “log odds ratio.” In presenting results from logistic regression, it’s common to exponentiate the coefficients, that is, to compute ecoef to produce a simple odds ratio.

The table below shows the coefficients in odds ratio form from the NORA model. There were many explanatory variables in the model: Age group, years since menopause, health status, etc. All of these were arranged to be categorical variables, so there is one coefficient for each level of each variable. As always, one level of each variable serves as a reference level. For instance, in the table below, the age group 50-54 is the reference level. In the table below, the odds ratio for the reference level is always given as 1.00. The other odds ratios are always with respect to this reference. So, women in the 55-59 age group have odds of having osteoporosis that are 1.79 time bigger than women in the 50-54 age group. In contrast, women who are 6-10 years since menopause have odds of having osteoporosis that are 0.79 as big as women who are 5 years since menopause.

An odds ratio of 1 means that the group has the same probability value as the reference group. Odds ratios bigger than 1 mean the group is more likely to have osteoporosis than the reference group; odds ratios smaller than 1 mean the group is less likely to have the condition.

The 95% confidence interval on the odds ratio indicates the precision of the estimate from the available data. When the confidence interval for a coefficient includes 1.00, the null hypothesis that the population odds ratio is 1 cannot be rejected at a 0.05 significance level. For example, the odds ratio for a self-rated health status level of “very good” is 1.04 compared to those in “excellent” health. But the confidence interval, 0.97 to 1.13, includes 1.00, indicating that the evidence is weak that women in very good health have a different risk of developing osteoporosis compared to women in excellent health.

For some variables, e.g., “college education or higher,” no reference level is given. This is simply because the variable has just two levels. The other level serves as the reference.

Age group (years)
Odds Ratio (95% CI)




   50-54 1.00(Referent)
55-59 1.79(1.56-2.06)
60-64 3.84(3.37-4.37)
65-69 5.94(5.24-6.74)
70-74 9.54(8.42-10.81)
75-79 14.34(12.64-16.26)
80 22.56(19.82-25.67)
Years since menopause
Odds Ratio (95% CI)




5 1.00(Referent)
6-10 0.79(0.70-0.89)
11-15 0.83(0.76-0.91)
16-20 0.96(0.89-1.03)
21-25 1.01(0.95-1.08)
26-30 1.02(0.95-1.09)
31-35 1.10(1.03-1.19)
36-40 1.14(1.05-1.24)
41 1.24(1.14-1.35)
College educ or higher               
Odds Ratio (95% CI)




    0.91(0.87-0.94)
Self-rated health status
Odds Ratio (95% CI)




Excellent 1.00(Referent)
Very good 1.04(0.97-1.13)
Good 1.23(1.14-1.33)
Fair/poor 1.62(1.50-1.76)
Fracture history      
Odds Ratio (95% CI)




Hip 1.96(1.75-2.20)
Wrist 1.90(1.77-2.03)
Spine 1.34(1.17-1.54)
Rib 1.43(1.32-1.56)
Maternal
history of osteoporosis
Odds Ratio (95% CI)




1.08(1.01-1.17)
Maternal
history of fracture
Odds Ratio (95% CI)




1.16(1.11-1.22)
Race/ethnicity
Odds Ratio (95% CI)




White 1.00(Referent)
African American 0.55(0.48-0.62)
Native American 0.97(0.82-1.14)
Hispanic 1.31(1.19-1.44)
Asian 1.56(1.32-1.85)
Body mass index, kg/m2
Odds Ratio (95% CI)




   23 1.00(Referent)
23.01-25.99 0.46(0.44-0.48)
26.00-29.99 0.27(0.26-0.28)
30 0.16(0.15-0.17)
Current medication use
Odds Ratio (95% CI)




Cortisone 1.63(1.47-1.81)
Diuretics 0.81(0.76-0.85)
Estrogen use
Odds Ratio (95% CI)




Former 0.77(0.73-0.80)
Current 0.27(0.25-0.28)
Cigarette smoking
Odds Ratio (95% CI)




Former 1.14(1.10-1.19)
Current 1.58(1.48-1.68)
Regular Exercise
Odds Ratio (95% CI)




Regular 0.86(0.82-0.89)
Alcohol use, drinks/wk
Odds Ratio (95% CI)




None 1.00(Referent)
1-6 0.85(0.80-0.90)
7-13 0.76(0.69-0.83)
14 0.62(0.54-0.71)
Technology
Odds Ratio (95% CI)




Heel x-ray 1.00(Referent)
Forearm x-ray 2.86(2.75-2.99)
Finger x-ray 4.86(4.56-5.18)
Heel ultrasound 0.79(0.70-0.90)




Since all the variables were included simultaneously in the model, the various coefficients can be interpreted as indicating partial change: the odds ratio comparing the given level to the reference level for each variable, adjusting for all the other variables as if they had been held constant.

Prob 16.12. The concept of residuals does not cleanly apply to yes/no models because the model value is a probability (of a yes outcome), whereas the actual observation is the outcome itself. It would be silly to try to compute a difference between “yes” and a probability like 0.8. After all, what could it mean to calculate (yes - 0.8)2?

In fitting ordinary linear models, the criterion used to select the best coefficients for any given model design is “least squares,” minimizing the sum of square residuals. The corresponding criterion in fitting yes/no models (and many other types of models) is “maximum likelihood.”

The word “likelihood” has a very specific and technical meaning in statistics, it’s not just a synonym for “chance” or “probability.” A likelihood is the probability of the outcome according to a specific model.

To illustrate, here is an example of some yes-no observations and the model values of two different models.







Model A
Model B
Observed
Casep(Yes)p(No)p(Yes)p(No)Outcome






1 0.7 0.3 0.4 0.6 Yes
2 0.6 0.4 0.8 0.2 No
3 0.1 0.9 0.3 0.7 No
4 0.5 0.5 0.9 0.1 Yes






Likelihood always refers to a given model, so there are two likelihoods here: one for Model A and another for Model B. The likelihood for each case under Model A is the probability of the observed outcome according to the model. For example, the likelihood under Model A for case 1 is 0.7, because that is the model value of the observed outcome “Yes” for that case. The likelihood of case 2 under Model A is 0.4 — that is the probability of “No” for case 2 under model A.

The likelihood for the whole set of observations combines the likelihoods of the individual cases: multiply them all together. This is justified if the cases are independent of one another, as is usually assumed and sensible if the cases are the result of random sampling or random assignment to an experimental treatment.