Reading Questions.
Prob 14.01. Your friend is interested in using natural teas to replace commercial drugs. He has been testing out his own tea blend, which he calls “Special Cooling Tea” to reduce temperatures of people with the flu. Fortunately for him, he had a bad flu with fever that lasted for 3 days. During this time, he alternated taking Special Cooling Tea and taking a conventional fever reducing drug, Paracetamol. On each occasion, he measured the change in his body temperature over one hour. His data were
Change in | |
Treatment | Temp. (F) |
Tea | -1.2 |
Drug | -2.0 |
Tea | -0.7 |
Drug | -1.5 |
Tea | +0.0 |
Drug | -0.7 |
Your friend is excited by these results. When he fit the model temperature change ~ Treatment he got the following reports:
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
treat | 1 | 0.13 | 0.13 | 0.11 | 0.7555 |
Residuals | 4 | 4.85 | 1.21 | ||
Based on these report, your friend claims, "This shows that my Special Cooling Tea is just as effective as Paracetamol."
A
| There are 4 degrees of freedom in the residual. |
B
| The p-value is very small. |
C
| The p-value is not small. |
D
| There is no p-value on the residuals. |
A
| The Tea and Paracetamol have the same effect. |
B
| The Tea is more effective than Paracetamol. |
C
| The Tea is less effective than Paracetamol. |
Of course, your friend’s conclusion that his experiment shows that Tea and Paracetamol are the same is actually unjustified. Failing to reject the null hypothesis is not the same as accepting the null hypothesis. You explain this to your friend.
He’s disappointed, of course. But then he gets a bit annoyed. “How can you ever show that two things are the same if your never get to accept the null hypothesis?”
“Well,” you respond, “the first thing is to decide what ‘the same’ means. In this case, you’re interested in the fever reduction. How big a difference in the temperature reduction would be medically insignificant?”
Your friend thinks for a while. “I suppose no one would care about 0.2 degrees.”
“Then you have to make your measurement precise to about 0.2 degrees. If such a precise measurement shows no difference, then it’s reasonable to claim that your Tea and the drug are equivalent.” Using your friend’s data, you fit the model and look at the regression report.
Estimate | Std. Error | t value | Pr(>|t|) | |
(Intercept) | -0.9333 | 0.6360 | -1.47 | 0.2161 |
treatTea | 0.3000 | 0.8994 | 0.33 | 0.7555 |
The margin of error on the difference in temperature reduction is ±1.8 degrees, based on the 6 cases your friend recorded.
Prob 14.02. You can test the null hypothesis that the population mean of a variable x is zero by constructing a model x~1 and interpreting the coefficient on the intercept term 1.
Often, the null hypothesis that the population mean is zero is irrelevant. For example, in the kid’s feet data, the mean foot width is 8.99 cm. There is no physical way for the population mean to be zero.
But, suppose you want to test the null hypothesis that the population mean (for the population of school-aged children from whom the sample was drawn) is 8.8 cm. To do this, create a new variable that would be zero if the population mean were 8.8 cm. This is simple: make the new variable by subtracting 8.8 cm from the original variable. The new variable can then be tested to see if its mean is different from zero.
Using the kids-feet data:
Prob 14.04. You are performing an agricultural experiment in which you will study whether placing small bits of iron in the soil will increase the yield of barley. Randomly scattered across an enclosure owned by the state’s agricultural field office, you have marked off ten plots, each of size 10 meters by 10 meters. You randomly assign five of the plots to be controls, and in the other five you place the iron.
A colleague hears of your experiment and asks to join it. She would like to test zinc. Still another colleague wants to test copper and another has two different liquid fertilizers to test. The head of the field office asks you to share the data from your control plots, so that the other researchers won’t have to grow additional control plots. This will reduce the overall cost of the set of experiments. He points out that the additional experiments impose absolutely no burden on you since they are being grown in plots that you are not using. All you need to do is provide the yield measurements for the control plots to your colleagues so that they can use them in the evaluation of their own data.
You wonder whether there is a hidden cost. Will you need to adjust the p-value thresholdhold? For instance, a Bonferroni correction would reduce the threshold for significance from 0.05 to 0.01, since there are five experiments altogether (iron, zinc, copper, and two fertilizers, each compared to a control). What’s the cost of making such an adjustment?
A
| It reduces the significance level of your experiment. |
B
| It reduces the power of your experiment. |
C
| It invalidates the comparison to the control. |
D
| None of the above. |
What’s the argument for making an adjustment to the p-value threshold?
A
| Doing multiple experiments increases the probability of a Type I error. |
B
| Doing multiple experiments increase the probability of a Type II error. |
C
| Doing multiple experiments introduces interaction terms. |
D
| The other experiments increase the power of your experiment. |
What’s the argument for not making an adjustment to the p-value threshold?
A
| The additional experiments (zinc, copper, fertilizers) have nothing to do with my iron experiment, so I can reasonably ignore them. |
B
| You don’t think the other experiments will be successful. |
C
| A p-value of 0.05 is always appropriate. |
Suppose that the other experimenters decided that they had sufficient budget to make their own control plots, so that their experiments could proceed independently of yours (although all the plots would be in the same field). Would this change the arguments for and against adjusting the p-value?
A
| No. There are still multiple tests being done, even if the data sets don’t overlap. |
B
| Yes. Now the tests aren’t sharing any data. |
The idea that a researcher’s evaluation of a p-value should depend on whether or not other experiments are being conducted has been termed “inconsistent” by Saville. [?] He writes, “An experiment is no more a natural unit than a project consisting of several experiments or a research program consisting of several projects.” Why should you do an adjustment merely because a set of experiments happened to be performed by the same researcher or in the same laboratory? If you’re going to adjust, why shouldn’t you adjust for all the tests conducted in the same institution or in the same country?
This quandry illustrates that data do not speak for themselves. The evaluation of evidence needs to be made in the context in which the data were collected. In my view, the iron researcher would be justified in not adjusting the p-value threshold because the other experiments are irrelevant to his. However, the field office head, who is supervising multiple experiments, should be making adjustments. This is paradoxical. If the p-value from the iron experimental were 0.04, the individual researcher would be justified in declaring the effect of iron significant, but the office head would not be. Same data, different results.
This paradox has an impact on how you should interpret the reports that you read. If you are reading hundreds of reports, you need to keep a skeptical attitude about individual reports with a p-value of 0.01. A small p-value is only one piece of evidence, not a proof.
Prob 14.05. One of the difficulties in interpreting p-values comes from what is sometimes called publication bias: only those studies with sufficiently low p-values are published; we never see studies on the same subject that didn’t have low p-values.
We should interpret a p-value of, say, 0.03 differently in these two situations:
It’s sometimes the case that “para-normal” phenomena — mind-reading, for instance — are subjected to studies and that the studies sometimes give low p-values. It’s hard to interpret the p-value directly because of the publication bias effect. We can, of course, ask that the study be repeated, but if there are many repeats that we never hear about (because they happened to generate large p-values), it can still be difficult to interpret the p-value.
There is one way out, however. In addition to asking that the study be repeated, we can ask that the sample size be increased. The reason is that the larger sample size should generate much smaller p-values if the studied effect is genuine. However, if the effect is spurious, we’ll still expect to see published p-values around 0.03.
To explore this, consider the following result of a ficticious study of the possible link between “positive thinking” (posThinking) and the t-cell count (tcells) in the immune system of 50 different people. The study has produced a “suggestive” p-value of about 0.08:
Here is a statement that can generate the p-value:
Why is the the output of pt subtracted from 1?
A
| This is always the way pt is used. |
B
| Because the t-value is positive. |
C
| Because we want a one-tailed test. |
D
| Because the t-value is less than 2. |
The df in the statement stands for the degrees of freedom of the residual, just as in the F test. There were 50 cases and two model vectors (the intercept and the posthinking indicator vector), so 48 degrees of freedom in the residual.
Now imagine that the same study had been repeated with four times as much data: 200 cases. Assuming that everything else remained the same, how big would the standard error on posThinking be given the usual relationship between the size of the standard error and the number of cases n:
Prob 14.10. The t- and F-distributions are closely related. That is, you can calculate the p-value on a t-statistic by using the F distribution, if you do it right.
Generate a large random sample from a t-distribution with 10 degrees of freedom. Then generate another random sample from the F-distribution with 1 and 10 degrees of freedom.
Graphically compare the distributions of the F and t values. What do you see?
Now compare the distributions of the F values and the square of the t values. What do you see?
Prob 14.11. You know that you can compute the sample mean m and variance s^{2} of a variable with simple calculations
This is a fine way to calculate these quantities. But, just to show the link of these quantities to modeling, I ask you to do the following with the kidsfeet data: kidsfeet.csv :
Prob 14.12. Zebra mussels are a small, fast reproducing species of freshwater mussel native to the lakes of southeast Russia. They have accidentally been introduced in other areas, competing with native species and creating problems for people as they cover the undersides of docks and boats, clog water intakes and other underwater structures. Zebra mussels even attach themselves to other mussels, sometimes starving those mussels.
Zebra mussels growing on a larger, native mussel. |
Ecologists Shirley Baker and Daniel Hornbach examined whether zebra mussels gain an advantage by attaching to other mussels rather than to rocks.[?] The ecologists collected samples of small rocks and Amblema plicata mussels, each of which had a collection of zebra mussels attached. The samples were transported to a laboratory where the group of mussels from each individual rock or Amblema were removed and placed in an aquarium equipped to measure oxygen uptake and ammonia excretion. After these physiological measurements were made, the biochemical composition of the mussel tissue was determined: the percentage of protein, lipid, carbohydrate, and ash.
Baker and Hornbach found that zebra mussels attached to Amblema had greater physiological activity than those attached to rocks as measured by oxygen uptake and ammonia excretion. But this appears to be a sign of extra effort for the Amblema-attached zebra mussels, since they had lower carbohydrate and lipid levels. In other words, attaching to Amblema appears to be disadvantageous to the zebra mussels compared to attaching to a rock.
In this problem, you will use the data collected by Baker and Hornbach to reprise their statistical analysis. The data are in the file zebra-mussels.csv.
There are 30 cases altogether. The unit of analysis was not an individual zebra mussel but the group of zebra mussels that were attached to individual rocks or Amblema.
The variables are:
A first question is whether the amount of mussel tissue is greater for the rock-attached or the Amblema-attached zebra mussels. Here is the report of a model:
Based on this report, which of these claims do data support?
A
| Smaller mass for rock-attached mussels, as a group. |
B
| Larger mass for rock-attached mussels, as a group. |
C
| No statistically significant difference in mass between the rock- and Amblema-attached mussels. |
The dry.mass variable gives the mass of the entire group of mussels. It might be that the mussels were systematically different in size on the two different substrates. One way to look at this is to compute the dry mass per individual mussel. This can be calculated by dividing dry.mass by count:
Fit a model of ind.mass versus attachment. Which of the claims does this model support?
A
| Smaller mass for rock-attached mussels, as a group. |
B
| Larger mass for rock-attached mussels, as a group. |
C
| No statistically significant difference in mass between the rock- and Amblema-attached mussels. |
In summarizing their results about oxygen uptake and nitrogen excretion, Baker and Hornbach wrote:
Attachment substrate had significant effects on ammonia excretion and oxygen uptake rates. Dreissenids [that is, Zebra mussels] attached to Amblema had higher ammonia excretion rates (p < 0.0001) and higher oxygen uptake rates (p < 0.0001) compared to dreissenids attached to rocks.
To check their statistics, fit these models for oxygen uptake per individual and ammonia excretion per individual:
The coefficients on the attachment variable indicate that:
Look at the p-values on the attachment variable. What are they?
Perhaps you are surprised to see that the p-values from your models are quite different from those reported by Baker and Hornbach. That’s because Baker and Hornbach included ind.mass as a covariate in their models, in order to take into account how the metabolic rate might depend on the mass of the mussel.
Following Baker and Hornbach’s procedure, fit these models:
You’ll get different p-values for the attachment variable in these models depending on whether the analysis is done using Anova (which is often called “Analysis of Covariance” (ANCOVA) when there is a covariate in the model, or is done using a regression report. In ANCOVA, the p-value depends on whether ind.mass is put before or after attachment.
What is the regression report p-value on attachment?
The reason that including ind.mass as a covariate has improved the p-value is that ind.mass has soaked up the variance in the response variable. Show an Anova report for one of the two models mod3 or mod4 and explain what in the report demonstrates that ind.mass is soaking up variance.
In contrast to the dramatic effect of ind.mass on the p-values, including it as a covariate does not seem to affect much the coefficient on attachment. What is it about ind.mass that explains this?
A
| ind.mass is almost entirely uncorrelated with the response variable. |
B
| ind.mass is almost entirely uncorrelated with attachment. |
C
| ind.mass is strongly correlated with the response variable. |
D
| ind.mass is strongly correlated with attachment. |
Prob 14.21. The table shows a brief data set that we will use to trace through a simple one-way ANOVA calculation by hand.
Age | Group |
18 | Student |
22 | Student |
45 | Parent |
51 | Parent |
35 | Professor |
57 | Professor |
Our model will be Age ~ 1 + Group
The ANOVA report breaks down the model into a series of nested models. In this case, the series is simple:
For each of these nested models, you need to calculate the “degrees of freedom” and the sum of squares of the fitted model values. The degrees of freedom is the number of model vectors involved. The sum of squares is found by fitting the model and adding up the squares of the fitted values.
We will do this by hand. First, fit Age ~ 1 and enter the fitted values into the table. “Fitting” is easy here, since the coefficient on Age ~ 1 is just the grand mean of the Age variable.
For Age ~ 1
| ||
Age | Group | Fitted |
18 | Student | |
22 | Student | |
45 | Parent | |
51 | Parent | |
35 | Professor | |
57 | Professor | |
Once you have written down the fitted values for each case, compute the sum of squares. Since there is one model vector, there is just one degree of freedom for this model.
Next, fit the model Age ~ 1 + Group and enter the fitted values into the table. This model is also easy to fit, since the fitted values are just the groupwise means.
For Age ~ 1 + Group
| ||
Age | Group | Fitted |
18 | Student | |
22 | Student | |
45 | Parent | |
51 | Parent | |
35 | Professor | |
57 | Professor | |
Compute the sum of squares for this model. Since there are three groups, there are three model vectors and hence three degrees of freedom.
Finally, compute the sum of squares of the Age variable itself. You might like to think of this as the sum of squares of the fitted values to a “perfect” model, a model that captures all of the variability in Age. We know that such a model can always be constructed so long as we allow N model vectors, even if they are junk, so the degrees of freedom ascribed to this “perfect” model is N. (We put “perfect” in quotation marks to emphasize that this model is perfect only in the sense that the residuals are zero. That can happen anytime we have as many model vectors as cases, that is, when m = N.)
Enter the degrees of freedom and the sums of squares of the fitted values into the table below:
Model | D.F. | Sum of Squares of Fitted |
Age ~ 1 | ||
Age ~ 1 + Group | ||
“Perfect” model | ||
We are almost at the point where we can construct the ANOVA table. At the heart of the ANOVA table is the degree-of-freedom and sum-of-squares information from the above table. But rather than entering the sum of squares directly, we subtract from each of the sum of squares the total of all the sums of squares of the models that appear above it. That is, we give each successive nested model credit only for the amount that it increased the sum of squares from what it had been in the previous model. A similar accounting is done for the degrees of freedom. In the following table, we’ll mark the header as Δ to emphasize that you should the change in sum of squares and the change in degrees in freedom from the previous model.
Fill in the first two columns of the table. Then go back and fill in the “mean square” column, which is just the sum of squares divided by the degrees of freedom. Similarly, fill in the F column, which is the mean square for each model divided by the mean square for the perfect model.
Model | Δ D.F. | Δ S.S. | M.S. | F ratio | p-value |
Age ~ 1 | |||||
Age ~ 1 + Group | |||||
“Perfect” model | |||||
You can use software to compute the p-value. The relevant parameters of the F distribution are the degrees of freedom of the model and the degrees of freedom of the “perfect” model.
Of course, in a real ANOVA table, rows are labeled differently. Each row gives a name to the terms that were added in making the model. So, in the above table, the labels will be “Intercept,” “Group,” and “Residuals.”
Compare your table to one constructed with software. Enter the data into a spreadsheet, save it to disk in an appropriate format (e.g., CSV), then read it into the statistical software and create the ANOVA table.
Prob 14.22. The special program rand() generates random vectors for use in modeling. It’s special because it works only within the lm command. For example, suppose we want to create three random model vectors along with an intercept term to model the kids’ foot width data:
The coefficients are different in the two models because the “explanatory” vectors are random.
We’re going to study the R^{2} due to such collections of random vectors. This can be calculated with the r.squared program:
Note that the R^{2} values vary from trial to trial, because the vectors are random.
According to the principle of equipartition, on average, each random vector should contribute 1∕(n - 1) to the R^{2} and the effect of multiple vectors is additive. So, for example, with n = 39 cases, the three random vectors in the above example should result in an R^{2} that is near 3∕(39 - 1) = 0.08.
Repeat the above calculation of R^{2} many times for p = 3 random vectors and find the mean of the resulting R^{2}. You can do the repetitions 100 times with a statement like this:
Now do the same thing for p = 1,3,10,20,37 and 38. Are the results consistent with the theory that, on average, R^{2} should be p∕(n - 1)?
Enter all your results in the table:
p | p∕(n - 1) | Mean R^{2} | |
1 | |||
3 | |||
10 | |||
20 | |||
37 | |||
38 | |||
Note that for p = 38 all of the trials produced the same R^{2} value. Explain why.
Prob 14.23. Suppose that you have worked hard to carry out an experiment about baldness that involves several different treatments: rubbing various herbs and spices on the scalp, playing hairy music, drinking kiwi juice, laying fresh moss on the head, eating peaches. Then, you measure the density of hair follicles after 2 months.
You plot out your data:
It looks like the music treatment produced the lowest number of follicles, while moss produced the highest. (Kiwi is a close second, but the range is so large that you aren’t sure about it.)
You decide to extract the Music and Moss groups and build a model of follicle density versus treatment. (This sort of test, comparing the means of two groups, is called a t-test.)
Estimate | Std. Error | t value | Pr(>|t|) | |
(Intercept) | 93.6316 | 3.5882 | 26.09 | 0.0000 |
groupMoss | 11.1991 | 5.0745 | 2.21 | 0.0405 |
Or do you? Admittedly, a p-value of 0.04 will not be terribly compelling to someone who doubts that music and moss affect follicle density in different ways. But it is below the conventional threshold of 0.05. Consider some of the things that could be wrong:
A
| That Moss and Music are no different than a Control group. |
B
| That Moss and Music are the same as each other. |
C
| That Moss and Music have no effect. |
The Bonferroni correction calls for the p-value to be adjusted based on the number of comparisons done. There are two, equivalent ways to do it. One way is to divide the conventional threshold by the number of comparisons and use the new threshold. So for this test, with 10 pairs of groups, the threshold for rejecting the null would be 0.05∕10 = 0.005. The other way is just to multiply the p-value by the number of comparisons and then compare this to the standard threshold.
Using the “multiply by” approach to the Bonferroni correction, what is the
p-value value on the Music-vs-Moss comparison?
Prob 14.24. This exercise concerns assessing the appropriateness of the Bonferroni correction in a setting where you are choosing one pair out of several possible groups to compare to each other. The exercise is based on a simulation. You will need to install additional software to run the simulation, which you can do with this command:
Recall that the Bonferroni correction relates to a situation where you are conducting k hypothesis tests and choosing the p-value that is most favorable. For example, you might have tested several different treatments: B, C, D, ... against a control, A. An example is shown in the figure.
Group E seems to stand out as different from the control, so you decide to focus your attention on comparing the control A to group E.
To generate the data in the graph, use this command:
When you do this, your results should match these exactly (subject only to round-off):
As you might guess from the name of the simulation, bogus.groups, the data are being sampled from a population in which the assigned group is unrelated to the quantitative variable val. That is, the null hypothesis is true.
The point of the seed=230 statement is to make sure that exactly the same “random” numbers are generated as in the example.
Using the data, extract the subset of data for groups A and E, and carry out a hypothesis test on group using the model val ~ group. (Hint: in the subset command, use the selector variable group=="A’’|group=="E’’ to select cases that are from either group A or group E. Then apply lm to this subset of data.)
As a convenience, you are provided with a special-purpose function, compare.many.groups that does each of the hypothesis tests comparing the control group to another group. You can run it this way:
For the simulated data, your results should match exactly (subject to round-off). Confirm that the p-value from this report matches the one you got by hand above.
The Bonferroni correction takes the “raw” p-value and corrects it by multiplying by the number of comparisons. Since there are 7 comparisons altogether (groups B, C, ..., H against group A), multiply the p-value by 7.
Now you are going to do a simulation to test whether the Bonferroni correction is appropriate. The idea is to generate data from a simulation that implements the null hypothesis, calculate the p-value for the “best looking” comparison between the control group A and one of the other groups, and see how often the p-value is less than 0.05. It will turn out that the p-value on the best looking comparison will be below 0.05 much too often. (If the p-value is to be taken at face value, when the null hypothesis is true then p < 0.05 should happen only 5% of the time.) Then you’ll apply the Bonferroni correction and see if this fixes things.
Because this is a somewhat elaborate computation, the following leads you through the development of the R statement that will generate the results needed. Each step is a slight elaboration on the previous one.
It happens that none of the p-values from this simulation were < 0.05.
Don’t be too surprised by the result. You got exactly the same answer on each trial because the same random “seed” was used every time.
Then answer the following questions:
Prob 14.25. You and your statistics-class friends are having a disagreement. Does the result of a hypothesis test depend on the units in which quantities are measured? For example, in the kids’ feet data, kidsfeet.csv the length and width are measured in cm. Would it matter if they had been measured in inches or meters?
Do a numerical experiment to figure out whether the units make a difference. Explain your strategy concisely and give the corresponding commands and the results here.