Chapter 14 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

• A permutation test involves randomizing in order to simulate eliminating the relationship, if any, between the explanatory variables and the response variable. What’s the point of this?
• The F statistic compares two quantities. What are they?
• What is a “degree of freedom?”

Prob 14.01. Your friend is interested in using natural teas to replace commercial drugs. He has been testing out his own tea blend, which he calls “Special Cooling Tea” to reduce temperatures of people with the flu. Fortunately for him, he had a bad flu with fever that lasted for 3 days. During this time, he alternated taking Special Cooling Tea and taking a conventional fever reducing drug, Paracetamol. On each occasion, he measured the change in his body temperature over one hour. His data were

 Change in Treatment Temp. (F) Tea -1.2 Drug -2.0 Tea -0.7 Drug -1.5 Tea +0.0 Drug -0.7

Your friend is excited by these results. When he fit the model temperature change ~ Treatment he got the following reports:

 Df Sum Sq Mean Sq F value Pr(>F) treat 1 0.13 0.13 0.11 0.7555 Residuals 4 4.85 1.21

Based on these report, your friend claims, "This shows that my Special Cooling Tea is just as effective as Paracetamol."

• Which of the following features of the Anova report supports your friend’s conclusion that the two treatments are effective:

 A There are 4 degrees of freedom in the residual. B The p-value is very small. C The p-value is not small. D There is no p-value on the residuals.

• What’s an appropriate Null Hypothesis is this setting:

 A The Tea and Paracetamol have the same effect. B The Tea is more effective than Paracetamol. C The Tea is less effective than Paracetamol.

Of course, your friend’s conclusion that his experiment shows that Tea and Paracetamol are the same is actually unjustified. Failing to reject the null hypothesis is not the same as accepting the null hypothesis. You explain this to your friend.

He’s disappointed, of course. But then he gets a bit annoyed. “How can you ever show that two things are the same if your never get to accept the null hypothesis?”

“Well,” you respond, “the first thing is to decide what ‘the same’ means. In this case, you’re interested in the fever reduction. How big a difference in the temperature reduction would be medically insignificant?”

Your friend thinks for a while. “I suppose no one would care about 0.2 degrees.”

“Then you have to make your measurement precise to about 0.2 degrees. If such a precise measurement shows no difference, then it’s reasonable to claim that your Tea and the drug are equivalent.” Using your friend’s data, you fit the model and look at the regression report.

 Estimate Std. Error t value Pr(>|t|) (Intercept) -0.9333 0.6360 -1.47 0.2161 treatTea 0.3000 0.8994 0.33 0.7555

The margin of error on the difference in temperature reduction is ±1.8 degrees, based on the 6 cases your friend recorded.

• To reduce the margin of error to ±0.2 degrees, about how many cases altogether should your friend plan to collect? (Choose the closest one. For instance, 12 would be a doubling from the 6 actually collected.)
12  75  150  500  2000

Prob 14.02. You can test the null hypothesis that the population mean of a variable x is zero by constructing a model x~1 and interpreting the coefficient on the intercept term 1.

Often, the null hypothesis that the population mean is zero is irrelevant. For example, in the kid’s feet data, the mean foot width is 8.99 cm. There is no physical way for the population mean to be zero.

But, suppose you want to test the null hypothesis that the population mean (for the population of school-aged children from whom the sample was drawn) is 8.8 cm. To do this, create a new variable that would be zero if the population mean were 8.8 cm. This is simple: make the new variable by subtracting 8.8 cm from the original variable. The new variable can then be tested to see if its mean is different from zero.

Using the kids-feet data:

1.
Find the p-value on the null-hypothesis that the population mean of the foot width is 8.8 cm.

0.000  0.024  0.026  0.058  0.094  0.162  0.257
2.
Find the p-value on the null-hypothesis that the population mean of the foot width is 8.9 cm.

0.00  0.23  0.27  0.31  0.48  0.95

Prob 14.04. You are performing an agricultural experiment in which you will study whether placing small bits of iron in the soil will increase the yield of barley. Randomly scattered across an enclosure owned by the state’s agricultural field office, you have marked off ten plots, each of size 10 meters by 10 meters. You randomly assign five of the plots to be controls, and in the other five you place the iron.

A colleague hears of your experiment and asks to join it. She would like to test zinc. Still another colleague wants to test copper and another has two different liquid fertilizers to test. The head of the field office asks you to share the data from your control plots, so that the other researchers won’t have to grow additional control plots. This will reduce the overall cost of the set of experiments. He points out that the additional experiments impose absolutely no burden on you since they are being grown in plots that you are not using. All you need to do is provide the yield measurements for the control plots to your colleagues so that they can use them in the evaluation of their own data.

You wonder whether there is a hidden cost. Will you need to adjust the p-value thresholdhold? For instance, a Bonferroni correction would reduce the threshold for significance from 0.05 to 0.01, since there are five experiments altogether (iron, zinc, copper, and two fertilizers, each compared to a control). What’s the cost of making such an adjustment?

 A It reduces the significance level of your experiment. B It reduces the power of your experiment. C It invalidates the comparison to the control. D None of the above.

What’s the argument for making an adjustment to the p-value threshold?

 A Doing multiple experiments increases the probability of a Type I error. B Doing multiple experiments increase the probability of a Type II error. C Doing multiple experiments introduces interaction terms. D The other experiments increase the power of your experiment.

What’s the argument for not making an adjustment to the p-value threshold?

 A The additional experiments (zinc, copper, fertilizers) have nothing to do with my iron experiment, so I can reasonably ignore them. B You don’t think the other experiments will be successful. C A p-value of 0.05 is always appropriate.

Suppose that the other experimenters decided that they had sufficient budget to make their own control plots, so that their experiments could proceed independently of yours (although all the plots would be in the same field). Would this change the arguments for and against adjusting the p-value?

 A No. There are still multiple tests being done, even if the data sets don’t overlap. B Yes. Now the tests aren’t sharing any data.

The idea that a researcher’s evaluation of a p-value should depend on whether or not other experiments are being conducted has been termed “inconsistent” by Saville. [?] He writes, “An experiment is no more a natural unit than a project consisting of several experiments or a research program consisting of several projects.” Why should you do an adjustment merely because a set of experiments happened to be performed by the same researcher or in the same laboratory? If you’re going to adjust, why shouldn’t you adjust for all the tests conducted in the same institution or in the same country?

This quandry illustrates that data do not speak for themselves. The evaluation of evidence needs to be made in the context in which the data were collected. In my view, the iron researcher would be justified in not adjusting the p-value threshold because the other experiments are irrelevant to his. However, the field office head, who is supervising multiple experiments, should be making adjustments. This is paradoxical. If the p-value from the iron experimental were 0.04, the individual researcher would be justified in declaring the effect of iron significant, but the office head would not be. Same data, different results.

This paradox has an impact on how you should interpret the reports that you read. If you are reading hundreds of reports, you need to keep a skeptical attitude about individual reports with a p-value of 0.01. A small p-value is only one piece of evidence, not a proof.

Prob 14.05. One of the difficulties in interpreting p-values comes from what is sometimes called publication bias: only those studies with sufficiently low p-values are published; we never see studies on the same subject that didn’t have low p-values.

We should interpret a p-value of, say, 0.03 differently in these two situations:

1.
The study that generated the p-value was the only one that has been done on that subject. In this case, the p-value can reasonably be taken at face value.
2.
The study that generated the p-value was one of 20 different studies but was the only one of the 20 studies that generated a p-value below 0.05. In this case, the p-value has little genuine meaning since, even if the null hypothesis were true we wouldn’t be surprised to see a p-value like 0.03 from the “best” of 20 studies.

It’s sometimes the case that “para-normal” phenomena — mind-reading, for instance — are subjected to studies and that the studies sometimes give low p-values. It’s hard to interpret the p-value directly because of the publication bias effect. We can, of course, ask that the study be repeated, but if there are many repeats that we never hear about (because they happened to generate large p-values), it can still be difficult to interpret the p-value.

There is one way out, however. In addition to asking that the study be repeated, we can ask that the sample size be increased. The reason is that the larger sample size should generate much smaller p-values if the studied effect is genuine. However, if the effect is spurious, we’ll still expect to see published p-values around 0.03.

To explore this, consider the following result of a ficticious study of the possible link between “positive thinking” (posThinking) and the t-cell count (tcells) in the immune system of 50 different people. The study has produced a “suggestive” p-value of about 0.08:

Model:    tcells ~ posThinking
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)         3.6412     1.2506   2.912  0.00614
posThinking         0.2325     0.1293   1.798  0.07847

Here is a statement that can generate the p-value:

> 2*(1-pt(1.798,df=48))
[1] 0.07846795

Why is the the output of pt subtracted from 1?

 A This is always the way pt is used. B Because the t-value is positive. C Because we want a one-tailed test. D Because the t-value is less than 2.

The df in the statement stands for the degrees of freedom of the residual, just as in the F test. There were 50 cases and two model vectors (the intercept and the posthinking indicator vector), so 48 degrees of freedom in the residual.

Now imagine that the same study had been repeated with four times as much data: 200 cases. Assuming that everything else remained the same, how big would the standard error on posThinking be given the usual relationship between the size of the standard error and the number of cases n:

• Standard error with n = 200 cases:

0.52  0.26  0.13  0.065  0.032
• With n = 200 cases, how many degrees of freedom will there be in the residual?

48  50  98  100  198  200
• Using the standard error with n = 200 cases, and assuming that the coefficient remains exactly as reported in the table, recalculate the p-value on posThinking. Select the one of these that is closest to the p-value:

0.4  0.04  0.004  0.0004  0.00004  0.000004

Prob 14.10. The t- and F-distributions are closely related. That is, you can calculate the p-value on a t-statistic by using the F distribution, if you do it right.

Generate a large random sample from a t-distribution with 10 degrees of freedom. Then generate another random sample from the F-distribution with 1 and 10 degrees of freedom.

Graphically compare the distributions of the F and t values. What do you see?

• True or False
They are the same.
• True or False
The F distribution is much more skew to the right.
• True or False
The t distribution is symmetrical around zero.

Now compare the distributions of the F values and the square of the t values. What do you see?

• True or False
They are the same.

Prob 14.11. You know that you can compute the sample mean m and variance s2 of a variable with simple calculations

Such calculations are implemented with simple computer commands, e.g.
feet = fetchData("kidsfeet.csv")
mean( length, data=feet )

[1] 24.723

var( length, data=feet )

[1] 1.736

This is a fine way to calculate these quantities. But, just to show the link of these quantities to modeling, I ask you to do the following with the kidsfeet data:  kidsfeet.csv :

1.
Fit a model where one of the coefficients will be the mean of the length. Enter the model report below.

2.
Perform Anova on the same model. One of the numbers in the Anova report will be the variance. Which one is it?

3.
Based on the above, explain why the calculation of the sample variance involves dividing by N - 1 rather than N.

Prob 14.12. Zebra mussels are a small, fast reproducing species of freshwater mussel native to the lakes of southeast Russia. They have accidentally been introduced in other areas, competing with native species and creating problems for people as they cover the undersides of docks and boats, clog water intakes and other underwater structures. Zebra mussels even attach themselves to other mussels, sometimes starving those mussels.

 Zebra mussels growing on a larger, native mussel.

Ecologists Shirley Baker and Daniel Hornbach examined whether zebra mussels gain an advantage by attaching to other mussels rather than to rocks.[?] The ecologists collected samples of small rocks and Amblema plicata mussels, each of which had a collection of zebra mussels attached. The samples were transported to a laboratory where the group of mussels from each individual rock or Amblema were removed and placed in an aquarium equipped to measure oxygen uptake and ammonia excretion. After these physiological measurements were made, the biochemical composition of the mussel tissue was determined: the percentage of protein, lipid, carbohydrate, and ash.

Baker and Hornbach found that zebra mussels attached to Amblema had greater physiological activity than those attached to rocks as measured by oxygen uptake and ammonia excretion. But this appears to be a sign of extra effort for the Amblema-attached zebra mussels, since they had lower carbohydrate and lipid levels. In other words, attaching to Amblema appears to be disadvantageous to the zebra mussels compared to attaching to a rock.

In this problem, you will use the data collected by Baker and Hornbach to reprise their statistical analysis. The data are in the file zebra-mussels.csv.

GroupID dry.mass count attachment  ammonia   O2
1       1     0.55    20       Rock    0.075 0.82
2       2     0.45    19       Rock    0.066 0.70
3       3     0.37    20       Rock    0.074 0.62
... and so on ...
28     28     0.89    28    Amblema    0.248 2.68
29     29     0.70    31    Amblema    0.258 2.26
30     30     0.68    64    Amblema    0.235 2.05

There are 30 cases altogether. The unit of analysis was not an individual zebra mussel but the group of zebra mussels that were attached to individual rocks or Amblema.

The variables are:

• dry.mass The mass of the dried tissue of the group, in grams.
• count How many mussels in the group.
• attachment The substrate to which the mussels in the group were attached: a rock or an Amblema mussel.
• O2 Oxygen uptake in mg per hour for the group.
• ammonia – Nitrogen excretion measured as ammonia in mg per hour for the group.
• lipid, protein, carbo, ash Biochemical makeup as percent of the dry weight for each of these.
• Kcal Total calorific value of the tissue in kilo-calories per gram.

A first question is whether the amount of mussel tissue is greater for the rock-attached or the Amblema-attached zebra mussels. Here is the report of a model:

> z = fetchData("zebra-mussels.csv")
> summary(lm(dry.mass ~ attachment, data=z))
Estimate Std. Error t value Pr(>|t|)
(Intercept)      0.6846     0.0554   12.36  1.2e-12
attachmentRock  -0.1983     0.0770   -2.58    0.016

Based on this report, which of these claims do data support?

 A Smaller mass for rock-attached mussels, as a group. B Larger mass for rock-attached mussels, as a group. C No statistically significant difference in mass between the rock- and Amblema-attached mussels.

The dry.mass variable gives the mass of the entire group of mussels. It might be that the mussels were systematically different in size on the two different substrates. One way to look at this is to compute the dry mass per individual mussel. This can be calculated by dividing dry.mass by count:

> z = transform( z, ind.mass=dry.mass/count )

Fit a model of ind.mass versus attachment. Which of the claims does this model support?

 A Smaller mass for rock-attached mussels, as a group. B Larger mass for rock-attached mussels, as a group. C No statistically significant difference in mass between the rock- and Amblema-attached mussels.

In summarizing their results about oxygen uptake and nitrogen excretion, Baker and Hornbach wrote:

Attachment substrate had significant effects on ammonia excretion and oxygen uptake rates. Dreissenids [that is, Zebra mussels] attached to Amblema had higher ammonia excretion rates (p < 0.0001) and higher oxygen uptake rates (p < 0.0001) compared to dreissenids attached to rocks.

To check their statistics, fit these models for oxygen uptake per individual and ammonia excretion per individual:

> mod1 = lm( O2/count ~ attachment, data=z )
> mod2 = lm( ammonia/count ~ attachment, data=z )

The coefficients on the attachment variable indicate that:

• ammonia excretion is
lower  higher
for rock-attached zebra mussels than for Amblema-attached mussels.
• oxygen uptake is
lower  higher
for rock-attached than for Amblema-attached.
• True or False
These results are consistent with Baker and Hornbach’s claims.

Look at the p-values on the attachment variable. What are they?

• Ammonia:
0.01  0.04  0.10  0.25  0.40
• Oxygen:
0.01  0.04  0.10  0.25  0.40

Perhaps you are surprised to see that the p-values from your models are quite different from those reported by Baker and Hornbach. That’s because Baker and Hornbach included ind.mass as a covariate in their models, in order to take into account how the metabolic rate might depend on the mass of the mussel.

Following Baker and Hornbach’s procedure, fit these models:

> mod3 = lm( O2/count ~ ind.mass + attachment, data=z )
> mod4 = lm( ammonia/count ~ ind.mass + attachment, data=z )

You’ll get different p-values for the attachment variable in these models depending on whether the analysis is done using Anova (which is often called “Analysis of Covariance” (ANCOVA) when there is a covariate in the model, or is done using a regression report. In ANCOVA, the p-value depends on whether ind.mass is put before or after attachment.

What is the regression report p-value on attachment?

• Oxygen
0.0000016  0.000016  0.0016  0.016  0.16
• Ammonia
0.0000005  0.0005  0.0001  0.001  0.01  0.1

The reason that including ind.mass as a covariate has improved the p-value is that ind.mass has soaked up the variance in the response variable. Show an Anova report for one of the two models mod3 or mod4 and explain what in the report demonstrates that ind.mass is soaking up variance.

In contrast to the dramatic effect of ind.mass on the p-values, including it as a covariate does not seem to affect much the coefficient on attachment. What is it about ind.mass that explains this?

 A ind.mass is almost entirely uncorrelated with the response variable. B ind.mass is almost entirely uncorrelated with attachment. C ind.mass is strongly correlated with the response variable. D ind.mass is strongly correlated with attachment.

Prob 14.21. The table shows a brief data set that we will use to trace through a simple one-way ANOVA calculation by hand.

 Age Group 18 Student 22 Student 45 Parent 51 Parent 35 Professor 57 Professor

Our model will be Age ~ 1 + Group

The ANOVA report breaks down the model into a series of nested models. In this case, the series is simple:

1.
Age ~ 1
2.
Age ~ 1 + Group

For each of these nested models, you need to calculate the “degrees of freedom” and the sum of squares of the fitted model values. The degrees of freedom is the number of model vectors involved. The sum of squares is found by fitting the model and adding up the squares of the fitted values.

We will do this by hand. First, fit Age ~ 1 and enter the fitted values into the table. “Fitting” is easy here, since the coefficient on Age ~ 1 is just the grand mean of the Age variable.

 For Age ~ 1 Age Group Fitted 18 Student 22 Student 45 Parent 51 Parent 35 Professor 57 Professor

Once you have written down the fitted values for each case, compute the sum of squares. Since there is one model vector, there is just one degree of freedom for this model.

Next, fit the model Age ~ 1 + Group and enter the fitted values into the table. This model is also easy to fit, since the fitted values are just the groupwise means.

 For Age ~ 1 + Group Age Group Fitted 18 Student 22 Student 45 Parent 51 Parent 35 Professor 57 Professor

Compute the sum of squares for this model. Since there are three groups, there are three model vectors and hence three degrees of freedom.

Finally, compute the sum of squares of the Age variable itself. You might like to think of this as the sum of squares of the fitted values to a “perfect” model, a model that captures all of the variability in Age. We know that such a model can always be constructed so long as we allow N model vectors, even if they are junk, so the degrees of freedom ascribed to this “perfect” model is N. (We put “perfect” in quotation marks to emphasize that this model is perfect only in the sense that the residuals are zero. That can happen anytime we have as many model vectors as cases, that is, when m = N.)

Enter the degrees of freedom and the sums of squares of the fitted values into the table below:

 Model D.F. Sum of Squares of Fitted Age ~ 1 Age ~ 1 + Group “Perfect” model

We are almost at the point where we can construct the ANOVA table. At the heart of the ANOVA table is the degree-of-freedom and sum-of-squares information from the above table. But rather than entering the sum of squares directly, we subtract from each of the sum of squares the total of all the sums of squares of the models that appear above it. That is, we give each successive nested model credit only for the amount that it increased the sum of squares from what it had been in the previous model. A similar accounting is done for the degrees of freedom. In the following table, we’ll mark the header as Δ to emphasize that you should the change in sum of squares and the change in degrees in freedom from the previous model.

Fill in the first two columns of the table. Then go back and fill in the “mean square” column, which is just the sum of squares divided by the degrees of freedom. Similarly, fill in the F column, which is the mean square for each model divided by the mean square for the perfect model.

 Model Δ D.F. Δ S.S. M.S. F ratio p-value Age ~ 1 Age ~ 1 + Group “Perfect” model

You can use software to compute the p-value. The relevant parameters of the F distribution are the degrees of freedom of the model and the degrees of freedom of the “perfect” model.

Of course, in a real ANOVA table, rows are labeled differently. Each row gives a name to the terms that were added in making the model. So, in the above table, the labels will be “Intercept,” “Group,” and “Residuals.”

Compare your table to one constructed with software. Enter the data into a spreadsheet, save it to disk in an appropriate format (e.g., CSV), then read it into the statistical software and create the ANOVA table.

Prob 14.22. The special program rand() generates random vectors for use in modeling. It’s special because it works only within the lm command. For example, suppose we want to create three random model vectors along with an intercept term to model the kids’ foot width data:

> lm( width ~ rand(3), data=kids)
Coefficients:
(Intercept)     rand(3)1     rand(3)2     rand(3)3
9.00838      0.01648     -0.05185      0.01627
> lm( width ~ rand(3), data=kids)
(Intercept)     rand(3)1     rand(3)2     rand(3)3
8.99367     -0.09795     -0.06916      0.05676

The coefficients are different in the two models because the “explanatory” vectors are random.

We’re going to study the R2 due to such collections of random vectors. This can be calculated with the r.squared program:

> r.squared(lm( width ~ rand(3), data=kids))
[1] 0.1770687
> r.squared(lm( width ~ rand(3), data=kids))
[1] 0.03972449

Note that the R2 values vary from trial to trial, because the vectors are random.

According to the principle of equipartition, on average, each random vector should contribute 1(n - 1) to the R2 and the effect of multiple vectors is additive. So, for example, with n = 39 cases, the three random vectors in the above example should result in an R2 that is near 3(39 - 1) = 0.08.

Repeat the above calculation of R2 many times for p = 3 random vectors and find the mean of the resulting R2. You can do the repetitions 100 times with a statement like this:

> samps=do(100)*r.squared(lm(width~rand(3),data=kids))

Now do the same thing for p = 1,3,10,20,37 and 38. Are the results consistent with the theory that, on average, R2 should be p∕(n - 1)?

Enter all your results in the table:

 p p∕(n - 1) Mean R2 1 3 10 20 37 38

Note that for p = 38 all of the trials produced the same R2 value. Explain why.

Prob 14.23. Suppose that you have worked hard to carry out an experiment about baldness that involves several different treatments: rubbing various herbs and spices on the scalp, playing hairy music, drinking kiwi juice, laying fresh moss on the head, eating peaches. Then, you measure the density of hair follicles after 2 months.

It looks like the music treatment produced the lowest number of follicles, while moss produced the highest. (Kiwi is a close second, but the range is so large that you aren’t sure about it.)

You decide to extract the Music and Moss groups and build a model of follicle density versus treatment. (This sort of test, comparing the means of two groups, is called a t-test.)

dd = subset(d, group=='Music' | group=="Moss")
summary(lm(val ~ group, data=dd))
 Estimate Std. Error t value Pr(>|t|) (Intercept) 93.6316 3.5882 26.09 0.0000 groupMoss 11.1991 5.0745 2.21 0.0405
Aha! p = 0.04. You have evidence that such treatments can have an effect.

Or do you? Admittedly, a p-value of 0.04 will not be terribly compelling to someone who doubts that music and moss affect follicle density in different ways. But it is below the conventional threshold of 0.05. Consider some of the things that could be wrong:

• What’s the null hypothesis in the hypothesis test presented above?

 A That Moss and Music are no different than a Control group. B That Moss and Music are the same as each other. C That Moss and Music have no effect.

• Assuming that the null hypothesis is correct, what’s the probability of seeing a p-value as small or smaller than the one actually observed?
0  0.01  0.04  0.05  0.10  0.50  0.94  0.99
• You chose the two groups to compare based on the graph. There were many other possible pairs of groups: Sage vs Peaches, Kiwi vs Moss, etc. Altogether, there are 5 * 42 = 10 possible pairs of groups. (In general, when there are k different groups, there k(k - 1)2 possible pairs of groups. So if k = 10 there are 45 different possible pairs.)

The Bonferroni correction calls for the p-value to be adjusted based on the number of comparisons done. There are two, equivalent ways to do it. One way is to divide the conventional threshold by the number of comparisons and use the new threshold. So for this test, with 10 pairs of groups, the threshold for rejecting the null would be 0.0510 = 0.005. The other way is just to multiply the p-value by the number of comparisons and then compare this to the standard threshold.

Using the “multiply by” approach to the Bonferroni correction, what is the p-value value on the Music-vs-Moss comparison?

0.005  0.01  0.04  0.05  0.20  0.40  0.80

Prob 14.24. This exercise concerns assessing the appropriateness of the Bonferroni correction in a setting where you are choosing one pair out of several possible groups to compare to each other. The exercise is based on a simulation. You will need to install additional software to run the simulation, which you can do with this command:

Recall that the Bonferroni correction relates to a situation where you are conducting k hypothesis tests and choosing the p-value that is most favorable. For example, you might have tested several different treatments: B, C, D, ... against a control, A. An example is shown in the figure.

Group E seems to stand out as different from the control, so you decide to focus your attention on comparing the control A to group E.

To generate the data in the graph, use this command:

d = run.sim(bogus.groups, ngroups=8, seed=230)

When you do this, your results should match these exactly (subject only to round-off):

group       val
1     A  96.47747
2     A 101.51901
3     A 104.93023
4     A 105.42620
5     A 106.63995
6     B 100.61215

tail(d)

group       val
35     G  96.05322
36     H  86.91363
37     H  92.46628
38     H 109.58906
39     H 126.52284
40     H 136.98016

As you might guess from the name of the simulation, bogus.groups, the data are being sampled from a population in which the assigned group is unrelated to the quantitative variable val. That is, the null hypothesis is true.

The point of the seed=230 statement is to make sure that exactly the same “random” numbers are generated as in the example.

Using the data, extract the subset of data for groups A and E, and carry out a hypothesis test on group using the model val ~ group. (Hint: in the subset command, use the selector variable group=="A’’|group=="E’’ to select cases that are from either group A or group E. Then apply lm to this subset of data.)

• What is the p-value that compares groups A to E.
0.011  0.032  0.057  0.128  0.193  0.253

As a convenience, you are provided with a special-purpose function, compare.many.groups that does each of the hypothesis tests comparing the control group to another group. You can run it this way:

compare.many.groups( d, control="A")

group1 group2       diff     pvalue
1      A      B  -3.456261 0.37730300
2      A      C  -2.203797 0.78987416
3      A      D  -1.047166 0.79541252
4      A      E -21.580974 0.01101977
5      A      F  -7.841517 0.03687201
6      A      G   1.361413 0.74935216
7      A      H   7.495821 0.46485192

For the simulated data, your results should match exactly (subject to round-off). Confirm that the p-value from this report matches the one you got by hand above.

• As it happens, the p-value for group F is also below the 0.05 threshold. What is that p-value?
0.021  0.031  0.037  0.042  0.049

The Bonferroni correction takes the “raw” p-value and corrects it by multiplying by the number of comparisons. Since there are 7 comparisons altogether (groups B, C, ..., H against group A), multiply the p-value by 7.

• What is the Bonferroni-corrected p-value for the A vs E comparison?
0.011  0.037  0.077  0.11  0.37  0.77

Now you are going to do a simulation to test whether the Bonferroni correction is appropriate. The idea is to generate data from a simulation that implements the null hypothesis, calculate the p-value for the “best looking” comparison between the control group A and one of the other groups, and see how often the p-value is less than 0.05. It will turn out that the p-value on the best looking comparison will be below 0.05 much too often. (If the p-value is to be taken at face value, when the null hypothesis is true then p < 0.05 should happen only 5% of the time.) Then you’ll apply the Bonferroni correction and see if this fixes things.

Because this is a somewhat elaborate computation, the following leads you through the development of the R statement that will generate the results needed. Each step is a slight elaboration on the previous one.

1.
Generate one realization from the bogus.groups simulation and find the A vs other group p-values. In this example, you will use seed=111 in the first few steps so that you can compare your output directly to what’s printed here. This can be done in two statements, like this:
d = run.sim( bogus.groups, ngroups=8,seed=111 )
compare.many.groups( d, control="A")

group1 group2        diff    pvalue
1      A      B  -2.7882420 0.7311174
2      A      C  16.0119200 0.1113634
3      A      D   0.1178478 0.9893143
4      A      E  11.8370894 0.1204709
5      A      F   9.7352535 0.4986335
6      A      G -10.9812636 0.4322559
7      A      H   6.9922567 0.4669595
2.
It’s helpful to combine the two statements above into a single statement. That will make it easier to automate later on. To do this, just put the statement that created d in place of d, as in the following. (The statement is so long that it’s spread over 3 lines. You can cut it out of this page and into an R session, if you wish.)
compare.many.groups(
run.sim( bogus.groups, ngroups=8,seed=111 ),
control="A")

group1 group2        diff    pvalue
1      A      B  -2.7882420 0.7311174
2      A      C  16.0119200 0.1113634
3      A      D   0.1178478 0.9893143
4      A      E  11.8370894 0.1204709
5      A      F   9.7352535 0.4986335
6      A      G -10.9812636 0.4322559
7      A      H   6.9922567 0.4669595

It happens that none of the p-values from this simulation were < 0.05.

3.
Extend the statement above to extract the smallest p-value. This involves prefacing the statement with min( and following it with \$pvalue)

min(compare.many.groups(
run.sim( bogus.groups, ngroups=8,seed=111 ),
control="A")\$pvalue)

[1] 0.1113634
4.
Now, run this statement a dozen times, using do(12)*. Assign the results to an object s.
s = do(12)*min(compare.many.groups(
run.sim( bogus.groups, ngroups=8,seed=111 ),
control="A")\$pvalue)
s

result
1  0.1113634
2  0.1113634
3  0.1113634
4  0.1113634
5  0.1113634
6  0.1113634
7  0.1113634
8  0.1113634
9  0.1113634
10 0.1113634
11 0.1113634
12 0.1113634

Don’t be too surprised by the result. You got exactly the same answer on each trial because the same random “seed” was used every time.

5.
Delete the random seed, so that new random numbers will be generated on each trial.
s = do(12)*min(compare.many.groups(
run.sim( bogus.groups, ngroups=8 ),
control="A")\$pvalue)
s

result
1  0.20460170
2  0.05247199
3  0.01378853
4  0.03755064
5  0.03587157
6  0.26104192
7  0.13371236
8  0.08099884
9  0.03397622
10 0.16773459
11 0.43178937
12 0.16676447
6.
The point of running a dozen trials was just to produce a manageable amount of output for you to read. Now generate a thousand trials and, storing the results in s. It will take about a minute for the computer to complete the calculation, more on older computers.
s = do(1000)*min(compare.many.groups(
run.sim( bogus.groups, ngroups=8 ),
control="A")\$pvalue)