Chapter 5 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

• What is a sampling distribution? What sort of variation does it reflect?
• What is resampling and bootstrapping?
• What is the difference between a “confidence interval” and a “coverage interval?”

Prob 5.01. The mean of the adult children in Galton’s data is

> mean( height, data=Galton )

 66.76069

Had Galton selected a different sample of kids, he would likely have gotten a slightly different result. The confidence interval indicates a likely range of possible results around the actual result.

Use bootstrapping to calculate the 95% confidence interval on the mean height of the adult children in Galton’s data. The following statement will generate 500 bootstrapping trials.

> trials = do(500) * mean(height, data=resample(Galton) )

(a)
What’s the 95% confidence interval on the mean height?

 A 66.5 to 67.0 inches. B 66.1 to 67.3 inches. C 61.3 to 72 inches. D 65.3 to 66.9 inches.

(b)
A 95% coverage interval on the individual children’s height can be calculated like this:
> qdata(c(0.025,0.975), height, data=Galton)

2.5% 97.5%
60    73

Explain why the 95% coverage interval of individual children’s heights is so different from the 95% confidence interval on the mean height of all children.

(c)
Calculate a 95% confidence interval on the median height.

 A 66.5 to 67.0 inches. B 66.1 to 67.3 inches. C 61.3 to 72 inches. D 65.3 to 66.9 inches.

Prob 5.02. Consider the mean height in Galton’s data, grouped by sex:

> g = fetchData("Galton")
> mean( height ~ sex, data=g )

F        M
64.11016 69.22882

In interpreting this result, it’s helpful to have a confidence interval on the mean height of each group. Resampling and bootstrapping can do the job.

Resampling simulates a situation where a new sample is being drawn from the population, as if Galton had restarted his work. Since you don’t have access to the original population, you can’t actually pick another sample. But you can resample from you sample, introducing much of the variation that would occur if you had drawn a new sample from the population.

Each resample gives a somewhat different result:

> mean( height ~ sex, data=resample(g) )

F        M
64.06674 69.18453

> mean( height ~ sex, data=resample(g) )

F        M
63.95768 69.17194

> mean( height ~ sex, data=resample(g) )

F        M
64.03630 69.33382

By repeating this many times, you can estimate how much variation there is in the resampling process:

> trials = do(1000) * mean(height~sex, data=resample(g))

To quantify the variation, it’s conventional to take a 95% coverage interval. For example, here’s the coverage interval for heights of females

> qdata( c(0.025, 0.975), F, data=trials )

2.5%    97.5%
63.87739 64.34345
• What is the 95% coverage interval for heights of males in the resampling trials? (Choose the closest answer.)

 A 63.9 to 64.3 inches B 63.9 to 69.0 inches C 69.0 to 69.5 inches

• Do the confidence intervals for mean height overlap between males and females.
Yes  No
• Make a box-and-whisker plot of height versus sex.
> bwplot( height ~ sex, data=g )

This displays the overall variation of individual heights.

• Is there any overlap between males and females in the distribution of individual heights?
Yes  No
• How does the spread of individual heights compare to the 95% confidence intervals on the means?

 A The CIs on the means are about as wide as the distribution of individual heights. B The CIs on the means are much wider than the distribution of individual heights. C The CIs on the means are much narrower than the distribution of individual heights.

Prob 5.03. Resampling and bootstrapping provides one method to find confidence intervals. By making certain mathematical assumptions, it’s possible to estimate a confidence interval without randomization. This is the traditional method and still widely used. It’s implemented by the confint function when applied to a model such as created by mm or lm. For example:

> g = fetchData("Galton")
> mod = mm(height ~ sex, data=g)
> confint(mod)

group    2.5 %   97.5 %
1     F 63.87352 64.34681
2     M 69.00046 69.45717

For the present, you can imagine that confint is going through the work of resampling and repeating trials, although that is not what is actually going on.

For each of the following groupwise models, find the 95% confidence interval on the group means. Then answer whether the confidence intervals overlap. When there are more than two groups, consider whether any of the groups overlap with any of the others.

Note: Although mean and mm are doing much the same thing, mm retains additional information about the data needed to calculate a confidence interval. So use mm in this exercise rather than mean.

(a)
In the kids’ feet data, the foot width broken down by sex.
> mm( width ~ sex, data=KidsFeet )

Overlap:

None  Barely  Much
(b)
In the CPS85 data, the hourly wage broken down by sex.
> mm( wage ~ sex, data=CPS85 )

Overlap:

None  Barely  Much
(c)
In the CPS85 data, the hourly wage broken down by married.
> mm( wage ~ married, data=CPS85 )

Overlap:

None  Barely  Much
(d)
In the CPS85 data, the hourly wage broken down by sector.
> mm( wage ~ sector, data=CPS85 )

Overlap:

None  Barely  Much

Prob 5.09. A student writes the following on a homework paper:

“The 95% confidence interval is (9.6, 11.4). I’m very confident that this is correct, because my sample mean of 10.5 lies within this interval.”

Comment on the student’s reasoning.

Source: Prof. Rob Carver, Stonehill College

Prob 5.12. Scientific papers very often contain graphics with “error bars.” Unfortunately, there is little standardization of what such error bars mean so it is important for the reader to pay careful attention in interpreting the graphs.

The following four graphs — A through D — each show a distribution of data along with error bars. The meaning of the bars varies from graph to graph according to different conventions used in different areas of the scientific literature. In each graph, the height of the filled bar is the mean of the data. Your job is to associate each error bar with its meaning. You can do this by comparing the actual data (shown as dots) with the error bar.

 Graph A Graph B  Graph C Graph D  • Range of the data
Graph A  Graph B  Graph C  Graph D
• Standard deviation of the data
Graph A  Graph B  Graph C  Graph D
• Standard error of the mean
Graph A  Graph B  Graph C  Graph D
• 95% confidence interval on the mean
Graph A  Graph B  Graph C  Graph D

This problem is based on G. Cumming, F. Fidler, and DL Vaux (2007), “Error bars in experimental biology”, J. Cell Biology 177(1):7-11

Prob 5.13. An advertisement for “America’s premier weight loss destination” states that “a typical two week stay results in a loss of 7-14 lbs.” (The New Yorker, 7 April 2008, p 38.)

The advertisement gives no details about the meaning of “typical,” but here are some possibilities:

• The 95% coverage interval of the weight loss of the individual clients.
• The 50% coverage interval of the weight loss of the individual clients.
• The 95% confidence interval on the mean weight loss of all the clients.
• The 50% confidence interval on the mean weight loss of all the clients.

Explain what would be valid and what misleading about advertising a confidence interval on the mean weight loss.

Why might it be reasonable to give a 50% coverage interval of the weight loss of individual clients, but not appropriate to give a 50% confidence interval on the mean weight loss.

Prob 5.17. Standard errors and confidence interval apply not just to model coefficients, but to any numerical description of a variable. Consider, for instance, the median or IQR or standard deviation, and so on.

A quick and effective way to find a standard error is a method called bootstrapping, which involves repeatedly resampling the variable and calculating the description on each resample. This gives the sampling distribution of the description. From the sampling distribution, the standard error — which is just the standard deviation of the sampling distribution — can be computed.

Here’s an example, based on the inter-quartile range of the kids’ foot length measurements.

First, compute the desired sample statistic on the actual data. As it happens, IQR is not part of the mosaic package, so you need to use the with function:

> with( IQR(length), data=KidsFeet )

 1.6

Next, modify the statement to incorporate resampling of the data:

> with( IQR(length), data=resample(KidsFeet) )

 1.3

Finally, run this statement many times to generate the sampling distribution and find the standard error of this distribution:

> samps = do(500) * with( IQR(length), data=resample(KidsFeet) )
> sd(samps)

result
0.3379714

Use the bootstrapping method to find an estimate of the standard error of each of these sample statistics on the kids’ foot length data:

1.
The sample median. (Pick the closest answer.)
0.01  0.07  0.14  0.24  0.34  0.71  1.29  1.32  24.6
2.
The sample standard deviation. (Pick the closest answer.)
0.01  0.07  0.14  0.24  0.34  0.71  1.29  1.32  24.6
3.
The sample 75th percentile.
0.01  0.07  0.14  0.24  0.34  0.71  1.29  1.32  24.6

Bootstrapping works well in a broad set of circumstances, but if you have a very small sample, say less than a dozen cases, you should be skeptical of the result.

Prob 5.20. A perennial problem when writing scientific reports is figuring out how many significant digits to report. It’s naïve to copy all the digits from one’s calculator or computer output; the data generally do not justify such precision.

Once you have a confidence interval, however, you do not need to guess how many significant digits are appropriate. The standard error provides good guidance. Here is a rule of thumb: keep two significant digits of the margin of error and round the point estimate to the same precision.

For example, suppose you have a confidence interval of 1.7862 ± 0.3624 with 95% confidence. Keeping the first two significant digits of the margin of error gives 0.36. We’ll keep the point estimate to the same number of digits, giving altogether 1.79 ± 0.36.

Another example: suppose the confidence interval is 6548.23 ± 1321. Keeping the first two digits of the margin of error gives 1300, with a corresponding margin of error of 6500 ± 1300.

(a)
Suppose the computer output is 0.03234232 ± 0.01837232.

Using this rule of thumb, what should you report in as the confidence interval?

 A 0.3234 ± 0.01837 B 0.3234 ± 0.0183 C 0.0323 ± 0.0184 D 0.0323 ± 0.018 E 0.032 ± 0.018 F 0.032 ± 0.01 G 0.03 ± 0.01

(b)
Now suppose the computer output were 99.63742573 ± 1.48924367.

What should you report as the confidence interval?

 A 100 ± 1 B 99 ± 1.5 C 99.6 ± 1.5 D 99.64 ± 1.49 E 99.647 ± 1.489

Prob 5.23. Robert Hooke (1635-1703) was a contemporary of Isaac Newton. He is famous for his law of elasticity (Hooke’s Law) and is considered the father of microscopy. He was the first to use the word “cell” to name the components of plant tissue; the structures he observed during his observations through a microscope reminded him of monks’ cells in a monastery. He drew this picture of cork cells under the microscope: Regarding these observations of cork, Hooke wrote:

I could exceedingly plainly perceive it to be all perforated and porous, much like a Honey-comb, but that the pores of it were not regular. . . . these pores, or cells, . . . were indeed the first microscopical pores I ever saw, and perhaps, that were ever seen, for I had not met with any Writer or Person, that had made any mention of them before this ....

He went on to measure the cell size.

But, to return to our Observation, I told several lines of these pores, and found that there were usually about threescore of these small Cells placed end-ways in the eighteenth part of an Inch in length, whence i concluded that there must be neer eleven hundred of them, or somewhat more then a thousand in the length of an Inch, and therefore in a square Inch above a Million, or 1166400. and in a Cubick Inch, above twelve hundred Millions, or 1259712000. a thing almost incredible, did not our Microscope assure us of it by ocular demonstration .... — from Robert Hooke, Micrographia, 1665

There are several aspects of Hooke’s statement that reflect its origins at the start of modern science. Some are quaint, such as the spelling and obsolete use of Capitalization and the hyperbolic language (“a thing almost incredible,” which, to be honest, is true enough, but not a style accepted today in scientific writing). Hooke worked before the development of the modern notion of precision. The seeming exactness of the number 1,259,712,000 for the count of cork cells in a cubic inch leaves a modern reader to wonder: did he really count over a billion cells?

It’s easy enough to trace through Hooke’s calculation. The observation at the base of the calculation is threescore cells — that’s 60 cells — in 1/18 of an inch. This comes out to 60 × 18 = 1080 cells per linear inch. Modeling each cell as a little cube allows this to be translated into the number of cells covering a square inch: 10802 or 1,116,400. To estimate the number of cells in a cubic inch of cork material, the calculation is 10803 or 1,259,712,000.

To find the precision of these estimates, you need to go back to the precision of the basic observation: 60 cells in 1/18th of an inch. Hooke didn’t specify the precision of this, but it seems reasonable to think it might be something like 60 ± 5 or so, at a confidence level of 95%.

1.
When you change the units of a measurement (say, miles into kilometers), both the point estimate and the margin of error are multiplied by the conversion factor.

Translate Hooke’s count of the number of cells in 1/18 inch, 60 ± 5 into a confidence interval on the number of cells per linear inch.

 A 1080 ± 5 B 1080 ± 90 C 1080 ± 180

2.
In calculating the number of cells to cover a square inch, Hooke simply squared the number of cells per inch. That’s a reasonable approximation.

To carry this calculation through a confidence interval, you can’t just square the point estimate and the margin of error separately. Instead, a reasonable way to proceed is to take the endpoints of the interval (e.g., 55 to 65 for the count of cells in 1/18 inch), and square those endpoints. Then convert back to ± format.

What is a reasonable confidence interval for the number of cells covering a square inch?

 A 1,200,000 ± 500,000 B 1,170,000 ± 190,000 C 1,166,000 ± 19,000 D 1,166,400 ± 1,900

3.
What is a reasonable confidence interval for the number of cork cells that fit into a cubic inch?

 A 1,300,000,000 ± 160,000,000 B 1,260,000,000 ± 16,000,000 C 1,260,000,000 ± 1,600,000 D 1,259,700,000 ± 160,000 E 1,259,710,000 ± 16,000 F 1,259,712,000 ± 1,600

It’s usually better to write such numbers in scientific notation, so that the reader doesn’t have to count digits to make sense of them. For example, 1,260,000,000 ± 16,000,000 might be more clearly written as 1260 ± 16 × 106.

Prob 5.30. After a month’s hard work in the laboratory, you have measured a growth hormone from each of 40 plants and computed a confidence interval on the grand mean hormone concentration of 36 ± 8 ng/ml. Your advisor asks you to collect more samples until the margin of error is 4 ng/ml. Assuming the typical 1 relationship between the number of cases in the sample and the size of the margin of error, how many plants, including the 40 you have already processed, will you need to measure?

40  80  160  320  640

Prob 5.31. You are calculating the mean of a variable B and you want to know the standard error, that is, the standard deviation of the sampling distribution of the mean. Which of the following statements will estimate the standard error by bootstrapping?

 A sd(do(500)*resample(mean(B))) B resample(do(500)*mean(sd(B))) C mean(do(500)*mean(resample(B))) D sd(do(500)*mean(resample(B))) E resample(sd(do(500)*mean(B)))

Prob 5.40. In this activity, you are going to look at the sampling distribution and how it depends on the size of the sample. This will be done by simulating a sample drawn from a population with known properties. In particular you’ll be looking at a variable that is more or less like the distribution of human adult heights — normally distributed with a mean of 68 inches and a standard deviation of 3 inches.

Here’s one random sample of size n = 10 from this simulated population:

rnorm(10, mean=68, sd=3)

 62.842 71.095 62.357 68.896 67.494
 67.233 69.865 71.664 69.241 70.581

These are the heights of a random sample of n = 10. The sampling distribution refers to some numerical description of such data, for example, the sample mean. Consider this sample mean the output of a single trial.

mean( rnorm(10, mean=68, sd=3) )

 67.977

If you gave exactly this statement, it’s very likely that your result was different. That’s because you have a different random sample — rnorm generates random numbers. And if you repeat the statement, you’ll likely get a different value again, for instance:

mean( rnorm(10, mean=68, sd=3) )

 66.098

Note that both of the sample means above differ somewhat from the population mean of 68.

The point of examining a sampling distribution is to be able to see the reliability of a random sample. Do to this, you generate many trials — say, 1000 — and look at the distribution of the trials.

For example, here’s how to look at the sampling distribution for the mean of 10 random cases from the population:

s = do(1000)*mean( rnorm(10, mean=68, sd=3) )

By examining the distribution of the values stored in s, you can see what the sampling distribution looks like.

Generate your own sample

• What is the mean of this distribution?
• What is the standard deviation of this distribution?
• What is the shape of this distribution?

Now modify your simulation to look at the sampling distribution for n = 1000.

• What is the mean of this distribution?
• What is the standard deviation of this distribution?
• What is the shape of this distribution?

Which of these two sample sizes, n = 10 or n = 1000, gave a sampling distribution that was more reliable? How might you measure the reliability?

The idea of a sampling distribution applies not just to means, but to any numerical description of a variable, to the coefficients on models, etc.

Now modify your computer statements to examine the sampling distribution of the standard deviation rather than the mean. Use a sample size of n = 10. (Note: Read the previous sentence again. The statistic you are asked to calculate is the sample standard deviation, not the sample mean.)

• What is the mean of this distribution?
• What is the standard deviation of this distribution?
• What is the shape of this distribution?

Repeat the above calculation of the distribution of the sample standard deviation with n = 1000.

• What is the mean of this distribution?
• What is the standard deviation of this distribution?
• What is the shape of this distribution?

For this simulation of heights, the population standard deviation was set to 3. You expect the result from a random sample to be close to the population parameter. Which of the two sample sizes, n = 10 or n = 1000 gives results that are closer to the population value?