Reading Questions.

- 1.
- What is the disadvantage of using a 100% coverage interval to describe variation?
- 2.
- In describing a sample of a variable, what is the relationship between the variance and the standard
deviation?
- 3.
- What is a
residual?
- 4.
- What’s the difference between “density” and “frequency” in displaying a variable with a
histogram?
- 5.
- What’s a normal distribution?
- 6.
- Here is the graph showing boxplots of height broken down according to sex as
well as for both males and females together.
Which components of the boxplot for “All” match up exactly with the boxplot for “M” or “F”? Explain why.

- 7.
- Variables typically have units. For example, in Galton’s height data, the
height variable has units of inches. Suppose you are working with a
variable in units of degrees celsius. What would be the units of the
standard deviation of a variable? Of the variance? Why are they different?

Prob 3.01. Here is a small table of percentiles of typical daily calorie consumption of college students.

Percentile | Calories |

0 | 1400 |

5 | 1800 |

10 | 2000 |

25 | 2400 |

50 | 2600 |

75 | 2900 |

90 | 3100 |

95 | 3300 |

100 | 3700 |

- (a)
- What is the 50%-coverage interval?
- Lower Boundary
- 1800 1900 2000 2200 2400 2500 2600
- Upper Boundary
- 2600 2750 2900 3000 3100 3200 3500

- (b)
- What percentage of cases lie between 2900 and 3300? 10 20 25 30 40 50 60 70 80 90 95
- (c)
- What is the percentile that marks the upper end of the 95%-coverage interval?
75 90 92.5 95 97.5 100
Estimate the corresponding calorie value from the table.

2900 3000 3100 3300 3500 3700 - (d)
- Using the 1.5 IQR rule-of-thumb for identifying an outlier, what would be the
threshold for identifying a low calorie consumption as an outlier?1450 1500 1650 1750 1800 2000

Prob 3.02.

Here are some useful operators for taking a quick look at data frames:

names | Lists the names of the components. |

ncol | Tells how many components there are. |

nrow | Tells how many lines of data there are. |

head | Prints the first several lines of the data frame. |

Here are some examples of these commands applied to the CO2 data frame:

Plant Type Treatment conc uptake

1 Qn1 Quebec nonchilled 95 16.0

2 Qn1 Quebec nonchilled 175 30.4

3 Qn1 Quebec nonchilled 250 34.8

4 Qn1 Quebec nonchilled 350 37.2

5 Qn1 Quebec nonchilled 500 35.3

6 Qn1 Quebec nonchilled 675 39.2

1 Qn1 Quebec nonchilled 95 16.0

2 Qn1 Quebec nonchilled 175 30.4

3 Qn1 Quebec nonchilled 250 34.8

4 Qn1 Quebec nonchilled 350 37.2

5 Qn1 Quebec nonchilled 500 35.3

6 Qn1 Quebec nonchilled 675 39.2

- The data frame iris records measurements on flowers. You can read in
with
creating an object named iris.

Use the above operators to answer the following questions.

- 1.
- Which of the following is the name of a column in iris?

flower Color Species Length - 2.
- How many rows are there in iris?1 50 100 150 200
- 3.
- How many columns are there in iris?2 3 4 5 6 7 8 10
- 4.
- What is the Sepal.Length in the third row?1.2 3.6 4.2 4.7 5.9

- The data frame mtcars has data on cars from the 1970s. You can read it in
with
creating an object named cars.

Use the above operators to answer the following questions.

- 1.
- Which of the following is the name of a column in cars?
carb color size weight wheels
- 2.
- How many rows are there in cars?30 31 32 33 34 35
- 3.
- How many columns are there in cars?7 8 9 10 11
- 4.
- What is the wt in the second row?2.125 2.225 2.620 2.875 3.215

Prob 3.03. Here are Galton’s data on heights of adult children and their parents.

- (a)
- Which one of these commands will give the 95th percentile of the children’s
heights in Galton’s data?
A
qdata(95,height,data=galton) Bqdata(0.95,height,data=galton) Cqdata(95,galton,data=height) Dqdata(0.95,galton,data=height) Eqdata(95,father,data=height) Fqdata(0.95,father,data=galton) - (b)
- Which of these commands will give the 90-percent coverage interval of the
children’s heights in Galton’s data?
A
qdata(c(0.05,0.95),height,data=galton) Bqdata(c(0.025,0.975),height,data=galton) Cqdata(0.90,height,data=galton) Dqdata(90,height,data=galton) - (c)
- Find the 50-percent coverage interval of the following variables in Galton’s height
data:
- Father’s heights
A
59 to 73 inches

B68 to 71 inches

C63 to 65.5 inches

D68 to 74 inches

- Mother’s heights
A
59 to 73 inches

B68 to 71 inches

C63 to 65.5 inches

D68 to 74 inches

- Father’s heights
- (d)
- Find the 95-percent coverage interval of
- Father’s heights
A
65 to 73 inches

B65 to 74 inches

C68 to 73 inches

D59 to 69 inches

- Mother’s heights
A
62.5 to 68.5 inches

B65 to 69 inches

C63 to 68.5 inches

D59 to 69 inches

- Father’s heights

Prob 3.04. In Galton’s data, are the sons typically taller than their fathers? Create a variable that is the difference between the son’s height and the father’s height. (Arrange it so that a positive number refers to a son who is taller than his father.)

- 1.
- What’s the mean height difference (in inches)?-2.47 -0.31 0.06 66.76 69.23
- 2.
- What’s the standard deviation (in inches)?1.32 2.63 2.74 3.58 3.75
- 3.
- What is the 95-percent coverage interval (in inches)?
A
-3.7 to 4.8

B-4.6 to 4.9

C-5.2 to 5.6

D-9.5 to 4.5

Prob 3.05. Use R to generate the sequence of 101 numbers: 0,1,2,3,,100.

- 1.
- What’s the mean value?25 50 75 100
- 2.
- What’s the median value?25 50 75 100
- 3.
- What’s the standard deviation?10.7 29.3 41.2 53.8
- 4.
- What’s the sum of squares?5050 20251 103450 338350 585200

Now generate the sequence of perfect squares 0,1,4,9,,10000, or, written
another way, 0^{2},1^{2},2^{2},3^{2},,100^{2}. (Hint: Make a simple sequence 0 to 100 and
square it.)

- 1.
- What’s the mean value?50 2500 3350 4750 7860
- 2.
- What’s the median value?50 2500 3350 4750 7860
- 3.
- What’s the standard deviation?29.3 456.2 3028 4505 6108
- 4.
- What’s the sum of squares?5050 20251 338350 585200 2050333330

Prob 3.06. Using Galton’s height data (galton.csv),

make a box-and-whisker plot of the appropriate variable and count the outliers to answer each of these questions.

- 1.
- Which of these statements will make a box-and-whisker plot of height?
A
bwplot(height,data=galton) Bbwplot(~height,data=galton) Cbwplot(galton,data=height) Dbwplot(~galton,data=height) - 2.
- How many of the cases are outliers in height ?0 1 2 3 5 10 15 20
- 3.
- Make a box-and-whisker plot of mother. The bounds of the whiskers will be at
60 and 69 inches. It looks like just a few cases are beyond the whiskers. This
might be misleading if there are several mothers with exactly the same
values.
Make a tally of the mothers, like this:

58 58.5 59 60 60.2 60.5 61 61.5 62 62.5 62.7 63 63.5

7 9 26 36 1 1 25 1 73 22 7 103 42

63.7 64 64.2 64.5 64.7 65 65.5 66 66.2 66.5 66.7 67 68

8 112 5 26 7 133 36 69 5 47 6 45 11

68.5 69 70.5 Total

10 23 2 898How many of the cases lie outside the whiskers in the box-and-whisker plot of mother?

0 11 22 33 44 55 66 - 4.
- Apply the same process to father. According to the criteria
used by bwplot, how many of the cases are outliers in father?
0 4 9 14 19 24 29
- 5.
- You can tally on multiple variables. For instance, to tally on both mother and
sex, do this:
Looking just at the cases where mother is an outlier, how many of the children involved (variable sex) are female?

0 5 10 15 20 25 30 35

Prob 3.08. The figure shows the results from the medal winners in the women’s 10m air-rifle competition in the 2008 Olympics. (Figure from the New York Times, Aug. 10, 2008)

The location of each of 10 shots is shown as transluscent light circles in each target. The objective is to hit the bright target dot in the center. There is random scatter (variance) as well as steady deviations (bias) from the target.

What is the direction of the apparent bias in Katerina Emmons’s results?
(Directions are indicated as compas directions, E=east, NE=north east,
etc.)

NE NW SW SE

To measure the size of the bias, find the center of the shots and measure how far that is from the target dot. Take the distance between the concentric circles as one unit.

What is the size of the bias in Katerina Emmon’s results?

0 1 3 4 6 10

Prob 3.09. Here is a boxplot:

Reading from the graph, answer the following:

- (a)
- What is the median?
0 1 2 3 6 Can’t estimate from this graph
- (b)
- What is the 75th percentile?
0 1 2 3 6 Can’t estimate from this graph
- (c)
- What is the IQR?
0 1 2 3 4 6 Can’t estimate from this graph
- (d)
- What is the 40th percentile?
A
between 0 and 1

Bbetween 1 and 2

Cbetween 2 and 3

Dbetween 3 and 4

Ebetween 4 and 6

FCan’t estimate from this graph.

Prob 3.10a. The plot shows two different displays of density. The displays might be from the same distribution or two different distributions.

- (a)
- What are the two displays?
A
Density and cumulative

BRug and cumulative

CCumulative and box plot

DDensity and rug plot

ERug and box plot

- (b)
- The two displays show the same distribution.
True or False
- (c)
- Describe briefly any sign of mismatch or what features convince you that the two
displays are equivalent.

Prob 3.10b.

The plot shows two different displays of density. The displays might be from the same distribution or two different distributions.

- (a)
- What are the two displays?
A
Density and cumulative

BRug and cumulative

CCumulative and box plot

DDensity and rug plot

ERug and box plot

- (b)
- The two displays show the same distribution.
True or False
- (c)
- Describe briefly any sign of mismatch or what features convince you that the two
displays are equivalent.

Prob 3.11. By hand, calculate the mean, the range, the variance, and the standard deviation of each of the following sets of numbers:

- (A)
- 1,0,-1
- (B)
- 1,3
- (C)
- 1,2,3.

- 1.
- Which of the 3 sets of numbers — A, B, or C — is the most spread out
according to the range?
A
A

BB

CC

DNo way to know

EAll the same

- 2.
- Which of the 3 sets of numbers — A, B, or C — is the most spread out
according to the standard deviation?
A
A

BB

CC

DNo way to know

EAll the same

Prob 3.12. A standard deviation contest. For (a) and (b) below, you can choose numbers from the set 0,1,2,3,4,5,6,7,8, and 9. Repeats are allowed.

- (a)
- Which list of 4 numbers has the largest standard deviation such a list can
possibly have?
A
0,3,6,9

B0,0,0,9

C0,0,9,9

D0,9,9,9

- (b)
- Which list of 4 numbers has the smallest standard deviation such a list can
possibly have?
A
0,3,6,9

B0,1,2,3

C5,5,6,6

D9,9,9,9

Prob 3.13.

- (a)
- From what kinds of variables would side-by-side boxplots be generated?
A
categorical only

Bquantitative only

Cone categorical and one quantitative

Dvaries according to situation

- (b)
- From what kinds of variables would a histogram be generated?
A
categorical only

Bquantitative only

Cone categorical and one quantitative

Dvaries according to situation

Prob 3.14.

The boxplots below are all made from exactly the same data. One of them is made correctly, according to the “1.5 IQR” convention for drawing the whiskers. The others are drawn differently.

Plot 1 | Plot 2 | Plot 3 | Plot 4 |

- Which of the plots is correct?
1 2 3 4

Prob 3.15.

The plot purports to show the density of a distribution of data. If this is true, the fraction of the data that falls between any two values on the x axis should be the area under the curve between those two values.

Answer the following questions. In doing so, keep in mind that the area of each little box on the graph paper has been arranged to be 0.01, so you can calculate the area by counting boxes. You don’t need to be too fanatical about dealing with boxes where only a portion in under the curve; just eyeball things and estimate.

- (a)
- The total area under a density curve should be 1. Assuming
that the density curve has height zero outside of the area of
the plot, is the area under the entire curve consistent with this?
yes no
- (b)
- What fraction of the data falls in the range 12 ≤ x ≤ 14?
A
14%

B22%

C34%

D56%

ECan’t tell from this graph.

- (c)
- What fraction of the data falls in the range 14 ≤ x ≤ 16?
A
14%

B22%

C34%

D56%

ECan’t tell from this graph.

- (d)
- What fraction of the data has x ≥ 16?
A
1%

B2%

C5%

D10%

ECan’t tell from this graph.

- (e)
- What is the width of the 95% coverage interval. (Note: The coverage interval
itself has top and bottom ends. This problem asks for the spacing between the
two ends.)
A
2

B4

C8

D12

ECan’t tell from this graph.

Prob 3.16. If two distributions have the same five-number summary, must their density plots have the same shape? Explain.

Prob 3.17. As the name suggests, the Old Faithful geyser in Yellowstone National Park has eruptions that come at fairly predictable intervals, making it particularly attractive to tourists.

- (a)
- You are a busy tourist and have only 10 minutes to sit around and watch the
geyser. But you can choose when to arrive. If the last eruption occurred at noon,
what time should you arrive at the geyser to maximize your chances of seeing an
eruption?
A
12:50

B1:00

C1:05

D1:15

E1:25

- (b)
- Roughly, what is the probability that in the best 10-minute interval, you will
actually see the eruption:
A
5%

B10%

C20%

D30%

E50%

F75%

- (c)
- A simple measure of how faithful is Old Faithful is the interquartile range. What
is the interquartile range, according to the boxplot above?
A
10 minutes

B15 minutes

C25 minutes

D35 minutes

E50 minutes

F75 minutes

- (d)
- Not only are you a busy tourist, you are a smart tourist. Having read about Old
Faithful, you understand that the time between eruptions depends on how long
the previous eruption lasted. Here’s a box plot indicating the distribution of
inter-eruption times when the previous eruption duration was less than three
minutes. (That is, “TRUE” means the previous eruption lasted less than three
minutes.)
You can easily ask the ranger what was the duration of the previous eruption.

What is the best 10-minute interval to return (after a noon eruption) so that you will be most likely to see the next eruption, given that the previous eruption was less than three minutes in duration (the “TRUE” category).

A1:00 to 1:10

B1:05 to 1:15

C1:10 to 1:20

D1:15 to 1:25

E1:20 to 1:30

F1:25 to 1:35

- (e)
- How likely are you to see an eruption if you return for the most likely 10-minute
interval?
A
About 5%

BAbout 10%

CAbout 20%

DAbout 30%

EAbout 50%

FAbout 75%

Prob 3.18. For each of the following distributions, estimate by eye the mean, standard deviation, and 95% coverage interval. Also, calculate the variance.

Part 1.

- Mean. 10 15 20 25 30
- Std. Dev. 2 5 12 15 20
- 95% coverage interval.
- Lower end: 1 3 10 15 20
- Upper end : 20 25 30 35 40

- Lower end:
- Variance. 2 7 10 20 25 70 140 300

Part 2.

- Mean. 0.004 150 180 250
- Std. Dev. 10 30 60 80 120
- 95% coverage interval.
- Lower end: 50 80 100 135 150 200 230
- Upper end: 50 80 100 180 200 230

- Lower end:
- Variance. 30 80 500 900 1600 23000

Prob 3.19. Consider a large company where the average wage of workers is $15 per hour, but there is a spread of wages from minimum wage to $35 per hour.

After a contract negotiation, all workers receive a $2 per hour raise. What happens to the standard deviation of hourly wages?

A
| No change |

B
| It goes up by $2 per hour |

C
| It goes up by $4 per hour |

D
| It goes up by 4 dollars-square per hour |

E
| It goes up by $4 per hour-square |

F
| Can’t tell from the information given. |

The annual cost-of-living adjustment is 3%. After the cost-of-living adjustment, what happens to the standard deviation of hourly wages?

A
| No change |

B
| It goes up by 3% |

C
| It goes up by 9% |

D
| Can’t tell from the information given. |

Prob 3.20. Construct a data set of 10 hypothetical exam scores (use integers between 0 and 100) so that the inter-quartile range equals zero and the mean is greater than the median.

Give your set of scores here:

Prob 3.23. Here are some familiar quantities. For each of them, indicate what is a typical value, how far a typical case is from this typical value, and what is an extreme but not impossible case.

Example: Adult height. Typical value, 1.7 meters (68 inches). Typical case is about 7cm (3 inches) from the typical value. An extreme height is 2.2 meters (87 inches).

- An adult’s weight.
- Income of a full-time employed
person.
- Speed of cars on a highway in good
conditions.
- Systolic blood pressure in adults. [You might need to look this up on the
Internet.]
- Blood cholesterol LDL levels. [Again, you might need the
Internet.]
- Fuel economy among different models of
cars.
- Wind speed on a summer
day.
- Hours of sleep per night for college
students.

Prob 3.24. Data on the distribution of economic variables, such as income, is often presented in quintiles: divisions of the group into five equal-sized parts.

Here is a table from the US Census Bureau (Historical Income Tables from March 21, 2002) giving the distribution of income across US households in year 2000.

Upper | Mean | |

Quintile | Boundary | Value |

Lowest | $17,955 | $10,190 |

Second | $33,006 | $25,334 |

Third | $52,272 | $42,361 |

Fourth | $81,960 | $65,729 |

Fifth | — | $141,260 |

Based on this table, calculate:

- (a)
- The 20th percentile of family income.
10190 17955 33006 25334 52272 42361 81960 141260
- (b)
- The 80th percentile of family income.
10190 17955 33006 25334 52272 42361 81960 141260
- (c)
- The table doesn’t specify the median family income but you can make a
reasonable estimate of it. Pick the closest one.
10000 18000 25500 42500 53000 65700
- (d)
- Note that there is no upper boundary reported for the fifth quintile,
and no lower boundary reported for the first quintile. Why?
- (e)
- From this table, what evidence is there that family
income has a skew rather than “normal” distribution?

Prob 3.25. Use the Internet to find “normal” ranges for some measurements used in clinical medicine. Pick one of the following or choose one of particular interest to you: blood pressure (systolic, diastolic, pulse), hematocrit, blood sodium and potassium levels, HDL and LDL cholesterol, white blood cell counts, clotting times, blood sugar levels, vital respiratory capacity, urine production, and so on. In addition to the normal range, find out what “normal” means, e.g., a 95% coverage interval on the population or a range inconsistent with proper physiological function. You may find out that there are differing views of what “normal” means — try to indicate the range of such views. You may also find out that “normal” ranges can be different depending on age, sex, and other demographic variables.

Prob 3.28. An advertisement for “America’s premier weight loss destination” states that “a typical two week stay results in a loss of 7-14 lbs.” (The New Yorker, 7 April 2008, p 38.)

The advertisement gives no details about the meaning of “typical.” Give two or three plausible interpretations of the quoted 7-14 pound figure in terms of “typical.” What interpretation would be most useful to a person trying to predict how much weight he or she might lose?

Prob 3.29. A seemingly straightforward statistic to describe the health of a population is average age at death. In 1842, the Report on the Sanitary Conditions of the Labouring Population of Great Britain gave these averages: “gentlemen and persons engaged in the professions, 45 years; tradesmen and their families, 26 years; mechanics, servants and laborers, and their families, 16 years.”

A student questioned the accuracy of the 1842 report with this observation: “The mechanics, servants and laborer population wouldn’t be able to renew itself with an average age at death of 16 years. Mothers would be dying so early in life that they couldn’t possibly raise their kids.”

Explain how an average age of death of 16 years could be quite consistent with a “normal” family structure in which parents raise their children through the child’s adolescence in the teenage years. What other information about ages at death would give a more complete picture of the situation?

Prob 3.30. The identification of a case as an outlier does not always mean that the case is invalid or abnormal or the result of a mistake. One situation where perfectly normal cases can look like outliers is when there is a mechanism of proportionality at work. Imagine, for instance, that there is a typical rate of production of a substance, and the normal variability is proportional in nature, say from 1/10 of that typical rate to 10 times the rate. This leads to a situation where some normal cases are 100 times as large as others.

To illustrate, look at the alder.csv data set, which contains field data from a study of nitrogen fixation in alder plants. The SNF variable records the amount of nitrogen fixed in soil by bacteria that reside in root nodules of the plants. Make a box plot and a histogram and describe the distribution. Which of the following descriptions is most appropriate:

A
| The distribution is skewed to the left, with outliers at very low values of SNF. |

B
| The distribution is skewed to the right, with outliers at very high values of SNF. |

C
| The distribution is roughly symmetrical, although there are a few outliers. |

In working with a variable like this, it can help to convert the variable in a way that respects the idea of a proportional change. For instance, consider the three numbers 0.1, 1.0, and 10.0, which are evenly spaced in proportionate terms — each number is 10 times bigger than the preceding number. But as absolute differences, 0.1 and 1.0 are much closer to each other than 1.0 and 10.0.

The logarithm function transforms numbers to a scale where even proportions are equally spaced. For instance, taking the logarithm of the numbers 0.1, 1.0, and 10.0 gives the sequence -1, 0, 1 — exactly evenly spaced.

The logSNF variable gives the logarithm of SNF. Plot out the distribution of logSNF. Which of the following descriptions is most apt?

A
| The distribution is skewed to the left. |

B
| The distribution is skewed to the right. |

C
| The distribution is roughly symmetrical. |

You can compute logarithms directly in R, using the functions log, log2, or log10. Which of these functions was used to compute the quantity logSNF from SNF. (Hint: Try them out!)

log log2 log10

The base of the logarithm gives the size of the proportional change that corresponds to a 1-unit increase on the logarithmic scale. For example, log2 calculates the base-2 logarithm. On the base-2 logarithmic scale, a doubling in size corresponds to a 1-unit increase. In contrast, on the base-10 scale, a ten-fold increase in size gives a 1-unit increase.

Logarithmic transformations are often used to deal with variables that are positive and strongly skewed. In economics, price, income and production variables are often this way. In general, any variable where it is sensible to describe changes in terms of proportion might be better displayed on a logarithmic scale. For example, price inflation rates are usually given as percent (e.g., “The inflation rate was 4% last year.”) and so in dealing with prices over time, the logarithmic transformation can be appropriate.

Prob 3.31.

This exercise deals with data on weight loss achieved by clients who stayed two weeks at a weight-loss resort. The same data using three different sorts of graphical displays: a pie chart, a histogram, and a box-and-whiskers plot. The point of the exercise is to help you decide which display is the most effective at presenting information to you.

In many fields, pie charts are used as “statistical graphics.” Here’s a pie chart of the weight loss:

Using the pie graph, answer the following:

- (a)
- What’s the “typical” (median or mean) weight loss?
3.7 4.2 5.5 6.8 8.3 10.1 12.4
- (b)
- What is the central 50% coverage interval?2.3to6.8 4.2to10.7 4.4to8.7 6.1 to 9.3 5.2to12.1
- (c)
- What is an upper extreme value?
10 13 16 18 20

Now to display the data as a histogram. So that you can’t just re-use your answers from the pie chart, the weights have been rescaled into kilograms.

Using the histogram, answer the following:

- 1.
- What’s the “typical” (median or mean) weight loss?1.9 2.1 3.1 3.7 4.6 5.6
- 2.
- What is the central 50% coverage interval?1.1to3.3 2.0to4.8 2.0to3.9 2.8 to 4.4 2.5to5.4
- 3.
- What is an upper extreme value?
6 8 10 12 14

Finally, here is a boxplot of the same data. It’s been rescaled into a traditional unit of weight: stones.

Using the boxplot, answer the following:

- 1.
- What’s the “typical” (median or mean) weight loss?0.20 0.35 0.50 0.68 0.83 1.2
- 2.
- What is the central 50% coverage interval?0.2to0.5 0.3to0.8 0.4to0.8 0.5to0.7 0.3to0.6
- 3.
- What is an upper extreme value?
0.7 0.9 1.0 1.1 1.3

Which style of graphic made it easiest to answer the questions?

pie.chart histogram box.plot

Prob 3.36. Elevators typically have a close-door button. Some people claim that this button has no mechanical function; it’s there just to give impatient people some sense of control over the elevator.

Design and conduct an experiment to test whether the button does cause the elevator door to close. Pick an elevator with such a button and record some details about the elevator itself: place installed, year installed, model number, etc.

Describe your experiment along with the measurements you made and your conclusions. You may want to do the experiment in small teams and use a stopwatch in order to make accurate measurements. Presumably, you will want to measure the time between when the button is pressed and when the door closes, but you might want to measure other quantities as well, for instance the time from when the door first opened to when you press the button.

Store the data from your experiment in a spreadsheet in Google Docs. Set the permissions for the spreadsheet so that anyone with the link can read your data. Make sure to paste the link into the textbox so that your data can be accessed.

Please don’t inconvenience other elevator users with the experiment.

Prob 3.50.
What’s a “normal” body temperature? Depending on whether you use
the Celsius or Fahrenheit scale, you are probably used to the numbers 37^{∘}
(C) or 98.6^{∘} (F). These numbers come from the work of Carl Wunderlich,
published in Das Verhalten der Eigenwarme in Krankenheiten in 1868 based
on more than a million measurements made under the armpit. According
to Wunderlich, “When the organism (man) is in a normal condition, the
general temperature of the body maintains itself at the physiologic point:
37^{∘}C= 98.6^{∘}F.”

Since 1868, not only have the techniques for measuring temperatures improved, but so has the understanding that “normal” is not a single temperature but a range of temperatures.

A 1992 article in the Journal of the American Medical Association (PA
Mackowiak et al., “A Critical Appraisal of 98.6^{∘}F ...” JAMA v. 268(12) pp.
1578-1580) examined temperature measurements made orally with an electronic
thermometer. The subjects were 148 healthy volunteers between age 18 and
40.

The figure shows the distribution of temperatures, separately for males and females. Note that the horizontal scale is given in both C and F — this problem will use F.

What’s the absolute range for females?

- Minimum: 96.1 96.3 97.1 98.6 99.9 100.8
- Maximum: 96.1 96.3 97.1 98.6 99.9 100.8

And for males?

- Minimum: 96.1 96.3 97.1 98.6 99.9 100.8
- Maximum: 96.1 96.3 97.1 98.6 99.9 100.8

Notice that there is an outlier for the females’ temperature, as evidenced by a big gap in temperature between that bar and the next closest bar. How big is the gap?

A
| About 0.01 |

B
| About 0.1 |

C
| Almost 1 |

Give a 95% coverage interval for females. Hint: The interval will exclude the most extreme 2.5% of cases on each of the left and right sides of the distribution. You can find the left endpoint of the 95% interval by scanning in from the left, adding up the heights of the bars until they total 0.025. Similarly, the right endpoint can be marked by scanning in from the right until the bars total 0.025.

A
| About 96.2 |

B
| About 96.8 |

C
| About 97.6 |

And for males?

A
| About 96.2 |

B
| About 96.7 to about 99.4 |

C
| About 97.5 |

Prob 3.53. There are many different numerical descriptions of distributions: mean, median, standard deviation, variance, IQR, coverage interval, ... And these are just the ones we have touched on so far. We’ll also encounter “standard error,” “margin of error,” “confidence interval.” There are so many that it becomes a significant challenge to students to keep them straight. Eventually, statistical workers learn the subtleties of the different descriptions and when each is appropriate. Then, like using near synonyms in English, it becomes second nature.

As an example, consider the verb “spread.”. Here are some synonyms from the thesaurus, each of which is appropriate in a particular context: broadcast, scatter, propagate, sprawl, extend, stretch, cover, daub, ... If you were talking to a farmer about sewing seeds, the words “broadcast” or “scatter” would be appropriate, but it would be silly to say the seeds are being “daubbed” or “sprawled”. On the other hand, to an urbanite concerned with congestion in traffic, the growth of the city might well be summarized with “sprawl.” You have to know the context and the intent to choose the correct term.

To help to understand the different context and intents, here are two important ways of categorizing what a particular description captures:

- Location and scatter
- What is a typical value? (“center”)
- What are the top and bottom range of the values? (“range”)
- How far are the values scattered? (“scatter”)
- What is high? or What is low? (“non-central”)

- Including the “extremes”
- All inclusive, and sensitive to outliers. (“not-robust”)
- All inclusive, but not sensitive to outliers. (“robust”)
- Leaves out the very extremes. (“plausible”’)
- Focuses on the middle. (“mainstream”)

Note that descriptors of both the “plausible” and the “mainstream” type are necessarily robust, since they leave out the outliers.

- Individual versus whole sample.
- Description relevant to individual cases
- Description or summary of entire samples, combining many cases.

You won’t have to deal with this until later, where it explains terms that you haven’t yet encountered like like “standard error”, “margin of error”, “confidence interval.”

Example: The mean describes the center of a distribution. It is calculated from all the data and not-robust against outliers.

For each of the following descriptors of a distribution , choose the items that best characterize the descriptor.

- 1.
- Median
- (a)
- center range scatter non-central
- (b)
- robust not-robust plausible mainstream

- 2.
- Standard Deviation
- (a)
- center range scatter non-central
- (b)
- robust not-robust plausible mainstream

- 3.
- IQR
- (a)
- center range scatter non-central
- (b)
- robust not-robust plausible mainstream

- 4.
- Variance
- (a)
- center range scatter non-central
- (b)
- robust not-robust plausible mainstream

- 5.
- 95% coverage interval
- (a)
- center range scatter non-central
- (b)
- robust not-robust plausible mainstream

- 6.
- 50% coverage interval
- (a)
- center range scatter non-central
- (b)
- robust not-robust plausible mainstream

- 7.
- 50th percentile
- (a)
- center range scatter non-central

- 8.
- 80th percentile
- (a)
- center range scatter non-central

- 9.
- 99th percentile
- (a)
- center range scatter non-central

- 10.
- 10th percentile
- (a)
- center range scatter non-central

One of the reasons why there are so many descriptive terms is that they have different roles in theory. For example, the variance turns out to have simple theoretical properties that make it useful when describing sums of variables. It’s much simpler than, say, the IQR.

Prob 3.54. There are two kinds of questions that are often asked relating to percentiles:

- What is the value that falls at a given percentage? For instance, in the
ten-mile-race.csv running data, how fast are the fastest 10% of runners? In
R, you would ask in this way:
> run = fetchData("ten-mile-race.csv")

> qdata(0.10, run$net)

10%

4409The answers is in the units of the variable, in this case seconds. So 10% of the runners have net times faster than or equal to 4409 seconds.

- What percentage falls at a given value? For instance, what fraction of runners
are faster than 4000 seconds?
> pdata(4000, run$net)

[1] 0.04029643The answer includes those whose net time is exactly equal to or less than 4000 seconds.

It’s important to pay attention to the p and q in the statement. pdata and qdata ask related but different questions.

Use pdata and qdata to answer the following questions about the running data.

- 1.
- Below (or equal to) what age are the youngest 35% of runners?
- Which statement will do the correct calculation?
A
pdata(0.35,run$age) Bqdata(0.35,run$age) Cpdata(35,run$age) Dqdata(35,run$age) - What will the answer be? 28 29 30 31 32 33 34 35

- Which statement will do the correct calculation?
- 2.
- What’s the net time that divides the slowest 20% of runners from the rest of the
runners?
- Which statement will do the correct calculation?
A
pdata(0.20,run$net) Bqdata(0.20,run$net) Cpdata(0.80,run$net) Dqdata(0.80,run$net) - What will the answer be?4921 5318 5988 6346 7123 7431seconds

- Which statement will do the correct calculation?
- 3.
- What is the 95% coverage interval on age?
- Which statement will do the correct calculation?
A
pdata(c(0.025,0.975),run$age) Bqdata(c(0.025,0.975),run$age) Cpdata(c(0.050,0.950),run$age) Dqdata(c(0.050,0.950),run$age) - What will the answer be?
A
22 to 60

B20 to 65

C25 to 59

D20 to 60

- Which statement will do the correct calculation?
- 4.
- What fraction of runners are 30 or younger?

- Which statement will do the correct calculation?
A
pdata(30,run$age) Bqdata(30,run$age) Cpdata(30.1,run$age) Dqdata(30.1,run$age) - What will the answer be?

In percent:29.3 30.1 33.7 35.9 38.0 39.3

- Which statement will do the correct calculation?
- 5.
- What fraction of runners are 65 or older? (Caution: This isn’t yet in the form of
a BELOW question.)
- Which statement will do the correct calculation?
A
pdata(65,run$age) Bpdata(64.99,run$age) Cpdata(65.01,run$age) D1-pdata(65,run$age) E1-pdata(64.99,run$age) F1-pdata(65.01,run$age) - What will the answer be?

In percent:0.5 1.1 1.7 2.3 2.9

- Which statement will do the correct calculation?
- 6.
- The time it takes for a runner to get to the start line after the starting gun is
fired is the difference between the time and net.
run$to.start = run$time - run$net
- How long is it before 75% of runners get to the start line?

In seconds:164 192 213 294 324 351 - What fraction of runners get to the start line before one minute? (Caution:
the times are measured in seconds.)

In percent:10 15 19 22 25 31 34

- How long is it before 75% of runners get to the start line?
- 7.
- What is the 95% coverage interval on the ages of female runners?
A
19 to 61 years

B22 to 61 years

C19 to 56 years

D22 to 56 years

- 8.
- What fraction of runners have a net time BELOW 4000 seconds? (That is,
don’t include those who are at exactly 4000 seconds.)

In percent:3.72 4.00 4.03 4.07 5.21