Chapter 3 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

1.
What is the disadvantage of using a 100% coverage interval to describe variation?
2.
In describing a sample of a variable, what is the relationship between the variance and the standard deviation?
3.
What is a residual?
4.
What’s the difference between “density” and “frequency” in displaying a variable with a histogram?
5.
What’s a normal distribution?
6.
Here is the graph showing boxplots of height broken down according to sex as well as for both males and females together.

Which components of the boxplot for “All” match up exactly with the boxplot for “M” or “F”? Explain why.

7.
Variables typically have units. For example, in Galton’s height data, the height variable has units of inches. Suppose you are working with a variable in units of degrees celsius. What would be the units of the standard deviation of a variable? Of the variance? Why are they different?

Prob 3.01. Here is a small table of percentiles of typical daily calorie consumption of college students.

 Percentile Calories 0 1400 5 1800 10 2000 25 2400 50 2600 75 2900 90 3100 95 3300 100 3700

(a)
What is the 50%-coverage interval?
Lower Boundary
1800  1900  2000  2200  2400  2500  2600
Upper Boundary
2600  2750  2900  3000  3100  3200  3500
(b)
What percentage of cases lie between 2900 and 3300?
10  20  25  30  40  50  60  70  80  90  95
(c)
What is the percentile that marks the upper end of the 95%-coverage interval?
75  90  92.5  95  97.5  100

Estimate the corresponding calorie value from the table.

2900  3000  3100  3300  3500  3700
(d)
Using the 1.5 IQR rule-of-thumb for identifying an outlier, what would be the threshold for identifying a low calorie consumption as an outlier?
1450  1500  1650  1750  1800  2000

Prob 3.02.

Here are some useful operators for taking a quick look at data frames:

 names Lists the names of the components. ncol Tells how many components there are. nrow Tells how many lines of data there are. head Prints the first several lines of the data frame.

Here are some examples of these commands applied to the CO2 data frame:

CO2 = fetchData("CO2")

Called from: fetchData("CO2")

names(CO2)

[1] "Plant"     "Type"      "Treatment" "conc"      "uptake"

ncol(CO2)

[1] 5

nrow(CO2)

[1] 84

Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
• The data frame iris records measurements on flowers. You can read in with
iris = fetchData("iris")

Called from: fetchData("iris")

creating an object named iris.

Use the above operators to answer the following questions.

1.
Which of the following is the name of a column in iris?

flower  Color  Species  Length
2.
How many rows are there in iris?
1  50  100  150  200
3.
How many columns are there in iris?
2  3  4  5  6  7  8  10
4.
What is the Sepal.Length in the third row?
1.2  3.6  4.2  4.7  5.9
• The data frame mtcars has data on cars from the 1970s. You can read it in with
cars = fetchData("mtcars")

Called from: fetchData("mtcars")

creating an object named cars.

Use the above operators to answer the following questions.

1.
Which of the following is the name of a column in cars?

carb  color  size  weight  wheels
2.
How many rows are there in cars?
30  31  32  33  34  35
3.
How many columns are there in cars?
7  8  9  10  11
4.
What is the wt in the second row?
2.125  2.225  2.620  2.875  3.215

Prob 3.03. Here are Galton’s data on heights of adult children and their parents.

> require(mosaic)
> galton = fetchData("Galton")

(a)
Which one of these commands will give the 95th percentile of the children’s heights in Galton’s data?

 A qdata(95,height,data=galton) B qdata(0.95,height,data=galton) C qdata(95,galton,data=height) D qdata(0.95,galton,data=height) E qdata(95,father,data=height) F qdata(0.95,father,data=galton)

(b)
Which of these commands will give the 90-percent coverage interval of the children’s heights in Galton’s data?

 A qdata(c(0.05,0.95),height,data=galton) B qdata(c(0.025,0.975),height,data=galton) C qdata(0.90,height,data=galton) D qdata(90,height,data=galton)

(c)
Find the 50-percent coverage interval of the following variables in Galton’s height data:
• Father’s heights

 A 59 to 73 inches B 68 to 71 inches C 63 to 65.5 inches D 68 to 74 inches

• Mother’s heights

 A 59 to 73 inches B 68 to 71 inches C 63 to 65.5 inches D 68 to 74 inches

(d)
Find the 95-percent coverage interval of
• Father’s heights

 A 65 to 73 inches B 65 to 74 inches C 68 to 73 inches D 59 to 69 inches

• Mother’s heights

 A 62.5 to 68.5 inches B 65 to 69 inches C 63 to 68.5 inches D 59 to 69 inches

Prob 3.04. In Galton’s data, are the sons typically taller than their fathers? Create a variable that is the difference between the son’s height and the father’s height. (Arrange it so that a positive number refers to a son who is taller than his father.)

1.
What’s the mean height difference (in inches)?
-2.47  -0.31  0.06  66.76  69.23
2.
What’s the standard deviation (in inches)?
1.32  2.63  2.74  3.58  3.75
3.
What is the 95-percent coverage interval (in inches)?

 A -3.7 to 4.8 B -4.6 to 4.9 C -5.2 to 5.6 D -9.5 to 4.5

Prob 3.05. Use R to generate the sequence of 101 numbers: 0,1,2,3,,100.

1.
What’s the mean value?
25  50  75  100
2.
What’s the median value?
25  50  75  100
3.
What’s the standard deviation?
10.7  29.3  41.2  53.8
4.
What’s the sum of squares?
5050  20251  103450  338350  585200

Now generate the sequence of perfect squares 0,1,4,9,,10000, or, written another way, 02,12,22,32,,1002. (Hint: Make a simple sequence 0 to 100 and square it.)

1.
What’s the mean value?
50  2500  3350  4750  7860
2.
What’s the median value?
50  2500  3350  4750  7860
3.
What’s the standard deviation?
29.3  456.2  3028  4505  6108
4.
What’s the sum of squares?
5050  20251  338350  585200  2050333330

Prob 3.06. Using Galton’s height data (galton.csv),

> require(mosaic)
> galton = fetchData("Galton")

make a box-and-whisker plot of the appropriate variable and count the outliers to answer each of these questions.

1.
Which of these statements will make a box-and-whisker plot of height?

 A bwplot(height,data=galton) B bwplot(~height,data=galton) C bwplot(galton,data=height) D bwplot(~galton,data=height)

2.
How many of the cases are outliers in height ?
0  1  2  3  5  10  15  20
3.
Make a box-and-whisker plot of mother. The bounds of the whiskers will be at 60 and 69 inches. It looks like just a few cases are beyond the whiskers. This might be misleading if there are several mothers with exactly the same values.

Make a tally of the mothers, like this:

> tally( ~mother, data=galton )

58  58.5    59    60  60.2  60.5    61  61.5    62  62.5  62.7    63  63.5
7     9    26    36     1     1    25     1    73    22     7   103    42
63.7    64  64.2  64.5  64.7    65  65.5    66  66.2  66.5  66.7    67    68
8   112     5    26     7   133    36    69     5    47     6    45    11
68.5    69  70.5 Total
10    23     2   898

How many of the cases lie outside the whiskers in the box-and-whisker plot of mother?

0  11  22  33  44  55  66
4.
Apply the same process to father. According to the criteria used by bwplot, how many of the cases are outliers in father?
0  4  9  14  19  24  29
5.
You can tally on multiple variables. For instance, to tally on both mother and sex, do this:
> tally( ~mother + sex, data=galton )

Looking just at the cases where mother is an outlier, how many of the children involved (variable sex) are female?

0  5  10  15  20  25  30  35

Prob 3.08. The figure shows the results from the medal winners in the women’s 10m air-rifle competition in the 2008 Olympics. (Figure from the New York Times, Aug. 10, 2008)

The location of each of 10 shots is shown as transluscent light circles in each target. The objective is to hit the bright target dot in the center. There is random scatter (variance) as well as steady deviations (bias) from the target.

What is the direction of the apparent bias in Katerina Emmons’s results? (Directions are indicated as compas directions, E=east, NE=north east, etc.)

NE  NW  SW  SE

To measure the size of the bias, find the center of the shots and measure how far that is from the target dot. Take the distance between the concentric circles as one unit.

What is the size of the bias in Katerina Emmon’s results?

0  1  3  4  6  10

Prob 3.09. Here is a boxplot:

(a)
What is the median?

0  1  2  3  6  Can’t estimate from this graph
(b)
What is the 75th percentile?

0  1  2  3  6  Can’t estimate from this graph
(c)
What is the IQR?

0  1  2  3  4  6  Can’t estimate from this graph
(d)
What is the 40th percentile?

 A between 0 and 1 B between 1 and 2 C between 2 and 3 D between 3 and 4 E between 4 and 6 F Can’t estimate from this graph.

Prob 3.10a. The plot shows two different displays of density. The displays might be from the same distribution or two different distributions.

(a)
What are the two displays?

 A Density and cumulative B Rug and cumulative C Cumulative and box plot D Density and rug plot E Rug and box plot

(b)
The two displays show the same distribution.
True or False
(c)
Describe briefly any sign of mismatch or what features convince you that the two displays are equivalent.

Prob 3.10b.

The plot shows two different displays of density. The displays might be from the same distribution or two different distributions.

(a)
What are the two displays?

 A Density and cumulative B Rug and cumulative C Cumulative and box plot D Density and rug plot E Rug and box plot

(b)
The two displays show the same distribution.
True or False
(c)
Describe briefly any sign of mismatch or what features convince you that the two displays are equivalent.

Prob 3.11. By hand, calculate the mean, the range, the variance, and the standard deviation of each of the following sets of numbers:

(A)
1,0,-1
(B)
1,3
(C)
1,2,3.

1.
Which of the 3 sets of numbers — A, B, or C — is the most spread out according to the range?

 A A B B C C D No way to know E All the same

2.
Which of the 3 sets of numbers — A, B, or C — is the most spread out according to the standard deviation?

 A A B B C C D No way to know E All the same

Prob 3.12. A standard deviation contest. For (a) and (b) below, you can choose numbers from the set 0,1,2,3,4,5,6,7,8, and 9. Repeats are allowed.

(a)
Which list of 4 numbers has the largest standard deviation such a list can possibly have?

 A 0,3,6,9 B 0,0,0,9 C 0,0,9,9 D 0,9,9,9

(b)
Which list of 4 numbers has the smallest standard deviation such a list can possibly have?

 A 0,3,6,9 B 0,1,2,3 C 5,5,6,6 D 9,9,9,9

Prob 3.13.

(a)
From what kinds of variables would side-by-side boxplots be generated?

 A categorical only B quantitative only C one categorical and one quantitative D varies according to situation

(b)
From what kinds of variables would a histogram be generated?

 A categorical only B quantitative only C one categorical and one quantitative D varies according to situation

Prob 3.14.

The boxplots below are all made from exactly the same data. One of them is made correctly, according to the “1.5 IQR” convention for drawing the whiskers. The others are drawn differently.

 Plot 1 Plot 2 Plot 3 Plot 4
• Which of the plots is correct?
1  2  3  4

Prob 3.15.

The plot purports to show the density of a distribution of data. If this is true, the fraction of the data that falls between any two values on the x axis should be the area under the curve between those two values.

Answer the following questions. In doing so, keep in mind that the area of each little box on the graph paper has been arranged to be 0.01, so you can calculate the area by counting boxes. You don’t need to be too fanatical about dealing with boxes where only a portion in under the curve; just eyeball things and estimate.

(a)
The total area under a density curve should be 1. Assuming that the density curve has height zero outside of the area of the plot, is the area under the entire curve consistent with this?
yes  no
(b)
What fraction of the data falls in the range 12 x 14?

 A 14% B 22% C 34% D 56% E Can’t tell from this graph.

(c)
What fraction of the data falls in the range 14 x 16?

 A 14% B 22% C 34% D 56% E Can’t tell from this graph.

(d)
What fraction of the data has x 16?

 A 1% B 2% C 5% D 10% E Can’t tell from this graph.

(e)
What is the width of the 95% coverage interval. (Note: The coverage interval itself has top and bottom ends. This problem asks for the spacing between the two ends.)

 A 2 B 4 C 8 D 12 E Can’t tell from this graph.

Prob 3.16. If two distributions have the same five-number summary, must their density plots have the same shape? Explain.

Prob 3.17. As the name suggests, the Old Faithful geyser in Yellowstone National Park has eruptions that come at fairly predictable intervals, making it particularly attractive to tourists.

(a)
You are a busy tourist and have only 10 minutes to sit around and watch the geyser. But you can choose when to arrive. If the last eruption occurred at noon, what time should you arrive at the geyser to maximize your chances of seeing an eruption?

 A 12:50 B 1:00 C 1:05 D 1:15 E 1:25

(b)
Roughly, what is the probability that in the best 10-minute interval, you will actually see the eruption:

 A 5% B 10% C 20% D 30% E 50% F 75%

(c)
A simple measure of how faithful is Old Faithful is the interquartile range. What is the interquartile range, according to the boxplot above?

 A 10 minutes B 15 minutes C 25 minutes D 35 minutes E 50 minutes F 75 minutes

(d)
Not only are you a busy tourist, you are a smart tourist. Having read about Old Faithful, you understand that the time between eruptions depends on how long the previous eruption lasted. Here’s a box plot indicating the distribution of inter-eruption times when the previous eruption duration was less than three minutes. (That is, “TRUE” means the previous eruption lasted less than three minutes.)

You can easily ask the ranger what was the duration of the previous eruption.

What is the best 10-minute interval to return (after a noon eruption) so that you will be most likely to see the next eruption, given that the previous eruption was less than three minutes in duration (the “TRUE” category).

 A 1:00 to 1:10 B 1:05 to 1:15 C 1:10 to 1:20 D 1:15 to 1:25 E 1:20 to 1:30 F 1:25 to 1:35

(e)
How likely are you to see an eruption if you return for the most likely 10-minute interval?

Prob 3.18. For each of the following distributions, estimate by eye the mean, standard deviation, and 95% coverage interval. Also, calculate the variance.

Part 1.

• Mean.
10  15  20  25  30
• Std. Dev.
2  5  12  15  20
• 95% coverage interval.
• Lower end:
1  3  10  15  20
• Upper end :
20  25  30  35  40
• Variance.
2  7  10  20  25  70  140  300

Part 2.

• Mean.
0.004  150  180  250
• Std. Dev.
10  30  60  80  120
• 95% coverage interval.
• Lower end:
50  80  100  135  150  200  230
• Upper end:
50  80  100  180  200  230
• Variance.
30  80  500  900  1600  23000

Prob 3.19. Consider a large company where the average wage of workers is \$15 per hour, but there is a spread of wages from minimum wage to \$35 per hour.

After a contract negotiation, all workers receive a \$2 per hour raise. What happens to the standard deviation of hourly wages?

 A No change B It goes up by \$2 per hour C It goes up by \$4 per hour D It goes up by 4 dollars-square per hour E It goes up by \$4 per hour-square F Can’t tell from the information given.

The annual cost-of-living adjustment is 3%. After the cost-of-living adjustment, what happens to the standard deviation of hourly wages?

 A No change B It goes up by 3% C It goes up by 9% D Can’t tell from the information given.

Prob 3.20. Construct a data set of 10 hypothetical exam scores (use integers between 0 and 100) so that the inter-quartile range equals zero and the mean is greater than the median.

Give your set of scores here:

Prob 3.23. Here are some familiar quantities. For each of them, indicate what is a typical value, how far a typical case is from this typical value, and what is an extreme but not impossible case.

Example: Adult height. Typical value, 1.7 meters (68 inches). Typical case is about 7cm (3 inches) from the typical value. An extreme height is 2.2 meters (87 inches).

• Income of a full-time employed person.
• Speed of cars on a highway in good conditions.
• Systolic blood pressure in adults. [You might need to look this up on the Internet.]
• Blood cholesterol LDL levels. [Again, you might need the Internet.]
• Fuel economy among different models of cars.
• Wind speed on a summer day.
• Hours of sleep per night for college students.

Prob 3.24. Data on the distribution of economic variables, such as income, is often presented in quintiles: divisions of the group into five equal-sized parts.

Here is a table from the US Census Bureau (Historical Income Tables from March 21, 2002) giving the distribution of income across US households in year 2000.

 Upper Mean Quintile Boundary Value Lowest \$17,955 \$10,190 Second \$33,006 \$25,334 Third \$52,272 \$42,361 Fourth \$81,960 \$65,729 Fifth — \$141,260

Based on this table, calculate:

(a)
The 20th percentile of family income.

10190  17955  33006  25334  52272  42361  81960  141260
(b)
The 80th percentile of family income.

10190  17955  33006  25334  52272  42361  81960  141260
(c)
The table doesn’t specify the median family income but you can make a reasonable estimate of it. Pick the closest one.

10000  18000  25500  42500  53000  65700
(d)
Note that there is no upper boundary reported for the fifth quintile, and no lower boundary reported for the first quintile. Why?
(e)
From this table, what evidence is there that family income has a skew rather than “normal” distribution?

Prob 3.25. Use the Internet to find “normal” ranges for some measurements used in clinical medicine. Pick one of the following or choose one of particular interest to you: blood pressure (systolic, diastolic, pulse), hematocrit, blood sodium and potassium levels, HDL and LDL cholesterol, white blood cell counts, clotting times, blood sugar levels, vital respiratory capacity, urine production, and so on. In addition to the normal range, find out what “normal” means, e.g., a 95% coverage interval on the population or a range inconsistent with proper physiological function. You may find out that there are differing views of what “normal” means — try to indicate the range of such views. You may also find out that “normal” ranges can be different depending on age, sex, and other demographic variables.

Prob 3.28. An advertisement for “America’s premier weight loss destination” states that “a typical two week stay results in a loss of 7-14 lbs.” (The New Yorker, 7 April 2008, p 38.)

The advertisement gives no details about the meaning of “typical.” Give two or three plausible interpretations of the quoted 7-14 pound figure in terms of “typical.” What interpretation would be most useful to a person trying to predict how much weight he or she might lose?

Prob 3.29. A seemingly straightforward statistic to describe the health of a population is average age at death. In 1842, the Report on the Sanitary Conditions of the Labouring Population of Great Britain gave these averages: “gentlemen and persons engaged in the professions, 45 years; tradesmen and their families, 26 years; mechanics, servants and laborers, and their families, 16 years.”

A student questioned the accuracy of the 1842 report with this observation: “The mechanics, servants and laborer population wouldn’t be able to renew itself with an average age at death of 16 years. Mothers would be dying so early in life that they couldn’t possibly raise their kids.”

Explain how an average age of death of 16 years could be quite consistent with a “normal” family structure in which parents raise their children through the child’s adolescence in the teenage years. What other information about ages at death would give a more complete picture of the situation?

Prob 3.30. The identification of a case as an outlier does not always mean that the case is invalid or abnormal or the result of a mistake. One situation where perfectly normal cases can look like outliers is when there is a mechanism of proportionality at work. Imagine, for instance, that there is a typical rate of production of a substance, and the normal variability is proportional in nature, say from 1/10 of that typical rate to 10 times the rate. This leads to a situation where some normal cases are 100 times as large as others.

To illustrate, look at the alder.csv data set, which contains field data from a study of nitrogen fixation in alder plants. The SNF variable records the amount of nitrogen fixed in soil by bacteria that reside in root nodules of the plants. Make a box plot and a histogram and describe the distribution. Which of the following descriptions is most appropriate:

 A The distribution is skewed to the left, with outliers at very low values of SNF. B The distribution is skewed to the right, with outliers at very high values of SNF. C The distribution is roughly symmetrical, although there are a few outliers.

In working with a variable like this, it can help to convert the variable in a way that respects the idea of a proportional change. For instance, consider the three numbers 0.1, 1.0, and 10.0, which are evenly spaced in proportionate terms — each number is 10 times bigger than the preceding number. But as absolute differences, 0.1 and 1.0 are much closer to each other than 1.0 and 10.0.

The logarithm function transforms numbers to a scale where even proportions are equally spaced. For instance, taking the logarithm of the numbers 0.1, 1.0, and 10.0 gives the sequence -1, 0, 1 — exactly evenly spaced.

The logSNF variable gives the logarithm of SNF. Plot out the distribution of logSNF. Which of the following descriptions is most apt?

 A The distribution is skewed to the left. B The distribution is skewed to the right. C The distribution is roughly symmetrical.

You can compute logarithms directly in R, using the functions log, log2, or log10. Which of these functions was used to compute the quantity logSNF from SNF. (Hint: Try them out!)

log  log2  log10

The base of the logarithm gives the size of the proportional change that corresponds to a 1-unit increase on the logarithmic scale. For example, log2 calculates the base-2 logarithm. On the base-2 logarithmic scale, a doubling in size corresponds to a 1-unit increase. In contrast, on the base-10 scale, a ten-fold increase in size gives a 1-unit increase.

Logarithmic transformations are often used to deal with variables that are positive and strongly skewed. In economics, price, income and production variables are often this way. In general, any variable where it is sensible to describe changes in terms of proportion might be better displayed on a logarithmic scale. For example, price inflation rates are usually given as percent (e.g., “The inflation rate was 4% last year.”) and so in dealing with prices over time, the logarithmic transformation can be appropriate.

Prob 3.31.

This exercise deals with data on weight loss achieved by clients who stayed two weeks at a weight-loss resort. The same data using three different sorts of graphical displays: a pie chart, a histogram, and a box-and-whiskers plot. The point of the exercise is to help you decide which display is the most effective at presenting information to you.

In many fields, pie charts are used as “statistical graphics.” Here’s a pie chart of the weight loss:

Using the pie graph, answer the following:

(a)
What’s the “typical” (median or mean) weight loss?
3.7  4.2  5.5  6.8  8.3  10.1  12.4
(b)
What is the central 50% coverage interval?
2.3to6.8  4.2to10.7  4.4to8.7  6.1 to 9.3  5.2to12.1
(c)
What is an upper extreme value?
10  13  16  18  20

Now to display the data as a histogram. So that you can’t just re-use your answers from the pie chart, the weights have been rescaled into kilograms.

Using the histogram, answer the following:

1.
What’s the “typical” (median or mean) weight loss?
1.9  2.1  3.1  3.7  4.6  5.6
2.
What is the central 50% coverage interval?
1.1to3.3  2.0to4.8  2.0to3.9  2.8 to 4.4  2.5to5.4
3.
What is an upper extreme value?
6  8  10  12  14

Finally, here is a boxplot of the same data. It’s been rescaled into a traditional unit of weight: stones.

Using the boxplot, answer the following:

1.
What’s the “typical” (median or mean) weight loss?
0.20  0.35  0.50  0.68  0.83  1.2
2.
What is the central 50% coverage interval?
0.2to0.5  0.3to0.8  0.4to0.8  0.5to0.7  0.3to0.6
3.
What is an upper extreme value?
0.7  0.9  1.0  1.1  1.3

pie.chart  histogram  box.plot

Prob 3.36. Elevators typically have a close-door button. Some people claim that this button has no mechanical function; it’s there just to give impatient people some sense of control over the elevator.

Design and conduct an experiment to test whether the button does cause the elevator door to close. Pick an elevator with such a button and record some details about the elevator itself: place installed, year installed, model number, etc.

Describe your experiment along with the measurements you made and your conclusions. You may want to do the experiment in small teams and use a stopwatch in order to make accurate measurements. Presumably, you will want to measure the time between when the button is pressed and when the door closes, but you might want to measure other quantities as well, for instance the time from when the door first opened to when you press the button.

Please don’t inconvenience other elevator users with the experiment.

Prob 3.50. What’s a “normal” body temperature? Depending on whether you use the Celsius or Fahrenheit scale, you are probably used to the numbers 37 (C) or 98.6 (F). These numbers come from the work of Carl Wunderlich, published in Das Verhalten der Eigenwarme in Krankenheiten in 1868 based on more than a million measurements made under the armpit. According to Wunderlich, “When the organism (man) is in a normal condition, the general temperature of the body maintains itself at the physiologic point: 37C= 98.6F.”

Since 1868, not only have the techniques for measuring temperatures improved, but so has the understanding that “normal” is not a single temperature but a range of temperatures.

A 1992 article in the Journal of the American Medical Association (PA Mackowiak et al., “A Critical Appraisal of 98.6F ...” JAMA v. 268(12) pp. 1578-1580) examined temperature measurements made orally with an electronic thermometer. The subjects were 148 healthy volunteers between age 18 and 40.

The figure shows the distribution of temperatures, separately for males and females. Note that the horizontal scale is given in both C and F — this problem will use F.

What’s the absolute range for females?

• Minimum:
96.1  96.3  97.1  98.6  99.9  100.8
• Maximum:
96.1  96.3  97.1  98.6  99.9  100.8

And for males?

• Minimum:
96.1  96.3  97.1  98.6  99.9  100.8
• Maximum:
96.1  96.3  97.1  98.6  99.9  100.8

Notice that there is an outlier for the females’ temperature, as evidenced by a big gap in temperature between that bar and the next closest bar. How big is the gap?

 A About 0.01∘ F. B About 0.1∘ F. C Almost 1∘ F.

Give a 95% coverage interval for females. Hint: The interval will exclude the most extreme 2.5% of cases on each of the left and right sides of the distribution. You can find the left endpoint of the 95% interval by scanning in from the left, adding up the heights of the bars until they total 0.025. Similarly, the right endpoint can be marked by scanning in from the right until the bars total 0.025.

And for males?

Prob 3.53. There are many different numerical descriptions of distributions: mean, median, standard deviation, variance, IQR, coverage interval, ... And these are just the ones we have touched on so far. We’ll also encounter “standard error,” “margin of error,” “confidence interval.” There are so many that it becomes a significant challenge to students to keep them straight. Eventually, statistical workers learn the subtleties of the different descriptions and when each is appropriate. Then, like using near synonyms in English, it becomes second nature.

As an example, consider the verb “spread.”. Here are some synonyms from the thesaurus, each of which is appropriate in a particular context: broadcast, scatter, propagate, sprawl, extend, stretch, cover, daub, ... If you were talking to a farmer about sewing seeds, the words “broadcast” or “scatter” would be appropriate, but it would be silly to say the seeds are being “daubbed” or “sprawled”. On the other hand, to an urbanite concerned with congestion in traffic, the growth of the city might well be summarized with “sprawl.” You have to know the context and the intent to choose the correct term.

To help to understand the different context and intents, here are two important ways of categorizing what a particular description captures:

• Location and scatter
• What is a typical value? (“center”)
• What are the top and bottom range of the values? (“range”)
• How far are the values scattered? (“scatter”)
• What is high? or What is low? (“non-central”)
• Including the “extremes”
• All inclusive, and sensitive to outliers. (“not-robust”)
• All inclusive, but not sensitive to outliers. (“robust”)
• Leaves out the very extremes. (“plausible”’)
• Focuses on the middle. (“mainstream”)

Note that descriptors of both the “plausible” and the “mainstream” type are necessarily robust, since they leave out the outliers.

• Individual versus whole sample.
• Description relevant to individual cases
• Description or summary of entire samples, combining many cases.

You won’t have to deal with this until later, where it explains terms that you haven’t yet encountered like like “standard error”, “margin of error”, “confidence interval.”

Example: The mean describes the center of a distribution. It is calculated from all the data and not-robust against outliers.

For each of the following descriptors of a distribution , choose the items that best characterize the descriptor.

1.
Median
(a)
center  range  scatter  non-central
(b)
robust  not-robust  plausible  mainstream
2.
Standard Deviation
(a)
center  range  scatter  non-central
(b)
robust  not-robust  plausible  mainstream
3.
IQR
(a)
center  range  scatter  non-central
(b)
robust  not-robust  plausible  mainstream
4.
Variance
(a)
center  range  scatter  non-central
(b)
robust  not-robust  plausible  mainstream
5.
95% coverage interval
(a)
center  range  scatter  non-central
(b)
robust  not-robust  plausible  mainstream
6.
50% coverage interval
(a)
center  range  scatter  non-central
(b)
robust  not-robust  plausible  mainstream
7.
50th percentile
(a)
center  range  scatter  non-central
8.
80th percentile
(a)
center  range  scatter  non-central
9.
99th percentile
(a)
center  range  scatter  non-central
10.
10th percentile
(a)
center  range  scatter  non-central

One of the reasons why there are so many descriptive terms is that they have different roles in theory. For example, the variance turns out to have simple theoretical properties that make it useful when describing sums of variables. It’s much simpler than, say, the IQR.

Prob 3.54. There are two kinds of questions that are often asked relating to percentiles:

• What is the value that falls at a given percentage? For instance, in the ten-mile-race.csv running data, how fast are the fastest 10% of runners? In R, you would ask in this way:
> run = fetchData("ten-mile-race.csv")
> qdata(0.10, run\$net)
10%
4409

The answers is in the units of the variable, in this case seconds. So 10% of the runners have net times faster than or equal to 4409 seconds.

• What percentage falls at a given value? For instance, what fraction of runners are faster than 4000 seconds?
> pdata(4000, run\$net)
[1] 0.04029643

The answer includes those whose net time is exactly equal to or less than 4000 seconds.

It’s important to pay attention to the p and q in the statement. pdata and qdata ask related but different questions.

Use pdata and qdata to answer the following questions about the running data.

1.
Below (or equal to) what age are the youngest 35% of runners?
• Which statement will do the correct calculation?

 A pdata(0.35,run\$age) B qdata(0.35,run\$age) C pdata(35,run\$age) D qdata(35,run\$age)

• What will the answer be?
28  29  30  31  32  33  34  35
2.
What’s the net time that divides the slowest 20% of runners from the rest of the runners?
• Which statement will do the correct calculation?

 A pdata(0.20,run\$net) B qdata(0.20,run\$net) C pdata(0.80,run\$net) D qdata(0.80,run\$net)

• What will the answer be?
4921  5318  5988  6346  7123  7431
seconds
3.
What is the 95% coverage interval on age?
• Which statement will do the correct calculation?

 A pdata(c(0.025,0.975),run\$age) B qdata(c(0.025,0.975),run\$age) C pdata(c(0.050,0.950),run\$age) D qdata(c(0.050,0.950),run\$age)

• What will the answer be?

 A 22 to 60 B 20 to 65 C 25 to 59 D 20 to 60

4.
What fraction of runners are 30 or younger?
• Which statement will do the correct calculation?

 A pdata(30,run\$age) B qdata(30,run\$age) C pdata(30.1,run\$age) D qdata(30.1,run\$age)

• What will the answer be?
In percent:
29.3  30.1  33.7  35.9  38.0  39.3
5.
What fraction of runners are 65 or older? (Caution: This isn’t yet in the form of a BELOW question.)
• Which statement will do the correct calculation?

 A pdata(65,run\$age) B pdata(64.99,run\$age) C pdata(65.01,run\$age) D 1-pdata(65,run\$age) E 1-pdata(64.99,run\$age) F 1-pdata(65.01,run\$age)

• What will the answer be?
In percent:
0.5  1.1  1.7  2.3  2.9
6.
The time it takes for a runner to get to the start line after the starting gun is fired is the difference between the time and net.
run\$to.start = run\$time - run\$net

• How long is it before 75% of runners get to the start line?
In seconds:
164  192  213  294  324  351
• What fraction of runners get to the start line before one minute? (Caution: the times are measured in seconds.)
In percent:
10  15  19  22  25  31  34
7.
What is the 95% coverage interval on the ages of female runners?

 A 19 to 61 years B 22 to 61 years C 19 to 56 years D 22 to 56 years

8.
What fraction of runners have a net time BELOW 4000 seconds? (That is, don’t include those who are at exactly 4000 seconds.)
In percent:
3.72  4.00  4.03  4.07  5.21