Chapter 2 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

1.
What are the two major different kinds of variables?
2.
How are variables and cases arranged in a data frame?
3.
How is the relationship between these things: population, sampling frame, sample, census?
4.
What’s the difference between a longitudinal and cross-sectional sample?
5.
Describe some types of sampling that lead to the sample potentially being unrepresentative of the population?

Prob 2.02. Using the tally operator and the comparison operators (such as > or ==), answer the following questions about the CO2 data. You can read in the CO2 data with the statement

CO2 = fetchData("CO2")

You can see the data set itself by giving the command

CO2

In this exercise, you will use R commands to count how many of the cases satisfy various criteria:

1.
How many of the plants in CO2 are Mc1 for Plant?
7  12  14  21  28  34
2.
How many of the plants in CO2 are either Mc1 or Mn1?
8  12  14  16  23  54  92
3.
How many are Quebec for Type and nonchilled for Treatment?
8  12  14  16  21  23  54  92
4.
How many have a concentration (conc) of 300 or bigger?
12  24  36  48  60
5.
How many have a concentration between 300 and 450 (inclusive)?
12  24  36  48  60
6.
How many have a concentration between 300 and 450 (inclusive) and are nonchilled?
6  8  10  12  14  16
7.
How many have an uptake that is less than 1/10 of the concentration (in the units reported)?
17  33  34  51  68

Prob 2.04. Here is a small data frame about automobiles.

 Make and Vehicle Trans. # of City Hwy model type type cyl. MPG MPG Kia Optima compact Man. 4 21 31 Kia Optima compact Auto. 6 20 28 Saab 9-7X AWD SUV Auto. 6 14 20 Saab 9-7X AWD SUV Auto. 8 12 16 Ford Focus compact Man. 4 24 35 Ford Focus compact Auto. 4 24 33 Ford F150 2WD pickup Auto. 8 13 17

(a)
What are the cases in the data frame?

 A Individual car companies B Individual makes and models of cars C Individual configurations of cars D Different sizes of cars

(b)
For each case, what variables are given? Are they categorical or quantitative?
• Kia Optima:
not.a.variable  categorical  quantitative
• City MPG:
not.a.variable  categorical  quantitative
• Vehicle type:
not.a.variable  categorical  quantitative
• SUV:
not.a.variable  categorical  quantitative
• Trans. type:
not.a.variable  categorical  quantitative
• # of cyl.:
not.a.variable  categorical  quantitative
(c)
Why are some cars listed twice? Is this a mistake in the table?

 A Yes, it’s a mistake. B A car brand might be listed more than once, but the cases have different attributes on other variables. C Some cars are more in demand than others.

Prob 2.09. Here is a data set from an experiment about how reaction times change after drinking alcohol.[?] The measurements give how long it took for a person to catch a dropped ruler. One measurement was made before drinking any alcohol. Successive measurements were made after one standard drink, two standard drinks, and so on. Measurements are in seconds.

 Before After 1 After 2 After 3 0.68 0.73 0.80 1.38 0.54 0.64 0.92 1.44 0.71 0.66 0.83 1.46 0.82 0.92 0.97 1.51 0.58 0.68 0.70 1.49 0.80 0.87 0.92 1.54 and so on ...

(a)
What are the rows in the above data set?

 A Individual measurements of reaction time. B An individual person. C The number of drinks.

(b)
How many variables are there?

 A One — the reaction times. B Two — the reaction times with and without alcohol. C Four — the reaction times at four different levels of alcohol.

The format used for these data has several limitations:

• It leaves no room for multiple measurements of an individual at one level of alcohol, for example, two or three baseline measurements, or two or three measurements after one standard drink.
• It provides no flexibility for different levels of alcohol, for example 1.5 standard drinks, or for taking into account how long the measurement was made after the drink.

Another format, which would be better, is this:

 SubjectID ReactionTime Drinks S1 0.68 0 S1 0.73 1 S1 0.80 2 S1 1.38 3 S2 0.54 0 S2 0.64 1 S2 0.92 2 and so on ...

What are the cases in the reformatted data frame?

 A Individual measurements of reaction time. B An individual person. C The number of drinks.

How many variables are there?

 A The same as in the original version. It’s the same data! B Three — the subject, the reaction time, the alcohol level. C Four — the reaction times at four different levels of alcohol.

The lack of flexibility in the original data format indicates a more profound problem. The response to alcohol is not just a matter of quantity, but of timing. Drinks spread out over time have less effect than drinks consumed rapidly, and the physiological response to a drink changes over time as the alcohol is first absorbed into the blood and then gradually removed by the liver. Nothing in this data set indicates how long after the drinks the measurements were taken. The small change in reaction time after a single drink might reflect that there was little time for the alcohol to be absorbed before the measurement was taken; the large change after three drinks might actually be the response to the first drink finally kicking in. Perhaps it would have been better to make a measurement of the blook alcohol level at each reaction-time trial.

It’s important to think carefully about how to measure your variables effectively, and what you should measure in order to capture the effects you are interested in.

Prob 2.14. Sometimes categorical information is represented numerically. In the early days of computing, it was very common to represent everything with a number. For instance the categorical variable for sex, with levels male or female, might be stored as 0 or 1. Even categorical variables like race or language, with many different levels, can be represented as a number. The codebook provides the interpretation of each number (hence the word “codebook”).

Here is a very small part of a dataset from the 1960s used to study the influence of smoking and other factors on the weights of babies at birth.[?]  gestation.csv

 gest. wt race ed wt.1 inc smoke number 284 120 8 5 100 1 0 0 282 113 0 5 135 4 0 0 279 128 0 2 115 2 1 1 244 138 7 2 178 98 0 0 245 132 7 1 140 2 0 0 351 140 0 5 120 99 3 2 282 144 0 2 124 2 1 1 279 141 0 1 128 2 1 1 281 110 8 5 99 2 1 2 273 114 7 2 154 1 0 0 285 115 7 2 130 1 0 0 255 92 4 7 125 1 1 5 261 115 3 2 125 4 1 5 261 144 0 2 170 7 0 0

At first glance, all of the data seems quantitative. But read the codebook:

gest. - length of gestation in days

wt -  birth weight in ounces (999 unknown)

race - mother’s race
0-5=white 6=mex 7=black 8=asian
9=mixed 99=unknown

ed - mother’s education
2= HS graduate--no other schooling ,
4=HS+some college
9=unknown

marital 1=married, 2= legally separated, 3= divorced,
4=widowed, 5=never married

inc - family yearly income in \$2500 increments
0 = under 2500, 1=2500-4999, ...,
8= 12,500-14,999, 9=15000+,

smoke - does mother smoke? 0=never, 1= smokes now,
2=until current pregnancy, 3=once did, not now,
9=unknown

number  - number of cigarettes smoked per day
0=never, 1=1-4, 2=5-9, 3=10-14, 4=15-19,
5=20-29, 6=30-39, 7=40-60,
8=60+, 9=smoke but don’t know, 98=unknown, 99=not asked

Taking into account the codebook, what kind of data is each variable? If the data have a natural order, but are not genuinely quantitative, say “ordinal.” You can ignore the “unknown” or “not asked” codes when giving your answer.

(a)
Gestation
categorical  ordinal  quantitative
(b)
Race
categorical  ordinal  quantitative
(c)
Marital
categorical  ordinal  quantitative
(d)
Inc
categorical  ordinal  quantitative
(e)
Smoke
categorical  ordinal  quantitative
(f)
Number
categorical  ordinal  quantitative

The disadvantage of storing categorical information as numbers is that it’s easy to get confused and mistake one level for another. Modern software makes it easy to use text strings to label the different levels of categorical variables. Still, you are likely to encounter data with categorical data stored numerically, so be alert.

A good modern practice is to code missing data in a consistent way that can be automatically recognized by software as meaning missing. Often, NA is used for this purpose. Notice that in the number variable, there is a clear order to the categories until one gets to level 9, which means “smoke but don’t know.” This is an unfortunate choice. It would be better to store number as a quantitative variable telling the number of cigarettes smoked per day. Another variable could be used to indicate whether missing data was “smoke but don’t know,” “unknown”, or “not asked.”

Prob 2.22. Since the computer has to represent numbers that are both very large and very small, scientific notation is often used. The number 7.23e4 means 7.23 × 104 = 72300. The number 1.37e-2 means 1.37 × 10-2 = 0.0137.

For each of the following numbers written in computer scientific notation, say what is the corresponding number.

(a)
3e1
0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000
(b)
1e3
0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000
(c)
0.1e3
0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000
(d)
0.3e-2
0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000
(e)
10e3
0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000
(f)
10e-3
0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000
(g)
0.0003e3
0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000