Chapter 2 Problems      AGid      Statistical Modeling: A Fresh Approach (2/e)

Reading Questions.

1.
What are the two major different kinds of variables?
2.
How are variables and cases arranged in a data frame?
3.
How is the relationship between these things: population, sampling frame, sample, census?
4.
What’s the difference between a longitudinal and cross-sectional sample?
5.
Describe some types of sampling that lead to the sample potentially being unrepresentative of the population?

Prob 2.02. Using the tally operator and the comparison operators (such as > or ==), answer the following questions about the CO2 data. You can read in the CO2 data with the statement

    CO2 = fetchData("CO2")

You can see the data set itself by giving the command

    CO2

In this exercise, you will use R commands to count how many of the cases satisfy various criteria:

1.
How many of the plants in CO2 are Mc1 for Plant?
 7  12  14  21  28  34  
2.
How many of the plants in CO2 are either Mc1 or Mn1?
 8  12  14  16  23  54  92  
3.
How many are Quebec for Type and nonchilled for Treatment?
 8  12  14  16  21  23  54  92  
4.
How many have a concentration (conc) of 300 or bigger?
 12  24  36  48  60  
5.
How many have a concentration between 300 and 450 (inclusive)?
 12  24  36  48  60  
6.
How many have a concentration between 300 and 450 (inclusive) and are nonchilled?
 6  8  10  12  14  16  
7.
How many have an uptake that is less than 1/10 of the concentration (in the units reported)?
 17  33  34  51  68  

Prob 2.04. Here is a small data frame about automobiles.







Make and VehicleTrans.# ofCity Hwy
model type type cyl.MPGMPG












Kia Optimacompact Man. 421 31
Kia Optimacompact Auto. 620 28
Saab 9-7X AWD SUV Auto. 614 20
Saab 9-7X AWD SUV Auto. 8 12 16
Ford Focuscompact Man. 424 35
Ford Focuscompact Auto. 424 33
Ford F150 2WD pickup Auto. 813 17






(a)
What are the cases in the data frame?

A

Individual car companies

B

Individual makes and models of cars

C

Individual configurations of cars

D

Different sizes of cars


(b)
For each case, what variables are given? Are they categorical or quantitative?
(c)
Why are some cars listed twice? Is this a mistake in the table?

A

Yes, it’s a mistake.

B

A car brand might be listed more than once, but the cases have different attributes on other variables.

C

Some cars are more in demand than others.


Prob 2.09. Here is a data set from an experiment about how reaction times change after drinking alcohol.[?] The measurements give how long it took for a person to catch a dropped ruler. One measurement was made before drinking any alcohol. Successive measurements were made after one standard drink, two standard drinks, and so on. Measurements are in seconds.

BeforeAfter 1After 2After 3




0.68 0.73 0.80 1.38
0.54 0.64 0.92 1.44
0.71 0.66 0.83 1.46
0.82 0.92 0.97 1.51
0.58 0.68 0.70 1.49
0.80 0.87 0.92 1.54
and so on ...

(a)
What are the rows in the above data set?

A

Individual measurements of reaction time.

B

An individual person.

C

The number of drinks.


(b)
How many variables are there?

A

One — the reaction times.

B

Two — the reaction times with and without alcohol.

C

Four — the reaction times at four different levels of alcohol.


The format used for these data has several limitations:

Another format, which would be better, is this:

SubjectIDReactionTimeDrinks
S1 0.68 0
S1 0.73 1
S1 0.80 2
S1 1.38 3
S2 0.54 0
S2 0.64 1
S2 0.92 2
and so on ...

What are the cases in the reformatted data frame?

A

Individual measurements of reaction time.

B

An individual person.

C

The number of drinks.


How many variables are there?

A

The same as in the original version. It’s the same data!

B

Three — the subject, the reaction time, the alcohol level.

C

Four — the reaction times at four different levels of alcohol.


The lack of flexibility in the original data format indicates a more profound problem. The response to alcohol is not just a matter of quantity, but of timing. Drinks spread out over time have less effect than drinks consumed rapidly, and the physiological response to a drink changes over time as the alcohol is first absorbed into the blood and then gradually removed by the liver. Nothing in this data set indicates how long after the drinks the measurements were taken. The small change in reaction time after a single drink might reflect that there was little time for the alcohol to be absorbed before the measurement was taken; the large change after three drinks might actually be the response to the first drink finally kicking in. Perhaps it would have been better to make a measurement of the blook alcohol level at each reaction-time trial.

It’s important to think carefully about how to measure your variables effectively, and what you should measure in order to capture the effects you are interested in.

Prob 2.14. Sometimes categorical information is represented numerically. In the early days of computing, it was very common to represent everything with a number. For instance the categorical variable for sex, with levels male or female, might be stored as 0 or 1. Even categorical variables like race or language, with many different levels, can be represented as a number. The codebook provides the interpretation of each number (hence the word “codebook”).

Here is a very small part of a dataset from the 1960s used to study the influence of smoking and other factors on the weights of babies at birth.[?]  gestation.csv 

gest. wtraceedwt.1incsmokenumber
284120 8 5 100 1 0 0
282 113 0 5 135 4 0 0
279128 0 2 115 2 1 1
244 138 7 2 178 98 0 0
245132 7 1 140 2 0 0
351 140 0 5 120 99 3 2
282144 0 2 124 2 1 1
279 141 0 1 128 2 1 1
281110 8 5 99 2 1 2
273 114 7 2 154 1 0 0
285115 7 2 130 1 0 0
255 92 4 7 125 1 1 5
261115 3 2 125 4 1 5
261 144 0 2 170 7 0 0

At first glance, all of the data seems quantitative. But read the codebook:

gest. - length of gestation in days  
 
wt -  birth weight in ounces (999 unknown)  
 
race - mother’s race  
   0-5=white 6=mex 7=black 8=asian  
   9=mixed 99=unknown  
 
ed - mother’s education  
   0= less than 8th grade,  
   1 = 8th -12th grade - did not graduate,  
   2= HS graduate--no other schooling ,  
   3= HS+trade,  
   4=HS+some college  
   5= College graduate,  
   6&7 Trade school HS unclear,  
   9=unknown  
 
marital 1=married, 2= legally separated, 3= divorced,  
  4=widowed, 5=never married  
 
inc - family yearly income in $2500 increments  
  0 = under 2500, 1=2500-4999, ...,  
  8= 12,500-14,999, 9=15000+,  
  98=unknown, 99=not asked  
 
smoke - does mother smoke? 0=never, 1= smokes now,  
    2=until current pregnancy, 3=once did, not now,  
    9=unknown  
 
number  - number of cigarettes smoked per day  
   0=never, 1=1-4, 2=5-9, 3=10-14, 4=15-19,  
   5=20-29, 6=30-39, 7=40-60,  
   8=60+, 9=smoke but don’t know, 98=unknown, 99=not asked

Taking into account the codebook, what kind of data is each variable? If the data have a natural order, but are not genuinely quantitative, say “ordinal.” You can ignore the “unknown” or “not asked” codes when giving your answer.

(a)
Gestation
 categorical  ordinal  quantitative  
(b)
Race
 categorical  ordinal  quantitative  
(c)
Marital
 categorical  ordinal  quantitative  
(d)
Inc
 categorical  ordinal  quantitative  
(e)
Smoke
 categorical  ordinal  quantitative  
(f)
Number
 categorical  ordinal  quantitative  

The disadvantage of storing categorical information as numbers is that it’s easy to get confused and mistake one level for another. Modern software makes it easy to use text strings to label the different levels of categorical variables. Still, you are likely to encounter data with categorical data stored numerically, so be alert.

A good modern practice is to code missing data in a consistent way that can be automatically recognized by software as meaning missing. Often, NA is used for this purpose. Notice that in the number variable, there is a clear order to the categories until one gets to level 9, which means “smoke but don’t know.” This is an unfortunate choice. It would be better to store number as a quantitative variable telling the number of cigarettes smoked per day. Another variable could be used to indicate whether missing data was “smoke but don’t know,” “unknown”, or “not asked.”

Prob 2.22. Since the computer has to represent numbers that are both very large and very small, scientific notation is often used. The number 7.23e4 means 7.23 × 104 = 72300. The number 1.37e-2 means 1.37 × 10-2 = 0.0137.

For each of the following numbers written in computer scientific notation, say what is the corresponding number.

(a)
3e1
 0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000  
(b)
1e3
 0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000  
(c)
0.1e3
 0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000  
(d)
0.3e-2
 0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000  
(e)
10e3
 0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000  
(f)
10e-3
 0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000  
(g)
0.0003e3
 0.003  0.01  0.03  0.1  0.3  1  3  10  30  100  300  1000  3000  10000