Chapter 4 Groupwise models
Calculating means and other simple statistics is a matter of using the right function in R. The mosaic
package — which you should load in to R as shown in Section @ref(“sec:loading-mosaic”) — makes it straightforward to calculate either a “grand” statistic or a “group-wise” statistic. To illustrate:
Load the mosaic
package, needed only once per session.
Read in data you are interested in analyzing, for instance the Cherry-Blossom 2008 data described earlier:3
Runners = read.csv("http://tiny.cc/mosaic/Cherry-Blossom-2008.csv")
names( Runners )
## [1] "position" "division" "total" "name" "age" "place" "net" "gun"
## [9] "sex"
Calculate a grand mean on the “gun” time — the time between the start of the race, signalled by a gun, and when each runner crossed the finish line. Here, to emphasize that this is a group–wise model, we use the function gwm()
from the mosaic
package. It works the same as mean()
, but its output is more formally the output of a model.
gwm(gun ~ 1, data = Runners)
##
## Groupwise Model Call:
## gwm(formula = gun ~ 1, data = Runners)
##
## model_value
## 1 93.73
The model formula gun ~ 1
is an “all cases the same” model. You will refer to the 1
as the “Intercept term” later in the text, but for now, think of it as “one value.” The one value here is the grand mean. Grand mean and other “grand” statistics.
mean( gun ~ 1, data = Runners)
## 1
## 93.72504
median( gun ~ 1, data = Runners )
## 1
## 93.68333
sd( gun ~ 1, data = Runners )
## 1
## 14.96899
To tell R that you want to break the statistics down by groups, replace the 1
with the variable which defines the groups. You will be using this notation frequently in building models. Here, as before, the ~
means, “model by” or “broken down by” or “versus.” To find the means for the gun time broken down by sex, enter
gwm( gun ~ sex, data = Runners )
##
## Groupwise Model Call:
## gwm(formula = gun ~ sex, data = Runners)
##
## sex model_value
## 1 F 98.77
## 2 M 88.26
Other statistics work the same way, for instance,
sd( gun ~ sex, data = Runners )
## F M
## 13.33713 14.71851
Another example … wage broken down by sector of the economy, using data cps.csv
:4
CPS = read.csv("http://tiny.cc/mosaic/cps.csv")
gwm( wage ~ sector, data = CPS )
##
## Groupwise Model Call:
## gwm(formula = wage ~ sector, data = CPS)
##
## sector model_value
## 1 clerical 7.423
## 2 const 9.502
## 3 manag 12.704
## 4 manuf 8.036
## 5 other 8.501
## 6 prof 11.947
## 7 sales 7.593
## 8 service 6.537
In the Whickham smoking data example5, the outcome
for each person was not a number but a category: Alive or Dead at the time of the follow-up.
W <- read.csv("http://tiny.cc/mosaic/whickham.csv")
names(W)
## [1] "outcome" "smoker" "age"
levels(W$outcome)
## [1] "Alive" "Dead"
The gwm()
function recognizes that outcome
is a categorical variable and outputs proportions. In the case of “all cases the same,” this is porportions of the data in the outcome
levels:
gwm( outcome ~ 1, data = W )
##
## Groupwise Model Call:
## gwm(formula = outcome ~ 1, data = W)
##
## outcome model_value
## 1 Alive 0.7192
## 2 Dead 0.2808
Here’s the breakdown of the levels Alive
and Dead
according to smoking status:
gwm( outcome ~ smoker, data = W )
##
## Groupwise Model Call:
## gwm(formula = outcome ~ smoker, data = W)
##
## outcome smoker model_value
## 1 Alive No 0.3820
## 2 Dead No 0.1750
## 3 Alive Yes 0.3371
## 4 Dead Yes 0.1058
You may be surbrised by this result. But don’t be misled. A more meaningful question is whether smokers are different from non-smokers when holding other variables constant, such as age. To address this question, you need to add age into the model.
It might be natural to consider each age — 35, 36, 37, and so on — as a separate group, but you won’t get very many members of each group. And, likely, the data for 35 year-olds has quite a lot to say about 36 year-olds, so it doesn’t make sense to treat them as completely separate groups.
You can use the cut()
function to divide up a quantitative variable into groups. You get to specify the breaks between groups.
W$ageGroups <- cut(W$age, breaks=c(0,30,40,53,64,75,100) )
gwm( outcome ~ ageGroups, data = W )
##
## Groupwise Model Call:
## gwm(formula = outcome ~ ageGroups, data = W)
##
## outcome ageGroups model_value
## 1 Alive (0,30] 0.214612
## 2 Dead (0,30] 0.004566
## 3 Alive (30,40] 0.181887
## 4 Dead (30,40] 0.009893
## 5 Alive (40,53] 0.177321
## 6 Dead (40,53] 0.035769
## 7 Alive (53,64] 0.119482
## 8 Dead (53,64] 0.071537
## 9 Alive (64,75] 0.025875
## 10 Dead (64,75] 0.102740
## 11 Alive (75,100] 0.000000
## 12 Dead (75,100] 0.056317
gwm( outcome ~ smoker + ageGroups, data = W )
##
## Groupwise Model Call:
## gwm(formula = outcome ~ smoker + ageGroups, data = W)
##
## outcome smoker ageGroups model_value
## 1 Alive No (0,30] 0.123288
## 2 Dead No (0,30] 0.002283
## 3 Alive Yes (0,30] 0.091324
## 4 Dead Yes (0,30] 0.002283
## 5 Alive No (30,40] 0.097412
## 6 Dead No (30,40] 0.004566
## 7 Alive Yes (30,40] 0.084475
## 8 Dead Yes (30,40] 0.005327
## 9 Alive No (40,53] 0.075342
## 10 Dead No (40,53] 0.010654
## 11 Alive Yes (40,53] 0.101979
## 12 Dead Yes (40,53] 0.025114
## 13 Alive No (53,64] 0.064688
## 14 Dead No (53,64] 0.031963
## 15 Alive Yes (53,64] 0.054795
## 16 Dead Yes (53,64] 0.039574
## 17 Alive No (64,75] 0.021309
## 18 Dead No (64,75] 0.078387
## 19 Alive Yes (64,75] 0.004566
## 20 Dead Yes (64,75] 0.024353
## 21 Alive No (75,100] 0.000000
## 22 Dead No (75,100] 0.047184
## 23 Alive Yes (75,100] 0.000000
## 24 Dead Yes (75,100] 0.009132
With modeling techniques, to be introduced in later chapters, you can use quantitative variables without the need to divide them into groups.
4.1 Model Values and Residuals
A group-wise model tells you a model value for each group, but often you will need these in a case-by-case format: the model value for each case in a data set. The fitted()
function carries out this simple calculation, taking each case in turn, figuring out which group it belongs to, and then returning the set of model values for all the cases. It requires two arguments: the model (here provided by gwm()
) the data on which the model was based. For example:
Kids <- KidsFeet # just get this from mosaicData instead of going out on the web with read.csv
mod <- gwm( width ~ sex, data = Kids )
fitted( mod, Kids )
## [1] 9.190000 9.190000 9.190000 9.190000 9.190000 9.190000 9.190000 8.784211 8.784211 9.190000
## [11] 9.190000 9.190000 9.190000 9.190000 8.784211 8.784211 8.784211 8.784211 8.784211 8.784211
## [21] 9.190000 9.190000 8.784211 8.784211 8.784211 9.190000 8.784211 9.190000 9.190000 9.190000
## [31] 8.784211 8.784211 8.784211 9.190000 9.190000 8.784211 8.784211 8.784211 8.784211
The residuals are found by subtracting the case-by-case model value from the actual values for each case.
res <- Kids$width - fitted(mod, Kids)
head(res) # display the first few residuals
## [1] -0.79 -0.39 0.51 0.61 -0.29 0.51
res <- resid(mod, Kids) # function resid does the calculation automatically
head(res)
## [1] -0.79 -0.39 0.51 0.61 -0.29 0.51
Take care to use the same quantitative variable (width
in this case) from the data as was used in constructing the model.
The var()
function will calculate the variance:
var( Kids$width ) # overall variation
## [1] 0.2596761
var( fitted(mod, Kids) ) # variation in model values
## [1] 0.04222182
var( resid(mod, Kids) ) # residual variation
## [1] 0.2174543
Notice the “model triangle” relationship between these three numbers.