Chapter 2 Data in R
2.1 Preliminaries
Many of the R functions used in this text are written especially for the text to enhance convienence and clarity of purpose. To access these functions, you will need to load the mosaic
package at the beginning of your session. Loading a package is simple:
require(mosaic)
You need do this only once in each session of R, and on systems such as Rstudio the package will generally be reloaded automatically. (If you get an error message, it’s likely that the mosaic
package has not been installed on your system. Use the package installation menu in R to install mosaic
, after which the require()
function will load the package.)
mosaic
itself loads other packages it in turn depends on. If a command you see in this text does not work for you, be sure that mosaic
is loaded.
Data used in statistical modeling are usually organized into tables, often created using spreadsheet software. Most people presume that the same software used to create a table of data should be used to display and analyze it. This is part of the reason for the popularity of spreadsheet programs such as Excel and Google Spreadsheets.
For serious statistical work, it’s helpful to take another approach that strictly separates the processes of data collection and of data analysis: use one program to create data files and another program to analyze the data stored in those files.
By doing this, one guarantees that the original data are not modified accidentally in the process of analyzing them. This also makes it possible to perform many different analyses of the data; modelers often create and compare many different models of the same data.
2.2 Reading Tabular Data into R
Data is central to statistics, and the tabular arrangement of data is very common. Accordingly, R provides a large number of ways to read in tabular data. These vary depending on how the data are stored, where they are located, etc, but they generally take on similar forms which will become familiar to you with use.
This text makes use of several datasets and most of these are available to you in the package mosaicData
. You can load this into your workspace with the now familiar command
require(mosaicData)
or by checking the box next to mosaicData
in the packages tab in Rstudio. Once this is done, you can refer to a dataset by its name in mosaicData (clicking on mosaicData
in the packages tab in Rstudio will bring up an index of names and associated codebooks).
An often used classic dataset residing in mosaicData
is the height data collected by Sir Francis Galton. In the following commands look at the first few records:
head(Galton)
## family father mother sex height nkids
## 1 1 78.5 67.0 M 73.2 4
## 2 1 78.5 67.0 F 69.2 4
## 3 1 78.5 67.0 F 69.0 4
## 4 1 78.5 67.0 F 69.0 4
## 5 2 75.5 66.5 M 73.5 4
## 6 2 75.5 66.5 M 72.5 4
Though mosaicData
contains many of our text’s datasets, it does not contain all of them, and you’ll be wanting to analyse your own data, generated and then stored in tabular form. The most common method of reading tabular data, for the purposes of this book, is the R operator read.csv()
which, not surprisingly, reads in .csv
or comma separated variable files. These are text files that can be generated by a spreadsheet. read.csv()
imports tabular data (in .csv format) into R from anywhere on your computer or on the web.
Reading in a data table that’s been connected with read.csv()
is simply a matter of knowing the name of the data set. For instance, one data table used in examples in this book is swim100m.csv
. All of the .csv
files of data mentioned in the text are available on the web at http://tinyhttp://tiny.cc/mosaic/ so to read in this data table and create an object in R that contains the data, use a command like this:
Swim <- read.csv("http://tiny.cc/mosaic/swim100m.csv")
The part of this command that requires creativity is choosing a name for the R object that will hold the data. In the above command it is called Swim
, but you might prefer another name, e.g., S
or Sdata
or even Ralph
. Beginning with a capital letter is standard practice, but not required. Remember, R is case sensitive. Of course, it’s sensible to choose names that are short, easy to type and remember, and remind you what the contents of the object are about.
To help you identify data tables that can be accessed through read.csv()
, examples in this book will be marked with a flag containing the name of the file.
2.3 Data Frames
The type of R object created by read.csv()
is called a data frame
and is essentially a tabular layout. To illustrate , here are the first several cases of the Swim
data frame created by the previous use of read.csv()
:
head(Swim)
## year time sex
## 1 1905 65.8 M
## 2 1908 65.6 M
## 3 1910 62.8 M
## 4 1912 61.6 M
## 5 1918 61.4 M
## 6 1920 60.4 M
What do you think a function might be called that prints out the last several cases? Try it.
Note that the head()
function, one of several functions that operate on data frames, takes the R object that you created, not the quoted name of the data file.
Data frames, like tabular data generally, involve variables and cases. In R, each of the variables is given a name. You can refer to the variable by name in a couple of different ways. To see the variable names in a data frame, something you might want to do to remind yourself of how names are spelled and capitalized, use the names()
function:
names(Swim)
## [1] "year" "time" "sex"
Another way to get quick information about the variables in a data frame is with summary()
:
summary(Swim)
## year time sex
## Min. :1905 Min. :47.84 F:31
## 1st Qu.:1924 1st Qu.:53.64 M:31
## Median :1956 Median :56.88
## Mean :1952 Mean :59.92
## 3rd Qu.:1976 3rd Qu.:65.20
## Max. :2004 Max. :95.00
To see how many cases there are in a data frame, use nrow()
:
nrow(Swim)
## [1] 62
2.4 Variables in Data Frames
Perhaps the most common operation on a data frame is to refer to the values in a single variable. The two ways you will most commonly use involve functions with a data =
argument and the direct use of the $
notation.
The $
notation is the most basic, if not the most intuitive, way of referring to a variable in a dataframe. Here we find the mean record time (time
) in the dataset we’ve named Swim
:
mean(Swim$time)
## [1] 59.92419
Think of this as referring to the variable by both its family name (the data frame’s name,Swim
) and its given name (time
), something like Clinton$Hillary.
Most of the statistical modeling functions you will encounter in this book are designed to work with data frames and allow you to refer directly to variables within a data frame. For instance:
mean( ~ time, data = Swim)
## [1] 59.92419
min( ~ time, data = Swim)
## [1] 47.84
The data =
argument tells the function which data frame to pull the variable from. The use of the tilde (~
) identifies the first argument as a model formula, which is necessary if the data =
argument is to be used. Leaving off that argument or the tilde leads to an error.
The advantage of the data =
approach becomes evident when you construct statements that involve more than one variable within a data frame. For instance, here’s a calculation of the mean time separately for the different sexes:
mean( time ~ sex, data = Swim )
## F M
## 65.19226 54.65613
Alternatively,
mean( Swim$time ~ Swim$sex )
## F M
## 65.19226 54.65613
You will see much more of the tilde starting in Chapter @ref(“chap:simple-models”). It’s the R notation for “broken down by” or “versus.”
The ability of mean()
, median()
, and similar functions to handle the data =
format is provided by the mosaic
package. When you encounter a function that can’t handle the data =
format, use the $
notation.
2.5 Adding a New Variable
Sometimes you will compute a new quantity from the existing variables and want to treat this as a new variable. Adding a new variable to a data frame can be done with the $
notation. For instance, here is how to create a new variable in Swim
that holds the time
converted from minutes to units of seconds:
Swim$minutes = Swim$time/60
The new variable appears just like the old ones:
head(Swim, n = 3L)
## year time sex minutes
## 1 1905 65.8 M 1.096667
## 2 1908 65.6 M 1.093333
## 3 1910 62.8 M 1.046667
You could also, if you want, redefine an existing variable, for instance:
Swim$time = Swim$time/60
head(Swim, n = 3L)
## year time sex minutes
## 1 1905 1.096667 M 1.096667
## 2 1908 1.093333 M 1.093333
## 3 1910 1.046667 M 1.046667
Such assignment operations do not change the original file (e.g. the swim100m.csv file) from which the data were read, only the data frame in the current session of R. This is an advantage, since it means that your data in the data file stay in their original state and therefore won’t be corrupted by operations made during analysis.
2.6 Sampling from a Sample Frame
Much of statistical analysis is concerned with the consequences of drawing a sample from the population. Ideally, you will have a sampling frame that lists every member of the population from which the sample is to be drawn. With this in hand, you could treat the individual cases in the sampling frame as if they were cards in a deck of hands. To pick your random sample, shuffle the deck and deal out the desired number of cards.
When doing real work in the field, you would use the randomly dealt cards to locate the real-world cases they correspond to. Sometimes in this book, however, in order to let you explore the consequences of sampling, you will select a sample from an existing data set. The deal()
function performs this, taking as an argument the data frame to be used in the selection and the number of cases to sample.
For example, the kidsfeet.csv
data set has n=39 cases.
Kids <- read.csv("http://tiny.cc/mosaic/kidsfeet.csv")
nrow(Kids)
## [1] 39
Here’s how to take a random sample of five of the cases:
deal(Kids, 5)
## name birthmonth birthyear length width sex biggerfoot domhand orig.id
## 23 Laura 9 88 24.0 8.3 G R L 23
## 19 Lee 6 88 26.7 9.0 G L L 19
## 29 Mike 11 88 24.2 8.9 B L R 29
## 20 Heather 3 88 25.5 9.5 G R R 20
## 4 Josh 1 88 25.2 9.8 B L R 4
The results returned by deal()
will never contain the same case more than once, just as if you were dealing cards from a shuffled deck. In contrast, resample
replaces each case after it is dealt so that it can appear more than once in the result. You wouldn’t want to do this to select from a sampling frame, but it turns out that there are valuable statistical uses for this sort of sampling with replacement. .
You’ll make use of re-sampling in Chapter ??.
resample(Kids, 5)
## name birthmonth birthyear length width sex biggerfoot domhand orig.id
## 17 Caroline 12 87 24.0 8.7 G R L 17
## 39 Alisha 9 88 24.6 8.8 G L R 39
## 13 Cal 8 87 26.1 9.1 B L R 13
## 35 Peter 4 88 24.7 8.6 B R L 35
## 17.1 Caroline 12 87 24.0 8.7 G R L 17
Notice that Caroline was sampled twice.