The mosaic
package makes several summary statistic functions (like mean
and sd
)
formula aware.
mean_(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
mean(x, ...)
median(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
range(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
sd(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
max(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
min(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
sum(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
IQR(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
fivenum(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
iqr(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
prod(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
sum(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
favstats(x, ..., data = NULL, groups = NULL, na.rm = TRUE)
quantile(x, ..., data = NULL, groups = NULL, na.rm = getOption("na.rm", FALSE))
var(x, y = NULL, na.rm = getOption("na.rm", FALSE), ..., data = NULL)
cor(x, y = NULL, ..., data = NULL)
cov(x, y = NULL, ..., data = NULL)
a numeric vector or a formula
additional arguments
a data frame in which to evaluate formulas (or bare names).
Note that the default is data = parent.frame()
. This makes it convenient to
use this function interactively by treating the working environment as if it were
a data frame. But this may not be appropriate for programming uses.
When programming, it is best to use an explicit data
argument
-- ideally supplying a data frame that contains the variables mentioned.
a grouping variable, typically a name of a variable in data
a logical indicating whether NA
s should be removed before computing
a numeric vector or a formula
Many of these functions mask core R functions to provide an additional formula
interface. Old behavior should be unchanged. But if the first argument is a formula,
that formula, together with data
are used to generate the numeric vector(s)
to be summarized. Formulas of the shape x ~ a
or ~ x | a
can be used to
produce summaries of x
for each subset defined by a
. Two-way aggregation
can be achieved using formulas of the form x ~ a + b
or x ~ a | b
. See
the examples.
Earlier versions of these functions supported a "bare name + data frame" interface. This functionality has been removed since it was (a) ambiguous in some cases, (b) unnecessary, and (c) difficult to maintain.
mean(HELPrct$age)
#> [1] 35.65342
mean( ~ age, data = HELPrct)
#> [1] 35.65342
mean( ~ drugrisk, na.rm = TRUE, data = HELPrct)
#> [1] 1.887168
mean(age ~ shuffle(sex), data = HELPrct)
#> female male
#> 36.39252 35.42486
mean(age ~ shuffle(sex), data = HELPrct, .format = "table")
#> shuffle(sex) mean
#> 1 female 35.81308
#> 2 male 35.60405
# wrap in data.frame() to auto-convert awkward variable names
data.frame(mean(age ~ shuffle(sex), data = HELPrct, .format = "table"))
#> shuffle.sex. mean
#> 1 female 35.51402
#> 2 male 35.69653
mean(age ~ sex + substance, data = HELPrct)
#> female.alcohol male.alcohol female.cocaine male.cocaine female.heroin
#> 39.16667 37.95035 34.85366 34.36036 34.66667
#> male.heroin
#> 33.05319
mean( ~ age | sex + substance, data = HELPrct)
#> female.alcohol male.alcohol female.cocaine male.cocaine female.heroin
#> 39.16667 37.95035 34.85366 34.36036 34.66667
#> male.heroin
#> 33.05319
mean( ~ sqrt(age), data = HELPrct)
#> [1] 5.936703
sum( ~ age, data = HELPrct)
#> [1] 16151
sd(HELPrct$age)
#> [1] 7.710266
sd( ~ age, data = HELPrct)
#> [1] 7.710266
sd(age ~ sex + substance, data = HELPrct)
#> female.alcohol male.alcohol female.cocaine male.cocaine female.heroin
#> 7.980333 7.575644 6.195002 6.889772 8.035839
#> male.heroin
#> 7.973568
var(HELPrct$age)
#> [1] 59.4482
var( ~ age, data = HELPrct)
#> [1] 59.4482
var(age ~ sex + substance, data = HELPrct)
#> female.alcohol male.alcohol female.cocaine male.cocaine female.heroin
#> 63.68571 57.39037 38.37805 47.46896 64.57471
#> male.heroin
#> 63.57779
IQR(width ~ sex, data = KidsFeet)
#> B G
#> 0.75 0.60
iqr(width ~ sex, data = KidsFeet)
#> B G
#> 0.75 0.60
favstats(width ~ sex, data = KidsFeet)
#> sex min Q1 median Q3 max mean sd n missing
#> 1 B 8.4 8.875 9.15 9.625 9.8 9.190000 0.4517801 20 0
#> 2 G 7.9 8.550 8.80 9.150 9.5 8.784211 0.4935846 19 0
cor(length ~ width, data = KidsFeet)
#> [1] 0.6410961
cov(length ~ width, data = KidsFeet)
#> [1] 0.4304453
tally(is.na(mcs) ~ is.na(pcs), data = HELPmiss)
#> is.na(pcs)
#> is.na(mcs) TRUE FALSE
#> TRUE 2 0
#> FALSE 0 468
cov(mcs ~ pcs, data = HELPmiss) # NA because of missing data
#> [1] NA
cov(mcs ~ pcs, data = HELPmiss, use = "complete") # ignore missing data
#> [1] 13.46433
# alternative approach using filter explicitly
cov(mcs ~ pcs, data = HELPmiss |> filter(!is.na(mcs) & !is.na(pcs)))
#> [1] 13.46433