The \doby{} package contains a variety of utility functions. This working document describes some of these functions. The package originally grew out of a need to calculate groupwise summary statistics (much in the spirit of \code{PROC SUMMARY} of the \proglang{SAS} system), but today the package contains many different utilities.
library(doBy)
\section{Data used for illustration} \label{sec:co2data}
The description of the \code{doBy} package is based on the \code{mtcars} dataset.
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
tail(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16.7 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.17 14.5 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.77 15.5 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.57 14.6 0 1 5 8
#> Volvo 142E 21.4 4 121.0 109 4.11 2.78 18.6 1 1 4 2
A description of the variable names can be found here: \url{https://rstudio-pubs-static.s3.amazonaws.com/61800_faea93548c6b49cc91cd0c5ef5059894.html}; see also \url{https://www.jstor.org/stable/2529336?seq=12}.
\label{sec:summaryBy}
The \summaryby{} function is used for calculating quantities like ``the mean and variance of numerical variables \(x\) and `hv(y)hv` for each combination of two factors `hv(A)hv` and `hv(B)hv`’’. Notice: A functionality similar to \summaryby\ is provided by \code{aggregate()} from base \R.
myfun1 <- function(x){
c(m=mean(x), s=sd(x))
}
summaryBy(cbind(mpg, cyl, lh=log(hp)) ~ vs,
data=mtcars, FUN=myfun1)
#> vs mpg.m mpg.s cyl.m cyl.s lh.m lh.s
#> 1 0 16.6 3.86 7.44 1.149 5.20 0.330
#> 2 1 24.6 5.38 4.57 0.938 4.48 0.289
A simpler call is
summaryBy(mpg ~ vs, data=mtcars, FUN=mean)
#> vs mpg.mean
#> 1 0 16.6
#> 2 1 24.6
Instead of formula we may specify a list containing the left hand side and the right hand side of a formula\footnote{This is a feature of \summaryby\ and it does not work with \code{aggregate}.} but that is possible only for variables already in the dataframe:
summaryBy(list(c("mpg", "cyl"), "vs"),
data=mtcars, FUN=myfun1)
#> vs mpg.m mpg.s cyl.m cyl.s
#> 1 0 16.6 3.86 7.44 1.149
#> 2 1 24.6 5.38 4.57 0.938
Inspired by the \pkg{dplyr} package, there is a \verb|summary_by| function which does the samme as \summaryby{} but with the data argument being the first so that one may write
mtcars |> summary_by(cbind(mpg, cyl, lh=log(hp)) ~ vs,
FUN=myfun1)
#> vs mpg.m mpg.s cyl.m cyl.s lh.m lh.s
#> 1 0 16.6 3.86 7.44 1.149 5.20 0.330
#> 2 1 24.6 5.38 4.57 0.938 4.48 0.289
Ordering (or sorting) a data frame is possible with the \code{orderBy} function. For example, we order the rows according to \code{gear} and \code{carb} (within \code{gear}):
x1 <- orderBy(~ gear + carb, data=mtcars)
head(x1, 4)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1
#> Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> Toyota Corona 21.5 4 120 97 3.70 2.46 20.0 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
tail(x1, 4)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Lotus Europa 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.17 14.5 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.77 15.5 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.57 14.6 0 1 5 8
If we want the ordering to be by decreasing values of one of the variables, we can do
x2 <- orderBy(~ -gear + carb, data=mtcars)
Alternative forms are:
x3 <- orderBy(c("gear", "carb"), data=mtcars)
x4 <- orderBy(c("-gear", "carb"), data=mtcars)
x5 <- mtcars |> order_by(c("gear", "carb"))
x6 <- mtcars |> order_by(~ -gear + carb)
Suppose we want to split the \code{airquality} data into a list of dataframes, e.g.\ one dataframe for each month. This can be achieved by:
x <- splitBy(~ Month, data=airquality)
x <- splitBy(~ vs, data=mtcars)
lapply(x, head, 4)
#> $`0`
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
#> Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#>
#> $`1`
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1
#> Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> Merc 240D 24.4 4 147 62 3.69 3.19 20.0 1 0 4 2
attributes(x)
#> $names
#> [1] "0" "1"
#>
#> $groupid
#> vs
#> 1 0
#> 2 1
#>
#> $idxvec
#> $idxvec$`0`
#> [1] 1 2 5 7 12 13 14 15 16 17 22 23 24 25 27 29 30 31
#>
#> $idxvec$`1`
#> [1] 3 4 6 8 9 10 11 18 19 20 21 26 28 32
#>
#>
#> $grps
#> [1] "0" "0" "1" "1" "0" "1" "0" "1" "1" "1" "1" "0" "0" "0" "0" "0" "0" "1" "1"
#> [20] "1" "1" "0" "0" "0" "0" "1" "0" "1" "0" "0" "0" "1"
#>
#> $class
#> [1] "splitByData" "list"
Alternative forms are:
splitBy("vs", data=mtcars)
#> listentry vs
#> 1 0 0
#> 2 1 1
mtcars |> split_by(~ vs)
#> listentry vs
#> 1 0 0
#> 2 1 1
Suppose we want to select those rows within each month for which the the wind speed is larger than the mean wind speed (within the month). This is achieved by:
x <- subsetBy(~am, subset=mpg > mean(mpg), data=mtcars)
head(x)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1.Fiat 128 32.4 4 78.7 66 4.08 2.20 19.5 1 1 4 1
#> 1.Honda Civic 30.4 4 75.7 52 4.93 1.61 18.5 1 1 4 2
#> 1.Toyota Corolla 33.9 4 71.1 65 4.22 1.83 19.9 1 1 4 1
#> 1.Fiat X1-9 27.3 4 79.0 66 4.08 1.94 18.9 1 1 4 1
#> 1.Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16.7 0 1 5 2
#> 1.Lotus Europa 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
Note that the statement \code{Wind > mean(Wind)} is evaluated within each month.
Alternative forms are
x <- subsetBy("am", subset=mpg > mean(mpg), data=mtcars)
x <- mtcars |> subset_by("vs", subset=mpg > mean(mpg))
x <- mtcars |> subset_by(~vs, subset=mpg > mean(mpg))
The \code{transformBy} function is analogous to the \code{transform} function except that it works within groups. For example:
head(x)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 0.Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
#> 0.Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
#> 0.Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 0.Merc 450SL 17.3 8 276 180 3.07 3.73 17.6 0 0 3 3
#> 0.Pontiac Firebird 19.2 8 400 175 3.08 3.85 17.1 0 0 3 2
#> 0.Porsche 914-2 26.0 4 120 91 4.43 2.14 16.7 0 1 5 2
x <- transformBy(~vs, data=mtcars,
min.mpg=min(mpg), max.mpg=max(mpg))
head(x)
#> mpg cyl disp hp drat wt qsec vs am gear carb min.mpg max.mpg
#> 1 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4 10.4 26
#> 2 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4 10.4 26
#> 3 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 10.4 26
#> 4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 10.4 26
#> 5 16.4 8 276 180 3.07 4.07 17.4 0 0 3 3 10.4 26
#> 6 17.3 8 276 180 3.07 3.73 17.6 0 0 3 3 10.4 26
Alternative forms:
x <- transformBy("vs", data=mtcars,
min.mpg=min(mpg), max.mpg=max(mpg))
x <- mtcars |> transform_by("vs",
min.mpg=min(mpg), max.mpg=max(mpg))
This \code{lapplyBy} function is a wrapper for first splitting data into a list according to the formula (using splitBy) and then applying a function to each element of the list (using lapply).
lapplyBy(~vs, data=mtcars,
FUN=function(d) lm(mpg~cyl, data=d) |> summary() |> coef())
#> $`0`
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 36.93 3.69 10.01 2.73e-08
#> cyl -2.73 0.49 -5.56 4.27e-05
#>
#> $`1`
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 41.9 5.78 7.26 0.00001
#> cyl -3.8 1.24 -3.07 0.00978
To obtain the indices of the first/last occurences of an item in a vector do:
x <- c(1, 1, 1, 2, 2, 2, 1, 1, 1, 3)
firstobs(x)
#> [1] 1 4 10
lastobs(x)
#> [1] 6 9 10
The same can be done on variables in a data frame, e.g.
firstobs(~vs, data=mtcars)
#> [1] 1 3
lastobs(~vs, data=mtcars)
#> [1] 31 32
\subsection{The \code{which.maxn()} and \code{which.minn()} functions} \label{sec:whichmaxn}
The location of the \(n\) largest / smallest entries in a numeric vector can be obtained with
x <- c(1:4, 0:5, 11, NA, NA)
which.maxn(x, 3)
#> [1] 11 10 4
which.minn(x, 5)
#> [1] 5 1 6 2 7
Find (sub) sequences in a vector:
x <- c(1, 1, 2, 2, 2, 1, 1, 3, 3, 3, 3, 1, 1, 1)
subSeq(x)
#> first last slength midpoint value
#> 1 1 2 2 2 1
#> 2 3 5 3 4 2
#> 3 6 7 2 7 1
#> 4 8 11 4 10 3
#> 5 12 14 3 13 1
subSeq(x, item=1)
#> first last slength midpoint value
#> 1 1 2 2 2 1
#> 2 6 7 2 7 1
#> 3 12 14 3 13 1
subSeq(letters[x])
#> first last slength midpoint value
#> 1 1 2 2 2 a
#> 2 3 5 3 4 b
#> 3 6 7 2 7 a
#> 4 8 11 4 10 c
#> 5 12 14 3 13 a
subSeq(letters[x], item="a")
#> first last slength midpoint value
#> 1 1 2 2 2 a
#> 2 6 7 2 7 a
#> 3 12 14 3 13 a
x <- c("dec", "jan", "feb", "mar", "apr", "may")
src1 <- list(c("dec", "jan", "feb"), c("mar", "apr", "may"))
tgt1 <- list("winter", "spring")
recodeVar(x, src=src1, tgt=tgt1)
#> [1] "winter" "winter" "winter" "spring" "spring" "spring"
head(renameCol(mtcars, c("vs", "mpg"), c("vs_", "mpg_")))
#> mpg_ cyl disp hp drat wt qsec vs_ am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
Consider the vector
yvar <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0)
Imagine that “1” indicates an event of some kind which takes place at a certain time point. By default time points are assumed equidistant but for illustration we define time time variable
tvar <- seq_along(yvar) + c(0.1, 0.2)
Now we find time since event as
tse <- timeSinceEvent(yvar, tvar)
tse
#> yvar tvar abs.tse sign.tse ewin run tae tbe
#> 1 0 1.1 3.1 -3.1 1 NA NA -3.1
#> 2 0 2.2 2.0 -2.0 1 NA NA -2.0
#> 3 0 3.1 1.1 -1.1 1 NA NA -1.1
#> 4 1 4.2 0.0 0.0 1 1 0.0 0.0
#> 5 0 5.1 0.9 0.9 1 1 0.9 -5.1
#> 6 0 6.2 2.0 2.0 1 1 2.0 -4.0
#> 7 0 7.1 2.9 2.9 1 1 2.9 -3.1
#> 8 0 8.2 2.0 -2.0 2 1 4.0 -2.0
#> 9 0 9.1 1.1 -1.1 2 1 4.9 -1.1
#> 10 1 10.2 0.0 0.0 2 2 0.0 0.0
#> 11 0 11.1 0.9 0.9 2 2 0.9 -3.1
#> 12 0 12.2 2.0 2.0 2 2 2.0 -2.0
#> 13 0 13.1 1.1 -1.1 3 2 2.9 -1.1
#> 14 1 14.2 0.0 0.0 3 3 0.0 0.0
#> 15 1 15.1 0.0 0.0 4 4 0.0 0.0
#> 16 0 16.2 1.1 1.1 4 4 1.1 NA
#> 17 0 17.1 2.0 2.0 4 4 2.0 NA
#> 18 0 18.2 3.1 3.1 4 4 3.1 NA
#> 19 0 19.1 4.0 4.0 4 4 4.0 NA
#> 20 0 20.2 5.1 5.1 4 4 5.1 NA
The output reads as follows:
\begin{itemize} \item \verb"abs.tse": Absolute time since (nearest) event. \item \verb"sign.tse": Signed time since (nearest) event. \item \verb"ewin": Event window: Gives a symmetric window around each event. \item \verb"run": The value of \verb"run" is set to hv(1)hvwhen the first event occurs and is increased by
hv(1)hv at each subsequent event. \item \verb"tae": Time after event. \item \verb"tbe": Time before event. \end{itemize}
plot(sign.tse ~ tvar, data=tse, type="b")
grid()
rug(tse$tvar[tse$yvar == 1], col="blue",lwd=4)
points(scale(tse$run), col=tse$run, lwd=2)
lines(abs.tse + .2 ~ tvar, data=tse, type="b",col=3)
plot(tae ~ tvar, data=tse, ylim=c(-6,6), type="b")
grid()
lines(tbe ~ tvar, data=tse, type="b", col="red")
rug(tse$tvar[tse$yvar==1], col="blue", lwd=4)
lines(run ~ tvar, data=tse, col="cyan", lwd=2)
plot(ewin ~ tvar, data=tse, ylim=c(1, 4))
rug(tse$tvar[tse$yvar==1], col="blue", lwd=4)
grid()
lines(run ~ tvar, data=tse, col="red")
We may now find times for which time since an event is at most 1 as
tse$tvar[tse$abs <= 1]
#> [1] 4.2 5.1 10.2 11.1 14.2 15.1
Consider the \verb|lynx| data:
lynx <- as.numeric(lynx)
tvar <- 1821:1934
plot(tvar, lynx, type="l")
Suppose we want to estimate the cycle lengths. One way of doing this is as follows:
yyy <- lynx > mean(lynx)
head(yyy)
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE
sss <- subSeq(yyy, TRUE)
sss
#> first last slength midpoint value
#> 1 6 10 5 8 TRUE
#> 2 16 19 4 18 TRUE
#> 3 27 28 2 28 TRUE
#> 4 35 38 4 37 TRUE
#> 5 44 47 4 46 TRUE
#> 6 53 55 3 54 TRUE
#> 7 63 66 4 65 TRUE
#> 8 75 76 2 76 TRUE
#> 9 83 87 5 85 TRUE
#> 10 92 96 5 94 TRUE
#> 11 104 106 3 105 TRUE
#> 12 112 114 3 113 TRUE
plot(tvar, lynx, type="l")
rug(tvar[sss$midpoint], col="blue", lwd=4)
Create the “event vector”
yvar <- rep(0, length(lynx))
yvar[sss$midpoint] <- 1
str(yvar)
#> num [1:114] 0 0 0 0 0 0 0 1 0 0 ...
tse <- timeSinceEvent(yvar,tvar)
head(tse, 20)
#> yvar tvar abs.tse sign.tse ewin run tae tbe
#> 1 0 1821 7 -7 1 NA NA -7
#> 2 0 1822 6 -6 1 NA NA -6
#> 3 0 1823 5 -5 1 NA NA -5
#> 4 0 1824 4 -4 1 NA NA -4
#> 5 0 1825 3 -3 1 NA NA -3
#> 6 0 1826 2 -2 1 NA NA -2
#> 7 0 1827 1 -1 1 NA NA -1
#> 8 1 1828 0 0 1 1 0 0
#> 9 0 1829 1 1 1 1 1 -9
#> 10 0 1830 2 2 1 1 2 -8
#> 11 0 1831 3 3 1 1 3 -7
#> 12 0 1832 4 4 1 1 4 -6
#> 13 0 1833 5 5 1 1 5 -5
#> 14 0 1834 4 -4 2 1 6 -4
#> 15 0 1835 3 -3 2 1 7 -3
#> 16 0 1836 2 -2 2 1 8 -2
#> 17 0 1837 1 -1 2 1 9 -1
#> 18 1 1838 0 0 2 2 0 0
#> 19 0 1839 1 1 2 2 1 -9
#> 20 0 1840 2 2 2 2 2 -8
We get two different (not that different) estimates of period lengths:
len1 <- tapply(tse$ewin, tse$ewin, length)
len2 <- tapply(tse$run, tse$run, length)
c(median(len1), median(len2), mean(len1), mean(len2))
#> [1] 9.50 9.00 9.50 8.92
We can overlay the cycles as:
tse$lynx <- lynx
tse2 <- na.omit(tse)
plot(lynx ~ tae, data=tse2)
plot(tvar, lynx, type="l", lty=2)
mm <- lm(lynx ~ tae + I(tae^2) + I(tae^3), data=tse2)
lines(fitted(mm) ~ tvar, data=tse2, col="red")
\section{Acknowledgements} \label{discussion}
Credit is due to Dennis Chabot, Gabor Grothendieck, Paul Murrell and Jim Robison-Cox for reporting various bugs and making various suggestions to the functionality in the \doby{} package.