doby: Groupwise computations and miscellaneous utilities

Søren Højsgaard

The \doby{} package contains a variety of utility functions. This working document describes some of these functions. The package originally grew out of a need to calculate groupwise summary statistics (much in the spirit of \code{PROC SUMMARY} of the \proglang{SAS} system), but today the package contains many different utilities.

library(doBy)

\section{Data used for illustration} \label{sec:co2data}

The description of the \code{doBy} package is based on the \code{mtcars} dataset.

head(mtcars)
#>                    mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.32 18.6  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.46 20.2  1  0    3    1
tail(mtcars)
#>                 mpg cyl  disp  hp drat   wt qsec vs am gear carb
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.14 16.7  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.51 16.9  1  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.17 14.5  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.77 15.5  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.57 14.6  0  1    5    8
#> Volvo 142E     21.4   4 121.0 109 4.11 2.78 18.6  1  1    4    2

A description of the variable names can be found here: \url{https://rstudio-pubs-static.s3.amazonaws.com/61800_faea93548c6b49cc91cd0c5ef5059894.html}; see also \url{https://www.jstor.org/stable/2529336?seq=12}.

Groupwise computations

summaryBy() and summary_by()

\label{sec:summaryBy}

The \summaryby{} function is used for calculating quantities like ``the mean and variance of numerical variables \(x\) and `hv(y)hv` for each combination of two factors `hv(A)hv` and `hv(B)hv`’’. Notice: A functionality similar to \summaryby\ is provided by \code{aggregate()} from base \R.

myfun1 <- function(x){
    c(m=mean(x), s=sd(x))
}
summaryBy(cbind(mpg, cyl, lh=log(hp)) ~ vs, 
          data=mtcars, FUN=myfun1)
#>   vs mpg.m mpg.s cyl.m cyl.s lh.m  lh.s
#> 1  0  16.6  3.86  7.44 1.149 5.20 0.330
#> 2  1  24.6  5.38  4.57 0.938 4.48 0.289

A simpler call is

summaryBy(mpg ~ vs, data=mtcars, FUN=mean)
#>   vs mpg.mean
#> 1  0     16.6
#> 2  1     24.6

Instead of formula we may specify a list containing the left hand side and the right hand side of a formula\footnote{This is a feature of \summaryby\ and it does not work with \code{aggregate}.} but that is possible only for variables already in the dataframe:

summaryBy(list(c("mpg", "cyl"), "vs"), 
          data=mtcars, FUN=myfun1)
#>   vs mpg.m mpg.s cyl.m cyl.s
#> 1  0  16.6  3.86  7.44 1.149
#> 2  1  24.6  5.38  4.57 0.938

Inspired by the \pkg{dplyr} package, there is a \verb|summary_by| function which does the samme as \summaryby{} but with the data argument being the first so that one may write

mtcars |> summary_by(cbind(mpg, cyl, lh=log(hp)) ~ vs,
                      FUN=myfun1)
#>   vs mpg.m mpg.s cyl.m cyl.s lh.m  lh.s
#> 1  0  16.6  3.86  7.44 1.149 5.20 0.330
#> 2  1  24.6  5.38  4.57 0.938 4.48 0.289

orderBy() and order_by()

Ordering (or sorting) a data frame is possible with the \code{orderBy} function. For example, we order the rows according to \code{gear} and \code{carb} (within \code{gear}):

x1 <- orderBy(~ gear + carb, data=mtcars)
head(x1, 4)
#>                    mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
#> Valiant           18.1   6  225 105 2.76 3.46 20.2  1  0    3    1
#> Toyota Corona     21.5   4  120  97 3.70 2.46 20.0  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
tail(x1, 4)
#>                 mpg cyl  disp  hp drat   wt qsec vs am gear carb
#> Lotus Europa   30.4   4  95.1 113 3.77 1.51 16.9  1  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.17 14.5  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.77 15.5  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.57 14.6  0  1    5    8

If we want the ordering to be by decreasing values of one of the variables, we can do

x2 <- orderBy(~ -gear + carb, data=mtcars)

Alternative forms are:

x3 <- orderBy(c("gear", "carb"), data=mtcars)
x4 <- orderBy(c("-gear", "carb"), data=mtcars)
x5 <- mtcars |> order_by(c("gear", "carb"))
x6 <- mtcars |> order_by(~ -gear + carb)

splitBy() and split_by()

Suppose we want to split the \code{airquality} data into a list of dataframes, e.g.\ one dataframe for each month. This can be achieved by:

x <- splitBy(~ Month, data=airquality)
x <- splitBy(~ vs, data=mtcars)
lapply(x, head, 4)
#> $`0`
#>                    mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
#> Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
#> Duster 360        14.3   8  360 245 3.21 3.57 15.8  0  0    3    4
#> 
#> $`1`
#>                 mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Datsun 710     22.8   4  108  93 3.85 2.32 18.6  1  1    4    1
#> Hornet 4 Drive 21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
#> Valiant        18.1   6  225 105 2.76 3.46 20.2  1  0    3    1
#> Merc 240D      24.4   4  147  62 3.69 3.19 20.0  1  0    4    2
attributes(x)
#> $names
#> [1] "0" "1"
#> 
#> $groupid
#>   vs
#> 1  0
#> 2  1
#> 
#> $idxvec
#> $idxvec$`0`
#>  [1]  1  2  5  7 12 13 14 15 16 17 22 23 24 25 27 29 30 31
#> 
#> $idxvec$`1`
#>  [1]  3  4  6  8  9 10 11 18 19 20 21 26 28 32
#> 
#> 
#> $grps
#>  [1] "0" "0" "1" "1" "0" "1" "0" "1" "1" "1" "1" "0" "0" "0" "0" "0" "0" "1" "1"
#> [20] "1" "1" "0" "0" "0" "0" "1" "0" "1" "0" "0" "0" "1"
#> 
#> $class
#> [1] "splitByData" "list"

Alternative forms are:

splitBy("vs", data=mtcars)
#>   listentry vs
#> 1         0  0
#> 2         1  1
mtcars |> split_by(~ vs)
#>   listentry vs
#> 1         0  0
#> 2         1  1

subsetBy() and subset_by()

Suppose we want to select those rows within each month for which the the wind speed is larger than the mean wind speed (within the month). This is achieved by:

x <- subsetBy(~am, subset=mpg > mean(mpg), data=mtcars)
head(x)
#>                   mpg cyl  disp  hp drat   wt qsec vs am gear carb
#> 1.Fiat 128       32.4   4  78.7  66 4.08 2.20 19.5  1  1    4    1
#> 1.Honda Civic    30.4   4  75.7  52 4.93 1.61 18.5  1  1    4    2
#> 1.Toyota Corolla 33.9   4  71.1  65 4.22 1.83 19.9  1  1    4    1
#> 1.Fiat X1-9      27.3   4  79.0  66 4.08 1.94 18.9  1  1    4    1
#> 1.Porsche 914-2  26.0   4 120.3  91 4.43 2.14 16.7  0  1    5    2
#> 1.Lotus Europa   30.4   4  95.1 113 3.77 1.51 16.9  1  1    5    2

Note that the statement \code{Wind > mean(Wind)} is evaluated within each month.

Alternative forms are

x <- subsetBy("am", subset=mpg > mean(mpg), data=mtcars)
x <- mtcars  |> subset_by("vs", subset=mpg > mean(mpg))
x <- mtcars  |> subset_by(~vs, subset=mpg > mean(mpg))

transformBy() and transform_by()

The \code{transformBy} function is analogous to the \code{transform} function except that it works within groups. For example:

head(x)
#>                      mpg cyl disp  hp drat   wt qsec vs am gear carb
#> 0.Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
#> 0.Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
#> 0.Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
#> 0.Merc 450SL        17.3   8  276 180 3.07 3.73 17.6  0  0    3    3
#> 0.Pontiac Firebird  19.2   8  400 175 3.08 3.85 17.1  0  0    3    2
#> 0.Porsche 914-2     26.0   4  120  91 4.43 2.14 16.7  0  1    5    2
x <- transformBy(~vs, data=mtcars, 
                 min.mpg=min(mpg), max.mpg=max(mpg))
head(x)
#>    mpg cyl disp  hp drat   wt qsec vs am gear carb min.mpg max.mpg
#> 1 21.0   6  160 110 3.90 2.62 16.5  0  1    4    4    10.4      26
#> 2 21.0   6  160 110 3.90 2.88 17.0  0  1    4    4    10.4      26
#> 3 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2    10.4      26
#> 4 14.3   8  360 245 3.21 3.57 15.8  0  0    3    4    10.4      26
#> 5 16.4   8  276 180 3.07 4.07 17.4  0  0    3    3    10.4      26
#> 6 17.3   8  276 180 3.07 3.73 17.6  0  0    3    3    10.4      26

Alternative forms:

x <- transformBy("vs", data=mtcars, 
                 min.mpg=min(mpg), max.mpg=max(mpg))
x <- mtcars |> transform_by("vs",
                             min.mpg=min(mpg), max.mpg=max(mpg))

lapplyBy() and lapply_by()

This \code{lapplyBy} function is a wrapper for first splitting data into a list according to the formula (using splitBy) and then applying a function to each element of the list (using lapply).

lapplyBy(~vs, data=mtcars,
         FUN=function(d) lm(mpg~cyl, data=d)  |> summary()  |> coef())
#> $`0`
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)    36.93       3.69   10.01 2.73e-08
#> cyl            -2.73       0.49   -5.56 4.27e-05
#> 
#> $`1`
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)     41.9       5.78    7.26  0.00001
#> cyl             -3.8       1.24   -3.07  0.00978

Miscellaneous utilities

firstobs() and lastobs()

To obtain the indices of the first/last occurences of an item in a vector do:

x <- c(1, 1, 1, 2, 2, 2, 1, 1, 1, 3)
firstobs(x)
#> [1]  1  4 10
lastobs(x)
#> [1]  6  9 10

The same can be done on variables in a data frame, e.g.

firstobs(~vs, data=mtcars)
#> [1] 1 3
lastobs(~vs, data=mtcars)
#> [1] 31 32

\subsection{The \code{which.maxn()} and \code{which.minn()} functions} \label{sec:whichmaxn}

The location of the \(n\) largest / smallest entries in a numeric vector can be obtained with

x <- c(1:4, 0:5, 11, NA, NA)
which.maxn(x, 3)
#> [1] 11 10  4
which.minn(x, 5)
#> [1] 5 1 6 2 7

Subsequences - subSeq()

Find (sub) sequences in a vector:

x <- c(1, 1, 2, 2, 2, 1, 1, 3, 3, 3, 3, 1, 1, 1)
subSeq(x)
#>   first last slength midpoint value
#> 1     1    2       2        2     1
#> 2     3    5       3        4     2
#> 3     6    7       2        7     1
#> 4     8   11       4       10     3
#> 5    12   14       3       13     1
subSeq(x, item=1)
#>   first last slength midpoint value
#> 1     1    2       2        2     1
#> 2     6    7       2        7     1
#> 3    12   14       3       13     1
subSeq(letters[x])
#>   first last slength midpoint value
#> 1     1    2       2        2     a
#> 2     3    5       3        4     b
#> 3     6    7       2        7     a
#> 4     8   11       4       10     c
#> 5    12   14       3       13     a
subSeq(letters[x], item="a")
#>   first last slength midpoint value
#> 1     1    2       2        2     a
#> 2     6    7       2        7     a
#> 3    12   14       3       13     a

Recoding values of a vector - recodeVar()

x <- c("dec", "jan", "feb", "mar", "apr", "may")
src1 <- list(c("dec", "jan", "feb"), c("mar", "apr", "may"))
tgt1 <- list("winter", "spring")
recodeVar(x, src=src1, tgt=tgt1)
#> [1] "winter" "winter" "winter" "spring" "spring" "spring"

Renaming columns of a dataframe or matrix - renameCol()

head(renameCol(mtcars, c("vs", "mpg"), c("vs_", "mpg_")))
#>                   mpg_ cyl disp  hp drat   wt qsec vs_ am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.62 16.5   0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0   0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.32 18.6   1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4   1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0   0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.46 20.2   1  0    3    1

Time since an event - timeSinceEvent()

Consider the vector

yvar <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0)

Imagine that “1” indicates an event of some kind which takes place at a certain time point. By default time points are assumed equidistant but for illustration we define time time variable

tvar <- seq_along(yvar) + c(0.1, 0.2)

Now we find time since event as

tse <- timeSinceEvent(yvar, tvar)
tse
#>    yvar tvar abs.tse sign.tse ewin run tae  tbe
#> 1     0  1.1     3.1     -3.1    1  NA  NA -3.1
#> 2     0  2.2     2.0     -2.0    1  NA  NA -2.0
#> 3     0  3.1     1.1     -1.1    1  NA  NA -1.1
#> 4     1  4.2     0.0      0.0    1   1 0.0  0.0
#> 5     0  5.1     0.9      0.9    1   1 0.9 -5.1
#> 6     0  6.2     2.0      2.0    1   1 2.0 -4.0
#> 7     0  7.1     2.9      2.9    1   1 2.9 -3.1
#> 8     0  8.2     2.0     -2.0    2   1 4.0 -2.0
#> 9     0  9.1     1.1     -1.1    2   1 4.9 -1.1
#> 10    1 10.2     0.0      0.0    2   2 0.0  0.0
#> 11    0 11.1     0.9      0.9    2   2 0.9 -3.1
#> 12    0 12.2     2.0      2.0    2   2 2.0 -2.0
#> 13    0 13.1     1.1     -1.1    3   2 2.9 -1.1
#> 14    1 14.2     0.0      0.0    3   3 0.0  0.0
#> 15    1 15.1     0.0      0.0    4   4 0.0  0.0
#> 16    0 16.2     1.1      1.1    4   4 1.1   NA
#> 17    0 17.1     2.0      2.0    4   4 2.0   NA
#> 18    0 18.2     3.1      3.1    4   4 3.1   NA
#> 19    0 19.1     4.0      4.0    4   4 4.0   NA
#> 20    0 20.2     5.1      5.1    4   4 5.1   NA

The output reads as follows: \begin{itemize} \item \verb"abs.tse": Absolute time since (nearest) event. \item \verb"sign.tse": Signed time since (nearest) event. \item \verb"ewin": Event window: Gives a symmetric window around each event. \item \verb"run": The value of \verb"run" is set to hv(1)hvwhen the first event occurs and is increased byhv(1)hv at each subsequent event. \item \verb"tae": Time after event. \item \verb"tbe": Time before event. \end{itemize}

plot(sign.tse ~ tvar, data=tse, type="b")
grid()
rug(tse$tvar[tse$yvar == 1], col="blue",lwd=4)
points(scale(tse$run), col=tse$run, lwd=2)
lines(abs.tse + .2 ~ tvar, data=tse, type="b",col=3)

plot of chunk unnamed-chunk-27

plot(tae ~ tvar, data=tse, ylim=c(-6,6), type="b")
grid()
lines(tbe ~ tvar, data=tse, type="b", col="red")
rug(tse$tvar[tse$yvar==1], col="blue", lwd=4)
lines(run ~ tvar, data=tse, col="cyan", lwd=2)

plot of chunk unnamed-chunk-28

plot(ewin ~ tvar, data=tse, ylim=c(1, 4))
rug(tse$tvar[tse$yvar==1], col="blue", lwd=4)
grid()
lines(run ~ tvar, data=tse, col="red")

plot of chunk unnamed-chunk-29

We may now find times for which time since an event is at most 1 as

tse$tvar[tse$abs <= 1]
#> [1]  4.2  5.1 10.2 11.1 14.2 15.1

Example: Using subSeq() and timeSinceEvent()

Consider the \verb|lynx| data:

lynx <- as.numeric(lynx)
tvar <- 1821:1934
plot(tvar, lynx, type="l")

plot of chunk unnamed-chunk-31

Suppose we want to estimate the cycle lengths. One way of doing this is as follows:

yyy <- lynx > mean(lynx)
head(yyy)
#> [1] FALSE FALSE FALSE FALSE FALSE  TRUE
sss <- subSeq(yyy, TRUE)
sss
#>    first last slength midpoint value
#> 1      6   10       5        8  TRUE
#> 2     16   19       4       18  TRUE
#> 3     27   28       2       28  TRUE
#> 4     35   38       4       37  TRUE
#> 5     44   47       4       46  TRUE
#> 6     53   55       3       54  TRUE
#> 7     63   66       4       65  TRUE
#> 8     75   76       2       76  TRUE
#> 9     83   87       5       85  TRUE
#> 10    92   96       5       94  TRUE
#> 11   104  106       3      105  TRUE
#> 12   112  114       3      113  TRUE
plot(tvar, lynx, type="l")
rug(tvar[sss$midpoint], col="blue", lwd=4)

plot of chunk unnamed-chunk-33

Create the “event vector”

yvar <- rep(0, length(lynx))
yvar[sss$midpoint] <- 1
str(yvar)
#>  num [1:114] 0 0 0 0 0 0 0 1 0 0 ...
tse <- timeSinceEvent(yvar,tvar)
head(tse, 20)
#>    yvar tvar abs.tse sign.tse ewin run tae tbe
#> 1     0 1821       7       -7    1  NA  NA  -7
#> 2     0 1822       6       -6    1  NA  NA  -6
#> 3     0 1823       5       -5    1  NA  NA  -5
#> 4     0 1824       4       -4    1  NA  NA  -4
#> 5     0 1825       3       -3    1  NA  NA  -3
#> 6     0 1826       2       -2    1  NA  NA  -2
#> 7     0 1827       1       -1    1  NA  NA  -1
#> 8     1 1828       0        0    1   1   0   0
#> 9     0 1829       1        1    1   1   1  -9
#> 10    0 1830       2        2    1   1   2  -8
#> 11    0 1831       3        3    1   1   3  -7
#> 12    0 1832       4        4    1   1   4  -6
#> 13    0 1833       5        5    1   1   5  -5
#> 14    0 1834       4       -4    2   1   6  -4
#> 15    0 1835       3       -3    2   1   7  -3
#> 16    0 1836       2       -2    2   1   8  -2
#> 17    0 1837       1       -1    2   1   9  -1
#> 18    1 1838       0        0    2   2   0   0
#> 19    0 1839       1        1    2   2   1  -9
#> 20    0 1840       2        2    2   2   2  -8

We get two different (not that different) estimates of period lengths:

len1 <- tapply(tse$ewin, tse$ewin, length)
len2 <- tapply(tse$run, tse$run, length)
c(median(len1), median(len2), mean(len1), mean(len2))
#> [1] 9.50 9.00 9.50 8.92

We can overlay the cycles as:

tse$lynx <- lynx
tse2 <- na.omit(tse)
plot(lynx ~ tae, data=tse2)

plot of chunk unnamed-chunk-37

plot(tvar, lynx, type="l", lty=2)
mm <- lm(lynx ~ tae + I(tae^2) + I(tae^3), data=tse2)
lines(fitted(mm) ~ tvar, data=tse2, col="red")

plot of chunk unnamed-chunk-38

\section{Acknowledgements} \label{discussion}

Credit is due to Dennis Chabot, Gabor Grothendieck, Paul Murrell and Jim Robison-Cox for reporting various bugs and making various suggestions to the functionality in the \doby{} package.