# Cross-validation of regression models

## Cross-validation

Cross-validation (CV) is an essentially simple and intuitively reasonable approach to estimating the predictive accuracy of regression models. CV is developed in many standard sources on regression modeling and “machine learning”—we particularly recommend James, Witten, Hastie, & Tibshirani (2021, secs. 5.1, 5.3)—and so we will describe the method only briefly here before taking up computational issues and some examples. See Arlot & Celisse (2010) for a wide-ranging, if technical, survey of cross-validation and related methods that emphasizes the statistical properties of CV.

Validating research by replication on independently collected data is a common scientific norm. Emulating this process in a single study by data-division is less common: The data are randomly divided into two, possibly equal-size, parts; the first part is used to develop and fit a statistical model; and then the second part is used to assess the adequacy of the model fit to the first part of the data. Data-division, however, suffers from two problems: (1) Dividing the data decreases the sample size and thus increases sampling error; and (2), even more disconcertingly, particularly in smaller samples, the results can vary substantially based on the random division of the data: See Harrell (2015, sec. 5.3) for this and other remarks about data-division and cross-validation.

Cross-validation speaks to both of these issues. In CV, the data are randomly divided as equally as possible into several, say $$k$$, parts, called “folds.” The statistical model is fit $$k$$ times, leaving each fold out in turn. Each fitted model is then used to predict the response variable for the cases in the omitted fold. A CV criterion or “cost” measure, such as the mean-squared error (“MSE”) of prediction, is then computed using these predicted values. In the extreme $$k = n$$, the number of cases in the data, thus omitting individual cases and refitting the model $$n$$ times—a procedure termed “leave-one-out (LOO) cross-validation.”

Because the $$n$$ models are each fit to $$n - 1$$ cases, LOO CV produces a nearly unbiased estimate of prediction error. The $$n$$ regression models are highly statistical dependent, however, based as they are on nearly the same data, and so the resulting estimate of prediction error has relatively large variance. In contrast, estimated prediction error for $$k$$-fold CV with $$k = 5$$ or $$10$$ (commonly employed choices) are somewhat biased but have smaller variance. It is also possible to correct $$k$$-fold CV for bias (see below).

## Examples

### Polynomial regression for the Auto data

The data for this example are drawn from the ISLR2 package for R, associated with James et al. (2021). The presentation here is close (though not identical) to that in the original source (James et al., 2021, secs. 5.1, 5.3), and it demonstrates the use of the cv() function in the cv package.1

The Auto dataset contains information about 392 cars:

data("Auto", package="ISLR2")
#>   mpg cylinders displacement horsepower weight acceleration year origin
#> 1  18         8          307        130   3504         12.0   70      1
#> 2  15         8          350        165   3693         11.5   70      1
#> 3  18         8          318        150   3436         11.0   70      1
#> 4  16         8          304        150   3433         12.0   70      1
#> 5  17         8          302        140   3449         10.5   70      1
#> 6  15         8          429        198   4341         10.0   70      1
#>                        name
#> 1 chevrolet chevelle malibu
#> 2         buick skylark 320
#> 3        plymouth satellite
#> 4             amc rebel sst
#> 5               ford torino
#> 6          ford galaxie 500
dim(Auto)
#> [1] 392   9

With the exception of origin (which we don’t use here), these variables are largely self-explanatory, except possibly for units of measurement: for details see help("Auto", package="ISLR2").

We’ll focus here on the relationship of mpg (miles per gallon) to horsepower, as displayed in the following scatterplot:

plot(mpg ~ horsepower, data=Auto)

The relationship between the two variables is monotone, decreasing, and nonlinear. Following James et al. (2021), we’ll consider approximating the relationship by a polynomial regression, with the degree of the polynomial $$p$$ ranging from 1 (a linear regression) to 10.2 Polynomial fits for $$p = 1$$ to $$5$$ are shown in the following figure:

plot(mpg ~ horsepower, data=Auto)
horsepower <- with(Auto,
seq(min(horsepower), max(horsepower),
length=1000))
for (p in 1:5){
m <- lm(mpg ~ poly(horsepower,p), data=Auto)
mpg <- predict(m, newdata=data.frame(horsepower=horsepower))
lines(horsepower, mpg, col=p + 1, lty=p, lwd=2)
}
legend("topright", legend=1:5, col=2:6, lty=1:5, lwd=2,
title="Degree", inset=0.02)

The linear fit is clearly inappropriate; the fits for $$p = 2$$ (quadratic) through $$4$$ are very similar; and the fit for $$p = 5$$ may over-fit the data by chasing one or two relatively high mpg values at the right (but see the CV results reported below).

The following graph shows two measures of estimated (squared) error as a function of polynomial-regression degree: The mean-squared error (“MSE”), defined as $$\mathsf{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \widehat{y}_i)^2$$, and the usual residual variance, defined as $$\widehat{\sigma}^2 = \frac{1}{n - p - 1} \sum_{i=1}^n (y_i - \widehat{y}_i)^2$$. The former necessarily declines with $$p$$ (or, more strictly, can’t increase with $$p$$), while the latter gets slightly larger for the largest values of $$p$$, with the “best” value, by a small margin, for $$p = 7$$.

library("cv") # for mse() and other functions

var <- mse <- numeric(10)
for (p in 1:10){
m <- lm(mpg ~ poly(horsepower, p), data=Auto)
mse[p] <- mse(Auto$mpg, fitted(m)) var[p] <- summary(m)$sigma^2
}

plot(c(1, 10), range(mse, var), type="n",
xlab="Degree of polynomial, p",
ylab="Estimated Squared Error")
lines(1:10, mse, lwd=2, lty=1, col=2, pch=16, type="b")
lines(1:10, var, lwd=2, lty=2, col=3, pch=17, type="b")
legend("topright", inset=0.02,
legend=c(expression(hat(sigma)^2), "MSE"),
lwd=2, lty=2:1, col=3:2, pch=17:16)

The code for this graph uses the mse() function from the cv package to compute the MSE for each fit.

#### Using cv()

The generic cv() function has an "lm" method, which by default performs $$k = 10$$-fold CV:

m.auto <- lm(mpg ~ poly(horsepower, 2), data=Auto)
summary(m.auto)
#>
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
#>
#> Residuals:
#>     Min      1Q  Median      3Q     Max
#> -14.714  -2.594  -0.086   2.287  15.896
#>
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)
#> (Intercept)            23.446      0.221   106.1   <2e-16 ***
#> poly(horsepower, 2)1 -120.138      4.374   -27.5   <2e-16 ***
#> poly(horsepower, 2)2   44.090      4.374    10.1   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.37 on 389 degrees of freedom
#> Multiple R-squared:  0.688,  Adjusted R-squared:  0.686
#> F-statistic:  428 on 2 and 389 DF,  p-value: <2e-16

cv(m.auto)
#> R RNG seed set to 398475
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: mse
#> cross-validation criterion = 19.238
#> bias-adjusted cross-validation criterion = 19.224
#> full-sample criterion = 18.985

The "lm" method by default uses mse() as the CV criterion and the Woodbury matrix identity to update the regression with each fold deleted without having literally to refit the model. Computational details are discussed in the final section of this vignette. The function reports the CV estimate of MSE, a biased-adjusted estimate of the MSE (the bias adjustment is explained in the final section), and the MSE is also computed for the original, full-sample regression. Because the division of the data into 10 folds is random, cv() explicitly (randomly) generates and saves a seed for R’s pseudo-random number generator, to make the results replicable. The user can also specify the seed directly via the seed argument to cv().

To perform LOO CV, we can set the k argument to cv() to the number of cases in the data, here k=392, or, more conveniently, to k="loo" or k="n":

cv(m.auto, k="loo")
#> n-Fold Cross Validation
#> method: hatvalues
#> criterion: mse
#> cross-validation criterion = 19.248

For LOO CV of a linear model, cv() by default uses the hatvalues from the model fit to the full data for the LOO updates, and reports only the CV estimate of MSE. Alternative methods are to use the Woodbury matrix identity or the “naive” approach of literally refitting the model with each case omitted. All three methods produce exact results for a linear model (within the precision of floating-point computations):

cv(m.auto, k="loo", method="naive")
#> n-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 19.248
#> bias-adjusted cross-validation criterion = 19.248
#> full-sample criterion = 18.985

cv(m.auto, k="loo", method="Woodbury")
#> n-Fold Cross Validation
#> method: Woodbury
#> criterion: mse
#> cross-validation criterion = 19.248
#> bias-adjusted cross-validation criterion = 19.248
#> full-sample criterion = 18.985

The "naive" and "Woodbury" methods also return the bias-adjusted estimate of MSE and the full-sample MSE, but bias isn’t an issue for LOO CV.

This is a small regression problem and all three computational approaches are essentially instantaneous, but it is still of interest to investigate their relative speed. In this comparison, we include the cv.glm() function from the boot package, which takes the naive approach, and for which we have to fit the linear model as an equivalent Gaussian GLM. We use the microbenchmark() function from the package of the same name for the timings (Mersmann, 2023):

m.auto.glm <- glm(mpg ~ poly(horsepower, 2), data=Auto)
boot::cv.glm(Auto, m.auto.glm)$delta #> [1] 19.248 19.248 microbenchmark::microbenchmark( hatvalues = cv(m.auto, k="loo"), Woodbury = cv(m.auto, k="loo", method="Woodbury"), naive = cv(m.auto, k="loo", method="naive"), cv.glm = boot::cv.glm(Auto, m.auto.glm), times=10 ) #> Warning in microbenchmark::microbenchmark(hatvalues = cv(m.auto, k = "loo"), : #> less accurate nanosecond times to avoid potential integer overflows #> Unit: microseconds #> expr min lq mean median uq max neval cld #> hatvalues 986.13 1010.6 1131.7 1176.3 1197.7 1205.1 10 a #> Woodbury 9970.58 10173.4 10292.8 10351.1 10419.4 10462.7 10 a #> naive 209703.40 210921.2 233033.5 219749.1 270959.1 277474.9 10 b #> cv.glm 375319.86 382713.4 395619.9 384162.0 387250.0 448388.6 10 c On our computer, using the hatvalues is about an order of magnitude faster than employing Woodbury matrix updates, and more than two orders of magnitude faster than refitting the model.3 #### Comparing competing models The cv() function also has a method that can be applied to a list of regression models for the same data, composed using the models() function. For $$k$$-fold CV, the same folds are used for the competing models, which reduces random error in their comparison. This result can also be obtained by specifying a common seed for R’s random-number generator while applying cv() separately to each model, but employing a list of models is more convenient for both $$k$$-fold and LOO CV (where there is no random component to the composition of the $$n$$ folds). We illustrate with the polynomial regression models of varying degree for the Auto data (discussed previously), beginning by fitting and saving the 10 models: for (p in 1:10){ assign(paste0("m.", p), lm(mpg ~ poly(horsepower, p), data=Auto)) } objects(pattern="m\\.[0-9]") #> [1] "m.1" "m.10" "m.2" "m.3" "m.4" "m.5" "m.6" "m.7" "m.8" "m.9" summary(m.2) # for example, the quadratic fit #> #> Call: #> lm(formula = mpg ~ poly(horsepower, p), data = Auto) #> #> Residuals: #> Min 1Q Median 3Q Max #> -14.714 -2.594 -0.086 2.287 15.896 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 23.446 0.221 106.1 <2e-16 *** #> poly(horsepower, p)1 -120.138 4.374 -27.5 <2e-16 *** #> poly(horsepower, p)2 44.090 4.374 10.1 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 4.37 on 389 degrees of freedom #> Multiple R-squared: 0.688, Adjusted R-squared: 0.686 #> F-statistic: 428 on 2 and 389 DF, p-value: <2e-16 We then apply cv() to the list of 10 models (the data argument is required): # 10-fold CV cv.auto.10 <- cv(models(m.1, m.2, m.3, m.4, m.5, m.6, m.7, m.8, m.9, m.10), data=Auto, seed=2120) cv.auto.10[1:2] # for the linear and quadratic models #> #> Model model.1: #> 10-Fold Cross Validation #> method: Woodbury #> cross-validation criterion = 24.246 #> bias-adjusted cross-validation criterion = 24.23 #> full-sample criterion = 23.944 #> #> Model model.2: #> 10-Fold Cross Validation #> method: Woodbury #> cross-validation criterion = 19.346 #> bias-adjusted cross-validation criterion = 19.327 #> full-sample criterion = 18.985 # LOO CV cv.auto.loo <- cv(models(m.1, m.2, m.3, m.4, m.5, m.6, m.7, m.8, m.9, m.10), data=Auto, k="loo") cv.auto.loo[1:2] # linear and quadratic models #> #> Model model.1: #> n-Fold Cross Validation #> method: hatvalues #> cross-validation criterion = 24.232 #> Model model.2: #> n-Fold Cross Validation #> method: hatvalues #> cross-validation criterion = 19.248 Because we didn’t supply names for the models in the calls to the models() function, the names model.1, model.2, etc., are generated by the function. Finally, we extract and graph the adjusted MSEs for $$10$$-fold CV and the MSEs for LOO CV: cv.mse.10 <- sapply(cv.auto.10, function(x) x[["adj CV crit"]]) cv.mse.loo <- sapply(cv.auto.loo, function(x) x[["CV crit"]]) plot(c(1, 10), range(cv.mse.10, cv.mse.loo), type="n", xlab="Degree of polynomial, p", ylab="Cross-Validated MSE") lines(1:10, cv.mse.10, lwd=2, lty=1, col=2, pch=16, type="b") lines(1:10, cv.mse.loo, lwd=2, lty=2, col=3, pch=17, type="b") legend("topright", inset=0.02, legend=c("10-Fold CV", "LOO CV"), lwd=2, lty=2:1, col=3:2, pch=17:16) Alternatively, we can use the plot() method for "cvModList" objects to compare the models, though with separate graphs for 10-fold and LOO CV: plot(cv.auto.10, main="Polynomial Regressions, 10-Fold CV", axis.args=list(labels=1:10), xlab="Degree of Polynomial, p") plot(cv.auto.loo, main="Polynomial Regressions, LOO CV", axis.args=list(labels=1:10), xlab="Degree of Polynomial, p") In this example, 10-fold and LOO CV produce generally similar results, and also results that are similar to those produced by the estimated error variance $$\widehat{\sigma}^2$$ for each model, reported above (except for the highest-degree polynomials, where the CV results more clearly suggest over-fitting). ### Logistic regression for the Mroz data The Mroz data set from the carData package (associated with Fox & Weisberg, 2019) has been used by several authors to illustrate binary logistic regression; see, in particular Fox & Weisberg (2019). The data were originally drawn from the U.S. Panel Study of Income Dynamics and pertain to married women. Here are a few cases in the data set: data("Mroz", package="carData") head(Mroz, 3) #> lfp k5 k618 age wc hc lwg inc #> 1 yes 1 0 32 no no 1.2102 10.91 #> 2 yes 0 2 30 no no 0.3285 19.50 #> 3 yes 1 3 35 no no 1.5141 12.04 tail(Mroz, 3) #> lfp k5 k618 age wc hc lwg inc #> 751 no 0 0 43 no no 0.88814 9.952 #> 752 no 0 0 60 no no 1.22497 24.984 #> 753 no 0 3 39 no no 0.85321 28.363 The response variable in the logistic regression is lfp, labor-force participation, a factor coded "yes" or "no". The remaining variables are predictors: • k5, number of children 5 years old of younger in the woman’s household; • k618, number of children between 6 and 18 years old; • age, in years; • wc, wife’s college attendance, "yes" or "no"; • hc, husband’s college attendance; • lwg, the woman’s log wage rate if she is employed, or her imputed wage rate, if she is not (a variable that Fox & Weisberg, 2019 show is problematically defined); and • inc, family income, in$1000s, exclusive of wife’s income.

We use the glm() function to fit a binary logistic regression to the Mroz data:

m.mroz <- glm(lfp ~ ., data=Mroz, family=binomial)
summary(m.mroz)
#>
#> Call:
#> glm(formula = lfp ~ ., family = binomial, data = Mroz)
#>
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  3.18214    0.64438    4.94  7.9e-07 ***
#> k5          -1.46291    0.19700   -7.43  1.1e-13 ***
#> k618        -0.06457    0.06800   -0.95  0.34234
#> age         -0.06287    0.01278   -4.92  8.7e-07 ***
#> wcyes        0.80727    0.22998    3.51  0.00045 ***
#> hcyes        0.11173    0.20604    0.54  0.58762
#> lwg          0.60469    0.15082    4.01  6.1e-05 ***
#> inc         -0.03445    0.00821   -4.20  2.7e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#>     Null deviance: 1029.75  on 752  degrees of freedom
#> Residual deviance:  905.27  on 745  degrees of freedom
#> AIC: 921.3
#>
#> Number of Fisher Scoring iterations: 4

BayesRule(ifelse(Mroz$lfp == "yes", 1, 0), fitted(m.mroz, type="response")) #> [1] 0.30677 #> attr(,"casewise loss") #> [1] "y != round(yhat)" In addition to the usually summary output for a GLM, we show the result of applying the BayesRule() function from the cv package to predictions derived from the fitted model. Bayes rule, which predicts a “success” in a binary regression model when the fitted probability of success [i.e., $$\phi = \Pr(y = 1)$$] is $$\widehat{\phi} \ge .5$$ and a “failure” if $$\widehat{\phi} \lt .5$$.4 The first argument to BayesRule() is the binary {0, 1} response, and the second argument is the predicted probability of success. BayesRule() returns the proportion of predictions that are in error, as appropriate for a “cost” function. The value returned by BayesRule() is associated with an “attribute” named "casewise loss" and set to "y != round(yhat)", signifying that the Bayes rule CV criterion is computed as the mean of casewise values, here 0 if the prediction for a case matches the observed value and 1 if it does not (signifying a prediction error). The mse() function for numeric responses is also calculated as a casewise average. Some other criteria, such as the median absolute error, computed by the medAbsErr() function in the cv package, aren’t averages of casewise components. The distinction is important because, to our knowledge, the statistical theory of cross-validation, for example, in Davison & Hinkley (1997), S. Bates, Hastie, & Tibshirani (2023), and Arlot & Celisse (2010), is developed for CV criteria like MSE that are means of casewise components. As a consequence, we limit computation of bias adjustment and confidence intervals (see below) to criteria that are casewise averages. In this example, the fitted logistic regression incorrectly predicts 31% of the responses; we expect this estimate to be optimistic given that the model is used to “predict” the data to which it is fit. The "glm" method for cv() is largely similar to the "lm" method, although the default algorithm, selected explicitly by method="exact", refits the model with each fold removed (and is thus equivalent to method="naive" for "lm" models). For generalized linear models, method="Woodbury" or (for LOO CV) method="hatvalues" provide approximate results (see the last section of the vignette for details): cv(m.mroz, criterion=BayesRule, seed=248) #> R RNG seed set to 248 #> 10-Fold Cross Validation #> criterion: BayesRule #> cross-validation criterion = 0.32404 #> bias-adjusted cross-validation criterion = 0.31952 #> 95% CI for bias-adjusted CV criterion = (0.28607, 0.35297) #> full-sample criterion = 0.30677 cv(m.mroz, criterion=BayesRule, seed=248, method="Woodbury") #> R RNG seed set to 248 #> 10-Fold Cross Validation #> method: Woodbury #> criterion: BayesRule #> cross-validation criterion = 0.32404 #> bias-adjusted cross-validation criterion = 0.31926 #> 95% CI for bias-adjusted CV criterion = (0.28581, 0.35271) #> full-sample criterion = 0.30677 To ensure that the two methods use the same 10 folds, we specify the seed for R’s random-number generator explicitly; here, and as is common in our experience, the "exact" and "Woodbury" algorithms produce nearly identical results. The CV estimates of prediction error are slightly higher than the estimate based on all of the cases. The printed output includes a 95% confidence interval for the bias-adjusted Bayes rule CV criterion. S. Bates et al. (2023) show that these confidence intervals are unreliable for models fit to small samples, and by default cv() computes them only when the sample size is 400 or larger and when the CV criterion employed is an average of casewise components, as is the case for Bayes rule. See the final section of the vignette for details of the computation of confidence intervals for bias-adjusted CV criteria. Here are results of applying LOO CV to the Mroz model, using both the exact and the approximate methods: cv(m.mroz, k="loo", criterion=BayesRule) #> n-Fold Cross Validation #> criterion: BayesRule #> cross-validation criterion = 0.32005 #> bias-adjusted cross-validation criterion = 0.3183 #> 95% CI for bias-adjusted CV criterion = (0.28496, 0.35164) #> full-sample criterion = 0.30677 cv(m.mroz, k="loo", criterion=BayesRule, method="Woodbury") #> n-Fold Cross Validation #> method: Woodbury #> criterion: BayesRule #> cross-validation criterion = 0.32005 #> bias-adjusted cross-validation criterion = 0.3183 #> 95% CI for bias-adjusted CV criterion = (0.28496, 0.35164) #> full-sample criterion = 0.30677 cv(m.mroz, k="loo", criterion=BayesRule, method="hatvalues") #> n-Fold Cross Validation #> method: hatvalues #> criterion: BayesRule #> cross-validation criterion = 0.32005 To the number of decimal digits shown, the three methods produce identical results for this example. As for linear models, we report some timings for the various cv() methods of computation in LOO CV as well as for the cv.glm() function from the boot package (which, recall, refits the model with each case removed, and thus is comparable to cv() with method="exact"): microbenchmark::microbenchmark( hatvalues=cv(m.mroz, k="loo", criterion=BayesRule, method="hatvalues"), Woodbury=cv(m.mroz, k="loo", criterion=BayesRule, method="Woodbury"), exact=cv(m.mroz, k="loo", criterion=BayesRule), cv.glm=boot::cv.glm(Mroz, m.mroz, cost=BayesRule), times=10) #> Unit: milliseconds #> expr min lq mean median uq max neval #> hatvalues 1.3145 1.3296 1.6209 1.3431 1.3785 4.0053 10 #> Woodbury 37.0752 39.6014 40.5334 40.0595 40.5473 47.5972 10 #> exact 1714.0892 1747.4177 1782.3877 1785.8192 1813.3066 1848.2108 10 #> cv.glm 2003.4197 2032.1928 2066.9809 2072.7632 2104.7543 2115.3950 10 #> cld #> a #> b #> c #> d There is a substantial time penalty associated with exact computations. ## Cross-validating mixed-effects models The fundamental analogy for cross-validation is to the collection of new data. That is, predicting the response in each fold from the model fit to data in the other folds is like using the model fit to all of the data to predict the response for new cases from the values of the predictors for those new cases. As we explained, the application of this idea to independently sampled cases is straightforward—simply partition the data into random folds of equal size and leave each fold out in turn, or, in the case of LOO CV, simply omit each case in turn. In contrast, mixed-effects models are fit to dependent data, in which cases as clustered, such as hierarchical data, where the clusters comprise higher-level units (e.g., students clustered in schools), or longitudinal data, where the clusters are individuals and the cases repeated observations on the individuals over time.5 We can think of two approaches to applying cross-validation to clustered data:6 1. Treat CV as analogous to predicting the response for one or more cases in a newly observed cluster. In this instance, the folds comprise one or more whole clusters; we refit the model with all of the cases in clusters in the current fold removed; and then we predict the response for the cases in clusters in the current fold. These predictions are based only on fixed effects because the random effects for the omitted clusters are presumably unknown, as they would be for data on cases in newly observed clusters. 2. Treat CV as analogous to predicting the response for a newly observed case in an existing cluster. In this instance, the folds comprise one or more individual cases, and the predictions can use both the fixed and random effects. ### Example: The High-School and Beyond data Following their use by Raudenbush & Bryk (2002), data from the 1982 High School and Beyond (HSB) survey have become a staple of the literature on mixed-effects models. The HSB data are used by Fox & Weisberg (2019, sec. 7.2.2) to illustrate the application of linear mixed models to hierarchical data, and we’ll closely follow their example here. The HSB data are included in the MathAchieve and MathAchSchool data sets in the nlme package (Pinheiro & Bates, 2000). MathAchieve includes individual-level data on 7185 students in 160 high schools, and MathAchSchool includes school-level data: data("MathAchieve", package="nlme") dim(MathAchieve) #> [1] 7185 6 head(MathAchieve, 3) #> Grouped Data: MathAch ~ SES | School #> School Minority Sex SES MathAch MEANSES #> 1 1224 No Female -1.528 5.876 -0.428 #> 2 1224 No Female -0.588 19.708 -0.428 #> 3 1224 No Male -0.528 20.349 -0.428 tail(MathAchieve, 3) #> Grouped Data: MathAch ~ SES | School #> School Minority Sex SES MathAch MEANSES #> 7183 9586 No Female 1.332 19.641 0.627 #> 7184 9586 No Female -0.008 16.241 0.627 #> 7185 9586 No Female 0.792 22.733 0.627 data("MathAchSchool", package="nlme") dim(MathAchSchool) #> [1] 160 7 head(MathAchSchool, 2) #> School Size Sector PRACAD DISCLIM HIMINTY MEANSES #> 1224 1224 842 Public 0.35 1.597 0 -0.428 #> 1288 1288 1855 Public 0.27 0.174 0 0.128 tail(MathAchSchool, 2) #> School Size Sector PRACAD DISCLIM HIMINTY MEANSES #> 9550 9550 1532 Public 0.45 0.791 0 0.059 #> 9586 9586 262 Catholic 1.00 -2.416 0 0.627 The first few students are in school number 1224 and the last few in school 9586. We’ll use only the School, SES (students’ socioeconomic status), and MathAch (their score on a standardized math-achievement test) variables in the MathAchieve data set, and Sector ("Catholic" or "Public") in the MathAchSchool data set. Some data-management is required before fitting a mixed-effects model to the HSB data, for which we use the dplyr package (Wickham, François, Henry, Müller, & Vaughan, 2023): library("dplyr") #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union MathAchieve %>% group_by(School) %>% summarize(mean.ses = mean(SES)) -> Temp Temp <- merge(MathAchSchool, Temp, by="School") HSB <- merge(Temp[, c("School", "Sector", "mean.ses")], MathAchieve[, c("School", "SES", "MathAch")], by="School") names(HSB) <- tolower(names(HSB)) HSB$cses <- with(HSB, ses - mean.ses)

In the process, we created two new school-level variables: meanses, which is the average SES for students in each school; and cses, which is school-average SES centered at its mean. For details, see Fox & Weisberg (2019, sec. 7.2.2).

Still following Fox and Weisberg, we proceed to use the lmer() function in the lme4 package (D. Bates, Mächler, Bolker, & Walker, 2015) to fit a mixed model for math achievement to the HSB data:

library("lme4")
hsb.lmer <- lmer(mathach ~ mean.ses*cses + sector*cses
+ (cses | school), data=HSB)
summary(hsb.lmer, correlation=FALSE)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathach ~ mean.ses * cses + sector * cses + (cses | school)
#>    Data: HSB
#>
#> REML criterion at convergence: 46504
#>
#> Scaled residuals:
#>    Min     1Q Median     3Q    Max
#> -3.159 -0.723  0.017  0.754  2.958
#>
#> Random effects:
#>  Groups   Name        Variance Std.Dev. Corr
#>  school   (Intercept)  2.380   1.543
#>           cses         0.101   0.318    0.39
#>  Residual             36.721   6.060
#> Number of obs: 7185, groups:  school, 160
#>
#> Fixed effects:
#>                     Estimate Std. Error t value
#> (Intercept)           12.128      0.199   60.86
#> mean.ses               5.333      0.369   14.45
#> cses                   2.945      0.156   18.93
#> sectorCatholic         1.227      0.306    4.00
#> mean.ses:cses          1.039      0.299    3.48
#> cses:sectorCatholic   -1.643      0.240   -6.85

We can then cross-validate at the cluster (i.e., school) level,

cv(hsb.lmer, k=10, clusterVariables="school", seed=5240)
#> R RNG seed set to 5240
#> 10-Fold Cross Validation based on 160 {school} clusters
#> cross-validation criterion = 39.157
#> bias-adjusted cross-validation criterion = 39.148
#> 95% CI for bias-adjusted CV criterion = (38.066, 40.231)
#> full-sample criterion = 39.006

or at the case (i.e., student) level,

cv(hsb.lmer, seed=1575)
#> R RNG seed set to 1575
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.00587228 (tol = 0.002, component 1)
#> boundary (singular) fit: see help('isSingular')
#> 10-Fold Cross Validation
#> cross-validation criterion = 37.445
#> bias-adjusted cross-validation criterion = 37.338
#> 95% CI for bias-adjusted CV criterion = (36.288, 38.388)
#> full-sample criterion = 36.068

For cluster-level CV, the clusterVariables argument tells cv() how the clusters are defined. Were there more than one clustering variable, say classes within schools, these would be provided as a character vector of variable names: clusterVariables = c("school", "class"). For cluster-level CV, the default is k = "loo", that is, leave one cluster out at a time; we instead specify k = 10 folds of clusters, each fold therefore comprising $$160/10 = 16$$ schools.

If the clusterVariables argument is omitted, then case-level CV is employed, with k = 10 folds as the default, here each with $$7185/10 \approx 719$$ students. Notice that one of the 10 models refit with a fold removed failed to converge. Convergence problems are common in mixed-effects modeling. The apparent issue here is that an estimated variance component is close to or equal to 0, which is at a boundary of the parameter space. That shouldn’t disqualify the fitted model for the kind of prediction required for cross-validation.

There is also a cv() method for linear mixed models fit by the lme() function in the nlme package, and the arguments for cv() in this case are the same as for a model fit by lmer() or glmer(). We illustrate with the mixed model fit to the HSB data:

library("nlme")
#>
#> Attaching package: 'nlme'
#> The following object is masked from 'package:lme4':
#>
#>     lmList
#> The following object is masked from 'package:dplyr':
#>
#>     collapse
hsb.lme <- lme(mathach ~ mean.ses*cses + sector*cses,
random = ~ cses | school, data=HSB,
control=list(opt="optim"))
summary(hsb.lme)
#> Linear mixed-effects model fit by REML
#>   Data: HSB
#>     AIC   BIC logLik
#>   46525 46594 -23252
#>
#> Random effects:
#>  Formula: ~cses | school
#>  Structure: General positive-definite, Log-Cholesky parametrization
#>             StdDev   Corr
#> (Intercept) 1.541177 (Intr)
#> cses        0.018174 0.006
#> Residual    6.063492
#>
#> Fixed effects:  mathach ~ mean.ses * cses + sector * cses
#>                       Value Std.Error   DF t-value p-value
#> (Intercept)         12.1282   0.19920 7022  60.886   0e+00
#> mean.ses             5.3367   0.36898  157  14.463   0e+00
#> cses                 2.9421   0.15122 7022  19.456   0e+00
#> sectorCatholic       1.2245   0.30611  157   4.000   1e-04
#> mean.ses:cses        1.0444   0.29107 7022   3.588   3e-04
#> cses:sectorCatholic -1.6421   0.23312 7022  -7.044   0e+00
#>  Correlation:
#>                     (Intr) men.ss cses   sctrCt mn.ss:
#> mean.ses             0.256
#> cses                 0.000  0.000
#> sectorCatholic      -0.699 -0.356  0.000
#> mean.ses:cses        0.000  0.000  0.295  0.000
#> cses:sectorCatholic  0.000  0.000 -0.696  0.000 -0.351
#>
#> Standardized Within-Group Residuals:
#>       Min        Q1       Med        Q3       Max
#> -3.170106 -0.724877  0.014892  0.754263  2.965498
#>
#> Number of Observations: 7185
#> Number of Groups: 160

cv(hsb.lme, k=10, clusterVariables="school", seed=5240)
#> R RNG seed set to 5240
#> 10-Fold Cross Validation based on 160 {school} clusters
#> cross-validation criterion = 39.157
#> bias-adjusted cross-validation criterion = 39.149
#> 95% CI for bias-adjusted CV criterion = (38.066, 40.232)
#> full-sample criterion = 39.006

cv(hsb.lme, seed=1575)
#> R RNG seed set to 1575
#> 10-Fold Cross Validation
#> cross-validation criterion = 37.442
#> bias-adjusted cross-validation criterion = 37.402
#> 95% CI for bias-adjusted CV criterion = (36.351, 38.453)
#> full-sample criterion = 36.147

We used the same random-number generator seeds as in the previous example cross-validating the model fit by lmer(), and so the same folds are employed in both cases.7 The estimated covariance components and fixed effects in the summary output differ slightly between the lmer() and lme() solutions, although both functions seek to maximize the REML criterion. This is, of course, to be expected when different algorithms are used for numerical optimization. To the precision reported, the cluster-level CV results for the lmer() and lme() models are identical, while the case-level CV results are very similar but not identical.

### Example: Contrived hierarchical data

We introduce an artificial data set that exemplifies aspects of cross-validation particular to hierarchical models. Using this data set, we show that model comparisons employing cluster-based and those employing case-based cross-validation may not agree on a “best” model. Furthermore, commonly used measures of fit, such as mean-squared error, do not necessarily become smaller as models become larger, even when the models are nested, and even when the measure of fit is computed for the whole data set.

Consider a researcher studying improvement in a skill, yodeling, for example, among students enrolled in a four-year yodeling program. The plan is to measure each student’s skill level at the beginning of the program and every year thereafter until the end of the program, resulting in five annual measurements for each student. It turns out that yodeling appeals to students of all ages, and students enrolling in the program range in age from 20 to 70. Moreover, participants’ untrained yodeling skill is similar at all ages, as is their rate of progress with training. All students complete the four-year program.

The researcher, who has more expertise in yodeling than in modeling, decides to model the response, $$y$$, yodeling skill, as a function of age, $$x$$, reasoning that students get older during their stay in the program, and (incorrectly) that age can serve as a proxy for elapsed time. The researcher knows that a mixed model should be used to account for clustering due to the expected similarity of measurements taken from each student.

We start by generating the data, using parameters consistent with the description above and meant to highlight the issues that arise in cross-validating mixed-effects models:8

# Parameters:
set.seed(9693)
Nb <- 100     # number of groups
Nw <- 5       # number of individuals within groups
Bb <- 0       # between-group regression coefficient on group mean
SDre <- 2.0   # between-group SD of random level relative to group mean of x
SDwithin <- 0.5  # within group SD
Bw <- 1          # within group effect of x
Ay <- 10         # intercept for response
Ax <- 20         # starting level of x
Nx <- Nw*10      # number of distinct x values

Data <- data.frame(
group = factor(rep(1:Nb, each=Nw)),
x = Ax + rep(1:Nx, length.out = Nw*Nb)
) |>
within(
{
xm  <- ave(x, group, FUN = mean) # within-group mean
y <- Ay +
Bb * xm +                    # contextual effect
Bw * (x - xm) +              # within-group effect
rnorm(Nb, sd=SDre)[group] +  # random level by group
rnorm(Nb*Nw, sd=SDwithin)    # random error within groups
}
)

Here is a scatterplot of the data for a representative group of 10 (without loss of generality, the first 10) of 100 students, showing the 95% concentration ellipse for each cluster:9

library("lattice")
library("latticeExtra")
plot <- xyplot(y ~ x, data=Data[1:Nx, ], group=group,
ylim=c(4, 16),
par.settings=list(superpose.symbol=list(pch=1, cex=0.7))) +
layer(panel.ellipse(..., center.cex=0))
plot # display graph

The between-student effect of age is 0 but the within-student effect is 1. Due to the large variation in ages between students, the least-squares regression of yodeling skill on age (for the 500 observations among all 100 students) produces an estimated slope close to 0 (though with a small $$p$$-value), because the slope is heavily weighted toward the between-student effect:

summary(lm(y ~ x, data=Data))
#>
#> Call:
#> lm(formula = y ~ x, data = Data)
#>
#> Residuals:
#>    Min     1Q Median     3Q    Max
#> -5.771 -1.658 -0.089  1.552  7.624
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  9.05043    0.34719   26.07   <2e-16 ***
#> x            0.02091    0.00727    2.87   0.0042 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.35 on 498 degrees of freedom
#> Multiple R-squared:  0.0163, Adjusted R-squared:  0.0143
#> F-statistic: 8.26 on 1 and 498 DF,  p-value: 0.00422

The initial mixed-effects model that we fit to the data is a simple random-intercepts model:

# random intercept only:
mod.0 <- lmer(y ~ 1 + (1 | group), Data)
summary(mod.0)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ 1 + (1 | group)
#>    Data: Data
#>
#> REML criterion at convergence: 2103.1
#>
#> Scaled residuals:
#>     Min      1Q  Median      3Q     Max
#> -2.0351 -0.7264 -0.0117  0.7848  2.0438
#>
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  group    (Intercept) 2.90     1.70
#>  Residual             2.71     1.65
#> Number of obs: 500, groups:  group, 100
#>
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)   10.002      0.186    53.9

We will shortly consider three other, more complex, mixed models; because of data-management considerations, it is convenient to fit them now, but we defer discussion of these models:

# effect of x and random intercept:
mod.1 <- lmer(y ~ x + (1 | group), Data)

# effect of x, contextual (student) mean of x, and random intercept:
mod.2 <- lmer(y ~ x + xm + (1 | group), Data)
# equivalent to y ~ I(x - xm) + xm + (1 | group)

# model generating the data (where Bb = 0)
mod.3 <- lmer(y ~ I(x - xm) + (1 | group), Data)

We proceed to obtain predictions from the random-intercept model (mod.0) and the other models (mod.1, mod.2, and mod.3) based on fixed effects alone, as would be used for cross-validation based on clusters (i.e., students), and for fixed and random effects—so-called best linear unbiased predictions or BLUPs—as would be used for cross-validation based on cases (i.e., occasions within students):

Data <- within(Data, {
fit_mod0.fe <- predict(mod.0, re.form = ~ 0) # fixed effects only
fit_mod0.re <- predict(mod.0) # fixed and random effects (BLUPs)
fit_mod1.fe <- predict(mod.1, re.form = ~ 0)
fit_mod1.re <- predict(mod.1)
fit_mod2.fe <- predict(mod.2, re.form = ~ 0)
fit_mod2.re <- predict(mod.2)
fit_mod3.fe <- predict(mod.3, re.form = ~ 0)
fit_mod3.re <- predict(mod.3)
})

We then prepare the data for plotting:

Data_long <- reshape(Data[1:Nx, ], direction = "long", sep = ".",
timevar = "effect", varying = grep("\\.", names(Data[1:Nx, ])))
Data_long$id <- 1:nrow(Data_long) Data_long <- reshape(Data_long, direction = "long", sep = "_", timevar = "modelcode", varying = grep("_", names(Data_long))) Data_long$model <- factor(
c("~ 1", "~ 1 + x", "~ 1 + x + xm", "~ 1 + I(x - xm)")
[match(Data_long\$modelcode, c("mod0", "mod1", "mod2", "mod3"))]
)

Predictions based on the random-intercept model mod.0 for the first 10 students are shown in the following graph:

(plot +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod0" & effect == "fe"),
groups=group, type="l", lwd=2) +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod0" &  effect == "re"),
groups=group, type="l", lwd=2, lty=3)
) |> update(
main="Model: y ~ 1 + (1 | group)",
key=list(
corner=c(0.05, 0.05),
text=list(c("fixed effects only","fixed and random")),
lines=list(lty=c(1, 3))))

The fixed-effect predictions for the various individuals are identical—the estimated fixed-effects intercept or estimated general mean of $$y$$—while the BLUPs are the sums of the fixed-effects intercept and the random intercepts, and are only slightly shrunken towards the general mean. Because in our artificial data there is no population relationship between age and skill, the fixed-effect-only predictions and the BLUPs are not very different.

Our next model, mod.1, includes a fixed intercept and fixed effect of x along with a random intercept:

summary(mod.1)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ x + (1 | group)
#>    Data: Data
#>
#> REML criterion at convergence: 1564.5
#>
#> Scaled residuals:
#>     Min      1Q  Median      3Q     Max
#> -2.9016 -0.6350  0.0188  0.5541  2.8293
#>
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  group    (Intercept) 192.941  13.890
#>  Residual               0.257   0.507
#> Number of obs: 500, groups:  group, 100
#>
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept) -33.9189     1.5645   -21.7
#> x             0.9653     0.0158    61.0
#>
#> Correlation of Fixed Effects:
#>   (Intr)
#> x -0.460

Predictions from this model appear in the following graph:

(plot +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod1" & effect == "fe"),
groups=group, type="l", lwd=2) +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod1" & effect == "re"),
groups=group, type="l", lwd=2, lty=3)
) |> update(
main="Model: y ~ 1 + x + (1 | group)",
ylim=c(-15, 35),
key=list(
corner=c(0.95, 0.05),
text=list(c("fixed effects only","fixed and random")),
lines=list(lty=c(1, 3))))

The BLUPs fit the observed data very closely, but predictions based on the fixed effects alone, with a common intercept and slope for all clusters, are very poor—indeed, much worse than the fixed-effects-only predictions based on the simpler random-intercept model, mod.0. We therefore anticipate (and show later in this section) that case-based cross-validation will prefer mod1 to mod0, but that cluster-based cross-validation will prefer mod0 to mod1.

Our third model, mod.2, includes the contextual effect of $$x$$—that is, the cluster mean xm—along with $$x$$ and the intercept in the fixed-effect part of the model, and a random intercept:

summary(mod.2)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ x + xm + (1 | group)
#>    Data: Data
#>
#> REML criterion at convergence: 1169.2
#>
#> Scaled residuals:
#>     Min      1Q  Median      3Q     Max
#> -2.9847 -0.6375  0.0019  0.5568  2.7325
#>
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  group    (Intercept) 3.399    1.844
#>  Residual             0.255    0.505
#> Number of obs: 500, groups:  group, 100
#>
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)   9.4787     0.6171    15.4
#> x             0.9915     0.0160    62.1
#> xm           -0.9800     0.0206   -47.7
#>
#> Correlation of Fixed Effects:
#>    (Intr) x
#> x   0.000
#> xm -0.600 -0.777

This model is equivalent to fitting y ~ I(x - xm) + xm + (1 | group), which is the model that generated the data once the coefficient of the contextual predictor xm is set to 0 (as it is in mod.3, discussed below).

Predictions from model mod.2 appear in the following graph:

(plot +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod2" & effect == "fe"),
groups=group, type="l", lwd=2) +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod2" & effect == "re"),
groups=group, type="l", lwd=2, lty=3)
) |> update(
main="Model: y ~ 1 + x + xm + (1 | group)",
ylim=c(4, 16),
key=list(
corner=c(0.05, 0.05),
text=list(c("fixed effects only","fixed and random")),
lines=list(lty=c(1, 3))))

Depending on the estimated variance parameters of the model, a mixed model like mod.2 will apply varying degrees of shrinkage to the random-intercept BLUPs that correspond to variation in the heights of the parallel fitted lines for the individual students. In our contrived data, the mod.2 applies little shrinkage, allowing substantial variability in the heights of the fitted lines, which closely approach the observed values for each student. The fit of the mixed model mod.2 is consequently similar to that of a fixed-effects model with age and a categorical predictor for individual students (i.e., treating students as a factor, and not shown here).

The mixed model mod.2 therefore fits individual observations well, and we anticipate a favorable assessment using individual-based cross-validation. In contrast, the large variability in the BLUPs results in larger residuals for predictions based on fixed effects alone, and so we expect that cluster-based cross-validation won’t show an advantage for model mod.2 compared to the smaller model mod.0, which includes only fixed and random intercepts.

Had the mixed model applied considerable shrinkage, then neither cluster-based nor case-based cross-validation would show much improvement over the random-intercept-only model. In our experience, the degree of shrinkage does not vary smoothly as parameters are changed but tends to be “all or nothing,” and near the tipping point, the behavior of estimates can be affected considerably by the choice of algorithm used to fit the model.

Finally, mod.3 directly estimates the model used to generate the data. As mentioned, it is a constrained version of mod.2, with the coefficient of xm set to 0, and with x expressed as a deviation from the cluster mean xm:

summary(mod.3)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ I(x - xm) + (1 | group)
#>    Data: Data
#>
#> REML criterion at convergence: 1163.2
#>
#> Scaled residuals:
#>     Min      1Q  Median      3Q     Max
#> -2.9770 -0.6320  0.0063  0.5603  2.7249
#>
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  group    (Intercept) 3.391    1.842
#>  Residual             0.255    0.505
#> Number of obs: 500, groups:  group, 100
#>
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)   10.002      0.185    53.9
#> I(x - xm)      0.992      0.016    62.1
#>
#> Correlation of Fixed Effects:
#>           (Intr)
#> I(x - xm) 0.000

The predictions from mod.3 are therefore similar to those from mod.2:

(plot +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod3" & effect == "fe"),
groups=group, type="l", lwd=2) +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod3" & effect == "re"),
groups=group, type="l", lwd=2, lty=3)
) |> update(
main="Model: y ~ 1 + I(x - xm) + (1 | group)",
ylim=c(4, 16),
key=list(
corner=c(0.05, 0.05),
text=list(c("fixed effects only","fixed and random")),
lines=list(lty=c(1, 3))))