Cross-validation (CV) is an essentially simple and intuitively reasonable approach to estimating the predictive accuracy of regression models. CV is developed in many standard sources on regression modeling and “machine learning”—we particularly recommend James, Witten, Hastie, & Tibshirani (2021, secs. 5.1, 5.3)—and so we will describe the method only briefly here before taking up computational issues and some examples. See Arlot & Celisse (2010) for a wide-ranging, if technical, survey of cross-validation and related methods that emphasizes the statistical properties of CV.

Validating research by replication on independently collected data is a common scientific norm. Emulating this process in a single study by data-division is less common: The data are randomly divided into two, possibly equal-size, parts; the first part is used to develop and fit a statistical model; and then the second part is used to assess the adequacy of the model fit to the first part of the data. Data-division, however, suffers from two problems: (1) Dividing the data decreases the sample size and thus increases sampling error; and (2), even more disconcertingly, particularly in smaller samples, the results can vary substantially based on the random division of the data: See Harrell (2015, sec. 5.3) for this and other remarks about data-division and cross-validation.

Cross-validation speaks to both of these issues. In CV, the data are randomly divided as equally as possible into several, say \(k\), parts, called “folds.” The statistical model is fit \(k\) times, leaving each fold out in turn. Each fitted model is then used to predict the response variable for the cases in the omitted fold. A CV criterion or “cost” measure, such as the mean-squared error (“MSE”) of prediction, is then computed using these predicted values. In the extreme \(k = n\), the number of cases in the data, thus omitting individual cases and refitting the model \(n\) times—a procedure termed “leave-one-out (LOO) cross-validation.”

Because the \(n\) models are each fit to \(n - 1\) cases, LOO CV produces a nearly unbiased estimate of prediction error. The \(n\) regression models are highly statistical dependent, however, based as they are on nearly the same data, and so the resulting estimate of prediction error has relatively large variance. In contrast, estimated prediction error for \(k\)-fold CV with \(k = 5\) or \(10\) (commonly employed choices) are somewhat biased but have smaller variance. It is also possible to correct \(k\)-fold CV for bias (see below).

`Auto`

dataThe data for this example are drawn from the **ISLR2**
package for R, associated with James et al.
(2021). The presentation here is close (though not identical) to
that in the original source (James et al., 2021,
secs. 5.1, 5.3), and it demonstrates the use of the
`cv()`

function in the **cv** package.^{1}

The `Auto`

dataset contains information about 392
cars:

```
data("Auto", package="ISLR2")
head(Auto)
#> mpg cylinders displacement horsepower weight acceleration year origin
#> 1 18 8 307 130 3504 12.0 70 1
#> 2 15 8 350 165 3693 11.5 70 1
#> 3 18 8 318 150 3436 11.0 70 1
#> 4 16 8 304 150 3433 12.0 70 1
#> 5 17 8 302 140 3449 10.5 70 1
#> 6 15 8 429 198 4341 10.0 70 1
#> name
#> 1 chevrolet chevelle malibu
#> 2 buick skylark 320
#> 3 plymouth satellite
#> 4 amc rebel sst
#> 5 ford torino
#> 6 ford galaxie 500
dim(Auto)
#> [1] 392 9
```

With the exception of `origin`

(which we don’t use here),
these variables are largely self-explanatory, except possibly for units
of measurement: for details see
`help("Auto", package="ISLR2")`

.

We’ll focus here on the relationship of `mpg`

(miles per
gallon) to `horsepower`

, as displayed in the following
scatterplot:

The relationship between the two variables is monotone, decreasing,
and nonlinear. Following James et al.
(2021), we’ll consider approximating the relationship by a
polynomial regression, with the degree of the polynomial \(p\) ranging from 1 (a linear regression) to
10.^{2}
Polynomial fits for \(p = 1\) to \(5\) are shown in the following figure:

```
plot(mpg ~ horsepower, data=Auto)
horsepower <- with(Auto,
seq(min(horsepower), max(horsepower),
length=1000))
for (p in 1:5){
m <- lm(mpg ~ poly(horsepower,p), data=Auto)
mpg <- predict(m, newdata=data.frame(horsepower=horsepower))
lines(horsepower, mpg, col=p + 1, lty=p, lwd=2)
}
legend("topright", legend=1:5, col=2:6, lty=1:5, lwd=2,
title="Degree", inset=0.02)
```

The linear fit is clearly inappropriate; the fits for \(p = 2\) (quadratic) through \(4\) are very similar; and the fit for \(p = 5\) may over-fit the data by chasing
one or two relatively high `mpg`

values at the right (but see
the CV results reported below).

The following graph shows two measures of estimated (squared) error as a function of polynomial-regression degree: The mean-squared error (“MSE”), defined as \(\mathsf{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \widehat{y}_i)^2\), and the usual residual variance, defined as \(\widehat{\sigma}^2 = \frac{1}{n - p - 1} \sum_{i=1}^n (y_i - \widehat{y}_i)^2\). The former necessarily declines with \(p\) (or, more strictly, can’t increase with \(p\)), while the latter gets slightly larger for the largest values of \(p\), with the “best” value, by a small margin, for \(p = 7\).

```
library("cv") # for mse() and other functions
var <- mse <- numeric(10)
for (p in 1:10){
m <- lm(mpg ~ poly(horsepower, p), data=Auto)
mse[p] <- mse(Auto$mpg, fitted(m))
var[p] <- summary(m)$sigma^2
}
plot(c(1, 10), range(mse, var), type="n",
xlab="Degree of polynomial, p",
ylab="Estimated Squared Error")
lines(1:10, mse, lwd=2, lty=1, col=2, pch=16, type="b")
lines(1:10, var, lwd=2, lty=2, col=3, pch=17, type="b")
legend("topright", inset=0.02,
legend=c(expression(hat(sigma)^2), "MSE"),
lwd=2, lty=2:1, col=3:2, pch=17:16)
```

The code for this graph uses the `mse()`

function from the
**cv** package to compute the MSE for each fit.

`cv()`

The generic `cv()`

function has an `"lm"`

method, which by default performs \(k =
10\)-fold CV:

```
m.auto <- lm(mpg ~ poly(horsepower, 2), data=Auto)
summary(m.auto)
#>
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -14.714 -2.594 -0.086 2.287 15.896
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.446 0.221 106.1 <2e-16 ***
#> poly(horsepower, 2)1 -120.138 4.374 -27.5 <2e-16 ***
#> poly(horsepower, 2)2 44.090 4.374 10.1 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.37 on 389 degrees of freedom
#> Multiple R-squared: 0.688, Adjusted R-squared: 0.686
#> F-statistic: 428 on 2 and 389 DF, p-value: <2e-16
cv(m.auto)
#> R RNG seed set to 398475
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: mse
#> cross-validation criterion = 19.238
#> bias-adjusted cross-validation criterion = 19.224
#> full-sample criterion = 18.985
```

The `"lm"`

method by default uses `mse()`

as
the CV criterion and the Woodbury matrix identity to update the
regression with each fold deleted without having literally to refit the
model. Computational details are discussed in the final section of this
vignette. The function reports the CV estimate of MSE, a biased-adjusted
estimate of the MSE (the bias adjustment is explained in the final
section), and the MSE is also computed for the original, full-sample
regression. Because the division of the data into 10 folds is random,
`cv()`

explicitly (randomly) generates and saves a seed for
R’s pseudo-random number generator, to make the results replicable. The
user can also specify the seed directly via the `seed`

argument to `cv()`

.

To perform LOO CV, we can set the `k`

argument to
`cv()`

to the number of cases in the data, here
`k=392`

, or, more conveniently, to `k="loo"`

or
`k="n"`

:

```
cv(m.auto, k="loo")
#> n-Fold Cross Validation
#> method: hatvalues
#> criterion: mse
#> cross-validation criterion = 19.248
```

For LOO CV of a linear model, `cv()`

by default uses the
hatvalues from the model fit to the full data for the LOO updates, and
reports only the CV estimate of MSE. Alternative methods are to use the
Woodbury matrix identity or the “naive” approach of literally refitting
the model with each case omitted. All three methods produce exact
results for a linear model (within the precision of floating-point
computations):

```
cv(m.auto, k="loo", method="naive")
#> n-Fold Cross Validation
#> criterion: mse
#> cross-validation criterion = 19.248
#> bias-adjusted cross-validation criterion = 19.248
#> full-sample criterion = 18.985
cv(m.auto, k="loo", method="Woodbury")
#> n-Fold Cross Validation
#> method: Woodbury
#> criterion: mse
#> cross-validation criterion = 19.248
#> bias-adjusted cross-validation criterion = 19.248
#> full-sample criterion = 18.985
```

The `"naive"`

and `"Woodbury"`

methods also
return the bias-adjusted estimate of MSE and the full-sample MSE, but
bias isn’t an issue for LOO CV.

This is a small regression problem and all three computational
approaches are essentially instantaneous, but it is still of interest to
investigate their relative speed. In this comparison, we include the
`cv.glm()`

function from the **boot** package,
which takes the naive approach, and for which we have to fit the linear
model as an equivalent Gaussian GLM. We use the
`microbenchmark()`

function from the package of the same name
for the timings (Mersmann, 2023):

```
m.auto.glm <- glm(mpg ~ poly(horsepower, 2), data=Auto)
boot::cv.glm(Auto, m.auto.glm)$delta
#> [1] 19.248 19.248
microbenchmark::microbenchmark(
hatvalues = cv(m.auto, k="loo"),
Woodbury = cv(m.auto, k="loo", method="Woodbury"),
naive = cv(m.auto, k="loo", method="naive"),
cv.glm = boot::cv.glm(Auto, m.auto.glm),
times=10
)
#> Warning in microbenchmark::microbenchmark(hatvalues = cv(m.auto, k = "loo"), :
#> less accurate nanosecond times to avoid potential integer overflows
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> hatvalues 986.13 1010.6 1131.7 1176.3 1197.7 1205.1 10 a
#> Woodbury 9970.58 10173.4 10292.8 10351.1 10419.4 10462.7 10 a
#> naive 209703.40 210921.2 233033.5 219749.1 270959.1 277474.9 10 b
#> cv.glm 375319.86 382713.4 395619.9 384162.0 387250.0 448388.6 10 c
```

On our computer, using the hatvalues is about an order of magnitude
faster than employing Woodbury matrix updates, and more than two orders
of magnitude faster than refitting the model.^{3}

The `cv()`

function also has a method that can be applied
to a list of regression models for the same data, composed using the
`models()`

function. For \(k\)-fold CV, the same folds are used for
the competing models, which reduces random error in their comparison.
This result can also be obtained by specifying a common seed for R’s
random-number generator while applying `cv()`

separately to
each model, but employing a list of models is more convenient for both
\(k\)-fold and LOO CV (where there is
no random component to the composition of the \(n\) folds).

We illustrate with the polynomial regression models of varying degree
for the `Auto`

data (discussed previously), beginning by
fitting and saving the 10 models:

```
for (p in 1:10){
assign(paste0("m.", p),
lm(mpg ~ poly(horsepower, p), data=Auto))
}
objects(pattern="m\\.[0-9]")
#> [1] "m.1" "m.10" "m.2" "m.3" "m.4" "m.5" "m.6" "m.7" "m.8" "m.9"
summary(m.2) # for example, the quadratic fit
#>
#> Call:
#> lm(formula = mpg ~ poly(horsepower, p), data = Auto)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -14.714 -2.594 -0.086 2.287 15.896
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.446 0.221 106.1 <2e-16 ***
#> poly(horsepower, p)1 -120.138 4.374 -27.5 <2e-16 ***
#> poly(horsepower, p)2 44.090 4.374 10.1 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.37 on 389 degrees of freedom
#> Multiple R-squared: 0.688, Adjusted R-squared: 0.686
#> F-statistic: 428 on 2 and 389 DF, p-value: <2e-16
```

We then apply `cv()`

to the list of 10 models (the
`data`

argument is required):

```
# 10-fold CV
cv.auto.10 <- cv(models(m.1, m.2, m.3, m.4, m.5,
m.6, m.7, m.8, m.9, m.10),
data=Auto, seed=2120)
cv.auto.10[1:2] # for the linear and quadratic models
#>
#> Model model.1:
#> 10-Fold Cross Validation
#> method: Woodbury
#> cross-validation criterion = 24.246
#> bias-adjusted cross-validation criterion = 24.23
#> full-sample criterion = 23.944
#>
#> Model model.2:
#> 10-Fold Cross Validation
#> method: Woodbury
#> cross-validation criterion = 19.346
#> bias-adjusted cross-validation criterion = 19.327
#> full-sample criterion = 18.985
# LOO CV
cv.auto.loo <- cv(models(m.1, m.2, m.3, m.4, m.5,
m.6, m.7, m.8, m.9, m.10),
data=Auto, k="loo")
cv.auto.loo[1:2] # linear and quadratic models
#>
#> Model model.1:
#> n-Fold Cross Validation
#> method: hatvalues
#> cross-validation criterion = 24.232
#> Model model.2:
#> n-Fold Cross Validation
#> method: hatvalues
#> cross-validation criterion = 19.248
```

Because we didn’t supply names for the models in the calls to the
`models()`

function, the names `model.1`

,
`model.2`

, etc., are generated by the function.

Finally, we extract and graph the adjusted MSEs for \(10\)-fold CV and the MSEs for LOO CV:

```
cv.mse.10 <- sapply(cv.auto.10, function(x) x[["adj CV crit"]])
cv.mse.loo <- sapply(cv.auto.loo, function(x) x[["CV crit"]])
plot(c(1, 10), range(cv.mse.10, cv.mse.loo), type="n",
xlab="Degree of polynomial, p",
ylab="Cross-Validated MSE")
lines(1:10, cv.mse.10, lwd=2, lty=1, col=2, pch=16, type="b")
lines(1:10, cv.mse.loo, lwd=2, lty=2, col=3, pch=17, type="b")
legend("topright", inset=0.02,
legend=c("10-Fold CV", "LOO CV"),
lwd=2, lty=2:1, col=3:2, pch=17:16)
```

Alternatively, we can use the `plot()`

method for
`"cvModList"`

objects to compare the models, though with
separate graphs for 10-fold and LOO CV:

```
plot(cv.auto.10, main="Polynomial Regressions, 10-Fold CV",
axis.args=list(labels=1:10), xlab="Degree of Polynomial, p")
plot(cv.auto.loo, main="Polynomial Regressions, LOO CV",
axis.args=list(labels=1:10), xlab="Degree of Polynomial, p")
```

In this example, 10-fold and LOO CV produce generally similar results, and also results that are similar to those produced by the estimated error variance \(\widehat{\sigma}^2\) for each model, reported above (except for the highest-degree polynomials, where the CV results more clearly suggest over-fitting).

`Mroz`

dataThe `Mroz`

data set from the **carData**
package (associated with Fox & Weisberg,
2019) has been used by several authors to illustrate binary
logistic regression; see, in particular Fox &
Weisberg (2019). The data were originally drawn from the U.S.
Panel Study of Income Dynamics and pertain to married women. Here are a
few cases in the data set:

```
data("Mroz", package="carData")
head(Mroz, 3)
#> lfp k5 k618 age wc hc lwg inc
#> 1 yes 1 0 32 no no 1.2102 10.91
#> 2 yes 0 2 30 no no 0.3285 19.50
#> 3 yes 1 3 35 no no 1.5141 12.04
tail(Mroz, 3)
#> lfp k5 k618 age wc hc lwg inc
#> 751 no 0 0 43 no no 0.88814 9.952
#> 752 no 0 0 60 no no 1.22497 24.984
#> 753 no 0 3 39 no no 0.85321 28.363
```

The response variable in the logistic regression is `lfp`

,
labor-force participation, a factor coded `"yes"`

or
`"no"`

. The remaining variables are predictors:

`k5`

, number of children 5 years old of younger in the woman’s household;`k618`

, number of children between 6 and 18 years old;`age`

, in years;`wc`

, wife’s college attendance,`"yes"`

or`"no"`

;`hc`

, husband’s college attendance;`lwg`

, the woman’s log wage rate if she is employed, or her*imputed*wage rate, if she is not (a variable that Fox & Weisberg, 2019 show is problematically defined); and`inc`

, family income, in $1000s, exclusive of wife’s income.

We use the `glm()`

function to fit a binary logistic
regression to the `Mroz`

data:

```
m.mroz <- glm(lfp ~ ., data=Mroz, family=binomial)
summary(m.mroz)
#>
#> Call:
#> glm(formula = lfp ~ ., family = binomial, data = Mroz)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 3.18214 0.64438 4.94 7.9e-07 ***
#> k5 -1.46291 0.19700 -7.43 1.1e-13 ***
#> k618 -0.06457 0.06800 -0.95 0.34234
#> age -0.06287 0.01278 -4.92 8.7e-07 ***
#> wcyes 0.80727 0.22998 3.51 0.00045 ***
#> hcyes 0.11173 0.20604 0.54 0.58762
#> lwg 0.60469 0.15082 4.01 6.1e-05 ***
#> inc -0.03445 0.00821 -4.20 2.7e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1029.75 on 752 degrees of freedom
#> Residual deviance: 905.27 on 745 degrees of freedom
#> AIC: 921.3
#>
#> Number of Fisher Scoring iterations: 4
BayesRule(ifelse(Mroz$lfp == "yes", 1, 0),
fitted(m.mroz, type="response"))
#> [1] 0.30677
#> attr(,"casewise loss")
#> [1] "y != round(yhat)"
```

In addition to the usually summary output for a GLM, we show the
result of applying the `BayesRule()`

function from the
**cv** package to predictions derived from the fitted
model. Bayes rule, which predicts a “success” in a binary regression
model when the fitted probability of success [i.e., \(\phi = \Pr(y = 1)\)] is \(\widehat{\phi} \ge .5\) and a “failure” if
\(\widehat{\phi} \lt .5\).^{4} The first
argument to `BayesRule()`

is the binary {0, 1} response, and
the second argument is the predicted probability of success.
`BayesRule()`

returns the proportion of predictions that are
*in error*, as appropriate for a “cost” function.

The value returned by `BayesRule()`

is associated with an
“attribute” named `"casewise loss"`

and set to
`"y != round(yhat)"`

, signifying that the Bayes rule CV
criterion is computed as the mean of casewise values, here 0 if the
prediction for a case matches the observed value and 1 if it does not
(signifying a prediction error). The `mse()`

function for
numeric responses is also calculated as a casewise average. Some other
criteria, such as the median absolute error, computed by the
`medAbsErr()`

function in the **cv** package,
aren’t averages of casewise components. The distinction is important
because, to our knowledge, the statistical theory of cross-validation,
for example, in Davison & Hinkley
(1997), S. Bates, Hastie, & Tibshirani
(2023), and Arlot & Celisse
(2010), is developed for CV criteria like MSE that are means of
casewise components. As a consequence, we limit computation of bias
adjustment and confidence intervals (see below) to criteria that are
casewise averages.

In this example, the fitted logistic regression incorrectly predicts 31% of the responses; we expect this estimate to be optimistic given that the model is used to “predict” the data to which it is fit.

The `"glm"`

method for `cv()`

is largely
similar to the `"lm"`

method, although the default algorithm,
selected explicitly by `method="exact"`

, refits the model
with each fold removed (and is thus equivalent to
`method="naive"`

for `"lm"`

models). For
generalized linear models, `method="Woodbury"`

or (for LOO
CV) `method="hatvalues"`

provide approximate results (see the
last section of the vignette for details):

```
cv(m.mroz, criterion=BayesRule, seed=248)
#> R RNG seed set to 248
#> 10-Fold Cross Validation
#> criterion: BayesRule
#> cross-validation criterion = 0.32404
#> bias-adjusted cross-validation criterion = 0.31952
#> 95% CI for bias-adjusted CV criterion = (0.28607, 0.35297)
#> full-sample criterion = 0.30677
cv(m.mroz, criterion=BayesRule, seed=248, method="Woodbury")
#> R RNG seed set to 248
#> 10-Fold Cross Validation
#> method: Woodbury
#> criterion: BayesRule
#> cross-validation criterion = 0.32404
#> bias-adjusted cross-validation criterion = 0.31926
#> 95% CI for bias-adjusted CV criterion = (0.28581, 0.35271)
#> full-sample criterion = 0.30677
```

To ensure that the two methods use the same 10 folds, we specify the
seed for R’s random-number generator explicitly; here, and as is common
in our experience, the `"exact"`

and `"Woodbury"`

algorithms produce nearly identical results. The CV estimates of
prediction error are slightly higher than the estimate based on all of
the cases.

The printed output includes a 95% confidence interval for the
bias-adjusted Bayes rule CV criterion. S. Bates
et al. (2023) show that these confidence intervals are unreliable
for models fit to small samples, and by default `cv()`

computes them only when the sample size is 400 or larger and when the CV
criterion employed is an average of casewise components, as is the case
for Bayes rule. See the final section of the vignette for details of the
computation of confidence intervals for bias-adjusted CV criteria.

Here are results of applying LOO CV to the Mroz model, using both the exact and the approximate methods:

```
cv(m.mroz, k="loo", criterion=BayesRule)
#> n-Fold Cross Validation
#> criterion: BayesRule
#> cross-validation criterion = 0.32005
#> bias-adjusted cross-validation criterion = 0.3183
#> 95% CI for bias-adjusted CV criterion = (0.28496, 0.35164)
#> full-sample criterion = 0.30677
cv(m.mroz, k="loo", criterion=BayesRule, method="Woodbury")
#> n-Fold Cross Validation
#> method: Woodbury
#> criterion: BayesRule
#> cross-validation criterion = 0.32005
#> bias-adjusted cross-validation criterion = 0.3183
#> 95% CI for bias-adjusted CV criterion = (0.28496, 0.35164)
#> full-sample criterion = 0.30677
cv(m.mroz, k="loo", criterion=BayesRule, method="hatvalues")
#> n-Fold Cross Validation
#> method: hatvalues
#> criterion: BayesRule
#> cross-validation criterion = 0.32005
```

To the number of decimal digits shown, the three methods produce identical results for this example.

As for linear models, we report some timings for the various
`cv()`

methods of computation in LOO CV as well as for the
`cv.glm()`

function from the **boot** package
(which, recall, refits the model with each case removed, and thus is
comparable to `cv()`

with `method="exact"`

):

```
microbenchmark::microbenchmark(
hatvalues=cv(m.mroz, k="loo", criterion=BayesRule, method="hatvalues"),
Woodbury=cv(m.mroz, k="loo", criterion=BayesRule, method="Woodbury"),
exact=cv(m.mroz, k="loo", criterion=BayesRule),
cv.glm=boot::cv.glm(Mroz, m.mroz,
cost=BayesRule),
times=10)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> hatvalues 1.3145 1.3296 1.6209 1.3431 1.3785 4.0053 10
#> Woodbury 37.0752 39.6014 40.5334 40.0595 40.5473 47.5972 10
#> exact 1714.0892 1747.4177 1782.3877 1785.8192 1813.3066 1848.2108 10
#> cv.glm 2003.4197 2032.1928 2066.9809 2072.7632 2104.7543 2115.3950 10
#> cld
#> a
#> b
#> c
#> d
```

There is a substantial time penalty associated with exact computations.

The fundamental analogy for cross-validation is to the collection of new data. That is, predicting the response in each fold from the model fit to data in the other folds is like using the model fit to all of the data to predict the response for new cases from the values of the predictors for those new cases. As we explained, the application of this idea to independently sampled cases is straightforward—simply partition the data into random folds of equal size and leave each fold out in turn, or, in the case of LOO CV, simply omit each case in turn.

In contrast, mixed-effects models are fit to *dependent* data,
in which cases as clustered, such as hierarchical data, where the
clusters comprise higher-level units (e.g., students clustered in
schools), or longitudinal data, where the clusters are individuals and
the cases repeated observations on the individuals over time.^{5}

We can think of two approaches to applying cross-validation to
clustered data:^{6}

Treat CV as analogous to predicting the response for one or more cases in a

*newly observed cluster*. In this instance, the folds comprise one or more whole clusters; we refit the model with all of the cases in clusters in the current fold removed; and then we predict the response for the cases in clusters in the current fold. These predictions are based only on fixed effects because the random effects for the omitted clusters are presumably unknown, as they would be for data on cases in newly observed clusters.Treat CV as analogous to predicting the response for a newly observed case in an

*existing cluster*. In this instance, the folds comprise one or more individual cases, and the predictions can use both the fixed and random effects.

Following their use by Raudenbush & Bryk
(2002), data from the 1982 *High School and Beyond* (HSB)
survey have become a staple of the literature on mixed-effects models.
The HSB data are used by Fox & Weisberg
(2019, sec. 7.2.2) to illustrate the application of linear mixed
models to hierarchical data, and we’ll closely follow their example
here.

The HSB data are included in the `MathAchieve`

and
`MathAchSchool`

data sets in the **nlme**
package (Pinheiro & Bates, 2000).
`MathAchieve`

includes individual-level data on 7185 students
in 160 high schools, and `MathAchSchool`

includes
school-level data:

```
data("MathAchieve", package="nlme")
dim(MathAchieve)
#> [1] 7185 6
head(MathAchieve, 3)
#> Grouped Data: MathAch ~ SES | School
#> School Minority Sex SES MathAch MEANSES
#> 1 1224 No Female -1.528 5.876 -0.428
#> 2 1224 No Female -0.588 19.708 -0.428
#> 3 1224 No Male -0.528 20.349 -0.428
tail(MathAchieve, 3)
#> Grouped Data: MathAch ~ SES | School
#> School Minority Sex SES MathAch MEANSES
#> 7183 9586 No Female 1.332 19.641 0.627
#> 7184 9586 No Female -0.008 16.241 0.627
#> 7185 9586 No Female 0.792 22.733 0.627
data("MathAchSchool", package="nlme")
dim(MathAchSchool)
#> [1] 160 7
head(MathAchSchool, 2)
#> School Size Sector PRACAD DISCLIM HIMINTY MEANSES
#> 1224 1224 842 Public 0.35 1.597 0 -0.428
#> 1288 1288 1855 Public 0.27 0.174 0 0.128
tail(MathAchSchool, 2)
#> School Size Sector PRACAD DISCLIM HIMINTY MEANSES
#> 9550 9550 1532 Public 0.45 0.791 0 0.059
#> 9586 9586 262 Catholic 1.00 -2.416 0 0.627
```

The first few students are in school number 1224 and the last few in school 9586.

We’ll use only the `School`

, `SES`

(students’
socioeconomic status), and `MathAch`

(their score on a
standardized math-achievement test) variables in the
`MathAchieve`

data set, and `Sector`

(`"Catholic"`

or `"Public"`

) in the
`MathAchSchool`

data set.

Some data-management is required before fitting a mixed-effects model
to the HSB data, for which we use the **dplyr** package
(Wickham, François, Henry, Müller, & Vaughan,
2023):

```
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
MathAchieve %>% group_by(School) %>%
summarize(mean.ses = mean(SES)) -> Temp
Temp <- merge(MathAchSchool, Temp, by="School")
HSB <- merge(Temp[, c("School", "Sector", "mean.ses")],
MathAchieve[, c("School", "SES", "MathAch")], by="School")
names(HSB) <- tolower(names(HSB))
HSB$cses <- with(HSB, ses - mean.ses)
```

In the process, we created two new school-level variables:
`meanses`

, which is the average SES for students in each
school; and `cses`

, which is school-average SES centered at
its mean. For details, see Fox & Weisberg
(2019, sec. 7.2.2).

Still following Fox and Weisberg, we proceed to use the
`lmer()`

function in the **lme4** package (D. Bates, Mächler, Bolker, & Walker, 2015)
to fit a mixed model for math achievement to the HSB data:

```
library("lme4")
#> Loading required package: Matrix
hsb.lmer <- lmer(mathach ~ mean.ses*cses + sector*cses
+ (cses | school), data=HSB)
summary(hsb.lmer, correlation=FALSE)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathach ~ mean.ses * cses + sector * cses + (cses | school)
#> Data: HSB
#>
#> REML criterion at convergence: 46504
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -3.159 -0.723 0.017 0.754 2.958
#>
#> Random effects:
#> Groups Name Variance Std.Dev. Corr
#> school (Intercept) 2.380 1.543
#> cses 0.101 0.318 0.39
#> Residual 36.721 6.060
#> Number of obs: 7185, groups: school, 160
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 12.128 0.199 60.86
#> mean.ses 5.333 0.369 14.45
#> cses 2.945 0.156 18.93
#> sectorCatholic 1.227 0.306 4.00
#> mean.ses:cses 1.039 0.299 3.48
#> cses:sectorCatholic -1.643 0.240 -6.85
```

We can then cross-validate at the cluster (i.e., school) level,

```
cv(hsb.lmer, k=10, clusterVariables="school", seed=5240)
#> R RNG seed set to 5240
#> 10-Fold Cross Validation based on 160 {school} clusters
#> cross-validation criterion = 39.157
#> bias-adjusted cross-validation criterion = 39.148
#> 95% CI for bias-adjusted CV criterion = (38.066, 40.231)
#> full-sample criterion = 39.006
```

or at the case (i.e., student) level,

```
cv(hsb.lmer, seed=1575)
#> R RNG seed set to 1575
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.00587228 (tol = 0.002, component 1)
#> boundary (singular) fit: see help('isSingular')
#> 10-Fold Cross Validation
#> cross-validation criterion = 37.445
#> bias-adjusted cross-validation criterion = 37.338
#> 95% CI for bias-adjusted CV criterion = (36.288, 38.388)
#> full-sample criterion = 36.068
```

For cluster-level CV, the `clusterVariables`

argument
tells `cv()`

how the clusters are defined. Were there more
than one clustering variable, say classes within schools, these would be
provided as a character vector of variable names:
`clusterVariables = c("school", "class")`

. For cluster-level
CV, the default is `k = "loo"`

, that is, leave one cluster
out at a time; we instead specify `k = 10`

folds of clusters,
each fold therefore comprising \(160/10 =
16\) schools.

If the `clusterVariables`

argument is omitted, then
case-level CV is employed, with `k = 10`

folds as the
default, here each with \(7185/10 \approx
719\) students. Notice that one of the 10 models refit with a
fold removed failed to converge. Convergence problems are common in
mixed-effects modeling. The apparent issue here is that an estimated
variance component is close to or equal to 0, which is at a boundary of
the parameter space. That shouldn’t disqualify the fitted model for the
kind of prediction required for cross-validation.

There is also a `cv()`

method for linear mixed models fit
by the `lme()`

function in the **nlme** package,
and the arguments for `cv()`

in this case are the same as for
a model fit by `lmer()`

or `glmer()`

. We
illustrate with the mixed model fit to the HSB data:

```
library("nlme")
#>
#> Attaching package: 'nlme'
#> The following object is masked from 'package:lme4':
#>
#> lmList
#> The following object is masked from 'package:dplyr':
#>
#> collapse
hsb.lme <- lme(mathach ~ mean.ses*cses + sector*cses,
random = ~ cses | school, data=HSB,
control=list(opt="optim"))
summary(hsb.lme)
#> Linear mixed-effects model fit by REML
#> Data: HSB
#> AIC BIC logLik
#> 46525 46594 -23252
#>
#> Random effects:
#> Formula: ~cses | school
#> Structure: General positive-definite, Log-Cholesky parametrization
#> StdDev Corr
#> (Intercept) 1.541177 (Intr)
#> cses 0.018174 0.006
#> Residual 6.063492
#>
#> Fixed effects: mathach ~ mean.ses * cses + sector * cses
#> Value Std.Error DF t-value p-value
#> (Intercept) 12.1282 0.19920 7022 60.886 0e+00
#> mean.ses 5.3367 0.36898 157 14.463 0e+00
#> cses 2.9421 0.15122 7022 19.456 0e+00
#> sectorCatholic 1.2245 0.30611 157 4.000 1e-04
#> mean.ses:cses 1.0444 0.29107 7022 3.588 3e-04
#> cses:sectorCatholic -1.6421 0.23312 7022 -7.044 0e+00
#> Correlation:
#> (Intr) men.ss cses sctrCt mn.ss:
#> mean.ses 0.256
#> cses 0.000 0.000
#> sectorCatholic -0.699 -0.356 0.000
#> mean.ses:cses 0.000 0.000 0.295 0.000
#> cses:sectorCatholic 0.000 0.000 -0.696 0.000 -0.351
#>
#> Standardized Within-Group Residuals:
#> Min Q1 Med Q3 Max
#> -3.170106 -0.724877 0.014892 0.754263 2.965498
#>
#> Number of Observations: 7185
#> Number of Groups: 160
cv(hsb.lme, k=10, clusterVariables="school", seed=5240)
#> R RNG seed set to 5240
#> 10-Fold Cross Validation based on 160 {school} clusters
#> cross-validation criterion = 39.157
#> bias-adjusted cross-validation criterion = 39.149
#> 95% CI for bias-adjusted CV criterion = (38.066, 40.232)
#> full-sample criterion = 39.006
cv(hsb.lme, seed=1575)
#> R RNG seed set to 1575
#> 10-Fold Cross Validation
#> cross-validation criterion = 37.442
#> bias-adjusted cross-validation criterion = 37.402
#> 95% CI for bias-adjusted CV criterion = (36.351, 38.453)
#> full-sample criterion = 36.147
```

We used the same random-number generator seeds as in the previous
example cross-validating the model fit by `lmer()`

, and so
the same folds are employed in both cases.^{7} The estimated
covariance components and fixed effects in the summary output differ
slightly between the `lmer()`

and `lme()`

solutions, although both functions seek to maximize the REML criterion.
This is, of course, to be expected when different algorithms are used
for numerical optimization. To the precision reported, the cluster-level
CV results for the `lmer()`

and `lme()`

models are
identical, while the case-level CV results are very similar but not
identical.

We introduce an artificial data set that exemplifies aspects of cross-validation particular to hierarchical models. Using this data set, we show that model comparisons employing cluster-based and those employing case-based cross-validation may not agree on a “best” model. Furthermore, commonly used measures of fit, such as mean-squared error, do not necessarily become smaller as models become larger, even when the models are nested, and even when the measure of fit is computed for the whole data set.

Consider a researcher studying improvement in a skill, yodeling, for example, among students enrolled in a four-year yodeling program. The plan is to measure each student’s skill level at the beginning of the program and every year thereafter until the end of the program, resulting in five annual measurements for each student. It turns out that yodeling appeals to students of all ages, and students enrolling in the program range in age from 20 to 70. Moreover, participants’ untrained yodeling skill is similar at all ages, as is their rate of progress with training. All students complete the four-year program.

The researcher, who has more expertise in yodeling than in modeling, decides to model the response, \(y\), yodeling skill, as a function of age, \(x\), reasoning that students get older during their stay in the program, and (incorrectly) that age can serve as a proxy for elapsed time. The researcher knows that a mixed model should be used to account for clustering due to the expected similarity of measurements taken from each student.

We start by generating the data, using parameters consistent with the
description above and meant to highlight the issues that arise in
cross-validating mixed-effects models:^{8}

```
# Parameters:
set.seed(9693)
Nb <- 100 # number of groups
Nw <- 5 # number of individuals within groups
Bb <- 0 # between-group regression coefficient on group mean
SDre <- 2.0 # between-group SD of random level relative to group mean of x
SDwithin <- 0.5 # within group SD
Bw <- 1 # within group effect of x
Ay <- 10 # intercept for response
Ax <- 20 # starting level of x
Nx <- Nw*10 # number of distinct x values
Data <- data.frame(
group = factor(rep(1:Nb, each=Nw)),
x = Ax + rep(1:Nx, length.out = Nw*Nb)
) |>
within(
{
xm <- ave(x, group, FUN = mean) # within-group mean
y <- Ay +
Bb * xm + # contextual effect
Bw * (x - xm) + # within-group effect
rnorm(Nb, sd=SDre)[group] + # random level by group
rnorm(Nb*Nw, sd=SDwithin) # random error within groups
}
)
```

Here is a scatterplot of the data for a representative group of 10
(without loss of generality, the first 10) of 100 students, showing the
95% concentration ellipse for each cluster:^{9}

```
library("lattice")
library("latticeExtra")
plot <- xyplot(y ~ x, data=Data[1:Nx, ], group=group,
ylim=c(4, 16),
par.settings=list(superpose.symbol=list(pch=1, cex=0.7))) +
layer(panel.ellipse(..., center.cex=0))
plot # display graph
```

The between-student effect of age is 0 but the within-student effect is 1. Due to the large variation in ages between students, the least-squares regression of yodeling skill on age (for the 500 observations among all 100 students) produces an estimated slope close to 0 (though with a small \(p\)-value), because the slope is heavily weighted toward the between-student effect:

```
summary(lm(y ~ x, data=Data))
#>
#> Call:
#> lm(formula = y ~ x, data = Data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.771 -1.658 -0.089 1.552 7.624
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.05043 0.34719 26.07 <2e-16 ***
#> x 0.02091 0.00727 2.87 0.0042 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.35 on 498 degrees of freedom
#> Multiple R-squared: 0.0163, Adjusted R-squared: 0.0143
#> F-statistic: 8.26 on 1 and 498 DF, p-value: 0.00422
```

The initial mixed-effects model that we fit to the data is a simple random-intercepts model:

```
# random intercept only:
mod.0 <- lmer(y ~ 1 + (1 | group), Data)
summary(mod.0)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ 1 + (1 | group)
#> Data: Data
#>
#> REML criterion at convergence: 2103.1
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -2.0351 -0.7264 -0.0117 0.7848 2.0438
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> group (Intercept) 2.90 1.70
#> Residual 2.71 1.65
#> Number of obs: 500, groups: group, 100
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 10.002 0.186 53.9
```

We will shortly consider three other, more complex, mixed models; because of data-management considerations, it is convenient to fit them now, but we defer discussion of these models:

```
# effect of x and random intercept:
mod.1 <- lmer(y ~ x + (1 | group), Data)
# effect of x, contextual (student) mean of x, and random intercept:
mod.2 <- lmer(y ~ x + xm + (1 | group), Data)
# equivalent to y ~ I(x - xm) + xm + (1 | group)
# model generating the data (where Bb = 0)
mod.3 <- lmer(y ~ I(x - xm) + (1 | group), Data)
```

We proceed to obtain predictions from the random-intercept model
(`mod.0`

) and the other models (`mod.1`

,
`mod.2`

, and `mod.3`

) based on fixed effects
alone, as would be used for cross-validation based on clusters (i.e.,
students), and for fixed and random effects—so-called best linear
unbiased predictions or BLUPs—as would be used for cross-validation
based on cases (i.e., occasions within students):

```
Data <- within(Data, {
fit_mod0.fe <- predict(mod.0, re.form = ~ 0) # fixed effects only
fit_mod0.re <- predict(mod.0) # fixed and random effects (BLUPs)
fit_mod1.fe <- predict(mod.1, re.form = ~ 0)
fit_mod1.re <- predict(mod.1)
fit_mod2.fe <- predict(mod.2, re.form = ~ 0)
fit_mod2.re <- predict(mod.2)
fit_mod3.fe <- predict(mod.3, re.form = ~ 0)
fit_mod3.re <- predict(mod.3)
})
```

We then prepare the data for plotting:

```
Data_long <- reshape(Data[1:Nx, ], direction = "long", sep = ".",
timevar = "effect", varying = grep("\\.", names(Data[1:Nx, ])))
Data_long$id <- 1:nrow(Data_long)
Data_long <- reshape(Data_long, direction = "long", sep = "_",
timevar = "modelcode", varying = grep("_", names(Data_long)))
Data_long$model <- factor(
c("~ 1", "~ 1 + x", "~ 1 + x + xm", "~ 1 + I(x - xm)")
[match(Data_long$modelcode, c("mod0", "mod1", "mod2", "mod3"))]
)
```

Predictions based on the random-intercept model `mod.0`

for the first 10 students are shown in the following graph:

```
(plot +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod0" & effect == "fe"),
groups=group, type="l", lwd=2) +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod0" & effect == "re"),
groups=group, type="l", lwd=2, lty=3)
) |> update(
main="Model: y ~ 1 + (1 | group)",
key=list(
corner=c(0.05, 0.05),
text=list(c("fixed effects only","fixed and random")),
lines=list(lty=c(1, 3))))
```

The fixed-effect predictions for the various individuals are identical—the estimated fixed-effects intercept or estimated general mean of \(y\)—while the BLUPs are the sums of the fixed-effects intercept and the random intercepts, and are only slightly shrunken towards the general mean. Because in our artificial data there is no population relationship between age and skill, the fixed-effect-only predictions and the BLUPs are not very different.

Our next model, `mod.1`

, includes a fixed intercept and
fixed effect of `x`

along with a random intercept:

```
summary(mod.1)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ x + (1 | group)
#> Data: Data
#>
#> REML criterion at convergence: 1564.5
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -2.9016 -0.6350 0.0188 0.5541 2.8293
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> group (Intercept) 192.941 13.890
#> Residual 0.257 0.507
#> Number of obs: 500, groups: group, 100
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) -33.9189 1.5645 -21.7
#> x 0.9653 0.0158 61.0
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> x -0.460
```

Predictions from this model appear in the following graph:

```
(plot +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod1" & effect == "fe"),
groups=group, type="l", lwd=2) +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod1" & effect == "re"),
groups=group, type="l", lwd=2, lty=3)
) |> update(
main="Model: y ~ 1 + x + (1 | group)",
ylim=c(-15, 35),
key=list(
corner=c(0.95, 0.05),
text=list(c("fixed effects only","fixed and random")),
lines=list(lty=c(1, 3))))
```

The BLUPs fit the observed data very closely, but predictions based
on the fixed effects alone, with a common intercept and slope for all
clusters, are very poor—indeed, much worse than the fixed-effects-only
predictions based on the simpler random-intercept model,
`mod.0`

. We therefore anticipate (and show later in this
section) that case-based cross-validation will prefer `mod1`

to `mod0`

, but that cluster-based cross-validation will
prefer `mod0`

to `mod1`

.

Our third model, `mod.2`

, includes the contextual effect
of \(x\)—that is, the cluster mean
`xm`

—along with \(x\) and
the intercept in the fixed-effect part of the model, and a random
intercept:

```
summary(mod.2)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ x + xm + (1 | group)
#> Data: Data
#>
#> REML criterion at convergence: 1169.2
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -2.9847 -0.6375 0.0019 0.5568 2.7325
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> group (Intercept) 3.399 1.844
#> Residual 0.255 0.505
#> Number of obs: 500, groups: group, 100
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 9.4787 0.6171 15.4
#> x 0.9915 0.0160 62.1
#> xm -0.9800 0.0206 -47.7
#>
#> Correlation of Fixed Effects:
#> (Intr) x
#> x 0.000
#> xm -0.600 -0.777
```

This model is equivalent to fitting
`y ~ I(x - xm) + xm + (1 | group)`

, which is the model that
generated the data once the coefficient of the contextual predictor
`xm`

is set to 0 (as it is in `mod.3`

, discussed
below).

Predictions from model `mod.2`

appear in the following
graph:

```
(plot +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod2" & effect == "fe"),
groups=group, type="l", lwd=2) +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod2" & effect == "re"),
groups=group, type="l", lwd=2, lty=3)
) |> update(
main="Model: y ~ 1 + x + xm + (1 | group)",
ylim=c(4, 16),
key=list(
corner=c(0.05, 0.05),
text=list(c("fixed effects only","fixed and random")),
lines=list(lty=c(1, 3))))
```

Depending on the estimated variance parameters of the model, a mixed
model like `mod.2`

will apply varying degrees of shrinkage to
the random-intercept BLUPs that correspond to variation in the heights
of the parallel fitted lines for the individual students. In our
contrived data, the `mod.2`

applies little shrinkage,
allowing substantial variability in the heights of the fitted lines,
which closely approach the observed values for each student. The fit of
the mixed model `mod.2`

is consequently similar to that of a
fixed-effects model with age and a categorical predictor for individual
students (i.e., treating students as a factor, and not shown here).

The mixed model `mod.2`

therefore fits individual
observations well, and we anticipate a favorable assessment using
individual-based cross-validation. In contrast, the large variability in
the BLUPs results in larger residuals for predictions based on fixed
effects alone, and so we expect that cluster-based cross-validation
won’t show an advantage for model `mod.2`

compared to the
smaller model `mod.0`

, which includes only fixed and random
intercepts.

Had the mixed model applied considerable shrinkage, then neither cluster-based nor case-based cross-validation would show much improvement over the random-intercept-only model. In our experience, the degree of shrinkage does not vary smoothly as parameters are changed but tends to be “all or nothing,” and near the tipping point, the behavior of estimates can be affected considerably by the choice of algorithm used to fit the model.

Finally, `mod.3`

directly estimates the model used to
generate the data. As mentioned, it is a constrained version of
`mod.2`

, with the coefficient of `xm`

set to 0,
and with `x`

expressed as a deviation from the cluster mean
`xm`

:

```
summary(mod.3)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ I(x - xm) + (1 | group)
#> Data: Data
#>
#> REML criterion at convergence: 1163.2
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -2.9770 -0.6320 0.0063 0.5603 2.7249
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> group (Intercept) 3.391 1.842
#> Residual 0.255 0.505
#> Number of obs: 500, groups: group, 100
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 10.002 0.185 53.9
#> I(x - xm) 0.992 0.016 62.1
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> I(x - xm) 0.000
```

The predictions from `mod.3`

are therefore similar to
those from `mod.2`

:

```
(plot +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod3" & effect == "fe"),
groups=group, type="l", lwd=2) +
xyplot(fit ~ x, subset(Data_long, modelcode == "mod3" & effect == "re"),
groups=group, type="l", lwd=2, lty=3)
) |> update(
main="Model: y ~ 1 + I(x - xm) + (1 | group)",
ylim=c(4, 16),
key=list(
corner=c(0.05, 0.05),
text=list(c("fixed effects only","fixed and random")),
lines=list(lty=c(1, 3))))
```