---
title: "Take a `moderndive` into introductory linear regression with R"
author: "Albert Y. Kim, Chester Ismay, and Max Kuhn"
date: "`r Sys.Date()`"
vignette: >
%\VignetteIndexEntry{Take a `moderndive` into introductory linear regression with R}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
tags:
- R
- Rstats
- tidyverse
- regression
- moderndive
- modeling
- modelling
- kaggle
authors:
- name: Albert Y. Kim
orcid: 0000-0001-7824-306X
affiliation: 1
- name: Chester Ismay
orcid: 0000-0003-2820-2547
affiliation: 2
- name: Max Kuhn
orcid: 0000-0003-2402-136X
affiliation: 3
affiliations:
- name: Assistant Professor of Statistical and Data Sciences, Smith College, Northampton, MA, USA.
index: 1
- name: Data Science Evangelist, DataRobot, Portland, OR, USA.
index: 2
- name: Software Engineer, RStudio, USA.
index: 3
bibliography: paper.bib
# output:
# rticles::joss_article:
# keep_md: yes
# number_sections: yes
---
```{r, include = FALSE}
# knitr settings
knitr::opts_chunk$set(
# Code output:
warning = FALSE,
message = FALSE,
echo = TRUE,
# Figure:
out.width = "100%",
fig.width = 16 / 2.5,
fig.height = 9 / 2.5,
fig.align = "center",
fig.show = "hold",
# Etc:
collapse = TRUE,
comment = "##"
# tidy = FALSE
)
# Needed packages in vignette
library(moderndive)
library(ggplot2)
library(dplyr)
library(knitr)
library(broom)
# Needed packages internally
library(patchwork)
# Random number generator seed value
set.seed(76)
# Set ggplot defaults for rticles output:
if (!knitr::is_html_output()) {
# Grey theme:
theme_set(theme_light())
scale_colour_discrete <- ggplot2::scale_colour_viridis_d
}
# Set output width for rticles:
options(width = 70)
```
# Summary
We present the [`moderndive`](https://moderndive.github.io/moderndive/){target="_blank"} R package of datasets and functions for [tidyverse](https://www.tidyverse.org/)-friendly introductory linear regression [@tidyverse2019]. These tools leverage the well-developed `tidyverse` and `broom` packages to facilitate 1) working with regression tables that include confidence intervals, 2) accessing regression outputs on an observation level (e.g. fitted/predicted values and residuals), 3) inspecting scalar summaries of regression fit (e.g. $R^2$, $R^2_{adj}$, and mean squared error), and 4) visualizing parallel slopes regression models using `ggplot2`-like syntax [@R-ggplot2; @R-broom]. This R package is designed to supplement the book "Statistical Inference via Data Science: A ModernDive into R and the Tidyverse" [@ismay2019moderndive]. Note that the book is also available online at https://moderndive.com and is referred to as "ModernDive" for short.
# Statement of Need
Linear regression has long been a staple of introductory statistics courses. While the curricula of introductory statistics courses has much evolved of late, the overall importance of regression remains the same [@ASAGuidelines]. Furthermore, while the use of the R statistical programming language for statistical analysis is not new, recent developments such as the `tidyverse` suite of packages have made statistical computation with R accessible to a broader audience [@tidyverse2019]. We go one step further by leveraging the `tidyverse` and the `broom` packages to make linear regression accessible to students taking an introductory statistics course [@R-broom]. Such students are likely to be new to statistical computation with R; we designed `moderndive` with these students in mind.
# Introduction
Let's load all the R packages we are going to need.
```{r}
library(moderndive)
library(ggplot2)
library(dplyr)
library(knitr)
library(broom)
```
Let's consider data gathered from end of semester student evaluations for a sample of 463 courses taught by 94 professors from the University of Texas at Austin [@diez2015openintro]. This data is included in the `evals` data frame from the `moderndive` package.
```{r, echo=FALSE}
evals_sample <- evals %>%
select(ID, prof_ID, score, age, bty_avg, gender, ethnicity, language, rank) %>%
sample_n(5)
```
In the following table, we present a subset of `r ncol(evals_sample)` of the `r ncol(evals)` variables included for a random sample of `r nrow(evals_sample)` courses^[For details on the remaining `r ncol(evals) - ncol(evals_sample)` variables, see the help file by running `?evals`.]:
1. `ID` uniquely identifies the course whereas `prof_ID` identifies the professor who taught this course. This distinction is important since many professors taught more than one course.
1. `score` is the outcome variable of interest: average professor evaluation score out of 5 as given by the students in this course.
1. The remaining variables are demographic variables describing that course's instructor, including `bty_avg` (average "beauty" score) for that professor as given by a panel of 6 students.^[Note that `gender` was collected as a binary variable at the time of the study (2005).]
```{r random-sample-courses, echo=FALSE}
evals_sample %>%
kable()
```
## Regression analysis the "good old-fashioned" way
Let's fit a simple linear regression model of teaching `score` as a function of instructor `age` using the `lm()` function.
```{r}
score_model <- lm(score ~ age, data = evals)
```
Let's now study the output of the fitted model `score_model` "the good old-fashioned way": using `summary()` which calls `summary.lm()` behind the scenes (we'll refer to them interchangeably throughout this paper).
```{r}
summary(score_model)
```
## Regression analysis using `moderndive`
As an improvement to base R's regression functions, we've included three functions in the `moderndive` package that take a fitted model object as input and return the same information as `summary.lm()`, but output them in tidyverse-friendly format [@tidyverse2019]. As we'll see later, while these three functions are thin wrappers to existing functions in the `broom` package for converting statistical objects into tidy tibbles, we modified them with the introductory statistics student in mind [@R-broom].
1. Get a tidy regression table **with confidence intervals**:
```{r}
get_regression_table(score_model)
```
2. Get information on each point/observation in your regression, including fitted/predicted values and residuals, in a single data frame:
```{r}
get_regression_points(score_model)
```
3. Get scalar summaries of a regression fit including $R^2$ and $R^2_{adj}$ but also the (root) mean-squared error:
```{r}
get_regression_summaries(score_model)
```
Furthermore, say you would like to create a visualization of the relationship between two numerical variables and a third categorical variable with $k$ levels. Let's create this using a colored scatterplot via the `ggplot2` package for data visualization [@R-ggplot2]. Using `geom_smooth(method = "lm", se = FALSE)` yields a visualization of an *interaction model* where each of the $k$ regression lines has their own intercept and slope. For example in \autoref{fig:interaction-model}, we extend our previous regression model by now mapping the categorical variable `ethnicity` to the `color` aesthetic.
```{r interaction-model, fig.cap="Visualization of interaction model."}
# Code to visualize interaction model:
ggplot(evals, aes(x = age, y = score, color = ethnicity)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Age", y = "Teaching score", color = "Ethnicity")
```
However, many introductory statistics courses start with the easier to teach "common slope, different intercepts" regression model, also known as the *parallel slopes* model. However, no argument to plot such models exists within `geom_smooth()`.
[Evgeni Chasnovski](https://github.com/echasnovski){target="_blank"} thus wrote a custom `geom_` extension to `ggplot2` called `geom_parallel_slopes()`; this extension is included in the `moderndive` package. Much like `geom_smooth()` from the `ggplot2` package, you add `geom_parallel_slopes()` as a layer to the code, resulting in \autoref{fig:parallel-slopes-model}.
```{r parallel-slopes-model, fig.cap="Visualization of parallel slopes model."}
# Code to visualize parallel slopes model:
ggplot(evals, aes(x = age, y = score, color = ethnicity)) +
geom_point() +
geom_parallel_slopes(se = FALSE) +
labs(x = "Age", y = "Teaching score", color = "Ethnicity")
```
# Repository README
In the GitHub repository README, we present an in-depth discussion of six features of the `moderndive` package:
1. Focus less on p-value stars, more confidence intervals
2. Outputs as tibbles
3. Produce residual analysis plots from scratch using `ggplot2`
4. A quick-and-easy Kaggle predictive modeling competition submission!
5. Visual model selection: plot parallel slopes & interaction regression models
6. Produce metrics on the quality of regression model fits
Furthermore, we discuss the inner-workings of the `moderndive` package:
1. It leverages the `broom` package in its wrappers
1. It builds a custom `ggplot2` geometry for the `geom_parallel_slopes()` function that allows for quick visualization of parallel slopes models in regression.
# Author contributions
Albert Y. Kim and Chester Ismay contributed equally to the development of the `moderndive` package. Albert Y. Kim wrote a majority of the initial version of this manuscript with Chester Ismay writing the rest. Max Kuhn provided guidance and feedback at various stages of the package development and manuscript writing.
# Acknowledgments
Many thanks to Jenny Smetzer [\@smetzer180](https://github.com/smetzer180){target="_blank"}, Luke W. Johnston [\@lwjohnst86](https://github.com/lwjohnst86){target="_blank"}, and Lisa Rosenthal [\@lisamr](https://github.com/lisamr){target="_blank"} for their helpful feedback for this paper and to Evgeni Chasnovski [\@echasnovski](https://github.com/echasnovski){target="_blank"} for contributing the `geom_parallel_slopes()` function via GitHub [pull request](https://github.com/moderndive/moderndive/pull/55){target="_blank"}. The authors do not have any financial support to disclose.
# References