---
title: "How to Use the unmconf Package"
output: html_document
vignette: >
%\VignetteIndexEntry{How to Use the unmconf Package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction
`unmconf` is a package that employs a fully Bayesian hierarchical framework for modeling under the presence of unmeasured confounders using JAGS (Just Another Gibbs Sampler). The package handles both internal and external validation scenarios for up to two unmeasured confounders. Bayesian data analysis can be summarized in the following four steps: specifying the data model and prior, estimating model parameters, evaluating sampling quality and model fit, and summarizing and interpreting results. For users new to Markov Chain Monte Carlo (MCMC) software who wish to implement models involving unmeasured confounding, challenges arise in regard to understanding the syntax and how these programs handle missing data. The primary objective of `unmconf` is to address these challenges by creating a function that resembles `glm()` on the front end, while seamlessly implementing the necessary JAGS code on the back end. Functions are implemented to simplify the workflow using this model by acquiring data, modeling data, conducting diagnostics testing on the model, and analyzing results. With this package, users can perform robust fully Bayesian analyses, even without previous familiarity with JAGS syntax or data processing intricacies.
## Bayesian Unmeasured Confounding Model
For the statistical model, we denote the continuous or discrete response $Y$, the binary main exposure variable $X$, the vector of $p$ other perfectly observed covariates $\boldsymbol{C}$, and the unmeasured confounder(s) relating to both $Y$ and $X$ $U$. In the event of more unmeasured covariates, we denote them $U_{1}$, $U_{2}$, and so forth; these unmeasured confounders can be either binary or normally distributed.
In the scenario where there is a single unmeasured confounder, the joint distribution can be factorized as $f(y, u|x, \boldsymbol{c}) = f(y|x, u, \boldsymbol{c}) f(u|x, \boldsymbol{c})$. For the second unmeasured confounder, the joint distribution can be factorized as $f(y, u_{1}, u_{2}|x, \boldsymbol{c}) = f(y|x, \boldsymbol{c}, u_{1}, u_{2}) f(u_{1}|x, \boldsymbol{c}, u_{2}) f(u_{2}|x, \boldsymbol{c})$, giving the Bayesian unmeasured confounding model:
\begin{equation}\label{unmconf-two-unmeasured-confounders}
\begin{split}
Y|x, \boldsymbol{c}, u_{1}, u_{2} & \ \ \sim \ \ D_{Y}(g_Y^{-1}(\boldsymbol{v}'\boldsymbol{\beta} +\boldsymbol{\lambda}'\boldsymbol{u}), \ \xi_y) \\
U_{1}|x, \boldsymbol{c}, u_{2} & \ \ \sim \ \ D_{U_{1}}(g_{U_1}^{-1}(\boldsymbol{v}'\boldsymbol{\gamma} + \zeta u_{2}), \ \xi_{u_{1}}) \\
U_{2}|x, \boldsymbol{c} & \ \ \sim \ \ D_{U_{2}}(g_{U_2}^{-1}(\boldsymbol{v}'\boldsymbol{\delta}), \ \xi_{u_{2}}),
\end{split}
\end{equation}
where $\boldsymbol{v}' = [x~\boldsymbol{c}']'$ denotes the vector of the main exposure variable and all of the perfectly observed covariates and $\boldsymbol{u}' = [\boldsymbol{u_1}, \boldsymbol{u_2}]'$ denotes the vector of two unmeasured confounders in the response model. This model is completed by the specification of a link function $g_*$ and some family of distributions $D_{*}$. Additional parameters for certain distributions--if any--are denoted $\xi_{y}, \xi_{u_1}, \xi_{u_2}$. Examples of these would be $\sigma^2$ for the variance of a normal distribution or $\alpha$ for the shape parameter in the gamma distribution. `unmconf` allows for the user to work with a response from the normal, Poisson, gamma, or binomial distribution and unmeasured confounder(s) from the normal or binomial distribution. The package supports identity (normal), log (Poisson or gamma), and logit (Bernoulli) link functions.
### Prior Structure
Prior distributions for the model parameters will be jointly defined as $\pi(\boldsymbol{\theta})$, where $\boldsymbol{\theta} = (\boldsymbol{\beta}, \boldsymbol{\lambda}, \boldsymbol{\gamma}, \zeta, \boldsymbol{\delta}, \boldsymbol{\xi})$. The default prior structure is weakly informative. The regression coefficients have a relatively non-informative prior with a mean of 0 and precision (inverse of the variance) of 0.1 when the response is discrete. When the response is continuous, the regression coefficients have a relatively non-informative prior with a mean of 0 and precision of 0.001. To further customize the analysis, users can specify custom priors using the `priors` argument within the modeling function, `unm_glm()`. The format for specifying custom priors is `c("{parameter}[{covariate}]" = "{distribution}")`. An example of this is below.
Families specified for the response and unmeasured confounder(s) may present nuisance parameters, necessitating the inclusion of their prior distributions as well. The precision parameter, $\tau_*$, on a normal response or normal unmeasured confounder will have a Gamma(0.001, 0.001) as the default prior. Priors can also be elicited in terms of $\sigma_*$ or $\sigma_*^2$ through `priors`. The nuisance parameter, $\alpha_y$, for a gamma response has a gamma distribution as the prior with both scale and rate set to 0.1. The aforementioned nuisance parameters are tracked and posterior summaries are provided as a default setting in the package, but this can be modified.
# Generating Data
The function `runm()` is used to generate data. In the workflow of this package, this function is not required to use if one wishes to analyze a data set of interest. `runm()` can be used to perform a simulation study, if desired, and data is generated as follows. The perfectly measured covariates and unmeasured confounder(s) are generated independently of one another. The user can specify these variables' families and their respective distributions as named lists in `runm()`. Then, the data frame will be generated using the named vector of response model coefficients, $\boldsymbol{\beta}_R$, and treatment model coefficients, $\boldsymbol{\beta}_T$, that the user provides in the function. `runm()` will model the following:
```{=tex}
\begin{equation}
\begin{split}
\boldsymbol{C} & \ \ \sim \ \ D_{\boldsymbol{C}}(g^{-1}_{\boldsymbol{C}}(\theta_{\boldsymbol{C}})) \\
U_1 & \ \ \sim \ \ D_{U_1}(g^{-1}_{U_1}(\theta_{U_1}), \xi_{u_1}) \\
U_2 & \ \ \sim \ \ D_{U_2}(g^{-1}_{U_2}(\theta_{U_2}), \xi_{u_2}) \\
X | \boldsymbol{c}, u_{1}, u_{2} & \ \ \sim \ \ \text{Bernoulli}(\text{logit}^{-1}(\boldsymbol{w}'\boldsymbol{\beta}_T)) \\
Y| x, \boldsymbol{c}, u_{1}, u_{2} & \ \ \sim \ \ D_{Y}(g^{-1}_Y(\boldsymbol{z}'\boldsymbol{\beta_R}), \ \xi_y),
\end{split}
\end{equation}
```
where $\boldsymbol{w}' = [\boldsymbol{c}'~u_1~u_2]'$ and $\boldsymbol{z}' = [x~\boldsymbol{c}'~u_1~u_2]'$. All arguments in `runm()` have a default value assigned other than $n$. So, in its simplest form, one can generate a data set consisting of 100 observations by calling `runm(100)`. The default arguments can be customized to the user's preference if there is a desired data generation structure. An example of internal and external validation data is provided.
### Internal Validation
When the validation type is internal (default), the data will first be generated from some sample size `n`. Internal validation is available when the unmeasured confounder is ascertained for a, typically, small subset of the main study data. In certain cases, only internal validation data may be accessible and information on the unmeasured confounder is only known for a subset of the patients. A researcher can vary the argument `missing_prop` to determine how large validation studies need to be in a simulation. `missing_prop` will set the assigned proportion of the unmeasured confounder's observations to `NA`. A more detailed example is below, assigned `df`, and will be the data frame used throughout the remainder of this vignette.
```{r, message=FALSE}
library("unmconf")
library("bayesplot")
library("ggplot2"); theme_set(theme_minimal())
set.seed(13L)
df <-
runm(n = 100,
type = "int",
missing_prop = .75,
covariate_fam_list = list("norm", "bin", "norm"),
covariate_param_list = list(c(mean = 0, sd = 1), c(.3), c(0, 2)),
unmeasured_fam_list = list("norm", "bin"),
unmeasured_param_list = list(c(mean = 0, sd = 1), c(.3)),
treatment_model_coefs =
c("int" = -1, "z1" = .4, "z2" = .5, "z3" = .4,
"u1" = .75, "u2" = .75),
response_model_coefs =
c("int" = -1, "z1" = .4, "z2" = .5, "z3" = .4,
"u1" = .75, "u2" = .75, "x" = .75),
response = "norm",
response_param = c("si_y" = 1))
rbind(head(df, 5), tail(df, 5))
```
### External Validation
For external validation, the sample size argument can be a vector of length 2 to represent the number of observations in the main study data and external validation data, respectively. The sample size argument can also be of length 1, where the sample size will be split in half for the two types of data (main study data will obtain the additional observation if `n` is odd). The main study data has the unmeasured confounders completely missing for all observations. The external data fully observes the unmeasured confounders but typically has no reference on the exposure-unmeasured confounder relationship. Thus, informative priors on the bias parameters for this relationship are needed to achieve convergence. That is, in the above model, $\gamma_x$ and $\delta_x$.
If a user has a main study data set and an external validation data set that are collected and separate from one another, they can be joined together through `dplyr::bind_rows()`. This function binds the data sets by row and keeps all of the columns, even if the columns do not match between data sets. Say that a researcher has a main study data set with a binary outcome, a binary treatment, three perfectly measured standard normal covariates, and two unmeasured confounders that are completely missing. Further, say that the researcher also has an external validation data set with the same outcome, two of the perfectly measured covariates from the main study data, and the two unmeasured confounders completely observed. If the third covariate that is missing from the external validation data is deemed unimportant for modeling, then one can still perform inference using this data. An example of this scenario is below
```{r}
# Main Study Data
M <- tibble::tibble(
"y" = rbinom(100, 1, .5),
"x" = rbinom(100, 1, .3),
"z1" = rnorm(100, 0, 1),
"z2" = rnorm(100, 0, 1),
"z3" = rnorm(100, 0, 1),
"u1" = NA,
"u2" = NA
)
# External Validation Data
EV <- tibble::tibble(
"y" = rbinom(100, 1, .5),
"z1" = rnorm(100, 0, 1),
"z2" = rnorm(100, 0, 1),
"u1" = rnorm(100, 0, 1),
"u2" = rnorm(100, 0, 1)
)
df_ext <- dplyr::bind_rows(M, EV) |>
dplyr::mutate(x = ifelse(is.na(x), 0, x))
rbind(head(df_ext, 5), tail(df_ext, 5))
```
# Specifying and Fitting the Model
The main focus of this vignette should be around `unm_glm()`, which fits the posterior results of the Bayesian unmeasured confounding model through MCMC iterations. Upon acquiring all the relevant information, the `unm_glm()` function carries out two main tasks. Initially, it constructs a JAGS model and subsequently pre-processes the data for utilization by JAGS. This simplifies the process of performing a fully Bayesian analysis, as it spares users from the necessity of being familiar with JAGS syntax for both the model and data processing. The primary aim of the `unmconf` package, akin to `rstanarm` and `brms`, is to offer a user-friendly interface for Bayesian analysis, utilizing programming techniques familiar to R users. Users can input model information into `unm_glm()` in a similar manner as they would for the standard `stats::glm()` function, providing arguments like `formula`, `family`, and `data`.
The R language provides a straightforward syntax for denoting linear models, typically written as `response ~ terms`, where the coefficients are implicitly represented. Like other R functions for model fitting, users have the option to use the `. - {vars}` syntax instead of listing all predictors. For instance, if the user wants to model the first unmeasured confounder, `smoking`, given predictors `age, weight, height,` and `salary`, they can use either `form2 = smoking ~ age + weight + height + salary` or `form2 = smoking ~ . - {response varaible}`. To estimate the unmeasured confounder(s), we often model them conditioned on some or all of the perfectly measured covariates and the treatment. Once estimated, these unmeasured confounder(s) become predictors in the higher levels of the model structure. On the right-hand side of the `~`, we define the linear combination of predictors that models the response variable. Additionally, the predictors can include polynomial regression (e.g.`~ poly(z, 2)`) and interactive effects (e.g.`~x*z`).
`unm_glm()` also accepts arguments that facilitate MCMC computation on the posterior distribution to be passed to `coda.samples`, such as such as `n.iter, n.adapt, thin,` and `n.chains`. The arguments specified have default values, but the user is encouraged to supply their own values given the lack of convergence that is sometimes observed when validation sample sizes are small or priors are particularly diffuse.
### Internal Validation
For instance, consider the Bayesian unmeasured confounding model for the generated data set above, `df`, with a normal response, normal first unmeasured confounder, binary second unmeasured confounder, and three perfectly observed covariates. Further, let's say that, through expert opinion, we have prior information that the effect on $y$ from $u_1$, $\lambda_{u_1}$, is normally distributed with a mean of 0.5 and standard deviation of 1. JAGS parameterizes the normal distribution with precision rather than standard deviation, so we would use $\tau_{u_1} = 1 / \sigma_{u_1}^2 = 1$ for the prior distribution $\lambda_{u_1} \sim N(.5, \tau_{u_1} = 1)$. Using `unm_glm()`, we fit:
```{r}
unm_mod <-
unm_glm(form1 = y ~ x + z1 + z2 + z3 + u1 + u2, # y ~ .,
form2 = u1 ~ x + z1 + z2 + z3 + u2, # u1 ~ . - y,
form3 = u2 ~ x + z1 + z2 + z3, # u2 ~ . - y - u1,
family1 = gaussian(),
family2 = gaussian(),
family3 = binomial(),
priors = c("lambda[u1]" = "dnorm(.5, 1)"),
n.iter = 10000, n.adapt = 4000, thin = 1,
data = df)
```
### External Validation
We fit the Bayesian unmeasured confounding model from the external validation data set above, `df_ext`. Suppose that we have knowledge on the relationship between the treatment effect and both unmeasured confounders through expert opinion, where each relationship is normally distributed. JAGS parameterizes the normal distribution with precision rather than standard deviation. So, from expert opinion, we get the prior distributions $\gamma_x \sim N(1.1, 0.9)$ and $\delta_x \sim N(1.1, 4.5)$. Using `unm_glm()`, we fit:
```{r, eval=FALSE}
unm_mod_ext <-
unm_glm(form1 = y ~ x + z1 + z2 + u1 + u2, # y ~ . - z3,
form2 = u1 ~ x + z1 + z2 + u2, # u1 ~ . - y - z3,
form3 = u2 ~ x + z1 + z2, # u2 ~ . - y - u1 - z3,
family1 = binomial(),
family2 = gaussian(),
family3 = gaussian(),
priors = c("gamma[x]" = "dnorm(1.1, 0.9)",
"delta[x]" = "dnorm(1.1, 4.5)"),
n.iter = 4000, n.adapt = 2000, thin = 1,
data = df_ext)
```
### Export JAGS Code
By leveraging `unm_glm()`, users can conveniently implement complex Bayesian models, particularly those involving unmeasured confounders, without grappling with the intricacies of JAGS syntax or handling missing data. The Bayesian unmeasured confounding model structure of `unmconf` is currently set up to work with at most two unmeasured confounders. Instances may arise where the user may want to work with more than two unmeasured confounders. As an attempt to resolve this concern, the user can explicitly call the JAGS code that `unm_glm()` generates either through the argument `code_only = TRUE` in the function itself or through the separate function, `jags_code()`. With a starting point created by the model in `unmconf`, the user should be able to identify the syntax for JAGS code and thus add the layer(s) to the model structure and the respective prior distribution(s). Stopping at two unmeasured confounders allows for the package to run with confidence in the instance that individuals do not check for convergence and report the results of a poor model. Below shows both ways to extract a model's JAGS code.
```{r, eval=FALSE}
unm_glm(..., code_only = TRUE)
jags_code(unm_mod)
```
# Evaluate Convergence and Model Fit
After the model is fit and before using the MCMC samples for inference, it is necessary for users assess whether the chains have converged appropriately. Hierarchical models with unmeasured confounding are often confronted with convergence issues. To aid in chain convergence, we heavily increased the burn-in length and MCMC iterations from the default values of 1000 and 2000 in the function `unm_glm()` to 6000 and 10000, respectively. Additional checks include the posterior kernel density plots appearing relatively smooth in shape and the trace, or history, plots of the chains should have very similar values across the iterations (i.e., they "mix" well and the chains intermingle).
The `bayesplot` package provides a variety of `ggplot2`-based plotting functions for use after fitting Bayesian models. `bayesplot::mcmc_hist()` plots a histogram of the MCMC draws from all chains, and `bayesplot::mcmc_trace()` performs a trace plot of the chains. Given that $\beta_x$ is our parameter of interest, we only displayed the density and trace plots of this parameter below. The histogram appears smooth and without any jaggedness. The trace plot appears to mix well, as one cannot differentiate or identify patterns in the chains across the iterations for this model. Thus, there is no lack of convergence evident here. All parameters in the model upheld convergence standards.
```{r, fig.align='center', fig.height = 4, fig.width = 6, fig.cap="Histogram of the MCMC draws for the internal validation example. Smooth histogram is an indication of good convergence."}
bayesplot::mcmc_hist(unm_mod, pars = "beta[x]")
```
```{r, fig.align='center', fig.height = 4, fig.width = 6, fig.cap="Trace plot of the MCMC draws for the internal validation example. No patterns or diverging chains in the trace plot is an indication of good convergence."}
bayesplot::mcmc_trace(unm_mod, pars = "beta[x]")
```
```{r, echo=FALSE}
# df2 <-
# runm_extended(n = 100,
# covariate_fam_list = list("norm", "bin", "norm"),
# covariate_param_list = list(c(mean = 0, sd = 1), c(.3), c(0, 2)),
# unmeasured_fam_list = list("norm", "bin"),
# unmeasured_param_list = list(c(mean = 0, sd = 1), c(.3)),
# treatment_model_coefs =
# c("int" = -1, "z1" = .4, "z2" = .5, "z3" = .4,
# "u1" = .75, "u2" = .75),
# response_model_coefs =
# c("int" = -1, "z1" = .4, "z2" = .5, "z3" = .4,
# "u1" = .75, "u2" = .75, "x" = .75),
# response = "norm",
# response_param = c("si_y" = 1),
# type = "int",
# missing_prop = .99)
#
# unm_mod2 <-
# unm_glm(y ~ x + z1 + z2 + z3 + u1 + u2, # y ~ .,
# u1 ~ x + z1 + z2 + z3 + u2, # u1 ~ . - y,
# u2 ~ x + z1 + z2 + z3, # u2 ~ . - y - u1,
# family1 = gaussian(),
# family2 = gaussian(),
# family3 = binomial(),
# #priors = c("lambda[u1]" = "dnorm(0, 4)"),
# n.iter = 4000, n.adapt = 2000, thin = 1,
# data = df2)
#
# write_rds(unm_mod2, "/Users/ryanhebdon/Graduate School/Research/unmconf_personal/Diagnostics/bad_convergence.rds")
#bad_mod <- readRDS("/Users/ryanhebdon/Graduate School/Research/unmconf_personal/Diagnostics/bad_convergence.rds")
```
# Summarizing Results
Once the posterior samples have been computed, the `unm_glm()` function returns an R object as a list of the posterior samples, where the length of the list matches the number of chains. Calling the returned object explicitly, in our example `unm_mod`, will output the model's call and a named vector of coefficients at each level of the Bayesian unmeasured confounding model. A more formal model summary comes through `unm_summary()`, which obtains and prints a summary table of the results. Every parameter is summarized by the mean of the posterior distribution along with its two-sided 95% credible intervals based on quantiles. When generating a data set via `runm()`, adding the `data` argument appends a column to the summary table consisting of the true parameter values that were assigned.
```{r}
unm_summary(unm_mod, df) |>
head(10)
```
For model comparison, the deviance information criterion (DIC) and penalized expected deviance are provided through `unm_dic()`. DIC, a Bayesian version of AIC, is calculated by adding the "effective number of parameters" to the expected deviance and is computed through a wrapper around `rjags::dic_samples()`.
As mentioned above, unmeasured confounders can be considered as parameters in the Bayesian paradigm and are therefore estimated when performing MCMC. Yet, `unm_summary()` does not track the estimate of these "parameters" from the model fit. For this, `unm_backfill()` pairs with the original data set to impute the missing values for the unmeasured confounders with the posterior estimates. The ten observations below come from `df`, where five of the unmeasured confounders are observed and five are unobserved in the internal validation data. Columns `u1_observed` and `u2_observed` are logical variables added to the original data frame to display which variables were originally observed versus imputed. Additionally, the values of `u1` and `u2` that respectively have `FALSE` in the previously mentioned columns are the imputed posterior estimates.
```{r}
unm_backfill(df, unm_mod)[16:25, ]
```
To visualize the results from the model fit, the quantile-based posterior credible intervals can be drawn using `bayesplot::mcmc_intervals()`. A credible interval plot from the worked example is below. This plots the credible intervals for all parameters from the posterior draws of all the chains. We modify the default setting for the outer credible interval to be 0.95 for comparison with the output results from `unm_summary()`. The light blue circle in the middle of each parameter's interval portrays the posterior median. The bold, dark blue line displays the 50% credible interval and the thin, blue line covers the 95% credible interval. For simulation studies, where the true parameter values are known, calling the argument `true_params` will add a layer of light red circles to each credible interval to illustrate the true value used to generate the data. Here, the gamma and delta parameters do not have a true value because we generated the unmeasured confounders $u_1$ and $u_2$ as independent normal/Bernoulli random variables.
```{r, message=FALSE, warning=FALSE, fig.align='center', fig.height = 4, fig.width = 6, fig.cap= "A credible interval plot for the internal validation example. The light blue circle indicates the posterior median for each parameter, with the bold and thin blue lines displaying the 50% and 95% credible interval, respectively. Red circles are the true parameter values used in the simulation study."}
bayesplot::mcmc_intervals(unm_mod, prob_outer = .95,
regex_pars = "(beta|lambda|gamma|delta|zeta).+") +
geom_point(
aes(value, name), data = tibble::enframe(attr(df, "params")) |>
dplyr::mutate(name = gsub("int", "1", name)),
color = "red", fill = "pink", size = 4, shape = 21
)
```