This **R** package performs association tests between
the observed data and their systematic patterns of variation. Systematic
variation can be modeled by latent variables, that can arise from
biological processes, experimental conditions, environmental factors,
and others. We often estimate these patterns using principal component
analysis (PCA), factor analysis (FA), logistic factor analysis (LFA),
K-means clustering, partition around medoids (PAM), and related methods.
The jackstraw methods learn over-fitting characteristics inherent in
unsupervised learning, where the observed data are used to estimate the
systematic patterns and to be tested again (see circular
analysis).

Using a variety of unsupervised learning techniques, the jackstraw provides a resampling strategy and testing scheme to estimate statistical significance of association between the observed data and their systematic patterns of variation. For example, the cell cycle in microarray data may be estimated by principal components (PCs). Then, we can use the jackstraw for PCA to identify genes that are significantly associated with these PCs. On the other hand, cell identities in single cell RNA-seq (scRNA-seq) data are often determined by K-means clustering or other unsupervised clustering algorithms. Then, the jackstraw for clustering can identify single cells that are significant members of a given cluster.

Using `jackstraw_pca`

, we can find statistically
significant variables with regard to the top `r`

principal
components (PCs). If we only specify `r`

, we conduct
association tests with all `r`

PCs simultaneously.
Alternatively, we could test association with respect to a subset of
`r`

PCs, using an optional argument `r1`

. By
specifying `r`

(a total number of significant PCs) and
`r1`

(a numeric vector of target PCs),
`jackstraw_pca`

helps find statistically significant
variables with respect to `r1`

PCs, while accounting for the
fact that there are `r`

significant PCs. The package also
supports truncated PCA, using augmented implicitly restarted Lanczos
bidiagonalization algorithm (IRLBA; `jackstraw_irlba`

) or
randomized Singular Value Decomposition (RSVD;
`jackstraw_rpca`

).

Logistic
factor analysis (LFA) estimates population structure from genetic
data (single-nucleotide
polymorphisms; SNPs). `jackstraw_lfa`

provides
corresponding association tests between SNPs and population structure,
as estimated by LFA. Due to the requirements of a CRAN package, please
manually install `lfa`

from Bioconductor. See the R help on `lfa`

. In general, one
could directly specify an estimation method for latent variables in
`jackstraw_subspace`

.

Instead of continuous latent variables that are estimated by PCA,
LFA, or others, one may be interested in estimating discrete clusters
from a high dimensional data. For K-means clustering,
`jackstraw_kmeans`

evaluates whether data points are
significant members of a given cluster, by testing association between
observed data and cluster centers. This can help select data points that
are reliable members of clusters and further improve the cluster
membership. Note that in order to use the jackstraw for clustering, it’s
necessary to first apply the clustering algorithm to the data and
provide the resulting object (e.g., `kmeans.dat`

).

Related algorithms, such as Partitioning Around
Medoids (PAM) or k-medoids and Mini Batch K-means
algorithms, are supported by `jackstraw_pam`

and
`jackstraw_MiniBatchKmeans`

, respectively. Generally,
`jackstraw_cluster`

can be used for other clustering
algorithms.

There are few additional functions to support statistical inference
for unsupervised learning, such as finding a number of PCs or clusters.
Based on p-values, we could estimate posterior inclusion probabilities
(PIPs) using `pip`

.

*Chung, N.C.* (2020) Statistical significance of cluster
membership for unsupervised evaluation of cell identities.
Bioinformatics, 36(10): 3107–3114
https://doi.org/10.1093/bioinformatics/btaa087

*Chung, N.C.* and *Storey, J.D.* (2015) Statistical
significance of variables driving systematic variation in
high-dimensional data. Bioinformatics, 31(4): 545-554
https://doi.org/10.1093/bioinformatics/btu674

Association Test with Principal Components with a Gentle Introduction to Latent Variable Models

Statistical
Test of Cluster Memberships with a Toy Data Set
(`mtcars`

)

Unsupervised Evaluation of Cell Identities in Single Cell Genomics using the 10X Genomics Data

Install Bioconductor dependencies, `lfa`

,
`gcatest`

,
`qvalue`

,
manually first:

```
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install(c('qvalue', 'lfa', 'gcatest')) BiocManager
```

The following `jackstraw`

functions requires Bioconductor
packages:

`jackstraw_lfa`

,`pseudo_Rsq`

,`efron_Rsq`

requires`lfa`

.`jackstraw_lfa`

and`jackstraw_alstructure`

requires`gcatest`

.`pip`

requires the package`qvalue`

.

This package is in active development. Install jackstraw from GitHub:

```
install.packages("devtools")
library("devtools")
install_github("ncchung/jackstraw")
```

To use `jackstraw_alstructure`

, install the optional
`alstructure`

package from GitHub:

```
library(devtools)
install_github("StoreyLab/alstructure")
```

The stable version **jackstraw v1.3.17** is on CRAN. To
install from CRAN:

`install.packages("jackstraw")`

Here are some implementations of the jackstraw in different contexts and application domains.

jackstraw (Python) by Iain Carmichael

Jackstraw significance testing for JIVE in Python