Contents

0.1 Introducrtion

The accumulation of single-cell RNA-seq (scRNA-seq) studies highlights the potential benefits of integrating multiple data sets. By augmenting sample sizes and enhancing analytical robustness, integration can lead to more insightful biological conclusions. However, challenges arise due to the inherent diversity and batch discrepancies within and across studies. SCIntRuler, a novel R package, addresses these challenges by guiding the integration of multiple scRNA-seq data sets. SCIntRuler is an R package developed for single-cell RNA-seq analysis. It was designed using the Seurat framework, and offers existing and novel single-cell analytic work flows.

Integrating scRNA-seq data sets can be complex due to various factors, including batch effects and sample diversity. Key decisions – whether to integrate data sets, which method to choose for integration, and how to best handle inherent data discrepancies – are crucial. SCIntRuler offers a statistical metric to aid in these decisions, ensuring more robust and accurate analyses.

0.2 1. Installation

To install the package, you need to install the batchelor and MatrixGenerics package from Bioconductor.


# Check if BioManager is installed, install if not
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Check if 'batchelor' is installed, install if not
if (!requireNamespace("batchelor", quietly = TRUE))
    BiocManager::install("batchelor")

# Check if 'MatrixGenerics' is installed, install if not
if (!requireNamespace("MatrixGenerics", quietly = TRUE))
    BiocManager::install("MatrixGenerics")

The SCIntRuler can be installed by the following commands, the source code can be found at GitHub.

BiocManager::install("SCIntRuler") 

After the installation, the package can be loaded with

library(SCIntRuler)
library(Seurat)
library(dplyr)
library(ggplot2)

0.3 2. Explore with an example data

Let’s start with an example data. We conducted a series of simulation studies to assess the efficacy of SCIntRuler in guiding the integration selection under different scenarios with varying degrees of shared information among data sets. We generated the simulation data based on a real Peripheral Blood Mononuclear Cells (PBMC) scRNA-seq dataset.

0.3.1 Overview of the data

This dataset is a subset of what we used in our Simulation 2, where we have three studies. In each study, we randomly drew different numbers of CD4 T helper cells, B cells, CD14 monocytes, and CD56 NK cells to mimic four real-world scenarios with three data sources Simulation 2 introduces a moderate overlap, with 20.3% cells sharing the same cell type identity. There are 2000 B cells and 400 CD4T cells in the first study, 700 CD14Mono cells and 400 CD4T cells in the second study, 2000 CD56NK cells and 400 CD4T cells in the third study. This data is already in Seurat format and can be found under /data. There are 32738 genes and 5900 cells in simulation 2. Here, we subset 800 cells with 3000 genes.

data("sim_data_sce", package = "SCIntRuler")
sim_data <- as.Seurat(sim_data_sce)
head(sim_data[[]])
#>                     orig.ident nCount_RNA nFeature_RNA CellType  Study
#> AATTACGAATCGGT-1 SeuratProject       1035          365    Bcell Study1
#> AGAGCGGAGTCCTC-1 SeuratProject       1157          448    Bcell Study1
#> TTCTTACTGGTACT-1 SeuratProject       2824          884    Bcell Study1
#> TGACGCCTACACCA-1 SeuratProject       1801          644    Bcell Study1
#> ATTTGCACCTATGG-1 SeuratProject       2501          749    Bcell Study1
#> AGAACAGATGGAGG-1 SeuratProject       1113          391    Bcell Study1
#>                  RNA_snn_res.0.5 seurat_clusters ident
#> AATTACGAATCGGT-1               0               0     0
#> AGAGCGGAGTCCTC-1               0               0     0
#> TTCTTACTGGTACT-1               0               0     0
#> TGACGCCTACACCA-1               0               0     0
#> ATTTGCACCTATGG-1               0               0     0
#> AGAACAGATGGAGG-1               0               0     0

0.3.2 Data pre-process and visulization with Seurat

Followed by the tutorial of Seurat, we first pre-processed the data by the functions NormalizeData, FindVariableFeature, ScaleData, RunPCA, FindNeighbors, FinsClusters and RunUMAP from Seurat and then draw the UMAP by using DimPlot stratified by Study and Cell Type.

# Normalize the data
sim_data <- NormalizeData(sim_data)
# Identify highly variable features
sim_data <- FindVariableFeatures(sim_data, selection.method = "vst", nfeatures = 2000)
# Scale the data
all.genes <- rownames(sim_data)
sim_data <- ScaleData(sim_data, features = all.genes)
# Perform linear dimensional reduction
sim_data <- RunPCA(sim_data, features = VariableFeatures(object = sim_data))
# Cluster the cells
sim_data <- FindNeighbors(sim_data, dims = 1:20)
sim_data <- FindClusters(sim_data, resolution = 0.5)
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#> 
#> Number of nodes: 800
#> Number of edges: 46490
#> 
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.7653
#> Number of communities: 4
#> Elapsed time: 0 seconds
sim_data <- RunUMAP(sim_data, dims = 1:20)

0.3.2.1 UMAP separated by Study

p1 <- DimPlot(sim_data, reduction = "umap", label = FALSE, pt.size = .5, group.by = "Study", repel = TRUE)
p1

0.3.2.2 UMAP separated by cell type


p2 <- DimPlot(sim_data, reduction = "umap", label = TRUE, pt.size = .8, group.by = "CellType", repel = TRUE)
p2