0% found this document useful (0 votes)
61 views

SCIBER A Simple Method For Removing Batch Effects

SCIBER is a new method for removing batch effects from single-cell RNA sequencing data. It matches cell clusters across batches based on the overlap of differentially expressed genes between clusters. Compared to existing methods, SCIBER has better scalability to large datasets and is easier to tune. It also directly outputs gene expression data without dimension reduction. SCIBER is implemented as an R package to integrate multiple single-cell RNA sequencing datasets and remove batch effects.

Uploaded by

Neal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

SCIBER A Simple Method For Removing Batch Effects

SCIBER is a new method for removing batch effects from single-cell RNA sequencing data. It matches cell clusters across batches based on the overlap of differentially expressed genes between clusters. Compared to existing methods, SCIBER has better scalability to large datasets and is easier to tune. It also directly outputs gene expression data without dimension reduction. SCIBER is implemented as an R package to integrate multiple single-cell RNA sequencing datasets and remove batch effects.

Uploaded by

Neal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Bioinformatics, 2022, 1–8

https://doi.org/10.1093/bioinformatics/btac819
Advance Access Publication Date: 22 December 2022
Original Paper

Gene expression

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac819/6957084 by guest on 29 December 2022


SCIBER: a simple method for removing batch effects
from single-cell RNA-sequencing data
Dailin Gan and Jun Li *
Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN 46556, USA
*To whom correspondence should be addressed.
Associate Editor: Valentina Boeva

Received on September 20, 2022; revised on November 27, 2022; editorial decision on December 15, 2022; accepted on December 21, 2022

Abstract
Motivation: Integrative analysis of multiple single-cell RNA-sequencing datasets allows for more comprehensive
characterizations of cell types, but systematic technical differences between datasets, known as ‘batch effects’, need
to be removed before integration to avoid misleading interpretation of the data. Although many batch-effect-
removal methods have been developed, there is still a large room for improvement: most existing methods only
give dimension-reduced data instead of expression data of individual genes, are based on computationally demand-
ing models and are black-box models and thus difficult to interpret or tune.
Results: Here, we present a new batch-effect-removal method called SCIBER (Single-Cell Integrator and Batch Effect
Remover) and study its performance on real datasets. SCIBER matches cell clusters across batches according to the
overlap of their differentially expressed genes. As a simple algorithm that has better scalability to data with a large
number of cells and is easy to tune, SCIBER shows comparable and sometimes better accuracy in removing batch
effects on real datasets compared to the state-of-the-art methods, which are much more complicated. Moreover,
SCIBER outputs expression data in the original space, that is, the expression of individual genes, which can be used
directly for downstream analyses. Additionally, SCIBER is a reference-based method, which assigns one of the
batches as the reference batch and keeps it untouched during the process, making it especially suitable for integrat-
ing user-generated datasets with standard reference data such as the Human Cell Atlas.
Availability and implementation: SCIBER is publicly available as an R package on CRAN: https://cran.r-project.org/
web/packages/SCIBER/. A vignette is included in the CRAN R package.
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction scRNA-seq data; they may confound meaningful biological varia-


tions, lead to false discoveries and mislead the interpretation of the
Single-cell RNA-sequencing (scRNA-seq) enables transcriptomic data (Hicks et al., 2018; Leek et al., 2010). Although ingenious ex-
profiling at a single-cell resolution in a high-throughput manner perimental designs have been recommended to minimize batch
(Fan et al., 2020; Klein et al., 2015; Macosko et al., 2015; effects, such as pooling cells from multiple batches into a single one
Rosenberg et al., 2018; Zheng et al., 2017). While scRNA-seq from for sequencing (Cao et al., 2017; Kang et al., 2018; McGinnis et al.,
a single experiment can be used to describe transcriptomic hetero- 2019; Rosenberg et al., 2018), practical issues arise from logistical
geneity of different cell types, combining data from multiple datasets and temporal limitations and sample acquisition (Fan et al., 2020).
further enhances the power. However, there are often systematic dif- As a result, computational methods that remove batch effects are
ferences between datasets called ‘batch effects’ that are caused pure- often essential for unified single-cell transcriptomics analysis (Fan
ly by technical differences between datasets instead of biological et al., 2020; Welch et al., 2019).
variations. These technical differences include differences in scRNA- Current batch-effect-removal methods can be roughly divided
seq protocols, platforms, technologies (Hashimshony et al., 2012; into five categories: anchor-based, graph-based, anchor-graph-
Klein et al., 2015; Macosko et al., 2015; Picelli et al., 2014; based, deep learning-based and model-based (Fan et al., 2020;
Ramsköld et al., 2012), read processing, quality control, normaliza- Forcato et al., 2021). Below, we give examples or representatives in
tion procedures (Stegle et al., 2015), as well as different donors, cap- each category and briefly describe their ideas.
turing time and handling personnel (Fan et al., 2020; Korsunsky Anchor-based methods first decide pairs of cells that share simi-
et al., 2019; Tran et al., 2020). Batch effects are often substantial in lar expression pattern across batches, which are called ‘anchors’,

C The Author(s) 2022. Published by Oxford University Press.


V 1
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unre-
stricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
2 D.Gan and J.Li

and then use these pairs to simultaneously integrate batches and re- differential expression analysis and pseudotime analysis, require ex-
move batch effects. The method mutual nearest neighbors (MNN)- pression data of individual genes. Second, many existing methods
Correct (Haghverdi et al., 2018) proposes to use MNN to identify are based on computationally demanding models. As examples,
anchors, and this idea was then applied and extended by SMNN graph-based methods often have a large algorithmic complexity,
(Yang et al., 2021), BEER (Zhang et al., 2019a), Seurat (Stuart mixture non-Gaussian (e.g. zero-inflated negative binomial) models
et al., 2019), fastMNN (Lun, 2019), Scanorama (Hie et al., 2019) often require iterations and could have convergence issues [and also,
and scMerge (Lin et al., 2019). many parametric models could be unsatisfactory to model real data,
Graph-based methods construct weighted graphs between and see e.g. Svensson (2020), Silverman et al. (2020) and Zhao et al.

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac819/6957084 by guest on 29 December 2022


within batches and then use community detection methods to detect (2022) about the current debate on whether zeros are truly inflated
shared cell populations across batches. Conos (Barkas et al., 2019) in scRNA-seq data], and deep learning models are not only compu-
performs pairwise comparisons of batches to establish initial graphs tationally demanding but also require GPU resources. Third, many
between cells across different batches, creates a joint graph by com- existing algorithms rely on black-box models and thus are hard to
bining inter-batch edges with lower-weight intra-batch edges and interpret or tune. These models are complex and have a number of
then applies community detection methods such as Walktrap, parameters that need to be optimized on individual datasets (e.g. the
Louvain and Leiden, to the joint graph to decide clusters of different number of nodes in each hidden layer of a neural network and the
cell types. BBKNN (Pola nski et al., 2020) first constructs graphs for learning rate when training a neural network), which can be very
each cell by finding its k-nearest neighbors in each batch independ- difficult for the user to tune, especially if these parameters do not
ently and then merges sets of cell neighbors based on the distance have clear biological or statistical meanings.
metrics used in UMAP (Uniform Manifold Approximation and In this article, we present a new method for batch-effect removal
Projection) (McInnes et al., 2018). scPopCorn (Wang et al., 2019) called Single-Cell Integrator and Batch Effect Remover (SCIBER)
relies on a co-membership propensity graph used in Google’s (pronounced as ‘cyber’) for short. SCIBER is a simple method that
PageRank algorithm (Andersen et al., 2006) to compute intra- outputs the batch-effect corrected expression data in the original
dataset edges, uses cosine similarity matrix to define inter-dataset space/dimension. These expression data of individual genes can be
edges and then uses k-partitioning to decide shared populations directly used for all follow-up analyses. SCIBER has four steps, each
across batches. having a clear biological meaning. The algorithms used for the four
Anchor-graph-based methods make use of both anchors and steps are k-means clustering, t-test, Fisher’s exact test and linear re-
graph representation. As a representative, LIGER (Welch et al., gression, respectively, all of which are easily comprehensible. The
2019) first uses integrative non-negative matrix factorization to de- simplicity of SCIBER also makes it computationally light and scal-
termine a low-dimensional space where each cell can be character- able to data with a large number of cells. SCIBER requires minimum
ized by two sets of factors called dataset-specific factors and shared tuning: it has few tuning parameters and the default values work de-
factors, and then constructs a neighborhood graph for the shared cently on a large range of datasets. We have shown that SCIBER, as
factors, which is then used to identify joint clusters across batches a much simpler algorithm, performs comparably on real datasets or
that serve as anchors, by connecting cells with similar factor arguably better in some real datasets compared to the state-of-the-
loading. art methods.
Deep learning-based methods use deep neural networks to learn There is one more feature of SCIBER: it is a ‘reference-based’
patterns in datasets. MMD-ResNet (Shaham et al., 2017) assumes method. That is, one of the batches is treated as the ‘reference’ and
that batches differ in distribution and develops a residual network kept untouched, and what SCIBER does is to ‘project’ other batches
to learn a map that calibrates the distribution of the source batch to to the reference batch. Most existing methods are not reference-
match that of the target batch. By combining transfer learning with based: they are often done by projecting all batches to a new shared
variational auto-encoders, scGen (Lotfollahi et al., 2019) first learns space, thus all batches have been changed during batch-effect re-
a distribution from the reference batch and then uses the trained net- moval and no batch has the luxury of staying the same. Both
work to predict the query dataset. SAUCIE (Amodio et al., 2019) reference-based and non-reference-based are commonly used strat-
uses a deep neural network with autoencoders to learn the cellular egies in the literature on batch-effect removal in general, and they
manifold and obtains the batch-effect corrected data by one of its each have their own characteristics and advantages. For single-cell
hidden layers. scAlign (Johansen and Quon, 2019) uses encoder net- expression data, reference-based methods can be a better option
works to learn mappings from gene expression spaces of individual when there is a natural choice of the reference batch, or when the
conditions into a common alignment space. scVI (Lopez et al., ‘standard’ data are available and one wants to ‘add’ new data to
2018) uses a hierarchical Bayesian model with conditional distribu- them. For example, the Human Cell Atlas project (Regev et al.,
tions specified by deep neural networks to embed cells into a low- 2017), which aims to build comprehensive reference maps of all
dimensional space, where batch effects have been removed. human cells, provides high-quality data that can serve as calibrated
Model-based methods impose assumptions on the distribution of reference batches. SCIBER can add other data, with batch-effect
(count or transformed) data or the clusters of cell types. ComBat removed, to them. On the other hand, if a non-reference-based
(Johnson et al., 2007) relies on an empirical Bayes framework. It method is used, these ‘standard’ data, as well as other data, will be
assumes that the standardized gene expression of a given sample in a modified, and thus the visualization, differential expression, cell
given batch follows a normal distribution with pre-specified prior clusters and other results on these standard data will also change,
distributions. Limma (Smyth and Speed, 2003) uses a linear model which is often undesired.
to fit the input data, and the linear model contains a blocking term
to capture batch effects. ZINB-WaVE (Risso et al., 2018) models
the original count data by a zero-inflated negative binomial distribu- 2 Materials and methods
tion. Harmony (Korsunsky et al., 2019) develops a novel soft K-
means method to cluster cells, which implicitly assumes a linear 2.1 Overview
mixture model for the cell clusters. SCIBER takes scRNA-seq data from B þ 1 batches, including a ref-
Although many methods have been developed and demonstrated erence batch and B query batches (B  1). When not given, the ref-
their abilities to remove batch effects in real data effectively, there is erence batch can be assigned as the batch generated by the most
still a large room for improvement. First, most existing methods, no reliable sequencing technique or protocol among the B þ 1 batches,
matter which of the five categories they belong to, do not output ex- the batch with the largest number of cell types or cells, the batch
pression data in the original space/dimension, i.e. expression of indi- with the largest number of reads, etc. The scRNA-seq data is
vidual genes. Instead, they give dimension-reduced data, where each assumed to be pre-processed, e.g. normalized by the sequencing
dimension is typically a complex combination of many or all genes. depth and log-transformed to stabilize the variance.
While dimension-reduced data is often good enough for visualiza- Keeping the reference batch untouched, SCIBER projects each
tion, many other essential tasks of single-cell data analysis, such as query batch to the space of the reference batch. The projected query
SCIBER 3

ðbÞ ðbÞ
batches are the batch-effect corrected data, or ‘corrected data’ for batch. The decomposition can be written as x^ j ¼ X ðbÞ ^
c b with
short, which can be directly compared with the untouched reference b
ðbÞ ðbÞ T ðbÞ 1 ðbÞ T ðbÞ ðbÞ
^ ¼ ½ðX Þ X  ðX Þ x . Here, x^ is the fitted expression.
c c c j j
data for downstream analyses such as differential expression ana-
After the decomposition, the projected data for cell j (i.e. batch-
lysis, or directly concatenated with the untouched reference batch ðbÞ
for downstream analyses such as cell type identification or lineage effect removed expression of cell j), denoted as x~ j , is obtained by
analysis. ðbÞ
X ð0Þ ^
c b .
The basic idea of SCIBER, different from all other batch-effect- The above description is for each cell. Actually, this decompos-
removal algorithms, is to match cell clusters across batches by differ-

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac819/6957084 by guest on 29 December 2022


ition and projection can be done simultaneously for all cells in a
entially expressed (DE) genes. In another word, SCIBER assumes batch by using a single formula X ~ ðbÞ ¼ X ð0Þ ½ðX ðbÞ ÞT X ðbÞ 1
c c c
that cell clusters corresponding to the same cell type should have an ðX ðbÞ Þ T ðbÞ
X , where X ðbÞ
and ~ ðbÞ (without subscript c) are the raw
X
c
essentially identical set of DE genes, despite the existence of batch
and projected data matrices of batch b.
effects. This idea is implemented in the following four steps. SCIBER is publicly available as an R package SCIBER on
Assume that the expression of p genes is measured in nb cells in
CRAN.
batch b (b ¼ 0; 1; . . . ; B, and b ¼ 0 refers to the reference batch). Let
ðbÞ
xij be the expression of gene i in cell j from batch b, where
i ¼ 1; . . . ; p, and j ¼ 1; . . . ; nb . 2.6 Some considerations regarding the algorithm
Setting the number of clusters K for the clustering algorithm (e.g. K-
means) is often difficult when clustering single-cell data. For one
2.2 Step 1: K-means clustering for each individual batch
reason, the number of cell types present in the data is often un-
In Step 1, a K-means clustering is performed on each batch. That is,
known. Moreover, even if the number of cell types is known, using
the nb cells in batch b are clustered into K clusters using K-means
it as K may still give poor clustering results. This is because different
clustering, based on their expression on the p genes. By default,
pffiffiffiffiffi cell types often have quite unbalanced numbers of cells and there
K ¼ n0 , where n0 is the number of cells in the reference batch.
may be outliers in the data, and clustering algorithms often do not
handle unbalanced clusters and outliers perfectly. For example, K-
2.3 Step 2: Marker gene identification means clustering is sensitive to outliers and tends to combine small
In Step 2, for each of the K clusters in each batch, DE genes are iden- clusters and divide large clusters (Hastie et al., 2009; Tan et al.,
tified by a two-sample t-test, with one group being cells in the cluster 2016). Fortunately, this difficulty barely matters in the K-means
and the other group being the other cells (i.e. all the cells not in this clustering in step 1 of SCIBER. The aim of applying K-means clus-
cluster). The h (by default, h ¼ 75) genes that are up-regulated with tering in SCIBER is to divide cells into many small groups; within
the smallest P-values are claimed as the ‘marker genes’ for the each group, the expression of cells is highly similar. In fact, SCIBER
cluster. wants the number of groups to be significantly larger than the num-
ber of cell types so that a large cell type can be divided into multiple
clusters, resulting in multiple centroids in Step 4 for a ‘high-reso-
2.4 Step 3: Fisher’s exact test for cluster alignment
lution’ decomposition, while outliers form clusters by themselves
For each cluster in the query batch (‘query cluster’ for short), Step 3
and can then be left out from the putative matches in Step 3.
finds the cluster in the reference batch (‘reference cluster’ for short) pffiffiffiffiffi
SCIBER sets K ¼ n0 by default.
that has the largest number of shared marker genes with it. These
Before figuring out the matching of cells (Step 3), SCIBER does
two clusters are called a ‘matched pair’. A P-value is assigned to
all analyses, including the clustering in Step 1 and the DE analysis in
each matched pair using a Fisher’s exact test. To be more specific,
Step 2, strictly separately for each batch. Any clustering or DE ana-
for each matched pair, a two-by-two contingency table is formed,
lysis on pooled data from multiple batches can be misled by batch
with rows being whether a gene is in or not in the marker gene set of
effects.
the query cluster and columns being whether a gene is in or not in
In Step 3, SCIBER only assigns a subset of matched pairs as puta-
the marker gene set of the reference cluster, and a P-value is given
tive matches. This takes care of the important fact that some cell
by a Fisher’s exact test on this contingency table.
types are not present in both the reference batch and the query
Note that not every matched pair is ‘good enough’ or ‘true’.
batch.
Biologically, some cell types are only present in the query batch but
In Step 4, we use a simple linear model to project the query batch
not the reference, so they should not have a match in the reference
to the reference batch. We do not use a more complicated nonlinear
batch. SCIBER takes this into account and only keeps a subset of
model, such as a deep learning-based model or an ensemble decision
matched pairs, called ‘putative matches’. SCIBER keeps the x pro-
portion of matched pairs with the smallest P-values. Here, x is a tree, because a complicated model is not only often hard to tune and
user-specified pre-fixed parameter between 0 and 1, and it has a time-consuming to train but also likely to suffer from over-
clear biological meaning: roughly speaking, it is the proportion of correction (Argelaguet et al., 2021). On the contrary, our linear
cells that are present in both the query and the reference batch. model has a closed-form solution without any tuning parameter, a
SCIBER sets x ¼ 0:5 by default. high computational efficiency and a low risk of over-correction.
Moreover, its linear nature can be highly appreciated when projec-
ting cell types in the query batch that are absent in the reference
2.5 Step 4: Projection of the query batch to the batch. For example, if such a cell type lies in the very middle of two
reference batch other cell types that are present in both the query and the reference
Step 4 projects gene expression from each query batch to the refer- batches, it will be decomposed as a combination of half and half
ence batch based on the putative matches. Here is how the projec- from the two cell types, and in the corrected data, it will still be in
tion is made for each query batch. First, we compute its centroid in the very middle of the two cell types. This is ‘interpolation’ in the
the query cluster and its centroid in the reference cluster. Each cen- linear model. Similarly, outliers will also be projected into appropri-
troid is a vector with length p, with element i being the average ex- ate locations by ‘extrapolation’ in the linear model.
pression of gene i for cells in the cluster.
Then, for each cell j 2 1; . . . ; nb in the query batch, no matter
whether it is in a cluster that has a putatively matched cluster or in a 3 Results
ðbÞ
cluster that does not, we decompose its expression xj (on all genes,
and so it is a length p vector) by a linear combination of the cent- 3.1 SCIBER removes batch effects effectively while
roids of clusters that each have a putatively matched cluster. Let preserving separation between cell types
X ðbÞ
c be a matrix whose columns are the centroids of clusters that We apply SCIBER in all the eight real datasets used in the review
each have a putatively matched cluster, where subscript c stands for paper by Tran et al. (2020), except for a dataset from the human
‘centroids’, and let X ð0Þ
c be the corresponding matrix of the reference cell atlas, of which the cell type information is unavailable. This
4 D.Gan and J.Li

review paper systematically compared 14 batch-effect-removal others, and significantly better than all the other three methods in
methods from all the five categories on these real datasets. Their the other two datasets (Datasets 3 and 6). Figure 1 and
results show that Harmony (Korsunsky et al., 2019), Seurat (v3) Supplementary Figures S1–S7 give the UMAP plots of all the data-
(Stuart et al., 2019) and LIGER (Welch et al., 2019) are the best sets. The overall impression from the UMAP plots is consistent with
methods in terms of accuracy. A summary of these eight datasets is that from cLISI.
given in Table 1, and details are given in Supplementary Material. It Here, we use the human dendritic cells dataset (Dataset 3) as an
is worth noting that Dataset 5, which we downloaded from the re- example. The two batches of this dataset contain four different types
view paper and contains 83 323 cells, is a subsampled dataset (the of human dendritic cells (DCs): pDC (CD11C CD123þ plasmacy-

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac819/6957084 by guest on 29 December 2022


subsampling was done by the review paper) from the original data- toid DCs), DoubleNeg (CD11Cþ CD141 CD1C DC), CD141
set sequenced by Rosenberg et al. (2018) and Saunders et al. (2018), (CD11Cþ CD141þ DC) and CD1C (CD11Cþ CD1Cþ DC). Both
which contains 833 206 cells. Batches 1 and 2 contain 96 pDC cells and 96 DoubleNeg cells.
The raw scRNA-seq data are normalized to remove the differ- Other than that, Batch 1 contains 96 CD141 cells, while Batch 2
ence in sequencing depth and log-transformed to stabilize the vari- contains 96 CD1C cells. Note that CD141 and CD1C are two cell
ance. Then high-variable genes are identified for each batch, and types that are quite similar, so the main challenge in this dataset is
genes that are not identified as high-variable in any of the batches to keep these two cell types separated while merging each of the
are discarded. All of these pre-processing steps are conducted using other two cell types (pDC and DoubleNeg) from the two different
the R package Seurat (Stuart et al., 2019). batches. In this dataset, SCIBER achieves a much lower cLISI value
When applying SCIBER, we use the default values of all the (1.12) than those from the other methods (1.29 from Harmony,
pffiffiffiffiffi
parameters: K ¼ n0 in Step 1, h ¼ 75 in Step 2 and x ¼ 0:5 in Step 1.43 from LIGER and 1.38 from Seurat). Figure 1 gives the 2D
3. That is, a common set of parameter values are used for all data- UMAP plots of the batch-effect corrected data from different meth-
sets, and no tuning of parameters are involved. ods, as well as the raw (i.e. uncorrected) data. Cells in the first row
We compare the performance of SCIBER with the three state-of- of sub-plots are colored according to their batches, and cells in the
the-art methods reported by the review paper (Tran et al., 2020): second row of sub-plots are colored according to the cell types they
Harmony, Seurat and LIGER. Details of how these algorithms are belong to. Ideally, the cells should be separated according to the cell
used are given in Supplementary Material. type but not the batch. In the figure, this means that points of differ-
We use cLISI proposed by Korsunsky et al. (2019) as the numeric ent colors in a plot in the second row (which denote different cell
measure of performance. cLISI measures local cell-type purity: the types) should form separate clusters, and at the same time, points of
idea is, in clear data, the cells should cluster according to the cell the same color in a plot in the second row (which denote the same
types, and thus the cell types in a cell’s neighbor should be quite cell type) should not form sub-clusters that each has a different color
‘pure’. In Supplementary Material, we give the mathematical defin- in the corresponding plot in the first row (which denote different
ition of cLISI and why a few other measures are less appropriate. batches). Following this guidance, we find that while all methods
Smaller cLISI values (values closer to 1.0) indicate higher local cell- successfully keep pDC and DoubleNeg separated from other cell
type purities and are thus preferred. types, only SCIBER keeps CD141 and CD1C as two well-separated
However, cLISI has its limitations in measuring performance clusters. Meanwhile, Harmony, Seurat and LIGER mistakenly mix
(reasons given in Supplementary Material). We also plot the batch- CD141 and CD1C, which indicates that they likely have incorrectly
effect corrected data into a 2D space using UMAP and check the aligned cells from CD141 and CD1C. The success of SCIBER in sep-
plot visually. Such a 2D plot contains much more information than arating CD141 and CD1C could be attributed to its disinterest in
any single numeric summary statistic and gives a more comprehen- aligning all cell clusters across batches (Step 3 of the algorithm).
sive picture of the performance. Additionally, in LIGER, the CD1C cells are further divided into two
Supplementary Tables S1–S8 give the cLISI scores for different sub-clusters, determined by which batch they come from, and one of
methods in the eight datasets. The cell type cLISI score is the aver- the two sub-clusters is thoroughly mixed with the CD141 cells.
aged cLISI score for a cell type, and the overall cLISI score is the Interestingly, the UMAP plots of the raw data indicate that batch
average across the batch. We see that SCIBER performs comparably effects in this data might be minor; it seems that methods other than
to the best performer of the other three methods in six out of the SCIBER likely have misidentified a proportion of the differences be-
eight datasets, which is Seurat in some datasets and Harmony in the tween cell types as batch effects and removed them.

Table 1. Description of the eight datasets on which the batch correction algorithms were tested

No. Dataset No. of batches No. of cells Technologies References

1 Mouse cell atlas 2 6954 Microwell-Seq Smart-Seq2 Tabula Muris Consortium


et al. (2018) and Han
et al. (2018)
2 Human PBMCs 2 15 476 10 30 10 50 Zheng et al. (2017)
3 Human DCs 2 576 Smart-Seq2 Villani et al. (2017)
4 Mouse retina 2 71 638 Drop-seq Macosko et al. (2015) and Shekhar
et al. (2016)
5 Mouse brain 2 83 323 Drop-seq SPLIT-seq Rosenberg et al. (2018) and
Saunders et al. (2018)
6 Mouse HSPCs 2 4649 MARS-seq Smart-Seq2 Nestorowa et al. (2016) and Paul
et al. (2015)
7 Human pancreas 5 14 767 inDrop CEL-Seq2 Baron et al. (2016), Muraro
Smart-Seq2 SMARTer et al. (2016), Segerstolpe
SMARTer et al. (2016), Wang et al. (2016)
and Xin et al. (2016)
8 Cell line 3 9531 10 Hie et al. (2019) and Zheng
et al. (2017)

PBMCs, peripheral blood mononuclear cells; DCs, dendritic cells; HSPCs, hematopoietic stem and progenitor cells.
SCIBER 5

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac819/6957084 by guest on 29 December 2022


Fig. 1. Scatter plots of cells from all batches of the human dendritic cell data (Dataset 3) on a 2D UMAP space. Different sub-plots show corrected data from different methods
(from left to right: SCIBER, Harmony, Seurat and LIGER), as well as the raw (uncorrected) data. Cells in the first row of sub-plots are colored according to their batches, while
cells in the second row are colored according to their actual cell types. It is clear that only SCIBER keeps CD141 and CD1C as two separated clusters, while the other three
methods mix these two different cell types

A detailed discussion of results on the other datasets is available


in Supplementary Material. Overall, as we have summarized,
SCIBER shows similar and sometimes better, accuracy in removing
batch effects. What’s more, SCIBER has clear advantages in its com-
putational load, etc., which are discussed in the following sections.

3.2 Computational efficiency of SCIBER


As a simple algorithm, SCIBER is computationally more efficient
than Harmony, LIGER and Seurat, and it scales to datasets with a
large number of cells. To show this, we choose the original full data
of the mouse brain data (Dataset 5). We down-sample the number
of cells from 833 206 to 249 000, to 83 000, to 24 900, and finally,
to 8300, and record the runtime of SCIBER on an eight-core ma-
chine with 2.50 GHz CPU and 16 GB RAM. The results are shown
in Figure 2. (There is no result for Seurat on the full dataset due to
memory explosion.) We find that the computational time of
SCIBER increases roughly linearly with the increase in the number Fig. 2. Runtime of different methods on the full and the down-sampled mouse brain
of cells, although this could be a little hard to see from the figure, data (Dataset 5). The full dataset contains 833 206 cells, and it is down-sampled to
smaller numbers of cells. The runtime of Seurat on the full dataset is missing as it
whose x-axis is on the log scale but y-axis is not. In general, SCIBER
was terminated for excessive memory requests. It is clear that SCIBER has a signifi-
is 2.5 times faster than Harmony and LIGER, which in turn are cantly shorter runtime than all the other three methods
much faster than Seurat.

batches are added, the reference batch will not change, as well as the
3.3 SCIBER keeps the reference batch untouched while query batches that already have been integrated.
projecting query batches
As a reference-based method, SCIBER does not change the reference 3.4 SCIBER facilitates downstream analyses
batch when removing batch effects from the query batches. This The batch-effect corrected data obtained by SCIBER are the expres-
facilitates adding additional batches: when the additional batches sion of individual genes, which can be directly used for downstream
are added, all downstream analyses on the reference batch, such as analyses. Here, we use pseudotime analysis and marker gene ana-
visualization (dimension reduction), differential expression analysis, lysis on the human hematopoietic stem and progenitor cell data
regulatory network construction and pseudotime analysis, will re- (Dataset 6) as an example. This dataset contains mouse hematopoi-
main unchanged. etic stem and progenitor cells from two batches. Each batch contains
Using visualization as an example, Figure 3 demonstrates this common myeloid progenitor (CMP) cells, megakaryocyte–erythro-
point on the human pancreas data (Dataset 7), which contains cyte progenitor (MEP) cells, granulocyte–monocyte progenitor
human pancreas cells from five batches. We use the batch with the (GMP) cells and a few other types of cells. Biologically, these three
largest number of cells as the reference batch and integrate it with cell types are of particular interest: CMPs may differentiate into ei-
different numbers of other batches. As the plot shows, as new query ther GMPs or MEPs, which form the cells of the granulocyte/
6 D.Gan and J.Li

macrophage or megakaryocyte/erythroid lineages, respectively pseudotime. It is clear that the cells form two branches, both starting
(Akashi et al., 2000). In the expression data, the clustering of cells is at the CMP (the blue circles in the middle), one ending at GMP (yel-
expected to reflect this developmental lineage, with GMPs and low crosses on the left) and the other ending at MEP (red diamonds
MEPs forming their respective clusters that are both close to CMPs on the right). This correctly reflects the lineage. On the contrary,
(Haghverdi et al., 2018). However, in the raw data, the cells are results from the same analysis in the raw data (Fig. 4B) contradict
clustered mainly according to the batches they come from instead of the biological fact. For example, the MEPs (diamonds), which are
their cell types (see the last column of sub-plots in Supplementary supposed to have large pseudotime values, form two groups, one

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac819/6957084 by guest on 29 December 2022


Fig. S5). This incorrect clustering is largely corrected by removing (blue diamonds, on the top left) having very small pseudotime and
batch effects using any of the four methods, as shown in the other having very large pseudotime (red diamonds, on the bot-
Supplementary Figure S5, indicating the success of these four meth- tom left).
ods. However, SCIBER is the only method among them that outputs We further investigate how the expression of marker genes of
the expression of individual genes. these three cell types changes as a function of pseudotime. We use
We pool the batch-effect corrected data by SCIBER from all marker genes defined in the CellMarker database (Zhang et al.,
batches and conduct a pseudotime analysis with R package monocle 2019b). Figure 4C shows the expression of gene Itga2b on the
(Trapnell et al., 2014). The result is shown in Figure 4 A, where batch-effect corrected data along the inferred pseudotime (x-axis).
each point corresponds to a cell, with its shape (circle, cross or dia- In the plot, each point represents a cell, with different colors present-
mond) representing the cell type and color representing the inferred ing different cell types, and the black curve is the fitted (averaged

Fig. 3. Multiple batch projection using UMAP visualization for the human pancreas data (Dataset 7). Cells in the first row of sub-plots are colored according to batch labels,
while cells in the second row are colored according to cell type labels. The leftmost plots show the reference batch only, and the other plots from left to right show the batch-ef-
fect removed data when more batches are integrated by SCIBER. It is clear that the reference batch does not change when new batches are integrated

Fig. 4. Pseudotime and marker gene expression on the mouse hematopoietic stem and progenitor cell data (Dataset 6). (A) Inferred pseudotime from output data of SCIBER.
Each point represents a cell. The point types represent cell types, and the colors represent the inferred pseudotime. Starting from the top blue colored cells, there are two clear
branches. (B) Inferred pseudotime from the raw data. The triangle points are broken into two far apart clusters, one in blue color and the other in red color. (C, D) The expres-
sion of gene Itga2b based on (C) the output data of SCIBER and (D) the raw data. According to the annotation database, this gene is a marker gene for both cell types CMP
and GMP, but not a marker gene for MEP. The points are colored according to their actual cell types and are arranged from left to right according to the inferred pseudotime.
The black curves are the fitted expression level along the pseudotime. In (C), most CMPs appear at the very beginning, GMPs appear in the middle, and MEPs appear at last.
Also, the black curve shows a sudden drop, indicating a much lower expression level in MEPs compared to CMPs and GMPs. In (D), such an ordered arrangement of different
cell types and a drop in expression are not observed
SCIBER 7

and smoothed) expression level. We notice that most red points Argelaguet,R. et al. (2021) Computational principles and challenges in
(CMPs) appear at the very beginning, green points (GMPs) appear single-cell data integration. Nat. Biotechnol., 39, 1202–1215.
in the middle, and blue points (MEPs) appear at last. This is consist- Barkas,N. et al. (2019) Joint analysis of heterogeneous single-cell RNA-seq
ent with the two branches observed in Figure 4A. Moreover, the dataset collections. Nat. Methods, 16, 695–698.
Baron,M. et al. (2016) A single-cell transcriptomic map of the human and
black curve has a drastic drop as MEPs appear, indicating a much
mouse pancreas reveals inter- and intra-cell population structure. Cell Syst.,
lower expression in MEPs. This reflects the biological truth: this
3, 346–360.e4.
gene is a marker gene (actually the only marker gene of this type in
Cao,J. et al. (2017) Comprehensive single-cell transcriptional profiling of a

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac819/6957084 by guest on 29 December 2022


the CellMarker database) that is common for CMPs and GMPs but multicellular organism. Science, 357, 661–667.
not for MEPs. Figure 4D shows the expression of Itga2b on the raw Fan,J. et al. (2020) Single-cell transcriptomics in cancer: computational chal-
data, where its drop in MEPs has been distorted by batch effects. lenges and opportunities. Exp. Mol. Med., 52, 1452–1465.
Supplementary Figures S8–S11 give plots for marker genes of other Forcato,M. et al. (2021) Computational methods for the integrative analysis
types, in which we have similar observations. of single-cell data. Brief. Bioinform., 22, 20–29.
Haghverdi,L. et al. (2018) Batch effects in single-cell RNA-sequencing data
are corrected by matching mutual nearest neighbors. Nat. Biotechnol., 36,
4 Discussion 421–427.
Han,X. et al. (2018) Mapping the mouse cell atlas by Microwell-seq. Cell,
We have presented SCIBER, a simple algorithm that removes batch 172, 1091–1107.e17.
effects from scRNA-seq data in a computationally efficient manner. Hashimshony,T. et al. (2012) CEL-Seq: single-cell RNA-seq by multiplexed
While its accuracy is similar to, and sometimes better than, the linear amplification. Cell Rep., 2, 666–673.
state-of-the-arts, its ability to output expression data at the gene Hastie,T. et al. (2009) The Elements of Statistical Learning: Data Mining,
level facilitates follow-up analyses, and its reference-based nature Inference, and Prediction, Vol. 2. Springer, New York.
makes it especially suitable for mapping data to standard reference Hicks,S.C. et al. (2018) Missing data and technical variability in single-cell
datasets. RNA-sequencing experiments. Biostatistics, 19, 562–578.
Unlike existing algorithms, SCIBER identifies matched cell clus- Hie,B. et al. (2019) Efficient integration of heterogeneous single-cell transcrip-
ters across batches according to DE genes. The four steps of the tomes using scanorama. Nat. Biotechnol., 37, 685–691.
Johansen,N. and Quon,G. (2019) Scalign: a tool for alignment, integration,
SCIBER algorithm rely on simple algorithms and/or tests, which do
and rare cell identification from scRNA-seq data. Genome Biol., 20, 1–21.
not impose strong assumptions on the distribution of the data or the
Johnson,W.E. et al. (2007) Adjusting batch effects in microarray expression
shape of each cluster. (Although K-means clustering implicitly
data using empirical bayes methods. Biostatistics, 8, 118–127.
assumes each cluster has a round shape with similar sizes, SCIBER Kang,H.M. et al. (2018) Multiplexed droplet single-cell RNA-sequencing
uses a large K, which breaks up large clusters and hence accommo- using natural genetic variation. Nat. Biotechnol., 36, 89–94.
dates virtually any shape for the cell types.) The simplicity of the al- Klein,A.M. et al. (2015) Droplet barcoding for single-cell transcriptomics
gorithm also makes it computationally fast. In our current applied to embryonic stem cells. Cell, 161, 1187–1201.
implementation of SCIBER, the most time-consuming step is K- Korsunsky,I. et al. (2019) Fast, sensitive and accurate integration of single-cell
means clustering, and thus SCIBER can be further accelerated by data with harmony. Nat. Methods, 16, 1289–1296.
replacing K-means with a more efficient clustering algorithm or Leek,J.T. et al. (2010) Tackling the widespread and critical impact of batch
using a more efficient implementation of K-means. effects in high-throughput data. Nat. Rev. Genet., 11, 733–739.
The SCIBER algorithm has three parameters: the number of clus- Lin,Y. et al. (2019) Scmerge leverages factor analysis, stable expression, and
ters K in Step 1, the number of marker genes h in Step 2 and the pro- pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc.
portion of matched clusters x in Step 3. In Section 3, we have Natl. Acad. Sci. USA, 116, 9775–9784.
Lopez,R. et al. (2018) Deep generative modeling for single-cell transcriptom-
shown that no tuning of these parameters is needed for all the data-
ics. Nat. Methods, 15, 1053–1058.
sets we consider and the default values work well. To confirm the in-
Lotfollahi,M. et al. (2019) scGen predicts single-cell perturbation responses.
sensitivity of our method to the tuning parameters, we Nat. Methods, 16, 715–721.
systematically study the effect of these tuning parameters. The Lun,A. (2019) Further MNN Algorithm Development. Github repository,
pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi
results of different values of K (0:8 n0 ; n0 and 1:2 n0 ), h (50, 75 https://marionilab.github.io/FurtherMNN2018/theory/description.html.
and 100) and x (0.4, 0.5, 0.6 and 0.7) in all the eight datasets we Macosko,E.Z. et al. (2015) Highly parallel genome-wide expression profiling
consider are shown in Supplementary Tables S9–S11. It is clear that of individual cells using nanoliter droplets. Cell, 161, 1202–1214.
the performance of SCIBER changes little under different values of McGinnis,C.S. et al. (2019) Multi-seq: sample multiplexing for single-cell
the parameters. RNA sequencing using lipid-tagged indices. Nat. Methods, 16, 619–626.
McInnes,L. et al. (2018) UMAP: uniform manifold approximation and projec-
tion for dimension reduction. arXiv preprint arXiv:1802.03426.
Acknowledgement Muraro,M.J. et al. (2016) A single-cell transcriptome atlas of the human pan-
creas. Cell Syst., 3, 385–394.e3.
We thank Zixuan Song for proofreading the manuscript. Nestorowa,S. et al. (2016) A single-cell resolution map of mouse hematopoi-
etic stem and progenitor cell differentiation. Blood, 128, e20–e31.
Paul,F. et al. (2015) Transcriptional heterogeneity and lineage commitment in
Funding myeloid progenitors. Cell, 163, 1663–1677.
Picelli,S. et al. (2014) Full-length RNA-seq from single cells using smart-seq2.
This work was supported by the National Institutes of Health
Nat. Protoc., 9, 171–181.
[R01CA252878 and R01CA222405] and the DOD BCRP Breakthrough
Polanski,K. et al. (2020) BBKNN: fast batch alignment of single cell transcrip-
Award, Level 2 [W81XWH2110432] to J.L.
tomes. Bioinformatics, 36, 964–965.
Conflict of Interest: none declared. Ramsköld,D. et al. (2012) Full-length mRNA-seq from single-cell levels of
RNA and individual circulating tumor cells. Nat. Biotechnol., 30, 777–782.
Regev,A. et al.; Human Cell Atlas Meeting Participants. (2017) Science forum:
the human cell atlas. elife, 6, e27041.
References
Risso,D. et al. (2018) A general and flexible method for signal extraction from
Akashi,K. et al. (2000) A clonogenic common myeloid progenitor that gives single-cell RNA-seq data. Nat. Commun., 9, 1–17.
rise to all myeloid lineages. Nature, 404, 193–197. Rosenberg,A.B. et al. (2018) Single-cell profiling of the developing mouse
Amodio,M. et al. (2019) Exploring single-cell data with deep multitasking brain and spinal cord with split-pool barcoding. Science, 360, 176–182.
neural networks. Nat. Methods, 16, 1139–1145. Saunders,A. et al. (2018) Molecular diversity and specializations among the
Andersen,R. et al. (2006) Local graph partitioning using pagerank vectors. In: cells of the adult mouse brain. Cell, 174, 1015–1030.e16.
2006 47th Annual IEEE Symposium on Foundations of Computer Science Segerstolpe,Å. et al. (2016) Single-cell transcriptome profiling of human pan-
(FOCS’06), Berkeley, CA, pp. 475–486. IEEE. creatic islets in health and type 2 diabetes. Cell Metab., 24, 593–607.
8 D.Gan and J.Li

Shaham,U. et al. (2017) Removal of batch effects using distribution-matching Villani,A.-C. et al. (2017) Single-cell RNA-seq reveals new types of human
residual networks. Bioinformatics, 33, 2539–2546. blood dendritic cells, monocytes, and progenitors. Science, 356,
Shekhar,K. et al. (2016) Comprehensive classification of retinal bipolar neu- Wang,Y. et al. (2019) Subpopulation detection and their comparative
rons by single-cell transcriptomics. Cell, 166, 1308–1323.e30. analysis across single-cell experiments with scpopcorn. Cell Syst., 8,
Silverman,J.D. et al. (2020) Naught all zeros in sequence count data are the 506–513.e5.
same. Comput. Struct. Biotechnol. J., 18, 2789–2798. Wang,Y.J. et al. (2016) Single-cell transcriptomics of the human endocrine
Smyth,G.K. and Speed,T. (2003) Normalization of cdna microarray data. pancreas. Diabetes, 65, 3028–3038.
Methods, 31, 265–273. Welch,J.D. et al. (2019) Single-cell multi-omic integration compares and con-

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac819/6957084 by guest on 29 December 2022


Stegle,O. et al. (2015) Computational and analytical challenges in single-cell trasts features of brain cell identity. Cell, 177, 1873–1887.e17.
transcriptomics. Nat. Rev. Genet., 16, 133–145. Xin,Y. et al. (2016) RNA sequencing of single human islet cells reveals type 2
Stuart,T. et al. (2019) Comprehensive integration of single-cell data. Cell, 177, diabetes genes. Cell Metab., 24, 608–615.
1888–1902.e21. Yang,Y. et al. (2021) SMNN: batch effect correction for single-cell RNA-seq
Svensson,V. (2020) Droplet scRNA-seq is not zero-inflated. Nat. Biotechnol., data via supervised mutual nearest neighbor detection. Brief.
38, 147–150. Bioinformatics, 22, bbaa097.
Tabula Muris Consortium, et al. (2018) Single-cell transcriptomics of 20 Zhang,F. et al. (2019a) A novel approach to remove the batch effect of
mouse organs creates a tabula muris. Nature, 562, 367. single-cell data. Cell Discov., 5, 1–4.
Tan,P-.N. et al. (2016) Introduction to Data Mining. Pearson Education India, Zhang,X. et al. (2019b) Cellmarker: a manually curated resource of cell
Delhi. markers in human and mouse. Nucleic Acids Res., 47, D721–D728.
Tran,HT.N. et al. (2020) A benchmark of batch-effect correction methods for Zhao,P. et al. (2022) Modeling zero inflation is not necessary for spatial tran-
single-cell RNA sequencing data. Genome Biol., 21, 1–32. scriptomics. Genome Biol., 23, 1–19.
Trapnell,C. et al. (2014) The dynamics and regulators of cell fate decisions are Zheng,G.X. et al. (2017) Massively parallel digital transcriptional profiling of
revealed by pseudotemporal ordering of single cells. Nat. Biotechnol., 32, single cells. Nat. Commun., 8, 14049–14012.
381–386.

You might also like