SCIBER A Simple Method For Removing Batch Effects
SCIBER A Simple Method For Removing Batch Effects
https://doi.org/10.1093/bioinformatics/btac819
Advance Access Publication Date: 22 December 2022
Original Paper
Gene expression
Received on September 20, 2022; revised on November 27, 2022; editorial decision on December 15, 2022; accepted on December 21, 2022
Abstract
Motivation: Integrative analysis of multiple single-cell RNA-sequencing datasets allows for more comprehensive
characterizations of cell types, but systematic technical differences between datasets, known as ‘batch effects’, need
to be removed before integration to avoid misleading interpretation of the data. Although many batch-effect-
removal methods have been developed, there is still a large room for improvement: most existing methods only
give dimension-reduced data instead of expression data of individual genes, are based on computationally demand-
ing models and are black-box models and thus difficult to interpret or tune.
Results: Here, we present a new batch-effect-removal method called SCIBER (Single-Cell Integrator and Batch Effect
Remover) and study its performance on real datasets. SCIBER matches cell clusters across batches according to the
overlap of their differentially expressed genes. As a simple algorithm that has better scalability to data with a large
number of cells and is easy to tune, SCIBER shows comparable and sometimes better accuracy in removing batch
effects on real datasets compared to the state-of-the-art methods, which are much more complicated. Moreover,
SCIBER outputs expression data in the original space, that is, the expression of individual genes, which can be used
directly for downstream analyses. Additionally, SCIBER is a reference-based method, which assigns one of the
batches as the reference batch and keeps it untouched during the process, making it especially suitable for integrat-
ing user-generated datasets with standard reference data such as the Human Cell Atlas.
Availability and implementation: SCIBER is publicly available as an R package on CRAN: https://cran.r-project.org/
web/packages/SCIBER/. A vignette is included in the CRAN R package.
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.
and then use these pairs to simultaneously integrate batches and re- differential expression analysis and pseudotime analysis, require ex-
move batch effects. The method mutual nearest neighbors (MNN)- pression data of individual genes. Second, many existing methods
Correct (Haghverdi et al., 2018) proposes to use MNN to identify are based on computationally demanding models. As examples,
anchors, and this idea was then applied and extended by SMNN graph-based methods often have a large algorithmic complexity,
(Yang et al., 2021), BEER (Zhang et al., 2019a), Seurat (Stuart mixture non-Gaussian (e.g. zero-inflated negative binomial) models
et al., 2019), fastMNN (Lun, 2019), Scanorama (Hie et al., 2019) often require iterations and could have convergence issues [and also,
and scMerge (Lin et al., 2019). many parametric models could be unsatisfactory to model real data,
Graph-based methods construct weighted graphs between and see e.g. Svensson (2020), Silverman et al. (2020) and Zhao et al.
ðbÞ ðbÞ
batches are the batch-effect corrected data, or ‘corrected data’ for batch. The decomposition can be written as x^ j ¼ X ðbÞ ^
c b with
short, which can be directly compared with the untouched reference b
ðbÞ ðbÞ T ðbÞ 1 ðbÞ T ðbÞ ðbÞ
^ ¼ ½ðX Þ X ðX Þ x . Here, x^ is the fitted expression.
c c c j j
data for downstream analyses such as differential expression ana-
After the decomposition, the projected data for cell j (i.e. batch-
lysis, or directly concatenated with the untouched reference batch ðbÞ
for downstream analyses such as cell type identification or lineage effect removed expression of cell j), denoted as x~ j , is obtained by
analysis. ðbÞ
X ð0Þ ^
c b .
The basic idea of SCIBER, different from all other batch-effect- The above description is for each cell. Actually, this decompos-
removal algorithms, is to match cell clusters across batches by differ-
review paper systematically compared 14 batch-effect-removal others, and significantly better than all the other three methods in
methods from all the five categories on these real datasets. Their the other two datasets (Datasets 3 and 6). Figure 1 and
results show that Harmony (Korsunsky et al., 2019), Seurat (v3) Supplementary Figures S1–S7 give the UMAP plots of all the data-
(Stuart et al., 2019) and LIGER (Welch et al., 2019) are the best sets. The overall impression from the UMAP plots is consistent with
methods in terms of accuracy. A summary of these eight datasets is that from cLISI.
given in Table 1, and details are given in Supplementary Material. It Here, we use the human dendritic cells dataset (Dataset 3) as an
is worth noting that Dataset 5, which we downloaded from the re- example. The two batches of this dataset contain four different types
view paper and contains 83 323 cells, is a subsampled dataset (the of human dendritic cells (DCs): pDC (CD11C CD123þ plasmacy-
Table 1. Description of the eight datasets on which the batch correction algorithms were tested
PBMCs, peripheral blood mononuclear cells; DCs, dendritic cells; HSPCs, hematopoietic stem and progenitor cells.
SCIBER 5
batches are added, the reference batch will not change, as well as the
3.3 SCIBER keeps the reference batch untouched while query batches that already have been integrated.
projecting query batches
As a reference-based method, SCIBER does not change the reference 3.4 SCIBER facilitates downstream analyses
batch when removing batch effects from the query batches. This The batch-effect corrected data obtained by SCIBER are the expres-
facilitates adding additional batches: when the additional batches sion of individual genes, which can be directly used for downstream
are added, all downstream analyses on the reference batch, such as analyses. Here, we use pseudotime analysis and marker gene ana-
visualization (dimension reduction), differential expression analysis, lysis on the human hematopoietic stem and progenitor cell data
regulatory network construction and pseudotime analysis, will re- (Dataset 6) as an example. This dataset contains mouse hematopoi-
main unchanged. etic stem and progenitor cells from two batches. Each batch contains
Using visualization as an example, Figure 3 demonstrates this common myeloid progenitor (CMP) cells, megakaryocyte–erythro-
point on the human pancreas data (Dataset 7), which contains cyte progenitor (MEP) cells, granulocyte–monocyte progenitor
human pancreas cells from five batches. We use the batch with the (GMP) cells and a few other types of cells. Biologically, these three
largest number of cells as the reference batch and integrate it with cell types are of particular interest: CMPs may differentiate into ei-
different numbers of other batches. As the plot shows, as new query ther GMPs or MEPs, which form the cells of the granulocyte/
6 D.Gan and J.Li
macrophage or megakaryocyte/erythroid lineages, respectively pseudotime. It is clear that the cells form two branches, both starting
(Akashi et al., 2000). In the expression data, the clustering of cells is at the CMP (the blue circles in the middle), one ending at GMP (yel-
expected to reflect this developmental lineage, with GMPs and low crosses on the left) and the other ending at MEP (red diamonds
MEPs forming their respective clusters that are both close to CMPs on the right). This correctly reflects the lineage. On the contrary,
(Haghverdi et al., 2018). However, in the raw data, the cells are results from the same analysis in the raw data (Fig. 4B) contradict
clustered mainly according to the batches they come from instead of the biological fact. For example, the MEPs (diamonds), which are
their cell types (see the last column of sub-plots in Supplementary supposed to have large pseudotime values, form two groups, one
Fig. 3. Multiple batch projection using UMAP visualization for the human pancreas data (Dataset 7). Cells in the first row of sub-plots are colored according to batch labels,
while cells in the second row are colored according to cell type labels. The leftmost plots show the reference batch only, and the other plots from left to right show the batch-ef-
fect removed data when more batches are integrated by SCIBER. It is clear that the reference batch does not change when new batches are integrated
Fig. 4. Pseudotime and marker gene expression on the mouse hematopoietic stem and progenitor cell data (Dataset 6). (A) Inferred pseudotime from output data of SCIBER.
Each point represents a cell. The point types represent cell types, and the colors represent the inferred pseudotime. Starting from the top blue colored cells, there are two clear
branches. (B) Inferred pseudotime from the raw data. The triangle points are broken into two far apart clusters, one in blue color and the other in red color. (C, D) The expres-
sion of gene Itga2b based on (C) the output data of SCIBER and (D) the raw data. According to the annotation database, this gene is a marker gene for both cell types CMP
and GMP, but not a marker gene for MEP. The points are colored according to their actual cell types and are arranged from left to right according to the inferred pseudotime.
The black curves are the fitted expression level along the pseudotime. In (C), most CMPs appear at the very beginning, GMPs appear in the middle, and MEPs appear at last.
Also, the black curve shows a sudden drop, indicating a much lower expression level in MEPs compared to CMPs and GMPs. In (D), such an ordered arrangement of different
cell types and a drop in expression are not observed
SCIBER 7
and smoothed) expression level. We notice that most red points Argelaguet,R. et al. (2021) Computational principles and challenges in
(CMPs) appear at the very beginning, green points (GMPs) appear single-cell data integration. Nat. Biotechnol., 39, 1202–1215.
in the middle, and blue points (MEPs) appear at last. This is consist- Barkas,N. et al. (2019) Joint analysis of heterogeneous single-cell RNA-seq
ent with the two branches observed in Figure 4A. Moreover, the dataset collections. Nat. Methods, 16, 695–698.
Baron,M. et al. (2016) A single-cell transcriptomic map of the human and
black curve has a drastic drop as MEPs appear, indicating a much
mouse pancreas reveals inter- and intra-cell population structure. Cell Syst.,
lower expression in MEPs. This reflects the biological truth: this
3, 346–360.e4.
gene is a marker gene (actually the only marker gene of this type in
Cao,J. et al. (2017) Comprehensive single-cell transcriptional profiling of a
Shaham,U. et al. (2017) Removal of batch effects using distribution-matching Villani,A.-C. et al. (2017) Single-cell RNA-seq reveals new types of human
residual networks. Bioinformatics, 33, 2539–2546. blood dendritic cells, monocytes, and progenitors. Science, 356,
Shekhar,K. et al. (2016) Comprehensive classification of retinal bipolar neu- Wang,Y. et al. (2019) Subpopulation detection and their comparative
rons by single-cell transcriptomics. Cell, 166, 1308–1323.e30. analysis across single-cell experiments with scpopcorn. Cell Syst., 8,
Silverman,J.D. et al. (2020) Naught all zeros in sequence count data are the 506–513.e5.
same. Comput. Struct. Biotechnol. J., 18, 2789–2798. Wang,Y.J. et al. (2016) Single-cell transcriptomics of the human endocrine
Smyth,G.K. and Speed,T. (2003) Normalization of cdna microarray data. pancreas. Diabetes, 65, 3028–3038.
Methods, 31, 265–273. Welch,J.D. et al. (2019) Single-cell multi-omic integration compares and con-