bbz062
bbz062
net/publication/333197990
CITATIONS READS
61 2,281
4 authors, including:
Yu Huan Anjun Ma
Qin Ma
The Ohio State University
237 PUBLICATIONS 2,879 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Anjun Ma on 08 December 2019.
doi: 10.1093/bib/bbz062
Advance Access Publication Date:
Review article
Abstract
Appropriate ways to measure the similarity between single-cell RNA-sequencing (scRNA-seq) data are ubiquitous in
bioinformatics, but using single clustering or classification methods to process scRNA-seq data is generally difficult. This
has led to the emergence of integrated methods and tools that aim to automatically process specific problems associated
with scRNA-seq data. These approaches have attracted a lot of interest in bioinformatics and related fields. In this paper, we
systematically review the integrated methods and tools, highlighting the pros and cons of each approach. We not only pay
particular attention to clustering and classification methods but also discuss methods that have emerged recently as
powerful alternatives, including nonlinear and linear methods and descending dimension methods. Finally, we focus on
clustering and classification methods for scRNA-seq data, in particular, integrated methods, and provide a comprehensive
description of scRNA-seq data and download URLs.
Key words: single-cell RNA-seq; clustering; classification; similarity metric; sequences analysis; machine learning
Ren Qi is a doctoral student at the School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
Her research interests include machine learning, metric learning and bioinformatics.
Anjun Ma is a doctoral student at the Ohio State University, USA. His research topic mainly focuses on the single-cell gene transcription regulation analysis,
including biclustering algorithm development, regulon identification and functional pathway elucidation.
Qin Ma is the director of the Bioinformatics and Mathematical Biosciences Laboratory and an associate professor at the Department of Biomedical
Informatics, College of Medicine, The Ohio State University. His email is [email protected].
Quan Zou is a professor at the University of Electronic Science and Technology of China. He is a senior member of Institute of Electrical and Electronics
Engineers (IEEE) and Association for Computing Machinery (ACM). He won the Clarivate Analytics Highly Cited Researchers in 2018. He majors in
bioinformatics, machine learning and algorithms. His email is [email protected].
Submitted: 15 February 2019; Received (in revised form): 24 April 2019
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/
journals/pages/open_access/funder_policies/chorus/standard_publication_model)
1
2 Qi et al.
Single-cell RNA-seq (scRNA-seq) provides precision and (PCA) [11], which is unsupervised and aims to find a lower
details. It uses optimized next-generation sequencing technolo- dimensional representation of the data. Peng et al. [12] proposed
gies and acquires transcriptomic information from individual two models, unsupervised Gene Ontology AutoEncoder (GOAE)
cells to provide a better understanding of cell functions and supervised Gene Ontology Neural Network (GONN), to
at genetic and cellular levels. scRNA-seq has been used to reduce dimensionality. They combined scRNA-seq data and gene
study cancer, metagenomics and regulatory and evolutionary ontology information to extract the hidden layer information
recurrence. Individualized clinical treatment also will be the difficulties in sequencing single cells. Sequencing has the risks
target and direction of scRNA-seq technology. Generally, scRNA- of low coverage, low mappability, high duplicate rate and high
seq has three advantages (applications) over traditional bulk error rate. Some companies have developed tools to process
RNA-seq: (i) it can reveal complex and rare cell populations, scRNA raw data but only in the early stages.
(ii) uncover heterogenous gene regulatory relationships among The workflow for scRNA-seq is summarized in Figure 1.
cells, and (iii) track the trajectories of distinct cell lineages in Single-cell data are high-dimensional and contain a lot of noise,
t-SNE has been used recently to reduce the dimensionality of assumption could lead to loss of information and decreased
scRNA-seq data [46, 47]. accuracy.
1 PCA [11] 1987 MATLAB, Python, R etc. all have a free package
3 FA [51] 1980 http://personality-project.org/r
4 Classical multidimensional scaling [52] 1978 https://www.statmethods.net/advstats/mds.html
of supervised models and has been shown to perform best in technical noise in various sources of noise in single-cell data.
unsupervised models. scRNA-seq data analysis can help in studying the heterogeneity
Some of the open dimensionality reduction methods are and evolution of cancer cells. Except for a few early methods,
listed in Table 1. These methods each have their characteris- most of the currently available integrated methods achieve
tics, so the most appropriate dimensionality reduction method state-of-the-art performances on some problems. Although
should be chosen according to the characteristics of the target the analyses of scRNA-seq data are complex and procedures
dataset. In 2007, van der Maaten published a MATLAB Toolbox may vary depending on the purpose, many mature tools have
for Dimensionality Reduction that contains implementations of been developed to integrate two or more functions that greatly
many methods for dimensionality reduction [50]. simplify the independent operations.
The challenges of clustering in scRNA-seq research are
mainly reflected in an unclear number of single-cell clusters,
Classic clustering methods
unfixed cell types and poor scalability. In the past few years,
Clustering methods aim to group data objects into multiple the number of cells in scRNA-seq experiments has grown
classes or clusters so that objects in the same cluster are similar by several orders of magnitude. Although researchers have
and different from objects in different clusters [62]. Clustering developed a variety of tools, they are not user-friendly because
can be based on partitioning or layering, where partitioning they use different programming languages and require different
divides objects into different clusters, and layering classifies input data formats. In this section, we describe 11 of the most
objects into levels. Clustering based on distance clusters similar advanced scRNA-seq tools currently available. A summary of
objects close to each other. Clustering based on a probability these methods and tools, the download URLs and other useful
distribution model finds a set of objects in a group of objects information are given in Table 2.
that conform to a specific distribution model. The objects are not
necessarily the closest or most similar, but they perfectly fit the
SC3, an R package for clustering
model described by the probability distribution. Most clustering
methods need prior knowledge on a number of clusters, and the SC3 was proposed by Kiselev et al. [64] in 2017. It is an interactive
quality of clustering needs to be improved. Classic clustering R package that uses a parallelization approach to avoid the need
methods such as K-means [21], X-means [63], spectral clustering for user-specified parameters. SC3 was verified experimentally
[23] and EM [22] can be used in single-cell clustering directly. K- on 12 scRNA-seq datasets. SC3 constrained parameter values
means [21] is an unsupervised machine-learning technique that via a pipeline and was found to be superior to five other tested
operates on a complete dataset without the need for a special methods in terms of accuracy and stability. Because SC3 has a
training dataset. X-means [63] is an extended version of K-means long run time, Kiselev et al. proposed randomly selecting subsets
by an improve-structure part where Euclidean distance is used to and constructing clusters based on the random matrix theory.
calculate the distance between each use case, and other distance They found that the estimated value was consistent with the
functions are used to calculate the distance between any two use number of original clusters suggested by them. SC3 is based
cases. Spectral clustering [23] is an evolutionary algorithm from on PCA and spectral dimensionality reductions, and it utilizes
graph theory. The main idea is to consider all the data as points k-means and additionally performs the consensus clustering.
in space that can be connected by edges. EM [22] was proposed
by Arthur et al. in 1977. The EM algorithm assigns a probability
Single-cell regulatory network inference and clustering
distribution to each instance, which indicates the probability of
it belonging to each of the clusters. SCENIC [64, 65]was proposed by Aibar et al. in 2017 [63] who used
it to identify stable cell states in tumor and brain scRNA-seq
data based on the activity of the gene regulatory networks in
Popular clustering analysis tools each cell. The authors proposed two complementary methods to
Clustering is another effective method to detect cell types. handle the large dimensions of single-cell data: (i) small sample
Poisson and error models can be used to count data and explain extraction to infer the gene regulatory network and (ii) gradient
6 Qi et al.
enhancement instead of the random forest (RF) to achieve a tic regression models that predict gene sequences, providing a
more efficient solution. They demonstrated that single-cell data valuable tool for analyzing scRNA-seq data. SINCERA is based
were suitable for gene regulation and that genomic regulatory on hierarchical clustering, by which data is converted to z-score
codes can be used to guide the identification of transcription before clustering, and the number k for clustering is determined
factors and cell states. by finding the 1st singleton in the hierarchy.
contains the general functions of a single-cell toolkit, such as different pre-implantation stages. Treutlein’s dataset contains
quality control, difference analysis etc. In the monocle package, transcriptome data from single distal lung epithelium cells.
it is interesting to note that dimensionality reduction must be Zeisel’s dataset contains 19 972 genes from 3005 cells and
followed by clustering in order to be visualized. In addition, Mon- was used to study specialized cell types in mouse cortex and
ocle develops the function to infer the development trajectory, hippocampus.
which becomes the highlight of this tool.
Dataset size
BackSPIN
We considered a small dataset as one with RNA-seq data of less
Zeisel et al. [75] developed BackSPIN in 2015 and tested it on the than 100 cells, a large dataset had more than 1000 cells and a
adult nervous system, which is highly complex and has many medium-size dataset was in the middle. As shown in Table 4,
cell types that are challenging to identify. scRNA-seq data were Biase’s and Treutlein’s datasets were classified as small, and
used to classify mammalian cortical cells. BackSPIN detected Klein’s and Zeisel’s datasets were large with 2717 and 3005 cells,
different types of cells based on molecule clustering and showed respectively. The numbers of genes in these 12 datasets were
that transcription factors formed a complex hierarchical regu- extremely large. Except for Patel’s dataset, which contained 5948
latory code, revealing the diversity of brain cell types and their genes, the other 11 datasets contained from 19 972 to 41 480
transcriptomes. genes. These scRNA-seq datasets are large and contain a lot of
expression data.
BiSNN-Walk
Shi and Huang [76] proposed BiSNN-Walk, an iterative biclus- Performances on raw scRNA-seq datasets
tering method based on SNN-Cliq [72]. BiSNN-Walk differs from
To better understand the performance of each method on
SNN-Cliq in that it returns a sorted list as a reliable indicator of
scRNA-seq data, we conducted classification and clustering
a cluster. In addition, BiSNN-Walk uses a metric method based
experiments on the raw datasets. The experimental results
on entropy to select the starting point of clustering, and its
are particularly important because they can be used to analyze
clustering ability was tested on three scRNA-seq datasets.
and judge whether the data preprocessing steps and algorithm
improvements are effective.
Single-cell representation learning We used the raw scRNA-seq data of the 12 datasets without
Single-cell representation learning (SCRL) [47] is a nonlinear any preprocessing with four widely used machine-learning clas-
dimensionality reduction method based on machine learning sification methods, including KNN, RF, J48 and bagging. KNN is
and clustering that was developed by Li et al. in 2017. To process the most commonly used classification method, which deter-
drop-out events of single-cell RNA data, SCRL uses biological mines the class of samples to be classified by the class of
knowledge such as high-throughput RNA sequencing and adopts adjacent k samples. J48 is a decision tree-based algorithm. RF
a network-embedding method to express a more abundant and and bagging are integrated machine-learning algorithms. These
low-dimensional expression of scRNA-seq data. methods are free and efficient in Weka software. The results are
We have provided the download URLs and references for 14 shown in Figure 2. The methods’ classification performance was
other popular scRNA-seq analysis tools in Table 3, so interested measured with accuracy.
readers can find them easily. As shown in Figure 2, In general, although the four methods
showed differences in the results for the 12 datasets, the
classification of expression data showed accuracies that could
Performances of the methods on scRNA-seq datasets reach over 80%. Overall, bagging was the most stable achieving
good classification accuracy on all 12 datasets, which may
Transcriptome datasets
be explained by its integrated classification mechanism. For
The 12 datasets summarized in Table 4 are named after the six gold datasets, RF was better than the other three
the primary provider of the published dataset. The 1st six methods. Ting’s dataset showed the worst results among
datasets are benchwork labeled and are considered as the all the datasets and methods, possibly because the dataset
gold standard; the other six datasets are computationally contains too much noise, which affected the ability of the
labeled and are considered as silver standard. Yan’s dataset algorithms to accurately classify the expression data. Thus,
consists of the transcriptomes of human oocytes and early for complex datasets, machine learning still has room for
embryonic cells at seven key stages of development, using improvement.
two to three embryos per stage. Deng’s dataset includes the Unsupervised clustering is currently the core part of the
transcriptomes of single cells isolated from mouse embryos at scRNA-seq analysis. It does not require researchers to make
8 Qi et al.
Figure 2. True Positive (TP) rate of four classification methods based on expression data. The blue, orange, gray and yellow bars represent the classification performance
of KNN, RF, J48 and bagging, respectively.
any input to the known expressed genes and can directly X-means and greedy. The default parameters of the algo-
cluster similar cells by an algorithm. Because many subsequent rithm were used in all experiments. In the 2nd picture in
analyses of single cells are based on clustering, the results Figure 3, it can be easily seen that in the overall experi-
of clustering have a great impact on the final conclusion. We ment results, spectral and greedy were significantly bet-
performed experiments on the 12 datasets with three clustering ter than EM. EM showed the worst result because EM is
methods. The results are shown in Figure 3. Greedy [87] is a unable to recognize expression data when there is a lot of
hierarchical clustering algorithm adapted to large networks, noise.
which can detect high modularity partitions quickly and without To some extent, these diversities are also reflected in the
limit to the number of nodes. Overall, the spectral method effectiveness of the algorithm, that is, some methods are better
was more stable on all 12 datasets, which may be because of for certain types of data. Because of the complexity of clustering
its integrated classification mechanism. We need to input the problems, it is unlikely that one method is superior to all other
number of clustering on used clustering methods except for methods.
Clustering and classification methods 9
Performances of seven clustering methods and toolboxes clustering cell types and detecting differentially expressed genes
on the 12 datasets in the context of real data.
Cell classification results vary because of different parameter To evaluate the performance of clustering methods, we tested
combinations and different dataset sizes, making it difficult the clustering methods SC3, pcaReduce and tSNE+kmeans on
and time-consuming to find the optimized parameters for the 12 published datasets. The analysis refers to adjusted Rand
best result. Hence, methods that integrate machine learning to index (ARI) due to its wide adoption in the field. In order to
provide a profound and easy-to-operate pipeline have been test the stability of the method, we repeated the experiment
developed. At present, many clustering methods have been 100 times with fixed parameters. The clustering performance
developed for single-cell data, but these methods show very shows that SC3 has the best robustness, while pcaReduce and
different characteristics in model assumptions and clustering. tSNE+kmeans still have worse performance. Figure 4 shows that
Therefore, there is an urgent need for large-scale comparison most of the ARI values of SC3 are concentrated and located in
and evaluation of these methods. To this end, we compared the upper half of the graph, indicating that SC3 plays a stable
several clustering methods and tools, evaluated their clustering role and has the best effect. The ARI values of tSNE+kmeans and
capabilities and conducted analysis to explore their effects on pcaReduce are widely distributed.
10 Qi et al.
ARI of the three methods in the Pollan dataset is almost noisy nature of the single-cell RNA data. This noise makes it
all between 0.25 and 0.6, where SC3 results are more concen- difficult to distinguish very similar cell types, and this is where
trated at the endpoints of the interval, and pcaReduce values the technology needs to be improved. In this review, we covered
are more concentrated between 0.25 and 0.4. In comparison, several aspects of scRNA-seq applications, from quantitative
tSNE+kmeans results are better, most of which are around 0.6. analysis and characterization of cell types through clustering
This shows that tSNE+kmeans is effective for dimensionality and classification to gene regulatory network reconstruction
reduction of the dataset of Pollan. Based on the Biase dataset, and cell-state identification. Using 12 published sets of single-
SC3 results were stable above 0.9, and most ARI of pcaReduce cell RNA data, we showed the classification and clustering
were distributed between 0.8 and 0.9. On the contrary, more performance of each algorithm. We have listed the classical
than half of tSNE+kmeans were below 0.1, which indicated machine-learning methods, summarized the currently generally
that tSNE+kmeans was completely invalid for dimensionality accepted toolkits and conducted experiments on the single-
reduction of Baise dataset. Usually, no dimensionality reduction cell RNA dataset. At present, single-cell RNA classification
method or clustering is suitable for all datasets. We can observe has been relatively mature, but there is still a lot of room
the characteristics of statistical datasets and find the effective for improvement in clustering. scRNA-seq technologies allow
dimensionality reduction method corresponding to the dataset. researchers to uncover new and potentially unexpected biolog-
Then we conducted an experiment on all datasets with the ical discoveries compared with traditional profiling methods.
default parameters of the toolbox and recorded its experimental scRNA-seq has also been applied to identify subclones from
results as shown in Figure 5, which showed the clustering perfor- the transcriptomes of neoplastic cells, and the technique
mances on five toolboxes: SC3, SNN-Cliq, SINCERA, SEURAT and holds enormous potential for both basic biology and clinical
pcaReduce. First, we analyze the experimental results based on applications.
the dataset Pollen and find that the results of the five clustering Experimental studies emphasize that there is no one way to
algorithms are very close, from which we can draw a conclusion perform the best in all situations. Some of the shortcomings of
that Pollen is universal to the algorithms. In terms of algorithmic these approaches, such as scalability, robustness and in some
comparison, SC3 produced the best performances, beating the cases unavailability, need to be addressed in studies. Several
other four clustering methods tested. SC3 worked best on most recommendations have been made based on the above exper-
datasets, and its powerful integration features made it a stable iments and analysis, with details showcased in Table 4. Specifi-
toolbox. However, the SEURAT algorithm performed poorly with cally, for the Pollen’s dataset, we recommend the algorithm with
most datasets, and it is the most unstable method. SEURAT is a the fastest convergence (SIMLR) as it showed little difference
toolbox for spatial cell localization, and it may be that the noise in clustering performance among various algorithms. SIMLR is
in the single-cell RNA dataset affects the performance of the a kernel-based similarity learning method, based on dimen-
algorithm. sionality reduction of scRNA-seq data, and can be typically
applied to relative large-scale datasets. For the Biase’s dataset,
SC3, pcaReduce and SINCERA all achieved a better effect than
Discussion and conclusions the other algorithms. For the Goolam’s dataset, we recommend
scRNA-seq provides deep scrutiny into the gene expression using pcaReduce for dimensionality reduction and clustering
character of diverse cell types. The main challenge now is the due to its superior prediction performance, while for the Klein’s
Clustering and classification methods 11
4. Kanter I, Dalerba P, Kalisky T, et al. A cluster robustness score 24. Grun D, Lyubimova A, Kester L, et al. Single-cell messenger
for identifying cell subpopulations in single cell gene expres- RNA sequencing reveals rare intestinal cell types. Nature
sion datasets from heterogeneous tissues and tumors. Bioin- 2015;525(7568):251.
formatics 2019;35(6):962–71. 25. Žurauskienė J, Yau C. pcaReduce: hierarchical clustering
5. Xie J, Ma A, Zhang Y, et al. QUBIC2: a novel biclustering of single cell transcriptional profiles. BMC Bioinformatics
algorithm for large-scale bulk RNA-sequencing and single- 2016;17(1):140.
46. Lin C, Jain S, Kim H, et al. Using neural networks for reducing
69. Lihua Z, Shihua CB. Comparison of computational
the dimensions of single-cell RNA-Seq data. Nucleic Acids Res
methods for imputing single-cell RNA-sequencing data.
2017;45(17):e156.
IEEE/ACM Trans Comput Biol Bioinform 2018;1. doi: 10.1109/
47. Li X, Chen W, Chen Y, et al. Network embedding-based
TCBB.2018.2848633.
representation learning for single cell RNA-seq data. Nucleic
70. Guo M, Xu Y. Single-cell Transcriptome analysis using SIN-
Acids Res 2017;45(19):e166.