0% found this document useful (0 votes)
6 views

bbz062

Uploaded by

alirizauber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

bbz062

Uploaded by

alirizauber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/333197990

Clustering and Classification Methods for Single-cell RNA-Sequencing Data

Article in Briefings in Bioinformatics · May 2019


DOI: 10.1093/bib/bbz062

CITATIONS READS

61 2,281

4 authors, including:

Yu Huan Anjun Ma

10 PUBLICATIONS 146 CITATIONS


The Ohio State University
41 PUBLICATIONS 655 CITATIONS
SEE PROFILE
SEE PROFILE

Qin Ma
The Ohio State University
237 PUBLICATIONS 2,879 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Perennial maize study View project

Plant-associated microbial communities View project

All content following this page was uploaded by Anjun Ma on 08 December 2019.

The user has requested enhancement of the downloaded file.


Briefings in Bioinformatics, 00(00), 2019, 1–13

doi: 10.1093/bib/bbz062
Advance Access Publication Date:
Review article

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


Clustering and classification methods for single-cell
RNA-sequencing data
Ren Qi, Anjun Ma, Qin Ma and Quan Zou
Corresponding authors: Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu,
China, and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China. Tel: 170-9226-1008;
E-mail: [email protected]; Qin Ma, Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA.
Tel: 614-688-9857(O); E-mail: [email protected]

Abstract
Appropriate ways to measure the similarity between single-cell RNA-sequencing (scRNA-seq) data are ubiquitous in
bioinformatics, but using single clustering or classification methods to process scRNA-seq data is generally difficult. This
has led to the emergence of integrated methods and tools that aim to automatically process specific problems associated
with scRNA-seq data. These approaches have attracted a lot of interest in bioinformatics and related fields. In this paper, we
systematically review the integrated methods and tools, highlighting the pros and cons of each approach. We not only pay
particular attention to clustering and classification methods but also discuss methods that have emerged recently as
powerful alternatives, including nonlinear and linear methods and descending dimension methods. Finally, we focus on
clustering and classification methods for scRNA-seq data, in particular, integrated methods, and provide a comprehensive
description of scRNA-seq data and download URLs.

Key words: single-cell RNA-seq; clustering; classification; similarity metric; sequences analysis; machine learning

Introduction of individual cells. In particular, it cannot study the complex


Many bioinformatics problems involve DNA and protein system with ever-changing expression and the expression
sequence analyses or temporal series analyses. Each cell characteristics of genes in the system. The emergence of single-
has its unique phenotype and biological function, which is cell RNA-seq solves this problem by providing the expression
reflected in the differences between different histology. Bulk profile information of single cells. Although it is impossible
RNA sequencing (RNA-seq) is based on studies on a large number to obtain the complete information of each RNA expressed
of cells, and its expression level is the relative average level of by each cell, with limited raw materials, gene clustering
a group of cells. Therefore, in the mixed cell population, the analysis/identification of gene expression patterns can reveal
traditional bulk RNA-seq cannot analyze the critical differences or discover the existence of rare cell types in the cell population.

Ren Qi is a doctoral student at the School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
Her research interests include machine learning, metric learning and bioinformatics.
Anjun Ma is a doctoral student at the Ohio State University, USA. His research topic mainly focuses on the single-cell gene transcription regulation analysis,
including biclustering algorithm development, regulon identification and functional pathway elucidation.
Qin Ma is the director of the Bioinformatics and Mathematical Biosciences Laboratory and an associate professor at the Department of Biomedical
Informatics, College of Medicine, The Ohio State University. His email is [email protected].
Quan Zou is a professor at the University of Electronic Science and Technology of China. He is a senior member of Institute of Electrical and Electronics
Engineers (IEEE) and Association for Computing Machinery (ACM). He won the Clarivate Analytics Highly Cited Researchers in 2018. He majors in
bioinformatics, machine learning and algorithms. His email is [email protected].
Submitted: 15 February 2019; Received (in revised form): 24 April 2019
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/
journals/pages/open_access/funder_policies/chorus/standard_publication_model)

1
2 Qi et al.

Single-cell RNA-seq (scRNA-seq) provides precision and (PCA) [11], which is unsupervised and aims to find a lower
details. It uses optimized next-generation sequencing technolo- dimensional representation of the data. Peng et al. [12] proposed
gies and acquires transcriptomic information from individual two models, unsupervised Gene Ontology AutoEncoder (GOAE)
cells to provide a better understanding of cell functions and supervised Gene Ontology Neural Network (GONN), to
at genetic and cellular levels. scRNA-seq has been used to reduce dimensionality. They combined scRNA-seq data and gene
study cancer, metagenomics and regulatory and evolutionary ontology information to extract the hidden layer information

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


networks [1–3]. Identification of genes that are essential for and obtain lower dimensionality representations of the data.
a given cell type is critical for understanding the biological Because of the large numbers of zeros in scRNA-seq, the classical
characteristics of cells. Kater et al. [4] showed a clustering dimensionality building method often fails. Pierson and Yau [13]
robustness score to solve the problem that most clustering proposed zero-inflated factor analysis (ZIFA), which makes drop-
methods are not robust to noise. By artificially adding noise, they out events exactly zero, thereby modeling scRNA-seq data and
obtained clusters of cells with biological meaningful in single- improving the modeling precision. Although ZIFA is easy to use,
cell expression dataset. Studies have shown there is significant it is likely to lose information; however, drop-out events can be
heterogeneity in gene expression between individual cells of recovered to reduce the loss.
the same cell type. The rapid development of molecular biology Sequence analysis is essential in bioinformatics [14–17],
technologies has dramatically improved the ability to analyze and metric learning often plays an important role in this.
transcriptomes; in particular, high-throughput sequencing Common metric learning algorithms for classification include
technology and transcriptome sequencing (RNA-seq) analysis information-theoretic metric learning (ITML) [18], large margin
are now common experimental methods. Xie et al. [5] presented nearest neighbor (LMNN) [19] and geometric mean metric
a novel biclustering algorithm for the analysis of large-scale bulk learning (GMML) [20]. ITML, which was proposed by Davis et al.
RNA-seq and scRNA-seq data. in 2007 [18], uses Bregman divergence or divergence distance
In recent years, ‘scRNA-seq technology’ has become an to measure similarity. LMNN [19] was proposed by Weinberger
essential tool for molecular biology research. Compared with et al. in 2000 and is the most commonly used Mahalanobis
traditional cell-based RNA-seq, scRNA-seq can better reflect distance metric learning method. GMML [20, 21]was proposed
the molecular biological processes within a particular cell by Zadeh and Hosseini in 2016 [19], and the main innovation
population. In addition, scRNA-seq enables more precise of this method is its ability to achieve the effect that the
subpopulation analysis of specific cell types and allows the similarity distance is small and the dissimilarity distance is
detection of different responses of individual cells to the large using only one objective function. In scRNA-seq data,
same stimulus in the same cell type [6, 7]. In many studies the classification accuracy of several datasets can be more
on gene transcription [8, 9], what is detected is the average than 90% and sometimes close to 100%, whereas the clustering
gene expression of a population of somatic cells, tissue cells or task is more difficult. Common clustering machine-learning
organism cells. Although these studies have helped the progress methods include K-means [21], expectation maximization (EM)
of gene transcriptome research, traditional cell-based RNA-seq [22] and spectral clustering [23]. K-means [21] is a classical
cannot clearly show the heterogeneity between the cells in an clustering method that is easy to operate, and the number
organism. scRNA-seq can provide information on individual of clusters required can be set before the experiment. The K-
cell transcriptomes and can be used to develop cell subsets means method is robust on a variety of datasets. Many of the
to determine the time stage of cell differentiation and the newly proposed clustering methods are extensions of K-means.
progression of single cells. In addition to cell heterogeneity, the EM [22] is an integrated clustering method that also has been
analysis of cell differentiation requires the clustering of scRNA- shown to be stable and reliable. Spectral clustering [23] is based
seq data. Such clustering helps in understanding potential on graph theory, which introduces the concept of degree and
cellular mechanisms, which can promote the discovery of new then uses K-means to cluster after steps such as eigenvalue
markers on specific types of cells and the recognition of tumor decomposition. Common combinations of methods include
subtypes. T-distributed stochastic neighbor embedding (t-SNE) with K-
Köster et al. [10] proposed a Bayesian model for analyzing means [24] and PCA with hierarchical clustering [25]. Jiang et al.
the transcript expression of single cells. Transcriptome analysis embedded the cell similarity measurement method based on
helps to predict gene expression from genotype data. However, variance analysis into the hierarchical clustering framework
the cells in our bodies have almost the same genotype, but the and developed the clustering algorithm ‘Corr’ [26].
transcriptome information only reflects the activity of some scRNA-seq technology can be used to perform gene expres-
genes. In addition, gene expression is heterogeneous even within sion studies on a variety of cells simultaneously, avoiding the
similar cell types, and individual cell transcriptome is critical need to label each cell. The results of existing scRNA-seq studies
to elucidating stochastic biological processes. The analysis of have shown that gene expression profiles between individual
scRNA-seq data presents some challenges. scRNA-seq provides cells are significantly different in the same cell type, indicating
deep scrutiny into the gene expression character of diverse only a few of the expressed genes detected in the same cell
cell types. The current main challenge is the noisy nature of group are shared by a single cell, and most of the genes are
the scRNA data. Many of the features of scRNA-seq data are expressed randomly in each cell. Monier et al. [27] present a tool
zero or nearly zero, so processing noise information is of great called IRIS-EDA, which is a Shiny web server for expression data
significance. This noise makes it difficult to distinguish very analysis. scRNA-seq analysis has incomparable advantages over
similar cell types, and this is where the technology needs to traditional sequencing analysis in the fields of cancer evolution
be improved. Second, scRNA-seq data always have high dimen- and drug resistance analysis [28, 29] in the treatment process,
sions, which increase the difficulty of the analysis to some as well as epidermal mesenchymal transformation [30], which
extent. is extremely important in the cancer transformation process
Therefore, dimensionality reduction methods have been and the measurement of mutation rate in cancer cells [31–33].
used to process the original data. The most common dimen- Single-cell level analysis can help in understanding the complex
sionality reduction method is principal components analysis processes of cancer occurrence, development, metastasis and
Clustering and classification methods 3

recurrence. Individualized clinical treatment also will be the difficulties in sequencing single cells. Sequencing has the risks
target and direction of scRNA-seq technology. Generally, scRNA- of low coverage, low mappability, high duplicate rate and high
seq has three advantages (applications) over traditional bulk error rate. Some companies have developed tools to process
RNA-seq: (i) it can reveal complex and rare cell populations, scRNA raw data but only in the early stages.
(ii) uncover heterogenous gene regulatory relationships among The workflow for scRNA-seq is summarized in Figure 1.
cells, and (iii) track the trajectories of distinct cell lineages in Single-cell data are high-dimensional and contain a lot of noise,

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


development in terms of both temporal and spatial information. so the raw single-cell data generally are processed for dimen-
This review is organized as follows. In ‘the challenges of sionality reduction and denoising, before being classified for
scRNA-seq’, the challenges of scRNA-seq are discussed. In analysis, including clustering analysis, cell type identification,
‘methods for dimensionality reduction and clustering analysis sorting and other operations.
tools’, we review dimensionality reduction methods and
describe some of the advanced clustering tools and trends,
including SC3, an R package for clustering and single-cell Methods for dimensionality reduction and
regulatory network inference and clustering (SCENIC). In clustering analysis tools
‘performances of the methods on scRNA-seq datasets’, we de- Clustering plays an essential role in single-cell analysis. Given
scribe scRNA-seq datasets and provide the download URLs. In the high dimensionality of single-cell data, many approaches
‘discussion and conclusions’, we discuss the current limitations combine classic clustering and dimension reduction. Effective
of some of the methods based on the existing literature. dimensionality reduction methods are critical because most
scRNA-seq data are large and noisy, with the characteristics of
The challenges of scRNA-seq a small number of samples but a large number of dimensions.
Dimensionality reduction is usually carried out after counting
The study of a single cell is irreplaceable for the exploration normalization to avoid the curse of dimensionality.
of the mystery of biology and also faces significant challenges.
Gene transcription is not stable and continuous but a sporadic
active transcription. For a fixed cell, transcription is in a con-
Dimensionality reduction methods
stantly changing state [34–36]. Due to the technical limitations
of single-cell transcription level measurement, it is difficult to Principal components analysis
detect low-level gene expression, and most intelligent detection
PCA [11] is a commonly used unsupervised dimensionality re-
methods can detect about 10–20% of the actual mRNA molecules
duction method [44]. PCA assumes that the data are normally
[37, 38]. Since there are very few single-cell materials available,
distributed, diagonalizes the covariance matrix of the original
an amplification step is generally required to generate a larger
matrix and the resulting covariance matrix is a set of new
amount. However, due to amplification is nonlinear, the propor-
variables of the diagonal matrix. The orthogonal transformation
tion of cDNA in cells is not balanced, and the amplification is
is used to transform a set of potential linear correlation variables
biased, so some markers cannot be amplified.
into linear independent variables, which means that linear
scRNA-seq has great limitations in obtaining information.
dimensionality reduction is realized. One of the main problems
The main limitation is caused by biological noise during gene
with linear dimensionality reduction algorithms is that when
expression [34, 36]. A significant feature of scRNA-seq data is a
they concentrate dissimilar data points in a lower dimensional
large number of zero-inflated counts due to dropout or transient
region, the data points are far apart. By projecting cells into
gene expression, which may mislead downstream analyses. The
two-dimensional space, PCA can easily visualize samples and
read count is connected with gene-specific expression level,
improve the interpretation ability. An extended version of
while the nuisance variables are difficult to estimate. The com-
PCA, pcaReduce [25], creates the relationship between the data
monly used method of inter-sample normalization is trimmed
patterns and cell type. pcaReduce is a hierarchical clustering
mean of M values (TMM) and differential expression analysis for
combining PCA, k-means and iteration. It starts with a large
sequence count data (DESeq) [39–41]. Both methods eliminated
number of clusters, and pcaReduce iteratively combines similar
some genes based on a weighted average or median of samples,
clusters. After each combination, the component of the smallest
but both methods performed poorly when a large number of zero
variance in the data is deleted.
counts were counted. In addition to normalization, confounding
factors such as biological variables and technical noise also
T-distributed stochastic neighbor embedding
influence the observed read counts.
In 1992, Eberwine et al. [42] used in vitro transcriptional ampli- To represent high-dimensional data on the low-dimensional and
fication to study acutely dissociated cells in restricted areas nonlinear manifold, we also need to show similar data points
of rat brain and analyzed gene expression characteristics in together, which is not what the linear dimension-reduction algo-
single living neurons. However, the number of gene detection rithms can do. t-SNE [45] is a nonlinear dimensionality reduc-
and detection flux were low. With the continuous development tion method, and it converts the distance of high-dimensional
of detection technology, Tang et al. have combined single-cell space of points into the probability of similarity of points and
RNA with high-throughput sequencing for the 1st time [43], maintains the sum of the difference of conditional probabili-
significantly increasing the detection flux. However, due to the ties between a pair of points in high-dimensional space and
limitation of single-cell isolation technology, not all laboratories low-dimensional space to be the minimum. At the same time,
can successfully complete single-cell sequencing experiments. the long tail of t-distribution is used to solve the overlapping
Although many researchers began to study scRNA-seq and problem when the high-dimensional data is mapped to the low-
there were some experimental methods that were easy to start dimensional data. t-SNE algorithm defines the soft boundary
using, only a very small number of single cells were labeled, and between the local and global structure of data, which can make
individual cells could not be enriched, and the computational the points scattered locally and aggregated globally, and take
channels for processing original data were limited, which caused care of the points at close range and far range at the same time.
4 Qi et al.

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


Figure 1. The workflow of scRNA-seq data and the pipeline of scRNA-seq data application. The 1st stage is dataset processing. Cells–genes matrix was obtained by
processing the original data with the effective dimensionality reduction method. The 2nd stage is to calculate the similarity matrix. The similarity matrix can be used
for cell sequencing, cell type recognition and other applications. In addition, machine learning can be used to cluster or classify single-cell data.

t-SNE has been used recently to reduce the dimensionality of assumption could lead to loss of information and decreased
scRNA-seq data [46, 47]. accuracy.

Zero-inflated factor analysis


Neural networks
A large number of dropout events in single-cell RNA data make
most dimension-reduction algorithms fail to work. ZIFA [13] is Neural networks can continuously extract the main features
a linear dimensionality reduction method for scRNA-seq data, through a hidden layer to achieve the effect of dimensionality
which is completed on the basis of modification by probabilistic reduction. For example, denoising self-coding has been used
PCA/factor analysis (PPCA/FA). In the absence of dropout event, widely to reconstruct data from higher to lower dimensions
it is equivalent to PPCA/FA. ZIFA regards 0 in data as normal [48, 49]. Lin et al. [46] proposed a supervised method for scRNA-
data and models it. So, drop-out events in the data are assumed seq dimensionality reduction based on a neural network. This
to result in zero counts and are set as precisely 0 rather than method uses protein–protein and protein–DNA interaction data
approximately 0. However, these drop-out events could be recre- to learn the neural network of the structure and parameters
ated by detecting technology or environmental effects, so this in the model. Lin et al.’s method promotes the development
Clustering and classification methods 5

Table 1. Summary of dimensionality reduction methods

Dimensionality reduction method Reference Year Usage or download URL

1 PCA [11] 1987 MATLAB, Python, R etc. all have a free package
3 FA [51] 1980 http://personality-project.org/r
4 Classical multidimensional scaling [52] 1978 https://www.statmethods.net/advstats/mds.html

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


5 Sammon mapping [53] 1969 MATLAB, Python, R, etc. all have a free package
6 Linear discriminant analysis / 1936 MATLAB, Python, R, etc. all have a free package
9 Local linear embedding (LLE) [55] 2000 https://cs.nyu.edu/~roweis/lle/code.html
10 Laplacian eigenmaps [56] 2003 MATLAB, Python, R, etc. all have a free package
11 Hessian LLE [57] 2003 https://cs.nyu.edu/~roweis/code.html
12 Local tangent space alignment [58] 2004 https://manifoldlearningjl.readthedocs.io/en/latest/ltsa.html
18 Generalized discriminant analysis [59] 2000 https://github.com/mhaghighat/gda
20 Neighborhood preserving embedding [60] 2005 http://www.cad.zju.edu.cn/home/dengcai/Data/code/NPE.m
21 Locality preserving projection [61] 2005 http://www.cad.zju.edu.cn/home/dengcai/Data/code/LPP.m
22 t-distributed stochastic neighbor [45] 2008 https://lvdmaaten.github.io/tsne/
embedding
23 LMNN [19] 2005 http://kilian.cs.cornell.edu/code/lmnn/lmnn.html

of supervised models and has been shown to perform best in technical noise in various sources of noise in single-cell data.
unsupervised models. scRNA-seq data analysis can help in studying the heterogeneity
Some of the open dimensionality reduction methods are and evolution of cancer cells. Except for a few early methods,
listed in Table 1. These methods each have their characteris- most of the currently available integrated methods achieve
tics, so the most appropriate dimensionality reduction method state-of-the-art performances on some problems. Although
should be chosen according to the characteristics of the target the analyses of scRNA-seq data are complex and procedures
dataset. In 2007, van der Maaten published a MATLAB Toolbox may vary depending on the purpose, many mature tools have
for Dimensionality Reduction that contains implementations of been developed to integrate two or more functions that greatly
many methods for dimensionality reduction [50]. simplify the independent operations.
The challenges of clustering in scRNA-seq research are
mainly reflected in an unclear number of single-cell clusters,
Classic clustering methods
unfixed cell types and poor scalability. In the past few years,
Clustering methods aim to group data objects into multiple the number of cells in scRNA-seq experiments has grown
classes or clusters so that objects in the same cluster are similar by several orders of magnitude. Although researchers have
and different from objects in different clusters [62]. Clustering developed a variety of tools, they are not user-friendly because
can be based on partitioning or layering, where partitioning they use different programming languages and require different
divides objects into different clusters, and layering classifies input data formats. In this section, we describe 11 of the most
objects into levels. Clustering based on distance clusters similar advanced scRNA-seq tools currently available. A summary of
objects close to each other. Clustering based on a probability these methods and tools, the download URLs and other useful
distribution model finds a set of objects in a group of objects information are given in Table 2.
that conform to a specific distribution model. The objects are not
necessarily the closest or most similar, but they perfectly fit the
SC3, an R package for clustering
model described by the probability distribution. Most clustering
methods need prior knowledge on a number of clusters, and the SC3 was proposed by Kiselev et al. [64] in 2017. It is an interactive
quality of clustering needs to be improved. Classic clustering R package that uses a parallelization approach to avoid the need
methods such as K-means [21], X-means [63], spectral clustering for user-specified parameters. SC3 was verified experimentally
[23] and EM [22] can be used in single-cell clustering directly. K- on 12 scRNA-seq datasets. SC3 constrained parameter values
means [21] is an unsupervised machine-learning technique that via a pipeline and was found to be superior to five other tested
operates on a complete dataset without the need for a special methods in terms of accuracy and stability. Because SC3 has a
training dataset. X-means [63] is an extended version of K-means long run time, Kiselev et al. proposed randomly selecting subsets
by an improve-structure part where Euclidean distance is used to and constructing clusters based on the random matrix theory.
calculate the distance between each use case, and other distance They found that the estimated value was consistent with the
functions are used to calculate the distance between any two use number of original clusters suggested by them. SC3 is based
cases. Spectral clustering [23] is an evolutionary algorithm from on PCA and spectral dimensionality reductions, and it utilizes
graph theory. The main idea is to consider all the data as points k-means and additionally performs the consensus clustering.
in space that can be connected by edges. EM [22] was proposed
by Arthur et al. in 1977. The EM algorithm assigns a probability
Single-cell regulatory network inference and clustering
distribution to each instance, which indicates the probability of
it belonging to each of the clusters. SCENIC [64, 65]was proposed by Aibar et al. in 2017 [63] who used
it to identify stable cell states in tumor and brain scRNA-seq
data based on the activity of the gene regulatory networks in
Popular clustering analysis tools each cell. The authors proposed two complementary methods to
Clustering is another effective method to detect cell types. handle the large dimensions of single-cell data: (i) small sample
Poisson and error models can be used to count data and explain extraction to infer the gene regulatory network and (ii) gradient
6 Qi et al.

Table 2. Summary of advanced tools

Tools Language Method Download Cite

1 SC3 R Cluster https://github.com/hemberg-lab/sc3 10.1101/036558; 10.1038/nmeth.4236


2 SCENIC R/Python Cluster https://github.com/aertslab/SCENIC 10.1101/144501; 10.1038/nmeth.4463
3 BackSPIN Python Bicluster https://github.com/linnarsson-lab/BackSPIN 10.1126/science.aaa1934

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


4 BiSNN-Walk / Bicluster / 10.1089/cmb.2017.0049
5 SNN-Cliq MATLAB/Python Cluster http://bioinfo.uncc.edu/SNNCliq 10.1093/bioinformatics/btv088
6 NMF Python Cluster https://github.com/ccshao/nimfa 10.1093/bioinformatics/btw607
7 SIMLR MATLAB ML+cluster https://github.com/BatzoglouLabSU/SIMLR 10.1038/nmeth.4207
8 SINCERA R Cluster https://github.com/xu-lab/SINCERA 10.1371/journal.pcbi.1004575;
10.1007/978-1-4939-7710-9_15
9 SEURAT R Cluster https://github.com/satijalab/seurat 10.1038/nbt.3192; 10.1101/164889
10 Monocle R Cluster https://github.com/cole-trapnell-lab/ 10.1038/nbt.2859; 10.1038/nmeth.4150;
monocle-release 10.1101/110668; 10.1038/nmeth.4402
11 SCRL C++ ML+cluster https://github.com/SuntreeLi/ 10.1093/nar/gkx750
SCRL

enhancement instead of the random forest (RF) to achieve a tic regression models that predict gene sequences, providing a
more efficient solution. They demonstrated that single-cell data valuable tool for analyzing scRNA-seq data. SINCERA is based
were suitable for gene regulation and that genomic regulatory on hierarchical clustering, by which data is converted to z-score
codes can be used to guide the identification of transcription before clustering, and the number k for clustering is determined
factors and cell states. by finding the 1st singleton in the hierarchy.

SEURAT Shared nearest neighbor (SNN-Cliq)


Satija et al. [66] proposed SEURAT, a toolbox for spatial cell local- Xu et al. [72] developed SNN-Cliq in 2005 for grouping cells of the
ization. SEURAT combines scRNA-seq data within situ RNA pat- same type. scRNA-seq data usually have tens of thousands of
terns to predict spatial cell localization. The toolkit was applied dimensions, and only a few of the thousands of genes are sig-
to infer the spatial location of a complete transcriptome and nificantly expressed in different types of cells, which make the
correctly located unusual subpopulations. The reliability of SEU- clustering problem difficult. SNN-Cliq combined with an SNN
RAT was verified using the RNA-seq data of 851 single cells from similarity metric can automatically determine the number of
Danio rerio embryos. SEURAT’s test dataset is Pollan [67], which clusters, especially in high-dimensional single-cell data, which
includes the following four cell types: ‘NPC’, ‘GW16’, ‘GW21’ and is a great advantage.
‘GW21+3’. The toolkit’s expression matrix includes the number
of genes, the number of cells and the number of genes in Nonnegative matrix factorization
each cell, as well as the number of cells in which each gene is
In 2016, Shao et al. [73] proposed nonnegative matrix factoriza-
expressed. In addition, users can look for genes that fluctuate
tion (NMF) to identify subgroups in scRNA-seq datasets. Identify-
significantly and then use those genes rather than all of them
ing cell types from single-cell data is an unsupervised problem.
for subsequent analysis to reduce the amount of computation.
Although PCA is used widely, single-cell data are generally too
Users can look at the cell population with the toolkit, and then
noisy. The 1st few principal components extracted from PCA can
you can look for markers for each subpopulation.
explain only a small part of the differences, and cell subgroups
are not easy to distinguish through the projection of the 1st
Single-cell interpretation via multi-kernel learning
several dimensions. The NMF approach is different from PCA
In 2017, Wang et al. [68] proposed single-cell interpretation because its feature superposition constraint is nonnegative. NMF
via multi-kernel learning (SIMLR), a kernel-based similarity was designed specially to detect single parts, which helps to
learning method, for dimensionality reduction of scRNA-seq detect the natural groupings of individual cells and functional
data. SIMLR can also be applied to large-scale datasets. They cell subsets.
conducted single-kernel comparisons on four datasets without
weight terms and showed that adding weight terms significantly Monocle
enhanced SIMLR performance. Further experimental studies
To study cell differentiation, the expression profiles of indi-
conducted by Zhang et al. [69] verified the robustness of SIMLR
vidual cells are required. Monocle was developed by Trapnell
for drop-out events in single-cell data and its application to the
et al. [74] as an unsupervised algorithm for analyzing single-cell
imputed data by low-rank to have better performance than the
gene expression data to reveal the expression sequence of key
general clustering algorithm.
regulatory factors and the interactions associated with differen-
tiation. The authors used the Monocle algorithm to study mouse
SINCERA
myoblasts and found eight transcription factors that had not
Guo et al. [70, 71] proposed SINCERA, a pipeline for scRNA-seq been considered previously. scRNA-seq data collected at differ-
profiling analysis. SINCERA can identify cell types, gene signa- ent time points can help to reveal key events in differentiation.
tures and can determine key nodes. Analysis of mouse lung Monocle requires users to prepare phenotype data and feature
cells using the SINCERA pipeline distinguished the main cell data required by Monocle objects as well as the expression
types of the fetal lung. Guo et al. subsequently introduced logis- matrix, and the expression matrix is counted. This tool not only
Clustering and classification methods 7

Table 3. Summary of other popular analytical tools

Tools Download Tools Download

SAMtools https://github.com/samtools/samtools SCDE https://github.com/hms-dbmi/scde


SART https://github.com/alexdobin/STAR GeneQC http://bmbl.sdstate.edu/GeneQC/home.html
MAST https://github.com/RGLab/MAST IRIS-EDA http://bmbl.sdstate.edu/IRIS/

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


Kallisto https://github.com/pachterlab/kallisto QUBIC2 https://github.com/maqin2001/qubic2
BPSC https://github.com/nghiavtr/BPSC CellRanger https://github.com/10XGenomics/cellranger
salmon https://github.com/COMBINE-lab/salmon Scater https://github.com/davismcc/scater
UML-tools https://github.com/CGATOxford/UMI-tools SAVER https://github.com/mohuangx/SAVER

contains the general functions of a single-cell toolkit, such as different pre-implantation stages. Treutlein’s dataset contains
quality control, difference analysis etc. In the monocle package, transcriptome data from single distal lung epithelium cells.
it is interesting to note that dimensionality reduction must be Zeisel’s dataset contains 19 972 genes from 3005 cells and
followed by clustering in order to be visualized. In addition, Mon- was used to study specialized cell types in mouse cortex and
ocle develops the function to infer the development trajectory, hippocampus.
which becomes the highlight of this tool.
Dataset size
BackSPIN
We considered a small dataset as one with RNA-seq data of less
Zeisel et al. [75] developed BackSPIN in 2015 and tested it on the than 100 cells, a large dataset had more than 1000 cells and a
adult nervous system, which is highly complex and has many medium-size dataset was in the middle. As shown in Table 4,
cell types that are challenging to identify. scRNA-seq data were Biase’s and Treutlein’s datasets were classified as small, and
used to classify mammalian cortical cells. BackSPIN detected Klein’s and Zeisel’s datasets were large with 2717 and 3005 cells,
different types of cells based on molecule clustering and showed respectively. The numbers of genes in these 12 datasets were
that transcription factors formed a complex hierarchical regu- extremely large. Except for Patel’s dataset, which contained 5948
latory code, revealing the diversity of brain cell types and their genes, the other 11 datasets contained from 19 972 to 41 480
transcriptomes. genes. These scRNA-seq datasets are large and contain a lot of
expression data.
BiSNN-Walk
Shi and Huang [76] proposed BiSNN-Walk, an iterative biclus- Performances on raw scRNA-seq datasets
tering method based on SNN-Cliq [72]. BiSNN-Walk differs from
To better understand the performance of each method on
SNN-Cliq in that it returns a sorted list as a reliable indicator of
scRNA-seq data, we conducted classification and clustering
a cluster. In addition, BiSNN-Walk uses a metric method based
experiments on the raw datasets. The experimental results
on entropy to select the starting point of clustering, and its
are particularly important because they can be used to analyze
clustering ability was tested on three scRNA-seq datasets.
and judge whether the data preprocessing steps and algorithm
improvements are effective.
Single-cell representation learning We used the raw scRNA-seq data of the 12 datasets without
Single-cell representation learning (SCRL) [47] is a nonlinear any preprocessing with four widely used machine-learning clas-
dimensionality reduction method based on machine learning sification methods, including KNN, RF, J48 and bagging. KNN is
and clustering that was developed by Li et al. in 2017. To process the most commonly used classification method, which deter-
drop-out events of single-cell RNA data, SCRL uses biological mines the class of samples to be classified by the class of
knowledge such as high-throughput RNA sequencing and adopts adjacent k samples. J48 is a decision tree-based algorithm. RF
a network-embedding method to express a more abundant and and bagging are integrated machine-learning algorithms. These
low-dimensional expression of scRNA-seq data. methods are free and efficient in Weka software. The results are
We have provided the download URLs and references for 14 shown in Figure 2. The methods’ classification performance was
other popular scRNA-seq analysis tools in Table 3, so interested measured with accuracy.
readers can find them easily. As shown in Figure 2, In general, although the four methods
showed differences in the results for the 12 datasets, the
classification of expression data showed accuracies that could
Performances of the methods on scRNA-seq datasets reach over 80%. Overall, bagging was the most stable achieving
good classification accuracy on all 12 datasets, which may
Transcriptome datasets
be explained by its integrated classification mechanism. For
The 12 datasets summarized in Table 4 are named after the six gold datasets, RF was better than the other three
the primary provider of the published dataset. The 1st six methods. Ting’s dataset showed the worst results among
datasets are benchwork labeled and are considered as the all the datasets and methods, possibly because the dataset
gold standard; the other six datasets are computationally contains too much noise, which affected the ability of the
labeled and are considered as silver standard. Yan’s dataset algorithms to accurately classify the expression data. Thus,
consists of the transcriptomes of human oocytes and early for complex datasets, machine learning still has room for
embryonic cells at seven key stages of development, using improvement.
two to three embryos per stage. Deng’s dataset includes the Unsupervised clustering is currently the core part of the
transcriptomes of single cells isolated from mouse embryos at scRNA-seq analysis. It does not require researchers to make
8 Qi et al.

Table 4. Summary of scRNA-seq datasets

Dataset # of # of # of Cells in each cluster Standard Cell resource Recommended


genes cells clusters methods

1 Biase’s [77] 25 737 49 3 9 + 20 + 20 Bench Two and four-cell SC3, pcaReduce


Mouse embryo and SINCERA

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


2 Yan’s [78] 20 214 124 7 3 + 3 + 6 + 12 + 20 + Bench Human preimplantation pcaReduce
16 + 30 embryos and embryonic
stem cells
3 Goolam’s [79] 41 480 124 5 6 + 64 + 42 + 6 + 6 Bench Four-cell mouse embryos pcaReduce
4 Deng’s [80] 22 457 268 10 50 + 14 + 37 + 8 + 43 + Bench Mouse preimplantation SC3 and pcaReduce
10 + 30 + 12 + 60 + 4 embryos
5 Pollen’s [67] 23 730 301 11 22 + 17 + 11 + 37 + 31 + 54 Bench Human SC3, SIMLR and
+ 24 + 40 + 24 + 15 + 25 pcaReduce
6 Kolodziejczyk’s 38 653 704 3 295 + 159 + 250 Bench Mouse embryonic SC3, SINCERA and
[81] stem cell SEURAT
7 Treutlein’s [82] 23 271 80 5 41 + 14 + 12 + 11 + 3 Computational Human lung epithelium SC3
8 Ting’s [83] 29 018 149 7 24 + 41 + 11 + 34 + 12 + Computational Human pancreatic SC3
12 + 15 circulating tumor cells
9 Patel’s [84] 5948 430 5 118 + 94 + 75 + 73 + 70 Computational Human glioblastomas SC3, tSNE+kmeans
and SINCERA
10 Usoskin’s [85] 25 334 622 11 125 + 233 + 26 + 48 + 12 + Computational Human neuron SC3
17 + 32 + 64 + 22 + 31 + 12
11 Klein’s [86] 24 175 2717 4 933 + 303 + 683 + 798 Computational Human embryonic SC3, tSNE+kmeans
stem cells and SINCERA
12 Zeisel’s [75] 19 972 3005 9 290 + 390 + 948 + 820 + Computational Mouse cortex SC3
98 + 175 + 198 + 26 + 60

Figure 2. True Positive (TP) rate of four classification methods based on expression data. The blue, orange, gray and yellow bars represent the classification performance
of KNN, RF, J48 and bagging, respectively.

any input to the known expressed genes and can directly X-means and greedy. The default parameters of the algo-
cluster similar cells by an algorithm. Because many subsequent rithm were used in all experiments. In the 2nd picture in
analyses of single cells are based on clustering, the results Figure 3, it can be easily seen that in the overall experi-
of clustering have a great impact on the final conclusion. We ment results, spectral and greedy were significantly bet-
performed experiments on the 12 datasets with three clustering ter than EM. EM showed the worst result because EM is
methods. The results are shown in Figure 3. Greedy [87] is a unable to recognize expression data when there is a lot of
hierarchical clustering algorithm adapted to large networks, noise.
which can detect high modularity partitions quickly and without To some extent, these diversities are also reflected in the
limit to the number of nodes. Overall, the spectral method effectiveness of the algorithm, that is, some methods are better
was more stable on all 12 datasets, which may be because of for certain types of data. Because of the complexity of clustering
its integrated classification mechanism. We need to input the problems, it is unlikely that one method is superior to all other
number of clustering on used clustering methods except for methods.
Clustering and classification methods 9

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


Figure 3. Accuracy of 3 clustering methods based on 12 expression datasets. The figure above shows the accuracy of the spectral, greedy and EM clustering on the
dataset, while the figure below intuitively shows the percentage of each of the three methods in clustering performance accuracy.

Performances of seven clustering methods and toolboxes clustering cell types and detecting differentially expressed genes
on the 12 datasets in the context of real data.
Cell classification results vary because of different parameter To evaluate the performance of clustering methods, we tested
combinations and different dataset sizes, making it difficult the clustering methods SC3, pcaReduce and tSNE+kmeans on
and time-consuming to find the optimized parameters for the 12 published datasets. The analysis refers to adjusted Rand
best result. Hence, methods that integrate machine learning to index (ARI) due to its wide adoption in the field. In order to
provide a profound and easy-to-operate pipeline have been test the stability of the method, we repeated the experiment
developed. At present, many clustering methods have been 100 times with fixed parameters. The clustering performance
developed for single-cell data, but these methods show very shows that SC3 has the best robustness, while pcaReduce and
different characteristics in model assumptions and clustering. tSNE+kmeans still have worse performance. Figure 4 shows that
Therefore, there is an urgent need for large-scale comparison most of the ARI values of SC3 are concentrated and located in
and evaluation of these methods. To this end, we compared the upper half of the graph, indicating that SC3 plays a stable
several clustering methods and tools, evaluated their clustering role and has the best effect. The ARI values of tSNE+kmeans and
capabilities and conducted analysis to explore their effects on pcaReduce are widely distributed.
10 Qi et al.

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


Figure 4. The clustering results of SC3, pcaReduce and tSNE+kmeans. We use 12 colors to represent different datasets, labeled on the right side of the picture. The
column of the box in the picture represents the ARI distribution. The wider the horizontal direction is, the more clustering results are distributed in the value of ARI.

ARI of the three methods in the Pollan dataset is almost noisy nature of the single-cell RNA data. This noise makes it
all between 0.25 and 0.6, where SC3 results are more concen- difficult to distinguish very similar cell types, and this is where
trated at the endpoints of the interval, and pcaReduce values the technology needs to be improved. In this review, we covered
are more concentrated between 0.25 and 0.4. In comparison, several aspects of scRNA-seq applications, from quantitative
tSNE+kmeans results are better, most of which are around 0.6. analysis and characterization of cell types through clustering
This shows that tSNE+kmeans is effective for dimensionality and classification to gene regulatory network reconstruction
reduction of the dataset of Pollan. Based on the Biase dataset, and cell-state identification. Using 12 published sets of single-
SC3 results were stable above 0.9, and most ARI of pcaReduce cell RNA data, we showed the classification and clustering
were distributed between 0.8 and 0.9. On the contrary, more performance of each algorithm. We have listed the classical
than half of tSNE+kmeans were below 0.1, which indicated machine-learning methods, summarized the currently generally
that tSNE+kmeans was completely invalid for dimensionality accepted toolkits and conducted experiments on the single-
reduction of Baise dataset. Usually, no dimensionality reduction cell RNA dataset. At present, single-cell RNA classification
method or clustering is suitable for all datasets. We can observe has been relatively mature, but there is still a lot of room
the characteristics of statistical datasets and find the effective for improvement in clustering. scRNA-seq technologies allow
dimensionality reduction method corresponding to the dataset. researchers to uncover new and potentially unexpected biolog-
Then we conducted an experiment on all datasets with the ical discoveries compared with traditional profiling methods.
default parameters of the toolbox and recorded its experimental scRNA-seq has also been applied to identify subclones from
results as shown in Figure 5, which showed the clustering perfor- the transcriptomes of neoplastic cells, and the technique
mances on five toolboxes: SC3, SNN-Cliq, SINCERA, SEURAT and holds enormous potential for both basic biology and clinical
pcaReduce. First, we analyze the experimental results based on applications.
the dataset Pollen and find that the results of the five clustering Experimental studies emphasize that there is no one way to
algorithms are very close, from which we can draw a conclusion perform the best in all situations. Some of the shortcomings of
that Pollen is universal to the algorithms. In terms of algorithmic these approaches, such as scalability, robustness and in some
comparison, SC3 produced the best performances, beating the cases unavailability, need to be addressed in studies. Several
other four clustering methods tested. SC3 worked best on most recommendations have been made based on the above exper-
datasets, and its powerful integration features made it a stable iments and analysis, with details showcased in Table 4. Specifi-
toolbox. However, the SEURAT algorithm performed poorly with cally, for the Pollen’s dataset, we recommend the algorithm with
most datasets, and it is the most unstable method. SEURAT is a the fastest convergence (SIMLR) as it showed little difference
toolbox for spatial cell localization, and it may be that the noise in clustering performance among various algorithms. SIMLR is
in the single-cell RNA dataset affects the performance of the a kernel-based similarity learning method, based on dimen-
algorithm. sionality reduction of scRNA-seq data, and can be typically
applied to relative large-scale datasets. For the Biase’s dataset,
SC3, pcaReduce and SINCERA all achieved a better effect than
Discussion and conclusions the other algorithms. For the Goolam’s dataset, we recommend
scRNA-seq provides deep scrutiny into the gene expression using pcaReduce for dimensionality reduction and clustering
character of diverse cell types. The main challenge now is the due to its superior prediction performance, while for the Klein’s
Clustering and classification methods 11

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


Figure 5. ARI of five scRNA clustering analysis tools. We use abbreviations SNN, SIN, SEU and pcaR to represent the clustering tool SNN-Cliq, SINCERA, SEURAT and
pcaReduce, respectively. The closer to red means the better clustering effect, and the closer to blue means the worse clustering effect.

dataset, we recommend using the tSNE+kmeans or SC3 method.


From the perspective of the algorithm, it is easy to see, according the methods on scRNA-seq datasets were showed in the
to Figures 4 and 5, that SC3 is stable in most of the benchmark paper.
• The paper stated clearly recommendations after per-
datasets and scales well with the most datasets in terms of their
sizes and variating dimensionality. SC3, pcaReduce and SINCERA formed various methods and tools on a series of scRNA-
are more robust on almost datasets than the other tools in seq datasets. The corresponding summary can be found
this review. For single cell datasets created from heterogeneous in Table 4 and the discussion section at the end of this
sources, multi-modal and multi-view learning can be introduced article.
to combine all the gene expression data from a single cell so that
the data can be complementary to each other, to make better
use of the data for analysis. Further research into individual
cells will contribute to the field of personalized medicine with Funding
a deeper understanding of the underlying processes of various
developmental physiological and disease systems.
National Key R&D Program of China (2018YFC0910405); Nat-
ural Science Foundation of China (61771331); R01 award
#1R01GM131399-01 from the National Institute of General
Key Points Medical Sciences of the National Institutes of Health, and the
• The paper reviewed machine-learning approaches for content is solely the responsibility of the authors and does
not necessarily represent the official views of the National
clustering and classification based on the character-
istics of single-cell RNA-sequencing (scRNA-seq). Effi- Institutes of Health.
cient methods and tools for dimensionality reduction
were concluded in detail. References
• Various tools applied in scRNA-seq were explained 1. Xu Y, Zhou X. Applications of single-cell sequencing for
clearly, and we highlighted the pros and cons of each multiomics. Methods Mol Biol 2018;1754:327–74.
approach, which could help the readers to select proper 2. Yang J, Gruenewald S, Wan X-F. Quartet-net: a quartet-based
tools to distinguish tasks. method to reconstruct phylogenetic networks. Mol Biol Evol
• We provided a comprehensive description of scRNA-seq 2013;30(5):1206–17.
data and downloaded URLs. And the performances of 3. Yang JL, Grunewald S, Xu YF, et al. Quartet-based methods to
reconstruct phylogenetic networks. BMC Syst Biol 2014;8:12.
12 Qi et al.

4. Kanter I, Dalerba P, Kalisky T, et al. A cluster robustness score 24. Grun D, Lyubimova A, Kester L, et al. Single-cell messenger
for identifying cell subpopulations in single cell gene expres- RNA sequencing reveals rare intestinal cell types. Nature
sion datasets from heterogeneous tissues and tumors. Bioin- 2015;525(7568):251.
formatics 2019;35(6):962–71. 25. Žurauskienė J, Yau C. pcaReduce: hierarchical clustering
5. Xie J, Ma A, Zhang Y, et al. QUBIC2: a novel biclustering of single cell transcriptional profiles. BMC Bioinformatics
algorithm for large-scale bulk RNA-sequencing and single- 2016;17(1):140.

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


cell RNA-sequencing data analysis. 2018;409961. doi: https:// 26. Jiang H, Sohn LL, Huang H, et al. Single cell clustering based
doi.org/10.1101/409961. on cell-pair differentiability correlation and variance analy-
6. Marinov GK, Williams BA, McCue K, et al. From single-cell sis. Bioinformatics 2018;34(21):3684–94.
to cell-pool transcriptomes: stochasticity in gene expression 27. Monier B, McDermaid A, Wang C, et al. IRIS-EDA: an inte-
and RNA splicing. Genome Res 2014;24(3):496–510. grated RNA-Seq interpretation system for gene expression
7. Conesa A, Madrigal P, Tarazona S, et al. A survey of best prac- data analysis. PLoS Comput Biol 2019;15(2):e1006792.
tices for RNA-seq data analysis. Genome Biol 2016;17(13):2. 28. Navin NE. Tumor evolution in response to chemotherapy:
8. Pan G, Tang J, Guo F. Analysis of co-associated transcription phenotype versus genotype. Cell Rep 2014;6(3):417–9.
factors via ordered adjacency differences on motif distribu- 29. Liu X, Yang J, Zhang Y, et al. A systematic study on drug-
tion. Sci Rep 2017;7:43597. response associated genes using baseline gene expressions
9. Yang J, Huang T, Petralia F, et al. Synchronized age-related of the Cancer Cell Line Encyclopedia. Sci Rep 2016;6:22811.
gene expression changes across multiple tissues in human 30. Almendro V, Cheng Y-K, Randles A, et al. Inference of tumor
and the link to complex diseases. Sci Rep 2015;5:15145. evolution during chemotherapy by computational modeling
10. Johannes K, Myles B, Shirley LX. A Bayesian model for single and in situ analysis of genetic and phenotypic cellular diver-
cell transcript expression analysis on MERFISH data. Bioin- sity. Cell Rep 2014;6(3):514–27.
formatics 2019;35(6):995–1001. 31. Chenghang Z, Sijia L, Chapman AR, et al. Genome-wide
11. Wold S, Esbensen K, Geladi PJC, et al. Principal component detection of single-nucleotide and copy-number variations
analysis. Chemometr Intell Lab Syst 1987;2(1):37–52. of a single human cell. Science 2012;338(6114):1622–6.
12. Peng J, Wang X, Shang XJ. Combining gene ontology 32. Wang J, Fan HC, Behr B, et al. Genome-wide single-cell anal-
with deep neural networks to enhance the clustering ysis of recombination activity and de novo mutation rates in
of single cell RNA-seq data. 2018;437020. doi: https://doi. human sperm. Cell 2012;150(2):402–12.
org/10.1101/437020. 33. Wang Y, Waters J, Leung ML, et al. Clonal evolution in
13. Pierson E, Yau C. ZIFA: dimensionality reduction for zero- breast cancer revealed by single nucleus genome sequenc-
inflated single-cell gene expression analysis. Genome Biol ing. Nature 2014;512(7513):155.
2015;16(1):241. 34. Ross IL, Browne CM, Hume DA, et al. Transcription of indi-
14. Wei L, Ran S, Bing W, et al. Integration of deep fea- vidual genes in eukaryotic cells occurs randomly and infre-
ture representations and handcrafted features to improve quently. Immunol Cell Biol 1994;72(2):177–85.
the prediction of N6-methyladenosine sites. Neurocomputing 35. Ozbudak EM, Mukund T, Iren K, et al. Regulation of
2019;324:3–9. noise in the expression of a single gene. Nat Genet 2002;
15. Su R, Wu H, Xu B, et al. Developing a multi-dose compu- 31(1):69–73.
tational model for drug-induced hepatotoxicity prediction 36. Raj A, van den Bogaard P, Rifkin SA, et al. Imaging individual
based on toxicogenomics data. IEEE/ACM Trans Comput Biol mRNA molecules using multiple singly labeled probes. Nat
Bioinform 2018. doi: 10.1109/TCBB.2018.2858756. Methods 2008;5(10):877–9.
16. Wei L, Xing P, Zeng J, et al. Improved prediction of pro- 37. Islam S, Zeisel A, Joost S, et al. Quantitative single-cell
tein–protein interactions using novel negative samples, fea- RNA-seq with unique molecular identifiers. Nat Methods
tures, and an ensemble classifier. Artif Intell Med 2017; 2014;11(2):163.
83:67–74. 38. Svensson V, Natarajan KN, Ly LH, et al. Power analysis
17. Yang J, Zhang L. Run probabilities of seed-like pat- of single-cell RNA-sequencing experiments. Nat Methods
terns and identifying good transition seeds. J Comput Biol 2016;14(4):381–7.
2008;15(10):1295–313. 39. Robinson MD, Oshlack AJGB. A scaling normalization
18. Davis JV, Kulis B, Jain P, et al. Information-theoretic metric method for differential expression analysis of RNA-seq data.
learning. In: Icml 07: International Conference on Machine Learn- Genome Biol 2010;11(3):1–9.
ing, 2007. Corvalis, Oregon, USA. 40. Anders S, Huber W. Differential expression analysis for
19. Weinberger KQ, Blitzer J, Saul LK. Distance metric learning sequence count data. Genome Biol 2010;11:R106.
for large margin nearest neighbor classification. In: NIPS. 41. Li J, Witten DM, Johnstone IM, et al. Normalization, testing,
Vancouver, British Columbia, Canada, 2005 pp. 1473–80. and false discovery rate estimation for RNA-sequencing
20. Zadeh PH, Hosseini R, Sra S. Geometric mean metric learn- data. Biostatistics 2012;13(3):523–38.
ing. In ICML. New York City, NY, USA, 2016, pp. 2464–71. 42. Eberwine J, Yeh H, Miyashiro K, et al. Analysis of gene
21. Hartigan JA, Wong MA. Algorithm AS 136: a K-means clus- expression in single live neurons. Proc Natl Acad Sci U S A
tering algorithm. J R Stat Soc Ser C Appl Stat 1979;28(1): 1992;89(7):3010–4.
100–8. 43. Tang F, Barbacioru C, Wang Y, et al. mRNA-Seq whole-
22. Dempster AP, Laird NM, Rubin DBJJRSS. Maximum likelihood transcriptome analysis of a single cell. Nat Methods 2009;
from incomplete data via the EM algorithm. J R Stat Soc Series 6(5):377–U86.
B Stat Methodol 1977;39(1):1–38. 44. Zeng X, Lin W, Guo M, et al. A comprehensive overview and
23. Ng AY, Jordan MI, Weiss Y. On spectral clustering: analysis evaluation of circular RNA detection tools. PLoS Comput Biol
and an algorithm. In: Proceedings of the 14th International 2017;13(6):e1005420.
Conference on Neural Information Processing Systems: Natural 45. van der Maaten L, Hinton G. Visualizing data using t-SNE. J
and Synthetic, 2001. Kitakyushu, Japan. Mach Learn Res 2008;9:2579–605.
Clustering and classification methods 13

46. Lin C, Jain S, Kim H, et al. Using neural networks for reducing
69. Lihua Z, Shihua CB. Comparison of computational
the dimensions of single-cell RNA-Seq data. Nucleic Acids Res
methods for imputing single-cell RNA-sequencing data.
2017;45(17):e156.
IEEE/ACM Trans Comput Biol Bioinform 2018;1. doi: 10.1109/
47. Li X, Chen W, Chen Y, et al. Network embedding-based
TCBB.2018.2848633.
representation learning for single cell RNA-seq data. Nucleic
70. Guo M, Xu Y. Single-cell Transcriptome analysis using SIN-
Acids Res 2017;45(19):e166.

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz062/5528236 by Tianjin University user on 10 September 2019


CERA pipeline. Methods Mol Biol 2018;1751:209.
48. Zeng X, Zhang X, Zou Q. Integrative approaches for pre-
71. Guo M, Wang H, Potter SS, et al. SINCERA: a pipeline for
dicting microRNA function and prioritizing disease-related
single-cell RNA-Seq profiling analysis. PLoS Comput Biol
microRNA using biological interaction networks. Brief Bioin-
2015;11(11):e1004575.
form 2016;17(2):193–203.
72. Xu C, Su Z. Identification of cell types from single-cell tran-
49. Zou Q, Li J, Song L, et al. Similarity computation strategies in
scriptomes using a novel clustering method. Bioinformatics
the microRNA-disease network: a survey. Brief Funct Genomics
2015;31(12):1974–80.
2016;15(1):55–64.
73. Shao C, Höfer TJB. Robust classification of single-cell tran-
50. Maaten L. An introduction to dimensionality reduction
scriptome data by nonnegative matrix factorization. Bioin-
using Matlab. Maastricht: Maastricht University, 2007.
formatics 2016;33(2):btw607.
51. Chatfield C, Collins AJ. Factor analysis. In: Introduction to
74. Trapnell C, Cacchiarelli D, Grimsby J, et al. The dynamics
Multivariate Analysis. NY, US: Springer, 1980.
and regulators of cell fate decisions are revealed by pseu-
52. Kruskal JB, Wish M. Multidimensional Scaling. Thousand Oaks,
dotemporal ordering of single cells. Nat Biotechnol 2014;32(4):
CA USA: Sage Publications, 1978.
381–86.
53. Sammon JW, Jr. A Nonlinear mapping for data structure
75. Zeisel A, Munoz-Manchado AB, Codeluppi S, et al.
analysis. IEEE Trans Comput 1969.
Cell types in the mouse cortex and hippocampus
54. Fisher RA. The use of multiple measurements in taxonomic
revealed by single-cell RNA-seq. Science 2015;347(6226):
problems. Ann Hum Genet 2012;7(2):179–88.
1138–42.
55. Roweis ST, Saul LK. Nonlinear dimensionality reduction by
76. Shi F, Huang H. Identifying cell subpopulations and
locally linear embedding. Science 2000;290(5500):2323–6.
their genetic drivers from single-cell RNA-Seq data
56. Belkin M, Niyogi P. Laplacian Eigenmaps for Dimensionality
using a biclustering approach. J Comput Biol 2017;24(7):
Reduction and Data Representation. Cambridge, MA: MIT Press,
663–74.
2003.
77. Blase FH, Cao X, Zhong S. Cell fate inclination within 2-
57. Donoho DL, C G. Hessian eigenmaps: locally linear embed-
cell and 4-cell mouse embryos revealed by single-cell RNA
ding techniques for high-dimensional data. Proc Natl Acad Sci
sequencing. Genome Res 2014;24(11):1787–96.
U S A 2003;100(10):5591–6.
78. Yan L, Yang M, Guo H, et al. Single-cell RNA-Seq profiling of
58. Zhang Z, Zha H. Principal manifolds and nonlinear dimen-
human preimplantation embryos and embryonic stem cells.
sionality reduction via tangent space alignment. Siam J Sci
Nat Struct Mol Biol 2013;20(9):1131–9.
Comput 2004;8(4):406–24.
79. Goolam M, Scialdone A, Graham SJL, et al. Heterogeneity
59. Baudat G, Anouar F. Generalized discriminant analysis using
in Oct4 and Sox2 targets biases cell fate in 4-cell mouse
a kernel approach. Neural Comput 2000;12(10):2385–404.
embryos. Cell 2016;165(1):61–74.
60. He X, Deng C, Yan S, et al. Neighborhood preserving embed-
80. Deng Q, Ramskold D, Reinius B, et al. Single-cell RNA-Seq
ding. In: Tenth IEEE International Conference on Computer Vision,
reveals dynamic, random monoallelic gene expression in
2005. Beijing, China.
mammalian cells. Science 2014;343(6167):193–6.
61. He X, Niyogi P. Locality preserving projections. In: NIPS.
81. Kolodziejczyk AA, Kim JK, Tsang JCH, et al. Single cell
Vancouver, British Columbia, Canada, 2003.
RNA-sequencing of pluripotent states unlocks modular
62. Xu Y, Guo M, Liu X, et al. Identify bilayer modules via pseudo-
transcriptional variation. Cell Stem Cell 2015;17(4):
3D clustering: applications to miRNA-gene bilayer networks.
471–85.
Nucleic Acids Res 2016;44(20):e152.
82. Treutlein B, Brownfield DG, Wu AR, et al. Reconstructing lin-
63. Ishioka T. Extended k-means with an efficient estimation of
eage hierarchies of the distal lung epithelium using single-
the number of clusters. In: Seventeenth International Conference
cell RNA-seq. Nature 2014;509(7500):371.
on Machine Learning, 2000. Stanford, CA, USA.
83. Ting DT, Wittner BS, Ligorio M, et al. Single-cell RNA
64. Kiselev VY, Kirschner K, Schaub MT, et al. SC3: consen-
sequencing identifies extracellular matrix gene expression
sus clustering of single-cell RNA-seq data. Nat Methods
by pancreatic circulating tumor cells. Cell Rep 2014;8(6):
2017;14(5):483–6.
1905–18.
65. Aibar S, Gonzálezblas CB, Moerman T, et al. SCENIC:
84. Patel AP, Tirosh I, Trombetta JJ, et al. Single-cell RNA-seq
single-cell regulatory network inference and clustering. Cell
highlights intratumoral heterogeneity in primary glioblas-
2017;14(11):1083–6.
toma. Science 2014;344(6190):1396–401.
66. Rahul S, Farrell JA, David G, et al. Spatial reconstruction of
85. Usoskin D, Furlan A, Islam S, et al. Unbiased classifica-
single-cell gene expression data. Nat Biotechnol 2015;33(5):
tion of sensory neuron types by large-scale single-cell RNA
495–502.
sequencing. Nat Neurosci 2015;18(1):145.
67. Pollen AA, Nowakowski TJ, Shuga J, et al. Low-coverage
86. Klein AM, Mazutis L, Akartuna I, et al. Droplet barcoding for
single-cell mRNA sequencing reveals cellular heterogeneity
single-cell transcriptomics applied to embryonic stem cells.
and activated signaling pathways in developing cerebral
Cell 2015;161(5):1187–201.
cortex. Nat Biotechnol 2014;32(10):1053.
87. Blondel VD, Guillaume JL, Lambiotte R, et al. Fast unfolding of
68. Wang B, Zhu J, Pierson E, et al. Visualization and analysis of
community hierarchies in large networks. J Stat Mech 2008;
single-cell RNA-seq data by kernel-based similarity learning.
abs/0803.0476. doi: 10.1088/1742-5468/2008/10/P10008.
Nat Methods 2017;14(4):414.

View publication stats

You might also like