
Review of Single-cell RNA-seq Data Clustering
to a lower-dimensional space using dimension reduction
that can improve and refine the clustering results. In this
section, we review several commonly used dimension re-
duction methods including principal component analysis, t-
distributed stochastic neighbor embedding algorithm, deep
learning models, and others.
2.3.1. PCA
Principal Component Analysis (PCA) is a typical linear
projection method that projects a set of possibly correlated
variables into a set of linearly orthogonal variables (prin-
cipal components). Due to its conceptual simplicity and
efficiency, PCA has been widely used in single-cell RNA-
seq processing (Jiang et al., 2016a; Buettner et al., 2015;
Shalek et al., 2014; Usoskin et al., 2015; zurauskiene and
Yau, 2016; Kiselev et al., 2017). Notably, SC3 (Kiselev
et al., 2017) applied PCA to transform the distance matrices
as the input of consensus clustering; Shalek et al. (2014)
used PCA for single-cell RNA-seq data spanning several ex-
perimental conditions. In addition, some extended and im-
proved PCA-based methods have been developed including
pcaReduce (zurauskiene and Yau, 2016) which applied PCA
iteratively to provide low-dimensional principal component
representations; Usoskin et al. (2015) proposed an unbiased
iterative PCA-based process to identify distinct large-scale
expression data patterns. However, PCA cannot capture
the nonliner relationships between cells because of the high
levels of dropout and noise (Kiselev et al., 2019).
2.3.2. t-SNE
t-distributed Stochastic Neighbor Embedding (t-SNE)
is the most commonly used nonlinear dimension reduction
method which can uncover the relationships between cells.
t-SNE converts data point similarity into probability and
minimizes Kullback-Leibler divergence by gradient descent
until convergence. In single-cell RNA-seq data analysis, t-
SNE has become a cornerstone of dimension reduction and
visualization for high-dimensional single-cell RNA-seq data
(Linderman et al., 2019; Lin et al., 2017b; Butler et al., 2018;
Haghverdi et al., 2018; Ntranos et al., 2016; Prabhakaran
et al., 2016; Zeisel et al., 2015; Zhang et al., 2018; Li et al.,
2017). Especially, Linderman et al. (2019) developed a fast
interpolation-based t-SNE that dramatically accelerates the
processing and visualization of rare cell populations for large
datasets. Nonetheless, the limitations of t-SNE include the
loss function is non-convex which can lead to different local
optimality; the parameters in t-SNE are required to be tuned.
2.3.3. Deep lear ning models
In recent years, deep learning models (neural networks
and variational auto-encoders) have shown superior perfor-
mance in interpenetrating complex high-dimensional data.
SCNN (Lin et al., 2017a) tested various neural networks
architectures and incorporated prior biological knowledge
to obtain the reduced dimension representation of single
cell expression data. SCVIS (Ding et al., 2018) and VASC
(Wang and Gu, 2018) are both based on variational auto-
encoders which can capture nonlinear relationships between
cells and visualize the low-dimensional embedding in single-
cell gene expression data. Up to now, those methods demon-
strated superior ability of interpretation and compatibility on
high-dimensional single-cell RNA-seq data.
2.3.4. Other methods
In addition, there are also other dimensional reduction
methods such as CIDR (Lin et al., 2017b) applied principal
coordinate analysis that preserves the distance information
in low-dimension space from its high-dimension space; Seu-
rat (Butler et al., 2018) is a toolkit for analysis of single
cell RNA sequencing data and provides many dimension
reduction methods such as PCA and t-SNE. Uniform Mani-
fold Approximation and Projection (UMAP) (Mcinnes et al.,
2018) is a widely used technique for dimension reduction.
UMAP provides increased speed and better preservation of
data global structure for high dimensional datasets. It has
been verified that it outperforms t-SNE (Becht et al., 2019).
3. Clustering methods for single-cell RNA-seq
Diverse types of clustering methods have been devel-
oped for detecting cell types from single-cell RNA-seq data.
Those methods can be roughly classified into four cate-
gories including k-means clustering, hierarchical clustering,
community-detection-based clustering, and density-based
clustering. We review several computational applications
of those clustering methods with their strengths and limita-
tions. Table 1 illustrates the overview of the state-of-the-arts
clustering methods on single-cell RNA-seq data.
3.1. 𝑘-means clustering
𝑘-means clustering is the most popular clustering ap-
proach, which iteratively finds a predefined number of 𝑘
cluster centers (centroids) by minimizing the sum of the
squared Euclidean distance between each cell and its closest
centroid. In addition, it is suitable for large datasets since
it can scale linearly with the number of data points (Lloyd,
1982).
Several clustering tools based on 𝑘-means have been
developed for interpreting single-cell RNA-seq data. SAIC
(Yang et al., 2017) utilized an iterative 𝑘-means clustering to
identify the optimal subset of signature genes that separate
single cells into distinct clusters. pcaReduce (zurauskiene
and Yau, 2016) is a hierarchical clustering method while
it relies on 𝑘-means results as the initial clusters. RaceID
(Grün et al., 2015) applied 𝑘-means to unravel the hetero-
geneity of rare intestinal cell types (Tibshirani et al., 2001).
However, 𝑘-means clustering is an greedy algorithm
that may fail to find its global optimum; the predefined
number of clusters 𝑘 can affect the clustering results; and
another disadvantage is its sensitivity to outliers since it
tends to identify globular clusters, resulting in the failures
in detecting of rare cell types.
To overcome the above drawbacks, SC3 (Kiselev et al.,
2017) integrated individual 𝑘-means clustering results with
different initial conditions as the consensus clusters. RaceID2
S. Zhang et al. Page 3 of 12