单细胞RNA-seq数据聚类：细胞类型鉴定与特征研究综述

下载需积分: 0 | PDF格式 | 2.7MB | 更新于2024-08-11 | 159 浏览量 | 举报

“这篇论文是关于单细胞RNA测序数据聚类在细胞类型识别和特征鉴定中的应用的综述。文章作者来自香港城市大学计算机科学系，主要探讨了单细胞RNA-seq技术的进步如何促进高通量方式下的单细胞转录组学分析，以及无监督学习如数据聚类在识别和表征新细胞类型及基因表达模式中的核心作用。” 在过去的几年里，单细胞RNA测序（scRNA-seq）技术取得了显著的进步，它允许我们在单细胞水平上以高通量的方式进行大规模转录组学分析。这种技术的发展极大地推动了生物学研究，特别是在理解细胞异质性和揭示细胞分化路径方面。无监督学习，尤其是数据聚类，已成为识别和表征新型细胞类型的关键工具。在该研究中，作者详细回顾了现有的单细胞RNA-seq数据聚类方法，深入剖析了各种方法的优势和限制。这些方法通常包括基于距离、密度、层次结构以及模型驱动的方法。作者强调了在进行聚类分析前对scRNA-seq数据进行预处理的重要性，这包括质量控制（去除低质量读段和异常细胞）、标准化（消除不同样本间的测序深度差异）以及降维（减少数据复杂性，提取关键特征）等步骤。此外，作者还对一些流行的scRNA-seq聚类方法进行了性能比较实验，这些方法可能包括Seurat、Scanpy、Cell Ranger等。实验通常通过模拟数据或真实scRNA-seq数据集来评估聚类的准确性和稳定性，考察它们在发现细胞群、维持细胞类型结构以及识别稀有细胞类型方面的表现。通过对这些方法的评估，作者指出，选择适当的聚类方法取决于实验设计、数据质量和研究目标。例如，有些方法可能在处理大量细胞时表现出色，而其他方法则可能更适合识别小规模数据集中的复杂结构。此外，作者也提到了未来的研究方向，如开发更适应scRNA-seq数据特性的新型聚类算法，以及结合生物学先验知识进行有监督或半监督学习的策略。这篇综述提供了关于如何利用scRNA-seq数据聚类进行细胞类型识别和特征鉴定的全面指南，对于生物信息学领域的研究人员和生物医学科学家来说，是一份重要的参考文献。通过理解并应用文中介绍的方法和技术，科学家们能够更好地解析复杂组织的细胞组成，进一步推动我们对生命系统理解的边界。

Review of Single-cell RNA-seq Data Clustering

to a lower-dimensional space using dimension reduction

that can improve and reﬁne the clustering results. In this

section, we review several commonly used dimension re-

duction methods including principal component analysis, t-

distributed stochastic neighbor embedding algorithm, deep

learning models, and others.

2.3.1. PCA

Principal Component Analysis (PCA) is a typical linear

projection method that projects a set of possibly correlated

variables into a set of linearly orthogonal variables (prin-

cipal components). Due to its conceptual simplicity and

eﬃciency, PCA has been widely used in single-cell RNA-

seq processing (Jiang et al., 2016a; Buettner et al., 2015;

Shalek et al., 2014; Usoskin et al., 2015; zurauskiene and

Yau, 2016; Kiselev et al., 2017). Notably, SC3 (Kiselev

et al., 2017) applied PCA to transform the distance matrices

as the input of consensus clustering; Shalek et al. (2014)

used PCA for single-cell RNA-seq data spanning several ex-

perimental conditions. In addition, some extended and im-

proved PCA-based methods have been developed including

pcaReduce (zurauskiene and Yau, 2016) which applied PCA

iteratively to provide low-dimensional principal component

representations; Usoskin et al. (2015) proposed an unbiased

iterative PCA-based process to identify distinct large-scale

expression data patterns. However, PCA cannot capture

the nonliner relationships between cells because of the high

levels of dropout and noise (Kiselev et al., 2019).

2.3.2. t-SNE

t-distributed Stochastic Neighbor Embedding (t-SNE)

is the most commonly used nonlinear dimension reduction

method which can uncover the relationships between cells.

t-SNE converts data point similarity into probability and

minimizes Kullback-Leibler divergence by gradient descent

until convergence. In single-cell RNA-seq data analysis, t-

SNE has become a cornerstone of dimension reduction and

visualization for high-dimensional single-cell RNA-seq data

(Linderman et al., 2019; Lin et al., 2017b; Butler et al., 2018;

Haghverdi et al., 2018; Ntranos et al., 2016; Prabhakaran

et al., 2016; Zeisel et al., 2015; Zhang et al., 2018; Li et al.,

2017). Especially, Linderman et al. (2019) developed a fast

interpolation-based t-SNE that dramatically accelerates the

processing and visualization of rare cell populations for large

datasets. Nonetheless, the limitations of t-SNE include the

loss function is non-convex which can lead to diﬀerent local

optimality; the parameters in t-SNE are required to be tuned.

2.3.3. Deep lear ning models

In recent years, deep learning models (neural networks

and variational auto-encoders) have shown superior perfor-

mance in interpenetrating complex high-dimensional data.

SCNN (Lin et al., 2017a) tested various neural networks

architectures and incorporated prior biological knowledge

to obtain the reduced dimension representation of single

cell expression data. SCVIS (Ding et al., 2018) and VASC

(Wang and Gu, 2018) are both based on variational auto-

encoders which can capture nonlinear relationships between

cells and visualize the low-dimensional embedding in single-

cell gene expression data. Up to now, those methods demon-

strated superior ability of interpretation and compatibility on

high-dimensional single-cell RNA-seq data.

2.3.4. Other methods

In addition, there are also other dimensional reduction

methods such as CIDR (Lin et al., 2017b) applied principal

coordinate analysis that preserves the distance information

in low-dimension space from its high-dimension space; Seu-

rat (Butler et al., 2018) is a toolkit for analysis of single

cell RNA sequencing data and provides many dimension

reduction methods such as PCA and t-SNE. Uniform Mani-

fold Approximation and Projection (UMAP) (Mcinnes et al.,

2018) is a widely used technique for dimension reduction.

UMAP provides increased speed and better preservation of

data global structure for high dimensional datasets. It has

been veriﬁed that it outperforms t-SNE (Becht et al., 2019).

3. Clustering methods for single-cell RNA-seq

Diverse types of clustering methods have been devel-

oped for detecting cell types from single-cell RNA-seq data.

Those methods can be roughly classiﬁed into four cate-

gories including k-means clustering, hierarchical clustering,

community-detection-based clustering, and density-based

clustering. We review several computational applications

of those clustering methods with their strengths and limita-

tions. Table 1 illustrates the overview of the state-of-the-arts

clustering methods on single-cell RNA-seq data.

3.1. 𝑘-means clustering

𝑘-means clustering is the most popular clustering ap-

proach, which iteratively ﬁnds a predeﬁned number of 𝑘

cluster centers (centroids) by minimizing the sum of the

squared Euclidean distance between each cell and its closest

centroid. In addition, it is suitable for large datasets since

it can scale linearly with the number of data points (Lloyd,

1982).

Several clustering tools based on 𝑘-means have been

developed for interpreting single-cell RNA-seq data. SAIC

(Yang et al., 2017) utilized an iterative 𝑘-means clustering to

identify the optimal subset of signature genes that separate

single cells into distinct clusters. pcaReduce (zurauskiene

and Yau, 2016) is a hierarchical clustering method while

it relies on 𝑘-means results as the initial clusters. RaceID

(Grün et al., 2015) applied 𝑘-means to unravel the hetero-

geneity of rare intestinal cell types (Tibshirani et al., 2001).

However, 𝑘-means clustering is an greedy algorithm

that may fail to ﬁnd its global optimum; the predeﬁned

number of clusters 𝑘 can aﬀect the clustering results; and

another disadvantage is its sensitivity to outliers since it

tends to identify globular clusters, resulting in the failures

in detecting of rare cell types.

To overcome the above drawbacks, SC3 (Kiselev et al.,

2017) integrated individual 𝑘-means clustering results with

diﬀerent initial conditions as the consensus clusters. RaceID2

S. Zhang et al. Page 3 of 12

剩余11页未读，继续阅读

lesileqin

粉丝: 2567

单细胞RNA-seq数据聚类：细胞类型鉴定与特征研究综述

Spire.Pdf.free 4.4.1：免费读取PDF内容的解决方案

Spire.Pdf v*.**.**.***0 用例演示：如何去除水印且支持超10页

pdf.js最新稳定版发布，提升开发效率

[Visual.Studio.2010.高级编程].Professional.Visual.Studio.2010.pdf.003

Aspose.Pdf.dll破解版

c# Spire.Pdf.dll引用下载

NIST.FIPS.197.pdf

Thinking.In.Java.pdf

Aspose.Pdf aspose.words 破解版

Spire.Pdf_5.1.0.zip

最新资源

Spire.Pdf v*...***0 用例演示：如何去除水印且支持超10页