单细胞RNA-seq技术：无监督聚类中的新挑战与深度解析

需积分: 0 10 浏览量更新于2024-08-11 收藏 1.95MB PDF 举报

《单细胞RNA-seq数据集在无监督聚类中的挑战》是一篇深入探讨生物信息学领域的重要论文，它聚焦于单细胞测序技术，特别是RNA转录组测序（Single Cell RNA-seq, scRNA-seq）在生物学研究中的应用。单个细胞被视为生物学的基本单位，一个多细胞有机体由众多独特的细胞类型组成。长期以来，细胞类型的概念虽然直观，但其定义却一直难以达成共识。传统的细胞分类主要依据显微镜下的形态特征，如大小和形状，然而这些表征仅反映了细胞物理外观，而未能揭示更深层次的功能差异。随着分子生物学的进步，尤其是通过检测表面蛋白的存在或缺失来识别细胞类型的方法，科学家们开始能够基于分子层面的特征对细胞进行更为精细的分类。然而，表面蛋白只是整个蛋白质组的一小部分，许多关键的区别可能并不明显地体现在细胞膜上。近年来，微流体技术和RNA提取与扩增技术的飞跃发展，使得科学家能够对单个细胞的转录组进行高通量测序，这标志着单细胞分析技术进入了一个全新的时代，即单细胞RNA测序的“下一代”（Next-Generation Sequencing,NGS）。这种技术不仅提供了前所未有的分辨率，揭示了单细胞水平上的基因表达模式，还极大地推动了无监督聚类方法在单细胞数据分析中的应用。论文的重点在于单细胞RNA-seq数据集在无监督聚类过程中的挑战。无监督聚类是一种数据挖掘技术，它能在没有预先设定的类别信息下，自动将相似的细胞分组。然而，由于单细胞数据的复杂性，包括高度异质性、噪声、技术误差以及低丰度基因的存在，这些挑战在实际操作中尤为显著： 1. **异质性**：每个细胞可能同时表达多种基因，且在同一细胞类型内可能存在不同的亚型，这增加了聚类的复杂性和准确性要求。 2. **噪声和低丰度基因**：由于测序技术的局限，一些基因的表达水平可能非常低，这可能导致假阴性或假阳性的结果。此外，实验技术的不一致性也可能引入噪声。 3. **数据维度**：单细胞RNA-seq数据集包含大量的基因表达信息，高维空间中的聚类算法可能会陷入局部最优，难以找到全局最佳解。 4. **生物统计问题**：如何在处理大规模、高维数据的同时，保持生物学的合理性，如正确解读生物学意义的聚类，是另一个关键挑战。 5. **方法选择**：众多的聚类算法（如PCA、t-SNE、UMAP等）在处理单细胞数据时各有优缺点，如何选择合适的方法来提取最有价值的生物学信息是一个持续的研究课题。该论文探讨了单细胞RNA-seq数据在无监督聚类中的复杂性和面临的挑战，为生物信息学研究者提供了对这一前沿技术深入理解的视角，并提出了未来改进方法和技术优化的方向。对于从事生物科学、生物信息学或者单细胞生物学研究的人来说，理解和掌握这些挑战至关重要，因为它们直接影响到单细胞分析结果的可靠性和生物学解释的准确性。

The cell can be considered the fundamental unit in

biology. For centuries, biologists have known that multi-

cellular organisms are characterized by a plethora of

distinct cell types. Although the notion of a cell type is

intuitively clear, a consistent and rigorous definition has

remained elusive. Cells can be distinguished by their

size and shape using a microscope, and attributes based

on their physical appearance have traditionally been

the primary determinant of cell type. Later, discover-

ies in molecular biology made it possible to character-

ize cell types on the basis of the presence or absence of

surface proteins. However, surface proteins represent

only a small fraction of the proteome, and it is likely

that important differences are not manifested at the

cell membrane.

Advances in microfluidics have made it possible to

isolate a large number of cells, and along with improve-

ments in RNA isolation and amplification methods, it is

now possible to profile the transcriptome of individual

cells using next- generation sequencing technologies.

Technological developments have advanced at a breath-

taking speed. The first single- cell RNA sequencing

(scRNA- seq) experiment was published in 2009, and the

authors profiled only eight cells

. Only 7 years later, 10X

Genomics released a data set of more than 1.3 million

cells

. Thus, we are now in an era where large volumes

of scRNA- seq data make it possible to provide detailed

catalogues of the cells found in a sample.

For researchers to be able to take full advantage of

these rich data sets, efficient computational methods are

required. There are several steps involved in the com-

putational analysis of scRNA- seq data, including quality

control, mapping, quantification, normalization, clus-

tering, finding trajectories and identifying differentially

expressed genes

(FIG.1). The steps upstream of clustering

may have a substantial impact on the outcome, and for

each step numerous tools are available. Moreover, there are

also software packages that implement the entire clustering

workflow, for example, Seurat

, scanpy

and SINCERA

We encourage the reader to consult recently published

overviews of this workflow

6–10

, as this Review focuses on

clustering alone. As clustering is the key step in defining

cell types based on the transcriptome, one must carefully

consider both the computational and biological aspects.

The ability to define cell types through

unsupervised

clustering

on the basis of transcriptome similarity has

emerged as one of the most powerful applications of

scRNA- seq. Broadly speaking, the goal of clustering is

to discover the natural groupings of a set of objects

Defining cell types on the basis of the transcriptome

is attractive because it provides a data- driven, coher-

ent and unbiased approach that can be applied to any

sample. This opportunity has spurred the creation of

several atlas projects

12–17

, most notably the Human Cell

Atlas

. These atlas projects aim to build comprehensive

references for all cell types present in an organism or

tissue at various stages of development. In addition to

providing a deeper understanding of the basic biology,

atlases will also be useful as references for disease stud-

ies. For a cell atlas to be of practical use, reliable methods

for unsupervised clustering of the cells will be one of the

key computational challenges.

Although considerable progress has been made in

terms of clustering algorithms over the past few years,

a number of questions remain unanswered. In particu-

lar, there is no strong consensus about what is the best

approach or how cell types can be defined based on

scRNA- seq data. In this Review, we discuss several com-

putational and biological aspects related to clustering.

We first discuss the types of available clustering methods

and when it is appropriate to use them, because one of

the underlying assumptions is that discrete clusters are

present in the data. Next, we outline why unsupervised

clustering is a difficult problem and what considerations

need to be taken from both experimental and compu-

tational points of view. We then discuss the challenges

Unsupervised clustering

The process of grouping

objects based on similarity but

without any ground truth or

labelled training data.

Challenges in unsupervised clustering

of single- cell RNA- seq data

VladimirYuKiselev , TallulahS.Andrews and MartinHemberg *

Abstract

Single- cell RNA sequencing (scRNA- seq) allows researchers to collect large catalogues

detailing the transcriptomes of individual cells. Unsupervised clustering is of central importance

for the analysis of these data, as it is used to identify putative cell types. However, there are many

challenges involved. We discuss why clustering is a challenging problem from a computational

point of view and what aspects of the data make it challenging. We also consider the difficulties

related to the biological interpretation and annotation of the identified clusters.

Wellcome Sanger Institute,

Wellcome Genome Campus,

Hinxton, UK.

*e- mail: mh26@sanger.ac.uk

https://doi.org/10.1038/

s41576-018-0088-9

SINGLE-CELL OMICS

Corrected: Publisher Correction

NAture reviews

GENEtICS

Reviews

volume 20

mAY 2019

273

下载后可阅读完整内容，剩余9页未读，立即下载

lesileqin

粉丝: 2524
资源: 8

单细胞RNA-seq技术：无监督聚类中的新挑战与深度解析

org.apache.poi.xwpf.converter.pdf-1.0.6-pdf.zip

综合农贸市场小程序搭建计划书--.pdf

aspose.pdf-17.8.jar

白色简洁风格的学术交流会议源码下载.zip

基于交变电流场测量技术的水下结构缺陷可视化与智能识别方法

Neck Deep - In Bloom [mqms2].mgg2.flac

(176109030)基于ESO的永磁同步电机无感FOC1.采用线性扩张状态观测器(LESO)估计电机反电势，利用锁相环从反电势中提取位置和转速信息

三相逆变 单相 三相逆变器 SPWM -stm32主控（输入、输出具体可根据需要设定），本逆变器可以二次开发 本内容只包括 逆变程序，实现变频（0～100Hz)、变压调节，均有外接按键控制（使用

NSConditionException如何解决.md

白色简洁风格的房产交易中心企业网站源码下载.zip

最新资源

三相逆变单相三相逆变器 SPWM -stm32主控（输入、输出具体可根据需要设定），本逆变器可以二次开发本内容只包括逆变程序，实现变频（0～100Hz)、变压调节，均有外接按键控制（使用