The cell can be considered the fundamental unit in
biology. For centuries, biologists have known that multi-
cellular organisms are characterized by a plethora of
distinct cell types. Although the notion of a cell type is
intuitively clear, a consistent and rigorous definition has
remained elusive. Cells can be distinguished by their
size and shape using a microscope, and attributes based
on their physical appearance have traditionally been
the primary determinant of cell type. Later, discover-
ies in molecular biology made it possible to character-
ize cell types on the basis of the presence or absence of
surface proteins. However, surface proteins represent
only a small fraction of the proteome, and it is likely
that important differences are not manifested at the
cell membrane.
Advances in microfluidics have made it possible to
isolate a large number of cells, and along with improve-
ments in RNA isolation and amplification methods, it is
now possible to profile the transcriptome of individual
cells using next- generation sequencing technologies.
Technological developments have advanced at a breath-
taking speed. The first single- cell RNA sequencing
(scRNA- seq) experiment was published in 2009, and the
authors profiled only eight cells
1
. Only 7 years later, 10X
Genomics released a data set of more than 1.3 million
cells
2
. Thus, we are now in an era where large volumes
of scRNA- seq data make it possible to provide detailed
catalogues of the cells found in a sample.
For researchers to be able to take full advantage of
these rich data sets, efficient computational methods are
required. There are several steps involved in the com-
putational analysis of scRNA- seq data, including quality
control, mapping, quantification, normalization, clus-
tering, finding trajectories and identifying differentially
expressed genes
(FIG.1). The steps upstream of clustering
may have a substantial impact on the outcome, and for
each step numerous tools are available. Moreover, there are
also software packages that implement the entire clustering
workflow, for example, Seurat
3
, scanpy
4
and SINCERA
5
.
We encourage the reader to consult recently published
overviews of this workflow
6–10
, as this Review focuses on
clustering alone. As clustering is the key step in defining
cell types based on the transcriptome, one must carefully
consider both the computational and biological aspects.
The ability to define cell types through
unsupervised
clustering
on the basis of transcriptome similarity has
emerged as one of the most powerful applications of
scRNA- seq. Broadly speaking, the goal of clustering is
to discover the natural groupings of a set of objects
11
.
Defining cell types on the basis of the transcriptome
is attractive because it provides a data- driven, coher-
ent and unbiased approach that can be applied to any
sample. This opportunity has spurred the creation of
several atlas projects
12–17
, most notably the Human Cell
Atlas
18
. These atlas projects aim to build comprehensive
references for all cell types present in an organism or
tissue at various stages of development. In addition to
providing a deeper understanding of the basic biology,
atlases will also be useful as references for disease stud-
ies. For a cell atlas to be of practical use, reliable methods
for unsupervised clustering of the cells will be one of the
key computational challenges.
Although considerable progress has been made in
terms of clustering algorithms over the past few years,
a number of questions remain unanswered. In particu-
lar, there is no strong consensus about what is the best
approach or how cell types can be defined based on
scRNA- seq data. In this Review, we discuss several com-
putational and biological aspects related to clustering.
We first discuss the types of available clustering methods
and when it is appropriate to use them, because one of
the underlying assumptions is that discrete clusters are
present in the data. Next, we outline why unsupervised
clustering is a difficult problem and what considerations
need to be taken from both experimental and compu-
tational points of view. We then discuss the challenges
Unsupervised clustering
The process of grouping
objects based on similarity but
without any ground truth or
labelled training data.
Challenges in unsupervised clustering
of single- cell RNA- seq data
VladimirYuKiselev , TallulahS.Andrews and MartinHemberg *
Abstract
|
Single- cell RNA sequencing (scRNA- seq) allows researchers to collect large catalogues
detailing the transcriptomes of individual cells. Unsupervised clustering is of central importance
for the analysis of these data, as it is used to identify putative cell types. However, there are many
challenges involved. We discuss why clustering is a challenging problem from a computational
point of view and what aspects of the data make it challenging. We also consider the difficulties
related to the biological interpretation and annotation of the identified clusters.
Wellcome Sanger Institute,
Wellcome Genome Campus,
Hinxton, UK.
*e- mail: mh26@sanger.ac.uk
https://doi.org/10.1038/
s41576-018-0088-9
SINGLE-CELL OMICS
Corrected: Publisher Correction
NAture reviews
|
GENEtICS
Reviews
volume 20
|
mAY 2019
|
273