1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
DOI: 10.1002/minf.201600059
DMclust, a Density-based Modularity Method for Accurate
OTU Picking of 16S rRNA Sequences
Ze-Gang Wei,
[a]
Shao-Wu Zhang,*
[a]
and Yi-Zhai Zhang
[a]
Abstract: Clustering 16S rRNA sequences into operational
taxonomic units (OTUs) is a crucial step in analyzing
metagenomic data. Although many methods have been
developed, how to obtain an appropriate balance between
clustering accuracy and computational efficiency is still a
major challenge. A novel density-based modularity cluster-
ing method, called DMclust, is proposed in this paper to bin
16S rRNA sequences into OTUs with high clustering
accuracy. The DMclust algorithm consists of four main
phases. It first searches for the sequence dense group
defined as n-sequence community, in which the distance
between any two sequences is less than a threshold. Then
these dense groups are used to construct a weighted
network, where dense groups are viewed as nodes, each
pair of dense groups is connected by an edge, and the
distance of pairwise groups represents the weight of the
edge. Then, a modularity-based community detection
method is employed to generate the preclusters. Finally, the
remaining sequences are assigned to their nearest preclus-
ters to form OTUs. Compared with existing widely used
methods, the experimental results on several metagenomic
datasets show that DMclust has higher accurate clustering
performance with acceptable memory usage.
Keywords: modularity · clustering · OTUs · 16S rRNA · metagenomic
1 Introduction
The rapid development of next-generation sequencing
technology provides an excellent tool to explore the complex
microbial communities that contribute to many biological
processes in various environments. The high-throughput
sequencing technology can yield millions of DNA/RNA frag-
ments (or sequences) in a single run for the environmental
samples, and helps us uncover the microbial diversity and
better understand how these unknown microbes live and co-
exist with unprecedented resolution.
[1–3]
The 16S rRNA sequences are widely applied to infer the
phylogenetic relationships among microbial species, be-
cause the genomic variable regions (or reads) of 16S rRNA
can be used as the marker gene to obtain quick estimates
of taxonomic diversity.
[4,5]
The essential step in analyzing
these 16S rRNA sequence data to get the taxonomic
diversity profile of an environmental sample is to bin them
into operational taxonomic units (OTUs) that contain similar
reads.
[6,7]
The existing binning methods are usually catego-
rized into taxonomy-dependent methods and taxonomy-
independent methods. In taxonomy-dependent methods,
such as BLAST,
[8]
RDP classifier
[9]
and bioOTU,
[10]
individual
reads are taxonomically classified by comparing them with
the annotated sequences in reference databases. However,
most of reads originate from the genomes of unknown
organisms, and these reads cannot be mapped to the
known ‘taxonomic reference’ tree due to the lack of
genomic reference. In contrast, taxonomy-independent
methods bin or group reads in a given dataset into OTUs
based on their mutual similarity, and such methods do not
depend on any reference database.
[11, 12]
These methods can
be used to analyze these reads from unknown micro-
organisms, and they belong to the category of unsuper-
vised machine learning.
For taxonomy-independent methods, several different
clustering algorithms are developed to generate OTUs by
computing the pairwise sequence distance either with
multiple sequence alignment (MSA) or pairwise sequence
alignment (PSA). These algorithms can be further catego-
rized into hierarchical clustering, heuristic clustering, model-
based methods and network-based methods. Hierarchical
clustering methods such as DOTUR,
[13]
MOTHUR
[14]
and
ESPRIT
[15]
require a distance matrix between all reads to
construct a hierarchical tree, and then group the reads into
OTUs with a predetermined distance threshold. The overall
space and computational complexity of hierarchical cluster-
ing methods is O(N
2
), where N is the number of reads. In
order to address the time and memory bottleneck in
hierarchical clustering algorithms, heuristic clustering meth-
ods such as CD-HIT,
[16]
UPARSE,
[17]
UCLUST,
[18]
DNACLUST,
[19]
GramCluster,
[20]
ESPRIT-Tree
[21]
and MSClust
[22]
were devel-
oped by using a greedy clustering strategy. For each read,
[a] Z.-G. Wei, S.-W. Zhang, Y.-Z. Zhang
Key Laboratory of Information Fusion Technology of Ministry of
Education, School of Automation, Northwestern Polytechnical
University, Xi’an, 710072, China
Tel.:
+
86 02988431308
E-mail: zhangsw@nwpu.edu.cn
Supporting information for this article is available on the WWW
under https://doi.org/10.1002/minf.201600059
Full Paper www.molinf.com
© 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2017, 36, 1600059 (1 of 9) 1600059