Journal of Theoretical Biology 425 (2017) 80–87
Contents lists available at ScienceDirect
Journal of Theoretical Biology
journal homepage: www.elsevier.com/locate/jtbi
DBH: A de Bruijn graph-based heuristic method for clustering
large-scale 16S rRNA sequences into OTUs
Ze-Gang Wei, Shao-Wu Zhang
∗
Key Laboratory of Information Fusion Technology of Ministry of Education, College of Automation, Northwestern Polytechnical University, Xi’an 710072,
China
a r t i c l e i n f o
Article history:
Received 10 November 2016
Revised 28 March 2017
Accepted 20 April 2017
Available online 26 April 2017
Keywords:
de Bruijn graph
Clustering
Operational taxonomic units
16S rRNA
Metagenomic
a b s t r a c t
Recent sequencing revolution driven by high-throughput technologies has led to rapid accumulation
of 16S rRNA sequences for microbial communities. Clustering short sequences into operational taxo-
nomic units (OTUs) is an initial crucial process in analyzing metagenomic data. Although many heuris-
tic methods have been proposed for OTU inferences with low computational complexity, they just se-
lect one sequence as the seed for each cluster and the results are sensitive to the selected sequences
that represent the clusters. To address this issue, we present a de Bruijn graph-based heuristic cluster-
ing method (DBH) for clustering massive 16S rRNA sequences into OTUs by introducing a novel seed
selection strategy and greedy clustering approach. Compared with existing widely used methods on
several simulated and real-life metagenomic datasets, the results show that DBH has higher clustering
performance and low memory usage, facilitating the overestimation of OTUs number. DBH is more ef-
fective to handle large-scale metagenomic datasets. The DBH software can be freely downloaded from
https://github.com/nwpu134/DBH.git for academic users.
© 2017 Elsevier Ltd. All rights reserved.
1. Introduction
Metagenomics is a recently-born field that studies the genomic
content of microbial communities. A number of recent large-
scale researches have taken advantage of metagenomics to under-
stand microbial community structure and function, including the
MetaHIT project ( Qin et al., 2010 ) and the Human Microbiome
Project ( Consortium, 2012 ). Many of these projects assess microbial
communities by sequencing the 16S ribosomal RNA (rRNA) marker
genes. With the development of the next-generation sequencing
technology, the amount of genetic data is growing from a few tens
of thousands to several million reads (i.e., short sequences), faster
than the rate at which it can be analyzed ( Caporaso et al., 2010 ).
An essential first step in handling these large scale data is
to cluster them into the meaningful operational taxonomic units
(OTUs), that is, clusters of similar reads that are relative to tax-
onomic lineage in the sample ( Wei et al., 2016 ). Traditionally, hi-
erarchical clustering algorithms implemented in MOTHUR ( Schloss
et al., 2009 ), ESPRIT ( Sun et al., 2009 ), HPC-CLUST ( Rodrigues and
von Mering, 2014 ) and mcClust ( Cole et al., 2013 ) have been widely
used for detecting clusters. These methods need a pairwise dis-
tance matrix that is derived either from pairwise sequence align-
∗
Corresponding author.
E-mail addresses: david_nwpu@163.com (Z.-G. Wei), zhangsw@nwpu.edu.cn (S.-
W. Zhang).
ment or multiple sequence alignment, resulting in that they have
a high complexity in terms of both time and space for large-scale
data sets. Many heuristic approaches were developed to decrease
the computational demand, such as CD-HIT ( Li and Godzik, 2006 ),
Uclust ( Edgar, 2010 ), DySC ( Zheng et al., 2012 ), ESPRIT-Tree ( Cai
and Sun, 2011 ) and MSClust ( Chen et al., 2013 ). These methods
first select an input sequence as a seed to form the initial clus-
ter, then distinguish each input sequence sequentially. If the dis-
tance between the query sequence and representative sequences
in the existing clusters is within a pre-defined threshold, the input
sequence will be added to the corresponding cluster, otherwise a
new cluster is created and the query sequence is stored as the new
seed. This procedure is repeated until all sequences are assigned.
Although these tools are scalable and efficient, they generally pro-
duce clusters of lower quality than hierarchical clustering.
Different from both hierarchical and heuristic clustering meth-
ods that choose a distance threshold (e.g., 3% and 5%) to de-
fine OTUs at different taxonomic levels, several model-based ap-
proaches have been proposed, such as CROP ( Hao et al., 2011 ), BE-
BaC ( Cheng et al., 2012 ), M-pick ( Wang et al., 2013 ) and MtHc ( Wei
and Zhang, 2015 ). CROP ( Chen et al., 2013 ) builds a Bayesian model
with a Gaussian mixture model and a birth-death process to clus-
ter a set of sequences. It uses a lower bound and an upper bound,
which can be transformed to a cutoff to avoid the use of a constant
threshold. BEBaC ( Hao et al., 2011 ) first adopts heuristics to assign
the highly similar sequences to form a pre-group, then searches for
http://dx.doi.org/10.1016/j.jtbi.2017.04.019
0022-5193/© 2017 Elsevier Ltd. All rights reserved.