DBH：大型16S rRNA序列到OTU聚类的de Bruijn图启发式方法

74 浏览量更新于2024-08-26 收藏 2.56MB PDF 举报

"DBH：一种基于de Bruijn图的启发式方法，用于将大型16S rRNA序列聚集成OTU" 这篇研究论文详细介绍了DBH（de Bruijn Graph-based Heuristic method），这是一种利用de Bruijn图理论来处理大规模16S rRNA序列聚类问题的算法。16S rRNA是微生物学中用于分类和系统发育分析的一个关键分子标记，因为它在不同微生物种类中具有保守性和可变性。随着高通量测序技术的迅速发展，大量16S rRNA序列的积累使得对这些序列进行有效聚类成为分析微生物群落数据的关键步骤。在微生物学中，Operational Taxonomic Units (OTUs) 是一种用于代表类似序列群体的概念，它们通常基于一定的序列相似性阈值来定义。OTU的聚类过程旨在识别和分组高度相似的16S rRNA序列，从而揭示微生物群落的结构和组成。尽管已经提出了许多具有低计算复杂性的启发式方法来推断OTUs，但这些方法通常仅选择一个序列作为每个聚类的种子。然而，这种方法可能无法充分捕捉到序列间的复杂关系和多样性。DBH方法则通过引入de Bruijn图，提供了一种更全面的解决方案。 de Bruijn图是一种图论构造，由短的重叠序列（k-mers）连接而成，用于表示更长序列的结构。在DBH方法中，16S rRNA序列被拆分成k-mers，并在de Bruijn图中构建边。这种方法可以揭示序列之间的重叠和相似性，而不只是依赖于单一的种子序列。通过遍历和分析de Bruijn图，DBH能够发现和合并相关的序列群，从而形成OTUs。论文指出，DBH方法的优势在于其高效性和准确性。与现有的聚类算法相比，它能够在处理大规模数据集时保持较低的计算复杂性，同时保持高精度的聚类结果。这使得DBH成为处理海量16S rRNA序列的理想工具，特别是在微生物组学和元基因组学研究中，其中数据量通常非常大。此外，DBH方法还考虑了序列的变异性和系统发育信息，这有助于更好地理解微生物群落的多样性和进化关系。通过对16S rRNA序列的精细聚类，研究者可以深入探索微生物生态系统的结构、功能和动态变化。 DBH是一种创新的、基于de Bruijn图的启发式方法，它为解决大规模16S rRNA序列聚类问题提供了新的途径，对微生物群落研究具有重要价值。通过优化处理流程，DBH不仅提高了处理效率，还确保了聚类的准确性和完整性，对于解析复杂的微生物生态系统提供了有力的工具。

Journal of Theoretical Biology 425 (2017) 80–87

Contents lists available at ScienceDirect

Journal of Theoretical Biology

journal homepage: www.elsevier.com/locate/jtbi

DBH: A de Bruijn graph-based heuristic method for clustering

large-scale 16S rRNA sequences into OTUs

Ze-Gang Wei, Shao-Wu Zhang

∗

Key Laboratory of Information Fusion Technology of Ministry of Education, College of Automation, Northwestern Polytechnical University, Xi’an 710072,

China

a r t i c l e i n f o

Article history:

Received 10 November 2016

Revised 28 March 2017

Accepted 20 April 2017

Available online 26 April 2017

Keywords:

de Bruijn graph

Clustering

Operational taxonomic units

16S rRNA

Metagenomic

a b s t r a c t

Recent sequencing revolution driven by high-throughput technologies has led to rapid accumulation

of 16S rRNA sequences for microbial communities. Clustering short sequences into operational taxo-

nomic units (OTUs) is an initial crucial process in analyzing metagenomic data. Although many heuris-

tic methods have been proposed for OTU inferences with low computational complexity, they just se-

lect one sequence as the seed for each cluster and the results are sensitive to the selected sequences

that represent the clusters. To address this issue, we present a de Bruijn graph-based heuristic cluster-

ing method (DBH) for clustering massive 16S rRNA sequences into OTUs by introducing a novel seed

selection strategy and greedy clustering approach. Compared with existing widely used methods on

several simulated and real-life metagenomic datasets, the results show that DBH has higher clustering

performance and low memory usage, facilitating the overestimation of OTUs number. DBH is more ef-

fective to handle large-scale metagenomic datasets. The DBH software can be freely downloaded from

https://github.com/nwpu134/DBH.git for academic users.

1. Introduction

Metagenomics is a recently-born ﬁeld that studies the genomic

content of microbial communities. A number of recent large-

scale researches have taken advantage of metagenomics to under-

stand microbial community structure and function, including the

MetaHIT project ( Qin et al., 2010 ) and the Human Microbiome

Project ( Consortium, 2012 ). Many of these projects assess microbial

communities by sequencing the 16S ribosomal RNA (rRNA) marker

genes. With the development of the next-generation sequencing

technology, the amount of genetic data is growing from a few tens

of thousands to several million reads (i.e., short sequences), faster

than the rate at which it can be analyzed ( Caporaso et al., 2010 ).

An essential ﬁrst step in handling these large scale data is

to cluster them into the meaningful operational taxonomic units

(OTUs), that is, clusters of similar reads that are relative to tax-

onomic lineage in the sample ( Wei et al., 2016 ). Traditionally, hi-

erarchical clustering algorithms implemented in MOTHUR ( Schloss

et al., 2009 ), ESPRIT ( Sun et al., 2009 ), HPC-CLUST ( Rodrigues and

von Mering, 2014 ) and mcClust ( Cole et al., 2013 ) have been widely

used for detecting clusters. These methods need a pairwise dis-

tance matrix that is derived either from pairwise sequence align-

∗

Corresponding author.

E-mail addresses: david_nwpu@163.com (Z.-G. Wei), zhangsw@nwpu.edu.cn (S.-

W. Zhang).

ment or multiple sequence alignment, resulting in that they have

a high complexity in terms of both time and space for large-scale

data sets. Many heuristic approaches were developed to decrease

the computational demand, such as CD-HIT ( Li and Godzik, 2006 ),

Uclust ( Edgar, 2010 ), DySC ( Zheng et al., 2012 ), ESPRIT-Tree ( Cai

and Sun, 2011 ) and MSClust ( Chen et al., 2013 ). These methods

ﬁrst select an input sequence as a seed to form the initial clus-

ter, then distinguish each input sequence sequentially. If the dis-

tance between the query sequence and representative sequences

in the existing clusters is within a pre-deﬁned threshold, the input

sequence will be added to the corresponding cluster, otherwise a

new cluster is created and the query sequence is stored as the new

seed. This procedure is repeated until all sequences are assigned.

Although these tools are scalable and eﬃcient, they generally pro-

duce clusters of lower quality than hierarchical clustering.

Different from both hierarchical and heuristic clustering meth-

ods that choose a distance threshold (e.g., 3% and 5%) to de-

ﬁne OTUs at different taxonomic levels, several model-based ap-

proaches have been proposed, such as CROP ( Hao et al., 2011 ), BE-

BaC ( Cheng et al., 2012 ), M-pick ( Wang et al., 2013 ) and MtHc ( Wei

and Zhang, 2015 ). CROP ( Chen et al., 2013 ) builds a Bayesian model

with a Gaussian mixture model and a birth-death process to clus-

ter a set of sequences. It uses a lower bound and an upper bound,

which can be transformed to a cutoff to avoid the use of a constant

threshold. BEBaC ( Hao et al., 2011 ) ﬁrst adopts heuristics to assign

the highly similar sequences to form a pre-group, then searches for

http://dx.doi.org/10.1016/j.jtbi.2017.04.019

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38704830

粉丝: 3

DBH：大型16S rRNA序列到OTU聚类的de Bruijn图启发式方法

一种改进的摄像头视频实时拼接方法.docx

DBH2-2010E8009070020-陈凯-并行数据库查询优化技术研究1

bot-rewrite:重写 DBH 机器人

dbh-connor-bot:康纳机器人

KLiK-SocialMediaWebsite：完整的基于PHP的LoginRegistration系统，Profile系统，聊天室，论坛系统和BlogPollsEvent管理系统

shiny-lmfor:树DBH高度建模工具

DBH-overlay-remover：删除Steam Vulkan覆盖文件

lbp-scraper:DBH 持牌建筑从业者注册刮刀

shiny-lmfor: 基于R的树DBH与高度建模工具介绍

DBH机器人重写教程：自托管与贡献指南

最新资源