DMclust：16S rRNA序列的密度聚类模块化方法

136 浏览量更新于2024-08-29 收藏 1.63MB PDF 举报

"DMclust, a Density-based modularity method for picking OTU from massive 16S rRNA sequence data" 这篇研究论文"DMclust, a Density-based modularity method for picking OTU from massive 16S rRNA sequence data"介绍了一种新的方法，DMclust，用于从大量16S rRNA序列数据中选择操作分类单元（OTUs）。OTU是微生物学中分析元基因组数据的关键步骤，它允许研究人员将相似的微生物序列归类到一起，形成一个代表性的群体。 16S rRNA基因是广泛用于微生物分类和群落结构分析的一个分子标记，因为它在各种微生物中高度保守，同时存在一些可变区域，这些区域可以用来区分不同的物种或菌株。然而，处理海量的16S rRNA序列数据是一项挑战，需要在聚类准确性和计算效率之间找到合适的平衡。 DMclust算法由四个主要阶段组成： 1. 密度搜索：首先，它寻找序列密集组，即n序列社区，其中任意两个序列之间的距离小于一个阈值。这一步有助于识别高密度的序列簇，这些簇可能代表特定的微生物种类。 2. 构建网络：然后，这些密集组被用来构建一个加权网络。在这个网络中，每个密集组被视为一个节点，节点之间的边权重反映了它们的相似性。 3. 模ularity优化：利用模块化度量，算法旨在最大化网络中的社区结构，使得同一社区内的节点间连接更紧密，而不同社区间的节点连接较弱。这种方法考虑了全局结构，有助于识别复杂的数据模式。 4. 聚类生成：最后，基于优化后的网络结构，DMclust生成聚类，每个聚类代表一个OTU，从而提高了聚类的准确性。 DMclust的优势在于其密度基础和模块性优化相结合的策略，能够处理大规模数据，并且在保持高精度的同时避免了过度分割或合并错误的问题。与传统的OTU分拣方法相比，如基于距离阈值的方法（如UPGMA、VSEARCH）和基于密度的方法（如DBSCAN），DMclust可能提供更精确的群落结构解析。这篇论文的发表（DOI:10.1002/minf.201600059）表明，DMclust为微生物学研究提供了一个有力的工具，特别是在大数据分析背景下，对于理解微生物生态系统的复杂性和多样性具有重要意义。

DOI: 10.1002/minf.201600059

DMclust, a Density-based Modularity Method for Accurate

OTU Picking of 16S rRNA Sequences

Ze-Gang Wei,

[a]

Shao-Wu Zhang,*

[a]

and Yi-Zhai Zhang

[a]

Abstract: Clustering 16S rRNA sequences into operational

taxonomic units (OTUs) is a crucial step in analyzing

metagenomic data. Although many methods have been

developed, how to obtain an appropriate balance between

clustering accuracy and computational efficiency is still a

major challenge. A novel density-based modularity cluster-

ing method, called DMclust, is proposed in this paper to bin

16S rRNA sequences into OTUs with high clustering

accuracy. The DMclust algorithm consists of four main

phases. It first searches for the sequence dense group

defined as n-sequence community, in which the distance

between any two sequences is less than a threshold. Then

these dense groups are used to construct a weighted

network, where dense groups are viewed as nodes, each

pair of dense groups is connected by an edge, and the

distance of pairwise groups represents the weight of the

edge. Then, a modularity-based community detection

method is employed to generate the preclusters. Finally, the

remaining sequences are assigned to their nearest preclus-

ters to form OTUs. Compared with existing widely used

methods, the experimental results on several metagenomic

datasets show that DMclust has higher accurate clustering

performance with acceptable memory usage.

Keywords: modularity · clustering · OTUs · 16S rRNA · metagenomic

1 Introduction

The rapid development of next-generation sequencing

technology provides an excellent tool to explore the complex

microbial communities that contribute to many biological

processes in various environments. The high-throughput

sequencing technology can yield millions of DNA/RNA frag-

ments (or sequences) in a single run for the environmental

samples, and helps us uncover the microbial diversity and

better understand how these unknown microbes live and co-

exist with unprecedented resolution.

[1–3]

The 16S rRNA sequences are widely applied to infer the

phylogenetic relationships among microbial species, be-

cause the genomic variable regions (or reads) of 16S rRNA

can be used as the marker gene to obtain quick estimates

of taxonomic diversity.

[4,5]

The essential step in analyzing

these 16S rRNA sequence data to get the taxonomic

diversity profile of an environmental sample is to bin them

into operational taxonomic units (OTUs) that contain similar

reads.

[6,7]

The existing binning methods are usually catego-

rized into taxonomy-dependent methods and taxonomy-

independent methods. In taxonomy-dependent methods,

such as BLAST,

[8]

RDP classifier

[9]

and bioOTU,

[10]

individual

reads are taxonomically classified by comparing them with

the annotated sequences in reference databases. However,

most of reads originate from the genomes of unknown

organisms, and these reads cannot be mapped to the

known ‘taxonomic reference’ tree due to the lack of

genomic reference. In contrast, taxonomy-independent

methods bin or group reads in a given dataset into OTUs

based on their mutual similarity, and such methods do not

depend on any reference database.

[11, 12]

These methods can

be used to analyze these reads from unknown micro-

organisms, and they belong to the category of unsuper-

vised machine learning.

For taxonomy-independent methods, several different

clustering algorithms are developed to generate OTUs by

computing the pairwise sequence distance either with

multiple sequence alignment (MSA) or pairwise sequence

alignment (PSA). These algorithms can be further catego-

rized into hierarchical clustering, heuristic clustering, model-

based methods and network-based methods. Hierarchical

clustering methods such as DOTUR,

[13]

MOTHUR

[14]

and

ESPRIT

[15]

require a distance matrix between all reads to

construct a hierarchical tree, and then group the reads into

OTUs with a predetermined distance threshold. The overall

space and computational complexity of hierarchical cluster-

ing methods is O(N

), where N is the number of reads. In

order to address the time and memory bottleneck in

hierarchical clustering algorithms, heuristic clustering meth-

ods such as CD-HIT,

[16]

UPARSE,

[17]

UCLUST,

[18]

DNACLUST,

[19]

GramCluster,

[20]

ESPRIT-Tree

[21]

and MSClust

[22]

were devel-

oped by using a greedy clustering strategy. For each read,

[a] Z.-G. Wei, S.-W. Zhang, Y.-Z. Zhang

Key Laboratory of Information Fusion Technology of Ministry of

Education, School of Automation, Northwestern Polytechnical

University, Xi’an, 710072, China

Tel.:

86 02988431308

E-mail: zhangsw@nwpu.edu.cn

Supporting information for this article is available on the WWW

under https://doi.org/10.1002/minf.201600059

Full Paper www.molinf.com

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38532629

粉丝: 5
资源: 921

DMclust：16S rRNA序列的密度聚类模块化方法

spark-distributed-louvain-modularity:分布式Louvain模块化算法的Spark graphX实现

FEC.zip_FEC-Based_MATLAB 复杂网络 划分_复杂网络_社团_社团划分

Optimal multi-community network modularity for information diffusion

典型相关分析matlab实现代码-Dynamic-Brain-Network-Modularity:模块化脑网络组织的时间稳定性-及其与个体差

Automatic network clustering via density-constrained optimizationwith grouping operator

matlab中存档算法代码-weighted-modularity-LPAwbPLUS:在二分网络中找到加权模块化的算法

java-9-modularity-revealed:Alexandru Jecan的“揭示了Java 9模块化”的源代码-java source code

php-modularity

A Harmonic Motif Modularity Approach for Multi-layer Network Community Detection

matlab集成c代码-ADHD-Modularity:Hilger和Fiebach的分析代码（2018）：在健康成年人的代表性样本中，ADH

最新资源

FEC.zip_FEC-Based_MATLAB 复杂网络划分_复杂网络_社团_社团划分