SOAPdenovo2：优化内存效率的短读组装器

165 浏览量更新于2024-08-26 收藏 374KB PDF 举报

"SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler" 这篇研究论文介绍了SOAPdenovo2，这是一个针对下一代测序(NGS)短读数据的从头基因组组装工具的升级版。在2012年发布时，它着重于提高内存效率并解决基因组组装中的关键挑战。背景：随着NGS技术的飞速发展，从头组装大量新基因组的需求日益增长。然而，这个过程面临着几个重大挑战，包括组装连续性、准确性和覆盖度的提高，尤其是在复杂的重复区域。发现：为了应对这些挑战，研究人员开发了SOAPdenovo2。与前一代SOAPdenovo相比，SOAPdenovo2采用了新的算法设计，能够在图构建阶段减少内存消耗。这使得在处理大规模基因组数据时，可以在有限的计算资源下进行更有效的组装。此外，论文指出，SOAPdenovo2在处理重复序列时表现得更好，这是许多基因组中的难点，因为重复序列往往会干扰传统的组装算法。方法： SOAPdenovo2的核心改进在于优化了图构建策略，通过更智能的数据结构和算法，能够减少存储基因组组装过程中边和节点的信息所需的内存。这种方法对于处理高覆盖率的短读数据尤其有用，因为它允许在不牺牲准确性的情况下，处理更大、更复杂的基因组。结果：通过一系列实验，SOAPdenovo2在多个生物体的基因组组装中展示了其性能提升。这些实验表明，SOAPdenovo2在连续性、准确性和内存使用方面均优于SOAPdenovo，尤其是在处理重复区域时。这为科研人员提供了一个更强大且资源友好的工具，用于解析基因组结构，特别是在资源有限的环境中。结论： SOAPdenovo2的出现是基因组组装领域的一个重要进展，它提高了短读从头组装的效率，尤其是针对内存管理进行了优化，使研究人员能够更有效地处理大规模的NGS数据。这对于推进基因组学研究，尤其是在处理具有复杂重复结构的基因组时，具有重要意义。 SOAPdenovo2是基因组学研究的重要工具，它的出现推动了组装技术的发展，提高了组装质量和效率，尤其是在处理内存限制的问题上。

TE C H N I C A L N O T E Open Access

SOAPdenovo2: an empirically improved

memory-efficient short-read de novo assembler

Ruibang Luo

1,2†

, Binghang Liu

1,2†

, Yinlong Xie

1,2,3†

, Zhenyu Li

1,2†

, Weihua Huang

, Jianying Yuan

, Guangzhu He

Yanxiang Chen

, Qi Pan

, Yunjie Liu

, Jingbo Tang

, Gengxiong Wu

, Hao Zhang

, Yujian Shi

, Yong Liu

Chang Yu

, Bo Wang

, Yao Lu

, Changlei Han

, David W Cheung

, Siu-Ming Yiu

, Shaoliang Peng

, Zhu Xiaoqian

Guangming Liu

, Xiangke Liao

, Yingrui Li

1,2

, Huanming Yang

, Jian Wang

, Tak-Wah Lam

and Jun Wang

Abstract

Background: There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing

(NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and

accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs

improvement in continuity, accuracy and coverage, especially in repeat regions.

Findings: To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the

advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more

repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing,

and optimizes for large genome.

Conclusions: Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo 2 greatly

surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and

accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here,

the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and

50-fold longer th an the first published version. The genome coverage increased from 81.16% to 93.91%, and

memory consumption was ~2/3 lower during the point of largest memory consumption.

Keywords: Genome, Assembly, Contig, Scaffold, Error correction, Gap-filling

Findings

The inc reased use of next generation sequencing (NGS)

has resulted in an increased growth of the number of de

novo genome assemb lies being carried out using short

reads. Although there are several de novo assemblers

available, there remains room for improvement as show n

in recent assembly evaluation projects such as Assem-

blathon 1 [1] and GAGE [2]. Since the publication of the

first version of SOAPdenovo [3], it has been used to as-

semble many large eukaryotic genomes, but reports have

indicated areas that would benefit from updates, includ-

ing assembly coverage and length [4,5].

SOAPdenovo2, as with SOAPdenovo, is made up of

six modules that handle read error correction, de Bruijn

graph (DBG) construction, contig assembly, paired-end

(PE) reads mapping, scaffold construction, and gap clos-

ure. The major improvements we have made for in

SOAPdenovo2 are: 1) enhancing the error correction al-

gorithm, 2) providing a reduction in memory consump-

tion in DBG constructions, 3) resolving longer repeat

regions in contig assembly, 4) increasing assembly length

and coverage in scaffolding and 5) improving gap clos-

ure. Our data show that SOAPdenovo2 outperforms its

predecessor on the majority of the metrics benchmarked

in the Assemblathon 1 as well as GAGE; and in addition,

was able to substantially improve the original assembly

* Correspondence: twlam@cs.hku.hk; wangj@genomics.org.cn

†

Equal contributors

HKU-BGI Bioinformatics Algorithms and Core Technology Research

Laboratory & Department of Computer Science, University of Hong Kong,

Pokfulam, Hong Kong

BGI HK Research Institute, 16 Dai Fu Street, Tai Po Industrial Estate, Hong

Kong

Full list of author information is available at the end of the article

Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and

reproduction in any medium, provided the original work is properly cited.

Luo et al. GigaScience 2012, 1:18

http://www.gigasciencejournal.com/content/1/1/18

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38629206

粉丝: 4
资源: 958

SOAPdenovo2：优化内存效率的短读组装器

MEGAHIT与SOAPdenovo2:高效基因组组装工具对比

全基因组组装法典：基因组学的新突破

下一代基因组装算法综述：挑战与策略

SOAPdenovo2:下一代测序从头读

SOAPdenovo2:大型基因组创新组装者-开源

SOAPdenovo-Trans:新的转录组汇编器。-开源

SOAPdenovo2基因组装

PASHA: Parallelized Short Read Assembly:大型基因组最快的并行短读汇编程序之一。-开源

在进行大规模基因组从头组装时，如何使用SOAPdenovo2提升内存效率，并针对特定的基因组特性优化组装策略？

如何使用SOAPdenovo2进行基因组从头组装，并优化其内存使用？

最新资源