TE C H N I C A L N O T E Open Access
SOAPdenovo2: an empirically improved
memory-efficient short-read de novo assembler
Ruibang Luo
1,2†
, Binghang Liu
1,2†
, Yinlong Xie
1,2,3†
, Zhenyu Li
1,2†
, Weihua Huang
1
, Jianying Yuan
1
, Guangzhu He
1
,
Yanxiang Chen
1
, Qi Pan
1
, Yunjie Liu
1
, Jingbo Tang
1
, Gengxiong Wu
1
, Hao Zhang
1
, Yujian Shi
1
, Yong Liu
1
,
Chang Yu
1
, Bo Wang
1
, Yao Lu
1
, Changlei Han
1
, David W Cheung
2
, Siu-Ming Yiu
2
, Shaoliang Peng
4
, Zhu Xiaoqian
4
,
Guangming Liu
4
, Xiangke Liao
4
, Yingrui Li
1,2
, Huanming Yang
1
, Jian Wang
1
, Tak-Wah Lam
2*
and Jun Wang
1*
Abstract
Background: There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing
(NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and
accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs
improvement in continuity, accuracy and coverage, especially in repeat regions.
Findings: To overcome these challenges, we have developed its successor, SOAPdenovo2, which has the
advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more
repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing,
and optimizes for large genome.
Conclusions: Benchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo 2 greatly
surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and
accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here,
the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and
50-fold longer th an the first published version. The genome coverage increased from 81.16% to 93.91%, and
memory consumption was ~2/3 lower during the point of largest memory consumption.
Keywords: Genome, Assembly, Contig, Scaffold, Error correction, Gap-filling
Findings
The inc reased use of next generation sequencing (NGS)
has resulted in an increased growth of the number of de
novo genome assemb lies being carried out using short
reads. Although there are several de novo assemblers
available, there remains room for improvement as show n
in recent assembly evaluation projects such as Assem-
blathon 1 [1] and GAGE [2]. Since the publication of the
first version of SOAPdenovo [3], it has been used to as-
semble many large eukaryotic genomes, but reports have
indicated areas that would benefit from updates, includ-
ing assembly coverage and length [4,5].
SOAPdenovo2, as with SOAPdenovo, is made up of
six modules that handle read error correction, de Bruijn
graph (DBG) construction, contig assembly, paired-end
(PE) reads mapping, scaffold construction, and gap clos-
ure. The major improvements we have made for in
SOAPdenovo2 are: 1) enhancing the error correction al-
gorithm, 2) providing a reduction in memory consump-
tion in DBG constructions, 3) resolving longer repeat
regions in contig assembly, 4) increasing assembly length
and coverage in scaffolding and 5) improving gap clos-
ure. Our data show that SOAPdenovo2 outperforms its
predecessor on the majority of the metrics benchmarked
in the Assemblathon 1 as well as GAGE; and in addition,
was able to substantially improve the original assembly
* Correspondence: twlam@cs.hku.hk; wangj@genomics.org.cn
†
Equal contributors
2
HKU-BGI Bioinformatics Algorithms and Core Technology Research
Laboratory & Department of Computer Science, University of Hong Kong,
Pokfulam, Hong Kong
1
BGI HK Research Institute, 16 Dai Fu Street, Tai Po Industrial Estate, Hong
Kong
Full list of author information is available at the end of the article
© 2012 Luo et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Luo et al. GigaScience 2012, 1:18
http://www.gigasciencejournal.com/content/1/1/18