CGAP-Align: A High Performance DNA Short Read
Alignment Tool
Yaoliang Chen
1
, Ji Hong
1
, Wanyun Cui
1
, Jacques Zaneveld
2
, Wei Wang
1
, Richard Gibbs
2
, Yanghua Xiao
1
*,
Rui Chen
2
*
1 School of Computer Science, Fudan University, Shanghai, China, 2 Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of
Medicine, Houston, Texas, United States of America
Abstract
Background:
Next generation sequencing platforms have greatly reduced sequencing costs, leading to the production of
unprecedented amounts of sequence data. BWA is one of the most popular alignment tools due to its relatively high
accuracy. However, mapping reads using BWA is still the most time consuming step in sequence analysis. Increasing
mapping efficiency would allow the community to better cope with ever expanding volumes of sequence data.
Results:
We designed a new program, CGAP-align, that achieves a performance improvement over BWA without sacrificing
recall or precision. This is accomplished through the use of Suffix Tarray, a novel data structure combining elements of Suffix
Array and Suffix Tree. We also utilize a tighter lower bound estimation for the number of mismatches in a read, allowing for
more effective pruning during inexact mapping. Evaluation of both simulated and real data suggests that CGAP-align
consistently outperforms the current version of BWA and can achieve over twice its speed under certain conditions, all while
obtaining nearly identical results.
Conclusion:
CGAP-align is a new time efficient read alignment tool that extends and improves BWA. The increase in
alignment speed will be of critical assistance to all sequence-based research and medicine. CGAP-align is freely available to
the academic community at http://sourceforge.net/p/cgap-align under the GNU General Public License (GPL).
Citation: Chen Y, Hong J, Cui W, Zaneveld J, Wang W, et al. (2013) CGAP-Align: A High Performance DNA Short Read Alignment Tool. PLoS ONE 8(4): e61033.
doi:10.1371/journal.pone.0061033
Editor: Haixu Tang, Indiana University, United States of America
Received October 28, 2012; Accepted March 5, 2013; Published April 11, 2013
Copyright: ß 2013 Chen et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: Exome and whole genome sequencing was performed at the BCM-FGI core facility supported by NIH shared instrum ent grant 1S10RR026550 to R.C.
This work was largely supported by Microsoft Research Asia with sponsorship number FY10-RES-OPP-010 and partially supported by the National Natural Science
Foundation of China under grant No 61003001; Specialized Research Fund for the Doctoral Program of Higher Education No. 20100071120032; Key Program of
National Natural Science Foundation of China under grant No. 61033010; NIH training grant T32 EY007102; National Science and Technology Major Project of the
Ministry of Science and Technology of China under grant No. 2010ZX01042-003-004. This work is also partially supported by grants from the Retinal Research
Foundation and National Eye Institute (R01EY018571) to R.C. and the National Human Genome Research Institute to R.G. (U54HG003273). The funders had no role
in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: This work was largely supported by Microsoft Research Asia with spon sorship number FY10-RES-OPP-010. There are no patents, products
in development or marketed products to declare. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials, as
detailed online in th e guide for authors.
* E-mail: shawyh@fudan.edu.cn (YX); ruichen@bcm.edu (RC)
Introduction
In recent years, advances in sequencing have led to the production
of unprecedented amount of sequence data. Alignment, which maps
reads to the reference sequence, is one of the most computationally
demanding tasks performed in typical sequence data processing.
Accurate sequence alignment is critical for SNP calling [1], structural
variation detection [2] and further downstream analysis.
In order to efficiently and accurately map large numbers of short
reads many new alignment programs have been developed. The
algorithms underlying most of these tools can be classified into two
major categories [3]. The first category uses hash tables to hash
either read sequences, as in MAQ [1], SeqMap [4] and CloudBurst
[5], or the genome reference, as in SOAPv1 [6], PASS [7], MOM
[8] and ProbeMatch [9]. Although this technique can be easily
parallelized, the major drawback of using hash tables is that either
they must scan the whole genome, even when few reads are aligned,
or they require a large amount of memory to build an index for the
reference. The second category is based on string matching using a
representation of prefix/suffix trie [2], including suffix tree, suffix
array [10], enhanced suffix array [11] and FM-index [12]. The first
three representations are used by TRELLIS [13], MUMmer [14]
and Vmatch (http://www.vmatch.de). Unfortunately, these pro-
grams have poor performance on large-scale references including
the human genome due to memory constraints. The FM-index is a
type of substring index based on the Burrows-Wheeler transform,
with some similarities to suffix array. Owing to its small memory
requirement, it is utilized by numerous state of the art programs
including SOAPv2 [15], Bowtie [16], Bowtie2 [17] and BWA [18].
BWA has become one of the most widely used alignment tools
owing to its efficiency and accuracy. BWA possesses higher recall
and precision than SOAPv2 or Bowtie. However, both Bowtie and
SOAP2 are significantly faster than BWA. Therefore, it is highly
desirable to improve the speed of BWA while maintaining its
alignment quality. In this paper, we describe a new efficient
alignment tool, CGAP-align, following the framework of BWA. As
PLOS ONE | www.plosone.org 1 April 2013 | Volume 8 | Issue 4 | e61033