Comparative analysis of algorithms for next-generation
sequencing read alignment
Matthew Ruffalo
1∗
, Thomas LaFramboise
2,3
and Mehmet Koyut¨urk
1,3
1
Department of Electrical Engineering & Computer Science, Case Western Reserve University,
Cleveland, OH 44106, USA
2
Department of Genetics, Case Western Reserve University, Cleveland, OH 44106, USA
3
Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH
44106, USA
ABSTRACT
Motivation: The advent of next-generation sequencing (NGS)
techniques presents many novel opportunities for many applications
in life sciences. The vast number of short reads produced by these
techniques, however, pose significant computational challenges. The
first step in many types of g e n o mic anal ysis is the mapping o f short
reads to a reference genome, and several groups have developed
dedicated algorithms and software packages to perform this function.
As the dev elopers of these packages optimize their algorithms with
respect to various considerations, the relative merits of different
software packages remain unclear. However, for scientists who
generate and use NGS data for their specific research projects, an
important consideration is choosing the software that is most suitable
for their application.
Results: With a view to comparing existing shor t read alignment
software, we develop a simulation and evaluation suite, SEAL,which
simulates NGS runs for different configurations of various factors,
including sequencing error, indels, and coverage. We also develop
criteria to compare the performances of software with disparate
output structure (e.g.,somepackagesreturnasinglealignmentwhile
some return multiple possible alignments). Using these criteria, we
comprehensively evaluate the performances of Bowtie, BWA, mr- and
mrsFAST, Novoalign, SHRiMP and SOAPv2, with regard to accuracy
and r untime.
Conclusion: We expect that the results presented here will be useful
to investigators in choosing the alignment software that is most
suitable for their specific research aims. Our results also provide
insights into the factors that should be considered to use alignment
results effectively. SEAL can also be used to evaluate the performance
of algorithms that use deep sequencing data for various purposes
(e.g.,identificationofgenomicvariants).
Availability: SEAL is available as open-source at http://
compbio.case.edu/seal/.
1INTRODUCTION
Next-generation sequencing t echniques are demonstrating promise
in transforming research in life sciences (Schuster, 2007). These
techniques support many applications including metagenomics (Qin
∗
to whom correspondence should be addressed
et al.,2010), detectionofSNPs(VanTassellet al.,2008)
and genomic structural variants (Alkan et al.,2009;Medvedev
et al.,2009)inapopulation, DNAmethylationstudies(Taylor
et al.,2007), analysisofmRNAexpression(Sultanet al.,
2008), cancer genomics (Guffanti et al.,2009),andpersonalized
medicine (Auffray et al.,2009). Someapplications(e.g.,
metagenomics) require de novo sequencing of a sample (Miller
et al.,2010), whilemanyothers(e.g.,variantdetection,cancer
genomics) require resequencing. For all of these applications, the
vast amount of data produced by sequencing runs poses many
computational challenges (Horner et al.,2010).
In resequencing, a reference genome is already available for
the species (e.g.,thehumangenome)andoneisinterestedin
comparing short reads obtained from the genome of one or more
donors (individual members of the species) to the reference genome.
Therefore, the first step in any kind of analysis is the mapping
of short reads to a reference genome. This task is complicated
by many factors, including genetic variation in the population,
sequencing error, short read length, and the huge volume of short
reads to be mapped. So far, many algorithms have been developed
to overcome these challenges and these algorithms have been made
available to the scientific community as software packages (Li and
Homer, 2010). Currently available software packages for short read
alignment include Bowtie (Langmead et al.,2009),SOAP(Liet al.,
2009), BWA (Li and Durbin, 2009, 2010), mrFAST (Alkan et al.,
2009), mrsFAST (Hach et al.,2010),Novoalign(Novocraft,2010),
and SHRiMP (Rumble et al.,2009).
In this paper, we assess the performance of currently available
alignment algorithms, with a view to (i) understanding the effect
of various factors on accuracy and runtime performance and (ii)
comparing existing algorithms in terms of their performance in
various settings. For this purpose, we develop a simulation and
evaluation suite, SEAL,thatsimulatesshortreadsequencingruns
for a given set of confi gurations and evaluates the output of each
software using novel performance criteria that are specifically
designed for the current application. Our results show significant
differences in performance and accuracy as quality of the reads
and the characteristics of the genome vary. In the next section, we
briefly describe the alignment algorithms that are ev aluated in this
paper. Subsequently, in Section 3, we describe the simulation suite
implemented in SEAL and our performance criteria in detail. We
1
Associate Editor: Dr. Jonathan Wren
© The Author (2011). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Bioinformatics Advance Access published August 19, 2011
at University of California, Santa Barbara on September 25, 2011bioinformatics.oxfordjournals.orgDownloaded from