下一代测序读序列比对算法的对比研究

需积分: 8 110 浏览量更新于2024-09-08 收藏 665KB PDF 举报

随着下一代测序(NGS)技术的兴起，它为生命科学研究带来了前所未有的机遇，尤其是在基因组分析领域。这些技术能产生大量的短读数据，但同时也带来了巨大的计算挑战。其中，将短读序列与参考基因组进行比对是许多分析的第一步，这促使众多研究团队开发了专门的算法和软件工具来执行这一任务。例如，诸如比对速度、精确度、内存消耗、多线程处理能力、对新型测序错误类型的支持以及对大规模数据处理的适应性等特性都成为开发者优化算法的关键考量。《新一代测序读取比对算法的比较分析》这篇论文由Matthew Ruffalo、Thomas LaFramboise和Mehmet Koyutürk三位作者共同完成，他们分别来自美国凯斯西储大学的电气工程与计算机科学系、遗传学系以及蛋白质组学与生物信息学中心。他们的研究旨在提供一个全面的视角，评估当前市场上各种用于NGS读取比对的软件包在实际应用中的性能和优劣。论文首先阐述了动机，即随着NGS技术的发展，对于高效、准确地处理海量短读数据的需求日益增长。尽管市场上有多个比对软件，如BWA-MEM、Bowtie2、STAR和BLAT等，每种工具都有其独特的优势和适用场景。比如，BWA-MEM以其较长的比对长度和较高的精度闻名，而Bowtie2则以其简单快速和对低质量比对的处理能力受到青睐。作者们通过量化分析和深入比较，探讨了不同算法在处理速度、内存占用、错误率控制、对小片段读取的适应性、以及对新测序平台特性的支持等方面的性能。他们可能采用了基准测试方法，通过大规模真实或模拟数据集来衡量工具的效率，并可能考虑了不同参数设置下的性能变化。此外，论文还可能关注了算法背后的理论基础，如局部搜索策略、启发式方法以及与机器学习技术（如神经网络）相结合的可能性，以提升比对的准确性和速度。神经网络在这里可能被用来预测最佳比对路径，或者作为后处理步骤来改善比对结果。最后，通过对这些算法的对比，论文可能会提出一些实用建议，帮助科学家们根据具体研究需求和资源限制选择最合适的比对工具。这对于那些在处理NGS数据时寻求最优解决方案的研究者来说，具有重要的指导意义。该论文深入剖析了新一代测序读取比对算法的内部工作原理和性能差异，旨在为生物信息学研究人员提供一套全面的指南，以应对快速发展的测序技术和数据处理需求。

Comparative analysis of algorithms for next-generation

sequencing read alignment

Matthew Ruffalo

1∗

, Thomas LaFramboise

2,3

and Mehmet Koyut¨urk

1,3

Department of Electrical Engineering & Computer Science, Case Western Reserve University,

Cleveland, OH 44106, USA

Department of Genetics, Case Western Reserve University, Cleveland, OH 44106, USA

Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH

44106, USA

ABSTRACT

Motivation: The advent of next-generation sequencing (NGS)

techniques presents many novel opportunities for many applications

in life sciences. The vast number of short reads produced by these

techniques, however, pose signiﬁcant computational challenges. The

ﬁrst step in many types of g e n o mic anal ysis is the mapping o f short

reads to a reference genome, and several groups have developed

dedicated algorithms and software packages to perform this function.

As the dev elopers of these packages optimize their algorithms with

respect to various considerations, the relative merits of different

software packages remain unclear. However, for scientists who

generate and use NGS data for their speciﬁc research projects, an

important consideration is choosing the software that is most suitable

for their application.

Results: With a view to comparing existing shor t read alignment

software, we develop a simulation and evaluation suite, SEAL,which

simulates NGS runs for different conﬁgurations of various factors,

including sequencing error, indels, and coverage. We also develop

criteria to compare the performances of software with disparate

output structure (e.g.,somepackagesreturnasinglealignmentwhile

some return multiple possible alignments). Using these criteria, we

comprehensively evaluate the performances of Bowtie, BWA, mr- and

mrsFAST, Novoalign, SHRiMP and SOAPv2, with regard to accuracy

and r untime.

Conclusion: We expect that the results presented here will be useful

to investigators in choosing the alignment software that is most

suitable for their speciﬁc research aims. Our results also provide

insights into the factors that should be considered to use alignment

results effectively. SEAL can also be used to evaluate the performance

of algorithms that use deep sequencing data for various purposes

(e.g.,identiﬁcationofgenomicvariants).

Availability: SEAL is available as open-source at http://

compbio.case.edu/seal/.

1INTRODUCTION

Next-generation sequencing t echniques are demonstrating promise

in transforming research in life sciences (Schuster, 2007). These

techniques support many applications including metagenomics (Qin

∗

to whom correspondence should be addressed

et al.,2010), detectionofSNPs(VanTassellet al.,2008)

and genomic structural variants (Alkan et al.,2009;Medvedev

et al.,2009)inapopulation, DNAmethylationstudies(Taylor

et al.,2007), analysisofmRNAexpression(Sultanet al.,

2008), cancer genomics (Guffanti et al.,2009),andpersonalized

medicine (Auffray et al.,2009). Someapplications(e.g.,

metagenomics) require de novo sequencing of a sample (Miller

et al.,2010), whilemanyothers(e.g.,variantdetection,cancer

genomics) require resequencing. For all of these applications, the

vast amount of data produced by sequencing runs poses many

computational challenges (Horner et al.,2010).

In resequencing, a reference genome is already available for

the species (e.g.,thehumangenome)andoneisinterestedin

comparing short reads obtained from the genome of one or more

donors (individual members of the species) to the reference genome.

Therefore, the ﬁrst step in any kind of analysis is the mapping

of short reads to a reference genome. This task is complicated

by many factors, including genetic variation in the population,

sequencing error, short read length, and the huge volume of short

reads to be mapped. So far, many algorithms have been developed

to overcome these challenges and these algorithms have been made

available to the scientiﬁc community as software packages (Li and

Homer, 2010). Currently available software packages for short read

alignment include Bowtie (Langmead et al.,2009),SOAP(Liet al.,

2009), BWA (Li and Durbin, 2009, 2010), mrFAST (Alkan et al.,

2009), mrsFAST (Hach et al.,2010),Novoalign(Novocraft,2010),

and SHRiMP (Rumble et al.,2009).

In this paper, we assess the performance of currently available

alignment algorithms, with a view to (i) understanding the effect

of various factors on accuracy and runtime performance and (ii)

comparing existing algorithms in terms of their performance in

various settings. For this purpose, we develop a simulation and

evaluation suite, SEAL,thatsimulatesshortreadsequencingruns

for a given set of conﬁ gurations and evaluates the output of each

software using novel performance criteria that are speciﬁcally

designed for the current application. Our results show signiﬁcant

differences in performance and accuracy as quality of the reads

and the characteristics of the genome vary. In the next section, we

brieﬂy describe the alignment algorithms that are ev aluated in this

paper. Subsequently, in Section 3, we describe the simulation suite

implemented in SEAL and our performance criteria in detail. We

Associate Editor: Dr. Jonathan Wren

Bioinformatics Advance Access published August 19, 2011

at University of California, Santa Barbara on September 25, 2011bioinformatics.oxfordjournals.orgDownloaded from

下载后可阅读完整内容，剩余6页未读，立即下载

shuizhiyun

粉丝: 46
资源: 18

下一代测序读序列比对算法的对比研究

A-Comparative-Study-of-Reco-mmendation-Algorithms-in-E-Commerce-Applications

comparative-analysis-for-source-apportionment:来自源分配的环境数据的比较分析

Comparative Analysis between CTR and Low-Frequency Noise to Characterize the Optocoupler Reliability

A Comparative Study of Algorithms for Realtime Panoramic Video Blending.pdf

A comparative analysis of the Bender-Gestalt and beery/buktenica tests of visual-motor integration as a function of grade level for regular education students

Comparative transcriptome analysis of a female-sterile mutant (fsm) in Chinese cabbage (Brassica campestris ssp. pekinensis)

Comparative evaluations of the Monte Carlo-based light propagation simulation packages for optical imaging

comparative-study-of-frameworks-for-parallel-processing-of-graphs:我的最终论文

Hurst Exponent Comparative Analysis of Bacterial Essential and Non-Essential Genes Based on Chaos Game Representation

最新资源