CGAP-Align：提升DNA短读比对效率的新型工具

91 浏览量更新于2024-08-26 收藏 478KB PDF 举报

"CGAP-Align：一种高性能的DNA短读比对工具" CGAP-Align是一种专门为DNA短读序列比对设计的高性能工具，由复旦大学计算机科学学院和贝勒医学院人类基因组测序中心的研究人员开发。该工具旨在解决下一代测序平台产生的大量数据的快速比对问题，以提高序列分析的效率。随着测序技术的发展，DNA测序的成本大幅降低，导致海量的测序数据产生，这对现有的比对工具提出了更高的性能要求。传统的比对工具如BWA（Burrows-Wheeler Aligner）因其相对较高的准确度而被广泛使用。然而，BWA在处理大规模数据时的映射效率成为了瓶颈，限制了研究人员处理日益增长的测序数据能力。为了应对这一挑战，CGAP-Align通过优化算法和数据结构，实现了比BWA更高的性能提升，同时保持了相同的召回率和精确度。 CGAP-Align的关键创新在于使用了一种名为“Suffix Tarray”的新型数据结构，它是Suffix Array（后缀数组）和Suffix Tree（后缀树）两种经典数据结构的结合。Suffix Array可以快速查找字符串的后缀，而Suffix Tree则能高效地处理模式匹配问题。将两者结合的Suffix Tarray能够更有效地进行比对操作，减少了计算时间和内存消耗。此外，CGAP-Align还采用了更紧密的下界估计（tighter lower bound estimation），这有助于减少不必要的比对尝试，进一步提高了效率。这一改进使得CGAP-Align在处理大规模DNA短读数据时，能够更快地找到正确的配对位置，从而加速了整个生物信息学分析流程。 CGAP-Align是针对高通量测序数据分析中的核心问题——DNA短读序列比对的优化解决方案。它的出现为生物信息学领域提供了更高效的工具，有助于研究人员更有效地处理和分析大量的基因组数据，推动了基因组学研究的进步。通过对传统比对工具的性能优化和技术创新，CGAP-Align展示了在处理未来更大规模测序数据时的潜力。

CGAP-Align: A High Performance DNA Short Read

Alignment Tool

Yaoliang Chen

, Ji Hong

, Wanyun Cui

, Jacques Zaneveld

, Wei Wang

, Richard Gibbs

, Yanghua Xiao

Rui Chen

1 School of Computer Science, Fudan University, Shanghai, China, 2 Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of

Medicine, Houston, Texas, United States of America

Abstract

Background:

Next generation sequencing platforms have greatly reduced sequencing costs, leading to the production of

unprecedented amounts of sequence data. BWA is one of the most popular alignment tools due to its relatively high

accuracy. However, mapping reads using BWA is still the most time consuming step in sequence analysis. Increasing

mapping efficiency would allow the community to better cope with ever expanding volumes of sequence data.

Results:

We designed a new program, CGAP-align, that achieves a performance improvement over BWA without sacrificing

recall or precision. This is accomplished through the use of Suffix Tarray, a novel data structure combining elements of Suffix

Array and Suffix Tree. We also utilize a tighter lower bound estimation for the number of mismatches in a read, allowing for

more effective pruning during inexact mapping. Evaluation of both simulated and real data suggests that CGAP-align

consistently outperforms the current version of BWA and can achieve over twice its speed under certain conditions, all while

obtaining nearly identical results.

Conclusion:

CGAP-align is a new time efficient read alignment tool that extends and improves BWA. The increase in

alignment speed will be of critical assistance to all sequence-based research and medicine. CGAP-align is freely available to

the academic community at http://sourceforge.net/p/cgap-align under the GNU General Public License (GPL).

Citation: Chen Y, Hong J, Cui W, Zaneveld J, Wang W, et al. (2013) CGAP-Align: A High Performance DNA Short Read Alignment Tool. PLoS ONE 8(4): e61033.

doi:10.1371/journal.pone.0061033

Editor: Haixu Tang, Indiana University, United States of America

Received October 28, 2012; Accepted March 5, 2013; Published April 11, 2013

Copyright: ß 2013 Chen et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits

unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: Exome and whole genome sequencing was performed at the BCM-FGI core facility supported by NIH shared instrum ent grant 1S10RR026550 to R.C.

This work was largely supported by Microsoft Research Asia with sponsorship number FY10-RES-OPP-010 and partially supported by the National Natural Science

Foundation of China under grant No 61003001; Specialized Research Fund for the Doctoral Program of Higher Education No. 20100071120032; Key Program of

National Natural Science Foundation of China under grant No. 61033010; NIH training grant T32 EY007102; National Science and Technology Major Project of the

Ministry of Science and Technology of China under grant No. 2010ZX01042-003-004. This work is also partially supported by grants from the Retinal Research

Foundation and National Eye Institute (R01EY018571) to R.C. and the National Human Genome Research Institute to R.G. (U54HG003273). The funders had no role

in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: This work was largely supported by Microsoft Research Asia with spon sorship number FY10-RES-OPP-010. There are no patents, products

in development or marketed products to declare. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials, as

detailed online in th e guide for authors.

* E-mail: shawyh@fudan.edu.cn (YX); ruichen@bcm.edu (RC)

Introduction

In recent years, advances in sequencing have led to the production

of unprecedented amount of sequence data. Alignment, which maps

reads to the reference sequence, is one of the most computationally

demanding tasks performed in typical sequence data processing.

Accurate sequence alignment is critical for SNP calling [1], structural

variation detection [2] and further downstream analysis.

In order to efficiently and accurately map large numbers of short

reads many new alignment programs have been developed. The

algorithms underlying most of these tools can be classified into two

major categories [3]. The first category uses hash tables to hash

either read sequences, as in MAQ [1], SeqMap [4] and CloudBurst

[5], or the genome reference, as in SOAPv1 [6], PASS [7], MOM

[8] and ProbeMatch [9]. Although this technique can be easily

parallelized, the major drawback of using hash tables is that either

they must scan the whole genome, even when few reads are aligned,

or they require a large amount of memory to build an index for the

reference. The second category is based on string matching using a

representation of prefix/suffix trie [2], including suffix tree, suffix

array [10], enhanced suffix array [11] and FM-index [12]. The first

three representations are used by TRELLIS [13], MUMmer [14]

and Vmatch (http://www.vmatch.de). Unfortunately, these pro-

grams have poor performance on large-scale references including

the human genome due to memory constraints. The FM-index is a

type of substring index based on the Burrows-Wheeler transform,

with some similarities to suffix array. Owing to its small memory

requirement, it is utilized by numerous state of the art programs

including SOAPv2 [15], Bowtie [16], Bowtie2 [17] and BWA [18].

BWA has become one of the most widely used alignment tools

owing to its efficiency and accuracy. BWA possesses higher recall

and precision than SOAPv2 or Bowtie. However, both Bowtie and

SOAP2 are significantly faster than BWA. Therefore, it is highly

desirable to improve the speed of BWA while maintaining its

alignment quality. In this paper, we describe a new efficient

alignment tool, CGAP-align, following the framework of BWA. As

PLOS ONE | www.plosone.org 1 April 2013 | Volume 8 | Issue 4 | e61033

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38514872

粉丝: 6
资源: 879

CGAP-Align：提升DNA短读比对效率的新型工具

CGAP-Align:在BWA顶部构建的一种新的省时读取对齐工具-开源

Python库 | foursight_cgap-1.1.1b1-py3-none-any.whl

Python库 | submit_cgap-1.1.1-py3-none-any.whl

Python库 | foursight_cgap-1.5.0-py3-none-any.whl

PyPI 官网下载 | submit_cgap-1.1.1-py3-none-any.whl

Python库 | foursight_cgap-1.1.1b0-py3-none-any.whl

PyPI 官网下载 | foursight-cgap-0.4.9.1b24.tar.gz

CGAP-扶贫金融服务提供商临时纸质转移资金战略指南.pdf

大数据-算法-应用肿瘤基因解剖工程数据库及RNA干扰技术检测大肠癌基因表达谱.pdf

CGPA结果预测：计算您的GPA和CGPA分数，然后从您的分数中获得平均估计分数，最后获得第一，第二低位

最新资源