EMMA：高效大规模映射算法

需积分: 7 162 浏览量更新于2024-09-10 收藏 329KB PDF 举报

"EMMA是一种高效的大量映射算法，利用改进的近似映射过滤来优化大规模cDNA序列在基因组序列上的映射过程。该算法结合增强的后缀数组、剪枝快速哈希表、块对齐扩展以及k最长路径策略，提升了映射效率和准确性。" EMMA（Efficient Massive Mapping Algorithm）是针对大规模cDNA序列到基因组序列高效映射问题而设计的一种算法。在生物信息学领域，这种映射是理解基因表达、转录组分析和基因功能研究的基础。传统的映射方法在处理海量数据时往往效率低下，而EMMA通过引入一系列优化策略，显著提高了映射的速度和精确度。首先，EMMA算法的核心改进在于采用了一种基于增强后缀数组的近似映射过滤。后缀数组是一种数据结构，用于快速查找字符串中的模式，而在EMMA中，它被强化以适应大规模数据的处理。这使得算法能更有效地查找并过滤掉不匹配的cDNA序列，减少了不必要的计算量。其次，算法采用了剪枝的快速哈希表。快速哈希表能够快速存储和检索数据，而剪枝策略则避免了对潜在低质量匹配的进一步处理，进一步提升了映射速度，同时保持了较高的准确性。此外，EMMA还利用了块对齐扩展和k最长路径的概念。块对齐扩展是指将匹配的初始片段扩大到整个cDNA序列的更大区域，确保映射的连续性和完整性。k最长路径策略则是在多个可能的映射路径中选取最长的k个，以确定最可能的正确映射，这在处理重复序列和复杂基因结构时尤为重要。与传统的映射算法相比，EMMA在处理大规模cDNA序列时表现出了更高的效率。它不仅能在较短的时间内完成映射任务，而且由于其优化的过滤机制，还能保持较高的准确率，这对于生物信息学分析至关重要。因此，EMMA算法对于生物学家和研究人员来说，是一个强大的工具，能够加速基因组学和转录组学研究，有助于揭示更多的生物学现象和机制。 EMMA通过综合运用多种技术手段，实现了在大规模基因组数据映射中的高性能和高精度，为生物信息学的研究提供了强有力的支持。

ISSN 1672-9145 Acta Biochimica et Biophysica Sinica 2006, 38(12): 857–864 CN 31-1940/Q

©Institute of Biochemistry and Cell Biology, SIBS, CAS

EMMA: An Efficient Massive Mapping Algorithm Using Improved

Approximate Mapping Filtering

Xin ZHANG

, Zhi-Wei CAO

, Zhi-Xin LIN

, Qing-Kang WANG

*, and Yi-Xue LI

Institute of Micro/Nano Science and Technology, Shanghai Jiaotong University, Shanghai 200030, China;

Shanghai Center for Bioinformation Technology, Shanghai 200235, China;

College of Life Science and Technology, Shanghai Jiaotong University, Shanghai 200030, China;

Bioinformation Center of Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China

Abstract Efficient massive mapping algorithm (EMMA), an algorithm on efficiently mapping massive

cDNAs onto genomic sequences, has recently been developed. The process of mapping massive cDNAs

onto genomic sequences has been improved using more approximate mapping filtering based on an enhanced

suffix array coupled with a pruned fast hash table, algorithms of block alignment extensions, and k-longest

paths. When compared with the classical BLAT software in this field, the computing of EMMA ranges from

two to forty-one times faster under similar prediction precisions.

Key words cDNA mapping; maximal exact match; enhanced suffix array; pruned fast hash table;

extension algorithm

Received: August 26, 2006 Accepted: October 10, 2006

This work was supported by a grant from the Major State Basic

Research Development Program of China (No. 2004CB720103)

*Corresponding authors:

Qing-Kang WANG: Tel/Fax, 86-21-62933290; E-mail,

wangqingkang@sjtu.edu.cn

Yi-Xue LI: Tel, 86-21-64836199; Fax, 86-21-64838882; E-mail,

yxli@sibs.ac.cn

DOI: 10.1111/j.1745-7270.2006.00237.x

Mapping cDNAs (e.g., mRNAs and ESTs) onto genomic

sequences has become a common and potentially powerful

technique in the field of genome research. The resulting

alignments are often used in fields of gene finding,

alternative splicing prediction and single nucleotide

polymorphism studies [1−5]. However with more and more

sequences accumulated, the mapping computation has

become more and more expensive for most researches.

Faster cDNA-genome mapping software is always highly

expected, some of which are SIM4 [6], SPIDEY [7],

GENESEQER [8,9], BLAT [10], SQUALL [11], GMAP

[12] and ESTMAPPER [13]. Most of these algorithms are

derived from BLAST [14] and featured as a common four-

phase framework: first, finding exact matches longer than

given size; second, extending each exact match pair to

both directions by an ungapped alignment until the score

drops significantly; third, linking the extended matches

together to outline the plausible splice patterns; and finally,

refining the outlined splicing patterns to produce precise

mapping alignments.

Despite their various implementation methodologies, the

first step in most existing algorithms is implemented by

computing the word pairs (exact matches of fixed length)

of the cDNA and the genome. To improve the computation

speed, the genome is pre-processed by indexing all of its

words in a table. Early algorithms like SIM4 and SPIDEY

use simple look-up-table (LUT) functions to store these

genome words, which require O(4

) memory, where w is

the word size. For large w values, the memory required

for LUT would become impractical for normal computers.

To address this problem, modern cDNA mapping

algorithms including BLAT and GMAP mostly use hash

tables and only consider non-overlapped words, which

not only reduces the size of word table but also improves

the computation speed by thousands of times without

significant loss of precision [10,12].

The idea of the word-based method is simple enough

but requires additional treatment to concatenate neighboring

word pairs into longer ones. It has been evaluated that

such concatenation processes take up to 18% of the entire

at Jiangxi Agricultural University on December 28, 2013http://abbs.oxfordjournals.org/Downloaded from

下载后可阅读完整内容，剩余7页未读，立即下载

qq_17261707

粉丝: 0
资源: 2

EMMA：高效大规模映射算法

基于深度学习的多用户Massive MIMO预编码方法.pdf

massive_MIMO_networks.pdf

5G Massive MIMO网络应用白皮书.pdf

论文研究-A platform for massive railway information data storage.pdf

Deep-Learning-Based Millimeter-Wave Massive MIMO for Hybrid Precoding.pdf

5G Massive MIMO原理简介.pdf

Massive Parallel Ldpc Decoding on GPU.pdf

Massive.Software.Learning.Tutorials

C++.AMP.Accelerated.Massive.Parallelism.with.Microsoft.Visual.C++

5G 基站 Massive MIMO OTA 测试测量技术.pdf

最新资源