ISSN 1672-9145 Acta Biochimica et Biophysica Sinica 2006, 38(12): 857–864 CN 31-1940/Q
©Institute of Biochemistry and Cell Biology, SIBS, CAS
EMMA: An Efficient Massive Mapping Algorithm Using Improved
Approximate Mapping Filtering
Xin ZHANG
1
, Zhi-Wei CAO
2
, Zhi-Xin LIN
3
, Qing-Kang WANG
1
*, and Yi-Xue LI
4
*
1
Institute of Micro/Nano Science and Technology, Shanghai Jiaotong University, Shanghai 200030, China;
2
Shanghai Center for Bioinformation Technology, Shanghai 200235, China;
3
College of Life Science and Technology, Shanghai Jiaotong University, Shanghai 200030, China;
4
Bioinformation Center of Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
Abstract Efficient massive mapping algorithm (EMMA), an algorithm on efficiently mapping massive
cDNAs onto genomic sequences, has recently been developed. The process of mapping massive cDNAs
onto genomic sequences has been improved using more approximate mapping filtering based on an enhanced
suffix array coupled with a pruned fast hash table, algorithms of block alignment extensions, and k-longest
paths. When compared with the classical BLAT software in this field, the computing of EMMA ranges from
two to forty-one times faster under similar prediction precisions.
Key words cDNA mapping; maximal exact match; enhanced suffix array; pruned fast hash table;
extension algorithm
Received: August 26, 2006 Accepted: October 10, 2006
This work was supported by a grant from the Major State Basic
Research Development Program of China (No. 2004CB720103)
*Corresponding authors:
Qing-Kang WANG: Tel/Fax, 86-21-62933290; E-mail,
wangqingkang@sjtu.edu.cn
Yi-Xue LI: Tel, 86-21-64836199; Fax, 86-21-64838882; E-mail,
yxli@sibs.ac.cn
DOI: 10.1111/j.1745-7270.2006.00237.x
Mapping cDNAs (e.g., mRNAs and ESTs) onto genomic
sequences has become a common and potentially powerful
technique in the field of genome research. The resulting
alignments are often used in fields of gene finding,
alternative splicing prediction and single nucleotide
polymorphism studies [1−5]. However with more and more
sequences accumulated, the mapping computation has
become more and more expensive for most researches.
Faster cDNA-genome mapping software is always highly
expected, some of which are SIM4 [6], SPIDEY [7],
GENESEQER [8,9], BLAT [10], SQUALL [11], GMAP
[12] and ESTMAPPER [13]. Most of these algorithms are
derived from BLAST [14] and featured as a common four-
phase framework: first, finding exact matches longer than
given size; second, extending each exact match pair to
both directions by an ungapped alignment until the score
drops significantly; third, linking the extended matches
together to outline the plausible splice patterns; and finally,
refining the outlined splicing patterns to produce precise
mapping alignments.
Despite their various implementation methodologies, the
first step in most existing algorithms is implemented by
computing the word pairs (exact matches of fixed length)
of the cDNA and the genome. To improve the computation
speed, the genome is pre-processed by indexing all of its
words in a table. Early algorithms like SIM4 and SPIDEY
use simple look-up-table (LUT) functions to store these
genome words, which require O(4
w
) memory, where w is
the word size. For large w values, the memory required
for LUT would become impractical for normal computers.
To address this problem, modern cDNA mapping
algorithms including BLAT and GMAP mostly use hash
tables and only consider non-overlapped words, which
not only reduces the size of word table but also improves
the computation speed by thousands of times without
significant loss of precision [10,12].
The idea of the word-based method is simple enough
but requires additional treatment to concatenate neighboring
word pairs into longer ones. It has been evaluated that
such concatenation processes take up to 18% of the entire
at Jiangxi Agricultural University on December 28, 2013http://abbs.oxfordjournals.org/Downloaded from