A MapReduce-based Algorithm for Motif Search
Hongwei Huo, Shuai Lin, Qiang Yu and Yipu Zhang
School of Computer Science and Technology
Xidian University
Xi’an, 710071, China
{hwhuo, lin_s2009, feqond, zephyr26026}@mail.xidian.edu.cn
Vojislav Stojkovic
Department of Computer Science
Morgan State University
Baltimore, MD 21251, USA
vojislav.stojkovic@morgan.edu
Abstract—Motif search plays an important role in gene finding
and understanding gene regulation relationship. Motif search
is one of the most challenging problems in bioinformatics. In
this paper, we present three data partitions for the PMSP
algorithm and propose the PMSP MapReduce algorithm
(PMSPMR) for solving the motif search problem. For
instances of the problem with different difficulties, the
experimental results on the Hadoop cluster demonstrate that
PMSPMR has good scalability. In particular, for the more
difficult motif search problems, PMSPMR shows its advantage
because the speedup is almost linearly proportional to the
number of nodes in the Hadoop cluster. We also present
experimental results on realistic biological data by identifying
known transcriptional regulatory motifs in eukaryotes as well
as in actual promoter sequences extracted from
Saccharomyces cerevisiae.
Keywords- Motif search; data partition; scalability;
MapReduce; Hadoop
I. INTRODUCTION
Motif search is one of the most challenging problems
in biology, molecular biology, bioinformatics, and
computer science [1]. Motif search in unaligned DNA
sequences plays an important role in gene finding and
understanding gene regulation relationship. Das and Dai [2]
made a survey of the recent developments in DNA motif
search algorithms. Hu et al [3] extended earlier works to
prokaryotic datasets and clarified the limitations and the
potentials of existing motif search algorithms.
DNA motif discovery algorithms can be divided into
two categories based on the combinatorial approach used in
their design, word-based methods that mostly rely on
exhaustive enumeration and probabilistic sequence models
[2]. The enumerative approach is exact and guarantees
finding optimal solutions in the restricted search space. The
probabilistic approach involves representation of the motif
model by a position weight matrix.
Most probabilistic motif discovery algorithms apply
potent statistical techniques such as Expectation
Maximization (EM) and Gibbs sampling and its extensions.
Among the probabilistic methods, Gibbs sampling method
[4] has been used extensively for motif discovery
algorithms. It is initialized by choosing random starting
positions within the various sequences and then proceeds
through much iteration to execute the two steps of the
Gibbs sampler: predictive update step and sampling step.
The MEME algorithm [5] extended the EM algorithm for
identifying motifs in unaligned biopolymer sequences,
aiming to discover new motifs in a set of biopolymer
sequences where little is known in advance about any motif
that may be present. In PROJECTION [6], each l-mer of
input sequences was projected into a smaller space through
the projection template, and then, the EM algorithm is used
to do refinement. GARPS [7] used an efficient hashing-
based random projection strategy for processing input data,
reducing the search space, and then, the genetic algorithm is
used to do refinement.
The word-based enumerative methods guarantee
global optimality. They can be very fast when they are
implemented with optimized data structures such as suffix
trees. Buhler and Tompa [6] defined the challenging
instances of the planted motif search problems, such as (9,
2)-, (11, 3)-, (13, 4)-, (15, 5)-, (17, 6)- and (19, 7)-motif
problems. In WINNOWER [8], the graph-theoretic method
was introduced to motif search for the first time by finding
the maximum clique. However, it cannot handle the (15, 5)-
motif problem due to the huge search space. RISOTTO [9]
was the fastest algorithm in the family of suffix tree
algorithms for solving motif search. Davila [11]
implemented it on the machine with a Pentium4 2.40 GHz
processor and a core memory size of 1 Gbyte, and found
that it is capable of solving the (17, 6) challenging instance
in 12 hours. Recently, the research of exact motif search
algorithms mainly concentrated on the pattern-driven
method. PMSP [10] tracked the following simple idea. For
every l-mer x in the first sequence it generates d-neighbors
of x and tries to guess if an l-mer y in that neighborhood is a
motif by checking whether there is any l-mer in s
i
for i =
2,…,t at the distance d from it. PMSprune [11] was an
improvement over PMSP by using the branch and bound
strategy.
Although the exact enumeration is an advantage of
these methods, one limitation is that searching for long
patterns is computationally expensive, and an exhaustive
search through the sequence space of 4
L
words often
becomes impractical for L > 10 [8].
For different biological characteristics of the
regulatory elements, several researchers have developed
new methods/techniques such as parallel algorithms or
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops
978-0-7695-4676-6/12 $26.00 © 2012 IEEE
DOI 10.1109/IPDPSW.2012.255
2063
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
978-0-7695-4676-6/12 $26.00 © 2012 IEEE
DOI 10.1109/IPDPSW.2012.255
2046
2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
978-0-7695-4676-6/12 $26.00 © 2012 IEEE
DOI 10.1109/IPDPSW.2012.255
2052