MapReduce实现的主题搜索算法

111 浏览量更新于2024-08-26 收藏 442KB PDF 举报

"这篇研究论文探讨了一种基于MapReduce的主题搜索算法，主要针对生物信息学中的Motif搜索问题。Motif搜索在基因发现和理解基因调控关系中具有重要作用，是生物信息学中最具有挑战性的问题之一。论文提出了PMSP MapReduce（PMSPMR）算法，该算法利用MapReduce框架对PMSP算法进行了数据分区优化，适用于解决不同难度的Motif搜索问题。实验证实在Hadoop集群上，PMSPMR算法具有良好的可扩展性，特别是对于更复杂的Motif搜索任务，其加速比几乎与Hadoop集群中的节点数量成线性比例。此外，通过在真实的生物学数据上运行实验，该算法成功识别出了已知的转录因子，进一步证明了其在实际应用中的有效性。" 在生物信息学中，Motif是一种在多个核酸或蛋白质序列中频繁出现的短序列模式，通常与特定的生物学功能相关。Motif搜索的目标是找出这些模式，帮助科学家理解基因表达和调控的机制。然而，由于序列数据的庞大和复杂性，这个问题在计算上非常具有挑战性。 MapReduce是一种分布式计算模型，由Google提出，用于处理和生成大规模数据集。它将大型任务分解为两个阶段：Map阶段和Reduce阶段。Map阶段将输入数据分割，并在各个工作节点上并行处理；Reduce阶段则负责汇总和整合Map阶段的结果。论文提出的PMSP（Pattern Motif Search Problem）MapReduce算法，通过三种数据分区策略优化了原有的PMSP算法，使其更适合于分布式环境。在Hadoop集群上进行的实验表明，PMSPMR在处理不同复杂度的Motif搜索任务时，能够有效地利用多节点资源，随着节点数量的增加，性能提升显著，这体现了其在大规模数据处理中的强大能力。此外，论文还展示了PMSPMR在真实生物数据上的应用，通过识别出已知的转录因子Motif，证实了算法在处理现实世界生物信息问题时的准确性和实用性。这不仅为生物学家提供了有力的工具，也为未来在基因调控网络分析、疾病研究等领域的工作奠定了基础。

A MapReduce-based Algorithm for Motif Search

Hongwei Huo, Shuai Lin, Qiang Yu and Yipu Zhang

School of Computer Science and Technology

Xidian University

Xi’an, 710071, China

{hwhuo, lin_s2009, feqond, zephyr26026}@mail.xidian.edu.cn

Vojislav Stojkovic

Department of Computer Science

Morgan State University

Baltimore, MD 21251, USA

vojislav.stojkovic@morgan.edu

Abstract—Motif search plays an important role in gene finding

and understanding gene regulation relationship. Motif search

is one of the most challenging problems in bioinformatics. In

this paper, we present three data partitions for the PMSP

algorithm and propose the PMSP MapReduce algorithm

(PMSPMR) for solving the motif search problem. For

instances of the problem with different difficulties, the

experimental results on the Hadoop cluster demonstrate that

PMSPMR has good scalability. In particular, for the more

difficult motif search problems, PMSPMR shows its advantage

because the speedup is almost linearly proportional to the

number of nodes in the Hadoop cluster. We also present

experimental results on realistic biological data by identifying

known transcriptional regulatory motifs in eukaryotes as well

as in actual promoter sequences extracted from

Saccharomyces cerevisiae.

Keywords- Motif search; data partition; scalability;

MapReduce; Hadoop

I. INTRODUCTION

Motif search is one of the most challenging problems

in biology, molecular biology, bioinformatics, and

computer science [1]. Motif search in unaligned DNA

sequences plays an important role in gene finding and

understanding gene regulation relationship. Das and Dai [2]

made a survey of the recent developments in DNA motif

search algorithms. Hu et al [3] extended earlier works to

prokaryotic datasets and clarified the limitations and the

potentials of existing motif search algorithms.

DNA motif discovery algorithms can be divided into

two categories based on the combinatorial approach used in

their design, word-based methods that mostly rely on

exhaustive enumeration and probabilistic sequence models

[2]. The enumerative approach is exact and guarantees

finding optimal solutions in the restricted search space. The

probabilistic approach involves representation of the motif

model by a position weight matrix.

Most probabilistic motif discovery algorithms apply

potent statistical techniques such as Expectation

Maximization (EM) and Gibbs sampling and its extensions.

Among the probabilistic methods, Gibbs sampling method

[4] has been used extensively for motif discovery

algorithms. It is initialized by choosing random starting

positions within the various sequences and then proceeds

through much iteration to execute the two steps of the

Gibbs sampler: predictive update step and sampling step.

The MEME algorithm [5] extended the EM algorithm for

identifying motifs in unaligned biopolymer sequences,

aiming to discover new motifs in a set of biopolymer

sequences where little is known in advance about any motif

that may be present. In PROJECTION [6], each l-mer of

input sequences was projected into a smaller space through

the projection template, and then, the EM algorithm is used

to do refinement. GARPS [7] used an efficient hashing-

based random projection strategy for processing input data,

reducing the search space, and then, the genetic algorithm is

used to do refinement.

The word-based enumerative methods guarantee

global optimality. They can be very fast when they are

implemented with optimized data structures such as suffix

trees. Buhler and Tompa [6] defined the challenging

instances of the planted motif search problems, such as (9,

2)-, (11, 3)-, (13, 4)-, (15, 5)-, (17, 6)- and (19, 7)-motif

problems. In WINNOWER [8], the graph-theoretic method

was introduced to motif search for the first time by finding

the maximum clique. However, it cannot handle the (15, 5)-

motif problem due to the huge search space. RISOTTO [9]

was the fastest algorithm in the family of suffix tree

algorithms for solving motif search. Davila [11]

implemented it on the machine with a Pentium4 2.40 GHz

processor and a core memory size of 1 Gbyte, and found

that it is capable of solving the (17, 6) challenging instance

in 12 hours. Recently, the research of exact motif search

algorithms mainly concentrated on the pattern-driven

method. PMSP [10] tracked the following simple idea. For

every l-mer x in the first sequence it generates d-neighbors

of x and tries to guess if an l-mer y in that neighborhood is a

motif by checking whether there is any l-mer in s

for i =

2,…,t at the distance d from it. PMSprune [11] was an

improvement over PMSP by using the branch and bound

strategy.

Although the exact enumeration is an advantage of

these methods, one limitation is that searching for long

patterns is computationally expensive, and an exhaustive

search through the sequence space of 4

words often

becomes impractical for L > 10 [8].

For different biological characteristics of the

regulatory elements, several researchers have developed

new methods/techniques such as parallel algorithms or

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops

DOI 10.1109/IPDPSW.2012.255

2063

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

DOI 10.1109/IPDPSW.2012.255

2046

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

DOI 10.1109/IPDPSW.2012.255

2052

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38687218

粉丝: 3
资源: 941

MapReduce实现的主题搜索算法

基于mapreduce的kmeans算法

基于mapreduce的dbscan算法怎么写

基于MAPREDUCE实现EM算法

基于mapreduce的K-means算法

基于mapreduce框架的pagerank算法实现

mapreduce的有关算法

mapreduce有什么算法

基于MapReduce的耳机销售分析算法

mapreduce实现apriori算法

基于mapreduce的课程设计

最新资源