PairMotif+：DNA序列全新发现的高效算法

研究论文

55 浏览量更新于2024-08-26 收藏 1.33MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

PairMotif +是一种创新的、高效的DNA序列从头发现算法，它在2013年发表于《国际生物科学杂志》(International Journal of Biological Sciences)第9卷第4期，文章的DOI为10.7150/ijbs.5786。该研究论文由西安西电大学计算机科学技术学院的Qiang Yu, Hongwei Huo（通讯作者），Yipu Zhang, Hongzhi Guo, 和Haitao Guo共同完成，他们的联系方式可参见文中提供的电话、传真和邮箱。论文的标题直接阐述了其核心内容——"PairMotif+: A Fast and Effective Algorithm for De novo Motif Discovery in DNA Sequences"，表明其主要关注从DNA序列中发现新出现的模式（motif），即具有特定结构或功能的短序列重复。"De novo"一词强调了这种发现是独立于已知序列，不依赖于预先存在的知识，这对于基因组学和生物学研究至关重要，因为这有助于识别未知的调控元件和遗传标记。 Planted motif search（植入式模式搜索）是该研究领域中的一个重要课题，它涉及在随机分布的序列数据中寻找预设的模式。PairMotif +算法的引入旨在解决这一问题，并且在效率上有所突破，这对于大规模DNA数据的分析具有实际应用价值，例如在生物信息学中寻找基因调控区域、比较不同物种的序列相似性以及理解基因表达调控机制。论文的摘要概述了该研究的主要贡献：提出了一种新的搜索策略，能够在较短的时间内处理大量的DNA序列数据，提高了搜索的准确性和速度。这可能通过优化搜索策略、利用并行计算技术或者开发出更高效的匹配算法来实现。作者在2013年1月接收了投稿，经过四个月的审稿过程后于4月15日接受，并于4月29日公开发布。 PairMotif +的实施对于基因组学研究者来说，可能意味着更快速地挖掘潜在的生物学信号，减少了实验时间和成本，同时也促进了对生命科学中DNA序列模式的理解。由于这是一项开放获取的文章，它的研究成果可以被广泛传播和利用，但必须遵守Creative Commons License的规定，确保学术诚信。 PairMotif +是一个重要的科研成果，它革新了DNA序列分析的方法，为生物信息学领域的研究人员提供了一个有力的工具，以发现那些隐藏在大量数据中的关键序列模式。

资源详情

资源推荐

Int. J. Biol. Sci. 2013, Vol. 9

http://www.ijbs.com

414

, x

, y), P

, x

, y), P

, x

, y) and P

, x

, y).

For each position i (1 ≤ i ≤ l), assume that it belongs to

, x

, y). Then, a is 1 if x

[i] = x

[i], 0 otherwise; b is

1 if either y[i] = x

[i] or y[i] = x

[i], 0 otherwise. Fig. 1

shows an example for partitioning the positions in the

alignment of three l-mers.

Fig. 1 An example for partitioning positions in the alignment of

three l-mers.

Definition 3. Given a pair of l-mers x

and x

and

another l-mer y ∈ M

, x

), the mapping relation

from x

and x

to y, R(x

, x

, y), is defined to be a

2-tuple <|P

, x

, y)|, |P

, x

, y)|>. Furthermore,

the mapping relation from x

and x

to M

, x

), R(x

), is defined to be

…(1)

Given a pair of l-mers x

and x

, the elements in

R(x

, x

) implies the approach to partitioning and

traversing the candidate motif set M

, x

). We first

discuss how to compute R(x

, x

). For any candidate

motif y in M

, x

), let R(x

, x

, y) = <α, β>. From

Definition 2 and 3, α represents the number of posi-

tions at which x

[·] = x

[·], y[·] ≠ x

[·] and y[·] ≠ x

[·]; β

represents the number of positions at which x

[·] ≠

[·], y[·] ≠ x

[·] and y[·] ≠ x

[·]. Thus, we have 0 ≤ α ≤ l -

, x

), 0 ≤ β ≤ d

, x

) and d

(y, x

) + d

(y, x

) = 2α

+ 2β + (d

, x

) -β). Furthermore, we have d

(y, x

) +

(y, x

) ≤ 2d because y is the d-neighbor of both x

and

. Based on these considerations, we obtain inequali-

ties (2). Obviously, the values of α and β are deter-

mined by d

, x

), and R(x

, x

) can be calculated by

listing all 2-tuples <α, β> satisfying (2). For example,

for the PMS instance (15, 4), R(x

, x

) = {<0, 0>, <0, 1>,

<0, 2>, <0, 3>, <0, 4>, <1, 0>, <1, 1>, <1, 2>, <2, 0>}

when d

, x

) = 4.

…(2)

Based on the different 2-tuples in R(x

, x

), the

candidate motif set M

, x

) can be partitioned to

|R(x

, x

)| mutually disjoint subsets. For each <α, β>

in R(x

, x

), the corresponding subset of M

, x

) is

denoted by M

d<α, β>

, x

), namely M

d<α, β>

, x

) = {y:

y∈M

, x

) and R(x

, x

, y) = <α, β>}. Assume that <α,

β> and <α', β'> are two different elements of R(x

, x

then we have M

d<α, β>

, x

) ∩ M

d<α', β'>

, x

) = Ф ac-

cording to Definition 3. Since R(x

, x

) represents the

mapping relation from x

and x

to all candidate mo-

tifs, the partition of M

, x

) is:

, x

) = {M

d<α, β>

, x

): <α, β>∈R(x

, x

)} …(3)

In terms of equation (3), we can traverse the

candidate motifs derived from x

and x

, by generat-

ing the mutually disjoint subsets of M

, x

) one by

one. For each <α, β> in R(x

, x

), the candidate motifs

in M

d<α, β>

, x

) are generated as follows. First, set the

initial candidate motif y as x

. Second, select α posi-

tions from the positions at which x

[·] = x

[·], and for

each of these α positions, change y[·] to one of the

three characters different from x

[·]. Third, select β

positions from the positions at which x

[·] ≠ x

[·], and

for each of these β positions, change y[·] to one of the

two characters different from x

[·] and x

[·]. Fourth,

select a part of positions from the positions at which

[·] ≠ x

[·] except for those selected in the previous

step, and change y[·] to x

[·] for each of these posi-

tions. More details about these steps can be found in

the reference [29]. According to the process of gener-

ating candidate motifs, the size of M

d<α, β>

, x

) is

calculated by (4) where d

denotes the Hamming

distance between x

and x

…(4)

Step 1: Extracting Pairs of l-mers

PairMotif+ only extracts the pair of l-mers that

contains two l-mers x

and x

coming from different

sequences, i.e., x

∈

, x

∈

and i ≠ j. Thus, the pair of

l-mers x

and x

can be denoted by (x

, x

) if i < j, (x

) otherwise. The set of all pairs of l-mers in input

sequences S is denoted by L = {(x

, x

): (∀i, j)(1 ≤ i < j ≤

t, x

∈

and x

∈

)}.

The aim of Step 1 is to extract as few pairs of

l-mers as possible from L, and ensure that sufficient

(more than half of) pairs of motif instances are in-

cluded in them. We set a threshold k (0 ≤ k ≤ l), and

then extract the pairs of l-mers (x

, x

) from L with

, x

) ≤ k. The set of the extracted pairs of l-mers is

denoted by L

={(x

, x

): (x

, x

)∈L and d

, x

) ≤ k}.

For a proper choice of the threshold k, we con-

sider two probabilities. One is the probability that the

Hamming distance between two random l-mers is less

than or equal to k, denoted by p

. The other is the

probability that the Hamming distance between two



),(

2121

)},

,({),(

xxMy

yxx

RxxR

∈











≤≤

−≤≤

≤++

).,(0

),,(0

,2)

,(2

dxxd

βα

∑

−−≤≤−+

−≤≤













−

××













××













−

βαα

βα

didd

HHH

dddl

xxM

21,

3),

(

剩余12页未读，继续阅读

weixin_38722164

粉丝: 2
资源: 912

PairMotif+：DNA序列全新发现的高效算法

使用内核方法进行DNA序列分类：内核方法的机器学习（MVA 2021）-使用内核方法和ML算法从头开始进行DNA序列分类

如何解决最大子序列和问题？可以提供一个算法或者代码示例吗？

c语言使用冒泡法和快速排序法对给定序列排序

首次适配算法，下次适配算法，最佳适配算法，最差适配算法和快速适配算法中哪种算法效率最高，哪种内存利用率最高

用分治法和动态规划算法求最大连续子序列和问题

c语言，已知字符串str[]="+CCLK: "24/09/06,06:10:32+32"",查询CCLK: 在字符串中的位置

首次适应算法: 最佳适应算法 最坏适应算法

满足a[I] > a[j]， i < j称之为逆序对，给定一个整数N代表序列长度，第二行包含N个整数，代表序列中的元素，求出序列中逆序对的个数 例如：输入4 2 4 1 7 结果是2

fft算法stm32

基于Kantorovich距离的后推削减算法

2． 冒泡排序算法、折半插入排序算法简单选择排序算法基本思想

设待排序的记录序列用单链表作存储结构，试写出简单选择排序算法。

for (int f = 0; f < count; f++) { //从头继续位移 if (index >= count) index = 0; //存入StringBuffer lineBuffer.append(tokens.get(index)); lineBuffer.append(" "); index++; }什么意思

KMP 字符串匹配算法比较

动态存储管理算法有哪些

用C语言写一个算法，识别依次读入的一个以“#”为结束符的字符序列是否为形如“序列1@序列2”模式的字符序列。其中序列1和序列2中都不含字符“@”且序列2是序列1的逆序列。

最新资源

首次适应算法: 最佳适应算法最坏适应算法

满足a[I] > a[j]， i < j称之为逆序对，给定一个整数N代表序列长度，第二行包含N个整数，代表序列中的元素，求出序列中逆序对的个数例如：输入4 2 4 1 7 结果是2

2．冒泡排序算法、折半插入排序算法简单选择排序算法基本思想