Int. J. Biol. Sci. 2013, Vol. 9
http://www.ijbs.com
P
00
(x
1
, x
2
, y), P
01
(x
1
, x
2
, y), P
10
(x
1
, x
2
, y) and P
11
(x
1
, x
2
, y).
For each position i (1 ≤ i ≤ l), assume that it belongs to
P
ab
(x
1
, x
2
, y). Then, a is 1 if x
1
[i] = x
2
[i], 0 otherwise; b is
1 if either y[i] = x
1
[i] or y[i] = x
2
[i], 0 otherwise. Fig. 1
shows an example for partitioning the positions in the
alignment of three l-mers.
Fig. 1 An example for partitioning positions in the alignment of
three l-mers.
Definition 3. Given a pair of l-mers x
1
and x
2
and
another l-mer y ∈ M
d
(x
1
, x
2
), the mapping relation
from x
1
and x
2
to y, R(x
1
, x
2
, y), is defined to be a
2-tuple <|P
10
(x
1
, x
2
, y)|, |P
00
(x
1
, x
2
, y)|>. Furthermore,
the mapping relation from x
1
and x
2
to M
d
(x
1
, x
2
), R(x
1
,
x
2
), is defined to be
…(1)
Given a pair of l-mers x
1
and x
2
, the elements in
R(x
1
, x
2
) implies the approach to partitioning and
traversing the candidate motif set M
d
(x
1
, x
2
). We first
discuss how to compute R(x
1
, x
2
). For any candidate
motif y in M
d
(x
1
, x
2
), let R(x
1
, x
2
, y) = <α, β>. From
Definition 2 and 3, α represents the number of posi-
tions at which x
1
[·] = x
2
[·], y[·] ≠ x
1
[·] and y[·] ≠ x
2
[·]; β
represents the number of positions at which x
1
[·] ≠
x
2
[·], y[·] ≠ x
1
[·] and y[·] ≠ x
2
[·]. Thus, we have 0 ≤ α ≤ l -
d
H
(x
1
, x
2
), 0 ≤ β ≤ d
H
(x
1
, x
2
) and d
H
(y, x
1
) + d
H
(y, x
2
) = 2α
+ 2β + (d
H
(x
1
, x
2
) -β). Furthermore, we have d
H
(y, x
1
) +
d
H
(y, x
2
) ≤ 2d because y is the d-neighbor of both x
1
and
x
2
. Based on these considerations, we obtain inequali-
ties (2). Obviously, the values of α and β are deter-
mined by d
H
(x
1
, x
2
), and R(x
1
, x
2
) can be calculated by
listing all 2-tuples <α, β> satisfying (2). For example,
for the PMS instance (15, 4), R(x
1
, x
2
) = {<0, 0>, <0, 1>,
<0, 2>, <0, 3>, <0, 4>, <1, 0>, <1, 1>, <1, 2>, <2, 0>}
when d
H
(x
1
, x
2
) = 4.
…(2)
Based on the different 2-tuples in R(x
1
, x
2
), the
candidate motif set M
d
(x
1
, x
2
) can be partitioned to
|R(x
1
, x
2
)| mutually disjoint subsets. For each <α, β>
in R(x
1
, x
2
), the corresponding subset of M
d
(x
1
, x
2
) is
denoted by M
d<α, β>
(x
1
, x
2
), namely M
d<α, β>
(x
1
, x
2
) = {y:
y∈M
d
(x
1
, x
2
) and R(x
1
, x
2
, y) = <α, β>}. Assume that <α,
β> and <α', β'> are two different elements of R(x
1
, x
2
),
then we have M
d<α, β>
(x
1
, x
2
) ∩ M
d<α', β'>
(x
1
, x
2
) = Ф ac-
cording to Definition 3. Since R(x
1
, x
2
) represents the
mapping relation from x
1
and x
2
to all candidate mo-
tifs, the partition of M
d
(x
1
, x
2
) is:
M
d
(x
1
, x
2
) = {M
d<α, β>
(x
1
, x
2
): <α, β>∈R(x
1
, x
2
)} …(3)
In terms of equation (3), we can traverse the
candidate motifs derived from x
1
and x
2
, by generat-
ing the mutually disjoint subsets of M
d
(x
1
, x
2
) one by
one. For each <α, β> in R(x
1
, x
2
), the candidate motifs
in M
d<α, β>
(x
1
, x
2
) are generated as follows. First, set the
initial candidate motif y as x
2
. Second, select α posi-
tions from the positions at which x
1
[·] = x
2
[·], and for
each of these α positions, change y[·] to one of the
three characters different from x
1
[·]. Third, select β
positions from the positions at which x
1
[·] ≠ x
2
[·], and
for each of these β positions, change y[·] to one of the
two characters different from x
1
[·] and x
2
[·]. Fourth,
select a part of positions from the positions at which
x
1
[·] ≠ x
2
[·] except for those selected in the previous
step, and change y[·] to x
1
[·] for each of these posi-
tions. More details about these steps can be found in
the reference [29]. According to the process of gener-
ating candidate motifs, the size of M
d<α, β>
(x
1
, x
2
) is
calculated by (4) where d
H
denotes the Hamming
distance between x
1
and x
2
.
…(4)
Step 1: Extracting Pairs of l-mers
PairMotif+ only extracts the pair of l-mers that
contains two l-mers x
1
and x
2
coming from different
sequences, i.e., x
1
∈
l
s
i
, x
2
∈
l
s
j
and i ≠ j. Thus, the pair of
l-mers x
1
and x
2
can be denoted by (x
1
, x
2
) if i < j, (x
2
,
x
1
) otherwise. The set of all pairs of l-mers in input
sequences S is denoted by L = {(x
1
, x
2
): (∀i, j)(1 ≤ i < j ≤
t, x
1
∈
l
s
i
and x
2
∈
l
s
j
)}.
The aim of Step 1 is to extract as few pairs of
l-mers as possible from L, and ensure that sufficient
(more than half of) pairs of motif instances are in-
cluded in them. We set a threshold k (0 ≤ k ≤ l), and
then extract the pairs of l-mers (x
1
, x
2
) from L with
d
H
(x
1
, x
2
) ≤ k. The set of the extracted pairs of l-mers is
denoted by L
1
={(x
1
, x
2
): (x
1
, x
2
)∈L and d
H
(x
1
, x
2
) ≤ k}.
For a proper choice of the threshold k, we con-
sider two probabilities. One is the probability that the
Hamming distance between two random l-mers is less
than or equal to k, denoted by p
k
. The other is the
probability that the Hamming distance between two
),(
2121
2
1
)},
,({),(
xxMy
d
yxx
RxxR
∈
=
≤≤
−≤≤
≤++
).,(0
),,(0
,2)
,(2
21
21
21
xx
d
xx
dl
dxxd
H
H
H
β
α
βα
∑
−−≤≤−+
−≤≤
>
<
−
××
××
−
=
βαα
β
βα
βα
β
βα
HH
H
didd
di
HHH
d
i
dddl
xxM
,0
21,
2
3),
(