420 L. Ilie et al. / Journal of Discrete Algorithms 8 (2010) 418–428
The LCE problem is: given a string s and a set of pairs (i, j), compute LCE(i, j) for each pair. It can be solved by prepro-
cessing the string s in linear time so that each
LCE(i, j) is computed in constant time. The first solution uses constant-time
computation of the Lowest Common Ancestor [8,23,2,1] applied to the suffix tree; see an example in Fig. 1. The second,
more efficient, uses constant-time computation of Range Minimum Queries (RMQ) in arrays [2,1,4,5] applied to the
LCP
array. In general, we have LCE(i, j) = RMQ
LCP
(SA
−1
[i]+1, SA
−1
[ j]). Note the need for the inverse suffix array SA
−1
;an
example is shown in Fig. 1.
We shall denote the LCE algorithm of [5] based on constant-time RMQ computation by RMQ
const
. The practically most
efficient algorithm of [5] computes each
LCE(i, j) in (suboptimal) O(log n) time; it will be denoted by RMQ
log
.
3. Average LCE
We shall assume throughout the paper that the letters of the alphabet A are independent and identically distributed.
The starting point of our approach is the observation that most
LCE values are very small. The main result of this section
estimates the average value of the
LCE over all strings of a given length n,thatis,
Avg_LCE(n,)=
1
n
s∈A
n
1
n
2
1i< jn
LCE
s
(i, j)
.
Theorem 1.
(i) For any
2, lim
n→∞
Avg_LCE(n,)=
1
−1
.
(ii) For any n
2 and 2, Avg_LCE(n,)<
1
−1
.
Proof. Reorganizing the formula for
Avg_LCE(n,) gives
Avg_LCE(n,)=
2
n(n − 1)
n
n
−1
k=1
k
1i< jn−k+1
card
s
LCE
s
(i, j) = k
.
(i) For fixed k, i, j,denoteK
k,i, j
={s | LCE
s
(i, j) = k}. We compute the cardinality of K
k,i, j
. Recall that, in any string
s
∈ K
k,i, j
,wehaves[i ..i + k − 1]=s[ j .. j + k − 1].
(i.1) Assume first that j
n − k.Ifalso j − i k, then there are
k
possibilities for the strings letters contained in the
substrings s
[i ..i + k − 1] and s[ j .. j + k − 1]. The letters right after those, s[i + k] and s[ j + k],canbechosenin( − 1)
different ways as they must be different. There are
n−2(k+1)
possibilities to choose the remaining letters of s.Intotalwe
obtain card
(K
k,i, j
) =
n−k−1
( − 1).
Now, if j
− i < k,thens[i ..i + k − 1]=x
p
x
,with|x|= j − i, x
aprefixofx, and p 1. The letters contained in the
substrings s
[i ..i +k −1] and s[ j .. j + k − 1] are completely determined by x which can be any string out of
j−i
possibilities.
The letter in position j
+ k canbechosenin − 1 ways, since it has to be different from s[i + k]. The remaining letters can
be chosen in
n−(k+ j−i+1)
ways. In total, card(K
k,i, j
) =
n−k−1
( − 1).
(i.2) Assume next j
= n − k + 1. We no longer need the condition that s[i + k]=s[ j + k], as above, since s[ j + k] is
undefined. Therefore, by a reasoning similar to the one above, card
(K
k,i, j
) =
n−k
.
There are
n−k
2
pairs (i, j) verifying (i.1) above and n − k that verify (i.2). Consequently, we obtain (i) as follows:
Avg_LCE(n,) =
2
n(n − 1)
n
n
−1
k=1
k
n − k
2
n−k−1
( − 1) + (n − k)
n−k
=
2
n(n − 1)
n
n
−1
k=1
(n − k)
k
2
k−1
( − 1) + k
k
=
1
n(n − 1)
n
n
−1
k=1
(n − k)
k(k + 1)
k
− k(k − 1)
k−1
=
1
n(n − 1)
n
n(n − 1)
n−1
+
n−1
k=1
k(k − 1)
k−1
=
n
n − 1
1
− 1
−
1
n − 1
+ 1
( − 1)
2
+
1
n(n − 1)
2(
n
− 1)
n
( − 1)
3
n
→∞
−→
1
− 1
.