重新审视最长公共扩展问题及其在近似字符串搜索中的应用

55 浏览量更新于2024-08-25 收藏 435KB PDF 举报

"这篇文章是《离散算法》期刊2010年第八期中的一个研究，探讨了最长公共扩展问题的重新审视及其在近似字符串搜索中的应用。由Lucian Ilie、Gonzalo Navarro和Liviu Tinta共同撰写，分别来自加拿大的西安大略大学和智利的智利大学计算机科学系。文章详细讨论了如何解决最长公共扩展（Longest Common Extension, LCE）问题，并介绍了其在近似字符串搜索算法中的应用。" 在字符串处理领域，最长公共扩展问题是一个基础且重要的问题。它涉及到给定两个字符串位置i和j，找出从这两个位置开始的最长公共子串。此问题在许多基本的字符串问题中作为子问题出现，例如后缀自动机、字符串匹配算法等。通常，可以通过对字符串进行线性时间预处理来解决LCE问题，以便在最坏情况下以常数时间计算任意一对位置的最长公共扩展。文章中提到了两种已知的解决方法，它们都依赖于高效的算法。一种方法可能涉及使用KMP（Knuth-Morris-Pratt）或Rabin-Karp这样的字符串匹配算法，通过构建辅助数据结构来快速查询公共扩展。另一种方法可能利用后缀数组或后缀树等数据结构，这些数据结构能有效地支持在线查询。作者们在这篇文章中不仅回顾了现有的LCE问题解决方案，还可能探讨了新的算法或改进方法，以提高查询效率或减少存储需求。近似字符串搜索是另一个关键主题，特别是在处理文本大数据或存在拼写错误、变异或噪声的情况下。在这个领域，目标是设计能够容忍一定数量差异的搜索算法，如Levenshtein距离、编辑距离等。近似的搜索算法通常用于生物信息学、信息检索和文本挖掘等领域。文章详细介绍了如何将LCE问题的解决方案应用于近似字符串搜索，这可能包括在搜索过程中利用LCE信息来缩小搜索范围，减少比较次数，或者改进现有算法的性能。通过对字符串数据结构的优化，可以实现更高效地查找具有相似性的字符串，这对处理大规模文本数据尤其有用。这篇论文对于理解字符串算法和在实际应用中优化搜索性能有着重要的贡献，对于研究者和开发者来说，提供了深入研究LCE问题和近似字符串搜索的新视角。关键词包括：字符串、算法、最长公共扩展、近似字符串搜索，这些都是本文探讨的核心概念。

420 L. Ilie et al. / Journal of Discrete Algorithms 8 (2010) 418–428

The LCE problem is: given a string s and a set of pairs (i, j), compute LCE(i, j) for each pair. It can be solved by prepro-

cessing the string s in linear time so that each

LCE(i, j) is computed in constant time. The ﬁrst solution uses constant-time

computation of the Lowest Common Ancestor [8,23,2,1] applied to the suﬃx tree; see an example in Fig. 1. The second,

more eﬃcient, uses constant-time computation of Range Minimum Queries (RMQ) in arrays [2,1,4,5] applied to the

LCP

array. In general, we have LCE(i, j) = RMQ

LCP

(SA

−1

[i]+1, SA

−1

[ j]). Note the need for the inverse suﬃx array SA

−1

;an

example is shown in Fig. 1.

We shall denote the LCE algorithm of [5] based on constant-time RMQ computation by RMQ

const

. The practically most

eﬃcient algorithm of [5] computes each

LCE(i, j) in (suboptimal) O(log n) time; it will be denoted by RMQ

log

3. Average LCE

We shall assume throughout the paper that the letters of the alphabet A are independent and identically distributed.

The starting point of our approach is the observation that most

LCE values are very small. The main result of this section

estimates the average value of the

LCE over all strings of a given length n,thatis,

Avg_LCE(n,)=



s∈A









1i< jn

LCE

(i, j)



Theorem 1.

(i) For any

  2, lim

n→∞

Avg_LCE(n,)=

−1

(ii) For any n

 2 and   2, Avg_LCE(n,)<

−1

Proof. Reorganizing the formula for

Avg_LCE(n,) gives

Avg_LCE(n,)=

n(n − 1)

−1



k=1



1i< jn−k+1

card





LCE

(i, j) = k



(i) For ﬁxed k, i, j,denoteK

k,i, j

={s | LCE

(i, j) = k}. We compute the cardinality of K

k,i, j

. Recall that, in any string

∈ K

k,i, j

,wehaves[i ..i + k − 1]=s[ j .. j + k − 1].

(i.1) Assume ﬁrst that j

 n − k.Ifalso j − i  k, then there are 

possibilities for the strings letters contained in the

substrings s

[i ..i + k − 1] and s[ j .. j + k − 1]. The letters right after those, s[i + k] and s[ j + k],canbechosenin( − 1)

different ways as they must be different. There are 

n−2(k+1)

possibilities to choose the remaining letters of s.Intotalwe

obtain card

k,i, j

) = 

n−k−1

( − 1).

Now, if j

− i < k,thens[i ..i + k − 1]=x



,with|x|= j − i, x



apreﬁxofx, and p  1. The letters contained in the

substrings s

[i ..i +k −1] and s[ j .. j + k − 1] are completely determined by x which can be any string out of 

j−i

possibilities.

The letter in position j

+ k canbechosenin − 1 ways, since it has to be different from s[i + k]. The remaining letters can

be chosen in



n−(k+ j−i+1)

ways. In total, card(K

k,i, j

) = 

n−k−1

( − 1).

(i.2) Assume next j

= n − k + 1. We no longer need the condition that s[i + k]=s[ j + k], as above, since s[ j + k] is

undeﬁned. Therefore, by a reasoning similar to the one above, card

k,i, j

) = 

n−k

There are



n−k



pairs (i, j) verifying (i.1) above and n − k that verify (i.2). Consequently, we obtain (i) as follows:

Avg_LCE(n,) =

n(n − 1)

−1



k=1



n − k





n−k−1

( − 1) + (n − k)

n−k



n(n − 1)

−1



k=1

(n − k)







k−1

( − 1) + k



n(n − 1)

−1



k=1

(n − k)



k(k + 1)

− k(k − 1)

k−1



n(n − 1)



n(n − 1)

n−1



k=1

k(k − 1)

k−1



n − 1

 − 1

−

n − 1

 + 1

( − 1)

n(n − 1)

2(

− 1)



( − 1)

→∞

−→

 − 1

剩余10页未读，继续阅读

weixin_38682086

粉丝: 6
资源: 984

重新审视最长公共扩展问题及其在近似字符串搜索中的应用

Longest Common Extension with Recompression - 16th Nov 2016 (1611.05359)-计算机科学

The_longest_common_sub-string.rar_The Common

longest-common-string.rar_LONGEST COMMON STRI_longest_longest co

Given a string s, find the length of the longest substring without repeating characters.

longestCommonPrefix(vector<string>& strs)什么意思

最新资源