近似字符串匹配：编辑距离与动态规划

需积分: 10 159 浏览量更新于2024-08-02 收藏 1.14MB PDF 举报

"String Matching, Edit Distance, Dynamic Programming Algorithm" 在信息技术和计算机科学中，字符串匹配是一个基础且重要的问题。这个领域关注的是在一个文本串（通常称为“文本”或“主串”）中寻找一个模式串（也称为“模式”或“查询”）的出现。随着信息检索和计算生物学等领域的快速发展，允许错误的字符串匹配（即近似字符串匹配）变得越来越重要。本文主要探讨了在线搜索和编辑距离这两个关键概念，并深入解析它们的算法设计与复杂性分析。编辑距离（Edit Distance），又称为Levenshtein距离，是衡量两个字符串之间差异度的一个度量。它定义为通过插入、删除和替换字符操作将一个字符串转换成另一个字符串所需的最小操作次数。编辑距离的概念在许多应用中都起着关键作用，比如拼写检查、DNA序列比对等。动态规划算法在解决编辑距离问题上发挥着重要作用。其基本思想是构建一个二维矩阵，其中每个单元格表示两个字符串在某个位置的编辑距离。通过递归地填充这个矩阵，我们可以找到最小的编辑距离。动态规划方法的时间复杂度为O(mn)，其中m和n分别为两个比较字符串的长度。然而，优化的技术，如Wagner-Fischer算法，可以采用空间优化来降低内存需求。在线搜索，即实时字符串匹配，是指在输入流中不断接收字符时进行匹配的过程。对于允许错误的在线搜索，算法需要能够快速适应新字符的出现，并调整匹配策略。这类问题的挑战在于平衡查找效率与错误容忍度。文章中提到的实验部分对比了不同近似字符串匹配算法的性能，评估了它们在处理各种任务时的速度和准确性。这些比较有助于确定在特定应用场景下最合适的算法选择。最后，文章指出了未来的研究方向和开放问题。这可能包括进一步提高算法效率，尤其是在大数据背景下；探索新的错误模型以适应更复杂的匹配需求；以及研究如何利用统计行为和机器学习技术改进现有算法。总结来说，"String Matching"与"Edit Distance"结合动态规划算法，构成了一个强大的工具集，用于处理允许错误的字符串匹配问题。这项工作不仅提供了理论背景，还通过实验展示了实际应用中的性能差异，为相关领域的研究者和开发者提供了有价值的参考。

40 G. Navarro

Fig. 2. The DAWG or the sufﬁx automaton for the sample string. If all the states are ﬁnal, it is a DAWG.

If only the 2nd, 5th and rightmost states are ﬁnal then it is a sufﬁx automaton.

interesting in itself, but also essential for

the average case analysis of many search

algorithms, as will be seen later. We now

present the existing results and an empir-

ical validation. In this section we consider

the edit distance only. Some variants can

be adapted to these results.

The effort in analyzing the probabilistic

behavior of the edit distance has not given

good results in general [Kurtz and Myers

1997]. An exact analysis of the probability

of the occurrence of a ﬁxed pattern allow-

ing k substitution errors (i.e. Hamming

distance) can be found in R

egnier and

Szpankowski [1997], although the result

is not easy to average over all the possi-

ble patterns. The results we present here

apply to the edit distance model and, al-

though not exact, are easier to use in

general.

The result of R

egnier and Szpankowski

[1997] holds under the assumption that

the characters of the text are indepen-

dently generated with ﬁxed probabilities,

i.e. a Bernoulli model. In the rest of this

paper we consider a simpler model, the

“uniform Bernoulli model,” where all the

characters occur with the same probabil-

ity 1/σ . Although this is a gross simpliﬁ-

cation of the real processes that generate

the texts in most applications, the results

obtained are quite reliable in practice. In

particular, all the analyses apply quite

well to biased texts if we replace σ by 1/ p,

where p is the probability that two ran-

dom text characters are equal.

Although the problem of the average

edit distance between two strings is closely

related to the better studied LCS, the

well known results of Chv

atal and Sankoff

[1975] and Deken [1979] can hardly be ap-

plied to this case. It can be shown that the

average edit distance between two random

strings of length m tends to a constant

fraction of m as m grows, but the frac-

tion is not known. It holds that for any

two strings of length m, m − lcs ≤ ed ≤

2(m − lcs), where ed is their edit distance

and lcs is the length of their longest com-

mon subsequence. As proved in Chv

atal

and Sankoff [1975], the average LCS is be-

tween m/

√

σ and me/

√

σ for large σ , and

therefore the average edit distance is be-

tween m (1−e/

√

σ ) and 2m (1−1/

√

σ ). For

large σ it is conjectured that the true value

is m (1 − 1/

√

σ ) [Sankoff and Mainville

1983].

For our purposes, bounding the proba-

bility of a match allowing errors is more

important than the average edit distance.

Let f (m, k) be the probability of a random

pattern of length m matching a given text

position with k errors or less under the edit

distance (i.e. the text position is reported

as the end of a match). In Baeza-Yates

and Navarro [1999], Navarro [1998], and

Navarro and Baeza-Yates [1999b] upper

and lower bounds on the maximum error

level α

∗

for which f (m, k) is exponentially

decreasing on m are found. This is impor-

tant because many algorithms search for

potential matches that have to be veriﬁed

later, and the cost of such veriﬁcations is

polynomial in m, typically O(m

). There-

fore, if that event occurs with probability

O(γ

) for some γ<1 then the total cost

of veriﬁcations is O(m

) = o(1), which

makes the veriﬁcation cost negligible.

We ﬁrst show the analytical bounds for

f (m, k), then give a new result on average

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

A Guided Tour to Approximate String Matching 41

Fig. 3. A nondeterministic sufﬁx automaton to recognize any sufﬁx of "abracadabra.” Dashed lines rep-

resent ε-transitions (i.e. they occur without consuming any input).

edit distance, and ﬁnally present an exper-

imental veriﬁcation.

4.1 An Upper Bound

The upper bound for α

∗

comes from the

proof that the matching probability is

f (m, k) = O(γ

) for

γ =

σα

2α

1−α

(1 − α)

1−α

≤

σ (1 − α)

1−α

(1)

where we note that γ is 1/σ for α = 0 and

grows to 1 as α grows. This matching prob-

ability is exponentially decreasing on m as

long as γ<1, which is equivalent to

α<1−

√

−O(1/σ ) ≤ 1 −

√

(2)

Therefore, α<1−e/

√

σ is a conserva-

tive condition on the error level which en-

sures “few” matches. Therefore, the maxi-

mum level α

∗

satisﬁes α

∗

> 1 −e/

√

σ .

The proof is obtained using a combi-

natorial model. Based on the observation

that m − k common characters must ap-

pear in the same order in two strings that

match with k errors, all the possible alter-

natives to select the matching characters

from both strings are enumerated. This

model, however, does not take full advan-

tage of the properties of the edit distance:

even if m − k characters match, the dis-

tance can be larger than k. For example,

in ed(abc, bcd ) = 2, i.e. although two char-

acters match, the distance is not 1.

4.2 A Lower Bound

On the other hand, the only optimistic

bound we know of is based on the consider-

ation that only substitutions are allowed

(i.e. Hamming distance). This distance is

simpler to analyze but its matching proba-

bility is much lower. Using a combinatorial

model again it is shown that the matching

probability is f (m, k) ≥ δ

−1/2

, where

δ =

(1 − α)σ

1−α

Therefore an upper bound for the maxi-

mum α

∗

value is α

∗

≤ 1 −1/σ , since other-

wise it can be proved that f (m, k)isnot

exponentially decreasing on m (i.e. it is

Ä(m

−1/2

)).

4.3 A New Result on Average Edit Distance

We can now prove that the average edit

distance is larger than m (1 − e/

√

σ ) for

any σ (recall that the result of Chv

atal and

Sankoff [1975] holds for large σ ). We deﬁne

p(m, k) as the probability that the edit dis-

tance between two strings of length m is

at most k. Note that p(m, k) ≤ f (m, k) be-

cause in the latter case we can match with

any text sufﬁx of length from m−k to m+k.

Then the average edit distance is

k =0

kPr(ed = k) =

k =0

Pr(ed > k)

k =0

1 − p(m, k) = m −

k =0

p(m, k)

which, since p(m, k) increases with k,is

larger than

m−(Kp(m,K)+(m−K)) = K (1−p(m, K ))

for any K of our choice. In particu-

lar, for K /m < 1 − e/

√

σ we have that

p(m, K ) ≤ f (m, K ) = O(γ

) for γ<1.

Therefore choosing K =m (1 −e/

√

σ ) −1

yields that the edit distance is at least

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

42 G. Navarro

Fig. 4. Taxonomy of the types of solutions for online searching.

m (1 − e/

√

σ ) + O(1), for any σ .Aswe

see later, this proof converts a conjecture

about the average running time of an algo-

rithm [Chang and Lampe 1992] into a fact.

4.4 Empirical Veriﬁcation

We verify the analysis experimentally in

this section (this is also taken from Baeza-

Yates and Navarro [1999] and Navarro

[1998]). The experiment consists of gener-

ating a large random text (n = 10 MB) and

running the search of a random pattern on

that text, allowing k = m errors. At each

text character, we record the minimum al-

lowed error k for which that text position

matches the pattern. We repeat the exper-

iment with 1,000 random patterns.

Finally, we build the cumulative his-

togram, ﬁnding how many text positions

have matched with up to k errors, for each

k value. We consider that k is “low enough”

up to where the histogram values become

signiﬁcant, that is, as long as few text posi-

tions have matched. The threshold is set to

n/m

, since m

is the normal cost of verify-

ing a match. However, the selection of this

threshold is not very important, since the

histogram is extremely concentrated. For

example, for m in the hundreds, it moves

from almost zero to almost n in just ﬁve or

six increments of k.

Figure 5 shows the results for σ =32. On

the left we show the histogram we have

built, where the matching probability

undergoes a sharp increase at α

∗

.Onthe

right we show the α

∗

value as m grows. It is

clear that α

∗

is essentially independent of

m, although it is a bit lower for short pat-

terns. The increase in the left plot at α

∗

so sharp that the right plot would be the

same if we plotted the value of the average

edit distance divided by m.

Figure 6 uses a stable m = 300 to

show the α

∗

value as a function of σ . The

curve α =1 −1/

√

σ is included to show its

closeness to the experimental data. Least

squares give the approximation α

∗

1 −1.09/

√

σ , with a relative error smaller

than 1%. This shows that the upper

bound analysis (Eq. (2)) matches reality

better, provided we replace e by 1.09 in

the formulas.

Therefore, we have shown that the

matching probability has a sharp behav-

ior: for low α it is very low, not as low as

1/σ

like exact string matching, but still

exponentially decreasing in m, with an

exponent base larger than 1/σ . At some α

value (that we call α

∗

) it sharply increases

and quickly becomes almost 1. This point

is close to α

∗

= 1 − 1/

√

σ in practice.

This is why the problem is of inter-

est only up to a given error level, since

for higher errors almost all text positions

match. This is also the reason that some

algorithms have good average behavior

only for low enough error levels. The point

∗

=1 −1/

√

σ matches the conjecture of

Sankoff and Mainville [1983].

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

剩余57页未读，继续阅读

melody1111

粉丝: 0
资源: 5

近似字符串匹配：编辑距离与动态规划

FastFuzzyStringMatcher:用于快速，模糊，内存中字符串匹配的BK树

matching_匹配_

Matching Game

Efficient String Matching : An Aid to Bibliographic Search

Python 3.6 Cookbook: String Matching & Web/Spring Project Guide

ISM（Improved String Matching）可达矩阵的matlab程序

匹配算法 python

编写一个Java应用程序，程序输出字符串中与指定模式匹配的子字符串

string match

C++蛮力字符串匹配

最新资源