利用负因子优化字符串正则表达式匹配效率

134 浏览量更新于2024-08-30 收藏 724KB PDF 举报

在现代信息技术领域，正则表达式（Regular Expression, RE）是处理字符串匹配问题的核心工具，广泛应用于文本编辑、生物序列搜索以及Shell命令等多个应用场景。然而，当需要在大量候选匹配中寻找精确匹配时，现有的技术方法可能显得效率低下，特别是在候选数量众多时，如候选验证阶段会消耗大量计算资源。传统的方法通常包括两步：首先，通过扫描字符串，识别出RE可能匹配的子串；然后，利用状态机或自动机对这些子串逐一进行验证，以确定它们是否符合正则表达式的规则。这种方法的问题在于，对于大量不匹配的子串，这种验证过程可能会造成不必要的计算浪费。本文提出了一个创新的解决方案，即引入负因子（Negative Factors）的概念。负因子是指那些对于正则表达式匹配来说不可能出现的子串，它们的存在可以用来提前排除某些错误的匹配候选。作者Xiaochun Yang等人来自中国东北大学的信息科学与工程学院，他们通过分析正则表达式的结构和语义，设计了一种新的算法，能够利用负因子来指导搜索策略，减少无效的验证步骤。该技术的关键在于识别出哪些部分的子串是负因子，并将其整合到搜索过程中。例如，如果一个正则表达式包含一个特定的否定模式（如“非数字”），那么所有数字子串就是负因子，可以直接跳过，避免了后续的验证。这种方法显著提高了匹配速度，尤其是在处理大规模数据或复杂正则表达式时，能有效减少计算负担。此外，论文还探讨了如何有效地计算和存储负因子信息，以及如何结合自动机理论和动态规划策略来优化搜索过程。同时，论文也讨论了负因子在不同场景下的应用效果，以及可能面临的挑战和未来的改进方向。这篇研究论文提出了一个新颖的思路，通过引入负因子，优化了正则表达式在字符串匹配中的性能，对于提高系统效率和用户体验具有重要意义。这一成果有望在未来被广泛应用到各种依赖正则表达式的软件系统中，进一步推动了计算机科学和信息技术的发展。

matching suﬃx until it fails, and it reports an occurrence

whenever a ﬁnal state is reached. We call this kind of ap-

proach “algorithm PFilter,” where “P” stands for “Preﬁx.”

TACTAGACGTTAATTTACGTA

11 12

C andidate occurrences

(a) Using preﬁxes.

TACTAGACGTTAATTTACGTA

C andidate occurrences

(b) Combination of preﬁxes and last matching suﬃx.

Figure 2: Checking candidate occurrences of the RE

Q =(G|T)A

∗

An alternative approach is adopted in NR-grep [9, 11],

which uses a sliding window of size l

min

on the text T and

recognizes reversed matching preﬁxes in the sliding win-

dow using a reversed automaton. Similar to the deﬁni-

tion of preﬁx, a suﬃx w.r.t. an RE Q is deﬁned as a suf-

ﬁx with length l

min

of a string in R(Q). For example, for

the RE Q =(G|T)A

∗

, the suﬃxes w.r.t. Q are TT,

AT, AA, GA, GT, AG, GG,andTG, and the matching suﬃx-

es are T [4, 5], T [5, 6], T [8, 9], T [9, 10], T [11, 12], T [12, 13],

T [13, 14], T [14, 15], and T [18, 19]. We call the suﬃx-based

approach “algorithm SFilter,” which is also similar to the

algorithm PFilter. It runs a reversed automaton from the

end position of each suﬃx to the beginning of the text.

2.3 Improving Preﬁx-Based Approaches Us-

ing Last Matching Sufﬁx

In Figure 2(a), the matching preﬁx T [19, 20] = TA could

not be used to produce an answer string in R(Q) since it is

not among the suﬃxes identiﬁed above from the RE. There-

fore, we could use the last matching suﬃx to do an early

termination in each veriﬁcation step. Figure 2(b) shows the

example of improving the algorithm PFilter.Aswecan

see, by using the last matching suﬃx T [18, 19] = GT in the

text T , a veriﬁcation can terminate early at position 19.

We call this approach the “algorithm PS.” It only veriﬁes

those substrings starting from every matching preﬁx S

the last matching suﬃx S

if the starting position of S

less than or equals to the starting position of S

.WecallS

a valid matching suﬃx and each S

a valid matching preﬁx

w.r.t. its valid matching suﬃx S

. For example, the sub-

string T [18, 19] is a valid matching suﬃx and the substrings

T [0, 1], T [3, 4], T [5, 6], T [10, 11], T [15, 16] are valid match-

ing preﬁxes, whereas T [19, 20] is an invalid matching preﬁx.

The algorithm PS requires O(m·n·n



) time to do veriﬁca-

tions for n



valid matching preﬁxes of T , assuming m is |Q|,

and the veriﬁcation for each valid matching preﬁx requires

O(m·n)time.

3. NEGATIVE FACTORS

In this section, we develop the concept of negative factor,

which can be used to improve the performance of matching

algorithms. Contrary to positive factors, a negative factor

is a substring that must not appear in an occurrence. We

show that a negative factor can not only prune unnecessary

veriﬁcations, but also terminate veriﬁcations early. We ﬁrst

present a formal deﬁnition of negative factors, then show a

good pattern to prune candidates.

Deﬁnition 1. (Negative factor, or N-factor for short) Giv-

en a regular expression Q and a string w, a string w is called

a negative factor with respect to Q, or simply a negative fac-

tor when Q is clear in the context, if there is no string Σ

∗

wΣ

∗

does in R(Q).

For a text T , an N-factor w.r.t. an RE Q must not appear

in an answer to Q in T . For example, consider the RE Q

=(G|T)A

∗

, the strings C, AGG,andTTA are N-factors,

since they cannot appear in an answer as a substring.

Lemma 1. An N-factor w.r.t. an RE Q can not be a sub-

string of a preﬁx or a suﬃx w.r.t. Q.

Theorem 1. Given a text with length n, the number of

N-factors w.r.t. an RE Q cannot be greater than



i=1

|Σ|

A PNS Pattern: Intuitively, we say a substring of T has a

PNS pattern if it starts with a preﬁx of Q, has an N-factor

in the middle, and ends with a suﬃx of Q.Formerly,let

,π

be the set of starting positions of a matching preﬁx

P , a matching N-factor N , and a matching suﬃx S in a text

T , respectively. The substring T [π

,π

+ l

min

− 1] conforms

to a PNS pattern if N is a substring of T [π

,π

+ l

min

− 1].

Figure 3 shows that a substring conforms to a PNS pattern

if and only if π

≤ π

<π

and π

+ |N|≤π

+ l

min

(a) A PNS pattern. (b) Not a PNS pattern.

Figure 3: A substring conforming to a PNS pattern

iﬀ π

≤ π

<π

and π

+ |N|≤π

+ l

min

Obviously, a substring of T conforming to a PNS pattern

cannot be an occurrence of Q. Based on this observation,

we can prune unnecessary veriﬁcations using N-factors. Fig-

ure 4 shows an example of the beneﬁt by using PNS patterns.

TACTAGACGTTAATTTACGTA

Candidate occurrences

Figure 4: Using N-factors T [2, 2] and T [17, 17] to

prune candidates of Q =(G|T)A

∗

. Compared

with Figure 2(b), candidates T [0, 19] and T [15, 19] are

pruned and the veriﬁcations starting from positions

3, 5,and10 can be terminated early by using the

N-factors T [7, 7] and T [14, 16], respectively.

Although the number of N-factors w.r.t. Q =(G|T)A

∗

is large, we can still generate a small number of high-quality

N-factors, such as {C, AGG, ATA, ATG, GGG, GTA, GTG, TAT,

TGG, TTA, TTG}. (We will provide details in Section 5.) For

the example in Figure 2(b), all the ﬁve candidate substrings

conform to a PNS pattern, in which two of them can be

pruned using the N-factors T [2, 2] and T [17, 17]. For in-

stance, the substring T [15, 19] is such a substring that con-

sists of a matching preﬁx T [15, 16] = TA, a matching suﬃx

363

剩余11页未读，继续阅读

weixin_38580759

粉丝: 4
资源: 970

利用负因子优化字符串正则表达式匹配效率

PHP实现身份证号码验证的类实例教程

Oracle身份证校验函数实现与解析

易语言实现任意整式拆分的源码解析

利用关键因子过滤的正则表达式匹配算法

正则表达式匹配算法小结

CAD多行文本去掉控制字符的正则表达式

详解正则表达式实现二代身份证号码验证

正则表达式截取身份证号码加密的方法

【R语言模式匹配】：利用DataTables包和正则表达式的强大组合

java正则表达式判断素数

最新资源