matching suffix until it fails, and it reports an occurrence
whenever a final state is reached. We call this kind of ap-
proach “algorithm PFilter,” where “P” stands for “Prefix.”
TACTAGACGTTAATTTACGTA
0
1
2
3
4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
19
20
C andidate occurrences
(a) Using prefixes.
TACTAGACGTTAATTTACGTA
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
C andidate occurrences
(b) Combination of prefixes and last matching suffix.
Figure 2: Checking candidate occurrences of the RE
Q =(G|T)A
∗
GA
∗
T
∗
.
An alternative approach is adopted in NR-grep [9, 11],
which uses a sliding window of size l
min
on the text T and
recognizes reversed matching prefixes in the sliding win-
dow using a reversed automaton. Similar to the defini-
tion of prefix, a suffix w.r.t. an RE Q is defined as a suf-
fix with length l
min
of a string in R(Q). For example, for
the RE Q =(G|T)A
∗
GA
∗
T
∗
, the suffixes w.r.t. Q are TT,
AT, AA, GA, GT, AG, GG,andTG, and the matching suffix-
es are T [4, 5], T [5, 6], T [8, 9], T [9, 10], T [11, 12], T [12, 13],
T [13, 14], T [14, 15], and T [18, 19]. We call the suffix-based
approach “algorithm SFilter,” which is also similar to the
algorithm PFilter. It runs a reversed automaton from the
end position of each suffix to the beginning of the text.
2.3 Improving Prefix-Based Approaches Us-
ing Last Matching Suffix
In Figure 2(a), the matching prefix T [19, 20] = TA could
not be used to produce an answer string in R(Q) since it is
not among the suffixes identified above from the RE. There-
fore, we could use the last matching suffix to do an early
termination in each verification step. Figure 2(b) shows the
example of improving the algorithm PFilter.Aswecan
see, by using the last matching suffix T [18, 19] = GT in the
text T , a verification can terminate early at position 19.
We call this approach the “algorithm PS.” It only verifies
those substrings starting from every matching prefix S
p
to
the last matching suffix S
s
if the starting position of S
p
is
less than or equals to the starting position of S
s
.WecallS
s
a valid matching suffix and each S
p
a valid matching prefix
w.r.t. its valid matching suffix S
s
. For example, the sub-
string T [18, 19] is a valid matching suffix and the substrings
T [0, 1], T [3, 4], T [5, 6], T [10, 11], T [15, 16] are valid match-
ing prefixes, whereas T [19, 20] is an invalid matching prefix.
The algorithm PS requires O(m·n·n
p
) time to do verifica-
tions for n
p
valid matching prefixes of T , assuming m is |Q|,
and the verification for each valid matching prefix requires
O(m·n)time.
3. NEGATIVE FACTORS
In this section, we develop the concept of negative factor,
which can be used to improve the performance of matching
algorithms. Contrary to positive factors, a negative factor
is a substring that must not appear in an occurrence. We
show that a negative factor can not only prune unnecessary
verifications, but also terminate verifications early. We first
present a formal definition of negative factors, then show a
good pattern to prune candidates.
Definition 1. (Negative factor, or N-factor for short) Giv-
en a regular expression Q and a string w, a string w is called
a negative factor with respect to Q, or simply a negative fac-
tor when Q is clear in the context, if there is no string Σ
∗
wΣ
∗
does in R(Q).
For a text T , an N-factor w.r.t. an RE Q must not appear
in an answer to Q in T . For example, consider the RE Q
=(G|T)A
∗
GA
∗
T
∗
, the strings C, AGG,andTTA are N-factors,
since they cannot appear in an answer as a substring.
Lemma 1. An N-factor w.r.t. an RE Q can not be a sub-
string of a prefix or a suffix w.r.t. Q.
Theorem 1. Given a text with length n, the number of
N-factors w.r.t. an RE Q cannot be greater than
n
i=1
|Σ|
i
.
A PNS Pattern: Intuitively, we say a substring of T has a
PNS pattern if it starts with a prefix of Q, has an N-factor
in the middle, and ends with a suffix of Q.Formerly,let
π
p
,π
n
,π
s
be the set of starting positions of a matching prefix
P , a matching N-factor N , and a matching suffix S in a text
T , respectively. The substring T [π
p
,π
s
+ l
min
− 1] conforms
to a PNS pattern if N is a substring of T [π
p
,π
s
+ l
min
− 1].
Figure 3 shows that a substring conforms to a PNS pattern
if and only if π
p
≤ π
n
<π
s
and π
n
+ |N|≤π
s
+ l
min
.
T
S
1
P
1
N
1
S
2
P
2
N
2
T
(a) A PNS pattern. (b) Not a PNS pattern.
ʌ
p
ʌ
n
ʌ
s
ʌ
p
ʌ
n
ʌ
s
Figure 3: A substring conforming to a PNS pattern
iff π
p
≤ π
n
<π
s
and π
n
+ |N|≤π
s
+ l
min
.
Obviously, a substring of T conforming to a PNS pattern
cannot be an occurrence of Q. Based on this observation,
we can prune unnecessary verifications using N-factors. Fig-
ure 4 shows an example of the benefit by using PNS patterns.
TACTAGACGTTAATTTACGTA
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Candidate occurrences
Figure 4: Using N-factors T [2, 2] and T [17, 17] to
prune candidates of Q =(G|T)A
∗
GA
∗
T
∗
. Compared
with Figure 2(b), candidates T [0, 19] and T [15, 19] are
pruned and the verifications starting from positions
3, 5,and10 can be terminated early by using the
N-factors T [7, 7] and T [14, 16], respectively.
Although the number of N-factors w.r.t. Q =(G|T)A
∗
GA
∗
T
∗
is large, we can still generate a small number of high-quality
N-factors, such as {C, AGG, ATA, ATG, GGG, GTA, GTG, TAT,
TGG, TTA, TTG}. (We will provide details in Section 5.) For
the example in Figure 2(b), all the five candidate substrings
conform to a PNS pattern, in which two of them can be
pruned using the N-factors T [2, 2] and T [17, 17]. For in-
stance, the substring T [15, 19] is such a substring that con-
sists of a matching prefix T [15, 16] = TA, a matching suffix