
pat:
AT-THAT
string:
...
WHICH-FINALLY-HALTS.--AT-THAT-POINT ...
÷
Since "F" is known not to occur in
pat,
we can appeal
to Observation 1 and move the pointer (and thus
pat)
down by 7:
pat:
AT-THAT
string:
... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...
÷
Appealing to Observation 2, we can move the pointer
down 4 to align the two hyphens:
pat:
AT-THAT
string:
... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...
÷
Now
char
matches its opposite in
pat.
Therefore we
step left by one:
pat:
AT-THAT
string:
... WHICH-FINALLY-HALTS.--AT T~IAT POINT ...
÷
Appealing to Observation 3(a), we can move the
pointer to the right by 7 positions because "L" does
not occur in
pat. 2
Note that this only moves
pat
to the
right by 6.
pat:
AT-THAT
string:
... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...
÷
Again
char
matches the last character of
pat.
Stepping
to the left we see that the previous character in
string
also matches its opposite in
pat.
Stepping to the left a
second time produces:
pat:
AT-THAT
string:
... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...
÷
Noting that we have a mismatch, we appeal to Obser-
vation 3(b). The
delta2
move is best since it allows us
to push the pointer to the right by 7 so as to align the
discovered substring "AT" with the beginning of
pat. ~
pat:
AT-THAT
string:
... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...
÷
This time we discover that each character of
pat
matches the corresponding character in
string
so we
have found the pattern. Note that we made only 14
references to
string.
Seven of these were required to
confirm the final match. The other seven allowed us to
move past the first 22 characters of
string.
2 Note that
deltaz
would allow us to move the pointer to the
right only 4 positions in order to align the discovered substring "T"
in
string
with its second from last occurrence at the beginning of the
word "THAT" in
pat.
3 The
delta~
move only allows the pointer to be pushed to the
right by 4 to align the hyphens.
764
4. The Algorithm
We now specify the algorithm. The notation
pat(j)
refers to the jth character in
pat
(counting from 1 on
the left).
We assume the existence of two tables,
delta1
and
deltas.
The first has as many entries as there are
characters in the alphabet. The entry for some charac-
ter
char
will be denoted by
deltas(char).
The second
table has as many entries as there are character posi-
tions in the pattern. The jth entry will be denoted by
delta2(j).
Both tables contain non-negative integers.
The tables are initialized by preprocessing
pat,
and
their entries correspond to the values
deltaa
and
delta2
referred to earlier. We will specify their precise con-
tents after it is clear how they are to be used.
Our search algorithm may be specified as follows:
stringlen ,,--
length of
string.
i ~ patlen.
top:
if i >
stringlen
then return false.
j ,,-- patlen.
loop:
ifj = 0 then returnJ + 1.
if
string(i) = pat(j)
then
j~"-j-1.
i,~--i-1.
goto
loop.
close;
i ~-- i + max(delta1 (string(i)), delta2 (j)).
goto
top.
If the above algorithm returns false, then
pat
does not
occur in
string.
If the algorithm returns a number,
then it is the position of the left end of the first
occurrence
of pat
in
string.
The
deltal
table has an entry for each character
char
in the alphabet. The definition of
delta~
is:
deltas(char)
= If
char
does not occur in
pat,
then
pat-
len;
else
patlen - j,
where j is the
maximum integer such that
pat(j) =
char.
The
deltaz
table has one entry for each of the integers
from 1 to
patlen.
Roughly speaking,
delta2(j)
is (a) the
distance we can slide
pat
down so as to align the
discovered occurrence (in
string)
of the last
patlen-j
characters of
pat
with its rightmost plausible reoccurr-
ence, plus (b) the additional distance we must slide the
"pointer" down so as to restart the process at the right
end of
pat.
To define
delta2
precisely we must define
the rightmost plausible reoccurrence of a terminal
substring of
pat.
To this end let us make the following
conventions: Let $ be a character that does not occur
in
pat
and let us say that if i is less than 1 then
pat(i)
is
$. Let us also say that two sequences of characters [c~
• . . c,] and [d~ . . . d,] "unify" if for all i from 1 to n
either c~ =di or c~ = $ or d~ = $.
Finally, we define the position of the rightmost
plausible reoccurrence of the terminal substring which
starts at positionj + 1,
rpr(j),
forj from 1
topatlen,
to
be the greatest k less than or equal to
patlen
such that
Communications October 1977
of Volume 20
the ACM Number 10