work during an iteration that are not constant and the number of iterations is bounded by
O(log(n)) this implies a complexity of O(m + log(n)).
When only one occurrence is needed, the index is returned the first time the search
string has been matched. If the indices of all occurrences shall be returned, this can be
done within an extra cost of O(#occ). The algorithm in [3] does not terminate when the
first occurrence is found. Instead the next iteration proceeds with R = M and the algorithm
terminates when R − L < 2. When the algorithm terminates it is guaranteed that R will
be the smallest index i where Q ≤ S
SA[i]
, and thus the index of the left most occurrence
if any occurrences exists. With a symmetric algorithm the right most occurrence in the
suffix array can be found. Having found the left most and the right most occurrences in
O(m + log(n)) time, it is straight forward to return the indices in O(#occ) time.
The pseudo code for finding the left most occurrence of the query string is provided in
[3, p. 6]
1
. The pseudo code for finding the right most occurrence is symmetric, and is for
the sake of completeness provided in algorithm 2, with the notations used in this thesis. It
is easy to turn the pseudo code into the simple search algorithm as the only difference is,
that the algorithm should return the first time an occurrence is found.
The algorithm are using three different arrays to reach its complexity, the suffix array
and the two auxiliary arrays. The suffix array can be created in O(n · log(Σ)) time in
multiple ways. Having a suffix tree with sorted children, the simplest solution is a depth
first traversal of the suffix tree, where the children is visited in lexicographical order. This
is an O(n · log(|Σ|) time solution as a depth first traversal in lexicographical order clearly
can be done in O(n · log(|Σ|)) time and section 2.1 shows how to build a suffix tree in
O(n · log(|Σ|)) time.
Creating the two auxiliary arrays with a better or equally good complexity as the suffix
array is on the other hand not straight forward. Finding lcp(S
i
,S
j
) can be reduced to
finding the nearest common ancestor (NCA) of the two leafs representing S
i
and S
j
in the
suffix tree. Harel et al. [2] shows how NCA(S
i
, S
j
) can be found in constant time, at the cost
of O(n) preprocessing. The time complexity for creating the auxiliary arrays thus becomes
O(n), while the complexity for the entire enhanced suffix array becomes O(n · log(|Σ|)).
The data structures used to solve the NCA problems, can be created within O(n) space,
giving the enhanced suffix array a space complexity of O(n) like it was the case for the
suffix tree. In the following subsections it will be explained how Harel et al. [2] solves the
NCA problem in constant time.
Solving the NCA problem
To solve the NCA problem, it is separated into two subproblems called the nca depth
problem and the depth problem.
1
If the pseudo code from [3, p. 6] is followed, the user should notice that there is an error on line 5 as
line 5 should have been w
r
> a
P os[N −1]+r
instead of w
r
≤ a
P os[N −1]+r
.
10