Technically, this work owes very much to two papers [
9
,
8
]. For instance, our construction algorithm
of Theorem 1 is essentially the same as the grammar compression algorithm based on recompression
presented in [
9
]. Our contribution is in discovering the above mentioned property that can be used for
fast LCE queries. Also, we use the property to upper bound the size of our data structure in terms of
z
rather than the smallest grammar size
g
∗
. Since it is known that
z ≤ g
∗
holds, an upper bound in
terms of
z
is preferable. Our construction algorithm of Theorem 2 owes to [
8
], in which the recompression
technique solves the fully-compressed pattern matching problems. Basically our results can be obtained
by applying the technique. However, we make some contributions on top of it: We give a new observation
that simplifies the implementation and analysis of a component of recompression called
BComp
(see
Section 4.1.2). Also, we show that we can improve the time complexity from
O
(
n lg N
) to
O
(
n lg
(
N /n
)).
2 Preliminaries
An alphabet Σ is a set of characters. A string over Σ is an element in Σ
∗
. For any string
w ∈
Σ
∗
,
|w|
denotes the length of
w
. Let
ε
be the empty string, i.e.,
|ε|
= 0. Let Σ
+
= Σ
∗
\ {ε}
. For any
1
≤ i ≤ |w|
,
w
[
i
] denotes the
i
-th character of
w
. For any 1
≤ i ≤ j ≤ |w|
,
w
[
i..j
] denotes the substring of
w
beginning at
i
and ending at
j
. For convenience, let
w
[
i..j
] =
ε
if
i > j
. For any 0
≤ i ≤ |w|
,
w
[1
..i
]
(resp.
w
[
|w| − i
+ 1
..|w|
]) is called the prefix (resp. suffix) of
w
of length
i
. We say taht a string
x
occurs
at position
i
in
w
iff
w
[
i..|x| −
1] =
x
. A substring
w
[
i..j
] =
c
d
(
c ∈
Σ
, d ≥
1) of
w
is called a block iff it is
a maximal run of a single character, i.e., w[i − 1] 6= c and w[j + 1] 6= c.
The text on which LCE queries are performed is denoted by T
∈
Σ
∗
with
N
=
|
T
|
throughout this
paper. We assume that Σ is an integer alphabet [1
..N
O(1)
] and the standard word RAM model with word
size Ω(lg N).
The size of our compressed LCE data structure is bounded by
O
(
z lg
(
N /z
)), where
z
is the size of the
LZ77 factorization of T defined as follows:
Definition 4
(LZ77 factorization)
.
The factorization T =
f
1
f
2
· · · f
z
is the LZ77 factorization of T iff
the following condition holds: For any 1
≤ i ≤ z
, let
p
i
=
|f
1
f
2
· · · f
i−1
|
+ 1, then
f
i
= T[
p
i
] if T[
p
i
] does
not appear in T[1..p
i
− 1], otherwise f
i
is the longest prefix of T[p
i
..N] that occurs in T[1..p
i
− 1].
Example 5. The LZ77 factorization of abaabaabb is a · b · a · aba · ab · b and z = 6.
In this article, we deal with grammar compressed strings, in which a string is represented by a Context
Free Grammar (CFG) generating the string only. In particular, we consider Straight-Line Programs
(SLPs) that are CFGs in Chomsky normal form. Formally, an SLP that generates a string T is a triple
S
= (Σ
, V, D
), where Σ is the set of characters (terminals),
V
is the set of variables (non-terminals),
D
is the set of deterministic production rules whose righthand sides are in
V
2
∪
Σ, and the last variable
derives T.
3
Let
n
=
|V|
. We treat variables as integers in [1
..n
] (which should be distinguishable from Σ
by having extra one bit), and
D
as an injective function that maps a variable to its righthand side. We
assume that given any variable
X
we can access in
O
(1) time to the data space storing the information of
X
, e.g.,
D
(
X
). We refer to
n
as the size of
S
since
S
can be encoded in
O
(
n
) space. Note that
N
can be
as large as 2
n−1
, and so, SLPs have a potential to achieve exponential compression.
We extend SLPs by allowing run-length encoded rules whose righthand sides are of the form
X
d
with
X ∈ V and d ≥ 2, and call such CFGs run-length SLPs (RLSLPs). Since a run-length encoded rule can
be stored in O(1) space, we still define the size of an RLSLP by the number of variables.
Let us consider the derivation tree
T
of an RLSLP
S
that generates a string T, where we delete all
the nodes labeled with terminals for simplicity. That is, every node in
T
is labeled with a variable. The
height of
S
is the height of
T
. We say that a sequence
C
=
v
1
· · · v
m
of nodes is a chain iff the nodes are
all adjacent in this order, i.e., the beginning position of
v
i+1
is the ending position of
v
i
plus one for any
1 ≤ i < m. C is labeled with the sequence of labels of v
1
· · · v
m
.
For any sequence
p ∈ V
∗
of variables, let
val
S
(
p
) denote the string obtained by concatenating the
strings derived from all variables in the sequence. We omit
S
when it is clear from context. We say that
p
generates
val
(
p
). Also, we say that
p
occurs at position
i
iff there is a chain that is labeled with
p
and
begins at i.
The next lemma, which is somewhat standard for SLPs, also holds for RLSLPs.
Lemma 6.
For any RSLP
S
of height
h
generating T, by storing
|val
(
X
)
|
for every variable
X
, we can
support Extract(i, `) in O(h + `) time.
3
We treat the last variable as the starting variable.
3