压缩LCE数据结构：重组技术提升效率

120 浏览量更新于2024-08-25 收藏 519KB PDF 举报

"Longest Common Extension with Recompression" 是一篇发表于 2016 年 11 月 16 日的计算机科学论文，由 Tomohiro I 作者来自日本九州大学科技学院。该研究专注于解决字符串处理中的一个重要问题：在给定字符串 T 的两个位置 i 和 j 时，如何快速查找以这两个位置结尾的后缀之间的最长公共前缀（Longest Common Extension, LCE）长度。在文本处理中，LCE 查询是一个常见的需求，特别是在数据压缩和生物信息学等场景下，高效的数据结构对于性能至关重要。传统上，LCE 数据结构用于存储字符串 T 的压缩形式，以支持快速查询。然而，该论文提出了一个创新的方法，即“重新压缩”技术，它被证明对压缩 LCE 数据结构非常有用。这种技术允许构建一个新型的压缩 LCE 数据结构，其空间复杂度优化到了 O(zlg(N/z))，其中 z 是字符串 T 的Lempel-Ziv 77 分解（一种自包含因子分解算法）的大小，不包括自我引用部分。这意味着数据结构能够有效地利用压缩的信息，降低存储需求。论文的核心贡献在于设计了一个能够在 O(lgN) 时间内执行 LCE 查询的数据结构。这与传统的线性时间复杂度相比，有了显著提升，尤其是在处理大规模数据时。此外，作者还提供了两种构建方法： 1. 当字符串 T 以未压缩的形式存在时，他们展示了如何在 O(N) 的时间复杂度和空间复杂度内构建这个数据结构。 2. 当 T 是通过语法压缩，即使用一个生成 T 的大小为 n 的直译程序表示时，构建过程可以在 O(nlg(N/n)) 的时间复杂度和 O(n+zlg(N/z)) 的空间复杂度内完成。值得注意的是，这些算法是确定性的，确保始终返回正确的 LCE 长度结果。这篇论文提供了一种高效且适用于各种压缩格式的 LCE 数据结构解决方案，对于那些需要频繁处理大量文本数据的应用来说，具有重要的实际价值。通过结合压缩技术和高效的查询算法，作者推动了这一领域的发展，提升了计算效率和存储效率。

Technically, this work owes very much to two papers [

]. For instance, our construction algorithm

of Theorem 1 is essentially the same as the grammar compression algorithm based on recompression

presented in [

]. Our contribution is in discovering the above mentioned property that can be used for

fast LCE queries. Also, we use the property to upper bound the size of our data structure in terms of

rather than the smallest grammar size

∗

. Since it is known that

z ≤ g

∗

holds, an upper bound in

terms of

is preferable. Our construction algorithm of Theorem 2 owes to [

], in which the recompression

technique solves the fully-compressed pattern matching problems. Basically our results can be obtained

by applying the technique. However, we make some contributions on top of it: We give a new observation

that simpliﬁes the implementation and analysis of a component of recompression called

BComp

(see

Section 4.1.2). Also, we show that we can improve the time complexity from

(

n lg N

) to

(

n lg

(

N /n

)).

2 Preliminaries

An alphabet Σ is a set of characters. A string over Σ is an element in Σ

∗

. For any string

w ∈

∗

|w|

denotes the length of

. Let

be the empty string, i.e.,

|ε|

= 0. Let Σ

= Σ

∗

\ {ε}

. For any

≤ i ≤ |w|

[

] denotes the

-th character of

. For any 1

≤ i ≤ j ≤ |w|

[

i..j

] denotes the substring of

beginning at

and ending at

. For convenience, let

[

i..j

] =

i > j

. For any 0

≤ i ≤ |w|

..i

]

(resp.

[

|w| − i

+ 1

..|w|

]) is called the preﬁx (resp. suﬃx) of

of length

. We say taht a string

occurs

at position

iﬀ

[

i..|x| −

1] =

. A substring

[

i..j

] =

(

c ∈

, d ≥

1) of

is called a block iﬀ it is

a maximal run of a single character, i.e., w[i − 1] 6= c and w[j + 1] 6= c.

The text on which LCE queries are performed is denoted by T

∈

∗

with

throughout this

paper. We assume that Σ is an integer alphabet [1

..N

O(1)

] and the standard word RAM model with word

size Ω(lg N).

The size of our compressed LCE data structure is bounded by

(

z lg

(

N /z

)), where

is the size of the

LZ77 factorization of T deﬁned as follows:

Deﬁnition 4

(LZ77 factorization)

The factorization T =

· · · f

is the LZ77 factorization of T iﬀ

the following condition holds: For any 1

≤ i ≤ z

, let

· · · f

i−1

+ 1, then

= T[

] if T[

] does

not appear in T[1..p

− 1], otherwise f

is the longest preﬁx of T[p

..N] that occurs in T[1..p

− 1].

Example 5. The LZ77 factorization of abaabaabb is a · b · a · aba · ab · b and z = 6.

In this article, we deal with grammar compressed strings, in which a string is represented by a Context

Free Grammar (CFG) generating the string only. In particular, we consider Straight-Line Programs

(SLPs) that are CFGs in Chomsky normal form. Formally, an SLP that generates a string T is a triple

= (Σ

, V, D

), where Σ is the set of characters (terminals),

is the set of variables (non-terminals),

is the set of deterministic production rules whose righthand sides are in

∪

Σ, and the last variable

derives T.

Let

|V|

. We treat variables as integers in [1

..n

] (which should be distinguishable from Σ

by having extra one bit), and

as an injective function that maps a variable to its righthand side. We

assume that given any variable

we can access in

(1) time to the data space storing the information of

, e.g.,

(

). We refer to

as the size of

since

can be encoded in

(

) space. Note that

can be

as large as 2

n−1

, and so, SLPs have a potential to achieve exponential compression.

We extend SLPs by allowing run-length encoded rules whose righthand sides are of the form

with

X ∈ V and d ≥ 2, and call such CFGs run-length SLPs (RLSLPs). Since a run-length encoded rule can

be stored in O(1) space, we still deﬁne the size of an RLSLP by the number of variables.

Let us consider the derivation tree

of an RLSLP

that generates a string T, where we delete all

the nodes labeled with terminals for simplicity. That is, every node in

is labeled with a variable. The

height of

is the height of

. We say that a sequence

· · · v

of nodes is a chain iﬀ the nodes are

all adjacent in this order, i.e., the beginning position of

i+1

is the ending position of

plus one for any

1 ≤ i < m. C is labeled with the sequence of labels of v

· · · v

For any sequence

p ∈ V

∗

of variables, let

val

(

) denote the string obtained by concatenating the

strings derived from all variables in the sequence. We omit

when it is clear from context. We say that

generates

val

(

). Also, we say that

occurs at position

iﬀ there is a chain that is labeled with

and

begins at i.

The next lemma, which is somewhat standard for SLPs, also holds for RLSLPs.

Lemma 6.

For any RSLP

of height

generating T, by storing

|val

(

)

for every variable

, we can

support Extract(i, `) in O(h + `) time.

We treat the last variable as the starting variable.

剩余10页未读，继续阅读

weixin_38741950

粉丝: 2
资源: 962

压缩LCE数据结构：重组技术提升效率

C语言详解：最长公共子字符串与动态规划算法

LeetCode解决方案集：C语言实现的算法深度解析

C语言实战项目：最长下降子序列算法源码解析

c语言-leetcode 0003-longest-substring-without-repeat.zip

The-longest-public-son-sequence.rar_The Son

js-leetcode题解3-longest-substring-without-repeating-characters.js

c语言-leetcode 0014-longest-common-prefix.zip

C语言-leetcode题解之14-longest-common-prefix.c

js-leetcode题解之14-longest-common-prefix.js

c语言-leetcode 0005-longest-palindromic-substring.zip

最新资源