压缩后缀数组：实用构建与自索引高效应用

100 浏览量更新于2024-08-26 1 收藏 221KB PDF 举报

"压缩后缀数组的实际实现及其在自索引中的应用" 本文深入探讨了压缩后缀数组（Compressed Suffix Array, CSA）的实际构建方法及其在自索引中的应用。压缩后缀数组是一种高效的数据结构，用于存储文本的后缀，以便快速执行字符串查询。在文本索引领域，它具有重要的地位，因为它能以较小的空间占用提供高效的搜索性能。在本文中，作者Hongwei Huo、Longgang Chen、Jeffrey Scott Vitter和Yakov Nekrich提出了一种新的CSA构造方法，该方法能在线性时间内完成，并且所需的存储空间仅为2nHk + n + o(n)位，其中k ≤ clogσn - 1，c是小于1的常数，Hk表示第k阶熵。这里的n是文本字符的数量，σ是字符集的大小。这种优化的存储方案显著减少了空间需求，同时保持了索引的效率。作者对比了他们的方法与两种已有的压缩索引技术——FM索引（FM-Index）和Sad-CSA。通过在Canterbury Corpus和Pizza&Chili Corpus这两个数据集上的实验，他们的算法在压缩率和查询时间上显示出了优于其他两种索引的优势。尤其在处理非均匀分布的数据时，新提出的存储方案表现更佳，但在处理均匀分布的数据时，可能不如其他方法。自索引是一种特殊的索引，它自身包含足够的信息来检索自身的数据，无需额外的存储或访问原始数据。压缩后缀数组在自索引中的应用使得大规模文本数据的管理和查询变得更加高效。通过利用压缩技术，可以有效地减少存储需求，同时保持高效的查询性能，这对于大数据分析和文本挖掘等领域至关重要。实验结果证明，新方法在处理各种类型的数据时表现出色，特别是在处理那些具有特定分布特征（如非均匀分布）的文本时。这表明，对于那些对存储和查询性能有严格要求的应用场景，采用压缩后缀数组的自索引策略可能是理想的解决方案。这篇研究论文提供了关于压缩后缀数组实际构建和应用的新见解，为文本索引和自索引领域的研究者和实践者提供了有价值的参考。通过优化空间效率和查询速度，这项工作推动了压缩数据结构在实际应用中的边界，为未来的文本处理技术提供了新的可能性。

* Logarithms in this paper are in base 2 unless the base is stated explicitly.

A Practical Implementation of Compressed Suffix

Arrays with Applications to Self-Indexing

Hongwei Huo

, Longgang Chen

, Jeffrey Scott Vitter

and Yakov Nekrich

Xidian University

No.2 Taibai South Road

Xi’an, Shaanxi 710071, China

{hwhuo,lgchen}@mail.xidian.edu.cn

The University of Kansas

1450 Jayhawk Blvd.

Lawrence, KS 66045, USA

{jsv,yakov}@ittc.ku.edu

Abstract: In this paper we develop a simple and practical text indexing scheme

for compressed suffix arrays (CSA). For a text of n characters, our CSA can be

constructed in linear time and needs 2nH

+ n + o(n) bits of space for any k ≤

clog

n − 1 and any constant c < 1, where H

denotes the kth order entropy. We

compare the performance of our method with two established compressed

indexing methods, the FM-index and the Sad-CSA. Experiments on the

Canterbury Corpus and the Pizza&Chili Corpus show significant advantages of

our algorithm over two other indexes in terms of compression and query time.

Our storage scheme achieves better performance on all types of data present in

these two corpora, except for evenly distributed data, such as DNA. The source

code for our CSA is available online.

1. Introduction

Suffix trees[15,22] and suffix arrays[14] are versatile data structures that play a key

role in numerous string processing applications in such areas as string matching,

information retrieval, genome analysis, and text compression. Both the suffix tree and the

suffix array support pattern matching queries in optimal or almost-optimal time and use

linear space of O(n log n) bits. However in practice these data structures occupy 5 to 20

times more space than the raw string data; the latter needs only n log  bits of space,

where  denotes the alphabet size.

The compressed suffix array (CSA) [9,11,19–21] and the FM-index [2–4] overcome

the space limitation by exploiting the text compressibility and index regularities, while

supporting the functionalities of suffix arrays and suffix trees. The compressed suffix

array and FM-index are self-indexes in that they do not require a copy of the original data;

that is, they each serve as an index as well as a compressed version of the original text.

Grossi and Vitter [9,11] introduced the compressed suffix array (GV-CSA), which

uses O(n log

) bits* of space and answers string matching queries in o(|P| / log

n +

occ log

n) time, where |P| is the length of the query pattern P and occ denotes the

number of times P occurs in the source string. Sadakane [20,21] showed how to convert

the GV-CSA into a self-index, and the resulting index, called Sad-CSA, needs (1/߳)nH

O(n log log

) +

log

bits of space and answers queries in O(|P| log n + occ 

ఢ

time, where 0 < ߳ ≤ 1 is an arbitrary constant. Henceforth H

for k  0 denotes the kth

order empirical entropy of the source string T. Ferragina and Manzini designed the FM-

index that relies on the Burrows-Wheeler transform (BWT) [1] and the backward

searching approach [4,9]. Their original index uses at most 5nH

(T) + o(n log

) bits of

2014 Data Compression Conference

DOI 10.1109/DCC.2014.49

292

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38744962

粉丝: 9
资源: 968

压缩后缀数组：实用构建与自索引高效应用

基于压缩后缀数组实现的一个字符串搜索库

A Compressed Enhanced Suffix Array Index:具有简洁LCP信息的基于压缩后缀数组的索引-开源

后缀数组创建算法的实现

后缀数组（（处理字符串的有力工具））

基于QSA数组计算序列中所有NE重复模式的算法

后缀数组详解：构造与应用

后缀数组详解：算法与应用探析

后缀数组模板与计算height数组

压缩全文索引在演变图查询中的应用

压缩文本索引构建与应用技术探析

最新资源