压缩数据结构与应用：优化搜索性能

128 浏览量更新于2024-07-14 收藏 259KB PDF 举报

" Opportunistic Data Structures with Applications 是一篇2000年的计算机科学论文，由Paolo Ferragina和Giovanni Manzini合著。该论文关注的是在基础搜索问题中设计简洁数据结构的研究趋势。" 这篇论文的核心是探讨如何在当前电子数据急剧增长的背景下，设计节省空间的数据结构。随着可用电子数据的指数级增长，它已经超过了当前计算机内存和磁盘存储能力的提升。因此，空间优化成为了一个极具吸引力的问题，因为它不仅关乎存储效率，还与性能提升密切相关。这正如Knuth和Bentley等作者所指出的那样，减少辅助信息可以提高查询性能。在设计这些隐式数据结构时，目标是在尽可能减少与输入数据一起存储的辅助信息的同时，不显著降低查询性能。然而，输入数据完全被表示出来，没有利用可能存在的重复性来优化。这个问题对于程序员来说非常重要，他们通常会使用各种技巧来压缩数据，同时保持良好的查询性能。尽管他们的方法本质上是一些启发式策略，但效果往往受限。论文中提到的“机会主义数据结构”（Opportunistic Data Structures）可能是指一类能够利用偶然的机会或特定情况来优化性能的数据结构。它们可能不依赖于数据的特定重复模式，而是通过巧妙地组织数据以在大多数情况下提供高效访问，即使在数据不完全有序或有重复时也是如此。论文中提到的基础搜索问题可能包括如字符串查找、排序、索引构建等常见任务。简洁数据结构旨在在不牺牲太多查询速度的前提下，减少存储开销，这对于处理大量数据的系统尤其重要。例如，字典树、B树、压缩索引等都是简洁数据结构的例子，它们能够在节省空间的同时提供高效的查找操作。 "Opportunistic Data Structures with Applications"这篇论文讨论了在数据量激增的时代，如何通过设计创新的数据结构来平衡存储空间与查询性能之间的关系，这对理解和改进大数据环境下的数据管理具有重要的理论和实践价值。

of block addressing is that it can achieve both sublinear space overhead and sublinear query time, whereas

inverted indices achieve only the second goal [4]. Unfortunately, up to now all the known block addressing

indices [18, 4] achieve this goal only under some restrictive conditions on the block size. We show how to

use our opportunistic data structure to devise a novel block addressing scheme, called CGlimpse (standing

for Compressed Glimpse), which always achieves time and space sublinearity.

2 Background

Let T [1, u] be a text drawn from a constant-size alphabet Σ. A central concept in our discussion is

the suﬃx array data structure [17]. The suﬃx array A built on T [1, u] is an array containing the

lexicographically ordered sequence of the suﬃxes of T , represented via pointers to their starting positions

(i.e., integers). For instance, if T = ababc then A = [1, 3, 2, 4, 5]. In practice A occupies 4u bytes,

actually a lot when indexing large text collections. It is a long standing belief that suﬃx arrays are

uncompressible because of the “apparently random” permutation of the suﬃx pointers. Recent results

in the data compression ﬁeld have opened the door to revolutionary ways to compress suﬃx arrays and

are the basic tools of our solution. In [7], Burrows and Wheeler proposed a transformation (BWT from

now on) consisting of a reversible permutation of the text characters which gives a new string that is

“easier to compress”. The BWT tends to group together characters which occur adjacent to similar text

substrings. This nice property is exploited by locally-adaptive compression algorithms, such as move-to-

front coding [6], in combination with statistical (i.e. Huﬀman or Arithmetic coders) or structured coding

models. The BWT-based compressors are among the best compressors currently available since they

achieve a very good compression ratio using relatively small resources (time and space).

The reversible BW-transform. We distinguish between a forward transformation, which produces

the string to be compressed, and a backward transformation which gives back the original text from the

transformed one. The forward BWT consists of three basic steps: (1) Append to the end of T a special

character # smaller than any other text character; (2) form a conceptual matrix M whose rows are the

cyclic shifts of the string T# sorted in lexicographic order; (3) construct the transformed text L by taking

the last column of M. Notice that every column of M is a permutation of the last column L, and in

particular the ﬁrst column of M, call it F , is obtained by lexicographically sorting the characters in L.

There is a strong apparent relation between the matrix M and the suﬃx array A of the string T #.

When sorting the rows of the matrix M we are essentially sorting the suﬃxes of T #. Consequently, entry

A[i] points to the suﬃx of T # occupying (a preﬁx of) the ith row of M. The cost of performing the

forward BWT is given by the cost of constructing the suﬃx array A, and this requires O(u) time [20].

The cyclic shift of the rows of M is crucial to deﬁne the backward BWT, which is based on two easy

to prove observations [7]:

a. Given the ith row of M, its last character L[i] precedes its ﬁrst character F [i] in the original text

T , namely T = · · · L[i]F [i] · · ·.

b. Let L[i] = c and let r

be the rank of the row M[i] among all the rows ending with the character

c. Take the row M[j] as the r

-th row of M starting with c. Then the character corresponding to

L[i] in the ﬁrst column F is located at F [j] (we call this LF-mapping, where LF [i] = j).

We are therefore ready to describe the backward BWT:

1. Compute the array C[1 . . . |Σ|] storing in C[c] the number of occurrences of characters {#, 1, . . . , c−

1} in the text T . Notice that C[c] + 1 is the position of the ﬁrst occurrence of c in F (if any).

2. Deﬁne the LF-mapping LF [1 . . . u + 1] as follows LF [i] = C[L[i]] + r

, where r

equals the number

of occurrences of character L[i] in the preﬁx L[1, i] (see observation (b) above).

3. Reconstruct T backward as follows: set s = 1 and T [u] = L[1] (because M[1] = #T ); then, for

each i = u − 1, . . . , 1 do s = LF [s] and T [i] = L[s].

剩余15页未读，继续阅读

weixin_38694343

粉丝: 3
资源: 915

压缩数据结构与应用：优化搜索性能

Opportunistic Mobile Networks Advances and Applications.pdf

Opportunistic E-Mail-Security-System-开源

Cross-layer Opportunistic Scheduling for Device-to-Device Video Multicast Services

A distributed opportunistic scheduling protocol for device-to-device communications

Opportunistic amplify-and-forward relay selection with outdated channel state information

Opportunistic source scheduling in multi-source two-way relay networks

Low-complexity opportunistic transmission schemes for multi-user multi-relay asymmetric bidirectional relaying networks

Cooperative Communications With Outage-Optimal Opportunistic Relaying

Immunization-based redundancy elimination in Mobile Opportunistic Networks-Generated big data

The X Loss: Band-Mix Selection for Opportunistic Spectrum Accessing with Uncertain Spectrum Supply from Primary Service Providers

最新资源