压缩文本索引构建与应用技术探析

61 浏览量更新于2024-07-14 收藏 645KB PDF 举报

"这篇文档是关于压缩文本索引构建与应用的研究，由作者Hon, Wing-kai和韩永楷在2004年发表。它包含了一本论文的主要技术部分，从第3章到第8章，分别摘自以下几篇论文：第3章介绍了在大字母表上构建压缩后缀数组的方法；第4章讨论了突破全文索引构建的时间和空间障碍；第5章则提出了寻找最大唯一匹配的节省空间的算法。这些研究对理解高效压缩的文本索引技术和其在信息检索、大数据分析等领域的应用具有重要意义。" 该文主要探讨了压缩文本索引的构建及其在实际应用中的价值。压缩文本索引是一种高效的数据结构，它允许快速地在大量文本数据中进行搜索操作，同时占用较少的存储空间。这对于处理大规模文本数据，如搜索引擎的索引构建、生物信息学中的基因序列比对或日志数据分析等场景至关重要。在第3章中，作者讨论了如何在具有大量字符的字母表上构建压缩后的后缀数组。后缀数组是一种能快速查找文本中子串出现位置的数据结构，通常用于字符串模式匹配问题。通过压缩，可以在不牺牲查询效率的前提下减少存储需求，这对于处理大规模文本特别有用。第4章聚焦于如何在时间和空间效率上改进全文索引的构建。传统的全文索引构建可能需要大量的计算资源和内存，但这一章节提出的方法打破了这种限制，使得在有限的计算资源下也能快速有效地构建索引，这对于实时性和资源受限的环境有着显著的优势。第5章介绍了在节省空间的同时寻找文本中的最大唯一匹配（Maximal Unique Matches, MUMs）的算法。MUMs在生物信息学中尤其重要，因为它们可以用来识别不同DNA序列之间的差异。该章提出的算法优化了空间效率，有助于处理海量数据时的计算性能。这篇论文集中的研究不仅深入探讨了压缩文本索引的构建方法，还展示了它们在解决实际问题中的实用性，特别是在信息检索和大数据分析等领域。这些研究成果对于提升文本数据处理的效率和降低存储成本有显著的贡献，并为后续研究者提供了重要的理论基础和技术参考。

CHAPTER 1. INTRODUCTION 14

Table 1.5: Managing a dynamic dictionary.

Description Space (bits) Dictionary Matching Time Insertion/Deletion Time

existing

via suﬃx tree [6] O(d log d) O((t + occ) log d/ log log d) O(p log d/ log log d)

via fat-tree [76] O(d log d) O(t + occ) O(p)

this thesis via CST (special case) O(d log |Σ|) O((t + occ) log

d) amor. O(p log

2+

Managing a Dynamic Text

In the above discussion, we have discussed the simple text searching problem, in

which we need to maintain a single piece of static text, and the library manage-

ment problem where we need to maintain a dynamic collection of texts. Another

related problem is to maintain a single piece of text which is subject to update

over the times. This problem is useful in managing DNA texts, as they are

frequently updated due to errors in sequencing process.

Ferragina and Grossi [24] proposed an interval partitioning scheme to exploit

the generalized suﬃx tree to give an index that occupies O(n log n) bits of space

where n is the length of the text. It supports searching of a pattern P of length p

in O(p+occ) time. In addition, it supports insertion (and deletion) of a substring

of length y at an arbitrary position in T in O(y +

√

n) time. Later, Sahinalp and

Vishkin [76] proposed the fat-tree and further improved the insertion and deletion

time to O(y + log

n).

It was open whether there is a compressed index (i.e., using O(n log |Σ|)

bits) that can manage a dynamic text eﬃciently. In this thesis, we report the

progress of this dynamic problem. Precisely, we propose an index that occupies

O(n log |Σ|) bits of space for any ﬁxed  > 0, while supporting pattern searching

in O((p log

n)(log



n + log |Σ|) + occ log

2+

n) time, and insertion/deletion of a

substring of length y in O((y +

√

n) log

2+

n) amortized time.

Brieﬂy speaking, we make use of the interval partitioning technique in [24]

to reduce the dynamic text problem into the dynamic dictionary management

and the dynamic library management problems. Then, applying the compressed

solutions to the latter two problems, we produce the required compressed index.

A summary of the results are shown in Table 1.6.

CHAPTER 1. INTRODUCTION 15

Table 1.6: Managing a dynamic text.

Description Space (bits) Searching Time Insertion/Deletion Time

existing

via suﬃx tree [24] O(n log n) O(p + occ) O(y +

√

via fat-tree [76] O(n log n) O(p + occ) O(y + log

this thesis via CSA and CST O(n log |Σ|) O((p + occ) log

n) amor. O((y +

√

n) log

2+

1.3.3 Experimental Results on the Practical Aspects of

CSA and FM-index

While theoretical bounds are important, the success of a data structure is often

measured in terms of its performance in practice. Indeed, both CSA and FM-

index have demonstrated their practicality for text indexing in the literature. For

instance, for the DNA sequence E. coli, Ferragina and Manzini have shown in their

experimental paper [26] that the corresponding FM-index occupies 2.689n bits,

while its performance is comparable to that of the suﬃx arrays when searching a

pattern whose length is short (8-15 chars). On the other hand, Grossi, Gupta and

Vitter [33] have shown that the most space-eﬃcient variant of the CSA [32] can

be implemented in 2.392n bits while supporting fast searching queries. Moreover,

the performance of these indexes are also tested extensively across various texts,

whose lengths vary from 4 million to 70 million characters. Nevertheless, these

lengths cannot yet cover some of the popular DNA sequences such as fruit ﬂy,

human, fugu, and rice. One possible reason may be due to the lack of a space-

eﬃcient construction algorithm for these indexes as proposed in the previous

section. This, together with some other issues to be explained later, motivate us

to study the practical aspects of the CSA and FM-index from a diﬀerent point of

view. In this thesis, we attempt to ﬁnd out the answers to the following questions:

1. What is the largest DNA sequence whose CSA and FM-index can be con-

structed in main memory? Will the construction time be acceptable?

2. Previous studies on searching focus on short patterns. In real life, DNA

sequences are often searched against genes whose lengths are much longer.

Will the searching performance in this case be consistent with that for

searching short patterns? Also, will the length of the DNA sequences aﬀect

the performance?

CHAPTER 1. INTRODUCTION 16

3. In the literature, there are two types of searching methodology for CSA

or FM-index. One of them is called forward search, which is the classical

approach for suﬃx arrays. The other one is called backward search, which

is the method tailored for CSA or FM-index and is better than forward

search in theory. However, in practice, will backward search always beat

forward search?

4. Finally, can we conclude which one of the two indexes is best-suited for

indexing DNA sequences?

We conducted experiments on construction and searching performances with

an ordinary PC equipped with a 1.7 GHz Pentium IV processor with 256 Kbytes

of L2 cache, and 4 Gbytes of RAM. The operating system was Solaris 9. Note that

this modest conﬁguration can easily be acquired by most research laboratories

nowadays. Our results can be brieﬂy summarized as follows. For construction

limits, we have successfully construct the CSA and FM-index for DNA sequences

of length up to 3 Gbases. The construction times are 24 and 28 hours, respec-

tively. For the searching performance, we have constructed the CSA and the

FM-index for E. coli (4.6 Mbases), Fly (98 Mbases) and Human (2.88 Gbases).

In each setting, we tested the searching times (for both forward search and back-

ward search) using patterns of length from 10 to 10,000, where the patterns are

extracted from random positions in the corresponding DNA sequence to boost

the worst-case performance. From our experiments, we ﬁnd that backward search

is sensitive to the length of the pattern, while forward search is not. On the other

hand, searching diﬀerent DNAs against patterns of similar length shows similar

timing, indicating that the length of the DNA has little eﬀect on the searching

performance.

For the comparison between forward search and backward search, we observe

that using backward search, FM-index is consistently faster than CSA. However,

using forward search, CSA is faster than FM-index. The most surprising result

is that, for long patterns, forward search is more eﬃcient than backward search.

Roughly speaking, for patterns of length less than 2000, FM-index with backward

search is most eﬃcient; otherwise, CSA with forward search is fastest, while FM-

index with forward search is comparable. See Figure 1.1 for the timing of the

experiments on the CSA and FM-index of Human, with each index occupying

about 2.2 Gbytes (6n bits) of space.

剩余97页未读，继续阅读

weixin_38747025

粉丝: 129
资源: 1108

压缩文本索引构建与应用技术探析

Expert One-on-One J2EE Design and Development

Compressed Bloom Filters-计算机科学

Object-relative Addressing - Compressed Pointers in 64-bit Java Virtual Machines (P107_134)-计算机科学

Sorting improves word-aligned bitmap indexes - 2014 (0901.3751v6)-计算机科学

Compressed Perfect Embedded Skip Lists for Quick Inverted-Index Lookups-计算机科学

Optical information authentication using compressed double-random-phase-encoded images and quick-response codes

A Compressed Suffix Tree Based Implementation with Low Peak Memory Usage (2014)-计算机科学

Backward Search FM-Index (Full-text Index in a Minute Space) - Slides-计算机科学

squashfs - a compressed fs for Linux-开源

Index Compression - Slides-计算机科学

最新资源