压缩字符串中高效近似子串匹配的LZ77自索引方法

研究论文

151 浏览量更新于2024-08-27 收藏 624KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"这篇研究论文探讨了在压缩字符串中进行有效近似子字符串匹配的问题。现有的基于LZ77自索引的近似字符串匹配方法主要关注空间效率，而该论文则聚焦于如何在不解压整个文本的情况下，高效地搜索相似的字符串。作者提出了RS搜索算法，能够有效地合并子字符串的所有出现，以缩小潜在的匹配区域，并设计了新的过滤器来减少候选字符串的规模。实验结果显示，该算法在压缩字符串的近似匹配中表现出卓越的性能，实现了有趣的时间-空间权衡。关键词包括LZ77、自索引、近似字符串匹配和编辑距离。" 本文关注的是在压缩数据存储背景下解决大规模字符串匹配的挑战。LZ77是一种常见的文本压缩方法，它存储原始文本的差异部分，而非整个文本，从而节省存储空间。对于处理大量文本数据，这种方法非常有效。然而，基于LZ77的现有近似字符串匹配技术侧重于优化存储效率，而对时间效率的考虑相对较少。论文提出了一种名为RS（可能代表“Rapid Search”）的搜索算法，旨在在不解压整个压缩文本的前提下，快速找到与目标子字符串相似的序列。RS算法通过整合子字符串的所有出现位置，可以更精确地定位潜在匹配的区域，减少了搜索的范围。此外，论文还引入了创新的过滤策略，以进一步降低需要处理的候选字符串数量，优化了搜索过程的效率。在实际应用中，编辑距离是衡量两个字符串相似度的重要指标，它定义为将一个字符串转换为另一个字符串所需的最少单字符编辑操作次数。在近似字符串匹配中，计算编辑距离通常是核心步骤之一。RS算法在处理这个问题时，既考虑了编辑距离的计算，又兼顾了搜索效率和内存使用。实验结果证明了RS算法的有效性，它在压缩字符串的近似匹配中达到了出色的性能，同时在时间和空间资源之间找到了良好的平衡。这意味着该算法在处理大量压缩文本数据时，能够在保持较低的内存占用的同时，提供快速的搜索速度，这对于大数据环境下的文本分析和检索具有重大意义。

资源详情

资源推荐

186 Y. Han et al.

3 Preliminary

3.1 Deﬁnition

Let T be a long sequence and each character T [i] belongs to Σ, where Σ is a

ﬁnite alphabet set. T [i, j] indicates the substring of T from the i-th character

to the j-th character. Speciﬁcally, T [1,j] is a preﬁx of T whose end is the j-th

character.

Problem. In this paper we pay more attention to approximate string matching

based on LZ77 compressed representation. If there is a long string T of length n,

a pattern P of length m (m  n), which is much shorter than T , and threshold

k, what we aim to do is to locate all the substrings whose edit distances [14]

compared to pattern P are no larger than threshold k.

The LZ77 parsing of text T [1,n] is a sequence Z[1,n



] of phrases such that

T = Z[1]Z[2]...Z[n



], built as in [8]. Assume we have already processed T [1,i−1]

producing the sequence Z[1,p−1]. Then, we ﬁnd the longest preﬁx T [i, i



− 1] of

T [i, n] which occurs in T [1,i− 1], set Z[p]=T [i, i



] and continue with i = i



+1.

The occurrence in T [i, i



−1] of the preﬁx T [i, i



] is called the source of the phrase

Z[p].

Example 1. Given a sequence T = abdacadbedabbedacbdacad. Figure 1 shows

an example of LZ77 parsing. We give an identiﬁer on the top of every phrase.

There is no preﬁx of T[3] in T [1, 2], so we parse Z[3] = T [3]. T [13, 16] is the

longest preﬁx generated from T [8, 11] so that the 8th phrase is Z[8] = T [13, 17].

3.2 LZ77 Self-Index

LZ77 self-index structure was build up based on LZ77 parsing. Figure 2 shows

an example for LZ77 self-index. The index structure consists of two tries and a

range structure. A suﬃx trie on the top of Fig. 2 indexes all the suﬃxes starting

from phrases. On the left of Fig. 2 is the reverse trie. Each leaf node of both tries

stores the identiﬁer of a phrase. The range structure connects the point between

adjacent phrases in the grid.

According to the deﬁnition of LZ77 parsing, we split a text T of length n

into n



phrases such that T = Z[1]Z[2] ...Z[n



]. Given a pattern P , there are

three types of exact occurrences in the LZ77 parsing.

Example 2. In Fig. 1 ab spanning the ﬁrst two phrases is a primary occurrence.

ab appearing as the suﬃx of the seventh phrase is a special primary occurrence.

The substring dac beginning at position 19 in the last phrase is regarded as a

second occurrence.

Fig. 1. Example for LZ77 parsing

剩余13页未读，继续阅读

weixin_38519060

粉丝: 1
资源: 900

压缩字符串中高效近似子串匹配的LZ77自索引方法

ASM_DP_基于动态规划的近似串匹配算法CPP实现_

Pattern Matching and Text Compression Algorithms

将字符串str中的子字符串s1替换成新的子字符串s2(字符串长度<100),如果字符串

查询一个字符串中是否包含特定的子字符串。

编写一个Java应用程序，程序输出字符串中与指定模式匹配的子字符串

python gzip 压缩字符串

typescript字符串匹配

字符串匹配python

如何判断字符串中包含某个字符串

MSSQL 压缩字符串

字符串1 '张三,李四,王五' 字符串2 '李四,赵六,孙七' 使用Oracle where语句如何匹配字符串1中的李四在字符串2中

Python匹配中文字符串

C语言 字符串截去某子字符串

python指定字符串是否包含模糊匹配的字符串

Gzip 压缩字符串

python 字符串匹配

java中字符串的精确匹配_Java最佳实践–字符串性能和精确字符串匹配

labview字符串匹配模式

go语言在字符串中获取匹配到的字符串

js判断字符串中是否包含某个字符串 兼容所有浏览器示例

最新资源

C语言字符串截去某子字符串

js判断字符串中是否包含某个字符串兼容所有浏览器示例