后缀排序：一种创新且高效的算法

71 浏览量更新于2024-07-15 收藏 247KB PDF 举报

"本文主要探讨了一种新的、高效的后缀排序算法，该算法在时间和内存资源上具有优势。作者Michael A. Maniscalco和Simon J. Puglisi提出了一种不同于传统方法的策略，不再直接对字符串后缀进行排序，而是首先为每个后缀分配其在最终排序中的位置（排名），然后在所有后缀排名完成后，根据需要转化为已排序的数组。这种方法在实际应用中，如无损压缩（Burrows-Wheeler变换）和文本索引（后缀数组）等领域有广泛应用，并且通过实验表明，他们的算法在多种真实场景下优于其他主流算法。文章涉及的分类包括分析算法、非数值问题、数据结构（数组）以及信息存储和检索。" 后缀排序是字符串处理中一个基础而重要的任务，特别是在无损压缩和文本索引中。例如，Burrows-Wheeler变换是数据压缩的一种技术，它依赖于后缀排序来转换输入文本，而后缀数组则是一种用于快速查找文本模式的有效数据结构，它的构建同样基于后缀排序。传统的后缀排序算法通常直接操作后缀数组，不断将后缀移动到它们最终的位置，直到排序完成。然而，这种直接排序的方法可能会消耗大量的时间和内存。Maniscalco和Puglisi提出的新型算法采取了不同的策略。他们首先为字符串的所有后缀计算其在最终排序中的排名，而不是直接进行排序。这一过程可能涉及到了字符串的比较和特性分析，比如最长公共前缀、后缀树或者后缀自动机等工具的使用。一旦所有后缀的排名确定，可以根据需要一次性转换成排序后的数组。文章进一步介绍了一些在这个基本思想上的强力扩展，可能包括优化技术、复杂度分析以及适应不同场景的变体。这些扩展使得算法在处理大规模数据时更具效率。实验结果证明，这个新方法在处理真实世界的文本数据时，无论是在时间效率还是内存使用上，都表现出优于当前领先算法的性能。这篇研究论文提供了一个创新的后缀排序策略，它降低了算法的资源需求，提高了性能，对于需要大量后缀排序操作的领域，如大数据分析、生物信息学、文本挖掘等，都具有重要的理论和实践价值。

An Efﬁcient, Versatile Approach to Sufﬁx Sorting

•

3.2 Preﬁx Doubling

The preﬁx-doubling technique was ﬁrst applied to sufﬁx-sorting by Manber

and Myers [1993], inspired by the earlier work of Karp et al. [1972] in string

matching. The most efﬁcient implementation is that of Larsson and Sadakane

[1999].

Generally, the approach works in rounds—at the beginning of the round h,

the sufﬁxes are sorted on their 2

h−1

preﬁx in SA

with corresponding ranks

in ISA

. It is then observed that a sort using the integer pairs (ISA

[i], ISA

[i +h]) as keys, i +h ≤ n, computes a 2h-order of the sufﬁxes i (sufﬁxes i > n− h

are necessarily already fully ordered).

The two main implementations of the preﬁx-doubling approach differ pri-

marily in their application of the above observation. Manber and Myers do

an implicit 2h-sort by performing a left-to-right scan of SA

that induces the

2h-rank of SA

[ j ]h, j ∈ 0..n. On the other hand, Larsson and Sadakane ex-

plicitly sort each h-group using the ternary-split quicksort (TSQS) of Bentley

and McIlroy [1993]. Both approaches require 8n bytes of working space. Preﬁx-

doubling sorters have the advantage of being alphabet independent and taking

O(n log n) time, in the worst case.

3.3 Copy and Variants

Seward [2000] describes an important heuristic algorithm for sufﬁx-sorting

called copy. The main idea bears a resemblance to two stage. Algorithm copy

initially sorts the sufﬁxes into 1- and 2-groups, based on their ﬁrst two charac-

ters (using a counting sort). 1-Groups refer to contiguous portions of the sufﬁx

array, where sufﬁxes share the same ﬁrst character and 2-groups (“contained”

in 1-groups) refer to contiguous portions sharing the same ﬁrst two characters.

Seward sorts the 1-groups in order of smallest to largest (i.e., those containing

least sufﬁxes to those containing the most). Let G

denote the 1-group whose

member sufﬁxes all start with the letter λ ∈ . When G

is completely sorted,

by passing back over the portion of SA containing G

(now in order) for each

sufﬁx i encountered, the order of the sufﬁxes in 2-group preﬁxed x[i − 1]λ can

be induced. As sorting of 1-groups proceeds, ever more 2-groups will be already

ordered, allowing the sort routine to skip those portions of the 1-group. Seward

shows how the sorting of 1-groups can be made still more efﬁcient by avoiding

the sorting of sufﬁxes in G

preﬁxed λλ. If such sufﬁxes are left until after the

other members of G

are sorted, their order can also be induced. This ability of

copy to deal with long runs of identical characters efﬁciently gives it a distinct

advantage over two stage, which has no such mechanism.

It is worth noting that copy was intended for use in a character-based BWT

setting, where it is assumed ||≤2

. This assumption keeps the space re-

quired for the ||

buckets reasonable. If, however, ||=2

, the memory

requirements for the algorithm would increase dramatically, making the al-

gorithm impractical, in some applications. This weakness is inherited by al-

gorithms which extend copy. Several very fast sufﬁx sorters are based on

copy, namely, cache [Seward 2000], deep-shallow (ds) [Manzini and Ferragina

2004], and bucket pointer reﬁnement (bpr) [Sch

urmann and Stoye 2005]. These

ACM Journal of Experimental Algorithmics, Vol. 12, Article No. 1.2, Publication June: 2008.

剩余22页未读，继续阅读

weixin_38600460

粉丝: 5
资源: 955

后缀排序：一种创新且高效的算法

快速排序算法

改进的快速排序算法

快速排序的改进算法

高效线性后缀数组构建算法：分治与递归策略

后缀数组倍增算法实现

后缀数组创建算法的实现

后缀树算法 suffix_tree

快速排序算法模板

dsa-is后缀数组外存算法

优化的后缀数组构建算法

最新资源