压缩相对后缀树：应用于重复序列集合

108 浏览量更新于2024-07-14 收藏 619KB PDF 举报

“Relative Suffix Trees - 15th Dec, 2017 (1508.02550) - 计算机科学” 这篇论文介绍了一种名为“Relative Suffix Trees”的数据结构，它是由Andrea Farruggia、Travis Gagie、Gonzalo Navarro、Simon J. Puglisi和Jouni Sirén等人共同研究的成果。这些研究人员分别来自意大利比萨大学计算机科学系、智利生物技术和生物工程中心、智利迭戈波塔莱斯大学EIT、智利大学计算机科学系、芬兰赫尔辛基大学计算机科学系以及英国桑格研究所。 Suffix Trees是一种在字符串处理和生物信息学中广泛应用的数据结构，它们可以高效地解决诸如模式匹配、最长公共前后缀等问题。然而，传统Suffix Tree的一个主要缺点是其内存占用较大，可能会比输入序列大出十倍。为了克服这个问题，研究者们已经发展出了压缩后的Suffix Trees，这些压缩数据结构可以在保持高效性能的同时，只占用与输入序列压缩表示相等的空间。在这项工作中，作者们提出了一种针对重复序列集合（如个体基因组集合）的相对压缩Suffix Trees的新方法。他们将每个单独序列的Suffix Tree相对于一个共享的公共Suffix Tree进行压缩。这种方法的创新之处在于，它考虑了序列间的相似性和重复性，通过利用这些特性来进一步减少存储需求。相对Suffix Trees的构建和操作方式可能包括以下步骤： 1. **公共Suffix Tree的构建**：首先，研究人员会构建一个基于所有输入序列的公共Suffix Tree，这个树包含了所有序列共有的后缀。 2. **个体树的压缩**：接着，对于每个单独的序列，它们的Suffix Tree不是从头开始构建，而是相对于公共Suffix Tree进行压缩。这样可以避免存储大量重复信息，减少空间占用。 3. **查询和操作的优化**：尽管进行了压缩，但相对Suffix Trees仍然能够支持快速的查询和操作，例如查找模式、计算最长公共前后缀等，而不会显著降低效率。这项工作对于处理大规模生物信息学数据，尤其是基因组序列分析，具有重要的实际意义。通过压缩相对Suffix Trees，不仅可以节省存储空间，还能加速处理过程，这对于资源有限的计算环境来说是一个巨大的进步。未来的研究可能继续探索如何进一步优化这种数据结构，以适应更复杂和多样化的应用需求。

4 A. Farruggia et al.

The inverse function of LF is Ψ, with Ψ(i) =

select

(BWT, i − C[c]), where c is the largest character

value with C[c] < i. With functions Ψ and LF, we

can move forward and backward in the text, while

maintaining the lexicographic rank of the current suﬃx.

If the sequence S is not evident from the context, we

write LF

and Ψ

Compressed suﬃx arrays (CSA) [54, 55, 56] are

text indexes supporting a functionality similar to the

suﬃx array. This includes the following queries: i)

ﬁnd(P ) = [sp, ep] determines the lexicographic range of

suﬃxes starting with pattern P [1, `]; ii) locate(sp, ep) =

SA[sp, ep] returns the starting positions of these suﬃxes;

and iii) extract(i, j) = T [i, j] extracts substrings of the

text. In practice, the ﬁnd performance of CSAs can be

competitive with suﬃx arrays, while locate queries are

orders of magnitude slower [57]. Typical index sizes are

less than the size of the uncompressed text.

The FM-index (FMI) [55] is a common type of

compressed suﬃx array. A typical implementation [58]

stores the BWT in a wavelet tree [52]. The index

implements ﬁnd queries via backward searching. Let

[sp, ep] be the lexicographic range of the suﬃxes of

the text starting with suﬃx P[i + 1, `] of the pattern.

We can ﬁnd the range matching suﬃx P [i, `] with a

generalization of function LF as

LF([sp, ep], P [i]) = [C[P [i]] + rank

P [i]

(BWT, sp−1)+1,

C[P [i]] + rank

P [i]

(BWT, ep)].

We support locate queries by sampling some suﬃx

array pointers. If we want to determine a value

SA[i] that has not been sampled, we can compute

it as SA[i] = SA[j] + k, where SA[j] is a sampled

pointer found by iterating LF k times, starting from

position i. Given sample interval d, the samples can

be chosen in suﬃx order, sampling SA[i] at positions

divisible by d, or in text order, sampling T [i] at

positions divisible by d and marking the sampled SA

positions in a bitvector. Suﬃx-order sampling requires

less space, often resulting in better time/space trade-

oﬀs in practice, while text-order sampling guarantees

better worst-case performance. We also sample the ISA

pointers for extract queries. To extract T [i, j], we ﬁnd

the nearest sampled pointer after T [j], and traverse

backwards to T [i] with function LF.

Compressed suﬃx trees (CST) [5] are compressed

text indexes supporting the full functionality of a

suﬃx tree (see Table 1). They combine a compressed

suﬃx array, a compressed representation of the LCP

array, and a compressed representation of suﬃx tree

topology. For the LCP array, there are several common

representations:

• LCP-byte [51] stores the LCP array as a byte array.

If LCP[i] < 255, the LCP value is stored in the

byte array. Larger values are marked with a 255

in the byte array and stored separately. As many

texts produce small LCP values, LCP-byte usually

requires n to 1.5n bytes of space.

• We can store the LCP array by using variable-

length codes. LCP-dac uses directly addressable

codes [59] for the purpose, resulting in a structure

that is typically somewhat smaller and somewhat

slower than LCP-byte.

• The permuted LCP (PLCP) array [5] PLCP[1, n] is

the LCP array stored in text order and used as

LCP[i] = PLCP[SA[i]]. Because PLCP[i + 1] ≥

PLCP[i]−1, the array can be stored as a bitvector of

length 2n in 2n+o(n) bits. If the text is repetitive,

run-length encoding can be used to compress the

bitvector to take even less space [6]. Because

accessing PLCP uses locate, it is much slower than

the above two encodings.

Suﬃx tree topology representations are the main

diﬀerence between the various CST proposals. While

the compressed suﬃx arrays and the LCP arrays are

interchangeable, the tree representation determines how

various suﬃx tree operations are implemented. There

are three main families of compressed suﬃx trees:

• Sadakane’s compressed suﬃx tree (CST-Sada) [5]

uses a balanced parentheses representation for

the tree. Each node is encoded as an opening

parenthesis, followed by the encodings of its

children and a closing parenthesis. This can be

encoded as a bitvector of length 2n

, where n

the number of nodes, requiring up to 4n + o(n)

bits. CST-Sada tends to be larger and faster than

the other compressed suﬃx trees [11, 13].

• The fully compressed suﬃx tree (FCST) of Russo

et al. [10, 14] aims to use as little space as possible.

It does not require an LCP array at all, and

stores a balanced parentheses representation for a

sampled subset of suﬃx tree nodes in o(n) bits.

Unsampled nodes are retrieved by following suﬃx

links. FCST is smaller and much slower than the

other compressed suﬃx trees [10, 13].

• Fischer et al. [6] proposed an intermediate

representation, CST-NPR, based on lcp-intervals.

Tree navigation is handled by searching for the

values deﬁning the lcp-intervals. Range minimum

queries rmq(sp, ep) ﬁnd the leftmost minimal value

in LCP[sp, ep], while next/previous smaller value

queries nsv(i)/psv(i) ﬁnd the next/previous LCP

value smaller than LCP[i]. After the improvements

by various authors [7, 9, 8, 11, 13], the CST-NPR is

perhaps the most practical compressed suﬃx tree.

For typical texts and component choices, the size of

compressed suﬃx trees ranges from the 1.5n to 3n bytes

of CST-Sada to the 0.5n to n bytes of FCST [11, 13].

There are also some CST variants for repetitive texts,

such as versioned document collections and collections

of individual genomes. Abeliuk et al. [13] developed

a variant of CST-NPR that can sometimes be smaller

剩余15页未读，继续阅读

weixin_38653296

粉丝: 2
资源: 911

压缩相对后缀树：应用于重复序列集合

windows版本ES7.17.3中文分词器elasticsearch-analysis-ik-7.17.3 .zip

publicsuffix-list-dafsa-20180723-1.el8.noarch.rpm

rubygem-public_suffix-doc-2.0.5-4.el7.noarch.rpm

rh-maven35-publicsuffix-list-20170424-1.2.el7.noarch.rpm

rubygem-public_suffix-2.0.5-4.el7.noarch.rpm

publicsuffix-list-20180723-1.el8.noarch(1).rpm

Algorithm-Ukkonen-s-Suffix-Tree-Algorithm.zip

Infix--to-suffix-.rar_infix-Postfix

Suffix Trees - Slides (Ben Langmead, Johns Hopkins)-计算机科学

最新资源