数据感知FM索引：理论与实践优化

93 浏览量更新于2024-08-27 收藏 281KB PDF 举报

本文主要探讨了数据感知FM索引（Data-Aware FM-index），它是基于Foschini等人提出的高阶熵压缩文本索引方法的实用改进。这种方法是针对Burrows-Wheeler变换（BWT）和FM-index进行创新的，特别关注于数据感知的压缩技术。首先，文章回顾了基础背景，指出随着大数据时代的到来，各种大规模数据源如互联网、基因测序、XML、电子邮件、卫星数据和商业记录等产生的信息量急剧增长，对高效且压缩性能卓越的文本索引技术提出了更高的需求。FM-index作为一种经典的字符串搜索和压缩工具，其基础原理是通过BWT将文本排序并利用统计信息进行编码。在传统FM-index的基础上，作者提出了一种名为FM-Adaptive的新方法。该方法引入了小波树（Wavelet Tree）来处理整个BWT，将每个节点的位向量划分为多个块。与之前的工作不同，FM-Adaptive采用了一种混合编码策略，结合了混合编码和run-length Gamma编码，而非固定长度编码，从而实现了数据驱动的更精细压缩。这种灵活性使得算法能够更好地适应数据的统计特性，提升压缩效率。此外，与先前的研究保持理论上的性能优势的同时，实验证明FM-Adaptive在实践中具有显著的优势，特别是在压缩性能上。与当前最先进的索引技术相比，它展现出了优越的表现。论文的代码已在线发布，供研究者和开发者参考和测试。总结来说，数据感知FM索引是一项重要的研究成果，它革新了文本索引的构建方式，通过数据驱动的优化提升了压缩效率，并在实际应用中显示出更好的性能。这对于处理大规模数据集的搜索引擎、生物信息学分析以及其他需要高效文本搜索的场景具有重要意义。

previous or following position of its neighbor, which

provides not only the possibility of restoring the text,

but also leads to a compression to the higher-order

entropy of the text.

2.2 The Burrows-Wheeler transform. Let T be

a string of n characters from an alphabet Σ of size σ

and P a query pattern of length p. The string T has n

suﬃx, which starts at each of the n locations in the text.

The ith suﬃx, which start at position i, is denoted by

T [i..n]. The suﬃx array SA[1..n] of T is an array of n

integers that gives the sorted order of the suﬃxes of T .

That is, SA[i] = j if T [j..n] is the ith smallest suﬃx of

T in lexicographical order. Similarly, the inverse suﬃx

array is deﬁned by SA

−1

[j] = i. All the suﬃxes preﬁxed

by P occupy a contiguous range in the sorted array SA.

In the CSA [9, 10], the suﬃx array values are

indirectly encoded by instead storing the Φ function,

where Φ(i) = SA

−1

[SA[i] + 1]. The Φ function can

b e compressed into optimal space (in entropy sense),

and then each SA value can be computed by referring

to a small portion of the Φ function in O(polylogn)

time. More recently, Huo et al. [13] gave a practical

implementation of the CSA by encoding the diﬀerences

Φ(i) − Φ(i − 1) using Elias’s Gamma coding, in which

they used a remarkable property of Φ, that is, an

increasing sequence of positions within each Σ list.

i SA

LF F L

0 7 3 5 # abaaba b

1 2 4 6 a abab#a b

2 5 5 7 a b#abaa b

3 0 6 0 a baabab #

4 3 7 1 a bab#ab a

5 6 0 2 b #abaab a

6 1 1 3 b aabab# a

7 4 2 4 b ab#aba a

Figure 1: Example of the BWT of T = abaabab#.

The Burrows-Wheeler transform (BWT) of T is

an invertible permutation of T , denoted by L, such

that L[i] is the character in the text just preceding

the ith lexicographically smallest suﬃx of T . That is,

L[i] = T [SA[i] − 1 mod n]. Intuitively, the sequence

L[i] is easier to compress because adjacent characters

often share higher-order contexts, and thus space can

b e reduced even further to about nH

bits. The LF

function [3] stands for last-to-ﬁrst column mapping

since the character L[i] in the last column of Figure 1

is located in the column F at position LF(i), i.e.,

L[i] = F [LF (i)]. In the example of Figure 1, L[6] and

F [LF (6)] = F [3] both correspond to the third a in the

string abaabab#. Thus we can walk backwards through

the text T using the function LF . That is, if T [k] = L[i],

then T [k − 1] = L[LF (i)].

The FM-index and the CSA are closely related:

the LF function and the CSA neighbor function Φ are

inverses of one another. That is, SA[LF (i)] = SA[i] −1;

equivalently LF (i) = SA

−1

[SA[i] − 1] = Φ

−1

(i).

Since the BWT does not change the distribution of

characters, the 0th-order empirical entropy of T remains

the same. However, it tends to move characters with

similar contexts close together and thus the resulting

string L has a good locality (being consecutive piecewise

in the lexicographic order as shown in Figure 1) which

makes the move-to-front or wavelet tree transformation

more eﬀective.

2.3 The FM-index. Ferragina and Manzini intro-

duced the elegant FM-index [3, 4], based upon the

Burrows-Wheeler transform. The FM-index was the

ﬁrst self-index shown to have both fast performance

and space usage within a constant factor of the desired

entropy bound for constant-sized alphabets. The core

problem of the FM-index is to provide a compressed rep-

resentation of L together with some auxiliary structures

which makes it possible to compute the LF mapping

eﬃciently. LF (i) = C[L(i)] + Occ(i, L(i)) − 1, where

Occ(i, c) is the number of occurrences of character c in

the preﬁx L[0, i], and C[] is the array of length σ + 1

such that C[c] is the total number of text characters

which are alphabetically smaller than c. C = {0, 1, 5, 8}

for the example in Figure 1.

FM-index is based upon a procedure called

backward sear ch [4, 10], which ﬁnds the range of rows

in the BWT matrix M (the last three columns in Fig-

ure 1) that begin with a given pattern P . This range

represents the occurrences of P in T , which answers

the count query (returns the number of occurrences of

P in T ). Thus by backward search we turn the count

query on T into a sequence of rank queries on L. With

a slight extension, we can implement the locate and

extr act queries, where locate reports the occurring po-

sitions of P in T and extract displays the text substring

T [start, start + len − 1], given start and len.

The count algorithm describes the backward search-

based counting operation in which the while loop

p erforms p iterations from p−1 to 0,where p is the length

of pattern P . The algorithm maintains the following

invariant: after the k iterations, the variable l points to

the ﬁrst row of the M preﬁxed by P [p − k, p − 1], and

the variable r points to the last row of the M preﬁxed

by P [p −k, p −1]. After the p iterations, occ = r −l + 1,

the number of occurrences of P in T .

by the Society for Industrial and Applied Mathematics.

Downloaded 01/08/15 to 115.155.1.118. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

剩余13页未读，继续阅读

weixin_38516491

粉丝: 6
资源: 950

数据感知FM索引：理论与实践优化

fm_index:非常快速地实现针对DNA5字母的FM索引（使用EPR词典）

deepFM项目实战数据

ropebwt2:DNA序列的FM索引的增量构建

hisat2:基于图的对齐方式（分层图FM索引）

复旦微flash数据手册FM25W32

复旦微flash数据手册FM25Q64

FM62429-fm62429.rar_fm62429_fm62429中文_fm62429数据手册

FM1288数据手册

Burrows-Wheeler变换与FM索引解析

基于小波树的SeqAn生物序列FM索引

最新资源