使用后缀托盘、数组和树的模式匹配(2014)

182 浏览量更新于2024-07-14 收藏 585KB PDF 举报

"Pattern Matching using Suffix Trays, Arrays and Trees (2014)" 论文探讨了一种新的数据结构——后缀托盘（Suffix Tray），作为精确模式匹配的替代方案。作者Jens Olaf Svanholm Fogh在2014年完成此硕士论文，由Christian Storm Pedersen指导，发表于奥胡斯大学计算机科学系。精确模式匹配是科学研究中广泛应用的一种技术。传统的数据结构如后缀树（Suffix Tree）和后缀数组（Suffix Array）是进行精确模式匹配的主流选择。然而，这篇论文提出了后缀托盘这一新概念，它旨在提供一个在预处理时间和空间限制为O(n·|Σ|)和O(n)的情况下，具有更优查询时间复杂度的数据结构。后缀托盘的查询时间复杂度在最坏情况下为O(m + log(|Σ|))，其中m代表查询字符串的长度，|Σ|表示字母表的大小。这优于后缀树和后缀数组的最坏情况复杂度。后缀树在相同预处理和空间限制下，查询时间复杂度为O(m·log(|Σ|))，而后缀数组则为O(m + log(n))。后缀托盘通过结合后缀树和后缀数组的思想实现了这一改进。它在保持高效查询性能的同时，减少了对内存的需求，特别是在处理大规模文本数据时，这种优化显得尤为重要。论文可能详细讨论了如何构建后缀托盘，以及如何利用其特性来快速响应模式匹配查询。此外，论文还可能对比分析了后缀托盘与其他两种数据结构在不同场景下的性能，包括各种查询类型、文本特性和实际应用中的效率。它可能还包括了实现细节、算法优化以及可能的扩展和未来研究方向。这篇论文为模式匹配领域引入了一种新的有效工具，对提高大规模文本处理的效率有着积极的贡献。

except for the special case where k = i

hence the possible diﬀerence of one. The total

reduction in depth is thus constant when following a suﬃx link, and since suﬃx links only

are followed when a new j-index is updated this gives a total reduction in depth of O(n).

As the depth of the trie is O(n) this implies a total number of O(n) jumps between nodes.

Each jump uses O(|Σ|) time to ﬁnd the matching edge resulting in a total work load of

O(n · |Σ|). In addition to this O(n) nodes are inserted, each at a cost of O(|Σ|) giving a

total work load of O(n · |Σ|) inside the two for-loops.

As the number of iterations in the two for-loop are decreased to O(n) and the total

work load inside the two for-loops are decreased to O(n·|Σ|) a time complexity of O(n·|Σ|)

is reached.

2.2 Suﬃx Array

2.2.1 Outline of the suﬃx array data structure

A suﬃx array (SA) representing a text T , is an array containing pointers to S

, S

, . . . , S

These pointers are sorted in lexicographical order according to the suﬃxes they represent.

Let v be a node in a suﬃx tree with sorted children, it then applies that the subtree

of v is equivalent to a continuous part of the suﬃx array. The suﬃx array structure

can be extended to answer the same queries as suﬃx trees, by enhancing it with utility

arrays such as a longest common preﬁx (lcp) array. The construction time for suﬃx arrays

matches the construction time for suﬃx trees with a complexity of O(n · log(|Σ|)), while

the search complexity on the other hands diﬀer, as the complexity for the suﬃx array here

is O(m + log(n)).

2.2.2 Search and construction algorithms

Searching in a suﬃx array is done in a binary fashion. The naive approach is a binary

search, where the query string is matched from index 0 each time. Using this approach the

search algorithm gets a complexity of O(m · log(n)).

This can be improved by using an enhanced suﬃx array. In [3] Manber and Myers

shows how to obtain a search complexity of (m +log(n)). Let (L, M, R) be respectively the

index of the left endpoint, the midpoint and the right endpoint of the current interval in

each iteration of the binary search. For a speciﬁc T and M the pair (L

, R

) will always

be the same, as the binary search always divides the interval in the same manner. With

n possible indices as midpoint this results in n possible (L

, M, R

) triplets, all with

a unique M. In [3] two auxiliary arrays are created containing information about these

triplets. The two arrays are called Lcp

and Rcp

and are both of size n. The arrays

contains respectively |lcp(S

SA[L

]

, S

SA[M ]

)| and |lcp(S

SA[M ]

, S

SA[R

]

)| at index M.

Let l, r denote |lcp| between Q and the suﬃx at the leftmost/rightmost indices in the

current interval, where Q denotes the query string in the exact pattern matching problem.

The algorithm is then as follows. At the start of the algorithm l is lcp(Q, S

SA[0]

) and r

is lcp(Q, S

SA[n−1]

). The algorithm then checks if r ≥ l. The two branches are symmetric,

thus only the branch for r ≥ l will be explained here. In the implementation used in this

thesis the check performed is l ≥ r, which is how it is explained in [3]. Exactly how the

r < l branch of the algorithm is can thus be seen in [3], as it does not change the content

of the branch whether r < l or r ≤ l. Assuming r ≥ l the next step of the algorithm is to

compare r with Rcp

[M] with three possible outcomes:

1 r < Rcp

[M] : In this case |lcp(S

SA[M ]

, S

SA[R

]

)| > |lcp(Q, S

SA[R

]

)|. All suﬃxes

right for the midpoint, including the midpoint, is therefore guarantied to have a

longer preﬁx in common with S

SA[R

]

than Q. It is therefore safe to assume that Q

will reside in the left half of the interval, if it is a substring of T . A new iteration is

started with R = M and M =

M+L

while r, l and L remains unchanged. r remains

unchanged as |lcp(Q, S

SA[M ]

)| = |lcp(Q, S

SA[R

]

)|.

2 r = Rcp

[M] : In this case S

SA[M ]

and Q shares at least the ﬁrst r characters. A

character by character comparison is therefore performed, starting from T [SA[M]+r]

and Q[r] until a mismatch is found or the query string is fully matched. Let p =

|lcp(S

SA[M ]

, Q)|, there are then two possible cases:

Q[p] > T [SA[M] + p]: A possible match will reside in the right half. The next

iteration is started with L = M , M =

M+R

and l = p, while r and R remains

unchanged.

Q[p] < T [SA[M ] + p]: A possible match will reside in the left half. The next

iteration is started with R = M, M =

M+L

and r = p, while l and L remains

unchanged.

3 r > Rcp

[M] : In this case |lcp(S

SA[M ]

, S

SA[R

]

)| < |lcp(Q, S

SA[R

]

)|. This implies

that Q has a shorter preﬁx in common with S

SA[M ]

than with S

SA[R

]

. Being the

opposite of case 1 , this ensures that Q resides in the right half of the interval, if

it is a substring of T . A new iteration is started with L = M, M =

M+R

and

l = Rcp

[M], while r and R remains unchanged.

The key observations are:

- The algorithm terminates when max(l, r) = |Q|.

- In none of the three cases are max(l, r) decreased.

- The only case that does not take constant time is case 2 where single character matching

is performed.

- In each iteration a maximum of one single character mismatch is performed.

- max(l, r) will be increased by x when x characters are matched.

These observations combined, gives an upper bound of m + log(n) single character

comparisons throughout the algorithm. As the single character comparisons are the only

work during an iteration that are not constant and the number of iterations is bounded by

O(log(n)) this implies a complexity of O(m + log(n)).

When only one occurrence is needed, the index is returned the ﬁrst time the search

string has been matched. If the indices of all occurrences shall be returned, this can be

done within an extra cost of O(#occ). The algorithm in [3] does not terminate when the

ﬁrst occurrence is found. Instead the next iteration proceeds with R = M and the algorithm

terminates when R − L < 2. When the algorithm terminates it is guaranteed that R will

be the smallest index i where Q ≤ S

SA[i]

, and thus the index of the left most occurrence

if any occurrences exists. With a symmetric algorithm the right most occurrence in the

suﬃx array can be found. Having found the left most and the right most occurrences in

O(m + log(n)) time, it is straight forward to return the indices in O(#occ) time.

The pseudo code for ﬁnding the left most occurrence of the query string is provided in

[3, p. 6]

. The pseudo code for ﬁnding the right most occurrence is symmetric, and is for

the sake of completeness provided in algorithm 2, with the notations used in this thesis. It

is easy to turn the pseudo code into the simple search algorithm as the only diﬀerence is,

that the algorithm should return the ﬁrst time an occurrence is found.

The algorithm are using three diﬀerent arrays to reach its complexity, the suﬃx array

and the two auxiliary arrays. The suﬃx array can be created in O(n · log(Σ)) time in

multiple ways. Having a suﬃx tree with sorted children, the simplest solution is a depth

ﬁrst traversal of the suﬃx tree, where the children is visited in lexicographical order. This

is an O(n · log(|Σ|) time solution as a depth ﬁrst traversal in lexicographical order clearly

can be done in O(n · log(|Σ|)) time and section 2.1 shows how to build a suﬃx tree in

O(n · log(|Σ|)) time.

Creating the two auxiliary arrays with a better or equally good complexity as the suﬃx

array is on the other hand not straight forward. Finding lcp(S

) can be reduced to

ﬁnding the nearest common ancestor (NCA) of the two leafs representing S

and S

in the

suﬃx tree. Harel et al. [2] shows how NCA(S

, S

) can be found in constant time, at the cost

of O(n) preprocessing. The time complexity for creating the auxiliary arrays thus becomes

O(n), while the complexity for the entire enhanced suﬃx array becomes O(n · log(|Σ|)).

The data structures used to solve the NCA problems, can be created within O(n) space,

giving the enhanced suﬃx array a space complexity of O(n) like it was the case for the

suﬃx tree. In the following subsections it will be explained how Harel et al. [2] solves the

NCA problem in constant time.

Solving the NCA problem

To solve the NCA problem, it is separated into two subproblems called the nca depth

problem and the depth problem.

If the pseudo code from [3, p. 6] is followed, the user should notice that there is an error on line 5 as

line 5 should have been w

> a

P os[N −1]+r

instead of w

≤ a

P os[N −1]+r

剩余68页未读，继续阅读

weixin_38731199

粉丝: 7
资源: 928

使用后缀托盘、数组和树的模式匹配(2014)

Segment-based stereo matching using belief propagation

The Burrows-Wheeler Transform, Data Compression, Suffix Arrays, and Pattern Matching

请把这篇文献《Accelerating Similarity-Based Model Matching Using On-The-Fly Similarity Preserving Hashing》翻译成中文

spring.mvc.pathmatch.matching-strategy=ant-path-matcher

字符串["Baby","Toddler","Kids","Matching-Outfits","Maternity-Nursing","Licensed-Characters","Home-Baby-Gear","Shoes-Accessories"]转为java数组

spring.mvc.pathmatch.matching-strategy=ant_path_matcher这个是干什么的

rpm-net: robust point matching using learned features

No matching distribution found for pyproject.toml-based

: No matching distribution found for pyproject.toml-based

Emgu Cv template matching

最新资源