基于全局排序的多前缀过滤：高效字符串相似性连接

18 浏览量更新于2024-08-26 收藏 2.84MB PDF 举报

"这篇研究论文探讨了字符串相似连接的高效且可扩展的处理方法，重点关注如何利用倒排索引来优化这一过程。论文作者包括Chuitian Rong, Wei Lu, Xiaoli Wang, Xiaoyong Du, Yueguo Chen以及Anthony K.H. Tung。他们提出了一种基于不同全局排序的多重前缀过滤方法，以显著减少候选对的数量，从而降低验证成本。同时，论文还介绍了一个并行扩展算法，该算法在保持效率的同时具有良好的可扩展性。" 字符串相似连接是许多应用中的基础操作，例如数据清洗、信息检索、生物信息学等领域。它需要根据给定的相似度函数和用户指定的阈值找出集合中所有相似的字符串对。随着大数据时代的到来，高效处理这类问题变得越来越重要。传统的字符串相似性连接算法通常采用两步过滤和精炼的方法：首先通过遍历倒排索引来生成候选对，然后计算这些候选对的相似度来验证它们是否满足阈值条件。然而，这种做法存在两个主要问题：一是过滤步骤的效率不高，导致大量的无效候选对需要进行验证，增加了计算成本；二是为了保证过滤效果，可能会引入过多的计算开销。论文提出的多重前缀过滤方法通过考虑不同的全局排序，能够在生成候选对阶段就更有效地排除不相似的字符串对，从而显著降低后续验证阶段的工作量。这种方法的核心在于利用字符串的多个前缀特征，结合倒排索引的特性，实现更精细的过滤策略。此外，论文还提出了一种并行化的扩展算法，该算法能够有效地分配计算任务到多个处理器或计算节点上，以适应大规模数据集的处理需求。通过并行化，算法不仅能够保持单个实例的高效性能，还能在更大的数据规模下保持良好的可扩展性，这对于处理海量字符串数据的应用来说是一个关键的进步。这篇研究论文对字符串相似连接的处理提供了新的思路，通过多重前缀过滤和并行化技术，提高了处理效率，降低了计算成本，对于大数据环境下的字符串相似性分析有着重要的理论和实践价值。

String similarity join is a primitive operation in many

applications such as merge-purge [13], record linkage [10],

[27], object matching [23], reference reconciliation [7],

deduplication [21], [2], and approximate string join [11].

To avoid verifying every pair of strings in the data set and

improve performance, string similarity join typically con-

sists of two phases: candidate generation and verification

[9], [19]. In the candidate generation phase, the signature

assignment process or blocking process is invoked to group

the candidates into groups by using either an approximate

and exact approach, depending on whether some amount of

error could be tolerated. Since we aim to provide exact

answers, we will focus on the exact approaches. Recent

works that provide exact answers are typically built on top

of some traditional indexing methods, such as tree based

and invert ed index based. In [5], the T rie-tree-based

approach was proposed for edit similarity search, where

an in-memory Trie-tree is built to support edit similarity

search by incrementally probing it. The edit similarity join

method based on the Trie-tree was proposed in [26], in

which sub-trie pruning techniques are applied. In [30], a

-tree based method was proposed to support edit

similarity queries. It transforms the strings into digits and

indexes them in the B

-tree. However, these algorithms are

constrained to in-memory processing, not efficient and

scalable for processing large scale data set.

The methods making use of the inverted index are based

on the fact that similar strings share common parts, and

consequently, they transform the similarity constraints into

set overlap constraints. Based on the property of set overlap

[4], the prefix filtering was proposed to prune false

positives [4], [3], [29], [11]. In these methods, the partial

result of the candidate generation phase is a superset of the

final result. The AllPairs method proposed in [3] builds

the inverted index for prefix tokens, and each string pair in

the same inverted list is considered as candidates. This

method can reduce the false positives significantly com-

pared to the method that indexes all tokens of each strings

[22]. To prune false positives more aggressively, the PPJoin

method applies the position information of the prefix

tokens of the string. Based on the PPJoin, the PPJoin+ uses

the position information of suffix tokens to prune false

positives further [29]. As these methods need to merge

the inverted lists during the candidate generation phase,

some optimization techniques for the inverted list merging

were introduced in [16], [29]. The exact computation

method proposed in [1] is based on the pigeon hole

principle. It transforms similarity constraints into Hamming

distance constraints and transforms each record into a

binary vector. The binary vector is divided into partitions

and then hashed into signatures, and the strings that

produce the same signatures are considered as candidate

pairs. However, the signature scheme is time-consuming

and introduces unnecessary false positives.

A common drawback of the above proposals is that they

cannot be easily parallelized to run efficiently on a

MapReduce framework. In the MapReduce framework,

the global information about the whole data set cannot be

accessed easily, and therefore, the filtering strategies used in

a centralized system are not effective in this share-nothing

computing environment. In a demonstration paper [25], a

framework was briefly introduced, without providing much

details. In the recently work [24], two methods for similarity

join on MapReduce are proposed. One is RIDPairsImproved,

in which each prefix token of the string is considered as its

signature (key) in th e Map procedure, the candidate

generation phase. Then, the strings that with the same

signatures will be shuffled into one group for further

verification. In the verification process, filtering methods are

applied to avoid similarity computation for as many false

positives as possible. The other method is RIDPairsPPJoin,

which has the same implementation of Map. In Reduce,

RIDPairsPPJoin builds inverted index for each group of

strings to accelerate processing. However, the filtering

method needs to scan each string pair more than one time

to compute the similarity upper bound for pruning purpose.

This incurs high overhead in a distributed environment.

In this paper, we propose an efficient and MapReduce

friendly multiple prefix filtering approach based on

different global orderings. In the Map phase, we apply

one global ordering to generate signatures for the strings

and apply other global orderings to get different prefix

token sets that are appended to the string. In the Reduce

phase, for each string pair, their prefix token sets obtained

by the same global ordering are checked and pruned if they

are obviously not candidates. As the size of the prefix token

set is shorter than the string and the checking process is

applied in a pipelining manner, the verification process is,

therefore, very efficient.

3PROBLEM DEFINITION AND PRELIMINARIES

3.1 Definitions

A string s is considered as a set of tokens, each of which can

be either a word or an n-gram (a substring of s with length

n). For example, the tokens of string of s ¼ “Parallel

Relational Database Systems” are {Parall el, Relational,

Database, Systems}.

Definition 1 (String similarity join). Given a set of strings SS,

and a join threshold , string similarity join finds all string

pairs (s

)inSS, such that simðs

Þ.

Table 4 lists the symbols and their definitions that will be

used throughout this paper.

3.2 Similarity Measures

A similarity function measures how similar two strings are

and returns a value in [0,1]. Typically, the larger the value,

the more similar the two strings. In this paper, we utilize

three widely used similarity functions, namely Dice [20],

Jaccard [18], and Cosine [28], whose computation problem

can be reduced to set overlap problem [3]. They are based

on the fact that similar strings share common components.

Clearly, the similarity between two strings is zero if they do

not have any token in common. In other words, we only

verify string pairs with at least one common token. For

the sake of better understanding, the similarity measures

and their definitions are summarized in Table 5. Unless

otherwise specified, we use Jaccard as the default function,

i.e., simðs

Þ¼sim

jaccard

ðs

Þ.

3.3 Prefix Filtering

Prefix filtering technique is commonly used in the refine-

ment step to further prune false positives of candidate pairs

that share a certain number of common tokens. In [3], [4],

the methods sort the tokens of each string based on some

RONG ET AL.: EFFICIENT AND SCALABLE PROCESSING OF STRING SIMILARITY JOIN 2219

剩余13页未读，继续阅读

weixin_38602098

粉丝: 3
资源: 963

基于全局排序的多前缀过滤：高效字符串相似性连接

字符串问题详解

java 插入新的字符串

JavaScript字符串函数大全

Lua语言详解：数值运算与字符串连接

C++字符串处理函数详解

C++字符串处理与结束标志解析

MATLAB字符串拼接与云计算的云端之旅：在云端处理字符串，提升可扩展性

字符串连接与StringBuilder：了解性能差异，让你的代码更高效

【Java字符串分割案例分析】：空字符串与null值的处理策略

【Go语言高效字符串处理】：从入门到精通，解锁strings包的高级应用

最新资源