String similarity join is a primitive operation in many
applications such as merge-purge [13], record linkage [10],
[27], object matching [23], reference reconciliation [7],
deduplication [21], [2], and approximate string join [11].
To avoid verifying every pair of strings in the data set and
improve performance, string similarity join typically con-
sists of two phases: candidate generation and verification
[9], [19]. In the candidate generation phase, the signature
assignment process or blocking process is invoked to group
the candidates into groups by using either an approximate
and exact approach, depending on whether some amount of
error could be tolerated. Since we aim to provide exact
answers, we will focus on the exact approaches. Recent
works that provide exact answers are typically built on top
of some traditional indexing methods, such as tree based
and invert ed index based. In [5], the T rie-tree-based
approach was proposed for edit similarity search, where
an in-memory Trie-tree is built to support edit similarity
search by incrementally probing it. The edit similarity join
method based on the Trie-tree was proposed in [26], in
which sub-trie pruning techniques are applied. In [30], a
B
þ
-tree based method was proposed to support edit
similarity queries. It transforms the strings into digits and
indexes them in the B
þ
-tree. However, these algorithms are
constrained to in-memory processing, not efficient and
scalable for processing large scale data set.
The methods making use of the inverted index are based
on the fact that similar strings share common parts, and
consequently, they transform the similarity constraints into
set overlap constraints. Based on the property of set overlap
[4], the prefix filtering was proposed to prune false
positives [4], [3], [29], [11]. In these methods, the partial
result of the candidate generation phase is a superset of the
final result. The AllPairs method proposed in [3] builds
the inverted index for prefix tokens, and each string pair in
the same inverted list is considered as candidates. This
method can reduce the false positives significantly com-
pared to the method that indexes all tokens of each strings
[22]. To prune false positives more aggressively, the PPJoin
method applies the position information of the prefix
tokens of the string. Based on the PPJoin, the PPJoin+ uses
the position information of suffix tokens to prune false
positives further [29]. As these methods need to merge
the inverted lists during the candidate generation phase,
some optimization techniques for the inverted list merging
were introduced in [16], [29]. The exact computation
method proposed in [1] is based on the pigeon hole
principle. It transforms similarity constraints into Hamming
distance constraints and transforms each record into a
binary vector. The binary vector is divided into partitions
and then hashed into signatures, and the strings that
produce the same signatures are considered as candidate
pairs. However, the signature scheme is time-consuming
and introduces unnecessary false positives.
A common drawback of the above proposals is that they
cannot be easily parallelized to run efficiently on a
MapReduce framework. In the MapReduce framework,
the global information about the whole data set cannot be
accessed easily, and therefore, the filtering strategies used in
a centralized system are not effective in this share-nothing
computing environment. In a demonstration paper [25], a
framework was briefly introduced, without providing much
details. In the recently work [24], two methods for similarity
join on MapReduce are proposed. One is RIDPairsImproved,
in which each prefix token of the string is considered as its
signature (key) in th e Map procedure, the candidate
generation phase. Then, the strings that with the same
signatures will be shuffled into one group for further
verification. In the verification process, filtering methods are
applied to avoid similarity computation for as many false
positives as possible. The other method is RIDPairsPPJoin,
which has the same implementation of Map. In Reduce,
RIDPairsPPJoin builds inverted index for each group of
strings to accelerate processing. However, the filtering
method needs to scan each string pair more than one time
to compute the similarity upper bound for pruning purpose.
This incurs high overhead in a distributed environment.
In this paper, we propose an efficient and MapReduce
friendly multiple prefix filtering approach based on
different global orderings. In the Map phase, we apply
one global ordering to generate signatures for the strings
and apply other global orderings to get different prefix
token sets that are appended to the string. In the Reduce
phase, for each string pair, their prefix token sets obtained
by the same global ordering are checked and pruned if they
are obviously not candidates. As the size of the prefix token
set is shorter than the string and the checking process is
applied in a pipelining manner, the verification process is,
therefore, very efficient.
3PROBLEM DEFINITION AND PRELIMINARIES
3.1 Definitions
A string s is considered as a set of tokens, each of which can
be either a word or an n-gram (a substring of s with length
n). For example, the tokens of string of s ¼ “Parallel
Relational Database Systems” are {Parall el, Relational,
Database, Systems}.
Definition 1 (String similarity join). Given a set of strings SS,
and a join threshold , string similarity join finds all string
pairs (s
i
;s
j
)inSS, such that simðs
i
;s
j
Þ.
Table 4 lists the symbols and their definitions that will be
used throughout this paper.
3.2 Similarity Measures
A similarity function measures how similar two strings are
and returns a value in [0,1]. Typically, the larger the value,
the more similar the two strings. In this paper, we utilize
three widely used similarity functions, namely Dice [20],
Jaccard [18], and Cosine [28], whose computation problem
can be reduced to set overlap problem [3]. They are based
on the fact that similar strings share common components.
Clearly, the similarity between two strings is zero if they do
not have any token in common. In other words, we only
verify string pairs with at least one common token. For
the sake of better understanding, the similarity measures
and their definitions are summarized in Table 5. Unless
otherwise specified, we use Jaccard as the default function,
i.e., simðs
i
;s
j
Þ¼sim
jaccard
ðs
i
;s
j
Þ.
3.3 Prefix Filtering
Prefix filtering technique is commonly used in the refine-
ment step to further prune false positives of candidate pairs
that share a certain number of common tokens. In [3], [4],
the methods sort the tokens of each string based on some
RONG ET AL.: EFFICIENT AND SCALABLE PROCESSING OF STRING SIMILARITY JOIN 2219