Front. Comput. Sci.
3
different social network platforms [28]. Liu et al. [28] con-
sider username popularity and proposed a heuristic rule to
automatically determine if two accounts from different sites
with the same username belong to the same person. The rule
labeled user account pairs are then used for training a clas-
sifier for user linkage. Many state-of-the-art works also con-
sider the social connectivity for account association or user
linkage [29, 30]. The relevant topic is also studied in multi-
ple research communities. However, it is not easy to find the
exact matched pairs for training classifiers.
2.2 Probabilistic approaches
Fellegi-Sunter’s approach is also a rule-based linkage method
for two data sources, and solves the record linkage prob-
lem via using an unsupervised and probabilistic linkage ap-
proach [23,24]. It works well only when the linkage problem
is simple and exact one-to-one matching of record attributes.
Based on the same idea, Sadinle et al. generalize Fellegi-
Sunter’s model to present a probabilistic method for linking
multiple data files [25]. However, the approach can only han-
dle exactly one-to-one matching of entity attributes. It is un-
suitable to match multiple data sources since they have poor
quality, including heterogeneous in attributes, error, incom-
plete and missing values, etc. That is the focus of this work.
Gao et al. also extend Fellegi-Sunter’s approach to link users
across two different social networks. Their approach can han-
dle heterogeneous user attributes, missing values, and social
similarity in user profiles [26]. Unfortunately, the approach
cannot be applied to link multiple data sources (at least three).
3 Entity Alignment Approach
In this section, before we overview our proposed approach,
we describe a formal definition of the entity alignment prob-
lem.
3.1 The Problem Definition
We assume that there are N data sources (N > 1). Let E
i
,
1 ≤ i ≤ N, be the set of entities from the i−th source and
α
i
(e
i
j
) represent the observed features of e
i
j
∈ E
i
, i.e., α
i
(e
i
j
)
represents the observed feature vector of e
i
j
from source E
i
.
Let α
i
(E
i
) be the set of attribute feature vectors of entities
from source E
i
. The set of all candidates tuples T can be
represented as
Q
k
i=1
α
i
(E
i
). The entity matching problem is
to determine the matched tuples M and unmatched tuples U
in T , i.e.,
M = {(α
1
(e
1
j
1
), . . . , α
N
(e
N
j
N
))|e
1
j
1
= ... = e
N
j
N
, e
i
j
i
∈ E
i
}
U = {(α
1
(e
1
j
1
), . . . , α
N
(e
N
j
N
))|∃u , v, s.t. e
u
j
u
, e
v
j
v
}.
When (α
1
(e
1
j
1
), α
2
(e
2
j
2
), . . . , α
N
(e
N
j
N
)) ∈ M indicates that
entities e
i
j
i
, 1 ≤ i ≤ N, are the same, while
(α
1
(e
1
j
1
), α
2
(e
2
j
2
), . . . , α
N
(e
N
j
N
)) ∈ U indicates there is at least
an entity e
i
j
i
different from the others. Due to not exist any
duplicate entities in a data source, M therefore contains at
most min(|E
1
|, |E
2
|, · · · , |E
N
|) matched entity tuples. We may
ideally consider T = M ∪ U as candidate tuples, while T
is usually an extremely large set in real whose size can be
Π
N
i=1
|E
i
|. To improve the scalability of our proposed solution,
we therefore utilize a blocking technique to reduce the size of
candidate tuples.
3.2 Overview of EnAli
Our proposed entity alignment approach, EnAli, consists of
four components as following.
Step 1. Candidate tuple generation: By default, the num-
ber of candidate tuples is Π
N
i=1
|E
i
| when we identify the
identical entities from N data sources. Thus, candidate
tuple generation is a computationally expensive task. To
speed up the processing, we employ Locality Sensitive
Hashing (shorted as LSH) to block entities. Two enti-
ties do not contain in any candidate tuples if they do not
belong to the same bucket of LSH.
Step 2. Similarity computation: An entity may have multi-
ple attributes, such as individual demographic attributes
(name, address, age, etc.), temporal behaviors, and spa-
tial behavior, and so on. These attributes can be modeled
as different data types, such as string, set, distribution,
etc. We employ a set of similarity functions, {s
j
}
m
j=1
, to
evaluate the similarities, where s
j
evaluates similarity of
the j−th attributes of entities. For tuple
t
i
= (α
1
(e
1
i
1
), α
2
(e
2
i
2
), · · · , α
N
(e
N
i
N
)),
there are
N
2
m-dimension similarity vectors between d-
ifferent entity pairs. We take the minimum value of each
entry over
N
2
similarity vectors to model the similarity
of tuple t
i
, denoted as a m-dimension vector γ
i
.
Step 3. Parameter learning: Given similarity vector γ
i
of
tuple t
i
, EnAli constructs a generative probabilistic mod-
el via using exponential family which incorporates het-
erogeneous attribute similarities. EnAli employ EM-
algorithm to infer parameters of the generative model
(more details in Section 4).