EnAli：自动匹配跨多源异构数据的实体对齐方法

PDF格式 | 619KB | 更新于2024-08-26 | 7 浏览量 | 举报

"EnAli是针对跨多个异构数据源的实体对齐的研究论文，由Chao Kong、Ming Gao等人发表。该论文提出了一种无监督方法，旨在解决大规模异构数据源中的实体匹配问题，以促进数据清洗、数据集成、信息检索和机器学习等领域的工作。" 在当前数字化时代，数据已经成为各个领域的宝贵资源。然而，这些数据往往分散在不同的数据源中，具有异构性，即属性类型、结构和表示方式各不相同。实体对齐（Entity Alignment）是解决这一问题的关键技术，它的目标是识别不同数据源中指向同一现实世界实体的条目。这个过程对于数据清洗、数据集成、信息检索和机器学习等关键应用至关重要。 EnAli，即“Entity Alignment”的缩写，是一种新的无监督方法，专为处理两个或多个异构数据源的实体匹配而设计。传统的实体对齐方法通常需要大量的人工标注数据，这在处理大规模数据时既耗时又昂贵。EnAli通过引入生成概率模型，能够有效地应对异构实体属性带来的挑战。论文中提到，EnAli采用的生成概率模型能够捕获不同数据源之间的内在关联，即使这些数据源的结构和属性存在差异。这种方法的优点在于，它能够在没有预先标记的数据集的情况下工作，减少了对人工干预的依赖。此外，通过这种模型，EnAli可以学习到不同数据源中实体的分布特征，进而推断出哪些实体可能是相同的。 EnAli的工作流程可能包括以下几个步骤：首先，对每个数据源进行预处理，提取出关键的实体属性；接着，利用生成模型学习这些属性之间的潜在关系；然后，基于这些关系建立相似度度量，以比较不同数据源中的实体；最后，通过聚类或图算法等方法找出最可能对应的实体对。论文的贡献主要体现在以下几个方面： 1. 提出了一种无监督的实体对齐方法，降低了对标注数据的依赖。 2. 针对异构数据源，引入了生成概率模型来处理不同属性的匹配问题。 3. 可能提供了更高效的方法，解决了大规模数据源的实体对齐问题。 EnAli为解决跨多个异构数据源的实体对齐问题提供了一个新颖且实用的解决方案，有助于提升数据融合与分析的效率和准确性，对于推动数据科学和工程的发展具有重要意义。

Front. Comput. Sci.

diﬀerent social network platforms [28]. Liu et al. [28] con-

sider username popularity and proposed a heuristic rule to

automatically determine if two accounts from diﬀerent sites

with the same username belong to the same person. The rule

labeled user account pairs are then used for training a clas-

siﬁer for user linkage. Many state-of-the-art works also con-

sider the social connectivity for account association or user

linkage [29, 30]. The relevant topic is also studied in multi-

ple research communities. However, it is not easy to ﬁnd the

exact matched pairs for training classiﬁers.

2.2 Probabilistic approaches

Fellegi-Sunter’s approach is also a rule-based linkage method

for two data sources, and solves the record linkage prob-

lem via using an unsupervised and probabilistic linkage ap-

proach [23,24]. It works well only when the linkage problem

is simple and exact one-to-one matching of record attributes.

Based on the same idea, Sadinle et al. generalize Fellegi-

Sunter’s model to present a probabilistic method for linking

multiple data ﬁles [25]. However, the approach can only han-

dle exactly one-to-one matching of entity attributes. It is un-

suitable to match multiple data sources since they have poor

quality, including heterogeneous in attributes, error, incom-

plete and missing values, etc. That is the focus of this work.

Gao et al. also extend Fellegi-Sunter’s approach to link users

across two diﬀerent social networks. Their approach can han-

dle heterogeneous user attributes, missing values, and social

similarity in user proﬁles [26]. Unfortunately, the approach

cannot be applied to link multiple data sources (at least three).

3 Entity Alignment Approach

In this section, before we overview our proposed approach,

we describe a formal deﬁnition of the entity alignment prob-

lem.

3.1 The Problem Deﬁnition

We assume that there are N data sources (N > 1). Let E

1 ≤ i ≤ N, be the set of entities from the i−th source and

) represent the observed features of e

∈ E

, i.e., α

)

represents the observed feature vector of e

from source E

Let α

) be the set of attribute feature vectors of entities

from source E

. The set of all candidates tuples T can be

represented as

i=1

). The entity matching problem is

to determine the matched tuples M and unmatched tuples U

in T , i.e.,

M = {(α

), . . . , α

))|e

= ... = e

, e

∈ E

}

U = {(α

), . . . , α

))|∃u , v, s.t. e

, e

When (α

), α

), . . . , α

)) ∈ M indicates that

entities e

, 1 ≤ i ≤ N, are the same, while

(α

), α

), . . . , α

)) ∈ U indicates there is at least

an entity e

diﬀerent from the others. Due to not exist any

duplicate entities in a data source, M therefore contains at

most min(|E

|, |E

|, · · · , |E

|) matched entity tuples. We may

ideally consider T = M ∪ U as candidate tuples, while T

is usually an extremely large set in real whose size can be

i=1

|. To improve the scalability of our proposed solution,

we therefore utilize a blocking technique to reduce the size of

candidate tuples.

3.2 Overview of EnAli

Our proposed entity alignment approach, EnAli, consists of

four components as following.

Step 1. Candidate tuple generation: By default, the num-

ber of candidate tuples is Π

i=1

| when we identify the

identical entities from N data sources. Thus, candidate

tuple generation is a computationally expensive task. To

speed up the processing, we employ Locality Sensitive

Hashing (shorted as LSH) to block entities. Two enti-

ties do not contain in any candidate tuples if they do not

belong to the same bucket of LSH.

Step 2. Similarity computation: An entity may have multi-

ple attributes, such as individual demographic attributes

(name, address, age, etc.), temporal behaviors, and spa-

tial behavior, and so on. These attributes can be modeled

as diﬀerent data types, such as string, set, distribution,

etc. We employ a set of similarity functions, {s

}

j=1

, to

evaluate the similarities, where s

evaluates similarity of

the j−th attributes of entities. For tuple

= (α

), α

), · · · , α

)),

there are





m-dimension similarity vectors between d-

iﬀerent entity pairs. We take the minimum value of each

entry over





similarity vectors to model the similarity

of tuple t

, denoted as a m-dimension vector γ

Step 3. Parameter learning: Given similarity vector γ

tuple t

, EnAli constructs a generative probabilistic mod-

el via using exponential family which incorporates het-

erogeneous attribute similarities. EnAli employ EM-

algorithm to infer parameters of the generative model

(more details in Section 4).

剩余13页未读，继续阅读

weixin_38663516

粉丝: 6

EnAli：自动匹配跨多源异构数据的实体对齐方法

在《EnAli：自动匹配跨多源异构数据的实体对齐方法》中，EnAli采用的生成概率模型是如何在无监督学习环境下提高实体对齐的匹配精度的？

如何理解EnAli在处理跨异构数据源实体对齐中的生成概率模型，它如何提高无监督学习下的匹配精度？

EnAli在无监督学习环境下，是如何使用生成概率模型来提高跨异构数据源实体对齐的匹配精度的？

cole_02_0507.pdf

工程硕士开题报告：无线传感器网络路由技术及能量优化LEACH协议研究

【东海期货-2025研报】东海贵金属周度策略：金价高位回落，阶段性回调趋势初现.pdf

图像数据处理工具+数据(帮助用户快速划分数据集并增强图像数据集。通过自动化数据处理流程，简化了深度学习项目的数据准备工作)

diminico_02_0709.pdf

agenda_3cd_01_0716.pdf

A课件Python全栈开发线下班.zip

最新资源