Deep Sequence-to-Sequence Entity Matching for
Heterogeneous Entity Resolution
Hao Nie
1,3
, Xianpei Han
1,2
, Ben He
3,1∗
, Le Sun
1,2
, Bo Chen
1
, Wei Zhang
4
, Suhui Wu
4
, Hao Kong
4
1
Chinese Information Processing Laboratory, Institute of Soware, Chinese Academy of Sciences, Beijing, China
2
State Key Laboratory of Computer Science, Institute of Soware, Chinese Academy of Sciences, Beijing, China
3
University of Chinese Academy of Sciences, Beijing, China
4
Alibaba Group, Hangzhou, China
1
{niehao2016, xianpei, sunle, chenbo}@iscas.ac.cn
3
benhe@ucas.ac.cn
4
{lantu.zw, linnai.wsh, konghao.kh}@alibaba-inc.com
ABSTRACT
Entity Resolution (ER) identies records from dierent data sources
that refer to the same real-world entity. Conventional ER approaches
usually employ a structure matching mechanism, where aributes
are aligned, compared and aggregated for ER decision. e struc-
ture matching approaches, unfortunately, oen suer from het-
erogeneous and dirty ER problems. at is, entities from dierent
data sources are described using dierent schemas, and aribute
values may be misplaced, missing, or noisy. In this paper, we pro-
pose a deep sequence-to-sequence entity matching model, denoted
Seq2SeqMatcher, which can eectively solve the heterogeneous and
dirty problems by modeling ER as a token-level sequence-to-sequence
matching task. Specically, we propose an align-compare-aggregate
neural network for Seq2Seq entity matching, which can learn the
representations of tokens, capture the semantic relevance between
tokens, and aggregate matching evidence for accurate ER decisions
in an end-to-end manner. Experimental results show that, by com-
paring entity records in token level and learning all components
in an end-to-end manner, our Seq2Seq entity matching model can
achieve remarkable performance improvements on 9 standard en-
tity resolution benchmarks.
CCS CONCEPTS
• Information systems → Entity resolution; Deduplication; •
Computing methodologies → Ontology engineering.
KEYWORDS
entity resolution; aribute heterogeneity; matching; deep learning
ACM Reference Format:
Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu,
Hao Kong. 2019. Deep Sequence-to-Sequence Entity Matching for Hetero-
geneous Entity Resolution. In
e 28th ACM International Conference on
*Corresponding authors.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full cita-
tion on the rst page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permied. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
CIKM ’19, November 3–7, 2019, Beijing, China
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6976-3/19/11…$15.00
https://doi.org/10.1145/3357384.3358018
Name Brand $GGUHVV
݁
ଵ
Apple iphone 8 plus CA
Name Manufacturer Location Price
݁
ଶ
iphone 8p Apple California 699.9
match or
non-match ?
[Name, iphone]
[Name, 8p] [Manufacturer, Apple] [Location, CaliforQia][Price, 699.9]
[Name Apple]
[
Name
,
iphone]
[Name, 8]
[Name, plus]
[$GGUHVV,CA]
Sequence-to-Sequence
Matching
Figure 1: An entity resolution example under the Seq2Seq
entity matching framework.
Information and Knowledge Management (CIKM’19), November 3–7, 2019,
Beijing, China. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/
3357384.3358018
1 INTRODUCTION
Entity resolution (ER) aims to identify records referring to the same
real-world entity. For example in Figure 1, the entity records e
1
and
e
2
correspondingly from Walmart and Amazon would be resolved
as the same entity because they refer to the same real-world prod-
uct. ER is important in knowledge integration [8–10], data cleaning
[6] and information management [16] and has therefore received
signicant research aention in recent years [13, 15, 17, 27, 35].
In entity resolution, each record is a structured object composed
of one or more <aribute, value> pairs. Conventional approaches
usually model entity resolution as a structure-to-structure match-
ing task. Concretely, aributes are rstly aligned either manually
or automatically, then similarities between corresponding aribute
values are computed, nally similarity scores between aligned at-
tributes are aggregated to get the nal similarity between records.
For example, to resolve e
1
and e
2
in Figure 1, structure match-
ing systems rst identify Name and Address (Location) as
aligned aributes, then the total similarity is computed by aggre-
gating the similarities of the corresponding aribute values (“Ap-
ple iphone 8 plus”, “iphone 8p”) under Name and (CA, California)
under Address (Location).
e structure matching approaches, unfortunately, oen face
problems when entity records are heterogeneous or dirty. Firstly,