computed by using genetic programming for the record linkage problem. Their experiments showed that
both approaches did not perform well on real data.
Some other studies exploit more complex constraints that include relationships between different entity
types to link all types of entities in coordination (Bhattacharya and Getoor, 2007; Dong et al., 2005; On
et al., 2007). The usage of such constraints can indeed help to get better linkage results, but is in many
cases domain-dependent. We try to develop an approach which can be applicable across domains.
In order to address the problem of limited labeled data, we mainly consider the semi-supervised ap-
proaches. There are rarely semi-supervised approaches specially for the record linkage problem. Some
studies on improving self-training algorithms are related to our work. Self-training with editing (Li and
Zhou, 2005) can help to reduce mislabeled pseudo training examples, and reserved self-training (Guan
and Yang, 2013) is designed for handling imbalanced data. We have very different focus with theirs, i.e.
incorporating the instance correlations into learning algorithms, which can applied to other self-training
variants.
3 Problem Definition
In this section, we first introduce the preliminary related to our task. Then we formally define our studied
task.
Product record. A product record r is characterized by a referred product entity e and a set of attribute
values V = {(v
i
)}
i
, where v
i
denotes the value of the ith attribute in r.Weuser.e and r.V to index
the product entity and attribute value set of the record r respectively. A product record corresponds to a
unique product entity but a product entity can map to multiple product records across multiple databases.
Attribute values are represented as strings, i.e. a sequence of characters. An attribute of a product might
correspond to different descriptive text across websites.
Product record linkage. The task of product record linkage is to judge whether two product records refer
to the same product entity. Given two product records r and r
, we aim to judge whether r.e is the same
to r
.e. Usually, r and r
come from different product databases. Although different product databases
can have different attributes for the same product and different attribute names for the same attribute, we
make an assumption about the task: candidate record pairs share the same set of attributes. It is relatively
easy to automatically identify common attributes and align attributes (H
¨
arder et al., 1999; Rundensteiner,
1999; Hassanzadeh et al., 2013), which is not our focus in this paper. We mainly study product record
linkage under the same set of attributes, and this assumption makes our study more focused. If r and r
refer to the same product entity, denoted by r ∼ r
; otherwise, we denote it by r ∼ r
.
4 A General Machine Learning based Approach
Given a product type, as we mentioned above, we assume that it corresponds to a specific set of attributes,
and all the product records share the same set of attributes but possibly with different descriptive text for
attribute values. In this section, we further present a general supervised approach with similarity features.
4.1 Defining the similarity function
Given two product records r and r
, we can obtain the similarity between their descriptive text of an at-
tribute by using a similarity function. The major intuition is that if two records refer to the same product,
they should have similar text for the same attribute, i.e. the similarity function should return a large sim-
ilarity value. Let f(·, ·) denote a similarity function, which takes two text strings and returns a similarity
value within the interval [0, 1] for these two strings. As revealed in (Bilenko and Mooney, 2003), differ-
ent attributes or fields may need different similarity functions to achieve best similarity evaluation. Thus,
instead of fixing a single similarity function, we consider using the following widely used similarity
functions: 1) Exact match; 2) Cosine similarity; 3) Jaccard coefficient; 4) K-Gram similarity (Kondrak,
2005); 5) Levenshtein similarity (Levenshtein, 1966); 6) Affine Gap similarity (Needleman and Wunsch,
1970).