电子商务产品记录链接的群组自我训练方法

研究论文

188 浏览量更新于2024-08-28 收藏 242KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"Group based Self Training for E-Commerce Product Record Linkage" 这篇研究论文"Group based Self Training for E-Commerce Product Record Linkage"是2014年COLING国际计算语言学会议的技术论文之一，由 Wayne Xin Zhao、Yuexin Wu、Hongfei Yan 和 Xiaoming Li合作完成，分别来自中国人民大学信息学院和北京大学电子工程与计算机科学学院。该论文提出了一种解决电子商务产品记录链接问题的半监督方法，特别关注在少量标注数据下进行学习的自训练算法。在电子商务领域，产品记录链接（Product Record Linkage）是一项重要的任务，旨在识别和关联不同网站上的相同产品记录，以实现数据集成和信息共享。这项任务的挑战在于需要处理大量的未标注数据，并且通常缺乏足够的标注样本来训练准确的模型。传统的自训练算法利用高置信度的未标注实例作为新的训练样本来提升模型性能。然而，这些算法往往只基于单个实例的特征证据来评估其置信度，忽略了实例之间可能存在的相关性。这种忽视可能导致模型在复杂数据分布下产生错误的决策。为了解决这个问题，作者提出了一种新颖的自训练变体，该变体考虑了数据实例之间的相关性。他们引入了“群体”概念，将相似的未标注实例分组，并以群体的整体一致性来评估每个实例的置信度。这样可以利用实例间的相互支持，提高对不确定实例预测的准确性，从而改善模型的学习效果。论文详细探讨了群体自训练算法的实现步骤、评估指标以及实验设置。通过在实际的电子商务数据集上进行实验，作者展示了这种方法相比于传统自训练算法在产品记录链接任务上的优势，包括更高的链接精度和更好的泛化能力。此外，他们还讨论了该方法的潜在局限性和未来的研究方向，如如何优化群体选择策略，以及如何更好地利用先验知识来引导自训练过程。这篇论文为处理大规模无标注数据的电子商务产品记录链接问题提供了一个创新的解决方案，强调了在自训练过程中考虑数据实例间相关性的重要性。这种方法对于提高模型在有限标注数据条件下的性能，以及促进跨网站产品数据的整合有着深远的实践意义。

资源详情

资源推荐

computed by using genetic programming for the record linkage problem. Their experiments showed that

both approaches did not perform well on real data.

Some other studies exploit more complex constraints that include relationships between different entity

types to link all types of entities in coordination (Bhattacharya and Getoor, 2007; Dong et al., 2005; On

et al., 2007). The usage of such constraints can indeed help to get better linkage results, but is in many

cases domain-dependent. We try to develop an approach which can be applicable across domains.

In order to address the problem of limited labeled data, we mainly consider the semi-supervised ap-

proaches. There are rarely semi-supervised approaches specially for the record linkage problem. Some

studies on improving self-training algorithms are related to our work. Self-training with editing (Li and

Zhou, 2005) can help to reduce mislabeled pseudo training examples, and reserved self-training (Guan

and Yang, 2013) is designed for handling imbalanced data. We have very different focus with theirs, i.e.

incorporating the instance correlations into learning algorithms, which can applied to other self-training

variants.

3 Problem Deﬁnition

In this section, we ﬁrst introduce the preliminary related to our task. Then we formally deﬁne our studied

task.

Product record. A product record r is characterized by a referred product entity e and a set of attribute

values V = {(v

)}

, where v

denotes the value of the ith attribute in r.Weuser.e and r.V to index

the product entity and attribute value set of the record r respectively. A product record corresponds to a

unique product entity but a product entity can map to multiple product records across multiple databases.

Attribute values are represented as strings, i.e. a sequence of characters. An attribute of a product might

correspond to different descriptive text across websites.

Product record linkage. The task of product record linkage is to judge whether two product records refer

to the same product entity. Given two product records r and r



, we aim to judge whether r.e is the same

to r



.e. Usually, r and r



come from different product databases. Although different product databases

can have different attributes for the same product and different attribute names for the same attribute, we

make an assumption about the task: candidate record pairs share the same set of attributes. It is relatively

easy to automatically identify common attributes and align attributes (H

arder et al., 1999; Rundensteiner,

1999; Hassanzadeh et al., 2013), which is not our focus in this paper. We mainly study product record

linkage under the same set of attributes, and this assumption makes our study more focused. If r and r



refer to the same product entity, denoted by r ∼ r



; otherwise, we denote it by r ∼ r



4 A General Machine Learning based Approach

Given a product type, as we mentioned above, we assume that it corresponds to a speciﬁc set of attributes,

and all the product records share the same set of attributes but possibly with different descriptive text for

attribute values. In this section, we further present a general supervised approach with similarity features.

4.1 Deﬁning the similarity function

Given two product records r and r



, we can obtain the similarity between their descriptive text of an at-

tribute by using a similarity function. The major intuition is that if two records refer to the same product,

they should have similar text for the same attribute, i.e. the similarity function should return a large sim-

ilarity value. Let f(·, ·) denote a similarity function, which takes two text strings and returns a similarity

value within the interval [0, 1] for these two strings. As revealed in (Bilenko and Mooney, 2003), differ-

ent attributes or ﬁelds may need different similarity functions to achieve best similarity evaluation. Thus,

instead of ﬁxing a single similarity function, we consider using the following widely used similarity

functions: 1) Exact match; 2) Cosine similarity; 3) Jaccard coefﬁcient; 4) K-Gram similarity (Kondrak,

2005); 5) Levenshtein similarity (Levenshtein, 1966); 6) Afﬁne Gap similarity (Needleman and Wunsch,

1970).

1313

剩余10页未读，继续阅读

weixin_38652636

粉丝: 6
资源: 896

电子商务产品记录链接的群组自我训练方法

ArcGIS-插件-（linkage-mapper）

详解关于vue-area-linkage走过的坑

Vue使用vue-area-linkage实现地址三级联动效果的示例

/usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f linkage 是做什么用的

我需要在vue2中，jeecgboot 前端通过data.js的数据完成组件<j-area-linkage>完成省市区三级联动的前端代码

vue-area-linkage area-select选择之后选择框内容无法清空

jeecgboot在vue2中如何用 <j-area-linkage type="cascader" v-model="model.provinceCode" placeholder="请输入省市区" />

jeecgboot 前端通过area-data的数据填充到组件j-area-linkage完成省市区三级联动

小区物业系统E-R图

vue-admin编写级联下拉框省市城市

template specialization with C linkage

matlab中的linkage

python手工实现凝聚式层次聚类

gsm.h:7:21: fatal error: linkage.h: No such file or directory #include <linkage.h>

linkage python

Linkage Mapper 陈书予

matlab中linkage函数

matlablinkage的源码-vp-linkage:J-Linkage和T-Linkage用于消失点估计的实现

最新资源