噪声与缺失文本标签的频谱细化方法 - AAAI人工智能会议

需积分: 9 77 浏览量更新于2024-08-12 收藏 376KB PDF 举报

"噪声和MissingText标签的频谱标签细化。 AAAI人工智能会议" 这篇研究论文主要探讨了在处理网络上快速增长的带有噪声和缺失标签的用户生成数据时，如何进行文本标签的精细化调整。随着互联网内容的爆炸式增长，如社交媒体标签和亚马逊Mechanical Turk等众包平台产生的投票标签，出现了大量带有噪音和缺失的数据。大多数机器学习方法依赖准确的标签集，但在标签可靠性不足的情况下，这些方法的效果会大打折扣。论文作者包括来自伊利诺伊大学厄巴纳-香槟分校、北京大学、北京航空航天大学和香港科技大学的研究人员。他们提出了一种文本标签细化算法，该算法旨在校正这类带有噪声和缺失标签的数据集的标签。算法的核心假设是基于具有一定置信度的标签可以对标签集进行细化，并且数据与标签之间应保持一致性。为了实现这一目标，研究者提出了一个标签平滑度比率准则，用于评估标签的平滑程度以及标签与数据之间的一致性。这个准则能够帮助识别和调整那些不一致或噪声较大的标签，从而提高整个数据集的标签质量。通过这种方式，他们展示了这种方法在处理噪声和缺失标签问题上的有效性。论文的贡献可能包括以下几个方面： 1. 提出了一种新的标签细化算法，专门针对带有噪声和缺失的文本标签数据。 2. 设计了标签平滑度比率指标，以量化标签的稳定性和数据与标签的一致性。 3. 实证分析表明，所提出的算法能够有效改进标签质量，进而改善基于这些标签的机器学习模型的性能。 4. 对于众包和社交媒体等领域的数据处理，该方法可能具有广泛的应用前景，特别是在需要处理大量不完整或错误标签信息的场景下。这篇论文在噪声数据和缺失标签的处理上提供了创新的思路，对于提升机器学习在现实世界应用中的准确性具有重要意义。通过深入理解和应用这些技术，可以更好地应对互联网数据的挑战，提高机器学习模型的泛化能力和预测效果。

Spectral Label Reﬁnement for Noisy and Missing Text Labels

Yangqiu Song

Chenguang Wang

Ming Zhang

Hailong Sun

Qiang Yang

University of Illinois at Urbana-Champaign

Peking University

Beihang University

Hong Kong University of Science and Technology

yqsong@illinois.edu

{wangchenguang,mzhang cs}@pku.edu.cn

sunhl@act.buaa.edu.cn

qyang@cse.ust.hk

Abstract

With the recent growth of online content on the Web, there

have been more user generated data with noisy and missing

labels, e.g., social tags and voted labels from Amazon’s Me-

chanical Turks. Most of machine learning methods, which re-

quire accurate label sets, could not be trusted when the label

sets were yet unreliable. In this paper, we provide a text label

reﬁnement algorithm to adjust the labels for such noisy and

missing labeled datasets. We assume that the labeled sets can

be reﬁned based on the labels with certain conﬁdence, and

the similarity between data being consistent with the label-

s. We propose a label smoothness ratio criterion to measure

the smoothness of the labels and the consistency between la-

bels and data. We demonstrate the effectiveness of the label

reﬁning algorithm on eight labeled document datasets, and

validate that the results are useful for generating better labels.

Introduction

With the recent growth of the online content generation,

there are lots of datasets with noisy and missing labels. Su-

pervised machine learning methods, such as classiﬁcation

and ranking, have demonstrated their effectiveness in broad

applications, such as recommendation systems, natural lan-

guage processing tasks. On one hand, the more labeled and

accurate label sets are input to a supervised learning method,

the more improvement on the performance one can gain. On

the other hand, noisy and missing labels could hurt the per-

formance in a considerable way with different learning al-

gorithms, e.g., naive Bayes being better than support vec-

tor machines with sequential minimal optimization (SMO)

trained on noisy labels (Nettleton, Orriols-Puig, and Fornell-

s 2010). However, in real world, the situation can be even

worse. The labeled data on the Web can be extremely noisy

and missing.

For example, online crowdsourcing systems such as A-

mazon’s Mechanical Turk

and Rent-A-Coder

can facili-

tate the labeling tasks, by matching “labelers” with well de-

ﬁned “tasks.” However, since the labelers may lack exper-

tise, dedication, and interest, the resulting labels are often

noisy and will affect the decisions of learners (Raykar et

 2015, Association for the Advancement of Artiﬁcial

https://www.mturk.com/mturk/welcome

https://www.freelancer.com/

al. 2010). Even with certain processing of the labels anno-

tated by the non-expert labelers, such as voting, the result-

ing labels could be still noisy (Sheng, Provost, and Ipeirotis

2008). Moreover, in social networks, such as Facebook and

Twitter, users are often allowed to provide certain tags or

proﬁle information to gain attention from the others shar-

ing the similar interests. However, not all of the users want

to publicly annotate their private proﬁle information. In ad-

dition, the provided labels could be very noisy (Law, Set-

tles, and Mitchell 2010), since different users have different

habits or preferences. For example, for the labels “movie”

and “ﬁlm,” they are same, but can appear in two users’

tags. Another example is that a user may be an expert on

artiﬁcial intelligence and she tags herself with the term,

but she only publishes movie related content. In this case,

the tag does not perfectly characterize the contents that are

published. Thus, noisy and missing labels are common in

social networks. Furthermore, traditional natural language

processing (NLP) tasks can also beneﬁt from noisy data

labeled by non-experts, as if there are some mechanisms

to reduce the label noise (Pal, Mann, and Minerich 2007;

Snow et al. 2008). However, in some of more difﬁcult tasks,

such as event extraction, the mutual agreement of human

labels is only around 40 − 50% (Ji and Grishman 2008).

In such cases, non-expert annotations could be much worse.

Therefore, all the above examples indicate that more effec-

tive algorithms to deal with the noisy and missing label prob-

lem should be developed.

In this paper, we deal with the noisy and missing la-

bel problem with a label reﬁnement mechanism. Instead of

proposing a supervised learning algorithm that can handle

the noise, we propose an algorithm that can modify the label-

s themselves. Then the reﬁned labels can be used for other

machine learning and data mining tasks. With the assump-

tion that the data samples are static and i.i.d., and the labels

of data are consistent with their nearest neighborhoods, we

propose a label smoothness ratio criterion to reﬁne the noisy

and missing labels. Our approach considers both the con-

tent of data (by constraining the reﬁned labels to be smooth

on content graph) and the initial labels (by constraining the

reﬁned label being smooth on the graph constructed by the

initial labels). We relax the estimated labels to the real val-

ues and use spectral analysis to solve the problem. The ﬁ-

nal solution is given by a generalized eigenvalue decompo-

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38704870

粉丝: 6
资源: 1000

噪声与缺失文本标签的频谱细化方法 - AAAI人工智能会议

AAAI2023会议作者工具包（word/latex模板）

AAAI会议论文聚类分析 python

全网收集人工智能学术会议AAAI论文，最全_Awesome-AAAI.zip

《人工智能》--全网收集人工智能学术会议AAAI论文，最全.zip

aaai 2018 会议笔记

顶级学术会议AAAI-2018收录论文.zip

AAAI2022最新人工智能规划教程报告.rar

人工智能领域顶会AAAI 2018 论文列表

2008年AI顶级会议：IJCAI与AAAI权威解析

推荐系统 kdd会议 aaai会议汇总链接

最新资源