Spectral Label Refinement for Noisy and Missing Text Labels
Yangqiu Song
a
Chenguang Wang
b
Ming Zhang
b
Hailong Sun
c
Qiang Yang
d
a
University of Illinois at Urbana-Champaign
b
Peking University
c
Beihang University
d
Hong Kong University of Science and Technology
a
yqsong@illinois.edu
b
{wangchenguang,mzhang cs}@pku.edu.cn
c
sunhl@act.buaa.edu.cn
d
qyang@cse.ust.hk
Abstract
With the recent growth of online content on the Web, there
have been more user generated data with noisy and missing
labels, e.g., social tags and voted labels from Amazon’s Me-
chanical Turks. Most of machine learning methods, which re-
quire accurate label sets, could not be trusted when the label
sets were yet unreliable. In this paper, we provide a text label
refinement algorithm to adjust the labels for such noisy and
missing labeled datasets. We assume that the labeled sets can
be refined based on the labels with certain confidence, and
the similarity between data being consistent with the label-
s. We propose a label smoothness ratio criterion to measure
the smoothness of the labels and the consistency between la-
bels and data. We demonstrate the effectiveness of the label
refining algorithm on eight labeled document datasets, and
validate that the results are useful for generating better labels.
Introduction
With the recent growth of the online content generation,
there are lots of datasets with noisy and missing labels. Su-
pervised machine learning methods, such as classification
and ranking, have demonstrated their effectiveness in broad
applications, such as recommendation systems, natural lan-
guage processing tasks. On one hand, the more labeled and
accurate label sets are input to a supervised learning method,
the more improvement on the performance one can gain. On
the other hand, noisy and missing labels could hurt the per-
formance in a considerable way with different learning al-
gorithms, e.g., naive Bayes being better than support vec-
tor machines with sequential minimal optimization (SMO)
trained on noisy labels (Nettleton, Orriols-Puig, and Fornell-
s 2010). However, in real world, the situation can be even
worse. The labeled data on the Web can be extremely noisy
and missing.
For example, online crowdsourcing systems such as A-
mazon’s Mechanical Turk
1
and Rent-A-Coder
2
can facili-
tate the labeling tasks, by matching “labelers” with well de-
fined “tasks.” However, since the labelers may lack exper-
tise, dedication, and interest, the resulting labels are often
noisy and will affect the decisions of learners (Raykar et
Copyright
c
2015, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
1
https://www.mturk.com/mturk/welcome
2
https://www.freelancer.com/
al. 2010). Even with certain processing of the labels anno-
tated by the non-expert labelers, such as voting, the result-
ing labels could be still noisy (Sheng, Provost, and Ipeirotis
2008). Moreover, in social networks, such as Facebook and
Twitter, users are often allowed to provide certain tags or
profile information to gain attention from the others shar-
ing the similar interests. However, not all of the users want
to publicly annotate their private profile information. In ad-
dition, the provided labels could be very noisy (Law, Set-
tles, and Mitchell 2010), since different users have different
habits or preferences. For example, for the labels “movie”
and “film,” they are same, but can appear in two users’
tags. Another example is that a user may be an expert on
artificial intelligence and she tags herself with the term,
but she only publishes movie related content. In this case,
the tag does not perfectly characterize the contents that are
published. Thus, noisy and missing labels are common in
social networks. Furthermore, traditional natural language
processing (NLP) tasks can also benefit from noisy data
labeled by non-experts, as if there are some mechanisms
to reduce the label noise (Pal, Mann, and Minerich 2007;
Snow et al. 2008). However, in some of more difficult tasks,
such as event extraction, the mutual agreement of human
labels is only around 40 − 50% (Ji and Grishman 2008).
In such cases, non-expert annotations could be much worse.
Therefore, all the above examples indicate that more effec-
tive algorithms to deal with the noisy and missing label prob-
lem should be developed.
In this paper, we deal with the noisy and missing la-
bel problem with a label refinement mechanism. Instead of
proposing a supervised learning algorithm that can handle
the noise, we propose an algorithm that can modify the label-
s themselves. Then the refined labels can be used for other
machine learning and data mining tasks. With the assump-
tion that the data samples are static and i.i.d., and the labels
of data are consistent with their nearest neighborhoods, we
propose a label smoothness ratio criterion to refine the noisy
and missing labels. Our approach considers both the con-
tent of data (by constraining the refined labels to be smooth
on content graph) and the initial labels (by constraining the
refined label being smooth on the graph constructed by the
initial labels). We relax the estimated labels to the real val-
ues and use spectral analysis to solve the problem. The fi-
nal solution is given by a generalized eigenvalue decompo-