Exploring Outliers in Crowdsourced Ranking for QoE
Qianqian Xu
1
, Ming Yan
2
, Chendi Huang
3
,
Jiechao Xiong
3,4
, Qingming Huang
5,6,7
, Yuan Yao
8∗
1
State Key Laboratory of Information Security (SKLOIS), Institute of Information Engineering, CAS, Beijing,
100093, China
2
Department of Computational Mathematics, Science and Engineering and Department of Mathematics,
Michigan State University, East Lansing, MI, 48824, USA
3
BICMR-LMAM-LMEQF-LMP, School of Mathematical Sciences, Peking University, Beijing, 100871, China
4
Tencent AI Lab, Shenzhen, 518057, China
5
University of Chinese Academy of Sciences, Beijing, 100049, China
6
Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS, Beijing, 100190, China
7
Key Lab of Big Data Mining and Knowledge Management, CAS, Beijing, 100190, China
8
Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong, 100871
xuqianqian@iie.ac.cn,yanm@math.msu.edu,cdhuang@pku.edu.cn
jcxiong@tencent.com,qmhuang@ucas.ac.cn,yuany@ust.hk
ABSTRACT
Outlier detection is a crucial part of robust evaluation for
crowdsourceable assessment of Quality of Experience (QoE)
and has attracted much attention in recent years. In this
paper, we propose some simple and fast algorithms for outlier
detection and robust QoE evaluation based on the noncon-
vex optimization principle. Several iterative procedures are
designed with or without knowing the number of outlier-
s in samples. Theoretical analysis is given to show that
such procedures can reach statistically good estimates under
mild conditions. Finally, experimental results with simu-
lated and real-world crowdsourcing datasets show that the
proposed algorithms could produce similar performance to
Huber-LASSO approach in robust ranking, yet with nearly
8 or 90 times speed-up, without or with a prior knowledge
on the sparsity size of outliers, respectively. Therefore the
proposed methodology provides us a set of helpful tools for
robust QoE evaluation with crowdsourcing data.
CCS CONCEPTS
•Information systems →Data cleaning; Rank aggre-
gation;
KEYWORDS
HodgeRank; Outlier Detection;
l
0
-regularization; Iterative
Hard Thresholding; Iterative Least Trimmed Squares; Adap-
tive Algorithms
∗
Corresponding author.
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage
and that copies bear this notice and the full citation on the first page.
Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee. Request permissions
from permissions@acm.org.
MM’17, October 23–27, 2017, Mountain View, CA, USA.
© 2017 ACM. 978-1-4503-4906-2/17/10. . . $15.00
DOI: https://doi.org/10.1145/3123266.3123267
1 INTRODUCTION
In recent years, the Quality of Experience (QoE) [
20
,
23
]
has become a major research theme within the multimedia
community. QoE measures a user’s subjective expectation,
feeling, perception, and satisfaction with respect to multime-
dia content. Measuring and ensuring good QoE of multimedia
content is highly subjective in nature.
A variety of approaches can be employed to conduct sub-
jective tests, among which Mean Opinion Score (MOS) [
1
]
and paired comparison are the two most popular ones. In
the MOS test, individuals are asked to specify a rating from
“Bad” to “Excellent” (e.g., Bad-1, Poor-2, Fair-3, Good-4,
and Excellent-5) to grade the quality of a stimulus; while in
paired comparison approach, raters are only asked to make
intuitive comparative judgments instead of mapping their
perception on a categorical or numerical scale. Among these
there may be tradeoffs in the amount of information the
preference label contains and the bias associated with ob-
taining the label. For example, while a graded relevance
judgment on a five-point scale may contain more information
than a binary judgment, raters may also make more errors
due to the complexity of assigning finer-grained judgments.
In [
5
], it shows that MOS may suffer from three fundamental
problems: (i) it is unable to concretely define the concept of
scale; (ii) the interpretations of the scales among raters are
highly different; (iii) it is difficult to verify whether a rater
gives false ratings either intentionally or carelessly. Therefore,
the paired comparison method is currently gaining growing
attention. It not only promises assessments that are easier
and faster to obtain with less demanding task for raters, but
also yields more reliable data with less personal scale bias
in practice. However, a shortcoming of paired comparison
is that it has more expensive sampling complexity than the
MOS test, since the number of pairs grows quadratically with
the number of items to be ranked.
To tackle the cost problem, with the growth of crowd-
sourcing platforms such as MTurk, InnoCentive, CrowdFlow-
er, CrowdRank, and AllOurIdeas, researchers who wish to
MM’17, October 23-27, 2017, Mountain View, CA, USA