Semi-supervised support vector classification
with self-constructed Universum
Yingjie Tian
a,b
, Ying Zhang
c
, Dalian Liu
d,
n
a
Research Center on Fictitious Economy & Data Science, Chinese Academy of Sciences, Beijing 100190, China
b
Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China
c
School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100190, China
d
Department of Basic Course Teaching, Beijing Union University, Beijing 100101, China
article info
Article history:
Received 12 March 2015
Received in revised form
16 October 2015
Accepted 15 November 2015
Communicated by Yongdong Zhang
Available online 26 November 2015
Keywords:
Semi-supervised
Classification
Universum
Support vector machine
abstract
In this paper, we propose a strategy dealing with the semi-supervised classification problem, in which
the support vector machine with self-constructed Universum is iteratively solved. Universum data, which
do not belong to either class of interest, have been illustrated to encode some prior knowledge by
representing meaningful concepts in the same domain as the problem at hand. Our new method is
applied to seek more reliable positive and negative examples from the unlabeled dataset step by step,
and the Universum support vector machine(U-SVM) is used iteratively. Different Universum data will
result in different performance, so several effective approaches are explored to construct Universum
datasets. Experimental results demonstrate that appropriately constructed Universum will improve the
accuracy and reduce the number of iterations.
& 2016 Published by Elsevier B.V.
1. Introduction
In many traditional supervised learning, we acquire the deci-
sion function only through learning labeled dataset, however, in
some applications of machine learning, such as image retrieval [1],
text classification [2], natural language parsing [3], abundant
amounts of unlabeled data can be cheaply and automatically
acquired. Even if we can label samples manually, it will be labor-
intensive and very time consuming. In such situation, the tradi-
tional supervised learning usually goes worse with the lacking of
enough supervised information. Semi-supervised learning (SSL)
[4–9] has attracted an increasing amount of interests which
addresses this problem by using large amount of unlabeled data,
together with the labeled data, to build better classifier.
Semi-supervised learning problem: Given a training set
T ¼fðx
1
; y
1
Þ; …; ðx
l
; y
l
Þg⋃fx
l þ 1
; …; x
l þ q
g; ð1Þ
where x
i
A R
n
; y
i
A f1; 1g; i ¼ 1; …; l, x
i
A R
n
; i ¼ lþ1; …; lþq, and
the set x
1 þ 1
; …; x
l þ q
is a collection of unlabeled inputs known to
belong to one of the classes, predict the outputs y
1 þ 1
; …; y
l þ q
for
fx
l þ 1
; …; x
l þ q
g and find a real function g(x)inR
n
such that the
output y for any input x can be predicted by
f ðxÞ¼sgnðgðxÞÞ: ð2Þ
The motivation of semi-supervised methods is to take advan-
tage of the unlabeled data to improve the performance and there
are roughly five kinds of methods for solving the semi-supervised
learning problem such as Generative methods [10–13], Graph-
based methods [14–16], Co-training methods [17,18], Low-density
separation methods [19,20], and Self-training methods [21–23].
Self-training is probably the earliest idea about using unlabeled
data classification is a commonly used technique. Self-training is
also known as self-learning, self-labeling, or bootstrapping (not to
be confused with the statistical procedure with the same name).
This is a wrapper-algorithm that repeatedly uses a supervised
method. First, only a small labeled examples are trained in a
classifier to classify unlabeled data and select most confident
unlabeled points which will be added into the training set. The
classifier is re-trained with the new data and the process is
repeated. The idea has been used in many applications [24–26].
Our method belongs to this ideology.
Universum, which is defined as a collection of unlabeled points
known not belong to any class, was firstly proposed in [27]. It has
captured a general backdrop against the problem of interest and is
looked forward to represent meaningful information connected
with the classification task at hand. Universum dataset is easy to
acquire, since there is so few requirement for it. Additionally, it can
catch some prior information of the ground-truth decision
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/neucom
Neurocomputing
http://dx.doi.org/10.1016/j.neucom.2015.11.041
0925-2312/& 2016 Published by Elsevier B.V.
n
Corresponding author.
E-mail addresses: tyj@ucas.ac.cn (Y. Tian),
zhangying112@mails.ucas.ac.cn (Y. Zhang), ldlluck@sina.com (D. Liu).
Neurocomputing 189 (2016) 33–42