ENCYCLOPEDIA ENHANCED SEMANTIC EMBEDDING FOR ZERO-SHOT LEARNING
Zhen Jia
1,2
, Junge Zhang
1,2
, Kaiqi Huang
1,2,3
, Tieniu Tan
1,2,3
1
CRIPAC & NLPR, Institute of Automation, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
3
CAS Center for Excellence in Brain Science and Intelligence Technology
{zhen.jia, jgzhang, kqhuang, tnt}@nlpr.ia.ac.cn
ABSTRACT
There are tremendous object categories in the real world
besides those in image datasets. Zero-shot learning aims to
recognize image categories which are unseen in the training
set. A large number of previous zero-shot learning models
use word vectors of the class labels directly as category pro-
totypes in the semantic embedding space. But word vectors
cannot obtain the global knowledge of an image category suf-
ficiently. In this paper, we propose a new encyclopedia en-
hanced semantic embedding model to promote the discrim-
inative capability of word vector prototypes with the global
knowledge of each image category. The proposed model ex-
tracts the TF-IDF key words from encyclopedia articles to
acquire the global knowledge of each category. The con-
vex combination of the key words’ word vectors acts as the
prototypes of the object categories. The prototypes of seen
and unseen classes build up the embedding space where the
nearest neighbour search is implemented to recognize the un-
seen images. The experiments show that the proposed method
achieves the state-of-the-art performance on the challenging
ImageNet Fall 2011 1k2hop dataset.
Index Terms— zero-shot learning, image classification
1. INTRODUCTION
Image classification has gained huge progress in recent years,
due to the impressive improvement of deep learning methods,
such as convolutional neural networks (CNNs) [1, 2, 3, 4],
and large scale datasets [5]. Some CNN-based image classi-
fication methods [6] even trump human performance on Im-
ageNet classification task. Meanwhile, almost all successful
image classification methods mentioned above are supervised
models, which take large scale captioned image data to get
convergent. Early research on human cognition [7] shows
that human have the ability to recognize more than 30,000
object categories and objects with components removed or
non-rigid deformation. What’s more, human can recognize
objects they’ve never seen before. For instance, human can
easily tell apart different cat categories by just reading their
text descriptions. A child can also recognize a zebra at the
first sight if he has seen a horse before and known that a zebra
looks like a horse with white and black stripes. We strongly
hope that the machine image classification systems have the
similar ability as human beings to transfer knowledge from
other modalities to visual area, i.e. to recognize image cate-
gories which don’t appear in the training set.
Zero-shot learning (ZSL) aims to deal with image clas-
sification task in which the test categories have no overlap
with training categories. This topic draws an increasing atten-
tion of computer vision researchers. Many computer vision
and machine learning methods, such as probabilistic mod-
els [8, 9, 10], canonical correlation analysis [11, 12], met-
ric learning methods [13, 14] and graphical models [15] are
exploited to solve the ZSL problem. In order to classify un-
seen images, the first step is to build a semantic embedding
space where all the image classes are represented as their
prototypes. Attribute features, word vectors and image de-
scriptions of the categories are the typical side information to
form the embedding space. C. Lampert et al. [8, 9] come up
with probabilistic models – direct and indirect attribute pre-
diction models (DAP and IAP) to predict the unseen images
using their attribute features as prototypes. The deep visual-
semantic embedding model (DeViSE) [16] maps CNN image
features to the word vector embedding space. DeViSE model
explore the semantic and syntactic properties of word vector
as shown in [17]. Recently, Z. Akata et al. [18, 19] utilize im-
age descriptions as side information to build the embedding
space. In these three kinds of side information, word vectors
demonstrate more advantages than attribute features and im-
age descriptions to materialize prototypes, because they are
liberated from human annotations which are quite expensive
and time consuming. Thus word vector is an ideal prototype
to solve large scale ZSL problem.
Many proposed ZSL methods [11, 12, 16, 20, 21, 22] use
the word vectors of the class labels as the classification pro-
totypes directly, which has negative effect on zero-shot clas-
sification. The word vectors extracting algorithms, such as
skip-gram method [17], usually set the size of training win-
dow to a small number that makes word vectors unable to gain
the global knowledge of a category in the corpus. The global
knowledge is the more comprehensive and scientific repre-
1287978-1-5090-2175-8/17/$31.00 ©2017 IEEE ICIP 2017