Deep Semantic Structural Constraints for Zero-Shot Learning
Yan Li
∗ 1,2
, Zhen Jia
∗ 1,2
, Junge Zhang
1,2
, Kaiqi Huang
1,2,3
, Tieniu Tan
1,2,3
1
CRIPAC & NLPR, Institute of Automation, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
3
CAS Center for Excellence in Brain Science and Intelligence Technology
yan.li@cripac.ia.ac.cn, {zhen.jia, jgzhang, kqhuang, tnt}@nlpr.ia.ac.cn
Abstract
Zero-shot learning aims to classify unseen image categories
by learning a visual-semantic embedding space. In most
cases, the traditional methods adopt a separated two-step
pipeline that extracts image features from pre-trained CNN
models. Then the fixed image features are utilized to learn
the embedding space. It leads to the lack of specific structural
semantic information of image features for zero-shot learn-
ing task. In this paper, we propose an end-to-end trainable
Deep Semantic Structural Constraints model to address this
issue. The proposed model contains the Image Feature Struc-
ture constraint and the Semantic Embedding Structure con-
straint, which aim to learn structure-preserving image fea-
tures and endue the learned embedding space with stronger
generalization ability respectively. With the assistance of se-
mantic structural information, the model gains more auxiliary
clues for zero-shot learning. The state-of-the-art performance
certifies the effectiveness of our proposed method.
Introduction
As one of the most basic problems in the computer vision
area, image classification methods gain huge progress in re-
cent years with the impressive development of deep learn-
ing. Although ResNet (He et al. 2016), an outstanding repre-
sentation of the Convolutional Neural Network (CNN) clas-
sification models, gets the top-5 error rate as low as 3.57%
on ImageNet classification task, its classification ability is
still limited to the image categories in the training dataset.
The limitation that models can only classify image cate-
gories within the training set, restricts them to become more
intelligent as human beings. For a simple example, human
beings are able to classify different kinds of animals by just
reading their descriptions rather than seeing them. More and
more researchers try to break through this limitation by in-
troduce Zero-Shot Learning (ZSL) into image classification
(Lampert, Nickisch, and Harmeling 2009; Frome et al. 2013;
Norouzi et al. 2013; Socher et al. 2013; Fu et al. 2015;
Akata et al. 2015; Romera-Paredes and Torr 2015; Bucher,
Herbin, and Jurie 2016; Akata et al. 2016; Huang, Loy,
and Tang 2016; Changpinyo et al. 2016; Xian et al. 2017;
Morgado and Vasconcelos 2017).
∗
The first two authors contributed equally to this work.
Copyright
c
2018, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Zero-shot learning seeks to make image classification
models able to classify image categories which never ap-
pear in the training dataset. In the zero-shot learning task,
we refer to the image categories in the training set as seen
classes and those in the test set as unseen classes. The
category characteristic of unseen classes are learned from
the side information, i.e., the semantic features of the im-
ages. The commonly used side information could be the hu-
man annotated attribute features of images (Lampert, Nick-
isch, and Harmeling 2009; Akata et al. 2016), the text
descriptions of the image categories (Reed et al. 2016),
word vectors of the category labels (Frome et al. 2013;
Norouzi et al. 2013) and so on.
A large number of previous state-of-the-art methods focus
on building a common space where image features and se-
mantic features are embedded (Frome et al. 2013; Socher et
al. 2013; Akata et al. 2015; Romera-Paredes and Torr 2015;
Akata et al. 2016). The embedding space is built on the
correspondence between the seen images and their seman-
tic features. Then in the test stage, unseen image features
will be mapped to the embedding space where the classifi-
cation method, such as nearest neighbour (NN) search, can
be operated easily. Most of these methods adopt a separated
two-step pipeline, i.e., extracting image features from pre-
trained CNN models and using fixed image features to learn
the embedding space.
However, we argue that separating the image feature ex-
traction and the embedding space construction harms the
ZSL models severely. The separation leads to the result
that models cannot regulate the image features for the spe-
cific ZSL task during training. What’s more, the image fea-
tures extracted from a fixed pre-trained CNN model will not
capture the plentiful semantic information in the side in-
formation. The semantic information of human annotated
attributes, text descriptions or word vectors constructs the
semantic structure of a specific category. We believe that
combining the learning of image features and embedding
space in an end-to-end manner, meanwhile, incorporating
the structural information into the whole learning process
would contribute to much better zero-shot performance.
In this paper, we come up with a new Deep Semantic
Structural Constraints (DSSC) model for zero-shot learning
looking forward to training the model in an end-to-end style
and using the semantic structural information to supervise