Zero-Shot Learning Through Cross-Modal Transfer
Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Y. Ng
Computer Science Department, Stanford University, Stanford, CA 94305, USA
richard@socher.org, {mganjoo, manning}@stanford.edu, ang@cs.stanford.edu
Abstract
This work introduces a model that can recognize objects in images even if no
training data is available for the object class. The only necessary knowledge about
unseen visual categories comes from unsupervised text corpora. Unlike previous
zero-shot learning models, which can only differentiate between unseen classes,
our model can operate on a mixture of seen and unseen classes, simultaneously
obtaining state of the art performance on classes with thousands of training im-
ages and reasonable performance on unseen classes. This is achieved by seeing
the distributions of words in texts as a semantic space for understanding what ob-
jects look like. Our deep learning model does not require any manually defined
semantic or visual features for either words or images. Images are mapped to be
close to semantic word vectors corresponding to their classes, and the resulting
image embeddings can be used to distinguish whether an image is of a seen or un-
seen class. We then use novelty detection methods to differentiate unseen classes
from seen classes. We demonstrate two novelty detection strategies; the first gives
high accuracy on unseen classes, while the second is conservative in its prediction
of novelty and keeps the seen classes’ accuracy high.
1 Introduction
The ability to classify instances of an unseen visual class, called zero-shot learning, is useful in sev-
eral situations. There are many species and products without labeled data and new visual categories,
such as the latest gadgets or car models, that are introduced frequently. In this work, we show how
to make use of the vast amount of knowledge about the visual world available in natural language
to classify unseen objects. We attempt to model people’s ability to identify unseen objects even if
the only knowledge about that object came from reading about it. For instance, after reading the
description of a two-wheeled self-balancing electric vehicle, controlled by a stick, with which you
can move around while standing on top of it, many would be able to identify a Segway, possibly after
being briefly perplexed because the new object looks different from previously observed classes.
We introduce a zero-shot model that can predict both seen and unseen classes. For instance, without
ever seeing a cat image, it can determine whether an image shows a cat or a known category from
the training set such as a dog or a horse. The model is based on two main ideas.
Fig. 1 illustrates the model. First, images are mapped into a semantic space of words that is learned
by a neural network model [15]. Word vectors capture distributional similarities from a large, unsu-
pervised text corpus. By learning an image mapping into this space, the word vectors get implicitly
grounded by the visual modality, allowing us to give prototypical instances for various words. Sec-
ond, because classifiers prefer to assign test images into classes for which they have seen training
examples, the model incorporates novelty detection which determines whether a new image is on the
manifold of known categories. If the image is of a known category, a standard classifier can be used.
Otherwise, images are assigned to a class based on the likelihood of being an unseen category. We
explore two strategies for novelty detection, both of which are based on ideas from outlier detection
methods. The first strategy prefers high accuracy for unseen classes, the second for seen classes.
Unlike previous work on zero-shot learning which can only predict intermediate features or differ-
entiate between various zero-shot classes [21, 27], our joint model can achieve both state of the art
accuracy on known classes as well as reasonable performance on unseen classes. Furthermore, com-
pared to related work on knowledge transfer [21, 28] we do not require manually defined semantic
1