没有合适的资源?快使用搜索试试~ 我知道了~
首页A Survey of Zero-Shot Learning(零样本学习综述)
资源详情
资源评论
资源推荐

13
A Survey of Zero-Shot Learning: Seings, Methods,
and Applications
WEI WANG,
Nanyang Technological University, Singapore
VINCENT W. ZHENG,
WeBank, China
HAN YU and CHUNYAN MIAO,
Nanyang Technological University, Singapore
Most machine-learning methods focus on classifying instances whose classes have already been seen in train-
ing. In practice, many applications require classifying instances whose classes have not been seen previously.
Zero-shot learning is a powerful and promising learning paradigm, in which the classes covered by training
instances and the classes we aim to classify are disjoint. In this paper, we provide a comprehensive survey of
zero-shot learning. First of all, we provide an overview of zero-shot learning. According to the data utilized in
model optimization, we classify zero-shot learning into three learning settings. Second, we describe dierent
semantic spaces adopted in existing zero-shot learning works. Third, we categorize existing zero-shot learning
methods and introduce representative methods under each category. Fourth, we discuss dierent applications
of zero-shot learning. Finally, we highlight promising future research directions of zero-shot learning.
CCS Concepts: • Computing methodologies → Transfer learning;
Additional Key Words and Phrases: Zero-shot learning survey
ACM Reference format:
Wei Wang, Vincent W. Zheng, Han Yu, and Chunyan Miao. 2019. A Survey of Zero-Shot Learning: Settings,
Methods, and Applications. ACM Trans. Intell. Syst. Technol. 10, 2, Article 13 (January 2019), 37 pages.
https://doi.org/10.1145/3293318
1 INTRODUCTION
Supervised classication methods have achieved signicant success in research and have been
applied in many areas. Especially in recent years, beneting from the fast development of deep
This research is supported, in part, by the National Research Foundation, Prime Minister’s Oce, Singapore under its
IDM Futures Funding Initiative; the Interdisciplinary Graduate School, Nanyang Technological University, Singapore; and
the Nanyang Assistant Professorship (NAP). It is also partially supported by the NTU-PKU Joint Research Institute, a
collaboration between Nanyang Technological University and Peking University that is sponsored by a donation from the
Ng Teng Fong Charitable Foundation.
Authors’ addresses: W. Wang, Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY),
Interdisciplinary Graduate School, and School of Computer Science and Engineering, Nanyang Technological University,
50 Nanyang Avenue, Singapore; email: wwang008@e.ntu.edu.sg; V. W. Zheng, WeBank, Shahexilu 1819, Nanshan,
Shenzhen, China; email: wenchenzheng@gmail.com; H. Yu, School of Computer Science and Engineering, and Joint
NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), Nanyang Technological University,
50 Nanyang Avenue, Singapore; email: han.yu@ntu.edu.sg; C. Miao, School of Computer Science and Engineering, and
Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), Nanyang Technological University,
50 Nanyang Avenue, Singapore; email: ascymiao@ntu.edu.sg.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2019 Association for Computing Machinery.
2157-6904/2019/01-ART13 $15.00
https://doi.org/10.1145/3293318
ACM Transactions on Intelligent Systems and Technology, Vol. 10, No. 2, Article 13. Publication date: January 2019.

13:2 W. Wang et al.
learning techniques, they have made much progress. However, there are some restrictions for
methods under this learning paradigm. In supervised classication, sucient labeled training in-
stances are needed for each class. In addition, the learned classier can only classify the instances
belonging to classes covered by the training data, and it lacks the ability to deal with previously
unseen classes. However, in practical applications, there may not be sucient training instances
for each class. There could also be situations in which the classes not covered by the training
instances appear in the testing instances.
To deal with these problems, methods under dierent learning paradigms have been proposed.
To tackle the problem of learning classiers for classes with few training instances, few-shot
learning/one-shot learning methods [35, 122] have been proposed. In these methods, while learning
classiers for the classes with few instances, knowledge contained in instances of other classes is
utilized. To deal with previously unseen classes, a range of learning methods have been proposed.
In open set recognition methods [64, 132], when learning the classier with the training data, the fact
of unseen classes existing is taken in consideration. The learned classier can determine whether a
testing instance belongs to the unseen classes, but it cannot determine which specic unseen class
the instance belongs to. The cumulative learning [34]andclass-incremental learning [123] methods
have been proposed for problems in which labeled instances belonging to some previously un-
seen classes progressively appear after model learning. The learned classier can be adapted with
these newly available labeled instances to be able to classify classes covered by them. The open
world recognition [9] methods follow the process of “unseen classes detection, labeled instances
of unseen classes acquisition, and model adaption” and adapt the classier to be able to classify
previously unseen classes with the acquired labeled instances belonging to them.
For methods under the above learning paradigms, if the testing instances belong to unseen
classes that have no available labeled instances during model learning (or adaption), the learned
classier cannot determine the class labels of them. However, in many practical applications, we
need the classier to have the ability to determine the class labels for the instances belonging to
these classes. The following are some popular application scenarios:
• The number of target classes is large. An example is object recognition in computer vision.
Generally, human beings can recognize at least 30,000 object classes [10]. However, col-
lecting sucient labeled instances for such a large number of classes is challenging. Thus,
existing image datasets can only cover a small subset of these classes. There are many ob-
ject classes having no labeled instances in existing datasets. A similar example is activity
recognition [178], where the number of human activities is large, in contrast with the lim-
ited number of activity classes covered by existing datasets. Many activity classes have no
labeled instances in existing datasets.
• Target classes are rare. An example is ne-grained object classication. Suppose we want to
recognize owers of dierent breeds [8, 31]. It is hard to collect sucient image instances
for each specic ower breed. For many rare breeds, we cannot nd the corresponding
labeled instances.
• Target classes change over time. An example is recognizing images of products belonging to
a certain style and brand. As products of new styles and new brands appear frequently, for
some new products, it is dicult to nd corresponding labeled instances [98].
• In some particular tasks, it is expensive to obtain labeled instances. In some learning tasks
related with classication, the instance labeling process is expensive and time consuming.
Thus, the number of classes covered by existing datasets is limited, and many classes have no
labeled instances. For example, in the image semantic segmentation problem [93, 104], the
images used as training data should be labeled at the pixel level. This problem can be seen
ACM Transactions on Intelligent Systems and Technology, Vol. 10, No. 2, Article 13. Publication date: January 2019.

A Survey of Zero-Shot Learning: Seings, Methods, and Applications 13:3
as a pixel-level classication problem for the images. The number of object classes covered
by existing datasets is limited, with many object classes having no labeled instances. In
another example, in the image captioning problem [140], each image in the training data
should have a corresponding caption. This problem can be seen as a sequential classication
problem. The number of object classes covered by the existing image-text corpora is limited,
with many object classes not being covered.
In these applications, there are many classes having no labeled instances. For a classier, it is
important for it to have the ability to determine the class label of instances belonging to these
classes. To solve this problem, zero-shot learning (also known as zero-data learning [81]) is pro-
posed. The aim of zero-shot learning is to classify instances belonging to the classes that have no
labeled instances. Since its inception [80, 81, 113], zero-shot learning has become a fast-developing
eld in machine learning, with a wide range of applications in computer vision, natural language
processing, and ubiquitous computing.
1.1 Overview of Zero-Shot Learning
In zero-shot learning, there are some labeled training instances in the feature space. The classes
covered by these training instances are referred to as the seen classes. In the feature space, there
are also some unlabeled testing instances, which belong to another set of classes. These classes are
referred to as the unseen classes. The feature space is usually a real number space, and each instance
is represented as a vector within it. Each instance is usually assumed to belong to one class.
1
Now, we give the denition of zero-shot learning. Denote S = {c
s
i
|i = 1,...,N
s
} as the set of seen
classes, where each c
s
i
is a seen class. Denote U = {c
u
i
|i = 1,...,N
u
} as the set of unseen classes,
where each c
u
i
is an unseen class. Note that S∩U = ∅.DenoteX as the feature space, which is D-
dimensional; usually it is a real number space R
D
.DenoteD
tr
= {(x
tr
i
, y
tr
i
) ∈X×S}
N
tr
i=1
as the set
of labeled training instances belonging to seen classes; for each labeled instance (x
tr
i
, y
tr
i
), x
tr
i
is the
instance in the feature space, and y
tr
i
is the corresponding class label. Denote X
te
= {x
te
i
∈X}
N
te
i=1
as the set of testing instances, where each x
te
i
is a testing instance in the feature space. Denote
Y
te
= {y
te
i
∈U}
N
te
i=1
as the corresponding class labels for X
te
, which are to be predicted.
Denition 1.1 (Zero-Shot Learning). Given labeled training instances D
tr
belonging to the seen
classes S, zero-shot learning aims to learn a classier f
u
(·) : X→U that can classify testing
instances X
te
(i.e., to predict Y
te
) belonging to the unseen classes U .
From the denition, we can see that the general idea of zero-shot learning is to transfer the
knowledge contained in the training instances D
tr
to the task of testing instance classication.
The label spaces covered by the training and the testing instances are disjoint. Thus, zero-shot
learning is a subeld of transfer learning [114, 115]. In transfer learning, knowledge contained in
the source domain and source task is transferred to the target domain for learning the model in
the target task [114, 115]. According to [23, 114], based on whether the feature spaces and label
spaces in the source and target domains/tasks are the same, transfer learning can be classied
into homogeneous transfer learning and heterogeneous transfer learning. In homogeneous transfer
learning, the feature spaces and the label spaces are the same, while in heterogeneous transfer
learning, the feature spaces and/or the label spaces are dierent. In zero-shot learning, the source
feature space is the feature space of training instances, and the target feature space is the fea-
ture space of testing instances. They are the same; both are X. However, the source label space
1
There are some works on zero-shot learning under the multilabel setting, which focuses on classifying instances that each
has more than one class label. We will separately discuss these works in Section 3.3.
ACM Transactions on Intelligent Systems and Technology, Vol. 10, No. 2, Article 13. Publication date: January 2019.

13:4 W. Wang et al.
is the seen class set S, while the target label space is the unseen class set U. They are dier-
ent. In view of this, zero-shot learning belongs to heterogeneous transfer learning. Specically,
it belongs to heterogeneous transfer learning with dierent label spaces (we briey refer to it as
HTL-DLS). Many existing methods for HTL-DLS are proposed for problems under the setting in
which there are some labeled instances for the target label space classes [23]. However, in zero-shot
learning, no labeled instances belonging to the target label space classes (the unseen classes) are
available. This makes the problems in zero-shot learning dierent from these problems studied in
HTL-DLS.
Auxiliary information. As no labeled instances belonging to the unseen classes are available,
to solve the zero-shot learning problem, some auxiliary information is necessary. Such auxiliary
information should contain information about all of the unseen classes. This is to guarantee that
each of the unseen classes is provided with corresponding auxiliary information. Meanwhile, the
auxiliary information should be related to the instances in the feature space. This is to guarantee
that the auxiliary information is usable.
In existing works, the approach to involve auxiliary information is inspired by the way human
beings recognize the world. Humans can perform zero-shot learning with the help of some seman-
tic background knowledge. For example, with the knowledge that “a zebra looks like a horse, and
with stripes,” we can recognize a zebra even without having seen one before, as long as we know
what a horse looks like and what the pattern “stripe” looks like [45]. In this way, the auxiliary
information involved by existing zero-shot learning methods is usually some semantic informa-
tion. It forms a space that contains both the seen and the unseen classes. As this space contains
semantic information, it is often referred to as the semantic space. Being similar to the feature
space, the semantic space is also usually a real number space. In the semantic space, each class has
a corresponding vector representation, which is referred to as the class prototype (or prototype for
short) of this class.
2
We denote T as the semantic space. Suppose T is M-dimensional; it is usually
R
M
.Denotet
s
i
∈T as the class prototype for seen class c
s
i
,andt
u
i
∈T as the class prototype for
unseen class c
u
i
.DenoteT
s
= {t
s
i
}
N
s
i=1
as the set of prototypes for seen classes, and T
u
= {t
u
i
}
N
u
i=1
as
the set of prototypes for unseen classes. Denote π (·) : S∪U →T as a class prototyping func-
tion that takes a class label as input and outputs the corresponding class prototype. In zero-shot
learning, along with the training instances D
tr
, the class prototypes T
s
and T
u
are also involved
in obtaining the zero-shot classier f
u
(·). In Section 2, we will categorize and introduce dierent
kinds of semantic spaces.
We summarize the key notations used throughout this article in Table 1.
1.2 Learning Seings
In zero-shot learning, the goal is to learn the zero-shot classier f
u
(·). During model learning, if
information about the testing instances is involved, the learned model is transductive for these spe-
cic testing instances. In zero-shot learning, this transduction can be embodied in two progressive
degrees: transductive for specic unseen classes and transductive for specic testing instances.
This is dierent from the well-known transductive setting in semisupervised learning, which is
just for the testing instances. In the setting that is transductive for specic unseen classes, infor-
mation about the unseen classes is involved in model learning, and the model is optimized for
these specic unseen classes. In the setting that is transductive for specic testing instances, the
transductive degree goes further. The testing instances are also involved in model learning, and
2
In some works, for each class, there is more than one corresponding prototype in the semantic space. We will separately
discuss them in Section 3.3.
ACM Transactions on Intelligent Systems and Technology, Vol. 10, No. 2, Article 13. Publication date: January 2019.

A Survey of Zero-Shot Learning: Seings, Methods, and Applications 13:5
Table 1. Key Notations Used in This Article
Notation Description
X Feature space, which is D-dimensional
T Semantic space, which is M-dimensional
S, U Set of seen classes and set of unseen classes, respectively
N
tr
, N
te
Number of training instances and number of testing instances, respectively
N
s
, N
u
Number of seen classes and number of unseen classes, respectively
D
tr
The set of labeled training data from seen classes
X
te
The set of testing instances from unseen classes
Y
te
Labels for testing instances
(x
tr
i
, y
tr
i
) The ith labeled training instance: features x
tr
i
∈Xand label y
tr
i
∈S
x
te
i
The ith unlabeled testing instance: features x
te
i
∈X
T
s
,T
u
The set of prototypes for seen classes and unseen classes, respectively
(c
s
i
, t
s
i
) The ith seen class c
s
i
∈Sand its class prototype t
s
i
∈T
(c
u
i
, t
u
i
) The ith unseen class c
u
i
∈Uand its class prototype t
u
i
∈T
π (·) A class prototyping function π (·) : S∪U →T
f
u
(·) A zero-shot classier f
u
(·) : X→U
Fig. 1. Dierent learning seings for zero-shot learning.
the model is optimized for these specic testing instances. Based on the degree of transduction,
we categorize zero-shot learning into three learning settings.
Denition 1.2 (Class-Inductive Instance-Inductive (CIII) Setting). Only labeled training instances
D
tr
and seen class prototypes T
s
are used in model learning.
Denition 1.3 (Class-Transductive Instance-Inductive (CTII) Setting). Labeled training instances
D
tr
, seen class prototypes T
s
, and unseen class prototypes T
u
are used in model learning.
Denition 1.4 (Class-Transductive Instance-Transductive (CTIT) Setting). Labeled training in-
stances D
tr
, seen class prototypes T
s
, unlabeled testing instances X
te
, and unseen class prototypes
T
u
are used in model learning.
We summarize the above three learning settings in Figure 1. As we can see, from CIII to CTIT,
the classier f
u
(·) is learned with increasingly specic testing instances’ information. In machine-
learning methods, as the distributions of the training and the testing instances are dierent, the
performance of the model learned with the training instances will decrease when applied to the
testing instances. This phenomenon is more severe in zero-shot learning, as the classes covered by
the training and the testing instances are disjoint [6, 41]. In zero-shot learning, this phenomenon
is usually referred to as domain shift [41].
Under the CIII setting, as no information about the testing instances is involved in model learn-
ing, the problem of domain shift is severe in some methods under this setting. However, as the
models under this setting are not optimized for specic unseen classes and testing instances, when
ACM Transactions on Intelligent Systems and Technology, Vol. 10, No. 2, Article 13. Publication date: January 2019.
剩余36页未读,继续阅读



















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0