A Dual Attention Network with Semantic Embedding for Few-shot Learning
Shipeng Yan
∗
, Songyang Zhang
∗
, Xuming He
†
School of Information Science and Technology, ShanghaiTech University
{yanshp, zhangsy1, hexm}@shanghaitech.edu.cn
Abstract
Despite recent success of deep neural networks, it remains
challenging to efficiently learn new visual concepts from lim-
ited training data. To address this problem, a prevailing strat-
egy is to build a meta-learner that learns prior knowledge
on learning from a small set of annotated data. However,
most of existing meta-learning approaches rely on a global
representation of images and a meta-learner with complex
model structures, which are sensitive to background clutter
and difficult to interpret. We propose a novel meta-learning
method for few-shot classification based on two simple at-
tention mechanisms: one is a spatial attention to localize
relevant object regions and the other is a task attention to
select similar training data for label prediction. We imple-
ment our method via a dual-attention network and design a
semantic-aware meta-learning loss to train the meta-learner
network in an end-to-end manner. We validate our model on
three few-shot image classification datasets with extensive
ablative study, and our approach shows competitive perfor-
mances over these datasets with fewer parameters. For facil-
itating the future research, code and data split are available:
https://github.com/tonysy/STANet-PyTorch
1 Introduction
A particular intriguing property of human cognition is be-
ing able to learn a new concept from only a few exam-
ples, which, despite recent success of deep learning, remains
a challenging task for machine learning systems (Lake et
al. 2017). Such a few-shot learning problem setting has at-
tracted much attention recently, and in particular, for the
task of classification (Lake, Salakhutdinov, and Tenenbaum
2015; Vinyals et al. 2016; Triantafillou, Zemel, and Urta-
sun 2017). To tackle the issue of data deficiency, a pre-
vailing strategy of few-shot classification is to formulate
it as a meta-learning problem, aiming to learn a prior on
the few-shot classifiers from a set of similar classification
tasks (Vinyals et al. 2016; Mishra et al. 2018). Typically, a
meta-learner learns an embedding that maps the input into
a feature space and a predictor that transfers the label infor-
mation from the training set of each task to its test instance.
∗
Authors contributed equally and are listed in alphabetical order
†
In part supported by the NSFC Grant No. 61703195.
Copyright
c
2019, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
While this learning framework is capable of extracting ef-
fective meta-level prediction strategy, it suffers several lim-
itations in the task of image classification. First, the i.i.d as-
sumption on tasks tends to ignore the semantic relations be-
tween image classes that reflects the intrinsic similarity be-
tween individual tasks. This can lead to inefficient embed-
ding feature learning. Second, most of existing work rely on
an off-the-shelf deep network to compute a holistic feature
of each input image, which is sensitive to nuisance varia-
tions, e.g, background clutter. This makes it challenging to
learn an effective meta-learner, particularly for the methods
based on feature similarity. Moreover, recent attempts typi-
cally resort to learning complex prediction strategies to in-
corporate the context of training set in each task (Santoro et
al. 2016; Mishra et al. 2018), which are difficult to interpret
in terms of the prior knowledge that has been learned.
In this work, we aim to address the aforementioned weak-
nesses by a semantic-aware meta-learning framework, in
which we explicitly incorporates class sharing across tasks
and focuses on only semantically informative parts of input
images in each task. To this end, we make use of attention
mechanisms (Vaswani et al. 2017) to develop a novel mod-
ularized deep network for the problem of few-shot classi-
fication. Our deep network consists of two main modules:
an embedding network that computes a semantic-aware fea-
ture map for each image, and an meta-learning network that
learns a similarity-based classification strategy to transfer
the training label cues to a test example.
Specifically, given a few-shot classification task, our em-
bedding network first generates a convolutional feature map
for each image. Taking as input all these feature maps, the
meta-learning network then extracts a task-specific repre-
sentation of input data with a dual-attention mechanism,
which is used for few-shot class prediction. To achieve this,
the meta-learning network first infers a spatial attention map
for each image to capture relevant regions on the feature
maps and produces a selectively pooled feature vector for
every image (Xu et al. 2015). Given these image features,
the network employs a second attention module, referred as
task attention, to compute an attention map over the train-
ing set of the task. This attention encodes the relevance of
each training example to the test image class in the task
and is used to calculate a context-aware representation of
the test instance (Vinyals et al. 2016) for its class prediction.