Mixed Attention Mechanism for Small-Sample
Fine-grained Image Classification
Xiaoxu Li
∗
, Jijie Wu
∗
, Dongliang Chang
†
, Weifeng Huang
‡
, Zhanyu Ma
†
and Jie Cao
∗
∗
Lanzhou University of Technology, Lanzhou, China
E-mail: xiaoxulilut@gmail.com
†
Beijing University of Posts and Telecommunications, Beijing, China
E-mail: mazhanyu@bupt.edu.cn
‡
South-to-North Water Diversion Middle Route Information Technology Co., Ltd., China
E-mail: huangweifeng@nsbd.cn
Abstract—Fine-grained image Classification is an important
task in computer vision. The main challenge of the task are
that intra-class similarity is large and that training data points
in each class are insufficient for training a deep neural network.
Intuitively, if we can learn more discriminative features and more
detailed features from fined-grained images, the classification
performance can be improved. Considering that channel atten-
tion can learn more discriminative features, spatial attention can
learn more detailed features, this paper proposes a new spatial
attention mechanism by modifying Squeeze-and-Excitation block,
and a new mixed attention by combining the channel attention
and the proposed spatial attention. Experimental results on two
small-sample fine-grained image classification datasets demon-
strate that on both VGG16 network and ResNet-50 network, the
proposed two attention mechanisms achieve good performance,
and outperform other referred fine-grained image classification
methods.
I. INTRODUCTION
With rapid development of deep learning, Convolutional
Neural Networks (CNNs) are widely used in the task of fine-
grained image classification which is to distinguish one sub-
ordinate categories from others among the same superordinate
category [1]. Fine-grained image classification based on CNNs
have obtained impressive performance either by replacing
hand-crafted features with CNN features or by adopting an
end-to-end fashion. However, there still exists big challenges
since intra-class similarity is large and training data points in
each class are insufficient in fine-grained images [2][3][4].
The works of fine-grained image classification based on
CNNs mainly focus on learning more subtle and more discrim-
inative features. Some works improved network structure [5],
[6], [4], some works proposed a new loss [3], and some works
improved fine-grained classification by introducing attention
mechanism [7], [8], [9]. The Attention Mechanisms in Neural
Networks are derived from the visual attention mechanism
found in humans. Human visual attention focuses on a certain
region of an image with “high resolution” while perceiving
the surrounding image in “low resolution”, and then adjusting
the focal point over time [10]. Fine-grained classification with
attention mechanism could learn more delicate difference than
other methods[11].
There are three types of attention mechanisms, e.g. channel
attention, such as the SE (Squeeze-and-Excitation) block [12]
, the spatial attention, such as the Spatial Transformer [13],
and mixed attention, such as two-level Attention Models [8]
and Recurrent attention model [10]. The channel attention
aims to learn more discriminative features. SE (Squeeze-
and-Excitation) block [12] is a classical channel attention
method, which focuses on the channel relationship and adap-
tively recalibrate channel-wise feature responses by explicitly
modeling interdependencies between channels. The spatial
attention aims to learn more detailed features. The Spatial
Transformer [13], a new learnable and differentiable module,
which explicitly allows the spatial manipulation of data and
can be inserted into existing convolutional architectures. The
method is conditional on the feature map itself and can learn
invariance to scale, rotation and so on. Residual Attention
Network [14] is built by stacking Attention Modules which
generate attention-aware features.
The mixed attention aims to learn more discriminative and
more detailed features simultaneously. Two-level Attention
Model [8] combines three types of attention: the bottom-up
attention, the object-level top-down attention, and the part-
level top-down attention, which are responsible for proposing
candidate patches, selecting relevant patches to a certain
object, and localizing discriminative parts, respectively, to
find object parts and extract discriminative features. Recurrent
attention model [10] is a recurrent neural network model that
is capable of extracting information from an image by adap-
tively selecting a sequence of regions or locations and only
processing the selected regions at high resolution. Compared
with convolutional neural networks, the model greatly reduces
the amount of computation.
Intuitively, if we can learn more discriminative features
and more detailed features from fined-grained images, the
classification performance can be improved. Therefore, this
paper builds on the existing mixed attention works, proposes
a new spatial attention mask by modifying Squeeze-and-
Excitation block, and a new mixed attention method by com-
bining the channel attention and the proposed spatial attention.
In order to evaluate the proposed two attention methods,
we use two widely used networks, VGG16 and ResNet-50,
and select two small-sample fine-grained image classification
datasets, the Stanford Cars-196 dataset and the FGVC-Aircraft
Proceedings of APSIPA Annual Summit and Conference 2019
18-21 November 2019, Lanzhou, China
978-988-14768-7-6©2019 APSIPA