4 Woo, Park, Lee, Kweon
Wang et al. [
27] propose Residual Attention Network which uses an encoder-
decoder style attention module. By refining the feature maps, the network not
only performs well but is also robust to noisy inputs. Instead of directly com-
puting the 3d attention map, we decompose the process that learns channel
attention and spatial attention separately. The separate attention generation
process for 3D feature map has much less computational and parameter over-
head, and therefore can be used as a plug-and-play module for pre-existing base
CNN architectures.
More close to our work, Hu et al. [
28] introduce a compact module to exploit
the inter-channel relationship. In their Squeeze-and-Excitation module, they use
global average-pooled features to compute channel-wise attention. However, we
show that those are suboptimal features in order to infer fine channel attention,
and we suggest to use max-pooled features as well. They also miss the spatial
attention, which plays an important role in deciding ‘where’ to focus as shown in
[
29]. In our CBAM, we exploit both spatial and channel-wise attention based on
an efficient architecture and empirically verify that exploiting both is superior to
using only the channel-wise attention as [
28]. Moreover, we empirically show that
our module is effective in detection tasks (MS-COCO and VOC). Especially, we
achieve state-of-the-art performance just by placing our module on top of the
existing one-shot detector [
30] in the VOC2007 test set.
Concurrently, BAM [
31] takes a similar approach, decomposing 3D atten-
tion map inference into channel and spatial. They place BAM module at every
bottleneck of the network while we plug at every convolutional block.
3 Convolutional Block Attention Module
Given an intermediate feature map F ∈ R
C×H×W
as input, CBAM sequentially
infers a 1D channel attention map M
c
∈ R
C×1×1
and a 2D spatial attention
map M
s
∈ R
1×H×W
as illustrated in Fig.
1. The overall attention process can
be summarized as:
F
′
= M
c
(F) ⊗ F,
F
′′
= M
s
(F
′
) ⊗ F
′
,
(1)
where ⊗ denotes element-wise multiplication. During multiplication, the atten-
tion values are broadcasted (copied) accordingly: channel attention values are
broadcasted along the spatial dimension, and vice versa. F
′′
is the final refined
output. Fig.
2 depicts the computation process of each attention map. The fol-
lowing describes the details of each attention module.
Channel attention module. We produce a channel attention map by exploit-
ing the inter-channel relationship of features. As each channel of a feature map
is considered as a feature detector [
32], channel attention focuses on ‘what’ is
meaningful given an input image. To c ompute the channel attention efficiently,
we squeeze the spatial dimension of the input feature map. For aggregating spa-
tial information, average-pooling has been commonly adopted so far. Zhou et al.