sigmoid) and sequential techniques [11, 37]. Recent work
has shown its applicability to tasks such as image captioning
[4, 44] and lip reading [7], in which it is exploited to effi-
ciently aggregate multi-modal data. In these applications,
it is typically used on top of one or more layers represent-
ing higher-level abstractions for adaptation between modal-
ities. Highway networks [36] employ a gating mechanism
to regulate the shortcut connection, enabling the learning
of very deep architectures. Wang et al. [42] introduce
a powerful trunk-and-mask attention mechanism using an
hourglass module [27], inspired by its success in semantic
segmentation. This high capacity unit is inserted into deep
residual networks between intermediate stages. In contrast,
our proposed SE-block is a lightweight gating mechanism,
specialised to model channel-wise relationships in a com-
putationally efficient manner and designed to enhance the
representational power of modules throughout the network.
3. Squeeze-and-Excitation Blocks
The Squeeze-and-Excitation block is a computational
unit which can be constructed for any given transforma-
tion F
tr
: X → U, X ∈ R
W
0
×H
0
×C
0
, U ∈ R
W ×H×C
.
For simplicity of exposition, in the notation that follows
we take F
tr
to be a standard convolutional operator. Let
V = [v
1
, v
2
, . . . , v
C
] denote the learned set of filter ker-
nels, where v
c
refers to the parameters of the c-th filter. We
can then write the outputs of F
tr
as U = [u
1
, u
2
, . . . , u
C
]
where
u
c
= v
c
∗ X =
C
0
X
s=1
v
s
c
∗ x
s
. (1)
Here ∗ denotes convolution, v
c
= [v
1
c
, v
2
c
, . . . , v
C
0
c
] and
X = [x
1
, x
2
, . . . , x
C
0
] (to simplify the notation, bias terms
are omitted). Here v
s
c
is a 2D spatial kernel, and therefore
represents a single channel of v
c
which acts on the corre-
sponding channel of X. Since the output is produced by
a summation through all channels, the channel dependen-
cies are implicitly embedded in v
c
, but these dependencies
are entangled with the spatial correlation captured by the
filters. Our goal is to ensure that the network is able to in-
crease its sensitivity to informative features so that they can
be exploited by subsequent transformations, and to suppress
less useful ones. We propose to achieve this by explicitly
modelling channel interdependencies to recalibrate filter re-
sponses in two steps, squeeze and excitation, before they are
fed into next transformation. A diagram of an SE building
block is shown in Fig. 1.
3.1. Squeeze: Global Information Embedding
In order to tackle the issue of exploiting channel depen-
dencies, we first consider the signal to each channel in the
output features. Each of the learned filters operate with a
local receptive field and consequently each unit of the trans-
formation output U is unable to exploit contextual informa-
tion outside of this region. This is an issue that becomes
more severe in the lower layers of the network whose re-
ceptive field sizes are small.
To mitigate this problem, we propose to squeeze global
spatial information into a channel descriptor. This is
achieved by using global average pooling to generate
channel-wise statistics. Formally, a statistic z ∈ R
C
is gen-
erated by shrinking U through spatial dimensions W × H,
where the c-th element of z is calculated by:
z
c
= F
sq
(u
c
) =
1
W × H
W
X
i=1
H
X
j=1
u
c
(i, j). (2)
Discussion. The transformation output U can be in-
terpreted as a collection of the local descriptors whose
statistics are expressive for the whole image. Exploiting
such information is prevalent in feature engineering work
[31, 34, 45]. We opt for the simplest, global average pool-
ing, while more sophisticated aggregation strategies could
be employed here as well.
3.2. Excitation: Adaptive Recalibration
To make use of the information aggregated in the squeeze
operation, we follow it with a second operation which aims
to fully capture channel-wise dependencies. To fulfil this
objective, the function must meet two criteria: first, it must
be flexible (in particular, it must be capable of learning
a nonlinear interaction between channels) and second, it
must learn a non-mutually-exclusive relationship as multi-
ple channels are allowed to be emphasised opposed to one-
hot activation. To meet these criteria, we opt to employ a
simple gating mechanism with a sigmoid activation:
s = F
ex
(z, W) = σ(g(z, W)) = σ(W
2
δ(W
1
z)), (3)
where δ refers to the ReLU [26] function, W
1
∈ R
C
r
×C
and
W
2
∈ R
C×
C
r
. To limit model complexity and aid general-
isation, we parameterise the gating mechanism by forming
a bottleneck with two fully-connected (FC) layers around
the non-linearity, i.e. a dimensionality-reduction layer with
parameters W
1
with reduction ratio r (we set it to be 16,
and this parameter choice is discussed in Sec. 6.3), a ReLU
and then a dimensionality-increasing layer with parameters
W
2
. The final output of the block is obtained by rescaling
the transformation output U with the activations:
e
x
c
= F
scale
(u
c
, s
c
) = s
c
· u
c
, (4)
where
e
X = [
e
x
1
,
e
x
2
, . . . ,
e
x
C
] and F
scale
(u
c
, s
c
) refers to
channel-wise multiplication between the feature map u
c
∈
R
W ×H
and the scalar s
c
.
3