Squeeze-and-Excitation Networks

senet

需积分: 0 155 浏览量更新于2023-05-13 收藏 2.08MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

Squeeze-and-Excitation Networks

Jie Hu

[0000−0002−5150−1003]

Li Shen

[0000−0002−2283−4976]

Samuel Albanie

[0000−0001−9736−5134]

Gang Sun

[0000−0001−6913−6799]

Enhua Wu

[0000−0002−2174−1428]

Abstract—The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to

construct informative features by fusing both spatial and channel-wise information within local receptive ﬁelds at each layer. A broad

range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of

a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel

relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates

channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be

stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate

that SE blocks bring signiﬁcant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost.

Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classiﬁcation submission which won ﬁrst place and

reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ∼25%. Models and code are

available at https://github.com/hujie-frank/SENet.

Index Terms—Squeeze-and-Excitation, Image representations, Attention, Convolutional Neural Networks.

1 INTRODUCTION

ONVOLUTIONAL neural networks (CNNs) have proven

to be useful models for tackling a wide range of visual

tasks [1], [2], [3], [4]. At each convolutional layer in the net-

work, a collection of ﬁlters expresses neighbourhood spatial

connectivity patterns along input channels—fusing spatial

and channel-wise information together within local recep-

tive ﬁelds. By interleaving a series of convolutional layers

with non-linear activation functions and downsampling op-

erators, CNNs are able to produce image representations

that capture hierarchical patterns and attain global theo-

retical receptive ﬁelds. A central theme of computer vision

research is the search for more powerful representations that

capture only those properties of an image that are most

salient for a given task, enabling improved performance.

As a widely-used family of models for vision tasks, the

development of new neural network architecture designs

now represents a key frontier in this search. Recent research

has shown that the representations produced by CNNs can

be strengthened by integrating learning mechanisms into

the network that help capture spatial correlations between

features. One such approach, popularised by the Inception

family of architectures [5], [6], incorporates multi-scale pro-

cesses into network modules to achieve improved perfor-

• Jie Hu and Enhua Wu are with the State Key Laboratory of Computer

Science, Institute of Software, Chinese Academy of Sciences, Beijing,

100190, China.

They are also with the University of Chinese Academy of Sciences, Beijing,

100049, China.

Jie Hu is also with Momenta and Enhua Wu is also with the Faculty of

Science and Technology & AI Center at University of Macau.

E-mail: hujie@ios.ac.cn ehwu@umac.mo

• Gang Sun is with LIAMA-NLPR at the Institute of Automation, Chinese

Academy of Sciences. He is also with Momenta.

E-mail: sungang@momenta.ai

• Li Shen and Samuel Albanie are with the Visual Geometry Group at the

University of Oxford.

E-mail: {lishen,albanie}@robots.ox.ac.uk

mance. Further work has sought to better model spatial

dependencies [7], [8] and incorporate spatial attention into

the structure of the network [9].

In this paper, we investigate a different aspect of network

design - the relationship between channels. We introduce

a new architectural unit, which we term the Squeeze-and-

Excitation (SE) block, with the goal of improving the quality

of representations produced by a network by explicitly mod-

elling the interdependencies between the channels of its con-

volutional features. To this end, we propose a mechanism

that allows the network to perform feature recalibration,

through which it can learn to use global information to

selectively emphasise informative features and suppress less

useful ones.

The structure of the SE building block is depicted in

Fig. 1. For any given transformation F

mapping the

input X to the feature maps U where U ∈ R

H×W ×C

e.g. a convolution, we can construct a corresponding SE

block to perform feature recalibration. The features U are

ﬁrst passed through a squeeze operation, which produces a

channel descriptor by aggregating feature maps across their

spatial dimensions (H × W ). The function of this descriptor

is to produce an embedding of the global distribution of

channel-wise feature responses, allowing information from

the global receptive ﬁeld of the network to be used by

all its layers. The aggregation is followed by an excitation

operation, which takes the form of a simple self-gating

mechanism that takes the embedding as input and pro-

duces a collection of per-channel modulation weights. These

weights are applied to the feature maps U to generate

the output of the SE block which can be fed directly into

subsequent layers of the network.

It is possible to construct an SE network (SENet) by

simply stacking a collection of SE blocks. Moreover, these

SE blocks can also be used as a drop-in replacement for the

original block at a range of depths in the network architec-

arXiv:1709.01507v4 [cs.CV] 16 May 2019

Fig. 1. A Squeeze-and-Excitation block.

ture (Section 6.4). While the template for the building block

is generic, the role it performs at different depths differs

throughout the network. In earlier layers, it excites infor-

mative features in a class-agnostic manner, strengthening

the shared low-level representations. In later layers, the SE

blocks become increasingly specialised, and respond to dif-

ferent inputs in a highly class-speciﬁc manner (Section 7.2).

As a consequence, the beneﬁts of the feature recalibration

performed by SE blocks can be accumulated through the

network.

The design and development of new CNN architectures

is a difﬁcult engineering task, typically requiring the se-

lection of many new hyperparameters and layer conﬁgura-

tions. By contrast, the structure of the SE block is simple and

can be used directly in existing state-of-the-art architectures

by replacing components with their SE counterparts, where

the performance can be effectively enhanced. SE blocks are

also computationally lightweight and impose only a slight

increase in model complexity and computational burden.

To provide evidence for these claims, we develop several

SENets and conduct an extensive evaluation on the Ima-

geNet dataset [10]. We also present results beyond ImageNet

that indicate that the beneﬁts of our approach are not

restricted to a speciﬁc dataset or task. By making use of

SENets, we ranked ﬁrst in the ILSVRC 2017 classiﬁcation

competition. Our best model ensemble achieves a 2.251%

top-5 error on the test set

. This represents roughly a 25%

relative improvement when compared to the winner entry

of the previous year (top-5 error of 2.991%).

2 RELATED WORK

Deeper architectures. VGGNets [11] and Inception mod-

els [5] showed that increasing the depth of a network could

signiﬁcantly increase the quality of representations that

it was capable of learning. By regulating the distribution

of the inputs to each layer, Batch Normalization (BN) [6]

added stability to the learning process in deep networks

and produced smoother optimisation surfaces [12]. Building

on these works, ResNets demonstrated that it was pos-

sible to learn considerably deeper and stronger networks

through the use of identity-based skip connections [13], [14].

Highway networks [15] introduced a gating mechanism to

regulate the ﬂow of information along shortcut connections.

Following these works, there have been further reformula-

tions of the connections between network layers [16], [17],

1. http://image-net.org/challenges/LSVRC/2017/results

which show promising improvements to the learning and

representational properties of deep networks.

An alternative, but closely related line of research has

focused on methods to improve the functional form of

the computational elements contained within a network.

Grouped convolutions have proven to be a popular ap-

proach for increasing the cardinality of learned transforma-

tions [18], [19]. More ﬂexible compositions of operators can

be achieved with multi-branch convolutions [5], [6], [20],

[21], which can be viewed as a natural extension of the

grouping operator. In prior work, cross-channel correlations

are typically mapped as new combinations of features, ei-

ther independently of spatial structure [22], [23] or jointly

by using standard convolutional ﬁlters [24] with 1 × 1

convolutions. Much of this research has concentrated on the

objective of reducing model and computational complexity,

reﬂecting an assumption that channel relationships can be

formulated as a composition of instance-agnostic functions

with local receptive ﬁelds. In contrast, we claim that provid-

ing the unit with a mechanism to explicitly model dynamic,

non-linear dependencies between channels using global in-

formation can ease the learning process, and signiﬁcantly

enhance the representational power of the network.

Algorithmic Architecture Search. Alongside the works

described above, there is also a rich history of research

that aims to forgo manual architecture design and instead

seeks to learn the structure of the network automatically.

Much of the early work in this domain was conducted in

the neuro-evolution community, which established methods

for searching across network topologies with evolutionary

methods [25], [26]. While often computationally demand-

ing, evolutionary search has had notable successes which

include ﬁnding good memory cells for sequence models

[27], [28] and learning sophisticated architectures for large-

scale image classiﬁcation [29], [30], [31]. With the goal of re-

ducing the computational burden of these methods, efﬁcient

alternatives to this approach have been proposed based on

Lamarckian inheritance [32] and differentiable architecture

search [33].

By formulating architecture search as hyperparameter

optimisation, random search [34] and other more sophis-

ticated model-based optimisation techniques [35], [36] can

also be used to tackle the problem. Topology selection

as a path through a fabric of possible designs [37] and

direct architecture prediction [38], [39] have been proposed

as additional viable architecture search tools. Particularly

strong results have been achieved with techniques from

reinforcement learning [40], [41], [42], [43], [44]. SE blocks

can be used as atomic building blocks for these search

algorithms, and were demonstrated to be highly effective

in this capacity in concurrent work [45].

Attention and gating mechanisms. Attention can be in-

terpreted as a means of biasing the allocation of available

computational resources towards the most informative com-

ponents of a signal [46], [47], [48], [49], [50], [51]. Attention

mechanisms have demonstrated their utility across many

tasks including sequence learning [52], [53], localisation

and understanding in images [9], [54], image captioning

[55], [56] and lip reading [57]. In these applications, it

can be incorporated as an operator following one or more

layers representing higher-level abstractions for adaptation

between modalities. Some works provide interesting studies

into the combined use of spatial and channel attention [58],

[59]. Wang et al. [58] introduced a powerful trunk-and-mask

attention mechanism based on hourglass modules [8] that is

inserted between the intermediate stages of deep residual

networks. By contrast, our proposed SE block comprises a

lightweight gating mechanism which focuses on enhancing

the representational power of the network by modelling

channel-wise relationships in a computationally efﬁcient

manner.

3 SQUEEZE-AND-EXCITATION BLOCKS

A Squeeze-and-Excitation block is a computational unit

which can be built upon a transformation F

mapping an

input X ∈ R

×W

×C

to feature maps U ∈ R

H×W ×C

In the notation that follows we take F

to be a convo-

lutional operator and use V = [v

, v

, . . . , v

] to denote

the learned set of ﬁlter kernels, where v

refers to the

parameters of the c-th ﬁlter. We can then write the outputs

as U = [u

, u

, . . . , u

], where

= v

∗ X =

s=1

∗ x

. (1)

Here ∗ denotes convolution, v

= [v

, v

, . . . , v

], X =

, x

, . . . , x

] and u

∈ R

H×W

. v

is a 2D spatial kernel

representing a single channel of v

that acts on the corre-

sponding channel of X. To simplify the notation, bias terms

are omitted. Since the output is produced by a summation

through all channels, channel dependencies are implicitly

embedded in v

, but are entangled with the local spatial

correlation captured by the ﬁlters. The channel relationships

modelled by convolution are inherently implicit and local

(except the ones at top-most layers). We expect the learning

of convolutional features to be enhanced by explicitly mod-

elling channel interdependencies, so that the network is able

to increase its sensitivity to informative features which can

be exploited by subsequent transformations. Consequently,

we would like to provide it with access to global information

and recalibrate ﬁlter responses in two steps, squeeze and

excitation, before they are fed into the next transformation.

A diagram illustrating the structure of an SE block is shown

in Fig. 1.

3.1 Squeeze: Global Information Embedding

In order to tackle the issue of exploiting channel depen-

dencies, we ﬁrst consider the signal to each channel in the

output features. Each of the learned ﬁlters operates with

a local receptive ﬁeld and consequently each unit of the

transformation output U is unable to exploit contextual

information outside of this region.

To mitigate this problem, we propose to squeeze global

spatial information into a channel descriptor. This is

achieved by using global average pooling to generate

channel-wise statistics. Formally, a statistic z ∈ R

is gener-

ated by shrinking U through its spatial dimensions H × W ,

such that the c-th element of z is calculated by:

= F

) =

H × W

i=1

j=1

(i, j). (2)

Discussion. The output of the transformation U can be

interpreted as a collection of the local descriptors whose

statistics are expressive for the whole image. Exploiting

such information is prevalent in prior feature engineering

work [60], [61], [62]. We opt for the simplest aggregation

technique, global average pooling, noting that more sophis-

ticated strategies could be employed here as well.

3.2 Excitation: Adaptive Recalibration

To make use of the information aggregated in the squeeze

operation, we follow it with a second operation which aims

to fully capture channel-wise dependencies. To fulﬁl this

objective, the function must meet two criteria: ﬁrst, it must

be ﬂexible (in particular, it must be capable of learning

a nonlinear interaction between channels) and second, it

must learn a non-mutually-exclusive relationship since we

would like to ensure that multiple channels are allowed to

be emphasised (rather than enforcing a one-hot activation).

To meet these criteria, we opt to employ a simple gating

mechanism with a sigmoid activation:

s = F

(z, W) = σ(g(z, W)) = σ(W

δ(W

z)), (3)

where δ refers to the ReLU [63] function, W

∈ R

×C

and

∈ R

C×

. To limit model complexity and aid general-

isation, we parameterise the gating mechanism by forming

a bottleneck with two fully-connected (FC) layers around

the non-linearity, i.e. a dimensionality-reduction layer with

reduction ratio r (this parameter choice is discussed in Sec-

tion 6.1), a ReLU and then a dimensionality-increasing layer

returning to the channel dimension of the transformation

output U. The ﬁnal output of the block is obtained by

rescaling U with the activations s:

= F

scale

, s

) = s

, (4)

where

X = [

, . . . ,

] and F

scale

, s

) refers to

channel-wise multiplication between the scalar s

and the

feature map u

∈ R

H×W

Discussion. The excitation operator maps the input-

speciﬁc descriptor z to a set of channel weights. In this

regard, SE blocks intrinsically introduce dynamics condi-

tioned on the input, which can be regarded as a self-

attention function on channels whose relationships are not

conﬁned to the local receptive ﬁeld the convolutional ﬁlters

are responsive to.

剩余12页未读，继续阅读

颐水风华

粉丝: 9840
资源: 15

会员权益专享

Squeeze-and-Excitation Networks

Squeeze-and-Excitation Networks 论文思维导图

Rock thin sections identification based on improved squeeze-and

Python-GCNetNonlocalNetworksMeetSqueezeExcitationNetworksandBeyond

Squeeze-and-Excitation Networks作者提出

squeeze-and-excitation networks

介绍一下Squeeze-and-Excitation Networks的原理和用处，1000字

squeeze-excitation

pytorch 注意力机制

你知道Squeeze and Excitation模块

只使用通道注意力的有哪些

SE 通道注意力机制介绍

通道注意力研究现状 文献有哪些

简单说一下SENet是什么

mmdetection注意力机制

SENet注意力机制

可以在EDSR模型中加入SENet吗

yolov5注意力机制

yolov7中有注意力机制吗

resnet怎么添加注意力机制

会员权益专享

最新资源

通道注意力研究现状文献有哪些