JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
by the attention vector but also adds an identity connection.
GCT can be written as:
s = F
gct
(X, θ) = tanh(γCN(αNorm(X)) + β) (14)
Y = sX + X, (15)
where
α
,
β
and
γ
are trainable parameters.
Norm(·)
indicates
the L2-norm of each channel. CN is channel normalization.
A GCT block has fewer parameters than an SE block, and
as it is lightweight, can be added after each convolutional
layer of a CNN.
3.2.5 ECANet
To avoid high model complexity, SENet reduces the number
of channels. However, this strategy fails to directly model
correspondence between weight vectors and inputs, reducing
the quality of results. To overcome this drawback, Wang
et al. [37] proposed the efficient channel attention (ECA)
block which instead uses a 1D convolution to determine
the interaction between channels, instead of dimensionality
reduction.
An ECA block has similar formulation to an SE block
including a squeeze module for aggregating global spatial
information and an efficient excitation module for modeling
cross-channel interaction. Instead of indirect correspondence,
an ECA block only considers direct interaction between
each channel and its
k
-nearest neighbors to control model
complexity. Overall, the formulation of an ECA block is:
s = F
eca
(X, θ) = σ(Conv1D(GAP(X))) (16)
Y = sX (17)
where
Conv1D(·)
denotes 1D convolution with a kernel of
shape
k
across the channel domain, to model local cross-
channel interaction. The parameter
k
decides the coverage
of interaction, and in ECA the kernel size
k
is adaptively
determined from the channel dimensionality
C
instead of by
manual tuning, using cross-validation:
k = ψ(C) =
log
2
(C)
γ
+
b
γ
odd
(18)
where
γ
and
b
are hyperparameters.
|x|
odd
indicates the
nearest odd function of x.
Compared to SENet, ECANet has an improved excitation
module, and provides an efficient and effective block which
can readily be incorporated into various CNNs.
3.2.6 FcaNet
Only using global average pooling in the squeeze module
limits representational ability. To obtain a more powerful
representation ability, Qin et al. [57] rethought global infor-
mation captured from the viewpoint of compression and
analysed global average pooling in the frequency domain.
They proved that global average pooling is a special case of
the discrete cosine transform (DCT) and used this observa-
tion to propose a novel multi-spectral channel attention.
Given an input feature map
X ∈ R
C×H×W
, multi-
spectral channel attention first splits
X
into many parts
x
i
∈ R
C
0
×H×W
. Then it applies a 2D DCT to each part
x
i
. Note that a 2D DCT can use pre-processing results to
reduce computation. After processing each part, all results
are concatenated into a vector. Finally, fully connected layers,
ReLU activation and a sigmoid are used to get the attention
vector as in an SE block. This can be formulated as:
s = F
fca
(X, θ) = σ(W
2
δ(W
1
[(DCT(Group(X)))])) (19)
Y = sX (20)
where
Group(·)
indicates dividing the input into many
groups and DCT(·) is the 2D discrete cosine transform.
This work based on information compression and discrete
cosine transforms achieves excellent performance on the
classification task.
3.2.7 EncNet
Inspired by SENet, Zhang et al. [53] proposed the context
encoding module (CEM) incorporating semantic encoding loss
(SE-loss) to model the relationship between scene context
and the probabilities of object categories, thus utilizing global
scene contextual information for semantic segmentation.
Given an input feature map
X ∈ R
C×H×W
, a CEM first
learns
K
cluster centers
D = {d
1
, . . . , d
K
}
and a set of
smoothing factors
S = {s
1
, . . . , s
K
}
in the training phase.
Next, it sums the difference between the local descriptors
in the input and the corresponding cluster centers using
soft-assignment weights to obtain a permutation-invariant
descriptor. Then, it applies aggregation to the descriptors of
the
K
cluster centers instead of concatenation for computa-
tional efficiency. Formally, CEM can be written as:
e
k
=
P
N
i=1
e
−s
k
||X
i
−d
k
||
2
(X
i
− d
k
)
P
K
j=1
e
−s
j
||X
i
−d
j
||
2
(21)
e =
K
X
k=1
φ(e
k
) (22)
s = σ(W e) (23)
Y = sX (24)
where
d
k
∈ R
C
and
s
k
∈ R
are learnable parameters.
φ
denotes batch normalization with ReLU activation. In
addition to channel-wise scaling vectors, the compact con-
textual descriptor
e
is also applied to compute the SE-loss
to regularize training, which improves the segmentation of
small objects.
Not only does CEM enhance class-dependent feature
maps, but it also forces the network to consider big and
small objects equally by incorporating SE-loss. Due to its
lightweight architecture, CEM can be applied to various
backbones with only low computational overhead.
3.2.8 Bilinear Attention
Following GSoP-Net [54], Fang et al. [146] claimed that
previous attention models only use first-order information
and disregard higher-order statistical information. They thus
proposed a new bilinear attention block (bi-attention) to capture
local pairwise feature interactions within each channel, while
preserving spatial information.
Bi-attention employs the attention-in-attention (AiA) mech-
anism to capture second-order statistical information: the
outer point-wise channel attention vectors are computed
from the output of the inner channel attention. Formally,