深度学习中的视觉注意力机制：清华大学与南开大学联合综述

需积分: 39 12 浏览量更新于2024-07-09 7 收藏 5.26MB PDF 举报

"这篇综述论文是由清华大学计算机图形学团队、南开大学程明明教授团队以及卡迪夫大学Ralph R. Martin教授合作完成的，深入探讨了视觉注意力机制在计算机视觉领域的应用和发展。该文在ArXiv上发表，系统梳理了注意力机制的相关工作，并建立了一个专门的资料仓库，旨在为研究者提供一个全面了解和学习视觉注意力机制的平台。" 视觉注意力机制（Attention Mechanisms）在计算机视觉中的应用已经变得至关重要，它模仿了人类视觉系统对复杂场景中显著区域的自然和有效发现能力。这种机制通过动态调整输入图像特征的权重来工作，有助于提高模型在各种视觉任务上的性能。 1. **分类与定义**： - **通道注意力（Channel Attention）**：关注不同通道特征的重要性，通过学习通道之间的相关性来增强或抑制某些通道的信息。 - **空间注意力（Spatial Attention）**：聚焦于图像中的特定位置，通过加权或者选择性地关注图像的某些部分来提升关键区域的表示。 - **时间注意力（Temporal Attention）**：在视频理解等任务中，侧重于序列数据中的重要时间步，帮助模型捕捉到动态变化的关键帧。 - **分支注意力（Branch Attention）**：在多分支网络中，对每个分支的输出进行独立的注意力调整，以优化多任务学习或特征融合。 2. **应用场景**： - **图像分类（Image Classification）**：注意力机制能够帮助模型集中处理关键特征，提高分类准确性。 - **目标检测（Object Detection）**：通过引导模型关注目标物体，减少背景干扰，提高检测效果。 - **语义分割（Semantic Segmentation）**：精确地识别图像中的每个像素，注意力机制有助于区分不同对象的边界。 - **视频理解（Video Understanding）**：通过时间维度的注意力，捕捉关键动作和事件，提升视频分析能力。 - **图像生成（Image Generation）**：在生成对抗网络（GANs）中，注意力机制可指导模型生成更精细、更真实的图像。 - **3D视觉（3D Vision）**：在三维重建或场景理解中，注意力有助于确定哪些部分更为重要，提高重建质量和鲁棒性。 - **多模态任务（Multi-modal Tasks）**：在跨模态学习中，注意力机制可以协调不同模态（如文本和图像）的信息，促进跨域理解。 - **自监督学习（Self-supervised Learning）**：通过自我注意力，模型可以从无标注数据中学习到有用的表示。 3. **发展历程与趋势**： - 自注意力机制的引入，如Transformer，为视觉任务带来了新的视角和方法。 - 端到端的学习框架使得注意力机制更加灵活和高效。 - 近年来，随着深度学习技术的不断发展，注意力机制正朝着更深层次、更复杂的结构发展，例如动态注意力、多尺度注意力和上下文依赖的注意力模型。视觉注意力机制在计算机视觉领域的应用已经非常广泛，并且持续推动着相关技术的进步。通过不断的研究和创新，未来可能会出现更多高效、智能的注意力机制，以更好地模拟人类视觉系统，提升计算机在视觉理解和处理任务上的能力。

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

by the attention vector but also adds an identity connection.

GCT can be written as:

s = F

gct

(X, θ) = tanh(γCN(αNorm(X)) + β) (14)

Y = sX + X, (15)

where

and

are trainable parameters.

Norm(·)

indicates

the L2-norm of each channel. CN is channel normalization.

A GCT block has fewer parameters than an SE block, and

as it is lightweight, can be added after each convolutional

layer of a CNN.

3.2.5 ECANet

To avoid high model complexity, SENet reduces the number

of channels. However, this strategy fails to directly model

correspondence between weight vectors and inputs, reducing

the quality of results. To overcome this drawback, Wang

et al. [37] proposed the efﬁcient channel attention (ECA)

block which instead uses a 1D convolution to determine

the interaction between channels, instead of dimensionality

reduction.

An ECA block has similar formulation to an SE block

including a squeeze module for aggregating global spatial

information and an efﬁcient excitation module for modeling

cross-channel interaction. Instead of indirect correspondence,

an ECA block only considers direct interaction between

each channel and its

-nearest neighbors to control model

complexity. Overall, the formulation of an ECA block is:

s = F

eca

(X, θ) = σ(Conv1D(GAP(X))) (16)

Y = sX (17)

where

Conv1D(·)

denotes 1D convolution with a kernel of

shape

across the channel domain, to model local cross-

channel interaction. The parameter

decides the coverage

of interaction, and in ECA the kernel size

is adaptively

determined from the channel dimensionality

instead of by

manual tuning, using cross-validation:

k = ψ(C) =



log

(C)



odd

(18)

where

and

are hyperparameters.

|x|

odd

indicates the

nearest odd function of x.

Compared to SENet, ECANet has an improved excitation

module, and provides an efﬁcient and effective block which

can readily be incorporated into various CNNs.

3.2.6 FcaNet

Only using global average pooling in the squeeze module

limits representational ability. To obtain a more powerful

representation ability, Qin et al. [57] rethought global infor-

mation captured from the viewpoint of compression and

analysed global average pooling in the frequency domain.

They proved that global average pooling is a special case of

the discrete cosine transform (DCT) and used this observa-

tion to propose a novel multi-spectral channel attention.

Given an input feature map

X ∈ R

C×H×W

, multi-

spectral channel attention ﬁrst splits

into many parts

∈ R

×H×W

. Then it applies a 2D DCT to each part

. Note that a 2D DCT can use pre-processing results to

reduce computation. After processing each part, all results

are concatenated into a vector. Finally, fully connected layers,

ReLU activation and a sigmoid are used to get the attention

vector as in an SE block. This can be formulated as:

s = F

fca

(X, θ) = σ(W

δ(W

[(DCT(Group(X)))])) (19)

Y = sX (20)

where

Group(·)

indicates dividing the input into many

groups and DCT(·) is the 2D discrete cosine transform.

This work based on information compression and discrete

cosine transforms achieves excellent performance on the

classiﬁcation task.

3.2.7 EncNet

Inspired by SENet, Zhang et al. [53] proposed the context

encoding module (CEM) incorporating semantic encoding loss

(SE-loss) to model the relationship between scene context

and the probabilities of object categories, thus utilizing global

scene contextual information for semantic segmentation.

Given an input feature map

X ∈ R

C×H×W

, a CEM ﬁrst

learns

cluster centers

D = {d

, . . . , d

}

and a set of

smoothing factors

S = {s

, . . . , s

}

in the training phase.

Next, it sums the difference between the local descriptors

in the input and the corresponding cluster centers using

soft-assignment weights to obtain a permutation-invariant

descriptor. Then, it applies aggregation to the descriptors of

the

cluster centers instead of concatenation for computa-

tional efﬁciency. Formally, CEM can be written as:

i=1

−s

||X

−d

− d

)

j=1

−s

||X

−d

(21)

e =

k=1

φ(e

) (22)

s = σ(W e) (23)

Y = sX (24)

where

∈ R

and

∈ R

are learnable parameters.

denotes batch normalization with ReLU activation. In

addition to channel-wise scaling vectors, the compact con-

textual descriptor

is also applied to compute the SE-loss

to regularize training, which improves the segmentation of

small objects.

Not only does CEM enhance class-dependent feature

maps, but it also forces the network to consider big and

small objects equally by incorporating SE-loss. Due to its

lightweight architecture, CEM can be applied to various

backbones with only low computational overhead.

3.2.8 Bilinear Attention

Following GSoP-Net [54], Fang et al. [146] claimed that

previous attention models only use ﬁrst-order information

and disregard higher-order statistical information. They thus

proposed a new bilinear attention block (bi-attention) to capture

local pairwise feature interactions within each channel, while

preserving spatial information.

Bi-attention employs the attention-in-attention (AiA) mech-

anism to capture second-order statistical information: the

outer point-wise channel attention vectors are computed

from the output of the inner channel attention. Formally,

剩余26页未读，继续阅读

syp_net

粉丝: 158

深度学习中的视觉注意力机制：清华大学与南开大学联合综述

综述：计算机视觉中的注意力机制

最新「注意力机制Attention」大综述论文

An Empirical Study of Spatial Attention Mechanisms in Deep Networks

计算机视觉Attention注意力机制综述！清华、南开出品！185篇参考文献！.pdf

清华最新《图神经网络推荐系统》综述论文

清华大学崔鹏等最新「分布外泛化(Out-Of-Distribution Generalization)」 综述论文

清华大学发布首篇《图自动机器学习》综述论文

清华&央财&商务部-电子商务发展指数报告（）-5-40页.pdf

清华&中国工程院-2019中国人工智能发展报告(4).rar

清华&中国工程院-2019中国人工智能发展报告(4).pdf

最新资源

清华大学崔鹏等最新「分布外泛化(Out-Of-Distribution Generalization)」综述论文