自注意力与卷积层的关系探索

需积分: 30 18 浏览量更新于2024-07-09 收藏 1.53MB PDF 举报

"这篇论文‘On the Relationship between Self-Attention and Convolutional Layers’在ICLR 2020会议上发表，探讨了自注意力机制与卷积层之间的关系。作者包括Jean-Baptiste Cordonnier, Andreas Loukas和Martin Jaggi，他们来自瑞士洛桑联邦理工学院（EPFL）。" 在计算机视觉领域，近期的研究趋势表明，注意力机制正逐渐被引入到模型中，挑战了传统卷积层作为核心构建块的地位。Ramachandran等人在2019年的研究表明，注意力机制可以完全替代卷积，并在视觉任务上达到最先进的性能。这引发了一个问题：学习到的注意力层是否运作方式与卷积层相似？该论文提供了证据表明，注意力层确实可以执行类似卷积的操作，特别是当使用足够数量的注意力头时。具体来说，作者证明了一个多头自注意力层至少与任何卷积层一样具有表达能力。通过数值实验，他们展示了自注意力层对像素网格模式的关注方式与CNN层相似，这进一步证实了他们的分析。在深度学习中，卷积层通常用于提取图像特征，通过在局部区域内进行滤波操作，捕捉空间上的局部相关性。而自注意力机制则允许模型在不同位置之间建立长距离依赖，从而捕获全局上下文信息。尽管它们在概念上有所不同，但这篇论文揭示了这两种方法在实际应用中可能有着更深层次的联系。论文指出，即使注意力机制可以模拟卷积的效果，但这并不意味着它在所有情况下都能完全替代卷积。卷积层在处理图像数据时，由于其固有的平移等变性和参数共享特性，往往表现出色。而自注意力层虽然可以捕获更复杂的依赖关系，但在计算效率和内存需求方面可能会更高。总结来说，这项研究为理解自注意力机制和卷积层之间的关系提供了新的视角，有助于我们更好地设计和优化计算机视觉模型。同时，这也提示我们在未来的研究中，可以探索如何结合这两种机制的优点，以创建更强大、更灵活的视觉模型。

Published as a conference paper at ICLR 2020

The theorem is proven constructively by selecting the parameters of the multi-head self-attention

layer so that the latter acts like a convolutional layer. In the proposed construction, the attention

scores of each self-attention head should attend to a different relative shift within the set ∆∆

{−bK/2c, . . . , bK/2c}

of all pixel shifts in a K × K kernel. The exact condition can be found in

the statement of Lemma 1.

Then, Lemma 2 shows that the aforementioned condition is satisﬁed for the relative positional en-

coding that we refer to as the quadratic encoding:

(h)

:= −α

(h)

(1, −2∆

(h)

, −2∆

(h)

) r

:= (kδk

, δ

) W

qry

key

:= 0

key

:= I (9)

The learned parameters ∆

(h)

= (∆

(h)

, ∆

(h)

) and α

(h)

determine the center and width of attention

of each head, respectively. On the other hand, δ = (δ

, δ

) is ﬁxed and expresses the relative shift

between query and key pixels.

It is important to stress that the above encoding is not the only one for which the conditions of

Lemma 1 are satisﬁed. In fact, in our experiments, the relative encoding learned by the neural

network also matched the conditions of the lemma (despite being different from the quadratic en-

coding). Nevertheless, the encoding deﬁned above is very efﬁcient in terms of size, as only D

= 3

dimensions sufﬁce to encode the relative position of pixels, while also reaching similar or better

empirical performance (than the learned one).

The theorem covers the general convolution operator as deﬁned in eq. (17). However, machine

learning practitioners using differential programming frameworks (Paszke et al., 2017; Abadi et al.,

2015) might question if the theorem holds for all hyper-parameters of 2D convolutional layers:

• Padding: a multi-head self-attention layer uses by default the "SAME" padding while a

convolutional layer would decrease the image size by K − 1 pixels. The correct way to

alleviate these boundary effects is to pad the input image with bK/2c zeros on each side.

In this case, the cropped output of a MHSA and a convolutional layer are the same.

• Stride: a strided convolution can be seen as a convolution followed by a ﬁxed pooling

operation—with computational optimizations. Theorem 1 is deﬁned for stride 1, but a

ﬁxed pooling layer could be appended to the Self-Attention layer to simulate any stride.

• Dilation: a multi-head self-attention layer can express any dilated convolution as each head

can attend a value at any pixel shift and form a (dilated) grid pattern.

Remark for the 1D case. Convolutional layers acting on sequences are commonly used in the lit-

erature for text (Kim, 2014), as well as audio (van den Oord et al., 2016) and time series (Franceschi

et al., 2019). Theorem 1 can be straightforwardly extended to show that multi-head self-attention

with N

heads can also simulate a 1D convolutional layer with a kernel of size K = N

with

min(D

, D

out

) output channels using a positional encoding of dimension D

≥ 2. Since we have

not tested empirically if the preceding construction matches the behavior of 1D self-attention in

practice, we cannot claim that it actually learns to convolve an input sequence—only that it has the

capacity to do so.

PROOF OF MAIN THEOREM

The proof follows directly from Lemmas 1 and 2 stated below:

Lemma 1. Consider a multi-head self-attention layer consisting of N

= K

heads, D

≥ D

out

and let f : [N

] → ∆∆

be a bijective mapping of heads onto shifts. Further, suppose that for

every head the following holds:

softmax(A

(h)

q,:

)



1 if f(h) = q − k

0 otherwise.

(10)

Then, for any convolutional layer with a K × K kernel and D

out

output channels, there exists

(h)

val

}

h∈[N

]

such that MHSA(X) = Conv(X) for every X ∈ R

W ×H×D

剩余17页未读，继续阅读

潜夙

粉丝: 0
资源: 40

自注意力与卷积层的关系探索

图神经网络研讨 - Graph convolutional neural networks.pdf

on the relationship between self-attention and convolutional layers

ABCNN Attention-Based Convolutional Neural Network.pdf

模型压缩经典文章翻译1：（Network Slimming翻译）Network Slimming-Learning Efficient Convolutional Networks ...-附件资源

Zeiler 和 Fergus - 2014 - Visualizing and Understanding Convolutional Networ.pdf

imagenet-classification-with-deep-convolutional-neural-networks原版和翻译..rar

Graph Neural Networks_ A Review of Methods and Applications----清华大学周杰.pdf

Recurrent-Attention-Convolutional-Neural-Network

3D-3D-convolutional-speaker-recognition.zip

ECG-arrhythmia-classification-using-a-2-D-convolutional-neural-network.:使用MITDB数据集预测心律失常

最新资源