Published as a conference paper at ICLR 2020
The theorem is proven constructively by selecting the parameters of the multi-head self-attention
layer so that the latter acts like a convolutional layer. In the proposed construction, the attention
scores of each self-attention head should attend to a different relative shift within the set ∆∆
K
=
{−bK/2c, . . . , bK/2c}
2
of all pixel shifts in a K × K kernel. The exact condition can be found in
the statement of Lemma 1.
Then, Lemma 2 shows that the aforementioned condition is satisfied for the relative positional en-
coding that we refer to as the quadratic encoding:
v
(h)
:= −α
(h)
(1, −2∆
(h)
1
, −2∆
(h)
2
) r
δ
:= (kδk
2
, δ
1
, δ
2
) W
qry
=W
key
:= 0
d
W
key
:= I (9)
The learned parameters ∆
(h)
= (∆
(h)
1
, ∆
(h)
2
) and α
(h)
determine the center and width of attention
of each head, respectively. On the other hand, δ = (δ
1
, δ
2
) is fixed and expresses the relative shift
between query and key pixels.
It is important to stress that the above encoding is not the only one for which the conditions of
Lemma 1 are satisfied. In fact, in our experiments, the relative encoding learned by the neural
network also matched the conditions of the lemma (despite being different from the quadratic en-
coding). Nevertheless, the encoding defined above is very efficient in terms of size, as only D
p
= 3
dimensions suffice to encode the relative position of pixels, while also reaching similar or better
empirical performance (than the learned one).
The theorem covers the general convolution operator as defined in eq. (17). However, machine
learning practitioners using differential programming frameworks (Paszke et al., 2017; Abadi et al.,
2015) might question if the theorem holds for all hyper-parameters of 2D convolutional layers:
• Padding: a multi-head self-attention layer uses by default the "SAME" padding while a
convolutional layer would decrease the image size by K − 1 pixels. The correct way to
alleviate these boundary effects is to pad the input image with bK/2c zeros on each side.
In this case, the cropped output of a MHSA and a convolutional layer are the same.
• Stride: a strided convolution can be seen as a convolution followed by a fixed pooling
operation—with computational optimizations. Theorem 1 is defined for stride 1, but a
fixed pooling layer could be appended to the Self-Attention layer to simulate any stride.
• Dilation: a multi-head self-attention layer can express any dilated convolution as each head
can attend a value at any pixel shift and form a (dilated) grid pattern.
Remark for the 1D case. Convolutional layers acting on sequences are commonly used in the lit-
erature for text (Kim, 2014), as well as audio (van den Oord et al., 2016) and time series (Franceschi
et al., 2019). Theorem 1 can be straightforwardly extended to show that multi-head self-attention
with N
h
heads can also simulate a 1D convolutional layer with a kernel of size K = N
h
with
min(D
h
, D
out
) output channels using a positional encoding of dimension D
p
≥ 2. Since we have
not tested empirically if the preceding construction matches the behavior of 1D self-attention in
practice, we cannot claim that it actually learns to convolve an input sequence—only that it has the
capacity to do so.
PROOF OF MAIN THEOREM
The proof follows directly from Lemmas 1 and 2 stated below:
Lemma 1. Consider a multi-head self-attention layer consisting of N
h
= K
2
heads, D
h
≥ D
out
and let f : [N
h
] → ∆∆
K
be a bijective mapping of heads onto shifts. Further, suppose that for
every head the following holds:
softmax(A
(h)
q,:
)
k
=
1 if f(h) = q − k
0 otherwise.
(10)
Then, for any convolutional layer with a K × K kernel and D
out
output channels, there exists
{W
(h)
val
}
h∈[N
h
]
such that MHSA(X) = Conv(X) for every X ∈ R
W ×H×D
in
.
4