Non-Local Neural Networks with Grouped Bilinear Attentional Transforms
Lu Chi
1,2
, Zehuan Yuan
2
, Yadong Mu
1∗
, Changhu Wang
2
1
Peking University, Beijing, China,
2
ByteDance AI Lab, Beijing, China
{chilu,myd}@pku.edu.cn, {yuanzehuan,wangchanghu}@bytedance.com
Abstract
Modeling spatial or temporal long-range dependency
plays a key role in deep neural networks. Conventional
dominant solutions include recurrent operations on sequen-
tial data or deeply stacking convolutional layers with small
kernel size. Recently, a number of non-local operators
(such as self-attention based [57]) have been devised.
They are typically generic and can be plugged into many
existing network pipelines for globally computing among
any two neurons in a feature map. This work proposes a
novel non-local operator. It is inspired by the attention
mechanism of human visual system, which can quickly
attend to important local parts in sight and suppress other
less-relevant information. The core of our method is
learnable and data-adaptive bilinear attentional transform
(BA-Transform), whose merits are three-folds: first, BA-
Transform is versatile to model a wide spectrum of local or
global attentional operations, such as emphasizing specific
local regions. Each BA-Transform is learned in a data-
adaptive way; Secondly, to address the discrepancy among
features, we further design grouped BA-Transforms, which
essentially apply different attentional operations to different
groups of feature channels; Thirdly, many existing non-
local operators are computation-intensive. The proposed
BA-Transform is implemented by simple matrix multiplica-
tion and admits better efficacy. For empirical evaluation,
we perform comprehensive experiments on two large-scale
benchmarks, ImageNet and Kinetics, for image / video clas-
sification respectively. The achieved accuracies and various
ablation experiments consistently demonstrate significant
improvement by large margins.
1. Introduction
This era has witnessed the vigorous development of deep
neural networks, with significant empirical success in a
plethora of important real-life vision tasks [28, 36, 45, 56].
The neural architectures of convolutional networks are still
∗
Corresponding author.
Conv
Conv
Conv
NL-Block
Conv
Conv
Conv
NL-Block
Conv
Conv
Conv
NL-Block
Flower
(a)
(b)
Figure 1: (a) Typical architecture of neural networks with non-
local operators, where non-local neural blocks (highlighted in
blue) are sparsely added into original network pipeline to instan-
taneously achieve large receptive fields. (b) Illustration of our
proposed bilinear attentional transform (BA-Transform). With
properly-learned matrices P
(X)
, Q
(X)
in the transformation for-
mula Y = P
(X)
XQ
(X)
, BA-Transform can conduct a variety of
operations (selective zooming and dispersing to distant positions
as shown in this sub-figure) on attended features. The super-scripts
in P, Q emphasize their dependence on X.
undergoing rapid evolution. Much of recent endeavor has
been devoted to designing deeper [48, 17] or wider [61,
14] network architectures, or more effective atomic con-
volutional operators [6, 20]. The main interest of this
work is modeling long-range spatial [57] or temporal [56]
dependencies in deep convolutional networks. To this
end, classic neural networks, such as VGG-Net [48] or
ResNet [17], mostly adopt a scheme of deeply stacking
many convolutional layers with small receptive fields (e.g.,
3 × 3 kernels in ResNet [17] and 3 × 3 × 3 spatio-temporal
kernels in C3D [52]).
One of current research fronts regarding effectively en-
larging neural receptive fields is to sparsely insert non-local
operators into an existing network pipeline. An illustration
of such a architecture is shown in Figure 1(a). The main
challenge for sparse insertion of non-local operators is their
11804