改进的非局部神经网络：群体双线性注意力变换

需积分: 11 61 浏览量更新于2024-08-04 收藏 838KB PDF 举报

非局部神经网络（Non-Local Neural Networks）是近年来深度学习领域中的一个重要研究分支，它旨在解决深度神经网络中的长程时空依赖问题。传统方法往往依赖于序列数据的循环操作或堆叠多层小卷积核，但这些方法可能无法充分捕捉全局关系。论文《Non-Local Neural Networks with Grouped Bilinear Attentional Transforms》引入了一种新颖的改进网络结构，即群组双线性注意力变换（Grouped Bilinear Attentional Transform, BA-Transform）。 BA-Transform的设计灵感来源于人视觉系统中的注意力机制，人类能够迅速聚焦于重要的局部细节并抑制不相关的背景信息。该方法的核心在于其可学习性和数据自适应性。首先，BA-Transform具有通用性，能够灵活地应用于特征图中任意两个神经元之间的全局计算。与传统的自注意力机制（如Transformer中的自注意力块）类似，它允许网络在处理图像或视频等数据时，跨越空间和时间，寻找潜在的相关性。与现有的Non-Local模块相比，BA-Transform有三个主要优势：一是适应性强，能够在不同的输入上下文中动态调整注意力权重，提高模型的灵活性；二是通过双线性变换（Bilinear）捕捉更高阶的特征交互，这使得模型能够更好地理解和处理复杂的空间关系；三是由于其数据驱动的特性，BA-Transform可以根据输入数据的特性进行优化，从而更好地适应各种任务需求。在实验部分，论文展示了群组双线性注意力变换模块在图片分类和视频分类任务上的卓越性能，它已经超越了传统的Non-Local网络结构。这表明，通过引入BA-Transform，网络能够更有效地利用长程依赖信息，从而提升模型的准确性和泛化能力。 Non-Local Neural Networks with Grouped Bilinear Attentional Transforms的研究将非局部网络的优势与注意力机制相结合，形成了一种在深度学习中极具竞争力的架构，对于计算机视觉、自然语言处理等领域有着广泛的应用前景。未来，这种结构有望推动更多领域的研究者探索更加高效、精准的长程依赖模型。

Non-Local Neural Networks with Grouped Bilinear Attentional Transforms

Lu Chi

1,2

, Zehuan Yuan

, Yadong Mu

1∗

, Changhu Wang

Peking University, Beijing, China,

ByteDance AI Lab, Beijing, China

{chilu,myd}@pku.edu.cn, {yuanzehuan,wangchanghu}@bytedance.com

Abstract

Modeling spatial or temporal long-range dependency

plays a key role in deep neural networks. Conventional

dominant solutions include recurrent operations on sequen-

tial data or deeply stacking convolutional layers with small

kernel size. Recently, a number of non-local operators

(such as self-attention based [57]) have been devised.

They are typically generic and can be plugged into many

existing network pipelines for globally computing among

any two neurons in a feature map. This work proposes a

novel non-local operator. It is inspired by the attention

mechanism of human visual system, which can quickly

attend to important local parts in sight and suppress other

less-relevant information. The core of our method is

learnable and data-adaptive bilinear attentional transform

(BA-Transform), whose merits are three-folds: ﬁrst, BA-

Transform is versatile to model a wide spectrum of local or

global attentional operations, such as emphasizing speciﬁc

local regions. Each BA-Transform is learned in a data-

adaptive way; Secondly, to address the discrepancy among

features, we further design grouped BA-Transforms, which

essentially apply different attentional operations to different

groups of feature channels; Thirdly, many existing non-

local operators are computation-intensive. The proposed

BA-Transform is implemented by simple matrix multiplica-

tion and admits better efﬁcacy. For empirical evaluation,

we perform comprehensive experiments on two large-scale

benchmarks, ImageNet and Kinetics, for image / video clas-

siﬁcation respectively. The achieved accuracies and various

ablation experiments consistently demonstrate signiﬁcant

improvement by large margins.

1. Introduction

This era has witnessed the vigorous development of deep

neural networks, with signiﬁcant empirical success in a

plethora of important real-life vision tasks [28, 36, 45, 56].

The neural architectures of convolutional networks are still

∗

Corresponding author.

Conv

NL-Block

Conv

NL-Block

Conv

NL-Block

Flower

(a)

(b)

Figure 1: (a) Typical architecture of neural networks with non-

local operators, where non-local neural blocks (highlighted in

blue) are sparsely added into original network pipeline to instan-

taneously achieve large receptive ﬁelds. (b) Illustration of our

proposed bilinear attentional transform (BA-Transform). With

properly-learned matrices P

(X)

, Q

(X)

in the transformation for-

mula Y = P

(X)

, BA-Transform can conduct a variety of

operations (selective zooming and dispersing to distant positions

as shown in this sub-ﬁgure) on attended features. The super-scripts

in P, Q emphasize their dependence on X.

undergoing rapid evolution. Much of recent endeavor has

been devoted to designing deeper [48, 17] or wider [61,

14] network architectures, or more effective atomic con-

volutional operators [6, 20]. The main interest of this

work is modeling long-range spatial [57] or temporal [56]

dependencies in deep convolutional networks. To this

end, classic neural networks, such as VGG-Net [48] or

ResNet [17], mostly adopt a scheme of deeply stacking

many convolutional layers with small receptive ﬁelds (e.g.,

3 × 3 kernels in ResNet [17] and 3 × 3 × 3 spatio-temporal

kernels in C3D [52]).

One of current research fronts regarding effectively en-

larging neural receptive ﬁelds is to sparsely insert non-local

operators into an existing network pipeline. An illustration

of such a architecture is shown in Figure 1(a). The main

challenge for sparse insertion of non-local operators is their

11804

下载后可阅读完整内容，剩余9页未读，立即下载

脚踏南山

粉丝: 0
资源: 15

改进的非局部神经网络：群体双线性注意力变换

Lagrange stability of complex-valued neural networks with time delays

non-local neural networks

disentangled non-local neural networks

Non-local Neural Networks的tensorflow2代码

全网首发《Competition-based neural networks with robotic applications (2018)》

One-shot learning with Memory-Augmented Neural Networks

Deep-Learning-with-TensorFlow-Explore-neural-networks-with-Python.pdf.pdf

a-gentle-introduction-to-neural-networks-with-python

Predicting-Myers-Briggs-Type-Indicator-with-Recurrent-Neural-Networks

CRC - Recurrent Neural Networks Design and Applications

最新资源