ResNeXt：深度神经网络的聚合残差变换

需积分: 29 77 浏览量更新于2024-08-12 收藏 1.27MB PDF 举报

"ResNeXt网络架构的论文，探讨了深度神经网络中的一种新方法——聚合残差变换，提出了一种高度模块化的网络结构，强调了'基数'(transformations集合的大小)作为影响模型性能的新维度，除了深度和宽度之外。在ImageNet-1K数据集上的实验表明，增加基数比增加深度或宽度更能有效提高分类准确性。" 在深度学习领域，ResNeXt是由Saining Xie等人提出的一种创新的深度神经网络架构，它源自于ResNet（残差网络）。ResNet通过引入残差块解决了深度网络中的梯度消失和爆炸问题，而ResNeXt则进一步优化了这一概念。论文的主要贡献在于提出了“聚合残差变换”(Aggregated Residual Transformations)的理念，这是对ResNet的改进和扩展。 ResNeXt的核心是其构建块，这个块不只执行单一的变换，而是将一组相同拓扑的变换聚合在一起。这种设计使网络更加模块化，减少了超参数的数量，简化了网络结构。这种结构被称为“多分支结构”，每个分支都执行相同的计算，但拥有独立的权重。网络的“基数”（cardinality）指的是这些并行分支的数量，它与网络的深度（层数）和宽度（每层的特征图大小）一起决定了模型的复杂性和能力。研究发现，在保持计算复杂度不变的情况下，增加基数可以显著提升模型的分类精度。这表明，增加并行分支的数目可能比单纯增加网络的深度或宽度更能有效地利用计算资源。ResNeXt模型在ILSVRC 2016图像分类任务中的出色表现验证了这一点，显示了其在实际应用中的潜力。 ResNeXt提供了一种新的思考方式，即通过增加并行变换的基数来增强网络的表达能力和泛化能力，这为深度学习领域的网络设计提供了新的方向。这种方法不仅提高了模型的性能，而且通过减少需要调整的超参数数量，使得模型更易于训练和优化。

Aggregated Residual Transformations for Deep Neural Networks

Saining Xie

Ross Girshick

Piotr Doll

Zhuowen Tu

Kaiming He

UC San Diego

Facebook AI Research

{s9xie,ztu}@ucsd.edu {rbg,pdollar,kaiminghe}@fb.com

Abstract

We present a simple, highly modularized network archi-

tecture for image classiﬁcation. Our network is constructed

by repeating a building block that aggregates a set of trans-

formations with the same topology. Our simple design re-

sults in a homogeneous, multi-branch architecture that has

only a few hyper-parameters to set. This strategy exposes a

new dimension, which we call “cardinality” (the size of the

set of transformations), as an essential factor in addition to

the dimensions of depth and width. On the ImageNet-1K

dataset, we empirically show that even under the restricted

condition of maintaining complexity, increasing cardinality

is able to improve classiﬁcation accuracy. Moreover, in-

creasing cardinality is more effective than going deeper or

wider when we increase the capacity. Our models, named

ResNeXt, are the foundations of our entry to the ILSVRC

2016 classiﬁcation task in which we secured 2nd place.

We further investigate ResNeXt on an ImageNet-5K set and

the COCO detection set, also showing better results than

its ResNet counterpart. The code and models are publicly

available online

1. Introduction

Research on visual recognition is undergoing a transi-

tion from “feature engineering” to “network engineering”

[25, 24, 44, 34, 36, 38, 14]. In contrast to traditional hand-

designed features (e.g., SIFT [29] and HOG [5]), features

learned by neural networks from large-scale data [33] re-

quire minimal human involvement during training, and can

be transferred to a variety of recognition tasks [7, 10, 28].

Nevertheless, human effort has been shifted to designing

better network architectures for learning representations.

Designing architectures becomes increasingly difﬁcult

with the growing number of hyper-parameters (width

, ﬁl-

ter sizes, strides, etc.), especially when there are many lay-

ers. The VGG-nets [36] exhibit a simple yet effective strat-

egy of constructing very deep networks: stacking build-

https://github.com/facebookresearch/ResNeXt

Width refers to the number of channels in a layer.

256, 1x1, 4

4, 3x3, 4

4, 1x1, 256

256, 1x1, 4

4, 3x3, 4

4, 1x1, 256

256, 1x1, 4

4, 3x3, 4

4, 1x1, 256

....

total 32

paths

256-d in

256, 1x1, 64

64, 3x3, 64

64, 1x1, 256

256-d in

256-d out

Figure 1. Left: A block of ResNet [14]. Right: A block of

ResNeXt with cardinality = 32, with roughly the same complex-

ity. A layer is shown as (# in channels, ﬁlter size, # out channels).

ing blocks of the same shape. This strategy is inherited

by ResNets [14] which stack modules of the same topol-

ogy. This simple rule reduces the free choices of hyper-

parameters, and depth is exposed as an essential dimension

in neural networks. Moreover, we argue that the simplicity

of this rule may reduce the risk of over-adapting the hyper-

parameters to a speciﬁc dataset. The robustness of VGG-

nets and ResNets has been proven by various visual recog-

nition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks

involving speech [42, 30] and language [4, 41, 20].

Unlike VGG-nets, the family of Inception models [38,

17, 39, 37] have demonstrated that carefully designed

topologies are able to achieve compelling accuracy with low

theoretical complexity. The Inception models have evolved

over time [38, 39], but an important common property is

a split-transform-merge strategy. In an Inception module,

the input is split into a few lower-dimensional embeddings

(by 1×1 convolutions), transformed by a set of specialized

ﬁlters (3×3, 5×5, etc.), and merged by concatenation. It

can be shown that the solution space of this architecture is a

strict subspace of the solution space of a single large layer

(e.g., 5×5) operating on a high-dimensional embedding.

The split-transform-merge behavior of Inception modules

is expected to approach the representational power of large

and dense layers, but at a considerably lower computational

complexity.

Despite good accuracy, the realization of Inception mod-

els has been accompanied with a series of complicating fac-

arXiv:1611.05431v2 [cs.CV] 11 Apr 2017

下载后可阅读完整内容，剩余9页未读，立即下载

dale567

粉丝: 2
资源: 5

ResNeXt：深度神经网络的聚合残差变换

经典的图像修复代码 Inpainting

3d-photo-inpainting的训练模型

AOT-GAN-for-Inpainting:用于高分辨率图像修补的AOT-GAN（图像修补的代码库）

ResNeXt-50文献引用

基于残差网络的田区智能检测系统的参考文献

spark.shuffle.spill

Semi-Supervised Classification with Graph Convolutional Networks

python如何将一个表格中按照某一列对表格信息进行汇总

torch.nn.functional.fold

最新资源