MobileNetV2:InvertedResidualsandLinearBottlenecks_反向瓶颈层

MobileNetV2

Bottlenecks

需积分: 44 68 浏览量更新于2023-03-16 评论收藏 1.47MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Mark Sandler Andrew Howard Menglong Zhu Andrey Zhmoginov Liang-Chieh Chen

Google Inc.

{sandler, howarda, menglong, azhmogin, lcchen}@google.com

Abstract

In this paper we describe a new mobile architecture,

MobileNetV2, that improves the state of the art perfor-

mance of mobile models on multiple tasks and bench-

marks as well as across a spectrum of different model

sizes. We also describe efﬁcient ways of applying these

mobile models to object detection in a novel framework

we call SSDLite. Additionally, we demonstrate how

to build mobile semantic segmentation models through

a reduced form of DeepLabv3 which we call Mobile

DeepLabv3.

is based on an inverted residual structure where

the shortcut connections are between the thin bottle-

neck layers. The intermediate expansion layer uses

lightweight depthwise convolutions to ﬁlter features as

a source of non-linearity. Additionally, we ﬁnd that it is

important to remove non-linearities in the narrow layers

in order to maintain representational power. We demon-

strate that this improves performance and provide an in-

tuition that led to this design.

Finally, our approach allows decoupling of the in-

put/output domains from the expressiveness of the trans-

formation, which provides a convenient framework for

further analysis. We measure our performance on

ImageNet [1] classiﬁcation, COCO object detection [2],

VOC image segmentation [3]. We evaluate the trade-offs

between accuracy, and number of operations measured

by multiply-adds (MAdd), as well as actual latency, and

the number of parameters.

1. Introduction

Neural networks have revolutionized many areas of

machine intelligence, enabling superhuman accuracy for

challenging image recognition tasks. However, the drive

to improve accuracy often comes at a cost: modern state

of the art networks require high computational resources

beyond the capabilities of many mobile and embedded

applications.

This paper introduces a new neural network architec-

ture that is speciﬁcally tailored for mobile and resource

constrained environments. Our network pushes the state

of the art for mobile tailored computer vision models,

by signiﬁcantly decreasing the number of operations and

memory needed while retaining the same accuracy.

Our main contribution is a novel layer module: the

inverted residual with linear bottleneck. This mod-

ule takes as an input a low-dimensional compressed

representation which is ﬁrst expanded to high dimen-

sion and ﬁltered with a lightweight depthwise convo-

lution. Features are subsequently projected back to a

low-dimensional representation with a linear convolu-

tion. The ofﬁcial implementation is available as part of

TensorFlow-Slim model library in [4].

This module can be efﬁciently implemented using

standard operations in any modern framework and al-

lows our models to beat state of the art along multiple

performance points using standard benchmarks. Fur-

thermore, this convolutional module is particularly suit-

able for mobile designs, because it allows to signiﬁ-

cantly reduce the memory footprint needed during in-

ference by never fully materializing large intermediate

tensors. This reduces the need for main memory access

in many embedded hardware designs, that provide small

amounts of very fast software controlled cache memory.

2. Related Work

Tuning deep neural architectures to strike an optimal

balance between accuracy and performance has been

an area of active research for the last several years.

Both manual architecture search and improvements in

training algorithms, carried out by numerous teams has

lead to dramatic improvements over early designs such

as AlexNet [5], VGGNet [6], GoogLeNet [7]. , and

ResNet [8]. Recently there has been lots of progress

in algorithmic architecture exploration included hyper-

parameter optimization [9, 10, 11] as well as various

arXiv:1801.04381v3 [cs.CV] 2 Apr 2018

methods of network pruning [12, 13, 14, 15, 16, 17] and

connectivity learning [18, 19]. A substantial amount of

work has also been dedicated to changing the connectiv-

ity structure of the internal convolutional blocks such as

in ShufﬂeNet [20] or introducing sparsity [21] and oth-

ers [22].

Recently, [23, 24, 25, 26], opened up a new direc-

tion of bringing optimization methods including genetic

algorithms and reinforcement learning to architectural

search. However one drawback is that the resulting net-

works end up very complex. In this paper, we pursue the

goal of developing better intuition about how neural net-

works operate and use that to guide the simplest possible

network design. Our approach should be seen as compli-

mentary to the one described in [23] and related work.

In this vein our approach is similar to those taken by

[20, 22] and allows to further improve the performance,

while providing a glimpse on its internal operation. Our

network design is based on MobileNetV1 [27]. It re-

tains its simplicity and does not require any special op-

erators while signiﬁcantly improves its accuracy, achiev-

ing state of the art on multiple image classiﬁcation and

detection tasks for mobile applications.

3. Preliminaries, discussion and intuition

3.1. Depthwise Separable Convolutions

Depthwise Separable Convolutions are a key build-

ing block for many efﬁcient neural network architectures

[27, 28, 20] and we use them in the present work as well.

The basic idea is to replace a full convolutional opera-

tor with a factorized version that splits convolution into

two separate layers. The ﬁrst layer is called a depthwise

convolution, it performs lightweight ﬁltering by apply-

ing a single convolutional ﬁlter per input channel. The

second layer is a 1 × 1 convolution, called a pointwise

convolution, which is responsible for building new fea-

tures through computing linear combinations of the in-

put channels.

Standard convolution takes an h

× w

× d

in-

put tensor L

, and applies convolutional kernel K ∈

k×k×d

×d

to produce an h

× w

× d

output ten-

sor L

. Standard convolutional layers have the compu-

tational cost of h

· w

· d

· k · k.

Depthwise separable convolutions are a drop-in re-

placement for standard convolutional layers. Empiri-

cally they work almost as well as regular convolutions

but only cost:

· w

· d

+ d

) (1)

which is the sum of the depthwise and 1 × 1 pointwise

convolutions. Effectively depthwise separable convolu-

tion reduces computation compared to traditional layers

by almost a factor of k

. MobileNetV2 uses k = 3

(3 × 3 depthwise separable convolutions) so the compu-

tational cost is 8 to 9 times smaller than that of standard

convolutions at only a small reduction in accuracy [27].

3.2. Linear Bottlenecks

Consider a deep neural network consisting of n layers

each of which has an activation tensor of dimensions

× w

× d

. Throughout this section we will be dis-

cussing the basic properties of these activation tensors,

which we will treat as containers of h

× w

“pixels”

with d

dimensions. Informally, for an input set of real

images, we say that the set of layer activations (for any

layer L

) forms a “manifold of interest”. It has been long

assumed that manifolds of interest in neural networks

could be embedded in low-dimensional subspaces. In

other words, when we look at all individual d-channel

pixels of a deep convolutional layer, the information

encoded in those values actually lie in some manifold,

which in turn is embeddable into a low-dimensional sub-

space

At a ﬁrst glance, such a fact could then be captured

and exploited by simply reducing the dimensionality of

a layer thus reducing the dimensionality of the oper-

ating space. This has been successfully exploited by

MobileNetV1 [27] to effectively trade off between com-

putation and accuracy via a width multiplier parameter,

and has been incorporated into efﬁcient model designs

of other networks as well [20]. Following that intuition,

the width multiplier approach allows one to reduce the

dimensionality of the activation space until the mani-

fold of interest spans this entire space. However, this

intuition breaks down when we recall that deep convo-

lutional neural networks actually have non-linear per co-

ordinate transformations, such as ReLU. For example,

ReLU applied to a line in 1D space produces a ’ray’,

where as in R

space, it generally results in a piece-wise

linear curve with n-joints.

It is easy to see that in general if a result of a layer

transformation ReLU(Bx) has a non-zero volume S,

the points mapped to interior S are obtained via a lin-

ear transformation B of the input, thus indicating that

the part of the input space corresponding to the full di-

mensional output, is limited to a linear transformation.

In other words, deep networks only have the power of

a linear classiﬁer on the non-zero volume part of the

more precisely, by a factor k

/(k

+ d

)

Note that dimensionality of the manifold differs from the dimen-

sionality of a subspace that could be embedded via a linear transfor-

mation.

Input Output/dim=2 Output/dim=3 Output/dim=5 Output/dim=15 Output/dim=30

Figure 1: Examples of ReLU transformations of

low-dimensional manifolds embedded in higher-dimensional

spaces. In these examples the initial spiral is embedded into

an n-dimensional space using random matrix T followed by

ReLU, and then projected back to the 2D space using T

−1

In examples above n = 2, 3 result in information loss where

certain points of the manifold collapse into each other, while

for n = 15 to 30 the transformation is highly non-convex.

(a) Regular

(b) Separable

bottleneck

(d) Bottleneck with ex-

pansion layer

Figure 2: Evolution of separable convolution blocks. The

diagonally hatched texture indicates layers that do not contain

non-linearities. The last (lightly colored) layer indicates the

beginning of the next block. Note: 2d and 2c are equivalent

blocks when stacked. Best viewed in color.

output domain. We refer to supplemental material for

a more formal statement.

On the other hand, when ReLU collapses the chan-

nel, it inevitably loses information in that channel. How-

ever if we have lots of channels, and there is a a structure

in the activation manifold that information might still be

preserved in the other channels. In supplemental ma-

terials, we show that if the input manifold can be em-

bedded into a signiﬁcantly lower-dimensional subspace

of the activation space then the ReLU transformation

preserves the information while introducing the needed

complexity into the set of expressible functions.

To summarize, we have highlighted two properties

that are indicative of the requirement that the manifold

of interest should lie in a low-dimensional subspace of

the higher-dimensional activation space:

1. If the manifold of interest remains non-zero vol-

ume after ReLU transformation, it corresponds to

a linear transformation.

(a) Residual block (b) Inverted residual block

Figure 3: The difference between residual block [8, 30]

and inverted residual. Diagonally hatched layers do not

use non-linearities. We use thickness of each block to

indicate its relative number of channels. Note how clas-

sical residuals connects the layers with high number of

channels, whereas the inverted residuals connect the bot-

tlenecks. Best viewed in color.

2. ReLU is capable of preserving complete informa-

tion about the input manifold, but only if the input

manifold lies in a low-dimensional subspace of the

input space.

These two insights provide us with an empirical hint

for optimizing existing neural architectures: assuming

the manifold of interest is low-dimensional we can cap-

ture this by inserting linear bottleneck layers into the

convolutional blocks. Experimental evidence suggests

that using linear layers is crucial as it prevents non-

linearities from destroying too much information. In

Section 6, we show empirically that using non-linear

layers in bottlenecks indeed hurts the performance by

several percent, further validating our hypothesis

. We

note that similar reports where non-linearity was helped

were reported in [29] where non-linearity was removed

from the input of the traditional residual block and that

lead to improved performance on CIFAR dataset.

For the remainder of this paper we will be utilizing

bottleneck convolutions. We will refer to the ratio be-

tween the size of the input bottleneck and the inner size

as the expansion ratio.

3.3. Inverted residuals

The bottleneck blocks appear similar to residual

block where each block contains an input followed

by several bottlenecks then followed by expansion [8].

However, inspired by the intuition that the bottlenecks

actually contain all the necessary information, while an

expansion layer acts merely as an implementation detail

that accompanies a non-linear transformation of the ten-

sor, we use shortcuts directly between the bottlenecks.

We note that in the presence of shortcuts the information loss is

actually less strong.

剩余13页未读，继续阅读

匠人_C

粉丝: 26
资源: 10

会员权益专享

MobileNetV2: Inverted Residuals and Linear Bottlenecks

评论0

会员权益专享

最新资源

MobileNetV2: Inverted Residuals and Linear Bottlenecks

评论0

Bottleneck-1.3.2-cp37-cp37m-win_amd64

MobileNet论文翻译

rapid_bottleneck_identifation

mobilenetv2: inverted residuals and linear bottlenecks

深度可分离卷积参考文献

Mobilenet v2原理

MobileNet发展历史

MobileNetV3结构图以及各模块详细作用

shufflenetv2的算法原理

MobileNetV2与其他卷积神经网络的区别

yolov5改进mobilenet

mobilenetv2中t，c，n，s的设置规则

mobilenetv3

Deep Variational information bottleneck

Linux Storage System Bottleneck Exploration

Bottleneck-1.2.1-cp36-cp36m-win_amd64.whl

A4打印模板-画图设计设计师产品草稿图纸-网格纸A4打印模板高清待办练字模板PDF下载.pdf

ISA-95 流程圣经，描述了PLM企业资源计划、MES制造执行系统、ERP企业资源计划系统、SCM供应链管理系统之间的关系

年会活动颁奖领奖音乐74首

这个项目是用于个人参加浙江大学移动创新竞赛而使用。.zip

会员权益专享

最新资源