ShuffleNet V2：优化CNN设计的实用准则与高效架构

ShuffleNet

需积分: 19 62 浏览量更新于2024-07-18 1 收藏 1.58MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design 在深度卷积神经网络（CNN）的设计过程中，传统的指导原则主要侧重于间接性能指标，如计算复杂度，即浮点运算次数（FLOPs）。然而，实际的运行速度并非仅仅由FLOPs决定，它还受到内存访问成本和目标平台特性等因素的影响。因此，该论文提出了一个全新的设计视角，即在目标平台上评估直接性能指标，而不仅仅是考虑FLOPs。作者们通过一系列精心控制的实验，提炼出了一系列针对高效网络设计的实用准则。这些准则强调了在兼顾速度和准确性的前提下，如何平衡模型的复杂性和资源利用率。基于这些准则，他们开发了一种新型的CNN架构，名为ShuffleNet V2。这个新的架构旨在优化内存访问、减少计算密集部分并利用有效的资源调度策略，从而实现在保持较高精度的同时显著提升运行速度。 ShuffleNet V2的关键创新可能包括：使用通道 shuffle 操作来增强特征重排和信息交换，这有助于减少重复计算，提高计算效率；采用更小的瓶颈结构，结合轻量级操作，如点积和组内卷积，以降低计算负担；以及对硬件特性的敏感设计，确保模型能够在各种设备上实现最佳性能。论文的全面消融实验（ablation studies）验证了ShuffleNet V2在速度与准确性的权衡上超越了当前最先进的模型，表明其在实际应用中具有很高的效率。这对于那些需要在资源有限的设备上运行，或者追求快速推理速度而不牺牲太多准确性的场景来说，是一个重要的进步。总结来说，ShuffleNet V2的工作提供了一个实用且系统的方法来设计高效的CNN架构，它不仅关注计算复杂度，而且充分考虑了实际运行环境中的速度和内存管理问题。这对于推动未来CNN设计的实践导向，尤其是在移动设备和嵌入式系统上的应用具有深远的意义。

资源详情

资源推荐

GPU (Batches/sec.) ARM (Images/sec.)

c1:c2 (c1,c2) for ×1 ×1 ×2 ×4 (c1,c2) for ×1 ×1 ×2 ×4

1:1 (128,128) 1480 723 232 (32,32) 76.2 21.7 5.3

1:2 (90,180) 1296 586 206 (22,44) 72.9 20.5 5.1

1:6 (52,312) 876 489 189 (13,78) 69.1 17.9 4.6

1:12 (36,432) 748 392 163 (9,108) 57.6 15.1 4.4

Table 1: Validation experiment for Guideline 1. Four diﬀerent ratios of number of

input/output channels (c1 and c2) are tested, while the total FLOPs under the four

ratios is ﬁxed by varying the number of channels. Input image size is 56 × 56.

Other settings include: full optimization options (e.g. tensor fusion, which

is used to reduce the overhead of small operations) are switched on. The input

image size is 224 × 224. Each network is randomly initialized and evaluated for

100 times. The average runtime is used.

To initiate our study, we analyze the runtime performance of two state-

of-the-art networks, ShuﬄeNet v1 [15] and MobileNet v2 [14]. They are both

highly eﬃcient and accurate on ImageNet classiﬁcation task. They are both

widely used on low end devices such as mobiles. Although we only analyze these

two networks, we note that they are representative for the current trend. At

their core are group convolution and depth-wise convolution, which are also

crucial components for other state-of-the-art networks, such as ResNeXt [7],

Xception [12], MobileNet [13], and CondenseNet [16].

The overall runtime is decomposed for diﬀerent operations, as shown in Fig-

ure 2. We note that the FLOPs metric only account for the convolution part.

Although this part consumes most time, the other operations including data

I/O, data shuﬄe and element-wise operations (AddTensor, ReLU, etc) also oc-

cupy considerable amount of time. Therefore, FLOPs is not an accurate enough

estimation of actual runtime.

Based on this observation, we perform a detailed analysis of runtime (or

speed) from several diﬀerent aspects and derive several practical guidelines for

eﬃcient network architecture design.

G1) Equal channel width minimizes memory access cost (MAC).

The modern networks usually adopt depthwise separable convolutions [12,13,15,14],

where the pointwise convolution (i.e., 1 × 1 convolution) accounts for most of

the complexity [15]. We study the kernel shape of the 1 × 1 convolution. The

shape is speciﬁed by two parameters: the number of input channels c

and out-

put channels c

. Let h and w be the spatial size of the feature map, the FLOPs

of the 1 × 1 convolution is B = hwc

For simplicity, we assume the cache in the computing device is large enough

to store the entire feature maps and parameters. Thus, the memory access cost

(MAC), or the number of memory access operations, is MAC = hw(c

)+c

Note that the two terms correspond to the memory access for input/output

feature maps and kernel weights, respectively.

From mean value inequality, we have

剩余18页未读，继续阅读

匠人_C

粉丝: 26
资源: 10

ShuffleNet V2：优化CNN设计的实用准则与高效架构

最新资源