4
GPU (Batches/sec.) ARM (Images/sec.)
c1:c2 (c1,c2) for ×1 ×1 ×2 ×4 (c1,c2) for ×1 ×1 ×2 ×4
1:1 (128,128) 1480 723 232 (32,32) 76.2 21.7 5.3
1:2 (90,180) 1296 586 206 (22,44) 72.9 20.5 5.1
1:6 (52,312) 876 489 189 (13,78) 69.1 17.9 4.6
1:12 (36,432) 748 392 163 (9,108) 57.6 15.1 4.4
Table 1: Validation experiment for Guideline 1. Four different ratios of number of
input/output channels (c1 and c2) are tested, while the total FLOPs under the four
ratios is fixed by varying the number of channels. Input image size is 56 × 56.
Other settings include: full optimization options (e.g. tensor fusion, which
is used to reduce the overhead of small operations) are switched on. The input
image size is 224 × 224. Each network is randomly initialized and evaluated for
100 times. The average runtime is used.
To initiate our study, we analyze the runtime performance of two state-
of-the-art networks, ShuffleNet v1 [15] and MobileNet v2 [14]. They are both
highly efficient and accurate on ImageNet classification task. They are both
widely used on low end devices such as mobiles. Although we only analyze these
two networks, we note that they are representative for the current trend. At
their core are group convolution and depth-wise convolution, which are also
crucial components for other state-of-the-art networks, such as ResNeXt [7],
Xception [12], MobileNet [13], and CondenseNet [16].
The overall runtime is decomposed for different operations, as shown in Fig-
ure 2. We note that the FLOPs metric only account for the convolution part.
Although this part consumes most time, the other operations including data
I/O, data shuffle and element-wise operations (AddTensor, ReLU, etc) also oc-
cupy considerable amount of time. Therefore, FLOPs is not an accurate enough
estimation of actual runtime.
Based on this observation, we perform a detailed analysis of runtime (or
speed) from several different aspects and derive several practical guidelines for
efficient network architecture design.
G1) Equal channel width minimizes memory access cost (MAC).
The modern networks usually adopt depthwise separable convolutions [12,13,15,14],
where the pointwise convolution (i.e., 1 × 1 convolution) accounts for most of
the complexity [15]. We study the kernel shape of the 1 × 1 convolution. The
shape is specified by two parameters: the number of input channels c
1
and out-
put channels c
2
. Let h and w be the spatial size of the feature map, the FLOPs
of the 1 × 1 convolution is B = hwc
1
c
2
.
For simplicity, we assume the cache in the computing device is large enough
to store the entire feature maps and parameters. Thus, the memory access cost
(MAC), or the number of memory access operations, is MAC = hw(c
1
+c
2
)+c
1
c
2
.
Note that the two terms correspond to the memory access for input/output
feature maps and kernel weights, respectively.
From mean value inequality, we have