Decoupled Convolutions for CNNs
Guotian Xie,
1,2∗
Ting Zhang,
4
Kuiyuan Yang,
3
Jianhuang Lai,
1,2
Jingdong Wang
4
1
School of Data and Computer Science, Sun Yat-Sen University
2
Guangdong Province Key Laboratory of Information Security
3
DeepMotion,
4
Microsoft Research
xieguotian1990@gmail.com,{Ting.Zhang, jingdw}@microsoft.com
kuiyuanyang@deepmotion.ai, stsljh@mail.sysu.edu.cn
Abstract
In this paper, we are interested in designing small CNNs by
decoupling the convolution along the spatial and channel do-
mains. Most existing decoupling techniques focus on approx-
imating the filter matrix through decomposition. In contrast,
we provide a two-step interpretation of the standard convolu-
tion from the filter at a single location to all locations, which
is exactly equivalent to the standard convolution. Motivated
by the observations in our decoupling view, we propose an
effective approach to relax the sparsity of the filter in spa-
tial aggregation by learning a spatial configuration, and re-
duce the redundancy by reducing the number of intermedi-
ate channels. Our approach achieves comparable classifica-
tion performance with the standard uncoupled convolution,
but with a smaller model size over CIFAR-100, CIFAR-10
and ImageNet.
Introduction
Since AlexNet (Krizhevsky, Sutskever, and Hinton 2012)
successfully applied Convolutional Neural Network (CNN)
to ImageNet and won the challenge by a large margin in
2012, CNNs become the most widely used model for im-
age classification (He et al. 2016), object detection (Ren et
al. 2015; Redmon and Farhadi 2016) and image segmenta-
tion (Long, Shelhamer, and Darrell 2015; Kolesnikov and
Lampert 2016) and so on. CNNs have become deeper and
deeper (Simonyan and Zisserman 2014; Szegedy et al. 2015;
He et al. 2015; 2016; Huang et al. 2016), ranging from
tens of layers to thousands of layers to pursue better per-
formance, and have become wider and wider as well, such
as Wide Residual Networks (Zagoruyko and Komodakis
2016).
Another research direction is designing more effective fil-
ters. There have been many works on filter design, and most
of them can be categorized into two types. One is to decom-
pose the filter matrix into several low rank matrices (Ioan-
nou et al. 2015; Denton et al. 2014; Zhang et al. 2015; Kim
et al. 2015; Tai et al. 2015; Jaderberg, Vedaldi, and Zisser-
man 2014; Mamalet and Garcia 2012), the other is to view
the filter as a sparse matrix, where some works sparsify the
∗
This work was done when Guotian Xie was an intern at Mi-
crosoft Research, Beijing, P.R. China.
Copyright
c
2018, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
channel extent, e.g., group convolution (Ioannou et al. 2016;
Zhang et al. 2017), channel-wise convolution or separable
filters (Chollet 2016) and other works sparsify the spatial ex-
tent with smaller filters, e.g., 3 × 3, 1 ×3 and 3× 1 (Szegedy
et al. 2016). In this paper, in contrast to design the filters, we
are interested in decoupling the convolution along the spa-
tial and channel domains and propose an effective approach
based on the decoupled interpretation.
We start from analyzing the process of convolution on the
input, and decompose this process into two steps. First each
location in the input is projected across the channel domain.
In this way, the projection along channel domain is not re-
lated to the spatial information of the input. Second, we ac-
cumulate the projections of the locations across spatial do-
main, and this process is only related to the spatial relation-
ship. We reformulate the decoupled two steps in a convolu-
tion form, first conducting 1×1 across channel-domain con-
volution, and then conducting across spatial-domain convo-
lution with a spatial configuration. This process is denoted
as decoupling spatial convolution.
From this decoupling view, we found that the decoupled
structure of standard spatial convolution is unbalance, in
which the 1 × 1 across channel-domain convolution is in
a high dimensional space that might lead to redundancy,
whereas the across spatial-domain convolution is a struc-
tured sparse group convolution. To solve this problem, we
propose a balance decoupling spatial convolution (BDSC)
to relax the sparsity of across spatial-domain convolution
by learning a spatial configuration, and to reduce the redun-
dancy of across channel-domain convolution by reducing the
intermediate output channels. In this way, we found in our
experiments that, the performance of the models using our
decoupling convolution drops slightly comparing with the
standard spatial convolution, yet the model size is smaller
than models of standard spatial convolution.
Our contributions in this paper are:
1. We decouple the standard spatial convolution of CNN into
two parts, an across channel-domain convolution and an
across spatial-domain convolution.
2. We propose the balance decoupling spatial convolution to
relax the sparsity of the filter in spatial aggregation by
learning a spatial configuration, and to reduce the redun-
dancy of 1 × 1 across channel-domain convolution by re-
The Thirty-Second AAAI Conference
on Artificial Intelligence (AAAI-18)
4284