Active Convolution: Learning the Shape of Convolution for Image Classification
Yunho Jeon
EE, KAIST
jyh2986@kaist.ac.kr
Junmo Kim
EE, KAIST
junmo.kim@kaist.ac.kr
Abstract
In recent years, deep learning has achieved great suc-
cess in many computer vision applications. Convolutional
neural networks (CNNs) have lately emerged as a major ap-
proach to image classification. Most research on CNNs thus
far has focused on developing architectures such as the In-
ception and residual networks. The convolution layer is the
core of the CNN, but few studies have addressed the convo-
lution unit itself. In this paper, we introduce a convolution
unit called the active convolution unit (ACU). A new convo-
lution has no fixed shape, because of which we can define
any form of convolution. Its shape can be learned through
backpropagation during training. Our proposed unit has a
few advantages. First, the ACU is a generalization of convo-
lution; it can define not only all conventional convolutions,
but also convolutions with fractional pixel coordinates. We
can freely change the shape of the convolution, which pro-
vides greater freedom to form CNN structures. Second, the
shape of the convolution is learned while training and there
is no need to tune it by hand. Third, the ACU can learn
better than a conventional unit, where we obtained the im-
provement simply by changing the conventional convolution
to an ACU. We tested our proposed method on plain and
residual networks, and the results showed significant im-
provement using our method on various datasets and archi-
tectures in comparison with the baseline.
1. Introduction
Following the success of deep learning in the ImageNet
Large Scale Visual Recognition Challenge (ILSVRC) [20],
the best performance in classification competitions has al-
most invariably been achieved on convolutional neural net-
work (CNN) architectures. AlexNet [16] is composed of
three types of receptive field convolutions (3 × 3, 5 × 5,
11 × 11). VGG [21] is based on the idea that a stack of
two convolutional layers with a receptive field 3 × 3 is more
effective than a 5 × 5 convolution. GoogleNet [24, 25, 26]
introduced an Inception layer for the composition of various
receptive fields. The residual network [10, 11, 29], which
adds shortcut connections to implement identity mapping,
allows more layers to be stacked without running into the
gradient vanishing problem. Recent research on CNNs has
mostly focused on composing layers rather than the convo-
lution itself.
Other basic units, such as activation and pooling units,
have been studied with many variations. Sigmoid [7] and
tanh were the basic activations for the very first neural net-
work. The rectified linear unit (ReLU) [19] was suggested
to overcome the gradient vanishing problem, and achieved
good results without pre-training. Since then, many vari-
ants of ReLUs has been suggested, such as the leaky ReLU
(LReLU) [18], randomized LReLU [27], parametric ReLU
[9], and exponential linear unit [3]. Other types of activa-
tion units have been suggested to learn subnetworks, such
as Maxout [5] and local winner-take-all [23].
Pooling is another basic operation in a CNN to reduce the
resolution and enable translation invariance. Max and aver-
age pooling are the most popular methods. Spatial pyramid
pooling [8] was introduced to deal with inputs of varying
resolution. The ROI pooling method was used to speed up
detection [4]. Recently, fractional pooling [6] has been ap-
plied to image classification. Lee et al. [17] proposed a gen-
eral pooling method that combines pooling and convolution
units. On the other hand, Springenberg et al. [22] showed
that using only convolution units is sufficient without any
pooling.
However, only a few studies have considered convolu-
tion units themselves. Dilated convolution [1, 28] has been
suggested for dense prediction of segmentation. It reduces
post-processing to enhance the resolution of the segmented
result. Permutohedral lattice convolution [14] is used to
expand the convolved dimension from the spatial domain
to the color domain. It enables pairwise potentials to be
learned for conditional random fields.
In this paper, we propose a new convolution unit. Unlike
conventional convolution and its variants, this unit does not
have a fixed shape of the receptive field, and can be used
to take more diverse forms of receptive fields for convolu-
tions. Moreover, its shape can be learned during the train-
ing procedure. Since the shape of the unit is deformable and
1
arXiv:1703.09076v1 [cs.CV] 27 Mar 2017