Xception: Deep Learning with Depthwise Separable Convolutions
Franc¸ois Chollet
Google, Inc.
fchollet@google.com
Abstract
We present an interpretation of Inception modules in con-
volutional neural networks as being an intermediate step
in-between regular convolution and the depthwise separable
convolution operation (a depthwise convolution followed by
a pointwise convolution). In this light, a depthwise separable
convolution can be understood as an Inception module with
a maximally large number of towers. This observation leads
us to propose a novel deep convolutional neural network
architecture inspired by Inception, where Inception modules
have been replaced with depthwise separable convolutions.
We show that this architecture, dubbed Xception, slightly
outperforms Inception V3 on the ImageNet dataset (which
Inception V3 was designed for), and significantly outper-
forms Inception V3 on a larger image classification dataset
comprising 350 million images and 17,000 classes. Since
the Xception architecture has the same number of param-
eters as Inception V3, the performance gains are not due
to increased capacity but rather to a more efficient use of
model parameters.
1. Introduction
Convolutional neural networks have emerged as the mas-
ter algorithm in computer vision in recent years, and de-
veloping recipes for designing them has been a subject of
considerable attention. The history of convolutional neural
network design started with LeNet-style models [
10
], which
were simple stacks of convolutions for feature extraction
and max-pooling operations for spatial sub-sampling. In
2012, these ideas were refined into the AlexNet architec-
ture [
9
], where convolution operations were being repeated
multiple times in-between max-pooling operations, allowing
the network to learn richer features at every spatial scale.
What followed was a trend to make this style of network
increasingly deeper, mostly driven by the yearly ILSVRC
competition; first with Zeiler and Fergus in 2013 [
25
] and
then with the VGG architecture in 2014 [18].
At this point a new style of network emerged, the Incep-
tion architecture, introduced by Szegedy et al. in 2014 [
20
]
as GoogLeNet (Inception V1), later refined as Inception V2
[
7
], Inception V3 [
21
], and most recently Inception-ResNet
[
19
]. Inception itself was inspired by the earlier Network-
In-Network architecture [
11
]. Since its first introduction,
Inception has been one of the best performing family of
models on the ImageNet dataset [
14
], as well as internal
datasets in use at Google, in particular JFT [5].
The fundamental building block of Inception-style mod-
els is the Inception module, of which several different ver-
sions exist. In figure 1 we show the canonical form of an
Inception module, as found in the Inception V3 architec-
ture. An Inception model can be understood as a stack of
such modules. This is a departure from earlier VGG-style
networks which were stacks of simple convolution layers.
While Inception modules are conceptually similar to con-
volutions (they are convolutional feature extractors), they
empirically appear to be capable of learning richer repre-
sentations with less parameters. How do they work, and
how do they differ from regular convolutions? What design
strategies come after Inception?
1.1. The Inception hypothesis
A convolution layer attempts to learn filters in a 3D space,
with 2 spatial dimensions (width and height) and a chan-
nel dimension; thus a single convolution kernel is tasked
with simultaneously mapping cross-channel correlations and
spatial correlations.
This idea behind the Inception module is to make this
process easier and more efficient by explicitly factoring it
into a series of operations that would independently look at
cross-channel correlations and at spatial correlations. More
precisely, the typical Inception module first looks at cross-
channel correlations via a set of 1x1 convolutions, mapping
the input data into 3 or 4 separate spaces that are smaller than
the original input space, and then maps all correlations in
these smaller 3D spaces, via regular 3x3 or 5x5 convolutions.
This is illustrated in figure 1. In effect, the fundamental hy-
pothesis behind Inception is that cross-channel correlations
and spatial correlations are sufficiently decoupled that it is
preferable not to map them jointly
1
.
1
A variant of the process is to independently look at width-wise corre-
arXiv:1610.02357v3 [cs.CV] 4 Apr 2017