Going Deeper with Convolutions
Christian Szegedy
1
, Wei Liu
2
, Yangqing Jia
1
, Pierre Sermanet
1
, Scott Reed
3
,
Dragomir Anguelov
1
, Dumitru Erhan
1
, Vincent Vanhoucke
1
, Andrew Rabinovich
4
1
Google Inc.
2
University of North Carolina, Chapel Hill
3
University of Michigan, Ann Arbor
4
Magic Leap Inc.
1
{szegedy,jiayq,sermanet,dragomir,dumitru,vanhoucke}@google.com
2
wliu@cs.unc.edu,
3
reedscott@umich.edu,
4
arabinovich@magicleap.com
Abstract
We propose a deep convolutional neural network ar-
chitecture codenamed Inception that achieves the new
state of the art for classification and detection in the Im-
ageNet Large-Scale Visual Recognition Challenge 2014
(ILSVRC14). The main hallmark of this architecture is the
improved utilization of the computing resources inside the
network. By a carefully crafted design, we increased the
depth and width of the network while keeping the compu-
tational budget constant. To optimize quality, the architec-
tural decisions were based on the Hebbian principle and
the intuition of multi-scale processing. One particular in-
carnation used in our submission for ILSVRC14 is called
GoogLeNet, a 22 layers deep network, the quality of which
is assessed in the context of classification and detection.
1. Introduction
In the last three years, our object classification and de-
tection capabilities have dramatically improved due to ad-
vances in deep learning and convolutional networks [10].
One encouraging news is that most of this progress is not
just the result of more powerful hardware, larger datasets
and bigger models, but mainly a consequence of new ideas,
algorithms and improved network architectures. No new
data sources were used, for example, by the top entries
in the ILSVRC 2014 competition besides the classification
dataset of the same competition for detection purposes. Our
GoogLeNet submission to ILSVRC 2014 actually uses 12
times fewer parameters than the winning architecture of
Krizhevsky et al [9] from two years ago, while being sig-
nificantly more accurate. On the object detection front, the
biggest gains have not come from naive application of big-
ger and bigger deep networks, but from the synergy of deep
architectures and classical computer vision, like the R-CNN
algorithm by Girshick et al [6].
Another notable factor is that with the ongoing traction
of mobile and embedded computing, the efficiency of our
algorithms – especially their power and memory use – gains
importance. It is noteworthy that the considerations leading
to the design of the deep architecture presented in this paper
included this factor rather than having a sheer fixation on
accuracy numbers. For most of the experiments, the models
were designed to keep a computational budget of 1.5 billion
multiply-adds at inference time, so that the they do not end
up to be a purely academic curiosity, but could be put to real
world use, even on large datasets, at a reasonable cost.
In this paper, we will focus on an efficient deep neural
network architecture for computer vision, codenamed In-
ception, which derives its name from the Network in net-
work paper by Lin et al [12] in conjunction with the famous
“we need to go deeper” internet meme [1]. In our case, the
word “deep” is used in two different meanings: first of all,
in the sense that we introduce a new level of organization
in the form of the “Inception module” and also in the more
direct sense of increased network depth. In general, one can
view the Inception model as a logical culmination of [12]
while taking inspiration and guidance from the theoretical
work by Arora et al [2]. The benefits of the architecture are
experimentally verified on the ILSVRC 2014 classification
and detection challenges, where it significantly outperforms
the current state of the art.
2. Related Work
Starting with LeNet-5 [10], convolutional neural net-
works (CNN) have typically had a standard structure –
stacked convolutional layers (optionally followed by con-
1