Under review as a conference paper at ICLR 2017
Perhaps the mostly widely studied CNN macroarchitecture topic in the recent literature is the impact
of depth (i.e. number of layers) in networks. Simoyan and Zisserman proposed the VGG (Simonyan
& Zisserman, 2014) family of CNNs with 12 to 19 layers and reported that deeper networks produce
higher accuracy on the ImageNet-1k dataset (Deng et al., 2009). K. He et al. proposed deeper CNNs
with up to 30 layers that deliver even higher ImageNet accuracy (He et al., 2015a).
The choice of connections across multiple layers or modules is an emerging area of CNN macroar-
chitectural research. Residual Networks (ResNet) (He et al., 2015b) and Highway Networks (Sri-
vastava et al., 2015) each propose the use of connections that skip over multiple layers, for example
additively connecting the activations from layer 3 to the activations from layer 6. We refer to these
connections as bypass connections. The authors of ResNet provide an A/B comparison of a 34-layer
CNN with and without bypass connections; adding bypass connections delivers a 2 percentage-point
improvement on Top-5 ImageNet accuracy.
2.4 NEURAL NETWORK DESIGN SPACE EXPLORATION
Neural networks (including deep and convolutional NNs) have a large design space, with numerous
options for microarchitectures, macroarchitectures, solvers, and other hyperparameters. It seems
natural that the community would want to gain intuition about how these factors impact a NN’s
accuracy (i.e. the shape of the design space). Much of the work on design space exploration (DSE)
of NNs has focused on developing automated approaches for finding NN architectures that deliver
higher accuracy. These automated DSE approaches include bayesian optimization (Snoek et al.,
2012), simulated annealing (Ludermir et al., 2006), randomized search (Bergstra & Bengio, 2012),
and genetic algorithms (Stanley & Miikkulainen, 2002). To their credit, each of these papers pro-
vides a case in which the proposed DSE approach produces a NN architecture that achieves higher
accuracy compared to a representative baseline. However, these papers make no attempt to provide
intuition about the shape of the NN design space. Later in this paper, we eschew automated ap-
proaches – instead, we refactor CNNs in such a way that we can do principled A/B comparisons to
investigate how CNN architectural decisions influence model size and accuracy.
In the following sections, we first propose and evaluate the SqueezeNet architecture with and with-
out model compression. Then, we explore the impact of design choices in microarchitecture and
macroarchitecture for SqueezeNet-like CNN architectures.
3 SQUEEZENET: PRESERVING ACCURACY WITH FEW PARAMETERS
In this section, we begin by outlining our design strategies for CNN architectures with few param-
eters. Then, we introduce the Fire module, our new building block out of which to build CNN
architectures. Finally, we use our design strategies to construct SqueezeNet, which is comprised
mainly of Fire modules.
3.1 ARCHITECTURAL DESIGN STRATEGIES
Our overarching objective in this paper is to identify CNN architectures that have few parameters
while maintaining competitive accuracy. To achieve this, we employ three main strategies when
designing CNN architectures:
Strategy 1. Replace 3x3 filters with 1x1 filters. Given a budget of a certain number of convolution
filters, we will choose to make the majority of these filters 1x1, since a 1x1 filter has 9X fewer
parameters than a 3x3 filter.
Strategy 2. Decrease the number of input channels to 3x3 filters. Consider a convolution layer
that is comprised entirely of 3x3 filters. The total quantity of parameters in this layer is (number of
input channels) * (number of filters) * (3*3). So, to maintain a small total number of parameters
in a CNN, it is important not only to decrease the number of 3x3 filters (see Strategy 1 above), but
also to decrease the number of input channels to the 3x3 filters. We decrease the number of input
channels to 3x3 filters using squeeze layers, which we describe in the next section.
Strategy 3. Downsample late in the network so that convolution layers have large activation
maps. In a convolutional network, each convolution layer produces an output activation map with
a spatial resolution that is at least 1x1 and often much larger than 1x1. The height and width of
these activation maps are controlled by: (1) the size of the input data (e.g. 256x256 images) and (2)
3