3
ImageNet datasets. But the feature computation in R-
CNN is time-consuming, because it repeatedly applies
the deep convolutional networks to the raw pixels
of thousands of warped regions per image. In this
paper, we show that we can run the convolutional
layers only once on the entire image (regardless of
the number of windows), and then extract features
by SPP-net on the feature maps. This method yields
a speedup of over one hundred times over R-CNN.
Note that training/running a detector on the feature
maps (rather than image regions) is actually a more
popular idea [23], [24], [20], [5]. But SPP-net inherits
the power of the deep CNN feature maps and also the
flexibility of SPP on arbitrary window sizes, which
leads to outstanding accuracy and efficiency. In our
experiment, the SPP-net-based system (built upon the
R-CNN pipeline) computes convolutional features 30-
170× faster than R-CNN, and is overall 24-64× faster,
while has better or comparable accuracy. We further
propose a simple model combination method to boost
the result on the Pascal VOC 2007 detection task.
A preliminary version of this manuscript has been
published in ECCV 2014 [25]. Based on [25], we
attended the competition of ILSVRC 2014 [26], and
ranked #2 in object detection and #3 in image clas-
sification (both are provided-data-only tracks) among
all 38 teams. There are a few modifications made
over [25] for ILSVRC 2014. We show that the SPP-
nets can boost various networks that are deeper and
larger (Sec. 3.1.2-3.1.4) over the no-SPP counterparts.
Further, driven by our detection framework, we find
that multi-view testing on feature maps with flexibly
located/sized windows (Sec. 3.1.5) can increase the
classification accuracy. This manuscript also provides
the details of these modifications.
2 DEEP NETWORKS WITH SPATIAL PYRA-
MID POOLING
2.1 Convolutional Layers and Feature Maps
Consider the popular seven-layer architectures [3], [4].
The first five layers are convolutional, some of which
are followed by pooling layers. These pooling layers
can also be considered as “convolutional”, in the sense
that they are using sliding windows. The last two
layers are fully connected, with an N-way softmax as
the output, where N is the number of categories.
The deep network described above needs a fixed
image size. However, we notice that the requirement
of fixed sizes is only due to the fully-connected layers
that demand fixed-length vectors as inputs. On the
other hand, the convolutional layers accept inputs of
arbitrary sizes. The convolutional layers use sliding
filters, and their outputs have roughly the same aspect
ratio as the inputs. These outputs are known as feature
maps [1] - they involve not only the strength of the
responses, but also their spatial positions.
In Figure 2, we visualize some feature maps. They
are generated by some filters of the conv
5
layer
(the fifth convolutional layer). Figure 2(c) shows the
strongest activated images of these filters in the Ima-
geNet dataset. We see a filter can be activated by some
semantic content. For example, the 55-th filter (Fig-
ure 2, bottom left) is most activated by a circle shape;
the 66-th filter (Figure 2, top right) is most activated
by a ∧-shape; and the 118-th filter (Figure 2, bottom
right) is most activated by a ∨-shape. These shapes
in the input images (Figure 2(a)) activate the feature
maps at the corresponding positions (the arrows in
Figure 2).
It is worth noticing that we generate the feature
maps in Figure 2 without fixing the input size. These
feature maps generated by deep convolutional lay-
ers are analogous to the feature maps in traditional
methods [27], [28]. In those methods, SIFT vectors
[29] or image patches [28] are densely extracted and
then encoded, e.g., by vector quantization [16], [15],
[30], sparse coding [17], [18], or Fisher kernels [19].
These encoded features consist of the feature maps,
and are then pooled by Bag-of-Words (BoW) [16] or
spatial pyramids [14], [15]. Analogously, the deep
convolutional features can be pooled in a similar way.
2.2 The Spatial Pyramid Pooling Layer
The convolutional layers accept arbitrary input sizes,
but they produce outputs of variable sizes. The classi-
fiers (SVM/softmax) or fully-connected layers require
fixed-length vectors. Such vectors can be generated
by the Bag-of-Words (BoW) approach [16] that pools
the features together. Spatial pyramid pooling [14],
[15] improves BoW in that it can maintain spatial
information by pooling in local spatial bins. These
spatial bins have sizes proportional to the image size,
so the number of bins is fixed regardless of the image
size. This is in contrast to the sliding window pooling
of the previous deep networks [3], where the number
of sliding windows depends on the input size.
To adopt the deep network for images of arbi-
trary sizes, we replace the last pooling layer (e.g.,
pool
5
, after the last convolutional layer) with a spatial
pyramid pooling layer. Figure 3 illustrates our method.
In each spatial bin, we pool the responses of each
filter (throughout this paper we use max pooling).
The outputs of the spatial pyramid pooling are kM-
dimensional vectors with the number of bins denoted
as M (k is the number of filters in the last convo-
lutional layer). The fixed-dimensional vectors are the
input to the fully-connected layer.
With spatial pyramid pooling, the input image can
be of any sizes. This not only allows arbitrary aspect
ratios, but also allows arbitrary scales. We can resize
the input image to any scale (e.g., min(w, h)=180, 224,
...) and apply the same deep network. When the
input image is at different scales, the network (with