filter #175
filter #55
(a) image (b) feature maps (c) strongest activations
filter #66
filter #118
(a) image (b) feature maps (c) strongest activations
Figure 2. Visualization of the feature maps. (a) Two images in Pascal VOC 2007. (b) The feature maps of some conv
5
filters. The arrows
indicate the strongest responses and their corresponding positions in the images. (c) The ImageNet images that have the strongest responses
of the corresponding filters. The green rectangles mark the receptive fields of the strongest responses.
tivated by a ∧-shape; and the 118-th filter (Figure 2, bottom
right) is most activated by a ∨-shape. These shapes in the
input images (Figure 2(a)) activate the feature maps at the
corresponding positions (the arrows in Figure 2).
It is worth noticing that we generate the feature maps
in Figure 2 without fixing the input size. These feature
maps generated by deep convolutional layers are analogous
to the feature maps in traditional methods [2, 4]. In those
methods, SIFT vectors [2] or image patches [4] are densely
extracted and then encoded, e.g., by vector quantization
[25, 17, 29], sparse coding [32, 30], or Fisher kernels [22].
These encoded features consist of the feature maps, and are
then pooled by Bag-of-Words (BoW) [25] or spatial pyra-
mids [14, 17]. Analogously, the deep convolutional features
can be pooled in a similar way.
2.2. The Spatial Pyramid Pooling Layer
The convolutional layers accept arbitrary input sizes,
but they produce outputs of variable sizes. The classifiers
(SVM/softmax) or fully-connected layers require fixed-
length vectors. Such vectors can be generated by the Bag-
of-Words (BoW) approach [25] that pools the features to-
gether. Spatial pyramid pooling [14, 17] improves BoW in
that it can maintain spatial information by pooling in local
spatial bins. These spatial bins have sizes proportional to
the image size, so the number of bins is fixed regardless
of the image size. This is in contrast to the sliding win-
dow pooling of the previous deep networks [16], where the
number of sliding windows depends on the input size.
To adopt the deep network for images of arbitrary sizes,
we replace the pool
5
layer (the pooling layer after conv
5
)
with a spatial pyramid pooling layer. Figure 3 illustrates
our method. In each spatial bin, we pool the responses
of each filter (throughout this paper we use max pool-
ing). The outputs of the spatial pyramid pooling are 256M-
dimensional vectors with the number of bins denoted as M
(256 is the number of conv
5
filters). The fixed-dimensional
vectors are the input to the fully-connected layer (fc
6
).
With spatial pyramid pooling, the input image can be of
any sizes; this not only allows arbitrary aspect ratios, but
also allows arbitrary scales. We can resize the input image
to any scale (e.g., min(w, h)=180, 224, ...) and apply the
same deep network. When the input image is at different
scales, the network (with the same filter sizes) will extract
features at different scales. The scales play important roles
in traditional methods, e.g., the SIFT vectors are often ex-
tracted at multiple scales [19, 2] (determined by the sizes
of the patches and Gaussian filters). We will show that the
scales are also important for the accuracy of deep networks.
2.3. Training the Network with the Spatial Pyramid
Pooling Layer
Theoretically, the above network structure can be trained
with standard back-propagation [18], regardless of the input
image size. But in practice the GPU implementations (such
as convnet [16] and Caffe [8]) are preferably run on fixed
input images. Next we describe our training solution that
takes advantage of these GPU implementations while still
preserving the spatial pyramid pooling behaviors.
Single-size training
As in previous works, we first consider a network taking
a fixed-size input (224×224) cropped from images. The
cropping is for the purpose of data augmentation. For an
image with a given size, we can pre-compute the bin sizes
needed for spatial pyramid pooling. Consider the feature
maps after conv
5
that have a size of a×a (e.g., 13×13).
With a pyramid level of n×n bins, we implement this pool-
ing level as a sliding window pooling, where the window
size win = da/ne and stride str = ba/nc with d·e and b·c
denoting ceiling and floor operations. With an l-level pyra-
mid, we implement l such layers. The next fully-connected
layer (fc
6
) will concatenate the l outputs. Figure 4 shows
an example configuration of 3-level pyramid pooling (3×3,
2×2, 1×1) in the convnet style [16].
3