深度卷积网络中的空间金字塔池化技术

需积分: 9 175 浏览量更新于2024-09-08 收藏 3.9MB PDF 举报

"Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition" 这篇技术报告主要讨论了深度卷积神经网络（CNN）在视觉识别中的 Spatial Pyramid Pooling（SPP）技术。传统的深度CNN需要固定大小（如224×224像素）的输入图像，这种固定尺寸的要求是人为设定的，可能会降低对任意大小或比例图像的识别准确性。为了解决这个问题，作者提出了一个更合理的池化策略——空间金字塔池化（Spatial Pyramid Pooling），它能消除对固定尺寸输入的依赖。 SPP-Net（Spatial Pyramid Pooling Network）是由此引入的新网络结构，它的核心在于能够根据输入图像的大小或比例自适应地生成固定长度的表示。这克服了传统CNN中因输入尺寸限制而可能造成的性能下降。通过SPP层，网络能够在不同尺度的图像上进行特征提取，保证了池化层输出的维度恒定，从而使得后续全连接层可以接收到稳定的信息。在实验中，SPP-Net在ImageNet 2012、Pascal VOC 2007和Caltech 101等多个数据集上实现了最先进的分类精度，证明了该方法的有效性。特别地，SPP-Net在对象检测任务中表现出更大的优势。利用SPP-Net，可以在不重新训练的情况下处理不同尺度的对象，提高了检测的效率和准确性。 SPP-Net的关键在于其空间金字塔池化层，该层将输入图像分成多个金字塔级别的网格，并在每个级别上进行池化操作。这种方法允许网络捕获不同尺度的上下文信息，增强了模型的鲁棒性和泛化能力。此外，由于SPP层可以与卷积层分开进行计算，因此它还可以应用于预先训练好的CNN模型，实现快速迁移学习。 Spatial Pyramid Pooling技术是深度学习领域中解决输入尺寸不变性问题的一个重要创新，它改进了CNN的结构，使其能更好地适应不同尺寸的图像，特别是在物体检测等需要考虑多种尺度的应用中，SPP-Net提供了显著的性能提升。这一技术为后续的网络设计和计算机视觉任务的发展奠定了基础。

filter #175

filter #55

(a) image (b) feature maps (c) strongest activations

filter #66

filter #118

(a) image (b) feature maps (c) strongest activations

Figure 2. Visualization of the feature maps. (a) Two images in Pascal VOC 2007. (b) The feature maps of some conv

ﬁlters. The arrows

indicate the strongest responses and their corresponding positions in the images. (c) The ImageNet images that have the strongest responses

of the corresponding ﬁlters. The green rectangles mark the receptive ﬁelds of the strongest responses.

tivated by a ∧-shape; and the 118-th ﬁlter (Figure 2, bottom

right) is most activated by a ∨-shape. These shapes in the

input images (Figure 2(a)) activate the feature maps at the

corresponding positions (the arrows in Figure 2).

It is worth noticing that we generate the feature maps

in Figure 2 without ﬁxing the input size. These feature

maps generated by deep convolutional layers are analogous

to the feature maps in traditional methods [2, 4]. In those

methods, SIFT vectors [2] or image patches [4] are densely

extracted and then encoded, e.g., by vector quantization

[25, 17, 29], sparse coding [32, 30], or Fisher kernels [22].

These encoded features consist of the feature maps, and are

then pooled by Bag-of-Words (BoW) [25] or spatial pyra-

mids [14, 17]. Analogously, the deep convolutional features

can be pooled in a similar way.

2.2. The Spatial Pyramid Pooling Layer

The convolutional layers accept arbitrary input sizes,

but they produce outputs of variable sizes. The classiﬁers

(SVM/softmax) or fully-connected layers require ﬁxed-

length vectors. Such vectors can be generated by the Bag-

of-Words (BoW) approach [25] that pools the features to-

gether. Spatial pyramid pooling [14, 17] improves BoW in

that it can maintain spatial information by pooling in local

spatial bins. These spatial bins have sizes proportional to

the image size, so the number of bins is ﬁxed regardless

of the image size. This is in contrast to the sliding win-

dow pooling of the previous deep networks [16], where the

number of sliding windows depends on the input size.

To adopt the deep network for images of arbitrary sizes,

we replace the pool

layer (the pooling layer after conv

)

with a spatial pyramid pooling layer. Figure 3 illustrates

our method. In each spatial bin, we pool the responses

of each ﬁlter (throughout this paper we use max pool-

ing). The outputs of the spatial pyramid pooling are 256M-

dimensional vectors with the number of bins denoted as M

(256 is the number of conv

ﬁlters). The ﬁxed-dimensional

vectors are the input to the fully-connected layer (fc

With spatial pyramid pooling, the input image can be of

any sizes; this not only allows arbitrary aspect ratios, but

also allows arbitrary scales. We can resize the input image

to any scale (e.g., min(w, h)=180, 224, ...) and apply the

same deep network. When the input image is at different

scales, the network (with the same ﬁlter sizes) will extract

features at different scales. The scales play important roles

in traditional methods, e.g., the SIFT vectors are often ex-

tracted at multiple scales [19, 2] (determined by the sizes

of the patches and Gaussian ﬁlters). We will show that the

scales are also important for the accuracy of deep networks.

2.3. Training the Network with the Spatial Pyramid

Pooling Layer

Theoretically, the above network structure can be trained

with standard back-propagation [18], regardless of the input

image size. But in practice the GPU implementations (such

as convnet [16] and Caffe [8]) are preferably run on ﬁxed

input images. Next we describe our training solution that

takes advantage of these GPU implementations while still

preserving the spatial pyramid pooling behaviors.

Single-size training

As in previous works, we ﬁrst consider a network taking

a ﬁxed-size input (224×224) cropped from images. The

cropping is for the purpose of data augmentation. For an

image with a given size, we can pre-compute the bin sizes

needed for spatial pyramid pooling. Consider the feature

maps after conv

that have a size of a×a (e.g., 13×13).

With a pyramid level of n×n bins, we implement this pool-

ing level as a sliding window pooling, where the window

size win = da/ne and stride str = ba/nc with d·e and b·c

denoting ceiling and ﬂoor operations. With an l-level pyra-

mid, we implement l such layers. The next fully-connected

layer (fc

) will concatenate the l outputs. Figure 4 shows

an example conﬁguration of 3-level pyramid pooling (3×3,

2×2, 1×1) in the convnet style [16].

剩余10页未读，继续阅读

午羊

粉丝: 25
资源: 8

深度卷积网络中的空间金字塔池化技术

深度卷积网络中的空间金字塔池化用于视觉识别

基于空间金字塔池化的深度卷积神经网络用于视觉识别

深度学习论文精选：卷积神经网络进阶篇

spatial pyramid pooling in deep convolutional networks for visual recognition

Spatial Pyramid Pooling in Deep Convolutional Networks for Visua

An Overview of YOLOv8's Application in Object Detection

Performance Evaluation of YOLOv10: Metrics for Measuring Model Effectiveness, Objective Assessment ...

Driving the Advancement of Object Detection Technology and Leading the New Revolution in Artificial...

SPP-Net：突破固定尺寸限制的视觉识别解决方案

基于springboot+vue的体育馆管理系统的设计与实现（Java毕业设计，附源码，部署教程）.zip

最新资源