深度卷积网络中的空间金字塔池化用于视觉识别

需积分: 9 3 浏览量更新于2024-09-09 收藏 3.97MB PDF 举报

"这篇论文提出了一种在深度卷积神经网络（CNN）中用于视觉识别的空间金字塔池化方法，称为SPP-Net（Spatial Pyramid Pooling Network）。它解决了传统CNN需要固定尺寸输入图像的问题，提高了对任意大小或尺度图像的识别准确性，并且对物体变形具有鲁棒性。" 在深度学习领域，卷积神经网络（CNN）已经成为图像识别和计算机视觉任务的核心技术。然而，传统的CNN架构通常需要固定尺寸的输入图像，这在实际应用中可能会限制模型的泛化能力，特别是对于不同尺寸或比例的对象。论文"Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition"由Kaiming He、Xiangyu Zhang、Shaoqing Ren和Jian Sun提出，旨在解决这一问题。 SPP-Net的主要创新是引入了空间金字塔池化层。这一层能够对不同大小的特征图进行池化操作，生成固定长度的表示，从而摆脱了对输入图像尺寸的依赖。空间金字塔池化通过在多个层次上进行池化，模仿了空间金字塔模型，这样可以捕获不同尺度的信息，对尺度变化有较好的适应性。具体来说，空间金字塔池化将特征图分割成多个子区域，每个子区域执行最大池化操作，然后将所有子区域的池化结果拼接成一个固定长度的向量。这种方法允许网络接收不同大小的输入，并且在训练过程中保持网络参数不变，这对于图像分类和其他视觉任务尤其有用。在ImageNet 2012数据集上的实验显示，SPP-Net显著提高了多种已发表的CNN架构的识别精度，无论这些架构的设计如何。同时，在Pascal VOC 2007和Caltech 101数据集上，SPP-Net仅使用单个全图像表示并在没有微调的情况下达到了最先进的分类结果。这表明SPP-Net的强大性能，尤其是在处理具有复杂尺度变化和物体变形的场景时。 SPP-Net的另一个优点是它可以轻松地插入到现有的CNN架构中，无需对整个网络进行重新设计。这使得它成为现有CNN模型的一种有效增强手段，对于提升模型的泛化能力和鲁棒性具有重要意义。此外，SPP-Net对于实时应用，如目标检测和图像分割，也有着显著的影响，因为它可以减少对输入图像预处理的需求，加快计算速度。 "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition"是深度学习领域的一个重要里程碑，它推动了CNN在处理变尺寸输入和处理复杂场景方面的进步，为后续的研究和应用奠定了坚实的基础。

ImageNet datasets. But the feature computation in R-

CNN is time-consuming, because it repeatedly applies

the deep convolutional networks to the raw pixels

of thousands of warped regions per image. In this

paper, we show that we can run the convolutional

layers only once on the entire image (regardless of

the number of windows), and then extract features

by SPP-net on the feature maps. This method yields

a speedup of over one hundred times over R-CNN.

Note that training/running a detector on the feature

maps (rather than image regions) is actually a more

popular idea [23], [24], [20], [5]. But SPP-net inherits

the power of the deep CNN feature maps and also the

ﬂexibility of SPP on arbitrary window sizes, which

leads to outstanding accuracy and efﬁciency. In our

experiment, the SPP-net-based system (built upon the

R-CNN pipeline) computes convolutional features 30-

170× faster than R-CNN, and is overall 24-64× faster,

while has better or comparable accuracy. We further

propose a simple model combination method to boost

the result on the Pascal VOC 2007 detection task.

A preliminary version of this manuscript has been

published in ECCV 2014 [25]. Based on [25], we

attended the competition of ILSVRC 2014 [26], and

ranked #2 in object detection and #3 in image clas-

siﬁcation (both are provided-data-only tracks) among

all 38 teams. There are a few modiﬁcations made

over [25] for ILSVRC 2014. We show that the SPP-

nets can boost various networks that are deeper and

larger (Sec. 3.1.2-3.1.4) over the no-SPP counterparts.

Further, driven by our detection framework, we ﬁnd

that multi-view testing on feature maps with ﬂexibly

located/sized windows (Sec. 3.1.5) can increase the

classiﬁcation accuracy. This manuscript also provides

the details of these modiﬁcations.

2 DEEP NETWORKS WITH SPATIAL PYRA-

MID POOLING

2.1 Convolutional Layers and Feature Maps

Consider the popular seven-layer architectures [3], [4].

The ﬁrst ﬁve layers are convolutional, some of which

are followed by pooling layers. These pooling layers

can also be considered as “convolutional”, in the sense

that they are using sliding windows. The last two

layers are fully connected, with an N-way softmax as

the output, where N is the number of categories.

The deep network described above needs a ﬁxed

image size. However, we notice that the requirement

of ﬁxed sizes is only due to the fully-connected layers

that demand ﬁxed-length vectors as inputs. On the

other hand, the convolutional layers accept inputs of

arbitrary sizes. The convolutional layers use sliding

ﬁlters, and their outputs have roughly the same aspect

ratio as the inputs. These outputs are known as feature

maps [1] - they involve not only the strength of the

responses, but also their spatial positions.

In Figure 2, we visualize some feature maps. They

are generated by some ﬁlters of the conv

layer

(the ﬁfth convolutional layer). Figure 2(c) shows the

strongest activated images of these ﬁlters in the Ima-

geNet dataset. We see a ﬁlter can be activated by some

semantic content. For example, the 55-th ﬁlter (Fig-

ure 2, bottom left) is most activated by a circle shape;

the 66-th ﬁlter (Figure 2, top right) is most activated

by a ∧-shape; and the 118-th ﬁlter (Figure 2, bottom

right) is most activated by a ∨-shape. These shapes

in the input images (Figure 2(a)) activate the feature

maps at the corresponding positions (the arrows in

Figure 2).

It is worth noticing that we generate the feature

maps in Figure 2 without ﬁxing the input size. These

feature maps generated by deep convolutional lay-

ers are analogous to the feature maps in traditional

methods [27], [28]. In those methods, SIFT vectors

[29] or image patches [28] are densely extracted and

then encoded, e.g., by vector quantization [16], [15],

[30], sparse coding [17], [18], or Fisher kernels [19].

These encoded features consist of the feature maps,

and are then pooled by Bag-of-Words (BoW) [16] or

spatial pyramids [14], [15]. Analogously, the deep

convolutional features can be pooled in a similar way.

2.2 The Spatial Pyramid Pooling Layer

The convolutional layers accept arbitrary input sizes,

but they produce outputs of variable sizes. The classi-

ﬁers (SVM/softmax) or fully-connected layers require

ﬁxed-length vectors. Such vectors can be generated

by the Bag-of-Words (BoW) approach [16] that pools

the features together. Spatial pyramid pooling [14],

[15] improves BoW in that it can maintain spatial

information by pooling in local spatial bins. These

spatial bins have sizes proportional to the image size,

so the number of bins is ﬁxed regardless of the image

size. This is in contrast to the sliding window pooling

of the previous deep networks [3], where the number

of sliding windows depends on the input size.

To adopt the deep network for images of arbi-

trary sizes, we replace the last pooling layer (e.g.,

pool

, after the last convolutional layer) with a spatial

pyramid pooling layer. Figure 3 illustrates our method.

In each spatial bin, we pool the responses of each

ﬁlter (throughout this paper we use max pooling).

The outputs of the spatial pyramid pooling are kM-

dimensional vectors with the number of bins denoted

as M (k is the number of ﬁlters in the last convo-

lutional layer). The ﬁxed-dimensional vectors are the

input to the fully-connected layer.

With spatial pyramid pooling, the input image can

be of any sizes. This not only allows arbitrary aspect

ratios, but also allows arbitrary scales. We can resize

the input image to any scale (e.g., min(w, h)=180, 224,

...) and apply the same deep network. When the

input image is at different scales, the network (with

剩余13页未读，继续阅读

YELLOWPIPLE

粉丝: 0

深度卷积网络中的空间金字塔池化用于视觉识别

wx494社区门诊管理系统小程序-php+vue+uniapp.zip（可运行源码+sql文件+文档）

HTML+CSS+JS+JQ+Bootstrap的家具风格趋势展示响应式网页.7z

高分项目，基于Python+OpenCV的实时疲劳驾驶检测系统，内含源码+演示视频+部署教程

spatial pyramid pooling in deep convolutional networks for visual recognition

Spatial Pyramid Pooling in Deep Convolutional Networks for Visua

An Overview of YOLOv8's Application in Object Detection

Performance Evaluation of YOLOv10: Metrics for Measuring Model Effectiveness, Objective Assessment ...

Driving the Advancement of Object Detection Technology and Leading the New Revolution in Artificial...

运维管理系统，主要具备监测、配置及告警功能

优博讯i6200S:I6300A刷机方法.zip

最新资源