R-FCN-3000：大规模实时目标检测器

需积分: 10 78 浏览量更新于2024-09-09 收藏 3.9MB PDF 举报

"R-FCN-3000 - 大规模实时目标检测器，将目标检测和分类解耦合，以实现高效准确的多类别检测。" 在计算机视觉领域，目标检测是至关重要的任务之一，尤其在处理大量类别的场景下更为复杂。R-FCN-3000是针对这个问题提出的一种解决方案，该方法首次在CVPR 2018会议上发表。R-FCN-3000的核心思想是将对象的存在检测（objectness detection）与细粒度分类（fine-grained classification）这两个步骤分开处理，以提高检测效率和准确性。 R-FCN（Region-based Fully Convolutional Networks）是一种基于区域的全卷积网络，它改进了传统的Fast R-CNN和 Faster R-CNN，消除了Proposal提取的额外步骤，利用全卷积网络进行端到端的训练。在R-FCN-3000中，作者进一步优化了这一架构，通过共享位置敏感滤波器（position-sensitive filters）来执行定位任务，这些滤波器对于不同物体类别是通用的。然而，对于细粒度的分类，这些位置敏感滤波器并不必要，这使得模型能更专注于分类任务，而不会被定位任务的复杂性所拖累。 R-FCN-3000的关键创新在于它对目标检测和分类的解耦。每个RoI（Region of Interest）的检测分数由对象存在分数与细粒度分类分数相乘得到，这样的设计使得模型能够同时关注全局的物体存在概率和局部的类别特征。在ImageNet检测数据集上，R-FCN-3000实现了34.9%的mAP（mean Average Precision），比YOLO-9000高出18%，同时还能以30帧每秒的速度处理图像，显示了其在实时性能上的优越性。此外，实验还表明R-FCN-3000学习到的对象存在特征具有良好的泛化能力，可以应用于未见过的新类别。随着训练类别数量的增加，模型的性能也会随之提升，这验证了解耦检测和分类可以促进通用目标检测模型的形成。这一发现对于大规模多类别目标检测的实时系统有着深远的影响，为未来的研究提供了新的方向和思路。

3. Background

This section provides a brief introduction of Deformable

R-FCN [6] which is used in R-FCN-3000. In R-FCN [5],

Atrous convolution [4] is used in the conv5 layer to increase

the resolution of the feature map while still utilizing the

pre-trained weights from the ImageNet classiﬁcation net-

work. In Deformable-R-FCN [6], the atrous convolution is

replaced by a deformable convolution structure in which a

separate branch predicts offsets for each pixel in the fea-

ture map, and the convolution kernel is applied after the

offsets have been applied to the feature-map. A region pro-

posal network (RPN) is used for generating object propos-

als, which is a two layer CNN on top of the conv4 features.

Efﬁciently implemented local convolutions, referred to as

position sensitive ﬁlters, are used to classify these propos-

als.

4. Large Scale Fully-Convolutional Detector

This section describes the process of training a large-

scale object detector. We ﬁrst explain the training data re-

quirements followed by discussions of some of the chal-

lenges involved in training such a system - design deci-

sions for making training and inference efﬁcient, appropri-

ate loss functions for a large number of classes, mitigating

the domain-shift which arises when training on classiﬁca-

tion data.

4.1. Weakly Supervised vs. Supervised?

Obtaining an annotated dataset of thousands of classes is

a major challenge for large scale detection. Ideally, a sys-

tem that can learn to detect object instances using partial im-

age level tags (class labels) for the objects present in train-

ing images would be preferable because large-scale training

data is readily available on the internet in this format. Since

the setting with partial annotations is very challenging, it is

commonly assumed that labels are available for all the ob-

jects present in the image. This is referred to as the weakly

supervised setting. Unfortunately, explicit boundaries of

objects or atleast bounding-boxes are required as supervi-

sion signal for training accurate object detectors. This is the

supervised setting. The performance gap between super-

vised and weakly supervised detectors is large - even 2015

object detectors [15] were better by 40% on the PASCAL

VOC 2007 dataset compared to recent weakly supervised

detectors [8]. This gap is a direct result of insufﬁcient learn-

ing signal coming from weak supervision and can be further

explained with the help of an example. For classifying a dog

among 1000 categories, only body texture or facial features

of a dog may be sufﬁcient and the network need not learn

the visual properties of its tail or legs for correct classiﬁca-

tion. Therefore, it may never learn that legs or tail are parts

of the dog category, which are essential to obtain accurate

boundaries.

On one hand, the huge cost of annotating bounding boxes

for thousands of classes under settings similar to popular

detection datasets such as PASCAL or COCO makes it pro-

hibitively expensive to collect and annotate a large-scale de-

tection dataset. On the other hand, the poor performance of

weakly supervised detectors impedes their deployment in

real-life applications. Therefore, we ask - is there a middle

ground that can alleviate the cost of annotation while yield-

ing accurate detectors? Fortunately, the ImageNet database

contains around 1-2 objects per image; therefore, the cost

of annotating the bounding boxes for the objects is only a

few seconds compared to several minutes in COCO [24]. It

is because of this reason that the bounding boxes were also

collected while annotating ImageNet! A potential downside

of using ImageNet for training object detectors is the loss

of variation in scale and context around objects available in

detection datasets, but we do have access to the bounding-

boxes of the objects. Therefore, a natural question to ask

is, how would an object detector perform on “detection”

datasets if it were trained on classiﬁcation datasets with

bounding-box supervision? We show that careful design

choices with respect to the CNN architecture, loss function

and training protocol can yield a large-scale detector trained

on the ImageNet classiﬁcation set with signiﬁcantly better

accuracy compared to weakly supervised detectors.

4.2. Super-class Discovery

Fully convolutional object detectors learn class-speciﬁc

ﬁlters based on scale & aspect-ratio [23] or in the form of

position sensitive ﬁlters [5, 6] for each class. Therefore,

when the number of classes become large, it becomes com-

putationally in-feasible to apply these detectors. Hence, we

ask is it necessary to have sets of ﬁlters for each class or

can they be shared across visually similar classes? In the

extreme case - can detection be performed using just a fore-

ground/background detector and a classiﬁcation network?

To obtain visually similar sets of objects for which position-

sensitive ﬁlters can be shared, objects should have similar

visual appearances. We obtain the j

object-class repre-

sentation, x

, by taking the average of 2048-dimensional

feature-vectors (x

), from the ﬁnal layer of ResNet-101,

for the all the samples belonging to the j

object-class in

the ImageNet classiﬁcation dataset (validation set). Super-

classes are then obtained by applying K-means clustering

on {x

: j ∈ {1, 2, . . . C}}, where C is the number of

object-classes, to obtain K super-class clusters.

4.3. Architecture

First, RPN is used for generating proposals, as in [6].

Let the set of individual object-classes the detector is be-

ing trained on be C, |C| = C, and the set of super-classes

(SC) be K, |K| = K. For each super-class k, suppose we

剩余10页未读，继续阅读

Fate_LordMasterKing

粉丝: 0
资源: 6

R-FCN-3000：大规模实时目标检测器

PPR-FCN源代码库：基于R-FCN的MATLAB实现

多GPU训练下R-FCN的MATLAB代码实现及优化

R-FCN：基于区域的全卷积网络目标检测

R-FCN源代码

py-R-FCN的预训练模型

matlab代码abs-PPR-FCN:这是我们ICCV2017论文PPR-FCN的代码库

Windows下配置R-FCN-附件资源

R-FCN中文翻译1

py-R-FCN, 具有联合培训和 python 支持的R FCN.zip

图像卷积操作matlab代码-py-R-FCN-multiGPU:用于在caffe中的多个GPU上训练py-faster-rcnn和py-R-

最新资源