深度学习驱动的目标检测演进：从R-CNN到RFBNet

需积分: 0 126 浏览量更新于2024-06-30 收藏 8.54MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"这篇综述文章是对深度学习在通用对象检测领域的5年发展历程的全面概述，作者包括Li Liu、Wanli Ouyang、Xiaogang Wang等，旨在总结深度学习技术在这一领域的最新成果，并对超过300项研究进行了归纳。文章涵盖了对象检测框架、对象特征表示、对象提议生成、上下文建模、训练策略和评估指标等多个关键方面，并指出了未来研究的潜在方向。" 深度学习在通用对象检测中的应用已经成为计算机视觉领域一个基础且挑战性的问题的核心解决方案。目标检测的任务是识别和定位自然图像中预定义类别内的多个对象实例。近年来，随着深度学习技术的崛起，直接从数据中学习特征表示的能力已经带来了显著的进步，推动了通用对象检测领域的革新。在深度学习的背景下，对象检测框架经历了从传统的R-CNN（Region-based Convolutional Neural Networks）到更高效的模型如Fast R-CNN、Faster R-CNN以及后来的RFBNet（Region-based Fully Convolutional Network with Asymmetric Convolution Layers）等一系列演变。R-CNN系列方法通过结合选择性搜索和卷积神经网络（CNN）实现了候选区域的分类和定位，但速度较慢。Fast R-CNN通过共享特征图计算提高了速度，而Faster R-CNN则引入了区域提议网络（RPN），进一步减少了计算开销。RFBNet则是对Faster R-CNN的改进，利用非对称卷积层来增强特征检测的能力，提高了检测精度。对象特征表示是深度学习模型的关键组成部分，如VGG、ResNet、Inception等预训练模型为特征提取提供了强大的基础。这些模型的多层结构可以捕获不同级别的细节，从低级边缘和纹理到高级语义信息。同时，研究者们还探索了如何通过空间金字塔池化、注意力机制等方法提升特征的表达能力。在对象提议生成阶段，各种算法如EdgeBox、Selective Search和ProposalFlow被设计用来生成高质量的候选框，这些框通常被用于后续的分类和定位任务。这些方法在减少计算量的同时尽量保持高召回率，以确保检测的全面性。上下文建模是另一个重要方面，因为理解图像中的全局信息有助于提高检测性能。这包括对局部邻域的建模、对远距离依赖的捕捉，以及使用全局池化和递归神经网络等技术来整合场景上下文。训练策略是优化模型性能的关键，包括数据增强、多尺度训练、在线 hard example mining等方法，这些策略可以帮助模型更好地泛化，并应对物体大小、形状和位置的变化。评估指标，如平均精度（mAP）、平均精度平均（mAP@IoU）等，为比较不同检测系统提供了统一标准，推动了研究者们在准确性和速度之间的权衡。深度学习在对象检测领域取得了显著进展，但仍存在许多未解决的问题，如实时性、鲁棒性、对小目标检测的挑战以及对未知类别的泛化能力等。未来的研究方向可能包括但不限于轻量级模型设计、自适应上下文建模、无监督学习以及跨模态检测等。

资源详情

资源推荐

Deep Learning for Generic Object Detection: A Survey 7

The three operations that are repeatedly applied by a typical CNN

are illustrated in Fig. 8 (a). DCNNs having a large number of lay-

ers, a “deep” network, are referred to as Deep CNNs (DCNNs) and

a typical DCNN architecture illustrated in Fig. 8 (b).

As can be seen from Fig. 8 (b), each layer of a CNN consists of

a number of feature maps, within which each pixel acts like a neu-

ron. Each neuron in a convolutional layer is connected to feature

maps of the previous layer through a set of weights (essentially

a ﬁlter). As can be seen in Fig. 8 (b), early layers in a CNN are

typically composed of convolutional and pooling layers. The later

layers are normally fully connected layers. Some sort of nonlinear-

ity is normally present between each pair of layers.

From earlier to later layers, the input image repeatedly under-

goes convolution, and with each layer the receptive ﬁeld (the re-

gion of support) increases. In general, the initial CNN layers ex-

tract low-level features (e.g., edges), with later layers extracting

features of increasing complexity [296, 13, 145, 195].

DCNNs have a number of outstanding advantages: a hierarchi-

cal structure to learn representations of data with multiple levels

of abstraction, the capacity to learn very complex functions, and

learning feature representations directly and automatically from

data with minimal domain knowledge. What has particularly made

DCNNs feasible has been the availability of large scale labeled

datasets and of GPUs with very high computational capability.

Despite the great successes, known deﬁciencies remain. In par-

ticular, there is an extreme need for labeled training data, there is

a requirement of expensive computing resources, and considerable

skill and experience are still needed to select appropriate learning

parameters and network architecture. Trained networks are poorly

interpretable, there is a lack of robustness to image transformations

and degradations, and many DCNNs have shown serious vulnera-

bility to attacks, all of which currently limit the use of DCNNs in

many real world applications.

4 Datasets and Performance Evaluation

4.1 Datasets

Datasets have played a key role throughout the history of object

recognition research, not only as a common ground for measuring

and comparing the performance of competing algorithms, but also

pushing the ﬁeld towards increasingly complex and challenging

problems. In particular, with deep learning techniques recently rev-

olutionizing many visual recognition problems, it is large amounts

of annotated data which play a key role in their success. The present

access to large numbers of images on the Internet makes it possible

to build comprehensive datasets of increasing numbers of images

and categories in order to capture an ever greater richness and di-

versity of objects, enabling unprecedented performance in object

recognition.

For generic object detection, there are four famous datasets:

PASCAL VOC [66, 67], ImageNet [52], MS COCO [162] and

Open Images [139]. Attributes of these datasets are summarized

in Table 3, and selected sample images are shown in Fig. 9. There

are three steps to creating large-scale annotated datasets: determin-

ing the set of target object categories, collecting a diverse set of

candidate images to represent the selected categories on the Inter-

Table 2 Most frequent object classes for each detection challenge. The size of

each word is proportional to the frequency of that class in the training dataset.

(a) PASCAL VOC (20 Classes) (b) MS COCO (80 Classes)

(d) Open Images Detection Challenge (500 Classes)

net, and annotating the large amount of collected images, typically

by designing crowdsourcing strategies (the most challenging step).

Recognizing space limitations, we refer interested readers to the

original papers [66, 67, 162, 230, 139] for detailed description of

these datasets in terms of construction and properties.

The four datasets form the backbone of their respective de-

tection challenges. Each challenge consists of a publicly available

dataset of images together with ground truth annotation and stan-

dardized evaluation software, and an annual competition and corre-

sponding workshop. Statistics for the number of images and object

instances in the training, validation and testing datasets

for the

detection challenges are given in Table 4. The most frequent ob-

ject classes in VOC, COCO, ILSVRC and Open Images detection

datasets are visualized in Table 2.

PASCAL VOC [66, 67] is a multiyear effort devoted to the

creation and maintenance of a series of benchmark datasets for

classiﬁcation and object detection, creating the precedent for stan-

dardized evaluation of recognition algorithms in the form of an-

nual competitions. Starting from only four categories in 2005, the

dataset has increased to 20 categories that are common in everyday

life, as shown in Fig. 9.

The annotations on the test set are not publicly released, except for PAS-

CAL VOC2007.

8 Li Liu et al.

Table 3 Popular databases for object recognition. Some example images from PASCAL VOC, ImageNet, MS COCO and Open Images are shown in Fig. 9.

Dataset

Name

Total

Images Categories

Images Per

Category

Objects Per

Image

Size

Started

Year

Highlights

PASCAL

VOC

(2012) [67]

11, 540 20 303 ∼ 4087 2.4 470 × 380 2005

Covers only 20 categories that are common in everyday life; Large number

of training images; Close to real-world applications; Signiﬁcantly larger intra-

class variations; Objects in scene context; Multiple objects in one image; Con-

tains many difﬁcult samples; Creates the precedent for standardized evaluation of

recognition algorithms in the form of annual competitions.

ImageNet [230]

millions+

21, 841 − 1.5 500 × 400 2009

Considerably larger number of object categories; More instances and more cate-

gories of objects per image; More challenging than PASCAL VOC; Popular sub-

set benchmarks ImageNet1000; The backbone of ILSVRC challenge; Images are

object-centric.

MS COCO [162] 328, 000+ 91 − 7.3 640 × 480 2014

Even closer to real world scenarios; Each image contains more instances of objects

and richer object annotation information; Contains object segmentation notation

data that is not available in the ImageNet dataset; The next major dataset for large

scale object detection and instance segmentation.

Places [311]

millions+

434 − − 256 × 256 2014

The largest labeled dataset for scene recognition; Four subsets Places365 Stan-

dard, Places365 Challenge, Places 205 and Places88 as benchmarks.

Open Images [139]

millions+

6000+ − − varied 2017

A dataset of about 9 million images that have been annotated with image level

labels, object bounding boxes and visual relationships; Support large scale object

detection; Support visual relationship detection;

bicycle

pottedplant

bottle

bicycle

cat

bicycle

pottedplant

bottle

bicycle

cat

bicycle

car

person

bicycle

car

person

sofa

chair

tvmonitor

pottedplant

sofa

chair

tvmonitor

pottedplant

(a) PASCAL VOC

(b) ILSVRC

microwave

toaster

coffee maker

dishwasher

stove

microwave

toaster

coffee maker

dishwasher

stove

electric fan

lamp

dog

electric fan

lamp

dog

chair

table

pitcher

cup

person

flower pot

chair

table

pitcher

cup

person

flower pot

Fig. 9 Some example images with object annotations from PASCAL VOC, ILSVRC, MS COCO and Open Images. See Table 3 for summary of these datasets.

For the PASCAL VOC challenge, since 2009 the data consist of

the previous years’ images augmented with new images, allowing

the number of images to grow and, more importantly, meaning that

test results can be compared from year to year. Due the availability

of larger datasets like ImageNet, MS COCO and Open Images,

PASCAL VOC has gradually fallen out of fashion.

ILSVRC, the ImageNet Large Scale Visual Recognition Chal-

lenge [230] is derived from ImageNet [52]. ILSVRC scales up

PASCAL VOC’s goal of standardized training and evaluation of

detection algorithms by more than an order of magnitude in the

number of object classes and images. A subset of ImageNet im-

ages (ImageNet1000) with 1000 different object categories and a

total of 1.2 million images has been ﬁxed to provide a standard-

ized benchmark for the ILSVRC image classiﬁcation challenge.

The ImageNet1000 is also commonly used for DCNN pretraining.

MS COCO is a response to the criticism of ImageNet, that

objects in its dataset tend to be large and well centered, making

the ImageNet dataset atypical of real world scenarios. To push

research to richer image understanding, researchers created the

MS COCO database [162] containing complex everyday scenes

with common objects in their natural context, closer to real life,

where objects are labeled using fully-segmented instances to pro-

vide more accurate detector evaluation. The COCO object detec-

tion challenge [162] is probably the most challenging detection

benchmark, featuring two object detection tasks: using either bound-

ing box output or object instance segmentation output. Compared

to ILSVRC it has fewer object categories, more instances per cate-

gory, and it contains object segmentation annotations not available

in ILSVRC. COCO introduced three new challenges:

1. It contains objects at a wide range of scales, including a high

percentage of small objects [245];

2. Objects are less iconic and amid clutter or heavy occlusion;

3. The evaluation metric (see Table 5) encourages more accurate

object localization.

Just like ImageNet in its time, MS COCO has become the standard

for object detection today, with the dataset statistics for training,

validation and testing summarized in Table 4.

OICOD (the Open Image Challenge Object Detection) is de-

rived from the Open Images V4 [139], currently the largest pub-

licly available object detection dataset, and where the challenge

was organized for the ﬁrst time at ECCV2018. OICOD is different

from previous large scale object detection datasets like ILSVRC

and MS COCO, not merely in terms of the signiﬁcantly increased

number of classes, images and bounding box annotations, but also

regarding the annotation process. In ILSVRC and MS COCO, in-

stances of all classes in the dataset are exhaustively annotated. For

Open Images V4, a classiﬁer was applied to each image and the

resulting labels with sufﬁciently high scores were sent for human

veriﬁcation. Therefore in OICOD, for each image, only the ob-

ject instances of all human conﬁrmed positive labels are annotated,

剩余36页未读，继续阅读

yxldr

粉丝: 22
资源: 326

深度学习驱动的目标检测演进：从R-CNN到RFBNet

深度学习的目标检测技术演进：R-CNN、FastR-CNN、FasterR-CNN

stc-v21-20201217ad库

创维E900V21C-V21E

drawable-v21

drawable-v21 drawable-v24这些文件夹存放的图片跟drawable-xxhdpi有什么区别？

创维e900v21e-s905

e900v21e-s905l3卡刷包

创维e900v21e-s905l2卡刷包 百度盘

热血江湖v21服务端

魔百盒创维代工e900v21e-s905l3b-rtl8822cs-卡刷包

创维e900v21d改无线

移动魔百合e900v21e-s905l.rar

devexpress.vcl.v21

创维e900v21e晶晨s905l3

Android中drawable-v21 drawable-v24这些文件夹存放得图片又什么区别？该如何使用这两个文件夹

e900v21c刷机教程

ec900v21e强刷包

e900v21e 9.0 固件

山东联通e900v21c

e900v21e短接点

最新资源

创维e900v21e-s905l2卡刷包百度盘