深度学习目标检测二十年进展：从RCNN到未来趋势

需积分: 0 48 浏览量更新于2024-07-01 收藏 6.67MB PDF 举报

本文是一篇关于GOD（Generic Object Detection，通用目标检测）的综述论文，标题为“GOD目标检测进展综述1”，它着重探讨了过去二十年来目标检测领域的发展历程、关键挑战以及取得的重要突破。文章首先介绍了目标检测与传统方法的对比，以及分类方法的不同。在过去的二十年中，目标检测面临的主要问题包括精度和效率两方面。精确度的挑战涉及如何提高模型的识别能力，确保目标物体被准确地定位和分类。效率挑战则集中在减少计算成本，提高检测速度，以便实时处理大量图像。在这个过程中，研究人员发展出了一系列里程碑式的对象检测器，如Two-Stage Framework，包括RCNN（Region-based Convolutional Neural Networks）、SPPNet（Spatial Pyramid Pooling）、FastRCNN、FasterRCNN、RFCN（Recurrent Fully Convolutional Network）和MaskRCNN，以及One-Stage Pipeline（统一管道），如DetectorNet、OverFeat、YOLO（You Only Look Once）及其后续版本YOLOv2和YOLO9000、SSD（Single Shot MultiBox Detector）等。文章深入剖析了设计目标检测器的基础子问题，如基于深度卷积神经网络（DCNN）的对象表示，探讨了流行的CNN架构，如VGG、ResNet等，以及如何通过结合多层特征、多个CNN层进行检测、模型几何变换和建模对象变形来改进对象表示。情境建模也是一个关键部分，因为理解目标在不同场景下的关系有助于提升检测性能。此外，文中还提到了检测的建议方法、特殊问题，如弱监督或无监督学习、3D对象检测等。为了评估模型的性能，文章列举了多个流行的数据集，并解释了常用的评价指标，如精确度（Accuracy）、召回率（Recall）、F1分数（F1 Score）等。最后，作者总结了未来的研究趋势，这些趋势包括开放世界学习（Open World Learning）、更高效检测框架的设计、紧凑且高效的深度CNN特征提取、鲁棒的对象表示、上下文推理、对象实例分割以及弱监督或无监督的学习方法。整体来看，这篇综述论文提供了对目标检测技术发展历史的全面回顾，为读者深入了解这一领域的最新进展和技术挑战提供了宝贵参考。

Deep Learning for Generic Object Detection: A Survey 5

achieving two competing goals: high quality/accuracy and high ef-

ﬁciency, as illustrated in Fig. 4. As illustrated in Fig. 5, high qual-

ity detection has to accurately localize and recognize objects in

images or video frames, such that the large variety of object cate-

gories in the real world can be distinguished (i.e., high distinctive-

ness), and that object instances from the same category, subject to

intraclass appearance variations, can be localized and recognized

(i.e., high robustness). High efﬁciency requires the entire detec-

tion task to run at a sufﬁciently high frame rate with acceptable

memory and storage usage. Despite several decades of research

and signiﬁcant progress, arguably the combined goals of accuracy

and efﬁciency have not yet been met.

2.2.1 Accuracy related challenges

For accuracy, the challenge stems from 1) the vast range of intra-

class variations and 2) the huge number of object categories.

We begin with intraclass variations, which can be divided into

two types: intrinsic factors, and imaging conditions. For the for-

mer, each object category can have many different object instances,

possibly varying in one or more of color, texture, material, shape,

and size, such as the “chair” category shown in Fig. 5 (h). Even in

a more narrowly deﬁned class, such as human or horse, object in-

stances can appear in different poses, with nonrigid deformations

and different clothes.

For the latter, the variations are caused by changes in imag-

ing conditions and unconstrained environments which may have

dramatic impacts on object appearance. In particular, different in-

stances, or even the same instance, can be captured subject to a

wide number of differences: different times, locations, weather

conditions, cameras, backgrounds, illuminations, viewpoints, and

viewing distances. All of these conditions produce signiﬁcant vari-

ations in object appearance, such as illumination, pose, scale, oc-

clusion, background clutter, shading, blur and motion, with exam-

ples illustrated in Fig. 5 (a-g). Further challenges may be added by

digitization artifacts, noise corruption, poor resolution, and ﬁlter-

ing distortions.

In addition to intraclass variations, the large number of object

categories, on the order of 10

−10

, demands great discrimination

power of the detector to distinguish between subtly different inter-

class variations, as illustrated in Fig. 5 (i)). In practice, current de-

tectors focus mainly on structured object categories, such as the 20,

200 and 91 object classes in PASCAL VOC [53], ILSVRC [179]

and MS COCO [129] respectively. Clearly, the number of object

categories under consideration in existing benchmark datasets is

much smaller than that can be recognized by humans.

2.2.2 Efﬁciency related challenges

The exponentially increasing number of images calls for efﬁcient

and scalable detectors. The prevalence of social media networks

and mobile/wearable devices has led to increasing demands for

analyzing visual data. However mobile/wearable devices have lim-

ited computational capabilities and storage space, in which case an

efﬁcient object detector is critical.

For efﬁciency, the challenges stem from the need to localize

and recognize all object instances of very large number of object

categories, and the very large number of possible locations and

实现两个相互竞争的目标：如图4所示，高质量、高准确性和高效率。

如图5所示，高质量检测必须准确地定位和识别图像或视频帧中的物体

，这样才能区分真实世界中各种各样的对象类别（例如高度的区别性），

以及来自同一类别的对象实例，受限于类内外观的变化，可以被本地化和

识别（例如高鲁棒性)

高效率要求整个检测任务以足够高的帧速率运行，并使用可接受的内存和

存储使用。

尽管经过了几十年的研究和取得了重大进展，但准确和效率的综合目标还

没有得到满足。

准确的说，精准度的挑战来自于大量的类内变化和大量的对象类别的挑战。

我们从细胞内的变化开始，可以分为两种类型：内在因素和成像条件。

对于前者,每个对象类别可以有许多不同的对象实例,可能存在一个或多个不

同的颜色,质地,材料,形状,大小,如椅子类别图5所示(h)。

即使在一个更加狭义的类中,如人或马,对象实例可能有着不同的姿势,可能有

非刚性变形，也可能穿着不同的衣服。

对于后者，这些变化是由成像条件的变化和不受约束的环境造成的，这可能

会对物体的外观产生巨大的影响。

特别地，不同的实例，甚至是相同的实例，都可以被捕获到不同的地方：

不同的时间、地点、天气条件、摄像机、背景、光照、视点和观看距离。所

有这些条件都会产生显著的物体外观变化，如光照、姿势、尺度、遮挡、背

景杂波、阴影、模糊和运动，如图5（a-g）所示。

数字化的人工品、噪音的干扰、糟糕的解决方案和过多的畸变而失真，可能

会增加更多的挑战。

除了类内的变化外，对象类别的数量巨大大约在10000- 100000，要求

探测器有很大的辨别能力，以区分细微不同的类间变化，如图5（i）所

示）。

在实践中，当前的检测器主要关注结构化的对象类别，例如PASCAL

VOC [53]、ILSVRC [179]和COCO [129]的对象类。

显然，现有基准数据集中所考虑的对象类别的数量远远小于人类可识别

的对象类别。

为了提高效率，挑战源于需要本地化和识别大量对象类别的所有对象

实例，以及单个图像中可能的大量位置和

指数增加的图像数量需要有效且可扩展的检测器。

社交媒体网络和移动/可穿戴设备的普及导致对分析视觉数据的需求不断增加

。

然而，移动/可穿戴设备具有有限的计算能力和存储空间，

在这种情况下，

有效的物体检测器是关键的。

scales within a single image, as shown by the example in Fig. 5

(c). A further challenge is that of scalability: A detector should

be able to handle unseen objects, unknown situations, and rapidly

increasing image data. For example, the scale of ILSVRC [179] is

already imposing limits on the manual annotations that are feasible

to obtain. As the number of images and the number of categories

grow even larger, it may become impossible to annotate them man-

ually, forcing algorithms to rely more on weakly supervised train-

ing data.

2.3 Progress in the Past Two Decades

Early research on object recognition was based on template match-

ing techniques and simple part based models [57], focusing on

speciﬁc objects whose spatial layouts are roughly rigid, such as

faces. Before 1990 the leading paradigm of object recognition was

based on geometric representations [149, 169], with the focus later

moving away from geometry and prior models towards the use of

statistical classiﬁers (such as Neural Networks [178], SVM [159]

and Adaboost [213, 222]) based on appearance features[150, 181

This successful family of object detectors set the stage for most

subsequent research in this ﬁeld.

In the late 1990s and early 2000s object detection research

made notable strides. The milestones of object detection in re-

cent years are presented in Fig. 2, in which two main eras (SIFT

vs. DCNN) are highlighted. The appearance features moved from

global representations [151, 197, 205] to local representations that

are invariant to changes in translation, scale, rotation, illumina-

tion, viewpoint and occlusion. Handcrafted local invariant features

gained tremendous popularity, starting from the Scale Invariant

Feature Transform (SIFT) feature [139], and the progress on var-

ious visual recognition tasks was based substantially on the use

of local descriptors [145] such as Haar like features [213], SIFT

[140], Shape Contexts [11], Histogram of Gradients (HOG) [42]

and Local Binary Patterns (LBP) [153], covariance [206]. These

local features are usually aggregated by simple concatenation or

feature pooling encoders such as the inﬂuential and efﬁcient Bag

of Visual Words approach introduced by Sivic and Zisserman [194]

and Csurka et al. [37], Spatial Pyramid Matching (SPM) of BoW

models [114], and Fisher Vectors [166].

For years, the multistage handtuned pipelines of handcrafted

local descriptors and discriminative classiﬁers dominated a variety

of domains in computer vision, including object detection, until the

signiﬁcant turning point in 2012 when Deep Convolutional Neural

Networks (DCNN) [109] achieved their record breaking results in

image classiﬁcation. The successful application of DCNNs to im-

age classiﬁcation [109] transferred to object detection, resulting in

the milestone Region based CNN (RCNN) detector of Girshick et

al. [65]. Since then, the ﬁeld of object detection has dramatically

evolved and many deep learning based approaches have been de-

veloped, thanks in part to available GPU computing resources and

the availability of large scale datasets and challenges such as Im-

ageNet [44, 179] and MS COCO [129]. With these new datasets,

researchers can target more realistic and complex problems when

detecting objects of hundreds categories from images with large

intraclass variations and interclass similarities [129, 179].

规模，如图5中的示例所示（c ）。

进一步的挑战是可扩展性：探测器应该

能够处理看不见的物体，未知情况和快速

增加的图像数据。例如，ILSVRC [179]的规模

已经对可获得的手动注释施加了限制。

随着图像的数量和类别的数量变得更大，

可能无法手动注释它们，

迫使算法更多地依赖于弱监督的训练数据。

对对象识别的早期研究基于模板匹配技术和基于部分的简单模型

[57]，侧重于空间布局大致刚性的特定物体，如人脸。

在1990年之前，对象识别的主要范式是基于几何表示[149,169]，

随后焦点从几何和先前模型转向使用统计分类（如神经网络

[178]，SVM [159]和Adaboost [213,222]）基于外观特征[150,181]。

这一成功的物体探测器系列为该领域的大多数后续研究奠定了基础

在20世纪90年代末和21世纪初，目标检测研究取得了显著的进步。

近年来关于目标检测的里程碑如图2所示，

其中两个主要的时代（SIFT vs. DCNN）被高亮显示。

外观特征从全局表示从转移到局部表示，这些表示在翻译、缩放、

旋转、光照、视点和遮挡等方面都是不变的。

手工设计的局部不变特征得到了极大的流行，从尺度不变性特征转

换开始，并且在各种视觉识别任务上的进展主要是基于对本地描述

符的使用，比如Haar、SIFT、形状联系、梯度直方图（HOG）、局

部二值法（LBP）、协方差。

这些局部特性通常是由简单的连接或功能池编码器聚合而成的，比

如Sivic和Zisserman 、Csurka等、BoW模型的空间金字塔匹配

（SPM）以及Fisher矢量。

多年来，手工设计的目标定位和鉴别分类器的多级优化通道在计

算机视觉领域占据了许多领域，包括目标检测，

直到2012年的重大转折点，深度卷积神经网络（DCNN）在图像

分类中创造了新的纪录。成功地将DCNNs应用于图像分类，进

而转移到目标检测，从而导致了基于里程碑的CNN（RCNN）的

Girshick等人的探测器。

从那时起，目标检测领域已经发生了巨大的变化，许多基于深度

学习的方法已经开发出来，这在一定程度上要归功于可用的

GPU计算资源，以及大规模数据集的可用性，以及诸如

ImageNet和COCO 的挑战比赛。

有了这些新的数据集，研究人员可以在检测到数百个类别的物体

时，从具有巨大的内部变化和类间相似性的图像中，找出更现实

、更复杂的问题。

剩余54页未读，继续阅读

贼仙呐

粉丝: 32
资源: 296

深度学习目标检测二十年进展：从RCNN到未来趋势

目标检测综述.docx

目标检测总结

god

godis

iNked God

God2Iso GOD转ISO.rar

inked_god

GOD-KILLER

GoD_Actions

capistrano-god

最新资源