深度学习前后的目标检测：20年演进与关键进展

需积分: 0 43 浏览量更新于2024-07-04 收藏 12.27MB PDF 举报

"这篇综述文章是对过去二十年目标检测领域发展的一个全面回顾，涵盖了从1990年代到2019年的400多篇相关论文。文章深入探讨了历史上的里程碑式检测器、数据集、评估指标、检测系统的基础模块、加速技术以及最新的检测方法。此外，还讨论了一些重要的应用，如行人检测、人脸识别和文字检测，并对这些应用的挑战进行了深入分析。" 在计算机视觉领域，目标检测是一项基础且具有挑战性的任务，它近年来受到了极大的关注。目标检测的发展历程可以看作是计算机视觉历史的一个缩影。早期的方法，可以比喻为冷兵器时代的智慧，而如今则借助深度学习的力量，使目标检测技术达到了一个全新的艺术水平。文章首先回顾了历史上的关键检测器，这些里程碑式的进展包括从传统的基于特征的方法（如边缘检测、区域分割）到基于机器学习的算法（如AdaBoost、SVM），再到深度学习时代的代表性模型，如R-CNN系列（Fast R-CNN、Faster R-CNN、Mask R-CNN）以及YOLO（You Only Look Once）和SSD（Single Shot MultiBox Detector）等单阶段检测器。接着，文章讨论了目标检测的数据集，如PASCAL VOC、COCO等，这些数据集推动了研究的进步，提供了大量标注的图像，用于训练和评估检测模型。同时，文中也提到了各种评估指标，如Precision-Recall曲线、平均精度均值mAP，这些指标帮助量化检测性能。检测系统的基石，如特征提取、定位策略、分类器设计，也是文章的重点。作者深入分析了从手工设计特征到深度学习自动学习特征的转变，以及现代检测系统如何通过 anchor boxes 和 feature pyramid networks 等结构优化定位和多尺度处理。为了提高实时性和效率，文章还介绍了各种加速技术，如卷积神经网络的剪枝、量化和蒸馏，以及硬件加速解决方案。最后，文章探讨了目标检测在实际应用中的挑战，如行人检测在复杂环境中的困难、人脸识别的光照和遮挡问题、以及文字检测的形状和变形问题。这些应用的挑战推动了新的研究方向和技术进步。总结来说，"目标检测二十年综述Object Detection in 20 Years: A Survey" 是一篇详尽的文献，不仅梳理了目标检测的技术发展历程，还对未来的研究方向给出了启示，对于理解目标检测的演变过程和最新趋势具有极高的参考价值。

Dataset Year Description #Cites

MIT Ped.[30] 2000 One of the ﬁrst pedestrian detection datasets. Consists of ∼500 training and ∼200

testing images (built based on the LabelMe database). url: http://cbcl.mit.edu/

software-datasets/PedestrianData.html

1515

INRIA [12] 2005 One of the most famous and important pedestrian detection datasets at early time.

Introduced by the HOG paper [12]. url: http://pascal.inrialpes.fr/data/human/

24705

Caltech

[59, 60]

2009 One of the most famous pedestrian detection datasets and benchmarks. Consists

of ∼190,000 pedestrians in training set and ∼160,000 in testing set. The metric

is Pascal-VOC @ 0.5 IoU. url: http://www.vision.caltech.edu/Image Datasets/

CaltechPedestrians/

2026

KITTI [61] 2012 One of the most famous datasets for trafﬁc scene analysis. Captured in Karl-

sruhe, Germany. Consists of ∼100,000 pedestrians (∼6,000 individuals). url:

http://www.cvlibs.net/datasets/kitti/index.php

2620

CityPersons

[62]

2017 Built based on CityScapes dataset [63]. Consists of ∼19,000 pedestrians in training

set and ∼11,000 in testing set. Same metric with CalTech. url: https://bitbucket.

org/shanshanzhang/citypersons

EuroCity [64] 2018 The largest pedestrian detection dataset so far. Captured from 31 cities in 12

European countries. Consists of ∼238,000 instances in ∼47,000 images. Same

metric with CalTech.

TABLE 2

An overview of some popular pedestrian detection datasets.

Dataset Year Description #Cites

FDDB [65] 2010 Consists of ∼2,800 images and ∼5,000 faces from Yahoo! With occlusions, pose

changes, out-of-focus, etc. url: http://vis-www.cs.umass.edu/fddb/index.html

531

AFLW [66] 2011 Consists of ∼26,000 faces and 22,000 images from Flickr with rich facial landmark

annotations. url: https://www.tugraz.at/institute/icg/research/team-bischof/

lrs/downloads/aﬂw/

414

IJB [67] 2015 IJB-A/B/C consists of over 50,000 images and videos frames, for both

recognition and detection tasks. url: https://www.nist.gov/programs-projects/

face-challenges

279

WiderFace

[68]

2016 One of the largest face detection dataset. Consists of ∼32,000 images and 394,000

faces with rich annotations i.e., scale, occlusion, pose, etc. url: http://mmlab.ie.

cuhk.edu.hk/projects/WIDERFace/

193

UFDD [69] 2018 Consists of ∼6,000 images and ∼11,000 faces. Variations include weather-based

degradation, motion blur, focus blur, etc. url: http://www.ufdd.info/

WildestFaces

[70]

2018 With ∼68,000 video frames and ∼2,200 shots of 64 ﬁghting celebrities in uncon-

strained scenarios. The dataset hasn’t been released yet.

TABLE 3

An overview of some popular face detection datasets.

robot arm trying to grasp a spanner).

Recently, there are some further developments of the

evaluation in the Open Images dataset, e.g., by considering

the group-of boxes and the non-exhaustive image-level cate-

gory hierarchies. Some researchers also have proposed some

alternative metrics, e.g., “localization recall precision” [94].

Despite the recent changes, the VOC/COCO-based mAP is

still the most frequently used evaluation metric for object

detection.

2.3 Technical Evolution in Object Detection

In this section, we will introduce some important building

blocks of a detection system and their technical evolutions

in the past 20 years.

2.3.1 Early Time’s Dark Knowledge

The early time’s object detection (before 2000) did not follow

a uniﬁed detection philosophy like sliding window detec-

tion. Detectors at that time were usually designed based on

low-level and mid-level vision as follows.

• Components, shapes and edges

“Recognition-by-components”, as an important cogni-

tive theory [98], has long been the core idea of image

recognition and object detection [13, 99, 100]. Some early

researchers framed the object detection as a measurement of

similarity between the object components, shapes and con-

tours, including Distance Transforms [101], Shape Contexts

[35], and Edgelet [102], etc. Despite promising initial results,

things did not work out well on more complicated detec-

目标检测的技术演进

接下来介绍检测系统的一些重要构件及其在

过去20年中的技术发展。分量识别作为一

种重要的认知理论，长期以来一直是图像识

别和目标检测的核心思想。一些早期的研究

人员将目标检测定义为测量对象组件、形状

和轮廓之间的相似性，包括距离变换

[101]、形状上下文、小边特征[102]等。

近年来，对开放图像数据集的评价有了进一步的发

展，如考虑了组框（group-of boxes）和非穷举的

图像级类别层次结构。一些研究者也提出了一些替

代指标，如 “ 定位召回精度 ” 。尽管最近发生了一

些变化，基于 VOC/COCO 的 mAP 仍然是最常用

的目标检测评估指标。

Dataset Year Description #Cites

ICDAR [71] 2003 ICDAR2003 is one of the ﬁrst public datasets for text detection. ICDAR 2015

and 2017 are other popular iterations of the ICDAR challenge [72, 73]. url: http:

//rrc.cvc.uab.es/

530

STV [74] 2010 Consists of ∼350 images and ∼720 text instances taken from Google StreetView.

url: http://tc11.cvc.uab.es/datasets/SVT 1

339

MSRA-TD500

[75]

2012 Consists of ∼500 indoor/outdoor images with Chinese and English texts. url:

http://www.iapr-tc11.org/mediawiki/index.php/MSRA Text Detection 500

Database (MSRA-TD500)

413

IIIT5k [76] 2012 Consists of ∼1,100 images and ∼5,000 words from both streets and born-digital

images. url: http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html

165

Syn90k [77] 2014 A synthetic dataset with 9 million images generated from a 90,000 vocabulary of

multiple fonts. url: http://www.robots.ox.ac.uk/

∼

vgg/data/text/

246

COCOText

[78]

2016 The largest text detection dataset so far. Built based on MS-COCO, Consists

of ∼63,000 images and ∼173,000 text annotations. https://bgshih.github.io/

cocotext/.

TABLE 4

An overview of some popular scene text detection datasets.

Dataset Year Description #Cites

TLR [79] 2009 Captured by a moving vehicle in Paris. Consists of ∼11,000 video frames

and ∼9,200 trafﬁc light instances. url: http://www.lara.prd.fr/benchmarks/

trafﬁclightsrecognition

164

LISA [80] 2012 One of the ﬁrst trafﬁc sign detection dataset. Consists of ∼6,600 video

frames, ∼7,800 instances of 47 US signs. url: http://cvrr.ucsd.edu/LISA/

lisa-trafﬁc-sign-dataset.html

325

GTSDB [81] 2013 One of the most popular trafﬁc signs detection dataset. Consists of ∼900 images

with ∼1,200 trafﬁc signs capture with various weather conditions during differ-

ent time of a day. url: http://benchmark.ini.rub.de/?section=gtsdb&subsection=

news

259

BelgianTSD

[82]

2012 Consists of ∼7,300 static images, ∼120,000 video frames, and ∼11,000 trafﬁc sign

annotations of 269 types. The 3D location of each sign has been annotated. url:

https://btsd.ethz.ch/shareddata/

224

TT100K [83] 2016 The largest trafﬁc sign detection dataset so far, with ∼100,000 images (2048 x

2048) and ∼30,000 trafﬁc sign instances of 128 classes. Each instance is annotated

with class label, bounding box and pixel mask. url: http://cg.cs.tsinghua.edu.cn/

trafﬁc%2Dsign/

111

BSTL [84] 2017 The largest trafﬁc light detection dataset. Consists of ∼5000 static images, ∼8300

video frames, and ∼24000 trafﬁc light instances. https://hci.iwr.uni-heidelberg.

de/node/6132

TABLE 5

An overview of some popular trafﬁc light detection and trafﬁc sign detection datasets.

tion problems. Therefore, machine learning based detection

methods were beginning to prosper.

Machine learning based detection has gone through mul-

tiple periods, including the statistical models of appearance

(before 1998), wavelet feature representations (1998-2005),

and gradient-based representations (2005-2012).

Building statistical models of an object, like Eigenfaces

[95, 106] as shown in Fig 5 (a), was the ﬁrst wave of learning

based approaches in object detection history. In 1991, M.

Turk et al. achieved real-time face detection in a lab envi-

ronment by using Eigenface decomposition [95]. Compared

with the rule-based or template based approaches of its

time [107, 108], a statistical model better provides holistic

descriptions of an object’s appearance by learning task-

speciﬁc knowledge from data.

Wavelet feature transform started to dominate visual

recognition and object detection since 2000. The essence of

this group of methods is learning by transforming an image

from pixels to a set of wavelet coefﬁcients. Among these

methods, the Haar wavelet, owing to its high computational

efﬁciency, has been mostly used in many object detection

tasks, such as general object detection [29], face detection

[10, 11, 109], pedestrian detection [30, 31], etc. Fig 5 (d)

shows a set of Haar wavelets basis learned by a VJ detector

[10, 11] for human faces.

• Early time’s CNN for object detection

The history of using CNN to detecting objects can be

自2000年以来，小波特征变换开始主导视觉识

别和目标检测。这组方法的本质是通过将图像

从像素点转换为一组小波系数来学习。其中，

Haar小波由于其计算效率高，被广泛应用于一

般目标检测[29]、人脸检测[10,11,109]，行人检

测[30,31]等目标检测任务中。图 (d)为VJ检测器

学习到的一组用于人脸的Haar小波基[10, 11]。

但在更复杂的检测问题上，事情进展得并不顺利。

因此，基于机器学习的检测方法开始蓬勃发展。

基于机器学习的检测经历了包括外观统计模型在内

的多个阶段 ( 1998年以前 ) 、小波特征表示

( 1998-2005 ) 和基于梯度的表示 ( 2005-2012 )。

建立对象的统计模型，比如特征面（Eigenfaces）

[95,106]如图 (a)所示，是目标检测历史上第一波基

于学习的方法。

1991年，M.Turk等人利用特征脸分解技术在实验室

环境中实现了实时人脸检测[95]。与当时基于规则

或模板的方法相比[107,108]，统计模型通过从数据

中学习特定于任务的知识，更好地提供了对象外观

的整体描述。

剩余38页未读，继续阅读

Arrowes

粉丝: 846

深度学习前后的目标检测：20年演进与关键进展

深度学习驱动的目标检测：20年的演进与综述

MMDetection3D v0.11.0发布：OpenMMLab下一代3D检测平台获奖

卡车俯视图目标检测数据集：1679张标注图像

目标检测综述PPT-Object Detection in 20 Years: A Survey

RSOD数据集：遥感图像目标检测与标注

JavaScript Object.defineProperty深入理解：属性控制与区别

深度学习目标检测框架综述：CNN、锚定与无锚技术

基于深度学习的目标检测与跟踪技术综述

安全帽检测数据集：提升工作场所安全的5K图像标注

DETR: 革命性的目标检测技术及完整开源资源包

最新资源