深度学习驱动的目标检测：20年的演进与综述

需积分: 36 120 浏览量更新于2024-07-15 收藏 7.98MB PDF 举报

"《Object Detection in 20 Years: A Survey》是由Zhengxia Zou博士等人撰写的关于视觉目标检测的综述论文，全面回顾了从20世纪90年代到2019年的400多篇相关研究，探讨了深度学习在目标检测中的应用及其发展历程。" 目标检测是计算机视觉领域的重要且具有挑战性的问题，近年来随着深度学习的兴起，该领域取得了显著的进步。这篇论文将过去二十年的目标检测发展视为计算机视觉历史的一个缩影，从早期的“冷兵器时代”到现在的深度学习主导的技术美学。论文中详尽地分析了历史上的里程碑式检测器，这些早期的方法为后续的发展奠定了基础。随着深度学习的引入，如R-CNN、Fast R-CNN、Faster R-CNN和YOLO（You Only Look Once）等方法逐步提升了检测的精度和速度。这些模型通过卷积神经网络（CNN）学习特征表示，极大地改善了目标识别和定位的能力。此外，论文还讨论了检测数据集的关键作用，如PASCAL VOC、COCO等数据集推动了目标检测算法的基准测试和性能比较。同时，论文也涉及了评估检测性能的各种指标，如平均精度（mAP）、漏检率（False Negative Rate）和误报率（False Positive Rate）等，这些指标帮助研究人员量化算法的优劣。论文进一步剖析了检测系统的基石组件，包括提议生成、特征提取、分类与回归等步骤。提议生成如Selective Search等方法用于找出可能包含目标的区域；特征提取利用深度网络提取高维特征；分类和回归则用于确定目标类别和精确边界框。为了提高效率，论文还讨论了各种加速技术，如多尺度检测、并行计算和模型优化策略，这些技术使得实时目标检测成为可能，尤其在嵌入式和移动设备上。论文专门讨论了几个关键的应用场景，如行人检测、人脸检测和文本检测，这些都是现实世界中需求强烈的方向。每个应用都有其特定的挑战，例如行人检测中的遮挡问题，人脸检测中的姿态变化，以及文本检测中的形状和光照变化。最后，作者深入分析了这些应用面临的挑战以及近年来技术趋势，包括实例分割、多任务学习和自监督学习等新兴方向，它们为未来目标检测的发展指明了道路。《Object Detection in 20 Years: A Survey》是理解目标检测历史、现状和未来趋势的重要参考资料。

Dataset Year Description #Cites

MIT Ped.[30] 2000 One of the ﬁrst pedestrian detection datasets. Consists of ∼500 training and ∼200

testing images (built based on the LabelMe database). url: http://cbcl.mit.edu/

software-datasets/PedestrianData.html

1515

INRIA [12] 2005 One of the most famous and important pedestrian detection datasets at early time.

Introduced by the HOG paper [12]. url: http://pascal.inrialpes.fr/data/human/

24705

Caltech

[59, 60]

2009 One of the most famous pedestrian detection datasets and benchmarks. Consists

of ∼190,000 pedestrians in training set and ∼160,000 in testing set. The metric

is Pascal-VOC @ 0.5 IoU. url: http://www.vision.caltech.edu/Image Datasets/

CaltechPedestrians/

2026

KITTI [61] 2012 One of the most famous datasets for trafﬁc scene analysis. Captured in Karl-

sruhe, Germany. Consists of ∼100,000 pedestrians (∼6,000 individuals). url:

http://www.cvlibs.net/datasets/kitti/index.php

2620

CityPersons

[62]

2017 Built based on CityScapes dataset [63]. Consists of ∼19,000 pedestrians in training

set and ∼11,000 in testing set. Same metric with CalTech. url: https://bitbucket.

org/shanshanzhang/citypersons

EuroCity [64] 2018 The largest pedestrian detection dataset so far. Captured from 31 cities in 12

European countries. Consists of ∼238,000 instances in ∼47,000 images. Same

metric with CalTech.

TABLE 2

An overview of some popular pedestrian detection datasets.

Dataset Year Description #Cites

FDDB [65] 2010 Consists of ∼2,800 images and ∼5,000 faces from Yahoo! With occlusions, pose

changes, out-of-focus, etc. url: http://vis-www.cs.umass.edu/fddb/index.html

531

AFLW [66] 2011 Consists of ∼26,000 faces and 22,000 images from Flickr with rich facial landmark

annotations. url: https://www.tugraz.at/institute/icg/research/team-bischof/

lrs/downloads/aﬂw/

414

IJB [67] 2015 IJB-A/B/C consists of over 50,000 images and videos frames, for both

recognition and detection tasks. url: https://www.nist.gov/programs-projects/

face-challenges

279

WiderFace

[68]

2016 One of the largest face detection dataset. Consists of ∼32,000 images and 394,000

faces with rich annotations i.e., scale, occlusion, pose, etc. url: http://mmlab.ie.

cuhk.edu.hk/projects/WIDERFace/

193

UFDD [69] 2018 Consists of ∼6,000 images and ∼11,000 faces. Variations include weather-based

degradation, motion blur, focus blur, etc. url: http://www.ufdd.info/

WildestFaces

[70]

2018 With ∼68,000 video frames and ∼2,200 shots of 64 ﬁghting celebrities in uncon-

strained scenarios. The dataset hasn’t been released yet.

TABLE 3

An overview of some popular face detection datasets.

robot arm trying to grasp a spanner).

Recently, there are some further developments of the

evaluation in the Open Images dataset, e.g., by considering

the group-of boxes and the non-exhaustive image-level cate-

gory hierarchies. Some researchers also have proposed some

alternative metrics, e.g., “localization recall precision” [94].

Despite the recent changes, the VOC/COCO-based mAP is

still the most frequently used evaluation metric for object

detection.

2.3 Technical Evolution in Object Detection

In this section, we will introduce some important building

blocks of a detection system and their technical evolutions

in the past 20 years.

2.3.1 Early Time’s Dark Knowledge

The early time’s object detection (before 2000) did not follow

a uniﬁed detection philosophy like sliding window detec-

tion. Detectors at that time were usually designed based on

low-level and mid-level vision as follows.

• Components, shapes and edges

“Recognition-by-components”, as an important cogni-

tive theory [98], has long been the core idea of image

recognition and object detection [13, 99, 100]. Some early

researchers framed the object detection as a measurement of

similarity between the object components, shapes and con-

tours, including Distance Transforms [101], Shape Contexts

[35], and Edgelet [102], etc. Despite promising initial results,

things did not work out well on more complicated detec-

Dataset Year Description #Cites

ICDAR [71] 2003 ICDAR2003 is one of the ﬁrst public datasets for text detection. ICDAR 2015

and 2017 are other popular iterations of the ICDAR challenge [72, 73]. url: http:

//rrc.cvc.uab.es/

530

STV [74] 2010 Consists of ∼350 images and ∼720 text instances taken from Google StreetView.

url: http://tc11.cvc.uab.es/datasets/SVT 1

339

MSRA-TD500

[75]

2012 Consists of ∼500 indoor/outdoor images with Chinese and English texts. url:

http://www.iapr-tc11.org/mediawiki/index.php/MSRA Text Detection 500

Database (MSRA-TD500)

413

IIIT5k [76] 2012 Consists of ∼1,100 images and ∼5,000 words from both streets and born-digital

images. url: http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html

165

Syn90k [77] 2014 A synthetic dataset with 9 million images generated from a 90,000 vocabulary of

multiple fonts. url: http://www.robots.ox.ac.uk/

∼

vgg/data/text/

246

COCOText

[78]

2016 The largest text detection dataset so far. Built based on MS-COCO, Consists

of ∼63,000 images and ∼173,000 text annotations. https://bgshih.github.io/

cocotext/.

TABLE 4

An overview of some popular scene text detection datasets.

Dataset Year Description #Cites

TLR [79] 2009 Captured by a moving vehicle in Paris. Consists of ∼11,000 video frames

and ∼9,200 trafﬁc light instances. url: http://www.lara.prd.fr/benchmarks/

trafﬁclightsrecognition

164

LISA [80] 2012 One of the ﬁrst trafﬁc sign detection dataset. Consists of ∼6,600 video

frames, ∼7,800 instances of 47 US signs. url: http://cvrr.ucsd.edu/LISA/

lisa-trafﬁc-sign-dataset.html

325

GTSDB [81] 2013 One of the most popular trafﬁc signs detection dataset. Consists of ∼900 images

with ∼1,200 trafﬁc signs capture with various weather conditions during differ-

ent time of a day. url: http://benchmark.ini.rub.de/?section=gtsdb&subsection=

news

259

BelgianTSD

[82]

2012 Consists of ∼7,300 static images, ∼120,000 video frames, and ∼11,000 trafﬁc sign

annotations of 269 types. The 3D location of each sign has been annotated. url:

https://btsd.ethz.ch/shareddata/

224

TT100K [83] 2016 The largest trafﬁc sign detection dataset so far, with ∼100,000 images (2048 x

2048) and ∼30,000 trafﬁc sign instances of 128 classes. Each instance is annotated

with class label, bounding box and pixel mask. url: http://cg.cs.tsinghua.edu.cn/

trafﬁc%2Dsign/

111

BSTL [84] 2017 The largest trafﬁc light detection dataset. Consists of ∼5000 static images, ∼8300

video frames, and ∼24000 trafﬁc light instances. https://hci.iwr.uni-heidelberg.

de/node/6132

TABLE 5

An overview of some popular trafﬁc light detection and trafﬁc sign detection datasets.

tion problems. Therefore, machine learning based detection

methods were beginning to prosper.

Machine learning based detection has gone through mul-

tiple periods, including the statistical models of appearance

(before 1998), wavelet feature representations (1998-2005),

and gradient-based representations (2005-2012).

Building statistical models of an object, like Eigenfaces

[95, 106] as shown in Fig 5 (a), was the ﬁrst wave of learning

based approaches in object detection history. In 1991, M.

Turk et al. achieved real-time face detection in a lab envi-

ronment by using Eigenface decomposition [95]. Compared

with the rule-based or template based approaches of its

time [107, 108], a statistical model better provides holistic

descriptions of an object’s appearance by learning task-

speciﬁc knowledge from data.

Wavelet feature transform started to dominate visual

recognition and object detection since 2000. The essence of

this group of methods is learning by transforming an image

from pixels to a set of wavelet coefﬁcients. Among these

methods, the Haar wavelet, owing to its high computational

efﬁciency, has been mostly used in many object detection

tasks, such as general object detection [29], face detection

[10, 11, 109], pedestrian detection [30, 31], etc. Fig 5 (d)

shows a set of Haar wavelets basis learned by a VJ detector

[10, 11] for human faces.

• Early time’s CNN for object detection

The history of using CNN to detecting objects can be

剩余39页未读，继续阅读

Brucechows

粉丝: 55
资源: 36

深度学习驱动的目标检测：20年的演进与综述

目标检测综述PPT-Object Detection in 20 Years: A Survey

Object Detection in 20 Years A Survey总结汇报.pptx

深度学习全面分析survey

object detection in 20 years: a survey

object detection in 20 years

Fine-Grained Feature Enhancement for Object Detection in Remote Sensing Images

high resolution image object detection

Exploring Classification Equilibrium in Long-Tailed Object Detection

What Can Human Sketches Do for Object Detection?

安装tensorflow object detection api

最新资源