深度学习驱动的通用目标检测：最新进展综述

需积分: 9 128 浏览量更新于2024-07-09 收藏 7.5MB PDF 举报

本文档《DeepLearningforGenericObjectDetection:ASurvey》是一篇发表在国际计算机视觉期刊（2020年）的研究综述，作者Li Liu等人。该论文聚焦于深度学习在通用目标检测领域的最新进展，这是计算机视觉中最基本且富有挑战性的问题之一。目标检测旨在从自然图像中定位预定义类别中的对象实例。随着深度学习技术的兴起，它们已经成为从数据中直接学习特征表示的强大工具，极大地推动了通用目标检测领域的进步。文章指出，在这个快速发展的领域，研究者们已经做出了超过300项重要贡献，这些贡献涵盖了通用目标检测的多个关键方面： 1. **检测框架**：论文深入探讨了深度学习驱动的目标检测框架，包括但不限于卷积神经网络（CNN）为基础的方法，如R-CNN系列（Region-based Convolutional Networks）、Fast R-CNN、Faster R-CNN、YOLO（You Only Look Once）和SSD（Single Shot MultiBox Detector），以及后续的改进版本和变体。 2. **对象特征表示**：深度学习在提取和理解图像特征方面发挥了重要作用。这包括卷积层的特征提取，如VGG、ResNet、Inception等架构，以及更高级别的特征金字塔网络（FPN）、注意力机制和多尺度融合，以捕捉不同尺度和上下文信息。 3. **对象提议生成**：在目标检测过程中，如何从图像中找到可能包含目标候选区域的方法是关键。研究涵盖了诸如Selective Search、Edge Boxes、RPN（Region Proposal Network）等对象提案算法，以及它们如何与深度学习相结合优化效率和准确性。 4. **检测算法的优化**：论文讨论了各种优化策略，如数据增强、迁移学习、多任务学习和端到端训练，以及如何通过这些方法提升模型的泛化能力和鲁棒性。 5. **评估指标和基准**：介绍了用于衡量目标检测性能的指标，如精度、召回率、AP（Average Precision）和mAP（mean Average Precision），以及常用的公开数据集，如PASCAL VOC、COCO（Common Objects in Context）和ImageNet Detection Challenge。 6. **未来趋势与挑战**：论文还分析了当前技术的局限性，如小目标检测、行人检测和实时性能，以及未来可能的研究方向，如更高效的计算资源利用、更深层次的模型结构和多模态融合。《DeepLearningforGenericObjectDetection:ASurvey》是一份全面的指南，为研究人员、工程师和学生们提供了关于深度学习在通用目标检测领域的最新研究成果和趋势的深入理解。它对于理解和应用这些技术在实际应用场景中具有重要的参考价值。

270 International Journal of Computer Vision (2020) 128:261–318

Table 2 Popular databases for object recognition

Dataset

name

Total images Categories Images per category Objects per image Image size Started year Highlights

PASCAL

VOC (2012)

(Evering-

ham et al.

2015)

11,540 20 303–4087 2.4 470 × 380 2005 Covers only 20 categories that are

common in everyday life; Large

number of training images; Close

to real-world applications;

Signiﬁcantly larger intraclass

variations; Objects in scene

context; Multiple objects in one

image; Contains many difﬁcult

samples

ImageNet

(Rus-

sakovsky

et al. 2015)

14 millions+ 21,841 − 1.5 500 × 400 2009 Large number of object categories;

More instances and more

categories of objects per image;

More challenging than PASCAL

VOC; Backbone of the ILSVRC

challenge; Images are

object-centric

MS COCO

(Lin et al.

2014)

328,000+ 91 − 7.3 640 × 480 2014 Even closer to real world scenarios;

Each image contains more

instances of objects and richer

object annotation information;

Contains object segmentation

notation data that is not available

in the ImageNet dataset

Places

(Zhou et al.

2017a)

10 millions+ 434 −−256 × 256 2014 The largest labeled dataset for

scene recognition; Four subsets

Places365 Standard, Places365

Challenge, Places 205 and

Places88 as benchmarks

Open

Images

(Kuznetsova

et al. 2018)

9 millions+ 6000+− 8.3 Varied 2017 Annotated with image level labels,

object bounding boxes and visual

relationships; Open Images V5

supports large scale object

detection, object instance

segmentation and visual

relationship detection

Example images from PASCAL VOC, ImageNet, MS COCO and Open Images are shown in Fig. 9

(a)

(b)

(c)

(d)

Fig. 9 Some example images with object annotations from PASCAL VOC, ILSVRC, MS COCO and Open Images. See Table 2 for a summary of

these datasets

detection dataset. OICOD is different from previous large

scale object detection datasets like ILSVRC and MS COCO,

not merely in terms of the signiﬁcantly increased number

of classes, images, bounding box annotations and instance

segmentation mask annotations, but also regarding the anno-

tation process. In ILSVRC and MS COCO, instances of all

123

International Journal of Computer Vision (2020) 128:261–318 271

Table 3 Statistics of commonly used object detection datasets

Challenge Object classes Number of images Number of annotated objects Summary (Train+Val )

Train Val Test Train Val Images Boxes Boxes/Image

PASCAL VOC object detection challenge

VOC07 20 2501 2510 4952 6301(7844) 6307(7818) 5011 12,608 2.5

VOC08 20 2111 2221 4133 5082(6337) 5281(6347) 4332 10,364 2.4

VOC09 20 3473 3581 6650 8505(9760) 8713(9779) 7054 17,218 2.3

VOC10 20 4998 5105 9637 11,577(13,339) 11,797(13,352) 10,103 23,374 2.4

VOC11 20 5717 5823 10,994 13,609(15,774) 13,841(15,787) 11,540 27,450 2 .4

VOC12 20 5717 5823 10,991 13,609(15,774) 13,841

(15,787) 11,540 27,450 2.4

ILSVRC object detection challenge

ILSVRC13 200 395,909 20,121 40,152 345,854 55,502 416,030 401,356 1.0

ILSVRC14 200 456,567 20,121 40,152 478,807 55,502 476,668 534,309 1.1

ILSVRC15 200 456,567 20,121 51,294 478,807 55,502 476,668 534,309 1.1

ILSVRC16 200 456,567 20,121 60,000 478,807 55,502 476,668 534,309 1.1

ILSVRC17 200 456,567 20,121 65,500 478,807 55,502 476,668 534,309 1.1

MS COCO object detection challenge

MS COCO15 80 82,783 40,

504 81,434 604,907 291,875 123,287 896,782 7.3

MS COCO16 80 82,783 40,504 81,434 604,907 291,875 123,287 896,782 7.3

MS COCO17 80 118,287 5000 40,670 860,001 36,781 123,287 896,782 7.3

MS COCO18 80 118,287 5000 40,670 860,001 36,781 123,287 896,782 7.3

Open images challenge object detection (OICOD)(BasedonopenimagesV4Kuznetsova et al. 2018)

OICOD18 500 1,643,042 100,000 99,999 11,498,734 696,410 1,743,042 12,195,144 7.0

Object statistics for VOC challenges list the non-difﬁcult objects used in the evaluation (all annotated objects). For the COCO challenge, prior to

2017, the test set had four splits (Dev, Standard, Reserve,andChallenge), with each having about 20K images. Starting in 2017, the test set has

only the Dev and Challenge splits, with the other two splits removed. Starting in 2017, the train and val sets are arranged differently, and the test set

is divided into two roughly equally sized splits of about 20,000 images each: Test Dev and Test Challenge. Note that the 2017 Test Dev/Challenge

splits contain the same images as the 2015 Test Dev/Challenge splits, so results across the years are directly comparable

classes in the dataset are exhaustively annotated, whereas

for Open Images V4 a classiﬁer was applied to each image

and only those labels with sufﬁciently high scores were sent

for human veriﬁcation. Therefore i n OICOD only the object

instances of human-conﬁrmed positive labels are annotated.

4.2 Evaluation Criteria

There are three criteria for evaluating the performance of

detection algorithms: detection speed in Frames Per Second

(FPS), precision, and recall. The most commonly used met-

ric is Average Precision (AP), derived from precision and

recall. AP is usually evaluated in a category speciﬁc manner,

i.e., computed for each object category separately. To com-

pare performance over all object categories, the mean AP

(mAP) averaged over all object categories is adopted as the

ﬁnal measure of performance

. More details on these metrics

In object detection challenges, such as PASCAL VOC and ILSVRC,

the winning entry of each object category is that with the highest AP

score, and the winner of the challenge is the team that wins on the most

object categories. The mAP is also used as the measure of a team’s

can be found in Everingham et al. (2010), Everingham et al.

(2015), Russakovsky et al. (2015), Hoiem et al. (2012).

The standard outputs of a detector applied to a testing

image I are the predicted detections {(b

, c

, p

)}

, indexed

by object j, of Bounding Box (BB) b

, predicted category c

and conﬁdence p

. A predicted detection (b , c, p) is regarded

as a True Positive (TP) if

• The predicted category c equals the ground truth label

• The overlap ratio IOU (Intersection Over Union) (Ever-

ingham et al. 2010; Russakovsky et al. 2015)

IOU(b, b

) =

area (b ∩ b

)

area (b ∪ b

)

, (4)

between the predicted BB b and the ground truth b

not smaller than a predeﬁned threshold ε, where ∩ and

Footnote 3 continued

performance, and is justiﬁed since the ranking of teams by mAP was

always the same as the ranking by the number of object categories won

(Russakovsky et al. 2015).

123

272 International Journal of Computer Vision (2020) 128:261–318

Table 4 Most frequent object classes for each detection challenge

(a)

(b)

(c)

(d)

The size of each word is proportional to the frequency of that class in

the training dataset

cup denote intersection and union, respectively. A typical

value of ε is 0.5.

Otherwise, it is considered as a False Positive (FP). The con-

ﬁdence level p is usually compared with some threshold β

to determine whether the predicted class label c is accepted.

AP is computed separately for each of the object classes,

based on Precision and Recall. For a given object class c and

a testing image I

,let{(b

, p

)}

j=1

denote the detections

returned by a detector, ranked by conﬁdence p

in decreasing

order. Each detection (b

, p

) is either a TP or an FP, which

can be determined via the algorithm

in Fig. 10. Based on

the TP and FP detections, the precision P(β) and recall R(β)

(Everingham et al. 2010) can be computed as a function of

the conﬁdence threshold β, so by varying the conﬁdence

It is worth noting that for a given threshold β, multiple detections of

the same object in an image are not considered as all correct detections,

and only the detection with the highest conﬁdence level is considered

as a TP and the rest as FPs.

threshold different pairs (P, R) can be obtained, in principle

allowing precision to be regarded as a function of recall, i.e.

P(R), from which the Average Precision (AP) (Everingham

et al. 2010; Russakovsky et al. 2015) can be found.

Since the introduction of MS COCO, more attention has

been placed on the accuracy of the bounding box location.

Instead of using a ﬁxed IOU threshold, MS COCO introduces

a few metrics (summarized in Table 5) for characterizing the

performance of an object detector. For instance, in contrast to

the traditional mAP computed at a single IoU of 0.5, AP

coco

is averaged across all object categories and multiple IOU val-

ues from 0.5to0.95 in steps of 0.05. Because 41% of the

objects in MS COCO are small and 24% are large, metrics

small

coco

, AP

medium

coco

and AP

large

coco

are also introduced. Finally,

Table 5 summarizes the main metrics used in the PASCAL,

ILSVRC and MS COCO object detection challenges, with

metric modiﬁcations for the Open Images challenges pro-

posed in Kuznetsova et al. (2018).

5 Detection Frameworks

There has been steady progress in object feature represen-

tations and classiﬁers for recognition, as evidenced by the

dramatic change from handcrafted features (Viola and Jones

2001; Dalal and Triggs 2005; Felzenszwalb et al. 2008;

Harzallah et al. 2009; Vedaldi et al. 2009) to learned DCNN

features (Girshick et al. 2014; Ouyang et al. 2015; Girshick

2015; Ren et al. 2015; Dai et al. 2016c). In contrast, in terms

of localization, the basic “sliding window” strategy (Dalal

and Triggs 2005; Felzenszwalb et al. 2010b, 2008) remains

mainstream, although with some efforts to avoid exhaustive

search (Lampert et al. 2008; Uijlings et al. 2013). However,

the number of windows is large and grows quadratically

with the number of image pixels, and the need to search

over multiple scales and aspect ratios further increases the

search space. Therefore, the design of efﬁcient and effec-

tive detection frameworks plays a key role in reducing this

computational cost. Commonly adopted strategies include

cascading, sharing feature computation, and reducing per-

window computation.

This section reviews detection frameworks, listed in

Fig. 11 and Table 11, the milestone approaches appearing

since deep learning entered the ﬁeld, organized into two main

categories:

(a) Two stage detection frameworks, which include a pre-

processing step for generating object proposals;

(b) One stage detection frameworks, or region proposal free

frameworks, having a single proposed method which

does not separate the process of the detection proposal.

123

剩余57页未读，继续阅读

Data+Science+Insight

粉丝: 1w+
资源: 54

深度学习驱动的通用目标检测：最新进展综述

A Survey of Deep Learning-based Object Detection.pdf

Deep Learning Object Detection.rar

Object Detection With Deep Learning：A Review.pdf

TensorFlow Object Detection API所涉及文件.zip

Python库 | ObjectDetection-aeye-0.0.3.tar.gz

Advanced Applied Deep Learning.pdf

yolov8系列--️️ Custom object detection for PPE Detection of .zip

Real-Time Flying Object Detection with YOLOv8.pdf

Deep Learning with Theano.pdf

2D.Object.Detection.and.Recognition.2002

最新资源