深度学习时代的目标检测进展综述

需积分: 0 11 浏览量更新于2024-06-30 收藏 937KB PDF 举报

"近期物体检测领域的进展" 在深度卷积神经网络时代，物体检测技术已经取得了显著的进步。物体检测，即在图像中识别特定类别的对象（如'汽车'、'飞机'等），近年来受到了广泛关注。这主要归因于该任务在众多应用中的重要性，以及自深度卷积神经网络（DCNN）出现以来在此领域取得的突破性进展。自2018年的这篇论文发表以来，深度学习驱动的物体检测方法已经成为研究的焦点。文章全面回顾了使用深度CNN进行物体检测的最新文献，并深入剖析了这些进展。主要讨论了以下核心概念和技术： 1. **基础架构**：文章涵盖了典型物体检测模型，如单阶段检测器SSD（Single Shot MultiBox Detector）、YOLO（You Only Look Once）和两阶段检测器Faster R-CNN。这些模型通过不同方式实现了快速且准确的目标定位和分类。 - SSD通过在一个单一的网络架构中预测边界框和类别概率，实现了实时检测。 - YOLO则以其端到端的训练和高效率而著名，尽管牺牲了一些精度。 - Faster R-CNN引入了区域提议网络（RPN），提高了定位精度，但速度相对较慢。 2. **挑战与解决方案**：社区当前面临的挑战包括检测小物体、处理密集场景、减少计算复杂度和提高实时性能。研究人员通过改进网络架构、引入注意力机制、优化检测流程等方式来应对这些挑战。 3. **扩展问题**：物体检测不仅仅是定位和分类，还涉及到实例分割和关键点检测。论文讨论了如何将物体检测任务拓展到这些相关领域，以获取更丰富的视觉理解。 4. **损失函数与训练策略**：不同的损失函数对检测性能有显著影响，如多任务损失、平滑L1损失和Focal损失等。这些损失函数的设计旨在解决类别不平衡和难例检测的问题。 5. **数据增强与预训练模型**：数据增强技术如翻转、缩放、剪切等用于扩充训练数据集，而预训练模型（如ImageNet预训练）则为检测模型提供了强大的特征表示能力。 6. **后处理技术**：非极大值抑制（NMS）是检测结果后处理的关键步骤，以消除重叠的边界框并保留最佳预测。这篇综述深入探讨了深度CNN在物体检测领域的应用，分析了当前的技术瓶颈，并展望了未来可能的研究方向。随着技术的不断发展，物体检测系统正在变得更加高效和精确，这对于自动驾驶、监控、机器人导航等应用具有重大意义。

Further improvements : Many improvements have been sugge sted on the

above methodologies concerning speed, performance and computationa l eﬃ-

ciency.

DeepBox [Kuo et al., 2015] proposed a light weight generic objectness system

by capturing semantic prope rties. It helped in reducing the burden of localiza-

tion on the detector as the number of c lasses incre ased. Light-head R-CNN

[Li et al., 2017f] proposed a smaller detection head and thin feature maps to

speed up two-stage detectors. Singh et al. [2017] brought R-FCN to 30 fps by

sharing position sensitive feature maps across classes. Using slight architectural

changes, they were also able to bring the number of classes predicted by R-FCN

to 3000 without losing too much speed.

Several improvements have been made to RoI-Pooling. The spatial trans-

former of [Jaderberg et al., 2015] used a diﬀer e ntiable re -sampling grid using

bilinear interpolation and can be used in any detection pipeline. Chen et al.

[2016b] used this for Face detection, where faces were warped to ﬁt canonical

poses. Dai et al. [2016a] pro po sed another type of pooling called RoI War ping

based on bilinear interpolation. Ma et al. [2018] were the ﬁrst to introduce a ro-

tated RoI-Pooling working with oriented regions (More on oriented RoI-Pooling

can be found in Section 3.2.2). Mask R-CNN [He et al., 2017] proposed RoI

Align to address the problem of misalignment in RoI Pooling which used bilin-

ear interp olation to calculate the value of four regularly sampled locations on

each cell. It was also the ﬁrs t step towards a diﬀerentiable RoI-Pooling with

respect to the coordinates of the regions. It broug ht co nsistent improvements to

all Faster R-CNN baselines on CO CO. Recently, Jiang et al. [2018] introduced

a Precise RoI Pooling based on interpolating no t just 4 spatial locations but a

dense region, which allowed full diﬀerentiability with no misalignments.

Li et al. [2016a], Yu e t al. [201 6b] also used contextual information and

aspect ratios while StuﬀNet [Bra hmbhatt et al., 2017] trained for segment-

ing amorphous categories such as ground and water for the same purpose.

Chen and Gupta [2017] made use of memory to take advantage of context in

detecting objects. Li et al. [2018b] incorporated Global Context Module (GCM)

to utilize contextual information and Row-Column Max Pooling (RCM Pooling)

to better extract s c ores from the ﬁnal feature map as compared to the R-FCN

method.

Deformable R-FCN [Dai et al., 2017] brought ﬂexibility to the ﬁxed geomet-

ric transformations at the Position sensitive Ro I-Pooling stage of R-FCN by

learning additional oﬀsets for each spatial sampling location using a diﬀerent

network br anch in addition to other tr icks discussed in Section 2.1.5. Lin et al.

[2017a] proposed to use a network with multiple ﬁnal feature maps with diﬀerent

coarsene ss to adapt to objects of various sizes. Zagoruyko et al. [2016] used skip

connections with the same motivation. Mask-RCNN [He et al., 2017] in addi-

tion to RoI-align added a branch in pa rallel to the classiﬁcation and bounding

box regres sion for optimizing the segmentation los s. Additional training for se g-

mentation lead to an improve ment in the performance of object detection task

as well.

The double-staged methods have now by far attained supremacy over best per-

forming object detection DCNNs. However, for certain applications two-stage

methods are not enough to ge t rid of all the false positives.

2.1.4 Cascades

Traditional one-class object de tection pipelines resorted to boosting like ap-

proaches for improving the performance where uncorre lated weak classiﬁe rs

(better than random chance but not too corr e lated with the true predictions )

are combined to form a strong classiﬁer. With modern CNNs, as the classiﬁers

are quite strong , the attractiveness of those methods has plummeted. How-

ever, for some speciﬁc problems where there are still too many false positives,

resear chers still ﬁnd it useful. Furthermore, if the weak CNNs used are very

shallow it can also sometimes increase the overall spe e d of the method.

One of the ﬁrst ideas that were developed was to cascade multiple CNNs.

Li et al. [2015] and Yang and Nevatia [2016] both used a three-staged approach

by chaining three CNNs for face detection. The former approach scanned the

image using a 12×12 patch CNN to reject 90% o f the non-face regions in a coarse

manner. The remaining detections were oﬀset by a s e cond CNN and given as

input to a 24 × 24 CNN that continued rejecting false positives and reﬁning

regres sions. The ﬁnal ca ndidates were then passed on to a 48 × 48 cla ssiﬁcation

network which o utput the ﬁnal score. The latter approach created separate

score maps for diﬀerent resolutions us ing the s ame FCN on diﬀerent scales of

the test image (image pyramid). These score maps were then up-sampled to

the same re solution and added to create a ﬁnal score map, which was then used

to select proposals. Proposals were then passed to the s e cond stage where two

diﬀerent veriﬁcation CNNs, trained on hard examples, e radicated the remaining

false p ositives. The ﬁrst one being a four-layer FCN trained from scratch and

the second one an AlexNet [Krizhevsky et al., 2012] pre-trained on ImageNet.

All the approaches mentioned in the last paragraph are ad ho c: the CNNs

are independent of each o ther, there is no overall design, therefo re, they could

bene ﬁt from integrating the elegant zoo ming module that is the RoI-Pooling.

The RoI-Pooling can act like a glue to pass the detections from one network to

the other, while doing the down-sampling operation locally. Dai et al. [2016a]

used a Mask R-CNN like structure that ﬁrst propos e d bounding boxes, then

predicted a mask and us ed a third stage to pe rform ﬁne grained discrimination

on masked regions that are RoI-Pooled a second time.

Ouyang et al. [2017], Wang et al. [2017a] optimized in an end-to-end manner

a Faster R-CNN with multiple stages of RoI-Pooling. Each stage accepted only

the highest scored proposals from the previous stage and added mo re context

and/or loc alized the detection better. Then additional informatio n about con-

text was used to do ﬁne grained discrimination between hard negatives and true

positives in [Ouyang et al., 2017], for example. On the contrary, Zhang et al.

[2016a] showed that for pedestrian detection RoI-Po oling, too coarse a feature

map actually hurts the result. This problem has been alleviated by the use of

feature pyramid networks with higher resolutio n feature maps . Therefore, they

used the RPN pro posals of a Faster R-CNNN in a boosting pip e line involving

a forest (Tang et al. [2017c] acted simila rly for small vehicle detection).

Yang et al. [2016a], aware of the problem raised by Zhang et al. [2016a], used

RoI-Pooling on multiple scaled feature maps of all the layers of the network.

The classiﬁcation function on e ach layer was le arned using the weak classiﬁers

of AdaBoost and then approximated using a fully connec ted neural network.

While all the mentioned pipelines are hard cascades where the diﬀerent classi-

ﬁers are independent, it is sometimes pos sible to use a soft cascade where the

ﬁnal sc ore is a linear weighted combination of the scores given by the diﬀerent

weak classiﬁers like in Angelova et al. [201 5]. They used 200 stages (instea d

of 2000 stages in their ba seline with AdaBoost [Benenson et al., 2012]) to keep

recall high enough while improving precision. To save computations that would

be otherwise unmanageable, they terminated the computations of the weighted

sum whenever the score for a certain number of classiﬁers fell under a spec iﬁe d

threshold (there are, therefore, as many thres holds to learn as there are clas-

siﬁers). T hese thresholds are then really important because they control the

trade-oﬀ between speed, recall and precis ion.

All the previous works in this Section invo lved a s mall ﬁxed number of local-

ization reﬁnement steps, which might cause pro posals to be not pe rfectly aligned

with the g round truth, which in turn might impact the accuracy. That is why

lots of work proposed iterative b ounding box regression (while loop on localiza-

tion r e ﬁnement until condition is reached). Najibi et al. [2016], Rajaram et al.

[2016] started with a r egularly spac e d grid of sparse pyramid boxes (only 200

non-overlapping in Najibi et al. [2016] wherea s, Rajaram et al. [2016] used all

Faster R-CNN anchors on the grid) that were iteratively pushed towards the

ground truth according to the feature representation obtained from RoI-Pooling

the current reg ion. An interesting ﬁnding was that even if the goal was to use

as many reﬁnement steps as necessary if the seed boxes or anchors span the

space appropriately, r e gressing the boxes only twice can in fact be suﬃcient

[Najibi et al., 2016]. Approa ches proposed by Gidaris and Komodakis [2016a]

and Li et al. [2017a] can als o be viewed, internally, as iterative regression based

methods proposing regions for detectors , such as Fast R-CNN.

Boosting and multistage (> 2) methods we have seen previously exhibit very

diﬀerent possible combinations of DCNNs. But we thought it would be interest-

ing to still have a Section for a special kind of method that was hinted at in the

previous Sec tions, namely the part-based models, if not for their pe rformances

at least for their historical importance.

2.1.5 Parts-Based Models

Before the reign of CNN methods, the algorithms based on Deformable Parts-

based Model (DPM) and HoG features used to win all the object detection

competitions. In this alg orithm latent (not sup e rvised) object parts were dis-

covere d for each class and optimized by minimizing the deformatio ns of the full

objects (connections were modeled by spring s forces). The whole thing was built

on a HoG image pyramid.

When Region based DCNNs started to beat the former champion, re-

searchers began to wonder if it was only a matter of using b e tter features.

If this was the cas e then the region based approach would not necessarily be a

more powerful algorithm. The DPM was ﬂexible enough to integrate the newer

more discriminative CNN features. Ther efore, some research works focused in

this research direction.

In 2014, Savalle and Tsogkas [2014] tried to get the best of both worlds:

they replaced the HoG feature pyramids used in the DPM with the CNN lay-

ers. Surprisingly, the performance they obtained, even if far superior to the

DPM+HoG baseline, was considerably worse than the R-CNN method. The

authors suspected the reason for it was the ﬁxed size as pect ratios used in

the DPM together with the training strategy. Girshick et al. [2 015] put more

thought on how to mix CNN and DPM by coming up with the distance trans -

form pooling thus bringing the new DPM (DeepPyr amidDPM) to the level of

R-CNN (even s lightly better). Ranjan et al. [20 15] built on it a nd introduced a

normalization layer that forced each scale-speciﬁc feature map to have the same

activation intensities. They also implemented a new procedure of sampling opti-

mal targets by using the closest root ﬁlter in the pyramid in terms of dimensions.

This allowed them to further mimic the HOG-DPM strengths. Simultaneously,

Wan et al. [2015] also improved the DeepPyra midDPM but fa ile d short c om-

pared to the newest version of R-CNN, ﬁne-tuned (R-CNN FT). Therefore, in

2015 it seemed that the DPM based appr oaches have hit a dead end and that

the community should focus on R-CNN type methods.

However, the ﬂexibility of the RoI-Pooling of Fast R-CNN was going to help

making the two approaches come together. Ouyang et al. [2015] combined Fast

R-CNN to get rid of most backgrounds and a DeepID-Net, which introduced

a max-pooling penalized by the deformation of the parts called def-pooling.

The combination improved over the state-of-the-art. As we mentioned in Sec-

tion 2 .1.3, Dai et al. [2017] built on R-FCN and added deformations in the

Position Sensitive RoI-Pooling: an oﬀset is learned from the classical Posi-

tion Sensitive poo led tensor with a fully connected network for each cell of

the RoI-Pooling thus creating ”parts” like featur es. This trick of moving RoI

cells around is also prese nt in [Mordan et al., 2017], although slightly diﬀerent

because it is closer to the original DP M. Dai et al. [2017] even added oﬀsets to

convolutional ﬁlters cells on Conv-5, which became doable thanks to bilinear

interpolation. It, thus, became a truly deforma ble fully convolutional network.

However, Mordan et al. [2017] got better performances on VOC without it. Sev-

eral works used deformable R-FCN like [Xu et al., 2017b] for aerial imagery that

used a diﬀerent training strategy. However, even if it is still present in famous

competitions like COCO, it is less used tha n its co unterparts with ﬁxed RoI-

Pooling. It might come back though thanks to recent best perfor ming mode ls

like [Singh and Davis, 2018] that used [Dai et al., 2017] as their baseline and

selectively back-propaga ted gradients according to the object size.

2.2 Model Training

The next important aspect of the detection model’s design is the losses being

used to converge the huge number of weights and the hype r-para meters that

must be conducive to this convergence. Optimizing for a wrongfully crafted

loss may a ctually lead the model to diverge instead. Choosing incorrect hyper-

parameters, on the one hand, can stagnate the model, trap it in a local op-

tima or, on the other hand, over-ﬁt the training data (causing poor generaliza-

tions). Since DCNNs are mostly trained with mini-batch SGD (se e for instance

[LeCun et al., 2012]), we focus the following dis c ussion on losses and on the

optimization tricks necessary to attain converg e nc e. We also review the contri-

bution of pre-training on some other dataset and data a ugmentation techniques

which bring about an exce lle nt initialization point and good generalizations

respectively.

2.2.1 Losses

Multi-variate cross entropy loss, or log loss, is generally used throughout the

literature to classify images or regions in the context of detectors. However,

detecting objects in large images comes with its own set of speciﬁc challenges:

regres s bounding boxes to get precise localization, which is a hard problem that

is not present at all in classiﬁcatio n and an imbalance between target object

regions and background regions.

A binary cross entropy los s is formulated as shown in Eq. 1. It is used for

learning the combined objectness. All instances , y, are marked as positive labels

with a value one. This equation constraints the network to output the pr edicted

conﬁdence score, p, to be 1 if it thinks there is an o bject and 0 otherwise.

CE(p, y) =

(

−log(p) if y = 1

−log(1 − p) otherwise

(1)

A multi-variate version of the log loss is used for classiﬁcation (Eq. 2). p

o,c

predicts the probability of observation o being class c where c ∈ {1, .., C}. y

o,c

is 1 if observation o be longs to class c and 0 otherwise. c = 0 is accounted for

the special case of background class.

M CE(p, y) = −

c=0

o,c

log(p

o,c

) (2)

Fast-RCNN [Girshick, 2015] used a multitask loss (Eq. 3) which is the de-

facto equation used for classifying as well as regressing. The loss e s are summed

over all the regions proposals or default reference boxes, i. The ground-truth

label, p

∗

, is 1 if the proposal b ox is positive, otherwise 0. Regularization is

learned only for positive propos al boxes.

L({p

}, {t

}) =

cls

, p

∗

) + λ

reg

∗

reg

, t

∗

)

(3)

where t

is a vector representing the 4 coordinates of the predicted bounding

box and simila rly t

∗

represents the 4 coordinates of the ground truth. Eq. 4

presents the equation for exact parameterized coordinates. {x

, y

, w

, h

} are

the center x and y coordina tes, width and height of the default a nchor box

respectively. Similarly {x

∗

, y

∗

, w

∗

, h

∗

} a re ground truths and {x, y, w, h} are

the coordinates to be predicted. The two terms are normaliz ed by mini-batch

size, N

cls

, and number of prop osals/default reference boxes, N

reg

, and weighted

by a balancing parameter λ.

x − x

, t

y − y

= log

, t

= log

∗

− x

, t

∗

− y

∗

= log

∗

, t

∗

= log

∗

(4)

剩余105页未读，继续阅读

芊暖

粉丝: 28
资源: 339

深度学习时代的目标检测进展综述

A survey of recent advances in visual feature detection

思维导图_综述-Recent Advances in Neural Question Generation_.pdf

SME-Recent Advances in Mineral Processing Plant Design

英文原版-Recent Advances in Surgery 30 30th Edition

Learning Rich Features from RGB-D Images for Object Detection and Segmentation

recent advances in deep learning for object detection

（CNN最新进展综述）Recent Advances in Convolutional Neural Networks

ADVANCES IN MULTIUSER DETECTION

用卷积滤波器matlab代码-Awesome-Object-Detection:对象检测的资源集合

Recent Advances in Contact Mechanics- Papers Collected at the

最新资源