深度学习驱动的目标检测：现状与进展

需积分: 16 201 浏览量更新于2024-07-14 收藏 4.66MB PDF 举报

本文档《Object Detection With Deep Learning：A Review》是一篇综述性质的文章，主要关注深度学习在目标检测领域的应用和发展。随着计算机视觉和图像理解的紧密联系，近年来，目标检测已经成为研究热点。传统的目标检测方法依赖于手工设计的特征和浅层可训练架构，其性能提升往往受限于复杂的手工特征组合和多层次的图像理解。文章指出，随着深度学习的快速发展，更加强大的工具被引入，这些工具能够学习到语义丰富的深层特征，从而解决传统架构中的问题。深度学习模型在网络结构、训练策略和优化函数方面展现出多样性。作者对基于深度学习的目标检测框架进行了全面的概述： 1. **历史回顾**：首先，作者简要介绍了目标检测的历史背景，包括从早期的传统方法（如Haar特征和级联分类器）到深度学习技术的兴起。 2. **深度学习的优势**：深度学习通过卷积神经网络（CNN）自动提取特征，能够学习到更高级别的抽象表示，这显著提高了目标检测的精度和鲁棒性。 3. **网络架构**：文章详细讨论了各种深度学习模型，如R-CNN系列（Region-based Convolutional Networks）、Fast R-CNN、Faster R-CNN、YOLO（You Only Look Once）以及其变体，如YOLOv2和YOLOv3，它们在实时性和准确性之间寻找平衡。 4. **训练策略**：深度学习模型的训练涉及到多任务学习（例如目标检测和定位）、迁移学习（预训练模型在新任务上的微调）以及数据增强等策略，这些都对最终性能有重大影响。 5. **优化函数与损失函数**：论文探讨了如何优化深度学习模型，如使用梯度下降优化器，以及设计适应目标检测任务的特定损失函数，如边界框回归和分类损失。 6. **实时性和效率**：随着深度学习的发展，目标检测模型不仅追求精度，也开始注重速度，因此文中也讨论了轻量级模型的设计和实时目标检测的重要性。 7. **挑战与未来方向**：尽管深度学习已经在目标检测领域取得了巨大进步，但文中也提到了一些尚未解决的问题，如小目标检测、遮挡物体识别和多目标跟踪，以及对计算资源的需求。《Object Detection With Deep Learning：A Review》为读者提供了一个深入理解深度学习在目标检测领域的现状、方法和技术的全面视角，有助于研究人员和工程师进一步探索和优化这一关键领域的技术。

3216 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTE MS, VOL. 30, NO. 11, NOVEMBER 2019

Fig. 4. Architecture of SPP-net for object detection [64].

However, more candidate boxes are required to achieve com-

parable results to those of R-CNN.

2) SPP-Net: FC layers must take a ﬁxed-size input. That

is why R-CNN chooses to warp or crop each region proposal

into the same size. However, th e object may exist partly in

the cropped region and unwanted geometric distortion may be

produced due to the warping operation. These content losses or

distortion s will reduce recognition accuracy, especially when

the scales of objects vary.

To solve this problem, He et al. [64] took the theory of

spatial pyramid matching (SPM) [89], [90] into consideration

and proposed a novel CNN architecture named SPP-net. SPM

takes several ﬁner to coarser scales to partition the image into

a number of divisions and aggregates quantized local features

into mid-level representations.

The architecture of SPP-net for object detection can be

found in Fig. 4. Different from R-CNN, SPP-net reuses

feature maps of the ﬁfth conv layer (conv5) to project region

proposals of arbitrary sizes to ﬁxed-length feature vectors. The

feasibility of the reusability of these feature maps is due to

the fact that the feature maps not only involve the strength of

local responses but also have relationships with their spatial

positions [64]. The layer after the ﬁnal conv layer is referred to

as the SPP layer. If the number of feature maps in conv5 is 256,

taking a three-level pyramid, the ﬁnal feature vector for each

region proposal obtained after the SPP layer has a dimension

of 256 × (1

+ 2

+ 4

) = 5376.

SPP-net not only gains better results with a correct estima-

tion of different region proposals in their corresponding scales

but also improves detection efﬁciency in the testing period

with the sharing of computation cost before SPP layer among

different proposals.

3) Fast R-CNN: Although SPP-net has achieved impressive

improvements in both accuracy and efﬁciency over R-CNN,

it still has some notable drawbacks. SPP-net takes almost the

same multistage pip eline as R-CNN, including feature extrac-

tion, network ﬁne-tuning, SVM training, and bounding-box

regressor ﬁtting. Therefore, an additional expense on storage

space is still required. In addition, the conv layers preceding

the SPP layer cannot be updated with the ﬁne-tuning algorithm

introduced in [64]. As a result, an accuracy drop of very deep

networks is unsurprising. To this end, Girshick [16] introduced

a multitask loss on classiﬁcation and bounding box regression

and proposed a novel CNN architecture named Fast R-CNN.

The architecture of Fast R-CNN is exhibited in Fig. 5.

Similar to SPP-net, the wh ole image is processed with conv

layers to produce feature maps. Then, a ﬁxed-length feature

vector is extracted from each region proposal with an RoI

Fig. 5. Architecture of Fast R-CNN [16].

pooling layer. The RoI pooling layer is a special case of the

SPP layer, which has only one pyramid level. Each feature

vector is then fed into a sequence of FC layers before ﬁnally

branching into two sibling output layers. One output layer is

responsible for producing softmax probabilities for all C + 1

categories (C object classes plus one “background” class)

and the other output layer encodes reﬁned bounding-box

positions with four real-valued numbers. All parameters in

these procedures (except the generation o f region proposals)

are optimized via a multitask loss in an end-to-end way.

The multitasks loss L is deﬁned in the following to jointly

train classiﬁcation and bounding-box regression:

L( p, u, t

,v) = L

cls

( p, u) + λ[u ≥ 1]L

loc

,v) (1)

where L

cls

( p, u) =−log p

calculates the log loss for ground

truth class u,andp

is driven from the discrete probability

distribution p = ( p

, ···, p

) over the C +1 outputs from the

last FC layer. L

loc

,v) is deﬁned over the predicted offsets

= (t

, t

) and ground-truth bounding-box regression

targets v = (v

),wherex , y,w, and h denote

the two coordinates of the box center, width, and height,

respectively. Each t

adopts the parameter settings in [15] to

specify an object proposal with a log-space height/width shift

and scale-invariant translation. The Iverson bracket indicator

function [u ≥ 1] is employed to omit all background RoIs.

To provide more robustness against outliers and eliminate the

sensitivity in exploding gradients, a smooth L

loss is adopted

to ﬁt bounding-box regressors as follows:

loc

,v) =



i∈x ,y,w,h

smooth



− v



(2)

where

smooth

(x ) =



0.5x

if |x| < 1

|x|−0.5otherwise.

(3)

To accelerate the pipeline of Fast R-CNN, another two

tricks are of necessity. On the one hand, if training sam-

ples (i.e., RoIs) come from different images, backpropagation

through the SPP layer becomes highly inefﬁcient. Fast R-CNN

samples minibatches hierarchically, n amely, N images sam-

pled randomly at ﬁrst and then R/N RoIs sampled in each

image, where R represents the number of RoIs. Critically,

computation and memory are shared by RoIs from the same

image in the forward and backward pass. On the other hand,

much time is spent in computing the FC layers during the

forward pass [16]. The truncated singular value decomposition

(SVD) [91] can be utilized to compress large FC layer s an d

to accelerate the testing procedure.

In the Fast R-CNN, regardless of region proposal genera-

tion, the training of all network layers can be processed in

a single stage with a multitask loss. It saves the additional

Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on October 31,2020 at 01:20:24 UTC from IEEE Xplore. Restrictions apply.

剩余20页未读，继续阅读

yuppie__1029

粉丝: 9
资源: 12

深度学习驱动的目标检测：现状与进展

目标检测综述PPT-Object Detection in 20 Years: A Survey

Occlusion Handling in Generic Object Detection A Review.pdf

A Survey of Deep Learning-based Object Detection.pdf

给显著性目标检测、项目的代码仓库

基于yolo的低照度目标检测英文文献

深度学习卫星三维重建代码的GitHub项目

找一些关于yolov5目标检测的参考文献

AttributeError: 'numpy.int64' object has no attribute 'startswith'

ros::nh.serviceClinet

最新资源