YOLOv4深度学习目标检测：优化速度与精度

需积分: 10 74 浏览量更新于2024-07-16 1 收藏 3.76MB PDF 举报

"YOLOv4论文：YOLOv4: 物体检测的最佳速度与准确性。该论文探讨了如何优化卷积神经网络（CNN）的准确性和速度，提出了一系列适用于大多数模型、任务和数据集的通用特征，包括加权残差连接（WRC）、跨阶段部分连接（CSP）、跨小批量归一化（CmBN）、自对抗训练（SAT）和Mish激活函数。此外，还引入了新的数据增强方法Mosaic和Droppath等技术。" YOLOv4是目标检测领域的先进算法，它旨在在保持高速运行的同时提高检测的准确性。这篇论文由Alexey Bochkovskiy、Chien-Yao Wang、Kinyiu Wong和Hong-Yuan Mark Liao等人共同撰写，发表于2020年，是YOLO系列算法的最新发展。 1. **YOLOv4的核心改进**： - **加权残差连接（Weighted-Residual-Connections, WRC）**：传统的残差连接允许信息直接在不同层之间传递，而WRC进一步优化了这一过程，通过赋予不同路径不同的权重，提高了信息流的有效性。 - **跨阶段部分连接（Cross-Stage-Partial-connections, CSP）**：CSP结构旨在减少计算复杂性，同时保持模型性能。它将主干网络分为两个部分，一部分用于特征学习，另一部分用于融合不同阶段的特征。 - **跨小批量归一化（Cross-mini-Batch Normalization, CmBN）**：不同于标准的小批量归一化，CmBN结合了不同小批量的数据，提高了模型的泛化能力。 - **自对抗训练（Self-adversarial-training, SAT）**：这是一种正则化策略，通过在训练过程中引入对抗性元素，提高模型的鲁棒性，使其能够处理不规则或模糊的输入。 - **Mish激活函数**：Mish是一种新的非线性激活函数，具有更好的梯度传播特性，相比于ReLU和其他激活函数，它在平缓区域和饱和区域都表现出更好的性能。 2. **数据增强技术**： - **Mosaic数据增强**：Mosaic方法是一种强大的图像增强技术，它随机组合多个训练样本，创建复杂的场景，帮助模型学习更广泛的视觉模式。 - **Droppath**：类似于Dropout，Droppath在训练过程中随机“丢弃”路径，防止过拟合，增强模型的泛化能力。 3. **整体框架**： YOLOv4采用了Darknet-53作为基础网络，并结合了SPP-Block、Panoptic FPN和其它先进的模块，以提升模型的定位精度和对小物体的检测能力。 4. **实验结果**：论文详细展示了在COCO数据集上的实验结果，证明了YOLOv4在速度与准确性之间达到了最优平衡，超越了先前的YOLO系列版本以及其他的对象检测算法。总结来说，YOLOv4通过一系列创新的架构设计和训练策略，显著提升了物体检测的速度和精度，对于实时应用和复杂环境下的目标识别具有重要价值。这些技术不仅适用于YOLOv4，也对整个计算机视觉领域的发展提供了宝贵的经验和参考。

2.3. Bag of specials

For those plugin modules and post-processing methods

that only increase the inference cost by a small amount

but can signiﬁcantly improve the accuracy of object detec-

tion, we call them “bag of specials”. Generally speaking,

these plugin modules are for enhancing certain attributes in

a model, such as enlarging receptive ﬁeld, introducing at-

tention mechanism, or strengthening feature integration ca-

pability, etc., and post-processing is a method for screening

model prediction results.

Common modules that can be used to enhance recep-

tive ﬁeld are SPP [25], ASPP [5], and RFB [47]. The

SPP module was originated from Spatial Pyramid Match-

ing (SPM) [39], and SPMs original method was to split fea-

ture map into several d × d equal blocks, where d can be

{1, 2, 3, ...}, thus forming spatial pyramid, and then extract-

ing bag-of-word features. SPP integrates SPM into CNN

and use max-pooling operation instead of bag-of-word op-

eration. Since the SPP module proposed by He et al. [25]

will output one dimensional feature vector, it is infeasible to

be applied in Fully Convolutional Network (FCN). Thus in

the design of YOLOv3 [63], Redmon and Farhadi improve

SPP module to the concatenation of max-pooling outputs

with kernel size k × k, where k = {1, 5, 9, 13}, and stride

equals to 1. Under this design, a relatively large k × k max-

pooling effectively increase the receptive ﬁeld of backbone

feature. After adding the improved version of SPP module,

YOLOv3-608 upgrades AP

by 2.7% on the MS COCO

object detection task at the cost of 0.5% extra computation.

The difference in operation between ASPP [5] module and

improved SPP module is mainly from the original k×k ker-

nel size, max-pooling of stride equals to 1 to several 3 × 3

kernel size, dilated ratio equals to k, and stride equals to 1

in dilated convolution operation. RFB module is to use sev-

eral dilated convolutions of k×k kernel, dilated ratio equals

to k, and stride equals to 1 to obtain a more comprehensive

spatial coverage than ASPP. RFB [47] only costs 7% extra

inference time to increase the AP

of SSD on MS COCO

by 5.7%.

The attention module that is often used in object detec-

tion is mainly divided into channel-wise attention and point-

wise attention, and the representatives of these two atten-

tion models are Squeeze-and-Excitation (SE) [29] and Spa-

tial Attention Module (SAM) [85], respectively. Although

SE module can improve the power of ResNet50 in the Im-

ageNet image classiﬁcation task 1% top-1 accuracy at the

cost of only increasing the computational effort by 2%, but

on a GPU usually it will increase the inference time by

about 10%, so it is more appropriate to be used in mobile

devices. But for SAM, it only needs to pay 0.1% extra cal-

culation and it can improve ResNet50-SE 0.5% top-1 accu-

racy on the ImageNet image classiﬁcation task. Best of all,

it does not affect the speed of inference on the GPU at all.

In terms of feature integration, the early practice is to use

skip connection [51] or hyper-column [22] to integrate low-

level physical feature to high-level semantic feature. Since

multi-scale prediction methods such as FPN have become

popular, many lightweight modules that integrate different

feature pyramid have been proposed. The modules of this

sort include SFAM [98], ASFF [48], and BiFPN [77]. The

main idea of SFAM is to use SE module to execute channel-

wise level re-weighting on multi-scale concatenated feature

maps. As for ASFF, it uses softmax as point-wise level re-

weighting and then adds feature maps of different scales.

In BiFPN, the multi-input weighted residual connections is

proposed to execute scale-wise level re-weighting, and then

add feature maps of different scales.

In the research of deep learning, some people put their

focus on searching for good activation function. A good

activation function can make the gradient more efﬁciently

propagated, and at the same time it will not cause too

much extra computational cost. In 2010, Nair and Hin-

ton [56] propose ReLU to substantially solve the gradient

vanish problem which is frequently encountered in tradi-

tional tanh and sigmoid activation function. Subsequently,

LReLU [54], PReLU [24], ReLU6 [28], Scaled Exponential

Linear Unit (SELU) [35], Swish [59], hard-Swish [27], and

Mish [55], etc., which are also used to solve the gradient

vanish problem, have been proposed. The main purpose of

LReLU and PReLU is to solve the problem that the gradi-

ent of ReLU is zero when the output is less than zero. As

for ReLU6 and hard-Swish, they are specially designed for

quantization networks. For self-normalizing a neural net-

work, the SELU activation function is proposed to satisfy

the goal. One thing to be noted is that both Swish and Mish

are continuously differentiable activation function.

The post-processing method commonly used in deep-

learning-based object detection is NMS, which can be used

to ﬁlter those BBoxes that badly predict the same ob-

ject, and only retain the candidate BBoxes with higher re-

sponse. The way NMS tries to improve is consistent with

the method of optimizing an objective function. The orig-

inal method proposed by NMS does not consider the con-

text information, so Girshick et al. [19] added classiﬁcation

conﬁdence score in R-CNN as a reference, and according to

the order of conﬁdence score, greedy NMS was performed

in the order of high score to low score. As for soft NMS [1],

it considers the problem that the occlusion of an object may

cause the degradation of conﬁdence score in greedy NMS

with IoU score. The DIoU NMS [99] developers way of

thinking is to add the information of the center point dis-

tance to the BBox screening process on the basis of soft

NMS. It is worth mentioning that, since none of above post-

processing methods directly refer to the captured image fea-

tures, post-processing is no longer required in the subse-

quent development of an anchor-free method.

剩余16页未读，继续阅读

冰火岛山野村夫

粉丝: 1
资源: 6

YOLOv4深度学习目标检测：优化速度与精度

PDF2-2004全数据库.zip

PDF卡片，2004版的

jade PDF2-2004库文件

在线硝酸盐分析仪.pdf.pdf

初级工程师PCB设计技巧.pdf.pdf

TONI 在线总氮分析仪.pdf.pdf

pdf.js和pdf.worker.js

集成运算放大器基本电路图.pdf.pdf

com.itextpdf.itextpdf.5.5.13.3 相关jar包和源码jar包

LED基础知识及恒流恒压电路汇总.pdf.pdf

最新资源