YOLO-Former：融合Transformer与YOLOv4的高效目标检测新范式

版权申诉

5星 · 超过95%的资源 121 浏览量更新于2024-08-03 2 收藏 513KB PDF 举报

YOLO-Former是近年来目标检测领域的一项重要研究，它实现了YOLO（You Only Look Once，一种实时目标检测算法）与Transformer架构的深度融合。YOLOv4作为快速且准确的目标检测模型，因其高效的推理速度而被选为基础框架。Transformer，作为一种在自然语言处理中表现出色的架构，其自注意力机制能够捕获复杂的空间依赖关系，这对于目标检测任务来说非常有价值。 YOLO-Former的核心创新之处在于将YOLOv4的轻量级设计与Transformer的全局视野相结合。具体来说，该方法引入了卷积注意力模块，这允许模型在保持局部特征的同时，也能捕捉到更大范围内的上下文信息。这有助于提高检测的精确度，尤其是在面对小物体和遮挡情况时。同时，Transformer模块在模型中被巧妙地整合，使得信息能够在不同尺度和位置之间进行有效传递，进一步增强了模型的鲁棒性和泛化能力。实验结果显示，YOLO-Former在Pascal VOC数据集上展现出卓越的性能，平均精度（mAP）达到了85.76%，这表明其在保持高速预测速度（每秒10.85帧）的同时，依然能够提供出色的检测效果。这个成绩对于目标检测领域的研究者和实际应用者来说，都是一个重要的里程碑，因为它证明了Transformer与传统检测方法如YOLO的有效结合确实可以提升整体系统的效能。此外，YOLO-Former的工作还强调了创新技术融合的重要性，即如何将最先进的YOLO和Transformer技术结合起来，以推动目标检测技术的进步。这种融合不仅仅是技术层面的堆叠，而是深入理解两者的优势并找到最优的协作方式，从而在保持高效的同时提升准确性。总结来说，YOLO-Former是一个结合了实时性和精度的里程碑式研究，它展示了如何通过将Transformer的智能处理与YOLOv4的实践优势结合起来，来实现目标检测任务中的双赢局面。这项工作不仅对现有技术进行了革新，也为未来的深度学习模型设计提供了新的思路和方向。

YOLO-Former: YOLO Shakes Hand With ViT

Javad Khoramdel Ahmad Moori Yasamin Borhani Armin Ghanbarzadeh Esmaeil Najaﬁ

Tarbiat Modares University Faculty of Mechanical Engineering, K. N. Toosi University of Technology

Tehran, Iran

j.khoramdel@modares.ac.ir ahmadmoori@email.kntu.ac.ir borhaniyasamin@gmail.com agz1986@gmail.com

najafi.e@kntu.ac.ir

Abstract—The proposed YOLO-Former method seamlessly

integrates the ideas of transformer and YOLOv4 to create a

highly accurate and efﬁcient object detection system. The method

leverages the fast inference speed of YOLOv4 and incorporates

the advantages of the transformer architecture through the inte-

gration of convolutional attention and transformer modules. The

results demonstrate the effectiveness of the proposed approach,

with a mean average precision (mAP) of 85.76% on the Pascal

VOC dataset, while maintaining high prediction speed with a

frame rate of 10.85 frames per second. The contribution of this

work lies in the demonstration of how the innovative combination

of these two state-of-the-art techniques can lead to further

improvements in the ﬁeld of object detection.

Index Terms—Article submission, IEEE, IEEEtran, journal,

X, paper, template, typesetting.

I. INTRODUCTION

Many computer vision tasks, such as image classiﬁcation,

image segmentation, and object detection, are dominated by

deep neural networks due to the recent advancements in deep

learning. Object detection is the task of detecting instances

of semantic objects of a certain class in digital images and

videos [1]. Some applications of such systems are license

plate character recognition, object tracking, human face and

body detection and recognition, activity recognition, medical

imaging, advanced driving assistant systems, manufacturing

industry, and robotics.

With the advent of big-data and higher processing power,

the deep neural network based methods for object detection

have become more popular. These networks are capable of

end-to-end object detection without the need of additional

components and are mostly based on convolutional neural

networks [2]. The state-of-the-art object detection methods

can be further categorized into two main categories. First,

region proposal based models that prioritize detection accuracy

over inference speed such as RCNN [3], fast RCNN [4],

mask RCNN [5]. Second, one-stage detection models that

have high inference speeds and are capable of achieving real

time detection. The examples of one-stage models include

single shot multibox detector (SSD) [6], you only look once

(YOLO) [7], EfﬁcientDet [8], RetinaNet [9], CenterNet [10],

and HourGlass [11].

Although all the previously mentioned object detectors rely

solely on the convolutional and pooling layers, the impressive

results of Vision Transformer (ViT) [12] which is based on

attention layers, has inspired ViT-YOLO [13], and DETR

[14] to develop object detectors based on the idea of the

transformer. The detection transformer (DETR) framework

uses the transformer encoder-decoder-based architecture to

perform end-to-end object detection [14]. The ViT-YOLO

embeds the scaled dot multi-head attention layer at the end of

the YOLOv4 backbone by ﬂattening the feature maps before

the attention layer. It then reshapes the attention layer outputs

to 2D to be consistent with the remainder of the network [13].

This paper improves the accuracy of YOLOv4 by intro-

ducing the YOLO-Former algorithm that employs a novel

convolutional self-attention module (CSAM) in the YOLOv4

structure. The CSAM is developed based on the scaled dot

self-attention (SDSA). In order to connect the proposed CSAM

to other components in the network, a convolutional trans-

former module has been implemented. The presented object

detector is further enhanced by using several augmentation

policies to increase its generalization capability. As such,

the YOLO-Former provides more accurate results on the

Pascal VOC dataset, while preserving the real-time execution

property.

The structure of the paper is as follows. Section II presents a

summary of the studies on augmentations, regularization, and

attention mechanisms. The network structure and developed

modules are discussed in Section III. A detailed description

of the experiments conducted with the implemented model and

the YOLOv4 dataset, training conﬁguration, and evaluation is

available in Section IV. The results and comparison to the

literature are discussed in Section V and, ﬁnally, Section VI

concludes the paper.

II. BACKGROUND

A brief review of the formerly conducted studies on aug-

mentations, regularization, and attention mechanisms is given

in the following.

A. Augmentation

The great impact of the augmentation on extending the

generalization ability of the models has made it inseparable

from image processing. A network can beneﬁt from augmen-

tation methods such as translation, color jittering, rotation,

etc. not only as a means of providing more data, but also

as a means of making it less sensitive to these transformations

[15]. For instance, occlusion is a challenge problem in image

recognition. One solution for that is introduced as the cutout

method, which makes the used dataset more versatile [16]. In

this technique, a random region of images is covered by a

arXiv:2401.06244v1 [cs.CV] 11 Jan 2024

下载后可阅读完整内容，剩余8页未读，立即下载

人工智能_SYBH

粉丝: 5w+
资源: 233

YOLO-Former：融合Transformer与YOLOv4的高效目标检测新范式

YOLO-TLA：基于YOLOv5的高效轻量级小目标检测模型

YOLO-World：实时开放词汇对象检测

YOLO-Nano:新版YOLO-Nano

YOLO-V5:使用对象检测模型YOLO-V5对图像进行定位和分类

YOLO-Summary:YOLO系列资料

yolo-utils:YOLO处理实用程序

YOLO-Tutorials:YOLO对象检测教程

yolo-swag:项目申请

yolo-server:服务器上的Yolo实现

yolo-pet:yolo实时宠物检测和识别

最新资源