1
YOLO-Former: YOLO Shakes Hand With ViT
Javad Khoramdel Ahmad Moori Yasamin Borhani Armin Ghanbarzadeh Esmaeil Najafi
Tarbiat Modares University Faculty of Mechanical Engineering, K. N. Toosi University of Technology
Tehran, Iran
j.khoramdel@modares.ac.ir ahmadmoori@email.kntu.ac.ir borhaniyasamin@gmail.com agz1986@gmail.com
najafi.e@kntu.ac.ir
Abstract—The proposed YOLO-Former method seamlessly
integrates the ideas of transformer and YOLOv4 to create a
highly accurate and efficient object detection system. The method
leverages the fast inference speed of YOLOv4 and incorporates
the advantages of the transformer architecture through the inte-
gration of convolutional attention and transformer modules. The
results demonstrate the effectiveness of the proposed approach,
with a mean average precision (mAP) of 85.76% on the Pascal
VOC dataset, while maintaining high prediction speed with a
frame rate of 10.85 frames per second. The contribution of this
work lies in the demonstration of how the innovative combination
of these two state-of-the-art techniques can lead to further
improvements in the field of object detection.
Index Terms—Article submission, IEEE, IEEEtran, journal,
L
A
T
E
X, paper, template, typesetting.
I. INTRODUCTION
Many computer vision tasks, such as image classification,
image segmentation, and object detection, are dominated by
deep neural networks due to the recent advancements in deep
learning. Object detection is the task of detecting instances
of semantic objects of a certain class in digital images and
videos [1]. Some applications of such systems are license
plate character recognition, object tracking, human face and
body detection and recognition, activity recognition, medical
imaging, advanced driving assistant systems, manufacturing
industry, and robotics.
With the advent of big-data and higher processing power,
the deep neural network based methods for object detection
have become more popular. These networks are capable of
end-to-end object detection without the need of additional
components and are mostly based on convolutional neural
networks [2]. The state-of-the-art object detection methods
can be further categorized into two main categories. First,
region proposal based models that prioritize detection accuracy
over inference speed such as RCNN [3], fast RCNN [4],
mask RCNN [5]. Second, one-stage detection models that
have high inference speeds and are capable of achieving real
time detection. The examples of one-stage models include
single shot multibox detector (SSD) [6], you only look once
(YOLO) [7], EfficientDet [8], RetinaNet [9], CenterNet [10],
and HourGlass [11].
Although all the previously mentioned object detectors rely
solely on the convolutional and pooling layers, the impressive
results of Vision Transformer (ViT) [12] which is based on
attention layers, has inspired ViT-YOLO [13], and DETR
[14] to develop object detectors based on the idea of the
transformer. The detection transformer (DETR) framework
uses the transformer encoder-decoder-based architecture to
perform end-to-end object detection [14]. The ViT-YOLO
embeds the scaled dot multi-head attention layer at the end of
the YOLOv4 backbone by flattening the feature maps before
the attention layer. It then reshapes the attention layer outputs
to 2D to be consistent with the remainder of the network [13].
This paper improves the accuracy of YOLOv4 by intro-
ducing the YOLO-Former algorithm that employs a novel
convolutional self-attention module (CSAM) in the YOLOv4
structure. The CSAM is developed based on the scaled dot
self-attention (SDSA). In order to connect the proposed CSAM
to other components in the network, a convolutional trans-
former module has been implemented. The presented object
detector is further enhanced by using several augmentation
policies to increase its generalization capability. As such,
the YOLO-Former provides more accurate results on the
Pascal VOC dataset, while preserving the real-time execution
property.
The structure of the paper is as follows. Section II presents a
summary of the studies on augmentations, regularization, and
attention mechanisms. The network structure and developed
modules are discussed in Section III. A detailed description
of the experiments conducted with the implemented model and
the YOLOv4 dataset, training configuration, and evaluation is
available in Section IV. The results and comparison to the
literature are discussed in Section V and, finally, Section VI
concludes the paper.
II. BACKGROUND
A brief review of the formerly conducted studies on aug-
mentations, regularization, and attention mechanisms is given
in the following.
A. Augmentation
The great impact of the augmentation on extending the
generalization ability of the models has made it inseparable
from image processing. A network can benefit from augmen-
tation methods such as translation, color jittering, rotation,
etc. not only as a means of providing more data, but also
as a means of making it less sensitive to these transformations
[15]. For instance, occlusion is a challenge problem in image
recognition. One solution for that is introduced as the cutout
method, which makes the used dataset more versatile [16]. In
this technique, a random region of images is covered by a
arXiv:2401.06244v1 [cs.CV] 11 Jan 2024