DEYO: DETR with YOLO for End-to-End Object Detection
Haodong Ouyang
Southwest Minzu University
Chengdu, China
ouyanghaodong@stu.swun.edu.cn
Abstract
The training paradigm of DETRs is heavily contingent
upon pre-training their backbone on the ImageNet dataset.
However, the limited supervisory signals provided by the
image classification task and one-to-one matching strategy
result in an inadequately pre-trained neck for DETRs. Ad-
ditionally, the instability of matching in the early stages of
training engenders inconsistencies in the optimization ob-
jectives of DETRs. To address these issues, we have de-
vised an innovative training methodology termed step-by-
step training. Specifically, in the first stage of training, we
employ a classic detector, pre-trained with a one-to-many
matching strategy, to initialize the backbone and neck of
the end-to-end detector. In the second stage of training,
we froze the backbone and neck of the end-to-end detec-
tor, necessitating the training of the decoder from scratch.
Through the application of step-by-step training, we have
introduced the first real-time end-to-end object detection
model that utilizes a purely convolutional structure encoder,
DETR with YOLO (DEYO). Without reliance on any sup-
plementary training data, DEYO surpasses all existing real-
time object detectors in both speed and accuracy. Moreover,
the comprehensive DEYO series can complete its second-
phase training on the COCO dataset using a single 8GB
RTX 4060 GPU, significantly reducing the training expen-
diture. Source code and pre-trained models are available at
https://github.com/ouyanghaodong/DEYO.
1. Introduction
Object detection is a fundamental task within the field
of computer vision, tasked with the precise localization and
identification of various object categories within images or
videos. This technology is a cornerstone for many com-
puter vision applications, including autonomous driving,
video surveillance, facial recognition, and object tracking.
In recent years, advancements in deep learning, particularly
methods based on Convolutional Neural Networks (CNNs)
[12], have led to groundbreaking progress in object detec-
Figure 1. DEYO has surpassed other real-time object detectors in
speed and accuracy; all detectors were exclusively trained on the
COCO dataset without any additional datasets.
tion tasks, establishing themselves as the predominant tech-
nology in this domain.
DEtection TRansformer (DETR) [3] introduces an end-
to-end approach for object detection, comprising a CNN
backbone, transformer encoder, and transformer decoder.
DETR employs a Hungarian loss to predict a one-to-one
set of objects, thereby eliminating reliance on the manually
tuned component of Non-Maximum Suppression (NMS),
which significantly streamlines the object detection pipeline
through end-to-end optimization.
Although end-to-end object detectors based on Trans-
formers (DETRs) have achieved notable success in terms of
performance, these detectors typically rely on pre-training
their backbone networks on the ImageNet dataset. Should
a new backbone be selected, it necessitates pre-training on
ImageNet before training the DETRs or utilizing an exist-
ing pre-trained backbone. Such dependency limits the flexi-
bility in designing the backbone and escalates development
costs, and when the task dataset significantly diverges from
ImageNet, this pre-training strategy may result in subopti-
1
arXiv:2402.16370v1 [cs.CV] 26 Feb 2024