DiffYOLO: Object Detection for Anti-Noise via YOLO
and Diffusion Models
Yichen Liu
liuyichen21@mails.ucas.ac.cn
Huajian Zhang
zhanghj@impcas.ac.cn
Daqing Gao
gaodq@impcas.ac.cn
Abstract
Object detection models represented by YOLO series have been widely used and
have achieved great results on the high quality datasets, but not all the working
conditions are ideal. To settle down the problem of locating targets on low quality
datasets, the existing methods either train a new object detection network, or need a
large collection of low-quality datasets to train. However, we propose a framework
in this paper and apply it on the YOLO models called DiffYOLO. Specifically, we
extract feature maps from the denoising diffusion probabilistic models to enhance
the well-trained models, which allows us fine-tune YOLO on high-quality datasets
and test on low-quality datasets. The results proved this framework can not only
prove the performance on noisy datasets, but also prove the detection results on
high-quality test datasets. We will supplement more experiments later (with various
datasets and network architectures).
1 Introduction
YOLO has become prevailed in target detection tasks, from automatic driving to medical image
processing. Alice Froidevaux et al. used YOLO to detect vehicles through satellite images[
3
];
Sudipto Paul et al. applied YOLO to brain cancer recognition on MRI images[
13
]; Ethan Grooby et
al. explored automated facial landmark detection using YOLO[
7
]. Although YOLO has achieved
great success in object detection tasks, capturing objects from images with noises is still a great
challenge. Normally object detection models are trained on high quality images, but the test condition
may not be so ideal. Fig.1 shows on the test images with noise, a well-trained YOLO on high quality
datasets has poor detection results. If these models trained on high-quality data sets can perform well
on noise test sets with simple enhancements, then the trained models can be better utilized.
Transfer learning on pretrained models is an important method to make full use of pre-trained models.
It first appeared in language models called fine-tune[
9
], bringing many benefits, such as making
training more efficiently and less dependent on high-quality training sets, therefore we hope to find a
method to leverage other well-trained models to improve the performance of YOLO models.
Denoising diffusion probabilistic models(DDPM) was put forward by Sohl-Dickstein et al., has
shown great advantage in many generation tasks[
15
,
8
]. Othmane Laousy et al. demonstrated that the
diffusion method is not susceptible to perturbations [10], so we decided to incorporate the diffusion
model into the YOLO model.
Therefore, we propose a framework in this paper for improving the noise resistance of models already
trained on high-quality data sets, called DiffYOLO. We first extract some features from the Unet
of the already trained Diffusion models, fuse them, and then splice them into the neck module of
YOLO. The feature extracted by such a diffusion model can improve the YOLO model to obtain
Preprint. Work in progress.
arXiv:2401.01659v1 [cs.CV] 3 Jan 2024