Lightweight Object Detection: A Study Based on YOLOv7 Integrated with
ShuffleNetv2 and Vision Transformer
Wenkai Gong
kai.901025@gmail.com
Abstract
As mobile computing technology rapidly evolves, deploy-
ing efficient object detection algorithms on mobile devices
emerges as a pivotal research area in computer vision. This
study zeroes in on optimizing the YOLOv7 algorithm to
boost its operational efficiency and speed on mobile plat-
forms while ensuring high accuracy. Leveraging a synergy
of advanced techniques such as Group Convolution, Shuf-
fleNetV2, and Vision Transformer, this research has effec-
tively minimized the model’s parameter count and mem-
ory usage, streamlined the network architecture, and for-
tified the real-time object detection proficiency on resource-
constrained devices. The experimental outcomes reveal that
the refined YOLO model demonstrates exceptional perfor-
mance, markedly enhancing processing velocity while sus-
taining superior detection accuracy.
1. Introduction
As the field of computer vision rapidly advances, ob-
ject detection has become a crucial component in vari-
ous applications, spanning areas such as security surveil-
lance, autonomous driving, and smart healthcare. De-
spite the high computational complexity and insufficient
real-time capabilities of traditional object detection meth-
ods, deep learning-based algorithms have achieved signifi-
cant breakthroughs in accuracy and real-time performance.
Among these, YOLO (You Only Look Once) [1, 3, 4, 6, 8–
10, 12]has established itself as a classic real-time object de-
tection algorithm, striking a balance between computational
speed and detection precision. However, mobile devices
typically face limitations in computational power, mem-
ory capacity, and energy consumption, complicating the de-
ployment of deep learning models. To adapt the YOLO
model for these contexts, it necessitates further improve-
ments and optimizations. This paper will delve into re-
search on an enhanced YOLO model tailored for mobile
deployment, focusing on network structure optimization,
model compression and acceleration, robustness enhance-
ment, and performance evaluation across different applica-
tion scenarios.
The primary objectives of this study encompass the ex-
ploration and understanding of the YOLO algorithm and its
variants in the context of object detection tasks. The focus
of this work will be on grasping the fundamental principles
and core mechanisms of the YOLO algorithm, along with
its performance across various tasks and scenarios. This
includes, but is not limited to, an in-depth investigation
of YOLO’s network architecture, loss functions, training
strategies, and comparative analysis with other object de-
tection algorithms. Considering the characteristics of mo-
bile devices, this research aims to design and implement
enhancements to the YOLO model. Addressing the compu-
tational capabilities and memory constraints of mobile de-
vices, the study will strive to optimize the structure and al-
gorithms of the YOLO model. This may involve lightweight
model design, efficient algorithm implementation, and spe-
cific hardware optimizations, all intended to significantly
enhance the model’s performance and efficiency on mo-
bile devices while maintaining detection accuracy. Verifi-
cation and evaluation of the improved model’s performance
on standard datasets, as well as its operational efficiency
on actual mobile devices, will also be integral. The re-
search will further assess the performance and efficiency of
the enhanced YOLO model through experimental validation
on standard datasets and deployment testing in real mobile
device environments. This comprehensive evaluation will
help ensure that the improved model not only advances the-
oretically but also demonstrates feasibility and effectiveness
in practical applications.
The main contributions of this paper are summarized as
follows:
1. In the enhanced YOLO model, the design philosophy of
ShuffleNet v2 [7] is thoroughly referenced and utilized.
Particularly, the combination of channel shuffling and
group convolution [5] effectively balances the model’s
complexity and performance. This design not only im-
proves the model’s efficiency but also retains robust fea-
ture extraction capabilities, enabling real-time object de-
tection on mobile devices. Moreover, by incorporating
techniques like skip connections and depthwise separa-
1
arXiv:2403.01736v1 [cs.CV] 4 Mar 2024