3
a lightweight modification of DETR. Focus-DETR addresses
high complexity and parameter count that DETR faced by
selectively processing object feature vectors, prioritizing only
key objects. This selective approach can be tailored by adjust-
ing the number of input vectors.
Object detection methods based on CNNs typically utilize
deep and wide backbone networks for feature extraction. They
leverage multi-scale feature fusion [30] to capture extensive
semantic information without neglecting geometric texture
details, thereby enhancing the expressiveness of the detection
feature map. However, these method come with a trade-off, as
the numerous convolutional operations and stacked feature ex-
traction networks significantly increase the model complexity
and parameter count. On the other hand, Transformer-based
object detection methods utilize self-attention mechanisms to
model the interrelationships of different feature maps and
globally contextualize them. These method are adept at assim-
ilating information from multi-scale receptive fields. However,
the complexity inherent in the Transformer architecture, partic-
ularly the interactions among several feature vectors, results in
an increased number of parameters and higher computational
load. Moreover, challenges such as object occlusion and back-
ground distraction persist in detection tasks. Therefore, current
research primarily focuses on enhancing detection accuracy
and robustness, achieving real-time processing, and reducing
model complexity. This paper aims to bolster the detection of
small objects by augmenting the feature extraction capabilities
of the backbone network and the multi-scale feature fusion
in the neck network. Simultaneously, we intend to decrease
the overall parameters and computational demands through the
application of lightweighting strategies.
B. Attention mechanisms
Inspired by the way humans perceive visual information,
attention mechanisms in computer vision aim to emulate se-
lective focus on objects, minimizing attention to backgrounds
and distractions. Attention mechanisms are broadly classified
into two types: channel attention and spatial attention.
One notable effort is the squeeze-and-excitation network
(SENet) [31]. It consists of two phases: the squeeze phase,
which assesses feature distribution across each channel-wise
feature map, and the excitation phase, which leverages this
distribution to discern the dependencies between channels and
assign appropriate weights to each. By focusing on channel-
specific feature weighting and dependencies, SENet effectively
concentrates on region of interests (ROIs), offering an efficient
alternative to depth network architectures with its minimal
computational and parameter demands. Woo et al. introduced
the convolutional block attention module (CBAM) [32], merg-
ing spatial attention module (SAM) with channel attention
module (CAM) to augment object detection capabilities. While
CAM focuses on enhancing channel features related to the ob-
ject, SAM is designed to capture spatial information, thereby
boosting the ability to understand spatial relationships. Li et al.
introduced selective kernel network (SKNet) [33], which em-
ploys selectivity coefficients to dynamically adjust convolution
kernel sizes, capturing multi-scale features effectively in com-
plex scenes. Building on SENet, Wang et al. proposed efficient
channel attention (ECA) [34], which optimizes channel feature
weighting using global average pooling and one-dimensional
convolution. ECA also dynamically calculates the convolution
kernel size based on the feature map channel count, striking a
balance between computational efficiency and performance.
Coordinate attention (CA) [35], proposed by Hou et al.
, concentrates on spatial coordinate information in feature
maps. By learning coordinate weights, CA modifies feature
representations across different locations, enhancing the spatial
understanding and generalization abilities of the network.
Zhang et al. developed the shuffle-attention (SA) [36], which
segments input feature maps into groups along the channel
dimension. Each group passes through a channel branch for
generating feature weights via global average pooling, and a
spatial branch for spatial statistical computation. These branch
outputs are then combined to enhance subgroup integration.
Despite its effectiveness, SA incurs a significant increase
in parameters and computational load due to its grouping
and aggregation process. Additionally, Zhang et al. proposed
RFACovn [37], which fuses standard convolution with spatial
attention. This method calculates the weights for each region
within a receptive field through spatial attention, assigning
unique convolution kernels to each region.
Existing attention mechanisms in computer vision, encom-
passing channel and spatial attention or their combination,
typically enhance object representation via feature weighting.
These mechanisms are advantageous as they amplify important
features without significantly adding to model complexity,
thanks to their straightforward modular architecture. Our study
builds on this philosophy, integrating attention mechanisms at
various positions within the backbone network. Unlike con-
ventional methods that focus solely on the channel or spatial
dimensions, our mechanism also captures the global features
of the image, offering a more comprehensive understanding of
the input feature maps.
III. THE PROPOSED YOLO-TLA
A. Motivation and baseline
In this study, we propose YOLO-TLA, an improved object
detection model based on YOLOv5, with a focus on small
object detection and reduced model complexity, as outlined
in Fig.1. YOLOv5 presents five versions, named YOLOv5n,
YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, in order
of increasing size. Each version is designed with specific
configurations to cater to different size needs. The model is
structured into three primary parts: the backbone network, the
neck network, and the head network. The backbone network is
built on CSPDarknet53, consisting of standard convolutional
layers with additional feature enhancement modules, tasked
with extracting geometric texture features like the shape and
color of objects. To enrich this basic information, the neck
network, drawing inspiration from FPNet [20] and PANet [24],
further combining feature maps from the backbone network
with deeper semantic information. This combination results in
feature maps rich in both semantic and geometric information.
These enhanced feature maps are then fed into the head
network, which performs the final detection and classification.