![](https://csdnimg.cn/release/download_crawler_static/89118334/bg4.jpg)
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 14, NO. 8, AUGUST 2021 4
nature of transformer models and detection accuracy remains a
crucial research scope in the current field of computer vision.
C. Combination of CNN and Transformer
In object detection, CNNs and transformers have distinct
applications and advantages. CNNs are known for their strong
image feature extraction abilities, ability to perform multi-
channel processing, and ability to learn spatial correlations.
However, CNN-based models have limitations in handling
objects of different sizes and proportions due to fixed window
sizes and strides. On the other hand, transformers exhibit
excellent performance in capturing long-range dependencies
within input sequences without prior knowledge, albeit at a
slower speed and requiring substantial amounts of training
data. Evidently, the amalgamation of CNNs and transform-
ers offers complementarity across various dimensions, and
researchers have already delved into numerous methodologies
to explore this synergy.
The pioneering DETR model replaces fully connected and
convolutional layers with transformers while using ResNet
as the feature extractor, improving accuracy and efficiency.
Huawei’s CMTBlock combines depthwise separable convolu-
tion and the transformer’s multihead self-attention module for
local and global information fusion. The CMT model [44]
stacks the CMTBlock in a hybrid CNN-transformer structure.
The Conformer [45] adopts a dual-network structure, where
the CNN branch enhances local perception of the transformer
branch. The mobile-former [46] features parallel CNN and
transformer modules with bidirectional bridges, leveraging
MobileNet [47] for local processing and the transformer for
global interaction. However, networks or models employing
such hybrid structures face challenges in effectively balancing
accuracy and lightweight design. For instance, detectors such
as DETR, lacking FPN structures, exhibit suboptimal perfor-
mance in small object detection. While the CMT and Con-
former networks have proven effective in classification tasks,
their application to downstream tasks such as object detection
deviates from the realm of lightweight design. In contrast
to the aforementioned models, which concatenate both struc-
tures, an alternative approach involves making transformer-
style improvements directly on the CNN network. ConvNeXt
[48] implements novel architectures and optimization strate-
gies similar to those of transformers, achieving competitive
results without attention structures. RepLKNet [49] employs
large convolutional kernels to widen the receptive field, thus
emulating the transformer-like capability for global feature
extraction. By investigating the computational principles of
transformers, ACMix [50] maps their operation process onto
convolutional operators, thereby combining them with tra-
ditional convolution operations to construct a novel CNN
architecture. Parc-Net [51] introduces circular convolution
for global information extraction within a pure convolutional
structure. Although these innovative networks may not achieve
SOTA performance, their greater significance lies in exploring
the factors contributing to the success of transformers from
a CNN perspective, providing inspiration for subsequent re-
search endeavors. The fusion of transformers and CNNs offers
a flexible and diverse range of integration methods. Future
research should strive to deepen the understanding of their
interactions to improve design and optimization.
D. Object Detection of Antenna Interference Sources
Regularly monitoring and mitigating antenna interference
sources has become one of the most critical tasks in the
wireless communication field. In the past, detecting antenna
interference sources mainly relied on traditional techniques
such as spectrum analysis, signal recognition and positioning.
However, these methods have many limitations. For example,
when detection personnel identify a radio interference signal
through a spectrum analyzer, they can determine only the
approximate direction of the interference source based on the
strength of the received signal and cannot accurately determine
its position.
The rapid advancement of deep learning and computer
vision has facilitated the successful application of object
detection-assisted tasks in various industries. Examples in-
clude defect detection in industrial settings, pest/weed de-
tection in agriculture, and vehicle and pedestrian detection
in transportation [52] [53] [54] [55] [56]. These solutions
provide effective ideas for our antenna interference source
detection task. When investigators confirm the approximate
direction of the interference source antenna through a signal
receiver and spectrum analyzer, they can use drones with
cameras and related object detection algorithms to replace
manual accurate positioning work. Unfortunately, the field of
antenna interference source detection based on object detection
tasks has largely not been explored. Due to the lack of
learning samples and models for related antenna interference
source detection, existing detection methods are not suitable
for antenna detection. Therefore, it is urgent and meaningful
to create a professional dataset and train a model suitable for
this detection task to address the difficulty of locating antenna
interference sources in the wireless communication field.
III. PROPOSED DETECTION FRAMEWORK
A. Overall model structure
The overall idea for the network(Fig. 2) lies in the com-
bination of a CNN and transformer, both the inductive bias
ability of the convolutional operation and the ability of the
transformer to extract global information, while also meeting
the needs of a lightweight model with low computational
complexity. YOLO-Ant adopts DSLKNet, which is composed
of DSLK-Blocks, as the backbone for downsampling and
feature extraction in images. In DSLKNet, four DSLK-Layers
employ convolutional kernels of varying sizes to sequentially
extract rich features from different receptive fields of the
image. To address the challenge of detecting small objects, we
incorporate the neck structures of the FPN and PAN for multi-
scale feature learning. On the neck component, we conducted
pruning based on YOLOv5-s (detailed data provided in Section
IV. EXPERIMENT). In comparison to the baseline model, the
pruned neck model features an increased number of module
stacks and a reduced number of channels in each module.
This structural modification effectively alleviates redundancy