TPH-YOLOv5：基于Transformer和CBAM的无人机场景物体检测提升

需积分: 0 28 浏览量更新于2024-08-04 收藏 8.46MB PDF 举报

本文档深入探讨了"TPH-YOLOv5：基于Transformer预测头改进的YOLOv5在无人机捕获场景下的目标检测"。随着无人机在各种高度进行导航，物体尺度的变化显著，这对网络优化提出了挑战。无人机在低空高速飞行时，密集物体上可能出现运动模糊，这进一步加大了对象识别的难度。为了应对这两个问题，研究者提出了TPH-YOLOv5方法。 TPH-YOLOv5是在YOLOv5的基础上进行创新，其核心在于添加了一个额外的预测头，特别设计用于检测不同尺度的目标，这有助于提高模型对物体大小变化的适应性。原有的预测模块被Transformer预测头（TPH）所取代，利用自注意力机制挖掘潜在的预测能力。自注意力机制允许模型在整个特征图上寻找全局上下文信息，从而增强对复杂场景中物体位置和大小的精确识别。此外，论文还引入了卷积块注意力模型（CBAM），它能够在密集物体区域聚焦于重要的视觉特征，帮助解决由于运动模糊带来的识别难题。CBAM结合空间注意力和通道注意力，提高了模型对关键区域的敏感度，使得在复杂的无人机拍摄场景中，对象之间的区分度得以提升。为了进一步优化TPH-YOLOv5，研究者提供了一系列实用策略，包括数据增强、训练技巧以及超参数调整等，这些都旨在提高模型在无人机捕获数据集上的性能。整体而言，TPH-YOLOv5不仅解决了物体尺度变化的问题，还通过Transformer和CBAM的有效融合，增强了模型在动态环境中的鲁棒性和准确性，对于无人机目标检测任务具有重要意义。

2. Related Work

2.1. Data Augmentation

The effectiveness of data augmentation is to expand the

dataset, so that the model has higher robustness to the im-

ages obtained from different environments. Photometric

distortions and geometric distortions are wildly used by re-

searchers. As for photometric distortion, we adjusted the

hue, saturation and value of the images. In dealing with ge-

ometric distortion, we add random scaling, cropping, trans-

lation, shearing, and rotating. In addition to the above-

mentioned global pixel augmentation methods, there are

some more unique data augmentation methods. Some re-

searchers have proposed methods using multiple images to-

gether for data augmentation i.e. MixUp [57], CutMix [56]

and Mosaic [2]. MixUp randomly select two samples from

the training images to perform random weighted summa-

tion, and the labels of the samples also correspond to the

weighted summation. Unlike occlusion works that gener-

ally use zero-pixel ”black cloth” to occlude a image, Cut-

Mix uses an area of another image to cover the occluded

area. Mosaic is an improved version of the CutMix. Mosaic

stitches four images, which greatly enriches the background

of the detected object. In addition, batch normalization cal-

culates the activation statistics of 4 different images on each

layer.

In TPH-YOLOv5, we use a combination of MixUp, Mo-

saic and traditional methods in data augmentation.

2.2. Multi-Model Ensemble Method in Object De-

tection

Deep learning neural networks are non-linear methods.

They provide greater ﬂexibility and can scale in proportion

to the amount of training data. One disadvantage of this

ﬂexibility is that they learn through random training algo-

rithms, which means that they are sensitive to the details

of the training data, and may ﬁnd a different set of weights

each time they train, resulting in different predictions. This

gives the neural network a high variance. A successful way

to reduce the variance of neural network models is to train

multiple models instead of a single model, and combine the

predictions of these models.

There are three different methods to ensemble boxes

from different object detection models: Non-maximum sup-

pression (NMS) [36], Soft-NMS [53], weighted boxes fu-

sion (WBF) [43]. In the NMS method, if the overlap, in-

tersection over union (IoU) of the boxes is higher than a

certain threshold, they are considered to belong to the same

object. For each object, NMS only leaves one bounding

box with the highest conﬁdence, and other bounding boxes

are deleted. Therefore, the box ﬁltering process depends

on the choice of this single IoU threshold, which have a

big impact on model performance. Soft-NMS has made

a slightly change to NMS, which made Soft-NMS shows

a signiﬁcant improvement over traditional NMS on stan-

dard benchmark datasets (such as PASCAL VOC [10] and

MS COCO [30]). It sets an attenuation function for the

conﬁdence of adjacent bounding boxes based on the IoU

value instead of completely setting their conﬁdence scores

to zero and delete them. WBF works differently from NMS.

Both NMS and Soft-NMS exclude some boxes, while WBF

merges all boxes to form the ﬁnal result. Therefore, it can

solve all the inaccurate predictions of the model. We use

WBF to ensemble ﬁnal models, which performs much bet-

ter than NMS.

2.3. Object Detection

CNN-based object detectors can be divided into

many types: 1) one-stage detectors: YOLOX [11],

FCOS [48], DETR [65], Scaled-YOLOv4 [51], Efﬁ-

cientDet [45]. 2) two-stage detectors: VFNet [59],

CenterNet2 [62]. 3) anchor-based detectors: Scaled-

YOLOv4 [51], YOLOv5 [21]. 4) anchor-free detectors:

CenterNet [63], YOLOX [11], RepPoints [55]. Some detec-

tors are specially designed for Drone-captured images like

RRNet [4], PENet [46], CenterNet [63] etc. But from the

perspective of components, they generally consist of two

parts, an CNN-based backbone, used for image feature ex-

traction, and the other part is detection head used to predict

the class and bounding box for object. In addition, the ob-

ject detectors developed in recent years often insert some

layers between the backbone and the head, people usually

call this part the neck of the detector. Next, we will sepa-

rately introduce these three structures in detail.

Backbone. The backbone that are often used in-

clude VGG [42], ResNet [17], DenseNet [20], Mo-

bileNet [19], EfﬁcientNet [44], CSPDarknet53 [52], Swin

Transformer [35] etc., rather than networks designed by

ourselves. Because these networks have proven that they

have strong feature extraction capabilities on classiﬁcation

and other issues. But researchers will also ﬁne-tune the

backbone to make it more suitable for speciﬁc tasks.

Neck. The neck is designed to make better use of the fea-

tures extracted by the backbone. It reprocesses and ratio-

nally uses the feature maps extracted by Backbone at dif-

ferent stages. Usually, a neck consists of several bottom-up

paths and several top-down paths. Neck is a key link in the

target detection framework. The earliest neck is the use of

up and down sampling block. The feature of this method is

that there is no feature layer aggregation operation, such as

SSD [34], directly follow the head after the multi-level fea-

ture map. Commonly used path-aggregation blocks in neck

are: FPN [28], PANet [33], NAS-FPN [12], BiFPN [45],

ASFF [32], SFAM [61].The commonality of these methods

is to repeatedly use various up-and-down sampling, splic-

ing, dot sum or dot product to design aggregation strate-

2780

剩余10页未读，继续阅读

Shaco、LYF

粉丝: 4
资源: 7

TPH-YOLOv5：基于Transformer和CBAM的无人机场景物体检测提升

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head

TPH-YOLOv5用于无人机捕获场景目标检测

tph-yolov5

TPH-YOLOv5

Tph-yolov5

无_说明_tph-yolov5_No_Description_tph-yolov5.zip

tph-yolov5引用

Tph-yolov5改进

复现tph-yolov5

tph-yolov5复现

最新资源