2. Related Work
2.1. Data Augmentation
The effectiveness of data augmentation is to expand the
dataset, so that the model has higher robustness to the im-
ages obtained from different environments. Photometric
distortions and geometric distortions are wildly used by re-
searchers. As for photometric distortion, we adjusted the
hue, saturation and value of the images. In dealing with ge-
ometric distortion, we add random scaling, cropping, trans-
lation, shearing, and rotating. In addition to the above-
mentioned global pixel augmentation methods, there are
some more unique data augmentation methods. Some re-
searchers have proposed methods using multiple images to-
gether for data augmentation i.e. MixUp [57], CutMix [56]
and Mosaic [2]. MixUp randomly select two samples from
the training images to perform random weighted summa-
tion, and the labels of the samples also correspond to the
weighted summation. Unlike occlusion works that gener-
ally use zero-pixel ”black cloth” to occlude a image, Cut-
Mix uses an area of another image to cover the occluded
area. Mosaic is an improved version of the CutMix. Mosaic
stitches four images, which greatly enriches the background
of the detected object. In addition, batch normalization cal-
culates the activation statistics of 4 different images on each
layer.
In TPH-YOLOv5, we use a combination of MixUp, Mo-
saic and traditional methods in data augmentation.
2.2. Multi-Model Ensemble Method in Object De-
tection
Deep learning neural networks are non-linear methods.
They provide greater flexibility and can scale in proportion
to the amount of training data. One disadvantage of this
flexibility is that they learn through random training algo-
rithms, which means that they are sensitive to the details
of the training data, and may find a different set of weights
each time they train, resulting in different predictions. This
gives the neural network a high variance. A successful way
to reduce the variance of neural network models is to train
multiple models instead of a single model, and combine the
predictions of these models.
There are three different methods to ensemble boxes
from different object detection models: Non-maximum sup-
pression (NMS) [36], Soft-NMS [53], weighted boxes fu-
sion (WBF) [43]. In the NMS method, if the overlap, in-
tersection over union (IoU) of the boxes is higher than a
certain threshold, they are considered to belong to the same
object. For each object, NMS only leaves one bounding
box with the highest confidence, and other bounding boxes
are deleted. Therefore, the box filtering process depends
on the choice of this single IoU threshold, which have a
big impact on model performance. Soft-NMS has made
a slightly change to NMS, which made Soft-NMS shows
a significant improvement over traditional NMS on stan-
dard benchmark datasets (such as PASCAL VOC [10] and
MS COCO [30]). It sets an attenuation function for the
confidence of adjacent bounding boxes based on the IoU
value instead of completely setting their confidence scores
to zero and delete them. WBF works differently from NMS.
Both NMS and Soft-NMS exclude some boxes, while WBF
merges all boxes to form the final result. Therefore, it can
solve all the inaccurate predictions of the model. We use
WBF to ensemble final models, which performs much bet-
ter than NMS.
2.3. Object Detection
CNN-based object detectors can be divided into
many types: 1) one-stage detectors: YOLOX [11],
FCOS [48], DETR [65], Scaled-YOLOv4 [51], Effi-
cientDet [45]. 2) two-stage detectors: VFNet [59],
CenterNet2 [62]. 3) anchor-based detectors: Scaled-
YOLOv4 [51], YOLOv5 [21]. 4) anchor-free detectors:
CenterNet [63], YOLOX [11], RepPoints [55]. Some detec-
tors are specially designed for Drone-captured images like
RRNet [4], PENet [46], CenterNet [63] etc. But from the
perspective of components, they generally consist of two
parts, an CNN-based backbone, used for image feature ex-
traction, and the other part is detection head used to predict
the class and bounding box for object. In addition, the ob-
ject detectors developed in recent years often insert some
layers between the backbone and the head, people usually
call this part the neck of the detector. Next, we will sepa-
rately introduce these three structures in detail.
Backbone. The backbone that are often used in-
clude VGG [42], ResNet [17], DenseNet [20], Mo-
bileNet [19], EfficientNet [44], CSPDarknet53 [52], Swin
Transformer [35] etc., rather than networks designed by
ourselves. Because these networks have proven that they
have strong feature extraction capabilities on classification
and other issues. But researchers will also fine-tune the
backbone to make it more suitable for specific tasks.
Neck. The neck is designed to make better use of the fea-
tures extracted by the backbone. It reprocesses and ratio-
nally uses the feature maps extracted by Backbone at dif-
ferent stages. Usually, a neck consists of several bottom-up
paths and several top-down paths. Neck is a key link in the
target detection framework. The earliest neck is the use of
up and down sampling block. The feature of this method is
that there is no feature layer aggregation operation, such as
SSD [34], directly follow the head after the multi-level fea-
ture map. Commonly used path-aggregation blocks in neck
are: FPN [28], PANet [33], NAS-FPN [12], BiFPN [45],
ASFF [32], SFAM [61].The commonality of these methods
is to repeatedly use various up-and-down sampling, splic-
ing, dot sum or dot product to design aggregation strate-
2780