2.3. Bag of specials
For those plugin modules and post-processing methods
that only increase the inference cost by a small amount
but can significantly improve the accuracy of object detec-
tion, we call them “bag of specials”. Generally speaking,
these plugin modules are for enhancing certain attributes in
a model, such as enlarging receptive field, introducing at-
tention mechanism, or strengthening feature integration ca-
pability, etc., and post-processing is a method for screening
model prediction results.
Common modules that can be used to enhance recep-
tive field are SPP [25], ASPP [5], and RFB [47]. The
SPP module was originated from Spatial Pyramid Match-
ing (SPM) [39], and SPMs original method was to split fea-
ture map into several d × d equal blocks, where d can be
{1, 2, 3, ...}, thus forming spatial pyramid, and then extract-
ing bag-of-word features. SPP integrates SPM into CNN
and use max-pooling operation instead of bag-of-word op-
eration. Since the SPP module proposed by He et al. [25]
will output one dimensional feature vector, it is infeasible to
be applied in Fully Convolutional Network (FCN). Thus in
the design of YOLOv3 [63], Redmon and Farhadi improve
SPP module to the concatenation of max-pooling outputs
with kernel size k × k, where k = {1, 5, 9, 13}, and stride
equals to 1. Under this design, a relatively large k × k max-
pooling effectively increase the receptive field of backbone
feature. After adding the improved version of SPP module,
YOLOv3-608 upgrades AP
50
by 2.7% on the MS COCO
object detection task at the cost of 0.5% extra computation.
The difference in operation between ASPP [5] module and
improved SPP module is mainly from the original k×k ker-
nel size, max-pooling of stride equals to 1 to several 3 × 3
kernel size, dilated ratio equals to k, and stride equals to 1
in dilated convolution operation. RFB module is to use sev-
eral dilated convolutions of k×k kernel, dilated ratio equals
to k, and stride equals to 1 to obtain a more comprehensive
spatial coverage than ASPP. RFB [47] only costs 7% extra
inference time to increase the AP
50
of SSD on MS COCO
by 5.7%.
The attention module that is often used in object detec-
tion is mainly divided into channel-wise attention and point-
wise attention, and the representatives of these two atten-
tion models are Squeeze-and-Excitation (SE) [29] and Spa-
tial Attention Module (SAM) [85], respectively. Although
SE module can improve the power of ResNet50 in the Im-
ageNet image classification task 1% top-1 accuracy at the
cost of only increasing the computational effort by 2%, but
on a GPU usually it will increase the inference time by
about 10%, so it is more appropriate to be used in mobile
devices. But for SAM, it only needs to pay 0.1% extra cal-
culation and it can improve ResNet50-SE 0.5% top-1 accu-
racy on the ImageNet image classification task. Best of all,
it does not affect the speed of inference on the GPU at all.
In terms of feature integration, the early practice is to use
skip connection [51] or hyper-column [22] to integrate low-
level physical feature to high-level semantic feature. Since
multi-scale prediction methods such as FPN have become
popular, many lightweight modules that integrate different
feature pyramid have been proposed. The modules of this
sort include SFAM [98], ASFF [48], and BiFPN [77]. The
main idea of SFAM is to use SE module to execute channel-
wise level re-weighting on multi-scale concatenated feature
maps. As for ASFF, it uses softmax as point-wise level re-
weighting and then adds feature maps of different scales.
In BiFPN, the multi-input weighted residual connections is
proposed to execute scale-wise level re-weighting, and then
add feature maps of different scales.
In the research of deep learning, some people put their
focus on searching for good activation function. A good
activation function can make the gradient more efficiently
propagated, and at the same time it will not cause too
much extra computational cost. In 2010, Nair and Hin-
ton [56] propose ReLU to substantially solve the gradient
vanish problem which is frequently encountered in tradi-
tional tanh and sigmoid activation function. Subsequently,
LReLU [54], PReLU [24], ReLU6 [28], Scaled Exponential
Linear Unit (SELU) [35], Swish [59], hard-Swish [27], and
Mish [55], etc., which are also used to solve the gradient
vanish problem, have been proposed. The main purpose of
LReLU and PReLU is to solve the problem that the gradi-
ent of ReLU is zero when the output is less than zero. As
for ReLU6 and hard-Swish, they are specially designed for
quantization networks. For self-normalizing a neural net-
work, the SELU activation function is proposed to satisfy
the goal. One thing to be noted is that both Swish and Mish
are continuously differentiable activation function.
The post-processing method commonly used in deep-
learning-based object detection is NMS, which can be used
to filter those BBoxes that badly predict the same ob-
ject, and only retain the candidate BBoxes with higher re-
sponse. The way NMS tries to improve is consistent with
the method of optimizing an objective function. The orig-
inal method proposed by NMS does not consider the con-
text information, so Girshick et al. [19] added classification
confidence score in R-CNN as a reference, and according to
the order of confidence score, greedy NMS was performed
in the order of high score to low score. As for soft NMS [1],
it considers the problem that the occlusion of an object may
cause the degradation of confidence score in greedy NMS
with IoU score. The DIoU NMS [99] developers way of
thinking is to add the information of the center point dis-
tance to the BBox screening process on the basis of soft
NMS. It is worth mentioning that, since none of above post-
processing methods directly refer to the captured image fea-
tures, post-processing is no longer required in the subse-
quent development of an anchor-free method.
4