![](https://csdnimg.cn/release/download_crawler_static/10958071/bg3.jpg)
achieving satisfactory accuracy with high efficiency. DPM
[12] is another popular method using mixtures of multi-
scale deformable part models to represent highly variable
object classes, maintaining top results on PASCAL VOC [8]
for many years. However, with the arrival of deep convolu-
tional network, the object detection task is quickly dom-
inated by the CNN-based detectors, which can be roughly
divided into two categories, i.e., the two-stage approach and
one-stage approach.
Two-Stage Approach. The two-stage approach consists of
two parts, where the first one (e.g., Selective Search [46],
EdgeBoxes [55], DeepMask [32, 33], RPN [36]) generates a
sparse set of candidate object proposals, and the second one
determines the accurate object regions and the correspond-
ing class labels using convolutional networks. Notably, the
two-stage approach (e.g., R-CNN [16], SPPnet [18], Fast R-
CNN [15] to Faster R-CNN [36]) achieves dominated per-
formance on several challenging datasets (e.g., PASCAL
VOC 2012 [11] and MS COCO [29]). After that, numer-
ous effective techniques are proposed to further improve the
performance, such as architecture diagram [5, 26, 54], train-
ing strategy [41, 48], contextual reasoning [1, 14, 40, 50]
and multiple layers exploiting [3, 25, 27, 42].
One-Stage Approach. Considering the high efficiency, the
one-stage approach attracts much more attention recently.
Sermanet et al. [38] present the OverFeat method for clas-
sification, localization and detection based on deep Con-
vNets, which is trained end-to-end, from raw pixels to ul-
timate categories. Redmon et al. [34] use a single feed-
forward convolutional network to directly predict object
classes and locations, called YOLO, which is extremely
fast. After that, YOLOv2 [35] is proposed to improve
YOLO in several aspects, i.e., add batch normalization on
all convolution layers, use high resolution classifier, use
convolution layers with anchor boxes to predict bounding
boxes instead of the fully connected layers, etc. Liu et al.
[30] propose the SSD method, which spreads out anchors
of different scales to multiple layers within a ConvNet and
enforces each layer to focus on predicting objects of a cer-
tain scale. DSSD [13] introduces additional context into
SSD via deconvolution to improve the accuracy. DSOD
[39] designs an efficient framework and a set of principles to
learn object detectors from scratch, following the network
structure of SSD. To improve the accuracy, some one-stage
methods [24, 28, 53] aim to address the extreme class im-
balance problem by re-designing the loss function or clas-
sification strategies. Although the one-stage detectors have
made good progress, their accuracy still trails that of two-
stage methods.
3. Network Architecture
Refer to the overall network architecture shown in Fig-
ure 1. Similar to SSD [30], RefineDet is based on a feed-
forward convolutional network that produces a fixed num-
ber of bounding boxes and the scores indicating the pres-
ence of different classes of objects in those boxes, followed
by the non-maximum suppression to produce the final re-
sult. RefineDet is formed by two inter-connected modules,
i.e., the ARM and the ODM. The ARM aims to remove neg-
ative anchors so as to reduce search space for the classifier
and also coarsely adjust the locations and sizes of anchors
to provide better initialization for the subsequent regressor,
whereas ODM aims to regress accurate object locations and
predict multi-class labels based on the refined anchors. The
ARM is constructed by removing the classification layers
and adding some auxiliary structures of two base networks
(i.e., VGG-16 [43] and ResNet-101 [19] pretrained on Im-
ageNet [37]) to meet our needs. The ODM is composed of
the outputs of TCBs followed by the prediction layers (i.e.,
the convolution layers with 3 × 3 kernel size), which gener-
ates the scores for object classes and shape offsets relative to
the refined anchor box coordinates. The following explain
three core components in RefineDet, i.e., (1) transfer con-
nection block (TCB), converting the features from the ARM
to the ODM for detection; (2) two-step cascaded regression,
accurately regressing the locations and sizes of objects; (3)
negative anchor filtering, early rejecting well-classified neg-
ative anchors and mitigate the imbalance issue.
Transfer Connection Block. To link between the ARM
and ODM, we introduce the TCBs to convert features of dif-
ferent layers from the ARM, into the form required by the
ODM, so that the ODM can share features from the ARM.
Notably, from the ARM, we only use the TCBs on the fea-
ture maps associated with anchors. Another function of the
TCBs is to integrate large-scale context [13, 27] by adding
the high-level features to the transferred features to improve
detection accuracy. To match the dimensions between them,
we use the deconvolution operation to enlarge the high-level
feature maps and sum them in the element-wise way. Then,
we add a convolution layer after the summation to ensure
the discriminability of features for detection. The architec-
ture of the TCB is shown in Figure 2.
Two-Step Cascaded Regression. Current one-stage meth-
ods [13, 24, 30] rely on one-step regression based on various
feature layers with different scales to predict the locations
and sizes of objects, which is rather inaccurate in some chal-
lenging scenarios, especially for the small objects. To that
end, we present a two-step cascaded regression strategy to
regress the locations and sizes of objects. That is, we use
the ARM to first adjust the locations and sizes of anchors to
provide better initialization for the regression in the ODM.
Specifically, we associate n anchor boxes with each regu-
larly divided cell on the feature map. The initial position of
each anchor box relative to its corresponding cell is fixed.
At each feature map cell, we predict four offsets of the re-
fined anchor boxes relative to the original tiled anchors and