A New Two-Stage Object Detection Network
without RoI-Pooling
Chao Yan
1
, Weihai Chen
1
∗
, Peter C. Y. Chen
2
, Kendrick Amezquita S.
2
, Xingming Wu
1
1. School of Automation Science and Electrical Engineering, Beihang University, 100191, Beijing, China
E-mail: whchenbuaa@126.com
2. Department of Mechanical Engineering, National University of Singapore, 117576, Singapore
E-mail: mpechenp@nus.edu.sg
Abstract: Two-stage object detection networks often propose a set of candidate boxes in the first stage, and then fine-
tune the boxes in the second stage. The original two-stage object detection methods mostly process the features among
the candidate boxes in the picture by RoI-Pooling [3]. Due to the overlaps of the candidate boxes proposed in the first
stage, the calculation of the second stage is repetitive and the single-frame detection is slow. RoI-Pooling also makes
the features of the elongated shape deformed. In this paper, we present a new two-step object detection network, called
Spatial Alignment Network(SAN), which does not use the RoI-Pooling layer and reduces the computational repeatability
of the second stage. We also use atrous convolution for the network fine-tuning. Our network has a competitive result,
and faster than the original two-stage detectors.
Key Words: Object Detection, Deep Learning, Computer Vision
1 INTRODUCTION
In recent years, a great progress has been made in the
field of object detection [4] and semantic segmentation [5].
There are two major categories of structures to object de-
tection, respectively, one-stage and two-stage. The two-
stage object detection networks usually propose a set of
candidate boxes in the first stage by a RPN(region pro-
posal network) [6] and then perform a fine-tuning on the
candidate boxes in the second stage [6] [7] [8]. This kind
of method usually has higher accuracy but slower speeds.
Semantic segmentation networks are usually using the en-
coder and decoder structures. The contextual relationships
[9] [10] within the picture are often taken into account in
the segmentation networks.
RoI-Pooling is a commonly used structure in two-stage ob-
ject detection architectures [6] [7] [8]. The RoI-Pooling
layer is a set of mattings on the feature maps according to
the candidate boxes proposed in the first stage, then zoom
them to the spcified size, such as 7 × 7 [3]. There are
some variations about this, such as RoI-Align [11]. For
large squared candidate boxes, this operation can reduce
a certain amount of calculation. But for small rectangular
object candidate boxes, this operation modifies the space
information of the original small object. Most importantly,
the candidate boxes proposed in the first stage overlap so
This work is supported by International Scientific and Technolog-
ical Cooperation Projects of China under Grant 2015DFG12650, the
Singapore-China Joint Research Programme of the Science and Engineer-
ing Research Council in the Agency for Science, Technology and Re-
search (A*STAR), Singapore, under SERC Project No.1420200047, and
National Nature Science Foundation of China under Grant 61620106012
and 61573048.
*Weihai Chen is the corresponding author.
much that the overall detection rate is slowed down.
Relatively, some one-stage object detection frames [12]
[13] [14] [15] just like RPN, regress the deviation from the
ground truth boxes to the default boxes of different aspect
ratios and different scales at each location around the fea-
ture maps [6]. This operation is fully convolution [1] [2]
and fast, although the precision may be a little lower.
In this paper, we propose a new fully convolution frame-
work for two-stage object detection, called Spatial Align-
ment Network(SAN), which doesn’t use RoI-Pooling. The
first step of the detection process is as same as Faster R-
CNN [6]. In the second step, we use convolution again to
regress the deviation between the candidate boxes obtained
in the first stage and the ground truth boxes. Fig1(c) illus-
trates the basic procedure of our network. Some parts of
this framework are a bit like R-FCN [8], but our second
stage is handled differently. We don’t use RoI-Pooling or
PS RoI-Pooling. We combine the candidate boxes informa-
tion of the first and second stage by sequnce number. We
also use some tricks to combine the outputs of RPN and the
features. We test our model on VOC2007 test set, and get a
test speed of 90ms per image using ResNet-101, with mAP
76.5%. Under the same backbone and hardware condition,
the test speed of our network is 3× than Faster R-CNN,
1.2× faster than F-RCNN.
2 RELATED WORK
2.1 Two-stage Detectors
R-CNN [16] successfully applies convolution to object de-
tection for the first time. It uses the selective search al-
gorithm to extract about 2000 region proposals in the im-
age, extracts the features from the image in the region pro-
posals by convolution, and classifies the extracted features
1680
978-1-5386-1243-9/18/$31.00
c
2018 IEEE