3216 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTE MS, VOL. 30, NO. 11, NOVEMBER 2019
Fig. 4. Architecture of SPP-net for object detection [64].
However, more candidate boxes are required to achieve com-
parable results to those of R-CNN.
2) SPP-Net: FC layers must take a fixed-size input. That
is why R-CNN chooses to warp or crop each region proposal
into the same size. However, th e object may exist partly in
the cropped region and unwanted geometric distortion may be
produced due to the warping operation. These content losses or
distortion s will reduce recognition accuracy, especially when
the scales of objects vary.
To solve this problem, He et al. [64] took the theory of
spatial pyramid matching (SPM) [89], [90] into consideration
and proposed a novel CNN architecture named SPP-net. SPM
takes several finer to coarser scales to partition the image into
a number of divisions and aggregates quantized local features
into mid-level representations.
The architecture of SPP-net for object detection can be
found in Fig. 4. Different from R-CNN, SPP-net reuses
feature maps of the fifth conv layer (conv5) to project region
proposals of arbitrary sizes to fixed-length feature vectors. The
feasibility of the reusability of these feature maps is due to
the fact that the feature maps not only involve the strength of
local responses but also have relationships with their spatial
positions [64]. The layer after the final conv layer is referred to
as the SPP layer. If the number of feature maps in conv5 is 256,
taking a three-level pyramid, the final feature vector for each
region proposal obtained after the SPP layer has a dimension
of 256 × (1
2
+ 2
2
+ 4
2
) = 5376.
SPP-net not only gains better results with a correct estima-
tion of different region proposals in their corresponding scales
but also improves detection efficiency in the testing period
with the sharing of computation cost before SPP layer among
different proposals.
3) Fast R-CNN: Although SPP-net has achieved impressive
improvements in both accuracy and efficiency over R-CNN,
it still has some notable drawbacks. SPP-net takes almost the
same multistage pip eline as R-CNN, including feature extrac-
tion, network fine-tuning, SVM training, and bounding-box
regressor fitting. Therefore, an additional expense on storage
space is still required. In addition, the conv layers preceding
the SPP layer cannot be updated with the fine-tuning algorithm
introduced in [64]. As a result, an accuracy drop of very deep
networks is unsurprising. To this end, Girshick [16] introduced
a multitask loss on classification and bounding box regression
and proposed a novel CNN architecture named Fast R-CNN.
The architecture of Fast R-CNN is exhibited in Fig. 5.
Similar to SPP-net, the wh ole image is processed with conv
layers to produce feature maps. Then, a fixed-length feature
vector is extracted from each region proposal with an RoI
Fig. 5. Architecture of Fast R-CNN [16].
pooling layer. The RoI pooling layer is a special case of the
SPP layer, which has only one pyramid level. Each feature
vector is then fed into a sequence of FC layers before finally
branching into two sibling output layers. One output layer is
responsible for producing softmax probabilities for all C + 1
categories (C object classes plus one “background” class)
and the other output layer encodes refined bounding-box
positions with four real-valued numbers. All parameters in
these procedures (except the generation o f region proposals)
are optimized via a multitask loss in an end-to-end way.
The multitasks loss L is defined in the following to jointly
train classification and bounding-box regression:
L( p, u, t
u
,v) = L
cls
( p, u) + λ[u ≥ 1]L
loc
(t
u
,v) (1)
where L
cls
( p, u) =−log p
u
calculates the log loss for ground
truth class u,andp
u
is driven from the discrete probability
distribution p = ( p
0
, ···, p
C
) over the C +1 outputs from the
last FC layer. L
loc
(t
u
,v) is defined over the predicted offsets
t
u
= (t
u
x
, t
u
y
, t
u
w
, t
u
h
) and ground-truth bounding-box regression
targets v = (v
x
,v
y
,v
w
,v
h
),wherex , y,w, and h denote
the two coordinates of the box center, width, and height,
respectively. Each t
u
adopts the parameter settings in [15] to
specify an object proposal with a log-space height/width shift
and scale-invariant translation. The Iverson bracket indicator
function [u ≥ 1] is employed to omit all background RoIs.
To provide more robustness against outliers and eliminate the
sensitivity in exploding gradients, a smooth L
1
loss is adopted
to fit bounding-box regressors as follows:
L
loc
(t
u
,v) =
i∈x ,y,w,h
smooth
L
1
t
u
i
− v
i
(2)
where
smooth
L
1
(x ) =
0.5x
2
if |x| < 1
|x|−0.5otherwise.
(3)
To accelerate the pipeline of Fast R-CNN, another two
tricks are of necessity. On the one hand, if training sam-
ples (i.e., RoIs) come from different images, backpropagation
through the SPP layer becomes highly inefficient. Fast R-CNN
samples minibatches hierarchically, n amely, N images sam-
pled randomly at first and then R/N RoIs sampled in each
image, where R represents the number of RoIs. Critically,
computation and memory are shared by RoIs from the same
image in the forward and backward pass. On the other hand,
much time is spent in computing the FC layers during the
forward pass [16]. The truncated singular value decomposition
(SVD) [91] can be utilized to compress large FC layer s an d
to accelerate the testing procedure.
In the Fast R-CNN, regardless of region proposal genera-
tion, the training of all network layers can be processed in
a single stage with a multitask loss. It saves the additional
Authorized licensed use limited to: SOUTHWEST JIAOTONG UNIVERSITY. Downloaded on October 31,2020 at 01:20:24 UTC from IEEE Xplore. Restrictions apply.