6 Li Liu et al.
The research community has started moving towards the chal-
lenging goal of building general purpose object detection systems
whose ability to detect many object categories matches that of hu-
mans. This is a major challenge: according to cognitive scientists,
human beings can identify around 3,000 entry level categories and
30,000 visual categories overall, and the number of categories dis-
tinguishable with domain expertise may be on the order of 10
5
[14]. Despite the remarkable progress of the past years, designing
an accurate, robust, efficient detection and recognition system that
approaches human-level performance on 10
4
− 10
5
categories is
undoubtedly an open problem.
3 Frameworks
There has been steady progress in object feature representations
and classifiers for recognition, as evidenced by the dramatic change
from handcrafted features [213, 42, 55, 76, 212] to learned DCNN
features [65, 160, 64, 175, 40].
In contrast, the basic “sliding window” strategy [42, 56, 55]
for localization remains to be the main stream, although with some
endeavors in [113, 209]. However the number of windows is large
and grows quadratically with the number of pixels, and the need to
search over multiple scales and aspect ratios further increases the
search space. The the huge search space results in high computa-
tional complexity. Therefore, the design of efficient and effective
detection framework plays a key role. Commonly adopted strate-
gies include cascading, sharing feature computation, and reducing
per-window computation.
In this section, we review the milestone detection frameworks
present in generic object detection since deep learning entered the
field, as listed in Fig. 6 and summarized in Table 10. Nearly all
detectors proposed over the last several years are based on one of
these milestone detectors, attempting to improve on one or more
aspects. Broadly these detectors can be organized into two main
categories:
A. Two stage detection framework, which includes a pre-processing
step for region proposal, making the overall pipeline two stage.
B. One stage detection framework, or region proposal free frame-
work, which is a single proposed method which does not sep-
arate detection proposal, making the overall pipeline single-
stage.
Section 4 will build on the following by discussing fundamental
subproblems involved in the detection framework in greater detail,
including DCNN features, detection proposals, context modeling,
bounding box regression and class imbalance handling.
3.1 Region Based (Two Stage Framework)
In a region based framework, category-independent region propos-
als are generated from an image, CNN [109] features are extracted
from these regions, and then category-specific classifiers are used
to determine the category labels of the proposals. As can be ob-
served from Fig. 6, DetectorNet [198], OverFeat [183], MultiBox
[52] and RCNN [65] independently and almost simultaneously
proposed using CNNs for generic object detection.
RCNN: Inspired by the breakthrough image classification re-
sults obtained by CNN and the success of selective search in re-
gion proposal for hand-crafted features [209], Girshick et al. were
among the first to explore CNN for generic object detection and
developed RCNN [65, 67], which integrates AlexNet [109] with
the region proposal method selective search [209]. As illustrated
in Fig. 7, training in an RCNN framework consists of multistage
pipelines:
1. Class-agnostic region proposals, which are candidate regions
that might contain objects, are obtained selective search [209];
2. Region proposals, which are cropped from the image and warped
into the same size, are used as the input for finetuning a CNN
model pre-trained using large-scale dataset such as ImageNet;
3. A set of class specific linear SVM classifiers are trained using
fixed length features extracted with CNN, replacing the soft-
max classifier learned by finetuning.
4. Bounding box regression is learned for each object class with
CNN features.
In spite of achieving high object detection quality, RCNN has no-
table drawbacks [64]:
1. Training is a multistage complex pipeline, which is inelegant,
slow and hard to optimize because each individual stage must
be trained separately.
2. Numerous region proposals which provide only rough local-
ization need to be externally detected.
3. Training SVM classifiers and bounding box regression is ex-
pensive in both disk space and time, since CNN features are
extracted independently from each region proposal in each im-
age, posing great challenges for large-scale detection, espe-
cially very deep CNN networks such as AlexNet [109] and
VGG [191].
4. Testing is slow, since CNN features are extracted per object
proposal in each testing image.
SPPNet: During testing, CNN features extraction is the main
bottleneck of the RCNN detection pipeline, which requires to ex-
tract CNN features from thousands of warped region proposals for
an image. Noticing these obvious disadvantages, He et al. [77] in-
troduced the traditional spatial pyramid pooling (SPP) [68, 114]
into CNN architectures. Since convolutional layers accept inputs
of arbitrary sizes, the requirement of fixed-sized images in CNNs
is only due to the Fully Connected (FC) layers, He et al. found
this fact and added an SPP layer on top of the last convolutional
(CONV) layer to obtain features of fixed-length for the FC lay-
ers. With this SPPnet, RCNN obtains a significant speedup with-
out sacrificing any detection quality because it only needs to run
the convolutional layers once on the entire test image to generate
fixed-length features for region proposals of arbitrary size. While
SPPnet accelerates RCNN evaluation by orders of magnitude, it
does not result in a comparable speedup of the detector training.
Moreover, finetuning in SPPnet [77] is unable to update the convo-
lutional layers before the SPP layer, which limits the accuracy of
very deep networks.
Fast RCNN: Girshick [64] proposed Fast RCNN that addresses
some of the disadvantages of RCNN and SPPnet, while improv-
ing on their detection speed and quality. As illustrated in Fig. 8,
Fast RCNN enables end-to-end detector training (when ignoring