Rich feature hierarchies for accurate object detection and semantic segmentation
Tech report
Ross Girshick
1
Jeff Donahue
1,2
Trevor Darrell
1,2
Jitendra Malik
1
1
UC Berkeley and
2
ICSI
{rbg,jdonahue,trevor,malik}@eecs.berkeley.edu
Abstract
Can a large convolutional neural network trained for
whole-image classification on ImageNet be coaxed into de-
tecting objects in PASCAL? We show that the answer is
yes, and that the resulting system is simple, scalable, and
boosts mean average precision, relative to the venerable
deformable part model, by more than 40% (achieving a fi-
nal mAP of 48% on VOC 2007). Our framework combines
powerful computer vision techniques for generating bottom-
up region proposals with recent advances in learning high-
capacity convolutional neural networks. We call the result-
ing system R-CNN: Regions with CNN features. The same
framework is also competitive with state-of-the-art seman-
tic segmentation methods, demonstrating its flexibility. Be-
yond these results, we execute a battery of experiments that
provide insight into what the network learns to represent,
revealing a rich hierarchy of discriminative and often se-
mantically meaningful features.
1. Introduction
Image features are the engine of recognition. Better fea-
tures immediately propel a wide array of computer vision
techniques forward. The last feature revolution was, ar-
guably, established through the introduction of SIFT [30]
and then HOG [7]. Nearly all modern object detection and
semantic segmentation systems (e.g., [5, 17]) are built on
top of one, or both, of these low-level features, serving as a
testament to their effectiveness.
Yet, the hypothesis that SIFT and HOG are now bottle-
necks throttling recognition performance has emerged over
the last few years. This hypothesis is grounded, for exam-
ple, in the wide range of papers that attempt to boost detec-
tion accuracy with work along four axes: (1) rich structured
models [20, 42]; (2) multiple feature learning [38, 41]; (3)
learned histogram-based features [11, 29, 32]; or (4) unsu-
pervised feature learning [34].
The PASCAL Visual Object Classes (VOC) Challenge
serves as the main benchmark for assessing object detec-
1. Input
image
2. Extract region
proposals (~2k)
3. Compute
CNN features
aeroplane? no.
.
.
.
person? yes.
tvmonitor? no.
4. Classify
regions
warped region
.
.
.
CNN
R-CNN: Regions with CNN features
Figure 1: Object detection system overview. Our system (1)
takes an input image, (2) extracts around 2000 bottom-up region
proposals, (3) computes features for each proposal using a large
convolutional neural network (CNN), and then (4) classifies each
region using class-specific linear SVMs. This system achieves a
mean average precision (mAP) of 43.5% on PASCAL VOC 2010.
For comparison, [36] reports a mAP of 35.1% using the same
region proposals, but with a spatial pyramid and bag-of-visual-
words approach. Deformable part models [19] perform at 29.6%.
tor performance [15]. The 2010 and 2011 challenges were
won by combining multiple types of features and making
extensive use of context from ensembles of object detec-
tors and scene classifiers. Using multiple features improved
mean average precision (mAP) by at most 10% (relative),
with diminishing returns for each additional feature. In the
final year of the challenge (2012) systems performed no bet-
ter than in the previous year. This plateau suggests current
methods may be limited by the available features. Here,
we take a supervised feature learning approach. Figure 1
overviews our method and highlights some of our results.
At the same time, researchers working on a broad array
of “deep learning” methods were making steady progress on
improving whole-image classification. (See Bengio et al.
[3] for an excellent survey.) However, until recently these
results were isolated to datasets such as CIFAR [25] and
MNIST [28], slowing their adoption by computer vision re-
searchers for use on other tasks and image domains.
Then, Krizhevsky et al. [26] rekindled broader interest in
convolutional neural networks (CNNs) [27, 28] by showing
substantially lower error rates on the 2012 ImageNet Large
Scale Visual Recognition Challenge (ILSVRC) [9, 10]. The
significance of their result was vigorously debated during
1
arXiv:1311.2524v1 [cs.CV] 11 Nov 2013