3. Background
This section provides a brief introduction of Deformable
R-FCN [6] which is used in R-FCN-3000. In R-FCN [5],
Atrous convolution [4] is used in the conv5 layer to increase
the resolution of the feature map while still utilizing the
pre-trained weights from the ImageNet classification net-
work. In Deformable-R-FCN [6], the atrous convolution is
replaced by a deformable convolution structure in which a
separate branch predicts offsets for each pixel in the fea-
ture map, and the convolution kernel is applied after the
offsets have been applied to the feature-map. A region pro-
posal network (RPN) is used for generating object propos-
als, which is a two layer CNN on top of the conv4 features.
Efficiently implemented local convolutions, referred to as
position sensitive filters, are used to classify these propos-
als.
4. Large Scale Fully-Convolutional Detector
This section describes the process of training a large-
scale object detector. We first explain the training data re-
quirements followed by discussions of some of the chal-
lenges involved in training such a system - design deci-
sions for making training and inference efficient, appropri-
ate loss functions for a large number of classes, mitigating
the domain-shift which arises when training on classifica-
tion data.
4.1. Weakly Supervised vs. Supervised?
Obtaining an annotated dataset of thousands of classes is
a major challenge for large scale detection. Ideally, a sys-
tem that can learn to detect object instances using partial im-
age level tags (class labels) for the objects present in train-
ing images would be preferable because large-scale training
data is readily available on the internet in this format. Since
the setting with partial annotations is very challenging, it is
commonly assumed that labels are available for all the ob-
jects present in the image. This is referred to as the weakly
supervised setting. Unfortunately, explicit boundaries of
objects or atleast bounding-boxes are required as supervi-
sion signal for training accurate object detectors. This is the
supervised setting. The performance gap between super-
vised and weakly supervised detectors is large - even 2015
object detectors [15] were better by 40% on the PASCAL
VOC 2007 dataset compared to recent weakly supervised
detectors [8]. This gap is a direct result of insufficient learn-
ing signal coming from weak supervision and can be further
explained with the help of an example. For classifying a dog
among 1000 categories, only body texture or facial features
of a dog may be sufficient and the network need not learn
the visual properties of its tail or legs for correct classifica-
tion. Therefore, it may never learn that legs or tail are parts
of the dog category, which are essential to obtain accurate
boundaries.
On one hand, the huge cost of annotating bounding boxes
for thousands of classes under settings similar to popular
detection datasets such as PASCAL or COCO makes it pro-
hibitively expensive to collect and annotate a large-scale de-
tection dataset. On the other hand, the poor performance of
weakly supervised detectors impedes their deployment in
real-life applications. Therefore, we ask - is there a middle
ground that can alleviate the cost of annotation while yield-
ing accurate detectors? Fortunately, the ImageNet database
contains around 1-2 objects per image; therefore, the cost
of annotating the bounding boxes for the objects is only a
few seconds compared to several minutes in COCO [24]. It
is because of this reason that the bounding boxes were also
collected while annotating ImageNet! A potential downside
of using ImageNet for training object detectors is the loss
of variation in scale and context around objects available in
detection datasets, but we do have access to the bounding-
boxes of the objects. Therefore, a natural question to ask
is, how would an object detector perform on “detection”
datasets if it were trained on classification datasets with
bounding-box supervision? We show that careful design
choices with respect to the CNN architecture, loss function
and training protocol can yield a large-scale detector trained
on the ImageNet classification set with significantly better
accuracy compared to weakly supervised detectors.
4.2. Super-class Discovery
Fully convolutional object detectors learn class-specific
filters based on scale & aspect-ratio [23] or in the form of
position sensitive filters [5, 6] for each class. Therefore,
when the number of classes become large, it becomes com-
putationally in-feasible to apply these detectors. Hence, we
ask is it necessary to have sets of filters for each class or
can they be shared across visually similar classes? In the
extreme case - can detection be performed using just a fore-
ground/background detector and a classification network?
To obtain visually similar sets of objects for which position-
sensitive filters can be shared, objects should have similar
visual appearances. We obtain the j
th
object-class repre-
sentation, x
j
, by taking the average of 2048-dimensional
feature-vectors (x
i
j
), from the final layer of ResNet-101,
for the all the samples belonging to the j
th
object-class in
the ImageNet classification dataset (validation set). Super-
classes are then obtained by applying K-means clustering
on {x
j
: j ∈ {1, 2, . . . C}}, where C is the number of
object-classes, to obtain K super-class clusters.
4.3. Architecture
First, RPN is used for generating proposals, as in [6].
Let the set of individual object-classes the detector is be-
ing trained on be C, |C| = C, and the set of super-classes
(SC) be K, |K| = K. For each super-class k, suppose we