main (S) and the test data space as the target domain (T ).
We assume that an annotated training image dataset in S
is supplied, but that only images in T are given (i.e. there
are no labels in T ). Our framework, visualized in Fig. 1,
consists of three main phases:
1. Object proposal mining: A standard Faster R-CNN,
trained on the source domain, is used to detect objects
in the target domain. The detected objects form a pro-
posal set in T .
2. Image classification training: Given the images ex-
tracted from bounding boxes in S, we train an image
classification model that predicts the class of objects
in each image. The resulting classifier is used to score
the proposed bounding boxes in T . This model aids in
training the robust object detection model in the next
phase. The reason for introducing image classification
is that i) this model may rely on representations differ-
ent than those used by the phase one detection model
(e.g., motion features) or it may use a more sophisti-
cated network architectures, and ii) this model can be
trained in a semi-supervised fashion using labeled im-
ages in S and unlabeled images in T .
3. Robust object detection training: In this phase a
robust object detection model is trained using object
bounding boxes in S and object proposals in T (from
phase one) that has been rescored using the image clas-
sification (from phase two).
We organize the detailed method description as follows.
Firstly, we introduce background notation and provide a de-
scription of Faster R-CNN in Sec. 3.1 to define the model
used in phase one. Secondly, a probabilistic view of Faster
R-CNN in Sec. 3.2 provides a foundation for the robust ob-
ject detection framework presented in Sec. 3.3. This defines
the model used in phase three. Lastly, the image classifica-
tion model used in phase two is discussed in Sec. 3.4.
Notation: We are given training images in S along with
their object bounding box labels. This training set is de-
noted by D
D
D
S
= {(x
x
x
(s)
,y
y
y
(s)
)} where x
x
x
(s)
∈ S represents
an image, y
y
y
(s)
is the corresponding bounding box label for
x
x
x
(s)
and s is an index. Each bounding box y
y
y = (y
c
,y
y
y
l
) rep-
resents a class label by an integer, y
c
∈ Y = {1, 2, . . . , C},
where C is the number of foreground classes, and a 4-tuple,
y
y
y
l
∈ R
4
, giving the coordinates of the top left corner,
height, and width of the box. To simplify notation, we as-
sociate each image with a single bounding box.
2
In the target domain, images are given without bounding
box annotations. At the end of phase one, we augment this
dataset with proposed bounding boxes generated by Faster
R-CNN. We denote the resulting set by D
D
D
T
= {x
x
x
(t)
,
˜
y
y
y
(t)
}
2
This restriction is for notational convenience only. Our implementa-
tion makes no assumptions about the number of objects in each image.
where x
x
x
(t)
∈ T is an image,
˜
y
y
y
(t)
∈ Y is the corresponding
proposed bounding box and t is an index. Finally, we obtain
the image classification score obtained at the end of phase
two for each instance in D
D
D
T
from p
img
(y
c
|x
x
x,
˜
y
y
y
l
) which rep-
resents the probability of assigning the image cropped in the
bounding box
˜
y
y
y
l
in x
x
x to the class y
c
∈ Y ∪ {0} which is one
of the foreground categories or background.
3.1. Faster R-CNN
Faster R-CNN [45] is a two-stage detector consisting of
two main components: a region proposal network (RPN)
that proposes regions of interests (ROI) for object detection
and an ROI classifier that predicts object labels for the pro-
posed bounding boxes. These two components share the
first convolutional layers. Given an input image, the shared
layers extract a feature map for the image. In the first stage,
RPN predicts the probability of a set of predefined anchor
boxes for being an object or background along with refine-
ments in their sizes and locations. The anchor boxes are a
fixed predefined set of boxes with varying positions, sizes
and aspect ratios across the image. Similar to RPN, the re-
gion classifier predicts object labels for ROIs proposed by
the RPN as well as refinements for the location and size
of the boxes. Features passed to the classifier are obtained
with a ROI-pooling layer. Both networks are trained jointly
by minimizing a loss function:
L = L
RP N
+ L
ROI
. (1)
L
RP N
and L
ROI
represent losses used for the RPN and
ROI classifier. The losses consist of a cross-entropy cost
measuring the mis-classification error and a regression loss
quantifying the localization error. The RPN is trained to
detect and localize objects without regard to their classes,
and the ROI classification network is trained to classify the
object labels.
3.2. A Probabilistic View of Faster R-CNN
In this section, we provide a probabilistic view of Faster
R-CNN that will be used to define a robust loss function for
noisy detection labels. The ROI classifier in Faster R-CNN
generates an object classification score and object location
for each proposed bounding box generated by the RPN. A
classification prediction p
cls
(y
c
|x
x
x,
˜
y
y
y
l
) represents the prob-
ability of a categorical random variable taking one of the
disjoint C + 1 classes (i.e., foreground classes plus back-
ground). This classification distribution is modeled using a
softmax activation. Similarly, we model the location pre-
diction p
loc
(y
y
y
l
|x
x
x,
˜
y
y
y
l
) = N (y
y
y
l
;
¯
y
y
y
l
, σI
I
I) with a multivariate
Normal distribution
3
with mean
¯
y
y
y
l
and constant diagonal
covariance matrix σI
I
I. In practice, only
¯
y
y
y
l
is generated by
the ROI classifier which is used to localize the object.
3
This assumption follows naturally if the L
2
-norm is used for the lo-
calization error in Eq. 1. In practice however, a combination of L
2
and L
1
norms are used which do not correspond to a simple probabilistic output.