没有合适的资源?快使用搜索试试~ 我知道了~
首页CVPR 2018 会议优秀论文精选
资源详情
资源评论
资源推荐

Domain Adaptive Faster R-CNN for Object Detection in the Wild
Yuhua Chen
1
Wen Li
1
Christos Sakaridis
1
Dengxin Dai
1
Luc Van Gool
1,2
1
Computer Vision Lab, ETH Zurich
2
VISICS, ESAT/PSI, KU Leuven
{yuhua.chen,liwen,csakarid,dai,vangool}@vision.ee.ethz.ch
Abstract
Object detection typically assumes that training and test
data are drawn from an identical distribution, which, how-
ever, does not always hold in practice. Such a distribution
mismatch will lead to a significant performance drop. In
this work, we aim to improve the cross-domain robustness of
object detection. We tackle the domain shift on two levels:
1) the image-level shift, such as image style, illumination,
etc., and 2) the instance-level shift, such as object appear-
ance, size, etc. We build our approach based on the recent
state-of-the-art Faster R-CNN model, and design two do-
main adaptation components, on image level and instance
level, to reduce the domain discrepancy. The two domain
adaptation components are based on H-divergence theory,
and are implemented by learning a domain classifier in ad-
versarial training manner. The domain classifiers on dif-
ferent levels are further reinforced with a consistency regu-
larization to learn a domain-invariant region proposal net-
work (RPN) in the Faster R-CNN model. We evaluate our
newly proposed approach using multiple datasets including
Cityscapes, KITTI, SIM10K, etc. The results demonstrate
the effectiveness of our proposed approach for robust ob-
ject detection in various domain shift scenarios.
1. Introduction
Object detection is a fundamental problem in computer
vision. It aims at identifying and localizing all object in-
stances of certain categories in an image. Driven by the
surge of deep convolutional networks (CNN) [32], many
CNN-based object detection approaches have been pro-
posed, drastically improving performance [21, 51, 20, 8, 19,
39].
While excellent performance has been achieved on the
benchmark datasets [12, 37], object detection in the real
world still faces challenges from the large variance in view-
points, object appearance, backgrounds, illumination, im-
age quality, etc., which may cause a considerable domain
shift between the training and test data. Taking autonomous
Figure 1. Illustration of different datasets for autonomous driv-
ing: From top to bottom-right, example images are taken from:
KITTI[17], Cityscapes[5], Foggy Cityscapes[49], SIM10K[30].
Though all datasets cover urban scenes, images in those dataset
vary in style, resolution, illumination, object size, etc. The visual
difference between those datasets presents a challenge for apply-
ing an object detection model learned from one domain to another
domain.
driving as an example, the camera type and setup used in a
particular car might differ from those used to collect train-
ing data, and the car might be in a different city where
the appearance of objects is different. Moreover, the au-
tonomous driving system is expected to work reliably under
different weather conditions (e.g. in rain and fog), while the
training data is usually collected in dry weather with better
visibility. The recent trend of using synthetic data for train-
ing deep CNN models presents a similar challenge due to
the visual mismatch with reality. Several datasets focusing
on autonomous driving are illustrated in Figure 1, where we
can observe a considerable domain shift.
Such domain shifts have been observed to cause sig-
nificant performance drop [23]. Although collecting more
training data could possibly alleviate the impact of domain
shift, it is non-trivial because annotating bounding boxes is
an expensive and time-consuming process. Therefore, it is
highly desirable to develop algorithms to adapt object de-
tection models to a new domain that is visually different
from the training domain.
In this paper, we address this cross-domain object detec-
tion problem. We consider the unsupervised domain adap-
tation scenario: full supervision is given in the source do-
main while no supervision is available in the target domain.
1
3339

Thus, the improved object detection in the target domain
should be achieved at no additional annotation cost.
We build an end-to-end deep learning model based on
the state-of-the-art Faster R-CNN model [48], referred to
as Domain Adaptive Faster R-CNN. Based on the covari-
ate shift assumption, the domain shift could occur on im-
age level (e.g, image scale, image style, illumination, etc.)
and instance level (e.g, object appearance, size, etc.), which
motivates us to minimize the domain discrepancy on both
levels. To address the domain shift, we incorporate two do-
main adaptation components on image level and instance
level into the Faster R-CNN model to minimize the H-
divergence between two domains. In each component, we
train a domain classifier and employ the adversarial training
strategy to learn robust features that are domain-invariant.
We further incorporate a consistency regularization between
the domain classifiers on different levels to learn a domain-
invariant region proposal network (RPN) in the Faster R-
CNN model.
The contribution of this work can be summarized as
follows: 1) We provide a theoretical analysis of the do-
main shift problem for cross-domain object detection from
a probabilistic perspective. 2) We design two domain adap-
tation components to alleviate the domain discrepancy at
the image and instance levels, resp. 3) We further propose
a consistency regularization to encourage the RPN to be
domain-invariant. 4) We integrate the proposed components
into the Faster R-CNN model, and the resulting system can
be trained in an end-to-end manner.
We conduct extensive experiments to evaluate our Do-
main Adaptive Faster R-CNN using multiple datasets in-
cluding Cityscapes [5], KITTI [17], SIM 10k [30], etc. The
experimental results clearly demonstrate the effectiveness
of our proposed approach for addressing the domain shift
of object detection in multiple scenarios with domain dis-
crepancies.
2. Related Work
Object Detection: Object detection dates back a long
time, resulting in a plentitude of approaches. Classical
work [9, 13, 56] usually formulated object detection as a
sliding window classification problem. The rise of deep
convolutional networks(CNNs) [32] finds its origin in ob-
ject detection, where its successes have led to a swift
paradigm shift. Among the large number of approaches
proposed [21, 51, 20, 19, 39, 8], region-based CNNs (R-
CNN) [21, 20, 60] have received significant attention due
to their effectiveness. This line of work was pioneered by
R-CNN [21], which extracts region proposals from the im-
age and a network is trained to classify each region of in-
terest (ROI) independently. The idea has been extended
by [20, 26] to share the convolution feature map among
all ROIs. Faster R-CNN [21] produces object proposals
with a Region Proposal Network (RPN). It achieved state-
of-the-art results and laid the foundation for many follow-
up works [19, 39, 8, 36, 60]. Faster R-CNN is also highly
flexible and can be extended to other tasks, e.g. instance
segmentation [7]. However, those works focused on the
conventional setting without considering the domain adap-
tation issue for object detection in the wild. In this paper,
we choose Faster R-CNN as our base detector, and improve
its generalization ability for object detection in a new target
domain.
Domain Adaptation: Domain adaptation has been
widely studied for image classification in computer vi-
sion [10, 11, 33, 23, 22, 14, 52, 40, 15, 18, 50, 45, 43, 35].
Conventional methods include domain transfer multiple
kernel learning [10, 11], asymmetric metric learning [33],
subspace interpolation [23], geodesic flow kernel [22], sub-
space alignment [14], covariance matrix alignment [52, 57],
etc. Recent works aim to improve the domain adaptability
of deep neural networks, including [40, 15, 18, 50, 45, 43,
34, 24, 41, 42]. Different from those works, we focus on
the object detection problem, which is more challenging as
both object location and category need to be predicted.
A few recent works have also been proposed to perform
unpaired image translation between two sets of data, which
can be seen as pixel-level domain adaptation [62, 31, 59,
38]. However, it is still a challenging issue to produce re-
alistic images in high resolution as required by real-world
applications like autonomous driving.
Domain Adaptation Beyond Classification: Compared
to the research in domain adaptation for classification, much
less attention has been paid to domain adaptation for other
computer vision tasks. Recently there are some works con-
cerning tasks such as semantic segmentation [4, 27, 61],
fine-grained recognition [16] etc. For the task of detec-
tion, [58] proposed to mitigate the domain shift problem
of the deformable part-based model (DPM) by introducing
an adaptive SVM. In a recent work [47], they use R-CNN
model as feature extractor, then the features are aligned with
the subspace alignment method. There also exists work to
learn detectors from alternative sources, such as from im-
ages to videos [54], from 3D models [46, 53], or from syn-
thetic models [25]. Previous works either cannot be trained
in an end-to-end fashion, or focus on a specific case. In this
work, we build an end-to-end trainable model for object de-
tection, which is, to the best of our knowledge, the first of
its kind.
3. Preliminaries
3.1. Faster R-CNN
We briefly review the Faster R-CNN [60] model, which
is the baseline model used in this work. Faster R-CNN is
a two-stage detector mainly consisting of three major com-
3340

ponents: shared bottom convolutional layers, a region pro-
posal network (RPN) and a region-of-interest (ROI) based
classifier. The architecture is illustrated in the left part of
Figure 2.
First an input image is represented as a convolutional
feature map produced by the shared bottom convolutional
layers. Based on that feature map, RPN generates candi-
date object proposals, whereafter the ROI-wise classifier
predicts the category label from a feature vector obtained
using ROI-pooling. The training loss is composed of the
loss of the RPN and the loss of the ROI classifiers:
L
det
= L
rpn
+ L
roi
(1)
Both training loss of the RPN and ROI classifiers have
two loss terms: one for classification as how accurate the
predicted probability is, and the other is a regression loss
on the box coordinates for better localization. Readers are
referred to [60] for more details about the architecture and
the training procedure.
3.2. Distribution Alignment with H-divergence
The H-divergence [1] is designed to measure the diver-
gence between two sets of samples with different distribu-
tions. Let us denote by x a feature vector. A source domain
sample can be denoted as x
S
and a target domain sample
as x
T
. We also denote by h : x → {0, 1} a domain clas-
sifier, which aims to predict the source samples x
S
to be 0,
and target domain sample x
T
to be 1. Suppose H is the set
of possible domain classifiers, the H-divergence defines the
distance between two domains as follows:
d
H
(S, T )=2
!
1 − min
h∈H
"
err
S
(h(x)) + err
T
(h(x))
#
$
.
where err
S
and err
T
are the prediction errors of h(x) on
source and target domain samples, resp. The above defini-
tion implies that the domain distance d
H
(S, T ) is inversely
proportional to the error rate of the domain classifier h. In
other words, if the error is high for the best domain clas-
sifier, the two domains are hard to distinguish, so they are
close to each other, and v.v.
In deep neural networks, the feature vector x usually
comprises the activations after a certain layer. Let us de-
note by f the network that produces x . To align the two
domains, we therefore need to enforce the networks f to
output feature vectors that minimize the domain distance
d
H
(S, T ) [15], which leads to:
min
f
d
H
(S, T ) ⇔ max
f
min
h∈H
{err
S
(h(x)) + err
T
(h(x))}.
This can be optimized in an adversarial training manner.
Ganin and Lempitsky [15] implemented a gradient reverse
layer (GRL), and integrated it into a CNN for image classi-
fication in the unsupervised domain adaptation scenario.
4. Domain Adaptation for Object Detection
Following the common terminology in domain adapta-
tion, we refer to the domain of the training data as source
domain, denoted by S, and to the domain of the test data
as target domain, denoted by T . For instance, when using
the Cityscapes dataset for training and the KITTI dataset
for testing, S is the Cityscapes dataset and T represents the
KITTI dataset.
We also follow the classic setting of unsupervised do-
main adaptation, where we have access to images and full
supervision in the source domain (i.e., bounding box and
object categories), but only unlabeled images are available
for the target domain. Our task is to learn an object detec-
tion model adapted to the unlabeled target domain.
4.1. A Probabilistic Perspective
The object detection problem can be viewed as learn-
ing the posterior P (C, B|I), where I is the image repre-
sentation, B is the bounding-box of an object and C ∈
{1,...,K} the category of the object (K being the total
number of categories).
Let us denote the joint distribution of training samples
for object detection as P (C, B, I), and use P
S
(C, B, I) and
P
T
(C, B, I) to denote the source domain joint distribution
and the target domain joint distribution, resp. Note that here
we use P
T
(C, B, I) to analyze the domain shift problem,
although the bounding box and category annotations (i.e.,
B and C) are unknown during training. When there is a
domain shift, P
S
(C, B, I) = P
T
(C, B, I).
Image-Level Adaptation: Using the Bayes’s Formula,
the joint distribution can be decomposed as:
P (C, B , I)=P (C, B|I)P (I). (2)
Similar to the classification problem, we make the covariate
shift assumption for objection detection, i.e., the conditional
probability P (C, B |I) is the same for the two domains, and
the domain distribution shift is caused by the difference on
the marginal distribution P (I). In other words, the detec-
tor is consistent between two domains: given an image, the
detection results should be the same regardless of which do-
main the image belongs. In the Faster R-CNN model, the
image representation I is actually the feature map output
of the base convolutional layers. Therefore, to handle the
domain shift problem, we should enforce the distribution
of image representation from two domains to be the same
(i.e., P
S
(I)=P
T
(I)), which is referred to as image-level
adaptation.
Instance-Level Adaptation: On the other hand, the
joint distribution can also be decomposed as:
P (C, B , I)=P (C|B,I )P ( B, I ). (3)
With the covariate shift assumption, i.e., the conditional
probability P (C|B, I ) is the same for the two domains, we
3341

Figure 2. An overview of our Domain Adaptive Faster R-CNN model: we tackle the domain shift on two levels, the image level and the
instance level. A domain classifier is built on each level, trained in an adversarial training manner. A consistency regularizer is incorporated
within these two classifiers to learn a domain-invariant RPN for the Faster R-CNN model.
have that the domain distribution shift is from the difference
in the marginal distribution P (B, I ). Intuitively, this im-
plies the semantic consistency between two domains: given
the same image region containing an object, its category
labels should be the same regardless of which domain it
comes from. Therefore, we can also enforce the distribution
of instance representation from two domains to be the same
(i.e., P
S
(B, I )=P
T
(B, I )). We refer to it as instance-level
alignment.
Here the instance representation (B, I ) refers to the fea-
tures extracted from the image region in the ground truth
bounding box for each instance. Although the bounding-
box annotation is unavailable for the target domain, we can
obtain it via P (B, I )=P (B|I) P (I), where P (B|I) is a
bounding box predictor (e.g, RPN in Faster R-CNN). This
holds only when P (B|I) is domain-invariant, for which we
provide a solution below.
Joint Adaptation: Ideally, one can perform domain
alignment on either the image or instance level. Consider-
ing that P (B, I )=P (B|I)P (I) and the conditional distri-
bution P (B|I) is assumed to be the same and non-zero for
two domains, thus we have:
P
S
(I)=P
T
(I) ⇔ P
S
(B, I )=P
T
(B, I ). (4)
In other words, if the distributions of the image-level rep-
resentations are identical for two domains, the distributions
of the instance-level representations are also identical, and
v.v. Yet, it is generally non-trivial to perfectly estimate the
conditional distribution P (B|I). The reasons are two-fold:
1) in practice it may be hard to perfectly align the marginal
distributions P (I), which means the input for estimating
P (B|I) is somehow biased, and 2) the bounding box an-
notation is only available for source domain training data,
therefore P (B|I) is learned using the source domain data
only, which is easily biased toward the source domain.
To this end, we propose to perform domain distribution
alignment on both the image and instance levels, and to ap-
ply a consistency regularization to alleviate the bias in esti-
mating P (B|I). As introduced in Section 3.2, to align the
distributions of two domains, one needs to train a domain
classifier h(x). In the context of object detection, x can be
the image-level representation I or the instance-level repre-
sentation (B, I ). From a probabilistic perspective, h(x) can
be seen as estimating a sample x’s probability belonging to
the target domain.
Thus, by denoting the domain label as D, the image-level
domain classifier can be viewed as estimating P (D|I), and
the instance-level domain classifier can be seen as estimat-
ing P (D|B,I ). By using the Bayes’ theorem, we obtain:
P (D|B,I )P (B|I)=P (B|D, I)P (D|I). (5)
In particular, P (B|I) is a domain-invariant bounding box
predictor, and P (B|D, I) a domain-dependent bounding
box predictor. Recall that in practice we can only learn
a domain-dependent bounding box predictor P (B| D, I ),
since we have no bounding box annotations for the target
domain. Thus, by enforcing the consistency between two
domain classifiers, i.e., P ( D|B, I )=P (D|I), we could
learn P (B|D, I) to approach P (B|I).
4.2. Domain Adaptation Components
This section introduces two domain adaptation compo-
nents for the image and instance levels, used to align the
feature representation distributions on those two levels.
Image-Level Adaptation: In the Faster R-CNN model,
the image-level representation refers to the feature map out-
puts of the base convolutional layers (see the green paral-
lelogram in Figure 2). To eliminate the domain distribution
mismatch on the image level, we employ a patch-based do-
main classifier as shown in the lower right part of Figure 2.
In particular, we train a domain classifier on each activa-
tion from the feature map. Since the receptive field of each
3342

activation corresponds to an image patch of the input image
I
i
, the domain classifier actually predicts the domain label
for each image patch.
The benefits of this choice are twofold: 1) aligning
image-level representations generally helps to reduce the
shift caused by the global image difference such as image
style, image scale, illumination, etc. A similar patch-based
loss has shown to be effective in recent work on style trans-
fer [29], which also deals with the global transformation,
and 2) the batch size is usually very small for training an
object detection network, due to the use of high-resolution
input. This patch-based design is helpful to increase the
number of training samples for training the domain classi-
fier.
Let us denote by D
i
the domain label of the i-th training
image, with D
i
=0for the source domain and D
i
=1
for the target domain. We denote as φ
u,v
(I
i
) the activation
located at (u, v) of the feature map of the i-th image after
the base convolutional layers. By denoting the output of the
domain classifier as p
(u,v )
i
and using the cross entropy loss,
the image-level adaptation loss can be written as:
L
img
= −
%
i,u,v
&
D
i
log p
(u,v )
i
+(1− D
i
) log(1 − p
(u,v )
i
)
'
.
(6)
As discussed in Section 3.2, to align the domain distri-
butions, we should simultaneously optimize the parameters
of the domain classifier to minimize the above domain clas-
sification loss, and also optimize the parameters of the base
network to maximize this loss. For the implementation we
use the gradient reverse layer (GRL) [15], whereas the or-
dinary gradient descent is applied for training the domain
classifier. The sign of the gradient is reversed when passing
through the GRL layer to optimize the base network.
Instance-Level Adaptation: The instance-level rep-
resentation refers to the ROI-based feature vectors before
feeding into the final category classifiers (i.e., the rectangles
after the “FC” layer in Figure 2). Aligning the instance-
level representations helps to reduce the local instance dif-
ference such as object appearance, size, viewpoint etc. Sim-
ilar to the image-level adaptation, we train a domain classi-
fier for the feature vectors to align the instance-level distri-
bution. Let us denote the output of the instance-level do-
main classifier for the j-th region proposal in the i-th image
as p
i,j
. The instance-level adaptation loss can now be writ-
ten as:
L
ins
= −
%
i,j
&
D
i
log p
i,j
+(1− D
i
) log(1 − p
i,j
)
'
. (7)
We also add a gradient reverse layer before the domain clas-
sifier to apply the adversarial training strategy.
Consistency Regularization: As analyzed in Sec-
tion 4.1, enforcing consistency between the domain clas-
sifier on different levels helps to learn the cross-domain ro-
bustness of bounding box predictor (i.e., RPN in the Faster
R-CNN model). Therefore, we further impose a consis-
tency regularizer. Since the image-level domain classifier
produces an output for each activation of the image-level
representation I, we take the average over all activations in
the image as its image-level probability. The consistency
regularizer can be written as:
L
cst
=
%
i,j
∥
1
|I|
%
u,v
p
(u,v )
i
− p
i,j
∥
2
, (8)
where |I| denotes the total number of activations in a feature
map, and ∥ · ∥ is the ℓ
2
distance.
4.3. Network Overview
An overview of our network is shown in Figure 2. We
augment the Faster R-CNN base architecture with our do-
main adaptation components, which leads to our Domain
Adaptive Faster R-CNN model.
The left part of Figure 2 is the original Faster R-CNN
model. The bottom convolutional layers are shared between
all components. Then the RPN and ROI pooling layers are
built on top, followed by two fully connected layers to ex-
tract the instance-level features.
Three novel components are introduced in our Domain
Adaptive Faster R-CNN. The image-level domain classifier
is added after the last convolution layer and the instance-
level domain classifier is added to the end of the ROI-wise
features. The two classifiers are linked with a consistency
loss to encourage the RPN to be domain-invariant. The fi-
nal training loss of the proposed network is a summation of
each individual part, which can be written as:
L = L
det
+ λ ( L
img
+ L
ins
+ L
cst
) (9)
where λ is a trade-off parameter to balance the Faster R-
CNN loss and our newly added domain adaptation compo-
nents. The network can be trained in an end-to-end manner
using a standard SGD algorithm. Note that the adversar-
ial training for domain adaptation components is achieved
by using the GRL layer, which automatically reverses the
gradient during propagation. The overall network in Fig-
ure 2 is used in the training phase. During inference, one
can remove the domain adaptation components, and sim-
ply use the original Faster R-CNN architecture with adapted
weights.
5. Experiments
5.1. Experiment Setup
We adopt the unsupervised domain adaptation protocol
in our experiments. The training data consists of two parts:
3343
剩余452页未读,继续阅读














安全验证
文档复制为VIP权益,开通VIP直接复制

评论0