没有合适的资源？快使用搜索试试~ 我知道了~

首页CVPR 2018 会议优秀论文精选

资源详情

资源评论

资源推荐

Domain Adaptive Faster R-CNN for Object Detection in the Wild

Yuhua Chen

1

Wen Li

1

Christos Sakaridis

1

Dengxin Dai

1

Luc Van Gool

1,2

1

Computer Vision Lab, ETH Zurich

2

VISICS, ESAT/PSI, KU Leuven

{yuhua.chen,liwen,csakarid,dai,vangool}@vision.ee.ethz.ch

Abstract

Object detection typically assumes that training and test

data are drawn from an identical distribution, which, how-

ever, does not always hold in practice. Such a distribution

mismatch will lead to a signiﬁcant performance drop. In

this work, we aim to improve the cross-domain robustness of

object detection. We tackle the domain shift on two levels:

1) the image-level shift, such as image style, illumination,

etc., and 2) the instance-level shift, such as object appear-

ance, size, etc. We build our approach based on the recent

state-of-the-art Faster R-CNN model, and design two do-

main adaptation components, on image level and instance

level, to reduce the domain discrepancy. The two domain

adaptation components are based on H-divergence theory,

and are implemented by learning a domain classiﬁer in ad-

versarial training manner. The domain classiﬁers on dif-

ferent levels are further reinforced with a consistency regu-

larization to learn a domain-invariant region proposal net-

work (RPN) in the Faster R-CNN model. We evaluate our

newly proposed approach using multiple datasets including

Cityscapes, KITTI, SIM10K, etc. The results demonstrate

the effectiveness of our proposed approach for robust ob-

ject detection in various domain shift scenarios.

1. Introduction

Object detection is a fundamental problem in computer

vision. It aims at identifying and localizing all object in-

stances of certain categories in an image. Driven by the

surge of deep convolutional networks (CNN) [32], many

CNN-based object detection approaches have been pro-

posed, drastically improving performance [21, 51, 20, 8, 19,

39].

While excellent performance has been achieved on the

benchmark datasets [12, 37], object detection in the real

world still faces challenges from the large variance in view-

points, object appearance, backgrounds, illumination, im-

age quality, etc., which may cause a considerable domain

shift between the training and test data. Taking autonomous

Figure 1. Illustration of different datasets for autonomous driv-

ing: From top to bottom-right, example images are taken from:

KITTI[17], Cityscapes[5], Foggy Cityscapes[49], SIM10K[30].

Though all datasets cover urban scenes, images in those dataset

vary in style, resolution, illumination, object size, etc. The visual

difference between those datasets presents a challenge for apply-

ing an object detection model learned from one domain to another

domain.

driving as an example, the camera type and setup used in a

particular car might differ from those used to collect train-

ing data, and the car might be in a different city where

the appearance of objects is different. Moreover, the au-

tonomous driving system is expected to work reliably under

different weather conditions (e.g. in rain and fog), while the

training data is usually collected in dry weather with better

visibility. The recent trend of using synthetic data for train-

ing deep CNN models presents a similar challenge due to

the visual mismatch with reality. Several datasets focusing

on autonomous driving are illustrated in Figure 1, where we

can observe a considerable domain shift.

Such domain shifts have been observed to cause sig-

niﬁcant performance drop [23]. Although collecting more

training data could possibly alleviate the impact of domain

shift, it is non-trivial because annotating bounding boxes is

an expensive and time-consuming process. Therefore, it is

highly desirable to develop algorithms to adapt object de-

tection models to a new domain that is visually different

from the training domain.

In this paper, we address this cross-domain object detec-

tion problem. We consider the unsupervised domain adap-

tation scenario: full supervision is given in the source do-

main while no supervision is available in the target domain.

1

3339

Thus, the improved object detection in the target domain

should be achieved at no additional annotation cost.

We build an end-to-end deep learning model based on

the state-of-the-art Faster R-CNN model [48], referred to

as Domain Adaptive Faster R-CNN. Based on the covari-

ate shift assumption, the domain shift could occur on im-

age level (e.g, image scale, image style, illumination, etc.)

and instance level (e.g, object appearance, size, etc.), which

motivates us to minimize the domain discrepancy on both

levels. To address the domain shift, we incorporate two do-

main adaptation components on image level and instance

level into the Faster R-CNN model to minimize the H-

divergence between two domains. In each component, we

train a domain classiﬁer and employ the adversarial training

strategy to learn robust features that are domain-invariant.

We further incorporate a consistency regularization between

the domain classiﬁers on different levels to learn a domain-

invariant region proposal network (RPN) in the Faster R-

CNN model.

The contribution of this work can be summarized as

follows: 1) We provide a theoretical analysis of the do-

main shift problem for cross-domain object detection from

a probabilistic perspective. 2) We design two domain adap-

tation components to alleviate the domain discrepancy at

the image and instance levels, resp. 3) We further propose

a consistency regularization to encourage the RPN to be

domain-invariant. 4) We integrate the proposed components

into the Faster R-CNN model, and the resulting system can

be trained in an end-to-end manner.

We conduct extensive experiments to evaluate our Do-

main Adaptive Faster R-CNN using multiple datasets in-

cluding Cityscapes [5], KITTI [17], SIM 10k [30], etc. The

experimental results clearly demonstrate the effectiveness

of our proposed approach for addressing the domain shift

of object detection in multiple scenarios with domain dis-

crepancies.

2. Related Work

Object Detection: Object detection dates back a long

time, resulting in a plentitude of approaches. Classical

work [9, 13, 56] usually formulated object detection as a

sliding window classiﬁcation problem. The rise of deep

convolutional networks(CNNs) [32] ﬁnds its origin in ob-

ject detection, where its successes have led to a swift

paradigm shift. Among the large number of approaches

proposed [21, 51, 20, 19, 39, 8], region-based CNNs (R-

CNN) [21, 20, 60] have received signiﬁcant attention due

to their effectiveness. This line of work was pioneered by

R-CNN [21], which extracts region proposals from the im-

age and a network is trained to classify each region of in-

terest (ROI) independently. The idea has been extended

by [20, 26] to share the convolution feature map among

all ROIs. Faster R-CNN [21] produces object proposals

with a Region Proposal Network (RPN). It achieved state-

of-the-art results and laid the foundation for many follow-

up works [19, 39, 8, 36, 60]. Faster R-CNN is also highly

ﬂexible and can be extended to other tasks, e.g. instance

segmentation [7]. However, those works focused on the

conventional setting without considering the domain adap-

tation issue for object detection in the wild. In this paper,

we choose Faster R-CNN as our base detector, and improve

its generalization ability for object detection in a new target

domain.

Domain Adaptation: Domain adaptation has been

widely studied for image classiﬁcation in computer vi-

sion [10, 11, 33, 23, 22, 14, 52, 40, 15, 18, 50, 45, 43, 35].

Conventional methods include domain transfer multiple

kernel learning [10, 11], asymmetric metric learning [33],

subspace interpolation [23], geodesic ﬂow kernel [22], sub-

space alignment [14], covariance matrix alignment [52, 57],

etc. Recent works aim to improve the domain adaptability

of deep neural networks, including [40, 15, 18, 50, 45, 43,

34, 24, 41, 42]. Different from those works, we focus on

the object detection problem, which is more challenging as

both object location and category need to be predicted.

A few recent works have also been proposed to perform

unpaired image translation between two sets of data, which

can be seen as pixel-level domain adaptation [62, 31, 59,

38]. However, it is still a challenging issue to produce re-

alistic images in high resolution as required by real-world

applications like autonomous driving.

Domain Adaptation Beyond Classiﬁcation: Compared

to the research in domain adaptation for classiﬁcation, much

less attention has been paid to domain adaptation for other

computer vision tasks. Recently there are some works con-

cerning tasks such as semantic segmentation [4, 27, 61],

ﬁne-grained recognition [16] etc. For the task of detec-

tion, [58] proposed to mitigate the domain shift problem

of the deformable part-based model (DPM) by introducing

an adaptive SVM. In a recent work [47], they use R-CNN

model as feature extractor, then the features are aligned with

the subspace alignment method. There also exists work to

learn detectors from alternative sources, such as from im-

ages to videos [54], from 3D models [46, 53], or from syn-

thetic models [25]. Previous works either cannot be trained

in an end-to-end fashion, or focus on a speciﬁc case. In this

work, we build an end-to-end trainable model for object de-

tection, which is, to the best of our knowledge, the ﬁrst of

its kind.

3. Preliminaries

3.1. Faster R-CNN

We brieﬂy review the Faster R-CNN [60] model, which

is the baseline model used in this work. Faster R-CNN is

a two-stage detector mainly consisting of three major com-

3340

ponents: shared bottom convolutional layers, a region pro-

posal network (RPN) and a region-of-interest (ROI) based

classiﬁer. The architecture is illustrated in the left part of

Figure 2.

First an input image is represented as a convolutional

feature map produced by the shared bottom convolutional

layers. Based on that feature map, RPN generates candi-

date object proposals, whereafter the ROI-wise classiﬁer

predicts the category label from a feature vector obtained

using ROI-pooling. The training loss is composed of the

loss of the RPN and the loss of the ROI classiﬁers:

L

det

= L

rpn

+ L

roi

(1)

Both training loss of the RPN and ROI classiﬁers have

two loss terms: one for classiﬁcation as how accurate the

predicted probability is, and the other is a regression loss

on the box coordinates for better localization. Readers are

referred to [60] for more details about the architecture and

the training procedure.

3.2. Distribution Alignment with H-divergence

The H-divergence [1] is designed to measure the diver-

gence between two sets of samples with different distribu-

tions. Let us denote by x a feature vector. A source domain

sample can be denoted as x

S

and a target domain sample

as x

T

. We also denote by h : x → {0, 1} a domain clas-

siﬁer, which aims to predict the source samples x

S

to be 0,

and target domain sample x

T

to be 1. Suppose H is the set

of possible domain classiﬁers, the H-divergence deﬁnes the

distance between two domains as follows:

d

H

(S, T )=2

!

1 − min

h∈H

"

err

S

(h(x)) + err

T

(h(x))

#

$

.

where err

S

and err

T

are the prediction errors of h(x) on

source and target domain samples, resp. The above deﬁni-

tion implies that the domain distance d

H

(S, T ) is inversely

proportional to the error rate of the domain classiﬁer h. In

other words, if the error is high for the best domain clas-

siﬁer, the two domains are hard to distinguish, so they are

close to each other, and v.v.

In deep neural networks, the feature vector x usually

comprises the activations after a certain layer. Let us de-

note by f the network that produces x . To align the two

domains, we therefore need to enforce the networks f to

output feature vectors that minimize the domain distance

d

H

(S, T ) [15], which leads to:

min

f

d

H

(S, T ) ⇔ max

f

min

h∈H

{err

S

(h(x)) + err

T

(h(x))}.

This can be optimized in an adversarial training manner.

Ganin and Lempitsky [15] implemented a gradient reverse

layer (GRL), and integrated it into a CNN for image classi-

ﬁcation in the unsupervised domain adaptation scenario.

4. Domain Adaptation for Object Detection

Following the common terminology in domain adapta-

tion, we refer to the domain of the training data as source

domain, denoted by S, and to the domain of the test data

as target domain, denoted by T . For instance, when using

the Cityscapes dataset for training and the KITTI dataset

for testing, S is the Cityscapes dataset and T represents the

KITTI dataset.

We also follow the classic setting of unsupervised do-

main adaptation, where we have access to images and full

supervision in the source domain (i.e., bounding box and

object categories), but only unlabeled images are available

for the target domain. Our task is to learn an object detec-

tion model adapted to the unlabeled target domain.

4.1. A Probabilistic Perspective

The object detection problem can be viewed as learn-

ing the posterior P (C, B|I), where I is the image repre-

sentation, B is the bounding-box of an object and C ∈

{1,...,K} the category of the object (K being the total

number of categories).

Let us denote the joint distribution of training samples

for object detection as P (C, B, I), and use P

S

(C, B, I) and

P

T

(C, B, I) to denote the source domain joint distribution

and the target domain joint distribution, resp. Note that here

we use P

T

(C, B, I) to analyze the domain shift problem,

although the bounding box and category annotations (i.e.,

B and C) are unknown during training. When there is a

domain shift, P

S

(C, B, I) = P

T

(C, B, I).

Image-Level Adaptation: Using the Bayes’s Formula,

the joint distribution can be decomposed as:

P (C, B , I)=P (C, B|I)P (I). (2)

Similar to the classiﬁcation problem, we make the covariate

shift assumption for objection detection, i.e., the conditional

probability P (C, B |I) is the same for the two domains, and

the domain distribution shift is caused by the difference on

the marginal distribution P (I). In other words, the detec-

tor is consistent between two domains: given an image, the

detection results should be the same regardless of which do-

main the image belongs. In the Faster R-CNN model, the

image representation I is actually the feature map output

of the base convolutional layers. Therefore, to handle the

domain shift problem, we should enforce the distribution

of image representation from two domains to be the same

(i.e., P

S

(I)=P

T

(I)), which is referred to as image-level

adaptation.

Instance-Level Adaptation: On the other hand, the

joint distribution can also be decomposed as:

P (C, B , I)=P (C|B,I )P ( B, I ). (3)

With the covariate shift assumption, i.e., the conditional

probability P (C|B, I ) is the same for the two domains, we

3341

Figure 2. An overview of our Domain Adaptive Faster R-CNN model: we tackle the domain shift on two levels, the image level and the

instance level. A domain classiﬁer is built on each level, trained in an adversarial training manner. A consistency regularizer is incorporated

within these two classiﬁers to learn a domain-invariant RPN for the Faster R-CNN model.

have that the domain distribution shift is from the difference

in the marginal distribution P (B, I ). Intuitively, this im-

plies the semantic consistency between two domains: given

the same image region containing an object, its category

labels should be the same regardless of which domain it

comes from. Therefore, we can also enforce the distribution

of instance representation from two domains to be the same

(i.e., P

S

(B, I )=P

T

(B, I )). We refer to it as instance-level

alignment.

Here the instance representation (B, I ) refers to the fea-

tures extracted from the image region in the ground truth

bounding box for each instance. Although the bounding-

box annotation is unavailable for the target domain, we can

obtain it via P (B, I )=P (B|I) P (I), where P (B|I) is a

bounding box predictor (e.g, RPN in Faster R-CNN). This

holds only when P (B|I) is domain-invariant, for which we

provide a solution below.

Joint Adaptation: Ideally, one can perform domain

alignment on either the image or instance level. Consider-

ing that P (B, I )=P (B|I)P (I) and the conditional distri-

bution P (B|I) is assumed to be the same and non-zero for

two domains, thus we have:

P

S

(I)=P

T

(I) ⇔ P

S

(B, I )=P

T

(B, I ). (4)

In other words, if the distributions of the image-level rep-

resentations are identical for two domains, the distributions

of the instance-level representations are also identical, and

v.v. Yet, it is generally non-trivial to perfectly estimate the

conditional distribution P (B|I). The reasons are two-fold:

1) in practice it may be hard to perfectly align the marginal

distributions P (I), which means the input for estimating

P (B|I) is somehow biased, and 2) the bounding box an-

notation is only available for source domain training data,

therefore P (B|I) is learned using the source domain data

only, which is easily biased toward the source domain.

To this end, we propose to perform domain distribution

alignment on both the image and instance levels, and to ap-

ply a consistency regularization to alleviate the bias in esti-

mating P (B|I). As introduced in Section 3.2, to align the

distributions of two domains, one needs to train a domain

classiﬁer h(x). In the context of object detection, x can be

the image-level representation I or the instance-level repre-

sentation (B, I ). From a probabilistic perspective, h(x) can

be seen as estimating a sample x’s probability belonging to

the target domain.

Thus, by denoting the domain label as D, the image-level

domain classiﬁer can be viewed as estimating P (D|I), and

the instance-level domain classiﬁer can be seen as estimat-

ing P (D|B,I ). By using the Bayes’ theorem, we obtain:

P (D|B,I )P (B|I)=P (B|D, I)P (D|I). (5)

In particular, P (B|I) is a domain-invariant bounding box

predictor, and P (B|D, I) a domain-dependent bounding

box predictor. Recall that in practice we can only learn

a domain-dependent bounding box predictor P (B| D, I ),

since we have no bounding box annotations for the target

domain. Thus, by enforcing the consistency between two

domain classiﬁers, i.e., P ( D|B, I )=P (D|I), we could

learn P (B|D, I) to approach P (B|I).

4.2. Domain Adaptation Components

This section introduces two domain adaptation compo-

nents for the image and instance levels, used to align the

feature representation distributions on those two levels.

Image-Level Adaptation: In the Faster R-CNN model,

the image-level representation refers to the feature map out-

puts of the base convolutional layers (see the green paral-

lelogram in Figure 2). To eliminate the domain distribution

mismatch on the image level, we employ a patch-based do-

main classiﬁer as shown in the lower right part of Figure 2.

In particular, we train a domain classiﬁer on each activa-

tion from the feature map. Since the receptive ﬁeld of each

3342

activation corresponds to an image patch of the input image

I

i

, the domain classiﬁer actually predicts the domain label

for each image patch.

The beneﬁts of this choice are twofold: 1) aligning

image-level representations generally helps to reduce the

shift caused by the global image difference such as image

style, image scale, illumination, etc. A similar patch-based

loss has shown to be effective in recent work on style trans-

fer [29], which also deals with the global transformation,

and 2) the batch size is usually very small for training an

object detection network, due to the use of high-resolution

input. This patch-based design is helpful to increase the

number of training samples for training the domain classi-

ﬁer.

Let us denote by D

i

the domain label of the i-th training

image, with D

i

=0for the source domain and D

i

=1

for the target domain. We denote as φ

u,v

(I

i

) the activation

located at (u, v) of the feature map of the i-th image after

the base convolutional layers. By denoting the output of the

domain classiﬁer as p

(u,v )

i

and using the cross entropy loss,

the image-level adaptation loss can be written as:

L

img

= −

%

i,u,v

&

D

i

log p

(u,v )

i

+(1− D

i

) log(1 − p

(u,v )

i

)

'

.

(6)

As discussed in Section 3.2, to align the domain distri-

butions, we should simultaneously optimize the parameters

of the domain classiﬁer to minimize the above domain clas-

siﬁcation loss, and also optimize the parameters of the base

network to maximize this loss. For the implementation we

use the gradient reverse layer (GRL) [15], whereas the or-

dinary gradient descent is applied for training the domain

classiﬁer. The sign of the gradient is reversed when passing

through the GRL layer to optimize the base network.

Instance-Level Adaptation: The instance-level rep-

resentation refers to the ROI-based feature vectors before

feeding into the ﬁnal category classiﬁers (i.e., the rectangles

after the “FC” layer in Figure 2). Aligning the instance-

level representations helps to reduce the local instance dif-

ference such as object appearance, size, viewpoint etc. Sim-

ilar to the image-level adaptation, we train a domain classi-

ﬁer for the feature vectors to align the instance-level distri-

bution. Let us denote the output of the instance-level do-

main classiﬁer for the j-th region proposal in the i-th image

as p

i,j

. The instance-level adaptation loss can now be writ-

ten as:

L

ins

= −

%

i,j

&

D

i

log p

i,j

+(1− D

i

) log(1 − p

i,j

)

'

. (7)

We also add a gradient reverse layer before the domain clas-

siﬁer to apply the adversarial training strategy.

Consistency Regularization: As analyzed in Sec-

tion 4.1, enforcing consistency between the domain clas-

siﬁer on different levels helps to learn the cross-domain ro-

bustness of bounding box predictor (i.e., RPN in the Faster

R-CNN model). Therefore, we further impose a consis-

tency regularizer. Since the image-level domain classiﬁer

produces an output for each activation of the image-level

representation I, we take the average over all activations in

the image as its image-level probability. The consistency

regularizer can be written as:

L

cst

=

%

i,j

∥

1

|I|

%

u,v

p

(u,v )

i

− p

i,j

∥

2

, (8)

where |I| denotes the total number of activations in a feature

map, and ∥ · ∥ is the ℓ

2

distance.

4.3. Network Overview

An overview of our network is shown in Figure 2. We

augment the Faster R-CNN base architecture with our do-

main adaptation components, which leads to our Domain

Adaptive Faster R-CNN model.

The left part of Figure 2 is the original Faster R-CNN

model. The bottom convolutional layers are shared between

all components. Then the RPN and ROI pooling layers are

built on top, followed by two fully connected layers to ex-

tract the instance-level features.

Three novel components are introduced in our Domain

Adaptive Faster R-CNN. The image-level domain classiﬁer

is added after the last convolution layer and the instance-

level domain classiﬁer is added to the end of the ROI-wise

features. The two classiﬁers are linked with a consistency

loss to encourage the RPN to be domain-invariant. The ﬁ-

nal training loss of the proposed network is a summation of

each individual part, which can be written as:

L = L

det

+ λ ( L

img

+ L

ins

+ L

cst

) (9)

where λ is a trade-off parameter to balance the Faster R-

CNN loss and our newly added domain adaptation compo-

nents. The network can be trained in an end-to-end manner

using a standard SGD algorithm. Note that the adversar-

ial training for domain adaptation components is achieved

by using the GRL layer, which automatically reverses the

gradient during propagation. The overall network in Fig-

ure 2 is used in the training phase. During inference, one

can remove the domain adaptation components, and sim-

ply use the original Faster R-CNN architecture with adapted

weights.

5. Experiments

5.1. Experiment Setup

We adopt the unsupervised domain adaptation protocol

in our experiments. The training data consists of two parts:

3343

剩余452页未读，继续阅读

安全验证

文档复制为VIP权益，开通VIP直接复制

信息提交成功

## 评论0