没有合适的资源?快使用搜索试试~ 我知道了~
首页《迁移学习:域自适应理论》综述论文
《迁移学习:域自适应理论》综述论文
需积分: 0 3.7k 浏览量
更新于2023-05-31
评论 3
收藏 1.19MB PDF 举报
本综述的主要目的是提供一个特定的、可以说是最流行的迁移学习子领域——领域自适应——最新理论结果的概述。
资源详情
资源评论
资源推荐

A SURVEY ON DOMAIN ADAPTATION THEORY
Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban
Univ Lyon, UJM-Saint-Etienne, CNRS, Institut d’Optique Graduate School
Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Etienne, France
name.surname@univ-st-etienne.fr
Younès Bennani
Université Sorbonne Paris Nord, CNRS, Institut Galilée
Laboratoire d’Informatique de Paris Nord UMR 7030, F-93430, Villetaneuse, France
name.surname@sorbonne-paris-nord.fr
ABSTRACT
All famous machine learning algorithms that correspond to both supervised and semi-supervised
learning work well only under a common assumption: training and test data follow the same
distribution. When the distribution changes, most statistical models must be reconstructed from new
collected data that, for some applications, may be costly or impossible to get. Therefore, it became
necessary to develop approaches that reduce the need and the effort of obtaining new labeled samples
by exploiting data available in related areas and using it further in similar fields. This has given rise
to a new machine learning framework called transfer learning: a learning setting inspired by the
capability of a human being to extrapolate knowledge across tasks to learn more efficiently. Despite a
large amount of different transfer learning scenarios, the main objective of this survey is to provide
an overview of the state-of-the-art theoretical results in a specific and arguably the most popular
sub-field of transfer learning called domain adaptation. In this sub-field, the data distribution is
assumed to change across the training and the test data while the learning task remains the same. We
provide a first up-to-date description of existing results related to domain adaptation problem that
cover learning bounds based on different statistical learning frameworks.
Keywords Transfer learning · Domain adaptation · Learning theory
This survey is a shortened version of the recently published book
“Advances in Domain Adaptation Theory"
[
Redko et al., 2019c
] written by the authors of this survey. Its purpose is to provide a high-level overview of the
latter work and update it with some recent references. All proofs and most of mathematical developments are
omitted in this version to keep the document at a reasonable length. For more details, we refer the interested
reader to the original papers or the full version of the book available at http://tiny.cc/mj2dnz.
1 Introduction
The idea behind transfer learning is inspired by the human being’s ability to learn with minimal or no supervision
based on previously acquired knowledge. It is not surprising that this concept was not invented in the machine learning
community in the proper sense of the term, since the notion “transfer of learning” had been used long before the
construction of the first computer and is found in the psychology field papers from early 20th century. From the
statistical point of view, this learning scenario is different from supervised learning as the former does not assume that
the training and test data have to be drawn from the same probability distribution. It was argued that this assumption
is often too restrictive to hold in practice as in many real-world applications a hypothesis is learned and deployed in
environments that differ and exhibit an important shift. A typical example often used in transfer learning is to consider
arXiv:2004.11829v1 [cs.LG] 24 Apr 2020

Figure 1: Distinction between usual supervised learning setting and transfer learning, and positioning of domain
adaptation.
a spam filtering task where the spam filter is learned using an arbitrary classification algorithm for a corporate mailbox
of a given user. In this case, the vast majority of e-mails analyzed by the algorithm are likely to be of a professional
character with very few of them being related to the private life of the considered person. Imagine further a situation
where this same user installs a mailbox software on the personal computer and imports the settings of its corporate
mailbox hoping that it will work equally well on it too. This, however, is not likely to be the case as many personal
e-mails may seem like spam to an algorithm learned purely on professional communications due to the differences
in their content and attached files as well as the non-uniformity of email addresses. Another illustrative example is
that of species classification in oceanographic studies where one relies on a video coverage of a certain sea area in
order to recognize species of the marine habitat. For instance, in the Mediterranean sea and in the Indian ocean, the
species of fish that can be found on the recorded videos are likely to belong to the same family, even though their actual
appearance may be quite dissimilar due to different climate and evolutionary background. In this case, the learning
algorithm trained on the video coverage of the Mediterranean sea will most likely fail to provide a correct classification
of species in the Indian ocean without being specifically adapted by an expert.
For this kind of applications, it may be desirable to find a learning paradigm that can remain robust to a changing
environment and adapt to a new problem at hand by drawing parallels and exploiting the knowledge from the domain
where it was learned in the first place. In response to this problem, the quest for new algorithms able to learn on a
training sample and then have a good performance on a test sample coming from a different but related probability
distribution gave rise to a new learning paradigm called transfer learning. Its definition is given as follows.
Definition 1.
(Transfer learning) We consider a source data distribution
S
called the source domain, and a target
data distribution
T
called the target domain. Let
X
S
× Y
S
be the source input and output spaces associated to
S
, and
X
T
×Y
T
be the target input and output spaces associated to
T
. We denote by
S
X
and
T
X
the marginal distributions of
X
S
and
X
T
, by
t
S
and
t
T
the source and target learning tasks depending on
Y
S
and
Y
T
, respectively. Then, transfer
learning aims to help to improve the learning of the target predictive function
f
T
: X
T
→ Y
T
for
t
T
using knowledge
gained from S and t
S
where S 6= T .
Note that the condition
S 6= T
implies either
S
X
6= T
X
(i.e.,
X
S
6= X
T
or
S
X
(X) 6= T
X
(X)
) or
t
S
6= t
T
(i.e.,
Y
S
6= Y
T
or
S(Y |X) 6= T (Y |X)
). In transfer learning, one often distinguishes three possible learning settings based
on these different relations (illustrated in Figure 1):
1. Inductive transfer learning where S
X
= T
X
and t
S
6= t
T
;
For example,
S
X
and
T
X
are the distributions of the data collected from the mailbox of one particular user,
with t
S
being the task of detecting a spam, while t
T
being the task of detecting a hoax;
2. Transductive transfer learning where S
X
6= T
X
but t
S
= t
T
;
For example, in the spam filtering problem,
S
X
is the distribution of the data collected for one user,
T
X
is the
distribution of data of another user and t
S
and t
T
are both the task of detecting a spam;
2

3. Unsupervised transfer learning where t
S
6= t
T
and S
X
6= T
X
;
For example,
S
X
generates the data collected from one user and
T
X
generates the content of web-pages
collected on the web with t
S
consisting in filtering out spams, while t
T
is to detect hoax.
Arguably, the vast majority of situations where transfer learning proves to be the most needed falls into the second
category. This latter has the name of domain adaptation, where we suppose that the source and the target tasks are the
same, but where we have a source data set with an abundant amount of labeled observations and a target one with no (or
little) labeled instances. In this survey, we concentrate on theoretical advances related to the latter case and highlight
their differences with respect to the traditional supervised learning paradigm. A brief overview of the considered works
is given in Table 1 and 2 for for learning bounds and hardness results, respectively.
Table 1: A summary of contributions presented in this survey for learning bounds in domain adaptation.
(Task)
refers
to the considered learning problem;
(Framework)
specifies the statistical learning framework used in the analysis;
(Divergence)
is the metric used to compare the source and target distributions;
(Link)
stands for the dependence
between the source error and the divergence term;
(Non-estim.)
indicates the presence of a non-estimable term in the
bounds.
REFERENCE
LEARNING BOUNDS
TASK FRAMEWORK DIVERGENCE LINK NON-ESTIM.
[Ben-David et al., 2007]
[Blitzer et al., 2008]
[Ben-David et al., 2010a]
Binary
classification
VC L
1
, H∆H Add. +
[Mansour et al., 2009a]
Classification/
Regression
Rademacher Discrepancy Add. +
[Cortes et al., 2010]
[Cortes and Mohri, 2014]
[Cortes et al., 2015]
Regression Rademacher
(Generalized)
Discrepancy
Add. +
[Mansour et al., 2008]
Classification/
Regression
– – – –
[Mansour et al., 2009b]
[Hoffman et al., 2018]
Classification/
Regression
– Rényi Mult. –
[Dhouib and Redko, 2018]
Binary classification/
Similarity learning
– L
1
, χ
2
Mult. +
[Redko et al., 2019a] Binary classification Rademacher Discrepancy Add. +
[Zhang et al., 2012]
Regression/
Classification
Uniform
entropy number
IPM Add. –
[Redko, 2015] Regression Rademacher IPM/MMD Add. +
[Redko et al., 2017] Regression – IPM/Wassertein Add. +
[Zhang et al., 2019]
Large-margin
classification
Rademacher IPM Add. +
[Johansson et al., 2019]
Classification
– IPM Add. +
[Shen et al., 2018] Classification – Wasserstein Add. +
[Courty et al., 2017] Classification – Wasserstein Add. +
[Germain et al., 2013]
Classification
PAC-Bayes Domain disagreement Add. +
[Germain et al., 2016]
Classification
PAC-Bayes β-divergence Mult. +
[Li and Bilmes, 2007] Classification PAC-Bayes – Add. –
[McNamara and Balcan, 2017]
Binary classification
VC/PAC-Bayes – Add. –
[Mansour and Schain, 2014] Classification Robustness λ-shift Add. –
[Kuzborskij and Orabona, 2013]
[Kuzborskij and Orabona, 2017]
[Du et al., 2017]
Regression Stability – – –
3

[Perrot and Habrard, 2015]
Classification/
Similarity learning
Stability – – –
[Morvant et al., 2012]
Classification/
Similarity learning
Robustness/VC H∆H Add. +
Table 2: A summary of contributions presented in this survey for hardness results in domain adaptation.
(Type)
is the
type of the obtained result;
(Setting)
points out the presence or absence of target data (either labelled or unlabelled);
(Assumptions)
indicate the considered assumptions (individual or combined);
(Proper)
specifies if the learned model
is required to belong to a predefined class; (Constr.) indicates if the result is of a constructive nature.
REFERENCE
HARDNESS RESULTS
TYPE SETTING ASSUMPTIONS PROPER CONSTR.
[Ben-David et al., 2010b]
Impossibility/
Sample compl.
Unlabelled target
Cov. shift,
H∆H, λ
H
– +
[Ben-David et al., 2012]
Impossibility/
Sample compl.
No target/
Unlabelled target
Cov. shift,
C
B
, Lipscht.
+ +/–
[Ben-David and Urner, 2012]
Impossibility/
Sample compl.
Unlabelled target
Cov. shift,
C
B
, Realizab.
– –
[Redko et al., 2019b]
Estimation/
Sample compl.
Labelled target
– – –
[Zhao et al., 2019]
Impossibility Unlabelled target
Cov. shift,
H∆H, λ
H
– +
[Johansson et al., 2019]
Impossibility Unlabelled target
Cov. shift,
H∆H, λ
H
– +
[Hanneke and Kpotufe, 2019]
Sample compl. Labelled target
Relaxed cov. shift,
Noise cond.
– –
The rest of this survey is organized as follows. In Section 2, we briefly present the traditional statistical learning
frameworks that are referred to throughout the survey. In Section 3, we present the first theoretical results of the domain
adaptation theory from the seminal works of [
Ben-David et al., 2007
,
Mansour et al., 2009a
,
Cortes and Mohri, 2011
]
that rely on the famous
H∆H
and discrepancy distances. We further turn our attention to hardness results for domain
adaptation problem in Section 4. Section 5 presents several works establishing the generalization bounds for domain
adaptation based on the popular integral probability metrics (IPMs). In Section 6, we highlight several learning bounds
proved using the PAC-Bayesian framework. Finally, in Section 7 we give an overview of the contributions that take the
actual learning algorithm into account when deriving the learning bounds and conclude the survey in Section 8.
2 Preliminary knowledge
Below we recall the usual supervised learning setup and different quantities used to derive generalization bounds in
this context. This includes the notions of Vapnik-Chervonenkis (VC) [
Vapnik, 2006
,
Vapnik and Chervonenkis, 1971
]
and Rademacher complexities [
Koltchinskii and Panchenko, 1999
], the definitions related to the PAC-Bayesian the-
ory [
McAllester, 1999
] and those from the more recent algorithmic stability [
Bousquet and Elisseeff, 2002
] and algo-
rithmic robustness [Xu and Mannor, 2010] frameworks.
2.1 Definitions
Let a pair
(X, Y )
define the input and the output spaces where
X
is described by real-valued vectors of finite dimension
d
, i.e.,
X ⊆ R
d
and for
Y
we distinguish between two possible scenarios: 1) when
Y
is continuous, e.g.,
Y = [−1, 1]
or
Y = R
, we talk about regression; 2) when
Y
is discrete and takes values from a finite set, we talk about classification.
Two important cases of classification are binary classification and multi-class classification where
Y = {−1, 1}
(or
Y = {0, 1}) and Y = {1, 2, . . . , C} with C > 2, respectively.
4

We assume that
X × Y
is drawn from an unknown joint probability distribution
D
and that we observe them through
a finite training sample (also called learning sample)
S = {(x
i
, y
i
)}
m
i=1
∼ (D)
m
of
m
independent and identically
distributed (i.i.d.) pairs (also called examples or data instances). We further denote by
H = {h|h : X → Y }
a
hypothesis space (also called hypothesis class) consisting of functions that map each element of
X
to
Y
. These
functions h are usually called hypotheses, or more specifically classifiers or regressors depending on the nature of Y .
Let us now consider a loss function
` : Y × Y → [0, 1]
that gives a cost of
h(x)
deviating from the true output
y ∈ Y
.
We can define the true risk and the empirical risk with respect to D and S respectively, as follows.
Definition 2.
(True risk) Given a loss function
` : Y × Y → [0, 1]
, the true risk (also called generalization error)
R
`
D
(h) for a given hypothesis h ∈ H on a distribution D over X ×Y is defined as
R
`
D
(h) = E
(x,y)∼D
`(h(x), y).
By abuse of notations, for a given pair of hypotheses (h, h
0
) ∈ H
2
, we write
R
`
D
(h, h
0
) = E
(x,y)∼D
`(h(x), h
0
(x)).
Definition 3.
(Empirical risk) Given a loss function
` : Y × Y → [0, 1]
and a training sample
S = {(x
i
, y
i
)}
m
i=1
where each example is drawn i.i.d. from D, the empirical risk R
`
ˆ
D
(h) for a given hypothesis h ∈ H is defined as
R
`
ˆ
D
(h) =
1
m
m
X
i=1
`(h(x
i
), y
i
) ,
where
ˆ
D is the empirical distribution associated to the sample S.
The most natural loss function that can be used to count the number of errors committed by a hypothesis
h ∈ H
on the
distribution D is the 0 − 1 loss function `
0−1
: Y ×Y → {0, 1} which is defined for a training example (x, y) as
`
01
(h(x), y) = I [h(x) 6= y] =
1, if h(x) 6= y ,
0, otherwise.
(1)
(0,0)
(0,1)
(1,0)(-1,0)
Zero-one loss
Hinge loss
Linear loss
Figure 2: Illustration of different
loss functions.
A popular proxy to this non-convex function is the hinge loss defined for a given
pair (x, y) by
`
hinge
(h(x), y) = [1 − yh(x)]
+
= max (0, 1 − yh(x)) .
Another loss function often used in practice that extends the
0 −1
loss to the case
of real values is the linear loss `
lin
: R × R → [0, 1], defined by:
`
lin
(h(x), y) =
1
2
(1 − yh(x)) .
The three above-mentioned loss functions are illustrated in Figure 2. Note that in
this figure, the X-axis are
yh(x)
values as
h(x) = y
is equivalent to
yh(x) ≥ 0
when Y = {−1, 1}.
Notations
Below, we present the notations that are used throughout the survey.
X Input space
Y Output space
D A domain: a yet unknown distribution over X × Y
D
X
Marginal distribution of D on X
ˆ
D
X
Empirical distribution associated with a sample drawn from D
X
SUPP(D) Support of distribution D
Pr(·) Probability of an event
E(·) Expectation of a random variable
x =(x
1
, . . . , x
d
)
>
∈R
d
A d-dimensional real-valued vector
(x, y) ∼ D (x, y) is drawn i.i.d. from D
S ={(x
i
, y
i
)}
m
i=1
∼(D)
m
Labeled learning sample constituted of m examples drawn i.i.d. from D
S
u
={(x
i
)}
m
i=1
∼(D
X
)
m
Unlabeled learning sample constituted of m examples drawn i.i.d. from D
X
5
剩余48页未读,继续阅读


















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0