6
multi-source transfer learning, which mainly has the follow-
ing two steps in e ach iteration.
1. Candidate Class ifier Construction: A group of candi-
date weak classifiers are respectively trained on the
weighted instances in the pairs of each source domain
and the target domain, i.e., D
S
i
∪ D
T
(i = 1, · · · , m
S
).
2. Instance Weighting: A classifier which has the minimal
classification error rate
¯
δ on the target domain instances
is selecte d (denoted by j), and then is used for updating
the weights of the instances in D
S
j
and D
T
.
Finally, the selected classifiers from each iteration are com-
bined to form the fi nal classifier. Another parameter-based
algorithm, i.e., TaskTrAdaBoost, is also proposed in that
work [26], which is introduced in Sect ion 5.3.
Some approaches realize instance weighting stra tegy in a
heuristic way. For example, Jiang and Zhai proposed a gen-
eral weighting framework for the adaptation of instances
[27]. According to the paper, three types of instances (i.e., la-
beled source-domain, labeled target-domain, and unlabeled
target-domain insta nces) are used to construct the target
classifier. There are three terms in the obj e ctive function,
which are designed corresponding to the instances’ types
for minimizing the cross-entropy loss.
• Labeled Target-domain Instance: The classifier should mini-
mize the cross-entropy loss on them, which is actually a
standard supervis e d learning task.
• Unlabeled Target-domain Instance: These instances’ true con-
ditional dist ributions P (y|x
T,U
i
) are unknown and should
be estimated. A possib le solution is to train an auxiliary
classifier on the labeled source-domain and target-domain
instances to help est imate the conditional distributions or
assign pseudo labels to these instances.
• Labeled Source-domain Instance: The authors define the
weight of x
S,L
i
as the product of two parts, i.e., α
i
and
β
i
. T he weight β
i
is ideally equal to P
T
(x
i
)/P
S
(x
i
),
which can be estimated by non-parametric methods such
as KMM or can be set uniformly in t he worst ca se. The
weight α
i
is us e d to filter ou t the source-domain instances
that differ greatly from the target domain.
A heuristic method can be used to produce the value of α
i
,
which contains the following three steps.
1. Auxiliary Classifier Construction: An auxiliary classifier
trained on the labeled target-domain instances are used
to classify the unlabeled source-domain ins tances.
2. Instance Ranking: The source-domain instances are
ranked based on the probabilistic prediction results.
3. Heuristic Weighting (β
i
): The weights of the top-k
source-domain instances with wrong predictions are set
to zero, and the weights of others are set to one.
The objective function of this framework consists of four
terms, i.e., the above-mentioned three terms with three
tradeoff parameters controlling t he balance among the types
of instances and a regularizer controlling the complexity of
the model.
4.2 Feature Transformation Strategy
Feature transformation strategy is often adopted in feature-
based approaches. Feature-based approaches transform
each original feature into a new feat ure representat ion for
TABLE 2
Metrics Adopted in Transfer Learning.
Measurement Related Algorithms
Maximum Mean Discrepancy [28] [29] [30 ] [31] [32 ]· · ·
Kullback-Leibler Divergence [33] [34] [35 ] [36] [37 ]· · ·
Jensen-Shannon Divergence [38] [39] [40 ] [41] [42 ]· · ·
Bregman Divergence [43] [44] [45 ] [46] [47 ]· · ·
Hilbert-Schmidt Independ e nce Criterion [48] [29] [49] [50] [51]· · ·
transfer learning. The objectives of constructing a new fea-
ture representation include minimizing the marginal and the
conditional distribution difference, preserving the proper-
ties or the potential structures of the data, and finding the
correspondence between features. The operations of feature
transformation can be divided into three types, i.e., feature
augmentation, feature reduction, and feature alignment. Be-
sides, feature reduction can be further divided into several
types such as feature mapping, feature clustering, feature
selection, and feature encoding. A complete feature trans-
formation process designed in an a lgorithm may consist of
several operations.
4.2.1 Distribution Difference Metric
One primary objective of feature transformation is to reduce
the distribution difference of the source and the target do-
main instances. Therefore, how to measure the distribution
difference or the s imilarity between domains effectively is
an important issue.
The measurement termed Maximum Mean Discrepancy
(MMD) is widely used in the field of transfer learning,
which is formulated as follows [28]:
MMD(X
S
, X
T
) =
1
n
S
n
S
X
i=1
Φ(x
S
i
) −
1
n
T
n
T
X
j=1
Φ(x
T
j
)
2
H
.
MMD can be easily comp uted by using kernel trick. Briefly,
MMD quantifies the distribution difference by calcuating
the distance of the mean valu e s of the instances in a RKHS.
Note that the abov e -mentioned KMM actually produces
the weights of instances b y minimizing the MMD distance
between domains.
Table. 2 lists some commonly used metrics and the
related algorithms. In addition to Table. 2, there a re s ome
other measurement criteria adopted in transfer learning,
including Wasserstein distance [52], [53], Central Moment
Discrepancy [54], etc. Some studies focus on optimizing
and improving the existing me asurements. Take MMD as
an example. Gretton et al. proposed a multi-kernel version
of MMD, i.e., MK-MMD [55], which takes advantage of
multiple kernels. Besides, Yan et al. proposed a weighted
version of MMD [56], which attempts to address the issue
of class weight bias.
4.2.2 Feature Augmentation
Feature au gm e ntation ope rations are widely used in fea-
ture transformation, especially in symmetric feature-based
approaches. To be more specific, there are several ways to
realize feature augmentation such as feature replication and