JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
The training set is given as a set of pairs of corresponding
images {(s
i
, x
i
)}, where x
i
is a natural photo and s
i
is a
corresponding semantic label map. The ith-layer feature ex-
tractor of discriminator D
k
is denoted as D
(i)
k
(from input to
the ith layer of D
k
). The feature matching loss L
F M
(G, D
k
)
is:
L
F M
(G, D
k
) =
E
(s,x)
T
P
i=1
1
N
i
h
D
(i)
k
(s, x) − D
(i)
k
(s, G (s))
1
i
,
(23)
where N
i
is the number of elements in each layer and
T denotes the total number of layers. The final objective
function of [157] is
min
G
max
D
1
,D
2
,D
3
X
k=1,2,3
(L
GAN
(G, D
k
) + λL
F M
(G, D
k
)). (24)
3.2.3 CycleGAN
Image-to-image translation is a class of graphics and vision
problems where the goal is to learn the mapping between
an output image and an input image using a training
set of aligned image pairs. When paired training data is
available, reference [156] can be used for these image-to-
image translation tasks. However, reference [156] can not be
used for unpaired data (no input/output pairs), which was
well solved by Cycle-consistent GANs (CycleGAN) [53].
CycleGAN is an important progress for unpaired data. It
is proved that cycle-consistency is an upper bound of the
conditional entropy [158]. CycleGAN can be derived as a
special case within the proposed variational inference (VI)
framework [159], naturally establishing its relationship with
approximate Bayesian inference methods.
The basic idea of DiscoGAN [54] and CycleGAN [53]
is nearly the same. Both of them were proposed separately
nearly at the same time. The only difference between Cy-
cleGAN [53] and DualGAN [55] is that DualGAN uses the
loss format advocated by Wasserstein GAN (WGAN) rather
than the sigmoid cross-entropy loss used in CycleGAN.
3.2.4 f-GAN
As we know, Kullback-Leibler (KL) divergence measures the
difference between two given probability distributions. A
large class of assorted divergences are the so called Ali-
Silvey distances, also known as the f-divergences [160].
Given two probability distributions P and Q which have,
respectively, an absolutely continuous density function p
and q with regard to a base measure dx defined on the
domain X, the f -divergence is defined,
D
f
(P kQ) =
Z
X
q (x)f
p (x)
q (x)
dx. (25)
Different choices of f recover popular divergences as special
cases of f -divergence. For example, if f (a) = a log a, f-
divergence becomes KL divergence. The original GANs
[3] is a special case of f-GAN [17] which is based on f-
divergence. The reference [17] shows that any f-divergence
can be used for training GAN. Furthermore, the reference
[17] discusses the advantages of different choices of di-
vergence functions on both the quality of the produced
generative models and training complexity. Im et al. [161]
quantitatively evaluated GANs with divergences proposed
for training. Uehara et al. [162] extend the f-GAN further,
where the f -divergence is directly minimized in the gen-
erator step and the ratio of the distributions of real and
generated data are predicted in the discriminator step.
3.2.5 Integral Probability Metrics (IPMs)
Denoting P the set of all Borel probability measures on a
topological space (M, A). The integral probability metric
(IPM) [163] between two probability distributions P ∈ P
and Q ∈ P is defined as
γ
F
(P, Q) = sup
f∈F
Z
M
fdP −
Z
M
fdQ
, (26)
where F is a class of real-valued bounded measurable
functions on M . Nonparametric density estimation and
convergence rates for GANs under Besov IPM Losses is
discussed in [164]. IPMs include such as RKHS-induced
maximum mean discrepancy (MMD) as well as the Wasser-
stein distance used in Wasserstein GANs (WGAN).
3.2.5.1 Maximum Mean Discrepancy (MMD):
The maximum mean discrepancy (MMD) [165] is a measure
of the difference between two distributions P and Q given
by the supremum over a function space F of differences
between the expectations with regard to two distributions.
The MMD is defined by:
MMD(F, P, Q) =
sup
f∈F
(E
X∼P
[f (X)] − E
Y ∼Q
[f (Y )]) .
(27)
MMD has been used for deep generative models [166]–[168]
and model criticism [169].
3.2.5.2 Wasserstein GAN (WGAN):
WGAN [18] conducted a comprehensive theoretical analysis
of how the Earth Mover (EM) distance behaves in com-
parison with popular probability distances and divergences
such as the total variation (TV) distance, the Kullback-
Leibler (KL) divergence, and the Jensen-Shannon (JS) diver-
gence utilized in the context of learning distributions. The
definition of the EM distance is
W (p
data
, p
g
) = inf
γ∈Π(p
data
,p
g
)
E
(x,y)∈γ
[kx − yk] , (28)
where Π (p
data
, p
g
) denotes the set of all joint distributions
γ (x, y) whose marginals are p
data
and p
g
, respectively.
However, the infimum in (28) is highly intractable. The
reference [18] uses the following equation to approximate
the EM distance
max
w∈W
E
x∼p
data
(x)
[f
w
(x)] − E
z∼p
z
(z)
[f
w
(G (z))] , (29)
where there is a parameterized family of functions
{f
w
}
w∈W
that are all K-Lipschitz for some K and f
w
can
be realized by the discriminator D. When D is optimized,
(29) denotes the approximated EM distance. Then the aim
of G is to minimize (29) to make the generated distribution
as close to the real distribution as possible. Therefore, the
overall objective function of WGAN is
min
G
max
w∈W
E
x∼p
data
(x)
[f
w
(x)] − E
z∼p
z
(z)
[f
w
(G (z))]
= min
G
max
D
E
x∼p
data
(x)
[D (x)] − E
z∼p
z
(z)
[D (G (z))] .
(30)