3. Related work
Apart from their application in computer vision [
28
,
29
,
30
,
31
,
33
,
77
,
82
,
85
], GANs have also been been employed
in natural language processing [
43
,
44
,
80
], medicine [
67
,
37
] and several other fields [
16
,
51
,
60
]. Many of recent
research have accordingly focused on providing ways to
avoid the problems discussed in Section 2 [45, 53].
Mode collapse
For instance, Metz et al. [
53
] unroll the
optimization of the discriminator to obtain a better estimate
of the optimal discriminator at each step, which remedies
mode collapse. However, due to high computational com-
plexity, it is not scalable to large datasets. VEEGAN [
69
]
adds a reconstruction term to bi-directional GANs [
15
,
14
]
objective, which does not depend on the discriminator. This
term can provide a training signal to the generator even when
the discriminator does not. PacGAN [
45
] changes the dis-
criminator to make decisions based on a pack of samples.
This change mitigates mode collapse by making it easier
for the discriminator to detect lack of diversity and natu-
rally penalize the generator when mode collapse happens.
Lucic et al. [
49
], motivated by the better performance of
supervised-GANs, propose using a small set of labels and
a semi-supervised method to infer the labels for the entire
data. They further improve the performance by utilizing an
auxiliary rotation loss similar to that of RotNet [17].
Mode connecting
Based on Theorem 1, to avoid mode
connecting one has to either use a latent variable
z
with a
disconnected support, or allow
G
θ
to be a discontinuous
function [27, 36, 40, 46, 64].
To obtain a disconnected latent space, DeLiGAN [
22
]
samples
z
from a mixture of Gaussian, while Odena et
al. [
59
] add a discreet dimension to the latent variable. Other
methods dissect the latent space post-training using some
variant of rejection sampling, for example, Azadi et al. [
2
]
perform rejection sampling based on the discriminator’s
score, and Tanielian et al. [
70
] reject the samples where
the generator’s Jacobian is higher than a certain threshold.
The discontinuous generator method is mostly achieved
by learning multiple generators, with the primary motivation
being to remedy mode-collapse, which also reduces mode
connecting. Both MGAN [
27
] and DMWGAN [
36
] employ
K different generators while penalizing them from overlap-
ping with each other. However, these works do not explicitly
address the issue when some of the data modes are not being
captured. Also, as shown in Liu et al. [
46
], MGAN is quite
sensitive to the choice of
K
. By contrast, Self-Conditioned
GAN [
46
] clusters the space using the discriminator’s fi-
nal layer and uses the labels as self-supervised conditions.
However, in practice, their clustering does not seem to be
reliable (e.g., in terms of NMI for labeled datasets), and the
features highly depend on the choice of the discriminator’s
architecture. In addition, there is no guarantee that the gener-
ators will be guided to generate from their assigned clusters.
GAN-Tree [
40
] uses hierarchical clustering to address con-
tinuous multi-modal data, with the number of parameters
increasing linearly with the number of clusters. Thus it is
limited to very few cluster numbers (e.g., 5) and can only
capture a few modes.
Another recently expanding direction explores the benefit
of using image augmentation techniques for generative mod-
eling. Some works simply augment the data using various
perturbations (e.g., random crop, horizontal flipping) [
34
].
Others [
9
,
49
,
84
] incorporated regularization on top of the
augmentations, for example CRGAN [
83
] enforces consis-
tency for different image perturbations. ADA [
32
] processes
each image using non-leaking augmentations and adaptively
tunes the augmentation strength while training. These works
are orthogonal to ours and can be combined with our method.
4. Method
This section first describes how GANs are trained on a par-
titioned space using a mixture of generators/discriminators
and the unified objective function required for this goal. We
then explain our differentiable space partitioner and how we
guide the generators towards the right region. We conclude
the section by making connections to supervised GANs,
which use an auxiliary classifier [56, 59].
Multi-generator/discriminator objective:
Given a par-
titioning of the space, we train a generator (
G
i
) and a discrim-
inator (
D
i
) for each region. To avoid over-parameterization
and allow information sharing across different regions, we
employ parameter sharing across different
G
i
(
D
i
)’s by tying
their parameters except the input (last) layer. The mixture of
these generators serves as our main generator
G
. We use the
following objective function to train our GANs:
k
!
i
π
i
"
min
G
i
max
D
i
V (D
i
,G
i
,A
i
)
#
(1)
where
A
1
,A
2
,...,A
k
be a partitioning of the space,
π
i
:=
p
data
(x ∈ A
i
) and:
V (D, G, A)=E
x∼p
data
(x|x∈A)
[log D(x)] +
E
z∼p
z
(z|G(z)∈A)
[log(1 − D(G(z)))] (2)
We motivate this objective by making connection to the
Jensen–Shannon distance (
JSD
) between the distribution of
our mixture of generators and the data distribution in the
following Theorem.
Theorem 2.
Let
P =
$
k
i
π
i
p
i
,
Q =
$
k
i
π
i
q
i
, and
A
1
,A
2
,...,A
K
be a partitioning of the space, such that the
support of each distribution p
i
and q
i
is A
i
. Then:
JSD(P ' Q)=
!
i
π
i
JSD(p
i
' q
i
) (3)
5101