Algorithm 1: Adversarial training of refiner net-
work R
θ
Input: Sets of synthetic images x
i
∈ X , and real
images y
j
∈ Y, max number of steps (T ),
number of discriminator network updates
per step (K
d
), number of generative
network updates per step (K
g
).
Output: ConvNet model R
θ
.
for t = 1, . . . , T do
for k = 1, . . . , K
g
do
1. Sample a mini-batch of synthetic images
x
i
.
2. Update θ by taking a SGD step on
mini-batch loss L
R
(θ) in (4) .
end
for k = 1, . . . , K
d
do
1. Sample a mini-batch of synthetic images
x
i
, and real images y
j
.
2. Compute
˜
x
i
= R
θ
(x
i
) with current θ.
3. Update φ by taking a SGD step on
mini-batch loss L
D
(φ) in (2).
end
end
Probability mapInput image
w
h
Figure 3. Illustration of local adversarial loss. The discrimina-
tor network outputs a w × h probability map. The adversarial
loss function is the sum of the cross-entropy losses over the
local patches.
loss function (1) used in our implementation is:
L
R
(θ) = −
X
i
log(1 − D
φ
(R
θ
(x
i
)))
+λkR
θ
(x
i
) − x
i
k
1
, (4)
where k.k
1
is `
1
norm. We implement R
θ
as a fully con-
volutional neural net without striding or pooling. This
modifies the synthetic image on a pixel level, rather
than holistically modifying the image content as in e.g.
a fully connected encoder network, and preserves the
global structure and the annotations. We learn the refiner
and discriminator parameters by minimizing L
R
(θ) and
L
D
(φ) alternately. While updating the parameters of
R
θ
, we keep φ fixed, and while updating D
φ
, we fix θ.
We summarize this training procedure in Algorithm 1.
2.2. Local Adversarial Loss
Another key requirement for the refiner network is
that it should learn to model the real image characteris-
tics without introducing any artifacts. When we train a
Buffer of
refined images
Refined images
with current
Refined Real
Figure 4. Illustration of using a history of refined images. See
text for details.
single strong discriminator network, the refiner network
tends to over-emphasize certain image features to fool
the current discriminator network, leading to drifting
and producing artifacts. A key observation is that any
local patch we sample from the refined image, should
have similar statistics to a real image patch. Therefore,
rather than defining a global discriminator network, we
can define discriminator network that classifies all local
image patches separately. This not only limits the re-
ceptive field, and hence the capacity of the discriminator
network, but also provides many samples per image for
learning the discriminator network. This also improves
training of the refiner network because we have multiple
‘realism loss’ values per image.
In our implementation, we design the discriminator
D to be a fully convolutional network that outputs w ×
h dimensional probability map of patches belonging to
fake class, where w × h are the number of local patches
in the image. While training the refiner network, we sum
the cross-entropy loss values over w × h local patches,
as illustrated in Figure 3.
2.3. Updating Discriminator using a History of
Refined Images
Another problem of adversarial training is that the
discriminator network only focuses on the latest refined
images. This may cause (i) diverging of the adversar-
ial training, and (ii) the refiner network re-introducing
the artifacts that the discriminator has forgotten about.
Any refined image generated by the refiner network at
any time during the entire training procedure is a ‘fake’
image for the discriminator. Hence, the discriminator
should be able to classify all these images as fake. Based
on this observation, we introduce a method to improve
the stability of adversarial training by updating the dis-
criminator using a history of refined images, rather than
only the ones in the current mini-batch. We slightly
modify Algorithm 1 to have a buffer of refined images
generated by previous networks. Let B be the size of the
buffer and b be the mini-batch size used in Algorithm 1.