frequency crispness, in many cases they nonetheless accu-
rately capture the low frequencies. For problems where this
is the case, we do not need an entirely new framework to
enforce correctness at the low frequencies. L1 will already
do.
This motivates restricting the GAN discriminator to only
model high-frequency structure, relying on an L1 term to
force low-frequency correctness (Eqn. 4). In order to model
high-frequencies, it is sufficient to restrict our attention to
the structure in local image patches. Therefore, we design
a discriminator architecture – which we term a PatchGAN
– that only penalizes structure at the scale of patches. This
discriminator tries to classify if each N ×N patch in an im-
age is real or fake. We run this discriminator convolution-
ally across the image, averaging all responses to provide the
ultimate output of D.
In Section 4.4, we demonstrate that N can be much
smaller than the full size of the image and still produce
high quality results. This is advantageous because a smaller
PatchGAN has fewer parameters, runs faster, and can be
applied to arbitrarily large images.
Such a discriminator effectively models the image as a
Markov random field, assuming independence between pix-
els separated by more than a patch diameter. This connec-
tion was previously explored in [38], and is also the com-
mon assumption in models of texture [17, 21] and style
[16, 25, 22, 37]. Therefore, our PatchGAN can be under-
stood as a form of texture/style loss.
3.3. Optimization and inference
To optimize our networks, we follow the standard ap-
proach from [24]: we alternate between one gradient de-
scent step on D, then one step on G. As suggested in
the original GAN paper, rather than training G to mini-
mize log(1 − D(x, G(x, z)), we instead train to maximize
log D(x, G(x, z)) [24]. In addition, we divide the objec-
tive by 2 while optimizing D, which slows down the rate at
which D learns relative to G. We use minibatch SGD and
apply the Adam solver [32], with a learning rate of 0.0002,
and momentum parameters β
1
= 0.5, β
2
= 0.999.
At inference time, we run the generator net in exactly
the same manner as during the training phase. This differs
from the usual protocol in that we apply dropout at test time,
and we apply batch normalization [29] using the statistics of
the test batch, rather than aggregated statistics of the train-
ing batch. This approach to batch normalization, when the
batch size is set to 1, has been termed “instance normal-
ization” and has been demonstrated to be effective at im-
age generation tasks [54]. In our experiments, we use batch
sizes between 1 and 10 depending on the experiment.
4. Experiments
To explore the generality of conditional GANs, we test
the method on a variety of tasks and datasets, including both
graphics tasks, like photo generation, and vision tasks, like
semantic segmentation:
• Semantic labels↔photo, trained on the Cityscapes
dataset [12].
• Architectural labels →photo, trained on CMP Facades
[45].
• Map↔aerial photo, trained on data scraped from
Google Maps.
• BW→color photos, trained on [51].
• Edges→photo, trained on data from [65] and [60]; bi-
nary edges generated using the HED edge detector [58]
plus postprocessing.
• Sketch→photo: tests edges→photo models on human-
drawn sketches from [19].
• Day→night, trained on [33].
• Thermal→color photos, trained on data from [27].
• Photo with missing pixels→inpainted photo, trained
on Paris StreetView from [14].
Details of training on each of these datasets are provided
in the supplemental materials online. In all cases, the in-
put and output are simply 1-3 channel images. Qualita-
tive results are shown in Figures 8, 9, 11, 10, 13, 14, 15,
16, 17, 18, 19, 20. Several failure cases are highlighted
in Figure 21. More comprehensive results are available at
https://phillipi.github.io/pix2pix/.
Data requirements and speed We note that decent re-
sults can often be obtained even on small datasets. Our fa-
cade training set consists of just 400 images (see results in
Figure 14), and the day to night training set consists of only
91 unique webcams (see results in Figure 15). On datasets
of this size, training can be very fast: for example, the re-
sults shown in Figure 14 took less than two hours of training
on a single Pascal Titan X GPU. At test time, all models run
in well under a second on this GPU.
4.1. Evaluation metrics
Evaluating the quality of synthesized images is an open
and difficult problem [52]. Traditional metrics such as per-
pixel mean-squared error do not assess joint statistics of the
result, and therefore do not measure the very structure that
structured losses aim to capture.
To more holistically evaluate the visual quality of our re-
sults, we employ two tactics. First, we run “real vs. fake”
perceptual studies on Amazon Mechanical Turk (AMT).
For graphics problems like colorization and photo gener-
ation, plausibility to a human observer is often the ultimate
goal. Therefore, we test our map generation, aerial photo
generation, and image colorization using this approach.