desirable to shuttle this information directly across the net.
For example, in the case of image colorizaton, the input and
output share the location of prominent edges.
To give the generator a means to circumvent the bot-
tleneck for information like this, we add skip connections,
following the general shape of a “U-Net” [34] (Figure 3).
Specifically, we add skip connections between each layer i
and layer n −i, where n is the total number of layers. Each
skip connection simply concatenates all channels at layer i
with those at layer n − i.
2.2.2 Markovian discriminator (PatchGAN)
It is well known that the L2 loss – and L1, see Fig-
ure 4 – produces blurry results on image generation prob-
lems [22]. Although these losses fail to encourage high-
frequency crispness, in many cases they nonetheless accu-
rately capture the low frequencies. For problems where this
is the case, we do not need an entirely new framework to
enforce correctness at the low frequencies. L1 will already
do.
This motivates restricting the GAN discriminator to only
model high-frequency structure, relying on an L1 term to
force low-frequency correctness (Eqn. 4). In order to model
high-frequencies, it is sufficient to restrict our attention to
the structure in local image patches. Therefore, we design
a discriminator architecture – which we term a PatchGAN
– that only penalizes structure at the scale of patches. This
discriminator tries to classify if each N × N patch in an
image is real or fake. We run this discriminator convoluta-
tionally across the image, averaging all responses to provide
the ultimate output of D.
In Section 3.4, we demonstrate that N can be much
smaller than the full size of the image and still produce
high quality results. This is advantageous because a smaller
PatchGAN has fewer parameters, runs faster, and can be
applied on arbitrarily large images.
Such a discriminator effectively models the image as a
Markov random field, assuming independence between pix-
els separated by more than a patch diameter. This con-
nection was previously explored in [25], and is also the
common assumption in models of texture [8, 12] and style
[7, 15, 13, 24]. Our PatchGAN can therefore be understood
as a form of texture/style loss.
2.3. Optimization and inference
To optimize our networks, we follow the standard ap-
proach from [14]: we alternate between one gradient de-
scent step on D, then one step on G. We use minibatch
SGD and apply the Adam solver [20].
At inference time, we run the generator net in exactly
the same manner as during the training phase. This differs
from the usual protocol in that we apply dropout at test time,
and we apply batch normalization [18] using the statistics of
the test batch, rather than aggregated statistics of the train-
ing batch. This approach to batch normalization, when the
batch size is set to 1, has been termed “instance normaliza-
tion” and has been demonstrated to be effective at image
generation tasks [38]. In our experiments, we use batch size
1 for certain experiments and 4 for others, noting little dif-
ference between these two conditions.
3. Experiments
To explore the generality of conditional GANs, we test
the method on a variety of tasks and datasets, including both
graphics tasks, like photo generation, and vision tasks, like
semantic segmentation:
• Semantic labels↔photo, trained on the Cityscapes
dataset [4].
• Architectural labels→photo, trained on the CMP Fa-
cades dataset [31].
• Map↔aerial photo, trained on data scraped from
Google Maps.
• BW→color photos, trained on [35].
• Edges→photo, trained on data from [49] and [44]; bi-
nary edges generated using the HED edge detector [42]
plus postprocessing.
• Sketch→photo: tests edges→photo models on human-
drawn sketches from [10].
• Day→night, trained on [21].
Details of training on each of these datasets are pro-
vided in the Appendix. In all cases, the input and out-
put are simply 1-3 channel images. Qualitative results
are shown in Figures 8, 9, 10, 11, 12, 14, 15, 16,
and 13. Several failure cases are highlighted in Fig-
ure 17. More comprehensive results are available at
https://phillipi.github.io/pix2pix/.
Data requirements and speed We note that decent re-
sults can often be obtained even on small datasets. Our fa-
cade training set consists of just 400 images (see results in
Figure 12), and the day to night training set consists of only
91 unique webcams (see results in Figure 13). On datasets
of this size, training can be very fast: for example, the re-
sults shown in Figure 12 took less than two hours of training
on a single Pascal Titan X GPU. At test time, all models run
in well under a second on this GPU.
3.1. Evaluation metrics
Evaluating the quality of synthesized images is an open
and difficult problem [36]. Traditional metrics such as per-
pixel mean-squared error do not assess joint statistics of the
result, and therefore do not measure the very structure that
structured losses aim to capture.
In order to more holistically evaluate the visual qual-
ity of our results, we employ two tactics. First, we run