条件对抗网络推动图像到图像翻译的通用解决方案

需积分: 12 55 浏览量更新于2024-07-18 收藏 6.89MB PDF 举报

Image-to-Image Translation with Conditional Adversarial Nets是由Phillip Isola、Jun-Yan Zhu、Tinghui Zhou和Alexei A. Efros等人在2016年提出的一篇开创性的论文，发表于加州大学伯克利分校的BAIR实验室。该研究主要探讨了如何利用条件对抗网络（Conditional Adversarial Networks, CAGAN）解决图像到图像（Image-to-Image Translation）的问题，这是一个在图像处理、图形学和计算机视觉领域具有广泛应用的研究方向。论文的核心思想是将传统问题特定的算法转化为通用的解决方案，CAGAN能够在多种图像转换任务上展现出良好的性能，如从标签到建筑外观（Label-to-Facade）、黑白图像转彩色（BW to Color）、航空影像转地图（Aerial to Map）、标签到街道场景（Label-to-Street Scene）、边缘到照片（EdgestoPhoto）、日间场景转夜间（DaytoNight）等。这些转换都是通过共享的网络架构和统一的目标函数实现，只在训练数据上有所区别，从而展现出CAGAN的高度灵活性和适应性。 CAGAN的核心在于引入了条件性（conditioning）的概念，即在生成器（Generator）和判别器（Discriminator）之间的交互中，模型能够根据额外的输入信息（如类别标签、语义图等）来生成与之相对应的输出图像。这种条件指导使得模型能够学习到更加精确和有针对性的映射，从而提升翻译的质量和一致性。论文的主要贡献包括： 1. 提出了一种通用的图像到图像翻译框架，展示了其在多个实际问题上的应用潜力。 2. 详细介绍了CAGAN的网络结构，包括生成器和判别器的设计，以及如何通过对抗训练策略进行优化。 3. 评估了CAGAN在各种任务上的性能，并讨论了可能的改进方向和未来研究的挑战。 Image-to-Image Translation with Conditional Adversarial Nets这篇论文为图像处理领域的图像转换任务提供了一个强大的工具，它不仅促进了学术界对深度学习在图像转换中的理解，也启发了后续的研究者们在诸如风格迁移、图像修复等领域继续探索和创新。通过深入理解和应用这项技术，可以推动计算机视觉和图像处理技术在诸多应用场景中的实际应用，如增强现实、自动驾驶和图像编辑等。

desirable to shuttle this information directly across the net.

For example, in the case of image colorizaton, the input and

output share the location of prominent edges.

To give the generator a means to circumvent the bot-

tleneck for information like this, we add skip connections,

following the general shape of a “U-Net” [34] (Figure 3).

Speciﬁcally, we add skip connections between each layer i

and layer n −i, where n is the total number of layers. Each

skip connection simply concatenates all channels at layer i

with those at layer n − i.

2.2.2 Markovian discriminator (PatchGAN)

It is well known that the L2 loss – and L1, see Fig-

ure 4 – produces blurry results on image generation prob-

lems [22]. Although these losses fail to encourage high-

frequency crispness, in many cases they nonetheless accu-

rately capture the low frequencies. For problems where this

is the case, we do not need an entirely new framework to

enforce correctness at the low frequencies. L1 will already

do.

This motivates restricting the GAN discriminator to only

model high-frequency structure, relying on an L1 term to

force low-frequency correctness (Eqn. 4). In order to model

high-frequencies, it is sufﬁcient to restrict our attention to

the structure in local image patches. Therefore, we design

a discriminator architecture – which we term a PatchGAN

– that only penalizes structure at the scale of patches. This

discriminator tries to classify if each N × N patch in an

image is real or fake. We run this discriminator convoluta-

tionally across the image, averaging all responses to provide

the ultimate output of D.

In Section 3.4, we demonstrate that N can be much

smaller than the full size of the image and still produce

high quality results. This is advantageous because a smaller

PatchGAN has fewer parameters, runs faster, and can be

applied on arbitrarily large images.

Such a discriminator effectively models the image as a

Markov random ﬁeld, assuming independence between pix-

els separated by more than a patch diameter. This con-

nection was previously explored in [25], and is also the

common assumption in models of texture [8, 12] and style

[7, 15, 13, 24]. Our PatchGAN can therefore be understood

as a form of texture/style loss.

2.3. Optimization and inference

To optimize our networks, we follow the standard ap-

proach from [14]: we alternate between one gradient de-

scent step on D, then one step on G. We use minibatch

SGD and apply the Adam solver [20].

At inference time, we run the generator net in exactly

the same manner as during the training phase. This differs

from the usual protocol in that we apply dropout at test time,

and we apply batch normalization [18] using the statistics of

the test batch, rather than aggregated statistics of the train-

ing batch. This approach to batch normalization, when the

batch size is set to 1, has been termed “instance normaliza-

tion” and has been demonstrated to be effective at image

generation tasks [38]. In our experiments, we use batch size

1 for certain experiments and 4 for others, noting little dif-

ference between these two conditions.

3. Experiments

To explore the generality of conditional GANs, we test

the method on a variety of tasks and datasets, including both

graphics tasks, like photo generation, and vision tasks, like

semantic segmentation:

• Semantic labels↔photo, trained on the Cityscapes

dataset [4].

• Architectural labels→photo, trained on the CMP Fa-

cades dataset [31].

• Map↔aerial photo, trained on data scraped from

Google Maps.

• BW→color photos, trained on [35].

• Edges→photo, trained on data from [49] and [44]; bi-

nary edges generated using the HED edge detector [42]

plus postprocessing.

• Sketch→photo: tests edges→photo models on human-

drawn sketches from [10].

• Day→night, trained on [21].

Details of training on each of these datasets are pro-

vided in the Appendix. In all cases, the input and out-

put are simply 1-3 channel images. Qualitative results

are shown in Figures 8, 9, 10, 11, 12, 14, 15, 16,

and 13. Several failure cases are highlighted in Fig-

ure 17. More comprehensive results are available at

https://phillipi.github.io/pix2pix/.

Data requirements and speed We note that decent re-

sults can often be obtained even on small datasets. Our fa-

cade training set consists of just 400 images (see results in

Figure 12), and the day to night training set consists of only

91 unique webcams (see results in Figure 13). On datasets

of this size, training can be very fast: for example, the re-

sults shown in Figure 12 took less than two hours of training

on a single Pascal Titan X GPU. At test time, all models run

in well under a second on this GPU.

3.1. Evaluation metrics

Evaluating the quality of synthesized images is an open

and difﬁcult problem [36]. Traditional metrics such as per-

pixel mean-squared error do not assess joint statistics of the

result, and therefore do not measure the very structure that

structured losses aim to capture.

In order to more holistically evaluate the visual qual-

ity of our results, we employ two tactics. First, we run

剩余15页未读，继续阅读

weixin_42157757

粉丝: 0
资源: 1

条件对抗网络推动图像到图像翻译的通用解决方案

Image-to-Image Translation with Conditional Adversarial Networks

Unpaired Image-to-Image Translation using Cycle-consistent adversarial networks

Unpaired Image-to-Image Translation using Adversarial Consistency Loss.pdf

image-to-image translation with conditional adversarial networks

image-to-Image Translation with Conditional Adversarial Networks

输入为128 128 3通道图像，输出128 128 3通道特征图的上下文编码加生成器 定义代码

conditional generative adversarial nets

pix2pix python

详细讲述朱俊彦的pix2pix工作内容

多个角度相片生成模型的代码

最新资源

输入为128 128 3通道图像，输出128 128 3通道特征图的上下文编码加生成器定义代码