扩散模型在图像生成中超越GAN

需积分: 1 185 浏览量更新于2024-06-25 1 收藏 37.95MB PDF 举报

"Diffusion Models Beat GANs on Image Synthesis.pdf" 本文主要探讨了扩散模型在图像合成领域超越当前最先进的生成对抗网络（GANs）的研究成果。由Prafulla Dhariwal和Alex Nichol等人发表，他们来自OpenAI，展示了如何通过一系列的消融研究找到更优的架构，提升无条件图像合成的质量。扩散模型是一种新兴的生成模型，通过逐步扩散和恢复过程来创建高逼真度的图像。在无条件图像合成任务中，作者们通过对模型进行改进，实现了比当前最佳方法更高的图像质量。对于有条件图像合成，他们引入了分类器指导（classifier guidance）技术，这是一种高效的方法，通过利用分类器的梯度在多样性与保真度之间做出平衡。这种技术能够提高样本的质量，同时保持对分布的更好覆盖。实验结果显示，他们的模型在ImageNet的128x128、256x256和512x512分辨率上分别达到了2.97、4.59和7.72的Fréchet Inception Distance (FID)分数，FID是一种评估生成图像质量和真实图像之间相似度的指标，数值越低表示质量越高。值得注意的是，即使每个样本仅进行25次前向传递，该模型也能与BigGAN-deep相媲美，这在计算效率上具有显著优势。此外，作者发现分类器指导与上采样扩散模型相结合能产生更优的效果，将FID进一步降低到256x256分辨率下的3.94和512x512分辨率下的3.85。这些结果表明，扩散模型在图像合成领域的表现已经超过了传统的GANs，并且在保持高质量的同时，还能实现更高的效率和多样性。论文最后提到，研究代码已开源，可在https://github.com/openai/guided-diffusion获取，这为其他研究者和开发者提供了深入研究和应用扩散模型的平台。总结起来，这篇研究揭示了扩散模型在图像生成上的优越性，特别是在与分类器结合使用时，不仅提高了生成图像的逼真度，还降低了计算成本，这将推动AI和深度学习领域在图像生成技术方面的进步。同时，这也为未来研究提供了一个新的方向，即如何更好地优化和利用扩散模型，以实现更加多样化且高质量的图像合成。

Figure 3: Samples from an unconditional diffusion model with classiﬁer guidance to condition

on the class "Pembroke Welsh corgi". Using classiﬁer scale 1.0 (left; FID: 33.0) does not produce

convincing samples in this class, whereas classiﬁer scale 10.0 (right; FID: 12.0) produces much more

class-consistent images.

We can now substitute this into the score function for p(x

)p(y|x

∇

log(p

(y|x

)) = ∇

log p

) + ∇

log p

(y|x

) (12)

= −

√

1 − ¯α



) + ∇

log p

(y|x

) (13)

Finally, we can deﬁne a new epsilon prediction

ˆ(x

)

which corresponds to the score of the joint

distribution:

ˆ(x

)

= 

) −

√

1 − ¯α

∇

log p

(y|x

) (14)

We can then use the exact same sampling procedure as used for regular DDIM, but with the modiﬁed

noise predictions

ˆ

)

instead of



)

. Algorithm 2 summaries the corresponding sampling

algorithm.

4.3 Scaling Classiﬁer Gradients

To apply classiﬁer guidance to a large scale generative task, we train classiﬁcation models on

ImageNet. Our classiﬁer architecture is simply the downsampling trunk of the UNet model with

an attention pool [

] at the 8x8 layer to produce the ﬁnal output. We train these classiﬁers on the

same noising distribution as the corresponding diffusion model, and also add random crops to reduce

overﬁtting. After training, we incorporate the classiﬁer into the sampling process of the diffusion

model using Equation 10, as outlined by Algorithm 1.

In initial experiments with unconditional ImageNet models, we found it necessary to scale the

classiﬁer gradients by a constant factor larger than 1. When using a scale of 1, we observed that the

classiﬁer assigned reasonable probabilities (around 50%) to the desired classes for the ﬁnal samples,

but these samples did not match the intended classes upon visual inspection. Scaling up the classiﬁer

gradients remedied this problem, and the class probabilities from the classiﬁer increased to nearly

100%. Figure 3 shows an example of this effect.

To understand the effect of scaling classiﬁer gradients, note that

s ·∇

log p(y|x) = ∇

log

p(y|x)

where

is an arbitrary constant. As a result, the conditioning process is still theoretically grounded

in a re-normalized classiﬁer distribution proportional to

p(y|x)

. When

s > 1

, this distribution

becomes sharper than

p(y|x)

, since larger values are ampliﬁed by the exponent. In other words, using

a larger gradient scale focuses more on the modes of the classiﬁer, which is potentially desirable for

producing higher ﬁdelity (but less diverse) samples.

In the above derivations, we assumed that the underlying diffusion model was unconditional, modeling

p(x)

. It is also possible to train conditional diffusion models,

p(x|y)

, and use classiﬁer guidance in

the exact same way. Table 4 shows that the sample quality of both unconditional and conditional

models can be greatly improved by classiﬁer guidance. We see that, with a high enough scale, the

guided unconditional model can get quite close to the FID of an unguided conditional model, although

training directly with the class labels still helps. Guiding a conditional model further improves FID.

Table 4 also shows that classiﬁer guidance improves precision at the cost of recall, thus introducing

a trade-off in sample ﬁdelity versus diversity. We explicitly evaluate how this trade-off varies with

Conditional Guidance Scale FID sFID IS Precision Recall

7 7 26.21 6.35 39.70 0.61 0.63

7 3 1.0 33.03 6.99 32.92 0.56 0.65

7 3 10.0 12.00 10.40 95.41 0.76 0.44

3 7 10.94 6.02 100.98 0.69 0.63

3 3 1.0 4.59 5.25 186.70 0.82 0.52

3 3 10.0 9.11 10.93 283.92 0.88 0.32

Table 4: Effect of classiﬁer guidance on sample quality. Both conditional and unconditional models

were trained for 2M iterations on ImageNet 256×256 with batch size 256.

0 2 4 6 8 10

gradient scale

FID sFID

0 2 4 6 8 10

gradient scale

100

150

200

250

300

0 2 4 6 8 10

gradient scale

0.3

0.4

0.5

0.6

0.7

0.8

0.9

precision recall

Figure 4: Change in sample quality as we vary scale of the classiﬁer gradients for a class-conditional

ImageNet 128×128 model.

0.70 0.75 0.80 0.85 0.90 0.95

Precision

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Recall

BigGAN-deep

Classifier

guidance (ours)

100 125 150 175 200 225 250 275

FID

BigGAN-deep

Classifier guidance (ours)

Figure 5: Trade-offs when varying truncation for BigGAN-deep and gradient scale for classiﬁer

guidance. Models are evaluated on ImageNet 128

128. The BigGAN-deep results were produced

using the TFHub model [12] at truncation levels [0.1, 0.2, 0.3, ..., 1.0].

the gradient scale in Figure 4. We see that scaling the gradients beyond 1.0 smoothly trades off

recall (a measure of diversity) for higher precision and IS (measures of ﬁdelity). Since FID and sFID

depend on both diversity and ﬁdelity, their best values are obtained at an intermediate point. We also

compare our guidance with the truncation trick from BigGAN in Figure 5. We ﬁnd that classiﬁer

guidance is strictly better than BigGAN-deep when trading off FID for Inception Score. Less clear

cut is the precision/recall trade-off, which shows that classiﬁer guidance is only a better choice up

until a certain precision threshold, after which point it cannot achieve better precision.

5 Results

To evaluate our improved model architecture on unconditional image generation, we train separate

diffusion models on three LSUN [

] classes: bedroom, horse, and cat. To evaluate classiﬁer

guidance, we train conditional diffusion models on the ImageNet [

] dataset at 128

128, 256

256,

and 512×512 resolution.

剩余43页未读，继续阅读

IT徐师兄

粉丝: 2401
资源: 2862

扩散模型在图像生成中超越GAN

sora学习文档集合一

High-Resolution Image Synthesis with Latent Diffusion Models.pdf

nonlocal diffusion problems J.D.Rossi.pdf

High-Resolution Video Synthesis with Latent Diffusion Models.pdf

Novel Diffusion-Based Models for Image Restoration and Interpolation2019.pdf

diffusion models和gans结合

Photorealistic Video Generation with Diffusion Models.pdf

Imagen Video- High with Diffusion Models.pdf

Denoising Diffusion Implicit Models.pdf

Denoising Diffusion Probabilistic Models.pdf

最新资源