高分辨率图像合成：基于预训练自编码器的潜在扩散模型

人工智能

需积分: 1 15 浏览量更新于2024-06-25 收藏 38.95MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"《高分辨率图像合成与潜在扩散模型》(High-Resolution Image Synthesis with Latent Diffusion Models)是一篇由Robin Rombach、Andreas Blattmann、Dominik Lorenz等人共同撰写的论文，发表在Ludwig Maximilian University of Munich和Heidelberg University的研究团队以及RunwayML平台上。该研究专注于探索如何利用潜在扩散模型（Latent Diffusion Models, LDMs）进行高分辨率图像的合成，这是AI领域尤其是生成式建模的一个重要进展。传统的扩散模型通过将图像形成过程分解为一系列的去噪自编码器应用，实现了前所未有的图像生成质量和控制能力。它们的公式设计使得模型能够在不重新训练的情况下，提供对生成过程的指导。然而，这些模型直接在像素空间操作，导致了训练过程耗时且计算成本高昂，因为每次推理都需要逐次评估，这在资源有限的情况下显得非常不经济。为了克服这一局限性，研究人员提出了一种创新的方法：他们将强大的预训练自编码器的潜在空间作为扩散模型的训练平台。与先前的工作不同，利用这种内在表示进行训练，使得LDMs能够在保持高质量和灵活性的同时，显著减少对计算资源的需求。这种方法首次允许在有限的计算条件下接近最优的模型性能，同时降低了训练和推理的成本。具体来说，论文探讨了如何优化潜在空间中的扩散模型参数，以便于更有效地学习和处理复杂的图像细节，如纹理、形状和结构等。此外，文章可能还涵盖了模型架构的设计、训练策略、以及如何在潜在空间中实现有效的采样和控制，以生成高度逼真的高分辨率图像。通过这种方法，潜在扩散模型不仅提升了效率，还扩展了生成模型在艺术创作、计算机视觉任务和增强现实等领域的应用潜力。《高分辨率图像合成与潜在扩散模型》这篇论文在AI领域尤其是生成模型的研究中具有重要意义，它推动了技术的发展，使得高质量图像生成成为可能，而且在资源受限的环境下仍能维持良好的性能表现，为后续的AI研究和实际应用提供了新的思考和实践方向。"

资源详情

资源推荐

bicubic LDM-SR SR3

Figure 10. ImageNet 64→256 super-resolution on ImageNet-Val.

LDM-SR has advantages at rendering realistic textures but SR3

can synthesize more coherent ﬁne structures. See appendix for

additional samples and cropouts. SR3 results from [72].

[72] and ﬁx the image degradation to a bicubic interpola-

tion with 4×-downsampling and train on ImageNet follow-

ing SR3’s data processing pipeline. We use the f = 4 au-

toencoding model pretrained on OpenImages (VQ-reg., cf .

Tab. 8) and concatenate the low-resolution conditioning y

and the inputs to the UNet, i.e. τ

is the identity. Our quali-

tative and quantitative results (see Fig. 10 and Tab. 5) show

competitive performance and LDM-SR outperforms SR3

in FID while SR3 has a better IS. A simple image regres-

sion model achieves the highest PSNR and SSIM scores;

however these metrics do not align well with human per-

ception [106] and favor blurriness over imperfectly aligned

high frequency details [72]. Further, we conduct a user

study comparing the pixel-baseline with LDM-SR. We fol-

low SR3 [72] where human subjects were shown a low-res

image in between two high-res images and asked for pref-

erence. The results in Tab. 4 afﬁrm the good performance

of LDM-SR. PSNR and SSIM can be pushed by using a

post-hoc guiding mechanism [15] and we implement this

image-based guider via a perceptual loss, see Sec. D.6.

SR on ImageNet Inpainting on Places

User Study Pixel-DM (f1) LDM-4 LAMA [88] LDM-4

Task 1: Preference vs GT ↑ 16.0% 30.4% 13.6% 21.0%

Task 2: Preference Score ↑ 29.4% 70.6% 31.9% 68.1%

Table 4. Task 1: Subjects were shown ground truth and generated

image and asked for preference. Task 2: Subjects had to decide

between two generated images. More details in E.3.6

Since the bicubic degradation process does not generalize

well to images which do not follow this pre-processing, we

also train a generic model, LDM-BSR, by using more di-

verse degradation. The results are shown in Sec. D.6.1.

Method FID ↓ IS ↑ PSNR ↑ SSIM ↑ N

params

[

samples

](

∗

)

Image Regression [72] 15.2 121.1 27.9 0.801 625M N/A

SR3 [72] 5.2 180.1 26.4 0.762 625M N/A

LDM-4 (ours, 100 steps) 2.8

†

/4.8

‡

166.3 24.4±3.8 0.69±0.14 169M 4.62

emphLDM-4 (ours, big, 100 steps) 2.4

†

/4.3

‡

174.9 24.7±4.1 0.71±0.15 552M 4.5

LDM-4 (ours, 50 steps, guiding) 4.4

†

/6.4

‡

153.7 25.8±3.7 0.74±0.12 184M 0.38

Table 5. ×4 upscaling results on ImageNet-Val. (256

);

†

: FID

features computed on validation split,

‡

: FID features computed

on train split;

∗

: Assessed on a NVIDIA A100

train throughput sampling throughput

†

train+val FID@2k

Model (reg.-type) samples/sec. @256 @512 hours/epoch epoch 6

LDM-1 (no ﬁrst stage) 0.11 0.26 0.07 20.66 24.74

LDM-4 (KL, w/ attn) 0.32 0.97 0.34 7.66 15.21

LDM-4 (VQ, w/ attn) 0.33 0.97 0.34 7.04 14.99

LDM-4 (VQ, w/o attn) 0.35 0.99 0.36 6.66 15.95

Table 6. Assessing inpainting efﬁciency.

†

: Deviations from Fig. 7

due to varying GPU settings/batch sizes cf . the supplement.

4.5. Inpainting with Latent Diffusion

Inpainting is the task of ﬁlling masked regions of an im-

age with new content either because parts of the image are

are corrupted or to replace existing but undesired content

within the image. We evaluate how our general approach

for conditional image generation compares to more special-

ized, state-of-the-art approaches for this task. Our evalua-

tion follows the protocol of LaMa [88], a recent inpainting

model that introduces a specialized architecture relying on

Fast Fourier Convolutions [8]. The exact training & evalua-

tion protocol on Places [108] is described in Sec. E.2.2.

We ﬁrst analyze the effect of different design choices for

the ﬁrst stage. In particular, we compare the inpainting ef-

ﬁciency of LDM-1 (i.e. a pixel-based conditional DM) with

LDM-4, for both KL and VQ regularizations, as well as VQ-

LDM-4 without any attention in the ﬁrst stage (see Tab. 8),

where the latter reduces GPU memory for decoding at high

resolutions. For comparability, we ﬁx the number of param-

eters for all models. Tab. 6 reports the training and sampling

throughput at resolution 256

and 512

, the total training

time in hours per epoch and the FID score on the validation

split after six epochs. Overall, we observe a speed-up of at

least 2.7× between pixel- and latent-based diffusion models

while improving FID scores by a factor of at least 1.6×.

The comparison with other inpainting approaches in

Tab. 7 shows that our model with attention improves the

overall image quality as measured by FID over that of [88].

LPIPS between the unmasked images and our samples is

slightly higher than that of [88]. We attribute this to [88]

only producing a single result which tends to recover more

of an average image compared to the diverse results pro-

duced by our LDM cf . Fig. 21. Additionally in a user study

(Tab. 4) human subjects favor our results over those of [88].

Based on these initial results, we also trained a larger dif-

fusion model (big in Tab. 7) in the latent space of the VQ-

regularized ﬁrst stage without attention. Following [15],

the UNet of this diffusion model uses attention layers on

three levels of its feature hierarchy, the BigGAN [3] residual

block for up- and downsampling and has 387M parameters

input result

Figure 11. Qualitative results on object removal with our big, w/

ft inpainting model. For more results, see Fig. 22.

instead of 215M. After training, we noticed a discrepancy

in the quality of samples produced at resolutions 256

and

512

, which we hypothesize to be caused by the additional

attention modules. However, ﬁne-tuning the model for half

an epoch at resolution 512

allows the model to adjust to

the new feature statistics and sets a new state of the art FID

on image inpainting (big, w/o attn, w/ ft in Tab. 7, Fig. 11.).

5. Limitations & Societal Impact

Limitations While LDMs signiﬁcantly reduce computa-

tional requirements compared to pixel-based approaches,

their sequential sampling process is still slower than that

of GANs. Moreover, the use of LDMs can be question-

able when high precision is required: although the loss of

image quality is very small in our f = 4 autoencoding mod-

els (see Fig. 1), their reconstruction capability can become

a bottleneck for tasks that require ﬁne-grained accuracy in

pixel space. We assume that our superresolution models

(Sec. 4.4) are already somewhat limited in this respect.

Societal Impact Generative models for media like im-

agery are a double-edged sword: On the one hand, they

40-50% masked All samples

Method FID ↓ LPIPS ↓ FID ↓ LPIPS ↓

LDM-4 (ours, big, w/ ft) 9.39 0.246± 0.042 1.50 0.137± 0.080

LDM-4 (ours, big, w/o ft) 12.89 0.257± 0.047 2.40 0.142± 0.085

LDM-4 (ours, w/ attn) 11.87 0.257± 0.042 2.15 0.144± 0.084

LDM-4 (ours, w/o attn) 12.60 0.259± 0.041 2.37 0.145± 0.084

LaMa [88]

†

12.31 0.243± 0.038 2.23 0.134± 0.080

LaMa [88] 12.0 0.24 2.21 0.14

CoModGAN [107] 10.4 0.26 1.82 0.15

RegionWise [52] 21.3 0.27 4.75 0.15

DeepFill v2 [104] 22.1 0.28 5.20 0.16

EdgeConnect [58] 30.5 0.28 8.37 0.16

Table 7. Comparison of inpainting performance on 30k crops of

size 512 × 512 from test images of Places [108]. The column 40-

50% reports metrics computed over hard examples where 40-50%

of the image region have to be inpainted.

†

recomputed on our test

set, since the original test set used in [88] was not available.

enable various creative applications, and in particular ap-

proaches like ours that reduce the cost of training and in-

ference have the potential to facilitate access to this tech-

nology and democratize its exploration. On the other hand,

it also means that it becomes easier to create and dissemi-

nate manipulated data or spread misinformation and spam.

In particular, the deliberate manipulation of images (“deep

fakes”) is a common problem in this context, and women in

particular are disproportionately affected by it [13,24].

Generative models can also reveal their training data

[5, 90], which is of great concern when the data contain

sensitive or personal information and were collected with-

out explicit consent. However, the extent to which this also

applies to DMs of images is not yet fully understood.

Finally, deep learning modules tend to reproduce or ex-

acerbate biases that are already present in the data [22, 38,

91]. While diffusion models achieve better coverage of the

data distribution than e.g. GAN-based approaches, the ex-

tent to which our two-stage approach that combines adver-

sarial training and a likelihood-based objective misrepre-

sents the data remains an important research question.

For a more general, detailed discussion of the ethical

considerations of deep generative models, see e.g. [13].

6. Conclusion

We have presented latent diffusion models, a simple and

efﬁcient way to signiﬁcantly improve both the training and

sampling efﬁciency of denoising diffusion models with-

out degrading their quality. Based on this and our cross-

attention conditioning mechanism, our experiments could

demonstrate favorable results compared to state-of-the-art

methods across a wide range of conditional image synthesis

tasks without task-speciﬁc architectures.

This work has been supported by the German Federal Ministry for

Economic Affairs and Energy within the project ’KI-Absicherung - Safe

AI for automated driving’ and by the German Research Foundation (DFG)

project 421703927.

剩余44页未读，继续阅读

IT徐师兄

粉丝: 1975
资源: 2689

会员权益专享

高分辨率图像合成：基于预训练自编码器的潜在扩散模型

ug871-vivado-high-level-synthesis-tutorial.pdf

ug902-vivado-high-level-synthesis.pdf

ug902-vivado-high-level-synthesis.pdf 中文

more control for free! image synthesis with semantic diffusion guidance代码

ug901-vivado-synthesis.pdf

sdxl text encoder

guided diffusion

high-level synthesis

vivado prc mdrv-1

基于GAN的文本生成图像国内研究现状

pix2pixhd简介

xapp1317-scalable-matrix-inverse-hls

生成 HLS（High-level synthesis）GCN代码

比较有代表性的15篇呢

innovus cts.pdf

HLS 加速卷积神经网络

realisticVision

可以告訴我最新有關於nerf的論文嗎，需要至少十篇

modulated_conv2d

用vue2编写收款语音播报

会员权益专享

最新资源