扩散Transformer：DiffiT推动图像生成新纪元

下载需积分: 5 | PDF格式 | 43.22MB | 更新于2024-06-14 | 71 浏览量 | 举报

DiffiT是一项在图像生成领域崭露头角的技术，它将扩散模型与视觉Transformer相结合，旨在提升基于扩散的学习方法在图像生成任务中的性能。这项研究由阿里·哈塔米扎德、贾明·宋、顾林、简·考茨和阿拉什·瓦哈达特在NVIDIA进行，他们共同探讨了Transformer架构在扩散模型中的潜力，尤其是在对抗噪声过程中发挥的作用。传统上，生成图像的扩散模型依赖于卷积残差U-Net这样的网络结构，这些网络在逐次去噪过程中生成高质量图像。然而，DiffiT团队质疑了这一现状，他们提出了一个名为Diffusion Vision Transformers (DiffiT) 的创新模型。这个模型的特点在于其混合层次架构，包括一个U形编码器和解码器，这使得模型能够更有效地处理和解析图像特征。 DiffiT的核心贡献是引入了一个时间依赖的自注意力模块，这个模块允许注意力层在去噪过程的不同阶段动态调整其行为。这意味着模型可以根据当前阶段的任务需求，更加灵活地分配注意力资源，从而提高生成图像的质量和多样性。这种设计允许Transformer网络更好地捕捉全局和局部空间关系，从而在图像生成时展现出更高的表达力。实验部分展示了DiffiT在ImageNet数据集上的卓越表现，生成的图像质量显著优于传统的U-Net架构，特别是在处理复杂图像细节和保持风格一致性方面。作者通过未经过滤的生成样本图（如图1所示），证明了DiffiT在生成逼真度和多样性方面的进步，这些图像在颜色和视觉效果上都具有吸引力。 DiffiT的研究揭示了视觉Transformer在扩散模型中的潜力，它不仅扩展了我们对生成式模型架构的理解，还为未来的图像生成任务提供了新的可能。通过结合Transformer的全局理解和自适应性，DiffiT有望在艺术创作、图像修复、超分辨率等多个场景中发挥重要作用，并推动AI技术在视觉领域的进一步发展。

𝐻 × 𝑊 × 3

Encoder

Patch Embed

Layer Norm

Time

Embedding

TMSA

Layer Norm

MLP

Latent DiffiT

Transformer Block

× 𝑁

Decoder

Unpatchify

𝐻 × 𝑊 × 3

ℎ × 𝑤 × 𝐶

Label

Embedding

Figure 4 – Overview of the latent DifﬁT framework.

DifﬁT ResBlock We deﬁne our ﬁnal residual cell by com-

bining our proposed DifﬁT Transformer block with an addi-

tional convolutional layer in the form:

ˆx

= Conv

3×3

(Swish (GN (x

))) , (9)

= DifﬁT-Block (ˆx

, x

) + x

, (10)

where GN denotes the group normalization operation [

]

and DifﬁT-Transformer is deﬁned in Eq. 7 and Eq. 8 (shown

in Fig. 3). Our residual cell for image space diffusion models

is a hybrid cell combining both a convolutional layer and our

Transformer block.

3.2.2 Latent Space

Recently, latent diffusion models have been shown effective

in generating high-quality large-resolution images [

In Fig. 4, we show the architecture of latent DifﬁT model.

We ﬁrst encode the images using a pre-trained variational

auto-encoder network [

]. The feature maps are then con-

verted into non-overlapping patches and projected into a new

embedding space. Similar to the DiT model [

], we use

a vision transformer, without upsampling or downsampling

layers, as the denoising network in the latent space. In addi-

tion, we also utilize a three-channel classiﬁer-free guidance

to improve the quality of generated samples. The ﬁnal layer

of the architecture is a simple linear layer to decode the

output.

4. Results

4.1. Image Space

We have trained the proposed DifﬁT model on CIFAR-10,

FFHQ-64 datasets respectively. In Table. 1, we compare the

performance of our model against a variety of different gen-

erative models including other score-based diffusion models

as well as GANs, and VAEs. DifﬁT achieves a state-of-the-

art image generation FID score of 1.95 on the CIFAR-10

dataset, outperforming state-of-the-art diffusion models such

as EDM [

] and LSGM [

]. In comparison to two recent

ViT-based diffusion models, our proposed DifﬁT signiﬁ-

cantly outperforms U-ViT [

] and GenViT [

] models in

terms of FID score in CIFAR-10 dataset. Additionally, Dif-

ﬁT signiﬁcantly outperforms EDM [

] and DDPM++ [

]

models, both on VP and VE training conﬁgurations, in terms

Table 1 – FID performance comparison against various genera-

tive approaches on the CIFAR10, FFHQ-64 datasets. VP and VE

denote Variance Preserving and Variance Exploding respectively.

DifﬁT outperforms competing approaches, sometimes by large

margins.

Method Class Space Type CIFAR-10 FFHQ

32×32 64×64

NVAE [68] VAE - 23.50 -

GenViT [76] Diffusion Image 20.20 -

AutoGAN [22] GAN - 12.40 -

TransGAN [31] GAN - 9.26 -

INDM [38] Diffusion Latent 3.09 -

DDPM++ (VE) [66] Diffusion Image 3.77 25.95

U-ViT [7] Diffusion Image 3.11 -

DDPM++ (VP) [66] Diffusion Image 3.01 3.39

StyleGAN2 w/ ADA [33] GAN - 2.92 -

LSGM [69] Diffusion Latent 2.01 -

EDM (VE) [34] Diffusion Image 2.01 2.53

EDM (VP) [34] Diffusion Image 1.99 2.39

DifﬁT (Ours) Diffusion Image 1.95 2.22

of FID score. In Fig. 5, we illustrate the generated images

on FFHQ-64 dataset. Please see supplementary materials for

CIFAR-10 generated images.

4.2. Latent Space

We have also trained the latent DifﬁT model on ImageNet-

512 and ImageNet-256 dataset respectively. In Table. 2, we

present a comparison against other approaches using various

image quality metrics. For this comparison, we select the

best performance metrics from each model which may in-

clude techniques such classiﬁer-free guidance. In ImageNet-

256 dataset, the latent DifﬁT model outperforms competing

approaches, such as MDT-G [

], DiT-XL/2-G [

] and

StyleGAN-XL [

], in terms of FID score and sets a new

SOTA FID score of 1.73. In terms of other metrics such

as IS and sFID, the latent DifﬁT model shows a competi-

tive performance, hence indicating the effectiveness of the

proposed time-dependant self-attention. In ImageNet-512

dataset, the latent DifﬁT model signiﬁcantly outperforms

DiT-XL/2-G in terms of both FID and Inception Score (IS).

Although StyleGAN-XL [

] shows better performance in

terms of FID and IS, GAN-based models are known to suffer

from issues such as low diversity that are not captured by

the FID score. These issues are reﬂected in sub-optimal

performance of StyleGAN-XL in terms of both Precision

and Recall. In addition, in Fig. 6, we show a visualization of

uncurated images that are generated on ImageNet-256 and

ImageNet-512 dataset. We observe that latent DifﬁT model

is capable of generating diverse high quality images across

different classes.

5. Ablation

In this section, we provide additional ablation studies to

provide insights into DifﬁT. We address four main questions:

剩余22页未读，继续阅读

muyu_525

粉丝: 2

扩散Transformer：DiffiT推动图像生成新纪元

Bass-Diffusion-model-for-short-life-cycle-products-sales-prediction

matlab代码影响-iGEM-Pasteur-Paris-2018-codes-for-secretion-diffusion-and-in

Stable-Diffusion-WebUI（秋叶）和Stable-Diffusion–forge

matlab美式看涨期权代码-jump-diffusion-model-for-option-pricing:欧美期权定价的跳跃扩散模型

matlabr2012b代码-VF4-and-DDFV-for-anisotropic-diffusion-problem:我在StellaK

介绍下Palette-Image-to-Image-Diffusion-Models

stable-diffusion中autoencoder，latent-diffusion，retrieval-augmented-diffusion的作用及关联

Diffusion Vision Transformers (DiffiT) 如何利用自注意力模块和U形架构优化图像生成质量？

如何理解Diffusion Vision Transformers (DiffiT) 在图像生成任务中结合扩散模型和视觉Transformer的机制？

diffusionclip: text-guided diffusion models for robust image manipulation

最新资源