𝐻 × 𝑊 × 3
Encoder
Patch Embed
Layer Norm
+
Time
Embedding
TMSA
Layer Norm
MLP
+
Latent DiffiT
Transformer Block
× 𝑁
Decoder
Unpatchify
+
𝐻 × 𝑊 × 3
ℎ × 𝑤 × 𝐶
Label
Embedding
Figure 4 – Overview of the latent DiffiT framework.
DiffiT ResBlock We define our final residual cell by com-
bining our proposed DiffiT Transformer block with an addi-
tional convolutional layer in the form:
ˆx
s
= Conv
3×3
(Swish (GN (x
s
))) , (9)
x
s
= DiffiT-Block (ˆx
s
, x
t
) + x
s
, (10)
where GN denotes the group normalization operation [
73
]
and DiffiT-Transformer is defined in Eq. 7 and Eq. 8 (shown
in Fig. 3). Our residual cell for image space diffusion models
is a hybrid cell combining both a convolutional layer and our
Transformer block.
3.2.2 Latent Space
Recently, latent diffusion models have been shown effective
in generating high-quality large-resolution images [
56
,
69
].
In Fig. 4, we show the architecture of latent DiffiT model.
We first encode the images using a pre-trained variational
auto-encoder network [
56
]. The feature maps are then con-
verted into non-overlapping patches and projected into a new
embedding space. Similar to the DiT model [
52
], we use
a vision transformer, without upsampling or downsampling
layers, as the denoising network in the latent space. In addi-
tion, we also utilize a three-channel classifier-free guidance
to improve the quality of generated samples. The final layer
of the architecture is a simple linear layer to decode the
output.
4. Results
4.1. Image Space
We have trained the proposed DiffiT model on CIFAR-10,
FFHQ-64 datasets respectively. In Table. 1, we compare the
performance of our model against a variety of different gen-
erative models including other score-based diffusion models
as well as GANs, and VAEs. DiffiT achieves a state-of-the-
art image generation FID score of 1.95 on the CIFAR-10
dataset, outperforming state-of-the-art diffusion models such
as EDM [
34
] and LSGM [
69
]. In comparison to two recent
ViT-based diffusion models, our proposed DiffiT signifi-
cantly outperforms U-ViT [
7
] and GenViT [
76
] models in
terms of FID score in CIFAR-10 dataset. Additionally, Dif-
fiT significantly outperforms EDM [
34
] and DDPM++ [
66
]
models, both on VP and VE training configurations, in terms
Table 1 – FID performance comparison against various genera-
tive approaches on the CIFAR10, FFHQ-64 datasets. VP and VE
denote Variance Preserving and Variance Exploding respectively.
DiffiT outperforms competing approaches, sometimes by large
margins.
Method Class Space Type CIFAR-10 FFHQ
32×32 64×64
NVAE [68] VAE - 23.50 -
GenViT [76] Diffusion Image 20.20 -
AutoGAN [22] GAN - 12.40 -
TransGAN [31] GAN - 9.26 -
INDM [38] Diffusion Latent 3.09 -
DDPM++ (VE) [66] Diffusion Image 3.77 25.95
U-ViT [7] Diffusion Image 3.11 -
DDPM++ (VP) [66] Diffusion Image 3.01 3.39
StyleGAN2 w/ ADA [33] GAN - 2.92 -
LSGM [69] Diffusion Latent 2.01 -
EDM (VE) [34] Diffusion Image 2.01 2.53
EDM (VP) [34] Diffusion Image 1.99 2.39
DiffiT (Ours) Diffusion Image 1.95 2.22
of FID score. In Fig. 5, we illustrate the generated images
on FFHQ-64 dataset. Please see supplementary materials for
CIFAR-10 generated images.
4.2. Latent Space
We have also trained the latent DiffiT model on ImageNet-
512 and ImageNet-256 dataset respectively. In Table. 2, we
present a comparison against other approaches using various
image quality metrics. For this comparison, we select the
best performance metrics from each model which may in-
clude techniques such classifier-free guidance. In ImageNet-
256 dataset, the latent DiffiT model outperforms competing
approaches, such as MDT-G [
21
], DiT-XL/2-G [
52
] and
StyleGAN-XL [
61
], in terms of FID score and sets a new
SOTA FID score of 1.73. In terms of other metrics such
as IS and sFID, the latent DiffiT model shows a competi-
tive performance, hence indicating the effectiveness of the
proposed time-dependant self-attention. In ImageNet-512
dataset, the latent DiffiT model significantly outperforms
DiT-XL/2-G in terms of both FID and Inception Score (IS).
Although StyleGAN-XL [
61
] shows better performance in
terms of FID and IS, GAN-based models are known to suffer
from issues such as low diversity that are not captured by
the FID score. These issues are reflected in sub-optimal
performance of StyleGAN-XL in terms of both Precision
and Recall. In addition, in Fig. 6, we show a visualization of
uncurated images that are generated on ImageNet-256 and
ImageNet-512 dataset. We observe that latent DiffiT model
is capable of generating diverse high quality images across
different classes.
5. Ablation
In this section, we provide additional ablation studies to
provide insights into DiffiT. We address four main questions:
5