![](https://csdnimg.cn/release/download_crawler_static/87795300/bg8.jpg)
Figure 3: Samples from an unconditional diffusion model with classifier guidance to condition
on the class "Pembroke Welsh corgi". Using classifier scale 1.0 (left; FID: 33.0) does not produce
convincing samples in this class, whereas classifier scale 10.0 (right; FID: 12.0) produces much more
class-consistent images.
We can now substitute this into the score function for p(x
t
)p(y|x
t
):
∇
x
t
log(p
θ
(x
t
)p
φ
(y|x
t
)) = ∇
x
t
log p
θ
(x
t
) + ∇
x
t
log p
φ
(y|x
t
) (12)
= −
1
√
1 − ¯α
t
θ
(x
t
) + ∇
x
t
log p
φ
(y|x
t
) (13)
Finally, we can define a new epsilon prediction
ˆ(x
t
)
which corresponds to the score of the joint
distribution:
ˆ(x
t
)
:
=
θ
(x
t
) −
√
1 − ¯α
t
∇
x
t
log p
φ
(y|x
t
) (14)
We can then use the exact same sampling procedure as used for regular DDIM, but with the modified
noise predictions
ˆ
θ
(x
t
)
instead of
θ
(x
t
)
. Algorithm 2 summaries the corresponding sampling
algorithm.
4.3 Scaling Classifier Gradients
To apply classifier guidance to a large scale generative task, we train classification models on
ImageNet. Our classifier architecture is simply the downsampling trunk of the UNet model with
an attention pool [
49
] at the 8x8 layer to produce the final output. We train these classifiers on the
same noising distribution as the corresponding diffusion model, and also add random crops to reduce
overfitting. After training, we incorporate the classifier into the sampling process of the diffusion
model using Equation 10, as outlined by Algorithm 1.
In initial experiments with unconditional ImageNet models, we found it necessary to scale the
classifier gradients by a constant factor larger than 1. When using a scale of 1, we observed that the
classifier assigned reasonable probabilities (around 50%) to the desired classes for the final samples,
but these samples did not match the intended classes upon visual inspection. Scaling up the classifier
gradients remedied this problem, and the class probabilities from the classifier increased to nearly
100%. Figure 3 shows an example of this effect.
To understand the effect of scaling classifier gradients, note that
s ·∇
x
log p(y|x) = ∇
x
log
1
Z
p(y|x)
s
,
where
Z
is an arbitrary constant. As a result, the conditioning process is still theoretically grounded
in a re-normalized classifier distribution proportional to
p(y|x)
s
. When
s > 1
, this distribution
becomes sharper than
p(y|x)
, since larger values are amplified by the exponent. In other words, using
a larger gradient scale focuses more on the modes of the classifier, which is potentially desirable for
producing higher fidelity (but less diverse) samples.
In the above derivations, we assumed that the underlying diffusion model was unconditional, modeling
p(x)
. It is also possible to train conditional diffusion models,
p(x|y)
, and use classifier guidance in
the exact same way. Table 4 shows that the sample quality of both unconditional and conditional
models can be greatly improved by classifier guidance. We see that, with a high enough scale, the
guided unconditional model can get quite close to the FID of an unguided conditional model, although
training directly with the class labels still helps. Guiding a conditional model further improves FID.
Table 4 also shows that classifier guidance improves precision at the cost of recall, thus introducing
a trade-off in sample fidelity versus diversity. We explicitly evaluate how this trade-off varies with
8