During sampling, we apply this transition along an increasing sequence
λ
min
= λ
1
< · · · < λ
T
=
λ
max
for
T
timesteps; in other words, we follow the discrete time ancestral sampler of Sohl-Dickstein
et al. (2015); Ho et al. (2020). If the model
x
θ
is correct, then as
T → ∞
, we obtain samples from an
SDE whose sample paths are distributed as
p(z)
(Song et al., 2021b), and we use
p
θ
(z)
to denote the
continuous time model distribution. The variance is a log-space interpolation of
˜σ
2
λ
0
|λ
and
σ
2
λ|λ
0
as
suggested by Nichol & Dhariwal (2021); we found it effective to use a constant hyperparameter
v
rather than learned
z
λ
-dependent
v
. Note that the variances simplify to
˜σ
2
λ
0
|λ
as
λ
0
→ λ
, so
v
has an
effect only when sampling with non-infinitesimal timesteps as done in practice.
The reverse process mean comes from an estimate
x
θ
(z
λ
) ≈ x
plugged into
q(z
λ
0
|z
λ
, x)
(Ho et al.,
2020; Kingma et al., 2021) (
x
θ
also receives
λ
as input, but we suppress this to keep our notation
clean). We parameterize
x
θ
in terms of
-prediction (Ho et al., 2020):
x
θ
(z
λ
) = (z
λ
−σ
λ
θ
(z
λ
))/α
λ
,
and we train on the objective
E
,λ
k
θ
(z
λ
) − k
2
2
(5)
where
∼ N (0, I)
,
z
λ
= α
λ
x + σ
λ
, and
λ
is drawn from a distribution
p(λ)
over
[λ
min
, λ
max
]
.
This objective is denoising score matching (Vincent, 2011; Hyv
¨
arinen & Dayan, 2005) over multiple
noise scales (Song & Ermon, 2019), and when
p(λ)
is uniform, the objective is proportional to the
variational lower bound on the marginal log likelihood of the latent variable model
R
p
θ
(x|z)p
θ
(z)dz
,
ignoring the term for the unspecified decoder
p
θ
(x|z)
and for the prior at
z
λ
min
(Kingma et al., 2021).
If
p(λ)
is not uniform, the objective can be interpreted as weighted variational lower bound whose
weighting can be tuned for sample quality (Ho et al., 2020; Kingma et al., 2021). We use a
p(λ)
inspired by the discrete time cosine noise schedule of Nichol & Dhariwal (2021): we sample
λ
via
λ = −2 log tan(au + b)
for uniformly distributed
u ∈ [0, 1]
, where
b = arctan(e
−λ
max
/2
)
and
a = arctan(e
−λ
min
/2
) − b
. This represents a hyperbolic secant distribution modified to be supported
on a bounded interval. For finite timestep generation, we use
λ
values corresponding to uniformly
spaced u ∈ [0, 1], and the final generated sample is x
θ
(z
λ
max
).
Because the loss for
θ
(z
λ
)
is denoising score matching for all
λ
, the score
θ
(z
λ
)
learned by our
model estimates the gradient of the log-density of the distribution of our noisy data
z
λ
, that is
θ
(z
λ
) ≈ −σ
λ
∇
z
λ
log p(z
λ
)
; note, however, that because we use unconstrained neural networks to
define
θ
, there need not exist any scalar potential whose gradient is
θ
. Sampling from the learned
diffusion model resembles using Langevin diffusion to sample from a sequence of distributions
p(z
λ
)
that converges to the conditional distribution p(x) of the original data x.
In the case of conditional generative modeling, the data
x
is drawn jointly with conditioning informa-
tion
c
, i.e. a class label for class-conditional image generation. The only modification to the model is
that the reverse process function approximator receives c as input, as in
θ
(z
λ
, c).
3 GUIDANCE
An interesting property of certain generative models, such as GANs and flow-based models, is the
ability to perform truncated or low temperature sampling by decreasing the variance or range of noise
inputs to the generative model at sampling time. The intended effect is to decrease the diversity of
the samples while increasing the quality of each individual sample. Truncation in BigGAN (Brock
et al., 2019), for example, yields a tradeoff curve between FID score and Inception score for low and
high amounts of truncation, respectively. Low temperature sampling in Glow (Kingma & Dhariwal,
2018) has a similar effect.
Unfortunately, straightforward attempts of implementing truncation or low temperature sampling
in diffusion models are ineffective. For example, scaling model scores or decreasing the variance
of Gaussian noise in the reverse process cause the diffusion model to generate blurry, low quality
samples (Dhariwal & Nichol, 2021).
3.1 CLASSIFIER GUIDANCE
To obtain a truncation-like effect in diffusion models, Dhariwal & Nichol (2021) introduce classifier
guidance, where the diffusion score
θ
(z
λ
, c) ≈ −σ
λ
∇
z
λ
log p(z
λ
|c)
is modified to include the
3