Bound tightness
When
p
θ
= p
g
, we have that
|∇
x
E
θ
(x) + ∇
x
log p
g
(x)|
p
= 0
. In Theo-
rem 1,
m
is a constant related to the Lipschitz constant of
log f(x)
satisfying
|∇
x
log f(x)|
p
≤
|∇
y
log f(y)|
p
+m
for all
x, y
(see proof for details). When
p
θ
= p
g
we also have
|∇
x
log f(x)|
p
=
0, such that m = 0. Our upper bound is then dL(θ)e = L(θ) = L(θ), and hence it is tight.
3 Numerical evaluation of the bounds
Evaluating the lower bound To evaluate the lower bound in Equation (11), we need the smallest
singular value of the Jacobian
J = ∂
z
G(z)
. Recall that this singular value satisfy
s
1
= kJv
min
k
2
=
min
v6=0
kJvk
2
kvk
2
. We can then evaluate the singular value by finding
v
min
with an iterative optimization
algorithm, where we opt to use the celebrated single-vector LOBPCG algorithm (Knyazev, 1998).
This method performs an iterative minimization of the generalized Rayleigh quotient,
ρ(v) :=
v
|
J
|
Jv
v
|
v
, (16)
which converges to
v
min
. The gradient of
ρ(v)
is proportional to
r = J
|
Jv − ρ(v)v
. To avoid
computing the Jacobian
J
, we use Jacobian-vector products, which can be efficiently evaluated using
automatic differentiation. To compute J
|
Jv, we use the following trick (in pytorch-notation):
J
|
Jv = ((Jv)
|
J)
|
= ∇
z
((Jv)
|
.detach() · G(z))
|
. (17)
The optimal learning rate for this iterative scheme can be found by maximizing the Rayleigh
quotient
(16)
. Finally, we follow the suggestions of Knyazev (2001) to improve numerical stability
and accelerate convergence, which we omit here for brevity.
Evaluating the upper bound
There are two challenges when evaluating Equation (15). The
first is to compute
∇
x
log p
g
(x)
, where we empirically found that existing methods (Shi
et al., 2018; Li and Turner, 2018) were too inefficient for our needs. To evaluate the term
E
x∼p
g
(x)
[|∇
x
E
θ
(x) + ∇
x
log p
g
(x)|
p
], we further loosen the bound
|∇
G(z)
E
θ
(G(z)) + ∇
G(z)
log p
g
(G(z))|
p
≤
|∇
G(z)
E
θ
(G(z))J
z
+ ∇
G(z)
log p
g
(G(z))J
z
|
p
s
p
1
≤
|∇
G(z)
E
θ
(G(z))J
z
+ ∇
z
log p
g
(G(z))|
p
s
p
1
.
(18)
where
s
1
is the smallest singular value of
J
z
. Detailed derivations are in the supplementary material.
If we choose p = 2, then we can use Hutchinson’s estimator (1989):
|∇
x
E
θ
(x)J
z
+ ∇
z
log p
g
(G(z))|
2
= E
v
h
(∇
x
E
θ
(x)J
z
v + ∇
z
log p
g
(G(z))v)
2
i
, (19)
where v ∼ N (0, I
d
). This is easily evaluated using automatic differentiation.
The second challenge is to evaluate
log p
g
(x)
which needs the Jacobian of the generator
G(z)
as
dictated by Equation (8). Here, we opt to use our entropy estimator as described above. We could
alternatively use Hutchinson’s estimator as proposed by Kumar et al. (2020). Experimentally we do
not observe much difference between these two estimators.
4 Related work
In machine learning, there has been a long-standing interest in EBMs dating back to Hopfield
networks (Hopfield, 1982), Boltzmann machines (Hinton and Sejnowski, 1983; Ackley et al., 1985)
and restricted Boltzmann machines (Smolensky, 1986; Hinton, 2002), see e.g. reviews in the works by
LeCun et al. (2006) and Scellier (2020). Learning and evaluation of these models are difficult since the
normalization constant cannot be efficiently evaluated. MLE-based learning, such as the Boltzmann
learning rule, relies on expensive MCMC sampling to estimate the gradient, and more advanced
MCMC methods are used to reliably estimate the normalization constant (see e.g. Salakhutdinov and
Murray, 2008; Grosse et al., 2013; Burda et al., 2015; Frellsen et al., 2016). For images, MCMC-
based learning has been used to learn non-deep EBMs of both textures (Zhu et al., 1998; Zhu and
4