Scalable Deep Gaussian Markov Random Fields for General Graphs
3.2. Variational training
The parameters of a DGMRF can be trained by maximizing
the log marginal likelihood
log p(y
m
|θ)
. This is however
often infeasible as it requires computing the determinant of
the posterior precision matrix
˜
Q
(Sid
´
en & Lindsten, 2020).
For large
N
one can instead resort to variational inference,
maximizing the Evidence Lower Bound (ELBO),
ELBO(θ, φ) = E
q(x|φ)
[log p(y
m
, x|θ)] + H[q(x|φ)]
≤ log p(y
m
|θ)
(7)
where
q
is a variational distribution with parameters
φ
and
H[·]
refers to differential entropy. For a DGMRF with a
Gaussian likelihood the first term of the ELBO is
E
q(x|φ)
[log p(y
m
, x|θ)] =
−
1
2
E
q( x|φ)
g(x)
|
g(x) +
1
σ
2
(y
m
− x)
|
I
m
(y
m
− x)
+ log|det(G)| −M log σ + const.
(8)
where
M =
P
N
i=1
m
i
is the number of observed nodes.
The expectation in Eq. 8 can be estimated using a set of
samples drawn from
q
. As
G = G
(L)
G
(L−1)
. . . G
(1)
, the
log-(absolute)-determinant is given by
log|det(G)| =
L
X
l=1
log
det
G
(l)
. (9)
Computing this efficiently is one of the major challenges
with the general graph setting, as will be discussed further
in section 3.3.
The full set of model parameters
θ
are the trainable pa-
rameters of each layer and the noise standard deviation
σ
.
Maximizing the ELBO w.r.t.
θ
and
φ
can be done using
gradient-based stochastic optimization.
3.2.1. VARIATIONAL DISTRIBUTION
A natural and useful way to choose the variational distribu-
tion is as another Gaussian
q(x|φ) = N(x|ν, SS
|
)
. This
corresponds to defining
q
by another affine transformation
in the opposite direction of the DGMRF,
x = Sr + ν, r ∼ N(0, I). (10)
Note the difference to Eq. 2 as we here parametrize the
covariance matrix instead of the precision matrix. This
parametrization additionally allows for computing gradi-
ents through the sampling process, by the use of the
reparametrization trick (Kingma & Welling, 2014).
Sid
´
en & Lindsten (2020) use a simple mean field approx-
imation with a diagonal
S
, making all components of
x
independent (Bishop, 2006). However, we propose a more
flexible q by choosing
S = diag(ξ)
˜
G diag(τ ) (11)
where
ξ, τ ∈ R
N
are vectors containing positive parameters
and
˜
G
is defined in the same way as the DGMRF layer in
Eq. 6. Including the matrix
˜
G
in
S
introduces off-diagonal
elements in the covariance matrix of
q
, alleviating the inde-
pendence assumption between nodes. Multiple such layers
can also be used, introducing longer dependencies between
nodes in the graph. The full set of variational parameters
φ
is then
ν, ξ, τ
and all trainable parameters from the layer(s)
˜
G
. In Appendix A.2 we empirically show that DGMRFs
trained using our more flexible variational distribution con-
sistently outperforms those trained using the simple mean
field approximation.
With this choice of S the entropy term of the ELBO is
H[q(x|φ)] = log|det(S)| + const.
= log
det
˜
G
+
N
X
i=1
log ξ
i
+ log τ
i
+ const.
(12)
Re-using the DGMRF layer construct in
q
has the added ben-
efit that the techniques we develop for the log-determinant
computation readily extend also to computing H[q(x|φ)].
3.3. Computing the log-determinant
Computing the necessary log-determinants in Eq. 9 and
12 efficiently is a major challenge with the general graph
setting. The CNN-based DGMRF was defined on a lattice
graph, which creates a special structure in
G
and allows
for finding efficient closed-form expressions for the log-
determinants (Sid
´
en & Lindsten, 2020). As we do not make
any such assumptions on the graph structure we here pro-
pose new scalable methods to compute the log-determinants.
3.3.1. EIGENVALUE METHOD
One way to compute the log-determinant is based on the
eigenvalues of the matrix. As the determinant is given by
the product of all eigenvalues,
log
det
G
(l)
= log|det(D
γ
l
)|
+ log
det
α
l
I + β
l
D
−1
A
=
N
X
i=1
γ
l
log(d
i
) + log |λ
i
|
(13)
where
{λ
i
}
N
i=1
are the eigenvalues of
α
l
I + β
l
D
−1
A
. It
can be shown
1
that
λ
i
= α
l
+ β
l
λ
0
i
with
λ
0
i
being the
i
:th
1
For an eigenvector
v
i
of
D
−1
A
with eigenvalue
λ
0
i
,
D
−1
Av
i
= λ
0
i
v
i
⇒ (α
l
I + β
l
D
−1
A)v
i
= (α
l
+ β
l
λ
0
i
)v
i
.