Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
(Wan et al., 2013), multiplicative Gaussian noise (Srivas-
tava et al., 2014), etc.). We show that the dropout objec-
tive, in effect, minimises the Kullback–Leibler divergence
between an approximate distribution and the posterior of
a deep Gaussian process (marginalised over its finite rank
covariance function parameters). Due to space constraints
we refer the reader to the appendix for an in depth review
of dropout, Gaussian processes, and variational inference
(section 2), as well as the main derivation for dropout and
its variations (section 3). The results are summarised here
and in the next section we obtain uncertainty estimates for
dropout NNs.
Let
b
y be the output of a NN model with L layers and a loss
function E(·, ·) such as the softmax loss or the Euclidean
loss (square loss). We denote by W
i
the NN’s weight ma-
trices of dimensions K
i
× K
i−1
, and by b
i
the bias vec-
tors of dimensions K
i
for each layer i = 1, ..., L. We de-
note by y
i
the observed output corresponding to input x
i
for 1 ≤ i ≤ N data points, and the input and output sets
as X, Y. During NN optimisation a regularisation term is
often added. We often use L
2
regularisation weighted by
some weight decay λ, resulting in a minimisation objective
(often referred to as cost),
L
dropout
:=
1
N
N
X
i=1
E(y
i
,
b
y
i
) + λ
L
X
i=1
||W
i
||
2
2
+ ||b
i
||
2
2
.
(1)
With dropout, we sample binary variables for every input
point and for every network unit in each layer (apart from
the last one). Each binary variable takes value 1 with prob-
ability p
i
for layer i. A unit is dropped (i.e. its value is set
to zero) for a given input if its corresponding binary vari-
able takes value 0. We use the same values in the backward
pass propagating the derivatives to the parameters.
In comparison to the non-probabilistic NN, the deep Gaus-
sian process is a powerful tool in statistics that allows us to
model distributions over functions. Assume we are given a
covariance function of the form
K(x, y) =
Z
p(w)p(b)σ(w
T
x + b)σ(w
T
y + b)dwdb
with some element-wise non-linearity σ(·) and distribu-
tions p(w), p(b). In sections 3 and 4 in the appendix we
show that a deep Gaussian process with L layers and co-
variance function K(x, y) can be approximated by placing
a variational distribution over each component of a spec-
tral decomposition of the GPs’ covariance functions. This
spectral decomposition maps each layer of the deep GP to
a layer of explicitly represented hidden units, as will be
briefly explained next.
Let W
i
be a (now random) matrix of dimensions K
i
×
K
i−1
for each layer i, and write ω = {W
i
}
L
i=1
. A priori,
we let each row of W
i
distribute according to the p(w)
above. In addition, assume vectors m
i
of dimensions K
i
for each GP layer. The predictive probability of the deep
GP model (integrated w.r.t. the finite rank covariance func-
tion parameters ω) given some precision parameter τ > 0
can be parametrised as
p(y|x, X, Y) =
Z
p(y|x, ω)p(ω|X, Y)dω (2)
p(y|x, ω) = N
y;
b
y(x, ω), τ
−1
I
D
b
y
x, ω = {W
1
, ...,W
L
}
=
r
1
K
L
W
L
σ
...
r
1
K
1
W
2
σ
W
1
x + m
1
...
The posterior distribution p(ω|X, Y) in eq. (2) is in-
tractable. We use q(ω), a distribution over matrices whose
columns are randomly set to zero, to approximate the in-
tractable posterior. We define q(ω) as:
W
i
= M
i
· diag([z
i,j
]
K
i
j=1
)
z
i,j
∼ Bernoulli(p
i
) for i = 1, ..., L, j = 1, ..., K
i−1
given some probabilities p
i
and matrices M
i
as variational
parameters. The binary variable z
i,j
= 0 corresponds then
to unit j in layer i − 1 being dropped out as an input to
layer i. The variational distribution q(ω) is highly multi-
modal, inducing strong joint correlations over the rows of
the matrices W
i
(which correspond to the frequencies in
the sparse spectrum GP approximation).
We minimise the KL divergence between the approximate
posterior q(ω) above and the posterior of the full deep GP,
p(ω|X, Y). This KL is our minimisation objective
−
Z
q(ω) log p(Y|X, ω)dω + KL(q(ω)||p(ω)). (3)
We rewrite the first term as a sum
−
N
X
n=1
Z
q(ω) log p(y
n
|x
n
, ω )dω
and approximate each term in the sum by Monte Carlo in-
tegration with a single sample
b
ω
n
∼ q(ω) to get an unbi-
ased estimate − log p(y
n
|x
n
,
b
ω
n
). We further approximate
the second term in eq. (3) and obtain
P
L
i=1
p
i
l
2
2
||M
i
||
2
2
+
l
2
2
||m
i
||
2
2
with prior length-scale l (see section 4.2 in the
appendix). Given model precision τ we scale the result by
the constant 1/τN to obtain the objective:
L
GP-MC
∝
1
N
N
X
n=1
− log p(y
n
|x
n
,
b
ω
n
)
τ
(4)
+
L
X
i=1
p
i
l
2
2τN
||M
i
||
2
2
+
l
2
2τN
||m
i
||
2
2
.
distribution over functions ???