Hands-on Bayesian Neural Networks - a Tutorial for Deep Learning Users 7
If the cost of a false positive varies across dierent classes, it should be used to compute the risk
and choose the minimal risk prediction.
3 MOTIVATION FOR BAYESIAN METHODS IN DEEP LEARNING
Dening a prior belief
p(θ)
on the model parametrization (Section 2) is regarded by some users to
be hard if not impossible. Dening a prior for a simple functional model is considered intuitive, e.g.,
explicitly adding a regularization term to favor a lower degree polynomial function or a smoother
function [
54
]. However, dening priors is harder for the multi-layer models used in deep learning.
So, why do we bother to use Bayesian methods for deep learning given that it is hard to clearly
comprehend deep neural networks behavior when dening the priors? The functional relationship
encoded by an articial neural network implicitly represents the conditional probability
p(y|x, θ)
,
and Bayes formula is an appropriate tool to use to invert conditional probabilities, even if one has a
priori little insight about
p(θ)
. While there are very strong theoretical principles and schema on
which this Bayes formula can be based [
76
], we focus in this section on some practical benets of
using Bayesian Deep networks.
First, Bayesian methods provide a natural approach to
quantify uncertainty
in deep learning.
Bayesian neural networks often have better calibration than classical neural networks [
46
,
58
,
66
],
i.e., their predicted uncertainty is more consistent with the observed errors. In other words, they
are neither overcondent nor undercondent compared to their non-Bayesian counterpart.
Working with a Bayesian neural network allows to distinguish between
epistemic uncertainty
,
i.e., the uncertainty due to a lack of knowledge, measured by
p(θ |D)
, which can be reduced with
more data, and
aleatoric uncertainty
, i.e., the uncertainty due to the (partially) aleatoric nature of
the data and measured by
p(y|x, θ)
[
14
,
44
]. This makes BNNs very data ecient, as they can learn
from a small dataset without overtting. At prediction time, out-of-training distribution points will
just lead to high epistemic uncertainty. It also makes BNNs an interesting tool for active learning
[
19
,
88
], as one can interpret the model predictions and see if, for a given input, dierent probable
parametrizations lead to dierent predictions. In this latter case, labelling this specic input will
eectively reduce the epistemic uncertainty.
Furthermore, the No-free-lunch theorem for machine learning [
94
] can be interpreted as saying
that any supervised learning algorithm includes some kind of implicit prior (while this interpretation
is more philosophical than mathematical, and thus subject to discussion). Bayesian methods, when
used correctly, will at least make the prior explicit. Now, if
integrating prior knowledge
seems
hard with tools that are basically black boxes, it is not impossible. In Bayesian deep learning, priors
are often considered as soft constraints, like regularization. Most regularization methods already
used for point estimate neural networks can be understood from a Bayesian perspective as setting
a prior, as demonstrated in Section 5.3. Moreover, previously learned posterior can be recycled as
prior when new data becomes available. This makes Bayesian neural networks a valuable tool for
online learning [64].
Last but not least, the Bayesian paradigm enables
the analysis of learning methods
and
draws links between them. Some methods initially not presented as Bayesian can be
implicitly
understood
as being approximate Bayesian, like regularization (Sec.5.3) or ensembling (Sec.8.2.2).
This, in turn, supports the understanding of why certain methods that are easier to use than a
strict application of the Bayesian algorithms can still give meaningful results from a Bayesian
perspective. In fact, most Bayesian neural network architectures used in practice rely on methods
that are approximately or implicitly Bayesian (Sec.8), because the exact algorithms are often too
expensive. The Bayesian paradigm also provides a systematic framework to design new learning
and regularization strategies, even for point-estimate models.
ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2020.