For example, in Fig. 2A we visualize different evidential
NIG
distributions with varying model
parameters. We illustrate that by increasing the evidential parameters (i.e.
υ, α
) of this distribution,
the p.d.f. becomes tightly concentrated about its inferred likelihood function. Considering a single
parameter realization of this higher-order distribution (Fig. 2B), we can subsequently sample many
lower-order realizations of our likelihood function, as shown in Fig. 2C.
In this work, we use neural networks to infer, given an input, the hyperparameters,
m
, of this
higher-order, evidential distribution. This approach presents several distinct advantages compared to
prior work. First, our method enables simultaneous learning of the desired regression task, along with
aleatoric and epistemic uncertainty estimation, by enforcing evidential priors and without leveraging
any out-of-distribution data during training. Second, since the evidential prior is a higher-order NIG
distribution, the maximum likelihood Gaussian can be computed analytically from the expected
values of the
(µ, σ
2
)
parameters, without the need for sampling. Third, we can effectively estimate
the epistemic or model uncertainty associated with the network’s prediction by simply evaluating the
variance of our inferred evidential distribution.
3.2 Prediction and uncertainty estimation
The aleatoric uncertainty, also referred to as statistical or data uncertainty, is representative of
unknowns that differ each time we run the same experiment. The epistemic (or model) uncertainty,
describes the estimated uncertainty in the prediction. Given a
NIG
distribution, we can compute the
prediction, aleatoric, and epistemic uncertainty as
E[µ] = γ
| {z }
prediction
, E[σ
2
] =
β
α−1
| {z }
aleatoric
, Var[µ] =
β
υ(α−1)
| {z }
epistemic
. (5)
Complete derivations for these moments are available in Sec. S1.1. Note that
Var[µ] = E[σ
2
]/υ
,
which is expected as υ is one of our two evidential virtual-observation counts.
3.3 Learning the evidential distribution
Having formalized the use of an evidential distribution to capture both aleatoric and epistemic
uncertainty, we next describe our approach for learning a model to output the hyperparameters of this
distribution. For clarity, we structure the learning process as a multi-task learning problem, with two
distinct parts: (1) acquiring or maximizing model evidence in support of our observations and (2)
minimizing evidence or inflating uncertainty when the prediction is wrong. At a high level, we can
think of (1) as a way of fitting our data to the evidential model while (2) enforces a prior to remove
incorrect evidence and inflate uncertainty.
(1) Maximizing the model fit.
From Bayesian probability theory, the “model evidence”, or marginal
likelihood, is defined as the likelihood of an observation,
y
i
, given the evidential distribution parame-
ters m and is computed by marginalizing over the likelihood parameters θ:
p(y
i
|m) =
p(y
i
|θ, m)p(θ|m)
p(θ|y
i
, m)
=
Z
∞
σ
2
=0
Z
∞
µ=−∞
p(y
i
|µ, σ
2
)p(µ, σ
2
|m) dµ dσ
2
(6)
The model evidence is, in general, not straightforward to evaluate since computing it involves
integrating out the dependence on latent model parameters. However, in the case of placing a
NIG
evidential prior on our Gaussian likelihood function an analytical solution does exist:
p(y
i
|m) = St
y
i
; γ,
β(1 + υ)
υ α
, 2α
. (7)
where
St
y; µ
St
, σ
2
St
, υ
St
is the Student-t distribution evaluated at
y
with location
µ
St
, scale
σ
2
St
, and
υ
St
degrees of freedom. We denote the loss,
L
NLL
i
(w)
, as the negative logarithm of model evidence
L
NLL
i
(w) =
1
2
log
π
υ
− α log(Ω) +
α +
1
2
log((y
i
− γ)
2
υ + Ω) + log
Γ(α)
Γ(α+
1
2
)
(8)
where
Ω = 2β(1 + υ)
. Complete derivations for Eq. 7 and Eq. 8 are provided in Sec. S1.2. This
loss provides an objective for training a NN to output parameters of a
NIG
distribution to fit the
observations by maximizing the model evidence.
4