Neural machine-based forecasting 2905
dynamical system, Y is the future time series that needs
to be predicted based on the preceding time series his-
tory X.
Let
Z = Y|X
be the event that Y happens after X and P
m
(Z;θ) be a
family of probability distributions over the same para-
metric space indexed by θ. In this paper, the authors use
a deep recurrent neural network, parameterized by θ as
the surrogate model G(θ) to determine the conditional
probability P
m
(Z;θ), as an approximation to the true
but unknown data-generating distribution P
d
(Z).Ifthe
time series of event Z is drawn from a dynamical sys-
tem with certain initial condition, then the conditional
probability
P
d
(Z) ≡ 1
due to the determinism. However, in practice, P
m
(Z;θ)
can only be close to 1 by adjusting the value of θ without
necessarily achieving the above equality, especially, for
complex dynamical systems. To understand how one
transforms a deterministic problem into a probabilistic
one, there are two viewpoints to consider.
First, following the maximum likelihood principle
[15], the estimator for θ can be defined as
˜
θ = argmax
θ
P
m
(Z;θ), (3a)
= argmax
θ
r
k=1
P
m
(Z
k
;θ), (3b)
where Z ={Z
k
, k = 1,...,r} are independent data
sequences with batch size r, generated by the true but
unknown P
d
(Z). The above Eq. (3b) can be problem-
atic in terms of numerical computation. Due to the
determination of the product over many probabilities
that all vary from 0 to 1, the computation is prone to
numerical underflow. Hence, it is more convenient to
take the logarithm of both sides of the equation. This
results in the following equivalent optimization prob-
lem:
˜
θ = argmax
θ
r
k=1
log P
m
(Z
k
;θ). (4)
Typically, a large value of batch size r can give a bet-
ter estimation of θ, resulting in P
m
(Z
k
;
˜
θ) ≈ 1. There-
fore, the prediction of future response based on this
surrogate model is more accurate. But in reality during
the training stage, r is often limited and the probabil-
ity distribution represented by Z is an empirical data-
generating distribution, that is, labeled as
˜
P
d
(Z).Asa
result, Eq. (4) can be written as an expectation over the
empirical distribution defined by the training dataset:
˜
θ = argmax
θ
E
Z∼
˜
P
d
log P
m
(Z;θ). (5)
The second viewpoint is related to the Kullback–
Leibler divergence or KL divergence [16]. It is a mea-
sure of the distance between two different probability
distributions. The KL divergence between
˜
P
d
defined
by the training dataset and P
m
related to the surrogate
model is given by
D
KL
= E
Z∼
˜
P
d
[log
˜
P
d
(Z) − log P
m
(Z;θ)], (6a)
= E
Z∼
˜
P
d
log
˜
P
d
(Z) − E
Z∼
˜
P
d
log P
m
(Z;θ). (6b)
The goal is to minimize D
KL
by adjusting the model
parameters in G(θ ). The first term in Eq. (6b) is only
associated with the probability of generating certain
true time series, not related to the surrogate model itself.
Hence, the estimation of θ should only come from the
second term, which is
˜
θ =−argmin
θ
E
Z∼
˜
P
d
log P
m
(Z;θ). (7)
Comparing with the maximum likelihood principle
from the first viewpoint, one can find that Eqs. (5)
and (7) are essentially the same.
2.2 Probability distributions and loss functions
Now, the authors are ready to discuss the relations
between the surrogate model G(θ ) and conditional
probability P
m
. As mentioned earlier, G(θ) is a deep
recurrent neural network, which in essence is the fol-
lowing mapping function:
G(X;θ) = Y. (8)
Again, X and Y are the time series history and
future time series sequentially generated from a cer-
tain dynamical system. In reality, the mapping output
from the surrogate model is
Y = G(X;θ), which is an
approximation to the true target value Y with certain
types of associated errors. Here, three types of error dis-
tributions corresponding to three different P
m
(Y|X;θ)
and loss functions are considered, by using one time
step univariate time series x and y, without loss of gen-
erality.
123