
3
rsta.royalsocietypublishing.org Phil. Trans. R. Soc. A 0000000
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(i) Convolutional Neural Networks
Traditionally designed for image datasets, convolutional neural networks (CNNs) extract local
relationships that are invariant across spatial dimensions [
10
,
22
]. To adapt CNNs to time series
datasets, researchers utilise multiple layers of causal convolutions [
23
,
24
,
25
] – i.e. convolutional
filters designed to ensure only past information is used for forecasting. For an intermediate feature
at hidden layer l, each causal convolutional filter take the form below:
h
l+1
t
= A
(W ∗ h) (l, t)
, (2.4)
(W ∗ h) (l, t) =
k
X
τ=0
W (l, τ)h
l
t−τ
, (2.5)
where
h
l
t
∈ R
H
in
is an intermediate state at layer
l
at time
t
,
∗
is the convolution operator,
W (l, τ) ∈
R
H
out
×H
in
is a fixed filter weight at layer
l
, and
A(.)
is an activation function, such as a sigmoid
function, representing any architecture-specific non-linear processing.
Considering the 1-D case, we can see that Equation
(2.5)
bears a strong resemblance to finite
impulse response (FIR) filters in digital signal processing [
26
]. This leads to two key implications
for temporal relationships learnt by CNNs. Firstly, in line with the spatial invariance assumptions
for standard CNNs, temporal CNNs assume that relationships are time-invariant – using the same
set of filter weights at each time step and across all time. In addition, CNNs are only able to use
inputs within its defined lookback window, or receptive field, to make forecasts. As such, the
receptive field size
k
needs to be tuned carefully to ensure that the model can make use of all
relevant historical information. It is worth noting that a single causal CNN layer is equivalent to
an auto-regressive (AR) model.
Dilated Convolutions
Using standard convolutional layers can be computational challenging
where long-term dependencies are significant, as the number of parameters scales directly with the
size of the receptive field. To alleviate this, modern architectures frequently make use of dilated
covolutional layers [23, 24], which extend Equation (2.5) as below:
(W ∗ h) (l, t, d
l
) =
bk/d
l
c
X
τ=0
W (l, τ)h
l
t−d
l
τ
, (2.6)
where
b.c
is the floor operator and
d
l
is a layer-specific dilation rate. Dilated convolutions can hence
be interpreted as convolutions of a down-sampled version of the lower layer features – reducing
resolution to incorporate information from the distant past. As such, by increasing the dilation rate
with each layer, dilated convolutions can gradually aggregate information at different time blocks,
allowing for more history to be used in an efficient manner. With the WaveNet architecture of [
23
]
for instance, dilation rates are increased in powers of 2 with adjacent time blocks aggregated in
each layer – allowing for 2
l
time steps to be used at layer l as shown in Figure 1a.
(ii) Recurrent Neural Networks
Recurrent neural networks (RNNs) have had a historically been used in sequence modelling
[
22
], with strong results on a variety of natural language processing tasks [
27
]. Given the natural
interpretation of time series data as sequences of inputs and targets, many RNN-based architectures
have been developed for temporal forecasting applications [
28
,
29
,
30
,
31
]. At its core, RNN cells
contain an internal memory state which acts as a compact summary of past information. The
memory state is recursively updated using with new observations at each time step as shown in
Figure 1b, i.e.:
z
t
= ν (z
t−1
,
˜
x
t
) , (2.7)