Under review as a conference paper at ICLR 2019
"
i
t
f
t
o
t
#
= σ
W [x
t
⊕ vec(
˜
h
t−1
)] + b
(2)
c
t
= f
t
c
t−1
+ i
t
vec(
˜
j
t
) (3)
˜
h
t
= matricization(o
t
tanh(c
t
)) (4)
Equation set 1: IMV-Full
˜
i
t
˜
f
t
˜o
t
= σ
W ~
˜
h
t−1
+ U ~ x
t
+ b
(5)
˜c
t
=
˜
f
t
˜c
t−1
+
˜
i
t
˜
j
t
(6)
˜
h
t
= ˜o
t
tanh(˜c
t
) (7)
Equation set 2: IMV-Tensor
IMV-Full
: With vectorization in Eq.
(2)
and
(3)
, IMV-Full updates gates and memories using full
˜
h
t−1
and
˜
j
t
regardless of the variable-wise data in them. By simple replacement of the hidden
update in standard LSTM by
˜
j
t
, IMV-Full behaves identically to standard LSTM while enjoying the
interpretability shown below.
IMV-Tensor
: By applying tensor-dot operations in Eq.
(5)
, gates and memory cells are matrices as
well, elements of which have the correspondence to input variables as hidden state matrix
˜
h
t
does.
In IMV-Full and IMV-Tensor, gates only scale
˜
j
t
and
˜c
t−1
and thus retain the variable-wise data
organization in
˜
h
t
. Meanwhile, based on tensorized hidden state Eq.
(1)
and gate update Eq.
(5)
,
IMV-Tensor can also be considered as a set of parallel LSTMs, each of which processes one variable
series. The derived hidden states specific to individual variables are aggregated on both temporal and
variable level through the attention described below.
3.2 MIXTURE ATTENTION
After feeding a sequence of
{x
1
, · · · , x
T
}
into either IMV-Full or IMV-Tensor, we obtain a sequence
of hidden state matrices
{
˜
h
1
, · · · ,
˜
h
T
}
, where the sequence of hidden states specific to variable
n
is
extracted as {h
n
1
, · · · , h
n
T
}.
In this part, we present the novel mixture attention mechanism in IMV-LSTM based on the following
idea. Temporal attention is first applied to the sequence of hidden states corresponding to each
variable, so as to obtain the summarized history of each variable. Then by using the history enriched
hidden state of each variable, global variable attention is derived. These two steps are assembled into
a probabilistic mixture model (Zong et al., 2018; Graves, 2013; Bishop, 1994), which facilitates the
subsequent training, inference, and interpretation process.
In particular, the mixture attention is formulated as:
p(y
T +1
|X
T
) =
N
X
n=1
p(y
T +1
|z
T +1
= n, X
T
) · p(z
T +1
= n|X
T
)
=
N
X
n=1
p(y
T +1
| z
T +1
= n, h
n
1
, · · · , h
n
T
) · p(z
T +1
= n |
˜
h
1
, · · · ,
˜
h
T
)
=
N
X
n=1
p(y
T +1
| z
T +1
= n, h
n
T
⊕ g
n
| {z }
variable-wise
temporal attention
) · p(z
T +1
= n | h
1
T
⊕ g
1
, · · · , h
N
T
⊕ g
N
)
| {z }
overall variable attention
(8)
In Eq.
(8)
, we introduce a latent random variable
z
T +1
into the the density function of
y
T +1
to govern the generation process.
z
T +1
is a discrete variable over the set of values
{1, · · · , N}
corresponding to
N
input variables. Mathematically,
p(y
T +1
| z
T +1
= n, h
n
T
⊕ g
n
)
characterizes the
density of y
T +1
conditioned on historical data of variable n, while the prior of z
T +1
, i.e. p(z
T +1
=
n | h
1
T
⊕ g
1
, · · · , h
N
T
⊕ g
N
) controls to what extent y
T +1
is driven by variable n.
The context vector
g
n
is computed as the temporal attention weighted sum of hidden states corre-
sponding to variable
n
, i.e.,
g
n
=
P
t
α
n
t
h
n
t
where attention weight
α
n
t
=
exp ( f
n
(h
n
t
) )
P
k
exp ( f
n
(h
n
k
) )
.
f
n
(·)
can be a flexible function specific to variable
n
, e.g., neural networks (Bahdanau et al., 2014). The
p(y
T +1
| z
T +1
= n, h
n
T
⊕ g
n
)
is a Gaussian distribution parameterized by
[ µ
n
, σ
n
] = ϕ
n
( h
n
T
⊕ g
n
)
,
where ϕ
n
(·) can be a feedforward neural network.
4