Submission and Formatting Instructions for ICML 2022
(FEA) conn ecting encoder and decoder, and the Mixture
Of Experts Decomposition block (MOEDe c omp). The de-
tailed description of FEB, FEA, a nd MOEDecomp blocks
will be given in the following Section
3.2, 3.3, and 3.4 re-
spectively.
The encoder adopts a multilayer structure as: X
l
en
=
Encoder(X
l−1
en
), where l ∈ {1, ··· , N} denotes the out-
put of l-th encoder layer and X
0
en
∈ R
I×D
is the embed ded
historical series. The Enc oder(·) is formalized as
S
l,1
en
,
−
= MOEDecomp(FEB
X
l−1
en
+ X
l−1
en
),
S
l,2
en
,
−
= MOEDecomp(FeedForward
S
l,1
en
+ S
l,1
en
),
X
l
en
= S
l,2
en
,
(1)
where S
l,i
en
, i ∈ {1, 2} represents the seasonal compo-
nent a fter the i-th decomposition block in th e l-th layer r e-
spectively. For FEB module, it has two d ifferent versions
(FEB-f & FEB-w) which are implemented through Discrete
Fourier transform (DFT) and Discrete Wavelet transform
(DWT) mechanism respec tively and can seamlessly re place
the self- attention bloc k.
The decoder also a dopts a multilayer structure as:
X
l
de
, T
l
de
= D e coder(X
l−1
de
, T
l−1
de
), where l ∈ {1, ··· , M}
denotes the output of l-th decoder lay er. The Decoder(·) is
formalized as
S
l,1
de
, T
l,1
de
= MOEDecomp
FEB
X
l−1
de
+ X
l−1
de
,
S
l,2
de
, T
l,2
de
= MOEDecomp
FEA
S
l,1
de
, X
N
en
+ S
l,1
de
,
S
l,3
de
, T
l,3
de
= MOEDecomp
FeedForward
S
l,2
de
+ S
l,2
de
,
X
l
de
= S
l,3
de
,
T
l
de
= T
l−1
de
+ W
l,1
·T
l,1
de
+ W
l,2
·T
l,2
de
+ W
l,3
·T
l,3
de
,
(2)
where S
l,i
de
, T
l,i
de
, i ∈ {1, 2, 3 } represent the seaso nal and
trend component after the i-th decomposition block in the l-
th layer respectively. W
l,i
, i ∈ {1, 2, 3} represents the pro-
jector for the i-th extracted trend T
l,i
de
. Similar to FEB, FEA
has two different versions (FEA-f & FEA-w) which are im-
plemented through DFT and DWT projectio n respectively
with attention design, and can replace the cross-attention
block. The detailed description of FEA(·) will be given in
the following Section
3.3.
The final prediction is the sum of the two r e fined decom-
posed components as W
S
· X
M
de
+ T
M
de
, where W
S
is to
project the d e ep transformed seasonal component X
M
de
to
the target dimensio n.
3.2. Fourier Enhanced Structure
Discrete Fourier Transform (DFT) The proposed
Fourier Enhanced Structures u se discrete Fourier transfor m
(DFT). Le t F denotes the Fourier transform and F
−1
de-
Figure 3.
Frequency Enhanced Block with Fourier transform
(FEB-f) structure.
Figure 4.
Frequency Enhanced Attention with Fourier transform
(FEA-f) structure, σ(·) is the activation function.
notes the inverse Fourier transform. Given a sequence of
real numbers x
n
in time domain, where n = 1, 2...N. DFT
is defined as X
l
=
P
N−1
n=0
x
n
e
−iωln
, where i is the im a g-
inary unit and X
l
, l = 1, 2...L is a sequence o f complex
numbers in the frequency domain. Similarly, the inverse
DFT is defined as x
n
=
P
L−1
l=0
X
l
e
iωln
. The complex-
ity of DFT is O(N
2
). With fast Fourier transform (FFT),
the computation complexity can be reduced to O(N log N).
Here a random subset of the Fourier basis is used and
the scale of the subset is b ounded by a scalar. When we
choose the mode index b efore DFT and reverse DFT oper-
ations, the computation complexity can be further red uced
to O(N).
Frequency Enhanced Block with Fourier Transform
(FEB-f) The FE B-f is used in both encoder and decoder
as shown in Figure
2. The input (x ∈ R
N×D
) of the
FEB-f block is first linearly projected with w ∈ R
D×D
, so
q = x·w. Then q is converted f rom the time dom ain to the
frequency domain. The Fourier transform of q is denoted
as Q ∈ C
N×D
. In frequency domain , only the randomly
selected M modes are kept so we use a select operator as
˜
Q = Select(Q) = Select(F(q)), (3)
where
˜
Q ∈ C
M×D
and M << N. Then, the FEB-f is
defined as
FEB-f(q) = F
−1
(Padding(
˜
Q ⊙ R)), (4)
where R ∈ C
D×D×M
is a parameterized kernel in itialized
randomly. Let Y = Q ⊙ C, with Y ∈ C
M×D
. The pro-
duction operator ⊙ is defined as: Y
m,d
o
=
P
D
d
i
=0
Q
m,d
i
·
R
d
i
,d
o
,m
, where d
i
= 1, 2...D is the input channel and
d
o
= 1, 2...D is the output chan nel. The result of Q ⊙ R
is then z e ro-padded to C
N×D
before performing inverse
Fourier transform bac k to the time domain. T he structure
is shown in Figure
3.