4
Yi ZHENG et al. Exploiting Multi-Channels Deep Convolutional Neural Networks for Multivariate Time Series Classification
where W should satisfy these three constraints above. One
step further, for two multivariate time series X and Y, similar
to Euclidean distance, DTW between X and Y can be defined
as follows:
DTW(X, Y) =
X
l
i=1
DTW(x
i
, y
i
)
where l denotes the number of components in multivariate
time series, and both of x
i
and y
i
represent the i
th
univariate
time series of them, respectively.
It is common to apply dynamic programming to compute
DTW(Q, C) (or DTW(X, Y)), which is very efficient and
has a time complexity O(n
2
) in this context. However,
when the size of data set grows large and the length of
time series becomes long, it is very time consuming to
compute DTW combined with k-NN method. Hence, to
reduce the time consumption, window constraint DTW has
been adopted widely instead of full DTW in many previous
work [10, 18–20]. On the other hand, from the intuition, the
warping path is unlikely to go very far from the diagonal of
the distance matrix [10]. In other words, for any element
w
k
= d(q
i
, c
j
) in the warping path, the difference between i
and j should not be too large. By limiting the warping path
to a warping window, some previous work [10, 19] showed
that relatively tight warping windows actually improve the
classification accuracy.
According to above discussions, we consider both
Euclidean distance and window constraint DTW as the
default distance measures in the following.
3 Multi-Channels Deep Convolutional Neural
Networks
In this section, we will introduce a deep learning
framework for multivariate time series classification: Multi-
Channels Deep Convolutional Neural Networks (MC-
DCNN). Traditional Convolutional Neural Networks (CNN)
usually include two parts. One is a feature extractor,
which learns features from raw data automatically. The
other is a trainable fully-connected MLP, which performs
classification based on the learned features from the previous
part. Generally, the feature extractor is composed of multiple
similar stages, and each stage is made up of three cascading
layers: filter layer, activation layer and pooling layer. The
input and output of each layer are called feature maps [13].
In the previous work of CNN [13], the feature extractor
usually contains one, two or three such 3-layers stages. For
remainder of this section, we first introduce the components
of CNN briefly and more details of CNN can be referred
to [13, 21]. Then, we show the gradient-based learning of
our model. After that, the related unsupervised pretraining is
given at the end of this section.
3.1 Architecture
In contrast to image classification, the inputs of multivariate
time series classification are multiple 1D subsequences but
not 2D image pixels. We modify the traditional CNN and
apply it to multivariate time series classification task in this
way: We separate multivariate time series into univariate
ones and perform feature learning on each univariate series
individually, and then a traditional MLP is concatenated at
the end of feature learning that is used to do the classification.
To be understood easily, we illustrate the architecture of MC-
DCNN in Fig. 3. Specifically, this is an example of 2-stages
MC-DCNN with pretraining for activity classification. Once
the pretraining is completed, the initial weights of network
are obtained. Then, the inputs of 3-channels are fed into a
2-stages feature extractor, which learns hierarchical features
through filter, activation and pooling layers. At the end
of feature extractor, the feature maps of each channel are
flatten and combined as the input of subsequent MLP for
classification. Note that in Fig. 3, the activation layer is
embedded into filter layer in the form of non-linear operation
on each feature map. We describe how each layer works in
the following subsections.
3.1.1 Filter Layer
The input of each filter is a univariate time series, which is
denoted x
l
i
∈ <
n
l
2
, 1 6 i 6 n
l
1
, where l denotes the layer which
the time series comes from, n
l
1
and n
l
2
are number and length
of input time series. To capture local temporal information, it
requires to restrict each trainable filter k
i j
with a small size,
which is denoted m
l
2
, and the number of filter at layer l is
denoted m
l
1
. Recalling the example described in Fig. 3, in
first stage of channel 1, we have n
l
1
= 1, n
l
2
= 256, m
l
2
= 5
and m
l
1
= 8. We compute the output of each filter according
to this:
P
i
x
l−1
i
∗ k
l
i j
+ b
l
j
, where the ∗ is convolution operator
and b
l
j
is the bias term. To determine the size of each filter
k
i j
, we follow the earlier studies [22] and set it to 5 (m
2
= 5)
as they suggested.
3.1.2 Activation Layer
The activation function introduces the non-linearity into
neural networks and allows it to learn more complex model.