to describe the probability distribution of sequences. A test
sequence is classified into the class whose model provides the
highest likelihood of generating the test sequence.
Autoregressive models [18] and time-delay embed-
ding [19] are designed for univariate time series, but their
extensions to vector sequences are not trivial. The hidden
Markov model (HMM) [20] is one of the most popular prob-
abilistic dynamic system models. HMM models the joint
probability of hidden states and observations. Conditional
random field (CRF) [21] directly models the conditional
probability of the hidden nodes given the observations, but
it requires more data and computation for training. Some
model-based approaches, such as HMM, can also be applied
to concatenate sequences to perform implicit segmentation-
based classification [2], where the sequence segmentation is
a by-product of recognition by so-called cross training.
Rather than employing models for classification, our pro-
posed approach utilizes the properties of such models to
define the statistics of the sequence class.
Distance-Based Approaches. The core of distance-based
approaches is to define a distance that measures the similar-
ity between sequences. For vector sequences with unequal
lengths, dynamic time warping (DTW) [22] is the most
widely used distance measure. Given two sequences
XX ¼½x
1
; x
2
; ...; x
N
x
2R
dN
x
and YY ¼½y
1
; y
2
; ...; y
N
y
2
R
dN
y
with lengths N
x
and N
y
, respectively, DTW finds the
optimal alignment from all possible sets of correspondences
between vectors, which minimizes the sum of pairwise vec-
tor-to-vector distances
min
pp
x
;pp
y
X
T
t¼1
x
p
x
t
y
p
y
t
2
; (1)
where T is the number of steps required to align the two
sequences. pp
x
¼½p
x
1
; p
x
2
; ...; p
x
T
T
2f1:N
x
g
T 1
and pp
y
¼
½p
y
1
; p
y
2
; ...; p
y
T
T
2f1:N
y
g
T 1
denote the aligned indices
between vectors in sequences XX and YY (warping paths),
respectively. Boundary conditions, continuity and monoto-
nicity constraints are attached to p
x
and p
y
. Eq. (1) can be
efficiently solved by dynamic programming.
Some variations of DTW based on stretching and align-
ment have been proposed. The constrained DTW [23]
restricts the amount of subsequent alignment through a
locality constraint. The longest common subsequence dis-
tance [24] allows unmatched elements in the stretched
sequences. In [25], each sequence is first mapped to a semi-
continuous HMM, and then the DTW distance between the
mixture weight vectors of the two HMMs is used as the final
distance between the two sequences.
Feature-Based Approaches. The basic concept of feature-
based approaches is to represent a sequence as a global fea-
ture vector such that vector sequences are mapped to feature
vectors with fixed dimensionality. Conventional supervised
learning methods, such as LDA, k-nearest neighbor and sup-
port vector machine (SVM), can then be applied.
For univariate time series, discrete Fourier trans-
forms [26] and discrete wavelet transforms [27] are applied
to individual time series, and only a portion of the coeffi-
cients are preserved. Consequently, the entire time series is
transformed to a new, shorter representation. These meth-
ods actually reduce the length of sequences rather than the
dimensionality of the component vectors, which differs
from the focus of this paper.
A sequence can either be represented by a collection of
subsequences called shapelets in [28], or be transformed
into a bag-of-features representation as in [29]. In [16], the
vector representation of a sequence is obtained by mean
pooling the mapped vectors in the sequence. In [15], for
each video, local descriptors are pooled by the bag-of-words
or by Fisher vector to form a single vector representation
without considering the temporal positions of the descrip-
tors. To encode the temporal dynamics into the final repre-
sentation, the pooled time series [30] summarizes the
changes in the frame-wide feature elements over time, and
rank pooling [31] learns a function capable of ordering the
frame-wide features and uses the function parameters as
the representation.
2.4 Dimensionality Reduction for Sequences (DRS)
Although DRS has begun to receive attention, it has
remained an under-explored area. In [3], [32], DTW is com-
bined with canonical correlation analysis to align multidi-
mensional feature sequences. Although these methods also
perform DRS, they can only be applied to multi-modal
sequences for alignment and cannot be extended to multi-
sequence classes for classification. In [33], a sequence kernel
based DR approach that combines spatial, temporal and
periodic information is proposed for time series data, where
labels are associated with the vectors in long time series.
The method is kernel-based, and its task is to predict a class
label for each frame.
In speech and handwriting recognition, LDA has been
performed based on HMM [1], [34], which consists of two
steps: 1) An HMM for each class is trained to create pseudo
state labels of vectors in training sequences, and 2) LDA is
then performed by treating all states of all HMMs as indi-
vidual classes. We refer to this method as “state-LDA”.
HLDA [11] can also be applied using a similar process,
which is denoted as “state-HLDA”. The temporal depen-
dencies are ignored because the states within the same
HMM are not independent, and a true label is associated
with the entire state sequence rather than a state. In contrast
to these methods, in our proposed method, the temporal
dependency and holistic structure are explored in two
aspects: 1) the extracted statistics encode the hidden dynam-
ics and temporal information, and 2) the objective function
aims to holistically discriminate different sequence classes.
An advantage of DRS over the feature-based methods
introduced in Section 2.3 is that it does not break the length
and structure when processing concatenate vector sequen-
ces. After DRS, the sequences still remain unsegmented, but
the differences between segments corresponding to differ-
ent patterns become starker; therefore, more precise seg-
mentation may be obtained. Either re-segmentation or a
holistic approach can be conducted in subsequent proce-
dures. In contrast, when feature-based methods are per-
formed, an irreversible pre-segmentation of the long
sequences is needed, which induces irreversible errors. Cer-
tainly, feature-based methods can also be performed on the
sequences after DRS for better classification performances.
Recurrent neural networks (RNN) [35] such as long-short
term memories (LSTM) [36] are powerful models for
SU ET AL.: DISCRIMINATIVE DIMENSIONALITY REDUCTION FOR MULTI-DIMENSIONAL SEQUENCES 79