TABLE I
F
REQUENTLY USED NOTATIONS
Notation Description
X a time series (x
1
,x
2
, ··· ,x
n
)
X(i, l) a length-l subsequence of X starting at offset i
ˆ
X the normalized series of time series X
X
i
the i
th
length-w disjoint window of X
μ
X
i
the mean value of the i
th
disjoint window of X
σ
X
i
the standard deviation of the i
th
disjoint window of X
WI a window interval containing continuous window positions
IS
i
a set of window intervals satisfying the criterion for Q
i
CS
i
, CS a set of candidates for Q
i
and for all Q
j
(1 ≤ j ≤ i)
n
I
,n
P
the number of window intervals and window positions
works in Section VIII. Finally, we conclude the paper and
look into the future work in Section IX.
II. P
RELIMINARY KNOWLEDGE
In this section, we introduce the definition of time series
and other useful notations.
A. Definitions and Problem Statement
A time series is a sequence of ordered values, denoted as
X =(x
1
,x
2
, ··· ,x
n
), where n = |X| is the length of X.A
length-l subsequence of X is a shorter time series, denoted as
X(i, l)=(x
i
,x
i+1
, ··· ,x
i+l−1
), where 1 ≤ i ≤ n − l +1.
For any subsequence S =(s
1
,s
2
, ··· ,s
m
), μ
S
and σ
S
are
the mean value and standard deviation of S respectively. Thus
the normalized series of S, denoted as
ˆ
S,is
ˆ
S =
s
1
− μ
S
σ
S
,
s
2
− μ
S
σ
S
, ···,
s
m
− μ
S
σ
S
.
Our work supports two common distance measures, Eu-
clidean distance and Dynamic Time Warping. Here we give
the definition of them.
Euclidean Distance (ED): Given two length-m sequences,
S and S
, their distance is ED(S, S
)=
m
i=1
(s
i
− s
i
)
2
.
Dynamic Time Warping (DTW): Given two length-m se-
quences, S and S
, their distance is
DTW(, )=0; DTW(S, )=DTW(,S
)=∞;
DTW(S, S
)=
(s
1
− s
1
)
2
+ min
⎧
⎪
⎨
⎪
⎩
DTW(suf(S), suf(S
))
DTW(S, suf(S
))
DTW(suf(S),S
)
where represents empty series and suf(S)=(s
2
, ··· ,s
m
)
is a suffix subsequence of S.
In DTW, the warping path is defined as a matrix to represent
the optimal alignment for two series. The matrix element (i, j)
represents that s
i
is aligned to s
j
. To reduce the computation
complexity, we use the Sakoe-Chiba band [10] to restrict the
width of warping, denoted as ρ. Any pair (i, j) should satisfy
|i − j|≤ρ. When ρ =0, it degenerates into ED.
We aim to support subsequence matching for both the raw
subsequence and the normalized subsequence simultaneously.
The problem statements are given here.
Raw Subsequence Matching (RSM): Given a long time
series X, a query sequence Q (|X|≥|Q|) and a distance
threshold ε (ε ≥ 0), find all subsequences S of length |Q|
from X, which satisfy D
S, Q
≤ ε. In this case, we call that
S and Q are in ε-match.
Normalized Subsequence Matching (NSM): Given a long
time series X, a query sequence Q and a distance threshold
ε (ε ≥ 0), find all subsequences S of length |Q| from X,
which satisfy D
ˆ
S,
ˆ
Q
≤ ε, where
ˆ
S and
ˆ
Q are
the normalized
series of S and Q respectively.
The cNSM problem adds two constraints to the NSM
problem. Thresholds α (α ≥ 1) and β (β ≥ 0) are introduced
to constrain the degree of amplitude scaling and offset shifting.
Constrained Normalized Subsequence Matching (cNSM):
Given a long time series X, a query sequence Q, a distance
threshold ε, and the constraint thresholds α and β, find all
subsequences S of length |Q| from X, which satisfy
D
ˆ
S,
ˆ
Q
≤ ε ∩
1
α
≤
σ
S
σ
Q
≤ α ∩−β ≤ μ
S
− μ
Q
≤ β.
The larger α and β, the looser the constraint. In this case, we
call that S and Q are in (ε, α, β)-match.
The distance D(·, ·) is either ED or DTW. In this paper,
we build an index to support four types of queries, RSM-ED,
RSM-DTW, cNSM-ED and cNSM-DTW simultaneously.
III. T
HEORETICAL FOUNDATION AND
APPROACH MOTIVATION
In this section, we establish the theoretical foundation of
our approach. We propose a condition to filter the unqualified
subsequences. For all four types of queries, the conditions
share the same format, which enables us to support all query
types with a single index.
Specifically, for the query Q and the subsequence S of
length-m, we segment them into aligned disjoint windows
of the same length w. The i
th
window of Q (or S)is
denoted as Q
i
(or S
i
), (1 ≤ i ≤ p =
m
w
), that is,
Q
i
=(q
(i−1)∗w+1
, ··· ,q
i∗w
).
For each window, we hope to find one or more features,
based on which we can construct the filtering condition. In
this work, we choose to utilize one single feature, the mean
value of the window. The advantages are two-folds. First, with
a single feature, we can build a one-dimensional index, which
improves the efficiency of index retrieval greatly. Second, the
mean value allows us to design the condition for both RSM
and cNSM queries.
We denote mean values of Q
i
and S
i
as μ
Q
i
and μ
S
i
.
The condition consists of p number of ranges. The i
th
one
is denoted as [LR
i
, UR
i
] (1 ≤ i ≤ p). If S is a qualified
subsequence, for any i, μ
S
i
must fall within [LR
i
, UR
i
].Ifany
μ
S
i
is outside the range, we can filter S safely.
A. RSM-ED Query Processing
In this section, we first present the condition for the simplest
case, RSM-ED query, and then illustrate our approach.
Lemma 1. If S and Q are in ε-match under ED measure, that
is, ED(S, Q) ≤ ε, then μ
S
i
(1 ≤ i ≤ p) must satisfy
μ
S
i
∈
μ
Q
i
−
ε
√
w
,μ
Q
i
+
ε
√
w
. (1)
Proof. Based on the ED definition, we have
ED
2
(S, Q)=
n
k=1
(s
k
− q
k
)
2
≥
i∗w
j=(i−1)∗w+1
(s
j
− q
j
)
2
868