Searching and Mining Trillions of Time Series
Subsequences under Dynamic Time Warping
Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista
2
,
Brandon Westover
1
, Qiang Zhu, Jesin Zakaria, Eamonn Keogh
UC Riverside,
1
Brigham and Women's Hospital,
2
University of São Paulo
{rakthant, bcampana, mueen, qzhu, jzaka, eamonn}@cs.ucr.edu, gbatista@icmc.usp.br, mwestover@partners.org
ABSTRACT
Most time series data mining algorithms use similarity search as a
core subroutine, and thus the time taken for similarity search is the
bottleneck for virtually all time series data mining algorithms. The
difficulty of scaling search to large datasets largely explains why
most academic work on time series data mining has plateaued at
considering a few millions of time series objects, while much of
industry and science sits on billions of time series objects waiting
to be explored. In this work we show that by using a combination
of four novel ideas we can search and mine truly massive time
series for the first time. We demonstrate the following extremely
unintuitive fact; in large datasets we can exactly search under
DTW much more quickly than the current state-of-the-art
Euclidean distance search algorithms. We demonstrate our work on
the largest set of time series experiments ever attempted. In
particular, the largest dataset we consider is larger than the
combined size of all of the time series datasets considered in all data
mining papers ever published. We show that our ideas allow us to
solve higher-level time series data mining problem such as motif
discovery and clustering at scales that would otherwise be
untenable. In addition to mining massive datasets, we will show
that our ideas also have implications for real-time monitoring of
data streams, allowing us to handle much faster arrival rates
and/or use cheaper and lower powered devices than are currently
possible.
Categories and Subject Descriptors
H.2.8 [Information Systems]: Database Application — Data
Mining
General Terms
Algorithm, Experimentation
Keywords
Time series, Similarity Search, Lower Bounds
1. INTRODUCTION
Time series data is pervasive across almost all human endeavors,
including medicine, finance, science and entertainment. As such,
it is hardly surprising that time series data mining has attracted
significant attention and research effort. Most time series data
mining algorithms require similarity comparisons as a subroutine,
and in spite of the consideration of dozens of alternatives, there is
increasing evidence that the classic Dynamic Time Warping
(DTW) measure is the best measure in most domains [6].
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
KDD’12, August 12–16, 2012, Beijing, China.
Copyright 2012 ACM 978-1-4503-1462-6/12/08...$15.00.
It is difficult to overstate the ubiquity of DTW. It has been used in
robotics, medicine [5], biometrics, music/speech processing
[1][27][41], climatology, aviation, gesture recognition [3][38],
user interfaces [16][22][29][38], industrial processing,
cryptanalysis [7], mining of historical manuscripts [15], geology,
astronomy [20][31], space exploration, wildlife monitoring, etc.
As ubiquitous as DTW is, we believe that there are thousands of
research efforts that would like to use DTW, but find it too
computationally expensive. For example, consider the following:
“Ideally, dynamic time warping would be used to achieve this, but
due to time constraints…” [5]. Likewise, [3] bemoans DTW is
“still too slow for gesture recognition systems”, and [1] notes,
even “a 30 fold speed increase may not be sufficient for scaling
DTW methods to truly massive databases.” As we shall show, our
subsequence search suite of four novel ideas (called the UCR
suite) removes all of these objections. We can reproduce all the
experiments in all these papers in well under a second.
We make an additional claim for our UCR suite which is almost
certainly true, but hard to prove, given the variability in how
search results are presented in the literature. We believe our exact
DTW sequential search is much faster than any current
approximate search or exact indexed search. In a handful of
papers the authors are explicit enough with their experiments to
see this is true. Consider [28], which says it can answer queries of
length 1,000 under DTW with 95% accuracy, in a random walk
dataset of one million objects in 5.65 seconds. We can exactly
search this dataset in 3.8 seconds (on a very similar machine).
Likewise, a recent paper that introduced a novel inner product
based DTW lower bound greatly speeds up exact subsequence
search for a wordspotting task in speech. The authors state: “the
new DTW-KNN method takes approximately 2 minutes” [41];
however, we can reproduce their results in less than a second. An
influential paper on gesture recognition on multi-touch screens
laments that “DTW took 128.26 minutes to run the 14,400 tests for
a given subject’s 160 gestures” [38]. However, we can reproduce
these results in under 3 seconds.
1.1 A Brief Discussion of a Trillion
Since we use the word “trillion” in this work and to our
knowledge, it has never appeared in a data mining/database paper,
we will take the time to briefly discuss this number. By a trillion,
we mean the short scale version of the word [14], one million
million, or 10
12
, or 1,000,000,000,000.
If we have a single time series T of length one trillion, and we
assume it takes eight bytes to store each value, it will require 7.2
terabytes to store. If we sample a electrocardiogram at 256Hz, a
trillion data points would allow us to record 123 years of data,
every single heartbeat of the longest lived human [37].
A time series of length one trillion is a very large data object. In
fact, it is more than all of the time series data considered in all
papers ever published in all data mining conferences combined.
This is easy to see with a quick back-of-the-envelope calculation.