iSAX 2.0: Indexing and Mining One Billion Time Series
Alessandro Camerra Themis Palpanas Jin Shieh Eamonn Keogh
University of Trento
a.camerra@studenti.unitn.it, themis@disi.unitn.eu
University of California, Riverside
{shiehj, eamonn}@cs.ucr.edu
Abstract—There is an increasingly pressing need, by several
applications in diverse domains, for developing techniques able
to index and mine very large collections of time series.
Examples of such applications come from astronomy, biology,
the web, and other domains. It is not unusual for these
applications to involve numbers of time series in the order of
hundreds of millions to billions. However, all relevant
techniques that have been proposed in the literature so far
have not considered any data collections much larger than one-
million time series. In this paper, we describe iSAX 2.0, a data
structure designed for indexing and mining truly massive
collections of time series. We show that the main bottleneck in
mining such massive datasets is the time taken to build the
index, and we thus introduce a novel bulk loading mechanism,
the first of this kind specifically tailored to a time series index.
We show how our method allows mining on datasets that
would otherwise be completely untenable, including the first
published experiments to index one billion time series, and
experiments in mining massive data from domains as diverse
as entomology, DNA and web-scale image collections.
Keywords-time series; data mining; representations; indexing
I. INTRODUCTION
The problem of indexing and mining time series has
captured the interest of the data mining and database
community for almost two decades. However, there remains
a huge gap between the scalability of the methods in the
current literature, and the needs of practitioners in many
domains. To illustrate this gap, consider the selection of
quotes from unsolicited emails sent to the current authors,
asking for help in indexing massive time series datasets.
• “…we have about a million samples per minute coming in
from 1000 gas turbines around the world… we need to be
able to do similarity search for...” Lane Desborough, GE.
• “…an archival rate of 3.6 billion points a day, how can
we (do similarity search) in this data?” Josh Patterson,
TVA.
Our communication with such companies and research
institutions has lead us to the perhaps surprising conclusion:
For all attempts at large scale mining of time series, it is the
time complexity of building the index that remains the most
significant bottleneck: e.g., a state-of-the-art method [3]
needs over 6 days to build an index with 100-million items.
Additionally, there is a pressing need to reduce retrieval
times, especially as such data is clearly doomed to be disk
resident. Once a dimensionality-reduced representation (i.e
DFT, DWT, SAX, etc.) has been decided on, the only way to
improve retrieval times is by optimizing splitting algorithms
for tree-based indexes (i.e., R-trees, M-trees, etc.), since a
poor splitting policy leads to excessive and useless
subdivisions, which create unnecessarily deep sub-trees and
causing lengthier traversals.
In this work we solve both of these problems with
significant extensions to the recently introduced multi-
resolution symbolic representation indexable Symbolic
Aggregate approXimation (iSAX) [3]. As we will show with
the largest (by far) set of time series indexing experiments
ever attempted, we can reduce the index building time by
72% with a novel bulk loading scheme, which is the first
bulk loading algorithm for a time series index. Also, our new
splitting policy reduces the size of the index by 27%. The
number of disk page accesses is reduced by 50%, while more
than 99.5% of those accesses are sequential.
To push the limits of time series data mining, we
consider experiments that index 1,000,000,000 (one billion)
time series of length 256. To the best of our knowledge, this
is the first time a paper in the literature has reached the one
billion mark for similarity search on multimedia objects of
any kind. On four occasions the best paper winners at
SIGKDD/SIGMOD have looked at the problem of indexing
time series, with the largest dataset considered by each paper
being 500,000 objects [20], 100,000 objects [21], 6,480
objects [1], and 27,000 objects [23]. Thus the 1,000,000,000
objects considered here represent real progress, beyond the
inevitable improvements in hardware performance.
We further show that the scalability achieved by our
ideas allows us to consider interesting data mining problems
in entomology, biology, and the web, that would otherwise
be untenable. The contributions we make in this paper can be
summarized as follows.
• We present mechanisms that allow iSAX 2.0, a data
structure suitable for indexing and mining time series, to
scale to very large datasets.
• We introduce the first bulk loading algorithm, specifically
designed to operate in the context of a time series index.
The proposed algorithm can dramatically reduce the
number of random disk page accesses (as well as the total
number of disk accesses), thus reducing the time required
to build the index by an order of magnitude.
• We also propose a new node splitting algorithm, based on
simple statistics that are accurate, yet efficient to compute.
This algorithm leads to an average reduction in the size of
the index by 27%.
• We present the first approach that is experimentally
validated to scale to data collections of time series with up
to 1 billion objects.
The rest of the paper is organized as follows. We review
some background material in Section II. Section III
introduces the basic pillars for our scalable index, iSAX 2.0.