iSAX 2.0：海量时间序列索引与挖掘技术

42 浏览量更新于2024-08-25 收藏 544KB PDF 举报

"iSAX 2.0 是一种用于索引和挖掘十亿级时间序列数据的技术，由Alessandro Camerra、Themis Palpanas、Jin Shieh和Eamonn Keogh等人提出，旨在解决天文、生物学、网络等领域大规模时间序列处理的需求。iSAX 2.0主要解决了传统技术在处理海量时间序列数据时的局限性，通过创新的批量加载机制优化了索引构建过程，以适应超大规模的数据集。时间序列分析在多个领域都有广泛的应用，如天文学中对星体运动轨迹的追踪，生物学中基因表达数据的分析，以及互联网上用户行为数据的挖掘等。当涉及的时间序列数量达到数百到数十亿级别时，传统的索引和挖掘方法面临着效率和存储的挑战。iSAX 2.0 技术应运而生，它是一种基于符号表示的时间序列索引方法，能够有效压缩和抽象时间序列数据，大大减少了存储需求并提高了查询效率。 iSAX（Indexed Symbolic Aggregate Approximation）是iSAX 2.0的核心，它使用一种叫做符号聚合近似的方法将原始时间序列转换成一棵树结构，这棵树被称为SAX树。每个时间序列被表示为一个符号路径，这些路径在树中进行比较，从而实现快速相似度搜索。iSAX 2.0 在前一代的基础上进行了优化，增强了处理大规模数据的能力。在面对十亿级时间序列的挑战时，iSAX 2.0 引入了创新的批量加载机制，这是专门为时间序列设计的一种批量构建索引的方法。传统方法在构建索引时通常需要逐个处理时间序列，这在处理大规模数据时极为耗时。iSAX 2.0 的批量加载机制则允许一次性处理大量数据，显著降低了索引构建的时间，从而提升了整体的性能。此外，iSAX 2.0 还考虑了数据的动态更新和扩展，支持在线插入和删除操作，确保了在数据持续增长时系统仍能保持高效运行。这一特性使得iSAX 2.0 成为了处理动态变化大规模时间序列集合的理想选择。 iSAX 2.0 是一个专门针对大规模时间序列数据处理的索引和挖掘技术，它通过高效的符号表示、批量加载机制和对动态数据的支持，解决了当前技术在处理海量数据时的难题，为各种领域的应用提供了强大支持。"

iSAX 2.0: Indexing and Mining One Billion Time Series

Alessandro Camerra Themis Palpanas Jin Shieh Eamonn Keogh

University of Trento

a.camerra@studenti.unitn.it, themis@disi.unitn.eu

University of California, Riverside

{shiehj, eamonn}@cs.ucr.edu

Abstract—There is an increasingly pressing need, by several

applications in diverse domains, for developing techniques able

to index and mine very large collections of time series.

Examples of such applications come from astronomy, biology,

the web, and other domains. It is not unusual for these

applications to involve numbers of time series in the order of

hundreds of millions to billions. However, all relevant

techniques that have been proposed in the literature so far

have not considered any data collections much larger than one-

million time series. In this paper, we describe iSAX 2.0, a data

structure designed for indexing and mining truly massive

collections of time series. We show that the main bottleneck in

mining such massive datasets is the time taken to build the

index, and we thus introduce a novel bulk loading mechanism,

the first of this kind specifically tailored to a time series index.

We show how our method allows mining on datasets that

would otherwise be completely untenable, including the first

published experiments to index one billion time series, and

experiments in mining massive data from domains as diverse

as entomology, DNA and web-scale image collections.

Keywords-time series; data mining; representations; indexing

I. INTRODUCTION

The problem of indexing and mining time series has

captured the interest of the data mining and database

community for almost two decades. However, there remains

a huge gap between the scalability of the methods in the

current literature, and the needs of practitioners in many

domains. To illustrate this gap, consider the selection of

quotes from unsolicited emails sent to the current authors,

asking for help in indexing massive time series datasets.

• “…we have about a million samples per minute coming in

from 1000 gas turbines around the world… we need to be

able to do similarity search for...” Lane Desborough, GE.

• “…an archival rate of 3.6 billion points a day, how can

we (do similarity search) in this data?” Josh Patterson,

TVA.

Our communication with such companies and research

institutions has lead us to the perhaps surprising conclusion:

For all attempts at large scale mining of time series, it is the

time complexity of building the index that remains the most

significant bottleneck: e.g., a state-of-the-art method [3]

needs over 6 days to build an index with 100-million items.

Additionally, there is a pressing need to reduce retrieval

times, especially as such data is clearly doomed to be disk

resident. Once a dimensionality-reduced representation (i.e

DFT, DWT, SAX, etc.) has been decided on, the only way to

improve retrieval times is by optimizing splitting algorithms

for tree-based indexes (i.e., R-trees, M-trees, etc.), since a

poor splitting policy leads to excessive and useless

subdivisions, which create unnecessarily deep sub-trees and

causing lengthier traversals.

In this work we solve both of these problems with

significant extensions to the recently introduced multi-

resolution symbolic representation indexable Symbolic

Aggregate approXimation (iSAX) [3]. As we will show with

the largest (by far) set of time series indexing experiments

ever attempted, we can reduce the index building time by

72% with a novel bulk loading scheme, which is the first

bulk loading algorithm for a time series index. Also, our new

splitting policy reduces the size of the index by 27%. The

number of disk page accesses is reduced by 50%, while more

than 99.5% of those accesses are sequential.

To push the limits of time series data mining, we

consider experiments that index 1,000,000,000 (one billion)

time series of length 256. To the best of our knowledge, this

is the first time a paper in the literature has reached the one

billion mark for similarity search on multimedia objects of

any kind. On four occasions the best paper winners at

SIGKDD/SIGMOD have looked at the problem of indexing

time series, with the largest dataset considered by each paper

being 500,000 objects [20], 100,000 objects [21], 6,480

objects [1], and 27,000 objects [23]. Thus the 1,000,000,000

objects considered here represent real progress, beyond the

inevitable improvements in hardware performance.

We further show that the scalability achieved by our

ideas allows us to consider interesting data mining problems

in entomology, biology, and the web, that would otherwise

be untenable. The contributions we make in this paper can be

summarized as follows.

• We present mechanisms that allow iSAX 2.0, a data

structure suitable for indexing and mining time series, to

scale to very large datasets.

• We introduce the first bulk loading algorithm, specifically

designed to operate in the context of a time series index.

The proposed algorithm can dramatically reduce the

number of random disk page accesses (as well as the total

number of disk accesses), thus reducing the time required

to build the index by an order of magnitude.

• We also propose a new node splitting algorithm, based on

simple statistics that are accurate, yet efficient to compute.

This algorithm leads to an average reduction in the size of

the index by 27%.

• We present the first approach that is experimentally

validated to scale to data collections of time series with up

to 1 billion objects.

The rest of the paper is organized as follows. We review

some background material in Section II. Section III

introduces the basic pillars for our scalable index, iSAX 2.0.

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38666823

粉丝: 5
资源: 971

iSAX 2.0：海量时间序列索引与挖掘技术

demo-ivfpq-indexing.cpp

archiva-scheduler-indexing-2.1.1-sources.jar

IBM DB2 for i - Indexing Methods and Strategies-计算机科学

Data-storage-and-indexing.rar_索引

CUDA Thread-Indexing Cheatsheet-计算机科学

p2p-indexing-and-search:使用Hypercore协议进行P2P索引和搜索入门

Example-Deep-Linking-and-Google-app-autocompletions-using-App-Indexing-API:此示例演示使用 App Indexing API 的深度链接和 Google 应用自动完成的简单实现

Files-and-Indexing:在“数据库管理系统”课程下开发了一个项目

sbd-tugas-indexing

Audio-Video-Indexing-Retrieval-Project-report

最新资源