突破瓶颈：大规模时间序列搜索与DTW算法

需积分: 16 13 浏览量更新于2024-09-08 收藏 917KB PDF 举报

动态时间战争（Dynamic Time Warping, DTW）是一项在时序数据分析中广泛使用的算法，尤其在处理相似度搜索时具有重要意义。2012年在KDD大会上，T. Rakthanmanon、B. Campana、A. Mueen等人发表了题为《Searching and Mining Billions of Time Series Subsequences under Dynamic Time Warping》的论文，该研究荣获最佳论文奖，标志着在大规模时间序列数据挖掘领域的一个重大突破。在传统的时序数据挖掘算法中，相似性搜索是核心环节，然而随着数据集规模的增长，这个过程变得异常耗时，成为了制约几乎所有算法扩展到更大数据集的关键瓶颈。在此之前，学术界对时间序列数据挖掘的研究往往局限于几百万对象的规模，而业界和科学研究中的海量数据（数十亿甚至更多）仍然未被充分利用。作者提出四个关键创新点，旨在解决大规模时间序列数据的搜索和挖掘问题： 1. **高效搜索技术**：他们开发了一种新的搜索方法，能够在大规模数据集中利用DTW进行快速且精确的相似度搜索。这打破了常规认知，即在大数据环境下，DTW搜索的速度并不受数据量增长的影响，反而可能变得更有效率。 2. **数据结构优化**：通过巧妙地设计数据结构，例如使用空间分治策略或索引技术，使得搜索过程能够在复杂的时间序列数据中迅速定位潜在的相关子序列，显著提升了搜索效率。 3. **并行与分布式计算**：研究者探讨了如何将搜索任务分解到多台计算机上，通过并行化处理来加速大规模时间序列的处理，使得处理能力可以随着硬件资源的增加而线性提升。 4. **可扩展性分析**：论文详细分析了这些技术如何实现在处理海量时间序列数据时的可扩展性，证明了即使面对庞大的数据集，也能实现有效的搜索和挖掘任务。通过这些创新，研究人员展示了在实际应用中，大规模时间序列数据的挖掘不再是遥不可及的目标，而是可以通过DTW等技术得以实现。这一成果不仅推动了学术界对更大规模时间序列数据处理的研究，也为工业界提供了强有力的工具，帮助他们发掘和分析存储在云端或分布式系统中的海量时间序列数据，从而在诸如医疗监控、金融交易、物联网等领域带来革命性的变化。

Searching and Mining Trillions of Time Series

Subsequences under Dynamic Time Warping

Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista

Brandon Westover

, Qiang Zhu, Jesin Zakaria, Eamonn Keogh

UC Riverside,

Brigham and Women's Hospital,

University of São Paulo

{rakthant, bcampana, mueen, qzhu, jzaka, eamonn}@cs.ucr.edu, gbatista@icmc.usp.br, mwestover@partners.org

ABSTRACT

Most time series data mining algorithms use similarity search as a

core subroutine, and thus the time taken for similarity search is the

bottleneck for virtually all time series data mining algorithms. The

difficulty of scaling search to large datasets largely explains why

most academic work on time series data mining has plateaued at

considering a few millions of time series objects, while much of

industry and science sits on billions of time series objects waiting

to be explored. In this work we show that by using a combination

of four novel ideas we can search and mine truly massive time

series for the first time. We demonstrate the following extremely

unintuitive fact; in large datasets we can exactly search under

DTW much more quickly than the current state-of-the-art

Euclidean distance search algorithms. We demonstrate our work on

the largest set of time series experiments ever attempted. In

particular, the largest dataset we consider is larger than the

combined size of all of the time series datasets considered in all data

mining papers ever published. We show that our ideas allow us to

solve higher-level time series data mining problem such as motif

discovery and clustering at scales that would otherwise be

untenable. In addition to mining massive datasets, we will show

that our ideas also have implications for real-time monitoring of

data streams, allowing us to handle much faster arrival rates

and/or use cheaper and lower powered devices than are currently

possible.

Categories and Subject Descriptors

H.2.8 [Information Systems]: Database Application — Data

Mining

General Terms

Algorithm, Experimentation

Keywords

Time series, Similarity Search, Lower Bounds

1. INTRODUCTION

Time series data is pervasive across almost all human endeavors,

including medicine, finance, science and entertainment. As such,

it is hardly surprising that time series data mining has attracted

significant attention and research effort. Most time series data

mining algorithms require similarity comparisons as a subroutine,

and in spite of the consideration of dozens of alternatives, there is

increasing evidence that the classic Dynamic Time Warping

(DTW) measure is the best measure in most domains [6].

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise,

or republish, to post on servers or to redistribute to lists, requires prior

specific permission and/or a fee.

KDD’12, August 12–16, 2012, Beijing, China.

It is difficult to overstate the ubiquity of DTW. It has been used in

robotics, medicine [5], biometrics, music/speech processing

[1][27][41], climatology, aviation, gesture recognition [3][38],

user interfaces [16][22][29][38], industrial processing,

cryptanalysis [7], mining of historical manuscripts [15], geology,

astronomy [20][31], space exploration, wildlife monitoring, etc.

As ubiquitous as DTW is, we believe that there are thousands of

research efforts that would like to use DTW, but find it too

computationally expensive. For example, consider the following:

“Ideally, dynamic time warping would be used to achieve this, but

due to time constraints…” [5]. Likewise, [3] bemoans DTW is

“still too slow for gesture recognition systems”, and [1] notes,

even “a 30 fold speed increase may not be sufficient for scaling

DTW methods to truly massive databases.” As we shall show, our

subsequence search suite of four novel ideas (called the UCR

suite) removes all of these objections. We can reproduce all the

experiments in all these papers in well under a second.

We make an additional claim for our UCR suite which is almost

certainly true, but hard to prove, given the variability in how

search results are presented in the literature. We believe our exact

DTW sequential search is much faster than any current

approximate search or exact indexed search. In a handful of

papers the authors are explicit enough with their experiments to

see this is true. Consider [28], which says it can answer queries of

length 1,000 under DTW with 95% accuracy, in a random walk

dataset of one million objects in 5.65 seconds. We can exactly

search this dataset in 3.8 seconds (on a very similar machine).

Likewise, a recent paper that introduced a novel inner product

based DTW lower bound greatly speeds up exact subsequence

search for a wordspotting task in speech. The authors state: “the

new DTW-KNN method takes approximately 2 minutes” [41];

however, we can reproduce their results in less than a second. An

influential paper on gesture recognition on multi-touch screens

laments that “DTW took 128.26 minutes to run the 14,400 tests for

a given subject’s 160 gestures” [38]. However, we can reproduce

these results in under 3 seconds.

1.1 A Brief Discussion of a Trillion

Since we use the word “trillion” in this work and to our

knowledge, it has never appeared in a data mining/database paper,

we will take the time to briefly discuss this number. By a trillion,

we mean the short scale version of the word [14], one million

million, or 10

, or 1,000,000,000,000.

If we have a single time series T of length one trillion, and we

assume it takes eight bytes to store each value, it will require 7.2

terabytes to store. If we sample a electrocardiogram at 256Hz, a

trillion data points would allow us to record 123 years of data,

every single heartbeat of the longest lived human [37].

A time series of length one trillion is a very large data object. In

fact, it is more than all of the time series data considered in all

papers ever published in all data mining conferences combined.

This is easy to see with a quick back-of-the-envelope calculation.

262

下载后可阅读完整内容，剩余8页未读，立即下载

K5niper

粉丝: 92
资源: 10

突破瓶颈：大规模时间序列搜索与DTW算法

time of day- dynamic sky dome

dtw.rar_Warping_dtw_dynamic time warping

dynamic time warping

使用tushare的数据用Python 写一个 使用Dynamic Time Warping作为算法的关于中证1000指数和上证50指数的配对交易策略，高胜率，然后用backtrader平台回测，最后画出图形

Dynamic Time Warping (DTW)

用Python 写一个 使用Dynamic Time Warping作为算法的 股指期货交易的交易策略

能否通过实例介绍一下Dynamic Time Warping (DTW)

用Python 写一个 使用Dynamic Time Warping作为算法的 关于 中证1000股指期货和上证50股指期货的配对交易策略

使用tushare的数据用Python 写一个 高胜率的使用Dynamic Time Warping作为算法的关于中证1000指数和上证50指数的配对交易策略，然后用backtrader平台回测，最后画出图形

能否给我一个Dynamic Time Warping (DTW)的实例？

最新资源

使用tushare的数据用Python 写一个使用Dynamic Time Warping作为算法的关于中证1000指数和上证50指数的配对交易策略，高胜率，然后用backtrader平台回测，最后画出图形

用Python 写一个使用Dynamic Time Warping作为算法的股指期货交易的交易策略

用Python 写一个使用Dynamic Time Warping作为算法的关于中证1000股指期货和上证50股指期货的配对交易策略

使用tushare的数据用Python 写一个高胜率的使用Dynamic Time Warping作为算法的关于中证1000指数和上证50指数的配对交易策略，然后用backtrader平台回测，最后画出图形