时间序列聚类——十年回顾_时间序列聚类分析

时间序列

聚类

4星 · 超过85%的资源需积分: 50 132 浏览量更新于2023-03-16 评论 15 收藏 1.27MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Time-series clustering – A decade review

Saeed Aghabozorgi, Ali Seyed Shirkhorshidi

, Teh Ying Wah

Department of Information System, Faculty of Computer Science and Information Technology, University of Malaya (UM),

50603 Kuala Lumpur, Malaysia

article info

Article history:

Received 13 October 2014

Accepted 27 April 2015

Available online 6 May 2015

Keywords:

Clustering

Time-series

Distance measure

Evaluation measure

Representations

abstract

Clustering is a solution for classifying enormous data when there is not any early

knowledge about classes. With emerging new concepts like cloud comp uting and big

data and their vast applications in recent years, research works have been increased on

unsupervised solutions like clustering algorithms to extract knowledge from this

avalanche of data. Clustering time-series data has been used in diverse scientific areas

to discover patterns which empower data analysts to extract valuable infor mation from

complex and massive datasets. In case of huge datasets, using supervised classification

solutions is almost impossible, while clustering can solve this problem using un-

supervised approaches. In this research work, the focus is on time-series data, which is

one of the popular data types in clustering problems and is broadly used from gene

expression data in biology to stock market analysis in finance. This review will expose four

main components of time-series clustering and is aimed to represent an updated

investigation on the trend of improvements in efficiency, quality and complexity of

clustering time-series approaches during the last decade and enlighten new paths for

future works.

1. Introduction

Clustering is a data mining technique where similar data

are placed into related or homogeneous groups without

advanced knowledge of the groups’ definitions [1].Indetail,

clusters are formed by grouping objects that have maximum

similarity with other objects within the group, and minimum

similarity with objects in other groups. It is a useful approach

for exploratory data analysis as it identifies structure(s) in an

unlabelled dataset by objectively organizing data into similar

groups. Moreover, clustering is used for exploratory data

analysis for summary generation and as a pre-processing

step for other data mining tasks or as a part of a complex

system.

With increasing power of data storages and processors,

real-world applications have found the chance to store and

keep data for a long time. Hence, data in many applications

is being stored in the form of time-series data, for example

sales data, stock prices, exchange rates in finance, weather

data, biomedical measurements (e.g., blood pressure and

electrocardiogram measurements), biometrics data (image

data for facial recognition), particle tracking in physics, etc.

Accordingly, different works are found in variety of domains

such as Bioinformatics and Biology, Genetics, Multimedia

[2–4] and Finance. This amount of time-series data has

provided the opportunity of analysing time-series for many

researchers in data mining communities in the last decade.

Consequently, many researches and projects relevant to

analysing time-series have been performed in various areas

for different purposes such as: subsequence matching,

anomaly detection, motif discovery [5], indexing, clustering,

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/infosys

Information Systems

http://dx.doi.org/10.1016/j.is.2015.04.007

Corresponding author. Tel.: þ 60 196918918.

E-mail addresses: saeed@um.edu.my (S. Aghabozorgi),

shirkhorshidi_ali@siswa.um.edu.my,

Shirkhorshidi_ali@yahoo.co.uk (A. Seyed Shirkhorshidi),

tehyw@um.edu.my (T. Ying Wah).

Information Systems 53 (2015) 16–38

classification [6], visualization [7], segmentation [8], identi-

fying patterns, trend analysis, summarization [9], and

forecasting. Moreover, there are many on-going research

projects aimed to improve the existing techniques [10,11].

In the recent decade, there has been a considerable amount

of changes and developments in time-series clustering area

that are caused by emerging concepts such as big data and

cloud computing which increased size of datasets exponen-

tially . For example, one hour of ECG (electrocardiogram) data

occupies 1 gigabyte, a typical weblog requir es 5 gigabytes per

week, the space shuttle database has 200 gigabytes and

updating it requir es 2 gigab ytes per day [12].Consequently,

clustering craved for impro vements in recent years to cope

with this incremental avalanche of data to keep its reputation

as a helpful data-mining tool for extracting useful patterns and

knowledg e fr om big datasets. This review is opportune,

becausedespitetheconsiderablechangesinthearea,thereis

not a comprehensive review on anatomy and structure of

time-series clustering. There are some survey s and r eview s

that focus on comparative aspects of time-series clustering

experiments [6,13–17] butnoneofthemtendtobeas

comprehensiv e as we are in this review . This research work

is aimed to represent an updated investigation on the trend of

improv ements in efficiency, quality and complexity of cluster -

ing time-series approaches during the last decade and

enlighten new paths for future wor ks.

1.1. Time-series clustering

A special type of clustering is time-series clustering. A

sequence composed of a series of nominal symbols from a

particular alphabet is usually called a temporal sequence, and

a sequence of continuous, real-valued elements, is known as a

time-series [1 5].Atime-seriesisessentiallyclassifiedas

dynamic data because its feature values change as a function

of time, which means that the value(s) of each point of a

time-series is/are one or more observations that are made

chronologically. Time-series data is a type of temporal data

which is naturally high dimensional and large in data size

[6,17,18]. Time-series data are of interest due to their ubiquity

in various areas ranging from science, engineering, business,

finance, economics, healthcare, to government [16].While

each time-series is consisting of a large number of data points

it can also be seen as a single object [1 9]. Clustering such

complex objects is particularly advantageous because it leads

to discovery of interesting patterns in time-series datasets. As

these patterns can be either frequent or rare patterns, several

research challenges have arisen such as: developing methods

to recognize dynamic changes in time-series, anomaly and

intrusion detection, pr ocess control, and charact er recogni-

tion [20–22]. More applications of time-series data are dis-

cussed in Section 1.2. To highlight the importance and the

need for clustering time-series datasets, potentially overlap-

ping objectives for clustering of time-series data are given as

follows:

1. Time-series databases contain valuable information that

can be obtained through pattern discovery. Clustering is

a common solution performed to uncover these patterns

on time-series datasets.

2. Time-series databases ar e v ery large and cannot be handled

well b y human inspectors. Hence, man y users prefer to deal

with structured datasets rather than very large datasets. As

a result, time-series data are represented as a set of groups

of similar time-series by aggregation of data in non-

overlapping clusters or by a taxonom y as a hierarchy of

abstract concepts.

3. Time-series clustering is the most-used approach as an

exploratory technique, and also as a subroutine in more

complex data mining algorithms, such as rule discovery,

indexing, classification, and anomaly detection [22].

4. Representing time-series cluster structures as visual

images (visualization of time-series data) can help users

quickly understand the structure of data, clusters,

anomalies, and other regularities in datasets.

The problem of clustering of time-series data is formally

defined as follows:

Definition 1:. Time-series clustering, given a dataset of n

time-series data D ¼ F

; F

; ::; F

fg; the process of unsuper-

vised partitioning of D into C ¼ C

; C

; ::; C



, in such a way

that homogenous time-series are grouped together based

on a certain similarity measure, is called time-series clus-

tering. Then, C

is called a cluster, where D ¼[

i ¼ 1

and

\ C

¼ ∅ for ia j.

Time-series clustering is a challenging issue because first

of all, time-series data are often far larger than memory size

and consequently they are stored on disks. This leads to an

exponential decrease in speed of the clustering process.

Second challenge is that time-series data are often high

dimensional [23,24] which makes handling these data diffi-

cult for many clustering algorithms [25] and also slows down

the process of clustering [26]. Finally , the third challenge

addresses the similarity measures that are used to make the

clusters. To do so, similar time-series should be found which

needs time-series similarity matching that is the process of

calculating the similarity among the whole time-series using

a similarity measure. This process is also known as “whole

sequence matching” where whole lengths of time-series are

considered during distance calculation. However, the process

is complicated, because time-series data are naturally noisy

and include outliers and shifts [18], at the other hand the

length of time-series v aries and the distance among them

needs to be calculated. These common issues have made the

similarity measure a major challenge for data miners.

1.2. Applications of time-series clustering

Clustering of time-series data is mostly utilized for dis-

covery of interesting patterns in time-series datasets [27,28].

This task itself, fall into two categories: The first group is the

one which is used to find patterns that frequently appears in

the dataset [29,30]. The second group are methods to discover

patterns which happened i n datasets surprisingly [3 1–34].

Briefly , finding the clusters of time-series can be advantageous

in different domains to answer following real world problems:

Anomaly , novelty or discord detection: Anomaly det ection

are methods to discover unusual and unexpected patterns

which happen in datasets surprisingly [3 1–34].Forexample,

S. Aghabozorgi et al. / Information Systems 53 (2015) 16–38 17

in sensor databases, clustering of time-series which are pro-

duced by sensor readings of a mobile robot in order t o discover

the events [35].

1- Recognizing dynamic changes in time-series: detec-

tion of correlation between time-series [36]. For exam-

ple, in financial databases, it can be used to find the

companies with similar stock price move.

2- Prediction and recommendation: a hybrid technique

combining clustering and function approximation per

cluster can help user to predict and recommend [37–40].

For example, in scientific databases, it can address

problems such as finding the patterns of solar magnetic

wind to predict today’s pattern.

3- Pattern discovery: to discover the interesting patterns

in databases. For example, in marketing database, differ-

ent daily patterns of sales of a specific product in a store

can be discovered.

Table 1 depicts some applications of time-series data in

different domains.

1.3. Taxonomy of time-series clustering

Reviewing the literature, one can conclude that most of

clustering time-series related works are classified into three

categories: “whole time-series clustering”, “subsequence clus-

tering” and “time point clustering” as depicted in Fig. 1.The

first two categories are mentioned by Keogh and Lin [242] On

behalf of Ali Shirkhorshidi (shirkhorshidi_ali@yahoo.co.uk).



Whole time-series clustering is considered as cluster-

ing of a set of individual time-series with respect to their

similarity. Here, clustering means applying conventional

Table 1

Samples of objectives of time-series clustering in different domains.

Category Clustering application Research

works

Aviation/

Astronomy

Astronomical data (star light curves) – pre-processing for outlier detection [41]

Biology Multiple gene expression profile alignment for microarray time-series data clustering [42]

Functional clustering of time series gene expression data [43]

Identification of functionally related genes [44–46]

Climate Discovery of climate indices [47,48]

Analysing PM

and PM

2.5

concentrations at a coastal location of New Zealand [49]

Energy Discovering energy consumption pattern [50,51]

Environment and

urban

Analysis of the regional variability of sea-level extremes [52]

Earthquake - Analysing potential violations of a Comprehensive Test Ban Treaty (CTBT) – Pattern discovery and

forecasting

[53,54]

Analysis of the change of population distribution during a day in Salt Lake County, Utah, USA [55]

Investigating the relationship between the climatic indices with the clusters/trends detected based on clustering

method.

[56]

Finance Finding seasonality patterns (retail pattern) [57]

Personal income pattern [58]

Creating efficient portfolio ( a group of stocks owned by a particular person or company) [59]

Discovery patterns from stock time-series [60]

Risk reduced portfolios by analyzing the companies and the volatility of their returns [61]

Discovery patterns from stock time-series [29,62]

Investigate the correlation between hedging horizon and performance in financial time-series. [63]

Medicine Detecting brain activity [64,65]

Exploring, identifying, and discriminating pathological cases from MS clinical samples [66]

Psychology Analysis of human behaviour in psychological domain [67]

Robotics Forming prototypical representations of the robot’s experiences [68,69]

Speech/voice

recognition

Speaker verification [70]

Biometric voice classification using hierarchical clustering [71]

User analysis Analysing multivariate emotional behaviour of users in social network with the goal to cluster the users from a

fully new perspective-emotions

[72]

Fig. 1. Time-series clustering taxonomy.

S. Aghabozorgi et al. / Information Systems 53 (2015) 16–3818

(usually) clustering on discrete objects, where objects

are time-series.



Subsequence clustering means clustering on a set of

subsequences of a time-series that are extracted via a

sliding window, that is, clustering of segments from a

single long time-series.



Time point clustering is another category of clustering

which is seen in some papers [74–76]. It is clustering of

time points based on a combination of their temporal

proximity of time points and the similarity of the corre-

sponding values. This approach is similar to time-series

segmentation. However, it is different from segmentation

as all points do not need to be assigned to clusters, i.e.,

some of them are considered as noise.

Essentially, sub-sequence clustering is performed on a

single time-series, and Keogh and Lin [242] represented that

this type of clustering is meaningless. Time-point clustering

also is applied on a single time-series, and is similar to time-

series segmentation as the objective of time-point clustering

is finding the clusters of time-point instead of clusters of

time-series data. The focus of this study is on the “whole

time-series clustering”.Acompletereviewonwholetime-

series clustering is performed and shown in Table 4.Review-

ing the literature, it is noticeable that various techniques have

been recommended for the clustering of whole time-series

data. Ho wev er, most of them take one of the following

approaches to cluster time-series data:

1 . Customizing the existing conv entional clustering algorithms

(which work with static data) such that the y become

compatible with the nature of time-series data. In this

approach, usually their distance measure (in conventional

algori thms) is modified t o be compatible with the r a w time-

series data [1 6] .

2. Converting time-series data into simple objects (static

data) as input of conventional clustering algorithms [16].

3. Using multi resolutions of time-series as input of a

multi-step approach. This approach is discussed further

in Section 5.6.

Beside this common characteristic, there are generally

three different ways to cluster time-series, namely shape-

based, feature-based and model-based.

Fig. 2 shows a brief of these approaches. In the shape-

based approach, shapes of two time-series are matched as

well as possible, by a non-linear stretching and contracting

of the time axes. This approach has also been labelled as a

raw-data-based approach because it typically works directly

with the raw time-series data. Shape-based algorithms

usually employ conventional clustering methods, which

are compatible with static data while their distance/simi-

larity measure has been modified with an appropriate one

for time-series. In the feature-based approach, the raw

time-series are converted into a feature vector of lower

dimension. Later, a conventional clustering algorithm is

applied to the extracted feature vectors. Usually in this

approach, an equal length feature vector is calculated from

each time-series followed by the Euclidean distance mea-

surement [77].Inmodel-based methods, a raw time-series

is transformed into model parameters (a parametric model

Fig. 2. The time-series clustering approaches.

S. Aghabozorgi et al. / Information Systems 53 (2015) 16–38 19

for each time-series,) and then a suitable model distance

and a clustering algorithm (usually conventional clustering

algorithms) is chosen and applied to the extracted model

parameters [16]. However, it is shown that usually model-

based approaches has scalability problems [78], and its

performance reduces when the clusters are close to each

other [79].

Reviewing existing works in the literature, it is implied

that essentially time-series clustering has four components:

dimensionality reduction or representation method, dis-

tance measurement, clustering algorithm, prototype defini-

tion, and evaluation. Fig. 3 shows an overview of these

components.

The general process in the time-series clustering uses

some or all of these components depending on the problem.

Usually, data is approximated using a representation

method in such a way that can fit in memory. Afterwards,

a clustering algorithm is applied on data by using a distance

measure. In the clustering process, usually a prototype is

required for summarization of the time-series. At last, the

clusters are evaluated using criteria. In the following sub-

sections, each component is discussed, and several related

works and methods are reviewed.

1.4. Organization of the review

In the rest of this paper, we will provide a state-of-the-

art review on main components available in time-series

clustering plus the evaluation methods and measures avail-

able for validating time-series clustering. In Section 2, time-

series representation is discussed. Similarity and dissimilar-

ity measures are represented in Section 3. Sections 4 and 5

are dedicated to clustering prototypes and clustering algo-

rithms respectively. In section 6 evaluation measures is

discussed and finally the paper is concluded in Section 7.

2. Representation methods for time series clustering

The first component of time-series clustering explained

here is dimension reduction which is a common solution for

most whole time-series clustering approaches proposed in

the literature [9,80–82]. This section reviews methods of

time-series dimension reduction which is known as time-

series representation as well. Dimensionality reduction r epre-

sents the raw time-series in another space by transforming

time-series to a lower dimensional space or by feature

extraction. The reason that dimensionality reduction is

greatly important in clustering of time-series is firstly because

itreducesmemoryrequirementsasallrawtime-series

cannot fit in the main memory [9,24]. Secondly, distance

calculation among raw data is computationally expensive,

and dimensionality reduction significantly speeds up cluster-

ing [9,24]. Finally, when measuring the distance between two

raw time-series, highly unintuitive results may be garnered,

because some distance measures are highly sensitive to some

“distortions” in the data [3,83], and consequently, by using

raw time-series, one may cluster time-series which are

similar in noise instead of clustering them based on similarity

in shape. The potential to obtain a different type of cluster is

the reason why choosing the appropriate approach for

dimension reduction (feature extr action) and its ratio is a

challenging task [26]. In fact, it is a trade-off between speed

and quality and all efforts must be made to obtain a proper

balance point between quality and execution time.

Definition 2:. Time-series representation, given a time-

series data F

¼ f

; ::; f



, representation is transform-

ing the time-series to another dimensionality reduced

vector F

¼ f

; ::; f

where xo T and if two series are

similar in the original space, then their representations

should be similar in the transformation space too.

According to [83], choosing an appropriate data representa-

tion method can be considered as the key component which

effects the efficiency and accuracy of the solution. High

dimensionality and noise are characteristics of most time-

series data [6], consequentl y , dimensionality reduction meth-

ods are usuall y used in whole time-series cluster ing in or der to

address these issues and promote the performance. Time-

series dimensionality reduction techniques have progr essed a

long wa y and are widel y used with larg e scale time-series

dataset and each has its own features and drawbac ks. Accord-

ingly , many researches had been carried out focusing on

representation and dimensionality reduction [84–90].Itis

worth here to mention about the one of the recent compar -

isons on representation methods. H. Ding et al. [91] hav e

performed a comprehensive comparison of 8 representation

methods on 38 datasets. Although, they had investigated the

indexing effectiveness of representation methods, the results

are advantag eous for clustering purpose as well. They use

tightness of lo wer bounds to compar e representation methods.

They show that there is very little difference between recent

representatio n methods. In tax onom y of representations, ther e

are generally four representation types [9,83,92,93]:data

adaptive, non-data adaptive, model-based and data dictated

representation approaches as are depicted in Fig. 4.

Fig. 3. An overview of four components of whole time-series clustering.

Fig. 4. Hierarchy of different time-series representation approaches.

S. Aghabozorgi et al. / Information Systems 53 (2015) 16–3820

剩余22页未读，继续阅读

whoru8

2019-05-14

15年的一篇文章，是比较全面的综述

鸡小葵

粉丝: 0
资源: 1

会员权益专享

时间序列聚类——十年回顾

评论1

会员权益专享

最新资源

时间序列聚类——十年回顾

评论1

四种聚类算法实现对控制图时间序列的聚类

聚类分析算法实现clustering-algorithms-master

TIcc2541封装

时间序列聚类分析文献综述

python 一维时间序列聚类

python 时间序列聚类

多变量时间序列聚类ls-cluster

聚类——FCM的matlab程序

基于python的时间序列的聚类

怎样检验时间序列聚类的好坏

dtw时间序列聚类实战

时间序列聚类matlab程序

提供几个包可以用来做DNA序列聚类

时间序列gmm聚类python

如何进行SSR序列聚类分析

基于趋势的时间序列相似性度量和聚类研究.pdf

我有70个时间序列数据，按照变化趋势进行聚类 欧式距离的聚类代码

不同长度的时间序列的聚类方法即代码示例

udf比较复杂的业务 时间序列分析

不同长度时间序列的聚类方法即代码示例

会员权益专享

最新资源

我有70个时间序列数据，按照变化趋势进行聚类欧式距离的聚类代码

udf比较复杂的业务时间序列分析