没有合适的资源?快使用搜索试试~ 我知道了~
首页时间序列聚类——十年回顾
资源详情
资源评论
资源推荐
Time-series clustering – A decade review
Saeed Aghabozorgi, Ali Seyed Shirkhorshidi
n
, Teh Ying Wah
Department of Information System, Faculty of Computer Science and Information Technology, University of Malaya (UM),
50603 Kuala Lumpur, Malaysia
article info
Article history:
Received 13 October 2014
Accepted 27 April 2015
Available online 6 May 2015
Keywords:
Clustering
Time-series
Distance measure
Evaluation measure
Representations
abstract
Clustering is a solution for classifying enormous data when there is not any early
knowledge about classes. With emerging new concepts like cloud comp uting and big
data and their vast applications in recent years, research works have been increased on
unsupervised solutions like clustering algorithms to extract knowledge from this
avalanche of data. Clustering time-series data has been used in diverse scientific areas
to discover patterns which empower data analysts to extract valuable infor mation from
complex and massive datasets. In case of huge datasets, using supervised classification
solutions is almost impossible, while clustering can solve this problem using un-
supervised approaches. In this research work, the focus is on time-series data, which is
one of the popular data types in clustering problems and is broadly used from gene
expression data in biology to stock market analysis in finance. This review will expose four
main components of time-series clustering and is aimed to represent an updated
investigation on the trend of improvements in efficiency, quality and complexity of
clustering time-series approaches during the last decade and enlighten new paths for
future works.
& 2015 Elsevier Ltd. All rights reserved.
1. Introduction
Clustering is a data mining technique where similar data
are placed into related or homogeneous groups without
advanced knowledge of the groups’ definitions [1].Indetail,
clusters are formed by grouping objects that have maximum
similarity with other objects within the group, and minimum
similarity with objects in other groups. It is a useful approach
for exploratory data analysis as it identifies structure(s) in an
unlabelled dataset by objectively organizing data into similar
groups. Moreover, clustering is used for exploratory data
analysis for summary generation and as a pre-processing
step for other data mining tasks or as a part of a complex
system.
With increasing power of data storages and processors,
real-world applications have found the chance to store and
keep data for a long time. Hence, data in many applications
is being stored in the form of time-series data, for example
sales data, stock prices, exchange rates in finance, weather
data, biomedical measurements (e.g., blood pressure and
electrocardiogram measurements), biometrics data (image
data for facial recognition), particle tracking in physics, etc.
Accordingly, different works are found in variety of domains
such as Bioinformatics and Biology, Genetics, Multimedia
[2–4] and Finance. This amount of time-series data has
provided the opportunity of analysing time-series for many
researchers in data mining communities in the last decade.
Consequently, many researches and projects relevant to
analysing time-series have been performed in various areas
for different purposes such as: subsequence matching,
anomaly detection, motif discovery [5], indexing, clustering,
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/infosys
Information Systems
http://dx.doi.org/10.1016/j.is.2015.04.007
0306-4379/& 2015 Elsevier Ltd. All rights reserved.
n
Corresponding author. Tel.: þ 60 196918918.
E-mail addresses: saeed@um.edu.my (S. Aghabozorgi),
shirkhorshidi_ali@siswa.um.edu.my,
Shirkhorshidi_ali@yahoo.co.uk (A. Seyed Shirkhorshidi),
tehyw@um.edu.my (T. Ying Wah).
Information Systems 53 (2015) 16–38
classification [6], visualization [7], segmentation [8], identi-
fying patterns, trend analysis, summarization [9], and
forecasting. Moreover, there are many on-going research
projects aimed to improve the existing techniques [10,11].
In the recent decade, there has been a considerable amount
of changes and developments in time-series clustering area
that are caused by emerging concepts such as big data and
cloud computing which increased size of datasets exponen-
tially . For example, one hour of ECG (electrocardiogram) data
occupies 1 gigabyte, a typical weblog requir es 5 gigabytes per
week, the space shuttle database has 200 gigabytes and
updating it requir es 2 gigab ytes per day [12].Consequently,
clustering craved for impro vements in recent years to cope
with this incremental avalanche of data to keep its reputation
as a helpful data-mining tool for extracting useful patterns and
knowledg e fr om big datasets. This review is opportune,
becausedespitetheconsiderablechangesinthearea,thereis
not a comprehensive review on anatomy and structure of
time-series clustering. There are some survey s and r eview s
that focus on comparative aspects of time-series clustering
experiments [6,13–17] butnoneofthemtendtobeas
comprehensiv e as we are in this review . This research work
is aimed to represent an updated investigation on the trend of
improv ements in efficiency, quality and complexity of cluster -
ing time-series approaches during the last decade and
enlighten new paths for future wor ks.
1.1. Time-series clustering
A special type of clustering is time-series clustering. A
sequence composed of a series of nominal symbols from a
particular alphabet is usually called a temporal sequence, and
a sequence of continuous, real-valued elements, is known as a
time-series [1 5].Atime-seriesisessentiallyclassifiedas
dynamic data because its feature values change as a function
of time, which means that the value(s) of each point of a
time-series is/are one or more observations that are made
chronologically. Time-series data is a type of temporal data
which is naturally high dimensional and large in data size
[6,17,18]. Time-series data are of interest due to their ubiquity
in various areas ranging from science, engineering, business,
finance, economics, healthcare, to government [16].While
each time-series is consisting of a large number of data points
it can also be seen as a single object [1 9]. Clustering such
complex objects is particularly advantageous because it leads
to discovery of interesting patterns in time-series datasets. As
these patterns can be either frequent or rare patterns, several
research challenges have arisen such as: developing methods
to recognize dynamic changes in time-series, anomaly and
intrusion detection, pr ocess control, and charact er recogni-
tion [20–22]. More applications of time-series data are dis-
cussed in Section 1.2. To highlight the importance and the
need for clustering time-series datasets, potentially overlap-
ping objectives for clustering of time-series data are given as
follows:
1. Time-series databases contain valuable information that
can be obtained through pattern discovery. Clustering is
a common solution performed to uncover these patterns
on time-series datasets.
2. Time-series databases ar e v ery large and cannot be handled
well b y human inspectors. Hence, man y users prefer to deal
with structured datasets rather than very large datasets. As
a result, time-series data are represented as a set of groups
of similar time-series by aggregation of data in non-
overlapping clusters or by a taxonom y as a hierarchy of
abstract concepts.
3. Time-series clustering is the most-used approach as an
exploratory technique, and also as a subroutine in more
complex data mining algorithms, such as rule discovery,
indexing, classification, and anomaly detection [22].
4. Representing time-series cluster structures as visual
images (visualization of time-series data) can help users
quickly understand the structure of data, clusters,
anomalies, and other regularities in datasets.
The problem of clustering of time-series data is formally
defined as follows:
Definition 1:. Time-series clustering, given a dataset of n
time-series data D ¼ F
1
; F
2
; ::; F
n
fg; the process of unsuper-
vised partitioning of D into C ¼ C
1
; C
2
; ::; C
k
, in such a way
that homogenous time-series are grouped together based
on a certain similarity measure, is called time-series clus-
tering. Then, C
i
is called a cluster, where D ¼[
k
i ¼ 1
C
i
and
C
i
\ C
j
¼ ∅ for ia j.
Time-series clustering is a challenging issue because first
of all, time-series data are often far larger than memory size
and consequently they are stored on disks. This leads to an
exponential decrease in speed of the clustering process.
Second challenge is that time-series data are often high
dimensional [23,24] which makes handling these data diffi-
cult for many clustering algorithms [25] and also slows down
the process of clustering [26]. Finally , the third challenge
addresses the similarity measures that are used to make the
clusters. To do so, similar time-series should be found which
needs time-series similarity matching that is the process of
calculating the similarity among the whole time-series using
a similarity measure. This process is also known as “whole
sequence matching” where whole lengths of time-series are
considered during distance calculation. However, the process
is complicated, because time-series data are naturally noisy
and include outliers and shifts [18], at the other hand the
length of time-series v aries and the distance among them
needs to be calculated. These common issues have made the
similarity measure a major challenge for data miners.
1.2. Applications of time-series clustering
Clustering of time-series data is mostly utilized for dis-
covery of interesting patterns in time-series datasets [27,28].
This task itself, fall into two categories: The first group is the
one which is used to find patterns that frequently appears in
the dataset [29,30]. The second group are methods to discover
patterns which happened i n datasets surprisingly [3 1–34].
Briefly , finding the clusters of time-series can be advantageous
in different domains to answer following real world problems:
Anomaly , novelty or discord detection: Anomaly det ection
are methods to discover unusual and unexpected patterns
which happen in datasets surprisingly [3 1–34].Forexample,
S. Aghabozorgi et al. / Information Systems 53 (2015) 16–38 17
in sensor databases, clustering of time-series which are pro-
duced by sensor readings of a mobile robot in order t o discover
the events [35].
1- Recognizing dynamic changes in time-series: detec-
tion of correlation between time-series [36]. For exam-
ple, in financial databases, it can be used to find the
companies with similar stock price move.
2- Prediction and recommendation: a hybrid technique
combining clustering and function approximation per
cluster can help user to predict and recommend [37–40].
For example, in scientific databases, it can address
problems such as finding the patterns of solar magnetic
wind to predict today’s pattern.
3- Pattern discovery: to discover the interesting patterns
in databases. For example, in marketing database, differ-
ent daily patterns of sales of a specific product in a store
can be discovered.
Table 1 depicts some applications of time-series data in
different domains.
1.3. Taxonomy of time-series clustering
Reviewing the literature, one can conclude that most of
clustering time-series related works are classified into three
categories: “whole time-series clustering”, “subsequence clus-
tering” and “time point clustering” as depicted in Fig. 1.The
first two categories are mentioned by Keogh and Lin [242] On
behalf of Ali Shirkhorshidi (shirkhorshidi_ali@yahoo.co.uk).
Whole time-series clustering is considered as cluster-
ing of a set of individual time-series with respect to their
similarity. Here, clustering means applying conventional
Table 1
Samples of objectives of time-series clustering in different domains.
Category Clustering application Research
works
Aviation/
Astronomy
Astronomical data (star light curves) – pre-processing for outlier detection [41]
Biology Multiple gene expression profile alignment for microarray time-series data clustering [42]
Functional clustering of time series gene expression data [43]
Identification of functionally related genes [44–46]
Climate Discovery of climate indices [47,48]
Analysing PM
10
and PM
2.5
concentrations at a coastal location of New Zealand [49]
Energy Discovering energy consumption pattern [50,51]
Environment and
urban
Analysis of the regional variability of sea-level extremes [52]
Earthquake - Analysing potential violations of a Comprehensive Test Ban Treaty (CTBT) – Pattern discovery and
forecasting
[53,54]
Analysis of the change of population distribution during a day in Salt Lake County, Utah, USA [55]
Investigating the relationship between the climatic indices with the clusters/trends detected based on clustering
method.
[56]
Finance Finding seasonality patterns (retail pattern) [57]
Personal income pattern [58]
Creating efficient portfolio ( a group of stocks owned by a particular person or company) [59]
Discovery patterns from stock time-series [60]
Risk reduced portfolios by analyzing the companies and the volatility of their returns [61]
Discovery patterns from stock time-series [29,62]
Investigate the correlation between hedging horizon and performance in financial time-series. [63]
Medicine Detecting brain activity [64,65]
Exploring, identifying, and discriminating pathological cases from MS clinical samples [66]
Psychology Analysis of human behaviour in psychological domain [67]
Robotics Forming prototypical representations of the robot’s experiences [68,69]
Speech/voice
recognition
Speaker verification [70]
Biometric voice classification using hierarchical clustering [71]
User analysis Analysing multivariate emotional behaviour of users in social network with the goal to cluster the users from a
fully new perspective-emotions
[72]
Fig. 1. Time-series clustering taxonomy.
S. Aghabozorgi et al. / Information Systems 53 (2015) 16–3818
(usually) clustering on discrete objects, where objects
are time-series.
Subsequence clustering means clustering on a set of
subsequences of a time-series that are extracted via a
sliding window, that is, clustering of segments from a
single long time-series.
Time point clustering is another category of clustering
which is seen in some papers [74–76]. It is clustering of
time points based on a combination of their temporal
proximity of time points and the similarity of the corre-
sponding values. This approach is similar to time-series
segmentation. However, it is different from segmentation
as all points do not need to be assigned to clusters, i.e.,
some of them are considered as noise.
Essentially, sub-sequence clustering is performed on a
single time-series, and Keogh and Lin [242] represented that
this type of clustering is meaningless. Time-point clustering
also is applied on a single time-series, and is similar to time-
series segmentation as the objective of time-point clustering
is finding the clusters of time-point instead of clusters of
time-series data. The focus of this study is on the “whole
time-series clustering”.Acompletereviewonwholetime-
series clustering is performed and shown in Table 4.Review-
ing the literature, it is noticeable that various techniques have
been recommended for the clustering of whole time-series
data. Ho wev er, most of them take one of the following
approaches to cluster time-series data:
1 . Customizing the existing conv entional clustering algorithms
(which work with static data) such that the y become
compatible with the nature of time-series data. In this
approach, usually their distance measure (in conventional
algori thms) is modified t o be compatible with the r a w time-
series data [1 6] .
2. Converting time-series data into simple objects (static
data) as input of conventional clustering algorithms [16].
3. Using multi resolutions of time-series as input of a
multi-step approach. This approach is discussed further
in Section 5.6.
Beside this common characteristic, there are generally
three different ways to cluster time-series, namely shape-
based, feature-based and model-based.
Fig. 2 shows a brief of these approaches. In the shape-
based approach, shapes of two time-series are matched as
well as possible, by a non-linear stretching and contracting
of the time axes. This approach has also been labelled as a
raw-data-based approach because it typically works directly
with the raw time-series data. Shape-based algorithms
usually employ conventional clustering methods, which
are compatible with static data while their distance/simi-
larity measure has been modified with an appropriate one
for time-series. In the feature-based approach, the raw
time-series are converted into a feature vector of lower
dimension. Later, a conventional clustering algorithm is
applied to the extracted feature vectors. Usually in this
approach, an equal length feature vector is calculated from
each time-series followed by the Euclidean distance mea-
surement [77].Inmodel-based methods, a raw time-series
is transformed into model parameters (a parametric model
Fig. 2. The time-series clustering approaches.
S. Aghabozorgi et al. / Information Systems 53 (2015) 16–38 19
for each time-series,) and then a suitable model distance
and a clustering algorithm (usually conventional clustering
algorithms) is chosen and applied to the extracted model
parameters [16]. However, it is shown that usually model-
based approaches has scalability problems [78], and its
performance reduces when the clusters are close to each
other [79].
Reviewing existing works in the literature, it is implied
that essentially time-series clustering has four components:
dimensionality reduction or representation method, dis-
tance measurement, clustering algorithm, prototype defini-
tion, and evaluation. Fig. 3 shows an overview of these
components.
The general process in the time-series clustering uses
some or all of these components depending on the problem.
Usually, data is approximated using a representation
method in such a way that can fit in memory. Afterwards,
a clustering algorithm is applied on data by using a distance
measure. In the clustering process, usually a prototype is
required for summarization of the time-series. At last, the
clusters are evaluated using criteria. In the following sub-
sections, each component is discussed, and several related
works and methods are reviewed.
1.4. Organization of the review
In the rest of this paper, we will provide a state-of-the-
art review on main components available in time-series
clustering plus the evaluation methods and measures avail-
able for validating time-series clustering. In Section 2, time-
series representation is discussed. Similarity and dissimilar-
ity measures are represented in Section 3. Sections 4 and 5
are dedicated to clustering prototypes and clustering algo-
rithms respectively. In section 6 evaluation measures is
discussed and finally the paper is concluded in Section 7.
2. Representation methods for time series clustering
The first component of time-series clustering explained
here is dimension reduction which is a common solution for
most whole time-series clustering approaches proposed in
the literature [9,80–82]. This section reviews methods of
time-series dimension reduction which is known as time-
series representation as well. Dimensionality reduction r epre-
sents the raw time-series in another space by transforming
time-series to a lower dimensional space or by feature
extraction. The reason that dimensionality reduction is
greatly important in clustering of time-series is firstly because
itreducesmemoryrequirementsasallrawtime-series
cannot fit in the main memory [9,24]. Secondly, distance
calculation among raw data is computationally expensive,
and dimensionality reduction significantly speeds up cluster-
ing [9,24]. Finally, when measuring the distance between two
raw time-series, highly unintuitive results may be garnered,
because some distance measures are highly sensitive to some
“distortions” in the data [3,83], and consequently, by using
raw time-series, one may cluster time-series which are
similar in noise instead of clustering them based on similarity
in shape. The potential to obtain a different type of cluster is
the reason why choosing the appropriate approach for
dimension reduction (feature extr action) and its ratio is a
challenging task [26]. In fact, it is a trade-off between speed
and quality and all efforts must be made to obtain a proper
balance point between quality and execution time.
Definition 2:. Time-series representation, given a time-
series data F
i
¼ f
1
; ::; f
t
; ::; f
T
, representation is transform-
ing the time-series to another dimensionality reduced
vector F
'
i
¼ f
'
1
; ::; f
'
x
no
where xo T and if two series are
similar in the original space, then their representations
should be similar in the transformation space too.
According to [83], choosing an appropriate data representa-
tion method can be considered as the key component which
effects the efficiency and accuracy of the solution. High
dimensionality and noise are characteristics of most time-
series data [6], consequentl y , dimensionality reduction meth-
ods are usuall y used in whole time-series cluster ing in or der to
address these issues and promote the performance. Time-
series dimensionality reduction techniques have progr essed a
long wa y and are widel y used with larg e scale time-series
dataset and each has its own features and drawbac ks. Accord-
ingly , many researches had been carried out focusing on
representation and dimensionality reduction [84–90].Itis
worth here to mention about the one of the recent compar -
isons on representation methods. H. Ding et al. [91] hav e
performed a comprehensive comparison of 8 representation
methods on 38 datasets. Although, they had investigated the
indexing effectiveness of representation methods, the results
are advantag eous for clustering purpose as well. They use
tightness of lo wer bounds to compar e representation methods.
They show that there is very little difference between recent
representatio n methods. In tax onom y of representations, ther e
are generally four representation types [9,83,92,93]:data
adaptive, non-data adaptive, model-based and data dictated
representation approaches as are depicted in Fig. 4.
Fig. 3. An overview of four components of whole time-series clustering.
Fig. 4. Hierarchy of different time-series representation approaches.
S. Aghabozorgi et al. / Information Systems 53 (2015) 16–3820
剩余22页未读,继续阅读
鸡小葵
- 粉丝: 0
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- RTL8188FU-Linux-v5.7.4.2-36687.20200602.tar(20765).gz
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
- SPC统计方法基础知识.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论1