时间序列分类中基于孤立点检测的实例选择方法

版权申诉

154 浏览量更新于2024-07-06 收藏 1.94MB PDF 举报

"Outlier Detection as Instance Selection Method for Feature Selection in Time Series Classification" 本文是一篇由David Cemernek撰写的硕士论文，旨在探讨如何在时间序列分类中利用孤立点检测作为特征选择的实例选择方法。这篇论文是软件工程与管理硕士课程的一部分，并在格拉茨科技大学提交，由Roman Kern博士指导，互动系统与数据科学研究所的Stefanie Lindstaedt教授担任负责人。论文发布于2019年2月，后来在arXiv上进行了更新。孤立点检测是一种数据分析技术，用于识别数据集中与大多数其他数据点显著不同的观测值。在时间序列分析中，这些异常值可能代表数据收集过程中的错误、罕见事件或潜在的新模式。在机器学习任务中，如分类，异常值可能对模型性能产生负面影响，因此有效地处理它们是至关重要的。论文的主要贡献是提出将孤立点检测作为一种实例选择策略，用于特征选择过程。传统上，特征选择涉及从原始特征集合中挑选出最相关或最有影响力的子集，以减少过拟合风险，提高模型解释性。然而，通过考虑数据实例（即每个时间序列）的异常程度，作者主张可以进一步优化这个过程。这种方法可以去除那些由于异常值导致的误导性实例，从而提高模型的稳定性和预测能力。论文详细介绍了实施这一策略的步骤，包括选择合适的孤立点检测算法（例如基于距离的方法、统计方法或聚类方法），评估不同方法对时间序列分类性能的影响，以及如何将孤立点检测集成到特征选择框架中。此外，作者可能还比较了这种方法与其他特征选择技术（如递归特征消除、基于惩罚的方法等）的效果，并进行了实验验证，以证明其有效性。论文还涵盖了数据预处理、模型评估指标以及可能的应用场景，如金融交易、医疗诊断或工业监控等领域的时间序列数据。作者强调，正确处理孤立点对于确保机器学习模型能够从复杂、有时噪声大的时间序列数据中准确提取信息至关重要。总结起来，这篇论文深入探讨了孤立点检测在时间序列分类特征选择中的应用，提供了一种新颖的实例选择策略，有望改善机器学习模型的性能。通过理解和利用异常值的特性，该方法有助于提高模型的稳健性和泛化能力，对于数据挖掘、人工智能和机器学习领域的研究具有重要意义。

2. Related Work

• consider time as the primary axis

We provided an artiﬁcially generated example of time series data in the plot

shown in Figure 2.1.

Figure 2.1.:

Figure shows some randomly generated time series data. The x-axis represents

the time dimension with days as interval. The y-axis represents the artiﬁcially

generated values of the different observations.

Given that a large percentage of the data produced worldwide is time series

data and the exponentially growing size of databases, there has recently

been an explosion of interest in time series analysis. Among many others

the following data from various domains are examples of time series data

[Rat+09]:

•

Finance: Presentation of the development of the stock market price of

a company over time.

•

Meteorology: Temperature development over time for a speciﬁc area

like a country, state, or city.

•

Trade: Historical store sales data, for example sold products over time.

•

Medical: Electrocardiograms showing the electrical activity of a heart

over time.

2. Related Work

Clustering

algorithms ﬁnd natural groups, called clusters in data. In con-

trary to classiﬁcation there are no predeﬁned labels to mark classes in data,

thus clustering is also referred to as unsupervised learning. Clustering aims

to organize instances within a cluster homogeneously (similar instances

should belong to the same cluster), but the clusters should be as heterogen-

eous as possible (different clusters should be as distinct as possible) [EA12].

Motif discovery

deals with ﬁnding recurring patterns in subsequences of

time series data. This recurring patterns are referred to as ”motifs” [FV17].

Outlier detection

focuses on ﬁnding abnormal (or anomalous) sequences

in time series data. A ﬁrst step in anomaly detection is often to create a

model for detecting normal time series and then ﬁnd subsequences which

deviates from this normal behavior [FV17]. We have a closer look at this

topic in Sub-Section 2.2.4.

Prediction

of subsequent or future values of time series data is based on the

principle that observations close together in time are more closely related

than observations far away from each other [Dag10]. Prediction tasks are

modeling these correlations and dependencies between time series data in

order to forecast future values [FV17]. Thus this task is also referred to as

time series forecasting.

Query by content

deals with ﬁnding of similar time series or time series

sub-sequences given a query time series. To deﬁne similar time series query

by content depends on the deﬁnition of a similarity measures between time

series data [EA12].

Rule discovery

is also referred to as association rule mining, and aims at

ﬁnding relations between variables in (time series) data. [FV17]

Segmentation

focus on creating approximations of time series data, by

means of dimensionality reduction of potential high-dimensional time series

data. [EA12]. Within the work of [Rat+09] segmentation is also referred to

as summarization.

One fundamental problem for almost all of the above tasks of time series

data mining is the representation of time series data. A common approach

thereby is to transform the time series via some sort of dimensionality

reduction followed by various indexing mechanisms. These techniques are

complemented by the areas of similarity measures between time series and

segmentation of time series data. [Fu11]. These aforementioned steps almost

match the deﬁnition of [EA12], in which the three major issues for dealing

2. Related Work

Time Series Classiﬁcation

Time series classiﬁcation

(TSC) differs from traditional classiﬁcation in that

the elements to be classiﬁed are ordered. This ordering may be used for

discriminant features [Bag+17]. In ”traditional” classiﬁcation this ordering

of features is not important, and furthermore interaction between features

is considered independent of their relative positions [BL14].

A possibly more intuitive comparison of traditional classiﬁcation to TSC is

based on the assumption that the former only uses static features, whereas

TSC uses dynamic features, for which the change in values over time is

relevant.

The three main characteristics that make TSC so difﬁcult are: the small

number of cases, large number of features and the highly correlated and/or

redundant features. In traditional classiﬁcation we already have good solu-

tions for these three characteristics. Traditional classiﬁcation algorithms

typical have problems with discriminating features in autocorrelation, phase

independence in classes, and embedded discriminative sub-series. Never-

theless, this does not mean that these problems are present in every time

series dataset. This qualiﬁes traditional classiﬁcation algorithms as valuable

baseline approaches, and these algorithms may provide important insight

into problem characteristics of a speciﬁc dataset [Bag+17].

Until recently the default baseline algorithm for TSC was the 1-Nearest

Neighbor classiﬁer (1-NN) with euclidean distance. The

Nearest Neigh-

bor Classiﬁer

is a representative of instance-based learning algorithms.

Algorithms of this kind only store training instances during their training

phase. To classify a new (unseen) instance this new instance is compared

with its closest neighbors within the stored training instances. To compare

the closeness to given neighbors instance-based learners are using various

different distance functions (probably the most famous one is the euclidean

distance function) [BM02].

Since the authors of [BL14] stated that 1-NN classiﬁer is easy to beat it

should not be used as a baseline for TSC any more. Instead the authors

recommended the usage of

1-Nearest Neighbor with dynamic time warp-

ing window (DTW1NN)

set through cross-validation as a more meaningful

baseline. Furthermore due to the solid results, the authors selected

Rotation

Forest

as their second benchmark algorithm. Rotation Forest is a variant of

剩余125页未读，继续阅读

易小侠

粉丝: 6589
资源: 9万+

时间序列分类中基于孤立点检测的实例选择方法

demo.zip_DEMO_outlier_outlier detection_基于距离的离群点检测_数据挖掘

LOF.rar_LOF_outlier_outlier detection

LOF.rar_LOF_lof matlab_officemmj_outlier detection_基于密度的lof

Yang_Zhang_Outlier_detection_techniques_for_wireless_sensor_networks

streaming_outlier_detection:Javascript中数据流的异常检测

Density-Based_Outlier_Detection:使用相对密度和K均值聚类的基于密度的离群值检测

A_Fast_Greedy_Algorithm_for_Outlier_Mining.rar_Windows编程_Windows_Unix_

outlier.rar_lagrange_outlier_异常值_数据剔除_缺失数据 插值

outlier_detection:Javascript 中的异常值检测

MGain(index_outlier,index_outlier)=1e4*eye(length(index_outlier));

最新资源

outlier.rar_lagrange_outlier_异常值_数据剔除_缺失数据插值