时间序列森林：一种函数性数据分析方法

需积分: 10 112 浏览量更新于2024-09-11 收藏 12.03MB PDF 举报

"本文介绍了一种用于时间序列分类的树集成方法，称为时间序列森林（Time Series Forest, TSF）。该方法结合了熵增和距离度量（称为入口熵和距离增益）来评估分割，提高了分类的准确性。TSF在每个树节点上随机采样特征，具有线性的时间序列长度计算复杂度，并且可以通过并行计算技术构建。此外，文章还提出了一种临时重要曲线来捕获有助于分类的时间特性。实验研究表明，这些创新提升了时间序列分类的效果。" 在这篇研究中，"函数性数据分析"主要聚焦于时间序列数据的分类问题。时间序列数据是按顺序收集的数据，通常与时间有关，如股票价格、气温或心率等。时间序列森林（TSF）是一种新的机器学习模型，它利用决策树的集成方法处理这类数据。TSF的核心创新在于其评估节点分裂的方式，即结合了熵增（衡量信息增益）和距离度量（衡量数据点之间的相似性），形成所谓的“入口熵和距离增益”。这种结合使得TSF在选择特征和划分数据时能更好地捕捉到时间序列中的模式。 TSF的一个关键优点是它的计算效率。尽管每个决策树节点需要对特征进行随机采样，但算法的总体复杂度与时间序列的长度成线性关系。这意味着即使面对大规模数据集，TSF也能高效运行。此外，由于可以利用并行计算，TSF可以快速构建，这对于处理大量时间序列数据尤其有用。为了更好地理解时间序列的特性，研究人员提出了“临时重要曲线”（Temporal Importance Curve）。这个概念旨在识别和提取对于分类至关重要的时间点或时间间隔。通过分析这些曲线，可以揭示出哪些时间段的特征对分类决策最具影响力，从而提高模型的解释性和性能。实验结果显示，TSF相对于其他方法在时间序列分类任务上表现出更高的准确性和效果。这表明函数性数据分析中采用特定的模型和策略可以显著提升对动态、时序数据的理解和预测能力，对于诸如金融、医疗、环境监测等领域有着广泛的应用价值。

classiﬁer that uses/yiel ds a set of simple features that can contribute to the domain knowledge. For example, in manufactur-

ing applications, speciﬁc properties of the time series signals that discriminate conforming from un-conformi ng products are

invaluable to diagnose, correct, and improve processes .

3. Interval features

Interval features are calculated from a time series interval, e.g., ‘‘the interval between time 10 and time 30’’. Many types of

features over a time interval can be considered, but one may prefer simple and interpretabl e features such as the mean and

standard deviation, e.g., ‘‘the average of the time series segment between time 10 and time 30’’.

Let K be the number of feature types and f

()(k =1,2,..., K) be the kth type. Here we consider three types: f

= mean,

= standard deviation , f

= slope. Let f

) for 1 6 t

6 t

6 M denote the kth interval feature calculated over the interval

between t

and t

. Let

be the value at time i for a time series example. Then the three interval features for the example

are calculated as follows:

ðt

; t

Þ¼

i¼t

 t

þ 1

ð1Þ

ðt

; t

Þ¼

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i¼t

f

ðt

ÞÞ

t

> t

0 t

¼ t

ð2Þ

ðt

; t

Þ¼

b t

> t

0 t

¼ t

(

ð3Þ

where

b is the slope of the least squares regressio n line of the training set fðt

;

Þ; ðt

þ 1;

þ1

Þ; ...; ðt

;

Þg.

Interval features have been shown to be effective for time series classiﬁcation [15,14,16]. However , the interval feature

space is large (O(M

)). Rodríguez et al. [15] considered using only intervals of lengths equal to powers of two, and, therefore,

reduced the feature space to O(M logM). Here we consider the random sampling strategy used in a random forest [1] that

reduces the feature space to O(M) at each tree node.

4. Time series forest classiﬁer

4.1. Splitting criterion

A time series tree is the base component of a time series forest, and the splitting criterion is used to determine the best

way to split a node in a tree. A candidat e split S in a time series tree node tests the following condition (for simplicity and

without loss of generality, we assume the root node here):

ðt

; t

Þ 6

ð4Þ

for a threshold

. The instances satisfying the condition are sent to the left child node. Otherwise, the instances are sent to

the right child node.

Let ff

ðt

; t

Þ; n 2 1; 2; ...; Ng denote the set of values of f

) for all training instances at the node. To obtain a good

threshold

in Eq. (4), one can sort the feature values of all the training instances and then select the best threshold from

the midpoints between pairs of consecutive values, but this can be too costly [14]. We consider the strategy employed in

Rodríguez and Alonso [14]. The candidate thresholds for a particular type feature f

are formed such that the range of

min

n¼1

ðf

ðt

; t

ÞÞ; max

n¼1

ðf

ðt

; t

is divided into equal-width intervals. The number of candidate thresholds is denoted

and is ﬁxed, e.g., 20. The best threshold is then selected from the candidat e thresholds. In this manner, sorting is

avoided, and only

tests are needed.

Furthermore, a splitting criterion is needed to deﬁne the best split S

⁄

: f



ðt



; t



Þ 6



. We employ a combination of entropy

gain and a distance measure as the splitting criterion. Entropy gain are commonly used as the splitting criterion in tree mod-

els. Denote the proportio ns of instances correspondi ng to classes {1, 2, ..., C} at a tree node as {

, ...,

}, respectively.

The entropy at the node is deﬁned as

Entropy ¼

c¼1

log

ð5Þ

The entropy gain 4Entropy for a split is then the differenc e between the weighted sum of entropy at the child nodes and the

entropy at the parent node, where the weight at a child node is the proportion of instances assigned to that child node.

4Entropy evaluates the usefulnes s of separating the classes. However, in time series classiﬁcation, the number of candi-

date splits can be large, and there are often cases where multiple candidate splits have the same 4Entropy. Therefore we

consider an additional measure called Margin, which calculates the distance between a candidate threshold and its nearest

feature value. The Margin of split f

, t

) 6

is calculated as

144 H. Deng et al. / Information Sciences 239 (2013) 142–153

剩余11页未读，继续阅读

zhangwnekai

粉丝: 0

时间序列森林：一种函数性数据分析方法

excel函数图表以及数据分析.RAR

Scala和Spark大数据分析函数式编程、数据流和机器学习

多项式拟合正弦函数：数据分析与模型优化

Matlab自定义函数：fMRI数据分析与多元模式分析可视化

柯布-道格拉斯生产函数模型：数据分析与优化

R语言实现Copula函数：金融数据分析核心代码包

Excel函数实战练习：数据分析与高级功能

ZDT1-ZDT4函数真实Pareto前端数据分析

函数型数据主成分分析模型与代码全套资源

Python数据分析进阶：函数与控制流详解

最新资源