工业大数据应用下的分布式并行时间序列特征提取

184 浏览量更新于2024-07-14 收藏 1.04MB PDF 举报

在2016年10月25日发布的论文"Distributed and Parallel Time Series Feature Extraction for Industrial Big Data Applications"（arXiv:1610.07717v1）中，作者Maximilian Christ、Andreas W. Kempa-Liehr和Michael Feindt探讨了工业大数据应用中的一个重要挑战——如何有效地进行分布式和并行的时间序列特征提取。该研究工作是在ACML（Advanced Computing and Machine Learning）研讨会“Learning on Big Data”（WLBD）的背景下进行的，该会议于2016年11月16日在新西兰汉密尔顿举行。工业领域的预测性维护和生产线优化等任务，常常涉及到大量时间序列数据和元信息，这使得特征选择变得尤为复杂。传统的特征选择方法可能无法处理这种多维度和关联性强的数据，因此，论文提出了一种高效且可扩展的特征提取算法。该算法旨在筛选出与每个标签或回归目标密切相关的强相关和弱相关的时间序列特征，同时考虑到数据的分布式和并行处理需求。该算法的设计注重在大规模数据集上保持性能和效率，通过分布式的计算架构，将数据分割并行处理，从而显著提高特征提取的速度。这种方法有助于降低计算复杂度，减少内存占用，并能够适应不断增长的工业大数据环境。作者们强调了在实际应用中，特别是那些对实时性和准确性有高要求的场景中，这种并行化策略的重要性。由于Michael Feindt当时正在Karlsruhe Institute of Technology休假，他的贡献主要体现在理论构建和算法设计阶段。论文的发布不仅展示了在工业4.0时代处理复杂时间序列数据的新思路，也为工业界提供了实用的工具和技术来应对日益增长的数据挑战。这篇论文是工业大数据分析领域的一次重要突破，它通过分布式和并行的时间序列特征提取方法，为解决实际工业问题提供了一种新颖且有效的解决方案，对于提升工业智能系统的性能和决策能力具有重要意义。

3. Feature ﬁltering

Typically, time series are noisy and contain redundancies. Therefore, one should keep the

balance between extracting meaningful but probably fragile features and robust but probably

non-signiﬁcant features. Some features such as the median will not be heavily inﬂuenced by

outliers, others such as max will be intrinsically fragile. The choice of the right time series

feature mappings is crucial to capture the right characteristics for the task at hand.

3.1. Relevance of features

A meaningless feature describes a characteristic of the time series that is not useful for the

classiﬁcation or regression task at hand. Radivojac et al. [2004] considered a binary target

Y , stating that the relevance of feature X is measured as the diﬀerence between the class

conditional distributions f

X|Y =0

and f

X|Y =1

. We adopt this deﬁnition and consider a feature

X being relevant for the classiﬁcation of the binary target Y if those distributions are not

equal. In general, a feature X is relevant for predicting target Y if and only if

∃ y

, y

with f

) > 0, f

) > 0 : f

X|Y =y

6= f

X|Y =y

. (1)

The condition from Equation (1) is equivalent to

X is not relevant for target Y

⇔ ∀y

, y

with f

) > 0, f

) > 0 : f

X|Y =y

= f

X|Y =y

⇔ ∀y

with f

) > 0 : f

X|Y =y

= f

⇔ f

X,Y

= f

X|Y

= f

⇔ X, Y are statistically independent

(2)

We will use the statistical independence to derive a shorter deﬁnition of a relevant feature:

Deﬁnition 1 (A relevant feature) A feature X

is relevant or meaningful for the pre-

diction of Y if and only if X

and Y are not statistically independent.

3.2. Hypothesis tests

For every extracted feature X

, . . . , X

we will deploy a singular statistical test

checking the hypotheses

= {X

is irrelevant for predicting Y }, H

= {X

is relevant for predicting Y }. (3)

The result of each hypothesis test H

is a so-called p-value p

, which quantiﬁes the prob-

ability that feature X

is not relevant for predicting Y . Small p-values indicate features,

which are relevant for predicting the target.

Based on the vector (p

, . . . , p

)

of all hypothesis tests, a multiple testing approach

will select the relevant features (Sec. 3.3). We propose to treat every feature uniquely by a

diﬀerent statistical test, depending on wether the codomains of target and feature are binary

or not. The usage of one general feature test for all constellations is not recommended.

Specialized hypothesis tests yield a higher statistical power due to more assumptions about

the codomains that can be used during the construction of those tests. The proposed

feature signiﬁcance tests are based on nonparametric hypothesis tests, that do not make

剩余16页未读，继续阅读

weixin_38678057

粉丝: 6
资源: 870

工业大数据应用下的分布式并行时间序列特征提取

PyPI 官网发布的 cdktf-cdktf-provider-pagerduty-0.0.125 资源

Python开发语言库：dapr-dev-0.9.0a0.dev281.tar.gz

Python库 tripleo-common-7.6.12.tar.gz 下载指南

Foundations of Modern Networking-Pearson Education(2016).epub

Python库 | pymilvus-distributed-0.0.58.tar.gz

Python库 | pymilvus-distributed-0.0.28.tar.gz

CRC.Press.-.Creating.Components.-.Object.Oriented,.Concurrent,.and.Distributed.Computing.in.Java.-.2004.chm

Building Machine Learning Projects with TensorFlow-Packt Publishing(2016).epub

ITRON-system-user-guide.doc.tar.gz_ITRON_operating systems

RabbitMQ-in-Action-Distributed-Messaging-for-Everyone.pdf

最新资源