UELM-MapReduce：高效处理不确定数据流的并行分类器

136 浏览量更新于2024-08-26 收藏 1.31MB PDF 举报

"该资源主要探讨了如何在高速不确定数据流中进行高效挖掘，提出了一种基于极端学习机（ELM）和MapReduce的并行集成分类器UELM-MapReduce。该方法旨在解决现实生活中由于仪器不精确、无线传输错误等因素导致的自然不确定性数据。" 在当前的数据挖掘领域，大多数算法都是针对精确且完整数据设计的。然而，许多实际应用中的数据存在不确定性，这可能是由仪器的不准确、无线传输错误等原因引起的。论文作者Donghong Han、Christophe Giraud-Carrier和Shuoru Li提出了一种新的解决方案，称为UELM-MapReduce，它是一种利用极端学习机（Extreme Learning Machine）和MapReduce框架处理不确定数据流的并行集成分类器。极端学习机（ELM）是一种快速的单隐藏层前馈神经网络训练方法，它的优势在于无需反向传播，可以快速得到权重。UELM-MapReduce通过将数据流分割成连续的训练块，训练出一个高效的并行ELM基础分类器集合。每个基础分类器的权重根据其在最新测试块上的均方误差进行调整，而准确率最低的分类器会被替换，以此来应对概念漂移（Concept drifts）问题。概念漂移是指数据分布随着时间推移发生变化的现象，这是数据流挖掘中的一个重要挑战。UELM-MapReduce通过动态调整分类器权重和替换低效分类器，能够有效地适应这种变化，保持模型的准确性。实验结果证明，UELM-MapReduce在处理不确定数据流时，既具有较高的效率，又能够保持良好的分类精度。同时，通过MapReduce并行化处理，使得该方法能够在大规模数据集上运行，提升了处理速度，适应了大数据时代的需求。这篇论文为不确定数据流的挖掘提供了一个新的视角，通过结合ELM的快速学习能力和MapReduce的并行计算能力，解决了数据不确定性带来的挑战，为实时和大规模的数据流分析提供了有力工具。

Efficient mining of high-speed uncertain data streams 775

for classification and prediction. A novel neural network

method was proposed in [15]. Finally, a novel algorithm

known as uHARMONY, was presented in [16]tohelptrain

either SVM or rule-based classifiers. The algorithm can

effectively mine discriminative patterns from uncertain data

as classification features/rules.

2.2 Streaming data classification algorithms

A variety of algorithms exist for classifying streaming

data with infinite data flow and concept drift. These can

generally be divided into two categories: single classifier

approaches and ensemble approaches.

Single classifier approaches extend traditional tech-

niques to handle streaming data. The Concept-adapting

Very Fast Decision Tree (CVFDT) algorithm presented

in [17] is a direct extension of the Very Fast Decision

Tree (VFDT) algorithm [24]. It is aimed at dealing with

high-speed evolving data streams and time-varying con-

cepts. Further developments to improve performance or

to integrate ambiguity by exploring multiple options at

each node have been proposed leading to the ambigu-

ous CVFDT (ACVFDT) algorithm [18]. Similarly, some

researchers have adapted decision-rule learners to stream-

ing data. In [19], a Very Fast Decision Rule (VFDR)

approach was introduced for streaming data classification

that can incrementally learn from incoming examples and

expand rules over time. In order to handle concept drift, the

authors later introduced Adaptive Very Fast Decision Rules

(AVFDR) [20].

Another category of algorithms for classifying data

streams is ensemble-based models. In [21], a streaming

ensemble algorithm (SEA) was presented that can build sep-

arate classifiers on sequential chunks of training points. In

[23], a general framework for mining concept-drifting data

streams based on a weighted ensemble classifier was pre-

sented. The weight of each base classifier in the ensemble

is adjusted as a function of its expected classification accu-

racy on the test data. In [22], the authors proposed a new

experimental data stream framework for studying concept

drift, which adopted two novel variants of bagging. In gen-

eral, ensembles tend to have better accuracy and they are

easier to parallelize than single classifier approaches [23].

Here, we train an ensemble of base-level classifiers on the

MapReduce framework.

2.3 Classification algorithms for uncertain data streams

The study of uncertain data stream mining is still in its

infancy, and only a few classification algorithms have been

proposed. A variant of CVFDT, known as uncertainty-

handling CVFDT, or UCVFDT for short, was introduced

in [25]. In order to choose the best splitting attribute

from uncertain attributes in the process of tree grow-

ing, UCVFDT uses a probabilistic Hoeffding bound and

improves the method for computing information gain under

uncertainty. In [26], another variant of CVFDT, known as

positive and unlabeled uncertain CVFDT, or puuCVFDT

for short, was presented with a specific focus on uncer-

tain streaming data classification with only positive and

unlabeled training data.

In addition, some studies have adapted ensemble-based

classifiers. In order to cope with the problem of one-class-

based uncertain data stream classification, an SVM-based

ensemble classifier was proposed in [27]. It adopted the

OC-SVM method for incorporating uncertainty to learn

base-level classifiers and extended the approach of adjust-

ing the weights of each base classifier [23] to handle

concept drift. In [28], an ensemble classification model

named ECluds was introduced, where the supervised k-

means clustering algorithm is used to train the base-level

classifiers. In [29], the NS-PDT algorithm [13] is extended

to learn base-level classifiers that can cope with uncer-

tainty. Two types of ensemble-based approaches are then

proposed: static classifier ensemble (SCE) and dynamic

classifier ensemble (DCE). Finally, the Weighted Ensem-

ble Classifier based on the ELM algorithm (WEC-ELM)

was proposed in [30]. It adopts the ELM technology to

train the base-level classifiers and adjusts the weights of

each base-level classifier dynamically to track the evolving

concept.

In all of the aforementioned studies, the work focused on

uncertain data streams classification where the uncertainty

was either in the feature vectors or in the class values, but

not both.

e must also note in passing that our approach is inspired

by the ELM-MapReduce algorithm [31]. However, ELM-

MapReduce is used to classify big spatial data that is precise

and can be scanned multiple times. Hence, it cannot be

applied directly to the problem of uncertain data streams

classification.

3 Preliminaries

In this section, we present our model of uncertainty over

both feature vectors and class labels, and show how the

model affects ELM learning and the analysis of error for

base classifiers.

3.1 Problem definition

In general, there are two ways to handle uncertainty:

probability-based models specified using probability den-

sity functions (PDFs) [5, 6], and statistic-based models

that work with errors or standard deviations [4]. We adopt

剩余12页未读，继续阅读

weixin_38623000

粉丝: 5
资源: 925

UELM-MapReduce：高效处理不确定数据流的并行分类器

数据流频繁项挖掘算法（Manku）

数据流--英文

流计算中的数据挖掘是什么

天气数据可视化的数据流图怎么画

天气数据可视化的数据流图

理解和定义数据流和数据存储用什么工具

鲁棒优化的不确定集有哪些创新点可以挖掘

学习平台数据需求分析，并提供数据流图

鲁棒优化不确定集的构建有哪些创新点可以挖掘

python爬虫数据挖掘

最新资源