大数据区间极限学习机：基于不确定性降低的新方法

174 浏览量更新于2024-08-26 收藏 412KB PDF 举报

"基于不确定性降低的大数据区间极限学习机的研究论文" 在大数据领域，高效且准确的分类模型是至关重要的。本文提出的"区间极限学习机（Interval Extreme Learning Machine, ELM）"是一种针对大规模数据分类的新模型，特别适用于具有连续值属性的大数据集。该模型的核心目标是解决两个关键问题：选择代表性样本和去除数据冗余。区间极限学习机的构建结合了两个技术手段，即条件属性的离散化和类别标签的模糊化。首先，受传统决策树（Decision Tree, DT）诱导算法的启发，每个条件属性根据不确定性原则被分割成多个区间。这种离散化方法有助于减少数据的不确定性，使模型能够更好地理解和处理连续的数值特征。通过将连续的属性值划分为离散区间，可以简化复杂的数据结构，提高模型的处理效率。其次，对于类别标签，区间ELM采用了模糊化处理。在传统的二元分类问题中，类别通常是非此即彼的。然而，在模糊逻辑框架下，类别边界变得不那么清晰，这使得模型能适应那些边界模糊或者存在多个可能类别的数据。模糊化处理允许类别标签具有一定的“灰色地带”，从而提高了模型对复杂分类边界的适应性。区间ELM模型借鉴了极限学习机（Extreme Learning Machine, ELM）的思想，这是一种快速、高效的单层神经网络训练方法。ELM随机初始化输入层与隐藏层之间的权重，然后通过最小化误差来唯一确定输出层权重，这一过程避免了反向传播算法的迭代计算，大大减少了训练时间。而在区间ELM中，这种快速学习策略被扩展到处理区间数据，使得模型在大数据场景下仍然保持高效。在实际应用中，区间ELM模型能够处理大量具有不确定性和复杂性的数据，尤其适用于那些需要实时分析或实时响应的场景。其优势在于既能有效处理非线性关系，又能减少数据预处理的需求，因为离散化和模糊化步骤在模型构建中已经内置。这篇研究论文提出了一种创新的机器学习模型，即基于不确定性降低的大数据区间极限学习机，它通过结合条件属性离散化和模糊化类别标签，为大数据分类提供了一个高效且鲁棒的解决方案。这一模型不仅提升了模型的泛化能力，还降低了计算复杂度，为大数据分析领域开辟了新的研究方向。

AUTHOR COPY

Y. Li et al. / Interval ELM for big data based on uncertainty reduction 2393

Algorithm 1 Extreme Learning Machine

Input:

Training set X =



, y

)|x

∈ R

, y

∈ R

i = 1,...,N



; activation function g(x);

number of hidden node

Output:

Input weight w

, input bias b

, and output weight β.

1: Randomly assign input weight w

and bias b

where j = 1,...,

2: Calculate the hidden layer output matrix H;

3: Calculate the output weight β = H

†

Y where H

†

is the

Moore-Penrose generalized inverse of matrix H,

and Y = [y

,...,y

]

2.2. Challenges in learning from big data for ELM

The ELM algorithm has exhibited satisfactory

performances on various application scenarios. For

instance, Mohammed et al. [22] developed a new human

face recognition algorithm based on bidirectional two

dimensional principal component analysis and ELM,

which achieves hundred folds reduction in training time

and minimal dependence on the number of prototypes.

Besides, Chacko [6] combined wavelet energy feature

and ELM to deal with handwritten character recognition

problem, which gives high recognition accuracy. More-

over, Suresh [32] presented two schemes, named k-fold

selection scheme and real-coded genetic algorithm, to

select the input weights and bias for ELM, which are

effective in non-reference image quality assessment.

However, training single ELM on massive data with

high dimensionality is still a challenging problem. It is

well known that the main time complexity of training an

ELM is in calculating the pseudo-inverse of the hidden

layer output matrix. There is a high demanding on both

time and space if the size of the matrix is large. Several

directions are listed in the following for handling this

problem.

1. Sequential learning: the big data set can be divided

into small subsets, then the training instances are

sequentially presented to the learning algorithm.

2. Divide-and-conquer strategy: the data matrix is

divided into a number of small sub-matrices, then

a learner is trained for each sub-matrix, and the

results are integrated based on the theories of lin-

ear algebra.

3. Sample and feature selection: perform both fea-

ture selection and sample selection on the big data

set in order to reﬁne the samples and remove data

redundancy, then a learner is trained on the reﬁned

data.

In this paper, another direction is ﬁgured out, i.e., dis-

cretization of conditional attributes and fuzziﬁcation of

decision labels. Discrete and continuous are two clas-

sical ordinal data types with orders among the values.

Generally, the number of discrete values for an attribute

is ﬁnite and even few, while the number of continuous

values can be inﬁnitely many. This property of discrete

value makes it easier to use and comprehend in data

analysis. For example, when a decision tree is induced,

the continuous attribute will make the tree reach a pure

state quickly with a bad performance (all the instances

in a leaf node belong to a speciﬁc class) [14]. There

are many other advantages of using discrete values.

For instance, it is mentioned in [27, 30] that discrete

attributes have a closer knowledge-level representation

than continuous ones.

When any attribute in a data set is continuous, it is

hard to ﬁnd samples with same values. As a result, sim-

ilar samples are treated as entirely different with each

other, which lead to data redundancy. Oppositely, with

discrete attributes, the data set is often compact and

short, hence the learning is more effective and efﬁ-

cient. Thus, in big data analysis, data volume can be

reduced and simpliﬁed through discretization of condi-

tional attributes. Basically, all the conditional attributes

are discretized into a ﬁnite number of intervals, then

the samples with the same discrete values are merged

as one record. As a result, the data set is compressed

and redundancy is removed to a certain extent. How-

ever, discretization may sometimes be intractable due to

the heavy matrix and integral operations involved. For

example, the discretization process will be time con-

suming when there are too many training attributes.

Moreover, discretization error always exists, thus a

tradeoff between accuracy and compression rate should

be considered in real applications.

In addition, in order to handle the veracity property

of big data, we not only perform discretization for data

compression, but also fuzzify class labels by convert-

ing them into a set of memberships. In decision theory,

membership could be considered as a kind of capacity,

which weakens the probability axiom of countability. In

other words, it reﬂects the likelihood of an event of con-

dition. In theory, fuzzy class contains more information

on the relationships between observations and labels,

which could help us make decisions in real applica-

tions. The fuzziﬁcation of class labels could be realized

by computing the mean of the decision labels in the

same conditional group. As a result, the problem is

transferred to symbolic learning with fuzzy class labels.

On the other hand, interval data is widely used in sym-

剩余12页未读，继续阅读

weixin_38562085

粉丝: 6

大数据区间极限学习机：基于不确定性降低的新方法

人大《保险精算学》笔记深度解析

深度学习500问.pdf

数据科学基石揭秘：如何从大数据迈入机器学习

【探索排序算法】：外部排序实现与理解，大数据排序新策略

支持向量机的预测区间：理论与应用

【随机变量的极限定理】：中心极限定理与大数定律，概率极限理论的深入探讨

构建准确预测区间的挑战与策略

统计推断中的置信区间：关键概念与实际应用指南

【数据收集优化攻略】：如何利用置信区间与样本大小

【统计推断新视角】：点估计与区间估计的有效方法

最新资源