AUTHOR COPY
Y. Li et al. / Interval ELM for big data based on uncertainty reduction 2393
Algorithm 1 Extreme Learning Machine
Input:
Training set X =
(x
i
, y
i
)|x
i
∈ R
n
, y
i
∈ R
L
,
i = 1,...,N
; activation function g(x);
number of hidden node
˜
N.
Output:
Input weight w
j
, input bias b
j
, and output weight β.
1: Randomly assign input weight w
j
and bias b
j
where j = 1,...,
˜
N;
2: Calculate the hidden layer output matrix H;
3: Calculate the output weight β = H
†
Y where H
†
is the
Moore-Penrose generalized inverse of matrix H,
and Y = [y
1
,...,y
N
]
T
.
2.2. Challenges in learning from big data for ELM
The ELM algorithm has exhibited satisfactory
performances on various application scenarios. For
instance, Mohammed et al. [22] developed a new human
face recognition algorithm based on bidirectional two
dimensional principal component analysis and ELM,
which achieves hundred folds reduction in training time
and minimal dependence on the number of prototypes.
Besides, Chacko [6] combined wavelet energy feature
and ELM to deal with handwritten character recognition
problem, which gives high recognition accuracy. More-
over, Suresh [32] presented two schemes, named k-fold
selection scheme and real-coded genetic algorithm, to
select the input weights and bias for ELM, which are
effective in non-reference image quality assessment.
However, training single ELM on massive data with
high dimensionality is still a challenging problem. It is
well known that the main time complexity of training an
ELM is in calculating the pseudo-inverse of the hidden
layer output matrix. There is a high demanding on both
time and space if the size of the matrix is large. Several
directions are listed in the following for handling this
problem.
1. Sequential learning: the big data set can be divided
into small subsets, then the training instances are
sequentially presented to the learning algorithm.
2. Divide-and-conquer strategy: the data matrix is
divided into a number of small sub-matrices, then
a learner is trained for each sub-matrix, and the
results are integrated based on the theories of lin-
ear algebra.
3. Sample and feature selection: perform both fea-
ture selection and sample selection on the big data
set in order to refine the samples and remove data
redundancy, then a learner is trained on the refined
data.
In this paper, another direction is figured out, i.e., dis-
cretization of conditional attributes and fuzzification of
decision labels. Discrete and continuous are two clas-
sical ordinal data types with orders among the values.
Generally, the number of discrete values for an attribute
is finite and even few, while the number of continuous
values can be infinitely many. This property of discrete
value makes it easier to use and comprehend in data
analysis. For example, when a decision tree is induced,
the continuous attribute will make the tree reach a pure
state quickly with a bad performance (all the instances
in a leaf node belong to a specific class) [14]. There
are many other advantages of using discrete values.
For instance, it is mentioned in [27, 30] that discrete
attributes have a closer knowledge-level representation
than continuous ones.
When any attribute in a data set is continuous, it is
hard to find samples with same values. As a result, sim-
ilar samples are treated as entirely different with each
other, which lead to data redundancy. Oppositely, with
discrete attributes, the data set is often compact and
short, hence the learning is more effective and effi-
cient. Thus, in big data analysis, data volume can be
reduced and simplified through discretization of condi-
tional attributes. Basically, all the conditional attributes
are discretized into a finite number of intervals, then
the samples with the same discrete values are merged
as one record. As a result, the data set is compressed
and redundancy is removed to a certain extent. How-
ever, discretization may sometimes be intractable due to
the heavy matrix and integral operations involved. For
example, the discretization process will be time con-
suming when there are too many training attributes.
Moreover, discretization error always exists, thus a
tradeoff between accuracy and compression rate should
be considered in real applications.
In addition, in order to handle the veracity property
of big data, we not only perform discretization for data
compression, but also fuzzify class labels by convert-
ing them into a set of memberships. In decision theory,
membership could be considered as a kind of capacity,
which weakens the probability axiom of countability. In
other words, it reflects the likelihood of an event of con-
dition. In theory, fuzzy class contains more information
on the relationships between observations and labels,
which could help us make decisions in real applica-
tions. The fuzzification of class labels could be realized
by computing the mean of the decision labels in the
same conditional group. As a result, the problem is
transferred to symbolic learning with fuzzy class labels.
On the other hand, interval data is widely used in sym-