Efficient mining of high-speed uncertain data streams 775
for classification and prediction. A novel neural network
method was proposed in [15]. Finally, a novel algorithm
known as uHARMONY, was presented in [16]tohelptrain
either SVM or rule-based classifiers. The algorithm can
effectively mine discriminative patterns from uncertain data
as classification features/rules.
2.2 Streaming data classification algorithms
A variety of algorithms exist for classifying streaming
data with infinite data flow and concept drift. These can
generally be divided into two categories: single classifier
approaches and ensemble approaches.
Single classifier approaches extend traditional tech-
niques to handle streaming data. The Concept-adapting
Very Fast Decision Tree (CVFDT) algorithm presented
in [17] is a direct extension of the Very Fast Decision
Tree (VFDT) algorithm [24]. It is aimed at dealing with
high-speed evolving data streams and time-varying con-
cepts. Further developments to improve performance or
to integrate ambiguity by exploring multiple options at
each node have been proposed leading to the ambigu-
ous CVFDT (ACVFDT) algorithm [18]. Similarly, some
researchers have adapted decision-rule learners to stream-
ing data. In [19], a Very Fast Decision Rule (VFDR)
approach was introduced for streaming data classification
that can incrementally learn from incoming examples and
expand rules over time. In order to handle concept drift, the
authors later introduced Adaptive Very Fast Decision Rules
(AVFDR) [20].
Another category of algorithms for classifying data
streams is ensemble-based models. In [21], a streaming
ensemble algorithm (SEA) was presented that can build sep-
arate classifiers on sequential chunks of training points. In
[23], a general framework for mining concept-drifting data
streams based on a weighted ensemble classifier was pre-
sented. The weight of each base classifier in the ensemble
is adjusted as a function of its expected classification accu-
racy on the test data. In [22], the authors proposed a new
experimental data stream framework for studying concept
drift, which adopted two novel variants of bagging. In gen-
eral, ensembles tend to have better accuracy and they are
easier to parallelize than single classifier approaches [23].
Here, we train an ensemble of base-level classifiers on the
MapReduce framework.
2.3 Classification algorithms for uncertain data streams
The study of uncertain data stream mining is still in its
infancy, and only a few classification algorithms have been
proposed. A variant of CVFDT, known as uncertainty-
handling CVFDT, or UCVFDT for short, was introduced
in [25]. In order to choose the best splitting attribute
from uncertain attributes in the process of tree grow-
ing, UCVFDT uses a probabilistic Hoeffding bound and
improves the method for computing information gain under
uncertainty. In [26], another variant of CVFDT, known as
positive and unlabeled uncertain CVFDT, or puuCVFDT
for short, was presented with a specific focus on uncer-
tain streaming data classification with only positive and
unlabeled training data.
In addition, some studies have adapted ensemble-based
classifiers. In order to cope with the problem of one-class-
based uncertain data stream classification, an SVM-based
ensemble classifier was proposed in [27]. It adopted the
OC-SVM method for incorporating uncertainty to learn
base-level classifiers and extended the approach of adjust-
ing the weights of each base classifier [23] to handle
concept drift. In [28], an ensemble classification model
named ECluds was introduced, where the supervised k-
means clustering algorithm is used to train the base-level
classifiers. In [29], the NS-PDT algorithm [13] is extended
to learn base-level classifiers that can cope with uncer-
tainty. Two types of ensemble-based approaches are then
proposed: static classifier ensemble (SCE) and dynamic
classifier ensemble (DCE). Finally, the Weighted Ensem-
ble Classifier based on the ELM algorithm (WEC-ELM)
was proposed in [30]. It adopts the ELM technology to
train the base-level classifiers and adjusts the weights of
each base-level classifier dynamically to track the evolving
concept.
In all of the aforementioned studies, the work focused on
uncertain data streams classification where the uncertainty
was either in the feature vectors or in the class values, but
not both.
W
e must also note in passing that our approach is inspired
by the ELM-MapReduce algorithm [31]. However, ELM-
MapReduce is used to classify big spatial data that is precise
and can be scanned multiple times. Hence, it cannot be
applied directly to the problem of uncertain data streams
classification.
3 Preliminaries
In this section, we present our model of uncertainty over
both feature vectors and class labels, and show how the
model affects ELM learning and the analysis of error for
base classifiers.
3.1 Problem definition
In general, there are two ways to handle uncertainty:
probability-based models specified using probability den-
sity functions (PDFs) [5, 6], and statistic-based models
that work with errors or standard deviations [4]. We adopt