Computational Intelligence and Neuroscience
learning algorithms []. At the meantime, ELM also produces
good generalization performance. It has been veried that
ELM can achieve the equal generalization performance with
the typical Support Vector Machine algorithm [].
2.2. Stochastic Gradient Boosting. Stochastic gradient boost-
ing scheme was proposed by Friedman in [], and it is
a variant of the gradient boosting method presented in
[]. Given a training set {(x
,y
)}
=1
,thegoalistolearna
hypothesis
(x)that maps x to y and minimizes the training
loss as follows:
(
x
)
=arg min
𝐾
(x)
=1
y
,
x
,
()
where (⋅,⋅)isthelossfunctionwhichevaluatesthedierence
between the predicted value and the target and K denotes
the number of iterations. In boosting mechanism, K additive
individual learners are trained sequentially by
(
x
)
=arg min
𝑘
(x)
=1
y
,
−1
x
+
x
()
and
(
x
)
=
−1
(
x
)
+
(
x
)
,
()
where = 1,2,⋅⋅⋅,. It is shown that the optimization
problem depends much on the loss function and becomes
unsolvable when (⋅,⋅)is complex. Creatively, gradient boost-
ing constructs the weak individuals based on the pseudo
residuals, which are the gradient of loss function with respect
to the model values predicted at the current learning step. For
instance, let 𝜖
()
be the pseudo residual of the th sample at the
th iteration written as
𝜖
()
=−
y
,y
y
y
=
𝑘−1
(x
𝑖
)
,
()
and thus the th weak learner
(x)is trained by
(
x
)
=arg min
𝑘
(x)
=1
𝜖
()
,
x
.
()
As gradient boosting constructs additive ensemble model
by sequentially tting a weak individual learner to the current
pseudo-residuals of whole training dataset at each iteration,
it costs much training time and may suer from overtting
problem. In view of that, a minor modication named
stochastic gradient boosting is proposed to incorporate some
randomization to the procedure. Specically, at each iteration
a randomly selected subset instead of the full training dataset
isusedtottheindividuallearnerandcomputethemodel
updateforthecurrentiteration.Namely,let{()}
1
be a
random permutation of the integers {1,2,⋅⋅⋅,},andthe
subset with size
<of the entire training dataset can be
given by {(x
()
,y
()
)}
=1
.Furthermore,theth weak learner
using the stochastic gradient boosting ensemble scheme is
trained by solving the following optimization problem as
∗
(
x
)
=arg min
∗
𝑘
(x)
=1
𝜖
()
()
,
∗
x
()
.
()
Given the base learner
0
(x)which is trained by the initial
training dataset, the nal ensemble learning model con-
structed by stochastic gradient boosting scheme predicts an
unknown testing instance x as follows:
(
x
)
=
0
(
x
)
+
=1
∗
(
x
)
.
()
Stochastic gradient boosting is also considered as a special
linear search optimization algorithm, which makes the newly
added individual learner t the fastest descent direction of
partial training loss at each learning step.
3. Stochastic Gradient Boosting-Based
Extreme Learning Machine (SGB-ELM)
SGB-ELM is a novel hybrid learning algorithm, which intro-
duces the stochastic gradient boosting method into ELM
ensemble procedure. As boosting mechanism focuses on
gradually reducing the training residuals at each iteration
and ELM is a special multiparameters network (for classi-
cation tasks particularly), instead of combining the ELM
and stochastic gradient boosting primitively, we design an
enhanced training scheme to alleviate possible overtting in
our proposed SGB-ELM algorithm. e detailed implemen-
tation of SGB-ELM is presented in Algorithm , where the
determination of optimal output weights for each individual
ELM learner is illustrated in Algorithm accordingly.
ere are many existing second-order approximation
methods including sequential quadratic programming (SQP)
[] and majorization-minimization algorithm (MM) [].
SQP is an eective method for nonlinearly constrained
optimization by solving quadratic subproblems. MM aims
to optimize the local alternative objective which is easier
to solve in comparison with the original cost function.
Instead of using second-order approximation directly, SGB-
ELM designs an optimization criterion for the output-layer
weights of each individual ELM. In view of that, quadratic
approximation is merely employed as an optimization tool in
SGB-ELM.
In SGB-ELM, the key issue is to determine the optimal
output-layer weights of each weak individual ELM, which is
expected to further decrease the training loss and meanwhile
keep a simple network structure. Consequently, we design a
learning objective considering not only the tting ability for
training instances but also the complexity of our ensemble
model as follows:
()
=
=1
y
,y
+
=1
,
()
where (⋅,⋅) is a dierentiable loss function that measures
the dierence between the predicted output y
and the target