
If the activation function g is infinitely differentiable we
can prove that the required number of hidden nodes
~
NpN.
Strictly speaking, we have
2
Theorem 2.1. Given a standard SLFN with N hidden nodes
and activation function g : R ! R which is infinitely
differentiable in any interval, for N arbitrary distinct samples
ðx
i
; t
i
Þ, where x
i
2 R
n
and t
i
2 R
m
, for any w
i
and b
i
randomly chosen from any intervals of R
n
and R, respec-
tively, according to any continuous probability distribution,
then with probability one, the hidden layer output matrix H
of the SLFN is invertible and kHb Tk¼0.
Proof. Let us consider a vector cðb
i
Þ¼½g
i
ðx
1
Þ; ...;
g
i
ðx
N
Þ
T
¼½gðw
i
x
1
þ b
i
Þ; ...; gðw
i
x
N
þ b
i
Þ
T
, the ith col-
umn of H, in Euclidean space R
N
, where b
i
2ða; bÞ and
ða; bÞ is any interval of R.
Following the same proof method of Tamura and
Tateishi ( [23], p. 252) and our previous work ( [10],
Theorem 2.1), it can be easily proved by contradiction that
vector c does not belong to any subspace whose dimension
is less than N.
Since w
i
are randomly generated based on a continuous
probability distribution, we can assume that w
i
x
k
aw
i
x
k
0
for all kak
0
. Let us suppose that c belongs to a
subspace of dimension N 1. Then there exists a vector a
which is orthogonal to this subspace
ða; cðb
i
ÞcðaÞÞ ¼ a
1
gðb
i
þ d
1
Þþa
2
gðb
i
þ d
2
Þ
þþa
N
gðb
i
þ d
N
Þz ¼ 0, ð6Þ
where d
k
¼ w
i
x
k
, k ¼ 1; ...; N and z ¼ a cðaÞ ,
8b
i
2ða; bÞ. Assume a
N
a0, Eq. (6) can be further written
as
gðb
i
þ d
N
Þ¼
X
N1
p¼1
g
p
gðb
i
þ d
p
Þþz=a
N
, (7)
where g
p
¼ a
p
=a
N
, p ¼ 1; ...; N 1. Since gðxÞ is infinitely
differentiable in any interval, we have
g
ðlÞ
ðb
i
þ d
N
Þ¼
X
N1
p¼1
g
p
g
ðlÞ
ðb
i
þ d
p
Þ,
l ¼ 1; 2; ...; N; N þ 1; ..., ð8Þ
where g
ðlÞ
is the lth derivative of function g of b
i
. However,
there are only N 1 free coefficients: g
1
; ...; g
N1
for the
derived more than N 1 linear equations, this is contra-
dictory. Thus, vector c does not belong to any subspace
whose dimension is less than N.
Hence, from any interval ða; bÞ it is possible to randomly
choose N bias values b
1
; ...; b
N
for the N hidden nodes
such that the corresponding vector s cðb
1
Þ; cðb
2
Þ; ...; cðb
N
Þ
span R
N
. This means that for any weight vectors w
i
and
bias values b
i
chosen from any intervals of R
n
and R,
respectively, according to an y continuous probability
distribution, then with probability one, the column vectors
of H can be made full-rank. &
Such activation functi ons include the sigmoidal func-
tions as well as the radial basis, sine, cosine, exponential,
and many other nonregular functions as shown in Huang
and Babri [11].
Furthermore, we have
Theorem 2.2. Given any small positive value 40 and
activation function g : R ! R which is infinitely differentiable
in any interval, there exists
~
NpN such that for N arbitrary
distinct samples ðx
i
; t
i
Þ, where x
i
2 R
n
and t
i
2 R
m
, for any w
i
and b
i
randomly chosen from any intervals of R
n
and R,
respectively, according to any continuous probability distribu-
tion, then with probability one, kH
N
~
N
b
~
Nm
T
Nm
ko.
Proof. The validity of the theorem is obvious, otherwise, one
could simply choose
~
N ¼ N which makes kH
N
~
N
b
~
Nm
T
Nm
ko according to Theorem 2.1. &
3. Proposed ex treme learning machi ne (ELM)
Based on Theorems 2.1 and 2.2 we can propose in this
section an extremely simple and effici ent method to train
SLFNs.
3.1. Conventional gradient-based solution of SLFNs
Traditionally, in order to train an SLFN, one may wish
to find specific
^
w
i
;
^
b
i
;
^
b (i ¼ 1 ; ...;
~
N) such that
kHð
^
w
1
; ...;
^
w
~
N
;
^
b
1
; ...;
^
b
~
N
Þ
^
b Tk
¼ min
w
i
;b
i
;b
kHðw
1
; ...; w
~
N
; b
1
; ...; b
~
N
Þb Tkð9Þ
which is equivalent to minimizing the cost function
E ¼
X
N
j¼1
X
~
N
i¼1
b
i
gðw
i
x
j
þ b
i
Þt
j
!
2
. (10)
When H is unknown gradient-based learning algori-
thms are generally used to search the minimum of
kHb Tk. In the minimization procedure by using
gradient-based algorithms, vector W, which is the set of
weights (w
i
,b
i
) and biases (b
i
) parameters, is iteratively
adjusted as follows:
W
k
¼ W
k1
Z
qEðWÞ
qW
. (11)
Here Z is a learning rate. The popular learning algorithm
used in feedforward neural networks is the BP learning
algorithm where gradients can be computed efficiently by
propagation from the output to the input. There are several
issues on BP learning algorithms:
(1) When the learning rate Z is too small, the learning
algorithm converges very slowly. However, when
Z is too large, the algorithm becomes unstable and
diverges.
ARTICLE IN PRESS
2
In fact, the theorem and its proof are also linearly valid for the case
g
i
ðxÞ¼gðkx w
i
k=b
i
Þ; w
i
2 R
n
; b
i
2 R
þ
.
G.-B. Huang et al. / Neurocomputing 70 (2006) 489–501 491