3. Multiple kernel extreme learning machine
In [6],akernelELMisfirst proposed, in which a Gaussian kernel
and a polynomial kernel are empirically specified. Such specified
kernels may not be suitable for v arious applications. This moti v at es
us to design a learning algorithm which is able to automatically learn
a data-dependent optimal kernel for ELM in different applications.
Inspired by the idea of MKL, we assume that the optimal kernel can
be expressed as a linear combination of base kernels, and jointly
learn the structural parameters of ELM and the optimal kernel
combination coefficients. This e xt ension makes ELM able to handle
different heterog eneous data integrations, and this extends the ELM
to a wider range of applications. Following the research on SVM
based MKL, we first design a sparse MK -ELM algorithm, and general-
ize it to the non-sparse case. After that, a radius-incorporated variant
is proposed. Three efficient algorithms are given to solve the
corresponding kernel learning problems.
3.1. The sparse MK-ELM
By incorporating the base kernel combination weights into
ELM, and imposing an ℓ
1
-norm and non-negative constraint on
the base kernel weights, we obtain the objective function of the
proposed sparse MK-ELM as follows:
min
γ
min
β;ξ
1
2
‖β‖
2
F
þ
C
2
∑
n
i ¼ 1
‖ξ
i
‖
2
s:t: β
>
ϕðx
i
; γÞ¼y
i
ξ
i
; 8i; ∑
m
p ¼ 1
γ
p
¼1; γ
p
Z 0; 8p; ð6Þ
where β ¼½β
1
; … ; β
m
A R
ðjϕ
1
ðÞjþ⋯ þjϕ
m
ðÞjÞT
, β
p
A R
jϕ
p
ðÞjT
ðp ¼ 1; …; mÞ
is the pth component corresponding to the pth base kernel. Recall
that ξA R
Tn
is the training error matrix on training data,
ξ
i
¼½ξ
1i
; ξ
2i
; … ; ξ
Ti
>
ð1r i r nÞ is the ith column of ξ.
As observed, Eq. (6) optimizes the structural parameter of
ELM β and the kernel combination coefficients γ jointly. We now
show how to solve the objective function in Eq. (6) efficiently.
By substituting Eq. (5) into Eq. (6), Eq. (6) can be rewritten as
min
γ
min
β;ξ
1
2
‖β‖
2
F
þ
C
2
∑
n
i ¼ 1
‖ξ
i
‖
2
s:t: ∑
m
p ¼ 1
ffiffiffiffiffi
γ
p
p
β
>
p
ϕ
p
ðx
i
Þ¼y
i
ξ
i
; 8i; ∑
m
p ¼ 1
γ
p
¼1; γ
p
Z 0; 8p:
ð7Þ
After defining
~
β ¼½
~
β
1
;
~
β
2
; …;
~
β
m
; ð8Þ
where
~
β
p
¼
ffiffiffiffiffi
γ
p
p
β
p
; p ¼1; …; m, Eq. (7) can be equivalently refor-
mulated as
min
γ
min
~
β;ξ
1
2
∑
m
p ¼ 1
‖
~
β
p
‖
2
F
γ
p
þ
C
2
∑
n
i ¼ 1
‖ξ
i
‖
2
s:t: ∑
m
p ¼ 1
~
β
p
>
ϕ
p
ðx
i
Þ¼y
i
ξ
i
; 8i; ∑
m
p ¼ 1
γ
p
¼1; γ
p
Z 0; 8p: ð9Þ
It is not difficult to verify that Eq. (9) is a joint-convex optimization
problem [34], and its Lagrangian function is
Lð
~
β; ξ; γÞ¼
1
2
∑
m
p ¼ 1
‖
~
β
p
‖
2
F
γ
p
þ
C
2
∑
n
i ¼ 1
‖ξ
i
‖
2
∑
T
t ¼ 1
∑
n
i ¼ 1
α
it
∑
m
p ¼ 1
~
β
p
>
ϕ
p
ðx
i
Þy
ti
þξ
ti
!
þτ ∑
m
p ¼ 1
γ
p
1
!
;
ð10Þ
where α A R
nT
and τ are the Lagrange multipliers. In Eq. (10),we
omit the non-negative constraints on γ
p
; ðp ¼1; …; mÞ since the
newly updated kernel combination weights are automatically kept
non-negative at each iteration, as will be validated later.
We can have the KKT optimality conditions of Eq. (10) as
follows:
~
β
p
¼γ
p
∑
T
t ¼ 1
∑
n
i ¼ 1
α
it
ϕ
p
ðx
i
Þ; 8p ð11Þ
ξ
ti
¼
α
ti
C
; 8t 8i ð12Þ
∑
m
p ¼ 1
~
β
p
>
ϕ
p
ðx
i
Þ¼y
i
ξ
i
; 8i; ð13Þ
which can be rewritten into a matrix form as
Kð; ; γÞþ
I
C
α ¼ Y
>
; ð14Þ
where Kðx
i
; x
j
; γÞ¼ϕðx
i
; γÞ
>
ϕðx
j
; γÞ¼∑
m
p ¼ 1
γ
p
K
p
ðx
i
; x
j
Þ. Y ¼½y
1
; … ;
y
n
A R
Tn
is the label matrix. From Eq. (14), the α , which cor-
responds to the structural parameter of ELM, can be obtained by
α ¼ Kð; ; γÞþ
I
C
1
Y
>
: ð15Þ
We then show how to update the kernel combination coeffi-
cients γ efficiently. By taking the derivative of Eq. (10) w.r.t
γ
p
ðp ¼ 1; …; mÞ and let it vanish, we obtain that the new kernel
combination weights γ
new
are updated by
γ
new
p
¼
‖
~
β
p
‖
F
∑
m
p ¼ 1
‖
~
β
p
‖
F
; 8p; ð16Þ
where
‖
~
β
p
‖
F
¼γ
p
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
∑
T
s;t ¼ 1
∑
n
i;j ¼ 1
α
it
α
js
K
p
ðx
i
; x
j
Þ
s
: ð17Þ
and γ
p
; ðp ¼1; …; mÞ is the pth kernel combination weight in the
last iteration. As seen from Eqs. (16) and (17), the newly updated
γ
new
p
ðp ¼ 1; …; mÞ are kept non-negative at each iteration, which
automatically satisfies the non-negative constraint. The detailed
derivation of updating the kernel combination weights is provided
in the Appendix.
The overall optimization algorithm for solving sparse MK-ELM
is presented in Table 1.
Algorithm 1. The sparse MK-ELM.
1:
Input: fK
p
g
m
p ¼ 1
, y and C.
2: Output: α and γ.
3:
Initialize γ ¼γ
0
and t¼0.
4: repeat
5:
Compute Kð; ; γÞ¼∑
m
p ¼ 1
γ
t
p
K
p
.
6:
Update α
t
by solving Eq. (15).
7:
Update γ
t þ1
by Eq. (16).
8: t ¼ t þ1.
9:
until maxfjγ
t þ1
γ
t
jgr 1e 4
3.2. Non-sparse MK-ELM
Recent research on SVM based MKL has indicated that non-
sparse MKL algorithms can usually outperform the sparse alter-
natives [20,35] by arguing that some complementary information
may be lost due to the sparsity constraint. In the following part, we
first design a non-sparse MK-ELM algorithm, and propose an
optimization algorithm to solve the resulting kernel learning
X. Liu et al. / Neurocomputing 149 (2015) 253–264 255