LIU et al.: EFFICIENT APPROACH TO INTEGRATING RADIUS INFORMATION INTO MKL 559
maximize the margin by solving the following optimization
problem with respect to γ:
min
γ,ω,b,ξ
1
2
ω
2
+ C
n
i=1
ξ
i
s.t. y
i
ω
φ(x
i
; γ)+b
≥ 1 − ξ
i
,ξ
i
≥ 0 ∀i
γ
1
=1, γ 0 (1)
where ω is the normal of the separating hyperplane, b is the bias
term, and ξ =[ξ
1
,...,ξ
n
]
is the vector of slack variables.
Recent work on MKL has considered to incorporate the
radius of the MEB into the traditional formulation and demon-
strated that it helps to achieve better kernel learning perfor-
mance [31], [32]. The theoretical justification for incorporating
the radius lies at that the generalization error bound of SVMs
is dependent on both the margin and the radius of the MEB of
training data [10]. As pointed out in [32], only maximizing the
margin with respect to γ will lead to scaling and initialization
issues. A larger margin can be arbitrarily achieved by scaling
γ to τγ(τ>1), and this will affect the convergency of the
optimization problem. Usually, a norm constraint is imposed on
γ to address this issue. Nevertheless, identifying an appropriate
norm constraint for a given kernel learning task remains an open
issue itself [14], [33], [43], [47], [48]. Moreover, even if a norm
constraint is imposed, a good kernel could still be misjudged
as a poor one by simply downscaling the corresponding kernel
weight [32]. These issues can be removed or mitigated by the
incorporation of radius information.
The following formulation is adopted by both works in [31]
and [32], with an additional
1
-norm constraint used in [31]
min
γ,ω,b,ξ
1
2
ω
2
R
2
+C
n
i=1
ξ
q
i
s.t. y
i
ω
φ(x
i
; γ)+b
≥1−ξ
i
,ξ
i
≥0∀i; γ 0 (2)
where R
2
is the squared radius of the MEB and q =1or 2. Like
the margin, R
2
is also a function of γ. The work in [31] focuses
on how to approximate the optimization problem in (2) with
one that can be more efficiently solved and does not address the
scaling issue mentioned earlier. Differently, the work in [32]
directly solves the optimization in (2) and carefully discusses
how the scaling issue can be addressed. In detail, a trilevel
optimization problem is proposed in that work
min
γ
J(γ) s.t. γ
p
≥ 0 ∀p (3)
where
J(γ)=
max
α
α
1 −
1
2R
2
(α ◦ y)
K(γ)(α ◦ y)
s.t. α
y =0 0≤ α
i
≤ C ∀i
(4)
R
2
=
max
β
β
diag (K(γ)) − β
K(γ)β
s.t. β
1 =1 0≤ β
i
∀i
. (5)
To solve the optimization problem, a trilevel optimization struc-
ture is developed accordingly. Specifically, in the first step,
R
2
is computed by solving the quadratic programming (QP)
in (5) with a given γ. Then, the obtained R
2
is taken into
(4) to solve another QP to calculate
J(γ). The last step is
to update the kernel weight γ. The aforementioned procedure
is repeated until a stopping criterion is satisfied. Compared
with traditional MKL algorithms, an extra QP is introduced
and solved at each iteration. This can considerably increase the
computation cost of SVM-based MKL, particularly when the
size of the t raining set is large. Worse is that the solution of
the optimization problem in (5) is sensitive to outliers. As a
result, the obtained R
2
can become noisy, and this noise will,
in turn, affect the optimization of kernel weights via the trilevel
optimization structure. To reduce the sensitivity to outliers, we
could simply impose a box constraint on β, and (5) becomes
1
R
2
=
max
β
β
diag (K(γ)) − β
K(γ)β
s.t. β
1 =1 0≤ β
i
≤ D ∀i
(6)
where D is a regularization parameter. More sophisticated
variants can also be adopted by following the idea of support
vector data description (SVDD) in [49]. However, these meth-
ods will bring one more to-be-tuned hyperparameter D into the
trilevel optimization structure. This does not well align with our
goal—developing an efficient approach to integrating the radius
information into MKL.
III. P
ROPOSED
2
-NORM tr(S
T
) MKL ALGORITHM
A. Close Relationship Between R
2
and tr(S
T
)
Recall that x
i
(i =1,...,n) denotes the ith training sam-
ple. The total scatter matrix is defined as S
T
=
n
i=1
(x
i
−
m)(x
i
− m)
, where m =(1/n)
n
i=1
x
i
is the sample-based
total mean. Although each training sample is implicitly mapped
onto a feature space via the kernel trick and S
T
in that space is
inaccessible, its trace can be explicitly expressed by the kernel
function as
tr(S
T
)=tr(K(γ)) −
1
n
1
K(γ)1 =
m
p=1
γ
p
a
p
(7)
where a
p
Δ
=tr(K
p
) − (1/n)1
K
p
1. The close relationship
between tr(S
T
) and the squared radius of the MEB R
2
has
been revealed in the literature [44]. Both measure the scattering
of samples in a kernel-induced feature space, and tr(S
T
) can
be shown as an approximation of R
2
. The detailed analysis
on the relationship can be found in [44, Appendix].
2
In this
paper, instead of incorporating the radius of the MEB directly,
we incorporate tr(S
T
), and the advantages are threefold.
1) In the definition of S
T
, each sample is assigned with
an equal weight when measuring data scattering. This
makes tr(S
T
) less sensitive to an outlier that signifi-
cantly deviates from the center of data cloud. In contrast,
such an outlier will become an important support vector
1
As in SVMs, imposing a box constraint on β is equivalent to introducing a
slack variable for each training sample in the primal problem of (5).
2
http://users.cecs.anu.edu.au/wanglei/My_papers/FS_CSM_appendix_v02.
pdf.