LIU et al.: ADAPTIVE APPROACH TO LEARNING OPTIMAL NEIGHBORHOOD KERNELS 373
[29], [30] in the third category learns G in the neighborhood of
K
0
by imposing a penalty term G − K
0
2
F
, where ·
F
de-
notes the Frobenius norm. The optimization problem becomes
min
G0
min
f∈H
G
1
2
f
2
+ C
n
i=1
(y
i
f(x
i
)) +
ρ
2
G − K
0
2
F
(4)
where ρ ≥ 0 is a regularization parameter and G 0 is a
positive semi-definite matrix. As shown in [29], the classifier
f obtained in this way can achieve higher classification accu-
racy when compared with the classifier trained with K
0
.The
performance of ONKL depends on the goodness of the pre-
specified kernel K
0
. However, how to identify an appropriate
K
0
remains unsolved. Moreover, setting the pre-specified ker-
nel K
0
is separated from learning the optimal kernel G in [29].
III. O
UR ADAPTIVE APPROACH TO LEARNING OPTIMAL
NEIGHBORHOOD KERNELS
A. Problem Formulation
To address the above issues, we treat the pre-specified kernel
as an extra variable and jointly learns it with the optimal kernel
and the structure parameters of SVMs. We call our approach
optimal neighborhood joint kernel learning (ONJKL) in short.
Mathematically, we define the objective of ONJKL as
min
G,K
ν
,f
1
2
f
2
+ C
n
i=1
(y
i
f(x
i
)) +
ρ
2
G − K
ν
2
F
s.t. K
ν
∈ Ω, G 0,f∈H
G
, ν ∈ R
m
. (5)
Note that to make this approach work, we constrain the pre-
specified kernel with a parameterized model and denote it by
K
ν
∈ Ω, where ν is the parameter vector and Ω is its domain,
a subset of all positive semi-definite matrices. Applying such
a model is necessary because 1) constraining the pre-specified
kernel within an appropriate domain avoids the trivial solution
of K
ν
≡ G, which always minimizes the last term in (5);
2) a parameterized model allows prior knowledge on the pre-
specified kernel to be conveniently incorporated. Given differ-
ent models, the above optimization will produce different K
and G to fit classification tasks.
We consider our approach as an adaptive way for ONKL and
highlight the adaptivity of our approach as follows:
• It adaptively sets the pre-specified kernel for a given
classification task via optimization;
• In this approach, the optimal neighborhood kernel and the
pre-specified kernel are adaptively adjusted with respect to
each other through the joint learning process;
• This approach adaptively produces the optimal neighbor-
hood kernel based on the parameterized model chosen by
auser.
In the following, we analyze the properties of this formula-
tion, provide two instantiations, and discuss how to efficiently
solve this optimization.
B. Properties of the Optimal Neighborhood Kernel G
This subsection first shows that the optimal neighborhood
kernel G can be obtained once the pre-specified kernel K
ν
and
the structure parameters of SVMs are optimized. It then proves
that G can always achieve higher kernel-alignment value that
K
ν
, providing a theoretical support for the better classification
performance obtained by using G.
We derive the dual problem of (5) as follows:
min
G,K
ν
max
α∈R
n
α
1 −
1
2
(α ⊗ y)
G(α ⊗ y)+
ρ
2
G − K
ν
2
F
s.t. α
y =0, 0 α C1,
K
ν
∈ Ω (6)
where α is the structure parameters of SVMs (or, Lagrange
multipliers), y is a column vector consisting of the labels of
all training samples, and ⊗denotes a component-wise multipli-
cation between two vectors.
Following Theorem 1 in [29], we can easily derive the
analytical form of G which minimizes the (6), as stated in
Theorem 1.
Theorem 1: The optimal neighborhood kernel, G
, that min-
imizes (6) has the following analytical form:
G
= K
ν
+
1
2ρ
(α ⊗ y)(α ⊗ y)
. (7)
The result in Theorem 1 indicates that in our approach, we
only need to optimize the pre-specified kernel K
ν
(or more
precisely, its model parameters ν ) and the SVMs’ structure
parameters α, instead of optimizing the matrix G.
Now, we evaluate the goodness of G
and K
ν
by using
the kernel alignment criterion [34]. The alignment between
two kernel matrices K
1
and K
2
is defined as A(K
1
, K
2
)=
K
1
, K
2
/
K
1
, K
1
K
2
, K
2
, where ·, · is the inner
product between two matrices.
1
Recall that y is the column
vector of the labels of all training samples. In the literature,
yy
is called the ideal kernel for a given classification task. The
alignment of a kernel to the ideal one can be used to measure
its goodness. As proven in [34], the estimation of the alignment
is concentrated, which means that if a kernel achieves high
alignment on the training set, it expects to obtain high alignment
on the test set, too. Higher alignment implies that the resulting
classifier will have better generalization performance according
to the Theorem 4 in [34]. Specifically, the generalization error
of a classifier is upper bounded by 1 −
ˆ
A(S), where
ˆ
A(S)
is the kernel alignment on data set S. More details on the
generalization capability are recommended to refer some recent
work [35], [36], where some new refinement techniques which
maximize the uncertainty or combines multiple reducts have
been proposed to improve the generalization capability of a
learning system. For our approach, we can prove the following
theorem.
Theorem 2: The kernel alignment of the optimal neighbor-
hood kernel G
to yy
is higher than that of K
ν
.
1
A, B =
i
j
a
ij
b
ij
,wherea
ij
and b
ij
are the entries of A and B.