半径信息融合的多核学习有效方法

需积分: 10 78 浏览量更新于2024-08-15 收藏 2.22MB PDF 举报

"这篇文章提出了一种有效整合半径信息到多核学习(MKL)中的方法，以提升核学习性能。作者发现最小包围球(MEB)的半径对异常值敏感，这可能导致直接集成到MKL中时计算量增大且性能下降。文章中，他们受到MEB半径与总数据散射矩阵轨迹之间关系的启发，建议将后者纳入MKL，以改善这种状况。特别是，他们关注的是基于l（2）-范数的软边界支持向量机(SVM)分类器，并进行了详细的理论分析，展示所提方法如何保留MEB半径的优点并有效地解决优化问题。这种方法在异常值或噪声训练样本存在时更具鲁棒性，提高了计算效率，并且可以方便地用现有MKL软件包解决。实验结果在UCI数据集和Caltech-101数据集上验证了方法的有效性和效率。" 文章探讨的核心知识点包括： 1. **多核学习(Multiple Kernel Learning, MKL)**: MKL是一种机器学习技术，它允许在多个核函数的基础上学习模型，以利用不同核函数的互补信息，从而提高学习性能。 2. **最小包围球(Minimum Enclosing Ball, MEB)**: MEB是包含所有数据点的最小半径球体，其半径反映了数据集的集中程度。在MKL中，MEB的半径可以提供有关数据分布的信息，但直接使用会因异常值的敏感性而引入问题。 3. **半径信息集成**: 文章提出了一种新方法，不直接使用MEB半径，而是利用与之相关的总数据散射矩阵轨迹，以降低对异常值的敏感性。 4. **支持向量机(Support Vector Machines, SVMs)**: SVM是一种常用的监督学习模型，尤其适用于小样本和高维数据。l（2）-范数软边界SVM允许在分类边界上有一些误分类，通过惩罚项调整模型复杂度。 5. **半径边界(Radius-Margin Bound)**: SVM的半径边界是衡量分类间隔和模型泛化能力的一个重要概念，文章利用这一边界理论来合理地整合半径信息。 6. **鲁棒性(Robustness)**: 提出的方法在异常值或噪声样本存在的场景下表现更稳定，这是通过避免在每次迭代中计算半径的二次优化实现的。 7. **计算效率(Computational Efficiency)**: 方法通过优化过程提高了计算效率，减少了计算半径带来的额外负担。 8. **优化问题(Optimization Problem)**: 文章详细分析了如何有效地解决整合半径信息后的优化问题，确保了方法的可实施性。 9. **实验验证(Experimental Validation)**: 使用UCI数据集和Caltech-101数据集进行的实验结果证实了该方法在提高学习性能和计算效率方面的有效性。这篇文章提供了一个新的策略，通过巧妙处理半径信息，改进了多核学习框架下的支持向量机分类，提升了算法在面对异常值和噪声数据时的表现，并降低了计算复杂度。

LIU et al.: EFFICIENT APPROACH TO INTEGRATING RADIUS INFORMATION INTO MKL 559

maximize the margin by solving the following optimization

problem with respect to γ:

min

γ,ω,b,ξ

ω

+ C



i=1

s.t. y





φ(x

; γ)+b



≥ 1 − ξ

,ξ

≥ 0 ∀i

γ

=1, γ  0 (1)

where ω is the normal of the separating hyperplane, b is the bias

term, and ξ =[ξ

,...,ξ

]



is the vector of slack variables.

Recent work on MKL has considered to incorporate the

radius of the MEB into the traditional formulation and demon-

strated that it helps to achieve better kernel learning perfor-

mance [31], [32]. The theoretical justiﬁcation for incorporating

the radius lies at that the generalization error bound of SVMs

is dependent on both the margin and the radius of the MEB of

training data [10]. As pointed out in [32], only maximizing the

margin with respect to γ will lead to scaling and initialization

issues. A larger margin can be arbitrarily achieved by scaling

γ to τγ(τ>1), and this will affect the convergency of the

optimization problem. Usually, a norm constraint is imposed on

γ to address this issue. Nevertheless, identifying an appropriate

norm constraint for a given kernel learning task remains an open

issue itself [14], [33], [43], [47], [48]. Moreover, even if a norm

constraint is imposed, a good kernel could still be misjudged

as a poor one by simply downscaling the corresponding kernel

weight [32]. These issues can be removed or mitigated by the

incorporation of radius information.

The following formulation is adopted by both works in [31]

and [32], with an additional 

-norm constraint used in [31]

min

γ,ω,b,ξ

ω



i=1

s.t. y





φ(x

; γ)+b



≥1−ξ

,ξ

≥0∀i; γ 0 (2)

where R

is the squared radius of the MEB and q =1or 2. Like

the margin, R

is also a function of γ. The work in [31] focuses

on how to approximate the optimization problem in (2) with

one that can be more efﬁciently solved and does not address the

scaling issue mentioned earlier. Differently, the work in [32]

directly solves the optimization in (2) and carefully discusses

how the scaling issue can be addressed. In detail, a trilevel

optimization problem is proposed in that work

min



J(γ) s.t. γ

≥ 0 ∀p (3)

where



J(γ)=



max



1 −

(α ◦ y)



K(γ)(α ◦ y)

s.t. α



y =0 0≤ α

≤ C ∀i



(4)



max



diag (K(γ)) − β



K(γ)β

s.t. β



1 =1 0≤ β

∀i



. (5)

To solve the optimization problem, a trilevel optimization struc-

ture is developed accordingly. Speciﬁcally, in the ﬁrst step,

is computed by solving the quadratic programming (QP)

in (5) with a given γ. Then, the obtained R

is taken into

(4) to solve another QP to calculate



J(γ). The last step is

to update the kernel weight γ. The aforementioned procedure

is repeated until a stopping criterion is satisﬁed. Compared

with traditional MKL algorithms, an extra QP is introduced

and solved at each iteration. This can considerably increase the

computation cost of SVM-based MKL, particularly when the

size of the t raining set is large. Worse is that the solution of

the optimization problem in (5) is sensitive to outliers. As a

result, the obtained R

can become noisy, and this noise will,

in turn, affect the optimization of kernel weights via the trilevel

optimization structure. To reduce the sensitivity to outliers, we

could simply impose a box constraint on β, and (5) becomes





max



diag (K(γ)) − β



K(γ)β

s.t. β



1 =1 0≤ β

≤ D ∀i



(6)

where D is a regularization parameter. More sophisticated

variants can also be adopted by following the idea of support

vector data description (SVDD) in [49]. However, these meth-

ods will bring one more to-be-tuned hyperparameter D into the

trilevel optimization structure. This does not well align with our

goal—developing an efﬁcient approach to integrating the radius

information into MKL.

III. P

ROPOSED 

-NORM tr(S

) MKL ALGORITHM

A. Close Relationship Between R

and tr(S

)

Recall that x

(i =1,...,n) denotes the ith training sam-

ple. The total scatter matrix is deﬁned as S



i=1

−

m)(x

− m)



, where m =(1/n)



i=1

is the sample-based

total mean. Although each training sample is implicitly mapped

onto a feature space via the kernel trick and S

in that space is

inaccessible, its trace can be explicitly expressed by the kernel

function as

tr(S

)=tr(K(γ)) −



K(γ)1 =



p=1

(7)

where a

=tr(K

) − (1/n)1



1. The close relationship

between tr(S

) and the squared radius of the MEB R

has

been revealed in the literature [44]. Both measure the scattering

of samples in a kernel-induced feature space, and tr(S

) can

be shown as an approximation of R

. The detailed analysis

on the relationship can be found in [44, Appendix].

In this

paper, instead of incorporating the radius of the MEB directly,

we incorporate tr(S

), and the advantages are threefold.

1) In the deﬁnition of S

, each sample is assigned with

an equal weight when measuring data scattering. This

makes tr(S

) less sensitive to an outlier that signiﬁ-

cantly deviates from the center of data cloud. In contrast,

such an outlier will become an important support vector

As in SVMs, imposing a box constraint on β is equivalent to introducing a

slack variable for each training sample in the primal problem of (5).

http://users.cecs.anu.edu.au/wanglei/My_papers/FS_CSM_appendix_v02.

pdf.

剩余12页未读，继续阅读

weixin_38673924

粉丝: 4
资源: 906

半径信息融合的多核学习有效方法

多核学习算法研究.pdf

IOS应用源码Demo-利用OpenCL生成球体顶点 OpenCL_sphere-毕设学习.zip

反应场方法在GROMACS模拟中的理论与应用：深入理解

【生物信息学中的聚类分析】：R语言dbscan包应用揭秘

MATLAB机器学习实战：分类、回归和聚类算法应用

光学制造无缝对接：Zemax在设计到生产中的应用

【大数据环境下的聚类利器】：R语言dbscan包的高效使用方法

【Python格式化艺术的灵活运用】：混合方法的性能与灵活性分析

Java基础知识：从入门到精通

在Unity ECS中实现复杂的物理模拟

最新资源