计算机研究与发展 ISSN 10001239CN 111777T P
Journal o f Co mputer Resear ch and Dev elo pment 45711891194 2008
收稿日期 2007
08
15 修回日期 2008
01
09
基金项目 江苏省自然科学基金项目BK 2006095教育部高等学校博士学科点专项科研基金项目20040286009
基于局部信息熵的加权子空间离群点检测算法
倪巍伟
1
陈 耿
2
陆介平
3
吴英杰
1
孙志挥
1
1
东南大学计算机科学与工程学院南京 210096
2
南京审计学院审计信息工程实验室南京210029
3
江苏省镇江市科技局 江苏镇江212002
niww 2007 yahoo com cn
Local Entropy Based Weighted Subspace Outlier Mining Algorithm
Ni Weiwei
1
Chen Geng
2
Lu Jieping
3
Wu Ying jie
1
and Sun Zhihui
1
1
College o f Computer Science and Engineering Southeast Univ ersity N anjing 210096
2
L aboratory o f A udit In f ormation Engineering N anjing Audit University N anj ing 210029
3
Z henjiang Science and T echnology Bureau of J iangsu P rov ince Zhenjiang J iangsu 212002
Abstract Outlier mining has become a ho t issue in t he field of data mining w hich is to find
exceptional object s that deviate f rom the m ost rest o f the data set H ow ever along w ith the increase
of dimension some unusual characteristic appearance becomes possible such as spatial di stributio n o f
the data and the dist ance o f full att ribute space i s no lo nge r meaningful w hich is cal led curse o f
dim ensio nality Pheno mena o f cur se of dimensionality de teriorate lo ts o f existing outlier detectio n
algo rithm s validity Concerning this problem a local ent ropy based w eight ed subspace o utlier mining
algo rithm SPOD is pro posed w hich gene rat es out lier subspace and w eighted attribute vect or of each
data object by analy zing entro py of each att ribute on the neig hbo rhoo d o f this data o bject For a given
data o bject t ho se out lier attributes w hich constitute this object s o utlier subspace are assigned w ith
bigger weig ht Furthe rm ore definitions such as subspace w eig hted distance are introduced to make a
densit ybased out lier processing upon the data se t and g et each data point s subspace o utlier influence
factor T he bigge r this fact or is t he bigger t he possibility of the co rresponding dat a point becoming
an outlie r is T heoretical analysis and ex perimental resul ts testify t hat SPOD is suitable f or dataset s
w it h hig h dimensio n and is effi cient and effective
Key wordshigh dimensional data outlier detection info rmatio n entro py subspace mining w eig hted
vecto r
摘要
离群点检测作为数据挖掘的一个重要研究方向 可以从大量数据中发现少量与多数数据有明显
区别的数据对象 维度灾殃现象的存在使得很多已有的离群点检测算法对高维数据不再有效 针对这
一问题 提出基于局部信息熵的加权子空间离群点检测算法 SPO D 通过对数据对象在各维进行邻域信
息熵分析 生成数据对象相应的离群子空间和属性权向量 对离群子空间中的属性赋以较高的权值 进
一步提出子空间加权距离等概念 采用基于密度离群点检测的思想 分析计算数据对象的子空间离群影
响因子 判断是否为离群点 算法能够有效地适应于高维数据离群点检测 理论分析和实验结果表明算
法是有效可行的
关键词
高维数据 离群点检测 信息熵 子空间挖掘 权向量
中图法分类号 T P311