应对噪声、无关与新颖属性：实例学习算法的挑战

需积分: 10 168 浏览量更新于2024-07-25 收藏 1.45MB PDF 举报

"容忍噪音、无关和新颖属性的实例基学习算法" 本文主要探讨了在数据挖掘中的实例基学习算法如何处理噪声、不相关属性以及新颖属性的问题。实例基学习算法，特别是最近邻算法（K-Nearest Neighbor, KNN），在增量学习任务中表现出色，因其快速的学习速度、低更新成本以及在多个应用中取得的高分类准确率而备受青睐。然而，这些算法也存在一些显著的问题，限制了其实际应用。首先，针对最近邻算法的存储需求过大问题，已经有一些修改方法可以显著降低这个问题。然而，这些存储优化的变体对噪声非常敏感。噪声是指数据集中存在的错误或异常值，它们可能导致学习过程出错并影响预测准确性。噪声的存在会干扰算法对模式的识别，从而降低模型的稳健性。其次，实例基学习算法对不相关属性的处理能力较弱。不相关属性是指对学习任务无用或者影响甚微的特征。这些属性可能会增加计算负担，且不提供有用信息，甚至可能引入噪声，导致学习性能下降。因此，有效地识别和处理不相关属性对于提高学习效率和准确性至关重要。最后，最近邻算法的一个基本假设是所有实例都由相同的属性集来描述，这在现实世界的数据集中往往不成立。当后续处理的实例引入了新的、与学习任务相关的属性时，这种刚性会导致算法无法适应。这种灵活性的缺乏限制了算法在动态环境中或面对非结构化数据时的适用性。为了解决这些问题，研究者们提出了一系列策略。针对噪声，可以采用异常检测技术来识别和去除噪声数据，或者使用鲁棒的距离度量来减少噪声的影响。对于不相关属性，特征选择方法如过滤、包裹或嵌入式方法可以帮助识别和去除不相关特征。而对于新颖属性，需要开发适应性强的算法，如在线学习或自适应学习方法，以便于在遇到新属性时能动态调整模型。容忍噪音、无关和新颖属性是提升实例基学习算法性能的关键挑战。通过改进算法的噪声处理机制、优化特征选择流程以及增强算法的适应性，可以提升数据挖掘的效率和效果，使其更好地应用于实际问题中。

INSTANCE-BASED LEARNING ALGORITHMS

271

TABLE

A summary

algorithms IBl and IB2

Framework component IBl IB2

Similarity function

classification function

concept description updater

-Euclidean distance

nearest neighbor

save all instances

-Euclidean distance

nearest neighbor

save only misclassified instances

The simplest IBL algorithm is named IBl (Table 1). Its similarity function is

Similarity@, y) = -

~ygc-z

where x and y are instances in an n-dimensional instance space. IBl is identical to

the nearest neighbor algorithm except for two characteristics. First, it linearly

normalizes all instances’ attribute va1ues.t This ensures that each attribute has the

same (maximal) affect on the similarity function. Second, it uses a simple method

for tolerating missing attribute values; if either of an attribute

a’s

value is missing

during a similarity computation, it is assumed to be maximally different from the

other value. IBl predicts that an instance’s class is the same as that of its most

similar neighbor in the concept description. Finally, IBl saves all processed training

instances in its concept description.

IBl recorded high classification accuracies across a variety of applications, some

of which are summarized in Table 3. Table 2 briefly characterizes the domains and

the experiments. The results for C4, Quinlan’s (1987) extension of ID3 that prunes,

are also presented in Table 3 for comparison purposes. The first five results falsely

suggest that IBl is a relatively robust classifier (in comparison with C4). We see a

clearer picture of IBl’s capabilities in the last two results, in which its accuracy is

relatively poor. This occurred because instances in these data sets are described by

several irrelevant attributes.

IB2 (Table 1) is identical to IBl except that its concept description updater saves

only

misclassified

instances. Table 3 shows that IB2 significantly reduced IBl’s

TABLE

Database characteristics. All results are averaged over 50 trials, except for the last

database (25 trials), and used randomly drawn and separate training and test sets

Database

Train

size

Test

size

Number

attributes

Number

classes

LED display (7 attributes)

200

500

7 10

Waveform

300

500

21 3

Voting

350 85 16 2

Cleveland

250 53 13 2

Hungarian

250

13 2

LED display (24 attributes) 1000 100

24 10

t All of our IBL algorithms apply a (dynamically updated) linear normalization function to all attribute

values before processing. All attribute values are scaled to the range [0, 11.

剩余20页未读，继续阅读

雷神阿拉蕾

粉丝: 0
资源: 1

应对噪声、无关与新颖属性：实例学习算法的挑战

Hi-fi Playback: Tolerating Position Errors in Shift Operations of Racetrack Memory

A Survey of Fault Tolerant Methodologies for FPGAs

spark_3_2_0-master-3.2.3-1.el7.noarch.rpm

浙大城市学院在河南2021-2024各专业最低录取分数及位次表.pdf

第4周玩转案例分析.pdf

基于MATLAB的教室人数统计系统源代码+使用说明，带有丰富的人机交互GUI界面

java-ssm+jsp药品销售网站系统实现源码(项目源码-说明文档)

AI健身体能测试之基于paddlehub实现引体向上计数个数统计源码+模型+视频例子+视频结果文件.zip

CPA《财务成本管理》刘正兵 专题班 资本成本 债务成本的估计+加权平均资本成本.pdf

CPA《财务成本管理》刘正兵 专题班 风险与报酬 资本资产定价模型.pdf

最新资源

CPA《财务成本管理》刘正兵专题班资本成本债务成本的估计+加权平均资本成本.pdf

CPA《财务成本管理》刘正兵专题班风险与报酬资本资产定价模型.pdf