基于属性重要性和缺失率的三元决策聚类算法处理不完整数据

21 浏览量更新于2024-08-28 收藏 201KB PDF 举报

本文是一篇研究论文，主要探讨了针对不完整数据的三元决策聚类算法（AThree-WayDecisionsClusteringAlgorithm）。在现实数据集中，由于数据获取的困难、限制以及随机噪声的存在，常常会出现缺失值的问题。这使得许多传统的聚类方法无法直接应用于完整的数据集进行分析，因此该文提出了一种新的处理策略。作者们关注的是如何利用属性的重要性（significance）和缺失率（miss rate）来设计一种适应性更强的聚类方法。三元决策方法利用区间集（interval sets）将一个簇自然地划分为正区域、边界区域和负区域，这种划分方式特别适合处理软聚类问题，即允许数据点在不同类别之间存在模糊性。首先，作者们根据领域知识将数据集细分为四部分：充分数据（sufficient data）、有价值的数据（valuable data）、不足数据（inadequate data）和无效数据（invalid data）。充分数据部分包含相对完整且重要的特征；有价值数据是指虽然存在缺失，但通过领域知识判断仍具有一定价值的信息；不足数据是指特征缺失较多，但可能通过其他手段补全或降权处理的部分；而无效数据则可能由于数据质量低或无意义，直接被剔除。接下来，该算法运用决策理论，结合区间集的特性，对每类数据进行分别处理。对于充分数据，可以直接应用标准聚类方法；对于有价值数据，通过填充缺失值或者使用缺失值插补技术进行预处理；不足数据则可能需要进行降维或特征选择，以减少对缺失值的敏感性；而对于无效数据，通过删除或用平均值等代替处理，确保算法的稳定性和准确性。在聚类过程中，算法会根据属性的显著性和缺失率动态调整每个数据点的归属，同时考虑到不确定性边界的影响，使得聚类结果更为灵活和适应实际场景。这种方法旨在克服不完整数据带来的挑战，提高聚类分析的准确性和鲁棒性，从而为数据挖掘中的潜在结构发现提供有效工具。总结来说，这篇论文的核心贡献在于提出了一种基于属性重要性和缺失率的三元决策聚类算法，该算法特别适用于处理不完整数据集，通过区间集的概念和决策理论，实现了对不同类型数据的区分与有效聚类，有助于提升数据挖掘中对复杂数据结构的理解和挖掘能力。

A Three-Way Decisions Clustering Algorithm for Incomplete Data 767

inadequate data and invalid data, according to the domain knowledge about

the attribute signiﬁcance and miss rate. Then, for suﬃcient data, the weighted

distance between two incomplete objects and similar value estimation formula

are deﬁned, and a grid-based method is proposed to obtain an initial clustering

result. For other types, the distance and membership between object and cluster

are deﬁned respectively, and three-way decisions rules are used to obtain the ﬁnal

clustering result. The experimental results on some data sets show preliminarily

the eﬀectiveness of the proposed algorithm.

2 Classify an Incomplete Data Set

2.1 Representation of Clustering

To deﬁne our framework, let a universe be U = {x

, ··· , x

},andthe

clustering result is C = {C

, ··· ,C

}, which is a family of clusters of

the universe.

The universe can be represented as an information system S =(U, A, V, F, W ).

U = {x

, x

, ··· , x

} and A = {a

, ··· ,a

} are ﬁnite nonempty sets

of objects and attributes respectively. V = {V

, ··· ,V

} is the set of possible

attribute values, V

is the possible attribute values of a

, f is an information

function, f : V

= f (x

) ∈ V

. W = {w

, ··· ,w

} is a set of attribute

weights, w

is the weight of a

.Thex

is an object which has D attributes,

namely, x

=(x

, ··· ,x

). The x

denotes the value of dth attribute

of object x

,wheren ∈{1, ··· ,N},andd ∈{1, ··· ,D}.

When there are some missing values, the information system S will be an incom-

plete information system. Table 1 shows an example, which contains 10 objects,

and each object has 9 attributes. The missing value is expressed by the symbol ∗.

Table 1. An Incomplete Information System

U a

3 2 1 25 5 1 ∗ 9 ∗

2 ∗ 8 15 4 2 4 6 9

∗ ∗ ∗ ∗ ∗ 6 5 ∗ 10

2 ∗ ∗ 23 7 5 ∗ ∗ ∗

∗ 8 9 20 4 7 5 6 ∗

∗ ∗ ∗ ∗ ∗ 5 8 6 9

∗ ∗ ∗ 19 20 2 4 9 4

2 3 9 ∗ ∗ ∗ 3 4 6

3 2 1 25 5 ∗ ∗ ∗ 2

3 5 ∗ ∗ 4 ∗ ∗ ∗ ∗

We can look at the cluster analysis problem from a view of decisions making.

For a hard clustering, it is a typical two-way decisions in some sense; and for a

soft clustering, it is a kind of three-way decisions. The positive decisions decide

objects into the positive region of a cluster deﬁnitely, the negative decisions

剩余11页未读，继续阅读

weixin_38729108

粉丝: 5
资源: 896

基于属性重要性和缺失率的三元决策聚类算法处理不完整数据

A Semi-supervised Three-Way Clustering Framework for Multi-view Data

Cost-sensitive three-way class-specific attribute reduction

Three-way decisions for composed set-valued decision tables

Which Linux functions are included in the schedule() scheduling algorithm

AI-Assisted Low Information Latency Wireless Networking

我要写个ppt，关于数字孪生、bi业务系统前后端设计、业务数据对接的内容

用英文写一篇短文，谈谈一个数据工程师每天的工作

kibana table

写一段python代码：将100分钟划分为每10分钟一个的决策片，每个决策片有随机生成的临时司机和订单，所有的决策片构成了一个场景，用蒙特卡罗模拟生成多个场景，将这些场景用k-means聚类进行标记

behavioral_data.cond(dai).decisions(dajj) = data(daj,6)表示什么意思

最新资源