基于统计归因的不完全数据属性加权模糊C均值聚类算法

51 浏览量更新于2024-08-26 收藏 168KB PDF 举报

在现代信息技术应用中，处理不完整数据集是一项常见的挑战。本文主要探讨了"基于统计归因的不完全数据集属性加权模糊c均值算法"（AttributeWeightedFuzzy c-Means for Incomplete Datasets Based on Statistical Imputation）。传统上，模糊c均值聚类（Fuzzy c-Means, FCM）算法用于非结构化数据的分组，但当数据集中存在缺失值时，算法的性能会受到影响。在本文中，作者回顾了之前的研究工作，提出了一个创新的方法，即利用统计表示来估计缺失的属性值。这种统计表示是通过对数据分布的理解和推断，为每个属性赋予权重，以此强调那些对数据解释和聚类至关重要的属性。这种方法考虑了数据之间的相关性和属性的重要性，确保了在进行模糊聚类时，即使部分信息缺失也能保持较高的聚类准确性。具体来说，算法流程如下： 1. **数据预处理**：首先，对数据集进行清洗和预处理，识别并标记出缺失值。统计表示技术用于填充这些空白，通过已知数据的模式来推测未知值。 2. **属性加权**：根据数据特征的重要性和对分类结果的影响程度，为每个属性分配不同的权重。这有助于避免无关或噪声属性对聚类结果产生误导。 3. **模糊c均值迭代**：采用加权模糊c均值算法，将带有权重的属性值代入，使得每个数据点被分配到最接近的多个簇，每个簇具有一定的模糊性，反映了数据点与簇中心的相似度。 4. **聚类性能评估**：通过实验验证算法的有效性，对比标准的Fuzzy c-Means和其他处理缺失值的方法，展示其在聚类准确性和鲁棒性方面的优势。实验结果显示，这个基于统计归因和属性加权的不完全数据集模糊c均值算法在实际应用中表现出色，能够在处理缺失数据的同时，有效地进行有意义的数据聚类。这对于诸如推荐系统、市场细分、异常检测等依赖于数据完整性的应用场景具有重要意义。总结来说，本文的核心贡献在于提出了一种策略，即通过结合统计估计和属性加权策略，改进了Fuzzy c-Means算法来应对数据缺失问题，从而提升数据挖掘和分析的精度和可靠性。这一成果对于数据密集型的IT行业具有实际价值，特别是在处理大规模、高维度且可能存在大量缺失值的数据集时。

An Attribute Weighted Fuzzy c-Means Algorithm for Incomplete Datasets Based on

Statistical Imputation

Dan Li

School of Control Science and Engineering

Dalian University of Technology

Dalian, China

ldan@dlut.edu.cn

Chongquan Zhong

School of Control Science and Engineering

Dalian University of Technology

Dalian, China

zhongcq@dlut.edu.cn

Abstract—The problem of missing data is frequently

encountered in real world applications. In this paper, an

attribute weighted fuzzy c-means algorithm for incomplete

data sets is presented. The statistical representation proposed

in our previous work is used here to impute the missing

attribute values, and attribute weighting is involved to

emphasize the contribution of important attributes.

Experimental results indicate that the proposed approach has

good clustering performance.

Keywords-fuzzy clustering; incomplete data; attribute

weighted; statistcal imputation

I. INTRODUCTION (HEADING 1)

Fuzzy clustering is one of the effective techniques in

pattern recognition, which partitions a collection of

multivariate data into meaningful groups to discover data

structure in data sets. In real world applications, lots of data

sets contain missing values. And most of the clustering

algorithms, such as the widely used fuzzy c-means (FCM)

algorithm [1], can’t deal with incomplete data sets directly.

Over the past decades, numerous approaches to the

problem of incomplete data clustering have been developed.

In 2001, Hathaway proposed four strategies to continue the

FCM clustering of incomplete data [2], called as Whole Data

Strategy (WDS), Partial Distance Strategy (PDS), Optimal

Completion Strategy (OCS) and Nearest Prototype Strategy

(NPS). By taking into account the information why data are

missing, Timm developed a fuzzy clustering algorithm

extended from the Gath-Geva algorithm [3]. Honda

partitioned the incomplete datasets into several linear fuzzy

clusters [4]. Besides, Li put forward a FCM algorithm based

on nearest-neighbor intervals to solve the incomplete data [5].

Lim and Kiong proposed an autonomous and deterministic

method to clustering data sets with missing values [6].

In this paper, we describe the development of an attribute

weighted FCM algorithm for incomplete data. The next

section introduces the FCM algorithm and FCM-based

clustering algorithms for incomplete data. Section III

presents the Statistical imputation of missing attribute values

and the proposed attribute weighted FCM algorithm that can

treat incomplete data sets. Section IV presents clustering

results and a comparative study of our proposed algorithm

with various other methods. And finally, conclusions are

drawn in Section V.

II. F

UZZY C-MEANS ALGORITHMS FOR INCOMPLETE

DATA CLUSTERING

A. Fuzzy c-Means Algorithm

Let

{

}

,,,

=⊂"Xxx x

be a set of

dimensional complete data, and the fuzzy c-means (FCM)

algorithm partitions

into c clusters that are characterized

by prototypes

[]

= "Vv v. The FCM algorithm

performs clustering by minimizing the objective function

()

ik k i

=−

¦¦

UV x v

, (1)

where

[]

,,,

kkk sk

xx= "x is an object datum;

[]

=∈U

a is partition matrix,

[]

,: 0,1∀∈

ik u ,

; m is a fuzzification

parameter,

()

1,m ∈∞; and

⋅ denotes Euclidean norm.

FCM uses the Lagrange multiplier method, and the

necessary conditions for minimizing (1) are [1]:

ik k

v ˈ for 1, 2, ,ic= " (2)

and

−

ªº

§·

−

«»

¨¸

«»

¨¸

−

«»

©¹

¬¼

. (3)

The procedure of FCM is to optimize the clustering

objective function (1) by alternating optimization (AO), that

is, the minimization steps (2) and (3) are repeated until the

change in memberships and/or prototypes drops below a

certain threshold

2015 7th International Conference on Intelligent Human-Machine Systems and Cybernetics

DOI 10.1109/IHMSC.2015.128

407

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38715048

粉丝: 7
资源: 960

基于统计归因的不完全数据属性加权模糊C均值聚类算法

基于梯度归因异常性的OD检测内含数据集-含说明书(可运行).zip

基于模糊算法的扣式电池自动化生产线故障分析研究

指标异常归因方法和算法包括哪些

为什么基于神经网络的点云配准算法即使在训练过程中没有涉及的场景和数据集也可以进行准确配准

python 多维度归因分析

python 栅格数据随机森林归因

“星火”多因子系列(四):《基于持仓的基金业绩归因:始于brinson,归于barra》

python波动归因分析

随机森林是归因分析的一种吗

王安琪不会写气象归因法（MCA）的python代码，你可以帮帮她吗

最新资源