快速聚类算法：应对大量类别数据

需积分: 10 27 浏览量更新于2024-09-09 收藏 109KB PDF 举报

"本文介绍了一种名为k-modes的聚类算法，它是对经典k-means算法的扩展，特别适用于处理包含大量类别数据的数据集。k-means算法虽然在处理数值数据时表现出高效率，但面对包含类别数据的挖掘任务时显得力不从心。k-modes算法引入了新的不相似度度量方法，以处理类别对象，并用模式（mode）代替均值（mean）来代表簇的中心，同时采用基于频率的方法来更新和优化这些模式。" 在数据挖掘领域，聚类是一种基本操作，目的是将大量对象划分为内部相似性高的簇。k-means算法因其高效性而在处理大规模数据集时被广泛应用，其工作原理是通过迭代优化，使得每个簇内的对象与该簇的中心点（均值）距离最小化。然而，k-means算法仅适用于数值型数据，无法直接处理类别数据，而实际数据集往往包含大量的分类特征。 k-modes算法解决了这一问题，它针对类别数据设计了一种新的不相似度度量。不同于k-means中的欧几里得距离，k-modes使用的是基于类别的距离或不相似度，如Jaccard相似度、Hamming距离等。在k-modes中，簇的中心不再由数值平均值表示，而是由出现频率最高的类别（模式）来代表。这种方法能够更好地反映类别数据的特性，因为类别数据通常没有顺序或连续性。算法的执行过程包括以下步骤： 1. 初始化：选择k个初始的类别模式作为簇中心。 2. 分配阶段：根据每个对象与各个模式的不相似度，将对象分配到最近的簇。 3. 更新阶段：计算每个簇的新模式，即该簇内所有对象类别出现频率最高的类别。 4. 迭代：重复分配和更新阶段，直到模式不再改变或达到预设的迭代次数。 k-modes算法的优点在于能有效处理类别数据，且在某些情况下，其效率接近于k-means。然而，它也有缺点，例如对于离群值敏感，以及在处理大规模数据时可能需要较大的内存空间。为了优化性能，可以采用一些策略，如早期停止规则、近似方法或采样技术。 k-modes算法为处理包含大量类别属性的数据集提供了一个实用的解决方案，扩展了聚类分析的应用范围，使得数据挖掘能更全面地应用于各种类型的数据。在实际应用中，结合k-means和k-modes，或者与其他聚类算法如DBSCAN、谱聚类等相结合，可以进一步提高聚类的质量和效率。

A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in

Data Mining

Zhexue Huang

The author wishes to acknowledge that this work was carried out within the Cooperative Research Centre for Advanced Computational Systems (ACSys)

established under the Australian Government’s Cooperative Research Centres Program.

Cooperative Research Centre for Advanced Computational Systems

CSIRO Mathematical and Information Sciences

GPO Box 664, Canberra 2601, AUSTRALIA

email:Zhexue.Huang@cmis.csiro.au

Abstract

Partitioning a large set of objects into homogeneous clusters is a

fundamental operation in data mining. The

-means algorithm is

best suited for implementing this operation because of its

efficiency in clustering large data sets. However, working only on

numeric values limits its use in data mining because data sets in

data mining often contain categorical values. In this paper we

present an algorithm, called

-modes, to extend the

-means

paradigm to categorical domains. We introduce new dissimilarity

measures to deal with categorical objects, replace means of

clusters with modes, and use a frequency based method to update

modes in the clustering process to minimise the clustering cost

function. Tested with the well known soybean disease data set

the algorithm has demonstrated a very good classification

performance. Experiments on a very large health insurance data

set consisting of half a million records and 34 categorical

attributes show that the algorithm is scalable in terms of both the

number of clusters and the number of records.

1 Introduction

Partitioning a set of objects into homogeneous clusters is a

fundamental operation in data mining. The operation is

needed in a number of data mining tasks, such as

unsupervised classification and data summation, as well as

segmentation of large heterogeneous data sets into smaller

homogeneous subsets that can be easily managed,

separately modelled and analysed. Clustering is a popular

approach used to implement this operation. Clustering

methods partition a set of objects into clusters such that

objects in the same cluster are more similar to each other

than objects in different clusters according to some defined

criteria. Statistical clustering methods (Anderberg 1973,

Jain and Dubes 1988) use similarity measures to partition

objects whereas conceptual clustering methods cluster

objects according to the concepts objects carry (Michalski

and Stepp 1983, Fisher 1987).

The most distinct characteristic of data mining is that

it deals with very large data sets (gigabytes or even

terabytes). This requires the algorithms used in data

mining to be scalable. However, most algorithms currently

used in data mining do not scale well when applied to very

large data sets because they were initially developed for

other applications than data mining which involve small

data sets. The study of scalable data mining algorithms has

recently become a data mining research focus (Shafer et al.

1996).

In this paper we present a fast clustering algorithm

used to cluster categorical data. The algorithm, called

modes

, is an extension to the well known

k-means

algorithm (MacQueen 1967).

Compared to other clustering

methods the

-means algorithm and its variants

(Anderberg 1973) are efficient in clustering large data sets,

thus very suitable for data mining. However, their use is

often limited to numeric data because these algorithms

minimise a cost function by calculating the means of

clusters. Data mining applications frequently involve

categorical data. The traditional approach to converting

categorical data into numeric values does not necessarily

produce meaningful results in the case where categorical

domains are not ordered. The

-modes algorithm in this

paper removes this limitation and extends the

-means

paradigm to categorical domains whilst preserving the

efficiency of the

-means algorithm.

In (Huang 1997) we have proposed an algorithm,

called

k-prototypes

, to cluster large data sets with mixed

numeric and categorical values. In the

-prototypes

algorithm we define a dissimilarity measure that takes into

account both numeric and categorical attributes. Assume

is the dissimilarity measure on numeric attributes

defined by the squared Euclidean distance and

is the

dissimilarity measure on categorical attributes defined as

the number of mismatches of categories between two

objects. We define the dissimilarity measure between two

objects as

, where

is a weight to balance the two

parts to avoid favouring either type of attribute. The

clustering process of the k-prototypes algorithm is similar

to the k-means algorithm except that a new method is used

to update the categorical attribute values of cluster

下载后可阅读完整内容，剩余7页未读，立即下载

qq_35928721

粉丝: 0
资源: 2

快速聚类算法：应对大量类别数据

聚类、分类所用数据集

newman的快速聚类算法 fast算法

聚类算法数据集

一种基于代表点的分布式数据流聚类算法.pdf

数据挖掘聚类算法--k均值算法

数据聚类kmedoids聚类算法附matlab代码

论文研究-一种高维混合属性数据聚类算法.pdf

DBSCAN聚类算法_聚类算法 MATLAB

lustering数据挖掘聚类算法介绍.pdf

密度聚类.zip_密度_密度聚类算法_数据聚类_样本数据聚类_聚类

最新资源