K-Means与K-Medoids聚类算法比较分析

需积分: 9 201 浏览量更新于2024-09-05 收藏 500KB PDF 举报

"kmeans vs kmedoids.pdf" 在数据挖掘领域，聚类是一种常见的无监督学习方法，将相似的数据对象分组到一起形成所谓的“簇”。这篇由Dr. Aishwarya Batra撰写的分析文章深入探讨了两种分区聚类算法——k-means和k-medoids，它们是数据挖掘中广泛使用的算法。聚类类似于分类，但不同之处在于它不依赖于预先定义的类别，而是由数据本身的结构来决定。 k-means是最著名的分区聚类算法之一。该算法基于质心（centroid）的概念，将数据点分配到最近的质心所在的簇。然而，k-means算法有几个显著的局限性。首先，它对初始质心的选择非常敏感，不同的初始配置可能导致完全不同的聚类结果。其次，由于每次迭代都要计算所有数据点到质心的距离，因此在处理大规模数据集时，计算成本较高。此外，k-means只能处理欧几里得距离下的球形簇，对于非凸或异形的簇效果不佳。相比之下，k-medoids算法，也称为Partitioning Around Medoids (PAM)算法，是一种更为稳健的聚类方法。k-medoids选择实际的数据点作为代表簇的中心，即medoids，而不是像k-means那样使用数据点的平均值。这种方法使得k-medoids对离群值更具鲁棒性，并且能够处理非数值属性，因为它基于对象间的真实相似度，而不仅仅是距离。然而，k-medoids的缺点在于它的计算复杂度比k-means更高，尤其是在数据集庞大的情况下。为了改进k-means的性能，研究人员提出了一系列方法，包括不同的初始化策略（如K-means++）和优化技术，以减少对初始质心选择的依赖，以及降低计算负担。尽管如此，k-medoids通常被认为在处理复杂簇结构和有噪声的数据时更可靠。在实际应用中，选择k-means还是k-medoids取决于具体的需求和数据特性。如果数据集较小，且预期簇是规则形状的，k-means可能是更高效的选择。而当簇的形状不确定，或者数据中存在噪声和离群值时，k-medoids可能是更好的选择。理解这两种算法的优缺点以及它们对数据的假设，是决定在具体项目中使用哪种算法的关键步骤。总结来说，k-means和k-medoids都是用于数据聚类的重要工具，各有其特点和适用场景。k-means以其简单和效率著称，但对初始条件和数据分布有一定假设；k-medoids则提供了更强的鲁棒性和对异常值的容忍度，但计算成本相对较高。在实际操作中，根据具体需求、数据规模和特性，以及对聚类质量的要求，选择合适的算法至关重要。

[Page No. 274]

Analysis and Approach: K-Means and K-Medoids Data Mining

Algorithms

Dr. Aishwarya Batra

Asst Professor, L. J. Institute of Computer Applications, Ahmedabad, India.

E–mail: batra.aishwarya@gmail.com

Abstract

Clustering is similar to classification in which data are

grouped. A cluster is therefore a collection of objects which

are similar between them and are dissimilar to the objects

belonging to other clusters .There exist a large number of

clustering algorithms in the literature. The choice of clustering

algorithm depends both on the type of data available and on

the particular purpose and application. Clustering analysis is

one of the main analytical methods in data mining. K-means is

the most popular and partition based clustering algorithm. But

it is computationally expensive and the quality of resulting

clusters heavily depends on the selection of initial centroid and

the dimension of the data. Several methods have been

proposed in the literature for improving performance of the k-

means clustering algorithm. In this research, the most

representative algorithms K-Means and K-Medoids were

examined and analyzed based on their basic approach. The

best algorithm in each category was found out based on their

performance. The input data points are generated by two ways,

one by using normal distribution and another by applying

uniform distribution.

Keywords: K Means, K Medoid, Clustering, Partitional

Algorithm

Introduction

Clustering techniques have a wide use and importance

nowadays. This importance tends to increase as the amount of

data grows and the processing power of the computers

increases. Clustering applications are used extensively in

various fields such as artificial intelligence, pattern

recognition, economics, ecology, psychiatry and marketing.

Data clustering is under vigorous development. Contributing

areas of research include data mining, statistics, machine

learning, spatial database technology, biology, and marketing.

Owing to the huge amounts of data collected in databases,

cluster analysis has recently become a highly active topic in

data mining research. As a branch of statistics, cluster analysis

has been extensively studied for many years, focusing mainly

on distance-based cluster analysis.

The main purpose of clustering techniques is to partition a

set of entities into different groups, called clusters. Cluster

analysis tools based on k-means, k-medoids, and several other

methods have also been built into many statistical analysis

software packages or systems, such as S-Plus, SPSS, and SA.

Categorization of Major Clustering Methods

In general, the major clustering methods can be classified into

the following categories:

• Partitioning methods

• Hierarchical methods

• Density-based methods

• Grid-based methods

• Model-based methods

Classical Partitioning Methods: K-Means & K-Medoids

The most well-known and commonly used partitioning

methods are k-means, k-medoids, and their variations.

Partitional clustering techniques create a one-level partitioning

of the data points. There are a number of such techniques, but

we shall only describe two approaches in this section: K-

means and K-medoid. Both these techniques are based on the

idea that a centre point can represent a cluster. For K-means

we use the notion of a centroid, which is the mean or median

point of a group of points. Note that a centroid almost never

corresponds to an actual data point. For K-medoid we use the

notion of a medoid, which is the most representative (central)

point of a group of points. Partitional techniques create a one-

level (un-nested) partitioning of the data points. If K is the

desired number of clusters, then partitional approaches

typically find all K clusters at once.

Clustering

k-means

(MacQueen’67): Each cluster is represented by the

center of the cluster.

k-medoids

or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by one of the

objects in the cluster

Centroid-Based Technique: The K-Means Method

Basic Algorithm

The K-means clustering technique is very simple and we

immediately begin with a description of the basic algorithm.

We elaborate in the following sections.

Basic K-means Algorithm for finding K clusters

1. Select K points as the initial centroids.

2. Assign all points to the closest centroid.

3. Recompute the centroid of each cluster.

下载后可阅读完整内容，剩余5页未读，立即下载

Rock11ee

粉丝: 0
资源: 3

K-Means与K-Medoids聚类算法比较分析

聚类Kmeans和Kmedoids算法

分析大数据运用大数据分析 Kmeans算法实战.pdf

(完整版)1. matlab实现Kmeans聚类算法.pdf

kmeans_model.labels_

生成代码：python聚类分析绘制散点图r1=pd.Series(kmeans_model.labels_).value_counts() print('最终每个类别的数目为：\n',r1)最终每个类别的数目为： 4 24611 0 15730 3 12111 1 5337 2 4255 dtype: int64

使用数据protein.txt做k-means聚类的过程验证练习，并将聚类结果储存在向量（kmeans.result.学号）中

kmeans().fit

最新资源