[Page No. 274]
Analysis and Approach: K-Means and K-Medoids Data Mining
Algorithms
Dr. Aishwarya Batra
Asst Professor, L. J. Institute of Computer Applications, Ahmedabad, India.
E–mail: batra.aishwarya@gmail.com
Abstract
Clustering is similar to classification in which data are
grouped. A cluster is therefore a collection of objects which
are similar between them and are dissimilar to the objects
belonging to other clusters .There exist a large number of
clustering algorithms in the literature. The choice of clustering
algorithm depends both on the type of data available and on
the particular purpose and application. Clustering analysis is
one of the main analytical methods in data mining. K-means is
the most popular and partition based clustering algorithm. But
it is computationally expensive and the quality of resulting
clusters heavily depends on the selection of initial centroid and
the dimension of the data. Several methods have been
proposed in the literature for improving performance of the k-
means clustering algorithm. In this research, the most
representative algorithms K-Means and K-Medoids were
examined and analyzed based on their basic approach. The
best algorithm in each category was found out based on their
performance. The input data points are generated by two ways,
one by using normal distribution and another by applying
uniform distribution.
Keywords: K Means, K Medoid, Clustering, Partitional
Algorithm
Introduction
Clustering techniques have a wide use and importance
nowadays. This importance tends to increase as the amount of
data grows and the processing power of the computers
increases. Clustering applications are used extensively in
various fields such as artificial intelligence, pattern
recognition, economics, ecology, psychiatry and marketing.
Data clustering is under vigorous development. Contributing
areas of research include data mining, statistics, machine
learning, spatial database technology, biology, and marketing.
Owing to the huge amounts of data collected in databases,
cluster analysis has recently become a highly active topic in
data mining research. As a branch of statistics, cluster analysis
has been extensively studied for many years, focusing mainly
on distance-based cluster analysis.
The main purpose of clustering techniques is to partition a
set of entities into different groups, called clusters. Cluster
analysis tools based on k-means, k-medoids, and several other
methods have also been built into many statistical analysis
software packages or systems, such as S-Plus, SPSS, and SA.
Categorization of Major Clustering Methods
In general, the major clustering methods can be classified into
the following categories:
• Partitioning methods
• Hierarchical methods
• Density-based methods
• Grid-based methods
• Model-based methods
Classical Partitioning Methods: K-Means & K-Medoids
The most well-known and commonly used partitioning
methods are k-means, k-medoids, and their variations.
Partitional clustering techniques create a one-level partitioning
of the data points. There are a number of such techniques, but
we shall only describe two approaches in this section: K-
means and K-medoid. Both these techniques are based on the
idea that a centre point can represent a cluster. For K-means
we use the notion of a centroid, which is the mean or median
point of a group of points. Note that a centroid almost never
corresponds to an actual data point. For K-medoid we use the
notion of a medoid, which is the most representative (central)
point of a group of points. Partitional techniques create a one-
level (un-nested) partitioning of the data points. If K is the
desired number of clusters, then partitional approaches
typically find all K clusters at once.
Clustering
k-means
(MacQueen’67): Each cluster is represented by the
center of the cluster.
k-medoids
or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
Centroid-Based Technique: The K-Means Method
Basic Algorithm
The K-means clustering technique is very simple and we
immediately begin with a description of the basic algorithm.
We elaborate in the following sections.
Basic K-means Algorithm for finding K clusters
1. Select K points as the initial centroids.
2. Assign all points to the closest centroid.
3. Recompute the centroid of each cluster.