The author wishes to acknowledge that this work was carried out within the Cooperative Research Centre for Advanced Computational Systems (ACSys)
established under the Australian Government’s Cooperative Research Centres Program.
Cooperative Research Centre for Advanced Computational Systems
CSIRO Mathematical and Information Sciences
GPO Box 664, Canberra 2601, AUSTRALIA
email:Zhexue.Huang@cmis.csiro.au
Abstract
Partitioning a large set of objects into homogeneous clusters is a
fundamental operation in data mining. The
k
-means algorithm is
best suited for implementing this operation because of its
efficiency in clustering large data sets. However, working only on
numeric values limits its use in data mining because data sets in
data mining often contain categorical values. In this paper we
present an algorithm, called
k
-modes, to extend the
k
-means
paradigm to categorical domains. We introduce new dissimilarity
measures to deal with categorical objects, replace means of
clusters with modes, and use a frequency based method to update
modes in the clustering process to minimise the clustering cost
function. Tested with the well known soybean disease data set
the algorithm has demonstrated a very good classification
performance. Experiments on a very large health insurance data
set consisting of half a million records and 34 categorical
attributes show that the algorithm is scalable in terms of both the
number of clusters and the number of records.
1 Introduction
Partitioning a set of objects into homogeneous clusters is a
fundamental operation in data mining. The operation is
needed in a number of data mining tasks, such as
unsupervised classification and data summation, as well as
segmentation of large heterogeneous data sets into smaller
homogeneous subsets that can be easily managed,
separately modelled and analysed. Clustering is a popular
approach used to implement this operation. Clustering
methods partition a set of objects into clusters such that
objects in the same cluster are more similar to each other
than objects in different clusters according to some defined
criteria. Statistical clustering methods (Anderberg 1973,
Jain and Dubes 1988) use similarity measures to partition
objects whereas conceptual clustering methods cluster
objects according to the concepts objects carry (Michalski
and Stepp 1983, Fisher 1987).
The most distinct characteristic of data mining is that
it deals with very large data sets (gigabytes or even
terabytes). This requires the algorithms used in data
mining to be scalable. However, most algorithms currently
used in data mining do not scale well when applied to very
large data sets because they were initially developed for
other applications than data mining which involve small
data sets. The study of scalable data mining algorithms has
recently become a data mining research focus (Shafer et al.
1996).
In this paper we present a fast clustering algorithm
used to cluster categorical data. The algorithm, called
k-
modes
, is an extension to the well known
k-means
algorithm (MacQueen 1967).
Compared to other clustering
methods the
k
-means algorithm and its variants
(Anderberg 1973) are efficient in clustering large data sets,
thus very suitable for data mining. However, their use is
often limited to numeric data because these algorithms
minimise a cost function by calculating the means of
clusters. Data mining applications frequently involve
categorical data. The traditional approach to converting
categorical data into numeric values does not necessarily
produce meaningful results in the case where categorical
domains are not ordered. The
k
-modes algorithm in this
paper removes this limitation and extends the
k
-means
paradigm to categorical domains whilst preserving the
efficiency of the
k
-means algorithm.
In (Huang 1997) we have proposed an algorithm,
called
k-prototypes
, to cluster large data sets with mixed
numeric and categorical values. In the
k
-prototypes
algorithm we define a dissimilarity measure that takes into
account both numeric and categorical attributes. Assume
s
n
is the dissimilarity measure on numeric attributes
defined by the squared Euclidean distance and
s
c
is the
dissimilarity measure on categorical attributes defined as
the number of mismatches of categories between two
objects. We define the dissimilarity measure between two
objects as
s
n
+
γ
s
c
, where
γ
is a weight to balance the two
parts to avoid favouring either type of attribute. The
clustering process of the k-prototypes algorithm is similar
to the k-means algorithm except that a new method is used
to update the categorical attribute values of cluster