A Collaborative Divide-and-Conquer K-Means Clustering
Algorithm for Processing Large Data
Huimin Cui
SKL of Computer Architecture,
Institute of Computing
Technology, CAS, China
cuihm@ict.ac.cn
Gong Ruan
University of Chinese
Academy of Sciences, China
ruangong@ict.ac.cn
Jingling Xue
School of Computer Science
and Engineering, University of
New South Wales, Australia
jingling@cse.unsw.edu.au
Rui Xie
SKL of Computer Architecture,
Institute of Computing
Technology, CAS, China
xierui@ict.ac.cn
Lei Wang
SKL of Computer Architecture,
Institute of Computing
Technology, CAS, China
wlei@ict.ac.cn
Xiaobing Feng
SKL of Computer Architecture,
Institute of Computing
Technology, CAS, China
fxb@ict.ac.cn
ABSTRACT
K-means clustering plays a vital role in data mining. As an
iterative computation, its performance will suffer when ap-
plied to tremendous amounts of data, due to poor temporal
locality across its iterations. The state-of-the-art streaming
algorithm, which streams the data from disk into memory
and operates on the partitioned streams, improves temporal
locality but can misplace objects in clusters since different
partitions are processed locally. This paper presents a col-
laborative divide-and-conquer algorithm to significantly im-
prove the state-of-the-art, based on two key insights. First,
we introduce a break-and-recluster procedure to identify the
clusters with misplaced objects. Second, we introduce col-
laborative seeding between different partitions to acceler-
ate the convergence inside each partition. Compared with
the streaming algorithm using a number of wikipedia web-
pages as our datasets, our collaborative algorithm improves
its clustering quality by up to 35.3% with an average of 8.8%
while decreasing its execution times from 0.3% to 80.1% with
an average of 48.6%.
1. INTRODUCTION
K-means clustering is a commonly used algorithm for data
mining, which was proposed by S. P. Lloyd in 1957 [26]. It
partitions n objects into k clusters such that similar ob-
jects belong to the same cluster, according to some similar-
ity function. In spite of the fact that K-means was proposed
over 50 years ago and thousands of clustering algorithms
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CF ’14, May 20 - 22 2014, Cagliari, Italy
Copyright 2014 ACM 978-1-4503-2870-8/14/05 ...$15.00.
have been published since then, K-means is still widely used
in a variety of areas, ranging from market segmentation,
computer vision, geostatistics, astronomy to agriculture. As
a representative scenario, billions of Web pages create ter-
abytes of new data every day, with many of these data
streams being unstructured, adding to the difficulty in ana-
lyzing them. The K-means clustering algorithm can be used
to discover the natural groups of the data, allowing us to
understand, process and summarize the data.
Specifically, the K-means clustering algorithm [26] assigns
each object to the cluster whose center (also called cen-
troid) is the nearest. The algorithm starts with an initial set
of cluster centers, chosen at random or according to some
heuristic procedure. Then the algorithm iteratively assigns
each object to one of the clusters. In each iteration, each ob-
ject is assigned to its nearest cluster center according to the
Euclidean distance between the two. Then the cluster cen-
ters are re-calculated [31]. The pseudo code of the K-means
clustering algorithm is shown in Algorithm 1.
Algorithm 1 K-means clustering algorithm.
procedure KMeans(S, k)
1: Initialize k empty clusters C
1
, C
2
, ..., C
k
.
2: Initialize cluster centers for C
1
, C
2
, ..., C
k
randomly or
heuristically.
3: while The convergence criterion is not met do
4: for each object s in S do
5: Compute the distance from s to all centers.
6: Assign s to its nearest cluster.
7: end for
8: for each cluster C
i
in {C
1
, C
2
, ..., C
k
} do
9: Update the center of C
i
according to all objects
belonging to C
i
.
10: end for
11: end while
As the data volume is getting more tremendous, the origi-
nal K-means algorithm would face a big challenge of reusing