大数据协同分治K均值聚类算法研究

90 浏览量更新于2024-08-26 收藏 747KB PDF 举报

一种处理大数据的协同分治K均值聚类算法在数据挖掘领域中，K均值聚类算法是一种经典的无监督学习算法，能够对大规模数据进行聚类分析。但是，随着数据规模的增加，K均值聚类算法的计算性能会严重下降，主要是由于其迭代计算的特性，导致了缓慢的计算速度和较差的时空局部性。针对这个问题，本文提出了一种协同分治K均值聚类算法，旨在提高大数据下的聚类分析性能。首先，我们需要了解K均值聚类算法的基本原理。K均值聚类算法是一种基于距离的聚类算法，旨在将数据点分配到K个簇中，使得簇内的数据点尽量相似，而簇间的数据点尽量不同。算法的核心步骤是：首先，随机初始化K个簇心；然后，对每个数据点，计算其到每个簇心的距离，并将其分配到距离最近的簇中；最后，对每个簇，更新簇心的位置，以便更好地代表簇中的数据点。如此迭代多次，直到簇心位置不再变化为止。然而，对于大数据，K均值聚类算法的计算性能会严重下降。主要有两个原因：一是数据规模的增加，导致了计算时间的增加；二是迭代计算的特性，导致了缓慢的计算速度和较差的时空局部性。针对这个问题，本文提出了一种协同分治K均值聚类算法，旨在提高大数据下的聚类分析性能。协同分治K均值聚类算法的核心思想是将数据分治成多个小块，然后对每个小块进行K均值聚类，最后将每个小块的聚类结果合并以获得最终的聚类结果。这样做的好处是：一方面，可以充分利用多核处理器的计算能力，提高计算速度；另一方面，可以减少数据在内存中的存储空间，提高计算效率。在具体实现中，我们可以使用流式处理算法，将数据从磁盘流式传输到内存中，然后对其进行K均值聚类。这样可以提高时空局部性，减少计算时间。同时，我们也可以使用数据分区技术，将数据分成多个小块，然后对每个小块进行K均值聚类。这样可以充分利用多核处理器的计算能力，提高计算速度。协同分治K均值聚类算法可以有效地提高大数据下的聚类分析性能，解决了传统K均值聚类算法在大数据下的计算性能问题。但是，需要注意的是，本文的算法还需要进一步的优化和改进，以适应更大规模的数据集。

A Collaborative Divide-and-Conquer K-Means Clustering

Algorithm for Processing Large Data

Huimin Cui

SKL of Computer Architecture,

Institute of Computing

Technology, CAS, China

cuihm@ict.ac.cn

Gong Ruan

University of Chinese

Academy of Sciences, China

ruangong@ict.ac.cn

Jingling Xue

School of Computer Science

and Engineering, University of

New South Wales, Australia

jingling@cse.unsw.edu.au

Rui Xie

SKL of Computer Architecture,

Institute of Computing

Technology, CAS, China

xierui@ict.ac.cn

Lei Wang

SKL of Computer Architecture,

Institute of Computing

Technology, CAS, China

wlei@ict.ac.cn

Xiaobing Feng

SKL of Computer Architecture,

Institute of Computing

Technology, CAS, China

fxb@ict.ac.cn

ABSTRACT

K-means clustering plays a vital role in data mining. As an

iterative computation, its performance will suﬀer when ap-

plied to tremendous amounts of data, due to poor temporal

locality across its iterations. The state-of-the-art streaming

algorithm, which streams the data from disk into memory

and operates on the partitioned streams, improves temporal

locality but can misplace objects in clusters since diﬀerent

partitions are processed locally. This paper presents a col-

laborative divide-and-conquer algorithm to signiﬁcantly im-

prove the state-of-the-art, based on two key insights. First,

we introduce a break-and-recluster procedure to identify the

clusters with misplaced objects. Second, we introduce col-

laborative seeding between diﬀerent partitions to acceler-

ate the convergence inside each partition. Compared with

the streaming algorithm using a number of wikipedia web-

pages as our datasets, our collaborative algorithm improves

its clustering quality by up to 35.3% with an average of 8.8%

while decreasing its execution times from 0.3% to 80.1% with

an average of 48.6%.

1. INTRODUCTION

K-means clustering is a commonly used algorithm for data

mining, which was proposed by S. P. Lloyd in 1957 [26]. It

partitions n objects into k clusters such that similar ob-

jects belong to the same cluster, according to some similar-

ity function. In spite of the fact that K-means was proposed

over 50 years ago and thousands of clustering algorithms

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

CF ’14, May 20 - 22 2014, Cagliari, Italy

have been published since then, K-means is still widely used

in a variety of areas, ranging from market segmentation,

computer vision, geostatistics, astronomy to agriculture. As

a representative scenario, billions of Web pages create ter-

abytes of new data every day, with many of these data

streams being unstructured, adding to the diﬃculty in ana-

lyzing them. The K-means clustering algorithm can be used

to discover the natural groups of the data, allowing us to

understand, process and summarize the data.

Speciﬁcally, the K-means clustering algorithm [26] assigns

each object to the cluster whose center (also called cen-

troid) is the nearest. The algorithm starts with an initial set

of cluster centers, chosen at random or according to some

heuristic procedure. Then the algorithm iteratively assigns

each object to one of the clusters. In each iteration, each ob-

ject is assigned to its nearest cluster center according to the

Euclidean distance between the two. Then the cluster cen-

ters are re-calculated [31]. The pseudo code of the K-means

clustering algorithm is shown in Algorithm 1.

Algorithm 1 K-means clustering algorithm.

procedure KMeans(S, k)

1: Initialize k empty clusters C

, C

, ..., C

2: Initialize cluster centers for C

, C

, ..., C

randomly or

heuristically.

3: while The convergence criterion is not met do

4: for each object s in S do

5: Compute the distance from s to all centers.

6: Assign s to its nearest cluster.

7: end for

8: for each cluster C

in {C

, C

, ..., C

} do

9: Update the center of C

according to all objects

belonging to C

10: end for

11: end while

As the data volume is getting more tremendous, the origi-

nal K-means algorithm would face a big challenge of reusing

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38563176

粉丝: 2
资源: 920

大数据协同分治K均值聚类算法研究

算法协同聚类,聚类算法实现,matlab

探索基于大数据的分布式隐私保护聚类挖掘算法.pdf

基于大数据的分布式隐私保护聚类挖掘算法研究.pdf

ABC-K-means.rar_ABC_MATLAB 聚类_k means改进算法_k均值聚类_蜂群matlab

Matlab实现k均值聚类算法与应用示例

协作模糊C-均值聚类算法提升数据集聚类效果

K-均值聚类协同进化：高维优化的粒子群优化算法

K均值聚类算法在推荐系统中的应用

K均值聚类算法在推荐系统中的应用及优化

推荐系统中的K均值聚类算法：打造个性化推荐，提升用户体验

最新资源