OpenMP并行化CURE算法：处理大规模数据集的高效聚类方法

需积分: 15 149 浏览量更新于2024-09-15 收藏 179KB PDF 举报

本文主要探讨了如何利用OpenMP（Open Multi-Processing）并行化技术来加速分层数据聚类算法——CURE（Clustering Using Representatives and Errors）。CURE是一种高效的层次型数据聚类方法，它通过寻找每个簇的代表点并计算簇内的误差来构建数据集的层次结构。然而，CURE算法在处理大规模数据集时可能会遇到性能瓶颈，因为其核心部分涉及到大量的迭代和复杂的计算，这些步骤具有良好的并行性。 OpenMP为CURE算法的并行化提供了一个透明的管理框架，能够有效地处理算法中的非对称性和不确定性。通过OpenMP，程序员可以将算法分解为多个可同时执行的任务或工作单元，特别是在那些包含重复计算和循环的层次结构中，如在计算相似度和分配数据到簇的过程中。OpenMP运行时支持使得这些循环级别的并行操作得以有效地利用多核处理器的优势，从而显著提高了算法的执行速度。实验结果表明，作者的OpenMP实现版本对于不同问题参数下的CURE具有很好的扩展性，能够有效应对大型数据集的挑战。随着数据集规模的增长，平行化的CURE算法能够在保持聚类质量的同时，显著降低单个任务的处理时间，这极大地提升了整个系统的效率和实用性。总结来说，这篇论文的关键贡献在于展示了如何利用OpenMP优化分层数据聚类算法，使得原本可能耗时的CURE在并行环境中变得更为高效。这对于处理海量数据和提升大数据分析性能具有重要意义，为数据科学家和工程师提供了在处理复杂数据集时的一个强大工具。同时，该研究也为其他需要处理大量迭代计算的算法提供了一种并行化策略的参考。

1. Initialization: Compute distances and find nearest neighbors

pairs for all clusters

2. Clustering: Perform hierarchical clustering until the

predefined number of clusters k has been computed

While (number of remaining clusters > k) {

a. Find the pair of clusters with the minimum distance

b. Merge them:

i. new size = size1 + size2

ii. new centroid = a1*centroid1 + a2*centroid2,

where a1 = size1/new size and a2 = size2/new size

iii. find c new representative points

c. Update nearest neighbors pairs for the clusters

d. Reduce the number of remaining clusters

e. If conditions are satisfied, apply pruning of clusters

}

3. Output the representative points of each cluster

Fig. 1. Outline of CURE

merged clusters. Moreover, to reduce the time complexity of the algorithm, the

authors propose an improved merge procedure where the new c representative

points are chosen between the 2c p oints of the two clusters merged.

The worst–case time complexity of CURE is O (n

logn), where n is the num-

ber of points to be clustered. In order to allow CURE to handle very large

data sets, CURE uses a random sample of the database. Sampling improves the

performance of the algorithm since the sample can be designed to ﬁt in main

memory, eliminating thus signiﬁcant I/O costs, and also contributes in the ﬁlter-

ing of outliers. To speed up the clustering process when the sample size increases,

CURE partitions and partially clusters the data points in the partitions of the

random sample. Instead of using a centroid to label the clusters, multiple repre-

sentative points are used, and each data point is assigned to the cluster with the

closest representative point. The use of multiple points enables the algorithm to

identify arbitrarily shaped clusters. Empirical work with CURE discovered that

the algorithm is insensitive to outliers and can identify clusters with interest-

ing shape. Moreover, sampling and partitioning speed up the clustering process

without sacriﬁcing cluster quality.

2.2 Implementation

Our parallel implementation of CURE was inspired by the source code of Dr.

Han and has been enhanced to handle large data sets. The algorithm uses a

linear array of records that keeps information about the size, the centroid and

the representative points of each cluster. Taking into consideration the improved

procedure for merging clusters and that the labeling of data is a separate process,

剩余10页未读，继续阅读

liwenbo19890722

粉丝: 1
资源: 2

OpenMP并行化CURE算法：处理大规模数据集的高效聚类方法

Fortran95_OpenMP_cn_openmp_Fortran_

OpenMP_Tutorial.rar_openmp

pm代码matlab-Parallelization_Workshop:并行化_研讨会

Dijkstra.rar_As One_dijkstra algorithm_simple_dijkstra

Parallelization-of-a-Genetic-Algorithm-on-the-GPU:遗传算法的CUDA程序

Parallelization of Latent Group Model for Group Recommendation Algorithm

SIMD Parallelization of Applications that Traverse Irregular Data Structures-计算机科学

Multicore Parallelization of Min-Cost Flow for CAD Applications

Dataflow processing parallelization-Cell-开源

HELIX-RC - An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs (campanoni14-isca)-计算机科学

最新资源