最小生成树拆分合并：层次聚类新方法

38 浏览量更新于2024-08-27 收藏 1.76MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"基于最小生成树的拆分合并：一种分层聚类方法" 本文提出了一种创新的分层聚类方法，旨在解决传统聚类算法在处理多样化的数据集时可能遇到的问题。当数据集中包含形状、大小和密度各异的聚类时，大多数聚类算法的性能可能会大幅下降。为了解决这一问题，该方法采用了最小生成树(MST)和基于MST的图来进行数据的拆分与合并。最小生成树在该方法中扮演了关键角色。在拆分阶段，首先构建一个基于数据点的MST，通过分析MST中的顶点度（连接边的数量），选取高度的顶点作为初始的原型。然后，利用K均值算法对数据集进行分割，将数据点分配到这些原型所代表的簇中。K均值算法能够有效地将数据点分配到最近的原型，从而形成初步的子簇。在合并阶段，对拆分得到的子簇进行进一步处理。不是任意两个子簇都可以合并，而是只考虑那些在MST中相邻的子簇对。这样的策略有助于保持聚类的连通性和结构完整性。通过对相邻子簇的合并，可以逐步构建出层次聚类树，即 dendrogram，从而揭示数据的层级关系。该方法的一大优点是它不需要用户指定簇的数量，这是许多聚类算法的一个重要参数，但往往难以准确预估。相反，通过观察dendrogram，用户可以根据数据的内在结构决定合适的簇数。此外，由于依赖于MST和K均值，该方法相对简单且易于实现，同时在实验中表现出对合成数据集和真实数据集的有效性。实验结果验证了该方法的有效性，不仅在理论上具有吸引力，而且在实践中也表现出强大的适应性。通过对不同数据集的测试，展示了其在处理复杂聚类结构时的优越性能，尤其是在处理形状和密度变化较大的簇时。总结来说，"基于最小生成树的拆分合并"方法是一种适应性强、参数需求少的分层聚类技术，尤其适合处理具有多样化特性的数据集。这种方法结合了MST的结构信息和K均值的聚类能力，为解决聚类问题提供了一个新的视角和工具。

资源详情

资源推荐

Deﬁnition 1. Let X

be the pruned version of X as in

¼ X n

2 V; degreeð

Þ¼1

; ð1Þ

where degree(

) denotes the degree of vertex

in the MST of X.

2.3.2. Constructing a 3-MST graph

An MST describes the intrinsic skeleton of a dataset and accordingly can be used for clustering. In our proposed method, we

use it to guide the splitting and merging processes. However, a single MST loses some neighborhood information that is crucial

for splitting and merging. To overcome this drawback, we combine several MSTs and form a graph G

mst

(X,k) as follows:

Deﬁnition 2. Let T

= f

mst

(V,E) denote the MST of G(X)=(V,E). The following iterations of an MST are deﬁned as:

¼ f

mst

V; E n[

i1

j¼1



; ð2Þ

where f

mst

:(V,E) ? T is a function to compute an MST from graph G(X)=(V, E), and i P 2.

In theory, the above deﬁnition of T

is not rigorous because E n[

i1

j¼1

may produce isolated subgraph. For example, if there

exists a vertex

in T

and the degree of

is jVj1,

will be isolated in G(V,EnT

). Hence, the second MST (T

) cannot be

completed in terms of Deﬁnition 2. In practice, this is not a problem because the ﬁrst MST T

is still connected and we

can simply ignore it as a minor artefact, because it has no noticeably effect on the performance of the overall algorithm. How-

ever, for the sake of completeness, we solve this minor problem by always connecting such an isolated subgraph with an

edge randomly selected from those connecting the isolated subgraph in T

Let G

mst

(X,k) denote the k-MST graph, which is deﬁned as a union of the k MSTs: G

mst

(X,k)=T

[ T

[[T

. In this paper,

we use G

mst

,k) to determine the initial prototypes in the split stage and to calculate the merge index of a neighboring par-

tition pair in the merge stage. Here, k is set to 3 in terms of the following observation: 1 round of MST is not sufﬁcient for the

criterion-based merge but 3 iterations are. The number itself is a small constant and can be justiﬁed from computational

point of view. Additional iterations do not add much to the quality, but only increase processing time. A further discussion

concerning the selection of k can be found in Section 3.4.

2.4. Split stage

In the split stage, initial prototypes are selected as the nodes of highest degree in the graph G

mst

,k). K-means is then

applied to the pruned dataset using these prototypes. The produced partitions are adjusted to keep the clusters connected

with respect to the MST.

2.4.1. Application of K-means

The pruned dataset is ﬁrst split by K-means in the original Euclidean space, where the number of partitions K

is set to

ﬃﬃﬃﬃﬃﬃﬃ

. This is done under the assumption that the number of clusters in a dataset is smaller than the square root of the num-

ber of patterns in the dataset [3,36].If

ﬃﬃﬃﬃﬃﬃﬃ

6 K, K

can be set to K + k(jX

jK) to grantee that K

is greater than K, where

0<k < 1. Since this is not a normal situation, we do not discuss the parameter k in this paper. Moreover, if jX

j 6 K, the split-

and-merge scheme will degenerate into a traditional agglomerative clustering.

However, to determine the K

initial prototypes is a tough problem, and a random selection would give an unstable split-

ting result. For example, the method proposed in [30] uses K-means with randomly selected prototypes in its split stage, and

the ﬁnal clustering results are not unique. We therefore utilize the MST-based graph G

mst

,3) to avoid this problem.

Fig. 1. The overview of split-and-merge. In Stage 1, the dataset X is pruned into X

according to the MST of X, and three iterations of MSTs of X

are computed

and combined into a 3-MST graph. In Stage 2, X

is partitioned by K-means, where the initial prototypes are generated from the 3-MST graph. The partitions

are then adjusted so that each partition is a subtree of the MST of X

. In Stage 3, the partitions are merged into the desired number of clusters and the pruned

data points are distributed to the clusters.

C. Zhong et al. / Information Sciences 181 (2011) 3397–3410

3399

剩余13页未读，继续阅读

weixin_38669729

粉丝: 7
资源: 908

最小生成树拆分合并：层次聚类新方法

基于网格的最小生成树聚类算法.pdf

c++代码实现分层聚类

分层聚类方法和两步聚类方法的优缺点和适用条件

基于最小距离的层次聚类和基于最小生成树的层次聚类有什么区别

基于最小生成树的聚类思想

最小生成树融合图聚类

分层聚类、Kmeans聚类的区别

K-均值聚类方法是常用的背景噪声提取方法，其优缺点

简述分层聚类和两部聚类方法的优缺点和适用条件

分层聚类迭代 matlab

能换一种方法聚类吗，出来k-means和SpectralClustering聚类方法

基于距离相关系数的分层聚类法

使用matlab完成层次聚类算法(最小生成树算法)

网格化聚类相较于非网格化聚类方法的主要优势：

网格化的聚类方法对比其他聚类方法有什么优势

分层聚类算法matlab

凝聚层次聚类算法实现（非聚类库函数调用），要求算法输入：随机生成聚类的>=20个一维对象；算法输出：分类结果，聚类过程中得到的最短距离值以及距离矩阵。考虑三种不同距离计算方法进行聚类。

python实现修改k-Means聚类算法的randCents函数，使得k个初始的聚类中心点的选取满足条件：第j个聚类中心要远离第1~j-1个聚类中心。

K均值聚类和层次聚类有什么区别

分级聚类有哪些方法可以实现

最新资源