最小跨越随机森林在广义传递距离中的应用

需积分: 5 200 浏览量更新于2024-08-26 收藏 12.58MB PDF 举报

"这篇研究论文探讨了最小跨越随机森林的广义传递距离，这是一种用于聚类分析的新方法，它扩展了传统的基于最小生成树（MST）的传递距离概念，并将其应用于最小生成随机森林（MSRF）框架下。通过元素最大池化处理随机森林中的传递距离矩阵，该方法在噪声存在的情况下可以避免单一最小生成树导致的不良连接。" 正文: 在数据挖掘和机器学习领域，聚类是一种重要的无监督学习方法，用于发现数据中的结构和模式。传统的传递距离是基于超度量性质的距离度量，特别适用于聚类任务，因为它能有效地捕捉数据点之间的层次关系。然而，当数据中存在噪声或者复杂结构时，仅仅依赖于最小生成树（MST）计算的传递距离可能不足以准确地反映这些关系。这篇由Yu等人撰写的论文引入了一种新的概念——最小跨越随机森林的广义传递距离。最小生成随机森林（MSRF）是通过对原始数据集构建多棵随机森林来扩展传统的最小生成树方法。每棵树都生成一个传递距离矩阵，然后通过元素最大池化操作，将这些矩阵融合成一个单一的距离矩阵。这种策略的直觉在于，最大池化可以有效地抑制单棵最小生成树可能导致的不理想短链接，这些链接可能在有噪声的数据中出现，对聚类结果产生负面影响。从理论上讲，通过最大池化得到的距离矩阵仍然满足超度量性质，这意味着它保持了良好的聚类属性。超度量性质保证了距离矩阵满足三角不等式，这是许多聚类算法如层次聚类和DBSCAN的基础。此外，这种广义传递距离方法可能提高对非线性结构和异常值的鲁棒性，因为随机森林的多样性有助于捕捉数据的多个方面。在实验部分，作者可能对比了MSRF与仅使用MST的传递距离，以及其他聚类方法，例如K-means、谱聚类等，展示了新方法在不同数据集和噪声条件下的性能优势。论文可能会包括具体的结果分析，如聚类精度、轮廓系数等评价指标，以证明其有效性。 "最小跨越随机森林的广义传递距离"是一种创新的距离度量方法，旨在改善噪声环境下的聚类效果。这种方法结合了随机森林的多样性和传递距离的超度量特性，有望为数据挖掘和机器学习领域的聚类问题提供更强大、更稳健的解决方案。

Generalized Transitive Distance with Minimum Spanning Random Forest

Zhiding Yu

, Weiyang Liu

, Wenbo Liu

, Xi Peng

, Zhuo Hui

, B. V. K. Vijaya Kumar

Dept. of Electrical and Computer Eng., Carnegie Mellon University

School of Electronic and Computer Eng., Peking University, P.R. China

I2R, Agency for Sci., Tech. and Research (A*STAR), Singapore

yzhiding@andrew.cmu.edu, wyliu@pku.edu.cn, pangsaai@gmail.com, kumar@ece.cmu.edu

Abstract

Transitive distance is an ultrametric with elegant

properties for clustering. Conventional transitive

distance can be found by referring to the mini-

mum spanning tree (MST). We show that such dis-

tance metric can be generalized onto a minimum s-

panning random forest (MSRF) with element-wise

max pooling over the set of transitive distance ma-

trices from an MSRF. Our proposed approach is

both intuitively reasonable and theoretically attrac-

tive. Intuitively, max pooling alleviates undesired

short links with single MST when noise is present.

Theoretically, one can see that the distance metric

obtained max pooling is still an ultrametric, render-

ing many good clustering properties. Comprehen-

sive experiments on data clustering and image seg-

mentation show that MSRF with max pooling im-

proves the clustering performance over single MST

and achieves state of the art performance on the

Berkeley Segmentation Dataset.

1 Introduction

Over the past decades, clustering has been and is still one of

the most important and fundamental machine learning prob-

lem. A number of clustering methods have been proposed,

ranging from the famous k-means algorithm and graph-based

approaches (such as single linkage algorithm)

[

Sibson, 1973

]

to the family of mode seeking

[

Comaniciu and Meer, 2002

]

spectral clustering

[

Ng et al., 2002; Zelnik-Manor and Per-

ona, 2004; Shi and Malik, 2000

]

, and subspace cluster-

ing

[

Elhamifar and Vidal, 2009; Liu et al., 2013; 2013;

Peng et al., 2013; 2015

]

. Despite the large variety of different

methods, some general principles are commonly considered

when evaluating the performance among different methods.

These principles include:

• Ability to discover clusters with arbitrary shape.

• Robustness against noise

• Scalability

The family of spectral clustering methods received much

attention and found wide applications for the excellent clus-

tering performance. Given n data points, eigendecomposition

0 5 10 15 20 25 30

(a)

0 5 10 15 20 25 30

(b)

Figure 1: (a) Clustering with transitive distance on a single

MST. (b) Clustering with the proposed framework.

is conducted on an n × n normalized pairwise similarity ma-

trix, followed by k-means to generate clusters. A key reason

for spectral clustering’s success lies in its ability to discover

non-convex latent structures. This comes from the the fac-

t that eigendecomposition projects data in to a kernel space

with nicely shaped clusters.

Spectral clustering is not the only family of methods that

can handle clusters with arbitrary shapes. Transitive distance

clustering (also known as path based clustering) provides an

elegant and intuitive non-eigendecomposition alternative also

effective in handling non-convex clusters. Speciﬁcally, tran-

sitive distance emphasizes the connectivity rather than abso-

lute distance between pairwise data. This is achieved by ﬁnd-

ing the set of largest hops (edges) along all possible connect-

ing paths and deﬁning the pairwise distance as the minimum

hop.

[

Fischer and Buhmann, 2003b

]

proposed the concept

of transitive distance and an agglomerative bottom-up clus-

tering framework. The idea of connectivity kernel was later

proposed in

[

Fischer et al., 2004

]

. Other works include the

transitive closure

[

Ding et al., 2006

]

and transitive afﬁnity

[

Chang and Yeung, 2005; 2008

]

There exist some inherent connections between transitive

distance and minimum spanning tree. It was proved that the

transitive distance edge for all pairwise data lies on the mini-

mum spanning tree, if the maximum order (number of nodes)

of a path is equal to n. This, however, does not necessarily

mean that transitive distance clustering is identical to early

graph based method such as single linkage algorithm. There

are many nice properties associated with transitive distance,

one of them being that the transitive distance is an ultrametric

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38559727

粉丝: 6
资源: 924

最小跨越随机森林在广义传递距离中的应用

（7,4）循环码编码、最小广义距离译码

gmres.rar_GMRES_gmres迭代_valueuv6_广义最小残差_迭代广义

基于总体最小二乘法的离散广义系统的控制器设计方法

理解广义线性模型与随机森林

线性最小二乘问题与广义逆矩阵解析

广义时变系统参数辨识的最小二乘与随机梯度算法

非最小相位非线性广义系统状态反馈控制方法

矩阵求导解最小二乘问题：广义线性模型与概率分布

广义随机森林 效应估计

广义随机森林 平均效应估计

最新资源

广义随机森林效应估计

广义随机森林平均效应估计