相似度传递聚类：一种高效找寻数据代表的方法

需积分: 0 130 浏览量更新于2024-08-05 收藏 408KB PDF 举报

"AP聚类（Affinity Propagation）_通过数据点间传递消息进行聚类" AP聚类，全称为Affinity Propagation，是由Brendan J. Frey和Delbert Dueck提出的一种非监督学习的聚类算法，旨在通过在数据点之间交换消息来寻找代表性的例子，即聚类中心。这种方法与传统的K-means或层次聚类等方法不同，它不需要预先设定簇的数量，而是允许数据自我组织成任意数量的簇。在聚类问题中，识别一组代表性样本对于处理感官信号、发现数据中的模式至关重要。传统方法是随机选择初始数据点子集并迭代优化，但这只有在初始选择接近良好解决方案时才有效。而Affinity Propagation算法则采用了一种不同的策略：输入数据点对之间的相似度测量值，通过实值消息在数据点之间传递，直到高质量的示例集合和对应的聚类逐渐形成。该算法的工作原理如下： 1. 相似度矩阵：首先，计算所有数据点对之间的相似度。这可以是欧氏距离、余弦相似度或其他合适的相似性度量。 2. 消息传递：每个数据点都会发送和接收两个类型的消息：责任（Responsibility）和可用度（Availability）。责任表示一个数据点作为其他点的候选聚类中心的适宜程度，而可用度表示一个点被选为聚类中心的可能性。 3. 迭代更新：在每一轮迭代中，根据责任和可用度更新这两个值，直到达到稳定状态或者达到预设的最大迭代次数。 4. 簇的形成：最终，那些具有高责任和高可用度的数据点将被视为聚类中心，与其相似度高的其他数据点将归入同一簇。 Affinity Propagation在多种应用中表现出色，如人脸识别、基因检测、文本摘要以及航空旅行网络中的城市聚类。与其他方法相比，它在精度上表现出显著优势，并且运行速度更快。在人脸图像聚类中，它能以更低的错误率找到聚类，并且所需时间少于其他方法的百分之一。 Affinity Propagation提供了一种灵活且适应性强的聚类解决方案，尤其适用于需要处理大量数据和未指定簇数量的复杂场景。其基于消息传递的机制使得它能够有效地处理非凸形状的簇和噪声，从而在数据挖掘和机器学习领域中占据一席之地。

Clustering by Passing Messages

Between Data Points

Brendan J. Frey* and Delbert Dueck

Clustering data by identifying a subset of representative examples is important for processing

sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly

choosing an initial subset of data points and then iteratively refining it, but this works well only if

that initial choice is close to a good solution. We devised a method called “affinity propagation,”

which takes as input measures of similarity between pairs of data points. Real-valued messages are

exchanged between data points until a high-quality set of exemplars and corresponding clusters

gradually emerges. We used affinity propagation to cluster images of faces, detect genes in

microarray data, identify representative sentences in this manuscript, and identify cities that are

efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than

other methods, and it did so in less than one-hundredth the amount of time.

lustering data based on a measure of

similarity is a critical step in scientific

data analysis and in engineering sys-

tems. A common approach is to use data to

learn a set of centers such that the sum of

squared errors between data points and their

nearest centers is small. When the centers are

selected from actual data points, they are called

“exemplars.” The popular k-centers clustering

technique (1) begins with an initial set of ran-

domly selected exemplars and iteratively refines

this set so as to decrease the sum of squared

errors. k-centers clustering is quite sensitive to

the initial selection of exemplars, so it is usually

rerun many times with different initializations in

an attempt to find a good solution. However ,

this works well only when the number of clus-

ters is small and chances are good that at least

one random initialization is close to a good

solution. We take a quite different approach

and introduce a method that simultaneously

considers all data points as potential exem-

plars. By viewing each data point as a node in

a network, we devised a method that recur-

sively transmits real-valued messages along

edges of the network until a good set of ex-

emplars and corresponding clusters emerges.

As described later, messages are updated on

the basis of simple formulas that search for

minima of an appropriately chosen energy

function. At any point in time, the magnitude

of each message reflects the current affinity

that one data point has for choosing another

data point as its exemplar , so we call our meth-

od “affinity propagation.” Figure 1A illus-

trates how clusters gradually emerge during

the message-passing procedure.

Affinity propagation takes as input a col-

lection of real-valued similarities between data

points, where the similarity s(i,k) indicates

how well the data po int with index k is suited

to be the exemplar for data point i. When the

goal is to minimize squared error, each sim-

ilarity is set to a negative squared error (Eu-

clidean distance): For points x

and x

, s(i,k)=

−||x

− x

. Indeed, the method described here

can be applied when the optimization criterion is

much more general. Later, we describe tasks

where similarities are derived for pairs of im-

ages, pairs of microarray measurements, pairs of

English sentences, and pairs of cities. When an

exemplar-dependent probability model is avail-

able, s(i,k) can be set to the log-likelihood of

data point i given that its exemplar is point k.

Alternatively, when appropriat e, similarit ies

may be set by hand.

Rather than requiring that the number of

clusters be prespecified, affinity propagation

takes as input a real number s(k,k) for each data

point k so that data points with larger values

of s(k,k) are more likely to be chosen as ex-

emplars. These values are referred to as “pref-

erences.” The number of identified exemplars

(number of clusters) is influenced by the values

of the input preferences, but also emerges from

the message-passing procedure. If a priori, all

data points are equally suitable as exemplars, the

preferences should be set to a common value—

this value can be varied to produce different

numbers of clusters. The shared value could

be the median of the input similarities (resulting

in a moderate number of clusters) or their

minimum (resulting in a small number of

clusters).

There are two kinds of message exchanged

between data points, and each takes into ac-

count a different kind of competition. Mes-

sages can be combined at any stage to decide

which points are exemplars and, for every

other point, which exemplar it belongs to. The

“responsibility” r(i,k), sent from data point i to

candidate exemplar point k, reflects the ac-

cumulated evidence for how well-suited point

k is to serve as the exemplar for point i, taking

into account other potential exemplars for

point i (Fig. 1B). The “availability” a(i,k), sent

from candidate exemplar point k to point i,

reflects the accumulated evidence for how

appropriate it would be for point i to choose

point k as its exemplar, taking into account the

support from other points that point k should be

an exemplar (Fig. 1C). r(i,k)anda(i,k) can be

viewed as log-probabilit y ratios. To begin

with, the availabilities are initialized to zero:

a(i,k) = 0. Then, the responsibilities are com-

puted using the rule

rði, kÞ ← s ð i,kÞ − max

′

s:t: k

′

≠ k

faði,k

′

Þþsði; k

′

Þg

ð1Þ

In the first iteration, because the availabilities

are zero, r(i,k) is set to the input similarity

between point i and point k as its exemplar,

minus the largest of the similarities between

point i and other candidate exemplars. This

competitive update is data-driven and does not

take into account how many other points favor

each candidate exemplar. In later iterations,

when some points are effectively assigned to

other exemplars, their availabilities will drop

below zero as prescribed by the update rule

below. These negative availabilities will de-

crease the effective values of some of the input

similarities s(i,k′) in the above rule, removing

the corresponding candidate exemplars from

competition. For k = i, the responsibility r(k,k)

is set to the input preference that point k be

chosen as an exemplar, s(k,k), minus the largest

of the similarities between point i and all other

candidate exemplars. This “self-responsibility”

reflects accumulated evidence that point k is an

exemplar , based on its input preference tem-

pered by how ill-suited it is to be assigned to

another exemplar.

Whereas the above responsibility update

lets all candidate exemplars compete for own-

ership of a data point, the following availabil-

ity update gathers evidence from data points

as to whether each candidate exemplar would

make a good exemplar:

aði,kÞ ← min

0, rðk,kÞþ

′

s:t: i

′

∉fi;kg

maxf0,rði

′

,kÞg

ð2Þ

The availability a(i,k) is set to the self-

responsibility r(k,k) plus the sum of the positive

responsibilities candidate exemplar k receives

from other points. Only the positive portions of

incoming responsibilities are added, because it

is only necessary for a good exemplar to explain

some data points well (positive responsibilities),

regardless of how poorly it explains other data

points (negative responsibilities). If the self-

responsibility r(k,k) is negative (indicating that

point k is currently better suited as belonging to

another exemplar rather than being an exem-

plar itself), the availability of point k as an

exemplar can be increased if some other points

have positive responsibilities for point k being

their exemplar. To limit the influence of strong

Department of Elec trical and Computer Engineering,

University of Toronto, 10 King’s College Road, Toronto,

Ontario M5S 3G4, Canada.

*To whom correspondence should be addressed. E-mail:

frey@psi.toronto.edu

REPORTS

16 FEBRUARY 2007 VOL 315 SCIENCE www.sciencemag.org972

下载后可阅读完整内容，剩余4页未读，立即下载

蒋寻

粉丝: 30
资源: 319

相似度传递聚类：一种高效找寻数据代表的方法

Clustering by Passing Messages Between Data Points

近邻传播聚类（affinity propagation clustering）MATLAB程序

AP聚类_AP聚类有监督_ap聚类_

APdemo12.rar_AP Clustering_AP聚类算法_DEMO_ap聚类_聚类算法

apcluster.zip_AP算法_AP聚类python_AP聚类算法python实现_ap聚类_ap聚类 python

AP聚类算法和案例.rar_AP 聚类_AP聚类数据_AP聚类算法_三维分类_三维聚类

APjulei.rar_ap聚类_聚类_聚类并输出_聚类输出

apcluster.rar_AP 聚类_聚类_自适应_自适应聚类

DPC_密度聚类_无监督聚类_dpc聚类_clustering_DPC.zip

AP_ap聚类_聚类个数_

最新资源