基于模式相似性的聚类加速方法

聚类方法

需积分: 3 115 浏览量更新于2024-08-02 收藏 582KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"文章探讨了一种基于模式相似性的聚类方法，通过在区块上进行聚类，提升聚类效率。作者提出的新相似性概念不再局限于传统的距离度量，如曼哈顿距离或欧氏距离，而是关注对象在部分维度上的协同模式。这种新模式适用于更广泛的应用场景，例如在DNA微阵列分析中的基因表达水平同步变化情况。" 在数据挖掘和机器学习领域，聚类是一种常用的技术，用于识别数据集中相似对象的群体。传统的聚类方法通常基于数值距离的相似度，例如曼哈顿距离、欧氏距离或L_p距离，这些度量方式假设相似的对象在大多数或所有维度上具有接近的值。然而，这种假设在某些复杂的数据模式中可能并不适用。王海勋和裴健提出的"pCluster模型"引入了一种新的相似性概念，它不完全依赖于数值的接近性，而是侧重于对象在特定维度子集上展现出的共同模式或趋势。这种方法更加灵活，能够处理那些在某些特定维度上表现一致，而在其他维度上可能存在显著差异的数据。例如，在生物信息学中，DNA微阵列数据分析时，两个基因可能在某些条件下表达水平同步上升或下降，即使它们在其它时间点的表达水平相差很大，这种同步变化的模式就是一种重要的相似性。 pCluster模型的工作流程可能包括以下步骤：首先，对数据进行预处理，识别出具有潜在模式的维度子集；然后，计算对象在这些子集上的模式相似度，而不是直接比较数值；最后，根据相似性度量结果进行聚类，形成具有共同模式的类群。这种方法的优势在于，它能够发现传统距离度量可能忽略的深层结构和关系。此外，这种基于模式的聚类方法还可以应用于其他领域，如市场细分，其中消费者的购买行为可能在某些特定时期或情境下表现出一致性，而非在所有产品或时间点上；或者在社交网络分析中，用户的行为模式可能在特定的活动或话题上显示出一致，而这些一致性的模式可以揭示用户的兴趣群组。 "pCluster模型"通过关注数据的模式相似性，提供了一种更全面且适应性强的聚类方法，能够有效提高聚类的速度和准确性，尤其对于那些具有复杂模式和结构的数据集，它能更好地捕捉数据的内在特性。

资源详情

资源推荐

484 J. Comput. Sci. & Technol., July 2008, Vol.23, No.4

mining pattern-based clusters. For example, the Pear-

son R model

[18]

studies the coherence among a set of

objects, and Pearson R deﬁnes the correlation between

two objects X and Y as:

− X)(Y

− Y )

− X)

− Y )

where X

and Y

are the i-th attribute value of X and

Y , and X and Y are the means of all attribute values

in X and Y , respectively. From this formula, we can

see that the Pearson R correlation measures the corre-

lation between two objects with respect to all attribute

values. A large positive value indicates a strong pos-

itive correlation while a large negative value indicates

a strong negative correlation. However, some strong

coherence may only exist on a subset of dimensions.

For example, in collaborative ﬁltering, six movies are

ranked by viewers. The ﬁrst three are action movies

and the next three are family movies. Two viewers

rank the movies as (8, 7, 9, 2, 2, 3) and (2, 1, 3, 8, 8, 9).

The viewers’ ranking can be grouped into two clusters,

the ﬁrst three movies in one cluster and the rest in an-

other. It is clear that the two viewers have consistent

bias within each cluster. However, the P earson R cor-

relation of the two viewers is small because globally no

explicit pattern exists.

2.2 Correlations in Subspaces

One way to discover the shifting pattern in Fig.2(a)

using traditional subspace clustering algorithms (such

as CLIQUE) is through data transformation. Given N

attributes, a

, . . . , a

, we deﬁne a derived attribute,

= a

− a

, for every pair of attributes a

and

. Thus, our problem is equivalent to mining sub-

space clusters on the objects with the derived set of

attributes. However, the converted data set will have

N(N −1)/2 dimensions and it becomes intractable even

for a small N because of the curse of dimensionality.

Cheng et al. introduced the bicluster concept

[15]

a measure of the coherence of the genes and conditions

in a sub matrix of a DNA array. Let X be the set

of genes and Y the set of conditions. Let I ⊂ X and

J ⊂ Y be subsets of genes and conditions, respectively.

The pair (I, J) speciﬁes a sub matrix A

with the fol-

lowing mean squared residue score:

H(I, J) =

|IkJ|

i∈I,j∈J

− d

+ d

)

, (1)

where

|J|

j∈J

, d

|I|

i∈I

|I||J|

i∈I,j∈J

are the row and column means and the means in the

submatrix A

, respectively. A submatrix A

is called

a δ-bicluster if H(I, J) 6 δ for some δ > 0. A random

algorithm is designed for ﬁnding such clusters in a DNA

array.

Yang et al.

[16]

proposed a move-based algorithm to

ﬁnd biclusters more eﬃciently. It starts from a random

set of seeds (initial clusters) and iteratively improves

the clustering quality. It avoids the cluster overlapping

problem as multiple clusters are found simultaneously.

However, it still has the outlier problem, and it requires

the number of clusters as an input parameter.

We noticed several limitations of this pioneering

work as follows.

Fig.3. Mean squared residue cannot exclude outliers in a biclus-

ter. (a) Dataset A: Residue 4.238 (without the outlier residue is

0). (b) Dataset B: Residue 5.722.

剩余15页未读，继续阅读

gly20031015

粉丝: 0
资源: 1

基于模式相似性的聚类加速方法

spectral_cluster_spectralclustering_谱聚类_cluster_

sentences-similarity-cluster:计算句子的相似度并将结果聚类

from sklearn. cluster import KMeans

用python写一个近邻聚类法

cluster聚类stata

single pass文本聚类python实现

使用GPU加速余弦相似度聚类 代码

ap聚类算法python

余弦相似度聚类加速 代码实现

java使用hanlp进行文本相似度分析其他方法

in k-means algorirthm

已知相似度矩阵，如何用近邻传播算法聚类并返回聚类中心和聚类结果

python利用ASE计算不同结构POSCAR类型文件之间的相似性，对结构进行比较和分类，具体代码及实施过程

使用DBSCAN聚类词向量模型的代码

bad line with cluster kernel pixel

帮我用python写一个有分类和聚类的推荐系统代码

最新资源

使用GPU加速余弦相似度聚类代码

余弦相似度聚类加速代码实现