数据挖掘中的聚类技术探析

4星 · 超过85%的资源需积分: 9 72 浏览量更新于2024-08-02 收藏 823KB PDF 举报

"Survey of Clustering Data Mining Techniques" 聚类是一种数据挖掘技术，它涉及将数据集分割成相似对象的组。这种技术的核心在于通过较少的簇来概括数据，虽然会丢失部分细节，但能实现数据的简化。聚类通过簇对数据进行建模，这一概念在数学、统计学和数值分析的历史发展中有着深厚的根基。从机器学习的角度看，聚类对应于隐藏的模式。聚类搜索是无监督学习的一种形式，即在没有预先标记或分类的情况下，系统通过自我学习寻找数据中的结构。最终形成的系统代表了一种数据概念。在实际应用中，聚类在诸如科学研究数据探索、信息检索与文本挖掘、空间数据库应用、网络分析、客户关系管理（CRM）、市场营销、医学诊断、计算生物学等诸多领域发挥着重要作用。近年来，聚类在统计学、模式识别和机器学习等多个领域都受到了广泛关注。特别是在数据挖掘领域，由于面临处理大规模数据集和众多属性的挑战，聚类技术变得更加复杂。这些大型数据集可能包含数百万甚至数十亿的记录，每条记录又可能有数百个特征。因此，有效的聚类算法必须能够处理高维度的数据，并且在效率上达到可接受的标准。聚类方法可以大致分为两大类：划分方法和层次方法。划分方法如K-means、K-modes、K-medoids等，它们首先假设了簇的数量，然后通过迭代优化过程来找到最佳的簇中心。层次方法包括凝聚型和分裂型，如层次聚类（Agglomerative Clustering）和DIANA（Divisive Analysis），它们通过逐步合并或拆分对象来构建簇的层次结构。除此之外，还有一些基于密度的方法，如DBSCAN（Density-Based Spatial Clustering of Applications with Noise），它能在数据分布不均匀的情况下发现任意形状的簇。而谱聚类（Spectral Clustering）则利用数据的相似性矩阵构造图谱，然后通过图谱切割来形成簇。在评估聚类质量时，常用的方法有轮廓系数、Calinski-Harabasz指数和Davies-Bouldin指数等。这些指标可以帮助我们理解聚类的内部紧密度和外部疏远度，从而判断聚类结果的好坏。聚类技术的研究不仅局限于算法设计，还包括如何处理缺失值、异常值以及如何选择合适的距离度量。同时，随着大数据时代的到来，分布式聚类算法，如Hadoop MapReduce上的Giraph和Spark上的GraphX，也成为了研究的热点，它们旨在提高在大规模数据集上的聚类效率。总结起来，"Survey of Clustering Data Mining Techniques"这篇综述探讨了聚类作为数据挖掘中的关键技术，其理论基础、应用场景、方法类别及评价标准。聚类技术的发展与进步不断推动着数据科学的进步，对于理解和揭示数据中的隐藏结构至关重要。

While the algorithm CURE works with numerical attributes (particularly low dimensional

spatial data), the algorithm ROCK developed by the same researchers [Guha et al. 1999]

targets hierarchical agglomerative clustering for categorical attributes. It is surveyed in

the section Co-Occurrence of Categorical Data.

The hierarchical agglomerative algorithm C

HAMELEON [Karypis et al. 1999a] utilizes

dynamic modeling in cluster aggregation. It uses the connectivity graph G corresponding

to the K-nearest neighbor model sparsification of the connectivity matrix: the edges of K

most similar points to any given point are preserved, the rest are pruned. C

HAMELEON has

two stages. In the first stage small tight clusters are built to ignite the second stage. This

involves a graph partitioning [Karypis

& Kumar 1999]. In the second stage agglomerative

process is performed. It utilizes measures of relative inter-connectivity and

relative closeness ; both are locally normalized by quantities related to

clusters . In this sense the modeling is dynamic. Normalization involves certain

non-obvious graph operations [Karypis

& Kumar 1999]. CHAMELEON strongly relies on

graph partitioning implemented in the library HMETIS (see the section Co-Occurrence of

Categorical Data). Agglomerative process depends on user provided thresholds. A

decision to merge is made based on the combination

),(

CCRI

),(

CCRC

CC ,

),(),(

jiji

CCRCCCRI ⋅

of local relative measures. The algorithm does not depend on assumptions about the data

model. This algorithm is proven to find clusters of different shapes, densities, and sizes in

2D (two-dimensional) space. It has a complexity of

where m is number of sub-clusters built during first initialization phase. Figure 2

(analogous to the one in [Karypis

& Kumar 1999]) presents a choice of four clusters (a)-

(d) for a merge. While Cure would merge clusters (a) and (b), CHAMELEON makes

intuitively better choice of merging (c) and (d).

))log()log((

mmNNNmO ++

Figure 1

: Agglomeration in Cure. Figure 2: CHAMELEON merges (c) and (d).

Before After

(a)

(b)

2.3. Binary Divisive Partitioning

In linguistics, information retrieval, and document clustering applications binary

taxonomies are very useful. Linear algebra methods, based on singular value

decomposition (SVD) are used for this purpose in collaborative filtering and information

retrieval [Berry & Browne 1999]. SVD application to hierarchical divisive clustering of

document collections resulted in the PDDP (Principal Direction Divisive Partitioning)

algorithm [Boley 1998]. In our notations, object x is a document, l

attribute corresponds

to a word (index term), and matrix entry is a measure (as TF-IDF) of l-term frequency

in a document x. PDDP constructs SVD decomposition of the matrix

Rex

xxeXC ∈==−=

)1,...1(,

),(

This algorithm bisects data in Euclidean space by a hyperplane that passes through data

centroid orthogonally to eigenvector with the largest singular value. The k-way splitting

is also possible if the k largest singular values are considered. Bisecting is a good way to

categorize documents and it results in a binary tree. When k-means (2-means) is used for

bisecting, the dividing hyperplane is orthogonal to a line connecting two centroids. The

comparative study of both approaches [Savaresi & Boley 2001] can be used for further

references. Hierarchical divisive bisecting k-means was proven [Steinbach et al. 2000] to

be preferable for document clustering.

While PDDP or 2-means are concerned with how to split a cluster, the problem of which

cluster to split is also important. Casual strategies are: (1) split each node at a given level,

(2) split the cluster with highest cardinality, and, (3) split the cluster with the largest

intra-cluster variance. All three strategies have problems. For analysis regarding this

subject and better alternatives, see [Savaresi et al. 2002].

2.4. Other Developments

Ward’s method [Ward 1963] implements agglomerative clustering based not on linkage

metric, but on an objective function used in k-means (sub-section K-Means Methods).

The merger decision is viewed in terms of its effect on the objective function.

The popular hierarchical clustering algorithm for categorical data COBWEB [Fisher

1987] has two very important qualities. First, it utilizes

incremental learning. Instead of

following divisive or agglomerative approaches, it dynamically builds a dendrogram by

processing one data point at a time. Second, COBWEB belongs to

conceptual or model-

based learning. This means that each cluster is considered as a model that can be

described intrinsically, rather than as a collection of points assigned to it. COBWEB’s

dendrogram is called a classification tree. Each tree node C, a cluster, is associated with

the conditional probabilities for categorical attribute-values pairs,

llpl

ApdlCvx :1,:1),|Pr( === .

This easily can be recognized as a C-specific Naïve Bayes classifier. During the

classification tree construction, every new point is descended along the tree and the tree

is potentially updated (by an insert/split/merge/create operation). Decisions are based on

an analysis of a

category utility [Corter & Gluck 1992]

(

)

))(Pr()|(Pr()(

,/)(},...,{

lplj

vxCvxCCU

kCCUCCCU

=−==

similar to GINI index. It rewards clusters

C for increases in predictability of the

categorical attribute values . Being incremental, COBWEB is fast with a complexity

of , though it depends non-linearly on tree characteristics packed into a constant t.

There is the similar incremental hierarchical algorithm for all numerical attributes called

CLASSIT [Gennari et al. 1989]. CLASSIT associates normal distributions with cluster

nodes. Both algorithms can result in highly unbalanced trees.

)(tNO

Chiu et al. [2001] proposed another conceptual or model-based approach to hierarchical

clustering. This development contains several different useful features, such as the

extension of BIRCH-like preprocessing to categorical attributes, outliers handling, and a

two-step strategy for monitoring the number of clusters including BIC (defined below).

The model associated with a cluster covers both numerical and categorical attributes and

constitutes a blend of Gaussian and multinomial models. Denote corresponding

multivariate parameters by

. With every cluster C, we associate a logarithm of its

(classification) likelihood

)|(log

∈

The algorithm uses maximum likelihood estimates for parameter

. The distance between

two clusters is defined (instead of linkage metric) as a decrease in log-likelihood

2121

21 CCCC

lllCC



−+=

caused by merging of the two clusters under consideration. The agglomerative process

continues until the stopping criterion is satisfied. As such, determination of the best k is

automatic. This algorithm has the commercial implementation (in SPSS Clementine). The

complexity of the algorithm is linear in N for the summarization phase.

Traditional hierarchical clustering is inflexible due to its greedy approach: after a merge

or a split is selected it is not refined. Though COBWEB does reconsider its decisions, it is

so inexpensive that the resulting classification tree can also have sub-par quality. Fisher

[1996] studied iterative hierarchical cluster redistribution to improve once constructed

dendrogram. Karypis et al. [1999b] also researched refinement for hierarchical clustering.

In particular, they brought attention to a relation of such a refinement to a well-studied

refinement of k-way graph partitioning [Kernighan & Lin 1970].

For references related to parallel implementation of hierarchical clustering see [Olson

1995].

3. Partitioning Relocation Clustering

In this section we survey data partitioning algorithms, which divide data into several

subsets. Because checking all possible subset systems is computationally infeasible,

certain greedy heuristics are used in the form of iterative optimization. Specifically, this

means different relocation schemes that iteratively reassign points between the k clusters.

Unlike traditional hierarchical methods, in which clusters are not revisited after being

constructed, relocation algorithms gradually improve clusters. With appropriate data, this

results in high quality clusters.

One approach to data partitioning is to take a conceptual point of view that identifies the

cluster with a certain model whose unknown parameters have to be found. More

剩余55页未读，继续阅读

dschends

粉丝: 0

数据挖掘中的聚类技术探析

Application of data mining techniques in customer relationship managemen

Data Mining Techniques for Marketing, Sales and Customer Relationship Management

Survey of Text Mining II Clustering Classification and Retrieval

Survey of Clustering Algorithms.pdf

DataMining Concepts And Techniques

Survey of Text Mining:Clustering, Classification, and Retrieval, Second Edition

Data Mining-Concepts and Techniques

Data Mining:Concepts and Techniques

techvada.zip_Data mining_Data mining Csharp_clustering csharp_in

Next Generation of Data Mining

最新资源