BIRCH算法：大数据集的高效聚类解决方案

需积分: 9 84 浏览量更新于2024-07-26 收藏 686KB PDF 举报

BIRCH（Balanced Iterative Reducing and Clustering using Hierarchies）算法是1997年由张天、拉格胡·拉马克里希南和米罗·利夫尼提出的一种新颖的数据聚类算法，首次发表在《数据挖掘与知识发现》(Data Mining and Knowledge Discovery)杂志上，卷1，第2期，141-182页。该研究是针对当时快速增长的大规模数据集分析需求而设计的，特别是在数据挖掘领域，数据聚类被认为是其中的一个重要分支。 BIRCH算法的主要目的是克服现有数据聚类方法在处理大规模数据时面临的挑战，如内存限制和CPU资源紧张。它通过构建一种层次结构，将原始数据有效地进行减缩和组织，从而实现对大数据集的有效处理。算法的核心思想包括： 1. **树状结构**：BIRCH采用一种自底向上的构建方式，构建一种称为“概要树”（Summary Tree）的数据结构，每个节点代表一个子集，通过对数据进行聚合和抽样，减少存储和计算的需求。 2. **中心点表示**：节点不再存储所有数据点，而是通过一种简化的中心点（如质心或中心对象）来代表其子集。这有助于降低内存消耗，并且提高了对大数据集的处理效率。 3. **层次划分**：通过迭代地将数据点分配到最近的节点，然后合并相似的节点，形成一个层次结构，反映出数据的内在结构和模式。 4. **可扩展性**：BIRCH的设计允许动态调整树的深度和大小，适应不同大小的数据集，同时保持较高的聚类精度。 5. **应用广泛**：该算法在诸如数据分类、图像处理等实际应用中展现出强大的性能，尤其适用于处理海量数据时的实时分析和发现潜在的有用模式或属性关联。总结来说，BIRCH算法是一种高效、可扩展的数据聚类技术，它通过牺牲部分细节信息来换取对大规模数据处理的能力，是探索性数据分析中的一种有力工具。其在面对资源有限的大数据场景下，提供了一种有效的方法来发现和理解数据中的复杂关系，为数据挖掘领域的发展做出了重要贡献。

148 ZHANG, RAMAKRISHNAN AND LIVNY

CFRepresentativityTheorem : Given the CF entries of subclusters, allthemeasurements

deﬁned in Section 3 can be computed accurately.

CF Additivity Theorem : Assume that CF

=(N

,SS

), and CF

=(N

) are the CF entries of two disjoint subclusters. Then the CF entry of the subcluster

that is formed by merging the two disjoint subclusters is:

+ CF

=(N

,SS

+SS

) (11)

The theorem’s proof consists of conventional vector space algebra (Zhang, 1996). Ac-

cording to the CF deﬁnition and the CF representativity theorem, one can think of a sub-

cluster as a set of data points, and the CF entry stored as a summary. This CF entry is not

only compact because it stores much less than all the data points in the subcluster, it is also

accurate because it is sufﬁcient for calculating all the measurements (as deﬁned in Section

3) that we need for making clustering decisions in BIRCH. According to the CF additiv-

ity theorem, the CF entries can be stored and calculated incrementally and consistently as

subclusters are merged or new data points are inserted.

4.2. CF-tree

A CF-tree is a height-balanced tree with two parameters: branching factor (B for nonleaf

node and L for leaf node) and threshold T . Each nonleaf node contains at most B entries of

the form [CF

,child

], where i =1,2, ..., B,‘child

’ is a pointer to its i-th child node, and

is the CF entry of the subcluster represented by this child. So a nonleaf node represents

a subcluster made up of all the subclusters represented by its entries. A leaf node contains at

most L entries, and each entry is a CF. In addition, each leaf node has two pointers, ‘prev’

and ‘next’, which are used to chain all leaf nodes together for efﬁcient scans. A leaf node

also represents a subcluster made up of all the subclusters represented by its entries. But

all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold

value T: the diameter (alternatively, the radius) of each leaf entry has to be less than T.

The tree size is a function of T. The larger T is, the smaller the tree is. We require a node

to ﬁt in a page of size P , where P is a parameter of BIRCH. Once the dimension d of the

data space is given, the sizes of leaf and nonleaf entries are known, and then B and L are

determined by P .SoPcan be varied for performance tuning.

Such a CF-tree will be built dynamically as new data objects are inserted. It is used to

guide a new insertion into the correct subcluster for clustering purposes just as a B+-tree is

used to guide a new insertion into the correct position for sorting purposes. However the

CF-tree is a very compact representation of the dataset because each entry in a leaf node is

not a single data point but a subcluster (which absorbs as many data points as the speciﬁc

threshold value allows).

BIRCH: A NEW DATA CLUSTERING ALGORITHM AND ITS APPLICATIONS 149

4.3. Insertion Algorithm

We now present the algorithm for inserting a CF entry ‘Ent’ (a single data point or a

subcluster) into a CF-tree.

1. Identifying the appropriate leaf: Starting from the root, recursively descend the CF-tree

bychoosingtheclosestchildnodeaccordingtoachosendistancemetric: D0,D1,D2,D3

or D4 as deﬁned in Section 3.

2. Modifying the leaf: Upon reaching a leaf node, ﬁnd the closest leaf entry, say L

, and

then test whether L

can ‘absorb’ ‘Ent’ without violating the threshold condition. (That

is, the cluster merged with ‘Ent’ and L

must satisfy the threshold condition. Note that

the CF entry of the new cluster can be computed from the CF entries for L

and ‘Ent’.)

If so, update the CF entry for L

to reﬂect this. If not, add a new entry for ‘Ent’ to the

leaf. If there is space on the leaf for this new entry to ﬁt in, we are done, otherwise we

must split the leaf node. Node splitting is done by choosing the farthest pair of entries

as seeds, and redistributing the remaining entries based on the closest criteria.

3. Modifying the path to the leaf: After inserting ‘Ent’ into a leaf, update the CF infor-

mation for each nonleaf entry on the path to the leaf. In the absence of a split, this

simply involves updating existing CF entries to reﬂect the addition of ‘Ent’. A leaf split

requires us to insert a new nonleaf entry into the parent node, to describe the newly

created leaf. If the parent has space for this entry, at all higher levels, we only need to

update the CF entries to reﬂect the addition of ‘Ent’. In general, however, we may have

to split the parent as well, and so on up to the root. If the root is split, the tree height

increases by one.

4. A Merging Reﬁnement: Splits are caused by the page size, which is independent of

the clustering properties of the data. In the presence of skewed data input order , this

can affect the clustering quality, and also reduce space utilization. A simple additional

merging step often helps ameliorate these problems: Suppose that there is a leaf split,

and the propagation of this split stops at some nonleaf node N

, i.e., N

can accommo-

date the additional entry resulting from the split. We now scan node N

to ﬁnd the two

closest entries. If they are not the pair corresponding to the split, we try to merge them

and the corresponding two child nodes. If there are more entries in the two child nodes

than one page can hold, we split the merging result again. During the resplitting, in

case one of the seeds attracts enough merged entries to ﬁll a page, we just put the rest of

the entries with the other seed. In summary, if the merged entries ﬁt on a single page,

we free a node (page) for later use and create space for one more entry in node N

thereby increasing space utilization and postponing future splits; otherwise we improve

the distribution of entries in the closest two children.

The above steps work together to dynamically adjust the CF-tree to reduce its sensitivity to

the data input ordering.

剩余41页未读，继续阅读

fffsfff

粉丝: 0
资源: 2

BIRCH算法：大数据集的高效聚类解决方案

Birch聚类算法分析与改进：核心概念和改进方向

BIRCH聚类算法详解：原理与过程

四种聚类算法在二维坐标数据集上的应用与分析

一种改进的BIRCH聚类算法

BIRCH聚类算法的伪代码

birch聚类算法伪代码实现

birch聚类算法的伪代码

BIRCH聚类算法的伪代码表示

BIRCH聚类算法的伪代码实现

BIRCH聚类算法的伪代码详细实现

最新资源