数据流聚类：理论与实践

需积分: 9 147 浏览量更新于2024-07-17 收藏 314KB PDF 举报

"Clustering Data Streams: Theory and Practice" 在数据科学领域，数据流聚类是一种处理不断涌入且可能无限的数据序列的技术。随着大数据的快速发展，数据流模型因其对大规模动态数据集的有效处理能力而受到广泛关注。这篇由Sudipto Guha、Adam Meyerson、Nina Mishra、Rajeev Motwani和Liadan O’Callaghan合作撰写的论文，详细探讨了数据流聚类的理论与实践。论文中提到，数据流模型是由于各种类型的数据，如电话记录、网页文档和点击流等，其应用需求而兴起的。这种模型的一个关键特性是在有限的内存和计算资源下，能够对数据进行一次或少数几次扫描（线性扫描或通过）来完成分析。这在处理海量实时数据时尤为重要。作者提出了一种针对数据流的聚类算法，该算法能够在处理大量数据流时保持高效性能。聚类是数据挖掘中的一个重要任务，旨在将相似的数据点分组到一起，形成所谓的“簇”。对于数据流，传统的聚类算法可能无法适应，因为它们通常假设数据可以被反复访问且内存资源充足。论文提供的实证分析展示了该算法在合成数据流和真实数据流上的表现，证明了其在处理数据流时的有效性和适应性。通过对比和评估，作者可能展示了算法如何在处理速度、内存占用和聚类质量之间找到平衡，这对于实时数据分析和决策制定至关重要。在介绍部分，论文定义了数据流的基本概念：一个有序的点序列(x1, ..., xn)，这些点只能按顺序读取，且通常只能读取一次或少量次。这样的模型特别适用于那些需要快速响应、实时分析的场景，如网络流量监控、社交媒体分析或金融交易监控。总结来说，这篇论文为数据流聚类提供了理论基础和实践经验，对于理解如何在资源受限的情况下处理不断变化的数据集具有重要的指导意义。它强调了在数据流环境下设计和评估聚类算法的挑战，以及解决这些挑战的方法，对于大数据分析和相关领域的研究者和从业者都极具价值。

Other known approaches such as DBSCAN [19], OPTICS [5] and DENCLUE[39], STING [75], CLIQUE [3],

Wave-Cluster [70], and OPTIGRID [40], are not designed to optimize the

–Median objective.

3 A Provable Stream Clustering Framework

3.1 Clustering in Small Space

Data stream algorithms must not have large space requirements, and so our ﬁrst goal will be to show that

clustering can be carried out in small space (



for

data points, and

<<

), without being concerned

with the number of passes. Subsequently we will develop a one-pass algorithm. We ﬁrst investigate algo-

rithms that examine the data in a piecemeal fashion. In particular, we study the performance of a divide-and-

conquer algorithm, called Small-Space, that divides the data into pieces, clusters each of these pieces, and

then again clusters the centers obtained (where each center is weighted by the number of points assigned

to it). We show that this piecemeal approach is good, in that if we had a constant-factor approximation

algorithm, running it in divide-and-conquer fashion would still yield a (slightly worse) constant-factor ap-

proximation. We then propose another algorithm (Smaller-Space) that is similar to the piecemeal approach

except that instead of reclustering only once, it repeatedly reclusters weighted centers. For this algorithm,

we prove that if we recluster a constant number of times, a constant-factor approximation is still obtained,

although, as expected, the constant factor worsens with each successive reclustering.

3.1.1 Simple Divide-and-Conquer and Separability Theorems

For simplicity we start with the version of the algorithm that reclusters only once.

Algorithm Small-Space(S)

1. Divide

into

disjoint pieces



;:::;

2. For each

,ﬁnd

(

)

centers in



. Assign each point in



to its closest center.

3. Let



be the

(

)

centers obtained in (2), where each center

is weighted by the number of points

assigned to it.

4. Cluster



to ﬁnd

centers.

We are interested in clustering in small space,

will be set so that both

and



ﬁt in main memory. If

is very large, no such

may exist – we will address this issue later.

Deﬁnition 1 (The

–median Problem) Given an instance

(

S; k

)

–Median, i.e., an integer

and a set

points with metric

(



;



)

,the

–Median cost (or simply the cost) of a set of medians

;:::;C

(

S; C

;:::;C

) =

min



(

x; C

)

. That is, the cost of a solution is the sum of assignment

distances. Deﬁne cost

(

S; Q

)

to be the smallest possible cost if the medians are required to belong to the set

. The optimization problem is to ﬁnd cost

(

S; S

)

for the discrete case and cost

(

S; R

)

for the Euclidean

case.

Before analyzing algorithm Small-Space, we describe the relationship between the discrete and continu-

ous clustering problem. The following is folklore (the proof can be found in [34]):

Theorem 1 Given an instance

(

S; k

)

–Median cost

(

S; S

)



cost

(

S; Q

)

for any

The following separability theorem sets the stage for a divide-and-conquer algorithm. This theorem carries

over to other clustering metrics such as the sum of squared distances.

Theorem 2 Consider an arbitrary partition of a set

points into



;:::;

.Then

cost

(



;

)



cost

(

S; S

)

Proof: ¿From Theorem 1, cost

(



;

)



cost

(



)

. Summing over

the result follows.

Next we show that the new instance, where all the points

that have median

shift their weight to the

point

(i.e., the weighted

(

)

centers

in Step 2 of Algorithm Small-Space), has a good feasible clus-

tering solution. Assigning a point

of weight

to a median at distance

will cost

; that is, assignment

distances are multiplied by weights in the objective function. Notice that the set of points in the new instance

is much smaller and may not even contain the optimum medians for

Theorem 3 If

cost

(



;

)

and



cost

(

S; S

cost

(



)

then there exists a solution

of cost at most



)

to the new weighted instance



Proof: For all



,let

;:::;C

i;k



be the medians that achieve the minimum cost

(



;

)

.Let

the medians that achieve the minimum cost

(

S; S

)



;:::;C



For



,let

(

)

denote the closest of

;:::;C

i;k

,andlet



(

)

denote the closest of



;:::;C



.Alsolet

i;j

be the number of members



for which

(

) =

i;j

(that is,

i;j

is the weight in



i;j

). For each

i;j

, there is a member of



;:::;C



within a distance

min

(

i;j

(

x; c

(

)) +

(

x; C



(

)))

by the triangle inequality. Therefore,

(





;:::;C



)



The factor 2 is not present in the familiar case where



Again the factor 2 is not present in the case that the data are points in

and the medians can be anywhere in

剩余32页未读，继续阅读

寒沧

粉丝: 271
资源: 161

数据流聚类：理论与实践

AUTOSAR_SWS_RTE.pdf

DPC_密度聚类_无监督聚类_dpc聚类_clustering_DPC_源码.zip

给定数据集：iris_2_3.txt，用random.shuffle()函数随机排列数据集顺序，用PCA算法对随机排列的iris_2_3.txt数据降维（3维），再用k-mean聚2类，写出聚类中心坐标。

（2）给定数据集：iris_2_3.txt，用random.shuffle()函数随机排列数据集顺序，用PCA算法对随机排列的iris_2_3.txt数据降维（3维），再用k-mean聚2类，写出聚类中心坐标。

如何使用循环输出df_A_0 = df_normalized_data[kms.labels_ == 0] df_A_1 = df_normalized_data[kms.labels_ == 1] df_A_2 = df_normalized_data[kms.labels_ == 2] df_A_3 = df_normalized_data[kms.labels_ == 3] df_A_4 = df_normalized_data

解释n_clusters = 3 cluster = KMeans(n_clusters = n_clusters, random_state = 0).fit(df.values) y_pred = cluster.labels_ pre = cluster.fit_predict(df.values)

最新资源