动态密度聚类算法的研究与挑战

需积分: 9 46 浏览量更新于2024-07-17 收藏 480KB PDF 举报

"这篇PDF文件探讨了动态密度聚类方法，主要关注DBSCAN（Density-Based Spatial Clustering of Applications with Noise）这一密度基础聚类算法及其在动态数据环境中的应用。DBSCAN能够有效地找出基于密度连接的对象聚集，而无需预先设定簇的数量。然而，在数据不断更新的情况下，如何高效地维护这些聚类成为一个挑战。" DBSCAN是一种流行的数据挖掘技术，它通过考虑对象之间的邻近性和密度来识别数据集中的核心对象、边界对象和噪声。在DBSCAN中，一个对象属于一个簇如果它周围有一定密度的邻居，这个密度通常通过最小邻域半径(eps)和最小邻域对象数(minPts)来定义。这种方法特别适用于发现不规则形状的聚类，并且对异常值容忍度较高。在动态环境中，如当数据集允许插入和删除操作时，传统的DBSCAN算法需要重新计算整个数据集的邻接关系，这可能导致显著的计算开销。因此，文献中提出了ρ-approximate DBSCAN，其目标是降低静态数据上的计算复杂性。然而，该文件指出，即使在完全动态的数据集上，即同时处理插入和删除，ρ-approximate DBSCAN也面临着与原始DBSCAN相同的困难，即计算难度仍然很高。作者Junhao Gan和Yufei Tao进一步研究了这个问题，并揭示了ρ-approximate版本在处理动态数据时的局限性。他们可能还讨论了可能的优化策略或新的动态聚类算法，以应对这种挑战。这些策略可能包括增量式更新、局部调整或者利用数据结构的特性来减少不必要的计算。在实际应用中，动态聚类对于实时分析、大数据流处理和物联网(IoT)等场景至关重要。例如，在监控系统中，新的传感器数据不断加入，旧的可能会丢失，需要实时调整聚类结果。因此，开发高效且适应性强的动态密度聚类算法是当前研究的重要方向。这篇论文深入探讨了DBSCAN在动态环境下的挑战，并揭示了ρ-approximate DBSCAN在处理数据更新时的局限性。它为理解和改进动态聚类算法提供了重要的理论基础，对于希望优化大规模动态数据聚类的开发者和研究人员具有很高的参考价值。

(1 + ρ)ǫ

(a) Dataset (b) Core graph (c) One possible ρ-approximate core graph

Figure 2: Illustration of DBSCAN and ρ-approximate DBSCAN (ρ = 0.5, MinPts = 3)

MinPts

, which can be regarded as a constant. Next, we review how

the clusters are formed using graph terminology.

Given a point

p ∈ P

, we use

B(p, r)

to represent the ball that is

centered at

, and has radius

. The point is said to be a core point

B(p, ǫ)

covers at least

MinPts

points of

(including

itself);

otherwise, it is a non-core point. To illustrate, consider the dataset

of 18 points in Figure 2a, where

is the radius of the inner solid

circle, and

MinPts = 3

. The core points have been colored black,

while the non-core points colored white. The dashed circle can be

ignored for the time being.

DBSCAN clusters are deﬁned in two steps. The ﬁrst one focuses

exclusively on the core points, and groups them into preliminary

clusters. The second step determines how the non-core points should

be assigned to the clusters. Next, we explain the two steps in detail.

Step 1: Clustering Core Points.

It will be convenient to imagine an

undirected core graph

—this graph is conceptual and need

not be materialized. Speciﬁcally, each vertex of

corresponds

to a distinct core point in

. There is an edge between two core

points (a.k.a. vertices)

, p

if and only if

dist(p

, p

) ≤ ǫ

, where

dist(·, ·)

represents the Euclidean distance between two points. Fig-

ure 2b shows the core graph for the dataset of Figure 2a.

Each connected component (CC) of

constitutes a preliminary

cluster. In Figure 2b, there are 3 CCs (a.k.a. preliminary clusters).

Note that every core point belongs to exactly one preliminary cluster.

Step 2: Non-Core Assignment.

This step augments the preliminary

clusters with non-core points. For each non-core point

, DBSCAN

looks at every core point

core

∈ B(p, ǫ)

, and assigns

to the (only)

preliminary cluster containing

core

. Note that, in this manner,

may be assigned to zero, one, or more than one preliminary cluster.

After all the non-core points have been assigned, the preliminary

clusters become ﬁnal clusters.

It should be clear from the above that the DBSCAN clusters are

uniquely deﬁned by the parameters

and

MinPts

, but they are

not necessarily disjoint. A non-core point may belong to multiple

clusters, while a core point must exist only in a single cluster. It is

possible that a non-core point is not in any cluster; such a point is

called noise.

In Figure 2a, there are two non-core points

and

. Since

B(o

, ǫ)

covers

is assigned to the preliminary cluster of

B(o

, ǫ)

, however, covers no core points, indicating that

noise. The ﬁnal DBSCAN clusters are {o

, o

, ..., o

}, {o

, o

, ...,

}, {o

, o

, ..., o

Remark.

DBSCAN can also be deﬁned under the notion of “density-

reachable”; see [9]. The above graph-based deﬁnition is equiv-

alent, perhaps more intuitive, and allows a simple extension to

ρ-approximate DBSCAN, as we will see later.

Hardness of DBSCAN and USEC.

It is easy to see that the DB-

SCAN clusters on a set

points can be computed in

O ( n

)

time, noticing that the core graph

has

O ( n

)

edges. However,

clever algorithms should produce the clusters without generating all

the edges, and thus, avoid the quadratic trap. Indeed, when

d = 2

the clusters can be computed in only O(n log n) time [11].

It would be highly desired to ﬁnd an algorithm of

O ( n )

time for

d ≥ 3

, but recently Gan and Tao [10] have essentially dispelled

the possibility. They proved an

O ( n )

-time reduction from the unit-

spherical emptiness checking (USEC) problem to DBSCAN. In

other words, any

T (n)

-time DBSCAN algorithm implies that USEC

problem can be solved with O(T (n)) time.

In USEC, we are given a set

red

of red points and a set

blue

of blue points in

. All the points have distinct coordinates on

every dimension. The objective is to determine whether there exist a

red point

red

and a blue point

blue

such that

dist(p

red

, p

blue

) ≤ 1

(the distance threshold 1 can be replaced with any positive value by

scaling). The problem has a lower bound of

Ω(n

4/3

)

for

d ≥ 5

in a

broad class of algorithms [6,7]. For

d = 3

and

, beating the bound

has been a grand open problem in theoretical computer science, and

is widely believed [6] to be impossible. By the reduction of [10],

no DBSCAN algorithm can have running time

o(n

4/3

)

d ≥ 5

;

d = 3

and 4, this is also true unless unlikely ground-breaking

improvements could be made on 3D USEC.

Approximation and the Sandwich Guarantee.

Gan and Tao [10]

developed

-approximate DBSCAN, which returns almost the same

clusters as exact DBSCAN by offering a strong sandwich guarantee

that will be introduced shortly. In contrast to the high time complex-

ity of the latter, the approximate version takes only

O ( n )

expected

time to compute for any constant ρ > 0.

Besides the parameters

and

MinPts

inherited from DBSCAN,

the approximate version accepts a third parameter

, which is a small

positive constant less than 1, and controls the clustering precision.

Its clusters can also be deﬁned in the same two steps as in exact

DBSCAN, as explained below.

Step 1: Clustering Core Points.

It will also be convenient to follow

a graph-based approach. Let us deﬁne an undirected

-approximate

core graph

on the dataset

—again, this graph is conceptual

and need not be materialized. Each vertex of

corresponds to a

distinct core point in

. Given two core points

, p

, whether or

not G

has an edge between their vertices is determined as:

• The edge deﬁnitely exists if dist(p

, p

) ≤ ǫ.

剩余14页未读，继续阅读

shoushudao111

粉丝: 57
资源: 176

动态密度聚类算法的研究与挑战

机器学习入门与实战(scikit-learn和Keras)课件—聚类.pdf

论文研究-基于模糊邻近关系的结构聚类.pdf

利用python实现模糊动态聚类.pdf

将clustering.labels_导出到excel

基于趋势的时间序列相似性度量和聚类研究.pdf

clustering = OPTICS().fit(df) clustering.labels_ 优化这段代码

A density-based spatial clustering of application with noise[J].将激光点云进行聚类

详细介绍一下第五步，进行聚类分割clustering = o3d.geometry.DBSCANClusterer() labels = clustering.cluster(fpfh.data, eps=0.25, min_points=10)

density-based clustering

查错 错误使用 clustering.evaluation.ClusterCriterion (第 402 行) 不允许 X 为空。

最新资源

查错错误使用 clustering.evaluation.ClusterCriterion (第 402 行) 不允许 X 为空。