DBSCAN：一种基于密度的空间聚类算法

需积分: 0 140 浏览量更新于2024-08-03 收藏 131KB PDF 举报

"这篇论文主要介绍了DBSCAN(Density-Based Spatial Clustering of Applications with Noise)算法，这是一种基于密度的空间聚类算法，特别适用于发现任意形状的聚类。DBSCAN仅需一个输入参数，并能帮助用户确定合适的参数值。实验结果显示，DBSCAN在发现任意形状聚类上的效果显著优于传统的CLARANS算法，并且在效率上比CLARANS高出100多倍。" DBSCAN算法是数据挖掘领域中的一个重要聚类方法，它由Martin Ester、Hans-Peter Kriegel、Jörg Sander和Xiaowei Xu于1996年提出。该算法的核心思想是通过密度来定义和发现聚类，而非像K-means那样依赖于预先设定的聚类数量。这使得DBSCAN能够处理具有复杂形状的聚类，以及在噪声数据中识别出有意义的结构。 DBSCAN算法有两个主要的参数：ε（epsilon）和MinPts。ε是一个距离阈值，表示在半径ε内的邻域；MinPts是邻域内必须包含的点的最小数目。如果一个点p的ε邻域内包含至少MinPts个点（包括p自身），那么这些点组成一个“核心对象”区域。基于这些核心对象，DBSCAN可以扩展聚类，将相邻的核心对象连接在一起。对于那些不是任何核心对象的ε邻域内的点，它们可能被认为是噪声或边缘点，不被包含在任何聚类中。与K-means相比，DBSCAN的优点在于它不需要预先知道聚类的数量，而且对异常值的容忍度较高。这是因为DBSCAN在计算时不会受到孤立点的影响，它可以自动忽略噪声。此外，由于其基于密度的特性，DBSCAN能够在数据分布不均匀的情况下有效地进行聚类。实验部分，论文对比了DBSCAN与CLARANS（一种快速的近似层次聚类算法）在发现任意形状聚类的效果和效率。实验结果表明，DBSCAN在发现复杂形状聚类上具有显著优势，而CLARANS则可能因假设球形聚类而失效。在执行速度上，DBSCAN的性能也远超CLARANS，表明DBSCAN更适合处理大规模数据集。总结来说，DBSCAN算法是一种强大的聚类工具，尤其在处理具有非凸形状的聚类和大量噪声数据时。它的单参数设置简化了用户调整参数的过程，而其高效率和鲁棒性使其成为大数据分析和挖掘的重要选择。对于毕业设计或研究项目，深入理解和应用DBSCAN算法可以帮助解决复杂的聚类问题。

Abstract

Clustering algorithms are attractive for the task of class iden-

tiﬁcation in spatial databases. However, the application to

large spatial databases rises the following requirements for

clustering algorithms: minimal requirements of domain

knowledge to determine the input parameters, discovery of

clusters with arbitrary shape and good efﬁciency on large da-

tabases. The well-known clustering algorithms offer no solu-

tion to the combination of these requirements. In this paper,

we present the new clustering algorithm DBSCAN relying on

a density-based notion of clusters which is designed to dis-

cover clusters of arbitrary shape. DBSCAN requires only one

input parameter and supports the user in determining an ap-

propriate value for it. We performed an experimental evalua-

tion of the effectiveness and efﬁciency of DBSCAN using

synthetic data and real data of the SEQUOIA 2000 bench-

mark. The results of our experiments demonstrate that (1)

DBSCAN is signiﬁcantly more effective in discovering clus-

ters of arbitrary shape than the well-known algorithm CLAR-

ANS, and that (2) DBSCAN outperforms CLARANS by a

factor of more than 100 in terms of efﬁciency.

Keywords: Clustering Algorithms, Arbitrary Shape of Clus-

ters, Efﬁciency on Large Spatial Databases, Handling Noise.

1. Introduction

Numerous applications require the management of spatial

data, i.e. data related to space. Spatial Database Systems

(SDBS) (Gueting 1994) are database systems for the man-

agement of spatial data. Increasingly large amounts of data

are obtained from satellite images, X-ray crystallography or

other automatic equipment. Therefore, automated know-

ledge discovery becomes more and more important in spatial

databases.

Several tasks of knowledge discovery in databases (KDD)

have been deﬁned in the literature (Matheus, Chan & Pi-

atetsky-Shapiro 1993). The task considered in this paper is

class identiﬁcation, i.e. the grouping of the objects of a data-

base into meaningful subclasses. In an earth observation da-

tabase, e.g., we might want to discover classes of houses

along some river.

Clustering algorithms are attractive for the task of class

identiﬁcation. However, the application to large spatial data-

bases rises the following requirements for clustering algo-

rithms:

(1) Minimal requirements of domain knowledge to deter-

mine the input parameters, because appropriate values

are often not known in advance when dealing with large

databases.

(2) Discovery of clusters with arbitrary shape, because the

shape of clusters in spatial databases may be spherical,

drawn-out, linear, elongated etc.

(3) Good efﬁciency on large databases, i.e. on databases of

signiﬁcantly more than just a few thousand objects.

The well-known clustering algorithms offer no solution to

the combination of these requirements. In this paper, we

present the new clustering algorithm DBSCAN. It requires

only one input parameter and supports the user in determin-

ing an appropriate value for it. It discovers clusters of arbi-

trary shape. Finally, DBSCAN is efﬁcient even for large spa-

tial databases. The rest of the paper is organized as follows.

We discuss clustering algorithms in section 2 evaluating

them according to the above requirements. In section 3, we

present our notion of clusters which is based on the concept

of density in the database. Section 4 introduces the algo-

rithm DBSCAN which discovers such clusters in a spatial

database. In section 5, we performed an experimental evalu-

ation of the effectiveness and efﬁciency of DBSCAN using

synthetic data and data of the SEQUOIA 2000 benchmark.

Section 6 concludes with a summary and some directions for

future research.

2. Clustering Algorithms

There are two basic types of clustering algorithms (Kaufman

& Rousseeuw 1990): partitioning and hierarchical algo-

rithms. Partitioning algorithms construct a partition of a da-

tabase D of n objects into a set of k clusters. k is an input pa-

rameter for these algorithms, i.e some domain knowledge is

required which unfortunately is not available for many ap-

plications. The partitioning algorithm typically starts with

an initial partition of D and then uses an iterative control

strategy to optimize an objective function. Each cluster is

represented by the gravity center of the cluster (k-means al-

gorithms) or by one of the objects of the cluster located near

its center (k-medoid algorithms). Consequently, partitioning

algorithms use a two-step procedure. First, determine k rep-

resentatives minimizing the objective function. Second, as-

sign each object to the cluster with its representative “clos-

est” to the considered object. The second step implies that a

partition is equivalent to a voronoi diagram and each cluster

is contained in one of the voronoi cells. Thus, the shape of all

A Density-Based Algorithm for Discovering Clusters

in Large Spatial Databases with Noise

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu

Institute for Computer Science, University of Munich

Oettingenstr. 67, D-80538 München, Germany

{ester | kriegel | sander | xwxu}@informatik.uni-muenchen.de

Published in Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96)

下载后可阅读完整内容，剩余5页未读，立即下载

FAUSOUL

粉丝: 1
资源: 3

DBSCAN：一种基于密度的空间聚类算法

激光感知 - DBSCAN点云聚类论文

基于遗传算法的聚类分析论文

实用高效聚类算法在信息检索中的应用

元学习驱动的聚类算法推荐：理解并提升效率

的最全韩顺平php入门到精通全套笔记.doc )

花生好坏缺陷识别数据集,7262张图片，支持yolov7格式的标注，识别准确率在95.7%

总务科（基建办）2024年工作总结.doc

基于springboot+vue的相亲网站（Java毕业设计，附源码，部署教程）.zip

广东省高清卫星地图全图

智能聊天机器人在电商客服领域的应用研究与开发毕业设计报告

最新资源