ICA: 动态数据集的高效增量OPTICS聚类算法

需积分: 14 100 浏览量更新于2024-07-09 收藏 1.66MB PDF 举报

增量聚类算法（Incremental Clustering Algorithm, ICA）是一种针对快速发展的信息时代而设计的数据挖掘工具。它建立在经典的OPTICS（Ordering Points To Identify the Clustering Structure）算法基础上，但针对动态数据集的特点进行了优化。在传统OPTICS中，算法需要用户预先设定参数如ε（邻域半径）和MinPts（最小邻居数），以及可达性距离，这些在ICA中被简化为直接使用距离作为衡量标准，这使得算法更加直观易用，计算效率得到显著提升。 ICA的核心优势在于其增量特性，它能够在处理静态数据集的基础上，有效应对实时添加的新数据对象。与传统的静态聚类方法不同，ICA能够适应数据流中的变化，持续进行聚类分析，无需每次数据更新都重新运行整个算法。这种能力对于实时数据监控、在线学习和预测等领域具有重要意义。此外，ICA还引入了一种自动命名技术，即自动提取技术（Automatically Extract Technique）。该技术可以从聚类排序结构中智能地识别和命名簇，减少了用户手动干预的需求，提高了聚类结果的可解释性和实用性。这种方法通过分析数据的内在模式和关联性，自动确定每个簇的特征和边界，从而减少用户的负担，提升工作效率。为了验证ICA的有效性和效率，作者进行了一系列实验，并对算法与OPTICS进行了详细对比。实验结果显示，ICA在处理动态数据集时表现出更好的性能，尤其是在处理新增数据和实时分析方面。这表明ICA是处理现代大数据环境中动态聚类任务的理想选择。总结来说，ICA作为一种基于OPTICS的增量聚类算法，不仅提供了更快、更灵活的聚类解决方案，还通过自动化命名技术提升了用户体验。对于那些需要处理大量动态数据和实时分析的场景，ICA无疑是一个强大的工具，有助于提高数据挖掘的效率和准确性。

algorithms is represented respectively by the mean value, the ‘‘mode’’ and the ‘‘medoid’’

of all objects belonging to the cluster. These algorithms are executed by two steps. First,

randomly generated k objects to represent the k clusters. Second, use an iterative control

strategy to optimize the objective function. Often, the function can be the average or the

sum of the distance between the objects to the representation object in a cluster.

CLARANS [17] is an improved k-medoid type algorithm and two parameters are needed

from the users. It is very suitable for the huge data set whose efﬁciency is much higher

than PAM and CLARA presented in [18]. Streaming clustering algorithms applied to

mining dynamic datasets are also very popular [6–10]. In [6], the authors proposed a

constant-factor approximation algorithm for the k-median problem based on the Small-

Space algorithm in the data stream model. The dataset needs to be scanned only once and,

however, the authors also show negative results implying that the algorithm can’t be

improved in a certain sense. A framework for clustering evolving data streams proposed in

[8] is divided into an online component which periodically stores detailed summary

statistics and an ofﬂine component which uses only this summary statistics to extract the

clusters. In addition, a pyramidal time framework is designed to store the ‘‘snapshots’’.

However, the framework is not suitable to arbitrary shaped clusters. As a result, a density-

based clustering algorithm over an evolving data stream with noise is proposed in [9]. A

novel clustering algorithm based on passing messages between data points is presented in

[19, 20]. The authors of [19] devised a method called ‘‘afﬁnity propagation’’, which takes

as input measures of similarity between pairs of data points, to clustering. Two incre-

mental afﬁnity propagation (IAP) clustering algorithms based on message passing are

proposed in [20]. One of the two algorithms is IAP clustering based on k-medoid

(IAPKM) and the other one is IAP clustering based on Nearest Neighbor Assignment

(IAPNA).

Except for the clustering algorithms introduced before, the algorithms that have the

most important inﬂuence to our clustering algorithm are DBSCAN [21] and OPTICS [5].

To our knowledge, DBSCAN is the ﬁrst density-based clustering algorithm that is not

grid-based. Most of the density-based clustering algorithms are partitioning algorithms and

so is ICA. Different with the density in the ﬁeld of grid-based clustering algorithms, the

density in DBSCAN is represented by two parameters e and MinPts that preset by the

users. The objects in a cluster are divided into two subsets: core objects and border

objects. For each core object in a cluster, the neighborhood of a radius e has to contain at

least MinPts objects; for each border object, there must be a corresponding core object

whose neighborhood contains the border object. Based on the deﬁnition of a cluster, the

author in [21] proposed the detailed pseudo-code. OPTICS is a more mature method that

can both automatically ﬁnd the hierarchical clustering structure and present the inherent

structure of the data set to users. The same with DBSCAN, the users of OPTICS also need

to preset e and MinPts, though they are rather insensitive to the reachability-plot. The

authors in [5] suggest that the values of the parameters have just to be ‘‘large’’ enough to

yield a good result. Based on the reachability-plot, automatic techniques are proposed to

ﬁnd the clusters. Compared with DBSCAN, in our opinion, the biggest advantage of

OPTICS is the reachability-plot that can present the inherent structure of the data set

intuitively. Note that, the reachability-plot is independent from the dimension of the data

set. The users can get a good clustering result by setting proper parameters based on the

reachability-plot.

J.-S. Fu et al.

123

剩余19页未读，继续阅读

weixin_38641150

粉丝: 2
资源: 920

ICA: 动态数据集的高效增量OPTICS聚类算法

动态增量聚类的设计与实现

论文研究-一种基于代表点的增量聚类算法.pdf

MachineLearning:实现机器学习算法（聚类、降维、遗传算法）的 C++ 框架

无监督学习：使用不同的降维算法运行聚类算法并比较性能

独立成分分析(ICA)：一种新的概念与算法

基于空间谱组合聚类算法的高光谱图像分类附matlab代码.zip

matlab白化代码-Data-Whitening-ICA:此MATLAB脚本显示了如何对ICA算法进行数据白化

matlab灰度处理代码-Overcomplete-ICA:超额ICA

GI-ICA:GI-ICA和PEGI的Matlab实现-开源

高光谱数据分类：基于波段聚类与遗传算法的选择方法

最新资源