粗糙集理论驱动的增量模糊聚类集成学习：应对数据不确定性

PDF格式 | 766KB | 更新于2024-08-27 | 107 浏览量 | 举报

本文主要探讨了"基于粗糙集理论的增量模糊聚类集成学习"（Incremental Fuzzy Cluster Ensemble Learning based on Roughset Theory, IFCERS），发表在《知识基系统》（Knowledge-Based Systems）杂志第132期（2017年）。该研究针对数据集中的不确定性、模糊性和重叠分布问题，提出了一种新颖的方法，它巧妙地融合了聚类分析任务与分类技术。 IFCERS方法的核心思想是利用软聚类结果，通过粗糙集理论来增强模型的鲁棒性和预测能力。粗糙集理论是一种处理不确定性和不完备信息的有效工具，它允许在不完整的数据中进行知识推理。在这个框架下，增量学习策略被引入，使得模型能够随着新数据的不断加入而动态地更新和优化，避免了从头开始重新训练的问题，提高了学习效率。文章的流程主要包括以下几个步骤： 1. 软聚类：首先对原始数据进行软聚类，得到每个样本的模糊类别概率，这允许对数据点的归属度进行量化，反映出其属于多个类别的可能性。 2. 粗糙集处理：利用粗糙集理论对软聚类结果进行处理，通过对数据属性的约简和知识精炼，剔除对分类决策影响不大的特征，从而降低模型复杂性。 3. 集成学习：将多个软聚类器的预测结果结合起来，形成一个集成模型。这可能涉及到随机森林等集成方法，通过投票或加权平均的方式提高整体分类性能。 4. 增量学习：当新的数据点出现时，IFCERS能够在线地更新模型，只对新数据进行部分重计算，避免了对整个数据集进行重新训练，节省了时间和资源。 5. 性能评估：论文提供了实验结果，展示了IFCERS在处理具有不确定性和模糊性的数据集上的优势，通过与传统方法如单一聚类算法和非增量方法的对比，证明了其在精度、效率和适应性方面的优越性。总结来说，这篇文章是一项重要的研究，它将粗糙集理论与增量学习和模糊聚类相结合，为处理复杂数据集提供了创新的解决方案。这对于实际应用中的数据挖掘、模式识别和机器学习任务具有重要的理论价值和实践意义。

146 J. Hu et al. / Knowledge-Based Systems 132 (2017) 144–155

membership, dependency of attributes and reduction of attributes.

The binary relations deﬁned on attributes partition the points of

universe into a collection of elementary classes. Any vague (impre-

cise) concept which is a subset of the universe can be expressed

by two exact sets: the lower and upper approximations. The lower

approximation consists of the objects surely belonging to the vague

concept, whereas the upper approximation consists of the objects

possibly belonging to the vague concept. Based on rough approx-

imations, the universe can be divided into three pair-wise dis-

joint regions: the positive region corresponds to the lower approx-

imation, the negative region corresponds to the complement of

the upper approximation, and the boundary region corresponds to

the difference between the upper and lower approximations. From

positive and negative regions, three-way decisions can be made:

deterministic decisions (immediate acceptance, immediate rejec-

tion) can be made from objects in positive and negative regions

respectively according to their memberships in the given decision

class while nondeterministic decisions for objects in the boundary

region. The rough membership represents the membership of an

object belongs to a set. Dependency of attributes is used to discov-

ery dependencies between attributes. Lastly, reduction of attributes

is a set of attributes that preserves partition ability of original at-

tributes.

Nowadays, RST has been widely applied in clustering analysis.

Some rough clustering researches constructed each cluster with an

interval or rough set characterized by its lower and upper approx-

imations or a lower approximation and a boundary region, respec-

tively [6,28–30] . Others applied principle of attribute reduction in

RST to select the feature with the largest differentiation to split

the universe and a series of divisive hierarchical clustering based

on RST has been proposed with the result of hard partition on the

universe [31,32] . And some others combined clustering with three-

way decisions theory [33–35] .

2.3. Cluster ensemble

Cluster ensemble method involves roughly two major steps: en-

semble generation and consensus aggregation. In the ﬁrst step, a

set of diverse clustering solutions will be produced using a gener-

ative mechanism, such as different data perturbation techniques,

homogenous algorithm with different parameters (or initializa-

tions) and heterogeneous basic clustering algorithms [36] , etc. The

second step is the key step in any cluster ensemble approaches

where a consensus aggregation of the clustering solutions in the

ensemble is acquired. A large number of cluster ensemble meth-

ods have been proposed, which can be mainly divided into two

categories: ensemble method based on objects co-occurrence and

method based on median partition [2] . The ﬁrst type of method

ﬁnds the ﬁnal result by calculating how many times two objects

belong together to the same cluster or how many times an ob-

ject belongs to one cluster in the cluster ensemble. Voting and co-

association matrix based methods fall into this category.

2.4. Random forests classiﬁer

Among the different unsupervised machine learning methods,

k -Nearest-Neighbor Classiﬁers ( k -NN) [37] , Support Vector Ma-

chines (SVM) [38] and Random Forests (RF) [39] are three typical

classiﬁers. k -NN is based on learning by analogy, that is, by com-

paring the unknown sample with its k closest neighbors from the

training set and then assigning the unknown sample to the most

common class among its k-nearest neighbors. The number k is a

user-deﬁned constant which inﬂuences the classiﬁcation need to

be selected by various heuristic techniques. SVM ﬁnds a nonlin-

ear mapping (by deﬁning a suitable Kernel function) to transform

the original training data into an appropriate higher dimension and

then searches for a linear optimal separating hyperplane in this

new dimension using support vectors and margins to classify the

training set. The effectiveness of SVM depends on the selection of

kernel function as well as its parameters, the soft margin param-

eter. The SVM suffers from eﬃciency issues, that is, the training

time of even the fastest SVM can be extremely slow.

Random Forests (RF) [39] is a recently developed unsupervised

ensemble learning method for classiﬁcation, regression and other

tasks. RF is capable of solving the above diﬃculties. This method

grows an ensemble of decision trees in advance with the governing

of random input vectors generated by bagging, random split selec-

tion or the random subspace, and then output the most popular

class of all the individual trees in the forest. The RF solves decision

trees’ problem of overﬁtting to their training set. The generaliza-

tion error for random forests depends on two things: the strength

of each individual tree in the forest (the bigger it is, the better)

and the dependence between them (the smaller it is, the better).

The random forests can process dataset with a large number of

features without feature selection and estimate the importance of

features automatically. It runs eﬃciently on data set with missing

value, class-imbalanced data sets or even on large data set.

3. Incremental fuzzy cluster ensemble learning based on rough

set theory

In this section, we ﬁrst describe the basic idea of incremental

fuzzy cluster ensemble learning based on RST (IFCERS) and then

present the key steps of proposed method in detail.

3.1. The basic idea of IFCERS

Fig. 1 provides an overview of the IFCERS. In IFCERS, well-

known soft clustering techniques such as FCM, rough k-means,

and rough-fuzzy k-means by setting the number of clusters as

ground number of clusters are employed ﬁrst to generate multiple

clustering solutions, and each clustering solution will be formed

into fuzzy membership matrix. Then, the clustering solutions with

lower fuzzy partition entropy will be selected to form a fuzzy clus-

tering ensemble, and the positive region, boundary region and neg-

ative region of fuzzy clustering ensemble are obtained based on

core idea of RST (the principle of rough approximation construc-

tion). Next, the improved accurate group structure of data points

in the positive region is acquired by applying a fuzzy cluster en-

semble method. Finally, a supervised learning method, called ran-

dom forests, is applied incrementally to points in the boundary and

negative region with the group structure of the positive region to

yield better ﬁnal clustering results.

3.2. The key steps of IFCERS

In the following, we describe the key steps of our method in

detail.

3.2.1. The generation and the formalization of base soft clusterings

In this step, the well-known soft clustering techniques such as

FCM, rough k-means, and rough-fuzzy k-means with the number

of cluster being set to the true class of a data set are adopted ﬁrst

to generate multiple clustering solutions.

Actually, based on different mathematical theories, these soft

clustering methods are mainly divided into: soft clustering method

based on FST, soft clustering method based on RST or a combi-

nation of both theories, and soft clustering analysis based on ev-

idence theory. On the other hand, due to the introduction of the

different abstract, describe and calculate methods of the uncertain

cluster structure, the expression forms of these soft clustering re-

sults are different, which brings many diﬃculties to combine the

multiple base clustering results.

剩余11页未读，继续阅读

weixin_38706951

粉丝: 4

粗糙集理论驱动的增量模糊聚类集成学习：应对数据不确定性

基于D-S证据理论的模糊聚类集成改进

粗糙集与模糊聚类结合的网站日志数据分析实践

基于粗糙集与模糊聚类的政务本体智能学习方法

基于模糊测度和证据理论的模糊聚类集成方法

基于模糊集合理论的模糊聚类的MATLAB例程代码

基于粗糙集和模糊聚类的网站日志数据挖掘实例分析

基于粗糙集的支持向量聚类方法 (2007年)

rough-clustering:基于粗糙集理论的聚类

计算机研究 -基于粗糙集的模糊聚类及其图像分割应用.pdf

一种新的基于粗集编码的模糊聚类数据处理方法.pdf

最新资源