聚类特征选择算法全解析

聚类

特征选择

需积分: 9 101 浏览量更新于2024-07-22 收藏 493KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"这篇文档是关于聚类特征选择算法的综述，主要涵盖了聚类、特征选择以及它们在聚类中的应用。文档旨在为初学者提供深入理解，内容包括滤波模型、包裹模型和混合模型等不同的特征选择方法，以及针对普通数据、文本数据和流数据的特定算法。" 聚类是一种无监督学习方法，用于将数据集中的对象根据其相似性或距离分组到不同的簇中。特征选择是机器学习预处理的关键步骤，它涉及从原始特征集合中挑选出对模型预测或分析最有价值的一小部分特征。在聚类中，特征选择尤其重要，因为它可以降低计算复杂度，提高聚类质量，并帮助发现更显著的模式。特征选择有三种主要模型：滤波模型、包裹模型和混合模型。滤波模型是基于单个特征与目标变量的关联程度进行评分和选择，如皮尔逊相关系数或卡方检验。包裹模型则通过遍历所有可能的子集来寻找最优特征组合，这通常需要较高的计算成本。混合模型结合了滤波和包裹方法的优点，先用滤波快速缩小特征范围，再用包裹进行精细化搜索。针对普通数据的聚类特征选择算法包括Spectral Feature Selection (SPEC)、Laplacian Score (LS)、特征选择用于稀疏聚类、基于散射可分性的局部特征选择（LFSBSS）以及Multi-Cluster Feature Selection (MCFS) 和Feature Weighting k-means等。这些方法各有侧重，例如SPEC利用谱分析来识别特征，而LS则是基于拉普拉斯矩阵的特征值进行评分。对于文本数据，特征选择常常涉及到词频（TF）、逆文档频率（IDF）、TF-IDF、卡方统计量等。TF-IDF是文本挖掘领域广泛应用的特征表示方法，它可以量化词汇的重要性。卡方统计量则用于评估特征与类别的独立性。此外，频繁项基文本聚类和频繁项序列方法则关注于发现频繁出现的词汇或词序模式。在处理流数据时，例如Text Stream Clustering Based on Adaptive Feature Selection (TSC-AFS)，算法需要适应不断变化的数据流，有效地处理新出现的特征并及时更新特征选择策略，以保持聚类的质量和效率。聚类特征选择算法是提升聚类效果和数据理解的重要手段，涵盖多种模型和特定领域的应用，对于理解和优化数据分析过程至关重要。

资源详情

资源推荐

the dissimilar samples should be in diﬀerent clusters. Usually, neither cluster’s description

nor its quantiﬁcation is given in advance unless a domain knowledge exists, which poses a

great challenge in data clustering.

Clustering is useful in several machine learning and data mining tasks including: im-

age segmentation, information ret r ieval, pattern recognition, pattern classiﬁcation, network

analysis, and so on. It can be seen as either an exploratory task or prepr ocess in g step. If the

goal is to explore and reveal the hidden patterns in the data, clustering becomes a stand-

alone exploratory task by itself. However, if the generated clusters are going to be used to

facilitate another data mining or machine learning task, clustering will be a preprocessing

step in this case.

There are many clustering methods in the literature. These methods can be categorized

broadly into: partitioning methods, hierarchical methods, and density-based methods. The

partitioning methods use a distance-based metr ic to cluster the points based on their simi-

larity. Algorithms belonging to this type produce one level partitioning and non-overlapping

spherical shaped clusters. K-means and k-medoids are popular partitioning algorithms. The

hierarchical method, on the othe r hand, partiti ons the data into diﬀerent levels that look at

the end like a hierarchy. This kind of clustering helps in data visualization and summariza-

tion. Hierarchical clustering can be done in either bottom-up (i.e. agglomerative) fashion or

top-down (i.e. divisive) fashion. Examples of this type of clustering are BI RCH, Chameleon,

AG NES, DIANA. Unlike these two clustering techniques, density-based cluster in g can cap-

ture arbitrary shaped clusters such as S-shape. Data points in dense regions will form a

cluster while data points from diﬀerent cluster s will be separated by low density regions.

DBSCAN and OPTICS are popular example s of density-based clustering methods.

0.1.2 Feature Selection

In the pas t thirty years, the dimensi onality of the data involved in machine learning

and data mining tasks has increased explosively. Data with extr e mely high dimensional-

ity has presented serious challenges to existing learning methods [38],known as the curse

of dimensionality [25]. With the existence of a large number of features, a learning model

tends to overﬁt and th e ir learning performance degenerates. To address the problem of the

curse of dimensionality, dimensionality reduction techniques have been studied, which form

an important branch in the machine learning research area. Feature selection is one of the

most used techniques to reduce dimensionality among practitioners. It aims to choose a

small subset of the relevant features from the original ones according to certain relevance

evaluation cr it er i on [37, 23], which usually leads to better learning performance, e.g. higher

learning accuracy, lower computational cost, and better model i nterpretability. Feature se-

lection has been succ es s fu lly applied in many real applications, such as pattern recognition

[28, 58, 46], text categorization [74, 31, 52], Image processing [ 28, 56], bioinformatics [1, 2],

and so forth.

According to whether the label information is utilized, diﬀerent feature selection algo-

rithms can be categorized into supervised [69, 60], unsupervised [18, 46], or semi-supervised

algorithms [79, 73]. With respect to diﬀerent selection strategies, feature selection algo-

rithms can also be categorized as bein g of either the ﬁlter [39, 15], wrapper [33], hybrid,

and embedded models [13, 51]. Feature selection algorithms of the ﬁlter model are indepen-

dent of any classiﬁer. They evaluate the relevance of a feature by studying its characteristics

using certain statistical criter ia. Relief [59], Fisher score [16], CFS [24], and FCBF [76] are

among the most representative algorithms of the ﬁlter model. On the other hand, algorithms

belonging to the wrapper model utilize a classiﬁer as a selec ti on criteria. In other words,

they select a set of features that has the most discriminative power using a given classiﬁer,

such as: SVM, KNN, and so on. An example of the wrapper model is the F S SE M [17],

ℓ

SVM [10]. Other examples of the wrapper model could be any combination of a preferred

search strategy and given classiﬁer. Since the wrapper model depends on a given classiﬁ er ,

cross-validation is usually required in th e evaluation process. They are in general more com-

putationally expensive and biased to the chosen classiﬁer. Therefore, in real applications,

the ﬁlter model is more popular, especially for problems with large datasets. However, the

wrapper model has been empirically pr oven to be superior, in terms of classiﬁcation acc u-

racy, to those of a ﬁlter model. Due to these shortcomings in each model, the hybrid model

[13, 40], was proposed to bridge the gap between the ﬁl te r and wrapper models. First, it

incorporates the statistical criteria, as ﬁlter model does, to select several candidate features

subsets with a given cardinality. Second, it chooses the subset with the highest classiﬁca-

tion accuracy [40]. Thus, the hybrid model usually achieves both comparable accuracy to

the wrapper and comparable eﬃciency to the ﬁlter model. Representative feature selection

algorithms of hybrid model include: BBHFS [13], HGA [53]. Finally, the emb ed de d model

performs feature selection in the learning time. In other words, it achieves model ﬁtting

and feature selection simultaneously. Ex ample s of embed de d model include C4.5 [54], Blo-

gReg [21], and SBMLR [21]. Based on diﬀerent types of outputs, most feature selection

algorithms fall into one of the three categories: subset selection [75], which retu r n s a sub s et

of selected features identiﬁed by the index of the feature; feature weighting [59], which re-

turns weight corresponding to each feature; and the hybrid of subset sele ct ion and feature

weighting, which returns a ranked subset of features.

Feature weighting, on the other hand, is thought of as a generalization of feature selection

[70]. In feature select ion, a feature is assigned a binary weight, where 1 means the feature is

selected and 0 otherwise. However, feature weighting ass igns a value, usually in the interval

[0,1] or [-1,1], to each feature. The greater this value is, the more salient the featur e will be.

Feature weighting was found to outperform a feature sele cti on in tasks where features vary

in their relevance score [70], which is true in most real-world problems. Featur e weighting

could be, also, reduced to feature selection if a threshold is set to select features based on

their weights. Therefore, most of feature s e lec ti on algorithms mentioned in this chapter can

be considered using feature weighting scheme.

Typ icall y, a feature selection method consists of four basic steps [40], namely, subset

generation, subset evaluation, stopping criterion, and result validation. In the ﬁrst ste p,

a candidate feature subset will be ch ose n based on a given search strategy, which is sent,

in the second step, to be evaluated according to certain evaluation criterion. The subset

that best ﬁts the evaluation criterion will be chosen from all the candidates that have been

evaluated after the stopping criterion are met. In the ﬁnal step, the chosen subset will be

validated using domain knowledge or validation set.

0.1.3 Feature Selection for Clustering

The e xi st en ce of irrelevant features in the data set may degrade learning quality and

consume more memory and computational time that could be saved if these features were

remove d. From the clustering point of view, r e moving irrelevant features will not negatively

aﬀect clustering accuracy whilst reducing required storage and computational time. F i gur e

2 illustrates this notion where (a) shows the relevant feature f

which can discriminate

clusters. Figure 2(b) and (c) shows that f

and f

cannot distinguish the clusters; h en ce ,

they will not add any signiﬁcant information to the clustering.

In addition, diﬀerent relevant features may produce diﬀerent clusteri ng. Figure 3(a)

shows four clus te r s by utilizing knowledge from f

and f

, while F i gur e 3(b) shows two

clusters if we use f

only. Similarly, Figure 3(c) shows two diﬀerent clusters if we use f

instead.Therefore, diﬀerent subset of relevant features may result in diﬀerent clus t er in g,

which greatly help discovering diﬀere nt hidden patterns in the data.

剩余36页未读，继续阅读

zhp987630

粉丝: 0
资源: 2

聚类特征选择算法全解析

特征选择MCFS算法，来自github

k-means聚类分析 可选类个数

特征选择算法FEAST-V2.0.0(matlab)

文本聚类与分类算法整理

模糊C均值聚类FCM程序算法说明

FCM.rar_FCM模糊聚类_FCM聚类_fcm算法流程图

模糊c均值聚类+FCM算法的c++代码

模糊c均值聚类+FCM算法的MATLAB代码.pdf

一种基于粒子群的模糊聚类图像分割算法.tmp

聚类算法 模糊C算法的matlab源码

经典聚类的算法 聚类 k-menas

数据挖掘中的聚类与分类算法比较

基于聚类算法的颜色特征提取

复杂网络的聚类系数算法代码.zip

模糊C均值聚类算法

模糊C聚类的一个算法例子

复杂网络的聚类系数算法matlab代码.zip

fcm.rar_C均值算法_FCM聚类算法_c均值聚类

最新资源

k-means聚类分析可选类个数

聚类算法模糊C算法的matlab源码

经典聚类的算法聚类 k-menas