L2-范数正则化的加权K均值聚类框架

需积分: 8 45 浏览量更新于2024-07-06 收藏 1.64MB PDF 举报

"weighting k-means with an l2-norm regularization" 在机器学习领域，K-Means算法是一种广泛使用的无监督学习方法，用于数据集的聚类分析。该算法通过迭代过程将数据分配到离它们最近的聚类中心，以此达到将相似数据分组的目的。然而，传统的K-Means算法存在一个主要问题，即它不能有效地识别和区分特征在聚类过程中的鉴别能力。这个问题可能会导致某些重要的特征被忽视，从而影响聚类结果的质量。在上述的"weighting k-means with an l2-norm regularization"研究中，作者提出了一种新的基于K-Means的聚类框架，该框架引入了L2范数正则化来改进原有的算法。L2范数正则化（也称为Euclidean norm）是机器学习中常用的正则化技术，它有助于防止过拟合并提高模型的泛化能力。在K-Means中引入L2正则化，意味着在计算距离或聚类中心时，不仅考虑原始特征值，还考虑了特征的权重，这些权重是通过L2正则化项优化得到的。具体来说，这个新框架首先对特征进行加权，使得那些具有更高鉴别力的特征在聚类过程中起更大的作用。然后，通过L2正则化来约束这些权重，确保它们不会变得过大，从而避免单个特征主导整个聚类过程。在优化过程中，目标函数包括了聚类误差平方和以及L2正则化项，通过最小化这个综合目标来寻找最优的聚类结果和特征权重。这种方法有以下几个优点： 1. 提升特征选择：L2正则化的引入能够使算法自动赋予不同特征不同的权重，突出关键特征，降低不重要特征的影响。 2. 改善聚类效果：通过优化特征权重，可以提高聚类的准确性和稳定性，特别是在高维数据集上，能更好地捕捉数据的内在结构。 3. 防止过拟合：L2正则化可以抑制模型复杂度，减少过拟合的风险，提高模型的泛化性能。此外，该研究还可能探讨了算法的实现细节，如初始化策略、迭代次数的确定以及优化算法的选择等。通过实验验证，作者们可能对比了改进后的算法与传统K-Means在各种数据集上的表现，以证明其优越性。 "weighting k-means with an l2-norm regularization"这一研究为解决K-Means算法在处理特征权重不均衡问题上提供了一个创新的方法，通过引入L2正则化，增强了聚类过程的灵活性和准确性，为大数据分析和机器学习领域的聚类任务提供了新的思路。

X. Huang et al. / Knowledge-Based Systems 0 0 0 (2018) 1–15 3

ARTICLE IN PRESS

JID: KNOSYS [m5G; March 26, 2018;14:47 ]

usually achieve interesting results in most of cases. Using the way

of βth power to constrain feature weights, Huang et al. [11] pro-

posed a k -means type clustering framework which is able to inte-

grate both intra-cluster compactness and inter-cluster separations.

However, when all objects in a cluster share the same value on

a feature in the clustering process, i.e., the scatter of this feature

is zero in the cluster, the algorithms by using the way of the βth

power to constrain feature weights will assign the weight of the

zero scatter feature to one and assign the weights of the other fea-

tures to zeros. Generally, it is unreasonable to use only one feature

to distinguish a cluster on a data set.

Another method of using the entropy to constrain feature

weights proposed in the EWkmeans algorithm [7] aims to encour-

age more features to participate in the clustering process by simul-

taneously minimizing the within cluster scatter and maximizing

the negative weight entropy. The updating rules of the EWkmeans

algorithm can be achieved by minimizing the following objective

function:

P (U, W, Z) =



p=1



i =1



j=1

( x

− z

)

+ γ



p=1



j=1

log w

(3)

subject to the constraint conditions Eq. (2) , where γ is a parame-

ter to balance the scatter and the entropy of the weights. Follow-

ing to EWkmeans, Chen et al. [9,26] proposed two types of auto-

mated two-level variable weighting clustering algorithms for mul-

tiview data. Deng et al. proposed an enhanced soft subspace clus-

tering algorithm [10] by integrating intra-cluster compactness and

inter-cluster separation with the entropy constraint to the feature

weights. In EWkmeans algorithm, the updating rule of the weight

solved by minimizing the objective function Eq. (3) is as follow:

exp (−D

/γ )



l=1

exp (−D

/γ )

, (4)

where D

is the scatter of cluster p on feature j . From this updat-

ing rule, we can observe that when D

is large, e.g. D

= 10 0 0 ,

exp (−D

/γ ) nearly equals to zero (according to the reference [7] ,

the value of γ ranges from 0.1 to 7). Therefore, the clustering pro-

cess of the k -means algorithms with entropy regularization are of-

ten dominated by seldom features. Even, numeric overﬂow errors

may happen when the scatter is large.

2.2. k -modes type clustering on categorical data sets

2.2.1. k -modes type algorithms

Since working only on numerical data sets prohibits the k -

means type algorithms from being used to cluster real-life data

including categorical values, Huang proposed the k -modes algo-

rithm [27] which employs a simple matching dissimilarity measure

to handle categorical objects, replaces the means of clusters with

modes, and utilizes a frequency-based method to update modes in

the clustering process. Based on this technique, most of k -means

type methods [1,7,8] working on numerical data can be modiﬁed

to k -modes type algorithms for clustering a categorical data set by

simply using modes to replace means. Based on the k -modes al-

gorithm, Cao et al. proposed W- k -modes algorithm [28] by using

complement entropy and Bai et al. [29] proposed another weight-

ing k -modes algorithm by using the βth power to constrain feature

weights. To further pursue the performance, Bai and Liang [30] in-

troduced the inter-cluster separation to the conventional k -modes

and proved the convergence of their proposed algorithm. Qian

et al. [31] proposed a novel data-representation scheme for the

categorical data by mapping a set of categorical objects into a Eu-

clidean space. Based on the data-representation scheme, Qian et al.

[31] developed a general clustering framework for space structure

of categorical data.

However, since the centroid in a dimension is usually repre-

sented by a feature value, the representability of the centroids in

this type of algorithms is not enough, especially when the distri-

bution of feature values is uniform.

2.2.2. Generalization of centroid representability in k -modes type

algorithms

Conventional k -modes algorithm chooses the feature value of

maximal frequency to represent a cluster on a feature. The method

often ignores the representability of other feature values whose

frequencies are close to the maximal one in the cluster. For the

sake of eliminate this ﬂaw, many improved algorithms [14,32–

35]

were proposed by allocating proper weights to the feature

values which are not maximal frequency. A frequency-based cen-

troid is introduced to the k -modes algorithm by San et al. [32] .

The higher the frequency of a feature value in a cluster is, the

larger representability the feature value has in the cluster. The

relative feature frequencies as the weights are adopted to re-

ﬂect the representability of cluster centroid in a cluster in ref-

erences [34,36] which are able to improve the measure of intra-

compactness in a cluster for categorical data as opposed to con-

ventional k -modes algorithm. Lee and Pedrycz [35] generalized the

k -modes algorithm with fuzzy p-modes prototypes. The algorithms

mentioned above can be seen as the special cases of the general-

ized k -modes algorithm. However, Bai et al. [14] argued that the

before-mentioned methods can converge to a local optimum so-

lution only if these algorithms degenerate to the simple k -modes

algorithm. To overcome this deﬁciency, Bai et al. [14] developed

two modiﬁed k -mode type clustering algorithms: MKM_NOF and

MKM_NDM which employ the entropy and l

-norm regularization

to smooth the centroid representation on every feature, respec-

tively.

Generalized centroid usually has more representative than tra-

ditional centroid which is represented by a single feature value on

a dimension. However, the algorithms with generalized centroid

require more computational cost due to the larger dimensions of

the centroid. Moreover, these algorithms have no capability of fea-

ture selection for noisy data sets.

2.3. k -means type clustering on mixed data sets

Facing mixed type objects that we frequently encounter in real

world, Huang [27] proposed a more practically useful algorithm:

k -prototypes which is straightforward to integrate the classic k -

means and k -modes algorithms with a balancing parameter. Lee

and Pedrycz [35] extended k -prototypes into a fuzzy p -modes clus-

tering algorithm where more effective centroid representation is

used in the part of categorical data in comparison to classic k -

prototypes algorithm. Ahmada and Dey [37] proposed another k -

means type clustering algorithm for subspace clustering for mixed

numerical and categorical data. However, this method also inte-

grates numerical data and categorical data by simple addition. It

is lack of effective method to fuse numerical data and categorical

data under the k -means type clustering framework except simple

addition of two parts. In the existing methods mentioned above, in

essence, the numeric data and the categorical data are still tackled

separately, which is not really uniﬁed semantically.

2.4. Characteristic of our proposed methods

At present, the existing k -means type algorithms can be sum-

marized two classes: (1) no weighting k -means algorithms; (2)

weighting k -means algorithms. No weighting k -means algorithms

Please cite this article as: X. Huang et al., A new weighting k -means type clustering framework with an l

-norm regularization,

Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.03.028

剩余14页未读，继续阅读

milkme_ops

粉丝: 1
资源: 7

L2-范数正则化的加权K均值聚类框架

An Entropy Weighting k-Means Algorithm for

【代码】iMWK-means实现

Weighted-KNN-Algorithm-With-Inverse-Distance-Weighting-Method-Python

TW-$(k)$-Means: Automated Two-Level Variable Weighting Clustering Algorithm for Multiview Data

apertium-weighting-tools-evaluation

pandas_weighting-0.0.1-py3-none-any.whl

pandas_weighting-0.0.2-py3-none-any.whl

matlab均方误差的代码-Perceptual-Weighting-Filter-Loss:语音增强DNN训练的感知加权滤波器损失

Delay-dependent stability analysis of neural networks with time-varying delay: A generalized free-weighting-matrix approach

pandas-weighting-0.0.2.tar.gz

最新资源