双约束半监督下OSS-NMF: 文档聚类的非负矩阵分解新方法

44 浏览量更新于2024-07-15 收藏 1.1MB PDF 举报

本文探讨了"具有双重约束的半监督文档聚类的非负矩阵分解框架"。非负矩阵分解（Nonnegative Matrix Factorization, NMF）是一种常用的数据分析工具，特别适用于处理非负数据集，如文本中的词频矩阵，因为自然语言文本通常是非负的。在半监督学习环境中，NMF被用来挖掘未标记数据的潜在结构，结合少量的标签信息来提高聚类效果。本文的主要贡献在于提出了一种新颖的算法——正交半监督非负矩阵分解（OSS-NMF）。该方法将两种类型的约束整合到传统的NMF框架中：一是成对约束，反映了文档之间的领域知识；二是类别知识，源自于单词的预定义分类。通过这种方式，作者将聚类问题转化为寻找目标函数的局部极小值问题，利用对偶优化理论导出了有效的更新规则，设计了一种迭代算法进行协同聚类。算法的正确性和收敛性得到了理论上的证明，这确保了方法的有效性和稳健性。作者强调，这种双重约束的策略显著提升了在文档聚类任务中的性能，特别是在面临数据标记不足的情况下，OSS-NMF能够更有效地利用有限的标签信息，并挖掘出更精确的文档群组。实验部分展示了OSS-NMF在实际文档聚类任务中的优越性能，与传统方法相比，它能够在保持非负性、捕捉语义信息的同时，更好地考虑领域专家知识和特征类别信息，从而提高了聚类的精度和解释性。这篇研究论文在半监督文档聚类领域引入了一个创新的非负矩阵分解框架，通过融合领域知识和类别信息，有效地提高了聚类效果，并展示了其在实际应用中的潜力。对于那些关注文本数据分析、半监督学习和非负矩阵分解的科研人员来说，这篇文章提供了有价值的新思路和技术参考。

A nonnegative matrix factorization 631

There are generally two types of semi-supervised clustering methods: constraint-based

methods [22] and metric-based methods [42]. In constraint-based approaches, the algorithm

makes use of different constraints such as class labels or pairwise constraints to guide the

search for an appropriate clustering. The pairwise constraints (side-information) are often

taken as ‘must-links’ and ‘cannot-links’ relationship between objects in order to enhance

unsupervised clustering algorithms. A must-link constraint is used to specify that the two

instances in the must-link relation should be associated with the same cluster. A cannot-link

constraint is used to specify that the two instances in the cannot-link relation should not

be associated with the same cluster. These sets of constraints act as a guide for constrained

clustering algorithms, which attempt to ﬁnd clusters in a data set satisfying the speciﬁed must-

link and cannot-link constraints. Classical semi-supervised clustering with labeled seeding

points can refer to [1,2]. As for semi-supervised clustering with labeled constraints, Wagstaff

et al. [39] enforced constraints to be satisﬁed during the cluster assignment in the clustering

process. Davidson et al. [12] presented the feasibility of clustering under different types of

constraints. Lu et al. [29] proposed a probabilistic clustering based on Gaussian mixture mod-

els (GMM) of the data distribution, which provided a ﬂexible framework that encompassed

several other semi-supervised clustering models as its special cases.

In metric-based approaches, an existing clustering algorithm that uses a distance measure

is employed. However, the distance measure is ﬁrst trained to satisfy the labels or constraints in

the supervised data [21,47]. In constraint-based approaches, the clustering algorithm should

be modiﬁed so that the constraints can be used to bias the search for an appropriate clustering

of the data [8]. Some noticeable work include: Hu et al. [19] integrated the constraints

into the K-means objective function, which is expressed as equivalent trace formulation.

Chang et al. [7] proposed a metric learning method which performs nonlinear transformation

globally and linear transformation locally. The trend for current research in semi-supervised

clustering is to combine both of these approaches. Yin et al. [45] proposed an adaptive

semi-supervised clustering kernel method based on metric learning, which can deal with the

problem of manually tuning kernel parameters and violation problem of pairwise constraints.

Besides, Gu et al. [17] proposed a dual regularized co-clustering method based on semi-

nonnegative matrix tri-factorization, which considered the geometric structure in the data.

The co-clustering method was formulated as semi-nonnegative matrix tri-factorization with

two graph regularizers.

Generally speaking, co-clustering algorithms deal with dyadic data and there are many

co-clustering algorithms initially developed for bioinformatics [23]. In terms of document

clustering, the similarity between words and the similarity between documents can be used

to co-cluster the term-document matrix. Thereafter, the co-occurrence frequencies can be

encoded in co-occurrence matrices, and then, matrix factorizations can be adopted to solve

the clustering problem [6,16]. Dhillon [14] proposed a co-clustering approach, modeling the

algorithm as an information-theoretic partition of the empirical joint probability distribution

of two sets of discrete random variables. Banerjee et al. [1] extended Dhillon’s method to a

general Bregman co- clustering and matrix factorization framework. Rege et al. [33]presented

a graph theoretic approach to the problem of document-word co-clustering, which used

isoperimetric co-clustering algorithm (ICA) for partitioning the document-word bipartite

graph. The Bayesian interpretation was also introduced [35,41] to formulate a two-sided

generative model for document and word co-occurrence. More recently, Song et al. [36]

proposed an approach combining the beneﬁts of information-theoretic co-clustering and

constrained clustering.

NMF was ﬁrst proposed in 1994 by Paatero et al. [32] and started to be extensively studied

after the publication of an article by Lee et al. [24] in 1999. The classical NMF algorithms

123

Author's personal copy

剩余24页未读，继续阅读

weixin_38638312

粉丝: 6
资源: 957

双约束半监督下OSS-NMF: 文档聚类的非负矩阵分解新方法

稀疏非负矩阵分解及模式识别

基于术语关联的半监督微博客聚类非负矩阵分解算法

基于术语相关性的微博聚类半监督非负矩阵分解

基于非负矩阵分解的双约束文本聚类算法

Distributed_pyNMFk:自定义聚类的分布式非负矩阵分解

NMFk.jl:非负矩阵和物理信息的非负矩阵分解+ k-均值聚类和物理约束

行业分类-设备装置-一种基于非负矩阵分解的半监督聚类方法及系统.zip

非负矩阵分解的matlab代码,内容全.zip_landylc_listenbl6_分解_非负矩阵_非负矩阵分解

基于最大相关熵非负矩阵分解的文档聚类

时间戳数据聚类：多非负矩阵分解与演化分析

最新资源