基于最大相关熵非负矩阵分解的文档聚类

113 浏览量更新于2024-08-29 收藏 225KB PDF 举报

“Documents clustering based on max-correntropy nonnegative matrix factorization” 这篇研究论文探讨了一种基于最大相关熵非负矩阵分解（Max-Correntropy Non-negative Matrix Factorization, MC-NMF）的文档聚类方法。非负矩阵分解（NMF）在分类和聚类任务中已经被广泛应用，尤其在处理具有正向性质的数据时。传统的NMF算法通常以最小化L2距离或Kullback-Leibler（KL）散度为目标，但这可能不适用于非线性情况。作者提出的新方法通过最大化原始矩阵与两个低秩矩阵乘积之间的相关熵来实现文档聚类。这种方法的一个关键优点是它能够适应非线性关系，从而在复杂的数据分布中找到更有效的聚类结构。相关熵是一种衡量两个概率分布相似性的度量，它可以捕捉到数据分布的非高斯性和局部结构，这对于处理实际文档数据中的噪声和异常值尤为有效。在MC-NMF中，新学习到的基础向量（basis vectors）代表了语义特征空间的新基，这些基向量直接从数据中学习得到，能更好地反映文档内容的内在结构。通过优化相关熵，算法可以捕获到数据的非线性依赖关系，使得聚类结果更能反映文档之间的实际关联。此外，论文还可能涵盖了以下方面： 1. **算法设计**：详细介绍了MC-NMF的优化过程，包括迭代步骤、目标函数的构建以及如何在每一步中更新矩阵组件。 2. **实验验证**：通过与其他NMF方法（如经典的L2距离最小化NMF和KL散度最小化NMF）进行比较，展示了MC-NMF在文档聚类任务上的性能优势，可能包括准确率、召回率和F1分数等评估指标。 3. **应用实例**：可能提供了具体的文档数据集案例，展示了MC-NMF在新闻文章、论坛帖子或学术论文等不同类型的文档聚类中的实际应用。 4. **局限性和未来工作**：讨论了MC-NMF方法可能存在的局限性，比如计算复杂度、收敛速度等，并提出了未来的研究方向，如优化算法效率或将其扩展到其他领域的问题。这篇研究论文为非线性文档聚类提供了一个新的视角，利用最大相关熵来改进非负矩阵分解，从而提高了聚类的质量和鲁棒性。这种方法对于理解和分析大量文本数据的组织结构具有重要的理论和实践价值。

DOCUMENTS CLUSTERING BASED ON MAX-CORRENTROPY

NONNEGATIVE MATRIX FACTORIZATION

Le Li

, Jianjun Yang

, Yang Xu

, Zhen Qin

, Honggang Zhang

David R. Cheriton School of Computer Science, University of Waterloo, ON N2L3G1, Canada

Department of Computer Science, University of North Georgia, Oakwood, GA 30566, USA

Pattern Recognition and Intelligent System Laboratory, Beijing University of Posts and Telecommunications, Beijing, China

E-MAIL: l248li@uwaterloo.ca, jianjun.yang@ung.edu, {xj992adolphxy, qinzhenbupt}@gmail.com, zhhg@bupt.edu.cn

Abstract:

Nonnegative matrix factorization (NMF) has been success-

fully applied to many areas for classiﬁcation and clustering.

Commonly-used NMF algorithms mainly target on minimizing

the l

distance or Kullback-Leibler (KL) divergence, which may

not be suitable for nonlinear case. In this paper, we pro-

pose a new decomposition method by maximizing the corren-

tropy between the original and the product of two low-rank

matrices for document clustering. This method also allows us

to learn the new basis vectors of the semantic feature space

from the data. To our knowledge, we haven’t seen any work

has been done by maximizing correntropy in NMF to cluster

high dimensional document data. Our experiment results show

the supremacy of our proposed method over other variants of

NMF algorithm on Reuters21578 and TDT2 databasets.

Keywords:

Document clustering; Nonnegative matrix factorization

1. Introduction

A corpus is a collection of documents where each document

is associated with a ground-truth topic that summaries the con-

tent of the document. Document clustering is the process that

ﬁnds the correct label for the input document, such that this

label should match with the ground-truth topic as much as pos-

sible. Such clustering makes it possible that automatically or-

ganizes millions of documents, websites, news, etc. into the

multiple partitions, where documents within the same partitions

share same topic. As a consequence, we can leverage this tech-

nique to different tasks, like document organization and brows-

ing, corpus summarization, and document classiﬁcation [1].

Different types of algorithms have been used to cluster/clas-

sify the data (e.g. SVM [34] and pLSA [12]). These algorithms

have a variety of applications in different areas[32, 23, 19, 28,

33, 21, 20, 18, 17, 27]. Among those algorithms, we are espe-

cially interested in the nonnegative matrix factorization (NMF)

method. NMF algorithm maps the original features into latent

semantics space where each basis vector in the latent space rep-

resents a topic. More precisely, assuming each document is

represented as a feature vector with D dimension, and we have

N documents in the corpora, then we can form a D ∗ N ma-

trix (denoted as X) to represent the whole corpora. NMF al-

gorithm can decompose the X into two low-rank nonnegative

matrices, H and W , such that the X ≈ HW . One of the main

beneﬁts is its nature of dimension reduction without losing too

much useful information. This decomposition has been shown

its supremacy in many areas (e.g. bioinformatics [24]).

Much work has been done on applying NMF algorithms to

document clustering [29, 16]. However, most of them try to

minimize the l

distance or KL divergence. Inspired by the re-

cent work in [24] that combines correntropy with NMF in can-

cer clustering, we propose a similar max-correntropy nonneg-

ative matrix factorization algorithm (MCC) into the document

clustering area. The work in [24] is in line with ours in the

way that both show the beneﬁts of using this max-correntropy

method for clustering. However, we are working on different

areas. Meanwhile, the work in [24] only examines the cluster-

ing performance on a limit number of topics (less than 10) and

lower dimension data, while we systematically investigate its

performance on more sophisticated clustering tasks with more

documents, topics and higher dimension of data.

To achieve that, we implement the MCC algorithm and test

its accuracy on the Reuters21578 and TDT2 corpora. We

compare the MCC algorithm to classic loss functions (l

dis-

tance and KL divergence), as well as other variants of NMF

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38587705

粉丝: 0
资源: 930

基于最大相关熵非负矩阵分解的文档聚类

非负矩阵分解（Non-negative Matrix Factorization，NMF）聚类的MATLAB代码示例中，数据集的格式应该是什么样子的

综述常用的聚类算法（包括：单聚类算法和双聚类算法）

请评价一下系统聚类法和k-means聚类法

共识聚类、NMF聚类和K-means的差异

聚类分析使用k-means

NMF是如何实现聚类的

蚁群聚类算法和k-means算法比较实验

基于多维样本空间分布密度的聚类中心优化K-均值算法的MATLAB代码

聚类结果全是-1是因为什么，怎么解决？

最新资源