DOCUMENTS CLUSTERING BASED ON MAX-CORRENTROPY
NONNEGATIVE MATRIX FACTORIZATION
Le Li
1
, Jianjun Yang
2
, Yang Xu
3
, Zhen Qin
3
, Honggang Zhang
3
1
David R. Cheriton School of Computer Science, University of Waterloo, ON N2L3G1, Canada
2
Department of Computer Science, University of North Georgia, Oakwood, GA 30566, USA
3
Pattern Recognition and Intelligent System Laboratory, Beijing University of Posts and Telecommunications, Beijing, China
E-MAIL: l248li@uwaterloo.ca, jianjun.yang@ung.edu, {xj992adolphxy, qinzhenbupt}@gmail.com, zhhg@bupt.edu.cn
Abstract:
Nonnegative matrix factorization (NMF) has been success-
fully applied to many areas for classification and clustering.
Commonly-used NMF algorithms mainly target on minimizing
the l
2
distance or Kullback-Leibler (KL) divergence, which may
not be suitable for nonlinear case. In this paper, we pro-
pose a new decomposition method by maximizing the corren-
tropy between the original and the product of two low-rank
matrices for document clustering. This method also allows us
to learn the new basis vectors of the semantic feature space
from the data. To our knowledge, we haven’t seen any work
has been done by maximizing correntropy in NMF to cluster
high dimensional document data. Our experiment results show
the supremacy of our proposed method over other variants of
NMF algorithm on Reuters21578 and TDT2 databasets.
Keywords:
Document clustering; Nonnegative matrix factorization
1. Introduction
A corpus is a collection of documents where each document
is associated with a ground-truth topic that summaries the con-
tent of the document. Document clustering is the process that
finds the correct label for the input document, such that this
label should match with the ground-truth topic as much as pos-
sible. Such clustering makes it possible that automatically or-
ganizes millions of documents, websites, news, etc. into the
multiple partitions, where documents within the same partitions
share same topic. As a consequence, we can leverage this tech-
nique to different tasks, like document organization and brows-
ing, corpus summarization, and document classification [1].
Different types of algorithms have been used to cluster/clas-
sify the data (e.g. SVM [34] and pLSA [12]). These algorithms
have a variety of applications in different areas[32, 23, 19, 28,
33, 21, 20, 18, 17, 27]. Among those algorithms, we are espe-
cially interested in the nonnegative matrix factorization (NMF)
method. NMF algorithm maps the original features into latent
semantics space where each basis vector in the latent space rep-
resents a topic. More precisely, assuming each document is
represented as a feature vector with D dimension, and we have
N documents in the corpora, then we can form a D ∗ N ma-
trix (denoted as X) to represent the whole corpora. NMF al-
gorithm can decompose the X into two low-rank nonnegative
matrices, H and W , such that the X ≈ HW . One of the main
benefits is its nature of dimension reduction without losing too
much useful information. This decomposition has been shown
its supremacy in many areas (e.g. bioinformatics [24]).
Much work has been done on applying NMF algorithms to
document clustering [29, 16]. However, most of them try to
minimize the l
2
distance or KL divergence. Inspired by the re-
cent work in [24] that combines correntropy with NMF in can-
cer clustering, we propose a similar max-correntropy nonneg-
ative matrix factorization algorithm (MCC) into the document
clustering area. The work in [24] is in line with ours in the
way that both show the benefits of using this max-correntropy
method for clustering. However, we are working on different
areas. Meanwhile, the work in [24] only examines the cluster-
ing performance on a limit number of topics (less than 10) and
lower dimension data, while we systematically investigate its
performance on more sophisticated clustering tasks with more
documents, topics and higher dimension of data.
To achieve that, we implement the MCC algorithm and test
its accuracy on the Reuters21578 and TDT2 corpora. We
compare the MCC algorithm to classic loss functions (l
2
dis-
tance and KL divergence), as well as other variants of NMF