Manifold Regularized Gaussian Mixture Model for Semi-supervised Clustering
Haitao Gan
∗
, Nong Sang
∗
, Rui Huang
∗†
, Xi Chen
∗
∗
School of Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
†
NEC Laboratories China, Beijing, 100084, China
haitao
gan@hust.edu.cn, nsang@hust.edu.cn, ruihuang@hust.edu.cn, cut@mail.ustc.edu.cn
Abstract—Over the last few decades, Gaussian Mixture
Model (GMM) has attracted considerable interest in data
mining and pattern recognition. GMM can be used to cluster
a bunch of data through estimating the parameters of multiple
Gaussian components using Expectation-Maximization (EM).
Recently, Locally Consistent GMM (LCGMM) has been pro-
posed to improve the clustering performance of GMM by
exploiting the local manifold structure modeled by a 𝑝 nearest
neighbor graph. In practice, various prior knowledge may be
available which can be used to guide the clustering process
and improve the performance. In this paper, we introduce
a semi-supervised method, called Semi-supervised LCGMM
(Semi-LCGMM), where prior knowledge is provided in the
form of class labels of partial data. Semi-LCGMM incorporates
prior knowledge into the maximum likelihood function of
LCGMM and is solved by EM. It is worth noting that in
our algorithm each class has multiple Gaussian components
while in the unsupervised settings each class only has one
Gaussian component. Experimental results on several datasets
demonstrate the effectiveness of our algorithm.
Keywords-Semi-supervised clustering; Gaussian Mixture
Model; Manifold structure;
I. INTRODUCTION
Over the last few decades, clustering analysis has become
one of the most important and interesting topics in data min-
ing, and has been widely used in many other related fields,
such as image categorization [1], document categorization
[2] and bioinformatics [3], etc. One of the most widely
used clustering methods is Gaussian Mixture Model (GMM)
[4], [5], [6]. GMM assumes that the observations are drawn
independently from a mixture of Gaussians where each
Gaussian density has its own mean and covariance. Cluster-
ing results of observations can be obtained by estimating the
parameters of the Gaussian components using Expectation-
Maximization (EM). However, the drawback of GMM is
that it assumes that the observations are generated from
Euclidean space. Recent studies [7], [8], [9] have shown
that clustering performance can be improved by exploiting
the underlying manifold structure of the data. To incorporate
such information, Locally Consistent GMM (LCGMM) [7]
and Laplacian regularized GMM (LapGMM) [8] have been
proposed, both of which assume that similar observations
should have similar conditional probability distributions.
They employ local consistency or Laplacian regularizer,
respectively, to modify the objective function of GMM. In
this way, the conditional probability density functions can
vary smoothly along the geodesics on the manifold. The
results show that both LCGMM and LapGMM can improve
clustering performance by adding a manifold regularization
term into the GMM objective function.
Meanwhile, in many applications, it is difficult to cluster
complicated datasets well enough using only unsupervised
clustering analysis and various prior knowledge may often
be available, e.g., in the form of class labels or pairwise
constraints of partial data. Consequently, semi-supervised
clustering [10] has become a recent topic of interest. Semi-
supervised GMM (Semi-GMM) [11] has been proposed with
the assumption that each class is associated with multiple
Gaussian components. Semi-GMM exploits the information
of labeled observations by embedding prior knowledge in
the objective function of GMM. Semi-GMM is applied to
image segmentation and yields promising results. But Semi-
GMM does not take into account manifold structure of data
space.
In this paper, we propose a Semi-supervised LCGMM
algorithm (Semi-LCGMM) where prior knowledge is pro-
vided in the form of class labels of partial data. This work
is motivated by the observation that information delivered
by labeled data may help guide the clustering process
and potentially improve the clustering performance. In our
algorithm, LCGMM is adapted under a semi-supervised
clustering framework by incorporating prior knowledge into
the maximum likelihood function of LCGMM. The EM
algorithm is used to solve the optimization problem. The
proposed Semi-LCGMM uses a linear combination of Gaus-
sians where each class is composed of multiple Gaussian
components as in [11], but the parameter initialization is
performed in a different manner. Compared to other unsuper-
vised and semi-supervised clustering methods, our algorithm
achieves comparable, if not better, results.
The rest of the paper is organized as follows: In section
II, we briefly review the related work. In section III, we
describe our algorithm in detail, Section IV presents the ex-
perimental results on several datasets. Finally, we conclude
the paper and discuss some future directions in section V.
II. L
OCALLY CONSISTENT GMM (LCGMM)
Given a set of observations X = {𝑥
1
, ⋅⋅⋅ ,𝑥
𝑛
}, where
𝑥
𝑖
∈ ℝ
𝐷
, we model this set through a mixture of Gaussian
distributions. The expected value of the complete data log
2013 Second IAPR Asian Conference on Pattern Recognition
978-1-4799-2190-4/13 $26.00 © 2013 IEEE
DOI 10.1109/ACPR.2013.126
361
2013 Second IAPR Asian Conference on Pattern Recognition
978-1-4799-2190-4/13 $31.00 © 2013 IEEE
DOI 10.1109/ACPR.2013.126
361