Abstract
Facing a large number of clustering solutions,
cluster ensemble method provides an effective
approach to aggregating them into a better one. In
this paper, we propose a novel cluster ensemble
method from probabilistic perspective. It assumes
that each clustering solution is generated from a
latent cluster model, under the control of two
probabilistic parameters. Thus, the cluster ensemble
problem is reformulated into an optimization
problem of maximum likelihood. An EM-style
algorithm is designed to solve this problem. It can
determine the number of clusters automatically.
Experimenal results have shown that the proposed
algorithm outperforms the state-of-the-art methods
including EAC-AL, CSPA, HGPA, and MCLA.
Furthermore, it has been shown that our algorithm is
stable in the predicted numbers of clusters.
1 Introduction
The goal of cluster analysis is to discover the underlying
structure of a dataset (Jain et al., 1999; Jain, 2010). It
normally partitions a set of objects so that the objects within
the same group are similar while those from different groups
are dissimilar. A large number of clustering algorithms have
been proposed, e.g. k-Means, Spectral Clustering,
Hierarchical Clustering, Self-Organizing Maps, to name but
a few, yet no single one is able to successfully achieve this
goal for all datasets. On the same data, different algorithms,
or even multiple runs of the same algorithm with different
parameters, often lead to clustering solutions that are distinct
from each other.
Confronted with a large number of clustering solutions,
cluster ensemble or clustering aggregation methods have
emerged, which try to combine different clustering solutions
into a consensus one, in order to improve the quality of
component clustering solutions (Vega-Pons and
Ruiz-Shulcloper, 2011). Cluster ensemble methods usually
consist of two or three phases: the ensemble generation phase
to produce a variety of clustering solutions; then the
ensemble selection phase to select a subset of these clustering
solutions, which is optional; and finally the consensus phase
to induce a unified partition by combining the component
ones. In the generation phase, different clustering solutions
can be generated by different clustering algorithms, the same
algorithm with different parameter settings or initialization,
and injection of random disturbance into data set such as data
resampling (Minaei-Bidgoli et al., 2004), random projection
(Fern and Brodley, 2003), and random feature selection
(Strehl and Ghosh, 2002). Following the generation phase, an
optional enemble selection phase will select or prune these
clustering solutions according to their qualities and
diversities (Fern and Lin, 2008; Azimi and Fern, 2009).
In this paper, we focus on the final phase - clustering
combination. There are a lot of algorithms for the
combination, which can be categorized according to the kind
of information exploited. The algorithm proposed here falls
into the category making use of the pairwise similarities
between objects, which form a co-association matrix in the
context of cluster ensembles. Any clustering algorithm can
be applied on this new similarity matrix to find a consensus
partition. Evidence Accummulation Clustering (Fred and
Jain, 2005), or EAC in short, performs a hierarchical
clustering of average linkage (AL) or single linkage (SL) on
co-associationg matrix, where a maximum lifetime criterion
is proposed to determine the number of clusters.
Cluster-based Similarity Partitioning (CSPA) algorithm
(Strehl and Ghosh, 2002) uses a graph-paritioning algorithm
instead, but requires the number of clusters be specified
manually. Another algorithm HGPA (Strehl and Ghosh, 2002)
can be thought as an approximation to CSPA. Out of this
category, MCLA algorithm makes a clustering of clusters
based on the similarities between clusters, and then assigns
objects to its closest meta-cluster. For a thorough list of
related algorithms, please refer to the survey paper by
Vega-Pons and Ruiz-Shulcloper (2011).
Although these methods have achieved some success, they
are still deficient in several aspects: first, they lack of
theoretic underpinning; second, they think that all the
clustering solutions be of the same quality, and thus assign
the same weight to each clustering solution; last but not the
least, most of them (except EAC) require the number of
clusters to be specified manually. As to the maximum
lifetime criterion adopted by EAC algorithm, it is more or
less a rule-of-thumb that is lack of justification. As we shall
see in the experiments, the maximum lifetime criterion is
unstable in determining the number of clusters.
A Probabilistic Approach to Latent Cluster Analysis
Zhipeng Xie, Rui Dong, Zhengheng Deng, Zhenying He, Weidong Yang
School of Computer Science
Fudan University, Shanghai, China
{xiezp, 11210240011, 11210240082, zhenying, wdyang}@fudan.edu.cn
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence