On defining affinity graph for spectral clustering through
ranking on manifolds
Tian Xia
a,b,
, Juan Cao
a
, Yong-dong Zhang
a
, Jin-tao Li
a
a
Center for Advanced Computing Technology Research, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
b
Graduate University of the Chinese Academy of Sciences, Beijing 100039, China
article info
Article history:
Received 3 November 20 08
Received in revised form
20 January 2009
Accepted 1 March 2009
Communicated by D. Tao
Available online 9 April 2009
Keywords:
Affinity graph
Spectral clustering
Ranking on manifolds
abstract
Spectral clustering consists of two distinct stages: (a) construct an affinity graph from the dataset and
(b) cluster the data points through finding an optimal partition of the affinity graph. The focus of the
paper is the first step. Existing spectral clustering algorithms adopt Gaussian function to define the
affinity graph since it is easy to impleme nt. However, Gaussian function is hard to depict the intrinsic
structure of the data, and it has to specify a scaling parameter whose selection is still an open issue in
spectral clustering. Therefore, we propose a new definition of affinity graph for spectral clustering from
the graph partition perspective. In particular, we propose two consistencies: smooth consistency and
constraint consistency, for affinity graph to hold, and then define the affinity graph respecting these
consistencies in a regularization framework of ranking on manifolds. Meanwhile the proposed
definition of affinity graph is applicable to both unsupervised and semi-supervised spectral clustering.
Encouraging experimental results on synthetic and real world data demonstrate the effectiveness of the
proposed approach.
& 2009 Elsevier B.V. All rights reserved.
1. Introduction
In recent years, spectral clustering has become one of the most
popular modern clustering algorithms. It clusters data points
through performing spectral analysis on the matrix derived from
the data. Specifically, spectral clustering consists of two distinct
stages: (a) construct an affinity graph from the dataset and (b)
cluster the data points through finding an optimal partition of the
affinity graph. Although a great deal of effort has been carried out
addressing the latter, little progress has been made on defining
affinity graph, whereas it encodes the intrinsic structure of the
data, and plays an important role in spectral clustering.
Existing spectral clustering algorithms [2–5] adopt Gaussian
function, i.e., Aðx
i
; x
j
Þ¼expðkx
i
x
j
k
2
=2
s
2
Þ, to define the affinity
graph, since it is simple to implement. However, Gaussian
function has a scaling parameter
s
to be specified manually, and
its selection is still an open issue in spectral clustering. In practice,
s
is often set by an empirical value, such as
s
is set as 0.05 of
the maximal pairwise Euclidean distance among the dataset in
normalized cut algorithm (NC) which is a representative spectral
clustering algorithm [2,6]. This setting of
s
makes spectral
clustering be very sensitive to outliers. Manor et al. propose to
use a local scale rather than a global one for Gaussian function [7];
however, this algorithm has limited success on real world data
although it works well on synthetic data.
More important, Gaussian function is hard to depict the
intrinsic structure of the data from graph partition perspective,
and makes spectral clustering perform badly on some data
distributed in complex shape. To illustrate this, let us consider a
toy example on two-moon data as shown in Fig. 1(a). The affinity
graph constructed in NC is shown in Fig. 1(b), in the form of
K-nearest neighborhood (KNN) graph. We can see that some data
pairs distributed on separate moons are also linked in the affinity
graph; it implies some wrong local neighborhood relationships,
and thus the clustering result of NC is somehow biased as shown
in Fig. 1(c). We also illustrate the example in semi-supervised
case: four constraints including two must-link constraints and
two cannot-link constraints are added as shown in Fig. 1(d); the
affinity graph as shown in Fig. 1(e) is constructed through a
representative distance metric learning method proposed in [8];
we can see that some very near neighbors are not linked in the
affinity graph, and thus it produces bad clustering result as shown
in Fig. 1(f). From this toy example, we can observe: (a) the success
of spectral clustering depends greatly on the constructed affinity
graph which encodes the intrinsic structure of the data; and (b)
Gaussian function based affinity graph still has problems to depict
the intrinsic structure of some data distributed in complex shape.
In this paper, we focus on the definition of affinity graph for
spectral clustering in both unsupervised and semi-supervised
ARTICLE IN PRESS
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/neucom
Neurocomputing
0925-2312/$ - see front matter & 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2009.03.012
Corresponding author at: Center for Advanced Computing Technology
Research, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
100190, China.
E-mail address: txia@ict.ac.cn (T. Xia).
Neurocomputing 72 (2009) 3203–3211