Short Communication
DSKmeans: A new kmeans-type approach to discriminative subspace
clustering
Xiaohui Huang
a
, Yunming Ye
a,
⇑
, Huifeng Guo
a
, Yi Cai
b
, Haijun Zhang
a
, Yan Li
c
a
Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen 518055, China
b
School of Software Engineering, South China University of Technology, Guangzhou, China
c
Shenzhen Polytechnic, Liuxian Road, Shenzhen 518055, China
article info
Article history:
Received 23 February 2014
Received in revised form 15 June 2014
Accepted 15 July 2014
Available online 27 July 2014
Keywords:
Kmeans clustering
Feature selection
3-Order tensor
Data mining
Subspace clustering
abstract
Most of kmeans-type clustering algorithms rely on only intra-cluster compactness, i.e. the dispersions of
a cluster. Inter-cluster separation which is widely used in classification algorithms, however, is rarely
considered in a clustering process. In this paper, we present a new discriminative subspace kmeans-type
clustering algorithm (DSKmeans), which integrates the intra-cluster compactness and the inter-cluster
separation simultaneously. Different to traditional weighting kmeans-type algorithms, a 3-order tensor
is constructed to evaluate the importance of different features in order to integrate the aforementioned
two types of information. First, a new objective function for clustering is designed. To optimize the objec-
tive function, the corresponding updating rules for the algorithm are then derived analytically. The prop-
erties and performance of DSKmeans are investigated on several numerical and categorical data sets.
Experimental results corroborate that our proposed algorithm outperforms the state-of-the-art
kmeans-type clustering algorithms with respects to four metrics: Accuracy, RandIndex, Fscore and Nor-
mal Mutual Information(NMI).
Ó 2014 Elsevier B.V. All rights reserved.
1. Introduction
Clustering techniques have been used extensively in many
fields in nature [1], such as bioinformatics [2], text organizations
[3], and community detection [4], to name just a few. Clustering
is an unsupervised classification technique that aims at partition-
ing a data set into clusters such that the objects within a cluster
are similar and the objects in different clusters are dissimilar
according to certain pre-defined criteria [5].
The clustering algorithms [6] can be summarized as partition-
ing methods, hierarchical methods, density-based methods, grid-
based methods and model-based methods, etc. The kmeans-type
clustering algorithm is a widely used partitioning methods in
many real-life applications. Many researchers extended the
kmeans algorithms by different types of weighting ways. From
the weighting ways, existing kmeans-type algorithms can be clas-
sified into three categories: (1) No weighting kmeans-type algo-
rithms [7–10], which treat all features equally in the process of
minimizing the dispersions of clusters. Different features, however,
have different discriminative capabilities in real-world applica-
tions. Therefore, different types of feature selection and weighting
methods have been proposed in many clustering processes. (2)
Vector weighting kmeans-type algorithms, which have been
reported in [5,11–15]. (3) Matrix weighting kmeans-type algo-
rithms, the examples of which are proposed in [16–21,3,22,23].
Most of these weighting kmeans-type clustering algorithms only
consider that the objects in the same cluster are similar, i.e. mini-
mizing the dispersions of all the clusters, in a way that the features
are weighted by using different methods.
However, a feature in a cluster may have different discrimina-
tive capabilities when we compare this cluster with other clusters.
For example, there are three clusters (C1, C2 and C3) in Fig. 1 (the
distributions of features are listed in the table). W
12
and W
13
are
weighting vectors when we compare cluster 1 (C1) with cluster 2
(C2) and cluster 3 (C3), respectively. We can observe that the fea-
tures ‘‘Olympic, sport, chaos, riots’’ have more discriminative capa-
bilities when comparing C1 with C2. In contrast, comparing C1–C3,
the features ‘‘London, England, Beijing, China’’ have more discrim-
inative capabilities. The same features in C1‘‘London, England’’
have different discriminative capabilities when comparing to dif-
ferent clusters, i.e. ‘‘London, England’’ have less discriminative
capabilities in distinguishing C1 and C2, while they have more dis-
criminative capabilities in identifying C1 and C3.
Motivated by the example in Fig. 1, we propose a new kmeans-
type algorithm by integrating the intra-cluster compactness and
the inter-cluster separation with a 3-order tensor weighting
http://dx.doi.org/10.1016/j.knosys.2014.07.009
0950-7051/Ó 2014 Elsevier B.V. All rights reserved.
⇑
Corresponding author.
Knowledge-Based Systems 70 (2014) 293–300
Contents lists available at ScienceDirect
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys