1462 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 48, NO. 5, MAY 2018
Iam-On et al. [23] refined the CA matrix by considering the
shared neighbors between clusters to improve the consensus
results. Wang [24] introduced a dendrogram-like hierarchical
data structure termed CA-tree to facilitate the CA-based
ensemble clustering process. Lourenço et al. [29] proposed
a new ensemble clustering approach which is based on the
EAC paradigm and is able to determine the probabilistic
assignments of data objects to clusters. Liu et al. [31]
employed spectral clustering on the CA matrix and developed
an efficient ensemble clustering approach termed spectral
ensemble clustering (SEC).
The graph partitioning-based approaches [17], [19],
[28], [30] address the ensemble clustering problem by con-
structing a graph model to reflect the ensemble information.
The consensus clustering is then obtained by parti-
tioning the graph into a certain number of segments.
Strehl and Ghosh [17] proposed three graph partitioning-
based ensemble clustering algorithms, i.e., CSPA, HGPA,
and MCLA. Fern and Brodley [19] constructed a bipartite
graph for the clustering ensemble by treating both clus-
ters and objects as nodes, and obtain the consensus clus-
tering by partitioning the bipartite graph. Yu et al. [28]
designed a double affinity propagation-based ensemble clus-
tering framework, which is able to handle the noisy attributes
and obtain the final clustering by the normalized cut
algorithm.
The median partition-based approaches [18], [21], [25], [32]
formulate the ensemble clustering problem into an optimiza-
tion problem, which aims to find a median partition (or
clustering) by maximizing the similarity between this clus-
tering and the multiple base clusterings. The median partition
problem is NP-hard [21]. Finding the globally optimal solution
in the huge space of all possible clusterings is computation-
ally infeasible for large datasets. Cristofor and Simovici [18]
proposed to obtain an approximate solution using the genetic
algorithm, where clusterings are treated as chromosomes.
Topchy et al. [21] cast the median partition problem into a
maximum likelihood problem and approximately solve it by
the EM algorithm. Franek and Jiang [25] cast the median par-
tition problem into an Euclidean median problem by clustering
embedding in vector spaces. Huang et al. [32] formulated the
median partition problem into a binary linear programming
problem and obtained an approximate solution by means of
the factor graph theory.
These algorithms attempt to solve the ensemble clustering
problem in various ways [17]–[29], [31], [32]. However, one
common limitation to most of the existing methods is that
they generally treat all clusters and all base clusterings in the
ensemble equally and may suffer from low-quality clusters or
low-quality base clusterings. To partially address this limita-
tion, recently some weighted ensemble clustering approaches
have been presented [30], [37], [38]. Li and Ding [37] cast the
ensemble clustering problem into a non-negative matrix factor-
ization problem and proposed a weighted consensus clustering
approach, where each base clustering is assigned a weight in
order to improve the consensus result. Yu et al. [38] exploited
the feature selection techniques to weight and select the base
clusterings. In fact, clustering selection [38] can be viewed as a
0–1 weighting scheme, where 1 indicates selecting a clustering
and 0 indicates removing a clustering. Huang et al. [30]
proposed to evaluate and weight the base clusterings based
on the concept of normalized crowd agreement index (NCAI),
and devised two weighted consensus functions to obtain the
final clustering result.
Although the above-mentioned weighted ensemble clus-
tering approaches [30], [37], [38] are able to estimate the
reliability of base clusterings and weight them accordingly, yet
they generally treat a base clustering as a whole and neglect
the local diversity of clusters inside the same base cluster-
ing. To explore the reliability of clusters, Alizadeh et al. [45]
proposed to evaluate clusters in the ensemble by averaging
normalized mutual information (NMI) [17] between cluster-
ings, which results in a very expensive computational cost and
is not feasible for large datasets. Zhong et al. [39] exploited
the Euclidean distances between objects to estimate the cluster
reliability, which needs access to the original data features and
is only applicable to numerical data. However, in the more
general formulation of ensemble clustering [17]–[25], [32],
the original data features are not available in the consensus
process. Moreover, by measuring the within-cluster similar-
ity based on Euclidean distances, the efficacy of the method
in [39] heavily relies on some implicit assumptions about data
distribution, which places an unstable factor in the consen-
sus process. Different from [39], in this paper, our approach
requires no access to the original data features. We propose
to estimate the uncertainty of clusters by considering the
cluster labels in the entire ensemble based on an entropic
criterion, and then present the concept of ECI to evaluate
cluster reliability without making any assumptions on the
data distribution. Further, to obtain the consensus clustering
results, two novel consensus functions are developed based
on cluster uncertainty estimation and local weighting strat-
egy. Extensive experiments on a variety of real-world datasets
have shown that our approach exhibits significant advantages
in clustering accuracy and efficiency over the state-of-the-art
approaches.
III. P
RELIMINARIES
A. Entropy
In this section, we briefly review the concept of entropy. In
information theory [46], the entropy is a measure of the uncer-
tainty associated with a random variable. The formal definition
of entropy is provided in Definition 1.
Definition 1: For a discrete random variable X, the entropy
H(X) is defined as
H
(
X
)
=−
x∈X
p
(
x
)
log
2
p
(
x
)
(1)
where X is the set of values that X can take, and p(x) is the
probability mass function of X.
The joint entropy is a measure of the uncertainty associated
with a set of random variables. The formal definition of joint
entropy is provided in Definition 2.