THEORY AND APPLICATIONS OF SOFT COMPUTING METHODS
OPE-HCA: an optimal probabilistic estimation approach
for hierarchical clustering algorithm
Jiancong Fan
1,2,3
Received: 26 February 2015 / Accepted: 21 July 2015
The Natural Computing Applications Forum 2015
Abstract The Survival of the Fittest is a principle which
selects the superior and eliminates the inferior in the nat-
ure. This principle has been used in many fields, especially
in optimization problem-s olving. Clustering in data mining
community endeavors to discover unknown representations
or patterns hidden in datasets. Hierarchical clustering
algorithm (HCA) is a method of cluster analysis which
searches the optimal distribution of clusters by a hierar-
chical structure. Strategies for hierarchical clustering gen-
erally have two types: agglomerative with a bottom-up
procedure and divisive with a top-down procedure. How-
ever, most of the clustering approaches have two disad-
vantages: the use of distance-based measurement and the
difficulty of the clusters integration. In this paper, we
propose an optimal probabilistic estimation (OPE)
approach by exploiting the Survival of the Fittest principle.
We devise a hierarchical clustering algorithm (HCA) based
on OPE, also called OPE-HCA. The OPE-HCA combines
optimization with probability and aggl omerative HCA.
Experimental results show that the OPE-HCA has the
ability of searching and discovering patterns at different
description levels and can also obtain better performance
than many clustering algorithms according to NMI and
clustering accuracy measures.
Keywords Clustering Hierarchical clustering
algorithm Data mining Probabilistic estimation
1 Introduction
The phrase ‘‘Survival of the Fittest’’ is originated from
evolutionary theory as a way of describing the natural
selection mechanism. Generally, it refers that the proba-
bility of survivors is high if the survivors are fit for the
natural environment. So it is more commonly used today to
refer to a supposed greater probability that ‘‘fit’’ as opposed
to ‘‘unfit’’ individuals will survive some context.
Clustering is a general task to be solved in data analysis
and mining. The clusters obtained by various clustering
algorithms differ significa ntly in their tasks and objectives.
So far, however , it is still a hard work to predict what
constitutes a cluster hidden in dataset and how to efficiently
discover them with high accuracy. One of the main reasons
is that clustering is an unsupervised analysis process. It is
unknown how many clusters there exist and what the
names of clusters are. But this problem occurs in most
practical applications, such as web data mining and big
data analysis, because it is difficult to foresee exactly the
hidden patterns in black box or the event that has not
occurred. There have emerged rich clustering strategies and
algorithms attempting to solve the blindness in clustering
process. However, most of them are specialized algorithms
that one algorithm is only suitable to solve one particular
dataset.
Among the numerous clustering methods, hierarchical
clustering [1, 2], also called connectivity-based clustering,
& Jiancong Fan
fanjiancong@sdust.edu.cn
1
State Key Laboratory of Mining Disaster Prevention and
Control Co-founded by Shandong Province and the Ministry
of Science and Technology, Shandong University of Science
and Technology, Qingdao 266590, China
2
College of Information Science and Engineering, Shandong
University of Science and Technology, Qingdao 266590,
China
3
State Key Laboratory for Novel Software Technology,
Nanjing University, Nanjing 210023, China
123
Neural Comput & Applic
DOI 10.1007/s00521-015-1998-5