Prototype Hierarchy Based Clustering for the
Categorization and Navigation of Web Coll ections
Zhao-Yan Ming
1,2
, Kai Wang
2
and Tat-Seng Chua
2
1
NUS Graduate School for Integrative Sciences and Engineering
2
Department of Computer Science, School of Computing
National University of Singapore
{mingzy,kwang,chuats}@comp.nus.edu.sg
ABSTRACT
This paper presents a novel prototype hierarchy based clus-
tering (PHC) framework for the organization of web collec-
tions. It solves simultaneously the problem of categorizing
web collections and interpreting the clustering results for
navigation. By utilizing prototype hierarchies and the un-
derlying topic structures of the collections, PHC is modeled
as a multi-criterion optimization problem based on mini-
mizing the hierarchy evolution, maximizing category cohe-
siveness and inter-hierarchy structural and semantic resem-
blance. The flexible design of metrics enables PHC to be
a general framework for applications in various domains.
In the experiments on categorizing 4 collections of distinct
domains, PHC achieves 30% improvement in μF
1
over the
state-of-the-art techniques. Further experiments provide in-
sights on performance variations with abstract and concrete
domains, completeness of the prototype hierarchy, and ef-
fects of different combinations of optimization criteria.
Categories and Subject Descriptors
H.3.3 [ Information Storage and Retrieval]: Informa-
tion Search and Retrieval—clustering
General Terms
Algorithms, Performance, Experimentation.
Keywords
Hierarchical Clustering, Prototype Hierarchy, Hierarchy In-
duction, Criterion Function
1. INTRODUCTION
With the flourishing of user contributed services like Ya-
hoo! Answers, discovering the utility of user-generated-contents
becomes a research topic of interest to many researchers.
The utility of user-generated-contents comes in two major
aspects, the quality and accessibility. Efforts have been put
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGIR’10, July 19–23, 2010, Geneva, Switzerland.
Copyright 2010 ACM 978-1-60558-896-4/10/07 ...$10.00.
to distinguish the good and bad quality content [1]. To make
contents more accessible, state-of-the-art retrieval models
like translation based language model [18] and syntactic tree
matching [15] have achieved promising performance. Orga-
nizing the huge collections of data for information navigation
is another important direction in exploring web collections.
Categorization, especially hierarchical clustering with labels
and descriptions of clusters, enables browsing style of infor-
mation access. Users can navigate through the hierarchy
driven by their information needs [11, 17].
Currently, web services rely on users to construct topic
hierarchies and assign objects into their nodes. Open Di-
rectory Project (ODP) and Wikipedia are both examples
of hierarchically organized web collections formed by com-
munity of editors. Yahoo! Answers (YA) is organized in a
hierarchical tree containing 728 nodes with 26 top-level cate-
gories, relying on users to select a category for their postings.
Besides the reliance on manual assignment, a hierarchy as
large as YA’s directory is too coarse to contain a category
like IPod (it is in Music & Music players) whose subtopics
might be of interest to many users. These suggest the ne-
cessity of automatic fine-grained hierarchical categorization.
Toward automatic categorization of web collections into
hierarchies, supervised techniques that require manually-
labeled corpora are not appropriate for dynamic Web infor-
mation services [9]. Existing unsupervised techniques gen-
erally focus either on clustering the collections into smaller
groups [5, 17], or extracting labels for clustered groups [4].
SnakeT [6] is a successful hierarchical clustering engine that
performs sequential clustering and labeling on snippets re-
turned by search engines. However, the resulting clusters
and labels may not be consistent and systematic because of
its data-driven nature. LiveClassifier [9] addresses the cat-
egorization and navigation in one go by utilizing predefined
topic hierarchies and searching the training instances to feed
into a supervised learner. This approach, however, ignores
the underlying topic structure of the target collection; and
the result is confined to the predefined hierarchy which may
not be a perfect match to the collection.
In this paper, we propose an unsupervised approach called
Prototype Hierarchy based Clustering (PHC) to tackle
the problem of web collection categorization and navigation.
PHC utilizes the world knowledge in the form of prototype
hierarchies, while adapts to the underlying topic structures
of the collections. By following the structure of the proto-
type hierarchy, PHC eliminates the problem of determining
the number of clusters and assigning initial clusters.
Moreover, the PHC results are interpretable, comprehen-