原型层次聚类：Web集合组织与导航的新框架

需积分: 1 160 浏览量更新于2024-09-16 收藏 951KB PDF 举报

"该资源是一篇关于如何构建层次结构的SIGIR论文，主要介绍了一种称为原型层次聚类（Prototype Hierarchy Based Clustering, PHC）的方法，用于网络集合的分类和导航。作者是Zhao-Yan Ming、Kai Wang和Tat-Seng Chua，来自新加坡国立大学计算机科学系。论文提出了一个新颖的框架，旨在同时解决网络集合的分类问题和聚类结果的解释，以支持导航功能。" 正文: 在当前的信息爆炸时代，有效地组织和导航网络集合变得至关重要。这篇名为"Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections"的论文提出了一种新的方法，即原型层次聚类（PHC），来解决这个问题。PHC框架旨在通过创建一个原型层次结构，将网络集合进行有条理的分类，并利用这些层次结构来解释聚类结果，从而便于用户导航。 PHC的核心在于将网络集合的组织视为一个多标准优化问题。它通过最小化层次演进、最大化类别凝聚力以及在不同层次之间实现结构和语义上的相似性最大化来达到这一目标。这种灵活的度量设计使得PHC框架能够适应不同领域的应用需求。在论文中提到的实验部分，PHC框架在对四个不同领域网络集合的分类任务上，相比最先进的技术，μF1指标提升了30%，这充分展示了其优越性能。这一改进对于提高信息检索效率、提升用户体验具有显著意义，特别是在大量网络数据的管理和探索中。论文还强调了PHC的实用性，因为它能够根据底层主题结构自动生成有意义的类别，帮助用户理解聚类结果，进而更有效地导航。这种方法对于搜索引擎优化、信息检索系统、网站目录构建以及任何需要大规模数据分类和导航的场景都具有潜在的应用价值。 "Prototype Hierarchy Construction"是一个旨在改善网络集合分类和导航的创新性研究。通过利用原型层次结构，该方法不仅提高了分类精度，还增强了聚类结果的解释性和导航的可用性。这对于推动信息检索和数据管理技术的发展具有重要意义。

Prototype Hierarchy Based Clustering for the

Categorization and Navigation of Web Coll ections

Zhao-Yan Ming

1,2

, Kai Wang

and Tat-Seng Chua

NUS Graduate School for Integrative Sciences and Engineering

Department of Computer Science, School of Computing

National University of Singapore

{mingzy,kwang,chuats}@comp.nus.edu.sg

ABSTRACT

This paper presents a novel prototype hierarchy based clus-

tering (PHC) framework for the organization of web collec-

tions. It solves simultaneously the problem of categorizing

web collections and interpreting the clustering results for

navigation. By utilizing prototype hierarchies and the un-

derlying topic structures of the collections, PHC is modeled

as a multi-criterion optimization problem based on mini-

mizing the hierarchy evolution, maximizing category cohe-

siveness and inter-hierarchy structural and semantic resem-

blance. The ﬂexible design of metrics enables PHC to be

a general framework for applications in various domains.

In the experiments on categorizing 4 collections of distinct

domains, PHC achieves 30% improvement in μF

over the

state-of-the-art techniques. Further experiments provide in-

sights on performance variations with abstract and concrete

domains, completeness of the prototype hierarchy, and ef-

fects of diﬀerent combinations of optimization criteria.

Categories and Subject Descriptors

H.3.3 [ Information Storage and Retrieval]: Informa-

tion Search and Retrieval—clustering

General Terms

Algorithms, Performance, Experimentation.

Keywords

Hierarchical Clustering, Prototype Hierarchy, Hierarchy In-

duction, Criterion Function

1. INTRODUCTION

With the ﬂourishing of user contributed services like Ya-

hoo! Answers, discovering the utility of user-generated-contents

becomes a research topic of interest to many researchers.

The utility of user-generated-contents comes in two major

aspects, the quality and accessibility. Eﬀorts have been put

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGIR’10, July 19–23, 2010, Geneva, Switzerland.

to distinguish the good and bad quality content [1]. To make

contents more accessible, state-of-the-art retrieval models

like translation based language model [18] and syntactic tree

matching [15] have achieved promising performance. Orga-

nizing the huge collections of data for information navigation

is another important direction in exploring web collections.

Categorization, especially hierarchical clustering with labels

and descriptions of clusters, enables browsing style of infor-

mation access. Users can navigate through the hierarchy

driven by their information needs [11, 17].

Currently, web services rely on users to construct topic

hierarchies and assign objects into their nodes. Open Di-

rectory Project (ODP) and Wikipedia are both examples

of hierarchically organized web collections formed by com-

munity of editors. Yahoo! Answers (YA) is organized in a

hierarchical tree containing 728 nodes with 26 top-level cate-

gories, relying on users to select a category for their postings.

Besides the reliance on manual assignment, a hierarchy as

large as YA’s directory is too coarse to contain a category

like IPod (it is in Music & Music players) whose subtopics

might be of interest to many users. These suggest the ne-

cessity of automatic ﬁne-grained hierarchical categorization.

Toward automatic categorization of web collections into

hierarchies, supervised techniques that require manually-

labeled corpora are not appropriate for dynamic Web infor-

mation services [9]. Existing unsupervised techniques gen-

erally focus either on clustering the collections into smaller

groups [5, 17], or extracting labels for clustered groups [4].

SnakeT [6] is a successful hierarchical clustering engine that

performs sequential clustering and labeling on snippets re-

turned by search engines. However, the resulting clusters

and labels may not be consistent and systematic because of

its data-driven nature. LiveClassiﬁer [9] addresses the cat-

egorization and navigation in one go by utilizing predeﬁned

topic hierarchies and searching the training instances to feed

into a supervised learner. This approach, however, ignores

the underlying topic structure of the target collection; and

the result is conﬁned to the predeﬁned hierarchy which may

not be a perfect match to the collection.

In this paper, we propose an unsupervised approach called

Prototype Hierarchy based Clustering (PHC) to tackle

the problem of web collection categorization and navigation.

PHC utilizes the world knowledge in the form of prototype

hierarchies, while adapts to the underlying topic structures

of the collections. By following the structure of the proto-

type hierarchy, PHC eliminates the problem of determining

the number of clusters and assigning initial clusters.

Moreover, the PHC results are interpretable, comprehen-

下载后可阅读完整内容，剩余7页未读，立即下载

u010223456

粉丝: 0
资源: 1

原型层次聚类：Web集合组织与导航的新框架

计算机组成原理全套PPT课件-上海交大

MIT_uAMPS_ns.tar.gz文件中修改版ns-leach算法介绍

Unity3D_3.X快速入门：打造震撼3D游戏

MFC_Hierarchy_Chart.zip_Hierarchy Chart p_MFC Chart_chart mfc_v

CUDA_C_Programming_Guide.pdf

2022年美赛获奖D类论文_2215444.pdf

使用unity3d进行游戏开发_从入门到精通_02.pdf

eetop.cn_Synthesis tool commands_2022.03.pdf

gdc2019/erincatto_dynamicbvh_full.pdf

最新资源