大规模数据集的局部聚类导航

需积分: 3 193 浏览量更新于2024-09-26 收藏 229KB PDF 举报

"Navigating Massive Data Sets via Local Clustering" 是一篇由 Michael E. Houle 在 IBM Research Tokyo Research Laboratory 发表的论文，该研究提出了一种利用局部聚类来实现大规模数据集特征提取和导航的方法。这种方法的核心是将聚类视为重叠的邻域，并通过一种自然的置信度度量来评估簇内的关联性和簇间的差异性。论文还强调了即使在较大簇的交集中也能识别出较小的簇，同时，通过最近的高效近似相似性搜索技术实现了局部聚类的可扩展性。此外，簇的重叠结构形成了一个可以被用户查询和导航的层次结构。实验结果展示了该方法在两个大型文本数据库上的应用。在大数据分析领域，有效的数据导航和特征提取对于理解复杂的数据集至关重要。这篇论文提出的局部聚类方法提供了一个创新的解决方案。它不同于传统的硬聚类（hard clustering），在硬聚类中，每个数据点只能属于一个特定的簇，而局部聚类允许数据点存在于多个簇中，形成重叠的邻域。这种“软聚类”(soft clustering)的概念增加了数据表示的灵活性，更适应现实世界中复杂且有交叠的数据分布。置信度度量是评估数据点与簇之间关系的关键工具，它量化了数据点在簇内的归属程度和与其他簇的区分程度。这一度量可以帮助识别数据中的模式，同时能够处理噪声和不确定性，使得簇的定义更加鲁棒。论文中提到的高效近似相似性搜索技术是实现大规模数据集聚类的关键。这些技术能够在保持较低计算成本的同时，快速找到与给定数据点相似的其他点，这对于处理海量数据至关重要。此外，局部聚类生成的层次结构为用户提供了直观的探索工具。用户可以根据这个层次结构进行查询和导航，以深入理解和发现数据集中的模式。这种交互式的数据导航方法对数据挖掘和信息检索领域具有重要意义，特别是对于信息搜索和聚类任务。关键词如“Softclustering”、“nearest neighbor”和“association”揭示了该研究的重点在于利用模糊边界和邻近关系进行聚类，并关注数据点之间的关联性。这种聚类方法不仅适用于文本数据，其原理也适用于其他类型的数据，如图像、音频或社交网络数据，具有广泛的应用前景。 “Navigating Massive Data Sets via Local Clustering”提供了一种新颖的、可扩展的聚类方法，通过局部聚类和层次结构的构建，有效地处理和导航大规模数据集，对于数据挖掘和信息检索领域的研究和实践具有重要的理论与实际价值。

Navigating Massive Data Sets via Local Clustering

Michael E. Houle

IBM Research, Tokyo Research Laboratory

Shimotsuruma 1623-14, Yamato-shi

Kanagawa-ken 242-8502, Japan

meh@trl.ibm.com

ABSTRACT

This paper introduces a scalable method for feature extrac-

tion and navigation of large data sets by means of local clus-

tering, where clusters are modeled as overlapping neighbor-

ho ods. Under the model, intra-cluster association and ex-

ternal diﬀerentiation are both assessed in terms of a natural

conﬁdence measure. Minor clusters can be identiﬁed even

when they app ear in the intersection of larger clusters. Scal-

ability of local clustering derives from recent generic tech-

niques for eﬃcient approximate similarity search. The clus-

ter overlap structure gives rise to a hierarchy that can be

navigated and queried by users. Experimental results are

provided for two large text databases.

Categories and Subject Descriptors

H.2.8 [Database Management]: Database Applications—

Data mining; H.3.3 [Information Storage and Retrieval]:

Information Search and Retrieval—Clustering

Keywords

Soft clustering, nearest neighbor, association, conﬁdence

1. INTRODUCTION

Much of the data available online is unstructured, with

the relationships among contents or attributes little under-

sto od. For such data, the knowledge discovery process starts

with an attempt to understand the relationships as they ex-

ist within the data collection itself. This process, sometimes

referred to as feature extraction or data prospecting, assumes

no a priori knowledge of the data distribution. When suc-

cessful, feature extraction provides the raw patterns needed

for categorization and further correlation.

In this paper, we will primarily be concerned with the

problem of feature extraction from text-based data sets by

means of clustering, one of the most basic forms of pat-

tern identiﬁcation. Traditional clustering techniques, how-

ever, are generally not well-suited for extracting small (but

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGKDD ’03, August 24-27, 2003, Washington, DC, USA

important) clusters from document sets. Partition-based

techniques such as K-means and other squared-error heuris-

tics, expectation maximization heuristics [17], agglomera-

tive methods such as DBSCAN [6], and many hierarchical

hybrid metho ds all attempt to classify data points by as-

signing each to a single cluster (or in some cases, to reject it

as ‘noise’). Text data, however, can often be meaningfully

classiﬁed in more than one way. Any attempt to assign such

data to a single cluster would unjustiﬁably weaken any oth-

ers to which it also relates. Soft clustering methods do allow

membership in more than one cluster [2, 16]; however, they

tend to dissipate the contribution of individual data items

among several clusters through fractional assignment.

Partitional clustering techniques, as well as soft cluster-

ing methods, typically rely on the global minimization of

classiﬁcation error in distributing data points among a ﬁxed

number of disjoint clusters. Larger clusters have propor-

tionately greater inﬂuence on the ﬁnal partition, and their

size allows them to resist the inﬂuences of other elements

and clusters. Valuable minor clusters, on the other hand,

tend to be broken up or combined with other clusters. For

more background on the many clustering techniques for data

mining contexts, see (for example) [10, 14].

This paper introduces a general model for clustering that

borrows from both information retrieval and association rule

discovery. The patch model assumes that data clusters can

be represented as the results of neighborhood queries based

on elements from the data set, according to some measure

of (dis)similarity appropriate to the domain. These patch

clusters can be represented very compactly by their query

elements plus their sizes; when needed, the cluster elements

themselves can be retrieved by means of a similarity query.

Under the model, the relationship between two patch clus-

ters C

and C

is assessed according to a natural conﬁdence

measure resembling that of association rule discovery [1]:

conf (C

, C

)

∆

∩ C

That is, the conﬁdence in the strength of the relevance of

concept C

to concept C

is expressed as the proportion of

elements forming C

that also contribute to the formation

of C

. The model also assesses intra-cluster association and

external diﬀerentiation in terms of conﬁdence values.

This paper also provides a generic local clustering strat-

egy based on the patch model, PatClust, that measures

the intra-cluster association within expanding sequences of

neighborhoods, or ‘patches’, about each data element. For

each element, as the size of its patch increases, a substan-

547

下载后可阅读完整内容，剩余5页未读，立即下载

hhhyyyapple

粉丝: 0

大规模数据集的局部聚类导航

SQL Queries Succinctly

Navigating_Privacy_in_a_Data_Driven_World_Treating_Privacy_as

Navigating-Newcastle

navigating your it career

01 05. Navigating in 3D

siggraph2019 2 Navigating Intrinsic Triangulations.pdf

navigating-reason:好奇的原因工具选项概述

信息安全_数据安全_Welcome to Digiville Navigating .pdf

【船级社】 NK Guidelines for Navigating Ice Covered Seas in Russian

2021年威胁报告navigating cybersecurity in an uncertain world.pdf

最新资源