Navigating Massive Data Sets via Local Clustering
Michael E. Houle
IBM Research, Tokyo Research Laboratory
Shimotsuruma 1623-14, Yamato-shi
Kanagawa-ken 242-8502, Japan
meh@trl.ibm.com
ABSTRACT
This paper introduces a scalable method for feature extrac-
tion and navigation of large data sets by means of local clus-
tering, where clusters are modeled as overlapping neighbor-
ho ods. Under the model, intra-cluster association and ex-
ternal differentiation are both assessed in terms of a natural
confidence measure. Minor clusters can be identified even
when they app ear in the intersection of larger clusters. Scal-
ability of local clustering derives from recent generic tech-
niques for efficient approximate similarity search. The clus-
ter overlap structure gives rise to a hierarchy that can be
navigated and queried by users. Experimental results are
provided for two large text databases.
Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications—
Data mining; H.3.3 [Information Storage and Retrieval]:
Information Search and Retrieval—Clustering
Keywords
Soft clustering, nearest neighbor, association, confidence
1. INTRODUCTION
Much of the data available online is unstructured, with
the relationships among contents or attributes little under-
sto od. For such data, the knowledge discovery process starts
with an attempt to understand the relationships as they ex-
ist within the data collection itself. This process, sometimes
referred to as feature extraction or data prospecting, assumes
no a priori knowledge of the data distribution. When suc-
cessful, feature extraction provides the raw patterns needed
for categorization and further correlation.
In this paper, we will primarily be concerned with the
problem of feature extraction from text-based data sets by
means of clustering, one of the most basic forms of pat-
tern identification. Traditional clustering techniques, how-
ever, are generally not well-suited for extracting small (but
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGKDD ’03, August 24-27, 2003, Washington, DC, USA
Copyright 2003 ACM 1-58113-737-0/03/0008 ...$5.00.
important) clusters from document sets. Partition-based
techniques such as K-means and other squared-error heuris-
tics, expectation maximization heuristics [17], agglomera-
tive methods such as DBSCAN [6], and many hierarchical
hybrid metho ds all attempt to classify data points by as-
signing each to a single cluster (or in some cases, to reject it
as ‘noise’). Text data, however, can often be meaningfully
classified in more than one way. Any attempt to assign such
data to a single cluster would unjustifiably weaken any oth-
ers to which it also relates. Soft clustering methods do allow
membership in more than one cluster [2, 16]; however, they
tend to dissipate the contribution of individual data items
among several clusters through fractional assignment.
Partitional clustering techniques, as well as soft cluster-
ing methods, typically rely on the global minimization of
classification error in distributing data points among a fixed
number of disjoint clusters. Larger clusters have propor-
tionately greater influence on the final partition, and their
size allows them to resist the influences of other elements
and clusters. Valuable minor clusters, on the other hand,
tend to be broken up or combined with other clusters. For
more background on the many clustering techniques for data
mining contexts, see (for example) [10, 14].
This paper introduces a general model for clustering that
borrows from both information retrieval and association rule
discovery. The patch model assumes that data clusters can
be represented as the results of neighborhood queries based
on elements from the data set, according to some measure
of (dis)similarity appropriate to the domain. These patch
clusters can be represented very compactly by their query
elements plus their sizes; when needed, the cluster elements
themselves can be retrieved by means of a similarity query.
Under the model, the relationship between two patch clus-
ters C
i
and C
j
is assessed according to a natural confidence
measure resembling that of association rule discovery [1]:
conf (C
i
, C
j
)
∆
=
|C
i
∩ C
j
|
|C
i
|
.
That is, the confidence in the strength of the relevance of
concept C
j
to concept C
i
is expressed as the proportion of
elements forming C
i
that also contribute to the formation
of C
j
. The model also assesses intra-cluster association and
external differentiation in terms of confidence values.
This paper also provides a generic local clustering strat-
egy based on the patch model, PatClust, that measures
the intra-cluster association within expanding sequences of
neighborhoods, or ‘patches’, about each data element. For
each element, as the size of its patch increases, a substan-
547