DisCo: Distributed Co-clustering with Map-Reduce
A Case Study Towards Petabyte-Scale End-to-End Mining
Spiros Papadimitriou Jimeng Sun
IBM T.J. Watson Research Center
Hawthorne, NY, USA
{spapadim,jimeng}@us.ibm.com
Abstract
Huge datasets are becoming prevalent; even as re-
searchers, we now routinely have to work with datasets that
are up to a few terabytes in size. Interesting real-world ap-
plications produce huge volumes of messy data. The mining
process involves several steps, starting from pre-processing
the raw data to estimating the final models.
As data become more abundant, scalable and easy-
to-use tools for distributed processing are also emerging.
Among those, Map-Reduce has been widely embraced by
both academia and industry. In database terms, Map-
Reduce is a simple yet powerful execution engine, which
can be complemented with other data storage and manage-
ment components, as necessary.
In this paper we describe our experiences and findings
in applying Map-Reduce, from raw data to final models,
on an important mining task. In particular, we focus on
co-clustering, which has been studied in many applications
such as text mining, collaborative filtering, bio-informatics,
graph mining. We propose the Distributed Co-clustering
(DisCo) framework, which introduces practical approaches
for distributed data pre-processing, and co-clustering. We
develop DisCo using Hadoop, an open source Map-Reduce
implementation. We show that DisCo can scale well and
efficiently process and analyze extremely large datasets (up
to several hundreds of gigabytes) on commodity hardware.
1 Introduction
It’s a clich
´
e, but it’s true: huge volumes of data are col-
lected and need to be processed on a daily basis. For ex-
ample, Google now processes an estimated 20 petabytes of
data per day [13] and the Internet Archive
1
is growing at
20 terabytes a month, having reached 2 petabytes sometime
in 2006. Retail giants such as Walmart and online shop-
ping stores such as Amazon and eBay all deal with with
petabytes of transactional data every day.
By definition, research on data mining focuses on scal-
able algorithms applicable to huge datasets. But let’s take
things from the beginning. Natural sources of data pro-
1
http://www.archive.org/
vide them in vast quantities, but impure form. A repository
may consist of, e.g., a corpus of text documents, a large
web crawl, or system logs. Schemas do not arise sponta-
neously in nature. On the contrary, significant effort must
be invested to make the data fit a given schema. Most com-
monly, data are collected in a multitude of unstructured or
semi-structured formats. Aspects of the data that are rele-
vant to the task at hand need to be extracted and stored in an
appropriate representation. Most researchers start with the
assumption that the input is in the appropriate form. How-
ever, getting the data into the right form is not trivial (see
detailed discussion in Section 3).
Map-Reduce [12] is attracting a lot of attention, prov-
ing both a source for inspiration [30] as well as target of
polemic [14] by prominent researchers in databases. Re-
cently, some have questioned whether relational DBMSes
are appropriate for any and all data management tasks un-
der the sun [35, 34]. Moreover, [34] makes a strong case
that bundling data storage, indexing, query execution, trans-
action control, and logging components into a monolithic
system with a veneer of SQL is not always desirable. Start-
ing from this call for a component-based approach, Map-
Reduce is an execution engine, largely unconcerned about
data models and storage schemes. In the simplest case, data
reside on a distributed file system [19, 1, 26] but nothing
prevents pulling data from a large data store like BigTable
[7, 2, 38], or any other storage engine that (i) provides data
de-clustering and replication across many machines, and (ii)
allows computations to execute on local copies of the data.
Arguably, Map-Reduce is powerful both for the features it
provides, as well as for the features it omits, in order to pro-
vide a clean and simple programming abstraction.
Hadoop is an open source implementation of the core
components necessary for Map-Reduce. It focuses on pro-
viding the necessary minimum functionality, combining
simplicity of use with scalable performance
2
. However, if
additional functionality is needed by an application, other
open source components are available, which address e.g.,
key-based data access [2], or more complex job and data
2
While this article was being written, Hadoop won the TeraSort bench-
mark in the general purpose category, completing the task in 209 seconds
using 900 eight-core nodes, beating the previous record of 297 seconds.