大规模分布式共聚类：NOKIA的DisCO与Map-Reduce应用

需积分: 9 103 浏览量更新于2024-09-16 收藏 270KB PDF 举报

"这篇文档是关于使用分布式计算框架DisCo(分布式共聚类)与Map-Reduce技术在处理大规模数据挖掘任务，特别是Nokia DDFS(可能是Nokia的数据分布存储系统)上的应用案例。文中深入探讨了云计算环境下的大数据处理，并以Petabyte规模的端到端挖掘为例进行分析。" 在当前信息化时代，海量数据日益普及，科研人员经常需要处理TB级别的数据集。这些数据通常复杂且庞大，涵盖了各种实际应用场景，如市场分析、社交网络研究等。数据挖掘过程包括预处理原始数据到构建最终模型等多个步骤，这些步骤对于处理大规模数据集来说具有挑战性。随着数据量的增加，分布式处理工具的需求也在增长，Map-Reduce作为一种广泛接受的解决方案应运而生。Map-Reduce在数据库领域中是一个简单但功能强大的执行引擎，可以与不同的数据存储和管理组件结合使用。它将复杂的计算任务分解为两个主要阶段：Map阶段和Reduce阶段，非常适合处理大量数据并行计算的问题。本文作者Spiros Papadimitriou和Jimeng Sun来自IBM T.J. Watson Research Center，他们在论文中分享了运用Map-Reduce从原始数据到最终模型的全程实践，特别是在共聚类（Co-clustering）任务中的应用。共聚类是一种数据分析方法，常用于发现数据中的潜在结构，如文本分析中的文档和词项之间的关系，或用户和产品之间的关联。通过Map-Reduce，研究人员能够在大规模数据集上执行共聚类，即使数据达到Petabyte级别，也能实现高效处理。这种方法使得端到端的数据挖掘过程更具可扩展性和实用性，为应对大数据时代的挑战提供了有力工具。论文详细讨论了在实施过程中遇到的问题、解决策略以及所获得的洞察，为其他研究者和从业者提供了宝贵的实践经验，有助于推动大数据处理技术的发展和应用。

DisCo: Distributed Co-clustering with Map-Reduce

A Case Study Towards Petabyte-Scale End-to-End Mining

Spiros Papadimitriou Jimeng Sun

IBM T.J. Watson Research Center

Hawthorne, NY, USA

{spapadim,jimeng}@us.ibm.com

Abstract

Huge datasets are becoming prevalent; even as re-

searchers, we now routinely have to work with datasets that

are up to a few terabytes in size. Interesting real-world ap-

plications produce huge volumes of messy data. The mining

process involves several steps, starting from pre-processing

the raw data to estimating the ﬁnal models.

As data become more abundant, scalable and easy-

to-use tools for distributed processing are also emerging.

Among those, Map-Reduce has been widely embraced by

both academia and industry. In database terms, Map-

Reduce is a simple yet powerful execution engine, which

can be complemented with other data storage and manage-

ment components, as necessary.

In this paper we describe our experiences and ﬁndings

in applying Map-Reduce, from raw data to ﬁnal models,

on an important mining task. In particular, we focus on

co-clustering, which has been studied in many applications

such as text mining, collaborative ﬁltering, bio-informatics,

graph mining. We propose the Distributed Co-clustering

(DisCo) framework, which introduces practical approaches

for distributed data pre-processing, and co-clustering. We

develop DisCo using Hadoop, an open source Map-Reduce

implementation. We show that DisCo can scale well and

efﬁciently process and analyze extremely large datasets (up

to several hundreds of gigabytes) on commodity hardware.

1 Introduction

It’s a clich

e, but it’s true: huge volumes of data are col-

lected and need to be processed on a daily basis. For ex-

ample, Google now processes an estimated 20 petabytes of

data per day [13] and the Internet Archive

is growing at

20 terabytes a month, having reached 2 petabytes sometime

in 2006. Retail giants such as Walmart and online shop-

ping stores such as Amazon and eBay all deal with with

petabytes of transactional data every day.

By deﬁnition, research on data mining focuses on scal-

able algorithms applicable to huge datasets. But let’s take

things from the beginning. Natural sources of data pro-

http://www.archive.org/

vide them in vast quantities, but impure form. A repository

may consist of, e.g., a corpus of text documents, a large

web crawl, or system logs. Schemas do not arise sponta-

neously in nature. On the contrary, signiﬁcant effort must

be invested to make the data ﬁt a given schema. Most com-

monly, data are collected in a multitude of unstructured or

semi-structured formats. Aspects of the data that are rele-

vant to the task at hand need to be extracted and stored in an

appropriate representation. Most researchers start with the

assumption that the input is in the appropriate form. How-

ever, getting the data into the right form is not trivial (see

detailed discussion in Section 3).

Map-Reduce [12] is attracting a lot of attention, prov-

ing both a source for inspiration [30] as well as target of

polemic [14] by prominent researchers in databases. Re-

cently, some have questioned whether relational DBMSes

are appropriate for any and all data management tasks un-

der the sun [35, 34]. Moreover, [34] makes a strong case

that bundling data storage, indexing, query execution, trans-

action control, and logging components into a monolithic

system with a veneer of SQL is not always desirable. Start-

ing from this call for a component-based approach, Map-

Reduce is an execution engine, largely unconcerned about

data models and storage schemes. In the simplest case, data

reside on a distributed ﬁle system [19, 1, 26] but nothing

prevents pulling data from a large data store like BigTable

[7, 2, 38], or any other storage engine that (i) provides data

de-clustering and replication across many machines, and (ii)

allows computations to execute on local copies of the data.

Arguably, Map-Reduce is powerful both for the features it

provides, as well as for the features it omits, in order to pro-

vide a clean and simple programming abstraction.

Hadoop is an open source implementation of the core

components necessary for Map-Reduce. It focuses on pro-

viding the necessary minimum functionality, combining

simplicity of use with scalable performance

. However, if

additional functionality is needed by an application, other

open source components are available, which address e.g.,

key-based data access [2], or more complex job and data

While this article was being written, Hadoop won the TeraSort bench-

mark in the general purpose category, completing the task in 209 seconds

using 900 eight-core nodes, beating the previous record of 297 seconds.

下载后可阅读完整内容，剩余9页未读，立即下载

honXian

粉丝: 0
资源: 7

大规模分布式共聚类：NOKIA的DisCO与Map-Reduce应用

Python库disco-1.15.4：PyPI官网正式发布

Python库psychic_disco-0.10.0发布，助力开发者高效编程

Python库文件disco-1.19.0的安装与使用

Disco-2.12.2.zip

PyPI 官网下载 | disco-1.15.4.tar.gz

Python库 | psychic_disco-0.10.0.tar.gz

Python库 | disco-py-0.0.12.tar.gz

Python库 | disco-1.19.0-py3-none-any.whl

scrt-8.7.3.2279.ubuntu19-64.tar.gz

oCd-pack-16-by-disco.zip

最新资源