大规模分布式共聚类DisCo与MapReduce

需积分: 9 47 浏览量更新于2024-09-25 收藏 270KB PDF 举报

"DisCo：Distributed Co-clustering with MapReduce - 通过MapReduce进行分布式共聚类的方法在处理大规模数据时的应用与研究" 在当前大数据时代，数据量呈现出爆炸性的增长，研究人员经常需要处理达到几TB甚至PB级别的数据集。这些数据往往杂乱无章，挖掘其中的有价值信息需要经过一系列步骤，包括数据预处理到最终模型的建立。随着数据的丰富，可扩展且易于使用的分布式处理工具也日益受到关注，其中MapReduce便是学术界和工业界广泛应用的一种框架。 MapReduce是一种简洁而强大的执行引擎，它可以在必要时与其他数据存储和管理组件结合使用。在数据库术语中，MapReduce的核心是将复杂计算任务分解为两个主要阶段——Map和Reduce，使得大规模数据的处理变得更加高效和并行化。本文详细介绍了我们在MapReduce上从原始数据到最终模型的整个过程中的应用和发现，特别是在一个关键的挖掘任务——共聚类（Co-clustering）上的实践。共聚类是一种将数据的行和列同时聚类的方法，广泛应用于如文本挖掘、图像分析、推荐系统等多个领域，它可以发现数据中的隐含模式和结构，帮助我们理解不同维度特征之间的关联性。在处理大规模数据集时，DisCo（Distributed Co-clustering with MapReduce）利用MapReduce的分布式特性，将共聚类算法分布式地执行在多台机器上，从而实现对海量数据的高效处理。DisCo的主要优点包括： 1. **并行化处理**：Map阶段将数据切分成小块并分配到各个节点，各节点独立执行聚类任务，极大地提升了处理速度。 2. **容错性**：MapReduce框架具有内置的错误恢复机制，当某个节点故障时，任务可以自动重新调度到其他节点，保证了系统的稳定性和可靠性。 3. **可扩展性**：随着数据量的增长，只需增加更多的计算节点，即可线性提升处理能力，适应大数据的挑战。 4. **易编程**：MapReduce使用简单的编程模型，开发者可以快速理解和实现复杂的算法。在论文中，我们深入探讨了DisCo在实际应用中的性能表现，包括如何优化MapReduce作业的性能，以及如何解决在大规模数据集上共聚类算法可能遇到的挑战，如内存管理和计算效率。我们还分享了在处理PB级数据时的案例研究，展示了DisCo在端到端的数据挖掘流程中如何提供有效的解决方案。通过DisCo，我们可以处理前所未有的大规模数据，并在短时间内完成共聚类，这对于实时或近实时的数据分析至关重要。此外，该方法对于那些需要在大量文档、网页或其他类型数据中寻找相似性和模式的场景尤为适用。 DisCo为处理大数据的共聚类问题提供了新的视角和方法，它不仅展示了MapReduce在大数据处理中的潜力，也为未来的分布式数据挖掘技术开辟了新的道路。

DisCo: Distributed Co-clustering with Map-Reduce

A Case Study Towards Petabyte-Scale End-to-End Mining

Spiros Papadimitriou Jimeng Sun

IBM T.J. Watson Research Center

Hawthorne, NY, USA

{spapadim,jimeng}@us.ibm.com

Abstract

Huge datasets are becoming prevalent; even as re-

searchers, we now routinely have to work with datasets that

are up to a few terabytes in size. Interesting real-world ap-

plications produce huge volumes of messy data. The mining

process involves several steps, starting from pre-processing

the raw data to estimating the ﬁnal models.

As data become more abundant, scalable and easy-

to-use tools for distributed processing are also emerging.

Among those, Map-Reduce has been widely embraced by

both academia and industry. In database terms, Map-

Reduce is a simple yet powerful execution engine, which

can be complemented with other data storage and manage-

ment components, as necessary.

In this paper we describe our experiences and ﬁndings

in applying Map-Reduce, from raw data to ﬁnal models,

on an important mining task. In particular, we focus on

co-clustering, which has been studied in many applications

such as text mining, collaborative ﬁltering, bio-informatics,

graph mining. We propose the Distributed Co-clustering

(DisCo) framework, which introduces practical approaches

for distributed data pre-processing, and co-clustering. We

develop DisCo using Hadoop, an open source Map-Reduce

implementation. We show that DisCo can scale well and

efﬁciently process and analyze extremely large datasets (up

to several hundreds of gigabytes) on commodity hardware.

1 Introduction

It’s a clich

e, but it’s true: huge volumes of data are col-

lected and need to be processed on a daily basis. For ex-

ample, Google now processes an estimated 20 petabytes of

data per day [13] and the Internet Archive

is growing at

20 terabytes a month, having reached 2 petabytes sometime

in 2006. Retail giants such as Walmart and online shop-

ping stores such as Amazon and eBay all deal with with

petabytes of transactional data every day.

By deﬁnition, research on data mining focuses on scal-

able algorithms applicable to huge datasets. But let’s take

things from the beginning. Natural sources of data pro-

http://www.archive.org/

vide them in vast quantities, but impure form. A repository

may consist of, e.g., a corpus of text documents, a large

web crawl, or system logs. Schemas do not arise sponta-

neously in nature. On the contrary, signiﬁcant effort must

be invested to make the data ﬁt a given schema. Most com-

monly, data are collected in a multitude of unstructured or

semi-structured formats. Aspects of the data that are rele-

vant to the task at hand need to be extracted and stored in an

appropriate representation. Most researchers start with the

assumption that the input is in the appropriate form. How-

ever, getting the data into the right form is not trivial (see

detailed discussion in Section 3).

Map-Reduce [12] is attracting a lot of attention, prov-

ing both a source for inspiration [30] as well as target of

polemic [14] by prominent researchers in databases. Re-

cently, some have questioned whether relational DBMSes

are appropriate for any and all data management tasks un-

der the sun [35, 34]. Moreover, [34] makes a strong case

that bundling data storage, indexing, query execution, trans-

action control, and logging components into a monolithic

system with a veneer of SQL is not always desirable. Start-

ing from this call for a component-based approach, Map-

Reduce is an execution engine, largely unconcerned about

data models and storage schemes. In the simplest case, data

reside on a distributed ﬁle system [19, 1, 26] but nothing

prevents pulling data from a large data store like BigTable

[7, 2, 38], or any other storage engine that (i) provides data

de-clustering and replication across many machines, and (ii)

allows computations to execute on local copies of the data.

Arguably, Map-Reduce is powerful both for the features it

provides, as well as for the features it omits, in order to pro-

vide a clean and simple programming abstraction.

Hadoop is an open source implementation of the core

components necessary for Map-Reduce. It focuses on pro-

viding the necessary minimum functionality, combining

simplicity of use with scalable performance

. However, if

additional functionality is needed by an application, other

open source components are available, which address e.g.,

key-based data access [2], or more complex job and data

While this article was being written, Hadoop won the TeraSort bench-

mark in the general purpose category, completing the task in 209 seconds

using 900 eight-core nodes, beating the previous record of 297 seconds.

下载后可阅读完整内容，剩余9页未读，立即下载

kacussa

粉丝: 1
资源: 25

大规模分布式共聚类DisCo与MapReduce

Storm Blueprints- Patterns for Distributed Real-time Computation(PACKT,2014)

Building Microservices: Designing Fine-Grained Systems

提升UWSNs能源效率的分布式路由分簇协议：Distributed Minimum-Cost Clustering与HEED比较

seminario-mapreduce:用于 Ciemat-UEX Hadoop 会议的资源

distributed-svm:使用 MapReduce 实现的分布式 SVM 方法

Hadoop-MapReduce-Distributed-Grep:使用 Hadoop MapReduce 实现分布式 grep

mapreduce-db-operat:mapreduce实现数据从hdfs到mysql之间的相互传递

大数据-hadoop-mapreduce代码

Co-design techniques for distributed real-time embedded systems

Co-design techniques for distributed real-time embedded systems .pdf

最新资源