MapReduce：简化大规模集群数据处理

5星 · 超过95%的资源需积分: 13 45 浏览量更新于2024-09-15 收藏 233KB PDF 举报

"MapReduce是Google开发的一种简化大型集群数据处理的编程模型，由Jeffrey Dean和Sanjay Ghemawat提出。它旨在解决在大量机器上并行处理大规模原始数据的问题，例如网页抓取文档、Web请求日志等，以生成各种派生数据，如倒排索引、Web文档的图结构表示、主机爬取页面的数量统计以及特定日期最频繁查询的集合。MapReduce的主要目标是将复杂的分布式计算细节隐藏起来，使开发者能更专注于实际的业务逻辑。” MapReduce的核心概念由两个主要阶段组成：Map（映射）和Reduce（规约）。在Map阶段，原始数据被分割成多个小块，然后在不同的机器上并行处理。每个Map任务接收一部分数据，执行指定的映射函数，生成键值对作为中间结果。这些中间结果随后在Reduce阶段进行汇聚和聚合。Reduce任务根据键对中间结果进行排序，然后调用对应的规约函数，将相同键的所有值合并成一个结果。 MapReduce模型还包括两个附加步骤：Shuffle和Sort。Shuffle阶段负责将Map产生的键值对按照键进行排序，并将相同键的数据分发到同一个Reduce任务。Sort阶段确保所有键值对在传递给Reduce之前按键排序，这是Reduce阶段能够正确处理数据的前提。为了应对大规模数据处理中的故障容错问题，MapReduce设计了一套机制。当工作节点出现故障时，系统能够自动检测并重新调度任务，确保整体处理的完整性。此外，数据的冗余存储（通常是通过复制）也增加了系统的可用性和可靠性。 MapReduce的另一个关键特性是其可扩展性。由于任务可以被拆分为大量独立的小任务，因此可以轻松地添加更多机器来处理更大的数据集或更快地完成计算。这使得MapReduce成为处理PB级别数据的理想选择。在Google，MapReduce被广泛应用于各种场景，包括构建搜索引擎索引、分析用户行为数据、日志分析等。这种模型的简单性和高效性使其在大数据处理领域获得了广泛应用，并且激发了其他类似框架的诞生，如Apache Hadoop，它在开源社区中扮演了重要角色，推动了大数据处理技术的发展。 MapReduce通过提供一种抽象的编程模型，使得开发人员可以专注于数据处理的逻辑，而无需深入理解底层的分布式系统细节。这种简化的方法极大地降低了大规模数据处理的复杂性，为处理海量数据提供了强大的工具。

MapReduce: Simplified Data Processing

on Large Clusters

by Jeffrey Dean and Sanjay Ghemawat

1 Introduction

Prior to our development of MapReduce, the authors and many others

at Google implemented hundreds of special-purpose computations that

process large amounts of raw data, such as crawled documents, Web

request logs, etc., to compute various kinds of derived data, such as

inverted indices, various representations of the graph structure of Web

documents, summaries of the number of pages crawled per host, and

the set of most frequent queries in a given day. Most such computa-

tions are conceptually straightforward. However, the input data is usu-

ally large and the computations have to be distributed across hundreds

or thousands of machines in order to finish in a reasonable amount of

time. The issues of how to parallelize the computation, distribute the

data, and handle failures conspire to obscure the original simple com-

putation with large amounts of complex code to deal with these issues.

As a reaction to this complexity, we designed a new abstraction that

allows us to express the simple computations we were trying to perform

but hides the messy details of parallelization, fault tolerance, data distri-

bution and load balancing in a library. Our abstraction is inspired by the

map and reduce primitives present in Lisp and many other functional lan-

guages. We realized that most of our computations involved applying a

map operation to each logical record’ in our input in order to compute a

set of intermediate key/value pairs, and then applying a reduce operation

to all the values that shared the same key in order to combine the derived

data appropriately. Our use of a functional model with user-specified map

and reduce operations allows us to parallelize large computations easily

and to use reexecution as the primary mechanism for fault tolerance.

The major contributions of this work are a simple and powerful

interface that enables automatic parallelization and distribution of

large-scale computations, combined with an implementation of this

interface that achieves high performance on large clusters of com-

modity PCs. The programming model can also be used to parallelize

computations across multiple cores of the same machine.

Section 2 describes the basic programming model and gives several

examples. In Sec tion 3, we describe an implementation of the Map Reduce

interface tailored towards our cluster-based computing environment.

Sec tion 4 describes several refinements of the programming model that

we have found useful. Sec tion 5 has performance measurements of our

implementation for a variety of tasks. In Section 6, we explore the use of

MapReduce within Google including our experiences in using it as the ba-

sis for a rewrite of our production indexing system. Section 7 discusses re-

lated and future work.

2 Programming Model

The computation takes a set of input key/value pairs, and produces a

set of output key/value pairs. The user of the MapReduce library

expresses the computation as two functions: map and reduce.

Map, written by the user, takes an input pair and produces a set of

intermediate key/value pairs. The MapReduce library groups together

all intermediate values associated with the same intermediate key I

and passes them to the reduce function.

The reduce function, also written by the user, accepts an interme-

diate key I and a set of values for that key. It merges these values

together to form a possibly smaller set of values. Typically just zero or

one output value is produced per reduce invocation. The intermediate

values are supplied to the user’s reduce function via an iterator. This

allows us to handle lists of values that are too large to fit in memory.

2.1 Example

Consider the problem of counting the number of occurrences of each

word in a large collection of documents. The user would write code

similar to the following pseudocode.

Abstract

apReduce is a programming model and an associated implementation for processing

and generating large datasets that is amenable to a broad variety of real-world tasks.

Users specify the computation in terms of a map and a reduce function, and the under-

lying runtime system automatically parallelizes the computation across large-scale clusters of

machines, handles machine failures, and schedules inter-machine communication to make effi-

cient use of the network and disks. Programmers find the system easy to use: more than ten

thousand distinct MapReduce programs have been implemented internally at Google over the

past four years, and an average of one hundred thousand MapReduce jobs are executed on

Google’s clusters every day, processing a total of more than twenty petabytes of data per day.

Biographies

Jeff Dean (jeff@google.com) is a Google Fellow and is currently work-

ing on a large variety of large-scale distributed systems at Google’s Moun -

tain View, CA, facility.

Sanjay Ghemawat (sanjay@google.com) is a Google Fellow and works

on the distributed computing infrastructure used by most the company’s

products. He is based at Google’s Mountain View, CA, facility.

COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 107

下载后可阅读完整内容，剩余6页未读，立即下载

emily_yanan

粉丝: 0
资源: 6

MapReduce：简化大规模集群数据处理

MapReduce: Simplified Data Processing on Large Clusters 英文原文

MapReduce: Simplified Data Processing on Large Clusters中文版

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters翻译

MapReduce-Simplified Data Processing on Large Clusters.pdf

MapReduce_ Simplified Data Processing on Large Clusters.pdf

MapReduce: Simplified Data Processing on Large Clusters.pdf

论文：MapReduce: Simplified Data Processing on Large Clusters

MapReduce_Simplified_Data_Processing_on_Large_Clusters

MapReduce-Simplified_Data_Processing_on_Large_Clusters中文版（免积分下载）

最新资源