Google MapReduce：大规模集群简化数据处理

需积分: 50 145 浏览量更新于2024-09-03 收藏 259KB PDF 举报

"MapReduce: Simplified Data Processing on Large Clusters" 这篇由Google的Jeffrey Dean和Sanjay Ghemawat于2008年发表的论文《MapReduce：大规模集群上的简化数据处理》阐述了一种分布式计算框架，旨在解决在海量数据集上执行计算时遇到的复杂性问题。在论文发表之前，Google已经开发了数百个专门用于处理大量原始数据的计算程序，例如抓取的文档、Web请求日志等，以生成各种派生数据，如倒排索引、Web文档结构的图表示，以及每日最频繁查询的集合。尽管这些计算概念上简单，但实际操作由于数据量大，通常需要在数千台机器上进行分布式处理，这导致了处理失败、数据分布和并行化等问题，使得代码变得复杂。 MapReduce的核心思想是将复杂的分布式计算任务分解为两个主要阶段：Map阶段和Reduce阶段。在Map阶段，输入数据被分割成多个块，并在不同的节点上并行处理。每个节点上的Map函数接收键值对作为输入，执行特定的操作（如过滤或转换），然后生成新的中间键值对。这个过程使得数据可以自然地在节点间进行分区和排序。 Reduce阶段则负责聚合Map阶段产生的中间结果。Reduce函数接收来自Map阶段的键及其对应的值列表，对每个键执行聚合操作，如求和、最大值或计数。通过这种方式，Reduce阶段可以汇总整个集群中的信息，生成最终结果。论文中还强调了MapReduce框架的容错机制。系统设计能够自动处理节点故障，确保即使部分节点失效，计算也能继续进行。这主要通过数据备份和任务重调度实现。如果某个节点失效，其上的任务会被重新分配到其他可用节点，而数据的副本确保了任务可以在无需重新计算原始数据的情况下恢复。此外，MapReduce框架还包括一个中央调度器，它负责分配任务给各个工作节点，以及监控任务进度和资源使用情况。这种集中式的调度方式简化了集群管理，同时允许动态调整资源分配以应对负载变化。 MapReduce的出现极大地简化了大规模数据处理的编程模型，使得非专业分布式系统程序员也能处理大规模数据。它不仅在Google内部得到了广泛应用，也为Hadoop等开源项目提供了基础，推动了大数据处理技术的发展。通过抽象出Map和Reduce这两个基本操作，开发者可以专注于业务逻辑，而无需深入理解底层的分布式系统细节。

MapReduce: Simplified Data Processing

on Large Clusters

by Jeffrey Dean and Sanjay Ghemawat

1 Introduction

Prior to our development of MapReduce, the authors and many others

at Google implemented hundreds of special-purpose computations that

process large amounts of raw data, such as crawled documents, Web

request logs, etc., to compute various kinds of derived data, such as

inverted indices, various representations of the graph structure of Web

documents, summaries of the number of pages crawled per host, and

the set of most frequent queries in a given day. Most such computa-

tions are conceptually straightforward. However, the input data is usu-

ally large and the computations have to be distributed across hundreds

or thousands of machines in order to finish in a reasonable amount of

time. The issues of how to parallelize the computation, distribute the

data, and handle failures conspire to obscure the original simple com-

putation with large amounts of complex code to deal with these issues.

As a reaction to this complexity, we designed a new abstraction that

allows us to express the simple computations we were trying to perform

but hides the messy details of parallelization, fault tolerance, data distri-

bution and load balancing in a library. Our abstraction is inspired by the

map and reduce primitives present in Lisp and many other functional lan-

guages. We realized that most of our computations involved applying a

map operation to each logical record’ in our input in order to compute a

set of intermediate key/value pairs, and then applying a reduce operation

to all the values that shared the same key in order to combine the derived

data appropriately. Our use of a functional model with user-specified map

and reduce operations allows us to parallelize large computations easily

and to use reexecution as the primary mechanism for fault tolerance.

The major contributions of this work are a simple and powerful

interface that enables automatic parallelization and distribution of

large-scale computations, combined with an implementation of this

interface that achieves high performance on large clusters of com-

modity PCs. The programming model can also be used to parallelize

computations across multiple cores of the same machine.

Section 2 describes the basic programming model and gives several

examples. In Sec tion 3, we describe an implementation of the Map Reduce

interface tailored towards our cluster-based computing environment.

Sec tion 4 describes several refinements of the programming model that

we have found useful. Sec tion 5 has performance measurements of our

implementation for a variety of tasks. In Section 6, we explore the use of

MapReduce within Google including our experiences in using it as the ba-

sis for a rewrite of our production indexing system. Section 7 discusses re-

lated and future work.

2 Programming Model

The computation takes a set of input key/value pairs, and produces a

set of output key/value pairs. The user of the MapReduce library

expresses the computation as two functions: map and reduce.

Map, written by the user, takes an input pair and produces a set of

intermediate key/value pairs. The MapReduce library groups together

all intermediate values associated with the same intermediate key I

and passes them to the reduce function.

The reduce function, also written by the user, accepts an interme-

diate key I and a set of values for that key. It merges these values

together to form a possibly smaller set of values. Typically just zero or

one output value is produced per reduce invocation. The intermediate

values are supplied to the user’s reduce function via an iterator. This

allows us to handle lists of values that are too large to fit in memory.

2.1 Example

Consider the problem of counting the number of occurrences of each

word in a large collection of documents. The user would write code

similar to the following pseudocode.

Abstract

apReduce is a programming model and an associated implementation for processing

and generating large datasets that is amenable to a broad variety of real-world tasks.

Users specify the computation in terms of a map and a reduce function, and the under-

lying runtime system automatically parallelizes the computation across large-scale clusters of

machines, handles machine failures, and schedules inter-machine communication to make effi-

cient use of the network and disks. Programmers find the system easy to use: more than ten

thousand distinct MapReduce programs have been implemented internally at Google over the

past four years, and an average of one hundred thousand MapReduce jobs are executed on

Google’s clusters every day, processing a total of more than twenty petabytes of data per day.

Biographies

Jeff Dean (jeff@google.com) is a Google Fellow and is currently work-

ing on a large variety of large-scale distributed systems at Google’s Moun -

tain View, CA, facility.

Sanjay Ghemawat (sanjay@google.com) is a Google Fellow and works

on the distributed computing infrastructure used by most the company’s

products. He is based at Google’s Mountain View, CA, facility.

COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 107

下载后可阅读完整内容，剩余6页未读，立即下载

拖尼杨

粉丝: 20

Google MapReduce：大规模集群简化数据处理

Google三大创新论文解读与中文版下载

MapReduce：大规模集群简化数据处理

"Google大数据论文：Google File System的设计与实现

MapReduce_ Simplified Data Processing on Large Clusters.pdf

MapReduce-Simplified Data Processing on Large Clusters.pdf

MapReduce: Simplified Data Processing on Large Clusters.pdf

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters翻译

MapReduce: Simplified Data Processing on Large Clusters 英文原文

MapReduce: Simplified Data Processing on Large Clusters中文版

最新资源