MapReduce:大数据处理简化框架

5星 · 超过95%的资源需积分: 13 6 浏览量更新于2024-09-13 收藏 233KB PDF 举报

"MapReduce是一种分布式计算框架，由Google的Jeffrey Dean和Sanjay Ghemawat提出，旨在简化大规模数据集的处理。MapReduce的核心思想是将复杂的并行计算任务分解为两个主要阶段：Map阶段和Reduce阶段，适用于处理海量集群上的数据。这种技术在处理如网页抓取文档、Web请求日志等大量原始数据时，能够生成各种衍生数据，如倒排索引、Web文档的图结构表示、主机爬取页面的统计以及特定日期内最频繁的查询集合等。 1. Map阶段：在Map阶段，输入数据被分割成多个块（通常由HDFS完成），然后这些数据块被分发到集群的不同节点上进行处理。每个节点上的Map任务会独立地对分配给它的数据块进行操作。Map函数接收键值对作为输入，对其进行处理，然后生成中间键值对。这个阶段的关键在于并行化处理，使得大规模数据可以同时在多台机器上进行计算。 2. Shuffle和Sort阶段： Map任务完成后，中间键值对会根据键进行排序，这是默认的行为，确保相同的键会被分组在一起。这个阶段是MapReduce中的一个重要组成部分，因为Reduce函数通常需要对相同键的所有值进行聚合操作。 3. Reduce阶段： Reduce阶段接收到排序后的中间键值对，Reduce任务按照键对这些数据进行分组，并调用Reduce函数来处理每个键及其关联的值列表。这允许我们执行聚合操作，如求和、计数或找出最大值、最小值等。Reduce阶段同样支持并行处理，不同键的处理可以同时在不同的Reducer上进行。 4.容错机制： MapReduce设计了内置的容错机制，当工作节点出现故障时，未完成的任务可以被重新调度到其他健康的节点上，确保整个计算过程的可靠性。此外，数据的复制也提供了冗余，增强了系统的可用性。 5. 应用场景： MapReduce非常适合处理批处理任务，例如大数据分析、日志处理、搜索引擎索引构建等。它已经被广泛应用于各种领域，包括互联网公司、科研机构和企业内部的大数据处理系统。 6. Hadoop MapReduce： Google的MapReduce概念启发了开源项目Hadoop，Hadoop MapReduce是实现这一编程模型的开源实现，它成为Apache Hadoop生态系统的重要组成部分，允许开发者在开源硬件集群上处理大规模数据。总结来说，MapReduce通过提供一种抽象的编程模型，让开发者可以专注于数据处理的逻辑，而无需关注底层的分布式计算细节，极大地简化了大规模数据处理的复杂性。这种设计思路对于处理PB级别的数据具有革命性的意义，为现代大数据分析奠定了基础。"

MapReduce: Simplified Data Processing

on Large Clusters

by Jeffrey Dean and Sanjay Ghemawat

1 Introduction

Prior to our development of MapReduce, the authors and many others

at Google implemented hundreds of special-purpose computations that

process large amounts of raw data, such as crawled documents, Web

request logs, etc., to compute various kinds of derived data, such as

inverted indices, various representations of the graph structure of Web

documents, summaries of the number of pages crawled per host, and

the set of most frequent queries in a given day. Most such computa-

tions are conceptually straightforward. However, the input data is usu-

ally large and the computations have to be distributed across hundreds

or thousands of machines in order to finish in a reasonable amount of

time. The issues of how to parallelize the computation, distribute the

data, and handle failures conspire to obscure the original simple com-

putation with large amounts of complex code to deal with these issues.

As a reaction to this complexity, we designed a new abstraction that

allows us to express the simple computations we were trying to perform

but hides the messy details of parallelization, fault tolerance, data distri-

bution and load balancing in a library. Our abstraction is inspired by the

map and reduce primitives present in Lisp and many other functional lan-

guages. We realized that most of our computations involved applying a

map operation to each logical record’ in our input in order to compute a

set of intermediate key/value pairs, and then applying a reduce operation

to all the values that shared the same key in order to combine the derived

data appropriately. Our use of a functional model with user-specified map

and reduce operations allows us to parallelize large computations easily

and to use reexecution as the primary mechanism for fault tolerance.

The major contributions of this work are a simple and powerful

interface that enables automatic parallelization and distribution of

large-scale computations, combined with an implementation of this

interface that achieves high performance on large clusters of com-

modity PCs. The programming model can also be used to parallelize

computations across multiple cores of the same machine.

Section 2 describes the basic programming model and gives several

examples. In Sec tion 3, we describe an implementation of the Map Reduce

interface tailored towards our cluster-based computing environment.

Sec tion 4 describes several refinements of the programming model that

we have found useful. Sec tion 5 has performance measurements of our

implementation for a variety of tasks. In Section 6, we explore the use of

MapReduce within Google including our experiences in using it as the ba-

sis for a rewrite of our production indexing system. Section 7 discusses re-

lated and future work.

2 Programming Model

The computation takes a set of input key/value pairs, and produces a

set of output key/value pairs. The user of the MapReduce library

expresses the computation as two functions: map and reduce.

Map, written by the user, takes an input pair and produces a set of

intermediate key/value pairs. The MapReduce library groups together

all intermediate values associated with the same intermediate key I

and passes them to the reduce function.

The reduce function, also written by the user, accepts an interme-

diate key I and a set of values for that key. It merges these values

together to form a possibly smaller set of values. Typically just zero or

one output value is produced per reduce invocation. The intermediate

values are supplied to the user’s reduce function via an iterator. This

allows us to handle lists of values that are too large to fit in memory.

2.1 Example

Consider the problem of counting the number of occurrences of each

word in a large collection of documents. The user would write code

similar to the following pseudocode.

Abstract

apReduce is a programming model and an associated implementation for processing

and generating large datasets that is amenable to a broad variety of real-world tasks.

Users specify the computation in terms of a map and a reduce function, and the under-

lying runtime system automatically parallelizes the computation across large-scale clusters of

machines, handles machine failures, and schedules inter-machine communication to make effi-

cient use of the network and disks. Programmers find the system easy to use: more than ten

thousand distinct MapReduce programs have been implemented internally at Google over the

past four years, and an average of one hundred thousand MapReduce jobs are executed on

Google’s clusters every day, processing a total of more than twenty petabytes of data per day.

Biographies

Jeff Dean (jeff@google.com) is a Google Fellow and is currently work-

ing on a large variety of large-scale distributed systems at Google’s Moun -

tain View, CA, facility.

Sanjay Ghemawat (sanjay@google.com) is a Google Fellow and works

on the distributed computing infrastructure used by most the company’s

products. He is based at Google’s Mountain View, CA, facility.

COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 107

下载后可阅读完整内容，剩余6页未读，立即下载

小于UP

粉丝: 3
资源: 40

MapReduce:大数据处理简化框架

藏经阁-Apache Hadoop 3.0_ What’s new in YARN & MapReduce.pdf

Hadoop MapReduce.pdf

MapReduce.pdf

MongoDB的MapReduce.pdf

软件工程mapreduce.pdf

第四章Mapreduce.pdf

Dean-2008-Mapreduce.pdf

第7章-MapReduce.pdf

windows下配置hadoop和mapreduce.pdf

第4章大数据技术教程-MapReduce.pdf

最新资源