MapReduce：大规模集群简化数据处理的核心模型

需积分: 0 150 浏览量更新于2024-08-03 收藏 174KB PDF 举报

MapReduce: Simplified Data Processing on Large Clusters 是一篇由 Jeffrey Dean 和 Sanjay Ghemawat 发表的重要论文，首次提出了一个编程模型及其在Google内部广泛应用的数据处理解决方案。这篇论文的核心是阐述了 MapReduce 的概念，它是一种专为大规模集群设计的分布式计算框架，旨在简化复杂数据处理任务的编程。 MapReduce 的核心思想在于将复杂的数据处理任务分解为两个主要阶段：Map 和 Reduce。Map 阶段，用户定义一个 map 函数，这个函数接收键值对作为输入，通过执行特定逻辑生成一系列中间键值对。这个过程实现了数据的预处理和初步分组，使得后续的 Reduce 阶段能更高效地进行聚合操作。 Reduce 阶段则是对所有与相同中间键关联的值进行合并，通常用于生成最终结果。这种设计模式非常适合那些可以通过键进行划分和汇总的任务，如搜索引擎的网页索引、数据分析等。通过这种方式，MapReduce 能够充分利用集群中的大量廉价机器，自动并行化程序执行，大大提高了处理大规模数据的能力。论文强调，MapReduce 框架将并行化和分布式系统管理的复杂性隐藏起来，程序员无需具备并行或分布式系统背景，也能轻松编写出高效运行的程序。这使得非专家也能利用大型分布式系统的强大能力，降低了数据处理的门槛。作者的 MapReduce 实现是基于大量普通计算机（commodity machines）的，它优化了输入数据的分割、任务调度、机器故障处理以及跨机器通信管理，确保了系统的高可用性和性能。这种架构灵活性和易用性使得 MapReduce 成为了现代大数据处理领域不可或缺的技术基础，并且对诸如 Hadoop 等开源项目产生了深远影响。 MapReduce论文不仅介绍了该模型的设计原理，还展示了其在实际场景中的强大效能，对于理解和应用分布式计算、云计算以及大数据分析具有里程碑式的意义。通过阅读这篇论文，读者可以深入理解如何编写和执行高效的分布式数据处理程序，这对于任何从事IT行业，尤其是处理海量数据的人士来说都是宝贵的资源。

User

Program

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)

assign

map

(2)

assign

reduce

split 0

split 1

split 2

split 3

split 4

output

file 0

(6) write

worker

(3) read

worker

(4) local write

Map

phase

Intermediate files

(on local disks)

worker

output

file 1

Input

files

(5) remote read

Reduce

phase

Output

files

Figure 1: Execution overview

Inverted Index: The map function parses each docu-

ment, and emits a sequence of hword, document IDi

pairs. The reduce function accepts all pairs for a given

word, sorts the corresponding document IDs and emits a

hword, list(document ID)i pair. The set of all output

pairs forms a simple inverted index. It is easy to augment

this computation to keep track of word positions.

Distributed Sort: The map function extracts the key

from each record, and emits a hkey, recordi pair. The

reduce function emits all pairs unchanged. This compu-

tation depends on the partitioning facilities described in

Section 4.1 and the ordering properties described in Sec-

tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-

terface are possible. The right choice depends on the

environment. For example, one implementation may be

suitable for a small shared-memory machine, another for

a large NUMA multi-processor, and yet another for an

even larger collection of networked machines.

This section describes an implementation targeted

to the computing environment in wide use at Google:

large clusters of commodity PCs connected together with

switched Ethernet [4]. In our environment:

(1) Machines are typically dual-processor x86 processors

running Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typically

either 100 megabits/second or 1 gigabit/second at the

machine level, but averaging considerably less in over-

all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-

chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-

tached directly to individual machines. A distributed ﬁle

system [8] developed in-house is used to manage the data

stored on these disks. The ﬁle system uses replication to

provide availability and reliability on top of unreliable

hardware.

(5) Users submit jobs to a scheduling system. Each job

consists of a set of tasks, and is mapped by the scheduler

to a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiple

machines by automatically partitioning the input data

To appear in OSDI 2004 3

剩余12页未读，继续阅读

CS-Polaris

粉丝: 956
资源: 8

MapReduce：大规模集群简化数据处理的核心模型

MapReduce: Simplified Data Processing on Large Clusters 英文原文

MapReduce-Simplified Data Processing on Large Clusters.pdf

论文：MapReduce: Simplified Data Processing on Large Clusters

MapReduce_ Simplified Data Processing on Large Clusters.pdf

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters翻译

MapReduce: Simplified Data Processing on Large Clusters中文版

MapReduce_Simplified_Data_Processing_on_Large_Clusters

MapReduce-Simplified_Data_Processing_on_Large_Clusters中文版（免积分下载）

MapReduce：大规模集群简化数据处理

最新资源