MapReduce编程模型解析

需积分: 9 33 浏览量更新于2024-09-17 收藏 375KB PDF 举报

"MapReduce 入门 - 一篇在OSDI会议上发表的论文，详细介绍了MapReduce编程模型及其在大规模集群上的数据处理应用。" MapReduce是Google开发的一种编程模型，专为处理和生成大规模数据集而设计。由Jeffrey Dean和Sanjay Ghemawat在2004年提出，它简化了在大型集群上进行数据处理的过程。MapReduce的核心概念由两个主要阶段组成：Map（映射）和Reduce（化简），以及一个中间的Shuffle过程。 Map阶段：用户定义一个Map函数，该函数接收键值对作为输入，并生成一组新的中间键值对。这个阶段通常用于数据预处理，例如过滤、转换或分组数据。Map函数并行地在集群的不同节点上执行，将大任务分解为小任务，提高了处理效率。 Shuffle阶段：在Map和Reduce之间，系统自动进行Shuffle操作，它根据中间键对数据进行排序和分区，确保相同中间键的所有值被发送到同一个Reduce任务。 Reduce阶段：用户定义一个Reduce函数，用于合并所有与同一中间键关联的中间值。这个阶段是数据聚合和总结的过程，例如求和、最大值或最小值计算。Reduce函数同样并行执行，处理来自多个Map任务的结果。 MapReduce的自动化特性：程序以函数式编程风格编写，系统会自动将任务并行化，并在集群上执行。运行时系统负责数据的分区、任务调度、机器故障的处理以及集群间的通信管理。这种自动化使得没有并行和分布式系统经验的程序员也能轻松利用大规模分布式系统的资源。实现与性能： Google的MapReduce实现运行在大量商用机器组成的集群上，具有高度的容错性和可扩展性。它可以处理机器故障，并通过自动数据复制确保高可用性。此外，它还支持动态负载均衡，能够根据集群的当前状态调整任务分配，优化整体性能。应用领域： MapReduce模型广泛应用于各种实际场景，如搜索引擎的索引构建、数据挖掘、机器学习等。其灵活性和可扩展性使其成为大数据处理领域的重要工具，尤其是在Apache Hadoop等开源实现中得到了广泛应用。总结： MapReduce是一种革命性的数据处理框架，它简化了大规模数据处理的复杂性，使得非专业并行计算的开发者也能有效地处理海量数据。通过Map和Reduce的组合，以及自动化执行机制，MapReduce为大数据分析提供了强大且可靠的解决方案。

User

Program

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)

assign

map

(2)

assign

reduce

split 0

split 1

split 2

split 3

split 4

output

file 0

(6) write

worker

(3) read

worker

(4) local write

Map

phase

Intermediate files

(on local disks)

worker

output

file 1

Input

files

(5) remote read

Reduce

phase

Output

files

Figure 1: Execution overview

Inverted Index: The map function parses each docu-

ment, and emits a sequence of hword, document IDi

pairs. The reduce function accepts all pairs for a given

word, sorts the corresponding document IDs and emits a

hword, list(document ID)i pair. The set of all output

pairs forms a simple inverted index. It is easy to augment

this computation to keep track of word positions.

Distributed Sort: The map function extracts the key

from each record, and emits a hkey, recordi pair. The

reduce function emits all pairs unchanged. This compu-

tation depends on the partitioning facilities described in

Section 4.1 and the ordering properties described in Sec-

tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-

terface are possible. The right choice depends on the

environment. For example, one implementation may be

suitable for a small shared-memory machine, another for

a large NUMA multi-processor, and yet another for an

even larger collection of networked machines.

This section describes an implementation targeted

to the computing environment in wide use at Google:

large clusters of commodity PCs connected together with

switched Ethernet [4]. In our environment:

(1) Machines are typically dual-processor x86 processors

running Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typically

either 100 megabits/second or 1 gigabit/second at the

machine level, but averaging considerably less in over-

all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-

chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-

tached directly to individual machines. A distributed ﬁle

system [8] developed in-house is used to manage the data

stored on these disks. The ﬁle system uses replication to

provide availability and reliability on top of unreliable

hardware.

(5) Users submit jobs to a scheduling system. Each job

consists of a set of tasks, and is mapped by the scheduler

to a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiple

machines by automatically partitioning the input data

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association

139

剩余12页未读，继续阅读

wheeny1234

粉丝: 0
资源: 1

MapReduce编程模型解析

Hadoop及Mapreduce入门

MapReduce入门案例.rar

Hadoop MapReduce入门

MapReduce入门编程课后实训1使用MapReduce编程统计某超市1月商品被购买的次数

hbase的mapreduce快速入门

第1关：hbase的mapreduce快速入门

头歌hbase的mapreduce快速入门

头歌 第1关:HBase的MapReduce快速入门

大数据从入门到实战 - 第3章 mapreduce基础实战

在处理大数据存储和分析任务时，HDFS和MapReduce如何相互配合实现高效的数据处理？

最新资源

头歌第1关:HBase的MapReduce快速入门