谷歌MapReduce：大数据处理模型

需积分: 10 172 浏览量更新于2024-09-11 收藏 1.28MB PDF 举报

"谷歌发布的关于大数据处理的MapReduce技术文章" MapReduce是由谷歌发表的一种编程模型，主要用于处理和生成大规模的数据集。该模型由Jeffrey Dean和Sanjay Ghemawat提出，他们都是谷歌公司的研究人员。MapReduce的核心理念是将复杂的分布式计算任务简化，让用户能够通过定义“映射”(map)函数和“化简”(reduce)函数来处理海量数据。映射阶段（Map Phase）：用户自定义的map函数接收输入的数据，通常是键值对的形式，然后将这些数据转换成一系列中间的键值对。这个过程可以并行执行，因为每个键值对可以独立处理，不需要依赖其他对的结果。化简阶段（Reduce Phase）：在映射阶段生成的中间键值对被按照键进行分组，然后传递给reduce函数。reduce函数负责合并所有与同一个键相关的中间值，生成最终的结果。这个阶段可以用于聚合、汇总或者过滤等操作。 MapReduce的设计目标是使不具备并行或分布式系统经验的程序员也能轻松利用大规模分布式系统的资源。运行时系统自动处理数据分区、程序执行调度、机器故障处理以及机器间通信的管理。这样，开发者只需关注业务逻辑，而无需关心底层的分布式细节。谷歌实现的MapReduce系统在大量商用硬件组成的集群上运行，具有高度的可扩展性和容错性。它能自动地将任务分解，分配到不同的机器上，并且能够在部分节点故障时，自动重试失败的任务，保证了系统的稳定性和可靠性。此外，MapReduce模型对于许多实际应用都非常适用，如搜索引擎的索引构建、数据挖掘、日志分析等。通过这种方式，大数据的处理工作可以被高效、可靠地完成，极大地推动了大数据领域的发展。总结来说，MapReduce是谷歌提出的处理大数据的关键技术，它简化了分布式计算，使得开发人员能够专注于业务逻辑，而将分布式系统的复杂性隐藏在背后。这一模型和实现为后续的Hadoop等大数据处理框架奠定了基础，对现代云计算和大数据处理产生了深远影响。

User

Program

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)

assign

map

(2)

assign

reduce

split 0

split 1

split 2

split 3

split 4

output

file 0

(6) write

worker

(3) read

worker

(4) local write

Map

phase

Intermediate files

(on local disks)

worker

output

file 1

Input

files

(5) remote read

Reduce

phase

Output

files

Figure 1: Execution overview

Inverted Index: The map function parses each docu-

ment, and emits a sequence of ⟨word, document ID⟩

pairs. The reduce function accepts all pairs for a given

word, sorts the co rresponding document IDs and emits a

⟨word, list(document ID)⟩ pair. The set of all output

pairs forms a simple inverted index. It is easy to augm ent

this comp utation to keep track of word positions.

Distributed Sort: The map function extracts the key

from each record, and emits a ⟨key, record⟩ pair. The

reduce function emits all pairs unchanged. This compu-

tation depends on the partitioning facilities described in

Section 4.1 and the ordering properties described in Sec-

tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-

terface are possible. The right choice depends on the

environment. For example, one implementation may be

suitable for a small shared-memory machine, another for

a large NUMA multi-processor, and yet another for an

even larger collection of networked machines.

This section describes an implementation targeted

to the computing environment in wide use at Google:

large clusters of commodity PCs connected together with

switched Ethernet [4]. In our environment:

(1) Machines are typically dual-processor x86 processors

running Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typically

either 100 megabits/second or 1 gigabit/second at the

machine level, but averaging considerab ly less in over-

all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-

chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-

tached directly to individual machines. A distributed ﬁle

system [8] developed in-house is used to manage the data

stored on these disks. The ﬁle system uses replication to

provide availability and reliability on top of unreliable

hardware.

(5) Users submit jobs to a scheduling system. Each job

consists of a set of tasks, and is mapped by the scheduler

to a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiple

machines by automatically partitioning the input data

To appear in OSDI 2004 3

剩余12页未读，继续阅读

xiaotdl

粉丝: 0
资源: 3

谷歌MapReduce：大数据处理模型

谷歌MapReduce与BigTable解析

Google云 MapReduce技术解析

谷歌MapReduce原始论文解析

mapreduce mapreduce mapreduce

Google mapreduce

google mapreduce

mapreduce:mapreduce

Google MapReduce中文版

Google MapReduce 中文版

Google技术之MapReduce

最新资源