MapReduce：大规模集群简化数据处理

需积分: 14 158 浏览量更新于2024-09-20 收藏 186KB PDF 举报

MapReduce: Simplified Data Processing on Large Clusters MapReduce是一种编程模型和相应的实现，专为处理和生成大规模数据集而设计。它由Jeffrey Dean和Sanjay Ghemawat在Google公司开发，旨在简化在大型集群上进行复杂数据处理的过程。这个模型的核心思想是将复杂的任务分解为两个主要步骤：map（映射）和reduce（规约）。 1. **Map Function**: 用户编写map函数，它接收一个键值对作为输入，然后通过执行特定的操作将其转换为一系列中间键值对。在这个过程中，数据被分解成更小、易于处理的部分，便于分布式环境中的并行处理。这种函数式编程风格使得开发者无需深入理解并行和分布式系统的技术细节，就能利用集群资源。 2. **Reduce Function**: map阶段完成后，reduce函数将所有具有相同中间键的值进行合并，产生最终的结果。这个过程通常涉及聚合操作，如求和、计数或平均，将数据从分布式状态汇总到单一结果。 3. **自动并行化和调度**：MapReduce框架负责底层的复杂性管理，包括数据划分（partitioning）、任务调度、错误处理（如机器故障）以及跨机器间的通信。这确保了程序能够高效地在多台机器上并行执行，提高处理速度和吞吐量。 4. **平台兼容性**：MapReduce能够在大型的、由普通计算机组成的集群上运行，这意味着它不仅仅局限于特定硬件或操作系统，而是具有广泛的应用范围，使得大规模数据处理变得容易且经济。 5. **易用性**：对于没有分布式系统背景的程序员，MapReduce提供了一种友好的接口，使他们能够利用集群的强大能力，无需关注底层的复杂性，只需关注业务逻辑的实现。 6. **广泛应用领域**：论文指出，许多现实世界的任务，如网页抓取、数据分析、机器学习等，都可以通过MapReduce模型有效地表达和执行。 MapReduce作为一种强大的数据处理工具，通过将复杂的任务分解为简单的函数，并利用集群的分布式特性，极大地降低了大规模数据处理的复杂性和难度，使得非专家也能轻松实现高性能计算。其广泛的应用和易用性使其成为现代IT行业处理大数据的标准方法之一。

User

Program

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)

assign

map

(2)

assign

reduce

split 0

split 1

split 2

split 3

split 4

output

file 0

(6) write

worker

(3) read

worker

(4) local write

Map

phase

Intermediate files

(on local disks)

worker

output

file 1

Input

files

(5) remote read

Reduce

phase

Output

files

Figure 1: Execution overview

Inverted Index: The map function parses each docu-

ment, and emits a sequence of hword, document IDi

pairs. The reduce function accepts all pairs for a given

word, sorts the corresponding document IDs and emits a

hword, list(document ID)i pair. The set of all output

pairs forms a simple inverted index. It is easy to augment

this computation to keep track of word positions.

Distributed Sort: The map function extracts the key

from each record, and emits a hkey, recordi pair. The

reduce function emits all pairs unchanged. This compu-

tation depends on the partitioning facilities described in

Section 4.1 and the ordering properties described in Sec-

tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-

terface are possible. The right choice depends on the

environment. For example, one implementation may be

suitable for a small shared-memory machine, another for

a large NUMA multi-processor, and yet another for an

even larger collection of networked machines.

This section describes an implementation targeted

to the computing environment in wide use at Google:

large clusters of commodity PCs connected together with

switched Ethernet [4]. In our environment:

(1) Machines are typically dual-processor x86 processors

running Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typically

either 100 megabits/second or 1 gigabit/second at the

machine level, but averaging considerably less in over-

all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-

chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-

tached directly to individual machines. A distributed ﬁle

system [8] developed in-house is used to manage the data

stored on these disks. The ﬁle system uses replication to

provide availability and reliability on top of unreliable

hardware.

(5) Users submit jobs to a scheduling system. Each job

consists of a set of tasks, and is mapped by the scheduler

to a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiple

machines by automatically partitioning the input data

To appear in OSDI 2004 3

剩余12页未读，继续阅读

kyleyang8848

粉丝: 0
资源: 7

MapReduce：大规模集群简化数据处理

论文：MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters 英文原文

MapReduce: Simplified Data Processing on Large Clusters中文版

MapReduce: Simplified Data Processing on Large Clusters翻译

MapReduce: Simplified Data Processing on Large Clusters.pdf

MapReduce_ Simplified Data Processing on Large Clusters.pdf

MapReduce-Simplified Data Processing on Large Clusters.pdf

MapReduce Simplified Data Processing on Large Clusters.pdf

MapReduce_Simplified_Data_Processing_on_Large_Clusters

MapReduce-Simplified_Data_Processing_on_Large_Clusters中文版（免积分下载）

最新资源