BSP与MapReduce：理论基础与应用比较

需积分: 10 69 浏览量更新于2024-09-16 收藏 341KB PDF 举报

本文主要探讨了Bulk Synchronous Parallel (BSP)模型与MapReduce框架之间的关系。MapReduce作为一种广泛应用于工业界，并在学术领域解决复杂问题的并行计算框架，其理论基础的坚实性对于理解其性能至关重要。作者将MapReduce与BSP模型相联系，强调了BSP模型在现代并行算法设计中的核心地位，并定义了一类可以有效在MapReduce环境下实现的BSP算法。 BSP模型，由 Leslie Lamport在1981年提出，是一种用于描述并行计算过程的模型，其中所有处理器在同一时间步（称为同步阶段）进行计算，然后进入通信阶段，在此期间处理器之间交换数据。这种模型强调了同步性和全局视图，有助于简化并行程序的设计。相比之下，MapReduce是由Google开发的一种分布式编程模型，最初是为处理大规模数据集而设计的。它包含两个主要步骤：Map阶段，负责对输入数据进行本地处理，生成中间键值对；Reduce阶段，接收并合并中间结果，生成最终的输出。MapReduce的核心在于其简化了并行编程，尤其是对于非专家开发者来说，通过数据分片和分布式处理，降低了编写并行代码的复杂性。作者在文章中指出，虽然MapReduce在实践中取得了显著的成功，但将其放在BSP模型的框架下，可以帮助深入理解其效率瓶颈以及潜在的优化空间。他们提出了一种特殊的BSP子类，这些算法能够充分利用MapReduce的特点，如局部计算、分布式内存管理和数据划分，同时保持高效的并行性。通过将MapReduce与BSP结合，研究者可以更好地分析算法的复杂度，比如工作量平衡、通信开销和内存需求，这对于优化MapReduce作业的性能至关重要。此外，这种关联也有助于推动算法设计的创新，使得更多的传统并行算法能够在MapReduce环境中找到可行的实现方式。总结来说，本文深入探讨了BSP模型在现代并行计算特别是MapReduce框架中的应用价值，以及如何通过理解和利用BSP原理来设计更高效、可扩展的MapReduce算法。这对于理解和改进大数据处理技术，以及进一步推动整个IT行业的技术进步具有重要意义。

While in the map phase, every map task processes various hkey; valuei pairs, in the reduce phase

all the values for a given key are processed by a single reduce task. This is achieved by logically

partitioning the secondary memory of each worker processing a map task into r partitions, and then

determining in which particular partition an output pair should be stored, a process accomplished

by the shuﬄe step. This step can be viewed as a data routing step, determining which reduce

task will process a data pair based on the pair’s key. This function is performed by the workers

while processing the map tasks. Typically, a function such as (hash(key) mod r) is used, where

the hash function is a simple function, computable in a small constant time, used to map the keys

to a more manageable domain. Other partitioning functions can be deﬁned by the user, especially

if the keys are in numeric form, such as partitioning the keys into r logical partitions representing

various ranges of values.

When all the map tasks have ﬁnished, the r reduce tasks are assigned to the available workers

using the same process as for the map tasks. Each reduce task accesses the data assigned to it,

stored across the workers responsible for computing the q map tasks. All the pairs with the same

key are stored in the same partition, and each partition can have pairs with diﬀerent keys. All

this data is sorted by key and combined such that all values associated with a key are grouped

together in a single hkey; valuei pair. This is sometimes considered as being a second part of the

shuﬄe step.

Each reduce task then reads its assigned data and processes it one hkey; valuei pair at a time

using the reduce function. The task’s output is then written to global memory, and can either be

the ﬁnal output of the algorithm or used as input to a new round of MapReduce. The input for a

new round of MapReduce is partitioned into q parts by the master processor and the process just

described is repeated.

The relationship between the system processors and the map and reduce tasks leads to some

interesting aspects of the MapReduce framework. A number of map and reduce tasks can be

performed, in sequence, by a single processor in each round. In each of the map and reduce

phases, tasks are assigned to workers as these ﬁnish their previously assigned task. Therefore, if

computation is equally divided between tasks, then every processor will perform about

/p map

tasks and

/p reduce tasks. If on the other hand the computation time diﬀers for each task, then

load balancing is automatically achieved. It also allows for the eﬃcient handling of fault tolerance.

However, the task assignment strategy also places some limitations on the framework. Between

rounds the data is split up into q parts and each is assigned to a map function. After any task

ﬁnishes, the worker’s primary memory is cleared, so data cannot be associated with a single

processor and accessed at will in diﬀerent rounds. Therefore, any data that is required in multiple

rounds should be speciﬁcally stored in global memory.

Given a multiset of n hkey; valuei pairs as an input, the above process describing a single

MapReduce round is deﬁned by two functions: map and reduce, and the shuﬄe step. These are

deﬁned as follows:

– Given a single input pair from the multiset of the round’s input pairs {hk

; v

i, hk

; v

i, . . . , hk

; v

i},

the map function performs some computation to produce a new intermediate multiset of

hkey; valuei pairs {hl

; w

i, hl

; w

i, . . . , hl

; w

i}.

– The union of all the intermediate multisets produced by the map functions is acted upon in

the shuﬄe step. All the pairs with the same key l

are combined to produce a new set of lists

of the form hl

; w

, w

, . . .i.

– Each of the lists produced by the shuﬄe step is passed to a separate reduce function that

performs some computation to produce a new list hj

; x

, x

, . . .i.

剩余12页未读，继续阅读

zhenghuxiong86

粉丝: 0
资源: 1

BSP与MapReduce：理论基础与应用比较

少杰 （徐东）：ODPS MapReduce对外开放实践

BC-BSP: A BSP-based parallel iterative processing system for big data on cloud architecture

A BSP-based Parallel Iterative Processing System with Multiple Partition Strategies for Big Graphs

Hadoop与MapReduce初学者指南

ODPS MapReduce开放服务实践与技术创新

MapReduce在大数据分析中的应用与优势

ODPS MapReduce实战：大数据处理与并行运算

Apache Hama：基于BSP的分布式计算框架

MapReduce排序深度解析：实现大数据高效排序的6大策略

BSP图计算模型评估

最新资源

少杰（徐东）：ODPS MapReduce对外开放实践