MapReduce: Simplified Data Processing
on Large Clusters
by Jeffrey Dean and Sanjay Ghemawat
1 Introduction
Prior to our development of MapReduce, the authors and many others
at Google implemented hundreds of special-purpose computations that
process large amounts of raw data, such as crawled documents, Web
request logs, etc., to compute various kinds of derived data, such as
inverted indices, various representations of the graph structure of Web
documents, summaries of the number of pages crawled per host, and
the set of most frequent queries in a given day. Most such computa-
tions are conceptually straightforward. However, the input data is usu-
ally large and the computations have to be distributed across hundreds
or thousands of machines in order to finish in a reasonable amount of
time. The issues of how to parallelize the computation, distribute the
data, and handle failures conspire to obscure the original simple com-
putation with large amounts of complex code to deal with these issues.
As a reaction to this complexity, we designed a new abstraction that
allows us to express the simple computations we were trying to perform
but hides the messy details of parallelization, fault tolerance, data distri-
bution and load balancing in a library. Our abstraction is inspired by the
map and reduce primitives present in Lisp and many other functional lan-
guages. We realized that most of our computations involved applying a
map operation to each logical record’ in our input in order to compute a
set of intermediate key/value pairs, and then applying a reduce operation
to all the values that shared the same key in order to combine the derived
data appropriately. Our use of a functional model with user-specified map
and reduce operations allows us to parallelize large computations easily
and to use reexecution as the primary mechanism for fault tolerance.
The major contributions of this work are a simple and powerful
interface that enables automatic parallelization and distribution of
large-scale computations, combined with an implementation of this
interface that achieves high performance on large clusters of com-
modity PCs. The programming model can also be used to parallelize
computations across multiple cores of the same machine.
Section 2 describes the basic programming model and gives several
examples. In Sec tion 3, we describe an implementation of the Map Reduce
interface tailored towards our cluster-based computing environment.
Sec tion 4 describes several refinements of the programming model that
we have found useful. Sec tion 5 has performance measurements of our
implementation for a variety of tasks. In Section 6, we explore the use of
MapReduce within Google including our experiences in using it as the ba-
sis for a rewrite of our production indexing system. Section 7 discusses re-
lated and future work.
2 Programming Model
The computation takes a set of input key/value pairs, and produces a
set of output key/value pairs. The user of the MapReduce library
expresses the computation as two functions: map and reduce.
Map, written by the user, takes an input pair and produces a set of
intermediate key/value pairs. The MapReduce library groups together
all intermediate values associated with the same intermediate key I
and passes them to the reduce function.
The reduce function, also written by the user, accepts an interme-
diate key I and a set of values for that key. It merges these values
together to form a possibly smaller set of values. Typically just zero or
one output value is produced per reduce invocation. The intermediate
values are supplied to the user’s reduce function via an iterator. This
allows us to handle lists of values that are too large to fit in memory.
2.1 Example
Consider the problem of counting the number of occurrences of each
word in a large collection of documents. The user would write code
similar to the following pseudocode.
Abstract
M
apReduce is a programming model and an associated implementation for processing
and generating large datasets that is amenable to a broad variety of real-world tasks.
Users specify the computation in terms of a map and a reduce function, and the under-
lying runtime system automatically parallelizes the computation across large-scale clusters of
machines, handles machine failures, and schedules inter-machine communication to make effi-
cient use of the network and disks. Programmers find the system easy to use: more than ten
thousand distinct MapReduce programs have been implemented internally at Google over the
past four years, and an average of one hundred thousand MapReduce jobs are executed on
Google’s clusters every day, processing a total of more than twenty petabytes of data per day.
Biographies
Jeff Dean (jeff@google.com) is a Google Fellow and is currently work-
ing on a large variety of large-scale distributed systems at Google’s Moun -
tain View, CA, facility.
Sanjay Ghemawat (sanjay@google.com) is a Google Fellow and works
on the distributed computing infrastructure used by most the company’s
products. He is based at Google’s Mountain View, CA, facility.