Map Reduce: Simplified
Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
Google, Inc.
OSDI ’04: 6th Symposium on Operating
Systems Design and Implementation
What Is It?
• “. . . A programming model and an
associated implementation for processing
and generating large data sets.”
• Google version runs on a typical Google
cluster: large number of commodity
machines, switched Ethernet, inexpensive
disks attached directly to each machine in
the cluster.
Motivation
• Data-intensive applications
• Huge amounts of data, fairly simple
processing requirements, but …
• For efficiency, parallelize
• MapReduce is designed to simplify
parallelization and distribution so
programmers don’t have to worry about
details.
Advantages of Parallel
Programming
• Improves performance and efficiency.
• Divide processing into several parts which
can be executed concurrently.
• Each part can run simultaneously on
different CPUs on a single machine, or
they can be CPUs in a set of computers
connected via a network.
Programming Model
• The model is “inspired by” Lisp primitives
map and reduce.
• map applies the same operation to several
different data items; e.g.,
(mapcar #'abs '(3 -4 2 -5))=>(3 4 2 5)
• reduce applies a single operation to a set of
values to get a result; e.g.,
(+ 3 4 2 5) => 14
Programming Model
• MapReduce was developed by Google to
process large amounts of raw data, for
example, crawled documents or web
request logs.
• There is so much data it must be
distributed across thousands of machines
in order to be processed in a reasonable
time.
Programming Model
• Input & Output: a set of key/value pairs
• The programmer supplies two functions:
• map (in_key, in_val) =>
list(intermediate_key,intermed_val)
• reduce (intermediate_key,
list-of(intermediate_val)) =>
list(out_val)
• The program takes a set of input key/value pairs
and merges all the intermediate values for a
given key into a smaller set of final values.
Example: Count occurrences of words in a
set of files
• Map function: for each word in each file, count
occurrences
• Input_key: file name; Input_value: file contents
• Intermediate results: for each file, a list of words
and frequency counts
– out_key = a word; int_value = word count in this file
• Reduce function: for each word, sum its
occurrences over all files
• Input key: a word; Input value: a list of counts
• Final results: A list of words, and the number of
occurrences of each word in all the files.
Other Examples
• Distributed Grep: find all occurrences of a
pattern supplied by the programmer
– Input: the pattern and set of files
• key = pattern (regexp), data = a file name
– Map function: grep the pattern, file
– Intermediate results: lines in which the pattern
appeared, keyed to files
• key = file name, data = line
– Reduce function is the identity function:
passes on the intermediate results
Other Examples
• Count URL Access Frequency
– Map function: counts URL requests in a log of
requests
• key: URL; data: a log
– Intermediate results: URL, total count for this
log
– Reduce function: combines URL count for all
logs and emits (URL, total_count)
Implementation
• More than one way to implement
MapReduce, depending on environment
• Google chooses to use the same
environment that it uses for the GFS: large
(~1000 machines) clusters of PCs with
attached disks, based on 100 megabit/sec
or 1 gigabit/sec Ethernet.
• Batch environment: user submits job to a
scheduler (Master)
Implementation
• Job scheduling:
– User submits job to scheduler (one program
consists of many tasks)
– scheduler assigns tasks to machines.
General Approach
• The MASTER:
– initializes the problem; divides it up among a
set of workers
– sends each worker a portion of the data
– receives the results from each worker
• The WORKER:
– receives data from the master
– performs processing on its part of the data
– returns results to master
Overview
• The Map invocations are distributed
across multiple machines by automatically
partitioning the input data into a set of M
splits or shards.
• The worker-process parses the input to
identify the key/value pairs and passes
them to the Map function (defined by the
programmer).
Overview
• The input shards can be processed in
parallel on different machines.
– It’s essential that the Map function be able to
operate independently – what happens on
one machine doesn’t depend on what
happens on any other machine.
• Intermediate results are stored on local
disks, partitioned into R regions as
determined by the user’s partitioning
function. (R <= # of output keys)
Overview
• The number of partitions (R) and the
partitioning function are specified by the user.
• Map workers notify Master of the location of the
intermediate key-value pairs; the master
forwards the addresses to the reduce workers.
• Reduce workers use RPC to read the data
remotely from the map workers and then
process it.
• Each reduction takes all the values associated
with a single key and reduces it to one or more
results.
Example
• In the word-count app, a worker emits a
list of word-frequency pairs; e.g. (a,
100), (an, 25), (ant, 1), …
• out_key = a word; value = word count
for some file
• All the results for a given out_key are
passed to a reduce worker for the next
processing phase.
Overview
• Final results are appended to an output file
that is part of the global file system.
• When all map/reduce jobs are done, the
master wakes up the user program and
the MapReduce call returns control to the
user program.
Fault Tolerance
• Important, because since MapReduce
relies on 100’s, even 1000’s of machines,
failures are inevitable.
• Periodically, the master pings workers.
• Workers that don’t respond in a pre-
determined amount of time are considered
to have failed.
• Any map task or reduce task in progress
on a failed worker is reset to idle and
becomes eligible for rescheduling.
Fault Tolerance
• Any map tasks completed by the worker are
reset to idle state, and are eligible for
scheduling on other workers.
• Reason: since the results are stored on the
disk of the failed machine, they are
inaccessible.
• Completed reduce tasks on failed machines
don’t need to be redone because output
goes to a global file system.
Failure of the Master
• Regular checkpoints of all the Master’s
data structures would make it possible to
roll back to a known state and start again.
• However, since there is only one master
failure is highly unlikely, so the current
approach is just to abort the program in
case of failure.
Locality
• Recall Google File system implementation:
• Files are divided into 64MB blocks and
replicated on at least 3 machines.
• The Master knows the location of data and
tries to schedule map operations on
machines that have the necessary input.
Or, if that’s not possible, schedule on a
nearby machine to reduce network traffic.
Task Granularity
• Map phase is subdivided into M pieces
and the reduce phase into R pieces.
• Objective: M and R >> than the number of
worker machines.
– Improves dynamic load balancing
– Speeds up recovery in case of failure; failed
machine’s many completed map tasks can be
spread out across all other workers.
Task Granularity
• Practical limits on size of M and R:
– Master must make O(M + R) scheduling
decisions and store O(M * R) states
– Users typically restrict size of R, because the
output of each reduce worker goes to a
different output file
– Authors say they “often” set M = 200,000 and
R = 5,000. Number of workers = 2,000.
“Stragglers”
• A machine that takes a long time to finish
its last few map or reduce tasks.
– Causes: bad disk (slows read ops), other
tasks are scheduled on the same machine,
etc.
– Solution: assign stragglers’ unfinished work to
other machines that have completed. Use
results from the original worker or the backup,
depending on which finishes first
Experience
• Google used MapReduce to rewrite the indexing
system that constructs the Google search engine
data structures.
• Input: GFS documents retrieved by the web
crawlers – about 20 terabytes of data.
• Benefits
– Simpler, smaller, more readable indexing code
– Many problems, such as machine failures, are dealt
with automatically by the MapReduce library.
Conclusions
• Easy to use. Programmers are shielded
from the problems of parallel processing
and distributed systems.
• Can be used for many classes of problems,
including generating data for the search
engine, for sorting, for data mining, for
machine learning, and other
• Scales to clusters consisting of 1000’s of
machines
• But ….
Not everyone agrees that MapReduce is
wonderful!
• The database community believes parallel
database systems are a better solution.