Lecture 2.1
Lecture 2.1
MapReduce
Most of the contents about MapReduce in this set of slides are borrowed from the presentation
by Michael Kleber from Google dated on January 14, 2008.
1
Part 2 Reference Texts
• Tom White, Hadoop: The Definitive Guide: Storage and Analysis at Internet
Scale (4th Edition), O'Reilly Media, April 11, 2015, ISBN: 978-1491901632
• Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce,
Morgan and Claypool Publishers, April 30, 2010, ISBN: 978-1608453429
• Jason Swartz, Learning Scala, O'Reilly Media, December 8, 2014, ISBN: 978-
1449368814
• Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, Learning
Spark: Lightning-Fast Big Data Analysis, O'Reilly Media, February 27, 2015,
ISBN: 978-1449358624
• Bill Chambers and Matei Zaharia, Spark: The Definitive Guide: Big Data
Processing Made Simple, O'Reilly Media, March 8, 2018, ISBN: 978-
1491912218
2
Why we want to use MapReduce?
• MapReduce is a distributed computing paradigm
• Distributed computing is hard
• Do we really need to do distributed computing?
Yes, otherwise some problems are too big for single computers.
3
Distributed computing is hard
• Bad news I: programming work
Communication and coordination
Recovering from machine failure (all the time!)
Status reporting
Debugging
Optimization
Data locality
• Bad news II: repeat for every problem you want to solve
4
MapReduce
• A simple programming model that can be applied to many large-scale
computing problems
• Hide messy details in MapReduce runtime library
Automatic parallelization
Load balancing
Network and disk transfer optimization
Automatic handing of machine failures
Robustness
Improvements to core library benefit all users of library!
5
Typical flow of problem solving by MapReduce
1. Read a lot of data
2. Map: extract something you care about from each record
3. (Hidden) shuffle and sort
4. Reduce: aggregate, summarize, filter, or transform ……
5. Write the results
• The above outline stays the same when solving different problems
Map and Reduce change to fit the particular problem
6
MapReduce paradigm
• Basic data type: the key-value pair (k, v)
For example, key = URL, value = HTML of the web page
7
Example: word frequencies (or word count) in web pages
• Considered as “Hello World!” example in cloud computing
• Input: files with one document per record
• Specify a map function that takes a key/value pair
key = document URL
value = document contents
“to”, “1”
“be”, “1”
“or”, “1”
“not”, “1”
“to”, “1”
“be”, “1” 8
Example: word frequencies (or word count) in web pages
• MapReduce library gathers together all pairs with the same key in the
shuffle/sort step
• Specify a reduce function that combines the values for a key
In this case, compute the sum
10
Under the hood: scheduling
• One master, many workers
Input data split into M map tasks (typically, 128 MB in each split)
The input data decides how many map tasks will be created
Reduce phase partitioned into R reduce tasks (= number of output files)
Each reduce task generates an output file
The programmer has the authority to decide the number of reduce tasks
Tasks are assigned to workers dynamically
12
MapReduce: fault tolerance via re-execution
• Worker failure
Detect failure via periodic heartbeats
Re-execute completed and in-progress map tasks
Re-execute in-progress reduce tasks
Task completion committed through master
• Master failure
State is checkpointed to replicated file system
New master recovers and continues
13