0% found this document useful (0 votes)
8 views

Lecture 2.1

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 2.1

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Introduction to

MapReduce

Most of the contents about MapReduce in this set of slides are borrowed from the presentation
by Michael Kleber from Google dated on January 14, 2008.

1
Part 2 Reference Texts
• Tom White, Hadoop: The Definitive Guide: Storage and Analysis at Internet
Scale (4th Edition), O'Reilly Media, April 11, 2015, ISBN: 978-1491901632
• Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce,
Morgan and Claypool Publishers, April 30, 2010, ISBN: 978-1608453429
• Jason Swartz, Learning Scala, O'Reilly Media, December 8, 2014, ISBN: 978-
1449368814
• Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, Learning
Spark: Lightning-Fast Big Data Analysis, O'Reilly Media, February 27, 2015,
ISBN: 978-1449358624
• Bill Chambers and Matei Zaharia, Spark: The Definitive Guide: Big Data
Processing Made Simple, O'Reilly Media, March 8, 2018, ISBN: 978-
1491912218
2
Why we want to use MapReduce?
• MapReduce is a distributed computing paradigm
• Distributed computing is hard
• Do we really need to do distributed computing?
Yes, otherwise some problems are too big for single computers.

• Examples: 20+ billion web pages × 20KB = 400+ terabytes


One computer can read 30-35 MB/sec from disc
 ~4 months to read the web
~400 hard / SSD drives just to store the web
Even more time to do something with the data

3
Distributed computing is hard
• Bad news I: programming work
Communication and coordination
Recovering from machine failure (all the time!)
Status reporting
Debugging
Optimization
Data locality

• Bad news II: repeat for every problem you want to solve

• How can we make this easier?

4
MapReduce
• A simple programming model that can be applied to many large-scale
computing problems
• Hide messy details in MapReduce runtime library
Automatic parallelization
Load balancing
Network and disk transfer optimization
Automatic handing of machine failures
Robustness
Improvements to core library benefit all users of library!

5
Typical flow of problem solving by MapReduce
1. Read a lot of data
2. Map: extract something you care about from each record
3. (Hidden) shuffle and sort
4. Reduce: aggregate, summarize, filter, or transform ……
5. Write the results

• The above outline stays the same when solving different problems
Map and Reduce change to fit the particular problem

6
MapReduce paradigm
• Basic data type: the key-value pair (k, v)
For example, key = URL, value = HTML of the web page

• Programmer specifies two primary methods


Map
 (k, v) → <(k1, v1), (k2, v2), (k3, v3), ……, (kn, vn)>
Reduce
 (k’, <v’1, v’2, ……, v’n>) → <(k’, v”1), (k’, v”2), ……, (k’, v”m)>
 All v’ with the same k’ are reduced together
(Remember the invisible “Shuffle and Sort” step)

7
Example: word frequencies (or word count) in web pages
• Considered as “Hello World!” example in cloud computing
• Input: files with one document per record
• Specify a map function that takes a key/value pair
key = document URL
value = document contents

• Output of map function is (potentially many) key/value pairs


In this case, output (word, “1”) once per word in the document

“document 1”, “to be or not to be”

“to”, “1”
“be”, “1”
“or”, “1”
“not”, “1”
“to”, “1”
“be”, “1” 8
Example: word frequencies (or word count) in web pages
• MapReduce library gathers together all pairs with the same key in the
shuffle/sort step
• Specify a reduce function that combines the values for a key
In this case, compute the sum

key = “be” key = “not” key = “or” key = “to”


value = <“1”, “1”> value = <“1”> value = <“1”> value = <“1”, “1”>

“2” “1” “1” “2”

• Output of reduce step


“be”, “2”
“not”, “1”
“or”, “1”
“to”, “2”
9
Example: the overall process

10
Under the hood: scheduling
• One master, many workers
Input data split into M map tasks (typically, 128 MB in each split)
 The input data decides how many map tasks will be created
Reduce phase partitioned into R reduce tasks (= number of output files)
 Each reduce task generates an output file
 The programmer has the authority to decide the number of reduce tasks
Tasks are assigned to workers dynamically

• Master assigns each map task to a free worker


Consider locality of data to worker when assigning task
Worker reads task input (often from local disk!)
Worker produces R local files containing intermediate (k, v) pairs

• Master assigns each reduce task to a free worker


Worker reads intermediate (k, v) pairs generated by map workers
Worker applies Reduce function to produce the output
11
User may specify Partition: which intermediate keys to which Reducers
MapReduce: the flow in diagram

12
MapReduce: fault tolerance via re-execution
• Worker failure
Detect failure via periodic heartbeats
Re-execute completed and in-progress map tasks
Re-execute in-progress reduce tasks
Task completion committed through master

• Master failure
State is checkpointed to replicated file system
New master recovers and continues

• Very robust: lost thousands of machines once, but finished successfully

13

You might also like