09b - MapReduce
09b - MapReduce
Enabling
Technologies:
Map Reduce
3
Key-Value Pairs
(key, value) pairs are used as the format for both data and intermediate results
4
MapReduce
5
A Brief View of MapReduce
Processing Granularity
• Mappers
• Run on a record-by-record basis
• Your code processes that record and may produce
• Zero, one, or many outputs
• Reducers
• Run on a group-of-records (having same key)
• Your code processes that group and may produce
• Zero, one, or many outputs
9
MapReduce: The Map Step
Input Intermediate
key-value pairs key-value pairs
k v
map
v
k k v
map
v
k k v
… …
v k v
k
MapReduce: The Reduce Step
k v
… … …
k v k v k v
Warm up: Word Count
Map Reduce
Tasks Tasks
15
Word Count Example
MAP: Reduce:
Read input and Group by key: Collect all
produces a set Collect all pairs values
The crew of the space
of key-value with same key belonging to the
reads
shuttle Endeavor at
recently pairs key and output a
d
returned to Earth as e
ambassadors, harbingers th
sequential
of a new eraof space
exploration. Scientists at ea
NASA are saying that (The, 1) (crew, 1) d
ry
the l
(crew, 1) (crew, 1)
recent assembly of the (crew, 2)
(of, 1) (space, 1) al
Dextre bot is the first
(space, 1) ti
step in a long-term (the, 1) (the, 1) e
Only
space- (the, 3) n
(space, 1) (the, 1) u
based man/mache
(shuttle, 1) (the, 1) (shuttle, 1)
partnership. '"The work (recently, 1) S
we're doing now -- the (Endeavor, 1) (shuttle, 1)
robotics we're doing -- is (recently, 1) (recently, 1) … e
what we're going to
q
…. …
need
……………………..
1 Each mapper 2 The mappers 3 Each KV-pair output 4 The reducers 5 The reducers
receives some process the by the mapper is sort their input process their
of the KV- KV-pairs sent to the reducer by key input one
pairs one by one that is responsible and group it group
as input for it at a time
How it looks like in Java
Map function
Job configuration
Example 2: Inverted Index
20
Exercise1: Find the maximum temperature
each year
• Given a large dataset of weather station readings, write down the
Map and Reduce steps necessary to find the maximum temperature
recorded for each year for all weather stations.
• The dataset contains lines with the following format: `stationID, year,
month, day, max temperature (maxTemp), min temperature
(minTemp)‘
21
Exercise2: How to process this SQL query in
MapReduce?
23
Answer Q2:
24
Answer Q2 (Optimized)
25
Hadoop
26
Brief history
• Initially developed by Doug Cutting as a filesystem for Apache Nutch, a
web search engine
2
8
The origin of the name
• “Hadoop” is a made-up name, as explained by Doug Cutting:
2
9
Hadoop: How it Works
30
Hadoop Architecture
31
MapReduce Framework
Hadoop Distributed File System (HDFS)
One namenode
Maintains metadata info about files:
• Maps a filename to a set of blocks
• Maps a block to the DataNodes where it resides
• Replication engine for blocks
File F 1 2 3 4 5
Client
Secondary
NameNode
ClusterId
DataNodes
3
4
Data Flow
User
Program
assign Master
assign
map
reduce
37
Replication engine
• Upon detecting a DataNode failure
38
HDFS Erasure Coding
• New feature introduced in Hadoop 3.0
• Problem with Replication Mechanism in HDFS
• Each replica uses 100% storage overhead, thus results
in 200% storage overhead.
• Cold replica
39
• Requires only 50% storage overhead.
• But can tolerate only 1 failure.
40
Reed-Solomon Algorithm
• RS multiplies 𝑚 data cells with a Generator Matrix (GT) to get
extended codeword with 𝑚 data cells and 𝑛 parity cells.
• Data can be recovered by multiplying the inverse of the generator
matrix with the extended codewords as long as 𝑚 out of 𝑚 + 𝑛 cells are
available.
• XOR is the special case with 𝑛 = 1
• Can tolerate up
to 𝑛 failures
• But increases
CPU load
41
NameNode failure
• A single point of failure
42
Hadoop Map-Reduce
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce
Map Parse-hash
Reduce
Map Parse-hash
Reduce
Map Parse-hash
44
Hadoop MapReduce
Map Parse-hash
Reduce
Map Parse-hash
Reduce
In this example, 1 map-reduce
job consists of 4 map tasks and 3
Map Parse-hash
reduce tasks
Reduce
Map Parse-hash
45
Failures
47
Reference
48