Map Reduce
Map Reduce
2
From CS Foundations to MapReduce
• Consider a large data collection: {web, weed, green, sun, moon, land,
part, web, green,…}
• Problem: Count the occurrences of the different words in the
collection.
• Let's design a solution for this problem (Before MR);
• We will start from scratch
• We will add and relax constraints
• We will do incremental design, improving the solution for performance and scalability
3
Word Counter and Result Table (Before MR)
{web, weed, green, sun, moon, land, part, web, green,…}
Main
web 2
Data
collection weed 1
green 2
WordCounter
sun 1
parse( )
count( ) moon 1
land 1
4
Multiple Instances of Word Counter
web 2
Data weed 1
Main
collection
green 2
Thread sun 1
1..* moon 1
WordCounter
land 1
parse( )
count( ) part 1
5
Improve Word Counter for Performance
Main N No need for lock
o web 2
Data
weed 1
collection
green 2
sun 1
Thread
moon 1
1..*
1..* land 1
Parser Counter
part 1
KEY web weed green sun moon land part web green …….
VALUE
B.Ramamurthy & K.Madurai 6
Peta-scale Data
Main
web 2
weed 1
green 2
sun 1
Thread
Data moon 1
1..*
collection
1..*
Counter
land 1
Parser
part 1
KEY web weed green sun moon land part web green …….
VALUE 7
Before MapReduce…
• Large scale data processing was difficult!
• Managing hundreds or thousands of processors
• Managing parallelization and distribution
• I/O Scheduling
• Status and monitoring
• Fault/crash tolerance
• MapReduce provides all of these, easily!
MapReduce Overview
• How does it solve our previously mentioned problems?
• MapReduce is highly scalable and can be used across many
computers.
• Many small machines can be used to process jobs that normally
could not be processed by a large machine.
Map Abstraction
• Inputs a key/value pair
– Key is a reference to the input value
– Value is the data set on which to operate
• Evaluation
– Function defined by user
– Applies to every value in value input
• Might need to parse input
15
Why is this approach better?
• Creates an abstraction for dealing with complex overhead
• The computations are simple, the overhead is messy
• Removing the overhead makes programs much smaller and thus
easier to use
• Less testing is required as well. The MapReduce libraries can be assumed to
work properly, so only user code needs to be tested
• Division of labor also handled by the MapReduce libraries, so
programmers only need to focus on the actual computation.
Peta-scale Data
Main
web 2
weed 1
green 2
sun 1
Thread
Data moon 1
1..*
collection
1..*
Counter
land 1
Parser
part 1
KEY web weed green sun moon land part web green …….
VALUE 17
Data
collection Peta Scale Data is Commonly Distributed
Main
web 2
Data
collection weed 1
green 2
Data sun 1
collection
moon 1
Thread
land 1
1..*
Data part 1
1..*
collection Parser Counter
KEY web weed green sun moon land part web green …….
VALUE
B.Ramamurthy & K.Madurai 18
Data
collection
Write Once Read Many (WORM) data
Data
Main
collection
web 2
Data weed 1
collection
green 2
Thread
sun 1
Data
moon 1
collection 1..*
1..*
Parser Counter land 1
part 1
Data
collection DataCollection WordList ResultTable
KEY web weed green sun moon land part web green …….
VALUE
19
Data
collection WORM Data is Amenable to Parallelism
Main
Data
collection
1. Data with WORM
characteristics : yields to
Data
collection parallel processing.
Thread 2. Data without dependencies:
1..* yields to out of order
Data
collection
1..*
Counter processing
Parser
WordList
Data DataCollection ResultTable
collection
20
Map Operation
MAP: Input data ➔ <key, value> pair web 1
weed 1
green 1 web 1
sun 1 weed 1
moon 1 green 1
land 1 sun1 1
web
part 1 moon 1
Map web
weed
1
1
land
1web 1 1
Data green
green
1 part
1weed 1 1
Split the data to
sun
Collection: split1 web … 1 1
moon web1green 1 1
processors
green 1 … 1moon 1 1
part
sun 1 KEY1land VALUE
1
web
moon 1
green 1part 1
…
green 1 KEY VALUE
… 1
KEY VALUE
Data
Collection: split n
21
Reduce Operation
MAP: Input data ➔ <key, value> pair
REDUCE: <key, value> pair ➔ <result
Reduce
Map
Data
Collection: split1 Split the data to
Supply multiple
processors
Reduce
Data Map
Collection: split 2
……
…
Data
Reduce
Collection: split n Map
22
Large scale data splits
Map <key, 1> Reducers (say, Count)
Parse-hash
Count
P-0000
, count1
Parse-hash
Count
P-0001
, count2
Parse-hash
Count
P-0002
Parse-hash ,count3
23
MapReduce
Implementation
24
Computing Environment at Google
• Machines are dual-processor x86 processors running Linux, with 2-4 GB of
memory per machine.
• Commodity networking hardware is used – typically either 100
megabits/second or 1 gigabit/second at the machine level
• A cluster consists of 100s or 1000s of machines, and therefore machine
failures are common.
• Storage is provided by inexpensive IDE disks attached directly to individual
machines. GFS is used to manage the data stored on these disks. The file
system uses replication to provide availability and reliability on top of
unreliable hardware.
• Users submit jobs to a scheduling system. Each job consists of a set of tasks
and is mapped by the scheduler to a set of available machines within a
cluster.
25
WORKING PROCESS OF MapReduce (1)
26
WORKING PROCESS OF MapReduce (2)
27
SHUFFLE MECHANISM
28
EXAMPLE: TYPICAL PROGRAM WORDCOUNT
29
FUNCTIONS OF WORDCOUNT
30
MAP PROCESS OF WORDCOUNT
31
REDUCE PROCESS OF WORDCOUNT
32
Master Data Structures
• For each map task and reduce task, it stores the state (idle, in-progress, or
completed), and the identity of the worker machine (for non-idle tasks).
• The master is the conduit through which the location of intermediate file
regions is propagated from map tasks to reduce tasks.
▪ Therefore, for each completed map task, the master stores the locations
and sizes of the R intermediate file regions produced by the map task.
▪ Updates to this location and size information are received as map tasks
are completed.
▪ The information is pushed incrementally to workers that have in-progress
reduce tasks.
33
Points need to be emphasized
• No reduce can begin until map is complete
• Master must communicate locations of intermediate files
• Tasks scheduled based on location of data
• If map worker fails any time before reduce finishes, task must be
completely rerun
• MapReduce library does most of the hard work for us!
How to use it?
• User to do list:
• indicate:
• Input/output files
• M: number of map tasks
• R: number of reduce tasks
• W: number of machines
• Write map and reduce functions
• Submit the job
MapReduce programming model
Determine if the problem is parallelizable and solvable using
MapReduce (ex: Is the data WORM?, large data set).
Design and implement solution as Mapper classes and Reducer
class.
36
MapReduce Characteristics
• Very large-scale data
• Write once and read many data allows for parallelism without mutexes
• Map and Reduce are the main operations: simple code
• There are other supporting operations such as combine and partition.
• All the map should be completed before reduce operation starts.
• Map and reduce operations are typically performed by the same physical
processor.
• Number of map tasks and reduce tasks are configurable.
• Operations are provisioned near the data.
• Commodity hardware and storage.
• Runtime takes care of splitting and moving data for operations.
37
Student Projects…
Distributed Grep
Reverse Web-Link Graph
Term-Vector per Host
Inverted Index
38
1. Distributed Grep
Grep is command-line utility for searching plain-text data sets for
lines that match a regular expression.
Example: $ grep "apple" fruitlist.txt
Map emits a line if it matches a supplied regular expression pattern.
Reduce is an identity function that just copies the supplied
intermediate data to the output.
Split data grep matches
40
3. Term-vector per host
• A term vector summarizes the most important words that occur in a document
or a set of documents as a list of <word, frequency>pairs.
• Map emits a <hostname, term vector> pair for each input document (where the
hostname is extracted from the URL of the document).
• Reduce is passed all per-document term vectors for a given host. It adds these
term vectors together, throwing away infrequent terms, and then emits a final
⟨hostname, term vector⟩ pair.
41
4. Inverted Index
• Map parses each document, and emits a sequence of ⟨word,
document ID⟩ pairs.
• Reduce function accepts all pairs for a given word, sorts the
corresponding document IDs and emits a ⟨word, list(document ID)⟩pair.
• The set of all output pairs forms a simple inverted index. It is easy to
augment this computation to keep track of word positions.
42