0% found this document useful (0 votes)
15 views

Map Reduce

The document discusses distributed systems and provides an overview of MapReduce. It describes how MapReduce can be used to solve problems like counting word occurrences across large datasets in a scalable way by separating the processing across multiple computers.

Uploaded by

am8465821
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Map Reduce

The document discusses distributed systems and provides an overview of MapReduce. It describes how MapReduce can be used to solve problems like counting word occurrences across large datasets in a scalable way by separating the processing across multiple computers.

Uploaded by

am8465821
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Distributed Systems

Eng. Asmaa AbdulQawy


MapReduce OVERVIEW

2
From CS Foundations to MapReduce
• Consider a large data collection: {web, weed, green, sun, moon, land,
part, web, green,…}
• Problem: Count the occurrences of the different words in the
collection.
• Let's design a solution for this problem (Before MR);
• We will start from scratch
• We will add and relax constraints
• We will do incremental design, improving the solution for performance and scalability

3
Word Counter and Result Table (Before MR)
{web, weed, green, sun, moon, land, part, web, green,…}

Main

web 2
Data
collection weed 1

green 2
WordCounter
sun 1
parse( )
count( ) moon 1

land 1

DataCollection ResultTable part 1

4
Multiple Instances of Word Counter

web 2

Data weed 1
Main
collection
green 2

Thread sun 1

1..* moon 1
WordCounter
land 1
parse( )
count( ) part 1

DataCollection ResultTable Observe:


Multi-thread
Lock on shared data

5
Improve Word Counter for Performance
Main N No need for lock
o web 2
Data
weed 1
collection
green 2

sun 1
Thread
moon 1
1..*
1..* land 1
Parser Counter

part 1

DataCollection WordList ResultTable


Separate counters

KEY web weed green sun moon land part web green …….

VALUE
B.Ramamurthy & K.Madurai 6
Peta-scale Data
Main

web 2

weed 1

green 2

sun 1
Thread
Data moon 1
1..*
collection
1..*
Counter
land 1
Parser

part 1

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE 7
Before MapReduce…
• Large scale data processing was difficult!
• Managing hundreds or thousands of processors
• Managing parallelization and distribution
• I/O Scheduling
• Status and monitoring
• Fault/crash tolerance
• MapReduce provides all of these, easily!
MapReduce Overview
• How does it solve our previously mentioned problems?
• MapReduce is highly scalable and can be used across many
computers.
• Many small machines can be used to process jobs that normally
could not be processed by a large machine.
Map Abstraction
• Inputs a key/value pair
– Key is a reference to the input value
– Value is the data set on which to operate
• Evaluation
– Function defined by user
– Applies to every value in value input
• Might need to parse input

• Produces a new list of key/value pairs


– Can be different type from input pair
Map Example

map(String key, String value):


// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
Reduce Abstraction
• Starts with intermediate Key / Value pairs
• Ends with finalized Key / Value pairs
• Starting pairs are sorted by key
• Iterator supplies the values for a given key to the Reduce function.
Reduce Abstraction
• Typically, a function that:
• Starts with a large number of key/value pairs
• One key/value for each word in all files being greped (including multiple entries for the
same word)

• Ends with very few key/value pairs


• One key/value for each unique word across all the files with the number of instances
summed into this entry

• Broken up so a given worker works with input of the same


key.
Reduce Example

reduce(String key, Iterator values):


// key: a word
// values: a list of counts
int result = 0;
for each v in values: result += ParseInt(v);
Emit(AsString(result));
How Map and Reduce Work Together?

Reduce applies a user


Reduces
Map returns defined function to
accepts
information reduce the amount of
information
data

15
Why is this approach better?
• Creates an abstraction for dealing with complex overhead
• The computations are simple, the overhead is messy
• Removing the overhead makes programs much smaller and thus
easier to use
• Less testing is required as well. The MapReduce libraries can be assumed to
work properly, so only user code needs to be tested
• Division of labor also handled by the MapReduce libraries, so
programmers only need to focus on the actual computation.
Peta-scale Data
Main

web 2

weed 1

green 2

sun 1
Thread
Data moon 1
1..*
collection
1..*
Counter
land 1
Parser

part 1

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE 17
Data
collection Peta Scale Data is Commonly Distributed
Main
web 2
Data
collection weed 1

green 2

Data sun 1
collection
moon 1
Thread
land 1
1..*
Data part 1
1..*
collection Parser Counter

Data DataCollection WordList ResultTable Issue: managing the


collection large scale data

KEY web weed green sun moon land part web green …….

VALUE
B.Ramamurthy & K.Madurai 18
Data
collection
Write Once Read Many (WORM) data
Data
Main
collection

web 2

Data weed 1
collection
green 2

Thread
sun 1
Data
moon 1
collection 1..*
1..*
Parser Counter land 1

part 1
Data
collection DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….
VALUE
19
Data
collection WORM Data is Amenable to Parallelism
Main
Data
collection
1. Data with WORM
characteristics : yields to
Data
collection parallel processing.
Thread 2. Data without dependencies:
1..* yields to out of order
Data
collection
1..*
Counter processing
Parser

WordList
Data DataCollection ResultTable

collection

20
Map Operation
MAP: Input data ➔ <key, value> pair web 1

weed 1

green 1 web 1
sun 1 weed 1
moon 1 green 1
land 1 sun1 1
web
part 1 moon 1
Map web
weed
1
1
land
1web 1 1
Data green
green
1 part
1weed 1 1
Split the data to
sun
Collection: split1 web … 1 1
moon web1green 1 1

Supply multiple weedKEY 1 VALUEgreen


land 1sun 1 1

processors
green 1 … 1moon 1 1
part
sun 1 KEY1land VALUE
1
web
moon 1
green 1part 1

Data Map land 1


… 1web 1
part 1
Collection: split 2 KEY green
VALUE 1
……
web 1 … 1


green 1 KEY VALUE
… 1

KEY VALUE

Data
Collection: split n

21
Reduce Operation
MAP: Input data ➔ <key, value> pair
REDUCE: <key, value> pair ➔ <result
Reduce
Map
Data
Collection: split1 Split the data to
Supply multiple
processors
Reduce
Data Map
Collection: split 2
……


Data
Reduce
Collection: split n Map

22
Large scale data splits
Map <key, 1> Reducers (say, Count)

Parse-hash

Count
P-0000
, count1

Parse-hash

Count
P-0001
, count2
Parse-hash

Count
P-0002
Parse-hash ,count3

23
MapReduce
Implementation

24
Computing Environment at Google
• Machines are dual-processor x86 processors running Linux, with 2-4 GB of
memory per machine.
• Commodity networking hardware is used – typically either 100
megabits/second or 1 gigabit/second at the machine level
• A cluster consists of 100s or 1000s of machines, and therefore machine
failures are common.
• Storage is provided by inexpensive IDE disks attached directly to individual
machines. GFS is used to manage the data stored on these disks. The file
system uses replication to provide availability and reliability on top of
unreliable hardware.
• Users submit jobs to a scheduling system. Each job consists of a set of tasks
and is mapped by the scheduler to a set of available machines within a
cluster.
25
WORKING PROCESS OF MapReduce (1)

26
WORKING PROCESS OF MapReduce (2)

27
SHUFFLE MECHANISM

28
EXAMPLE: TYPICAL PROGRAM WORDCOUNT

29
FUNCTIONS OF WORDCOUNT

30
MAP PROCESS OF WORDCOUNT

31
REDUCE PROCESS OF WORDCOUNT

32
Master Data Structures
• For each map task and reduce task, it stores the state (idle, in-progress, or
completed), and the identity of the worker machine (for non-idle tasks).
• The master is the conduit through which the location of intermediate file
regions is propagated from map tasks to reduce tasks.
▪ Therefore, for each completed map task, the master stores the locations
and sizes of the R intermediate file regions produced by the map task.
▪ Updates to this location and size information are received as map tasks
are completed.
▪ The information is pushed incrementally to workers that have in-progress
reduce tasks.

33
Points need to be emphasized
• No reduce can begin until map is complete
• Master must communicate locations of intermediate files
• Tasks scheduled based on location of data
• If map worker fails any time before reduce finishes, task must be
completely rerun
• MapReduce library does most of the hard work for us!
How to use it?
• User to do list:
• indicate:
• Input/output files
• M: number of map tasks
• R: number of reduce tasks
• W: number of machines
• Write map and reduce functions
• Submit the job
MapReduce programming model
 Determine if the problem is parallelizable and solvable using
MapReduce (ex: Is the data WORM?, large data set).
 Design and implement solution as Mapper classes and Reducer
class.

36
MapReduce Characteristics
• Very large-scale data
• Write once and read many data allows for parallelism without mutexes
• Map and Reduce are the main operations: simple code
• There are other supporting operations such as combine and partition.
• All the map should be completed before reduce operation starts.
• Map and reduce operations are typically performed by the same physical
processor.
• Number of map tasks and reduce tasks are configurable.
• Operations are provisioned near the data.
• Commodity hardware and storage.
• Runtime takes care of splitting and moving data for operations.

37
Student Projects…

 Distributed Grep
 Reverse Web-Link Graph
 Term-Vector per Host
 Inverted Index

Each group try to explain and write MR Function.

38
1. Distributed Grep
 Grep is command-line utility for searching plain-text data sets for
lines that match a regular expression.
 Example: $ grep "apple" fruitlist.txt
 Map emits a line if it matches a supplied regular expression pattern.
 Reduce is an identity function that just copies the supplied
intermediate data to the output.
Split data grep matches

Very Split data grep matches


All
big Split data grep matches cat matches
data
Split data grep matches
2. Reverse Web-Link Graph
 A forward web-link graph is a graph that has an edge from node URL1 to node URL2
if the web page found at URL1 has a hyperlink to URL2.
 A reverse web-link graph is the same graph with the edges reversed.
 MapReduce can easily be used to construct a reverse web-link graph.
 Map outputs <target, source> pairs for each link to a target URL found in a
document named source.
 Reduce concatenates the list of all source URLs associated with a given target URL
and emits the pair <target, list of source URLs>.

40
3. Term-vector per host
• A term vector summarizes the most important words that occur in a document
or a set of documents as a list of <word, frequency>pairs.
• Map emits a <hostname, term vector> pair for each input document (where the
hostname is extracted from the URL of the document).
• Reduce is passed all per-document term vectors for a given host. It adds these
term vectors together, throwing away infrequent terms, and then emits a final
⟨hostname, term vector⟩ pair.

41
4. Inverted Index
• Map parses each document, and emits a sequence of ⟨word,
document ID⟩ pairs.
• Reduce function accepts all pairs for a given word, sorts the
corresponding document IDs and emits a ⟨word, list(document ID)⟩pair.
• The set of all output pairs forms a simple inverted index. It is easy to
augment this computation to keep track of word positions.

42

You might also like