0% found this document useful (0 votes)

28 views42 pages

Map Reduce

The document discusses distributed systems and provides an overview of MapReduce. It describes how MapReduce can be used to solve problems like counting word occurrences across large datasets in a scalable way by separating the processing across multiple computers.

Uploaded by

am8465821

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views42 pages

Map Reduce

Uploaded by

am8465821

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Distributed Systems

Eng. Asmaa AbdulQawy

MapReduce OVERVIEW
•

2
From CS Foundations to MapReduce
• Consider a large data collection: {web, weed, green, sun, moon, land,
part, web, green,…}
• Problem: Count the occurrences of the different words in the
collection.
• Let's design a solution for this problem (Before MR);
• We will start from scratch
• We will add and relax constraints
• We will do incremental design, improving the solution for performance and scalability

3
Word Counter and Result Table (Before MR)
{web, weed, green, sun, moon, land, part, web, green,…}

Main

web 2
Data
collection weed 1

green 2
WordCounter
sun 1
parse( )
count( ) moon 1

land 1

DataCollection ResultTable part 1

4
Multiple Instances of Word Counter

web 2

Data weed 1
Main
collection
green 2

Thread sun 1

1..* moon 1
WordCounter
land 1
parse( )
count( ) part 1

DataCollection ResultTable Observe:

Multi-thread
Lock on shared data

5
Improve Word Counter for Performance
Main N No need for lock
o web 2
Data
weed 1
collection
green 2

sun 1
Thread
moon 1
1..*
1..* land 1
Parser Counter

part 1

DataCollection WordList ResultTable

Separate counters

KEY web weed green sun moon land part web green …….

VALUE
B.Ramamurthy & K.Madurai 6
Peta-scale Data
Main

web 2

weed 1

green 2

sun 1
Thread
Data moon 1
1..*
collection
1..*
Counter
land 1
Parser

part 1

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE 7
Before MapReduce…
• Large scale data processing was difficult!
• Managing hundreds or thousands of processors
• Managing parallelization and distribution
• I/O Scheduling
• Status and monitoring
• Fault/crash tolerance
• MapReduce provides all of these, easily!
MapReduce Overview
• How does it solve our previously mentioned problems?
• MapReduce is highly scalable and can be used across many
computers.
• Many small machines can be used to process jobs that normally
could not be processed by a large machine.
Map Abstraction
• Inputs a key/value pair
– Key is a reference to the input value
– Value is the data set on which to operate
• Evaluation
– Function defined by user
– Applies to every value in value input
• Might need to parse input

• Produces a new list of key/value pairs

– Can be different type from input pair
Map Example

map(String key, String value):

// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
Reduce Abstraction
• Starts with intermediate Key / Value pairs
• Ends with finalized Key / Value pairs
• Starting pairs are sorted by key
• Iterator supplies the values for a given key to the Reduce function.
Reduce Abstraction
• Typically, a function that:
• Starts with a large number of key/value pairs
• One key/value for each word in all files being greped (including multiple entries for the
same word)

• Ends with very few key/value pairs

• One key/value for each unique word across all the files with the number of instances
summed into this entry

• Broken up so a given worker works with input of the same

key.
Reduce Example

reduce(String key, Iterator values):

// key: a word
// values: a list of counts
int result = 0;
for each v in values: result += ParseInt(v);
Emit(AsString(result));
How Map and Reduce Work Together?

Reduce applies a user

Reduces
Map returns defined function to
accepts
information reduce the amount of
information
data

15
Why is this approach better?
• Creates an abstraction for dealing with complex overhead
• The computations are simple, the overhead is messy
• Removing the overhead makes programs much smaller and thus
easier to use
• Less testing is required as well. The MapReduce libraries can be assumed to
work properly, so only user code needs to be tested
• Division of labor also handled by the MapReduce libraries, so
programmers only need to focus on the actual computation.
Peta-scale Data
Main

web 2

weed 1

green 2

sun 1
Thread
Data moon 1
1..*
collection
1..*
Counter
land 1
Parser

part 1

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE 17
Data
collection Peta Scale Data is Commonly Distributed
Main
web 2
Data
collection weed 1

green 2

Data sun 1
collection
moon 1
Thread
land 1
1..*
Data part 1
1..*
collection Parser Counter

Data DataCollection WordList ResultTable Issue: managing the

collection large scale data

KEY web weed green sun moon land part web green …….

VALUE
B.Ramamurthy & K.Madurai 18
Data
collection
Write Once Read Many (WORM) data
Data
Main
collection

web 2

Data weed 1
collection
green 2

Thread
sun 1
Data
moon 1
collection 1..*
1..*
Parser Counter land 1

part 1
Data
collection DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….
VALUE
19
Data
collection WORM Data is Amenable to Parallelism
Main
Data
collection
1. Data with WORM
characteristics : yields to
Data
collection parallel processing.
Thread 2. Data without dependencies:
1..* yields to out of order
Data
collection
1..*
Counter processing
Parser

WordList
Data DataCollection ResultTable

collection

20
Map Operation
MAP: Input data ➔ <key, value> pair web 1

weed 1

green 1 web 1
sun 1 weed 1
moon 1 green 1
land 1 sun1 1
web
part 1 moon 1
Map web
weed
1
1
land
1web 1 1
Data green
green
1 part
1weed 1 1
Split the data to
sun
Collection: split1 web … 1 1
moon web1green 1 1

Supply multiple weedKEY 1 VALUEgreen

land 1sun 1 1

processors
green 1 … 1moon 1 1
part
sun 1 KEY1land VALUE
1
web
moon 1
green 1part 1

Data Map land 1

… 1web 1
part 1
Collection: split 2 KEY green
VALUE 1
……
web 1 … 1

…
green 1 KEY VALUE
… 1

KEY VALUE

Data
Collection: split n

21
Reduce Operation
MAP: Input data ➔ <key, value> pair
REDUCE: <key, value> pair ➔ <result
Reduce
Map
Data
Collection: split1 Split the data to
Supply multiple
processors
Reduce
Data Map
Collection: split 2
……

…
Data
Reduce
Collection: split n Map

22
Large scale data splits
Map <key, 1> Reducers (say, Count)

Parse-hash

Count
P-0000
, count1

Parse-hash

Count
P-0001
, count2
Parse-hash

Count
P-0002
Parse-hash ,count3

23
MapReduce
Implementation

24
Computing Environment at Google
• Machines are dual-processor x86 processors running Linux, with 2-4 GB of
memory per machine.
• Commodity networking hardware is used – typically either 100
megabits/second or 1 gigabit/second at the machine level
• A cluster consists of 100s or 1000s of machines, and therefore machine
failures are common.
• Storage is provided by inexpensive IDE disks attached directly to individual
machines. GFS is used to manage the data stored on these disks. The file
system uses replication to provide availability and reliability on top of
unreliable hardware.
• Users submit jobs to a scheduling system. Each job consists of a set of tasks
and is mapped by the scheduler to a set of available machines within a
cluster.
25
WORKING PROCESS OF MapReduce (1)

26
WORKING PROCESS OF MapReduce (2)

27
SHUFFLE MECHANISM

28
EXAMPLE: TYPICAL PROGRAM WORDCOUNT

29
FUNCTIONS OF WORDCOUNT

30
MAP PROCESS OF WORDCOUNT

31
REDUCE PROCESS OF WORDCOUNT

32
Master Data Structures
• For each map task and reduce task, it stores the state (idle, in-progress, or
completed), and the identity of the worker machine (for non-idle tasks).
• The master is the conduit through which the location of intermediate file
regions is propagated from map tasks to reduce tasks.
▪ Therefore, for each completed map task, the master stores the locations
and sizes of the R intermediate file regions produced by the map task.
▪ Updates to this location and size information are received as map tasks
are completed.
▪ The information is pushed incrementally to workers that have in-progress
reduce tasks.

33
Points need to be emphasized
• No reduce can begin until map is complete
• Master must communicate locations of intermediate files
• Tasks scheduled based on location of data
• If map worker fails any time before reduce finishes, task must be
completely rerun
• MapReduce library does most of the hard work for us!
How to use it?
• User to do list:
• indicate:
• Input/output files
• M: number of map tasks
• R: number of reduce tasks
• W: number of machines
• Write map and reduce functions
• Submit the job
MapReduce programming model
 Determine if the problem is parallelizable and solvable using
MapReduce (ex: Is the data WORM?, large data set).
 Design and implement solution as Mapper classes and Reducer
class.

36
MapReduce Characteristics
• Very large-scale data
• Write once and read many data allows for parallelism without mutexes
• Map and Reduce are the main operations: simple code
• There are other supporting operations such as combine and partition.
• All the map should be completed before reduce operation starts.
• Map and reduce operations are typically performed by the same physical
processor.
• Number of map tasks and reduce tasks are configurable.
• Operations are provisioned near the data.
• Commodity hardware and storage.
• Runtime takes care of splitting and moving data for operations.

37
Student Projects…

 Distributed Grep
 Reverse Web-Link Graph
 Term-Vector per Host
 Inverted Index

Each group try to explain and write MR Function.

38
1. Distributed Grep
 Grep is command-line utility for searching plain-text data sets for
lines that match a regular expression.
 Example: $ grep "apple" fruitlist.txt
 Map emits a line if it matches a supplied regular expression pattern.
 Reduce is an identity function that just copies the supplied
intermediate data to the output.
Split data grep matches

Very Split data grep matches

All
big Split data grep matches cat matches
data
Split data grep matches
2. Reverse Web-Link Graph
 A forward web-link graph is a graph that has an edge from node URL1 to node URL2
if the web page found at URL1 has a hyperlink to URL2.
 A reverse web-link graph is the same graph with the edges reversed.
 MapReduce can easily be used to construct a reverse web-link graph.
 Map outputs <target, source> pairs for each link to a target URL found in a
document named source.
 Reduce concatenates the list of all source URLs associated with a given target URL
and emits the pair <target, list of source URLs>.

40
3. Term-vector per host
• A term vector summarizes the most important words that occur in a document
or a set of documents as a list of <word, frequency>pairs.
• Map emits a <hostname, term vector> pair for each input document (where the
hostname is extracted from the URL of the document).
• Reduce is passed all per-document term vectors for a given host. It adds these
term vectors together, throwing away infrequent terms, and then emits a final
⟨hostname, term vector⟩ pair.

41
4. Inverted Index
• Map parses each document, and emits a sequence of ⟨word,
document ID⟩ pairs.
• Reduce function accepts all pairs for a given word, sorts the
corresponding document IDs and emits a ⟨word, list(document ID)⟩pair.
• The set of all output pairs forms a simple inverted index. It is easy to
augment this computation to keep track of word positions.

Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Introduction to MapReduce and Hadoop
No ratings yet
Introduction to MapReduce and Hadoop
45 pages
MapReduce Programming Model Guide
No ratings yet
MapReduce Programming Model Guide
55 pages
Introduction to MapReduce & Functional Programming
No ratings yet
Introduction to MapReduce & Functional Programming
37 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
MapReduce: Simplified Data Processing
No ratings yet
MapReduce: Simplified Data Processing
4 pages
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
No ratings yet
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
35 pages
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
No ratings yet
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
36 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
MapReduceIntro Updated
No ratings yet
MapReduceIntro Updated
31 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
MapReduce Framework in Big Data
No ratings yet
MapReduce Framework in Big Data
46 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
ECS765P - W2 - The MapReduce Programming Model
No ratings yet
ECS765P - W2 - The MapReduce Programming Model
53 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
120 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
MapReduce: Large-Scale Data Processing
100% (1)
MapReduce: Large-Scale Data Processing
13 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
MapReduce: Large-Scale Data Processing
No ratings yet
MapReduce: Large-Scale Data Processing
13 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Distributed Computing Seminar: Mapreduce Theory and Implementation
No ratings yet
Distributed Computing Seminar: Mapreduce Theory and Implementation
30 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
MapReduce Programming Model Overview
No ratings yet
MapReduce Programming Model Overview
26 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
29 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
V Force Inverter Modular Chiller English
No ratings yet
V Force Inverter Modular Chiller English
39 pages
Lec - 4 - Health Learning IEC Materials For Nutrition Education
100% (1)
Lec - 4 - Health Learning IEC Materials For Nutrition Education
53 pages
L 0013090857 PDF
No ratings yet
L 0013090857 PDF
30 pages
Grade 12 - Data Handling Using Pandas 1-Worksheet 3
100% (1)
Grade 12 - Data Handling Using Pandas 1-Worksheet 3
2 pages
Shortlisted Institutes For Website
No ratings yet
Shortlisted Institutes For Website
141 pages
Review of Language Conflict and Language Planning, Edited by Ernst Håkon Jahr
No ratings yet
Review of Language Conflict and Language Planning, Edited by Ernst Håkon Jahr
6 pages
Optimizing Polyurethane Foam for Heat Distribution
No ratings yet
Optimizing Polyurethane Foam for Heat Distribution
12 pages
kUNKLE INSTALACIÓN
No ratings yet
kUNKLE INSTALACIÓN
1 page
Snapdeal
No ratings yet
Snapdeal
10 pages
PRGF SVX001B en - 08282020
No ratings yet
PRGF SVX001B en - 08282020
40 pages
Effective Mentoring and Active Listening Techniques
No ratings yet
Effective Mentoring and Active Listening Techniques
23 pages
Shock Method and Plyometrics
100% (1)
Shock Method and Plyometrics
75 pages
Procurement Insights for Executives
No ratings yet
Procurement Insights for Executives
120 pages
Evaluating Engineering Faculty
No ratings yet
Evaluating Engineering Faculty
8 pages
Workshop Manual Thermo DW 300
100% (1)
Workshop Manual Thermo DW 300
98 pages
Sika® Plaster Mix: Cement for Bricks
No ratings yet
Sika® Plaster Mix: Cement for Bricks
2 pages
Spillover of Environment-Friendly Consumer Behaviour: John TH Gersen, Folke .Olander
No ratings yet
Spillover of Environment-Friendly Consumer Behaviour: John TH Gersen, Folke .Olander
12 pages
Post Command Language
No ratings yet
Post Command Language
18 pages
Understanding Inheritance in OOP
No ratings yet
Understanding Inheritance in OOP
5 pages
CS301 Midterm Quiz Solutions
No ratings yet
CS301 Midterm Quiz Solutions
8 pages
Activity Design English District Festival of Talents
100% (1)
Activity Design English District Festival of Talents
3 pages
SAFE Challenge 16 Questions and Answers
No ratings yet
SAFE Challenge 16 Questions and Answers
21 pages
Past Simple Practice Guide
100% (1)
Past Simple Practice Guide
3 pages
MODEL 5600 & 5600 ECONOMINDER Installation Instructions
No ratings yet
MODEL 5600 & 5600 ECONOMINDER Installation Instructions
28 pages
Razak & Mohamed
No ratings yet
Razak & Mohamed
8 pages
1-7 Mar
No ratings yet
1-7 Mar
20 pages
Focal Access 165 A1 Manual
No ratings yet
Focal Access 165 A1 Manual
9 pages
XT125R 2008
No ratings yet
XT125R 2008
50 pages
Eccs 102 Ap Fa 23
No ratings yet
Eccs 102 Ap Fa 23
9 pages
Company Background
No ratings yet
Company Background
2 pages

Map Reduce

Uploaded by

Map Reduce

Uploaded by

Distributed Systems

Eng. Asmaa AbdulQawy

DataCollection ResultTable part 1

DataCollection ResultTable Observe:

DataCollection WordList ResultTable

DataCollection WordList ResultTable

• Produces a new list of key/value pairs

map(String key, String value):

• Ends with very few key/value pairs

• Broken up so a given worker works with input of the same

reduce(String key, Iterator values):

Reduce applies a user

DataCollection WordList ResultTable

Data DataCollection WordList ResultTable Issue: managing the

Supply multiple weedKEY 1 VALUEgreen

Data Map land 1

Each group try to explain and write MR Function.

Very Split data grep matches

You might also like