MapReduce: Efficient Data Processing

MapReduce is a programming model and associated implementation for processing large datasets in parallel across clusters of machines. It handles parallelization and distribution details so programmers can focus on mapping and reducing functions. The model takes key-value pairs as input, applies a map function to generate intermediate key-value pairs, and a reduce function to merge values by key. It provides fault tolerance and locality to improve performance on large datasets.

Uploaded by

Joy Bagdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views29 pages

MapReduce: Efficient Data Processing

Uploaded by

Joy Bagdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Map Reduce: Simplified

Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat
Google, Inc.
OSDI ’04: 6th Symposium on Operating
Systems Design and Implementation
What Is It?
• “. . . A programming model and an
associated implementation for processing
and generating large data sets.”
• Google version runs on a typical Google
cluster: large number of commodity
machines, switched Ethernet, inexpensive
disks attached directly to each machine in
the cluster.
Motivation
• Data-intensive applications
• Huge amounts of data, fairly simple
processing requirements, but …
• For efficiency, parallelize
• MapReduce is designed to simplify
parallelization and distribution so
programmers don’t have to worry about
details.
Advantages of Parallel
Programming

• Improves performance and efficiency.

• Divide processing into several parts which
can be executed concurrently.
• Each part can run simultaneously on
different CPUs on a single machine, or
they can be CPUs in a set of computers
connected via a network.
Programming Model
• The model is “inspired by” Lisp primitives
map and reduce.
• map applies the same operation to several
different data items; e.g.,
(mapcar #'abs '(3 -4 2 -5))=>(3 4 2 5)
• reduce applies a single operation to a set of
values to get a result; e.g.,
(+ 3 4 2 5) => 14
Programming Model
• MapReduce was developed by Google to
process large amounts of raw data, for
example, crawled documents or web
request logs.
• There is so much data it must be
distributed across thousands of machines
in order to be processed in a reasonable
time.
Programming Model
• Input & Output: a set of key/value pairs
• The programmer supplies two functions:
• map (in_key, in_val) =>
list(intermediate_key,intermed_val)
• reduce (intermediate_key,
list-of(intermediate_val)) =>
list(out_val)
• The program takes a set of input key/value pairs
and merges all the intermediate values for a
given key into a smaller set of final values.
Example: Count occurrences of words in a
set of files
• Map function: for each word in each file, count
occurrences
• Input_key: file name; Input_value: file contents
• Intermediate results: for each file, a list of words
and frequency counts
– out_key = a word; int_value = word count in this file
• Reduce function: for each word, sum its
occurrences over all files
• Input key: a word; Input value: a list of counts
• Final results: A list of words, and the number of
occurrences of each word in all the files.
Other Examples
• Distributed Grep: find all occurrences of a
pattern supplied by the programmer
– Input: the pattern and set of files
• key = pattern (regexp), data = a file name
– Map function: grep the pattern, file
– Intermediate results: lines in which the pattern
appeared, keyed to files
• key = file name, data = line
– Reduce function is the identity function:
passes on the intermediate results
Other Examples
• Count URL Access Frequency
– Map function: counts URL requests in a log of
requests
• key: URL; data: a log
– Intermediate results: URL, total count for this
log
– Reduce function: combines URL count for all
logs and emits (URL, total_count)
Implementation
• More than one way to implement
MapReduce, depending on environment
• Google chooses to use the same
environment that it uses for the GFS: large
(~1000 machines) clusters of PCs with
attached disks, based on 100 megabit/sec
or 1 gigabit/sec Ethernet.
• Batch environment: user submits job to a
scheduler (Master)
Implementation
• Job scheduling:
– User submits job to scheduler (one program
consists of many tasks)
– scheduler assigns tasks to machines.
General Approach
• The MASTER:
– initializes the problem; divides it up among a
set of workers
– sends each worker a portion of the data
– receives the results from each worker
• The WORKER:
– receives data from the master
– performs processing on its part of the data
– returns results to master
Overview
• The Map invocations are distributed
across multiple machines by automatically
partitioning the input data into a set of M
splits or shards.
• The worker-process parses the input to
identify the key/value pairs and passes
them to the Map function (defined by the
programmer).
Overview
• The input shards can be processed in
parallel on different machines.
– It’s essential that the Map function be able to
operate independently – what happens on
one machine doesn’t depend on what
happens on any other machine.
• Intermediate results are stored on local
disks, partitioned into R regions as
determined by the user’s partitioning
function. (R <= # of output keys)
Overview
• The number of partitions (R) and the
partitioning function are specified by the user.
• Map workers notify Master of the location of the
intermediate key-value pairs; the master
forwards the addresses to the reduce workers.
• Reduce workers use RPC to read the data
remotely from the map workers and then
process it.
• Each reduction takes all the values associated
with a single key and reduces it to one or more
results.
Example
• In the word-count app, a worker emits a
list of word-frequency pairs; e.g. (a,
100), (an, 25), (ant, 1), …
• out_key = a word; value = word count
for some file
• All the results for a given out_key are
passed to a reduce worker for the next
processing phase.
Overview
• Final results are appended to an output file
that is part of the global file system.
• When all map/reduce jobs are done, the
master wakes up the user program and
the MapReduce call returns control to the
user program.
Fault Tolerance
• Important, because since MapReduce
relies on 100’s, even 1000’s of machines,
failures are inevitable.
• Periodically, the master pings workers.
• Workers that don’t respond in a pre-
determined amount of time are considered
to have failed.
• Any map task or reduce task in progress
on a failed worker is reset to idle and
becomes eligible for rescheduling.
Fault Tolerance
• Any map tasks completed by the worker are
reset to idle state, and are eligible for
scheduling on other workers.
• Reason: since the results are stored on the
disk of the failed machine, they are
inaccessible.
• Completed reduce tasks on failed machines
don’t need to be redone because output
goes to a global file system.
Failure of the Master
• Regular checkpoints of all the Master’s
data structures would make it possible to
roll back to a known state and start again.
• However, since there is only one master
failure is highly unlikely, so the current
approach is just to abort the program in
case of failure.
Locality
• Recall Google File system implementation:
• Files are divided into 64MB blocks and
replicated on at least 3 machines.
• The Master knows the location of data and
tries to schedule map operations on
machines that have the necessary input.
Or, if that’s not possible, schedule on a
nearby machine to reduce network traffic.
Task Granularity
• Map phase is subdivided into M pieces
and the reduce phase into R pieces.
• Objective: M and R >> than the number of
worker machines.
– Improves dynamic load balancing
– Speeds up recovery in case of failure; failed
machine’s many completed map tasks can be
spread out across all other workers.
Task Granularity
• Practical limits on size of M and R:
– Master must make O(M + R) scheduling
decisions and store O(M * R) states
– Users typically restrict size of R, because the
output of each reduce worker goes to a
different output file
– Authors say they “often” set M = 200,000 and
R = 5,000. Number of workers = 2,000.
“Stragglers”
• A machine that takes a long time to finish
its last few map or reduce tasks.
– Causes: bad disk (slows read ops), other
tasks are scheduled on the same machine,
etc.
– Solution: assign stragglers’ unfinished work to
other machines that have completed. Use
results from the original worker or the backup,
depending on which finishes first
Experience
• Google used MapReduce to rewrite the indexing
system that constructs the Google search engine
data structures.
• Input: GFS documents retrieved by the web
crawlers – about 20 terabytes of data.
• Benefits
– Simpler, smaller, more readable indexing code
– Many problems, such as machine failures, are dealt
with automatically by the MapReduce library.
Conclusions
• Easy to use. Programmers are shielded
from the problems of parallel processing
and distributed systems.
• Can be used for many classes of problems,
including generating data for the search
engine, for sorting, for data mining, for
machine learning, and other
• Scales to clusters consisting of 1000’s of
machines
• But ….
Not everyone agrees that MapReduce is
wonderful!
• The database community believes parallel
database systems are a better solution.

MapReduce: Simplified Data Processing
No ratings yet
MapReduce: Simplified Data Processing
4 pages
Parallel Programming and MapReduce Overview
No ratings yet
Parallel Programming and MapReduce Overview
47 pages
Google MapReduce Overview 2004
No ratings yet
Google MapReduce Overview 2004
25 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
MapReduce Overview and Implementation Guide
No ratings yet
MapReduce Overview and Implementation Guide
42 pages
MapReduce Overview and Applications
No ratings yet
MapReduce Overview and Applications
42 pages
MapReduce Principles and Patterns Explained
No ratings yet
MapReduce Principles and Patterns Explained
65 pages
HBase Sharding in MapReduce Framework
No ratings yet
HBase Sharding in MapReduce Framework
43 pages
Understanding Hadoop and MapReduce
No ratings yet
Understanding Hadoop and MapReduce
44 pages
MapReduce Overview and Fault Tolerance
No ratings yet
MapReduce Overview and Fault Tolerance
42 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Formatting MapReduce Paper for LibreOffice
No ratings yet
Formatting MapReduce Paper for LibreOffice
16 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
20 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
MapReduce Overview and Applications
No ratings yet
MapReduce Overview and Applications
31 pages
Introduction to Map-Reduce Basics
No ratings yet
Introduction to Map-Reduce Basics
34 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
15 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Introduction to Map-Reduce in Data Science
No ratings yet
Introduction to Map-Reduce in Data Science
48 pages
MapReduce: Simplifying Large-Scale Data Processing
No ratings yet
MapReduce: Simplifying Large-Scale Data Processing
17 pages
Big Data Computing: MapReduce & Clustering
No ratings yet
Big Data Computing: MapReduce & Clustering
36 pages
Introduction to MapReduce Concepts
No ratings yet
Introduction to MapReduce Concepts
37 pages
Big Data Analytics: Map-Reduce Overview
No ratings yet
Big Data Analytics: Map-Reduce Overview
43 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Big Data and Hadoop Overview
No ratings yet
Big Data and Hadoop Overview
32 pages
Introduction to MapReduce in C
No ratings yet
Introduction to MapReduce in C
74 pages
Introduction to MapReduce Framework
No ratings yet
Introduction to MapReduce Framework
74 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
44 pages
MapReduce Overview by Dr. Ranasinghe
No ratings yet
MapReduce Overview by Dr. Ranasinghe
34 pages
MapReduce in Cloud Computing
No ratings yet
MapReduce in Cloud Computing
21 pages
MapReduce Overview and Word Count
No ratings yet
MapReduce Overview and Word Count
24 pages
Data Mining with MapReduce Techniques
No ratings yet
Data Mining with MapReduce Techniques
28 pages
Introduction to Hadoop MapReduce
No ratings yet
Introduction to Hadoop MapReduce
55 pages
Big Data Analytics: MapReduce Overview
No ratings yet
Big Data Analytics: MapReduce Overview
47 pages
Google MapReduce: Simplified Data Processing
No ratings yet
Google MapReduce: Simplified Data Processing
19 pages
Hadoop MapReduce in Cloud Computing
No ratings yet
Hadoop MapReduce in Cloud Computing
25 pages
Understanding the MapReduce Paradigm
No ratings yet
Understanding the MapReduce Paradigm
13 pages
MapReduce: Distributed Computing Explained
No ratings yet
MapReduce: Distributed Computing Explained
51 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
35 pages
Distributed Systems: 18. Mapreduce
No ratings yet
Distributed Systems: 18. Mapreduce
39 pages
MapReduce: Efficiency and Applications
No ratings yet
MapReduce: Efficiency and Applications
16 pages
MapReduce: Efficient Data Processing
No ratings yet
MapReduce: Efficient Data Processing
13 pages
MapReduce: Large-Scale Data Processing
No ratings yet
MapReduce: Large-Scale Data Processing
13 pages
MapReduce Programming Model Overview
No ratings yet
MapReduce Programming Model Overview
55 pages
Understanding the MapReduce Framework
No ratings yet
Understanding the MapReduce Framework
47 pages
MapReduce: Simplified Data Processing
100% (1)
MapReduce: Simplified Data Processing
13 pages
Understanding MapReduce Workflows
No ratings yet
Understanding MapReduce Workflows
25 pages
Chapter 3 - 大数据管理
No ratings yet
Chapter 3 - 大数据管理
38 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
25 pages
Understanding MapReduce in Big Data
No ratings yet
Understanding MapReduce in Big Data
16 pages
Understanding MapReduce Workflows
No ratings yet
Understanding MapReduce Workflows
24 pages
Advanced MapReduce Programming Techniques
No ratings yet
Advanced MapReduce Programming Techniques
26 pages
Needham-Schroeder Authentication Protocol
No ratings yet
Needham-Schroeder Authentication Protocol
2 pages
Fert
No ratings yet
Fert
38 pages
Mobile Computing Fundamentals Course
No ratings yet
Mobile Computing Fundamentals Course
3 pages
Career Planning & Development Course
No ratings yet
Career Planning & Development Course
3 pages
Understanding Network Routing Algorithms
No ratings yet
Understanding Network Routing Algorithms
14 pages
Characteristics of Distributed Systems
No ratings yet
Characteristics of Distributed Systems
35 pages
Distributed Systems Architectures Overview
No ratings yet
Distributed Systems Architectures Overview
43 pages
Concurrency and Mutual Exclusion in Systems
No ratings yet
Concurrency and Mutual Exclusion in Systems
37 pages
Wave and Traversal Algorithms Overview
No ratings yet
Wave and Traversal Algorithms Overview
10 pages
Cuckoo Search Algorithm Overview
No ratings yet
Cuckoo Search Algorithm Overview
20 pages
Understanding Digital Watermarking
No ratings yet
Understanding Digital Watermarking
1 page
Group Dynamics and Team Building Course
No ratings yet
Group Dynamics and Team Building Course
3 pages
Comprehensive Guide to Algorithms and Data Structures
No ratings yet
Comprehensive Guide to Algorithms and Data Structures
18 pages
Proving Merge Sort Correctness
100% (1)
Proving Merge Sort Correctness
593 pages
IIT GATE CSE Subject Analysis
No ratings yet
IIT GATE CSE Subject Analysis
4 pages
Module 3: Nfs Setup: Exercise
No ratings yet
Module 3: Nfs Setup: Exercise
8 pages
SAP Basis Daily Check Guide
No ratings yet
SAP Basis Daily Check Guide
6 pages
AI-Based Real-Time CCTV Upscaling
No ratings yet
AI-Based Real-Time CCTV Upscaling
5 pages
ManoConfig Software Manual for LEX 1
No ratings yet
ManoConfig Software Manual for LEX 1
12 pages
Best Practices for SGA and PGA Settings
No ratings yet
Best Practices for SGA and PGA Settings
2 pages
IP Telephony Network Requirements Guide
No ratings yet
IP Telephony Network Requirements Guide
5 pages
KCNS 10 Student Guide Overview
No ratings yet
KCNS 10 Student Guide Overview
2 pages
Computer Organization & Assembly Language
No ratings yet
Computer Organization & Assembly Language
11 pages
Business Metrics v2 Prerequisites Guide
No ratings yet
Business Metrics v2 Prerequisites Guide
62 pages
11 08 Alldata New Customer Install Guide PDF
No ratings yet
11 08 Alldata New Customer Install Guide PDF
4 pages
Static NAT Configuration for R2
No ratings yet
Static NAT Configuration for R2
4 pages
Uploading Mipeg X Software to M3C
100% (2)
Uploading Mipeg X Software to M3C
3 pages
Dell Unity Device Sanitization Guide
No ratings yet
Dell Unity Device Sanitization Guide
8 pages
Turbine Interlock Logic Module Overview
No ratings yet
Turbine Interlock Logic Module Overview
2 pages
Task List 23
100% (2)
Task List 23
13 pages
SAP Cloud Architecture Interview Insights
No ratings yet
SAP Cloud Architecture Interview Insights
9 pages
QNX Neutrino RTOS: Beginner's Guide
100% (1)
QNX Neutrino RTOS: Beginner's Guide
388 pages
PostgreSQL Data Types Overview
No ratings yet
PostgreSQL Data Types Overview
1 page
Sanet - ST LinuxMagazineUSAIssue255February2022
No ratings yet
Sanet - ST LinuxMagazineUSAIssue255February2022
100 pages
Backtrack 4 and Wifi Setup For Beginners
No ratings yet
Backtrack 4 and Wifi Setup For Beginners
4 pages
Craft Terminal Software De-installation Guide
No ratings yet
Craft Terminal Software De-installation Guide
81 pages
Manual Mikrotik
No ratings yet
Manual Mikrotik
88 pages
SNMP Agent Port Configuration Guide
No ratings yet
SNMP Agent Port Configuration Guide
36 pages
Cisco ISE Implementation Training Guide
No ratings yet
Cisco ISE Implementation Training Guide
5 pages
CCNA Cisco Commands Cheat Sheet #2
No ratings yet
CCNA Cisco Commands Cheat Sheet #2
5 pages
Acer Predator 12X Block Diagram
No ratings yet
Acer Predator 12X Block Diagram
87 pages
Process vs Thread Context Switching
No ratings yet
Process vs Thread Context Switching
6 pages
IP Addressing and Subnetting Explained
No ratings yet
IP Addressing and Subnetting Explained
4 pages
HMI Panel Update to TIA V17 Guide
No ratings yet
HMI Panel Update to TIA V17 Guide
5 pages
Linux for Embedded Systems Overview
No ratings yet
Linux for Embedded Systems Overview
18 pages

MapReduce: Efficient Data Processing

Uploaded by

MapReduce: Efficient Data Processing

Uploaded by

Map Reduce: Simplified

Processing on Large Clusters

• Improves performance and efficiency.

You might also like