0% found this document useful (0 votes)

44 views

Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan

This document provides an overview of MapReduce, a programming model for processing large datasets in a distributed computing environment. It begins with an example word counting problem to illustrate MapReduce concepts. Mappers process input data in parallel to produce intermediate key-value pairs, which are then grouped and sent to reducers. Reducers receive all values associated with a key and produce the final output. The document outlines the major MapReduce components and how data flows through the system, noting its advantages over traditional approaches for large-scale data processing.

Uploaded by

Javier Ignacio Rojas Cares

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan

Uploaded by

Javier Ignacio Rojas Cares

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

MapReduce

Simplified Data Processing on Large Clusters

by Jeffrey Dean and Sanjay Ghemawa

Presented by Jon Logan
Outline

Problem Statement / Motivation

An Example Program
MapReduce vs Hadoop
GFS / HDFS
MapReduce Fundamentals
Example Code
Workflows
Conclusion / Questions
Why MapReduce?
Before MapReduce
Large Concurrent Systems
Grid Computing
Rolling Your Own Solution
Considerations
Threading is hard!
How do you scale to more machines?
How do you handle machine failures?
How do you facilitate communication between nodes?
Does your solution scale?

Scale out, not up!

An Example Program

I will present the concepts of MapReduce using the typical example of MR,
Word Count
The input of this program is a volume of raw text, of unspecified size (could
be KB, MB, TB, it doesnt matter!)
The output is a list of words, and their occurrence count. Assume that words
are split correctly, ignoring capitalization and punctuation.
Example
The doctor went to the store. =>
The, 2
Doctor, 1
Went, 1
To, 1
Store, 1
Map? Reduce?

Mappers read in data from the filesystem, and output (typically) modified data
Reducers collect all of the mappers output on the keys, and output (typically)
reduced data
The outputted data is written to disk

All data is in terms of key value pairs

Outline

Problem Statement / Motivation

An Example Program
MapReduce vs Hadoop
GFS / HDFS
MapReduce Fundamentals
Example Code
Workflows
Conclusion / Questions
MapReduce vs Hadoop

The paper is written by two researchers at Google, and describes their

programming paradigm
Unless you work at Google, or use Google App Engine, you wont use it!
Open Source implementation is Hadoop MapReduce
Not developed by Google
Started by Yahoo

Googles implementation (at least the one described) is written in C++

Hadoop is written in Java
GFS/HDFS

This is not a GFS/HDFS presentation! (But the following presentation is)

A few concepts are key to MapReduce though:

Google File System (GFS) and Hadoop Distributed File System (HDFS) are essentially
distributed filesystems
Are fault tolerant through replication
Allows data to be local to computation
Outline

Problem Statement / Motivation

An Example Program
MapReduce vs Hadoop
GFS / HDFS
MapReduce Fundamentals
Example Code
Workflows
Conclusion / Questions
Major Components

User Components:
Mapper
Reducer
Combiner (Optional)
Partitioner (Optional) (Shuffle)
Writable(s) (Optional)

System Components:
Master
Input Splitter*
Output Committer*
* You can use your own if you really want!

Image source: http://www.ibm.com/developerworks/java/library/l-hadoop-3/index.html

Key Notes

Mappers and Reducers are typically single threaded and deterministic

Determinism allows for restarting of failed jobs, or speculative execution
Need to handle more data? Just add more Mappers/Reducers!
No need to handle multithreaded code
Since theyre all independent of each other, you can run (almost) arbitrary number of nodes
Mappers/Reducers run on arbitrary machines. A machine typically multiple map and
reduce slots available to it, typically one per processor core
Mappers/Reducers run entirely independent of each other
In Hadoop, they run in separate JVMs
Basic Concepts

All data is represented in key value pairs of an arbitrary type

Data is read in from a file or list of files, from HDFS
Data is chunked based on an input split
A typical chunk is 64MB (more or less can be configured depending on your use case)

Mappers read in a chunk of data

Mappers emit (write out) a set of data, typically derived from its input
Intermediate data (the output of the mappers) is split to a number of reducers
Reducers receive each key of data, along with ALL of the values associated with it
(this means each key must always be sent to the same reducer)
Essentially, <key, set<value>>
Reducers emit a set of data, typically reduced from its input which is written to disk
Data Flow

Split
Mapper 0
0
Out
Reducer 0
0
Split
Input

Mapper 1
1
Out
Reducer 1
1
Split
Mapper 2
2
Input Splitter

Is responsible for splitting your input into multiple chunks

These chunks are then used as input for your mappers
Splits on logical boundaries. The default is 64MB per chunk
Depending on what youre doing, 64MB might be a LOT of data! You can change it
Typically, you can just use one of the built in splitters, unless you are reading
in a specially formatted file
Mapper

Reads in input pair <K,V> (a section as split by the input splitter)

Outputs a pair <K, V>

Ex. For our Word Count example, with the following input: The teacher went
to the store. The store was closed; the store opens in the morning. The store
opens at 9am.

The output would be:

<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1> <the, 1> <store, 1>
<was, 1> <closed, 1> <the, 1> <store, 1> <opens, 1> <in, 1> <the, 1> <morning, 1>
<the 1> <store, 1> <opens, 1> <at, 1> <9am, 1>
Reducer

Accepts the Mapper output, and collects values on the key

All inputs with the same key must go to the same reducer!
Input is typically sorted, output is output exactly as is
For our example, the reducer input would be:
<The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1> <store, 1> <the, 1> <store, 1>
<was, 1> <closed, 1> <the, 1> <store, 1> <opens, 1> <in, 1> <the, 1> <morning, 1>
<the 1> <store, 1> <opens, 1> <at, 1> <9am, 1>
The output would be:
<The, 6> <teacher, 1> <went, 1> <to, 1> <store, 3> <was, 1> <closed, 1> <opens, 1>
<morning, 1> <at, 1> <9am, 1>
Combiner

Essentially an intermediate reducer

Is optional
Reduces output from each mapper, reducing bandwidth and sorting
Cannot change the type of its input
Input types must be the same as output types
Output Committer

Is responsible for taking the reduce output, and committing it to a file

Typically, this committer needs a corresponding input splitter (so that another
job can read the input)
Again, usually built in splitters are good enough, unless you need to output a
special kind of file
Partitioner (Shuffler)

Decides which pairs are sent to which reducer

Default is simply:
Key.hashCode() % numOfReducers

User can override to:

Provide (more) uniform distribution of load between reducers
Some values might need to be sent to the same reducer
Ex. To compute the relative frequency of a pair of words <W1, W2> you would need to
make sure all of word W1 are sent to the same reducer

Binning of results
Master

Responsible for scheduling & managing jobs

Scheduled computation should be close to the data if possible

Bandwidth is expensive! (and slow)
This relies on a Distributed File System (GFS / HDFS)!

If a task fails to report progress (such as reading input, writing output, etc),
crashes, the machine goes down, etc, it is assumed to be stuck, and is killed,
and the step is re-launched (with the same input)

The Master is handled by the framework, no user code is necessary

Master Cont.

HDFS can replicate data to be local if necessary for scheduling

Because our nodes are (or at least should be) deterministic
The Master can restart failed nodes
Nodes should have no side effects!

If a node is the last step, and is completing slowly, the master can launch a second
copy of that node
This can be due to hardware isuses, network issues, etc.
First one to complete wins, then any other runs are killed
Writables

Are types that can be serialized / deserialized to a stream

Are required to be input/output classes, as the framework will serialize your
data before writing it to disk
User can implement this interface, and use their own types for their
input/output/intermediate values
There are default for basic values, like Strings, Integers, Longs, etc.
Can also handle store, such as arrays, maps, etc.
Your application needs at least six writables
2 for your input
2 for your intermediate values (Map <-> Reduce)
2 for your output
Outline

Problem Statement / Motivation

An Example Program
MapReduce vs Hadoop
GFS / HDFS
MapReduce Fundamentals
Example Code
Workflows
Conclusion / Questions
Mapper Code
Our input to our mapper is <LongWritable, Text>
The key (the LongWritable) can be assumed to be the position in the document our
input is in. This doesnt matter for this example.

Our output is a bunch of <Text, LongWritable>. The key is the token, and the value
is the count. This is always 1.

For the purpose of this demonstration, just assume Text is a fancy String, and
LongWritable is a fancy Long. In reality, theyre just the Writable equivalents.
Reducer Code

Our input is the output of our Mapper, a <Text, LongWritable> pair

Our output is still a <Text,LongWritable>, but it reduces N inputs for token T,
into one output <T, N>
Combiner Code

Do we need a combiner?
No, but it reduces bandwidth.

Our reducer can actually be our combiner in this case though!

Thats it!

All that is needed to run the above code is an extremely simple runner class.
Simply specifies which components to use, and your input/output directories
Workflows

Sometimes you need multiple steps to express your design

MapReduce does not directly allow for this, but there are solutions that do
Hadoop YARN allows for a Directed Acyclic Graph of nodes
Oozie also allows for a graph of nodes
Handling Data By Type

Process
Data A

Input Fetch
Merge Output
Data

Process
Data B
Conclusion

MapReduce provides a simple way to scale your application

Scales out to more machines, rather than scaling up
Effortlessly scale from a single machine to thousands
Fault tolerant & High performance
If you can fit your use case to its paradigm, scaling is handled by the
framework

Corus - Mid Com-Protocol - Modbus Rtu
50% (8)
Corus - Mid Com-Protocol - Modbus Rtu
42 pages
OpenWells Basics Training Manual 2003 (1) .11.0.2
100% (7)
OpenWells Basics Training Manual 2003 (1) .11.0.2
260 pages
CPENTbrochure
No ratings yet
CPENTbrochure
9 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Digital Mobile Network Evolution - From GSM To 5G
100% (1)
Digital Mobile Network Evolution - From GSM To 5G
61 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
6. Map Reduce Programming
No ratings yet
6. Map Reduce Programming
67 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
BDA FW-4
No ratings yet
BDA FW-4
7 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
30 pages
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
No ratings yet
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
83 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Advanced Mapreduce
No ratings yet
Advanced Mapreduce
37 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
Hadoop Training in Hyderabad
No ratings yet
Hadoop Training in Hyderabad
49 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Map Reduce
No ratings yet
Map Reduce
57 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Palak
No ratings yet
Palak
10 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
No ratings yet
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
55 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Mapreduce Programming Framework
No ratings yet
Mapreduce Programming Framework
23 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Dllction To MAPREDUCE Afflrlling: L Tro
No ratings yet
Dllction To MAPREDUCE Afflrlling: L Tro
12 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
CLOUD UNIT 5
No ratings yet
CLOUD UNIT 5
52 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Hadoop Unit III DR David
No ratings yet
Hadoop Unit III DR David
12 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
45 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Mapreduce Types and Formats
No ratings yet
Mapreduce Types and Formats
65 pages
bda megh
No ratings yet
bda megh
50 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
MapReduce - Notes
No ratings yet
MapReduce - Notes
17 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
04_MapReduce
No ratings yet
04_MapReduce
45 pages
Hadoop
No ratings yet
Hadoop
34 pages
BDAunit-III
No ratings yet
BDAunit-III
4 pages
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
From Everand
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
Avishek Sharma
No ratings yet
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Flatcar Container Linux Datasheet
No ratings yet
Flatcar Container Linux Datasheet
4 pages
9.3.1.4 Packet Tracer - Implementing A Subnetted IPv6 Addressing Scheme Instructions
100% (3)
9.3.1.4 Packet Tracer - Implementing A Subnetted IPv6 Addressing Scheme Instructions
3 pages
Ua
No ratings yet
Ua
28 pages
Practical C Programming 2nd Edition Steve Oualline instant download
100% (6)
Practical C Programming 2nd Edition Steve Oualline instant download
65 pages
Audio Media and Information: Definition, Types, and Techniques
No ratings yet
Audio Media and Information: Definition, Types, and Techniques
23 pages
Ch3 UnitTesting
No ratings yet
Ch3 UnitTesting
24 pages
Finding Missing Person Using Ai
100% (1)
Finding Missing Person Using Ai
18 pages
Data Entry Methods C
67% (3)
Data Entry Methods C
28 pages
WIRESHARK Sheet
No ratings yet
WIRESHARK Sheet
16 pages
Multicon User Manual: Revision: T
No ratings yet
Multicon User Manual: Revision: T
99 pages
CS F111 1008
No ratings yet
CS F111 1008
4 pages
Chet An Resume
No ratings yet
Chet An Resume
2 pages
Online Tutorial: Submitted By: Guided by
No ratings yet
Online Tutorial: Submitted By: Guided by
42 pages
Fixed Deposit Sample
No ratings yet
Fixed Deposit Sample
21 pages
SierraWireless AirLink ES450 Datasheet
No ratings yet
SierraWireless AirLink ES450 Datasheet
4 pages
SPRD
No ratings yet
SPRD
49 pages
Code Migration - 22
No ratings yet
Code Migration - 22
16 pages
BCBP Implementation Guide PDF
100% (2)
BCBP Implementation Guide PDF
65 pages
CA - ITPAM - Quick Start Guide
No ratings yet
CA - ITPAM - Quick Start Guide
21 pages
Research Proposal
No ratings yet
Research Proposal
2 pages
ES26 Course Information
No ratings yet
ES26 Course Information
2 pages
Oracle 11i and R12 Differences
No ratings yet
Oracle 11i and R12 Differences
58 pages
Open Gapps Arm 7.1 Mini 20190223.versionlog
No ratings yet
Open Gapps Arm 7.1 Mini 20190223.versionlog
2 pages
Big Data Notes
No ratings yet
Big Data Notes
2 pages
Module - 3: 8051 Stack, I/O Port Interfacing and Programming
100% (1)
Module - 3: 8051 Stack, I/O Port Interfacing and Programming
51 pages
Virtual Reality and Augmented Reality
No ratings yet
Virtual Reality and Augmented Reality
10 pages

Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan

Uploaded by

Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan

Uploaded by

MapReduce

Simplified Data Processing on Large Clusters

by Jeffrey Dean and Sanjay Ghemawa

Problem Statement / Motivation

Scale out, not up!

All data is in terms of key value pairs

Problem Statement / Motivation

The paper is written by two researchers at Google, and describes their

Googles implementation (at least the one described) is written in C++

This is not a GFS/HDFS presentation! (But the following presentation is)

A few concepts are key to MapReduce though:

Problem Statement / Motivation

Image source: http://www.ibm.com/developerworks/java/library/l-hadoop-3/index.html

Mappers and Reducers are typically single threaded and deterministic

All data is represented in key value pairs of an arbitrary type

Mappers read in a chunk of data

Is responsible for splitting your input into multiple chunks

Reads in input pair <K,V> (a section as split by the input splitter)

The output would be:

Accepts the Mapper output, and collects values on the key

Essentially an intermediate reducer

Is responsible for taking the reduce output, and committing it to a file

Decides which pairs are sent to which reducer

User can override to:

Responsible for scheduling & managing jobs

Scheduled computation should be close to the data if possible

The Master is handled by the framework, no user code is necessary

HDFS can replicate data to be local if necessary for scheduling

Are types that can be serialized / deserialized to a stream

Problem Statement / Motivation

Our input is the output of our Mapper, a <Text, LongWritable> pair

Our reducer can actually be our combiner in this case though!

Sometimes you need multiple steps to express your design

MapReduce provides a simple way to scale your application

You might also like