0% found this document useful (0 votes)

59 views36 pages

Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh

This document provides an overview of the MapReduce paradigm for distributed file systems and big data analytics. It discusses the physical organization of compute nodes in a cluster and how large-scale file systems are organized. It also describes the Map and Reduce tasks in MapReduce, how keys are grouped, and details of MapReduce execution including handling node failures.

Uploaded by

imran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views36 pages

Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh

Uploaded by

imran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 36

Big Data Analytics

Module 3: MapReduce Paradigm

Faculty Name : Ms. Varsha Sanap

Dr. Vivek Singh
Index - (Heading Font: minion pro, Size:20)

Lecture19 – Distributed File Systems : Physical Organization of the New

4
Software Compute Nodes, Large- Scale- File-System -Organization.
Lecture20 – MapReduce: The Map Tasks, Grouping by Key,The Reduce
Tasks, Combiners, Details of MapReduce Execution, Coping With Node 15
Failures

2
Lecture19

Distributed File System

MapReduce

4 Lecture19-Distributed File System

Single Node Architecture

Machine Learning, Statistics

“Classical” Data Mining

5 Lecture19-Distributed File System

Motivation: Google Example

 20+ billion web pages x 20KB = 400+ TB

 1 computer reads 30-35 MB/sec from disk
 ~4 months to read the web
 ~1,000 hard drives to store the web
 Takes even more to do something useful
with the data!
 Today, a standard architecture for such problems is emerging:
 Cluster of commodity Linux nodes
 Commodity network (ethernet) to connect them

6 Lecture19-Distributed File System

Cluster Architecture

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

7 Lecture19-Distributed File System

Cluster Architecture

8 Lecture19-Distributed File System

Large Scale Computing

 Large-scale computing for data mining problems on commodity hardware

 Challenges:
 How do you distribute computation?
 How can we make it easy to write distributed programs?
 Machines fail:
 One server may stay up 3 years (1,000 days)
 If you have 1,000 servers, expect to loose 1/day
 People estimated Google had ~1M machines in 2011
 1,000 machines fail every day!

9 Lecture19-Distributed File System

Idea and Solution

10 Lecture19-Distributed File System

Storage Architecture

 Problem:
 If nodes fail, how to store data persistently?
 Answer:
 Distributed File System:
 Provides global file namespace
 Google GFS; Hadoop HDFS;
 Typical usage pattern
 Huge files (100s of GB to TB)
 Data is rarely updated in place
 Reads and appends are common

11 Lecture19-Distributed File System

Distributed File system

 Chunk servers
 File is split into contiguous chunks
 Typically each chunk is 16-64MB
 Each chunk replicated (usually 2x or 3x)
 Try to keep replicas in different racks
 Master node
 a.k.a. Name Node in Hadoop’s HDFS
 Stores metadata about where files are stored
 Might be replicated
 Client library for file access
 Talks to master to find chunk servers
 Connects directly to chunk servers to access data

12 Lecture19-Distributed File System

Distributed File Sytem

 Reliable distributed file system

 Data kept in “chunks” spread across machines
 Each chunk replicated on different machines
 Seamless recovery from disk or machine failure

Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

 Bring Computation Directly to Data

 Chunk Servers also serve as Compute Servers

13 Lecture19-Distributed File System

Lecture20

MapReduce
Programming Model: MapReduce

15 Lecture 20-MapReduce
Task: Word Count

Case 1: File too large for memory, but all <word, count> pairs fit in memory
Case 2:Count occurrences of words:
 words(doc.txt) | sort | uniq -c
 where words takes a file and outputs the words in it, one per a line
 Case 2 captures the essence of MapReduce
 Great thing is that it is naturally parallelizable

16 Lecture 20-MapReduce
MapReduce Overview

 Sequentially read a lot of data

 Map:
 Extract something you care about
 Group by key: Sort and Shuffle
 Reduce:
 Aggregate, summarize, filter or transform
 Write the result

Outline stays the same, Map and Reduce

change to fit the problem

17 Lecture 20-MapReduce
MapReduce: The Map step

Input Intermediate
key-value pairs key-value pairs

18 Lecture 20-MapReduce
MapReduce: The Reduce Step

Intermediate
key-value pairs Output
Key-value groups key-value pairs
k v reduce
k v v v k v

k v Group reduce
k v v k v
by key
k v

… …
k v
k v k v

19 Lecture 20-MapReduce
More Specifically

 Input: a set of key-value pairs

 Programmer specifies two methods:
 Map(k, v)  <k’, v’>*
 Takes a key-value pair and outputs a set of key-value pairs
 E.g., key is the filename, value is a single line in the file
 There is one Map call for every (k,v) pair
 Reduce(k’, <v’>*)  <k’, v’’>*
 All values v’ with same key k’ are reduced together
and processed in v’ order
 There is one Reduce function call per unique key k’

20 Lecture 20-MapReduce
MapReduce: Word Counting

21 Lecture 20-MapReduce
Word Count Using MapReduce

map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

22 Lecture 20-MapReduce
MapReduce :Environment

Map-Reduce environment takes care of:

 Partitioning the input data
 Scheduling the program’s execution across a
set of machines
 Performing the group by key step
 Handling machine failures
 Managing required inter-machine communication

23 Lecture 20-MapReduce
MapReduce: A Diagram

MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

24 Lecture 20-MapReduce
MapReduce:In Parallel

All phases are distributed with many tasks doing the work

25 Lecture 20-MapReduce
MapReduce

26 Lecture 20-MapReduce
Data Flow

27 Lecture 20-MapReduce
Coordinator: Master

28 Lecture 20-MapReduce
Dealing with Failures

 Map worker failure

 Map tasks completed or in-progress at worker are reset to idle
 Reduce workers are notified when task is rescheduled on another worker
 Reduce worker failure
 Only in-progress tasks are reset to idle
 Reduce task is restarted
 Master failure
 MapReduce task is aborted and client is notified

29 Lecture 20-MapReduce
How many Map Reduce jobs

 M map tasks, R reduce tasks

 Rule of a thumb:
 Make M much larger than the number of nodes in the cluster
 One DFS chunk per map is common
 Improves dynamic load balancing and speeds up recovery from worker failures
 Usually R is smaller than M
 Because output is spread across R files

30 Lecture 20-MapReduce
Task Granularity &Pipelining

 Fine granularity tasks: map tasks >> machines

 Minimizes time for fault recovery
 Can do pipeline shuffling with map execution
 Better dynamic load balancing

31 Lecture 20-MapReduce
Refinements: Backup Tasks

 Problem
 Slow workers significantly lengthen the job completion time:
 Other jobs on the machine
 Bad disks
 Weird things
 Solution
 Near end of phase, spawn backup copies of tasks
 Whichever one finishes first “wins”
 Effect
 Dramatically shortens job completion time

32 Lecture 20-MapReduce
Refinement: Combiners

 Often a Map task will produce many pairs of the form (k,v1), (k,v2), … for the same
key k
 E.g., popular words in the word count example
 Can save network time by pre-aggregating values in the mapper:
 combine(k, list(v1))  v2
 Combiner is usually same as the reduce function
 Works only if reduce function is commutative and associative

33 Lecture 20-MapReduce
Refinement: Combiners

 Back to our word counting example:

 Combiner combines the values of all keys of a single mapper (single machine):

 Much less data needs to be copied and shuffled!

Lecture 11-MapReduce
34
Refinement: Partition Function

35 Lecture 11-MapReduce
Thank You

Yum Yum D Giga
No ratings yet
Yum Yum D Giga
368 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Paper Solutions
No ratings yet
Paper Solutions
1 page
CHS Module 1 - Installing Computer Systems and Networks
100% (2)
CHS Module 1 - Installing Computer Systems and Networks
72 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
No ratings yet
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
35 pages
Lecture 2 - Map Reduce
No ratings yet
Lecture 2 - Map Reduce
20 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
Map Reduce
No ratings yet
Map Reduce
26 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
49 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Bda 2
No ratings yet
Bda 2
35 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2016: Lecture 4: Mapreduce and Hadoop
24 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
45 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
48 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
No ratings yet
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
36 pages
Week 02
No ratings yet
Week 02
115 pages
Indirect Addressing S5 S7
0% (1)
Indirect Addressing S5 S7
35 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
WP Raid Controller Performance 2016 WW en
No ratings yet
WP Raid Controller Performance 2016 WW en
50 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
5000 6000MSL
No ratings yet
5000 6000MSL
216 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
OS-Lecture 1 - 10
No ratings yet
OS-Lecture 1 - 10
73 pages
Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
Untitled
No ratings yet
Untitled
16 pages
Ict Lecture 1
No ratings yet
Ict Lecture 1
30 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
Operating System
No ratings yet
Operating System
87 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Foresee-Femdnn032g-A3a55 C5117593
No ratings yet
Foresee-Femdnn032g-A3a55 C5117593
25 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
XtremIO - Ver - 4 0 27 1 - RN - 302 005 714 - Rev 01
No ratings yet
XtremIO - Ver - 4 0 27 1 - RN - 302 005 714 - Rev 01
72 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Unit2 - COA Shikha Singh
No ratings yet
Unit2 - COA Shikha Singh
106 pages
Suse Linux
No ratings yet
Suse Linux
116 pages
Tecnia Bba-Notes Ist Sem
No ratings yet
Tecnia Bba-Notes Ist Sem
22 pages
Less01 - DBA1 Notes
No ratings yet
Less01 - DBA1 Notes
34 pages
Inverted File
No ratings yet
Inverted File
20 pages
Hadoop
No ratings yet
Hadoop
34 pages
ITFY First Semester QuestionBank 2024
No ratings yet
ITFY First Semester QuestionBank 2024
9 pages
Mining Social Network Graph: Faculty Name:Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Mining Social Network Graph: Faculty Name:Ms. Varsha Sanap Dr. Vivek Singh
18 pages
ASLR On The Line - Practical Cache Attacks On The MMU
No ratings yet
ASLR On The Line - Practical Cache Attacks On The MMU
15 pages
Bigdata Analytics Module 6: Big Data Analytics Applications: Faculty Name: Ms. Varsha Sanap Dr. Vivek Kumar Singh
No ratings yet
Bigdata Analytics Module 6: Big Data Analytics Applications: Faculty Name: Ms. Varsha Sanap Dr. Vivek Kumar Singh
31 pages
Ovm3 Disaster Recovery 1872591 PDF
No ratings yet
Ovm3 Disaster Recovery 1872591 PDF
18 pages
Albaroodi, Hala - Critical Review of Openstack Security Issues and Weaknesses
No ratings yet
Albaroodi, Hala - Critical Review of Openstack Security Issues and Weaknesses
11 pages
Tute Answers
No ratings yet
Tute Answers
11 pages
Microprocessor: Presented By: Ashwini Patil Swaleha Khan Gauri Kadam Imran Shaikh
No ratings yet
Microprocessor: Presented By: Ashwini Patil Swaleha Khan Gauri Kadam Imran Shaikh
8 pages
Power and Speed Trade-Offs in Data Path Structures Array Subsystems
100% (1)
Power and Speed Trade-Offs in Data Path Structures Array Subsystems
54 pages
Microprocessors System Syllabus
No ratings yet
Microprocessors System Syllabus
8 pages
Micro Processer 8086-1
No ratings yet
Micro Processer 8086-1
91 pages
SFR1M44 Fu DL
No ratings yet
SFR1M44 Fu DL
4 pages
NFS Server Configuration: /opt/nfs /opt/nfs
No ratings yet
NFS Server Configuration: /opt/nfs /opt/nfs
3 pages
TLE 7 Lesson 3
No ratings yet
TLE 7 Lesson 3
3 pages
Gateforum Question Paper
No ratings yet
Gateforum Question Paper
2 pages
2024 Summer Question Paper
No ratings yet
2024 Summer Question Paper
2 pages
Exam C - HANATEC142: SAP Certified Technology Associate - SAP HANA (Edition 2014)
No ratings yet
Exam C - HANATEC142: SAP Certified Technology Associate - SAP HANA (Edition 2014)
10 pages
AM2 Question Paper 1
No ratings yet
AM2 Question Paper 1
1 page
AM2 Exam Solution Index
No ratings yet
AM2 Exam Solution Index
1 page
AM2 Exam Solution Index
No ratings yet
AM2 Exam Solution Index
1 page
Biography of Hadrat Ali Ibn Abi Talib
No ratings yet
Biography of Hadrat Ali Ibn Abi Talib
1 page
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Beginning C# 7 Programming with Visual Studio 2017
From Everand
Beginning C# 7 Programming with Visual Studio 2017
Benjamin Perkins
No ratings yet
Availability and Disaster Recovery Foe SAP HANA
No ratings yet
Availability and Disaster Recovery Foe SAP HANA
4 pages
ICT Grade 7 PDF
No ratings yet
ICT Grade 7 PDF
27 pages
SELOGICA
100% (1)
SELOGICA
16 pages

Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh

Uploaded by

Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh

Uploaded by

Big Data Analytics

Module 3: MapReduce Paradigm

Faculty Name : Ms. Varsha Sanap

Lecture19 – Distributed File Systems : Physical Organization of the New

Distributed File System

4 Lecture19-Distributed File System

Machine Learning, Statistics

“Classical” Data Mining

5 Lecture19-Distributed File System

 20+ billion web pages x 20KB = 400+ TB

6 Lecture19-Distributed File System

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

7 Lecture19-Distributed File System

8 Lecture19-Distributed File System

 Large-scale computing for data mining problems on commodity hardware

9 Lecture19-Distributed File System

 Issue: Copying data over a network takes time

10 Lecture19-Distributed File System

11 Lecture19-Distributed File System

12 Lecture19-Distributed File System

 Reliable distributed file system

Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

 Bring Computation Directly to Data

13 Lecture19-Distributed File System

 Sequentially read a lot of data

Outline stays the same, Map and Reduce

 Input: a set of key-value pairs

Map-Reduce environment takes care of:

 Map worker failure

 M map tasks, R reduce tasks

 Fine granularity tasks: map tasks >> machines

 Back to our word counting example:

 Much less data needs to be copied and shuffled!

You might also like