0% found this document useful (0 votes)
59 views36 pages

Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh

This document provides an overview of the MapReduce paradigm for distributed file systems and big data analytics. It discusses the physical organization of compute nodes in a cluster and how large-scale file systems are organized. It also describes the Map and Reduce tasks in MapReduce, how keys are grouped, and details of MapReduce execution including handling node failures.

Uploaded by

imran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views36 pages

Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh

This document provides an overview of the MapReduce paradigm for distributed file systems and big data analytics. It discusses the physical organization of compute nodes in a cluster and how large-scale file systems are organized. It also describes the Map and Reduce tasks in MapReduce, how keys are grouped, and details of MapReduce execution including handling node failures.

Uploaded by

imran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 36

Big Data Analytics

Module 3: MapReduce Paradigm

Faculty Name : Ms. Varsha Sanap


Dr. Vivek Singh
Index - (Heading Font: minion pro, Size:20)

Lecture19 – Distributed File Systems : Physical Organization of the New


4
Software Compute Nodes, Large- Scale- File-System -Organization.
Lecture20 – MapReduce: The Map Tasks, Grouping by Key,The Reduce
Tasks, Combiners, Details of MapReduce Execution, Coping With Node 15
Failures

2
Lecture19

Distributed File System


MapReduce

4 Lecture19-Distributed File System


Single Node Architecture

Machine Learning, Statistics

“Classical” Data Mining

5 Lecture19-Distributed File System


Motivation: Google Example

 20+ billion web pages x 20KB = 400+ TB


 1 computer reads 30-35 MB/sec from disk
 ~4 months to read the web
 ~1,000 hard drives to store the web
 Takes even more to do something useful
with the data!
 Today, a standard architecture for such problems is emerging:
 Cluster of commodity Linux nodes
 Commodity network (ethernet) to connect them

6 Lecture19-Distributed File System


Cluster Architecture

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

7 Lecture19-Distributed File System


Cluster Architecture

8 Lecture19-Distributed File System


Large Scale Computing

 Large-scale computing for data mining problems on commodity hardware


 Challenges:
 How do you distribute computation?
 How can we make it easy to write distributed programs?
 Machines fail:
 One server may stay up 3 years (1,000 days)
 If you have 1,000 servers, expect to loose 1/day
 People estimated Google had ~1M machines in 2011
 1,000 machines fail every day!

9 Lecture19-Distributed File System


Idea and Solution

 Issue: Copying data over a network takes time


 Idea:
 Bring computation close to the data
 Store files multiple times for reliability
 Map-reduce addresses these problems
 Google’s computational/data manipulation model
 Elegant way to work with big data
 Storage Infrastructure – File system
 Google: GFS. Hadoop: HDFS
 Programming model
 Map-Reduce

10 Lecture19-Distributed File System


Storage Architecture

 Problem:
 If nodes fail, how to store data persistently?
 Answer:
 Distributed File System:
 Provides global file namespace
 Google GFS; Hadoop HDFS;
 Typical usage pattern
 Huge files (100s of GB to TB)
 Data is rarely updated in place
 Reads and appends are common

11 Lecture19-Distributed File System


Distributed File system

 Chunk servers
 File is split into contiguous chunks
 Typically each chunk is 16-64MB
 Each chunk replicated (usually 2x or 3x)
 Try to keep replicas in different racks
 Master node
 a.k.a. Name Node in Hadoop’s HDFS
 Stores metadata about where files are stored
 Might be replicated
 Client library for file access
 Talks to master to find chunk servers
 Connects directly to chunk servers to access data

12 Lecture19-Distributed File System


Distributed File Sytem

 Reliable distributed file system


 Data kept in “chunks” spread across machines
 Each chunk replicated on different machines
 Seamless recovery from disk or machine failure

Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

 Bring Computation Directly to Data


 Chunk Servers also serve as Compute Servers

13 Lecture19-Distributed File System


Lecture20

MapReduce
Programming Model: MapReduce

15 Lecture 20-MapReduce
Task: Word Count

Case 1: File too large for memory, but all <word, count> pairs fit in memory
Case 2:Count occurrences of words:
 words(doc.txt) | sort | uniq -c
 where words takes a file and outputs the words in it, one per a line
 Case 2 captures the essence of MapReduce
 Great thing is that it is naturally parallelizable

16 Lecture 20-MapReduce
MapReduce Overview

 Sequentially read a lot of data


 Map:
 Extract something you care about
 Group by key: Sort and Shuffle
 Reduce:
 Aggregate, summarize, filter or transform
 Write the result

Outline stays the same, Map and Reduce


change to fit the problem

17 Lecture 20-MapReduce
MapReduce: The Map step

Input Intermediate
key-value pairs key-value pairs

18 Lecture 20-MapReduce
MapReduce: The Reduce Step

Intermediate
key-value pairs Output
Key-value groups key-value pairs
k v reduce
k v v v k v

k v Group reduce
k v v k v
by key
k v

… …
k v
k v k v

19 Lecture 20-MapReduce
More Specifically

 Input: a set of key-value pairs


 Programmer specifies two methods:
 Map(k, v)  <k’, v’>*
 Takes a key-value pair and outputs a set of key-value pairs
 E.g., key is the filename, value is a single line in the file
 There is one Map call for every (k,v) pair
 Reduce(k’, <v’>*)  <k’, v’’>*
 All values v’ with same key k’ are reduced together
and processed in v’ order
 There is one Reduce function call per unique key k’

20 Lecture 20-MapReduce
MapReduce: Word Counting

21 Lecture 20-MapReduce
Word Count Using MapReduce

map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

22 Lecture 20-MapReduce
MapReduce :Environment

Map-Reduce environment takes care of:


 Partitioning the input data
 Scheduling the program’s execution across a
set of machines
 Performing the group by key step
 Handling machine failures
 Managing required inter-machine communication

23 Lecture 20-MapReduce
MapReduce: A Diagram

MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

24 Lecture 20-MapReduce
MapReduce:In Parallel

All phases are distributed with many tasks doing the work

25 Lecture 20-MapReduce
MapReduce

26 Lecture 20-MapReduce
Data Flow

27 Lecture 20-MapReduce
Coordinator: Master

28 Lecture 20-MapReduce
Dealing with Failures

 Map worker failure


 Map tasks completed or in-progress at worker are reset to idle
 Reduce workers are notified when task is rescheduled on another worker
 Reduce worker failure
 Only in-progress tasks are reset to idle
 Reduce task is restarted
 Master failure
 MapReduce task is aborted and client is notified

29 Lecture 20-MapReduce
How many Map Reduce jobs

 M map tasks, R reduce tasks


 Rule of a thumb:
 Make M much larger than the number of nodes in the cluster
 One DFS chunk per map is common
 Improves dynamic load balancing and speeds up recovery from worker failures
 Usually R is smaller than M
 Because output is spread across R files

30 Lecture 20-MapReduce
Task Granularity &Pipelining

 Fine granularity tasks: map tasks >> machines


 Minimizes time for fault recovery
 Can do pipeline shuffling with map execution
 Better dynamic load balancing

31 Lecture 20-MapReduce
Refinements: Backup Tasks

 Problem
 Slow workers significantly lengthen the job completion time:
 Other jobs on the machine
 Bad disks
 Weird things
 Solution
 Near end of phase, spawn backup copies of tasks
 Whichever one finishes first “wins”
 Effect
 Dramatically shortens job completion time

32 Lecture 20-MapReduce
Refinement: Combiners

 Often a Map task will produce many pairs of the form (k,v1), (k,v2), … for the same
key k
 E.g., popular words in the word count example
 Can save network time by pre-aggregating values in the mapper:
 combine(k, list(v1))  v2
 Combiner is usually same as the reduce function
 Works only if reduce function is commutative and associative

33 Lecture 20-MapReduce
Refinement: Combiners

 Back to our word counting example:


 Combiner combines the values of all keys of a single mapper (single machine):

 Much less data needs to be copied and shuffled!

Lecture 11-MapReduce
34
Refinement: Partition Function

35 Lecture 11-MapReduce
Thank You

You might also like