0% found this document useful (0 votes)
46 views

T05 MapReduce

The document describes the key components of Hadoop including HDFS daemons like NameNode, Secondary NameNode, and DataNode as well as MapReduce daemons like JobTracker and TaskTracker. It explains how Hadoop works by having clients submit work to a master node which divides the work into tasks assigned to slave nodes. The slave nodes perform the tasks and return results to the master node which then returns the final results to the client. It also discusses MapReduce programming concepts like map, reduce, shuffle, and sort as well as advantages like parallel processing and data locality.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

T05 MapReduce

The document describes the key components of Hadoop including HDFS daemons like NameNode, Secondary NameNode, and DataNode as well as MapReduce daemons like JobTracker and TaskTracker. It explains how Hadoop works by having clients submit work to a master node which divides the work into tasks assigned to slave nodes. The slave nodes perform the tasks and return results to the master node which then returns the final results to the client. It also discusses MapReduce programming concepts like map, reduce, shuffle, and sort as well as advantages like parallel processing and data locality.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1

Hadoop Daemons

◼ HDFS Daemons
◼ NameNode (NN)
◼ Secondary NameNode (SNN)
◼ DataNode (DN)

◼ MapReduce Daemons
◼ JobTracker (JT)
◼ TaskTracker (TT)

2
3
Hadoop Architecture Description

◼ Clients (one or more) submit their work to Hadoop System.


◼ When Hadoop System receives a Client Request, first it is received by a Master
Node.
◼ Master Node’s MapReduce component “Job Tracker” is responsible for receiving
Client Work and divides into manageable independent Tasks and assign them to
Task Trackers.
◼ Slave Node’s MapReduce component “Task Tracker” receives those Tasks from “Job
Tracker” and perform those tasks by using MapReduce components.
◼ Once all Task Trackers finished their assigned task, Job Tracker takes those results
and combines them into a final result.
◼ Finally, Hadoop System sends the final result to the Client.

4
Advantages of MapReduce

◼ Parallel Processing

◼ Data locality : Moving


processing to data.

5
MapReduce Process Flow

6
Hadoop Processing Framework

7
MapReduce Example: Word Count Program

8
9
What is MapReduce

◼ MapReduce is a programming model and an associated implementation for


processing and generating large data sets.

◼ Users specify a map function that processes a key/value pair to generate a set of
intermediate key/value pairs, and a reduce function that merges all intermediate
values associated with the same intermediate key.

◼ Many real world tasks are expressible in this model.

◼ Programs written in this functional style are automatically parallelized and executed
on a large cluster of commodity machines.

◼ A typical MapReduce computation processes many terabytes of data on thousands


of machines

10
MapReduce Runtime System

◼ The MapReduce runtime system takes care of the details of:

◼ partitioning the input data,

◼ scheduling the program’s execution across a set of machines,

◼ handling machine failures, and

◼ managing the required inter-machine communication.

◼ This allows programmers without any experience with parallel and distributed
systems to easily utilize the resources of a large distributed system

11
MapReduce Programming Components

TT: RecordReader

(k1, v1)

Mapper

(k2, v2)

TT: Partition Sort Shuffle

(k3, <v3>)

Reducer

(k4, v4)

TT: RecordWriter

12
Combine

◼ Combine is an intermediate step between Map and Reduce


◼ Combine is an optional process.
◼ The combiner is a reducer that runs individually on each mapper server.
◼ It reduces the data on each mapper further to a simplified form before passing it
downstream.
◼ This makes shuffling and sorting easier as there is less data to work with.
◼ Often, the combiner class is set to the reducer class itself, due to the cumulative and
associative functions in the reduce function. However, if needed, the combiner can
be a separate class as well.

13
Partition

◼ Partition is an intermediate step between Map and Reduce


◼ Partition is the process that translates the <key, value> pairs resulting from mappers
to another set of <key, value> pairs to feed into the reducer.
◼ It decides how the data has to be presented to the reducer and also assigns it to a
particular reducer.
◼ The default partitioner determines the hash value for the key, resulting from the
mapper, and assigns a partition based on this hash value.
◼ There are as many partitions as there are reducers.
◼ Once the partitioning is complete, the data from each partition is sent to a specific
reducer.

14
Fault Tolerance

◼ The master pings every worker periodically.


◼ If no response is received from a worker in a certain amount of time, the master marks the
worker as failed.
◼ Any map tasks completed by the worker are reset back to their initial idle state, and therefore
become eligible for scheduling on other workers.
◼ Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and
becomes eligible for rescheduling.
◼ Completed map tasks are re-executed on a failure because their output is stored on the local
disk(s) of the failed machine and is therefore inaccessible.
◼ Completed reduce tasks do not need to be re-executed since their output is stored in a global
file system.
◼ When a map task is executed first by worker A and then later executed by worker B (because A
failed), all workers executing reduce tasks are notified of the reexecution. Any reduce task that
has not already read the data from worker A will read the data from worker B

15
Locality

◼ Network bandwidth is a relatively scarce resource in our computing environment.


◼ HDFS divides each file into 64 MB blocks, and stores several copies of each block
(typically 3 copies) on different machines.
◼ The MapReduce master takes the location information of the input files into
account and attempts to schedule a map task on a machine that contains a replica
of the corresponding input data.
◼ Failing that, it attempts to schedule a map task near a replica of that task’s input
data (e.g., on a slave machine that is on the same rack).

16
Limitations of Hadoop

◼ Issue with small files. Not suited to processing many files simultaneously.
NameNode gets overloaded.
◼ Slow processing Speed. Hadoop has many lines of code.
◼ Latency: Each time converting data to key-value format by Mapper and then by
Reducer takes time.
◼ Security: Hadoop is missing encryption. It supports Kerberos authentication, which
is hard to manage. Complex applications are challenging to manage and thus their
data can be at risk.
◼ No real-time data processing, only batch
◼ Not easy to use and program. It doesn’t support abstraction
◼ Hadoop is not so efficient for iterative processing of chain of stages in which each
output of the previous stage is the input to the next stage.

17
References

◼ https://www.systutorials.com/hadoop-mapreduce-tutorials/
◼ https://www.tutorialscampus.com/map-reduce/algorithm.htm
◼ https://www.journaldev.com/8848/mapreduce-algorithm-example

18
Acknowledgment

◼ Many of the figures in the slides are copied from Simplilearn and Edureka

19
Thank You

20

You might also like