BDA-Chapter-3
Map Reduce
Map Reduce
• MapReduce in Hadoop is nothing but the processing model in
Hadoop.
• The programming model of MapReduce is designed to process huge
volumes of data parallelly by dividing the work into a set of
independent tasks.
Map Reduce Layer
• MapReduce is a patented software framework introduced by Google to support
distributed computing on large datasets on clusters of computers.
• It is basically an operative programming model that runs in the Hadoop background
providing simplicity, scalability, recovery, and speed, including easy solutions for data
processing.
• This MapReduce framework is proficient in processing a tremendous amount of data
parallelly on large clusters of computational nodes.
• MapReduce is a programming model that allows you to process your data across an
entire cluster. It basically consists of Mappers and Reducers that are different scripts you
write or different functions you might use when writing a MapReduce program.
• Mappers have the ability to transform your data in parallel across your computing cluster
in a very efficient manner
• Reducers are responsible for aggregating your data together.
• Mappers and Reducers put together can be used to solve complex problems.
MapReduce Phases
MapReduce Phases
• There are mainly three phases in MapReduce:
• the map phase,
• the shuffle phase,
• the reduce phase.
MapReduce Phases
Map Phase
• The phase wherein individuals count the population of their assigned
cities is called the map phase.
• There are some terms related to this phase.
• Mapper: The individual (census taker) involved in the actual
calculation is called a mapper.
• Input split: The part of the city each census taker is assigned with is
known as the input split.
• Key–Value pair: The output from each mapper is a key–value pair. As
can be seen in the image, the key is X and the value is, say, 5.
Shuffle Phase
• The phase in which the values from different mappers are copied or
transferred to reducers is known as the shuffle phase.
• The shuffle phase comes in-between the map phase and the reduce
phase
Reduce Phase
• The large dataset has been broken down into various input splits and
the instances of the tasks have been processed.
• This is when the reduce phase comes into place. Similar to the map
phase, the reduce phase processes each key separately.
• Reducers: The individuals who work in the headquarters are known as
the reducers. This is because they reduce or consolidate the outputs
from many different mappers.
• After the reducer has finally finished the task, a results file is
generated, which is stored in HDFS. Then, HDFS replicates these
results
Working of MapReduce
Working of MapReduce
• Hadoop divides the job into tasks. There are two types of tasks:
• Map tasks (Splits & Mapping)
• Reduce tasks (Shuffling, Reducing)
• as mentioned above.
• The complete execution process (execution of Map and Reduce tasks, both) is
controlled by two types of entities called a
• Jobtracker: Acts like a master (responsible for complete execution of submitted
job)
• Multiple Task Trackers: Acts like slaves, each of them performing the job
• For every job submitted for execution in the system, there is one Jobtracker that
resides on Namenode and there are multiple tasktrackers which reside
on Datanode.
Working of MapReduce
• A job is divided into multiple tasks which are then run onto multiple data
nodes in a cluster.
• It is the responsibility of job tracker to coordinate the activity by scheduling
tasks to run on different data nodes.
• Execution of individual task is then to look after by task tracker, which
resides on every data node executing part of the job.
• Task tracker’s responsibility is to send the progress report to the job
tracker.
• In addition, task tracker periodically sends ‘heartbeat’ signal to the
Jobtracker so as to notify him of the current state of the system.
• Thus job tracker keeps track of the overall progress of each job. In the
event of task failure, the job tracker can reschedule it on a different task
tracker.
Example
• The architecture consists of four phases of execution namely,
splitting, mapping, shuffling, and reducing.
• Consider the input:
Features of MapReduce
• The important features of MapReduce are illustrated as follows:
• Abstracts developers from the complexity of distributed programming languages
• In-built redundancy and fault tolerance is available
• The MapReduce programming model is language independent
• Automatic parallelization and distribution are available
• Enables the local processing of data
• Manages all the inter-process communication
• Parallelly manages distributed servers running across various tasks
• Manages all communications and data transfers between various parts of the
system module
• Redundancy and failures are provided for the overall management of the whole
process