Unit-2 (MapReduce-I)

The document discusses MapReduce, including its fundamentals, how it works, and advantages. MapReduce allows distributed and parallel processing of large datasets across multiple machines. It consists of map and reduce tasks, with mappers processing input splits in parallel and reducers combining output from mappers. This overcomes challenges of traditional approaches like critical path delays and load balancing.

Uploaded by

tripathineeharika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

166 views28 pages

Unit-2 (MapReduce-I)

Uploaded by

tripathineeharika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Big Data and Analytics

Unit-II
MapReduce
Table of Content
1. Traditional Approach
2. Challenges
3. Fundamentals of MapReduce
4. Example of MapReduce
5. Advantages of MapReduce
6. Developing a MapReduce Application
Fundamentals of MapReduce
• MapReduce performs the processing of large data set
in a distributes and parellel manner.
• MapReduce consist of two task- Map and Reduce.
• Two essential daeomons of MapReduce: Job Tracker
and Task Tracker.
Traditional Way
When the MapReduce framework was not there, how parallel and distributed
processing used to happen in a traditional way.
Example- where I have a weather log containing the daily average temperature
of the years from 2019 to 2022. Here, I want to calculate the day having the
highest temperature each year.

Traditional Way
•In the traditional way, split the data into smaller parts or blocks
and store them in different machines.
•Then, find the highest temperature in each part stored in the
corresponding machine.
•At last, combine the results received from each of the machines
to have the final output.
Challenges associated with this traditional approach:
• Critical path problem: It is the amount of time taken
to finish the job without delaying the next milestone or
actual completion date. So, if, any of the machines
delay the job, the whole work gets delayed.
• Equal split issue: How will we divide the data into
smaller chunks so that each machine gets even part of
data to work with. In other words, how to equally
divide the data such that no individual machine is
overloaded or underutilized.
• The single split may fail: If any of the machines fail to
provide the output, we will not be able to calculate the
result. So, there should be a mechanism to ensure this
fault tolerance capability of the system.
• Aggregation of the result: There should be a
mechanism to aggregate the result generated by each of
the machines to produce the final output.
These are the issues which we will have to take care
individually while performing parallel processing of huge
data sets when using traditional approaches.
To overcome these issues, we have the MapReduce
framework.
MapReduce

MapReduce is a programming framework that allows us to

perform distributed and parallel processing on large data sets in a
distributed environment.
• MapReduce consists of two distinct tasks – Map and Reduce.
• As the name MapReduce suggests, the reducer phase takes place
after the mapper phase has been completed.
• So, the first is the map job, where a block of data is read and
processed to produce key-value pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input to
the Reducer.
• The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or key-
value pairs which is the final output.
• Two essential daemons of
MapReduce: Job Tracker and Task
Tracker
• Job tracker schedule the Job,
provide the resources to task tracker
and monitoring all the task.
•All the work done by the Task
Tracker
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Hadoop’s
Mapper store saves this intermediate data into the local disk.
Input Split
• It is a unit of work in a MapReduce program that can be processed by an individual
map.
• In other words, Input split represent the data to be processed by an individual map.
• Each split can be divided into smaller segments known as records.
• Split implementation extends the INPUTSPLIT class, which is a base abstract class of
Hadoop. It is able to define split lengths and split locations.

public abstract class InputSplit {

public abstract long getLength() throws IOException, InterruptedException;
public abstract String[] getLocations() throws IOException, InterruptedException;
}
RecordReader
• It interacts with the Input split and converts the obtained data in the form of Key-
Value Pairs.
• To read the input, the RecordReader class is invoked continually until the entire
input splits are consumed.
• Each invocation of RecordReader calls the map() method of the mapper.

Reducer Class
• The Intermediate output generated from the mapper is fed to the reducer which
processes it and generates the final output which is then saved in the HDFS.
Driver Class
• The major component in a MapReduce job is a Driver Class. It is responsible for
setting up a MapReduce Job to run-in Hadoop. We specify the names
of Mapper and Reducer Classes long with data types and their respective job
names.
A Word Count Example of MapReduce
Let us consider a text file called sample.txt whose contents are
as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we have to perform a word count on the
sample.txt using MapReduce. So, we will be finding the
unique words and the number of occurrences of those unique
words.
•First, we divide the input into three splits as shown in the figure.
This will distribute the work among all the map nodes.
•Then, we tokenize the words in each of the mappers and give a
hardcoded value (1) to each of the tokens or words. The rationale
behind giving a hardcoded value equal to 1 is that every word, in
itself, will occur once.
• Now, a list of key-value pair will be created where the key is nothing but the
individual words and value is one. So, for the first line (Dear Bear River) we
have 3 key-value pairs – Dear, 1; Bear, 1; River, 1. The mapping process
remains the same on all the nodes.
• After the mapper phase, a partition process takes place where sorting and
shuffling happen so that all the tuples with the same key are sent to the
corresponding reducer.
• So, after the sorting and shuffling phase, each reducer will have a unique key
and a list of values corresponding to that very key. For example, Bear, [1,1];
Car, [1,1,1].., etc.
•Now, each Reducer counts the values which are present in that
list of values. As shown in the figure, reducer gets a list of
values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as –
Bear, 2.
•Finally, all the output key/value pairs are then collected and
written in the output file.
Advantages of MapReduce
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node
works with a part of the job simultaneously. So, MapReduce is based on Divide
and Conquer paradigm which helps us to process the data using different
machines.
As the data is
processed by multiple
machines instead of a
single machine in
parallel, the time taken
to process the data
gets reduced by a
tremendous amount
2. Data Locality:
• Instead of moving data to the processing unit, we are moving
the processing unit to the data in the MapReduce Framework.
In the traditional system, we used to bring data to the
processing unit and process it. But, as the data grew and
became very huge, bringing this huge amount of data to the
processing unit posed the following issues:
• Moving huge data to processing is costly and decrease the
network performance.
• Processing takes time as the data is processed by a single unit
which becomes the bottleneck.
• The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by
bringing the processing unit to the data. The data is
distributed among multiple nodes where each node processes
the part of the data residing on it. This allows us to have the
following advantages:
• It is very cost-effective to move processing unit to the data.
• The processing time is reduced as all the nodes are
working with their part of the data in parallel.
• Every node gets a part of the data to process and therefore,
there is no chance of a node getting overburdened.
Developing A Map-Reduce Application :
The entire MapReduce program can be fundamentally divided into
three parts:
• Mapper Phase Code
• Reducer Phase Code
• Driver Code
Mapper code:
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable>
{
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
•We have created a class Map that extends the class Mapper which is
already defined in the MapReduce Framework.
•We define the data types of input and output key/value pair after the class
declaration using angle brackets.
•Both the input and output of the Mapper is a key/value pair
Input:
The key is nothing but the offset of each line
in the text file: LongWritable
The value is each individual line (as shown
in the figure at the right): Text
Output:
The key is the tokenized words: Text
We have the hardcoded value in our case
which is 1: IntWritable
Example – Dear 1, Bear 1, etc.
We have written a java code where we have
tokenized each word and assigned them a
hardcoded value equal to 1.
Reducer Code:
public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException,InterruptedException {
int sum=0;
for(IntWritable x: values)
{
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}
}
•We have created a class Reduce which extends class Reducer like that of Mapper.
•We define the data types of input and output key/value pair after the class
declaration using angle brackets as done for Mapper.
•Both the input and the output of the Reducer is a key-value pair.
Input:
The key nothing but those unique words which have been generated after
the sorting and shuffling phase: Text
The value is a list of integers corresponding to each key: IntWritable
Example – Bear, [1, 1], etc.
Output:
The key is all the unique words present in the input text file: Text
The value is the number of occurrences of each of the unique
words: IntWritable
Example – Bear, 2; Car, 3, etc.
•We have aggregated the values present in each of the list corresponding to each
key and produced the final answer.
•In general, a single reducer is created for each of the unique words, but, you
can specify the number of reducer in mapred-site.xml.
Driver Code:
Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
•In the driver class, we set the configuration of our MapReduce
job to run in Hadoop.
•We specify the name of the job, the data type of input/output of
the mapper and reducer.
•We also specify the names of the mapper and reducer classes.
•The path of the input and output folder is also specified.
•The method setInputFormatClass() is used for specifying how a
Mapper will read the input data or what will be the unit of work.
Here, we have chosen TextInputFormat so that a single line is
read by the mapper at a time from the input text file.
•The main() method is the entry point for the driver. In this
method, we instantiate a new Configuration object for the job.
public static class Reduce extends
package co.edureka.mapreduce;
Reducer&lt;Text,IntWritable,Text,IntWritable&gt; {
import java.io.IOException;
public void reduce(Text key, Iterable&lt;IntWritable&gt;
import java.util.StringTokenizer;
values,Context context) throws IOException,InterruptedException {
import org.apache.hadoop.io.IntWritable;
int sum=0;
import org.apache.hadoop.io.LongWritable;
for(IntWritable x: values){
import org.apache.hadoop.io.Text;
sum+=x.get();
import org.apache.hadoop.mapreduce.Mapper;
}
import org.apache.hadoop.mapreduce.Reducer;
context.write(key, new IntWritable(sum));
import org.apache.hadoop.conf.Configuration;
}}
import org.apache.hadoop.mapreduce.Job;
public static void main(String[] args) throws Exception {
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
Configuration conf= new Configuration();
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
Job job = new Job(conf,"My Word Count Program");
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
job.setJarByClass(WordCount.class);
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setMapperClass(Map.class);
import org.apache.hadoop.fs.Path;
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
public class WordCount{
job.setOutputValueClass(IntWritable.class);
public static class Map extends
job.setInputFormatClass(TextInputFormat.class);
Mapper&lt;LongWritable,Text,Text,IntWritable&gt;{
job.setOutputFormatClass(TextOutputFormat.class);
public void map(LongWritable key, Text value,Context
Path outputPath = new Path(args[1]);
context) throws IOException,InterruptedException{
//Configuring the input/output path from the filesystem into the job
String line = value.toString();
FileInputFormat.addInputPath(job, new Path(args[0]));
StringTokenizer tokenizer = new StringTokenizer(line);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
while (tokenizer.hasMoreTokens()) {
//deleting the output path automatically from hdfs so that we don't have to
value.set(tokenizer.nextToken());
delete it explicitly
context.write(value, new IntWritable(1));
outputPath.getFileSystem(conf).delete(outputPath);
}
//exiting the job only if the flag value becomes false
\ }
System.exit(job.waitForCompletion(true) ? 0 : 1);
THANK YOU

Writing An Hadoop MapReduce Program in Python
No ratings yet
Writing An Hadoop MapReduce Program in Python
21 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Python PPT
No ratings yet
Python PPT
60 pages
Apache Pig
No ratings yet
Apache Pig
21 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
Top 40 SQL Query Interview Questions and Answers For Practice
No ratings yet
Top 40 SQL Query Interview Questions and Answers For Practice
18 pages
Unit 5
No ratings yet
Unit 5
40 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
Turing Machine: Concepts & Examples
No ratings yet
Turing Machine: Concepts & Examples
39 pages
Chapter 2 - Query Processing and Optimization
No ratings yet
Chapter 2 - Query Processing and Optimization
16 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
CBSE Class XII: GUI Programming Guide
No ratings yet
CBSE Class XII: GUI Programming Guide
88 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Big Data Analytics Midterm Q&A
No ratings yet
Big Data Analytics Midterm Q&A
15 pages
Query Optimization
No ratings yet
Query Optimization
9 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Window Functions
No ratings yet
Window Functions
29 pages
File Concept Access Methods Directory Structure File System Mounting File Sharing Protection
No ratings yet
File Concept Access Methods Directory Structure File System Mounting File Sharing Protection
30 pages
Practical 4 Asset Transfer App
No ratings yet
Practical 4 Asset Transfer App
8 pages
Javascript Leetcode Examples
No ratings yet
Javascript Leetcode Examples
34 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
Flipkart Analyst Interview Insights
No ratings yet
Flipkart Analyst Interview Insights
16 pages
Python Programming Basics Guide
No ratings yet
Python Programming Basics Guide
46 pages
Compiler Design Questions
No ratings yet
Compiler Design Questions
6 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
26 pages
CD Unit-2 (R20)
No ratings yet
CD Unit-2 (R20)
38 pages
Data Scientist Resume Sample USA
No ratings yet
Data Scientist Resume Sample USA
4 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Hadoop Ecosystem Components Guide
No ratings yet
Hadoop Ecosystem Components Guide
19 pages
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
Word Count Program To Demonstrate The Use of Map and Reduce Tasks
No ratings yet
Word Count Program To Demonstrate The Use of Map and Reduce Tasks
5 pages
Data Scientist Interview Questions and Answers PDF
No ratings yet
Data Scientist Interview Questions and Answers PDF
37 pages
Mongo DB Operators
No ratings yet
Mongo DB Operators
104 pages
The Relational Algebra and Calculus
No ratings yet
The Relational Algebra and Calculus
79 pages
Distributed File Systems: Unit - V Essay Questions
No ratings yet
Distributed File Systems: Unit - V Essay Questions
10 pages
Hadoop Log Level MapReduce Tutorial
No ratings yet
Hadoop Log Level MapReduce Tutorial
3 pages
Cns - Unit2
No ratings yet
Cns - Unit2
135 pages
250+ TOP MCQs On Dependability and Security Specification and Answers 2023
No ratings yet
250+ TOP MCQs On Dependability and Security Specification and Answers 2023
6 pages
React.js Quick Start Guide
No ratings yet
React.js Quick Start Guide
1 page
Unit Ii
No ratings yet
Unit Ii
20 pages
Scala Basic Interview Questions
No ratings yet
Scala Basic Interview Questions
16 pages
5 Knowledge Representation
No ratings yet
5 Knowledge Representation
19 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
Describe The Functions and Features of HDP
100% (2)
Describe The Functions and Features of HDP
16 pages
Java Script
No ratings yet
Java Script
40 pages
Hadoop Training #5: MapReduce Algorithm
100% (2)
Hadoop Training #5: MapReduce Algorithm
31 pages
ADT
No ratings yet
ADT
34 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Amdocs Questions
No ratings yet
Amdocs Questions
8 pages
Snakify Theory Questions and Model Answers
0% (1)
Snakify Theory Questions and Model Answers
214 pages
Space Complexity - Unit 1 - Part 2
No ratings yet
Space Complexity - Unit 1 - Part 2
15 pages
Chapter 6
No ratings yet
Chapter 6
20 pages
DBMS Lab # 5 SQL Constraints
No ratings yet
DBMS Lab # 5 SQL Constraints
14 pages
Compiler Design
No ratings yet
Compiler Design
19 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
MapReduce for Big Data Enthusiasts
No ratings yet
MapReduce for Big Data Enthusiasts
18 pages
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
No ratings yet
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
4 pages
MapReduce Basics for Big Data Beginners
No ratings yet
MapReduce Basics for Big Data Beginners
32 pages
MapReduce: Working and Advantages
No ratings yet
MapReduce: Working and Advantages
12 pages
CSS Margin and Padding Properties Box Model
No ratings yet
CSS Margin and Padding Properties Box Model
15 pages
Table and Image in HTML
No ratings yet
Table and Image in HTML
21 pages
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
Anatomy of Hadoop MapReduce Jobs
No ratings yet
Anatomy of Hadoop MapReduce Jobs
11 pages
Unit-2 (Hadoop)
No ratings yet
Unit-2 (Hadoop)
16 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Unit 4 (MongoDB)
No ratings yet
Unit 4 (MongoDB)
46 pages
Acasde
No ratings yet
Acasde
3 pages
Towards Reinforcing The Waste Separation at Source For Vie - 2023 - Environmenta
No ratings yet
Towards Reinforcing The Waste Separation at Source For Vie - 2023 - Environmenta
13 pages
STEM12.A - GROUP2 D.I.Y. Stirling Engine Powered Fan For A Recycled Vacuum Cleaner 1 1
No ratings yet
STEM12.A - GROUP2 D.I.Y. Stirling Engine Powered Fan For A Recycled Vacuum Cleaner 1 1
108 pages
R.O. Mineral Water Plant Quotation
No ratings yet
R.O. Mineral Water Plant Quotation
5 pages
Oxysource Oxygen Gas Generators Pis
No ratings yet
Oxysource Oxygen Gas Generators Pis
2 pages
HSSB 22 16 Manual English
No ratings yet
HSSB 22 16 Manual English
2 pages
B8 Maths WK8
No ratings yet
B8 Maths WK8
4 pages
STEEL and CONCRETE DOMES DESIGN PDF
100% (1)
STEEL and CONCRETE DOMES DESIGN PDF
57 pages
PAUT
No ratings yet
PAUT
7 pages
Performance Qualification of Vial Washer 7363
No ratings yet
Performance Qualification of Vial Washer 7363
25 pages
Top 100 IELTS Phrasal Verbs
No ratings yet
Top 100 IELTS Phrasal Verbs
1 page
2nd Summative Test in Math 7-22-23
No ratings yet
2nd Summative Test in Math 7-22-23
4 pages
Swep Ii Report - 18 - Eg - Me - 1285
No ratings yet
Swep Ii Report - 18 - Eg - Me - 1285
14 pages
Steam, Steel and Lizzie The Elephant - The Steel Industry, Transport Technology and Urban Development in Sheffield, 1800-1914
No ratings yet
Steam, Steel and Lizzie The Elephant - The Steel Industry, Transport Technology and Urban Development in Sheffield, 1800-1914
642 pages
Scuderia Topolino - Technical Advice
No ratings yet
Scuderia Topolino - Technical Advice
130 pages
MT D Catalogue 2013
No ratings yet
MT D Catalogue 2013
56 pages
1991 A Note On Halley's Method
No ratings yet
1991 A Note On Halley's Method
4 pages
Siamul Hayat 1page CV
No ratings yet
Siamul Hayat 1page CV
1 page
CSC462-AI Lec01 Slides
No ratings yet
CSC462-AI Lec01 Slides
9 pages
PLC and Robot Integration Guide
No ratings yet
PLC and Robot Integration Guide
2 pages
Time Sheet Pt. Tracon Industri Pertamina Ep Asset 2 - Engineering Service Prabumulih
No ratings yet
Time Sheet Pt. Tracon Industri Pertamina Ep Asset 2 - Engineering Service Prabumulih
1 page
1984 Socratic Seminar Cheat Sheet
No ratings yet
1984 Socratic Seminar Cheat Sheet
3 pages
Google Collab
No ratings yet
Google Collab
21 pages
Amritsar Airport Data Update
No ratings yet
Amritsar Airport Data Update
14 pages
Soil - Lab Report 7
No ratings yet
Soil - Lab Report 7
5 pages
Operations Research in Business
No ratings yet
Operations Research in Business
27 pages
Upper and Lower Bounds
0% (1)
Upper and Lower Bounds
3 pages
3.reading Strategies
100% (1)
3.reading Strategies
30 pages
Fusion Bonded Epoxy Pipeline Coating
No ratings yet
Fusion Bonded Epoxy Pipeline Coating
2 pages
UNIT-4 Reactive Power VS Voltage Control
No ratings yet
UNIT-4 Reactive Power VS Voltage Control
72 pages

Unit-2 (MapReduce-I)

Uploaded by

Unit-2 (MapReduce-I)

Uploaded by

Big Data and Analytics

MapReduce is a programming framework that allows us to

public abstract class InputSplit {

You might also like