Unit-2 (MapReduce-I)
Unit-2 (MapReduce-I)
Unit-II
MapReduce
Table of Content
1. Traditional Approach
2. Challenges
3. Fundamentals of MapReduce
4. Example of MapReduce
5. Advantages of MapReduce
6. Developing a MapReduce Application
Fundamentals of MapReduce
• MapReduce performs the processing of large data set
in a distributes and parellel manner.
• MapReduce consist of two task- Map and Reduce.
• Two essential daeomons of MapReduce: Job Tracker
and Task Tracker.
Traditional Way
When the MapReduce framework was not there, how parallel and distributed
processing used to happen in a traditional way.
Example- where I have a weather log containing the daily average temperature
of the years from 2019 to 2022. Here, I want to calculate the day having the
highest temperature each year.
Traditional Way
•In the traditional way, split the data into smaller parts or blocks
and store them in different machines.
•Then, find the highest temperature in each part stored in the
corresponding machine.
•At last, combine the results received from each of the machines
to have the final output.
Challenges associated with this traditional approach:
• Critical path problem: It is the amount of time taken
to finish the job without delaying the next milestone or
actual completion date. So, if, any of the machines
delay the job, the whole work gets delayed.
• Equal split issue: How will we divide the data into
smaller chunks so that each machine gets even part of
data to work with. In other words, how to equally
divide the data such that no individual machine is
overloaded or underutilized.
• The single split may fail: If any of the machines fail to
provide the output, we will not be able to calculate the
result. So, there should be a mechanism to ensure this
fault tolerance capability of the system.
• Aggregation of the result: There should be a
mechanism to aggregate the result generated by each of
the machines to produce the final output.
These are the issues which we will have to take care
individually while performing parallel processing of huge
data sets when using traditional approaches.
To overcome these issues, we have the MapReduce
framework.
MapReduce
Reducer Class
• The Intermediate output generated from the mapper is fed to the reducer which
processes it and generates the final output which is then saved in the HDFS.
Driver Class
• The major component in a MapReduce job is a Driver Class. It is responsible for
setting up a MapReduce Job to run-in Hadoop. We specify the names
of Mapper and Reducer Classes long with data types and their respective job
names.
A Word Count Example of MapReduce
Let us consider a text file called sample.txt whose contents are
as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we have to perform a word count on the
sample.txt using MapReduce. So, we will be finding the
unique words and the number of occurrences of those unique
words.
•First, we divide the input into three splits as shown in the figure.
This will distribute the work among all the map nodes.
•Then, we tokenize the words in each of the mappers and give a
hardcoded value (1) to each of the tokens or words. The rationale
behind giving a hardcoded value equal to 1 is that every word, in
itself, will occur once.
• Now, a list of key-value pair will be created where the key is nothing but the
individual words and value is one. So, for the first line (Dear Bear River) we
have 3 key-value pairs – Dear, 1; Bear, 1; River, 1. The mapping process
remains the same on all the nodes.
• After the mapper phase, a partition process takes place where sorting and
shuffling happen so that all the tuples with the same key are sent to the
corresponding reducer.
• So, after the sorting and shuffling phase, each reducer will have a unique key
and a list of values corresponding to that very key. For example, Bear, [1,1];
Car, [1,1,1].., etc.
•Now, each Reducer counts the values which are present in that
list of values. As shown in the figure, reducer gets a list of
values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as –
Bear, 2.
•Finally, all the output key/value pairs are then collected and
written in the output file.
Advantages of MapReduce
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node
works with a part of the job simultaneously. So, MapReduce is based on Divide
and Conquer paradigm which helps us to process the data using different
machines.
As the data is
processed by multiple
machines instead of a
single machine in
parallel, the time taken
to process the data
gets reduced by a
tremendous amount
2. Data Locality:
• Instead of moving data to the processing unit, we are moving
the processing unit to the data in the MapReduce Framework.
In the traditional system, we used to bring data to the
processing unit and process it. But, as the data grew and
became very huge, bringing this huge amount of data to the
processing unit posed the following issues:
• Moving huge data to processing is costly and decrease the
network performance.
• Processing takes time as the data is processed by a single unit
which becomes the bottleneck.
• The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by
bringing the processing unit to the data. The data is
distributed among multiple nodes where each node processes
the part of the data residing on it. This allows us to have the
following advantages:
• It is very cost-effective to move processing unit to the data.
• The processing time is reduced as all the nodes are
working with their part of the data in parallel.
• Every node gets a part of the data to process and therefore,
there is no chance of a node getting overburdened.
Developing A Map-Reduce Application :
The entire MapReduce program can be fundamentally divided into
three parts:
• Mapper Phase Code
• Reducer Phase Code
• Driver Code
Mapper code:
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable>
{
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
•We have created a class Map that extends the class Mapper which is
already defined in the MapReduce Framework.
•We define the data types of input and output key/value pair after the class
declaration using angle brackets.
•Both the input and output of the Mapper is a key/value pair
Input:
The key is nothing but the offset of each line
in the text file: LongWritable
The value is each individual line (as shown
in the figure at the right): Text
Output:
The key is the tokenized words: Text
We have the hardcoded value in our case
which is 1: IntWritable
Example – Dear 1, Bear 1, etc.
We have written a java code where we have
tokenized each word and assigned them a
hardcoded value equal to 1.
Reducer Code:
public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException,InterruptedException {
int sum=0;
for(IntWritable x: values)
{
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}
}
•We have created a class Reduce which extends class Reducer like that of Mapper.
•We define the data types of input and output key/value pair after the class
declaration using angle brackets as done for Mapper.
•Both the input and the output of the Reducer is a key-value pair.
Input:
The key nothing but those unique words which have been generated after
the sorting and shuffling phase: Text
The value is a list of integers corresponding to each key: IntWritable
Example – Bear, [1, 1], etc.
Output:
The key is all the unique words present in the input text file: Text
The value is the number of occurrences of each of the unique
words: IntWritable
Example – Bear, 2; Car, 3, etc.
•We have aggregated the values present in each of the list corresponding to each
key and produced the final answer.
•In general, a single reducer is created for each of the unique words, but, you
can specify the number of reducer in mapred-site.xml.
Driver Code:
Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
•In the driver class, we set the configuration of our MapReduce
job to run in Hadoop.
•We specify the name of the job, the data type of input/output of
the mapper and reducer.
•We also specify the names of the mapper and reducer classes.
•The path of the input and output folder is also specified.
•The method setInputFormatClass() is used for specifying how a
Mapper will read the input data or what will be the unit of work.
Here, we have chosen TextInputFormat so that a single line is
read by the mapper at a time from the input text file.
•The main() method is the entry point for the driver. In this
method, we instantiate a new Configuration object for the job.
public static class Reduce extends
package co.edureka.mapreduce;
Reducer&lt;Text,IntWritable,Text,IntWritable&gt; {
import java.io.IOException;
public void reduce(Text key, Iterable&lt;IntWritable&gt;
import java.util.StringTokenizer;
values,Context context) throws IOException,InterruptedException {
import org.apache.hadoop.io.IntWritable;
int sum=0;
import org.apache.hadoop.io.LongWritable;
for(IntWritable x: values){
import org.apache.hadoop.io.Text;
sum+=x.get();
import org.apache.hadoop.mapreduce.Mapper;
}
import org.apache.hadoop.mapreduce.Reducer;
context.write(key, new IntWritable(sum));
import org.apache.hadoop.conf.Configuration;
}}
import org.apache.hadoop.mapreduce.Job;
public static void main(String[] args) throws Exception {
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
Configuration conf= new Configuration();
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
Job job = new Job(conf,"My Word Count Program");
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
job.setJarByClass(WordCount.class);
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setMapperClass(Map.class);
import org.apache.hadoop.fs.Path;
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
public class WordCount{
job.setOutputValueClass(IntWritable.class);
public static class Map extends
job.setInputFormatClass(TextInputFormat.class);
Mapper&lt;LongWritable,Text,Text,IntWritable&gt;{
job.setOutputFormatClass(TextOutputFormat.class);
public void map(LongWritable key, Text value,Context
Path outputPath = new Path(args[1]);
context) throws IOException,InterruptedException{
//Configuring the input/output path from the filesystem into the job
String line = value.toString();
FileInputFormat.addInputPath(job, new Path(args[0]));
StringTokenizer tokenizer = new StringTokenizer(line);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
while (tokenizer.hasMoreTokens()) {
//deleting the output path automatically from hdfs so that we don't have to
value.set(tokenizer.nextToken());
delete it explicitly
context.write(value, new IntWritable(1));
outputPath.getFileSystem(conf).delete(outputPath);
}
//exiting the job only if the flag value becomes false
\ }
System.exit(job.waitForCompletion(true) ? 0 : 1);
THANK YOU