0% found this document useful (0 votes)
78 views

Unit 3 - Big Data Technologies

The document provides details about MapReduce programming. It begins with explaining why MapReduce is used and how it solves problems with large volumes of data by dividing tasks into smaller parts that are processed in parallel. It then discusses key aspects of MapReduce like the map and reduce functions, and provides examples of word count and maximum temperature programs. It also describes the MapReduce architecture, components, and workflow, with the map function extracting data, shuffle/sort organizing data by key, and reduce function aggregating results.

Uploaded by

prakash N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

Unit 3 - Big Data Technologies

The document provides details about MapReduce programming. It begins with explaining why MapReduce is used and how it solves problems with large volumes of data by dividing tasks into smaller parts that are processed in parallel. It then discusses key aspects of MapReduce like the map and reduce functions, and provides examples of word count and maximum temperature programs. It also describes the MapReduce architecture, components, and workflow, with the map function extracting data, shuffle/sort organizing data by key, and reduce function aggregating results.

Uploaded by

prakash N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Unit III

MAP REDUCE PROGRAMMING

Why MapReduce? Solving the Problem with MapReduce, Hadoop 2.x –


MapReduce Architecture, Hadoop 2.x – MapReduce Components,
Application Workflow, MapReduce Paradigm, Map Reduce Programs -
Word Count Program, Maximum Temperature Program.
What is Map Reduce?
• It is powerful paradigm for parallel computation
• Hadoop uses map reduce to execute job on files in HDFS
• Read lot of Data.
• Map: extract something you care about from each record
• Shuffle and sort
• Reduce: aggregate, filter or transform
• Write the results.
Example#1 Map Reduce
• Count the number of fans for each player / team
• Traditional way
• Smart way
Example#1….contd.. Map Reduce
• Smart Way
Example#2 Map Reduce
Example#3 Map Reduce
Why MapReduce?..Contd..Example of MAP
REDUCE
Map Reduce Flow
Why MapReduce?
• Traditional Model is not suitable to huge volume of data.

• Cannot accommodated by standard Database server.

• The centralized system creates too much of a bottleneck while processing

multiple files simultaneously.


• Google solved this bottleneck issue using an algorithm called MapReduce.

• MapReduce divides a task into small parts and assigns them to many computers.

• Later, the results are collected at one place and integrated to form the result
dataset.
Why MapReduce?..Contd..
Advantages :
• Taking processing to the data
• Processing data in parallel
Solving the Problem with MapReduce
• Searching
• Sorting
• Log Processing
• Amount of computation required is significantly more than any one
computer.
• Analytics
• Video and Image Analysis
Solving the Problem with MapReduce

Big Data Analytics


Real Time Scenario to understand importance
of MAP REDUCE in Hadoop
Real Time Scenario to understand importance of MAP REDUCE in Hadoop
Difference between MRV1 and
MRV2(YARN)
• MRV1 (“classic”) has 3 Components
• API – for user level programming of MR applications
• Framework – runtime services for running Map and Reduce process, shuffling
and sorting etc.,
• Resource Management – infrastructure to monitor nodes, allocate resources
and schedule jobs.
• MRV2 (“Next Gen”) moves Resource Management into YARN
Difference between MRV1 and
MRV2(YARN)
• MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes
and 40,000 tasks
• YARN is designed to scale up to 10,000 nodes and 100,000 tasks.
• Daemons
HADOOP1(MR1) HADOOP(MR2)
Name Node Name Node
Data Node Data Node
Secondary Name Node Secondary Name Node
Job Tracker Resource Manager
Task Tracker Node Manager
Difference between MRV1 and
MRV2(YARN)
• Limitations of Hadoop1
• Single Point of Failure
• Low resource utilization
• Less scalability when comparted to hadoop2
Big Data Analytics
Big Data Analytics
Map Reduce Task / Anatomy of classic
map reduce job run
• Input,
• Map,
• Combiner,
• Shuffle and sort,
• Reducer,
• Output
Anatomy of classic map reduce job run
How MapReduce Works? / Explain the anatomy of classic map reduce job run/How Hadoop
runs map reduce Job?
How MapReduce Works? / Explain the anatomy of classic map
reduce job run/How Hadoop runs map reduce Job?
#1 Architecture Map Reduce Flow Diagram
Contd..
• Input Phase − Here we have a Record Reader that translates each
record in an input file and sends the parsed data to the mapper in the
form of key-value pairs.
• Map − Map is a user-defined function, which takes a series of key-
value pairs and processes each one of them to generate zero or more
key-value pairs.
• Intermediate Keys − They key-value pairs generated by the mapper
are known as intermediate keys.
Contd..
• Combiner − A combiner is a type of local Reducer that groups similar
data from the map phase into identifiable sets.
• It takes the intermediate keys from the mapper as input and applies a
user-defined code to aggregate the values in a small scope of one
mapper.
• It is not a part of the main MapReduce algorithm; it is optional.
Contd..
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort
step.
• It downloads the grouped key-value pairs onto the local machine,
where the Reducer is running.
• The individual key-value pairs are sorted by key into a larger data list.
• The data list groups the equivalent keys together so that their values
can be iterated easily in the Reducer task.
Contd..
• Reducer − The Reducer takes the grouped key-value paired data as
input and runs a Reducer function on each one of them.
• Here, the data can be aggregated, filtered, and combined in a number
of ways, and it requires a wide range of processing.
• Once the execution is over, it gives zero or more key-value pairs to the
final step.
• Output Phase − In the output phase, we have an output formatter
that translates the final key-value pairs from the Reducer function and
writes them onto a file using a record writer.
Contd..
• Job Tracker
• Job tracker is going to take care of all MR jobs that are running on various
nodes present in the Hadoop cluster
• Job tracker plays vital role in scheduling jobs and it will keep track of the
entire map and reduce jobs.
• Task Tracker
• Actual map and reduce tasks are performed by Task tracker.
MAP REDUCE – EXMAPLE WORD COUNT
Contd…Anatomy of classic map reduce job run
How MapReduce Works? / Explain the anatomy of classic map reduce job run/How Hadoop
runs map reduce Job?

• Map Reduce work flow is divided in two steps:

1. Decomposing a problem into Map Reduce job

2. Running Jobs
1. Decomposing a problem into Map Reduce job

• Creat two scripts : Map Script and Reduce Script

• Map Script :-

 Map Script (which you write) take some input data, and maps it to <Key, Value> pairs
according to your specifications.

 Emit a <word, 1> (eg: A,B,R <A,1> <B,1> <C,1>) pair for each word in the input stream.

Map Script does no aggregation (i.e actual counting).

Purpose of Map script is to model data into <key,value> pairs for the reducer to aggregate.

Shuffled, which means the pairs with the same key are grouped. (eg: <A,1><A,1>) and
passed in to a single machine.
1. Decomposing a problem into Map Reduce job

• Reduce Script :-

 Takes collection of <key,value> pairs, reduce them according to user specified


reduced script.

For example, we want to count the number of word, so our reduce script to
simply sum the values of the collection of <key,value> pairs which have the same
key. (eg:- A2, B2, C2)
2. Running Dependent Jobs (linear chain of Jobs) or
More complex Directed Acyclic Graph Jobs

• When there is more than one job in a MapReduce workflow, the question arises:
how do you manage the jobs so they are executed in order?
several approaches:
1. linear chain of jobs
2. more complex directed acyclic graph (DAG) of jobs.
Linear Chain of Jobs
• For a linear chain, the simplest approach is to run each job one after another,
waiting until a job completes successfully before running the next:
JobClient.runJob(conf1);
JobClient.runJob(conf2);
Contd.. Running Dependent Jobs (linear chain of
Jobs) or More complex Directed Acyclic Graph Jobs
• For anything more complex job like DAG than a linear chain, there is a
class called JobControl
OOzie workflow is a DAG
What is OOZIE?
• All sort of programs can be pipelined in a desired order to work in
Hadoop’s distributed environment
• Provides a mechanism to run a job at given schedule.
• OOzie runs on a server and client submits a workflow to the server.
#2 Architecture Map Reduce Flow Diagram
Map Reduce Objects
Important Question Map Reduce Program - Word Count
• Driver class (Public void static main- the entry point)
• Map class (MapForWordCount) which extends public class
Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements the
Map function.
• Reduce class (ReduceForWordCount) which extends public class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements the
Reduce function.
• Serialization and Deserialization in mapreduce for java in Writable
Example..Explanation
• For example , Input is “Hello” Mapper<LongWritable, Text, Text, IntWritable>
Hello (for Map Logic)
Input,
mk (Longwritable) mv //text
0 hello
Output,
hello (text) 1 (intwritable)

• Reduce Logic Reducer<Text, IntWritable, Text, IntWritable>


Input,
Hello Hello Hello (Hello,1,1,1) Hello (Text) 1,1,1 (IntWritable)
Hai Hai Hai (Hai,1,1,1) Hai(Text) 1,1,1(IntWritable)
Output,
(Hello,3) Hello(Text) 3(IntWritable)
(Hai,3) Hai (Text) 3(IntWritable)
First Method: MapForWordCount

public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable>


{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = value.toString(); //input : mk 0 mv hai- convert to string
String[] words=line.split(","); //hello,hello,hai,hello,hai
for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1); //(hello,1,1,1)
con.write(outputKey, outputValue);
}
}
}
Second Method: ReduceForWordCount

public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>


{
//(hello,1,1,1)
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException,
InterruptedException
{
int sum = 0;
for(IntWritable value : values) {
sum += value.get(); //0+1;1+1;2+13
}
con.write(word, new IntWritable(sum)); //(hello,3)
}
}
}
Configuration in Void Main
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
Job.waitForCompletion(true);
}
Contd..Explanation
• LongWritable, which corresponds to a Java Long, Text (like Java
String)
• IntWritable (like Java Integer)
• Context: used for efficient and effective data storing/sharing.
• java.lang.InterruptedException. Thrown when a thread is waiting,
sleeping, or otherwise occupied, and the thread is interrupted.
• GenericOptionsParser is a utility to parse command line
arguments generic to the Hadoop framework. GenericOptionsParser
recognizes several standarad command line arguments

You might also like