Unit 3 - Big Data Technologies
Unit 3 - Big Data Technologies
• MapReduce divides a task into small parts and assigns them to many computers.
• Later, the results are collected at one place and integrated to form the result
dataset.
Why MapReduce?..Contd..
Advantages :
• Taking processing to the data
• Processing data in parallel
Solving the Problem with MapReduce
• Searching
• Sorting
• Log Processing
• Amount of computation required is significantly more than any one
computer.
• Analytics
• Video and Image Analysis
Solving the Problem with MapReduce
2. Running Jobs
1. Decomposing a problem into Map Reduce job
• Map Script :-
Map Script (which you write) take some input data, and maps it to <Key, Value> pairs
according to your specifications.
Emit a <word, 1> (eg: A,B,R <A,1> <B,1> <C,1>) pair for each word in the input stream.
Purpose of Map script is to model data into <key,value> pairs for the reducer to aggregate.
Shuffled, which means the pairs with the same key are grouped. (eg: <A,1><A,1>) and
passed in to a single machine.
1. Decomposing a problem into Map Reduce job
• Reduce Script :-
For example, we want to count the number of word, so our reduce script to
simply sum the values of the collection of <key,value> pairs which have the same
key. (eg:- A2, B2, C2)
2. Running Dependent Jobs (linear chain of Jobs) or
More complex Directed Acyclic Graph Jobs
• When there is more than one job in a MapReduce workflow, the question arises:
how do you manage the jobs so they are executed in order?
several approaches:
1. linear chain of jobs
2. more complex directed acyclic graph (DAG) of jobs.
Linear Chain of Jobs
• For a linear chain, the simplest approach is to run each job one after another,
waiting until a job completes successfully before running the next:
JobClient.runJob(conf1);
JobClient.runJob(conf2);
Contd.. Running Dependent Jobs (linear chain of
Jobs) or More complex Directed Acyclic Graph Jobs
• For anything more complex job like DAG than a linear chain, there is a
class called JobControl
OOzie workflow is a DAG
What is OOZIE?
• All sort of programs can be pipelined in a desired order to work in
Hadoop’s distributed environment
• Provides a mechanism to run a job at given schedule.
• OOzie runs on a server and client submits a workflow to the server.
#2 Architecture Map Reduce Flow Diagram
Map Reduce Objects
Important Question Map Reduce Program - Word Count
• Driver class (Public void static main- the entry point)
• Map class (MapForWordCount) which extends public class
Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements the
Map function.
• Reduce class (ReduceForWordCount) which extends public class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements the
Reduce function.
• Serialization and Deserialization in mapreduce for java in Writable
Example..Explanation
• For example , Input is “Hello” Mapper<LongWritable, Text, Text, IntWritable>
Hello (for Map Logic)
Input,
mk (Longwritable) mv //text
0 hello
Output,
hello (text) 1 (intwritable)