CS702 Big Data Programs
CS702 Big Data Programs
Under Guidance of
DR. PARASHU RAM PAL Professor
Affiliated to
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA BHOPAL,
MADHYA PRADESH
Assessment of Experiments
(10 Marks)
(20 Marks)
Execution
(5 Marks)
(5 Marks)
Record
Total
Viva
S.No. Experiment/ Program Signature
TOTAL: / 200
Average Marks Obtained: / 20
Signature : Signature :
Name : Dr. Parashu Ram Pal Name : Dr. Neelesh Jain
EXPERIMENT NO. 1
Hadoop is a Java-based programming framework that supports the processing and storage
of extremely large datasets on a cluster of inexpensive machines. It was the first major open
source project in the big data playing field and is sponsored by the Apache Software
Foundation.
Hadoop Common is the collection of utilities and libraries that support other Hadoop
modules.
HDFS, which stands for Hadoop Distributed File System, is responsible for persisting
data to disk.
YARN, short for Yet Another Resource Negotiator, is the "operating system" for HDFS.
MapReduce is the original processing model for Hadoop clusters. It distributes work
within the cluster or map, then organizes and reduces the results from the nodes into a
response to a query. Many other processing models are available for the 2.x version of
Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-alone mode
which is suitable for learning about Hadoop, performing simple operations, and debugging.
Procedure:
we'll install Hadoop in stand-alone mode and run one of the example example MapReduce
programs it includes to verify the installation.
Prerequisites:
1
Download Hadoop from www.hadoop.apache.org
2
Procedure to Run Hadoop
If Apache Hadoop 2.2.0 is not already installed then follow the post Build, Install,
Configure and Run Apache Hadoop 2.2.0 in Microsoft Windows OS.
2. Start HDFS (Namenode and Datanode) and YARN (Resource Manager and Node
Manager)
3
Run wordcount MapReduce job
Create a text file with some content. We'll pass this file as input to
the wordcount MapReduce job for counting words.
C:\file1.txt
Install Hadoop
Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt') to be used for
counting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input
Copy the text file(say 'file1.txt') from local disk to the newly created 'input' directory in HDFS.
4
Check content of the copied file.
7
EXPERIMENT - 02
AIM: To Develop a MapReduce program to calculate the frequency of a given word in agiven file
Map Function – It takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN
Output
Convert into another set of data
(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuples
into a smaller set of tuples.
Example – (Reduce function in Word Count)
Input Set of Tuples
(output of Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)
8
Workflow of MapReduce consists of 5 steps
1. Splitting – The splitting parameter can be anything, e.g. splitting by space,
comma, semicolon, or even by a new line (‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In order
to group them in “Reduce Phase” the similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from each
cluster) is combine together to form a Result
Make sure that Hadoop is installed on your system with java idk
Steps to follow
Step 1. Open Eclipse> File > New > Java Project > (Name it – MRProgramsDemo) >
Finish
Step 2. Right Click > New > Package ( Name it - PackageDemo) > Finish
Step 3. Right Click on Package > New > Class (Name it - WordCount)
Step 4. Add Following Reference Libraries –
9
Right Click on Project > Build Path> Add External Archivals
/usr/lib/hadoop-0.20/hadoop-core.jar
Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
package PackageDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text,
IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = value.toString();
10
String[] words=line.split(",");
for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text,
IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws
IOException,
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
11
12
To Move this into Hadoop directly, open the terminal and enter the following
commands:
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile
13
EXPERIMENT - 03
AIM: To Develop a MapReduce program to find the maximum temperature in each year.
Description: MapReduce is a programming model designed for processing large volumes of data
in parallel by dividing the work into a set of independent tasks.Our previous traversal has given an
introduction about MapReduce This traversal explains how to design a MapReduce program. The
aim of the program is to find the Maximum temperature recorded for each year of NCDC data.
The input for our program is weather data files for each year This weather data is collected by
National Climatic Data Center – NCDC from weather sensors at all over the world. You can find
weather data for each year from ftp://ftp.ncdc.noaa.gov/pub/data/noaa/.All files are zipped by year
and the weather station. For each year, there are multiple files for different weather stations.
Here is an example for 1990 (ftp://ftp.ncdc.noaa.gov/pub/data/noaa/1901/).
010080-99999-1990.gz
010100-99999-1990.gz
010150-99999-1990.gz
…………………………………
MapReduce is based on set of key value pairs. So first we have to decide on the types for the
key/value pairs for the input.
Map Phase: The input for Map phase is set of weather data files as shown in snap shot. The types
of input key value pairs are LongWritable and Text and the types of output key value pairs are
Text and IntWritable. Each Map task extracts the temperature data from the given year file. The
output of the map phase is set of key value pairs. Set of keys are the years. Values are the
temperature of each year.
Reduce Phase: Reduce phase takes all the values associated with a particular key. That is all the
temperature values belong to a particular year is fed to a same reducer. Then each reducer finds
the highest recorded temperature for each year. The types of output key value pairs in Map phase
is same for the types of input key value pairs in reduce phase (Text and IntWritable). The types of
output key value pairs in reduce phase is too Text and IntWritable.
So, in this example we write three java classes:
HighestMapper.java
HighestReducer.java
HighestDriver.java
Program: HighestMapper.java
14
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class HighestMapper extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable>
{
public static final int MISSING = 9999;
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
String line = value.toString();
String year = line.substring(15,19);
int temperature;
if (line.charAt(87)=='+')
temperature = Integer.parseInt(line.substring(88, 92));
else
temperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if(temperature != MISSING && quality.matches("[01459]"))
output.collect(new Text(year),new IntWritable(temperature));
}
}
HighestReducer.java
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class HighestReducer extends MapReduceBase implements Reducer<Text, IntWritable,
Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException
{
int max_temp = 0;
;
while (values.hasNext())
{
15
int current=values.next().get();
if ( max_temp < current)
max_temp = current;
}
output.collect(key, new IntWritable(max_temp/10));
}
HighestDriver.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class HighestDriver extends Configured implements Tool{
public int run(String[] args) throws Exception
{
JobConf conf = new JobConf(getConf(), HighestDriver.class);
conf.setJobName("HighestDriver");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(HighestMapper.class);
conf.setReducerClass(HighestReducer.class);
Path inp = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.addInputPath(conf, inp);
FileOutputFormat.setOutputPath(conf, out);
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception
{
int res = ToolRunner.run(new Configuration(), new HighestDriver(),args);
System.exit(res);
}
}
Output:
16
EXPERIMENT - 04
import java.util.Scanner;
public class JavaExample
{
public static void main(String args[])
{
/* This program assumes that the student has 6 subjects,
* thats why I have created the array of size 6. You can
* change this as per the requirement.
*/
int marks[] = new int[6];
int i;
float total=0, avg;
Scanner scanner = new Scanner(System.in);
for(i=0; i<6; i++) {
System.out.print("Enter Marks of Subject"+(i+1)+":");
marks[i] = scanner.nextInt();
total = total + marks[i];
}
scanner.close();
//Calculating average
here avg = total/6;
System.out.print("The student Grade is: ");
if(avg>=80)
{
System.out.print("A");
}
else if(avg>=60 && avg<80)
{
System.out.print("B");
}
else if(avg>=40 && avg<60)
{
17
BIG DATA ANALYTICS LABORATORY (19A05602P)
System.out.print("C");
}
else
{
System.out.print("D");
}
}
}
Expected Output:
18
EXPERIMENT - 05
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
20
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.ReflectionUtils;
Pair() {
i = 0;
21
j = 0;
}
Pair(int i, int j) {
this.i = i;
this.j = j;
}
@Override
public void readFields(DataInput input) throws IOException {
i = input.readInt();
j = input.readInt();
}
@Override
public void write(DataOutput output) throws IOException {
output.writeInt(i);
output.writeInt(j);
}
@Override
public int compareTo(Pair compare) {
if (i > compare.i) {
return 1;
} else if ( i < compare.i) {
return -1;
} else {
if(j > compare.j) {
return 1;
} else if (j < compare.j) {
return -1;
}
}
return 0;
}
public String toString() {
return i + " " + j + " ";
}
}
public class Multiply {
public static class MatriceMapperM extends Mapper<Object,Text,IntWritable,Element>
{
22
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String readLine = value.toString();
String[] stringTokens = readLine.split(",");
if (tempElement.tag == 0) {
M.add(tempElement);
} else if(tempElement.tag == 1) {
N.add(tempElement);
}
}
for(int i=0;i<M.size();i++) {
for(int j=0;j<N.size();j++) {
job2.setMapperClass(MapMxN.class);
job2.setReducerClass(ReduceMxN.class);
job2.setMapOutputKeyClass(Pair.class);
job2.setMapOutputValueClass(DoubleWritable.class);
job2.setOutputKeyClass(Pair.class);
job2.setOutputValueClass(DoubleWritable.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
#!/bin/bash
mkdir -p classes
javac -d classes -cp classes:`$HADOOP_HOME/bin/hadoop classpath` Multiply.java
jar cf multiply.jar -C classes .
echo "end"
stop-yarn.sh
stop-dfs.sh
myhadoop-cleanup.sh
26
Expected Output:
27
EXPERIMENT - -06
Input Data
The above data is saved as sample.txt and given as input. The input file looks as shown below.
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
Source code:
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
28
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class ProcessUnits
{
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable ,/*Input key Type */ Text, /*Input value Type*/
Text, /*Output key Type*/ IntWritable> /*Output value Type*/
{
//Map function
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
String line = value.toString(); String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens())
{
lasttoken=s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
}
}
//Reducer class
public static class E_EReduce extends MapReduceBase implements
Reducer< Text, IntWritable, Text, IntWritable >
{
//Reduce function
public void reduce( Text key, Iterator <IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws
IOException
{
int maxavg=30;
int val=Integer.MIN_VALUE;
while (values.hasNext())
{
if((val=values.next().get())>maxavg)
{
output.collect(key, new IntWritable(val));
}
}
}
29
}
//Main function
public static void main(String args[])throws Exception
{
JobConf conf = new JobConf(ProcessUnits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
Expected OUTPUT:
Input:
Kolkata,56
Jaipur,45
Delhi,43
Mumbai,34
Goa,45
Kolkata,35
Jaipur,34
Delhi,32
Output:
Kolkata 56
Jaipur 45
Delhi 43
Mumbai 34
30
EXPERIMENT - -07
AIM: To Develop a MapReduce to analyze weather data set and print whether the day
is shinny or cool day.
NCDC provides access to daily data from the U.S. Climate Reference Network /
U.S. Regional Climate Reference Network (USCRN/USRCRN) via anonymous ftp at:
Dataset ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01
After going through wordcount mapreduce guide, you now have the basic idea of
how a mapreduce program works. So, let us see a complex mapreduce program on
weather dataset. Here I am using one of the dataset of year 2015 of Austin, Texas
. We will do analytics on the dataset and classify whether it was a hot day or a cold
day depending on the temperature recorded by NCDC.
31
NCDC gives us all the weather data we need for this
ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/2015/CRND
0103-2015- TX_Austin_33_NW.txt
Step 1
https://drive.google.com/file/d/0B2SFMPvhXPQ5bUdoVFZsQjE2ZDA/view?
usp=sharing
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
32
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
public static class MaxTemperatureMapper extends
Mapper<LongWritable, Text, Text, Text> {
/**
* @method map
* This method takes the input as text data type
* Now leaving the first five tokens,it takes 6th token is taken as temp_max and
* 7th token is taken as temp_min. Now temp_max > 35 and
temp_min < 10 are passed to the reducer.
*/ @Override
public void map(LongWritable arg0, Text Value, Context 2 context) throws IOException,
InterruptedException {
//Converting the record (single line) to String and storing it in a String variable line
//date
// Cold day
new Text(String.valueOf(temp_Min)));
}
}
}
}
//Reducer
*MaxTemperatureReducer class is static and extends Reducer abstract
*/
public void reduce (Text Key, Iterator<Text> Values, Context context) throws
IOException, Interrupted Exception {
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
34
Job job = new Job(conf, "weather example");
job.setJarByClass(MyMaxMin.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
OutputPath.getFileSystem(conf).delete(OutputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Import the project in eclipse IDE in the same way it was told in earlier guide and
change the jar paths with the jar files present in the lib directory of this project.
When the project is not having any error, we will export it as a jar file, same as
we did in wordcount mapreduce guide. Right Click on the Project file and click on
Export. Select jar file.
You can download the jar file directly using below link
38
temperature.jar
https://drive.google.com/file/d/0B2SFMPvhXPQ5RUlZZDZSR3FYVDA/view?us
p=sharing
link weather_data.txt
https://drive.google.com/file/d/0B2SFMPvhXPQ5aFVILXAxbFh6ejA/view?usp=s
haring
39
EXPERIMENT - 08
Source code:
public class Driver extends Configured implements Tool {
enum Counters { DISCARDED_ENTRY
}
public static void main(String[] args) throws Exception { ToolRunner.run(new Driver(), args);
}
job.setMapperClass(Mapper.class); job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setCombinerClass(Combiner.class); job.setReducerClass(Reducer.class);
job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class);
Text value,
org.apache.hadoop.mapreduce.Mapper<
40
LongWritable,
Text,
LongWritable,
Text
>.Context context
.substring(values.get(2).length() - 4);
// convert time to minutes (e.g. 1542 -> 942)
* 60 + Integer.parseInt(time.substring(2,4));
);
} else
{
41
}}
42
String[] atom = iterator.next().toString().split(":");
a += Long.parseLong(atom[0]);
n += Long.parseLong(atom[1]);
}
context.write(key, new Text(Long.toString(a) + ":" + Long.toString(n)));
}
}
public class Reducer extends org.apache.hadoop.mapreduce.Reducer<
LongWritable,
Text,
LongWritable,
Text
{
@Override
protected void reduce(
LongWritable key,
Iterable<Text> values,
Context context
) throws IOException, InterruptedException { Long
n = 0l;
Long a = 0l;
Iterator<Text> iterator = values.iterator();
// calculate the finale aggregate
while (iterator.hasNext()) {
String[] atom = iterator.next().toString().split(":");
a += Long.parseLong(atom[0]);
n += Long.parseLong(atom[1]);
}
// cut of seconds
int average = Math.round(a / n);
Expected Output:
44
EXPERIMENT - -09
AIM: To Develop a MapReduce program to find the tags associated with each movie by
analyzing movie lens data.
For this analysis the Microsoft R Open distribution was used. The reason for this was its
multithreaded performance as described here. Most of the packages that were used come from the
tidyverse - a collection of packages that share common philosophies of tidy data. The tidytext and
wordcloud packages were used for some text processing. Finally, the doMC package was used to
embrace the multithreading in some of the custom functions which will be described later.
doMC package is not available on Windows. Use doParallel package instead.
Driver1.java
package KPI_1;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class Driver1
{
public static void main(String[] args) throws Exception {
Path firstPath = new Path(args[0]);
Path sencondPath = new Path(args[1]);
//use MultipleOutputs and specify different Record class and Input formats
MultipleInputs.addInputPath(job, firstPath, TextInputFormat.class,
movieDataMapper.class);
MultipleInputs.addInputPath(job, sencondPath, TextInputFormat.class,
ratingDataMapper.class);
//set Reducer class
job.setReducerClass(dataReducer.class);
FileOutputFormat.setOutputPath(job, outputPath_1);
job.waitForCompletion(true)
Job job1 = Job.getInstance(conf, "Most Viewed Movies2");
job1.setJarByClass(Driver1.class);
//set Driver class
//set Mapper class
job1.setMapperClass(topTenMapper.class);
//set reducer class
job1.setReducerClass(topTenReducer.class);
//output format for mapper
job1.setMapOutputKeyClass(Text.class);
job1.setMapOutputValueClass(LongWritable.class);
job1.setOutputKeyClass(LongWritable.class);
job1.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job1, outputPath_1);
46
FileOutputFormat.setOutputPath(job1, outputPath_2);
job1.waitForCompletion(true);
}
}
dataReducer.java
import java.io.*;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Reducer;
@Override
public void reduce(LongWritable key, Iterable<Text>values,Context
context)throws IOException,InterruptedException
{
//key(movie_id) values
//234 [ 1, ToyStory,1,1,1,1 ..... ]
long count = 0;
String movie_name = null;
for(Text val:values)
{
String token = val.toString();
else
47
{
movie_name = token; //means data from
movieDataMapper;
}
}
movieDataMapper.java
import java.io.*;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
@Override
public void map(Object key,Text value,Context context)throws
IOException,InterruptedException
{
String []tokens = value.toString().split("::");
long movie_id = Long.parseLong(tokens[0]);
String name = tokens[1];
context.write(new LongWritable(movie_id), new Text(name));
//movie_id name
}
}
ratingDataMapper.java
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
48
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
topTenMapper.java
import java.io.*;
import java.util.*;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
topTenReducer.java
import java.io.*;
import java.util.Map;
import java.util.TreeMap;
import org.apache.hadoop.io.LongWritable;
50
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
51
String value = entry.getValue(); //movie_name
52
EXPERIMENT - 10
AIM: Develop a MapReduce program to find the frequency of books published eachyear
and find in which year maximum number of books were published usingthe following data.
Description:
MapReduce is a software framework for easily writing applications which process vast amounts
of data residing on multiple systems. Although it is a very powerful framework, it doesn’t provide
a solution for all the big dataproblems.
Before discussing about MapReduce let first understand framework in general. Framework is a
set of rules which we follow or should follow to obtain the desired result. So whenever we write
a MapReduce program we should fit our solution into the MapReduce framework.
Although MapReduce is very powerful it has its limitations. Some problems like processing graph
algorithms, algorithms which require iterative processing, etc. are tricky and challenging. So
implementing such problems in MapReduce is very difficult. To overcome such problems we can
use MapReduce design pattern.
[Note: A Design pattern is a general repeatable solution to a commonly occurring problem in
software design. A design pattern isn’t a finished design that can be transformed directly into
code. It is a description or template for how to solve a problem that can be used in many different
situations.]
We generally use MapReduce for data analysis. The most important part of data analysis is to
find outlier. An outlier is any value that is numerically distant from most of the other data points
in a set of data. These records are most interesting and unique pieces of data in the set.
The point of this blog is to develop MapReduce design pattern which aims at finding the Top K
records for a specific criteria so that we could take a look at them and perhaps figure out the
reason which made them special.
This can be achived by defining a ranking function or comparison function between two records
that determines whether one is higher than the other. We can apply this pattern to use
MapReduce to find the records with the highest value across the entire data set.
Before discussing MapReduce approach let’s understand the traditional approach of finding Top
K records in a file located on a single machine.
53
Steps to find K records
Traditional Approach: If we are dealing with the file located in the single system or RDBMS we
can follow below steps find to K records:
Sort the data
2. Pick Top K records
MapReduce approach: Solving the same using MapReduce is a bit complicated because:
1. Data is not sorted
2. Data is processed across multiple nodes
Finding Top K records using MapReduce Design Pattern
For finding the Top K records in distributed file system like Hadoop using MapReduce we should
follow the below steps:
1. In MapReduce find Top K for each mapper and send to reducer
2. Reducer will in turn find the global top 10 of all the mappers
To achieve this we can follow Top-K MapReduce design patterns which is explained below with
the help of an algorithm:
1985,ATL,NL,barkele01,870000
1985,ATL,NL,bedrost01,550000
1985,ATL,NL,benedbr01,545000
1985,ATL,NL,campri01,633333
1985,ATL,NL,ceronri01,625000
1985,ATL,NL,chambch01,800000
Above data set contains 5 columns – yearID, teamID, lgID, playerID, salary. In this example we
are finding Top K records based on salary.
For sorting the data easily we can use java.lang.TreeMap. It will sort the keys automatically.
But in the default behavior Tree sort will ignore the duplicate values which will not give the
correct results.
To overcome this we should create a Tree Map with our own compactor to include the duplicate
values and sort them.
Below is the implementation of Comparator to sort and include the duplicate values :
Comparator code:
import java.util.Comparator;
54
public class Salary {
private int sum;
public int getSum() {
return sum;
}
public void setSum(int sum) {
this.sum = sum;
}
public Salary(int sum) {
super();
this.sum = sum;
}
}
class MySalaryComp1 implements Comparator<Salary>{
@Override
public int compare(Salary e1, Salary e2) {
if(e1.getSum()>e2.getSum()){
return 1;
} else {
return -1;
}
}
}
Mapper Code:
public class Top20Mapper extends Mapper<LongWritable, Text, NullWritable, Text> {
// create the Tree Map with MySalaryComparator
public static TreeMap<sala, Text> ToRecordMap = new TreeMap<Salary , Text>(new
MySalaryComp1());
public void map(LongWritable key, Text value, Context context)throws IOException,
InterruptedException {
String line=value.toString();
String[] tokens=line.split("\t");
//split the data and fetch salary
int salary=Integer.parseInt(tokens[3]);
//insert salary object as key and entire row as value
//tree map sort the records based on salary
ToRecordMap.put(new Salary (salary), new Text(value));
// If we have more than ten records, remove the one with the lowest salary
// As this tree map is sorted in descending order, the employee with
// the lowest salary is the last key.
Iterator<Entry<Salary , Text>> iter = ToRecordMap.entrySet().iterator();
55
Entry<Salary , Text> entry = null;
while(ToRecordMap.size()>10){
entry = iter.next();
iter.remove();
}
}
protected void cleanup(Context context) throws IOException, InterruptedException {
// Output our ten records to the reducers with a null key
for (Text t:ToRecordMap.values()) {
context.write(NullWritable.get(), t);
}
}
}
Reducer Code:
import java.io.IOException;
import java.util.Iterator;
import java.util.TreeMap;
import java.util.Map.Entry;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
public class Top20Reducer extends Reducer<NullWritable, Text, NullWritable, Text> {
public static TreeMap<Salary , Text> ToRecordMap = new TreeMap<Salary , Text>(new
MySalaryComp1());
public void reduce(NullWritable key, Iterable<Text> values,Context context) throws
IOException, InterruptedException {
for (Text value : values) {
String line=value.toString();
if(line.length()>0){
String[] tokens=line.split("\t");
//split the data and fetch salary
int salary=Integer.parseInt(tokens[3]);
//insert salary as key and entire row as value
//tree map sort the records based on salary
Expected Output:
The Output: of the Job is Top K records.
This way we can obtain the Top K records using MapReduce functionality.
I hope this blog was helpful in giving you a better understanding
of Implementing MapReduce design pattern.
57