0% found this document useful (0 votes)
16 views

Big Data Practical 2

Uploaded by

risingknight17
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Big Data Practical 2

Uploaded by

risingknight17
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Practical 2: Map Reduce Programming Examples Word Count,

Union, Intersection & GroupSum.

Theory:
MapReduce word count in Apache Hadoop is a framework that counts the number of times
each word appears in each set of input data. It works by splitting the data into chunks, sorting
the map outputs, and reducing the input to tasks. The output and input of jobs are stored in a
file system, and the framework monitors, schedules, and re-executes failed tasks.
The MapReduce word count example works in three stages:
 Mapper phase: The text from the input file is tokenized into words to create a key-
value pair.
 Shuffle phase: The key-value pairs are sorted alphabetically.
 Reducer phase: The keys are grouped together, and the values for similar keys are
added up to find the number of times a word appears.
MapReduce is a programming paradigm that allows for massive scalability across thousands
of servers in a Hadoop cluster.
A MapReduce is a data processing tool which is used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce:
Simplified Data Processing on Large Clusters," published by Google.
A MapReduce system is usually composed of three steps (even though it's generalized as the
combination of Map and Reduce operations/functions). The MapReduce operations are:
 Map: The input data is first split into smaller blocks. The Hadoop framework then
decides how many mappers to use, based on the size of the data to be processed and
the memory block available on each mapper server. Each block is then assigned to a
mapper for processing. Each ‘worker’ node applies the map function to the local data,
and writes the output to temporary storage. The primary (master) node ensures that
only a single copy of the redundant input data is processed.
 Shuffle, combine and partition: worker nodes redistribute data based on the output
keys (produced by the map function), such that all data belonging to one key is located
on the same worker node. As an optional process the combiner (a reducer) can run
individually on each mapper server to reduce the data on each mapper even further
making reducing the data footprint and shuffling and sorting easier. Partition (not
optional) is the process that decides how the data has to be presented to the reducer
and also assigns it to a particular reducer.
 Reduce: A reducer cannot start while a mapper is still in progress. Worker nodes
process each group of <key,value> pairs output data, in parallel to produce
<key,value> pairs as output. All the map output values that have the same key are
assigned to a single reducer, which then aggregates the values for that key. Unlike the
map function which is mandatory to filter and sort the initial data, the reduce function
is optional.
Aim 2a: Write a simple program for Word Count Using Map Reduce Programming.
Steps :
1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.
2. Right Click > New > Package ( Name it - PackageDemo) > Finish.
3. Right Click on Package > New > Class (Name it - WordCount).
4. Add Following Reference Libraries:
 Right Click on Project > Build Path> Add External
 /usr/lib/hadoop-0.20/hadoop-core.jar 2b
 Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
Code:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
} }
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Aim 2b: Write a simple program for Union Using Map Reduce Programming.
Steps :
1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.
2. Right Click > New > Package ( Name it – PackageDemo(optional)) > Finish.
3. Right Click on Package > New > Class (Name it - Union).
4. Add Following Reference Libraries:
 Right Click on Project > Build Path> Add External
 /usr/lib/hadoop-0.20/hadoop-core.jar 2b
 Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
Code:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;

public class Union {


private static Text emptyWord = new Text("");
public static class Mapper
extends org.apache.hadoop.mapreduce.Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
context.write(value, emptyWord);
}
}
public static class Reducer
extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> _values,
Context context
) throws IOException, InterruptedException {
context.write(key, key);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Word sum");
job.setJarByClass(Union.class);
job.setMapperClass(Mapper.class);
job.setCombinerClass(Reducer.class);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
Path input = new Path( args[0]);
Path output = new Path(args[1]);
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
OUTPUT:
Aim 2c: Write a simple program for Intersection Using Map Reduce Programming.
Steps :
1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.
2. Right Click > New > Package ( Name it - PackageDemo) > Finish.
3. Right Click on Package > New > Class (Name it - WordCount).
4. Add Following Reference Libraries:
 Right Click on Project > Build Path> Add External
 /usr/lib/hadoop-0.20/hadoop-core.jar 2b
 Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
Code:
import java.io.IOException;
import java.util.HashSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Intersection {
public static class IntersectionMapper extends Mapper<Object, Text, Text, Text> {
private Text outputKey = new Text();
private Text outputValue = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String fileName = ((org.apache.hadoop.mapreduce.lib.input.FileSplit) context.getInputSplit()).getPath().getName();
outputKey.set(line);
outputValue.set(fileName);
context.write(outputKey, outputValue);
}
}
public static class IntersectionReducer extends Reducer<Text, Text, Text, Text> {
private HashSet<String> fileNames = new HashSet<>();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
fileNames.clear();
for (Text value : values) {
fileNames.add(value.toString());
}
if (fileNames.size() == 2) { // If the key appears in both input files
context.write(key, new Text(""));
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Intersection");
job.setJarByClass(Intersection.class);
job.setMapperClass(IntersectionMapper.class);
job.setReducerClass(IntersectionReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0])); // First input file
FileInputFormat.addInputPath(job, new Path(args[1])); // Second input file
FileOutputFormat.setOutputPath(job, new Path(args[2])); // Output path
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Output:
Aim 2d: Write a simple program for Union Using Map Reduce Programming.
Steps :
1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.
2. Right Click > New > Package ( Name it - PackageDemo) > Finish.
3. Right Click on Package > New > Class (Name it - WordCount).
4. Add Following Reference Libraries:
 Right Click on Project > Build Path> Add External
 /usr/lib/hadoop-0.20/hadoop-core.jar 2b
 Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
Code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class GroupSum {
public static class GroupMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text groupKey = new Text();
private IntWritable value = new IntWritable();
public void map(LongWritable key, Text val, Context context) throws IOException, InterruptedException {
String line = val.toString();
String[] parts = line.split(",");
groupKey.set(parts[0]); // Set grouping key
value.set(Integer.parseInt(parts[1])); // Set value
context.write(groupKey, value);
}
}
public static class GroupReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Group Sum");
job.setJarByClass(GroupSum.class);
job.setMapperClass(GroupMapper.class);
job.setReducerClass(GroupReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
OUTPUT:

CONCLUSION:
Hadoop MapReduce programming enables distributed processing of large datasets in a scalable and fault-tolerant
manner. These examples illustrate how diverse data operations, such as word counting, set operations (union and
intersection), and data aggregation (group sum), can be effectively implemented, paving the way for handling
complex Big Data workloads.

You might also like