bi lab file
bi lab file
TECHNOLOGY
East Campus, Geeta Colony, NEW DELHI-110031
Practical File of
Business Intelligence
Course Code -
CPAIE12
(M.TECH)
Commands
a) STANDALONE MODE:
Installation of jdk 8
Command: sudo apt-get install openjdk-8-jdk
Download and extract Hadoop
Command: wget
http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-1.2.0.tar.gz
Command: tar -xvf hadoop-1.2.0.tar.gz
Command: sudo mv hadoop-1.2.0 /usr/lib/hadoop
Set the path for java and hadoop
Command: sudo gedit $HOME/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-i386
export PATH=$PATH:$JAVA_HOME/bin export
HADOOP_COMMON_HOME=/usr/lib/hadoop export
HADOOP_MAPRED_HOME=/usr/lib/hadoop export
PATH=$PATH:$HADOOP_COMMON_HOME/bin
export PATH=$PATH:$HADOOP_COMMON_HOME/Sbin
Checking of java and hadoop
Command: java -version
Command: hadoop version
b) PSEUDO MODE:
Hadoop single node cluster runs on a single machine. The namenodes and
datanodes are performing on the one machine. The installation and
configuration steps as given below:
Installation of secured shell:
Command: sudo apt-get install openssh-server
2
Format the name node
Command: hadoop namenode -format
Start the namenode, datanode
Command: start-dfs.sh
Start the task tracker and job tracker
Command: start-mapred.sh
To check if Hadoop started correctly
Command: jps Namenode secondarynamenode datanode jobtrackertasktracker
4
Practical – 2
Implement the following file management tasks in hadoop like
i) Adding Files and Directories, retrieving and Deleting Files ii)
Bench mark and stress test Apache Hadoop cluster.
The Hadoop command copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command:
hadoop fs -cat example.txt
Deleting Files from HDFS
hadoop fs -rm example.txt
Command for creating a directory in hdfs is “hdfs dfs –mkdir /newdir”.
Adding a directory is done through the command “hdfs dfs –put new_dir /”.
5
Practical – 3
Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm. ●
Find the number of occurrences of each word appearing in the input file(s)
● Performing a MapReduce Job for word search count (look for specific keywords in
a file.
PROGRAM LOGIC:
WordCount is a simple program which counts the number of occurrences of each word in
a given text input data set. WordCount fits very well with the MapReduce programming
model making it a great example to understand the Hadoop Map/Reduce programming
style. Our implementation consists of three main parts:
1. Mapper
2. Reducer
3. Driver
● First Open Eclipse -> then select File -> New -> Java Project ->Name it
WordCount ->then Finish.
● Create 3 Java Classes into the project. Name them WCDriver(having the main
function), WCMapper, WCReducer.
You have to include two Reference Libraries for that:
● Right Click on Project -> then select Build Path-> Click on Configure Build Path
In the above figure, you can see the Add External JARs option on the Right Hand Side.
Click on it and add the below mentioned files.
You can find these files in /usr/lib/
1./usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.6.0-mr1-cdh5.13.0.jar 2.
/usr/lib/hadoop/hadoop-common-2.6.0-cdh5.13.0.jar
6
org.apache.hadoop.mapred.Reporter; public class
WC_Mapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{
private final static IntWritable one = new
IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text
value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString(); StringTokenizer
tokenizer = new StringTokenizer(line); while
(tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one); }
}
7
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapred.FileInputFormat; import
org.apache.hadoop.mapred.FileOutputFormat; import
org.apache.hadoop.mapred.JobClient; import
org.apache.hadoop.mapred.JobConf; import
org.apache.hadoop.mapred.TextInputFormat; import
org.apache.hadoop.mapred.TextOutputFormat; public
class WC_Runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);
conf.setReducerClass(WC_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
Now you have to make a jar file. Right Click on Project-> Click on Export-> Select export
destination as Jar File-> Name the jar File(WordCount.jar) -> Click on next -> at last Click
on Finish. Now copy this file into the Workspace directory of Cloudera.
Open the terminal on CDH and change the directory to the workspace. You can do this by
using the “cd workspace/” command. Now, Create a text file (WCFile.txt) and move it to
HDFS. Open the terminal and write this code(remember you should be in the same
directory as the jar file you have created just now).
After Executing the code, you can see the result in the WCOutput file or by writing the
following command on terminal.
8
9
10
Practical – 4
Input: A large textual file containing one sentence per line, A small file containing a set of
stop words (One stop word per line)
Output: A textual file containing the word count of the large input file without the words
appearing in the small file.
Procedure:
● Step 1: One block is processed by one mapper at a time. In the mapper, a developer
can specify his own business logic as per the requirements. In this manner, Map
runs on all the nodes of the cluster and processes the data blocks in parallel.
● Step 2: Output of Mapper also known as intermediate output is written to the local
disk. An output of mapper is not stored on HDFS as this is temporary data and
writing on HDFS will create unnecessary many copies.
● Step 3: Output of mapper is shuffled to reducer node (which is a normal slave node
but reduce phase will run here hence called as reducer node). The shuffling/copying
is a physical movement of data which is done over the network.
● Step 4: Once all the mappers are finished and their output is shuffled on reducer
nodes then this intermediate output is merged & sorted. Which is then provided as
input to reduce phase.
● Step 5: Reduce is the second phase of processing where the user can specify his
own custom business logic as per the requirements. An input to a reducer is
provided from all the mappers. An output of the reducer is the final output, which
is written on HDFS.
TokenizerMapper.java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Mapper; import
java.io.IOException; import
java.util.Arrays; import
java.util.HashSet; import java.util.Set;
public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private static final Set<String> STOP_WORDS = new HashSet<>(Arrays.asList(
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if",
11
"in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the",
"their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
)); private final static IntWritable one = new IntWritable(1); private
Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
String[] words = value.toString().toLowerCase().replaceAll("[^a-zA-Z ]",
"").split("\\s+"); for (String w : words) { if
(!STOP_WORDS.contains(w) && !w.isEmpty()) { word.set(w);
context.write(word, one); }
}
}
}
StopWordElimination.java
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class StopWordElimination {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration(); Job job =
Job.getInstance(conf, "stop word elimination");
job.setJarByClass(StopWordElimination.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
IntSumReducer.java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text; import
12
org.apache.hadoop.mapreduce.Reducer; import
java.io.IOException;
public class IntSumReducer extends Reducer<Text, IntWritable, Text,
IntWritable> { private IntWritable result = new IntWritable(); public
void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException { int sum = 0; for
(IntWritable val : values) { sum += val.get(); }
result.set(sum); context.write(key, result); }
}
13
Practical – 5
Write a Map Reduce program that mines weather data. Weather sensors collecting data
every hour at many locations across the globe gather large volumes of log data,which is a
good candidate for analysis with Map Reduce, since it is semi structured and record
oriented.
1. Find average,max and min temperature for each year in the NCDC data set?
2. Filter the readings of a set based on value of the measurement, Output the line of
input files associated with a temperature value greater than 30.0 and store it in a
separate file.
14
Step 5: Run the MapReduce job hadoop fs -rm -r
/weatherMining/output hadoop jar WeatherAnalysis.jar
WeatherAnalysis /weatherMining/input
/weatherMining/output
WeatherAnalysis.java
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.Path; import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
java.io.IOException; public class WeatherAnalysis { // Mapper:
Splits each line and extracts the date and temperature. public
static class WeatherMapper extends Mapper<Object, Text, Text,
FloatWritable> { private Text date = new Text(); private
FloatWritable temp = new FloatWritable();
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException { // Split the line by
whitespace.
String[] parts = value.toString().split("\\s+"); // Check
that there are enough tokens (we expect at least 7 fields). if
(parts.length > 6) { // parts[1] is the date (YYYYMMDD)
date.set(parts[1]); try { // parts[5] is the
temperature (e.g., -18.8) float temperature =
Float.parseFloat(parts[5]); temp.set(temperature);
context.write(date, temp); } catch (NumberFormatException e) {
// Skip this record if the temperature is not a valid float.
}
}
}
}
}
15
WeatherReducer.java
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.FloatWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public static class WeatherReducer extends Reducer<Text, FloatWritable, Text,
FloatWritable> { public void reduce(Text key, Iterable<FloatWritable>
values, Context context) throws IOException, InterruptedException {
float maxTemp = Float.NEGATIVE_INFINITY; for (FloatWritable val :
values) { if (val.get() > maxTemp) { maxTemp =
val.get(); } } context.write(key, new
FloatWritable(maxTemp));
}
}
Weather.java
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.FloatWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public
static void main(String[] args) throws Exception {
Configuration conf = new Configuration(); Job job =
Job.getInstance(conf, "Weather Data Analysis");
job.setJarByClass(WeatherAnalysis.class);
job.setMapperClass(WeatherMapper.class);
job.setReducerClass(WeatherReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
16
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
17