0% found this document useful (0 votes)
5 views

bi lab file

This document is a practical file for a Business Intelligence course at Netaji Subhas University of Technology, detailing various Hadoop installation and programming tasks. It includes step-by-step instructions for setting up Hadoop in different modes, performing file management tasks, and implementing MapReduce programs for word counting and stop word elimination. The document serves as a comprehensive guide for students to understand and apply Hadoop functionalities in practical scenarios.

Uploaded by

raghav.ty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

bi lab file

This document is a practical file for a Business Intelligence course at Netaji Subhas University of Technology, detailing various Hadoop installation and programming tasks. It includes step-by-step instructions for setting up Hadoop in different modes, performing file management tasks, and implementing MapReduce programs for word counting and stop word elimination. The document serves as a comprehensive guide for students to understand and apply Hadoop functionalities in practical scenarios.

Uploaded by

raghav.ty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

NETAJI SUBHAS UNIVERSITY OF

TECHNOLOGY
East Campus, Geeta Colony, NEW DELHI-110031
Practical File of

Business Intelligence
Course Code -

CPAIE12

(M.TECH)

Submitted To: Submitted By:

Dr. Vishal Bhatnagar Name: Raghvender Tyagi


Professor Roll No: 2024PAI7329
Department of Computer Branch:- Artificial Intelligence
Science and Engineering
Index

S.No Practical Date Signature


Perform Setting up and installing
Hadoop in its 2 operating modes.
1
Use Web based tools to monitor
your Hadoop Setup
Implement the following file
management tasks in hadoop
like i) Adding Files and
2 Directories, retrieving and
Deleting Files ii) Bench mark
and stress test Apache Hadoop
cluster.
Run a basic Word Count Map
3 Reduce program to understand Map
Reduce paradigm
Write a program for stop word
4
elimination in textual file
Write a map reduce program that
5
mines weather data.
Practical – 1
Perform Setting up and installing Hadoop in its 2 operating modes. Use Web based tools
to monitor your Hadoop Setup

Commands
a) STANDALONE MODE:
Installation of jdk 8
Command: sudo apt-get install openjdk-8-jdk
Download and extract Hadoop
Command: wget
http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-1.2.0.tar.gz
Command: tar -xvf hadoop-1.2.0.tar.gz
Command: sudo mv hadoop-1.2.0 /usr/lib/hadoop
Set the path for java and hadoop
Command: sudo gedit $HOME/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-i386
export PATH=$PATH:$JAVA_HOME/bin export
HADOOP_COMMON_HOME=/usr/lib/hadoop export
HADOOP_MAPRED_HOME=/usr/lib/hadoop export
PATH=$PATH:$HADOOP_COMMON_HOME/bin
export PATH=$PATH:$HADOOP_COMMON_HOME/Sbin
Checking of java and hadoop
Command: java -version
Command: hadoop version
b) PSEUDO MODE:
Hadoop single node cluster runs on a single machine. The namenodes and
datanodes are performing on the one machine. The installation and
configuration steps as given below:
Installation of secured shell:
Command: sudo apt-get install openssh-server

Create a ssh key for passwordless ssh configuration


Command: ssh-keygen -t rsa –P ""
Moving the key to authorized key
Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Checking of secured shell login
1
Command: ssh localhost
Add JAVA_HOME directory in hadoop-env.sh file
Command: sudo gedit /usr/lib/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-i386
Creating namenode and datanode directories for hadoop
Command: sudo mkdir -p /usr/lib/hadoop/dfs/namenode
Command: sudo mkdir -p /usr/lib/hadoop/dfs/datanode
Configure core-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
Configure hdfs-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/lib/hadoop/dfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/lib/hadoop/dfs/datanode</value>
</property>
Configure mapred-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/mapred-site.xml
<property> <name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>

2
Format the name node
Command: hadoop namenode -format
Start the namenode, datanode
Command: start-dfs.sh
Start the task tracker and job tracker
Command: start-mapred.sh
To check if Hadoop started correctly
Command: jps Namenode secondarynamenode datanode jobtrackertasktracker

FULLY DISTRIBUTED MODE


All the demons like namenodes and datanodes are run on different machines. The data
will replicate according to the replication factor in client machines. The secondary
namenode will store the mirror images of the namenode periodically. The namenode has
the metadata where the blocks are stored and the number of replicas in the client machines.
The slaves and master communicate with each other periodically. The configurations of
multi node cluster are given below:
Configure the hosts in all nodes/machines
Command: sudo gedit/etc/hosts/ 192.168.1.58pcetcse1
192.168.1.4 pcetcse2
192.168.1.5 pcetcse3
192.168.1.7 pcetcse4
192.168.1.8 pcetcse5
Passwordless SshConfigurationCreate ssh key on namenode/master.
Command: ssh-keygen -t rsa -p “”
Copy the generated public key all datanodes/slaves.
Command: ssh-copy-id -i ~/.ssh/id_rsa.pubhuser@pcetcse2
Command: ssh-copy-id -i~/.ssh/id_rsa.pub huser@pcetcse3
Command: ssh-copy-id -i ~/.ssh/id_rsa.pub huser@pcetcse4
Command: ssh-copy-id -i ~/.ssh/id_rsa.pubhuser@pcetcse
NOTE: Verify the passwordless ssh environment from namenode to all datanodes as
“huser” user.
Login to master node
Command: sshpcetcse1 ssh pcetcse2 ssh pcetcse3 ssh pcetcse4 ssh pcetcse5
Add JAVA_HOME directory in hadoop-env.sh file in all nodes/machines
Command: sudo gedit /usr/lib/hadoop/conf/hadoop-env.sh export
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-i386
Creating namenode directory in namenode/master
3
Command: sudo mkdir -p /usr/lib/hadoop/dfs/namenode
Creating namenode directory in datanonodes/slaves
Command: sudo mkdir -p /usr/lib/hadoop/dfs/datanode

4
Practical – 2
Implement the following file management tasks in hadoop like
i) Adding Files and Directories, retrieving and Deleting Files ii)
Bench mark and stress test Apache Hadoop cluster.

Adding Files and Directories to HDFS


Before you can run Hadoop programs on data stored in HDFS, you„ll need to put the data
into HDFS first.
Lets create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login username. This directory isn't automatically
created for you, though, so let's create it with the mkdir command. For the purpose of
illustration, we use chuck. You should substitute your user name in the example
commands.
hadoop fs -mkdir /user/hdusr hadoop
fs -put hadoop fs -put example.txt
/user/hdusr

The Hadoop command copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command:
hadoop fs -cat example.txt
Deleting Files from HDFS
hadoop fs -rm example.txt
Command for creating a directory in hdfs is “hdfs dfs –mkdir /newdir”.
Adding a directory is done through the command “hdfs dfs –put new_dir /”.

5
Practical – 3
Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm. ●
Find the number of occurrences of each word appearing in the input file(s)
● Performing a MapReduce Job for word search count (look for specific keywords in
a file.

PROGRAM LOGIC:
WordCount is a simple program which counts the number of occurrences of each word in
a given text input data set. WordCount fits very well with the MapReduce programming
model making it a great example to understand the Hadoop Map/Reduce programming
style. Our implementation consists of three main parts:
1. Mapper
2. Reducer
3. Driver

● First Open Eclipse -> then select File -> New -> Java Project ->Name it
WordCount ->then Finish.
● Create 3 Java Classes into the project. Name them WCDriver(having the main
function), WCMapper, WCReducer.
You have to include two Reference Libraries for that:
● Right Click on Project -> then select Build Path-> Click on Configure Build Path

In the above figure, you can see the Add External JARs option on the Right Hand Side.
Click on it and add the below mentioned files.
You can find these files in /usr/lib/
1./usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.6.0-mr1-cdh5.13.0.jar 2.
/usr/lib/hadoop/hadoop-common-2.6.0-cdh5.13.0.jar

Step-1. Write a Mapper Code import


java.io.IOException; import
java.util.StringTokenizer; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.LongWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapred.MapReduceBase; import
org.apache.hadoop.mapred.Mapper; import
org.apache.hadoop.mapred.OutputCollector; import

6
org.apache.hadoop.mapred.Reporter; public class
WC_Mapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{
private final static IntWritable one = new
IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text
value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString(); StringTokenizer
tokenizer = new StringTokenizer(line); while
(tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one); }
}

Step-2. Write a Reducer Code

package org.example; import java.io.IOException;


import java.util.Iterator; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapred.MapReduceBase; import
org.apache.hadoop.mapred.OutputCollector; import
org.apache.hadoop.mapred.Reducer; import
org.apache.hadoop.mapred.Reporter;
public class WC_Reducer extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> { public void
reduce(Text key, Iterator<IntWritable>
values,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException { int sum=0;
while (values.hasNext()) {
sum+=values.next().get(); }
output.collect(key,new IntWritable(sum)); }
}

Step-3: Write Driver Code

package org.example; import java.io.IOException;


import org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.IntWritable; import

7
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapred.FileInputFormat; import
org.apache.hadoop.mapred.FileOutputFormat; import
org.apache.hadoop.mapred.JobClient; import
org.apache.hadoop.mapred.JobConf; import
org.apache.hadoop.mapred.TextInputFormat; import
org.apache.hadoop.mapred.TextOutputFormat; public
class WC_Runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);
conf.setReducerClass(WC_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}

Now you have to make a jar file. Right Click on Project-> Click on Export-> Select export
destination as Jar File-> Name the jar File(WordCount.jar) -> Click on next -> at last Click
on Finish. Now copy this file into the Workspace directory of Cloudera.

Open the terminal on CDH and change the directory to the workspace. You can do this by
using the “cd workspace/” command. Now, Create a text file (WCFile.txt) and move it to
HDFS. Open the terminal and write this code(remember you should be in the same
directory as the jar file you have created just now).
After Executing the code, you can see the result in the WCOutput file or by writing the
following command on terminal.

8
9
10
Practical – 4

Stop word elimination problem:

Input: A large textual file containing one sentence per line, A small file containing a set of
stop words (One stop word per line)
Output: A textual file containing the word count of the large input file without the words
appearing in the small file.

Procedure:
● Step 1: One block is processed by one mapper at a time. In the mapper, a developer
can specify his own business logic as per the requirements. In this manner, Map
runs on all the nodes of the cluster and processes the data blocks in parallel.
● Step 2: Output of Mapper also known as intermediate output is written to the local
disk. An output of mapper is not stored on HDFS as this is temporary data and
writing on HDFS will create unnecessary many copies.
● Step 3: Output of mapper is shuffled to reducer node (which is a normal slave node
but reduce phase will run here hence called as reducer node). The shuffling/copying
is a physical movement of data which is done over the network.
● Step 4: Once all the mappers are finished and their output is shuffled on reducer
nodes then this intermediate output is merged & sorted. Which is then provided as
input to reduce phase.
● Step 5: Reduce is the second phase of processing where the user can specify his
own custom business logic as per the requirements. An input to a reducer is
provided from all the mappers. An output of the reducer is the final output, which
is written on HDFS.

TokenizerMapper.java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Mapper; import
java.io.IOException; import
java.util.Arrays; import
java.util.HashSet; import java.util.Set;
public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private static final Set<String> STOP_WORDS = new HashSet<>(Arrays.asList(
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if",

11
"in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the",
"their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
)); private final static IntWritable one = new IntWritable(1); private
Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
String[] words = value.toString().toLowerCase().replaceAll("[^a-zA-Z ]",
"").split("\\s+"); for (String w : words) { if
(!STOP_WORDS.contains(w) && !w.isEmpty()) { word.set(w);
context.write(word, one); }
}
}
}

StopWordElimination.java
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class StopWordElimination {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration(); Job job =
Job.getInstance(conf, "stop word elimination");
job.setJarByClass(StopWordElimination.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

IntSumReducer.java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text; import

12
org.apache.hadoop.mapreduce.Reducer; import
java.io.IOException;
public class IntSumReducer extends Reducer<Text, IntWritable, Text,
IntWritable> { private IntWritable result = new IntWritable(); public
void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException { int sum = 0; for
(IntWritable val : values) { sum += val.get(); }
result.set(sum); context.write(key, result); }
}

13
Practical – 5

Write a Map Reduce program that mines weather data. Weather sensors collecting data
every hour at many locations across the globe gather large volumes of log data,which is a
good candidate for analysis with Map Reduce, since it is semi structured and record
oriented.
1. Find average,max and min temperature for each year in the NCDC data set?
2. Filter the readings of a set based on value of the measurement, Output the line of
input files associated with a temperature value greater than 30.0 and store it in a
separate file.

Step 1: Set up directories on the local system cd ~/Desktop mkdir weatherMining


&& cd weatherMining mkdir demo_classes input_data mv~/Desktop/CRND0103-
2020-AK_Fairbanks_11_NE.txt~/Desktop/weatherMining/in put_data/

Step 2: Create the Java program nano


WeatherAnalysis.java
# (Paste the provided Java code, then save and exit)

Step 3: Compile and package the Java code export


JAVA_HOME=/usr/java/default export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar export
HADOOP_CLASSPATH=$(hadoop classpath) javac -classpath
"$HADOOP_CLASSPATH" -d demo_classes WeatherAnalysis.java jar -cvf
WeatherAnalysis.jar -C demo_classes/ .

Step 4: Upload the input file to HDFS


hadoop fs -mkdir /weatherMining
hadoop fs -mkdir /weatherMining/input
hadoop fs -put
~/Desktop/weatherMining/input_data/CRND0103-2020-AK_Fairbanks_11_NE.txt
/weatherMining/input

14
Step 5: Run the MapReduce job hadoop fs -rm -r
/weatherMining/output hadoop jar WeatherAnalysis.jar
WeatherAnalysis /weatherMining/input
/weatherMining/output

Step 6: View the output hadoop fs


-ls /weatherMining/output
hadoop fs -cat /weatherMining/output/part-r-00000

WeatherAnalysis.java
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.Path; import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
java.io.IOException; public class WeatherAnalysis { // Mapper:
Splits each line and extracts the date and temperature. public
static class WeatherMapper extends Mapper<Object, Text, Text,
FloatWritable> { private Text date = new Text(); private
FloatWritable temp = new FloatWritable();
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException { // Split the line by
whitespace.
String[] parts = value.toString().split("\\s+"); // Check
that there are enough tokens (we expect at least 7 fields). if
(parts.length > 6) { // parts[1] is the date (YYYYMMDD)
date.set(parts[1]); try { // parts[5] is the
temperature (e.g., -18.8) float temperature =
Float.parseFloat(parts[5]); temp.set(temperature);
context.write(date, temp); } catch (NumberFormatException e) {
// Skip this record if the temperature is not a valid float.
}
}
}
}
}

15
WeatherReducer.java
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.FloatWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public static class WeatherReducer extends Reducer<Text, FloatWritable, Text,
FloatWritable> { public void reduce(Text key, Iterable<FloatWritable>
values, Context context) throws IOException, InterruptedException {
float maxTemp = Float.NEGATIVE_INFINITY; for (FloatWritable val :
values) { if (val.get() > maxTemp) { maxTemp =
val.get(); } } context.write(key, new
FloatWritable(maxTemp));
}
}

Weather.java
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.FloatWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public
static void main(String[] args) throws Exception {
Configuration conf = new Configuration(); Job job =
Job.getInstance(conf, "Weather Data Analysis");
job.setJarByClass(WeatherAnalysis.class);
job.setMapperClass(WeatherMapper.class);
job.setReducerClass(WeatherReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));

16
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}

17

You might also like