BD Lab File
BD Lab File
Practical no. 01
Aim: Perform setting up and Installing Hadoop in its two operating modes:
➔ Pseudo distributed
➔ Fully distributed
Installation of Hadoop:
Hadoop software can be installed in three modes of operation:
● Stand Alone Mode: Hadoop is a distributed software and is designed to run on a commodity of
machines. However, we can install it on a single node in stand-alone mode. In this mode, Hadoop
software runs as a single monolithic java process. This mode is extremely useful for debugging
purpose. You can first test run your Map-Reduce application in this mode on small data, before
actually executing it on cluster with big data.
● Pseudo Distributed Mode: In this mode also, Hadoop software is installed on a Single Node.
Various daemons of Hadoop will run on the same machine as separate java processes. Hence, all
the daemons namely NameNode, DataNOde, SecondaryNameNode, JobTracker, TaskTracker run
on single machine.
● Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode, JobTracker,
SecondaryNameNode ( Optional and can be run on a separate mode) run on the Master Node.
The daemons DataNode and TaskTracker run on the Slave Node.
1. Installing Java
To get started, you’ll update our package list and install OpenJDK, the default Java
Development Kit on Ubuntu 20.04:
$ sudo apt update
$ sudo apt install default-jdk
Once the installation is complete, let’s check the version.
$ java -version
2. Installing Hadoop
With Java in place, you’ll visit the Apache Hadoop Releases page to find the most recent
stable release.
Navigate to binary for the release you’d like to install. In this guide you’ll install Hadoop
3.3.1, but you can substitute the version numbers in this guide with one of your choice.
Rishav Kumar,
Department of Information Technology
On the next page, right-click and copy the link to the release binary.
Rishav Kumar,
Department of Information Technology
Rishav Kumar,
Department of Information Technology
If you have trouble finding these lines, use CTRL+W to quickly search through the text.
Once you’re done, exit with CTRL+X and save your file.
4. Running Hadoop
Now you should be able to run Hadoop:
$ /usr/local/hadoop/bin/hadoop
This output means you’ve successfully configured Hadoop to run in stand-alone mode.
Rishav Kumar,
Department of Information Technology
You’ll ensure that Hadoop is functioning properly by running the example MapReduce
program it ships with. To do so, create a directory called input in our home directory and
copy Hadoop’s configuration files into it to use those files as our data
$ mkdir ~/input
$ cp /usr/local/hadoop/etc/hadoop/*.xml ~/input
Next, you can use the following command to run the MapReduce
hadoop-mapreduce-examples program, a Java archive with several options:
$/usr/local/hadoop/bin/hadoop/jar/usr/local/hadoop/share/hadoop/mapreduce/hadoop-ma
preduce-examples-3.3.1.jar grep ~/input ~/grep_example 'allowed[.]*'
This invokes the grep program, one of the many examples included in
hadoop-mapreduce-examples, followed by the input directory, input and the output directory
grep_example. The MapReduce grep program will count the matches of a literal word or
regular expression. Finally, the regular expression allowed[.]* is given to find
occurrences of the word allowed within or at the end of a declarative sentence. The
expression is
case-sensitive, so you wouldn’t find the word if it were capitalized at the beginning of
a sentence.
When the task completes, it provides a summary of what has been processed and errors it
has encountered, but this doesn’t contain the actual results.
Results are stored in the output directory and can be checked by running cat on the output
directory:
$ cat ~/grep_example/*
The MapReduce task found 19 occurrences of the word allowed followed by a period and
one occurrence where it was not. Running the example program has verified that our
stand-alone installation is working properly and that non-privileged users on the system
can run Hadoop for exploration or debugging.
Conclusion
In this practical, you’ve installed Hadoop in stand-alone mode and verified it by running an
example program it provided. To learn how to write your own MapReduce programs, you
might want to visit Apache Hadoop’s MapReduce tutorial which walks through the code behind
the example. When you’re ready to set up a cluster, see the Apache Foundation Hadoop Cluster
Setup guide.
Rishav Kumar,
Department of Information Technology
Practical no. 02
Aim: Use web based tools to monitor your Hadoop
setup Procedure:
1. Open a supported web browser.
2. Enter the Ambari web URL in the browser address bar.
http://[YOUR_AMBARI_SERVER_FQDN]:8080
3. Type your user name and password in the Sign in page. If you are an Ambari Administrator
accessing the Ambari web UI for the first time, use the default Ambari administrator credentials.
admin/admin
4. Click Sign in, if Ambari server is stopped, you can restart it using a command line editor.
5. If necessary, start Ambari server in the Ambari server host machine. Ambari-server start,
typically, you start the Ambari server and Ambari web as part of the installation process.
Rishav Kumar,
Department of Information Technology
In Ambari Web, click Dashboard to view the operating status of your cluster. Dashboard includes three
options: • Metrics • Heatmaps • Config History The Metrics option displays by default. On the Metrics
page, multiple widgets represent operating status information of services in your cluster. Most widgets
display a single metric; for example, HDFS Disk Usage represented by a load chart and a percentage
figure:
Rishav Kumar,
Department of Information Technology
Practical no. 03
Aim: Implement the following file management tasks in hadoop:
➔ Adding files and directories
➔ Retrieving files
➔ Deleting files
File Management tasks in Hadoop
1. Create a directory in HDFS at given
path(s). Usage:
directory. Usage :
Copy single src file, or multiple src files from local file system to the Hadoop data file system
Usage:
Download:
hadoop fs -get:
Usage:
Usage:
Rishav Kumar,
Department of Information Technology
This command allows multiple sources as well in which case the destination must be a directory.
Usage:
HDFS copyFromLocal
Usage:
Similar to put command, except that the source is restricted to a local file
reference. copyToLocal
Usage:
Similar to get command, except that the destination is restricted to a local file reference
Usage :
Usage :
Rishav Kumar,
Department of Information Technology
Usage :
Usage :
Usage :
Rishav Kumar,
Department of Information Technology
Practical no. 04
Aim: Run a basic Word Count Map Reduce program to understand Map Reduce
Paradigm
➔ Find the number of occurrence of each word appearing in the input file(s)
➔ Performing a MapReduce Job for word search count (look for specific
keywords in a file) 5
Code:
Import java.io.IOException;
Import java.util.StringTokenizer;
Import org.apache.hadoop.io.IntWritable;
Import org.apache.hadoop.io.LongWritable;
Import org.apache.hadoop.io.Text;
Import org.apache.hadoop.mapreduce.Mapper;
Import org.apache.hadoop.mapreduce.Reducer;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.mapreduce.Job;
Import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
Import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
Import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import org.apache.hadoop.fs.Path;
Rishav Kumar,
Department of Information Technology
}
}
}
public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {
Rishav Kumar,
Department of Information Technology
Mapper code:
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {
public void map(LongWritable key, Text value, Context context) throws
IOException,InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
Reducer Code:
public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException,InterruptedException {
int sum=0; for(IntWritable x: values) {
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}
}
Driver Code:
Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Rishav Kumar,
Department of Information Technology
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
Rishav Kumar,
Department of Information Technology
Practical no. 05
Aim: Stop word elimination problem:
➔ A large textual file containing one sentence per line
➔ A small file containing a set of stop words (One stop word per line)
Output:
A textual file containing the same sentences of the large input file without the
words appearing in the small file.
Code:
package
com.hadoop.skipper;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/** (non-Javadoc)
** @see
*org.apache.hadoop.mapreduce.Mapper#setup(org.apache.hadoop.mapreduce.
Rishav Kumar,
Department of Information Technology
* Mapper.Context) */
@SuppressWarnings("deprecation")
try {
stopWordFiles = context.getLocalCacheFiles();
System.out.println(stopWordFiles.toString());
stopWordFiles.length > 0) {
readStopWordFile(stopWordFile);
} catch (IOException e) {
/*
* Method to read the stop word file and get the stop words
*/
try {
stopWordList.add(stopWord);
System.err.println("Exception while reading stop word file '" + stopWordFile + "' : " + ioe.toString());
Rishav Kumar,
Department of Information Technology
/*
* (non-Javadoc)
* @see
org.apache.hadoop.mapreduce.Mapper#map(KEYIN, VALUEIN,
* org.apache.hadoop.mapreduce.Mapper.Context)
*/
while (tokenizer.hasMoreTokens()) {
if (stopWordList.contains(token)) {
context.getCounter(StopWordSkipper.COUNTERS.STOPWORDS).increment(1L);
} else {
context.getCounter(StopWordSkipper.COUNTERS.GOOD WORDS).increment(1L);
word.set(token);
context.write(word, null);
Rishav Kumar,
Department of Information Technology
package
com.hadoop.skipper;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counters;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
@SuppressWarnings("deprecation")
args = parser.getRemainingArgs();
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setJarByClass(StopWordSkipper.class);
job.setMapperClass(SkipMapper.class);
job.setNumReduceTasks(0);
Rishav Kumar,
Department of Information Technology
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// word file
if ("-skip".equals(args[i])) {
DistributedCache.addCacheFile(new
Path(args[++i]).toUri(),
job.getConfiguration());
{ i++;
} else {
break;
}}
other_args.add(args[i]);
job.waitForCompletion(true);
Rishav Kumar,
Department of Information
Practical no. 06
Aim: Using various mathematical functions on the console in R.
Software Requirement: Rstudio
Theory:
RStudio is an integrated development environment for R, a programming language for statistical computing and
graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs
on a remote server and allows accessing RStudio using a web browser.
b<-34
c<-a+b
print(c)
● Subtraction:
a<-21
b<-34
c<-b-a
print(c)
Rishav Kumar,
Department of Information
● Multiplication:
a<-21
b<-34
c<-a*b
print(c)
● Division:
a<-210
b<-21
c<-a/b
print(c)
max(32,35,45,56,23)
Rishav Kumar,
Department of Information
Rishav Kumar,
Department of Information
ceiling(15.6)
floor(45.1)
In conclusion, various mathematical functions on the console in R have been successfully implemented and
verified.
Rishav Kumar,
Department of Information
Practical no. 07
Aim: Write an R script, to create R objects for the calculator application and save
in a specified location on disk.
Software Requirement: Rstudio
Theory:
R can be used as a powerful calculator by entering equations directly at the prompt in the command console. Simply
type your arithmetic expression and press ENTER. R will evaluate the expressions and respond with the result.
While this is a simple interaction interface, there could be problems if you are not careful. R will normally execute
your arithmetic expression by evaluating each item from left to right, but some operators have precedence in the
order of evaluation.
+ Addition
- Subtraction
* Multiplication
/ Division
Script:
operator = function(num1, num2, operat){
if(operat == "+"){
else{
return("invalid operat")
Rishav Kumar,
Department of Information
result = operator(num1,num2,operat)
print(paste("result", result))
Output:
In conclusion, R script, to create R objects for the calculator application and save in a specified location on disk
have been successfully implemented and verified.
Rishav Kumar,
Department of Information
Practical no. 08
Aim: Write an R script to find basic descriptive statistics using summary, str, and
quartile function on mtcars & cars datasets.
Software Requirement: Rstudio
Theory:
Summary (or descriptive) statistics are the first figures used to represent nearly every dataset. They also form the
foundation for much more complicated computations and analyses. Thus, in spite of being composed of simple
methods, they are essential to the analysis process.
A data set is a collection of related, discrete items of associated data that may be accessed individually, in
combination, or managed as a whole entity. A data set is organized into some type of data structure
Script:
data(“mtcars”): This script is used to import data from mtcars which present car datasets.
Rishav Kumar,
Department of Information
summary(mtcars): This script is used to give brief statements of the main points which present in mtcars.
dim(mtcars): This script is used to show how many rows and columns, present in the datasets.
names(mtcars):
quantile(mtcars$speed):
Rishav Kumar,
Department of Information
str(mtcars):
In conclusion, the R script to find basic descriptive statistics using summary, str, and quartile function on mtcars &
cars datasets have been successfully implemented and verified.
Rishav Kumar,
Department of Information
Practical no. 09
Aim: Write an R script to find a subset of a dataset by using subset(), and
aggregate() functions on the iris dataset.
Objective:
● Reading different types of data sets (.txt, .csv) from the web and disk and writing in files in
specific disk locations.
● Reading Excel data-sheet in R.
● Reading XML dataset in R.
Theory:
When a program is terminated, the entire data is lost. Storing in a file will preserve our data even if the program
terminates. If we have to enter a large number of data, it will take a lot of time to enter them all. However, if we
have a file containing all the data, we can easily access the contents of the file using a few commands in R. You can
easily move your data from one computer to another without any changes. So those files can be stored in various
formats. It may be stored in a .txt(tab-separated value) file, or in a tabular format i.e. .csv(comma-separated value)
file or it may be on the internet or cloud. R provides very easier methods to read those files.
Find a subset of a dataset by using subset(), and aggregate() function on the iris dataset:
data(iris)
subset(iris)
Rishav Kumar,
Department of Information
aggregate(iris)
Script for reading .csv dataset from the web and disk and writing in files:
You can check which directory the R workspace is pointing to using the getwd() function. You can also set a new
working directory using setwd()function.
Rishav Kumar,
Department of Information
Rishav Kumar,
Department of Information
install.packages(“XML”)
Rishav Kumar,
Department of Information
Rishav Kumar,
Department of Information
view(import)
print(is.data.frame(import)): It is used to verify the data set fetch from the .xlsm file.
Rishav Kumar,
Department of Information
print(amount)
In conclusion,we use the aggregate() function to calculate the mean sepal length and width for each species in the
original iris dataset. We use the formula interface of the aggregate() function to specify which variable we want to
aggregate (Sepal.Length or Sepal.Width) and how we want to group the data (by the Species variable). The mean()
function is used as the aggregation function to calculate the means. The resulting output shows the mean sepal
length and width for each species in the iris dataset.
In conclusion for the reading the different file like .txt, .csv, .xlsx, .xlm dataset have been successfully implemented
and verified by using different commands or script command line in R language.
Rishav Kumar,
Department of Infromation
Practical no. 10
Aim: Visualization using R packages
A. Find the data distribution using box and scatter plot.
B. Find the outliers using plot.
Theory:
Installation of package and Histogram presentation
install.packages(“ggplot2”) library(“ggplot2”)
Rishav Kumar,
Department of Infromation
Rishav Kumar,
Department of Infromation
Box Plot:
mtcars$cyl = factor(mtcars$cyl)
Rishav Kumar,
Department of Infromation
Scatter Plot:
Rishav Kumar,