0% found this document useful (0 votes)
13 views

BD Lab File

Uploaded by

17.vikashverma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

BD Lab File

Uploaded by

17.vikashverma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Department of Information Technology

Practical no. 01
Aim: Perform setting up and Installing Hadoop in its two operating modes:
➔ Pseudo distributed
➔ Fully distributed
Installation of Hadoop:
Hadoop software can be installed in three modes of operation:
● Stand Alone Mode: Hadoop is a distributed software and is designed to run on a commodity of
machines. However, we can install it on a single node in stand-alone mode. In this mode, Hadoop
software runs as a single monolithic java process. This mode is extremely useful for debugging
purpose. You can first test run your Map-Reduce application in this mode on small data, before
actually executing it on cluster with big data.
● Pseudo Distributed Mode: In this mode also, Hadoop software is installed on a Single Node.
Various daemons of Hadoop will run on the same machine as separate java processes. Hence, all
the daemons namely NameNode, DataNOde, SecondaryNameNode, JobTracker, TaskTracker run
on single machine.
● Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode, JobTracker,
SecondaryNameNode ( Optional and can be run on a separate mode) run on the Master Node.
The daemons DataNode and TaskTracker run on the Slave Node.

Hadoop Installation: Ubuntu Operating System in stand alone mode


Steps for Installation

1. Installing Java

To get started, you’ll update our package list and install OpenJDK, the default Java
Development Kit on Ubuntu 20.04:
$ sudo apt update
$ sudo apt install default-jdk
Once the installation is complete, let’s check the version.
$ java -version
2. Installing Hadoop
With Java in place, you’ll visit the Apache Hadoop Releases page to find the most recent
stable release.
Navigate to binary for the release you’d like to install. In this guide you’ll install Hadoop
3.3.1, but you can substitute the version numbers in this guide with one of your choice.

Rishav Kumar,
Department of Information Technology

On the next page, right-click and copy the link to the release binary.

On the server, you’ll use wget to fetch it:


$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
In order to make sure that the file you downloaded hasn’t been altered, you’ll do a quick
check using SHA-512, or the Secure Hash Algorithm 512. Return to the releases page,
then right-click and copy the link to the checksum file for the release binary you
downloaded:

Again, you’ll use wget on our server to download the file:


$ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz.sha512

Rishav Kumar,
Department of Information Technology

Then run the verification:


$ shasum -a 512 hadoop-3.3.1.tar.gz
Compare this value with the SHA-512 value in the .sha512 file:
$ cat hadoop-3.3.1.tar.gz.sha512
The output of the command you ran against the file you downloaded from the mirror
should match the value in the file you downloaded from apache.org.
Now that you’ve verified that the file wasn’t corrupted or changed, you can extract it:
$ tar -xzvf hadoop-3.3.1.tar.gz
Use the tar command with the -x flag to extract, -z to uncompress, -v for verbose output,
and -f to specify that you’re extracting from a file.
Finally, you’ll move the extracted files into /usr/local, the appropriate place for
locally installed software:
$ sudo mv hadoop-3.3.1 /usr/local/hadoop
3. Configuring Hadoop’s Java Home
Hadoop requires that you set the path to Java, either as an environment variable or in the
Hadoop configuration file.
The path to Java, /usr/bin/java is a symlink to /etc/alternatives/java, which is in turn a symlink
to default Java binary. You will use readlink with the -f flag to follow every symlink in
every part of the path, recursively. Then, you’ll use sed to trim bin/java from the output to
give us the correct value for JAVA_HOME.
To find the default Java path
$ readlink -f /usr/bin/java | sed "s:bin/java::"
You can copy this output to set Hadoop’s Java home to this specific version, which
ensures that if the default Java changes, this value will not. Alternatively, you can use the
readlink command dynamically in the file so that Hadoop will automatically use whatever
Java version is set as the system default.
To begin, open hadoop-env.sh:
$ sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Then, modify the file by choosing one of the following options:

Rishav Kumar,
Department of Information Technology

If you have trouble finding these lines, use CTRL+W to quickly search through the text.
Once you’re done, exit with CTRL+X and save your file.
4. Running Hadoop
Now you should be able to run Hadoop:
$ /usr/local/hadoop/bin/hadoop

This output means you’ve successfully configured Hadoop to run in stand-alone mode.

Rishav Kumar,
Department of Information Technology

You’ll ensure that Hadoop is functioning properly by running the example MapReduce
program it ships with. To do so, create a directory called input in our home directory and
copy Hadoop’s configuration files into it to use those files as our data
$ mkdir ~/input
$ cp /usr/local/hadoop/etc/hadoop/*.xml ~/input
Next, you can use the following command to run the MapReduce
hadoop-mapreduce-examples program, a Java archive with several options:

$/usr/local/hadoop/bin/hadoop/jar/usr/local/hadoop/share/hadoop/mapreduce/hadoop-ma
preduce-examples-3.3.1.jar grep ~/input ~/grep_example 'allowed[.]*'
This invokes the grep program, one of the many examples included in
hadoop-mapreduce-examples, followed by the input directory, input and the output directory
grep_example. The MapReduce grep program will count the matches of a literal word or
regular expression. Finally, the regular expression allowed[.]* is given to find
occurrences of the word allowed within or at the end of a declarative sentence. The
expression is
case-sensitive, so you wouldn’t find the word if it were capitalized at the beginning of
a sentence.
When the task completes, it provides a summary of what has been processed and errors it
has encountered, but this doesn’t contain the actual results.
Results are stored in the output directory and can be checked by running cat on the output
directory:
$ cat ~/grep_example/*
The MapReduce task found 19 occurrences of the word allowed followed by a period and
one occurrence where it was not. Running the example program has verified that our
stand-alone installation is working properly and that non-privileged users on the system
can run Hadoop for exploration or debugging.

Conclusion
In this practical, you’ve installed Hadoop in stand-alone mode and verified it by running an
example program it provided. To learn how to write your own MapReduce programs, you
might want to visit Apache Hadoop’s MapReduce tutorial which walks through the code behind
the example. When you’re ready to set up a cluster, see the Apache Foundation Hadoop Cluster
Setup guide.

Rishav Kumar,
Department of Information Technology

Practical no. 02
Aim: Use web based tools to monitor your Hadoop
setup Procedure:
1. Open a supported web browser.
2. Enter the Ambari web URL in the browser address bar.
http://[YOUR_AMBARI_SERVER_FQDN]:8080

3. Type your user name and password in the Sign in page. If you are an Ambari Administrator
accessing the Ambari web UI for the first time, use the default Ambari administrator credentials.
admin/admin
4. Click Sign in, if Ambari server is stopped, you can restart it using a command line editor.
5. If necessary, start Ambari server in the Ambari server host machine. Ambari-server start,
typically, you start the Ambari server and Ambari web as part of the installation process.

Access Ambari Admin page


Only an Ambari administrator can access the Ambari Admin page from Ambari Web. The Ambari Admin
page supports tasks such as creating a cluster, managing users, groups, roles, and permissions, and
managing stack versions.

Working with the cluster dashboard


Use Dashboard to view metrics and configuration history for your cluster. You monitor your Hadoop
cluster using Ambari Web. Dashboard provides Metrics and Heatmaps visualizations and cluster
configuration history options. You access the cluster dashboard by clicking Dashboard at the top left of
the Ambari Web UI main window.

Rishav Kumar,
Department of Information Technology

In Ambari Web, click Dashboard to view the operating status of your cluster. Dashboard includes three
options: • Metrics • Heatmaps • Config History The Metrics option displays by default. On the Metrics
page, multiple widgets represent operating status information of services in your cluster. Most widgets
display a single metric; for example, HDFS Disk Usage represented by a load chart and a percentage
figure:

Rishav Kumar,
Department of Information Technology

Practical no. 03
Aim: Implement the following file management tasks in hadoop:
➔ Adding files and directories
➔ Retrieving files
➔ Deleting files
File Management tasks in Hadoop
1. Create a directory in HDFS at given

path(s). Usage:

hadoop fs -mkdir <paths> Example:

hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2

2. List the contents of a

directory. Usage :

hadoop fs -ls <args> Example:

hadoop fs -ls /user/saurzcode

3. Upload and download a file in HDFS.

Upload: hadoop fs -put:

Copy single src file, or multiple src files from local file system to the Hadoop data file system

Usage:

hadoop fs -put <localsrc> ... <HDFS_dest_Path> Example:

hadoop fs -put /home/saurzcode/Samplefile.txt /user/ saurzcode/dir3/

Download:

hadoop fs -get:

Copies/Downloads files to the local file system

Usage:

hadoop fs -get <hdfs_src> <localdst> Example:

hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/

4. See contents of a file

Same as unix cat command:

Usage:

Rishav Kumar,
Department of Information Technology

hadoop fs -cat <path[filename]> Example:

hadoop fs -cat /user/saurzcode/dir1/abc.txt

5. Copy a file from source to destination

This command allows multiple sources as well in which case the destination must be a directory.

Usage:

hadoop fs -cp <source> <dest> Example:

hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2

6. Copy a file from/To Local file system to

HDFS copyFromLocal

Usage:

hadoop fs -copyFromLocal <localsrc> URI Example:

hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/ saurzcode/abc.txt

Similar to put command, except that the source is restricted to a local file

reference. copyToLocal

Usage:

hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Similar to get command, except that the destination is restricted to a local file reference

7. Move file from source to destination.

Note:- Moving files across filesystem is not permitted.

Usage :

hadoop fs -mv <src> <dest> Example:

hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2

8. Remove a file or directory in HDFS.

Remove files specified as argument. Deletes directory only when it is empty

Usage :

hadoop fs -rm <arg> Example:

hadoop fs -rm /user/saurzcode/dir1/abc.txt

Rishav Kumar,
Department of Information Technology

Recursive version of delete.

Usage :

hadoop fs -rmr <arg> Example:

hadoop fs -rmr /user/saurzcode/

9. Display last few lines of a file.

Similar to tail command in Unix.

Usage :

hadoop fs -tail <path[filename]> Example:

hadoop fs -tail /user/saurzcode/dir1/abc.txt

10. Display the aggregate length of a file.

Usage :

hadoop fs -du <path> Example:

hadoop fs -du /user/saurzcode/dir1/abc.txt

Rishav Kumar,
Department of Information Technology

Practical no. 04
Aim: Run a basic Word Count Map Reduce program to understand Map Reduce
Paradigm
➔ Find the number of occurrence of each word appearing in the input file(s)
➔ Performing a MapReduce Job for word search count (look for specific
keywords in a file) 5
Code:
Import java.io.IOException;
Import java.util.StringTokenizer;
Import org.apache.hadoop.io.IntWritable;
Import org.apache.hadoop.io.LongWritable;
Import org.apache.hadoop.io.Text;
Import org.apache.hadoop.mapreduce.Mapper;
Import org.apache.hadoop.mapreduce.Reducer;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.mapreduce.Job;
Import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
Import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
Import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import org.apache.hadoop.fs.Path;

public class WordCount


{
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {
public void map(LongWritable key, Text value,Context context) throws
IOException,InterruptedException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
While (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));

Rishav Kumar,
Department of Information Technology

}
}
}
public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,Context context) throws


IOException,InterruptedException {
int sum=0;
for(IntWritable x: values) {
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//deleting the output path automatically from hdfs so that we don't have to delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
//exiting the job only if the flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Rishav Kumar,
Department of Information Technology

Mapper code:
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {
public void map(LongWritable key, Text value, Context context) throws
IOException,InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}

Reducer Code:
public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException,InterruptedException {
int sum=0; for(IntWritable x: values) {
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}
}

Driver Code:
Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

Rishav Kumar,
Department of Information Technology

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

Run the MapReduce code:


The command for running a MapReduce code is:
Hadoop jar hadoop-mapreduce-example.jar WordCount / sample / input / sample / output

Rishav Kumar,
Department of Information Technology

Practical no. 05
Aim: Stop word elimination problem:
➔ A large textual file containing one sentence per line
➔ A small file containing a set of stop words (One stop word per line)
Output:
A textual file containing the same sentences of the large input file without the
words appearing in the small file.
Code:
package

com.hadoop.skipper;

import java.io.BufferedReader;

import java.io.FileReader;

import java.io.IOException;

import java.util.HashSet;

import java.util.Set;

import java.util.StringTokenizer;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class SkipMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

private Text word = new Text();

private Set<String> stopWordList = new HashSet<String>();

private BufferedReader fis;

/** (non-Javadoc)

** @see

*org.apache.hadoop.mapreduce.Mapper#setup(org.apache.hadoop.mapreduce.

Rishav Kumar,
Department of Information Technology

* Mapper.Context) */

@SuppressWarnings("deprecation")

protected void setup(Context context) throws java.io.IOException, InterruptedException {

try {

Path[] stopWordFiles = new Path[0];

stopWordFiles = context.getLocalCacheFiles();

System.out.println(stopWordFiles.toString());

if (stopWordFiles != null &&

stopWordFiles.length > 0) {

for (Path stopWordFile : stopWordFiles) {

readStopWordFile(stopWordFile);

} catch (IOException e) {

System.err.println("Exception reading stop word file: " + e);

/*

* Method to read the stop word file and get the stop words

*/

private void readStopWordFile(Path stopWordFile) {

try {

fis = new BufferedReader(new FileReader(stopWordFile.toString()));

String stopWord = null;

while ((stopWord = fis.readLine()) != null) {

stopWordList.add(stopWord);

} catch (IOException ioe) {

System.err.println("Exception while reading stop word file '" + stopWordFile + "' : " + ioe.toString());

Rishav Kumar,
Department of Information Technology

/*

* (non-Javadoc)

* @see

org.apache.hadoop.mapreduce.Mapper#map(KEYIN, VALUEIN,

* org.apache.hadoop.mapreduce.Mapper.Context)

*/

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

String token = tokenizer.nextToken();

if (stopWordList.contains(token)) {

context.getCounter(StopWordSkipper.COUNTERS.STOPWORDS).increment(1L);

} else {

context.getCounter(StopWordSkipper.COUNTERS.GOOD WORDS).increment(1L);

word.set(token);

context.write(word, null);

Rishav Kumar,
Department of Information Technology

package

com.hadoop.skipper;

import java.util.ArrayList;

import java.util.List;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.filecache.DistributedCache;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Counters;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

@SuppressWarnings("deprecation")

public class StopWordSkipper {

public enum COUNTERS {STOPWORDS, GOODWORDS}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

GenericOptionsParser parser = new GenericOptionsParser(conf, args);

args = parser.getRemainingArgs();

Job job = new Job(conf, "StopWordSkipper");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setJarByClass(StopWordSkipper.class);

job.setMapperClass(SkipMapper.class);

job.setNumReduceTasks(0);

Rishav Kumar,
Department of Information Technology

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

List<String> other_args = new ArrayList<String>();

// Logic to read the location of stop word file from

the command line

// The argument after -skip option will be taken as

the location of stop

// word file

for (int i = 0; i < args.length; i++) {

if ("-skip".equals(args[i])) {

DistributedCache.addCacheFile(new

Path(args[++i]).toUri(),

job.getConfiguration());

if (i+1 < args.length)

{ i++;

} else {

break;

}}

other_args.add(args[i]);

FileInputFormat.setInputPaths(job, new Path(other_args.get(0)));

FileOutputFormat.setOutputPath(job, new Path(other_args.get(1)));

job.waitForCompletion(true);

Counters counters = job.getCounters();

System.out.printf("Good Words: %d, Stop Words: %d\n",


counters.findCounter(COUNTERS.GOODWORDS).getValue(),
counters.findCounter(COUNTERS.STOPWORDS).getValue());

Rishav Kumar,
Department of Information

Practical no. 06
Aim: Using various mathematical functions on the console in R.
Software Requirement: Rstudio
Theory:
RStudio is an integrated development environment for R, a programming language for statistical computing and
graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs
on a remote server and allows accessing RStudio using a web browser.

Various mathematical Functions:


● Addition:
a<-21

b<-34

c<-a+b

print(c)

● Subtraction:
a<-21

b<-34

c<-b-a

print(c)

Rishav Kumar,
Department of Information

● Multiplication:
a<-21

b<-34

c<-a*b

print(c)

● Division:
a<-210

b<-21

c<-a/b

print(c)

● min() and max():


min(23,34,45,12)

max(32,35,45,56,23)

Rishav Kumar,
Department of Information

● sqrt(): square root of any number.


sqrt(225)

● abs(): the positive value of any number


abs(-34.45)

● ceiling() and floor():


ceiling means the upper value of the number.

floor means the lower value of the number.

Rishav Kumar,
Department of Information

ceiling(15.6)

floor(45.1)

In conclusion, various mathematical functions on the console in R have been successfully implemented and
verified.

Rishav Kumar,
Department of Information

Practical no. 07
Aim: Write an R script, to create R objects for the calculator application and save
in a specified location on disk.
Software Requirement: Rstudio
Theory:
R can be used as a powerful calculator by entering equations directly at the prompt in the command console. Simply
type your arithmetic expression and press ENTER. R will evaluate the expressions and respond with the result.
While this is a simple interaction interface, there could be problems if you are not careful. R will normally execute
your arithmetic expression by evaluating each item from left to right, but some operators have precedence in the
order of evaluation.

The operators R uses for basic arithmetic are:

+ Addition

- Subtraction

* Multiplication

/ Division

Script:
operator = function(num1, num2, operat){

if(operat == "+"){

return (num1 + num2)

} else if(operat == "-"){

return (num1 - num2)

} else if(operat == "*"){

return (num1 * num2)

} else if(operat == "/"){

return (num1 / num2)

else{

return("invalid operat")

Rishav Kumar,
Department of Information

num1 = as.numeric(readline(prompt = "enter first number:"))

num2 = as.numeric(readline(prompt = "enter second number:"))

operat = readline(prompt = "enter operator (+,-,*,/):")

result = operator(num1,num2,operat)

print(paste("result", result))

Output:

In conclusion, R script, to create R objects for the calculator application and save in a specified location on disk
have been successfully implemented and verified.

Rishav Kumar,
Department of Information

Practical no. 08
Aim: Write an R script to find basic descriptive statistics using summary, str, and
quartile function on mtcars & cars datasets.
Software Requirement: Rstudio
Theory:
Summary (or descriptive) statistics are the first figures used to represent nearly every dataset. They also form the
foundation for much more complicated computations and analyses. Thus, in spite of being composed of simple
methods, they are essential to the analysis process.

A data set is a collection of related, discrete items of associated data that may be accessed individually, in
combination, or managed as a whole entity. A data set is organized into some type of data structure

Script:
data(“mtcars”): This script is used to import data from mtcars which present car datasets.

Rishav Kumar,
Department of Information

head(mtcars): This script is used to make first six to show as output.

summary(mtcars): This script is used to give brief statements of the main points which present in mtcars.

dim(mtcars): This script is used to show how many rows and columns, present in the datasets.

names(mtcars):

quantile(mtcars$speed):

Rishav Kumar,
Department of Information

str(mtcars):

In conclusion, the R script to find basic descriptive statistics using summary, str, and quartile function on mtcars &
cars datasets have been successfully implemented and verified.

Rishav Kumar,
Department of Information

Practical no. 09
Aim: Write an R script to find a subset of a dataset by using subset(), and
aggregate() functions on the iris dataset.
Objective:
● Reading different types of data sets (.txt, .csv) from the web and disk and writing in files in
specific disk locations.
● Reading Excel data-sheet in R.
● Reading XML dataset in R.

Software Requirements: R studio

Theory:
When a program is terminated, the entire data is lost. Storing in a file will preserve our data even if the program
terminates. If we have to enter a large number of data, it will take a lot of time to enter them all. However, if we
have a file containing all the data, we can easily access the contents of the file using a few commands in R. You can
easily move your data from one computer to another without any changes. So those files can be stored in various
formats. It may be stored in a .txt(tab-separated value) file, or in a tabular format i.e. .csv(comma-separated value)
file or it may be on the internet or cloud. R provides very easier methods to read those files.

Find a subset of a dataset by using subset(), and aggregate() function on the iris dataset:
data(iris)

subset(iris)

Rishav Kumar,
Department of Information

aggregate(iris)

Script for reading .csv dataset from the web and disk and writing in files:
You can check which directory the R workspace is pointing to using the getwd() function. You can also set a new
working directory using setwd()function.

Reading a CSV File


Following is a simple example of read.csv() function to read a CSV file available in your current working directory
read.data <- read.csv(“annual-enterprise-survey-2021-financial-year-provisional-csv.csv”)
print(read.data)

Rishav Kumar,
Department of Information

print(ncol(read.data)): It is used for printing a number of columns.

print(nrow(read.data)): It is used for printing a number of rows.

Rishav Kumar,
Department of Information

Script for reading XML data set:


You can read a XML file in R using the "XML" package. This package can be installed using the following
command.

install.packages(“XML”)

Reading XML File


The xml file is read by R using the function xmlToDataFrame(). It is stored as a list in R.
// Load the package required to read XML files.
library(“XML”)
// Also load the other required package.
library(“methods”)
// Give the input file name to the function.
xml <- xmlToDataFrame(“report.xml”)
// Print the result
print(xml)

Rishav Kumar,
Department of Information

Script for reading Excel data sheet in R:

Reading Excel File


The excel file is read by R using the function read_excel. It is stored as a list in R.
library(readxl): It is used to load the library in workspace in R.
import <- read_excel(“C:/Users/admin/Desktop/import.xlsx”).

Rishav Kumar,
Department of Information

view(import)

print(is.data.frame(import)): It is used to verify the data set fetch from the .xlsm file.

print(ncol(import)): It is used to print the number of columns.

print(nrow(import)): It is used to print the number of rows.

Rishav Kumar,
Department of Information

amount <- max(import $ Amount)

print(amount)

In conclusion,we use the aggregate() function to calculate the mean sepal length and width for each species in the
original iris dataset. We use the formula interface of the aggregate() function to specify which variable we want to
aggregate (Sepal.Length or Sepal.Width) and how we want to group the data (by the Species variable). The mean()
function is used as the aggregation function to calculate the means. The resulting output shows the mean sepal
length and width for each species in the iris dataset.

In conclusion for the reading the different file like .txt, .csv, .xlsx, .xlm dataset have been successfully implemented
and verified by using different commands or script command line in R language.

Rishav Kumar,
Department of Infromation

Practical no. 10
Aim: Visualization using R packages
A. Find the data distribution using box and scatter plot.
B. Find the outliers using plot.
Theory:
Installation of package and Histogram presentation
install.packages(“ggplot2”) library(“ggplot2”)

ggplot(data=iris, aes(x=Sepal.Length)) + geom_histogram()

ggplot(data=iris, aes(x=Sepal.Length)) + geom_histogram(binwidth=1)

Rishav Kumar,
Department of Infromation

ggplot(data=iris, aes(x=Sepal.Length)) + geom_histogram(color=”black”, fill=”white”, bins=10)

ggplot(iris, aes(x=Sepal.Length, color=Species)) + geom_histogram(fill=”white”, binwidth=1)

Rishav Kumar,
Department of Infromation

Density Plot: ggplot(iris, aes(x=Sepal.Length)) + geom_density()

Box Plot:

mtcars$cyl = factor(mtcars$cyl)

ggplot(mtcars, aes(x=cyl, y=disp)) + geom_boxplot()

Rishav Kumar,
Department of Infromation

We can see one outliers for 6 cylinders.

To create a notched boxplot we write notch = TRUE

ggplot(mtcars, aes(x=cyl, y=disp)) + geom_boxplot(notch = TRUE)

Scatter Plot:

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, shape = Species, color = Species)) + geom_point()

Rishav Kumar,

You might also like