0% found this document useful (0 votes)
27 views

Training in R For Data Statistics

The document discusses using R for statistical analysis and visualization. It covers installing and using R and RStudio, performing basic calculations and statistical functions, writing scripts, and creating plots like line graphs, histograms, and bar charts. Examples are provided for various functions and plotting different types of charts using built-in R functions.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Training in R For Data Statistics

The document discusses using R for statistical analysis and visualization. It covers installing and using R and RStudio, performing basic calculations and statistical functions, writing scripts, and creating plots like line graphs, histograms, and bar charts. Examples are provided for various functions and plotting different types of charts using built-in R functions.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 113

JIET Institute of Design

and Technology, Jodhpur

Topic: R for statistics

Krishan Pal Singh


Assistant Professor
JDAT
Day 1-
Study of R language and its tools
Study of R language
● R is an environment for data manipulation,
statistical computing, and graphics display and
data analysis.

● Effective data handling and storage of outputs


is possible.

● Simple as well as complicated calculations are


possible.
Cont..
● Simulations are possible.

● Graphical display on‐screen and hardcopy are


possible.

● Programming language is effective which includes


all possibilities just like any other good
programming language.

● R has a statistical computing environment.


Cont..
● It has a computer language which is convenient
to use for statistical and graphical applications.

● R is free (open source) software and therefore is


not a black box.

● Built in and contributed packages are available,


and users are provided tools to make packages.
Cont..
● It is possible to contribute own packages.

● The commands can be saved, run and stored in


script files.

● R is available for Windows, Unix, Linux and


Macintosh platforms.

● Graphics can be directly saved in a Postscript or


PDF format.
Installation of R and RStudio
Step 1: Installation of R

● Go to the following link to install R:


http://cran.r-project.org

Step 2: Installation of RStudio

● Go to the following link to download RStudio:


http://www.rstudio.com/products/rstudio/download/

 Please make sure that you select the package


compatible with your operating system
Step 3: Installation of packages
● Run RStudio and explore the environment

● Install the following packages from your repository. You may


use the install.packages() function available in R to do that.

● Run the following command:

● install.packages(<comma_separated package names>,


dependencies = TRUE) in your R console

● For example: To install packages plyr and dplyr, run the


following command: install.packages(“plyr”, “dplyr”,
dependencies = TRUE)
Libraries in R

To use a library, type the library function with


the name of the library in brackets.

Thus to load the cluster library type:


library(cluster)
Similarly,
library(ggplot2) : loads package ggplot2
library(graphics): loads package graphics
Contents of Libraries

Use help function to get the detailed contents of library packages.


Here is how find out about the contents of the cluster library:
library(help=cluster) returns the following:

Information on package ‘cluster’


Description:

Package: cluster
Version: 1.14.4
Date: 2013-03-26
Priority: recommended
Author: Martin Maechler, based on S
Original by Peter … …… … …
Followed by a list of all the functions and data sets.
How to quit in R
Task

It is good practice to remove the variable


names given to any data frame at the end each
session in R.

Then

How can removes all variable names or


selected variable names?
Day 2-
Working at command level and
statistical functions
Working at command level
● > is the prompt sign in R.
R as a calculator
Calculations with Data Vectors

All Arithmetic operations perform same as power operation with


vectors and scalars.
Built in commands
Overview over Other Functions
Statistical Functions

● To find sum of squares:

x = c(2,3,4,5)
● To find sum of squares of deviation from mean
● > x = c(2,3,4,5)
Day 3-
Work on Arithmetic, Logical and
Matrix operations.
Day 4-
Writing simple programs, saving,
and running programs.
Set the working directory
Create an R file
RStudio with script file open
Writing scripts
Saving R script file
Executing an R file
Add comments –single line
Add comments –Multiple lines
Clear the console
Clear the environment –rm()

● Single variable: Enter in console/R script: rm(variable)


● All variables: Enter in console/R script: rm(list=ls())
Day 5-
Writing simple programs of conditions
and iterations.
Sequence function

● Sequence function syntax : seq(from, to, by,


length).
● Creates equi-spaced points between ‘from’ and
‘to’.
Control structures
for loop, Nested for Loops
Exercise
1. Write a R program to input marks of three subjects
Physics, Chemistry, Math. And then calculate percentage
and grade according to following using switch case:
 Percentage >= 80%: Distinction
 Percentage >= 60%: I DIV
 Percentage >= 50%: II DIV
 Percentage >= 40%: III DIV
 Percentage < 40%: FAIL

2. Write a program to find out factorial of a given number


using for loop.

3. Write a program to generate Fibonacci Series using while


loop.
Day 6-
Plotting the data using line graph, histograms, multiple
graphs, etc.
Line Graphs
● A line graph is a pictorial representation of
information which changes continuously over
time.
● Line charts are used in identifying the trends in
data. For line graph construction, R provides
plot() function, which has the following syntax:

plot(v,type,col,xlab,ylab)
S. Param Description
No eter
1. V It is a vector which contains the numeric
values.
2. Type This parameter takes the value : I: to draw
only the lines or p: to draw only the points
and "o" to draw both lines and points.
3. Xlab It is the label for the x-axis.
4. Ylab It is the label for the y-axis.
5. Main It is the title of the chart.
6. Col It is used to give the color for both the points
and lines
Exercise
Write a R Program to draw multiple lines.
# Create the data for the chart.
v <- c(7,12,28,3,41)
t <- c(14,7,6,19,3)
# Give the chart file a name.
png(file = "line_chart_2_lines.jpg")
# Plot the bar chart.
plot(v,type = "o",col = "red", xlab = "Month", ylab =
"Rain fall", main = "Rain fall chart")
lines(t, type = "o", col = "blue")
# Save the file. dev.off()
Histogram
● A histogram is a type of bar chart which shows the frequency of
the number of values which are compared with a set of values
ranges. The histogram is used for the distribution, whereas a bar
chart is used for comparing different entities. In the histogram,
each bar represents the height of the number of values present in
the given range.

● For creating a histogram, R provides hist() function, which takes


a vector as an input and uses more parameters to add more
functionality. There is the following syntax of hist() function:

hist(v,main,xlab,ylab,xlim,ylim,breaks,col,border)
S.N Parameter Description
o
1. V It is a vector that contains numeric values.
2. Main It indicates the title of the chart.
3. Col It is used to set the color of the bars.
4. Border It is used to set the border color of each bar.
5. Xlab It is used to describe the x-axis.
6. Ylab It is used to describe the y-axis.
7. Xlim It is used to specify the range of values on the
x-axis.
8. Ylim It is used to specify the range of values on the
y-axis.
9. Breaks It is used to mention the width of each bar.
Exercise
Write a R Program to draw histogram.
Suppose you own a school uniform shop, and you want to stock up the supply based on the age of the
students residing in your locality. You can go through your bill book and write the ages of all the
customers as follow:
Age Data: 5, 5, 7, 6, 8, 11, 13, 11, 14, 12, 15, 23, 24, 15, 14, 14, 15, 10, 24, 16, 16, 17, 11, 19, 23, 14,
18, 16, 15, 19, 14, 9, 11, 10, 12, 10, 10, 16, 13, 14, 12, 15, 23, 24, 15, 14, 14, 15, 12, 24, 16, 16, 17, 18,
19, 23, 18, 9, 23, 14, 11, 16, 6, 13, 11, 14, 12, 15, 22, 22, 15, 14, 14, 15, 10, 5, 7, 6, 8, 6, 13, 11, 14, 12,
15, 23, 21, 15, 14, 14, 15, 5, 7, 6, 8, 6, 13, 11, 14, 12, 15, 9, 24, 15, 14, 14, 15, 12.
The above data can be categorized into groups as follow:
Age Groups (Class-
Number of students (frequency)
Intervals)
5-8 16
9-12 24
13-16 46
17-20 8
21-24 14
Bar Charts
A bar chart is a pictorial representation in which numerical
values of variables are represented by length or height of lines or
rectangles of equal width. A bar chart is used for summarizing a
set of categorical data. In bar chart, the data is shown through
rectangular bars having the length of the bar proportional to the
value of the variable.

In R, we can create a bar chart to visualize the data in an


efficient manner. For this purpose, R provides the barplot()
function, which has the following syntax:

barplot(h,x,y,main,names.arg,col)
S.No Parameter Description
1. H A vector or matrix which contains numeric
values used in the bar chart.

2. Xlab A label for the x-axis.

3. Ylab A label for the y-axis.

4. Main A title of the bar chart.

5. names.arg A vector of names that appear under each


bar.

6. Col It is used to give colors to the bars in the


graph.
Exercise

Write a R Program to draw Bar Graph.


Scatterplots

The scatter plots are used to compare variables. A comparison


between variables is required when we need to define how
much one variable is affected by another variable. In a
scatterplot, the data is represented as a collection of points.
Each point on the scatterplot defines the values of the two
variables. One variable is selected for the vertical axis and
other for the horizontal axis. In R, there are two ways of
creating scatterplot, i.e., using plot() function and using the
ggplot2 package's functions.
There is the following syntax for creating scatterplot in R:
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
S.No Parameters Description
1. x It is the dataset whose values are the
horizontal coordinates.
2. y It is the dataset whose values are the vertical
coordinates.
3. main It is the title of the graph.
4. xlab It is the label on the horizontal axis.
5. ylab It is the label on the vertical axis.
6. xlim It is the limits of the x values which is used for
plotting.
7. ylim It is the limits of the values of y, which is used
for plotting.
8. axes It indicates whether both axes should be
drawn on the plot.
Exercise

Write a R Program to draw Scatter Graph.


Day 7-8
Statistics for Analysis of Experimental Data: Given a sample of
50 observations, find out: sample average, sample variance,
sample standard deviation, and standard error of the mean.
Step 7: Divide the number you calculated in Step 6 by the square root
of the sample size (in this sample problem, the sample size is 4):

2.45 / √ (4) = 2.45/2 = 1.225


Exercise
1. Following are the time taken (in seconds) by
20 participants in a race: 32, 35, 45, 83, 74, 55,
68, 38, 35, 55, 66, 65, 42, 68, 72, 84, 67, 36,
42, 58.

2. Suppose two data points are missing in the


earlier example where the time taken (in
seconds) by 20 participants in a race. They are
recorded as NA
Day 9
Given a sample data of 100 observations, find out normal
distribution, plot the normal distribution using R Language, find
out the mean and standard deviation. Discuss the properties of
this distribution.
Statistical Probability Functions
● R provides various statistical probability functions to perform
statistical task. These statistical functions are very helpful to find
normal density, normal quantile and many more calculation.

● In a random collection of data from independent sources, it is


generally observed that the distribution of data is normal. Which
means, on plotting a graph with the value of the variable in the
horizontal axis and the count of the values in the vertical axis we
get a bell shape curve. The center of the curve represents the
mean of the data set. In the graph, fifty percent of values lie to
the left of the mean and the other fifty percent lie to the right of
the graph. This is referred as normal distribution in statistics.
R has four in built functions to generate normal distribution.
They are described below:
dnorm(x, mean, sd), pnorm(x, mean, sd), qnorm(p, mean,
sd), rnorm(n, mean, sd)

Following is the description of the parameters used in above


functions −
 x is a vector of numbers.
 p is a vector of probabilities.
 n is number of observations(sample size).
 mean is the mean value of the sample data. It's default value is
zero.
 sd is the standard deviation. It's default value is 1.
S. Function Description Example
No
1. dnorm(x, m=0, sd=1, It is used to find the height a <- seq(-7, 7, by=0.1)
log=False) of the probability b <- dnorm(a, mean=2.5, sd=0.5)
distribution at each point to a png(file="dnorm.png")
given mean and standard plot(x,y)
deviation dev.off()

2. pnorm(q, m=0, sd=1, It is used to find the a <- seq(-7, 7, by=0.2)


lower.tail=TRUE, probability of a normally b <- dnorm(a, mean=2.5, sd=2)
log.p=FALSE) distributed random numbers png(file="pnorm.png")
which are less than the value plot(x,y)
of a given number. dev.off()

3. qnorm(p, m=0, sd=1) It is used to find a number a <- seq(1, 2, by=002)


whose cumulative value b <- qnorm(a, mean=2.5, sd=0.5)
matches with the probability png(file="qnorm.png")
value. plot(x,y)
dev.off()

4. rnorm(n, m=0, sd=1) It is used to generate random y <- rnorm(40)


numbers whose distribution png(file="rnorm.png")
is normal. hist(y, main="Normal Distribution")
dev.off()
Thanks

You might also like