R Programming Merged PDF
R Programming Merged PDF
PROGRAMMING
CS1756
R Object and attributes
How to Download?
qWhere to get R?
qGo to www.r-project.org
qDownloads: CRAN
qSet your Mirror: INDIA: https://mirror.niser.ac.in/cran/.
qSelect Windows 7 or later.
qSelect base
qSelect R-4.1.1-win.exe(65MB)
qhttps://www.rstudio.com/products/rstudio/download/
R Object
DATA TYPES:
qAs R is an object oriented programming language basically
everything in R is an object!
qIt has a wide variety of data types including vectors, matrices,
data frames, Array, factors and lists.
qThe most basic object is a vector.
R Object
qR has five basic or “atomic” classes or atomic vectors of
objects objects:
1. character
2. numeric (real numbers, double or decimal)
3. integer
4. complex
5. logical (True/False)
qA character vector stores small pieces of text.
qYou can create a character vector in R by typing a character or
string of characters surrounded by quotes
EXAMPLE: A<- “apple”
Or
A= “apple”
q The individual elements of a character vector are known as strings.
R Object: Numeric
qNumbers in R a generally treated as numeric objects (i.e.
double precision real numbers)
qIf you explicitly want an integer, you need to specify the L
suffix
A<- 1
EXAMPLE: A= 1L
Entering 1 in R gives a numeric object ; entering 1L explicitly
gives you an integer object
R Object: Numeric
q There is also a special number Inf which represents infinity. This
allows us to represent entities like 1 / 0.
q This way, Inf can be used in ordinary calculations;
e.g., 1 / Inf is 0.
q The value NaN represents an undefined value (“not a number”);
e.g. 0 / 0;
q NaN can also be thought of as a missing value
R Object: Integer
q Integer vectors store integers, numbers that can be written without a
decimal component.
q Note that R won’t save a number as an integer unless you include
the L.
q Integer numbers without the L will be saved as doubles.
The only difference between 4 and 4L is how R saves the number in your
computer’s memory. Integers are defined more precisely in your
computer’s memory than doubles (unless the integer is very large or
small).
Why would you save your data as an integer
instead of a double?
q For example, the number π contains an endless sequences of digits to
the right of the decimal place.
q The computer must round π to something close to, but not exactly equal
to π to store π in its memory.
q As a result, each double is accurate to about 16 significant digits.
q This introduces a little bit of error.
q In most cases, this rounding error will go unnoticed.
q However, in some situations, the rounding error can cause surprising
results
q These errors are known as floating-point errors, and doing arithmetic in
these conditions is known as floating-point arithmetic.
q Floating-point arithmetic is not a feature of R; it is a feature of computer
programming.
R Object: Logicals
q Logical vectors store TRUEs and FALSEs, R’s form of Boolean data.
EXAMPLE: 5>6
##FALSE
q Any time you type TRUE or FALSE in capital letters (without
quotation marks), R will treat your input as logical data.
q R also assumes that T and F are shorthand for TRUE and FALSE,
unless they are defined elsewhere
e.g. T <- 500.
Since the meaning of T and F can change, its best to stick with TRUE
and FALSE
R Object: Complex
q Doubles, integers, characters, and logicals are the most common
types of atomic vectors in R, but R also recognizes two more types:
complex and raw.
q Complex vectors store complex numbers.
Comp<- 1+2i
q EXAMPLE:
R Object: Raw
q Raw vectors store raw bytes of data.
q Making raw vectors gets complicated, but you can make an empty
raw vector of length n with raw(n)
q EXAMPLE: raw(5)
[1] 00 00 00 00 00
How to create a vector?
q c() : To create a vector of objects.
>x <- c(0.5, 0.6) ## numeric
> x <- c(TRUE, FALSE) ## logical
> x <- c(T, F) ## logical
> x <- c("a", "b", "c") ## character
> x <- 9:29 ## integer
> x <- c(1+0i, 2+4i) ## complex
How to create a vector? Contd..
q vector() : To create a vector of objects.
>x <- vector(“numeric”, length=10) ## numeric
R Attributes
q Attributes of an object can be accessed using the attributes()
function.
a<-c(1,2,3,4)
q EXAMPLE: length(a)
R Attributes: EXAMPLE
• x<-c(1,2,3,4)
• attributes(x)
v NULL
• names(x)
v NULL
q Each of these attributes has its own helper function that can be used to give
attributes to an object.
q The names won’t affect the actual values of the vector, nor will the names be
affected when you manipulate the values of the vector.
q Also, can be used names to change the names attribute or remove it all
together
R Attributes: Names
R Attributes: Dim
q To transform an atomic vector into an n-dimensional array by giving it
a dimensions attribute with dim.
q Set the dim attribute to a numeric vector of length n.
q R will reorganize the elements of the vector into n dimensions.
q Each dimension will have as many rows (or columns, etc.) as the nth
value of the dim vector
R Attributes: Dim contd..
q EXAMPLE:
## Create a matrix of dimension 2 rows and 2 columns
dim(a)<-c(2,2)
a
## [ , 1] [ , 2]
[1, ] 1 3
[2, ] 2 4
q R always fills up each matrix by columns, instead of by rows.
q If you’d like more control over this process, you can use one of R’s
helper functions, matrix or array.
R objects: : Matrix
q Matrices store values in a two-dimensional array, just like a matrix from
linear algebra.
q To create one, first give matrix an atomic vector to reorganize into a
matrix.
q Then, define how many rows should be in the matrix by setting the nrow
argument to a number.
q matrix will organize your vector of values into a matrix with the
specified number of rows.
q Alternatively, you can set the ncol argument, which tells R how many
columns to include in the matrix
R objects: matrix contd..
q EXAMPLE:
## Create a matrix of dimension 2 rows and 2 columns
m<-matrix(a, nrow=2) //
m
## [ , 1] [ , 2]
[1, ] 1 3
[2, ] 2 4
q R will fill up the matrix column by column by default, but you can fill the
matrix row by row if you include the argument
q byrow = TRUE
R objects: matrix contd..
q EXAMPLE:
m<-matrix(a, nrow=2, byrow=TRUE)
m
## [ , 1] [ , 2]
[1, ] 1 2
[2, ] 3 4
q matrix also has other default arguments that you can use to
customize your matrix. You can read about them at matrix’s
help page (accessible by ?matrix).
R objects: Array
q Array is one of the R data objects that stores more than two
dimensions.
q They can store values of a single basic data type only.
q They store the data in the form of layered matrices.
q array ()function creates an n-dimensional array.
R objects: Arrays contd..
q EXAMPLE:
Array_name = array(data,dim = c(row_size,column_size,matrices),
dimnames = list(row_names,column_names,matrices_names))
q Where data: vector that provides the value to fill the array
q dim: vector that tells the dimensions of the array
q row_size: the number of rows in the array
q column_size : the number of columns in the array
q matrices : the number of matrices in the array
q dimnames : list of names for the dimensions of the array
q row_names : vector with the names for all the rows
q column_names : vector with the names for all the columns
q matrices_names : vector with the names for all the matrices in the array
R objects: Array contd..
q EXAMPLE: Creating a 2D array
arr<-array(c(1,2,3,4),dim=(2,2))
arr
## [ , 1] [ , 2]
[1, ] 1 3
[2, ] 2 4
R objects: Array contd..
q EXAMPLE: Creating two 2D array
arr<-array(c(1:8),dim=c(2,2,2))
Arr arr<-c(1:8)
dim(arr)=c(2,2,2)
## [ , 1] [ , 2] arr
## [ , 1] [ , 2]
[1, ] 1 3
[2, ] 2 4 [1, ] 1 3
[2, ] 2 4
## [ , 1] [ , 2]
## [ , 1] [ , 2]
[1, ] 5 7
[1, ] 5 7 [2, ] 6 8
[2, ] 6 8
R objects: : Array contd..
q EXAMPLE: Naming the dimensions of R arrays
rnames=c(“r1”, “r2”)
cnames= c(“c1”, “c2”)
Mnames=c(“m1”, “m2”)
arr<-array(c(1:8),dim=c(2,2,2),
dimnames=list(rnames,cnames,mnames))
arr
## , , m1
[ c1] [ c2]
r1 1 3
r2 2 4
## , , m2
[ c1] [ c2]
r1 5 7
r2 6 8
R objects : Array contd..
q Elements of an array can be accessed using the square brackets to
denote an index.
q There can be 4 types of indices:
1. positive integers
2. negative integers
3. logical values
4. characters
R objects: : Array contd..
EXAMPLE: Accessing the elements of R arrays using positive
index
## , , m1
c1 c2
r1 1 3
r2 2 4 #First 2 represents element to be
accessed from2nd row
## , , m2 #1 represents element to be accessed
c1 c2 from 1nd column
r1 5 7 # Last 2 represents element to be
r2 6 8 accessed from 1st matrix
>arr[2,1, 2]
[1] 6
R objects: : Array contd..
EXAMPLE:Accessing the elements of R arrays using
negative index
## , , m1
c1 c2
r1 1 3
r2 2 4
## , , m2 #First -1 represents 1st row is removed
c1 c2 #-2 represents 2nd column is removed
r1 5 7 # Last -2 represents 2nd matrix is
r2 6 8 removed
>arr[-1,-2,-2]
[1] 2
R objects: Array contd..
EXAMPLE:Accessing the elements of R arrays using logical
vectors index
## , , m1
c1 c2
r1 1 3
r2 2 4
## , , m2 #First vector represents 1st row ,
c1 c2 #second vector represents 1st and 2nd
r1 5 7 column
r2 6 8 # Third represents 1st matrix
>arr[c(T,F),c(T,T),c(T,F)]
[1] 1 3
R objects: Array contd..
EXAMPLE:Accessing the elements of R arrays using
character vector index
rnames=c("r1", "r2")
cnames= c("c1", "c2")
mnames=c("m1", "m2")
arr<-array(c(1:8),dim=c(2,2,2),
dimnames=list(rnames,cnames,mnames))
> arr[c("r1"),c("c2"),c("m1")]
[1] 3
R objects: List
q Lists are objects that consist of an ordered collection of
objects.
q Lists are vectors that can contain elements of any type.
EXAMPLE:
List_ex<-list(3,4) List_ex<-list(3, “four”)
List_ex List_ex
[[1]] [[1]]
[1] 3 [1] 3
[[2]] [[2]]
[2] 4 [2] “four”
R objects: List contd..
q Creating a list using c().
EXAMPLE:
n<-c(1,2,3)
m<-c("one", "two","three")
p<-c(TRUE,FALSE, TRUE)
List_ex<-list(n,m,p)
List_ex
R objects: List contd..
q Creating a list with tags.
q You can also name the elements of a list: $ operator is a short-cut for [[,
that works only for named elements
EXAMPLE:
List_ex<-list(“list1”=1, “list2”=“two”, “list3”=TRUE)
$list1 [1]1 $list1, list2 &list3 are called tags which makes it easier to
$list2 reference the components of the list
[2] “two”
$list3
[3] TRUE
R objects: List contd..
How to access the components of list?
q Lists can be accessed in similar fashion to vectors.
q We can retrieve a list with the single square brackets “[]”.
q Integer, logical or character vectors can be used for indexing
R objects: List contd..
EXAMPLE:
n<-c(1,2,3)
##Retreiving multiple member from list
m<-c("one", "two","three") List_ex[c(2,3)]
p<-c(TRUE,FALSE, TRUE) [[1]]
List_ex<-list(n,m,p,3) [1] “one”, “two”, “three”
List_ex [[2]]
[1] TRUE,FALSE, TRUE
##Retreiving single member of list
List_ex[2]
[[1]]
[1] “one”, “two”, “three”
R objects: List contd..
EXAMPLE:
n<-c(1,2,3)
m<-c("one", "two","three") List_ex <-list("first"=n,"second"=m,"third"=p)
p<-c(TRUE,FALSE, TRUE) List_ex$first
List_ex<-list(n,m,p,3) [1] 1 2 3
List_ex
q R factors can be of any type.
q Factors can have NA values, if a value that is not in the
levels of a factor is entered into it
q Factors are closely related with vectors. In fact, factors are
stored as integer vectors.
R objects: Factors contd..
Creating a factor:
q Factor can be created using the function factor()
Syntax:
factor_name=factor(x=character(),levels,labels,exclude,ordered,nmax)
x :a vector with the data
Levels :an optional vector with unique values that x might take
Labels :an optional vector of labels for the levels in the factor
exclude :set of values that are excluded from the levels of the factor,
Ordered: a logical value that determines whether the factor is an ordered or unordered factor,
nmax: an upper limit on the number of levels.
R objects: Factors contd..
EXAMPLE: n<-factor(c(“male”, “female”, “male”,
“female”))
n<-factor(c(“male”, “female”)) n
[1] male female male female
n Levels: female male
[1] male female str(x)
Levels: female male Factor w/2 levels “female” ,“male”:
2 121
Using factors with labels is better than using integers because factors are self-describing. Having a variable
that has values “Male” and “Female” is better than a variable that has values 1 and 2.
R objects: Factors contd..
Accessing the component of factors:
q Accessing components of a factor is similar to vectors.
q Using positive integers
q Using negative integers
#using positive
n<-factor(c(“male”, “female”, “male”, “female”))
q Using logical vectors
n[3]
n[c(2:4)]
#using negative
n[-2]
n[c(-3,-4)]
#using logicals
n[c(T,F,T,T)]
R objects: Factors contd..
Modifying or adding new data or levels R Factor:
#modifying components
n<-factor(c(“male”, “female”, “male”, “female”))
n[3] “female”
n
#modifying components with levels outside
n[3] “others”
**<NA> will assign
#adding new levels that was not present
n[5]<- “others”
**error**
#adding new levels
levels(n)<-c(“male”, “female”, “others”)
or
levels(n)<-c(levels(n), “others”)
R objects: Factors ordered or unordered
q R factors can be classified as ordered or unordered .
q By default, the levels are arranged in alphabetical order and are all
considered equal irrespective of their arrangement
EXAMPLE:
##unordered levels
sizes <- c("s","m","s","l","m","xs","l","m","xl","xxl","s",
"l","xs","xl","m","l")
sizes<-factor(sizes)
sizes
R objects: Factors ordered or unordered
q We can provide a specific order of levels by using:
1. levels arguments
2. ordered () function
EXAMPLE:
##ordered levels
sizes <- c("s","m","s","l","m","xs","l","m","xl","xxl","s", "l","xs","xl","m","l")
sizes<-factor(sizes, levels=c(“xs”, “s”, “m”, “l”, “xl”, “xxl” , “xxl”))
sizes
##ordered levels
sizes <- c("s","m","s","l","m","xs","l","m","xl","xxl","s", "l","xs","xl","m","l")
sizes<-ordered(sizes, levels=c(“xs”, “s”, “m”, “l”, “xl”, “xxl” , “xxl”))
sizes
We can also convert unordered factors into ordered ones by using the as.ordered() function.
R PROGRAMMING
CS1756
Operator in R
Operators
q R has several operators to perform tasks including arithmetic,
logical and bitwise operations.
q R has many operators to carry out different mathematical and
logical operations.
q Operators in R can mainly be classified into the following
categories.
1. Arithmetic operators
2. Relational operators
3. Logical operators
4. Assignment operators
Operators: Arithmetic operators
x<-5
y<-7
print(x+y)
print(x-y)
print(x/y) Output
print(x*y)
print(x%%y)
print(y%/%x)
Operators: Relational operators
q Relational operators are used to compare between values.
Operator Description
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to
< Less than
Operators: Relational operators Example
x<-5
y<-7
print(x<y)
print(x>y)
print(x<=y) Output
print(x>=y)
print(x==y)
print(y!=x)
Operators on Vector
q All the arithmetic operators and relational operators works on
vectors.
q The variable used in such operators are single element
vector.
q Element-wise operation can be carried out by using the c()
function.
Vector operation Example
x<-c(5,3,2,1)
y<-c(5,6,7,1,5) Output
print(x<y)
print(x>y)
print(x<=y)
print(x>=y)
print(x==y)
print(y!=x)
Vector operation Example contd..
x<-c(5,3,2,1)
y<-c(5,6)
x+y
NOTE:
q When there is a mismatch in length (number of elements) of operand vectors, the elements
in shorter one is recycled in a cyclic manner to match the length of the longer one.
q R will issue a warning if the length of the longer vector is not an integral multiple of the
shorter vector.
Operators: Logical Operators
Zero is considered FALSE and non-zero numbers are taken as TRUE. An example run.
Operators: Assignment Operators
q Following operators are used to assign values to variables.
• The operators <- and = can
Operator Description be used, almost
interchangeably, to assign
<-, <<-, = Leftwards assignment to variable in the same
->, ->> Rightwards assignment environment
• <<- operator is used for
assigning to variables in the
parent environments (more
like global assignments).
Operator: Assignment Example
Output
Assignment
q Write an R program to add two vectors by considering the
constraints:
a) Two vectors of same length
b) Two vectors are of different length
R PROGRAMMING
CS1756
Reading and Writing Data
Reading a Data in R
Some primary function used to read a data in R are:
1. Reading data in tabular form:
q read.table()
q read.csv()
Some primary function used to write a data into R are:
1. Writing data in tabular form:
q write.table()
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
read.table() contd..
q file: the name of a file
q header: logical indicating if the file has a header line
q sep: string indicating how the columns are separated
q colClasses: character vector indicating the class of each column in
the dataset
q nrows: the number of rows in the dataset.(By default read.table()
reads an entire file)
q comment.char: character string indicating the comment character.
Ø By default it is "#". If there are no commented lines in your file, it’s worth
setting this to be the empty string "".
q skip: the number of lines to skip from the beginning
read.table() contd..
q stringsAsFactors:
Ø Whether character variables to be coded as factors.
Ø By defaults it is set to TRUE.
Ø If you always want this to be FALSE, you can set a global option via
options(stringsAsFactors = FALSE)
read.table() Example
q If the dataset is a small or moderate size , then you can usually call
read.table without specifying any other arguments.
Example:
Data<-read.table(“abc.txt”)
NOTE:
R will automatically
q skip lines that begin with a #
q figure out how many rows there are (and how much memory needs to be
allocated)
q figure what type of variable is in each column of the table.
q read.csv is same as read.table except that the default separator is comma.
read.table() Example
test<-read.table("emp.txt",header=TRUE,sep=",", quote="\"")
test1<-read.table("empNa.txt",header=TRUE,sep=",", quote="\"",
na.strings=TRUE, strip.white=TRUE,
comment.char="$",
blank.lines.skip=TRUE)
print(test1)
read.table() Example
employeeNames <- c("empID", "FName", "LName", "Qualification", "Profession","Sal","Sales")
employees <- read.table("empNa.txt", TRUE, sep = ",", quote="\"",
na.strings = TRUE,
strip.white = TRUE, skip = 3,
as.is = c(TRUE, TRUE, FALSE, FALSE, TRUE),
col.names = employeeNames,
comment.char = "$", blank.lines.skip = TRUE)
print(employees)
str(employees)
#ACCESSING THE ELEMENTS
employees[[1]]
employees$FName
Example data
• empNa.txt
• emp.txt
• airquality.csv
write.table() Example
Example
#Reading a CSV file from the directory
x <- read.csv("airquality.csv", header=TRUE)
#Creating a new csv file that contains only Month and Day data from
the airquality.csv
write.csv(y,"New.csv")
Variable in R
q A variable in R can store an atomic vector, group of atomic vectors or a
combination of many R objects.
q A valid variable name consists of letters, numbers and the dot or
underscore characters.
q The variable name starts with a letter or the dot are not followed by a
number.
Variable in R: Example
q The variables can be assigned values using leftward, rightward and
equal to operator.
EXAMPLE:
#Leftward Assignment operator
Var.1<-c(1:5)
Var.2<-c(“TRUE”, “FALSE”)
Variable Assignment contd..
EXAMPLE:
#Rightward Assignment operator
c(1:5)->Var.3
c(“TRUE”, “FALSE”)->Var.4
#Equal operator
Var.5=c(“1”, “a” , “c”)
Control Structures in R
INTRODUCTION
• Control structures in R allow you to control the
flow of execution of a series of R expressions
depending on runtime executions.
• Control structures allow you to respond to
inputs or to features of the data and execute
different R expressions accordingly.
Commonly used control
structures
Commonly used control structures are
• if and else: testing a condition and acting on it
• for: execute a loop a fixed number of times
• while: execute a loop while a condition is true
• repeat: execute an infinite loop (must break out
of it to stop)
• break: break the execution of a loop
• next: skip an interaction of a loop
• return: exit a function
Control Structure: if-else
if(<condition>) { ## if(<condition1>) { ## do
do something something
} } else if (<condition2>)
{
## do something different
if(<condition>) { ##
do something }else {
} else { ## do something
## do something} different}
Control Structure: if-else
EXAMPLE
if(x>5) { y<- 10 y<- if(x>5) {
} else { 10
y<-0 } else {
} 0
}
Control Structure: for loop
• In R, for loops take an interator variable and
assign it successive values from a sequence or
vector.
• For loops are most commonly used for
iterating over the elements of an object (list,
vector, etc.)
Control Structure: for loop
EXAMPLE
for(i in 1:10) Output:
[1] 1
{
[1] 2
print (i) [1] 3
} [1] 4
[1] 5
This loop takes the i [1] 6
variable and in each [1] 7
iteration of the loop gives it [1] 8
values 1, 2, 3, …, 10, [1] 9
executes the code within the [1] 10
curly braces, and then the
loop exits.
Control Structure: for loop
EXAMPLE
x<-c("a","b","c","d") for(i in seq_along(x))
for(i in 1:4) {
{ print(x[i])
print(x[i]) }
} The seq_along() function is
Output: commonly used in
[1] "a" conjunction with for
loops in order to generate
[1] "b" an integer sequence
[1] "c" based on the length of an
[1] "d" object (in this case, the
object x).
Control Structure: for loop
EXAMPLE
x<-c("a","b","c","d") It is not necessary to
for(letter in x)
use an index-type
{
print(letter) variable.
}
Output:
[1] "a"
[1] "b"
[1] "c"
[1] "d"
Control Structure: nested for loop
• In R, for loops take an interator variable and
assign it successive values from a sequence or
vector.
• For loops are most commonly used for
iterating over the elements of an object (list,
vector, etc.)
Control Structure: nested for loop
EXAMPLE
x<-matrix(1:6,2,3) [1] 1
for(i in seq_len(nrow(x))) [1] 3
{ [1] 5
for(j in [1] 2
seq_len(ncol(x))) [1] 4
print(x[i,j]) [1] 6
}
Control Structure: while loop
• While loops begin by testing a condition. If it
is true, then they execute the loop body. Once
the loop body is executed, the condition is
tested again, and so forth, until the condition
is false, after which the loop exits
Control Structure: while loop
EXAMPLE
c=0 [1] 0
[1] 1
while(c<10) [1] 2
{ [1] 3
[1] 4
print(c) [1] 5
c=c+1 [1] 6
[1] 7
[1] 8
} [1] 9
Control Structure: repeat
• repeat initiates an infinite loop right from the
start. These are not commonly used in
statistical or data analysis applications but
they do have their uses.
• The only way to exit a repeat loop is to call
break.
Control Structure: repeat
EXAMPLE
x0 <- 1 tol <- 1e-8
repeat {
x1 <- computeEstimate()
if(abs(x1 - x0) < tol)
{break } else { x0 <- x1 }
}
Control Structure: next, break,return
• next is used to skip an iteration of a loop.
• break is used to exit a loop immediately,
regardless of what iteration the loop may be
on.
• Returns signals that a function should exit and
return a given value.
Control Structure: next, break
EXAMPLE
for(i in 1:100) for(i in 1:100)
{ if(i <= 20) { print(i)
{ ## Skip the first 20 if(i > 20)
iterations next } { ## Stop loop after 20
print(i)} iterations
break }
}
R PROGRAMMING
CS1756
Functions in R
Function in R
q Functions in R are “first class objects”, which means that they can be
treated much like any other R object.
q Functions can be passed as arguments to other functions. This is
very handy for the various apply functions, like lapply() and sapply().
q Functions can be nested, so that you can define a function inside of
another function.
Creating a Function
q Functions are defined using the function() directive and are
stored as R objects just like anything else.
EXAMPLE:
##with no arguments
Fun<-function(){
}
OUTPUT
Creating a Function contd..
EXAMPLE:
##with arguments
OUTPUT
Creating a Function contd..
EXAMPLE:
##with no default value
The user must specify the value of the argument num. If it is not specified by the user, R
will throw an error.
Creating a Function contd…
EXAMPLE:
##with default value
OUTPUT
Creating a Function contd…
EXAMPLE:
##with return
OUTPUT
q Calling an R function with arguments can be done in a
variety of ways
q R functions arguments can be matched positionally or by
name.
q Positional matching just means that R assigns the first
value to the first argument, the second value to second
argument and so on.
Example:
fun<-function(a,b,c)
fun(1,2,3)
Arguments matching in R contd…
q When specifying the function arguments by name, it doesn’t
matter in what order you specify them
Example:
fun<-function(a=1,b=2)
fun(b=10,a=5)
Arguments matching in R contd.…
q Positional matching can be mixed with matching by name.
q When an argument is matched by name, it is “taken out” of
the argument list and the remaining unnamed arguments are
matched in the order that they are listed in the function
definition.
Example:
fun<-function(a=1,b)
fun(10,a=5)
Lazy Evaluation
q Arguments to functions are evaluated lazily, so they are
evaluated only as needed in the body of the function.
q Example:
This function never actually uses
fun<-function(a,b){ the argument b, so calling fun(2)
a^3 will not produce an error because
} the 2 gets positionally matched to
a
fun(2)
String Manipulation
qUsing these functions, strings can be constructed with definite patterns
or even at random. It can be changed and modify them in any desired
way.
String Manipulation Functions:
qRandom forest (or random forests) is an ensemble classifier that
consists of many decision trees and outputs the class that is the mode of
the class's output by individual trees.
qThe term came from random decision forests that was first proposed by
Tin Kam Ho of Bell Labs in 1995.
qDecision trees are individual learners that are combined. They are one of
the most popular learning methods commonly used for data exploration.
qOne type of decision tree is called CART(classification and regression
tree).
DECISION
M TREE 1 OUTPUT 1
RS + CS
DECISION
TREE 2 OUTPUT 2
RS + CS
MAJORITY
DATA SET OUTPUT
N DECISION
RS + CS TREE 3 OUTPUT 3
RS + CS
DECISION OUTPUT 4
TREE N
DECISION TREE
EXAMPLE
Random Forest
install.packages("party")
library(party) Out-of-bag (OOB) error,:
is a method of measuring the
print(head(readingSkills)) prediction error of random
library(randomForest) forests, boosted decision trees,
and other machine learning
# Create the forest. models utilizing bootstrap
output.forest <- randomForest(nativeSpeaker ~ age + shoeSize + score, aggregating (bagging) to sub-
data = readingSkills)
sample data samples used for
training
# View the forest results.
print(output.forest)
Random Forest
require(randomForest)
require(MASS)#Package which contains the Boston housing dataset
attach(Boston)
set.seed(101)
dim(Boston)
plot(Boston.rf)
Advantages of Random Forest
qIt overcomes the problem of overfitting by averaging or combining
the results of different decision trees.
qRandom forests work well for a large range of data items than a
single decision tree does.
qRandom forest has less variance than single decision tree.
qRandom forests are very flexible and possess very high accuracy.
qScaling of data does not require in random forest algorithm.
qRandom Forest algorithms maintains good accuracy even a large
proportion of the data is missing.
Disadvantages of Random Forest
qComplexity is the main disadvantage of Random forest algorithms.
qConstruction of Random forests are much harder and time-
consuming than decision trees.
qMore computational resources are required to implement Random
Forest algorithm.
qIt is less intuitive in case when we have a large collection of decision
trees.
qThe prediction process using random forests is very time-consuming
in comparison with other algorithms.
CS1756
Decision Tree Example
Steps follows by Decision Tree Algorithm
ØStep 1: Select the feature (predictor variable)
that best classifies the data set into the desired
classes and assign that feature to the root node.
ØStep 2: Traverse down from the root node,
whilst making relevant decisions at each
internal node such that each internal node best
classifies the data.
ØStep 3: Route back to step 1 and repeat until
you assign a class to the input data.
Steps follows by Decision Tree Algorithm
q There is a popular R package known as rpart which is used to create the decision trees in R.
q We will use recursive partitioning as well as conditional partitioning to build our Decision Tree.
q R builds Decision Trees as a two-stage process as follows:
1. Performing the identification of a unique variable that splits the variable into groups.
2. It applies the above process for each subgroup until the subgroup reach to the minimum
size or no improvement in a subgroup is shown.
q
Example :
To create a decision tree for reading Skills
Step 1: Install the required packages and run the required
library.
install.packages(“party”)
install.packages(“caTools”)
library(datasets)
library(caTools)
library(party)
library(dplyr)
library(magrittr)
Step 2: Load the dataset readingSkills and execute
head(readingSkills)
>data("readingSkills")
>head(readingSkills)
Step 3: Splitting dataset into 4:1 ratio for train(80) and test
data(20)
>sample_data = sample.split(readingSkills, SplitRatio = 0.8)
>train_data <- subset(readingSkills, sample_data == TRUE)
>test_data <- subset(readingSkills, sample_data == FALSE)
Step 4: Create the decision tree model using ctree and plot the
model:
syntax: ctree(formula, data)
/*formula describes the predictor and response variables and data is
the data set used.*/
model<- ctree(nativeSpeaker ~ ., train_data)
plot(model)
Step 5: Making a prediction
# testing the people who are native speakers or not
predict_model<-predict(model, test_data)
# creates a table to count how many are classified native
speakers or not
m_at <- table(test_data$nativeSpeaker, predict_model)
m_at
Step 6: Determining the accuracy of the model developed
ac_Test <- sum(diag(m_t)) / sum(m_at)
print(paste('Accuracy for test is found to be', ac_Test))
[1] "Accuracy for test is found to be 0.74"
R PROGRAMMING
CS1756
Statistical measures in R
Statistics and R
qR Statistics concerns with data; their collection, analysis, and
interpretation.
qIt has the following two types:
1. Descriptive statistics
2. Inferential statistics
Descriptive Statistics
qIt is about providing a description of the data.
qIt makes the data easier to understand and gives us knowledge
about the data which is necessary to perform further analysis.
qVariability: It refers to the scatter or the spread of values in the set.
qThese two components give you a fair estimate of what data means.
qThere are several parameters included in these components that a
descriptive statistics report comprises of.
Descriptive Statistics: Central
tendency
Sl Parameters Description
No
1 Mean Averaging all the numerical observations in a dataset
2 Median Midpoint that separates the data evenly into two halves, unlike the mean, is not sensitive
to extreme values and outliers
3 Mode Which observation occurs most frequently in the dataset; mode can be taken for nominal
data as well as numerical
4 Range Difference between the extremes of your data
5 Interquartile Range Divides the data into percentiles. it is the central 50% of the data
6 Standard Deviation Estimates the variation in numerical observations.
7 Variance Measures how spread out or scattered values are from the mean. Standard deviation
squared is essentially the variance
8 Skewness How symmetric the data is around the average. Depending on where the extreme values
lie, the data may have a positive or negative skew
9 Kurtosis Visual estimate of the variance of a data. Normal distribution curve may be peaked or
flat, kurtosis estimates this property of the data.
Descriptive Statistics Central tendency:
Example
##import built-in dataset of R "warpbreaks"
data(warpbreaks)
##summary()function is one of the widely used descriptive analysis
##It gives the range, mean, median and interquartile range
summary(warpbreaks)
##mean:
#enter a list in such functions instead of a data frame as could be used with “summary()”
mean(warpbreaks$breaks)
##min
min(warpbreaks$breaks)
##max
max(warpbreaks$breaks)
##median
median(warpbreaks$breaks)
##range
range(warpbreaks$breaks)
##standard deviation:
sd(warpbreaks$breaks)
##Variance
var(warpbreaks$breaks)
Inferential statistics
qInferential statistics are used to draw inferences from the sample of
a huge data set.
qRandom samples of data are taken from a population, which are
then used to describe and make inferences and predictions about
the population.
qAlso, a conclusion is drawn about the larger population from a data
of a much smaller sample.
qCentral Limit Theorem, Hypothesis Testing, ANOVA are some of the
inferential statistics techniques.
Descriptive Statistics: Central limit
Theorem(CLT)
qThe CLT states that, given a sufficiently large sample size from
a population, the mean of all samples from the same population
will be approximately equal to the mean of the original
population.
qIt also states that as you increase the number of samples and
the sample size, the distribution of all of the sample means will
approximate a normal distribution no matter what the population
distribution is.
qThis distribution is referred to as the “sampling distribution.”
Descriptive Statistics: z-score
Descriptive Statistics: EXAMPLE
##Z-score
data <- c(6, 7, 7, 12, 13, 13, 15, 16, 19, 22)
z_scores <- (data-mean(data))/sd(data)
z_scores
qMean=13
qThe first raw data value of “6” is 1.323 standard deviations below the
mean.
qThe fifth raw data value of “13” is 0 standard deviations away from the
mean, i.e. it is equal to the mean.
qThe last raw data value of “22” is 1.701 standard deviations above the
mean.
Linear Regression
Using R programming
What is linear regression?
• Linear regression is used to predict the value of a continuous variable
Y based on one or more input predictor variables X.
• The aim is to establish a mathematical formula between the response
variable (Y) and the predictor variables (Xs). You can use this formula
to predict Y, only when X values are known.
Y=A+BX+C
where, A is the intercept and B is the slope. Collectively, they are
called regression coefficients. ϵ is the error term, the part of Y the
regression model is unable to explain.
Introduction
Y
1 unit
Predictor (X)
Example Problem
• Use the cars dataset that comes with R by default.
• You can access this dataset simply by typing in cars in your R console.
You will find that it consists of 50 observations(rows) and 2 variables
(columns) – dist and speed.
• Lets print out the first six observations here..
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
Applying the summary functions:
> summary(cars)
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
Use attach() to do the same
To find the mean of a subset: attach(cars)
> mean(cars$speed) > mean(speed)
[1] 15.4 OR [1] 15.4
> mean(cars$dist) > mean(dist)
[1] 42.98 [1] 42.98
Graphical Analysis
• Before we begin building the regression model, it is a good practice to
analyze and understand the variables.
• The graphical analysis and correlation study below will help with this.
Graphical Analysis:
• The aim of this exercise is to build a simple regression model that we
can use to predict Distance (dist) by establishing a statistically
significant linear relationship with Speed (speed).
Graphical Analysis
• Scatter plots can help visualize any linear relationships between the
dependent (response) variable and independent (predictor) variables.
• If you are having multiple predictor variables, a scatter plot is drawn
for each one of them against the response, along with the line of best
as seen below.
scatter.smooth(x=cars$speed, y=cars$dist, main="Dist ~ Speed")
• The scatter plot along with the smoothing
line above suggests a linearly increasing
relationship between the ‘dist’ and ‘speed’
variables.
• This is a good thing, because, one of the
underlying assumptions in linear regression
is that the relationship between the
response and predictor variables is linear
and additive.
BoxPlot – Check for outliers
• Generally, any datapoint that lies outside the 1.5 * interquartile-range
(1.5 * IQR) is considered an outlier, where, IQR is calculated as the
distance between the 25th percentile and 75th percentile values for
that variable.
# divide graph area in 2 columns
>par(mfrow=c(1, 2))
# box plot for 'speed'
boxplot(cars$speed, main="Speed", sub=paste("Outlier rows: ",
boxplot.stats(cars$speed)$out))
# box plot for 'distance'
boxplot(cars$dist, main="Distance", sub=paste("Outlier rows: ",
boxplot.stats(cars$dist)$out))
Density plot
• Check if the response variable is close to normality
>library(e1071)
# divide graph area in 2 columns
>par(mfrow=c(1, 2))
# density plot for 'speed'
>plot(density(cars$speed), main="Density Plot: Speed", ylab="Frequency",
sub=paste("Skewness:", round(e1071::skewness(cars$speed), 2)))
# density plot for 'dist'
>polygon(density(cars$speed), col="red") plot(density(cars$dist),
main="Density Plot: Distance", ylab="Frequency", sub=paste("Skewness:",
round(e1071::skewness(cars$dist), 2)))
>polygon(density(cars$dist), col="red")
Correlation
• Correlation is a statistical measure that suggests the level of linear
dependence between two variables, that occur in pair – just like what we
have here in speed and dist.
• Correlation can take values between -1 to +1. If we observe for every
instance where speed increases, the distance also increases along with it,
then there is a high positive correlation between them and therefore the
correlation between them will be closer to 1.
• The opposite is true for an inverse relationship, in which case, the
correlation between the variables will be close to -1.
• A value closer to 0 suggests a weak relationship between the variables.
• A low correlation (-0.2 < x < 0.2) probably suggests that much of variation
of the response variable (Y) is unexplained by the predictor (X), in which
case, we should probably look for better explanatory variables.
>cor(cars$speed, cars$dist)
> [1] 0.8068949
Build Linear Model
• Now that we have seen the linear relationship pictorially in the
scatter plot and by computing the correlation.
• The function used for building linear models is lm().
• The lm() function takes in two main arguments, namely: 1. Formula 2.
Data. The data is typically a data.frame and the formula is a object of
class formula.
• lets see the syntax for building the linear model.
# build linear regression model on full data
linearMod <- lm(dist ~ speed, data=cars)
print(linearMod)
Build Linear Model
• Now that we have built the linear model, we also have
# build linear regression model on full established the relationship between the predictor and
data response in the form of a mathematical formula for
>linearMod <- lm(dist ~ speed, data=cars) Distance (dist) as a function for speed.
>print(linearMod) • For the above output, you can notice the ‘Coefficients’
Call: part having two components: Intercept: -17.579, speed:
lm(formula = dist ~ speed, data = cars) 3.932
• These are also called the beta coefficients. In other
Coefficients: words,
(Intercept) speed dist = Intercept + (β ∗ speed)
-17.579 3.932 => dist = −17.579 + 3.932∗speed
Linear Regression Diagnostics
• Now the linear model is built and we have a formula that we can use
to predict the dist value if a corresponding speed is known.
• Is this enough to actually use this model?
• NO! Before using a regression model, you have to ensure that it is
statistically significant. How do you ensure this?
• Lets begin by printing the summary statistics for linearMod.
> summary(linearMod)
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The p Value: Checking for statistical significance
• The summary statistics above tells us a number of things. One of them is
the model p-Value (bottom last line) and the p-Value of individual predictor
variables (extreme right column under ‘Coefficients’).
• The p-Values are very important because, We can consider a linear model
to be statistically significant only when both these p-Values are less that
the pre-determined statistical significance level, which is ideally 0.05.
• This is visually interpreted by the significance stars at the end of the row.
• The more the stars beside the variable’s p-Value, the more significant the
variable.
Exercise
• Build a linear regression for the datasets chickwts, beaver1.
Hypothesis Testing in R
What is hypothesis testing?
Null and Alternative Hypothesis
qWhen a predetermined number of subjects in
a hypothesis test prove the "alternative
hypothesis," then the original hypothesis (the
"null hypothesis") is overturned or "rejected.“
qYou must decide the level of statistical
significance in your hypothesis, as you can
never be 100 percent confident in your
findings.
qFirst, let's examine the steps to test a
hypothesis.
How to Test a Hypothesis
State your null hypothesis.
qThe null hypothesis is a commonly accepted
fact.
qIt's the default, or what we'd believe if the
experiment was never conducted.
q It's the least exciting result, showing no
significant difference between two or more
groups.
qResearchers work to nullify or disprove null
hypotheses.
How to Test a Hypothesis
State an alternative hypothesis.
qYou'll want to prove an alternative
hypothesis.
q This is the opposite of the null hypothesis,
demonstrating or supporting a statistically
significant result.
qBy rejecting the null hypothesis, you
accept the alternative hypothesis.
How to Test a Hypothesis
Determine a significance level.
q This is the determiner, also known as the alpha (α).
alpha (α)
q It defines the probability that the null hypothesis will be rejected.
q A typical significance level is set at 0.05 (or 5%).
q You may also see 0.1 or 0.01, depending on the area of study.
q If you set the alpha at 0.05, then there is a 5% chance you'll find support
for the alternative hypothesis (thus rejecting the null hypothesis) when,
in truth, the null hypothesis is actually true and you were wrong to reject
it.
q In other words, the significance level is a statistical way of
demonstrating how confident you are in your conclusion. If you set a
high alpha (0.25), then you'll have a better shot at supporting your
alternative hypothesis, since you don't need to find as big a difference
between your test groups.
q However, you'll also have a bigger chance at being wrong about your
conclusion.
How to Test a Hypothesis
Calculate the p-value.
q The p-value, or calculated probability, indicates
the probability of achieving the results of the null
hypothesis.
q While the alpha is the significance level you're
trying to achieve, the p-level is what your actual
data is showing when you calculate it.
• High P-Values: Your data are likely with a
true null
• Low P-Values: Your data are unlikely with a
true null
How to Test a Hypothesis
Draw a conclusion.
qIf your p-value meets your significance level
requirements, then your alternative
hypothesis may be valid and you may reject
the null hypothesis.
qIn other words, if your p-value is less than
your significance level (e.g., if your calculated
p-value is 0.02 and your significance level is
0.05), then you can reject the null hypothesis
and accept your alternative hypothesis.
How to Determine a p-Value When
Testing a Null Hypothesis?
qWhen you test a hypothesis about
a population, you can use your test
statistic to decide whether to reject the null
hypothesis, H0.
qYou make this decision by coming up with
a number, called a p-value.
p-Value
• A p-value is a probability associated with your
critical value. It measures the chance of getting
results at least as strong as yours if the claim
(H0) were true.
• The following figure shows the locations of a
test statistic and their corresponding
conclusions.
To find the p-value for your test
statistic:
• Look up your test statistic on the appropriate
distribution — in this case, on the standard normal
(Z-) distribution (see the Z-table).
• Find the probability that Z is beyond (more extreme than)
your test statistic:
1. If Ha contains a less-than alternative, find the probability
that Z is less than your test statistic (that is, look up your
test statistic on the Z-table and find its corresponding
probability). This is the p-value. (Note: In this case, your
test statistic is usually negative.)
To find the p-value for your test
statistic:
2. If Ha contains a greater-than alternative, find the probability
that Z is greater than your test statistic (look up your test
statistic on the Z-table, find its corresponding probability, and
subtract it from one). The result is your p-value. (Note: In this
case, your test statistic is usually positive.)
To find the p-value for your test
statistic:
3. If Ha contains a not-equal-to alternative, find the probability
that Z is beyond your test statistic and double it. There are two
cases:
Ø If your test statistic is negative, first find the probability
that Z is less than your test statistic (look up your test statistic
on the Z-table and find its corresponding probability). Then
double this probability to get the p-value.
Ø If your test statistic is positive, first find the probability
that Z is greater than your test statistic (look up your test
statistic on the Z-table, find its corresponding probability, and
subtract it from one). Then double this result to get the p-
value.
Confidence One- Two-tail
level tailed test
test
0.90 0.10 1.28 0.05
0.95 0.05 1.645 0.025
0.98 0.02 2.05 0.01
0.99 0.01 2.33 0.005
One-tailed Test
• A one-tailed test is a statistical hypothesis test in
which the critical area of a distribution is one-
sided so that it is either greater than or
than less than
a certain value, but not both.
a certain value
• If the sample being tested falls into the one-sided
critical area, the alternative hypothesis will be
accepted instead of the null hypothesis.
• A one-tailed test is also known as a directional
hypothesis or directional test.
Two-Tailed Test
• A two-tailed test is a method in which the critical
area of a distribution is two-sided and tests
whether a sample is greater than or less than a
certain range of values.
certain range of values
• If the sample being tested falls into either of the
critical areas, the alternative hypothesis is
accepted instead of the null hypothesis.
• By convention, two-tailed tests are used to
determine significance at the 5% level,
meaning each side of the distribution is cut at
2.5%
left tailed
Right tailed
Test Statistic
• Different hypothesis tests use different test
statistics based on the probability model
assumed in the null hypothesis. Common
tests and their test statistics include:
Types of errors
There are two types of errors that relate to incorrect
conclusions about the null hypothesis.
(a)Type-I Error:
q Type-I error occurs when the sample results, lead to
the rejection of the null hypothesis when it is in fact
true.
q Type-I errors are equivalent to false positives.
q Type-I errors can be controlled.
q The value of alpha, which is related to the level of
Significance that we selected has a direct bearing
on Type-I errors.
Types of errors
(b) Type-II Error:
qType-II error occurs when based on the
sample results, the null hypothesis is not
rejected when it is in fact false.
qType-II errors are equivalent to false
negatives.
Types of errors
TRUE FALSE
TRUE
True positive False Positive
(TYPE 1 ERROR)
Measured or Perceived
Let,
Null Hypothesis H₀: μ = 20
Alternative Hypothesis, H₁ or Ha : μ > 20
The interpretation of the p-value as
follows:
• Now that our P-value is 3% which is less
than (we are definitely below the threshold
of committing Type-I error), means obtaining
a sample statistic as extreme as possible
(x̄ >= 25) given that H₀ is true is very less.
• In other words, we can’t obtain our sample
statistic as long as we assume H₀ is true.
Hence, we reject null hypothesis H₀ and
accept the alternative hypothesis Ha.
The interpretation of the p-value as
follows:
• Suppose you get P-Value as 6% i.e. the
probability of obtaining the sample statistic
as extreme as possible is higher given that
the null hypothesis is true. So we fail to
reject H₀, comparing with we can’t take
risk of committing Type-I error more than
the agreed level of significance. Hence,
we fail to reject the null hypothesis and
reject the alternative hypothesis.
One Sample Hypothesis Testing Example
Step 1: State the Null hypothesis.
The accepted fact is that the population mean is
100, so:
H0: μ=100.
Step 2: State the Alternate Hypothesis.
The claim is that the students have above
average IQ scores, so:
H1: μ > 100.
The fact that we are looking for scores “greater
than” a certain point means that this is a one-
tailed test.
One Sample Hypothesis Testing Example
Step 3: Draw a picture to help you
visualize the problem.
One Sample Hypothesis Testing
Example
Step 4: State the alpha level. If you aren’t given an alpha level, use
5% (0.05).
Step 5: Find the rejection region area (given by your alpha level above)
from the z-table. An area of 0.05 is equal to a z-score of 1.645.
Step 6: Find the test statistic using this formula:
For this set of data: z= (112.5-100) /
(15/√30)=4.56.
Step 7: If Step 6 is greater than Step 5, reject the null hypothesis. If it’s
less than Step 5, you cannot reject the null hypothesis.
In this case, it is greater (4.56 > 1.645), so you can reject the null.
Conclusion: Since there is sufficient evidence to reject the null
hypothesis . So, The claim that the students have above average
IQ scores is accepted
Student one sample t-test using R
SYNTAX:
t.test(x, mu = 0)
where x is the name of our variable of interest and mu is
set equal to the mean specified by the null hypothesis
One test Example:
• Determine if the average life of a bulb from
brand A is 10 years or not.
q In this case, when you want to check if the
sample mean represents the population mean,
then you should run One Sample t-test
One test Example:
• Determine if the average life of a bulb from
brand A is 10 years or not.
> set.seed(100)
> x<-rnorm(30,mean = 10,sd=1)
> t.test(x,mu=10)
One test Example:
p-value >alpha
Thus we have to accept the null hypothesis. Here the null
hypothesis was that the average life of the bulb is 10. And the
alternative hypothesis was that it is not equal to 10.
Student one sample t-test using
R example
• The average IQ of adult population is 100.
A researcher believes the average IQ of
adults is lower. A random sample of 5
students are tested and scored
69,79,89,99,109.
• Is there enough evidence to suggest the
average IQ is lower?
Student one sample t-test using R
If mean is not specified, it
H0: μ=100 will check against Zero
H1: μ<100
> data=c(69,79,89,109)
> t.test(data)
One Sample t-test
data: data
t = 10.13, df = 3, p-value = 0.002049
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
59.32469 113.67531
sample estimates:
mean of x
86.5
Student one sample t-test using R
> t.test(data,mu=100,alternative =c("less"))
left-tailed
One Sample t-test
data: data Since the p value > 0.05
t = -1.581, df = 3, p-value = 0.106
alternative hypothesis: true mean is less than 100 Hence , WE failed to reject the
95 percent confidence interval: null hypothesis.
-Inf 106.5957
sample estimates:
mean of x
86.5
Student one sample t-test using R
> t.test(data,mu=100,alternative =c("two.sided")) Two-sided tail
One Sample t-test
data: data
t = -1.581, df = 3, p-value = 0.212
alternative hypothesis: true mean is not equal to 100
95 percent confidence interval:
59.32469 113.67531
sample estimates:
mean of x
86.5
Student one sample t-test using R
Can change the
confidence level
> t.test(data,mu=100,alternative =c("two.sided"),conf.level=0.99)
One Sample t-test
data: data
t = -1.581, df = 3, p-value = 0.212
alternative hypothesis: true mean is not equal to 100
99 percent confidence interval:
36.62374 136.37626
sample estimates:
mean of x
86.5
When to use z-test or t-test?
Z-test T-test
• Z-tests are statistical calculations that • t-tests are calculations used to test a
can be used to compare population hypothesis, but they are most useful
means to a sample's. when we need to determine if there is
• The z-score tells you how far, in a statistically significant difference
standard deviations, a data point is between two independent sample
from the mean or average of a data groups.
set. • In other words, a t-test asks whether a
• A z-test compares a sample to a difference between the means of two
defined population and is typically groups is unlikely to have occurred
used for dealing with problems because of random chance.
relating to large samples (n > 30). • Usually, t-tests are most appropriate
• Z-tests can also be helpful when we when dealing with problems with a
want to test a hypothesis. limited sample size (n < 30).
• Generally, they are most useful when • A t-test is used when the population
the standard deviation is known. parameters (mean and standard
deviation) are not known.
When to use z-test or t-test?
Z-test T-test
z = (x — μ) / (σ / √n), where
x= sample mean
μ = population mean
σ / √n = population standard
deviation
Variation of t-test
There are three versions of t-test
1. Independent samples t-test which compares mean
for two groups
2. Paired sample t-test which compares means from
the same group at different times
3. One sample t-test which tests the mean of a
single group against a known mean.
Refer for example:
https://www.statisticshowto.datasciencecentral.com/
probability-and-statistics/t-test/
Two sample t-test using R
To conduct a paired-samples test, we need
either two vectors of data, or we need one
vector of data with a second that serves as a
binary grouping variable. The test is then run
using the syntax :
t.test(x1, x2, paired=TRUE)
Two sample t-test using R
Ø data1=c(60,70,100,120)
Ø data=c(20, 30,40,150)
> t.test(data,data1)
Welch Two Sample t-test//Independent Sample T -test
t = -0.82681, df = 4.19, p-value = 0.4528
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-118.21718 63.21718
sample estimates:
mean of x mean of y
60.0 87.5
Two sample t-test using R
> t.test(data,data1,paired=T)
Paired t-test
data: data and data1
t = -1.3933, df = 3, p-value = 0.2578
alternative hypothesis: true difference in means
is not equal to 0
95 percent confidence interval:
-90.3147 35.3147
sample estimates:
mean of the differences
-27.5
Two sample t-test using R
> t.test(data,data1,paired=FALSE)##independent Sample
paired=FALSE
Welch Two Sample t-test
data: data and data1
t = -0.82681, df = 4.19, p-value = 0.4528
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-118.21718 63.21718
sample estimates:
mean of x mean of y
60.0 87.5
PAIRED SAMPLE t-test
qThe null hypothesis assumes that the true
mean difference between the paired
samples is zero
qH1H1: Pre-Placement training affected the
participant’s knowledge
EXAMPLE:
PAIRED SAMPLE t-test
Ø before_score=c(12.2, 14.6, 13.4, 11.2, 12.7,
10.4, 15.8, 1.9, 9.5, 14.2)
data: score by time
t = 2.272, df = 9, p-value = 0.0246
alternative hypothesis: true difference in means is
greater than 0
95 percent confidence interval:
0.1043169 Inf The results showed that the probability value is
sample estimates: lower than 0.05. Lower the P-value, lower the
evidence we have to support the null
mean of the differences hypothesis. Based on this result, we shall reject
0.54 the null hypothesis of no difference. It means
PP training significantly improved the
participants’ knowledge.
ANOVA
‘density’, ‘block’, and ‘fertilizer’ listed as categorical variables with the
number of observations at each level (i.e. 48 observations at density 1
and 48 observations at density 2).
‘Yield’ should be a quantitative variable with a numeric summary
(minimum, median, mean, maximum).
Example
The p-value of the fertilizer variable is low (p < 0.001), so it appears that
the type of fertilizer used has a real impact on the final crop yield.
Chi-Square Test
q Chi-square test is used to compare categorical
variables.
q There are two type of chi-square test
q 1. Goodness of fit test, which determines if a sample
matches the population.
q 2. A chi-square fit test for two independent variables
is used to compare two variables in a contingency
table to check if the data fits.
q A small chi-square value means that data fits
q A high chi-square value means that data
doesn’t fit.
The hypothesis being tested for
chi-square is:
• Null: Variable A and Variable B are independent
• Alternate: Variable A and Variable B are not
independent.
• The statistic used to measure significance, in this case,
is called chi-square statistic. The formula used for
calculating the statistic is
Χ2 = Σ [ (Or,c — Er,c)2 / Er,c ]
Where
Or,c = observed frequency count at level r of Variable A
and level c of Variable B
Er,c = expected frequency count at level r of Variable A and
level c of Variable B
Assignment(Using dataset gapminder)
Ø install.packages (“gapminder”)
Ø library(gapminder)
Ø data(gapminder)
Ø summary(gapminder)
Ø mean(gapminder$gdpPercap)
Ø attach(gapminder)
Ø median(pop)
Ø hist(lifeExp)
Ø hist(log(pop))
Ø boxplot(pop)
Ø boxplot(pop~continent)
Ø plot(lifeExp~pop)
Ø plot(lifeExp~gdpPercap)
Assignment(Using dplyr for piping)
Ø install.packages(“dplyr”)
Ø library(dplyr)
Ø gapminder%>%
select(country, lifeExp) %>%
filter(country=="South
Africa"|country=="Ireland")%>%
group_by(country)%>%
summarise(Average_life=mean(lifeExp))
Assignment(Using t-test)
Ø data1=gapminder%>%
select(country, lifeExp) %>%
filter(country=="India"|country=="China")
t.test(data=data1,lifeExp ~ country)
Assignment(Using ggplot)
Ø install.packages(“ggplot2”)
Ø library(ggplot2)
Ø gapminder%>%
filter(gdpPercap<50000)%>%
ggplot(aes(x=gdpPercap,y=lifeExp))
ggplot(aes(x=log(gdpPercap),y=lifeExp,col=continent, size=pop))
geom_point(alpha=0.3)
geom_smooth(method=lm)
facet_wrap(~continent)
Ø X= lm(lifeExp~gdpPercap)
Ø summary(x)
K-means Clustering using R
Clustering analysis
• The purpose of clustering analysis is to
identify patterns in your data and create
groups according to those patterns.
• Therefore, if two points have similar
characteristics, that means they have the
same pattern and consequently, they belong
to the same group.
• By doing clustering analysis we should be
able to check what features usually appear
together and see what characterizes a group.
Clustering analysis
• In R’s partitioning approach, observations
are divided into K groups and reshuffled
to form the most cohesive clusters
possible according to a given criterion.
• There are two methods—K-means and
Partitioning Around Mediods (PAM).
K-means clustering
• The most common partitioning method is
the K-means cluster analysis.
• It partitions the given data set into k
predefined distinct clusters.
• A cluster is defined as a collection of data
points exhibiting certain similarities.
K-means clustering contd..
It partitions the data set such that-
1. Each data point belongs to a cluster with
the nearest mean.
2. Data points belonging to one cluster have
high degree of similarity.
3. Data points belonging to different clusters
have high degree of dissimilarity.
K-means clustering contd..
Conceptually, the K-means algorithm:
1. Selects K centroids (K rows chosen at random)
2. Assigns each data point to its closest centroid
3. Recalculates the centroids as the average of all data
points in a cluster (i.e., the centroids are p-length
mean vectors, where p is the number of variables)
4. Assigns data points to their closest centroids
5. Continues steps 3 and 4 until the observations are
not reassigned or the maximum number of iterations
(R uses 10 as a default) is reached.
K-means clustering contd..
K-means function
Kmeans(x,centers)
Ø plot(1:10,bss_tss, type="b",ylab = "betweenss/totalss",xlab = "cluster(k)")
b=both lines and dots.
K-means clustering Example
##To plot for 4 clusters
for(i in 1:4){
plot(iris, col=k[[i]]$cluster)
}
Some more example
## install.packages(c("cluster", "rattle","NbClust"))
Ø library(cluster)
Ø library(rattle)
Ø library(NbClust)
##Load the data and look at the first few rows
Ø data(wine, package="rattle")
Ø head(wine)
##Remove the first column from the data and scale it using
the scale() function
Ø df <- scale(wine[,-1])
Some more example
##How do we decide how many clusters to use if you don’t know that
already?
Method 1: A plot of the total within-groups sums of squares against the
number of clusters in a K-means solution can be helpful. A bend in the
graph can suggest the appropriate number of clusters
Ø wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
wss
}
Ø wssplot(df)
Some more example
To note once you plot the wssplot()
There will be a distinct drop in within groups sum of
squares when moving from 1 to 3 clusters. After three
clusters, this decrease drops off, suggesting that a 3-
cluster solution may be a good fit to the data.
Some more example
Method 1: Use the NbClust library, which runs many
experiments and gives a distribution of potential number of
clusters.
Ø library(NbClust)
Ø set.seed(1234)
Ø nc <- NbClust(df, min.nc=2, max.nc=15,
method="kmeans")
Ø barplot(table(nc$Best.n[1,]),xlab="Numer of Clusters",
ylab="Number of Criteria", main="Number of Clusters
Chosen by 26 Criteria")
Ø table(nc$Best.n[1,])
Some more example
Once you’ve picked the number of clusters, run k-means
using this number of clusters. Output the result of calling
kmeans() into a variable fit.km
Ø set.seed(1234)
Ø fit.km <- kmeans(df, centers=3, nstart=25)
Ø fit.km$size
Ø table(fit.km$cluster,wine$Type)
Ø clusplot(pam(df,3))
R PROGRAMMING
CS1756
Linear Regression
Linear Regression
qRegression analysis is a very widely used statistical tool to establish a
relationship model between two variables.
qSupervised learning
qFirst variable is predictor variable whose value is gathered through experiments.
qThe second variable is called response variable whose value is derived from the
predictor variable.
qRegression models a target prediction value based on independent variables.
1. X – independent variables
2. Y- dependent variables
What is linear regression?
Y=A+BX+C
q Linear regression is used to predict the value of a continuous
variable Y based on one or more input predictor variables X.
q The aim is to establish a mathematical formula between the
response variable (Y) and the predictor variables (Xs). You can
use this formula to predict Y, only when X values are known.
where, A is the intercept and B is the slope. Collectively, they
are called regression coefficients. ϵ is the error term, the
part of Y the regression model is unable to explain.
Linear Regression
Y
1 unit
Predictor (X)
Linear Regression
Linear Regression (cont…)
qIn Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1.
qA non-linear relationship where the exponent of any variable is not equal to 1
creates a curve.
Example Problem
q Use the cars dataset that comes with R by default.
q You can access this dataset simply by typing in cars in your R
console. You will find that it consists of 50 observations(rows)
and 2 variables (columns) – dist and speed.
q Lets print out the first six observations here..
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
Applying the summary functions:
> summary(cars)
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean
: 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
To find the mean of a subset:
> mean(cars$speed)
[1] 15.4
> mean(cars$dist)
[1] 42.98
Graphical Analysis
q To build a simple regression model that we can use to predict
Distance (dist) by establishing a statistically significant linear
relationship with Speed (speed).
Scatter plots
q Scatter plots can help visualize any linear relationships between
the dependent (response) variable and independent (predictor)
variables.
Graphical Analysis: Scatter Plot
scatter.smooth(x=cars$speed, y=cars$dist, main="Dist ~ Speed")
Following is the description of the parameters used −
§ formula is a symbol presenting the relation between x and y.
§ data is the vector on which the formula will be applied.
height <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
age <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
reg1 <- lm(height~age)
summary(reg1)
Linear Regression (cont…)
Linear Regression (cont…)
Steps to Establish a Regression
qA simple example of regression is predicting weight of a person when his height
is known.
qTo do this we need to have the relationship between height and weight of a
person.
1. Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
2. Create a relationship model using the lm() functions in R.
3. Find the coefficients from the model created and create the mathematical equation
4. Get a summary of the relationship model to know the average error in prediction. Also
called residuals.
5. To predict the weight of new persons, use the predict() function in R.
Linear Regression (cont…)
height=a+b*age
Coefficient :
qThe values of the intercept (“a” value) and the slope (“b” value) for the age.
qThese “a” and “b” values plot a line between all the points of the data.
Example:
ØIf there is a child that is 20.5 months old, a is 64.92 and b is 0.635, the model
predicts (on average) that its height in centimeters is around
64.92 + (0.635 * 20.5) = 77.93 cm
ØWhen a regression takes into account two or more predictors to create the linear
regression, it’s called multiple linear regression.
height = a + Age * b1 + (Number of Siblings} × b2
Linear Regression (cont…)
P value:
qA p-value indicates whether or not you can reject or accept a hypothesis.
qThe hypothesis, in this case, is that the predictor is not meaningful for your
model.
qThese “a” and “b” values plot a line between all the points of the data.
qThe p-value for age is 4.34*e-10 or 0.000000000434. A very small value means
that age is probably an excellent addition to your model.
qThe p-value for the number of siblings is 0.85.
qIn other words, there’s 85% chance that this predictor is not meaningful for the
regression.
qA standard way to test if the predictors are not meaningful is looking if the p-
values smaller than 0.05.
Linear Regression (cont…)
Residuals :
qA good way to test the quality of the fit of the model is to look at the residuals or
the differences between the real values and the predicted values.
qThe straight line in the image above represents the predicted values. The red
vertical line from the straight line to the observed data value is the residual.
qThe idea in here is that the sum of the residuals is approximately zero or as low
as possible.
Linear Regression (cont…)
R square:
qA good way to test the quality of the fit of the model is to look at the residuals or the
differences between the real values and the predicted values.
qOne measure used to test how good is your model is.
qModels that poorly fit the data have R² near 0.
qThe idea in here is that the sum of the residuals is approximately zero or as low as
possible.
qThe second one has an R² of 0.99, and the model can explain 99% of the total
variability
Linear Regression (cont…)
Predict Function predict() :
qTo predict in linear regression.
predict(object, newdata)
• object is the formula which is already created using the lm() function.
• newdata is the vector containing the new value for predictor variable.
Linear Regression (cont…)
Predict Function:
height <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
age <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
reg1<- lm(r~p) # Find weight of a person with height 170.
a <- data.frame(x = 170)
result <- predict(reg1,a)
print(result)
o/p:
76.22869
Linear Regression (cont…)
plot(y, x,col = "blue",main = "Height & Weight Regression", abline(lm(x~y)),cex = 1.3,pch = 16,xlab =
"Weight in Kg",ylab = "Height in cm")
EXAMPLE
Call:
lm(formula = dist ~ speed, data = cars)
Use dataset cars
Residuals:
• linearMod <- lm(dist ~ speed, data=cars) Min 1Q Median 3Q Max
• print(linearMod) -29.069 -9.525 -2.272 9.215 43.201
• summary(linearMod) Coefficients:
• Now that we have built the linear model, we also have Estimate Std. Error t value Pr(>|t|)
established the relationship between the predictor and (Intercept) -17.5791 6.7584 -2.601 0.0123 *
response in the form of a mathematical formula for speed 3.9324 0.4155 9.464 1.49e-12 ***
Distance (dist) as a function for speed. ---
• For the above output, you can notice the ‘Coefficients’ Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
part having two components: Intercept: -17.579, speed:
3.932 Residual standard error: 15.38 on 48 degrees of freedom
• These are also called the beta coefficients. Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
• In other words, F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
dist = Intercept + (β ∗�speed)
=> dist = −17.579 + 3.932∗speed
Neural Network: EXAMPLE
Using R
Example-1
##Using the Boston dataset in the MASS package
##The Boston dataset is a collection of data about
housing values in the suburbs of Boston.
##Goal is to predict the median value of owner-
occupied homes (medv) using all the other
continuous variables available.
Example-1
Ø set.seed(500)
Ø library(MASS)
Ø data <- Boston
##to check that no datapoint is missing, otherwise we need to
fix the dataset.
Ø apply(data,2,function(x) sum(is.na(x)))
Example-1
##proceed by randomly splitting the data into a train and a
test set, then we fit a linear regression model and test it on the
test set using the gml() function
Ø index <- sample(1:nrow(data),round(0.75*nrow(data)))
Ø train <- data[index,] test <- data[-index,]
Ø lm.fit <- glm(medv~., data=train)
Ø summary(lm.fit)
Ø pr.lm <- predict(lm.fit,test)
Ø MSE.lm <- sum((pr.lm - test$medv)^2)/nrow(test)
Example-1
##first step, to normalize your data before training a neural
network.
## scale and split the data
Ø maxs <- apply(data, 2, max)
Ø mins <- apply(data, 2, min)
Ø scaled <- as.data.frame(scale(data, center = mins, scale =
maxs - mins))
Ø train_ <- scaled[index,]
Ø test_ <- scaled[-index,]
• Note that scale returns a matrix that needs to be coerced into a
data.frame.
Example-1
##to use 2 hidden layers with this configuration: 13:5:3:1. The
input layer has 13 inputs, the two hidden layers have 5 and 3
neurons and the output layer has a single output since we are
doing regression.
Ø library(neuralnet)
Ø n <- names(train_)
Ø f <- as.formula(paste("medv ~", paste(n[!n %in% "medv"],
collapse = " + ")))
Ø nn <-
neuralnet(f,data=train_,hidden=c(5,3),linear.output=T)
Example-1
## to plot the model
Ø plot(nn)
##Predicting medv using the neural network
Ø pr.nn <- compute(nn,test_[,1:13])
Ø pr.nn_<-pr.nn$net.result*(max(data$medv)-min(data$medv))+min(data$medv)
Ø test.r <- (test_$medv)*(max(data$medv)-min(data$medv))+min(data$medv)
Ø MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
Ø plot(test$medv,pr.nn_,col='red',main='Real vs predicted NN',pch=18,cex=0.7)
Ø points(test$medv,pr.lm,col='blue',pch=18,cex=0.7)
Ø abline(0,1,lwd=2)
Ø legend('bottomright',legend=c('NN','LM'),pch=18,col=c('red','blue'))
Neural Network: Cross
Validation
Using R
A (fast) cross validation
Ø boxplot(cv.error,xlab='MSE CV',col='cyan',
border='blue',names='CV error (MSE)',
main='CV error (MSE) for NN',horizontal=TRUE)
Cross Validation: Example
• The average MSE for the neural network (7.641292) is lower
than the one of the linear model although there seems to be a
certain degree of variation in the MSEs of the cross validation.
• This may depend on the splitting of the data or the random
initialization of the weights in the net.
• By running the simulation different times with different seeds
you can get a more precise point estimate for the average MSE.
SUMMARY
• Neural networks resemble black boxes a lot: explaining
their outcome is much more difficult than explaining
the outcome of simpler model such as a linear model.
• Therefore, depending on the kind of application you
need, you might want to take into account this factor
too.
• Furthermore, as you have seen above, extra care is
needed to fit a neural network and small changes can
lead to different results.
Multiple Linear Regression
What is multiple linear regression?
18
EXAMPLE
- install.packages("tidyverse")
- library(tidyverse)
- install.packages("datarium")
We’ll use the marketing data set [datarium package], which
contains the impact of the amount of money spent on three
advertising medias (youtube, facebook and newspaper) on sales.
• data("marketing", package = "datarium")
• head(marketing, 4)
youtube facebook newspaper sales
1 276.12 45.36 83.04 26.52
2 53.40 47.16 54.12 12.48
3 20.64 55.08 83.16 11.16
4 181.80 49.56 70.20 22.20
19
EXAMPLE: Building model
Ø model <- lm(sales ~ youtube + facebook + newspaper,
data = marketing)
Ø summary(model)
20
EXAMPLE: Building model
Call:
lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.5932 -1.0690 0.2902 1.4272 3.3951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.526667 0.374290 9.422 <2e-16 ***
youtube 0.045765 0.001395 32.809 <2e-16 ***
facebook 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.023 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
21
Building model: Interpretetion
Residual standard error: 2.023 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
The first step in interpreting the multiple regression analysis is to examine
the F-statistic and the associated p-value, at the bottom of model
summary.
In our example, it can be seen that p-value of the F-statistic is < 2.2e-16,
which is highly significant. This means that, at least, one of the predictor
variables is significantly related to the outcome variable.
To see which predictor variables are significant, you can examine the
coefficients table, which shows the estimate of regression beta
coefficients and the associated t-statitic p-values:
summary(model)$coefficient
22
Building model: Interpretetion
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.52667 0.37429 9.422 1.27e-17
## youtube 0.04576 0.00139 32.809 1.51e-81
## facebook 0.18853 0.00861 21.893 1.51e-54
## newspaper -0.00104 0.00587 -0.177 8.60e-01
• For a given the predictor, the t-statistic evaluates whether or not there
is significant association between the predictor and the outcome
variable, that is whether the beta coefficient of the predictor is
significantly different from zero.
• It can be seen that, changing in youtube and facebook advertising
budget are significantly associated to changes in sales while changes in
newspaper budget is not significantly associated with sales.
• For a given predictor variable, the coefficient (b) can be interpreted as
the average effect on y of a one unit increase in predictor, holding all
other predictors fixed.
23
Building model: Interpretation
• We found that newspaper is not significant in the multiple regression
model. This means that, for a fixed amount of youtube and newspaper
advertising budget, changes in the newspaper advertising budget will
not significantly affect sales units.
• As the newspaper variable is not significant, it is possible to remove it
from the model:
Ø model <- lm(sales ~ youtube + facebook, data = marketing)
Ø summary(model)
24
Building model: Interpretation
Call:
lm(formula = sales ~ youtube + facebook, data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.5572 -1.0502 0.2906 1.4049 3.3994
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.50532 0.35339 9.919 <2e-16 ***
youtube 0.04575 0.00139 32.909 <2e-16 ***
facebook 0.18799 0.00804 23.382 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.018 on 197 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962
F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16
Finally, our model equation can be written as follow: sales = 3.5 + 0.045*youtube +
0.187*facebook. 25
Building model: Interpretation
The confidence interval of the model coefficient can be extracted as
follow:
Ø confint(model)
Ø ## 2.5 % 97.5 %
Ø ## (Intercept) 2.808 4.2022
Ø ## youtube 0.043 0.0485
Ø ## facebook 0.172 0.2038
26
Model accuracy assessment
29
Other Way
• If you have many predictors variable in your
data, you don’t necessarily need to type their
name when computing the model.
To compute multiple regression using all of the
predictors in the data set, simply type this:
model <- lm(sales ~., data = marketing)
30
Other Way
• If you want to perform the regression using all
of the variables except one, say newspaper,
type this:
model <- lm(sales ~. -newspaper, data = marketing)
OR
you can use the update function:
model1 <- update(model, ~. -newspaper)
31
ASSIGNMENT
• Create your own dataset containing (5 columns and 24
rows)
• Column names: Year, Month , Interest_rate,
unemployement_rate and Stock_Index_Price
• Consider:
– Year (2019,2018)
– Month(1-12)[each year will have 1-12 months]
– Interest_rate(any random decimal number from range
1.00-3.00)
– unemployement_rate(any random decimal number from
range 5.00-6.50)
– Stock_Index_Price(any integer value from 700-1500)
32
ASSIGNMENT
• Check that a linear relationship exists between:
– The Stock_Index_Price (dependent variable) and
the Interest_Rate (independent variable); and
– The Stock_Index_Price (dependent variable) and the
Unemployment_Rate (independent variable)
• Use scatter plots to show the linearity.
[Note: when interest rates go up, the stock index
price also goes up]
[when the unemployment rates go up, the stock
index price goes down (here you still have a linear
relationship, but with a negative slope) ]
33
ASSIGNMENT
34
TO SUM UP:
• Adjusted R-squared reflects the fit of the model, where a
higher value generally indicates a better fit
• Intercept coefficient is the Y-intercept
• Interest_Rate coefficient is the change in Y due to a change
of one unit in the interest rate (everything else held
constant)
• Unemployment_Rate coefficient is the change in Y due to
a change of one unit in the unemployment rate (everything
else held constant)
• Std. Error reflects the level of accuracy of the coefficients
• Pr(>|t|) is the p-value. A p-value of less than 0.05 is
considered to be statistically significant
35
CS1756 R PROGRAMMING
Basic Concept of Neural Network
Using R
The Basics of Neural Network
q Activation function defines the output of a neuron in terms of a local
induced field.
q Activation functions are a single line of code that gives the neural
nets non-linearity and expressiveness.
q There are many activation functions:
1. Identity function
2. Binary Step Function
3. Sigmoid Function
4. Ramp Function
5. ReLu stands for the rectified linear unit (ReLU):
q It is the most used activation function in the world. It output 0
for negative values of x.
Create Neural Network
##INSTALL THE neuralnet package
Ø install.packages("neuralnet")
Ø library(neuralnet)
# Split data
train_idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris_train <- iris[train_idx, ]
iris_test <- iris[-train_idx, ]
# Binary classification
nn <- neuralnet(Species == "setosa" ~ Petal.Length + Petal.Width, iris_train, linear.output = FALSE)
pred <- predict(nn, iris_test)
table(iris_test$Species == "setosa", pred[, 1] > 0.5)
# Multiclass classification
nn <- neuralnet((Species == "setosa") + (Species == "versicolor") + (Species == "virginica")
~ Petal.Length + Petal.Width, iris_train, linear.output = FALSE)
pred <- predict(nn, iris_test)
table(iris_test$Species, apply(pred, 1, which.max))
Application
NN's wonderful properties offer many applications such as:
• Pattern Recognition: neural networks are very suitable for pattern
recognition problems such as facial recognition, object detection,
fingerprint recognition, etc.
• Anomaly Detection: neural networks are good at pattern detection,
and they can easily detect the unusual patterns that don’t fit in the
general patterns.
• Time Series Prediction: Neural networks can be used to predict
time series problems such as stock price, weather forecasting.
• Natural Language Processing: Neural networks offer a wide range
of applications in Natural Language Processing tasks such as text
classification, Named Entity Recognition (NER), Part-of-Speech
Tagging, Speech Recognition, and Spell Checking.
Time Series Analysis in R
Definitions, Applications
• Definition of Time Series: An ordered
sequence of values of a variable at equally
spaced time intervals.
• Applications: The usage of time series models
is twofold:
§ Obtain an understanding of the underlying forces
and structure that produced the observed data
§ Fit a model and proceed to forecasting,
monitoring or even feedback and feedforward
control.
58
Definitions, Applications
• Time Series Analysis is used for many applications such
as:
1. Economic Forecasting
2. Sales Forecasting
3. Budgetary Analysis
4. Stock Market Analysis
5. Yield Projections
6. Process and Quality Control
7. Inventory Studies
8. Workload Projections
9. Utility Studies
10. Census Analysis
59
Definitions
• A successful analysis can help professionals
observe patterns and ensure the smooth
functioning of the business.
• The most important part of time series analysis is
forecasting or prediction of future values using
the historical data.
• These predictions help in determining the future
course of action and give an approximate idea
about how the business will look a year from
now.
60
Components of Time Series
61
Components of Time Series
• Trend Component: By trend component, we
mean that the general tendency of the data to
increase or decrease during a long period of
time.[population growth over the year]
• Seasonal Component: The variations in the time
series that arise due to the rhythmic forces which
operate over a span of less than 12 months or a
year.
• Short term variation.[sales of ice cream during
summer]
62
Components of Time Series
• Cyclical Component: The oscillatory movements
in a time series that last for more than a year.[ 5
years of economic growth, 2 years]
• Random Component: Random or irregular
variations or fluctuations which are not
accounted for by trend, seasonality and cyclical
components are defined as the random
component. These are also called episodic
fluctuations.[variation caused by eathquake or
flood,war etc.]
63
Where time series cant be used?
1. When the values are constant over a period
of time.
2. When values could be represented by known
functions like cosine, sine etc.(Changing with
fix function)
64
Stationary of Data
• Stationary of data depends on:
– Mean
– Variance
– CO-variance
• Non-stationary data contains components of
trend, cyclicity ,seasonality and irregularity.
– It affects the forecasting of time series
65
Stationary of Data
• The mean of the series should not be a function of
time rather should be a constant. The image below
has the left hand graph satisfying the condition
whereas the graph in red has a time dependent
mean.
66
Stationary of Data
• The variance of the series should not a be a function
of time. This property is known as homoscedasticity.
Following graph depicts what is and what is not a
stationary series. (Notice the varying spread of
distribution in the right hand graph)
67
Stationary of Data
• The covariance of the i th term and the (i + m) th
term should not be a function of time. In the
following graph, you will notice the spread becomes
closer as the time increases. Hence, the covariance is
not constant with time for the ‘red series’.
68
Moving Average
• It is a technique to get an overall idea of the trends in
a data set.
Example: Business sales in last 3 months (Jan,Feb and
March) and forecast for month April.
MONTH SALES
Jan 150
Feb 170
March 145
• Moving Average(3)=(150+170+145)/3=155
[3 means taking 3 value]
69
Moving Average contd..
• Hence ,based on the sales of Jan-March , the
sales of April will be 155.
• Once actual sales of April comes in, it
computes to forecast the next month and so
on. MONTH SALES
Jan 150
Feb 170
March 145
April 155
May
70
Moving Data Example
Year Quarter Sales
1 1 2.8
1 2 2.1
1 3 4
1 4 4.5
2 1 3.8
2 2 3.2
2 3 4.8
2 4 5.4
3 1 4
3 2 3.6
3 3 5.5
3 4 5.8
4 1 4.3
4 2 3.9
4 3 6
4 4 6.4 71
7
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Not a stationary.
72
Centered
Time code Year Quarter Sales MA(4) MA
1 1 1 2.8
2 1 2 2.1
3 1 3 4 3.4 3.5
4 1 4 4.5 3.6 3.7
5 2 1 3.8 3.9 4.0
6 2 2 3.2 4.1 4.2
7 2 3 4.8 4.3 4.3
8 2 4 5.4 4.4 4.4
9 3 1 4 4.5 4.5
10 3 2 3.6 4.6 4.7
11 3 3 5.5 4.7 4.8
12 3 4 5.8 4.8 4.8
13 4 1 4.3 4.9 4.9
14 4 2 3.9 5.0 5.1
15 4 3 6 5.2
16 4 4 6.4
73
7
4
Sales(Yt)
3 Centered MA
0
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
74
ARIMA MODEL
• Auto Regressive Integrated Moving Average
• It is specified by three parameters:
– Number of Auto Regressive (AR) terms
– how many non-seasonal differences are needed
to achieve stationarity (I)
– number of lagged forecast errors in the
prediction equation (MA)
MA
75
ARIMA MODEL
– p: Auto Regressive (AR) parameter
EX: ARIMA(2,0,0) has a value of p equals 2.
– d: Degree of Differencing
– q: Moving Average
76
What is Autoregressive model?
• In regression, an autoregressive(regression on
itself) components refer to prior values of current
value.
X(t) Current Value
Then AR components =X(t-1)*a
Where a=fitted coefficients
Like wise second AR components will be X(t-2)*a and so on…..
• These are often referred as the lagged terms.
• So, the prior value is called the first lag , second
lag and so on…
77
What is degree of differencing?
• It is equal to the number of non-seasonal
differences needed to achieve stationary.
1 Level of differencing would mean you take the current
value and subtract the prior value from it
• EXAMPLE
Values 1st order Results
Differencing
Subtracting the
5 NA prior values
4 (4-5) -1 from current
6 (6-4) 2 values
78
What is Moving Average?
79
To test stationary
• ARIMA model works on the assumption that
the data is stationary.
• Trend and seasonality of data are removed.
In order to test whether or not the series and
their error terms is auto-correlated, we use:
1. ACF(Auto correlation function)
2. PACF(Partial Auto correlation function)
80
What is Auto-correlation?
• Auto-correlation is the similarity between
values of a same variable across observations.
• ACF(Auto correlation function):
– Tells us how correlated points are with each
other based on how many times steps they are
separated by.
– It is used to determine how past and future data
points are related in a time series.
– It values can range from -1to 1
81
What is Partial Auto-correlation?
• PACF(Partial Auto correlation function):
– It is the degree of association between two
variables while adjusting the effect of one or
more additional variables.
– It gives partial correlation of time series with its
own lagged values.
– It values can range from -1to 1
82
EXAMPLE
>library(forecast)
>data("AirPassengers")
> class(AirPassengers)
>start(AirPassengers)
>end(AirPassengers)
>frequency(AirPassengers)
##To check the missing values
>tsdata=is.na(AirPassengers)
>sum(tsdata)
>summary(AirPassengers)
83
EXAMPLE
>plot(AirPassengers)
>cycle (AirPassengers)
>boxplot(AirPassengers~cycle(AirPassengers))
>tsdata<-ts(AirPassengers,frequency = 12)
##To decompose the different time series components
>ddata<-decompose(tsdata,"multiplicative")
>plot(ddata)
Or
Ø plot(ddata$trend)
Ø plot(ddata$seasonal)
Ø plot(ddata$random)
84
EXAMPLE
>plot(AirPassengers)
> abline(reg = lm(AirPassengers~time(AirPassengers)))
>cycle(AirPassengers)
##To get boxplot by cycle
>boxplot(AirPassengers~cycle(AirPassengers)
To build the ARIMA model
>mymodel<-auto.arima(AirPassengers)
##To see the different possible parameters of ARIMA ,to
compare information criteria
>auto.arima(AirPassengers,ic="aic",trace=TRUE)
>plot.ts(mymodel$residuals)
85
EXAMPLE
##To test the model
>library(tseries)
>plot.ts(mymodel$residuals)
>acf(ts(mymodel$residuals),main=‘ACF
Residuals’)
>pacf(ts(mymodel$residuals),main=‘PACF
Residuals’)
86
EXAMPLE
##To forecast for the next 10 years
>myforecast<-
forecast(mymodel,level=c(95),h=10*12)
>plot(myforecast)
>Box.test(mymodel$residuals,lag=5,type="Ljung-
Box")
>Box.test(mymodel$residuals,lag=10,type="Ljung-
Box")
>Box.test(mymodel$residuals,lag=15,type="Ljung-
Box")
87
IMPUTATION OF MISSING VALUES
MICE Package
• It is one of the commonly used packages by R users.
Creating multiple imputations as compared to a single
imputation (such as mean) takes care of uncertainty in
missing values.
• MICE assumes that the missing data are Missing at Random
(MAR), which means that the probability that a value is
missing depends only on observed value and can be
predicted using them.
• It imputes data on a variable by specifying an imputation
model per variable.
>install.packages("mice")
>library(mice)
>data("iris")
Min. :4.300 Min. :2.00 Min. :1.00 Min. :0.100 setosa :47
Median :5.700 Median :3.00 Median :4.50 Median :1.300 virginica :45
Mean :5.789 Mean :3.06 Mean :3.87 Mean :1.183 NA's :10
>library(VIM)
### Using pbox(parallel boxplot) function to draw boxplot
>pbox(iris.mis, delimiter = NULL, selection = "any")
>pbox(iris.mis , pos=1, cex=0.6)
To perform imputations:
>val<-mice(iris.mis)
>Val$imp$Sepal.Length
X<-lm(Sepal.Length~Petal.Length+Sepal.Width,data=iris.mis)
Fit<-with(Value,X)
Pool(Fit)
Amelia Package
• This package also performs multiple imputation (generate
imputed data sets) to deal with missing values.
• Multiple imputation helps to reduce bias and increase
efficiency.
• It is enabled with bootstrap based EMB(Expectation-
Maximization) algorithm which makes it faster and robust to
impute many variables including cross sectional, time
series data etc.
• Also, it is enabled with parallel imputation feature using
multicore CPUs.
EAMPLE:
> library(Amelia)
#load data
> data("iris")
> summary(iris.mis)
> amelia_fit$imputations[[2]]
> amelia_fit$imputations[[3]]
> amelia_fit$imputations[[4]]
> amelia_fit$imputations[[5]]
To check a particular column in a data set, use the following
commands
> amelia_fit$imputations[[5]]$Sepal.Length
>ameliaview()
missForest
#missForest
>install.packages("missForest")
> library(missForest)
#loaddata
> data("iris")
#seed 10% missing values
>iris.mis <- prodNA(iris, noNA = 0.1)
> summary(iris.mis)
#impute missing values, using all parameters as default values
> iris.imp <- missForest(iris.mis)
#check imputed values
> iris.imp$ximp
NRMSE PFC
0.14148554 0.02985075
NRMSE is normalized mean squared error. It is used to represent
error derived from imputing continuous values. PFC (proportion of
falsely classified) is used to represent error derived from
imputing categorical values.
NRMSE PFC
0.1535103 0.0625000