IDS - Unit 3 - 5
IDS - Unit 3 - 5
R - Vectors
Vectors in R are the same as the arrays in C language which are used to hold multiple data
values of the same type. One major key point is that in R the indexing of the vector will start
from ‘1’ and not from ‘0’. We can create numeric vectors and character vectors as well.
Vector Creation
Single Element Vector
Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of
the above vector types.
print("abc");
print(12.5)
print(63L)
print(TRUE)
print(2+3i)
print(charToRaw('hello'))
Multiple Elements Vector
Using colon operator with numeric data
v <- 5:13
print(v)
v <- 6.6:12.6
print(v)
# If the final element specified does not belong to the sequence then it is discarded.
v <- 3.8:11.4
print(v)
# Vector addition.
add.result <- v1+v2
print(add.result)
# Vector subtraction.
sub.result <- v1-v2
print(sub.result)
# Vector multiplication.
multi.result <- v1*v2
print(multi.result)
# Vector division.
divi.result <- v1/v2
print(divi.result)
Types of vectors
Vectors are of different types which are used in R. Following are some of the types of
vectors:
Numeric vectors
Numeric vectors are those which contain numeric values such as integer, float, etc.
Output:
[1] "double"
[1] "integer"
Character vectors
Character vectors contain alphanumeric values and special characters.
Output:
[1] "character"
Logical vectors
Logical vectors contain boolean values such as TRUE, FALSE and NA for Null
values.
typeof(v1)
Output:
[1] "logical"
Modifying a vector
Modification of a Vector is the process of applying some operation on an individual element
of a vector to change its value in the vector. There are different ways through which we can
modify a vector:
X <- c(2, 7, 9, 7, 8, 2)
X[3] <- 1
X[2] <-9
X[X>5] <- 0
# Modify by specifying
cat('combine() function', X)
Output
subscript operator 2 9 1 7 8 2
Logical indexing 2 0 1 0 0 2
combine() function 1 0 2
Deleting a vector
Deletion of a Vector is the process of deleting all of the elements of the vector. This can be
done by assigning it to a NULL value.
Output:
Output vector NULL
# Creation of Vector
A <- sort(X)
Output:
ascending order 1 2 2 7 8 11
descending order 11 8 7 2 2 1
> xc
abcd
5678
With the setNames function, two vectors of the same length can be used to create a named
vector:
x <- 5:8
y <- letters[1:4]
xy <- setNames(x, y)
which results in a named integer vector:
> xy
abcd
5678
You may also use the names function to get the same result:
xy <- 5:8
names(xy) <- letters[1:4]
#With such a vector it is also possibly to select elements by name:
xy["a"]
Vector sub-setting
In R Programming Language, subsetting allows the user to access elements from an object. It
takes out a portion from the object based on the condition provided.
Method 1: Subsetting in R Using [ ] Operator
Using the ‘[ ]’ operator, elements of vectors and observations from data frames can be
accessed. To neglect some indexes, ‘-‘ is used to access all other indexes of vector or data
frame.
x <- 1:15
# Print vector
cat("Original vector: ", x, "\n")
# Subsetting vector
cat("First 5 values of vector: ", x[1:5], "\n")
cat("Without values present at index 1, 2 and 3: ", x[-c(1, 2, 3)], "\n")
Matrices
Matrix is a rectangular arrangement of numbers in rows and columns. In a matrix, as we
know rows are the ones that run horizontally and columns are the ones that run vertically.
Creating and Naming a Matrix
To create a matrix in R you need to use the function called matrix(). The arguments to
this matrix() are the set of elements in the vector. You have to pass how many numbers of
rows and how many numbers of columns you want to have in your matrix.
A = matrix(
# No of rows
nrow = 3,
# No of columns
ncol = 3,
# By default matrices are in column-wise order
# So this parameter decides how to arrange the matrix
byrow = TRUE
)
# Naming rows
rownames(A) = c("r1", "r2", "r3")
# Naming columns
colnames(A) = c("c1", "c2", "c3")
Matrix where all rows and columns are filled by a single constant ‘k’:
To create such a matrix the syntax is given below:
Syntax: matrix(k, m, n)
Parameters:
k: the constant
m: no of rows
n: no of columns
print(matrix(5, 3, 3))
Diagonal matrix:
A diagonal matrix is a matrix in which the entries outside the main diagonal are all zero. To
create such a matrix the syntax is given below:
print(diag(c(5, 3, 3), 3, 3))
Identity matrix:
A square matrix in which all the elements of the principal diagonal are ones and all other
elements are zeros. To create such a matrix the syntax is given below:
print(diag(1, 3, 3))
Matrix metrics
Matrix metrics mean once a matrix is created then
cat("Number of rows:\n")
print(nrow(A))
cat("Number of columns:\n")
print(ncol(A))
cat("Number of elements:\n")
print(length(A))
# OR
print(prod(dim(A)))
Matrix subsetting
A matrix is subset with two arguments within single brackets, [], and separated by a comma.
The first argument specifies the rows, and the second the columns.
M_new<-matrix(c(25,23,25,20,15,17,13,19,25,24,21,19,20,12,30,17),ncol=4)
#M_new<-matrix(1:16,4)
M_new
colnames(M_new)<-c("C1","C2","C3","C4")
rownames(M_new)<-c("R1","R2","R3","R4")
Arrays
Arrays are the R data objects which can store data in more than two dimensions. For example
− If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2
rows and 3 columns. Arrays can store only data type.
An array is created using the array() function. It takes vectors as input and uses the values in
the dim parameter to create an array.
Example
The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns.
Accessing arrays
The arrays can be accessed by using indices for different dimensions separated by commas.
Different components can be specified by any combination of elements’ names or positions.
# accessing elements
cat ("Third element of vector is : ", vec[3])
# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])
# Use apply to calculate the sum of the rows across all the matrices.
result <- apply(new.array, c(1), sum)
print(result)
When we execute the above code, it produces the following result −
,,1
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
,,2
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
[1] 56 68 60
Accessing subset of array elements
A smaller subset of the array elements can be accessed by defining a range of row or column
limits.
c(vector, values): c() function allows us to append values to the end of the array. Multiple
values can also be added together.
append(vector, values): This method allows the values to be appended at any position in the
vector. By default, this function adds the element at end.
append(vector, values, after=length(vector)) adds new values after specified length of the
array specified in the last argument of the function.
Class in R
Class is the blueprint that helps to create an object and contains its member variable along with
the attributes. As discussed earlier in the previous section, there are two classes of R, S3, and
S4.
S3 Class
S3 class is somewhat primitive in nature. It lacks a formal definition and object of this class
can be created simply by adding a class attribute to it.
This simplicity accounts for the fact that it is widely used in R programming language. In
fact most of the R built-in classes are of this type.
Example 1: S3 class
# create a list with required components
s <- list(name = "John", age = 21, GPA = 3.5)
# name the class appropriately
class(s) <- "student"
S4 Class
S4 class are an improvement over the S3 class. They have a formally defined structure
which helps in making object of the same class look more or less similar.
Class components are properly defined using the setClass() function and objects are created
using the new() function.
Example 2: S4 class
< setClass("student", slots=list(name="character", age="numeric", GPA="numeric"))
Reference Class
Reference class were introduced later, compared to the other two. It is more similar to the
object oriented programming we are used to seeing in other major programming languages.
Reference classes are basically S4 classed with an environment added to it.
Example 3: Reference class
< setRefClass("student")
Factors
Introduction to Factors:
Factors in R Programming Language are data structures that are implemented to categorize the
data or represent categorical data and store it on multiple levels.
They can be stored as integers with a corresponding label to every unique integer. Though
factors may look similar to character vectors, they are integers and care must be taken while
using them as strings. The factor accepts only a restricted number of distinct values. For
example, a data field such as gender may contain values only from female, male.
Creating a Factor in R Programming Language
The command used to create or modify a factor in R language is – factor() with a vector as
input.
The two steps to creating a factor are:
Creating a vector
Converting the vector created into a factor using function factor()
Example:
print(data)
print(is.factor(data))
print(factor_data)
print(is.factor(factor_data))
Example:
v <- gl(3, 4, labels = c("A", "B","C"))
print(v)
Summarizing a Factor
The summary function in R returns the results of basic statistical calculations (minimum, 1st
quartile, median, mean, 3rd quartile, and maximum) for a numerical vector. The general way
to write the R summary function is summary(x, na.rm=FALSE/TRUE). Again, X refers to a
numerical vector, while na.rm=FALSE/TRUE specifies whether to remove empty values from
the calculation.
Example:
v <- gl(3, 4, labels = c("A", "B","C"))
print(v)
summary(v)
print(x)
print(is.factor(x))
In the above code, x is a vector with 8 elements. To convert it to a factor the function factor()
is used. Here there are 8 factors and 3 levels. Levels are the unique elements in the data. Can
be found using levels() function.
Ordered factors is an extension of factors. It arranges the levels in increasing order. We use
two functions: factor() along with argument ordered().
Example:
Output:
friend_id friend_name
1 1 Sachin
2 2 Sourav
3 3 Dravid
4 4 Sehwag
5 5 Dhoni
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11","2015-03-27")),
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
Extract 3rd and 5th row with 2nd and 4th column
result <- emp.data[c(3,5),c(2,4)]
Add Column
Just add the column vector using a new column name.
Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows in the
same structure as the existing data frame and use the rbind() function.
In the example below we create a data frame with new rows and merge it with the existing data frame
to create the final data frame.
Pulse Duration
2 150 30
3 120 45
subset(emp.data, emp_id == 3)
Sorting Data
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING.
Prepend the sorting variable by a minus sign to indicate DESCENDING order. Here are some
examples.
data = data.frame(
rollno = c(1, 5, 4, 2, 3),
subjects = c("java", "python", "php", "sql", "c"))
print(data)
print("sort the data in decreasing order based on subjects ")
print(data[order(data$subjects, decreasing = TRUE), ] )
Output:
rollno subjects
1 1 java
2 5 python
3 4 php
4 2 sql
5 3 c
[1] "sort the data in decreasing order based on subjects "
rollno subjects
4 2 sql
2 5 python
3 4 php
1 1 java
5 3 c
[1] "sort the data in decreasing order based on rollno "
rollno subjects
2 5 python
3 4 php
5 3 c
4 2 sql
1 1 java
Lists
Lists are one-dimensional, heterogeneous data structures. The list can be a list of vectors, a
list of matrices, a list of characters and a list of functions, and so on.
A list is a vector but with heterogeneous data elements. A list in R is created with the use of
list() function. R allows accessing elements of a list with the use of the index value. In R, the
indexing of a list starts with 1 instead of 0 like other programming languages.
Creating a List
To create a List in R you need to use the function called “list()”. In other words, a list is a
generic vector containing other objects. To illustrate how a list looks, we take an example
here. We want to build a list of employees with the details. So for this, we want attributes
such as ID, employee name, and the number of employees.
empId = c(1, 2, 3, 4)
numberOfEmp = 4
print(empList)
or
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data)
Access components by names: All the components of a list can be named and we can use
those names to access the components of the list using the dollar command.
empId = c(1, 2, 3, 4)
empName = c("Debi", "Sandeep", "Subham", "Shiba")
numberOfEmp = 4
empList = list(
"ID" = empId,
"Names" = empName,
"Total Staff" = numberOfEmp
)
print(empList)
Merging list
We can merge the list by placing all the lists into a single list.
print(vec)
Unit-4
Conditionals and control flow
R - Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of
operators.
Types of Operators
We have the following types of operators in R programming −
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
Arithmetic Operators
Following table shows the arithmetic operators supported by R language. The operators act on
each element of the vector.
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v+t)
it produces the following result −
[1] 10.0 8.5 10.0
Relational Operators
Following table shows the relational operators supported by R language. Each element of the
first vector is compared with the corresponding element of the second vector. The result of
comparison is a Boolean value.
== v <- c(2,5.5,6,9)
Checks if each element of the first vector is t <- c(8,2.5,14,9)
equal to the corresponding element of the print(v == t)
second vector. it produces the following result −
[1] FALSE FALSE FALSE TRUE
!= v <- c(2,5.5,6,9)
Checks if each element of the first vector is t <- c(8,2.5,14,9)
unequal to the corresponding element of the print(v!=t)
second vector. it produces the following result −
[1] TRUE TRUE TRUE FALSE
Logical Operators
Following table shows the logical operators supported by R language. It is applicable only to
vectors of type logical, numeric or complex. All numbers greater than 1 are considered as
logical value TRUE.
Each element of the first vector is compared with the corresponding element of the second
vector. The result of comparison is a Boolean value.
! v <- c(3,0,TRUE,2+2i)
It is called Logical NOT operator. Takes print(!v)
each element of the vector and gives the
opposite logical value. it produces the following result −
[1] FALSE TRUE FALSE FALSE
The logical operator && and || considers only the first element of the vectors and give a vector
of single element as output.
|| v <- c(0,0,TRUE,2+2i)
Called Logical OR operator. Takes first t <- c(0,3,TRUE,2+3i)
element of both the vectors and gives the print(v||t)
TRUE if one of them is TRUE. it produces the following result −
[1] FALSE
Assignment Operators
These operators are used to assign values to vectors.
%in% v1 <- 8
This
v2 <- 12
operator is
t <- 1:10
used to print(v1 %in% t)
identify if print(v2 %in% t)
an element
belongs to a it produces the following result −
vector. [1] TRUE
[1] FALSE
%*% This M = matrix( c(2,6,5,1,10,4), nrow = 2,ncol = 3,byrow = TRUE)
operator is t = M %*% t(M)
used to print(t)
multiply a
it produces the following result −
matrix with
its [,1] [,2]
transpose. [1,] 65 82
[2,] 82 117
R provides the following types of decision making statements. Click the following links to
check their detail.
1 if statement
An if statement consists of a Boolean expression followed by one or more statements.
2 if...else statement
An if statement can be followed by an optional else statement, which executes when the
Boolean expression is false.
3 switch statement
A switch statement allows a variable to be tested for equality against a list of values.
R - If Statement
An if statement consists of a Boolean expression followed by one or more statements.
Syntax
The basic syntax for creating an if statement in R is −
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
If the Boolean expression evaluates to be true, then the block of code inside the if statement
will be executed. If Boolean expression evaluates to be false, then the first set of code after the
end of the if statement (after the closing curly brace) will be executed.
Flow Diagram
Example
x <- 30L
if(is.integer(x)) {
print("X is an Integer")
}
When the above code is compiled and executed, it produces the following result −
[1] "X is an Integer"
R - If...Else Statement
An if statement can be followed by an optional else statement which executes when the boolean
expression is false.
Syntax
The basic syntax for creating an if...else statement in R is −
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
} else {
// statement(s) will execute if the boolean expression is false.
}
If the Boolean expression evaluates to be true, then the if block of code will be executed,
otherwise else block of code will be executed.
Flow Diagram
Example
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
}
When the above code is compiled and executed, it produces the following result −
[1] "Truth is not found"
Here "Truth" and "truth" are two different strings.
The if...else if...else Statement
An if statement can be followed by an optional else if...else statement, which is very useful to
test various conditions using single if...else if statement.
When using if, else if, else statements there are few points to keep in mind.
An if can have zero or one else and it must come after any else if's.
An if can have zero to many else if's and they must come before the else.
Once an else if succeeds, none of the remaining else if's or else's will be tested.
Syntax
The basic syntax for creating an if...else if...else statement in R is −
if(boolean_expression 1) {
// Executes when the boolean expression 1 is true.
} else if( boolean_expression 2) {
// Executes when the boolean expression 2 is true.
} else if( boolean_expression 3) {
// Executes when the boolean expression 3 is true.
} else {
// executes when none of the above condition is true.
}
Example
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}
When the above code is compiled and executed, it produces the following result −
[1] "truth is found the second time"
Nested If Statements
You can also have if statements inside if statements, this is called nested if statements.
Example
x <- 41
if (x > 10) {
print("Above ten")
if (x > 20) {
print("and also above 20!")
} else {
print("but not above 20.")
}
} else {
print("below 10.")
}
AND
The & symbol (and) is a logical operator, and is used to combine conditional statements:
Example
a <- 200
b <- 33
c <- 500
The | symbol (or) is a logical operator, and is used to combine conditional statements:
Example
a <- 200
b <- 33
c <- 500
R - Switch Statement
A switch statement allows a variable to be tested for equality against a list of values. Each
value is called a case, and the variable being switched on is checked for each case.
Syntax
The basic syntax for creating a switch statement in R is −
switch(expression, case1, case2, case3....)
The following rules apply to a switch statement −
If the value of expression is not a character string it is coerced to integer.
You can have any number of case statements within a switch.
If the value of the integer is between 1 and nargs()−1 (The max number of
arguments)then the corresponding element of case condition is evaluated and the result
returned.
If expression evaluates to a character string then that string is matched (exactly) to the
names of the elements.
If there is more than one match, the first matching element is returned.
No Default argument is available.
In the case of no match, if there is a unnamed element of ... its value is returned.
Flow Diagram
Example
x <- switch(
3,
"first",
"second",
"third",
"fourth"
)
print(x)
When the above code is compiled and executed, it produces the following result −
[1] "third"
Example 2:
# Mathematical calculation
val1 = 6
val2 = 7
val3 = "s"
result = switch(
val3,
"a"= cat("Addition =", val1 + val2),
"d"= cat("Subtraction =", val1 - val2),
"r"= cat("Division = ", val1 / val2),
"s"= cat("Multiplication =", val1 * val2),
"m"= cat("Modulus =", val1 %% val2),
"p"= cat("Power =", val1 ^ val2)
)
print(result)
Iterative Programming in R
R - Loops
Introduction:
There may be a situation when you need to execute a block of code several number
of times. In general, statements are executed sequentially. The first statement in a
function is executed first, followed by the second, and so on.
Programming languages provide various control structures that allow for more
complicated execution paths.
A loop statement allows us to execute a statement or group of statements multiple
times and the following is the general form of a loop statement in most of the
programming languages −
1 repeat loop
Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
2 while loop
Repeats a statement or group of statements while a given condition is true. It
tests the condition before executing the loop body.
3 for loop
Like a while statement, except that it tests the condition at the end of the loop
body.
R - For Loop
A For loop is a repetition control structure that allows you to efficiently write a loop
that needs to execute a specific number of times.
Syntax
The basic syntax for creating a for loop statement in R is −
for (value in vector) {
statements
}
Flow Diagram
R’s for loops are particularly flexible in that they are not limited to integers, or even
numbers in the input. We can pass character vectors, logical vectors, lists or
expressions.
Example
v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
When the above code is compiled and executed, it produces the following result −
[1] "A"
[1] "B"
[1] "C"
[1] "D"
Example
for (x in 1:10) {
print(x)
}
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday')
{ print(day)
R - While Loop
The While loop executes the same code again and again until a stop condition is met.
Syntax
The basic syntax for creating a while loop in R is −
while (test_expression) {
statement
}
Flow Diagram
Here key point of the while loop is that the loop might not ever run. When the condition
is tested and the result is false, the loop body will be skipped and the first statement
after the while loop will be executed.
Example1
val = 1
Example2
n<-5
factorial < - 1
i<-1
while (i <= n)
{
factorial = factorial * i
i=i+1
}
print(factorial)
R - Repeat Loop
It is a simple loop that will run the same statement or a group of statements
repeatedly until the stop condition has been encountered. Repeat loop does
not have any condition to terminate the loop, a programmer must specifically
place a condition within the loop’s body and use the declaration of a break
statement to terminate this loop. If no condition is present in the body of the
repeat loop then it will iterate infinitely.
Syntax
The basic syntax for creating a repeat loop in R is −
repeat
{
statement
if( condition )
{
break
}
}
Flow Diagram
Example1
val = 1
repeat
{
print(val)
val = val + 1
if(val > 5)
{
break
}
}
Example 2:
i<-0
repeat
{
print("Geeks 4 geeks!")
i=i+1
if (i == 5)
{
break
}
}
1 break statement
Terminates the loop statement and transfers execution to the statement
immediately following the loop.
2 Next statement
The next statement simulates the behavior of R switch.
R - Break Statement
The break statement in R programming language has the following two usages −
When the break statement is encountered inside a loop, the loop is immediately
terminated and program control resumes at the next statement following the
loop.
It can be used to terminate a case in the switch statement
Syntax
The basic syntax for creating a break statement in R is −break
Flow Diagram
Example
for (val in 1: 5)
{
# checking condition
if (val == 3)
{
# using break keyword
break
}
R - Next Statement
The next statement in R programming language is useful when we want to skip the
current iteration of a loop without terminating it. On encountering next, the R parser
skips further evaluation and starts next iteration of the loop.
Syntax
The basic syntax for creating a next statement in R is −next
Flow Diagram
Example
for (val in 1: 5)
{
# checking condition
if (val == 3)
{
# using next keyword
next
}
R - Functions
Functions are useful when you want to perform a certain task multiple times. A function accepts
input arguments and produces the output by executing valid R commands that are inside the
function. In R Programming Language when you are creating a function the function name and
the file in which you are creating the function need not be the same and you can have one or
more function definitions in a single R file.
Built-in Function: Built function R is sqrt(), mean(), max(), these function are directly call in
the program by users.
Functions in R Language
Functions are created in R by using the command function(). The general structure of the
function file is as follows:
Built-in Function in R Programming Language
Here we will use built-in function like sum(), max() and min().
print(sum(4:6))
print(evenOdd(4))
print(evenOdd(3))
print(areaOfCircle(2))
resultList = Rectangle(2, 3)
print(resultList["Area"])
print(resultList["Perimeter"])
print(f(4))
# Case 1:
print(Rectangle(2, 3))
# Case 2:
print(Rectangle(width = 8, length = 4))
# Case 3:
print(Rectangle())
Example1:
Cal= function(a,b,c){
v = a*b
return(v)
}
print(Cal(5, 10))
Example2:
Cal= function(a,b,c){
v = a*b*c
return(v)
}
print(Cal(5, 10))
Adding Arguments in R
We can pass an argument to a function while calling the function by simply giving the value
as an argument inside the parenthesis. Below is an implementation of a function with a single
argument.
divisbleBy5 <- function(n){
if(n %% 5 == 0)
{
return("number is divisible by 5")
}
else
{
return("number is not divisible by 5")
}
}
# Function call
divisbleBy5(100)
Adding Multiple Arguments in R
A function in R programming can have multiple arguments too. Below is an
implementation of a function with multiple arguments.
divisible <- function(a, b){
if(a %% b == 0)
{
return(paste(a, "is divisible by", b))
}
else
{
return(paste(a, "is not divisible by", b))
}
}
# Function call
divisible(7, 3)
# Function call
divisible(10, 5)
divisible(12)
Dots Argument
Dots argument (…) is also known as ellipsis which allows the function to take an undefined
number of arguments. It allows the function to take an arbitrary number of arguments.
Below is an example of a function with an arbitrary number of arguments.
fun <- function(n, ...){
l <- list(n, ...)
paste(l, collapse = " ")
}
# Function call
fun(5, 1L, 6i, 15.2, TRUE)
The recursive function uses the concept of recursion to perform iterative tasks they call
themselves, again and again, which acts as a loop. These kinds of functions need a stopping
condition so that they can stop looping continuously.
Recursive functions call themselves. They break down the problem into smaller components.
The function() calls itself within the original function() on each of the smaller components.
After this, the results will be put together to solve the original problem.
Example1:
fac <- function(x){
if(x==0 || x==1)
{
return(1)
}
else
{
return(x*fac(x-1))
}
}
fac(3)
Nested Functions
Example
Nested_function(Nested_function(2,2), Nested_function(3,3))
Write a function within a function.
Example
Loading an R package
Packages
Packages are collections of R functions, data, and compiled code in a well-defined format. The
directory where packages are stored is called the library. R comes with a standard set of
packages. Others are available for download and installation. Once installed, they have to be
loaded into the session to be used.
Adding Packages
You can expand the types of analyses you do be adding other packages. A complete list of
contributed packages is available from CRAN.
Load an R Package
There are basically two extremely important functions when it comes down to R packages:
To install packages, you need administrator privileges. This means that install.packages() will
thus not work in the DataCamp interface. However, almost all CRAN packages are installed
on our servers. You can load them with library().
In this exercise, you'll be learning how to load the ggplot2 package, a powerful package for
data visualization. You'll use it to create a plot of two variables of the mtcars data frame. The
data has already been prepared for you in the workspace.
Mathematical Functions in R
R provides the various mathematical functions to perform the mathematical calculation. These
mathematical functions are very helpful to find absolute value, square value and much more
calculations. In R, there are the following functions which are used:
Numerosity reduction techniques replace the original data volume by alternative, smaller
forms of data representation.
These techniques may be parametric or nonparametric.
For parametric methods, a model is used to estimate the data, so that typically only the
data parameters need to be stored, instead of the actual data. (Outliers may also be
stored.) Regression and log-linear models are examples.
Nonparametric methods for storing reduced representations of the data include
histograms, clustering, sampling, and data cube aggregation.
The technique also works to remove noise without smoothing out the main features of
the data, making it effective for data cleaning as well. Given a set of coefficients, an
approximation of the original data can be constructed by applying the inverse of the
DWT used.
The DWT is closely related to the discrete Fourier transform (DFT), a signal
processing technique involving sines and cosines. In general, however, the DWT
achieves better lossy compression.
Unlike the DFT, wavelets are quite localized in space, contributing to the conservation
of local detail.
There is only one DFT, yet there are several families of DWTs. Figure 3.4 shows some wavelet
families. Popular wavelet transforms include the Haar-2, Daubechies-4, and Daubechies-6. The
general procedure for applying a discrete wavelet transform uses a hierarchical pyramid
algorithm that halves the data at each iteration, resulting in fast computational speed.
Wavelet transforms can be applied to multidimensional data such as a data cube. This
is done by first applying the transform to the first dimension, then to the second, and so
on.
Lossy compression by wavelets is reportedly better than JPEG compression, the current
commercial standard.
Wavelet transforms have many real world applications, including the compression of
fingerprint images, computer vision, analysis of time-series data, and data cleaning.
Attribute subset selection reduces the data set size by removing irrelevant or redundant
attributes (or dimensions). The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes. Mining on a reduced set of
attributes has an additional benefit: It reduces the number of attributes appearing in the
discovered patterns, helping to make the patterns easier to understand.
Therefore, heuristic methods that explore a reduced search space are commonly used for
attribute subset selection. These methods are typically greedy in that, while searching through
attribute space, they always make what looks to be the best choice at the time. Their strategy
is to make a locally optimal choice in the hope that this will lead to a globally optimal solution.
Such greedy methods are effective in practice and may come close to estimating an optimal
solution.
The “best” (and “worst”) attributes are typically determined using tests of statistical
significance, which assume that the attributes are independent of one another. Many other
attribute evaluation measures can be used such as the information gain measure used in
building decision trees for classification.5
Basic heuristic methods of attribute subset selection include the techniques that follow, some
of which are illustrated in Figure 3.6.
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and added to the reduced set. At
each subsequent iteration or step, the best of the remaining original attributes is added to the
set.
2. Stepwise backward elimination: The procedure starts with the full set of attributes. At each
step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the
procedure selects the best attribute and removes the worst from among the remaining attributes.
4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchart like
structure where each internal (nonleaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction.
At each node, the algorithm chooses the “best” attribute to partition the data into individual
classes.
When decision tree induction is used for attribute subset selection, a tree is constructed from
the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set
of attributes appearing in the tree form the reduced subset of attributes.
Regression and log-linear models can be used to approximate the given data. In (simple) linear
regression, the data are modeled to fit a straight line. For example, a random variable, y (called
a response variable), can be modeled as a linear function of another random variable, x (called
a predictor variable), with the equation
where the variance of y is assumed to be constant. In the context of data mining, x and y are
numeric database attributes. The coefficients, w and b (called regression coefficients), specify
the slope of the line and the y-intercept, respectively.
Histograms
Histograms use discarding to approximate data distributions and are a popular form of data
reduction. A histogram for an attribute, A, partitions the data distribution of A into disjoint
subsets, referred to as buckets or bins. If each bucket represents only a single attribute–
value/frequency pair, the buckets are called singleton buckets. Often, buckets instead represent
continuous ranges for the given attribute.
Clustering
Clustering techniques consider data tuples as objects.
They partition the objects into groups, or clusters, so that objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters.
Similarity is commonly defined in terms of how “close” the objects are in space, based
on a distance function.
The “quality” of a cluster may be represented by its diameter, the maximum distance
between any two objects in the cluster.
Centroid distance is an alternative measure of cluster quality and is defined as the
average distance of each cluster object from the cluster centroid.
In data reduction, the cluster representations of the data are used to replace the actual
data.
Sampling
Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random data sample (or subset). Suppose that a large data
set, D, contains N tuples. Let’s look at the most common ways that we could sample D for data
reduction, as illustrated in Figure 3.9.
Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a
tuple is drawn, it is placed back in D so that it may be drawn again.
Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,” then an SRS
of s clusters can be obtained, where s < M. For example, tuples in a database are usually
retrieved a page at a time, so that each page can be considered a cluster. A reduced data
representation can be obtained by applying, say, SRSWOR to the pages, resulting in a cluster
sample of the tuples. Other clustering criteria conveying rich semantics can also be explored.
For example, in a spatial database, we may choose to define clusters geographically based on
how closely different areas are located.
Stratified sample: If D is divided intomutually disjoint parts called strata, a stratified sample
of D is generated by obtaining an SRS at each stratum. This helps ensure a representative
sample, especially when the data are skewed. For example, a stratified sample may be obtained
fromcustomer data, where a stratum is created for each customer age group. In this way, the
age group having the smallest number of customers will be sure to be represented
An advantage of sampling for data reduction is that the cost of obtaining a sample is
proportional to the size of the sample, s, as opposed to N, the data set size. Hence, sampling
complexity is potentially sublinear to the size of the data. Other data reduction techniques can
require at least one complete pass through D. For a fixed sample size, sampling complexity
increases only linearly as the number of data dimensions, n, increases, whereas techniques
using histograms, for example, increase exponentially in n.
When applied to data reduction, sampling is most commonly used to estimate the answer to an
aggregate query. It is possible (using the central limit theorem) to determine a sufficient sample
size for estimating a given function within a specified degree of error. This sample size, s, may
be extremely small in comparison to N. Sampling is a natural choice for the progressive
refinement of a reduced data set. Such a set can be further refined by simply increasing the
sample size.
Multi-dimensional aggregated information. For example, the above Figure shows a data
cube for multidimensional analysis of sales data with respect to annual sales per item type for
each AllElectronics branch. Each cell holds an aggregate data value, corresponding to the data
point in multidimensional space. (For readability, only some cell values are shown.) Concept
hierarchies may exist for each attribute, allowing the analysis of data at multiple abstraction
levels. For example, a hierarchy for branch could allow branches to be grouped into regions,
based on their address. Data cubes provide fast access to precomputed, summarized data,
thereby benefiting online analytical processing as well as data mining.
The cube created at the lowest abstraction level is referred to as the base cuboid. The
base cuboid should correspond to an individual entity of interest such as sales or customer. In
other words, the lowest level should be usable, or useful for the analysis. A cube at the highest
level of abstraction is the apex cuboid. For the sales data in Figure 3.11, the apex cuboid would
give one total—the total sales for all three years, for all item types, and for all branches. Data
cubes created for varying levels of abstraction are often referred to as cuboids, so that a data
cube may instead refer to a lattice of cuboids. Each higher abstraction level further reduces the
resulting data size. When replying to data mining requests, the smallest available cuboid
relevant to the given task should be used.
Data Visualization
Data visualization aims to communicate data clearly and effectively through graphical
representation. Data visualization has been used extensively in many applications—for
example, at work for reporting, managing business operations, and tracking progress of tasks.
More popularly, we can take advantage of visualization techniques to discover data
relationships that are otherwise not easily observable by looking at the raw data. Nowadays,
people also use data visualization to create fun and interesting graphics.
This section start with multidimensional data such as those stored in relational
databases. We discuss several representative approaches, including pixel-oriented techniques,
geometric projection techniques, icon-based techniques, and hierarchical and graph-based
techniques. We then discuss the visualization of complex data and relations.
A simple way to visualize the value of a dimension is to use a pixel where the color of
the pixel reflects the dimension’s value. For a data set of m dimensions, pixel-oriented
techniques create m windows on the screen, one for each dimension. The m dimension values
of a record are mapped tompixels at the corresponding positions in the windows.
The colors of the pixels reflect the corresponding values. Inside a window, the data
values are arranged in some global order shared by all windows. The global order may be
obtained by sorting all data records in a way that’s meaningful for the task at hand.
Pixel-oriented visualization. AllElectronics maintains a customer information table,
which consists of four dimensions: income, credit limit, transaction volume, and age. Can we
analyze the correlation between income and the other attributes by visualization? We can sort
all customers in income-ascending order, and use this order to lay out the customer data in the
four visualization windows, as shown in Figure 2.10. The pixel colors are chosen so that the
smaller the value, the lighter the shading. Using pixelbased visualization, we can easily observe
the following: credit limit increases as income increases; customers whose income is in the
middle range are more likely to purchase more from AllElectronics; there is no clear correlation
between income and age.
In pixel-oriented techniques, data records can also be ordered in a query-dependent
way. For example, given a point query, we can sort all records in descending order of similarity
to the point query.
Filling a window by laying out the data records in a linear way may not work well for
a wide window. The first pixel in a row is far away fromthe last pixel in the previous row,
though they are next to each other in the global order. Moreover, a pixel is next to the one
above it in the window, even though the two are not next to each other in the global order. To
solve this problem, we can lay out the data records in a space-filling curve to fill the windows.
A space-filling curve is a curve with a range that covers the entire n-dimensional unit
hypercube. Since the visualization windows are 2-D, we can use any 2-D space-filling curve.
Figure 2.11 shows some frequently used 2-D space-filling curves. Note that the
windows do not have to be rectangular. For example, the circle segment technique uses
windows in the shape of segments of a circle, as illustrated in Figure 2.12. This technique can
ease the comparison of dimensions because the dimension windows are located side by side
and form a circle.
Geometric Projection Visualization Techniques
A scatter plot displays 2-D data points using Cartesian coordinates. A third dimension
can be added using different colors or shapes to represent different data points. Figure 2.13
shows an example, where X and Y are two spatial attributes and the third dimension is
represented by different shapes. Through this visualization, we can see that points of types “+”
and “_” tend to be colocated.
A 3-D scatter plot uses three axes in a Cartesian coordinate system. If it also uses color,
it can display up to 4-D data points (Figure 2.14).
For data sets with more than four dimensions, scatter plots are usually ineffective. The
scatter-plot matrix technique is a useful extension to the scatter plot. For an n dimensional
data set, a scatter-plot matrix is an n_n grid of 2-D scatter plots that provides a visualization of
each dimension with every other dimension. Figure 2.15 shows an example, which visualizes
the Iris data set. The data set consists of 450 samples from each of three species of Iris flowers.
There are five dimensions in the data set: length and width of sepal and petal, and species. The
scatter-plot matrix becomes less effective as the dimensionality increases. Another popular
technique, called parallel coordinates, can handle higher dimensionality. To visualize n-
dimensional data points, the parallel coordinates technique draws n equally spaced axes, one
for each dimension, parallel to one of the display axes.
A data record is represented by a polygonal line that intersects each axis at the point
corresponding to the associated dimension value.
A major limitation of the parallel coordinates technique is that it cannot effectively
show a data set of many records. Even for a data set of several thousand records, visual clutter
and overlap often reduce the readability of the visualization and make the patterns hard to find.
Viewing large tables of data can be tedious. By condensing the data, Chernoff faces
make the data easier for users to digest. In this way, they facilitate visualization of regularities
and irregularities present in the data, although their power in relating multiple relationships is
limited. Another limitation is that specific data values are not shown.
Furthermore, facial features vary in perceived importance. This means that the
similarity of two faces (representing twomultidimensional data points) can vary depending on
the order in which dimensions are assigned to facial characteristics. Therefore, this mapping
should be carefully chosen. Eye size and eyebrow slant have been found to be important.
Asymmetrical Chernoff faces were proposed as an extension to the original technique.
Since a face has vertical symmetry (along the y-axis), the left and right side of a face
are identical, which wastes space. Asymmetrical Chernoff faces double the number of facial
characteristics, thus allowing up to 36 dimensions to be displayed. The stick figure
visualization technique maps multidimensional data to five-piece stick figures, where each
figure has four limbs and a body. Two dimensions are mapped to the display (x and y) axes
and the remaining dimensions are mapped to the angle and/or length of the limbs. Figure 2.18
shows census data, where age and income are mapped to the display axes, and the remaining
dimensions (gender, education, and so on) are mapped to stick figures. If the data items are
relatively dense with respect to the two display dimensions, the resulting visualization shows
texture patterns, reflecting data trends.
In early days, visualization techniques were mainly for numeric data. Recently, more
and more non-numeric data, such as text and social networks, have become available.
Visualizing and analyzing such data attracts a lot of interest. There are many new visualization
techniques dedicated to these kinds of data.
For example, many people on theWeb tag various objects such as pictures, blog entries,
and product reviews. A tag cloud is a visualization of statistics of user-generated tags. Often,
in a tag cloud, tags are listed alphabetically or in a user-preferred order. The importance of a
tag is indicated by font size or color. Figure 2.21 shows a tag cloud for visualizing the popular
tags used in aWeb site. Tag clouds are often used in two ways. First, in a tag cloud for a single
item, we can use the size of a tag to represent the number of times that the tag is applied to this
item by different users.
Second, when visualizing the tag statistics on multiple items, we can use the size of a
tag to represent the number of items that the tag has been applied to, that is, the popularity of
the tag. In addition to complex data, complex relations among data entries also raise challenges
for visualization. For example, Figure 2.22 uses a disease influence graph to visualize the
correlations between diseases. The nodes in the graph are diseases, and the size of each node
is proportional to the prevalence of the corresponding disease. Two nodes are linked by an edge
if the corresponding diseases have a strong correlation.
The width of an edge is proportional to the strength of the correlation pattern of the two
corresponding diseases.
In summary, visualization provides effective tools to explore data. We have introduced
several popular methods and the essential ideas behind them. There are many existing tools
and methods. Moreover, visualization can be used in data mining in various aspects. In addition
to visualizing data, visualization can be used to represent the data mining process, the patterns
obtained from a mining method, and user interaction with the data. Visual data mining is an
important research and development direction.