r lang-Unit-02
r lang-Unit-02
You can use the following command in the R console to install the "xlsx" package. It may ask to install some
additional packages on which this package is dependent. Follow the same command with required package
name to install the additional packages.
install.packages("xlsx")
Use the following command to verify and load the "xlsx" package.
[1] TRUE
Loading required package: rJava
Loading required package: methods
Loading required package: xlsxjars
1.3 Input as xlsx File
Open Microsoft excel. Copy and paste the following data in the work sheet named as sheet1.
Also copy and paste the following data to another worksheet and rename this worksheet to "city".
name city
Rick Seattle
Dan Tampa
Michelle Chicago
Ryan Seattle
Gary Houston
Nina Boston
Simon Mumbai
Guru Dallas
Save the Excel file as "input.xlsx". You should save it in the current working directory of the R workspace.
The input.xlsx is read by using the read.xlsx() function as shown below. The result is stored as a data frame
in the R environment.
In R, we can read data from files stored outside the R environment. We can also write data into files which
will be stored and accessed by the operating system. R can read and write into various file formats like csv,
excel, xml etc.
In this chapter we will learn to read data from a csv file and then write data into a csv file. The file should
be present in current working directory so that R can read it. Of course we can also set our own directory and
read files from there.
The csv file is a text file in which the values in the columns are separated by a comma. Let's consider
the following data present in the file named input.csv.
You can create this file using windows notepad by copying and pasting this data. Save the file
as input.csv using the save As All files(*.*) option in notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
Following is a simple example of read.csv() function to read a CSV file available in your current working
directory –
3. data <- read.csv("input.csv")
4. print(data)
5. When we execute the above code, it produces the following result −
By default the read.csv() function gives the output as a data frame. This can be easily checked as
follows. Also we can check the number of columns and rows.
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
[1] TRUE
[1] 5
[1] 8
Once we read data in a data frame, we can apply all the functions applicable to data frames as explained in a
subsequent section.
[1] 843.25
R can create csv file form existing data frame. The write.csv() function is used to create the csv file. This file gets
created in the working directory.
Here the column X comes from the data set newper. This can be dropped using additional parameters while
writing the file.
If" Statements
if
syntax
if statements let you execute statements conditionally: that is which statements execute depend on
whether some condition is true or false. For example:
func max(a int, b int) int {
var m int
if a > b {
m = a
} else {
m = b
if CONDITION {
THEN PART
} else {
ELSE PART
If CONDITION is true, then the statements in the THEN PART will be executed.
If CONDITION is false, then the statements in the ELSE PART will be executed.
Schematically:
Test yourself! What kind of computer instruction do you think plays an important role in how if
statements are executed?
Conditions
The CONDITION part of an if statement can be any expression that evaluates to a Boolean value. For
example, the following comparison operators return Boolean values.
e1 == e2 e1 is equal to e2
e1 != e2 e1 is not equal to e2
a > 10 * b + c
10 == 10
The Boolean operators && and (and and or) are particularly useful in conditions.
|| if
Additional if examples
A max function:
return a
return b
return a
} else {
return b
The else clause is optional, and you can have as many statements as you want inside the THEN PART and
ELSE PART :
if temperature > 100 {
fmt.Println(“Warning: too hot!”)
fmt.Println("Run away!")
Go requires that
Another example:
return x
if a < 10 {
a = a*a
if a * a > 3*b {
var t int = a
a = b
b = t
10 if a < b {
11
} else {
if statements can be nested: an if statement can appear inside the THEN PART or ELSE PART of
another if statement. This is common, and let's you make complex decisions.
// returns the smallest even number among 2 ints; returns 0 if both are odd
func smallestEven(a, b int) int {
if a % 2 == 0 {
if b % 2 == 0 {
// both a and b are even, so return smaller one
if a < b {
return a
} else {
return b
10
} else {
11 } else if b % 2 == 0 { // ***
12 // only b is even
return b
13
} else {
// both a and b are odd
14
return 0
15
16
if a % 2 == 0 {
...
} else {
...
"For" Loops
for
syntax
Sometimes we want to execute a sequence of instructions many times. For this, a loop is what we need.
The statements in the body of the loop will be executed until the loop condition is false. Each time through the
loop is called an iteration.
An example:
return f
There are 3 parts following the for and before the , and these parts are separated by semicolons.
These parts work as follows:
INITIALIZATION_STATEMENT: a single statement that is executed one time before the loop starts.
CONDITION: the loop will repeatedly execute until this condition is false.
POST-INTERATION_STATEMENT: this is run after each time the FOR-BODY is executed
Test yourself! How many times is the word "Hi" printed by this loop:
var i int
for i = 10; i < 20; i = i+1 {
fmt.Println("Hi")
"While" loops
If we only include the CONDITION in a for loop, we will execute the FOR-BODY "while" CONDITION is true.
You could re-write factorial as:
var f int = 1
var i int = 1
for i <= n { // only condition in for
f = f * i
i = i + 1
The and operators add 1 or subtract 1 from a variable. These are particularly useful in the
i++ i--
POST-ITERATION_STATEMENT part of a loop, where you can write instead of the longer (but
for i++
exactly equivalent) .
i = i + 1
:= variable declarations
Notice in the factorial example that we had to create a variable that just served to count how many
times we had executed the loop. This is quite common. Go provides a shorthand for this so that you can
declare a variable inside of the INITIALIZATION_STATEMENT:
v := 1
The := operator both declares and initializes a variable. The above is equivalent to:
var v int = 1
The answer is that Go knows must be an integer because is an integer. This works for s
string
and s too:
float64
r := 3.14159
s := "Hi there"
These statements save you typing var and the type. We can now rewrite in a more clear,
factorial
typical way:
return f
var f int = 1
for i := 1; i <= n; i++ {
f = f * i
and
var f int = 1
i := 1
for i <= n {
f = f * i
i++
The answer is the scope of the variable . In the first snippet, lasts only for the loop, while in the second
// BAD CODE
func sumSquares() {
// of the numbers 1 to 10
j = j + i * i
1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
10 100
which are the first 10 squares, not their sums. Why does this happen?
Variable is created and destroyed each time through the loop!
Nested loops
fmt.Println("")
will print:
10
11
4. Functions
A function in R is an object containing multiple interrelated statements that are run together in a
predefined order every time the function is called. Functions in R can be built-in or created by the user (user-
defined). The main purpose of creating a user-defined function is to optimize our program, avoid the
repetition of the same block of code used for a specific task that is frequently performed in a particular
project, prevent us from inevitable and hard-to-debug errors related to copy-paste operations, and make the
code more readable. A good practice is creating a function whenever we're supposed to run a certain set of
commands more than twice.
• min(), max(), mean(), median() – return the minimum / maximum / mean / median value of a numeric
vector, correspondingly
• sum() – returns the sum of a numeric vector
• range() – returns the minimum and maximum values of a numeric vector
• abs() – returns the absulute value of a number
• str() – shows the structure of an R object
• print() – displays an R object on the console
• ncol() – returns the number of columns of a matrix or a dataframe
• length() – returns the number of items in an R object (a vector, a list, etc.)
• nchar() – returns the number of characters in a character object
• sort() – sorts a vector in ascending or descending (decreasing=TRUE) order
• exists() – returns TRUE or FALSE depending on whether or not a variable is defined in the R environment
print(min(vector))
print(mean(vector))
print(median(vector))
print(sum(vector))
print(range(vector))
print(str(vector))
print(length(vector))
print(sort(vector, decreasing=TRUE))
[1] 1
[1] 3
[1] 3
[1] 18
[1] 1 5
num [1:6] 3 5 2 3 1 4
NULL
[1] 6
[1] 5 4 3 3 2 1
[1] TRUE
Function Name
This is the name of the function object that will be stored in the R environment after the function definition and
used for calling that function. It should be concise but clear and meaningful so that the user who reads our code
can easily understand what exactly this function does. For example, if we need to create a function for calculating
the circumference of a circle with a known radius, we'd better call this function circumference rather
than function_1 or circumference_of_a_circle. (Side note: While commonly we use verbs in function names, it's ok to use
just a noun if that noun is very descriptive and unambiguous.)
Function Parameters
Sometimes, they are called formal arguments. Function parameters are the variables in the function
definition placed inside the parentheses and separated with a comma that will be set to actual values
(called arguments) each time we call the function. For example:
2*pi*r
print(circumference(2))
[1] 12.56637
Above, we created a function to calculate the circumference of a circle with a known radius using the
formula �=2��, so the function has the only parameter r. After defining the function, we called it with
the radius equal to 2 (hence, with the argument 2).
It's possible, even though rarely useful, for a function to have no parameters:
'Hello, World!'
print(hello_world())
Also, some parameters can be set to default values (those related to a typical case) inside the function
definition, which then can be reset when calling the function. Returning to our circumference function, we
can set the default radius of a circle as 1, so if we call the function with no argument passed, it will
calculate the circumference of a unit circle (i.e., a circle with a radius of 1). Otherwise, it will calculate the
circumference of a circle with the provided radius:
2*pi*r
}
print(circumference())
print(circumference(2))
[1] 6.283185
[1] 12.56637
Function Body
The function body is a set of commands inside the curly braces that are run in a predefined order every
time we call the function. In other words, in the function body, we place what exactly we need the function
to do:
x + y
print(sum_two_nums(1, 2))
[1] 3
Note that the statements in the function body (in the above example – the only statement x + y) should
be indented by 2 or 4 spaces, depending on the IDE where we run the code, but the important thing is to
be consistent with the indentation throughout the program. While it doesn't affect the code performance
and isn't obligatory, it makes the code easier to read.
It's possible to drop the curly braces if the function body contains a single statement. For example:
print(sum_two_nums(1, 2))
[1] 3
As we saw from all the above examples, in R, it usually isn't necessary to explicitly include the return
statement when defining a function since an R function just automatically returns the last evaluated
expression in the function body. However, we still can add the return statement inside the function body
using the syntax return(expression_to_be_returned). This becomes inevitable if we need to return more
than one result from a function. For example:
return(c(mean, median))
print(mean_median(c(1, 1, 1, 2, 3)))
Calling a Function in R
In all the above examples, we actually already called the created functions many times. To do so, we just
put the punction name and added the necessary arguments inside the parenthesis. In R, function
arguments can be passed by position, by name (so-called named arguments), by mixing position-based
and name-based matching, or by omitting the arguments at all.
If we pass the arguments by position, we need to follow the same sequence of arguments as defined in
the function:
x - y
print(subtract_two_nums(3, 1))
[1] 2
If we pass the arguments by name, i.e., explicitly specify what value each parameter defined in the
function takes, the order of the arguments doesn't matter:
x - y
print(subtract_two_nums(x=3, y=1))
print(subtract_two_nums(y=1, x=3))
[1] 2
[1] 2
Since we explicitly assigned x=3 and y=1, we can pass them either as x=3, y=1 or y=1, x=3 – the result
will be the same.
It's possible to mix position- and name-based matching of the arguments. Let's look at the example of the
function for calculating BMR (basal metabolic rate), or daily consumption of calories, for women based on
their weight (in kg), height (in cm), and age (in years). The formula that will be used in the function is
the Mifflin-St Jeor equation:
Now, let's calculate the calories for a woman 30 years old, with a weight of 60 kg and a height of 165 cm.
However, for the age parameter, we'll pass the argument by name and for the other two parameters, we'll
pass the arguments by position:
In the case like above (when we mix matching by name and by position), the named arguments are
extracted from the whole succession of arguments and are matched first, while the rest of the arguments
are matched by position, i.e., in the same order as they appear in the function definition. However, this
practice isn't recommended and can lead to confusion.
Finally, we can omit some (or all) of the arguments at all. This can happen if we set some (or all) of the
parameters to default values inside the function definition. Let's return to
our calculate_calories_women function and set the default age of a woman as 30 y.o.:
print(calculate_calories_women(60, 165))
[1] 1320.25
In the above example, we passed only two arguments to the function, despite it having three parameters
in its definition. However, since one of the parameters has a default value assigned to it when we pass
two arguments to the function, R interprets that the third missing argument should be set to its default
value and makes the calculations accordingly, without throwing an error.
When calling a function, we usually assign the result of this operation to a variable, to be able to use it
later:
2*pi*r
print(circumference_radius_5)
[1] 31.41593
Using Functions Inside Other Functions
Inside the definition of an R function, we can use other functions. We've already seen such an example
earlier, when we used the built-in mean() and median() functions inside a user-defined
function mean_median:
return(c(mean, median))
It's also possible to pass the output of calling one function directly as an argument to another function:
d/2
2*pi*r
print(circumference(radius_from_diameter(4)))
[1] 12.56637
In the above piece of code, we created two simple functions first: for calculating the radius of a circle
given its diameter and for calculating the circumference of a circle given its radius. Since originally we
knew only the diameter of a circle (equal to 4), we called the radius_from_diameter function inside
the circumference function to calculate first the radius from the provided value of diameter and then
calculate the circumference of the circle. While this approach can be useful in many cases, we should be
careful with it and avoid passing too many functions as arguments to other functions since it can affect the
code readability.
Finally, functions can be nested, meaning that we can define a new function inside another function. Let's
say that we need a function that sums up the circle areas of 3 non-intersecting circles:
pi*r^2
print(sum_circle_ares(1, 2, 3))
[1] 43.9823
Above, we defined the circle_area function inside the sum_circle_ares function. We then called that
inner function three times (circle_area(r1), circle_area(r2), and circle_area(r3)) inside the outer
function to calculate the area of each circle for further summing up those areas. Now, if we try to call
the circle_area function outside the sum_circle_ares function, the program throws an error, because the
inner function exists and works only inside the function where it was defined:
print(circle_area(10))
Error in circle_area(10): could not find function "circle_area"
Traceback:
1. print(circle_area(10))
1. Similar to creating any function, the inner function is supposed to be used at least 3
times inside the outer function. Otherwise, it isn't viable to create it.
2. If we want to be able to use the function independent of the bigger function, we should
create it outside the bigger function instead of nesting these functions. For example, if
we were going to use the circle_area function outside the sum_circle_ares function,
we would write the following code:
pi*r^2
print(sum_circle_ares(1, 2, 3))
print(circle_area(10))
[1] 43.9823
[1] 314.1593
Here, we go again used the circle_area function inside the sum_circle_ares function. However, this
time, we were also able to call it outside that function and get the result rather than an error.
5.Dataset:
There is a popular built-in data set in R called "mtcars" (Motor Trend Car Road
Tests), which is retrieved from the 1974 Motor Trend US Magazine.
You can use the question mark (?) to get information about the mtcars data set:
2.Get Information
a.Use the dim() function to find the dimensions of the data set, and
the names() function to view the names of the variables:
Example:
Data_Cars <- mtcars # create a variable of the mtcars data set for
better organization
# Use names() to find the names of the variables from the data set
names(Data_Cars)
b.Use the rownames() function to get the name of each row in the first
column, which is the name of each car:
Example:
rownames(Data_Cars)
If you want to print all values that belong to a variable, access the data frame
by using the $ sign, and the name of the variable (for example cyl )
Example:
Data_Cars$cyl
sort(Data_Cars$cyl)
Now that we have some information about the data set, we can start to
analyze it with some statistical numbers.
For example, we can use the summary() function to get a statistical summary of
the data:
Example:
summary(Data_Cars)
The summary() function returns six statistical numbers for each variable:
• Min
• First quantile (percentile)
• Median
• Mean
• Third quantile (percentile)
• Max
6.Max Min
In the previous chapter, we introduced the mtcars data set. We will continue
to use this data set throughout the next pages.
You learned from the R Math chapter that R has several built-in math
functions. For example, the min() and max() functions can be used to find the lowest
or highest value in a set:
max(Data_Cars$hp)
min(Data_Cars$hp)
Mean
To calculate the average value (mean) of a variable from the mtcars data set, find the sum of all
values, and divide the sum by the number of values.
Example
Find the average weight (wt) of a car:
mean(Data_Cars$wt)
Median
The median value is the value in the middle, after you have sorted all the values.
If we take a look at the values of the wt variable (from the mtcars data set), we will see that
there are two numbers in the middle:
Sorted observation of wt (weight)
Note: If there are two numbers in the middle, you must divide the sum of those numbers by
two, to find the median.
Luckily, R has a function that does all of that for you: Just use the median() function to find the
middle value:
Example
Find the mid point value of weight (wt):
median(Data_Cars$wt)
Result:
[1] 3.325
Mode
The mode value is the value that appears the most number of times.
R does not have a function to calculate the mode. However, we can create our own function to
find it.
If we take a look at the values of the wt variable (from the mtcars data set), we will see that the
numbers 3.440 are often shown:
Sorted observation of wt (weight)
Instead of counting it ourselves, we can use the following code to find the mode:
Example
Data_Cars <- mtcars
names(sort(-table(Data_Cars$wt)))[1]
Result:
[1] "3.44"
From the example above, we now know that the number that appears the most number of times
in mtcars wt variable is 3.44 or 3.440 lbs.