Introduction To R
Introduction To R
Johnny Lo
February 2020
Contents
1 Fundamentals of Programming in R 2
1.1 Variable assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Coercion of Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Piping 25
1
To start the session, open a blank R script bygo to the toolbar in R Studio and select File > New File >
R Script or Ctrl+Shift+N on your keyboard. Then commence coding by following the notes below.
1 Fundamentals of Programming in R
Everything that we create in R is an object. An object is a data structure having some attributes and
methods which act on its attributes. For beginners of R, it is very important that you have a strong
understanding of variables, and the basic data types and data structures. These are the types of
objects that you will create, operate on, and manipulate on a day-to-day basis. Once you have the basics
down, then you are well on your way to being a competent R programmer.
In the example below, the number 35 is assigned to the variable x. Note that when you run this command,
the object or variable x is created but nothing will be displayed.
x <- 35
To recall or print out the value that is assigned to variable x, you just need to run variable name as a
command.
## [1] 35
You can overwrite the stored value by assigning another value to the same variable. For example,
x <- 18; x
## [1] 18
Suppose we want to add 8 to x and store the sum in a new object, say y. This can be done as follows.
y <- x + 8; y
## [1] 26
2
Suppose now we have a linear equation given by y = 3x + 5. Since x = 18, then we have,
y <- 3*x+5; y
## [1] 59
If we wish to evaluate y for another value of x, say 17 then all we need to do is overwrite the value of x,
i.e. x <- 17, and re-run the above command for y. The ability to write an equation as a code, such as the
one above, allows us perform computations much more efficiently.
When coding, it is always a good idea to give meaningful names to variables and better yet, ones that are
self explanatory. Providing meaningful names to variables will help the reader and reduce the amount of
commentaries required. For example, suppose we want to calculate the area of a circle (i.e. 𝜋𝑟2 ), given that
its diameter is 10 cm. Here we have,
1.2.1 Character
A character is a string value and is defined by using the quotes (” ”). A character can be a number (e.g. 1,
2, 3), letter (e.g. a, b, C), symbol (e.g. #, $, @), or a combination of the three (e.g. a word or sentence).
Note that if a number is defined as a character in R, then you will not be able to perform any mathematical
operation on it.
"a"
## [1] "a"
"abc123"
## [1] "abc123"
"apples"
## [1] "apples"
"I hate apples"
3
## [1] "I hate apples"
You can also run multiple commands with a single line of code by using using semi-colon (;) as the separator.
## [1] "a"
## [1] "abc"
## [1] "apples"
## [1] "I hate apples"
1.2.2 Numeric
A numeric is any real or decimal value. You can define them in R simply by typing in the value and
executing it. For example,
5.5; 2.75; pi
## [1] 5.5
## [1] 2.75
## [1] 3.141593
We can also perform mathematical operations on the numerics just as you would do in a calculator.
5.5+2.7 #Addition
## [1] 8.2
5.5-2.7 #Subtraction
## [1] 2.8
5.5/2 #Division
## [1] 2.75
5.5*2 #Multiplication
## [1] 11
5.5^2 #Squaring
## [1] 30.25
5.5^4 #To the power of 4
## [1] 915.0625
sqrt(5.5) #Square root
## [1] 2.345208
exp(5.5) #Exponential
## [1] 244.6919
4
log(5.5) #Natural log
## [1] 1.704748
1.2.3 Integer
An integer is a whole number that can be positive or negative. To define an integer in R, we just need to
type L after a whole number. For example,
5L #5
## [1] 5
-3L #-3
## [1] -3
The mathematical operations that were performed previously on numerics can also be applied to integers.
1.2.4 Logical
A logical value is to indicate whether an item/statement is TRUE or FALSE, i.e. a Boolean value. A list of
the R logical operators are given in Figure 1.
5
## [1] TRUE
5==2 #Does 5 equal 2?
## [1] FALSE
5!=2 #Does 5 not equal 2?;
## [1] TRUE
5>2 #Is 5 greater than 2?
## [1] TRUE
5>=2 #Is 5 greater than or equals to 2
## [1] TRUE
5>2 #Is 5 less than 2?
## [1] TRUE
5>=2 #Is 5 less than or equals to 2
## [1] TRUE
What about for the following? See if you can determine the answers to them.
(5>2)&(5>4)
(5>2)&(5<4)
(5>2)|(5<4)
1.3.1 Vector
A vector is a collection of elements of the same data type. A vector can be created by using the c(.)
command.
## NULL
c(1,2,3) #A numeric vector
## [1] 1 2 3
c("a","b","c") #A character vector
6
## [1] TRUE FALSE FALSE
a <- c(1:5); a
## [1] 1 2 3 4 5
We use square brackets, i.e. [.], if we wish to access particular element(s) within the vector, rather than the
whole vector.
## [1] 2
a[3] #3rd element
## [1] 3
a[3:5] #3rd to 5th element
## [1] 3 4 5
a[c(2,5)] #2nd and 5th elements
## [1] 2 5
b <- c(6:10); b #Create a new vector b containing the integers 6, 7, 8, 9 and 10.
## [1] 6 7 8 9 10
c <- c(a,b); c #Combine vectors a and b, and assign the vector sum to a new vector, c.
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 7 9 11 13 15
b-a #Subtract the elements of vector a from vector b
## [1] 5 5 5 5 5
a*b #Multiply the elements of vector a and vector b together
## [1] 6 14 24 36 50
7
b/a #Divide the elements of vector b by the elements of vector a
Notice how that the operations are performed element to element. For instance with a+b, the 1st element
of vector a is added to the 1st element of vector b to form the 1st element of the vector sum, then the 2nd
element of vector a is added to the 2nd element of vector b to form the 2nd element of the vector sum, and
so on. This is also the case for the other mathematical operations.
1.3.2 Factor
A factor is similar to a vector, but it is a collection of elements from a finite set of values, i.e. categorical
variable. To create a factor, we start by creating a numeric or character vector then convert it to a factor
using the factor(.) command.
## [1] 1 2 3 4 5
## Levels: 1 2 3 4 5
f2 <- c("Male","Female","Female","Male","Female"); #A charactor vector
f2 <- factor(f2); f2 #Convert f2 to a factor
With f2, we can see that it is a factor with 2 levels, i.e. Male or Female.
Note that by default, the levels of a factor are ordered alphabetically. This is often not ideal in cases where
there is an order to said levels. For example, suppose we have a ordinal factor (f3) with three levels; low(L),
medium (M) and High (H).
f3 <- factor(c("L","M","H","H","M","L")); f3
## [1] L M H H M L
## Levels: H L M
As you can see, the default order for the levels is H, L, M. This can be annoying if you want to summarise
or plot a set of data accordingly for these levels and compare them. Naturally, one would prefer to have the
order be L, M, H. We can ensure this is the case by including the levels argument to the factor(.) command,
and specifying the desired order.
## [1] L M H H M L
## Levels: L M H
8
1.3.3 Matrix
A matrix (plural: matrices) is a 2-dimensional array whose elements are all of the same data type. A
matrix is defined using the matrix(.), and requiring the data and matrix dimension as inputs. By default,
data are read into the matrix in a column-wise manner.
We can also read the data in a row-wise manner change the byrow argument to TRUE. That is,
Note that in some instances, the command may have several input arguments and can be difficult to read.
To improve the readability of an extended command, we can break it up and write it across multiple lines.
This works because R will examine the next line if the current command line is incomplete. Another
advantage of this is that we can add commentary to each of the input arguments. We will use the previous
example to illustrate this.
Another way to create a matrix, is by binding multiple vectors of the same length together. We can either
bind the vectors together as column vectors using the cbind(.) command, or as row vectors using the
rbind(.) command.
9
Mat.A <- cbind(v1,v2,v3); Mat.A #binding vectors as column vectors
## v1 v2 v3
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Mat.B <- rbind(v1,v2,v3); Mat.B #binding vectors as row vectors
The elements of two or more matrices can be multiplied together only if they have the same dimension.
Since Mat.A and Mat.B are both 3 × 3 matrices, we can multiply them by one another. Note that the
multiplication is done element-to-element. In this instance, A × B = B × A.
Mat.A*Mat.B
## v1 v2 v3
## [1,] 1 8 21
## [2,] 8 25 48
## [3,] 21 48 81
However in matrix multiplication, say A × B, the number of columns in matrix A must equal the number
of rows in matrix B, and each row of matrix A are multiplied by each column of matrix B, and summed.
Click here for a Khan Academy online tutorial in manual matrix multiplication. Matrix multiplication in
R is performed using the %*% command. In matrix multiplication, A × B ≠ B × A.
## v1 v2 v3
## v1 14 32 50
## v2 32 77 122
## v3 50 122 194
To access the elements of a matrix, say A, we need to specify the corresponding row(s) and columns(s)
within the square brackets, i.e. A[row(s),column(s)]. If the rows are not specified, but the columns are,
then all the rows will be extracted and vice versa.
10
Mat.A[2,3] #Access the element located in row 3 and column 3
## v3
## 8
Mat.A[1:2,3] #Access the elements located in rows 1 and 2 along column 3
## [1] 7 8
Mat.A[1,2:3] #Access the elements located along row 1 and in columns 2 and 3
## v2 v3
## 4 7
Mat.A[1,] #Access all elements in row 1
## v1 v2 v3
## 1 4 7
Mat.A[,3] #Access all elements in column 3
## [1] 7 8 9
Mat.A[,c(1,3)] #Access all elements in columns 1 and 3
## v1 v3
## [1,] 1 7
## [2,] 2 8
## [3,] 3 9
Mat.A[,-1] #Access all data, except those in column 1
## v2 v3
## [1,] 4 7
## [2,] 5 8
## [3,] 6 9
df <- data.frame(Name,Age,Gender); df
11
## 4 Beth 55 Female
## 5 Lachlan 43 Male
We can add new columns(s) to an existing data frame by using either the data.frame(.) or cbind(.) command.
df[1,c(1:3)]
## Name Gender
## 1 John Male
## 2 Sarah Female
## 3 Zach Male
## 4 Beth Female
## 5 Lachlan Male
df[,"Name"]
12
df[,c("Name","Gender")]
## Name Gender
## 1 John Male
## 2 Sarah Female
## 3 Zach Male
## 4 Beth Female
## 5 Lachlan Male
Another way is to use the $ syntax and the column name to access the column. However, we can only
access one column at a time using this approach.
## [1] 35 28 33 55 43
The $ syntax can also be used, along with a new name, to add a new column to an existing data frame,
without having to overwrite the variable assigned with the original data frame, unlike with the data.frame(.)
or cbind(.) command. Note that the overwriting process was not done previously with the Coffee.Drinker
vector and therefore the vector is actually not a part of the data frame df at the moment.
#Add the new variable, Coffee.Drinker, to df and recall the data frame
df$Coffee.Drinker <- c(TRUE,TRUE,FALSE,TRUE,FALSE); df
13
The str(.) command is particularly informative as it provides a summary of the data frame and its data.
A tibble is a modern take on a data frame, without the annoying features. A tibble is defined using the
tibble(.) command.
## # A tibble: 5 x 4
## Name Age Gender Coffee.Drinker
## <chr> <dbl> <fct> <lgl>
## 1 John 35 Male TRUE
## 2 Sarah 28 Female TRUE
## 3 Zach 33 Male FALSE
## 4 Beth 55 Female TRUE
## 5 Lachlan 43 Male FALSE
One of the ways that tibble(.) and data.frame(.) differ is how the data frames are printed to screen. If
you recall a data frame in R, it will attempt print all the rows and columns until it reaches the maximum
allowed. When you print a tibble, only the first ten rows, and all the columns that fit on one screen will be
displayed and this makes it easier to work with large datasets. Furthermore, an abbreviated description of
the column type is provided under the column names.
Another key difference is in subsetting. Subsetting a tibble will alway return a tibble by default,
even if only one column is accessed. On the hand, when you access a single column in a data frame,
the end result is either a vector or a factor. The latter is true if the column that is accessed is a factor.
This difference often leads to codes in older packages not working properly due to mis-matching of data types.
Thankfully, we can convert from one structure to the other (and vice versa) to overcome this issue.
## # A tibble: 5 x 5
## Name Age Gender Coffee.Drinker Diabetes
## <chr> <dbl> <fct> <lgl> <fct>
## 1 John 35 Male TRUE Yes
## 2 Sarah 28 Female TRUE No
## 3 Zach 33 Male FALSE No
## 4 Beth 55 Female TRUE No
## 5 Lachlan 43 Male FALSE Yes
Refer to the link here for more details regarding other differences between these two structures.
14
1.3.5 List
A list is a collection of data structures. It is the most flexible data structure in R in that each component
of a list can be of different type, dimension or length to the other components. You can even store a list
within a list. A list is created using the list(.) command.
Let us create a list consisting of a vector, a matrix and a data frame that were defined previously.
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## v1 v2 v3
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## [[3]]
## Name Age Gender Coffee.Drinker Diabetes
## 1 John 35 Male TRUE Yes
## 2 Sarah 28 Female TRUE No
## 3 Zach 33 Male FALSE No
## 4 Beth 55 Female TRUE No
## 5 Lachlan 43 Male FALSE Yes
str(list1) #examine the structure of the list and its components.
## List of 3
## $ : int [1:10] 1 2 3 4 5 6 7 8 9 10
## $ : int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:3] "v1" "v2" "v3"
## $ :'data.frame': 5 obs. of 5 variables:
## ..$ Name : chr [1:5] "John" "Sarah" "Zach" "Beth" ...
## ..$ Age : num [1:5] 35 28 33 55 43
## ..$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2
## ..$ Coffee.Drinker: logi [1:5] TRUE TRUE FALSE TRUE FALSE
## ..$ Diabetes : Factor w/ 2 levels "No","Yes": 2 1 1 1 2
To access the components of a list, we use the double square brackets [[.]].
## [1] 1 2 3 4 5 6 7 8 9 10
list1[[2]] #Access the 2nd component, i.e. matrix Mat.A
## v1 v2 v3
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
15
list1[[3]] #Access the 3rd component, i.e. data frame df
If the components in a list are labelled, then we can access them by using the $ syntax as well.
## List of 3
## $ VecC : int [1:10] 1 2 3 4 5 6 7 8 9 10
## $ MatA : int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:3] "v1" "v2" "v3"
## $ DatFrame:'data.frame': 5 obs. of 5 variables:
## ..$ Name : chr [1:5] "John" "Sarah" "Zach" "Beth" ...
## ..$ Age : num [1:5] 35 28 33 55 43
## ..$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2
## ..$ Coffee.Drinker: logi [1:5] TRUE TRUE FALSE TRUE FALSE
## ..$ Diabetes : Factor w/ 2 levels "No","Yes": 2 1 1 1 2
## [1] 1 2 3 4 5 6 7 8 9 10
list1$MatA #Access the matrix component
## v1 v2 v3
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
list1$DatFrame #Access the data frame component
16
In such instances, the elements whose data type is the most flexible will be coerced into the least flexible
data type. For the four data types that were discussed, the order of flexibility (least to most) is (1)
Character, (2) Numeric, (3) Integer and (4) Logical.
## [1] 1 1
#The 1st three elements are coerced into characters.
matrix(c(5,FALSE,4.6,"No"),nrow=2,ncol=2,byrow=FALSE)
## [,1] [,2]
## [1,] "5" "4.6"
## [2,] "FALSE" "No"
Sometimes coercions are done intentionally, and other times they are not. For the latter, it can be a
frustrating experience if you are unaware that this is happening, and in particular, when this leads to coding
errors later on.
## [1] 4 9 10 6 8 6 4 4 9 1 8 4 4 7 5 6 2 5 7 2
#20 random numbers between 5 and 15 from a Uniform distributionn
x2 <- runif(20,5,15); x2
17
## [8] 3.884554 13.628431 15.898036 14.227182 12.777260 11.921751 11.164433
## [15] 13.891406 11.678051 14.047780 8.773800 16.200333 7.003848
Note: The numbers show here are likely to be different to what you will have as these are randomly generated
numbers.
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5
## [31] 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0
#Repeat the sequence {1,2,3,4,5} 5 times over
x5 <- rep(c(1:5),times=5); x5
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
#Repeat each element of the sequence {1,2,3,4,5} by 5 times
x6 <- rep(c(1:5),each=5); x6
## [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5
#Repeat each element of the sequence {1,2,3,4,5} by 3 times, and then repeat
#that sequence the 2nd time. Note that "each" has precedence over "times".
x7 <- rep(c(1:5),times=2,each=3); x7
## [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
2.1.3 Alphabets
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
LETTERS[1:10] #List first 10 letters (in UPPER case) of the alphabet
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
If you have not already done so, download the ElderlyPopWA.csv dataset from Blackboard to your working
directory. Then import this data into R using this approach with either the “From text (base)” or “From
text (readr)” option. The first option imports the data and stores it as a data frame, whereas the 2nd option
stores it as a tibble.
Once the dataset is imported, R Studio will open it (courtesy of the View(.) command) in a new
tab for you to view. Furthermore, the following command line(s) (depending on which option you went
18
Figure 2: Importing Data
with) will appear in the Console. Notice that the 2nd approach requires the readr package to be loaded first.
Copy the above command lines and paste them into your R script. This way, you do not need to browse for
your data file again if you ever need to restart your session and/or re-run your code.
Suppose you have amended the dataset and would like to export the updated data frame/tibble to a new
CSV file. To do this, we can use either the write.csv(.) or write_csv(.) command.
#Method 1
write.csv(ElderlyPopWA, #Name of the data frame/tibble to be exported
"ElderlyPopWA_updated.csv" #Name of the new CSV file to write the data to.
)
#Method 2
write_csv(ElderlyPopWA, #Name of the data frame/tibble to be exported
"ElderlyPopWA_updated.csv" #Name of the new CSV file to write the data to.
)
For example, suppose we wish to categorise the elderly female partipants (from the ElderlyPopWA dataset)
into their respective BMI classes (see Figure 3).
19
Figure 3: BMI classification
The cut(.) command will allow us to do this. Note that by default, the left side of the interval is open, and
the right side is closed, i.e. (a,b]. Type and run the command ?cut for help if you want to know how to
change this setting.
Exercise: Within the sample, how many elderly females are in each of the three BMI classes?
In many instances, we are often interested in having a closer examination of a sub-sample of individuals
based on particular group(s) they belong to (e.g. those of a particular gender, ethnicity, income bracket,
etc), or by some thresholds (e.g. those who are 35 years old or younger). The subset(.) command allows us
to perform this task.
20
Make sure you view your subsets, i.e. View(.), to ensure you have done this correctly.
We can also use the which(.) command for this purpse. This command outputs the indices of the elements
for this the set condition is true, which we can set as the rows of data that we wish to select.
#Measures of centre
mean(ElderlyPopWA$Age) #Mean (i.e. average) age
## [1] 74.31748
median(ElderlyPopWA$Age) #Median age
## [1] 74.08767
#Measures of spread
sd(ElderlyPopWA$Age) #Standard deviation of age
## [1] 2.596677
range(ElderlyPopWA$Age) #range of age, i.e. minimum and maximum
## [1] 4.158219
fivenum(ElderlyPopWA$Age) #Five-number summary, i.e. min, Q1, Q2, Q3, max.
Skewness and kurtosis are two commonly used measures of shape. Skewness is a measure of symmetry where,
1. Skewness = 0, distribution is symmetrical;
2. Skewness < 0, distribution is left- or negatively-skewed, i.e. longer left tail;
3. Skewness > 0, distribution is right- or positively-skewed, i.e. longer right tail.
21
Kurtosis is a measure of “tailness” in a distribution relative to a normal distribution.
1. Kurtosis = 3, tails are that of a normal distribution, i.e. mesokurtic;
2. Kurtosis < 3, tails are comparitively shorter than a normal distribution, i.e. platykurtic
3. Kurtosis > 3, tails are comparitively longer than a normal distribution, i.e. leptokurtic.
To compute these two measures in R, we require the moments package.
#Measures of shape
skewness(ElderlyPopWA$Age)
## [1] 0.2670584
kurtosis(ElderlyPopWA$Age)
## [1] 2.021052
The above values indicate that the variable Age is approximately symmetrically distributed with a slight
positive skewness, and is platykurtic.
Exercise: Standardised the Age variable (i.e. (x-mean)/sd), add it to the existing data frame (call it
z_Age), and summarise it.
There are other functions that allow you to obtain the summary statistics in a more efficient, in particular
for large data frames.
Note that colMeans(.) function only works for data frames. If you data is stored as a tibble (as in the case
here), then you will need to covert it to a data frame beforehand.
apply(ElderlyPopWA[,2:8], # Data
MARGIN=2, # Apply the function in a column-wise manner. MARGIN = 1 implies row-wise.
FUN=mean # The function to be applied.
)
22
The apply(.) functionis much more versatile here as you can apply other functions to your data frame
besides mean(.). Suppose now, we wish to determine the standard deviations for a data frame. All we need
to do here is set FUN to sd.
apply(ElderlyPopWA[,2:8], # Data
MARGIN=2, # Apply the function in a column-wise manner. MARGIN = 1 implies row-wise.
FUN=sd # The function to be applied.
)
Exercise: Using the apply(.) function, find the five number summary for each of the continuous variables
in the ElderlyPopWA dataset.
##
## Underweight Healthy_Weight Overweight
## 27 100 21
BMI.prop <- prop.table(BMI.freq); BMI.prop #Proportions of sample for each BMI class
##
## Underweight Healthy_Weight Overweight
## 0.1824324 0.6756757 0.1418919
We can also use the table(.) command to crosstabulate two or more categorical/ordinal variables, and then
use prop.table(.) to compute the marginal proportions.
##
## Underweight Healthy_Weight Overweight
## <75years 16 62 11
23
## 75+years 11 38 10
prop.table(tab,1) #Proportions by row, i.e. within each age group
##
## Underweight Healthy_Weight Overweight
## <75years 0.1797753 0.6966292 0.1235955
## 75+years 0.1864407 0.6440678 0.1694915
prop.table(tab,2) #Proportions by column, i.e. within each BMI class
##
## Underweight Healthy_Weight Overweight
## <75years 0.5925926 0.6200000 0.5238095
## 75+years 0.4074074 0.3800000 0.4761905
tab/sum(tab) #Proportions relative to overall sample size
##
## Underweight Healthy_Weight Overweight
## <75years 0.10810811 0.41891892 0.07432432
## 75+years 0.07432432 0.25675676 0.06756757
#Summarise the waist circumference of the individuals across the three BMI classes
## BMI.class Waist
## 1 Underweight 76.78519
## 2 Healthy_Weight 88.96800
## 3 Overweight 100.72381
#Compute the standard deviation
aggregate(Waist~BMI.class,data=ElderlyPopWA,FUN=sd)
## BMI.class Waist
## 1 Underweight 5.717362
## 2 Healthy_Weight 6.081282
## 3 Overweight 6.115383
We can also compute the summary statistics across the combination of two or more categorical/ordinal
variables.
24
#Compute the mean
aggregate(Waist~BMI.class+Age.grp,data=ElderlyPopWA,FUN=mean)
## <75years 75+years
## Underweight 77.01250 76.45454
## Healthy_Weight 88.98871 88.93421
## Overweight 102.80909 98.43000
Exercise: Determine the mean and standard deviation of hip, triceps and arm girth across all BMI classes.
3 Piping
In modern R scripts, you will often find codes written with the %>% syntax. This is the most popular
pipe operator in R. Pipes allow us to clearly express a sequence of multiple operations, without having to
nest these functions within each other. The latter will often times involve multiple sets of parentheses and
can make the code hard to read and understand. This is where pipes can help.
25
Note: In order to use pipes, we need to installed and load the magrittr package, or we can just call the
tidyverse package (which we already have), as it includes magrittr package.
Let us return to the ElderlyPopWA dataset. Suppose that we wish to find the proportions of
individuals in each of the BMI classes to 3 decimal places. We can do just that with a single line
of code with nested commands (see below), however, another person reading this code may need some time
before he/she can determine its purpose.
round(prop.table(table(ElderlyPopWA$BMI.class)),3)
##
## Underweight Healthy_Weight Overweight
## 0.182 0.676 0.142
##
## Underweight Healthy_Weight Overweight
## 0.182 0.676 0.142
There is nothing wrong with the above approach as the sequence of operations is clear. The issue here, in
particluar in loops (to be introduced later), is that a new variable is created to store the value at each of
the first two steps, thus requiring memory space. Pipes are a way for us to avoid these consequences. Here
is how we would do this.
ElderlyPopWA$BMI.class %>%
table(.) %>%
prop.table(.) %>%
round(.,digits=3)
## .
## Underweight Healthy_Weight Overweight
## 0.182 0.676 0.142
Note: The dot “.” acts as an argument placeholder for the object created from the previous step. By
default, the created object is always assumed to be the first argument to be passed to the command in the
proceeding step. Therefore, it is actually not necessary to include the dot if this applies to your situation.
That is, we can easily write this as:
ElderlyPopWA$BMI.class %>%
table() %>%
prop.table() %>%
round(digits=3)
## .
26
## Underweight Healthy_Weight Overweight
## 0.182 0.676 0.142
In fact, you can even exclude the brackets if there is only one argument to pass through. Having said this,
it is probably better to include . as a beginner user of pipes and to avoid any confusion later.
The use of . is particular useful if you wish to adjust the argument placeholder. Suppose that we want to
diplay 𝜋 to 5 decimal places.
5 %>% round(pi,digits=.)
## [1] 3.14159
Here, notice how the value of 5 is passed as the 2nd argument using ., instead of first.
There are other forms of pipes that you may find useful. Click here if you want to find out more.
Control structures in R allows you to control the flow of execution of a series of R commands, and not
having run the same set of commands repeatedly. Control structures are typically, though not necessarily,
used inside of functions. The inclusion of control structures and functions will often improve the readability
of the code.
In this section, you will learn how to write your own function. You will be also be introduced to two of the
more commonly used control structures (if and else and for loops). Other forms of control structures are
discussed but not emphasized.
4.1 Functions
Functions are defined using the function(.) directive. Like anything else in R, functions are stored as R
objects. Let’s begin with a simple function that prints Hello! My names is [insert your name here]
when called. We will call this function Greeting.
The command(s) between the curly brackets {} form the body of a function. For the above function, it
is just a simple print(.) command. When you execute the above code, the Greeting function is created
27
(check your Global Environment panel). To call or execute a function, just type the function name along
with the empty brackets, and run.
Greeting()
The next aspect of a basic function in R is the function arguments. These are the options that you can
specify to the user that the user may explicity set for the function. The argument list is defined within
the brackets of the function(.) directive. Values assigned to an argument list are passed to, and are typ-
ically used by the code within the body of a function. Otherwise there is no point in having an argument list.
Suppose we wish to create a function that adds 3 to any value that we pass into the function and outputs
the sum.
Here, a user can assign any value to the argument x which function will then add 3 to. The sum is then
outputted as the function value.
add3(5)
## [1] 8
add3(10)
## [1] 13
add3(15)
## [1] 18
This is the basis of how all the built-in functions in R are created, i.e. mean(.), sd(.), etc.
Let’s step it up a bit and create a function that finds the sum, product and ratio between any two values,
say x and y. The two values are to be specified by the user.
Let’s assign the values, 3 and 4 to x and y respectively, and pass them into the function.
28
x <- 3; y <- 4;
## [1] 0.75
Notice how only the ratio of x to y is given, i.e. 0.75. This is because, by default, a function in R will
always output the result of the last line of code within the body of a function. In this case, the result of
the final line is the creation of the variable ratio_xy, which contains the ratio of x to y. If we want the
function to return all three measures, then we just need to add a command (as the final line of code) that
concatenates them.
Here, the value of the function is a vector with 3 elements, the first of which is the sum, the second is the
product and the third is the ratio of x and y.
add_mult(x=3,y=4)
R function arguments can be matched positionally or by name. With the above command, the arguments
are matched by name. In this instance, the order of the argument does not matter. We can specify the y
argument before x if we want, and the result will be the same.
add_mult(y=4,x=3)
Positional matching just means that R assigns the first value to the first argument, the second value to
second argument, etc. For example,
add_mult(3,4)
Here, the first value, i.e. 3, is assigned to the first argument (i.e. x) and the second value (i.e. 4) is assigned
to the 2nd argument.
29
Note that with this approach, the order in which you specify the values does matter. For the previous code,
swap the value order of the two values around and see what the difference is between their output.
This concludes the introduction to functions in R. Click here if you would like to find out more about
functions in R and their nuances that are not discussed here.
Exercise: Write a function that accepts a vector, say vec_x of length greater than 2, and returns the
mean and standard deviation of that vector.
Care needs to be taken when using a while loop. When the condition no longer holds true, a while loop
then permits exit. However, if you are not careful and the condition is never untrue, then then loop will
never end. The for loop, on the other hand, will stop after a set number of iterations. Furthermore, what
is done with a while loop can be achieved with a for loop. Given this, the while loop will not be discussed
further here. We will also not discuss next and return controls as they are not often used. Click here for
more information about while loops and other control structures in R.
4.2.1 If-and-Else
The if-and-else control structures are used to execute a set of commands only when a particular condition
is satisfied, and performs the alternate set of commands if the conditions is not met. The else clause is not
mandatory, and would apply in intances where no action is taken if the condition is not met.
To start, let’s create an if statement that checks if a given integer > 0 is odd, and prints a statement that
indicates this if it is true.
#If statement
if (x%%2 != 0)
{
cat(paste(x," is an odd integer\n",sep="")) #print command
}
## 5 is an odd integer
Note: %% is the modulus operator and != is the not equals to logical operator.
30
If x is an even number, then no action is taken. Try this out and change x to, say 22, and re-run the above
code.
Let’s now extend the above if statement to include else. Suppose now we want the code to print a statement
that an integer > 0 is odd if it is true, otherwise print a statement that the integer is even.
if (x%%2 != 0)
{
cat(paste(x," is an odd integer\n",sep=""))
}else{
cat(paste(x," is an even integer\n",sep=""))
}
## 10 is an even integer
We can also include multiple conditions in an if-and-else statement by using else if. For example, suppose
you wish to know your grade (i.e. N, P, C, D and HD) for a unit based on your final score.
if (score < 0)
{
print("Invalid score!") #Cannot have a negative score
}else if (score < 50){
print("Your final grade is N.")
}else if (score <60){
print("Your final grade is P.")
}else if (score <70){
print("Your final grade is C.")
}else if (score <80){
print("Your final grade is D.")
}else if (score <=100){
print("Your final grade is HD.")
}else{
print("Invalid score!") #Cannot have a score greater than 100%
}
#For loop
for (I in 1:10)
{
31
print(I)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
The variable I in the for loop acts as a counter that starts at 1, the command(s) (in this instance it is just
a print command) within the loop are then executed. I then increases by 1 to be 2, and the commands are
executed again. This process continues until I equals to 10. Note that I is just a variable name, and you
can quite easily use another name. Furthermore, the count does not necessary have to start at 1.
We often take advantage of the counter and incorporate it into the commands and run them systematically
through, say a data frame or list. For example, suppose we want to generate the means for the continous
measurements (i.e. columns 2 to 8) in the ElderlyPopWA dataset. We can systematically move from one
column to the next and generate their means using a for loop.
## [1] 74.31748
## [1] 26.62886
## [1] 88.41351
## [1] 104.3297
## [1] 28.86149
## [1] 31.24932
## [1] 38.29458
If-and-else statements are often defined within a for loop to control the execution of certain sets of com-
mands. For example, suppose we want to print out all numbers between 1 and 100 that are divisible by 7 only.
#For loop
for (I in 1:100)
{
#If statement to check if the number is divisible by 7
if (I%%7 == 0)
{
32
print(I)
}
}
## [1] 7
## [1] 14
## [1] 21
## [1] 28
## [1] 35
## [1] 42
## [1] 49
## [1] 56
## [1] 63
## [1] 70
## [1] 77
## [1] 84
## [1] 91
## [1] 98
If-and-else statements in for loops are often used to break from the loop when a condition is met. For
example, suppose we want to print out the first 10 numbers between 500 and 800 that are divisible by 13.
for (I in 500:800)
{
#1st if statement: Checks if the number is divisible by 13
if (I%%13 == 0)
{
print(I) #Print the number
count <- count + 1 #Add 1 to count
}
## [1] 507
## [1] 520
## [1] 533
## [1] 546
## [1] 559
## [1] 572
## [1] 585
## [1] 598
## [1] 611
## [1] 624
33
THE END
34