R Tutorial
R Tutorial
KDnuggets: R remains leading tool, but Python usage growing very fast.
O’Reilly 2016 Data Science Salary Survey
install.packages("swirl")
require(swirl)
swirl()
Design of the R system
I CRAN https://cran.r-project.org
I GitHub
I Bioconductor https://www.bioconductor.org/
I Mostly bioinformatics and genomics stuff
I Neuroconductor https://www.neuroconductor.org/
I The new kid on the block: computational imaging software for
brain imaging
I RForge http://rforge.net/
I Not so well known
I Some .tar.gz file you downloaded from a website
I Use caution!
Installing from CRAN
install.packages("devtools") # Install it
require(devtools) # Load it
library(devtools) # Alternatively, load like this
install.packages("devtools",
repos = "http://stat.ethz.ch/CRAN/")
source("https://bioconductor.org/biocLite.R")
biocLite()
biocLite("GenomicFeatures") # For example
source("https://neuroconductor.org/neurocLite.R")
neuro_install(c("fslr", "hcp")) # Install these two
Installing from GitHub
Tip: What if the package you want has been removed from, say,
CRAN or Bioconductor? Both have read-only mirrors of their R
repositories. These mirrors can be a useful alternative source for
packages that have disappeared from their R repositories.
How to get help
I Typing ?command will display the help pages for command, e.g.:
I recommend this.
A good way to learn R
120
80
dist
40
5 10 15 20 25
speed
Obviously, you can replace Prof. Frink with (for example) the logo for your
institute or experiment.
Meow
1.41
0.85
0.29
Random Cats
other cats
−0.28
−0.84
−1.4
−1.97
−10 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10
some cats
Fundamentals
Data types
.Machine
# For information on interpreting the output:
?.Machine
Attributes
R objects can have attributes:
I Names
I Dimensions (e.g., matrices)
I Length (e.g., vectors)
I Class
## [1] 3 2
class(m)
## [1] "matrix"
Evaluation
I When a complete expression is entered at the prompt, it is
evaluated and the result of the evaluated expression is returned.
The result may be auto-printed.
I <- is the assignment operator (you might also see = in code)
I Tip: keyboard shortcut in RStudio is Alt+- (Windows/Linux)
or Option+- (Mac)
# This is a comment!
x <- 42 # Nothing printed
x # Auto-printed
## [1] 42
## [1] 42
Sequences
## [1] 1 2 3 4 5
## [1] 0 0 0 0 0 0 0 0 0 0
Mixing objects
x <- 0:6
class(x)
## [1] "integer"
as.logical(x)
as.character(x)
## [1] NA NA NA
Matrices
Matrices are vectors with a dimension attribute. The dimension
attribute is itself an integer vector of length 2 (nrow, ncol):
dim(m)
## [1] 2 3
v <- 1:10
v
## [1] 1 2 3 4 5 6 7 8 9 10
We can also reshape matrices (unroll them) using the same method.
cbind-ing and rbind-ing
Matrices can be created by column-binding or row-binding with
cbind() and rbind():
x <- 1:3
y <- 10:12
cbind(x, y)
## x y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
rbind(x, y)
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i
Factors
I signal/background
I bjet/light
I male/female
I benign/malignant
I low/middle/high
I etc.
unclass(x)
## [1] 1 1 2 1
## attr(,"levels")
## [1] "signal" "background"
Missing values
Missing values are denoted by NA or NaN for undefined
mathematical operations:
x <- 0/0
is.na(x) # Is x NA?
## [1] TRUE
is.nan(x) # Is x NaN?
## [1] TRUE
## foo bar
## 1 1 TRUE
## 2 2 TRUE
## 3 3 FALSE
x <- 1:3
names(x) <- c("foo", "bar", "qux")
x
## c d
## a 1 3
## b 2 4
Subsetting
## [1] "A"
x[1:5]
x[1, 2]
## [1] 3
## [1] 1 3 5
Subsetting lists
x <- list(foo = 1:4, bar = 0.6)
x[1]
## $foo
## [1] 1 2 3 4
class(x[1])
## [1] "list"
x[[1]]
## [1] 1 2 3 4
class(x[[1]])
## [1] "integer"
Subsetting lists
x$bar
## [1] 0.6
class(x["bar"])
## [1] "list"
class(x[["bar"]])
## [1] "numeric"
Removing missing values
x[!is.na(x)]
## [1] 1 2 4 5
Removing missing values
airquality[3:6, ] # A toy dataset
R will automatically
This is what happens if you try to read in more data than will fit in RAM.
RAM considerations
## [1] "dummy"
## character(0)
## [1] 1.999484
Interfaces to the outside world
devtools::install_github("lyonsquark/RootTreeToR")
require(RootTreeToR)
# Open and load ROOT tree:
rt <- openRootChain("TreeName", "FileName")
N <- nEntries(rt) # Number of rows of data
# Names of branches:
branches <- RootTreeToR::getNames(rt)
# Read in a subset of branches (vars), M rows:
df <- toR(rt, vars, nEntries = M) # df: a data.frame
Other packages for reading ROOT data
I AlphaTwirl: a Python library for summarising event data in
ROOT Trees (https://github.com/TaiSakuma/AlphaTwirl)
I For more details, ask the author Tai Sakuma (he’s here)
Control structures
I Structures that will be familiar to C++ programmers include:
if, else, for, while, break, and return. repeat executes
an infinite loop, next skips an iteration of a loop.
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
Functions in R are “first class objects”, which means that they can
be treated much like any other R object. Importantly,
myplot
## [1] 5 12 21 32
Vectorised matrix operations
## [,1] [,2]
## [1,] 10 30
## [2,] 20 40
## [,1] [,2]
## [1,] 40 40
## [2,] 60 60
Loop functions
## $a
## [1] 3
##
## $b
## [1] 0.1878958
sapply(x, mean)
## a b
## 3.0000000 0.1878958
Loop functions: apply
x <- matrix(rnorm(12), 4, 3)
apply(x, 2, mean)
str(mtcars)
summary(mtcars[1:3])
head(mtcars[1:5])
I d for density
I r for random number generation
I p for cumulative distribution
I q for quantile function
Graphics
Plotting using R base plots: histograms
Histogram of airquality$Ozone
30
Frequency
20
10
0
0 50 100 150
airquality$Ozone
Plotting using R base plots: (Tukey) boxplots
5 6 7 8 9
6
4
2
0
0 5 10 15 20 25
x
Plotting using R base plots: scatter plots
smoothScatter(x, y)
12
10
8
y
6
4
2
0
0 5 10 15 20 25
x
Plotting using R base plots: scatter plot matrices
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5
7.5
Sepal.Length
6.0
4.5
4.0
Sepal.Width
3.0
2.0
7
5
Petal.Length
3
1
2.5
1.5
Petal.Width
0.5
3.0
Species
2.0
1.0
4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 1.5 2.0 2.5 3.0
Plotting using R base plots: line graphs
data("UKDriverDeaths")
plot(UKDriverDeaths)
2500
2000
UKDriverDeaths
1500
1000
Time
Plotting with ggplot2
10000
7500
cut
Fair
Good
count
5000
Very Good
Premium
Ideal
2500
1e+05
continent
1e+04 Africa
gdpPercap
Americas
Asia
Europe
Oceania
1e+03
40 60 80
lifeExp
Example ggplot2 plot: box plot
ggplot(InsectSprays,
aes(x = spray, y = count, fill = spray)) +
geom_boxplot()
20
spray
A
B
count
C
D
10
E
F
A B C D E F
spray
Example ggplot2 plot: violin plot
ggplot(InsectSprays,
aes(x = spray, y = count, fill = spray)) +
geom_violin()
20
spray
A
B
count
C
D
10
E
F
A B C D E F
spray
Example ggplot2 plot: scatter plot with a LOESS (locally
weighted scatterplot smoothing) fit and 95% confidence
interval band
120
80
dist
40
5 10 15 20 25
speed
Machine learning
Machine learning in R
I There are a huge number of packages available for machine
learning in R
I See CRAN Machine Learning task view
I Rather than interacting with algorithms directly, it’s easier to
drive them from a framework that provides a unified interface
for performing common operations
I These provide functions for data preprocessing (cleaning), data
splitting (creating partitions and resampling), training and
testing functions, and utilities for generating confusion matrices
and ROC plots
I Two excellent frameworks for machine learning are the caret
and mlr packages
I Another useful package is H2 O, which includes algorithms
implemented in Java (for speed) for deep learning and boosted
decision trees, and has an interface to Spark for parallelised
machine learning on large datasets — well worth checking out!
Machine learning in R
ggplot(gbmFit)
0.825
Accuracy (Cross−Validation)
0.750
V11
V12
V9
V36
V4
V13
V10
V45
V43
V48
40 50 60 70 80 90 100
Importance
mlr
A simple stratified cross-validation of linear discriminant analysis
with mlr:
require(mlr)
data(iris)
# Define the task
task <- makeClassifTask(id = "tutorial",
data = iris,
target = "Species")
# Define the learner
lrn <- makeLearner("classif.lda")
# Define the resampling strategy
rdesc <- makeResampleDesc(method = "CV",
stratify = TRUE)
# Do the resampling
r <- resample(learner = lrn, task = task,
resampling = rdesc, show.info = FALSE)
mlr
Results:
## mmce.test.mean
## 0.02
Deep learning with H2 O
require(h2o);
h2o.no_progress()
localH2O <- h2o.init() # Initialise
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 hours 54 minutes
## H2O cluster version: 3.10.3.6
## H2O cluster version age: 1 month
## H2O cluster name: H2O_started_from_R_andy_
## H2O cluster total nodes: 1
## H2O cluster total memory: 0.84 GB
## H2O cluster total cores: 2
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
Deep learning with H2 O
Load prostate cancer dataset, partition into training and test sets:
Train deep learning neural net with 5 hidden layers, ReLU with
dropout, 10000 epochs, then predict on held-out test set:
Calculate AUC:
## [[1]]
## [1] 0.7744834
Deep learning with H2 O
Plot ROC curve:
plot(performance(p,
measure = "tpr", x.measure = "fpr"),
col = "red"); abline(a = 0, b = 1, lty = 2)
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0