Data Visualization in R - With Cheat Sheets PDF
Data Visualization in R - With Cheat Sheets PDF
1. Overview
Michael Friendly
SCS Short Course
Sep/Oct, 2018
http://datavis.ca/courses/RGraphics/
Course outline
1. Overview of R graphics
2. Standard graphics in R
3. Grid & lattice graphics
4. ggplot2
Outline: Session 1
• Session 1: Overview of R graphics, the big picture
Getting started: R, R Studio, R package tools
Roles of graphics in data analysis
• Exploration, analysis, presentation
What can I do with R graphics?
• Anything you can think of!
• Standard data graphs, maps, dynamic, interactive graphics –
we’ll see a sampler of these
• R packages: many application-specific graphs
Reproducible analysis and reporting
• knitr, R markdown
• R Studio
-#-
Outline: Session 2
• Session 2: Standard graphics in R
R object-oriented design
Lecture notes for this session are available on the web page
Outline: Session 4
• Session 4: ggplot2
Most powerful approach to statistical graphs,
based on the “Grammar of Graphics”
A graphics language, composed of layers, “geoms”
(points, lines, regions), each with graphical
“aesthetics” (color, size, shape)
part of a workflow for “tidy” data manipulation
and graphics
Resources: Books
Paul Murrell, R Graphics, 2nd Ed.
Covers everything: traditional (base) graphics, lattice, ggplot2, grid graphics, maps, network diagrams, …
R code for all figures: https://www.stat.auckland.ac.nz/~paul/RG2e/
Hadley Wickham, ggplot2: Elegant graphics for data analysis, 2nd Ed.
1st Ed: Online, http://ggplot2.org/book/
ggplot2 Quick Reference: http://sape.inf.usi.ch/quick-reference/ggplot2/
Complete ggplot2 documentation: http://docs.ggplot2.org/current/
7
Resources: cheat sheets
R Studio provides a variety of handy cheat sheets for aspects of data analysis &
graphics See: https://www.rstudio.com/resources/cheatsheets/
Download, laminate,
paste them on your
fridge
8
Getting started: Tools
• To profit best from this course, you need to install
both R and R Studio on your computer
Publish: A variety of R packages make it easy to write and publish research reports
and slide presentations in various formats (HTML, Word, LaTeX, …), all within R
Studio
Web apps: R now has several powerful connections to preparing dynamic, web-
based data display and analysis applications.
10
Getting started: R Studio
command history
workspace: your variables
files
R console plots
(just like Rterm) packages
help
R Studio navigation
R folder navigation commands:
• Where am I?
> getwd()
[1] "C:/Dropbox/Documents/6135"
• Go somewhere:
> setwd("C:/Dropbox")
> setwd(file.choose())
R Studio GUI
12
R Studio projects
R Studio projects are a handy way to
organize your work
13
R Studio projects
An R Studio project for a research paper: R files (scripts), Rmd files (text, R “chunks”)
14
Organizing an R project
• Use a separate folder for each project
• Use sub-folders for various parts
15
Organizing an R project
• Use separate R files for different steps:
Data import, data cleaning, … → save as an RData file
Analysis: load RData, …
read-mydata.R
# read the data; better yet: use RStudio File -> Import Dataset ...
mydata <- read.csv("data/mydata.csv")
16
Organizing an R project
• Use separate R files for different steps:
Data import, data cleaning, … → save as an RData file
Analysis: load RData, …
analyse.R
# analysis
load("data/mydata.RData")
# fit models
mymod.1 <- lm(y ~ X1 + X2 + X3, data=mydata)
21
The 80-20 rule: Graphics
• Analysis graphs: Happily, 20% of effort can give 80% of a
desired result
Default settings for plots often give something reasonable
90-10 rule: Plot annotations (regression lines, smoothed curves, data
ellipses, …) add additional information to help understand patterns,
trends and unusual features, with only 10% more effort
• Presentation graphs: Sadly, 80% of total effort may be
required to give the remaining 20% of your final graph
Graph title, axis and value labels: should be directly readable
Grouping attributes: visually distinct, allowing for BW vs color
• color, shape, size of point symbols;
• color, line style, line width of lines
Legends: Connect the data in the graph to interpretation
Aspect ratio: need to consider the H x V size and shape
22
What can I do with R graphics?
A wide variety of standard plots (customized)
line graph: plot()
barchart()
hist()
3D plot: persp()
boxplot()
pie()
Bivariate plots
R base graphics provide a wide variety of different plot types for bivariate data
Some plotting
functions take a
matrix argument &
plot all columns
24
Bivariate plots
A number of specialized plot types are also available in base R graphics
Plot methods for factors and tables are designed to show the association between
categorical variables
The vcd & vcdExtra
packages provide more
and better plots for
categorical data
25
Mosaic plots
Similar to a grouped bar chart
Shows a frequency table with tiles,
area ~ frequency
> data(HairEyeColor)
> HEC <- margin.table(HairEyeColor, 1:2)
> HEC
Eye
Hair Brown Blue Hazel Green
Black 68 20 15 5
Brown 119 84 54 29
Red 26 17 14 14
Blond 7 94 10 16
> chisq.test(HEC)
data: HEC
X-squared = 140, df = 9, p-value <2e-16
> round(residuals(chisq.test(HEC)),2)
Eye
Hair Brown Blue Hazel Green
Black 4.40 -3.07 -0.48 -1.95
Brown 1.23 -1.95 1.35 -0.35
Red -0.07 -1.73 0.85 2.28
Blond -5.85 7.05 -2.23 0.61
data(Duncan, package=“car”)
plot(~ prestige + income + education,
data=Duncan)
pairs(~ prestige + income + education,
data=Duncan)
29
Multivariate plots
These basic plots can be enhanced in
many ways to be more informative.
library(car)
scatterplotMatrix(~prestige + income + education, data=Duncan, id.n=2)
30
Multivariate plots: corrgrams
For larger data sets, visual
summaries are often more useful
than direct plots of the raw data
See: Friendly, M. Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 2002, 56, 316-324
31
Multivariate plots: corrgrams
For even larger data sets, more
abstract visual summaries are
necessary to see the patterns of
relationships.
See: Friendly, M. Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 2002, 56, 316-324
32
Generalized pairs plots
Generalized pairs plots from the gpairs
package handle both categorical (C) and
quantitative (Q) variables in sensible ways
x y plot
Q Q scatterplot
C Q boxplot
Q C barcode
C C mosaic
library(gpairs)
data(Arthritis)
gpairs(Arthritis[, c(5, 2:5)], …)
33
Models: diagnostic plots
Linear statistical models (ANOVA,
regression), y = X β + ε, require some
assumptions: ε ~ N(0, σ2)
34
Models: Added variable plots
The car package has many more functions for plotting linear model objects
Among these, added variable plots show the partial relations of y to each x, holding all
other predictors constant.
library(car)
avPlots(duncan.mod, id.n=2,ellipse=TRUE, …)
35
Models: Interpretation
Fitted models are often difficult to interpret from tables of coefficients
Call:
lm(formula = prestige ~ income + education + type, data = Duncan)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.18503 3.71377 -0.050 0.96051 How to understand
income
education
0.59755
0.34532
0.08936
0.11361
6.687 5.12e-08 ***
3.040 0.00416 **
effect of each
typeprof 16.65751 6.99301 2.382 0.02206 * predictor?
typewc -14.66113 6.10877 -2.400 0.02114 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
36
Models: Effect plots
Fitted models are more easily interpreted by plotting the predicted values.
Effect plots do this nicely, making plots for each high-order term, controlling for others
library(effects)
duncan.eff1 <- allEffects(duncan.mod1)
plot(duncan.eff1)
37
Models: Coefficient plots
Sometimes you need to report or display the coefficients from a fitted model.
A plot of coefficients with CIs is sometimes more effective than a table.
library(coefplot)
duncan.mod2 <- lm(prestige ~ income * education, data=Duncan)
coefplot(duncan.mod2, intercept=FALSE, lwdInner=2, lwdOuter=1,
title="Coefficient plot for duncan.mod2")
38
Coefficient plots become
increasingly useful as:
(a) models become more complex
(b) we have several models to family income - wife's income
compare
log wage rate for working women
It uses:
lattice::wireframe(z ~ x + y, …)
40
3D graphics: code
1. Generate data for the model z = 10 + .5x +.3y + .2 x*y
b0 <- 10 # intercept
b1 <- .5 # x coefficient
b2 <- .3 # y coefficient
int12 <- .2 # x*y coefficient
g <- expand.grid(x = 1:20, y = 1:20)
g$z <- b0 + b1*g$x + b2*g$y + int12*g$x*g$y
41
3D graphics
42
Statistical animations
Statistical concepts can often be
illustrated in a dynamic plot of some
process.
43
Data animations
Time-series data are often plotted
against time on an X axis.
library(HistData)
SnowMap(density=TRUE,
main=“Snow's Cholera Map, Death Intensity”)
SnowMap(density=TRUE,
main="Snow's Cholera Map with Pump Neighborhoods“)
library(igraph)
tree <- graph.tree(10) full <- graph.full(10)
tree <- set.edge.attribute(tree, "color", value="black") fullIgraph <- set.edge.attribute(full, "color",
plot(treeIgraph, value="black")
layout=layout.reingold.tilford(tree, plot(full, layout=layout.circle)
root=1, flip.y=FALSE))
48
Diagrams: Network diagrams
graphvis (http://www.graphviz.org/) is a comprehensive program for drawing
network diagrams and abstract graphs. It uses a simple notation to describe nodes
and edges.
The Rgraphviz package (from Bioconductor) provides an R interface
49
Diagrams: Flow charts
The diagram package:
library(sem)
union.mod <- specifyEquations(covs="x1, x2", text="
y1 = gam12*x2
y2 = beta21*y1 + gam22*x2
y3 = beta31*y1 + beta32*y2 + gam31*x1
")
union.sem <- sem(union.mod, union, N=173)
pathDiagram(union.sem,
edge.labels="values",
file="union-sem1",
min.rank=c("x1", "x2"))
51
Dynamically updated data visualizations
The wind map app, http://hint.fm/wind/ is one of a growing number of R-based
applications that harvests data from standard sources, and presents a visualization
52
Web scraping: CRAN package history
R has extensive facilities for extracting and processing information obtained from web
pages. The XML package is one useful tool for this purpose.
This example:
• downloads information about all R
packages from the CRAN web site,
• finds & counts all of those available for
each R version,
• plots the counts with ggplot2, adding a
smoothed curve, and plot annotations
53
shiny: Interactive R applications
shiny, from R Studio, makes it easier to develop interactive applications
56
Output formats and templates
The integration of R, R Studio, knitr,
rmarkdown and other tools is now
highly advanced.
58
Writing it up
• Use simple Markdown to write text
• Include code chunks for analysis & graphs
mypaper.Rmd, created from a template Help -> Markdown quick reference
yaml header
Header 2
59
rmarkdown basics
rmarkdown uses simple markdown formatting for all standard document elements
60
R code chunks
R code chunks are run by knitr, and the results are inserted in the output document
An R chunk:
```{r name, options}
# R code here
```
61
The R Markdown Cheat Sheet provides most of the details
https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf
62
R notebooks
Often, you just want to “compile” an R script, and get the output embedded in the
result, in HTML, Word, or PDF. Just type Ctrl-Shift-K or tap the Compile Report button
63
Summary & Homework
• Today has been mostly about an overview of R
graphics, but with emphasis on:
R, R Studio, R package tools
Roles of graphics in data analysis,
A small gallery of examples of different kinds of graphic applications in
R; only small samples of R code
Work flow: How to use R productively in analysis & reporting
• Next week: start on skills with traditional graphics
• Homework:
Install R & R Studio
Find one or more examples of data graphs from your research area
• What are the graphic elements: points, lines, areas, regions, text, labels, ???
• How could they be “described” to software such as R?
• How could they be improved?
64