Data Visualisation Slides 1-6
Data Visualisation Slides 1-6
Statistics Visualization
Project
a n d:
n D e m ho n ,
Machine Privacy O in P y t
m ing S hi ny
Learning Engineering o g r a m R a n d
Pr ing in
g r a m m
P r o
Practical Information
• Data Visualisation on Thursday 9:00 - 11:30
fi
Big Data
Data Explosion
Insight, Decision, Action
Why Visualization?
Useful in two phases…
fi
Static Infographics
Tangible Visualisations
Data Animations
source: https://www.youtube.com/watch?v=4gIhRkCcD4U
Interactive Visualisations
source: http://getdolphins.com/blog/interactive-data-visualizations-new-york-times/
In-Class Assignment 1.1
Truth or Beauty?
Terminology?
• Absolute values
• Aggregate • …
• Filter
• Summarize
fi
Assignment 1.2
Follow the tutorial on R Data Structures and Graphics.
Make notes of things you don’t understand.
https://www.w3schools.com/R/r_vectors.asp
Base Data Types
Number of children, Floor in a
Numbers Discrete Numbers
building, …
Man || Woman,
Categories Pass || Fail || Inconclusive…
Interactive R Demo
Example Data Sets
• R contains a number of example Data Sets
Line Graph
Scatter Diagram
Histogram versus Bar Graph
Assignment 1.3
Install Studio on your laptop. Explore some of the data
sets that are packaged with R.
• What container data structure, what data types?
• Which plots can be used to explore the data?
https://www.rstudio.com/products/rstudio/download/
Homework Assignment: Iris
• Use the built-in data.frame “iris”. For all the plots, make sure that you have
human-readable titles and clear labelling (please don’t use just the variable
names!)
• Use > help(iris) to understand what attributes there are. Make sure that you
understand what they all mean.
• Make a histogram with > hist() with 20 bins of petal width for the Iris Setosa.
• Make a scatterplot of sepal length versus petal length. Show each of the three
species of iris on the same plot with a colored legend to separate them.
• Make a scatterplot of sepal length versus sepal width for all irises whose petal
width is larger than 1.5
• Make one more plot that shows something interesting about the differences
between the species of Irises.
•
Iris
If something is unclear or you need additional help please contact me!
Email: [email protected]
• Use > help(iris) to understand what attributes there are. Make sure that you
understand what they all mean.
• Make a histogram with > hist() with 20 bins of petal width for the Iris Setosa.
• Make a scatterplot of sepal length versus petal length. Show each of the three
species of iris on the same plot with a colored legend to separate them.
• Make a scatterplot of sepal length versus sepal width for all irises whose petal
width is larger than 1.5
• Make one more plot that shows something interesting about the differences
between the species of Irises.
•
Recap …
base data types, vectors,
data.frames, subsetting,
visualizing
Logical (a.k.a.
TRUE, FALSE
Boolean)
0D 25
Vector, c(2, 4, 6) or
List, list(1, “Rotterdam”,
1D c(“Amsterdam”,
TRUE)
“Berlin”)
2D Matrix Data.Frame
multi-D Array
Vectors
• Construct your own with: c(1, 2, 3)
• Vector can be named: my.vector <- c(“A”, “B”, “C”)
• Also elements of vector can be named (e.g. built-in
data set islands). Use names() to manipulate them.
> str(islands)
Named num [1:48] 11506 5500 16988 2968 16 ...
- attr(*, "names")= chr [1:48] "Africa" "Antarctica" "Asia" "Australia" ...
> head(islands)
Africa Antarctica Asia Australia Axel Heiberg Baf n
11506 5500 16988 2968 16 184
fi
Data Frames (1 of 2)
• Construct your own with: data.frame(col1 =
c(1,2,3), col2 = c(“A”, “B”, “C”))
• A data frame can be named: my.df <-
data.frame(col1 = c(1,2,3), col2 = c(“A”, “B”,
“C”))
• Also elements of vector can be named (e.g. built-in
data set women). Use rownames() or colnames()
to manipulate them.
Data Frames (2 of 2)
> str(women)
'data.frame': 15 obs. of 2 variables:
$ height: num 58 59 60 61 62 63 64 65 66 67 ...
$ weight: num 115 117 120 123 126 129 132 135 139 142 ...
> head(women)
height weight
1 58 115
2 59 117
3 60 120 Observation
4 61 123
5 62 126
6 63 129
Variable
fi
Explore real-world Data Sets
Discover?
Import
Clean
Transform
Visualize
Import Data Sets from
les on the Internet
fi
Data Structures in Memory
(Vector, Matrix, Data.Frame,
List, …) are different from Data
Structures stored in a File.
Data File Formats
TXT JPEG
CSV PNG
JSON MP3
GeoJSON AVI
XML …
HTML
R Code Hints
• > read.csv() for reading “comma separated value” les (“.csv”).
• > read.csv2() variant used in countries that use a comma “,” as
decimal point and a semicolon “;” as eld separators.
• > read.delim() for reading “tab-separated value” les (“.txt”). By
default, point (“.”) is used as decimal points.
• > read.delim2()for reading “tab-separated value” les (“.txt”). By
default, comma (“,”) is used as decimal points.
• > install.packages(“rjson”)
> library("rjson")
example:
like to explore?
What would you
• Example:
> time <- seq(from=0, to=10, by=0.1)
> growth <- time*time
> plot(x=time, y=growth, type=“l”)
visual comparison
over time
Trend Line
visual comparison
between variables
Boxplot
• Purpose: Get an impression of the distribution of a
variable in a data set: Center, Quartiles, Outliers…?
> boxplot(islands)
visual inspection
of distribution
Histogram
• Purpose: Get an impression of the distribution of a
variable in a data set: Symmetrical or Skewed?
Uniform distribution? Normal distribution? otherwise
visual inspection
of distribution
visual inspection
of relationship
3D plot
> persp(volcano)
Heat Map
> image(volcano)
Contour Graph (isobaren,
hoogtelijnen)
> contour(volcano)
Multiple Data Sets in a
single plot
visual comparison
between variables
Geospatial Plots
London Cholera Map
Dr. John Snow,1854
Coördinates on Earth
Longitude, Latitude
Different Projections …
… result in different maps.
https://www.timeanddate.com/time/map/
Date and Time formats
Quick Demo
> df <- data.frame(Date = c("10/9/2009 0:00:00", "10/15/2009
0:00:00"))
> as.Date(df$Date, "%m/%d/%Y %H:%M:%S”)
> install.packages("tidyverse")
> library(“tidyverse”)
• Contains a lot of packages that are useful in the data science, for this Course
ggplot2
ggplot2
• Midwest dataset: a build in dataset
ggplot2
• Midwest dataset: a build in dataset
ggplot2
• Midwest dataset: a build in dataset
ggplot2
> g <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() + geom_smooth(method="lm")
# set se=FALSE to turnoff con dence bands
Homework week 2
Exercise 3
For the dataset state.x77
a. Remove column Frost
> keep <- c("Population" ,"Income" , "Illiteracy", "Life.Exp" ,
"Murder" , "HS.Grad" , "Area" )
> sta <- sta[keep]
b. Add a variable to the data frame which should categorize the level of illiteracy: [0,1) is low,
[1,2) is some, [2, inf) is high.
>sta$illlvl <- ifelse(sta$Illiteracy<1, 'low',ifelse(sta$Illiteracy
<2, 'some', ifelse(sta$Illiteracy>=2,'high', NA)))
Different data structures
Data Cleaning
How is it done in R?
Data transformation
> install.packages(‘dyplr’)
This package will allow you to manipualte the data easily.
An example:
fi
Data transformation
> install.packages(‘dyplr’)
This package will allow you to manipualte the data easily.
An example:
df <- read_delim("heartatk4R.txt",
"\t", col_types = cols(AGE = col_integer(),
DIAGNOSIS = col_character(), DIED = col_character(), DRG =
col_character(), LOS = col_integer()))
fi
Data transformation
> install.packages(‘dyplr’)
This package will allow you to manipualte the data easily.
An example:
df <- read_delim("heartatk4R.txt",
"\t", col_types = cols(AGE = col_integer(),
DIAGNOSIS = col_character(), DIED = col_character(), DRG =
col_character(), LOS = col_integer()))
fi
Data transformation
Data transformation
# pipe operator;
data is send to
the next step
Data transformation
Data transformation
- data bias
- data loss
Possible solutions:
y = a + b*x
Homework / In class
https://www.kaggle.com/code/rtatman/data-cleaning-challenge-cleaning-numeric-columns/notebook
‘Data Analytics & Visualisation’
Minor ‘Data Science’
Hogeschool Rotterdam, CMI
Week 4
Import
Clean
Transform
Visualize
Practical Problems …
CBS, Kaggle, https://datahub.io/
You can’t nd data. collections ,Google Data Set Search,
…
clean!, gsub(), use lubridate library for
Data is polluted, in wrong format. Dates and Times, use options of
read.csv(): StringsAsFactors = F
Matching of arguments
R functions arguments can be matched positionally or by
name. So the following calls to sd are all equivalent
> mydata <- rnorm(100)
> sd(mydata)
> sd(x = mydata)
> sd(x = mydata, na.rm = FALSE)
> sd(na.rm = FALSE, x = mydata)
> sd(na.rm = FALSE, mydata)
y = a + b*x
Visualisation to
present data sets
Prede ned R Functions
> help()
> str()
input or output or > head()
actual parameters return value
> nrow()
> ncol()
… > summary()
> plot()
> install.packages()
default values > library()
side effects?
> merge()
> aggregate()
fi
User De ned R Functions
Functions can be created using the function() keyword and are stored as R
objects just like anything else. In particular, they are R objects of class “function”.
f <- function(<formal parameters>) {
…
return(variable)
}
Functions in R are “ rst class objects”, which means that they can be treated
much like any other R object.
Importantly,
• Functions can be passed as arguments to other functions.
• Functions can be nested, so that you can de ne a function inside of another
function.
• The return value of a function is the last expression in the function body to be
evaluated.
fi
fi
fi
▪fluidPage() is a layout function that sets up the basic visual structure of the page
▪selectInput() is an input control that lets the user interact with the app
by providing a value. In this case,
It’s a select box with the label “Dataset” a
nd lets you choose one of the built-in datasets that come with R.
▪verbatimTextOutput() and tableOutput() are output controls that tell Shiny where to put
rendered output. verbatimTextOutput() displays code and tableOutput() displays tables
Reactive programming
Reactive programming is another programming paradigm, it is
programming with asynchronous data streams.
You are able to create data streams of anything, not just from click and hover events.
Streams are cheap and ubiquitous, anything can be a stream: variables, user inputs,
properties, caches, data structures, etc.
then you can access the value of that input with input$count. It will
initially contain the value 100, and it will be automatically updated as the
user changes the value in the browser.
shinyApp(ui, server)
#> Error: Can't modify read-only reactive value
'count'
fl
One more important thing about input: it’s selective about who is allowed to read it.
read from an input, you must be in a reactive
Exercise
Create an app that greets the user by name. You don’t know all the functions you need to do this
yet, so I’ve included some lines of code below. Think about which lines you’ll use and then copy
and paste them into the right place in a Shiny app.
Exercise
Suppose your friend wants to design an app that allows the user to set a
number (x) between 1 and 50, and displays the result of multiplying this
number by 5. This is their rst attempt:
fi
Homework
Homework
Read this:
https://mastering-shiny.org/basic-app.html
https://mastering-shiny.org/basic-case-study.html
Page Lay-out with Panels
ui1 <- fluidPage(
sidebarLayout(
sidebarPanel(
),
mainPanel(
tabsetPanel(
Exercise
Create an app that greets the user by name. You don’t know all the functions you need to do this
yet, so I’ve included some lines of code below. Think about which lines you’ll use and then copy
and paste them into the right place in a Shiny app.
Exercise A
Create an app that greets the user by name. You don’t know all the functions you need to do this
yet, so I’ve included some lines of code below. Think about which lines you’ll use and then copy
and paste them into the right place in a Shiny app.
textOutput("greeting")
})
}
fl
Exercise
Suppose your friend wants to design an app that allows the user to set a
number (x) between 1 and 50, and displays the result of multiplying this
number by 5. This is their rst attempt:
fi
Exercise A
Suppose your friend wants to design an app that allows the user to set a
number (x) between 1 and 50, and displays the result of multiplying this
number by 5. This is their rst attempt:
fi
Data storytelling
Data storytelling is the concept of building a
compelling narrative based on complex data
and analytics that help tell your story and
influence and inform a particular audience.
WHY?
•Adding value to your data and insights.
The star of
the show: DATA
•Think about your theory. What do you want to prove or disprove? What do you think the
data will tell you?
•Collect data. Collate the data you’ll need to develop your story.
•Define the purpose of your story. Using the data you gathered, you should be able to write
what the goal of your story is in a single sentence.
•Think about what you want to say. Outline everything from the intro to the conclusion.
•Ask questions. Were you right or wrong in your hypothesis? How do these answers shape the
narrative of your data story?
•Create a goal for your audience. What actions would you like them to take after reading
your story?
Avoid Data
Decoration!
Stephen Few’s pitfalls
1. Exceeding the boundaries of a single screen
2. Supplying inadequate context for the data
3. Displaying excessive detail or precision
4. Expressing measures indirectly
5. Choosing inappropriate media of display
6. Introducing meaningless variety
7. Using poorly designed display media
8. Encoding quantitative data inaccurately
9. Arranging the data poorly
10. Ineffectively highlighting what’s important
11. Cluttering the screen with useless decoration
12. Misusing or overusing color
13. Designing an unappealing visual display
Levels of Understanding
1. Describe data sets (Descriptive Statistics /
Summary Statistics, Visualization)
New Terminology
• Feature: What we called Variable
fi
Pearsons Correlation
Coef cient
is a measure of linear correlation between two sets of data
fi
Correlation
• Strong / Weak?
• Direction?
• Linear / Non-linear?
• Pearsons correlation
coëf ciënt, number
between -1 and +1.
How to discover
correlations?
• A causes B
• B causes A
Correlation
Source: xkcd.com/552
How to discover
correlations?
• Make some plots! Of course …
R demo: USJudgeRatings
y = f(x)
Multivariate dependencies:
z = f(x, y)
• Gaussian:
fi
Guessing functions
Often there is not a single right answer.
fi
Fitting a function
Linear Regression
• We want to get a function that describes our data well,
but we know that there are some uncertainties that
cause some scatter in the data points.
co <- coef(fit)
co <- coef(fit)
persp(fit, …)
Non-Linear Regression
• We want to get a function that describes our data
well, but we know that there are some uncertainties
that cause some scatter in the data points.
R : Polynomial Fit
# polynomialial fit
co <- coef(fit)
f <- function(x,a,b){a*exp(b*x)}
co <- coef(fit)
f <- function(x,a,b){a*log(x) + b}
co <- coef(fit)
fi
fi
fi
fi
“With four parameters I can t an elephant, and
with ve I can make him wiggle his trunk.”
Residual deviance
R squared
Homework:
R person correlation
2 vs
Import
Communicate
1. Image datasets
2. Natural Language
https://cran.r-project.org/web/packages/magick/vignettes/intro.html
NLP
Natural Langauge processing
NLP- many problems
Headlines:
Source: https://www.quora.com/What-is-a-tf-idf-vector
TF-IDF
• Term Frequency (TF) is the ratio of number of
times a word occurred in a document to the total
number of words in the document.
Regression
Regression in R
Residual standard error: The residual standard error is a
measure of how well the model ts the data.
fi
Remainder: how did
we do it so far
Overweight?
With what we have learned so far we can distinguis 7 main properties, when it comes to the data visualization
The basis: rst three of
seven elements
• Data: the actual variables to be
plotted
• Aesthetics: visual
characteristics that represent
data, e.g. position, size, color,
shape, transparency
Source: http://www.science-craft.com/category/data-visualisation/
fi
Three more, advanced
elements
fi
fi
fi
Some Examples 1
Source: https://www.theguardian.com/news/datablog/2011/mar/08/international-womens-day-pay-gap#_
Some Examples 2
Some Examples 3
t i l l v i s i b l e
Da t a s
s e l e c t e d
Sca l e s
r e d d o t s
e t r y : c o l o
Geom
Some Examples 4
n t a d d e d
a l el e m e
S t a t i s t i c
Florence Nightingale
https://www.sciencenews.org/pictures/mathtrek/112608/nightingale.swf
Bad Practice
Bad Practice
Bad Practice
Bad Practice
Bad Practice
Typical Exam
Questions
OLD EXAM, WE DID NOT COVER ALL! AND NO ANSWER WILL
BE PROVIDED FOR THAT EXAM, THIS IS JUST ANOTHER EXAMPLE
OLD EXAM, WE DID NOT COVER ALL! AND NO ANSWER WILL
BE PROVIDED FOR THAT EXAM, THIS IS JUST ANOTHER EXAMPLE
OLD EXAM, WE DID NOT COVER ALL! AND NO ANSWER WILL
BE PROVIDED FOR THAT EXAM, THIS IS JUST ANOTHER EXAMPLE
Theory Exam
Case Study
2. Visualization Design
Since 2014 there are earthquakes in Groningen a region in the north of Holland where
natural gas is pumped up. Initially NAM, the responable company, denied
responsability for the earthquakes and the collateral damage to houses. In 2014 the
Dutch government decided to put a cap on the quantity of gas that could be pumped
up per year. This cap was lowered in the subsequent years, when the earthquakes did
not stop.
• · 2014; Decision to limit the winning of natural gas to 42.5 billion cube meters and
with 80% around Loppersum (where the heaviest quakes occurred).
• · January 2015: Decision to lower the cap to 39.4 billion cube meters.