Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
1
Outline
• EDA
• Visualization
– One variable
– Two variables
– More than two variables
– Other types of data
– Dimension reduction
2
EDA and Visualization
• Exploratory Data Analysis (EDA) and
Visualization are very important steps in any
analysis task.
3
Data Visualization – cake bakery
4
Exploratory Data Analysis (EDA)
• Goal: get a general sense of the data
– means, medians, quantiles, histograms, boxplots
• You should always look at every variable - you will learn
something!
• data-driven (model-free)
• Think interactive and visual
– Humans are the best pattern recognizers
– You can use more than 2 dimensions!
• x,y,z, space, color, time….
5
Summary Statistics
• not visual
• sample statistics of data X
– mean: = i Xi / n
– mode: most common value in X
– median: X=sort(X), median = Xn/2 (half below, half above)
– quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n
• interquartile range: value(Q3) - value(Q1)
• range: max(X) - min(X) = Xn - X1
– variance: 2 = i (Xi - )2 / n
– skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]
• zero if symmetric; right-skewed more common (what kind of
data is right skewed?)
7
Issues with Histograms
8
But be
careful with
axes and
scales!
9
Smoothed Histograms - Density
Estimates
• Kernel estimates smooth out the
contribution of each datapoint over a local
neighborhood of that point.
n
ˆ x xi
f (x) 1
nh K( h )
i1
10
Bandwidth
choice is an
art
Usually want
to try several
11
Boxplots
12
Time Series
If your data has a temporal component, be sure to exploit
it
summer bifurcations in air travel
(favor early/late)
summer
peaks
steady growth
trend
13
Time-Series Example 3
mean weight vs mean age
for 10k control group
Scotland experiment:
Possible explanations:
“ milk in kid diet better health” ?
Grow less early in year than later?
20,000 kids: Would expect smooth weight growth plot.
5k raw, 5k pasteurize,
No steps in height plots; so why
10k control (no supplement) Visually reveals
height uniformly, weight spurts?
unexpected pattern (steps),
not apparent from raw data table.
Kids weighed in clothes: summer garb
lighter than winter?
Spatial Data
• Data from
cities/states/zip
cods – easy to get
lat/long
• Can plot as
scatterplot
15
Spatial data: choropleth Maps
• Maps using color shadings to represent numerical values are called chloropleth maps
• http://elections.nytimes.com/2008/results/president/map.html
16
Two Continuous Variables
interesting?
interesting?
17
2D Scatterplots
interesting
?
interesting
?
18
Scatter Plot: No apparent
relationship
19
Scatter Plot: Linear relationship
20
Scatter Plot: Quadratic relationship
21
Scatter plot: Homoscedastic
22
Scatter plot: Heteroscedastic
23
Two variables - continuous
• Scatterplots
– But can be bad with lots of data
24
Two variables - continuous
25
Transparent plotting
Alpha-blending:
• plot( rnorm(1000), rnorm(1000), col="#0000ff22",
pch=16,cex=3)
26
Jittering
27
Displaying Two Variables
• If one variable is
categorical, use
small multiples
• Many software
packages have this
implemented as
‘lattice’ or ‘trellis’
packages
library(‘lattice’)
histogram(~DiastolicBP | TimesPregnant==0
28
Two Variables - one categorical
29
Barcharts and Spineplots
stacked barcharts
can be used to
compare
continuous values
across two or more
categorical ones.
orange=M blue=F
spineplots show
proportions well,
but can be hard to
interpret 30
More than two
variables
Pairwise scatterplots
Can be somewhat
ineffective for
categorical data
31
32
Multivariate: More than two
variables
• Get creative!
• Conditioning on variables
– trellis or lattice plots
– Cleveland models on human perception,
all based on conditioning
– Infinite possibilities
• Earthquake data:
– locations of 1000 seismic events of MB > 4.0.
The events occurred in a cube near Fiji since
1964
– Data collected on the severity of the
earthquake
33
34
35
How many
dimensions
are
represented
here?
Petal, a nonreproductive
part of the flower
Sepal, a nonreproductive
part of the flower
The famous iris data!
37
Parallel Coordinates
Sepal
Length
5.1
38
Parallel Coordinates: 2 D
Sepal Sepal
Length Width
3.5
5.1
39
Parallel Coordinates: 4 D
3.5
5.1 0.2
1.4
40
Parallel Visualization of Iris data
3.5
5.1
1.4
0.2
41
Multivariate: Parallel coordinates
Alpha blending
can be effective
Courtesy Unwin, Theus, Hofmann
42
Parallel coordinates
• Useful in an interactive setting
43
Networks and Graphs
44
Network Visualization
• Graphviz (open source software) is a nice layout tool
for big and small graphs
45
What’s missing?
• pie charts
– very popular
– good for showing simple relations of proportions
– Human perception not good at comparing arcs
– barplots, histograms usually better (but less pretty)
• 3D
– nice to be able to show three dimensions
– hard to do well
– often done poorly
– 3d best shown through “spinning” in 2D
• uses various types of projecting into 2D
• http://www.stat.tamu.edu/~west/bradley/
46
Worst graphic in the
world?
47
Dimension Reduction
– Variable selection
• e.g. stepwise
– Principle Components
• find linear projection onto p-space with maximal
variance
– Multi-dimensional scaling
• takes a matrix of (dis)similarities and embeds the
points in p-dimensional space to retain those
More on this in next Topic
similarities
48
Visualization done right
• http://www.youtube.com/watch?
v=jbkSRLYSojo
49