Lecture 06-Describing Data Visual Information
Lecture 06-Describing Data Visual Information
Visual Information
Summarizing Data
Describing A Dataset
When describing a dataset, we generally consider the following three questions:
What is the general shape of the data?
Where are the data values centered?
How do the data vary?
These are all aspects of what we call the
distribution of the data.
We use some simple arithmetic, which depends (a little) on what the data distribution represents,
to describe these aspects of the data.
Shape
General Shape Of A Distribution
A symmetric distribution is one in which the left and right hand sides of the distribution are
roughly equally balanced.
A skewed (asymmetric) distribution is one in which there is no such equal balance. Right-
skewness refers to a longer right tail, while left-skewness correspond to a longer left tail.
A uniform distribution is a specific symmetric distribution in which all outcomes are equally
likely.
Example: Distributions
Internet access in the world Time between Old Faithful geyser
eruptions
Central Tendency
The Mean
The arithmetic mean is generally calculated as: We can generalize for unequal weights.
𝑤(𝑥|𝑎) = ∑ 𝑖 𝑥 𝑖 == 𝑎 be an indicator of value 𝑎,
∑
𝑁 Let 𝑛
observations
(
𝑛+1 ) 𝑡ℎ
2
observation
The Mode
The mode is the value that occurs most often in the dataset. If no value in the dataset is repeated, then there
is no mode.
What is the mode in the dataset below?
4 5 9 5 11 7 5 3 7 8 6 5 12
𝑠 =
2 ( 𝑥 1 −𝑥¯) 2 +(𝑥2 −𝑥¯)2 +...+(𝑥 −𝑥¯)2
𝑛 𝜎 2
for a population
𝑛−1
𝑠 =
2 sum of observed squared distance from sample mean For standard deviation, we use:
number of observations - 1 √2 √2
𝑠 = 𝑠 and 𝜎 = 𝜎 .
The standard deviation puts the variance into the same units as the data, providing a measure of
how large the average standardized distance from the center is.
Note: The use of 𝑛 − 1 in calculating variance helps to ensure that our estimate of the population variance is
unbiased and accounts for the extra uncertainty introduced by estimating the population mean from the
sample itself.
Bias
Sampling bias: when a sampling method
systematically yields results that are either too
high or too low.
𝜎 2𝜋
𝜇 = Central Tendency (𝐸[𝑋]) 68% of the observations lie within 1
standard "distance" of the center
𝜎 = Spread (𝐸[𝑋 − 𝐸[𝑋]]) 95% lie within 1.96 standard "distance" of
the center
𝑥 = Specific value of the 99% lie within 2.58 standard "distance" of
continuous variable the center
Often denoted N(𝜇, 𝜎2 ), the normal distribution is special as it underscores the Central Limit Theorem's
revelation that sums of independent variables universally converge to this form, regardless of their initial
distributions.
Let's Get MAD
MAD(𝑥) = median(|𝑥𝑖 − median(𝑥)|)
The median absolute deviation (MAD) is a measure of the variability of a dataset. It is calculated
by taking the median of the absolute differences between each data point and the median of the
dataset.
MAD is a robust measure of variability, meaning it is less affected by outliers than other
measures of variability
Coefficient of Variation (CV)
Introduced by Karl Pearson to compare relative variability
of different datasets, in an attempt to mitigate confusion
in interpreting standard deviation.
Mathematically, it is defined as:
+1
✔︎ 𝑚 = 3 = 0.50+1 = 1.5 = 0.5
𝑝
𝑚 = 0.411+0.95
2 = 0.6805
Interquartile Range
The median divides the data into two equal halves (it is the 50𝑡ℎ percentile). If we divide each of those
halves again, we obtain two additional statistics known as the first (Q1) and third (Q3) quartiles, which are
the 25𝑡ℎ and 75𝑡ℎ percentiles.
Interquartile range: IQR = 𝑄3 − 𝑄1 A value is considered an outlier if it is:
Smaller than 𝑄1 − 1.5 × 𝐼𝑄𝑅
or
Larger than 𝑄3 + 1.5 × 𝐼𝑄𝑅
MATLAB Code
data = [-0.977, -0.151, -0.103, 0.4, 0.411, 0.95, 0.979, 1.764, 1.868, 2.241];
Q1 = quantile(data, 0.25, 'method', 8);
Q3 = quantile(data, 0.75, 'method', 8);
IQR = Q3 - Q1;
outliers = data(data < Q1 - 1.5 * IQR | data > Q3 + 1.5 * IQR);
disp(outliers);
Outliers
An outlier is an observed value that is notably distinct from the other
values in a dataset. Usually, an outlier is much larger or much smaller than
the rest of the data values.
Displaying the data: Boxplot
A boxplot is a graphical display of the five number summary for a quantitative variable. It shows the general
shape of the distribution, identifies the middle 50% of the data, and highlights any outliers.
A boxplot includes:
80
70
60
X SD: 16.76
Y SD: 26.93
y
50
Correlation: -0.06
40
30
10
x
through Simulated Annealing
Justin Matejka
and George Fitzmaurice,
ACM SIGCHI Conference on Human
Factors in Computing System (2017)
Release The Datasaurus!
X Mean: 54.26
Y Mean: 47.83
X SD: 16.76
Y SD: 26.93
Correlation: -0.06
" Same Stats, Different Graphs:
Generating Datasets with Varied
Appearance and Identical
Statistics through Simulated
Annealing"
Justin Matejka
and George Fitzmaurice,
ACM SIGCHI Conference on Human
Factors in Computing System (2017)
Visualizing The Distribution
A dotplot is a common way to visualize the shape of a moderately sized dataset.
Species Longevity Species Longevity Species Longevity Species Longevity Species Longevity
Baboon 20 Chimpanzee 20 Fox 7 Leopard 12 Rabbit 5
Black bear 18 Chipmunk 6 Giraffe 10 Lion 15 Rhinoceros 15
Grizzly bear 25 Cow 15 Goat 8 Monkey 15 Sea lion 12
Polar bear 20 Deer 8 Gorilla 20 Moose 12 Sheep 12
Beaver 5 Dog 12 Guinea Pig 4 Mouse 3 Squirrel 10
Buffalo 15 Donkey 12 Hippopotamus 25 Opossum 1 Tiger 16
Camel 12 Elephant 40 Horse 20 Pig 10 Wolf 5
Cat 12 Elk 15 Kangaroo 7 Puma 12 Zebra 15
Note:
For this particular dataset,
values are integers and can be
easily stacked.
From Dotplot To Histogram
A dotplot, challenging to construct with overlapping dots for similar, numerous values, can be
replaced by a histogram. Histograms aggregate similar values through counts, effectively
displaying data distribution.
Process to construct a histogram
1 Define "boundaries" (they form bins)
45 50 55 60 65 70 75 80 85 90 95 100 105
2 Count the number of elements inside each
bin
Histogram Characteristics
Histograms can be
Bin width: 5
Bin offset: 0
sensitive to parameter
choices!
In particular the 55
40 45 50 55 60 65 70 75 80 85 90 95 100 105 110
bin width 50
45
40
Count
30
25
15
10
0
40 45 50 55 60 65 70 75 80 85 90 95 100 105 110
look.
Bargraphs are evil
1) Part of the range covered by the bar might have never been observed in the sample
Bar graphs are evil
2) They conceal the variance and the underlying distribution of the data
Plunger plots only: who would know that the values were skewed [...] and that
the common statistical tests would be inappropriate?
"For better characterization of a sample, we prefer box, swarm, or violin plots for their ability to show the distribution of the data."
You've been warned before!
A Better Option: Dotplot
If the number of data is relatively small, showing directly the raw data and accompanying
mean/median is best.
A Better Option: Beeswarm
A Beeswarm is a dot plot that shows the distribution of data points in a way that avoids overlap.
10 random points = 50 = 5 Generate
45 46 47 48 49 50 51 52 53 54
Dotplot
Jitter
Beeswarm
A Better Option: Boxplot
A boxplot is a graphical display of the five number summary for a quantitative variable. It shows the general
shape of the distribution, identifies the middle 50% of the data, and highlights any outliers.
A Better Option: Boxplot
A boxplot is a graphical display of the five number summary for a quantitative variable. It shows the general
shape of the distribution, identifies the middle 50% of the data, and highlights any outliers.
Showing The Data Is Best
“ Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing ”