Measures of Spread and Dispersion
Measures of Spread and Dispersion
Spread / Dispersion
Measures of Spread and Dispersion
• This is an continuation of our venture into
summary statistics.
• In this video we will cover:
• Variance and Standard Deviation.
• IQR
• Measures of Skew and the Concept of the
Normal Distribution.
Standard Deviation
• The most commonly used measure of spread or
dispersion is the Standard Deviation.
• There’s a formula for this:
;
Σ 𝑥0 − 𝑚𝑒𝑎𝑛454678905:
𝑠𝑑(𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛) =
𝑁
• The main operative aspect of this formula
;
is the
summation of 𝑥0 − 𝑚𝑒𝑎𝑛454678905:
• The first step is therefore to compute the mean
of the population.
Standard Deviation
;
• Lets compute Σ 𝑥0 − 𝑚𝑒𝑎𝑛454678905:
on an example.
• Here is a list of student assessment
scores in my (small) class.
𝑠𝑐𝑜𝑟𝑒𝑠 = [90, 75, 60, 80,95,50,75]
• The mean is:
90 + 75 + 60 + 80 + 95 + 50 + 75
= 75
7
• Now need to compute the rest…
Standard Deviation Score Diff Squared
; 90 15 225
• Lets compute Σ 𝑥0 − 𝑚𝑒𝑎𝑛454678905: 75 0 0
on an example. 60 -15 225
𝑠𝑑 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
;
Σ 𝑥0 − 𝑚𝑒𝑎𝑛454678905:
=
𝑁
1500
= = 14.639
7
Standard Deviation of a Sample
• In some cases we are computing the standard
deviation of a sample.
• The formula is only slightly different:
;
Σ 𝑥0 − 𝑚𝑒𝑎𝑛P8Q47R
𝑠𝑑(𝑠𝑎𝑚𝑝𝑙𝑒) =
𝑁−1
Sample SD versus Population SD
• Which do I use? Need to know the difference between
samples and populations.
• Samples are subsets of populations.
• If for example, I am interested in all the people in my class, and I
have that data for all 30 of them, then we apply the population SD.
• By contrast, if I only had a subset, say 10/30 people in my class,
than that’s a sample.
• Given we are normally looking to generalise to a wider population
(e.g. we study 100 people and see if this generalises to
populations of millions), sample s.d.’s are more commonly used.
• Notation: In practice, 𝜇 is population mean (and 𝜎 its
standard deviation), whilst for samples, we use 𝑥̅ for the
sample mean (and 𝑠 to represent its standard deviation).
This is important when you read formulas.
Variance
• Sometimes you’ll read about variance being used instead
of the standard deviation.
• This is simply the square of the standard deviation.
;
𝑉𝑎𝑟 𝑥 = 𝑠𝑑 𝑥
• As with Standard Deviation, it is possible to have both a
population and sample variance.
• The key thing is that the variance is the square of the standard
deviation!
IQR and some other measures of spread
• Like the arithmetic mean, the standard deviation can be
distorted by outliers.
• Let’s start with the Range. If I have a set of numbers, then
the Range is the largest number minus the smallest:
𝑟𝑎𝑛𝑔𝑒 𝑥 = max 𝑥 − min 𝑥
IQR and some other measures of spread
• Like the arithmetic mean, the standard deviation can be
distorted by outliers.
• Let’s start with the Range. If I have a set of numbers, then
the Range is the largest number minus the smallest:
𝑟𝑎𝑛𝑔𝑒 𝑥 = max 𝑥 − min 𝑥
• An example. Suppose I wanted to find the range of the
following numbers:
𝑠𝑐𝑜𝑟𝑒𝑠 = [90, 75, 60, 80,95,50,75]
IQR and some other measures of spread
• Like the arithmetic mean, the standard deviation can be
distorted by outliers.
• Let’s start with the Range. If I have a set of numbers, then the
Range is the largest number minus the smallest:
𝑟𝑎𝑛𝑔𝑒 𝑥 = max 𝑥 − min 𝑥
• An example. Suppose I wanted to find the range of the
following numbers:
𝑠𝑐𝑜𝑟𝑒𝑠 = [90, 75, 60, 80,95,50,75]
• I would sort those numbers from lowest to highest:
𝑠𝑜𝑟𝑡(𝑠𝑐𝑜𝑟𝑒𝑠) = [𝟓𝟎, 60,75,75,80,90, 𝟗𝟓]
𝑟𝑎𝑛𝑔𝑒 𝑥 = max 𝑥 − min 𝑥 = 95 − 50 = 45
IQR and some other measures of spread
• Now for the IQR, or interquartile range. This does not use the
min and max, but the top and bottom quartile.
• More formally, the IQR is the difference between the third
(upper) and first (lower) quartiles. Given a even 𝑁 = 2𝑛 or
odd 𝑁 = 2𝑛 + 1 number of values:
• The first quartile 𝑄` is the median of the 𝑛 smallest values
• The third quartile 𝑄a is the median of the 𝑛 largest values
• (The second quartile 𝑄; is the median of all 𝑁 values)
• Lets go back to our sorted array of scores…
𝑠𝑜𝑟𝑡(𝑠𝑐𝑜𝑟𝑒𝑠) = [50,60,75,75,80,90,95]
IQR and some other measures of spread
• Now for the IQR, or interquartile range. This does not use the
min and max, but the top and bottom quartile.
• More formally, the IQR is the difference between the third
(upper) and first (lower) quartiles. Given a even 𝑁 = 2𝑛 or
odd 𝑁 = 2𝑛 + 1 number of values:
• The first quartile 𝑄` is the median of the 𝑛 smallest values
• The third quartile 𝑄a is the median of the 𝑛 largest values
• (The third quartile 𝑄; is the median of all 𝑁 values)
• Lets go back to our sorted array of scores…
𝑠𝑜𝑟𝑡(𝑠𝑐𝑜𝑟𝑒𝑠) = [50,60,75,75,80,90,95]
Lowest 𝑛 Highest 𝑛
numbers numbers
𝑄;
IQR and some other measures of spread
• Now for the IQR, or interquartile range. This does not use the
min and max, but the top and bottom quartile.
• More formally, the IQR is the difference between the third
(upper) and first (lower) quartiles. Given a even 𝑁 = 2𝑛 or
odd 𝑁 = 2𝑛 + 1 number of values:
• The first quartile 𝑄` is the median of the 𝑛 smallest values
• The third quartile 𝑄a is the median of the 𝑛 largest values
• (The third quartile 𝑄; is the median of all 𝑁 values)
• Lets go back to our sorted array of scores… 𝑁 = 𝑙𝑒𝑛𝑔𝑡ℎ 𝑠𝑐𝑜𝑟𝑒𝑠 = 7
𝑠𝑜𝑟𝑡(𝑠𝑐𝑜𝑟𝑒𝑠) = [50,60,75,75,80,90,95] ⟹ 2𝑛 + 1 = 7
⟹𝑛=3
Lowest 𝑛 Highest 𝑛
numbers numbers
𝑄;
IQR and some other measures of spread
• Now for the IQR, or interquartile range. This does not use the
min and max, but the top and bottom quartile.
• More formally, the IQR is the difference between the third
(upper) and first (lower) quartiles. Given a even 𝑁 = 2𝑛 or
odd 𝑁 = 2𝑛 + 1 number of values:
• The first quartile 𝑄` is the median of the 𝑛 smallest values 𝑁 = 𝑙𝑒𝑛𝑔𝑡ℎ 𝑠𝑐𝑜𝑟𝑒𝑠 = 7
• The third quartile 𝑄a is the median of the 𝑛 largest values ⟹ 2𝑛 + 1 = 7
⟹𝑛=3
• (The third quartile 𝑄; is the median of all 𝑁 values)
• Lets go back to our sorted array of scores… 𝐼𝑄𝑅 = 𝑄a − 𝑄` = 90 − 60 = 30
𝑠𝑜𝑟𝑡(𝑠𝑐𝑜𝑟𝑒𝑠) = [50, 𝟔𝟎, 75,75,80, 𝟗𝟎, 95]
𝑄` 𝑄; 𝑄a
Skew and Normality
• Skew is about the shape of a distribution of numbers.
• Can plot something called a histogram of data, which illustrates how data is
distributed:
Skew and Normality
• Can describe the level of skew using descriptive statistics.
• One example is Pearson’s first skewness co-efficient:
𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒
𝑠𝑘` =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
• Another example is Pearson’s second skewness co-efficient:
3(𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛)
𝑠𝑘; =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
• The first one is more usually used, unless you do not know
the mode (or it is not stable due to a small sample size).
• A value of zero is no skew:
• Higher magnitude (i.e. absolute value) means increased skew.
• The sign indicates the direction of the skew.
Skew and Normality
• Here are three examples with different degrees of Skew: