Statistical Measures 2024 (Part 2) - Word
Statistical Measures 2024 (Part 2) - Word
Measures of Dispersion:
Spread about the mean: measuring the distance between the items and their common
mean.
i) the mean deviation
ii) the standard deviation
We are going to focus solely on the standard deviation for the purposes of this
module.
Central percentage spread of items: these measures have links with the median.
i) the 10 to 90 percentile range
ii) the quartile deviation
The Range:
The difference between the smallest and largest values of items in a set or distribution.
Example:
The daily number of books sold by two separate book stores over twelve days were:
Bookstore 1: 3, 5, 1, 4, 5, 3, 6, 8, 6, 2, 3, 7
Bookstore 2: 2, 3, 2, 1, 4, 3, 2, 2, 1, 3, 4, 1
The range of values for Bookstore 1 is 8-1=7, and for Bookstore 2 is 4-1=3.
Thus, daily sales are more variable for Bookstore 1.
Standard Deviation
1
The standard deviation is a common method of dispersion. The following standard
deviation formula has been adapted from the formulae for a set and can be used for
both simple discrete and grouped distributions.
● The standard deviation is a measure of the average deviation from the mean value.
● It is the most common measure of dispersion. (Remember: dispersion =
spread/variability).
● It is used as a measure for comparison only when the units in the distribution are
the same and the respective means are comparable.
Formula:
1) A set: σ = √ Σ (x - x)2 / n
2) A frequency distribution:
We are going to focus solely on the standard deviation of a frequency distribution for
the purposes of this module.
The data below relates the number of successful sales made by the sales-force in a
particular quarter. Calculate the mean and standard deviation (to one decimal place).
Solution:
2
Mean = 1225 / 80
= 15.3 sales
Standard deviation,
= 6.1 sales
Number of weeks 3 7 15 20 9 4
● When a comparison of two distributions and their means are made, it is necessary
to do so with regard to their variability.
● While the standard deviation is the important measure of spread, it cannot be used
as the sole basis of comparing two distributions.
● This is because it is an absolute measure of dispersion that measures variation in
the same units as the original data. (Remember that absolute values ignore the
negative signs).
● For example, if we have a standard deviation of 10 and a mean of 5, the values
vary by an amount twice as large as the mean itself. If, on the other hand, we have
a standard deviation of 10 and a mean of 5,000, the variation relative to the mean
is insignificant. Therefore, we cannot know the dispersion of a set of data until we
know how the standard deviation compares with the mean.
● A relative measure of dispersion, which compares the mean to the standard
deviation, is the coefficient of variation, which is found by dividing the standard
deviation by the mean.
3
Algebraically, this is:
Example
A: μ= 120, σ = 55
B: μ= 90, σ = 50
Solution:
A: Coefficient of Variation = 55 / 120 = 45.8%
B: Coefficient of Variation = 50 / 90 = 55.6%
Skewness
Curve A Curve B
30 32 34 44 46 48
● Curve A is skewed to the right (or positively skewed) because it tails off
toward the high end of the scale.
● Curve B is skewed to the left (or negatively skewed) because it tails off toward
the low end of the scale.
4
Measuring Skewness
● If a distribution has positive skewness (Psk > 0), the mode is smaller than the
mean, and vice versa.
● If the distribution is symmetrical (Psk = 0), then the mean equals the mode,
and the skewness is zero.
● If a distribution has negative skewness (Psk < 0), then the mode is greater than
the mean, and vice versa.
● Dividing by the standard deviation allows distributions with different units to
be compared.
Pearson’s skew =
This may be useful if only the mean and median are known for a distribution.
Generally, the values of skewness are low (highly skewed distributions may have
values of ± 1, while values up to ± 3 are theoretically possible).
5
● It can be shown for such populations that about 68% of the data lie within one
standard deviation of the mean, about 95% within two standard deviations of the
mean, and about 99% within three standard deviations of the mean. This is shown
in the diagram below.
34% 34%
13.5% 13.5%
2% 2%
Example
46 58 65 70 76 49 59 66 71 78
50 59 66 71 79 53 60 66 72 80
6
54 62 66 73 82 55 63 68 73 83
55 64 68 73 84 57 65 69 74 88
Given that the mean and standard deviation of this data is 67 and 10 respectively, does
this data satisfy the empirical rule?
Solution:
Examining the data, we see that 26 of the numbers lie in the range 57 - 77 i.e. within
one standard deviation of the mean. These 26 numbers represent 26/40 or 65% of the
data, and are very close to the 68%, which lie within one standard deviation for the
empirical rule. Further calculations are shown below:
As the percentages for this sample are very close to the empirical rule, it is reasonable
to conclude that this sample is coming from a normal population.
DECILES
Deciles are similar to quartiles in that they are used to divide up a cumulative
distribution. However, in this case, they break the distribution up into tenths (i.e.
deciles, just like decades, meaning blocks of ten years) rather than quarters. Thus,
● The first decile has 10% of values below it and 90% above it
● The second decile has 20% of values below it and 80% above it, and so on.
PERCENTILES
There are ninety-nine points of a distribution that divide it up into one hundred equal
parts. They are normally denoted as P1, P2, … P99.
● Thus, the tenth percentile (P10) has ten percent of the values of the distribution
below it (and ninety percent of the values above it).
Note that the 50th percentile (P50), is the median, and the 25th percentile (P25) and 75th
percentile (P75) are equal to Q1 and Q3 respectively.
The terms outlier and extreme values are often used interchangeably. Both refer to a
data value that is atypical of the data set i.e. values which differ markedly from most
of the numbers in the set.
7
For example, suppose that the number of championship matches played by a team in
the last five years is as follows:
2 1 10 1 1
The average (mean) of these numbers is 3, which is heavily influenced by the outlier
of 10.
Discarding the outlier, we obtain a modified mean of 1.25 which is perhaps more
meaningful for comparisons or for setting a norm.
Alternatively, we could decide that the median is a better measure of average for data
sets with outliers.
Data sets should always be examined for outliers, as the reasons for such values can
vary - it may be due to weather conditions or due to a recording error.