STAE Lecture Notes - LU3
STAE Lecture Notes - LU3
LEARNING OBJECTIVES
• Understand the concepts of and calculate the mean, median, mode and percentiles
• Understand the concepts of and calculate the range, interquartile range, standard deviation, variance
and coefficient of variation
• Choose the appropriate measures of central tendency and variability for any given variable
Textbook reference:
1
3.1. Introduction
Descriptive statistics are numerical summary measures used to describe the data collected from a sample in
terms of central tendency, variability, skewness and kurtosis. These measures are used in most statistical
analyses. In this course, measures of central tendency and variability are calculated using raw and frequency
data, skewness is only evaluated visually, and kurtosis is not assessed.
3.1.1. Mode
For raw data and ungrouped frequency data the mode is the value of the variable that occurs most frequently.
A variable can have one, two, more than two, or no mode.
• Unimodal = one mode
• Bimodal = two modes
• Multimodal = more than two modes
2
For grouped frequency data it is not possible identify the value occurring most frequently since the data
were grouped into class intervals and information were lost. For such data formats the class with the highest
frequency is the modal class and the mode is generally estimated using the midpoint of the modal class.
3.1.2. Median
The median is the value of the variable in the middle of the ordered set of data values. Therefore at most
50% of observations are below the median value, and at most 50% of observations are above the median
value.
3
o For example, for the ordered observations: 3 4 6 9 13
5 +1 6
Since n = 5 the median position = = = 3 . The value in position 3 of the ordered data is 6, i.e.,
2 2
median = 6
• If n is even, the median position value will be a fraction
o The median value is the average of the two variable values on either side of the median position in
the ordered data
o For example, for the ordered observations: 3 4 6 9
4 +1 5
Since n = 4, the median position = = = 2.5 . The value in position 2 of the ordered data is 4,
2 2
4 + 6 10
and the value in position 3 of the ordered data is 6, i.e., median = = =5
2 2
For ungrouped frequency tables the median is calculated using cumulative frequencies. For grouped
frequency tables the median is estimated using cumulative frequencies and an interpolation formulae.
However, this is beyond the scope of this course.
4
3.1.3. Mean
The mean of a variable is also referred to as the arithmetic mean or the average. For raw data the mean is
calculated by adding all the values of the variable together and dividing by the total number of observations.
For a random variable X the population mean is denoted by the Greek letter µ (mu):
1
=
N
x
5
For an ungrouped frequency table, the mean is calculated using a formula based on the values of the variable
and the frequency of occurrence. For a grouped frequency table, the mean is estimated using a formula based
on the midpoint of the class intervals and the frequency of occurrence. For the purpose of this course, it is
sufficient to calculate/estimate the mean from frequency tables using the calculator.
6
3.1.4. Concluding notes
Advantages Disadvantages
Mode Valid for categorical and numerical data
More than one mode can exist
Not affected by outliers
Median Not affected by outliers Only appropriate for ordinal and numerical
Best measure to use for skewed data data
Mean Calculated using every value in the dataset, Affected of extreme values
i.e. very accurate Only appropriate for numerical data
Best measure to use for symmetrical data
7
3.3. Measures Of Relative Standing
Measures of relative standing show where particular values stand relative to the whole distribution of the
variable. Relative standing is measured through percentiles. Percentiles are points which partition an ordered
dataset into a hundred parts. The rth percentile, Pr, is the value of the variable that separates the lowest r%
of the distribution from the remaining (100 – r)% of the distribution. The formal interpretation of Pr is: at
most r% is less than Pr and at most (100 – r)% is more than Pr.
For example, if 10% of students scored at least 80 on a test, then a student who scored 82 performed in the
top 10% of the distribution. The value “80” is the minimum value obtained by the top 10% of the distribution
and is therefore the 90th percentile, i.e. P90 = 80, as it separates the lowest 90% from the remaining 10% of
the distribution. Thus, at most 90% of students scored less than 80 and at most 10% of students scored more
than 80. Recall the interpretation of the median, namely at most 50% of observations are below the median
value and at most 50% of observations are above the median value. The median of a distribution is the 50th
percentile value, i.e. P50 = median.
8
Other commonly used percentiles are deciles, which divide the distribution into ten equal parts (D1, D2, …,
D10) and quartiles, which divide the distribution into four equal parts (Q1, Q2, Q3, Q4). Both deciles and
quartiles can be expressed in terms of percentiles. For example, D5 = Q2 = P50 = median. For raw data any
percentile value is obtained by first sorting the data from lowest to highest, locating the percentile position
and then using a formula to calculate the percentile value. Percentile calculations from frequency data is
beyond the scope of this course.
(
• Pr = x( k ) + d x( k +1) − x( k ) )
o Where x( k ) is the value in position k of the ordered dataset
( )
o P20 = x( 2) + 0.6 x(3) − x( 2) = 5 + 0.6 (8 − 5 ) = 6.8
o At most 20% of observations are less than 6.8 and at most 80% of observations are greater than 6.8
• Q3 = P75
75
o position = (12 + 1) = 9.75 , Therefore k = 9 and d = 0.75
100
o The value in position 9 (k) is 15 and the value in position 10 (k + 1) is 17
( )
o P75 = x(9) + 0.75 x(10) − x(9) = 15 + 0.75 (17 − 15 ) = 16.5
o At most 75% of observations are less than 16.5 and at most 25% of observations are greater than
16.5
10
3.4. Measures Of Variability
Measures of variability (or spread or dispersion) describe the extent to which data are spread around its
central tendency and across the scale. The commonly used measures of variability are range, interquartile
range, variance, standard deviation and coefficient of variation.
3.3.1. Range
The range is an approximate measure of variability and shows how much of the scale is utilised. For raw
data and ungrouped frequency data the range is the difference between the maximum and the minimum
values of a variable. For grouped frequency data the range is the difference between the upper limit of the
last class interval and the lower limit of the first class interval.
Range = maximum – minimum
12
3.3.4. Variance
To solve the problem encountered with the average deviation measure, differences are considered as
distances which must always be positive. There are two ways in which negative values can be removed:
either take the absolute value (i.e., remove the sign), or square the value. The variance is the average squared
deviation around the mean. It is the most commonly used measure of variability in statistics. The larger the
value of the variance the more the data values vary around the mean and the greater the spread of the data.
The variance is expressed in the squared unit of measurement of a variable, which is of no practical value
and is difficult to interpret.
The population variance is denoted by the Greek letter 2 (sigma-squared) and is calculated as follows:
1
2 = ( − )
2
x
N
The sample variance is denoted by the Roman letter s 2 (s-squared) and is calculated as follows:
n x 2 − ( x )
2
1
s2 = (x − x) =
2
n −1 n ( n − 1)
13
For an ungrouped frequency table, the variance is calculated based on the values of the variable and the
frequency of occurrence. For a grouped frequency table, the variance is estimated using a formula based on
the midpoint of the class intervals and the frequency of occurrence. For the purpose of this course, it is
sufficient to calculate/estimate the variance from frequency tables using the calculator. The calculators give
the population or sample standard deviations, which must be squared to obtain the variance. Steps to perform
calculations are discussed in Section 3.3.5.
14
The sample standard deviation is denoted by the Roman letter s and is calculated as follows:
n x 2 − ( x )
2
1
s= ( − ) =
2
x x
n −1 n ( n − 1)
15
3.3.6. Coefficient of variation
The coefficient of variation (CV) is a measure of relative variability and is used to compare the variability
of different variables measured in different units or to compare the variability of the same variables
measured at different times.
The CV is the ratio of the standard deviation to the mean, expressed as a percentage, i.e. the variability in
the variable is expressed as a percentage of the mean of that variable. This measures variability on
comparable scales for multiple variables. Note, this value is not bounded by 100% and can be greater than
100%, which implies more variability.
16
3.3.7. Concluding notes
Advantages Disadvantages
Range Easy to calculate Affected by extreme values
Interquartile range Not affected by outliers
Best measure of variability for Does not utilise all the data
skewed data
Average deviation None Always zero
Variance/Standard Uses all available data
deviation Best measure of variability for Affected by outliers
symmetrical data
Coefficient of variation Best measure of relative variability Affected by outliers
17
Exercise 3.1
1) The sums for X = coffee consumption are x = 59 and x 2
= 251 . The following table shows the
frequency distribution for coffee consumption. Calculate the mean, range, variance, standard deviation
and coefficient of variation using the computational formulae as well as the calculator. Compare the
results.
Coffee consumption Frequency
1 5
2 6
3 3
4 2
5 2
6 0
7 1
8 1
Total 20
18
From table
Mean =
Range =
Variance =
Standard deviation =
Coefficient of variation =
From sums: x = 59 , x 2
= 251
1
Mean = x =
n
x
Range =
19
x = 59 , x 2
= 251
n x − ( x )
2 2
Variance = s 2 =
n ( n − 1)
Standard deviation = s =
s
Coefficient of variation = 100
x
Comparison
20
2) Use the raw data for the coffee affinity score as well as the grouped frequency table to calculate the
mean, range, variance, standard deviation and coefficient of variation. Compare the results.
Coffee affinity score Frequency Midpoint
(0, 1] 7
(1, 2] 4
(2, 3] 2
(3, 4] 4
(4, 5] 3
Total 20
From table
Mean =
Range =
Variance =
Standard deviation =
Coefficient of variation =
21
From raw data
0.1 0.2 0.4 0.4 0.6 0.8 1.0 1.4 1.8 1.9 1.9 2.3 2.4 3.1 3.1 3.4 3.6 4.4 4.6 4.9
Mean =
Range =
Variance =
Standard deviation =
Coefficient of variation =
Comparison
22
3) Use the following stem-and-leaf plot of age (leaf unit = 1) and calculate the mode(s), median, D2 and
IQR
Stem Leaf
1 9 9 9
2 1 4 4 5 6 6 8 9 9
3 0 2 4 5 6 7
4 0 3
Median
23
Stem Leaf
1 9 9 9
2 1 4 4 5 6 6 8 9 9
3 0 2 4 5 6 7
4 0 3
D2
P25
P75
IQR
24