0% found this document useful (0 votes)
17 views

STAE Lecture Notes - LU3

Uploaded by

aneenzenda06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

STAE Lecture Notes - LU3

Uploaded by

aneenzenda06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Learning Unit 3: DESCRIPTIVE STATISTICS

LEARNING OBJECTIVES
• Understand the concepts of and calculate the mean, median, mode and percentiles
• Understand the concepts of and calculate the range, interquartile range, standard deviation, variance
and coefficient of variation
• Choose the appropriate measures of central tendency and variability for any given variable

Textbook reference:

1
3.1. Introduction
Descriptive statistics are numerical summary measures used to describe the data collected from a sample in
terms of central tendency, variability, skewness and kurtosis. These measures are used in most statistical
analyses. In this course, measures of central tendency and variability are calculated using raw and frequency
data, skewness is only evaluated visually, and kurtosis is not assessed.

3.2. Measures Of Central Tendency


A measure of central tendency (or location) is a single value that summarises the centre of a distribution.
The commonly used measures of central tendency are mode, median and mean.

3.1.1. Mode
For raw data and ungrouped frequency data the mode is the value of the variable that occurs most frequently.
A variable can have one, two, more than two, or no mode.
• Unimodal = one mode
• Bimodal = two modes
• Multimodal = more than two modes
2
For grouped frequency data it is not possible identify the value occurring most frequently since the data
were grouped into class intervals and information were lost. For such data formats the class with the highest
frequency is the modal class and the mode is generally estimated using the midpoint of the modal class.

3.1.2. Median
The median is the value of the variable in the middle of the ordered set of data values. Therefore at most
50% of observations are below the median value, and at most 50% of observations are above the median
value.

To find the median for raw data


• Order the data from lowest to highest
n +1
• Find the median position =
2
• If n is odd, the median position value will be a whole number
o The median value is the value of the variable in the median position of the ordered data

3
o For example, for the ordered observations: 3 4 6 9 13
5 +1 6
Since n = 5 the median position = = = 3 . The value in position 3 of the ordered data is 6, i.e.,
2 2
median = 6
• If n is even, the median position value will be a fraction
o The median value is the average of the two variable values on either side of the median position in
the ordered data
o For example, for the ordered observations: 3 4 6 9
4 +1 5
Since n = 4, the median position = = = 2.5 . The value in position 2 of the ordered data is 4,
2 2
4 + 6 10
and the value in position 3 of the ordered data is 6, i.e., median = = =5
2 2

For ungrouped frequency tables the median is calculated using cumulative frequencies. For grouped
frequency tables the median is estimated using cumulative frequencies and an interpolation formulae.
However, this is beyond the scope of this course.

4
3.1.3. Mean
The mean of a variable is also referred to as the arithmetic mean or the average. For raw data the mean is
calculated by adding all the values of the variable together and dividing by the total number of observations.

For a random variable X the population mean is denoted by the Greek letter µ (mu):
1
=
N
 x

For a random variable X the sample mean is denoted by x (x-bar):


1
x=
n
x

For example, consider the random sample with observations: 9 2 4 13 6


9 + 2 + 4 + 13 + 6 34
x= = = 6.8
5 5

5
For an ungrouped frequency table, the mean is calculated using a formula based on the values of the variable
and the frequency of occurrence. For a grouped frequency table, the mean is estimated using a formula based
on the midpoint of the class intervals and the frequency of occurrence. For the purpose of this course, it is
sufficient to calculate/estimate the mean from frequency tables using the calculator.

Steps to find the mean using the calculator


1) Enter data
2) AC
3) STAT →4:VAR → 2: x → =

6
3.1.4. Concluding notes
Advantages Disadvantages
Mode Valid for categorical and numerical data
More than one mode can exist
Not affected by outliers
Median Not affected by outliers Only appropriate for ordinal and numerical
Best measure to use for skewed data data
Mean Calculated using every value in the dataset, Affected of extreme values
i.e. very accurate Only appropriate for numerical data
Best measure to use for symmetrical data

7
3.3. Measures Of Relative Standing
Measures of relative standing show where particular values stand relative to the whole distribution of the
variable. Relative standing is measured through percentiles. Percentiles are points which partition an ordered
dataset into a hundred parts. The rth percentile, Pr, is the value of the variable that separates the lowest r%
of the distribution from the remaining (100 – r)% of the distribution. The formal interpretation of Pr is: at
most r% is less than Pr and at most (100 – r)% is more than Pr.

For example, if 10% of students scored at least 80 on a test, then a student who scored 82 performed in the
top 10% of the distribution. The value “80” is the minimum value obtained by the top 10% of the distribution
and is therefore the 90th percentile, i.e. P90 = 80, as it separates the lowest 90% from the remaining 10% of
the distribution. Thus, at most 90% of students scored less than 80 and at most 10% of students scored more
than 80. Recall the interpretation of the median, namely at most 50% of observations are below the median
value and at most 50% of observations are above the median value. The median of a distribution is the 50th
percentile value, i.e. P50 = median.

8
Other commonly used percentiles are deciles, which divide the distribution into ten equal parts (D1, D2, …,
D10) and quartiles, which divide the distribution into four equal parts (Q1, Q2, Q3, Q4). Both deciles and
quartiles can be expressed in terms of percentiles. For example, D5 = Q2 = P50 = median. For raw data any
percentile value is obtained by first sorting the data from lowest to highest, locating the percentile position
and then using a formula to calculate the percentile value. Percentile calculations from frequency data is
beyond the scope of this course.

To find Pr for raw data:


• Order the data from lowest to highest
r
• Find the percentile position = ( n + 1)
100
o This yields a value in the format k.d, where k = the integer portion and d = the decimal portion (in
decimal format)

(
• Pr = x( k ) + d x( k +1) − x( k ) )
o Where x( k ) is the value in position k of the ordered dataset

o Where x( k +1) is the value in position (k + 1) of the ordered dataset


9
For example, find and interpret P20 and Q3 for the following 12 observations (already ordered):
4 5 8 9 11 12 12 14 15 17 19 21
• P20
20
o position = (12 + 1) = 2.6 , Therefore k = 2 and d = 0.6
100
o The value in position 2 (k) is 5 and the value in position 3 (k + 1) is 8

( )
o P20 = x( 2) + 0.6 x(3) − x( 2) = 5 + 0.6 (8 − 5 ) = 6.8

o At most 20% of observations are less than 6.8 and at most 80% of observations are greater than 6.8

• Q3 = P75
75
o position = (12 + 1) = 9.75 , Therefore k = 9 and d = 0.75
100
o The value in position 9 (k) is 15 and the value in position 10 (k + 1) is 17

( )
o P75 = x(9) + 0.75 x(10) − x(9) = 15 + 0.75 (17 − 15 ) = 16.5

o At most 75% of observations are less than 16.5 and at most 25% of observations are greater than
16.5
10
3.4. Measures Of Variability
Measures of variability (or spread or dispersion) describe the extent to which data are spread around its
central tendency and across the scale. The commonly used measures of variability are range, interquartile
range, variance, standard deviation and coefficient of variation.

3.3.1. Range
The range is an approximate measure of variability and shows how much of the scale is utilised. For raw
data and ungrouped frequency data the range is the difference between the maximum and the minimum
values of a variable. For grouped frequency data the range is the difference between the upper limit of the
last class interval and the lower limit of the first class interval.
Range = maximum – minimum

3.3.2. Interquartile range


The interquartile range (IQR) is the distance between the 1st and 3rd quartiles. It gives a measure of how the
middle 50% of the distribution is spread around the median.
IQR = Q3 – Q1
11
3.3.3. Average deviation
The average deviation is the arithmetic mean of the differences between each observation and the mean of
the variable.
1
Average deviation =
n
( x − x )

Consider the observations: 2 2 1 3


1
x =2 → (x − x): 0 0 1 −1 →
4
 (x − x) = 0
Because the mean is the arithmetic centre of the distribution, some observations are less than the mean (i.e.
negative difference) and some observations are greater than the mean (i.e. positive difference). The negative
and positive values completely cancel out across all observations. The sum of the differences is always equal
to zero, making this a redundant measure of variability and only serves as an introduction or starting point
to measure how data are spread around the mean.

12
3.3.4. Variance
To solve the problem encountered with the average deviation measure, differences are considered as
distances which must always be positive. There are two ways in which negative values can be removed:
either take the absolute value (i.e., remove the sign), or square the value. The variance is the average squared
deviation around the mean. It is the most commonly used measure of variability in statistics. The larger the
value of the variance the more the data values vary around the mean and the greater the spread of the data.
The variance is expressed in the squared unit of measurement of a variable, which is of no practical value
and is difficult to interpret.

The population variance is denoted by the Greek letter  2 (sigma-squared) and is calculated as follows:
1
2 =  ( −  )
2
x
N

The sample variance is denoted by the Roman letter s 2 (s-squared) and is calculated as follows:

n x 2 − (  x )
2
1
s2 =  (x − x) =
2

n −1 n ( n − 1)
13
For an ungrouped frequency table, the variance is calculated based on the values of the variable and the
frequency of occurrence. For a grouped frequency table, the variance is estimated using a formula based on
the midpoint of the class intervals and the frequency of occurrence. For the purpose of this course, it is
sufficient to calculate/estimate the variance from frequency tables using the calculator. The calculators give
the population or sample standard deviations, which must be squared to obtain the variance. Steps to perform
calculations are discussed in Section 3.3.5.

3.3.5. Standard deviation


The standard deviation is the positive square root of the variance. It is expressed in terms of the unit of
measurement of the variable. Under certain distributional assumptions, the standard deviation has a very
particular and practical interpretation (further detail is given in Section 6.3.3).
The population standard deviation is denoted by the Greek symbol  (sigma) and is calculated as follows:
1
=  ( −  )
2
x
N

14
The sample standard deviation is denoted by the Roman letter s and is calculated as follows:

n x 2 − (  x )
2
1
s=  ( − ) =
2
x x
n −1 n ( n − 1)

Steps to find the standard deviation using the calculator


1) Enter data
2) AC
3) STAT →4:VAR → 3: x → = (population standard deviation)
STAT →4:VAR → 4: sx → = (sample standard deviation)

15
3.3.6. Coefficient of variation
The coefficient of variation (CV) is a measure of relative variability and is used to compare the variability
of different variables measured in different units or to compare the variability of the same variables
measured at different times.

The CV is the ratio of the standard deviation to the mean, expressed as a percentage, i.e. the variability in
the variable is expressed as a percentage of the mean of that variable. This measures variability on
comparable scales for multiple variables. Note, this value is not bounded by 100% and can be greater than
100%, which implies more variability.

The sample coefficient of variation is calculated as follows:


s
CV =  100
x

16
3.3.7. Concluding notes
Advantages Disadvantages
Range Easy to calculate Affected by extreme values
Interquartile range Not affected by outliers
Best measure of variability for Does not utilise all the data
skewed data
Average deviation None Always zero
Variance/Standard Uses all available data
deviation Best measure of variability for Affected by outliers
symmetrical data
Coefficient of variation Best measure of relative variability Affected by outliers

17
Exercise 3.1
1) The sums for X = coffee consumption are  x = 59 and  x 2
= 251 . The following table shows the

frequency distribution for coffee consumption. Calculate the mean, range, variance, standard deviation
and coefficient of variation using the computational formulae as well as the calculator. Compare the
results.
Coffee consumption Frequency
1 5
2 6
3 3
4 2
5 2
6 0
7 1
8 1
Total 20

18
From table
Mean =
Range =
Variance =
Standard deviation =
Coefficient of variation =

From sums:  x = 59 ,  x 2
= 251

1
Mean = x =
n
x

Range =

19
 x = 59 ,  x 2
= 251

n x − (  x )
2 2

Variance = s 2 =
n ( n − 1)

Standard deviation = s =

s
Coefficient of variation =  100
x

Comparison

20
2) Use the raw data for the coffee affinity score as well as the grouped frequency table to calculate the
mean, range, variance, standard deviation and coefficient of variation. Compare the results.
Coffee affinity score Frequency Midpoint
(0, 1] 7
(1, 2] 4
(2, 3] 2
(3, 4] 4
(4, 5] 3
Total 20

From table
Mean =
Range =
Variance =
Standard deviation =
Coefficient of variation =
21
From raw data
0.1 0.2 0.4 0.4 0.6 0.8 1.0 1.4 1.8 1.9 1.9 2.3 2.4 3.1 3.1 3.4 3.6 4.4 4.6 4.9
Mean =
Range =
Variance =
Standard deviation =
Coefficient of variation =

Comparison

22
3) Use the following stem-and-leaf plot of age (leaf unit = 1) and calculate the mode(s), median, D2 and
IQR

Stem Leaf
1 9 9 9
2 1 4 4 5 6 6 8 9 9
3 0 2 4 5 6 7
4 0 3

From raw data


Mode(s)

Median

23
Stem Leaf
1 9 9 9
2 1 4 4 5 6 6 8 9 9
3 0 2 4 5 6 7
4 0 3

D2

P25

P75

IQR
24

You might also like