217 - Chapter 3 - Descriptive Statistics - Numerical Measures
217 - Chapter 3 - Descriptive Statistics - Numerical Measures
● When given a set of raw data one of the most useful ways
of summarising that data is to find an average of that set
of data.
● An average is a measure of the centre of the data set.
● There are three common ways of describing the centre of
a set of numbers.
● They are the mean, the median and the mode and are
calculated as follows.
● Exercises 1 :
Ten patients at a doctor’s surgery wait for the following
lengths of times to see their doctor.
5mins 9mins 17mins 22mins 8mins 11mins
2mins 16 mins 55 mins 5mins
(1) What are the mean, median and mode for these data?
(2) What measure of central tendency would you use here?
● Exercises 1 :
1. Mean =15 mins, Median = (9+11)/2 = 10 mins, Mode = 5
mins.
2. The median would be the preferred measure of central
tendency to use here and not the mean, since there is an
outlier of 55 mins. This is making the assumption that the
outlier is a freak value and should be disregarded. The
mode would not be suitable, because it is just chance that
two people waited for the same period of time, and all the
others waited for different time periods.
● Exercises 2 :
2. What is the appropriate measure of central tendency to
use with these data?
● Exercises 2 :
The mode is the only possible measure of central tendency
to use here, since we are dealing with category data. The
modal category is ‘train’.
● Exercises 3 :
Which measure of central tendency is best used to measure
the average house price in KSA?
● Exercises 3 :
• The median is used to indicate average house prices in
KSA.
• The inclusion of the very expensive houses (those worth
millions of SAR) in the calculation of the mean would
make the ‘average’ house price too high to be
representative of the general market.
• Nor is the mode suitable because it could happen by
chance that a very large number of houses all had the
same non-representative value.
● Exercises 4 :
Without doing any calculation, estimate the mean of the
distribution in figure below :
● Exercises 4 :
• The actual value for the mean is 56.
• How close to this value did you get with your guess?
MEASURES OF DISPERSION
MEASURES OF DISPERSION
● Example:
▪ A national sampling of prices for new and used cars found that the mean
price for a new car is 20,100 SAR and the standard deviation is 6,125 SAR
and that the mean price for a used car is 5,485 SAR with a standard
deviation equal to 2,730 SAR.
▪ In terms of absolute variation, the standard deviation of price for new cars is
more than twice that of used cars.
▪ However, in terms of relative variation, there is more relative variation in the
price of used cars than in new cars.
The CV for used cars is 2.730 / 5.485 = 49.8%
and the CV for new cars is 6.125 / 20.100 = 30.5%
● Example:
▪ The mean salary for deputies in Douglas County is $27,500 and the standard
deviation is $4,500.
▪ The mean salary for deputies in Hall County is $24,250 and the standard
deviation is $2,750.
▪ A deputy who makes $30,000 in Douglas County makes $1,500 more than a
deputy does in Hall County who makes $28,500. Which deputy has the
higher salary relative to the county in which he works?
● Example:
▪ For the deputy in Douglas County who makes $30,000, the z score is
▪ For the deputy in Hall County who makes $28.500, the z score is
● The median is 20.5 (half way between the 6th and 7th
observations), and divides the data into two equal sets
with exactly 50% of the observations in each: the 1st to
the 6th observations in the first set and the 7th to 12th
observations in the other.
● We see that 50% of the area is between the first and third
quartiles.
● This means that 50% of the observations lie between the
first and third quartiles.
● For the following data sets, calculate the quartiles and find
the interquartile range.
1 - The following numbers represent the time in minutes
that twelve employees took to get to work on a particular
day. 18 34 68 22 10 92 46 52 38 29 45 37
2 - The number of people killed in road traffic accidents
in New South Wales from 1989 to 1996 is given below.
960 797 663 652 560 619 623 583
Source: Statistics–A Powerful Edge, Australian Bureau of Statistics, 1998.
Solutions :
1. First quartile = 25.5, Median = 37.5, Third quartile = 49, IQR = 23.5.
2. First quartile = 601, Median = 637.5, Third quartile = 730, IQR = 129.
3. First quartile = 52, Median = 70.5, Third quartile = 86, IQR = 34.
4. Our estimate puts the first quartile at 40, the median at 50 and the third
quartile at 60. This gives an interquartile range of 20. This means that
the middle 50% of marks lie within 20 marks of each other.
● Solution :
1) s2 = 84.93.
2) (x bar) = 74.75, s = 13.14.
THE BOX-PLOT
THE BOX-PLOT
● The box-plot is another way of representing a data set
graphically.
● It is constructed using the quartiles, and gives a good indication
of the spread of the data set and its symmetry (or lack of
symmetry).
● It is a very useful method for comparing two or more data sets.
● The box-plot consists of a scale, a box drawn between the first
and third quartile, the median placed within the box, whiskers
on both sides of the box and outliers (if any).
● This is best illustrated using a diagram such as in the following
figure.
● Step 1:
▪ Order the data and calculate the quartiles.
44 46 47 48 49 49 50
51 52 52 53 53 53 54
54 54 55 55 56 57 57
59 59 60 61 62 66 68
▪ Now we calculate the median, the first quartile and the third quartile.
▪ For these data, median = 54, the first quartile = 50.5 and the third quartile
= 58.
▪ With this information we can begin to construct the box-plot.
● Step 2:
▪ Draw the scale and mark on the quartiles.
▪ Mark the median at the correct place above the scale with a asterix, draw
a box around this asterix with the left hand side of the box at the first
quartile, 50.5, and the right hand side of the box at the third quartile, 58.
▪ This is illustrated in the following figure.
● Step 3:
▪ Calculate the interquartile range and determine the position of the outlier
thresh1olds.
Interquartile range = third quartile − first quartile = 58 − 50.5 = 7.5.
▪ The position of the lower outlier threshold is found by subtracting the
interquartile range from the first quartile, 50.5 − 7.5 = 43.
▪ The position of the upper outlier threshold is found by adding the
interquartile range to the third quartile, 58 + 7.5 = 65.5. (Some texts add
or subtract 1.5 × interquartile range.)
▪ We now add the outlier thresholds to our diagram. This is illustrated in the
follwoing figure.
● Step 4:
▪ Use the outlier thresholds to draw the whiskers.
▪ To draw the left hand whisker, we need the smallest data value that lies
inside the outlier thresholds.
▪ In this example, it is the value 44. This is drawn on our diagram with a
small cross level with the asterix. A horizontal line is now drawn to the left
hand side of the box.
▪ To draw the right hand whisker, we find the largest data value that lies
inside the outlier thresholds.
▪ In this example, the value is 62. This is drawn on the right hand side of
the box with a small cross and connected to the box by a horizontal line.
● Step 5:
▪ Determine the outliers and remove the outlier thresholds.
▪ Values (if any) that lie outside the outlier thresholds are called outliers. In
this example, 66 and 68 are outliers. These are placed on the diagram
using a small square or circle.
▪ Finally, the outlier thresholds are removed. The completed box-plot is
illustrated in the following figure :