0% found this document useful (0 votes)
40 views79 pages

217 - Chapter 3 - Descriptive Statistics - Numerical Measures

Uploaded by

ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views79 pages

217 - Chapter 3 - Descriptive Statistics - Numerical Measures

Uploaded by

ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

King Khalid University

College of Sciences and Arts of Tanumah


Department of Computer Science

217CSM-3: Statistical Programming

Chapter 3 - Descriptive Statistics: Numerical Measures

Dr. Moheddine Imsatfia


King Khalid University
College of Sciences and Arts of Tanumah
Department of Computer Science

217CSM-3: Statistical Programming

Chapter 3 - Descriptive Statistics: Numerical Measures

THE MEAN, MEDIAN AND MODE


NUMERICAL MEASURES

● If the measures are computed for data from a sample,


they are called sample statistics.
● If the measures are computed for data from a population,
they are called population parameters.
● A sample statistic is referred to as the point estimator of
the corresponding population parameter.

3 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● When given a set of raw data one of the most useful ways
of summarising that data is to find an average of that set
of data.
● An average is a measure of the centre of the data set.
● There are three common ways of describing the centre of
a set of numbers.
● They are the mean, the median and the mode and are
calculated as follows.

4 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● The mean − add up all the numbers and divide by how


many numbers there are.
● The median − is the middle number. It is found by putting
the numbers in order and taking the actual middle number
if there is one, or the average of the two middle numbers if
not.
● The mode − is the most commonly occurring number.

5 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● Let’s illustrate these by calculating the mean, median and


mode for the following data.
● Weight of luggage presented by airline passengers at the
check-in (measured to the nearest kg).

6 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

7 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● Central tendency describes the tendency of the


observations to bunch around a particular value, or
category.
● The mean, median and mode are all measures of central
tendency.
● They are all measures of the ‘average’ of the distribution.
● The best one to use in a given situation depends on the
type of variable given.

8 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● For example, suppose a class of 20 students own among


them a total of 17 pets as shown in the following table.

● Which measure of central tendency should we use here?

9 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● If our focus of interest were on the type of pet owned, we


would use the mode as our average. → ‘Cat’ would be
described as the ‘modal category’, as this is the category
that occurs most often.
● If, on the other hand, we were not interested in the type of
pet kept but the average number of pets owned then the
mean would be an appropriate measure of central
tendency. → Here the mean is 17/20 = 0.85.

10 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● The mean has some advantages over the median as a


measure of central tendency of quantity variables.
● One of them is that all the observed values are used to
calculate the mean. However, to calculate the median, while
all the observed values are used in the in the ranking, only
the middle or middle two values are used in the calculation.
● Another is that the mean is fairly stable from sample to
sample.
● This means that if we take several samples from the same
population their means are less likely to vary than their
medians.

11 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● However, the median is used as a measure of central


tendency if there are a few extreme values observed.
● The mean is very sensitive to extreme values and it may
not be an appropriate measure of central tendency in
these cases.
● With the exception of cases where there are obvious
extreme values, the mean is the value usually used to
indicate the centre of a distribution.
● We can also think of the mean as the balance point of a
distribution.

12 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● For example, consider the distribution of students’ marks


on a test :

● Without doing any calculation, we would guess the


balance point of the distribution to be approximately 58.
(Think of it as the centre of a see-saw.)

13 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● Exercises 1 :
Ten patients at a doctor’s surgery wait for the following
lengths of times to see their doctor.
5mins 9mins 17mins 22mins 8mins 11mins
2mins 16 mins 55 mins 5mins
(1) What are the mean, median and mode for these data?
(2) What measure of central tendency would you use here?

14 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● Exercises 1 :
1. Mean =15 mins, Median = (9+11)/2 = 10 mins, Mode = 5
mins.
2. The median would be the preferred measure of central
tendency to use here and not the mean, since there is an
outlier of 55 mins. This is making the assumption that the
outlier is a freak value and should be disregarded. The
mode would not be suitable, because it is just chance that
two people waited for the same period of time, and all the
others waited for different time periods.

15 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● Exercises 2 :
2. What is the appropriate measure of central tendency to
use with these data?

16 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● Exercises 2 :
The mode is the only possible measure of central tendency
to use here, since we are dealing with category data. The
modal category is ‘train’.

17 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● Exercises 3 :
Which measure of central tendency is best used to measure
the average house price in KSA?

18 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● Exercises 3 :
• The median is used to indicate average house prices in
KSA.
• The inclusion of the very expensive houses (those worth
millions of SAR) in the calculation of the mean would
make the ‘average’ house price too high to be
representative of the general market.
• Nor is the mode suitable because it could happen by
chance that a very large number of houses all had the
same non-representative value.

19 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● Exercises 4 :
Without doing any calculation, estimate the mean of the
distribution in figure below :

20 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE MEAN, MEDIAN AND MODE

● Exercises 4 :
• The actual value for the mean is 56.
• How close to this value did you get with your guess?

21 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


King Khalid University
College of Sciences and Arts of Tanumah
Department of Computer Science

217CSM-3: Statistical Programming

Chapter 3 - Descriptive Statistics: Numerical Measures

MEASURES OF DISPERSION
MEASURES OF DISPERSION

● The mean is the value usually used to indicate the centre


of a distribution.
● If we are dealing with quantity variables our description of
the data will not be complete without a measure of the
extent to which the observed values are spread out from
the average.
● We will consider several measures of dispersion and
discuss the merits and pitfalls of each.

23 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE RANGE

● One very simple measure of dispersion is the range.


● Lets consider the two distributions given in A and B.

● They represent the marks of a group of thirty students on


two tests. B

24 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE RANGE

● Here it is clear that the marks on test A are more spread


out than the marks on test B, and we need a measure of
dispersion that will accurately indicate this.
• On test A, the range of marks is 70 − 45 = 25.
• On test B, the range of marks is 65 − 45 = 20.
● Here the range gives us an accurate picture of the
dispersion of the two distributions.

25 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE RANGE

● However, as a measure of dispersion the range is


severely limited.
● Since it depends only on two observations, the lowest and
the highest, we will get a misleading idea of dispersion if
these values are outliers.

26 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE RANGE

● This is illustrated very well if the students’ marks are


distributed as in C and D :

27 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE RANGE

● On test C, the range is still 70 − 45 = 25.


● On test D, the range is now 72 − 40 = 32,
● Apart from the outliers, the distribution of marks on test D
is clearly less spread out than that of C.
● We want a measure of dispersion that will accurately give
a measure of the variability of the observations.
● We will concentrate now on the measure of dispersion
most commonly used, the standard deviation.

28 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


STANDARD DEVIATION

● Suppose we have a set of data where there is no


variability in the observed values.
● Each observation would have the same value, say 3, 3, 3,
3 and the mean would be that same value, 3.
● Each observation would not be different or deviate from
the mean.
● Now suppose we have a set of observations where there
is variability. The observed values would deviate from the
mean by varying amounts.

29 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


STANDARD DEVIATION

● The standard deviation is a kind of average of these


deviations from the mean.
● This is best explained by considering the following example.
• Take, for example, the following grades of 6 students:
56 48 63 60 51 52.
• Mean = 55.
• To find how much our observed values deviate from the
mean, we subtract the mean from each.
• Observed values 56 48 63 60 51 52
• Deviations from Mean +1 −7 +8 +5 −4 −3

30 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


STANDARD DEVIATION

● We cannot, at this stage, simply take the average of the


deviations as their sum is zero.
(+1) + (−7) + (+8) + (+5) + (−4) + (−3) = 0
● We get around this difficulty by taking the square of the
deviations.
● This gets rid of the minus signs.
● Deviations +1 −7 +8 +5 −4 −3
● Squared deviations 1 49 64 25 16 9

31 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


STANDARD DEVIATION

● We can now take the mean of these squared deviations.


This is called the variance.
• Variance = (1 + 49 + 64 + 25 + 16 + 9)/6 = 27.33.
● The variance is a very useful measure of dispersion for
statistical inference, but for our purposes it has a major
disadvantage. Because we squared the deviations, we
now have a quantity in square units. So to get the
measure of dispersion back into the same units as the
observed values, we define standard deviation as the
square root of the variance.
• Standard Deviation = √Variance =√27.33 = 5.228.

32 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


STANDARD DEVIATION

● Standard deviation for samples


● The variance and the standard deviation of samples are
computed as follows:
● The variance of samples of size n is represented by s2
and is given by

● The standard deviation of samples of size n is


represented by s2 and is given by

33 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


STANDARD DEVIATION

● Standard deviation for a population


● The variance and the standard deviation of a population are
computed as follows:
The variance of a population of size N is represented
by 𝛔2 and is given by

The standard deviation of a population of size N is


represented by 𝛔 and is given by

34 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


STANDARD DEVIATION

● The mean for a sample consisting of n observations is

● and the mean for a population consisting of N


observations is

35 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


STANDARD DEVIATION

● The standard deviation may be thought of as the ‘give or


take’ number.
● That is, on average, the student’s grade will be 55, give or
take 5 marks.
● The standard deviation is a very good measure of
dispersion and is the one to use when the mean is used
as the measure of central tendency.

36 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


COEFFICIENT OF VARIATION

● The coefficient of variation is a helpful statistic in comparing


the degree of variation from one data series to the other,
although the means are considerably different from each
other.
● The standard deviation of an exponential distribution is
equivalent to its mean, the making its coefficient of variation
to equalize 1.
● Distributions with a coefficient of variation to be less than 1
are considered to be low-variance, whereas those with a CV
higher than 1 are considered to be high variance.

37 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


COEFFICIENT OF VARIATION

● The coefficient of variation is equal to the standard


deviation divided by the mean. The result is usually
multiplied by 100 to express it as a percent.
● The coefficient of variation for a sample is given by

● and the coefficient of variation for a population is given


by

38 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


COEFFICIENT OF VARIATION

● Example:
▪ A national sampling of prices for new and used cars found that the mean
price for a new car is 20,100 SAR and the standard deviation is 6,125 SAR
and that the mean price for a used car is 5,485 SAR with a standard
deviation equal to 2,730 SAR.
▪ In terms of absolute variation, the standard deviation of price for new cars is
more than twice that of used cars.
▪ However, in terms of relative variation, there is more relative variation in the
price of used cars than in new cars.
 The CV for used cars is 2.730 / 5.485 = 49.8%
 and the CV for new cars is 6.125 / 20.100 = 30.5%

39 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


Z SCORES

● A z score is the number of standard deviations that a given


observation, x, is below or above the mean.
● For sample data, the z score is

● and for population data, the z score is

40 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


Z SCORES

● Example:
▪ The mean salary for deputies in Douglas County is $27,500 and the standard
deviation is $4,500.
▪ The mean salary for deputies in Hall County is $24,250 and the standard
deviation is $2,750.
▪ A deputy who makes $30,000 in Douglas County makes $1,500 more than a
deputy does in Hall County who makes $28,500. Which deputy has the
higher salary relative to the county in which he works?

41 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


Z SCORES

● Example:
▪ For the deputy in Douglas County who makes $30,000, the z score is

▪ For the deputy in Hall County who makes $28.500, the z score is

▪ When the county of employment is taken into consideration, the $28,500


salary is a higher relative salary than the $30,000 salary.

42 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


King Khalid University
College of Sciences and Arts of Tanumah
Department of Computer Science

217CSM-3: Statistical Programming

Chapter 3 - Descriptive Statistics: Numerical Measures

THE INTERQUARTILE RANGE


THE INTERQUARTILE RANGE

● The interquartile range is another useful measure of


dispersion or spread.
● It is used when the median is used as the measure of
central tendency.
● It gives the range in which the middle 50% of the
distribution lies.
● In order to describe this in detail, we first need to discuss
what we mean by quartiles.

44 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


QUARTILES

● Suppose we start with a large set of data, say the heights


of all adult males in Sydney.
● We can represent these data in a graph, which if
smoothed out a bit, may look like:

● As the name ‘quartile’ suggests, we want to divide the


data into four equal parts.
● In the above example, we want to divide the area under
our curve into four equal areas.

45 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
THE SECOND QUARTILE OR MEDIAN

● It is easy to see how to divide the area in the last figure


into two equal parts, since the graph is symmetric.
● The point which gives us 50% of the area to the left of it
and 50% to the right of it is called the second quartile or
median.
● This is illustrated as follows:

● This exactly corresponds to our previous idea of median


as the middle value.

46 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
THE FIRST QUARTILE

● The first quartile is the point which gives us 25% of the


area to the left of it and 75% to the right of it.
● This means that 25% of the observations are less than or
equal to the first quartile and 75% of the observations
greater than or equal to the first quartile.
● The first quartile is also called the 25th percentile.
● This is illustrated as follows :

47 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE THIRD QUARTILE

● The third quartile is the point which gives us 75% of the


area to the left of it and 25% of the area to the right of it.
● This means that 75% of the observations are less than or
equal to the third quartile and 25% of the observation are
greater than or equal to the third quartile.
● The third quartile is also called the 75th percentile.
● This is illustrated as follows :

48 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


SUMMARY

● The first (Q 1 ), second (Q 2 ) and third (Q 3 ) quartiles


divide the distribution into four equal parts.
● This is illustrated as follows :

49 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
QUARTILES FOR SMALL DATA SETS

● Suppose now we have a small data set of twelve


observations which we can write in ascending order as
follows.
● A data set, where the number of observations is a multiple
of four, has been chosen to avoid some technical
difficulties.
15 18 19 20 20 20 21 23 23 24 24 25

● In this case, we want to divide the data into four equal


sets, so that there are 25% of the observations in each.

50 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


QUARTILES FOR SMALL DATA SETS

● First, we find the median just as we did previously.

● The median is 20.5 (half way between the 6th and 7th
observations), and divides the data into two equal sets
with exactly 50% of the observations in each: the 1st to
the 6th observations in the first set and the 7th to 12th
observations in the other.

51 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
QUARTILES FOR SMALL DATA SETS

● To find the first quartile we consider the observations less


than the median.
15 18 19 20 20 20
● The first quartile is the median of these data. In this case,
the first quartile is half way between the 3rd and 4th
observations and is equal to 19.5.
● Now, we consider the observations which are greater than
the median.
21 23 23 24 24 25

52 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
QUARTILES FOR SMALL DATA SETS

● The third quartile is the median of these data and is equal


to 23.5.
● So, for our small data set of 12 observations, the quartiles
divide the set into four subsets.

● We will now use the quartiles to define a measure of


spread called interquartile range.

53 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
THE INTERQUARTILE RANGE

● The interquartile range quantifies the difference between


the third and first quartiles.
● If we were to remove the median (Q2 ) from the previous
figure, we would have a graph as follows :

● We see that 50% of the area is between the first and third
quartiles.
● This means that 50% of the observations lie between the
first and third quartiles.

54 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
THE INTERQUARTILE RANGE

● We define the interquartile range as:


The interquartile range = Third quartile − First quartile.
● For our small data set, the first quartile was 19.5 and our
third quartile was 23.5.
● So, the interquartile range is 23.5 − 19.5 = 4.
● We will use the interquartile range later to draw a box-plot.
● For now we are interested in it as a measure of spread.

55 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
THE INTERQUARTILE RANGE

● The interquartile range is particularly useful to describe


data sets where there are a few extreme values.
● Unlike the range, and to a lesser extent the standard
deviation, it is not sensitive to extreme values as it relies
on the spread of the middle 50% of the distribution.
● So, if there are data sets which have extreme values, it
can be more appropriate to use the median to describe
central tendency and the interquartile range to describe
the spread.

56 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
EXERCISES

● For the following data sets, calculate the quartiles and find
the interquartile range.
 1 - The following numbers represent the time in minutes
that twelve employees took to get to work on a particular
day. 18 34 68 22 10 92 46 52 38 29 45 37
 2 - The number of people killed in road traffic accidents
in New South Wales from 1989 to 1996 is given below.
960 797 663 652 560 619 623 583
Source: Statistics–A Powerful Edge, Australian Bureau of Statistics, 1998.

57 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
EXERCISES
 3 - The following data are the final marks of 40 students for the
University Preparation Course, ‘Preparatory Mathematics’ in 1998.
61 77 51 85 55 77 70 56 41 61 28 87 23 22 86 63 99 94 38 25
90 59 87 53 29 86 33 87 75 50 59 77 77 71 99 78 70 93 78 93
Source: Mathematics Learning Centre, 1998.
 4 - The curve below represents the marks of a large number of students
on an English exam. Estimate the quartiles and calculate the
interquartile range.

58 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
EXERCISES

 Solutions :
1. First quartile = 25.5, Median = 37.5, Third quartile = 49, IQR = 23.5.
2. First quartile = 601, Median = 637.5, Third quartile = 730, IQR = 129.
3. First quartile = 52, Median = 70.5, Third quartile = 86, IQR = 34.
4. Our estimate puts the first quartile at 40, the median at 50 and the third
quartile at 60. This gives an interquartile range of 20. This means that
the middle 50% of marks lie within 20 marks of each other.

59 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
King Khalid University
College of Sciences and Arts of Tanumah
Department of Computer Science

217CSM-3: Statistical Programming

Chapter 3 - Descriptive Statistics: Numerical Measures

ESTIMATES OF THE MEAN AND VARIANCE


ESTIMATES OF THE MEAN AND VARIANCE
● We have, so far, concerned ourselves with the mean, variance,
and standard deviation of a population.
● However, in statistics we are mainly concerned with analysing
data from a sample taken from a population, in order to make
inferences about that population.
● Our data sets are usually random samples drawn from the
population.
● When we have a random sample of size n, we use the sample
information to estimate the population mean and population
variance in the following way.

61 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
ESTIMATES OF THE MEAN AND VARIANCE
● The mean of a sample of size n is written as (read x bar).
● To find the sample mean we add up all the sample scores and
divide by the number of sample scores.
● This can be written using sigma notation as:

62 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
ESTIMATES OF THE MEAN AND VARIANCE
● The sample mean is used to estimate the population mean.
● If we took many samples of size n from the population,
calculated their sample means, and then averaged them, we
would get a value very close to the population mean.
● We say that the sample mean is an unbiased estimator of the
population mean.
● An estimate of the population variance of a sample of size n is
given by s2 Where

63 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
ESTIMATES OF THE MEAN AND VARIANCE

● Notice that we are dividing by n−1 instead of n as we did to


find the population variance.
● We need to do this because the value obtained if we divide by
n, tends to underestimate the population variance.
● Calculated in this way, s2 is an unbiased estimator of
population variance.
● In fact, s2 can be described as the estimated population
variance.
● It is sometimes called the ‘sample variance’ but this is strictly
speaking not accurate.

64 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
EXERCISES

1) The following values are the number of customers a restaurant


served for lunch on ten consecutive days : 46 50 51 60 62 64
72 41 53 55
We Suppose that this data set is a random sample of 10 days
taken from a restaurant’s records. Calculate the estimated
population variance, s2 for these data.
2) The raw scores that eight students got on a history test were:
69 84 93 61 79 88 57 67
Suppose that this data are a random sample of scores on a
history test. Calculate the mean, (x bar), and the estimated
population standard deviation, s, of these data.

65 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
EXERCISES

● Solution :
1) s2 = 84.93.
2) (x bar) = 74.75, s = 13.14.

66 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Progra


mming
King Khalid University
College of Sciences and Arts of Tanumah
Department of Computer Science

217CSM-3: Statistical Programming

Chapter 3 - Descriptive Statistics: Numerical Measures

THE BOX-PLOT
THE BOX-PLOT
● The box-plot is another way of representing a data set
graphically.
● It is constructed using the quartiles, and gives a good indication
of the spread of the data set and its symmetry (or lack of
symmetry).
● It is a very useful method for comparing two or more data sets.
● The box-plot consists of a scale, a box drawn between the first
and third quartile, the median placed within the box, whiskers
on both sides of the box and outliers (if any).
● This is best illustrated using a diagram such as in the following
figure.

68 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE BOX-PLOT

69 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


THE BOX-PLOT
● The two dashed vertical lines in Figure 28 are the lower and
upper outlier thresholds and are not normally included in a box-
plot.
● The following data set was used to construct the box-plot in the
above figure :
57 46 61 66 48 59 55
56 60 49 44 53 68 57
55 54 49 50 52 54 62
59 51 52 53 54 47 53

70 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


CONSTRUCTING A BOX-PLOT

● Step 1:
▪ Order the data and calculate the quartiles.
44 46 47 48 49 49 50
51 52 52 53 53 53 54
54 54 55 55 56 57 57
59 59 60 61 62 66 68
▪ Now we calculate the median, the first quartile and the third quartile.
▪ For these data, median = 54, the first quartile = 50.5 and the third quartile
= 58.
▪ With this information we can begin to construct the box-plot.

71 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


CONSTRUCTING A BOX-PLOT

● Step 2:
▪ Draw the scale and mark on the quartiles.
▪ Mark the median at the correct place above the scale with a asterix, draw
a box around this asterix with the left hand side of the box at the first
quartile, 50.5, and the right hand side of the box at the third quartile, 58.
▪ This is illustrated in the following figure.

72 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


CONSTRUCTING A BOX-PLOT

● Step 3:
▪ Calculate the interquartile range and determine the position of the outlier
thresh1olds.
Interquartile range = third quartile − first quartile = 58 − 50.5 = 7.5.
▪ The position of the lower outlier threshold is found by subtracting the
interquartile range from the first quartile, 50.5 − 7.5 = 43.
▪ The position of the upper outlier threshold is found by adding the
interquartile range to the third quartile, 58 + 7.5 = 65.5. (Some texts add
or subtract 1.5 × interquartile range.)
▪ We now add the outlier thresholds to our diagram. This is illustrated in the
follwoing figure.

73 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


CONSTRUCTING A BOX-PLOT

74 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


CONSTRUCTING A BOX-PLOT

● Step 4:
▪ Use the outlier thresholds to draw the whiskers.
▪ To draw the left hand whisker, we need the smallest data value that lies
inside the outlier thresholds.
▪ In this example, it is the value 44. This is drawn on our diagram with a
small cross level with the asterix. A horizontal line is now drawn to the left
hand side of the box.
▪ To draw the right hand whisker, we find the largest data value that lies
inside the outlier thresholds.
▪ In this example, the value is 62. This is drawn on the right hand side of
the box with a small cross and connected to the box by a horizontal line.

75 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


CONSTRUCTING A BOX-PLOT

This is illustrated in the following figure :

76 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


CONSTRUCTING A BOX-PLOT

● Step 5:
▪ Determine the outliers and remove the outlier thresholds.
▪ Values (if any) that lie outside the outlier thresholds are called outliers. In
this example, 66 and 68 are outliers. These are placed on the diagram
using a small square or circle.
▪ Finally, the outlier thresholds are removed. The completed box-plot is
illustrated in the following figure :

77 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


USING BOX-PLOTS TO COMPARE DATA SETS
● Box-plots are frequently used to compare data sets as the
differences in shape, spread and location are easily seen.
● For example, the following gives box-plots for the final marks of an
university preparation course, Preparatory Mathematics for the years 1996,
1997 and 1998.

78 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming


USING BOX-PLOTS TO COMPARE DATA SETS
● The marks from all years are left-skewed, but those from 1996
and 1997 quite markedly so. 1996 had the highest median
score but the least spread.
● The marks from 1998 vary more than those in 1996 and 1997.
● In all years, over 75% of students passed the course.

79 Chapater 3 - Descriptive Statistics: Numerical Measures 217CSM-3: Statistical Programming

You might also like