0% found this document useful (0 votes)
154 views

Chapter 3 Descriptive Statistics

This document provides an overview of descriptive statistics including measures of central tendency (mean, median, mode), dispersion, and shape. It discusses how to calculate and interpret the mean, median, and mode for both continuous and grouped data. Examples are provided to demonstrate calculating each measure. The key learning outcomes are to be able to calculate and understand measures of central tendency, dispersion, and shape and use SPSS software to analyze summary statistics.

Uploaded by

G Gጂጂ Tube
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views

Chapter 3 Descriptive Statistics

This document provides an overview of descriptive statistics including measures of central tendency (mean, median, mode), dispersion, and shape. It discusses how to calculate and interpret the mean, median, and mode for both continuous and grouped data. Examples are provided to demonstrate calculating each measure. The key learning outcomes are to be able to calculate and understand measures of central tendency, dispersion, and shape and use SPSS software to analyze summary statistics.

Uploaded by

G Gጂጂ Tube
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

University of Gondar

College of Medicine and Health Sciences


Institute of Public Health
Department of Epidemiology and Biostatistics

Chapter three: Descriptive statistics


Prepared BY: Department of Epidemiology and Biostatistics

University of
Gondar, Ethiopia
Leaning outcomes
• After completing this chapter a student will able to;

• List and calculate measures of central tendency

• List and calculate measures of dispersion

• List and calculate measures of shape

• Use spss software for summary measures

07/09/2021 2
Average should posses the following properties:

• It should be rigidly defined.


• It should be based on all observation under
investigation.
• It should be as little as affected by extreme
observations.
• It should be capable of further algebraic treatment.
• It should be as little as affected by fluctuations of
sampling.
• It should be ease to calculate and simple to
understand.
Measures of Central Tendency/ Measures of Location

• Measures of central Tendency: the various methods of


determining the actual value at which the data tend to
concentrate.
• Hence, measures of central Tendency is a value which
tends to sum up or describe the mass of the data in to
single value.
• These central tendency includes:
• Mean ,
• Median and
• Mode .

07/09/2021 4
Arithmetic Mean/simple Mean ( )
• Definition:
  the arithmetic mean is the sum of all observations
divided by the number of observations. It is usually denoted
by

,xif x’s are population observations


Population mean: μ 
N

• Let us consider X1, X2, ..., XN are the list of “n”


measurements obtained from “n” subjects
Then the mean for ungrouped number of measurements
for n subjects is defined as:

07/09/2021 5
Example
•  Consider the data on birth weight of 10 new born
children in kilo gram at university of Gondar hospital:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43.
Then the average birth weight can be computed as:

07/09/2021 6
Arithmetic mean cont…
When the data are arranged or given in the form of frequency
distribution i.e. there are k variety values such that a value Xi
has a frequency f i ( i=1,2,---,k) ,then the Arithmetic mean will
be given as
Example

Solution:
130
  𝑓 𝑖 × 𝑥𝑖
18× 2+19× 1+ 20× 4+ …+29 ×12
´𝑥 =∑ =
𝑖=1 ∑ 𝑓𝑖 2+1+4+ …+12

  3180
¿ =24.46 ≈ 25
130

07/09/2021 8
Exercise
• Consider the following frequency distribution table
Data value 10 20 30 40 50 60 70 80 90 100 110
Frequency 3 5 6 8 10 12 15 10 10 12 5

calculate the average of this data set?

07/09/2021 9
Mean for Grouped Data?

• In calculating the mean from grouped data, we assume


that all values falling into a particular class interval are
located at the midpoint of each interval. Therefore,
mean for grouped data is calculated as:
  ∑ 𝑓 𝑖 × 𝑥𝑚
𝑥=
´
𝑛

07/09/2021 10
Mean for Grouped Data
Example: calculate the mean for the grouped distribution
table given below:
Class Frequency
6-10 35
11-15 23
16-20 15
21-25 12
26-30 9
31-35 6

07/09/2021 11
Example cot…
•  Solution
Class Class mid (Xm Frequency fi × Xm
6-10 8 35 280
11-15 13 23 299
16-20 18 15 270
21-25 23 12 276
26-30 28 9 252
31-35 33 6 198
Total   100 1,575

• Therefor

07/09/2021 12
Properties of the arithmetic mean
• The mean can be used as a summary measure for both discrete
and continuous data, in general however, it is not appropriate for
either nominal or ordinal data.
• For a given set of data there is one and only one arithmetic mean.

• Algebraic sum of the deviations of the given values from their


arithmetic mean is always zero.
• The mean is used in computing other statistics, such as the
variance

• The mean is affected by extremely high or low values, called


outliers, and may not be the appropriate average to use in these
situations
13
Reading assignment

What is Geometric mean?


What is harmonic mean?
Combined mean
Weighted mean

07/09/2021 14
Median
• An alternative measure of central location, perhaps
second in popularity to the arithmetic mean.
• Suppose there are n observations in a sample.
• If these observations are ordered from smallest to largest,
then the median is defined as follows:
• The median, is a value such that at least half of the
observations are less than or equal to median and
at least half of the observations are greater than or
equal to median .
• The median is the midpoint of the data array.

07/09/2021 15
Median
Ungrouped data
• If the number of observations is odd, the median is defined
as the [(n+1)/2]th observation.
• If the number of observations is even the median is the
average of the two middle (n/2)th and [(n/2)+1]th values i.e
• To find the median of a data set:
• Arrange the data in ascending order.
• Find the middle observation of this ordered data.

Example1: where n is even: 19, 20, 20, 21, 22, 24, 27, 27,
27, 34

• Then, the median = (22 + 24)/2 = 23

16
Example 2
The number of children with asthma during a specific year in
seven local districts clinic is shown. Find the median for this
data set.
253, 125, 328, 417, 201, 70, 90
Solution:
First we must arrange the data in ascending order
70, 90, 125, 201, 253, 328, 417
Therefore, the fourth observation is the median of the data, i.e.
the value 201 is the median value

07/09/2021 17
Exercise
• The actual waiting time for the first job on the
selected sample of nine people having different field
of specialization was given below.

waiting time(in months): 11.6,11.3, 10.7, 18.0, 3.3,


9.2, 8.3, 3.8, 6.8
• calculate the median of the waiting time

07/09/2021 18
Median cont…
Median for grouped data.
-If data are given in the shape of continuous frequency
distribution, the median is defined as:

Where: Lmed =lower class boundary of the median class.


f med= The frequency of the median class
f c= The cumulative frequency less than type preceding the
median class .
W=the size of the median class.
n=total number of observation.
Note: The median class is the class with the smallest
cumulative frequency (less than type) greater than or equal to
n/2.
Median for grouped data cont…
Example; find the median for the following distribution.

Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 4
Median for grouped data …
• Solution
Class Frequency Cumulative frequency
40-44 7 7
45-49 10 17
50-54 22 39
55-59 15 54
60-64 12 66
65-69 6 72
70-74 4 76
Total 76 76

¿
 

07/09/2021 21
Median in grouped data …
• 
¿
 

07/09/2021 22
Merits and Demerits of Median
Merits:

• Median is a positional average and hence not influenced by extreme observations.

• Can be calculated in the case of open end intervals.

• The median can be used as a summary measure for ordinal, discrete


and continuous data, in general however, it is not appropriate for
nominal data.

Demerits:

• It is not a good representative of data if the number of items is small.

• It is not amenable to further algebraic treatment.

• It is vulnerable to sampling fluctuations .


07/09/2021 23
Mode
•Mode is a value which occurs most frequently in a set of values
•The mode may not exist and even if it does exist, it may not be unique.
•In case of discrete distribution the value having the maximum
frequency is the model value.
•If in a set of observed values, all values occur once or equal number of
times, there is no mode
•Examples:

1. Find the mode of 5, 3, 5, 8, and 9 ; Mode = 5


2. Find the mode of 8, 9, 9, 7, 8, 2, 5; Mode = 8 and 9
3. Find the mode of 4, 12, 3, 6, and 7. No mode/ mode doesn’t exist.
07/09/2021 24
Mode cont…

• NB: The mode for grouped data is modal class. Modal


class is the class with the largest frequency.

07/09/2021 25
Mode for Grouped data
In grouped data, we usually refer to the modal class, class
with highest frequency. If a single value for the mode of
grouped data must be specified, it is taken as:
1
Mode  L  w
1   2
Where L = The lower class boundary of the modal class;
 1  f mod  f 1  2  f mod  f 2

w = the size of the modal class


f1= frequency of the class preceding the modal class.
f2= frequency of the class succeeding the modal class
fmod = frequency of the modal class.
26
The Mode…
Example: Calculate the modal age for the age
distribution of 228 patients below.

Class interval Number of women

15-19 6
20-24 19
25-29 50
30-34 57
35-39 48
40-44 27
45-49 21
Total 228

27
The Mode…
Solution: By inspection (simply looking at the
frequencies), the mode lies in the fourth class, where L
=29.5, fmod = 57, f1=50, f2=48, w = 5, and

 1  57  50  7,  2  57  48  9
Therefore, the modal age,
7
x̂  29.5  5
79
 29.5  2.2
 31.7
28
Properties of Mode
• The mode can be used as a summary measure for nominal,
ordinal, discrete and continuous data, in general however, it
is more appropriate for nominal and ordinal data.

• It is not affected by extreme values

• It can be calculated for distributions with open end classes

• Sometimes its value is not unique

• The main drawback of mode is that it may not exist

29
Merits and Demerits of Mode

Merits:
 It is not affected by extreme observations.
 Easy to calculate and simple to understand.
 It can be calculated for distribution with open end class.

Demerits:
 It is not rigidly defined.
 It is not based on all observations.
 It is not suitable for further mathematical treatment.
 It is not stable average, i.e. it is affected by fluctuations of sampling
to some extent.
 Often its value is not unique.

30
Quartiles

- Quartiles are measures that divide the frequency


distribution in to four equal parts.
- The value of the variables corresponding to these
divisions are denoted Q1, Q2, and Q3 often called the
first, the second and the third quartile respectively.
- Q1 is a value which has 25% items which are less
than or equal to it.
- Similarly Q2 has 50% items with value less than or equal
to it and
- Q3 has 75 items whose values are less than or equal to it.
Quartile
•  Steps to calculate quartiles for ungrouped data;
Arrange the data in increasing order
If the number of observation is
A. odd: item,

B. Even:

07/09/2021 32
Quartiles

• 

W iN
Qi  LQi  (  C ), i  1,2,3
f Qi 4
Measure of variation/dispersion
Definition:

 The scatter or spread of items of a distribution is known as


dispersion or variation.
• In other words the degree to which numerical data tend
to spread about an average value is called dispersion
or variation of the data.
 Measures of dispersions are statistical measures which
provide ways of measuring the extent in which data are
dispersed or spread out.

07/09/2021 34
Measure of variation cont…
A good measure of variation posses:
• It should be easy to compute and understand.
• It should be based on all observations.
• It should be Uniquely defined
• It should be capable of further algebraic treatment.
• It should be as little as affected by extreme values

07/09/2021 35
Measure of variation Cont…
Absolute and relative measures

Measures of dispersion may be either absolute or relative


1. Absolute measures of dispersion (AMD): Absolute
measure is expressed in the SI unit in which the original data
are given such as kilograms, tones etc.
• These measures are suitable for comparing the variability in two
distributions having variables expressed in the same units and of the
same averaging size.
• These measures are not suitable for comparing the variability in two
distributions having variables expressed in different units.
Measure of variation cont…

2. Relative measures of dispersion (RMD): used to compare


the dispersion in two sets of data, when the variables are
measured in different units.

For example, we may wish to know, for a certain population,


whether serum cholesterol levels, measured in milligrams per
100 ml, are more variable than body weight, measure in
kilograms.

Furthermore, although the same unit of measurement is used,


the two MCT (means) may be quite different.
Types of measure of dispersion
Various measures of dispersions are in use. The most commonly used measures of dispersions are;

Absolute measure of Relative measure of


dispersion dispersion
• Range • Relative range
• Variance • Coefficient of quartile
deviation
• Quartile deviation
• Coefficient of mean
• Mean deviation
deviation
• Standard deviation
• Coefficient of variation
• Standard score
RANGE:
• It is the difference between the largest and smallest
observation from the data
• EXAMPLE: Consider the data on the weight (in Kg) of 10
new born children at university of Gondar hospital within
a month:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43
Solution:
• the range for the dataset can be computed by first
arranging all observation in to ascending order as:
1.98, 2.02, 2.33, 2.33, 2.43, 2.51, 2.88, 2.98, 3.01, 3.25.
• Range = Maximum - Minimum=3.25-1.98=1.27

07/09/2021 39
Range cot…

• It is based upon two extreme cases in the entire distribution,


the range may be considerably changed if either of the
extreme cases happens to drop out, while the removal of any
other case would not affect it at all.
• It wastes information , it takes no account of the entire data.

07/09/2021 40
Quartiles and Inter-quartile Range, Percentiles

• The inter-quartile range (IQR) is the difference between the third


and the first quartiles.

• Example: Consider the age data of 15 patients to find IQR


Q1 Q2 Q3

35 35 36 37 37 38 42 43 43 44 45 48 48 51 55

• IQR = 48 – 37 = 11

41
Quartile deviation (QD)
The range expresses the extreme variability of
observations of a variable.
 is half of the inter quartile range.

 Inter quartile Range =Q3-Q1

Inter quartile range Q3  Q1


QD  
2 2

42
Coefficient of quartile deviation (CQD)
It gives the average amount by which the
two quartiles differ from the median.

Q3  Q1
CQD 
Q3  Q1

43
Variance and Standard Deviation
• Variance measure how far on average scores deviate
or differ from the mean.

• Variance is the average of the square of the distance


each value from the mean

07/09/2021 44
07/09/2021 45
Variance
I.e. The sample variance, denoted by s2 , of a set of n
 

observed values having a mean is the sum of the


squared deviations divided by n -1

For the case of frequency distribution it is expressed


as:

46
Standard deviation
•  There a problem in a variance because the deviations
are squared and its units also square, in order to get
the original unit of measurement we insert in to square
root.

07/09/2021 47
Standard cont…
• Consider the following three datasets

• Dataset 1:7, 7, 7, 7, 7, 7 Mean=7, s.d=0

• Dataset 2: 6, 7, 7, 7, 7, 8, mean=7, s.d=0.63

• Dataset 3: 3, 2, 7, 8, 9, 13, mean=7, s.d=4.04

• We understand that the same mean but different


variation
07/09/2021 48
Example 1
Find the variance and standard deviation based on the given
data set given bellow?
35, 45, 30, 35, 40, 25
Solution
Firstly we find the mean

Next subtract the mean from each value and square it:
X X-

07/09/2021 49
Cont…
•Sum
  up all the squared values

And then divide the sum to (n-1) to get the variance

Insert the variance to square root to get standard deviation?

07/09/2021 50
Exercise
• The Areas of spray able surfaces with DDT from a sample of 15
houses are measured as follows (in m2) :

101,105,110,114,115,124,125,125,130,133,135,136,13 7,140,145

Find the variance and standard deviation of the given


data set?

07/09/2021 51
Example 2
• Find the variance and the standard deviation for the
frequency distribution of the given data set below.
Class Frequency Midpoint
5.5 – 10.5 1 8
10.5 – 15.5 2 13
15.5 – 20.5 3 18
20.5 – 25.5 5 23
25.5 – 30.5 4 28
30.5 – 35.5 3 33
35.5 – 40.5 2 38

07/09/2021 52
Cot…
•  Solution

Class Frequenc Midpoint fi.xm fi.(Xm-)2


y
5.5-10.5 1 8 8 1*(8-24.5)2= 272.25
10.5-15.5
10.5-15.5 2
2 13
13 26
26 2*(13-24.5)
2*(13-24.5)2 == 264.5
2
264.5
15.5-20.5
15.5-20.5 3
3 18
18 54
54 3*(18-24.5)
3*(18-24.5)2 == 126.75
2
126.75
20.5-25.5 5 23 115 5*(23-24.5)2 = 11.25
2
20.5-25.5 5 23 115 5*(23-24.5)2 = 11.25
25.5-30.5 4 28 112 4*(28-24.5) = 49
25.5-30.5 4 28 112 4*(28-24.5)22 = 49
30.5-35.5 3 33 99 3*(33-24.5) = 216.75
30.5-35.5
35.5-40.5 3
2 33
38 99
76 3*(33-24.5)
2*(38-24.5)2 == 216.75
2
364.5
35.5-40.5
Total 2
n = 20 38 76
490 2*(38-24.5)
1,305 2
= 364.5
Total n = 20 490 1,305
07/09/2021 53
Cot…
•Therefore
  variance is calculated based on the formula:

• The standard deviation is the square root of variance

07/09/2021 54
Special properties of standard deviation /variance

1. If the standard deviation of X1, X2,…Xn is S , then the standard


deviation of
a) x1  k , x2  k , x3  k ,..., xn  k will also be s
b) kx1 , kx 2 , kx3 ,..., kx n would be k s
c) a  kx1 , a  kx 2 , a  kx3 ,..., a  kx n would be k s

55
Special properties of standard deviation

•2.  If a sample of n1 observations has a variance and a


sample of n2 observations have a variance of, then the
combined variance called the pooled variance (Sp2) is
given by:

2 n  1 S   n  1 S
2 2

Sp  1 1 2 2

1n n 2
2

56
Coefficient of variation
• The standard deviation is an absolute measure of deviation of
observations around their mean and is expressed with the same
unit of the data.
• Due to this nature of the standard deviation it is not directly used
for comparison purposes with respect to variability.
• Coefficient of variation, is often used for this purpose
• The coefficient of variation (CV) is defined by:

CV =

• The coefficient of variation is most useful in comparing the


variability of several different samples, each with different means.

07/09/2021 57
Examples:
1. An analysis of the monthly wages paid (in Birr) to
workers in two firms A and B belonging to the same
pharmaceutical industry gives the following results

07/09/2021 58
07/09/2021 59
Coefficient of variation cont…
Exercise
2. A meteorologist interested in the consistency of
temperatures in three cities during a given week collected
the following data. The temperatures for the five days of
the week in the three cities were
City 1 : 25 24 23 26 17
City2 : 22 21 24 22 20
City3 : 32 27 35 24 28
Which city have the most consistent temperature, based
on these data?

07/09/2021 60
When to use coefficient of variance
• When comparison groups have very different means (CV is
suitable as it expresses the standard deviation relative to its
corresponding mean)

• When different units of measurements are involved,


e.g. group 1 unit is mm, and group 2 unit is gm (CV is
suitable for comparison as it is unit-free)
• In such cases, standard deviation should not be used for
comparison

07/09/2021 61
Exercise
1. Based on the given data set given below
a. Calculate mean, median and mode
b. Calculate variance, standard deviation and coefficient of
variation
15, 7, 13, 9, 10, 11
2. Calculate variance and standard deviation for the following data set
geven below;
5, 17, 12, 10, 8

07/09/2021 62
Standard Score
If X is a measurement from a distribution
  with mean and standard
deviation S, then its value in standard units is

Z: gives the deviations from the mean in units of standard deviation.


Z: gives the number of standard deviation a particular observation
lie above or below the mean.
It is used to compare two observations coming from different
groups.
Standard score cont..
Examples:
1. Two sections were given introduction to Bio-
statistics examinations. The following information
was given.
value section1 section2
mean 78 90
sd 6 5
Student A from section 1 scored 90 and student B
from section 2 scored 95. Relatively speaking who
performed better?
Cont…
Solutions:
Calculate the standard score of both students.

 Student A performed better relative to his section


because the score of student A is two standard
deviation above the mean score of his section while,
the score of student B is only one standard deviation
above the mean score of his section.
Standard score cont…
Exercise: Two groups of people were trained to perform a
certain task and tested to find out which group is faster to learn
the task. For the two groups the following information was given:
Value Group one Group two
Mean 10.4 min 11.9 min
Stand.dev. 1.2 min 1.3 min

Relatively speaking:
a) Which group is more consistent in its performance
b) Suppose a person A from group one take 9.2 minutes while
person B from Group two take 9.3 minutes, who was faster in
performing the task? Why?
Moments

• The rth moment about the mean (the rth central


moment) defined as:

mr 
 i
( X  X ) r

r  0,1,2, 
n
• for continuous grouped data it is given by:

mr 
 fi ( X i  X )r
n

where Xi’s are class marks.

67
Example:

Find the first three central moments of the numbers 2,


3 and 7
Solution: mean = (2+3+7)/3 = 4

 m1 
 i
( X  X ) 1


(2  4)  (3  4)  (7  4)
0
n 3

m2 
 i
( X  X ) 2


(2  4) 2  (3  4) 2  (7  4) 2
 4.67
n 3

m3 
 i
( X  X ) 3


(2  4) 3  (3  4) 3  (7  4) 3
6
n 3

68
Measures of shape
a. Skewness
• Skewness is the degree of asymmetry or departure
from symmetry of a distribution.
• A skewed frequency distribution is one that is not
symmetrical.
• Skewness is concerned with the shape of the curve not size.
• If the frequency curve (smoothed frequency polygon) of a
distribution has a longer tail to the right of the central
maximum than to the left, the distribution is said to be skewed
to the right or said to have positive Skewness.

07/09/2021 69
Concept of skewness
• The skewness of a distribution is defined as the lack
of symmetry.
• In a symmetrical distribution, mean, median, and
mode are equal to each other.

07/09/2021 70
Skewness
• If it has a longer tail to the left of the central
maximum than to the right, it is said to be skewed to
the left or said to have negative Skewness.
• For moderately skewed distribution, the following
relation holds among the three commonly used
measures of central tendency.
Mean-Mode=3*(Mean-Median)

07/09/2021 71
Skewness
Measures of Skewness
The Karl Pearson’s Coefficient of Skewness (SK):
Mean  Mode 3( Mean  Median )
Sk  Sk 
S tan dard deviation S tan dard deviation

If SK = 0, then the distribution is symmetrical.

If SK > 0, then the distribution is positively skewed.

If SK < 0, then the distribution is negatively skewed.


72
Remarks Related with Skewness
• In a positively skewed distribution, smaller observations are
more frequent than larger observations i.e. the majority of the
observations have a value below an average and it has a long
tail in the positive direction.

07/09/2021 73
Remarks Related with Skewness
 In a negatively skewed distribution, smaller observations
are less frequent than larger observations i.e. the majority
of the observations have a value above an average

07/09/2021 74
Kurtosis
• Kurtosis is the degree of peakdness of a distribution, usually
taken relative to a normal distribution.
 The peakdness of a distribution be classified in to three:
• Leptokurtic: -
- A distribution having relatively high peak
- A large number of observations have same values
• Mesokurtic: -
- Normal peak
- The curve is properly peaked
• Platykurtic:
 Flat toped
 A large number of observations have low frequency are
spread in the middle interval.
07/09/2021 75
Kurtosis

07/09/2021 76
Measures of kurtosis

•-  The moment coefficient of Skewness

m4
2 
 m2  2

• If =3, then the distribution is Mesokurtic.


• If >3, then the distribution is leptokurtic.
• If <3, then the distribution is Platykurtic.

07/09/2021 77
You
a n k
Th

07/09/2021 78

You might also like