0% found this document useful (0 votes)
5 views120 pages

Biostatistics-1

The document outlines the fundamentals of biostatistics, covering definitions, types of data, descriptive statistics, probability distributions, and statistical inference. It emphasizes the importance of data collection and analysis in medical and biological contexts, detailing methods for summarizing and interpreting data through various statistical measures. Key topics include measures of central tendency, dispersion, and the application of statistical tests for hypothesis testing and association between variables.

Uploaded by

Brian Kangogo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views120 pages

Biostatistics-1

The document outlines the fundamentals of biostatistics, covering definitions, types of data, descriptive statistics, probability distributions, and statistical inference. It emphasizes the importance of data collection and analysis in medical and biological contexts, detailing methods for summarizing and interpreting data through various statistical measures. Key topics include measures of central tendency, dispersion, and the application of statistical tests for hypothesis testing and association between variables.

Uploaded by

Brian Kangogo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 120

Outline of Biostatistics

Dr. M.M. Mweu,


MBChB Level IV Biostatistics,
21 November, 2023
Outline
Introduction to biostatistics:
Definitions, types of data, Descriptive statistics for Quantitative & Qualitative data:
Measures of central tendency, dispersion, data presentation (frequency tables,
graphs)

Probability distributions
Introduction to probability (mutually exclusive & independent events), Normal,
Binomial & Poisson distributions

Statistical Inference: Single mean & proportion


Sampling variability of a single mean and proportion including confidence intervals
and hypothesis tests, decision errors (Type I & II errors)

Statistical Inference: Two means & proportions


Confidence intervals and hypothesis tests for two means and proportions, paired t-
test for paired data

Association between two categorical variables


Chi-squared test
Introduction
e Statistics is a field of study concerned with:
Collection, organisation, summarisation and analysis of data
Drawing of inferences about a body of data when only a part of the
data is observed

e Biostatistics applies statistical methods to medical and biological


problems
e Data are numbers resulting from a measurement (e.g. weighing a
patient or taking temperature) and contain information — thus statistics

seeks to investigate and evaluate the meaning of this info


e Two branches:
Descriptive stats — deals with organising and summarising data
Analytic/ Inferential stats — deals with making inferences from samples about
populations and drawing conclusions from data

e Population — collection of all the individuals that we wish to say


something about; a set representing all measurements of interest
to the investigator
Introduction
° Sample —a part of the population on which measurements are
taken; a subset of measurements selected from the pop of
interest
e An example from the variable height:
Population: measurement of height of everyone in a room
(census). Then summarise the measurement by calculating the
average height — we get a parameter value - an exact value
(ignoring measurement error) — where no estimation is involved
Sample: measurement of height of some people in the room. The
average height of the sample constitutes a statistic — our ‘guess’
at the parameter value
e Nutshell: parameters are numerical descriptive measures
computed from population measurements whilst statistics are
computed from sample measurements
e Field of statistics is dedicated to understanding how well
‘statistics’ estimate ‘parameters’
Introduction
e Data are often necessary to answer key questions e.g. clinicians faced
iS
with a choice among competing treatment regimens

e Sources of data:
Routinely kept records — usually the first stopover in quest for data
e.g. hospital medical records containing patient info
| Surveys — often the logical source when recorded data are
absent/insufficient e.g. if a hospital administrator seeks to understand the
mode of transportation used by patients to visit a hospital and admission forms
don’t contain a question on transportation modes then a survey amongst
patients is necessary

Experiments — investigator conducts an experiment e.g. a nurse may


wish to know which of several strategies is best for maximising patient
compliance. Here, different strategies of motivating compliance are tried with
different patients with responses to the different strategies evaluated to
determine which strategy is most effective
_ External sources — data may already exist in the form of published
reports, data banks (repositories) or research literature. These data
may be reanalysed to answer key questions
Introduction
e Variable - an attribute or characteristic that takes different
values from one individual to another
e Constant — takes the same value for all the individuals
e A variable is random if the value it takes for a particular
individual is a chance or random event ie. the value cannot be exactly
predicted in advance

e Types of variables:
Quantitative: measurable in terms of magnitude
LJ Continuous — can take on any value in a given range e.g. height, blood
pressure, temper ature

Discrete — can take on only whole numbers e.g. counts of lesions


Qualitative: variables not measurable or described in magnitude —
often called factors
UL) Ordinal — have an order but no magnitude e.g. education levels —
primary /secondary/tertiary, severity — mild/moderate/severe, prognosis —
good/fair/poor, socioeconomic status (low/medium/ high)
LJ Nominal — categorised without any implied order e.g. sex, dead/alive,
ethnicity, religion, marital status
i)
Descriptive statistics
Quantitative variables
e We often summarise data using a single number -a
descriptive measure
e Descriptive measures for quantitative data are:
Central tendency (location) - measures the centre of the
distribution
_] Often the location where data tend to cluster

| Captures the central/average value of a set of data

Dispersion (spread) — refers to the average distance between


observations
Indicates the magnitude by which individuals tend to differ from each
other
UI By determining how far an observation lies from the rest we can
deduce whether such an observation is unusual (outlier) or not

6
Descriptive statistics
Quantitative variables
Central tendency measures

e Most commonly used are: mean, median and mode


Arithmetic mean

e Considered the most familiar measure of central tendency


e The term ‘arithmetic’ distinguishes it from other means (we
restrict ourselves to this only)

e The sample mean X computed from a sample of values is


expressed by:
n
= i=1%i
x = ————
n

Where x; is the typical value of a random variable and n denotes the no. of values in

the sample
Descriptive statistics
Quantitative variables
Central tendency measures

Arithmetic mean

Example:

The plnaney vosanien of 8 healthy men are: 2.75, 2.86, 3.37, 2.76, 2.62, 3.49, 3.05,

3.12 litres

2.75 + 2.86 + 3.37 + 2.76 + 2.62 +:3.49 + 3.05'+ 3.12


i. =
8
= 3.0025 litres

e Properties:
Uniqueness -— for a given set of data there’s only one mean
Simplicity — the mean is easily understood and easy to compute
Prone to distortion by extreme values since each value ina
dataset enters into the computation of the mean
Descriptive statistics
Quantitative variables
Central tendency measures
Median

e Value that divides a set of values into two equal parts such
that the no. of values equal to or greater than the median is
equal to the no. of values equal to or less than the median
e Ifno. of values is odd then the median is the middle value
when all values are arranged in order of magnitude
e If the no. of values is even, then the median is the mean of
the two middle values when all values are arranged in order
of magnitude
e Properties:
Uniqueness — as with mean, there’s only a single median for a
given set of data
Simplicity — is easy to calculate
Not drastically ateoted by extreme values as is the mean
Descriptive statistics
Quantitative variables
Central tendency measures
Mode

e Is the value which occurs most frequently in a set of data


e If all the values are different then there’s no mode
e However, a set of values may have more than one mode
Example:
The number of days spent in hospital by 17 subjects after an operation are:
3,4, 4, 6, 8, 8, 8, 10, 10, 12, 14, 14, 17, 25, 27, 37,42

Mode is: 8

e When mean, median and mode are equal we get a symmetric


distribution termed the bell-shaped curve —- the most common
being the normal distribution
e Data distributions may classified as to whether they are
symmetric or asymmetric (skewed)
Symmetric — left half of graph (histogram/frequency polygon) is mirror image of the
right half
Right/ positive skewed — distribution has long tail to the right
Left/negative skewed — distribution has long tail to the left
Descriptive statistics
Quantitative variables
Central tendency measures

Mean
Media
Mode pode Mode

Frequency : Median

Negatively Skewed Normal Distribution Positively Skewed

Dispersion measures

Dispersion conveys info regarding the amount of variability


present in a dataset
If all values are the same there’s no dispersion
Amount of dispersion may be small when the values — though
different — are close together

1]
Descriptive statistics
Quantitative variables
Dispersion measures

e There are a number of measures; range, variance, standard

deviation, coefficient of variation, interquartile range


Range

e Is the difference betwn the largest and smallest value in a set


of observations
e If we denote range by R, the largest (maximum) value by x;
and the smallest (minimum) value by x, then range is given

by:
R= Xy — Xs

e Since range expressed as a single number imparts minimal


info about a dataset it is often preferable to express range as
a number pair: (x,,x,) — conveys more info

12
Descriptive statistics
Quantitative variables
Dispersion measures

Variance

e Expresses dispersion relative to the scatter of values about


their mean
e The sample variance s? is computed by subtracting the mean
from each of a set of values, then squaring the resulting
differences, followed by adding up the squared differences
* The resulting sum is then divided by the sample size minus 1

s2 — 4 (xj-*)?

n-1

Where n — 1 is referred to as degrees of freedom (DF)


NB: the rationale for DF:

The sum of the deviations of the values from their mean is equal to zero. If then we know
the values ofn—1 of the deviations from the mean, we know the nth one since it is
automatically determined because of the necessity for alln values to add to zero
13
Descriptive statistics
Quantitative variables
Dispersion measures
Standard deviation

e The variance represents squared units and hence not


appropriate when we want to express dispersion in terms of

the original units


>
e To obtain dispersion in the original units we get the square
root of variance — the standard deviation (s):

n—-—1

Back to example on plasma volumes...

14
Descriptive statistics
Quantitative variables
Dispersion measures
Standard deviation

px
| cit | (x2)?
2.75 -0.2525 0.0638
2.86 -0.1425 0.0203
3.37 0.3675 0.135]
2.76 -0.2425 0.0588
2.62 -0.3825 0.1463
3.49 0.4875 0.2377
3.05 0.0475 0.0023
3. le 0.1175 0.0138
24.02 0 0.6781

X = 3.0025

= 0.0968
n-1 7

s = ¥s2 = ¥0.0968 = 0.3112


Descriptive statistics
Quantitative variables
Dispersion measures
Coefficient of variation (CV)

e Useful when one wishes to compare the dispersion between


two variables that are measured in different units —
comparing their SDs (have units) would lead to fallacious
results e.g. comparison betwn serum cholesterol levels (mg/100ml) and body
weight (Ibs)

e Even in situations where the two variables have same units,

their means may be different


e The CVis a measure of relative variation that expresses the
standard deviation as a percentage of the mean:
Ss
CV = z (100)%

e Since the mean and SD have the same units, CV has no units
16
Descriptive statistics
Quantitative variables
Dispersion measures
Coefficient of variation (CV)

Example:

Suppose two samples of male patients yield the followilag results:

25 years || years

Mean weight 145 pounds 80 pounds

Standard deviation 10 pounds 10 pounds

We wish to know which is more variable, the weights of the 25-year-olds or the 11-
year-olds

25-yr-olds: CV = — x 100 = 6.9% 11 yr-olds: CV = ~ x 100 = 12.5%


Variation is much higher in the sample of 11-year-olds
Descriptive statistics
Quantitative variables
Dispersion measures
Interquartile range (IOR)

A special note first....


Percentiles & Quartiles
* Aside from means & medians, percentiles and quartiles are also location
parameters

* A percentile is defined as follows:


Given a set of n observations x,, X2,... X,, the pth percentile P is the value of x such that
p percent or less of the observations are less than P and (100 — p) percent or less of the
observations are greater than P

¢ The 10" percentile is designated as P,), the 70™ as P,


° Ps, the 50" percentile is the median.
* The P95 is first quartile (Q,), Ps) is also the second/ middle quartile (Q,) and P,s; is
the thi¢d quartile (Q3)
« To find quartile for a dataset the following formulas are used:
n+1
1= = th ordered observation

18
Descriptive statistics
Quantitative variables
Dispersion measures
Interquartile range (IOR)

A special note first....


Percentiles & Ouartiles

2(n + 1)
2= —z th ordered observation

B(n + 1)
3= —J th ordered observation

Example:

The number of days spent in hospital by 17 subjects after an operation arranged


in increasing size were: 3, 4,4, 6,8, 8,8, 10, 10, 12, 14, 14, 17, 25, 27, 37,42

17 +1 6+8
0,= n th observation = 4.5th obs. = ——=7

2(17 + 1)
2= —z th observation = 9th obs.= 10

3(17+1 17+ 25
= ee i observation = 13.5th obs.= = 21

19
Descriptive statistics
Quantitative variables
Dispersion measures
Interquartile range (IOR)

° Reflects the variability among the middle 50% of the


observations in a dataset

e It is the difference betwn the 3 and 1*t quartiles:


IQR = Q3 -— Qi

e A large /QR indicates a large amount of variability among the

middle 50% of the relevant observations


Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram
e Besides using means, grouping data provides further useful
summarisation of data
e« Frequency distributions show the frequency of specific values or
range of values in the data ice. the distribution
e Guidelines on creating a frequency distribution:
Select a set of contiguous, non-overlapping intervals where each
value can only be placed in one of the intervals (class intervals)
Determine the number of intervals to include
WI Too few intervals result in loss of information — intervals shouldn't be less than 5 or
greater than 15
J Intervals can be determined by Sturges'’s rule:
k = 1+ 3.322(log,
.n); where k is the no. of class intervals and n is the size of dataset

Determine the width of the class intervals


WJ Intervals should be of the same width
J Width w may be determined by dividing the range R by k the number of class intervals:

i”

WU Intervals are ordered ‘rom smallest to largest


Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram
e Besides using means, grouping data provides further useful
summarisation of data
e« Frequency distributions show the frequency of specific values or
range of values in the data ice. the distribution
e Guidelines on creating a frequency distribution:
Select a set of contiguous, non-overlapping intervals where each
value can only be placed in one of the intervals (class intervals)
Determine the number of intervals to include
WI Too few intervals result in loss of information — intervals shouldn't be less than 5 or
greater than 15
J Intervals can be determined by Sturges's rule:
k = 1+ 3.322(log,.n); where k is the no. of class intervals and n is the size of dataset

Determine the width of the class intervals


WJ Intervals should be of the same width
J Width w may be determined by dividing the range R by k the number of class intervals:

WS

UO Intervals are ordered trom smallest to largest


Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram
Example:
* Table below shows a frequency distribution — values of age are distributed
among the specified class intervals:

Frequency distribution of ages of 189 subjects


Class interval Frequency
30-39 1]
40-49 46
50-59 70
60-69 45
70-79 16
S0-S9 ]
Total 189
le

e Relative frequencies show the proportion rather than the


number of values falling within a given class interval
e The proportions are obtained by dividing the no. of values in
a given class interval by the total no. of values
22
Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram

¢ To obtain cumulative frequencies (or cumulative relative freq.)


of values falling within two or more classes we sum/cumulate
the no. of values (or relative freq.) falling within the class
intervals of interest:
Frequency Cumulative Relative Cumulative
frequency frequency relative
frequen
0.0582 0.0582
0.2434 0.3016
a 0.3704 0.6720
172 0.2381 0.9101
188 0.0847 0.9948
189 0.0053 1.0001

e A histogram is a form of a bar graph which displays a


frequency (or relative freq.) distribution
23
Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram

e How its constructed:


Values of the variable of interest are represented by the horizontal axis
The frequency (or relative freq.) is displayed on the vertical axis
Above each class interval on the horizontal axis a rectangular bar is erected so
that the height corresponds to the relative freq.
The bars of the histogram are joined to prevent gaps from occurring

e The frequency polygon is a line graph that portrays a frequency


distribution
e Itis constructed from a histogram by placing a dot above the
midpoint of each class interval then connecting the dots by
straight lines
e The polygon is brought down to the horizontal axis at the ends at
points that would be midpoints if there was an additional bar
") |
Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram


60}- a Histogram & frequency
| sol a polygon for the agesof
em | 205 doctors

tal %
3 20
20+

10;

|
Owe 25 30 35 40 45 50 55
[Age (in years) ————>

(b) Stem-and-leaf plots

* Resembles the histogram and provides info on the range of


the dataset, shows location of the highest concentration of
observations and reveals the presence/absence of symmetry

29
Descriptive statistics
Quantitative variables
Graphical techniques

(b) Stem-and-leaf plots

e Consists of:
Stem — has one or more of the initial digits of the measurement
QO) Stems form an ordered column with the smallest stem at the top
U) The stem column has all stems within the range of the data even when an
observation with that stem is not in the dataset

Leaf — consists of one or more of the remaining digits


OU Forms the rows of the plot listed to the right of their respective stems
O) When leaves consist of more than one digit, all digits after the first may be deleted
O) Decimals when present in the original data are omitted in the plot

Example:
| Data|
4+ ]46 [47 49 63 64 66]|6s
|6s |72 |72 [75 76 |s1 | s4|ss
| 106]

Stem | Leaf
+ 4679
from their
Stems are separated
os 34688
7 2956 leaves by a vertical line
s 148
9
10 5
Descriptive statistics
Quantitative variables
Graphical techniques

(c) Box-and-Whisker plots

e A plot that makes use of the quartiles of a dataset


e Reveals info on the amount of spread, location of
concentration of obs. and symmetry of the data
e Uses relationships among the median, lower and upper
quartiles to describe the skewness of a distribution:
If the distribution is symmetric then the upper and lower quartiles
should be equally spaced from the median
If upper quartile is further from the median than the lower quartile —

distribution is positively skewed


If lower quartile is further from the median than the upper quartile —
distribution is negatively skewed
—~J
i)
Descriptive statistics
Quantitative variables
Graphical techniques

(c) Box-and-Whisker plots


Whisker

Interquartile Range
(IQR)
Outliers |_____ Outliers

"Minimum" | "Maximum"
(Q1 - 1.5*IQR) Ql Median Q3 (Q3 + 1.5*IQR)
(25th Percentile) (75th Percentile)

-4 -3 =a =] 0 1 2 3 4

(d) Scatter plots

e Display the relationship between two continuous variables by


plotting one variable against the values of another variable
Descriptive statistics
Quantitative variables
Graphical techniques

(d) Scatter plots

e e.g. one axis of the plot could represent height and the other weight with each
person in the dataset receiving one data point on the plot that corresponds to
his/her height and weight

Scatterplot of Weight (kg) vs Height (cm)

140 150 160 170 180 190 200


Height (cm)

29
Descriptive statistics
Qualitative variables
e Descriptive stats for analysing categorical variables include:
frequencies, percentages and fractions (relative frequencies)
obtained from a variable’s frequency distribution table
Example:

The following frequency table shows the frequency of each category of marital
status in a sample of 80 people along with the corresponding percentages:

Marital status
Frequency Percent
Single 44 55.0
Married 29 36.3
Other 7 8.8
Total 80 100

e Besides frequency tables, we can display categorical data


using graphs: Bar charts and pie charts

30
Descriptive statistics
Qualitative variables
e Bar charts use bars (which don’t touch) to represent each
category of the variable of interest — the length of the bars
_—

reflect the frequencies or percentages of the distribution of


=

the variable:
Marital Status of Sample
Count

Single Mamed Other

Marital status

dl
Descriptive statistics
Qualitative variables
e In Pie charts the percentages are represented by the angles in
different slices of a circle — the total angle (360 degrees)
representing 100%:

Pie Chart Count of Marital status


Marital
status
B Single
@ Married
@ Other

32
Introduction to Biostatistics

Dr. M.M. Mweu,


Level IV MBChB Biostatistics,
5 December, 2023
Introduction
 Statistics is a field of study concerned with:
 Collection, organisation, summarisation and analysis of data
 Drawing of inferences about a body of data when only a part of the
data is observed
 Biostatistics applies statistical methods to medical and biological
problems
 Data are numbers resulting from a measurement (e.g. weighing a
patient or taking temperature ) and contain information – thus statistics
seeks to investigate and evaluate the meaning of this info
 Two branches:
 Descriptive stats – deals with organising and summarising data
 Analytic/Inferential stats – deals with making inferences from samples about
populations and drawing conclusions from data

 Population – collection of all the individuals that we wish to say


something about; a set representing all measurements of interest
to the investigator
Introduction
 Sample – a part of the population on which measurements are
taken; a subset of measurements selected from the pop of
interest
 An example from the variable height:
 Population: measurement of height of everyone in a room
(census). Then summarise the measurement by calculating the
average height – we get a parameter value - an exact value
(ignoring measurement error) – where no estimation is involved
 Sample: measurement of height of some people in the room. The
average height of the sample constitutes a statistic – our ‘guess’
at the parameter value
 Nutshell: parameters are numerical descriptive measures
computed from population measurements whilst statistics are
computed from sample measurements
 Field of statistics is dedicated to understanding how well
‘statistics’ estimate ‘parameters’
Introduction
 Data are often necessary to answer key questions e.g. clinicians faced
with a choice among competing treatment regimens

 Sources of data:
 Routinely kept records – usually the first stopover in quest for data
e.g. hospital medical records containing patient info

 Surveys – often the logical source when recorded data are


absent/insufficient e.g. if a hospital administrator seeks to understand the
mode of transportation used by patients to visit a hospital and admission forms
don’t contain a question on transportation modes then a survey amongst
patients is necessary

 Experiments – investigator conducts an experiment e.g. a nurse may


wish to know which of several strategies is best for maximising patient
compliance. Here, different strategies of motivating compliance are tried with
different patients with responses to the different strategies evaluated to
determine which strategy is most effective

 External sources – data may already exist in the form of published


reports, data banks (repositories) or research literature. These data
may be reanalysed to answer key questions
Introduction
 Variable – an attribute or characteristic that takes different
values from one individual to another
 Constant – takes the same value for all the individuals
 A variable is random if the value it takes for a particular
individual is a chance or random event i.e. the value cannot be exactly
predicted in advance

 Types of variables:
 Quantitative: measurable in terms of magnitude
 Continuous – can take on any value in a given range e.g. height, blood
pressure, temperature
 Discrete – can take on only whole numbers e.g. counts of lesions
 Qualitative: variables not measurable or described in magnitude –
often called factors
 Ordinal – have an order but no magnitude e.g. education levels –
primary/secondary/tertiary, severity – mild/moderate/severe, prognosis –
good/fair/poor, socioeconomic status (low/medium/high)
 Nominal – categorised without any implied order e.g. sex, dead/alive,
ethnicity, religion, marital status
Descriptive statistics
Quantitative variables
 We often summarise data using a single number – a
descriptive measure
 Descriptive measures for quantitative data are:
 Central tendency (location) – measures the centre of the
distribution
Often the location where data tend to cluster
Captures the central/average value of a set of data
 Dispersion (spread) – refers to the average distance between
observations
Indicates the magnitude by which individuals tend to differ from each
other
By determining how far an observation lies from the rest we can
deduce whether such an observation is unusual (outlier) or not
Descriptive statistics
Quantitative variables
Central tendency measures

 Most commonly used are: mean, median and mode


Arithmetic mean

 Considered the most familiar measure of central tendency


 The term ‘arithmetic’ distinguishes it from other means (we
restrict ourselves to this only)

 The sample mean 𝑥 computed from a sample of values is


expressed by:
𝑛
𝑖=1 𝑥𝑖
𝑥=
𝑛
Where 𝑥𝑖 is the typical value of a random variable and 𝑛 denotes the no. of values in
the sample
Descriptive statistics
Quantitative variables
Central tendency measures
Arithmetic mean
Example:
The plasma volumes of 8 healthy men are: 2.75, 2.86, 3.37, 2.76, 2.62, 3.49, 3.05,
3.12 litres

2.75 + 2.86 + 3.37 + 2.76 + 2.62 + 3.49 + 3.05 + 3.12


𝑥=
8
= 3.0025 𝑙𝑖𝑡𝑟𝑒𝑠

 Properties:
 Uniqueness – for a given set of data there’s only one mean
 Simplicity – the mean is easily understood and easy to compute
 Prone to distortion by extreme values since each value in a
dataset enters into the computation of the mean
Descriptive statistics
Quantitative variables
Central tendency measures
Median

 Value that divides a set of values into two equal parts such
that the no. of values equal to or greater than the median is
equal to the no. of values equal to or less than the median
 If no. of values is odd then the median is the middle value
when all values are arranged in order of magnitude
 If the no. of values is even, then the median is the mean of
the two middle values when all values are arranged in order
of magnitude
 Properties:
 Uniqueness – as with mean, there’s only a single median for a
given set of data
 Simplicity – is easy to calculate
 Not drastically affected by extreme values as is the mean
Descriptive statistics
Quantitative variables
Central tendency measures
Mode

 Is the value which occurs most frequently in a set of data


 If all the values are different then there’s no mode
 However, a set of values may have more than one mode
Example:
The number of days spent in hospital by 17 subjects after an operation are:
3, 4, 4, 6, 8, 8, 8, 10, 10, 12, 14, 14, 17, 25, 27, 37, 42
𝑀𝑜𝑑𝑒 𝑖𝑠: 8
 When mean, median and mode are equal we get a symmetric
distribution termed the bell-shaped curve – the most common
being the normal distribution
 Data distributions may classified as to whether they are
symmetric or asymmetric (skewed)
 Symmetric – left half of graph (histogram/frequency polygon) is mirror image of the
right half
 Right/positive skewed – distribution has long tail to the right
 Left/negative skewed – distribution has long tail to the left
Descriptive statistics
Quantitative variables
Central tendency measures

Dispersion measures

 Dispersion conveys info regarding the amount of variability


present in a dataset
 If all values are the same there’s no dispersion
 Amount of dispersion may be small when the values – though
different – are close together
Descriptive statistics
Quantitative variables
Dispersion measures

 There are a number of measures: range, variance, standard


deviation, coefficient of variation, interquartile range
Range

 Is the difference betwn the largest and smallest value in a set


of observations
 If we denote range by 𝑅, the largest (maximum) value by 𝑥𝐿
and the smallest (minimum) value by 𝑥𝑠 then range is given
by:
𝑅 = 𝑥𝐿 − 𝑥𝑠
 Since range expressed as a single number imparts minimal
info about a dataset it is often preferable to express range as
a number pair: (𝑥𝑠 , 𝑥𝐿 ) – conveys more info
Descriptive statistics
Quantitative variables
Dispersion measures
Variance

 Expresses dispersion relative to the scatter of values about


their mean
 The sample variance 𝑠 2 is computed by subtracting the mean
from each of a set of values, then squaring the resulting
differences, followed by adding up the squared differences
 The resulting sum is then divided by the sample size minus 1
𝑛 2
𝑖=1(𝑥𝑖 −𝑥)
𝑠2 = 𝑛−1
Where 𝑛 − 1 is referred to as degrees of freedom (DF)
NB: the rationale for DF:
The sum of the deviations of the values from their mean is equal to zero. If then we know
the values of 𝑛 − 1 of the deviations from the mean, we know the nth one since it is
automatically determined because of the necessity for all 𝑛 values to add to zero
Descriptive statistics
Quantitative variables
Dispersion measures
Standard deviation

 The variance represents squared units and hence not


appropriate when we want to express dispersion in terms of
the original units
 To obtain dispersion in the original units we get the square
root of variance – the standard deviation (𝑠):

𝑛
− 𝑥)2
𝑖=1(𝑥𝑖
𝑠=
𝑛−1

Back to example on plasma volumes…


Descriptive statistics
Quantitative variables
Dispersion measures
Standard deviation

𝑥 𝒙−𝒙 (𝒙 − 𝒙)𝟐
2.75 -0.2525 0.0638
2.86 -0.1425 0.0203
3.37 0.3675 0.1351
2.76 -0.2425 0.0588
2.62 -0.3825 0.1463
3.49 0.4875 0.2377
3.05 0.0475 0.0023
3.12 0.1175 0.0138
24.02 0 0.6781

𝑥 = 3.0025

2
(𝑥 − 𝑥)2 0.6781
𝑠 = = = 0.0968
𝑛−1 7
𝑠= 𝑠 2 = 0.0968 = 0.3112
Descriptive statistics
Quantitative variables
Dispersion measures
Coefficient of variation (CV)

 Useful when one wishes to compare the dispersion between


two variables that are measured in different units –
comparing their SDs (have units) would lead to fallacious
results e.g. comparison betwn serum cholesterol levels (mg/100ml) and body
weight (lbs)

 Even in situations where the two variables have same units,


their means may be different
 The CV is a measure of relative variation that expresses the
standard deviation as a percentage of the mean:
𝑠
𝐶𝑉 = 100 %
𝑥
 Since the mean and SD have the same units, CV has no units
Descriptive statistics
Quantitative variables
Dispersion measures
Coefficient of variation (CV)

Example:
Suppose two samples of male patients yield the following results:

Sample 1 Sample 2
Age 25 years 11 years

Mean weight 145 pounds 80 pounds

Standard deviation 10 pounds 10 pounds


We wish to know which is more variable, the weights of the 25-year-olds or the 11-
year-olds
10 10
25-yr-olds: 𝐶𝑉 = × 100 = 6.9% 11 yr-olds: 𝐶𝑉 = × 100 = 12.5%
145 80

Variation is much higher in the sample of 11-year-olds


Descriptive statistics
Quantitative variables
Dispersion measures
Interquartile range (IQR)

A special note first….


Percentiles & Quartiles
 Aside from means & medians, percentiles and quartiles are also location
parameters
 A percentile is defined as follows:
Given a set of 𝒏 observations 𝒙𝟏 , 𝒙𝟐 ,… 𝒙𝒏, the 𝒑th percentile 𝑷 is the value of 𝒙 such that
𝒑 percent or less of the observations are less than 𝑷 and (𝟏𝟎𝟎 − 𝒑) percent or less of the
observations are greater than 𝑷

 The 10th percentile is designated as 𝑃10 , the 70th as 𝑃70


 𝑃50 , the 50th percentile is the median.
 The 𝑃25 is first quartile (𝑄1 ), 𝑃50 is also the second/middle quartile (𝑄2 ) and 𝑃75 is
the third quartile (𝑄3 )
 To find quartile for a dataset the following formulas are used:

𝑛+1
𝑄1 = 𝑡ℎ 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
4
Descriptive statistics
Quantitative variables
Dispersion measures
Interquartile range (IQR)

A special note first….


Percentiles & Quartiles

2(𝑛 + 1)
𝑄2 = 𝑡ℎ 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
4
3(𝑛 + 1)
𝑄3 = 𝑡ℎ 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
4
Example:
The number of days spent in hospital by 17 subjects after an operation arranged
in increasing size were: 3, 4, 4, 6, 8, 8, 8, 10, 10, 12, 14, 14, 17, 25, 27, 37, 42

17 + 1 6+8
𝑄1 = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 4.5𝑡ℎ 𝑜𝑏𝑠. = =7
4 2
2(17 + 1)
𝑄2 = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 9𝑡ℎ 𝑜𝑏𝑠. = 10
4
3(17 + 1) 17 + 25
𝑄3 = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 13.5𝑡ℎ 𝑜𝑏𝑠. = = 21
4 2
Descriptive statistics
Quantitative variables
Dispersion measures
Interquartile range (IQR)

 Reflects the variability among the middle 50% of the


observations in a dataset
 It is the difference betwn the 3rd and 1st quartiles:
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
 A large 𝐼𝑄𝑅 indicates a large amount of variability among the
middle 50% of the relevant observations
Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram

 Besides using means, grouping data provides further useful


summarisation of data
 Frequency distributions show the frequency of specific values or
range of values in the data i.e. the distribution

 Guidelines on creating a frequency distribution:


 Select a set of contiguous, non-overlapping intervals where each
value can only be placed in one of the intervals (class intervals)
 Determine the number of intervals to include
 Too few intervals result in loss of information – intervals shouldn’t be less than 5 or
greater than 15
 Intervals can be determined by Sturges’s rule:
𝑘 = 1 + 3.322(𝑙𝑜𝑔10 𝑛); where 𝑘 is the no. of class intervals and 𝑛 is the size of dataset
 Determine the width of the class intervals
 Intervals should be of the same width
 Width 𝑤 may be determined by dividing the range R by 𝑘 the number of class intervals:

𝑅
𝑤=
𝑘
 Intervals are ordered from smallest to largest
Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram
Example:
 Table below shows a frequency distribution – values of age are distributed
among the specified class intervals:

Frequency distribution of ages of 189 subjects


Class interval Frequency
30-39 11
40-49 46
50-59 70
60-69 45
70-79 16
80-89 1
Total 189

 Relative frequencies show the proportion rather than the


number of values falling within a given class interval
 The proportions are obtained by dividing the no. of values in
a given class interval by the total no. of values
Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram

 To obtain cumulative frequencies (or cumulative relative freq.)


of values falling within two or more classes we sum/cumulate
the no. of values (or relative freq.) falling within the class
intervals of interest:
Class interval Frequency Cumulative Relative Cumulative
frequency frequency relative
frequency
30-39 11 11 0.0582 0.0582
40-49 46 57 0.2434 0.3016
50-59 70 127 0.3704 0.6720
60-69 45 172 0.2381 0.9101
70-79 16 188 0.0847 0.9948
80-89 1 189 0.0053 1.0001
Total 189

 A histogram is a form of a bar graph which displays a


frequency (or relative freq.) distribution
Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram

 How its constructed:


 Values of the variable of interest are represented by the horizontal axis
 The frequency (or relative freq.) is displayed on the vertical axis
 Above each class interval on the horizontal axis a rectangular bar is erected so
that the height corresponds to the relative freq.
 The bars of the histogram are joined to prevent gaps from occurring

 The frequency polygon is a line graph that portrays a frequency


distribution
 It is constructed from a histogram by placing a dot above the
midpoint of each class interval then connecting the dots by
straight lines
 The polygon is brought down to the horizontal axis at the ends at
points that would be midpoints if there was an additional bar
Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram

(b) Stem-and-leaf plots

 Resembles the histogram and provides info on the range of


the dataset, shows location of the highest concentration of
observations and reveals the presence/absence of symmetry
Descriptive statistics
Quantitative variables
Graphical techniques
(b) Stem-and-leaf plots

 Consists of:
 Stem – has one or more of the initial digits of the measurement
 Stems form an ordered column with the smallest stem at the top
 The stem column has all stems within the range of the data even when an
observation with that stem is not in the dataset

 Leaf – consists of one or more of the remaining digits


 Forms the rows of the plot listed to the right of their respective stems
 When leaves consist of more than one digit, all digits after the first may be deleted
 Decimals when present in the original data are omitted in the plot

Example:
Data 44 46 47 49 63 64 66 68 68 72 72 75 76 81 84 88 106

Stem Leaf
4 4679
5
6 34688
7 2256
8 148
9
10 6
Descriptive statistics
Quantitative variables
Graphical techniques
(c) Box-and-Whisker plots

 A plot that makes use of the quartiles of a dataset


 Reveals info on the amount of spread, location of
concentration of obs. and symmetry of the data
 Uses relationships among the median, lower and upper
quartiles to describe the skewness of a distribution:
 If the distribution is symmetric then the upper and lower quartiles
should be equally spaced from the median
 If upper quartile is further from the median than the lower quartile –
distribution is positively skewed
 If lower quartile is further from the median than the upper quartile –
distribution is negatively skewed
Descriptive statistics
Quantitative variables
Graphical techniques
(c) Box-and-Whisker plots

(d) Scatter plots

 Display the relationship between two continuous variables by


plotting one variable against the values of another variable
Descriptive statistics
Quantitative variables
Graphical techniques
(d) Scatter plots
 e.g. one axis of the plot could represent height and the other weight with each
person in the dataset receiving one data point on the plot that corresponds to
his/her height and weight
Descriptive statistics
Qualitative variables
 Descriptive stats for analysing categorical variables include:
frequencies, percentages and fractions (relative frequencies)
obtained from a variable’s frequency distribution table
Example:
The following frequency table shows the frequency of each category of marital
status in a sample of 80 people along with the corresponding percentages:

Marital status
Frequency Percent
Single 44 55.0
Married 29 36.3
Other 7 8.8
Total 80 100

 Besides frequency tables, we can display categorical data


using graphs: Bar charts and pie charts
Descriptive statistics
Qualitative variables
 Bar charts use bars (which don’t touch) to represent each
category of the variable of interest – the length of the bars
reflect the frequencies or percentages of the distribution of
the variable:
Descriptive statistics
Qualitative variables
 In Pie charts the percentages are represented by the angles in
different slices of a circle – the total angle (360 degrees)
representing 100%:
Introduction to Probability

Dr. M.M. Mweu,


Level II MBChB Biostatistics,
30 January, 2018
UoN School of Public Health
Introduction
 Probability is the possibility that an event will occur
 Experiment/trial – any process of observation/measurement
e.g. checking if switch is turned off or counting no. of wounds
on a patient
 Outcomes – the results of the experiment e.g. counts, yes/no
answers etc.
 Sample space – list of all possible outcomes
 Event - the subset of the sample space
Example: One flip of a fair coin = experiment
Outcome = Head/Tail; Sample space = H, T
If we flip a fair coin twice, what is the probability of getting at least one head?
Sample space = HH, HT, TH, TT
Event (A) = at least one head
𝑛 = 3; 𝑁 = 4
3
𝑝 𝐴 = = 0.75
4
UoN School of Public Health
Probability rules
 Mutually exclusive events:
 Two events are mutually exclusive if the occurrence of one event excludes
the occurrence of the other
 E.g. If a baby is male, it can’t be female also or if a patient is malaria
positive, he can’t be malaria negative also
 The probability of occurrence of two mutually exclusive events is the
probability of either one of the two events occurring, denoted by
𝑝 𝐴 ∪ 𝐵 = 𝑝 𝐴 + 𝑝(𝐵) (add individual probabilities)
 E.g. 200 children tested for Entamoeba histolytica , where 59 are positive
& 141 are negative
59
 Probability of E. histolytica positivity = = 0.295
200
141
 Probability of E. histolytica negativity = = 0.705
200
 Probability of being positive or negative for E. histolytica = 0.295 + 0.705 = 1

 Independent events:
 Two events are independent if the occurrence of one event does not
influence the occurrence of the other – denoted by 𝑝 𝐴 ∩ 𝐵
 E.g. if the firstborn child is male, this doesn’t influence the sex of the
second born (can be male or female)
UoN School of Public Health
Probability rules
 Independent events:
 The probability of two independent events is obtained by multiplying
individual probabilities of the events
𝑝 𝐴 ∩ 𝐵 = 𝑝 𝐴 × 𝑝(𝐵)
E.g. in a blood bank the following distribution of blood groups was observed

Blood group n %

O 45 45

A 29 29

B 21 21

AB 5 5

Total 100 100

What is the probability that the next 2 persons will be in blood group O?
𝑝 𝐴 ∩ 𝐵 = 0.45 × 0.45 = 0.2025 ≅ 𝟎. 𝟐𝟎
UoN School of Public Health
Important probability distributions
 Probabilities can be assigned to each likely outcome for a
variable
 These probabilities usually follow a mathematical formula
called a probability distribution
 The probability distribution of a variable depends on the type
of variable:
 Continuous variables – temperature, weight, height – follow a Normal
distribution
 Binary (yes/no) variables – sex, disease, death – follow a Binomial
distribution
 Discrete or rare occurrences – death, rare diseases, counts – follow a
Poisson distribution. Poisson approaches Binomial as events become
more common and/or the population becomes smaller
UoN School of Public Health
Normal probability distribution
 All normally distributed variables are continuous but not
vice-versa
 For continuous variables all values within a certain range are
observable and the probability associated with one such
observation is negligibly small – on a continuous scale a zero
probability is assigned to individual points
 However, we can calculate the probability that a variable will
take a value betwn two points e.g. 𝑎 and 𝑏.
 For continuous variables a probability density is defined –
mathematical function such that area under curve betwn 2
points 𝑎 and 𝑏 is equal to the probability that the variable will
take a value betwn these 2 points
UoN School of Public Health
Normal probability distribution
 Two important parameters are needed to calculate
probabilities from the normal probability density:
 Mean (𝜇) – locates the central position
 Variance (𝜎 2 ) or its square root, the standard deviation (𝜎) measures the
degree of spread about the mean

 If the 𝜇 and 𝜎 of a normally distributed variable are known,


we can determine the probability that such a variable will
take a value that lies betwn 2 points 𝑎 and 𝑏.
UoN School of Public Health
Normal probability distribution
 To estimate the probability for a normally distributed variable, we
normally standardise the variable into a standard normal deviate
𝑍 which has a mean 0 and variance 1:

𝑌−𝜇
𝑍=
𝜎
𝑎−𝜇 𝑏−𝜇
Such that: 𝑝[ ≤𝑍≤ ]
𝜎 𝜎
Example
 The variable birth weight is known to be normally distributed with a mean of
3100𝑔 and variance of 2500𝑔2 . What is the probability that a baby will weigh

greater than 3000𝑔? (First draw the probability distribution)

3000 − 3100 100


𝑍 = = =2
2500 50
𝑝 𝑏𝑤𝑡 > 3000𝑔 = 0.02275 𝑎𝑡 𝑡ℎ𝑒 𝑡𝑎𝑖𝑙𝑠

1 − (0.02275 × 2)
= = 0.47725 + 0.5 = 𝟎. 𝟗𝟕𝟕𝟐𝟓
2
UoN School of Public Health
Binomial probability distribution
 Important for calculating the probability of d’se
 The binomial distribution describes the behaviour of a
random variable 𝑋 if the following conditions apply:
 The no. of observations 𝑛 is fixed
 Each observation is independent
 Each observation represents one of two outcomes (“success [𝑌]” or “failure”)

 The probability of success 𝜋 is the same for each outcome

𝑛 𝑦
𝑝 𝑌=𝑦 = 𝜋 (1 − 𝜋)(𝑛−𝑦)
𝑦

𝑛 𝑛!
Where: 𝑦
= 𝑦! 𝑛−𝑦 !

and: 𝑛= no. of trials/observations

𝜋 = probability of success on a single trial


𝑦= no. of successes after 𝑛 trials
 If the conditions described above are met then 𝑿 is said to have a binomial
distribution with parameters 𝝅 and 𝒏
UoN School of Public Health
Binomial probability distribution
Example
 According to CDC, 22% of adults in the United States smoke. Suppose we take
a sample of 10 people.
 What is the probability that 5 of them will smoke?

𝑛 𝑦
𝑝 𝑌=𝑦 = 𝜋 (1 − 𝜋)(𝑛−𝑦)
𝑦

10
𝑝 𝑌=5 = 0.225 (1 − 0.22)(10−5)
5

= 252 × 0.000515 × 0.289 = 𝟎. 𝟎𝟑𝟕𝟓 𝒐𝒓 𝟑. 𝟕𝟓%

 What is the probability that 2 or less will be smokers?


𝑝 𝑌 ≤2 =𝑝 𝑌=0 +𝑝 𝑌 =1 +𝑝 𝑌 =2

10 10 10
0.220 (1 − 0.22) 10−0
+ 0.221 (1 − 0.22) 10−1
+ 0.222 (1
0 1 2
10−2
− 0.22) = 0.083 + 0.235 + 0.298 = 𝟎. 𝟔𝟏𝟔 𝒐𝒓 𝟔𝟏. 𝟔%
UoN School of Public Health
Binomial probability distribution
Example
 What is the probability that at least one will smoke?
Probability of at least one being a smoker = 1 − 𝑝(𝑌 = 0)

10
=1− 0.220 (1 − 0.22) 10−0
0
= 1 − 0.083 = 𝟎. 𝟗𝟏𝟕 𝐨𝐫 𝟗𝟏. 𝟕%

 The mean and variance of a binomial distribution can be


shown to be:
𝜇 = 𝑛𝜋
𝜎 2 = 𝑛𝜋(1 − 𝜋)
 For large values of 𝑛 the distribution of the binomial variable 𝑋 and
the proportion 𝜋 are approximately normal. Hence, a normal
approximation to binomial distribution is possible when:
𝑛𝜋 ≥ 5
and

𝑛(1 − 𝜋) ≥ 5
UoN School of Public Health
Binomial probability distribution
Example
 If the probability of a certain disease is thought to be 0.2. What is the
probability that in a sample of 50 individuals, 2 or more will get the disease?
𝜋 = 0.2
𝑛 = 50
𝑛𝜋 = 50 × 0.2 = 10 𝒂𝒏𝒅
𝑛 1 − 𝜋 = 50 × 1 − 0.2 = 40
Hence normal approximation to Binomial distribution is possible

𝑦−𝜇
𝑍=
𝜎
𝑤ℎ𝑒𝑟𝑒 𝜇 = 0.2 × 50 = 𝟏𝟎 𝑎𝑛𝑑 𝜎 2 = 50 × 0.2 × 0.8 = 𝟖

2 − 10
𝑝 𝑌≥2 = (𝒏𝒆𝒆𝒅 𝒕𝒐 𝒅𝒓𝒂𝒘 𝒕𝒐 𝒔𝒆𝒆 𝒂𝒓𝒆𝒂)
8
𝑍 = 2.83

1 − (0.00233 × 2)
= 0.4977 + 0.5 = 𝟎. 𝟗𝟗𝟕𝟕
2
UoN School of Public Health
Poisson probability distribution
 The Poisson distribution is a discrete probability distribution
for the counts (or rates) of events that occur randomly (can’t
predict when they will happen e.g. time to next phone call) in
a given interval of time or space
 If 𝑌 = the no. of events in a given interval and if the mean no.
of events per interval is 𝜆, the probability of observing y
events in a given interval is given by:

λ𝑦 𝑒 −λ
𝑝 𝑌=𝑦 = (𝑁𝐵: 𝑒 𝑖𝑠 𝑎 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 − 𝑠𝑒𝑒 𝑒 𝑥 𝑖𝑛 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑜𝑟)
𝑦!
𝒘𝒉𝒆𝒓𝒆 𝒚 = 𝟎, 𝟏, 𝟐, 𝟑 𝒆𝒕𝒄

Example
 Births in a hospital occur randomly at an average of 1.8 births per hour. What
is the probability of observing 4 births in a given hour at the hospital?

1.84 𝑒 −1.8
𝑝 𝑌=4 = = 𝟎. 𝟎𝟕𝟐𝟑
4!
Table 3 Areas in Upper Tail of the Normal Distribution

The function tabulated is 1 — ¢{z) where ¢{z) is the cumulative distribution function of a standardised Normal variable, z.
oo

Thus l =z ays . .
1— @(z) = 5a { e* ? isthe probability that a standardised Normal variate selected at random will be greater than a
7 Zz

value of z (- x8)
oC

I-@ (:)

0 z

xX-H 00 OL 02 03 04 .05 06 OF 08 09
og

3000 .4960 4920 4830 -4840 4801 4761 AT21 4681 4641
4602
SOSCS

A562 A522 4483 4443 4404 4364 4325 42860 4247


4207 4168 4129 4090 4052 4013 3974 ..3936 3897 3859 ©
3821 3783 3745 3707 3669 3632 3594 3557 3520 3483
3446 3409 3372 3336 3300 3264 3228 -3192 3156 3121
3085 | 3050 3015 2981 2946 2912 2877 2843 2819 2776
2743
DOOSS

2709 2676 2643 261] 2578 2546 2514 2483 2451


2420 2389 2358 2327 2296 2266 2236 -2206 2177 2148
2119 .2090 2061 2033 2005 1977 1949 .1922 1894 1867
184i 1814 L788 1762 1736 ATI 1685 -1660 .1635 1611
1587 1562: 1539 1515“ 14925 1469 1446 1423 .1401 1379
1357 - .1335 .1314 1292 1271 1251 1230 -1210 .1190 -1170
Sm

AUS] 1131 A112 .1093 1075 1056 1038 -1020 .1003 -0985
0968 0951 .0934 0918 090] 0885 .0869 0853 .0838 0823
0808 . 0793 .O778 0764 .O749 .0735 0721. -0708 0694 0681
0668 0655 0643 - 0630 0618 .0606 0594 0532 OST I 0559
0548 = - .0537 0526 .0516 0505: 0495 0485 0475 0465 .0455
ee Se

0446 0436 0427 0418 0409 0401 0392 0384 .0375 .0367.
.0359 0351 0344 .0336 .0329 9322 0314 .0307 .0301 0294
0287 0281 0274 0268 .0262 .0256 0250 0244 0239 .0233
02275 .02222 02169 02118 02068 02018 01970 01923 — 01876 01831
NNN NN

01786 01743) 01700 01659 01618 .01578 01539 01500 01463 01426
01390 01355 = 0132] 01287) = .01255 01222) 01191 61160 01130 ~=.01101
01072 = 01044 OI1017) = 00990 00964 = 0093900914 = 00889 = 00866 =— 00842
00820 00798 00776 00755 00734 00714 00695 .00676 60657 .00639
00621 00604 00587 00570 00554 = 00539 = 00523 00508 00494 = 00480
NN NNN

00466 = .00453 .00440 00427 00415 00402-00391 -00379 60368 00337


00347 00336 .00326 -.00317 .00307 = =.00298 }=—s 00289 = 00280 00272 = .00264
00256 = .00248 00240 .00233 00226 .00219 00212 .00205 00199 = 00193
.00187 00181 .00175 00169 00164 = =.00159 =.00154 = 00149 06144 .00139
.00135 .00131 .00126 00122 00118 .00114 00111 .00107 00104 .00100
WHWWW

00097 00094 = .00090 .00087 00084 = =.00082 00079 =. 00076 00074 d0071
60069 = .00066. 00064 00062 0060 00058 §.00056 = 00054 00052 = .00030.
00048 0047 =.90045 00043 00042 00040 90039 90038 00036 .00035
00034 00032 .00031 00030 00029 = =.00028 00027 .00026 .00025 00024.
00023. = .00022, 00022, 00021 «= «00020 00019 = «60019 »=—s 00018 ~=.00017_ ~—s 00017
00016 00015 00015 .00014 00014 00013 00013 00012 00012 .O0d11
BR WWWWW

000108 .000104 000100 .000096 .000092 000088 000085 000082 .000078 .000075.
000072 .000069 000067 000064 .000062 .000059 000057 000054 .000052 .000059
000048 000046 000044 000042 .000041 .000039 000037 000036 .000034 .000033
-000032
5.0 — 0.000 000 286 7 5.3 > 0.000 600 019 0 6.0 —> 0.000 000 001 0
LA -
Table 7 Percentage Points of the ¢ Distribution
of the f distribution for v degrees of freedom.
The table gives the value of t,,, — the 100@ percentage point

The values of 1 are obtamed by solution of the equation:

J yy? abe
(vay? fase
a= DE4AVv+ pyreyy"

values of 7.
Note: The tabulation is for one tail only, that is, for positive
For | i} the column headings for a should be doubled.

0 Pay

0.01 0.005 0.001 0.6005


a= 0.10 0.05 0.025

12.706 31.821 63.657 338.31 §36,62


v= 1 3,078 6.314
4.303 6.965 9.925 22.326 31598
2 _ 1.886 2.920
3,182 454) 5.841 j0.213 12,924
3 1.638 2.353
2.776 | 3.747 4 604 7.AT3 8.510
4 1.533 2.132
2.57] 3.365 4.032 5.893 6.869
5 1.476 2.015

2447 3.143 3.707 5.208 5.9459


6 1440 ~ 1.943
2.365 2.998 3.499 4785 5 468
7 1.415 1.895
2.306 2,896 3.355 4.501 5.044
8 1.397 1.860
2.262 2.821 3.250 4.297 4.7381
9 1.383 1.833
2.764 3.169 4.144 4 587
10 1.372 1.812 2.228
2.718 3.106 4.025 4437
il 1.363 1.796 2.201
2.681 3.035 3.930 4.318
12 1.356 1.782 2.179
2.650 3.012 3.852 4.22}
13 1.350 1.77] 2.160
2.624 2.977 3.787 4.440
14 1.345 1.761 2.145
2.602 2.947 3,733 4073
15 1.341 1.753 2.131
2.583 2.921 3.686 4.015
16 1.337 1.746 2.120
2.567 2.898 3.646 3.965
17 1.333 1.740 2.110
2.552 2.878 3.610 3.922
18 1.330 1.734 2.101
2.539 2.86] 3.579 3.883
19 1.328 1.729 2.093
2.528 2.845 3.452 3.850
20 1.325 1.725 2.086

2.518 2.831 3,527 3.819


21 1.323. 7 1.721 2.080
2.508 2.819 3.505 3.792
22 1.321 1.717 2.074
2.500 2.807 3.485 3.767
23 1.319 1.714 2.069
2.492 2.797 3 467 3.745
24 1.318 1.711 2.064
2 ABS 2.787 3.450 3.725
25 1.316 1.708 2.060

2479 2.779 3.435 3.707


26 1.315 1.706 2.056
2473 27971 3.421 3.690
27 1.314 1.703 2.052
2.467 2.763 3.408 3.674
28 1,313 1.701 2.048
2.462 2.756 3,396 3.659
29 1.311 1.699 2.045
2.457 2.750 3.385 3.646
30 }.310 1.697 2 042
2.423 2.704 3.307 3.55]
40 1.303 1.684 2.021
2.390 2.660 3.232 3.466
60 1.296 1.671 2.000
2.358 2.617 3.160 3.373
120 1.289 1.658 1.989
2.326 2.576 3.090 3.291
oo 1.282 1.645 1.960

h,
cal Tables for Biological, Agricultural and Medicai Researc
This table is taken from Table Il of Fisher & Yates: Statisti Biometrika Tables for Statisti cians,
Wesley Longman Ltd. Also from Table 12 of
reprinted by permission of Addison
wet be ee af nefaed Trivercity Praee and the Riometrika Trustees.
Table 8 Percentage Points of the 77 Distribution — continued

10 05 025 02 OT 005 DOT =a


30 25 .20

3.841 5.024 5.412 6.635 7.879 = 10.827 =]


1.074 1.323 1.642 2.706
5.991 7.378 7.824 9.210 10.597 13.815 z
2.408 2.773 3.219 4.605
7.815 9.348 9.837 11.345 12.838 16.268 3
3.665 4.108 4.642 6.251
9.488 11.143 11.668 13.277 14.860 18.465 4
4.878 5.385 5.989 7.779
11.070 12.832 13.388 15.086 16.750 20.517 3
6.064 6.626 7.289 9.236
12.592 14.449 15.033 16.812 18.548 22.457 6
7.231 7.841 8.558 10.645
14.067 16.013 16.622 18.475 20.278 24.322 7
8.383 9.037 9.803 12.017
15,507 17,535 _ 18,168 20.090 21.955 26.125 8
9.524 10.219 11.030 13.362
16.919 19.023 19.679 21.666 23.589 27,877 9
10.656 11.389 12.242 14.684
18.307 20.483 21.161 23.209 25.188 29.588 10
11.781 12.549 13.442 15.987
19.675 21.928 22.618 24.725 26.757 31.204 li
12.899 13.701 14.631 17.275
21.026 23.337 24.034 - 26.217 28.300 32.909 12
14.011 14.845 15.812 18.549
22,362 24.736 25.472 27.688 29.819 34,528 13
15.119 15.984 16.985 19.812
23.685 26.119 26.873 29.14] 31.319 36.123 14
16.222 i711? 18.15] 21.064
24.996 27.488 28.259 30.578 32.801 37.697 {5
17.322 18.245 19.311 22,307
26.296 28.845 29.633 32.000 34.267 39.252 16
18.418 19.369 20.465 23.542
30.191 30.995 33.409 35.718 40.790 7
19.511 20.489 21.615 24.769 27,587
28.869 31.526 32.346 34.805 37.156 42.312 18
20.601 21.605 22.760 25.989
32.852 33.687 36.191 38.582 43.820 19
21.689 22.718 23.900 27,204 30,144
31.410 34.170 35.020 37.566 39.997 45.315 20
22.775 23,828 25.038 28.412
32.671 35.479 36.343 38.932 41.401 46,797 21
23.858 24,935 26.171 29.615
33.924 36,781 37.659 40.289: 42.796 48.268 22
24.939 26.039 27.301 30.813
38.076 38.968 41.638 44.181 49.728 23
26.018 27,141 28.429 32.007 35.172
36.415 39.364 40.270 42.980 45.558 SL.179 24
27.096 28.241 29.553 33.196
37.652 40.646 41.566 44.314 46.928 52,620 25
28,172 29,339 30.675 34.382
38,885 41.923 42.856 45.642 48.290 54.052 26
29.246 30,434 34.795 35.563
40.113 43.194 44.140 46.963 49.645 55.476 27
30.319 31.528 32.912 36.741
41,337 44.46] 45.419 48.278 50,993 56.893 28
31.391 32.620 34.027 37.916
39.087 42.557 45,722 46,693 49.588 52.336 58.302 29
32.461 33.711 35.139
43,773 46.979 47962 50.892 53.672 59.703 30
33.530 34.800 36.250 40.256
51.805 55.759 59.342 60.436 63.691 66.766 73.402 40
44.165 45.616 47.269
63.167 67.505 71.420 72.613 76,154 79,496 86.661 59
34.723 56.334 58.164
74,397 79.082 83.298 84.580 88.379 91.952 99.607 60
65.227 66.981 68.972
79.715 85.527 90.531 95.023 96.388 100.425 104.215 112.317 70
73.689 TTSTI
96.578 101.880 106.629 108.069 112.329 116.321 124.839 80
86.120 88.130 90.405
107.565 143.145 148.136 119.648 124.116 128.299 137.208 90
96.524 98.650 101.054
111.667 118.498 124.342 129.561 133.142 135.807 140.170 149.449 100
106.906 109.141

[ids
Table A4. Upper percentage points of the F distribution with ν1 , ν2 df
(a) 5% points

ν1 =df for the numerator


ν2 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∞

1 161.448 199.500 215.707 224.583 230.162 233.986 236.768 238.883 240.543 241.882 243.906 245.950 248.013 249.052 250.095 251.143 252.196 253.253 254.314
2 18.513 19.000 19.164 19.247 19.296 19.330 19.353 19.371 19.385 19.396 19.413 19.429 19.446 19.454 19.462 19.471 19.479 19.487 19.496
3 10.1280 9.5521 9.2766 9.1172 9.0135 8.9406 8.8867 8.8452 8.8123 8.7855 8.7446 8.7029 8.6602 8.6385 8.6166 8.5944 8.5720 8.5493 8.5265
4 7.7086 6.9443 6.5914 6.3882 6.2561 6.1631 6.0942 6.0410 5.9988 5.9644 5.9117 5.8578 5.8025 5.7744 5.7459 5.7170 5.6877 5.6581 5.6281
5 6.6079 5.7861 5.4095 5.1922 5.0503 4.9503 4.8759 4.8183 4.7725 4.7351 4.6777 4.6188 4.5581 4.5272 4.4957 4.4638 4.4314 4.3985 4.3650
6 5.9874 5.1433 4.7571 4.5337 4.3874 4.2839 4.2067 4.1468 4.0990 4.0600 3.9999 3.9381 3.8742 3.8415 3.8082 3.7743 3.7398 3.7047 3.6689
7 5.5914 4.7374 4.3468 4.1203 3.9715 3.8660 3.7870 3.7257 3.6767 3.6365 3.5747 3.5107 3.4445 3.4105 3.3758 3.3404 3.3043 3.2674 3.2298
8 5.3177 4.4590 4.0662 3.8379 3.6875 3.5806 3.5005 3.4381 3.3881 3.3472 3.2839 3.2184 3.1503 3.1152 3.0794 3.0428 3.0053 2.9669 2.9276
9 5.1174 4.2565 3.8625 3.6331 3.4817 3.3738 3.2927 3.2296 3.1789 3.1373 3.0729 3.0061 2.9365 2.9005 2.8637 2.8259 2.7872 2.7475 2.7067
10 4.9646 4.1028 3.7083 3.4780 3.3258 3.2172 3.1355 3.0717 3.0204 2.9782 2.9130 2.8450 2.7740 2.7372 2.6996 2.6609 2.6211 2.5801 2.5379
11 4.8443 3.9823 3.5874 3.3567 3.2039 3.0946 3.0123 2.9480 2.8962 2.8536 2.7876 2.7186 2.6464 2.6090 2.5705 2.5309 2.4901 2.4480 2.4045
12 4.7472 3.8853 3.4903 3.2592 3.1059 2.9961 2.9134 2.8486 2.7964 2.7534 2.6866 2.6169 2.5436 2.5055 2.4663 2.4259 2.3842 2.3410 2.2962
13 4.6672 3.8056 3.4105 3.1791 3.0254 2.9153 2.8321 2.7669 2.7144 2.6710 2.6037 2.5331 2.4589 2.4202 2.3803 2.3392 2.2966 2.2524 2.2064
14 4.6001 3.7389 3.3439 3.1122 2.9582 2.8477 2.7642 2.6987 2.6458 2.6022 2.5342 2.4630 2.3879 2.3487 2.3082 2.2664 2.2229 2.1778 2.1307
15 4.5431 3.6823 3.2874 3.0556 2.9013 2.7905 2.7066 2.6408 2.5876 2.5437 2.4753 2.4034 2.3275 2.2878 2.2468 2.2043 2.1601 2.1141 2.0659
16 4.4940 3.6337 3.2389 3.0069 2.8524 2.7413 2.6572 2.5911 2.5377 2.4935 2.4247 2.3522 2.2756 2.2354 2.1938 2.1507 2.1058 2.0589 2.0096
17 4.4513 3.5915 3.1968 2.9647 2.8100 2.6987 2.6143 2.5480 2.4943 2.4499 2.3807 2.3077 2.2304 2.1898 2.1477 2.1040 2.0584 2.0107 1.9604
18 4.4139 3.5546 3.1599 2.9277 2.7729 2.6613 2.5767 2.5102 2.4563 2.4117 2.3421 2.2686 2.1906 2.1497 2.1071 2.0629 2.0166 1.9681 1.9168
19 4.3807 3.5219 3.1273 2.8951 2.7401 2.6283 2.5435 2.4768 2.4227 2.3779 2.3080 2.2341 2.1555 2.1141 2.0712 2.0264 1.9795 1.9302 1.8780
20 4.3512 3.4928 3.0984 2.8661 2.7109 2.5990 2.5140 2.4471 2.3928 2.3479 2.2776 2.2033 2.1242 2.0825 2.0391 1.9938 1.9464 1.8963 1.8432
21 4.3248 3.4668 3.0725 2.8401 2.6848 2.5727 2.4876 2.4205 2.3660 2.3210 2.2504 2.1757 2.0960 2.0540 2.0102 1.9645 1.9165 1.8657 1.8117
22 4.3009 3.4434 3.0491 2.8167 2.6613 2.5491 2.4638 2.3965 2.3419 2.2967 2.2258 2.1508 2.0707 2.0283 1.9842 1.9380 1.8894 1.8380 1.7831
23 4.2793 3.4221 3.0280 2.7955 2.6400 2.5277 2.4422 2.3748 2.3201 2.2747 2.2036 2.1282 2.0476 2.0050 1.9605 1.9139 1.8648 1.8128 1.7570
24 4.2597 3.4028 3.0088 2.7763 2.6207 2.5082 2.4226 2.3551 2.3002 2.2547 2.1834 2.1077 2.0267 1.9838 1.9390 1.8920 1.8424 1.7896 1.7331
25 4.2417 3.3852 2.9912 2.7587 2.6030 2.4904 2.4047 2.3371 2.2821 2.2365 2.1649 2.0889 2.0075 1.9643 1.9192 1.8718 1.8217 1.7684 1.7110
26 4.2252 3.3690 2.9752 2.7426 2.5868 2.4741 2.3883 2.3205 2.2655 2.2197 2.1479 2.0716 1.9898 1.9464 1.9010 1.8533 1.8027 1.7488 1.6906
27 4.2100 3.3541 2.9604 2.7278 2.5719 2.4591 2.3732 2.3053 2.2501 2.2043 2.1323 2.0558 1.9736 1.9299 1.8842 1.8361 1.7851 1.7306 1.6717
28 4.1960 3.3404 2.9467 2.7141 2.5581 2.4453 2.3593 2.2913 2.2360 2.1900 2.1179 2.0411 1.9586 1.9147 1.8687 1.8203 1.7689 1.7138 1.6541
29 4.1830 3.3277 2.9340 2.7014 2.5454 2.4324 2.3463 2.2783 2.2229 2.1768 2.1045 2.0275 1.9446 1.9005 1.8543 1.8055 1.7537 1.6981 1.6377
30 4.1709 3.3158 2.9223 2.6896 2.5336 2.4205 2.3343 2.2662 2.2107 2.1646 2.0921 2.0148 1.9317 1.8874 1.8409 1.7918 1.7396 1.6835 1.6223
40 4.0847 3.2317 2.8387 2.6060 2.4495 2.3359 2.2490 2.1802 2.1240 2.0772 2.0035 1.9245 1.8389 1.7929 1.7444 1.6928 1.6373 1.5766 1.5089
60 4.0012 3.1504 2.7581 2.5252 2.3683 2.2541 2.1665 2.0970 2.0401 1.9926 1.9174 1.8364 1.7480 1.7001 1.6491 1.5943 1.5343 1.4673 1.3893
120 3.9201 3.0718 2.6802 2.4472 2.2899 2.1750 2.0868 2.0164 1.9588 1.9105 1.8337 1.7505 1.6587 1.6084 1.5543 1.4952 1.4290 1.3519 1.2539
∞ 3.8415 2.9957 2.6049 2.3719 2.2141 2.0986 2.0096 1.9384 1.8799 1.8307 1.7522 1.6664 1.5705 1.5173 1.4591 1.3940 1.3180 1.2214 1.0033
Statistical Inference:
Single mean and proportion

Dr. M.M. Mweu,


Level IV MBChB Biostatistics,
30 January, 2024
Statistical inference
 Is the procedure whereby conclusions about a pop are made
based on findings from a sample obtained from the pop
 Since it’s often difficult to measure every individual in the
pop, samples are taken and inferences drawn from them
about the pop
 Two measures of statistical inference:
 Confidence intervals – give an estimated range of values which is likely
to include an unknown population parameter, the estimated range being
calculated from a set of sample data
 Hypothesis tests – test whether there’s sufficient evidence in a sample of
data to infer that a certain condition is true for the entire population

 These measures are linked to the concept of sampling


distribution
Statistical inference
Confidence intervals
 A confidence interval is a pair of numerical values defining
an interval, which with a specified degree of confidence
includes the parameter being estimated
 If we construct a CI for the pop mean 𝜇 with a value for the
lower confidence limit (𝐿𝐶𝐿) and a value for the upper
confidence limit (𝑈𝐶𝐿) at the 95% degree of confidence, we can
say that we are 95% certain that this CI encloses the true value of the
pop mean

Hypothesis testing
 In hypothesis testing we state that we will reject a certain
hypothesis only if there is a 5% or less chance that it is true
Statistical inference
Hypothesis testing
(a) Null hypothesis
 Frequently, there’s an expected/natural value for a parameter
– called the null value

 In hypothesis testing we assess whether the statistic


computed from a sample is consistent with the null value
 If there’s consistency then the statistic will be considered
equal to the null value except for sampling & measurement
errors
 The argument that there’s consistency betwn the statistic and
the null value is the null hypothesis – denoted by 𝑯𝟎
 The 𝐻0 can be written as: 𝑯𝟎 : 𝝁 = 𝝁𝟎
Statistical inference
Hypothesis testing
(b) Alternative hypothesis
 Is the opposite of the 𝐻0 - the assertion that the statistic is
inconsistent with the null value – is denoted by 𝐻𝑎
 𝐻𝑎 states that the parameter is not equal to, is greater than, or
less than the null value
 Can be expressed as:
𝐻𝑎 : 𝜇 ≠ 𝜇0 (𝑡𝑤𝑜 𝑠𝑖𝑑𝑒𝑑)
𝐻𝑎 : 𝜇 > 𝜇0 (𝑜𝑛𝑒 𝑠𝑖𝑑𝑒𝑑)
𝐻𝑎 : 𝜇 < 𝜇0 (𝑜𝑛𝑒 𝑠𝑖𝑑𝑒𝑑)
 The choice of the 𝐻𝑎 will affect the way we conduct the test
 We choose an 𝐻𝑎 based on the prior knowledge we have about
possible values:
Statistical inference
Hypothesis testing
 Two-sided test: We are testing whether 𝜇 is, or is not equal to a specified
value 𝜇0 . We have no strong opinion whether 𝜇 is greater/less than 𝜇0 and
we state:
𝐻0 : 𝜇 = 𝜇0
𝐻𝑎 : 𝜇 ≠ 𝜇0
 One-sided test: We are testing 𝜇 to be greater/less than a given value 𝜇0 .
We need prior knowledge that 𝜇 is on a particular side of 𝜇0 . We restate
the hypothesis as:
𝐻0 : 𝜇 ≥ 𝜇0
𝐻𝑎 : 𝜇 < 𝜇0 𝑶𝑹
𝐻0 : 𝜇 ≤ 𝜇0
𝐻𝑎 : 𝜇 > 𝜇0

 Significance level: Refers to the null hypothesis


 When the probability (𝑃) that the statistic is consistent with the null
value becomes too small, we say that the statistic is significantly different
from the null value and hence we reject 𝐻0 : 𝜇 = 𝜇0
 How small must 𝑃 be for us to reject the 𝐻0 ? – usually 0.05 is used
Statistical inference
Errors in hypothesis testing
Type I Error

 This is denoted by 𝜶 – which is the probability that we will reject


the 𝐻0 : μ = 𝜇0 when the 𝐻0 was actually correct
 The probability that the 𝐻0 is true is the 𝑷-value
 More correctly, the 𝑃-value is the probability that we would observe a
statistic equal to, or more extreme, than the value we have observed if the
𝑯𝟎 is true
Type II Error

 This is denoted by 𝜷 – occurs when the 𝐻0 is accepted when the


𝐻𝑎 is true (NB: 𝛽 is often set at 0.20)

 This allows us to calculate the power of a test 𝟏 − 𝜷 which is the


probability of rejecting the 𝑯𝟎 if 𝑯𝒂 is true
 Also, the power of a test is the ability of the test to detect a real difference
when that difference exists and is of a certain magnitude

 For a given sample size 𝑛, lowering 𝛼 (say below 0.05) will increase 𝛽
 Probability of type II error (𝛽) decreases with increase in 𝑛
Statistical inference
Errors in hypothesis testing

𝑯𝟎
True False

Accept 𝟏 − 𝜶 (confidence Type II error (𝛽)


level)
Reject Type I error (𝛼) 𝟏 − 𝜷 (power)
Sampling distribution
 Confidence intervals and hypothesis tests are linked to the
concept of sampling distribution
 When different samples of equal size are repeatedly taken
from the same pop & we repeatedly calculate the statistics (e.g.
estimates of 𝜇, 𝜎 and 𝜋) for each sample we get populations of
statistics with known probability distributions
 The pop originally sampled is called the parent
population/parent distribution while that of the computed
statistic is the sampling distribution
 NB: The idea behind estimation is that 𝑁, no. of individuals in the pop is very
large compared to 𝑛 the no. of individuals in the sample, so that sampling
doesn’t affect the probability of choosing a particular sample – means that
although we are not sampling with replacement, in terms of probability, it as
though we were sampling with replacement
Sampling distribution of a mean
 If all possible samples of a given size, 𝑛, were picked and a 𝑥
(sample mean) calculated for each, the population of 𝑥’s would have
a normal distribution with a mean equal to the mean of the
1
parent distribution and a variance that is 𝑛
times smaller than
that of the parent distribution i.e. the sampling distribution of 𝑥 is normal
𝜎2
with a mean= 𝜇 and a variance =
𝑛

𝜎2

𝑛
is called the standard error of the mean which measures

the variability of the 𝑥’s obtained when taking repeated samples


of size 𝑛 (recall: 𝜎 measures the variability of the individual 𝑥’s in the population)

𝜎2
 As the sample size (𝑛) increases, the of the sample mean
𝑛

decreases meaning that 𝑥’s become clustered more closely to the

mean 𝜇 – we get more precise estimates as 𝑛 increases


Sampling distribution of a mean
 If 𝑛 (sample size) is large (𝑛 ≥ 20), the sampling distribution of 𝑥 will
𝜎2
be normal with mean = 𝜇 & variance = even if 𝑋 (parent variable) is
𝑛
not normally distributed – called the central limit theorem

𝑥
Confidence interval for a mean
 To make inferences about the true mean 𝜇 we construct a CI
 We accept that the observed mean 𝑥 is generally within 1.96
(recall: 𝑍0.025 = 1.96) standard errors of the true mean 𝜇 so that the
interval: 𝑥 ± 1.96 × 𝑆𝐸(𝑥) will usually include the true value
 This means that on repeated sampling, 95% of sample means
would fall within 1.96 standard errors of the 𝜇 so that the
interval: 𝑥 ± 1.96 × 𝑆𝐸(𝑥) includes 𝝁 approx. 95% of the time (called the 95% CI)
Sampling distribution of a mean
Confidence interval for a mean
 A 99% CI is given by: 𝑥 ± 2.58 × 𝑆𝐸(𝑥)
Example
 The packed cell volume (PCV) was measured in 25 children sampled randomly
from children aged 4 yrs living in a large West African village, with the following
results:
𝑥 = 34.1 𝑠 = 4.3
Using the 𝑠 as an unbiased estimator of 𝜎 we obtain the 95% CI of:

4.3
34.1 ± 1.96 × = 𝟑𝟐. 𝟒 𝒕𝒐 𝟑𝟓. 𝟖
25
Use of the 𝒕 distribution
 As the value of 𝜎 is generally unknown (recall: 95%𝐶𝐼 = 𝑥 ± 1.96
σ/ 𝑛), we have to use 𝑠 as an estimate of 𝜎 – introduces
sampling error in calculation
 Due to the this error, the interval: 𝑥 ± 1.96 × 𝑠/ 𝑛 includes 𝜇
less than 95% of the time i.e. the calculated interval is too narrow
Sampling distribution of a mean
Confidence interval for a mean
Use of the 𝒕 distribution
 To correct for this we use a multiplying factor larger than 1.96
– makes interval wider and restores confidence level to 95%
 The multiplying factor is contained in the 𝑡 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
 The factor depends on the degrees of freedom (v) used to
calculate the sample SD 𝑠 (𝑑𝑓 𝑎𝑟𝑒 𝑜𝑛𝑒 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑖. 𝑒. 𝒗 = 𝒏 − 𝟏)
 As 𝑛 increases the factor approaches 𝑍0.025 = 1.96 – hence t
distribution only needs to be used for 𝑛 < 20
Example
 In the PCV example, 𝑣 = 25 − 1 = 24. Using the 𝑡 distribution with 24 𝑑𝑓, the
𝑠 4.3
95% CI is: 𝑥 ± 𝑡 𝑛−1 ,𝛼/2 × = 34.1 ± 2.064 × = 𝟑𝟐. 𝟑 𝐭𝐨 𝟑𝟓. 𝟗
𝑛 25
Sampling distribution of a mean
Significance test for a mean
 We may wish to test a specific hypothesis about the pop mean 𝜇
e.g. if data on 4yr children in USA indicate a mean PCV of 37.1 we may test whether
our sample data (West African) are consistent with the 𝐻0 :
𝐻0 : 𝜇 = 𝜇0 = 37.1
𝐻𝑎 : 𝜇 ≠ 𝜇0

 One approach is to see whether the 95% CI includes the


hypothesised value (37.1) – doesn’t (some evidence against the 𝐻0)
 More objectively we use a significance test and examine the 𝑃-
value:

𝑥 − 𝜇0 34.1 − 37.1
𝑍= = = −𝟑. 𝟒𝟗
𝑆𝐸(𝑥) 4.3
25
 From the 𝑍 𝑡𝑎𝑏𝑙𝑒𝑠 we get the 𝑃-value: 2 × 0.00024 = 𝟎. 𝟎𝟎𝟎𝟒𝟖 (NB: 𝑃-
value is normally one-tailed so multiply the resulting probability by 2)

 Interpretation: The data provide strong evidence against 𝑯𝟎 hence the mean PCV in 4yr

old children in the West African village is different from that of children of the same
age in the USA
Sampling distribution of a mean
Significance test for a mean
 If 𝑛 < 20 then 𝑡 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 is more appropriate:

𝑥 − 𝜇0
𝑡= = −𝟑. 𝟒𝟗
𝑆𝐸(𝑥)
 From 𝑡 𝑡𝑎𝑏𝑙𝑒𝑠 3.49 falls betwn 3.467 & 3.745 hence 𝟎. 𝟎𝟎𝟏 <
𝑷 < 𝟎. 𝟎𝟎𝟐
Sampling distribution of a proportion
Example: In a survey of 335 men attending a health centre in Guilford (UK), 127
(37.9%) men said they were current smokers

 How we can we use the result from the above sample to say
something about the population which it represents? – we
use the concept of sampling distribution
 Suppose we repeatedly took a new sample of 335 men from
this health centre (assuming a large no. of men are registered there) and
calculated the proportion who smoked & then created a
histogram of these values – histogram would represent the
sampling distribution of the proportion:

𝑝 𝝅 𝑝
Sampling distribution of a proportion
 In practice we only conduct one survey from which to
estimate 𝑝
 Is 𝑝 close to 𝜋 or is it very different from 𝜋?
 In any random sample there’s some sampling variation in 𝑝 so
that the larger the 𝑛 the smaller the sampling variation
 The sampling variation of a proportion is described by its
standard error:

𝑝 × (1 − 𝑝)
𝑛

 The 𝑆𝐸 of a proportion is a measure of how far our observed


proportion 𝑝 differs from the true pop proportion 𝜋
 In the previous UK example, the estimated 𝑆𝐸 of the

0.379×(1−0.379)
proportion of smokers is: = 𝟎. 𝟎𝟐𝟔𝟓 𝒐𝒓 𝟐. 𝟔𝟓%
335
Sampling distribution of a proportion
 To estimate the interval of possible values within which the true
pop proportion 𝜋 lies we compute the CI:

𝑝 × (1 − 𝑝) 0.379 × 0.621
𝑝 ± 𝑍𝛼/2 × = 0.379 ± 1.96 ×
𝑛 335

= 0.327 𝑡𝑜 0.431 𝑜𝑟 𝟑𝟐. 𝟕%(𝑳𝑪𝑳) 𝒕𝒐 𝟒𝟑. 𝟏%(𝑼𝑪𝑳)


Interpretation: We are 95% confident that the true percentage of smokers in
Guilford, UK lies between 32.7% and 43.1%

Significance test for a proportion


𝐻0 : 𝜋 = 𝜋0
𝑝−𝜋
𝑍= 𝐻𝑎 : 𝜋 ≠ 𝜋0
𝑆𝐸(𝑝)
 For this 𝑍 test, again the conditions below must be satisfied:
𝑛𝑝 ≥ 5
and

𝑛(1 − 𝑝) ≥ 5
Comparing two Population Means &
Proportions

Dr. M.M. Mweu,


Level II MBChB Biostatistics,
30 November, 2016
UoN School of Public Health

Comparing two means


 The objective here is to compare the mean value in two
populations, or in two sub-populations by:
 Calculating a confidence interval for the difference between 2
sample means which allows for sampling error in estimating the
difference between the true means
 Testing the hypothesis that the true means in the 2
populations are equal
 The underlying assumptions in these calculations are:
 Variable of interest is normally distributed
 Observations are independent i.e. random samples are chosen
independently from the 2 pops of interest and there’s no
connection, for example, between 1st observation in one sample
and the 1st observation in the other sample
UoN School of Public Health

Sampling distribution of the difference betwn 2 means


 Random samples of size n1 and n2 are taken from 2 pops of interest. The
means and SDs of a quantitative variable x in the 2 pops and samples are:

Pop 1 Sample 1 Pop 2 Sample 2


Mean 𝜇1 𝑥1 𝜇2 𝑥2
SD 𝜎1 𝑠1 𝜎2 𝑠2

 If random samples of a given size of the variable x were taken repeatedly


in each of pop 1 & pop 2 and each time we measured (𝑥1 - 𝑥2 ) we would
find that:
 The values of 𝑥1 , 𝑠1, 𝑥2 , 𝑠2 would vary from sample to sample
 The values of (𝑥1 - 𝑥2 ) would be distributed symmetrically (normal
distr.) above and below the true population value (𝜇1 -𝜇2)
 Values near (𝜇1 -𝜇2) would occur more frequently than values far
from (𝜇1 -𝜇2)
UoN School of Public Health

Confidence interval for difference betwn 2 means


• Assuming a large sample size (n≥40) the 95% CI for the
difference betwn 2 means is given by:

𝑠21 𝑠22
(𝑥1 - 𝑥2 ) ∓ Z𝛼 /2* +
𝑛1 𝑛2

NB: If there’s no real difference betwn the 2 means the CI


around the difference should include zero
Example:
 In a cohort study in northeast Brazil, the mental and psychomotor
development of low birth weight (1500-2499g) infants born at ≥37
weeks gestation (at term), were compared to those of a control sample
of infants born with appropriate birth weight (3000-3499g). Results
for mental devpt at 12 months of age in samples of low and
appropriate birth weight infants were as follows:
UoN School of Public Health

Confidence interval for difference betwn 2 means


Mental development score
n Mean SD
Appropriate b. 84 115.1 9.44
weight
(ABW)

Low b. weight 84 108.1 11.50


(LBW)

 What can be said about the mental development scores of


children in these 2 groups?
n1 = 84, 𝑥1 = 115.1, s1 = 9.44
n2 =84, 𝑥2 = 108.1, s2 = 11.50
The difference in means is (115.1 – 108.1) = 7.0
Standard error is √(9.442/84 + 11.52/84) = √2.635 = 1.623
UoN School of Public Health

Confidence interval for difference betwn 2 means


So the 95% CI for the difference (𝜇1 -𝜇 2) is given by:
(115.1 – 108.1) ∓ 1.96√(9.442/84 + 11.52/84) or 7.0 ∓ 1.96*1.623
which is: 3.82 to 10.18
So the data suggest that at age 12 months, ABW children have, on
average, a mental development score betwn 3.8 and 10.2 points
higher than LBW children
UoN School of Public Health

Significance test for comparison of 2 means


 To test the hypothesis about the difference (𝜇1 -𝜇2) between 2
pop means, we write:
Ho: 𝜇1 = 𝜇2
Ha: 𝜇1 ≠ 𝜇2

𝑥1 − 𝑥2
Z= 𝑠21 𝑠22
+
𝑛1 𝑛2

After calculating the Z, the P-value may be read from tables of


the normal distribution.
Example:
In the Brazilian study,
(115.1 – 108.1)/√(9.442/84 + 11.52/84) or 7.0/1.623 = 4.31
From the Z tables, we find that P<0.001. In other words, there’s
strong evidence for a real difference betwn the 2 pop means
UoN School of Public Health

Significance test for comparison of 2 means


 NB: Note the close relationship between the significance test and the
confidence interval. The test will give a P-value less than 0.05 if the 95% CI
excludes the hypothesised value (0) and vice versa. In the example, the 95% CI
(3.82 to 10.2) excludes zero, and the P-value is less than 0.05.

 If sample sizes are small (n<40), and distribution of the


individual values are approx. normal, a t distribution is used
 An additional assumption is that 𝜎1 and 𝜎2 are equal to a
common value (common variance). So,

2( 1 1 𝑛1−1 𝑠12+ 𝑛2−1 𝑠22


CI: (𝑥1 - 𝑥2 ) ∓ t* 𝑆𝑝 + ) where 𝑆𝑝 =
2
𝑛1 𝑛2 𝑛1+𝑛2 −2
The value of t is read from the tables of the t-distribution with (𝑛1 + 𝑛2 − 2) degrees
of freedom
(𝑥1 − 𝑥2 )
Significance test: t = (Unpaired t-test)
1 1
2
𝑆𝑝 (𝑛 +𝑛 )
1 2
UoN School of Public Health

Significance test for comparison of 2 means


 However, if variances are different (unequal variances) then:
 First, test for equality of variances using an 𝐹 test:
𝑆21
𝐹(𝑉1 ,𝑉2 ) = 2
𝑆2
𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝑙𝑎𝑟𝑔𝑒𝑟 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑓𝑜𝑟𝑚𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑒𝑟𝑎𝑡𝑜𝑟 𝑎𝑛𝑑 𝑉1 = 𝑛1 − 1 & 𝑉2 = 𝑛2 − 1 𝑑𝑓

 If the 𝐹 test-statistic calculated above is significant then


perform:

𝑥1 − 𝑥2
t= 𝑠21𝑠22
(Welch t-test)
+
𝑛1 𝑛 2

 NB: It is common to see t-tests used where sample sizes are


large and adequate for use of z test.
UoN School of Public Health

Comparing two population means for paired data


 A paired t-test is used to compare 2 pop means when you
have 2 samples in which observations in one sample are
paired with observations in the other sample
 Examples of paired data:
 Before-and-after observations on the same subjects (e.g. students diagnostic
test results before and after a particular course)
 A comparison of 2 different methods of measurement or 2 different
treatments where the measurements/treatments are applied to the same
subjects (e.g. antigen and antibody ELISA tests on blood samples)
 In matched study designs e.g. in a matched clinical trial where we wish to test a
new therapy for leg ulcers in sickle cell anaemia relative to an existing therapy. We
form pairs of patients matched for age, sex and severity of ulcers and randomly
allocate one member of the pair to new therapy, and the other to the existing therapy
and compare the quantitative outcomes on the 2 treatments

 Observations in a pair are not independent, however different


pairs are independent
UoN School of Public Health

Comparing two population means for paired data


 Let’s assume 2 measurements (𝑥 𝑎𝑛𝑑 𝑦) are taken on each sample
drawn from different subjects. Steps involved in carrying out a
paired t-test are:
 Calculate the difference (𝑑𝑖 = 𝑥𝑖 − 𝑦𝑖 ) betwn the 2 observations on each pair,
making sure you distinguish betwn positive and negative differences

 Calculate the mean difference, 𝑑 (NB: the paired t-test assumes 𝑑𝑖 𝑠 are
normally distributed, if not then non-parametric tests are used)
 State the hypotheses:
𝐻0 : 𝐷 = 0
𝐻𝑎 : 𝐷 ≠ 0

 Calculate the standard deviation of the differences, 𝑠𝑑 , and use these to

𝑑 𝑠
calculate the standard error of the mean difference, 𝑆𝐸 𝑑 = √𝑛

𝑑
 Calculate the t-statistic which is given by 𝑇 = 𝑆𝐸(𝑑). Under the 𝐻0 , this

statistic follows a t-distribution with 𝑛 − 1 degrees of freedom


 Use t-tables to compare your value of 𝑇 to the critical 𝑡𝑛−1 distribution. This will
give the P-value for the paired t-test
UoN School of Public Health

Comparing two population means for paired data


Example
 During a nutritional survey, a quality control exercise was carried out to check
the agreement betwn 2 observers in taking skinfold measurements. Both
observers measured the same 15 adults, with the following results:
Skinfold measurement (mm) Difference
Subject Observer A Observer B A–B
1 21.5 18.3 +3.2
2 25.0 21.5 +3.5
3 19.3 16.3 +3.0
4 33.9 32.3 +1.6
5 15.9 19.1 -3.2
6 39.9 34.6 +5.3
7 20.8 16.8 +4.0
8 33.2 31.0 +2.2
9 34.4 32.5 +1.9
10 20.5 18.6 +1.9
11 14.6 14.0 +0.6
12 15.8 15.5 +0.3
13 18.4 16.4 +2.0
14 25.5 19.0 +6.5
15 19.0 17.6 +1.4
UoN School of Public Health

Comparing two population means for paired data


Example
 Mean difference 𝑑 = +2.28 and standard deviation 𝑠𝑑 = 2.25

2.28
 The 𝑇 statistic will given by: 𝑇 = 2.25 = 𝟑. 𝟗𝟐 𝑤𝑖𝑡ℎ 14 𝑑𝑓
√15

 This gives 𝟎. 𝟎𝟎𝟏 < 𝑷 < 𝟎. 𝟎𝟎𝟐, we reject 𝐻0 and conclude that there’s strong
evidence of a real difference between the two observers.
UoN School of Public Health

Comparing two proportions


 As with means, to compare two proportions (percentages) we
use:
 Statistical test of significance
 95% CI for the difference in the 2 proportions
Statistical significance test:
Example:
In a clinical trial for advanced (metastatic) breast cancer,
patients were randomly assigned to L-Pam or CMF. Tumour
response was defined as a shrinkage of tumour surface area by
at least a half for a minimum of 2 weeks:
CMF L-Pam
Tumour response Yes 49 (52.7%) 18 (19.8%) 67 (36.4%)
No 44 73 117
Total patients 93 91 184
UoN School of Public Health

Comparing two proportions


Statistical significance test
 Is CMF better than L-Pam in shrinkage of the tumour?

H o: 𝜋 1 = 𝜋 2
Ha: 𝜋1 ≠ 𝜋2

(p1−p2) 𝑥 +𝑥
Z= 1 1
where 𝑝= 𝑛 1+𝑛2
𝑝𝑞(𝑛 +𝑛 ) 1 2
1 2

 Here we make use of the fact that the difference between the 2 observed
proportions has approximately a normal distribution (normal
approximation to the binomial distribution): n𝜋 ≥ 5 ; n(1 − 𝜋) ≥ 5

52.7% − 19.8% 32.9


 |Z| = = = 4.63 P<0.001 (Strong evidence that CMF
1 1 7.1
36.4∗63.6∗( +
93 91
)
patients had better response than L-Pam patients)
UoN School of Public Health

Comparing two proportions


Confidence interval:
 The 95% CI for the difference betwn 2 proportions (percentages) is:
Observed difference (p) ± 1.96*Standard error of difference

𝑝1(1−𝑝1) 𝑝2(1−𝑝2)
p ± 1.96* +
𝑛1 𝑛2

 So in the previous example the 95% CI is given by:

52.7∗(100−52.7) 19.8∗(100−19.8)
32.9% ± 1.96* +
93 91

= 32.9% ± 13.04
= 19.86% to 45.94% (interval doesn’t include zero)
So, we are 95% confident that the true pop difference in tumour
responses between CMF & L-Pam is between 19.86% and 45.94%
NB: Standard error formula in the above calculation doesn’t assume the Ho of the
2 proportions being equal (common variance).

You might also like