Biostatistics-1
Biostatistics-1
Probability distributions
Introduction to probability (mutually exclusive & independent events), Normal,
Binomial & Poisson distributions
e Sources of data:
Routinely kept records — usually the first stopover in quest for data
e.g. hospital medical records containing patient info
| Surveys — often the logical source when recorded data are
absent/insufficient e.g. if a hospital administrator seeks to understand the
mode of transportation used by patients to visit a hospital and admission forms
don’t contain a question on transportation modes then a survey amongst
patients is necessary
e Types of variables:
Quantitative: measurable in terms of magnitude
LJ Continuous — can take on any value in a given range e.g. height, blood
pressure, temper ature
6
Descriptive statistics
Quantitative variables
Central tendency measures
Where x; is the typical value of a random variable and n denotes the no. of values in
the sample
Descriptive statistics
Quantitative variables
Central tendency measures
Arithmetic mean
Example:
The plnaney vosanien of 8 healthy men are: 2.75, 2.86, 3.37, 2.76, 2.62, 3.49, 3.05,
3.12 litres
e Properties:
Uniqueness -— for a given set of data there’s only one mean
Simplicity — the mean is easily understood and easy to compute
Prone to distortion by extreme values since each value ina
dataset enters into the computation of the mean
Descriptive statistics
Quantitative variables
Central tendency measures
Median
e Value that divides a set of values into two equal parts such
that the no. of values equal to or greater than the median is
equal to the no. of values equal to or less than the median
e Ifno. of values is odd then the median is the middle value
when all values are arranged in order of magnitude
e If the no. of values is even, then the median is the mean of
the two middle values when all values are arranged in order
of magnitude
e Properties:
Uniqueness — as with mean, there’s only a single median for a
given set of data
Simplicity — is easy to calculate
Not drastically ateoted by extreme values as is the mean
Descriptive statistics
Quantitative variables
Central tendency measures
Mode
Mode is: 8
Mean
Media
Mode pode Mode
Frequency : Median
Dispersion measures
1]
Descriptive statistics
Quantitative variables
Dispersion measures
by:
R= Xy — Xs
12
Descriptive statistics
Quantitative variables
Dispersion measures
Variance
s2 — 4 (xj-*)?
n-1
The sum of the deviations of the values from their mean is equal to zero. If then we know
the values ofn—1 of the deviations from the mean, we know the nth one since it is
automatically determined because of the necessity for alln values to add to zero
13
Descriptive statistics
Quantitative variables
Dispersion measures
Standard deviation
n—-—1
14
Descriptive statistics
Quantitative variables
Dispersion measures
Standard deviation
px
| cit | (x2)?
2.75 -0.2525 0.0638
2.86 -0.1425 0.0203
3.37 0.3675 0.135]
2.76 -0.2425 0.0588
2.62 -0.3825 0.1463
3.49 0.4875 0.2377
3.05 0.0475 0.0023
3. le 0.1175 0.0138
24.02 0 0.6781
X = 3.0025
= 0.0968
n-1 7
e Since the mean and SD have the same units, CV has no units
16
Descriptive statistics
Quantitative variables
Dispersion measures
Coefficient of variation (CV)
Example:
25 years || years
We wish to know which is more variable, the weights of the 25-year-olds or the 11-
year-olds
18
Descriptive statistics
Quantitative variables
Dispersion measures
Interquartile range (IOR)
2(n + 1)
2= —z th ordered observation
B(n + 1)
3= —J th ordered observation
Example:
17 +1 6+8
0,= n th observation = 4.5th obs. = ——=7
2(17 + 1)
2= —z th observation = 9th obs.= 10
3(17+1 17+ 25
= ee i observation = 13.5th obs.= = 21
19
Descriptive statistics
Quantitative variables
Dispersion measures
Interquartile range (IOR)
i”
WS
—
60}- a Histogram & frequency
| sol a polygon for the agesof
em | 205 doctors
tal %
3 20
20+
10;
|
Owe 25 30 35 40 45 50 55
[Age (in years) ————>
29
Descriptive statistics
Quantitative variables
Graphical techniques
e Consists of:
Stem — has one or more of the initial digits of the measurement
QO) Stems form an ordered column with the smallest stem at the top
U) The stem column has all stems within the range of the data even when an
observation with that stem is not in the dataset
Example:
| Data|
4+ ]46 [47 49 63 64 66]|6s
|6s |72 |72 [75 76 |s1 | s4|ss
| 106]
Stem | Leaf
+ 4679
from their
Stems are separated
os 34688
7 2956 leaves by a vertical line
s 148
9
10 5
Descriptive statistics
Quantitative variables
Graphical techniques
Interquartile Range
(IQR)
Outliers |_____ Outliers
"Minimum" | "Maximum"
(Q1 - 1.5*IQR) Ql Median Q3 (Q3 + 1.5*IQR)
(25th Percentile) (75th Percentile)
-4 -3 =a =] 0 1 2 3 4
e e.g. one axis of the plot could represent height and the other weight with each
person in the dataset receiving one data point on the plot that corresponds to
his/her height and weight
29
Descriptive statistics
Qualitative variables
e Descriptive stats for analysing categorical variables include:
frequencies, percentages and fractions (relative frequencies)
obtained from a variable’s frequency distribution table
Example:
The following frequency table shows the frequency of each category of marital
status in a sample of 80 people along with the corresponding percentages:
Marital status
Frequency Percent
Single 44 55.0
Married 29 36.3
Other 7 8.8
Total 80 100
30
Descriptive statistics
Qualitative variables
e Bar charts use bars (which don’t touch) to represent each
category of the variable of interest — the length of the bars
_—
the variable:
Marital Status of Sample
Count
Marital status
dl
Descriptive statistics
Qualitative variables
e In Pie charts the percentages are represented by the angles in
different slices of a circle — the total angle (360 degrees)
representing 100%:
32
Introduction to Biostatistics
Sources of data:
Routinely kept records – usually the first stopover in quest for data
e.g. hospital medical records containing patient info
Types of variables:
Quantitative: measurable in terms of magnitude
Continuous – can take on any value in a given range e.g. height, blood
pressure, temperature
Discrete – can take on only whole numbers e.g. counts of lesions
Qualitative: variables not measurable or described in magnitude –
often called factors
Ordinal – have an order but no magnitude e.g. education levels –
primary/secondary/tertiary, severity – mild/moderate/severe, prognosis –
good/fair/poor, socioeconomic status (low/medium/high)
Nominal – categorised without any implied order e.g. sex, dead/alive,
ethnicity, religion, marital status
Descriptive statistics
Quantitative variables
We often summarise data using a single number – a
descriptive measure
Descriptive measures for quantitative data are:
Central tendency (location) – measures the centre of the
distribution
Often the location where data tend to cluster
Captures the central/average value of a set of data
Dispersion (spread) – refers to the average distance between
observations
Indicates the magnitude by which individuals tend to differ from each
other
By determining how far an observation lies from the rest we can
deduce whether such an observation is unusual (outlier) or not
Descriptive statistics
Quantitative variables
Central tendency measures
Properties:
Uniqueness – for a given set of data there’s only one mean
Simplicity – the mean is easily understood and easy to compute
Prone to distortion by extreme values since each value in a
dataset enters into the computation of the mean
Descriptive statistics
Quantitative variables
Central tendency measures
Median
Value that divides a set of values into two equal parts such
that the no. of values equal to or greater than the median is
equal to the no. of values equal to or less than the median
If no. of values is odd then the median is the middle value
when all values are arranged in order of magnitude
If the no. of values is even, then the median is the mean of
the two middle values when all values are arranged in order
of magnitude
Properties:
Uniqueness – as with mean, there’s only a single median for a
given set of data
Simplicity – is easy to calculate
Not drastically affected by extreme values as is the mean
Descriptive statistics
Quantitative variables
Central tendency measures
Mode
Dispersion measures
𝑛
− 𝑥)2
𝑖=1(𝑥𝑖
𝑠=
𝑛−1
𝑥 𝒙−𝒙 (𝒙 − 𝒙)𝟐
2.75 -0.2525 0.0638
2.86 -0.1425 0.0203
3.37 0.3675 0.1351
2.76 -0.2425 0.0588
2.62 -0.3825 0.1463
3.49 0.4875 0.2377
3.05 0.0475 0.0023
3.12 0.1175 0.0138
24.02 0 0.6781
𝑥 = 3.0025
2
(𝑥 − 𝑥)2 0.6781
𝑠 = = = 0.0968
𝑛−1 7
𝑠= 𝑠 2 = 0.0968 = 0.3112
Descriptive statistics
Quantitative variables
Dispersion measures
Coefficient of variation (CV)
Example:
Suppose two samples of male patients yield the following results:
Sample 1 Sample 2
Age 25 years 11 years
𝑛+1
𝑄1 = 𝑡ℎ 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
4
Descriptive statistics
Quantitative variables
Dispersion measures
Interquartile range (IQR)
2(𝑛 + 1)
𝑄2 = 𝑡ℎ 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
4
3(𝑛 + 1)
𝑄3 = 𝑡ℎ 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
4
Example:
The number of days spent in hospital by 17 subjects after an operation arranged
in increasing size were: 3, 4, 4, 6, 8, 8, 8, 10, 10, 12, 14, 14, 17, 25, 27, 37, 42
17 + 1 6+8
𝑄1 = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 4.5𝑡ℎ 𝑜𝑏𝑠. = =7
4 2
2(17 + 1)
𝑄2 = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 9𝑡ℎ 𝑜𝑏𝑠. = 10
4
3(17 + 1) 17 + 25
𝑄3 = 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 13.5𝑡ℎ 𝑜𝑏𝑠. = = 21
4 2
Descriptive statistics
Quantitative variables
Dispersion measures
Interquartile range (IQR)
𝑅
𝑤=
𝑘
Intervals are ordered from smallest to largest
Descriptive statistics
Quantitative variables
Graphical techniques
(a) Frequency distribution and the histogram
Example:
Table below shows a frequency distribution – values of age are distributed
among the specified class intervals:
Consists of:
Stem – has one or more of the initial digits of the measurement
Stems form an ordered column with the smallest stem at the top
The stem column has all stems within the range of the data even when an
observation with that stem is not in the dataset
Example:
Data 44 46 47 49 63 64 66 68 68 72 72 75 76 81 84 88 106
Stem Leaf
4 4679
5
6 34688
7 2256
8 148
9
10 6
Descriptive statistics
Quantitative variables
Graphical techniques
(c) Box-and-Whisker plots
Marital status
Frequency Percent
Single 44 55.0
Married 29 36.3
Other 7 8.8
Total 80 100
Independent events:
Two events are independent if the occurrence of one event does not
influence the occurrence of the other – denoted by 𝑝 𝐴 ∩ 𝐵
E.g. if the firstborn child is male, this doesn’t influence the sex of the
second born (can be male or female)
UoN School of Public Health
Probability rules
Independent events:
The probability of two independent events is obtained by multiplying
individual probabilities of the events
𝑝 𝐴 ∩ 𝐵 = 𝑝 𝐴 × 𝑝(𝐵)
E.g. in a blood bank the following distribution of blood groups was observed
Blood group n %
O 45 45
A 29 29
B 21 21
AB 5 5
What is the probability that the next 2 persons will be in blood group O?
𝑝 𝐴 ∩ 𝐵 = 0.45 × 0.45 = 0.2025 ≅ 𝟎. 𝟐𝟎
UoN School of Public Health
Important probability distributions
Probabilities can be assigned to each likely outcome for a
variable
These probabilities usually follow a mathematical formula
called a probability distribution
The probability distribution of a variable depends on the type
of variable:
Continuous variables – temperature, weight, height – follow a Normal
distribution
Binary (yes/no) variables – sex, disease, death – follow a Binomial
distribution
Discrete or rare occurrences – death, rare diseases, counts – follow a
Poisson distribution. Poisson approaches Binomial as events become
more common and/or the population becomes smaller
UoN School of Public Health
Normal probability distribution
All normally distributed variables are continuous but not
vice-versa
For continuous variables all values within a certain range are
observable and the probability associated with one such
observation is negligibly small – on a continuous scale a zero
probability is assigned to individual points
However, we can calculate the probability that a variable will
take a value betwn two points e.g. 𝑎 and 𝑏.
For continuous variables a probability density is defined –
mathematical function such that area under curve betwn 2
points 𝑎 and 𝑏 is equal to the probability that the variable will
take a value betwn these 2 points
UoN School of Public Health
Normal probability distribution
Two important parameters are needed to calculate
probabilities from the normal probability density:
Mean (𝜇) – locates the central position
Variance (𝜎 2 ) or its square root, the standard deviation (𝜎) measures the
degree of spread about the mean
𝑌−𝜇
𝑍=
𝜎
𝑎−𝜇 𝑏−𝜇
Such that: 𝑝[ ≤𝑍≤ ]
𝜎 𝜎
Example
The variable birth weight is known to be normally distributed with a mean of
3100𝑔 and variance of 2500𝑔2 . What is the probability that a baby will weigh
1 − (0.02275 × 2)
= = 0.47725 + 0.5 = 𝟎. 𝟗𝟕𝟕𝟐𝟓
2
UoN School of Public Health
Binomial probability distribution
Important for calculating the probability of d’se
The binomial distribution describes the behaviour of a
random variable 𝑋 if the following conditions apply:
The no. of observations 𝑛 is fixed
Each observation is independent
Each observation represents one of two outcomes (“success [𝑌]” or “failure”)
𝑛 𝑦
𝑝 𝑌=𝑦 = 𝜋 (1 − 𝜋)(𝑛−𝑦)
𝑦
𝑛 𝑛!
Where: 𝑦
= 𝑦! 𝑛−𝑦 !
𝑛 𝑦
𝑝 𝑌=𝑦 = 𝜋 (1 − 𝜋)(𝑛−𝑦)
𝑦
10
𝑝 𝑌=5 = 0.225 (1 − 0.22)(10−5)
5
10 10 10
0.220 (1 − 0.22) 10−0
+ 0.221 (1 − 0.22) 10−1
+ 0.222 (1
0 1 2
10−2
− 0.22) = 0.083 + 0.235 + 0.298 = 𝟎. 𝟔𝟏𝟔 𝒐𝒓 𝟔𝟏. 𝟔%
UoN School of Public Health
Binomial probability distribution
Example
What is the probability that at least one will smoke?
Probability of at least one being a smoker = 1 − 𝑝(𝑌 = 0)
10
=1− 0.220 (1 − 0.22) 10−0
0
= 1 − 0.083 = 𝟎. 𝟗𝟏𝟕 𝐨𝐫 𝟗𝟏. 𝟕%
𝑛(1 − 𝜋) ≥ 5
UoN School of Public Health
Binomial probability distribution
Example
If the probability of a certain disease is thought to be 0.2. What is the
probability that in a sample of 50 individuals, 2 or more will get the disease?
𝜋 = 0.2
𝑛 = 50
𝑛𝜋 = 50 × 0.2 = 10 𝒂𝒏𝒅
𝑛 1 − 𝜋 = 50 × 1 − 0.2 = 40
Hence normal approximation to Binomial distribution is possible
𝑦−𝜇
𝑍=
𝜎
𝑤ℎ𝑒𝑟𝑒 𝜇 = 0.2 × 50 = 𝟏𝟎 𝑎𝑛𝑑 𝜎 2 = 50 × 0.2 × 0.8 = 𝟖
2 − 10
𝑝 𝑌≥2 = (𝒏𝒆𝒆𝒅 𝒕𝒐 𝒅𝒓𝒂𝒘 𝒕𝒐 𝒔𝒆𝒆 𝒂𝒓𝒆𝒂)
8
𝑍 = 2.83
1 − (0.00233 × 2)
= 0.4977 + 0.5 = 𝟎. 𝟗𝟗𝟕𝟕
2
UoN School of Public Health
Poisson probability distribution
The Poisson distribution is a discrete probability distribution
for the counts (or rates) of events that occur randomly (can’t
predict when they will happen e.g. time to next phone call) in
a given interval of time or space
If 𝑌 = the no. of events in a given interval and if the mean no.
of events per interval is 𝜆, the probability of observing y
events in a given interval is given by:
λ𝑦 𝑒 −λ
𝑝 𝑌=𝑦 = (𝑁𝐵: 𝑒 𝑖𝑠 𝑎 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 − 𝑠𝑒𝑒 𝑒 𝑥 𝑖𝑛 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑜𝑟)
𝑦!
𝒘𝒉𝒆𝒓𝒆 𝒚 = 𝟎, 𝟏, 𝟐, 𝟑 𝒆𝒕𝒄
Example
Births in a hospital occur randomly at an average of 1.8 births per hour. What
is the probability of observing 4 births in a given hour at the hospital?
1.84 𝑒 −1.8
𝑝 𝑌=4 = = 𝟎. 𝟎𝟕𝟐𝟑
4!
Table 3 Areas in Upper Tail of the Normal Distribution
The function tabulated is 1 — ¢{z) where ¢{z) is the cumulative distribution function of a standardised Normal variable, z.
oo
Thus l =z ays . .
1— @(z) = 5a { e* ? isthe probability that a standardised Normal variate selected at random will be greater than a
7 Zz
value of z (- x8)
oC
I-@ (:)
0 z
xX-H 00 OL 02 03 04 .05 06 OF 08 09
og
3000 .4960 4920 4830 -4840 4801 4761 AT21 4681 4641
4602
SOSCS
AUS] 1131 A112 .1093 1075 1056 1038 -1020 .1003 -0985
0968 0951 .0934 0918 090] 0885 .0869 0853 .0838 0823
0808 . 0793 .O778 0764 .O749 .0735 0721. -0708 0694 0681
0668 0655 0643 - 0630 0618 .0606 0594 0532 OST I 0559
0548 = - .0537 0526 .0516 0505: 0495 0485 0475 0465 .0455
ee Se
0446 0436 0427 0418 0409 0401 0392 0384 .0375 .0367.
.0359 0351 0344 .0336 .0329 9322 0314 .0307 .0301 0294
0287 0281 0274 0268 .0262 .0256 0250 0244 0239 .0233
02275 .02222 02169 02118 02068 02018 01970 01923 — 01876 01831
NNN NN
01786 01743) 01700 01659 01618 .01578 01539 01500 01463 01426
01390 01355 = 0132] 01287) = .01255 01222) 01191 61160 01130 ~=.01101
01072 = 01044 OI1017) = 00990 00964 = 0093900914 = 00889 = 00866 =— 00842
00820 00798 00776 00755 00734 00714 00695 .00676 60657 .00639
00621 00604 00587 00570 00554 = 00539 = 00523 00508 00494 = 00480
NN NNN
00097 00094 = .00090 .00087 00084 = =.00082 00079 =. 00076 00074 d0071
60069 = .00066. 00064 00062 0060 00058 §.00056 = 00054 00052 = .00030.
00048 0047 =.90045 00043 00042 00040 90039 90038 00036 .00035
00034 00032 .00031 00030 00029 = =.00028 00027 .00026 .00025 00024.
00023. = .00022, 00022, 00021 «= «00020 00019 = «60019 »=—s 00018 ~=.00017_ ~—s 00017
00016 00015 00015 .00014 00014 00013 00013 00012 00012 .O0d11
BR WWWWW
000108 .000104 000100 .000096 .000092 000088 000085 000082 .000078 .000075.
000072 .000069 000067 000064 .000062 .000059 000057 000054 .000052 .000059
000048 000046 000044 000042 .000041 .000039 000037 000036 .000034 .000033
-000032
5.0 — 0.000 000 286 7 5.3 > 0.000 600 019 0 6.0 —> 0.000 000 001 0
LA -
Table 7 Percentage Points of the ¢ Distribution
of the f distribution for v degrees of freedom.
The table gives the value of t,,, — the 100@ percentage point
J yy? abe
(vay? fase
a= DE4AVv+ pyreyy"
values of 7.
Note: The tabulation is for one tail only, that is, for positive
For | i} the column headings for a should be doubled.
0 Pay
h,
cal Tables for Biological, Agricultural and Medicai Researc
This table is taken from Table Il of Fisher & Yates: Statisti Biometrika Tables for Statisti cians,
Wesley Longman Ltd. Also from Table 12 of
reprinted by permission of Addison
wet be ee af nefaed Trivercity Praee and the Riometrika Trustees.
Table 8 Percentage Points of the 77 Distribution — continued
[ids
Table A4. Upper percentage points of the F distribution with ν1 , ν2 df
(a) 5% points
1 161.448 199.500 215.707 224.583 230.162 233.986 236.768 238.883 240.543 241.882 243.906 245.950 248.013 249.052 250.095 251.143 252.196 253.253 254.314
2 18.513 19.000 19.164 19.247 19.296 19.330 19.353 19.371 19.385 19.396 19.413 19.429 19.446 19.454 19.462 19.471 19.479 19.487 19.496
3 10.1280 9.5521 9.2766 9.1172 9.0135 8.9406 8.8867 8.8452 8.8123 8.7855 8.7446 8.7029 8.6602 8.6385 8.6166 8.5944 8.5720 8.5493 8.5265
4 7.7086 6.9443 6.5914 6.3882 6.2561 6.1631 6.0942 6.0410 5.9988 5.9644 5.9117 5.8578 5.8025 5.7744 5.7459 5.7170 5.6877 5.6581 5.6281
5 6.6079 5.7861 5.4095 5.1922 5.0503 4.9503 4.8759 4.8183 4.7725 4.7351 4.6777 4.6188 4.5581 4.5272 4.4957 4.4638 4.4314 4.3985 4.3650
6 5.9874 5.1433 4.7571 4.5337 4.3874 4.2839 4.2067 4.1468 4.0990 4.0600 3.9999 3.9381 3.8742 3.8415 3.8082 3.7743 3.7398 3.7047 3.6689
7 5.5914 4.7374 4.3468 4.1203 3.9715 3.8660 3.7870 3.7257 3.6767 3.6365 3.5747 3.5107 3.4445 3.4105 3.3758 3.3404 3.3043 3.2674 3.2298
8 5.3177 4.4590 4.0662 3.8379 3.6875 3.5806 3.5005 3.4381 3.3881 3.3472 3.2839 3.2184 3.1503 3.1152 3.0794 3.0428 3.0053 2.9669 2.9276
9 5.1174 4.2565 3.8625 3.6331 3.4817 3.3738 3.2927 3.2296 3.1789 3.1373 3.0729 3.0061 2.9365 2.9005 2.8637 2.8259 2.7872 2.7475 2.7067
10 4.9646 4.1028 3.7083 3.4780 3.3258 3.2172 3.1355 3.0717 3.0204 2.9782 2.9130 2.8450 2.7740 2.7372 2.6996 2.6609 2.6211 2.5801 2.5379
11 4.8443 3.9823 3.5874 3.3567 3.2039 3.0946 3.0123 2.9480 2.8962 2.8536 2.7876 2.7186 2.6464 2.6090 2.5705 2.5309 2.4901 2.4480 2.4045
12 4.7472 3.8853 3.4903 3.2592 3.1059 2.9961 2.9134 2.8486 2.7964 2.7534 2.6866 2.6169 2.5436 2.5055 2.4663 2.4259 2.3842 2.3410 2.2962
13 4.6672 3.8056 3.4105 3.1791 3.0254 2.9153 2.8321 2.7669 2.7144 2.6710 2.6037 2.5331 2.4589 2.4202 2.3803 2.3392 2.2966 2.2524 2.2064
14 4.6001 3.7389 3.3439 3.1122 2.9582 2.8477 2.7642 2.6987 2.6458 2.6022 2.5342 2.4630 2.3879 2.3487 2.3082 2.2664 2.2229 2.1778 2.1307
15 4.5431 3.6823 3.2874 3.0556 2.9013 2.7905 2.7066 2.6408 2.5876 2.5437 2.4753 2.4034 2.3275 2.2878 2.2468 2.2043 2.1601 2.1141 2.0659
16 4.4940 3.6337 3.2389 3.0069 2.8524 2.7413 2.6572 2.5911 2.5377 2.4935 2.4247 2.3522 2.2756 2.2354 2.1938 2.1507 2.1058 2.0589 2.0096
17 4.4513 3.5915 3.1968 2.9647 2.8100 2.6987 2.6143 2.5480 2.4943 2.4499 2.3807 2.3077 2.2304 2.1898 2.1477 2.1040 2.0584 2.0107 1.9604
18 4.4139 3.5546 3.1599 2.9277 2.7729 2.6613 2.5767 2.5102 2.4563 2.4117 2.3421 2.2686 2.1906 2.1497 2.1071 2.0629 2.0166 1.9681 1.9168
19 4.3807 3.5219 3.1273 2.8951 2.7401 2.6283 2.5435 2.4768 2.4227 2.3779 2.3080 2.2341 2.1555 2.1141 2.0712 2.0264 1.9795 1.9302 1.8780
20 4.3512 3.4928 3.0984 2.8661 2.7109 2.5990 2.5140 2.4471 2.3928 2.3479 2.2776 2.2033 2.1242 2.0825 2.0391 1.9938 1.9464 1.8963 1.8432
21 4.3248 3.4668 3.0725 2.8401 2.6848 2.5727 2.4876 2.4205 2.3660 2.3210 2.2504 2.1757 2.0960 2.0540 2.0102 1.9645 1.9165 1.8657 1.8117
22 4.3009 3.4434 3.0491 2.8167 2.6613 2.5491 2.4638 2.3965 2.3419 2.2967 2.2258 2.1508 2.0707 2.0283 1.9842 1.9380 1.8894 1.8380 1.7831
23 4.2793 3.4221 3.0280 2.7955 2.6400 2.5277 2.4422 2.3748 2.3201 2.2747 2.2036 2.1282 2.0476 2.0050 1.9605 1.9139 1.8648 1.8128 1.7570
24 4.2597 3.4028 3.0088 2.7763 2.6207 2.5082 2.4226 2.3551 2.3002 2.2547 2.1834 2.1077 2.0267 1.9838 1.9390 1.8920 1.8424 1.7896 1.7331
25 4.2417 3.3852 2.9912 2.7587 2.6030 2.4904 2.4047 2.3371 2.2821 2.2365 2.1649 2.0889 2.0075 1.9643 1.9192 1.8718 1.8217 1.7684 1.7110
26 4.2252 3.3690 2.9752 2.7426 2.5868 2.4741 2.3883 2.3205 2.2655 2.2197 2.1479 2.0716 1.9898 1.9464 1.9010 1.8533 1.8027 1.7488 1.6906
27 4.2100 3.3541 2.9604 2.7278 2.5719 2.4591 2.3732 2.3053 2.2501 2.2043 2.1323 2.0558 1.9736 1.9299 1.8842 1.8361 1.7851 1.7306 1.6717
28 4.1960 3.3404 2.9467 2.7141 2.5581 2.4453 2.3593 2.2913 2.2360 2.1900 2.1179 2.0411 1.9586 1.9147 1.8687 1.8203 1.7689 1.7138 1.6541
29 4.1830 3.3277 2.9340 2.7014 2.5454 2.4324 2.3463 2.2783 2.2229 2.1768 2.1045 2.0275 1.9446 1.9005 1.8543 1.8055 1.7537 1.6981 1.6377
30 4.1709 3.3158 2.9223 2.6896 2.5336 2.4205 2.3343 2.2662 2.2107 2.1646 2.0921 2.0148 1.9317 1.8874 1.8409 1.7918 1.7396 1.6835 1.6223
40 4.0847 3.2317 2.8387 2.6060 2.4495 2.3359 2.2490 2.1802 2.1240 2.0772 2.0035 1.9245 1.8389 1.7929 1.7444 1.6928 1.6373 1.5766 1.5089
60 4.0012 3.1504 2.7581 2.5252 2.3683 2.2541 2.1665 2.0970 2.0401 1.9926 1.9174 1.8364 1.7480 1.7001 1.6491 1.5943 1.5343 1.4673 1.3893
120 3.9201 3.0718 2.6802 2.4472 2.2899 2.1750 2.0868 2.0164 1.9588 1.9105 1.8337 1.7505 1.6587 1.6084 1.5543 1.4952 1.4290 1.3519 1.2539
∞ 3.8415 2.9957 2.6049 2.3719 2.2141 2.0986 2.0096 1.9384 1.8799 1.8307 1.7522 1.6664 1.5705 1.5173 1.4591 1.3940 1.3180 1.2214 1.0033
Statistical Inference:
Single mean and proportion
Hypothesis testing
In hypothesis testing we state that we will reject a certain
hypothesis only if there is a 5% or less chance that it is true
Statistical inference
Hypothesis testing
(a) Null hypothesis
Frequently, there’s an expected/natural value for a parameter
– called the null value
For a given sample size 𝑛, lowering 𝛼 (say below 0.05) will increase 𝛽
Probability of type II error (𝛽) decreases with increase in 𝑛
Statistical inference
Errors in hypothesis testing
𝑯𝟎
True False
𝜎2
𝑛
is called the standard error of the mean which measures
𝜎2
As the sample size (𝑛) increases, the of the sample mean
𝑛
𝑥
Confidence interval for a mean
To make inferences about the true mean 𝜇 we construct a CI
We accept that the observed mean 𝑥 is generally within 1.96
(recall: 𝑍0.025 = 1.96) standard errors of the true mean 𝜇 so that the
interval: 𝑥 ± 1.96 × 𝑆𝐸(𝑥) will usually include the true value
This means that on repeated sampling, 95% of sample means
would fall within 1.96 standard errors of the 𝜇 so that the
interval: 𝑥 ± 1.96 × 𝑆𝐸(𝑥) includes 𝝁 approx. 95% of the time (called the 95% CI)
Sampling distribution of a mean
Confidence interval for a mean
A 99% CI is given by: 𝑥 ± 2.58 × 𝑆𝐸(𝑥)
Example
The packed cell volume (PCV) was measured in 25 children sampled randomly
from children aged 4 yrs living in a large West African village, with the following
results:
𝑥 = 34.1 𝑠 = 4.3
Using the 𝑠 as an unbiased estimator of 𝜎 we obtain the 95% CI of:
4.3
34.1 ± 1.96 × = 𝟑𝟐. 𝟒 𝒕𝒐 𝟑𝟓. 𝟖
25
Use of the 𝒕 distribution
As the value of 𝜎 is generally unknown (recall: 95%𝐶𝐼 = 𝑥 ± 1.96
σ/ 𝑛), we have to use 𝑠 as an estimate of 𝜎 – introduces
sampling error in calculation
Due to the this error, the interval: 𝑥 ± 1.96 × 𝑠/ 𝑛 includes 𝜇
less than 95% of the time i.e. the calculated interval is too narrow
Sampling distribution of a mean
Confidence interval for a mean
Use of the 𝒕 distribution
To correct for this we use a multiplying factor larger than 1.96
– makes interval wider and restores confidence level to 95%
The multiplying factor is contained in the 𝑡 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
The factor depends on the degrees of freedom (v) used to
calculate the sample SD 𝑠 (𝑑𝑓 𝑎𝑟𝑒 𝑜𝑛𝑒 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑖. 𝑒. 𝒗 = 𝒏 − 𝟏)
As 𝑛 increases the factor approaches 𝑍0.025 = 1.96 – hence t
distribution only needs to be used for 𝑛 < 20
Example
In the PCV example, 𝑣 = 25 − 1 = 24. Using the 𝑡 distribution with 24 𝑑𝑓, the
𝑠 4.3
95% CI is: 𝑥 ± 𝑡 𝑛−1 ,𝛼/2 × = 34.1 ± 2.064 × = 𝟑𝟐. 𝟑 𝐭𝐨 𝟑𝟓. 𝟗
𝑛 25
Sampling distribution of a mean
Significance test for a mean
We may wish to test a specific hypothesis about the pop mean 𝜇
e.g. if data on 4yr children in USA indicate a mean PCV of 37.1 we may test whether
our sample data (West African) are consistent with the 𝐻0 :
𝐻0 : 𝜇 = 𝜇0 = 37.1
𝐻𝑎 : 𝜇 ≠ 𝜇0
𝑥 − 𝜇0 34.1 − 37.1
𝑍= = = −𝟑. 𝟒𝟗
𝑆𝐸(𝑥) 4.3
25
From the 𝑍 𝑡𝑎𝑏𝑙𝑒𝑠 we get the 𝑃-value: 2 × 0.00024 = 𝟎. 𝟎𝟎𝟎𝟒𝟖 (NB: 𝑃-
value is normally one-tailed so multiply the resulting probability by 2)
Interpretation: The data provide strong evidence against 𝑯𝟎 hence the mean PCV in 4yr
old children in the West African village is different from that of children of the same
age in the USA
Sampling distribution of a mean
Significance test for a mean
If 𝑛 < 20 then 𝑡 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 is more appropriate:
𝑥 − 𝜇0
𝑡= = −𝟑. 𝟒𝟗
𝑆𝐸(𝑥)
From 𝑡 𝑡𝑎𝑏𝑙𝑒𝑠 3.49 falls betwn 3.467 & 3.745 hence 𝟎. 𝟎𝟎𝟏 <
𝑷 < 𝟎. 𝟎𝟎𝟐
Sampling distribution of a proportion
Example: In a survey of 335 men attending a health centre in Guilford (UK), 127
(37.9%) men said they were current smokers
How we can we use the result from the above sample to say
something about the population which it represents? – we
use the concept of sampling distribution
Suppose we repeatedly took a new sample of 335 men from
this health centre (assuming a large no. of men are registered there) and
calculated the proportion who smoked & then created a
histogram of these values – histogram would represent the
sampling distribution of the proportion:
𝑝 𝝅 𝑝
Sampling distribution of a proportion
In practice we only conduct one survey from which to
estimate 𝑝
Is 𝑝 close to 𝜋 or is it very different from 𝜋?
In any random sample there’s some sampling variation in 𝑝 so
that the larger the 𝑛 the smaller the sampling variation
The sampling variation of a proportion is described by its
standard error:
𝑝 × (1 − 𝑝)
𝑛
0.379×(1−0.379)
proportion of smokers is: = 𝟎. 𝟎𝟐𝟔𝟓 𝒐𝒓 𝟐. 𝟔𝟓%
335
Sampling distribution of a proportion
To estimate the interval of possible values within which the true
pop proportion 𝜋 lies we compute the CI:
𝑝 × (1 − 𝑝) 0.379 × 0.621
𝑝 ± 𝑍𝛼/2 × = 0.379 ± 1.96 ×
𝑛 335
𝑛(1 − 𝑝) ≥ 5
Comparing two Population Means &
Proportions
𝑠21 𝑠22
(𝑥1 - 𝑥2 ) ∓ Z𝛼 /2* +
𝑛1 𝑛2
𝑥1 − 𝑥2
Z= 𝑠21 𝑠22
+
𝑛1 𝑛2
𝑥1 − 𝑥2
t= 𝑠21𝑠22
(Welch t-test)
+
𝑛1 𝑛 2
Calculate the mean difference, 𝑑 (NB: the paired t-test assumes 𝑑𝑖 𝑠 are
normally distributed, if not then non-parametric tests are used)
State the hypotheses:
𝐻0 : 𝐷 = 0
𝐻𝑎 : 𝐷 ≠ 0
𝑑 𝑠
calculate the standard error of the mean difference, 𝑆𝐸 𝑑 = √𝑛
𝑑
Calculate the t-statistic which is given by 𝑇 = 𝑆𝐸(𝑑). Under the 𝐻0 , this
2.28
The 𝑇 statistic will given by: 𝑇 = 2.25 = 𝟑. 𝟗𝟐 𝑤𝑖𝑡ℎ 14 𝑑𝑓
√15
This gives 𝟎. 𝟎𝟎𝟏 < 𝑷 < 𝟎. 𝟎𝟎𝟐, we reject 𝐻0 and conclude that there’s strong
evidence of a real difference between the two observers.
UoN School of Public Health
H o: 𝜋 1 = 𝜋 2
Ha: 𝜋1 ≠ 𝜋2
(p1−p2) 𝑥 +𝑥
Z= 1 1
where 𝑝= 𝑛 1+𝑛2
𝑝𝑞(𝑛 +𝑛 ) 1 2
1 2
Here we make use of the fact that the difference between the 2 observed
proportions has approximately a normal distribution (normal
approximation to the binomial distribution): n𝜋 ≥ 5 ; n(1 − 𝜋) ≥ 5
𝑝1(1−𝑝1) 𝑝2(1−𝑝2)
p ± 1.96* +
𝑛1 𝑛2
52.7∗(100−52.7) 19.8∗(100−19.8)
32.9% ± 1.96* +
93 91
= 32.9% ± 13.04
= 19.86% to 45.94% (interval doesn’t include zero)
So, we are 95% confident that the true pop difference in tumour
responses between CMF & L-Pam is between 19.86% and 45.94%
NB: Standard error formula in the above calculation doesn’t assume the Ho of the
2 proportions being equal (common variance).