Probability and Probability Distributions
Probability and Probability Distributions
Learning Objectives
At the end of this session students should
a) be able to define probability by the ‘equally likely events’ and the ‘relative frequency’
approaches
b) be able to use the simple rules of probability
c) know the properties of the binomial distribution and be able to apply it to practical problems
d) know the properties of the Normal distribution and be able to apply it to practical problems
3.1 Introduction
Consider a simple experiment of tossing a coin. We cannot predict in advance the outcome
of the experiment (Heads or Tails) because of the role of chance, but we can if we wish
repeat the experiment as often as we like. Likewise, consider giving a new treatment to a
group of patients. We cannot predict in advance what the outcome of treatment for any given
patient will be (success or failure), but we can observe the outcome in large numbers of
similar patients.
Later in the course you will find that many of the conclusions obtained when employing
statistical methods in medicine and dentistry are also subject to chance and we will make use
of probability to express numerically the uncertainty of these conclusions.
3.1
3.3 Simple Laws of Probability
Probabilities possess certain properties that hold irrespective of which definition is used.
The probability of any event A, denoted P(A), must be between 0 and 1
P(A)=0 if and only if A is impossible
P(A)=1 if and only if A is certain
For any event A, the probability of A not happening (the complement of A) is
P(not A) = 1 - P(A)
The addition law states that if two events, A and B, are mutually exclusive (i.e. they
cannot occur simultaneously) then the probability of either event A or event B occurring,
denoted by P(A or B), is
P(A or B) = P(A) + P(B)
The multiplication law states that if two events, A and B, are independent (i.e. the risk of
one event occurring is unaffected by the other event occurring) then the probability of
both event A and event B occurring, denoted by P(A and B), is
P(A and B) = P(A) x P(B)
Suppose three individuals are chosen at random. What is the probability that all three are
group O?
Assuming the three are unrelated (and therefore independent as regards blood group),
P(all 3 individuals are group O) = P(O) P(O) P(O) = 0.463 = 0.097
3.2
represents the number of individuals out of three that are group O, then there are four
possible values (0,1,2 and 3). In example 3.4 we did not obtain the full probability distrib-
ution since we did not calculate probabilities for all possible values. This could be done
using the laws of probability, but it is easier to use the binomial distribution (section 3.6).
The mean of a probability distribution is the average value that the variable takes in the long
run. It is really a weighted average of the possible values of the variable, the weights being
given by the probabilities. The standard deviation of a probability distribution measures the
amount of variation in the variable about its mean.
Although there are many useful probability distributions in statistics we will study just two in
this chapter; the binomial distribution for discrete variables (section 3.6) and the Normal
distribution for continuous variables (section 3.7).
For the binomial distribution the mean and standard deviation are obtained from the formulae
Mean = np Standard deviation = np(1 p)
We can now examine the problem in the third part of example 3.4 using the binomial
distribution by noting that n=3 (the number of trials) and p=0.46 (the probability of “success”
defined as being blood group O). Then we can obtain the probability distribution simply by
successively substituting the various possible values of r.
r
3!
0 (0.46)0 (0.54)3 = ( 0.54)3 =0.1575
0!3!
3!
1 (0.46)1 (0.54)2 = 3 (0.46) ( 0.54)2 =0.4024
1!2!
3!
2 (0.46)2 (0.54)1 = 3 (0.46)2 ( 0.54) =0.3428
2!1!
3!
3 (0.46)3 (0.54)0 = (0.46)3 =0.0973
3!0!
3.3
In practice we can use SPSS to generate binomial probability distributions as follows:
File > New > Data
Enter the values 0,1,2,3 in a column. Name it r, and display 0 decimal places.
Transform > Compute Variable…
Set Target Variable: to P
In Function group: select PDF & NonCentral PDF.
In Functions and Special Variables: select Pdf.Binom.
Move it into the Numeric expression box by clicking the up arrow box.
Complete as PDF.BINOM(r, 3, 0.46) and press OK.
Display the probabilities in P to four decimal places.
Although the binomial distribution might initially appear to be rather restricted in its scope
for application, in practice it can be used to describe a wide variety of medical phenomena in
a relatively simple way.
3.4
For some variables (including height) this function is a smooth bell-shaped curve, symmetric
about the mean, (see diagram). This is a Normal distribution. Although in theory a Normal
distribution can take any value between – to + , it can be shown that 68%, 95% and
99.7% of the total probability lies in the range, , 2 , 3 , respectively i.e.
within 1, 2 or 3 standard deviations, .
The above Normal distribution has
Mean = ,
Standard deviation = .
We write this as N(, 2). In practice and must often be estimated from sample data.
The Standard Normal Distribution is a Normal distribution with mean, =0 and standard
deviation, σ=1; hence N(0,1). The method for moving from a general Normal Distribution,
X, to the Standard Normal distribution, Z, is based on the transformation
Z = (X-)/
i.e. subtract the mean, , and then divide by the standard deviation, .
So Z represents the number of standard deviations X is from the mean. The standard Normal
distribution used to be important because the areas under the curve were conveniently
tabulated , but nowadays we are more likely to use computer packages (e.g. SPSS or Excel)
to evaluate Normal distribution tail areas. The following table gives some percentage points
of the standard Normal distribution.
N(0, 1) N(0, 1)
P/2 P/2 P
Z Z
-z z z
Example 1
Assume that diastolic blood pressure (DBP) and is Normally distributed with mean =100
mmHg and standard deviation =10 mmHg in a certain population. Find the probability that
randomly selected individual has DBP between 90 and 120 mmHg.
3.5
We can then use statistical tables or, alternatively, SPSS to obtain the tail areas directly:
File > New > Data
Enter the values of X = 90 and 120 in a column. Name it DBP.
Transform > Compute Variable…
Set Target Variable: to cumprob
In Function group: select CDF & NonCentral CDF
In Functions and Special Variables: select Cdf.Normal
Move it into the Numeric expression box by clicking the up arrow box.
Complete as CDF.NORMAL(DBP, 100, 10) and press OK.
Display the cumulative probabilities in cumprob to four decimal places.
This tells us that Prob (DBP < 120) = 0.9772 and Prob (DBP < 90) = 0.1587
So Prob (90 < DBP < 120) = 0.9772 – 0.1587 = 0.8185
So 82% of the DBP distribution lies between 90 and 120 mmHg.
Example 2 2
N (3200, 800 )
Forced vital capacity (FVC) in healthy middle-aged men is
distributed Normally with mean = 3,200 ml and standard
deviation = 800 ml. 2.5% 2.5%
3.6
3.8 Recommended Reading
Bland Chapter 6 – Probability
Chapter 7 – The Normal Distribution
3.9 Practical
1. Shown below is a life table which summarises the death rates in Northern Ireland in the
period 1980-82. If 1000 newborn boys or girls were to experience this mortality pattern, the
table shows the numbers who would still be alive at various ages. Source: Annual Report of
the Registrar General for N Ireland (1982)
2. The probability that a patient recovers from a stomach disease is 0.8. Suppose 10 people
are known to have contracted this disease. Assuming the conditions for the Binomial
distribution to hold, use SPSS to calculate the probabilities of the following events:
(a) exactly 4 recover (r = 4);
(b) at least 8 recover (r > 8);
(c) at least 4 but not more than 8 recover (4 < r < 8);
(d) at most 6 recover (r < 6);
(e) the mean of r.
3.7
3. The distribution of weight in a population can be approximated by the Normal
distribution with mean = 64 kg and standard deviation = 10 kg.
Use the SPSS Cumulative DF Function for the Normal distribution answer the following:
(a) What percentage of the population has weight exceeding 80 kg?
(b) What percentage of the population has weight lying between 50 and 75 kg?
Use the SPSS Inverse DF Function for the Normal distribution to answer the following:
(c) Above what weight does 25% of the population of weights lie?
4. In the British Regional Heart Study, 270 out of 7,735 men aged between 40 and 59 years
developed ischaemic heart disease in a five year period. Statistical analysis showed the
probability of developing the disease to be related to a number of risk factors which could
be combined to form a “risk factor score” (British Medical Journal 1986; 293: 474-9).
This score was more successful at predicting heart disease than any single risk factor
taken on its own. The risk factor score in those who developed the disease was distributed
N(1070, 2702) while in the non-developers it was distributed N(900, 1102).
Sketch the two Normal probability distributions on the same scale of risk factor score paying
particular attention to the different means and standard deviations in the developers and the
non-developers.
A cut-off score of 1000 risk factor units was proposed to predict whether a subject would
become a developer or a non-developer. Determine how many actual developers would
be predicted to be non-developers and how many actual non-developers would be
predicted to be developers using this cut-off point. Then complete the following table.
Developer Non-developer
Risk >1000
Score <1000
3.8