0% found this document useful (0 votes)
73 views

Probability and Probability Distributions

This document provides an overview of probability and probability distributions: 1. It defines probability using both the equally likely outcomes approach and the relative frequency approach. 2. It introduces key probability distributions like the binomial distribution for discrete variables and the Normal distribution for continuous variables. 3. It provides an example using the binomial distribution to calculate the probability of getting a certain number of successes (being blood type O) out of 3 trials, showing how to calculate the mean and standard deviation.

Uploaded by

Dr P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Probability and Probability Distributions

This document provides an overview of probability and probability distributions: 1. It defines probability using both the equally likely outcomes approach and the relative frequency approach. 2. It introduces key probability distributions like the binomial distribution for discrete variables and the Normal distribution for continuous variables. 3. It provides an example using the binomial distribution to calculate the probability of getting a certain number of successes (being blood type O) out of 3 trials, showing how to calculate the mean and standard deviation.

Uploaded by

Dr P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Chapter 3 PROBABILITY and PROBABILITY DISTRIBUTIONS

Learning Objectives
At the end of this session students should
a) be able to define probability by the ‘equally likely events’ and the ‘relative frequency’
approaches
b) be able to use the simple rules of probability
c) know the properties of the binomial distribution and be able to apply it to practical problems
d) know the properties of the Normal distribution and be able to apply it to practical problems

3.1 Introduction
Consider a simple experiment of tossing a coin. We cannot predict in advance the outcome
of the experiment (Heads or Tails) because of the role of chance, but we can if we wish
repeat the experiment as often as we like. Likewise, consider giving a new treatment to a
group of patients. We cannot predict in advance what the outcome of treatment for any given
patient will be (success or failure), but we can observe the outcome in large numbers of
similar patients.
Later in the course you will find that many of the conclusions obtained when employing
statistical methods in medicine and dentistry are also subject to chance and we will make use
of probability to express numerically the uncertainty of these conclusions.

3.2 Definition of Probability


The equally likely outcomes definition of probability relies on the assumption that each of
the possible outcomes of an experiment (tossing a coin, rolling a die) is equally likely. The
probability of any event, A, is then defined as
Number of equally likely outcomes in A
P(A) = 
Total number of outcomes
Therefore, in rolling a fair die, the probability of an even number is 3/6 = 0.5 In many
situations, however, we do not have equally likely outcomes.
The relative frequency definition of 1.0
0.9
probability necessitates repeating the 0.8
experiment a large number of times and 0.7
Proportion heads

noting the proportion of times the event, 0.6


0.5
A, of interest occurs (its relative 0.4
frequency). In the diagram the probability 0.3
0.2
of a Head would be defined as 0.5 since
0.1
this is the value at which the relative 0.0
0 10 20 30 40 50 60 70 80 90 100
frequency is stabilising. This definition
Number of tosses
does not require equally likely outcomes.
It could therefore be used, for example, to assign probabilities to each face of a loaded die.
However, this definition is also rather limited.
An alternative is the Bayesian approach based on one’s degree of belief. However, this
approach does have its critics since different individuals will assign different probabilities to
the same event.

3.1
3.3 Simple Laws of Probability
Probabilities possess certain properties that hold irrespective of which definition is used.
 The probability of any event A, denoted P(A), must be between 0 and 1
 P(A)=0 if and only if A is impossible
 P(A)=1 if and only if A is certain
 For any event A, the probability of A not happening (the complement of A) is
P(not A) = 1 - P(A)

 The addition law states that if two events, A and B, are mutually exclusive (i.e. they
cannot occur simultaneously) then the probability of either event A or event B occurring,
denoted by P(A or B), is
P(A or B) = P(A) + P(B)

 The multiplication law states that if two events, A and B, are independent (i.e. the risk of
one event occurring is unaffected by the other event occurring) then the probability of
both event A and event B occurring, denoted by P(A and B), is
P(A and B) = P(A) x P(B)

3.4 Example of the Simple Laws of Probability


In a large population the respective percentages of blood groups O, A, B and AB are 46, 43,
8 and 3. Suppose that an individual is randomly selected from the population. (Randomly
selected means that each individual in the population has the same chance of being chosen.)

 What is the probability that the individual is not group O?


P(individual is not group O) = P( O ) = 1 – P(O) = 1 – 0.46 = 0.54.

 What is the probability that the individual is group A or group B?


P(individual is group A or group B) = P(A or B) = P(A) + P(B) = 0.43 + 0.08 = 0.51,
since an individual can belong to only one blood group.

 Suppose three individuals are chosen at random. What is the probability that all three are
group O?
Assuming the three are unrelated (and therefore independent as regards blood group),
P(all 3 individuals are group O) = P(O)  P(O)  P(O) = 0.463 = 0.097

3.5 Probability Distributions


If a variable takes different values depending on some chance mechanism, then the
probabilities assigned to the different possible values of the variable form a probability
distribution. A probability distribution therefore describes how probability is distributed
across the various possible values. For example, in last part of example 3.4, if the variable

3.2
represents the number of individuals out of three that are group O, then there are four
possible values (0,1,2 and 3). In example 3.4 we did not obtain the full probability distrib-
ution since we did not calculate probabilities for all possible values. This could be done
using the laws of probability, but it is easier to use the binomial distribution (section 3.6).

The mean of a probability distribution is the average value that the variable takes in the long
run. It is really a weighted average of the possible values of the variable, the weights being
given by the probabilities. The standard deviation of a probability distribution measures the
amount of variation in the variable about its mean.

Although there are many useful probability distributions in statistics we will study just two in
this chapter; the binomial distribution for discrete variables (section 3.6) and the Normal
distribution for continuous variables (section 3.7).

3.6 Binomial Distribution


The binomial distribution is appropriate when there are n independent trials each of which
may end in success or failure. Denote the probability of success on each trial by p, and let r
be the number of successes in the n trials. Then
n!
P(r successes in n trials) = p r (1  p) n  r r=0,..., n
r!(n  r)!
Note that n! is referred to as n factorial and equals n.(n-1).(n-2) ... 3.2.1 so that
5! = 5.4.3.2.1 = 120 and 1! = 1 but note that 0! = 1. Also recall that p0=1

For the binomial distribution the mean and standard deviation are obtained from the formulae
Mean = np Standard deviation = np(1  p)

We can now examine the problem in the third part of example 3.4 using the binomial
distribution by noting that n=3 (the number of trials) and p=0.46 (the probability of “success”
defined as being blood group O). Then we can obtain the probability distribution simply by
successively substituting the various possible values of r.

Number with blood n!


group O (successes) P(r successes) = r!(n  r)! p (1  p)
r nr

r
3!
0 (0.46)0 (0.54)3 = ( 0.54)3 =0.1575
0!3!
3!
1 (0.46)1 (0.54)2 = 3 (0.46) ( 0.54)2 =0.4024
1!2!
3!
2 (0.46)2 (0.54)1 = 3 (0.46)2 ( 0.54) =0.3428
2!1!
3!
3 (0.46)3 (0.54)0 = (0.46)3 =0.0973
3!0!

Although the mean can be calculated as


(0 x 0.1575) + (1 x 0.4024) + (2 x 0.3428) + (3 x 0.0973) = 1.38
it is easier to use the formulae Mean = np = 3 x 0.46 = 1.38
Similarly Standard deviation = np(1  p) = 3x0.46x(1  0.46) = 0.86
So on average we would expect 1.38 of the three individuals to be blood group O (SD 0.86).

3.3
In practice we can use SPSS to generate binomial probability distributions as follows:
File > New > Data
Enter the values 0,1,2,3 in a column. Name it r, and display 0 decimal places.
Transform > Compute Variable…
Set Target Variable: to P
In Function group: select PDF & NonCentral PDF.
In Functions and Special Variables: select Pdf.Binom.
Move it into the Numeric expression box by clicking the up arrow box.
Complete as PDF.BINOM(r, 3, 0.46) and press OK.
Display the probabilities in P to four decimal places.

Although the binomial distribution might initially appear to be rather restricted in its scope
for application, in practice it can be used to describe a wide variety of medical phenomena in
a relatively simple way.

3.7 Normal Distribution


N(, 2)
The Normal (or Gaussian) distribution is the most
important probability distribution in statistics, partly
because of its ability to describe certain measure-
ments, but more importantly because of theoretical
results which show that it is useful for describing
sampling distributions in large samples (Section 4.7).
-3 -2 -1  +1 +2 +3 X
Suppose a measurement on a continuous scale (e.g.     
height) is made on a randomly selected individual 68%
from a population.
95%
Imagine the shape of the relative frequency histogram 99.7%
of height in a large sample of individuals. This will
be a curve called the probability density function (pdf).
It shows how probability is distributed over possible heights.
Areas under the curve correspond to probabilities, and the total area under the curve is one.

3.4
For some variables (including height) this function is a smooth bell-shaped curve, symmetric
about the mean,  (see diagram). This is a Normal distribution. Although in theory a Normal
distribution can take any value between –  to + , it can be shown that 68%, 95% and
99.7% of the total probability lies in the range,   ,   2 ,   3 , respectively i.e.
within 1, 2 or 3 standard deviations, .
The above Normal distribution has
Mean = ,
Standard deviation = .
We write this as N(, 2). In practice  and  must often be estimated from sample data.

The Standard Normal Distribution is a Normal distribution with mean, =0 and standard
deviation, σ=1; hence N(0,1). The method for moving from a general Normal Distribution,
X, to the Standard Normal distribution, Z, is based on the transformation
Z = (X-)/
i.e. subtract the mean, , and then divide by the standard deviation, .

So Z represents the number of standard deviations X is from the mean. The standard Normal
distribution used to be important because the areas under the curve were conveniently
tabulated , but nowadays we are more likely to use computer packages (e.g. SPSS or Excel)
to evaluate Normal distribution tail areas. The following table gives some percentage points
of the standard Normal distribution.

N(0, 1) N(0, 1)

P/2 P/2 P

Z Z
-z z z

Percentage points, z, of N(0, 1) Probability, P (two sided) Probability, P (one sided)


 0.00 1.00 0.50
 1.28 0.20 0.10
 1.645 0.10 0.05
 1.96 0.05 0.025
 2.58 0.01 0.005

Example 1
Assume that diastolic blood pressure (DBP) and is Normally distributed with mean =100
mmHg and standard deviation =10 mmHg in a certain population. Find the probability that
randomly selected individual has DBP between 90 and 120 mmHg.

3.5
We can then use statistical tables or, alternatively, SPSS to obtain the tail areas directly:
File > New > Data
Enter the values of X = 90 and 120 in a column. Name it DBP.
Transform > Compute Variable…
Set Target Variable: to cumprob
In Function group: select CDF & NonCentral CDF
In Functions and Special Variables: select Cdf.Normal
Move it into the Numeric expression box by clicking the up arrow box.
Complete as CDF.NORMAL(DBP, 100, 10) and press OK.
Display the cumulative probabilities in cumprob to four decimal places.

This tells us that Prob (DBP < 120) = 0.9772 and Prob (DBP < 90) = 0.1587
So Prob (90 < DBP < 120) = 0.9772 – 0.1587 = 0.8185
So 82% of the DBP distribution lies between 90 and 120 mmHg.

Example 2 2
N (3200, 800 )
Forced vital capacity (FVC) in healthy middle-aged men is
distributed Normally with mean  = 3,200 ml and standard
deviation  = 800 ml. 2.5% 2.5%

Obtain symmetric limits about the mean that enclose 95%


of the distribution of FVC. =3200 X
FVC
 - 1.96   + 1.96 
3200 - (1.96x800) 3200 + (1.96x800)
This time the process must be reversed.
We obtain z = 1.96 from the table, and the limits are
 + 1.96  or 3,200  1.96 x 800 or 1,630 ml and 4,770 ml.
These limits give a 95% reference range (discussed further in Chapter D).
Alternatively in SPSS, insert the cumulative probabilities 0.025 and 0.975 in a column called
cumprob. Then use Transform > Compute Variable… This time use the Inverse DF Function
group and the idf.Normal function to generate the values of FVC.

3.6
3.8 Recommended Reading
Bland Chapter 6 – Probability
Chapter 7 – The Normal Distribution

3.9 Practical
1. Shown below is a life table which summarises the death rates in Northern Ireland in the
period 1980-82. If 1000 newborn boys or girls were to experience this mortality pattern, the
table shows the numbers who would still be alive at various ages. Source: Annual Report of
the Registrar General for N Ireland (1982)

Age in years (x) Number Surviving (1x)


Men Women
0 1000 1000
10 981 987
20 975 984
30 962 979
40 947 971
50 909 949
60 800 882
70 578 743
80 252 462
90 20 55
100 2 10
110 0 0
Perform the following calculations separately for each sex:
(a) Find the probability that a randomly selected newborn will survive to his/her 20th birthday?
(b) What is the probability that he/she will not survive to his/her 70th birthday?
(c) If a man/woman survives to age 20, what are his/her chances of surviving to age 70?

2. The probability that a patient recovers from a stomach disease is 0.8. Suppose 10 people
are known to have contracted this disease. Assuming the conditions for the Binomial
distribution to hold, use SPSS to calculate the probabilities of the following events:
(a) exactly 4 recover (r = 4);
(b) at least 8 recover (r > 8);
(c) at least 4 but not more than 8 recover (4 < r < 8);
(d) at most 6 recover (r < 6);
(e) the mean of r.

3.7
3. The distribution of weight in a population can be approximated by the Normal
distribution with mean  = 64 kg and standard deviation  = 10 kg.

Use the SPSS Cumulative DF Function for the Normal distribution answer the following:
(a) What percentage of the population has weight exceeding 80 kg?
(b) What percentage of the population has weight lying between 50 and 75 kg?

Use the SPSS Inverse DF Function for the Normal distribution to answer the following:
(c) Above what weight does 25% of the population of weights lie?

4. In the British Regional Heart Study, 270 out of 7,735 men aged between 40 and 59 years
developed ischaemic heart disease in a five year period. Statistical analysis showed the
probability of developing the disease to be related to a number of risk factors which could
be combined to form a “risk factor score” (British Medical Journal 1986; 293: 474-9).
This score was more successful at predicting heart disease than any single risk factor
taken on its own. The risk factor score in those who developed the disease was distributed
N(1070, 2702) while in the non-developers it was distributed N(900, 1102).
Sketch the two Normal probability distributions on the same scale of risk factor score paying
particular attention to the different means and standard deviations in the developers and the
non-developers.
A cut-off score of 1000 risk factor units was proposed to predict whether a subject would
become a developer or a non-developer. Determine how many actual developers would
be predicted to be non-developers and how many actual non-developers would be
predicted to be developers using this cut-off point. Then complete the following table.
Developer Non-developer

Risk >1000
Score <1000

270 7465 7735


How useful do you think predictions made using this cut-off would be?

3.8

You might also like