Course Notes Part 1 - Chapters 1 To 4
Course Notes Part 1 - Chapters 1 To 4
Chapter 1 – Terminology
1.1 Definitions
Data/Data set – Set of values collected or obtained when gathering information on some
issue of interest.
Examples
4) The yields of a certain crop obtained after applying different types of fertilizer.
Statistics – Collection of methods for planning experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing, interpreting the data and drawing
conclusions from it.
Statistics in the above sense refers to the methodology used in drawing meaningful
information from a data set. This use of the term should not be confused with statistics
(referring to a set of numerical values) or statistics (referring to measures of description
obtained from a data set).
Examples
2) The collection of all cars of a certain type manufactured during a particular month.
Examples
1) Study of the entire population carried out by the government every 10 years.
A census is usually very costly and time consuming. It is therefore not carried out very often.
A study of a population is usually confined to a subgroup of the population.
The number of values in the sample (sample size) is denoted by n. The number of values in
the population (population size) is denoted by N.
Discrete variables – Variables that can assume a finite or countable number of possible
values. Such variables are usually obtained by counting.
Examples
3) A person’s response (agree, not agree) to a statement. A one (1) is recorded when
the person agrees with the statement, a zero (0) is recorded when a person does not
agree.
Continuous variables – Variables that can assume an infinite number of possible values.
Such variables are usually obtained by measurement.
Examples
Measurement scales
Examples
Nominal scale – Level of measurement which classifies data into categories in which no
order or ranking can be imposed on the data.
A variable can be treated as nominal when its values represent categories with no intrinsic
ranking. For example, the department of the company in which an employee works.
Examples of nominal variables include region, postal code, or religious affiliation.
Ordinal scale – Level of measurement which classifies data into categories that can be
ordered or ranked. Differences between the ranks do not exist.
A variable can be treated as ordinal when its values represent categories with some intrinsic
order or ranking.
Examples
Examples
Discrete and continuous variables examples given above.
Interval scale – Level of measurement which classifies data that can be ordered and ranked
and where differences are meaningful. However, there is no meaningful zero and ratios are
meaningless.
Examples
1) The difference between a temperature of 100 degrees and 90 degrees is the same
difference as that between 90 degrees and 80 degrees. Taking ratios in such a case
does not make sense.
Ratio scale – Level of measurement where differences and ratios are meaningful and there
is a natural zero. This is the “highest” level of measurement in terms of possible operations
that can be performed on the data.
Examples
Variables like height, weight, mark (in test) and speed are ratio variables. These variables
have a natural zero and ratios make sense when doing calculations e.g. a weight of 80
kilograms is twice as heavy as one of 40 kilograms.
2.1) Collecting data that compares reckless driving of female and male drivers.
2.2) Collecting data on smoking and lung cancer.
Examples
Examples
Sampling frame (synonyms: "sample frame", "survey frame") – This is the actual set of units
from which a sample is drawn
Example
Consider a survey aimed at establishing the number of potential customers for a new
service in a certain city. The research team has drawn 1000 numbers at random from a
telephone directory for the city, made 200 calls each day from Monday to Friday from 8am
to 5pm and asked some questions.
In this example, the population of interest is all the inhabitants in the city. The sampling
frame includes only those city dwellers that satisfy all the following conditions:
6
3) They are likely to be at home from 8am to 5pm from Monday to Friday;
The sampling frame in this case definitely differs from the population. For example, it under-
represents the categories which either have no telephone (e.g. the most poor), have an
unlisted number, and who were not at home at the time of calls (e.g. employed people),
who don't like to participate in telephone interviews (e.g. more busy and active people).
Such differences between the sampling frame and the population of interest is a main cause
of bias when drawing conclusions based on the sample.
Probability samples – Samples drawn according to the laws of chance. These include simple
random sampling, systematic sampling and stratified random sampling.
Simple random sampling – Sampling in which each sample of a given size that can be drawn
will have the same chance of being drawn. Most of the theory in statistical inference is
based on random sampling being used.
Examples
1) The 6 winning numbers (drawn from 49 numbers) in a Lotto draw. Each potential
sample of 6 winning numbers has the same chance of being drawn.
Example
Suppose the first 6 random numbers in the table of random numbers are:
10480, 22368, 24130, 42167, 37570, 77921.
Use these numbers to select the 6 wining numbers in a Lotto draw.
The 49 numbers from which the draw is made all involve 2 digits i.e. 01, 02, . . . , 49.
Putting the above numbers from the table of random numbers next to each other in a string
of digits gives: 10 48 02 23 68 24 13 04 21 67 37 57 07 79 21 .
7
The winning numbers can be selected by either taking all pairs of digits between 01 and 49
(discarding any numbers outside this range or repeats) by working from left to right or right
to left in the above string.
By working from left to right the winning numbers are: 10, 48, 2, 23, 24 and 13.
By working from right to left the winning numbers are: 21, 7, 37, 21, 4 and 13.
The advantage of simple random sampling is that it is simple and easy to apply when small
populations are involved. However, because every person or item in a population has to be
listed before the corresponding random numbers can be read, this method is very
cumbersome to use for large populations and cannot be used if no list of the population
items is available. It can also be very time consuming to try and locate every person included
in the sample. There is also a possibility that some of the persons in the sample cannot be
contacted at all.
Systematic sampling – Sampling in which data is obtained by selecting every kth object,
N
where k is approximately .
n
Examples
1) A manufacturer might decide to select every 20th item on a production line to test
for defects and quality. This technique requires the first item to be selected at
random as a starting point for testing and, thereafter, every 20th item is chosen.
2) A market researcher might select every 10th person who enters a particular store,
after selecting a person at random as a starting point; or interview occupants of
every 5th house in a street, after selecting a house at random as a starting point.
Stratified random sampling – Sampling in which the population is divided into groups
(called strata) according to some characteristic. Each of these strata is then sampled using
random sampling.
8
A general problem with random sampling is that you could, by chance, miss out a particular
group in the sample. However, if you subdivide the population into groups, and sample from
each group, you can make sure the sample is representative. Some examples of strata
commonly used are those according to province, age and gender. Other strata may be
according to religion, academic ability or marital status.
Example
In a study investigating the expenditure pattern of consumers, they were divided into low,
medium and high income groups.
When sampling is proportional to size (an income group comprises the same percentage of
the sample as of the population) the sample sizes for the strata should be calculated as
follows.
Convenience Sampling – Sampling in which data that is readily available is used e.g. surveys
done on the internet. These include quota sampling.
Example
A company is marketing a new product and needs to know how potential customers might
react to the product.
Stage 1: It is decided that age (the 3 groups under 20, 20-40, over 40) and gender
(male, female) are the characteristics that will determine the sample.
Stage 2: The 6 categories to be sampled from are (male under 20), (male 20-40),
(male over 40), (female under 20), (female 20-40) and (female over 40).
Stage 3: The numbers (sub-quotas) to be sampled are (male under 20) - 40,
(male 20-40) - 60, (male over 40) - 25, (female under 20) - 35, (female 20-40) - 65
and (female over 40) -30. The total quota is the total of all the sub-quotas i.e. 255.
Stage 4: Visit a place where individuals to be interviewed are readily available e.g. a
large shopping center and interview people until all the quotas are filled.
Quota sampling is a cheap and convenient way of obtaining a sample in a short space of
time. However, this method of sampling is not based on the laws of chance and cannot
guarantee a sample that is representative of the population from which it is drawn.
When obtaining a quota sample, interviewers often choose who they like (within criteria
specifications) and may therefore select those who are easiest to interview. Therefore
sampling bias can result. It is also impossible to estimate the accuracy of quota sampling
(because sampling is not random).
10
All the data sets used in this chapter will be regarded as samples drawn from some
population. One of the main purposes of studying a sample is to get information about the
population. The main focus here is on summarizing and describing some features of the
data.
The graph above shows how a person's weight varied from the beginning of 1991 to the
beginning of 1995.
Bar charts
A bar chart or bar graph is a chart consisting of rectangular bars with heights proportional to
the values that they represent. Bar charts are used for comparing two or more values that
are taken over time or under different conditions.
In a simple bar chart the figures used to make comparisons are represented by bars. These
are either drawn vertically or horizontally. Only totals are represented. The height or length
11
of the bar is drawn in proportion to the size of the figure being presented. An example is
shown below.
When you want to draw a bar chart to illustrate your data, it is often the case that the totals
of the figures can be broken down into parts or components.
You start by drawing a simple bar chart with the total figures as shown above. The columns
or bars (depending on whether you draw the chart vertically or horizontally) are then
divided into the component parts.
12
Pareto chart
A Pareto chart is a special type of bar chart where the values being plotted are
arranged in descending order. The graph is accompanied by a line graph which shows
the cumulative totals of each category, left to right.
The graph below is a Pareto chart that shows the percentage of late arrivals at a
place of work organized according to cause of late arrival (from the most common to
the least common cause). The line shows the accumulated percentages.
100 100%
80 80%
60 60%
Percent
percent
40 40%
20 20%
33
25
17 15
8
0 2 0%
traffic child care transport weather overslept emergency
reason
13
Dot Plot
This is diagram where a line is drawn according to a scale that is appropriate for the data set
and the values (in the data set) plotted at their positions on the scale. If the same value
occurs more than once, the multiple values are plotted on top of each other at the same
point on the scale. For small data sets (few values) this plot can provide useful information
regarding data patterns.
Example
Imagine that a medium-sized retailer, thinking of expanding into a new region
identifies a business that it considers as being ready for takeover. It finds the
following annual profit figures (in tens of thousands of pounds) for the target
retailer's last ten years trading:
9 9 7 7 7 6 5 4 3 3
To draw a dot plot we can begin by drawing a horizontal line across the page to
represent the range of values of all the numbers; then we can mark an 'x' above the
appropriate value along the line as follows:
Pie Chart
A Pie chart is a diagram that shows the subdivision of some entity/total into subgroups. The
diagram is in the form of a circle which is divided into slices with each slice having an area
according to the proportion that it makes up of the total.
Example
The pie chart below shows the ingredients used to make a sausage and mushroom pizza.
14
The degrees needed for each slice is found by calculating the appropriate percentage of 360
e.g. for sausage the degrees are 0.125×360 = 45 and for cheese 0.25×360 =90 etc.
The complete calculations are shown in the table below.
Stem-and-leaf plot
A stem-and-leaf plot is a device used for summarizing quantitative data in a
table/graphical format to assist in visualizing the shape of a data set.
Examples
1) To construct a stem-and-leaf plot, the values must first be sorted in ascending order.
Here is the sorted set of data values that will used in the example:
44 46 47 49 63 64 66 68 68 72 72 75 76 81 84 88 106
Next, it must be determined what the stems will represent and what the leaves will
represent. Typically, the leaf contains the last digit of the number and the stem
contains all of the other digits. In the case of very large or very small numbers, the
data values may be rounded to a particular place value (such as the hundredths
place) that will be used for the leaves. The remaining digits to the left of the rounded
place value are used as the stems.
In this example, the leaf represents the “ones” place and the stem the rest of the
number (“tens” place or higher).
The stem-and-leaf plot is drawn with two columns separated by a vertical line. The
stems are listed to the left of the vertical line. It is important that each stem is listed
only once and that no numbers are skipped, even if it means that some stems have
no leaves. The leaves are listed in increasing order in a row to the right of each stem.
4 |4679
5 |
6 |34688
7 |2256
8 |148
9 |
10 | 6
15
key: 5|4=54
leaf unit: 1.0
stem unit: 10.0
As an example, suppose the fat contents (in grams) for eating English breakfasts and
cold meat sandwiches are to be compared. The fat contents are shown below.
Sandwiches: 6, 7, 12, 13, 17, 18, 20, 21, 21, 24, 26, 28, 30, 34
Breakfasts: 12, 14, 15, 16, 18, 23, 25, 25, 36, 36, 38, 41, 44, 45
Breakfasts Sandwiches
|0| 6 7
2 4 5 6 8 |1| 2 3 7 8
3 5 5 |2| 0 1 1 4 6 8
6 6 8 |3| 0 4
1 4 5 |4|
Conclusion: The fat content in English breakfasts appears to be higher than that in
sandwiches.
Suppose the symbol x is used to denote some variable of interest in a study. In order to
distinguish between values of this variable, subscripts are used.
n
x1 + x2 + . . . + xn = x .
i 1
i
If it is understood that the range of subscript indices over which the summation is taken
involves all the x values, the summation can be written as just
x1 + x2 + . . . + xn = x.
Example 1: Suppose x1 = 70, x2 = 74, x3 = 66, x4 = 68, x5 = 71. Then
x
i 1
i = x1+x2+ . . . + x5 = 70+74+66+68+71 = 349.
x
i 1
2
i = 702 + 742 + 662 + 682 + 712 = 24397.
x
n
( xi ) 2
2
Note that i
i 1 i 1
5
e.g. for the abovementioned data x
i 1
2
i = 24397 349 2 = 121801.
The summation notation can also be used to write the sum of products of corresponding
values for 2 different sets of values.
x y
i 1
i i = x1 y1 x 2 y 2 x n y n
i 1 2 3 4 5 6
xi 11 13 7 12 10 8
yi 8 5 7 6 9 11
17
x y
i 1
i i = (11×8) + (13×5) + (7×7) + (12×6) + (10×9) + (8×11)
= 88 + 65 + 49 + 72 + 90 + 88
= 452.
n n n
Note that xi yi
i 1
( xi ) ( y i )
i 1 i 1
6
e.g. for the abovementioned data x
i 1
i = 61 and
6 6 6 6
y i = 46 ( x i ) ( y i ) = 2806
i 1 i 1 i 1
x y .
i 1
i i
A frequency distribution is a table in which data are grouped into classes and the number of
values (frequencies) which fall in each class recorded.
The main purpose of constructing a frequency distribution is to get insight into the
distribution pattern of the frequencies over the classes. Hence, the name frequency
distribution is used to refer to this pattern.
Example 1
In a survey of 40 families in a village, the number of children per family was recorded and
the following data obtained.
1 0 3 2 1 5 6 2
2 1 0 3 4 2 1 6
3 2 1 5 3 3 2 4
2 2 3 0 2 1 4 5
3 3 4 4 1 2 4 5
Example 2
Consider the following data of low temperatures (in degrees Fahrenheit to the nearest
degree) for 50 days. The highest temperature is 64 and the lowest temperature is 39.
The classes into which the above values can be sorted can be found by following the steps
shown below.
1. Find the maximum (=64) and minimum (=39) values and calculate the
2. Decide on the number of classes. Use Sturges’ rule which states that
No. of classes = k
= the rounded up value of (1 + 1.44 ln n)
= 1 + 1.44 × ln(50)
= 6.63
i.e. k = 7.
3. Calculate the class width such that no. of classes × class width > range
4. Find the lower value that defines the first class. This is usually a value just below the
minimum value in the data set. Since the minimum value for this data set is 39, the
lowest class can have a minimum value one below this i.e. 38.
5. Find the lower values that define each of the classes that follow by successively adding
the class width to the lower value of class.
The frequency distribution below shows the data values sorted into the classes
The table below shows the classes and their frequencies for the temperatures data set.
class limits f
38 – 41 4
42 – 45 10
46 – 49 8
50 – 53 15
54 – 57 9
58 – 61 3
62 – 65 1
Total 50
The values in the above example that define the classes of the frequency distribution are
called class limits. The classes of the type 38 – 41, 42 – 45,… in which both the upper and
lower limits are included are called “ inclusive classes” . For example, the class 38 – 41
includes all the values from 38 to 41.
In spite of great importance of classification in statistical analysis, no hard and fast rules can
be laid down for it.
The following points must be kept in mind for classification:
1) The classes should be clearly defined and should not lead to any ambiguity.
2) Each of the given values in the data set should be included in one of the classes.
3) The classes should be of equal width, otherwise the different class frequencies will
not be comparable. If the class widths are unequal, then comparable figures can
20
If we deal with a continuous variable, it is not possible to arrange the data in the class
intervals of above type. Let us consider the distribution of age in years. If class intervals are
15 – 19, 20 – 24 then persons with ages between 19 and 20 years are not taken into
consideration. In such a case we form the class intervals as 0 – 5, 5 – 10, 10 – 15,
15 – 20,…… Here all the persons with any fraction of age are included in one group or the
other. In the above classes, the upper limits of each class are excluded from the respective
classes and are included in the immediate next class and are known as ‘exclusive classes’.
The upper and lower class limits of the new exclusive type classes are known as class
boundaries.
If d is the gap between the upper limit of any class and the lower limit of the succeeding
class, the class boundaries for any class are then given by :
Example 3
The monthly expenditures (thousands of rands) of 60 households are shown on the next
page. The values of this data set were accurately recorded (not rounded).
21
classes f
4.5 – 5.5 5
5.5 – 6.5 7
6.5 – 7.5 13
7.5 – 8.5 13
8.5 – 9.5 9
9.5 – 10.5 10
10.5 – 11.5 3
Total 60
For this distribution lower (upper) class limit = lower (upper) class boundary for each of the
classes.
A value that falls on the boundary of 2 classes is allocated to the higher of the two classes
e.g. 5.50000 is allocated to the class 5.5 – 6.5 (not 4.5 to 5.5).
Class midpoints
Examples
1) For the frequency distribution in example 2 (temperature data), the class midpoints
are given on the following page.
22
2) For the frequency distribution in example 3 (expenditure data), the class midpoints are
given below.
classes midpoints
4.5 – 5.5 5
5.5 – 6.5 6
6.5 – 7.5 7
7.5 – 8.5 8
8.5 – 9.5 9
9.5 – 10.5 10
10.5 – 11.5 11
Cumulative frequencies
The “less than” cumulative frequency of a class is the number of values in the sample that
are less than or equal to the upper class boundary of the class.
Examples
class cumulative
f calculations
boundaries frequency
37.5 – 41.5 4 4 4
41.5 – 45.5 10 14 4+10
45.5 – 49.5 8 22 4+10+8
49.5 – 53.5 15 37 4+10+8+15
53.5 – 57.5 9 46 4+10+8+15+9
57.5 – 61.5 3 49 4+10+8+15+9+3
61.5 – 65.5 1 50 4+10+8+15+9+3+1
23
cumulative
classes f calculations
frequencies
4.5 – 5.5 5 5 5
5.5 – 6.5 7 12 5+7
6.5 – 7.5 13 25 5+7+13
7.5 – 8.5 13 38 5+7+13+13
8.5 – 9.5 9 47 5+7+13+13+9
9.5 – 10.5 10 57 5+7+13+13+9+10
10.5 – 11.5 3 60 5+7+13+13+9+10+3
Total 60
Examples
1) The relative and percentage frequencies for the frequency distribution in example 2
(temperature data) are shown below.
2) The relative and percentage frequencies for the frequency distribution in example 3
(expenditure data) is shown on the following page.
24
relative percentage
classes f
frequency frequency
4.5 – 5.5 5 0.083 8.3
5.5 – 6.5 7 0.117 11.7
6.5 – 7.5 13 0.217 21.7
7.5 – 8.5 13 0.217 21.7
8.5 – 9.5 9 0.15 15
9.5 – 10.5 10 0.167 16.7
10.5 – 11.5 3 0.05 5
Total 60 1 100
Histogram
Example
16
14
12
frequency
10
8
6
4
2
0
37.5-41.5 41.5-45.5 45.5-49.5 49.5-53.5 53.5-57.5 57.5-61.5 61.5-65.5
temperature
25
Frequency polygon
This is also a graphical representation of a frequency distribution. For each class the
class midpoint is plotted against the frequency and the plotted points joined by
means of straight lines.
Example
midpoint 35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
f 0 4 10 8 15 9 3 1 0
16
14
12
frequency
10
8
6
4
2
0
0 10 20 30 40 50 60 70 80
midpoint
Note:
The two plotted values at the lower and upper ends were added to anchor the graph to the
horizontal axis. The lower end value is a plot of 0 versus the midpoint of the class below the
first (lowest) class (35.5). This midpoint is obtained by subtracting the class width (4) from
the midpoint of the lowest class (39.5). The upper end value is a plot of 0 versus the
midpoint of the class above the last class (67.5). This midpoint is obtained by adding the
class width (4) to the midpoint of the last (highest) class (63.5).
The histogram and frequency polygon are equivalent graphical representations of the
pattern of the frequencies shown in the frequency distribution.
The the histogram can provide an estimate of the probability (chance) that a value drawn at
random from the data set will lie between two values.
26
Examples
Example
For the “less than” ogive of the frequency distribution in example 2 (temperature data)
class boundary 37.5 41.5 45.5 49.5 53.5 57.5 61.5 65.5
cumulative
0 4 14 22 37 46 49 50
frequency
27
cumulative frequency
60
50
40
Cum. frequency
30
20
10
0
0 10 20 30 40 50 60 70
class boundary
Note:
The plotted value at the lower end was added to anchor the graph to the horizontal axis.
The lower end value is a plot of 0 versus the upper class boundary of the class below the
first (lowest) class (37.5). This upper class boundary is obtained by subtracting the class
width (4) from the upper class boundary of the lowest class (41.5).
A percentage “less than” ogive can be plotted by just changing the vertical scale. In this
example the frequencies add up to 50. In order to convert these frequencies to percentages,
each frequency is multiplied by 2. To draw the percentage ogive, each cumulative frequency
in the above table will have to be multiplied by 2. The resulting graph is shown on the
following page. Values that have a given percentage of the observations in the data set less
than it can be read off from the ogive.
120
100
% cumulative freq
80
60
40
20
0
0 10 20 30 40 50 60 70
boundaries
28
The main purpose of drawing a histogram is to describe the clustering pattern of the values
in the data set. For a large sample size, the histogram (frequency polygon) can be fairly well
approximated by a smooth curve (called a frequency curve) that is fitted to the frequencies.
The following patterns of the shape of the frequency curve appear regularly in data sets.
0.45
0.4
0.35
0.3
frequency
0.25
0.2
0.15
0.1
0.05
0
-4 -2 0 2 4
x
This shape is for data sets where the majority of values are in the central portion of the
scale with fewer and fewer values the further away from the center (in both directions).
Many data sets have this shape. Examples are
0.12
0.1
0.08
frequency
0.06
0.04
0.02
0
0 1 2 3 4 5 6
x
29
This shape occurs when all the values in the data set occur approximately the same number
of times. Examples are
3) Frequencies obtained when tossing an unbiased coin and recording 0 if tails come up and 1
if heads come up.
Bimodal shape
60
50
40
frequency
30
20
10
0
0 20 40 60 80 100 120
Body length (m m )
This pattern which shows two distinct peaks (hence the name bimodal data) appearing
when there are two subgroups with different sets of values in the same data set.
Examples
1) Measuring the body lengths of ants when there are adults and juveniles together in
the same data set. The two peaks in the curve reflect the fact that juvenile ants have
shorter body lengths than adult ants.
2) Heights of a population of males and females. Since the females are shorter than the
males, the frequency curve will have two peaks. One peak will be located where the
most female heights are concentrated and one where the most male heights are
concentrated.
30
1.2
0.8
frequency 0.6
0.4
0.2
0
0 2 4 6 8 10 12 14
x
This shape shows a high clustering of values at the lower end of the scale and less and less
clustering further away from the lower end towards the upper end.
Example
The time it takes to serve a customer at a supermarket. For most customers the service time
is quite short. The longer the service time, the less the number of customers.
0.3
0.25
0.2
frequency
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16
-0.05
x
This shape shows a high clustering of values at the upper end of the scale and less and less
clustering further away from the upper end towards the lower end.
Example
Marks in a test where most students did well, but a few performed poorly.
31
In the calculations a distinction will be made between methods used when the data are in
raw form (values as collected) or grouped form (form of a frequency distribution).
1
mean = x
n
x.
x is pronounced “x bar”.
Example
The marks of seven students in a mathematics test with a maximum possible mark of 20 are
given below:
15 13 18 16 14 17 12:
mean = x
x
15 13 18 16 14 17 12
= 15.
n 7
Median:
The median is the value in the data set which is such that half of the values in the
data set are less than or equal to it and half greater than or equal to it.
For an odd number of values in the data set, the median is the middle value of the
data set when it has been arranged in ascending order. That is, from the smallest
value to the largest value.
If the number of values in the data set is even, then the median is the average of the
two middle values.
32
Examples
1) The marks of nine students in a geography test that had a maximum possible mark of 50
are given below:
47 35 37 32 38 39 36 34 35
Arrange the data values in order from the lowest value to the highest value:
32 34 35 35 36 37 38 39 47
2) Consider the above data set with the first value (47) omitted.
Arrange the data values in order from the lowest value to the highest value:
32 34 35 35 36 37 38 39
In this case the number of values n = 8 which is an even number. The two middle values in
n 8 n
the data set are in positions 4 and 1 5 i.e. the values 35 and 36.
2 2 2
35 36
Median = 35.5.
2
Mode:
The mode of a set of data values is the value(s) that occurs most often.
Example:
Find the mode of the following data set:
48 44 48 45 42 49 48
The mode is 48 since it occurs most often.
Note
1) It is possible for a set of data values to have more than one mode.
2) If there are two data values that occur most frequently, we say that the set of data
values is bimodal e.g. the data set 2 2 4 5 5 6 has two modes (2 and 5).
3) If no value in the data set occurs more than once, it has no mode e.g. the data set 4
5 7 9 has no mode.
33
1) The mean is used as a measure of central tendency for symmetrical, bell-shaped data
that do not have extreme values (extreme values are called outliers).
2) The median may be more useful than the mean when there are extreme values in the
data set as it is not affected by the extreme values.
3) The mode is useful when the most common item, characteristic or value of a data set is
required.
Examples
1) The amounts (thousands) for which each of 7 properties were sold are shown below.
For this data set mean = x = 772.86. This value of the mean is not a central value for
the data set (it is greater than all the values but the largest one). The reason for this
is that the last value (2350) has a considerable influence on the value of the mean.
The median = 555 is a value that more centrally located than the mean. Unlike the
mean, the median is not influenced by the large last values in the data set.
2) For qualitative (non-numerical) data only the mode can be calculated. For example,
suppose 10 rate payers are asked whether they think the percentage increase in
rates is reasonable. They can either agree (A), disagree (D) or be neutral (N) on the
issue. Their responses are shown below.
A, A, D, N, D, A, D, D, N, N.
For this data set the modal response is D (since D occurs more times than the other
responses). It is not possible to calculate a median or a mean for this data set.
When calculating the mean for raw data, it is usually assumed that all the values in the data
set are equally important. If the values are not all considered equally important, the
weighted mean ( x w ) is calculated according to the formula below.
In the formula x1, x2, . . . , xr are the values and w1, w2, . . . ,wr their respective weights.
34
Example
The final mark (percentage) in a certain course is based on an assignment mark (which
counts for 10% of the final mark), a test mark (which counts for 30% of the final mark) and
an exam mark (which counts for 60% of the final mark). Calculate the final mark of a student
who gets a 65% assignment mark, a 70% test mark and a 55% exam mark.
Solution:
The above formula is applied with
x1= 65, x2= 70 x3= 55,
w1= 10, w2= 30 w3= 60.
65 *10 70 * 30 55 * 60 6050
xw 60.5.
10 30 60 100
where xmid(i) is the midpoint of the ith class, k the number of classes and n the sample size.
This formula is a special case of the weighted mean formula with wi = fi and
k
w
i 1
i n.
Example
2487
mean = 49.74.
50
35
Example
The performance of 2 different stocks is monitored over a period of 8 days. Their values are
shown in the table below.
day 1 2 3 4 5 6 7 8
A 103 120 112 108 130 106 120 112
B 112 97 85 123 153 85 146 110
The dot plot that follows shows the performance of each stock.
The mean values for the two stocks are the same (=113.875), but they differ in variability
(extent of spread around the mean). Stock B has a far wider spread around the mean than
stock A.
36
The larger (wider) spread in the stock B values is reflected in the larger range (more
than twice that of stock A).
Example
For stock A the standard deviation is calculated as follows.
x = score A x2
103 10609
120 14400
112 12544
108 11664
130 16900
106 11236
120 14400
112 12544
sum 911 104297
37
For stock B the standard deviation is 25.682 (check this using STATMODE).
Interpretation: The stock A values differ (on average) from the mean by 8.919, while stock
B values differ (on average) from the mean by almost 3 times this amount.
For grouped data, the raw data formulae for the variance and standard deviation can
be slightly modified.
Example
125372.5 2487 2 / 50
variance = S2 = 34.06367
49
Example:
Since the two standard deviations that were calculated above are in different units, they
cannot be compared directly.
The coefficient of variation calculations show that in relative terms the variability for
expenditure data set is greater than that of the temperature data set.
Example:
Men’s Heights have a bell-shaped distribution with a mean of 69.2 inches and a standard
deviation of 2.9 inches.
Approximately 68% of data values are within 69.2 ± 2.9 = (66.3, 72.1).
Approximately 95% of data values are within 69.2 ± 5.8 = (63.4, 75).
Approximately 99.7% of data values are within 69.2 ± 8.7 = (60.5, 77.9).
39
2.8.1 Definitions
The ith percentile , Pi , is the value that has i% of the values in a data set less or equal to it
(0 < i ≤ 100).
Examples
The 9 deciles D1, D2, . . . , D9 are the values that have 10%, 20%, . . . , 90%
respectively of the values in the data set less or equal to them.
Steps to be followed in calculating the first and third quartiles for raw data
3) Divide the data set into 2 portions of equal numbers of values – set 1 consists of
those values less or equal to the median and set 2 consists of those values greater or
equal to the median. When the data set has an odd number of values, the median is
excluded from the division of the data set into 2 portions.
4) The first quartile (Q1) is the median of set 1 and the third quartile (Q3) is the median
of set 2.
40
Example
The distance from home to work (kilometers) of 11 employees at a certain company are
shown below. Calculate Q1 and Q3.
1) Ordered data set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49
2) Median = 40. After this step the median is deleted from the data set.
4) Set 2 – 5 values greater than the median i.e. 41, 42, 43, 47, 49.
Example
Suppose the data set consists of the above values and 56 (12 values).
1) Ordered Data Set: 6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49, 56
40 41
2) median = 40.5. Unlike what was done in example 1, no values are deleted
2
from the data set.
3) Set 1 – 6 values less or equal than median i.e. 6, 7, 15, 36, 39, 40
Set 2 – 6 values greater or equal than the median i.e. 41, 42, 43, 47, 49, 56.
15 36 43 47
4) Q1 = median of set 1 = 25.5 , Q3 = median of set 2 = 45.
2 2
Q3 Q1
The quartile deviation = Q = can also be used as a measure of variability.
2
For the data set in example 1, quartile deviation = Q = (43 – 15)/2 = 14.
The quartile deviation value shows the extent to which the values in the data set deviate
from the median. For a skew data set (heavy clustering at lower or upper end of the scale)
the quartile deviation is a more appropriate measure of variability than the standard
deviation (which is more suitable as a measure of variability for symmetric data sets).
41
A formula for calculating the ith percentile Pi for grouped data is shown below.
i = 1, 2, … , 100.
n = sample size
c = class width.
Example
class cumulative
boundaries f frequency
37.5 – 41.5 4 4
41.5 – 45.5 10 14
45.5 – 49.5 8 22
49.5 – 53.5 15 37
53.5 – 57.5 9 46
57.5 – 61.5 3 49
61.5 – 65.5 1 50
Total 50
42
Median
i * n 50 * 50
Step 1: Calculate position of median = 25.
100 100
Step 2: Median class (class that contains 25th observation) is the class 49.5 – 53.5.
First quartile
Step 2: First quartile class (class that contains 12.5th observation) is the class __________
Q1 =
Third quartile
Step 2: Third quartile class (class that contains 37.5th observation) is the class ___________
Q3 =
Fourth decile
65th Percentile
(32.5 22) * 4
P65 = 49.5 + 52.3.
15
Example
The cumulative frequency graph on the following page shows the distribution of marks
scored by a class of 40 students in a test.
44
type value(s)
central tendency median
deviation Q Q1
quartile deviation = Q = 3
2
extremes minimum and maximum
Example
The IQ’s of 13 people are shown below.
92, 104, 93, 98, 112, 145, 88, 90, 104, 119, 101, 95, 154
45
minimum = 88
Q1 = 92.5
median = 101
Q3 = 115.5
maximum = 154
Box-and-Whisker plot
Example continued:
IQR = Q3 – Q1 = 23
Q* = Q1 – 1.5×IQR
= 92.5 – (1.5)(23) = 58
46
None of the values in the data set are smaller than the lower cut-off value so there are no
values that are “too small”.
Q** = Q3 + 1.5×IQR
=115.5 + (1.5)(23)
=150
The only value in the data set that is larger than this is 154. This value (154) is “too big” and
so is an outlier
Example continued:
47
A Box-and-Whisker plot can also be used to assess the skewness (departure from symmetry)
of a variable.
For positively skewed data most of the values are at the lower end of the scale
(mean > median, “box” section of the plot towards the lower end of the scale).
For negatively skewed data most of the values are at the upper end of the scale
(mean < median, “Box” section of the plot towards the upper end of the scale).
In the previous example the data set is positively skew.
When several data sets are to be compared, several Box-and-Whisker plots can be plotted
side-by-side.
Example
The Box-and-Whisker plot shown below enables one to compare delays in departing flights
(in minutes) for certain days in December (16th to the 26th).
For all the days the data sets are positively skewed (data sets all have the “box” section
closer to the lower end of the scale with a long upper whisker). This means that there are
short delays in flight departures on all the days. The long upper whiskers that are visible
show that there were some quite late departures on 16, 17, 21, 22, 23, 24 and 25
December.
48
Chapter 3 – Probability
3.1 Terminology
Probability (Chance)
A probability is the chance that something of interest will happen.
A probability is expressed as a proportion i.e. it ranges from 0 to 1.
Chance can be expressed as a percentage i.e. it ranges from 0 to 100.
Examples
1
2) The probability of winning the Lotto is .
13983816
Random experiment
This is an experiment that gives different outcomes when repeated under similar conditions.
3) The outcome that will occur when the experiment is performed depends on
chance.
Examples
4) Drawing a card from a deck of cards (possible outcomes: 13 hearts, 13 clubs, 13 spades,
13 diamonds).
49
Set
A set is a collection of outcomes.
Sample space
The sample space is the set of all possible outcomes of a random experiment. A
sample space is usually denoted by the symbol S and the collection of elements
contained in S enclosed in curly brackets { }.
Sample point
A sample point is an individual outcome (element) in a sample space.
Examples
5) Drawing a card from a deck of cards. The elements in the sample space are listed
below.
S = {2♦ 3♦ 4♦ 5♦ 6♦ 7♦ 8♦ 9♦ 10♦ J♦ Q♦ K♦ A♦
2♥ 3♥ 4♥ 5♥ 6♥ 7♥ 8♥ 9♥ 10♥ J♥ Q♥ K♥ A♥
2♣ 3♣ 4♣ 5 ♣ 6♣ 7♣ 8♣ 9♣ 10♣ J♣ Q♣ K♣ A♣
2♠ 3♠ 4♠ 5♠ 6♠ 7♠ 8♠ 9♠ 10♠ J♠ Q♠ K♠ A♠ }
Event
An event is a subset of a sample space i.e. a collection of sample points taken from a sample
space.
Impossible event
An impossible event is an event that cannot happen (has probability zero).
50
Certain event
A certain event is an event that is sure to happen (has probability 1).
Simple events are events that involve only one sample point (outcome) of the sample space
.
Examples
1) Let E denote the event “an odd number is obtained when tossing a single die”.
Then E = {1, 3, 5}.
2) Let H denote the event “at least one head appears when tossing two coins”.
H = {hh, ht, th}.
3) Let B denote the event “obtaining a club and a heart in a single draw from a deck of
cards”. The event B is impossible. The set of outcomes of B is an empty set denoted by
B = { } = .
4) Let A denote the event “obtaining a 1, 2, 3, 4, 5 or 6 when tossing a single die”. The
event A is a certain event i.e. one of the outcomes belonging to the set describing the
event must happen. This is denoted by A = S, where S is the sample space.
Venn diagrams
A Venn diagram is a drawing, in which circular areas represent groups of items
usually sharing common properties.
The drawing consists of two or more circles, each representing a specific group or
set, contained within a square that represents the sample space. Venn diagrams are
often used as a visual display when referring to sample spaces, events and
operations involving events.
Complementary events
The complementary event Ā (sometimes written À) of an event A is all the outcomes in S
that are not in A.
51
Examples
1) Consider the experiment of tossing a single die. S = {1, 2, 3, 4, 5, 6}. The complement
of the event A = “obtaining a 3 or less” = {1, 2, 3} is
A = “obtaining a 4 or more” = {4, 5, 6}.
2) Consider the experiment of tossing two coins. S = {hh, ht, th, tt}. The complement of
the event H = “at least one head”= {hh, ht, th} is H “no heads” = {tt}.
The union of two events A and B, denoted by A B , is the set of outcomes that are
in A or in B or in both A and B i.e. the event that
“either A or B or both A and B occur”
or “at least one of A or B occurs”.
These definitions involving two events can be extended to ones involving 3 or more events
e.g. for the 3 events A1, A2 and A3 the event A1 A2 A3 is the event “at least one of A1, A2
or A3 occurs” and A1 A2 A3 the event “A1 and A2 and A3 occur”.
Examples
A B = {1, 2, 3, 5, 6, 7, 8, 9} , A B = { 3, 7},
A B = {2, 5, 9}, A B = {1, 6, 8}.
2) Let C be the event “drawing a face card from a deck of cards” and A the event “drawing
a king or an ace from a deck of cards”.
Examples
1) Let B be the event “drawing a black card from a deck of cards” and R the event “drawing
a red card from a deck of cards”.
The events B and R have no outcomes in common i.e. B R (empty set). Hence B
and R are mutually exclusive.
2) Let E be the event “an even number with a single throw of a die” and O the event “an
odd number with a single throw of a die” i.e. E = {2, 4, 6} and O = {1, 3, 5}.
N ( A) m
P(A) = = ,
N (S ) n
where N(A) = m is the number of outcomes favourable to the event A and N(S) = n
the number of outcomes in the sample space S i.e. the total number of outcomes.
Examples
Solution:
2) Two dice are rolled. Find the probability that a sum of 7 will occur.
Solution:
The number of sample points in S is 36 (see example 3 under sample space).
The classical definition of probability requires the assumption that all the outcomes in the
sample space are equally likely. If this assumption is not met, this formula cannot be used.
Example
The possible temperatures (degrees Celsius) in Durban on a particular day in December are
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39.
In December Durban is hot so, for example, 15 degrees is less likely than 30 degrees
i.e. P (temperature = 15) = 1 ÷ 25 = 0.04 does not seem reasonable.
f
P(A) = .
n
Note: This formula differs from the classical formula in the sense that the classical
formula uses all the outcomes in the sample space as the total number of outcomes,
while the relative frequency formula uses the number of repetitions (n) of the
experiment as total number of outcomes. In the classical formula the number of
outcomes in the sample space is fixed, while the number of repetitions of an
experiment (n) can vary. It can be shown that the empirical probability is a good
approximation of the true probability when n is sufficiently large.
55
Examples
1) A bent coin is tossed 1000 times with heads coming up 692 times.
692
An estimate of P(h) is 0.692.
1000
Mark f
less than 30 6
30 – 39 26
40 – 49 45
50 – 59 64
60 – 69 82
70 – 79 37
80 – 89 22
90 – 99 8
Total 290
From the table (using the empirical formula) the following probabilities can be
estimated.
26 6
(a) P(mark less than 40) = 0.110.
290
64 82 37 22 8 6 26 45 213
(b) P(pass) = 1 0.73.
290 290 290
22 8
(c) P(above 80) = 0.103.
290
Example
The preference probabilities according to gender for 2 different brands of a certain product
are summarized in the table on the following page.
The gender marginal probabilities are obtained by summing the joint probabilities over the
brands. The brand marginal probabilities are obtained by summing the joint probabilities
over the genders.
56
Brand
Marginal
1 2
Probability
Male 0.2 0.32 0.52
Gender
Female 0.4 0.08 0.48
Marginal
0.6 0.4 1
Probability
This rule can be extended to any finite number of experiments. If one experiment can be
done in n1 ways, a second one in n2 ways, . . . , a kth one in nk ways, then one of the k
experiments can be done in n1 + n2 +. . . + nk ways.
Example:
Suppose a man is standing in a room which has 2 doors to his left and 1 door to his
right. In how many ways can he leave the room?
Solution:
Let “leave the room by going to the left” be experiment 1 and “leave the room by
going to the right” be experiment 2. There are n=2 ways to do experiment 1 (he can
leave by door A or door B) and there is m=1 way to do experiment 2 (he can leave by
door C). In total there are n+m = 2+1 = 3 ways to leave the room.
57
This rule can be extended to any finite number of experiments. If one experiment can be
done in n1 ways, a second one in n2 ways, . . . , a kth one in nk ways, then the k
experiments together can be done in n1×n2×…×nk ways.
Example 1:
A basic meal consists of soup, a sandwich and a beverage. If a person having this
meal has 3 choices of soup, 4 choices of sandwiches and a choice of coffee or tea as
a beverage, how many such meals are possible?
Example 2:
A PIN to be used at an ATM can be formed by selecting 4 digits from the digits
0, 1, 2, . . . , 9 . How many choices of PIN are there if
Factorial notation
In how many ways can n (n – integer) objects be arranged in a row?
Note: 1 ! = 1, 0 ! = 1.
Examples
1) In how many ways can 7 people be placed in a queue at a bus stop?
Permutation
A permutation is the number of different arrangements of a group of items where
order matters.
The number of permutations of n objects taken r at a time is calculated from
n!
nPr = P(n, r) = .
(n r )!
Combination
A combination is the number of different selections of a group of items where order
does not matter.
The number of combinations of a group of n objects taken r at a time is calculated
from
n n!
nCr = C(n, r) = ( r ) = .
(n r )!r!
Examples:
1) Four people (A, B, C, D) serve on a board of directors. A chairman and vice-chairman are
to be chosen from these 4 people. In how many ways can this be done?
Chairman Vice-chairman
A B
B A
A C
C A
A D
D A
B C
C B
B D
D B
C D
D C
2) Four people (A, B, C, D) serve on a board of directors. Two people are to be chosen from
them as members of a committee that will investigate fraud allegations. In how many
ways can this be done?
Number of ways = 6.
60
In both these examples a choice of 2 people from 4 people is made. However, in example 1
the order of choice of the 2 people matters (since the one person chosen is chairman and
the other one vice-chairman). In example 2 the order does not matter. The only interest is in
who serves on the committee.
Application of formulae.
In question 1 the permutations formula applies with n = 4, r =2.
4!
Number of ways = P(4, 2) = 12.
(4 2)!
4!
Number of ways = C(4, 2) = 6.
2!(4 2)!
3) Find the number of ways to take 4 people and place them in groups of 3 at a time where
order does not matter.
Solution:
Since order does not matter, use the combination formula.
4! 24
C(4,3) = 4 .
3!(4 3)! 6
4) Find the number of way to arrange 6 items in groups of 4 at a time where order matters.
6! 720
Solution: P(6,4) = 360
(6 4)! 2!
There are 360 ways to arrange 6 items taken 4 at a time when order matters.
5) Find the number of ways to take 20 objects and arrange them in groups of 5 at a time
where order does not matter.
20! 20.19.18.17.16
Solution: C(20,5) = 15504
5!(20 5)! 1.2.3.4.5
There are 15 504 ways to arrange 20 objects taken 5 at a time when order does not
matter.
61
6) Determine the total number of five-card hands that can be drawn from a deck of 52
cards.
Solution:
When a hand of cards is dealt, the order of the cards does not matter. Thus the
combinations formula is used.
There are 52 cards in a deck and we want to know in how many different ways we can
draw them in groups of five at a time when order does not matter. Using the
combination formula gives
C(52,5) = 2 598 960.
7) There are five women and six men in a group. From this group a committee of 4 is to be
chosen. In how many ways can the committee be formed if the committee is to have at least
3 women in it?
Solution:
8) In how many ways can a phone number consisting of 5 digits be chosen from the digits
1, 2, 3, . . . , 9 if no digits are to be repeated?
Solution:
9) In how many ways can the 6 winning numbers in a Lotto draw be selected?
Solution:
10) In many ways can a five-card hand consisting of three eight's and two sevens be dealt?
Solution:
62
11) How many different 5-card hands include 4 of a kind and one other card?
Solution:
We have 13 different ways to choose 4 of a kind: 2's, 3's, 4's, … Queens, Kings and
Aces.
Once a set of 4 of a kind has been removed from the deck, 48 cards are left.
The possible situations that will satisfy the above requirement are:
Complementary events
For any event A defined on some sample space,
P( A ) = 1 – P( A).
These formulae can be extended to probabilities involving more than two events
e.g. for 3 events A, B and C defined on some sample space
63
This formula can easily be verified with the aid of the Venn diagram shown below.
From the above diagram the following sets can be written down.
De Morgan’s Laws
____
(1) P( A B ) P( A B)
_____
(2) P ( A B ) P( A B)
P(A) = P( A B) P( A B )
P(B) = P( A B) P( A B)
These formulae can be verified from the Venn diagram shown on the following page.
The formulae can be extended to probabilities involving more than two events.
Examples
1) There are two telephone lines – A and B. Line A is engaged 50% of the time and line B is
engaged 60% of the time. Both lines are engaged 30% of the time. Calculate the
probability that
Solution:
Let E1 denote the event “line A is engaged” and E2 the event “line B is engaged”.
(a) P(at least one of the lines are engaged) = P(E1 E2)
= P(E1) + P(E2) – P(E1 E2)
= 0.5 + 0.6 – 0.3
= 0.8
(b) P(none of the lines are engaged.) = 1 – P(at least one of the lines are engaged)
= 1 – 0.8
= 0.2
(d) The event “line A is engaged, but line B is not engaged” can be written in symbols as
(e) P(only one line is engaged) = P(line A is engaged, but line B is not engaged)
+ P(line B is engaged, but line A is not engaged)
= P( E 1E 2 ) P( E1 E 2 )
P( E1 E 2 ) = P(E2) – P(E1 E2) = 0.6 – 0.3 = 0.3. (Using the total probability
formula)
2) Let O be the event that a certain lecturer will be in his/her office on a particular
afternoon and L the event that he/she will be at a lecture. Suppose P(O) = 0.48 and P(L)
= 0.27.
Solution:
(b) P( O L ) =
3) A batch of 20 computers contain 3 that are faulty. Four (4) computers are selected at
random without replacement from this batch. Calculate the probability that
Solution:
There are C(20,4) = 4845 [why not P(20,4) ?] ways of selecting the 4 computers from the
batch of 20. Since random selection is used, all 4845 selections are equally likely. Let A
denote the event “all 4 the computers selected are not faulty” and B the event “at least
2 of the computers selected are faulty”
(a) P(A) =
(b) P(B) =
67
The Conditional probability of an event A occurring given that another event B has occurred
is given by
P( A B)
P(A | B) = , where P(B) > 0.
P( B)
P( A B)
Also P(B|A) = , where P(A) > 0.
P( A)
Example 1
Five hundred (500) TV viewers consisting of 300 males and 200 females were asked whether
they were satisfied with the news coverage on a certain TV channel. Their replies are
summarized in the table below.
Answer
Satisfied Not Satisfied Total
Male 180 120 300
Gender
Female 90 110 200
Total 270 230 500
180
P(satisfied | male) = = 0.6.
300
90
P(satisfied | female) = = 0.45.
200
270
P(satisfied) = 0.54 and P(not satisfied) =
500
Note
2) The probability of a person being satisfied depends on the gender of the person being
interviewed. In this case females are less satisfied than males with the news coverage.
Example 2
At a certain university the probability of passing accounting is 0.68, the probability of
passing statistics 0.65 and the probability of passing both statistics and accounting is 0.57.
Calculate the probability that a student
(c) passes statistics when it is known that he/she did not pass accounting.
Solution:
P( A B) 0.57
(a) P(B|A) = = 0.838 .
P( A) 0.68
P( A B) 0.57
(b) P(A|B) = = 0.877.
P( B) 0.65
(c) P(B | A ) =
Examples
1) A box has 12 bulbs, 3 of which are defective. If two bulbs are selected at random
without replacement, then what is the probability that both are defective?
Solution:
Let d1 denote the event “the first bulb is defective” and d2 the event “the second bulb is
defective”.
T
3
Then P(d1) = and
12
2
P(d2|d1) = .
11
Using the above mentioned multiplication formula,
3 2
P(d2 d1) = P(d1) P(d2|d1) = 0.045.
12 11
2) Two cards are drawn at random from from a deck of playing cards. What is the
probability that both these cards are aces?
Solution:
Since there are 4 aces in a deck of 52 cards, the probability of drawing one ace is 4/52.
Having removed one ace and not replacing it reduces the probabilities of drawing
another ace on the second draw. The 51 cards remaining contain 3 aces and therefore
the probability of drawing an ace on the second draw is 3/51. We can multiply these
probabilities and determine the probability of drawing two aces.
3) Three cards are drawn at random from from a deck of playing cards. What is the
probability that all 3 these cards are aces?
Independent events
Two events A and B are said to independent if P(A| B) = P(A) or P(B|A) = P(B).
This means that the occurrence of B does not affect the probability that A occurs.
Substitution of the above result into the multiplication formula for two probabilities gives
P(A B) = P(A) P(B) if A and B are independent.
Examples
1) The probability that person A will be alive in 20 years is 0.7 and the probability that
person B will be alive in 20 years is 0.5, while the probability that they will both be alive
in 20 years is 0.45. Are the events E1 “A is alive in 20 years” and E2 “B is alive in 20 years”
independent?
Solution:
Since P(E1) P(E2) = 0.7 × 0.5 = 0.35 ≠ P(E1 E2), the events E1 and E2 are not
independent.
Since P(1st coin is heads) × P(2nd coin is heads) = ½ × ½ = ¼ = P(both tosses heads),
the events “heads on the first toss” and “heads on the second toss” are independent.
The multiplication rule for independent events can be extended to involve more
than 2 events. In general, if the events A1, A2, . . . , An are independent then
Examples
1) A coin is tossed and a single 6 sided die is rolled. Find the probability of “heads” and
rolling a 3 with the die.
P(head) = ½ and P(3) = 1/6.
Since the results of the coin and the die are independent,
P(heads and 3) = P(heads) P(3) = (1/2) × (1/6) = 1/12
2) A school survey found that 9 out of 10 students like pizza. If three students are
chosen at random with replacement, what is the probability that all three students
like pizza?
Solution
P(student 1 likes pizza) = 9/10 = P(student 2 likes pizza) = P(student 3 likes pizza).
P(student 1 likes pizza and student 2 likes pizza and student 3 likes pizza)
= P(student 1 likes pizza) x P(student 2 likes pizza) x P(student 3 likes pizza)
9
= ( ) 3 0.729 .
10
3) It is known that 8% of all cars of a certain make that are sold encounter engine
overheating problems within 50 000 kilometers of travel. During the past week 4
such cars were sold. Assuming that engine overheating problems for the 4 cars are
encountered independently, what is the probability that
(a) all 4
(b) none
(c) at least one of these cars sold
encounter engine overheating problems within 50 000 kilometers of travel ?
Solution:
Let A denote the event “overheating problems within 50 000 kilometers of travel”.
So
P(none) =
72
Bayes’ theorem
P( A B)
In order to apply the conditional probability formula P(A|B) = ,
P( B)
values for P(A B) and P(B) are needed.
Suppose that only the values for P(A), P(B|A) and P(B| A ) are available.
In this case the probabilities [ P(A B) and P(B)] required for calculating P(A|B) can be
calculated from
and
Substituting these probabilities into the first conditional probability formula gives
P( A) P( B | A)
P(A|B) = .
P( A) P( B | A) P( A ) P( B | A )
This result is known as Bayes’ theorem (named after the person who proposed the
method).
Example 1
When testing a person for a certain disease, the test can show either a positive result (the
person has the disease) or a negative result (the person does not have the disease).
When a person actually has the disease, the test shows positive 99% of the time. When the
person actually does not have the disease the test shows negative 95% of the time. Suppose
it is known that only 0.1% of the people in the population have the disease.
a) If a test turns out to be positive, what is the probability that the person has the
disease?
73
b) If the test turns out to be negative, what is the probability that the person does not
have the disease?
Solution:
Denominator:
P(B) = P( A B) P( A B)
= P(A) P(B|A) + P( A ) P(B| A )
= ( 0.001 × 0.99 ) + ( 0.999 × 0.05 )
= 0.00099 + 0.04995
= 0.05094
P( A B) 0.00099
P(A|B) = = = 0.0194.
P( B) 0.05094
P( A B ) P( A ) P( B | A ) 0.999 x0.95
(b) P( A | B ) = 0.9999895.
P( B ) 1 P( B) 0.94906
From the above it can be seen that a negative result of the test is very reliable (it will be
wrong only 105 times in 10 million cases). On the other hand, the chances that a person will
have the disease when the result of the test shows positive is 194 in 10 000.
74
Suppose A1, A2, …, An are mutually exclusive events whose union is the sample space
S and P(Ai) > 0. Then, for any event B with P(B) > 0, and any k={1, 2, …, 3},
Example 2
Suppose that Bob can decide to go to work by one of three modes of transportation – car,
bus, or commuter train. Because of high traffic, if he decides to go by car, there is a 50%
chance he will be late. If he goes by bus, which has special reserved lanes but is sometimes
overcrowded, the probability of being late is only 20%. The commuter train is more
expensive than the other modes of transport but is late only 1% of the time.
a) Suppose that Bob is late one day and his boss wishes to estimate the probability that
he drove to work that day by car. Since he does not know which mode of
transportation Bob usually uses, he assumes that each mode is equally likely to be
used. What is the boss’ estimate of the probability that Bob drove to work by car?
b) Suppose that a co-worker of Bob’s knows that Bob drives to work by car 10% of the
time, he almost always takes the commuter train to work, and he never takes the bus.
Given that Bob is late to work today, the co-worker believes there is a ____% chance
that Bob came to work by train.
Solution
There are two events of interest –being late and choice of transport. There are 3 options for
the choice of transport.
Let
L = is late to work
B = takes bus
C = takes car
T = takes train
Solution (a)
75
Denominator:
Solution (b)
Try for yourself
a p
.
b 1 p
a b
From the above it can be shown that p = and 1 – p = .
ab ab
b 1 p
.
a p
Examples
a) A pair of balanced dice is tossed. What are the odds in favour of the sum of the numbers
showing a 6?
Total number of outcomes = 6 x 6 =36.
Possible ways of getting a sum of 6 : (1, 5), (2, 4), (3, 3), (4, 2), (5,1).
Number of ways of getting a 6 is 5.
76
c) The table below shows data that were collected from 781 middle aged female patients at a
certain hospital.
no 90 346 436
(i) For smokers the odds in favour of heart problems is 172 to 173 or 1 to 1.0058
(ii) For non-smokers the odds in favour of heart problems is 90 to 346 or 1 to 3.8444.
From this it can be seen that smokers are much more at risk for heart problems than non-
smokers.
77
Chapter 4 – Probability
distributions of discrete random
variables
Examples:
Examples:
1) The variables T and X from the above examples are discrete random variables.
2) The variables H and V from the above examples are continuous random variables.
78
Examples:
1) As above, let T be the random variable that represents the number of tails obtained
when a coin is flipped three times. Then T has 4 possible values 0, 1, 2, and 3. The
outcomes of the experiment and the values of T are summarized in the next table.
Outcomes T
hhh 0
hht, hth, thh 1
tth, tht, htt 2
ttt 3
Assuming that the outcomes are all equally likely, the probability distribution for T is
given in the following table.
t 0 1 2 3 Total
p(t) 1/8 3/8 3/8 1/8 1
2) Let Y denote the number of tosses of a coin until heads appear first. Then
y 1 2 3 . . . Total
p(y) ½ (½)2 (½)3 . . . 1
3) A pair of dice is tossed. Let X denote the sum of the digits. The probability
distribution of X can be found from the following table. The entry in any particular
cell is the sum of the row and column values.
79
1st die
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
2nd die 3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
x 2 3 4 5 6 7 8 9 10 11 12
P(X=x) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
Note:
For any discrete random variable X, the range of values that it can assume are such that
0 ≤ P(x) ≤ 1 and P( x) 1 .
x
Examples
1) For the probability mass function in example 1 the cumulative distribution function is
x 0 1 2 3
F(x) 1/8 ½ 7/8 1
2) For the probability mass function in example 3 the cumulative distribution function is
x 2 3 4 5 6 7 8 9 10 11 12
F(x) 1/36 3/36 6/36 10/36 15/36 21/36 26/36 30/36 33/36 35/36 1
3) Consider a discrete random variable with probability mass function given below.
x 1 2 3 4
P(X=x) 0.1 0.3 0.4 0.2
80
The graphs on the previous page are plots of the probability mass function (graph on the
right) and cumulative distribution function (graph on the left).
A random variable can only take on one value at a time i.e. the events X = x1 and X = x 2 for
x1 ≠ x2 are mutually exclusive. The probability of the variable taking on any number of
different values can be found by simply adding the appropriate probabilities.
Examples
1) Find the probability of getting 2 or more tails when a coin is flipped 3 times.
2) Find the probability of getting at least one tail when a coin is flipped 3 times.
Or
3) Find the probability of needing at most 3 tosses of a coin to get the first heads.
The mean or expected value of a random variable X is the average value that we would
expect for X when performing the random experiment many times.
E(X) = = xp(x) .
Examples
Thus if 3 coins are flipped a large number of times, we should expect the average
number of tails (per 3 flips) to be about 1.5. Since the number of tails is an integer
value, it will never actually assume the mean value of 1.5. This mean value more
82
reflects the fact that the extreme values (0 and 3) occur the same proportion of
times (an eighth) and the middle values occur the same proportion of times (three
eighths).
2) The score S obtained in a certain quiz is a random variable with probability distribution
given below.
s 0 1 2 3 4 5
p(s) 0.12 0.04 0.16 0.32 0.24 0.12
s 0 1 2 3 4 5 sum
p(s) 0.12 0.04 0.16 0.32 0.24 0.12 1
s × p(s) 0 0.04 0.32 0.96 0.96 0.60 2.88
= E(S) = 2.88
Variance
For a random variable X, the variance, denoted by σ2 , can be calculated by using the
formula
The standard deviation of X, denoted by σ, is just the positive square root of σ2. This is a
measure of the extent to which the values are spread around the mean.
The calculation of the standard deviation for a random variable is similar to that of the
calculation of the standard deviation for grouped data.
Example
t 0 1 2 3 sum
p(t) 1/8 3/8 3/8 1/8 1
t × p(t) 0 3/8 6/8 3/8 1.5
t2 × p(t) 0 3/8 12/8 9/8 3
83
Bernoulli trial:
Consider an experiment in which there are two complementary outcomes. One
outcome is labelled “success” (s) and the other is labelled “failure” (f). Such an
experiment is called a Bernoulli trial.
We denote the probability of success as P(s)= p and the probability of failure as
P(f) = 1–p = q
Notation:
A short hand way of referring to a binomially distributed random variable X, based
on n trials with probability of success p, is X ~ B(n,p) or X ~ Bin(n,p).
Examples:
1) Consider the experiment of flipping a coin 5 times. If we let the event of getting “tails”
on a flip be labeled “success” and “heads” failure, and if the random variable T
represents the number of tails obtained, then T will be binomially distributed with n = 5,
p = ½ and q=½
3) Fourteen percent of flights from a certain airport are delayed. If 20 flights are chosen at
random, then we can consider each flight to be an independent Bernoulli trial. If we
define a successful trial to be one where a flight takes off on time, then the random
variable Z representing the number of on-time flights will be binomially distributed with
n =2 0, p = 0.86 and q = 0.14.
Tree diagram
The number of possible outcomes in a binomial experiment can be written down
from a diagram such as the one below. This diagram called a tree diagram enables
one to write down all the outcomes when this experiment is performed 3 times.
s
s
f
s
s
f
f
start
s
s
f
f
s
f
f
The following outcomes and their respective number of successes (x) can be written down
from the above tree diagram.
Outcomes x
fff 0
ffs, fsf, ffs 1
ssf, sfs, fss 2
sss 3
A formula for the binomial probability mass function for the case n = 3 can be written down
from the above table by noting the following.
85
1) Each outcome is a sequence of s (success) and f (failure) values e.g. fff, ffs, ssf etc.
3) Since the trials are independent, the probability of a particular sequence of s’s and
f’s is given by a product of p (the probability of success) and q (the probability of
failure) values, where p’s occur x times and q’s (3 – x) times e.g. P(fff) = q3,
P(ffs) = pq2, P(ssf) = p2q etc.
4) The number of outcomes where there are x success and (3 – x) failure outcomes can
be counted by using the formula C(3, x)= 3Cx .
By using the above, the binomial formula for n = 3 can be written down as
To write down the general formula, the same reasoning as explained above applies to
sequences with n outcomes consisting of s (x of these) and f (n – x of these) values. In the
formula the number 3 is just replaced by n i.e.
Examples
1) As in the previous examples, let T be the random variable representing the number of
tails when a coin is flipped 3 times. Then T ~ Bin(3 , 0.5). Using the formula above with
n=3 and p = 0.5 , we can calculate the probability of exactly 2 tails as:
a) 3 answers correctly?
b) 7 answers correctly?
c) fewer than 3 answers correctly?
d) at least 5 answers correctly?
Solution:
a) P(X=3) = f(3) = 10C3 (0.2)3 (0.8)7 = 0.2013
86
c) P(X < 3) =
d) P(X ≥ 5) =
Notice that the calculations needed in parts (c) and (d) of the previous example are time
consuming. Instead of using the pdf f(x) to solve the problems, the CDF F(x) can be used.
Values for the CDF are found in the Cumulative Binomial Distribution tables at the end of
the notes (Table A).
There are several tables – one for each different value of n. The first column gives the value
of n while the second column gives the possible values that the random variable X can take
on. The top row gives common values of p.
Remember: These tables give cumulative probabilities so situations that involve the
“<”, “>”and “≥” signs must be adjusted so that they are in a form that uses the “≤”
sign i.e. a “less than or equal to” situation.
Examples
1) Suppose X ~ Bin(12 , 0.6). Find the probability that X is less than, or equal to, 5.
Part (d): What is the probability that the student chooses at least 5 answers
correctly?
P(X ≥ 5) = 1 – P(X ≤ 4) = 1 – 0.9672 = 0.0328
Example
Note: A Binomial random variable with n=1 is simply a Bernoulli trial and is sometimes
referred to as a Bernoulli distribution.
Consider a bowl with N marbles of which Np are blue and Nq red, where p q 1. If
sampling is done with replacement and drawing a blue marble labeled “success” (red
Np Nq
marble labeled “failure”), then P(success) = p and P(failure) = q . If
N N
P( x blue marbles in n draws) is required and sampling is with replacement, the
binomial formula will still apply. If sampling is without replacement, P(success) is no
longer constant (assumption 4 of binomial experiment is violated) and the binomial
formula will no longer apply for calculating the abovementioned probability. In such
a case
Example
A bowl contains 10 blue and 7 red marbles. Four (4) marbles are drawn at random from the
bowl. Calculate the probability of
(a) two
(b) at least 3
blue marbles drawn when sampling is done
1) with replacement.
2) without replacement.
2 2
10 7
P(X = 2) = 4 C2 = 0.352 .
17 17
90
= 0.335 + 0.120
= 0.455.
C 2 7 C2 45 21
2a) P(X = 2) = 10
= 0.397 .
17 C 4 2380
2b)
Examples
1) The number of bad cheques presented for daily payment at a bank.
2) The number of road deaths per month.
3) The number of bacteria in a given culture.
4) The number of defects per square meter on metal sheets being manufactured.
5) The number of mistakes per typewritten page.
PDF
The probability that x events occur in time/space is given by
91
Examples
1) A secretary claims an average mistake rate of 1 per page. A sample page is selected
at random and 5 mistakes found. What is the probability of her making 5 or more
mistakes if her claim of 1 mistake per page on average is correct?
Solution:
In this case μ=1 is claimed and X the number of mistakes ≥ 5. If the claim is true,
P(X ≥ 5) = 1 – P(X ≤ 4)
1 e 1 e 1 e 1
= 1 – e e 1
2 ! 3! 4!
= 1 – 0.9963
= 0.0037.
The above calculation shows that if the claim of 1 mistake per page on average is true,
there is only a 37 in 10 000 chance of getting 5 or more mistakes per page. This remote
chance of 5 or more mistakes when an average of 1 mistake per page is true casts doubt
on whether the claim of 1 mistake per page on average is in fact true.
2) At a particular restaurant 4 plates are broken, on average, each week. What is the
probability that
a) 2 plates are broken next week?
b) at most 4 plates are broken next week?
c) more than 3 plates are broken next week?
Solution:
b)
92
c)
Notice that the calculations needed in parts (b) and (c) of the previous example are time
consuming. Instead of using the pdf f(x) to solve the problems, the CDF F(x) can be used.
Values for the CDF are found in the Cumulative Poisson Distribution table at the end of the
notes (Table B).
The top row gives some values for µ and the first column gives some values that Poisson
random variable X can take on. The cumulative probabilities F(x) = P(X < x) can be found by
lining up the relevant row and column.
Reminder: As with the Cumulative Binomial Distribution tables, these tables give
cumulative probabilities so situations that involve the “<”, “>” and “≥” signs must be
adjusted so that they are in a form that uses the “≤” sign i.e. a “less than or equal to”
situation.
Example 2
Part (b): Step 1 – Find µ=4 in the top row of the table.
Step 2 – Find x=4 in the first column.
Step 3 – Line up the column and row.
At the intersection of the row is the value F(4) = P(X ≤ 4) = 0.6288
The Poisson random variable can also be seen as an approximation to a binomial random
variable with the number of trials (n) large and the probability of success (p) small such that
the mean μ = np is of moderate size. This approximation is good when n 20 and p 0.05
or n 100 and np 10 .
Example
A life insurance company has found that the probability is 0.000015 that a person aged 40-
50 will die from a certain rare disease. If the company has 100 000 policy holders in this age
group, what is the probability that this company will have to pay out 4 claims or more
because of death from this disease?
Solution:
For the following reasons a binomial distribution with n = 100 000 and p = 0.000015 is
reasonable in this case.
3 The death or not from this disease of one person does not affect that of another
person.
The Poisson distribution with µ = 100 000×(0.000015) = 1.5 can be used to approximate this
probability.
P(X ≥ 4) = 1 – P(X ≤ 3)
= 1 – 0.9344
= 0.0656.
The mean and variance of the Poisson distribution are given by E(X) = µ and
var(X) = µ.
In the case of the Poisson approximation to the binomial distribution
E(X) = var(X) = np
standard deviation = np .
94
Example
Calls arrive at switchboard at an average rate of 1 every 15 seconds. What is the probability
of not more than 5 calls arriving during a particular minute?
Solution:
A mean rate of 1 every 15 seconds is equivalent to a mean rate of 4 every minute. Since the
question concerns an interval of 1 minute, µ = 4 (not µ = 1).