Statistical Analysis With Software Application
Statistical Analysis With Software Application
Instructional Materials
In
STAT 20053
Statistical Analysis
with Software Application
Compiled by:
OVERVIEW:
You are taking part in a gameshow. The host of the show, who is known as Monty, shows
you three outwardly identical doors. Behind one of them is a prize (a sports car), and behind the
other two are goats. You are asked to select, but not open, one of the doors. After you have done
so, Monty, who knows where the prize is, opens one of the two remaining doors. He always opens
a door he knows will reveal a goat, and randomly chooses which door to open when he has more
than one option (which happens when your initial choice contains the prize). After revealing a
goat, Monty gives you the choice of either switching to the other unopened door or sticking with
your original choice. You then receive whatever is behind the door you choose. What should you
do, assuming you want to win the prize?
MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Identify the Monty Hall problem.
2. Explain decision making under uncertainty and uncertainty in the news.
3. Discuss simplicity and complexity needs for models.
4. Perform the process of model building and making assumptions.
COURSE MATERIALS:
The famous Monty Hall Problem is a classic example of decision making under
uncertainty. In module 2, we will solve this problem formally, but for now appreciate that at each
round of the game you, as the player, do not know where the sports car is. To begin with, the only
certainty you have is that the sports car must be behind one of the three doors. You may, or may
not, initially chose the `correct' door (assuming you want to win the prize!) but there is no certainty
in your choice. Upon revealing a goat behind one of the doors you did not choose, you still face
uncertainty (the only certainty you have is that the sports car must be behind one of the two
unopened doors). The “controversy” arose over the American game show `Let's Make a Deal',
and the New York Times (among others) devoted two pages to the problem, readers' letters etc.
Bewildered game show players wrote to Marilyn vos Savant, an advice columnist for
Parade Magazine, and asked for her opinion in her `Ask Marilyn' column. Savant – who is credited
by the Guinness Book of Records as having the highest IQ of any woman in the world – gave her
decision. She said, “You should change your choice”. There then followed a long argument in the
correspondence columns, some supporting Savant's decision and others saying that it was
nonsense. What do you think, and why?
1
1.2 Decision Making Under Uncertainty
To study, or not to study? To invest, or not to invest? To marry, or not to marry? These,
among others, are decisions many of us face during our lives. Of course, decisions have to be
taken in the present, with uncertain future outcome.
In the workplace, for example, making decisions is the most important job of any executive.
However, it is also the toughest and riskiest job. Bad decisions can damage a business, a
reputation and a career, sometimes irreparably. Good decisions can result in promotion, a strong
reputation and making money! Today we are living in the age of technology, with two important
implications for everyone.
1. Technology has made it possible to collect vast amounts of data – the era of `big data'.
2. Technology has given many more people the power and responsibility to analyse data and
make decisions on the basis of quantitative analysis.
A large amount of data already exists, and it will only increase further in the future. Many
companies, rightly, are seeing data-driven decision-making as a source of competitive advantage.
By using quantitative methods to uncover and extract the information in the data and then acting
on this information – guided by quantitative analysis – they are able to gain advantages which
their more qualitatively – oriented competitors are not able to gain.
Today, demand for people with quantitative skills far exceeds supply, creating a skills
deficit. With demand set to increase further, and supply failing to keep pace with demand, then
Economics 101 will tell you that the price increases whenever demand exceeds supply. Of course,
the “price” being referred to here is that of an employee, i.e. the salary which quantitative staff
can command (already high) is set to rise even further. Decision-making is a process when one
is faced with a problem or decision having more than one possible outcome.
The possible results from the decision are a function of both internal variables (which we
can control) and external variables (which we cannot control), each of which cannot be
expressed with certainty. Hence the outcome cannot be known in advance with certainty. When
evaluating all decision-making, we start with structuring the problem.
Determine the possible criteria which could be used to evaluate the alternatives:
qualitative analysis
quantitative analysis
2
This is all very well, but how much “higher”? How would we justify a specific increase of Px?
Note this is not an exhaustive list, but clearly market demand, competition, production
costs and advertising expenditure (among other factors) are likely to be relevant to the price-
setting problem.
For all decisions, we need to determine the influencing factors which could either be internal or
external, such as:
demand and competitive supply
availability of labour and materials
…
which are then used to derive expected results or consequences. Of course, determining which
are the influencing factors, and their corresponding weights of influence, is not necessarily easy,
but a thoughtful consideration of these is important due to their cumulative effect on the outcome.
In a qualitative analysis, once we have determined a preliminary list of the factors which
we think will affect the possible outcomes of the decision:
the management team “qualitatively” evaluates how each factor could affect the decision
this discussion leads to an assessment by the decision-maker
the decision is made followed by implementation, if necessary.
For example, in a qualitative analysis we might describe the potential options in a decision
tree (covered in module 6) in which we can include the concept of probable outcomes. We could
make this assessment using the (qualitative) qualifiers of:
optimistic
conservative
pessimistic
3
Our objective becomes to define mathematically the relationships which might exist. Next,
we evaluate the significance of the predictive value of the relationships found. An assessment of
the relationships which our analysis defines leads us to be able to quantitatively express the
expected results or consequences of the decision we are making.
“News” is not “olds”. News reports new information about events taking place in the world
– ignoring fake news! Especially in business news, you will find numerous reports discussing the
many uncertainties being faced by business. While uncertainty certainly makes life exciting, it
makes decision-making particularly challenging. Should a form increase production? Advertise?
Cut back? Merge?
Decisions are made in the present, with uncertain future outcomes. Hence many media
reports will comment on the uncertainties being faced. Of course, some eras are more uncertain
than others. Indeed, 2016 was the year of the black swan – low-probability, high-impact events
– with the Brexit referendum vote and the election of Donald Trump to the White House the main
geopolitical stories. Both outcomes were considered unlikely, yet they both happened. Some
prediction markets priced in a 25% probability of each of these outcomes, but a simple probability
calculation (as will be covered next module) would equate this to tossing a fair coin twice and
getting two heads – perhaps these were not such surprising results after all! Taking Brexit as an
example, immediately after the referendum result was known, uncertainty arose about exactly
what `Brexit' meant. Exiting the single market? Exiting the customs union?
Financial markets, in particular, tend to be very sensitive to news. Even stories reporting
comments from influential people, such as politicians, can move markets – sometimes
dramatically! For example, read `Flash crash sees the pound gyrate in Asian trading' available at
http://www.bbc.co.uk/news/business-37582150.
At one stage it fell as much as 6% to $1.1841 – the biggest move since the Brexit Vote –
before recovering. It was recently trading 2% lower at $1.2388. It is not clear what triggered the
sudden sell-off. Analysts say it could have been automated trading systems reacting to a news
report. The Bank of England said it was ‘looking into’ the ash crash. The sharp drop came after
the Financial Times published a story online about French President Francois Hollande
demanding ‘tough Brexit negotiations’.
Increasingly, quantitative hedge funds and asset managers will trade algorithmically, with
computers designed to scan the internet for news stories and interpret whether news reports
contain any useful information which would allow a revision of probabilistic beliefs (we have
already seen an example of this with the Monty Hall problem, and will formally consider ‘Bayesian
updating’ in module 2). Here, the demand for ‘tough Brexit negotiations’ by the then French
President would be interpreted as being bad for the UK, which would lead to a further depreciation
in the pound sterling.
“These days some algos trade on the back of news sites, and even what is trending on
social media sites such as Twitter, so a deluge of negative Brexit headlines could have led to an
algo taking that as a major sell signal for the pound,” says Kathleen Brooks, research director at
City Index.
So, from now on, when you read (or listen to) the news, keep an eye out for (or ear open
to!) the word ‘uncertainty’ and consider what kinds of decisions are being made in the face of the
uncertainty.
4
1.4 Simplicity vs. Complexity – The Need for Models
Answer, b!
Although we care about the real world, seek to understand it and make decisions in reality,
we have an inherent dislike of complexity. Indeed, in the social sciences the real world is a highly
complex web of interdependencies between countless variables.
So in order to make any sense of the real world, we will inevitably have to simplify reality.
Our tool for achieving this is a model. A model is a deliberate simplification of reality. A good
model retains the most important features of reality and ignores less important details.
Immediately we see that we face a trade-off (an opportunity cost in Economics 101). The benefit
of a model is that we simplify the complex real world. The cost of a model is that the consequence
of this simplification of reality is a departure from reality. Broadly speaking, we would be happy if
the benefit exceeded the cost, i.e. if the simplicity made it easier for us to understand and analyse
the real world while incurring only a minimal departure from reality.
Of course, an engineer would likely need to know these ‘less important details’, but for a
tourist visiting London such information is superfluous and the map is very much fit-for-purpose.
However, we said above that a model is a departure from reality, hence some caution should
always be exercised when using a model. Blind belief in a model might be misleading. For
example, the map above fails to accurately represent the precise geographic location of stations.
If we look at the geographically-accurate map we see the first map can be very misleading in
terms of the true distance between stations { for example, the quickest route from Edgware Road
to Marble Arch is to walk! Also, even line names can be a model – the Circle line (in yellow) is
5
clearly not a true circle! Does it matter? Well, the Circle line forms a loop and it is an easy name
to remember, so arguably here the simplification of the name outweighs the slight departure in
reality from a true circle!
Our key takeaway is that models inevitably involve trade-offs. As we further simplify reality
(a benefit), we further depart from reality (a cost). In order to determine whether or not a model is
“good”, we must decide whether the benefit justifies the cost. Resolving this benefit–cost trade-
off is subjective – further adding to life's complexities.
1.5 Assumptions
The normal distribution is frequently-used in models. One example is that financial returns
on assets are often assumed to be normally distributed. Under this assumption of normality, the
probability of returns being within three standard deviations of the mean (mean and standard
deviation will be reviewed in modules 2 and 3) is approximately 99.7%. This means that the
probability of returns being more than three standard deviations from the mean is approximately
0.3%. (In the graph above, the mean is 0 and the standard deviation is 1, so ‘mean ±3 standard
deviations’ equates to the interval [–3; 3]. This means that 99.7% of the total area under the curve
is between –3 and 3.) Assuming market returns follow a normal distribution is fundamental to
many models in finance, for example Markowitz's modern portfolio theory and the Black–Scholes–
Merton option pricing model. However, this assumption does not typically reflect actual observed
6
market returns and `tail events', i.e. black swan events (which recall are low-probability, high-
impact events), tend to occur more frequently than a normal distribution would predict! For now,
the moral of the story is to beware assumptions – if you make a wrong or invalid assumption, then
decisions you make in good faith may lead to outcomes far from what you expected. As an
example, the subprime mortgage market in the United States assumed house prices would only
ever increase, but what goes up usually comes down at some point.
ACTIVITIES/ASSESSMENT
True or False. Write “True” if the statement is correct, and “False” otherwise.
Watch:
“Evolution of the London Underground Map”
https://www.youtube.com/watch?v=1pMX7EkAhoA
7
MODULE 2 – QUANTIFYING UNCERTAINTY WITH PROBABILITY
OVERVIEW:
We're going to be considering quantifying uncertainty with probability. So in the first
module, this was just really the introduction to the course, and I wanted to get you thinking about
the general concepts of decision-making under uncertainty. And for example, we kicked off with
the Monty Hall problem. But of course, now we need to formalize things somewhat more. We
need to decide what exactly is probability and how do we quantify it? How do we determine this?
So to assist us with this, we do need to introduce some key vocabularies, some key lexicon, if
you will.
So we begin with the concept of an experiment, indeed a random experiment. Now
examples for this could be trivial things such as tossing a coin and seeing which is the upper most
face. Maybe it's rolling a die and seeing the score on the uppermost face there. Or maybe it could
be more sort of real world examples, whereby we look at some stock index like the FTSE 100 and
see what the change was in the value of that index on a particular trading day. So there's some
random experiment, which could lead to one of several possible outcomes. Now sometimes the
number of outcomes might be quite small. Tossing the coin, we've only really got two options
here, either the uppermost face is going to be heads or it's going to be tails. If we consider
extending it to rolling a die, well of course there, on a standard die, you have six possible values
for that uppermost face. The integers 1, 2, 3, 4, 5 and 6. If instead we consider the FTSE 100
Index, well of course, there are lots of possible values we can have for percentage change on a
particular trading day. So we have a random experiment, which results in a particular outcome.
MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Quantify uncertainty with probability applied to some simple examples.
2. Recall a selection of common probability distributions.
3. Discuss how new information leads to revised beliefs.
COURSE MATERIALS:
2.1 Probability Principles
Probability is very important for statistics because it provides the rules which allow us to
reason about uncertainty and randomness, which is the basis of statistics and must be fully
understood in order to think clearly about any statistical investigation.
Probability, P(A), will be defined as a function which assigns probabilities (real numbers)
to events (sets). This uses the language and concepts of set theory. So we need to study the
basics of set theory first.
8
A set is a collection of elements (also known as ‘members’ of the set).
Example: The following are all examples of sets, where “l” can be read as “such that”:
𝐴 = {𝐴𝑚𝑦, 𝐵𝑜𝑏, 𝑆𝑎𝑚}
𝐵 = [1, 2, 3, 4, 5}
𝐶 = {𝑥|𝑥 𝑖𝑠 𝑎 𝑝𝑟𝑖𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟} = {2, 3, 5, 7, 11, … }
𝐷 = {𝑥|𝑥 ≥ 0} (𝑡ℎ𝑎𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑛𝑜𝑛 − 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑟𝑒𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟𝑠
An experiment is a process which produces outcomes and which can have several different
outcomes. The sample space S is the set of all possible outcomes of the experiment. An event
is any subset A of the sample space such that A ⊂ S, where ⊂ denotes a subset.
Example:
If the experiment is ‘select a trading day at random and record the % change in the FTSE
100 index from the previous trading day’, then the outcome is the % change in the FTSE 100
index.
𝑆 = [−100, +∞) for the % change in the FTSE 100 index (in principle).
An event of interest might be 𝐴 = {𝑥|𝑥 > 0} – the event that the daily change is positive, i.e. the
FTSE 100 index gains value from the previous trading day. We would then denote the probability
of this event as:
What does probability mean? Probability theory tells us how to work with the probability
function and derive probabilities of events from it. However, it does not tell us what `probability'
really means.
We define probabilities to span the unit interval, i.e. [0, 1], such that for any event A we
have:
0 ≤ P(A) ≤1
At the extremes, an impossible event occurs with a probability of zero, and a certain event
occurs with a probability of one, hence P(S) = 1 by definition of the sample space. For any event
A, P(A) → 1 as the event becomes more likely, and P(A) → 0 as the event becomes less likely.
Therefore, the probability value is a quantified measure of how likely an event is to occur.
Figure 2
9
There are several alternative interpretations of the real-world meaning of “probability” in
this sense. One of them is outlined below. The mathematical theory of probability and calculations
on probabilities are the same whichever interpretation we assign to “probability”.
Example: How should we interpret the following, as statements about the real world of coins and
babies?
‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a large
number of times, and the proportion of heads out of those tosses was 0.5, the ‘probability
of heads’ could be said to be 0.5, for that coin.
‘The probability is 0.51 that a child born in the Philippines today is a boy.’ If the proportion
of boys among a large number of live births was 0.51, the ‘probability of a boy’ could be
said to be 0.51.
subjectively
by experimentation (empirically)
theoretically
Ignoring extreme events like a world war, the determination of probabilities is usually done
empirically, by observing actual realizations of the experiment and using them to estimate
probabilities. In the simplest cases, this basically applies the frequency definition to observed
data.
Example:
If I toss a coin 10,000 times, and 5,050 of the tosses come up heads, it seems that,
approximately, P(heads) = 0.5, for that coin.
Of the 7,098,667 live births in England and Wales in the period 1999 - 2009, 51.26% were
boys. So we could assign the value of about 0.51 to the probability of a boy in that
population.
The estimation of probabilities of events from observed data is an important part of statistics!
10
2.2 Simple Probability Distributions
One can view probability as a quantifiable measure of one's degree of belief in a particular
event, or set of interest. Let us consider two simple experiments.
Example
i. The toss of a (fair) coin: S = {H, T} where H and T denote “heads” and “tails”, respectively,
and are called the elements or members of the sample space.
So the coin toss sample space has two elementary outcomes, H and T, while the score
on a die has six elementary outcomes. These individual elementary outcomes are themselves
events, but we may wish to consider slightly more exciting events of interest. For example, for the
score on a die, we may be interested in the event of obtaining an even score, or a score greater
than 4 etc. Hence we proceed to define an event of interest.
Typically, we can denote events by letters for notational efficiency. For example, A = “an
even score” and B = “a score greater than 4”. Hence A = {2, 4, 6} and B = {5, 6}.
The universal convention is that we define probability to lie on a scale from 0 to 1 inclusive
(Multiplying by 100 yields a probability as percentage). Hence the probability of any event A, say,
is denoted by P(A) and is a real number somewhere in the unit interval, i.e. P(A) ∈ [0, 1], where
“∈ " means “is a member of”. Note the following:
Therefore, we have a probability scale from 0 to 1 on which we are able to rank events,
as evident from the P(A) > P(B) result above. However, we need to consider how best to quantify
these probabilities theoretically (we have previously considered determining probabilities
subjectively and by experimentation). Let us begin with experiments where each elementary
outcome is equally likely, hence our (fair) coin toss and (fair) die score fulfill this criterion
(conveniently).
Classical probability is a simple special case where values of probabilities can be found
by just counting outcomes. This requires that:
the sample space contains only a finite number of outcomes, N
all of the outcomes are equally probable (equally likely).
We will use these often, not because they are particularly important but because they provide
simple examples for illustrating various results in probability.
11
Suppose that the sample space S contains N equally likely outcomes, and that event A consists
of n ≤ N of these outcomes. We then have that:
𝑛 number of outcomes in 𝐴
𝑃(𝐴) = =
𝑁 total number of outcomes in the sample space 𝑆
That is, the probability of A is the proportion of outcomes which belong to A out of all possible
outcomes. In the classical case, the probability of any event can be determined by counting the
number of outcomes which belong to the event, and the total number of possible outcomes.
Example
i. For the coin toss, if A is the event “heads”, then N = 2 (H and T) and n = 1 (H). So,for a
fair coin, P(A) = 1/2 = 0.5
ii. For the die score, if A is the event “an even score”, then N = 6 (1, 2, 3, 4, 5 and 6) and
n = 3 (2, 4 and 6). So, for a fair die, P(A) = 3/6 = 1/2 = 0.5
Finally, if B is the event “score greater than 4”, then N = 6 (as before) and n = 2 (5 and 6).
Hence P(B) = 2/6 = 1/3.
Example: Rolling two dice, what is the probability that the sum of the two scores is 5?
Determine the sample space, which is the 36 ordered pairs:
Determine the outcomes in the event A = {(1, 4), (2, 3), (3, 2), (4, 1)} (highlighted).
Determine the probability to be P(A) = 4/36 = 1/9.
A random variable is a ‘mapping’ of the elementary outcomes in the sample space to real
numbers. This allows us to attach probabilities to the experimental outcomes. Hence the concept
of a random variable is that of a measurement which takes a particular value for each possible
trial (experiment). Frequently, this will be a numerical value.
Example:
1. Suppose we sample at random five people and measure their heights, hence ‘height’ is the
random variable and the five (observed) values of this random variable are the realized
measurements for the heights of these five people.
2. Suppose a fair die is thrown four times and we observe two 6’s, a 3 and a 1. The random
variable is the ‘score on the die’, and for these four trials it takes the values 6, 6, 3 and 1. (In this
case, since we do not know the true order in which the values occurred, we could also say that
the results were 1, 6, 3 and 6 or 1, 3, 6 and 6, or . . .)
12
An example of an experiment with non-numerical outcomes would be a coin toss, for which
recall S = {H, T}. We can use a random variable, X, to convert the sample space elements to real
numbers such as:
1 𝑖𝑓 ℎ𝑒𝑎𝑑𝑠
𝑋={
0 𝑖𝑓 𝑡𝑎𝑖𝑙𝑠
The value of any of the above variables will typically vary from sample to sample, hence
the name “random variable”. So each experimental random variable has a collection of possible
outcomes, and a numerical value associated with each outcome. We have already encountered
the term “sample space” which here is the set of all possible numerical values of the random
variable.
A natural question to ask is “what is the probability of any of these values?”. That is, we
are interested in the probability distribution of the experimental random variable. Be aware that
random variables comes in two varieties – discrete and continuous.
Discrete: Synonymous with ‘count data’, that is, random variables which take non-
negative integer values, such as 0, 1, 2, …. For example, the number of heads in n coin
tosses.
Continuous: Synonymous with ‘measured data’ such as the real line, ℝ = (−∞, ∞)or
some subset of ℝ, for example the unit interval [0, 1]. For example, the height of adults in
centimeters.
To summarize, a probability distribution is the complete set of sample space values with
their associated probabilities which must sum to 1 for discrete random variables. The probability
distribution can be represented diagrammatically by plotting the probabilities against sample
space values. Finally, before we proceed, let us spend a moment to briefly discuss some
important issues with regard to the notation associated with random variables. For notational
efficiency reasons, we often use a capital letter to represent the random variable. The letter X is
13
often adopted, but it is perfectly legitimate to use any other letter: Y, Z etc. In contrast, a lower
case letter denotes a particular value of the random variable.
Example: Let X = “the upper-faced score after rolling a fair die”. If the die results in a 3, then this
is written as x = 3. The probability distribution of X is:
This is an example of the (discrete) uniform distribution. For discrete random variables,
we talk about a mass of probability at each respective sample space value. In the discrete uniform
case this mass is the same, i.e. 1/6, and this is plotted to show the probability distribution of X, as
shown below.
14
The obvious approach is to use the probability-weighted average of the sample space
values, known as the expected value of X. If 𝑥1 , 𝑥2 , … , 𝑥𝑁 are the possible values of the random
variable X, with corresponding probabilities 𝑝(𝑥)1 , 𝑝(𝑥)2 , … , 𝑝(𝑥)𝑛 then:
𝑁
Note that the expected value is also referred to as the population mean, which can be
written as E(X) (in words “the expectation of the random variable X”) or 𝜇 (in words “the
(population) mean of X”). So, for so-called ‘discrete’ random variables, E(X) is determined by
taking the product of each value of X and its corresponding probability, and then summing across
all values of X.
Example:
Example: Let X represent the value shown when a fair die is thrown once.
Hence,
We should view 3.5 as a long-run average since, clearly, the score from a single roll of a die can
never be 3.5, as it is not a member of the sample space. However, if we rolled the die a (very)
large number of times, then the average of all of these outcomes would be (approximately) 3.5.
For example, suppose we rolled the die 600 times and observed the frequencies of each score.
Let us suppose we observed the following frequencies:
15
The average observed score is:
So we see that in the long run the average score is approximately 3.5. Note a different 600 rolls
of the die might lead to a different set of frequencies. Although we might expect 100 occurrences
of each score of 1 to 6 (that is, taking a relative frequency interpretation of probability, as each
score occurs with a probability of 1/6 we would expect one sixth of the time to observe each score
of 1 to 6), it is unlikely we would observe exactly 100 occurrences of each score in practice.
Example: Recall the toss of a fair coin, where we define the random variable X such that:
1 𝑖𝑓 ℎ𝑒𝑎𝑑𝑠
𝑋={
0 𝑖𝑓 𝑡𝑎𝑖𝑙𝑠
Here, viewed as a long-run average, E(X) = 0.5 can be interpreted as the proportion of heads in
the long run (and, of course, the proportion of tails too).
Example:
Let us consider the game of roulette, from the point of view of the casino (The House). Suppose
a player puts a bet of $1 on ‘red’. If the ball lands on any of the 18 red numbers, the player gets
that $1 back, plus another $1 from The House. If the result is one of the 18 black numbers or the
green 0, the player loses the $1 to The House. We assume that the roulette wheel is unbiased,
i.e. that all 37 numbers have equal probabilities. What can we say about the probabilities and
expected values of wins and losses? Define the random variable X = “money received by The
House”. Its possible values are – 1 (the player wins) and 1 (the player loses). The probability
function is:
18
for x = −1
37
P(X = x) = p(x) = 19
for x = 1
37
{0 otherwise
Where p(x) is shortened version of P(X=x). Therefore, the expected value is:
18 19
𝐸(𝑋) = (−1 × ) + (1 × ) = +0.027
37 37
On average, The House expects to win 2.7p for every $1 which players bet on red. This expected
gain is known as the house edge. It is positive for all possible bets in roulette.
16
The mean (expected value) E(X) of a probability distribution is analogous to the sample mean
̅of a sample distribution (introduced in module 3). This is easiest to see when the
(average) X
sample space is finite.
Suppose the random variable X can have K different valuesX1, …, Xk, and their frequencies in a
random sample are f1, …, fK, respectively. Therefore, the sample mean of X is:
𝐾
𝑓1 𝑥1 + ⋯ + 𝑓𝐾 𝑥𝐾
𝑋̅ = = 𝑥1 𝑝̂ (𝑥1 ) + ⋯ + 𝑥𝐾 𝑝̂ (𝑥𝐾 ) = ∑ 𝑥𝑖 𝑝̂ (𝑥𝑖 )
𝑓1 + ⋯ + 𝑓𝐾
𝑖=1
where:
𝑓𝑖
𝑝̂ (𝑥𝑖 ) =
∑𝐾
𝑖=1 𝑓𝑖
are the sample proportions of the values xi. The expected value of the random variable X is:
𝐾
So 𝑋̅ uses the sample proportions, 𝑝̂ (𝑥𝑖 ), whereas 𝐸(𝑋) uses the population probabilities, 𝑝(𝑥𝑖 ).
Bayesian updating is the act of updating your (probabilistic) beliefs in light of new
information. Formally named after Thomas Bayes (1701 – 1761), for two events A and B, the
simplest form of Bayes' theorem is:
P(B|A) P(A)
P(A|B) =
P(B)
Suppose we define the event A to be “roll a 6”. Unconditionally, i.e. a priori (before we receive
any additional information), we have:
has occurred (where P(B) = 1/2), which means we can effectively revise our sample space, S*,
by eliminating 1, 3 and 5 (the odd scores), such that:
S* = {2, 4, 6}
17
So now the revised sample space contains three equally likely outcomes (instead of the original
six), so the Bayesian updated probability (known as a conditional probability or a posteriori
probability) is:
1
P(A|B) =
3
where "|" can be read as “given”, hence A|B means “A given B”. Deriving this result formally using
Bayes' theorem, we already have P(A) = 1/6 and also P(B) = 1/2, so we just need P(B|A), which
is the probability of an even score given a score of 6. Since 6 is an even score, P(B|A)= 1. Hence:
Suppose instead we consider the case where we are told that an odd score was obtained. Since
even scores and odd scores are mutually exclusive (they cannot occur simultaneously) and
collectively exhaustive (a die score must be even or odd), then we can view this as the
complementary event, denoted Bc, such that:
So, given an odd score, what is the conditional probability of obtaining a 6? Intuitively, this is zero
(an impossible event), and we can verify this with Bayes' theorem:
where, clearly, we have P(B c |A) = 0 (since 6 is an even, not odd, score, so it is impossible to
obtain an odd score given the score is 6).
Example:
Suppose that 1 in 10,000 people (0.01%) has a particular disease. A diagnostic test for the
disease has 99% sensitivity (if a person has the disease, the test will give a positive result with a
probability of 0.99). The test has 99% specificity (if a person does not have the disease, the test
will give a negative result with a probability of 0.99).
Solution:
Let B denote the presence of the disease, and Bc denote no disease. Let A denote a positive test
result. We want to calculate P(A).
The probabilities we need are P(B) = 0.0001, P(Bc) = 0.9999, P(A|B) = 0.99 and also P(A|B c )=
0.01, and hence:
18
We want to calculate P(B|A), i.e. the probability that a person has the disease, given that the
person has received a positive test result.
P(B) = 0.0001
P(B c ) = 0.9999
P(A|B) = 0.99
P(A|B c ) = 0.01
and so
Why is this so small? The reason is because most people do not have the disease and the test
has a small, but non-zero, false positive rate of P(A|B c ). Therefore, most positive test results are
actually false positives.
In order to revisit the ‘Monty Hall’ problem, we require a more general form of Bayes' theorem,
which we note as follows. For a general partition (partition is the division of the sample space
into mutually exclusive and collectively exhaustive events) of the sample space S into B1,B2,…,Bn,
and for some event A, then:
P(A|Bk ) P(Bk )
P(Bk |A) =
∑ni=1(A|Bi )P(Bi )
Example:
You are taking part in a gameshow. The host of the show, who is known as Monty, shows you
three outwardly identical doors. Behind one of them is a prize (a sports car), and behind the other
two are goats. You are asked to select, but not open, one of the doors. After you have done so,
Monty, who knows where the prize is, opens one of the two remaining doors. He always opens a
door he knows will reveal a goat, and randomly chooses which door to open when he has more
than one option (which happens when your initial choice contains the prize). After revealing a
goat, Monty gives you the choice of either switching to the other unopened door or sticking with
your original choice. You then receive whatever is behind the door you choose. What should you
do, assuming you want to win the prize?
Suppose the three doors are labelled A, B and C. Let us define the following events.
DA, DB, DC: the prize is behind Door A, B and C, respectively.
MA, MB, MC: Monty opens Door A, B and C, respectively.
Suppose you choose Door A first, and then Monty opens Door B (the answer works the same
way for all combinations of these). So Doors A and C remain unopened. What we want to know
now are the conditional probabilities P(DA |MB ) and P(DC |MB ).
You should switch doors if P(DC |MB ) > P(DA |MB ), and stick with your original choice otherwise.
(You would be indifferent about switching if it was the case that P(DC |MB ) = P(DA |MB ) .)
19
Suppose that you first choose Door A, and then Monty opens Door B. Bayes' theorem tells us
that:
1 × 1/3 2
P(DC |MB ) =
1 1 1 1 3
2×3+0×3+1×3
And hence, P(DA |MB ) = 1 − P(DC |MB ) = 1/3 [because also P(MB |DB ) = 0 and so P(DB |MB ) = 0]
The same calculation applies to every combination of your first choice and Monty's choice.
Therefore, you will always double your probability of winning the prize if you switch from your
original choice to the door that Monty did not open. The Monty Hall problem has been called a
cognitive illusion, because something about it seems to mislead most people's intuition. In
experiments, around 85% of people tend to get the answer wrong at first. The most common
incorrect response is that the probabilities of the remaining doors after Monty's choice are both
1/2, so that you should not (or rather need not) switch.
This is typically based on ‘no new information’ reasoning. Since we know in advance that Monty
will open one door with a goat behind it, the fact that he does so appears to tell us nothing new
and should not cause us to favor either of the two remaining doors – hence a probability of 1/2 for
each (people see only two possible doors after Monty's action and implicitly apply classical
probability by assuming each door is equally likely to reveal the prize).
It is true that Monty's choice tells you nothing new about the probability of your original choice,
which remains at 1/3. However, it tells us a lot about the other two doors. First, it tells us everything
about the door he chose, namely that it does not contain the prize. Second, all of the probability
of that door gets ‘inherited’ by the door neither you nor Monty chose, which now has the probability
2/3.
So, the moral of the story is to switch! Note here we are using updated probabilities to form a
strategy – it is sensible to ‘play to the probabilities’ and choose as your course of action that which
gives you the greatest chance of success (in this case you double your chance of winning by
switching door). Of course, just because you pursue a course of action with the most likely chance
of success does not guarantee you success.
20
If you play the Monty Hall problem (and let us assume you switch to the unopened door), you can
expect to win with a probability of 2/3, i.e. you would win 2/3 of the time on average. In any single
play of the game, you are either lucky or unlucky in winning the prize. So you may switch and end
up losing (and then think you applied the wrong strategy – hindsight is a wonderful thing!) but in
the long run you can expect to win twice as often as you lose, such that in the long run you are
better off by switching!
If you feel like playing the Monty Hall game again, I recommend visiting:
http://www.math.ucsd.edu/~crypto/Monty/monty.html.
In particular, note how at the end of the game it shows the percentage of winners based on
multiple participants’ results. Taking the view that in the long run you should win approximately
2/3 of the time from switching door, and approximately 1/3 of the time by not switching, observe
how the percentages of winners tend to 66.7% and 33.3%, respectively, based on a large sample
size. Indeed, when we touch on statistical inference later in the module, it is emphasized that as
the sample size increases we tend to get a more representative (random) sample of the
population. Here, this equates to the sample proportions of wins converging to their theoretical
probabilities. Note also the site has an alternative version of the game where Monty does not
know where the sports car is.
2.5 Parameters
Probability distributions may differ from each other in a broader or narrower sense. In the broader
sense, we have different families of distributions which may have quite differentcharacteristics,
for example:
discrete distributions versus continuous distributions
among discrete distributions: a finite versus an infinite number of possible values
among continuous distributions: di_erent sets of possible values (for example, all real
numbers x, x > 0, or x ∈ [0, 1]); symmetric versus skewed distributions.
In the narrower sense, individual distributions within a family differ in having different values of
the parameters of the distribution. The parameters determine the mean and variance of the
distribution, values of probabilities from it etc.
Example:
An opinion poll on a referendum, where each Xi is an answer to the question “Will you vote ‘Yes’
or ‘No’ to joining/leaving1 the European Union?” has answers recorded as Xi = 0 if ‘No’ and Xi =
1 if ‘Yes’. In a poll of 950 people, 513 answered ‘Yes’.
21
Here we need a family of discrete distributions with only two possible values (0 and 1).
The Bernoulli distribution (discussed below), which has one parameter 𝜋 (the probability
that Xi = 1) is appropriate.
Within the family of Bernoulli distributions, we use the one where the value of 𝜋 is our best
estimate based on the observed data. This is 𝜋̂ = 513=950 = 0.54 (where 𝜋̂ denotes an
estimate of the parameter 𝜋).
A Bernoulli trial is an experiment with only two possible outcomes. We will number these
outcomes 1 and 0, and refer to them as ‘success’ and ‘failure’, respectively. Note these are
notional successes and failures – the success does not necessarily have to be a ‘good’ outcome,
nor a failure a ‘bad’ outcome!
The Bernoulli distribution is the distribution of the outcome of a single Bernoulli trial, named
after Jacob Bernoulli (1654 – 1705). This is the distribution of a random variable X with the
following probability function (A probability function is simply a function which returns the
probability of a particular value of X.):
Therefore:
And:
and no other values are possible. We could express this family of Bernoulli distributions in tabular
form as follows:
where 0 ≤ 𝜋 ≤ 1 is the probability of ‘success’. Note that just as a sample space represents all
possible values of a random variable, a parameter space represents all possible values of a
parameter. Clearly, as a probability, we must have that 0 ≤ 𝜋 ≤ 1.
22
Such a random variable X has a Bernoulli distribution with (probability) parameter 𝜋. This is
often written as:
If X ~ Bernoulli (𝜋), then we can determine its expected value, i.e. its mean, as the usual
probability-weighted average:
Hence we can view 𝜋 as the long-run average (proportion) of successes if we were to draw a
large random sample from this distribution.
Example:
Consider the toss of a fair coin, where X = 1 denotes ‘heads’ and X = 0 denotes ‘tails’. As this is
a fair coin, heads and tails are equally likely and hence 𝜋 = 0.5 leading to the specific Bernoulli
distribution:
Hence,
such that if we tossed a fair coin a large number of times, we would expect the proportion of heads
to be 0.5 (and in practice the long-run proportion of heads would be approximately 0.5).
Let X denote the total number of successes in these n trials, then X follows a binomial distribution
with parameters n and 𝜋, where n ≥ 1 is a known integer and 0 ≤ 𝜋 ≤ 1. This is often written as:
X ~ Bin(n, 𝜋)
If X ~ Bin(n, 𝜋), then:
E(X) = n 𝜋
Example:
A multiple choice test has 4 questions, each with 4 possible answers. James is taking the test,
but has no idea at all about the correct answers. So he guesses every answer and, therefore, has
the probability of 1/4 of getting any individual question correct.
23
Let X denote the number of correct answers in James’ test. X follows the binomial distribution with
n = 4 and 𝜋 = 0.25, i.e. we have:
X ~ Bin(4, 0.25)
For example, what is the probability that James gets 3 of the 4 questions correct?
Here it is assumed that the guesses are independent, and each has the probability 𝜋 = 0.25 of
being correct.
The probability of any particular sequence of 3 correct and 1 incorrect answers, for example 1110,
is 𝜋 3 (1−𝜋)1, where ‘1’ denotes a correct answer and ‘0’ denotes an incorrect answer.
However, we do not care about the order of the 0s and 1s, only about the number of 1s. So 1101
and 1011, for example, also count as 3 correct answers. Each of these also has the probability
of 𝜋 3 (1−𝜋)1.
The total number of sequences with three 1s (and, therefore, one 0) is the number of locations
for the three 1s which can be selected in the sequence of 4 answers. This is (43) = 4 (see below).
Therefore, the probability of obtaining three 1s is:
4
( ) 𝜋 3 (1 − 𝜋)1 = 4 × (0.25)3 × 0.751 ≈ 0.0469
3
where (𝑛𝑥) is the binomial coefficient – in short, the number of ways of choosing x objects out of n
when sampling without replacement when the order of the objects does not matter.
24
Example:
25
Poisson Distribution
26
Example:
Example:
Example:
27
Example:
28
Poisson Approximation of the Binomial Distribution
29
30
ACTIVITIES/ASSESSMENT:
1. An event is:
a. any subset A of the sample space
b. a probability
c. a function which assigns probabilities to events
2. Probability…
a. is always some value in the unit interval [0, 1]
b. can exceed 1
c. can be negative
4. When there are N equally likely outcomes in a sample space, where n < N of them agree with
some event A, then:
a. P(A) = n/N
b. P(A) = N/n
c. P(A) = 1
31
MODULE 3 – DESCRIPTIVE STATISTICS
OVERVIEW:
Descriptive statistics are a simple, yet powerful, tool for data reduction and summarization.
Many of the questions for which people use statistics to help them understand and make decisions
involve types of variables which can be measured. Obvious examples include height, weight,
temperature, lifespan, rate of inflation and so on. When we are dealing with such a variable – for
which there is a generally recognized method of determining its value – we say that it is a
measurable variable. The numbers which we then obtain come ready – equipped with an order
relation, i.e. we can always tell if two measurements are equal (to the available accuracy) or if
one is greater or less than the other.
MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Explain the different levels of measurement of variables.
2. Explain the importance of data visualization and descriptive statistics.
3. Compute common descriptive statistics for measurable variables.
COURSE MATERIALS:
3.1 Classifying Variables
Data are obtained on any desired variable. For most of this module, we will be dealing with
variables which can be partitioned into two types.
1. Discrete data – things you can count. Examples include the number of passengers on a flight
and the number of telephone calls received each day in a call center. Observed values for these
will be 0, 1, 2, . . . (i.e. non-negative integers).
2. Continuous data – things you can measure. Examples include height, weight and time which
can be measured to several decimal places.
Of course, before we do any sort of statistical analysis, we need to collect data. Module 4 will
discuss a range of different techniques which can be employed to obtain a sample. For now, we
just consider some examples of situations where data might be collected.
A pre-election opinion poll asks 1,000 people about their voting intentions.
A market research survey asks how many hours of television people watch per week.
A census interviewer asks each householder how many of their children are receiving full-
time education.
A polling organization might be asked to determine whether, say, the political preferences of
voters were in some way linked to their job type – for example, do supporters of Party X tend to
be blue-collar workers? Other market research organizations might be employed to determine
32
whether or not users were satisfied with the service which they obtained from a commercial
organization (a restaurant, say) or a department of local or central government (housing
departments being one important instance).
This means that we are concerned, from time to time, with categorical variables in addition to
measurable variables. So we can count the frequencies with which an item belongs to particular
categories. Examples include:
In cases (a) and (b) we are doing simple counts, within a sample, of a single category, while in
cases (c) and (d) we are looking at some kind of cross-tabulation between variables in two
categories: worker type vs. political preference in (c), and political preference vs. worker type in
(d) (they are not the same!).
There is no unambiguous and generally agreed way of putting worker types in order (in the way
that we can certainly say that 1 < 2). It is similarly impossible to rank (as the technical term has
it) many other categories of interest: for instance in combating discrimination against people
organizations might want to look at the effects of gender, religion, nationality, sexual orientation,
disability etc. but the whole point of combating discrimination is that different ‘varieties’ within each
category cannot be ranked.
In case (e), by contrast, there is a clear ranking – the restaurant would be pleased if there were
lots of people who expressed themselves satisfied rather than dissatisfied. Such considerations
lead us to distinguish two main types of variable, the second of which is itself subdivided.
For a nominal variable (like gender), the numbers (values) serve only as labels or tags for
identifying and classifying cases. When used for identification, there is a strict one-to-one
correspondence between the numbers and the cases. For example, your passport or driving
license number uniquely identifies you.
Any numerical values do not reflect the amount of the characteristic possessed by the cases.
Counting is the only arithmetic operation on values measured on a nominal scale, and hence only
a very limited number of statistics, all of which are based on frequency counts, can be determined.
33
Ordinal Categorical Variables
An ordinal variable has a ranking scale in which numbers are assigned to cases to indicate the
relative extent to which the cases possess some characteristic. It is possible to determine if a
case has more or less of a characteristic than some other case, but not how much more or less.
Any series of numbers can be assigned which preserves the ordered relationships between the
cases. In addition to the counting operation possible with nominal variables, ordinal variables
permit the use of statistics based on centiles such as percentiles, quartiles and the median.
Interval-level variables have scales where numerically equal distances on the scale represent
equal value differences in the characteristic being measured. For example, if the temperatures on
three days were 0, 10 and 20 degrees, then there is a constant 10-degree differential between 0
and 10, and 10 and 20. This allows comparisons of differences between values. The location of
the zero point is not fixed – both the zero point and the units of measurement are arbitrary. For
example, temperature can be measured in different (arbitrary) units, such as degrees Celsius and
degrees Fahrenheit.
Any positive linear transformation of the form y = a + bx will preserve the properties of the scale,
hence it is not meaningful to take ratios of scale values. Statistical techniques which may be used
include all of those which can be applied to nominal and ordinal variables. In addition statistics
such as the mean and standard deviation are applicable.
Ratio-level variables possess all the properties of nominal, ordinal and interval variables. A ratio
variable has an absolute zero point and it is meaningful to compute ratios of scale values. Only
proportionate transformations of the form y = bx, where b is a positive constant, are allowed. All
statistical techniques can be applied to ratio data.
Example:
Consider the following three variables describing different characteristics of countries. Later, we
consider a sample of 155 countries in 2002 for these variables.
Region of the country. This is a nominal variable which could be coded (in alphabetical
order) as follows: 1 = Africa, 2 = Asia, 3 = Europe, 4 = Latin America, 5 = Northern
America, 6 = Oceania.
The level of democracy, i.e. a democracy index, in the country. This could be an 11-
point ordinal scale from 0 (lowest level of democracy) to 10 (highest level of
democracy).
Gross domestic product per capita (GDP per capita) (i.e. per person, in $000s) which
is a ratio scale.
Region and the level of democracy are discrete, with the possible values of 1, 2, …, 6, and 0, 1,
2, …, 10, respectively. GDP per capita is continuous, taking any non-negative value.
Many discrete variables have only a finite number of possible values. The region variable has 6
possible values, and the level of democracy has 11 possible values.
34
The simplest possibility is a binary, or dichotomous, variable, with just two possible values. For
example, a person's gender could be recorded as 1 = female and 2 = male. A discrete variable
can also have an unlimited number of possible values. For example, the number of visitors to a
website in a day: 0, 1, 2, 3, 4,…
The levels of democracy have a meaningful ordering, from less democratic to more democratic
countries. The numbers assigned to the different levels must also be in this order, i.e. a larger
number = more democratic.
In contrast, different regions (Africa, Asia, Europe, Latin America, Northern America and Oceania)
do not have such an ordering. The numbers used for the region variable are just labels for different
regions. A different numbering (such as 6 = Africa, 5 = Asia, 1 = Europe, 3 = Latin America, 2 =
Northern America and 4 = Oceania) would be just as acceptable as the one we originally used.
1. Descriptive statistics – summarize the data which were collected, in order to make them more
understandable.
2. Statistical inference – use the observed data to draw conclusions about some broader
population.
Sometimes ‘1.’ is the only aim. Even when ‘2.’ is the main aim, ‘1.’ is still an essential first step.
Data do not just speak for themselves. There are usually simply too many numbers to make sense
of just by staring at them. Descriptive statistics attempt to summarize some key features of the
data to make them understandable and easy to communicate. These summaries may be
graphical or numerical (tables or individual summary statistics).
35
Rows of the data matrix correspond to different units (subjects/observations). Here, each unit is
a country.
The number of units in a dataset is the sample size, typically denoted by n. Here, n = 155
countries.
Columns of the data matrix correspond to variables, i.e. different characteristics of the units. Here,
region, the level of democracy, and GDP per capita are the variables.
Sample distribution
36
Here ‘%’ is the percentage of countries in a region, out of the 155 countries in the sample. This
is a measure of proportion (that is, relative frequency).
‘Cumulative %’ for a value of the variable is the sum of the percentages for that value and all
lower-numbered values.
A bar chart is the graphical equivalent of the table of frequencies. The next figure displays the
region variable data as a bar chart. The relative frequencies of each region are clearly visible.
37
If a variable has many distinct values, listing frequencies of all of them is not very practical. A
solution is to group the values into non-overlapping intervals, and produce a table or graph of the
frequencies within the intervals.
The most common graph used for this is a histogram. A histogram is like a bar chart, but without
gaps between bars, and often uses more bars (intervals of values) than is sensible in a table.
Histograms are usually drawn using statistical software, such as Minitab, R or SPSS.
You can let the software choose the intervals and the number of bars. A table of frequencies for
GDP per capita where values have been grouped into non-overlapping intervals is shown below.
The next figure shows a histogram of GDP per capita with a greater number of intervals to better
display the sample distribution.
So far, we have tried to summarize (some aspect of) the sample distribution of one variable at a
time. However, we can also look at two (or more) variables together. The key question is then
whether some values of one variable tend to occur frequently together with particular values of
another, for example high values with high values. This would be an example of an association
38
between the variables. Such associations are central to most interesting research questions, so
you will hear much more about them in the next topics.
Some common methods of descriptive statistics for two-variable associations are introduced here,
but only very briefly now and mainly through examples.
The best way to summarize two variables together depends on whether the variables have ‘few’
or ‘many’ possible values. We illustrate one method for each combination, as listed below.
A scatterplot shows the values of two measurable variables against each other, plotted as points
in a two-dimensional coordinate system.
Example: A plot of data for 164 countries is shown below which plots the following variables.
On the horizontal axis (the x-axis): a World Bank measure of ‘control of corruption’, where
high values indicate low levels of corruption.
On the vertical axis (the y-axis): GDP per capita in $.
Interpretation: it appears that virtually all countries with high levels of corruption have relatively
low GDP per capita. At lower levels of corruption there is a positive association, where countries
with very low levels of corruption also tend to have high GDP per capita.
Boxplots are useful for comparisons of how the distribution of a measurable variable varies
across different groups, i.e. across different levels of a categorical variable.
39
The figure below shows side-by-side boxplots of GDP per capita for the different regions.
GDP per capita in African countries tends to be very low. There is a handful of countries
with somewhat higher GDPs per capita (shown as outliers in the plot).
The median for Asia is not much higher than for Africa. However, the distribution in Asia
is very much skewed to the right, with a tail of countries with very high GDPs per capita.
The boxplots for Northern America and Oceania are not very useful, because they are
based on very few countries (two and three countries, respectively).
A (two-way) contingency table (or cross-tabulation) shows the frequencies in the sample of each
possible combination of the values of two categorical variables. Such tables often show the
percentages within each row or column of the table.
Example: The table below reports the results from a survey of 972 private investors. The variables
are as follows.
40
Numbers in parentheses are percentages within the rows. For example, 25.3 = (37=146)×100.
Interpretation: look at the row percentages. For example, 17.8% of those aged under 45, but only
5.2% of those aged 65 and over, think that short-term gains are ‘very important’. Among the
respondents, the older age groups seem to be less concerned with quick profits than the younger
age groups.
Frequency tables, bar charts and histograms aim to summarize the whole sample distribution of
a variable. Next we consider descriptive statistics which summarize (describe) one feature of the
sample distribution in a single number: summary (descriptive) statistics.
We begin with measures of central tendency. These answer the question: where is the ‘center’ or
‘average’ of the distribution?
In formula, a generic variable is denoted by a single letter. In this module, usually X. However,
any other letter (Y, W etc.) could also be used, as long as it is used consistently. A letter with a
subscript denotes a single observation of a variable.
We use Xi to denote the value of X for unit i, where i can take values 1, 2, 3, … , n, and n is the
sample size. Therefore, the n observations of X in the dataset (the sample) are X1, X2, X3, … ,Xn.
These can also be written as Xi, for i = 1, … , n.
41
The Sample Mean
The sample mean (‘arithmetic mean’, ‘mean’ or ‘average’) is the most common measure of
central tendency.
42
43
Why is the mean a good summary of the central tendency?
Let X(1), X(2), …, X(n) denote the sample values of X when ordered from the smallest to the largest,
known as the order statistics, such that:
The (sample) median, q50, of a variable X is the value which is ‘in the middle’ of the ordered
sample.
44
For our country data n = 155, so q50 = X(78). From a table of frequencies, the median is the value
for which the cumulative percentage first reaches 50% (or, if a cumulative % is exactly 50%, the
average of the corresponding value of X and the next highest value).
45
The median can be determined from the frequency table of the level of democracy:
Sensitivity to Outliers
For the following small ordered dataset, the mean and median are both 4:
1, 2, 4, 5, 8
1, 2, 4, 5, 8, 100
In general, the mean is affected much more than the median by outliers, i.e. unusually small or
large observations. Therefore, you should identify outliers early on and investigate them –
perhaps there has been a data entry error, which can simply be corrected. If deemed genuine
outliers, a decision has to be made about whether or not to remove them.
Due to its sensitivity to outliers, the mean, more than the median, is pulled toward the longer tail
of the sample distribution.
46
The Sample Mode
The (sample) mode of a variable is the value which has the highest frequency (i.e. appears most
often) in the data.
For our country data, the modal region is 1 (Africa) and the mode of the level of democracy is 0.
The mode is not very useful for continuous variables which have many different values, such as
GDP per capita.
A variable can have several modes (i.e. be multimodal). For example, GDP per capita has modes
0.8 and 1.9, both with 5 countries out of the 155. The mode is the only measure of central tendency
which can be used even when the values of a variable have no ordering, such as for the (nominal)
region variable.
Central tendency is not the whole story. The two sample distributions shown below have the same
mean, but they are clearly not the same. In one (red) the values have more dispersion (variation)
than in the other.
One might imagine these represent the sample distributions of the daily returns of two stocks,
both with a mean of 0%.
The black stock exhibits a smaller variation, hence we may view this as a safer stock – although
there is little chance of a large positive daily return, there is equally little chance of a small negative
daily return. In contrast, the red stock would be classified as a riskier stock – now there is a non-
negligible chance of a large positive daily return, however this coincides with an equally non-
negligible chance of a large negative daily return, i.e. a loss.
47
Example: A small example determining the sum of the squared deviations from the (sample)
mean, used to calculate common measures of dispersion.
48
Example: Consider the following simple data set:
Sample Quartiles
The median, q50, is basically the value which divides the sample into the smallest 50% of
observations and the largest 50%. If we consider other percentage splits, we get other (sample)
quartiles (percentiles), qc. Some special quartiles are given below.
The first quartile, q25 or Q1, is the value which divides the sample into the smallest 25% of
observations and the largest 75%.
The third quartile, q75 or Q3, gives the 75% – 25% split.
The extremes in this spirit are the minimum, X(1) (the ‘0% quartile’, so to speak), and the
maximum, X(n) (the ‘100% quartile’).
These are no longer ‘in the middle’ of the sample, but they are more general measures of location
of the sample distribution. Two measures based on quartile-type statistics are the:
range: X(n) – X(1) = maximum – minimum
interquartile range (IQR): IQR = q75 – q25 = Q3 – Q1.
The range is, clearly, extremely sensitive to outliers, since it depends on nothing but the extremes
of the distribution, i.e. the minimum and maximum observations. The IQR focuses on the middle
50% of the distribution, so it is completely insensitive to outliers.
49
Boxplots
A boxplot (in full, a box-and-whiskers plot) summarizes some key features of a sample
distribution using quartiles. The plot is comprised of the following.
The line inside the box, which is the median.
The box, whose edges are the first and third quartiles (Q1 and Q3). Hence the box
captures the middle 50% of the data. Therefore, the length of the box is the interquartile
range.
The bottom whisker extends either to the minimum or up to a length of 1.5 times the
interquartile range below the first quartile, whichever is closer to the first quartile.
The top whisker extends either to the maximum or up to a length of 1.5 times the
interquartile range above the third quartile, whichever is closer to the third quartile.
Points beyond 1.5 times the interquartile range below the first quartile or above the third
quartile are regarded as outliers, and plotted as individual points.
A much longer whisker (and/or outliers) in one direction relative to the other indicates a skewed
distribution, as does a median line not in the middle of the box. The boxplot below is of GDP per
capita using the sample of 155 countries.
The normal distribution is by far the most important probability distribution in statistics.
This is for three broad reasons.
Many variables have distributions which are approximately normal, for example heights of
humans or animals, and weights of various products.
The normal distribution has extremely convenient mathematical properties, which make it
a useful default choice of distribution in many contexts.
50
Even when a variable is not itself even approximately normally distributed, functions of
several observations of the variable (sampling distributions) are often approximately
normal, due to the central limit theorem (covered in Module 5.5). Because of this, the
normal distribution has a crucial role in statistical inference. This will be discussed later in
the course.
The figure below shows three normal distributions with different means and/or variances.
51
N(0, 1) and N(5, 1) have the same dispersion but different location: the N(5, 1) curve is
identical to the N(0, 1) curve, but shifted 5 units to the right.
N(0, 1) and N(0, 9) have the same location but different dispersion: the N(0, 9) curve is
centered at the same value, 0, as the N(0, 1) curve, but spread out more widely.
We now consider one of the convenient properties of the normal distribution. Suppose X is a
random variable, and we consider the linear transformation Y = aX + b, where a and b are
constants.
52
3.6 Variance of Random Variables
One very important average associated with a distribution is the expected value of the square of
the deviation of the random variable from its mean, 𝜇. This can be seen to be a measure – not
the only one, but the most widely used by far – of the dispersion of the distribution and is known
as the variance of the random variable. We distinguish between two different types of variance:
the sample variance, S2, which is a measure of the dispersion in a sample dataset
the population variance, Var(X) = 𝜎2, which reflects the variance of the whole population,
i.e. the variance of a probability distribution.
In essence, this is simply an average. Specifically, the average squared deviation of the data
about the sample mean (The division by n – 1, rather than by n, ensures that the sample variance
estimates the population variance correctly on average – known as an ‘unbiased estimator’). We
define the population variance in an analogous way, i.e. we define it to be the average squared
deviation about the population mean.
53
and this represents the dispersion of a (discrete) probability distribution.
Example:
Returning to the example of a fair die, we had the following probability distribution:
54
Example:
55
Some Probabilities Around the Mean
56
The first two of these are illustrated graphically in the figure below.
57
ACTIVITIES/ASSESSMENT:
Given: Four observations are obtained: 7, 9, 10 and 11. For these four values, derive the following:
6. State how many of the four observations lie in the interval 𝑋̅ − 2 × 𝑆 and 𝑋̅ + 2 × 𝑆
Watch:
Finding mean, median, and mode | Descriptive statistics | Probability and Statistics | Khan
Academy
https://www.youtube.com/watch?v=k3aKKasOmIw
58
MODULE 4 – INFERENTIAL STATISTICS
OVERVIEW:
Statistical inference involves inferring unknown characteristics of a population based on
observed sample data. We begin with aspects of estimation. Now inference itself, we could sort
of subdivide into two main branches. Firstly, our estimation of focus for this module four. And
secondly, hypothesis testing our focus in module five. Now conceptually, what are we trying to
achieve? Well, this word inference means to infer something about a wider population based on
an observed sample of data.
So really when we do statistical analysis, the data we observed, we tend to view as a
sample drawn from some wider population. Now this word population in the everyday use of the
term may refer to perhaps the population of a country or maybe a city. Well, indeed, we may be
considering those particular types of populations in our statistical studies, but we are not confined
to that kind of simplistic definition of a population. Rather, a population doesn't necessarily even
have to refer to human beings. It maybe the population of companies who shares a listed on some
stock exchange. Maybe we're looking at the population of fish in the sea, planets and the universe,
you name it. Now at the heart of what we tried to do with our statistical inference is that, we
assume that our sample is fairly representative of that wider population. And our goal when
selecting a sample in the first place is to achieve hopefully this representativeness. Now
contextually, that may sound straightforward enough, but that's perhaps easier said than done.
MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Summarize common data collection methods.
2. Explain what a sampling distribution is.
3. Discuss the principles of point interval and estimation.
COURSE MATERIALS:
4.1 Introduction to Sampling
Sampling is a key component of any research design. The key to the use of statistics in research
is being able to take data from a sample and make inferences about a large population. This idea
is depicted below.
59
Sampling design involves several basic questions.
Should a sample be taken?
If so, what process should be followed?
What kind of sample should be taken?
How large should it be?
Sample or Census?
The conditions which favor the use of a sample or census are summarized in the table below. Of
course, in practice, some of our factors may favor a sample while others favor a census, in which
case a balanced judgment is required.
60
Classification of Sampling Techniques
We draw a sample from the target population, which is the collection of elements or objects
which possess the information sought by the researcher and about which inferences are to be
made. We now consider the different types of sampling techniques which can be used in practice,
which can be decomposed into non-probability sampling techniques and probability
sampling techniques.
Non-probability sampling techniques are characterized by the fact that some units in the
population do not have a chance of selection in the sample. Other individual units in the population
have an unknown probability of being selected. There is also an inability to measure sampling
error. Examples of such techniques are:
convenience sampling
judgmental sampling
quota sampling
snowball sampling.
We now consider each of the listed techniques, explaining their strengths and weaknesses. To
illustrate each, we will use the example of 25 students (labelled ‘1’ to ‘25’) who happen to be in a
particular class (labelled ‘A’ to ‘E’) as follows:
Convenience Sampling
Convenience sampling attempts to obtain a sample of convenient elements (hence the name).
Often, respondents are selected because they happen to be in the right place at the right time.
Examples include using students and members of social organizations; also ‘people-in-the-street’
interviews.
61
Suppose class D happens to assemble at a convenient time and place, so all elements (students)
in this class are selected. The resulting sample consists of students 16, 17, 18, 19 and 20. Note
in this case there are no students selected from classes A, B, C and E.
Strengths of convenience sampling include being the cheapest, quickest and most convenient
form of sampling. Weaknesses include selection bias and lack of a representative sample.
Judgmental Sampling
Judgmental sampling is a form of convenience sampling in which the population elements are
selected based on the judgment of the researcher. Examples include purchase engineers being
selected in industrial market research; also expert witnesses used in court. Suppose a researcher
believes classes B, C and E to be ‘typical’ and ‘convenient’. Within each of these classes one or
two students are selected based on typicality and convenience. The resulting sample here
consists of students 8, 10, 11, 13 and 24. Note in this case there are no students selected from
classes A and D.
Judgmental sampling is achieved at low cost, is convenient, not particularly time-consuming and
good for ‘exploratory’ research designs. However, it does not allow generalizations and is
subjective due to the judgment of the researcher.
Quota Sampling
Quota sampling may be viewed as two-stage restricted judgmental sampling. The first stage
consists of developing control categories, or quota controls, of population elements. In the second
stage, sample elements are selected based on convenience or judgment. Suppose a quota of
one student from each class is imposed. Within each class, one student is selected based on
judgment or convenience. The resulting sample consists of students 3, 6, 13, 20 and 22.
62
Quota sampling is advantageous in that a sample can be controlled for certain characteristics.
However, it suffers from selection bias and there is no guarantee of representativeness of the
sample.
Snowball Sampling
In snowball sampling an initial group of respondents is selected, usually at random. After being
interviewed, these respondents are asked to identify others who belong to the target population
of interest. Subsequent respondents are selected based on these referrals. Suppose students 2
and 9 are selected randomly from classes A and B. Student 2 refers students 12 and 13, while
student 9 refers student 18. The resulting sample consists of students 2, 9, 12, 13 and 18. Note
in this case there are no students from class E included in the sample.
Snowball sampling has the major advantage of being able to increase the chance of locating the
desired characteristic in the population and is also fairly cheap. However, it can be time-
consuming.
We have previously seen that the term target population represents the collection of units
(people, objects etc.) in which we are interested. In the absence of time and budgetary constraints
we conduct a census, that is a total enumeration of the population. Its advantage is that there is
no sampling error because all population units are observed and so there is no estimation of
population parameters. Due to the large size, N, of most populations, an obvious disadvantage
with a census is cost, so it is often not feasible in practice. Even with a census non-sampling error
may occur, for example if we have to resort to using cheaper (hence less reliable) interviewers
who may erroneously record data, misunderstand a respondent etc.
63
So we select a sample, that is a certain number of population members are selected and studied.
The selected members are known as elementary sampling units. Sample surveys (hereafter
‘surveys’) are how new data are collected on a population and tend to be based on samples rather
than a census. Selected respondents may be contacted in a variety of methods such as face-to-
face interviews, telephone, mail or email questionnaires.
Sampling error will occur (since not all population units are observed). However, non-sampling
error should be less since resources can be used to ensure high quality interviews or to check
completed questionnaires.
Types of Error
Several potential sources of error can affect a research design which we do our utmost to control.
The ‘total error’ represents the variation between the true value of a parameter in the population
of the variable of interest (such as a population mean) and the observed value obtained from the
sample. Total error is composed of two distinct types of error in sampling design.
Sampling error occurs as a result of us selecting a sample, rather than performing a
census (where a total enumeration of the population is undertaken).
- It is attributable to random variation due to the sampling scheme used.
- For probability sampling, we can estimate the statistical properties of the
sampling error, i.e. we can compute (estimated) standard errors which
facilitate the use of hypothesis testing and the construction of confidence
intervals.
Non-sampling error is a result of (inevitable) failures of the sampling scheme. In practice
it is very difficult to quantify this sort of error, typically through separate investigation. We
distinguish between two sorts of non-sampling error:
- Selection bias – this may be due to (1) the sampling frame not being equal
to the target population, or (2) in cases where the sampling scheme is not
strictly adhered to, or (3) non-response bias.
- Response bias – the actual measurements might be wrong, for example
ambiguous question wording, misunderstanding of a word in a
questionnaire, or sensitivity of information which is sought. Interviewer bias
is another aspect of this, where the interaction between the interviewer and
interviewee influences the response given in some way, either intentionally
or unintentionally, such as through leading questions, the dislike of a
particular social group by the interviewer, the interviewer's manner or lack
of training, or perhaps the loss of a batch of questionnaires from one local
post office. These could all occur in an unplanned way and bias your survey
badly.
Both kinds of error can be controlled or allowed for more effectively by a pilot survey. A pilot
survey is used:
to find the standard error which can be attached to different kinds of questions and hence
to underpin the sampling design chosen
to sort out non-sampling questions, such as:
- do people understand the questionnaires?
- are our interviewers working well?
- are there particular organizational problems associated with this enquiry?
64
Probability Sampling
Probability sampling techniques mean every population element has a known, non-zero
probability of being selected in the sample. Probability sampling makes it possible to estimate the
margins of sampling error, therefore all statistical techniques (such as confidence intervals and
hypothesis testing – covered later in the module) can be applied.
In order to perform probability sampling, we need a sampling frame which is a list of all population
elements. However, we need to consider whether the sampling frame is:
i. Adequate (does it represent the target population?)
ii. Complete (are there any missing units, or duplications?)
iii. Accurate (are we researching dynamic populations?)
iv. Convenient (is the sampling frame readily accessible?).
In this and Section 4.3 we illustrate each of these techniques using the same example from
Section 4.1, i.e. we consider a class of 25 students, numbered 1 to 25, spread across five classes,
A to E.
In a simple random sample each element in the population has a known and equal probability of
selection. Each possible sample of a given size, n, has a known and equal probability of being
the sample which is actually selected. This implies that every element is selected independently
of every other element.
Suppose we select five random numbers (using a random number generator) from 1 to 25.
Suppose the random number generator returns 3, 7, 9, 16 and 24. The resulting sample therefore
consists of students 3, 7, 9, 16 and 24. Note in this case there are no students from class C.
SRS is simple to understand and results are readily projectable. However, there may be difficulty
constructing the sampling frame, lower precision (relative to other probability sampling methods)
and there is no guarantee of sample representativeness.
65
4.3 Further Random Sampling
Systematic Sampling
In systematic sampling, the sample is chosen by selecting a random starting point and then
picking every ith element in succession from the sampling frame. The sampling interval, i, is
determined by dividing the population size, N, by the sample size, n, and rounding to the nearest
integer. When the ordering of the elements is related to the characteristic of interest, systematic
sampling increases the representativeness of the sample. If the ordering of the elements
produces a cyclical pattern, systematic sampling may actually decrease the representativeness
of the sample.
Example:
Systematic sampling may or may not increase representativeness – it depends on whether there
is any ‘ordering’ in the sampling frame. It is easier to implement relative to SRS.
Stratified Sampling
Stratified sampling is a two-step process in which the population is partitioned (divided up) into
subpopulations known as strata. The strata should be mutually exclusive and collectively
exhaustive in that every population element should be assigned to one and only one stratum and
no population elements should be omitted. Next, elements are selected from each stratum by a
random procedure, usually SRS. A major objective of stratified sampling is to increase the
precision of statistical inference without increasing cost.
66
The elements within a stratum should be as homogeneous as possible (i.e. as similar as possible),
but the elements between strata should be as heterogeneous as possible (i.e. as different as
possible). The stratification factors should also be closely related to the characteristic of
interest. Finally, the factors (variables) should decrease the cost of the stratification process by
being easy to measure and apply.
In proportionate stratified sampling, the size of the sample drawn from each stratum is
proportional to the relative size of that stratum in the total population. In disproportionate
(optimal) stratified sampling, the size of the sample from each stratum is proportional to the
relative size of that stratum and to the standard deviation of the distribution of the characteristic
of interest among all the elements in that stratum. Suppose we randomly select a number from 1
to 5 for each class (stratum) A to E. This might result, say, in the stratified sample consisting of
students 4, 7, 13, 19 and 21. Note in this case one student is selected from each class.
Stratified sampling includes all important subpopulations and ensures a high level of precision.
However, sometimes it might be difficult to select relevant stratification factors and the
stratification process itself might not be feasible in practice if it was not known to which stratum
each population element belonged.
Cluster Sampling
In cluster sampling the target population is first divided into mutually exclusive and collectively
exhaustive subpopulations known as clusters. A random sample of clusters is then selected,
based on a probability sampling technique such as SRS. For each selected cluster, either all the
elements are included in the sample (one-stage cluster sampling), or a sample of elements is
drawn probabilistically (two-stage cluster sampling).
Suppose we randomly select three clusters: B, D and E. Within each cluster, we randomly select
one or two elements. The resulting sample here consists of students 7, 18, 20, 21 and 23. Note
in this case there are no students selected from clusters A and C.
67
Cluster sampling is easy to implement and cost effective. However, the technique suffers from a
lack of precision and it can be difficult to compute and interpret results.
Multistage Sampling
In multistage sampling selection is performed at two or more successive stages. This technique
is often adopted in large surveys. At the first stage, large ‘compound’ units are sampled (primary
units), and several sampling stages of this type may be performed until we at last sample the
basic units.
The technique is commonly used in cluster sampling so that we are at first sampling the main
clusters, and then clusters within clusters etc. We can also use multistage sampling with mixed
techniques, i.e. cluster sampling at Stage 1 and stratified sampling at Stage 2 etc.
A simple random sample is a sample selected by a process where every possible sample (of
the same size, n) has the same probability of selection. The selection process is left to chance,
therefore eliminating the effect of selection bias. Due to the random selection mechanism, we
do not know (in advance) which sample will occur. Every population element has a known, non-
zero probability of selection in the sample, but no element is certain to appear.
Example:
We consider all possible samples of size n = 2 (without replacement, i.e. once an object has been
chosen it cannot be selected again).
Since this is a simple random sample, each sample has a probability of selection of 1/15.
A population has particular characteristics of interest such as the mean, 𝜇, and variance, 𝜎2.
Collectively, we refer to these characteristics as parameters. If we do not have population data,
the parameter values will be unknown.
‘Statistical inference’ is the process of estimating the (unknown) parameter values using the
(known) sample data.
68
We use a statistic (called an estimator) calculated from sample observations to provide a point
estimate of a parameter.
69
4.5 Sampling distribution of the sample mean
Like any distribution, we care about a sampling distribution's mean and variance. Together, we
can assess how ‘good’ an estimator is. First, consider the mean. We seek an estimator which
does not mislead us systematically. So the ‘average’ (mean) value of an estimator, over all
possible samples, should be equal to the population parameter itself.
70
Returning to our example:
An important difference between a sampling distribution and other distributions is that the values
in a sampling distribution are summary measures of whole samples (i.e. statistics, or estimators)
rather than individual observations.
Formally, the mean of a sampling distribution is called the expected value of the estimator,
denoted by E(∙).
An unbiased estimator has its expected value equal to the parameter being estimated. For our
example, E(𝑋̅) = 6 = 𝜇.
Fortunately, the sample mean 𝑋̅ is always an unbiased estimator of 𝜇 in simple random sampling,
regardless of the:
sample size, n
distribution of the (parent) population.
This is a good illustration of a population parameter (here, 𝜇) being estimated by its sample
counterpart (here, 𝑋̅).
The unbiasedness of an estimator is clearly desirable. However, we also need to take into account
the dispersion of the estimator's sampling distribution. Ideally, the possible values of the estimator
should not vary much around the true parameter value. So, we seek an estimator with a small
variance.
Recall the variance is defined to be the mean of the squared deviations about the mean of the
distribution. In the case of sampling distributions, it is referred to as the sampling variance.
71
Returning to our example:
We now consider the relationship between 𝜎2 and the sampling variance. Intuitively, a larger 𝜎2
should lead to a larger sampling variance. For population size N and sample size n, we note the
following result when sampling without replacement:
We use the term standard error to refer to the standard deviation of the sampling distribution,
so:
72
Some implications are the following:
As the sample size n increases, the sampling variance decreases, i.e. the precision
increases.
Provided the sampling fraction, n / N, is small, the term:
𝑁−𝑛
≈1
𝑁−1
Returning to our example, the larger the sample, the less variability there will be between samples.
We can see that there is a striking improvement in the precision of the estimator, because the
variability has decreased considerably.
The range of possible 𝑥̅ values goes from 3.5 to 8.0 down to 5.0 to 7.25. The sampling variance
is reduced from 1.6 to 0.4.
The factor (N – n)/(N – 1) decreases steadily as n→N. When n = 1 the factor equals 1, and when
n = N it equals 0.
When sampling without replacement, increasing n must increase precision since less of the
population is left out. In much practical sampling N is very large (for example, several million),
while n is comparably small (at most 1,000, say).
73
Example:
74
Example:
A point estimate (such as a sample mean, 𝑥̅ ) is our ‘best guess’ of an unknown population
parameter (such as a population mean, 𝜇) based on sample data. Although:
𝐸(𝑋̅) = 𝜇
meaning that on average the sample mean is equal to the population mean, as it is based on a
sample there is some uncertainty (imprecision) in the accuracy of the estimate. Different random
samples would tend to lead to different observed sample means. Confidence intervals
communicate the level of imprecision by converting a point estimate into an interval estimate.
75
Formally, an x% confidence interval covers the unknown parameter with x% probability over
repeated samples. The shorter the confidence interval, the more reliable the estimate.
If we assume we have either (i) known 𝜎, or (ii) unknown 𝜎 but a large sample size, say n ≥ 50,
then the formula for the endpoints of a confidence interval for a single mean are:
Here 𝑥̅ is the sample mean, 𝜎 is the population standard deviation, s is the sample standard
deviation, n is the sample size and z is the confidence coefficient, reflecting the confidence
level.
More simply, we can view the confidence interval for a mean as:
best guess ± margin of error
Therefore, we see that there are three influences on the size of the margin of error (and hence
on the width of the confidence interval). Specifically:
76
Confidence Coefficients
Other levels of confidence pose no problem, but require a different confidence coefficient. For
large n, we obtain this coefficient from the standard normal distribution.
Example:
77
ACTIVITIES/ASSESSMENT:
9. Which of the following does not influence the size of the margin of error when considering the
confidence interval for a population mean?
a. the sample size, n
b. the sample mean, 𝑥̅
c. the standard deviation, 𝜎 or s
10. Other things equal, as the confidence level increases the margin of error:
a. decreases
b. stays the same
c. increases
11. Non-probability sampling techniques, such as convenience and quota sampling, suffer from
selection bias.
a. True
b. False
78
12. In probability sampling:
a. there is a known, non-zero probability of a population element being selected.
b. some population elements have a zero probability of being selected, while others have
an unknown probability of being selected.
14. We use a statistic (called an estimator) calculated from sample observations to provide a point
estimate of a:
a. parameter
b. random variable
Watch:
79
MODULE 5 – HYPOTHESIS TESTING
OVERVIEW:
We continue statistical inference with an examination of the fundamentals of hypothesis
testing – testing a claim or theory about a population parameter. Can we find evidence to support
or refute a claim or theory? So think of hypothesis testing as about simple decision theory. Namely
we are going to be choosing between two competing statements and it's going to be a binary
choice such that based on some data, based on some evidence we are going to conclude either
in favor of one statement or hypothesis or the other. So I mentioned this legal or judicial analogy
and this is where we will begin because I'm sure many, if not all of you, are familiar with the
concept of a courtroom and a jury.
So let's imagine the following scenario. Let's suppose that I've been a naughty boy. So
naughty in fact the police have arrested me on suspicion of committing some crime. But let's
imagine the police have arrested me on suspicion, let's say for murder. So, I'd like you to imagine
now that the police have done their investigation and now we are in the courtroom setting. I am
the defendant in this trial for murder and you are a member of the jury because basically a jury is
conducting a hypothesis test. They are choosing between two competing statements. They're
trying to determine whether the defendant is guilty or not guilty of this criminal offense. I'd like to
relay the statistical form of testing and apply it to this legal analogy. First of all, once we've done
that we'll then be in a position to consider more statistical versions of hypothesis testing. So, in
our statistical world of hypothesis testing we have two competing statements known as
hypotheses, a so called null hypothesis H0 and an alternative hypothesis H1. The jury would set
the following hypotheses: H0 would be that the defendant is not guilty of the alleged crime and the
alternative hypothesis H1 is that the defendant is guilty of said crime. So, with the presumption
of innocence, the jury have to assume that the defendant is innocent, i.e. not guilty of the crime
until the evidence becomes sufficiently overwhelming that being not guilty is an unlikely scenario
and hence the jury would return a verdict of guilty.
MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Explain the underlying philosophy of hypothesis testing.
2. Distinguish the different inferential errors in testing
3. Conduct simple tests of common parameters.
COURSE MATERIALS:
Module 5 considers hypothesis testing, i.e. decision theory whereby we make a binary decision
between two competing hypotheses:
80
The binary decision is whether to ‘reject H0’ or ‘fail to reject H0’. Before we consider statistical
tests, we begin with a legal analogy – the decision of a jury in a court trial.
Example:
In a criminal court, defendants are put on trial because the police suspect they are guilty of a
crime. Of course, the police are biased due to their suspicion of guilt so determination of whether
a defendant is guilty or not guilty is undertaken by an independent (and hopefully objective) jury.
In most jury-based legal systems around the world there is the ‘presumption of innocence until
proven guilty’. This equates to the jury initially believing H0, which is the working hypothesis. A
jury must continue to believe in the null hypothesis until they feel the evidence presented to the
court proves guilt ‘beyond a reasonable doubt’, which represents the burden of proof required to
establish guilt. In our statistical world of hypothesis testing, this will be known as the significance
level, i.e. the amount of evidence needed to reject H0.
The jury uses the following decision rule to make a judgment. If the evidence is:
sufficiently inconsistent with the defendant being not guilty, then reject the null hypothesis
(i.e. convict)
not indicating guilt beyond a reasonable doubt, then fail to reject the null hypothesis – note
that failing to prove guilt does not prove that the defendant is innocent!
Miscarriages of Justice
In a perfect world juries would always convict the guilty and acquit the innocent. Sadly, it is not a
perfect world and so sometimes juries reach incorrect decisions, i.e. convict the innocent and
acquit the guilty. One hopes juries get it right far more often than they get it wrong, but this is an
important reminder that miscarriages of justice do occur from time to time, demonstrating that the
jury system is not infallible!
Statistical hypothesis testing also risks making mistakes which we will formally define as Type I
errors and Type II errors in Module 5.2.
Note: The jury is not testing whether the defendant is guilty, rather the jury is testing the hypothesis
of not guilty. Failure to reject H0 does not prove innocence, rather the jury concludes the evidence
is not sufficiently inconsistent with H0 to indicate guilt beyond a reasonable doubt. Admittedly,
what constitutes a ‘reasonable doubt’ is subjective which is why juries do not always reach a
unanimous verdict.
81
5.2 Type I and Type II Errors
In any hypothesis test there are two types of inferential decision error which could be committed.
Clearly, we would like to reduce the probabilities of these errors as much as possible. These two
types of error are called a Type I error and a Type II error.
Type I error: rejecting H0 when it is true. This can be thought of as a ‘false positive’.
Denote the probability of this type of error by 𝛼.
Type II error: failing to reject H0 when it is false. This can be thought of as a ‘false
negative’. Denote the probability of this type of error by 𝛽.
Both errors are undesirable and, depending on the context of the hypothesis test, it could be
argued that either one is worse than the other.
However, on balance, a Type I error is usually considered to be more problematic. (Thinking back
to trials by jury, conventional wisdom is that it is better to let 100 guilty people walk free than to
convict a single innocent person. While you are welcome to disagree, this view is consistent with
Type I errors being more problematic.)
For example, if H0 was being ‘not guilty’ and H1 was being ‘guilty’, a Type I error would be finding
an innocent person guilty (bad for him/her), while a Type II error would be finding a guilty person
innocent (bad for the victim/society, but admittedly good for him/her).
The complement of a Type II error, that is 1 − 𝛽, is called the power of the test – the probability
that the test will reject a false null hypothesis. Hence power measures the ability of the test to
reject a false H0, and so we seek the most powerful test for any testing situation.
Unlike 𝛼, we do not control test power.However, we can increase it by increasing the sample size,
n (a larger sample size will inevitably improve the accuracy of our statistical inference).
We have:
82
Other things equal, if you decrease 𝛼 you increase 𝛽 and vice-versa. Hence there is a trade-off.
Significance Level
Since we control for the probability of a Type I error, 𝛼, what value should this be?
Well, in general we test at the 100𝛼% significance level, for 𝛼 ∈ [0, 1]. The default choice is 𝛼 =
0.05, i.e. we test at the 5% significance level. Of course, this value of 𝛼 is subjective, and a
different significance level may be chosen. The severity of a Type I error in the context of a specific
hypothesis test might for example justify a more conservative or liberal choice for 𝛼.
In fact, noting our look at confidence intervals in Module 4.6, we could view the significance level
as the complement of the confidence level. (Strictly speaking, this would apply to so-called ‘two-
tailed’ hypothesis tests.)
For example:
a 90% confidence level equates to a 10% significance level
a 95% confidence level equates to a 5% significance level
a 99% confidence level equates to a 1% significance level.
We introduce p-values, which are our principal tool for deciding whether or not to reject H0.
A p-value is the probability of the event that the ‘test statistic’ takes the observed value or more
extreme (i.e. more unlikely) values under H0. It is a measure of the discrepancy between the
hypothesis H0 and the data evidence.
Example:
Suppose one is interested in evaluating the mean income (in $000s) of a community. Suppose
income in the population is modelled as N(𝜇, 25) and a random sample of n = 25 observations is
taken, yielding the sample mean 𝑥̅ = 17.
83
Independently of the data, three expert economists give their own opinions as follows.
Dr. A claims the mean income is 𝜇 = 16.
Ms. B claims the mean income is 𝜇 = 15.
Mr. C claims the mean income is 𝜇 = 14.
Here, 𝑋̅~ 𝑁 (𝜇, 𝜎 2 /𝑛) = 𝑁(𝜇, 1). We assess the statements based on this distribution.
If Dr. A's claim is correct, 𝑋̅~ 𝑁 (16, 1). The observed value 𝑥̅ = 17 is one standard deviation away
from 𝜇, and may be regarded as a typical observation from the distribution. Hence there is little
inconsistency between the claim and the data evidence. This is shown below:
If Ms. B's claim is correct, 𝑋̅~ 𝑁 (15, 1). The observed value 𝑥̅ = 17 begins to look a bit ‘extreme’,
as it is two standard deviations away from 𝜇. Hence there is some inconsistency between the
claim and the data evidence. This is shown below:
84
If Mr. C's claim is correct, 𝑋̅~ 𝑁 (14, 1). The observed value 𝑥̅ = 17 is very extreme, as it is three
standard deviations away from 𝜇. Hence there is strong inconsistency between the claim and the
data evidence. This is shown below:
It follows that:
Under H0: 𝜇 = 16, 𝑃(𝑋̅ ≥ 17) + 𝑃(𝑋̅ ≤ 15) = 𝑃(|𝑋̅ − 16| ≥ 1) = 0.3173
Under H0: 𝜇 = 15, 𝑃(𝑋̅ ≥ 17) + 𝑃(𝑋̅ ≤ 13) = 𝑃(|𝑋̅ − 15| ≥ 2) = 0.0455
Under H0: 𝜇 = 14, 𝑃(𝑋̅ ≥ 17) + 𝑃(𝑋̅ ≤ 11) = 𝑃(|𝑋̅ − 14| ≥ 3) = 0.0027
In summary, we reject the hypothesis 𝜇 = 15 or 𝜇 = 14, as, for example, if the hypothesis 𝜇 = 14
is true, the probability of observing 𝑥̅ = 17, or more extreme values, would be as small as 0.003.
We are comfortable with this decision, as a small probability event would be very unlikely to occur
in a single experiment.
On the other hand, we cannot reject the hypothesis 𝜇 = 16. However, this does not imply that this
hypothesis is necessarily true, as, for example, 𝜇 = 17 or 18 are at least as likely as 𝜇 = 16.
Remember:
not reject ≠ accept.
Interpretation of p-values
Module 5.2 explained that we control for the probability of a Type I error through our choice of
significance level, 𝛼, where 𝛼 ∈ [0, 1]. Since p-values are also probabilities, as defined above, we
simply compare p-values with our chosen benchmark significance level, 𝛼.
85
The p-value decision rule is shown below for 𝛼 = 0.05
Clearly, the magnitude of the p-value (compared with 𝛼) determines whether or not H0 is rejected.
Therefore, it is important to consider two key influences on the magnitude of the p-value: the
effect size and the sample size.
The effect size reflects the difference between what you would expect to observe if the null
hypothesis is true and what is actually observed in a random experiment. Equality between our
expectation and observation would equate to a zero effect size, which (while not proof that H0 is
true) provides the most convincing evidence in favor of H0. As the difference between our
expectation and observation increases, the data evidence becomes increasingly inconsistent with
H0 making us more likely to reject H0. Hence as the effect size gets larger, the p-value gets
smaller (and so is more likely to be below 𝛼).
To illustrate this idea, consider the experiment of tossing a coin 100 times and observing the
number of heads. Quite rightly, you would not doubt the coin is fair (i.e. unbiased) if you observed
exactly 50 heads as this is what you would expect from a fair coin (50% of tosses would be
expected to be heads, and the other 50% tails). However, it is possible that you are:
somewhat skeptical that the coin is fair if you observe 40 or 60 heads, say
even more skeptical that the coin is fair if you observe 35 or 65 heads, say
highly skeptical that the coin is fair if you observe 30 or 70 heads, say.
In this situation, the greater the difference between the number of heads and tails, the more
evidence you have that the coin is not fair.
In fact, if we test:
H0 : 𝜋 = 0.5 vs. H1 : 𝜋 ≠ 0.5
where 𝜋 = P(heads), for n = 100 tosses of the coin we would expect 50 heads and 50 tails. It can
be shown that for this fixed sample size the p-value is sensitive to the effect size (the difference
between the observed sample proportion of heads and the expected proportion of 0.5) as follows:
86
So we clearly see the inverse relationship between the effect size and the p-value. The above is
an example of a sensitivity analysis where we consider the pure influence of the effect size on the
p-value while controlling for (fixing) the sample size. We now proceed to control the effect size to
examine the sample size influence.
Other things equal, a larger sample size should lead to a more representative random sample
and the characteristics of the sample should more closely resemble those of the population
distribution from which the sample is drawn.
In the context of the coin toss, this would mean the observed sample proportion of heads should
converge to the true probability of heads, 𝜋, as 𝑛 → ∞.
As such, we consider the sample size influence on the p-value. For a non-zero effect size (A zero
effect size would result in non-rejection of H0, regardless of n) the p-value decreases as the
sample size increases.
Continuing the coin toss example, let us fix the (absolute) effect size at 0.1, i.e. in each of the
following examples the observed sample proportion of heads differs by a fixed proportion of 0.1
(= 10%).
So we clearly see the inverse relationship between the sample size and the p-value.
In Module 5.2, we defined the power of the test as the probability that the test will reject a false
null hypothesis. In order to reject the null hypothesis it is necessary to have a sufficiently small p-
value (less than 𝛼), hence we see that we can unilaterally increase the power of a test by
increasing the sample size. Of course, the trade-off would be the increase in data collection costs.
Example:
87
5.4 Testing a Population Mean Claim
We consider the hypothesis test of a population mean in the context of a claim made by a
manufacturer.
As an example, the amount of water in mineral water bottles exhibits slight variations attributable
to the bottle-filling machine at the factory not putting in identical quantities of water in each bottle.
The labels on each bottle may state ‘500 ml’ but this equates to a claim about the average
contents of all bottles produced (in the population of bottles).
Let X denote the quantity of water in a bottle. It would seem reasonable to assume a normal
distribution for X such that:
𝑋 ~ 𝑁 (𝜇, 𝜎 2 )
88
Suppose a random sample of n = 100 bottles is to be taken, and let us assume that 𝜎 = 10 ml.
From our work in Module 4.5 we know that:
𝜎2 (10)2
𝑋̅~ 𝑁 (𝜇, ) = 𝑁 (𝜇, ) = 𝑁(𝜇, 1)
𝑛 100
Further suppose that the sample mean in our random sample of 100 is 𝑥̅ = 503 ml. Clearly, we
see that:
𝑥̅ = 503 ≠ 500 = 𝜇
The question is whether the difference between 𝑥̅ = 503 and the claim 𝜇 = 500 is:
Determination of the p-value will allow us to choose between explanations (a) and (b). We proceed
by standardizing 𝑋̅ such that:
𝑋̅ − 𝜇
𝑍= ~ 𝑁(0, 1)
𝜎/√𝑛
acts as our test statistic. Note the test statistic includes the effect size, 𝑋̅ − 𝜇, as well as the
sample size, n.
Using our sample data, we now obtain the test statistic value (noting the influence of both the
effect size and the sample size, and hence ultimately the influence on the p-value):
503 − 500
=3
10/√100
The p-value is the probability of our test statistic value or a more extreme value conditional on
H0. Noting that H1: 𝜇 ≠ 500, ‘more extreme’ here means a z-score > 3 and < - 3. Due to the
symmetry of the standard normal distribution about zero, this can be expressed as:
Note this value can easily be obtained using Microsoft Excel, say, as:
=NORM.S.DIST(-3)*2 or =(1-NORM.S.DIST(3))*2
89
Therefore, since 0.0027 < 0.05 we reject H0 and conclude that the result is “statistically significant”
at the 5% significance level (and also, at the 1% significance level). Hence there is (strong)
evidence that 𝜇 ≠ 500. Since 𝑥̅ > 𝜇 we might go further and suppose that 𝜇 > 500.
Although the p-value is very small, indicating it is highly unlikely that this is a Type I error,
unfortunately we cannot be certain which outcome has actually occurred.
We have discussed (in Module 4.5) the very convenient result that if a random sample comes
from a normally-distributed population, the sampling distribution of 𝑋̅ is also normal. How about
sampling distributions of 𝑋̅ from other populations?
For this, we can use a remarkable mathematical result, the central limit theorem (CLT). In
essence, the CLT states that the normal sampling distribution of 𝑋̅ which holds exactly for random
samples from a normal distribution, also holds approximately for random samples from nearly any
distribution.
The CLT applies to ‘nearly any’ distribution because it requires that the variance of the population
distribution is finite. If it is not, the CLT does not hold. However, such distributions are not
common.
Suppose that {X1, X2, …, Xn} is a random sample from a population distribution which has mean
𝐸(𝑋𝑖 ) = 𝜇 < ∞ and variance 𝑉𝑎𝑟(𝑋𝑖 ) = 𝜎 2 < ∞, that is with a finite mean and finite variance. Let
𝑋̅𝑛 denote the sample mean calculated from a random sample of size n, then:
𝑋̅𝑛 − 𝜇
lim 𝑃 ( ≤ 𝑧) = Φ(𝑧)
𝑛→∞ 𝜎/√𝑛
for any 𝑧, where Φ(𝑧) = 𝑃(𝑍 ≤ 𝑧) denotes a cumulative probability of the standard normal
distribution.
The " lim " indicates that this is an asymptotic result, i.e. one which holds increasingly well as n
𝑛→∞
increases, and exactly when the sample size is infinite.
90
In less formal language, the CLT says that for a random sample from nearly any distribution with
mean 𝜇 and variance 𝜎 2 then:
𝜎2
𝑋̅~ 𝑁 (𝜇, )
𝑛
approximately, when 𝑛 is sufficiently large. We can then say that 𝑋̅ is asymptotically normally
𝜎2
distributed with mean 𝜇 and variance 𝑛
.
It may appear that the CLT is still somewhat limited, in that it applies only to sample means
calculated from random samples. However, this is not really true, for two main reasons.
There are more general versions of the CLT which do not require the observations 𝑋𝑖 to
be independent and identically distributed (IID).
Even the basic version applies very widely, when we realise that the “X” can also be a
function of the original variables in the data. For example, if X and Y are random variables
in the sample, we can also apply the CLT to:
𝑛 𝑛
log(𝑋𝑖 ) 𝑋𝑖 𝑌𝑖
∑ 𝑜𝑟 ∑
𝑛 𝑛
𝑖=1 𝑖=1
Therefore, the CLT can also be used to derive sampling distributions for many statistics which do
not initially look at all like 𝑋̅ for a single random variable in a random sample. You may get to do
this in the next topics.
The larger the sample size n, the better the normal approximation provided by the CLT is. In
practice, we have various rules-of-thumb for what is “large enough” for the approximation to be
“accurate enough”. This also depends on the population distribution of 𝑋𝑖. For example:
for symmetric distributions, even small n is enough
for very skewed distributions, larger n is required.
For many distributions, n > 50 is sufficient for the approximation to be reasonably accurate.
Example:
In the first case, we simulate random samples of sizes: n = 1, 5, 10, 30, 100 and 100 from the
Exponential (0.25) distribution (for which 𝜇 = 4 and 𝜎 2 = 16). This is clearly a skewed distribution,
as shown by the histogram for n = 1 in the figure below.
10,000 independent random samples of each size were generated. Histograms of the values of
𝑋̅ in these random samples are shown. Each plot also shows the approximating normal
distribution, N(4,16/n). The normal approximation is reasonably good already for n = 30, very
good for n = 100, and practically perfect for n = 1000.
91
Example:
In the second case, we simulate 10,000 independent random samples of sizes: n = 1, 10, 30, 50,
100 and 1000 from the Bernoulli (0.2) distribution (for which 𝜇 = 0.2 and 𝜎 2 = 0.16).
Here the distribution of 𝑋𝑖 itself is not even continuous, and has only two possible values, 0 and
1. Nevertheless, the sampling distribution of 𝑋̅ can be very well-approximated by the normal
distribution, when n is large enough.
Note that since here 𝑋𝑖 = 1 or 𝑋𝑖 = 0 for all 𝑖, 𝑋̅ = ∑𝑛𝑖=1 𝑋𝑖 /𝑛 = 𝑚/𝑛 where m is the number of
observations for which 𝑋𝑖 = 1. In other words, 𝑋̅ is the sample proportion of the value X = 1.
The normal approximation is clearly very bad for small n, but reasonably good already for n = 50,
as shown by the histograms below.
92
Note that as n increases:
𝜎2
there is convergence to 𝑁(𝜇, 𝑛
)
the sampling variance decreases (although the histograms might at first seems to show
the same variation, look closely at the scale on the x-axes).
The above example considered Bernoulli sampling where we noted that the sample mean was
the sample proportion of successes, which we now denote as P.
𝜎2
𝑋̅ ~ 𝑁 ( 𝜇, )
𝑛
93
approximately, when n is sufficiently large, and noting that when X ~ Bernoulli(𝜋) then:
E(𝑋) = 0 × (1 − 𝜋) + 1 × 𝜋 = 𝜋 = 𝜇
and
we have:
𝜎2 𝜋(1 − 𝜋)
𝑋̅ = 𝑃 → 𝑁 (𝜇, ) = 𝑁 (𝜋, )
𝑛 𝑛
as 𝑛 → ∞
We see that:
E(𝑋̅) = E(𝑃) = π
hence the sample proportion is equal to the population proportion, on average. Also:
Var(𝑃) → 0 as 𝑛 → ∞
so the sampling variance tends to zero as the sample size tends to infinity, as we see in the
histograms in the previous examples.
𝜎2 𝜋(1 − 𝜋)
𝑋̅ = 𝑃 → 𝑁 (𝜇, ) = 𝑁 (𝜋, )
𝑛 𝑛
as 𝑛 → ∞
We will now use this result to conduct statistical inference for proportions.
Confidence Intervals
As the sample proportion is a special case of the sample mean, this construct continues to hold.
Here, the:
point estimate = p
94
where p is the observed sample proportion, and the:
margin of error = confidence coefficient × standard error
𝑝(1 − 𝑝)
√
𝑛
𝑝(1 − 𝑝)
𝑝 ± 𝑧×√
𝑛
Example:
In opinion polling, sample sizes of about 1000 are used as this leads to a margin of error of
approximately three percentage points – deemed an acceptable tolerance on the estimation error
by most political scientists. Suppose 630 out of 1000 voters in a random sample said they would
vote ‘Yes’ in a binary referendum. The sample proportion is:
630
p= = 0.63
100
and a 95% confidence interval for 𝜋, the true proportion who would vote ‘Yes’ in the electoral
population, is:
0.63 × 0.37
0.63 ± 1.96 × √ = 0.63 ± 0.03 ⟹ (0.60, 0.66) or (60%, 66%)
1000
Hypothesis Testing
and a random sample of n = 1000 returned a sample proportion of p = 0.44. To undertake this
test, we follow a similar approach to that outlined in Module 5.4.
95
We proceed by standardizing P such that:
𝑃−𝜋
𝑍= ~ 𝑁 (0, 1)
√𝜋(1 − 𝜋)/𝑛
Approximately for large 𝑛, which is satisfied here since 𝑛 = 1000. Note the test statistic includes
the effect size, 𝑃 − 𝜋, as well as the sample size, 𝑛.
Note: The true standard error of the sample proportion is 𝜋(1 − 𝜋)/𝑛. However, 𝜋 is unknown
(which is why we are estimating it via a confidence interval), hence we intuitively use the estimated
standard error which estimates 𝜋 with 𝑝.
Using our sample data, we now obtain the test statistic value (noting the influence of both the
effect size and the sample size, and hence ultimately the influence on the p-value):
0.44 − 0.4
= 2.58
√0.4 × (1 − 0.4)/1000
The p-value is the probability of our test statistic value or a more extreme value conditional on
H0. Noting that H1: 𝜋 ≠ 0.4, ‘more extreme’ here means a z-score > 2.58 and < -2.58. Due to the
symmetry of the standard normal distribution about zero, this can be expressed as:
Note this value can easily be obtained using Microsoft Excel, say, as:
=NORM.S.DIST(-2.58)*2 or =(1-NORM.S.DIST(2.58))*2
Therefore, since 0.0099 < 0.05 we reject H0 and conclude that the result is ‘statistically significant’
at the 5% significance level (and also, just, at the 1% significance level). Hence there is (strong)
evidence that 𝜋 ≠ 0.4. Since 𝑝 > 𝜋 we might go further and suppose that 𝜋 > 0.4.
96
Finally, recall the possible decision space:
Although the p-value is very small, indicating it is highly unlikely that this is a Type I error,
unfortunately we cannot be certain which outcome has actually occurred.
ACTIVITIES/ASSESSMENT:
Hypothesis Testing:
You are to test the claim by a mineral water bottle manufacturer that its bottles contain an average
of 1000 ml (1 liter). A random sample of n=12 bottles resulted in the measurements (in ml):
992, 1002, 1000, 1001, 998, 999, 1000, 995, 1003, 1001, 997 and 997.
It is assumed that the true variance of water in all bottles is 𝜎 2 = 1.5, and that the amount of water
in bottles is normally distributed. Test the manufacturer's claim at the 1% significance level (you
may use Excel to calculate the p-value). Also, briefly comment on what the hypothesis test result
means about the manufacturer's claim, and if an error might have occurred which type of error it
would be.
Watch:
Statistics: Introduction to Hypothesis Testing in Filipino
https://www.youtube.com/watch?v=plAiYXYaqY0
97
MODULE 6 – APPLICATIONS
OVERVIEW:
We conclude the course with a cross-section of applications of content covered in previous
modules to more advanced modelling applications of the real world. We begin with the topic called
decision tree analysis. Arguably, this sort of brings us full circle with our initial look in module one
at decision making under uncertainty. Remember, we need to make decisions in the present
for which we don't know exactly what's going to happen in the future. So we have these unknown
or uncertain future outcomes. How do we look at modelling this kind of situation? Well, decision
trees help us along the way.
For example, when you're playing a game of chess, you're deciding which piece to move.
But if you're a good chess player, you're not deciding on your next move, you're trying to anticipate
the next move of your opponent. If you're a really good chess player, you'll then start to think not
just those two moves ahead but further moves ahead as well. Of course you cannot know exactly
what your opponent is going to do, there's some uncertainty there but your decision making will
be based on your expectations of what your opponent may do.
MODULE OBJECTIVES:
After successfully completing the module, you should be able to:
1. Use simple decision tree analysis to model decision-making under uncertainty.
2. Interpret the beta of a stock as a common risk measure used in finance.
3. Describe the principles of linear programming and Monte Carlo simulation.
COURSE MATERIALS:
6.1 Decision Tree Analysis
Module 1 introduced the concept of decision making under uncertainty whereby decisions are
taken in the present with uncertain future outcomes.
Of course, Module 2.1 explained that we could determine probabilities using one of three
methods:
subjectively
by experimentation (empirically)
theoretically.
98
In what follows we will be concerned with decision analysis, i.e. where there is only one rational
decision-maker making non-strategic decisions.
(Game theory involves two or more rational decision-makers making strategic decisions.)
Example:
Imagine you are an ice-cream manufacturer, and for simplicity suppose your level of sales can be
either high or low. Hence high and low sales are mutually exclusive and collectively exhaustive in
this example. (Note: Clearly, in practice a continuum of sales levels could be expected. Applied
here would needlessly complicate the analysis – the binary set of outcomes of high and low sales
is sufficient to demonstrate the principle and use of decision tree analysis.)
If sales are high you earn a profit of $300,000 (excluding advertising costs), but if sales are low
you experience a loss of $100,000 (excluding advertising costs).
You have the choice of whether to advertise your product or not. Advertising costs would be fixed
at $100,000.
If you advertise sales are high with a probability of 0.9, but if you do not advertise sales are high
with a probability of just 0.6. Note advertising does not guarantee success (not all advertising
campaigns are successful!) so here we model advertising as increasing the probability of the
‘good’ outcome (i.e. high sales). Viewed as conditional probabilities, these are:
We are now in a position to draw the decision tree for the ice-cream manufacturer problem. Note
that decision trees are read from left to right, representing the time order of events in a logical
manner.
99
The decision tree is:
On the far left the tree begins with a decision node where we have to decide whether to advertise
or not without knowledge of whether sales turn out to be high or low. After the decision is made,
chance takes over and resolves the realized level of sales according to the respective probability
distribution, ultimately resulting in our payoff. Note that the payoffs in the top half of the decision
tree ($200,000 and – $200,000) are simply the corresponding payoffs of $300,000 and
– $100,000, less the fixed advertising costs of $100,000.
In order to solve the decision tree we calculate the expected monetary value (EMV) of each
option (advertise and not advertise) and proceed whereby the decision-maker maximizes
expected profits.
The EMV is simply an expected value, and so in this discrete setting we apply our usual
probability-weighted average approach. We have:
Hence the optimal (recommended) strategy is to advertise, since this results in a higher expected
payoff ($160,000 > $140,000).
Remember that an expected value should be viewed as a long-run average. Clearly, by deciding
to advertise we will not make $160,000 – the possible outcomes are either a profit of $200,000
(with probability 0.9) or a loss of $200,000 (with probability 0.1).
However, rather than a “one-shot” game, imagine the game was played annually over 10 years.
By choosing “advertise” each time, then we would expect high sales in 9 years (each time with a
profit of $200,000) and low sales in 1 year (with a loss of $200,000), reflecting the probability
distribution used. Hence in the long run this would average to be $160,000.
100
Of course, this example has failed to account for risk – the subject of Module 6.2.
6.2 Risk
The decision tree analysis in Module 6.1 was solved by choosing the option with the maximum
expected profit. As such, we only considered the mean (average) outcome.
In practice people care about risk and tend to factor it into their decision making. Of course,
different people have different attitudes to risk so we can profile people's risk appetite as follows.
A decision-maker is risk-averse if s/he prefers the certain outcome of $x over a risky project with
a mean (EMV) of $x.
A decision-maker is risk-loving (also known as risk-seeking) if s/he prefers a risky project with a
mean (EMV) of $x over the certain outcome of $x.
The certainty equivalent (CE) of a risky project is the amount of money which makes the
decision-maker indifferent between receiving this amount for sure and the risky project.
Example:
A risky project pays out $100 with a probability of 0.5 and $0 with a probability of 0.5. The certainty
equivalent, X, makes the decision-maker indifferent between the certain outcome X and the risky
project.
101
Let X be the value which makes you indifferent between the safe and risky assets. Immediately
we see that:
E(risky) = 0.5 × $100 + 0.5 × $0 = $50:
Hence:
E(safe) = 1 × $50 = $50
so this individual does not care about risk and only focuses on the expected return. Therefore, for
X = 50, such an individual sees no difference between the safe and risky assets, even though the
safe asset is risk-free while the risky asset is risky – hence the name.
102
Example:
In Module 3.4 you saw the following diagram used to illustrate the returns of two stocks:
At the time it was noted that the black stock was the safer stock due to its smaller variation. Hence
depending on an investor's risk appetite, they would invest in the black or red stock accordingly.
Risk Premium
EMV – CE
Interpretation: The amount of money the decision-maker is willing to pay to receive a safe payoff
of X rather than face the risky project with an expected payoff of X.
If risk-neutral, the decision-maker uses the EMV criterion as seen in Module 6.1.
103
6.3 Linear Regression
Linear regression analysis is one of the most frequently-used statistical techniques. It aims to
model an explicit relationship between one dependent variable, denoted as y, and one or more
regressors (also called covariates, or independent variables), denoted as x1, …, xp.
The goal of regression analysis is to understand how y depends on x1, …, xp and to predict or
control the unobserved y based on the observed x1, …, xp. We only consider simple examples
with p = 1.
Example:
In a university town, the sales, y, of 10 pizza parlor restaurants are closely related to the student
population, x, in their neighborhoods.
The scatterplot above shows the sales (in thousands of pesos) in a period of three months
together with the numbers of students (in thousands) in their neighborhoods.
We plot y against x, and draw a straight line through the middle of the data points:
𝑦 = 𝛼 + 𝛽𝑥 + 𝜀
where 𝜀 stands for a random error term, 𝛼 is the intercept and 𝛽 is the slope of the straight line.
𝑦̂ = 𝛼 + 𝛽𝑥
104
Some other possible examples of y and x are shown in the following table:
𝑦 𝑥
Sales Price
Weight Gains Protein in Diet
Present PSE 100 index Past PSE 100 index
Consumption Income
Salary Tenure
Daughter’s Height Mother’s Height
We now present the simple linear regression model. Let the paired observations (x1, y1), …, (xn,
yn) be drawn from the model:
𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖
Where:
So the model has three parameters: 𝛽0 , 𝛽1 , 𝑎𝑛𝑑 𝜎 2 . In a formal topic on regression you would
consider the following questions:
Example:
We can apply the simple linear regression model to study the relationship between two series of
financial returns – a regression of a stock's returns, 𝑦, on the returns of an underlying market
index, 𝑥. This regression model is an example of the capital asset pricing model (CAPM).
when the difference between the two prices is small. Daily prices are definitely not independent.
However, daily returns may be seen as a sequence of uncorrelated random variables.
The capital asset pricing model (CAPM) is a simple asset pricing model in finance given by:
𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖
105
Note:
So the ‘beta’ of a stock is a simple measure of the riskiness of that stock with respect to the market
index. By definition, the market index has 𝛽 = 1.
In summary:
if 𝛽 > 1 ⇒ risky stocks
as market movements are amplified in the stock's returns, and:
if 𝛽 < 1 ⇒ defensive stocks
as market movements are muted in the stock's returns.
106
6.4 Linear Programming
Linear programming is probably one of the most-used type of quantitative business model. It can
be applied in any environment where finite resources must be allocated to competing activities or
processes for maximum benefit, for example:
Optimization Models
Microsoft Excel can be used to solve linear programming problems using Solver. Excel has its
own terminology for optimization.
The first step is the model development step. You must decide:
The third step is to perform a sensitivity analysis – to what extent is the final solution sensitive to
parameter values used in the model. We omit this stage in our simple example below.
107
Example:
A frequent problem in business is the product mix problem. Suppose a company must decide on
a product mix (how much of each product to introduce) to maximize profit.
Suppose a firm produces two types of chocolate bar – type A bars and type B bars. A type A bar
requires 10 grams of cocoa and 1 minute of machine time. A type B bar requires 5 grams of cocoa
and 4 minutes of machine time. So type A bars are more cocoa-intensive, while type B bars are
more intricate requiring a longer production time.
Altogether 2,000 grams of cocoa and 480 minutes of machine time are available each day.
Assume no other resources are required.
The manufacturer makes 10 dollars profit from each type A bar and 20 dollars profit from each
type B bar. Assume all chocolate bars produced are sold.
10x + 20y
x ≥ 0 and y ≥ 0
The maximum (and minimum) values of the objective function lie at a corner of the feasible region.
The two lines intersect when both constraint equations are true, hence:
This happens at the point (160, 80). We can find the values of the objective function at each
corner by substitution into 10x + 20y. The four corners of the feasible region are:
The optimal solution is $3200 dollars which occurs by making 160 type A bars and 80 type B
bars.
108
In the accompanying Excel file, the problem is solved using Solver. Cells A1 and A2 are the
changing cells (where the solution x = 160 and y = 80 is returned). Cell B1 is the objective cell,
with formula =10*A1+20*A2 representing the objective function 10x+20y. Cells C1 and C2 contain
the constraints, with formula =10*A1+5*A2 and =A1+4*A2, respectively, representing the left-
hand sides of the constraints 10x + 5y = 2000 and x + 4y = 480.
We are forever bound by uncertainty. To study, or not to study? To invest, or not to invest? To
marry, or not to marry? Other examples are:
Often, several different system configurations are available. For example, from our R&D portfolio,
there are several products we could release, but we have to select just one due to our marketing
budget constraint.
Monte Carlo simulation works by simulating multiple hypothetical future worlds reflecting that
the outcome variable of interest is a random variable with a probability distribution. We can then
determine the expected outcome, as well as a quantification of risk, using the usual statistics
of mean and variance. Decision-makers can then make an informed judgment about the best
course of action based on their risk appetite.
109
To perform a Monte Carlo simulation we proceed as follows.
𝑁
1
𝑥̅ = ∑ 𝑥𝑖
𝑁
𝑖=1
And
𝑁
1
𝑠2 = ∑(𝑥𝑖 − 𝑥̅ )2
𝑁−1
𝑖=1
The mean gives our estimate of the expected outcome, while the variance is our measure of risk.
Example:
A company wants to analyze the following investment with the following uncertain revenues and
costs given by the probability distributions below. Assume, for simplicity, that the revenues and
costs are uncorrelated.
By using Microsoft Excel’s ‘=RAND()’ function, we can randomly select a number equally likely to
be between 0 and 1:
110
Given the probability distribution for revenues, we could simulate revenues depending on the
value returned by ‘=RAND()’ according to:
which ensures each revenue value occurs with the required probabilities. Similarly, for costs, we
could use:
Suppose for the first simulation we had random numbers for revenues and costs of 0.7467 and
0.1672, respectively. This corresponds to:
Revenue = $100,000
Costs = $30,000
Profit = $70,000
Suppose for the second simulation we had random numbers for revenues and costs of 0.0384
and 0.9056, respectively. This corresponds to:
Revenue = $50,000
Costs = $80,000
Profit = −$30,000 (a loss)
This process would then be replicated ideally thousands of times (N = the number of simulations),
resulting in the calculation of the mean and variance of profit to estimate the expected outcome
and risk.
Of course, the quality of these expected outcome and risk estimates is only as good as the model
used. What if the true probability distributions were different? What if revenues and costs were
correlated? What if other factors affected profit as well?
Clearly, in practice we would wish to conduct a sensitivity analysis to see how sensitive (or
robust) the distribution of the outcome variable is to such issues.
111
ACTIVITIES/ASSESSMENT:
2. Ignoring risk, in decision tree analysis we choose the option which will maximize the expected
profit.
a. True
b. False
3. A decision-maker who prefers the certain outcome of $x over a risk project with a mean (EMV)
of $x is:
a. risk-averse
b. risk-neutral
c. risk-loving
8. In a standard product mix problem, non-negativity constraints are required for the decision
variables.
a. True
b. False
9. The results of a Monte Carlo simulation are never sensitive to the choice of input probability
distribution(s).
a. True
b. False
112
10. In Monte Carlo simulation, the expected outcome can be calculated using the:
a. sample mean of the simulated values of the outcome variable
b. sample variance of the simulated values of the outcome variable
12. Ignoring risk, in decision tree analysis we choose option which will minimize the expected
payoff.
a. True
b. False
13. When the difference between prices is small, stock returns are approximately equal to:
current price
a. log (previous price)
previuos price
b. log ( )
current price
current price−previous price
c. log ( )
previous price
15. In Monte Carlo simulation, the quantification of risks can be calculated using the:
a. sample mean of the simulated values of the outcome variable.
b. sample variance of the simulated values of the outcome variable.
Module References:
113