Introduction To Statistics Module
Introduction To Statistics Module
1 Introduction 1
1.1. Overview of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Definition of terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3. Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4. Probability Sampling methods . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1. Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2. Systematic Random Sampling . . . . . . . . . . . . . . . . . . . . . 6
1.4.3. Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.4. Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5. Non-probability sampling methods . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1. Convinience or Availability . . . . . . . . . . . . . . . . . . . . . . 8
1.5.2. Quota / Proportionate . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.3. Expert or Judgemental . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.4. Chain referral / Snowballing / Networking . . . . . . . . . . . . . 9
1.6. Errors in sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7. Data Collection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.1. Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.2. Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.3. Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
2 CONTENTS
4 Measures of Dispersion 23
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2. Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3. Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4. Standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5. Coefficient of variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Basic Probability 31
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2. Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3. Approches to probability theory . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4. Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5. Basic probability concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6. Types of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.7. Laws of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.8. Types of probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.9. Contigency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.10.Tree diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.11.Counting rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.11.1. Multiplication Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.11.2. Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.11.3. Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.12.Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Probability Distributions 45
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2. Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3. Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4. Discrete probability distribution . . . . . . . . . . . . . . . . . . . . . . . 46
6.5. Properties of discrete probability mass function . . . . . . . . . . . . . . 47
6.6. Probability terminology and notation . . . . . . . . . . . . . . . . . . . . . 48
CONTENTS 3
7 Interval Estimation 59
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2. Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.3. Confidence Interval for the Population Mean . . . . . . . . . . . . . . . . 60
7.4. One-Sided Confidence Intervals for the Population Mean . . . . . . . . . 63
7.5. Confidence Interval for the Population Proportion . . . . . . . . . . . . . 67
7.6. Confidence Interval for the Population Variance . . . . . . . . . . . . . . 68
7.7. Confidence Interval for the Population Standard Deviation . . . . . . . . 69
7.8. Confidence Interval for the Difference of Two Populations Means . . . . 70
7.8.1. Case 1: Known Population Variance . . . . . . . . . . . . . . . . . 70
7.8.2. Case 2: Unknown (but assumed Equal) Population Variances . . 70
8 Hypothesis Testing 73
8.1. Important Definitions, and Critical Clarifications . . . . . . . . . . . . . 73
8.2. General Procedure on Hypotheses Testing . . . . . . . . . . . . . . . . . . 75
8.3. Hypothesis Testing Concerning the Population Mean . . . . . . . . . . . 75
8.3.1. Case 1: Known Population Variance . . . . . . . . . . . . . . . . . 75
8.3.2. Guidelines to the Expected Solution . . . . . . . . . . . . . . . . . 76
8.3.3. Case 2: Unknown Population Variance . . . . . . . . . . . . . . . . 76
8.4. Hypothesis Testing concerning the Population Proportion . . . . . . . . . 78
8.5. Comparison of Two Populations . . . . . . . . . . . . . . . . . . . . . . . . 79
8.5.1. Hypothesis Testing concerning the Difference of Two Population
Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.6. Independent Samples and Dependent/ Paired Samples . . . . . . . . . . 80
8.6.1. Advantages of Paired Comparisons . . . . . . . . . . . . . . . . . . 81
8.6.2. Disadvantages of Paired Comparisons . . . . . . . . . . . . . . . . 82
8.7. Test Procedure concerning the Difference of two Population Proportions 82
8.8. Tests for Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.9. Ending Remark(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9 Regression Analysis 87
9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.2. Uses of Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.3. Abuses of Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.4. The Simple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . 89
9.4.1. The Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.4.2. The Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . 89
9.4.3. Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.4.4. Coefficient of Determination, r2 . . . . . . . . . . . . . . . . . . . . 91
4 CONTENTS
10 Index numbers 95
10.1.Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.2.Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.3.What is an Index Number? . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.3.1. Characteristics of an Index Numbers . . . . . . . . . . . . . . . . . 95
10.3.2. Uses of Index Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.4.Types of Index Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.5.Methods of constructing index numbers . . . . . . . . . . . . . . . . . . . 98
10.5.1. Aggregate Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.5.2. Merits and demerits of this method . . . . . . . . . . . . . . . . . . 99
10.5.3. Weighted Aggregates Index . . . . . . . . . . . . . . . . . . . . . . 100
10.5.4. Laspeyres Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.5.5. Merits and demerits of Laspeyres method? . . . . . . . . . . . . . 101
10.5.6. Paasches Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.5.7. Merits and Demerits of Paasches Index . . . . . . . . . . . . . . . 102
10.6.Fisher Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Chapter 1
Introduction
Statistics is when individual data values are collected, summarized,analysed and pre-
sented and used for decision making. It is an important tool in transforming raw data
into meaning and usable information. Also statistics can be regarded as a decision
support tool. A table below shows a transformation process of data to information.
Statistics
Definition 1
Statistics refers to the methodology [collection techniques] for collection, presentation
and analysis of data and the use of such data [Neter J. et al (1988)].
Definition 2
In common usage, it refers to numerical data. This means an y collection of data or
information constitutes what is referred to as Statistics. Some examples under this
definition are:
2 Introduction
1. Vital statistics - These are numerical data on births, marriages, divorces, com-
municable diseases, harvests, accidents etc.
3. Social statistics - These are numeric data on housing, crime, education etc.
In Statistics (as in real life), we usually deal with large volumes of data making it diffi-
cult to study each observation (each data point), in order to draw conclusions about the
source of the data. We seek a statistical method or methods that can summarise the
data so that we can draw conclusions about these data, without scrutinising each ob-
servations (which is rather difficult). Such methods fall under area of statistics called
descriptive statistics.
Parameter(s) - These are numeric measure(s) derived from a population e.g. popula-
tion mean (µ), population variances (σ 2 ), and population standard deviation (σ).
Data
Data is what is more readily available from a variety of sources and of varying quality
and quantity. Precisely data is individual observations on an issue and in itself con-
veys no useful information.
Information
To make sound decision, one needs good and quality information. Information must
be timely, accurate, relevant, adequate and readily available. Information is defined to
processed data. Table above summarizes relationship between data and information.
GIGO - Garbage In Garbage Out.
Random variable
A variable is any characteristic being measured or observed. Since a variable can take
on different values at each measurement it is termed a random variable. For example,
Sales, Company turnover, Weight, Height, yield, Number of babies born, e.t.c
Introduction 3
Population
A population is a collection of elements about which we wish to make an inference.
The population must be clearly defined before the sample is taken.
Target population
The population whose properties are estimated via a sample or usually the ’total’ pop-
ulation.
Sample
A sample is a collection of sampling units drawn from a population. Data are obtained
from the sample and are used to describe characteristics of the population. A sample
can also be defined as a subset / part of or a fraction of a population.
statistic(s)
These are numeric measure(s) derived from a sample e.g. sample mean (x̄), sample
variances (s2 ), and sample standard deviation (s).
Sampling Frame
A sampling frame is a list of sampling units. A set of information used to identify a
sample population for statistical treatment. It includes a numerical identifier for each
individual, plus other identifying information about characteristics of the individuals,
to aid in analysis and allow for division into further frames for more in-depth analysis.
Sampling.
Sampling Units
Sampling units are non-overlapping collections of elements from the population that
cover the entire population. It is a member of both the sampling frame and of the sam-
ple. The sampling units partition the population of interest for example households or
individual persons for census.
4 Introduction
We do explore the sampling techniques in order to be able to decide which one is the
most appropriate for each given situation. Sampling techniques are methods of how
data can be collected from the given population.
Types of Sampling
Probability Sampling
Has a distinguishing characteristic that each unit in the population has a known,
nonzero probability of being included in the sample thus, it is clear that every subject
or unit has an equal chance of being selected from the population. These probabilities
are usually equal. It eliminates the danger of being biased in the selection process due
to one’s own opinions or desires.
Non-probability Sampling
Is a process where probabilities cannot be assigned to the units objectively, and hence
it becomes difficult to determine the reliability of the sample results in terms of proba-
bility. A sample is selected according to one’s convenience, or generality in nature. It is
a good technique for pilot or feasibility studies. Examples include purposive sampling,
convenience sampling, and quota sampling. In non-probability sampling, the units
that make up the sample are collected with no specific probability structure in mind
e.g. units making up the sample through volunteering.
Remark:
We shall focus on probability sampling because if an appropriate technique is chosen,
then it assures sample representativeness and hence the errors for the sampling can
be estimated.
Sample size
Reliability degree of the conclusions that we can obtain i.e. an estimation of the error
that we are going to have. An inappropriate selection of the elements of the sample
Introduction 5
can cause further errors once we want to estimate the corresponding population pa-
rameters.
The four methods of probability sampling are simple random, systematic, stratified
and cluster sampling methods.
Requires that each element of the population have an equal chance of being selected.
A simple random sample is selected by assigning a number to each element in the
population list and then using a random number table to draw out the elements of the
sample. The element with the number drawn out makes it into the sample. The pop-
ulation is ”mixed up” before a previously specified number, n, of elements is selected
at random. Each member of the population is selected one at a time, independent of
one another. However, it is noted that all elements of the study population are either
physically present or listed.
Also, regardless of the process used for this method, the process can be laborious espe-
cially when the list of the population is long or it is completed manually without the
aid of a computer. A simple random sample can be got using calculator (random key),
computer (using excel function =rand() ), or random number tables.
In this method, every set of n elements in the population has an equal chance of being
selected as the sample.
Advantages
Demerits
• Numbering of the population elements may be time consuming eg for large pop-
ulations
6 Introduction
Illustration
An example of simple random sampling may include writing each member of the pop-
ulation on a piece of paper and putting in a hat. Selecting the sample from the hat
is random and each member of the population has an equal chance of being selected.
However, this approach is not feasible for large populations, but can be completed eas-
ily if the population is very small.
Illustration
An example of systematic sampling would be if an official from the Academic Registry
of a hypothetical university is to register students for a tour of regional universities.
The official may select at random the 15th student out of the first 20 students in a list
of all students in the university. This official would then keep adding twenty and se-
lecting the 35th student, 55th student, 75th student and so on to register for the tour of
regional universities until the end of the list is reached.
Remark:
In cases where the population is large and the population list is available, systematic
sampling is usually preferred over simple random sampling since it is more convenient
to the experimenter.
It is used when representatives from each homogeneous subgroup within the popu-
lation need to be represented in the sample. The first step in stratified sampling is
to divide the population into subgroups (strata) based on mutually exclusive criteria.
Random or systematic samples are then taken from each subgroup. The sampling
fraction for each subgroup may be taken in the same proportion as the subgroup has
in the population.
Introduction 7
Illustration
As an example, if an owner of a local supermarket conducting a customer satisfaction
survey may wish to select random customers from each customer type in proportion
to the number of customers of that type in the population. Suppose 40 sample units
are to be selected, and 10% of the customers are managers, 60% are users, 25% are
operators and 5% are students from CUT, then 4 managers, 24 users, 10 operators and
2 students from CUT would be randomly selected.
Remark:
Stratified sampling can also sample an equal number of items from each subgroup.
In cluster sampling, the population that is being sampled is divided into naturally oc-
curring groups called clusters. A cluster is as heterogeneous as possible to matching
the population clusters which says that a cluster is representative of the population.
A random sample is then taken from within one or more selected.
Illustration
An organization with 300 small branches providing a service country wide has an em-
ployee at the HQ who is interested in auditing for compliance to some coding standard.
The employee might use cluster sampling to randomly select 40 branches as represen-
tatives for the audit and then randomly sample coding systems for auditing from just
the 40.
Remark:
Cluster sampling can tell us a lot about that particular cluster, but unless the clusters
are selected randomly and a lot of clusters are sampled, generalizations cannot always
be made about the entire population.
Simple random
8 Introduction
Each member of the study population has an equal probability of being selected.
Systematic
Each member of the study population is either assembled or listed, a random start is
designated, then members of the population are selected at equal intervals
Stratified
Each member of the study population is assigned to a homogeneous subgroup or stra-
tum, and then a random sample is selected from each stratum.
Cluster
Each member of the study population is assigned to a heterogeneous subgroup or clus-
ter, then clusters are selected at random and all members of a selected cluster are
included in the sample.
There are four methods of non-probability sampling discussed in this module that are
Convienience, Quota, expert and chain referral.
This is sampling which is based on the proximity of the population elements to the de-
cision maker. Being at the right place at the right time. Elements nearby are selected,
and those not in close physical or communication range are not considered.
This is sampling in which the decision maker has direct or indirect control over which
elements are to be included in the sample. Appropriate when the decision maker feels
Introduction 9
that some population members have better or more information than others. Or some
members are more representative than others.
The researcher starts with a person who displays qualities of interest then refers to
the next and so on.
During sampling, errors can be committed by the statistician. These are either sam-
pling or non-sampling errors. Errors can be corrected by sampling without bias.
Some common sources of bias are i) incorrect sampling operation, non-interviews.
Some errors that arises in sampling are discussed below.
Selection error
Selection error occurs when some elements of the population have a higher probability
of being selected than others. Consider a scenario where a manager of a local super-
market wishes to measure how satisfied his customers. He proceeds to interview some
of them from 08:00 to 12:00. Clearly, the customers who do their shopping in the after-
noon are left out and will not be represented making the sample unrepresentative of
all the customers. Such kind of errors can be avoided by choosing the sample so that
all the customers have the same probability of being selected. This is a sampling error.
Non-Response Error
It is possible that some of the elements of the population do not want or cannot answer
certain questions. It may also happen when we have a questionnaire including per-
sonal questions, that some of the members of the population do not answer honestly or
would rather avoid answering. These errors are generally very complicated to avoid,
but in case that we want to check honesty in answers, we can include some questions
called filter questions to detect if the answers are honest. This is a non-sampling error.
Interviewer influence
The interviewer may fail to be impartial i.e. s/he can promote some answers more than
others.
Remark:
A sample that is not representative of the population is called a biased sample.
10 Introduction
Questions relating to selecting out of naturally arise. These are: When concluding
about the population, how many of the population elements is represented by each one
of the sample elements? What proportion of the population are we selecting? The re-
sponses lie in the following factors.
1.7.1. Observation
This method has the direct and desk research methods. Direct observation involves
collecting data by observing the item in action. Examples for this method are: pedes-
trian flow, vehicle traffic, purchase behavior of a commodity in a shop, quality control
inspection e.t.c. An advantage of this method is that the respondent behaves in a
natural way since he is not aware that he is being observed. A disadvantage is that
it is a passive form of data collection. Also there is no opportunity to investigate the
behavior further. Desk research involves consulting and extracting secondary data
from source documents and collect data from them.
1.7.2. Interview
This method collects primary data through direct questioning. A questionnaire is the
instrument used to structure the data collection process. Three approaches in data
collection using interviews are: personal, postal and telephone interviews.
Personal Interviews
A questionnaire is completed through face-to-face contact with the respondent. Ad-
vantages for this method are: High response rate, it allows probing for reasons, data
collection is immediate, data accuracy is assured, useful for technical data is required,
non-verbal responses can be observed and noted, more questions can be asked, re-
sponses are spontaneous, and use of aided-recall questions is possible. Disadvantage
of this method are that it is time consuming, it requires trained interviewers, fewer
interviews are conducted because of cost and time constraints, biased data can be col-
lected if interviewer is inexperienced.
Introduction 11
Postal Surveys
When target population is large and or geographically dispersed then use of postal
questionnaires is considered most suitable. Advantages of this method is that larger
sample of respondents can be reached, more costs effective, interviewer bias is elimi-
nated, respondents have more time to consider their responses, anonymity of respon-
dents is assured resulting in more honest responses, respondents are more willing to
answer personal questions. The disadvantages for this method are: low response
rate, respondents cannot get clarity on some questions, mailed questionnaires must be
short and simple to complete, limited possibilities of probing or further investigations,
data collection takes long time, no control of who answers the questionnaire, and no
possibilities of validating responses.
Telephone Interviews
The interview is conducted telephonically with the respondent. Advantages of this
method are: it allows quicker contact with geographically dispersed respondents, call-
backs can be made if respondent is not initially available, low cost, interviewer probing
is possible, clarity on questions can be provided by the interviewer, and a larger sample
of respondents can be reached in short space of time. Disadvantages are that respon-
dent anonymity is lost, non-verbal responses cannot be observed, trained interviewers
are required hence more costly, possible interviewer bias, respondent may terminate
interview prematurely, and sampling errors are compounded if more respondents do
not have telephones.
1.7.3. Experimentation
This is when primary data is generated through manipulation of variables under con-
trolled conditions. The method is mostly used in scientific and engineering research.
Data on the primary variable under study is monitored and recorded whilst the re-
searcher controls effects of a number of influencing factors. Examples include: De-
mand elasticity for a product, advertising effectiveness. Advantages of this method
are: quality data is collected and results are generally more objective and valid. The
disadvantages are that the method is costly and time consuming, and may be impos-
sible to control for certain factors which affects the results.
12 Introduction
Chapter 2
2.1. Introduction
The world of statistics resolves around data, there is no statistics without data. What
is data? How is it collected? Why do we collect it? These are the questions to be
answered in this chapter.
Quality of data is influenced by three factors that are: type, source and method
used to collect data. The type of data gathered determines the type of analysis which
can be performed on the data. Certain statistical methods are valid for certain data
types only. An incorrect application of a statistical method to a particular data type
can render the findings invalid.
Data type is determined by the nature of the random variables which the data repre-
sents. Random variables are essentially of two kinds that are Qualitative and Quanti-
tative.
These are variables which yield categorical (non-numeric) responses. The data gener-
ated by qualitative random variables are classified into one of a number of categories.
2 Introduction
The numbers representing the categories are arbitrary i.e. codes: Coded values cannot
be manipulated arithmetically, as it does not make sense.
Quantitative random variables are variables that yield numeric responses. The data
generated for quantitative random variables can be meaningfully manipulated using
conventional arithmetic operations.
Each random variable category is associated with a different type of data. There
are two classifications of data types.
Data measurement scales include Nominal, Ordinal, Interval and Ratio-scaled data.
Nominal-scaled data
Objects or events are distinguished on the basis of a name. Nominal-scaled data is
associated mainly with qualitative random variables. Where data of qualitative ran-
dom variables is assigned to one of a number of categories of equal importance, then
such data is referred to as nominal-scaled data. There is no implied ordering between
the groups of the random variable.
Data and Data Presentation 3
Each observation of the random variables is assigned to only one of the categories
provided. Arithmetic calculations cannot be meaningfully performed on the coded val-
ues assigned to each category. They are only numeric codes which are arbitrarily as-
signed and can be counted. Nominal-scaled data is the weakest form of data, since
only a limited range of statistical analysis can be formed on such data.
Ordinal-scaled data
Objects or events are distinguished on the basis of the relative amounts of some char-
acteristics they posses. The magnitude between measurements is not reflected in the
rank. Such data is associated mainly with qualitative random variables. Like nominal-
scaled data, ordinal-scaled data is also assigned to only one of a number of coded cat-
egories, but there is now a ranking implied between the categories in terms of being
better, bigger, longer, older, taller, or stronger, etc. While there is an implied differ-
ence between the categories, this difference cannot be measured exactly. That is, the
distance between categories cannot be quantified nor assumed to be equal. Ordinal-
scaled data is generated from ranked responses in market research studies.
There is a wider range of valid statistical methods (i.e. the area of non-parametric
statistics) available for the analysis of ordinal-scaled data than there is for nominal-
scaled data. Ordinal-scaled data is also generated from a ”counting process”.
4 Introduction
Interval-scaled data
Interval-scaled data is associated with quantitative random variables. Differences
can be measured between values of a quantitative random variable. Thus interval-
scaled data possesses both order and distance properties. Interval-scaled data, how-
ever, does not possess an absolute origin. Therefore the ratio of values cannot be mean-
ingfully compared for interval-scaled data. The absolute difference makes sense when
interval-scaled data has been collected.
Ratio-scaled data
This data is associated mainly with quantitative random variables. If the full range of
arithmetic operations can be meaningfully performed on the observations of a random
variable, the data associated with that random variable is termed ratio-scaled. It is a
numeric data with a zero origin. The zero origin indicates the absence of the attribute
being measured.
Such data are the strongest form of statistical data which can be gathered and
lends itself to the widest range of statistical methods. Ratio-scaled data can be ma-
nipulated meaningfully through normal arithmetic operations. Ratio-scaled data is
Data and Data Presentation 5
When data capturing instruments are set up, care must be exercised to ensure that
the most useful form of data is captured. However, this is not always possible for
reasons of convenience, cost and sensitivity of information. This applies particularly
to random variables such as age, personal income, company turnover and consumer
behavior questions of a personal nature. The functional area of marketing generates
mostly categorical (i.e. nominal/ordinal) data arising from consumer studies, while the
areas of finance/accounting and production generate mainly quantitative (ratio) data.
Human resources management generates a mix of qualitative and quantitative data
for analysis.
Data type 2
Discrete data
A random variable whose observations can take on only specific values, usually only
integer values, is referred to as a discrete random variable. In such instances, certain
values are valid, while others are invalid.
Continuous data
A random variable whose observations take on any value in an interval is said to gen-
erate continuous data. This means that any value between a lower and an upper limit
is valid.
Data which is captured at the point where it is generated is called Primary data. Such
data is captured for the first time and with specific purpose in mind. Examples of data
sources are: Largely the same as for internal data source, but also includes survey
data (personnel surveys, salary surveys, market research surveys).
• Combining various sources could lead to errors of collation and introduce bias.
8 Introduction
Data can be presented in tables or graphs. Graphical techniques are pictorial or graph-
ical representations of data such that the main features of the data are captured. The
various graphical techniques which we will cover in this unit. Pie charts, bar charts,
histograms, box and whisker plots and stem and leaf displays. Some other techniques
which are important are dotplots, Lorenz curve and Z curves are not discussed in this
module.
A pie chart as the name suggests, is a circle divided into segments like a pie cut into
pieces from the centre outwards. Each segment represents one or more values taken
by a variable. Such charts are used to display qualitative data. Let us now look at an
example, and see how we can construct and interpret a pie chart.
Example 1.1
The ages of 10 students doing BSCAC program at Chinhoyi University of Technology
are: 26, 28, 28, 16, 22, 35, 42, 19, 55, 28. Grouping the ages into classes of 25 and
below, 26-35, 36-45, and above 45, leads to a frequency distribution table below.
We now express these age groups as proportions or percentages and then indicate
the angle in degrees as in table below.
the ith category can be done directly from the observations by using.
X
Pn i = 360o
i=1 Xi
i.e. each observation multiplied by 3600 divided by the sum of the observations
A bar chart, as the name suggests, is a visual presentation of data by means of bars
or blocks put side by side. Each bar represents a count of the different categories of
the data. Although both pie chart and bar graphs (as they are sometimes called) are
used to illustrate qualitative data or discrete qualitative data, bar charts use the ac-
tual counts or frequencies of occurrences of each category of data. We need not use the
actual data; we can use the percentage to come up with the Bar graph. Let us use the
data in example 1.1 to illustrate the bar chart.
Example 1.2
We will now construct the bar chart using the data in example 1.1. We come up with
suitable scales for the height and width of the graph, which are such that the graph
is clear and representative in class example. The bars represent each age group count
in terms of height. You can choose to make the bars thin or wide, it’s up to you all
you need to be certain of is that the bars represent each age group in terms of height.
The bars should be of the sane width. Often, we represent each category by different
colours or shades. This is especially useful when we are comparing several groups. For
instance, we could be comparing the age groups of different intakes that would mean
several graphs all put side by side. In this way we can compare the intakes aged X
over different years.
2.4.3. Histograms
Exercise
Consider results of a test written by 45 students and marked out of 70. Data is pre-
sented in categories in table below. Use the data above to draw a histogram for the
Marks Frequencies
10 - 19 7
20 - 29 20
30 - 39 9
40 - 49 3
50 - 59 5
60 - 69 1
mark distribution.
Example 1.3
A scientist interested in finding out the age groups of people interested in cultural
movies went to a movie theatre and collected the following information. Ages of people
watching movie is shown below.
7 15 22 38 12 18 14 26 20 15 22 34 12 18 24
19 14 29 21 32 12 17 24 13 25 20 15 31 11 16
23 39 19 14 28 20 9 16 22 39 13 25 19 14 31
The stem 0, 1, 2 and 3 are listed on the left side of a vertical line and the leaves
on the right side opposite the appropriate stem. The stem and leaf diagram of these
data are represented below. A stem and leaf display should always have a key that
indicates how data is displayed. Key: 0|7 = 7, 3|8 = 38.
Also take note that 1st , 2nd , 3rd , e.t.c. number on the leaf side should be in the same
columns for the histogram feature to reveal.
Data and Data Presentation 11
Frequency polygons are one alternative to histograms. The only difference here is that
a frequency polygon is a line plot of the frequencies against the corresponding class
mid-points. The points are joined by straight lines.
2.5. Exercises
1. Classify the following data sources as either Primary or secondary and Internal
or external
2. Define primary and secondary data. Include in your answers the advantages and
disadvantages of both data types. Give two examples of secondary data.
24 19 21 27 20 17 17 32 22 26 18 13 23 30 10
13 18 22 34 16 18 23 15 19 28 25 25 20 17 15
12 Introduction
(a) Define the random variable, the data type and the measurement scale.
(b) From the data, prepare:
i. an absolute frequency distribution
ii. a relative frequency distribution and
iii. the (relative) less than ogive.
(c) Construct the following graphs:
i. a histogram of the relative frequency distribution,
ii. stem and leaf diagram of the original data
(d) From the graphs, read off what percentage of trips were:
i. between 25 and 30 km long
ii. under 25km
iii. 22km or more?
Chapter 3
3.1. Introduction
From the previous unit, graphical displays were discussed. These are useful means
of communicating broad overviews of the behaviour of a random variable. However,
there is a need for numeral measures, called statistics, which will convey more pre-
cise information about the behaviour pattern of a random variable. The behaviour or
pattern of any random variable can be described by measures of:
• Central tendency and
• Mode and
• Median
Each measure will be computed for ungrouped data and grouped data.
The arithmetic mean for ungrouped data from a sample is defined as in:
Where n is the number of observations in the sample, xi is the value of the ith observa-
Pn
tion of random variable x and x̄ is the symbol for a sample arithmetic mean. i=1 xi
is the shorthand notation for the sum of n individual observation i.e.
Xn
xi = x1 + x2 + x3 + ... + xn
i=1
• The arithmetic mean uses all values of the data set in its computation.
• The sum of deviation of each observation from the mean value is equal zero.
Pn
i.e. i=1 (xi − x̄) = 0. This makes the mean an unbiased statistical measure of
central location.
There are other means that can be calculated for different distribution of values
which are Harmonic mean, Geometric mean and Weighted arithmetic mean.
We are not discuss them at this time.
In ungrouped data set, mode is obtained by observing the data carefully then finding
the most frequently occurring observation. However, if the number of observations is
too large, the mode can be found by arranging the data in ascending order and by in-
spection, identify that value that occurs frequently.
Example 1
If B=Blue, G=Green, R=Red and Y=Yellow. Consider a sample: YGBRBBRGYB, picked
from a mixed bag. What is the modal colour?
Solution
The modal colour is Blue, because it appears most, with a frequency of 4.
16 Measures of Central Tendency
In finding mode for grouped data, we first identify the modal class i.e. the class interval
with the highest frequency. The mode lies in this class and then calculate the modal
value using the formula
c(f1 − f0 )
M ode = lmo + (3.4)
2f1 − f0 − f2
where lmo is the lower limit of modal class, f1 is the frequency of the modal class, f0
is the frequency of the class preceeding the modal class, f2 is the frequency of interval
succeeding the modal class and c is the width of the modal class.
Example 2
Test mark, x 5 - 10 10 - 15 15 - 20 20 - 25 25 - 30
Frequency 3 5 7 2 4
Solution
We seek to invoke the formula
c(f1 − f0 )
M ode = lmo +
2f1 − f0 − f2
where 15-20 is the modal class with the highest frequency of 7, lmo = 15, f1 = 7,
f0 = 5, f2 = 2, and c = 5. Substituting these in the equation above yields
5(7 − 5)
M ode = 15 +
2(7) − 5 − 2
M ode = 16.42
The median is the value of a random variable which divides an ordered (ascending or
descending order) data set into two equal parts. It is also called the second quartile
Q2 or 50th percentile. Half of the observation will fall below this median value and
the other half above it. If the number of observations, n, is odd, then the median is
observation ( n+1 th
2 ) . If the number of observations is even, then median is the mean
of ( n2 )th and ( n2 + 1)th observation. First, we consider a modified scenario of ungrouped
data.
Measures of Central Tendency 17
Given the following income data presented in a frequency distribution table, find the
median.
Solution
The number of observations is 100, which is even thus the median is mean of ( n2 )th
and ( n2 + 1)th observations i.e. the mean of the 50th and 51st observations. To find
these observations we first find the cumulative frequencies of the data set. The 50th
observation is 4400 and 51st observation is 4900. Thus
4400 + 4900
M edian =
2
M edian = 4650
Interpretation
This means 50% of the workers get incomes that are less than $4650 and another 50%
get an income that is more than $4650.
Given the following grouped data in a frequency table, find the median.
We use a standard formular to calculate the median of the above grouped data which
is
c( n2 − F (<))
M edian = Ome + (3.5)
fme
where me is the median class, Ome is the lower limit of median class, n is the sample
size i.e. total number of observations, F (<) is the cumulative frequency of class prior
to median class and c is the width of the median class.
To use this formula, we calculate the cumulative frequencies and then identify the me-
dian class which is the class containing the ( n+1 th
2 ) observation.
18 Measures of Central Tendency
Example 4:
Calculate the median of following grouped data.
Solution:
First and foremost, order the data set, in this case its already ordered. Then calculate
the cumulative frequencies we get.
c( n2 − F (<))
M edian = Ome +
fme
Where the median class is 20 - 30, c = 10, n = 50, F (<) = 14, f = 22, fme = 22 and
Ome = 20. Substituting we have
10[ 50
2 − 14]
Me = 20 +
22
Me = 25
Interpretation
This implies that 50% of the students got less than 25 marks and the other 50% got
more than 25 marks.
The advantage of the median is that it is unaffected by outliers and is a useful measure
of central tendency when the distribution of a random variable is severely skewed. A
disadvantage of the median, however, is that it is inappropriate for categorical data.
It is best suited as a central location measure for interval-scaled data such as rating
scales.
3.6. Quartiles
Quartiles are those observations that divide an ordered data set into quarters (four
equal parts). Lower Quartile, Q1 is the first quartile or 25th percentile. It is that
observation which separates the lower 25 percent of the observations from the top 75
percent of ordered observations. Middle Quartile, Q2 is the second quartile or 50th
percentile or the median. It divides an ordered data set into two equal halves. Upper
Measures of Central Tendency 19
Quartile, Q3 is the third quartile or 75th percentile. It is that observation which di-
vides observations into the lower 75 percent from the top 25 percent.
To compute quartiles, a similar procedure is used as for calculating median. The only
difference lies in (i) the identification of the quartile position, and (ii) the choice of the
appropriate quartile interval. Each quartile position is determined as follows:
c( n4 − F (<))
Q1 = Oq1 +
fq1
where Oq1 is the lower limit of the class interval with lower quartile value, F (<) is the
cumulative frequency of the class interval before the lower quartile interval and fq1 is
the frequency of the lower quartile interval.
Exercise
Using income data below, find Q1 and Q2 .
Solution
n = 100, hence Q1 position is at n4 = 100 th
4 = 25 position. Arranging number of workers
cummulatively i.e. coming up with a cummulative distribution table 25th value lies at
income $4100. Hence Q1 is $4100. Show that Q3 is $5200.
the exact value. Find the first, second and third quartile values from the distribution
below.
c( n4 − F (<))
Q1 = Oq1 +
fq1
Where Q1 is the lower quartile, Oq1 is the lower limit of Q1 Interval (class), n is the
sample size (total number of observations), F (<) is the cumulative frequency of the in-
terval before the Q1 interval, fq1 is the frequency of the Q1 interval and c is the width
of the Q1 interval.
Thus:
c[ n4 − F (<)] 10[ 50
4 − 2]
Q1 = Oq1 + = 10 + = 18.75
fq1 12
Interpretation:
25 % of the students got below 18.75 marks.
c[ n2 − F (<)]
Q2 = Oq2 +
fq2
c[ 3n
4 − F (<)]
Q3 = Oq3 +
fq3
where Q3 is the upper quartile, Oq3 is the lower limit of Q3 class interval, n is the
sample size (i.e. total number of observations), F (<) is the cumulative frequency of
the interval before the Q3 interval, fq3 is the frequency of the Q3 interval and c is the
width of the Q3 interval.
Measures of Central Tendency 21
Thus:
c[ 3n
4 − F (<)] 10[ 3×50
4 − 36]
Q3 = Oq3 + = 30 + = 31.875
fq3 8
Interpretation:
75% of the students got below 31.875 marks. Alternatively, 25% of the students got
above 31.875 marks.
3.6.5. Percentiles
In general, any percentile value can be found by adjusting the median formula to:
(i)Find the required percentiles position and from this and (ii) Establish the percentile
interval.
Example
90th percentile position = 0.9 × n, 35th percentile position = 0.35 × n, 25th percentile
position(Q1 ) = 0.25 × n
Uses of percentiles:
Percentiles are used to identify various non-central values. For example, if it is desired
to work with a truncated dataset which excludes extreme values at either end of the
ordered dataset.
3.7. Skewness
2. If mean < median < mode the frequency distribution is negatively skewed i.e.
skewed to the left.
3. If mean > median > mode the frequency distribution is positively skewed i.e.
skewed to the right.
Remark:
2. If the frequency distribution is skewed, the median may be the best measure of
central location as it is not pulled by extreme values, nor is it as highly influenced
by the frequency of occurrence.
3.8. Kurtosis
Kurtosis is the measure of the degree of peakedness of a distribution. Frequency dis-
tributions can be described as: leptokurtic, mesokurtic and platykurtic.
• Platykurtic flat distribution i.e. the observations are widely spread about the
central location.
3.9. Exercises
1. The number of days in a year that employees in a certain company were away
from work due to illness is given in the following table:
Find the modal class and the modal sick days and interpret.
Sex F M F M F M M F F F F M
Seniority (yrs) 8 15 6 2 9 21 9 3 4 7 2 10
(a) Find the seniority mean, median and mode for the above data.
(b) Which of the mean, median and mode is the least useful measure of location
for the seniority data? Give a reason for your answer.
(c) Find the mode for the sex data. Does this indicate anything about the em-
ployment practice of the company when compared to the medians for the
seniority data for males and females?
Chapter 4
Measures of Dispersion
4.1. Introduction
Spread or Dispersion refers to the extent by which the observations of a random vari-
able are scattered about the central value. Measures of dispersion provide useful infor-
mation with which the reliability of the central value may be judged. Widely dispersed
observations indicate low reliability and less representativeness of the central value.
Conversely, a high concentration of observation about the central value increases con-
fidence in the reliability and representativeness of the central value.
4.2. Range
The range if the difference between the highest and the lowest observed values in a
dataset.
Example 6:
Given the following data in a frequency distribution table, find the range.
Solution:
24 Measures of Dispersion
For grouped distribution with class intervals, xmin is the lower limit of the lower class
interval and xmax is the upper limit of the highest class interval.
Interquartile Range = Q3 − Q1
This modified range removes some of the instability inherent in the range if out-
liers are present, but it excludes 50 percent of all observations from further analysis.
This measure of dispersion, like the range, also provides no information on the clus-
tering of observations within the dataset as it uses only two observations.
Quartile deviation
A measure of variation based on this modified range is called quartile deviation (QD)
or the semi-interquartile range. It is found by dividing the interquartile range in half
i.e.
Q3 − Q1
Quartile deviation =
2
Remember when calculating this measure you order your dataset first to calculated Q3
and Q1 . The quartile deviation is an appropriate measure of spread for the median. It
identifies the range below and above the median within which 50 percent of observa-
tions are likely to fall. It is a useful measure of spread if the sample of observations
contains excessive outliers as it ignores the top 25 percent and bottom 25 percent of
the ranked observations.
4.3. Variance
The most useful and reliable measures of dispersion are those that take every observa-
tion into account and are based on an average deviation from a central value. Variance
is such a measure of dispersion. Population variance is denoted by σ 2 whereas sample
variance is denoted by s2 .
Measures of Dispersion 25
The main difference being on the denominator of the two. Population variance divides
by N whereas sample variance divide by n − 1.
Consider the ages, in years, of 7 second hand cars: 13, 7, 10, 15, 12, 18, 9. Find the
variance of the ages of cars.
Solution
Step 1: Find the sample mean, x̄ = 84
7 = 12 years.
Step 2: Find the squared deviation of each observation from the sample mean. See
table below.
Car age, xi Mean, x̄ Deviation (xi − x̄) Deviations squared (xi − x̄)2
13 12 +1 1
7 12 -5 25
10 12 -2 4
15 12 +3 9
12 12 0 0
18 12 +6 36
9 12 -3 9
(xi − x̄)2 = 84
P P
(xi − x̄) = 0
Step 3: Find the average squared deviation that is the variance using the formular:
84
S2 = = 14 years2
7−1
Note:
Divison by n would appear logical, but the variance statistic would then be a biased
measure of dispersion. It can be shown to be unbiased if division is by (n−1). For large
samples i.e. n is greater than 30, however this distinction becomes less important.
Variance can be also calculated using a formular below. It gives similar results as
the above formular:
26 Measures of Dispersion
x2i − nx̄2
P
S2 =
n−1
x2 = 1092,
P P
x = 84, n=7 and x̄ = 12, substituting the values in the above
formular:
[1092 − 7(122 )] 84
S2 = = = 14 years2
7−1 6
Variance for grouped data
Grouped data is data presented in a frequency distribution table. Sample variance for
such grouped data is calculated using the formular:
Pn
2 − x̄)2
i=1 f (xi
S =
n−1
or,
f x2i − nx̄2
P
2
S =
n−1
Population variance is given by:
f x2i − N µ2
P
2
σ =
N
Example 7:
Consider data for student marks obtained from Test 1. Calculate the variance of the
student marks.
Solution
The midpoint in an inteval is calculated as:
P
fx 1290
f x2 = 38450. Using the above formular, the
P
Mean, x̄ = n = 50 = 25.8 and
variance,
Measures of Dispersion 27
f x2i − nx̄2
P
S2 =
n−1
38450 − 50(25.8)2 5168
S2 = = = 105.47 marks2
50 − 1 49
The variance is a measure of average squared deviation about the arithmetic mean.
It is expressed in squared units. Consequently, the meaning in a practical sense is
obscure. To provide meaning, the measure should be expressed in the original units of
the random variable.
A standard deviation is a measure which expresses the average deviation about the
mean in the original units of the random variable. The standard deviation is the
square root of the variance. Mathematically:
S
CV = × 100%
x̄
σ
CV = × 100%
µ
.
This ratio describes how large the measure of dispersion is relative to the mean of
28 Measures of Dispersion
the observation. A coefficient of variation value close to zero indicates low variability
and a tight clustering of observations about the mean. Conversely, a large coefficient
of variation value indicates that observations are more spread out about their mean
value.
S 10.27
CV = × 100% = × 100% = 39.8%.
x̄ 25.8
4.6. Exercises
1. Find the mean and the standard deviation for the following data which records
the duration of 20 telephone hotline calls on the 0772 line for advice on car re-
pairs.
At a cost of $2.60 per minute, what was the average cost of a call, and what
was the total cost paid by the 20 telephone callers. Calculated the coefficient of
variation and interpret it.
47 31 42 33 58 51 25 28
62 29 65 46 51 30 43 72
73 37 29 39 53 61 52 35
3. Give three reasons why the standard deviation is regarded as a better measure
of dispersion than the range.
(a) Outliers
(b) Skewness
(c) Kurtosis
30 Measures of Dispersion
Chapter 5
Basic Probability
5.1. Introduction
This unit will introduce to you simple concepts and terminologies in probability. These
include events, types of probabilities and rules of probabilities. Probability theory
is fundamental to the area of statistical inference. Inferential statistics deals with
generalising the behaviour of random variables from sample findings to the broader
population. Probability theory is used to quantify the uncertainties involved in making
these generalisations.
5.2. Definition
An Event is a collection of possible outcomes from an experiment or a trial. For ex-
ample Heads or Tails are events which can be obtained from tossing a fair coin.
Most decisions are made in the face of uncertainty. Probability is therefore, concerned
with uncertainty.
Subjective probability
It is probability which is based on a personal judgement that a given event will occur.
There is no theoretical or empirical basis for producing subjective probabilities. In
other words this probability of an event based on an educated guess, expert opinion or
just plain intuition. Subjective probabilities cannot be statistically verified and there
are not extensively used, hence will not be considered further.
Examples
1. When commuters board a commuter omnibus, they assume that they will arrive
safely at their destinations, so P(arriving safely) = 1.
2. If you invest, you assume that you will get a good return, so P (good return) =
0.9.
Objective probabilities
These are probabilities that can be verified, through repeated experimentation or em-
pirical observations. Mathematically it is defined as a ration of two numbers:
r
P (A) =
n
• a priori - that is when possible outcomes are known in advance such as tossing
a coin,selecting cards from a deck of cards. Classical probability.
For example the probability of a Head if a fair coin is tossed once is, P (Head) =
1
2 = 0.5
• Empirically that is when the values of r and n are not known in advance and
have to be observed through data collection or from a relative frequency table you
can deduce probability of the different outcomes.
For instance, If out of a random sample of 90 customers 50 said they prefer Bak-
ers Inn bread, then relative frequency that a randomly selected customer will
50
prefer Bakers Inn bread is 90 = 0.56
Example
Consider random process of drawing cards from a card deck. These probabilities are
called a priori probabilities.
26 1
1. Let A = event of selecting a red card. Then P (Red card) = 52 = 2 (26 possible red
cards out of 52 cards).
13 1
2. Let B = event of selecting a spade. Then P (Spade) = 52 = 4 (13 possible spades
out of a total of 52).
4 1
3. Let C = event of selecting an ace. Then P (Ace) = 52 = 13 (4 possible ace out of a
total of 52 cards).
1 12
4. Let D = event of selecting ’not an ace’. Then P (not an ace) = 1P (ace) = 1− 13 = 13 .
34 Basic Probability
3. Complement of an event
The complement of an event A is the collection of all possible outcomes that are
not contained in event A. That is P (Ac ) = 1P (A). Note P (Ac ) is also sometimes
written as P (Ā) or P (A0 ). In other words P (A) + P (A0 ) = 1.
Examples
(a) Passing and failing the same examination are mutually exclusive. In other
words its not possible to pass and fail at the same time one examination.
(b) In tossing a fair die once, getting a 3 and a 5 are mutually exclusive. You get
one outcome at time and not both.
Examples
(a) In tossing a fair die once, getting an odd number or a number greater than
2 are non mutually exclusive events i.e. it is possible for the number to be
odd and at the same time being greater than 2.
(b) An individual can have more than one bank account i.e. if you open a bank
account it does not prevent you from opening another account with another
bank.
Basic Probability 35
Example
Consider a random experiment of selecting companies from the Zimbabwe Stock
Exchange (ZSE). Let event A = small company, event B = medium company and
event C = large company. Then (A ∪ B ∪ C) = sample space (small, medium, large
companies) = all ZSE companies.
Example
Let A = event that an employee is over 30 years of age, B = event that the em-
ployee is female. If it can be assumed or empirically verified that a randomly
selected employee is over 30 years of age from a large organisation is equally
likely to be either male or female employee, then the two events A and B are
statistically independent.
Remark
The terms Statistically independent events and mutually exclusive events should
not be confused. They are two very different concepts. When two events are
mutually exclusive, they are NOT Statistically independent. They are dependent
in the sense that if one event happens, then the other event cannot happen. In
probability terms, the probability of the intersection of two mutually exclusive
events is zero, while the probability of two independent events is equal to the
product of the probabilities of the separate events.
Example
What is the probability of getting a 5 or 6 if a fair die is tossed once?
The intersection sign, ∩ means the joint probability of events A and B. P (A and B)
is subtracted to avoid double counting.
Example
What is the probability of getting an even number or a number less than four if
a fair die is tossed once?
Answer
Let event A = getting an even number and the elements are 2, 4, 6 and event B
= getting a number less than four and the elements are 1, 2, 3. Then P (A) =
3 3 1
6 and P (B) = 6 . Thus P (A and B) = 6 . There is only one element which is
common in A and B and the number is 2. Therefore
P (A or B) = P (A) + P (B) − P (A ∩ B) = 36 + 36 − 16 = 56
Exercise 1
Sixty per cent of the population of a town read either magazine A or magazine B and
10% read both. If 50% read magazine A, what is the probability that one person, se-
lected at random, read magazine B?
Multiplication Laws
Multiplication Laws pertain to dependent and independent events. The key word is
AND
Example
What is the probability of getting a tail when two fair coins are tossed at the
same time?
Answer
P (T 1 and T 2) = P (T 1) × P (T 2)
T1 = the probability of getting a tail from coin 1. T2 = the probability of getting a
tail from coin 2. The two outcomes do not affect each other. Therefore
P (T 1 and T 2) = 12 × 12 = 41
• Marginal Probability
• Joint Probability
• Conditional probabilities
Marginal Probability
It is the probability of only a single event A occurring regardless of certain conditions
prevailing. It is written as P (A). A frequency distribution describes the occurrence of
only one characteristic of interest at a time and is used to estimate marginal probabil-
ities.
Joint Probability
It is the chance that two or more events will occur simultaneously. It is the occurrence
of events at the same time. If the joint probability on any two events is zero, then the
events are mutually exclusive.
Conditional Probability
It is the probability that a given event occur under the conditions that another event
has already occurred. The symbol P (A|B) is the probability that the event A will occur
given that event B has already occurred.
P (A ∩ B)
P (A|B) = (5.1)
P (B)
38 Basic Probability
Sex
Payment Method Total
Male Female
Credit Card 10 15 25
Cash 8 6 14
Total 18 21 39
P (B ∩ A)
P (B|A) = (5.2)
P (A)
Note: P (A ∩ B) = P (B|A).P (A) = P (A|B).P (B). P(A and B) is the joint probability of
events A and B. P (B) is the probability of event B, which is a marginal probability.
Example: Joint Marginal and Conditional Probabilities
Consider the table of fees payment methods by sex
What is the probability of getting a person who is
Answer
(a) Question a) is joint probability of the events. The sample space has 39 people
altogether.
15
(i) P(female and credit card) = 39 = 0.3846.
Note: The two events should not be confused with independent events. In this
case find the value in the intersection set of female column and credit card row
which is 15.
8
(ii) P(male and cash) = 39 = 0.2051.
It is the chance of two events occurring at the same time.
25
(i) P(credit card user) = 39 = 0.6.
The condition which has been ignored is payment method. For joint probabilities,
consider values inside the table as a ration of the grand total 39. For marginal
probabilities consider row and column totals as ratios of grand total.
6
P (F emale and Cash) 6
i) P(female — cash user) = P (Cash) = 39
14 = 14 = 0.4286
39
10
P (Credit card and M ale) 5
ii) P(credit card — male)= P (M ale) = 39
18 = 9 = 0.5556
39
Exercise 2
A golfer has 12 golf shirts in his closet. Suppose 9 of these shirts are white and the
others are blue. He gets dressed in the dark, so he just grabs a shirt and puts it on.
He plays golf two days in a row and does not do laundry. What is the likelihood both
shirts selected are white?
Example
A survey of 150 students classified each as to gender and the number of movies at-
tended last month. Each respondent is classified according to two criteria, that is, the
number of movies attended and gender.
Gender
Movies Attended Total
Male female
0 20 40 60
1 40 30 70
2 or more 10 10 20
Total 70 80 150
40 Basic Probability
Exercise 3
Using the previous example of 150 students? Using a tree diagram what is the proba-
bility of selecting a male student given that he has seen one movie?
• Multiplication rule
• Permutations
• Combinations.
a) The total number of ways in which n objects can be arranged (ordered) is given
by:
n! = nf actorial = n(n − 1)(n − 2)(n − 3)...............3.2.1
Note that 0! = 1.
Example
The number of different ways in which 7 horses can complete a race is given by:
7! = 7.6.5.4.3.2.1 = 5040 different arrangements.
Then the total number of outcomes for the j trials is: n1 × n2 × n3 × n4 × ............ × nj
Example
A restaurant menu has a choice of 4 starters, 10 main courses and 6 desserts. What is
the total number of meals that can be ordered in this restaurant.
Solution
The total numbers of possible meals that can be ordered are: 4 × 10 × 6 = 240 meals.
5.11.2. Permutations
A permutation is number of distinct ways in which a group of objects can be arranged.
Each possible arrangement (ordering) is called a permutation. The number of ways of
arranging r objects selected from n objects where ordering is important, is given by the
formula:
n!
Prn = (5.3)
(n − r)!
Example
10 horses compete in a race.
(i) How many distinct arrangements are there of the first 3 horses past the post?
(ii) What is the probability of predicting the order of the first 3 horses past the post?
Answer
(i) Since the order of 3 horses is important, it is appropriate to use the permutation
formula.
10!
That is: Prn = P310 = (10−3)! = 720
There are 720 distinct ways of selecting 3 horses out of 10 horses.
(ii) The probability of selecting the first 3 horses past the post is:
1 1
P (f irst 3 horses) = Selecting 3 out of 10 horses = 730 chance of winning.
5.11.3. Combinations
A combination is the number of different ways of arranging a subset of objects selected
from a group of objects where the ordering is not important. Each possible arrange-
ment is called a combination. The number of ways of arranging r objects selected from
42 Basic Probability
n!
Crn = (5.4)
(n − r)!r!
Where n! = n f actorial = n(n − 1)(n − 2)(n − 3)...3.2.1, r! = r(r − 1)(r − 2)(r − 3)...3.2.1,
r = number of objects selected and n = total number of objects.
Example
10 horses complete in a race.
(i) How many arrangements are there of the first 3 horses past the post, not consid-
ering the order in which the first three pass the post?
(ii) What is the probability of predicting the first 3 horses past the post, in any order?
Answer
(i) The order of the first 3 horses is not important, hence apply the combination for-
mula.
n! 10!
Crn = (n−r)!r! = (10−3)!7! = 120
There are 120 different ways of selecting the first 3 horses out of 10 horses, with-
out regard to order.
(ii) The probability of selecting the first 3 horses past the post, disregarding order is
given by
1
P (f irst 3 horses) = Selecting1 3 horses = 120 chance of winning.
5.12. Exercise
1 . Find the values of:
(a) P47
(b) C28
its products at a time. How many different ways can this company compose a
display in the local newspaper?
44 Basic Probability
Chapter 6
Probability Distributions
6.1. Introduction
This unit will study probability distributions. A probability distribution gives the en-
tire range of values that can occur based on an experiment. A probability distribution
is similar to a relative frequency distribution. However in steady of describing the
past, it describes a likely future event. For instance a drug manufacturer may claim
a treatment will cause weight loss for 80% of the population. A consumer protection
agency may test the treatment on a sample of six people. If the manufacturers claim
is true, it is almost impossible to have an outcome where no one in the sample loses
weight and its most likely that 5 out of the 6 do lose weight.
6.2. Definition
A random variable is a function whose value is a real number determined by each ele-
ment in the sample space. In other words its a quantity resulting from an experiment
that, by chance, can assume different values. There are two types of random variables,
discrete random variables and continuous random variables.
• Number of defective light bulbs obtained when three light bulbs are selected at
random from a consignment could be 0, 1, 2, or 3.
• The waiting time for customers to receive their order at a manufacturing com-
pany.
Example
Find the probability mass function of the numbers when a pair of dice is thrown.
Answer
Let X be a random variable whose values of x are the possible totals of the outcomes
of the two dies. Then x can be an integer from 2 to 12. Two dice can fall in 6 × 6 ways
1 2
each with a probability 36 . For example, P (X = 3) = 36 since a total of 3 can occur in
two ways, that is (1,2) or (2, 1). The probability distribution (mass) function is:
x 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P(X= x) 36 36 36 36 36 36 36 36 36 36 36
Exercise 1
1 Three coins are tossed. Let X be the number of heads obtained. Find the function
of X.
2 Suppose you are interested in the number of Tails showing face up of coin. What
is the probability distribution for the number of Tails?
Probability Distributions 47
This means that probability of any value of x is never negative and also the sum
of the probabilities of the discrete random variable should be equal to 1.
iii. The mean, or the expected value of a discrete random variable X is µ = and given
by:
X
E(X) = xi P (X = xi )
all x
Example
Consider the following probability distribution for a discrete random variable. Verify
x 0 1 2 5 10
P (X = xi 0.05 0.25 0.30 0.20 0.20
the probability properties and find the standard deviation of the distribution.
Solution
= 11.6275
p √
v. Standard deviation, σ = V ar(X) = 11.6275 = 3.410
b) At least 3, means not less than 3. The minimum that can be assumed is 3 since
3 is not less than itself. Notation: P (X ≥ 3) = P (X = 3) + P (X = 4) + P (X =
5) + . . . + P (X = n)
c) Less than 3 this effectively means values below 3, and 3 is actually excluded.
Notation: P (X < 3) = P (X = 0) + P (X = 1) + P (X = 2)
d) More than 3, means values above 3, in discrete terms it is from 4 upwards. No-
tation: P (X > 3) = P (X = 4) + P (X = 5) + P (X = 6) + . . . + P (X = n) or using
the complimentary rule it is given as 1 − P (X ≤ 3)
f) Between 3 and 6 means the discrete values between 3 and 6, which are 4 and 5.
However it should be noted that the limits can be exclusive or inclusive. Notation
for exclusive: P (3 < X < 6) = P (X = 4) + P (X = 5). Notation for inclusive:
P (3 ≤ X ≤ 3) = P (X = 3) + P (X = 4) + P (X = 5) + P (X = 6)
Exercise 2
Consider the following probability distribution that characterises a marketing ana-
lysts belief concerning the probabilities associated with the number, x of sales that a
company might expect per month for a new super computer:
Probability Distributions 49
x 0 1 2 3 4 5 6 7 8
P(X=x) 0.02 0.08 0.15 0.19 0.24 0.17 0.10 0.04 0.01
i. What is the probability that the company will sell as in (a), (b), (c), (d), (e) and (f)
above.
P (X = 1) = p
P (X = 0) = 1 − p
i.e. P (X = 0) = 1 − P (X = 1) = 1 − p
This distribution best describes all situations where a ”trial” is made resulting in ei-
ther ”success” or ”failure,” such as when tossing a coin, or when modeling the success
or failure of a surgical procedure. The Bernoulli distribution is defined as:
Where, p is the probability that a particular event (e.g. success) will occur.
Example: tossing a fair coin, you get a head or a tail each with probability of 0.5.
Thus, if a head is labelled 1 and a tail 0, and a tail the random variable X representing
the outcome takes values 0 or 1. If the probability that X = 1 is p, then we have that:
1
P (X = 1) = .
2
1 1
P (X = 0) = 1 − = ,
2 2
since events X = 1 and X = 0 are mutually exclusive.
50 Probability Distributions
Data often arise in the form of counts or proportions which are realizations of a discrete
random variable. A common situation is to record how many times an event occurs
in n repetitions of an experiment, i.e. for each repetition the event either occurs (a
”success”) or it does not (a ”failure”) occur. More specifically, consider the following
experimental process:
Suppose we repeat a Bernoulli p experiment n times and count the number X of suc-
cesses, the distribution of X is called the Binomial, Bin(n, p) random variable. The
quantities n and p are called parameters and they specify the distribution.
n!
Where k = 0, 1, 2, . . . , n, 0 < p < 1 and Cxn = (n−x)!x!
1. Mean, µx = E(X) = np
Example:
The notation Bin(n, p) means a Binomial distribution with parameters n and p. Find:
Probability Distributions 51
iii) Variance.
Solution:
i) Let X is the number of heads (success) when a fair coin is tossed. Thus:
4
P (X = 4) = C46 12 (1 − 21 )64 = 15
64
1
ii) Mean= np = 6 × 2 =3
1
iii) Variance= npq = 6 × 2 × (1 − 12 ) = 1.5
Exercises:
1. A manufacturer of nails claims that only 3% of its nails are defective. A random
sample of 24 nails is selected, what is the probability that 5 of the nails are
defective?
2. A certain rare blood type can be found in only 0.05% of people. If the population
of a randomly selected group is 3000, what is the probability that at least two
persons in the group have this rare blood type?
The Poisson distribution, named after French mathematician Simon Denis Poisson, is
a discrete probability distribution that expresses the probability of a given number of
events occurring in a fixed interval of time and/or space if these events occur with a
known average rate and independently of the time since the last event. The Poisson
distribution can also be used for the number of events in other specified intervals such
as distance, area or volume.
The Poisson question: What is the probability of r occurrences of a given outcome be-
ing observed in a predetermined time, space or volume interval?
52 Probability Distributions
A Poison random variable is a discrete random variable that can take integer values
from 0 up to infinite (∞). The parameter for this distribution is λ, i.e. P o(λ). The
Poisson probability mass function is given by:
λx e−λ
P (X = x) = (6.4)
x!
Example:
The number of students arriving at a takeaway every 15 minutes is a Poisson random
variable with parameter λ = 0.2. Find the probability that zero, at most one and more
than two students arrive at the takeaway.
Solution:
Using the formular that:
λx e−λ
P (X = x) =
x!
0.20 e−0.2
P (X = 0) =
0!
= 0.8187
P (X ≤ 1) = P (X = 0) + P (X = 1)
0.20 e−0.2 0.21 e−0.2
= +
0! 1!
= 0.9824
P (X ≥ 2) = P (X = 2) + P (X = 3) + . . .
= 1 − (P (X = 0) + P (X = 1))
= 1 − (0.8187 + 0.1637)
= 0.0176
1. Mean = E(X) = λ
2. Variance = V ar(X) = λ
Exercises
1. A textile producer has established that a spinning machine stops randomly due
to thread breakages at an average rate of 5 stoppages per hour. What is the
probability that in a given hour:
2. The arrival of patients at a rural clinic is 2 per hour. In any given hour what is
the probability that:
Remark: As a general rule, always check that the time, space or volume interval over
which occurrences of the random variable are observed is the same as the time, space
or volume interval corresponding to the average rate of occurrences, λ. When they
differ, adjust the rate of occurrences to coincide with the observed interval.
i. It is bell-shaped.
iii. The tails of the distribution never touch the axis (i.e. asymptotic).
1 −1 x−µ 2
f (x) = √ e2( σ ) , −∞ < x < ∞ (6.5)
2πσ
where µ = mean of the random variable X and σ 2 = variance of the random variable X.
The random variable X is represented as: X ≈ N (µ, σ 2 ). µ and σ 2 are said to be the
parameters of X.
It is difficult to use the probability density function of the normal distribution to cal-
culate the probabilities for X. Hence the process of standardisation is used so that the
probability values are taken directly from the standard nornal distribution table. This
table indicates the probabilities corresponding to different values of Z starting at -3.
The process of standardisation involves calculating the value of Z by use of a formular:
X −µ
Z= (6.6)
σ
a. P (Z ≥ −2)
b. P (Z > 0.79)
e. P (Z ≤ −3)
Solution
a.
P (Z ≥ −2) = 1 − P (Z ≤ −2)
= 1 − Φ(−2)
= 1 − 0.0228
= 0.9772
b.
c.
d.
e.
P (Z ≤ −3) = Φ(−3)
= 0.0013
f.
= 0.4118
i. Mean is given by
1
E(X) = (b + a) (6.8)
2
(b − a)2
V ar(X) = (6.9)
12
NB
The probability that X falls in some interval say (c,d) can be easily calculated by inte-
grating the density function
1
f (x) =
b−a
to obtain
c−d
P (X = (c, d)) =
b−a
Example:
The marks of students from a certain examination are uniformly distributed in the
interval 50 to 75. The density function for the marks is given by:
(
1
75−50 75 < x < 50
f (X = x) =
0 elsewhere
Solution:
Probability Distributions 57
(b−a)2 (75−50)2
2. The Variance is given by 12 = 12 = 52.083
Interpretation:
The average mark for the examination was 62.5 with a variance of 52.083.
Exercise:
For the continuous uniform distribution defined of the interval [a, b], where b > a,
show that
i) Mean = 12 (b + a) and
2
ii) Variance = (b−a)
12
Example
Suppose that the length of a phone call in minutes is exponentially distributed with
parameter, λ = 0.1. If someone arrives immediately ahead of you at a public telephone
booth, what is the probability that you will wait for at least 20 minutes.
Solution
Let X be the length of a phoe call made in front of you. Then
Z ∞
P (X > 20) = 0.1e−0.1x dx
20
= −e−0.1x |∞
20
= e−2 = 0.2706
58 Probability Distributions
Chapter 7
Interval Estimation
7.1. Introduction
We are now in the knowledge that a population parameter can be estimated from sam-
ple data by calculating the corresponding point estimate. This chapter is motivated by
the desire to understand the goodness of such a point estimate. However, due to sam-
pling variability, it is almost never the case that the population parameter equals the
sample statistic. Further, the point estimate does not provide any information about
its closeness to the true population parameter. Thus, we cannot rely on point estimates
for decision making and policy formulation in day to day living and or in any organisa-
tion, institution or country. We need bounds that represent a range of plausible values
for a population parameter. Such ranges are called interval estimates.
To obtain the interval estimates, the same data from which the point estimate was
obtained is used. Interval estimates may be in the form of a confidence interval whose
purpose is to bound population parameters such as the mean, the proportion, the vari-
ance, and the standard deviation; a tolerance interval which bounds a selected propor-
tion of the population; and a prediction interval which places bounds on one or more
future observations from a population.
where α is the level of significance between zero and one; 1 − α is a value called
the ”confidence coefficient”; 100(1 − α)% is the confidence level; parameter estimate
is a value for the point estimate such as for the sample mean, x, or for the pop-
ulation proportion, pb; reliability coefficient is a probability point obtained from an
appropriate table as dictated by, for example, zα or t α2 ,n−1 ; and s.e.(parameter) read
standard error of the parameter, measures the closeness of the point estimate to the
true population parameter i.e. it measures the precision of an estimate in getting the
parameter.
Example
Interval Estimation 61
64.3 64.6 64.8 64.2 64.5 64.3 64.6 64.8 64.2 64.3
Assume that it is normally distributed with unit population variance. For these
data, construct a 95% confidence interval for the population mean.
Solution
Using the data, n = 10, x = 64.46, the level of significance, α = 5% = 0.05, and
from the given assumption, σ 2 = 1. Now, the resulting 95% confidence interval
(CI) for the population mean is
σ σ
x − z0.025 × √ ≤ µ ≤ x + z0.025 × √
n n
Substituting we have
1 1
64.46 − 1.96 × √ ≤ µ ≤ 64.46 + 1.96 × √ .
10 10
63.84 ≤ µ ≤ 65.08.
Interpretation
From the above estimation of the confidence interval for the population mean,
it is tempting to conclude that the population mean, µ, is within 63.84 ≤ µ ≤
65.08 with a probability of 0.95. To be blunt, this statement is not true.
Well, the true value of the population mean, µ, is unknown (!), and the con-
fidence interval is a random interval that is a function of the sample mean (!).
In this scenario, to say that µ is within 63.84 ≤ µ ≤ 65.08 with a probability
0.95 is totally off the mark. Now, if you follow the argument, the statement
63.84 ≤ µ ≤ 65.08 is either correct with probability 1 or incorrect with proba-
bility 1. By and large, the correct interpretation of a 100(1 − α)% confidence
interval for a population parameter is that:
if a very large number of random samples are collected and a 100(1 − α)% confi-
dence interval for the population parameter is computed from each sample, then
100(1 − α) percent of these intervals will contain the true value of the population
parameter with confidence 100(1 − α).
62 Interval Estimation
In terms that are loose and respecting the frequency approach to probability,
for our case, this general interpretation says
we don’t know if the statement 63.84 ≤ µ ≤ 65.08 is true for this specific sam-
ple, but that, in repeated sampling, the method used to obtain the confidence
interval for µ yields correct statements 95% of the times with a 0.95 confidence.
To illustrate this interpretation from the other end, let us consider a thousand
samples taken, and their 100(1 − α)% confidence intervals for a specified pop-
ulation parameter constructed. Conceivably, at least fifty of these confidence
intervals will fail to contain the true value of the population parameter. Sup-
pose that these were 95% confidence intervals. Then, only 5% of the confidence
intervals would fail to contain the true value of the population parameter. This
says that 950 of the times would contain the true population parameter of in-
terest with a 0.95 confidence.
We point out that in practice we obtain only one random sample and calculate
only one confidence interval. From the preceding standpoint, this confidence
interval either contains or does not contain the true population parameter. In
the end, one MUST therefore reject the obvious temptation!
Lecture Exercise
For the above example, separately construct a 90% and 99% confidence interval
for the population mean.
Task: Starting from the cases considered above, what is the general relation-
ship between confidence levels and their precision?
Interval Estimation 63
Remark:
The precision of a confidence interval is inversely proportional to the confidence
level. It is desirable to obtain a confidence interval that is short enough for
purposes of decision making and that also has adequate confidence. This is
easily the reason why the 95% confidence level is the default confidence level
chosen by researchers and practitioners.
Using similar assumptions, one-sided confidence limits for the population mean,
µ, are obtained by setting either `1 to -∞ or `2 to ∞ and replacing z α2 by zα .
Lecture Exercise
For the data in the above example, construct the 90%, 95%, 99% lower-, and
upper- confidence limits. What observations can you make?
σ2
X ∼ N (µ, )
n
64 Interval Estimation
Example
A study was carried out in Zimbabwe to investigate pollutant contamination
in small fish. A sample of small fish was selected from 53 rivers across the
country and the pollutant concentration in the muscle tissue was measured
(ppm). The pollutant concentration values are shown below. Construct a 95%
confidence interval for the population mean, µ.
1.230 1.330 0.040 0.044 1.200 0.270 0.490 0.190 0.940 0.520 0.830
0.810 0.710 0.500 0.490 1.160 0.050 0.150 0.400 0.190 0.650 0.770
1.080 0.980 0.630 0.560 0.410 0.730 0.430 0.590 0.340 0.340 0.270
0.840 0.500 0.340 0.280 0.340 0.250 0.750 0.870 0.560 0.100 0.170
0.180 0.190 0.040 0.490 0.270 1.100 0.160 0.210 0.860
Solution
Since n > 30, then the 95% confidence interval for µ is
0.3486 0.3486
0.5250 − 1.96 × √ ≤ µ ≤ 0.5250 + 1.96 × √
53 53
which simplifies to
0.431 ≤ µ ≤ 0.619
Lecture Exercise
Interval Estimation 65
Construct the 90% and the 99% CI for µ using the above data. Further, using
the above data construct the 90%, 95%, and the 99% lower- and upper- CI for
the population mean.
For our purposes, it will be reasonable to assume that the population of interest
is normal with an unknown mean, µ, and an unknown variance, σ 2 . A small
random sample of size n is drawn. Let X and S2 be the sample mean and
sample variance, respectively. We wish to construct a two-sided confidence
interval on µ . The population variance, σ 2 , is unknown and it is a reasonable
procedure to use S2 to estimate σ 2 . Then the random variable Z is replaced
with T which is given by √
n(X − µ)
T =
s
which is a random variable that follows the student’s t distribution with n − 1
degrees of freedom which are associated with the estimated standard devia-
tion.
Notation
We let tα ,n−1 and t α2 ,n−1 be the value of the random variable T with n − 1 de-
grees of freedom above which we find a probability α or α2 respectively.
where t α2 ,n−1 is the upper 100 α2 percentage point of the t- distribution with n − 1
degrees of freedom.
Example
Consider the following data obtained from a local Transport Logistics company.
19.8 10.1 14.9 7.5 15.4 15.4 15.4 18.5 7.9 12.7 11.9
11.4 11.4 14.1 17.6 16.7 15.8 19.5 8.8 13.6 11.9 11.4
Solution
Since our sample is small, n = 22, then the 95% confidence interval for the
population mean is given by
s s
x − t α2 ,n−1 × √ ≤ µ ≤ x + t α2 ,n−1 × √
n n
Substituting yields
3.55 3.55
13.71 − 2.080 × √ ≤ µ ≤ 13.71 + 2.080 × √
22 22
Lecture Exercise
For the above data,construct the 90% and the 99% confidence intervals on the
population mean and interpret the two confidence intervals. Further, construct
the 90%, the 95% and the 99% lower - and upper - confidence limits. Give an
interpretation of each and all of them.
Suppose that a random sample of size n, large n, has been taken from a large
population and that x but less than n observations in this sample belong to a
class of interest. Then pb calculated as nx is a point estimator of the proportion of
the population p that belongs to this class. It is noted that n and p are the pa-
rameters of a binomial distribution (refer to earlier discussions). The sampling
distribution of pb is approximately normal with mean p and variance p(1−p) n
if p
is not too close to either 0 or 1 and if n is relatively large. To apply this, it is
required that np and n(1-p) be greater than or equal to 5. We are saying that:
If n is large, then the distribution of
pb − p
Z=q ∼ N (0, 1).
p(1−p)
n
For large samples, which usually is the case when dealing with proportions, a
satisfactory 100(1 − α)% confidence interval on the population proportion p is
r r
pb(1 − pb) pb(1 − pb)
pb − z α2 × ≤ p ≤ pb + z α2 ×
n n
α
where pb is the point estimate of p, and z α2 is the upper 2
probability point of the
standard normal distribution.
Example
In a random sample of 85 stone sculptures, 10 have a surface finish that is
rougher than the expected. Construct a 95% confidence interval for the popu-
lation proportion of stone sculptures with a surface finish that is rougher than
the expected.
Solution
A 95% two-sided confidence interval for p is
r r
0.12(1 − 0.12) 0.12(1 − 0.12)
0.12 − 1.96 × ≤ p ≤ 0.12 + 1.96 ×
85 85
which simplifies to
0.05 ≤ p ≤ 0.19
Remark: The one-sided lower - and upper - confidence intervals are respec-
68 Interval Estimation
tively given as r
pb(1 − pb)
pb − zα × ≤ p
n
and r
pb(1 − pb)
p ≤ pb + zα × .
n
Lecture Exercise
In the above example, construct and interpret the 95% and the 99% lower - and
upper - confidence limits for the population proportion.
(n − 1)S 2
V =
σ2
has a chi-square (χ2 ) distribution with n − 1 degrees of freedom.
where χ2α ,n−1 and χ21− α ,n−1 are the upper and lower 100 α2 percentage points of
2 2
the χ2 distribution with n − 1 degrees of freedom, respectively.
Illustration
An Entrepreneur has got an automatic filling machine that she uses to fill
bottles with liquid detergent. A random sample of 20 bottles results in a sample
variance of fill volume of s2 equal to 0.0153 (f luid ounces)2 . Assume that the fill
volume is normally distributed. Then a 95% upper- CI is
(n − 1)S 2
σ2 ≤
χ21−α,n−1
substituting yields
(20 − 1) × 0.0153
σ2 ≤
χ21−0.05,20−1
Interval Estimation 69
simplifying we have
19 × 0.0153
σ2 ≤
χ20.95,19
so we get
19 × 0.0153
σ2 ≤
10.117
giving
σ 2 ≤ 0.0287
(n − 1)S 2
≤ σ2
χ2α,n−1
and
(n − 1)S 2
σ2 ≤
χ21−α,n−1
Remark: Clearly, the lower- and upper- confidence intervals/ limits for σ are
the square roots of the corresponding limits in the above equations.
Lecture Exercise
Using the information from the above illustration, construct a 90% lower- and
upper- confidence limits for the population standard deviation, σ.
70 Interval Estimation
The overall assumption remains in place. And, the same with everything else.
We are simply considering two populations and constructing confidence inter-
vals for the difference in two population means, µ1 − µ2 .
Illustrative Example
An entrepreneur is interested in reducing the drying time of a wall paint. Two
formulations of the paint are tested; formulation 1 is the standard, and formu-
lation 2 has a new drying ingredient that should reduce the drying time. From
experience, it is known that the standard deviation of drying time is 8 min-
utes, and this inherent variability should be unaffected by the addition of the
new ingredient. Ten specimens are painted with formulation 1, and another
10 specimens are painted with formulation 2; the 20 specimens are painted in
random order. The two sample mean drying times are 121 minutes and 112
minutes, respectively. Construct a 99% confidence interval for the difference in
the two population means.
Solution
To be provided in the lecture.
Illustrative Example
The following data is from two populations, A and B. Ten samples from A had
a mean of 90.0 with a sample standard deviation of s1 = 5.0, while 15 sam-
ples from B had a mean of 87.0 with a sample standard deviation of s2 = 4.0.
Assume that the populations, A and B are normally distributed and that both
normal populations have the same standard deviation. Construct a 95% confi-
dence interval on the difference in the two population means.
Interval Estimation 71
Solution
To be provided in the lecture.
72 Interval Estimation
Chapter 8
Hypothesis Testing
Hypotheses
A hypothesis is a statement about a population. Testing of hypotheses eval-
uates two hypotheses called the null and the alternative denoted H0 and H1
respectively. An H0 is the assertion that a population parameter takes on a
particular value. On the other hand, an H1 expresses the way in which the
value of a population parameter may deviate from that specified under H0 . The
direction of deviation may be specified (one - sided/ tailed tests) or may not be
specified(two - sided/ tailed tests).
We take time to point out that the language and grammar of testing of hy-
potheses does not use the word ”accept” or any of its numerous synonyms. This
is beyond semantics. To say one ”accepts” the null hypothesis is to imply that
they have proved the null hypothesis to be true. This practice is incorrect. The
null hypothesis is the claim that is usually set up with the expectation of re-
jecting it. The null hypothesis is assumed true until proven otherwise. If the
weight of evidence points to the belief that the null hypothesis is unlikely with
high probability, then there exists a statistical basis upon which we may reject
the null hypothesis. The design of hypotheses tests is such that they are with
the null hypothesis until there is enough evidence that suggest support for the
alternative hypothesis. Clearly, the design is never about selecting the more
likely of the two hypotheses. Let’s take this to our legal system. One is consid-
ered not guilty until proven otherwise. It is the job of the prosecutor to build
a case i.e. put evidence before the court of law that the person in question is
guilty. The jury or the judge will give their verdict as guilty or not guilty but
will NEVER give their verdict with an import of being innocent. By and large,
the courts of law are a classical example of constant testing of hypotheses pro-
cedure. So, let it be clear that on the basis of the data from the sample, we
74 Hypothesis Testing
either reject the null hypothesis or fail to reject the null hypothesis.
Remark 1: The H0 reflects the position of no change and will always be worded
as an equality.
Test Statistic
This is a value calculated from sample data and is used to decide on rejecting
H0 .
Critical Region
This is a range of values which is such that when the test statistic falls into it
then H0 would be rejected.
Critical Value
Is a value that separates the rejection region and the non-rejection region.
Type I error
Occurs when a true null hypothesis is rejected. A null hypothesis is rejected
when in actual fact it is true.
Type II error
It occurs when a false null hypothesis is not rejected. Alternatively, it is when
a null hypothesis is not rejected when in actual fact it is false.
• Decide on the basis of a decision criterion that rejects H0 if, upon compar-
ison, the test statistic is more extreme than a critical value.
• Conclude on the basis of the decision’s import, and report in the context of
the problem.
claim?
It is given that the population is normal and the population standard deviation
is known. Z− score is the test statistic. The testing of hypothesis procedure is
two sided. At the 0.05 level of significance and based on the sample evidence,
we conclude that the population mean is different from 50.
Exercise
For the above exercise, instead of using the testing of hypothesis procedure,
construct a 95% confidence interval. Test the same hypothesis using the con-
fidence interval. Is the value specified under H0 contained in the confidence
interval? Or, is zero contained in the confidence interval? What conclusions
should be drawn?
Example
Let the mean cost of an Introduction to Statistics textbook be µ. In testing the
claim that the population mean is not USD34.50 a sample of 36 current text-
books had selling costs with a sample mean USD32.00 and a sample standard
deviation of USD6.30. Using a 10% level of significance, what conclusion can
be made?
Hypothesis Testing 77
Solution
A two-tailed test, n > 30, α = 0.1 and, thus, the critical value is ±1.96. Detailed
solution in the lecture.
Exercise
The increased availability of light materials with high strength has revolution-
ized the design and manufacture of golf clubs, particularly drivers. Clubs with
hollow heads and very thin faces can result in much longer tee shots, especially
for players of modest skills. This is due partly to the spring-like effect that the
thin face imparts to the ball. Firing a golf ball at the head of the club and mea-
suring the ratio of the outgoing velocity of the ball to the incoming velocity can
quantify this spring-like effect. The ratio of velocities is called the coefficient
of restitution of the club. An experiment was performed in which 15 drivers
produced by a particular club maker were selected at random and their coeffi-
cients of restitution measured. In the experiment the golf balls were fired from
an air cannon so that the incoming velocity and spin rate of the ball could be
precisely controlled. The sample mean and sample standard deviation are x =
0.83725 and s = 0.02456. Determine if there is evidence at the α = 0.05 level to
support the claim that the mean coefficient of restitution exceeds 0.82.
Exercise
For the above exercise, instead of using the testing of hypothesis procedure,
construct a 95% confidence interval. Test the same hypothesis using the confi-
78 Hypothesis Testing
Illustrative Example
The advertised claim for batteries for cell phones is set at 48 operating hours,
with proper charging procedures. A study of 5000 batteries is carried out and
15 stop operating prior to 48 hours. Do these experimental results support the
claim that less than 0.2 percent of the company’s batteries will fail during the
advertised time period, with proper charging procedures? Use a hypothesis
testing procedure with α = 0.01. Is the conclusion the same at the 10% level of
significance?
Solution
15
We are testing H0 : p = 0.002 against H1 : p < 0.002 with pb = 5000
= 0.003.
pb − p0
Zcal = q = 1.5827223454 < Zcrit = Z0.01 = 2.3263
p0 (1−p0 )
n
Exercise
Let p be the proportion of new car loans having a 48 months period. In some
year p = 0.74. Suppose it is believed that this has declined and accordingly we
wish to test this belief using a 1% level of significance. What is the conclusion
if 350 of a sample of 500 new car loans have a time period of 48 months?
Hypothesis Testing 79
We now extend the previous one population results to the difference of means
for two populations.
Example Consider the following gasoline mileages of two makes of light trucks.
The trucks 1 and 2 have the population means and populations standard devi-
ations 28 and 6, and 24 and 9 respectively. If 35 of truck 1 and 40 of truck 2
are tested, test the claim that the mean difference is 4.
Solution
Exercise in the lecture.
Remark: In inferential applications the population variances σ12 and σ22 are
generally not known and must be estimated by s21 and s22 . The standard error is
estimated by s
s21 s2
s.e. = + 2
n1 n2
Case 2: Unknown Population Variance, and Small Sample (n1 + n2 ≤ 31)
We assume that the variances of both distributions σ12 and σ22 are unknown but
equal. This common variance is estimated by a quantity called pooled variance
denoted s2p and calculated as
(x1 − x2 ) − (µ1 − µ2 )
tcal = q
s2p [ n11 + n12 ]
80 Hypothesis Testing
Example
Consider the following data. n1 = 10, x1 = 90, s1 = 5, n2 = 15, x2 = 87 and
s2 = 4. Assume that the populations are normally distributed and that both
populations have the same standard deviation. At the 5% level of significance,
can we conclude that there is a difference in the two population means?
Solution
Left as an exercise for the lecture.
In testing for the equality of two population means, we may choose to select
two random samples one from each population and compare their means. If
these sample means exhibit a difference, then we reject the null hypothesis
that H0 : µ1 − µ2 = 0. Another approach is to try and match the subjects from
the two populations according to variables which will be expected to have an
influence on the variable under study. The two samples are no longer indepen-
dent and the inferences are now based on the differences of the observations
from the matched pairs.
Example
Samples of two brands of pork sausage are tested for their fat content. The
results of the percentage of fat are summarised as follows: Brand A (n=50,
x=26.0, s=9.0) and Brand B (n=46, x=29.3, s=8.0). Can we conclude that there
is sufficient evidence to suggest that there is a difference in the fat contented
of the two brands of pork sausage? Use a 5% level of significance.
Hypothesis Testing 81
Solution
Left as an exercise for the lecture.
For the new single sample, we find its mean, d, that estimates the population
mean for the differences, µd and standard deviation, sd . Assuming that the
original populations are normally distributed with equal means i.e. µ1 = µ2
and equal variances, the population mean for the differences µd is zero and a
standard error that is estimated by √sdn .
d
The test statistic in this case is tcal = s.e.
The hypotheses tests concerning µ1 and µ2 are now based on the sample mean
using the single sample and we have a modified null hypothesis, H0 : µd = 0
against an appropriate alternative hypothesis as instructed by the situation.
Example
Five automachines are tested for wind resistance with two types of grills. Their
drag coefficients were determined and recorded as follows.
Automachine 1 2 3 4 5
Grill A 0.47 0.46 0.40 0.44 0.43
Grill B 0.50 0.45 0.47 0.44 0.48
Using a 5% level of significance test for the difference in the drag coefficients
due to type of grill.
Solution
Left as an exercise during the lecture.
as measured by s2p in the case without pairing and s2d in the case of pairing.
Therefore, s2d > s2p implies a gain in precision due to pairing.
• It may be less expensive since in most cases fewer experimental units are
used when compared to a two sample design.
• A rest period may be required between applying the first and second treat-
ment in order to minimise the carry over effect from the first treatment.
Even, then the carry over effect may not be completely eliminated.
Suppose that two independent random samples of sizes n1 and n2 are taken
from two populations, and let x1 and x2 represent the number of observations
that belong to the class of interest in sample 1 and sample 2, respectively. In
testing the hypotheses
H0 : p1 − p2 = 0
H1 : p1 − p2 6= 0,
(pb1 − pb2 )
Zcal = q
pb(1 − pb)[ n11 + 1
n2
]
Example
Consider the following situation in which comparison is made of two concept
exposition methods. Method A is the standard and method B is the proposed.
A class of 200 CUMT105 students at the Chinhoyi University of Technology is
used. The students were randomly assigned to two groups of equal size. One
group was exposed to method A, and the other group was exposed to method
B. At the end of the semester, 19 of the students exposed to method B showed
improvement, while 27 of those exposed to method A improved. At the 5% level
of significance, is there sufficient reason to believe that method A is effective in
concept exposition?
Solution
First, we state the hypotheses:
H0 : pA − pB = 0
H1 : pA − pB 6= 0
The test statistic and the critical value are Zcal = 1.35 and Zcrit = 1.96 respec-
tively.
After comparing Zcal and Zcrit , the decision is that we fail to reject H0 . From
this decision, we therefore conclude that, at the 5% level of significance, there
is no sufficient evidence to support the assertion that method A is effective in
concept exposition.
84 Hypothesis Testing
Exercise
A study is made of business support of the immigration enforcement practices.
Suppose 73% of a sample of 300 cross border traders and 64% of the light man-
ufacturers said they fully supported the policies being proposed. Is there suffi-
cient evidence to conclude that the proposed policies are equally supported by
the two groups sampled. Use a 1% level of significance.
Tests for independence are performed on categorical data such as when testing
for independence of opinion on a public policy and gender. The data is con-
tained in what is called a contingency table. The hypotheses are tested using a
Chi - square test statistic, χ2cal .
Illustrative example
A company operates four machines three shifts each day. From production
records, the following data on the number of breakdowns are collected:
Machines
Shifts A B C D
1 4 3 2 1
2 3 1 9 4
3 1 1 6 0
Using 5% level of significance, test the hypothesis that breakdowns are inde-
pendent of the shift.
Exercise
Grades in Statistics and Communication Skills taken simultaneously were
recorded as follows for a particular group of students.
Com. Skills Grade
Stats Grade 1 2.1 2.2 Other
1 25 6 17 13
2.1 17 16 15 6
2.2 18 4 18 10
Other 10 8 11 20
Are the grades in Statistics and Communication Skills related? Use α = 0.01
Hypothesis Testing 85
It is required that one may demonstrate that hypothesis testing and confidence
intervals are equivalent procedures insofar as decision making or inference
about population parameters is concerned. However, each procedure presents
different insights. What is the major difference between these two cousin pro-
cedures?
86 Hypothesis Testing
Chapter 9
Regression Analysis
9.1. Introduction
It is important to note that the approach used here first exposes the useful con-
cepts of the regression analysis technique, gives an illustrative example on the
application of these concepts, and then wraps up with a practice question.
Many problems that are encountered in everyday life involve exploring the
relationships between two or more variables. Without attempting to formally
define what regression analysis is, regression analysis is a statistical tool that
is very useful for these types of problems. For example, in the clothing indus-
try, the sales obtained from selling particular designer outfits is related to the
amount of time spent advertising the label. Regression analysis can be used
to build a model to predict the sales given the amount of time devoted to ad-
vertising the label. In the sciences, regression analysis models can be used for
process optimization. For instance, finding the temperature levels that max-
imise yield, or for puproses of process control.
After non-superficial, but serious and rigorous studying of this chapter, the stu-
dent is expected to be able to use simple linear regression for building models
to everyday data, apply the method of least squares to estimate the param-
eters in a linear regression model, use the fitted regression model to make a
prediction of a future observation and interpret the scatter plot, the correlation
coefficient, the coefficient of determination, and the regression parameters.
• prediction
• forecasting
• optimisation
• control purposes
Regression relationships are valid only for values of the explanatory variable
within the range of the original data. The linear relationship that we have
assumed may be valid over the original range of X, but may unlikely remain
so as we extrapolate i.e. if we use values of X beyond the range in question
to estimate the value of Y. Alternatively put, as we stride from the range of
the values of X for which data were collected, our certainty about the validity
of the assumed model tend to fade away. We caution that linear regression
models are not necessarily valid for extrapolation purposes. Clearly, this is not
saying NO to extrapolation. Note that in many life situations extrapolation of
a regression model may be the only way to approach a given problem. We are
strongly warning that there is need to be alive to the potential abuses of the
treasure. To dilute the preceding a bit, a modest extrapolation may be quite
fine in most situations, however large extrapolations will almost always pro-
duce unacceptable results.
cause they have only one explanatory variable or independent variable. Specif-
ically, this will be our focus in this chapter.
Y = a + bX +
The random error term follows a normal distribution with a mean zero and
an unknown variance σ 2 . For completeness, we state that the random errors
corresponding to different observations are also assumed to be uncorrelated
random variables. To determine the appropriateness of employing simple lin-
ear regression we use (1) the scatter plot, and or (2) the correlation coefficient
techniques.
Interpretation of r
Remark: The interpretation of r must clearly state the magnitude/ size and
direction of the relationship between the random variables X and Y.
Having established that a linear relationship exists between the random vari-
ables X and Y, we proceed to fit the linear regression model/ line/ equation. To
fit a regression model is to estimate the regression coefficients a and b. The
estimates are denoted b
a and bb. The fitted model is written in the form
Yb = b
a + bbX
Regression Analysis 91
Now, we have fitted a model and we wish to determine how good it is and then
use it for prediction of new values for the system in question. To determine how
good our model is we calculate the values of the response variable in reverse
for each and every value of the explanatory variable and then note the differ-
ence. This difference obtained by subtracting the value of the fitted model from
the actually observed is the error in our model for that observation and it is
called the residual. By performing what is called residual analysis we are able
to come up with a statement on the adequacy of our fitted regression model.
After establishing the adequacy of our model we then proceed to predict future
values of the response variable for the system in question. This is technically
called forecasting.
a = y − bbx
b
Naturally, how much of the variability in the response variable has been ex-
plained by fitting the regression model? To answer this question we need to
compute the following coefficient.
0 ≤ R2 ≤ 1. Expressing r2 as a percentage
r2 = r2 × 100%
gives the amount of variability in the response variable that has been explained
by fitting a regression model.
Illustrative Example
Consider the following set of observations. Take X to be the exploratory vari-
able and Y to be the response variable.
Y 1 0 1 2 5 1 4 6 2 3 5 4 6 8 4
X 60 63 65 70 70 70 80 90 80 80 85 89 90 90 90
1. Draw a scatter plot for the above data. Comment on the suitability of
using simple linear regression to describe the relationship.
3. Fit the regression model using the method of least squares. Interpret the
regression coefficients.
4. State how much of the variation in Y has been accounted for by fitting the
linear regression model.
5. Using the fitted regression model, what is the value of Y when X = 60?
What is the residual?
Solution
A scatter diagram of the above data is shown in the figure below.
Exercise
Consider the following quantities for two random variables X and Y. Let X be
the cause variable and Y be the effect variable.
y 2 = 170045, x2 = 29 and
P P P P P
n = 20, x = 24, y = 1843, xy = 2215
2. Fit the regression model using the method of least squares. What is the
meaning of the regression coefficients?
Regression Analysis 93
3. How much of the variability in Y has been explained by fitting the linear
regression model above.
6. Comment on the usefulness of the values in parts (d) and (e) given that
P
for twenty observations x = 24. Hint: You are expected to reflect on the
uses and abuses of the regression analysis technique.
94 Regression Analysis
Chapter 10
Index numbers
10.1. Objectives
10.2. Introduction
Index numbers are today one of the most widely used statistical indicators.
Generally used to indicate the state of the economy, index numbers are aptly
called barometers of economic activity. Index numbers are used in comparing
production, sales or changes exports or imports over a certain period of time.
The role-played by index numbers in Indian trade and industry is impossible
to ignore. It is a very well known fact that the wage contracts of workers in our
country are tied to the cost of living index numbers.
It must be clearly understood that the index number for the base year is
always 100. An index number is commonly referred to as an index.
sumers price index for urban non-manual employees increased from $100
in 2004 to $202 in 2006, the real purchasing power of the dollar can be
found out as follows:
100
= 0.495
202
It indicates that if dollar was worth $100 in 2004 its purchasing power is
$49.5 in 2006.
4. Deflates time series data - Index numbers play a vital role in adjusting
the original data to reflect reality. For example, nominal income (income
at current prices) can be transformed into real income(reflecting the ac-
tual purchasing power) by using income deflators. Similarly, assume that
industrial production is represented in value terms as a product of vol-
ume of production and price. If the subsequent years industrial produc-
tion were to be higher by 20% in value, the increase may not be as a result
of increase in the volume of production as one would have it but because
of increase in the price. The inflation which has caused the increase in the
series can be eliminated by the usage of an appropriate price index and
thus making the series real.
• Quantity index
• Value index.
1. Price Index - The most frequently used form of index numbers is the price
index. A price index compares charges in price of edible oils. If an attempt
is being made to compare the prices of edible oils this year to the prices
of edible oils last year, it involves, firstly, a comparison of two price situ-
ations over time and secondly, the heterogeneity of the edible oils given
the various varieties of oils. By constructing a price index number, we are
summarizing the price movements of each type of oil in this group of edi-
ble oils into a single number called the price index. The Whole Price Index
(WPI) and the Consumer Price Index (CPI) are some of the popularly used
price indices.
1. Aggregate method
Demerits - It does not consider the relative importance of the various com-
modities involved. The unweighted index doesnt reflect the reality since the
price changes are not linked to any usage/consumption levels.
Example
Construct an unweighted index for the three commodities taking 2010 as the
base year.
Prices
Commodities
2010 (P0 ) 2012 (P1 )
Oranges (Dozen) 20 28
Milk (Ltr) 5 8
Gas 76 100
Interpretation
As evident, the price index was 134.65, which means that the prices rose by
34.65 percent from 1990 to 1992. By no means should this price index be
100 Introduction
To make this clear, let us calculate the price index with the same data provided
above but by changing the milk consumption from 1 liter to 100 liters. The
following table provides the calculation of the price index.
2. Paasche Method
Laspeyres method uses the quantities consumed during the base period in com-
puting the index number. This method is also the most commonly used method
which incidentally requires quantity measures for only one period. Laspeyres
index can be calculated using the following formula:
P
P1 Q 0
Laspeyres P rice Index(LP I) = P × 100% (10.3)
P0 Q 0
Where, P1 = Prices in the current year, P0 = Prices in the base year, Q0 = Quan-
tities in the base year.
Index numbers 101
In general, Laspeyres price index calculates the changes in the aggregate value
of the base years list of goods when valued at current year prices. In other
words, Laspeyres index measures the difference between the theoretical cost
in a given year and the actual cost in the base year of maintaining a standard
of living as in the base year. Also, Laspeyres quantity index can be calculated
by using the formula:
P
P0 Q1
Laspeyres Quantity Index(LQI) = P × 100% (10.4)
P0 Q0
Production Prices
Product Q0 Q1 P0 P1 P 0 Q0 P1 Q0 P0 Q1
1985 1990 1985 1990
Rice 46.60 58.00 700 910 32620.00 42406.00 40600.00
Sugar 14.57 17.92 620 950 9033.40 13841.50 11110.40
Salt 69.46 85.10 205 300 14239.30 20838.00 17445.50
Wheat 33.84 40.30 330 470 11167.20 15904.80 13299.00
Solution
price index. This enables us an easy comparability of one index with another.
P
P1 Q1
P aasche P rice Index(P P I) = P × 100% (10.5)
P0 Q1
Where, P1 = Prices in the current year P0 = Prices in the base year Q1 = Quan-
tities in the current year. The Paasche quantity index is given by:
P
P1 Q1
P aasche Quantity Index(P QI) = P × 100% (10.6)
P1 Q0
Demerits - Paasche index is not frequently used in practice when the number
of commodities is large. This is because for Paasche index, revised weights or
quantities must be computed for each year examined. Such information is ei-
Index numbers 103
ther unavailable or hard to gather adding to the data collection expense, which
makes the index unpopular. Paasche index tends to underestimate the rise in
prices or has a downward bias.
Let us understand the Paasches method with the help of an example. The
table below represents the calculation of Paasche index. In general Paasche
index reflects the change in the aggregate value of the current years (given
periods) list of goods when valued at base period prices. From the table below
calculate the Paasche price and quantity indices.
1992 1993
Commodity Price Quantity Price Quantity P0 Q0 P0 Q1 P 1 Q0 P1 Q1
A 3 18 4 15 54 45 72 60
B 5 6 5 9 30 45 30 45
C 4 20 6 26 80 104 120 156
D 1 14 3 15 14 15 42 45
178 209 264 306
Solution
Paasche Price Index (PPI) is:
P
P1 Q 1
P aasche P rice Index(P P I) = P × 100%
P0 Q 1
P P I = 146.41%
The difference between the purchase index and Laspeyres index reflects the
change in consumption patterns of the commodities A, B, C and D used in that
table. As the weighted aggregates price index for the set of prices was 148.31%
using the Laspeyres method and 146.41 using the Paasche method for the same
set, it indicates a trend towards less expensive goods. Generally, Laspeyres
and Paasche methods tend to produce opposite extremes in index values com-
puted from the same data. The use of Paasche index requires the continuous
use of new quantity weights for each period considered. As opposed to the
104 Introduction
Prof. Irving Fisher has proposed a formula for constructing index numbers, as
the geometric mean of the Laspeyres and Paasche indices i.e. Fisher’s quantity
and price index aare calculated as:
p
F isher0 s Quantity Index = (Laspeyres Quantity Index × P aasche Quantity Index)
p
F isher0 s P rice Index = (Laspeyres P rice Index × P aasche P rice Index)
1. Theoretically, geometric mean is considered the best average for the con-
struction of index numbers and Fishers index uses geometric mean.
3. Both the current year and base year prices and quantities are taken into
account by this index. The Index is not widely used owing to the practical
limitations of collecting data. Fishers Ideal Quantity Index can be found
out by the formula.