92% found this document useful (13 votes)
9K views

Business Statistics Using Excel PDF

Uploaded by

HarunVeledar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
92% found this document useful (13 votes)
9K views

Business Statistics Using Excel PDF

Uploaded by

HarunVeledar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 505

business statistics using Excel®

This page intentionally left blank


business statistics
using Excel®
Second edition

Glyn Davis & Branko Pecar

1
1
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Glyn Davis and Branko Pecar 2013
The moral rights of the authors have been asserted
First Edition copyright 2010
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
British Library Cataloguing in Publication Data
Data available
ISBN 978–0–19–965951–7
Printed in Italy by
L.E.G.O. S.p.A.—Lavis TN
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
Preface

Aims of the book


It has long been recognized that the development of modular undergraduate programmes
coupled with a dramatic increase in student numbers has led to a reconsideration of
teaching practices. This statement is particularly true in the teaching of statistics and, in
response, a more supportive learning process has been developed. A classic approach to
teaching statistics, unless one is teaching a class of future professional statisticians, can be
difficult and is often met with very little enthusiasm by the majority of students. A more
supportive learning process based on method application rather than method derivation
is clearly needed. The authors thought that by relying on some commonly available tools,
Microsoft Excel 2010 in particular, such an approach would be possible. To this effect, a
new programme relying on the integration of workbook based open learning materials
with information technology tools has been adopted. The current learning and assess-
ment structure may be defined as follows:

(a) To help students ‘bridge the gap’ between school and university
(b) To enable a student to be confident in handling numerical data
(c) To enable students to appreciate the role of statistics as a business decision-making
tool
(d) To provide a student with the knowledge to use Excel 2010 to solve a range of
statistical problems.

This book is aimed at students who require a general introduction to business statistics
that would normally form a foundation-level business school module. The learning mate-
rial in this book requires minimal input from a lecturer and can be used as a self-instruc-
tion guide. Furthermore, three online workbooks are available; two to help students with
Excel and practise numerical skills, and an advanced workbook to help undertake facto-
rial experiment analysis using Excel 2010.
The growing importance of spreadsheets in business is emphasized throughout the text
by the use of the Excel spreadsheet. The use of software in statistics modules is more or
less mandatory at both diploma and degree level, and the emphasis within the text is on
the use of Excel 2010 to undertake the required calculations.

How to use the book effectively


The sequence of chapters has been arranged so that there is a progressive accumulation
of knowledge. Each chapter guides students step by step through the theoretical and
spreadsheet skills required. Chapters also contain exercises that give students the chance
to check their progress.
vi Preface

Hints on using the book


(a) Be patient and work slowly and methodically, especially in the early stages when
progress may be slow.
(b) Do not omit or ‘jump around’ between chapters; each chapter builds upon
knowledge and skills gained previously. You may also find that the Excel
applications described earlier in the book are required to develop applications in
later chapters.
(c) Try not to compare your progress with others too much. Fastest is not always best!
(d) Don’t try to achieve too much in one session. Time for rest and reflection is
important.
(e) Mistakes are part of learning. Do not worry about them. The more you repeat
something, the fewer mistakes you will make.
(f ) Make time to complete the exercises, especially if you are learning on your own.
They are your best guide to your progress.
(g) The visual walkthroughs have been developed to solve a particular statistical
problem using Excel. If you are not sure about the Excel solution then use the visual
walkthrough (flash movies) as a reminder.
Brief contents

How to use this book xiv


How to use the Online Resource Centre xvi

1 Visualizing and presenting data 1

2 Data descriptors 58

3 Introduction to probability 107

4 Probability distributions 135

5 Sampling distributions and estimating 185

6 Introduction to parametric hypothesis testing 243

7 Chi-square and non-parametric


hypothesis testing 296

8 Linear correlation and regression analysis 343

9 Time series data and analysis 406

Glossary 468
Index 477
Detailed contents

How to use this book xiv


How to use the Online Resource Centre xvi

1 Visualizing and presenting data 1


Overview 1
Learning objectives 2
1.1 The different types of data variable 2
1.2 Tables 3
1.2.1 What a table looks like 4
1.2.2 Creating a frequency distribution 6
1.2.3 Types of data 10
1.2.4 Creating a table using Excel PivotTable 11
1.2.5 Principles of table construction 21
1.3 Graphical representation of data 21
1.3.1 Bar charts 22
1.3.2 Pie charts 27
1.3.3 Histograms 31
1.3.4 Histograms with unequal class intervals 40
1.3.5 Frequency polygon 42
1.3.6 Scatter and time series plots 47
1.3.7 Superimposing two sets of data onto one graph 51
Techniques in practice 54
Summary 56
Key terms 57
Further reading 57

2 Data descriptors 58
Overview 58
Learning objectives 59
2.1 Measures of central tendency 59
2.1.1 Mean, median, and mode 59
2.1.2 Percentiles and quartiles 63
2.1.3 Averages from frequency distributions 67
2.1.4 Weighted averages 77
2.2 Measures of dispersion 80
2.2.1 The range 82
2.2.2 The interquartile range and semi-interquartile range (SIQR) 82
2.2.3 The standard deviation and variance 83
2.2.4 The coefficient of variation 88
Detailed contents ix
2.2.5 Measures of skewness and kurtosis 89
2.3 Exploratory data analysis 94
2.3.1 Five-number summary 94
2.3.2 Box plots 96
2.3.3 Using the Excel ToolPak add-in 100
Techniques in practice 102
Summary 104
Key terms 105
Further reading 105

3 Introduction to probability 107


Overview 107
Learning objectives 107
3.1 Basic ideas 107
3.2 Relative frequency 109
3.3 Sample space 112
3.4 The probability laws 114
3.5 The general addition law 115
3.6 Conditional probability 117
3.7 Statistical independence 120
3.8 Probability tree diagrams 123
3.9 Introduction to probability distributions 124
3.10 Expectation and variance for a probability distribution 127
Techniques in practice 131
Summary 133
Key terms 133
Further reading 133

4 Probability distributions 135


Overview 135
Learning objectives 135
4.1 Continuous probability distributions 136
4.1.1 Introduction 136
4.1.2 The normal distribution 136
4.1.3 The standard normal distribution (Z distribution) 140
4.1.4 Checking for normality 149
4.1.5 Other continuous probability distributions 153
4.1.6 Probability density function and cumulative
distribution function 154
4.2 Discrete probability distributions 155
4.2.1 Introduction 155
4.2.2 Binomial probability distribution 155
x Detailed contents

4.2.3 Poisson probability distribution 165


4.2.4 Poisson approximation to the binomial distribution 173
4.2.5 Normal approximation to the binomial distribution 175
4.2.6 Normal approximation to the Poisson distribution 180
4.2.7 Other discrete probability distributions 182
Techniques in practice 182
Summary 183
Key terms 183
Further reading 184

5 Sampling distributions and estimating 185


Overview 185
Learning objectives 185
5.1 Introduction to the concept of a sample 186
5.1.1 Why sample? 186
5.1.2 Sampling terminology 187
5.1.3 Types of samples 188
5.1.4 Types of error 192
5.2 Sampling from a population 193
5.2.1 Introduction 193
5.2.2 Population versus sample 194
5.2.3 Sampling distributions 194
5.2.4 Sampling distribution of the mean 194
5.2.5 Sampling from a normal population 198
5.2.6 Sampling from a non-normal population 204
5.2.7 Sampling distribution of the proportion 210
5.2.8 Using Excel to generate a sample from a sampling
probability distribution 212
5.3 Population point estimates 217
5.3.1 Introduction 217
5.3.2 Types of estimate 218
5.3.3 Criteria of a good estimator 218
5.3.4 Point estimate of the population mean and variance 218
5.3.5 Point estimate for the population proportion and variance 222
5.3.6 Pooled estimates 224
5.4 Population confidence intervals 225
5.4.1 Introduction 225
5.4.2 Confidence interval estimate of the population mean, µ (σ known) 226
5.4.3 Confidence interval estimate of the population mean,
µ (σ unknown, n < 30) 228
5.4.4 Confidence interval estimate of the population mean,
µ (σ unknown, n ≥ 30) 232
5.4.5 Confidence interval estimate of a population proportion 235
Detailed contents xi
5.5 Calculating sample size 237
Techniques in practice 239
Summary 241
Key terms 241
Further reading 242

6 Introduction to parametric hypothesis testing 243


Overview 243
Learning objectives 243
6.1 Hypothesis testing rationale 244
6.1.1 Hypothesis statements H0 and H1 244
6.1.2 Parametric versus non-parametric tests of difference 246
6.1.3 One and two sample tests 246
6.1.4 Choosing an appropriate statisitcal test 247
6.1.5 Significance level 248
6.1.6 Sampling distributions 248
6.1.7 One and two tail tests 249
6.1.8 Check t-test model assumptions 250
6.1.9 Types of error 251
6.1.10 P-values 251
6.1.11 Critical test statistic 252
6.2 One sample z-test for the population mean 253
6.3 One sample t-test for the population mean 257
6.4 Two sample z-test for the population mean 261
6.5 Two sample z-test for the population proportion 266
6.6 Two sample t-test for population mean (independent samples,
equal variances) 269
6.7 Two sample tests for population mean (independent samples,
unequal variances) 274
6.7.1 Two sample tests for independent samples
(unequal variances) 274
6.7.2 Equivalent non-parametric test: Mann–Whitney U test 279
6.8 Two sample tests for population mean (dependent or
paired samples) 279
6.8.1 Two sample tests for dependent samples 279
6.8.2 Equivalent non-parametric test: Wilcoxon matched pairs test 283
6.9 F test for two population variances (variance ratio test) 285
6.10 Calculating the size of the type II error and the statistical power 290
Techniques in practice 292
Summary 294
Key terms 294
Further reading 295
xii Detailed contents

7 Chi-square and non-parametric


hypothesis testing 296
Overview 296
Learning objectives 296
7.1 Chi-square tests 297
7.1.1 Chi-square test of association 298
7.1.2 Chi-square test for independent samples 303
7.1.3 McNemar’s test for matched (or dependent) pairs 307
7.1.4 Chi-square goodness-of-fit test 312
7.2 Non-parametric (or distribution-free) tests 318
7.2.1 Sign test 318
7.2.2 Wilcoxon signed rank sum test for dependent samples (or
matched pairs) 324
7.2.3 Mann–Whitney U test for two independent samples 331
Techniques in practice 338
Summary 340
Key terms 341
Further reading 341

8 Linear correlation and regression analysis 343


Overview 343
Learning objectives 343
8.1 Linear correlation analysis 344
8.1.1 Scatter plots 344
8.1.2 Covariance 347
8.1.3 Pearson’s correlation coefficient, r 348
8.1.4 Testing the significance of linear correlation between the
two variables 353
8.1.5 Spearman’s rank correlation coefficient 356
8.1.6 Testing the significance of Spearman’s rank
correlation coefficient, rs 358
8.2 Linear regression analysis 362
8.2.1 Construct scatter plot to identify model 364
8.2.2 Fit line to sample data 364
8.2.3 Sum of squares defined 369
8.2.4 Regression assumptions 370
8.2.5 Test model reliability 372
8.2.6 The use of t-test to test whether the predictor variable is a
significant contributor 374
8.2.7 The use of F test to test whether the predictor variable is a
significant contributor 378
8.2.8 Confidence interval estimate for slope β1 382
8.2.9 Prediction interval for an estimate of Y 383
8.2.10 Excel data analysis regression solution 385
8.3 Some advanced topics in regression analysis 390
Detailed contents xiii
8.3.1 Introduction to non-linear regression 390
8.3.2 Introduction to multiple regression analysis 397
Techniques in practice 401
Summary 404
Key terms 405
Further reading 405

9 Time series data and analysis 406


Overview 406
Learning objectives 406
9.1 Introduction to time series data 407
9.1.1 Stationary and non-stationary time series 407
9.1.2 Seasonal time series 409
9.1.3 Univariate and multivariate methods 409
9.1.4 Scaling the time series 410
9.2 Index numbers 411
9.2.1 Simple indices 412
9.2.2 Aggregate indices 415
9.2.3 Deflating values 416
9.3 Trend extrapolation 419
9.3.1 A trend component 420
9.3.2 Fitting a trend to a time series 420
9.3.3 Types of trends 423
9.3.4 Using a trend chart function to forecast time series 424
9.3.5 Trend parameters and calculations 426
9.4 Moving averages and time series smoothing 430
9.4.1 Forecasting with moving averages 431
9.4.2 Exponential smoothing concept 436
9.4.3 Forecasting with exponential smoothing 438
9.5 Forecasting seasonal series with exponential smoothing 445
9.6 Forecasting errors 450
9.6.1 Error measurement 450
9.6.2 Types of errors 453
9.6.3 Interpreting errors 455
9.6.4 Error inspection 456
9.7 Confidence intervals 458
9.7.1 Population and sample standard errors 458
9.7.2 Standard errors in time series 459
Techniques in practice 463
Summary 465
Key terms 466
Further reading 466

Glossary 468
Index 477
How to use this book

» Learning objectives «
On successful completion of the module, you will be able to:
Learning objectives
» Learning » understand the concept of an average;

» recognize that three possible averages exist (mean, mode, and median) and calculate them
using a variety of graphical and formula methods in number and frequency distribution form;
Each chapter opens with a series of learn-
»
»
recognize when to use different measures of average;
understand the concept of dispersion;
ing objectives outlining what you can expect
On successf » recognize that different measures of dispersion exist (range, quartile range, SIQR, standard
deviation, and variance), and calculate them using a variety of graphical and formula methods
to learn as you progress through the chapter.
in number and frequency distribution form;

d » recognize when to use different measures of dispersion; These also serve as helpful recaps of impor-
» understand the idea of distribution shape, and calculate a value for symmetry and
peakedness;
tant concepts when revising.

Step-by-step Excel guidance


Excel screenshots are fully integrated
throughout the text and visually demonstrate
Figure 2.4 the Excel formulas, functions, and solutions to
provide you with clear step-by-step guidance
on how to solve the statistical problems posed.

Example 2.4 Example boxes


To illustrate the use of the Select Formulas > Select Insert Function method consider the prob-

Examp lem of calculating the mean value in Example 2.1. In Figures 2.1 and 2.2 the mean value is
located in cell E12. To insert the correct Excel function into cell E12 we would click on cell E12
and then Select Formulas > Select Insert Function as illustrated in Figures 2.3 and 2.4.
Detailed worked examples run throughout
each chapter to show you how the theory
To illustrate t relates to practice. The authors break concepts
lem of calcu down into clear step-by-step phases, which
are often accompanied by a series of Excel
screenshots, enabling you to assess your
progress.

Note According to Table 2.3, a number of claims corresponding to ‘one’ occurs three
times, which will contribute three to the total, ‘two’ claims occur four times contributing eight
Note boxes
to the sum, and so on. This can be written as follows:

Note Mean(X) =
(3*1) + (4*2) +.........+ (1*10)
3+4+4+5+5+7+5+3+3+1
= 206/40 = 5.15 Note boxes draw your attention to key points,
times, whic As already pointed out, as we are dealing with discrete data we would indicate a mean as
approximately five claims. Equation (2.3) can now be used to calculate the mean for a fre- areas where extra care should be taken, or
quency distribution data set:

to the sum, ∑ fX certain exceptions to the rules.


X=
∑f (2.3)

❉ Interpretation Twenty five percent of all the values in the data set are equal to or
Interpretation boxes
❉ Interpr below 430 miles, while 75% are equal to or below 470 miles.

Interpretation boxes appear throughout


below 430 m the chapters, providing you with further
explanations to aid your understanding of the
concepts being discussed.
How to use this book xv

Student exercises Student exercises


X2.14 The manager at BIG JIMS restaurant is concerned about the time it takes to process

Stud
credit card payments at the counter by counter staff. The manager has collected the

Throughout each chapter you are regularly following processing time data (time in minutes/seconds) (Table 2.21) and requested
that summary statistics are calculated.
(a) Calculate a five-number summary for this data set.
given the chance to test your knowledge (b) Do we have any evidence for a symmetric distribution?
(c) Use the Excel Analysis-ToolPak to calculate descriptive statistics.
and understanding of the topics covered (d) Which measures would you use to provide a measure of average and spread?
X2.14 The m
through student exercises at the end of each
section. You can then monitor your progress
by checking the answers at the back of the
textbook and online.

Techniques in practice ■ Techniques in practice


TP1 CoCo S. A. is concerned at the time taken to react to customer complaints and have


implemented a new set of procedures for its support centre staff. The customer service direc-

Techniques in practice exercises appear at the


end of each chapter and reinforce learning by
tor plans to reduce the mean time for responding to customer complaints to 28 days and has
collected the sample data given in Table 4.12 after implementation of the new procedures to
assess the time to react to complaints (days).
Techn
20 33 33 29 24 30
40 33 20 39 32 37
presenting questions to test the knowledge 32 50 36 31 38 29

and skills covered in that unit. You can use


15
31
28
33
35
22
27
19
26
29
39
42
43
22
30
33
21
17
TP1 CoCo
32 34 39 39 32 38
these to check your understanding of a topic
before moving on to the next chapter.

Chapter summary ■ Summary


In this chapter we have provided an introduction to the important statistical concept of para-
metric hypothesis testing for one and two samples. What is important in hypothesis testing is

Each chapter ends with an overview of the that you are able to recognize the nature of the problem and should be able to convert this into
two appropriate hypothesis statements (H0 and H1) that can be measured.
If you are comparing more than two samples then you would need to employ advanced
■ Summ
techniques covered and serves as an ideal statistical parametric hypothesis tests. These tests are called analysis of variance (ANOVA),
which are described in the online workbook ‘Factorial experiments’.
In this chapter we have described a simple five-step procedure to aid the solution process
tool for you to check your understanding of and have focused on the application of Excel to solve the data problems. The main empha-
sis is placed on the use of the p-value, which provides a number to the probability of the

the skills you should have acquired in that null hypothesis (H0) being rejected. Thus, if the measured p-value > α (Alpha) then we would
accept H0 to be statistically significant. Remember the value of the p-value will depend on
In this chapte
whether we are dealing with a two or one tail test. So take extra care with this concept as this
chapter. is where most students slip up.

Key terms ■ Key Terms


Alpha Level of significance One tail tests

Key terms are highlighted in green where



Alternative hypothesis Lower one tail test Parametric

they first appear in the text, along with their


Beta, α
Central Limit Theorem
Critical test statistic
F distribution
Mann–Whitney U test
Non-parametric
Null hypothesis
One sample t-test
P-value
Region of rejection
Robust test
Significance level, α
Key T
F test for the population Statistical power
definition in the margin. You can also find F test for two population mean Two sample t-test for
variances (variance One sample test population mean

these terms at the end of each chapter for ratio test)


Hypothesis test procedure
One sample z-test for the
population mean
(dependent or paired
samples) Alpha
quick reference.

Further reading ■ Further Reading


Textbook Resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
A list of recommended reading is included 2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).

to allow you to explore a particular subject


Oxford: Oxford University Press.

Web Resources ■ Furth


1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed

area in more depth. Annotated web links are 25 May 2012).


2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
also provided throughout the text to help you 3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, Te tbook Re
the Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May
locate further statistical resources. 2012).
4. Economagic—contains international economic data sets (http://www.economagic.com)
(accessed 25 May 2012).
How to use the Online Resource Centre

www.oxfordtextbooks.co.uk/orc/davis_pecar2e/

For students
Numerical skills workbook
The authors have provided you with a numerical skills
refresher, packed with examples and exercises, to equip you
with the skills needed to confidently approach every topic in
the textbook.

Introduction to Excel workbook


This workbook serves as an introductory guide or refresher
course which will guide you through the features of Microsoft
Excel 2010.

Factorial experiments workbook


This workbook has been devised to offer you specific guidance
on how to identify and solve factorial experiments. The authors
have provided a wealth of exercises, solutions, and suggested
reading to help you further your understanding of this topic.

Self-test multiple-choice questions


Multiple-choice questions for each chapter of the book help
you to test your understanding of a topic.

Online glossary
The glossary of terms, along with their definitions from the
book, can now be found online for ease of reference.
How to use the Online Resource Centre xvii

Revision tips
The authors have provided you with revision tips to help
consolidate your learning and to assist you when preparing for
your exams.

Visual walkthroughs

Visual walkthroughs, complete with audio explanations, are provided


for each statistical process in the text to help guide you through the
techniques and Excel solutions.

For registered adopters


Instructor's manual
This resource includes a chapter-by-chapter guide to
structuring lectures and seminars as well as teaching tips and
solutions from the techniques and exercises in the text.

PowerPoint lecture slides


A suite of fully customizable PowerPoint slides have been
designed by the authors to assist you in your lectures and
presentations.

Test bank
Each chapter of the book is accompanied by a bank of assorted
questions, covering a variety of techniques for the topics
covered.

Excel data and solutions from the book


Excel spreadsheets and solutions can be found online for all
of the exercises and techniques in practice problems posed in
the book.
This page intentionally left blank
Visualizing and presenting data 1

The display of various types of data or information in the form of tables, graphs, and dia-
grams is quite a common spectacle these days. Newspapers, magazines, and television
all use these types of displays to try and convey information in an easy-to-assimilate way.
In a nutshell what these forms of display aim to do is to summarize large sets of raw data
such that we can see, at a glance, the ‘behaviour’ of the data. Figures 1.1 and 1.2 provide
examples of tables published in an English newspaper.

THE WORST OFFENDERS

Bank Account Cut Mortgage rate cut


A&L Direct Saver –4.95% –2.20%
Abbey 50+ –4.65% –2.85%
Halifax Web Saver –4.65% –3.50%

Nationwide e-Savings –4.60% –3.99%

Northern Rock E-Saver –4.00% –2.70%


SOURCE: Moneyfacts.co.uk

Figure 1.1
‘No better off after rate cuts’. Elizabeth Colman, The Sunday Times—Money, 12 April 2009, p. 6

This chapter and the next will use a variety of techniques that can be used to present the
data in a form that will make sense to people. In this chapter we will look at using tables
and graphical forms to represent the raw data, and in Chapter 2 we will explore methods
that can put a summary number to the raw data.

» Overview «
In this chapter we shall look at methods to summarize data using tables and charts:

» tabulating data;

» graphing data.
2 Business statistics using Excel

Rising attacks
Increase in robberies over past three months
compared to previous year
Staffordshire 56%
North Yorkshire 47%
Lincolnshire 46%
Cambridgeshire 33%
Nottinghamshire 26%
Merseyside 14%
Greater Manchester 10%
SOURCE:
South Wales No increase Police
−14% Metropolitan police figures

000s 120

80
Robberies

SOURCE: Home Office 40


98 99 00 01 02 03 04 05 06 07 08

Figure 1.2
‘Muggings soar as recession bites’. David Leppard, The Sunday Times, 12 April 2009, p. 11

» Learning objectives «
On successful completion of the module you will be able to:

» understand the different types of data variables that can be used to represent a specific
measurement;

» know how to present data in table form;

» present data in a variety of graphical forms;

» construct frequency distributions from raw data;

x
» distinguish between discrete and continuous data;
Variable A variable is a » construct histograms for equal and unequal class widths;
symbol that can take on
any of a specified set of » understand what we mean by a frequency polygon;
values.
Quantitative Variables can
» solve problems using Microsoft Excel.
be classified using numbers.
Qualitative Variables can
be classified as descriptive
or categorical.
Categorical variables A
1.1 The different types of data variable
set of data is said to be
categorical if the values or A variable is any measured characteristic or attribute that differs for different subjects.
observations belonging to it
can be sorted according to
For example, if the height of 1000 subjects was measured, then height would be a variable.
category. Variables can be quantitative or qualitative (sometimes called categorical variables).
Visualizing and presenting data 3

Quantitative variables (or numerical variables) are measured on one of three different
scales: interval, ratio, or ordinal.
Qualitative variables are measured on a nominal scale. If a group of business students
was asked to name their favourite browser to browse the Web, then the variable would
be qualitative. If the time spent on the computer to research a topic was measured, then x
the variable would be quantitative. Nominal measurement consists of assigning items Interval scale An
interval scale is a scale
to groups or categories. No quantitative information is conveyed and no ordering of of measurement where
the items is implied. Nominal scales are therefore qualitative rather than quantitative. the distance between
any two adjacent units of
Football club allegiance, sex or gender, degree type, and courses studies are all examples measurement (or ‘intervals’)
of nominal scales. is the same, but the zero
point is arbitrary.
Frequency distributions, described in Chapter 2, are used to analyse data measured
Ratio scale Ratio scale
on a nominal scale. The main statistic computed is the mode. Variables measured on a consists not only of
nominal scale are often referred to as categorical or qualitative variables. It is very impor- equidistant points but also
has a meaningful zero
tant that you understand the type of data variable that you have as the type of graph or point.
summary statistic calculated will be dependent upon the type of data variable that you Ordinal scale Ordinal
are handling. scale is a scale where
the values/observations
Measurements with ordinal scales are ordered in the sense that higher numbers repre- belonging to it can be
sent higher values. However, the intervals between the numbers are not necessarily equal. ranked (put in order)
or have a rating scale
For example, on a five-point rating scale measuring student satisfaction, the difference
attached. You can count
between a rating of 1 (‘very poor’) and a rating of 2 (‘poor’) may not represent the same and order, but not
difference as the difference between a rating of 4 (‘good’) and a rating of 5 (‘very good’). measure, ordinal data.
Nominal scale A set
The lowest point on the rating scale in the example was arbitrarily chosen to be 1 and this
of data is said to be
scale does not have a ‘true’ zero point. The only conclusion you can make is that one is categorical if the values or
better than the other (or even worse), but you cannot say that one is twice as good as the observations belonging to
it can be sorted according
other. to category.
On interval measurement scales, one unit on the scale represents the same magnitude Frequency
of the characteristic being measured across the whole range of the scale. For example, if distributions Systematic
method of showing the
student stress was being measured on an interval scale, then a difference between a score number of occurrences of
of 5 and a score of 6 would represent the same difference in anxiety as would a difference observational data in order
from least to greatest.
between a score of 9 and a score of 10. Interval scales do not have a ‘true’ zero point,
Statistic A statistic is a
however; therefore it is not possible to make statements about how many times higher quantity that is calculated
one score is than another. For the stress measurement, it would not be valid to say that a from a sample of data.

person with a score of 6 was twice as anxious as a person with a score of 3. Graph A graph is a picture
designed to express words,
Ratio scales are like interval scales except they have true zero points. For example, a particularly the connection
weight of 100 g is twice as much as 50 g. Interval and ratio measurements are also called between two or more
quantities.
continuous variables. Table 1.1 summarizes the different measurement scales with
Continuous variable A
examples provided of these different scales. set of data is said to be
continuous if the values
belong to a continuous
interval of real values.
Table A table shows the
1.2 Tables number of times that items
occur.
Classes Classes provide
Presenting data in tabular form can make even the most comprehensive descriptive nar- several convenient intervals
rative of data more readily intelligible. Apart from taking up less room, a table enables into which the values of
the variable of a frequency
figures to be located quicker, easy comparisons between different classes to be made, distribution may be
and may reveal patterns that cannot otherwise be deduced. The simplest form of table grouped.
4 Business statistics using Excel

Measurement scale Recognizing a measurement scale


Nominal data 1. Classification data, e.g. male or female, red or black car.
2. Arbitrary labels, e.g. m or f, r or b, 0 or 1.
3. No ordering, e.g. it makes no sense to state that r > b.
Ordinal data 1. Ordered list, e.g. student satisfaction scale of 1, 2, 3, 4, and 5.
2. Differences between values are not important, e.g. political parties
can be given labels: far left, left, mid, right, far right, etc. and student
satisfaction scale of 1, 2, 3, 4, and 5.
Interval data 1. Ordered, constant scale, with no natural zero, e.g. temperature, dates.
2. Differences make sense, but ratios do not, e.g. temperature difference.
Ratio data 1. Ordered, constant scale, and a natural zero, e.g. length, height, weight,
and age.

Table 1.1

indicates the frequency of occurrence of objects within a number of defined categories.


Microsoft Excel provides a number of tables that can be constructed using raw data or
data that is already in summary form.

1.2.1 What a table looks like


Tables come in a variety of formats, from simple tables to frequency distributions, that
allow data sets to be summarized in a form that allows users to be able to access important
information. The table presented in Figure 1.1 compares the interest rate and mortgage
rate cuts for five leading bank accounts that appeared in The Sunday Times newspaper on
12 April 2009. We can see from the table information about the lender, account, interest
rate cut, and mortgage rate cut. This table will have been created from a data set collected
by the researcher.

Example 1.1
When asked the question ‘If there was a general election tomorrow, which party would you
vote for’, 1110 students responded as follows: 400 said Conservative, 510 Labour, 78 Liberal
Democrats, 55 Green, and the rest some other party. We can put this information in table form
indicating the frequency within each category, either as a raw score or as a percentage of the
total number of responses (Table 1.2).

Party Frequency Party Frequency, %


Conservative 400 Conservative 36
Labour 510 Labour 46
Democrat 78 or Democrat 7
Green 55 Green 5
Other 67 Other 6
Total 1110 Total 100
x
Raw data Raw data is data Table 1.2 Proposed voting behaviour by 1110 university students
collected in original form. (source: University Student Survey June 2012)
Visualizing and presenting data 5

Note
• When a secondary data source is used it is acknowledged.
• The title of the table is given.
• The total of the frequencies is given.
• When percentages are used for frequencies this is indicated together with the sample size, N.

Sometimes categories can be subdivided and tables can be constructed to convey this
information together with the frequency of occurrence within the subcategories. For
example Table 1.3 indicates the frequency of half-yearly sales of two cars produced by a
large company with the sales split by month.

Example 1.2

Half-yearly sales of XBAR Ltd


Month January February March April May June Total
Pink 5200 4100 6000 6900 6050 7000 35250
Blue 2100 1050 2950 5000 6300 5200 22600
Total 7300 5150 8950 11900 12350 12200 57850

Table 1.3 Half yearly sales of XBAR Ltd

Further subdivisions of categories may also be displayed as indicated in Table 1.4,


showing a sample of adult males, television viewing behaviour.

Example 1.3
Tabulated results from a survey undertaken to measure the television viewing habits of adult
males by marital status and age.

Single Married
Under 30 years 30+ years Under 30 years 30+ years
Less than 15 hours per week 330 358 1162 484
15 hours or more per week 1719 241 643 1521
Total 2049 599 1805 2005

Table 1.4 Viewing habits of adult males


6 Business statistics using Excel

1.2.2 Creating a frequency distribution


When data is collected by survey or by some other form we have, initially, a set of unor-
ganized raw data which, when viewed, would convey little information. A first step would
be to organize the set into a frequency distribution such that ‘like’ quantities are collected
and the frequency of occurrence of the quantities determined.

Example 1.4
Consider the set of data that represents the number of insurance claims processed each day by
an insurance firm over a period of 40 days: 3, 5, 9, 6, 4, 7, 8, 6, 2, 5, 10, 1, 6, 3, 6, 5, 4, 7, 8, 4, 5,
9, 4, 2, 7, 6, 1, 3, 5, 6, 2, 6, 4, 8, 3, 1, 7, 9, 7 and 2.
The frequency distribution can be used to show how many days it took for one claim to be
processed, how many days it took to process two claims, and so on. The simplest way of doing
this is by creating a tally chart.
Write down the range of values from the lowest (1) to the highest (10) then go through the
data set recording each score in the table with a tally mark. It’s a good idea to cross out figures
in the data set as you go through it to prevent double counting. Table 1.5 illustrates the fre-
quency distribution for the data set given in Example 1.4.

Score Tally Frequency, f


1 111 3
2 1111 4
3 1111 4
4 1111 5
5 1111 5
6 1111 11 7
7 1111 5
8 111 3
9 111 3
10 1 1
Σf = 40

Table 1.5

x In this example there were relatively few cases. However, we may have increased our
Tally chart A tally chart survey period to one year, and the range of claims may have been between 0 and 30. As our
is a method of counting
frequencies, according to aim is to summarize information we may find it better to group ‘likes’ into classes to form
some classification, in a set a grouped frequency distribution. The next example illustrates this point.
of data.
Grouped frequency
distribution Data Example 1.5
arranged in intervals to
show the frequency with Consider the following data set of miles travelled by 120 salesmen in one week
which the possible values
of a variable occur. (Table 1.6).
Visualizing and presenting data 7

403 407 407 408 410 412 413 413


423 424 424 425 426 428 430 430
435 435 436 436 436 438 438 438
444 444 445 446 447 447 447 448
452 453 453 453 454 455 455 456
462 462 462 463 464 465 466 468
474 474 475 476 477 478 479 481
490 493 494 495 497 498 498 500
415 430 439 449 457 468 482 502
416 431 440 450 457 469 482 502
418 432 440 450 458 470 483 505
419 432 441 451 459 471 485 508
420 433 442 451 459 471 486 509
421 433 442 451 460 472 488 511
421 434 443 452 460 473 489 515

Table 1.6

This mass of data conveys little in terms of information. Because there would be too many
value scores, putting the data into an ungrouped frequency distribution would not portray an
adequate summary. Grouping the data, however, provides the following (Table 1.7).

Mileage Tally Frequency, f


400–419 1111 1111 11 12
420–439 1111 1111 1111 1111 1111 11 27
440–459 1111 1111 1111 1111 1111 1111 1111 34
460–479 1111 1111 1111 1111 1111 24
480–499 1111 1111 1111 15
500–519 1111 111 8
Σf = 120

Table 1.7 Grouped frequency distribution data for Example 1.5 data

Excel solution—frequency distribution using Example 1.5 data

1 Input data into cells A6:H20

Figure 1.3
8 Business statistics using Excel

2 Excel Data Analysis solution to create a frequency distribution


Excel can construct grouped frequency distributions from raw data by using the
Data Analysis menu. Before we use this add-in we have to input the lower and upper
class boundaries into Excel. Excel calls this the Bin Range. In this example we have
decided to create a Bin Range that is based upon equal class widths. Let us choose the
following groups with the Bin Range calculated from these group values.

Mileage LCB–UCB Class width Bin range


399.5
400–419 399.5–419.5 20 419.5
420–439 419.5–439.5 20 439.5
440–459 439.5–459.5 20 459.5
460–479 459.5–479.5 20 479.5
480–499 479.5–499.5 20 499.5
500–519 499.5–519.5 20 519.5

Table 1.8 Class and bin range


LCB, lower class boundary; UCB, upper class boundary.

We can see that the class widths are all equal and the corresponding Bin Range is
399.5, 405.5, . . .. . .., 519.5. We can now use Excel to calculate the grouped frequency
distribution.

Bin Range: Cells B24:B30 (with the label in cell B23).


Now create the histogram.
See Figure 1.4.

x
Class boundaries Class
boundaries separate
one class in a grouped
frequency distribution from
another. Figure 1.4
Histogram A histogram
is a way of summarizing Select Data.
data that are measured
on an interval scale (either Select Data Analysis menu.
discrete or continuous). Click on Histogram.
Visualizing and presenting data 9
See Figure 1.5.

Figure 1.5

Click OK.
Input Data Range: Cells A6:H20.
Input Bin Range: Cells B24:B30.
Choose location of Output range: Cell D23.
See Figure 1.6.

Figure 1.6

Click OK.
Excel will now print out the grouped frequency table (Bin Range and frequency of
occurrence) as presented in cells D23–E31.
See Figure 1.7.

Figure 1.7
10 Business statistics using Excel

The grouped frequency distribution would now be as shown in Table 1.9.

Bin range Frequency


399.5 0
419.5 12
439.5 27
459.5 34
479.5 24
499.5 15
519.5 8
More 0

Table 1.9 Bin and frequency values

From Table 1.9 we can now create the grouped frequency distribution (Table 1.10)

Bin range Frequency Mileage


419.5 12 400–419
439.5 27 420–439
459.5 34 440–459
479.5 24 460–479
499.5 15 480–499
519.5 8 500–519

Table 1.10 Grouped frequency distribution

1.2.3 Types of data


Data can exist in two forms: discrete and continuous. Discrete data occurs as an integer
(whole number), for example 1, 2, 3, 4, 5, 6 . . . etc. Continuous data occurs as a continu-
ous number and can take any level of accuracy, for example the number of miles travelled
could be 440.3, or 440.34, or 440.342, and so on. It is important to note that whether data
is discrete or continuous depends not upon how it is collected but how it occurs in reality.
x Thus, height, distance, and age are all examples of continuous data although they may
Discrete Discrete data be presented as whole numbers. Class limits are the extreme boundaries. The class lim-
are a set of data where
the values/observations its given in a frequency distribution are called the stated limits. Two common types are
belonging to it are distinct illustrated in Table 1.11.
and separate, i.e. they can
be counted (1,2,3. . .).
A B
Class limit Class limits
separate one class in 5–under 10 5–9
a grouped frequency 10–under 15 10–14
distribution from another.
Stated limits The lower
Table 1.11
and upper limits of a class
interval.
To ensure that there are no gaps between classes and to help locate data in their appro-
True or mathematical
limits True or priate class, we devise what are known as true or mathematical limits. All calculations
mathematical limits are based on these true/mathematical limits. Their definition is determined by whether
separate one class in
a grouped frequency
we are dealing with continuous or discrete data. Table 1.12 indicates how these limits may
distribution from another. be defined.
Visualizing and presenting data 11

Mathematical limit
Stated limit Discrete Continuous
A 5–under 10 5–9 5–9.999999’
10–under 15 10–14 10–14.999999’
B 5 –9 5–9 4.5–9.5
10–15 10–15 9.5–15.5

Table 1.12 Example of mathematical limits

Placing of discrete data into an appropriate class usually provides few problems. If the
data is continuous and stated limits are as style A then a value of 9.9 would be placed in
the 5–under 10 stated class, conversely if style B were used then it would be placed in the
10–15 stated class. Using the true mathematical limits the width of a class can be found.
If CW = class width, UCB = upper class boundary, and LCB = lower class boundary,
then the class width is calculated using equation (1.1).

CW = UCB – LCB (1.1)

In Example 1.4, the true limits would be 0.5–1.5, 1.5–2.5, and the class width = 1.5 –
0.5 = 1.0. In Example 1.5, the true limits would be 399.5–419.5, 419.5–439.5, and the class
width = 419.5 – 399.5 = 20. Open ended classes are sometimes used at the two ends of a
distribution as a catch-all for extreme values and stated as, for example, up to 40, 40–50 . . .,
100 and over. There are no hard and fast rules for the number of classes to use, although
the following should be taken into consideration:

(a) Use between 5 and 12 classes. The actual number will depend on the size of the
sample and minimizing the loss of information.
(b) Class widths are easier to handle if in multiples of 2, 5, or 10 units.
(c) Although not always possible, try and keep classes at the same widths within a
distribution.
(d) As a guide, the following formula can be used to calculate the number of classes
given the class boundaries and the class width. Based upon this calculation we
would construct with six classes.

Highest Value − Lowest Value


Class width =
Number of Classes

1.2.4 Creating a table using Excel PivotTable


A PivotTable organizes and summarizes large amounts of data. The data in one or more
columns (also known as fields) in your data set can become row and column labels in the
PivotTable. The data in one column is usually chosen for the values which are summa-
rized in the centre of the table using a specific calculation. It is called a PivotTable because
the headings can be rotated around the data to view or summarize it in different ways. You
can also filter the data to display just the details for areas of interest. Alternatively, you can
choose to create a PivotChart which will summarize the data in chart format rather than
as a table. Details on creating a PivotChart are set out later in this section.
12 Business statistics using Excel

The source data can be:

• an Excel worksheet database/list or any range that has labelled columns—we will use
Excel worksheets as examples in this chapter;
• a collection of ranges to be consolidated—the ranges must contain both labelled rows
and columns;
• a database file created in an external application such as Access or Dbase.

The data in a PivotTable cannot be changed as they are the summary of other data. The
data itself can be changed and the PivotTable recalculated thereafter. However, formatting
changes, such as bold, number formats, etc., can be made directly to the PivotTable data.
To rearrange the worksheet simply drag and drop column headings to a new location on
the worksheet, and Microsoft Excel rearranges the data accordingly. To begin, you need
raw data to work with. The general rule is you need more than two criteria of data to work
with, otherwise you have nothing to pivot. Figure 1.8 depicts a typical PivotTable where
we have tabulated department spends against month. Notice the black down-pointing
arrows in the PivotTable. On Row 1 we have Department.

Figure 1.8

If the black arrow was clicked, a drop-down box would appear showing a list of the
departments.
We could click on a department and view the departmental spend for the three months
measured, or we could select which departments to view, or choose only one month. But
Excel does most of the work for you and puts in those drop-down boxes as part of the wiz-
ard. In the example we can see the advertising budget spend in June was €12,422.

Example 1.6
This example consists of a set of data that has been collected to measure the departmental
spend of individuals within three departments of Coco S.A.
The budget spends (in Euros) have been measured for April, May, and June 2007 (see Figure 1.9).

Figure 1.9
Visualizing and presenting data 13

Excel solution—creating a PivotTable using Example 1.6 data


Select Insert > PivotTable.
The PivotTable wizard will walk you through the process of creating an initial PivotTable.
Select PivotTable from the two options illustrated in Figure 1.10.

Figure 1.10

Input in the Create PivotTable menu the cell range for the data table and where you
want the PivotTable to appear.
Select a table: Cells B2:E32.
Choose to insert PivotTable in Existing Worksheet: Cell G2.
Figure 1.11 illustrates the Create PivotTable menu.

Figure 1.11

Click OK.
Excel creates a blank PivotTable and the user must then drag and drop the various fields
from the items; the resulting report is displayed ‘on the fly’, as illustrated in Figure 1.12.

Figure 1.12
14 Business statistics using Excel

The PivotTable (Cells G2:I19) will be populated with data from the data table in Cells
B3:E32 with the completion of the PivotTable Field List, which is located at the right-
hand side of the worksheet. Presented in Figures 1.13 and 1.14 are but a few examples of

Figure 1.13

Figure 1.14
Visualizing and presenting data 15

hundreds of possible reports that could be viewed with this data through the PivotTable
format. For Example 1.6 above choose:

• department—drop row fields here;


• month—drop column fields here;
• department budget—drop data items here.

The completed PivotTable for this problem is illustrated in Figure 1.15.

Figure 1.15

Modifying reports
The PivotTable field dialog box allows changes to be made to the PivotTable. For example,
we may decide to modify the PivotTable by including the individual staff spends in indi-
vidual departments. This can be achieved by selecting Name in the PivotTable Field List
with the outcome presented in Figure 1.16.

Figure 1.16

From Figure 1.16 we can observe that the individual staff contributions under each
department are presented. If you look at your Excel solution you will observe that the
Name variable is located in the Row Label dialog box. If we move the Name variable into
the Column dialog box then the solution will be as presented in Figure 1.17, where only
part of the solution is illustrated.

Figure 1.17

PivotTable options
A range of PivotTable options are available, as illustrated in Figure 1.18.

Figure 1.18
16 Business statistics using Excel

Changing the way data is summarized

By default, Excel will use a Sum function on numeric data and Count on non-numeric to
summarize or aggregate the data. To change this:

1. Click on the field you want to change (on the PivotTable itself or in the areas below
the Field list). For example, click inside the numbers within the PivotTable and right-
click on the mouse to bring up the menu illustrated in Figure 1.19.

Figure 1.19

2. Click on Value Field Settings and select the appropriate calculation. For example,
change the calculation from Sum to Average, as illustrated in Figure 1.20.

Figure 1.20

The final result is illustrated in Figure 1.21.

Figure 1.21

Note To display more than one calculation in the Values area add the same Field twice.
Visualizing and presenting data 17

Formatting values
1. Display the Field Settings dialog box as shown in Figure 1.20.
2. Click on the Number Format button.
3. Select the Category you want and set any options. For example, select Number and
enter the number of decimal places to display the data to.
4. Click OK and OK again, and your cells will be reformatted.

Displaying values as a percentage


1. Display the Field Settings dialog box as shown in Figure 1.20.
2. Click on the Show Values As tab.
3. Select Percentage of Row Total or Percentage of Column Total.

Creating a PivotChart with a PivotTable


1. Click anywhere within your data set.
2. On the Insert tab, click on the PivotTable drop-down and select PivotChart from the
list (see Figure 1.22).

Figure 1.22

3. Choose the data set and location of the PivotTable and PivotChart as you would to
create a new PivotTable (see Figure 1.23). A new blank PivotTable and PivotChart will
be created.

Figure 1.23

Select a table or range location: B2:E32.


Choose location of PivotTable and PivotChart: G2.
4. Click and drag Fields from the Field List onto the different areas of the PivotTable in
the usual way, as illustrated in Figures 1.24 and 1.25.
18 Business statistics using Excel

Figure 1.24

Figure 1.25
Visualizing and presenting data 19
5. The PivotChart and PivotTable will both be created simultaneously, as illustrated in
Figure 1.26.

Figure 1.26

Note Some Chart types (for example pie charts) are not suitable for PivotTables because
they can only show two variables.

Adding a PivotChart to an existing PivotTable


You can also add a PivotChart if you have already created the PivotTable.

1. Click anywhere on the PivotTable.


2. Click on PivotTable Tools menu and select Options > PivotChart (see Figure 1.27).

Figure 1.27

3. Select type of chart, e.g. Column, and click OK (see Figure 1.28).

x
Pie chart A pie chart is a
way of summarizing a set of
Figure 1.28 categorical data.
20 Business statistics using Excel

4. A PivotChart will be added to your existing PivotTable (see Figure 1.29).

Figure 1.29

Grouping data

Data can be summarized into higher level categories by grouping items within PivotTable
fields. Depending on the data in the field there are three ways to group items:

• group selected items into custom categories;


• automatically group numeric items by a specific interval;
• automatically group dates and times by a specific interval.

Refreshing a PivotTable
When data is changed in the PivotTable source list the PivotTable does not automatically
recalculate. To refresh the table:

1. Select any part of the PivotTable


2. Click on Pivot Table Tools Options then click on the Refresh button (see Figure 1.30).

Figure 1.30

PivotTable Options can be set to refresh data every time a spreadsheet is opened.

Extending the dataset


If you add additional columns or rows you will need to extend the data source of the
PivotTable to include them.

1. Select Pivot Table Tools Options and click on Change Data Source.
2. Edit the range in the Table/Range box to include your entire dataset and click OK.
Visualizing and presenting data 21

1.2.5 Principles of table construction


From our earlier discussions, we can conclude that when constructing tables good prin-
ciples to be adopted are as follows: (a) aim for simplicity; (b) the table must have a com-
prehensive and explanatory title; (c) the source should be stated; (d) units must be stated
clearly; (e) the headings for columns and rows should be unambiguous; (f ) double count-
ing should be avoided; (g) totals should be shown where appropriate; (h) percentages
and ratios should be computed and shown where appropriate; and, overall, (i) use your
imagination and common sense.

Student Exercises
X1.1 Criticize Table 1.13.

Castings Weight of metal Foundry


Up to 4 ton 60 210
Up to 10 ton 100 640
All other weights 110 800
Other 20 85
Total 290 2000

Table 1.13

X1.2 Table 1.14 represents the number of customers visited by a salesman over an 80-week
period

68 64 75 82 68 60 62 88 76 93 73 79 88 73 60 93
71 59 85 75 61 65 75 87 74 62 95 78 63 72 66 78
82 75 94 77 69 74 68 60 96 78 89 61 75 95 60 79
83 71 79 62 67 97 78 85 76 65 71 75 65 80 73 57
88 78 62 76 53 74 86 67 73 81 72 63 76 75 85 77

Table 1.14
x
Bar chart A bar chart is a
Use Excel to construct a grouped frequency distribution from the data set in Table 1.14 and
way of summarizing a set of
indicate both stated and mathematical limits (start at 50–54 with class width of 5). categorical data.
Frequency polygon A
graph made by joining the
middle-top points of the
columns of a frequency
histogram.
1.3 Graphical representation of data Scatter plot A scatter plot
is a plot of one variable
against another variable.
The next stage of analysis after the data has been tabulated is to graph it using a variety of
Time series plot A chart of
methods to provide a suitable graph. In this section we will explore: bar charts, pie charts, a change in variable against
histograms, frequency polygons, scatter plots, and time series plots. The type of graph time.
you will use to graph the data depends upon the type of variable you are dealing with Ordinal variable A set of
data is said to be ordinal if
within your data set, for example category (or nominal), ordinal, or interval (or ratio) data the values belonging to it
(Table 1.15). can be ranked.
22 Business statistics using Excel

Data type Which graph to use?


Category or nominal Bar chart, pie chart, and cross tabulation tables (or contingency tables)
Ordinal Bar chart, pie chart, and scatter plots
Interval or ratio Histogram and frequency polygon
Cumulative frequency curve (or ogive), scatter plots, and time series
plots

Table 1.15 Deciding which graph type given data type

1.3.1 Bar charts


Graph and chart are terms that are often used to refer to any form of graphical display.
Categorical data is represented largely by bar and pie charts. Bar charts are very useful
in providing a simple pictorial representation of several sets of data on one graph. Bar
charts are used for categorical data where each category is represented by each vertical
(or horizontal) bar. In bar charts each category is represented by a bar with the frequency
represented by the height of the bar. All bars should have equal width and the distance
between each bar is kept constant. It is important that the axes (X and Y) are labelled and
the chart has an appropriate title.
What each bar represents should be stated clearly within the chart. Figure 1.31 repre-
sents a component bar chart for half-yearly car sales.

Half-yearly car sales

June

May

April
Month

March

February
Pink
January Blue

0 2000 4000 6000 8000 10000 12000 14000


Number of cars Figure 1.31

Example 1.7
Consider the categorical data in Example 1.1, which represents the proposed voting behaviour
by a sample of university students. Excel can be used to create a bar chart to represent this data
set. For each category a vertical bar is drawn with the vertical height representing the number
of students in that category (or frequency) with the horizontal distance for each bar, and dis-
x tance between each bar, equal.
Cross tabulation Cross
tabulation is the process Each bar represents the number of students who would vote for a particular UK political
made with two or more party. From the bar chart you can easily detect the differences of frequency between the five
data sources (variables) that
are tabulating the results of
categories (Conservative, Labour, Liberal Democrat, Green, and Other). Figure 1.32 represents
one against the other. a bar chart for the proposed voting behaviour.
Visualizing and presenting data 23

Figure 1.32

Example 1.8
If you are interested in comparing totals then a component (or stacked) bar chart is constructed.
Figure 1.33 represents a component bar chart for the half-yearly car sales.
In this component bar chart you can see the variation in total sales from month to month
and the split between car type category per month.

Half-yearly car sales

June

May

April
Month

March
Pink
February Blue

January

0 2000 4000 6000 8000 10000 12000 14000


Number of cars Figure 1.33

Example 1.9
A multiple column chart is used when you want to compare each component over time, but
the totals are of little importance.
Figure 1.34 represents a multiple bar chart for the half-yearly car sales.
24 Business statistics using Excel

Half-yearly car sales


8000
7000 Pink
Blue
6000

Number of cars
5000
4000
3000
2000
1000
0
January February March April May June
Month Figure 1.34

Example 1.10
Excel solution—bar chart using Example 1.1 data

1 Input data series


The data in Example 1.1 consist of two columns of data. Column 1 represents party
membership and column 2 represents the number of students proposing to vote for
a particular political party (also called frequency of occurrence). We can use Excel to
create a bar chart for this data by placing the data in Excel as follows.

Party: Cells B6:B11.


Frequency: Cells C6:C11(includes labels in cells B6 and C6).
Select cells B6:C11.
Figure 1.35 represents the Excel worksheet.

Figure 1.35

2 Select Insert > Chart type (choose Column) > select first option, as illustrated in
Figure 1.36

This will result in the graph illustrated in Figure 1.37.

3 Edit the chart


The chart can then be edited to improve the chart appearance, for example. Include:
chart and axes titles, change bar colours, modify legend, and add or remove axes grid
lines.
• Chart title
Visualizing and presenting data 25

Figure 1.36

Figure 1.37
26 Business statistics using Excel

In Figure 1.37 the current chart title is ‘Frequency’. To change the title, click on the
current chart title ‘Frequency’, type in the new chart title ‘Proposed voting behaviour’,
and press the enter key. Figure 1.41 illustrates the final chart.
• Axes titles
In this case the axes titles are not currently available in Figure 1.37. To add the axes
titles click on the current chart and note that the Chart Tools menu on the Excel
menu will appear as illustrated in Figure 1.38.

Figure 1.38

Select Layout on the Chart Tools menu to access the Layout tool menu as illustrated
in Figure 1.39.

Figure 1.39

Select Axis title dialog box and choose either ‘Primary Horizontal Axis Title’ or ‘Primary
Vertical Axis Title’, and modify to add the axes titles, as illustrated in Figure 1.40.

Figure 1.40

Figure 1.41 illustrates the final chart.

Proposed voting behaviour


600

500
Conservative
400
Frequency

Labour
300 Democrat

200 Green
Other
100

0
ive

er
ra
u

e
bo

th
re
oc
at

O
La

G
rv

em
se

D
on

Party
C

Figure 1.41

• Change bar colours and modify legend


To change the bar colour select each bar in turn, right click, and select Format
Data Point > Fill > choose solid fill and select the bar colour. This is then repeated
for each bar in the chart. In Figure 1.37 the current bar colours are all blue. When
Visualizing and presenting data 27

each bar has a unique colour then the chart legend will list each of the bar titles, for
example Conservative for the first bar. Figure 1.41 illustrates the final chart.
• Remove horizontal grid lines
To remove the horizontal grid lines click on the horizontal gridlines to select and
press the computer delete key. Further modifications to grid lines can be achieved
by choosing Gridlines on the Chart Tools Gridlines menu.
The final bar chart is illustrated in Figure 1.41.

Student exercise
X1.3 Draw a suitable bar chart for the data in Table 1.16.

Industrial sources for consumption and investment demand (thousand million)


Producing industry Consumption Investment
Agriculture, mining 1.1 0.1
Metal manufacturers 2.0 2.7
Other manufacturing 6.8 0.3
Construction 0.9 3.7
Gas, electricity, and water 1.2 0.2
Services 16.5 0.8
Total 28.5 7.8

Table 1.16

1.3.2 Pie charts


In a pie chart the relative frequencies are represented by a slice of a circle. Each section
represents a category, and the area of a section represents the frequency or number of
objects within a category. They are particularly useful in showing relative proportions, but
their effectiveness tends to diminish for more than eight categories.

Example 1.11
Example 1.1 proposed voting behaviour data is illustrated in Table 1.17.

Political party Voting behaviour


Conservative 400
Labour 510
Democrat 78
Green 55
Other 67

Table 1.17
28 Business statistics using Excel

This data can then be represented by a pie chart.


Figure 1.42 represents a pie chart for proposed voting behaviour.
We can see that different slices of the circle represent the different choices that people have
when it comes to voting.

Proposed voting behaviour

Other, 67
Green, 55

Democrat, 78
Conservative
Conservative,
Labour
400
Democrat
Green
Labour, 510 Other

Figure 1.42

Note Calculation of circle segment angles for each political party:


A set of instructions is provided if you would like to calculate the angles of each slice in the cir-
cle that represents each voting category. From the table we can calculate that the total number
of students = 400 + 510 + 78 + 55 + 67 = 1110. Given that 360° represents the total number
of degrees in a circle, we can now calculate how many degrees would be represented by each
student. For this example we have 360° = 1110 students. Therefore, each student is represented
by (360/1110) degrees. Based upon this calculation we can now calculate each angle for each
political party category (see Table 1.18).

Political party Voting behaviour Angle calculation Angle (degrees; one


decimal place)
Conservative 400 (360/1110)*400 129.7
Labour 510 (360/1110)*510 165.4
Democrat 78 (360/1110)*78 25.3
Green 55 (360/1110)*55 17.8
Other 67 (360/1110)*67 21.7
Total 1110 359.9

Table 1.18

The size of each slice (sector) depends on the angle at the centre of the circle which, in turn,
depends upon the number in the category the sector represents. Before drawing the pie chart
you should always check that the angles you have calculated sum to 360°. A pie chart may be
constructed on a percentage basis or the actual figures may be used.
Visualizing and presenting data 29

Excel solution—pie chart using Example 1.1 data

1 Input data series


The data in Example 1.1 consists of two columns of data. Column 1 represents party
membership and column 2 represents the number of students proposing to vote for
a particular political party (also called frequency of occurrence). We can use Excel to
create a pie chart for this data by placing the data in Excel, as illustrated in Figure 1.43.

Party: Cells B6:B11.


Frequency: Cells C6:C11 (includes labels in cells B6 and C6).
Highlight cells B6:C11, as illustrated in Figure 1.43.

Figure 1.43

2 Select Insert > Chart type (choose Pie) > select first option, as illustrated in
Figure 1.44

Figure 1.44
30 Business statistics using Excel

This will result in the graph illustrated in Figure 1.45.

Frequency

Conservative

Labour

Democrat

Green

Other

Figure 1.45

3 Edit the chart


The chart can then be edited to improve the chart appearance, for example include:
chart title, change bar colours, change data label information (Chart Tools > Data
Labels > more data label options), as illustrated in Figure 1.46.

Figure 1.46

The final bar chart is illustrated in Figure 1.47.


Visualizing and presenting data 31
Proposed voting behaviour

Other, 67, 6%
Green, 55, 5%
Democrat,
78, 7% Conservative

Conservative, Labour
400, 36%
Democrat

Labour, 510, Green


46%
Other

Figure 1.47

Student exercises
X1.4 Three thousand six hundred people who work in Bradford were asked about the
means of transport that they used for daily commuting. The data collected is shown in
Table 1.19.

Type of transport Frequency of response


Private car 1800
Bus 900
Train 300
Other 600

Table 1.19

Construct a pie chart to represent this data.


X1.5 The results of the voting in an election are shown in Table 1.20.

Mr P 2045 votes
Mr Q 4238 votes
Mrs R 8605 votes
Ms S 12,012 votes

Table 1.20

Represent this information on a pie chart.

1.3.3 Histograms
We have already mentioned the idea of a frequency distribution via the displaying of cat-
egory level data with tables and bar charts. This concept can now be extended to higher
levels of measurement. A point to remember when displaying any form of data is the aim
of summarizing information clearly and in such a form that information is not distorted
or lost. The method used to graph a group frequency table (or distribution) is to construct
32 Business statistics using Excel

a histogram. A histogram looks like a bar chart, but they are different and should not be
confused with each other.
Histograms are constructed on the following principles: (a) the horizontal axis (x–axis)
is a continuous scale; (b) each class is represented by a vertical rectangle, the base of
which extends from one true limit to the next; and (c) the area of the rectangle is pro-
portional to the frequency of the class. This is very important as it means that the area
of the bar represents the frequency of each category. In the bar chart the frequency is
represented by the height of each bar. This implies that if we double the class width for
one bar compared with all the other classes then we would have to half the height of that
particular bar compared with all other bars.
In the special case where all class widths are the same then the height of the bar can be
taken to be representative of the frequency of occurrence for that category. It is important
to note that either frequencies or relative frequencies can be used to construct a histo-
gram, but the shape of the histogram would be exactly the same no matter which variable
you chose to graph.

Example 1.12
Example 1.4 represents the number of insurance claims processed each day by an insurance
firm over a period of 40 days (see Table 1.21).

Score Frequency, f
1 3
2 4
3 4
4 5
5 5
6 7
7 5
8 3
9 3
10 1
Σf = 40

Table 1.21

The data variable ‘score’ is a discrete variable and the histogram is constructed as illustrated
in Table 1.22.
We can see from Table 1.22 that all the class widths have the same value 1 (constant, class
width = UCB – LCB). In this case the histogram can be constructed with the height of the bar
representing the frequency of occurrence.
To construct the histogram we would plot frequency (y–axis, vertical) against score (x-axis)
with the boundary between the bars determined by the upper and lower class boundaries.
Figure 1.48 illustrates the class boundary positions for each bar.
Visualizing and presenting data 33

Score LCB–UCB Class width Frequency, f


1 0.5–1.5 1 3
2 1.5–2.5 1 4
3 2.5–3.5 1 4
4 3.5–4.5 1 5
5 4.5–5.5 1 5
6 5.5–6.5 1 7
7 6.5–7.5 1 5
8 7.5–8.5 1 3
9 8.5–9.5 1 3
10 9.5–10.5 1 1
Σf = 40
LCB, lower class boundary; UCB, upper class boundary.

Table 1.22

Number of claims processed


8
7
6
Frequency, f

5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
x = 0.5 1.5 2.5 Number of claims, X 9.5 10.5 Figure 1.48

The completed histogram would look like Figure 1.49.


Number of claims processed
8
7
6
Frequency, f

5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
Number of claims, X Figure 1.49

We can use the histogram to see how the number of claims varies in frequency from the
lowest claims value of 1 to the highest claims value of 10. If we look at the histogram we note:

• looking along the x-axis we can see that the claims are evenly spread out (1–10);
• the claims rise (1–6) with a maximum of 6 claims per day which occurred on 7 days;
• the claims drop (6–10) with a minimum of 10 claims per day which occurred on 1 day.

These ideas will lead to the idea of average (central tendency) and data spread (dispersion)
which will be explored in Chapter 2.
34 Business statistics using Excel

Example 1.13
Example 1.5 represents the miles recorded by 120 salesmen in one week, as illustrated in
Table 1.23.

Mileage Frequency, f
400–419 12
420–439 27
440–459 34
460–479 24
480–499 15
500–519 8
Σf = 120

Table 1.23

Figure 1.50 represents the histogram for miles recorded by 120 salesmen.

Mileage travelled by salesman


40

35

30
Frequency

25

20

15

10

0
400–419 420–439 440–459 460–479 480–499 500–519
Mileage Figure 1.50

Excel spreadsheet solution—histogram with equal class widths using Example 1.5
data

1 Data series
Input data into cells A6:H20.
See Figure 1.51.

2 Excel Data Analysis tool


Check that Analysis tool is installed. Now we can use Excel to create the histogram.
Before we use this technique we have to input the lower and upper class boundaries
into Excel. Excel calls this the Bin Range. In this example we have decided to create a Bin
Range that is based upon equal class widths. Let us choose the following groups with the
Bin Range calculated from these group values as illustrated in Table 1.24.
Visualizing and presenting data 35

Figure 1.51

Mileage LCB–UCB Class width Bin range


399.5
400–419 399.5–419.5 20 419.5
420–439 419.5–439.5 20 439.5
440–459 439.5–459.5 20 459.5
480–499 479.5–499.5 20 499.5
500–519 499.5–519.5 20 519.5
LCB, lower class boundary; UCB, upper class boundary.

Table 1.24

We can see from Table 1.24 that the class widths are all equal and the corresponding
Bin Range is 399.5, 419.5 . . . 519.5. We can now use Excel to create the grouped
frequency distribution and corresponding histogram for equal classes. If you want,
you can leave the Bin Range box blank and the Excel Histogram tool will automatically
create evenly distributed bin intervals using the minimum and maximum values in
the input range as beginning and end points. The number of intervals is equal to the
square root of the number of input values (rounded down).

Bin Range: Cells B24:B30 (with the label in cell B23).


See Figure 1.52.

Figure 1.52

Now create the histogram.


36 Business statistics using Excel

Select Data > Data Analysis.


Click on Histogram.
See Figure 1.53.

Figure 1.53

Click OK .
Input Data Range: Cells A6:H20.
Input Bin Range: Cells B24:B30.
Choose location of Output range: Cell D23.
See Figure 1.54.

Figure 1.54

Press OK.
Excel will now print out the grouped frequency table (Bin Range and Frequency of
occurrence), as presented in cells D23–E31.
See Figure 1.55.

Figure 1.55
Visualizing and presenting data 37
We can now use Excel to generate the histogram for equal class widths.

3 Input data series


Mileage: Cells B36:B42.
Frequency: Cells C36:C42 (includes labels in cells B36 and C36).
Highlight B36:C42.
See Figure 1.56.

Figure 1.56

4 Create column chart (Insert > Column > choose first option)
This will create the chart illustrated in Figure 1.57 with chart title and axes titles
updated.

Mileage travelled by salesman


40

35

30
Frequency

25

20

15

10

0
400–419 420–439 440–459 460–479 480–499 500–519
Mileage Figure 1.57

5 Transformation of the column chart to a histogram


Select bars by clicking on them, as illustrated in Figure 1.58.

Figure 1.58

Click on any one of the bars and right click on the computer mouse (Figure 1.59).
38 Business statistics using Excel

Figure 1.59

Select Format Data Series and reduce Gap Width to zero, as illustrated in Figure 1.60.

Figure 1.60

The final histogram is presented in Figure 1.61.

Mileage travelled by salesman


40

35

30
Frequency

25

20

15

10

0
400–419 420–439 440–459 460–479 480–499 500–519
Mileage Figure 1.61
Visualizing and presenting data 39

Note Calculation process for the creation of the histogram


The data variable ‘mileage’ is a grouped variable and the histogram is constructed as follows.

Mileage LCB–UCB Class width Frequency, f


400–419 399.5–419.5 20 12
420–439 419.5–439.5 20 27
440–459 439.5–459.5 20 34
460–479 459.5–479.5 20 24
480–499 479.5–499.5 20 15
500–519 499.5–519.5 20 8
Σf = 120

Table 1.25 Calculation procedure to identify class limits for the histogram

We can see from the table that all the class widths are of the same value 20 (constant, class
width = UCB – LCB). In this case the histogram can be constructed with the height of the bar
representing the frequency of occurrence.
To construct the histogram we would plot frequency (y-axis, vertical) against score (x-axis)
with the boundary between the bars determined by the upper and lower class boundaries.
Figure 1.62 illustrates the class boundary positions for each bar.

Miles recorded by 120 salesman


40
35
30
Frequency, f

25
20
15
10
5
0
400–419 420–439 440–459 460–479 480–499 500–519

x = 399.5 Miles travelled, X 499.5 519.5


419.5 439.5 Figure 1.62

Figure 1.63 illustrates the completed histogram for miles recorded by 120 salesmen.

Miles recorded by 120 salesman


40
35
30
Frequency, f

25
20
15
10
5
0
400–419 420–439 440–459 460–479 480–499 500–519
Miles travelled, X Figure 1.63
40 Business statistics using Excel

We can use the histogram to see how the frequency changes as the miles travelled changes
from the lowest group (400–419) to the highest group (500–519). If we look at the histogram
we can note:

• looking along the x-axis we can see that the miles recorded are evenly spread out;
• the miles recorded rise (400–419 to 440–459) with a maximum of 440–459 recorded;
• the miles recorded drop (440–459 to 500–519) with a minimum of 500–519 miles re-
corded.

These ideas will lead to the idea of average (central tendency) and data spread (disper-
sion), which will be explored in Chapter 2.

Student exercises
X1.6 Create a suitable histogram to represent the number of customers visited by a sales
man over an 80-week period (Table 1.26).

68 64 75 82 68 60 62 88 76 93 73 79 88 73 60 93
71 59 85 75 61 65 75 87 74 62 95 78 63 72 66 78
82 75 94 77 69 74 68 60 96 78 89 61 75 95 60 79
83 71 79 62 67 97 78 85 76 65 71 75 65 80 73 57
88 78 62 76 53 74 86 67 73 81 72 63 76 75 85 77

Table 1.26

X1.7 Create a suitable histogram to represent the spending on extracurricular activities


for a random sample of university students during the ninth week of the first term
(Table 1.27).

16.91 9.65 22.68 12.45 18.24 11.79 6.48 12.93 7.25 13.02
8.10 3.25 9.00 9.90 12.87 17.50 10.05 27.43 16.01 6.63
14.73 8.59 6.50 20.35 8.84 13.45 18.75 24.10 13.57 9.18
9.50 7.14 10.41 12.80 32.09 6.74 11.38 17.95 7.25 4.32
8.31 6.50 13.80 9.87 6.29 14.59 19.25 5.74 4.95 15.90

Table 1.27

1.3.4 Histograms with unequal class intervals


It may be necessary because of the way a set of data is distributed to use unequal class
intervals. In this case, special care needs to be taken in constructing the histogram given
that each bar (or rectangle) of the histogram is proportional to the frequency. Which of
the two histograms in Figure 1.64 for the following distribution illustrated in Table 1.28 is
correct?
Visualizing and presenting data 41

Class 0–2 2–4 4–6 6–8 8–12


Frequency 3 3 3 3 3

Table 1.28

Consider Figure 1.64.

4 A B
3
2

0–2 2–4 4–6 6–8 8–12 0–2 2–4 4–6 6–8 8–12 Class

Figure 1.64

Although class (8–12) is twice the width of the other classes, histogram A gives equal
weighting to the frequency for all classes.
It is therefore incorrect. Keep in mind that the area of a rectangle is proportional to
frequency and thus:

Class Frequency
Height =
Class Width (1.2)

Histogram B indicates the correct weighting to the class (8–12). As the class width is
twice the width of the other classes, the height of the rectangle is halved. In general, if we
choose a standard class width, a class having twice the width will have a height of 1/2 of its
frequency; three times the width a height of 1/3 of its frequency, and so on.

Example 1.14
Construct a histogram for the following distribution of discrete data (Table 1.29).

Class 118–121 122–128 129–138 139–148 149–158 159–178


Frequency 2 6 14 31 63 28

Table 1.29

Taking the class (129–138) as our standard class width (class width = 10) then we can use the
following formula to calculate the heights of each individual bar (or rectangle).

CWs
h = f
CW (1.3)
42 Business statistics using Excel

Where CWs = standard class width = 10, CW = class width, f = class frequency, and h = class
height (height of rectangle). Table 1.30 illustrates the calculation of the class heights when
classes are unequal.

Class 118–121 122–128 129–138 139–148 149–158 159–178


Frequency 2 6 14 31 63 28
LCB 117.5 121.5 128.5 138.5 148.5 158.5
UCB 121.5 128.5 138.5 148.5 158.5 178.5
Class width 4 7 10 10 10 20
Calculation (10/4)*2 (10/7)*6 (10/10)*14 (10/10)*31 (10/10)*63 (10/20)*28
of height
Height 5 8.6 14 31 63 14

Table 1.30 Calculation process to calculate the class height


LCB, lower class boundary; UCB, upper class boundary.

Figure 1.65 illustrates the completed histogram for the Example 1.14 data set.

60
50
40
30
20

10
118–121

112–128

129–138

139–148

149–158

159–178

Class Figure 1.65

Note It is important to note that:

(a) As the height of the rectangle is proportional to class frequency and class width we can use
the term frequency density rather than frequency.
(b) Total area is proportional to total frequency.

Unfortunately, you cannot create a histogram with unequal class widths (or intervals) using
Excel, but you can create the frequency distribution by inputting the upper and lower class
intervals. These are called Bins in Excel.

x 1.3.5 Frequency polygon


Histogram with unequal
class intervals A histogram A frequency polygon is formed from a histogram by joining the mid-points of the tops of
with unequal class intervals the rectangles by straight lines. The mid-points of the first and last class are joined to the
is a graphical representation
showing a visual impression
x-axis to either side at a distance equal to (1/2)th the class interval of the first and last class.
of the distribution of data
where class widths are of UCB + LCB
Mid-point of class =
different sizes. 2 (1.4)
Visualizing and presenting data 43

Where UCB = Upper Class Boundary and LCB = Lower Class Boundary.

Example 1.15
Table 1.31 illustrates the frequency polygon for the data set in Example 1.5.

Class True limits Class mid-point Frequency


400–419 399.5–419.5 409.5 12
420–439 419.5–439.5 429.5 27
440–459 439.5–459.5 449.5 34
460–479 459.5–479.5 469.5 24
480–499 479.5–499.5 489.5 15
500–519 499.5–519.5 509.5 8

Table 1.31

Figure 1.66 illustrates the frequency polygon for the travelling salesmen problem.

Frequency polygon for the miles travelled


40
35
30
Frequency

25
20
15
10
5
0
409.5 429.5 449.5 469.5 489.5 509.5
Class mid-point miles Figure 1.66

Excel solution—frequency polygon using Example 1.15 data

1 Data series
Class Mid-Point: cells D3:D9 (includes data label).
Frequency: cells E3:E9 (includes data label).
Highlight cells D3:E9.
Figure 1.67 illustrates the Excel solution.

x
Class mid-point The class
mid-point is the midpoint
Figure 1.67 of each class interval.
44 Business statistics using Excel

2 Select Insert > Line > select option 4, as illustrated in Figure 1.68

Figure 1.68

This will result in the graph illustrated in Figure 1.69.

600

500

400

Class mid-point
300
Frequency

200

100

0
1 2 3 4 5 6 Figure 1.69

3 Edit the chart


From Figure 1.69 we note that we have two lines on the chart: class mid-point line and
frequency line. Click on the class mid-point line and press delete on the computer
keyboard. The chart illustrated in Figure 1.69 will now change to the chart illustrated
in Figure 1.70.

The chart is correct on the vertical axis (frequency), but we would like the horizontal
axis to use the class labels rather than the numbers 1, 2, 3, 4, 5, and 6. To modify the
horizontal axis label from these numbers to the class mid-point labels we need to edit
the data series. Right-click on the data line, as illustrated in Figure 1.71.

Choose Select Data (see Figure 1.72).


Visualizing and presenting data 45

Figure 1.70

Figure 1.71

Figure 1.72

Click on Edit in the Horizontal (Category) Axis Labels and browse to the class mid-
point cell reference (D4:D9)(see Figure 1.73).

Figure 1.73

Click OK.
See Figure 1.74.
46 Business statistics using Excel

Figure 1.74

Click OK and the chart will be modified as illustrated in Figure 1.75.

Frequency polygon for the miles travelled


40
35
30
Frequency

25
20
15
10
5
0
409.5 429.5 449.5 469.5 489.5 509.5
Class mid-point miles Figure 1.75

Figure 1.75 illustrates the frequency polygon after a degree of reformatting (removed
border, horizontal gridlines).

Student exercise
X1.8 Create a frequency polygon (line graph) for the data in Table 1.32

Age, years Frequency, 1000s


16–17 4
18–20 73
21–24 185
25–29 104
30–34 34
35–39 33
40–44 22
45–54 10
55 and over 26

Table 1.32
Visualizing and presenting data 47

1.3.6 Scatter and time series plots


A scatter plot is a graph which helps us assess visually the form of relationship between
two variables. To illustrate the idea of a scatter plot consider the following problem.

Example 1.16
A manufacturing firm has designed a training programme that is supposed to increase the
productivity of employees.
The personnel manager decides to examine this claim by analysing the data results from the
first group of 20 employees that attended the course.
The results are provided in Table 1.33.

Employee number Production, X % Raise in production, Y


1 47 4.2
2 71 8.1
3 64 6.8
4 35 4.3
5 43 5.0
6 60 7.5
7 38 4.7
8 59 5.9
9 67 6.9
10 56 5.7
11 67 5.7
12 57 5.4
13 69 7.5
14 38 3.8
15 54 5.9
16 76 6.3
17 53 5.7
18 40 4.0
19 47 5.2
20 23 2.2

Table 1.33

Figure 1.76 illustrates the scatterplot. As can be seen from the scatter plot there would
seem to be some form of relationship; as production increases then there is a tendency for
the percentage raise in production to increase. The data, in fact, would indicate a positive
relationship.
We will explore this concept in more detail in Chapter 8 when discussing measuring
correlation between two data variables.
48 Business statistics using Excel

Scatter plot for % raise in production vs old production


9

% increase in production, Y
7

0
0 10 20 30 40 50 60 70 80
Old production, X Figure 1.76

Time series analysis is concerned with data collected over a period of time. It attempts
to isolate and evaluate various factors which contribute to changes over time in such vari-
able series as imports and exports, sales, unemployment, and prices. If we can evaluate
the main components that determine the value of, say, sales for a particular month then
we can project the series into the future to obtain a forecast.

Example 1.17
Consider the following time series data presented in Table 1.34 and the resulting time series
graph.

Sales of Pip Ltd 2001–2004 (tons)


Year Quarter 1 Quarter 2 Quarter 3 Quarter 4
2001 654 620 698 723
2002 756 698 748 802
2003 843 799 856 889
2004 967 876 960 976

Table 1.34 Time series data for Example 1.17

The first step in analysing the data in Table 1.34 is to create the time series plot using the
technique discussed in the previous section.
Figure 1.77 illustrates the up and down pattern with the overall sales increasing between
the beginning of 2001 and the end of 2004.
This pattern consists of an upward trend and a seasonal component that repeats
between individual quarters (Q1–Q2–Q3–Q4).
We shall explore these ideas of trend and seasonal components in Chapter 9. In the
previous two sections we looked at creating scatter plots and time series plots that may
visually provide information about the possible relationship between one measured vari-
able and another.
Visualizing and presenting data 49
Time series graph for sales data
1200

1000
Sales value, Y

800

600

400

200

0
0 2 4 6 8 10 12 14 16
Time point, X Figure 1.77

Care needs to be taken when using graphs to infer what this relationship may be. For
example, if we modify the y-axis scale then we have a very different picture of this poten-
tial relationship.
We noted that in Example 1.16 that the percentage change in production increases as
the value of the old production increases.
We can change the vertical y-axis so that the minimum value of y-axis is 0, but the maxi-
mum value is now increased to 60.
Figure 1.78 illustrates the effect on the graph of modifying the vertical scale.

Scatter plot for % raise in production vs old production


60
% increase in production, Y

50

40

30

20

10

0
0 10 20 30 40 50 60 70 80
Old production, X Figure 1.78

We can see that the data points are now hovering above the x-axis with the increase in
the vertical direction not as pronounced as in the first graph in Figure 1.76. If we further
increased the y-axis scale then this pattern would be diminished even further.
Furthermore, in Figure 1.77 we note that the time series plot indicated that sales are
increasing with time (upward trend) and that we have a pattern within the data that
repeats between individual quarters (Q1–Q2–Q3–Q4).
We can change the y-axis so that the minimum value y-axis value is 0, but the maxi-
mum value is now increased to 3000. We can see that the pattern in the data still shows
an upward trend, but the distinct pattern is not as pronounced as in the first graph. If we
50 Business statistics using Excel

further increased the y-axis scale then this pattern would be diminished even further, as
illustrated in Figure 1.79.

Time series graph for sales data


3000

2500
Sales value, A
2000

1500

1000

500

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Time point, X Figure 1.79

Excel solution—scatter plots using Example 1.16 data

1 Input data series


X: Cells C3:C23.
Y: Cells D3:D23 (includes data labels).
Highlight C3:D23.
See Figure 1.80.

Figure 1.80

2 Select Insert > Scatter > choose first option (Figure 1.81).

3 Edit the chart


The chart can then be edited to improve the chart appearance, for example include:
chart title, axes titles, removal of horizontal gridlines, and removal of the legend, as
illustrated in Figure 1.82.
Visualizing and presenting data 51
% raise in production, Y
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
% raise in production, Y
1.0
0.0
0 20 40 60 80 Figure 1.81

Scatter plot for % raise in production vs old production


9.0
8.0
% raise in production, Y

7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
0 10 20 30 40 50 60 70 80
Old production, X Figure 1.82

We can now ask Excel to fit a straight line to this data chart by clicking on a data point
on the chart: right-click on a data point and choose Add Trendline. We will look at
fitting a trend line and curves to scatter and time series charts in Chapters 8 and 9.

1.3.7 Superimposing two sets of data onto one graph


In certain circumstances we may wish to combine different data sets onto the same Excel
graph.

Example 1.18
A market researcher has collected a set of data that measures the distance travelled by eight
salesmen. The market researcher has calculated the average and standard deviation, and
requires a graph of mileage travelled against identification (ID) and the two error measure-
ments provided in Table 1.35.
52 Business statistics using Excel

Identification Mileage Average + error Average – error


1 220 239 208
2 210 239 208
3 230 239 208
4 200 239 208
5 250 239 208
6 238 239 208
7 219 239 208
8 220 239 208
Average = 223.38
Standard deviation = 15.74

Table 1.35

Excel solution—superimposing two data sets onto one graph for the Example 1.18
data

1 Input data series


Highlight all three columns (including labels) (Figure 1.83).

Figure 1.83

Mileage: Cells B4:B12.


Average + error: Cells C4:C12.
Average – error: Cells D4:D12 (include data labels in cells B4, C4, and D4).

2 Select Insert > Column > choose first option


See Figure 1.84

3 Edit the chart


From Figure 1.84 we note that we have three bars per ID. Edit the chart to include a
chart title, axes titles, and remove horizontal grid lines to give the chart illustrated in
Figure 1.85.

4 Further modify the chart by changing the bars representing the error term (average –
error, average + error) to be horizontal dashed lines rather than vertical bars. Select
average – error bar (this will select all the average – error bars for each ID). Right-
click on average – error bar > Select Change Series Chart Type > Select Line and
click OK. Repeat for average + error bars. The final chart is illustrated in Figure 1.86.
Visualizing and presenting data 53

300 Mileage
Average + error
Average – error
250

200

150

100

50

0
1 2 3 4 5 6 7 8 Figure 1.84

Superimposing two or more data sets


300 Mileage
Average + error
Average – error
250

200
Mileage

150

100

50

0
1 2 3 4 5 6 7 8
ID Figure 1.85

Superimposing two or more data sets


300
Mileage
Average + error
Average – error
250

200
Mileage

150

100

50

0
1 2 3 4 5 6 7 8
ID Figure 1.86
54 Business statistics using Excel

Student exercises
X1.9 Obtain a scatter plot for the data in Table 1.36 and comment on whether there is a link
between road deaths and the number of vehicles on the road. Would you expect this
to be true? Provide reasons for your answer.

Countries Vehicles per 100 Road deaths per


population 100,000 population
Great Britain 31 14
Belgium 32 30
Denmark 30 23
France 46 32
Germany 30 26
Irish Republic 19 20
Italy 35 21
Netherlands 40 23
Canada 46 30
USA 57 35

Table 1.36

X1.10 Obtain a scatter plot for the data in Table 1.37 that represents the passenger miles
flown by a UK-based airline (millions of passenger miles) during the period 2003–2004.
Comment on the relationship between miles flown and quarter.

Year Quarter 1 Quarter 2 Quarter 3 Quarter 4


2003 98.9 191.0 287.4 123.2
2004 113.4 228.8 316.2 155.7

Table 1.37

■ Techniques in practice
TP1 Coco S.A. supplies a range of computer hardware and software to 2000 schools within
a large municipal region of Germany. When Coco S.A. won the contract the issue of customer
service was considered to be central to the company being successful at the final bidding stage.
The company has now requested that its customer service director creates a series of graphi-
cal representations of the data to illustrate customer satisfaction with the service. The data in
Table 1.38 has been collected over the last six months and measures the time to respond to the
received complaint (days).

(a) Form a grouped frequency table.


(b) Plot the histogram.
Visualizing and presenting data 55

5 24 34 6 61 56 38 32
87 78 34 9 67 4 54 23
56 32 86 12 81 32 52 53
34 45 21 31 42 12 53 21
43 76 62 12 73 3 67 12
78 89 26 10 74 78 23 32
26 21 56 78 91 85 15 12
15 56 45 21 45 26 21 34
28 12 67 23 24 43 25 65
23 8 87 21 78 54 76 79

Table 1.38

(c) Do the results suggest that there is a great deal of variation in the time taken to respond
to customer complaints?
(d) What conclusions can you draw from these results?

TP2 Bakers Ltd run a chain of bakery shops and is famous for the quality of its pies. The
management of the company is concerned at the number of complaints from customers who
say it takes too long to serve customers at a particular branch. The motto of the company is
‘Have your pie in two minutes’. The manager of the branch concerned has been told to provide
data on the time it takes for customers to enter the shop and be served by the shop staff (see
Table 1.39).

0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70
0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12
0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70
1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88
0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55
1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38
0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80
1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25
1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48

Table 1.39

(a) Form a grouped frequency table.


(b) Plot the histogram.
(c) Do the results suggest that there is a great deal of variation in the time taken to serve
customers?
(d) What conclusions can you draw from these results?

TP3 Skodel Ltd is a small brewery that is undergoing a major expansion after a takeover
by a large European brewery chain. Skodel Ltd produces a range of beers and lagers, and is
56 Business statistics using Excel

renowned for the quality of its beers, winning a number of prizes at trade fairs throughout
the European Union. The new parent company is reviewing the quality control mechanisms
being operated by Skodel Ltd and is concerned at the quantity of lager in its premium lager
brand, which should contain a mean of 330 ml and a standard deviation of 15 ml. The bottling
plant manager provided the parent company with quantity measurements from 100 bottles for
analysis (see Table 1.40).

326 326 326 326 326 326 326 326 326 326
344 344 344 344 344 344 344 344 344 344
333 333 333 333 333 333 333 333 333 333
346 346 346 346 346 346 346 346 346 346
339 339 339 339 339 339 339 339 339 339
353 353 353 353 353 353 353 353 353 353
310 310 310 310 310 310 310 310 310 310
351 351 351 351 351 351 351 351 351 351
350 350 350 350 350 350 350 350 350 350
348 348 348 348 348 348 348 348 348 348

Table 1.40

(a) Form a grouped frequency table.


(b) Plot the histogram.
(c) Do the results suggest that there is a great deal of variation in quantity in the bottles?
(d) What conclusions can you draw from these results?

■ Summary
The methods described in this chapter are very useful for describing data using a variety of
tabulated and graphical methods. These methods allow one to make sense of data by con-
structing visual representations of numbers within the data set. Table 1.41 provides a summary
of which table/graph to construct given the data type.

Which table or chart to be applied


Numerical data Categorical data
Tabulating data Frequency distribution Summary table
Cumulative frequency
distribution
Graphing data Histogram Bar chart
Frequency polygon Pie chart
Presenting a relationship Scatterplot Time series graph Contingency table
between data variable

Table 1.41 Which table or chart to use?

In the next chapter we will look at summarizing data using measures of average and
dispersion.
Visualizing and presenting data 57

■ Key terms
Bar chart Grouped frequency Ratio
Categorical distributions Raw data
Class boundaries Histogram Scatter plot or
Class limits Histogram with unequal scattergrams
Class midpoint class intervals Stated limits
Classes Interval Statistic
Continuous Nominal Table
Cross tabulation Ordinal Tally chart
Discrete Ordinal variable Time series plot
Frequency distributions Pie chart True or mathematical
Frequency polygon Qualitative limits
Graph or chart Quantitative Variable

■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.

Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012)
2 Data descriptors

Although tables, diagrams, and graphs provide easy-to-assimilate summaries of data they
only go part way in describing data. Often a concise numerical description is preferable
which enables us to interpret the significance of the data.
Measures of central tendency (or location) attempt to quantify what we mean when
we think of the ‘typical’ or ‘average’ value for a particular data set. The concept of central
tendency is extremely important and is encountered in daily life. For example:

• What is the average carbon dioxide emission for a particular car compared with other,
similar cars?
• What is the average starting salary for new graduates starting employment with a
large city bank?

A further measure that is employed to describe a data set is the concept of dispersion
(or spread) about this middle value.

» Overview «
In this chapter we shall look at three key statistical measures that enable us to describe a data
set:

» the central tendency is the amount by which all the data values coexist about a defined
typical value. A number of different measures of central tendency exist, including mean (or
average), median, and mode, that can be calculated for both ungrouped (individual data values
known) and grouped (data values within class intervals) data sets;

» the dispersion (or spread) is the amount that all the data values are dispersed about this
typical central tendency value. A number of different measures of dispersion exist, including
range, interquartile range, semi-interquartile range (SIQR), standard deviation, and variance
that can be calculated for both ungrouped and grouped data sets;

» the shape of the distribution is the pattern that can be observed within the data set. This
shape can be classified into whether the distribution is symmetric (or skewed) and whether or
not there is evidence that the shape is peaked. Skewness is defined as a measure of the lack of
Data descriptors 59

symmetry in a distribution and kurtosis is defined as a measure of the degree of peakedness in


the distribution.
Exploratory data analysis can be used to explore data sets and provide answers to questions
involving central tendency, spread, skewness, and the presence of outliers.

» Learning objectives «
On successful completion of the module, you will be able to:

» understand the concept of an average;

» recognize that three possible averages exist (mean, mode, and median) and calculate them
using a variety of graphical and formula methods in number and frequency distribution form;

» recognize when to use different measures of average;

» understand the concept of dispersion;

» recognize that different measures of dispersion exist (range, quartile range, SIQR, standard
deviation, and variance), and calculate them using a variety of graphical and formula methods
in number and frequency distribution form;

» recognize when to use different measures of dispersion;

» understand the idea of distribution shape, and calculate a value for symmetry and
peakedness;

» apply exploratory data analysis to a data set(s);

» use Microsoft Excel to calculate data descriptors.

2.1 Measures of central tendency


x
When reading market reports, newspapers, or undertaking a Web search to collect infor- Mean The mean is a
measure of the average
mation you will read about a term called the average. The average is an idea that allows
data value for a data set.
us to visualize, or put a measure on, what is considered to be the most representative Mode The mode is the
value of the group. This value is usually placed somewhere in the middle of the group and, most frequently occurring
value in a set of discrete
as such, is the best approximation of all other values. The mean (also called an average
data.
or arithmetic average), mode, and median are different measures of central tendency. Median The median is the
Section 2.1 will explore the calculation of these measures for ungrouped and grouped value halfway through the
ordered data set.
data respectively.
Central
tendency Measures the
location of the middle or
2.1.1 Mean, median, and mode the centre of a distribution.
Arithmetic mean The sum
The three most commonly used measures of central tendency (or average) are: mean (or of a list of numbers divided
arithmetic mean), median, and mode. If the mean is calculated from the entire population by the number of numbers.
60 Business statistics using Excel

of the data set then the mean is called the population mean. If we sample from this popu-
lation and calculate the mean then the mean is called the sample mean.
The population and sample mean would be calculated using the same formula:
mean = sum of data values/total number of data values. For example, if KC GmbH were
interested in the mean time taken for a consultant to travel by train from Munich to
Hamburg and if we assume that KC GmbH has gathered the time (rounded to the nearest
minute) for the last five trips (645, 638, 649, 630, and 647), then the mean time would be:

645 + 638 + 649 + 630 + 647 3209


= = 641.8
5 5
The mean time to travel between Munich and Hamburg was calculated to be 642 min-
utes. We can see that the mean uses all the data values in the data set (645 + 638 + . . . + 647)
and provides an acceptable average if we do not have any values that can be considered
unusually large or small.
If we added an extra value of one trip that took 1000 minutes then the new mean would
be 701.5 minutes, which cannot be considered to be representative of the other data val-
ues in the data set that range in value from 630 to 649. These extreme values are called
outliers, which have a tendency to skew the data distribution. In this case we would use a
different measure to calculate the value of central tendency.
An alternative method to calculate the average is the median. The median is literally the
‘middle’ number if you list the numbers in order of size. The median is not susceptible to
extreme values to the extent of the mean, but this method of determining the average can
be used with ordinal data and continuous data variables.
The final method for determining average is the mode. The mode is defined as the
number that occurs most frequently in the data set and can be used for both numerical
and categorical (or nominal) data variables. A major problem with the mode is that it is
possible to have more than one modal value representing the average for numerical data
variables; therefore, the mean and median are used to provide averages for numerical
data variables.
Several examples are provided to demonstrate how these are calculated.

Example 2.1
Suppose the marks obtained in the statistics examination are as illustrated in Table 2.1.

x
24 27 36 48 52 52 53 53 59 60 85 90 95
Population mean The
population mean is the
mean value of all possible Table 2.1  
values.
Extreme value An extreme We can describe the overall performance of these 13 students by calculating an ‘average’
value is an unusually large score using the mean, median, and mode.
or an unusually small value
compared with the others
in the data set.
Outlier An outlier is an Excel solution for Example 2.1—mean, median, and mode
observation in a data set
which is far removed in
value from the others in
Data Series input into Cells B4:B16.
the data set. Figure 2.1 illustrates the Excel solution.
Data descriptors 61

Figure 2.1

➜ Excel solution
Statistics marks Cell B4:B16 Values
n = Cell E4 Formula: =COUNT(B4:B16)
Σx = Cell E5 Formula: =SUM(B4:B16)
Mean = Cell E7 Formula: =E5/E4
Mean = Cell E12 Formula: =AVERAGE(B4:B16)
Median = Cell E13 Formula: =MEDIAN(B4:B16)
Mode = Cell E14 Formula: =MODE(B4:B16)

❉ Interpretation The above values imply that, depending on what measure we use,
the average mark for this group can be 56.5% (the mean), 53% (median), or 52% (mode). The
choice of the measure will depend on the type of numbers within the data set.

The mean
In general the mean can be calculated using the formula:

Mean (X) =
Sumof Data Values
=
∑X
Total Number of Data Values ∑ f (2.1)

Where X (X bar) represents the mean value for the sample data, ∑X represents the sum
of all the data values, and ∑f represents the number of data values.

❉ Interpretation From Excel, the value of the mean is 56.5%.

Note For the statistics marks example,

Mean(X) =
∑ X = 24 + 27 + ... + 90 + 95 = 734 / 13 = 56.4615
∑f 13

We can see from the formula method (cell E7) and Excel function method (cell E12) that the
mean examination mark is 56.5%.
62 Business statistics using Excel

The median
The median is defined as the middle number when the data is arranged in order of size.
The position of the median can be calculated as follows:

P
Position of Percentile = (N + 1)
100 (2.2)

Where P represents the percentile value and N represents the number of numbers in
the data set.

Note A percentile is a value on a scale of 100 that indicates the percent of a


distribution that is equal to or below it. For the median, P = 50.

Consider the data from Example 2.1 (note that the data is written in order of size—rank.
If it wasn’t ranked, the data would have to be put in the correct order before this method
was used manually) as presented in Table 2.2.

24 27 36 48 52 52 53 53 59 60 85 90 95

Table 2.2

❉ Interpretation From Excel, the value of the mean is 53%. We can see from this
example that the mean and median are reasonably close (56% compared with 53%), and the
distance between the lowest mark and the median (53 – 24 = 29) is less than the distance
between the largest value and the median (95 – 53 = 42). It should be noted that the median
is not influenced by the presence of very small or very large data values in the data set
(extreme values or outliers). If we have a small number of these extreme values (or outliers)
we would use the median instead of the mean to represent the measure of central tendency.
This issue will be explored in greater detail when discussing measuring skewness.

Note In this example the median was calculated to be the seventh number in the
ordered list of data values. If we created an extra value then the calculation would be a little
more complex. For example, if we had 14 numbers (N = 14) then the position of the median
would now be
50
Position of Median = (14 + 1) = 7.5th number
100

x The position of the median would now be the 7.5th number in the data set. To help us
Median The median is the
value halfway through the understand what this means we can rewrite this into a slightly different form:
ordered data set.
8th number + 7th number 53 + 53
Skewness Skewness is Position of Median = = = 53
defined as asymmetry in 2 2
the distribution of the data
values. The median statistics examination mark would then be 53%.
Data descriptors 63

The mode
The mode is defined as the number which occurs most frequently (the most ‘popular’
number).

❉ Interpretation From Excel, the value of the mode is 52%.

Note We can see that Excel provides only one solution for the mode (52) even though
we have two modal values in the data set (two numbers 52 and 53 occurred twice).

Example 2.2
Example 1.1 consists of category data that provide a measure of proposed student voting
behaviour at a university. We can see from the frequency count that Labour was the most
popular party for the students. The Labour party would represent the mode for this data set.

2.1.2 Percentiles and quartiles


The median represents the middle value of the data set, which corresponds to the fiftieth
percentile (P = 50) or the second quartile (Q2). A data set would be ranked in order of size
and we can then use the technique described above to calculate the values that would
represent individual percentiles or quartile values.

Example 2.3
Reconsider Example 2.1 to calculate the twenty-fifth percentile and quartile values.

Excel solution for Example 2.2—percentiles and quartiles


Figure 2.2 illustrates the Excel solution.

Figure 2.2
64 Business statistics using Excel

➜ Excel solution
25th percentile = Cell E15 Formula: =PERCENTILE.INC(B4:B16,0.25)
First quartile = Cell E16 Formula: =QUARTILE.INC(B4:B16,1)
Second quartile = Cell E17 Formula: =QUARTILE.INC(B4:B16,2)
Third quartile = Cell E18 Formula: =QUARTILE.INC(B4:B16,3)

❉ Interpretation From Excel, the twenty-fifth percentile = 48, first quartile = 48


(agrees with twenty-fifth percentile), second quartile = 53 (agrees with Excel Median function
method), and third quartile = 60.

Note The calculation process to calculate percentiles and quartiles is as follows for the
first, second, and third quartiles.
First Quartile, Q1
The first quartile corresponds to the twenty-fifth percentile and the position of this value
within the ordered data set is given by equation 2.2.

25 1
Position of Twenty-Fifth Percentile = (13 + 1) = (14) = 3.5th number
100 4

We therefore take the twenty-fifth percentile to be the number half the distance between the
fourth and third numbers. To solve this problem we use linear interpolation between the two
nearest ranks: Position of Twenty-Fifth Percentile = third number + 0.5*(fourth number – third
number). From the list of ordered data values the third number = 36 and the fourth number = 48.

Q1 = 36 + 0.5∗ (48 − 36) = 36 + 0.5∗ (12) = 36 + 6 = 42

The first quartile statistic examination mark is 42%.


Third Quartile, Q3
The third quartile corresponds to the seventy-fifth percentile and the position of this value
within the ordered data set is given by equation (2.2).

x 75 3
Position of Seventy-Fifth Percentile = (13 + 1) = (14) = 10.5th number
Quartiles Quartiles are 100 4
values that divide a sample
of data into four groups We therefore take the seventy-fifth percentile to be the number half the distance between
containing an equal
number of observations.
the tenth and eleventh numbers. To solve this problem we use linear interpolation between the
Q1 Q1 is the lower quartile two nearest ranks: Position of Seventy-Fifth Percentile = tenth number + 0.5*(eleventh number
and is the data value a – tenth number). From the list of ordered data values the tenth number = 60 and the eleventh
quarter way up through the
ordered data set.
number = 85.
Q3 Q3is the upper quartile
and is the data value a Q3 = 60 + 0.5∗ (85 − 60) = 60 + 0.5∗ (25) = 60 + 12.5 = 72.5
quarter way down through
the ordered data set. The third quartile statistic examination mark is 73.
Data descriptors 65

Note Using Excel the first quartile value is 48 and the third quartile value is 60. The
manual method provides a first quartile value of 42 and third quartile value of 73. Unlike the
median that has a standard calculation method, there is no one standard for the calculation
of the quartiles. A number of definitions for the quartile exist, which results in a number of
different calculation procedures to calculate the value of the quartiles. The method used by
Excel is method 1 in Freund, J. and Perles, B. (1987) ‘A New Look at Quartiles of Ungrouped
Data’, The American Statistician, 41 (3), 200–3.

Excel function solution for Examples 2.1 and 2.3


In the screenshot solutions presented in Figures 2.1 and 2.2 we can either type in the Excel
function, for example in cell E12 we could type = AVERAGE (B4:B16), or, if we could not
remember the name of the function, then we can employ the Select Formulas > Select
Function method to select the required function and insert into the required cell on the
worksheet.

Note A complete list of Excel statistical functions is provided online.

Example 2.4
To illustrate the use of the Select Formulas > Select Insert Function method consider the prob-
lem of calculating the mean value in Example 2.1. In Figures 2.1 and 2.2 the mean value is
located in cell E12. To insert the correct Excel function into cell E12 we would click on cell E12
and then Select Formulas > Select Insert Function as illustrated in Figures 2.3 and 2.4.

Figure 2.3

Select Formulas > Select Function (see Figure 2.4).


At this stage you can search for a function, choose a function from the most recently used
functions (default choice, see Figure 2.4), or select a category from the ‘Or select a category’
drop down menu. The type of functions we will be using in the textbook are statistical func-
tions. Therefore, choose Statistical, as illustrated in Figure 2.5.
66 Business statistics using Excel

Figure 2.4

Figure 2.5

At this final stage we can scroll down the list of the functions until we find the function we
require. To help the user Excel provides appropriate information on each function. We shall
choose the AVERAGE function and click OK. The Excel function average would then be placed
in cell E12. To familiarize yourself with the functions browse down the list to check on what
functions are available and their function names—see online for a complete list of all Excel
functions. To help data analysts Excel provides the calculate descriptive Data Analysis tool to
statistics for a set of numbers. For the calculation of a series of statistical values, for example,
use the Descriptive Statistics menu to calculate descriptive statistics described in this chapter.

Student exercises
X2.1 In 12 consecutive innings a batman’s scores were: 6, 13, 16, 45, 93, 0, 62, 87, 136, 25,
14, and 31. Find his mean score and the median.
X2.2 The following are the IQs of 12 people: 115, 89, 94, 107, 98, 87, 99, 120, 100, 94, 100,
99. It is claimed that ‘the average person in the group has an IQ of over 100’. Is this a
reasonable assertion?
Data descriptors 67

X2.3 A sample of six components was tested to destruction to establish how long they
would last. The times to failure (in hours) during testing were 40, 44, 55, 55, 64,
and 69. Which would be the most appropriate average to describe the life of these
components? What are the consequences of your choice?
X2.4 Find the mean, median, and mode of the following set of data: 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5.
X2.5 The average salary paid to graduates in three companies is: £7000, £6000, and £9000
per annum respectively. If the respective number of graduates in these companies is 5,
12, and 3, find the mean salary paid to the 20 graduates.

2.1.3 Averages from frequency distributions


In this section we shall extend the calculation of data descriptors for a set of numbers to
the situation where we are dealing with frequency distributions. A frequency distribution
is a simple data table that shows how many times entities, or frequencies, fall into every
category. When data has been arranged into a frequency distribution, a number of meth-
ods are available to obtain averages: mean, median, and mode.

Averages from frequency distributions where X is known


Knowing the value of X implies that the data values are predefined and given as a single
number (not as a member of a group), as in Example 2.5.

Example 2.5
The distribution of insurance claims processed each day is presented in Table 2.3.

Claims (X) 1 2 3 4 5 6 7 8 9 10
Frequency (f) 3 4 4 5 5 7 5 3 3 1

Table 2.3

Calculate the mean, mode, and median.

Excel solution for Example 2.5—the mean and mode


Figure 2.6 illustrates the Excel solution.

Figure 2.6
68 Business statistics using Excel

➜ Excel solution
X Cells B4:B13 Values
f Cells C4:C13 Values
fX Cell D4 Formula: =C4*B4
Copy formula down D4:D13
Σf =Cell C17 Formula: =SUM(C4:C13)
ΣfX =Cell C18 Formula: =SUM(D4:D13)
Mean =Cell C19 Formula: =C18/C17
Mean =Cell C23 Formula: =SUMPRODUCT (B4:B13,C4:C13)/SUM(C4:C13)
Mode = Cell C24 Value

❉ Interpretation From Excel, the mean is five claims per day and the mode is six claims
per day.

Note According to Table 2.3, a number of claims corresponding to ‘one’ occurs three
times, which will contribute three to the total, ‘two’ claims occur four times contributing eight
to the sum, and so on. This can be written as follows:
(3*1) + (4*2) +.........+ (1*10)
Mean(X) = = 206/40 = 5.15
3+4+4+5+5+7+5+3+3+1

As already pointed out, as we are dealing with discrete data we would indicate a mean as
approximately five claims. Equation (2.3) can now be used to calculate the mean for a fre-
quency distribution data set:
ΣfX
X=
Σf (2.3)

The following indicates the setting out of the calculation for finding X using the data
set in Table 2.3.
Table 2.4 presents the frequency distribution for Example 2.5.

ΣfX 206
X= = = 5.15 claims per day
Σf 40

Clearly, the number corresponds to the one obtained using the Excel Function method,
as expected.
Data descriptors 69

Claims (X) Frequency (f) fX


1 3 3
2 4 8
3 4 12
4 5 20
5 5 25
6 7 42
7 5 35
8 3 24
9 3 27
10 1 10
Σf = 40 ΣfX = 206

Table 2.4

The mode
As the mode is the most frequently occurring score it can be determined directly from a
frequency distribution or a histogram. If we consider the distribution given in Example
2.5, the most frequently occurring score is six; it has the highest frequency of seven.

The median and cumulative frequencies


Finding the mean and modal class from a frequency distribution should now cause little
difficulty. However, finding the median involves some further calculations. To this end
the cumulative frequency distribution is now introduced. If we consider the distribution
given in Example 2.5, the median of the 40 values is given by the (40 + 1)/2 value (or 20.5th
number). To find out which is the 20.5th number we would list all the scores in order of
size and create a cumulative frequency distribution.

Example 2.6
Reconsider the Example 2.5 data set.

Figure 2.7 illustrates the Excel solution to calculate the cumulative frequency table.

x
Cumulative frequency
distribution The
cumulative frequency for a
value x is the total number
Figure 2.7 of scores that are less than
or equal to x.
70 Business statistics using Excel

➜ Excel solution
X Cells B4:B13 Values
f Cells C4:C13 Values
≤Xcf Cells D4:D14 Values
CF Cell E4 Value
Cell E5 Formula: =C4
Cell E6 Formula: =E5+C5
Copy formula down E6:E14

Figure 2.8 illustrates the cumulative frequency curve and its use to find the median.

Cumulative frequency curve (or ogive)


40
Cumulative frequency, CF

35
30
25 20.5th number
20
15
10
Median = 5
5
0
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5
Class boundary, Xcf Figure 2.8

❉ Interpretation From the cumulative frequency curve, the median is five claims per day.

Table 2.5 represents the cumulative frequency distribution for the data presented in
Example 2.5.
The median value of the above distribution lies at the 20.5th item which lies at 5 claims.
We know this because 21 items are below 5.5.

Claims (X) Frequency (f) Mathematical Upper limit Cumulative


limit frequency (CF)
1 3 0.5–1.5 1.5 3
2 4 1.5–2.5 2.5 7
3 4 2.5–3.5 3.5 11
4 5 3.5–4.5 4.5 16
5 5 4.5–5.5 5.5 21
6 7 5.5–6.5 6.5 28
7 5 6.5–7.5 7.5 33
8 3 7.5–8.5 8.5 36
9 3 8.5–9.5 9.5 39
10 1 9.5–10.5 10.5 40

Table 2.5
Data descriptors 71

Averages from frequency distributions where X is class mid-point


Unlike the previous section where X was given in advance, certain frequencies can apply
to a number of classes or a range of values.

Example 2.7
Consider the distribution of miles travelled by salesmen. The layout is very similar to Example
2.5, except the mid-point value is shown for ‘X’.
Table 2.6 represents the grouped frequency distribution for Example 2.7.

Mileage Frequency (f)


400–419 12
420–439 27
440–459 34
460–479 24
480–499 15
500–519 8

Table 2.6

Excel solution for Example 2.7


Figure 2.9 illustrates the Excel solution.

Figure 2.9
LCB, lower class boundary; UCB, upper class boundary.

➜ Excel solution
Mileage Cells A4A9 Values
LCB Cells B4:B9 Values
UCB Cells C4:C9 Values
Mid-point x Cell D4 Formula: =(B4+C4)/2
Copy down formula D4:D9
72 Business statistics using Excel

Frequency, f Cell E4:E9 Values


fx Cell F4 Formula: =E4*D4
Copy down formula F4:F9
CF Cell G4 =E4
Cell G5 =G4 + E5
Copy down formula G5:G9
Mean
Σ f = Cell B12 Formula: =SUM(E4:E9)
Σ fx = Cell B13 Formula: =SUM(F4:F9)
mean = Cell F12 Formula: =B13/B12
mean = Cell F13 Formula: =SUMPRODUCT(D4:D9,E4:E9)/SUM(E4:E9)
Median
N = Cell B16 Formula: =SUM(E4:E9)
Position median = Cell B17 Formula: =(B16 + 1)/2
Median class is 440−459
L = Cell B19 Formula: =B6
C = Cell B20 Formula: =C6−B6
F = Cell B21 Formula: =G5
f = Cell B22 Formula: =E6
Median = Cell B23 Formula: =B19+B20*(B17−B21)/B22
Mode
Class width c = Cell F16 Formula: =C6−B6
Modal Class 440-459
L = Cell F19 Formula: =B6
f1 = Cell F20 Formula: =E6
f0 = Cell F21 Formula: =E5
f2 = Cell F22 Formula: =E7
Mode = Cell F23 Formula: =F19 + (F20−F21)*F16/(2*F20−F21−F22)

❉ Interpretation From Excel, the average number of miles travelled are mean = 454,
mode = 448, and median = 452.

The mean
The explanation of the Excel solution is as follows.

Note
Table 2.7 represents the grouped frequency distribution for Example 2.7.
The Mean is calculated using equation (2.3):

ΣfX 54480
Mean ( X) = = = 454miles
Σf 120
Data descriptors 73

Mileage Class Mid-Point (X) Frequency (f) fX


400–419 409.5 12 4914
420–439 429.5 27 11,596.5
440–459 449.5 34 15,283
460–479 469.5 24 11,268
480–499 489.5 15 7342.5
500–519 509.5 8 4076
Σf = 120 ΣfX = 54,480

Table 2.7

1. The value of X is the class mid-point which is computed using the true limits of each class.
This assumes that the data values within each class vary uniformly between the lowest and
highest data values within the class.
2. Even if a distribution has unequal class widths the same procedure is followed.
3. This form of layout transforms very easily to a spreadsheet.

The mode
As the mode is the most frequently occurring score it can be determined directly from a
frequency distribution or a histogram. If we consider the distribution given in Example
2.7, we can see that the most frequently occurring class is the class 440–459 miles. This
class is known as the modal class. If we look at the histogram associated with this example
the mode is very apparent: it is the class with the highest rectangle. We can estimate the
mode using a formula or graphical method:

(i) Formula method


Having established which is the class interval with the highest frequency (the modal
class), the mode can now be estimated using equation (2.4):

(f1 − f0)c
Mode = L +
2f1 − f0 − f2 (2.4)

Where: L = lower class boundary of the modal class, f0 = frequency of the class below
the modal class, f1 = frequency of the modal class, f2 = frequency of the class above the
modal class, and c = modal class width.

Note Please note that this formula only works if the modal class and the two adjacent
classes are of equal width. Therefore, using Example 2.7 we have L = 439.5, f0 = 27, f1 = 34,
f2 = 24 and c = 20.

(34 − 27)
Mode = 439.5 + × 20 = 448 miles to the nearest mile
2(34) − 27 − 24

The modal distance travelled is 448 miles.


74 Business statistics using Excel

(ii) Graphical method


We can estimate the mode graphically by constructing the histogram. For Example 2.7
the data set class widths are of the same size (constant) and therefore the height of the
column will represent the frequency of occurrence (f ). We can see from the histogram
that class 440–459 is the modal class and the mode will lie within this class. The frequen-
cies of the two adjacent classes (420–439, 460–479) can be used to estimate the value
of the mode: (a) construct the two crossed diagonals, (b) drop a perpendicular from
where these two lines meet to the horizontal axis, and (c) read from the horizontal axis
the value estimate for the mode. Figure 2.10 illustrates the graphical solution to estimate
the modal value.

40
35 Highest class frequency
= modal class 440–459
30
Frequency

25
20
15
10
5
0
400–419 420–439 440–459 460–479 480–499 500–519
Estimate of mode = 448 Mileage Figure 2.10

The modal value is estimated from the histogram to be 449 miles travelled.

The median
Just as with the cases where X is known, finding the median from a frequency distribu-
tion where X is the class mid-point involves some further calculations. The cumulative
frequency distribution and cumulative frequency polygon (or ogive) are used. If we con-
sider the distribution given in Example 2.7, the median of the 120 values is given by the
(120 + 1)/2th value (or 60.5th value) and this value lies in the class (440–459) miles.

Cumulative frequency distribution


Mileage Frequency Upper class limit (UCL, Cumulative frequency
XCF) (CF)
400–419 12 <419.5 12
420–439 27 <439.5 39
440–459 34 <459.5 73
460–479 24 <479.5 97
480–499 15 <499.5 112
500–519 8 <519.5 120

Table 2.8 Cumulative frequency distribution for Example 2.7

An estimate of the value of the median within that class can be determined either by
calculation or by using a graphical method:
Data descriptors 75
(i) Formula method
Equation (2.5) can be used to estimate the median:

(N + 1)/2 − F
Median = L + C (2.5)
f

Where: L = true lower class boundary of the median class, C = median class width,
F = cumulative frequency before the median class, f = frequency within the median class,
and N = total frequency.

Position of Median = 60.5th item

Median Class = 440 – 459

From Table 2.8: L = 439.5, C = 459.5 — 439.5 = 20, F = 39, f = 34, and N = 120.

(120 + 1) / 2 − 39
Median = 439.5 + 20 = 452 miles
34
The median number of miles travelled is 452 miles. This is quite close to the value obtained
for the mean (454 miles) and we would expect this given that the histogram for miles trav-
elled looks quite symmetrical. The concept of symmetry and a measure of how symmetri-
cal a distribution is will be explored when discussing the concept of skewness (see Section
2.2.5).

(ii) Graphical method


Figure 2.11 represents a cumulative frequency curve (or ogive) for the miles travelled
example.
We can use this curve to provide an estimate of the median. The median is then the
value which corresponds to half of the total frequency.
50
Therefore, the Position of the Median = (120 + 1) ≈ 60 th number . We can now use
100
the cumulative frequency curve to estimate the value of the median. From the graph the
median is approximately 452 miles.

Miles travelled by Salesmen


140
Cumulative frequency, cf

120
100
80
60th number
60
40
20
0
399.5 419.5 439.5 459.5 479.5 499.5 519.5
Xcf
Median value Figure 2.11

x
Percentiles and Quartiles Symmetrical A data set is
symmetrical when the data
values are distributed in the
Individual percentiles can be estimated using the following two methods: (i) the formula same way above and below
method and (ii) the graphical method. the middle value.
76 Business statistics using Excel

(i) Formula method


Equation (2.6) can be used to estimate the value at a particular percentile:

(N + 1)P/100 − F
Percentile Value P = L + C
f (2.6)

Where: L = true lower class boundary of the percentile class, C = percentile class width,
F = cumulative frequency before the percentile class, f = frequency within the percentile
class, and N = total frequency.
For example, if we want to calculate the tenth percentile, then P = 10.

10
Position of Tenth Percentile = (120 + 1) ≈ 12 th Number
100

Therefore, the tenth percentile is in the (400–419) class. From Table 2.6: L = 399.5,
C = 419.5 — 399.5 = 20, F = 0, f = 12, and N = 120.

20 ⎛ 10 ⎞
Median = 399.5 + ⎜ (120 + 1) 100 − 0⎟⎠ ≈ 420
12 ⎝

The tenth percentile number of miles travelled is approximately 420 miles.

(ii) Graphical method


In the previous example we have used the cumulative frequency curve (or ogive) to find
the median. Other statistics can also be obtained from it and revolve around the idea of
percentiles. As the name suggests the tenth percentile for the example given in the previ-
ous paragraph will occur at P = 10.

10
Position of Tenth Percentile = (120 + 1) ≈ 12th number
100

Therefore, the tenth percentile is in the (400–419) class. The estimated value can, like
the median (the median is the fiftieth percentile), be read off the cumulative frequency
curve (or ogive).

❉ Interpretation Ten per cent of all the data in this data set have the value equal to or
below 420 miles.

Two important percentiles are the twenty-fifth and seventy-fifth percentiles. These are
known as the lower quartile (LQ) and the upper quartile (UQ) respectively.

25
Position of First Quartile = (120 + 1) ≈ 30 th Number
100

75
Position of Third Quartile = (120 + 1) ≈ 90 th Number
100

Figure 2.12 illustrates the first and third quartile positions on the cumulative frequency
curve. We observe that Q1 and Q3 are approximately 432 and 462 respectively.
Data descriptors 77
Miles travelled by salesmen
140
Cumulative frequency, cf

120
100 90th number
80
60
40 30th number
20
0
399.5 419.5 439.5 459.5 479.5 499.5 519.5
Xcf
Q1 value Q3 value
Figure 2.12

❉ Interpretation Twenty-five per cent of all the values in the data set are equal to or
below 430 miles, while 75% are equal to or below 470 miles.

2.1.4 Weighted averages


In the previous examples we have calculated the mean using equation (2.1), which
assumes that each value of X is of equal importance to all other values of X. In many cases
we are faced with a situation where this is not true, for example module grades are often
computed using a weighted average as a different weighting is applied to different assess-
ments. However, from the calculation point of view, this is no different to the method of
calculating averages from frequency distributions. The weighted average is calculated
using equation (2.7) (which is identical to equation (2)).

w 1X1 + w 2 X 2 + w 3 X 3 ∑ wX
X= =
w1 + w 2 + w 3 ∑w (2.7)

Where w is the level of importance placed on each assessment element and X is the
actual mark associated with this weight. This can be laid out in a table format to aid the
calculation process.

Example 2.8
Suppose that Karen’s statistics module is assessed via a series of assessments (multiple choice
questions—mcq, in-course assignment—ica, end assignment—ea) with a weighting of 20%, 30%,
and 50% respectively. The actual marks awarded were 74, 66, and 88. Calculate the weighted
average. Figure 2.13 illustrates the Excel solution.
78 Business statistics using Excel

Figure 2.13

➜ Excel solution
mcq, w Cell C6 Value
mcq, X Cell D6 Value
ica, w Cell C7 Value
ica, X Cell D7 Value
ea, w Cell C8 Value
ea, X Cell D8 Value
wX Cell E6 Formula: =C6*D6
Copy formula down E6:E8
Total Cell E9 Formula: =SUM(E6:E8)
weighted average = Cell E11 Formula: =SUMPRODUCT (C6:C8, D6:D8)

❉ Interpretation Karen’s weighted average and therefore module grade would be 78.6%.

Note
1. If all the weights are equal, then the weighted mean is the same as the arithmetic mean.
2. As emphasized before, Excel does not contain a built-in function to calculate a weighted
average. Again, the SUMPRODUCT () function is used.
3. If the weights are given in percentages then the formula would be modified to SUMPRODUCT
(C6:C8, D6:D8)/SUM (C6:C8).

Student exercises
X2.6 Cameos Ltd is employed by a leading market research organization based in Berlin.
The company is discussing with the firm whether to expand the catering facilities
provided to its employees to include a greater range of products. The initial research by
Cameos has identified the following set of weekly spend (€) by individual employees
(Table 2.9).
Data descriptors 79

22 16 26 33 33 37 9 23 32 17
20 13 12 18 19 10 21 22 25 22
22 22 34 24 23 21 38 31 41 20

Table 2.9

(a) Plot the histogram and visually comment on the shape of the weekly expenditure.
Hint: use class width of 5.
(b) Calculate the values of the mean and median.
(c) Use descriptive statistics in conjunction with the histogram to comment on weekly
expenditure.
X2.7 Form a frequency distribution of the following data given in Table 2.10 with intervals
centred at 10, 15, 20, 25, 30, 35, and 40, and estimate the mean value.

9 26 33 24 41 24 37 39 30 28
24 42 17 26 18 33 40 28 31 20
32 21 39 25 16 17 26 11 30 28
34 24 19 23 27 18 32 21 40

Table 2.10

X2.8 The frequency distribution of the length of a sample of 98 nails is presented in Table
2.11 (measured to the nearest 0.1 mm).
(a) Find the mean length of this sample by hand and by using a spreadsheet.
(b) Construct the cumulative frequency graph and use this to estimate the median.
(c) Check the value of the median using the formula method.

Length Frequency
4.0–4.2 4
4.3–4.5 9
4.6–4.8 13
4.9–5.1 20
5.2–5.4 34
5.5–5.7 18

Table 2.11

X2.9 The distribution of marks of 400 candidates in an A-level examination is presented in


Table 2.12.
(a) Calculate the mean value.
(b) Construct the cumulative frequency curve and estimate the median, first, and third
quartile values.
80 Business statistics using Excel

Marks Frequency, f
0–10 6
11–20 15
21–30 31
31–40 80
41–50 93
51–60 69
61–70 54
71 –80 33
81–90 12
91 –100 7

Table 2.12

2.2 Measures of dispersion


In Section 2.1 we looked at the concept of central tendency that provides a measure of the
middle value of a set of data values, including: mean, median, and mode. This, however,
only gives a partial description. A fuller description can be obtained by also obtaining a
measure of the spread of the distribution. This kind of measure indicates whether the val-
ues in the distribution group are closely located about an average value or whether they
are more dispersed. These measures of dispersion are particularly important when we
wish to compare distributions.
To illustrate this consider the two hypothetical distributions in Figure 2.14 which meas-
ure the value of sales per week made by two salesmen in their respective sales areas.
Let us say the means of the two distributions, A and B, were 4000 and 5000 respectively.
But as you can see their shapes are very different, with B being far more spread out.
What would you infer from the two distributions given about the two salesmen and
the areas that they work in? We can see that both distributions, A and B, have different
mean values with distribution B being more spread out (or dispersed) than distribution A.

i f

x
Dispersion The variation
between data values is
called dispersion. XA XB
Figure 2.14
Data descriptors 81

Furthermore, distribution B is taller than distribution A. In this section we shall explore


methods that can be used to put a number to this idea of dispersion. The methods we will
explore include: range, interquartile range, SIQR, variance, standard deviation, coef-
ficient of variation, skewness, and kurtosis.

Example 2.9
Reconsider Example 2.1 and calculate measures of dispersion (or spread) of the statistics
examination marks presented in Table 2.13.

24 27 36 48 52 52 53 53 59 60 85 90 95

Table 2.13

Excel solution to Example 2.9


Figure 2.15 illustrates the Excel solution.

Figure 2.15 x
Range The range of a
data set is a measure
➜ Excel solution of the dispersion of the
observations.
Statistics marks Cell B4:B16 Values Interquartile range The
interquartile range is a
X^2 Cell C4 Formula: =B4^2 measure of the spread of or
Copy formula down C4:C16 dispersion within a data set.
n = Cell F4 Formula: =COUNT(B4:B16) Variance Measure of
the dispersion of the
Σx = Cell F5 Formula: =SUM(B4:B16) observations.
ΣX^2 = Cell F6 Formula: =SUM(C4:C16) Standard
mean = Cell F7 Formula: =F5/F4 deviation Measure of
the dispersion of the
variance Cell F8 Formula: =F6/F4−F7^2 observations (a square root
standard deviation = Cell F9 Formula: =F8^0.5 value of the variance).
Range = Cell F13 Formula: =MAX (B4:B16)−MIN (B4:B16) Coefficient of
variation The coefficient
Q1 = Cell F14 Formula: =QUARTILE.INC (B4:B16, 1) of variation measures the
Median = Cell F15 Formula: =MEDIAN(B4:B16) spread of a set of data as a
proportion of its mean.
Q3 = Cell F16 Formula: =QUARTILE.INC(B4:B16, 3)
Kurtosis Kurtosis is a
QR = Cell F17 Formula: =F16−F14 measure of the ‘peakedness’
SIQR = Cell F18 Formula: =(F16−F14)/2 or the distribution.
82 Business statistics using Excel

Mean = Cell F19 Formula: =AVERAGE(B4:B16)


varp = Cell F20 Formula: =VAR.P(B4:B16)
std = Cell F21 Formula: =STDEV.P(B4:B16)

2.2.1 The range


The range is the simplest measure of distribution and indicates the ‘length’ a distribution
covers. It is determined by finding the difference between the lowest and highest value
in a distribution. A formula for calculating the range, depending on the type of data, is
defined by equations (2.8) or (2.9).

RANGE (ungrouped data) = Highest Extreme Value − Lowest Extreme Value (2.8)

RANGE (grouped data) = UCB Highest Class − LCB Lowest Class (2.9)

Where UCB represents the upper class boundary and LCB represents the lower class
boundary. Thus, for the statistics examination example (Example 2.1): ungrouped data,
lowest value = 24 and highest value = 95, and range = 95 – 24 = 71 marks.

❉ Interpretation From Excel, the range for the statistics examination marks implies that
the achieved marks are scattered over 71 marks between the highest and the lowest mark.

If we have data that is in the form of a grouped frequency distribution then we would
use the upper and lower class boundaries of the largest and smallest class values to calcu-
late the range. Thus, for the miles travelled by salesmen example (Example 2.7): grouped
data, LCB = 399.5, UCB = 519.5. Thus, range = 519.5 — 399.5 = 120 miles.

The interquartile range and semi-interquartile range


2.2.2
(SIQR)
The interquartile range represents the difference between the third and first quartile and
can be used to provide a measure of spread within a data set which includes extreme data
values. The interquartile range is little affected by extreme data values in the data set and
is considered to be a good measure of spread for skewed distributions. The interquartile
range is defined by equation (2.10).

Interquartile range = Q3 − Q1 (2.10)

The SIQR is defined by equation (2.11).

Q3 − Q1
SIQR =
2 (2.11)
The SIQR is another measure of spread and is computed as one half of the interquartile
range which contains half of the data values. For Example 2.9, the first quartile value is
Data descriptors 83

48% and the third quartile value is 60%. The manual method provides a first quartile value
of 42% and third quartile value of 73%.

❉ Interpretation From Excel, the interquartile range is equal to 12 marks. The


interquartile range is a measure of variation that ignores the extremes and focuses on the
middle 50% of the data, i.e. only the data between the third (Q3) and first quartile (Q1).

The interquartile and SIQRs are more stable than the range because they focus on the
middle half of the data values and, therefore, can’t be influenced by extreme values. The
SIQR is used in conjunction with the median to a highly skewed distribution or to describe
an ordinal data set. The interquartile range (and SIQR) are more influenced by sampling
fluctuations in normal distributions than is the standard deviation, and therefore are not
often used for data that are approximately normally distributed. Furthermore, the actual
data values aren’t used and we will now look at a method that provides a measure of
spread but uses all the data values within the calculation.

2.2.3 The standard deviation and variance


Standard deviation is the measure of spread most commonly used in statistics when the
mean is used to calculate central tendency. The variance and standard deviation provide
a measure of how dispersed the data values (X) are about the mean value ( X ). Because
of its close links with the mean, the standard deviation can be greatly affected if the mean
gives a poor measure of central tendency.
If we calculated for each data value (X − X ), then some would be positive and some neg-
ative. Thus, if we were to sum all these differences then we would find that ∑(X − X ) = 0,
i.e. the positive and negative values would cancel out. To avoid this problem we would
square each individual difference before undertaking the summation.
This would provide us with the squared average difference which is known as the vari-
ance (VAR(X)), as defined by equation (2.12).

∑(X − X )2
Variance, VAR(X ) =
∑f (2.12)

To provide us with an average difference we take the square root of the variance to give
the standard deviation (SD(X)), as defined by equation (2.13).

∑( X − X )2
Standard Deviation , SD(X ) =
∑f (2.13)

By algebraic manipulation we can simplify equations (2.12) and (2.13) to equations


(2.14) and (2.15). x
Variation Variation is a
measure that describes
∑ X2
Variance, VAR(X) = − ( X )2 how spread out or
∑f (2.14) scattered a set of data is.
84 Business statistics using Excel

⎡ ∑ X2 ⎤
StandardDeviation , SD(X ) = ⎢ − ( X )2 ⎥
⎣ ∑ f ⎦ (2.15)

Variance describes how much the data values are scattered around the mean value or,
to put it differently, how tightly are the data values grouped around the mean. In a way,
the smaller the variance, the more representative the mean value is. Unfortunately, the
variance does not have the same dimension as the data set or the mean. In other words,
if the values are percentages, inches, degrees C, or any other measure, the variance is not
expressed in the same values because it is expressed in squared units.
As such, it is very useful as a comparison measure between the two data sets, as we will
discover later. To bring the variance into the same units of measure as the data set, the
standard deviation needs to be calculated. Although the standard deviation is less sus-
ceptible to extreme values than the range, standard deviation is still more sensitive than
the SIQR. If the possibility of outliers presents itself, then the standard deviation should
be supplemented by the SIQR.

❉ Interpretation From Excel, the variance equals 455.3 (marks2) with a standard
deviation equal to 21.3 marks.

Note Use Excel Formulas > Insert Function method if you are not sure the name of the
function.
The mean and variance can be calculated for Example 2.9 data set using equations (2.1) and
(2.14) respectively. From Table 2.14 we can show that ΣX = 734 and ΣX2 = 47362.

X X2
24 576
27 729
36 1296
48 2304
52 2704
52 2704
53 2809
53 2809
59 3481
60 3600
85 7225
90 8100
95 9025
ΣX = 734 ΣX = 47,362
2

Table 2.14
Data descriptors 85

Mean (X) =
∑ X = 24 + 27 + ... + 90 + 95 = 734 / 13 = 56.4615
∑f 13

∑ X2
Variance, VAR(X ) = − (X )2 = 455.3254438
∑f

Standard Deviation , SD(X ) = VAR(X) = 21.33835616

From the calculations we can now summarize the results: mean = 56.5 marks, standard
deviation = 21.3 marks, median = 53 marks, Q3 = 73 marks, Q1 = 42 marks, SIQR = 15.5 marks.

❉ Interpretation
A large proportion of the marks obtained in the statistical examination, as per Example 2.9, are
clustered within 21.3 marks around the mean mark of 56.4. We will explain later how large this
proportion is. Most of the marks are between 35.1 (56.4−21.3) and 77.7 (56.4 + 21.3). Eight out
of 13 marks are in this interval, which is 61% of all the marks.

Note It is very important to note that Excel contains two different functions (VAR.S (),
VAR.P ()) to calculate the value of the variance. The function that you use is dependent upon
whether the data set represents the complete population or is a sample from the population
being measured.

1. If the data set is the complete population then the population variance (σ2) is given by the
Excel function VAR.P ().
2. If the data set is a sample from the population then the sample variance (S2) is given by the
Excel function VAR.S ().

These issues will be explored in greater detail in Chapter 5 when discussing the issue of sam-
pling from populations and estimating population values from the sample data.

Example 2.10
Reconsider Example 2.7 grouped frequency data set and calculate the following descriptive
statistics: mean, variance, and standard deviation.
Table 2.15 illustrates the Example 2.7 data set.
86 Business statistics using Excel

Mileage Frequency, f
400–419 12
420–439 27
440–459 34
460–479 24
480–499 15
500–519 8

Table 2.15

Excel solution for Example 2.10


Figure 2.16 illustrates the Excel solution.

Figure 2.16

LCB and UCB are the lower and upper class boundaries.

➜ Excel solution
Mileage Cells A4A9 Values
LCB Cells B4:B9 Values
UCB Cells C4:C9 Values
Mid-point x Cell D4 Formula: =(B4+C4)/2
Copy down formula D4:D9
Frequency, f Cell E4:E9 Values
fx Cell F4 Formula: =E4*D4
Copy down formula F4:F9
fx^2 Cell G4 Formula: =E4*D4^2
Copy down formula G4:G9
Σ f = Cell B11 Formula: =SUM(E4:E9)
Σ fx = Cell B12 Formula: =SUM(F4:F9)
Σ fx^2 = Cell B13 Formula: =SUM(G4:G9)
x mean = Cell B15 Formula: =B12/B11
Population variance The mean = Cell B16 Formula: =SUMPRODUCT(D4:D9,E4:E9)/SUM(E4:E9)
population variance is the
variance of all possible
variance = Cell B17 Formula: =B13/B11−B15^2
values. standard deviation = Cell B18 Formula: =SQRT(B17)
Data descriptors 87

❉ Interpretation From Excel, the mean number of miles travelled is 454 miles with a
standard deviation of 27.4 miles.

Example 2.11
Reconsider Example 2.10 and calculate the median and SIQR.

Figures 2.17 and 2.18 illustrate the Excel solution.

Figure 2.17
LCB, lower class boundary; UCB, upper class boundary.

➜ Excel solution
Mileage Cells A4A9 Values
LCB Cells B4:B9 Values
UCB Cells C4:C9 Values
Mid-point x Cell D4 Formula: =(B4+C4)/2
Copy down formula D4:D9
Frequency, f Cell E4:E9 Values
CF Cell F4 Formula: =E4
Cell F5 Formula: =F4+E5
Copy formula down F5:F9
Median
N = Cell B12 Formula: =SUM(E4:E9)
Position median = Cell B13 Formula: =(B12+1)/2
Median class is 440-459
L = Cell B15 Formula: =B6
C = Cell B16 Formula: =C6−C5
F = Cell B17 Formula: =F5
f = Cell B18 Formula: =E6
Median = Cell B19 Formula: =B15+B16*(B13−B17)/B18
88 Business statistics using Excel

Figure 2.18

➜ Excel solution
Quartile 1
N = Cell B22 Formula: =B12
Position Q1 = Cell B23 Formula: =(25/100)*(B22+1)
Q1 class is 420-439
L = Cell B25 Formula: =B5
C = Cell B26 Formula: =C5−C4
F = Cell B27 Formula: =F4
f = Cell B18 Formula: =E5
Q1 = Cell B29 Formula: =B25+B26*(B23−B27)/B28
Quartile 3
N = Cell B32 Formula: =B12
Position Q1 = Cell B33 Formula: =(75/100)*(B32+1)
Q1 class is 460−479
L = Cell B35 Formula: =D6
C = Cell B36 Formula: =D7−D6
F = Cell B37 Formula: =F6
f = Cell B38 Formula: =E7
Q3 = Cell B39 Formula: =B35+B36*(B33−B37)/B38
QR = Cell B41 Formula: =B39−B29
SIQR = Cell B42 Formula: =B41/2

❉ Interpretation From Excel, the median number of miles travelled is 452 miles with a
SIQR of 15.6 miles.

2.2.4 The coefficient of variation


The coefficient of variation represents the ratio of the standard deviation to the mean
and it is a useful statistic for comparing the degree of variation from one data series to
Data descriptors 89

another. Standard deviations vary according to the size of values in the distribution and
may not even be in the same unit of measurement. For example, the value of the standard
deviation of a set of weights will be different, depending on whether they are measured in
pounds or kilograms. The coefficient of variation, however, will be the same in both cases
as it does not depend on the unit of measurement. One way of overcoming this problem is
to use the coefficient of variation, V, as defined by equation (2.16).

Standard Deviation
V= ∗ 100%
Mean (2.16)

For example, if the coefficient of variation is 10% then this means that the standard
deviation is equal to 10% of the average. For some measures, the standard deviation
changes as the average changes. In this case, the coefficient of variation is the best way to
summarize the variation. In other cases the standard deviation does not change with the
average. In this case, the standard deviation is the best way to summarize the variation.

Example 2.12
Consider the following problem that compares the average earnings in the UK and USA:
• mean earnings in the UK are £125 per week with a standard deviation of £10;
• mean earnings in the USA are $1005 per week with a standard deviation of $170.
10
For UK V = *100% = 8%
125

170
For USA V = *100% = 16.9%
1005

❉ Interpretation The spread of earnings in the USA is greater than the spread in
earnings in the UK.

2.2.5 Measures of skewness and kurtosis


A fundamental task in many statistical analyses is to characterize the location and vari-
ability of a data set. A further characterization of the data includes measuring shape via
skewness and kurtosis. Skewness is a measure of the degree of asymmetry of a distribu-
tion and kurtosis is a measure of whether the data are peaked or flat relative to a normal
distribution. The histogram is an effective graphical technique for showing both the skew-
ness and kurtosis for a data set. Consider the following three distributions A, B, and C, as
illustrated in Figure 2.19.
Distribution A is said to be symmetrical; the mean, median, and mode have the
same value and thus coincide at the same point of the distribution. Distribution B has
a high frequency of relatively low values and a low frequency of relatively high values.
Consequently, the mean is ‘dragged’ toward the right (the high values) of the distribution.
It is known as a right or positively skewed distribution. Distribution C has a high frequency x
of relatively high values and a low frequency of relatively low values. Consequently, the Shape The shape of the
distribution refers to the
shape of a probability
distribution and involves
the calculation of skewness
and kurtosis.
90 Business statistics using Excel

A Symmetrical distribution
f

Mean = Median = Mode

B Positive skew C Negative skew


f f

Mode
Mean Mean Mode
Median Median Figure 2.19

mean is ‘dragged’ toward the left (the low values) of the distribution. It is known as a left
or negatively skewed distribution. The skewness of a frequency distribution can be an
important consideration. For example, if your data set is salary, you would prefer a situa-
tion that led to a positively skewed distribution of salary to one that is negatively skewed.
Positive skewness is more common than negative, for example the salaries of lecturers.
One measure of skewness is Pearson’s coefficient of skewness, as defined by equation
(2.17).

3(Mean − Median)
PCS =
StandardDeviation (2.17)

Excel uses an alternative measure of skewness based upon Fisher’s skewness coeffi-
cient as defined by equation (2.18).

n
Fisher’s skewness = ∑((X − X )/s)3
(n − 1)(n − 2) (2.18)

Where n is the sample size and s is the sample standard deviation.

Note
1. With skewed data, the mean is not a good measure of central tendency because it is sensi-
tive to extreme values. In this case the median would be used to provide the measure of central
tendency.
2. The value for skewness is zero for symmetric distributions (mean = median).
3. If mean < median, then the measure of skewness is negative and the distribution is said to
be negatively skewed.
4. If mean > median, then the measure of skewness is positive and the distribution is said to
be positively skewed.
5. The measure of skewness is independent of the units been measured.
Data descriptors 91

The critical value of skewness is defined by equation (2.19).

6
Skewness critical value = ±2 ×
n (2.19)

Where n is the sample size.


6
The value of skewness is said to be critically skewed when the skewness value > ± 2 × .
n

A different measure of shape is a measure of whether the curve of a distribution is bell-


shaped (Mesokurtic), peaked (Leptokurtic), or flat (Platykurtic). Consider the two distri-
butions in Figure 2.20.

Distribution A

Distribution B

X Figure 2.20

We can see from the two distributions that distribution A is more peaked than distribu-
tion B, but the means and standard deviations are approximately the same.
This is a measure of kurtosis and Excel provides Fisher’s measure of kurtosis as defined
by equation (2.20).

n(n + 1) 3(n − 1)2


Kurtosis = ∑((X − X )/s)4 −
(n − 1)(n − 2)(n − 3) (n − 2)(n − 3) (2.20)

Where s represents the sample standard deviation (sample variance = ⎡⎣n/(n − 1)⎤⎦ * pop-
ulation variance).
The critical value of kurtosis is defined by equation (2.21).

24
Kurtosis critical value = ± 2 ×
n (2.21)

Where n is the sample size.

Example 2.13
Reconsider the marks obtained in the statistics examination (Example 2.1) as presented in Table
2.16. Calculate a measure of skewness and kurtosis.

24 27 36 48 52 52 53 53 59 60 85 90 95

Table 2.16

Excel solution for Example 2.13


Figure 2.21 illustrates the Excel solution to Example 2.13.
92 Business statistics using Excel

Figure 2.21

➜ Excel solution
X: Cells B4:B16 Values
n = Cell E5 Formula: =COUNT(B4:B16)
Fisher’s skew = Cell E7 Formula: =SKEW (B4:B16)
Critical skewness = Cell E8 Formula: =2*SQRT(6/E5)
Fisher’s kurtosis = Cell E10 Formula: =KURT(B4:B16)
Critical kurtosis = Cell E11 Formula: =2*SQRT(24/E5)

❉ Interpretation From Excel, the value of skewness is calculated to be 0.4410 (critical


value ± 1.36) which is moderately skewed and the value of kurtosis is calculated to be −0.4253
(critical value ± 2.72), which illustrates a moderately peaked distribution (or flat distribution).

Note Reporting the median along with the mean in skewed distributions is generally a
good idea.
Skewness:

1. Skewness = zero. A zero skew value indicates symmetry. Normal distributions produce a
skewness statistic of zero.
2. Skewness = A positive value indicates a positively skewed distribution (that is, with scores
bunched up on the low end of the score scale). In this example we have a positively skewed
distribution.
3. A negative value indicates a negatively skewed distribution (that is, with scores bunched up
on the high end of the scale).
4. Skewness > ±1.36 would suggest severe skewness. In this case we conclude a skewness value
of 0.4410 is not significantly skewed (−1.35 < 0.4410 < +1.36).

Kurtosis:

1. Kurtosis = zero value indicates a symmetrical distribution. Distributions with zero kurtosis are
called mesokurtic. Normal distributions produce a kurtosis statistic of about zero (again, I say
Data descriptors 93

‘about’ because small variations can occur by chance alone). For a positive kurtosis value close
to zero indicates a mesokurtic (that is, normally high) distribution because it is close to zero.
2. A distribution with positive kurtosis is called leptokurtic. In terms of shape, the distribution
has a more acute 'peak' around the mean.
3. A distribution with negative kurtosis is called platykurtic. In terms of shape, the distribution
has a smaller 'peak' around the mean.
4. Kurtosis > ± 2.72 would suggest severe kurtosis. In this case we conclude a kurtosis value of
−0.4253 is not a significant value of kurtosis (−2.72 < −0.4253 < +2.72).

Student exercises
X2.10 Over a one-month period the number of vacant beds in a West Yorkshire hospital was
surveyed. The frequency distribution in Table 2.17 resulted.

Beds vacant 0 2 3 5 6 8
Frequency 4 8 12 4 2 1

Table 2.17

Determine the mean and standard deviation.


X2.11 In a number of towns the distance of 1 sample of 122 supermarkets from the towns’
high street was measured to the nearest metre (Table 2.18).

Distance 57–59 60–62 63–65 66–68 69–71 72–74 75–77


Frequency 9 10 18 42 27 11 5

Table 2.18

Determine the range, mean, and standard deviation.


X2.12 In a debate on the alteration of a traffic system in the city centre, measurements of the
number of cars per minute were taken at two junctions; the results were as shown in
Table 2.19.

Number of cars per minute Junction A Junction B


10–14 0 5
15–19 3 8
20–24 13 10
25–29 24 12
30–34 17 14
35–39 3 5
40–44 0 3
45–49 0 3
Totals 60 60

Table 2.19
94 Business statistics using Excel

Compare the two distributions by plotting out their frequency polygons, and
determine the means and standard deviations.
X2.13 Greendelivery.com has recently decided to review the weekly mileage of the delivery
vehicles used to deliver shopping purchased online to customer homes from a central
parcel depot. The sample data collected (Table 2.20) is part of the first stage in analysing
the economic benefit of potentially moving all vehicles to biofuels from diesel.

80 165 159 143 140


136 138 118 120 124
159 131 93 145 109
163 136 163 142 80
106 111 123 161 179
144 145 91 112 146
170 105 131 141 122
137 152 109 122 126
114 155 92 143 165

Table 2.20

(a) Use Excel to construct a frequency distribution and plot the histogram with class
intervals of 10 and classes 75–84, 85–94 . . . 175–184. Comment on the pattern in
mileage travelled by the company vehicles.
(b) Use the raw data to determine the mean, median, standard deviation, and SIQR.
(c) Comment on which measure you would use to describe the average and measure
of dispersion. Explain using your answers to (a) and (b).
(d) Calculate the measure of skewness and kurtosis, and comment on the distribution
shape.

2.3 Exploratory data analysis


In the previous sections we explored methods to describe a data set by computing meas-
ures of average, spread, and shape. In this section we will explore exploratory data analy-
sis techniques, including: five-number summary, box plots, and, using the Excel ToolPak
x add-in to calculate, descriptive statistics.
Five-number summary A
five-number summary is
especially useful when we
have so many data that it
2.3.1 Five-number summary
is sufficient to present a
summary of the data rather
The five-number summary is a simple method that provides measures of average and
than the whole data set. spread with the added bonus of giving us an idea of the shape of the distribution. This
Box plot A box plot is a five-number summary consists of the following numbers in the data set: smallest value,
way of summarizing a set
of data measured on an
first quartile, median, third quartile, and largest value. For symmetrical distributions the
interval scale. following rule would hold:
Data descriptors 95

• Q3 — Median = Median — Q1
• Largest value — Q3 = Q1 — smallest value
• Median = Midhinge = Midrange

The midrange is the average of the largest and smallest data values, and the midhinge
is the average of the first and third quartiles. For non-symmetry the following rule would
hold:

• right-skewed distributions: Largest value — Q3 greatly exceeds Q1 — Smallest value


• left-skewed distributions: Q1 — Smallest value greatly exceeds Largest value — Q3

. Example 2.14
In this particular case we will assume that the values are as follows: first quartile Q1 = 15, mini-
mum = 8, median = 33, maximum = 88, and third quartile Q3 = 62
Input your data into Excel as illustrated in Figure 2.22.

Figure 2.22

We can see from the summary statistics that the data distribution is not symmetrical:

• the distance from Q3 to the median (62 — 33 = 29) is not the same as between Q1 and
the median (33 — 15 = 18);
• the distance from Q3 and the largest value (88 — 62 = 26) is not the same as the
distance between Q1 and the smallest value (15 — 8 = 7);
• the median (33), the midhinge ((62 + 15)/2 = 38.5, and the midrange (88 + 8)/2 = 48)
are not equal.

The summary numbers indicate right skewness because the distance between Q3 and
the largest number (88 — 62 = 26) is longer than the distance between Q1 and the smallest x
Right-skewed Right-
value (15 — 8 = 7). The minimum and maximum points are identified and enable identifi- skewed (or positive skew)
cation of any extreme values (or outliers). indicates that the tail on
the right side is longer than
the left side and the bulk of
the values lie to the left of
Note A simple rule to identify an outlier (or suspected outlier) is that the largest value − the mean.

smallest value (88 − 8 = 80) should be no longer than three times the length of the box (Q3 − Left-skewed Left-skewed
(or negative skew) indicates
Q1 = 62 − 15 = 47). that the tail on the left side
of the probability density
function is longer than
the right side and the bulk
of the values (possibly
❉ Interpretation In this case the value of maximum – minimum is 80 and Q3 − Q1 is including the median) lie to
47, and therefore no extreme values are present in the data set. the right of the mean.
96 Business statistics using Excel

2.3.2 Box plots


We have already discussed techniques for visually representing data (see histograms and
frequency polygons). In this section we present another important method, called box
plots. A box plot (or box-and-whisker plot) is a graphical method of displaying the sym-
metry or skewness in a data set. It shows a measure of central location (the median), two
measures of dispersion (the range and interquartile range), the skewness (from the orien-
tation of the median relative to the quartiles), and potential outliers.

Example 2.15
In this particular case we will assume that the values are the same as in Example 2.14: first quar-
tile Q1 = 15, minimum = 8, median = 33, maximum = 88, and third quartile Q3 = 62. The box
plot is then constructed, as illustrated in Figure 2.23.

Box and whisker plot


100
90
80
70
60 Q1
Value

Minimum
50
Median
40 Maximum
30 Q3

20
10
0
Value Figure 2.23

The box plot is interpreted as follows:

• if the median within the box is not equidistant from the whisker (or hinge), then the
data is skewed. The box plot indicates right skewness because the distance between
the median and the highest value is greater than the distance between the median
and the lowest value. Furthermore, the top whisker (distance between Q3 and
maximum) is longer than the lower whisker (distance between Q1 and minimum);
• the minimum and maximum points (or whiskers) are identified and enable
identification of any extreme values (or outliers). A simple rule to identify an outlier
(or suspected outlier) is that the whisker (maximum value — minimum value) should
be no longer than three times the length of the box (Q3 — Q1). In this case the value of
maximum — minimum is 80 and Q3 — Q1 is 48, and no extreme values are present in
the data set.
x
Box-and-whisker plot A
box-and-whisker plot is a
Excel spreadsheet solution for Example 2.15
way of summarizing a set
Unfortunately, Microsoft Excel does not have a built-in box plot chart type. You can create
of data measured on an
interval scale. your own charts using stacked bar or column charts and error bars in combination with
Data descriptors 97

line or XY scatter chart series to show additional data. For your data set calculate: first
quartile, minimum, median, maximum, and third quartile.

1 Input your data into Excel as illustrated in Figure 2.24.


Select the range B4:C9 (including table headers) as illustrated in Figure 2.24.

Figure 2.24

2 Plot chart
Select Insert > choose Line > choose the ‘Line with Markers’ as illustrated in
Figure 2.25.
Value
100
Value
90
80
70
60
50
40
30
20
10
0
Q1 Minimum Median Maximum Q3 Figure 2.25

We note that this does not look like a box plot so we will now modify the line chart
in Figure 2.25 so that it looks like a box plot.

3 Convert line chart to a box plot chart


Click on anywhere on the chart.
Select Chart Tools > Select Design > Select Data and click on Switch Row/Column
button as illustrated in Figure 2.26.

Figure 2.26
98 Business statistics using Excel

Click OK
100

90

80

70
Q1
60
Minimum
50
Median
40
Maximum
30
Q3
20

10

0
Value Figure 2.27

Figure 2.27 illustrates the transformation of the Figure 2.25 line chart.
Figure 2.27 looks more like a box chart and we can modify the chart to improve the
chart appearance, for example, removing the line through the legend points.

4 Edit the box plot


Right-click on any of the data points on the chart. Select Format Data Series > Choose
Line Colour> Click radial button next to ‘no line’ > Click Close as illustrated in Figure
2.28. Repeat this for the other four data lines.

Figure 2.28

Figure 2.29 illustrates the transformation of the Figure 2.27 line chart to remove the
lines through the data points.
100

90

80

70
Q1
60
Minimum
50
Median
40
Maximum
30
Q3
20

10

0
Value Figure 2.29

5 Add the lines and box to the data points


Select Layout > Analysis menu choose Lines > Click on High-Low Lines.
Data descriptors 99
Figure 2.30 illustrates the new chart.

100
90
80
70
Q1
60
Minimum
50
Median
40
Maximum
30
Q3
20
10
0
Value Figure 2.30

6 Add box
Select Layout > Analysis menu choose Up/Down Bars > Select Up/Down Bars
button.
Figure 2.31 illustrates the new chart.

100
90
80
70
60 Q1
Minimum
50
Median
40 Maximum
30 Q3
20
10
0
Value Figure 2.31

7 Add chart and axes titles


The final step is to add a chart title and modify axes titles if required, as illustrated in
Figure 2.32.

Box and whisker plot


100
90
80
70
60 Q1
Value

Minimum
50
Median
40 Maximum
30 Q3
20
10
0
Value Figure 2.32
100 Business statistics using Excel

2.3.3 Using the Excel ToolPak add-in


You can use the descriptive statistics procedure in the Excel ToolPak add-in Data Analysis
to provide a set of summary statistics, including: mean, median, mode, standard devia-
tion, sample variance, kurtosis, skewness, range, minimum, maximum, sum, count,
largest, and smallest number. The skewness and kurtosis values can be used to provide
information about the shape of the distribution.

Example 2.16
If we consider Example 2.1 data then the Descriptive Statistics procedure in the Excel ToolPak
add-in would give the required descriptive statistics.

Select Data > Select Data Analysis.


Select Descriptive Statistics.
See Figure 2.33.

Figure 2.33

Click OK.
Input date range: B3:B16.
Grouped By: Columns.
Select Labels in first row.
Output Range: D3.
Click Summary statistics.
See Figure 2.34.
Click OK.

Figure 2.34
Data descriptors 101

The Excel results would then be calculated and printed out in the Excel worksheet (see
Figure 2.35).

Figure 2.35

Student exercises
X2.14 The manager at BIG JIMS restaurant is concerned about the time it takes to process
credit card payments at the counter by counter staff. The manager has collected the
following processing time data (time in minutes/seconds) (Table 2.21) and requested
that summary statistics are calculated.
(a) Calculate a five-number summary for this data set.
(b) Do we have any evidence for a symmetric distribution?
(c) Use the Excel Analysis-ToolPak to calculate descriptive statistics.
(d) Which measures would you use to provide a measure of average and spread?

Processing credit cards (n = 40)


1.57 1.38 1.97 1.52 1.39

1.09 1.29 1.26 1.07 1.76

1.13 1.59 0.27 0.92 0.71

1.49 1.73 0.79 1.38 2.46

0.98 2.31 1.23 1.56 0.89

0.76 1.23 1.56 1.98 2.01

1.40 1.89 0.89 1.34 3.21

0.76 1.54 1.78 4.89 1.98

Table 2.21

X2.15 The local regional development agency is conducting a major review of the economic
development of a local community. One economic measure to be collected is the
local house prices that reflect on the economic well-being of this community. The
102 Business statistics using Excel

development agency have collected the following house price data (£) as presented in
Table 2.22.
(a) Calculate a five-number summary
(b) Do we have any evidence for a symmetric distribution?
(c) Use the Excel Analysis-ToolPak to calculate descriptive statistics.
(d) Which measures would you use to provide a measure of average and spread?

Local house price values (£, n = 50)


162726 162726 162726 162726 162726
188656 188656 188656 188656 188656
165547 165547 165547 165547 165547
175806 175806 175806 175806 175806
190670 190670 190670 190670 190670
145810 145810 145810 145810 145810
169682 169682 169682 169682 169682
155044 155044 155044 155044 155044
149304 149304 149304 149304 149304
197847 197847 197847 197847 197847

Table 2.22

■ Techniques in practice
TP1 Coco S.A. supplies a range of computer hardware and software to 2000 schools within
a large municipal region of Spain. When Coco S.A. won the contract the issue of customer
service was considered to be central to the company being successful at the final bidding stage.
The company has now requested that its customer service director creates a series of graphical
representations of the data to illustrate customer satisfaction with the service. The following
data has been collected over the last six months and measures the time to respond to the
received complaint (days), as presented in Table 2.23.
The customer service director has analysed this data to create a grouped frequency table
and has plotted the histogram. From this he made a series of observations regarding the time
to respond to customer complaints. He now wishes to extend the analysis to use numerical
methods to describe this data.

(a) From the data set calculate the mean and median.
(b) Repeat the analysis to calculate the standard deviation, quartiles (Q1, Q2, and Q3), quar-
tile range, and SIQR.
(c) Describe the shape of the distribution. Do the results suggest that there is a great deal of
variation in the time taken to respond to customer complaints?
Data descriptors 103

5 24 34 6 61 56 38 32
87 78 34 9 67 4 54 23
56 32 86 12 81 32 52 53
34 45 21 31 42 12 53 21
43 76 62 12 73 3 67 12
78 89 26 10 74 78 23 32
26 21 56 78 91 85 15 12
15 56 45 21 45 26 21 34
28 12 67 23 24 43 25 65
23 8 87 21 78 54 76 79

Table 2.23

(d) Which measures would you recommend the customer service manager uses to describe
the variation in time to respond to customer complaints?
(e) What conclusions can you draw from these results?

TP2 Bakers Ltd run a chain of bakery shops and is famous for the quality of its pies. The
management of the company is concerned at the number of complaints from customers who
say it takes too long to serve customers at a particular branch. The motto of the company is
‘Have your pie in two minutes’. The manager of the branch concerned has been told to provide
data on the time it takes for customers to enter the shop and be served by the shop staff, and
has presented the data in Table 2.24.

0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70
0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12
0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70
1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88
0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55
1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38
0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80
1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25
1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48

Table 2.24

(a) From the data set calculate the mean and median.
(b) Repeat the analysis to calculate the standard deviation, quartiles (Q1, Q2, and Q3), quar-
tile range, and SIQR.
(c) Describe the shape of the distribution. Do the results suggest that there is a great deal of
variation in the time taken to serve customers?
(d) Which measures would you recommend the shop manager uses to describe the varia-
tion in the time taken to serve customers?
(e) What conclusions can you draw from these results?
104 Business statistics using Excel

TP3 Skodel Ltd is a small brewery that is undergoing a major expansion after a takeover
by a large European brewery chain. Skodel Ltd produces a range of beers and lagers, and is
renowned for the quality of its beers, winning a number of prizes at trade fairs throughout
the European Union. The new parent company are reviewing the quality control mechanisms
being operated by Skodel Ltd and are concerned at the quantity of lager in its premium lager
brand, which should contain a mean of 330 ml and a standard deviation of 15 ml. The bottling
plant manager provided the parent company with quantity measurements from 100 bottles for
analysis (Table 2.25).

326 326 326 326 326 326 326 326 326 326
344 344 344 344 344 344 344 344 344 344
333 333 333 333 333 333 333 333 333 333
346 346 346 346 346 346 346 346 346 346
339 339 339 339 339 339 339 339 339 339
353 353 353 353 353 353 353 353 353 353
310 310 310 310 310 310 310 310 310 310
351 351 351 351 351 351 351 351 351 351
350 350 350 350 350 350 350 350 350 350
348 348 348 348 348 348 348 348 348 348

Table 2.25

(a) From the data set calculate the mean and median.
(b) Repeat the analysis to calculate the standard deviation, quartiles (Q1, Q2, and Q3), quar-
tile range, and SIQR.
(c) Describe the shape of the distribution. Do the results suggest that there is a great deal
of variation in quantity within the bottle measurements? Compare the assumed bottle
average and spread with the measured average and spread.
(d) What conclusions can you draw from these results?

■ Summary
This chapter extends your knowledge from using tables and charts to summarizing data using
measures of average and dispersion. The mean is the most commonly calculated average to
represent the measure of central tendency, but this measurement uses all the data within the
calculation and therefore outliers will affect the value of the mean. This can imply that the value
of the mean may not be representative of the underlying data set. If outliers are present in the
data set then you can either eliminate these values or use the median to represent the average.
The average provides a measure of the central tendency (or middle value) and the next calcula-
tion to perform is to provide a measure of the spread of the data within the distribution. The
standard deviation is the most common type of measure of dispersion (or spread), but, like the
mean, the standard deviation is influenced by the presence of outliers within the data set. If
Data descriptors 105
outliers are present in the data set then you can either eliminate these values or use the SIQR to
represent the degree of dispersion. You can estimate the degree of skewness in the data set by
calculating Pearson’s coefficient of skewness (or use Fisher’s skewness equation) and the degree
of ‘peakedness’ by calculating Fisher’s kurtosis coefficient statistic. Box plots are graph plots that
allow you to visualize the degree of symmetry or skewness in the data set.
The chapter explored the calculation process for raw data and frequency distributions, and
it is very important to note that the graphical method will not be as accurate as the raw data
method when calculating the summary statistics. Table 2.26 provides a summary of which sta-
tistics measures to use for different types of data.

Summary statistic to be applied


Data type Average Spread or dispersion
Nominal Mode NA
Ordinal Mode Range
Median Range, interquartile range
Ratio or interval Mode Range
Median Range, interquartile range
Mean Variance, standard deviation, skewness, kurtosis

Table 2.26 Which summary statistic to use?

In the next chapter we will look at the concept of probability.

■ Key terms
Arithmetic mean Left-skewed Range
Box plot Mean Right-skewed
Box-and-whisker plot Median Shape
Central tendency Mode Skewness
Coefficient of variation Outlier Standard deviation
Dispersion Population mean Symmetrical
Extreme value Population variance Variance
Five-number summary Q1: first quartile Variation
Interquartile range Q3: third quartile
Kurtosis Quartiles

■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.
106 Business statistics using Excel

Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone, and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012).
Introduction to probability 3

» Overview «
The concept of probability is an important aspect of the study of statistics and, within this
chapter, we shall introduce the reader to some of the concepts that are relevant to probability.

» Learning objectives «
On completing this chapter you will be able to:

» understand the concept of the following terms: experiment, outcome, sample space, rela-
tive frequency, and sample probability;

» understand the concept of events being mutually exclusive and independent;

» use the probability laws to solve simple problems;


x
Probability Probability
» use tree diagrams (or decision trees) as an aid to solving problems;
provides a quantitative
» understand the concept of a probability distribution; description of the likely
occurrence of a particular
» calculate expected values using probability distributions. event.
Chance Chance is
the unknown and
unpredictable element in
happenings that seems to
have no assignable cause.

3.1 Basic ideas Probable Probable


represents that an event (or
events) is likely to happen
or to be true.
There are a number of words and phrases that encapsulate the basic concept of probability:
Uncertainty Uncertainty
chance, probable, odds, and so on. In all cases we are faced with a degree of uncertainty is a state of having limited
and concerned with the likelihood of a particular event happening. Statistically, these words knowledge where it is
impossible to describe
and phrases are too vague—we need some measure of likelihood of an event occurring. This exactly the existing state
measure is termed probability and is measured on a scale ranging between 0 and 1. or future outcome of a
particular event occurring.
From Figure 3.1 we observe that the probability values lie between 0 and 1, with 0 rep-
Event An event is any
resenting no possibility of the event occurring and 1 representing the probability that the collection of outcomes of
event is certain to occur. an experiment.
108 Business statistics using Excel

Not Equal Certainty


possible possible
outcome

0 0.5 1

Probability value, P Figure 3.1

In reality the value of the probability will lie between 0 and 1.


In order to determine the probability of an event occurring data has to be obtained. This
can be achieved through, for example, experience or observation, or empirical methods.
The procedure or situation that produces a definite result (or outcome) is termed a ran-
dom experiment. Tossing a coin, rolling a die, recording the income of a factory worker,
and determining defective items on an assembly line are all examples of experiments. The
characteristics of the random experiment are:

• each experiment is repeatable;


• all possible outcomes can be described;
• although individual outcomes appear haphazard, continual repeats of the
experiment will produce a regular pattern.

The result of an experiment is called an outcome. It is the single possible result of an


experiment; for example, tossing a coin produces ‘a head’, rolling a die gives a ‘3′. If we
accept the proposition that an experiment can produce a finite number of outcomes then
we could, in theory, define all these outcomes. The set of all possible outcomes is defined
as the sample space. For example, the experiment of rolling a die could produce the out-
comes 1, 2, 3, 4, 5, and 6, which would thus define the sample space. Another basic notion
is the concept of an event and is simply a set of possible outcomes. This implies that an
event is a subset of the sample space. For example, take the experiment of rolling a die, the
event of obtaining an even number would be defined as the subset {2, 4, and 6}. Finally,
two events are said to be mutually exclusive if they cannot occur together. Thus, in rolling
x a die, the event ‘obtaining a two’ is mutually exclusive of the event ‘obtaining a three’. The
Outcome An outcome is event ‘obtaining a two’ and the event ‘obtaining an even number’ are not mutually exclu-
the result of an experiment
or other situation involving sive as both can occur together, i.e. {2} is a subset of {2, 4, 6}.
uncertainty.
Random experiment A
random experiment is
an experiment, trial, or Student exercises
observation that can be
repeated numerous times X3.1 Give an appropriate sample space for each of the following experiments:
under the same conditions.
Sample space The sample (a) A card is chosen at random from a pack of cards
space is an exhaustive list of
(b) A person is chosen at random from a group containing five females and six males
all the possible outcomes
of an experiment. (c) A football team records the results of each of two games as ‘win’, ‘draw’, or ‘lose’.
Mutually
exclusive Mutually X3.2 For the respective sample spaces given in X3.1 indicate the outcomes that constitute
exclusive represents events the following events:
that cannot occur at the
same time. (a) A queen is chosen
Introduction to probability 109
(b) A female is chosen
(c) A football team draws at least once.
X3.3 A group of males and females over the age of 16 years; one is chosen at random. State
which pair of events is mutually exclusive?
(a) Being a male and being aged 21 years or older
(b) Being a male and being a female.
X3.4 A dart is thrown at a board and is likely to land on any one of eight squares numbered
1–8 inclusive. A represents the event the dart lands in square 5 or 8. B represents the
event the dart lands in square 2, 3, or 4. C represents the event the dart lands in square
1, 2, 5 or 6. Which two events are mutually exclusive?

3.2 Relative frequency


Suppose we perform the experiment of throwing a die and note the score obtained. We
repeat the experiment a large number of times, say 1000, and note the number of times
each score was obtained. For each number we could derive the ratio of occurrence (event
A) to the total number of experiments (n = 1000). This ratio is called the relative frequency.
In general, if event A occurs m times, then your estimate of the probability that A will occur
is given by equation (3.1).

m
P(A ) =
n (3.1)

Example 3.1
Consider the result of running the die experiment where the die has been thrown 1000 times
and the number of times each possible outcome (1, 2, 3, 4, 5, and 6) recorded. The result of the
die experiment is illustrated in Table 3.1.

Score 1 2 3 4 5 6
Frequency 173 168 167 161 172 159
Relative frequency 0.173 0.168 0.167 0.161 0.172 0.159

Table 3.1

x
This notion of relative frequency provides an approach to determine the probability of
Relative
an event. As the number of experiments increases then the relative frequency stabilizes frequency Relative
and approaches the probability of the event. Thus, if we had performed the above experi- frequency is another term
for proportion; it is the
ment 2000 times we might expect ‘in the long run’ the frequencies of all the scores to value calculated by dividing
approach 0.167. This implies that P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 0.167. Actually, the number of times
an event occurs by the
for this experiment, the theoretical values for each event would be P(1) = P(2) = P(3) = total number of times an
P(4) = P(5) = P(6) = 1/6. experiment is carried out.
110 Business statistics using Excel

There are many situations where probabilities are derived through this relative
frequency approach (also called empirical approach or experimental probability
approach). If a manufacturer indicates that it is 99% certain (P = 99%) that an electric
light bulb will last 200 hours, this figure will have been arrived at from experiments which
have tested numerous samples of light bulbs. If we are told that the probability of rain on a
June day is 0.42, this will have been determined through studying rainfall records for June
over, say, the past 20 years.
A number of important issues are assumed when approaching probability problems:

• the probability of each event within the probability experiment lies between zero and
one;
• the sum of probabilities of all events in this experiment equals one;
• if we know the probability of an event occurring in the experiment, then the
probability of it not occurring is P(event not occurring) = 1 – P(event occurring).

Example 3.2
Suppose that a particular production process has been in operation for 200 days with a
recorded accident on 150 days. Let A = the event that an accident occurs in future, then the
probability of an accident occurring in future, P (A) = 150/200 = 0.75. This provides an estimate
or probability of 75% that an accident will occur in the future on each separate day.

Example 3.3
Over the last 3 years a random sample of 1000 students was selected and classified according
to degree classification and gender. The results were as recorded in Table 3.2.

1st 2i 2 ii 3rd Total


Male 20 90 200 90 400
Female 40 150 230 180 600
Total 60 240 430 270 1000

Table 3.2

x Calculate: (a) the probability that a student achieves a 2i and is female; (b) the probability that
Empirical a student achieves a 2i and is male; (c) the probability that a student achieves a 2i and is female
approach Empirical
probability, also known
or male; and (d) the probability that a 2i classification is not achieved.
as relative frequency, or
experimental probability, (a) Probability that a student achieves a 2i and is female:
is the ratio of the number
of outcomes in which a Number of female students with a 2i 150
specified event occurs to P (2i and female) = = = 0.15
the total number of trials.
Total sample size 1000
Experimental probability
approach Experimental Probability that a student achieves a 2i and is female is 0.15 or 15%.
probability approach (see
Empirical approach). (b) Probability that a student achieves a 2i and is male:
Introduction to probability 111

Number of male students with a 2i 90


P (2i and male) = = = 0.09
Total sample size 1000

Probability that a student achieves a 2i and is male is 0.09 or 9%.

(c) Probability that a student achieves a 2i and is female or male:

Number of male and female students with a 2i


P (2i student being female or male) =
Total sample size
240
= = 0.24
1000

Probability that a student achieves a 2i and is female or male is 0.24 or 24%.

(d) Probability that a 2i degree classification is not achieved:

Number of students not obtaining a 2i 760


P (not obtaining a 2i) = = = 0.76
Total sample size 1000

Probability that a 2i degree classification is not achieved is 0.76 or 76%.

Student exercises
X3.5 How would you give an estimate of the probability of a 25-year-old passing a driving
test at a first attempt?
X3.6 In an experiment we toss two unbiased coins 100 times and note the frequency of the
two possible outcomes (heads, tails). We are interested in calculating the probability
(or chance) that at least 1 head will occur from the 100 tosses of the 2 coins. Calculate:
(a) the theoretical probability that at least one head occurs, and (b) the value of this
probability from your experiment. What would you expect to occur between the
theoretical and experimental probability values if the overall number of attempts
increases?
X3.7 Table 3.3 provides information about 200 school leavers and their destination after
leaving school.

Leave school at 16 years of age Leave school at an older age


Full time education, E 14 18
Full time job, J 96 44
Other 15 13

Table 3.3

Determine the probabilities that a person, selected at random:


(a) Went into full-time education
(b) Went into a full-time job
112 Business statistics using Excel

(c) Either went into full-time education or went into a full-time job
(d) Left school at 16 years of age
(e) Left school at 16 years of age and went into full-time education.
X3.8 Consider Table 3.3 in X3.7.
(a) Are the events E and J mutually exclusive?
(b) Determine P(E and J).
(c) Using the values of P(E), P( J), and P(E and J) you have already determined in X3.7,
evaluate P(E) + P( J). What do you notice when you compare your answer with
P(E or J)?

3.3 Sample space


We already know that the sample space contains all likely outcomes of an experiment and
that one or more outcomes constitute an event. Here, rather than resort to the notion of
relative frequency, we will look at probability as defined by equation (3.2).

Number of outcomes in the event


P(Event) =
Total number of outcomes (3.2)

A number of examples will be used to illustrate this notion via the construction of the
sample space.

Example 3.4
If an experiment consists of rolling a die then the possible outcomes are 1, 2, 3, 4, 5, and 6. The
theoretical probability of obtaining a 3 can then be calculated using equation (3.2):
Number of outcomes producing a 3 1
P(obtaining a 3) = = = 0.1666666 ’
Total number of outcomes 6

Probability of obtaining a 3 is 0.1666666’ or 16 2/3%.

Example 3.5
If an experiment consists of tossing two unbiased coins then the possible outcomes are: HH,
HT, TH, and TT. We could illustrate the sample space with individual sample points (*), as illus-
trated in Table 3.4.

First coin
H T
Second coin H * *
T * *

Table 3.4
Introduction to probability 113

From this sample space we can calculate individual probabilities. For example, the theoreti-
cal probability of achieving at least one head would be calculated as follows:

Number of outcomes producing at least 1 head 3


P(at least 1 head) = = = 0.75
Total number of outcomes 4

Therefore, the theoretical probability of achieving at least 1 head would be 0.75 or 75%.

Example 3.6
An experiment consists of throwing two dice and noting their two scores. The sample space
could be shown as illustrated in Table 3.5.

Score on first die, X


1 2 3 4 5 6
1 * * * * * *
2 * * * * * *
3 * * * * * *
Score on second die (Y)
4 * * * * * *
5 * * * * * *
6 * * * * * *

Table 3.5

From this sample space calculate the following theoretical probabilities to three decimal
places: (a) P(X = Y); (b) P(X + Y = 5); (c) P(X * Y = 36); (d) P(X < 3 and Y > 2).

(a) P(X = Y) = 6/36 = 0.166


(b) P(X + Y = 5) = 4/36 = 0.111
(c) P(X * Y = 36) = 1/36 = 0.027
(d) P(X < 3 and Y > 2) = 8/36 = 0.222

Student exercises
X3.9 An unbiased coin and a fair die are tossed together. What is the probability of
obtaining a head and a 6?
X3.10 Calculate the following probabilities if two unbiased dice are tossed: (a) the probability
of a 3 on the first die and 5 on the second die = P(3, 5); (b) the probability of a 3 on the
first die and 5 on the second die = P(3 and 5); and (c) the probability of a 3 on the first
die or 5 on the second die = P(3 or 5).
X3.11 The following coins are placed in a bag: 1p, 2p, 5p, and 10p. A coin is taken at random
and then replaced. A second coin is taken at random and then replaced. Calculate the
following probabilities: (a) P(1p chosen first and 2p chosen second); (b) P(sum is 3p);
(c) P(at least one 10p).
114 Business statistics using Excel

X3.12 Ten discs with a different number (0, 1, 2 . . .. 9) printed on them are placed in a bag.
Two discs are taken out of the bag one at a time at random to form a two digit number
(where 08 is counted as the number 8). Assuming the first disc is replaced before the
second is chosen, find the following probability that: (a) the number is even; (b) the
number is less than 30; (c) the number is 67; and (d) the two digits forming the number
are equal. What would happen to your answers to (a)–(d) if the first disc is not replaced
before the second is chosen?
X3.13 Five cards are labelled A1, B2, C3, D3, and E3 respectively. A card is selected at random
and then a second is selected again before the first is replaced. (a) Show by listing
the sample space that there are just 20 possible outcomes. (b) Find the following
probabilities: (i) the first card chosen is A1; (ii) the second card chosen is A1; (iii) the
card A1 is chosen; (iv) the letter on the cards are adjacent in the alphabet; (v) the sum
of the numbers on the cards is odd; and (vi) the sum of the numbers on the cards is 5.
X3.14 A sample of 50 married women was asked how many children they had in their family.
The results are presented in Table 3.6.

Number of children 0 1 2 3 4 5+
Number of families 6 14 13 9 5 3

Table 3.6

Estimate the probability that if any married woman is asked the same question, she will
answer: (a) none; (b) between 1 and 3 inclusive; (c) more than 3; (d) neither 3 nor 4; and (e) less
than 2 or more than 4.

3.4 The probability laws


In the last section we informally introduced some of the laws of probability: (a) the prob-
ability of an event occurring lies between 0 and 1 (0 ≤ P(E) ≤ 1); (b) if a sample space com-
prises ‘n’ mutually exclusive events, then ΣP(En) = 1; (c) for any two mutually exclusive
events A and B within a sample space, then P(A or B) = P(A) + P(B); and (d) for any event
within a sample space, P(NOT E) = P(E’) = 1 — P(E).

Example 3.7
An experiment consists of tossing three coins. Let events A, B, C, and D represent the events
obtained: three heads, obtained three tails, obtained only two heads, and obtained only two
tails respectively. Figure 3.2 illustrates the sample space for this experiment and the four mutu-
ally exclusive events.
Introduction to probability 115

A C

HHT
HHH HTH

THH

THT
TTT
TTH HTT
D
B Figure 3.2

From Figure 3.2 we have: P(A) = 0.125, P(B) = 0.125, P(C) = 0.375, and P(D) = 0.375. As the
four mutually exclusive events exhaust the sample space then P (A) + P(B) + P(C) + P(D) = 1.0.
As A and B are mutually exclusive, then P(A or B) = P(A) + P(B) = 0.25. Similarly, P(A or B or
C) = P(A) + P(B) + P(C) = 0.625. As P (D) = 0.375, then P(D’) = 1 − P(D) = 1 − 0.375 = 0.625.

3.5 The general addition law


In the above example we demonstrated the addition law for mutually exclusive events,
P(A or B) = P(A) + P(B). When events are not mutually exclusive, such as two or more
events contain common outcomes within a sample space, then this law does not hold.

Example 3.8
To illustrate this case consider a sample space consisting of the positive integers from 1 through
10. Let event A represent all odd integers and event B represent all integers less than or equal
to 5. These two events within the sample space are displayed in Figure 3.3.

B
7 3 2

5
9 4 x
1
Addition law for mutually
exclusive events Addition
law for mutually exclusive
events is a result used to
8 10
6 Figure 3.3 determine the probability
that event A or event B
occurs, but both events
From Figure 3.3 we note that events A and B overlap with common sample points present. cannot occur at the same
This would be represented by the event {odd integers and integers ≤ 5}. time.
116 Business statistics using Excel

Note It is important to note that when we ask for the probability of events A and B
occurring then this is written as P(A and B). Furthermore, you may see in certain information
sources that the mathematical operator (or symbol) ∩ may be used instead of ‘and’. This
implies that P(A and B) means the same as P(A ∩ B).

The event {A or B} contains the outcomes of either odd integers or integers < 5. A little
thought would indicate that the number containing event A or B is given by the equation
n{A or B} = n{A} + n{B} – n{A and B}. Consequently, by transforming the events into prob-
abilities the general addition probability law is given by equation (3.3).

P(A or B) = P(A ) + P(B) − P(A and B) (3.3)

Note If two events are mutually exclusive, P (A and B) = 0.

Example 3.9
A card is chosen from an ordinary pack of cards. Write down the probabilities that the card is:
(a) black and an ace, (b) black or an ace, and (c) neither black nor an ace. Let event A and B
represent the events obtaining an ace card and B a black card respectively. The sample space
is represented by Figure 3.4.

A
B

2
2
24

24 Figure 3.4

x
Number of outcomes in A and B 2
General addition (a) P(B and A) = = = 0.0385
probability law General Total number of outcomes 52
addition probability law is
26 4 2 28
a result used to determine (b) P(B or A) = P(B) + P(A) – P(B and A) = + – = = 0.538462
the probability that event A 52 52 52 52
or event B occurs or both 28
occur. (c) P(neither B nor A) = 1 – P(B or A) = 1 – = 0.4615
52
Introduction to probability 117

Student exercises
X3.15 For each question indicate whether the events are mutually exclusive: (a) thermometers
are inspected and rejected if any of the following are found: poor calibration; inability
to withstand extreme temperatures without breaking; and not within specified size
tolerances; and (b) a manager will reject a job applicant for any of the following
reasons: lack of relevant experience, slovenly appearance, too old.
X3.16 Consider two events, A and B, of an experiment which is not empty. Display this
information in a Venn diagram and shade the area representing the event {A or B’}.
X3.17 Consider two events, A and B, where the associated probabilities are as follows:
P(A or B) = 3/4, P(B) = 3/8 and n(A) = 4. Calculate P(A and B) if the total sample size is
eight.
X3.18 A survey shows that 80% of all households have a colour television and 30% have
a microwave oven. If 20% have both a colour television and a microwave, what
percentage has neither?
X3.19 In a group of 50 students, 30 study French or German. If 20 study French and 15 study
German find the probability that a student studies French and German.

3.6 Conditional probability


We will now develop the multiplication law of probability by considering the concepts of
conditional probability and statistical independence.
x
Consider the differences between choosing an item at random from a lot, with and
Multiplication
without replacement. Let us say we have 100 items, of which 20 are defective and 80 are law Multiplication law is
not defective, from which two are selected. Let A = event first item defective and B = event a result used to determine
the probability that two
second item defective. If an item is replaced after the first selection the number of items events, A and B, both occur.
remains constant and P (A) = P (B) = 20/100 = 0.2. If the first item is not replaced then Conditional
what happens to the value of these probabilities? probability Conditional
probability is the
The probability of event A is still equal to 0.2. In order to determine P(B) we need to probability of an event
know the composition of the lot at the time of the second selection. By not replacing the occurring given that
another event has already
first item, the total number of items has been reduced to 99; if the first item is found to be occurred.
defective then 19 defective items will remain. Thus, the probability of event B occurring, Statistical
P(B), will now be conditional on whether event A has occurred. This we denote as P(B/A), independence Two
events are independent
the conditional probability of the event B given that A has occurred. Thus, for this example if the occurrence of one
P (B/A) = 19/99 = 0.1919. In effect, we are computing P(B) with respect to the reduced of the events gives us no
information about whether
sample space of A. The following example will make this clear from which we will develop or not the other event will
the multiplication law. occur.
118 Business statistics using Excel

Example 3.10
Of a group of 30 students, 15 are blue-eyed {B}, 5 are left-handed {L}, and 2 are both blue-eyed
and left-handed {B and L}.
The sample space is represented in Figure 3.5.

B
L

13
2
3

Total sample size = 30 Figure 3.5

Picking one student at random the probabilities would be as follows: P(L) = 5/30; P(B) = 15/30
and P(L and B) = 2/30. If we know that a student is blue-eyed then our sample space will be
reduced to 15 students, of which 2 are left-handed. Thus, P(L/B) = number in {L and B}/number
in {B}. Dividing top and bottom by the total sample space gives:

P(L andB) 2 / 30 2
P(L/B) = = = = 0.133333 ’
P(B) 15/ 30 15

In general, if we have two events, A and B, then the probability of event A given that event
B has occurred is given by equation (3.4).

P(A and B)
P(A/B) =
P(B) (3.4)

This general result can be converted to give the multiplication law for joint events and is
given by equation (3.5).

P( A and B) = P( A/B) * P(B) (3.5)

Example 3.11
Consider two events A and B which contain all sample points with P(A and B) = 1/4 and
P(A/B) = 1/3. Calculate (a) P(B), (b) P(A), and (c) P(B/A).
x
Probability of event A
(a) P(B)? From equation (3.4) we have P(B) = P(A and B)/P(A/B) = (1/4)/(1/3) = 3/4. Therefore,
given that event B has
occurred See Conditional the probability of event B occurring is 0.75 or 75%.
probability. (b) P(A)? Because A and B exhaust the sample space, P(A or B) = 1.0. From the addition law, P(A
Multiplication law
or B) = P(A) + P(B) – P(A and B). Thus, P(A) = P(A and B) + P(A or B) – P(B) = 1.0 + 0.25 –
for joint events See
Multiplication law. 0.75 = 0.5. Thus, the probability of event A occurring is 0.5 or 50%.
Introduction to probability 119

(c) P(B/A)? P(A and B) is the same as P(B and A). Thus, from the multiplication law,
P(B and A) = P(B/A) * P(A). Re-arranging this equation gives P(B/A) = P(B and
A)/P(A) = 0.25/0.5 = 0.5. Thus, the probability that event B occurs given that event A has
already occurred is 0.5 or 50%.

Example 3.12
An office is due to be modernized with new office equipment. To aid the office manager a
survey has been undertaken to identify the following information: (a) the number of laptops,
(b) the number of desktop computers, and (c) whether the computers are old or new. The data
collected is provided in Table 3.7.

Laptops, L Desktops, D Totals


New, N 40 30 70
Old, O 20 10 30
Totals 60 40 100

Table 3.7

If a person picks one computer at random, calculate the following probabilities: (a) the com-
puter is new; (b) the computer is a laptop; and (c) the computer is new given that it is a lap-
top. Parts (a) and (b) deal with distinct, mutually exclusive events within the full sample space.
Hence, P(N) = 70/100 = 0.70 and P(L) = 60/100 = 0.60. In part (c) we are dealing with the
conditional probability P(N/L). By considering the reduced sample space L (60 laptops) then
P(N/L) = 40/60 = 0.66’ or by considering the definition of conditional probability P(N/L) = P(N
and L)/P(L) = (40/100)/(60/100) = 0.66’. Both methods will give us the same answer, 66 2/3%,
for the probability that it is new given it is a laptop.

Example 3.13
A box contains 6 red and 10 black balls. What is the probability that if three balls are cho-
sen one at a time without replacement that they are all black? Let B1 = Event first draw black,
B2 = Event second draw black, and B3 = Event third draw black. In this example we are deter-
mining the probability that all three balls chosen are black (P(B1 and B2 and B3)). On the first
draw P(B1) = 10/16. On the second draw the sample space has been reduced to 15 balls and
given the condition that the first ball is black then P(B2/B1) = 9/15. On the third draw the sam-
ple space has been reduced to 14 balls and, given the condition that the first and second balls
are black, then P(B3/(B2 and B1)) = 8/14. Thus, P(B1 ∩ B2 ∩ B3) = P(B1) * P(B2/B1) * P(B3/(B2 ∩
B1)) = (10/16) * (9/15) * (8/14) = 0.2143. Therefore, the probability that all three balls are black
when no replacement occurs is 0.2143 or 21.4%.
120 Business statistics using Excel

Student exercises
X3.20 A bowl contains three red chips and five blue chips. Two chips are drawn successively,
at random and without replacement. Calculate the probability that the first chip drawn
is red and the second blue.
X3.21 Two events, D and E, are found to have the following probability relationships:
P(D) = 1/3, P(E) = 1/4, and P(D or E) = 1/2. Calculate the following probabilities: (a) P(D
and E), (b) P(D/E), and (c) P(E/D).
X3.22 Two events A and B are found to have the following probability relationships:
P(A) = 1/3, P(B) = 1/2, and P(A or B) = 3/4. Calculate the following probabilities: (a)
P(A/B), (b) P(B/A), (c) P(B’/A’), and (d) P(A’/B’).
X3.23 A bag contains four red counters and six black counters. A counter is picked at
random from the bag and not replaced. A second counter is then picked. Calculate the
following probabilities: (a) the second counter is red, given that the first is red; (b) both
the counters are red; and (c) the counters are of different colours.
X3.24 The Gompertz Oil Company drills for oil in old oil fields that large companies have
stated are uneconomic. The decision to drill will depend upon a number of factors,
including the geology of the proposed sites. Drilling experience shows that there is
a 0.40 probability of a type A structure present at the site given a productive well. It
is also known that 50% of all wells are drilled in locations with a type A structure and
30% of all wells drilled are productive. Use the information provided to answer the
following questions: (a) What is the probability of a well drilled in a type A structure
and being productive? (b) What is the probability of having a productive well at the
location if the drilling process begins in a location with a type A structure? and (c) Is
finding a productive well independent of the type A structure?

3.7 Statistical independence


x
Independent events Two We have already considered mutually exclusive events, such that events cannot occur at
events are independent if
the same time. We have also noted that in some situations knowing that one event has
the occurrence of one of
the events has no influence occurred yields information that will affect the probability of another event. There will
on the occurrence of the be many situations where the converse is true. For example, rolling a die twice, know-
other event.
ing that a six resulted on the first roll cannot influence the outcome of the second roll.
Multiplication law
for independent Similarly, take the example of picking a ball from a bag. If it were replaced before another
events Multiplication law was picked nothing changes—the sample space remains the same. Drawing the first ball
for independent events is
the chance that they both and replacing it cannot affect the outcome of the next selection. In these examples we
happen simultaneously have the notion of independent events. If two (or more) events are independent then the
is the product of the
chances that each occurs multiplication law for independent events is given by equation (3.6).
individually, e.g. P(A and
B) = P(A)*P(B). P(A and B) = P(A ) * P(B) (3.6)
Introduction to probability 121

From which we can deduce that for independent events P(A/B) = P(A) and similarly
P(B/A) = P(B).

Note The terms independent and mutually exclusive are different concepts. If A and B
are events with non-zero probabilities, then we can show that for P(A and B):

• if events A and B are mutually exclusive, then P(A and B) = 0. Mutually exclusive events can-
not occur at the same time. For example, the two events ‘my favourite football team lost a
match’ and ‘my favourite football team won the same match’ are mutually exclusive events;
• if two events A and B are independent, then P( A and B) ≠ 0. The outcome of event A has no
effect on the outcome of event B. For example, the two events ‘it rained in Paris’ and ‘my car
broke down in London’ are independent events. When calculating the probabilities for inde-
pendent events you multiply the probabilities. You are effectively asking what the chance is
of both events happening, bearing in mind that the two were unrelated.

So, if events A and B are mutually exclusive, they cannot be independent. If events A and B
are independent, they cannot be mutually exclusive.

Example 3.14
Suppose a fair die is tossed twice. Let event A represent the event first die shows an even num-
ber and event B represent the event second die shows a five or six. Events A and B are intui-
tively unrelated and are, therefore, independent events. Thus, the probability of A occurring
is P(A) = 3/6 = 1/2 and the probability of event B occurring is P(B) = 2/6 = 1/3. Thus, P(A and
B) = P(A) * P(B) = (1/2) *(1/3) = 1/6. Thus, the probability of events A and B occurring together
is 1/6.

Example 3.15
Three marksmen take part in a shooting contest. Their chances of hitting the ‘bull’ are 1/2, 1/3,
and 1/4 respectively. If they fire simultaneously what are the chances that only one bullet will
hit the bull? Let event A, B, C represent the event that the first man hits the bull, the second
man hits the bull, and the third man hits the bull, respectively, with the following probabilities:
P(A) = 1/2; P(B) = 1/3; P(C) = 1/4. The probability problem can be written as follows:
P(only one bull hit) = P(A and B’ and C’ OR A’ and B and C’ OR A’ and B’ and C)
P(only one bull hit) = P(A and B’ and C’) + P(A’ and B and C’) + P(A’ and B’ and C)
P(only one bull hit) = 1/2 * 2/3 * 3/4 + 1/2 * 1/3 * 3/4 + 1/2 * 2/3 * 1/4 = 1/4 + 1/8 + 1/12
P(only one bull hit) = 11/24.

Thus, the probability that one bull is hit between the three marksmen is 11/24 or 45.83%.
122 Business statistics using Excel

Note In the solution we have used the notation A’, B’, and C’. This notation represents
the event that the event does not occur, for example A’ would represent the event that event
A does not occur.

Example 3.16
Two football teams A and B are disputing the historical data of who is likely to win. To settle the
dispute the following probability data presented in Table 3.8 has been collected which meas-
ures the probability of each team scoring 0, 1, 2, or 3 goals. Calculate the probability: (a) that
team A wins, (b) that the teams draw, and (c) that team B wins.

Number of goals scored


0 1 2 3
Team A 0.3 0.3 0.3 0.1
Team B 0.2 0.4 0.3 0.1

Table 3.8

To solve this problem we need to find the total sample space. There are 16 possible results
(events) given the scores in Table 3.8, each of which is mutually exclusive. We will look at these
in a joint probability table assuming independence, i.e. this means that team A scoring does not
influence team B scoring (Table 3.9).

Team A scores
0 1 2 3
0 0.06 0.06 0.06 0.02
1 0.12 0.12 0.12 0.04
Team B scores
2 0.09 0.09 0.09 0.03
3 0.03 0.03 0.03 0.01

Table 3.9

As the events are mutually exclusive then the probabilities are as follows:
P(A wins) = 0.06 + 0.06 + 0.02 + 0.12 + 0.04 + 0.03 = 0.33
P(Draw) = 0.06 + 0.12 + 0.09 + 0.01 = 0.28
P(B wins) = 1 − {P(A wins) + P(Draw)} = 1 − {0.33 + 0.28} = 0.39

From these results we can see that team B has the greater chance of winning a game.

Student exercise
X3.25 A dart is thrown at a board and is equally likely to land in any one of eight squares
numbered 1–8 inclusive. Let A = Event dart lands in square 5 or 8; B = Event dart
Introduction to probability 123

lands in square 2, 3, or 4; and C = Event dart lands in square 1, 2, 5, or 6. From this


information calculate the following probabilities: (a) P(A ∩ B), (b) P(A and C), (c) P(B
and C), (d) P(A/B), (e) P(B/C), and (f) P(C/A). Which two events are mutually exclusive?
Which two events are statistically independent?

3.8 Probability tree diagrams


Probability tree diagrams provide a visual aid to help you solve complicated probability
problems.

Example 3.17
A bag contains three red and four white balls.
If one ball is taken at random and then replaced, and another ball is taken calculate the fol-
lowing probabilities:

(a) P(first ball red and second ball red)


(b) P( just one red)
(c) P(second ball white).

Figure 3.6 displays the experiment in a tree diagram.


Each branch of the tree indicates the possible result of a draw and associated probabilities.

1st Ball 2nd Ball

R2

3/7
R1 4/7
W2
3/7

4/7
R2
W1
3/7
4/7
W2

Figure 3.6

Multiplying along the branches provides the probability of a final outcome:

(a) P(first ball red and second ball red)


P(first ball red and second ball red) = P(R1) * P(R2)
P(first ball red and second ball red) = 3/7 * 3/7
P(first ball red and second ball red) = 9/49
124 Business statistics using Excel

(b) P( just one Red)


P( just one Red) = P(R1 and W2 or W1 and R2)
P( just one Red) = P(R1) * P(W2) + P(W1) * P(R2)
P( just one Red) = 3/7 * 4/7 + 4/7 * 3/7
P( just one Red) = 24/49

(c) P(second ball white)


P(second ball white) = P(R1 and W2 or W1 and W2)
P(second ball white) = P(R1 and W2) + P(W1 and W2)
P(second ball white) = 3/7 * 4/7 + 4/7 * 4/7
P(second ball white) = 21/49

Student exercises
X3.26 Each month DINGO Ltd receives a shipment of 100 parts from its supplier, which
will be checked on delivery for defective parts. Historically, the average number
of defective parts was 5. The new quality assurance procedure involves randomly
selecting a sample of three items (without replacement) for inspection. If more than
one of the sample is defective the order is returned. What proportion of shipments
might be expected to be returned?
X3.27 Susan takes examinations in Mathematics, French, and History. The probability that
she passes Mathematics is 0.7; the corresponding probabilities for French and History
are 0.8 and 0.6. Given that her performances in each subject are independent, draw
a tree diagram to show the possible outcomes. Use this tree diagram to calculate the
following probabilities: (a) fails all three examinations and (b) fails just one examination.

3.9 Introduction to probability distributions


We have already stated that the concept of relative frequency is one way to interpret prob-
ability. Relative frequency (rf ) represents the proportion of successful outcomes and is
defined by equation (3.7).

Number of successful outcomes


x rf =
Total number of attempts (3.7)
Frequency definition of
probability Frequency
definition of probability An alternative view is to say that the relative frequency (or proportion) represents an
defines an event’s estimate of the probability of the stated occurrence occurring within the total number of
probability as the limit of
its relative frequency in a
attempts (or experiment sample space). In this case equation (3.8) provides a frequency
large number of trials. definition of probability which is known as the experimental (or empirical) approach.
Introduction to probability 125

Number of successful outcomes


p(X) =
Total number of attempts (3.8)

Example 3.18
Consider the situation where a financial analyst collects data pertaining to the sales of a par-
ticular type of fridge freezer. From the data he is interested in the probability that this type of
fridge freeze will be sold in a particular region which he will then use to produce sales estimates
for the next 12 months. From the data he finds that this region sold 230 out of a national num-
ber sold of 1670. From equation (3.8) the relative frequency, or proportion, or probability of
this type of fridge being sold is P(X) = 230/1670 = 0.137725 or 13.8%. This can then be used
within the sales forecast plan, as will be outlined when discussing expectation in Section 3.10.

Example 3.19
To illustrate the idea of a probability distribution, consider the following frequency distribution
representing the mileage travelled by 120 salesmen described in Chapters 1 and 2, as presented
in Table 3.10.

Miles travelled Frequency, f


400–419 12
420–439 27
440–459 34
460–479 24
480–499 15
500–519 8

Table 3.10

From Table 3.10, we can calculate the concept of a relative frequency.

Figure 3.7 illustrates the calculation process.

Figure 3.7

We observe from Figure 3.7 that the relative frequency for 440–459 miles travelled is
0.283333. This implies that we have a chance, or probability, of 34/120 that the miles trav-
elled lies within this class.
126 Business statistics using Excel

➜ Excel solution
Mileage data: Cells B4:B9 Values
Frequency, f Cells C4:C9 Values
Relative frequency Cell D4 Formula: =C4/$C$11
Copy formula from D4:D9
Total f Cell C11 Formula: =SUM(C4:C9)
Total RF Cell D11 Formula: =SUM(D4:D9)

❉ Interpretation Thus, relative frequencies provide estimates of the probability for that
class, or value, to occur. If we were to plot the histogram of relative frequencies we would, in
fact, be plotting out the probabilities for each event, for example P(400 − 420 miles) = 0.10,
P(420 − 440 miles) = 0.225.

The distribution of probabilities given in Table 3.10 and graphically represented in


Figure 3.8 is a different way of illustrating the probability distribution.
Whereas for the frequency distribution the area under the histogram is proportional to
the total frequency, for the probability distribution the area is proportional to total proba-
bility (= 1.0). Given a particular probability distribution we can determine the probability
for any event associated with it. Thus, P(400 – 459 miles) = P(400 ≤ x ≤ 459) = Area under
the distribution from 400 to 460 = P(400 – 419) + P(420 – 439) + P(440 – 459) = 0.10 +
0.225 + 0.283 = 0.608. Thus, we have a probability estimate of 61% for the mileage travelled
to lie between 400 and 459 miles.

Note
If in Figure 3.8 we decreased the class width towards zero and increased the
number of associated bars observable then Figure 3.8 would approximate to a curve—the
probability distribution curve.

Mileage travelled by 120 salesmen


Relative frequency or probability

0.3

0.3

0.2

0.2

0.1

0.1

0.0
400–419 420–439 440–459 460–479 480–499 500–519
Miles travelled, X Figure 3.8
Introduction to probability 127

3.10 Expectation and variance for a probability


distribution
The value of the mean and standard deviation can be calculated from the frequency dis-
tribution by using equations (2.3) and (2.13) to give a mean value of 454.0 and a standard
deviation of 27.38 miles travelled. By using relative frequencies to determine the mean
we have, in fact, found the mean of the probability distribution. The mean of a probability
distribution is called the expected value, E(X), and can be calculated from equation (3.9).

∑ (X) = ∑ X × P(X) (3.9)

Further thought along the lines used in developing the notion of expectation would
reveal that the variance of the probability distribution, VAR(X), can be determined from
equation (3.10).

VAR(X) = ∑ X 2 × P(X) − [ ∑ X × P(X)]2 (3.10)


From equation (3.10) the standard deviation can be calculated using the relationship
given in equation (3.11).

SD(X) = VAR(X) (3.11)

Example 3.20
Returning to the miles travelled by salesmen we can easily calculate the mean number of miles
travelled and the corresponding measure of dispersion, as illustrated in Figures 3.9–3.11.

Figure 3.9
LCB, lower class boundary; UCB, upper class boundary.

Figure 3.10

Figure 3.11
128 Business statistics using Excel

➜ Excel solution
Mileage travelled Cells A5:A10 Values
Frequency, f Cells B5:B10 Values
LCB Cells C5:C10 Values
UCB Cells D5:D10 Values
Class mid-point Cells E5 Formula: =(C5+D5)/2
Copy formula from E5:E10
Relative frequency Cell G5 Formula: =B5/$C$14
Copy formula from G5:G10
X*P(X) Cell I5 Formula: =E5*G5
Copy formula from I5:I10
X2*P(X) Cell K5 Formula: =E5^2*G5
Copy formula from K5:K10
N = Σf = Cell C14 Formula: =SUM(B5:B10)
ΣXP = Cell C15 Formula: =SUM(I5:I10)
ΣX2P = Cell C16 Formula: =SUM(K5:K10)
Mean = Cell C17 Formula: =C15
Variance = Cell C18 Formula: =C16−C17^2
Standard Deviation = Cell C19 Formula: =C18^0.5

❉ Interpretation From Excel, the expected value is 454 miles travelled with a standard
deviation of 27.38 miles travelled.

Note Rearranging equation (2.3) to give the expected value (mean) represented by
equation (3.9):

∑ fX f
∑ (X) = = ∑X × = ∑ X × P(X)
∑f ∑f

Rearranging equation (2.14) to give the variance value represented by equation (3.10):

( )
2
∑f X − X
VAR(X) = = ∑ [X − E(X)]2 × P(X) = ∑ X 2 × P(X) − [ ∑ X × P(X)]2
∑f

Example 3.21
Consider the problem of a stall at a fete running a game of chance. The game consists of a
customer taking turns to choose three balls from a bag that contains 3 white and 17 red balls
without replacement. For a customer to win he/she would have to choose 3 white, 2 white, or
1 white with winnings of €5, €2, and €0.50 respectively. On the day of the fete 2000 customers
tried the game.
How much money might be expected to have been paid out to each customer?
Introduction to probability 129

To solve this problem we first need to calculate the associated probabilities of choosing
3, 2, 1, and 0 white balls; a tree diagram (see Figure 3.12) visually enables identification of
these probabilities.
The final stage consists of calculating the associated expected value given we know
what the winnings are for 3, 2, 1, and 0 white balls.

3rd Ball

W3 3 White
1st Ball 2nd Ball 1/18

W2 2 White
2/19 R3
17/18
W1
W3
2/18 2 White
3/20
17/19 R3
R2 1 White
16/18
2/18 W3
W2 2 White
3/19
17/20
16/18 1 White
R1 R3
3/18
16/19 R2 W3
1 White

15/18
0 White
R3
Figure 3.12

From the tree diagram illustrated in Figure 3.12 we can identify the different routes that we
can achieve 3, 2, 1, and 0 white balls. The probability of 3 whites is P(3 White) = P(1st White
and 2nd White and 3rd White) = 3/20 * 2/19 * 1/18 = 0.0009. By a similar process: P(only 2
White) = 0.0447, P(only 1 White) = 0.3579, and P(no white) = 0.5965. The probability distri-
bution for the expected winnings can now be constructed and is illustrated in Figure 3.13.

Figure 3.13

➜ Excel solution
Number of white balls Cells B4:B7 Values
Amount won, X Cells C4:C7 Values
Probability, P(C) Cells D4:D7 Values
X*P(X) Cells E4 Formula: =C4*D4
Copy formula from E4:E7
E(X) Cell E9 Formula: =SUM(E4:E7)
Total Cell E10 Formula: =2000*E9
130 Business statistics using Excel

❉ Interpretation From Excel, we observe that the expected winnings for each game
played is E(X) = ΣX * P(X) = 0.27285 (or €0.27 to the nearest cent). Given that we have 2000
players (or games played) then the total winnings is = N * E(X) = 2000 * 0.27285 = €545.70 to
the nearest cent.

Example 3.22
A company manufactures and sells product Xbar. The sales price of the product will be €6 per
unit, and estimates of sales demand and variable costs of sales are as presented in Tables 3.11
and 3.12.

Probability Sales demand


0.3 5000
0.6 6000
0.1 8000

Table 3.11

Probability Variable cost per unit, €


0.1 3.0
0.3 3.5
0.5 4.0
0.1 4.5

Table 3.12

The unit variable costs are not conditional on the volume of sales demand and fixed
costs are estimated to be €10,000. What is the expected profit? The expected profit can be
calculated if we realize that profit is determined from equation (3.12).

Profit = Sales – Variable Costs – Fixed Costs (3.12)

Table 3.13 illustrates the calculation of the expected sales demand using equation (3.9),
with the probability distribution employed to calculate the column statistics.

Probability, P Sales Demand, X XP


0.3 5000 1500
0.6 6000 3600
0.1 8000 800
Total = 5900

Table 3.13

The expected demand is E(X) = ∑ X × P(X) = 5900 units.


Introduction to probability 131

Table 3.14 illustrates the calculation of the expected value of the variable cost per unit
using equation (3.9), with the probability distribution employed to calculate the column
statistics.

Probability, P Variable Cost per unit (€), X XP


0.1 3 0.30
0.3 3.5 1.05
0.5 4 2.00
0.1 4.5 0.45
Total = 3.80

Table 3.14

The expected value per unit is E(X) = ∑ X × P(X) = €3.80.


From these calculations we can now calculate the overall value of sales, variable
costs, and expected profit as follows: sales = 5900 * €6 = €35400; variable costs = 5900 *
€3.80 = €22420; fixed costs = €10000; and expected profit E(profit) = sales – variable costs –
fixed costs = €35400 – €22420 – €10000 = €2980. The expected profit is expected to be
€2980.

Student exercises
X3.28 A bag contains six white and four red counters, three of which are drawn at random
and without replacement. If X can take on the values of 0, 1, 2, 3 red counters,
construct the probability distribution of X. If the experiment was repeated 60 times,
how many times would we expect to draw more than one red counter?
X3.29 In a game you are offered the chance to toss a fair coin until a ‘tail’ appears. If a tail
appears on the first toss you win £2. If the first tail appears on the second toss you
win £4. If the first tail appears on the third toss you win £8. How much should you be
willing to pay to participate in the game if you intend to quit after the third toss, win or
lose?

■ Techniques in practice
TP1 CoCo S.A. are considering putting money into one of two investments, A and B. The
net profits for identical periods and probabilities of success for investments A and B are given
in Table 3.15.

(a) Which investment yields a higher net profit?


(b) Can you make a decision on which investment is better, given this extra information?

TP2 Bakers Ltd currently has 303 shops across the UK. Table 3.16 describes the location of
each shop within one of four regions (SW, SE, NE, and ML and NW) and the level of shop profit.
132 Business statistics using Excel

Probability of Return
Net profits, £ A B
8000 0.0 0.1
9000 0.3 0.2
10,000 0.4 0.4
11,000 0.3 0.2
12,000 0.0 0.1

Table 3.15

As part of Bakers Ltd financial quality control initiative the company select a shop at random
and undertake a visit.

Region
SW SE NE ML and NW
Under 8000 12 8 15 22
Profit, £ 8000– < 80,000 54 34 43 41
Over 80,000 34 12 23 5

Table 3.16

(a) Calculate the probability that the shop chosen to be visited will be in the SW region.
(b) Owing to a careless administrative leak of information the next set of visits are known to
be located in the SW region. What is the probability that a shop in the SW region will
be chosen?
(c) Compare the two probabilities of P(SW) and P(SW/over 80,000). Are these two events
independent or dependent?

TP3 The Skodel Ltd credit manager knows from past experience that if the company accepts
a ‘good risk’ applicant for a £60,000 loan the profit will be £15,000. If it accepts a ‘bad risk’
applicant it will lose £6000. If it rejects a ‘bad risk’ applicant nothing is gained or lost. If it rejects
a ‘good risk’ applicant it will lose £3000 in good will.

(a) Complete the profit and loss table for this situation.

Decision
Accept Reject
Type of Risk Good
Bad

Table 3.17

(b) The credit manager assesses the probability that a particular applicant is a ‘good risk’ is
1/3 and a ‘bad risk’ is 2/3. What would be the expected profits for each of the two deci-
sions? Consequently, what decision should be taken for the applicant?
Introduction to probability 133
(c) Another manager independently assesses the same applicant to be four times as likely to
be a bad risk as a good one. What should this manager decide?
(d) Let the probability of being a good risk be x. What value of x would make the company
indifferent between accepting or rejecting an applicant for a mortgage?

■ Summary

In this chapter we have defined the concept of probability using the idea of relative fre-
quency. Furthermore, the key terms have been defined, such as experiment, sample
space, laws of probability, and the relationship between a relative frequency distribution
and probability distribution. In the next chapter we will explore probability distributions,
such as the normal distribution, Student’s t distribution, F distribution, binomial distribu-
tion, and Poisson distribution.

■ Key terms
Addition law for mutually Frequency definition of Outcome
exclusive events probability Probability
Chance General addition probability Probability of event A given
Conditional law that event B has occurred
probability Independent events Probable
Cumulative frequency Multiplication law Random experiment
distribution Multiplication law for Relative frequency
Empirical approach independent events Sample space
Event Multiplication law for joint Statistical independence
Experimental probability events Uncertainty
approach Mutually exclusive

■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.

Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
134 Business statistics using Excel

3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012)
Probability distributions 4

» Overview «
The concept of probability is an important aspect of the study of statistics and within Chapter 3
we introduced the reader to some of the concepts that are relevant to probability. However, the
main emphasis of Chapter 4 is to focus on the concepts of discrete and continuous probability
distributions and not on the fundamentals of probability theory. Initially, we will explore the
issue of continuous probability distributions (normal) and then introduce the concept of
discrete probability distributions (binomial, Poisson). Sections 4.1 and 4.2 will explore the
concept of a probability distribution and introduce two distinct types: (a) continuous and (b)
discrete. Table 4.1 summarizes the probability distributions that are applicable to whether the
data variables are discrete/continuous and whether the distributions are symmetric/skewed.

Variable type
Measured characteristic Discrete Continuous
Shape Symmetric Skewed Symmetric Skewed
Distribution Binomial Poisson Normal Exponential

Table 4.1

» Learning objectives «
On completing this unit you will be able to:

» understand the concept of a probability distribution;

» describe a continuous probability distribution;

» use the normal distribution to calculate the values of a variable that correspond to a par-
ticular probability;

» calculate one parameter of the normal distribution if the other parameters are known;

» use the normal distribution to calculate the probability that a variable has a value between
specific limits;
136 Business statistics using Excel

» describe other continuous probability distributions: uniform, Student’s t distribution, and


chi-square distribution;

» describe two discrete probability distributions;

» understand when to apply the binomial distribution;

» solve simple problems using both tree diagrams and the binomial formula;

» understand when to apply the Poisson distribution;

» solve simple problems using the Poisson formula;

»understand when to apply a normal distribution approximation to binomial and Poisson


probability distribution problems to simplify the solution process;

» solve problems using the Microsoft Excel spreadsheet.

4.1 Continuous probability distributions

4.1.1 Introduction
A random variable is a variable that provides a measure of the possible values obtainable
from an experiment. For example, we may wish to count the number of times that the
x number three appears on the tossing of a fair die or we may wish to measure the weight of
Random variable A
people involved in measuring the success of a new diet programme.
random variable is a
function that associates a In the first example, the random variable will consist of the numbers: 1, 2, 3, 4, 5, or 6. If
unique numerical value the die was fair then on each toss of the die each possible number (or outcome) will have
with every outcome of an
experiment. an equal chance of occurring. The numbers 1, 2, 3, 4, 5, or 6 represent the random variable
Discrete random for this experiment. In the second example, the possible number values will represent the
variable A discrete random weights of the people participating in the experiment. The random variable in this case
variable is one which may
take on only a countable would be the values of all possible weights. It is important to note that in the first example
number of distinct values the values take whole number answers (1, 2, 3, 4, 5, 6)—this is an example of a discrete
such as 0, 1, 2, 3, 4 . . .
random variable.
Continuous random
variable A continuous The second example consists of numbers that can take any value with respect to meas-
random variable is one ured accuracy (160.4 lb, 160.41 lb, 160.414 lb, etc.) and is an example of a continuous
which takes an infinite
number of possible values. random variable. In this section we shall explore the concept of a continuous probability
Continuous probability distribution with the focus on introducing the reader to the concept of a normal prob-
distribution If a random ability distribution.
variable is a continuous
variable, its probability
distribution is called a
continuous probability 4.1.2 The normal distribution
distribution.
Normal distribution The When a variable is continuous, and its value is affected by a large number of chance fac-
normal distribution is a tors, none of which predominates, then it will frequently appear as a normal distribution.
symmetrical, bell-shaped
curve, centred at its
This distribution does occur frequently and is probably the most widely used statistical
expected value. distribution. Some of the real-life variables having a normal distribution can be found,
Probability distributions 137

for example, in manufacturing (weights of tin cans) or can be associated with the human
population (people’s heights). The normal distribution is defined by equation (4.1):

2
1⎛ x −µ ⎞
1 − ⎜ ⎟
f (X ) = e 2⎝ σ ⎠
σ 2π (4.1)

This equation can be represented graphically by Figure 4.1 and illustrates the symmet-
rical characteristics of the normal distribution.

Normal curve
f(x)

µ X Figure 4.1

For the normal distribution the mean, median, and mode all have the same numerical
value.

Note
1. The population mean and standard deviation are represented by the notations μ and σ
respectively.
2. If a variable X varies as a normal distribution then we would state that X ~ N (μ, σ2).
3. The total area under the curve represents the total probability of all events occurring which
equals 1.0.

To calculate the probability of a particular value of X occurring we would calculate the


appropriate area represented in Figure 4.1 and use Excel (or statistical tables) to find the
corresponding value of the probability.

Example 4.1
A manufacturing firm quality assures components manufactured and historically the length of
a tube is found to be normally distributed with a population mean of 100 cm and a population
standard deviation of 5 cm.
Calculate the probability that a random sample of one tube will have a length of at least
110 cm.
From the information provided we define X has the tube length in centimetres and popula-
tion mean µ = 100 and standard deviation σ = 5. This can be represented using the notation
X ~ N (100, 52).
138 Business statistics using Excel

The problem we have to solve is to calculate the probability that 1 tube will have a length
of at least 110 cm.
This can be written as P(X ≥ 110) and is represented by the shaded area illustrated in
Figure 4.2.

Normal curve
f(x)

µ = 100 110 X Figure 4.2

This problem can be solved by using the Excel function NORM.DIST (X, μ, σ2, TRUE).
This function calculates the area illustrated in Figure 4.3.

Normal curve
f(x)
PDF = NORM.DIST() =
P(X <= 110)

P(X => 110) = 0.02275

µ = 100 110 X Figure 4.3

Excel solution—Example 4.1


The Excel solution is illustrated in Figure 4.4.
The value of P(X ≥ 110) can be found with the NORM.DIST() function together with
Figure 4.3.

Figure 4.4
Probability distributions 139

➜ Excel solution
Mean = Cell C5 Value
Standard deviation = Cell C6 Value
X = Cell C8 Value
P(X ≤ 110) = Cell C10 Formula: =NORM.DIST(C8,C5,C6,TRUE)
P(X ≥ 110) = Cell C12 Formula: =1−C10

❉ Interpretation From Excel, we observe that the probability that an individual tube
length is at least 110 cm is 0.02275 or 2.3% (P(X ≥ 110) = 0.02275).

Example 4.2
Calculate the probability that X lies between 85 and 105 cm for the problem outlined in
Example 4.1.
In this example we are required to calculate P (85 ≤ X ≤ 105) which represents the area
shaded in Figure 4.5.
The value of P (85 ≤ X ≤ 105) can be calculated using Excel’s NORM.DIST () function.
Normal curve

PDF = NORM.DIST() = p(85 f(x)


<= X <= 105) = 0.839995

X2 = 85 µ = 100 X1 = 105 X
Figure 4.5

Excel solution—Example 4.2

The Excel solution is illustrated in Figure 4.6.

Figure 4.6
140 Business statistics using Excel

From Excel, the NORMDIST () function can be used to calculate P


(85 ≤ X ≤ 105) = 0.839995.

➜ Excel solution
Mean = Cell C5 Value
Standard deviation = Cell C6 Value
X1 = Cell C8 Value
X2 = Cell C9 Value
P(85 ≤ X ≤ 105) = P(X ≤ 105) − P(X ≤ 85)
P(X ≤ 85) = Cell C13 Formula: =NORM.DIST(C8,C5,C6,TRUE)
P(X ≤ 105) = Cell C14 Formula: =NORM.DIST(C9,C5,C6,TRUE)
P(85 ≤ X ≤ 105) = Cell C16 Formula: =C14−C13

❉ Interpretation We observe that the probability that an individual tube length lies
between 85 and 105 cm is 0.839995 or 84.0%.

Student exercise
X4.1 Use the NORM.DIST function to calculate the following probabilities, X ~ N(100, 25):
(a) P(X ≤ 95); (b) P(95 ≤ X ≤ 105); (c) P(105 ≤ X ≤ 115); and (d) P(93 ≤ X ≤ 99). For
each probability identify the region to be found by shading the area on the normal
probability distribution graph.

4.1.3 The standard normal distribution (Z distribution)


If we have two different populations, both following normal distribution, it could be dif-
ficult to compare them as the units might be different, and the means and variances might
be different, etc. If this is the case, we would like to be able to standardize these distri-
butions so that we can compare them. This is possible by creating the standard normal
distribution. The standard normal distribution is a normal distribution whose mean is
always 0 and the standard deviation is always 1. Normal distributions can be transformed
to standard normal distributions by equation (4.2).

Z=
(X − µ)
σ (4.2)

x Where X, μ, and σ are the variable score value, population mean, and population stand-
Standard normal
distribution A standard ard deviation, respectively, taken from the original normal distribution. Any distribution
normal distribution is a can be converted to a standardized distribution using equation (4.2) and the shape of the
normal distribution with
zero mean (µ = 0) and unit
standardized version will be the same as the original distribution. If the original was sym-
variance (σ2 = 1). metric then the Z transformed version would still be symmetric and if the original was
Probability distributions 141

skewed then the Z transformed version would still be skewed. The corresponding prob-
ability density function is given by equation (4.3).
1
1 − Z2
f (Z ) = e 2
σ 2π (4.3)
The advantage of this method is that the Z values are not dependent on the original data
units and this allows tables of Z values to be produced with corresponding areas under the
curve. This allows for probabilities to be calculated if the Z value is known, and vice versa,
which allows a range of problems to be solved.
Figure 4.7 illustrates the standard normal distribution (or Z distribution) with Z scores
between –4 and +4.

Normal curve
f(z)

–4 –3 –2 –1 0 1 2 3 4
Z Figure 4.7

The Excel function NORM.S.DIST(z) returns the probability that the observed value of a
standard normal random variable will be less than or equal to z, as illustrated in Figure 4.8.

Note From calculation we can show that the proportion of values between ± 1, ± 2,
and ± 3 population standard deviations from the population mean of zero is 68%, 95%, and
99.7% respectively.

Normal curve
f(z)

P(Z ≤ z)

–4 –3 –2 –1 0 1 2 3 4
z Z Figure 4.8

Example 4.3
Reconsider Example 4.1. If a variable X varies as a normal distribution with a mean of 100 and
a standard deviation of 5, then the value of Z when X = 110 would be given by equation (4.2).
Z = (110−100)/5 = +2
142 Business statistics using Excel

Excel solution—Example 4.3


The value of P(Z ≥ 2) can be calculated using Excel’s NORM.S.DIST () function.
The Excel solution is illustrated in Figure 4.9.

Figure 4.9

➜ Excel solution
Mean = Cell C5 Value
Standard deviation = Cell C6 Value
X = Cell C8 Value
P(X ≤ 110) = Cell C10 Formula: =NORM.DIST(C8,C5,C6,TRUE)
P(X ≥ 110) = Cell C12 Formula: =1−C10
Z = Cell C14 Formula: =(C8−C5)/C6
P(Z ≤ +2) = Cell C15 Formula: =NORM.S.DIST(C14, TRUE)
P(Z ≥ +2) = Cell C16 Formula: =1−C15

This solution can be represented graphically by Figure 4.10.

Normal curve
f(z)

PDF = NORM.S.DIST() =
P(Z <= 2)

P(Z => 2) = 0.02275

0 2 Z Figure 4.10

From Excel, the NORM.S.DIST () function can be used to calculate P(Z ≥ +2) = 0.02275.

❉ Interpretation
We observe that the probability that an individual tube length is at least 110 cm is 0.02275 or
2.3% (P (X ≥ 110) = P (Z ≥ 2) = 0.02275).
Probability distributions 143

Note
1. This method is used to solve problems using tables of Z values and associated
probabilities.
2. The value of the Z score can be calculated using the Excel function STANDARDIZE ().
3. The Excel function NORM.DIST () calculates the value of the normal distribution for the
specified mean and standard deviation.
4. The Excel function NORM.S.DIST () calculates the value of the normal distribution for the
specified Z score value.

The value of this probability P(X ≥ 110) can be found from critical tables if we convert
P(X ≥ 110) to P(Z ≥ 2) and use the critical tables for the normal distribution provided
in Appendix 2. Table 4.2 illustrates an example of this critical table with the probability
P(Z ≥ 2) identified for a particular value of z.

Z 0.00 0.01 0.02 0.03 0.04


0.0 0.500 0.496 0.492 0.488 0.484
0.1 0.460 0.456 0.452 0.448 0.444
0.2 0.421 0.417 0.413 0.409 0.405
1.8 0.036 0.035 0.034 0.034 0.033
1.9 0.029 0.028 0.027 0.027 0.026
2.0 0.023 0.022 0.022 0.021 0.021
2.1 0.018 0.017 0.017 0.017 0.016

Table 4.2 Calculation of P(Z ≥ 2) for z = 2 = 0.023

Example 4.4
If we reconsider Example 4.2 and transform the value of X to Z then we find the solution can
be solved using the NORM.S.DIST () function.

Excel solution—Example 4.4


The Excel solution is illustrated in Figure 4.11.

➜ Excel solution
Mean = Cell C5 Value
Standard deviation = Cell C6 Value
X1 = Cell C8 Value
X2 = Cell C9 Value
P(85 ≤ X ≤ 105) = P(X ≤ 105) − P(X ≤ 85)
P(X ≤ 85) = Cell C13 Formula: =NORM.DIST(C8,C5,C6,TRUE)
P(X ≤ 105) = Cell C14 Formula: =NORM.DIST(C9,C5,C6,TRUE)
144 Business statistics using Excel

P(85 ≤ X ≤ 105) = Cell C16 Formula: =C14−C13


Z1 = Cell C18 Formula: =(C8−C5)/C6
Z2 = Cell C19 Formula: =(C9−C5)/C6
P(85 ≤ X ≤ 105) = P(Z ≤ 1) − P(Z ≤ −3)
P(Z ≤ −3) = Cell C23 Formula: =NORM.S.DIST(C18, TRUE)
P(Z ≤ 1) = Cell C24 Formula: =NORM.S.DIST(C19, TRUE)
P(85 ≤ X ≤ 105) = Cell C26 Formula: =C24−C23

Figure 4.11

From Excel, the NORM.S.DIST () function can be used to calculate P (85 ≤ X ≤ 105) = P
(− 3 ≤ Z ≤ +1) = 0.839995.
Figure 4.12 illustrates the solution, shaded on the normal distribution.

Normal curve
f(z)

P(–3 <= Z <= 1) =


0.839995

Z2 = –3 0 Z1 = 1 Z
Figure 4.12

❉ Interpretation We observe that the probability that an individual tube length lies
between 85 and 105 cm is 0.839995 or 84.0%.
Probability distributions 145

Student exercise
X4.2 Use the NORM.S.DIST () function to calculate the following probabilities, X ~ N(100,
25): (a) P(X ≤ 95); (b) P(95 ≤ X ≤ 105); (c) P(105 ≤ X ≤ 115); and (d) P(93 ≤ X ≤ 99). In
each case convert X to Z. Compare with your answers from Exercise X4.1.

Example 4.5
A local authority installs 2000 electric lamps. The life of lamps in hours (X) follows a normal
distribution, where X ~ N (1000, 40,000). Calculate: (a) the number of lamps that might be
expected to fail within the first 700 hours; (b) the number of lamps that may be expected to fail
between 900 and 1300 hours; and (c) after how many hours would we expect 10% of the lamps
to fail? From this information we have population mean, µ, of 1000 hours with a variance, σ2, of
40,000 hours2. This problem can be solved using either the NORM.DIST () or NORM.S.DIST ()
Excel functions as illustrated in Figure 4.13.

Excel solution—Example 4.5


(a) The first part of this problem can be split into two parts: (i) calculate the probability
that one lamp would fail in the first 700 hours, and (ii) calculate the number of lamps
from the 2000 that we would expect to fail in the first 700 hours.
The Excel solution is illustrated in Figure 4.13.

Figure 4.13

This solution is represented graphically by Figure 4.14.


This problem consists of solving P(X ≤ 700). Using the NORM.DIST () function we
find that 133 lamps are expected to fail within the first 700 hours out of the 2000
lamps.

➜ Excel solution
Mean = Cell C5 Value
Variance = Cell C6 Value
Standard deviation = Cell C7 Formula: =SQRT(C6)
146 Business statistics using Excel

X = Cell C9 Value
P(X ≤ 700) = Cell C11 Formula: =NORM.DIST(C9,C5,C7,TRUE)
Z = Cell C13 Formula: =(C9-C5)/C7
P(Z ≤ −1.5) = Cell C14 Formula: =NORM.S.DIST(C13, TRUE)
E(X) = N*P(X ≤ 700) = Cell C15 Formula: =2000*C11

Normal curve
f(x)

P(X <= 700)


= 0.066807

X = 700 1000 X Figure 4.14

❉ Interpretation From Excel, the NORM.DIST () or NORM.S.DIST () functions can be


used to calculate P(X ≤ 700) = 0.0688. The number of lamps that are expected to fail out of
the 2000 lamps, E(fail) = 2000*P(X ≤ 700) = 133 lamps.

(b) The second part of the problem requires the calculation of the probability that X lies
between 900 and 1300 hours, and the estimation of the number of lamps from 2000
which will fail between these limits.
The Excel solution is illustrated in Figure 4.15

Figure 4.15

This problem consists of solving P(900 ≤ X ≤ 1300). Using the NORM.DIST () function
we find that 1249 lamps are expected to fail between 900 and 1300 hours out of the
2000 lamps.
Probability distributions 147
This solution is represented graphically by Figure 4.16.
Normal curve
f(x)
P(900 <= X <= 1300)
= 0.624655

x
X1 = 900 1000 X2 = 1300
Figure 4.16

➜ Excel solution
Mean μ = Cell C5 Value
Variance σ2 = Cell C6 Value
Standard deviation σ = Cell C7 Formula: =SQRT(C6)
X1 = Cell C9 Value
X2 = Cell C10 Value
P(X ≤ 900) = Cell C12 Formula: =NORM.DIST(C9,C5,C7,TRUE)
P(X ≤ 1300) = Cell C13 Formula: =NORM.DIST(C10,C5,C7,TRUE)
P(900 ≤ X ≤ 1300) = Cell C14 Formula: =C13−C12
E(X) = N*P(900 ≤ X ≤ 1300) = Cell C15 Formula: =2000*C14
Z1 = Cell C18 Formula: =(C9−C5)/C7
Z2 = Cell C19 Formula: =(C10−C5)/C7
P(Z ≤ −0.5) = Cell C21 Formula: =NORM.S.DIST(C18, TRUE)
P(Z ≤ 1.5) = Cell C22 Formula: =NORM.S.DIST(C19, TRUE)
P(−0.5 ≤ Z ≤ 1.5) = Cell C23 Formula: =C22−C21
E(X) = N*P(900 ≤ X ≤ 1300) = Cell C24 Formula: =2000*C23

❉ Interpretation From Excel, the NORM.DIST () or NORM.S.DIST () functions


can be used to calculate P (900 ≤ X ≤ 1300) = 0.624655. The number of lamps
that are expected to fail between 900 and 1300 hours out of the 2000 lamps,
E(fail) = 2000*P(900 ≤ X ≤ 1300) = 1249 lamps.

(c) The final part of this problem consists of calculating the number of hours for the first
10% to fail. This corresponds to calculating the value of x where P(X ≤ x) = 0.1.
This problem can be solved using the NORM.INV () or NORM.S.INV () functions, as
illustrated in Figure 4.17.
From Excel, the NORM.INV () or NORM.S.INV () functions can be used to calculate
the expected number of hours for 10% to fail.
This solution is represented graphically by Figure 4.18.
148 Business statistics using Excel

Figure 4.17

Normal curve
f(x)

P(X <= x) = 0.1

x? 1000 x Figure 4.18

➜ Excel solution
Mean μ = Cell C5 Value
Variance σ2 = Cell C6 Value
Standard deviation σ = Cell C7 Formula: =SQRT(C6)
P(X = x) = Cell C9 Value
X = Cell C11 Formula: =NORM.INV(C9,C5,C7)
Z = Cell C13 Formula: =NORM.S.INV(C9)
X = μ + Z*σ = Cell C14 Formula: =C5+C13*C7

❉ Interpretation From Excel, the expected number of hours for 10% to fail is 744
hours.

Note
1. This problem corresponds to finding the value of x such that P(X ≤ x) = 10% (or 0.1).
From Excel, we find that P(X ≤ x) = 0.1 corresponds to Z = −1.28. To find x we would
then solve the equation: −1.28 = (X − 1000)/200. Re-arranging this equation gives
X = 1000 + (−1.28)*(200) = 744.
2. The Excel function NORM.INV () calculates the value of X from a normal distribution for
the specified probability, mean and standard deviation.
3. The Excel function NORM.S.INV () calculates the value Z from normal distribution for the
specified probability value.
Probability distributions 149

Student exercises
X4.3 Given that a normal variable has a mean of 10 and a variance of 25, calculate the
probability that a member chosen at random is: (a) ≥ 11, (b) ≤ 11, (c) ≤ 5, (d) ≥ 5, (e)
between 5 and 11.
X4.4 The lifetimes of certain types of car battery are normally distributed with a mean of
1248 days and standard deviation of 185 days. If the supplier guarantees them for 1080
days, what proportion of batteries will be replaced under guarantee?
X4.5 Electrical resistors have a design resistance of 500 ohms. The resistors are produced
by a machine with an output that is normally distributed N(501,9) ohms. Resistances
below 498 ohms and above 508 ohms are rejected. Find: (a) the proportion that will
be rejected; (b) the proportion which would be rejected if the mean was adjusted so
as to minimize the proportion of rejects; (c) how much the standard deviation would
need to be reduced (leaving the mean at 501 ohms) so that the proportion of rejects
below 498 ohms would be halved.

4.1.4 Checking for normality


Normality is a very important concept in business statistics and we shall see that the tests
described in Chapters 6–8 require the population distribution to be either normally or
approximately normally distributed. The issue of checking whether the population is nor-
mally distributed is an important concept and can be achieved by either constructing a
five-number summary (or box-and-whisker plot) to test for symmetry described in sec-
tion 2.3 and/or by constructing a normal probability plot to test for normality. Section 2.3
describes the concept of a five-number summary and corresponding box-and-whisker
plot to evaluate whether the distribution is symmetric:

• for symmetrical distributions the following rule would hold: Q3 − Median = Median −
Q1, Largest value − Q3 = Q1 − smallest value, and Median = Midhinge = Midrange. The
midrange is the average of the largest and smallest data values and the midhinge is
the average of the first and third quartiles;
• for non-symmetry the following rule would hold: right-skewed distributions: Largest
value − Q3 greatly exceeds Q1 − Smallest value, and left-skewed distributions: Q1 −
Smallest value greatly exceeds Largest value – Q3.

In Example 2.15 we were given the first quartile Q1 = 15, minimum = 8, median = 33,
maximum = 88 and third quartile Q3 = 62. From this data we concluded that the data dis-
tribution is not symmetrical (distance from Q3 to the median (62 − 33 = 29) is not the same
as between Q1 and the median (33 − 15 = 18), distance from Q3 and the largest value (88 −
62 = 26) is not the same as the distance between Q1 and the smallest value (15 − 8 = 7), and
the median (33), the midhinge ((62 + 15)/2 = 38.5) and the midrange ((88 + 8)/2 = 48) are
not equal). The summary numbers indicate right skewness because the distance between x
Q3 and the largest number (88 − 62 =26) is longer than the distance between Q1 and the Normal probability
plot Graphical technique
smallest value (15 − 8 = 7). The minimum and maximum points are identified and enable to assess whether the data
identification of any extreme values (or outliers). is normally distributed.
150 Business statistics using Excel

Note A simple rule to identify an outlier (or suspected outlier) is that the largest value –
smallest value (88 – 8 = 80) should be no longer than three times the length of the box (Q3 –
Q1 = 62 – 15 = 47). In this case the value of maximum – minimum is 80 and Q3 – Q1 is 47
and therefore no extreme values are present in the data set.

A normal probability plot consists of constructing a graph of data values against a cor-
responding Z value where Z is based upon the ordered value.

Example 4.6
The manager at BIG JIMS restaurant is concerned about the time it takes to process credit card
payments at the counter by counter staff.
The manager has collected the following processing time data (time in minutes/seconds)
and requested that the data be checked to see if it is normally distributed (Table 4.3).

Processing credit cards (n = 19)


0.64 0.71 0.85 0.89 0.92
0.96 1.07 0.76 1.09 1.13
1.23 0.76 1.18 0.79 1.26
1.29 1.34 1.38 1.5

Table 4.3

Excel solution—Example 4.6

Figure 4.19 illustrates the Excel solution to Example 4.6.

Figure 4.19

➜ Excel solution
n = Cell C3 Value
Ordered value Cells E4:E22 Values
Area Cell F4 Formula: =1/(C3 + 1)
Cell F5 Formula: =F4+$F$4
Copy formula from F5:F22
Probability distributions 151

Z value Cell H4 Formula: = NORM.S.INV(F4)


Copy formula from H4:H22
Ordered value Cells J4:J22 Values

The method to create the normal probability plot is as follows:

• order the data values (1, 2, 3 . . . n) with 1 referring to the smallest data value and n
representing the largest data value;
• for the first data value (smallest) calculate the cumulative area using the formula: = 1/
(n + 1).
• calculate the value of Z for this cumulative area using the Excel Function:
=NORM.S.INV (Z value);
• repeat for the other values where the cumulative area is given by the formula: =old
area + 1/(n + 1);
• input data values with smallest to largest value;
• plot data value y against Z value for each data point.

Figure 4.20 illustrates the normal probability curve plot for Example 4.6. We observe
from the graph that the relationship between the data values and Z is approximately a
straight line.

Normal probability plot

1.6 Data value, y

1.4

1.2

0.8

0.6

0.4

0.2

0
–2 –1 0 1 2
Z Value Figure 4.20

For data that is normally distributed we would expect the relationship to be linear. In
this situation we would accept the statement that the data values are approximately nor-
mally distributed.

❉ Interpretation Owing to the fact that the normal probability plot shows more or less
a straight line, we conclude that the data is approximately normally distributed.
152 Business statistics using Excel

Note From Chapter 2 the decision on the symmetry of a distribution is as follows,


together with the shape of the normal probability curve.

(a) Figure 4.21 illustrates a normal distribution where Largest value – Q3 equals Q1 – Small-
est value.
Normal probability plot—normal curve

1.60 Data value, y

1.40

1.20

1.00

0.80

0.60

0.40

0.20

0.00
–2 –1 0 1 2
Z Value Figure 4.21

In Example 4.6 we have: Largest value – Q3 = 0.18 approximately equal to Q1 – Smallest


value = 0.26.
(b) Figure 4.22 illustrates a left-skewed distribution where Q1 – Smallest value greatly ex-
ceeds Largest value – Q3.

Normal probability plot—left-skew

2.00 Data value, y

1.50

1.00

0.50

0.00
–2 –1 0 1 2

–0.50 Z Value

–1.00 Figure 4.22

(c) Figure 4.23 illustrates a right-skewed distribution where Largest value – Q3 greatly ex-
ceeds Q1 – Smallest value.
Probability distributions 153

Normal probability plot—right-skew


8.00 Data value, y

7.00

6.00

5.00

4.00

3.00

2.00

1.00

0.00
–2 –1 0 1 2
Z Value Figure 4.23

4.1.5 Other continuous probability distributions


A number of other continuous probability distributions will be discussed in Chapters 5–9,
including: Student’s t distribution, chi-square distribution, and F distribution.

1. Student’s t distribution
The Student’s t distribution is a distribution that is used to estimate a mean value
when the population variable is normally distributed but the sample chosen to
measure the population value is small and the population standard deviation is
unknown. It is the basis of the popular Student’s t-tests for the statistical significance
of the difference between two sample means and for confidence intervals for the
difference between two population means.

2. Chi-square distribution
The chi-square distribution (χ2 distribution) is a popular distribution that is used
to solve statistical inference problems involving contingency tables and assessing
goodness-of-fit tests between sample data and distributions. x
Student’s t
3. F distribution distribution The t
distribution is the sampling
The F distribution is a distribution that can be used to test whether the ratios of two distribution of the t statistic.
variances from normally distributed statistics are statistically different. The test Chi-square
statistic is defined as F = s12 s2 2, where s12 and s22 are the sample 1 and sample 2 distribution The chi
square distribution is a
variances respectively. The shape of the distribution depends upon the numerator mathematical distribution
and denominator degrees of freedom (df1 = n1 − 1, df2 = n2 − 1); the F distribution is that is used directly or
indirectly in many tests of
written as a function of n1, n2 as F (n1, n2). significance.
F distribution The
F distribution (also known
Note The normal, Student’s t, and chi-square distributions are special cases of the F the Fisher–Snedecor
distribution, as follows: distribution) is a continuous
probability distribution
• normal distribution = F(n1 = 1, n2 = infinite) distribution; that arises frequently as the
null distribution of a test
• Student’s t distribution = F(n1 = 1, n2) distribution ; statistic, most notably in the
• chi-square distribution = F(n1, n2 = infinite) distribution. analysis of variance
154 Business statistics using Excel

Two other continuous probability distributions are the uniform and exponential distri-
butions. The uniform distribution is used in the generation of random numbers for differ-
ent probability distributions and the exponential probability distribution is important in
the area of queuing theory.

4.1.6Probability density function and cumulative


distribution function
If X is a random variable (continuous or discrete), then the probability density function
(PDF) is a function f(x) such that for two values of x (say x = a and x = b), with a ≤ b, then
b


PDF = P(a ≤ x ≤ b) = f ( x ) dx , where f(x) ≥ 0, for all x.
a
If we assume that the probability distribution is normal, then Figure 4.24 represents
graphically what area the PDF represents.
Thus, the probability that X takes on a value in the interval (a, b) is the area under the
density function from a to b (see shaded region in Figure 4.24).

Normal curve
f(x)

a b µ X Figure 4.24

Example 4.7
In Example 4.1, we calculated the probability that a tube length will be at least 110 cm and it
is known that the tube length is normally distributed with a population mean and standard
deviation equal to 100 and 5 respectively.
This can be written as X ~ N (100, 52) with the probability problem written as P(X ≥ 110) =
1 − P(X ≤ 110) = 1 − PDF (110, 100, 5, TRUE) = 1 − 0.97725 = 0.02275.
The term P(X ≤ 110) represents the PDF, which is calculated by the Excel NORM.DIST () func-
tion. Figure 4.25 illustrates this graphically.

Normal curve
f(x)
finds PDF = P(X ≤ 110) = 0.97725

P(X ≥ 110) = 0.02275

100 given 110 X Figure 4.25


Probability distributions 155
The cumulative distribution function (CDF) is a function F(X) of a random variable, x,
x

which is defined for a number x by F(x) = P(X ≤ x) =


∫ f(s) ds. That is, for a given value of
0, −∝
x, F(x) is the probability that the observed value of X will be, at most, x.

Example 4.8
In Example 4.1, we found that when X = 110 the value of P(X ≥ 110) = 0.02275. The CDF
calculates the value of X when you know the probability value, as illustrated in Figure 4.26.
Given P(X ≤ 110) = 0.97725, the CDF will calculate the value of x given the PDF = 0.97725.
Therefore, x = CDF (0.97725, 100, 5) = 110, which is calculated by the Excel NORM.INV
function.
Normal curve

Given PDF = P(X ≤ 110) = 0.97725

x
Cumulative distribution
P(X ≥ 110) = 0.02275 function The cumulative
distribution function
(CDF), or just distribution
function, describes the
100 finds 110 X Figure 4.26 probability that a real-
valued random variable
X with a given probability
distribution will be found
at a value less than or
equal to x.
Discrete probability
distribution If a random
4.2 Discrete probability distributions variable is a discrete
variable, its probability
distribution is called
4.2.1 Introduction a discrete probability
distribution.

In this section we shall explore discrete probability distributions when dealing with dis- Binomial distribution A
binomial distribution can
crete random variables. Two specific distributions included are: binomial and Poisson be used to model a range
probability distributions. We will also explore how to approximate one distribution with of discrete random data
variables.
another, if appropriate.
Poisson probability
distribution The Poisson
distribution is a discrete
4.2.2 Binomial probability distribution probability distribution that
expresses the probability of
One of the most elementary discrete random variables—binomial—is associated with a given number of events
occurring in a fixed interval
questions that only allow ‘Yes’ or ‘No’ type answers, or a classification such as male or of time and/or space if
female, or recording a component as defective or not defective. If the outcomes are also these events occur with a
known average rate and
independent, for example, the possibility of a defective component does not influence the
independently of the time
possibility of finding another defective component then the variable is considered to be since the last event.
a binomial variable.
156 Business statistics using Excel

Consider the example of a supermarket that runs a two-week television campaign in


an attempt to increase the volume of trade at a supermarket. During the campaign all
customers are asked if they came to the supermarket because of the television advertising.
Each customer response can be classified as either yes or no. At the end of the campaign
the proportion of customers that responded yes is determined. For this study the experi-
ment is the process of asking customers if they came to the supermarket because of the
television advertising.
The random variable, X, is defined as the number of customers that responded yes.
Clearly, the random variable can assume only the values 0, 1, 2, 3 . . . n, where n is the total
number of customers. Consequently, the random variable is discrete.
Consider the characteristics that define the binomial experiment:

• the experiment consists of n identical trials;


• each trial results in one of two outcomes which for convenience we can define as
either a success or a failure;
• the outcomes from trial to trial are independent;
• the probability of success (p) is the same for each trial;
• the probability of failure (q), where q = 1 – p;
• the random variable equals the number of successes in the n trials and can take the
value from 0 to n.

These five characteristics define the binomial experiment and are applicable for situ-
ations of sampling from finite populations with replacement or for infinite populations
with or without replacement.

Example 4.9
A marksman shoots three rounds at a target. The probability of getting a ‘bull’ is 0.3. Develop
the probability distribution for getting 0, 1, 2, and 3 bulls. This experiment can be modelled by
a binomial distribution as:
x
Binomial experiment A • three identical trials (n = 3);
binomial experiment is an • each trial can result in either a bull (success) or not a bull (failure);
experiment with a fixed
number of independent • the outcome of each trial is independent;
trials. Each trial has • the probability of a success (P(a bull) = p = 0.3) is the same for each trial;
exactly two outcomes and
the probability of each
• the random variable is discrete.
outcome in a binomial
experiment remains the Figure 4.27 illustrates the tree diagram that represents the described experiment.
same for each trial. Let B represent the event that the marksmen hits the bull and B’ represents the event that
Discrete variable A set of the bull is missed.
data is said to be discrete if
the values belonging to it The corresponding individual event probabilities are: P(B) = 0.3 and P(B’) = 1 − P(B) =
can be counted as 1, 2, 3 . . . 1 − 0.3 = 0.7.
Probability distributions 157

Third attempt
B
Second attempt
B

First attempt B′
B
B

B′ B′

B
B

B′

B′ B

B′
B′ Figure 4.27

From this tree diagram we can identify the possible routes for 0, 1, 2, and 3 bull hits as
follows: P(no bull hit) = P(X = 0 success) = P(B’B’B’) = 0.7 * 0.7 * 0.7 = (0.7)3 = 0.343.
The important lesson is to note how we can use the tree diagram to undertake a calcula-
tion of an individual probability, but also note the pattern identified in the relationship
between the probability, P, and the individual event probability of success, p, or failure, q.
From Figure 4.27 we observe:

P (no bull hit) = (0.7)3

P (no bull hit) = q3

If we continue the calculation procedure we find:

P(1 bull hit) = P(X = 1 success)

P(1 bull hit) = P(BB′B′ or B′BB′ or B′B′B)

P(1 bull hit) = 0.3*0.7*0.7 + 0.7*0.3*0.7 + 0.7*0.7*0.3

P(1 bull hit) = 0.3*(0.7)2 + 0.3*(0.7)2 + 0.3*(0.7)2

P(1 bull hit) = 3 * 0.3*(0.7)2

P(1 bull hit) = 0.441

Therefore, P(1 bull hit) = 3pq2

P(2 bull hits) = P(X = 2 successes)

P(2 bull hits) = P(BBB′ or BB′B or B′BB)

P(2 bull hits) = 0.3*0.3*0.7 + 0.3*0.7*0.3 + 0.7*0.3*0.3

P(2 bull hits) = (0.3)2*0.7 + (0.3)2*0.7 + (0.3)2*0.7 = 3*(0.3)2*0.7


158 Business statistics using Excel

P(2 bull hits) = 0.189

Therefore, P(2 bull hits) = 3p2q

P(3 bull hits) = P(X = 3 successes)

P(3 bull hits) = P(BBB)

P(3 bull hits) = 0.3*0.3*0.3

P(3 bull hits) = (0.3)3 = 0.027

Therefore, P(3 bull hits) = p3

From these calculations we can now note the probability distribution for this experi-
ment (see Table 4.4).

X Formula P(X)
0 q 3 0.343
1 3pq2 0.441
2 2
3p q 0.189
3 p3 0.027
Total = 1.000

Table 4.4

This probability distribution is illustrated in Figure 4.28.

Binomial (N = 3 and P = 0.3)


0.5
0.45
0.4
0
Probability, P(X)

0.35
0.3
1
0.25
0.2 2
0.15
3
0.1
0.05
0
0 1 2 3
Number of bulls hit, X Figure 4.28

Note From the probability distribution we observe that the total probability equals one.
This is expected as the total probability would represent the total experiment.

Total probability for the experiment ∑ P( X = r) = 1


(4.4)
Probability distributions 159

If we increase the size of the experiment then it becomes quite difficult to calculate the
event probabilities. We really need to develop a formula for calculating binomial prob-
abilities. Using the ideas generated earlier, we have:

Total probability = P{X = 0 or X = 1 or X = 2 or X = 3}

Total probability = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3)

Total probability = (0.7)3 + 3*(0.3)*(0.7)2 + 3*(0.3)2*(0.7) + (0.3)3

Total probability = q3 + 3pq2 + 3p2q + p3

Total probability = p3 + 3p2q + 3pq2 + q3

Repeating this experiment for increasing values of ‘n’ would enable the identification
of a pattern that can be used to develop equation (4.5) to calculate the probability of ‘r’
successes given ‘n’ attempts of the experiment.

⎛ n⎞
P(X = r) = ⎜ ⎟ P r q n − r
⎝ r⎠ (4.5)

⎛ n⎞
The term ⎜ ⎟ calculates the binomial coefficients which are the numbers in front of the
⎝ r⎠
letter terms in the binomial expansion. For example, in the previous example we found
that the total probability = p3 + 3p2q + 3pq2 + q3 with the numbers in front of the letters of
1, 3, 3, and 1. These numbers are called the binomial coefficients and are calculated using
equation (4.6).

⎛ n⎞ n!
⎜⎝ r ⎟⎠ = n! (n − r)! (4.6)

Where n! (pronounced n factorial) is defined by equation (4.7).

n! = n*(n–1)*(n–2)*(n–3) ... 3*2*1 (4.7)

Note
⎛ n⎞
1. The term ⎜ ⎟ calculates the number of combinations of obtaining ‘r’ successes from ‘n’
⎝ r⎠
⎛ n⎞
attempts of the experiment. In certain information sources the term ⎜ ⎟ is replaced with
⎝ r⎠
alternative notation nCr.
2. It is important to note that 3! = 3*2*1 = 6, 2! = 2*1 = 2, 1! = 1, 0! = 1.

It can be shown that the mean and variance for a binomial distribution is given by equa-
tions (4.8) and (4.9)

Mean of Binomial Distribution, E(X) = np (4.8)

Variance of Binomial Distribution, VAR(X) = npq (4.9)

In Example 4.9 we note that n = 3, p = 0.3, and q = 1 – p = 0.7.


Substituting these values into equation (4.5) gives:
160 Business statistics using Excel

⎛ 3⎞
P(no bulls hit) = P( X = 0) = ⎜ ⎟ (0.3)0 (0.7)3
⎝ 0⎠

Inspecting this equation we note that the problem consists of three terms that are mul-
⎛ 3⎞
tiplied together to provide the probability of no bulls hit. The terms are: (a) ⎜ ⎟ , (b) (0.3)0,
⎝ 0⎠
and (c) (0.7)3. Parts (b) and (c) are straightforward to calculate and part (a) can be calcu-
lated from equation (4.6) as follows:

⎛ 3⎞ 3! 3! 3 × 2 ×1
⎜⎝ 0⎟⎠ = 0!(3 − 0)! = 0!3! = 1 × 3 × 2 × 1 = 1

Substituting this value into the problem solution gives:

⎛ 3⎞
P(no bulls hit) = P( X = 0) = ⎜ ⎟ (0.3)0 (0.7)3 = 1 × 1 × (0.7)3 = 0.343
⎝ 0⎠

Excel solution—Example 4.9


Figure 4.29 illustrates the Excel solution.

• Binomial probability of ‘r’ successes from ‘n’ attempts using Excel function BINOM.
DIST ().
• Binomial coefficients using Excel function COMBIN ().
• Factorial values using Excel function FACT ()

Figure 4.29

➜ Excel solution
Number of trials, n Cell D3 Value
Probability of hitting bull, p Cell D4 Value
Probability of missing bull, q Cell D5 Formula: =1−D4
Probability distributions 161

Probability distribution
r Cells D10:D13 Values
P(X = r) Cell E10 Formula: =BINOM.DIST(D10,$D$3, $D$4,FALSE)
Copy formula down E10:E13
Total Cell E14 Formula: =SUM(E10:E13)
Number of combinations
r Cells D17:D20 Values
nC Cell E17 Formula: =COMBIN($D$3,D17)
r
Copy formula down E17:E20
Factorials
r Cells D23:D26 Values
r! Cell E23 Formula: =FACT(D23)
Copy formula down E23:E26

Note Total probability for the experiment ∑P(X = r) = 1.

Example 4.10
A local authority surveyed the travel preferences of people who travelled to work by train
or bus. The initial analysis suggested that one in five people travelled by train to work. If five
people are interviewed, what is the probability that: (a) exactly three prefer travelling by train,
P(X = 3); (b) three or more prefer travelling by train, P(X ≥ 3); and (c) fewer than three prefer
travelling by train, P(X < 3).
This experiment can be modelled by a binomial distribution as:

• five identical trials (n = 5);


• each trial can result in either a person travels by train (success) or does not travel by train
(failure);
• the outcome of each trial is independent;
• the probability of a success (P(travels by train) = p = 1/5 = 0.2) is the same for each trial;
• the random variable is discrete.

The random variable, X, represents the number of people travelling by train out of the
five people interviewed. From the information provided we note that P(success) = P(prefer
train) = p = 1/5 = 0.2, P(failure) = 1 − p = q = 0.8, and number of identical trails n = 5.

Excel solution—Example 4.10


The Excel solution is illustrated in Figure 4.30.
162 Business statistics using Excel

Figure 4.30

➜ Excel solution
Number of trails n = Cell D3 Value
Probability travels by train, p = Cell D4 Value
Probability does not travel by
train, q = Cell D5 Formula: =1−D4
r Cells D10: D15 Values
P(X = r) Cells E10 Formula: =BINOM.DIST(D10,$D$3,$D$4,FALSE)
Copy formula down E10:E15
Total Cell E17 Formula: =SUM(E10:E15)
r Cell D20:D22 Values
P(X = 3) = Cell E20 Formula: =BINOM.DIST(D20,$D$3,$D$4,FALSE)
P(X ≥ 3) = Cell E21 Formula: =1−BINOM.DIST(D21,$D$3,$D$4,TRUE)
P(X < 3) = Cell E22 Formula: =BINOM.DIST(D22,$D$3,$D$4,TRUE)

❉ Interpretation
(a) P(exactly three prefer train) = P(X = 3) = 0.0512
(b) P(three or more prefer train) = P(X ≥ 3) = P(X = 3) + P(X = 4) + P(X = 5) = 0.05792
(c) P(fewer than three prefer train) = P(X < 3) = P(X ≤ 2) = 0.9421

Note Total probability for the experiment ∑P(X = r) = 1.

Example 4.11
A manufacturing company regularly conducts quality control checks at specified periods on all
products manufactured. A new order for 2000 light bulbs is due to be delivered to a national
do-it-yourself store. Historically, the manufacturing process has a failure rate of 15% and the
sample to be tested consists of four randomly selected light bulbs. From this information
Probability distributions 163

estimate the following probabilities: (a) find the probability distribution for 0, 1, 2, 3, and 4
defective light bulbs; (b) calculate the probability that at least three will be defective; and (c)
determine the mean and variance of the distribution.
This example highlights the case of selecting without replacement from a large population.
The effect on the sample space can be considered negligible and therefore we can consider
the events as independent. Let the random variable, X, represent the number of defective
light bulbs from the random sample. This value of X can take the values: 0 defective from 4
bulbs, or 1 defective from 4 bulbs, or 2 defective from 4 bulbs, or 3 defective from 4 bulbs,
or all 4 bulbs defective. This can be written as X = 0, 1, 2, 3, 4. For this example we have
p = P(success) = P(defective bulb) = 0.15, q = P(not defective) = 1 – p = 0.85, and n = 4.

Excel solution—Example 4.11

The Excel solution is illustrated in Figure 4.31.

Figure 4.31

➜ Excel solution
Number of trials n = Cell D3 Value
Probability light bulb fails, p = Cell D4 Value
Probability light bulb
does not fail, q = Cell D5 Formula: =1−D4

(a) Probability distribution


r Cells D10:D14 Values
P(X = r) Cell E10 Formula: =BINOM.DIST(D10,$D$3,$D$4,FALSE)
Copy formula down E10:E14
Total Cell E16 Formula: =SUM(E10:E14)
164 Business statistics using Excel

(b) P(X ≥ 3) = P(X = 3) + P(X = 4) = 1 − P(X < 3) =1 − P(X ≤ 2) = 0.0119813


r Cell D19 Value
P(X = r) Cell E19 Formula: =1−BINOMDIST(D19,$D$3,$D$4,TRUE)

(c) Mean and variance


r Cells D22:D26 Values
P(X = r) Cell E22 Formula: =BINOM.DIST(D22,$D$3,$D$4,FALSE)
Copy formula down E22:E26
r*P(X = r) Cell F22 Formula: =D22*E22
Copy formula down F22:F26
r2*P(X = r) Cell H22 Formula: =D22*F22
Copy formula down H22:H26
E(x) = Cell E28 Formula: =SUM(F22:F26)
VAR(X) = Cell E29 Formula: =SUM(H22:H26)−E28^2
np = Cell E30 Formula: =D3*D4
npq = Cell E31 Formula: =D3*D4*D5

❉ Interpretation
(a) Table 4.5 represents the probability distribution for Example 4.11.

r P(X = r)
0 0.52201
1 0.36848
2 0.09754
3 0.01148
4 0.00051

Table 4.5

(b) The probability of at least three defective bulbs from the sample of four is 0.0119813.
(c) Mean and variance for the probability distribution is mean = 0.6 and variance = 0.51.

Note Total probability for the experiment ∑P(X = r) = 1.

Student exercises
X4.6 Evaluate the following: (a) 3C1, (b) 10C3, (c) 2C0.
X4.7 A binomial model has n = 4 and p = 0.6.
(a) Find the probabilities of each of the five possible outcomes (i.e. P(0), P(1) . . . P(4)).
(b) Construct a histogram of this data.
Probability distributions 165

X4.8 Attendance at a cinema has been analysed and shows that audiences consist of 60%
men and 40% women for a particular film. If a random sample of six people was
selected from the audience during a performance, find the following probabilities:
(a) All women are selected
(b) Three men are selected
(c) Fewer than three women are selected.
X4.9 A quality control system selects a sample of three items from a production line. If
one or more is defective, a second sample is taken (also of size three), and if one or
more of these are defective then the whole production line is stopped. Given that the
probability of a defective item is 0.05, what is the probability that the second sample is
taken? What is the probability that the production line is stopped?
X4.10 Five people in seven voted in an election. If four of those on the roll are interviewed
what is the probability that at least three voted?
X4.11 A small tourist resort has a weekend traffic problem and is considering whether or not
to provide emergency services to help mitigate the congestion that results from an
accident or breakdown. Past records show that the probability of a breakdown or an
accident on any given day of a four-day weekend is 0.25. The cost to the community
caused by congestion resulting from an accident or breakdown is as follows:
• a weekend with 1 accident day costs £20,000;
• a weekend with 2 accident days costs £30,000;
• a weekend with 3 accident days costs £60,000;
• a weekend with 4 accident days costs £125,000.
As part of its contingency planning, the resort needs to know:
(a) The probability that a weekend will have no accidents
(b) The probability that a weekend will have at least two accidents
(c) The expected cost that the community will have to bear for an average weekend
period
(d) Whether or not to accept a tender from a private firm for emergency services of
£20,000 for each weekend during the season.

4.2.3 Poisson probability distribution


In the previous section we explored the concept of a binomial distribution which repre-
sents a discrete probability distribution that enables the probability of achieving ‘r’ suc-
cesses from ‘n’ independent experiments to be calculated. Each experiment (or event)
has two possible outcomes (‘success’ or ‘failure’) and the probability of ‘success’ (p) is
known.
x
The Poisson distribution developed by Simeon Poisson (1781–1840) is a discrete Poisson
probability distribution that enables the probability of ‘r’ events to occur during a speci- distribution Poisson
distributions model a range
fied interval (time, distance, area, and volume) if the average occurrence is known and
of discrete random data
the events are independent of the specified interval since the last event occurred. It has variables.
166 Business statistics using Excel

been usefully employed to describe probability functions of phenomena, such as prod-


uct demand, demands for service, numbers of accidents, numbers of traffic arrivals, and
numbers of defects in various types of lengths or objects.
Like the binomial it is used to describe a discrete random variable. With the binomial
distribution we have a sample of definite size and we know the number of ‘successes’ and
‘failures’. There are situations, however, when to ask how many ‘failures’ would not make
sense and/or the sample size is indeterminate. For example, if we watch a football match
we can report the number of goals scored but we cannot say how many were not scored. In
such cases we are dealing with isolated cases in a continuum of space and time, where the
number of experiments (n), probability of success (p), and failure (q) cannot be defined.
What we can do is divide the interval (time, distance, area, volume) into very small sec-
tions and calculate the mean number of occurrences in the interval. This gives rise to the
Poisson distribution defined by equation (4.10).

λ re− λ
P (X = r) =
r! (4.10)

Where:

• P(X = r) is the probability of event ‘r’ occurring;


• the symbol ‘r’ represents the number of occurrences of an event and can take the
value 0 → ∞ (infinity);
• r! is the factorial of ‘r’ calculated using the Excel function FACT ();
• λ (Greek letter lambda) is a positive real number that represents the expected
number of occurrences for a given interval. For example, if we found that we had an
average of four stitching errors in a one-metre length of cloth, then for two metres of
cloth we would expect the average number of errors to be λ = 4*2 = 8;
• the symbol ‘e’ represents the base of the natural logarithm (e = 2.71828. . .).

. Example 4.12
The data in Table 4.6 is a record of the number of times a river has flooded in a wet season over
the past 100 years.
Check if the distribution may be modelled using the Poisson distribution and determine
the expected frequencies for a 100-year period. The Excel solution is provided in Figures 4.32
and 4.33

Number of floods, X Number of years with ‘X’ floods, f


0 24
1 35
2 24
3 12
4 4
5 1
Total = 100

Table 4.6
Probability distributions 167

Excel solution—Example 4.12a


The first stage is to estimate the average number of floods per year, λ, based upon the sam-
ple data. Figure 4.32 illustrates the Excel solution.

Figure 4.32

➜ Excel solution
(a) Calculate frequency distribution mean and variance
Number of floods X Cells B7:B12 Values
Number of years with X floods, f Cells C7:C12 Values
xf Cell D7 Formula: =B7*C7
Copy formula down D7:D12
Σf = Cell C13 Formula: =SUM(C7:C12)
ΣXf = Cell D13 Formula: =SUM(D7:D12)
2
X Cell F7 Formula: =B7^2
Copy formula down F7:F12
fX2 Cell H7 Formula: =C7*F7
Copy formula down H7:H12
ΣfX Cell H13
2 Formula: =SUM(H7:H12)
Mean = Cell D16 Formula: =D13/C13
Variance = Cell D17 Formula: =H13/C13−C16^2

❉ Interpretation The average number of floods is 1.4 per year with a variance of 1.32.
They seem to be in close agreement (only 5.7% difference), which is one of the characteristics
of the Poisson distribution. The mean and variance of the Poisson distribution have the same
numerical value and, given the closeness of the two values in this numerical example, we
would conclude that the Poisson distribution should be a good model for the sample data.
168 Business statistics using Excel

Note The average number of floods per year (λ) and variance are calculated from the
frequency distribution.

∑ fX 140
Mean λ = = = 1.4
∑f 100

∑ fX 2
− (mean) = 1.32.
2
Variance VAR (X) =
∑f

The chi-square goodness-of-fit test is used to see if the Poisson model is a significant fit
to the sample data (see section 7.1.4).

Excel solution—Example 4.12b


Thus, we can now determine the probability distribution using equation (4.10), as illus-
trated in Figure 4.33.

Figure 4.33

➜ Excel solution
(b) Calculate the Poisson probabilities and fit probability model to data set
r Cells J7:J12 Values
P(X = r) Cell K7 Formula: =POISSON.DIST( J7,$C$16,FALSE)
Copy formula down K7:K12
Total = Cell K14 Formula: =SUM(K7:K12)
Expected frequencies Cell M7 Formula: =$C$13*K7
Copy formula from M7:M12
Total = Cell M14 Formula: =SUM(M7:M12)

❉ Interpretation
(a) The probability distribution is given in Table 4.7.
Probability distributions 169

r P(X = r)
0 0.2466
1 0.3452
2 0.2417
3 0.1128
4 0.0395
5 0.0111

Table 4.7

To check how well the Poisson probability distribution fits the data set we note that the
observed frequencies are given in the original table and that the expected frequencies can be
calculated from the Poisson probability fit using the equation EF = (∑f) × P(X = r). The manual
solution is now presented in Table 4.8.

r P(X = r) Observed frequency Expected frequency


0 0.2466 24 24.66
1 0.3452 35 34.52
2 0.2417 24 24.17
3 0.1128 12 11.28
4 0.0395 4 3.95
5 0.0111 1 1.11
Totals = 100 99.68

Table 4.8

We note that the expected frequencies are approximately equal to the observed frequency
values.

Table 4.9 illustrates the calculation of the Poisson probability values for λ = 1.4 by
applying equation (4.10).

r Poisson value Excel


0 −1.4
1.4 e = POISSON.
0 P ( X = 0) = = 0.2466 DIST( J7,$C$16,FALSE)
0!

1.41e −1.4 = POISSON.


1 P ( X = 1) = = 0.3452 DIST( J8,$C$16,FALSE)
1!

1.42e −1.4 = POISSON.


2 P(X = 2) = = 0.2417 DIST( J9,$C$16,FALSE)
2!

1.43e −1.4 = POISSON.


3 P ( X = 3) = = 0.1128 DIST( J10,$C$16,FALSE)
3!

1.44e −1.4 = POISSON.


4 P ( X = 4) = = 0.0395 DIST( J11,$C$16,FALSE)
4!

1.45e −1.4 = POISSON.


5 P ( X = 5) = = 0.0111 DIST( J12,$C$16,FALSE)
5!

Table 4.9
170 Business statistics using Excel

Figure 4.34 illustrates a Poisson probability plot for the number of floods example.

Poisson (mean = 1.4 floods per year)


0.4000
0.3500

Probability, P(X = r)
0.3000
0.2500
0.2000
0.1500
0.1000
0.0500
0.0000
0 1 2 3 4 5
Number of floods, X Figure 4.34

The skewed nature of the distribution can be clearly seen (positive skew).
If we determine the mean and the variance, either using the frequency distribution or
the probability distribution, we would find that the relationship is as given in equation
(4.11).

λ = VAR(X) (4.11)

The characteristics of a Poisson distribution are:

• mean = variance
• events discrete and randomly distributed in time and space;
• mean number of events in a given interval is constant;
• events are independent;
• two or more events cannot occur simultaneously.

Note Once it has been identified that the mean and variance have the same numerical
value, ensure that the other conditions above are satisfied, indicating that the sample data
most likely follow the Poisson distribution.

Example 4.13
A company is reviewing the number of telephone lines available for customer support. The
average number of calls received per day is three calls during a five-minute period. Estimate
the proportion of phone calls that cannot be answered during a five-minute period: (a) if the
company installs four lines and (b) if the company installs five lines.

Excel solution—Example 4.13


The Excel solution is illustrated in Figure 4.35.
Probability distributions 171

Figure 4.35

➜ Excel solution
λ = Cell C3 Value
r Cells B6:B10 Values
P(X = r) Cell C6 Formula: =POISSON.DIST(B6,$C$3,FALSE)
Copy formula down C6:C10
P(X ≤ 4) = Cell C12 Formula: =POISSON.DIST(B10,C3,TRUE)
P(X ≤ 4) = Cell C13 Formula: =SUM(C6:C10)
P(X > 4) = Cell C15 Formula: =1−C12
P(X > 5) = Cell C17 Formula: =1−POISSON(5,C3,TRUE)

❉ Interpretation

(a) If the company has four lines then the probability that a call cannot be answered P (call
not answered) = 1 − P(X ≤ 4) = 1 − P(X = 0 or X = 1 or X = 2 or X = 3 or X = 4). From
Excel, P (call not answered) = 0.185263245 or 18.5%. Probability that callers cannot
connect is 18.5% of the time.
(b) Should another line be installed? The corresponding calculation shows that if n = 5 then
the P(call not answered) = 1 − P(X ≤ 5) = 1 − P(X = 0 or X = 1 or X = 2 or X = 3 or X = 4
or X = 5). From Excel, P(call not answered) = 0.083917942 or 8.4%. The probability that
the switchboard could not handle all calls has been reduced to 8.4%. Whether or not this
was worthwhile depends upon the likely profits that this would create against the cost of
installation and running an extra telephone line.

The probability that a call cannot be answered P (call not answered) = 1 − P(X ≤ 4). Table
4.10 illustrates the calculation of the Poisson probability values for λ = 3 by applying
equation (4.10).
172 Business statistics using Excel

r Poisson value Excel

30e −3 = POISSON.DIST(B6,$C$3,FALSE)
0 P ( X = 0) = = 0.0498
0!

31e −3
1 P ( X = 1) = = 0.1494
1!

32e −3
2 P ( X = 2) = = 0.2240
2!

33e −3
3 P ( X = 3) = = 0.2240
3!

34e −3 = POISSON.DIST(B10,$C$3,FALSE)
4 P ( X = 4) = = 0.1680
4!

Table 4.10

Student exercises
X4.12 Calculate P(0), P(1), P(2), P(3), P(4), P(5), P(6), and P(>6) for a Poisson variable with a
mean of 1.2. Using this probability distribution determine the mean and variance.
X4.13 In a machine shop the average number of machines out of operation is two. Assuming
a Poisson distribution for machines out of operation, calculate the probability that at
any one time there will be:
(a) Exactly one machine out of operation
(b) More than one machine out of operation.
X4.14 A factory estimates that 0.25% of its production of small components is defective.
These are sold in packets of 200. Calculate the percentage of the packets containing
one or more defectives.
X4.15 The average number of faults in a metre of cloth produced by a particular machine is
0.1. (a) What is the probability that a length of four metres is free from faults? (b) How
long would a piece have to be before the probability that it contains no flaws is less
than 0.95?
X4.16 A garage has three cars available for daily hire. Calculate the following probabilities if
the variable is a Poisson variable with a mean of 2: (a) find the probability that on a
given day that exactly none, one, two, and three cars will be hired, and determine the
mean number of cars hired per day; (b) the charge of hire of a car is £25 per day and
the total outgoings per car, irrespective of whether or not it is hired, are £5 per day.
Determine the expected daily profit from hiring these three cars.
X4.17 Accidents occur in a factory randomly and, on average, at the rate of 2.6 per month.
What is the probability that in a given month: (a) no accidents will occur and (b) more
than one accident will occur?
Probability distributions 173

4.2.4 Poisson approximation to the binomial distribution


When the number of trials in a binomial situation is very large and when p is small then it
can be shown that the binomial probability function can be approximated by the Poisson
probability function with λ = np. The larger the n and the smaller the p, the better is the
approximation. The following equation (equation 4.12) for the Poisson probability is used
to approximate the true (binomial) result:

P (X = r) ≅
(np)r e− np
r! (4.12)

The Poisson random variable theoretically ranges from 0 → ∞. However, when used
as an approximation to the binomial distribution, the Poisson random variable—the
number of successes out of n observations—cannot be greater than the sample size n.
With large n and small p, equation 4.12 implies that the probability of observing a large
number of successes becomes small and approaches zero quite rapidly. For small values
of p (<0.1) and large values of n, the Poisson distribution will approximate the binomial
distribution with λ = np. For the binomial distribution with p small (<0.1) the mean (or
expected) value = np and the variance = npq = np(1−p) ≈ np. This implies that for small p
the expected and variance for the binomial distribution is approximated by the mean and
variance of the Poisson distribution (λ = np, VAR(X) = np).

Example 4.14
In a large consignment of apples 3% are rotten. What is the probability that a carton of 60
apples will contain fewer than 2 rotten apples? We have here a binomial experiment and there-
fore could easily apply the binomial distribution with p = 0.03, q = 0.97 and n = 60.

Excel solution—Example 4.14


The Excel solution is illustrated in Figure 4.36.

Figure 4.36

➜ Excel solution
n = Cell C3 Value
p = Cell C4 Value
np = Cell C5 Formula: =C3*C4
Binomial: P( X < 2) = Cell D7 Formula: =BINOM.DIST(1,C3,C4,TRUE)
Poisson: P( X < 2) = Cell D8 Formula: =POISSON.DIST(1,C5,TRUE)
174 Business statistics using Excel

❉ Interpretation We can see from Excel that the binomial and Poisson distributions
provide approximately equal results, 45.92% and 46.28% respectively.

The degree of agreement between the binomial and Poisson probability distributions
for this problem can be observed in Figure 4.37.

Comparison between binomial and poisson distributions


0.35

0.3

0.25
P(X = r)

0.2 Binomial

0.15 Poisson

0.1

0.05

0
0 1 2 3 4 5 6 7 8 9 10 11
X Figure 4.37

Note
1. Binomial solution

P(fewer than two rotten apples) = P(X < 2)

P(X < 2) = P(X = 0) + P(X = 1)

P ( X < 2) = 60
C0p0q60 − 0 + 60
C1p1q60 −1

P ( X < 2) = C0 (0.3) (0.97) + C1 (0.3) (0.97)


60 0 60 60 1 59

P(X < 2) = 0.1608 + 0.2984

P(X < 2) = 0.4592

2. Poisson solution

P(fewer than two rotten) = P(X < 2) = P(X = 0) + P(X = 1)

As n is large and p is small we can use the Poisson distribution. To check if the Poisson dis-
tribution is appropriate calculate the mean and variance: mean = np = 60 * 0.03 = 1.8, and
variance = npq = 60 * 0.03 * 0.97 = 1.746. Comparing the two values we see that they are
Probability distributions 175

approximately equal and the binomial distribution can be approximated using the Poisson
distribution:

1.80e −1.8 1.81e −1.8


P(X < 2) = P(X = 0) + P(X = 1) = + = 0.1652 + 0.2975 = 0.4627
0! 1!

Table 4.11 compares the two solutions.

Problem Binomial Poisson


P(X < 2) 0.4592 0.4627

Table 4.11

Student exercise
X4.18 A new telephone directory is to be published. Before publication entries are proofread
for errors and any corrections made. Experience suggests that, on average, 0.1% of
the entries require correction and that entries requiring correction are randomly
distributed. The directory contains 800 pages with 300 entries per page. Two methods
for making corrections are proposed: Method A (costs 50p per page containing one
correction and £1.50 per page containing two or more corrections), and Method B
(costs £1 per page containing one or more corrections). Which method, based on cost,
should be used?

4.2.5 Normal approximation to the binomial distribution


Computing binomial probabilities using the binomial probability distribution can be
difficult for large values of n. If we were undertaking the calculation using tables then usu-
ally tables are supplied up to a value of n of 50 and for particular values of the probability
of success, p. We have seen that the Poisson distribution can be used to approximate the
binomial distribution when n >20 and p <0.1. The binomial mean and variance are given
by the terms: mean = np and variance = npq. Substituting these terms into equation (4.2)
gives equation (4.13):

Z=
( X − np)
npq x
(4.13) Normal approximation
to the binomial If the
The normal distribution can be used to approximate the binomial probabilities (nor- number of trials, n, is large,
the binomial distribution is
mal approximation to the binomial) when n is large and p is close to 0.5 and np >5 (and approximately equal to the
nq > 5), with mean (μNormal ≈ μBinomial = np) and variance (σ 2 Normal ≈ σ 2 Binomial = npq). normal distribution.
176 Business statistics using Excel

Example 4.15
Assume you have a fair coin and wish to know the probability that you would get eight heads
out of ten flips. The binomial distribution has a mean of µ = np = 10 * 0.5 = 5 and a variance
of σ2 = npq = 10 * 0.5 * 0.5 = 2.5. The standard deviation is therefore 1.5811. A total of 8 heads
is 1.8973 standard deviations above the mean of the distribution [(8–5)/1.5811]. The question
then is ‘What is the probability of getting a value exactly 1.8973 standard deviations above the
mean?’. The answer to this question is to remember that the probability of a particular event for
a normal distribution is zero given that a particular event (or value of X) will not have an actual
area within the normal distribution. The problem is that the binomial distribution is a discrete
probability distribution whereas the normal distribution is a continuous distribution. The solu-
tion is to round off and consider any value from 7.5 to 8.5 to represent an outcome of 8 heads.
Using this approach, we can solve discrete binomial problems with a normal approximation if
we transform X = 8 for the binomial to the region 7.5–8.5 for the normal distribution.
The area shaded in Figure 4.38 is an approximation of the probability of obtaining eight
heads.
Normal curve
f(x)

P(X = 8) = 0.043495

X
5 6 7 8 9 Figure 4.38

We can see that the binomial probability distribution solution, P(X = 8) Binomial ≈P
(7.5 ≤ X ≤ 8.5) Normal.

Excel solution—Example 4.15


The Excel solution to this problem is illustrated in Figure 4.39.

Figure 4.39
Probability distributions 177

➜ Excel solution
Binomial
n = Cell D5 Value
p = Cell D6 Value
mean μ = Cell D7 Formula: =D5*D6
Variance σ2 = Cell D8 Formula: =D5*D6*(1−D6)
SD σ = Cell D9 Formula: =SQRT(D8)
Number of heads X = Cell D10 Value
P(X = 8) = Cell D11 Formula: =BINOM.DIST(D10,D5,D6,FALSE)
Normal
Lower X1 = Cell D13 Formula: =D10−0.5
Upper X2 = Cell D14 Formula: =D10+0.5
P(X1 ≤ 7.5) = Cell D15 Formula: =NORM.DIST(D13,D7,D9,TRUE)
P(X2 ≤ 8.5) = Cell D16 Formula: =NORM.DIST(D14,D7,D9,TRUE)
P(7.5 ≤ X ≤ 8.5) = Cell D18 Formula: =D16−D15

We can see from Excel that the two probabilities agree with one another. The binomial
probability of obtaining 8 heads from 10 flips is 0.043945 and the normal approximation
probability of containing 8 heads is 0.043495.

❉ Interpretation The probability of obtaining 8 heads from 10 flips of a fair coin is


approximately 4.3%.

Example 4.16
Enquiries at a travel agent lead to a holiday booking being made only sometimes. The agent
needs to make 35 bookings per week to break even. If during a week there are 100 enquiries
and the probability of a booking in each case is 0.4, find the probability that the agent will at
least break even in this particular week. To solve this problem let X represent the number of
bookings per week, p represent the probability that a booking will be made p = 0.4, and n
represent the number of possible bookings over the week, n = 100.
The area shaded in Figure 4.40 is a normal approximation of the binomial probability of
obtaining at least 35 bookings.
We can see that the binomial probability distribution solution, P(X ≥ 35) Binomial ≈ 1 − P
(X ≤ 34.5) Normal.
178 Business statistics using Excel

Normal curve
f(x)
P(X => 34.5) = 0.86921388

34.5 40 x

35 Binomial, X Figure 4.40

Excel solution—Example 4.16

The Excel solution to this problem is illustrated in Figure 4.41.

Figure 4.41

➜ Excel solution
n = Cell D3 Value
p = Cell D4 Value
mean μ = Cell D5 Formula: =D3*D4
Variance σ2 = Cell D6 Formula: =D3*D4*(1−D4)
SD σ = Cell D7 Formula: =SQRT(D6)
Binomial
P(X ≥ 35) = 1 − P(X ≤ 34)?
Binomial X = Cell D12 Value
Probability distributions 179

P(X ≥ 35) = 1 − P(X ≤ 34) = Cell D13


Formula: =1−BINOM.DIST(D12,D3,D4,TRUE)
Normal
P(X ≥ 34.5)?
Normal X = Cell D18 Value
P(X ≥ 34.5) = Cell D19 Formula: =1−NORM.DIST(D18,D5,D7,TRUE)
Z = Cell D21 Formula: =(D18−D5)/D7
P(X ≥ 34.5) = Cell D22 Formula: =1−NORM.S.DIST(D21, TRUE)

We can see from Excel that the two probabilities agree with one another. The binomial
probability of obtaining at least 35 bookings is 0.86966347 and the normal approximation
probability of obtaining at least 35 bookings is 0.86921388.

❉ Interpretation The probability of obtaining at least 35 bookings is 87.0%.

Note
(a) Binomial solution:

P(X = 35 or more) = P(X ≥ 35) = P(X = 35 or 36 or 37 . . . or 100)

This would be quite difficult to solve without the aid of calculator or some other compu-
tational device, for example a spreadsheet. From Excel we find that this probability value
is P(X ≥ 35) = 0.8697.
(b) Normal approximation solution (n = 100, p = 0.4):

μ = np = 0.4*100 = 40 and σ = npq = 4.899.

P(X ≥ 35 for binomial) ≈ P(X ≥ 34.5 for normal)

⎛ 34.5 — 40 ⎞
P(X ≥ 35 for binomial) ≈ P ⎜ Z ≥ ⎟ = P(Z ≥ − 1.12) = 0.8692
⎝ 4.899 ⎠
Comparing the two answers we can see that good agreement has been reached.

Student exercise
X4.19 Given X is a discrete binomial random variable with p = 0.3 and n = 20: (a) Can we
use the normal approximation to estimate the binomial probability? (b) What if n is
changed to 15? and (c) if n = 40 and p = 0.1 is the normal approximation appropriate?
180 Business statistics using Excel

4.2.6 Normal approximation to the Poisson distribution


The normal distribution can also be used to approximate the Poisson distribution when-
ever the parameter λ, the expected number of successes, equals or exceeds 5. As the value
of the mean and the variance of a Poisson distribution are the same (μ = λ = σ2 then σ =
λ . Substituting these terms into equation (4.2) gives equation (4.14).

Z=
(X − λ )
λ (4.14)

The approximation improves as the value of the mean (λ) grows larger and at a particu-
lar value we can assume that the Z variable is normally distributed.

Example 4.17
The average number of broken eggs per lorry is known to be 50. What is the probability that
there will be more than 70 broken eggs on a particular lorry load?
We may use the normal approximation to the Poisson distribution, where the mean
and variance are calculated as follows: mean (μNormal ≈ μPoisson = λ = 50) and variance
( σ 2Normal ≈ σ 2Poisson = λ = 50 ).
Require P(X > 70 for Poisson) ≈ P(X > 70.5 for normal).
The area shaded in Figure 4.42 is an approximation of the probability of obtaining more than
70 broken eggs.
Normal curve
f(x)

P(X => 70.5) = 0.001871

50 70.5 x

70 Poisson, X Figure 4.42

We can see that the Poisson probability distribution solution, P(X > 70) Poisson ≈P
(X ≥ 70.5) normal.

Excel solution—Example 4.17


The Excel solution to this problem is illustrated in Figure 4.43.
Probability distributions 181

Figure 4.43

➜ Excel solution
Mean λ = Cell D3 Value
Variance σ2 = Cell D4 Value
SD σ = Cell D5 Formula: =SQRT(D4)
Poisson
P(X >70) = 1 − P(X ≤ 70)?
Poisson X = Cell D10 Value
P(X >70) = 1 − P(X ≤ 70) = Cell D11 Formula: =1−POISSON.DIST(D10,D3,TRUE)
Normal P(X ≥ 70.5)?
Normal X = Cell D16 Value
P(X ≥ 70.5) = Cell D17 Formula: =1−NORM.DIST(D16,D3,D5,TRUE)
Z = Cell D19 Formula: =(D16−D3)/D5
P(X ≥ 70.5) = Cell D20 Formula: =1−NORM.S.DIST(D19, TRUE)

We can see from Excel that the two probabilities closely agree with one another. The
Poisson probability of obtaining more than 70 broken eggs is 0.002971 and the normal
approximation probability of obtaining more than 70 broken eggs is 0.001871.

❉ Interpretation The probability of obtaining more than 70 broken eggs is 0.2%.

Note
(a) Poisson solution:

P(X > 70) = P(X = 71 or 72 . . .)

This would be quite difficult to solve without the aid of a calculator or some other compu-
tational device, for example a spreadsheet. From Excel we find that this Poisson probability
value is P(X > 70) = 0.002971.
(b) Normal approximation solution:
μ = λ = 50 and σ = λ = 7.071068.
P(X > 70 for Poisson) ≈ P(X ≥ 70.5 for normal)
⎛ 70.5 − 50 ⎞
P(X > 70 for Poisson) ≈ P ⎜ Z ≥ ⎟ = P (Z ≥ 2.899138) = 0.001871.
⎝ 7.071068 ⎠
Comparing the two answers we can see that good agreement has been reached.
182 Business statistics using Excel

Student exercise
X4.20 A local maternity hospital has an average of 36 births per week. Use this information
to calculate the following probabilities: (a) the probability that there are fewer than
30 births in a given week; (b) the probability that there will be more than 40 births in
a given week; and (c) the probability that there will be between 30 and 40 births in a
given week.

4.2.7 Other discrete probability distributions


Other types of discrete probability distributions include the hypergeometric discrete
probability distribution which measures, like the binomial distribution, the number of
successes from n observations of the experiment. Unlike the binomial, which involves
replacement and therefore the probability of success (p) is constant, the hypergeometric
distribution involves sampling without replacement. In this case the probability of suc-
cess (p) is dependent upon the outcome of the previous run of the experiment. This distri-
bution is beyond the scope of this text book.

■ Techniques in practice
TP1 CoCo S. A. is concerned at the time taken to react to customer complaints and have
implemented a new set of procedures for its support centre staff. The customer service direc-
tor plans to reduce the mean time for responding to customer complaints to 28 days and has
collected the sample data given in Table 4.12 after implementation of the new procedures to
assess the time to react to complaints (days).

20 33 33 29 24 30
40 33 20 39 32 37
32 50 36 31 38 29
15 33 27 29 43 33
31 35 19 39 22 21
28 22 26 42 30 17
32 34 39 39 32 38

Table 4.12

(a) Estimate the mean time to react to customer complaints.


(b) Calculate the probability that the mean time to react is not greater than 28 days.
TP2 Bakers Ltd is currently in the process of reviewing the credit line available to supermar-
kets who they have defined as a ‘good’ or ‘bad’ risk. Based upon a £100,000 credit line the profit
is estimated to be £25,000 with a standard deviation of £5000. Calculate the probability that
the profit is greater than £28,000.
Probability distributions 183
TP3 Skodel Ltd is developing a low calorie lager for the European market with a mean
designed calorie count of 43 calories per 100 ml. The new product development team are hav-
ing problems with the production process and have collected an independent random sample
to assess whether the target calorie count is being met (Table 4.13).

49.7 45.2 37.7 31.9 34.8 39.8


45.9 40.5 40.6 41.9 51.4 54.0
34.3 47.8 63.1 26.3 41.2 31.7
41.4 45.1 41.1 47.9

Table 4.13

(a) Estimate the mean and variance based upon the sample data.
(b) State the value of calorie count if the production manager would like this value to be
43 ± 5%.
(c) Estimate the probability that the calorie count lies between 43 ± 5% (assume that your
answers to question (a) represent the population values).

■ Summary
The notion of a discrete and continuous probability distribution was introduced and examples
provided to illustrate the different types of discrete (binomial, Poisson) and continuous (nor-
mal) distributions.
In Chapter 5 we shall explore the concept of data sampling from normal and non-normal
population distributions and introduce the reader to the central limit theorem. Furthermore,
we will introduce a range of continuous probability distributions (Student’s t distribution, F
distribution, and chi-square distribution), which will be used in later chapters to solve a range
of problems that require statistical inference tests to be applied.
In Chapter 6 we will apply the central limit theorem to provide point and interval estimates
to certain population parameters (mean, variance, proportion) based upon sample parameters
(sample mean, sample variance, sample proportion).

■ Key terms
Binomial Discrete probability Normal probability plot
Binomial experiment distributions Poisson distribution
Chi-square distribution Discrete random variable Poisson probability
Continuous probability Discrete variable distribution
distribution F distribution Random variable
Continuous random variable Normal approximation to Standard normal distribution
Cumulative distribution the binomial Student’s t distribution
function Normal distribution
184 Business statistics using Excel

■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.

Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012).
Sampling distributions and
estimating 5

» Overview «
In Chapters 3 and 4 we introduced the concept of a probability distribution via the idea of
relative frequency and introduced two distinct types: discrete and continuous. In this chapter
we will explore the concept of taking a sample from a population and use this sample to
provide population estimates for the mean, standard deviation, and proportion. The types
of statistics that we explored within earlier chapters are statistics that provide an answer
to a particular question, where we assume that the data collected is from the complete
population. In many situations this is not the case and the data collected represents a sample
from a population being measured. In this case, the statistics calculated from the sample
(mean, standard deviation, and proportion) represent estimates of the true value that could
be calculated if you had access to the complete population of data values. These estimates
provide point estimates of the population values with the disagreement between the sample
and population value representing the margin of error. The margin of error can be represented
by the concept of a confidence interval for the population parameter value estimated from
the sample. This interval can be estimated if we assume that the sampling distribution of the
mean is normally distributed. We will show that the Central Limit Theorem allows a normal
distribution approximation for the sampling distribution of the mean to be assumed, even if
the population is not normally distributed. This result will allow the methods described in this
chapter to be employed to solve a range of statistical hypothesis tests where we test whether
the population mean has a particular value based upon the collected sample data.

» Learning objectives «
On completing this unit you will be able to:

» distinguish between the concept of a population and sample;

» recognize different types of sampling—probability (simple, systematic, stratified, cluster)


and non-probability ( judgement, quota, chunk, convenience);

» recognize reasons for sampling error—coverage error, non-response error, sampling error,
measurement error;
186 Business statistics using Excel

» understand the concept of a sampling distribution: mean and proportion;

» understand sampling from a normal population;

» understand sampling from non-normal population—Central Limit Theorem;

» calculate point estimates for one and two samples;

» calculate sampling errors and confidence intervals when the population standard devia-
tion is known and unknown (z- and t-tests);

» calculate and plot the sampling distribution of proportions for large n;

» calculate confidence intervals for one and two samples;

» determine sample sizes;

» solve problems using Microsoft Excel.

5.1 Introduction to the concept of a sample


There are many research questions we would like to answer that involve populations that
are too large to be able to measure every member of the population. How have wages of
German car workers changed over the past ten years? What are the management prac-
tices of foreign exchange bankers working in Paris? How many voters are planning to vote
for a political party at a local election? These questions are all valid and require research-
ers to design methods to collect the relevant data, which, when pooled together, gives the
researcher potential answers to the questions identified. In many cases the size of the
population is such that it is impractical to measure the wages of all German car workers
or all voters who are entitled to vote at a local election. In this situation a proportion of
the population would be selected from the population of interest to the researcher. This
proportion is achieved by sampling from the population; the proportion selected is called
the sample.

5.1.1 Why sample?


Sampling is usually collected via survey instruments, but could also be achieved by obser-
vation, archival records, or other methods. What is important to realize is that no matter
what method is used to collect the data values, the purpose is to determine how much
and how well the data set can be used to generalize the findings from the sample to the
population. It is important to avoid data collection methods that maximize the associated
errors—a bad sample may well render findings meaningless. Sampling in conjunction
with survey research is not only one of the most popular approaches to data collection
in business research, with the concept of random sampling providing the foundation
assumption that allows statistical hypothesis testing to be valid, but it is also important to
note that we all tend to perform sampling from populations without realizing that we are
Sampling distributions and estimating 187

doing so. For example, we browse through the television channels to find a programme
we may wish to watch and then make a decision based upon this sampling process. From
the sampling undertaken we may also make conclusions on the overall quality of televi-
sion programmes based upon the sample of programmes observed. This concept is called
making an inference on the population based upon the sample observations.
The primary aim of sampling is to select a sample from the population that shares the
same characteristics as the population. For example, if the population average height of
grown men between the age of 20 and 50 years is 176 cm, then the sample average height
would also be expected to be 176 cm unless we have the problem of sampling error. This
concept of sampling error can be measured and will be discussed within this chapter.
This concept of sample and population values being in agreement allows us to state that
we expect the sample to be representative of the population values being measured.
Questions we should answer are:

• How well does the sample represent the larger population from which it was drawn?
• How closely do the features of the sample resemble those of the larger population?

Before we describe the main sampling methods we need to define the terminology we
will use in this, and later, chapter(s).

5.1.2 Sampling terminology


A few statements hold true, in general, when dealing with sampling:

• samples are always drawn from a population;


• the population to be sampled should coincide with the population about which
information is wanted (the target population). Sometimes, for reasons of practicality
or convenience, the sampled population is more restricted than the target
population. In such cases, precautions must be taken to secure that the conclusions
only refer to the sampled population;
• before selecting the sample, the population must be divided into parts that are called
sampling units. These units must cover the whole of the population and they must
not overlap in the sense that every element in the population belongs to one and only
one unit. For example, in sampling the supermarket spending habits of people living
in a town the unit may be an individual or family, or a group of individuals living in a
particular post code;
• the development of this list of sampling units, called a frame, is often one of the major
practical problems. The frame is a list that contains the population list of what you x
would like to measure. For example, market research firms will access local authority Sampling error Sampling
error refers to the error
census data to create a sample. The list of registered students may be the sampling that results from taking
frame for a survey of the student body at a university. Problems can arise in sampling one sample rather than
taking a census of the entire
frame bias. Telephone directories are often used as sampling frames, for instance, population.
but tend to under-represent the poor (who have fewer or no phones) and the wealthy Sampling frame A
(who have unlisted numbers); sampling frame is the
source material or device
• the final stage is to collect the sample using either probability or non-probability from which a sample is
sampling. drawn.
188 Business statistics using Excel

5.1.3 Types of samples


We may then consider different types of probability samples. Although there are a num-
ber of different methods that might be used to create a sample, they can, generally, be
grouped into one of two categories: probability samples or non-probability samples.

Probability sampling
The idea behind this type of probability sampling is random selection. More specifically,
each sample from the population of interest has a known probability of selection under a
given sampling scheme.
There are four categories of probability samples described, as illustrated in Figure 5.1

Probability

Simple
Systematic Stratified Cluster
random
Figure 5.1

(a) Simple random sampling


The most widely known type of a random sample is the simple random sample. This is
characterized by the fact that the probability of selection is the same for every case in the
population. Simple random sampling is a method of selecting ‘n’ samples from a popula-
tion of size ‘N’ such that every possible sample of size ‘n’ has equal chance of being drawn.

Example 5.1
Consider the situation that a marketing researcher will experience when selecting a random
sample of 200 shoppers who shop at a supermarket during a particular time period. The
researcher notes that the supermarket would like to seek the views of its customers on a pro-
posed re-development of the store and the total footfall (the number of people visiting a shop
or a chain of shops in a period of time is called its footfall) within this time period is 10,000.
With a footfall (or population) of this size we could employ a number of ways to select an
appropriate sample of 200 from the potential 10,000. For example, we could place 10,000 con-
secutively numbered pieces of paper (1–10000) in a box, draw a number at random from the
box, shake, and select another number to maximize the chances of the second pick being ran-
dom, shake, and continue the process until all 200 numbers are selected. These would then be
used to select a customer entering the store with the customer chosen based upon the number
selected from the random process. To maximize the chances that customers selected would
agree to complete the survey we could enter them into a prize draw. These 200 customers
will form our sample with each number in the selection having the same probability of being
chosen. When undertaking the collection of data via random sampling we generally find it dif-
x
Random sample A ficult to devise a selection scheme to guarantee that we have a random sample. For example,
random sample is a the selection from a population might not be the total population that you wish to measure or,
sampling technique where
we select a sample from a
during the time period when the survey is conducted, we may find that the customers sampled
population of values. may by unrepresentative of the population as a result of unforeseen circumstances.
Sampling distributions and estimating 189

(b) Systematic random sampling


With systematic random sampling we create a list of every member of the population.
From the list, we randomly select the first sample element from the first n number val-
ues on the population list. Thereafter, we select every nth number value on the list. This
method involves choosing the nth element from a population list as follows:

1. Step 1—divide the number of cases in the population by the desired sample size.
2. Step 2—select a random number between one and the value attained in step 1. For
example, we could pick the number 28.
3. Step 3—starting with case number chosen in step 2, take every twenty-eighth record,
as per this example.

The advantages of systematic sampling compared to simple random sampling is that


the sample is easier to draw from the population. The disadvantages are that the sample
points are not equally likely.

(c) Stratified random sampling


With stratified random sampling, the population is divided into two or more mutually
exclusive groups, where each group is dependent upon the research area of interest. The
sampling procedure is to organize the population into homogenous subsets before sam-
pling, then draw a random sample from each group. With stratified random sampling the
population is divided into non-overlapping groups (subpopulations or strata) where all the
groups together would comprise the entire population. As an example, suppose we con-
duct a national survey in the Netherlands. We might divide the population into groups (or
strata) based on the regions of the Netherlands. Then we would randomly select from each
group (or strata). The advantage of this method is the guarantee that every group within
the population is selected and it provides an opportunity to undertake group comparisons.

Example 5.2
To illustrate, consider the situation where we wish to sample the views of graduate job applicants
to a major financial institution. The nature of this survey is to collect data on the application
process from the applicants’ perspective. The survey will therefore have to collect the views from
the different specified groups within the identified population. For example, this could be based
on gender, race, type of employment requested (full- or part-time), or whether an applicant is
classified as disabled. If we use simple random sampling it is possible that we may miss a repre-
sentative sample from one of these groups as a result, for example, of the relative size of the group
relative to the population. In this case, we would employ stratified random sampling to ensure
that appropriate numbers of sample values are drawn from each group in proportion to the per-
centage of the population as a whole. Stratified sampling offers several advantages over simple
random sampling: (a) it guards against an unrepresentative sample (e.g. all male from a predomi-
nately female group); (b) it provides sufficient group data for separate group analysis; (c) it requires
a smaller sample; and (d) greater precision is achievable compared with simple random sampling
for a sample of the same size. Stratified random sampling nearly always results in a smaller vari-
ance for the estimated mean or other population parameters of interest. The main disadvantage
190 Business statistics using Excel

of a stratified sample is that it may be more costly to collect and process the data compared with
a simple random sample. Two different categories of stratified random sampling are available

• proportionate stratification—with proportionate stratification, the sample size of each stra-


tum is proportionate to the population size of the stratum (same sampling fraction). The
method provides greater precision than for simple random sampling with the same sample
size; this precision is better when dealing with characteristics that are the same (homogene-
ous) strata;
• disproportionate stratification—with disproportionate stratification the sampling fraction
may vary from one stratum to the next. If differences are explored in the characteristics
being measured across strata then disproportionate stratification can provide better preci-
sion than proportionate stratification, when the sample points are correctly allocated to
strata. In general, given similar costs you would always choose proportionate stratification.

(d) Cluster sampling


Cluster sampling is a sampling technique in which the entire population of interest is
divided into groups, or clusters, and a random sample of these clusters is selected. Each
cluster must be mutually exclusive and together the clusters must include the entire pop-
ulation. After clusters are selected, then all data points within the clusters are selected.
No data points from non-selected clusters are included in the sample. This differs from
stratified sampling, in which some data values are selected from each group. When all the
data values within a cluster are selected, the technique is referred to as one-stage cluster
sampling. If a subset of units is selected randomly from each selected cluster, it is called
two-stage cluster sampling. Cluster sampling can also be made in three or more stages: it
is then referred to as multistage cluster sampling. The main reason for using cluster sam-
pling is that it is usually much cheaper and more convenient to sample the population in
clusters rather than randomly. In some cases, constructing a sampling frame that identi-
fies every population element is too expensive or impossible. Cluster sampling can also
reduce cost when the population elements are scattered over a wide area.

(e) Multistage sampling.


With multistage sampling, we select a sample by using combinations of different sam-
pling methods. For example, in stage 1, we might use cluster sampling to choose clusters
from a population. Then, in stage 2, we might use simple random sampling to select a
subset of elements from each chosen cluster for the final sample.

Non-probability sampling
In many situations it is not possible to select the kinds of probability samples used in
large-scale surveys. For example, we may be required to seek the views of local, family-run
businesses that have experienced financial difficulties during the bank credit crunch of
2007–2012. In this situation there are no easily accessible lists of businesses experiencing
difficulties, or there may never be a list created or available. The question of obtaining a
sample in this situation is achievable by using non-probability sampling methods to col-
lect the required sample data.
Figure 5.2 illustrates the four primary types of non-probability sampling methods.
Sampling distributions and estimating 191

Non-
probability

Convenience Purposive

Quota Snowball
Figure 5.2

We can divide non-probability sampling methods into two broad types: convenience
or purposive.

(a) Convenience sampling


Convenience (or availability) sampling is a method of choosing subjects who are available
or easy to find. This method is also sometimes referred to as haphazard, accidental, or
availability sampling. The primary advantage of the method is that it is very easy to carry
out relative to other methods. Problems can occur with this survey method in that you can
never guarantee that the sample is representative of the population. Convenience sam-
pling is a popular method with researchers and provides some data that can analysed, but
the type of statistics that can be applied to the data is compromised by uncertainties over
the nature of the population that the survey data represents.

(b) Purposive sampling


Purposive sampling is a sampling method in which elements are chosen based on the
purpose of the study. Purposive sampling may involve studying the entire population of
some limited group (accounts department at a local engineering firm) or a subset of a
population (chartered accountants). As with other non-probability sampling methods,
purposive sampling does not produce a sample that is representative of a larger popula-
tion, but it can be exactly what is needed in some cases—study of organization, commu-
nity, or some other clearly defined and relatively limited group. Examples of two popular
purposive sampling methods include quota sampling and snowball sampling.

• Quota sampling
Quota sampling is designed to overcome the most obvious flaw of convenience (or
availability) sampling. Rather than taking just anyone, quotas are set to ensure that
the sample you get represents certain characteristics in proportion to their prevalence
in the population. Note that for this method you have to know something about the
characteristics of the population ahead of time. There are two types of quota sampling:
proportional and non-proportional.

• In proportional quota sampling you want to represent the major characteristics


of the population by sampling a proportional amount of each. For instance, if you
know the population has 25% women and 75% men, and that you want a total
sample size of 400, you will continue sampling until you get those percentages and
then you will stop. So, if you’ve already got 100 women for your sample, but not
192 Business statistics using Excel

300 men, you will continue to sample men, even if legitimate women respondents
come along—you will not sample them because you have already ‘met your quota’.
The primary problem with this form of sampling is that even when we know that
a quota sample is representative of the particular characteristics for which quotas
have been set, we have no way of knowing if the sample is representative in terms
of any other characteristics. If we set quotas for age, we are likely to attain a sample
with good representativeness on age, but one that may not be very representative
in terms of gender, education, or other pertinent factors.
• In non-proportional quota sampling you specify the minimum number of
sampled data points you want in each category. In this case you are not concerned
with having the correct proportions, but with achieving the numbers in each
category. This method is the non-probabilistic analogue of stratified random
sampling in that it is typically used to assure that smaller groups are adequately
represented in your sample.

Finally, researchers often introduce bias when allowed to self-select respondents,


which is usually the case in this form of survey research. In choosing males, inter-
viewers are more likely to choose those that are better-dressed, or who seem more
approachable or less threatening. That may be understandable from a practical point
of view, but it introduces bias into research findings.

• Snowball sampling
In snowball sampling, you begin by identifying someone who meets the criteria for
inclusion in your study. You then ask them to recommend others who they may know
who also meet the criteria. Thus, the sample group appears to grow like a rolling snow-
ball. This sampling technique is often used in hidden populations, which are dif-
ficult for researchers to access, including firms with financial difficulties or students
struggling with their studies. The method creates a sample with questionable repre-
sentativeness and it can be difficult to judge how a sample compares with a larger pop-
ulation. Furthermore, an issue arises in who the respondents refer you to, for example,
friends will refer you to friends but are less likely to refer to ones they don’t consider as
friends, for whatever reason. This creates a further bias within the sample that makes it
difficult to say anything about the population.

Note The primary difference between probability methods of sampling and non-
probability methods is that in the latter you do not know the likelihood that any element of a
population will be selected for study.

5.1.4 Types of error


In this textbook we will be concerned with sampling from populations using probability
sampling methods, which are the prerequisite for the application of the statistical tests. If
we base our decisions on a sample, rather than the whole population, by definition we are
Sampling distributions and estimating 193

going to make some errors. The concept of sampling implies that we’ll also have to deal
with a number of types of errors, including sampling error, coverage error, measurement
error, and non-response error.

(a) What is sampling error?


Sampling error is the calculated statistical imprecision due to surveying a random sample
instead of the entire population. The margin of error provides an estimate of how much
the results of the sample may differ because of chance when compared with what would
have been found if the entire population were interviewed.

(b) Coverage error


Coverage error is associated with the inability to contact portions of the population.
Telephone surveys usually exclude people who do not have access to a landline telephone
in their homes. This will exclude people who are not at home or unavailable for a number
of reasons: at work, in prison, on holiday, or unavailable at the times when the telephone
calls are made.

(c) Measurement error


Measurement error is error, or bias, that occurs when surveys do not survey what they
intended to measure. This type of error results from flaws in the measuring instrument,
for example, question wording, question order, interviewer error, timing, and question
response options. This is the most common type of error faced by the polling industry.

(d) Non-response error


Non-response error results from not being able to interview people who would be eligible
to take the survey. Many households now have telephone answering machines and caller
identification that prevent easy contact or people may simply not want to respond to calls.
Non-response bias is the difference in responses of those people who complete the survey
against those who refuse to for any reason. While the error itself cannot be calculated,
response rates can be calculated and there are countless ways to do so.
The rest of this text will focus on samples that have been randomly selected and the
associated statistical techniques that can be applied to a randomly selected data set.

5.2 Sampling from a population

5.2.1 Introduction
When we wish to know something about a particular population it is usually impracti-
cal, especially when considering large populations, to collect data from every unit of that x
population. It is more efficient to collect data from a sample of the population under study Estimate An estimate is
an indication of the value
and from the sample make estimates of the population parameters. Essentially, based on of an unknown quantity
a sample, we make generalizations about a population. based on observed data.
194 Business statistics using Excel

5.2.2 Population versus sample


To describe the difference between a population and a sample, we can say:

• population—a complete set of counts or measurements derived from all objects


possessing one or more common characteristics, such as height, age, sales, and
income. Measures, such as means and standard deviations, derived from the
population data are known as population parameters;
• sample—a proportion of a population under study derived from sample data.
Measures, such as means and standard deviations, derived from sample statistics are
known as sample statistics or estimators.

The method of using samples to estimate population parameters is known as statistical


inference.
Statistical inference draws upon the probability results discussed in previous chap-
ters, especially the normal distribution. To distinguish between population and sample
parameters the following symbols are used, as presented in Table 5.1

Parameter Population Sample


Size N n
Mean μ x

Standard deviation σ s
Proportion π ρ

Table 5.1

Note The symbols μ, σ, and ρ are the Greek symbols mu = μ, sigma = σ, and rho = ρ.

5.2.3 Sampling distributions


The main issue that we shall explore shortly is that we wish to collect a sample (or samples)
from a population and use this sample to provide an estimate of the population parame-
ters (mean, standard deviation, proportion) by using the sample parameter value (sample
mean, sample proportion, and sample standard deviation). For example, you may wish to
check the quantity of muesli per bag produced by the manufacturing process. To answer
this type of question we need to know how the parameter being measured varies. This is
called the sampling distribution and we will now explore the sampling distribution of the
mean and proportion.

x 5.2.4 Sampling distribution of the mean


Sampling distribution The
sampling distribution In this section we will explore what we mean by the sampling distribution of the mean.
describes probabilities Let’s assume that a sample of size 200 is taken and that the mean weight is 135.5 kg.
associated with a statistic
when a random sample is
Another sample is taken and the mean weight is 132.5 kg. A large number of samples
drawn from a population. might be taken and the sample means calculated. Obviously, these means are unlikely to
Sampling distributions and estimating 195

be equal and they can be plotted as a frequency distribution of the means. What is really
important here is that the mean of all the sample means has some interesting properties.
It is identical to the overall population mean.

Note A sample mean is unbiased as the mean of all sample means of size n selected
from the population is equal to the population mean, μ.

Example 5.3
To illustrate this property consider the problem of tossing a fair die. The die has 6 numbers (1,
2, 3, 4, 5, and 6), with each number likely to have the same frequency of occurrence. If we then
take all possible samples of size 2 from this population then we will be able to illustrate two
important results of the sampling distribution of the sample means.

Excel solution—Example 5.3

Figure 5.3 illustrates the Excel solution.

Figure 5.3

➜ Excel solution
X Cells B8:B13 Values
X2 Cell C8 Formula: =B8∧2
Copy formula down C8:C13
N = Cell C15 Formula: =COUNT(B8:B13)
ΣX = Cell C16 Formula: =SUM(B8:B13)
ΣX2 = Cell C17 Formula: =SUM(C8:C13) x
Unbiased When the mean
Mean = Cell C18 Formula: =C16/C15
of the sampling distribution
Mean = Cell C19 Formula: =AVERAGE(B8:B13) of a statistic is equal to a
Pop SDev = Cell C20 Formula: =SQRT(C17/C15−C18∧2) population parameter, that
statistic is said to be an
Pop SDev = Cell C21 Formula: =STDEV.P(B8:B13) unbiased estimator of the
parameter.
196 Business statistics using Excel

From the population data values (1, 2, 3, 4, 5, and 6) we can calculate the population
mean and standard deviation using equations (2.1) and (2.3):

∑ X 21
Population mean, µ = = = 3.5
N 6

∑ X2 91
− (µ ) = − (3.5) = 1.7078
2 2
Population standard deviation, σ =
N 6

If we now sample all possible samples of size 2 (n = 2) from the population then we
would have the following sampling distribution of size 2. We can calculate the mean of
these sample means and corresponding standard deviation of the sample means, as illus-
trated in Figure 5.4.

Figure 5.4

➜ Excel solution
Sample pairs
Value 1 Cells F8:F28 Values
Value 2 Cells G8:G28 Values
Value mean Cell H8 Formula: =(F8+G8)/2
Copy formula down from H8:H28
f Cells J8:J28 Values
f * Xbar Cell K8 Formula: =J8*H8
Copy formula down from K8:K28
f *Xbar 2 Cell M8 Formula: =K8*H8
Copy formula down from M8:M28
Σf = Cell G30 Formula: =SUM( J8:J28)
Sampling distributions and estimating 197

ΣfX = Cell G31 Formula: =SUM(K8:K28)


ΣfX2 = Cell G32 Formula: =SUM(M8:M28)
Mean = Cell G33 Formula: =G31/G30
SD = Cell G34 Formula: =SQRT(G32/G30−G33∧ 2)

The sample mean (X) is calculated using equation (5.1).

∑X
X=
n (5.1)

For sample pair (2, 6) the sample mean is equal to 4 (cell H18). For each sample pair we
would have a different sample mean, as can be observed in Figure 5.4 (column H). From
this list of sample means we can calculate the overall mean of the sample means using
equation (5.2).

∑X
X=
∑f (5.2)

From Excel, the mean of the sample means X is equal to 3.5. From the die experiment
we observe X = µ = 3.5 . Furthermore, the mean of the sample means is an unbiased esti-
mator of the population mean.

X=µ (5.3)

The standard error of the sample means (or standard deviation of the sample means)
measures the standard deviation of all sample means from the overall mean. We know
from the population data ranges from 1 to 6 with a population standard deviation of
1.7078. We can repeat this exercise to calculate the standard deviation for the samples
means using equation (5.4).

∑ fX 2
( )
2
σX = − X
∑f (5.4)

From Excel, the standard deviation of the sample means (σ X ) is equal to 1.2076. From
this we conclude a difference exists between the two values. Why? Observe that in the
sampling example we calculate a series of sample means of size 2 and then calculated the
overall mean of the sample means. When averaging you replace the data set with a single
number that measures the middle value of the data set. The mean will be influenced by
any extreme data points in the sample, but by repeating the experiment to calculate a
series of means we should find that the range between the largest and smallest means
is less than the range within the original data sets. In other words, averages have smaller
198 Business statistics using Excel

variability than single observations. The standard error of the sampling mean distribution
is not equal to the population standard deviation (σ X < σ). In fact, the standard deviation
of the sample means is a biased estimate of the population standard deviation.

Note The standard deviation of the sample means is a biased estimate of the
population standard deviation because it is not necessarily the same as the population
standard deviation.

It can be shown that the relationship between sample and population is represented
by equation (5.5):

σ
σX =
n (5.5)

Equation (5.5) is called the standard error of the sample means or just standard error.
From equation (5.5) we observe that as n increases, the value of the standard error of the
sampling mean approaches zero (σ X → 0). In other words, as n increases the spread of
the sample mean decreases to zero. In this situation the measured random variable would
have to be constant to produce this result.

Note The law of large numbers implies that the sample mean X will approach the
population mean (μ) as n increases in value.

Using the numbers from our example, the values of the mean and standard deviation of
the sampling means is calculated as follows:
Check:

∑X 126
X= = = 3.5
∑f 36

∑fX 2
()
2 493.6
− (3.5) = 1.2076
2
σX = − X =
∑ f 36
σ 1.7078
= = 1.2076 = σ X
n 2

5.2.5 Sampling from a normal population


If we select a random variable X from a population that is normally distributed with popu-
lation mean μ and standard deviation σ then we can state this relationship using the nota-
tion X ~ N(μ, σ2).
Figure 5.5 illustrates the relationship between the variable and the distribution.
Sampling distributions and estimating 199
Normal curve
f(x)

X ∼ N (µ, σ2)

µ X
Figure 5.5

If we choose a sample from a normal population then we can show that the sample
means are also normally distributed with a mean of μ and a standard deviation of the
sampling mean given by equation (5.5), where n is the sample size on which the sampling
distribution was based. Figure 5.6 illustrates the relationship between the sampling mean
and the normal distribution:

Normal curve

f(X)
X ∼ N µ, σ
n( )
2

µ X Figure 5.6

Example 5.4
Consider the problem of selecting 40,000 random samples from a population that is assumed
to be normally distributed with mean £45,000 and standard deviation of £10,000.
The population values are based on 40,000 data points and the sampling distribution is
illustrated in Figure 5.7.
We observe from Figure 5.7 that the population data is approximately normal.
Histogram for the population data N = 40000
16000
14000
Frequency

12000
10000
8000
6000
4000
2000
0
6000

16000

26000

36000

46000

56000

66000

76000

86000

96000

More

Bin Figure 5.7


200 Business statistics using Excel

From Figures 5.8 to 5.11 we observe that the sampling distribution of the mean is approxi-
mately normal for sample distributions of size n = 2, 5, 10, and 40. From the histograms we
observe that the sample means are less spread out about the mean as the sample sizes increase.
Figure 5.8 illustrates the sampling distribution for the sample means for sample size n = 2.
Histogram, n = 2
450
400
350
Frequency

300
250
200
150
100
50
0
0

e
00

00

00

00

00

00

00

00

or
M
20

28

36

44

52

60

68

76
Bin Figure 5.8

Figure 5.9 illustrates the sampling distribution for the sample means for sample size n = 5.
Histogram, n = 5
450
400
350
Frequency

300
250
200
150
100
50
0
0

e
00

00

00

00

00

00

00

00

or
M
25

30

35

40

45

50

55

60

Bin Figure 5.9

Figure 5.10 illustrates the sampling distribution for the sample means for sample size n = 10.
Histogram, n = 10
500
450
400
Frequency

350
300
250
200
150
100
50
0
0

e
00

00

00

00

00

00

00

or
M
34

38

42

46

50

54

58

Bin Figure 5.10

Figure 5.11 illustrates the sampling distribution for the sample means for sample size n = 40.
Histogram, n = 40
500

400
Frequency

300

200

100

0
00

00

00

00

00

00

00

00

e
or
0

M
40

42

44

46

48

50

52

54

Bin Figure 5.11


Sampling distributions and estimating 201

Note From these observations we conclude that if we sample from a population that
is normally distributed with mean μ and standard deviation σ (X ~ N(μ, σ2), then the sampling
mean is normally distributed with mean μ and standard deviation of the sample means of
σX = σ n .

This relationship is represented by equation (5.6):

⎛ σ2 ⎞
X ~ N ⎜ µ, ⎟ (5.6)
⎝ n⎠

Given that we now know that the sample mean is normally distributed then we can
solve a range of problems using the methods described in Chapters 6, 8, and 9. The stand-
ardized sample mean Z value is given by equation (5.7):

X−µ
Z=
σ n (5.7)

Example 5.5
Diet X runs a number of weight reduction centres within a large town in the north east of
England. From the historical data it was found that the weight of participants is normally dis-
tributed with a mean of 150 lb and a standard deviation of 25 lb. This can be written in math-
ematical notation as X ~ N (150, 252). Calculate the probability that the average sample weight
is greater than 160 lb when 25 participants are randomly selected for the sample.

Excel solution—Example 5.5


Figure 5.12 illustrates the Excel solution.

Figure 5.12
202 Business statistics using Excel

➜ Excel solution
Population mean = Cell D7 Value
Population standard deviation = Cell D8 Value
Sample size n = Cell D11 Value
Sample mean = Cell D12 Value
Standard error of mean = Cell D13 Formula: =D8/D11∧0.5
Z = Cell D15 Formula: =(D12−D7)/D13
Z = Cell D16 Formula: =STANDARDIZE(D12,D7,D13)
P = Cell D18 Formula: =1−NORM.DIST(D12,D7,D13,TRUE)
P = Cell D19 Formula: =1−NORM.S.DIST(D16,TRUE)

The problem requires the solution to the problem P(X > 160).
Figure 5.13 illustrates the region to be found that represents this probability. Excel
can be used to solve this problem by either using the NORM.DIST () or NORM.S.DIST ()
functions.

Normal curve

P(X > 160) = P(Z > 2)

µ 160 X

0 2 Z Figure 5.13

Note We have already described both Excel functions NORM.DIST () and


NORM.S.DIST (). Ensure you do not confuse them.

Given the population mean (μ = 150), population standard deviation (σ = 25), sample
size (n = 25), and standard error of the sample mean σ x = σ n = 25 25 = 5.

Method 1—NORM.DIST () function: =NORM.DIST (X , μ, σ X , TRUE)

From Excel, P(X > 160) = 1 – NORM.DIST (X , μ, σ X, TRUE) = 0.022750132.

Method 2—NORM.S.DIST () function: =NORM.S.DIST (Z, TRUE)


From equation (5.6) we have:

X − µ 160 − 150 10
Z= = = =2
σ n 25 25 5
Sampling distributions and estimating 203

( )
From Excel, P X > 160 = P (Z > 2 ) = 1 − NORM.S.DIST (Z, TRUE) = 0.022750132.
As expected, both methods provided the same answer to the problem of calculating the
required probability.

❉ Interpretation Based upon a random sample the probability that the sample mean
is greater than 160 pounds is 0.0228 or 2.28%.

Example 5.6 Calculate the probability that the sample mean lies between 146
and 158 pounds.

Excel solution—Example 5.6


Figure 5.14 illustrates the Excel solution.

Figure 5.14

➜ Excel solution
Population mean = Cell D5 Value
Population standard deviation = Cell D6 Value
Sample size n = Cell D8 Value
Standard error = Cell D9 Formula: =D6/D8∧0.5
Sample 1 mean = Cell D10 Value
Sample 2 mean = Cell D11 Value
Z1 = Cell D12 Formula: =(D10−D5)/D9
Z2 = Cell D13 Formula: =(D11−D5)/D9
P = Cell D15 Formula: =NORM.DIST(D11,D5,D9,TRUE)−
NORM.DIST(D10,D5,D9,TRUE)
P = Cell D16 Formula: = NORM.S.DIST(D13,TRUE)-
NORM.S.DIST(D12,TRUE)
204 Business statistics using Excel

The problem requires the solution to the problem P(140 < X < 158) .
Figure 5.15 illustrates the region to be found that represents this probability. Again, Excel
can be used to solve this problem by using either the NORM.DIST () or NORM.S.DIST ()
functions.

Normal curve
P(140 < X < 158) = P(–2 < Z < 1.6)

140 150 158 X


–2 0 1.6 Z Figure 5.15

Given the population mean (μ = 150), population standard deviation (σ = 25), sample
size (n = 25), and standard error of the sample mean (σ x = σ n = 25 25 = 5) .

Method 1—NORM.DIST () function: =NORM.DIST (X , μ, σ X , TRUE)

From Excel, P(140 < X < 158) = NORM.DIST (X 2 , μ, σ X, TRUE) − NORM.DIST (X1, μ,
σ X , TRUE) = 0.922450576.

Method 2 – NORM.S.DIST () function: =NORM.S.DIST (Z, TRUE)

From equation (5.6) we have:

X1 − µ 140 − 150 −10


Z1 = = = = −2
σ n 25 25 5

X 2 − µ 158 − 150 8
Z2 = = = = 1.6
σ n 25 25 5

( )
From Excel, P 140 < X < 158 = P ( −2 < Z < 1.6 ) =NORM.S.DIST (Z2, TRUE) − NORM.S.
DIST (Z1, TRUE) = 0.922450576.
Both methods provided the same answer to the problem of calculating the required
probability.

❉ Interpretation Based upon a random sample the probability that the sample mean
is between 140 and 158 lb is 0.9224 or 92.24%.

5.2.6 Sampling from a non-normal population


In Section 5.2.5 we sampled from a population which is normally distributed and we stated
that the sample means will be normally distributed with mean μ and standard deviation
σ X . What if the data does not come from the normal distribution? It can be shown that
if we select a random sample from a non-normal distribution then the sampling mean
Sampling distributions and estimating 205

will be approximately normal with mean μ and standard deviation σ X if the sample size
is sufficiently large. In most cases, the value of n should be at least 30 for non-symmetric
distributions and at least 20 for symmetric distributions before we apply this approxima-
tion. This relationship is already represented by equation (5.6).
This leads to an important concept in statistics known as the Central Limit Theorem.
The Central Limit Theorem provides us with a shortcut to the information required for
constructing a sampling distribution. By applying the Theorem we can obtain the descrip-
tive values for a sampling distribution (usually the mean and the standard error, which is
computed from the sampling variance) and we can also obtain probabilities associated
with any of the sample means in the sampling distribution.

Note The Central Limit Theorem states that no matter what the shape of the
population distribution, the sampling distribution of the means will be approximately normal
with increasing sample sizes providing better approximations to the normal distribution.

If the mean is approximately normally distributed then we can solve a range of prob-
lems using the methods described in Chapters 6, 8, and 9.

Example 5.7
Consider the sampling of 50 electrical components from a production run where, historically,
the component’s average lifetime was found to be 950 hours with a standard deviation of 25
hours. The population data is right-skewed and therefore cannot be considered to be normally
distributed. Calculate the probability that the sample mean is less than 958 hours.

Excel solution—Example 5.7


Figure 5.16 illustrates the Excel solution.

Figure 5.16
x
Central Limit
Theorem The Central
Limit Theorem states
➜ Excel solution that whenever a random
sample is taken from any
Population mean = Cell D3 Value distribution (m, S2), then
Population standard deviation = Cell D4 Value the sample mean will be
approximately normally
Sample size n = Cell D6 Value distributed with mean m
Standard error = Cell D7 Formula: =D4/D6∧0.5 and variance S2/n.
206 Business statistics using Excel

Sample mean = Cell D8 Value


Z = Cell D9 Formula: =(D8−D3)/D7
P = Cell D11 Formula: =NORM.DIST(D8,D3,D7,TRUE)
P = Cell D12 Formula: =NORM.S.DIST(D9,TRUE)

As the sample size is reasonably large (>30), we will apply the Central Limit Theorem to
the problem and assume that the sampling mean distribution is approximately normally
distributed. From equation (5.6) we have X~N(µ , σ 2 n ) = N(950, 252 5 0) .
The problem requires the solution to the problem P(X < 958).
Figure 5.17 illustrates the region to be found that represents this probability.

Normal curve

P(X < 958) = P(Z < 2.2627417)

µ = 950 958 X

0 2.2627 Z Figure 5.17

Excel can be used to solve this problem by either using the NORM.DIST () or
NORM.S.DIST () functions.
Given the population mean (μ = 950), population standard deviation (σ = 25), sample
size (n = 50), and standard error (σ x = σ n = 25 50 = 3.535533906) .

Method 1—NORM.DIST () function: =NORM.DIST (X , μ, σ X , TRUE)

From Excel, P(X < 950) = NORM.DIST (X , μ, σ X , TRUE) = 0.988174192.

Method 2—NORM.S.DIST () function: =NORM.S.DIST (Z, TRUE)


X − µ 958 − 950 8
From equation (5.7) we have: Z = = = = 2.2627417
σ n 25 50 3.535533906
From Excel, P(X < 950) = P(Z < 2.2627417) = NORM.S.DIST (Z, TRUE) = 0.988174192.
Both methods provide the same answer to the problem of calculating the required
probability.

❉ Interpretation Based upon a random sample the probability that the sample mean
is less than 958 hours is 0.988174192 or 98.82%.
Sampling distributions and estimating 207

In the previous cases we assumed that sampling will have taken place with replacement
(very large or infinite population). If no replacement is undertaken then equation (5.5) is
modified by a correction factor to give equation (5.8):

σ N−n
σX = × (5.8)
n N −1

Where N = size of population and n = size of sample.

Example 5.8
A random sample of 30 part-time employees is chosen without replacement from a firm
employing 200 part-time workers. If the mean hours worked per month is 60 hours with a
standard deviation of 5 hours determine the probability that the sample mean: (a) will lie
between 60 and 62 hours, and (b) be over 63 hours. In this example we have a finite popula-
tion of size N (= 200) and a sample size of 30 (n = 30).
From equation (5.8) we can calculate the standard error of the sampling mean and then use
Excel to calculate the two probability values.

Excel solution—Example 5.8


Figure 5.18 illustrates the Excel solution.

Figure 5.18

➜ Excel solution
Population mean = Cell D3 Value
Population standard deviation = Cell D4 Value
Population size N = Cell D6 Value
Sample size n = Cell D7 Value
Standard error = Cell D9 Formula: =(D4/D7∧0.5)*SQRT((D6−D7)/(D6−1))
208 Business statistics using Excel

(a)
Sample 1 mean = Cell D12 Value
Sample 2 mean = Cell D13 Value
P = Cell D14 Formula: = NORM.DIST(D13,D3,D9,TRUE)-
NORM.DIST(D12,D3,D9,TRUE)
Z1 = Cell D15 Formula: =(D12−D3)/D9
Z2 = Cell D16 Formula: =(D13−D3)/D9
P = Cell D17 Formula: =NORM.S.DIST(D16,TRUE)−
NORM.S.DIST(D15,TRUE)
(b)
Sample mean = Cell D20 Value
Z = Cell D21 Formula: =(D20−D3)/D9
P = Cell D22 Formula: =1−NORM.DIST(D20,D3,D9,TRUE)

As the sample size is relatively large for the population, we will apply the Central Limit
Theorem to the problem and assume that the sampling mean distribution is approxi-
mately normally distributed. From equation (5.6) we have X ~ N(µ , σ 2 n ).

(a) The problem requires the solution to the problem P(60 < X < 62).

Figure 5.19 illustrates the region to be found that represents this probability.

Normal curve

P(60 < X < 62)

µ = 60 62 X
0 2.37 Z Figure 5.19

Excel can be used to solve this problem by either using the NORM.DIST () or
NORM.S.DIST () functions.
Given the population mean (μ = 60), population standard deviation (σ = 5), sample
size (n = 30), and standard error (σ x = 0.84373....) calculate P(60 < X < 62).

Method 1—NORM.DIST () function: = NORM.DIST (X , μ, σ X , TRUE)

From Excel, P(60 < X < 62) = NORM.DIST (D13, D3, D9, TRUE) – NORM.DIST (D12,
D3, D9, TRUE) = 0.491115714.

Method 2—NORM.S.DIST () function: = NORM.S.DIST (Z, TRUE)


Sampling distributions and estimating 209

From equations (5.7) and (5.8) we have:

X1 − µ 60 − 60
Z1 = = =0
σn 0.84373
X 2 − µ 62 − 60
Z2 = = = 2.3704
σn 0.84373

From Excel, P(60 < X < 62) = P(0 < Z < 2.3704) = NORM.S.DIST (Z2, TRUE) −
NORM.S.DIST (Z1, TRUE) = 0.491115714.
Both methods provide the same answer to the problem of calculating the required
probability.

❉ Interpretation Based upon a random sample the probability that the sample mean
lies between 60 and 62 is 0.491115714 or 49.11%.

(b) The problem requires the solution to the problem P(X > 63).

Figure 5.20 illustrates the region to be found that represents this probability.
Excel can be used to solve this problem by either using the NORM.DIST () or
NORM.S.DIST () functions.

Normal curve

P(X > 63)

µ = 60 63 X

0 3.56 Z Figure 5.20

Given the population mean (μ = 60), population standard deviation (σ = 5), sample
size (n = 30), and standard error (σ x = σ n = 5 30 = 0.84373....) calculate P(X > 63).

Method 1—NORM.DIST () function: = NORM.DIST (X , μ, σ X , TRUE)

From Excel, P(X > 63) = 1−NORM.DIST (X , μ, σ X, TRUE) = 0.000188553.

Method 2—NORM.S.DIST () function: = NORM.S.DIST (Z, TRUE)

From equation (5.7) we have:

X − µ 63 − 60
Z= = = 3.55560866
σn 0.84373
210 Business statistics using Excel

From Excel, P(X > 63) = P(Z > 3.55560866) = 1−NORM.S.DIST (Z, TRUE) = 0.000188553.
Both methods provide the same answer to the problem of calculating the required
probability.

❉ Interpretation Based upon a random sample the probability that the sample mean
is greater than 63 is 0.000188553 or 0.02%.

5.2.7 Sampling distribution of the proportion


Consider the case where a variable has two possible values, ‘yes’ or ‘no’, and we are
interested in the proportion that choose ‘yes’ or ‘no’ in some survey that measured the
response of shoppers in deciding whether to purchase a particular product A. From the
historical data it is found that 40% of people surveyed preferred product A and we would
define this as the estimated population proportion, π, who prefers product A. If we then
took a random sample from this population, it would be unlikely that exactly 40% would
choose product A, but, given sampling error, it is likely that this proportion could be
slightly less or slightly more than 40%. If we continued to sample proportions from this
population, then each sample would have an individual sample proportion value, which,
when placed together, would form the sampling distribution of the sample proportion
that choose product A.
The sampling distribution for the proportion is approximated using the binomial dis-
tribution given that the binomial distribution represents the distribution of ‘r’ successes
(choosing product A) from ‘n’ trials (or selections). The binomial distribution is the distri-
bution of the total number of successes, whereas the distribution of the population pro-
portion is the distribution of the mean number of successes.
Given that the mean is the total divided by the sample size, n, then the sampling dis-
tribution of the proportions and the binomial distribution differ in that the sample pro-
portion is the mean of the scores and the binomial distribution is dealing with the total
number of successes. We know from equation (4.8) that the mean of a binomial distribu-
tion is given by the equation μ = nπ, where π represents the population proportion. If we
divide through by ‘n’, then this equation gives equation (5.9), which represents the unbi-
ased estimator of the mean of the sampling distribution for the proportions.

µρ = π (5.9)

Equation (4.9) represents the variance of the binomial distribution which when divided
by ‘n’ gives equation (5.10), the standard deviation (or standard error) of the sampling
proportion, σρ, where π represents the population proportion.

π (1 − π )
σρ =
n (5.10)

From equations (5.9) and (5.10) the sampling distribution of the proportion is
approximated by a binomial distribution with mean (μρ) and standard deviation (σρ).
Sampling distributions and estimating 211

Furthermore, the sampling distribution of the sample proportion (ρ) can be approxi-
mated with a normal distribution when the probability of success is approximately 0.5,
and nπ and n(1–π) are at least 5.

⎛ π (1 − π ) ⎞
ρ ~ N ⎜ π,
⎝ n ⎟⎠ (5.11)

The standardized sample mean Z value is given by modifying equation (5.7) to give
equation (5.12).

ρ−π
Z=
π (1 − π )
n (5.12)

Example 5.9
It is known that 25% of workers in a factory own a personal computer. Find the probability that
at least 26% of a random sample of 80 workers will own a personal computer. In this example,
we have the population proportion π = 0.25 and sample size n = 80. The problem requires the
calculation of P(ρ ≥ 0.26).

Excel solution—Example 5.9


Figure 5.21 illustrates the Excel solution.

Figure 5.21

➜ Excel solution
Population proportion = Cell D3 Value
Sample proportion = Cell D5 Value
Sample size n = Cell D6 Value
Standard error = Cell D8 Formula: =SQRT(D3*(1−D3)/D6)
Z = Cell D10 Formula: =(D5−D3)/D8
P = Cell D12 Formula: =1−NORM.DIST(D5,D3,D8,TRUE)
P = Cell D14 Formula: =1−NORM.S.DIST(D10,TRUE)
212 Business statistics using Excel

From equation (5.10) the standard error for the sampling distribution of the proportion
is:

π (1 − π ) 0.25 (1 − 0.25)
σρ = = = 0.04841
n 80

Substituting this value into equation (5.12) gives the standardized Z value:

ρ−π 0.26 − 0.25


Z= = = 0.206559
π (1 − π ) 0.04841
n

From Excel, P (ρ ≥ 0.26 ) = P (Z ≥ 0.206559) = 0.418177

❉ Interpretation The probability that at least 26% of the workers own a computer is
41.82%.

5.2.8Using Excel to generate a sample from a probability


distribution
Excel can be used to generate random samples from a range of probability distributions,
including uniform, normal, binomial, and Poisson distributions. To generate a random
sample Select Data > Data Analysis, as illustrated in Figures 5.22 and 5.23.

Figure 5.22
Excel Data Analysis add-in

Select Data > Data Analysis > Random Number Generation and click OK.
Enter the following parameters into Figure 5.23:

• Input number of variable (or samples).


• Input number of data values in each sample.
• Select the distribution.
• Input distribution parameters, e.g. for normal: μ, σ.
• Decide on where the results should appear (Output range).
Sampling distributions and estimating 213

Figure 5.23
Excel Random Number Generation

Click OK.

Example 5.10
Consider the problem of sampling from a population which consists of the salaries for pub-
lic sector employees employed by a national government. The historical data suggests that
the population data is normally distributed with mean of €45,000 and standard deviation of
€10,000. We can use Excel to generate ‘N’ random samples with each sample containing ‘n’
data values.
(a) Create 10 random samples each with 1000 data points.
(b) Calculate the mean for each random sample.
(c) Plot the histogram representing the sampling distribution for the sample mean.

(a) Generate ‘n’ samples with ‘N, data values (n = 10, N = 1000), as
illustrated in Figure 5.24
From Excel, Select Data > Data Analysis > Random Number Generation.
Input:
n = 10
N = 1000
Normal distribution
Mean = 45000
SD = 1000
Output range: Cell B5. Click OK.
Figure 5.24 illustrates the completed menu.
The ‘n’ samples are located in the rows of the table of values, for example sample 1:
B5:K5, sample 2: B6:K6, and sample 1000: B1006:K1006.
214 Business statistics using Excel

Figure 5.24

(b) Calculate ‘n’ sample means


Calculate the sample mean using Excel function AVERAGE (), for example from sam-
ple 1: mean = average (B5:K5), sample 2: mean = average (B6:K6), and sample 1000:
mean = average (B1006:K1006). Figure 5.25 illustrates the first four samples and sample
means.

Figure 5.25

(c) Create histogram bins and plot histogram of sample means


We note that from the spreadsheet the smallest and largest sample means are 44014.28
and 46197.74 respectively. Based upon these two values we then determine the histogram
bin size as 44000 with step size of 500 (44000, 44500 . . . 46500) (Figure 5.26).

Figure 5.26

To create the histogram select Data > Data Analysis > Histogram and select values as
illustrated in Figure 5.27.
Input Range: L5:L1004
Bin Range: N10:N15
Output Range: P9
Click OK
Sampling distributions and estimating 215

Figure 5.27

Figures 5.28 and 5.29 illustrate the frequency distribution and corresponding histogram.

Figure 5.28 Frequency distribution

Histogram
500
400
Frequency

300
200
100
0
44000 44500 45000 45500 46000 46500 More
Bin Figure 5.29 Histogram

From the histogram we note that the histogram values are centred about the population
mean value of €45,000. If we repeated this exercise for different values of sample size ‘n’
we would find that the range would reduce as the sample sizes increase.

Student exercises
X5.1 Five people have all made claims for the amounts shown in Table 5.2.

Person 1 2 3 4 5
Insurance claim, € 500 400 900 1000 1200

Table 5.2
216 Business statistics using Excel

A sample of two people is to be taken at random, with replacement, from the


five. Derive the sampling distribution of the mean and prove: (a) X = µ, and (b)
σx = σ n .
X5.2 If X is a normal random variable with mean 10 and standard deviation 2, i.e. X ~ N (10,
4). Define and compare the sampling distribution for samples of size: (a) 2, (b) 4, and
(c) 16.
X5.3 If X is any random variable with mean = 63 and standard deviation = 10. Define and
compare the sampling distribution for samples of size: (a) 40, (b) 60, and (c) 100.
X5.4 Use the Excel spreadsheet to generate a random sample of 100 observations from
a normal distribution with a mean of 10 and a standard deviation of 4. Calculate
the sample mean and standard deviation. Why are these values different from the
population values?
X5.5 Assuming that the weights of 10,000 items are normally distributed and that the
distribution has a mean of 115 kg and a standard deviation of 3 kg: (a) estimate how
many items have weights between 115 and 118 kg; (b) if you have to pick one item at
random from the whole 10,000 items, how confident would you be in predicting that
its value would lie between 112 and 115 kg?; (c) if a sample of 10 items was drawn
from the 10,000 items what would be the standard error of the sample mean? What
would be the standard error if the sample consisted of 40 items?
X5.6 By treating the following as finite and infinite samples comment on the standard
errors: (a) find the sample mean and standard error for random samples of 1000
accounts if bank A has 5024 saving accounts with an average in each account of £512
and a standard deviation of £150; and (b) find the sample mean and standard error
for random samples of 1000 accounts if bank A has 10,244 saving accounts with an
average in each account of £564 and a standard deviation of £150.
X5.7 A sample of 100 was taken from a population with π = 0.5. Find the probability that
the sample proportion will lie between: (a) 0.4 and 0.6, (b) 0.35 and 0.65, and (c) 0.5
and 0.65.
X5.8 From a parliamentary constituency a sample of 100 people were asked whether they
would vote Labour or Conservative. It is thought that 40% of the constituency will
favour Labour. Find the approximate probability that in an election Labour will win
(assume only a two-party vote).
X5.9 The annual income of doctors constitutes a highly positive-skewed distribution.
Suppose the population has an unknown mean and a standard deviation of £10,000.
An estimate of the population mean is to be made using the sample mean. This
estimate must be within £1000 either side of the true mean: (a) if n = 100, find the
probability that the estimate will meet the desired accuracy and (b) if n = 625, find the
probability that the estimate will meet the desired accuracy.
X5.10 The average number of Xerox copies made in a working day in a certain office is
356 with a standard deviation of 55. It costs the firm three pence per copy. During
a working period of 121 days what is the probability that the average cost per day is
more than £11.10?
Sampling distributions and estimating 217

5.3 Population point estimates

5.3.1 Introduction
In the previous section we explored the sampling distribution of the mean and propor-
tion, and stated that these distributions can be considered to be normal with particular
population parameters (μ, σ2). For many populations, it is likely that we do not know the
value of the population mean (or proportion). Fortunately, we can use the sample mean
(or proportion) to provide an estimate of the population value. The objective of estimation
is to determine the approximate value of a population parameter on the basis of a sample
statistic. The method described in this section is dependent upon the sampling distribu-
tion being normally or approximately normally distributed. We can provide two estimates
of the population value: point and interval estimate.
Figure 5.30 illustrates the relationship between population mean, point, and interval
estimates.

Point estimate

Interval estimate Figure 5.30

Suppose that you want to find the mean weight of all football players who play in a local
football league. Owing to practical constraints you are unable to measure all the players,
but you are able to select a sample of 25 players at random and weigh them to provide a
sample mean. From Section 5.2 we know that the sampling distribution of the mean is
approximately normally distributed for large sample sizes and that the sample mean can
be considered to be an unbiased estimator of the population mean. After the sampling we
establish that the mean weight of the sample of players is 188 kg. This number becomes
the point estimate of the population mean. If we know, or can estimate, the population
standard deviation (σ), then we can apply equation (5.7) to provide an interval estimate
for the population mean based upon some degree of error between the sample and popu-
lation means. This interval estimate is called the confidence interval for the population
mean (or confidence interval for the population proportion if we are measuring propor-
tions). In this section we shall consider the following topics:

• types of estimates; x
• criteria of a good estimator; Point estimate A point
estimate (or estimator) is
• point estimate of the population mean, μ; any quantity calculated
• point estimate of the population proportion, π; from the sample data
which is used to provide
• point estimate of the population variance, σ2.
information about the
population.
In Section 5.4 we shall consider the following topics:
Confidence interval A
confidence interval gives an
• confidence interval estimate of the population mean (μ) and proportion (π), σ
estimated range of values
known; which is likely to include
• confidence interval estimate of the population mean (μ) and proportion (π), σ an unknown population
parameter.
unknown, n ≥ 30;
218 Business statistics using Excel

• confidence interval estimate of the population mean (μ) and proportion (π), σ
unknown, n < 30.

5.3.2 Types of estimate


To recap: a point estimate is a sample statistic used to estimate an unknown population
parameter. An interval estimate is a range of values used to estimate a population param-
eter. It indicates error by the extent of its range and by the probability of the true popula-
tion parameter lying within that range.

5.3.3 Criteria of a good estimator


Qualities desirable in a good estimator include being unbiased, consistent, and efficient.

1. An unbiased estimator of a population parameter is an estimator whose expected


value is equal to that parameter. As we already know, the sample mean X is an
unbiased estimator of the population mean, μ. This can be written as the expected
value of the sample mean equals the population mean, as given by equation (5.13):

( )
E X =µ (5.13)

2. An unbiased estimator is said to be consistent if the difference between the estimator


and the parameter grows smaller as the sample size grows larger. The sample mean
X is a consistent estimator of the population mean, μ, with the variance given by
equation (5.14):

σ2
( )
VAR X =
n
(5.14)

If n grows larger, then the value of the variance of the sample mean grows smaller.

3. If there are two unbiased estimators of a parameter, the one whose variance is smaller
is said to be efficient, for example both the sample mean and median are unbiased
estimators of the population mean. Which one should we use? The sample median
has a greater variance than the sample mean, so we choose the sample mean as it is
relatively efficient when compared with the sample median.

5.3.4 Point estimate of the population mean and variance


A point estimator draws inferences about a population by estimating the value of an
unknown parameter using a single point or data value. The sample mean is the best esti-
mator of the population mean. It is unbiased, consistent, and the most efficient estimator,
as long as the sample was either:

(a) drawn from a normal population


or
(b) the Central Limit theorem applies for large samples where the population data is not
normally distributed.
Sampling distributions and estimating 219

Thus, a point estimate of the population mean, µ̂, is given by equation (5.15):

µ̂ = X (5.15)

In Chapter 4 we noted that the point probabilities in continuous distributions were zero,
and here , in Chapter 5, we are expecting the point estimator to get closer and closer to the
true population value as the sample size increases. The degree of error is not reflected by
the point estimator, but we can employ the concept of the interval estimator to put a prob-
ability to the value of the population parameter lying between two values, with the middle
value being represented by the point estimator. Section 5.4 will discuss the concept of an
interval estimate or confidence interval.
In statistics, the standard deviation is often estimated from a random sample drawn
from a population. In Section 5.2.4 we showed, via a simple example, that the sampling
distribution of the means gives the following rules:

1. The mean of the sample means is an unbiased estimator of the population mean
(x = µ). In other words, the expected value of the sample means equals the
population mean (E(x) = µ).
2. The sample variances are a biased estimator of the population variance (σ 2 x ≠ σ 2 ).
In other words, the expected value of the sample variances are not equal to the
population variance (E(s) ≠ σ).

The sample variance bias can be corrected using Bessel’s correction, which corrects the
bias in the estimation of the population variance and some, but not all, of the bias in the
estimation of the population standard deviation. The Bessel correction factor is given by
equation (5.16).

n
(n − 1) (5.16)

The corrected sample variance is given by equation (5.17).

n ( x i − x )2
s2 = Σ
i =1 n −1 (5.17)

If you use n rather than n – 1 in equation (5.17) then you are biasing the statistic as an
x
estimator with the equation, giving an underestimate of the true population variance. It
Point estimate of the
can be shown mathematically that the sample variance given by equation (5.17) is a point population mean Point
estimate of the population variance. The Excel function to calculate an unbiased esti- estimate for the mean
involves the use of the
mate of the population variance (s2) is VAR.S(). sample mean to provide
The corrected sample standard deviation is given by equation (5.18). a ‘best estimate’ of the
unknown population
mean.
n ( x i − x )2 Point estimate of the
s= Σ population variance Point
i =1 n −1 (5.18)
estimate for the variance
involves the use of the
Unfortunately, it can be shown mathematically that not all the bias is removed when sample variance to provide
a ‘best estimate’ of the
using n – 1 in the equation rather than n, but, fortunately, the amount of bias is negligible unknown population
and we assume that equation (5.18) is an unbiased estimator of the population standard variance.
220 Business statistics using Excel

deviation. The Excel function to calculate an unbiased estimate of the population stand-
ard deviation (s) is STDEV.S(). Finally, the standard error of the sample means with the
estimate of the population standard deviation given by the sample standard deviation is
given by equation (5.19).

s
σx =
n (5.19)

The relationship between the biased sample variance (s2b) and the unbiased sample
variance (s2) is given by equation (5.20).

⎛ n ⎞ 2
s2 = ⎜ sb
⎝ n − 1 ⎟⎠ (5.20)

Why n – 1 and not n?


This is because the population variance (σ2) is estimated from the sample mean (x) and
from the deviation of each measurement from the sample mean. But if we lacked any one
of these measurements (the sample mean or a single deviation value), we could calculate
it from the rest of the data. So with n measurements (data points) only n – 1 of them are
free variables in the calculation of the sample variance. For example, a missing observa-
tion can be found from the other n – 1 observations and the sample mean. Therefore, we
have n – 1 degrees of freedom.
The larger the sample size, n, the smaller the correction involved in using the degrees of
freedom (n – 1). For example, Table 5.3 compares the value of 1/n and 1/(n – 1) for differ-
ent sample sizes n = 15, 25, 30, 40, 50, and 100. From Table 5.3 we conclude that very little
difference exists between 1/n and 1/(n – 1) for large n compared with small n.

n 1/n 1/(n – 1) % difference


15 0.066666667 0.071428571 0.06667
25 0.04 0.041666667 0.04000
30 0.033333333 0.034482759 0.03333
40 0.025 0.025641026 0.02500
50 0.02 0.020408163 0.02000
100 0.01 0.01010101 0.01000

Table 5.3

Similarly, the bias in the sample standard deviation is very small when n – 1 is used
instead of n in the denominator. The sample standard deviation is still biased, but the bias
is negligible. For example, for a normally distributed variable the approximate unbiased
estimator of the population standard deviation (σ̂) can be shown to be given by equation
(5.21).
x
Degrees of
freedom Refers to the ⎛ 1 ⎞
σ̂ = s × ⎜ 1 +
number of independent
⎝ 4 (n − 1) ⎟⎠ (5.21)
observations in a sample
minus the number of
population parameters that
must be estimated from
Table 5.4 explores the degree of error between the unbiased estimate of the population
sample data. standard deviation and the sample standard deviation. The table shows that when the
Sampling distributions and estimating 221

sample size is 4 the underestimate is 8.33% and when the sample size is 30 the underes-
timate is 0.86%. Furthermore, the difference between the two values quickly reduces in
size.

n= 4 10 20 30 40 50 100
Error = 0.0833 0.0278 0.0132 0.0086 0.0064 0.0051 0.0025
% error = 8.3333 2.7778 1.3158 0.8621 0.6410 0.5102 0.2525

Table 5.4

From a practical perspective we assume that equation (5.18) gives an unbiased estima-
tor of the population standard deviation.

Example 5.11
An experiment on the measurement of the length of rods was performed five times, with the
following results: 1.010, 1.012, 1.008, 1.013, and 1.011. Calculate the unbiased estimates of the
mean and variance of possible measurements, and give an estimate for the standard error of
your estimate of the mean.

Excel solution—Example 5.11


Figure 5.31 illustrates the Excel solution.

Figure 5.31

➜ Excel solution
X Cells B5:B9 Value
(X-Xbar)2 Cell C5 Formula: =(B5-$G$9)∧2
Copy formula down C5:C9
n = Cell G4 Formula: =COUNT(B5:B9)
ΣX = Cell G5 Formula: =SUM(B5:B9)
Σ(X-Xbar)2 = Cell G6 Formula: =SUM(C5:C9)
222 Business statistics using Excel

Formula solution
Sample mean = Cell G9 Formula: =G5/G4
Sample variance = Cell G10 Formula: =G6/(G4−1)
Sample standard deviation = Cell G11 Formula: =G10∧0.5
Estimate of population mean = Cell G12 Formula: =G9
Estimate of population standard deviation = Cell G13 Formula: =G11
Estimate of the standard error of the mean = Cell G14 Formula: =G13/G4∧0.5
Function solution mean x = Cell G17 Formula: =AVERAGE(B5:B9)
Sample variance = Cell G18 Formula: =VAR.S(B5:B9)
Sample standard deviation = Cell G19 Formula: =STDEV.S(B5:B9)
Estimate of population mean = Cell G20 Formula: =G17
Estimate of population standard deviation = Cell G21 Formula: =G19
Estimate of the standard error of the mean = Cell G22 Formula: =G21/G4∧0.5

The value of the unbiased estimates of the population mean, variance, and standard
error of the mean are provided by solving equations (5.15), (5.17), and (5.19).

(a) Sample values

Sample size n = 5
1.010 + 1.012 + 1.008 + 1.013 + 1.011
Sample mean X = = 1.0108
5

n
Σ ( X i − X )2
i =1
Sample variance s = = 0.0019235
n −1

(b) Population estimates

Estimate of population mean µ̂ = X = 1.0108


Estimate of population standard deviation σ̂ = s = 0.0019235
σˆ s
Estimate of population standard error σˆ X = = = 0.0008602
n n

x ❉ Interpretation The value of the unbiased estimates of the mean, variance, and
Standard error of the
mean The standard error
standard error are 1.011, 0.0019, and 0.0009 respectively.
of the mean (SEM) is the
standard deviation of the
sample mean’s estimate of
a population mean.
5.3.5Point estimate for the population proportion and
Point estimate variance
of the population
proportion Point estimate In the previous section we provided the equations to calculate the point estimate for the
for the proportion involves
the use of the sample population mean based upon the sample data. Instead of solving problems involving the
proportion to provide mean we can use the sample proportion to provide point estimates of the population pro-
a ‘best estimate’ of the
unknown population
portion. Equations (5.22) and (5.23) provide point estimates of the population propor-
proportion. tion and standard error:
Sampling distributions and estimating 223

Estimate of population proportion , π = ρ (5.22)

π (1 − π )
Estimate of standard error , σ ρ = (5.23)
n

Example 5.12
In a sample of 400 textile workers, 184 expressed dissatisfaction regarding a prospective plan
to modify working conditions. Provide a point estimate of the population proportion of total
workers who would be dissatisfied and give an estimate for the standard error of your estimate.

Excel solution—Example 5.12


Figure 5.32 illustrates the Excel solution.

Figure 5.32

➜ Excel solution
Total in sample n = Cell C5 Value
X Cell C6 Value
Sample proportion = Cell C8 Formula: =C6/C5
Estimate population proportion = Cell C12 Formula: =C8
Estimate population standard error = Cell C13 Formula: =SQRT(C12*(1−C12)/C5)

Point estimate of the population proportion, π̂ = ρ = 184 400 = 0.46

( )
Standard error of the proportion, σˆ ρ = πˆ 1 − πˆ n = 0.46 × (1 − 0.46 ) 400 = 0.025

The value of the unbiased estimates of the population mean, variance, and standard
error of the proportion are provided by solving equations (5.22) and (5.23).
x
(a) Sample values Standard error of the
proportion The standard
error of the proportion is
Sample size n = 400
the standard deviation of
the sample proportion’s
Number of successes X = 184 estimate of a population
proportion.
Sample proportion ρ = X/n = 184/400 = 0.46
224 Business statistics using Excel

(b) Population estimates

Estimate of population proportion π̂ = ρ = 0.46

πˆ (1 − πˆ ) 0.46(1 − 0.46)
Estimate of population standard error σˆ ρ = = = 0.0249
n 400

❉ Interpretation The value of the unbiased estimates of the proportion and standard
error are 0.46 and 0.0249 respectively.

5.3.6 Pooled estimates


If more than one sample is taken from a population then the resulting sample statistics
can be combined to provide pooled estimates for the population mean, variance, and
proportion.
Estimate of the population mean is provided by the pooled sample mean:

n1 X 1 + n 2 X 2
X= (5.24)
n1 + n 2

Estimate of the population variance is provided by the pooled sample variance:

n1s12 + n 2s2 2
σ̂ 2 = (5.25)
n1 + n 2 − 2

Estimate of the population proportion is provided by the pooled sample proportion:

n πˆ + n 2 πˆ 2 n1ρ1 + n 2ρ2
πˆ = 1 1 = (5.26)
n1 + n 2 n1 + n 2

Student exercises
X5.11 A random sample of 5 values was taken from a population: 8.1, 6.5, 4.9, 7.3, and 5.9.
Estimate the population mean and standard deviation, and the standard error of the
estimate for the population mean.
X5.12 The mean of 10 readings of a variable was 8.7 with standard deviation 0.3. A further
5 readings were taken: 8.6, 8.5, 8.8, 8.7, and 8.9. Estimate the mean and standard
deviation of the set of possible readings using all the data available.
X5.13 Two samples are drawn from the same population as follows: sample 1 (0.4, 0.2, 0.2,
0.4, 0.3, and 0.3) and sample 2 (0.2, 0.2, 0.1, 0.4, 0.2, 0.3, and 0.1). Determine the best
unbiased estimates of the population mean and variance.
Sampling distributions and estimating 225

X5.14 A random sample of 100 rods from a population line were measured and found to
have a mean length of 12.132 with standard deviation 0.11. A further sample of 50 is
taken. Find the probability that the mean of this sample will be between 12.12 and
12.14.
X5.15 A random sample of 20 children in a large school were asked a question and 12
answered correctly. Estimate the proportion of children in the school who answered
correctly and the standard error of this estimate.
X5.16 A random sample of 500 fish is taken from a lake and marked. After a suitable interval a
second sample of 500 is taken and 25 of these are found to be marked. By considering
the second sample estimate the number of fish in the lake.

5.4 Population confidence intervals

5.4.1 Introduction
If we take just one sample from a population we can estimate from the sample a popu-
lation parameter. Our knowledge of sampling error would indicate that the standard
error provides an evaluation of the likely error associated with a particular estimate. If
we assume that the sampling distribution of the sample means are normally distributed
then we can provide a measure of this error in terms of a probability value that the value of
the population mean will lie within a specified interval. This interval is called an interval
estimate (or confidence interval), where the interval is centred at the point estimate for
the population mean. Assuming that the sampling distribution of the mean follows a nor-
mal distribution then we can allocate probability values to these interval estimates. From
equation (5.7) we can restructure the equation to give equation (5.27):

σ
µ = X−Z× (5.27)
n

From our knowledge of the normal distribution we know that 95% of the distribution
lies within ± 1.96 standard deviations of the mean. Thus, for the distribution of sample
means, 95% of these sample means will lie in the interval defined by equation (5.27).

µ = X ± 1.96 × σ n

Therefore, this equation tells us that an interval estimate (or confidence inter-
val) is centred at X , with a lower value of µ1 = X − 1.96 × σ n and upper value of
µ 2 = X + 1.96 × σ n , as illustrated in Figure 5.33.
We will now look at how interval estimates and associated levels of confidence can be
calculated.
226 Business statistics using Excel

Normal curve
f(x)

95% Confidence interval

µ1 X µ2 µ
Figure 5.33

Confidence interval estimate of the population


5.4.2
mean, μ (σ known)
If a random sample of size ‘n’ is taken from a normal population N(μ,σ2) then the sam-
pling distribut ion of the sample means will be normal, X ~ N(µ , σ 2 n ), and the confidence
interval of the population mean is given by equation (5.28):

σ σ
X−Z× ≤µ≤X+Z× (5.28)
n n

Example 5.13
Eight samples measuring the length of cloth are sampled from a population where the length
is normally distributed with population standard deviation 0.2. Calculate a 95% confidence
interval for the population mean based on a sample of 8 observations: 4.9, 4.7, 5.1, 5.4, 4.7,
5.2, 4.8, and 5.1.

Excel solution—Example 5.13

Figure 5.34 illustrates the Excel solution.

Figure 5.34
Sampling distributions and estimating 227

➜ Excel solution
X: Cell B6:B13 Values
X2: Cell C6 Formula: =B6∧2
Copy formula down C6:C13
n = Cell C17 Formula: =COUNT(B6:B13)
ΣX = Cell C18 Formula: =SUM(B6:B13)
ΣX2 = Cell C19 Formula: =SUM(C6:C13)
Population standard deviation σ = Cell F4 Value
2 tails, 95% confidence interval = Cell F5 Value
CDF = Cell F6 Formula: =1−F5/2
Zcri = Cell F7 Formula: =NORM.S.INV(F6)
Formula Solution
Sample mean = Cell F9 Formula: =C18/C17
Estimate of population mean = Cell F10 Formula: =F9
Standard error of the mean = Cell F11 Formula: =F4/C17∧0.5
μ1 = Cell F12 Formula: =F9−F7*F11
μ2 = Cell F13 Formula: =F9+F7*F11
Function Solution
Sample mean x = Cell F16 Formula: =AVERAGE(B6:B13)
Estimate of population mean = Cell F17 Formula: =F16
Standard error of the mean = Cell F18 Formula: =F4/C17∧0.5
μ1 = Cell F19 Formula: =F16−CONFIDENCE.
NORM(F5,F4,C17)
μ2 = Cell F20 Formula: =F16+CONFIDENCE.
NORM(F5,F4,C17)

The value of the lower and upper confidence interval is given by equation (5.28). From
Excel: population standard deviation σ = 0.2 (known), sample mean X = 4.9875 , sample
size = 8, and value of Z for 95% confidence = ± 1.96. Substituting the values into equation
(5.28) gives:

σ 0.2
Standard error σ X = = = 0.0707
n 8
σ
µ1 = X − Z × = 4.9875 − 1.96 × 0.0707 = 4.8489
n
σ
µ2 = X + Z × = 4.9875 + 1.96 × 0.0707 = 5.1261
n

Figure 5.35 illustrates the 95% confidence interval for the population mean.
Thus, the 95% confidence interval for μ is = 4.9875 ± 1.96 * 0.0707 = 4.9875 ± 0.1386 =
4.8489 → 5.1261.
228 Business statistics using Excel

Normal curve
95% confidence
interval for µ

µ1 X µ2 µ
4.8489 4.9875 5.1261 Figure 5.35

❉ Interpretation The 95% confidence interval for the population mean is 4.8489 to
5.1261.

The value of the critical z statistic at a particular significance level can be found from the
normal distribution tables provided online. Table 5 illustrates an example of this with the
critical value z identified for a particular z value of the probability P(Z ≥ z) = 2.5% = 0.025
(right-hand tail in Figure 5.35).

Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06


0.0 0.500 0.496 0.492 0.488 0.484 0.480 0.476
0.1 0.460 0.456 0.452 0.448 0.444 0.440 0.436
0.2 0.421 0.417 0.413 0.409 0.405 0.401 0.397
1.8 0.036 0.035 0.034 0.034 0.033 0.032 0.031
1.9 0.029 0.028 0.027 0.027 0.026 0.026 0.025
2.0 0.023 0.022 0.022 0.021 0.021 0.020 0.020
2.1 0.018 0.017 0.017 0.017 0.016 0.016 0.015

Table 5.5 Calculation of z when P(Z ≥ 2) 0.025

From Table 5.5, critical z value = 1.96 when P(Z ≥ z) = 0.025. Given that we have two
tails then the critical z value = ±1.96.

Confidence interval estimate of the population


5.4.3
mean, μ (σ unknown, n < 30)
In the previous example we calculated the point and interval estimates when the popula-
tion was normally distributed but the population standard deviation was known. In most
x
Critical value The critical cases the population standard deviation would be an unknown value and we would have
value(s) for a hypothesis to use the sample value to estimate the population value with associated errors. The popu-
test is a threshold to which
the value of the test statistic lation mean estimate is still given by the value of the sample mean, but what about the
in a sample is compared confidence interval? In the previous example the sample mean and size were used to pro-
to determine whether or
not the null hypothesis is
vide this interval, but, in the new case, we have an extra unknown that has to be estimated
rejected. from the sample data to find this confidence interval.
Sampling distributions and estimating 229

Note This is often the case in many student research projects. They handle small sizes
and the population standard deviation is unknown.

If we have more information about the population then we would expect the probabil-
ity of the population mean lying within 1.96 standard errors of the mean to be smaller
when the population standard deviation is known compared with being unknown.
The question then becomes: Can we measure how much smaller this probability will
be? This question was answered by W. S. Gossett, who determined the distribution of
the mean when divided by an estimate of the standard error. The resultant distribution is
called the Student’s t distribution.
If the random variable X is normally distributed, then the test statistic has a t distribu-
tion with n – 1 degrees of freedom and with the test statistic defined by equation (5.29).

X−µ
t df =
s n (5.29)

Note The t distribution is very similar to the normal distribution when the estimate
of variance is based on many degrees of freedom (df = n – 1), but has relatively more scores
in its tails when there are fewer degrees of freedom. The t distribution is symmetric, like the
normal distribution, but flatter.

Figure 5.36 shows the t distribution with five degrees of freedom and the standard nor-
mal distribution. The t distribution is flatter than the normal distribution (leptokurtic).

Normal versus t distributions

Z
T

Z or t
–6 –4 –2 0 2 4 6 Figure 5.36

As the t distribution is leptokurtic, the percentage of the distribution within 1.96 stand-
ard deviations of the mean is less than the 95% for the normal distribution.
However, if the number of degrees of freedom (df ) is large (df = n – 1 ≥ 30) then there is
very little difference between the two probability distributions. The sampling error for the
t distribution is given by the sample standard deviation (s) and sample size (n), as defined
by equation (5.30).

σˆ s
σX = =
n n (5.30)
230 Business statistics using Excel

The degrees of freedom and confidence interval are given by equations (5.31) and
(5.32).

df = n − 1 (5.31)

s s
X − t df × ≤ µ ≤ X + t df ×
n n (5.32)

Example 5.14
For the following sample of 8 observations from an infinite normal population find the sample
mean and standard deviation, and hence determine the standard error, the population stand-
ard deviation, and a 95% confidence interval for the mean: 10.3, 12.4, 11.6, 11.8, 12.6, 10.9,
11.2, and 10.3.

Excel solution—Example 5.14

Figure 5.37 illustrates the Excel solution.

Figure 5.37

➜ Excel solution
X Cell B6:B13
(X – Xbar)2 Cell C6 Formula: =(B6−$G$9)∧2
Copy formula down C6:C13
n = Cell C18 Formula: =COUNT(B6:B13)
ΣX = Cell C19 Formula: =SUM(B6:B13)
Σ(X – Xbar)2 = Cell C20 Formula: =SUM(C6:C13)
Sampling distributions and estimating 231

Two tails, 95% confidence interval = Cell G4 Value


df = n − 1 = Cell G5 Formula: =C18−1
tcri = Cell G6 Formula: =T.INV.2T(G4,G5)
Formula solution
Sample mean = Cell G9 Formula: =C19/C18
Sample variance Cell G10 Formula: =C20/(C18−1)
Sample standard deviation = Cell G11 Formula: =G10∧0.5
Estimate of population mean = Cell G12 Formula: =G9
Standard error of the mean = Cell G13 Formula: =G11/C18∧0.5
μ1 = Cell G14 Formula: =G9−G6*G13
μ2 = Cell G15 Formula: =G9 + G6*G13
Function solution
Sample mean = Cell G19 Formula: =AVERAGE(B6:B13)
Sample variance Cell G20 Formula: =VAR.S(B6:B13)
Sample standard deviation = Cell G21 Formula: =STDEV.S(B6:B13)
Estimate of population mean = Cell G22 Formula: =G19
Standard error of the mean = Cell G23 Formula: =G21/C18∧0.5
μ1 = Cell G24 Formula: =G19−G6*G23
μ2 = Cell G25 Formula: =G19 + G6*G23
or use
μ1 = Cell G29 Formula: =G13−CONFIDENCE.T(G4,G11,C18)
μ2 = Cell G30 Formula: =G13+CONFIDENCE.T
(G4,G11,C18)

The value of the lower and upper confidence interval is given by equation (5.32). From
Excel: sample mean, X = 11.3875 , sample size = 8, sample variance = 0.7641071, sample
standard deviation = 0.8741322, and the value of t8 for 95% confidence = ±2.3646243.
Substituting values into equation (5.32) gives:

s 0.8741322
Standard error σ X = = = 0.3090524
n 8
s
µ1 = X − t 8 × = 11.3875 − 2.3646243 × 0.3090524 = 10.656707
n
s
µ2 = X − t 8 × = 11.3875 + 2.3646243 × 0.3090524 = 12.118293
n

Figure 5.38 illustrates the 95% confidence interval for the population mean.
Thus, the 95% confidence interval for μ is = 11.3875 ± 2.3646243* 0.3090524 = 10.6567 →
12.1183.

❉ Interpretation We are 95% confident that, on the basis of the sample, the true
population mean is between 10.6567 and 12.1183.
232 Business statistics using Excel

t distribution with 7 df
95% confidence
interval for µ

t7

µ1 = 10.66 X = 11.39 µ2 = 12.12 µ


Figure 5.38

The value of the critical t statistic at a particular significance level and degrees of free-
dom can be found from the Student’s t distribution tables provided online.
Table 5.6 illustrates an example of this with the critical t value identified for a particular
value of the probability P(T ≥ t) = 2.5% = 0.025 (right-hand tail in Figure 5.38) (ALPHA =
2 * 0.025 = 0.5) and degrees of freedom = n – 1 = 7.

ALPHA, df 50% 0.5 20% 0.20 10% 0.1 5% 0.05 2.50% 0.025 1% 0.01
1 1.00 3.08 6.31 12.71 25.45 63.66
2 0.82 1.89 2.92 4.30 6.21 9.92
3 0.76 1.64 2.35 3.18 4.18 5.84
4 0.74 1.53 2.13 2.78 3.50 4.60
5 0.73 1.48 2.02 2.57 3.16 4.03
6 0.72 1.44 1.94 2.45 2.97 3.71
7 0.71 1.41 1.89 2.36 2.84 3.50
8 0.71 1.40 1.86 2.31 2.75 3.36

Table 5.6 Calculation of t for P(T ≥ t) = 0.025 with 7 degrees of freedom (df)

From Table 5.6, the critical t value = 2.36 when P(T ≥ t) = 0.025 and 7 degrees of free-
dom. Given that we have two tails then the critical t value = ±2.36.

Confidence interval estimate of the population


5.4.4
mean, μ (σ unknown, n ≥ 30)
In this section we relax the assumption that the population variance, σ2, is known.
Certainly, it is important that we drop this assumption, as it is rare in practice for it to hold.
For large samples (n ≥ 30), we find that the sampling distribution of the mean is approxi-
mately normal, with the population variance being estimated from the sample variance
(σ2 ≈ s2). Substituting this approximation into equation (5.7) gives equations (5.33) and
(5.34):

X−µ
Z=
s n (5.33)
Sampling distributions and estimating 233

s s
X−Z× ≤µ≤X+Z×
n n (5.34)

Example 5.15
Eight samples measuring the length of cloth are sampled from a population where the length is
normally distributed with population standard deviation unknown. Calculate a 95% confidence
interval for the population mean based on a sample of 8 observations: 4.9, 4.7, 5.1, 5.4, 4.7,
5.2, 4.8, and 5.1.

Note We are using a small sample to illustrate the application of the method. When n
<30 (σ unknown), we would use the Student’s t distribution to fit the confidence interval.

Excel solution—Example 5.15


Figure 5.39 illustrates the Excel solution.

Figure 5.39

➜ Excel solution
X: Cell B6:B13
(X – Xbar)2: Cell C6 Formula: =(B6-$F$9)∧2
Copy Formula from C6:C13
n = Cell C17 Formula: =COUNT(B6:B13)
ΣX = Cell C18 Formula: =SUM(B6:B13)
Σ(X – Xbar)2 = Cell C19 Formula: =SUM(C6:C13)
2 tails, 95% confidence interval = Cell F5 Value
CDF = Cell F6 Formula: =1−F5/2
Zcri = Cell F7 Formula: =NORM.S.INV(F6)
234 Business statistics using Excel

Formula solution
Sample mean = Cell F9 Formula: =C18/C17
Estimate of population mean = Cell F10 Formula: =F9
Sample variance Cell F11 Formula: =C19/(C17−1)
Sample standard deviation = Cell F12 Formula: =F11∧0.5
Standard error of the mean = Cell F13 Formula: =F12/C17∧0.5
μ1 = Cell F14 Formula: =F9−F7*F13
μ2 = Cell F15 Formula: =F9 + F7*F13
Function solution
Sample mean = Cell F18 Formula: =AVERAGE(B6:B13)
Estimate of population mean = Cell F19 Formula: =F18
Sample variance Cell F20 Formula: =VAR.S(B6:B13)
Sample standard deviation = Cell F21 Formula: =STDEV.S(B6:B13)
Standard error of the mean = Cell F22 Formula: =F21/C17∧0.5
μ1 = Cell F23 Formula: =F18−CONFIDENCE.
NORM(F5,F21,C17)
μ2 = Cell F24 Formula: =F18+CONFIDENCE.
NORM(F5,F21,C17)

The value of the lower and upper confidence interval is given by equation (5.34). From
Excel: sample mean, X = 4.9875 , sample size = 8, sample variance = 0.064107143, sample
standard deviation = 0.253193884, and value of Z for 95% confidence = ±1.96.
Substituting values into equation (5.34) gives:

s 0.253193884
Standard error σ X = = = 0.08951755
n 8
σ
µ1 = X − Z × = 4.9875 − 1.96 × 0.08951755 = 4.8120488
n
σ
µ2 = X + Z × = 4.9875 + 1.96 × 0.08951755 = 5.1629512
n

Figure 5.40 illustrates the 95% confidence interval for the population mean.
Thus, the 95% confidence interval for μ is = 4.9875 ± 1.96 * 0.089517516 = 4.81 → 5.16.

Normal curve

µ1 µ2 µ
X
4.81 5.16
4.9875 Figure 5.40
Sampling distributions and estimating 235

❉ Interpretation The 95% confidence interval for the population mean is 4.81–5.16.

Confidence interval estimate of a population


5.4.5
proportion
If the population is normally distributed or the sample size is large (Central Limit
Theorem, n ≥ 30) then the confidence interval for a proportion is given by transforming
equation (5.12) to give equation (5.35), where the population proportion, π, is estimated
from the sample proportion, ρ:

ρ (1 − ρ) ρ (1 − ρ)
ρ−Z× ≤ µ ≤ρ+Z× (5.35)
n n

Example 5.16
In Example 5.9 we stated that 25% of workers in a factory own a personal computer. If this was
not known we could use the idea of a confidence interval to put a level of confidence on the
population proportion based upon the sample data collected. The sample data resulted in a
sample proportion = 0.26 with a sample size = 80.

Excel solution—Example 5.16


Figure 5.41 illustrates the Excel solution.

Figure 5.41

➜ Excel solution
Sample proportion = Cell C5 Value
Sample size n = Cell C6 Value
x
Point estimate of population mean = Cell C8 Formula: =C5 Level of confidence The
Two tails, 95% confidence interval = Cell C10 Value confidence level is
the probability value
Proportion in right and left tails = Cell C11 Formula: =C10/2 (1–α) associated with a
Upper Zcri = Cell C12 Formula: =NORM.S.INV(1−C11) confidence interval.
236 Business statistics using Excel

Estimate of standard error = Cell C14 Formula: =SQRT(C5*(1−C5)/C6)


Lower population proportion estimate = Cell C16 Formula: =C5−C12*C14
Upper population proportion estimate = Cell C17 Formula: =C5+C12*C14

The value of the lower and upper confidence interval is given by equation (5.35). From
Excel: sample proportion, ρ = 0.26, and sample size = 80, and value of Z for 95% confi-
dence = ±1.96. Substituting values into equation (5.35) gives:

ρ (1 − ρ) 0.26 (1 − 0.26 )
Standard error σ ρ = = = 0.0490408
n 80

ρ (1 − ρ)
µ1 = ρ − Z × = 0.26 − 1.96 × 0.0490408 = 0.1638818
n

ρ (1 − ρ)
µ2 = ρ + Z × = 0.26 + 1.96 × 0.0490408 = 0.3561182
n

Figure 5.42 illustrates the 95% confidence interval for the population proportion.
Thus, the 95% confidence interval for ρ is = 0.26 ± 1.96 *0.0490408 = 0.16 → 0.36.

Normal curve

95% confidence interval


for population proportion

–1.96 0 1.96 Z

0.1639 0.26 0.3561 ρ


Figure 5.42

❉ Interpretation The 95% confidence interval for the people who own a personal
computer in the whole population is between 16.3% and 35.6%.

Student exercises
X5.17 The standard deviation for a method of measuring the concentration of nitrate ions
in water is known to be 0.05 ppm. If 100 measurements give a mean of 1.13 ppm,
calculate the 90% confidence limits for the true mean.
Sampling distributions and estimating 237

X5.18 In trying to determine the sphere of influence of a sports centre a random sample
of 100 visitors was taken. This indicated a mean travel distance (d) of 10 miles with
a standard deviation of 3 miles: (a) What are the 90% confidence limits for the
population mean travel distance (D), and (b) What sample size would be required to
ensure that the confidence interval for D was 0.5 miles at the 95% level?
X5.19 The masses, in grams, of 13 ball bearings taken at random from a batch are: 21.4,
23.1, 25.9, 24.7, 23.4, 24.5, 25.0, 22.5, 26.9, 26.4, 25.8, 23.2, and 21.9. Calculate a 95%
confidence interval for the mean mass of the population, supposed normal, from
which these masses were drawn.

5.5 Calculating sample size


We can control the width of the confidence interval by determining the sample size neces-
sary to produce narrow intervals. For example, if we assume that we are sampling a mean
from a population that is normally distributed then we can use equation (5.7) to calculate
an appropriate sample size for a stated interval from equation (5.34).

σ
Interval = 2 × Z × (5.36)
n

Rearranging equation (5.36) will enable the calculation of the size via equation (5.37).

2
⎛ 2 × Z × σ⎞
n=⎜ (5.37)
⎝ Interval ⎟⎠

Where interval = 2 × margin of error (e) = μ1 − d2.

Example 5.17
A researcher determines that a margin of error (or sampling error, e) of no more than ± 0.5 units
is desired, along with a 98% confidence interval. If we assume a normal population standard
deviation of 0.2, calculate the sample size, n.

Excel solution—Example 5.17


Figure 5.43 illustrates the Excel solution.

Figure 5.43
238 Business statistics using Excel

➜ Excel solution
Specified interval = Cell C4 Value
Population standard deviation = Cell C5 Value
Two tails, 98% confidence interval = Cell C7 Value
Proportion in right and left hand tails = Cell C8 Formula: =C7/2
Upper Zcri = Cell C9 Formula: =NORM.S.INV(1−C8)
Sample size n = Cell C11 Formula: =(2*C9*C5/C4)∧2

From Excel: interval = 0.1, population standard deviation = 0.2, Zcri for 98% =
±2.326347874, and the sample size is calculated from equation (5.37).

2 2
⎛ 2 × Z × σ⎞ ⎛ 2 × 2.326347874 × 0.2 ⎞
n=⎜ =⎜ ⎟⎠ = 86.5903109
⎝ Interval ⎟⎠ ⎝ 0.1

Figure 5.44 illustrates the relationship between interval, confidence interval, and size
of sample.

Required sample Normal curve


size n = 87
98% confidence
interval for µ

µ1 X µ2 µ

Interval = 0.1 Figure 5.44

❉ Interpretation Thus, to produce a 98% confidence interval estimate of the mean, we


need a sample of 87.

Note To see what impact the selection of the error of margin and confidence interval
has on the sample size, we’ll run a small simulation. We’ll keep all the data from the previous
example (Table 5.7).

Margin error 10% 10% 10% 10%


Confidence interval 90% 95% 98% 99%
Sample size 43 61 87 106

Table 5.7
Sampling distributions and estimating 239

By keeping the same margin of error, but changing the confidence interval, we can see
how the sample size changes. Effectively, in this example, we need to increase the sample
size almost three times if we wanted our confidence interval to increase from 90% to 99%.
Let’s now keep the confidence interval constant, at 90%, but let’s change the margin of
error (Table 5.8).

Margin error 10% 5% 3% 1%


Confidence interval 90% 90% 90% 90%
Sample size 43 173 481 4329

Table 5.8

As we can see, the margin of error has a tremendous impact on the sample size. This
explains why political polls are often conducted with a 3% error margin. To increase the
accuracy in this case we would have to increase the sample size tenfold, which is clearly
too expensive. It is particularly important to emphasize here that the margin of error
depends very little on the size of the population from which we are sampling, as long as
the sampling fraction is less than 5% of the total population. For very large populations,
the impact is almost negligible.

Student exercise
X5.20 A business analyst has been requested by the managing director of a national
supermarket chain to undertake a business review of the company. One of the key
objectives is to assess the level of spending of shoppers who, historically, have weekly
mean levels of spending of €168.00 with a standard deviation of €15.65. Calculate
the size of a random sample to produce a 98% confidence interval for the population
mean spend, given that the interval is €30? Is the sample size appropriate given the
practical factors?

■ Techniques in practice
TP1 Concerned at the time to react to customer complaints CoCo S.A. has implemented
a new set of procedures for its support centre staff. The customer service director has directed
that a suitable test is applied to a new sample to assess whether the new target mean time for
responding to customer complaints is 28 days (Table 5.9).

20 33 33 29 24 30
40 33 20 39 32 37
32 50 36 31 38 29
15 33 27 29 43 33
31 35 19 39 22 21
28 22 26 42 30 17
32 34 39 39 32 38

Table 5.9
240 Business statistics using Excel

(a) Construct a point estimate for the mean time to respond.


(b) What are the model assumptions for part (a)?
(c) Construct a 95% confidence interval for the mean time.
(d) Is there any evidence to suggest that the mean time to respond to complaints is greater
than 28 days?

TP2 Bakers Ltd is currently undertaking a review of the delivery vans used to deliver prod-
ucts to customers. The company runs two types of delivery van (type A, recently purchased,
and type B, at least 3 years old) which are supposed to be capable of achieving 20 km per litre
of petrol. A new sample has now been collected (Table 5.10).

A B A B
17.68 15.8 26.42 34.8
18.72 36.1 25.22 16.8
26.49 6.3 13.52 15.0
26.64 12.3 14.01 28.9
9.31 15.5 33.9
22.38 40.1 27.1
20.23 20.4 16.8
28.80 3.7 23.6
17.57 13.6 29.7
9.13 35.1 28.2
20.98 33.3

Table 5.10

(a) Construct a point estimate for the mean times.


(b) What are the model assumptions for part (a)?
(c) Construct a 95% confidence interval for the mean times.
(d) Assuming that the population distance travelled varies as a normal distribution, do we
have any evidence to suggest that the two types of delivery vans differ in the mean dis-
tances travelled?
(e) Based upon your analysis, is there any evidence that the new delivery vans meet the
mean average of 20 km per litre?

TP3 Skodel Ltd is developing a low calorie lager for the European market with a mean
designed calorie count of 43 calories per 100 ml. The new product development team are
having problems with the production process and have collected two independent random
samples to assess whether the target calorie count is being met (assume the population vari-
ables are normally distributed) (Table 5.11).

(a) Construct a point estimate for the calorie count.


(b) What are the model assumptions for part (a)?
(c) Construct a 95% confidence interval for the calorie count.
(d) Is it likely that the target average number of calories is being achieved?
Sampling distributions and estimating 241

A B A B
49.7 39.4 45.2 34.5
45.9 46.5 40.5 43.5
37.7 36.2 31.9 37.8
40.6 46.7 41.9 39.7
34.8 36.5 39.8 41.1
51.4 45.4 54.0 33.6
34.3 38.2 47.8 35.8
63.1 44.1 26.3 44.6
41.2 58.7 31.7 38.4
41.4 47.1 45.1 26.1
41.1 59.7 47.9 30.7

Table 5.11

■ Summary
In this chapter we have provided an introduction to the important statistical concept of sam-
pling and have explored methods that can be used to provide point and confidence intervals.
We have shown that the Central Limit Theorem is a very important theorem that allows the
application of a range of statistical tests to be performed.

1. We have shown how the Central Limit Theorem can eliminate the need to construct
a sampling distribution by examining all possible samples that might be drawn from a
population. The Central Limit Theorem allows us to determine the sampling distribution
by using the population mean and variance values or estimates of these obtained from a
sample.
2. Furthermore, an unbiased estimate of the population mean is provided by the sample
mean and the sample variance (or standard deviation) is a biased estimate of the
population variance (or standard deviation).
3. From the Central Limit Theorem we know that the sampling distribution can be
approximated by the normal distribution.

We have shown that as the sample size increases the standard error decreases, but please be
aware that any advantage quickly vanishes as any improvements in standard error tend to be
smaller as the sample size gets larger and larger. The next chapter will now use these results to
introduce the concept of statistical hypothesis testing. In Chapter 6 we shall explore testing a state-
ment about the value of a population parameter given information about one or two samples.

■ Key terms
Central limit theorem Degrees of freedom Point estimate
Confidence interval Estimate Point estimate of the
Critical value Level of confidence population mean
242 Business statistics using Excel

Point estimate of the Sampling distribution Standard error of the mean


population proportion Sampling error Standard error of the
Point estimate of the Sampling frame proportion
population variance Sampling with Unbiased
Random sample replacement

■ Further reading

Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.

Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012).
Introduction to parametric
hypothesis testing 6

» Overview «
Experiments, surveys, and pilot projects are often carried out with the objective of testing a
theory, or hypothesis, about the nature of the process under investigation. Consider a UK
company attempting to enter the German market. They appoint a distributor in Bavaria who
reports that, on average, 2.7 litres of their product are consumed per week per family. Is this
number representative and indicative of the whole country? Should they decide to expand the
network of distributors? What confidence can be assigned to these numbers? Are these figures
from Germany comparable with the UK market? Just how much confidence can be placed
on the inference that there is no difference between the two populations? In order to provide
answers to these questions we set up a statement (a hypothesis) and test its validity by the
application of probability theory.
In this chapter we shall explore a range of hypothesis tests for one and two samples where
the population is normally distributed. The type of test employed, z or t, depends mainly on the
sample size. Even if the population is not normal, the tests will still give an approximate solution
if the sample size is sufficiently large (Central Limit Theorem). Chapter 7 will extend the range
of hypothesis tests to include the so-called non-parametric tests, which may be used in place
of parametric tests when the modelling assumptions are doubtful.

» Learning objectives «
On completing this unit you will be able to:
» understand the concept of the null and alternative hypothesis;
» understand the difference between one and two samples;
» understand the difference between the terms parametric and non-parametric;
» identify appropriate one and two sample tests;
» explain what is meant by a significance level;
» choose an appropriate sampling distribution;
244 Business statistics using Excel

» understand the difference between one and two tail tests;


» distinguish between type I and II errors;
» understand the concept of a p-value and its role in making decisions;
» understand the concept of the critical test statistic and its role in decision making;
» understand the use of the p-value and critical test statistic in making decisions;
» identify and apply a step procedure in solving hypothesis related problems;
» conduct one sample hypothesis test for the sample mean and proportion;
» conduct two sample hypothesis tests for the sample mean and proportion;
» conduct an F test for two population variances;
» solve hypothesis test problems using Microsoft Excel.

6.1 Hypothesis testing rationale


A hypothesis is a statement of the perceived value of a variable or perceived relationship
between two or more variables that can be measured. For example, the ‘average salary of
accountants is €31,000’ can be classed as a hypothesis statement which can be measured
and assessed. It contains one variable that can be classed as salary. In another example,
that Teesside and Leeds University undergraduate business degree students have similar
entry qualifications can be written as a hypothesis statement.
When dealing with a hypothesis test we have to formulate our initial research hypoth-
esis into two statements which can then be evaluated: null hypothesis and alternative
hypothesis.

6.1.1 Hypothesis statements H0 and H1


The null hypothesis (H0) is known as the hypothesis of no difference and is formulated
x
in anticipation of being rejected as false. The alternative hypothesis (H1) is a positive
Hypothesis test
procedure A series of proposition which states that a significant difference exists. In our first example, the aver-
steps to determine whether age salary of accountants is €31,000 can be stated as: H0: μ = €31,000 and the alternative
to accept or reject a null
hypothesis, based on hypothesis H1: μ ≠ €31,000. In our second example, we state that there is no difference
sample data. between the mean Teesside and Leeds entry scores, which can be stated as H0: Teesside
Null hypothesis (H0) The mean = Leeds mean and the alternative hypothesis H1: Teesside mean ≠ Leeds mean.
null hypothesis, H0,
represents a theory that has
been put forward but has
not been proved.
Alternative hypothesis
Note The rejection of the null hypothesis in favour of the alternative hypothesis cannot
(H1) The alternative be taken as conclusive proof that the alternative hypothesis is true, but rather as a piece of
hypothesis, H1, is a evidence that increases one’s belief in the truth of the alternative hypothesis.
statement of what a
statistical hypothesis test is
set up to establish.
Introduction to parametric hypothesis testing 245

Example 6.1
The historical output by employees is a mean rate of 100 units per hour with a standard devia-
tion of 20 units per hour. A new employee is tested on 36 separate random occasions and
is found to have an output of 90 units per hour. Does this indicate that the new employee’s
output is significantly different from the population mean output?

Figure 6.1

Figure 6.1 illustrates the Excel solution to solve the problem outlined in Example 6.1.
As we can see, we have used several built-in Excel functions, which we will explain
shortly, to help us make a decision. What is our decision? In this example we would reject
the null hypothesis H0 in favour of the alternative hypothesis H1 and conclude that there
is a significant difference between the new employee’s output and the firm’s existing
employee output. In fact, this test gives us power to state that we are 95% certain of our
decision. How did we do this? Hypothesis testing requires only a few strict steps and they
are as follows:

1 State hypothesis

2 Select the test

3 Set the level of significance

4 Extract relevant statistic

5 Make a decision
x
As we already introduced how to state the hypothesis, let’s explain the remaining four Level of significance The
level of significance is the
steps. criterion used for rejecting
the null hypothesis.
246 Business statistics using Excel

x 6.1.2 Parametric versus non-parametric tests of difference


Parametric Any statistic
computed by procedures Tests of hypothesis are usually classified into two methods: parametric and non-
that assumes the data were
drawn from a particular
parametric. The major distinction between them lies in the underlying assumptions
distribution. about the data to be analysed. Parametric statistics involve numbers with known, con-
Non-parametric Non- tinuous distributions. When the data are interval- or ratio-scaled and the sample size is
parametric tests are
often used in place
large, parametric statistical procedures are appropriate. Non-parametric statistics are
of their parametric appropriate when the numbers do not conform to a known distribution (aslo called
counterparts when certain
distribution free).
assumptions about the
underlying population are Parametric methods make assumptions about the underlying distribution from which
questionable. sample populations are selected. Non-parametric methods make no assumptions about
Mann–Whitney U
the underlying sample population’s distribution. Parametric statistical tests assume that
test The Mann–Whitney
U test is used to test the your data is approximately normally distributed (follows a classic bell-shaped curve) and
null hypothesis that two that the data are at the interval/ratio level of measurement. This chapter is concerned
populations have identical
distribution functions with the type of hypothesis test where the population is at the interval/ratio level of meas-
against the alternative urement and is either normally distributed or can be considered to be approximately
hypothesis that the two
distribution functions normally distributed. The chapter will look at two types of parametric test: z-test and the
differ only with respect to Student’s t-test for one and two samples. If you have more than two samples then you
location (median), if at all.
can use a parametric technique called Analysis of Variance (ANOVA) to undertake the
One sample test A one
sample test is a hypothesis hypothesis test (see online workbook ‘Factorial experiments’).
test for answering questions Non-parametric methods do not make any assumptions about the sample population
about the mean (or
median) where the data
distribution and are often based upon data that has been ranked (ordinal) or is nominal,
are a random sample of rather than actual measurement data. In many cases it is possible to replace a paramet-
independent observations
ric test with a corresponding non-parametric test. Chapter 7 will explore a range of non-
from an underlying
distribution. parametric tests, such as the Mann–Whitney U test for independent samples, Wilcoxon
Two sample tests A two matched pairs test for dependent samples, chi-square test for association, and the chi-
sample test is a hypothesis
square test for goodness-of-fit. The online workbook ‘Factorial experiments’ will describe
test for answering questions
about the mean where the the Kruskal–Wallis and Friedman’s tests, which are non-parametric tests when dealing
data are collected from with more than two samples.
two random samples of
independent observations,
each from an underlying
distribution. 6.1.3 One and two sample tests
One sample z-test for the
population mean A one- In this chapter we will explore hypothesis tests involving both one and two samples. A one
sample z-test is used to sample test involves testing a sample parameter (e.g. mean value) against a perceived
test whether a population
parameter is significantly population value (e.g. accountant salary €31,000) to ascertain whether or not there is a
different from some significant difference between a sample statistic and a population parameter (e.g. H0:
hypothesized value.
μ = €31,000). For two sample tests we test a sample against another sample to ascertain
Two sample z-test for
the population mean A whether or not there is a significant difference between two samples and, consequently,
two sample z-test for the whether or not the two samples represent different populations. In both cases we shall
population mean is used
to evaluate the difference
use tests that utilize the normal probability distribution and will be testing for differences
between two group means. between the means and proportions.
Two sample z-test for the The hypothesis tests that we will explore include: one sample z-test for the population
population proportion A
two sample z-test for the
mean, one sample t-test for the population mean, two sample z-test for the population
population proportion mean, two sample z-test for the population proportion, two sample t-test for popula-
is used to evaluate the
tion mean (independent samples), two sample t-test for population mean (dependent
difference between two
group proportions. samples), and F test for two population variances (variance ratio test).
Introduction to parametric hypothesis testing 247

6.1.4 Choosing an appropriate statisitcal test


Figure 6.2 provides a diagrammatic representation of the decisions required to decide on
which test to use to undertake the correct hypothesis test.

Number of
samples?

One sample Two samples

Data type? Data type?

Numerical Categorical Categorical Numerical

Independent or
One sample Z test for Z test for two dependent
Z test proportion proportions samples?
(s known)

One sample Independent Dependent


t-test samples samples
(s unknown)

Compare Compare
Paired t-test
means? varience?

Two Two F test


sample sample t-test
t-test (equal (unequal
varience) varience)

Figure 6.2

The key questions are:

1. What are you testing: difference or association? For parametric tests we are
measuring the difference between data values.
x
2. What is the type of data being measured? For parametric tests we are dealing with
Two sample t-test
interval/ratio data. for population mean
3. Can we assume that the population is normally distributed? For parametric tests we (independent samples,
unequal variances A two
expect the variable(s) being measured to be normally distributed or approximtely sample t-test for population
normally distributed. mean (independent
samples, unequal variances)
4. How many samples? We are dealing with one and two sample parametric tests. If we is used when two separate
have more then two samples then we would be dealing with an advanced statistical sets of independent but
differently distributed
hypothesis concept called ANOVA. This topic is described in the online workbook samples are obtained,
‘Factorial experiments’. one from each of the
two populations being
5. From Figure 6.2 we can then choose the appropriate test by answering extra
compared.
questions regarding whether we are dealing with means or proportions, or whether F test for two population
two samples are related (or dependent) or independent of one another. variances (variance
ratio test) F test for two
It is important to note that we have a range of other hypothesis tests to measure asso- population variances
(variance ratio test) is used
ciation (see Chapter 7) and in dealing with distribution free tests (see online workbook to test if the variances of
‘Factorial experiments’). two populations are equal.
248 Business statistics using Excel

6.1.5 Significance level


The level of significance represents the amount of risk that an analyst will accept when
making a decision. Whenever research is undertaken we will always have the possibility
that the data values are subject to chance. The use of the significance level is to seek to
put beyond reasonable doubt the notion that the findings are due to chance. The level of
significance, denoted by Greek letter alpha (α), represents the amount of error associated
with rejecting the null hypothesis when it is true. The value of α is normally 5% (0.05) or 1%
(0.01), but the value of α depends upon how sure you want to be that your decisions are an
accurate reflection of the true population relationship.

❉ Interpretation If an analyst states that the results are significant at the 5% level then
what they are saying is that there is a 5% probability that the sample data values collected have
occurred by chance. An alternative view is to use the concept of a confidence interval. In this
case we can observe that we are 95% confident that the results have not occurred by chance.

Note Most of the examples in this chapter use 0.05 for the level of significance. In practice
you will notice that sometimes certain hypotheses can be accepted at that level of significance,
but would have to be rejected if we used 0.01 as the level of significance. What do we do in such
situations? Read on further and section 6.1.9 on the types of errors might offer some resolution.

6.1.6 Sampling distributions


x In Chapter 5 we explored the concept of sampling data from a normally-distributed popu-
Significance level, α
The significance level of a lation and described the application of the Central Limit Theorem in sampling data from
statistical hypothesis test populations which are not normally distributed.
is a fixed probability of
wrongly rejecting the null 1. If we sampled from a population data set that is normally distributed then the
hypothesis, H0, if it is in
fact true. sampling distribution for the sample mean X will be normally distributed with
Alpha, α Alpha refers to sample mean µ x = population mean μ, and sampling error σ x = σ n . The standard
the probability that the normal distribution has a mean of zero and standard deviation equal to one.
true population parameter
lies outside the confidence 2. For populations that are not normally distributed we can make use of the Central
interval. Not to be confused Limit Theorem. The Theorem states that as the sample size increases the sampling
with the symbol alpha in
a time series context, i.e.
distribution of the mean approximates to a normal distribution. In general, this is the
exponential smoothing, case if the sample size is larger than 30 ( n ≥ 30).
where alpha is the
3. What do we do if the sample size is less than 30 and we do not know the population
smoothing constant.
Central Limit
standard deviation? If the population is normally, or approximtely normally,
Theorem The Central distributed then we can replace the standard normal distribution with the Student’s
Limit Theorem states
t distribution with n – 1 degrees of freedom. The t-test statistic is defined by equation
that whenever a random
sample is taken from any (5.29). The Student’s t distribution has a mean of zero, but the standard deviation
distribution (x–, S2), then varies with the number of degrees of freedom (df = n – 1).
the sample mean will be
approximately normally
distributed with mean μ
Figure 6.3 illustrates a comparison between the normal and t distribution where the num-
and variance S2/n. ber of degrees of freedom increases from 2 to 30. We observe that the error between the
Introduction to parametric hypothesis testing 249
Comparison between normal and t distributions

Normal T2 T4 T10 T30

0.4

0.3

0.2

0.1

0.0
–4.0 –3.0 –2.0 –1.0 0.0 1.0 2.0 3.0 4.0
Z or T value Figure 6.3

x
Critical test statistic The
critical value for a
normal and t distributions decreases as the number of degrees of freedom increases and hypothesis test is a limit
that very little numerical difference exists between the normal and t distributions when at which the value of the
sample test statistic is
we have sample sizes ≥ 30. judged to be such that the
From this concept we can calculate the corresponding test statistic and calculate the null hypothesis may be
rejected.
critical test statistic value given a significance level.
One tail tests A one
tail test is a statistical
hypothesis test in which
6.1.7 One and two tail tests the values for which we can
reject the null hypothesis,
In Section 6.1.1 we stated that the alternative hypotheses can be written as H1: μ ≠ €31,000 H0, are located entirely in
one tail of the probability
or H1: μT ≠ μL. The ≠ sign tells us that we are not sure what the direction of the difference distribution.
will be (< or >) but that a difference exists. In this case we have a two tailed test. It is pos- Region of rejection The
sible that we are assessing that the average accountant’s salary is greater than €31,000 range of values that leads
to rejection of the null
(implying H1: μ > €31,000) or is smaller than €31,000 (implying H1: μ < €31,000). In both hypothesis.
cases the direction is known and these are known as one tail tests. Two tail test A two tail test
The hypothesis test set up (H0 and H1) will tell you automatically whether you have a is a statistical hypothesis
test in which the values for
one or two tailed test. The region of rejection is located in the tail(s) of the distribution. which we can reject the
The exact location is determined by the way H1 is expressed. If H1 simply states that there null hypothesis, H0, are
is a difference, for example H1: μ ≠ 100, then the region of rejection is located in both tails located in both tails of the
probability distribution.
of the sampling distribution with areas equal to α/2. Lower one tail test A
For example, if α is set at 0.05 then the area in both tails will be 0.025 (see Figure 6.4). lower one tail test is a
statistical hypothesis test in
This is known as a two tail test. If H1 states that there is a direction of difference, for exam-
which the values for which
ple μ < 100 or μ > 100, then the region of rejection is located in one tail of the sampling we can reject the null
distribution—the tail being defined by the direction of the difference. hypothesis, H0 are located
entirely in the left tail of the
Hence, for a less than direction (H1: μ < 100) the left-hand tail would be used (see probability distribution.
Figure 6.5). Upper one tail test An
This is known as a lower one tail test. upper one tail test is a
statistical hypothesis test in
Hence, for a greater than direction (H1: μ > 100) the right-hand tail would be used (see which the values for which
Figure 6.6). This is known as an upper one tail test. we can reject the null
hypothesis, H0 are located
The actual location of this critical region will be determined by whether the variable entirely in the right tail of
being measured varies as a normal or Student’s t distribution. the probability distribution.
250 Business statistics using Excel

Normal curve

µ ≠ 100

Reject H0 2.5% Reject H0 2.5%

Accept H0

µ X Figure 6.4

Normal curve

µ < 100

Reject H0 5%
Accept H0

µ X Figure 6.5

Normal curve

µ > 100

Reject H0 5%
Accept H0

µ X Figure 6.6

6.1.8 Check t-test model assumptions


To undertake a Student’s t-test we assume that the population being measured varies as
a normal distribution. To test this assumption we can use the exploratory data analysis
techniques described in sections 2.3 and 4.1.4. What is important is that the mean and
median are approximately equivalent, and that the distribution is approximately sym-
x
Robust test If a test is metrical. The t-test is called a robust test in the sense that the accuracy of the technique is
robust, the validity of not unduly influenced by moving away from symmetry for increasing values of the sample
the test result will not
be affected by poorly size (see Central Limit Theorem).
structured data. In other Care should be taken for small samples if the population distribution is unknown. In
words, it is resistant against
violations of parametric
this case you should consider using non-parametric tests (see Chapter 7 or the online
assumptions. workbook ‘Factorial experiments’) that do not make the assumption that the population
of the variable being measured varies as a normal distribution.
Introduction to parametric hypothesis testing 251

6.1.9 Types of error


When making a decision in hypothesis testing we can distinguish between two types of
possible error: type I and type II. A type I error is committed when we reject a null hypoth-
esis when it is true, while a type II error occurs when we accept a null hypothesis when it
is not true.

Truth
H0 true H1 true
Decision Reject H0 Type I error Correct
Size of test α Power = 1 – β
Do not reject H0 Correct Type II error β
Confidence interval = 1 – α

Table 6.1 Types of error

Type I error
From Table 6.1 we observe that it is the rejection of a true null hypothesis that is a type I
error. This probability is represented by the level of significance (Greek letter Alpha, α) and
the significance value chosen represents the maximum probability of making a type I error.

Type II error
A type II error (denoted by Greek letter Beta, β) is only an error in the sense that an oppor-
tunity to reject the null hypothesis correctly was lost. It is not an error in the sense that an
incorrect conclusion was drawn, as no conclusion is drawn when the null hypothesis is
not rejected. Which of the errors is more serious? The answer to this question depends on x
the damage that is related to it. Type I and type II errors are related to each other; increas- Type I error, α A type I
ing the type I error will decrease the type II error and vice versa. error occurs when the null
hypothesis is rejected when
it is in fact true.
Statistical power Type II error, β A type II
The statistical power of the test is the probability of accepting the true alternative hypoth- error occurs when the
null hypothesis, H0, is not
esis or the probability of rejecting a false null hypothesis. The relationship between statis- rejected when it is in fact
tical power and the type II error is given by the equation power = 1 – β. false.

A simple example will be employed in Section 6.10 to illustrate the calculation of the Beta, β Beta refers to the
probability that a false
type II error (β) and the statistical power for a one sample t-test. population parameter
lies inside the confidence
interval.
6.1.10 P-values Statistical power The
power of a statistical test
Unlike the classical approach using the critical test statistic we can use the p-value to is the probability that it
will correctly lead to the
decide on accepting or rejecting H0. The p-value represents the probability of the calcu- rejection of a false null
lated random sample test statistic being this extreme if the null hypothesis is true. This hypothesis.

p-value can then be compared to the chosen significance level (α) to make a decision P-value The p-value is
the probability of getting
between accepting or rejecting the null hypothesis H0. a value of the test statistic
as extreme as or more
extreme than that observed
❉ Interpretation If p < α, then we would reject the null hypothesis H0 and accept the by chance alone, if the null
hypothesis is true.
alternative hypothesis H1.
252 Business statistics using Excel

Note The Excel screenshots will identify each of these stages in the solution process.

Microsoft Excel can be used to calculate a p-value depending upon whether the vari-
able being measured varies as a normal or Student’s t distribution.
The p-value will be generated automatically by Excel when using the Analysis ToolPak
solution method (Select Data > Data Analysis).

6.1.11 Critical test statistic


A different approach to using the p-value is to calculate the test statistic and compare
the value with a critical test statistic estimate from an appropriate table or via Excel.
The value of the critical test statistic will depend upon the following factors: (i) signifi-
cance level for z-test problems, and (ii) the significance level and number of degrees
of freedom for t-test problems. This critical test statistic can then be compared with
the calculated test statistic to make a decision between accepting or rejecting the null
hypothesis H0.

❉ Interpretation If test statistic > critical test statistic then we would reject the null
hypothesis H0 and accept the alternative hypothesis H1.

Microsoft Excel can be used to calculate the critical test statistic values depending
upon whether the variable being measured varies as a normal or Student’s t distribution.

Note The Excel screenshots will identify each of these stages in the solution process.

These values will be generated automatically by Excel when using the Data Analysis
solution method (Select Data > Data Analysis).

Student Exercises
X6.1 A supermarket is supplied by a consortium of milk producers. Recently, a quality
assurance check suggests that the amount of milk supplied is significantly different
from the quantity stated within the contract: (i) define what we mean by significantly
different; (ii) state the null and alternative hypothesis statements; and (iii) for the
alternative hypothesis do we have a two tail, lower one tail, or upper one tail test?
X6.2 A business analyst is attempting to understand visually the meaning of the critical test
statistic and the p-value. For a z value of 2.5 and significance level of 5% provide a
sketch of the normal probabilty distribution and use the sketch to illustrate the location
of the following statistics: test statistic, critical test statistic, significance value, and
p-value (you do not need to calculate the values of zcri or the p-value).
Introduction to parametric hypothesis testing 253

6.2 One sample z-test for the population mean


The first test we will explore is the one sample z-test for the population mean that assumes
that the sample data is collected randomly from a population that is normally distributed.
In this particular case we know the value of the population standard deviation.

Example 6.2
Employees of a firm produce units at a rate of 100 per hour with a standard deviation of
20 units per hour. A new employee is tested on 36 separate random occasions and is found
to have an output of 90 units per hour. Does this indicate that the new employee’s output is
significantly different from the average output?

Figure 6.7 illustrates the Excel solution.

Figure 6.7

➜ Excel Solution
Significance level Cell E10 Value = 0.05
Population mean Cell E13 Value = 100
Population standard deviation Cell E14 Value = 20
Sample size n Cell E16 Value = 36
Sample mean Xavg Cell E17 Value = 90
Sample standard error Cell E18 Formula = E14/E16^0.5
Zcal Cell E19 Formula = STANDARDIZE(E17,E13,E18)
Two tail p-value Cell E22 Formula = 2*(1−NORM.S.DIST(ABS(E19),TRUE))
Lower Zcri = Cell E23 Formula = NORM.S.INV(E10/2)
Upper Zcri = Cell E24 Formula = NORM.S.INV(1-E10/2)
254 Business statistics using Excel

Excel solution using the p-value for a one sample z-test


To use the p-value statistic method to make a decision:

1 State hypothesis
Null hypothesis H0: μ = 100 (population mean is equal to 100 units per hour).
Alternative hypothesis H1: μ ≠ 100 (population mean is not 100 units per hour).
The ≠ sign implies a two tail test.

2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—one sample;
• the statistic we are testing—testing for a difference between a sample mean ( x = 90)
and population mean (μ = 100). Population standard deviation is known (σ = 20);
• size of the sample—large (n = 36);
• nature of population from which sample drawn—population distribution is not
known, but sample size is large. For large n, the Central Limit Theorem states that
the sample mean is distributed approximately as a normal distribution.

One sample z-test of the mean is therefore selected.

3 Set the level of significance (α) = 0.05 (see Cell E10)

4 Extract relevant statistic


When dealing with a normal sampling distribution we calculate the Z statistic using
equation (6.1):

( x − µ)
Zcal =
σ n (6.1)

From Excel, population mean = 100 (see Cell E13), population standard deviation = 20
(see Cell E14), sample size n = 36 (see Cell E16), sample mean x = 90 (see Cell E17),
and standard error of the mean σ x = 3.33333’ (see Cell E18):

X − µ 90 − 100
Zcal = = = −3 (see Cell E19)
σ n 20 36

In order to identify region of rejection in this case, we need to find the p-value.
The p-value can be found from Excel by using the NORM.S.DIST() function. In the
example H1: μ ≠ 100 units/hour. From Excel, the two tail p-value = 0.0026998 (see
Cell E22).

Note For two tail tests the p-value would be given by the Excel formula:
=2*(1 − NORM.S.DIST(ABS(z value or cell reference), true)).
For one tail tests the p-value would be given by the Excel formula:
=NORM.S.DIST(abs(z value) for lower tail p-value, where Z is negative value
=1 − NORM.S.DIST(z value) for upper tail p-value, where Z is positive value.
Introduction to parametric hypothesis testing 255
5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated two-tail p-value of 0.0026998.
We can observe that the p-value < α and we conclude that given the two tail p-value
(0.0026998) < α (0.05) we reject H0 and accept H1.

❉ Interpretation It can be concluded that there is a significant difference, at the 0.05


level, between the new employee’s output and the firm’s existing employee output. In other
words, the sample mean value (90 units per hour) is not close enough to the population mean
value (100 units per hour) to allow us to assume that the sample comes from that population.

Excel solution using the critical z-test statistic for a one sample z-test
The solution procedure is exactly the same as for the p-value except that we use the critical
test statistic value to make a decision.

1 State hypothesis

2 Select test

3 Set the level of significance (α = 0.05)

4 Extract relevant statistic


The calculated test statistic Zcal = −3.0 (see Cell E19). We need to compare it with the
critical test statistic, Zcri. In the example H1: μ ≠ 100 units/hour. The critical Z values
can be found from Excel by using the NORM.S.INV() function, two tail Zcri = ±1.96
(see Cells E23 and E24).

Note We can calculate the critical two tail value of z as follows:


=NORM.S.INV(significance level/2) for lower critical z value
=NORM.S.INV(1 − significance level/2) for upper critical z value.
The corresponding one tail critical z values are given as follows:
=NORM.S.INV(significance level) for lower tail
=NORM.S.INV(1 − significance level) for upper tail.

5 Make decision
Does the test statistic lie within the region of rejection? Compare the calculated and
critical z values to determine which hypothesis statement (H0 or H1) to accept. In
Figure 6.8 we observe that zcal lies in the lower rejection zone (−3 < −1.96). Given zcal
(−3) < lower two tail zcri (−1.96), we will reject H0 and accept H1.

❉ Interpretation It can be concluded that there is a significant difference, at the 0.05


level, between the new employee’s output and the firm’s existing employee output. In other
words, the sample mean value (90 units per hour) is not close enough to the population mean
value (100 units per hour) to allow us to assume that the sample comes from that population.
256 Business statistics using Excel

Normal curve

Two tail p = 2*(1-


NORM.S.DIST(ABS(−3),
true)) = 0.0026998 < 0.05

Reject H0 2.5% Reject H0 2.5%

0.5*p-value

−3 −1.96 0 +1.96 Z
Accept H0
Figure 6.8

Figure 6.8 illustrates the relationship between the p-value, test statistic, and critical test
statistic.

Note As emphasized, the use of the p-value or comparison of z-calculated versus


z-critical value is a matter of preference. Both methods yield identical results.

Student Exercises
X6.3 What are the critical z values for a significance level of 2%: (i) two tail, (ii) lower one tail,
and (iii) upper one tail?
X6.4 A marketing manager has undertaken a hypothesis test to test for the difference
between accessories purchased for two different products. The initial analysis has been
performed and an upper one tail z-test has been chosen. Given that the z value was
calculated to be 3.45 find the corresponding p-value. From this result what would you
conclude?
X6.5 A mobile phone company is concerned at the lifetime of phone batteries supplied
by a new supplier. Based upon historical data this type of battery should last for 900
days with a standard deviation of 150 days. A recent, randomly selected sample of
40 batteries was selected and the sample battery life was found to be 942 days. Is the
sample battery life significantly different from 900 days (significance level 5%)?
X6.6 A local Indian restaurant advertises home delivery times of 30 minutes. To monitor
the effectiveness of this promise the restaurant manager monitors the time that the
order was received and the time of delivery. Based upon historical data the average
time for delivery is 30 minutes with a standard deviation of 5 minutes. After a series of
complaints from customers regarding this promise the manager decided to analyse the
data of the last 50 orders which resulted in an average time of 32 minutes. Conduct an
appropriate test at a significance level of 5%. Should the manager be concerned?
Introduction to parametric hypothesis testing 257

6.3 One sample t-test for the population mean


If the population standard deviation is not known then we may use a one sample t-test if
the population distribution is normal. The t-test uses the sample standard deviation, s, as
an estimate of the population standard deviation, σ. In this section we will describe the
one sample t-test for the population mean.

Example 6.3
A local car dealer wants to know if the purchasing habits of a buyer buying extras have changed.
He is particularly interested in male buyers. Based upon collected data he has estimated that
the distribution of extras purchased is approximately normally distributed with an average of
£2000 per customer. To test this hypothesis he has collected the extras purchased by the last
seven male customers (£): 2300, 2386, 1920, 1578, 3065, 2312, and 1790. Test whether the
extras purchased on average has changed.

Figure 6.9 illustrates the Excel solution.

5
Figure 6.9
x
One sample t-test for
the population mean A
➜ Excel solution one sample t-test is
a hypothesis test for
Significance level = Cell E13 Value = 0.05 answering questions about
Population mean = Cell E16 Value = 2000 the mean where the data
are a random sample of
Sample data: Cells E18:E24 Values independent observations
Sample size = Cell E26 Formula = COUNT(E18:E24) from an underlying
normal distribution where
Sample mean Xavg = Cell E27 Formula = AVERAGE(E18:E24) population variance is
Sample standard deviation s = Cell E28 Formula = STDEV.S(E18:E24) unknown.
258 Business statistics using Excel

Standard error = Cell E29 Formula = E28/E26^0.5


t = Cell E30 Formula = (E27−E16)/E29
No of degrees of freedom = Cell E33 Formula = E26 − 1
Two tail p-value = Cell E34 Formula = T.DIST.2T(ABS(E30),E33)
Upper tcri = Cell E35 Formula = T.INV.2T(E13,E33)
Lower tcri = Cell E36 Formula = −T.INV.2T(E13,E33)

Excel solution using the p-value for a one sample t-test


To use the p-value statistic method to make a decision:

1 State hypothesis
Null hypothesis H0: μ = 2000 (population mean spend on extras is equal to £2000).
Alternative hypothesis H1: μ ≠ 2000 (population mean is not equal to £2000).
The ≠ sign implies a two tail test.

2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—one sample;
• the statistic we are testing—testing for a difference between a sample mean and
population mean (μ = 2000). Two tail test. Population standard deviation is not
known;
• size of the sample—small (n = 7);
• nature of population from which sample drawn—population distribution is
normal, sample size is small, and population standard deviation is unknown. The
sample standard deviation will be used as an estimate of the population standard
deviation and the sampling distribution of the mean is a t distribution with n – 1
degrees of freedom.

We conclude that a one sample t-test of the mean is appropriate.

3 Set the level of significance (α) = 0.05 (see Cell E13)

4 Extract relevant statistic


The required distribution is a t distribution given by equation (6.2):

( x − µ)
t cal =
s n (6.2)

From Excel, population mean = 2000 (see Cell E16), sample size n = 7 (see Cell E26),
sample mean x = 2193 (see Cell E27), sample standard deviation s = 489.62673 (see
Cell E28), and standard error of the mean σ x = 185.0615084 (see Cell E29):

( x − µ) 2193 − 2000
t cal = = = 1.0429 (see Cell E30 )
s n 489.62673 7
Introduction to parametric hypothesis testing 259
Identify the region of rejection using the p-value method—the p-value can be found
from Excel by using the T.DIST.2T() function. In the example H1: μ ≠ £2000. From
Excel, the two tail p-value = 0.3371825 (see Cell E34).

Note We can calculate the two tail p-value using the Excel function T.DIST.2T:
=T.DIST.2T(ABS(t value), degrees of freedom).
We can calculate the one tail p-value using the Excel function T.DIST:
=T.DIST(t value, degrees of freedom, true) for 1 tail lower.
We can calculate the one tail p-value using the Excel function T.DIST.RT:
=T.DIST.RT(t value, degrees of freedom) for one tail upper.

5 Make a decision
Does the test statistic lie in the region of rejection? Compare the chosen significance
level (α) of 5% (or 0.05) with the calculated two tail p-value of 0.3371825. We can
observe that the p-value > α and we decided that we accept H0. Given two tail p-value
(0.3371825) > α (0.05), we will accept H0 and reject H1.

❉ Interpretation It can be concluded that there is no significant difference, at the 0.05


level, between the extras purchased by the sample and the historical extras of £2000 purchased.

Excel solution using the critical t-test statistic for a one sample t-test
The solution procedure is exactly the same as for the p-value except that we use the critical
test statistic value to make a decision.

1. State hypothesis.
2. Select test.
3. Set level of significance (α = 0.05).
4. Extract relevant statistic.
The calculated test statistic tcal = 1.0429 (see Cell E30). Calculate the critical test statis-
tic, tcri. In the example H1: μ ≠ £2000. The critical t values can be found from Excel by
using the T.INV.2T() function, two tail tcri = ± 2.2447 (see Cells E35–E36).

Note We can calculate the critical two tail value of t as follows:


=T.INV.2T(significance level, degrees of freedom) for upper critical t value
=−T.INV.2T(significance level, degrees of freedom) for lower critical t value.
The corresponding one tail critical value t value are given as follows:
=T.INV(significance level, degrees of freedom) for upper tail
=−T.INV(significance level, degrees of freedom) for lower tail.

5. Make decision
Does the test statistic lie within the region of rejection? Compare the calculated and
critical t values to determine which hypothesis statement (H0 or H1) to accept. As
260 Business statistics using Excel

tcal (0.97) lies between the two critical values (−2.447 ± 2.447) we would accept H0.
Given tcal (0.9655) lies between the lower and upper critical t values (−2.447 ± 2.447),
we will accept H0 and reject H1.

❉ Interpretation It can be concluded that there is no significant difference, at the 0.05


level, between the extras purchased by the sample and the historical extras of £2000 purchased.

Figure 6.10 illustrates the relationship between the p-value, test statistic, and critical
test statistic.

Two tail p =
T.DIST(ABS(1.04), 6) =
0.3371825 > 0.05
0.5*p-value

Reject H0 2.5%
Reject H0 2.5%

–2.447 0 1.04 +2.447 t Figure 6.10

Student exercises
X6.7 Calculate the critical t values for a significance level of 1% and 12 degrees of freedom:
(1) two tail, (ii) lower one tail, and (iii) upper one tail.
X6.8 After further data collection the marketing manager (Exercise X6.4) decides to
revisit the data analysis and changes the type of test to a t-test. (i) Explain under
what conditions a t-test could be used rather then the z-test, and (ii) calculate the
corresponding p-value if the sample size was 13 and the test statistic equal to 2.03.
From this result what would you conclude?
X6.9 A tyre manufacturer conducts quality assurance checks on the tyres that it
manufactures. One of the tests consists of undertaking a test on their medium-quality
tyres with an independent random sample of 12 tyres providing a sample mean and
standard deviation of 14,500 km and 800 km respectively. Given that the historical
average is 15,000 km and that the population is normally distributed, test whether the
sample would raise a cause for concern.
X6.10 A new low-fat fudge bar is advertised as having 120 calories. The manufacturing
company conducts regular checks by selecting independent random samples and
testing the sample average against the advertised average. Historically, the population
varies as a normal distribution and the most recent sample consists of the numbers:
99, 132, 125, 92, 108, 127, 105, 112, 102, 112, 129, 112, 111, 102, and 122. Is the
population value significantly different from 120 calories (significance level 5%)?
Introduction to parametric hypothesis testing 261

6.4 Two sample z-test for the population mean

Example 6.4
A large organization produces electric light bulbs in each of its two factories (A and B). It is
suspected that the quality of production from factory A is better than from factory B. To test this
assertion the organization collects samples from factory A and B, and measures how long each
light bulb works for (in hours) before it fails. Both population standard deviations are known
(σA2 = 52783 and σB2 = 61560). Conduct a two sample z-test for the population mean to test
this hypothesis.

Figure 6.11 illustrates the Excel solution.

Figure 6.11

➜ Excel solution
A: Cell B4:B33 Values
B: Cell C4:C35 Values
Significance level = Cell G13 Value = 0.05
nA = Cell G17 Formula: = COUNT (B4:B33)
Sample average = Cell G18 Formula: = AVERAGE(B4:B33)
Population variance known σ2A = Cell G19 Value
nB = Cell G22 Formula = COUNT (C4:C35)
262 Business statistics using Excel

Sample average = Cell G23 Formula = AVERAGE(C4:C35)


Population variance known σ2B = Cell G24 Value
Zcal = Cell G26 Formula = (G18−G23)/SQRT((G19/G17)
+(G24/G22))
One tail upper p-value = Cell G30 Formula = 1−NORM.S.DIST(G26,TRUE)
Upper Zcri = Cell G31 Formula = NORM.S.INV(1−G13)

Excel solution using the p-value for a two sample z-test


To use the p-value statistic method to make a decision:

1 State hypothesis
Null hypothesis H0: μA ≤ μB.
Alternative hypothesis H1: μA > μB.
The ‘ > ‘ sign implies an upper one tail test.

2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples.
• the statistic we are testing—testing that the lifetime of light bulbs from factory A
last longer than for factory B. Both population variances are known (σ2A = 52783
and σ2B = 61560);
• size of both samples—large (nA = 30 and nb = 32);
• nature of population from which sample drawn—population distribution is not
known, but sample size large. For large n, the Central Limit Theorem states that
the sample means are approximately normally distributed (nA and nB ≥ 30).

Two sample Z-test of the population mean.

3 Specify a significance level (α) = 0.05 (see Cell G13)

4 Extract relevant statistic


When dealing with a normal sampling distribution we calculate the z-test statistic
using equation (6.3):

(X A − X B ) − (µ A − µ B )
z cal =
⎡ σ A 2 σ B2 ⎤
⎢ + ⎥
⎣ nA nB ⎦ (6.3)

From Excel: nA = 30 (see Cell G17), X A = 1135.33’ (see Cell G18), σ2A = 52783 (see Cell
G19), nb = 32 (see Cell G22), X B = 894.21575 (see Cell G23), and σ2B = 61560 (see Cell
G24). If H0 is true (μA − μB = 0) then equation (6.3) simplifies to:

X A − XB
z cal = = 3.9729 (see Cell G26)
⎡ σ A 2 σ B2 ⎤
⎢ + ⎥
⎣ nA nB ⎦
Introduction to parametric hypothesis testing 263
Identify the region of rejection using the p-value method. The p-value can be found
from Excel using the NORM.S.DIST(z-value, true) function. In the example H1: μA > μB.
From Excel, the upper one tail p-value = 0.0000354957 (see Cell G30).

5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated upper one tail p-value of
0.0000354957. We can observe that the p-value < α, and we conclude that we reject
H0 and accept H1.

❉ Interpretation It can be concluded that, at the 5% level of significance, the light


bulbs from factory A have significantly longer lifetimes than the light bulbs from factory B.

Excel solution using the critical test statistic, zcri


The solution procedure is exactly the same as for the p-value except that we use the criti-
cal test statistic value to make a decision (see sections 6.1.10 and 6.1.11 for details). For
Example 6.4 we find z = 3.9729 and zcri = + 1.64 (see Cell G31). Does the test statistic lie
within the region of rejection? The calculation of z yields a value of 3.9729 and therefore
lies in the region of rejection for H0. Given zcal (3.9729) > upper zcri (1.64) we will reject H0
and accept H1.

❉ Interpretation It can be concluded that, at the 0.05 level, light bulbs from factory A
have significantly longer lifetimes than the light bulbs from factory B.

Figure 6.12 illustrates the relationship between the p-value and test statistic.

Upper one tail


p = 1-NORM.S.DIST
(3.9729, true) =
0.0000354957 < 0.05

0 3.9729 Z Figure 6.12

Excel Data Analysis solution for two sample z-test


As an alternative to either of the two previous methods, we can use a method embedded
in Excel called Data Analysis.
264 Business statistics using Excel

Example 6.5
Reconsider Example 6.4, but use the Data Analysis tool to undertake the analysis.
Figure 6.13 illustrates the application of the Data Analysis: z-test: Two Sample for Means.

A: Cells B4:B33.
B: Cells C4:C35.
Hypothesized mean difference = 0.
Variable 1 variance = 52783.
Variable 2 variance = 61560.
Alpha = 0.05.
Output Range: Cell E4.

Figure 6.13

We observe from Figure 6.14 that the relevant results agree with the previous results.

Figure 6.14
Introduction to parametric hypothesis testing 265

❉ Interpretation It can be concluded that, at the 0.05 level, light bulbs from factory A
have significantly longer lifetimes than the light bulbs from factory B.

Student exercises
X6.11 A battery manufacturer supplies a range of car batteries to car manufacturers. The
40 Amp-hour battery is manufactured at two manufacturing plants with a stated mean
time between charges of 8.3 days and a variance of 1.25 days. The company regularly
selects an independent random sample from the two plants with results as shown in
Table 6.2.

Plant A Plant B
6.72 10.13 9.31 7.83 9.93 8.10 6.27 8.54
9.83 7.38 9.36 9.23 10.36 7.81 9.69 8.51
7.15 6.93 7.23 8.70 9.06 7.58 8.01 9.54
7.72 9.32 8.32 10.65 8.08 8.35 7.78 9.08
9.20 8.70 9.32 8.09 9.82 6.51 8.33 7.01
11.36 8.50 8.86 10.06 9.56 7.98 8.94 7.06
6.38 7.99 9.34 6.62 7.81 6.62 9.82 9.26
9.57 7.23 8.91 10.74 7.27 8.14 9.45 10.26

Table 6.2

(a) For the given samples conduct an appropriate hypothesis test to test that the
sample mean values are not different at the 5% level of significance.
(b) If the sample means are not significantly different test whether the population
mean is 8.3 days (choose sample A to undertake the test).
X6.12 The Indian restaurant manager has employed two new delivery drivers and wishes to
assess their performance. The data in Table 6.3 represent the delivery times for person
A and B undertaken on the same day.

Person A Person B
32.9 25.6 36.2 34.6 30.3 31.6 25.5 36.5 36.0 36.3
29.4 33.5 32.5 40.7 32.7 25.5 28.1 38.8 32.4 32.8
41.2 35.6 40.8 32.4 35.3 34.2 37.5 33.3 25.9 37.7
40.3 34.6 30.2 37.1 31.0 33.4 32.3 33.2
39.3 36.5 35.0 32.7 35.5 32.6 31.9 36.8
30.3 35.7 40.2 34.2 36.5 34.0 35.9 25.1
37.5 38.0 33.4 33.2 36.1 41.4 29.0 37.6
45.0 30.7 37.8 37.7 28.9 29.8 34.3 34.4

Table 6.3

Based upon your analysis of the two samples is there any evidence that the delivery
times are different (test at 5%).
266 Business statistics using Excel

6.5Two sample z-test for the population


proportion

Example 6.6
Concerned by the number of passengers not wearing rear seat belts in cars, a local police
authority decided to undertake a series of surveys based upon two large cities. The sur-
vey consisted of two independent random samples collected from city A and B. The police
authority would like to know if the proportions of passengers wearing seat belts between city
A and B are different. Conduct a two sample z-test for the population proportion to test this
hypothesis.

Figure 6.15 illustrates the Excel solution.

5
Figure 6.15

➜ Excel solution
NA = Cell C4 Value
NB = Cell D4 Value
nA = Cell C5 Value
nB = Cell D5 Value
Introduction to parametric hypothesis testing 267

Significance level = Cell D15 Value = 0.05


NA = Cell D17 Formula: = C4
NB = Cell D18 Formula: = D4
nA = Cell D19 Formula: = C5
nB = Cell D20 Formula: = D5
ρA = Cell D21 Formula: = D19/D17
ρB = Cell D22 Formula: = D20/D18
Z = Cell D25 Formula: = (D21−D22)/SQRT((D21*(1−D21)/
D17+D22*(1−D22)/D18))
Two tail p-value = Cell D29 Formula: = 2*(1−NORM.S.DIST(ABS(D25),TRUE))
Lower Zcri = Cell D30 Formula: = NORM.S.INV(D15/2)
Upper Zcri = Cell D31 Formula: = NORM.S.INV(1−D15/2)

Excel solution using the p-value


To use the p-value statistic method to make a decision:

1 State hypothesis
Null hypothesis H0: πA = πB.
Alternative hypothesis H1: πA ≠ πB.
The ≠ sign implies a two tail test.

2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples;
• the statistic we are testing—testing that the proportion wearing seatbelts is different
between the two cities. Both population standard deviations are unknown;
• size of both samples—large (nA = 250 and nB = 190);
• nature of population from which sample drawn—population distribution is not
known, but sample size large. For large n, the Central Limit Theorem states that
the sample proportions are approximately normal distributed.

From this information we will undertake a two sample z-test for proportions.

3 Set the level of significance (α) = 0.05 (see Cell D15)

4 Extract the relevant statistic


When dealing with a normal sampling distribution we calculate the Z statistic using
equation (6.4).

ρA − ρB − (π A − π B )
z cal =
π A (1 − π A ) π B (1 − π B )
+
NA NB (6.4)

Where, ρA and ρB are proportions for sample A and B and and πA and πB are the
population proportions (πA ~ ρA, πB ~ ρB). From Excel: NA = 250 (see Cell D17), NB = 190
268 Business statistics using Excel

(see Cell D18), nA = 135 (see Cell D19), nB = 80 (see Cell D20), ρA = 0.54 (see Cell D21),
and ρB = 0.42 (see Cell D22). If H0 is true (πA – πB = 0) then equation (6.4) simplifies to:

ρA − ρB
z cal = = 2.49 (see CellD25)
ρA (1 − ρA ) ρB (1 − ρB )
+
NA NB

Identify the region of rejection using the p-value method. The p-value can be found
from Excel using the NORM.S.DIST(z value, true) function. In the example H1: πA ≠ πB.
From Excel, the two tail p-value = 0.013 (see Cell D29).

5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated two tail p-value of 0.013. We
can observe that the p-value < α, and we conclude that we reject H0 and accept H1.

❉ Interpretation It can be concluded that a significant difference exists between the


proportions of rear passengers wearing seatbelts between city A and B. Furthermore, the
evidence suggests that the proportion wearing seatbelts is higher for city A. To test this we
could undertake an upper one tail test to test whether the proportion for city A is significantly
larger than for city B. It should be noted that the decision will change if you choose a 1% level
of significance.

Excel solution using the critical test statistic, zcri


The solution procedure is exactly the same as for the p-value except that we use the criti-
cal test statistic value to make a decision (see sections 6.1.10 and 6.1.11 for details). For
Example 6.6 we find z = 2.49 and zcri = ± 1.96 (see Cells D30 and D31). Does the test statis-
tic lie within the region of rejection? The calculation of z yields a value of 2.49 and there-
fore lies in the region of rejection for H0. Given zcal (2.49) > upper two tail zcri (+ 1.96) we
will reject H0 and accept H1.

❉ Interpretation It can be concluded that a significant difference exists between the


proportions of rear passengers wearing seatbelts between cities A and B.

Figure 6.16 illustrates the relationship between the p-value and test statistic.

Two tail p = 2*(1-


NORM.S.DIST(ABS(2.49),
TRUE)) = 0.013 < 0.05

0 2.49 Z Figure 6.16


Introduction to parametric hypothesis testing 269

Student exercises
X6.13 During a national election a national newspaper wanted to assess whether there was
a similar voting pattern for a particular party between two towns in the north-east of
England. The sample results are illustrated in Table 6.4.

Town A Town B
Number interviewed, N 456 345
Intention to vote for party, n 243 212

Table 6.4

Assess whether there is a significant difference in voting intentions between town A


and town B (test at 5%).
X6.14 A national airline keeps a record of luggage misplaced at two European airports during
one week in the summer of 2006. The sample results are illustrated in Table 6.5.

Airport A Airport B
Total number of items processed, N 15596 25789
Number of items of luggage misplaced, n 123 167

Table 6.5

Assess whether there is a significant difference in misplaced luggage between the two
airports (test at 5%).

6.6Two sample t-test for population mean


(independent samples, equal variances)

Example 6.7
A certain product of organic beans are packed in tins and sold by two local shops. The local
authority have received complaints from customers that the amount of beans within the tins x
sold by the shop are different. To test this statistically two small random samples were collected Two sample t-test for
the population mean
from both shops. Conduct a two sample t-test for the population mean (independent sam- (independent samples,
ples, equal variance) to test this hypothesis. equal variance) A
two sample t-test for
the population mean
(independent samples,
Figure 6.17 illustrates the Excel solution. equal variance) is used
when two separate sets
of independent and
➜ Excel solution identically distributed
samples are obtained,
A: Cells B4:B21 Values one from each of the
two populations being
B: Cells C4:C28 Values compared.
270 Business statistics using Excel

Significance level = Cell G14 Value = 0.05


nA = Cell G16 Formula: = COUNT(B4:B21)
averageA = Cell G17 Formula: = AVERAGE(B4:B21)
sA = Cell G18 Formula: = STDEV.S(B4:B21)
nB = Cell G20 Formula: = COUNT(C4:C28)
averageB = Cell G21 Formula: = AVERAGE(C4:C28)
sB = Cell G22 Formula: = STDEV.S(C4:C28)
Pooled variance = Cell G24 Formula: = ((G16−1)*G18∧2+(G20−1)*G22∧2)/(G16+G20−2)
tcal = Cell G25 Formula: = (G17−G21)/SQRT(G24*(1/G16+1/G20))
df = Cell G28 Formula: = G16+G20−2
Two tail p-value = Cell G29 Formula: = T.DIST.2T(ABS(G25),G28)
Upper t critical = Cell G30 Formula: = T.INV.2T(G14,G28)
Lower t critical = Cell G31 Formula: = −T.INV.2T(G14,G28)

1
2

Figure 6.17

Excel solution using the p-value


To use the p-value statistic method to make a decision:

1 State hypothesis
Null hypothesis H0: μA = μB.
Alternative hypothesis H1: μA ≠ μB.
The ≠ sign implies a two tail test.
Introduction to parametric hypothesis testing 271

2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples;
• the statistic we are testing—testing that the amount of beans in a tin sold by both
shops is the same. Both population standard deviations are unknown;
• size of both samples—small (nA = 18 and nb = 25);
• nature of population from which sample drawn—population distribution is not
known, but we will assume that the population is normally distributed.

We will assume that the population variances are equal and conduct a Two Sample
t-test: Assuming Equal Variances (also called pooled-variance t-test).

3 Set level of significance (α) = 0.05 (see Cell G14)

4 Extract relevant statistic


When dealing with a normal sampling distribution we calculate the t-test statistic
using equation (6.5), population standard deviation estimate from equation (6.6),
and the number of degrees of freedom from equation (6.7).

(X A − X B ) − (µ A − µ B )
t cal =
⎡ 1 1 ⎤
(σ A + B ) × ⎢ + ⎥
⎣ n A nB ⎦ (6.5)

⎡⎣(n A − 1)s A 2 + (n B −)sB 2 ⎤⎦


σ A + B =
n A + nB − 2 (6.6)

df = n A + n B − 2 (6.7)

From Excel: nA = 18 (see Cell G16), X A = 527.055’ (see Cell G17), SA = 51.02 (see
Cell  G18), nB = 25 (see Cell G20), X B = 496.64 (see Cell G21), and SB = 41.38 (see
Cell G22):

⎡⎣(n A − 1) s A 2 + (n B − 1)sB 2 ⎤⎦
σˆ A + B = = 2082.0171816 (see Cell G24)
n A + nB − 2

If H0 is true (μA − μB = 0), then equation (6.5) simplifies to:

(X A − X B )
t cal = = 2.156 (see Cell G25)
⎡ 1 1 ⎤
(σ A + B ) × ⎢ + ⎥
⎣ n A nB ⎦

df = n A + n B − 2 = 41 (see cell G28)

Identify region of rejection using the p-value method. The p-value can be found from
Excel by using the T.DIST.2T() function. In this example H1: μA ≠ μB . From Excel, two
tail p-value = 0.036970 (see Cell G29).
272 Business statistics using Excel

5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated two tail p-value of
0.036970. We can observe that the p-value < α and we conclude that we reject H 0
and accept H1.

❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the quantity of beans sold by shops A and B is significantly different at the
5% level of significance. It should be noted that the decision will change if you choose a
1% level of significance.

Excel solution using the critical test statistic, tcri


The solution procedure is exactly the same as for the p-value except that we use the criti-
cal test statistic value to make a decision (see sections 6.1.10 and 6.1.11 for details). For
Example 6.7 we find t = 2.156 and tcri = ± 2.019541 (see Cells G31 and G32). Does the
test statistic lie within the region of rejection? The calculation of t yields a value of 2.156
and therefore lies in the region of rejection for H0. Given the tcal (2.156) > upper two tail
tcri (2.019541) we will reject H0 and accept H1.

❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the quantity of beans sold by shops A and B is significantly different at the
5% level of significance.

Figure 6.18 illustrates the relationship between the p-value and test statistic.

Two tail p =
T.DIST.2T(ABS(2.156),
41) = 0.036970 < 0.05

Reject Ho 2.5%
Reject Ho 2.5%

0 2.156 t Figure 6.18

Excel Data Analysis solution for two sample pooled t-test


As an alternative to either of the two previous methods, we can use a method embedded
in Excel Data Analysis.
Introduction to parametric hypothesis testing 273

Example 6.8
Reconsider Example 6.7, but use the Data Analysis tool to undertake the analysis.
Figure 6.19 illustrates the application of Data Analysis: t-test Two-Sample Assuming Equal
Variances (Select Data > Data Analysis > t-test: Two Sample Assuming Equal Variances).

Figure 6.19

We observe from Figure 6.20 that the relevant results agree with the previous results.

Figure 6.20

❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the quantity of beans sold by shops A and B is significantly different at
the 5% level of significance.
274 Business statistics using Excel

Student exercises
X6.15 During an examination board concerns were raised concerning the marks obtained by
students sitting the final year advanced economics (AE) and e-Marketing (EM) papers
(Table 6.6).

AE AE EM EM EM
51 63 71 68 61
66 35 69 53 59
50 9 63 65 55
48 39 66 48 66
54 35 43 63 61
83 44 34 48 58
68 68 57 47 77
48 36 58 53 73
45 68 64 54

Table 6.6

Historically, the sample data varies as a normal distribution and the population
standard deviations are approximately equal. Assess whether there is a significant
difference between the two sets of results (test at 5%).
X6.16 A university finance department would like to compare the travel expenses claimed by
staff attending conferences. After initial data analysis the finance director has identified
two departments who seem to have very different levels of claims. Based upon the
data provided (Table 6.7), undertake a suitable test to assess whether the level of
claims from department A is significantly greater than that from department B. You
can assume that the population expenses data are normally distributed and that the
population standard deviations are approximately equal.

Department A Department B
156.67 146.81 147.28 140.67 108.21 109.10 127.16
169.81 143.69 157.58 154.78 142.68 110.93 101.85
130.74 155.38 179.89 154.86 135.92 132.91 124.94
158.86 170.74

Table 6.7

6.7Two sample tests for population mean


(independent samples, unequal variances)
Two sample tests for independent samples
6.7.1
(unequal variances)
In Examples 6.7 and 6.8 we assumed that the variances were equal for the two samples
and conducted a two sample pooled t-test. In this test we assume that the population
Introduction to parametric hypothesis testing 275

variances are equal and the sample variances are combined to give a pooled estimate of
σ̂ A + B given by equation (6.6).
If we are concerned that the assumption of equal variances is unsound then we can
conduct a two sample t-test with equations (6.8) and (6.9).

Example 6.9
A certain product of organic beans is packed in tins and sold by two local shops. The local
authority have received complaints from customers that the amount of beans within the tins
sold by the shop is different. To test this statistically two small, random samples were collected
from both shops.

Figure 6.21 illustrates the Excel solution.

Figure 6.21

➜ Excel solution
A: Cells B4:B21 Values
B: Cells C4:C28 Values
Significance level = Cell G13 Value = 0.05
nA = Cell G15 Formula: = COUNT(B4:B21)
averageA = Cell G16 Formula: = AVERAGE(B4:B21)
sA = Cell G17 Formula: = STDEV.S(B4:B21)
nB = Cell G19 Formula: = COUNT(C4:C28)
276 Business statistics using Excel

averageB = Cell G20 Formula: = AVERAGE(C4:C28)


sB = Cell G21 Formula: = STDEV.S(C4:C28)
tcal = Cell G23 Formula: = (G16−G20)/(G17∧2/(G15)+G21∧2/(G19))∧0.5
df num = Cell G26 Formula: = (G17∧2/G15 + G21∧2/G19)^2
df denom = Cell G27 Formula: = ((G17∧2/G15)∧2/(G15−1)+(G21∧2/G19)∧2/(G19−1))
df = Cell G28 Formula: = IF((G26/G27−INT(G26/G27)< 0.5),INT(G26/G27),
INT(G26/G27)+1)
Two tail p-value = Cell G30 Formula: = T.DIST.2T(ABS(G23),G28)
Upper t critical = Cell G31 Formula: = T.INV.2T(G13,G28)
Lower t critical = Cell G32 Formula: = −T.INV.2T(G13,G28)

Excel solution using the p-value


To use the p-value statistic method to make a decision:

1 State hypothesis
Null hypothesis H0: μA = μB.
Alternative hypothesis H1: μA ≠ μB.
The ≠ sign implies a two tail test.

2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples;
• the statistic we are testing—testing that the amount of beans in a tin sold by both
shops is the same. Both population standard deviations are unknown;
• size of both samples—small (nA = 18 and nb = 25);
• nature of population from which sample drawn—population distribution is not
known, but we will assume that the population is approximately normal given
sample size is close to 30.

We will assume that the population variances are not equal and conduct a Two
Sample t-test: Assuming Unequal Variances (also called separate-variance t-test).

3 Set the level of significance (α) = 0.05 (see Cell G13)

4 Extract relevant statistic


When dealing with a normal sampling distribution we calculate the Satterthwaite’s
approximate t-test statistic (equation (6.8)) with the number of degrees of freedom
given by equation (6.9).

(X A − X B ) − (µ A − µ B )
t cal =
⎡ s A 2 sB 2 ⎤
⎢ + ⎥
⎣ n A nB ⎦ (6.8)
Introduction to parametric hypothesis testing 277
2
⎛ s A 2 sB 2 ⎞
⎜⎝ n + n ⎟⎠
A B
df =
⎛ ⎛ s 2 ⎞2 ⎛ s 2 ⎞2 ⎞
A B
⎜⎜ ⎟⎠ ⎜⎝ n ⎟⎠ ⎟
⎜ ⎝ n A B ⎟
⎜ n A − 1 + nB − 1 ⎟
⎜ ⎟
⎜⎝ ⎟⎠
(6.9)

From Excel: nA = 18 (see Cell G15), X A = 527.055’ (see Cell G16), SA = 51.02 (see Cell
G17), nB = 25 (see Cell G19), X B = 496.64 (see Cell G20), and SB = 41.38 (see Cell G21).
If H0 is true (μA – μB = 0) then equation (6.8) simplifies to:

X A − XB
t cal = = 2.083 (see Cell G23)
⎡ s A 2 sB 2 ⎤
⎢ + ⎥
⎣ n A nB ⎦

2
⎛ s A 2 sB 2 ⎞
⎜⎝ n + n ⎟⎠
A B
df = = 32 (see Cell G28)
⎛ ⎛ s 2 ⎞2 ⎛ s 2 ⎞2 ⎞
A B
⎜⎜ ⎟ ⎜ ⎟ ⎟
⎜ ⎝ n A ⎠ + ⎝ nB ⎠ ⎟
⎜ nA − 1 nB − 1 ⎟
⎜ ⎟
⎜⎝ ⎟⎠

Please note that the number of degrees of freedom (df ) is rounded to the nearest
whole number. Identify the region of rejection using the p-value method. The p-value
can be found from Excel by using the T.DIST.2T() function. In the example H1: μA ≠ μB.
From Excel, the two tail p-value = 0.045288 (see Cell G30).

5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated two tail p-value of 0.045288.
We can observe that the p-value < α, and we conclude that we reject H0 and accept H1.

❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the quantity of beans sold by shops A and B is significantly different at the
5% level of significance.

Excel solution using the critical test statistic, tcri


The solution procedure is exactly the same as for the p-value except that we use the
critical test statistic value to make a decision (see sections 6.1.10 and 6.1.11 for details).
For Example 6.9 we find tcal = 2.083 and tcri = ± 2.036933 (see Cells G31 and G32). Does
the test statistic lie within the region of rejection? The calculation of t yields a value of
2.083 and lies within the region of rejection for H0. Given that tcal (2.083) > upper two tail
tcri (2.036933) we will reject H0 and accept H1.
278 Business statistics using Excel

❉ Interpretation We conclude that, based upon the sample data collected, we have
evidence that the quantity of beans sold by shops A and B is significantly different at the
5% level of significance.

Figure 6.22 illustrates the relationship between the p-value and the test statistic.

Two tail p =
T.DIST.2T(ABS(2.083), 32)
= 0.045288 < 0.05

0 2.083 t Figure 6.22

Excel Data Analysis solution


As an alternative to either of the two previous methods, we can use a method embedded
in Excel Data Analysis.

Example 6.10
Reconsider Example 6.9, but use the Data Analysis tool to undertake the analysis

Figure 6.23 illustrates the application of Data Analysis: t-test: Two Sample Assuming
Unequal Variances (Select Data > Data Analysis > t-test: Two Sample Assuming Unequal
Variances).
We observe from Figure 6.24 that the relevant results agree with the previous results.

Figure 6.23

❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the quantity of beans sold by shops A and B is significantly different at the
5% level of significance.
Introduction to parametric hypothesis testing 279

Figure 6.24

6.7.2 Equivalent non-parametric test: Mann–Whitney U test


For two independent samples the non-parametric equivalent test is the Mann–Whitney
U test, which is described in Chapter 7. Please note that this is also called the Wilcoxon
rank sum test.

Student exercises
X6.17 Repeat Exercise X6.15, but do not assume equal variances. Are the two sets of results
significantly different (test at 5%)?
X6.18 Repeat Exercise X6.16, but do not assume equal variances. Are the expenses claimed
by department A significantly different to department B?

6.8Two sample tests for population mean


(dependent or paired samples) Two sample t-test
x

for population mean


(dependent or paired
6.8.1 Two sample tests for dependent samples samples) A two sample
t-test for population
mean (dependent
or paired samples) is
Example 6.11 used to compare two
dependent population
Suppose that Super Slim is advertising a weight reduction programme that they say will result means inferred from
two samples (dependent
in more than 10 lb weight loss in the first 30 days. Twenty-six subjects were independently and indicates that the values
randomly selected for a study, and their weights before and after the weight-loss programme from both samples are
numerically dependent
were recorded. Super Slim have stated that the historical data show that the populations are upon each other— there
normally distributed. Figure 6.25 shows the raw data and Excel solution. Conduct a two sample is a correlation between
corresponding values).
t-test for the population mean (dependent or paired samples) to test this hypothesis.
280 Business statistics using Excel

Figure 6.25 illustrates the Excel solution.

2
3

Figure 6.25

➜ Excel solution
Person Cells B5:B30 Values
Before weight, B Cells C5:C30 Values
After weight, A Cells D5:D30 Values
d = B – A Cell E5 Formula: = C5-D5 Copy formula down E5:E30
dn2 Cell G5 Formula: = E5∧2 Copy formula down G5:G30
Significance level Cell L13 Value = 0.05
n = Cell L15 Formula: = COUNT(B5:B30)
Σd = Cell L16 Formula: = SUM(E5:E30)
Σd2 = Cell L17 Formula: = SUM(G5:G30)
Mean d = Cell L18 Formula: = AVERAGE(E5:E30)
sd = Cell L19 Formula: = SQRT((L17−L16^2/L15)/(L15−1))
tcal = Cell L20 Formula: = (L18−10)/(L19/SQRT(L15))
df = Cell L23 Formula: = L15−1
Upper p-value = Cell L24 Formula: = T.DIST.RT(L20,L23)
Upper tcri = Cell L25 Formula: = T.INV(1−L13,L23)

Excel solution using the p-value

1 State hypothesis
The hypothesis statement implies that the population mean weight loss between
A and B should be at least 10 lb. If D = μA − μB = 10 lb, then the null and alternative
hypotheses would be stated as follows.

Null hypothesis H0: D ≤ 10.


Alternative hypothesis H1: D > 10 (or D – 10 > 0).
The > sign implies an upper one tail test.
Introduction to parametric hypothesis testing 281

2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples;
• the statistic we are testing—testing that the weight reduction programme results
in a weight loss. Both population standard deviations are unknown;
• size of both samples—small (nA and nB = 26);
• nature of population from which sample drawn—population distribution is
normally distributed.

In this case we have two variables that are related to each other (weight before vs
weight after treament) and we will conduct a two sample t-test: paired sample for
means.

3 Set level of significance (α) = 0.05 (see Cell L13).

4 Extract relevant statistic


The t-test statistic is given by equation (6.10) with the number of degrees of freedom
given be equation (6.12):

d−D
t cal =
sd
n (6.10)

∑ d 2 − (∑ d)2 /n
sd =
n −1 (6.11)

df = n − 1 (6.12)

From Excel: n = 26 (see Cell L15), Σd = 447 (see Cell L16), Σd2 = 12989 (see Cell L17),
d = 17.19231 (see Cell L18), and D = 10:

∑ d 2 − (∑ d)2 /n
sd = = 14.56577 (see Cell L19)
n −1

d−D
t cal = = 2.517802 (see Cell L20)
sd
n

df = n − 1 = 25 (see Cell L 23)

Identify the region of rejection using the p-value method. The p-value can be found
from Excel by using the T.DIST.RT() function. In the example H1: D > 10. From Excel,
the upper one tail p-value = 0.0093 (see Cell L24).

5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated upper one tail p-value
282 Business statistics using Excel

of 0.0093. We can observe that the p-value < α, and we conclude that we reject H0
and accept H1.

❉ Interpretation It can be concluded that the average weight loss is more than 10 lb at
a 5% level of significance.

Excel solution using the critical test statistic, tcri


The solution procedure is exactly the same as for the p-value except that we use the criti-
cal test statistic value to make a decision (see sections 6.1.10 and 6.1.11 for details). For
Example 6.11 we find tcal = 2.5178 and tcri = + 1.708141 (see Cell L25). Does the test statistic
lie within the region of rejection? The calculation of t yields a value of 2.5178 and therefore
lies in the region of rejection for H0. Given that tcal > upper one tail tcri (1.708141) we will
reject H0 and accept H1.

❉ Interpretation It can be concluded that the average weight loss is more than 10 lb at
a 5% level of significance.

Figure 6.26 illustrates the relationship between the p-value and test statistic.

Upper one tail p =


T.DIST.RT() =
0.0093 < 0.05

0 2.517 t
Figure 6.26

Excel Data Analysis solution

Example 6.12
Reconsider Example 6.11, but use the Data Analysis tool to undertake the analysis.

Figure 6.27 illustrates the application of Data Analysis: Two-Sample z Test for Paired
Means (Select Data > Data Analysis > t-test: Paired Two Sample for Means).
We observe from Figure 6.28 that the relevant results agree with the previous results.

❉ Interpretation It can be concluded that the average weight loss is more than 10 lb at
a 5% level of significance.
Introduction to parametric hypothesis testing 283

Figure 6.27

Figure 6.28

6.8.2Equivalent non-parametric test: Wilcoxon matched


pairs test
For two dependent (or paired, repeated measures) samples the non-parametric equiva-
lent test is the Wilcoxon matched pairs test, which will be described in Chapter 7.

Student exercises
X6.19 Choko Ltd provides training to its salespeople to aid the ability of each salesperson
to increase the value of their sales. During the last training session 15 salespeople
attended, and their weekly sales before and sales after are provided in Table 6.8.
284 Business statistics using Excel

Person Before After


1 2911.48 2287.22
2 1465.44 3430.54
3 2315.36 2439.93
4 1343.16 3071.55
5 2144.22 3002.40
6 2499.84 2271.37
7 2125.74 2964.65
8 2843.05 3510.43
9 2049.34 2727.41
10 2451.25 2969.99
11 2213.75 2597.71
12 2295.94 2890.20
13 2594.84 2194.37
14 2642.91 2800.56
15 3153.21 2365.75

Table 6.8

Assuming that the populations are normally distributed, assess whether there is any
evidence that the training improves sales (test at 5% and 1%).
X6.20 Concern has been raised at the standard achieved by students completing final
year project reports within a university department. One of the factors identified as
important is the research methods (RM) module mark achieved, which is studied
before the students start their project. The department has now collected data for 15
students, as given in Table 6.9.

Student RM Project
1 38 71
2 50 46
3 51 56
4 75 44
5 58 62
6 42 65
7 54 50
8 39 51
9 48 43
10 14 62
11 38 66
12 47 75
13 58 60
14 53 75
15 66 63

Table 6.9

Assuming that the populations are normally distributed, is there any evidence to
suggest that the marks are different (test at 5%).
Introduction to parametric hypothesis testing 285

6.9F test for two population variances


(variance ratio test)
In the previous sections we introduced the concept of hypothesis testing to test the differ-
ence between interval level variables using both z- and t-tests. In Example 6.7 we assumed
that the population variances were equal for both populations (shop A and B) and con-
ducted a pooled two sample t-test. In Example 6.9 we described the corresponding t-test
for population variances which are considered to be not equal.

Example 6.13
In this example we will use the F test to check if the two population variances in Example 6.7
can be considered equal with a 95% confidence.

Figure 6.29 illustrates the Excel solution.

2
3

Figure 6.29

➜ Excel solution
A: Cells B4:B21 Values
B: Cells C4:C28 Values
Significance level Cell G10 Value = 0.05
nA = Cell G12 Formula = COUNT(B4:B21)
x
nB = Cell G13 Formula = COUNT(C4:C28) F test Tests whether two
sA = Cell G14 Formula = STDEV.S(B4:B21) population variances are
the same based upon
sample values.
286 Business statistics using Excel

sB = Cell G15 Formula = STDEV.S(C4:C28)


Fcal = Cell G16 Formula = G14^2/G15^2
dfA = Cell G19 Formula = G12−1
dfB = Cell G20 Formula = G13−1
Two tail p-value = Cell G21 Formula = F.TEST(B4:B21,C4:C28)
FU = upper two tail Fcri = Cell G22 Formula = F.INV.RT(G10/2,G19,G20)
FL = lower two tail Fcri = Cell G23 Formula = F.INV(G10/2,G19,G20)
One tail test Upper p-value = Cell G27 Formula = F.DIST.RT(G16,G19,G20)
Upper Fcri = Cell G28 Formula = F.INV.RT(G10,G19,G20)
Lower p-value = Cell G29 Formula = F.DIST(G16,G19,G20,TRUE)
Lower Fcri = Cell G30 Formula = F.INV(G10,G19,G20)

Excel solution using the p-value

1 State hypothesis
The alternative hypothesis statement implies that the population variances are not
equal. The null and alternative hypotheses would be stated as follows:

Null hypothesis H0: σ A 2 = σ B 2


Alternative hypothesis H1: σ A 2 ≠ σ B 2
The ≠ sign implies a two tail test (or non-directional test).

2 State test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples;
• the statistic we are testing—testing that the variances are different from each other;
• size of both samples—small (nA = 18 and nb = 25).
• nature of population from which sample drawn—population distribution is not
known, but we will assume that the population is approximately normal given
sample sizes close to 30.

In this case conduct an F test for variance.

3 Set the level of significance at α = 0.05 (see Cell G10). For two-tail (or non-
directional) tests use α = significance level/2 = 0.025.

4 Extract relevant statistic


The difference between two variances can be studied using another sampling
x
distribution called the F distribution, defined by equation (6.13):
F distribution The
F distribution (also known
s2 largest
as the Fisher–Snedecor Fcal =
distribution) is a continuous s2 smallest (6.13)
probability distribution
that arises frequently as the
null distribution of a test Just as with the previous hypothesis tests we calculate the F statistic (Fcal) and
statistic, most notably in the compare this with the critical test value (Fcri), or calculate the corresponding p-value
analysis of variance.
for the F  distribution. If the two populations are independent and vary as normal
Introduction to parametric hypothesis testing 287
distributions then this ratio will vary as an F distribution with two sets of degrees of
freedom (dfn and dfd) given by equations (6.14) and (6.15):

dfnumerator = sample size of the largest sample variance − 1 (6.14)

dfdenominator = sample size of the smallest sample variance − 1 (6.15)

With the hypothesis tests considered so far we have been able to write the hypothesis
statement as either a one or two tail test. With the F test we have a similar situation but
we are dealing with variances rather than mean values.
From Excel: nA = 18 (see Cell G12), nB = 25 (see Cell G13), sA = 51.02 (see Cell G14),
sB = 41.38 (see Cell G15). Given that sA > sB then the numerator variance in equation
(6.13) will be the sample A variance and the denominator variance will be the sample
B variance.

S2 A
Fcal = = 1.5197059 (see Cell G16)
S2 B

Identify the region of rejection using the p-value method. The p-value can be found
from Excel by using the FTEST() function. In the example H1: σ A 2 ≠ σ B 2. From Excel,
the two tail p-value = 0.3393282 (see Cell G21).

5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated two tail p-value of 0.3393282.
We can observe that the p-value > α, and we conclude that we accept H0 and reject H1.
Given that the two tail p-value (0.3393282) > α (0.05) we will accept H0 and reject H1.

❉ Interpretation It can be concluded that the two population variances are not
significantly different at the 95% level of confidence.

Excel solution using the critical test statistic, Fcri


The solution procedure is exactly the same as for the p-value except that we use the critical
test statistic value to make a decision:

1 State hypothesis

2 State test

3 Set level of significance (α = 0.05)

4 Extract test statistic


The calculated test statistic Fcal = 1.5197. Calculate the critical test statistic, Fcri. In the
example H1: σ A 2 ≠ σ B 2 . The critical F values can be found from Excel by using the
F.INV.RT() function for the upper critical F statistic. Thus, direction is not implied and
therefore we have a two tail test and our region of rejection is as shown in Figure 6.30.
From Excel the critical F values are upper critical value FU = + 2.3864801 (see Cell
G22) and lower critical value FL = 0.3906518 (see Cell G23).
288 Business statistics using Excel

Note The upper (FU) and lower (FL) critical values for a two tail test can be calculated
using Excel as follows:

1. Upper critical F value = FU = F.INV.RT(significance level/2, df for largest variance, df for


smallest variance) or = F.INV(1-significance level/2, df for largest variance, df for smallest
variance)
2. Lower critical F value = FL = F.INV(significance level/2, df for largest variance, df for smallest
variance) or = 1/F.INV.RT(significance level/2, df for smallest variance, df for largest variance).

5 Make a decision
Does the test statistic lie within the region of rejection? Compare the calculated and
critical F values to determine which hypothesis statement (H0 or H1) to accept. The
calculation of Fcal yields a value of 1.5197 and therefore lies in the region of rejection
for H1. Given that the Fcal (1.5197) lies between the lower critical value FL (0.3906518)
and upper critical value FU (2.3864801) we will accept H0 and reject H1.

❉ Interpretation It can be concluded that, based upon the sample data collected,
we have evidence that the population variances are not significantly different at the 95%
level of confidence. In this case we would be reasonanbly happy to conduct the two sample
pooled t-test.

Figure 6.30 illustrates the relationship between the p-value, the F test statistic, and the
critical F statistic.

F distribution with dfA = 17, dfB = 24

0.7 Lower tail p-value


0.6
0.5 2 tail p-value = 0.34 > α (0.05)
0.4
0.3
Upper tail p-value
0.2
0.1
0
0 1 2 3 4 4.5
FL = 0.39 FU = 2.39 F
Fcal = 1.51
Figure 6.30

Table 6.10 illustrates the alternative one tail hypothesis tests.

Hypothesis—upper one tail test Hypothesis—lower one tail test

Null hypothesis H0: σ A ≤ B 2 2


Null hypothesis H0: σ A 2 ≥ σB2

Alternative hypothesis H1: σ A 2 > σB2 Alternative hypothesis H1: σ A 2 < σB2
With α = significance level With α = significance level

Table 6.10
Introduction to parametric hypothesis testing 289

Note The upper (FU) and lower (FL) critical values for a one tail test can be calculated
using Excel as follows:

1. Upper one tail F value = FU = F.INV.RT(significance level, df for largest variance, df for small-
est variance) or = F.INV(1-significance level, df for largest variance, df for smallest variance)
2. Lower one tail F value = FL = F.INV(significance level, df for largest variance, df for small-
est variance) or = 1/F.INV.RT(significance level, df for smallest variance, df for largest variance.

Excel Data Analysis solution

Example 6.14
Reconsider Example 6.13, but use the Data Analysis tool to undertake the analysis.

Figure 6.31 illustrates the application of the Excel Analysis ToolPak: F Test: Two Sample
for Variances (Select Data > Data Analysis> F Test: Two Sample for Variances).

Figure 6.31

We observe from Figure 6.32 that the relevant results agree with the previous results.

Figure 6.32
290 Business statistics using Excel

❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the population variances are not significantly different at the 5% level of
confidence. In this case we would be reasonanbly happy to conduct the two sample pooled t-test.

Note In the Data Analysis solution for the F test the Excel solution is for one tail only.
If you have a two tail hypothesis F test then compare the one tail p-value from Figure 6.32
(0.16966411) with the significance level divided by 2 (= α/2).

Student exercises
X6.21 In Exercise X6.15 we assumed that the two population variances are equal. Conduct an
appropriate test to check if the variances are equal (test at 5% and 1%).
X6.22 In Exercise X6.16 we assumed that the two population variances are equal. Conduct an
appropriate test to check if the variances are equal (test at 5%).

6.10Calculating the size of the type II error and


the statistical power
In section 6.1.9 we stated that the term alpha (α) is the probability that you will reject a
true null hypothesis and is called a type I error. The value is usually taken as either 5% or
1%. Beta (β) is also an error rate that represents the probability that you will reject a true
alternative hypothesis and is called a type II error. If the value of β was found to be 23%
then you would incorrectly reject a true alternative hypothesis 23% of the time. The term
statistical power represents the probability that you will reject a false null hypothesis and
therefore accept a true alternative hypothesis. Based on these definitions we can write the
following equation.

Statistical power = 1 − β (6.16)

If β was equal to 23% then the statistical power = 1 − 0.23 = 77% and we would con-
clude that we would accept a true alternative hypothesis 77% of the time.

Example 6.15
Let us illustrate the calculation of the type II error (β) and the statistical power via a simple
example. Consider the problem of estimating the spend on a particular type of sweet per day
where, historically, the average spend is £19.44 per day with a standard deviation of £6.23. The
shop would like to check whether or not the current spending per day is still £19.44 and they
have decided to collect a sample on a particular day of size 32, which results in average sample
spend of £23.40.
Introduction to parametric hypothesis testing 291

The shop, after consultation with an analyst, decides to conduct a one sample t-test, but they
would like to know how confident they can be in the outcome of applying this test to the data.
In other words, what is the probability of accepting a true alternative hypothesis or rejecting
a false null hypothesis?
Figure 6.33 illustrates a pictorial representation of the solution process.

Distribution A

α α = 0.025
2 2

t
0 tcri
X
19.44 Xcri

Distribution B
Statistical power
β

t
0

X
XB 23.40
Figure 6.33

Excel solution—Example 6.15


Figure 6.34 illustrates the three steps involved in calculating the type II error (β) and the
size of the test’s statistical power for the data given in Example 6.15.

Figure 6.34
292 Business statistics using Excel

➜ Excel solution
μ = Cell D6 Value
σ = Cell D7 Value
Significance = Cell D8 Value
Xbar = Cell D10 Value
n = Cell D11 Value
df = Cell D13 Formula: = D11−1
t = Cell D14 Formula: = (D10−D6)/(D7/SQRT(D11))
Step 1: calculate Xcri in distribution A with H0: μ = 20
tcri = Cell D18 Formula: = T.INV(1−D8/2,D13)
Xcri = Cell D19 Formula: = D6+D18*(D7/SQRT(D11))
Step 2: calculate the value of β where XB = Xcri for distribution A if H0: μ=22
tβ = Cell D23 Formula: = (D19−D10)/(D7/SQRT(D11))
β = Cell D24 Formula: = T.DIST(D23,D13,TRUE)
Step 3: calculate statistical power
Power = Cell D28 Formula: = 1−D24

❉ Interpretation From Excel, the value of the statistical power = 0.935091117 or 94%.
This high value of the statistical power (or just power) of 94% indicates that the one sample
t-test is highly likely to detect the effect or reject the null hypothesis that the population mean
is £19.44.

Note It is important to note that the value of β and the statistical power is not given by
an Excel function.

■ Techniques in practice
1. Concerned at the time taken to react to customer complaints, CoCo S. A. has imple-
mented a new set of procedures for its support centre staff. The customer service director has
directed that a suitable test is applied to a new sample to assess whether the new target mean
time for responding to customer complaints is 28 days. Table 6.11 illustrates the data collected
by the customer service director.

20 33 33 29 24 30
40 33 20 39 32 37
32 50 36 31 38 29
15 33 27 29 43 33
31 35 19 39 22 21
28 22 26 42 30 17
32 34 39 39 32 38

Table 6.11
Introduction to parametric hypothesis testing 293
(a) Describe the test to be applied with stated assumptions.
(b) Conduct the required test to assess whether evidence exists for the mean time to respond
to complaints to be greater than 28 days.
(c) What would happen to your results if the population mean time to react to customer
complaints changes to 30 days?

2. Bakers Ltd are currently undertaking a review of the delivery vans used to deliver products
to customers. The company runs two types of delivery van (type A, recently purchased, and
type B, at least three years old) which are supposed to be capable of achieving 20 km per litre
of petrol. A new sample has now been collected, as given in Table 6.12.

A B A B
17.68 15.8 26.42 34.8
18.72 36.1 25.22 16.8
26.49 6.3 13.52 15.0
26.64 12.3 14.01 28.9
9.31 15.5 33.9
22.38 40.1 27.1
20.23 20.4 16.8
28.80 3.7 23.6
17.57 13.6 29.7
9.13 35.1 28.2
20.98 33.3

Table 6.12

(a) Assuming that the population distance travelled varies as a normal distribution, is there any
evidence to suggest that the two types of delivery vans differ in the mean distances travelled?
(b) Based upon your analysis, is there any evidence that the new delivery vans meet the
mean average of 20 km per litre?

3. Skodel Ltd is developing a low calorie lager for the European market with a mean
designed calorie count of 43 calories per 100 ml. The new product development team are
having problems with the production process and have collected two independent random
samples to assess whether the target calorie count is being met (assume the population vari-
ables are normally distributed), as presented in Table 6.13.

A B A B
49.7 39.4 45.2 34.5
45.9 46.5 40.5 43.5
37.7 36.2 31.9 37.8
40.6 46.7 41.9 39.7
34.8 36.5 39.8 41.1
51.4 45.4 54.0 33.6
34.3 38.2 47.8 35.8
63.1 44.1 26.3 44.6
41.2 58.7 31.7 38.4
41.4 47.1 45.1 26.1
41.1 59.7 47.9 30.7

Table 6.13
294 Business statistics using Excel

(a) Describe the test to be applied with stated assumptions.


(b) Is the production process achieving a similar mean number of calories?
(c) Is it likely that the target average number of calories is being achieved?

■ Summary
In this chapter we have provided an introduction to the important statistical concept of para-
metric hypothesis testing for one and two samples. What is important in hypothesis testing is
that you are able to recognize the nature of the problem and should be able to convert this into
two appropriate hypothesis statements (H0 and H1) that can be measured.
If you are comparing more than two samples then you would need to employ advanced
statistical parametric hypothesis tests. These tests are called analysis of variance (ANOVA),
which are described in the online workbook ‘Factorial experiments’.
In this chapter we have described a simple five-step procedure to aid the solution process
and have focused on the application of Excel to solve the data problems. The main empha-
sis is placed on the use of the p-value, which provides a number to the probability of the
null hypothesis (H0) being rejected. Thus, if the measured p-value > α (Alpha) then we would
accept H0 to be statistically significant. Remember the value of the p-value will depend on
whether we are dealing with a two or one tail test. So take extra care with this concept as this
is where most students slip up.
The second part of the decision-making described the use of the critical test statistic in mak-
ing decisions. This is the traditional textbook method which uses published tables to provide
estimates of critical values for various test parameter values. The final method, and perhaps the
main Excel method you will use, is to employ the Data Analysis method available within Excel.
The focus of parametric tests is that the underlying variables are at the interval/ratio level of
measurement and the population being measured is distributed as a normal or approximately
normal distribution. In the next chapter we shall explore how we undertake hypothesis testing
for variables that are at the nominal or ordinal level of measurement by exploring the concept
of the chi-square and non-parametric tests.

■ Key Terms
Alpha Level of significance One tail tests
Alternative hypothesis Lower one tail test Parametric
Beta, α Mann–Whitney U test P-value
Central Limit Theorem Non-parametric Region of rejection
Critical test statistic Null hypothesis Robust test
F distribution One sample t-test Significance level, α
F test for the population Statistical power
F test for two population mean Two sample t-test for
variances (variance One sample test population mean
ratio test) One sample z-test for the (dependent or paired
Hypothesis test procedure population mean samples)
Introduction to parametric hypothesis testing 295
Two sample t-test for (independent samples, Two sample z-test for the
population mean equal variance) population proportion
(independent samples, Two sample tests Two tail test
unequal variances) Two sample z-test Type I error
Two sample t-test for for the population Type II error
the population mean mean Upper one tail test

■ Further Reading
Textbook Resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.

Web Resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed 25
May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html (accessed
25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States,
the Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May
2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May
2012).
Chi-square and non-
7 parametric hypothesis testing

» Overview «
In Chapter 6 we explored a series of parametric tests to assess whether the differences between
means (or variances) are statistically significant. Within parametric tests we sample from a
distribution with a known parameter value, for example population mean (μ), variance (ρ2), or
proportion (ρ). The techniques described were defined by three assumptions: (i) the underlying
population being measured varies as a normal distribution; (ii) the level of measurement is of
equal interval or ratio scaling; and (iii) the population variances are equal. Unfortunately, we will
come across data that does not fit these assumptions.
How do we measure the difference between the attitudes of people surveyed in assessing
their favourite car, when the response by each person is of the form: 1, 2, 3 . . . n? In this
situation we have ordinal data in which taking differences between the numbers (or ranks) is
meaningless. Furthermore, if we are asking for opinions where the opinion is of a categorical
form (e.g. strongly agree, agree, do not agree) then the concept of difference is, again,
meaningless. The responses are words not numbers, but you can, if you so wish, solve this
problem by allocating a number to each response, with 1 for strongly agree, 2 for agree, and
so on. This gives you a rating scale of responses, but remember that the opinions of people are
not quite the same as measuring time or measuring the difference between two times. Can we
say that the difference between strongly agree and agree is the same as the difference between
agree and disagree? Another way of looking at this problem is to ask the question: can we say
that a rank of 5 is five times stronger than a rank of 1?
This chapter will provide an overview to the chi-square (Χ2, where Χ is the Greek letter chi)
and non-parametric tests that can be used when parametric methods are not appropriate.

» Learning objectives «
On completing this unit you will be able to:

» apply the chi-square test to test for association between categorical variables;
x
Rank List data in order » apply the chi-square test to measure the difference between two proportions from two
of size. independent samples;
Chi-square and non-parametric hypothesis testing 297

» apply the chi-square test to measure the difference between two proportions from two
dependent samples;

» apply the chi-square goodness-of-fit test;

» apply the sign test to one sample;

» apply the Wilcoxon signed rank t test to two paired samples;

» apply the Mann–Whitney U test to two independent samples;

» solve problems using the Microsoft Excel.

7.1 Chi-square tests


This test is a versatile and widely used test that is applicable to dealing with data that is
categorical (or nominal or qualitative) in nature and cannot be stated as a number, for
example responses such as ‘Yes’, ‘No’, ‘Red’, and ‘No agreement’. In Chapter 1 we explored
tabulating such data and charting via bar and pie charts, where we were dealing with pro-
portions that fall into each of the categories being measured which form a sample from a
proportion of all possible responses. In Chapter 6 we explored the situation of comparing
two proportions where we assumed that the underlying population distribution is normal
or approximately normal. In many cases the variable is descriptive and in these cases the
chi-square test can be used.
To illustrate how this test can be used imagine you are undertaking a small survey for
your dissertation. Part of your overall research project is to establish if students’ attitudes
towards work ethics change as they progress through their studies. To establish this, you x
Chi-square test of
interview a group of first year students and a group of final year students and ask them
association The chi-square
certain questions to illuminate this problem. The results are tabulated in a simple table, test of association provides
where the rows represent first year and last year students, and the columns represent their a method for testing the
association between the
attitudes (a scale such as: strongly agree, agree, disagree, etc.). Once you have such a table row and column variables
constructed, you can establish if the maturity of students is in some way linked with their in a two-way table where
the null hypothesis H0
views on work ethics. The chi-square test can be used to test this claim of an association assumes that there is no
between their views on work ethics. association between the
variables.
Within this chapter we will explore a series of methods that make use of the chi-square
Chi-square test
distribution to make inferences about two and more proportions. For the chi-square test for independent
we require that the independent observations are capable of being classified into a num- samples Pearson chi-
square test is a non-
ber of separate categories and that the probability of lying in this category can be calcu- parametric test for a
lated when the appropriate null hypothesis is assumed to be true. This section will explore difference in proportions
between two or more
four applications of the chi-square test:
independent samples.

1. Undertake a chi-square test of association that is popular with students in analysing Chi square test for
goodness-of-fit The chi-
survey data square goodness-of-fit
2. Undertake a chi-square test for independent samples test of a statistical model
describes how well the
3. Undertake a chi-square test for dependent samples statistical model fits a set of
4. Undertake a chi-square test for goodness-of-fit. observations.
298 Business statistics using Excel

7.1.1 Chi-square test of association


The chi-square test of association is a hypothesis test used to determine whether the fre-
quency occurrence for two category variables (or more) is significantly related (or asso-
ciated) to each other. The null hypothesis states that the row and column variables (the
student maturity and their attitudes, as described in the previous section, for example)
are not associated. It can be shown that if the null hypothesis is true then the expected
frequencies (E) can be calculated using equation (7.1).

Row Total*Column Total


E=
Total Sample Size (7.1)

To test the null hypothesis we would compare the expected cell frequencies with the
observed cell frequencies and calculate the chi-square test statistic given by equation
(7.2):

(O − E)2
χ2 cal = ∑
E (7.2)

The chi-square test statistic enables a comparison to be made between the observed
frequencies (O) and expected frequencies (E) calculated via equation (7.2). Equation (7.1)
tells us what the expected frequencies would be if there was no association between the
two categorical variables, for example, gender and course. If the values are close to one
another then this provides evidence that there is no association, and, conversely, if we find
large differences between the observed and expected frequencies then we have evidence
to suggest an association does exist between the two categorical variables: gender and
course. Statistical hypothesis testing allows us to confirm whether the differences are likely
to be statistically significant. The chi-square distribution varies in shape with the number
of degrees of freedom and thus we need to find this value before we can look up the appro-
priate critical values. The number of degrees of freedom (df) is given by equation (7.3).

df = (r − 1) * (c − 1) (7.3)

Where r = number of rows and c = number of columns. Identify a region of rejection using
either the p-value method or calculate the critical test statistic.

Example 7.1
Suppose a university sampled 485 of its students to determine whether males and females dif-
x fered in preference for five courses offered. The question we would like to answer is to confirm
Test statistic A test statistic whether or not we have an association between the courses chosen and the person’s gender.
is a quantity calculated
In this case we have two attributes, gender and course, both of which have been divided into
from our sample of data.
Contingency table A
categories: two for gender and five for course. The resulting table is called a 5*2 contingency
contingency table is a table table because it consists of 5 rows and 2 columns. To determine whether gender and course
of frequencies classified
preference are associated (or independent) we conduct a chi-square test of association on the
according to the values of
the variables in question. contingency table.
Chi-square and non-parametric hypothesis testing 299

Figures 7.1 and 7.2 illustrate the Excel solution.

1 Calculate the expected frequencies and chi-square ratio (O−E)2/E

Figure 7.1

➜ Excel solution
Data series Cells D5:E9 Values
Number of males Cell D10 Formula: =SUM(D5:D9)
Number of Females Cell E10 Formula: =SUM(E5:E9)
Number of A101 Cell F5 Formula: =SUM(D5:E5)
Number of D102 Cell F6 Formula: =SUM(D6:E6)
Number of M101 Cell F7 Formula: =SUM(D7:E7)
Number of S101 Cell F8 Formula: =SUM(D8:E8)
Number of T101 Cell F9 Formula: =SUM(D9:E9)
Grand Total Cell F10 Formula: =SUM(D10:E10)
Expected frequencies Cell D15 Formula: =$F5*D$10/$F$10
Copy formula down and across D15:E19
(O−E)2/E Cell F15 Formula: =(D5−D15)^2/D15
Copy formula down and across F15:G19

2 Step 2 Calculate the chi-square statistics

5
Figure 7.2
300 Business statistics using Excel

➜ Excel solution
Significance level = Cell K12 Value =0.05
χ 2cal = Cell K15 Formula: =SUM(F15:G19)
r = Cell K18 Formula: =COUNT(D5:D9)
c = Cell K19 Formula: =COUNT(D5:E5)
df = Cell K20 Formula: =(K18−1)*(K19−1)
Critical χ 2 = Cell K21 Formula: =CHISQ.INV.RT(K12,K20)
P-value = Cell K22 Formula: =CHISQ.DIST.RT(K15,K20)
or p-value using CHISQ.DIST = Cell K24 Formula: =1−CHISQ.DIST(K15,K20,TRUE)
or p-value using CHISQ.TEST = Cell K26 Formula: =CHISQ.TEST(D5:E9,D15:E19)

Excel solution using the p-value


1 State hypothesis
H0: gender and course preference are not associated (or independent).
H1: there is an association between sex and course preference (or dependent).

2 Select test
• Number of samples—two category data variables (gender and course). The
sample data is randomly selected and are represented as frequency counts
within the contingency table.
• The statistic we are testing—testing for an association between the two category
data variables.

Apply a chi-square test of association to the sample data.

3 Set the level of significance (α) = 0.05 (Cell K12)

4 Extract relevant statistic


Expected frequencies
From equation (7.1) we can calculate the expected frequencies (see Cells D15:E19).
Calculate test statistic
From Excel, the ratio in equation (7.2) is calculated and displayed in Cells F15:G19,
with the sum χcal 2 = 63.2 (see Cell K15).
Calculate the number of degrees of freedom (df )
The number of df is given by equation (7.3): df = (r − 1)*(c − 1) = 4 * 1 = 4 (see Cell
K20).
Identify rejection region
Identify a region of rejection using the p-value method. The p-value can be found
from Excel by using the CHISQ.DIST.RT(), or CHISQ.DIST(), or CHISQ.TEST()
functions. From Excel, the p-value = 5.7E-13 (see Cell K22). Does the test statistic
lie within the region of rejection? Compare the chosen significance level (α) of 5%
(or 0.05) with the calculated p-value of 5.7E-13.

5 Make a decision
In Figure 7.2 we observe that the p-value <α (5.7E-13 <0.05) and conclude that we
reject H0 and accept H1.
Chi-square and non-parametric hypothesis testing 301

❉ Interpretation It can be concluded that there is a significant relationship or


association between the category variables gender and course preference. The chi-square
table indicates that the main contributors to the chi-square values are A101 and T101,
whereas S101 has very little contribution. This table would indicate that (i) fewer men opt for
A101, whereas fewer women opt for T101, and that (ii) men tend towards M101 and T101,
whereas women prefer A101 and D102.

Figure 7.3 illustrates the relationship between the p-value and test statistic.

Chi-square test of association with df = 4


f(x)

PDF given X2 = 63.36


CHISQ.DIST ()

P-value = CHI.DIST.RT(63.36, 4)

63.36 χ24 Figure 7.3

Note
For the chi-square test to give meaningful results:

(i) Cochran suggests that at least 80% of expected frequencies >5


(ii) if we have one degree of freedom then the expected frequencies >10
(iii) the test requires the assumption that the discrete probability of observed frequencies can
be approximated by the continuous chi-square distribution.

To reduce the associated errors, we can apply the Yate's correction for continuity given by
equation (7.4).

( O — E — 0.5)2
χcal2 = ∑
E (7.4)
x
Expected frequency In
a contingency table the
Excel solution using the critical test statistic expected frequencies are
the frequencies that you
The solution procedure is exactly the same as for the p-value except that we use the critical would predict in each
cell of the table, if you
test statistic value to make a decision. The calculated test statistic χ2cal = 63.3562 (see Cell knew only the row and
K15). Calculate the critical test statistic, χcri
2
. The critical value can be found from Excel by column totals, and if you
assumed that the variables
using the CHISQ.INV.RT() function, χcri = 9.4877 (see Cell K21). Does the test statistic lie
2
under comparison were
within the region of rejection? Compare the calculated and critical χ2 values to determine independent.
302 Business statistics using Excel

which hypothesis statement (H0 or H1) to accept. In Figure 7.3 we observe that χcal
2
lies in
the upper rejection zone (63.3562 > 9.49).

❉ Interpretation It can be concluded that there is a significant relationship (or


association) between the category variables gender and course preference.

Student exercises
X7.1 A business consultant requests that you perform some preliminary calculations before
analysing a data set using Excel.
(a) Calculate the number of degrees of freedom for a contingency table with three
rows and four columns.
(b) Find the upper tail critical χ2 value with a significance level of 5% and 1%. What
Excel function would you use to find this value?
(c) Describe how you would use Excel to calculate the test p-value. What does the
p-value represent if the calculated chi-square test statistic equals 8.92?
X7.2 A trainee risk manager for an investment bank has been told that the level of risk is
related directly to the industry type (manufacturing, retail, and financial). For the data
presented in the contingency table (Table 7.1) analyse whether or not perceived risk is
dependent upon the type of industry identified (assess at 5%)? If the two variables are
associated then what is the form of the association?

Level of risk Industrial class


Manufacturing Retail Financial
Low 81 38 16
Moderate 46 42 33
High 22 26 29

Table 7.1

X7.3 A manufacturing company is concerned at the number of defects produced by


the manufacture of office furniture. The firm operates three shifts and has classified
the number of defects as low, moderate, high, or very high (Table 7.2). Is there any
evidence to suggest a relationship between types of defect and shifts (assess at 5%)? If
the two variables are associated then what is the form of the association?

Shift Defect type


Low Moderate High Very high
1 29 40 91 25
2 54 65 63 8
3 70 33 96 38

Table 7.2
Chi-square and non-parametric hypothesis testing 303

X7.4 A local trade association is concerned at the level of business activity within the
local region. As part of a research project a random sample of business owners
were surveyed on how optimistic they were for the coming year. Based upon the
contingency Table 7.3 do we have any evidence to suggest different levels of optimism
for business activity (assess at 5%)? If the two variables are associated then what is the
form of the association?

Optimism level Type of business


Bankers Manufacturers Retailers Farmers
High 38 61 59 96
No change 16 32 27 29
Low 11 26 35 41

Table 7.3

X7.5 A group of students at a language school volunteered to sit a test that is to be undertaken
to assess the effectiveness of a new method to teach German to English-speaking
students. To assess effectiveness students sit two different tests with one test in English
and the other test in German. Is there any evidence to suggest that the student test
performances in English are replicated by their test performances in German (Table 7.4;
assess at 5%)? If the two variables are associated then what is the form of the association?

German English
≥60% 40–59% < 40%
≥60% 90 81 8
40–59% 61 90 8
<40% 29 39 6

Table 7.4

7.1.2 Chi-square test for independent samples


In Chapter 6 we explored the application of the z test to solve problems involving two pro-
portions. If we are concerned that the parametric assumptions are not valid then we can
use the chi-square test to test two independent proportions or apply the McNemar chi-
square test (or z test), which uses a normal approximation for two dependent samples.
With the chi-square test for independent samples we have two samples that involve
counting the number of times a categorical choice is chosen. In this situation we can
develop a cross tab (or contingency) table to display the frequency that each possible
value was chosen.

Example 7.2
To illustrate the concept consider the example of a firm who surveys whether or not employees
use the train to travel to work. The firm collects the data and has created a 2*2 contingency
table (see Table 7.5) to summarize the responses for only the people who work on two days.
304 Business statistics using Excel

Monday Wednesday
Take train to work 89 76
Do not take train to work 64 88

Table 7.5 Employee travel intentions

The question is now whether or not we have a significant difference between the
Monday and Wednesday employees who travel to work by train.
Figures 7.4 and 7.5 illustrate the Excel solution.

Figure 7.4

➜ Excel solution
Data series: Cells C6:D7
Sum row 1 Cell E6 Formula: =SUM(C6:D6)
Sum row 2 Cell E7 Formula: =SUM(C7:D7)
Sum column 1 Cell C8 Formula: =SUM(C6:C7)
Sum column 2 Cell D8 Formula: =SUM(D6:D7)
Grand total = Cell E8 Formula: =SUM(E6:E7)
Expected frequencies Cell C13 Formula: =$E6*C$8/$E$8
Copy formula down and across C13:D14
(O−E)^2/E Cell E13 Formula: =(C6−C13)^2/C13
Copy formula down and across E13:F14

5
Figure 7.5
Chi-square and non-parametric hypothesis testing 305

➜ Excel solution
Significance level Cell K13 Value
p Cell K16 Formula: =E6/E8
χ2cal = Cell K17 Formula: =SUM(E13:F14)
r = Cell K18 Formula: =COUNTA(B6:B7)
c = Cell K19 Formula: =COUNTA(C5:D5)
df = Cell K20 Formula: =(K18−1)*(K19−1)
χ2cri = Cell K21 Formula: =CHISQ.INV.RT(K13,K20)
P-value = Cell K22 Formula: =CHISQ.DIST.RT(K17,K20)
or p-value using CHISQ.DIST = Cell K24 Formula: =1−CHISQ.DIST(K17,K20,TRUE)
or p-value using CHISQ.TEST = Cell K25 Formula: =CHISQ.TEST(C6:D7,C13:D14)

In general, the 2*2 contingency table can be structured as illustrated in Table 7.6.

Column variable
1 2 Totals
Row variable 1 n1 n2 N
2 t1 –n1 t2 – n2 T–N
Totals t1 t2 T

Table 7.6

From this table we can estimate the proportion (or probability) that employees will use
the train by calculating the overall proportion (ρ) using equation (7.5).

n1 + n 2 N
ρ= = (7.5)
t1 + t 2 T

We can now use this estimate to calculate the expected frequency (E) for each cell
within the contingency table by multiplying the column total by ρ for the cells linked to
travelled by train and (1−ρ) for those cells who did not travel by train using equation (7.6).

E = ρ * column total (7.6)

Calculate the chi-square test statistic to compare the observed and expected frequen-
cies using equation (7.2).

(O − E)2
χ2 cal = ∑
E

Where O = observed frequency, E is the expected frequency calculated if the null x


hypothesis is true, and the number of df = (r − 1)(c − 1) = 1. In this case we would expect Observed frequency In
a contingency table the
the proportion of employees taking the train to be exactly the same; this fact can then observed frequencies are
be used to calculate the expected frequency. Given that the expected frequencies can be the frequencies actually
obtained in each cell of
calculated, we can calculate the chi-square test statistic and compare this value with a the table, from our random
critical value to make a decision. sample.
306 Business statistics using Excel

Excel solution using the p-value


1 State hypothesis
Given that the population proportions are π1 and π2 then the null and alternative
hypothesis are as follows:
H0: π1 = π2 (proportions travelling by train on the two days is the same).
H1: π1 ≠ π2 (proportions different).
Two tail test.

2 Select the test


Two independent samples.
Categorical data.
Chi-square test for the difference between two proportions.

3 Set the level of significance (α = 0.05) (see Cell K13)

4 Extract relevant statistic


Proportion travelling by train (ρ)
From Excel, ρ = 0.52 (see Cell K16).
Calculate expected frequencies
Calculate the expected frequencies for each cell using equation (7.6), for example,
for train on Monday the expected frequency would be 153*0.520505 = 79.6372 (see
Cell C13). Repeat this calculation for the other cells within the contingency table (see
Cells C13:D14).
Calculate the chi-square test statistic
For each cell we now need to calculate the ratio (O−E)^2/E located in Cells E13:F14
given in equation (7.2) and sum to give the χ2cal test statistic:

(O − E)2
χ2 cal = ∑ = 4.437356 (Cell K17)
E

Calculate the p-value


Identify region of rejection using the p-value method. The p-value can be found
from Excel by using the CHISQ.DIST.RT() function. From Excel, the two tail
p-value = 0.035161 (see Cell K22). Does the test statistic lie within the region of
rejection? Compare the chosen significance level (α) of 5% (or 0.05) with the
calculated p-value of 0.035161.

5 Make a decision
We observe that the p-value < α (0.035161 < 0.05), and we conclude that we reject H0
and accept H1.

❉ Interpretation It can be concluded that there is a significant difference in the


proportions travelling by train on Monday and Wednesday.

Note If you decided that the significance level is 1% (0.01), then we would have a
reverse decision given that the two tail p-value > α (0.035161 > 0.01). In this case we would
accept H0 and reject H1. This is an example of modifying your decision based upon how
confident you would like to be with your overall decision.
Chi-square and non-parametric hypothesis testing 307

Excel solution using the critical test statistic


The solution procedure is exactly the same as for the p-value except that we use the criti-
cal test statistic value to make a decision. The calculated test statistic χ2 cal = 4.437356 (see
Cell K17). Calculate the critical test statistic, χcri
2
. The critical value can be found from Excel
by using the CHISQ.INV.RT() function, χcri = 3.841459 (see Cell K21). Does the test sta-
2

tistic lie within the region of rejection? Compare the calculated and critical χ2 values to
determine which hypothesis statement (H0 or H1) to accept. We observe that χcal 2
lies in the
region of rejection (4.437356 > 3.841459), and we reject H0 and accept H1.

❉ Interpretation It can be concluded that there is a significant difference in the


proportions travelling by train on Monday and Wednesday.

Note
1. For the chi-square test to give meaningful results the expected frequency for each cell in the
2*2 contingency table is required to be at least 5. If this is not the case then the chi-square
distribution is not a good approximation to the ratio (O−E)2/E. In this situation, we can use
Fisher’s test, which provides an exact p-value.
2. In the example you may have noticed that the frequency counts are discrete variables which
are mapped onto the continuous chi-square distribution. In this case we need to apply the
Yates’ correction for continuity given by equation (7.4).
3. In Section 6.5 we compared two sample proportions using a normal approximation. When
we have one degree of freedom we can show that there is a simple relationship between the
value of χcal2 and the corresponding value of Z is given by the relationship χ2cal = ( Z cal )2 .
4. If we are interested in testing for direction in the alternative hypothesis (e.g. H1: π1 > π2) then
you cannot use a chi-square test but will have to undertake a normal distribution Z test to
test for direction.

The two proportion solution can be extended to more than two proportions, but this is
beyond the scope of this text.

7.1.3 McNemar’s test for matched (or dependent) pairs


The previous test explored the application of the chi-square test to compare two propor-
tions taken from random independent samples. If you have paired samples (or dependent
samples) then we can use McNemar’s test to compare two proportions.

x
McNemar
Example 7.3 test McNemar’s test is a
non-parametric method
Consider the problem of estimating the effectiveness of a political campaign on the voting used on nominal data to
determine whether the
patterns of a group of voters. Two groups of voters are selected at random and their voting row and column marginal
intentions (drop carbon dioxide (CO2), tax) for a local election recorded. Both groups are then frequencies are equal.
308 Business statistics using Excel

subjected to the same campaign and their voting intentions recorded. The question that arises
is whether or not the campaign was effective on the voting intentions of the voters. In this case
we have two groups who are recorded before and after, and we recognize that we are dealing
with paired samples. To solve this problem we can use McNemar’s test for two sets of nominal
data that are randomly selected. Table 7.7 contains the outcome of the voting intentions before
and after the campaign.

Before After
Drop CO2 Tax
Drop CO2 287 89
Tax 45 200

Table 7.7

The question is whether the political campaign has been successful on ‘drop CO2’ vot-
ers and ‘tax’ voters, who both received the same marketing campaign. To simplify the
problem we shall look at whether or not the proportion voting ‘drop CO2’ has changed
significantly.
H0: proportion voting for ‘drop CO2’ not changed.
H1: proportion voting for ‘drop CO2’ changed.
In terms of notation this can be written as: H0: π1 = π2, H1: π1 ≠ π2, where π1 = popula-
tion proportion voting ‘drop CO2’ before campaign and π2 = population proportion voting
‘drop CO2’ after campaign.

Note Remember that the other hypothesis is whether or not the proportions voting
for the ‘tax’ party are the same before and after the campaign.

In general, the 2*2 contingency table can be structured as illustrated in Table 7.8.

Column variable
Drop CO2 Tax Totals
Row variable Drop CO2 a b a+b
Tax c d c+d
Totals a+c b+d N

Table 7.8

From this table we observe that the sample proportions are given by equations (7.7)
and (7.8):

a+b
ρ1 =
N (7.7)

a+c
ρ2 =
N (7.8)
Chi-square and non-parametric hypothesis testing 309

This problem can be solved using either a z or chi-square test to test the difference
between the two proportions. It is important to note that the z test and chi-square test are
both applicable when dealing with two tail tests, but if your problem is directional (lower
or upper) then you can only use the z test.

1. McNemar Z test
To test the null hypothesis we can use the McNemar z test statistic defined by
equation (7.9), which is normally approximated:

b−c
Zcal =
b+c (7.9)

2. McNemar chi-square test


To test the null hypothesis we can use the McNemar chi-square test statistic defined
by equation (7.10):

χ2 cal =
(b − c)2
b+c (7.10)

For one degree of freedom the relationship between chi-square and Z is given by the
= (Zcal ) . Figure 7.6 illustrates the Excel solution for the McNemar Z and
2
relationship χcal
2

chi-square tests.

1
2

Figure 7.6

➜ Excel solution
Data series: Cells D6:E7
Sum row 1 Cell F6 Formula: =SUM(D6:E6)
Sum row 2 Cell F7 Formula: =SUM(D7:E7)
Sum column 1 Cell D8 Formula: =SUM(D6:D7)
310 Business statistics using Excel

Sum column 2 Cell E8 Formula: =SUM(E6:E7)


Grand total = Cell F8 Formula: =SUM(F6:F7)
level Cell J13 Value =0.05
ρ1 = Cell J16 Formula: =(D6+E6)/F8
ρ2 = Cell J17 Formula: =(D6+D7)/F8
Zcal = Cell J18 Formula: =(E6−D7)/SQRT(E6+D7)
Lower Zcri = Cell J20 Formula: =NORM.S.INV( J13/2)
Upper Zcri = Cell J21 Formula: =NORM.S.INV(1−J13/2)
Two tail p-value = Cell J22 Formula: =2*(1−NORM.S.DIST(ABS( J18),TRUE))
χ2cal = Cell J24 Formula: =(E6−D7)^2/(E6+D7)
(Zcal)2 = Cell J25 Formula: =J18^2
χ2 p-value = Cell J26 Formula: =CHISQ.DIST.RT( J24,1)

Excel solution using the p-value

1 State hypothesis
Given that the population proportions are π1 and π2 then the null and alternative
hypothesis are as follows:
H0: π1 = π2.
H1: π1 ≠ π2.
Two tail test.
Where π1 represents the proportion voting ‘drop CO2’ before the campaign and π2
represents the proportion voting ‘drop CO2’ after the campaign.

2 Select the test


Two dependent samples.
Categorical data.
McNemar test.

3 Set the level of significance (α = 0.05) (see Cell J13)

4 Extract relevant statistic


Calculate proportions
From Excel, the proportion voting yes for ‘drop CO2’ before and after the campaign:
ρ1 = 0.605475 and ρ2 = 0.534622 (Cells J16 and J17).
Calculate zcal from equation (7.9).

b−c 89 − 45
z cal = = = 3.801021 (Cell J18)
b+c 89 + 45

Calculate the p-value.


Identify region of rejection using the p-value method. The p-value can be found from
Excel by using the NORM.S.DIST() function. From Excel, the two tail p-value = 0.00014
(see Cell J22). Does the test statistic lie within the region of rejection? Compare the
Chi-square and non-parametric hypothesis testing 311

chosen significance level (α) of 5% (or 0.05) with the calculated two tail p-value of
0.00014.

5 Make a decision
We observe that the p-value < α (0.00014 < 0.05), and we conclude that we reject H0
and accept H1.

❉ Interpretation There is a significant difference in the voting intentions for ‘drop CO2’
after the campaign compared with before the campaign.

Excel solution using the critical test statistic


The solution procedure is exactly the same as for the p-value except that we use the critical
test statistic value to make a decision. From Excel, the proportion voting yes for ‘drop CO2′
before and after campaign: ρ1 = 0.605475 and ρ = 0.534622 (Cells J16 and J17). Calculate
zcal from equation (7.9).

b−c 89 − 45
Zcal = = = 3.801021 (Cell J18)
b+c 89 + 45

Identify the region of rejection using the critical test statistic. The critical value can be
found from Excel by using the NORM.S.INV() function. From Excel, the two tail zcri = ±1.96
(see Cells J20 and J21). Does the test statistic lie within the region of rejection? Compare
the calculated and critical value to determine which hypothesis statement (H0 or H1) to
accept. We observe that Zcal lies in the region of rejection (3.801021 > 1.96) and accept H1.

❉ Interpretation There is a significant difference in the voting intentions for ‘drop CO2′
after the campaign compared with before the campaign.

Note
1. This problem can be solved using a chi-square method defined by equation (7.10). For one
degree of freedom we have a relationship between the value of χ2cal and the corresponding
value of Z is given by the relationship χ2cal = ( Z cal )2 (see Cells J24 and J25).
2. If we are interested in testing for direction in the alternative hypothesis, for example H1:
π1 > π2, then you cannot use a chi-square test but will have to undertake a normal distribu-
tion z test to test for direction.

The two proportion solution can be extended easily to more than two proportions, but
this is beyond the scope of this text book.
312 Business statistics using Excel

Student exercises
X7.6 A business analyst requests answers to the following questions:
(a) What is the p-value when the chi-square test statistic = 2.89 and we have one
degree of freedom?
(b) If you have one degree of freedom, what is the value of the Z test statistic?
(c) Find the critical chi-square value for significance levels of 1% and 5%.
X7.7 The petrol prices during the summer of 2008 raised concerns with new car sellers
that potential customers were taking prices into account when choosing a new car. To
provide evidence to test this possibility a group of five local car showrooms agreed to
ask fleet managers and individual customers during August 2008 whether they were or
were not influenced by petrol prices. The results were as shown in Table 7.9.

Did petrol prices influence you Fleet customers Individual customers


in purchasing?
Yes 56 66
No 23 36

Table 7.9

At a 5% level of significance is there any evidence for the concerns raised by the car
showroom owners? Answer this question using both the critical test statistic and
p-value.
X7.8 A business analyst has been asked to confirm the effectiveness of a marketing
campaign on people’s attitudes to global warming. To confirm that the campaign was
effective a group of 500 people were randomly selected from the population and
asked the simple question about whether they agree that national governments should
be concerned with an answer of ‘Yes’ or ‘No’. The results are as shown in Table 7.10.

Before campaign After campaign


Yes No
Yes 202 115
No 89 75

Table 7.10

At a 5% level of significance is there any evidence that the campaign has increased the
number of people requesting that national governments should be concerned that global
warming is an issue? Answer this question using both the critical test statistic and p-value.

7.1.4 Chi-square goodness-of-fit test


In this section we will explore the concept of measuring how well a data set can be mod-
elled by a particular probability distribution. The method is called the goodness-of-fit
test and makes use of the chi-square distribution to compare the observed frequencies
Chi-square and non-parametric hypothesis testing 313

(O) with the expected frequencies (E) predicted by fitting a particular probability distri-
bution to the data set of observed frequencies or to compare whether observed sample
frequencies differ significantly from expected frequencies. The chi-square test is an alter-
native to the Anderson–Darling and Kolmogorov–Smirnov goodness-of-fit tests. The chi-
square goodness-of-fit test can be applied to discrete distributions, such as the binomial
and the Poisson. The Kolmogorov–Smirnov and Anderson–Darling tests are restricted to
continuous distributions.
For a chi-square goodness-of-fit test, the hypotheses take the following form:

H0: the data are consistent with a specified distribution;


H1: the data are not consistent with a specified distribution.

The goodness-of-fit can then be assessed by conducting a chi-square test on the


observed and expected frequencies as defined by equation (7.2).

(O − E)2
χ2 cal = ∑
E

Where the degrees of freedom (df ) are given by equation (7.11)

df = n − k − 1 (7.11)

Where n = number of observed cells, k = number of population parameters of a distribu-


tion that must be estimated to calculate the expected frequencies. Examples of values of k
for different probability distributions are described in Tables 7.11–7.13.

(a) For a normal distribution, X ~ N (μ, σ2)

Value of k Parameter conditions


0 μ and σ are known
1 μ known and σ estimated
1 μ estimated and σ known
2 μ and σ are estimated

Table 7.11

(b) For a binomial distribution, X ~ B (n, p)

Value of k Parameter conditions


0 p known
1 p estimated

Table 7.12

(c) For a Poisson distribution, X ~P (λ)

Value of k Parameter conditions


x
0 λ known Chi-square test Apply
1 λ estimated the chi square distribution
to test for homogeneity,
independence, or
Table 7.13 goodness-of-fit.
314 Business statistics using Excel

Note When comparing observed sample frequencies against known expected


frequencies then df = n − 1.

Example 7.4
To illustrate the method consider the example of a motorway safety officer who believes that
the number of accidents per week occurring on a stretch of motorway can be modelled using
a Poisson distribution. If X denotes the number of accidents per week then the sample data
can be modelled by fitting a Poisson distribution to the sample data. Figure 7.7 provides the
tabulated data and the chi-square goodness-of-fit test. The Poisson probability distribution is
given by equation (4.10).

λre − λ
P(X = r) =
r!

Where r = 0, 1, 2, 3 . . . ∞.

Figures 7.7 and 7.8 illustrate the Excel solution.

Figure 7.7

➜ Excel solution
X Cells B5:B11 Values
O Cells C5:C11 Values
xO Cells D5 Formula: =B5*C5
Copy formula down D5:D11
Estimated mean Cell D13 Formula: =SUM(D5:D11)/SUM(C5:C11)
P(X) Cells F5 Formula: =POISSON.DIST(B5,$D$13,FALSE)
Copy formula down F5:F11
E Cells G5 Formula: =SUM($C$5:$C$11)*F5
Copy formula down G5:G11
(O − E)^2/E Cells H5 Formula: =(C5−G5)^2/G5
Copy formula down H5:H11
Chi-square and non-parametric hypothesis testing 315

Figure 7.8

➜ Excel solution
Significance level Cell L12 Value =0.05
Chi-square (χ2) Cell L15 Formula: =SUM(H5:H11)
n Cell L18 Formula: =COUNT(B5:B11)
k Cell L19 Value =1
df Cell L20 Formula: =L18−L19−1
Critical value χ2 Cell L21 Formula: =CHISQ.INV.RT(L12,L20)
P-value Cell L22 Formula: =CHISQ.DIST.RT(L15,L20)
or p-value using CHISQ.DIST Cell L24 Formula: =1−CHISQ.DIST(L15,L20,TRUE)

Excel solution using the p-value


1 State hypothesis
H0: the data are consistent with a Poisson distribution.
H1: the data are not consistent with a Poisson distribution.

2 Select test
Comparing observed frequency with an expected frequency predicted by the Poisson
distribution. Chi-square goodness-of-fit test.

3 Set the level of significance (α = 0.05) (see Cell L12)

4 Extract relevant statistic


The initial calculation process involves using the sample data to estimate the average
number of accidents per week, λ. Given that we have a frequency distribution then we
can calculate the average value as follows:

∑ fx
λ= = 2 (see Cell D13)
∑f
316 Business statistics using Excel

The individual Poisson probabilities for X = 0, 1, 2. . . and 6 can now be calculated


using equation (4.10). The individual probability values can be calculated by the
Excel function POISSON.DIST() and the calculation is undertaken for all values of X
(see cells F5:F11).
If a Poisson distribution fits the observed sample frequency data (O) then we can
use E(X = r) = (∑f ) × P(X = r) to calculate what the expected frequencies (E) would
be if the Poisson distribution is appropriate. These expected frequencies (E) can
be observed in cells G5:G11. At this stage we have a set of observed and expected
frequency values from the Poisson distribution.
Now we wish to find out whether the observed and expected frequencies are close
enough for us to accept the null hypothesis. This is achieved by undertaking a chi-
square test to compare these two frequency values using equation (7.2). From Excel,
the value of the chi-square test statistic = 2.7914 (see cell L15).
For the Poisson distribution we estimated λ using the sample data. Given this
estimation we have a value for k of 1 and the number of degrees of freedom df = n – k
– 1 = 7 – 1 – 1 = 5 (see cells L18:L20).
Identify a region of rejection using the p-value method. The p-value can be found
from Excel by using the CHISQ.DIST.RT() OR CHISQ.DIST functions. From Excel, the
p-value = 0.73 (see cell L22). Does the test statistic lie within the region of rejection?
Compare the chosen significance level (α) of 5% (or 0.05) with the calculated p-value
of 0.73.

5 Make decision
From Excel, p-value > α (0.73 > 0.05), and we would accept H0 and reject H1. Figure 7.9
illustrates the relationship between the p-value, test statistic, and the critical test statistic.

❉ Interpretation It can be concluded that the there is a significant relationship


between the observed and expected frequencies. This implies that the data can be modelled
by a Poisson probability distribution with an estimated average value of 2.

Excel solution using the critical chi-square test statistic


The solution procedure is exactly the same as for the p-value except that we use the critical
test statistic value to make a decision.

1 State hypothesis

2 Select test

3 Set the level of significance (α = 0.05) (see Cell L12)

4 Extract relevant statistic


The calculated chi-square test statistic = 2.7914 (see Cell L15). The critical chi-square
value depends upon both the level of significance (α = 0.05, Cell L12) and the number of
degrees of freedom (df = 5, cell L20). The critical value can be found from Excel by using
the CHISQ.INV.RT() function. The critical chi-square value equals 11.0705 (see Cell
L21). Does the test statistic lie within the region of rejection? Compare the calculated
and critical values to determine which hypothesis statement (H0 or H1) to accept.
Chi-square and non-parametric hypothesis testing 317

5 Make decision
From Excel, calculated chi-square value < critical chi-square value (2.7914 < 11.0705),
and we would accept H0 and reject H1. Figure 7.9 illustrates the relationship between
the p-value, test statistic, and the critical test statistic.

❉ Interpretation It can be concluded that there is a significant relationship between


the observed and expected frequencies. This implies that the data can be modelled by a
Poisson probability distribution with an estimated average value of 2.

Figure 7.9 illustrates the relationship between the p-value, test statistic, and the critical
test statistic.

Chi-square goodness-of-fit test with df = 5


f(x)

P-value = 0.732
when chi-square = 2.7914

0 2.7914 Critical chi-square = 11.07 χ25


Figure 7.9

Note The expected number for each random variable must be at least 5. If necessary
combine classes in the table to satisfy this requirement. For example, in Figure 7.7 the
expected frequencies are all <5 and the classes should be combined. This will result in the
value of the number of classes n being reduced from 7 to 5 and the number of degrees
of freedom (df) from 5 to 3. This results in a new two tail p-value = 0.4249 and critical test
statistic = 7.8. In this case, the overall conclusion would not change.

Student exercises
X7.9 An employment agency has recently implemented a new training programme to
develop the interview skills of potential job applicants. Based upon the collected data
(Table 7.14) can we say confidently that the data can be modelled using a binomial
distribution (assess at 5%)?

Number of interview successes 0 1 2 3


Frequency 78 143 43 13

Table 7.14
318 Business statistics using Excel

X7.10 A university has recently set up a satellite department within a local college of
higher education. The university claims that 35% of the undergraduate students are
in department A, 26% are in department B, 25% are in department D, and 14% are
in department D. A random sample of 320 students finds the following number of
students in departments A–D: 132, 89, 64, and 35. Perform a hypothesis test at 5% to
test this claim.
X7.11 A new airport terminal has been assessing waiting times for passengers to be processed
at the airport check-in counters. The airport owners would like to be able to attach
levels of risk to different aspects of the business. To undertake this we are required to fit
an appropriate probability distribution to the observed frequencies provided in Table
7.15. (a) Use the data in Table 7.15 to provide an estimate of the population mean and
standard deviation; (b) construct a z distribution table with upper class boundaries of
14, 17, 22, 26, and infinity; (c) use this table to calculate the cumulative distribution
function values at these class boundaries based on your answers to parts (a)–(b); (d)
estimate the class probabilities and resultant expected frequencies; (e) calculate the
observed frequencies based upon your upper class boundaries; and (f) undertake
a chi-square goodness-of-fit test to assess at a 95% confidence that the normal
distribution would be a good fit to the sample data.

6 7 7 8 10 12 13 13 14
14 15 15 16 16 16 16 16 17
17 18 13 18 19 19 19 20 20
22 23 23 12 24 25 25 26 27
27 27 28 28 29 30 30 31 33

Table 7.15

7.2 Non-parametric (or distribution-free) tests


Many statistical tests require that your data follow a normal distribution. Sometimes this
x
is not the case. In some instances it is possible to transform the data to make them follow a
Sign test The sign test
is designed to test a normal distribution; in others this is not possible or the sample size might be so small that
hypothesis about the it is difficult to ascertain whether or not the data are normally distributed. In such cases it
location of a population
distribution. is necessary to use a statistical test that does not require the data to follow a particular dis-
Mann–Whitney U tribution. In this section we shall explore three non-parametric tests: sign test, Wilcoxon
test The Mann-Whitney signed rank test, and Mann–Whitney U test. Table 7.16 compares the non-parametric
U test is used to test the
null hypothesis that two tests with the parametric tests for one and two samples (discussed in Chapter 6).
populations have identical
distribution functions
against the alternative
hypothesis that the two
7.2.1 Sign test
distribution functions
differ only with respect to
The sign test is used to test a set of data values against a perceived hypothesis statement,
location (median), if at all. including:
Chi-square and non-parametric hypothesis testing 319

Test Parametric test Non-parametric test


One sample One sample z test Sign test
One sample t-test Wilcoxon signed-rank test
Paired samples Two paired sample z-test Sign test
Two paired sample t-test Wilcoxon signed rank test
Independent samples Two independent sample t-test Mann–Whitney U test (Wilcoxon
rank sum test)

Table 7.16

(a) Assessing the validity of a population median value assessed from collected sample
data—replaces the one-sample t-test, which assumes a normal population and that
a mean value has meaning
(b) Assessing the validity that the difference between two population medians is zero
based upon sample data—replaces the paired t-test, which assumes a normal
population and that a mean value has meaning
(c) Assessing the validity of proportions where the proportions are estimated from
ordered nominal (or categorical) data where a numerical scale is inappropriate,
but where we can rank the data observations—replaces the sample Z test for
proportions, which assumes a normal population.

If we rank the data then the null hypothesis would result in half the ranks to be less than
the median (r1) and half the ranks would be greater than the median (r2). In this situation
the null hypothesis can be modelled by a binomial distribution with the probability of a
data value being less than or greater than the median being equal to p = 0.5, with sample
size n. The sign test assumptions are: (1) randomly selected samples and (2) continuous
distribution. The sign test measures the number of counts that fall above and below the
median value. Given that 50% of all values lie below and 50% of all values lie above then
the population proportion (or probability) at the median value is 50% or 0.5. Under the
null hypothesis, we would expect the number of counts distribution to be approximately
symmetric around the median and the distribution of values below and above to be dis-
tributed at random among the ranks. The corresponding hypothesis statements for two
tail and one tail tests are as presented in Table 7.17.

Two tail test H0: sample median = population median


H1: sample median ≠ population median
Upper one tail test H0: sample median ≤ population median
H1: sample median > population median
Lower one tail test H0: sample median ≥ population median
H1: sample median < population median

Table 7.17

In this case the probability distribution is a binomial distribution with the probability
(or proportion) of success = 0.5 and the number of trials represented by the number of
paired observations (n). In this case we can model the situation using a binomial distri-
bution X ~ Bin (n, p). In this situation the value of the probability (P(X = r)), mean (μ),
320 Business statistics using Excel

and standard deviation is given by equations (4.5), (4.8), and (4.9): P(X = r) = nCr pr qn−r,
μ = np, and σ = np (1 − p) .

Example 7.5
To illustrate the concept consider the situation where 16 randomly-selected people were cho-
sen to measure the effectiveness of a new training programme on the value of sales. For the
training programme to be effective we would expect the hypothesis statement to be H1: the
training programme results in the average value in sales to increase. Given that we are told only
that we have a random selection and no information is given about the distribution, then we
will use the sign test to answer the question.

Figures 7.10 and 7.11 illustrate the Excel solution.

Figure 7.10

➜ Excel solution
Person: Cells A4:A19 Values
A: Cells B4:B19 Values
B: Cells C4:C19 Values
d = Cell D4 Formula: =B4−C4
Copy formula down D4:D19
Sign Cell F4 Formula: =IF(D40<0,"−",IF(D4>0,"+","0"))
Copy formula down F4:F19

➜ Excel solution
Level = Cell K8 Value =0.05
Median d = Cell K10 Formula =MEDIAN(D4:D19)
p = Cell K11 Value =0.5
N = Cell K12 Formula =COUNT(A4:A19)
r− = Cell K13 Formula =COUNTIF(F4:F19,"−")
Chi-square and non-parametric hypothesis testing 321

r+ = Cell K14 Formula =COUNTIF(F4:F19,"+")


r0 = Cell K15 Formula =COUNTIF(F4:F19,"=0")
x = Cell K16 Formula =MAX(K13,K14)
n′ = Cell K17 Formula =K12−K15
P(X = 12) = Cell K18 Formula =BINOM.DIST(12,$K$17,$K$11,FALSE)
P(X = 13) = Cell K19 Formula =BINOM.DIST(13,$K$17,$K$11,FALSE)
P(X = 14) = Cell K20 Formula =BINOM.DIST(14,$K$17,$K$11,FALSE)
P(X = 15) = Cell K21 Formula =BINOM.DIST(15,$K$17,$K$11,FALSE)
P(X = 16) = Cell K22 Formula =BINOM.DIST(16,$K$17,$K$11,FALSE)
One sided p = Cell K23 Formula =SUM(K18:K22)
mu = Cell K25 Formula =K17*K11
sigma = Cell K26 Formula =SQRT(K25*(1−K11))
mu-2sigma = Cell K27 Formula =K25−2*K26
mu-2sigma = Cell K28 Formula =K25+2*K26
Xc = Cell K29 Formula =K16−0.5
Zcal = Cell K30 Formula =(K29−K25)/K26
Upper one tail Zcri = Cell K31 Formula =NORM.S.INV(1−K8)
Upper one tail p-value = Cell K32 Formula =1−NORM.S.DIST(K30,TRUE)

1
2
3

Figure 7.11

Excel solution using the p-value


1 State hypothesis
H0: the median sales difference is zero.
H1: median sales after training > Median sales before training.
Upper one tail test.
322 Business statistics using Excel

2 Select test
Two dependent samples, both samples consist of ratio data, and no information on
the form of the distribution. Conduct signed rank test.

3 Set the level of significance (α = 0.05) (see Cell K8)

4 Extract relevant statistic


The solution process can be broken down into a series of steps:
Enter data
Sales values after training A (see Cells B4:B19) and before training B (see Cells
C4:C19).
Calculate the differences (d = A – B) (see cells D4:D19)
From Excel, we observe that the median difference d = 1.5 (Cell K10), which
suggests that the sales are moving in the correct direction (H1: d > 0). The question
now becomes: is this positive difference significant?
Allocate ‘ + ‘ and ‘−’ depending on whether d > 0 or d < 0
Calculate number of paired values
Total number of trials, N = 16 (see Cell K12)
Number of –ve ranks, r- = 4 (see Cell K13)
Number of +ve ranks, r+ = 12 (see Cell K14)
Number of shared ranks with d = 0, r0 = 0 (see Cell K15)
Calculate x = MAX(r-, r+) = MAX (4, 12) = 12 (see Cell K16)
Adjust n to remove shared ranks with d = 0, n′ = 16 (see Cell K17)
Calculate test statistic
Calculate binomial probabilities, P(X ≥ x)
Given x = 12 (see Cell K16). In this case we wish to solve the problem
p = P(X ≥ 12) = P(X = 12, 13, 14, 15, 16) = P(X = 12) + P(X = 13) + P(X = 14) +
P(X = 15) + P(X = 16). From the binomial distribution P(X = r) = nCr pr qn−r) we
can calculate the probabilities using the BINOM.DIST() function. From Excel, the
values are as follows: P(X = 12) (Cell K18), P(X = 13) (Cell K19), P(X = 14) (Cell
K20), P(X = 15) (Cell K21), and P(X = 16) (Cell K22).

Note This value given by the binomial equation represents an exact p-value.

If n is sufficiently large (n >25), we can use a normal approximation with the value
of the mean and standard deviation given by equations (4.8) and (4.9). From Excel,
μ = 8 (see Cell K25) and σ = 2 (see Cell K26). If μ ± 2σ is contained within the range
of the binomial 0 – n′, then the normal approximation should be an accurate
approximation. The normal approximation z equation is given by equation (7.12):

Xc − µ
Zcal =
σ (7.12)

Calculate critical values


Identify region of rejection using the p-value method:
Chi-square and non-parametric hypothesis testing 323
(a) For n <25, the p-value can be found from Excel by summing the individual
binomial probabilities, p = P(X ≥ 12) = 0.0384 (Cell K23). This represents
the probability of a value being at this value or more extreme and is an exact
p-value. Does the test statistic lie in the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated p-value of 0.0384. We
can observe that the p-value < α and we accept H1.
(b) For n ≥ 25, the p-value can be found from Excel by using the NORM.S.DIST()
function. From Excel, the upper one tail p-value = 0.0401 (see Cell K32). Does
the test statistic lie in the region of rejection? Compare the chosen significance
level (α) of 5% (or 0.05) with the calculated p-value of 0.0401. We can observe
that the p-value <α and we accept H1.

5 Make decision
We will reject H0 and accept H1 given that the binomial p-value (0.0384) < α (0.05) and
normal approximation p-value = 0.0401 < α (0.05).

❉ Interpretation From the sample data we have sufficient statistical evidence that the
after sales are significantly larger than the before sales.

Excel solution using the critical test statistic


The solution procedure is exactly the same as for the p-value except that we use the criti-
cal test statistic value to make a decision. From Excel, zcal = 1.75 (see Cell K30). Calculate
the critical test statistic, Zcri. The critical Z values can be found from Excel by using the
NORM.S.INV() function, upper one tail Zcri = 1.6449 (see Cell K31). Does the test statistic
lie within the region of rejection? Compare the calculated and critical Z values to deter-
mine which hypothesis statement (H0 or H1) to accept. We will reject H0 and accept H1,
given that zcal lies in the upper rejection zone (1.75 > 1.6449).

❉ Interpretation From the sample data we have sufficient statistical evidence that the
after sales are significantly larger than the before sales.

Note The decision to accept the alternative hypothesis is a borderline decision and will
change if the significance level changes to 1% from 5%.

Student exercises
X7.12 A researcher has undertaken a sign test with the following results: sum of positive and
negative signs are 15 and 4, respectively, with 3 ties. Given that binomial p = 0.5, assess
whether there is evidence that the median value is > 0.5 (assess at 5%).
324 Business statistics using Excel

X7.13 A teacher of 40 university students studying the application of Excel within a business
context is concerned that students are not taking a group work assignment seriously.
This is deemed to be important given that the group work element is contributing to
the development of personal development skills. To assess whether or not this is a
problem the module tutor devises a simple experiment which judges the individual
level of cooperation by each individual student within their own group. In the
experiment a rating scale is employed to measure the level of cooperation: 1 = limited
cooperation, 5 = moderate cooperation, and 10 = complete cooperation. The form of
the testing consists of an initial observation, a lecture on working in groups, and a final
observation. Given the raw data in Table 7.18 conduct a relevant test to assess whether
or not we can observe that cooperation has changed significantly (assess at 5%).

5, 8 4, 6 3, 3 6, 5 8, 9 10, 9 8, 8 4, 8 5, 5 8, 9
3, 5 5, 4 6, 5 4, 4 7, 8 7, 9 9, 9 8, 7 5, 8 5, 6
8, 7 8, 8 3, 4 5, 6 6, 7 4, 8 7, 8 9, 10 10, 10 8, 9
8, 8 4, 6 4, 5 7, 8 5, 7 7, 9 8, 10 3, 6 5, 6 7,8

Table 7.18

X7.14 A leading business-training firm advertises in its promotional material that its class sizes
at its Paris branch are no greater than 25. Recently, the firm has received a number of
complaints from disgruntled students who have complained that class sizes are >25
for a majority of its courses in Paris. To assess this claim the company selects 15 classes
at random and measures the class sizes as follows: 32, 19, 26, 25, 28, 21, 29, 22, 27,
28, 26, 23, 26, 28, and 29. Undertake an appropriate test to assess whether there is
any justification to the complaints (assess at 5%). What would your decision be if you
assessed at 1%?

Wilcoxon signed rank sum test for dependent


7.2.2
samples (or matched pairs)
The t-test is the standard test for testing that the difference between population means
for two paired samples is equal. If the populations are non-normal, particularly for small
samples, then the t- test may not be valid. The Wilcoxon signed rank sum test is another
example of a non-parametric (or distribution-free) test.
As for the sign test, the Wilcoxon signed rank sum test is used to test the null hypothesis
that the median of a distribution is equal to some value. It can be used: (a) in place of a
one-sample t-test, (b) in place of a paired t-test, or (c) for ordered categorical data where a
x
Wilcoxon signed rank sum numerical scale is inappropriate, but where it is possible to rank the observations.
test The Wilcoxon signed The method considers the differences between ‘n’ matched pairs as one sample. If the
rank test is designed to
test a hypothesis about the two population distributions are identical, then we can show that the sample statistic has
location of the population a symmetric null distribution. As with the Mann–Whitney test (Section 7.2.3), where the
median (one or two
matched pairs).
number of paired observations is small (n ≤ 20) we need to consult tables, but where the
Chi-square and non-parametric hypothesis testing 325

number of paired observations is large (n > 20) we can use a test based on the normal
distribution.
The Wilcoxon signed rank sum test assumptions are:

1. Each matched data pair is randomly distributed


2. The matched pair differences should be distributed symmetrically.

Although the Wilcoxon test assumes neither normality nor homogeneity of variance, it
does assume that the two samples are from populations with the same distribution shape.
It is also vulnerable to outliers, although not to nearly the same extent as the t-test. If we
cannot make this assumption about the distribution then we should use a test called the
sign test for ordinal data. The McNemar test is available for nominal paired data relat-
ing to dichotomous qualitative variables and is described in Section 7.1.3. In this section
we shall solve the Wilcoxon signed rank sum test where we have a large and small num-
ber of paired observations. In the case of a large number of paired observations (n > 20)
we shall use a normal approximation to provide an answer to the hypothesis statement.
Furthermore, for a large number of paired observations we shall use Excel to calculate
both the p-value and critical z value to make a decision. The situation of a small number
of paired observations (n ≤ 20) will be described together with an outline of the solution
process.

Example 7.6
Suppose that Slim-Gym is offering a weight reduction programme that they advertise will result
in more than a 10-lb weight loss in the first 30 days. Twenty subjects were selected for a study
and their weights before and after the weight loss programme were recorded.

Figures 7.12 and 7.13 illustrate the Excel solution, where X and Y represent the weight
before and after the weight loss programme. For this problem we should be able to write
the null and alternative hypotheses as H0: X – Y – 10 ≤ 0, H1: X – Y – 10 > 0.

Figure 7.12
326 Business statistics using Excel

➜ Excel solution
X: Cells A4:A27 Values
Y: Cells B4:B27 Values
d = Cell C4 Formula: =A4−B4−$K$7
Copy formula down C4:C27
ABS(d) = Cell E4 Formula: =ABS(C4)
Copy formula down E4:E27
Rank Cell G4 Formula: =RANK.AVG(E4,$E$4:$E$27,1)
Copy formula down G4:G27

5
Figure 7.13

➜ Excel solution
D0 = Cell K7 Value
Median difference = Cell K8 Formula =MEDIAN(C4:C27)
Significance level = Cell K13 Value
n = Cell K16 Formula =COUNT(A4:A27)
n0 = Cell K17 Formula =COUNTIF(E4:E27,“0”)
n′ = Cell K18 Formula =K16−K17
T− = Cell K19 Formula =SUMIF(C4:C27," <0",G4:G27)
T+ = Cell K20 Formula =SUMIF(C4:C27," >0",G4:G27)
n′(n′ + 1)/2 = Cell K21 Formula =K18*(K18+1)/2
T− + T+ = Cell K22 Formula =K19+K20
Two tail test, T = Cell K23 Formula =MIN(K19,K20)
Upper one tail test, T = Cell K24 Formula =K20
Lower one tail test, T = Cell K25 Formula =K19
mu = Cell K26 Formula =K18*(K18+1)/4
Chi-square and non-parametric hypothesis testing 327

sigma = Cell K27 Formula =SQRT((K18*(K18+1)*(2*K18+1)/24))


Z = Cell K28 Formula =(K24−K26-0.5)/K27
Upper one tail Zcri = Cell K30 Formula = NORM.S.INV(1-K13)
Upper one tail p-value = Cell K31 Formula = 1-NORM.S.DIST(K28,TRUE)

Note The value of z has been corrected for continuity by subtracting 0.5 (H1: >)

Excel solution using the p-value


1 State hypothesis
Under the null hypothesis, we would expect the distribution of the differences to be
approximately symmetric around zero and the distribution of positives and negatives
to be distributed at random among the ranks.
H0: the population median weight loss is at least 10 lbs (X − Y ≤ 10).
H1: the population median weight loss is greater than 10 lbs (X − Y – 10 > 0).
Upper one tail test.

2 Select test
Two dependent samples.
Both samples consist of ratio data.
No information on the form of the distribution.
Wilcoxon signed rank test.
Median value centred at D0 = 10 (see Cell K7).
The median difference is +4.1 (Cell K8), which supports the alternative hypothesis
that d > 0. If this was negative, or zero, then you would not conduct the test as there is
no evidence from the sample that d > 0.

3 Set the level of significance (α = 0.05) (see Cell K13)

4 Extract relevant statistic


The solution process can be broken down into a series of steps:
Calculate the differences (d = X – Y – D0)
From Excel, we observe that the median difference = +4.1 (Cell K8), which suggests
that the weight loss is moving in the correct direction (H1: d > 0). The question is
now whether this positive difference is significant.
Rank data
The convention is to assign rank 1 to the smallest value and rank n to the largest
value. If you have any shared ranks then the policy is to assign the average rank to
each of the shared values. The Excel function RANK.AVG() is used to rank the data.
Calculate number of paired values
Number of paired ranks, n = 24 (see Cell K16).
Number of shared ranks with d = 0, n0 = 0 (see Cell K17).
Adjust n to remove shared ranks with d = 0, n′ = 24 (see Cell K18).
Calculate the sum of the ranks, T- and T+:
328 Business statistics using Excel

T− = Sum of – ve ranks = 35 (see Cell K19).


T+ = Sum of + ve ranks = 265 (see Cell K20).
Check rankings using equation (7.13)

n′ (n′ + 1)
T+ + T− = (7.13)
2

n′ (n′ + 1)
T+ + T− = = 300 cells
2

Find Tcal
The value of Tcal is determined from the criteria outlined in Table 7.19.

Test Hypothesis Tcal Cell


Two sided test H1: Population locations not centred Tcal = Minimum of K23
at 0 T− and T+
One sided tests H1: Population differences are Tcal = T+ K24
centred at a value > 0
H1: Population differences are Tcal = T− K25
centred at a value < 0

Table 7.19

Given that we have an upper one tail test, then Tcal = 265.
Find Zcal
If the number of pairs is such that n is large enough (>20) a normal approximation
can be used with Zcal given by equation (7.14), and the mean and standard deviation
given by equations (7.15) and (7.16) respectively.

Tcal − µ T ± 0.5
Zcal =
σT (7.14)

n′ (n′ + 1)
µT =
4 (7.15)

n′ (n′ + 1) (2n′ + 1)
σT =
24 (7.16)

The value of Zcal is corrected for continuity by subtracting 0.5 if H1: > 0 or add 0.5 if
H1: < 0. From Excel: μT = 150 (see Cell K26), σT = 35.0 (see Cell K27), and Zcal is given
by equation (7.14).

Tcal − µ T − 0.5 265 − 150 − 0.5


Zcal = = = 3.2714 (cell k28)
µT 35.0 (7.17)

Calculate critical values


Identify region of rejection using the p-value method. The p-value can be found
from Excel by using the NORM.S.DIST() function. From Excel, the upper one tail
p-value = 0.0005 (see Cell K31). Does the test statistic lie in the region of rejection?
Chi-square and non-parametric hypothesis testing 329

Compare the chosen significance level (α) of 5% (or 0.05) with the calculated upper
one tail p-value of 0.0005.

5 Make decision.
We will reject H0 and accept H1 given two tail p-value (0.0005) <α (0.05).

❉ Interpretation From the sample data we have sufficient statistical evidence that the
weight loss is greater than 10 lbs.

Excel solution using the critical test statistic

The solution procedure is exactly the same as for the p-value except that we use the criti-
cal test statistic value to make a decision. The calculated test statistic zcal = 3.2714 (see Cell
K26). Calculate the critical test statistic, Zcri. The critical Z values can be found from Excel
by using the NORM.S.INV () function, upper one tail Zcri = +1.6449 (see Cell K30). Does
the test statistic lie within the region of rejection? Compare the calculated and critical Z
values to determine which hypothesis statement (H0 or H1) to accept. We observe that zcal
lies in the upper rejection zone (3.2714 > 1.6449) and we accept H1.

❉ Interpretation From the sample data we have sufficient statistical evidence that the
weight loss is greater than 10 lbs.

Figure 7.14 illustrates the relationship between the p-value and test statistic.

Normal curve
f(x)

Upper one tail p = 1-


NORM.S.DIST(3.2714,true)

0 + 3.2714 Z Figure 7.14

Small number of paired observations (n ≤ 10)


If the number of pairs is such that n is ≤10 then we calculate Tcal and use tables to look up
an exact value of the critical test statistic, Tcri. The value of Tcal is chosen to be Tcal = MIN (T− ,
T+) = MIN (35, 265) = 35. The decision rule is to reject H0 if Tcal ≤ Tcri with the exact value
of the critical test statistic Tcri available from a table representing the critical values of the
T statistic for the Wilcoxon signed ranks test. If we were to look up such a table we would
find Tcri = 92 for α = 0.05 with no tied ranks. We can see that Tcal ≤ Tcri (35 < 92); therefore,
we would reject H0 and accept H1.
330 Business statistics using Excel

Figure 7.15 illustrates the relationship between the T and Z values.

Accept H1 Accept H0 Accept H1

T– = 35 Tcri = 92 µT = 150 Tcri = 208 T+ = 265

Z
0
Zcri = + 1.65 Zcal = 3.2
Figure 7.15

The small sample method uses tables to look up the lower critical value (e.g. 92) and you
have to use the smallest T value as Tcal. If you want the upper critical value then you can
calculate the value if you remember that the distribution is symmetric about the median
(remember median = mean for symmetric distributions). From Figure 7.15: μT – lower
Tcri = upper Tcri – μT.

Dealing with ties


There are two types of tied observations that may arise when using the Wilcoxon signed
rank test:

1. Observations in the sample may be exactly equal to zero in the case of paired
differences. Ignore such observations and adjust n accordingly. For the previous
example we removed any values and used n′ instead of n.
2. Two or more observations/differences may be equal. If so, average the ranks across
the tied observations and reduce the variance by equation (7.17) for each group of t
tied ranks.

(t 3 − t )/ 48 (7.17)

Note In the example and exercises we have not modified the solution for tied ranks.

Student exercises
X7.15 The Wilcoxon paired ranks test is considered to be more powerful than the sign test.
Explain why.
X7.16 A company is planning to introduce new packaging for a product that has used the
same packing for over 20 years. Before it makes a decision on the new packaging it
decides to ask a panel of 20 participants to rate the current and proposed packaging
(using a rating scale of do not change 0–change 100) (Table 7.20). Is there any
evidence that the new packaging is more favourably received compared with the older
packaging (assess at 5%)?
Chi-square and non-parametric hypothesis testing 331

Participant Before After Participant Before After


1 80 89 11 37 40
2 75 82 12 55 68
3 84 96 13 80 88
4 65 68 14 85 95
5 40 45 15 17 21
6 72 79 16 12 18
7 41 30 17 15 21
8 10 22 18 23 25
9 16 12 19 34 45
10 17 24 20 61 80

Table 7.20

X7.17 A local manufacturer is concerned at the number of errors made by machinists in the
production of kites for a multinational retail company. To reduce the number of errors
being made the company decides to retrain all staff in a new set of procedures to
minimize the problem. To assess whether the training worked a random sample of 10
machinists were selected, and the number of errors made before and after the training
recorded as shown in Table 7.21.

Machinist
1 2 3 4 5 6 7 8 9 10
Before 49 34 30 46 37 28 48 40 42 45
After 22 23 32 24 23 21 24 29 27 27
11 12 13 14 15 16 17 18 19 20
Before 29 45 32 44 49 28 44 39 47 41
After 23 29 37 22 33 27 35 32 35 24
21 22 23 24 25 26 27 28 29 30
Before 33 38 35 35 47 47 48 35 41 35
After 37 37 24 23 23 37 38 30 29 31

Table 7.21

Is there any evidence that the training has reduced the number of errors (assess at 5%)?

7.2.3 Mann–Whitney U test for two independent samples


The Mann–Whitney U test is a non-parametric test that can be used in place of an unpaired
t-test. It is used to test the null hypothesis that two samples come from the same popula-
tion (i.e. have the same median) or, alternatively, whether observations in one sample
tend to be larger than observations in the other. Although it is a non-parametric test it
does assume that the two distributions are similar in shape. Where the samples are small
we need to use tables of critical values to find whether or not to reject the null hypothesis,
but where the sample is large we can use a test based on the normal distribution.
332 Business statistics using Excel

The basic premise of the test is that once all of the values in the two samples are put into
a single ordered list, if they come from the same parent population, then the rank at which
values from sample 1 and sample 2 appear will be by chance. If the two samples come
from different populations, then the rank at which the sample values will appear will not
be random and there will be a tendency for values from one of the samples to have lower
ranks than values from the other sample. We are thus testing for different locations of the
two samples. Whenever n1 and n2 is greater than 20, a large sample approximation can be
used for the distribution of the Mann–Whitney U statistic. The Mann–Whitney assump-
tions are as follows: (i) independent random samples are obtained from each population,
and (ii) the two populations are continuous and have the same shape.

Example 7.7
A local training firm has developed an innovative programme to improve the performance of
students on the courses it offers. To assess whether the new programme improves student per-
formance the firm have collected two random samples from the population of students sitting
an accountancy examination, where sample 1 students have studied via the traditional method
and sample 2 students via the new programme. The firm has analysed previous data and the
outcome of the results provides evidence that the distribution is not normally distributed, but
is skewed to the left. This information results in concern at the suitability of using a two sam-
ple independent t-test to undertake the analysis and, instead, they decide to use a suitable
distribution-free test. In this case the appropriate test is the Mann–Whitney U test.

Figures 7.16 and 7.17 illustrate the Excel Mann–Whitney U test solution.

Figure 7.16

➜ Excel solution
Training type: Cells A4:A18 Values
Combined samples: Cells B4:B18 Values
Rank Cell C4 Formula: =RANK.AVG(B4,$B$4:$B$18,1)
Copy formula down C4:C18
Chi-square and non-parametric hypothesis testing 333

1
2

Figure 7.17

➜ Excel solution
Significance level = Cell H10 Value
Median sample 1 = Cell H12 Formula: =MEDIAN(B4:B10)
Median sample 2 = Cell H13 Formula: =MEDIAN(B11:B18)
n1 = Cell H14 Formula: =COUNTIF(A4:A18,"=1")
n2 = Cell H15 Formula: =COUNTIF(A4:A18,"=2")
T1 = Cell H16 Formula: =SUMIF(A4:A18,"=1",C4:C18)
T2 = Cell H17 Formula: =SUMIF(A4:A18,"=2",C4:C18)
T1max = Cell H18 Formula: =H14*H15+H14*(H14+1)/2
T2max = Cell H19 Formula: =H14*H15+H15*(H15+1)/2
U1 = Cell H20 Formula: =H18−H16
U2 = Cell H21 Formula: =H19−H17
U1 + U2 = Cell H22 Formula: =H20+H21
n1n2 = Cell H23 Formula: =H14*H15
Ucal = Cell H24 Formula: =MIN(H20,H21)
mu = Cell H25 Formula: =H14*H15/2
sigma = Cell H26 Formula: =SQRT(H14*H15*(H14+H15+1)/12)
Z = Cell H27 Formula: =(H24−H25+0.5)/H26
Lower one tail Zcri = Cell H26 Formula: =NORM.S.INV(H10)
Lower p-value = Cell H26 Formula: =NORM.S.DIST(H27,TRUE)

Note The value of z has been corrected for continuity by adding 0.5 (H1: < 0)
334 Business statistics using Excel

Excel solution using the p-value


The solution procedure is exactly the same as for the p-value except that we use the critical
test statistic value to make a decision.

1 State hypothesis
H0: no difference in examination performance between the two groups.
H1: new programme improved performance (M1 < M2).
Lower one tailed test.

2 Select test
Comparing two independent samples.
Both samples consist of ratio data.
Unknown population distribution.
Mann–Whitney U test.

3 Set the level of significance (α = 0.05) (see Cell H10)

4 Extract relevant statistic


The solution process can be broken down into a series of steps:
Input samples into two columns
Combined sample (Cells A4:A18): sample 1 = 1, and sample 2 = 2.
Sample data in cells B4:B18.
Median sample 1 M1 = 62 (Cell H12).
Median sample 2 M2 = 75 (Cell H13).
We can observe that the median for sample 1 is smaller than for sample 2 (62 > 75).
The question now reduces to whether or not this is significant?
Rank data
The convention is to assign rank 1 to the smallest value and rank n to the largest
value. If you have any shared ranks then the policy is to assign the average rank to
each of the shared values. The Excel function RANK.AVG() is used to rank the data.
Calculate number of data points in each sample
Number in sample 1, n1 = 7 (see Cell H14).
Number in sample 2, n2 = 8 (see Cell H15).
Calculate the sum of the ranks, T1 and T2
Input formulae to calculate T1, T2.
T1 = Sum of sample 1 ranks = 37 (see Cell H16).
T2 = Sum of sample 2 ranks = 83 (see Cell H17).
T1MAX = Maximum sum value of sample 1 ranks = 84 (see Cell H18).
T2MAX = Maximum sum value of sample 2 ranks = 92 (see Cell H19).
Calculate U1, U2, and the test statistic Ucal
The value of U is equal to the difference between the maximum possible values
of T for the sample versus the actual observed values of T: U1 = T1[max] − T1 and
U2 = T2[max] − T2. U1 and U2 can be calculated from equations (7.18) and (7.19)
respectively as follows.
Chi-square and non-parametric hypothesis testing 335

n 1 (n 1 + 1)
U1 = n 1n 2 + − T1
2 (7.18)

n 2 (n 2 + 1)
U 2 = n 1n 2 + − T2
2 (7.19)

Substituting the computed values into equations (7.18) and (7.19) gives U1 = 47 (see
Cell H20), U2 = 9 (see Cell H21). Check using equation (7.20):

U1 + U 2 = n1n 2 (7.20)

From Excel, U1 + U2 = 56 (Cell H22) and n1n2 = 56 (Cell H23). The value of Ucal can
be either U1 or U2, and, for this example, we will choose Ucal = Minimum of U1 and
U2 = MIN (47, 9) = 9 (see Cell H24).

Note The value of Ucal can be either U1 or U2.

If the null hypothesis is true then we would expect U1 and U2 both to be centred at the
mean value μU, given by equation (7.21).

n1n 2
µU =
2 (7.21)

From Excel, μU = 28 (see Cell H25). Therefore, if there is no difference in performance


between the old and new training methods, how likely is it that we could end up, by
mere chance, with an observed value of U1 as large as 47? This is equivalent to asking
how likely it is for U2 to be as small as 9? This problem can be solved exactly, but we
will use a normal approximation to provide a solution to this problem.
Find Zcal
If the total number of pair wise comparisons (n1n2 = 7*8 = 56 > 20) we can approximate
the Mann–Whitney distribution U with a normal distribution given by equation (7.22).

U cal − µ U + 0.5
Z=
µU (7.22)

Where the standard deviation is given by equation (7.23):

n1n2 (n1 + n2 + 1)
σU =
12 (7.23)
The value of Zcal is corrected for continuity by subtracting 0.5 if H1: > 0 or add 0.5 if
H1: < 0. From Excel, μU = 28 (see Cell H25), σU = 8.6410 (see Cell H26), and Zcal is given
by equation (7.22).

U cal − µ U + 0.5 9 − 28 + 0.5


Zcal = = = −2.1410 (Cell H27)
σU 8.6410
336 Business statistics using Excel

Calculate critical values


Identify the region of rejection using the p-value method. The p-value can be found
by Excel by using the NORM.S.DIST() function. From Excel, the lower one tail
p-value = 0.0161 (see Cell H29). Does the test statistic lie in the region of rejection?
Compare the chosen significance level (α) of 5% (or 0.05) with the calculated lower
one tail p-value of 0.0161.

5 Make decision
We will reject H0 and accept H1 given the lower one tail p-value (0.0161) < α (0.05).

❉ Interpretation Based on the data, there is sufficient evidence to indicate, at a 5%


significance level, that the performance has improved. Note that if we modify the level of
significance to 1% then the decision would be a borderline decision.

Excel solution using the critical test statistic


The solution procedure is exactly the same as for the p-value except that we use the critical
test statistic value to make a decision. The calculated test statistic Zcal = −2.1410 (see Cell
H27). Calculate the critical test statistic, Zcri. The critical Z values can be found from Excel
by using the NORM.S.INV () function, lower one tail Zcri = –1.65 (see Cells H28). Does the
test statistic lie within the region of rejection? Compare the calculated and critical Z val-
ues to determine which hypothesis statement (H0 or H1) to accept. We observe that Zcal lies
in the lower rejection zone (−2.1410 < −1.65) and we accept H1.

❉ Interpretation Based on the data, there is sufficient evidence to indicate, at a 5%


significance level, that the performance has improved.

Figure 7.18 illustrates the relationship between the p-value and test statistic.

Normal curve
f(x)

Reject H0 5%

p = NORM.S.DIST Accept H0
(2.141)

–2.1410 –1.65 0 Z
Figure 7.18

Small number of pair wise comparisons (n ≤ 10)


For a small number of paired comparisons (n = n1n2 ≤ 10) we use tables to calculate an
exact value of the critical test value (Ucri) or an exact p-value based upon P (U ≤ 9).
Chi-square and non-parametric hypothesis testing 337

The theory suggests that if the null hypothesis is true then the U test statistic will be cen-
tred at μU = 28 with critical regions identified in Figure 7.19.

Accept H1 Accept H0 Accept H1

U2 = 9 Ucri = 13 µu = 28 U1 = 47
Z

Zcal = −2.1410 Zcri = +1.65 0


Figure 7.19

The evidence suggests that the performance has improved.

Dealing with data ties


If we find data with the same number value then we can deal with this problem by allocat-
ing the average tie value to each shared data value. In this situation we would then have to
use the normal approximation with the standard deviation σU adjustment given by equa-
tion (7.24).

n1n 2 ⎡ (n1 + n 2 )3 − (n1 + n 2 ) g t 3 − t ⎤


×⎢ ⎥
j j
σU = −∑
(n1 + n2 ) (n1 + n2 − 1) ⎣⎢ 12 j =1 12 ⎥
⎦ (7.24)

Where g = number of ties and tj = the number of tied ranks in group j.

Note The Mann–Whitney U test is equivalent statistically to the Wilcoxon rank sum test.

Student exercises
X7.18 What assumptions need to be made about the type and distribution of the data when
the Mann–Whitney test is used?
X7.19 Two groups of randomly-selected students are tested on a regular basis as part of
professional appraisals that are conducted on a two-year cycle by a leading financial
services company based in London. The first group has 8 students, with their sum of
the ranks equal to 65, and the second group has 9 students. Is there sufficient evidence
to suggest that the performance of the second group is better than the performance of
the first group (assess at 5%)?
X7.20 The sale of new homes is tied closely to the level of confidence within the financial
markets. A developer builds new homes in two European countries (A and B) and is
338 Business statistics using Excel

concerned that there is a direct relationship between the country and the interest rates
obtainable to build properties. To provide answers the developer decides to undertake
market research to see what interest rates would be obtainable if he decided to borrow
€300,000 over 20 years from 5 financial institutions in country A and 8 financial
institutions in country B. Based upon the data in Table 7.22 do we have any evidence
to suggest that the interest rates are significantly different?

A: 10.20 10.97 10.63 10.70 10.50 10.30 10.65


10.25 10.75 11.00
B: 10.60 10.80 11.40 10.90 11.10 11.20 10.89
10.78 11.05 11.15 10.85 11.16 11.18

Table 7.22

■ Techniques in practice
TP1 CoCo S. A. is concerned about the time taken to react to customer complaints and has
implemented a new set of procedures for its support centre staff. The customer service director
has decided that there is no evidence for the population distribution to be normally distributed
and has directed that a suitable test is applied to the sample to assess whether the new target
mean time for responding to customer complaints is 28 days (Table 7.23).

20 33 33 29 24 30
40 33 20 39 32 37
32 50 36 31 38 29
15 33 27 29 43 33
31 35 19 39 22 21
28 22 26 42 30 17
32 34 39 39 32 38

Table 7.23

(a) Describe the test to be applied with stated assumptions.


(b) Conduct the required test to assess whether evidence exists for the mean time to respond
to complaints to be greater than 28 days.
(c) What would happen to your results if the population mean time to react to customer
complaints changes to 30 days?

TP2 Bakers Ltd are currently undertaking a review of the delivery vans used to deliver prod-
ucts to customers. The company runs two types of delivery van (type A, recently purchased,
Chi-square and non-parametric hypothesis testing 339
and type B, at least three years old), which are supposed to be capable of achieving 20 km per
litre of petrol. A new sample has now been collected as shown in Table 7.24.

(a) Assuming that the population distance travelled does not vary as a normal distribution, is
there any evidence to suggest that the two types of delivery van differ in mean distance
travelled?
(b) Based upon your analysis, is there any evidence that the new delivery vans meet the
mean average of 20 km per litre?

A B A B
17.68 15.8 26.42 34.8
18.72 36.1 25.22 16.8
26.49 6.3 13.52 15.0
26.64 12.3 14.01 28.9
9.31 15.5 33.9
22.38 40.1 27.1
20.23 20.4 16.8
28.80 3.7 23.6
17.57 13.6 29.7
9.13 35.1 28.2
20.98 33.3

Table 7.24

TP3 Skodel Ltd is developing a low calorie lager for the European market with a mean
designed calorie count of 43 calories per 100 ml. The new product development team are
having problems with the production process and have collected two independent random
samples to assess whether the target calorie count is being met (do not assume that the popu-
lation variables are normally distributed) (Table 7.25).

A B A B
49.7 39.4 45.2 34.5
45.9 46.5 40.5 43.5
37.7 36.2 31.9 37.8
40.6 46.7 41.9 39.7
34.8 36.5 39.8 41.1
51.4 45.4 54.0 33.6
34.3 38.2 47.8 35.8
63.1 44.1 26.3 44.6
41.2 58.7 31.7 38.4
41.4 47.1 45.1 26.1
41.1 59.7 47.9 30.7

Table 7.25
340 Business statistics using Excel

(a) Describe the test to be applied with stated assumptions.


(b) Is the production process achieving a similar mean number of calories?
(c) Is it likely that the target average number of calories is being achieved?

■ Summary
In this chapter we have explored the concept of hypothesis testing for data involving cat-
egory data using the chi-square distribution and extended the parametric tests to the
case of non-parametric tests (or so called distribution-free tests), which do not require
the assumption of the population (or sample) distributions being normal. This chapter
adopted the simple five-step procedure described in Chapter 6 to aid the solution process
and focused on the application of Excel to solve the data problems.
The main emphasis is placed on the use of the p-value, which provides a number to the
probability of the null hypothesis (H0) being rejected. Thus, if the measured p-value > α
(Alpha) then we would accept H0 to be statistically significant. Remember the value of the
p-value will depend on whether we are dealing with a two or one tail test. So take extra
care with this concept as this is where most students slip up.
The second part of the decision-making described the use of the critical test statistic in
making decisions. This is the traditional textbook method which uses published tables to
provide estimates of critical values for various test parameter values.
In the case of the chi-square test we looked at a range of applications, including: testing
for differences in proportions, testing for association, and testing how well a theoretical
probability distribution fits collected sample data.
In the case of non-parametric tests we looked at a range of tests, including: sign test for
one sample, two paired sample Wilcoxon signed rank test, and two independent samples
Mann–Whitney test. In the case where we have more than two samples then we would
have to use techniques, such as the Kruskal–Wallis test or Friedman test depending upon
whether we are dealing with independent or dependent samples respectively. These tests
are described in the online workbook ‘Factorial experiments’.
Figure 7.20 provides a diagrammatic representation of the decisions required to decide
on which test to use to undertake the correct hypothesis test.
The key questions are:
1. What are you testing: difference or association? For non-parametric tests we are
dealing with ordinal and/or non-normal distributions, while the chi-square test will
test for association.
2. What is the type of data being measured? For non-parametric tests we are dealing
with ordinal data and categorical data for the chi-square test of association.
3. Can we assume that the population is normally distributed? For both types of tests we
are not assuming that the population distribution is normal.
4. How many samples? In Figure 7.20 we are dealing with one and two sample tests.
Chi-square and non-parametric hypothesis testing 341

Number of
samples?

One sample Two samples

Data type? Data type?

Ordinal and/or Ordinal and/or Categorical or


non-normal non-normal proportions

Independent Association or
Sign test
or dependent? proportion?

Independent Dependent Association Proportions

Wilcoxon Chi-square Independent


Mann–Whitney signed rank test of
U test or dependent?
sum test association

Independent Dependent

Chi-square
McNemar’s
test of
test
proportions

Figure 7.20

■ Key terms
Chi-square test Expected frequency Sign test
Chi-square test for Goodness-of-fit test Test statistic
independent samples Mann–Whitney U test Wilcoxon signed rank sum
Chi-square test of McNemar’s test test
association Observed frequency
Contingency table Rank

■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.

Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
342 Business statistics using Excel

2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html


(accessed 25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012).
Linear correlation and
regression analysis 8

» Overview «
In this chapter we will explore methods that define possible relationships, or associations,
between two interval, or ordinal, data variables. The issue of measuring the association
between two nominal data variables was explored under cross tabulation and the chi-square
distribution. When dealing with two data variables we can explore visually the possibility of an
association by plotting a scatter plot of one variable against another variable. Visually, this will
help to decide whether or not an association exists and the possible form of the association,
for example linear or non-linear. The strength of this association can then be assessed either
by calculating Pearson’s correlation coefficient for interval data or Spearman’s rank order
correlation coefficient for ordinal data. If the scatter plot suggests a possible association then
we can use least squares regression to fit this model to the data set. In this text we will focus on
linear relationships, but we have included sections introducing non-linear and multiple linear
regression analysis. Excel can be used to calculate most of the terms using specific functions
and we can access a data analysis macro called regression to calculate all the terms we would
need to undertake the regression analysis described within this overview.

» Learning objectives «
On successful completion of the module you will be able to:
» understand the meaning of simple linear correlation and regression analysis;
» apply a scatter plot to represent visually a possible relationship between two data variables;
» calculate Pearson’s correlation coefficient for interval data and provide meaning to this
value;
»calculate Spearman’s rank correlation coefficient for ordinal ranked data and provide
meaning to this value;
» fit a simple linear regression model to the two data variables to be able to predict a
dependent variable using an independent variable;
344 Business statistics using Excel

» fit this simple linear model to the scatter plot;


» estimate the reliability of this model fit to the dependent variable using the coefficient of
determination and provide meaning to this value;
» apply suitable inference tests to the simple linear model fit (t-test and F test);
» construct a confidence interval to the population parameter estimate;
» assess whether the model assumptions have been violated that would undermine confi-
dence in the model’s application to the data set;
» extend the linear case to fitting a non-linear relationship and linear multiple regression
models for interval data;
» assess reliability and conduct inference tests to test suitability of the predictor variable(s);
» solve problems using Microsoft Excel.

8.1 Linear correlation analysis


It seems obvious that sometimes there will be relationships between various sets of data
and our aim is to discover if a relationship exists, and, if so, how strong it is. In this section
we shall:

• apply a scatter plot to represent visually a possible relationship between two data
variables;
• understand the meaning of simple linear correlation analysis;
• calculate Pearson’s correlation coefficient for interval data and provide meaning to
this value;
• calculate Spearman’s rank correlation coefficient for ordinal ranked data and
provide meaning to this value;
• undertake an inference test on the value of the correlation coefficients (r and rs) being
significant.

8.1.1 Scatter plots


Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot
x data points. However, they have a very specific purpose. Scatter plots show how much
Scatter plot A scatter plot
is a plot of one variable one variable is affected by another. The relationship between two variables is called their
against another variable. correlation. Scatter plots usually consist of a large body of data. The closer the data points
Spearman’s come when plotted to making a straight line, the higher the correlation between the two
rank correlation
coefficient Spearman’s variables, or the stronger the relationship.
rank correlation coefficient
is applied to data sets when
it is not convenient to give
actual values to variables
Example 8.1
but one can assign a rank
order to instances of each A large manufacturing firm with some 8000 employees has designed a training programme
variable. that is supposed to increase the production of employees. The personnel manager decides
Linear correlation and regression analysis 345

to examine this claim by analysing the data results from the first group of 20 employees that
attended the course.
Table 8.1 provides the data set for the % change in production (y) measured against a range
of production values (x).

Employee Production, x % change in


number production, y
1 47 4.2
2 71 8.1
3 64 6.8
4 35 4.3
5 43 5.0
6 60 7.5
7 38 4.7
8 59 5.9
9 67 6.9
10 56 5.7
11 67 5.7
12 57 5.4
13 69 7.5
14 38 3.8
15 54 5.9
16 76 6.3
17 53 5.7
18 40 4.0
19 47 5.2
20 23 2.2

Table 8.1

At this stage it is important to define what we mean by a dependent and independent


variable:

(a) Dependent variable—the variable that we wish to predict, in this case % change in
production (variable y)
(b) Independent variable—in general, labelled as variable x or, in this case, production
x
variable. The independent variable provides the basis for calculating the value of the Dependent variable A
dependent variable. dependent variable is
what you measure in the
As a first stage to the analysis, the scatterplot would be plotted out, which, as indicated experiment and what
is affected during the
in Figure 8.1, involves plotting each pair of values as a point on a graph. experiment.
As can be seen from the scatterplot there would seem to be some form of relationship; Independent variable An
as productivity increases then there is a tendency for % change in production to increase. independent variable is the
variable you have control
The data, in fact, would indicate a positive relationship. As we will see in the next sec- over, what you can choose
tion, it is possible to describe this relationship by fitting a line or curve to the data set. In and manipulate.
346 Business statistics using Excel

% Change in production against production


9.0

% Change in production, y
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
0 10 20 30 40 50 60 70 80
Production, x Figure 8.1

Figure 8.2 we modified the y axis to run from 0 to 25 instead 0 to 9. We also changed one
point to illustrate the case of outliers. However, before we do this, just note how much
impact the change in resolution (i.e. change in scale for y) has on the perceived pattern of
the time series.

Note When scatterplots are used, like any other visualization method, make sure that
the right resolution (the y-axis range) is used.

Identifying outliers
Outlier?

25.0
% Change in production

20.0

15.0

10.0

5.0

0.0
0 20 40 60 80
Production Figure 8.2

What are outliers?


x
Outliers An outlier is an The scatter plot can be used to identify possible outliers within the data set. We can see in
observation in a data set
which is far removed in
Figure 8.2 the same data set as in Figure 8.1, but with one data value of y (= 22.4) which is
value from the others in the far greater than the other data point y values.
data set.
The outlier could have undue influence on the value of the correlation and regression
Regression coefficient A
regression coefficient
coefficients estimated in fitting a model to the data set.
is a measure of the One of the solutions to this problem is to delete the value from the data set. Sometimes
relationship between a
it might be that the outlier is not a true outlier but an extreme value of whatever you are
dependent variable and an
independent variable. measuring, or it could be that the data set consists of a small set of data values from the
Linear correlation and regression analysis 347

population being measured and the problem is due to having a small sample. There is no
widely accepted method on how to deal with outliers. Some researchers use quantitative
methods to exclude outliers that lie beyond ±1.5 standard deviations around the mean
value. To decide on what type of relationship exists between the two variables (x, y) then
we need to provide a numerical method to assess the strength of this potential relation-
ship, rather than rely on just the scatter plot.
The next three sections will explore three methods that can be used to measure the
relationship between data values: covariance, Pearson’s coefficient of correlation, and
Spearman’s rank correlation coefficient.

8.1.2 Covariance
A measure that tells us if the variables are jointly related is called covariance. Usually this
implies that we measure if the two variables move together. The covariance takes either
positive or negative values, depending on if the two variables move in the same or oppo-
site direction. If the covariance value is zero, or close to zero, then the two variables do not
move closely together at all. Equation (8.1) defines the sample covariance.

Σ (x − x )( y − y)
cov(x, y) =
n −1 (8.1)

As will be explained shortly, the covariance is an important building block for calculat-
ing the coefficient of correlation between two variables.

Note Alternative notation for sample covariance cov(x, y) is sxy.

Example 8.2
Reconsider the data set in Example 8.1.

Excel solution—Example 8.2


x
Figure 8.3 shows the Excel solution to calculate covariance. Pearson’s coefficient of
correlation Pearson’s
correlation coefficient
measures the linear
➜ Excel solution association between two
variables that have been
X: Cells C4:C23 Values measured on interval or
ratio scales.
Y: Cells D4:D23 Values
Covariance Covariance is a
Covariance = Cell D25 Formula: =COVAR.P(C4:C23, D4:D23) measure of how much two
Sample covariance = Cell D25 Formula: =COVAR.S(C4:C23, D4:D23) variables change together.
348 Business statistics using Excel

Figure 8.3

Note From Excel sample the covariance is 18.05, implying that both variables are
moving in the same direction (indicated by the positive value). A major flaw with the
covariance is that the variable can take any value and you are unable to measure the relative
strength of the relationship. For this value of 18.05 we do not know if this represents a strong
or weak relationship between x and y. To measure this strength we would use the correlation
coefficient or the coefficient of determination (COD).

8.1.3 Pearson’s correlation coefficient, r


The sample correlation coefficient that can be used to measure the strength of a linear
relationship is Pearson’s product moment correlation coefficient, r, which is defined by
equation (8.2).

s xy
r=
sxsy (8.2)

x Where sx is sample standard deviation of sample variables x, sy is sample standard devi-


Coefficient of ation of sample variables y, and sxy is the sample covariance between variables x and y. An
determination (COD) The
proportion of the variance alternative equation representing Pearson’s correlation coefficient is given by equation
in the dependent variable (8.3).
that is predicted from the
independent variable.
1 (x − x ) ( y − y )
Linear r= ∑
relationship Simple n sx sy (8.3)
linear regression aims to
find a linear relationship
between a response Where sx is sample standard deviation of sample variables x, sy is sample standard devia-
variable and a possible
predictor variable by the
tion of sample variables y, and n is the number of paired (x, y) data values.
method of least squares.
Linear correlation and regression analysis 349

Note
(a) If r lies between –1 ≤ r ≤ −0.5 or 0.5 ≤ r ≤ 1 (large association).
(b) If r lies between –0.5 ≤ r ≤ −0.3 or 0.3 ≤ r ≤ 0.5 (medium association).
(c) If r lies between –0.3 ≤ r ≤ −0.1 or 0.1 ≤ r ≤ 0.3 (small association).

Example 8.3
Reconsider the data set in Example 8.1.

Excel solution—Example 8.3


Figure 8.4 illustrates the Excel solution for calculating Pearson’s correlation coefficient, r.

Figure 8.4

➜ Excel solution
X: Cells C4:C23 Values
Y: Cells D4:D23 Values
Pearson r = Cell C25 Formula: =PEARSON(C4:C23,D4:D23)
Pearson r = Cell C26 Formula: =CORREL(C4:C23, D4:D23)
Pearson r = Cell C27 Formula: =COVARIANCE.S(C4:C23,D4:D23)/(STDEV.S(C4:C23)*
STDEV.S(D4:D23))

❉ Interpretation From Excel, the sample correlation coefficient is equal to +0.89. This
would indicate a fairly strong positive linear association (or relationship) between the value of
the % change in production (y) and the value of the original production values (x), confirming
the impression from the scatter plot in Figure 8.1.
350 Business statistics using Excel

It should be noted that if you include the outlier illustrated in Figure 8.2 then the value
of the correlation coefficient (r) would reduce to 0.3 and would suggest very little correla-
tion between the two variables (x, y).
What does the value of ‘r’ not indicate?

1. Correlation only measures the strength of a relationship between two variables but
does not prove a cause and effect relationship.
(a) Medical research suggests a strong correlation between the consumption of
alcohol and alcohol-induced liver disease. In this situation we have a cause
and effect situation where increased alcohol consumption increases the risk of
developing liver disease.
(b) But do we have a cause and effect between the amount of petrol sold and the
consumption of ice cream during the summer months? In this case the increase
in petrol consumption and ice cream sales is owing to the fact that it is summer
and (i) the holiday season has started and (ii) the temperature is increasing.
(c) Even though we do not have a cause and effect between the variables it is
possible that the association found might lead to what the true cause might be.
For example, a new survey found that the more time people spent watching
television the fatter they became. It could be that unemployed people spend
more time watching television and, at the same time, they cannot afford to eat a
healthy diet. In this case employment status would be the real cause. Remember,
it is usually more complicated than this simple example and the value of a
dependent variable may depend on more than just one independent variable.
2. A value of r ≈ 0 would indicate no linear relationship between x and y, but this may
indicate that the true form of the relationship is non-linear.

In Figure 8.5 we observe that, in general, as x increases, y increases. The correlation


between x and y would be positive in this case.

Scatter plot—example of positive correlation


600

500
Predicted variable, y

400

300

200

100

0
0 5 10 15 20 25
Known value, x Figure 8.5

In Figure 8.6 the data point pattern goes from a high value on the y-axis down to a high
value on the x-axis—the variables have a negative correlation.
Figure 8.7 Example of perfect positive correlation, r = +1
A perfect positive correlation is given the value of 1 and a perfect negative correlation
is given the value of −1. In reality the value of the correlation will lie between −1 and +1.
Linear correlation and regression analysis 351
Scatter plot—example of negative correlation
600

500
Predicted value, y

400

300

200

100

0
0 5 10 15 20 25
Known value, x Figure 8.6

A similar straight line pointing downwards is an example of perfect negative correla-


tion, r = −1.

Scatter plot—example of perfect positive correlation


450
400
Predicted variable, y

350
300
250
200
150
100
50
0
0 5 10 15 20 25
Known value, x Figure 8.7

Figure 8.8 illustrates what the scatterplot would look like for correlation value of −0.47.

Scatter plot—example of perfect negative correlation


600
Predicted value, y

500
400
300
200
100
0
0 5 10 15 20 25
Known value, x Figure 8.8

For Example, Figure 8.1 is the scatter plot for % change in production against produc-
tion which, as we already know, suggests that as x increases, y increases, and the values
are increasing in the same direction. We’ll now show how to calculate Pearson’s correla-
tion coefficient, r, using a formula approach in Excel.

Example 8.4
Reconsider the data set in Example 8.1.
352 Business statistics using Excel

Excel solution—Example 8.4


Many textbooks show different formula for the correlation coefficient. One of them is:

Σx Σy
Σxy −
r= n
⎛ 2 ( Σ x )2 ⎞ ⎛ 2 ( Σ y )2 ⎞
⎜⎝ Σx − ⎟ ⎜ Σy − ⎟
n ⎠⎝ n ⎠ (8.4)

Equation 8.4 is a modified version of equation (8.3). We will use this formula to demon-
strate how to calculate the correlation coefficient in Excel using this equation. Figure 8.9
illustrates the Excel solution.

Figure 8.9

➜ Excel solution
x: Cells C4:C23 Values
y: Cells D4:D23 Values
xy Cell E4 Formula: =C4*D4
Copy formula down E4:E23
x^2 Cell G4 Formula: =C4^2
Copy formula down G4:G23
y^2 Cell I4 Formula: =D4^2
Copy formula down I4:I23
n = Cell D26 Formula: =COUNT(C4:C23)
ΣX = Cell D27 Formula: =SUM(C4:C23)
ΣY = Cell D28 Formula: =SUM(D4:D23)
Linear correlation and regression analysis 353

ΣXY = Cell D29 Formula: =SUM(E4:E23)


ΣX^2 = Cell D30 Formula: =SUM(G4:G23)
Σy^2 = Cell D31 Formula: =SUM(I4:I23)
r = Cell D33
Formula: =(D29−D27*D28/D26)/SQRT((D30−D27^2/D26)*
(D31−D28^2/D26))

From Excel: n = 20, Σx = 1064, ΣY = 110.80, ΣXY = 6237.50, ΣX2 = 60352, and
ΣY2 = 653.44.
Substituting these values into equation (8.4) gives r = 0.89 (see cell D33).
As expected, we get the same value of 0.89 as calculated by Excel functions =PEARSON ()
or =CORREL (). We still have not examined how significant this linear correlation is, i.e. do
the conclusions we made about the sample data apply to the whole population? In order
to do this we need to conduct a hypothesis test. The end result will confirm if the same
conclusion applies to the whole company (population) and, more specifically, at what
level of significance.

8.1.4Testing the significance of linear correlation between


the two variables
In Example 8.3 we found that the correlation coefficient was 0.89, which indicates a
strong correlation between the two variables. Unfortunately, the size of the sample to
provide this value of 0.89 is quite small (sample size n = 20 out of a population size of
8000) and we would now like to check whether or not this provides evidence of a sig-
nificant association between the two variables in the overall population, i.e. among all
8,000 employees. Is this value of 0.89 due to sampling error from a population where no
real association exists?
To answer this question we need to conduct an appropriate hypothesis test to check
if the population value of association is zero. In this hypothesis test we are assessing the
possibility that the true population value of association ρ = 0. If the sample size is n ≥10,
then r is distributed as a t distribution with the number of degrees of freedom df = n − 2. It
can be shown that the relationship between r, ρ, and n, is given by equation (8.5):

r−p
t cal =
1 − r2
n−2 (8.5)

As per previous chapters on hypothesis testing, testing of the significance is done in five
short steps.

1 State hypothesis

2 Select the test


354 Business statistics using Excel

3 Define the significance level

4 Extract the relevant statistic, which will consist of three simple calculations:
(a) Calculate the value of r (correlation coefficient);
(b) Calculate test statistic tcalc;
(c) Determine the critical value tcrit;

5 Make a decision

Example 8.5
Reconsider the data set in Example 8.1.

Figures 8.10 and 8.11 illustrate the Excel solution

Figure 8.10

➜ Excel solution
x: Cells C4:C23 Values
y: Cells D4:D23 Values

2
3

5
Figure 8.11
Linear correlation and regression analysis 355

➜ Excel solution
Significance level = Cell I10 Value =0.05
Pearson coefficient = Cell I13 Formula: =PEARSON(C4:C23, D4:D23)
n = Cell I15 Formula: =COUNT(B4:B23)
df = Cell I16 Formula: =I15-2
t = Cell I17 Formula: =I13/SQRT((1−I13^2)/(I15−2))
Upper two tail t-critical = Cell I18 Formula: =T.INV.2T(I10, I16)
Lower two tail t-critical = Cell I19 Formula: =−I18

1 State hypothesis
Null hypothesis H0: ρ = 0 no population correlation exists
Alternative hypothesis H1: ρ ≠ 0 correlation exists

2 Select test—in this case we already know that we are testing the significance of linear
correlation and we use a t-test to test for significance.

3 Significance level. Set the significance level of 5% = 0.05 (see cell I10)

4 Extract relevant statistics


(a) Calculate the value of r. From Excel, r = 0.89 (see cell I13).
(b) Calculate test statistic, t.
If H0 true, then ρ = 0 and equation (8.5) simplifies to equation (8.6):

r
t cal =
1 − r2
n−2 (8.6)

with n − 2 degrees of freedom.

Note We note that the alternative hypothesis is ≠ and therefore we have not implied a
direction for the value of ρ. All we know is that it could be a significant correlation and that
ρ > 0 or ρ < 0. In this case we have two directions where ρ would be deemed significant and
this is called a two-tailed test.

From Excel:

r
t cal = = 8.29 (see cell I17)
1 − r2
n−2

(c) Using a significance level of 0.05 with 19 degrees of freedom the critical t value =
T.INV.2T (I10, I16) =± 2.1 (see cells I18 and I19).

5 Make a decision
The calculated value of the t-test statistic (8.29) is greater than the critical t statistic
value (2.1). We conclude that we should reject H0 and accept H1.
356 Business statistics using Excel

❉ Interpretation There is evidence to suggest a significant linear correlation between


the two variables at the level of significance of 0.05.

Note The preceding example illustrates a two tailed test, but one tail tests can exist and
will denote confidence in a specific relationship between X and Y.
For example, in the previous example we are quite certain that we would expect the %
change in production and the original production value of the tested employees to be related
and the association to be positive (as X increases Y increases). In this case we would conduct
H0: ρ = 0 and H1: ρ > 0. If we then tested at 5% then all this 5% would be allocated to the right-
hand tail of the decision graph and tcri would be positive. In this example the Excel solution
would give tcri =T.INV.2T (0.05*2, 18) = +1.73.
If we reversed the test and assumed that the association was negative (as X increases Y
decreases) then the alternative hypothesis would read H0: ρ = 0 and H1: ρ < 0, with a critical t
value of tcri = −1.73.

Test for negative correlation:

H0: ρ = 0
H1: ρ < 0

Left-tailed test.

Test for positive correlation:

H0: ρ = 0
H1: ρ > 0

Right-tailed test.

8.1.5 Spearman’s rank correlation coefficient


When data has been collected which is in ranked form then a ranked correlation coefficient
can be determined. Equation (8.4) provides the value of Pearson’s correlation coefficient
between two data variables, x and y, which are both at an interval level of measurement.
The question then arises what do we do if the data variables are both ranked? In this case
we can show algebraically that equation (8.4) is equivalent to equation (8.7).

6 ∗ ∑ (X r − Yr )2
rs = 1 −
n(n 2 − 1) (8.7)

Where Xr = rank order value of X, Yr = rank order value of Y, and n = number of paired
observations.
Equation (8.7) is known as Spearman’s rank correlation coefficient. The use of ranks
allows us to measure correlation using characteristics that cannot be expressed quanti-
tatively, but that lend themselves to being ranked. This equivalence between equations
(8.4) and (8.7) will only be true for situations where no tied ranks exist. When tied ranks
exist then you will find discrepancies between the value of r and rs. As with the other
Linear correlation and regression analysis 357

non-parametric tests introduced in this text, ties are handled by giving each tied value the
mean of the rank positions for which it is tied. The interpretation of rs is similar to that for
r, namely: (a) a value of rs near 1.0 indicates a strong positive relationship and (b) a value
of rs near −1.0 indicates a strong negative relationship.

Note
(a) If rs lies between –1 ≤ rs ≤ −0.5 or 0.5 ≤ rs ≤ 1 (large association).
(b) If rs lies between –0.5 ≤ rs ≤ −0.3 or 0.3 ≤ rs ≤ 0.5 (medium association).
(c) If rs lies between –0.3 ≤ rs ≤ −0.1 or 0.1 ≤ rs ≤ 0.3 (small association).

For pairs of data considered to have a strong relationship, just as in the case of
Pearson’s correlation coefficient, you will need to confirm that the value is significant (see
section 8.1.6).

Example 8.6
You are asked to decide whether the statistics rank correlates with the mathematics rank for
seven students provided in Table 8.2. As the information is ranked we use Spearman’s correla-
tion coefficient to measure the correlation between statistics and mathematics ranks.

Student Statistics rank Mathematics rank


1 2 1
2 1 3
3 4 7
4 6 5
5 5 6
6 3 2
7 7 4

Table 8.2

Excel solution—Example 8.6


Figure 8.12 illustrates the calculation of Spearman’s rank correlation coefficient using
Excel.

Figure 8.12
358 Business statistics using Excel

➜ Excel solution
Statistics rank, Xr Cells C5:C11 Values
Mathematics rank, Yr Cells D5:D11 Values
Xr − Yr = Cell F5 Formula: =C5−D5
Copy formula down F5:F11
(Xr − Yr)^2 = Cell H5 Formula: =F5^2
Copy formula down H5:H11
n = Cell F14 Formula: =COUNT (B5:B11)
Squared rank differences = Cell F15 Formula: =SUM(H5:H11)
Spearman’s rank correlation = Cell F17 Formula: =1−6*F15/(F14*(F14^2−1))

❉ Interpretation From Figure 8.12 the Spearman rank correlation is positive, rs = 0.54,
indicating that there is a mild positive rank correlation in this case. If this number was closer
to +1, we would be able to claim much stronger positive rank correlation.

Note Excel does not have a procedure for computing Spearman’s ranked correlation
coefficient directly. However, as the formula for Spearman’s is the same as for Pearson’s
correlation coefficient, we can use it providing that we have first converted the x and y
variables to rankings (Data > Data Analysis > Rank and Percentile).

Testing the significance of Spearman’s rank


8.1.6
correlation coefficient, rs
In this section we need to undertake a hypothesis test to test whether the true correlation
relationship between the population Y and X values is significant, based upon the sample
data values y, x. Having found a correlation using Spearman’s, it is necessary to test it to
discover whether or not it is significant.
The hypothesis test takes the default form that there is no correlation and the alterna-
tive hypothesis that there is a positive or negative correlation. Not only do significance lev-
els affect the critical value, but so do the number of values in the sample. The smaller the
sample, the higher the correlation must be for it to be significant. In this hypothesis test
we are assessing the possibility that ρs = 0. If the sample size is n ≥10, then r is distributed
as a t distribution with the number of degrees of freedom (df ) = n − 2. It can be shown that
the relationship between rs, ρs, and n is given by equation (8.8):

rs − ps
t cal =
1 − rs 2
n−2 (8.8)

Example 8.7
Reconsider the data set in Example 8.6 and assess the significance of rs.
Linear correlation and regression analysis 359

Figure 8.13 illustrates the critical value of Spearman’s correlation coefficient.

Figure 8.13

➜ Excel solution
Sig = Cell C5 Value
n = Cell C6 Value
df = Cell C7 Formula: =C6−2
tcri = Cell C8 Formula: =T.INV.2T(C5,C7)
Critical rs = Cell C9 Formula: =C8/SQRT(C8^2+C6−2)

1 State hypothesis
Null hypothesis H0: ρs = 0 no population correlation
Alternative hypothesis H1: ρs ≠ 0 population correlation exists
Two tail test

2 Select test—we already know that this is testing the significance of Spearman’s rank
correlation coefficient

3 Set the significance level of 5% = 0.05 (see cell C5)

4 Extract the relevant statistic

(a) Calculate the value of rs. From Excel, rs = 0.54.


(b) Calculate test statistic from equation (8.8).

If H0 true, then ρs = 0 and equation (8.8) simplifies to equation (8.9).

rs
t cal =
1 − rs 2
n−2 (8.9)

Note We note that the alternative hypothesis is ≠ and therefore we have no implied
direction for the value of ρs. All we know is that it could be a significant correlation and that
ρs > 0 or ρs < 0. In this case we have two directions where ρ would be deemed significant and
this is called a two tailed test.

(c) Critical value

The critical value of rs may be found either from a table of values or by calculation,
depending upon the size of the sample, n.
360 Business statistics using Excel

(d) Table of rs values for n ≤10:

N 6 7 8 9 10
Significance level 5% 0.829 0.759 0.738 0.666 0.632

Table 8.3 Critical values of Spearman’s correlation coefficient

(e) If the sample size n ≥ 10, the test statistic is approximated by a t statistic with
n − 2 degrees of freedom, as shown in equation (8.10). The critical rs value can
be found by rearranging equation (8.9) to make rs the subject of the equation:

t
rs =
t2 + n − 2 (8.10)

To find the critical rs value: (i) find tcri and (ii) substitute this value for tcri into
equation (8.10) to find the critical rs value. For example, if the significance level is
5% two tail, then tcri =± 2.31 and the critical value of rs = ±0.63.

Note For n > 20, rs may be treated as normal (0, 1), where

z = rs n − 1 (8.11)

For example, if the significance level is 5% two tail and n = 40, then Zcri = ±1.96 and the criti-
cal value of rs = ±0.314. In the comparison of marks example we have n = 7, significance level
5% two tail, and the table critical rs value is ± 0.759.

5 Make a decision
Given that 0.54 < 0.759, the test statistic does not fall in the critical region. Therefore,
we accept H0 and reject H1.

❉ Interpretation At the level of significance of 5%, there is insufficient evidence to


suggest a significant correlation between the two variables.

Student exercises
X8.1 In the course of a survey relating to examination success, you have discovered a high
negative correlation between students’ hours of study and their examination marks.
This is so at variance with common sense that it has been suggested an error has been
made. Do you agree?
X8.2 Construct a scatter plot for the data in Table 8.4 and calculate Pearson’s correlation
coefficient, r. Comment on the strength of the correlation between x and y.

x: 40 41 40 42 40 40 42 41 41 42
y: 32 43 28 45 31 34 48 42 36 38

Table 8.4
Linear correlation and regression analysis 361

X8.3 Display the data given in Table 8.5 in an appropriate form and state how the variables
are correlated.

x: 0 15 30 45 60 75 90 105 120
y: 806 630 643 625 575 592 408 469 376

Table 8.5

X8.4 Table 8.6 indicates the number of vehicles and number of road deaths in ten countries.

Countries Vehicles per 100 population Road deaths per 100,000 population
UK 31 14
Belgium 32 30
Denmark 30 23
France 46 32
Germany 30 26
Irish Republic 19 20
Italy 35 21
Netherlands 40 23
Canada 46 30
USA 57 35

Table 8.6

(a) Construct a scatter plot and comment upon the possible relationship between the
two variables.
(b) Calculate the product moment correlation coefficient between vehicle numbers
and road deaths.
(c) Use your answers to (a) and (b) to comment upon your results.
X8.5 Samples of students’ essays were marked by two tutors independently. The resulting
ranks are shown in Table 8.7.

A 5 8 1 6 2 7 3 4
Tutor
B 7 4 3 1 6 8 5 2

Table 8.7

(a) Calculate the rank correlation coefficient.


(b) State any conclusions that you can draw.
X8.6 The mathematics and statistics examination marks for a group of ten students are
shown in Table 8.8.

Mathematics 89 73 57 53 51 49 47 44 42 38
Statistics 51 53 49 50 48 21 46 19 43 43

Table 8.8
362 Business statistics using Excel

(a) Find the correlation coefficient for the two sets of marks.
(b) Place the marks in rank order and calculate the rank correlation coefficient.
(c) The following is a quotation from a statistics text ‘Rank correlation can be used
to give a quick approximation to the product moment correlation coefficient’.
Comment on this in the light of your results.
X8.7 Three people, P, Q, and R, were asked to place in preference nine features of a house
(A, B, C ... I). Calculate Spearman’s rank order correlation coefficients between the pairs
of preferences, as shown Table 8.9.

A B C D E F G H I
P 1 2 4 8 9 7 6 3 5
Q 1 4 5 8 7 9 2 3 6
R 1 9 6 8 7 4 2 3 5

Table 8.9

How far does this help to decide which pair from the three would be most likely to be
able to compromise on a suitable house?

8.2 Linear regression analysis


In this section we shall extend the concept of measuring association to include a method
for fitting a line equation to a data set. This will allow predictions to be provided for a
dependent variable (y) given an independent (or predictor) variable (x).
x Linear regression analysis is one of the most widely used techniques for modelling a
Regression linear relationship between variables and is employed by a wide range of subjects (busi-
analysis Regression
analysis is used to model ness, economics, psychology, social sciences in general) to enable models to be developed
the relationship between and levels of confidence to be provided in the model parameters. In later sections we will
a dependent variable and
one or more independent extend the linear case to briefly mention non-linear and multiple regression modelling.
variables. Regression models can be fitted to ordinal/categorical data, but this is beyond the scope
Multiple regression of this textbook. Linear regression analysis attempts to model the relationship between
model Multiple linear
regression aims to find a two variables in the form of a straight line relationship, as given by equation (8.12).
linear relationship between
a dependent variable
and several possible ŷ = b0 + b1x + error (8.12)
independent variables.
Linear regression Where ŷ is the estimated value of the dependent variable (y) at given values of the inde-
analysis Simple linear
regression aims to find a
pendent variable (x). As we are here taking a snapshot of the data set, we are effectively
linear relationship between dealing with a sample of a large data set, or the whole population, as we called it in previous
a response variable and a
chapters. This implies that the true population relationship is defined by equation (8.13).
possible predictor variable
by the method of least
squares. Ŷ = β0 + β1X (8.13)
Linear correlation and regression analysis 363

The values of constants b0 and b1 are effectively estimates of some true values of β0 and
β1, and we’ll also have to test to see how well they represent these true population values.
In order to determine this relationship the constants b0 and b1 have to be estimated from
the observed values of x and y. To do this, regression analysis utilizes the method of least
squares regression to provide a relationship between b0, b1, and the sample data values
(x, y). The method assumes that the line will pass through the point of intersection of the
mean values of x and y (x,y).
The method then pivots the line about this point until:

(i) The sum of the vertical squared distance of the data points is a minimum
(ii) The sum of the vertical distances of the data points above the line equals those
below the line.

This is described algebraically as:

Σ ( y − y ) = minimum
2

and

Σ (y − yˆ ) = 0.

From this concept two ‘normal equations’ are defined:

Σ y = nb0 + b1 Σ x

and

Σ xy = b0 Σ x + b1 Σ x 2

By solving the above equations simultaneously, estimates of the constants b0 and b1 are
determined to give the equation of the line of regression of y on x, where y is the depend-
ent variable and x is the independent variable. The two ‘normal equations’ can be rear-
ranged so that a solution can be obtained as given by equations (8.14) and (8.15).

nΣxy − Σx Σy
b1 =
nΣx 2 − (Σx )2 (8.14)

Σy − b1 Σx
b0 =
n (8.15)

Excel can be used in a number of different ways to undertake regression analysis and
x
calculate the required coefficients b0 and b1. Least squares The
method of least squares
1. Excel statistical functions—Excel contains a range of functions that allow a range of is a criterion for fitting
a specified model to
regression coefficient calculations to be undertaken. observed data. If refers to
2. Excel worksheet functions—standard Excel functions can be used to reproduce the finding the smallest (least)
sum of squared differences
manual solution, e.g. SUM, SQRT functions. between fitted and actual
3. Excel Data Analysis > Regression—this method provides a complete set of solutions. values.
364 Business statistics using Excel

The solution process can be split into a series of steps:

• construct scatter plot to identify model;


• fit model to sample data;
• test model reliability using the coefficient of determination;
• test whether the predictor variables are significant contributors, t-test;
• test whether the overall model is a significant contributor, F test;
• construct a confidence interval for the population slope, β1;
• check model assumptions.

8.2.1 Construct scatter plot to identify model


The first stage in undertaking regression analysis of a data set is to construct a scatter plot.
We can see from the Example 8.1 scatter plot that as x increases, y increases too, and, in
general, the values are increasing in the same direction. If you identify outliers decide
whether you plan to keep or remove them from the data set. Whatever you decide, make
sure you mention your outlier policy in your report.

Example 8.8
Reconsider Example 8.1 and fit the scatter plot as illustrated in Figure 8.14.

% Change in production vs production


12
% Change in production, Y

10

0
0 20 40 60 80 100
Production, X Figure 8.14

From Figure 8.14 we conclude that the % change in production (y-variable) is increas-
ing as the production increases.

8.2.2 Fit line to sample data


x Excel contains a number of functions that allow you to calculate directly the values of b0
Slope Gradient of the fitted and b1 in equation (8.12).
regression line.
Assumptions An
assumption is a proposition
that is taken for granted.
Linear correlation and regression analysis 365

Example 8.9
Figure 8.15 represents the Excel solution to fitting a regression line to the Example 8.8 data set.

Figure 8.15

The Excel function to calculate the slope, b1, and intercept, b0, is as described next.

➜ Excel solution
x: Cells B5:B24 Values
y: Cells C5:C24 Values
b0 = Cell C27 Formula: =INTERCEPT(C5:C24,B5:B24)
b1 = Cell C28 Formula: =SLOPE(C5:C24,B5:B24)

From Excel: b0 = 0.6712 and b1 = 0.0915. The equation of the sample regression line is
ŷ = 0.6712 + 0.0915x .

❉ Interpretation The regression equation for the example used here is % change in
production = 0.6712 + 0.0915 * production.

For every value of x (production) we can now estimate a value of the % change in pro-
duction. If we plotted these estimated values, they would represent a trend line, or a line x
Intercept Value of the
of regression. The calculated trend line has been fitted to the scatter plot as shown in regression equation (y)
Figure 8.16. when the x value = 0.
Observe that not all data points lie on the fitted line. In this case we can also observe an Residual The residual
represents the unexplained
error (sometimes called a residual or variation) between the data y value and the value of variation (or error) after
the line y value at each data point. fitting a regression model.
366 Business statistics using Excel

% Change in production vs production


12

% Change in production, Y
10

0
0 20 40 60 80 100
Production, X Figure 8.16

This concept of error can be measured using a variety of methods, including: coefficient
of determination (COD), standard error of estimate (SEE), and a range of inference meas-
ures to assess the suitability of the regression model fit to the data set.
An alternative approach to calculating the regression line is to right-click on one of the
data points in the graph and select Add Trendline option from the box (Figure 8.17).

Figure 8.17

Select the following options:

• Trend/Regression Type – Linear.


• Display Equation on chart.
• Display R-squared on chart.

Click Close (see Figure 8.18).


We now get not just the regression line, but the equation, which is identical to the one
we calculated using Excel functions SLOPE () and INTERCEPT ().
Linear correlation and regression analysis 367

Figure 8.18

In order to demonstrate the above points, we’ll show here yet another method of calcu-
lating linear regression using the Excel TREND() function (Figure 8.19).

% Change in production vs production


12
% Change in production, Y

10
y = 0.0915x + 0.6712
8 R2 = 0.7924

0
0 20 40 60 80 100
Production, X Figure 8.19

Example 8.10
Use the Excel TREND() function to fit a trend line to the Example 8.1 data set, as illustrated in
Figure 8.20.
368 Business statistics using Excel

Figure 8.20

➜ Excel solution
x: Cells B5:B24 Values
y: Cells C5:C24 Values
Estimated y =CellsD5 Formula =TREND($C$5:$C$24,$B$5:$B$24,B5)
Copy formula down D5:D24
Error =Cells F5 Formula =C5−D5
Copy formula down F5:F24

Note The Excel function is =TREND (known_y’s, [known_x’s], [new_x’s], [const]). We


have ignored the constant element here, as it is not relevant. Known_x’s is the set of all known
production values and, as it does not change from cell to cell, we had to put the whole
range as absolute references ($B$5:$B$24). Known_y’s is the set of all known % change in
production values, and, again, as it does not change from cell to cell, we had to put the whole
range as absolute references ($C$5:$C$24). The value of new_x’s changes from cell to cell and
it is therefore left as a relative reference in this formula.

There is obviously a degree of error between observed values of y and those estimated
by the regression line (ŷ). This error, or difference, is known as the residual and is defined
by equation (8.16).

Residual = y − yˆ (8.16)
These errors, as we will discover shortly, are a very important part of regression analysis.
An alternative is to use some general Excel functions to get even richer data. Excel also
offers an even more comprehensive way to achieve the same task through Excel’s Data
Analysis tool, but we’ll come back to it at the very end of this chapter. Let us return to
the concept of error that we mentioned above, which can be measured using a variety of
methods, including: coefficient of determination (COD), standard error of the estimate
(SEE), and a range of inference measures to assess the suitability of the regression model
fit to the data set.
Linear correlation and regression analysis 369

8.2.3 Sum of squares defined


When conducting regression analysis it is important to be able to identify three important
measures of variation: sum of squares for regression (SSR), sum of squares for error
(SSE), and total sum of squares (SST). Figure 8.21 illustrates the relationship between
these different measures, which shows that the total variation can be split into two parts:
explained and unexplained variation.

y
SSE

SST
yi

SSR


y

x

x xi
0

Figure 8.21
Understanding the relationship between SST, SSR, and SSE.

Regression sum of squares (SSR) is sometimes called explained variations.

n
SSR = ∑ (y i − y )2
i =1 (8.17)

Regression error sum of squares (SSE) is sometimes called unexplained variations.

( )
n 2
SSE = ∑ y i − yˆ x
i =1 (8.18)
Sum of squares for
regression (SSR) The
Regression total sum of squares (SST) is sometimes called the total variation: SSR measures how much
variation there is in the
modelled values.
n
SST = ∑ ( y i − y )
2 Sum of squares for error
i =1 (8.19) (SSE) The SSE measures
the variation in the
modelling errors.
The total sum of squares is equal to the regression sum of squares plus the error sum of Total sum of squares
squares. (SST) The SST measures
how much variation there
is in the observed data
SST = SSR + SSE (8.20) (SST = SSR + SSE).
370 Business statistics using Excel

8.2.4 Regression assumptions


The four assumptions of regression are as follows: (1) linearity, (2) independence of
errors, (3) normality of errors, and (4) variance constant.

1. Linearity
Linearity assumes that the relationship between the two variables is linear. To assess
linearity, the residuals (or errors) are plotted against the independent variable, x. Excel
Data > Data Analysis > Regression will create this plot automatically if requested (see sec-
tion 8.2.10). From Figure 8.22 we observe that there is no apparent pattern between the
residuals and x. Furthermore, the residuals are evenly spread out about error equal to zero.

Residual plot
1.5

1.0

0.5
Error

x
0.0
0 10 20 30 40 50 60 70 80
Independence of
–0.5 x
errors Independence
of errors means that the
distribution of errors
–1.0
is random and not
influenced by or correlated
to the errors in prior –1.5
observations. The opposite
of independence is called Figure 8.22
autocorrelation. Residuals versus x
Durbin–Watson The
Durbin–Watson statistic For this example a line fit to the data set would appear appropriate. If the scatter plot
is a test statistic used suggests that the relationship is non-linear then you would have to identify and fit this
to detect the presence
of autocorrelation (a relationship to your data set (see section 8.3.1).
relationship between
values separated from each
other by a given time lag)
2. Independence of errors
in the residuals (prediction The independence of errors assumption requires that there is no correlation between the
errors) from a regression
residuals of the regression analysis.
analysis.
Autocorrelation
This effect is called serial correlation and can be measured using the Durbin–Watson
Autocorrelation is the statistic. Another expression for serial correlation, though usually used in a different con-
correlation between
text, is autocorrelation. For Example 8.8, the data has been collected at the same time
members of a time series
of observations and the period and we do not need to consider serial correlation (independence of errors) as a
same values shifted at a problem. This topic is beyond the scope of this textbook.
fixed time interval.
Normality of
errors Normality of errors 3. Normality of errors
assumption states that the The normality of errors assumption requires that the measured errors (or residuals) are
errors should be normally
distributed - technically normally distributed for each value of the independent variable, X. If this assumption is
normality is necessary violated then the result can produce unrealistic estimations for the regression coefficients
only for the t-tests to
be valid, estimation of b0, b1, and the measures of correlation. Furthermore, any inference tests or confidence
the coefficients only intervals calculated are dependent upon the errors being normally distributed.
requires that the errors
be identically and
This assumption can be evaluated using two graphical methods: (i) construct
independently distributed. a histogram for the errors against x and check whether the shape looks normal or
Linear correlation and regression analysis 371

(ii) create a normal probability plot of the residuals (available from the Excel Data > Data
Analysis > Regression). Figure 8.23 illustrates a normal probability plot based upon the
Example 8.8 data set.

Normal probability plot


9
8
7
6
5
Y

4
3
2
1
0
0 20 40 60 80 100 120
Sample percentile

Figure 8.23
Normal probability plot of the residuals

We observe that the relationship is fairly linear and we conclude that the normal
assumption is not violated.
This problem can occur if the dependent and/or independent variables are not nor-
mally distributed or the linearity assumption is violated. Like the ANOVA F test and t-test,
regression analysis is robust against departures from this assumption. As long as the dis-
tribution of error against X is not very different from a normal distribution then the infer-
ences on β0 and β1 will not be seriously affected (see sections 8.2.6 and 8.2.7).

4. Variance constant
The final assumption of equal variance (or homoscedasticity) requires that the variance
of the errors is constant for all values of X.
This implies that the variability of the Y values is the same for all values of X and this
assumption is important when making inferences about β0 and β1 (see sections 8.2.6 and
8.2.7). If there are violations of this assumption then we can use data transformations or
weighted least-squares to attempt to improve model accuracy. In Figure 8.24 we observe
that the error is not growing in size as the value of X changes. This plot provides evidence
that the variance assumption is not violated. If the value of error changes greatly as the
value of X changes then we would assume that the variance assumption is violated.

Residual plot
1.5

1
Residuals or error

0.5

0 x
0 10 20 30 40 50 60 70 80 Equal variance
–0.5 x (homoscedasticity)
Homogeneity of variance
(homoscedasticity)
–1
assumption states that the
error variance should be
–1.5 Figure 8.24 constant.
372 Business statistics using Excel

If any of the four assumptions are violated, we can only conclude that linear regression
is not the best method for fitting to the data set, and we will need to find an alternative
method or model.

Note See section 8.2.10 for the Data > Data Analysis > Regression menu method to
check regression assumptions.

8.2.5 Test model reliability


Of the many methods used to assess the reliability of a regression line we shall discuss: (a)
residuals and the standard error of the estimate (SEE) and (b) the coefficient of deter-
mination (COD).

Example 8.11
Reconsider the data set in Example 8.1 and test the linear regression model reliability.

Figure 8.25 illustrates the Excel solution to calculate the coefficient of determination
and standard error of the estimate.

Figure 8.25

x ➜ Excel solution
Standard error of the
estimate (SEE) The x: Cells B5:B24 Values
standard error of the y: Cells C5:C24 Values
estimate (SEE) is an
estimate of the average
b0 = Cell C27 Formula: =INTERCEPT (C5:C24,B5:B24)
squared error in prediction. b1 = Cell C28 Formula: =SLOPE (C5:C24,B5:B24)
Linear correlation and regression analysis 373

ŷ = CellsD5 Formula: =$C$27+$C$28*B5


Copy formula down D5:D24
Error = Cells F5 Formula: =C5−D5
Copy formula down F5:F24
Sum = Cell F25 Formula: =SUM(F5:F24)
SEE = Cell C31 Formula: =STEYX (C5:C24,B5:B24)
COD = Cell C32 Formula: =RSQ (C5:C24,B5:B24)

(a) Residuals and the standard error of the estimate (SEE)


As before, we calculated errors, or residuals, and put them this time in column F. By plot-
ting the regression line onto the scattergram, as shown in Figure 8.16, it reveals that many
of the observed data points do not lie on the line. The values in Figure 8.25 column F show
by how much the fitted data (estimated) are adrift from the observed data. Plotting the
residuals against the x values provides information about possible modifications of, or
areas of caution in applying the regression line and the data from which it was formed. In
plotting the residuals we would look for a random, even scatter about the zero residual line.
This would indicate that the derived line was relatively free from error (see Figure 8.22).
If we were to determine all the residual values for the data then an interpretation of the
error in predicting y from the regression equation could be obtained. This would be the
standard deviation of actual y values from the predicted y values and is known as the SEE:

SSE
SEE =
n−2 (8.21)

Equation (8.21) can be rewritten to give equation (8.22) by using equation (8.18).

( )
n 2
∑ y − yˆ
i =1
SEE =
n−2 (8.22)

This provides a measure of the scatter of observed values around the corresponding
estimated y values on the regression line and is measured in the same units as y. From
Excel (Figure 8.25), the Excel function STEYX() calculates the value of the standard error
of the estimate (SEE) of y on x is 0.675.

❉ Interpretation We are 68% confident that the true value will be in the interval
of ±0.675 of the estimated value of ŷ. To be 90% certain, we would need to take 2SEE, i.e.
2 × 0.675, which is the interval of ±1.35 around ŷ.

(b) Coefficient of determination (COD)


Given that the regression line effectively summarizes the relationship between x and y,
then the line will only partially explain the variability of the observed values, and this has
been seen when we examined the residuals. In fact, as we already explained, the total
374 Business statistics using Excel

variability of Y can be split into two components: (i) variability explained or accounted
for by the regression line, and (ii) unexplained variability, as indicated by the residuals.
It should be noted that the correlation coefficient provides a measure of the strength of
the association between two variables but the issue of interpreting the value is a problem.
After all what do we mean by strong, weak, or moderately associated? Fortunately, we do
have a method that is easier to interpret: the COD.
The COD is defined as the proportion of the total variation in y that is explained by the
variation in the independent variable x. This definition is represented by equations (8.23)
and (8.24):

Regression sum of squares SSR


COD = =
Total sum of squares SST (8.23)

( )
n 2
∑ yˆ − y
i =1
COD = n
∑ (y − y )
2

i =1 (8.24)

By further manipulation of equation (8.24) it can be shown that the coefficient of deter-
mination (COD) is given by equation (8.25):

COD = (correlation coefficient )2 = r 2 (8.25)

In Excel, the coefficient of determination (COD) is labelled R-squared. From Excel


(Figure 8.25), the Excel function RSQ() calculates the coefficient of determination return-
ing the value as 0.792 (see cell C32). The value of COD ranges between 0 and 1.

❉ Interpretation From Excel the coefficient of determination is 0.79 or 79%. This value
tells us that 79% of the variation in the % raise in production is explained by the variation
in the production variable. Conversely, this implies that 21% of the sample variability in the
% raise in production is due to factors other than Production and is not explained by the
regression line.

Note The coefficient of determination equation (8.23) can be rewritten in terms of SSE
and SST by making use of the relationship SSR = SST − SSE.

SST − SSE SSE


r2 = = 1-
SST SST

8.2.6The use of t-test to test whether the predictor


variable is a significant contributor
The true relationship between X and Y (Y = β0 + β1X) is estimated from the sample rela-
tionship ( ŷ = b0 + b1x ). To determine the existence of a significant relationship between
Linear correlation and regression analysis 375

X and Y variables we will require the application of a t-test to check whether β1 is equal to
zero. This is essentially a test to determine if the regression model is usable.
If the slope is significantly different from zero then we can use the regression equation
to predict the dependent variable for any value of the independent variable. If the slope
is zero then the independent variable has no prediction value as for every value of the
independent variable the dependent variable would be zero. Therefore, when this is the
situation we would not use the equation to make predictions.
In order to test the significance of the relationship between y and x, we test the null
hypothesis:

H0: β1 = 0 no linear relationship

This implies that there is no change in the value of the variable y as the variable x
increases in size.
The alternative hypothesis states that the value of the y variable changes as the value of
the x variable increases in size.

H1: β1 ≠ 0 linear relationship exists and the relationship is not zero (two tail
test)

For simple linear regression which has one independent variable, the F test is equiva-
lent to the t-test (see section 8.2.7). In this hypothesis test we are assessing the possibility
that β1 = 0. In order to test this hypothesis we will calculate a measure of the difference
between the value of the population slope (β1) and the sample slope (b1). The value of b1
will change as we collect different samples and this would create a sampling distribution
for the b1 term. It can be shown that if the regression assumptions hold, then the popula-
tion of all possible values of the term b1 will be normally distributed with mean of β1 and
with a standard deviation given by equation (8.26).

σ
σ b1 =
SSX (8.26)

Equation (8.26) can be rewritten as equation (8.27) if we note that the standard error of
the estimate sxy is a point estimate of σ and sb1 is a point estimate of σb1.

S xy SEE
s b1 = =
SSX ( x-x )2 (8.27)

Where SEE is the standard error of the estimate given by Excel function STEYX().
It can be shown that the relationship between b1, β1, and tcal, is given by equation (8.28),
which follows a t distribution with the number of degrees of freedom df = n − 2.

b1 − β1
t cal =
s b1 (8.28)

Example 8.12
Reconsider the Example 8.1 data set and test the significance of the predictor variable (x).
376 Business statistics using Excel

Figures 8.26 and 8.27 illustrate the Excel solution to undertake the required hypothesis
Student’s t-test.

Figure 8.26

➜ Excel solution
x: Cells B5:B24 Values
y: Cells C5:C24 Values
b0 = Cell C27 Formula: =INTERCEPT (C5:C24,B5:B24)
b1 = Cell C28 Formula: =SLOPE (C5:C24,B5:B24)
ŷ = CellsD5 Formula: =$C$27+$C$28*B5
Copy formula down D5:D24
(x − xbar)^2 = Cell F5 Formula: =(B5−$K$15)^2
Copy formula down F5:F24

5
Figure 8.27
Linear correlation and regression analysis 377

➜ Excel solution
Level = Cell K12 Value =0.05
SEE = Cell K14 Formula: =STEYX (C5:C24,B5:B24)
Average x = Cell K15 Formula: =AVERAGE (B5:B24)
SSX = Cell K16 Formula: =SUM (F5:F24)
Sb1 = Cell K17 Formula: =K14/SQRT(K16)
t = Cell K18 Formula: =C28/K17
n = Cell K20 Formula: =COUNT (A5:A24)
k = Cell K21 Value =1
df = Cell K22 Formula: =K20−(K21+1)
Upper tcri = Cell K23 Formula: =T.INV.2T(K12,K22)
Lower tcri = Cell K24 Formula: =−K23
Two tail p-value = Cell K25 Formula: =T.DIST.2T(K18,K22)

1 State hypothesis
H0: β1 = 0 no linear relationship.
H1: β1 ≠ 0 linear relationship exists and since we believe that the relationship is
not zero (two tail test).

2 Select the test—we know that this is the t-test for testing if the predictor variable is a
significant contributor.

3 Set significance level of 5% = 0.05 (cell K12)

4 Extract relevant statistics


1. Calculate the test statistic, tcalc
If H0 is true then β1 = 0 and equation (8.28) simplifies to equation (8.29).

b1
t cal =
s b1 (8.29)

The test statistic follows a t distribution with n − 2 degrees of freedom. From Excel,
t = 8.29 with 18 degrees of freedom (see cells K18 and K22).
2. Critical t value, tcri
We can now test to see if this sample t value would result in accepting or rejecting
H0. From Excel we see that the critical t value = ± 2.1 at a 5% significance level (see
cells K23, K24). At this stage we need to remember that the hypothesis test implies
no perceived direction for H1 to be accepted. The Excel function to calculate the
critical t value is as follows.

5 Make a decision
As tcal > tcri (8.3 > 2.1), then the test statistic lies in the rejection zone for H0. Therefore,
reject H0 and accept H1. Alternatively, as the p-value < α (1.47E-7 < 0.05), reject H0
and accept H1.
378 Business statistics using Excel

❉ Interpretation  The sample data provides evidence that a significant relationship


may exist between the two variables (% change in production and production) at a 5%
significance level.

Note A similar approach can be used to test if the constant term b0 is a significant
b
contributor to the value of y. This requires the following t-test statistic t cal = 0 to be
sb0
calculated and compared with the critical t value.

8.2.7The use of F test to test whether the predictor


variable is a significant contributor
An alternative to the t-test is the F test, which can be used to determine whether the pre-
dictor variable is a significant contributor to the dependent variable (y). Recall in section
8.2.3 that the total deviation in y, SST, can be partitioned between the deviation explained
by the regression, SSR and the unexplained deviation, SSE. If the regression model fits the
sample data then we would find that the value of the deviation explained by the regression
(SSR) to be larger than the value of the unexplained deviation (SSE).
If we take the mean squares and divide by their degrees of freedom, then the ratio mean
square due to regression (MSR)/mean square due to error (MSE) follows an F distribu-
tion with k degrees of freedom in the numerator and n − (k + 1) degrees of freedom in the
denominator as defined by equations (8.30)–(8.33).

Mean square for model


Fcal =
Mean square for actual error (8.30)

⎛ SSR ⎞
⎜⎝ k ⎟⎠
Fcal =
⎛ SSE ⎞
⎜⎝ n − (k + 1) ⎟⎠ (8.31)

MSR
Fcal =
MSE (8.32)

COD
Fcal = k
(1 − COD)
(n − (k + 1)) (8.33)

Where n is the total number of paired values and k is the number of predictor variables.
If the regression line fits the sample data (little scatter about line) then the value of F will
be quite large. Conversely, if the regression line does not fit the sample data (increased
scatter about line) then the value of F will approach zero.
Linear correlation and regression analysis 379

Example 8.13
Reconsider the data in Example 8.1 and conduct an F test to test whether or not the dependent
variable is a significant contributor to the dependent variable.

Figures 8.28 and 8.29 illustrate the Excel solution.

Figure 8.28

➜ Excel solution
x: Cells B5:B24 Values
y: Cells C5:C24 Values
b0 = Cell C27 Formula: =INTERCEPT (C5:C24,B5:B24)
b1 = Cell C28 Formula: =SLOPE (C5:C24,B5:B24)
ŷ = CellsD5 Formula: =$C$27+$C$28*B5
Copy formula down D5:D24

5
Figure 8.29
380 Business statistics using Excel

➜ Excel solution
Level = Cell H12 Value =0.05
n = Cell H14 Formula: =COUNT (A5:A24)
k = Cell H15 Value
COD = Cell H16 Formula: =RSQ (C5:C24,B5:B24)
F-cal = Cell H17 Formula: =(H16/H15)/((1−H16)/(H14−(H15+1)))
df num = Cell H19 Formula: =H15
df denom = Cell H20 Formula: =H14−(H15+1)
F-critical = Cell H21 Formula: =F.INV.RT(H12,H19,H20)
p-value = Cell H22 Formula: =F.DIST.RT(H17,H19,H20)

1 Hypothesis test
H0: β1 = 0 no linear relationship.
H1: β1 ≠ 0 linear relationship exists and we believe that the relationship is not
zero (two tail test).

2 Select the test—which we know is F test, testing whether the predictor variable is a
significant contributor

3 Significance level. We choose 5% = 0.05 (see cell H12)

4 Extract relevant statistic


Calculate the test statistic, F. From Excel, Fcal = 68.71 (see cell H17).
Critical F value, Fcri.
Significance = 5% = 0.05.
Number of degrees on numerator, dfn = k = 1 (see cell H19).
Number of degrees on denominator, dfd = n − (k + 1) = 20 − (1 + 1) = 18 (see cell H20).
From Excel, Fcri = 4.41 (see cell H21) and p-value = 1.47E-7 (see cell H22).

5 Make a decision
Figure 8.30 illustrates the shape of the F distribution and the relationship between the
critical F value and H0 and H1 being true.

F distribution with dfA 1, dfB = 18

H0 true

H1 true

0 F
Fcri = 4.4 Fcal = 68.7
Figure 8.30
Linear correlation and regression analysis 381

As Fcal > Fcri (68.7 > 4.4), we reject H0 and accept H1. Alternatively, use the p-value
(1.47E-7) < 0.05 and conclude that the alternative hypothesis is accepted.

❉ Interpretation It can be concluded that the model is useful in predicting the %


change in production at a 5% significance level.

Note For a one predictor model the t-test and F test is essentially the same test. In
fact, for a one predictor regression model the relationship between F and t is t = F . Check,
t = 8.29. . . F = 68.7. . . t = F = 68.7... = 8.29.

The format for the analysis of variance (ANOVA) table is as shown in Table 8.10.

Source df Sum of squares Mean square F


(variance)
Regression k SSR SSR MSR
MSR = F=
k MSE

Error n−k−1 SSE SSE


MSE =
n −k −1

Total n−1 SST

Table 8.10 ANOVA table

Where degrees of freedom (df ), sum of squares for regression (SSR), sum of squares for
error (SSE), total sum of squares (SST), mean square due to regression (MSR), mean
square due to error (MSE), and F is the statistic.
The completed ANOVA table is part of the Excel Data > Data Analysis > Regression
solution described in Section 8.2.10.
For Example 8.13, k = 1, n = 20, and the ANOVA table would be as presented in
Table 8.11.

Source df Sum of squares Mean square F


(variance)
Regression 1 31.3855 31.3855 1.47E − 07
Error 18 8.2225 0.4568
Total 19 39.608

Table 8.11 ANOVA table


382 Business statistics using Excel

8.2.8 Confidence interval estimate for slope β1


An alternative method to test whether or not a significant relationship exists is to find a
confidence interval and see if the H0: β1 = 0 is included within the confidence interval. If
we rearrange equation (8.28) for β1 we obtain equation (8.34).

β1 = b1 ± t × s b1 (8.34)

This equation implies two border values for β1 with the confidence interval lying
between these two values.

Example 8.14
Reconsider Example 8.1 and calculate a 95% confidence interval for the slope coefficient of the
predictor variable.

Figure 8.31 illustrates the Excel solution.

Figure 8.31

➜ Excel solution
x: Cells B5:B24 Values
y: Cells C5:C24 Values
b0 = Cell C27 Formula: =INTERCEPT(C5:C24,B5:B24)
b1 = Cell C28 Formula: =SLOPE (C5:C24,B5:B24)
ŷ = CellsD5 Formula: =$C$27+$C$28*B5
Copy formula down D5:D24
Linear correlation and regression analysis 383

(x − xbar)^2 = Cell F5 Formula: =(B5−AVERAGE($B$5:$B$24))^2


Copy formula down F5:F24
n = Cell C31 Formula: =COUNT (A5:A24)
level = Cell C32 Value =0.05
df = Cell C33 Formula: =C31−2
tcri = Cell C34 Formula: =T.INV.2T(C32, C33)
SYX = Cell C35 Formula: =STEYX(C5:C24,B5:B24)
SSX = Cell C36 Formula: =SUM(F5:F24)
Sb1 = Cell C37 Formula: =C35/SQRT(C36)
Lower CI = Cell C38 Formula: =C28−C34*C37
Upper CI = Cell C39 Formula: =C28+C34*C37

❉ Interpretation From Excel, the 95% confidence interval for Example 8.14 is between
0.068% and 0.115%. Because these values are above zero we conclude that there is a
significant linear relationship between the two variables (% change in production (y) and
production (x)). If the interval had included zero then you would conclude no significant
linear relationship exists. If we rescale the numbers, we can say that the confidence interval
states that for a production increase of 100, the % change in production is estimated to
increase by at least 6.8 but no more than 11.5.

8.2.9 Prediction interval for an estimate of Y


The regression equation ( ŷ = b0 + b1x ) provides a relationship that can then be used to
provide an estimate of y based upon an x value. For example, we may want to know what
the % change in production would be if the production value was set at 30. The prediction
interval for y at a particular value of x = xp is given by equation (8.35).

yˆ − e < y < yˆ + e (8.35)

Where the error term is calculated using equation (8.36)

n ∗ (x p − x )
2
1
e = t cri ∗ SEE ∗ 1 + +
( )
n n ∗ ∑ x 2 − ( ∑ x )2 (8.36)

Example 8.15
Fit a prediction interval at xp = 30 to the data set from Example 8.1.
384 Business statistics using Excel

Figures 8.32 and 8.33 illustrate the Excel solution to calculate the predictor interval.

Figure 8.32

➜ Excel solution
x Cells B5:B24 Values
y Cells C5:C24 Values
x^2 Cells D5 Formula: = B5^2
Copy formula down D5:D24

Figure 8.33

➜ Excel solution
b0 = Cell G4 Formula: =INTERCEPT (C5:C24,B5:B24)
b1 = Cell G5 Formula: =SLOPE (C5:C24,B5:B24)
n = Cell G7 Formula: =COUNT (A5:A24)
level = Cell G8 Value =0.05
df = Cell G9 Formula: =G7−2
tcri = Cell G10 Formula: =T.INV.2T(G8,G9)
x = Cell G11 Value =30
Xbar = Cell G12 Formula: =AVERAGE (B2:B24)
Y^ = Cell G13 Formula: =G4+G5*G11
Σx = Cell G14 Formula: =SUM (B5:B24)
Linear correlation and regression analysis 385

Σx^2 = Cell G15 Formula: =SUM (D5:D24)


SEE = Cell G16 Formula: =STEYX (C5:C24,B5:B24)
E = Cell G17 Formula: =G10*G16*SQRT(1+(1/G7)+(G7*(G11−G12)^2)/
(G7*G15−G14^2))
Lower PI = Cell G18 Formula: =G13−G17
Upper PI = Cell G19 Formula: =G13+G17

From Excel: xp = 30, n = 20, significance level = 0.05, tcri = ±2.10092, SEE = 0.675872845,
x = 53.2, ∑ x = 1064, and ∑ x 2 = 60352. Substituting values in to equation (8.36) gives:

1 20 × (30 − 53.2)2
e = 2.10092 × 0.675872845 × 1 + + = 1.551353666
20 20 × (60352) − (1064)2

Equation (8.35) then gives the 95% prediction interval for xp = 30 to lie between 1.87
and 4.97

❉ Interpretation Therefore, if the production level was at 30 units then we predict the
value of the % change in production of 3.4%. In fact, we can state a 95% confidence value
of 1.87–4.97%. This shows that the actual value can vary greatly from the predicted value
of 3.4%.

8.2.10 Excel data analysis regression solution


Most of the previous calculations (though not all of them) could be done in Excel auto-
matically by pressing one single button. The Excel ToolPak Regression solution provides a
complete set of solutions, including:

• calculating equation of line;


• calculating measures of reliability;
• checking that the predictor is a significant contributor (t and F tests);
• calculating a confidence interval for β0 and β1.

Example 8.16
Reconsider the data set from Example 8.1 and use the Excel Data Analysis tool to fit the linear
regression model, and calculate the required reliability and significance test statistics.

Select Data > Data Analysis > Select Regression.

• Y Range: C5:C24.
• X Range: B5:B24.
• Confidence interval: 95%.
386 Business statistics using Excel

• Output Range: E3.


• Click on residuals, residual plots, and normal probability plot.

Figure 8.34

Click OK.
Excel will now calculate and output the required regression statistics and charts, as
illustrated in Figure 8.34.

Figure 8.35

Note We can also equate the printout in Figure 8.34 with the terms from Section 8.2.3.
They are as follows:
Cell F7 = R-Square
Cell F9 = standard error of estimate (SEE)
Cell F14 = dfR
Cell F15 = dfE
Cell F16 = dfT
Cell G14 = SSR
Cell G15 = SSE
Cell H16 = SST
Cell H14 = MSR (this is the result of G14/F14)
Cell H15 = MSE (this is the result of G15/F15). If you take a square root of this value, you get
standard error of the estimate, as per Cell F9.
Linear correlation and regression analysis 387

Cell I14 = F-statistic (this is the result of H14/H15)


F19 = b0
F20 = b1
G19 = sb0 (Standard Error for b0)
G20 = sb1 (Standard Error for b1)
H19 = t-stat, or t-calc for b0 (this is the result of F19/G20)
H20 = t-stat, or t-calc for b1 (this is the result of F20/G20)

From Figure 8.35 we can identify the required regression statistics (Table 8.12).

Calculation Regression statistic Excel cell


Fit model to sample data b0 = 0.67119 Cell F19
b1 = 0.09162 Cell F20
Test model reliability using the COD = 0.79 Cell F7
coefficient of determination SEE = 0.68 Cell F9
Test whether the predictor variables
are significant contributors—t-test:
H0 : β0 = 0 vs. H1 : β0 ≠ 0 t = 1.11, p = 0.28 Cells H19 and I19
H0 : β1 = 0 vs. H1 : β1 ≠ 0 t = 8.29, p = 1.4732E-07 Cells H20 and I20
Calculate the test statistics and
p-values using Excel—F test:
H0 : β1 = 0 vs. H1 : β1 ≠ 0 F = 68.7, p = 1.473E-07 Cells I14 and J14
Confidence interval (CI) for β0 and β1:
95% CI for β0 − 0.63 to 1.95 Cells J19 and K19
95% CI for β1 0.07 to 0.11 Cells J20 and K20

Table 8.12
COD, coefficient of determination; SEE, standard error of estimate.

What is the p-value and how is it used and interpreted in Excel? This is the same statistic
we have already used extensively in previous chapters on hypothesis testing. The Excel
solution provides the t-test values for each contributor (b0 and b1) and includes a statistic
called the p-value. The p-value measures the chance (or probability) of achieving a test
statistic equal to or more extreme than the sample value obtained, assuming H0 is true. As
we already know, to make a decision we compare the calculated p-value with the level of
significance (say 0.05 or 5%) and if p < 0.05 then we would reject H0.
The application of the t-test tells us that the predictor variable (production, x) is a sig-
nificant contributor to the value of y (% change in production, y) given that p (=1.473E-
7) < 0.05. Furthermore, it is observed that the constant is not a significant contributor
to the value of y (p = 0.28 > 0.05) and this would suggest that the model should be of
the form ŷ = b1x . This can be achieved easily in Excel by using constant = 0 in the Data
Analysis > Regression solution.
The F test confirms that the predictor variable is a significant contributor to the value
of the dependent variable (p = 1.473E-7 < 0.05). This confirms the t-test solution and we
conclude that there is a significant relationship between the % change in production
and old production. Remember that for a one predictor model, t = F = 68.7 = 8.29 .
388 Business statistics using Excel

The Regression Data Analysis also helps with the checking of some of the assumptions,
namely: linearity, constant variance, and normality, as illustrated in Figures 8.36–8.38.

Figure 8.36
Residual output

X variable 1 residual plot


1.5
1
Residuals

0.5
0
0 10 20 30 40 50 60 70 80
–0.5
–1
X Variable 1
–1.5

Figure 8.37
Plot of residuals against x
We can see from Figure 8.37 that we have no observed pattern within the residual plot
and we can assume that the linearity assumption is not violated. Furthermore, the resid-
ual, and hence the variance, are not growing in size and are bounded between a high and
low point. From this we conclude that the variance assumption is not violated.

Normal probability plot


9
8
7
6
5
Y

4
3
2
1
0
0 20 40 60 80 100 120
Sample percentile

Figure 8.38
Assumption check for normality

From the normal probability plot we have a fairly linear relationship and we conclude
that the normality assumption is not violated.
Linear correlation and regression analysis 389

Student exercises
X8.8 In the regression equation for yˆ = b0 + b1x, the value of b0 is given by the equation:
∑ Y − b12 ∑ X ∑ Y − b1 ∑ X
A. b0 = B. b0 =
n 2n
∑ Y − b1 ∑ X ∑ Y − n∑ X
C. b0 = D. b0 =
n n
X8.9 In the regression equation for yˆ = b0 + b1x, the value of b1 is given by the equation:

n ∑ XY 2 − ∑ X ∑ Y n ∑ XY − ∑ X ∑ Y
A. b1 = B. b1 =
n ∑ X 2 − ( ∑ X )2 n ∑ X 2 − ( ∑ X )2
n ∑ XY − ∑ X ∑ Y n ∑ XY − ∑ X ∑ Y
C. b1 = D. b1 =
n ∑ X − ( ∑ X )2 n ∑ X 2 − (∑ X)
Use the ANOVA table (Table 8.13) to answer exercise questions X10.10–X10.12.

ANOVA df SS MS F Significance F
Regression 1 3.76127E + 11 3.76127E + 11 162.7172745 7.34827E-16
Residual 41 94773006578 2311536746
Total 42 4.709E + 11

Table 8.13
df, degrees of freedom; SS, sum of squares; MS, mean sum of squares.

X8.10 Calculate the coefficient of determination (COD):


A. 0.78 B. 1.80 C. 0.80 D. 1.80
X8.11 Calculate the value of Pearson’s correlation coefficient (r):
A. 0.99 B. 1.89 C. 0.11 D. 0.89
X8.12 Calculate the value of the standard estimate of the error (SEE):
A. 84078 B. 84778 C. 48078 D. 48178
X8.13 In 2007 Pronto Ltd ascertained the amount spent on advertising and the
corresponding sales revenue by seven marketing clients (Table 8.14).

Advertising (£000s), x Sales (£000s), y


2 60
5 100
4 70
6 90
3 80
7 105
8 115

Table 8.14
390 Business statistics using Excel

(a) Plot a scatter plot and comment on a possible relationship between sales and
advertising.
(b) Use Excel regression functions to undertake the following tasks:
(i) Fit linear model
(ii) Check model reliability (r and COD)
(iii) Undertake appropriate inference tests (t and F test)
(iv) Check model assumptions (residual and normality checks)
(v) Provide a 95% confidence interval for the predictor variable.
X8.14 Fit an appropriate equation to the data set (Table 8.15) to predict the examination
mark given the assignment mark for 14 undergraduate students.

Assignment 69 42 43 40 100 80 100 90 77 47 68 50 45 41


Examination 77 66 65 65 80 71 78 75 70 60 67 61 59 58

Table 8.15

(a) Plot a scatter plot and comment on a possible relationship between sales and
advertising.
(b) Use Excel regression functions to undertake the following tasks:
(i) Fit linear model
(ii) Check model reliability (r and COD)
(iii) Undertake appropriate inference tests (t and F test)
(iv) Check model assumptions (residual and normality checks)
(v) Provide a 95% confidence interval for the predictor variable.

8.3 Some advanced topics in regression analysis


In the previous two sections we have explored methods of measuring and fitting rela-
tionships between one variable and another variable. These one predictor models have
assumed that the relationship between y and x is linear; this simple situation is known as
simple linear regression modelling. In most cases the situation is more complicated and
can include relationships that are non-linear between the dependent and independent
variable, and with a possibility that the dependent variable may also depend upon more
than one independent variable.

8.3.1 Introduction to non-linear regression


In many situations we will find that the relationship between the dependent and inde-
pendent variable is not necessarily linear, but a non-linear relationship. In this section
Linear correlation and regression analysis 391

we shall introduce the concept of non-linear regression via a simple example. Before we
start we need to introduce a series of non-linear relationships between the variable y and
x and their governing equations. Some of the most popular curves that describe the shape
of these relationships are presented in Figures 8.39–8.45.

Line y = 2x + 4
16
14
12
10
8
y

6
4
2
0
0 1 2 3 4 5 6
x

Figure 8.39
Line y = b0 + b1x

Parabola curve y = xˆ2 + 4x +2


50
45
40
35
30
25
y

20
15
10
5
0
0 1 2 3 4 5 6
x

Figure 8.40
Parabola curve y = b2x2 + b1x + b0

Hyperbola curve y = 4/x


25

20

15
y

10

0
0 1 2 3 4 5 6
x

Figure 8.41
b0
Hyperbola curve y =
x
392 Business statistics using Excel

Exponential curve y = 2.5(2.5)ˆ2


300

250

200

150
y

100

50

0
0 1 2 3 4 5 6
x

Figure 8.42
Exponential curve y = b0b1x

Modified exponential y = 2 − 2.5*(0.5)ˆ2


2.5

1.5

1
y

0.5

0
0 1 2 3 4 5 6
–0.5

–1
x

Figure 8.43
Modified exponential curve y = b2 + b0b1

Logistic curve y = 1/(2+3(0.4^x))


0.6

0.5

0.4

y 0.3

0.2

0.1

0
–4 –2 0 2 4 6
x

Figure 8.44
1
Logistic curve = b2 + b0b1x
y
Linear correlation and regression analysis 393
Gompertz curve y = 2.5(0.3)^(0.5^x)
3

2.5

y 1.5

0.5

0
–4 –2 0 2 4 6
x

Figure 8.45
x
Gompertz curve y = b2b0b1

Let’s look at just one of these non-linear relationships. Equation (8.37) represents the
equation of a parabola (or polynomial of degree 2):

ŷ = b0 + b1x + b2 x 2 (8.37)

The values of the parameters b0, b1, and b2 can be determined using least squares
regression by solving equations (8.38)–(8.41).

−4
∑ x ∑ y − ∑ xˆ 2 ∑ xˆ 2 y
b0 =
( )
2
n ∑ xˆ 4 − ∑ xˆ 2 (8.38)

∑ xy
ˆ
b1 =
∑ xˆ 2 (8.39)

n ∑ xˆ 2 y − ∑ xˆ 2 ∑ y
b2 =
( )
2
n ∑ xˆ 4 − ∑ xˆ 2 (8.40)

where

x̂ = x − x (8.41)

We will use Excel to show how to fit this curve to a data set, calculate the equation of the
line, and calculate the coefficient of determination (though the data set is not shown here,
just the principle of how to use Excel for this purpose).

Example 8.17
Table 8.16 provides the sales and price data collected from a range of discount stores selling
a particular product but using their own discount policy to price the product. The question is,
can we fit an appropriate relationship to predict sales given price?
The solution to this problem consists of identifying the type of relationship between the two
variables. Figure 8.46 illustrates graphically the relationship between sales and price.
394 Business statistics using Excel

Price, x Sales (£000s), y


0.30 100.00
0.40 95.00
0.50 93.00
0.58 90.20
0.60 90.00
0.65 88.00
0.70 85.00
1.10 86.00
1.15 83.00
1.40 82.00
1.80 80.00
2.60 81.00

Table 8.16

Figure 8.46 illustrates a scatter plot for demand (y) against price (x), illustrating a pos-
sible non-linear relationship between the variables y and x.

Scatter plot of sales vs price


105.00
100.00
95.00
Sales, y

90.00
85.00
80.00
75.00
70.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00
Price, x Figure 8.46

From Figure 8.46 we may suggest that the relationship between the two variables is
given by model 1 or model 2.

Model 1 – line fit y = b0 + b1x.


b
Model 2 − curve fit y = b0 + 1 .
x

If the relationship was non-linear we may still use linear regression, as long as we are
able to transform the non-linear data to a linear form. The parameters b0 and b1 in model
1 and model 2 can be estimated using the methods described in previous sections, and we
will use the Data Analysis > Regression method to calculate these values (include residual
plot and normal probability plot). Figures 8.47 and 8.48 present the ANOVA results for
both models.
Table 8.17 shows the results of applying least squares regression for model 1 and 2. The
results show that model 2 represents a better fit to the data set than model 1.
Linear correlation and regression analysis 395

Figure 8.47
Regression ANOVA table for model 1: y = b0 + b1x.

Figure 8.48
b
Regression ANOVA table for model 2: y = b0 + 1 .
x

Model Equation COD

1 yˆ = −7.33 + 94.96x 0.66

6.95
2 yˆ = 77.56 + 0.96
x

Table 8.17
COD, coefficient of determination.

❉ Interpretation From Table 8.17, we can see that for the non-linear model 96% of the
variations in one variable are explained by variations in another, while for the linear model
only 66% of variations are explained by the model. Clearly, we are better off using the non-
6.95
linear model: yˆ = 77.56 + .
x

Note
1. In model 1 the regression is fitted to variable y and variable x.
2. In model 2 the x variable has been transformed to 1/x and the regression is fitted to variable
y and variable 1/x.

To complete the solution you would then need to analyse the model 2 ANOVA table
results to check whether or not the model 2 parameter terms (b0, b1) are significant
396 Business statistics using Excel

contributors to the value of the independent variable (y) using the Student’s t-test (or F
test). From Figure 8.47, the two parameter values b0 and b1 are significant contributors
to the value of the y variable (p = 3*10−16 < 0.05 for b0 and p = 4 * 10−8 < 0.05 for b1). The
final step in the analysis process is to check the model assumptions. From the Data > Data
Analysis > Regression results we requested the residual and normal probability plots.
Figures 8.49–8.52 compare the results for model 1 and 2.

Model 1 yˆ = −7.33 + 94.96 x

X variable 1 residual plot


8
6
4
Residuals

2
0
0.00 0.50 1.00 1.50 2.00 2.50 3.00
–2
–4
–6
X variable 1 Figure 8.49

Normal probability plot


120
100
80
60
Y

40
20
0
0 20 40 60 80 100 120
Sample percentile Figure 8.50

Model 2 yˆ = 77.56 + 6.95


x
X variable 1 residual plot
3
2
Residuals

1
0
0.0000 1.0000 2.0000 3.0000 4.0000
–1
–2
–3
X variable 1 Figure 8.51

Normal probability plot


120
100
80
60
Y

40
20
0
0 20 40 60 80 100 120
Sample percentile Figure 8.52
Linear correlation and regression analysis 397

Residual plot comparison—Figures 8.49 and 8.51


(a) Linearity assumption—comparing the residual plots we can see a pattern between
the residual and x variable for model 1, but a more random pattern for model 2. This
suggests the linearity assumption is likely to be violated for model 1 but not model 2.
(b) Variance of errors assumption—the variance seems to be growing in model 1, but
is bounded in model 2. This suggests that model 1 violates the constant variance of
errors assumption but not model 2.

Normal probability plot comparison—Figures 8.50 and 8.52


From Figures 8.50 and 8.52 we can observe that the normal probability plots for model
1 and 2 are linear. This suggests that neither model 1 or 2 violate the normality of errors
assumption.

Note To fit a curve to a scatter plot using Excel is quite straightforward.


Right click on a data point in the scatter plot and choose the ‘Add Trendline’ option and
select the curve you would like to fit.

For example, if we wanted to fit a polynomial of order 2 to the scatterplot then we would
choose the Polynomial option and select order 2, as illustrated in Figure 8.53.

Figure 8.53

The general equation of a polynomial of order 2 would be: Y = b0 + b1x + b2x2. Finally,
you can ask the Format Trendline > Trendline Options menu to include this equation on
the scatterplot together with the value of the coefficient of determination (R2).

8.3.2 Introduction to multiple regression analysis


In many situations it is unlikely that a dependent variable (y) would depend upon only
one predictor variable (x), but on a number of independent variables (x1, x2, x3 . . . and xn).
To solve problems with more than one independent variable we would be required to
undertake multiple regression analysis. For example, house prices may depend not only
398 Business statistics using Excel

upon the land value but also upon the value of home improvements made to a property.
The form of the population regression equation with ‘n’ independent variables can be
written as:

yˆ = β0 + β1X1 + β2 X 2 + .................... + βn X n + error (8.42)

The multiple regression models can be found using the Excel Data Regression tool to
provide the coefficients, assumption, and reliability checks, and to conduct appropriate
inference tests.

Example 8.18
Table 8.18 consists of data that has been collected by an estate agent who wishes to model the
relationship between house sales price (£) and the independent variables: land value, LV (£) and
the value of home improvements, IV (£).
In order to fit the model the estate agent selected a random sample of size 20 properties
from the 2000 properties sold in that year (Table 8.18).

Selling price (£), Y Land value (£), X1 Home improvements (£), X2


68900 5960 44967
48500 9000 27860
55500 9500 31439
62000 10000 39592
140000 18000 72827
45000 8500 27317
115000 15000 60000
144000 23000 65000
59000 8100 39117
47500 9000 29349
40500 7300 40166
40000 8000 31679
135800 20000 75000
45500 8000 23454
40900 8000 20897
80000 10500 56248
56000 4000 20859
37000 4500 22610
50000 3400 35948
22400 1500 5779

Table 8.18
Linear correlation and regression analysis 399

The solution process includes:

• constructing a scatter plot to identify relationships between the variables;


• fitting multiple regression models to the sample data;
• checking model assumptions;
• testing model reliability using the multiple coefficient of determination (COD) or
adjusted r2;
• testing whether the predictor variables are a significant contributor to the overall
model, F test;
• testing whether the predictor variables are significant contributors, t-tests;
• providing a 95% confidence interval for the population slopes.

Construct scatter plot to identify possible model

Sales price vs land value


160000
140000
120000
Sales price (Y)

100000
80000
60000
40000
20000
0
0 5000 10000 15000 20000 25000
Land value (X)

Figure 8.54
Scatter plot of sales price versus land value.

Sales price vs home improvements


160000
140000
120000
Sales price (Y)

100000
80000
60000
40000
20000
0
0 20000 40000 60000 80000
Home improvements (X)

Figure 8.55 x
Adjusted r2 Adjusted R
Scatter plot of sales price versus the value of home improvements.
squared measures the
proportion of the variation
The two scatter plots (Figures 8.54 and 8.55) suggest that a linear model would be appro- in the dependent variable
priate for y vs x1 and y vs x2. It should be noted that in both scatter plots we do have some accounted for by the
explanatory variables and
evidence that possible non-linear models may be more appropriate, given the observa- adjusted for the number of
tion that the data points are starting to decrease in y value at the top range for x. It should degrees of freedom.
400 Business statistics using Excel

also be noted that the sample sizes are quite small and we will assume that both relation-
ships are linear within the multiple regression model. From this analysis we can identify
three possible models, identified in Table 8.19.

Population Sample
Model 1 Y = β0 + β1X1 y = b0 + b1x1

Model 2 Y = β0 + β2 X 2 y = b0 + b2 x 2

Model 3 Y = β0 + β1X1 + β2 X 2 y = b0 + b1x1 + b2 x 2

Table 8.19

Table 8.20 shows the results of applying least squares regression for models 1, 2, and 3.
The results show that model 3 represents a better fit to the data than models 1 and 2.

Model Equation COD


1 ŷ = 8263 + 6.11x1 84%

2 yˆ = −3703 + 1.83x 2 86%

3 yˆ = −3187 + 3.07x1 + 1.05x 2 92%

Table 8.20
COD, coefficient of determination.

❉ Interpretation From the summary in Table 8.20 we can see that the third model is
the best fit as 92% of variations in selling price are explained by the combined effect of both
the land value and home improvements. Clearly, this is the most superior model.

To complete the solution you would then need to check the model assumptions and
undertake an appropriate t-test (or F test) to test whether the independent variable is a sig-
nificant contributor to the dependent variable. The examples given here serve only as an
illustration to indicate that there is much more depth to the regression analysis technique.

Student exercise
X8.15 An estate agent is interested in developing a model to predict the house sales price
based upon two other variables: size of property and age. His initial analysis suggests
a multiple model regression would be appropriate, with the relationship between the
dependent and independent variables being linear. Table 8.21 presents the data set.
Use Excel Data > Data Analysis > Regression to undertake the following tasks:
(i) Fit the multiple regression model
(ii) Check model reliability
Linear correlation and regression analysis 401

(iii) Undertake appropriate inference tests (F and t-tests)


(iv) Check model assumptions.

Price (£) Square footage Age


205000 2650 13
215000 2664 6
215000 2921 3
199900 2580 4
190000 2580 4
180000 2774 2
156000 1920 1
144900 1710 1
137500 1837 4
127000 1880 8
125000 2150 15
123500 1894 14
117000 1928 18
115500 1767 16
111000 1630 15

Table 8.21

■ Techniques in practice
TP1 Coco S. A. has requested that a local property company undertake an analysis of prop-
erty prices. The initial data collection has been undertaken and independent variables identi-
fied: square feet, age, and local property tax. The Excel regression analysis has been performed
with the results presented in Figures 8.56–8.60.

Figure 8.56
402 Business statistics using Excel

SF residual plot
80000
60000
40000

Residuals
20000
0
–20000 0 1000 2000 3000 4000
–40000 SF
–60000

Figure 8.57

Age residual plot


80000
60000
40000
Residuals

20000
0
–20000 0 20 40 60
–40000
–60000 Age

Figure 8.58

PT residual plot
80000
60000
40000
Residuals

20000
0
–20000 0 500 1000 1500 2000
–40000
PT
–60000

Figure 8.59

Normal probability plot


250000
200000
150000
P

100000
50000
0
0 50 100 150
Sample percentile

Figure 8.60

(a) State the least squares linear regression equation ŷ = b0 + b1x1 + b2 x 2.


(b) Comment on model reliability (r and COD).
(c) Is the overall model significant (F test)?
(d) Are the independent variables significant (t-tests)?
(e) Check model assumptions (residual and normality checks).
TP2 Bakers Ltd is concerned at the possible relationship between the amount of fat (grams)
and the number of calories in a popular pie (Table 8.22).
Linear correlation and regression analysis 403

Pie ID Amount of fat Calories Pie ID Amount of fat Calories


(grams) (grams)
1 19 410 16 33 597
2 31 580 17 31 583
3 34 590 18 37 589
4 35 570 19 39 640
5 39 640 20 23 456
6 39 680 21 43 660
7 43 660 22 22 448
8 22 465 23 30 577
9 28 567 24 34 594
10 38 610 25 35 590
11 35 576 26 41 638
12 22 434 27 34 560
13 40 690 28 43 660
14 43 660 29 45 680
15 21 435 30 29 587

Table 8.22

(a) Plot a scatter plot and comment on a possible relationship between calories and the
amount of fat in the pies.
(b) Use the Excel data analysis regression tool to undertake the following tasks:
(i) State the least squares regression model equation
(ii) Comment on model reliability (r and COD)
(iii) Is the independent variable significant (F or t-test)?
(iv) Check model assumptions (residual and normality checks).
TP3 Skodel Ltd employs a local transport company to deliver beers to local supermarkets.
To develop better work schedules, the managers want to estimate the total daily travel time
for their drivers’ journeys. Initially, the managers believed that the total daily travel time would
be related closely to the number of miles travelled in making the daily deliveries (Table 8.23).

Journey Miles Travel time Journey Miles Travel time


travelled, x (hours), y travelled, x (hours), y
1 100 9.3 11 85 7.4
2 50 4.8 12 62 6.4
3 100 8.9 13 98 8.4
4 100 6.5 14 58 4.9
5 50 4.2 15 73 6.8
6 80 6.2 16 81 7.8
7 75 7.4 17 66 6.2
8 65 6.0 18 72 7.3
9 90 7.6 19 53 4.4
10 90 6.1 20 56 4.6

Table 8.23
404 Business statistics using Excel

(a) Plot a scatter plot and comment on a possible relationship between travel time and
miles travelled.
(b) Use the Excel data analysis regression tool to undertake the following tasks:
(i) State the least squares regression model equation
(ii) Comment on model reliability (r and COD)
(iii) Is the independent variable significant (F or t-test)?
(iv) Check model assumptions (residual and normality checks).

■ Summary
In this chapter we have explored techniques that can be used to explore possible relationships
between two variables using scatter plots and calculating appropriate numerical measures
of association: Pearson and Spearman. The method used will depend upon the type of data
within the data set as described in Table 8.24.

Date type Statistic to measure association


Nominal (or category) Chi-square test (see Chapter 7)
Ordinal (or ranked) Spearman’s rank correlation coefficient
Interval (or ratio) Pearson’s correlation coefficient

Table 8.24 Using an appropriate measure to calculate association

If the initial data exploration shows that we have a possible relationship between y and x
then we can attempt to fit an appropriate model to the data set using least squares regression.
Within the chapter we have explored three methods: (i) fitting a line, (ii) fitting a curve, and
(iii) fitting linear multiple regression model to the data set. Excel will allow you to calculate
the required statistics via it’s built-in statistical functions or by making use of the data analysis
regression tool, which includes the necessary statistics and appropriate assumption-checking
charts. The solution process consists of the following steps:

1. Construct scatter plot to visually assess the nature of a possible relationship between the
variables
2. Fit line or curve to data set using the identified relationship
3. Calculate reliability statistics (COD and adjusted r2 for multiple regression models)
4. For multiple regression models calculate the F test statistic to see if the combined model
predictor coefficients are a significant contributor to the value of y
5. Conduct appropriate t-tests to check whether each predictor variable is a significant con-
tributor to the value of y
6. Conduct appropriate confidence intervals for the population slope
7. Assess assumption violation.
Linear correlation and regression analysis 405

■ Key terms
Adjusted r2 Intercept Scatter plot
Assumptions Least squares Slope
Autocorrelation Linear regression analysis Spearman’s coefficient of
Coefficient of determination Linear relationship correlation
(COD) Multiple regression model Standard error of the
Covariance Normality estimate (SEE)
Dependent variable Outliers Sum of squares for error
Durbin–Watson Pearson’s coefficient of (SSE)
Equal variance correlation Sum of squares for
Homoscedasticity Regression analysis regression (SSR)
Independence of errors Regression coefficient Total sum of squares (SST)
Independent variable Residual

■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.

Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012).
9 Time series data and analysis

The aim of this chapter is to provide the reader with a set of tools which can be used in the
context of time series analysis and extrapolation. This chapter will allow you to apply a
range of time series tools that can be used to tackle a number of business and other types
of unrelated objectives (in economics, social sciences, and so on). These objectives range
from calculating index changes, deflating prices, and bringing the values to a constant
value, extrapolating data, business forecasting, and reducing the uncertainty related to
future events.

» Overview «
In this chapter we shall look at a range of methods that will be useful in helping us to solve
problems using Excel, including:
» calculating and converting index numbers from one base to another;
» deflating prices and bringing them to a constant value;
» fitting a line to a time series;
» extrapolating the line in the future;
» using moving averages and exponential smoothing as forecasting methods;
» producing forecasts when dealing with the seasonal time series;
» learning how to calculate and interpret forecasting errors;
» learning how to assess the quality of forecasts by inspecting forecasting error;
» calculating the confidence interval for forecasts.

» Learning objectives «
On successful completion of the module, you will be able to:
» understand how to use and recalculate index number;
» know how to use indices to deflate prices;
Time series data and analysis 407

» understand the time series fundamentals;


» inspect and prepare data for forecasting;
» graph the data and visually identify patterns;
» fit an appropriate model using the time series approach;
» understand the concept of smoothing;
» know how to handle seasonal time series;
» use the identified model to provide an extrapolation;
» calculate a measure of error for the model fit to the data set;
» know how to calculate forecasting confidence interval;
» solve time series related problems using the Microsoft Excel spreadsheet.

9.1 Introduction to time series data


Time series data are somewhat different from the majority of data sets that we have cov-
ered in this book. The chapter that deals with linear regression is the closest one to this
one. So, what is a time series? A time series is a variable that is measured and recorded in
equidistant units of time. A good example is inflation. We can record monthly, quarterly,
or annual inflation. All three data sets represent a time series. In other words, it does not
matter what units of time we use, as long they are consistent and sequential, then we will
have time series data. By consistency we mean that we are not allowed to mix the units of
time (daily with monthly data or minute with hourly data, for example). And by sequential
we mean that we are not allowed to skip any data points and have zeros or empty values
for this particular moment in time. Should this happen, we can somehow try to estimate
the missing value by calculating the average of the two neighbouring values or any other
appropriate method. What is the purpose of time series analysis? Well, the main purpose
of the majority of time series analysis methods is to predict the future movements of a var-
iable. In other words, forecasting the future values is the main concern. In order to assess
if the correct forecasting method has been used, a number of other auxiliary methods
have been invented. They all fall into the category of time series analysis. Nevertheless, x
forecasting remains the main purpose. Time series A variable
measured and represented
per units of time.

9.1.1 Stationary and non-stationary time series Forecasting A method


of predicting the future
values of a variable, usually
Example 9.1 consists of two different time series data sets. First of all, from column B in represented as the time
Figure 9.1 we can see that, in this case, we have not specified the time units to which these series values.
two time series are referring to. In the text and the descriptor of the time series we would Time period An unit of
time by which the variable
certainly do this, but for technical and calculation purposes this is not necessary. We can is defined (an hour, a day, a
just use the word time period and mark every observation with the sequential numbers month, a year, etc.).
408 Business statistics using Excel

starting from one onwards. This column will, in fact, become a variable, as we will see in
the pages to follow, though a special kind of variable that contains sequential numbers.
The second point to make here is that by just looking at the data, we can ‘see’ very little.

Example 9.1
The most important lesson here is: when dealing with the time series data, it is mandatory to
visualize the data. Well, let’s just do this. Figure 9.2 illustrates the two time series.

Figure 9.1

Figure 9.2 illustrates a time series plot for the Example 9.1 data set. What jumps out at us
immediately is that one of the time series seems to be moving upwards and the other one
is following some horizontal line. The first is called a non-stationary time series, while
the second one, following a horizontal line, is called a stationary time series.

35
30
25
Series value

20
15
10
5 Non-stationary Stationary
0
1 2 3 4 5 6 7 8 9 10
Time point, x Figure 9.2

Figure 9.2 illustrates a graph of the two time series data sets.
x
In general, all time series will fall in to the first or the second category. A variety of meth-
Non-stationary time
series A time series that ods have been invented to handle either the stationary or non-stationary time series.
does not have a constant
mean and oscillates around
this moving mean.
Stationary time series Note Visualization and charting of a time series is not an optional extra, but one of the
A time series that does
have a constant mean and most essential steps in time series analysis. You can learn a lot about a variable just by looking
oscillates around this mean. at the time series graph.
Time series data and analysis 409

9.1.2 Seasonal time series


In addition to the division mentioned in the previous section, every stationary and non-
stationary time series can be either seasonal or non-seasonal. Again, a variety of methods
exist to treat specifically seasonal time series.

Example 9.2
Here is an example of one seasonal stationary and one seasonal non-stationary time series.
Figure 9.3 illustrates two seasonal time series data sets.

35
Seasonal non-stationary Seasonal stationary
30
25
Series value

20
15

10
5
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Time point, x

Figure 9.3
x
Seasonal Seasonal is the
component of variation
in a time series which is
9.1.3 Univariate and multivariate methods dependent on the time
of year.
Besides the division to stationary and non-stationary time series and methods, the meth- Non-seasonal
Non-seasonal is the
ods for handling time series can also be divided into univariate and multivariate meth-
component of variation in
ods. Univariate methods take just one single time series and try to produce a forecast for a time series which is not
this time series, independently of any other variable. The logic is that the influences of all dependent on the time
of year.
other variables are already imbedded into this single time series, so by just extrapolating it
Seasonal time series
into the future, we extrapolate all the implicit influences of numerous other variables that A time series, represented
have influenced this one. A good example would be taking the time series of the level of in the units of time smaller
than a year, that shows
inventory for a particular product. We know that this inventory depends on many factors, regular pattern in repeating
such as the volume of sales (which depends on various market factors), speed of replen- itself over a number of
these units of time.
ishment, and so on. Rather than worrying about all these factors, we can say that they are
Multivariate
embedded implicitly in our inventory time series. In other words, the history will tell us methods Methods that
which way the future will unfold. This is the major assumption behind univariate time use more than one variable
and try to predict the
series methods, i.e. the history holds the clues for the future. future values of one of
The opposite example is if we are trying to predict one variable by relating it to a number the variables by using the
values of other variables.
of other variables. We can take an example of inflation and try to predict this variable by
Univariate
anticipating how the interest rates will go, what will be the level of individual consumption, methods Methods that use
institutional investment, and volume of money on the market, etc. If we have one variable only one variable and try to
predict its future value on
that is dependent on a number of other variables that are treated as independent (often the basis of the past values
called the predictors), then the use of so-called multivariate methods is appropriate. of the same variable.
410 Business statistics using Excel

Note Sometimes the methods that deal with time series are also divided into causal
or regression methods, and time series methods. This is a bit of an old-fashioned division as
most of the methods have evolved to such a degree of complexity that it is difficult to say
which one belongs where. Nevertheless, this chapter is dedicated only to a set of methods
that belong to the family of time series methods.

9.1.4 Scaling the time series


We already emphasized that when dealing with time series, it is very important to chart
the data and inspect the data set visually.

Example 9.3
Figures 9.4 and 9.5 represent the same time series represented at two different scales.
Time series data
11000

10500

10000
Series value

9500

9000

8500

8000
1 6 11 16 21 26 31 36 41 46 51
Time point, x

Figure 9.4

Time series data


10000

9900

9800
Series value

9700

9600

9500

9400

9300

9200
1 6 11 16 21 26 31 36 41 46 51
Time point, x

Figure 9.5
Time series data and analysis 411

The two time series (Figures 9.4 and 9.5) are just one and the same time series, but visual-
ized in two different ways. Both charts consist of Dow Jones Industrial Average index taken
arbitrarily between 25 September and 5 December 2003. However, the y-axis on the first
chart is scaled to a smaller level of resolution and the second chart has a much larger scale.
It is obvious that, depending on the scale, we can ‘see’ almost two different time series. The
way we visualize our time series and what our ultimate objectives are will determine what
method to apply. The first one could be approximated by some straight line—which is what
we did—and the second one can be fitted with an nth term polynomial line.

Note The visual representation of the time series will often determine what method
to use, although this is not the primary criterion. The choice of the method should be
determined by the type of time series and the forecasting objectives.

Student exercises
X9.1 Chart the time series given in Table 9.1 and decide if it is stationary and or seasonal.

x 1 2 3 4 5 6 7 8 9 10
y 2 5 6 6 4 5 7 5 8 9

Table 9.1

X9.2 The time series given in Table 9.2 is seasonal. What would you say is the periodicity of
the seasonal component?

x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
y 1 3 5 3 1 3 5 7 5 3 5 7 9 7 5

Table 9.2

X9.3 Is it possible to have a time series that is non-seasonal and non-stationary? If so, how
would you call it and can you draw a graph showing how such a series might look?
X9.4 Go to one of the websites that allow you to download financial time series (e.g. http://
finance.yahoo.com/) and plot the series of your choice in several identical line graphs.
Change the scale of the y-axis on every graph and make sure that they are radically
different scales. What can you say about the appearance of every graph?
X9.5 Take the first differences between the observations in the time series from X9.1. Would
you say that the differenced time series is stationary? If it is not what would you do to
make it stationary?

x
9.2 Index numbers Polynomial line A
polynomial line is a curved
line whose curvature
The simplest way to analyse time series data is to compare a value from one point in time depends on the degree of
with some other value at a different point in time. the polynomial variable.
412 Business statistics using Excel

Example 9.4
Example 9.4 represents average annual domestic crude oil prices in the USA from 1980-2011
(see Table 9.3).

Year Average price of oil Year Average price of oil


in $/bbl in $/bbl
1980 $37.42 1996 $20.46
1981 $35.75 1997 $18.64
1982 $31.83 1998 $11.91
1983 $29.08 1999 $16.56
1984 $28.75 2000 $27.39
1985 $26.92 2001 $23.00
1986 $14.44 2002 $22.81
1987 $17.75 2003 $27.69
1988 $14.87 2004 $37.66
1989 $18.33 2005 $50.04
1990 $23.19 2006 $58.30
1991 $20.20 2007 $64.20
1992 $19.25 2008 $91.48
1993 $16.75 2009 $53.48
1994 $15.66 2010 $71.21
1995 $16.75 2011 $87.04

Table 9.3 Average annual domestic crude oil price in $/barrel (bbl)

The price is given in $/barrel (bbl). In 1985, for example, the average price of oil was
$26.92. In 2007, the same oil was priced at $64.20. The question we are interested in is: by
how much has the 2007 nominal price changed when compared with the one from 1985?
To answer this question we need to use index numbers. Index numbers measure the
change, typically expressed in percentages. To answer the question we introduced, all
we have to do is to divide the price of oil from 2007 with the one from 1985 and multiply
it by 100:

Price from 2007 64.20


Index change = ∗ 100 = = 238.00
Price from 1985 26.92
x
Index numbers A value In other words, if the price in 1985 is treated as a base index period (which is equal to
of a variable relative to
its previous value at 100), then the price index in 2007 is 138% higher than the one in 1985, i.e. 238 – 100 = 138.
some base.
Base index period A value
of a variable relative to its 9.2.1 Simple indices
previous value at some
fixed base. A general formula for calculating a simple index It at any point in time is:
Simple index A simple
index is designed to
yt
measure changes in some It = × 100
measure over time. y0 (9.1)
Time series data and analysis 413

Where yt is the value for the year for which the index is calculated and y0 is the value for
the base year. In Example 9.4, this is:

y 2000 64.20
I2007 = × 100 = × 100 = 238.00
y 1985 26.92

Clearly, it is easy to calculate indices in Excel. We show two examples using the same
time series: one calculating indices for 1980 as the base year and the other one for 1992
as the base year.

Example 9.5
Figure 9.6 illustrates the Excel calculation procedure to calculate the required indices for the
average annual oil price.

Figure 9.6

➜ Excel solution
Year Cells B4:B35 Values
Average price Cell C4:C35 Values
Index base 1980 = Cell D4 Value (=100)
Cell D5 Formula: =C5/$C$4*100
Copy formula D5:D35
Index base 1992 = Cell F4 Formula: =C4/$C$16*100
Copy formula F4:F15
Cell F16 Value (=100)
Cell F17 Formula: =C17/$C$16*100
Copy formula F17:F35

❉ Interpretation
1. For the first index series: the price of oil in the year 2005, for example, was 33.73%
(133.73 — 100) higher than the price of oil in 1980.
414 Business statistics using Excel

2. For the second index series: the price of oil in the year 1999, for example, was 13.97%
(100 — 86.03) lower than the price of oil in 1992.

To convert indices from one year to another is very easy. Let’s say that we want to know
by how much was the price of oil higher in the year 2000 when compared with 1990. Using
the first series of indices, the one where 1980 is the base year, it is calculated as:

I2000 − I1990 73.20 − 61.97


I2000 = × 100 = × 100 = 18.11
I1990 61.97

If we tried to do the same for the second series of indices, the one where 1992 is the base
year:

I2000 − I1990 142.29 − 120.47


I2000 = × 100 = × 100 = 18.11
I1990 120.47

❉ Interpretation Indices can be converted easily from one base to another. Regardless
which series of indices we use, the price of oil in year 2000, for example, was 18% higher than
the price of oil in 1990.

Rather than having a time series of indices on a fixed basis, i.e. starting from one partic-
ular year that is equal to 100, we can have indices on a year-to-year basis. This effectively
means that every previous year is equivalent to 100. These are called chain indices.

Example 9.6
Figure 9.7 illustrates the calculation procedure to calculate the average oil price and index
values for oil prices.

Figure 9.7
Time series data and analysis 415

➜ Excel solution
Year Cells B4:B35 Values
Average price Cell C4:C35 Values
Index base 1980 = Cell D4 Value (=100)
Cell D5 Formula: = C5/C4*100
Copy formula D5:D35

❉ Interpretation The average oil price in 1985, for example, has dropped when
compared with the previous year by 6.37% and the price in 2007, for example, has grown by
10.12% in comparison with the previous year.

The series of numbers in Example 9.6 is very interesting as it actually shows us a per-
centage of change that takes place on a year-by-year basis. Using index numbers, we can
calculate a number of other more complicated indices. This takes us to an example of
aggregate price indices.

9.2.2 Aggregate indices


One of the best known aggregate price indices, that we will use as an example, is the US
Consumer Price Index (CPI). Similar aggregate indices exist in every country. This index
is calculated every year (in fact, it is calculated every month) by the US Bureau of Labor
Statistics and is a primary measure of changes in cost of living in the USA. In fact, CPI meas-
ures changes in the cost of a typical market basket of goods and services. It is composed of
housing prices, transportation, food, energy, and medical care, etc. What is really impor-
tant to understand is that CPI is a measure of inflation. Inflation is effectively calculated
as the percentage change in the CPI from one year to the next (or one month to the next).

Example 9.7
Table 9.4 shows the value of CPI from 1980 to 2011; they are calculated on the basis of year
2000.

Year CPI 2000 = 100 Year CPI 2000 = 100


1980 47.86 1996 91.09
1981 52.80 1997 93.22 x
Aggregate price indices
1982 56.04 1998 94.66 A measure of the value
1983 57.84 1999 96.73 of money based on a
collection (a basket) of
1984 60.33 2000 100 items and compared to the
1985 62.47 2001 102.83 same collection of items at
some base date or a period
of time.
416 Business statistics using Excel

1986 63.65 2002 104.46


1987 65.98 2003 106.83
1988 68.67 2004 109.69
1989 71.99 2005 113.41
1990 75.88 2006 117.07
1991 79.09 2007 120.41
1992 81.48 2008 125.03
1993 83.89 2009 124.59
1994 86.08 2010 126.63
1995 88.49 2011 130.63

Table 9.4

To calculate the value of CPI for 2007, for example, when compared with the previous
year, we can use the formula we have already introduced:

I2007 − I2006 120.41 − 117.07


CPI2007 = × 100 = × 100 = 2.85
I2006 117.07

❉ Interpretation The annual inflation rate in the USA, measured as CPI, in 2007 was
2.85%.

A more generic expression of the above formula is:

I YEAR A − I YEAR B
CPI YEAR A = × 100
I YEAR B (9.2)

CPI has one very important quality and that is: it can be used as a price deflator. We can
use CPI to convert (or deflate, hence the word deflator) prices from any year into the so
called constant prices. This is sometimes called converting actual dollars into real dollars,
i.e. dollars free from the inflation.

9.2.3 Deflating values

Example 9.8
Let’s take the example of oil prices as before. Column B repeats the average annual price of
domestic crude oil in $/bbl. These values are given in current dollars, i.e. the value of the dollar
in every given year. The second column shows us the values of the CPI index for every year,
given on the basis of year 2000 = 100 (Figure 9.8).
Time series data and analysis 417

Figure 9.8
Oil prices deflated with CPI

➜ Excel solution
Year Cells B4:B35 Values
Oil price Cells C4:C35 Values
CPI Cells D4:D35 Values
Deflated value = Cell E4 Formula: = C4*($D$24/D4)
Copy formula down E4:E35

To convert the prices of oil into a constant value, we need to deflate them. In our exam-
ple we can deflate them by multiplying annual prices with their corresponding CPI, which
is divided by the base year, i.e. year 2000, as per our example: Price at time A = Price at
time B × (CPI at time A /CPI at time B). In a more general sense, this formula is:

CPI A
PA = PB ×
CPIB (9.3)

❉ Interpretation The price of oil in 2007, when expressed in a constant dollar value for
year 2000, was $53.32 (cell E31 in Figure 9.8). The price of oil in 1980, on the same basis, i.e.
in constant year 2000 dollars, was $78.19 (cell E4 in Figure 9.8). This means that, in real terms,
the price of oil in 1980 was much higher than the price in 2007.

Example 9.9
Using the previously-described technique for converting indices from one base to another, if
we wanted to calculate the price of oil on the basis of a constant value of US dollars for the
year 2007, the calculation is as shown in Figure 9.9.
418 Business statistics using Excel

Figure 9.9
Deflated oil prices with CPI

➜ Excel solution
Year Cells B4:B35 Values
Oil price Cells C4:C35 Values
CPI Cells D4:D35 Values
Deflated value = Cell E4 Formula: = C4*($D$31/D4)
Copy formula down E4:E35

The calculation shown in the Excel solution helps us with simple questions, such as
is the price of oil in 2007 of $64.20 higher in real terms than the price of oil of $37.42 in
1980? We can translate this into a question: how much is $37.42 from 1980 worth in 2007
terms? This is calculated as: Adjusted price = Old price*(CPI for 2007/CPI for 1980). In
more general terms:

⎛ CPIFixed ⎞
y tadj = y t ⎜
⎝ CPIt ⎟⎠ (9.4)

❉ Interpretation Given that the 2007 price of oil is $64.20, this means that in 1980 the
price of oil was equivalent to $94.14. Using the constant value of dollars in 2007, the price of
oil in 1980 was $29.94 dollars more than the 2007 price of oil of $64.20.

Student exercises
X9.6 Calculate indices based on year 2000 for the series shown in Table 9.5. Could you
convert them into indices based on year 2003?
Time series data and analysis 419

Year 2000 2001 2002 2003 2004 2005 2006 2007 2008
x
Sales 230 300 290 320 350 400 350 400 420 Classical time series
analysis Approach
Table 9.5 to forecasting that
decomposes a time series
X9.7 Use the CPI values from Figure 9.8 to convert the sales values from the student exercise into certain constituent
components (trend,
X9.6 to a constant dollar value based on the 2004 value of the dollar. cyclical, seasonal and,
random component),
X9.8 What is the real value of the sales value in 2007 if you put it on the constant year 2000 makes estimates of each
basis? component and then re-
composes the time series
and extrapolates into the
future.
Trend (T) The trend is the
long-run shift or movement
in the time series
observable over several
periods of time.
9.3 Trend extrapolation Cyclical variations (C) The
cyclical variations of the
At the beginning of this chapter we classified not only the time series into various types, time series model that
result in periodic above-
but also various methods that deal with time series. It is our objective in this text to deal trend and below-trend
with univariate time series only and to describe just several basic time series analysis behaviour of the time series
lasting more than one year.
methods. Classical time series analysis starts with an assumption that every time series
Seasonal variations (S) The
can be decomposed into four elementary components: (i) underlying trend (T), (ii) cycli- seasonal variations of the
cal variations (C), (iii) seasonal variations (S), and (iv) irregular variations (I). time series model that
shows a periodic pattern
Depending on the model, these components can be put together in different ways to over one year or less.
represent the time series. The simplest of all is the so-called additive model. It states that Irregular variations (I)
time series Y implicitly consists of the four components that are all added together: The irregular variations
of the time series model
Y = T + C + S + I (9.5) that reflects the random
variation of the time series
values beyond what can
In addition to an additive model, a multiplicative model can also be used. Sometimes, be explained by the trend,
the most appropriate model is a mixed model. Here are two examples of these models: cyclical, and seasonal
components.
Multiplicative model: Y=T×C×S×I Additive model The
Mixed model: Y = (T × C × S) + I additive model time series
model is a model whereby
the separate components
The character of the data in time series will determine which model is the most
of the time series are added
appropriate. together to identify the
Underlying trend is almost self-explanatory, but we’ll describe it further along. The actual time series value.
Multiplicative model The
cyclical component consists of the long-term variations that happen over a period of sev-
multiplicative time series
eral years. If the time series is not long enough, sometimes we might not even be able to model is a model whereby
observe this component because the cycle is either longer than our time series or it is just the separate components
of the time series are
not obvious. However, the seasonal component applies to seasonal effects happening multiplied together to
within one year. Therefore, if the time series consists of annual data, there is no need to identify the actual time
series value.
worry about the seasonal component. At the same time, if we have monthly data and our
Mixed model The
time series is several years long, then it will (potentially) consist of the seasonal, as well as mixed time series blends
of the cyclical, component. And, finally, the irregular component is everything else that together both additive and
multiplicative components
together to identify the
actual time series value.
420 Business statistics using Excel

does not fit into any of the previous three components. A method of isolating different
components in a time series, or decomposing the time series, is called the classical time
series decomposition method. This is one of the oldest approaches to forecasting. The
whole area of classical time series analysis is concerned with the theory and practice of
how to decompose a time series into these components, estimate them, and then recom-
pose to produce forecasts. We will not go into this method in any depth, but we’ll look into
the trend component.

9.3.1 A trend component


Let’s say that, for practical purposes, we are only interested in estimating the trend and
that all the remaining components can be grouped into something that we will call the
residuals (R). In other words, time series Y in this simplified model now consists of only
two components, as defined by equation (9.6).

Y = T + R (9.6)

If a trend represents an underlying pattern that the time series follows, than the residu-
als are something that should oscillate randomly around the trend. In other words, if we
can estimate the underlying trend of a time series, we will not worry about these ran-
dom residuals fluctuating around the trend line. We can then extrapolate this trend. The
x
trend becomes our forecast of the time series. Admittedly, this forecast will not be 100%
Seasonal component A
component in the classical accurate as some residual value will be oscillating around the trend, but for all practical
time series analysis purposes, this might be exactly what we want. We are interested in just isolating the trend
approach to forecasting
that covers seasonal and extrapolating it into the future, which produces the forecast value for our time series.
movements of the time
series, usually taking place
inside one year’s horizon.
Classical time series
Note Fitting a trend to a time series and extrapolating it into the future is the most
decomposition Classical elementary form of forecasting.
time series decomposition
is a statistical method that
deconstructs a time series
into notional components.
Trend component A 9.3.2 Fitting a trend to a time series
component in the classical
time series analysis If trend is the underlying pattern that indicates the general movements and the direction
approach to forecasting
that covers underlying
of time series, then this implies that a trend can be described by any regular curve. This
directional movements of usually means a smooth curve, either straight line, a parabola, a sinusoid, or any other
the time series.
well-defined curve. Fortunately, Excel is very well equipped to help us define the trend,
Residuals (R) The
differences between the
fit it to time series, and extrapolate it into the future. Let’s see what elementary types of
actual and predicted trends are embedded in Excel and how to invoke them.
values. Sometimes called
forecasting errors. Their
behaviour and pattern has
to be random. Example 9.10
Types of trends The type
of trend can include line We’ll use an artificially-created time series that consists of only 30 observations, as illustrated
and curve fits to the data
in Table 9.6.
set.
Time series data and analysis 421

Period Series 1 Period Series 1


1 8 16 38
2 25 17 43
3 15 18 55
4 22 19 54
5 15 20 56
6 30 21 49
7 27 22 46
8 20 23 58
9 27 24 60
10 32 25 59
11 30 26 62
12 35 27 65
13 39 28 60
14 35 29 58
15 55 30 62

Table 9.6

When charted as a line graph, the time series looks as illustrated in Figure 9.10.

70

60

50
Series value

40

30

20

10

0
1 3 5 7 9 11 13 15 17 19 21 23
2 25
5 27 29
Time point, x

Figure 9.10

To fit a trend line to the time series is a very easy graphical process in Excel, as we have
already demonstrated in Chapter 8.
To fit a trend line to the time series right-click on any data point in the Excel graph, as
illustrated in Figure 9.11, and select Add Trendline.
After selecting Add Trendline, choose Linear Option, as well as Display Equation on
chart, and Display R-squared on chart (see Figure 9.12). Click on close.
422 Business statistics using Excel

Figure 9.11

Figure 9.12

The final graph with the trend line added automatically is illustrated in Figure 9.13 with
the line equation and coefficient of determination included.
80

70 y = 1.8082x + 13.306
R2= 0.89
60

50
Series value

40

30

20

10

0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Time point, x Figure 9.13
Time series data and analysis 423

What we are getting here instantly is a straight line that describes the underlying move-
ment and the direction of our time series.

9.3.3 Types of trends


Fitting a line to a time series is identical to establishing a regression line between two vari-
ables. The only difference is that, in the case of time series analysis, one of the variables
has to be time. In Chapter 8 we reviewed briefly different types of curves that can be used
to describe relationships between two variables. Before we go any further, let’s look at all
the options that Excel gives us. We can fit the following types of trends (or curves) to any
data set: (i) linear trend, (ii) logarithmic trend, (iii) polynomial trend, (iv) power trend,
(v) exponential trend, and (vi) moving average trend.
We’ll go into a greater depth regarding the linear trend, but it suffices here to say that
linear trend is defined by equation (9.7).

y = mx + b (9.7)

Logarithmic curve is defined by equation (9.8).

y = c ln x + b (9.8)

Here, c and b are constants and ln is natural logarithm function. The picture in the Excel
dialogue box indicates that this trend has a form of an inverse exponential curve. The one
x
that quickly reaches some high value and then continues to grow much more slowly.
Linear trend Linear trend
Polynomial curve comes in several degrees, for example a polynomial equation of is a straight line fit to a
degree 6 would be written as defined by equation (9.9). data set.
Logarithmic trend A
model that uses the
y = b + c1x + c 2 x + c 3 x + c 4 x + c 5 x + c6 x
2 3 4 5 6
(9.9) logarithmic equation to
approximate the time
series.
In this case also b and c1 to c6 are constants. If you experiment with these curves, you
Polynomial trend A model
will see that some of them translate into very dynamic curves, making multiple turns and that uses an equation of
ups and downs. any polynomial curve
(parabola, cubic curve, etc.)
Power function has a very simple equation, with c and b as constants, as defined by to approximate the time
equation (9.10). series.
Power trend A model
that uses an equation of a
y = cx b
(9.10) power curve (a parabola)
to approximate the time
series.
This trend is a parabolic trend that will continue to grow forever.
Exponential trend An
Exponential trend also has two constants, c and b, as defined by equation (9.11). underlying time series trend
that follows the movements
of an exponential curve.
y = ce bx (9.11)
Moving average trend The
moving average trend is a
The symbol e is used for the basis of natural logarithms. Unlike the power trend, which method of forecasting or
smoothing a time series by
continues to grow at a constant rate, exponential trend moves slowly at the beginning and averaging each successive
then resumes the very fast change typified by exponential growth. group of data points.
424 Business statistics using Excel

Moving averages trend is a special type of trend that we will cover further along in the
chapter as a separate heading, owing to its special way of deployment.

9.3.4 Using a trend chart function to forecast time series


It is quite a simple task to use the fitted Excel trend line to forecast five time periods into
the future. Right-click on the trend line on the graph and choose Format Trendline, as
illustrated in Figure 9.14.

Figure 9.14

In Figure 9.15 we will opt for an automatic trend line that will move five periods in the
future.

Figure 9.15

Figure 9.16 illustrates the modified time series chart with the trend line extended by 5
time periods to provide a forecast for time points 31, 32, 33, 34, and 35.
Time series data and analysis 425
90
y = 1.8082x + 13.306
80
R2= 0.89
70

60
Series value

50

40

30

20

10

0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Time point, x Figure 9.16

We can see that the actual time series is not a smooth straight line, but that it oscillates
around one and we have identified it. By extrapolating our straight line or linear trend in
to the future, we are stating that the actual line might be a bit adrift, but we believe that it
will be inside some confidence factor, as we will describe later on. Excel does not just give
us a pictorial representation of this trend line, but the actual equation of this line. From
Figure 9.16, we can see that this trend line is moving in accordance with the equation:
y = 1.802x + 13.306. We’ll explain this in a minute. The R-squared (or R2) value is 0.89. Let’s
refresh what we know about this statistic.
When fitting a line to a data set, as we described in Chapter 8, we measure how closely
the trend line fits the actual data. Every deviation is squared and all these values are
summed to create the total sum of squares (SST). The theory suggests that the SST consists
of the regression sum of squares (SSR) and residual sum of squares (SSE).
R-squared is a coefficient that measures how closely is the actual time series approxi-
mated (or fitted) by a trend line as given by equation (9.12).

SSE
R2 = 1 −
SST (9.12)

R-squared is actually the coefficient of determination (COD). As we know, the square


root of this value would give us the coefficient of correlation. Clearly, this coefficient is
checking how closely the trend line and the actual time series are related. In effect, it tells
us how much of the time series variations are ‘left out’ after we fitted the trend line.

Note
The closer R2 is to the value of 1, the better the fit of the trend to time series. In
our case R-squared is 0.89, which is very good. This confirms that our trend is approximating,
or fitting, the data very well. Only 11% (1 – 0.89 = 0.11) of data variations are not captured, or
explained, by the trend line. This is more than reasonable.

We said earlier that the trend line equation in this particular case was y = 1.802x + 13.306.
Excel extrapolated our trend line five periods into the future, but we do not know either the
past values or the future values of this trend line. All we have is the chart that does this for us.
We need to learn how to calculate these values manually or using the built-in Excel functions.
426 Business statistics using Excel

9.3.5 Trend parameters and calculations


The equation y = 1.802x + 13.306 is a specific case, fitted to our data set, of a generic Excel
linear trend equation that we have used before: y = mx + b. In most textbooks this equa-
tion is written as y = ax + b or y = a + bx. Whatever the case, the letter that stands alone
(without x) is called an intercept and the other letter associated with x is called the slope.
To avoid further confusion, let’s standardize to a formula: y = a + bx. Clearly, ‘a’ is an
intercept and ‘b’ is a slope. In our case, the value of the intercept is 13.306 and the value
of the slope is 1.802. Chapter 8 explains in greater depth the meaning of these two param-
eters, so let’s go and just use them. To calculate our past and future trend values we just
need these two parameters. The values of x are represented by the sequential numbers
that represent time periods.

Example 9.11
In this example we will fit a trend line to a time series data set given the value of the slope of
the trend line and its intercept. Figure 9.17 illustrates the manual calculation of the trend line
using basic Excel formulae.

Figure 9.17

Figure 9.17 indicates that we have put the value of the intercept in the cell H1 and the
value of the slope in the cell H2.

➜ Excel solution
Period Cells B4:B33 Values
Series 1 Cells C4:C33 Values
a = Cell H3 Value
b = Cell H4 Value
Trend Cell D4 Formula: = $H$3+$H$4*B4
Copy formula down D4:D33
Time series data and analysis 427

The forecasts at time points 31, 32, 33, 34, and 35 are produced in the same way as illus-
trated in Figure 9.18.

Figure 9.18

Note The future values of x should always be a sequential continuation of the period
numbers used in the past. In our case, the last observation is for period 30, which means that
the future values of x are 31, 32 . . . 35.

It is not necessary to use the Excel graph function and chart the trend first to get the
values of the intercept and the slope. Excel has a built-in function for both parameters. In
cell H3 we could have invoked the Excel INTERCEPT() function or in cell H4 the SLOPE()
function.

Example 9.12
Repeat Example 9.11, but this time use the Excel function INTERCEPT() and SLOPE() to calculate
these values in Cells H3 and H4.
Figure 9.19 illustrates the change to cells H3 and H4 in the Excel solution given in Figure 9.17.

Figure 9.19

➜ Excel solution
Period Cells B4:B33 Values
Series 1 Cells C4:C33 Values
a = Cell H3 Formula:
b = Cell H4 Formula:
Trend Cell D4 Formula: = $H$3+$H$4*B4
Copy formula down D4:D33

Example 9.13
Remember, if you cannot remember the names of the Excel functions then you can select
Formulas > Insert Function and choose the function you require to undertake your data analy-
sis. For example, Figures 9.20 and 9.21 illustrate the Excel solution for the INTERCEPT() and
SLOPE() functions.
428 Business statistics using Excel

Figure 9.20

Figure 9.21

Another approach is to use already ‘pre-packaged’ Excel functions dedicated to trend


estimation, as already used in Chapter 8. In cell D4, where we want the first value of the
trend to be calculated, we invoke the TREND function by clicking on cell D4 and selecting
Formula > Insert function > choose TREND, as illustrated in Figure 9.22.

Figure 9.22

Enter data ranges for the problem, as illustrated in Figure 9.23.


Time series data and analysis 429

Figure 9.23

Click OK.
The formula: = TREND(C4:C33,B4:B33,B4,TRUE) will then appear in cell D4.
Remember, before you copy this formula down to calculate the trend line values for
time points 1, 2, 3 . . . 30 (cells D4:D33) you will need to fix the cell reference for the terms
known_y’s and known_x’s, as given by the following change to formula in cell D4: =
TREND($C$4:$C$33,$B$4:$B$33,B4,TRUE). Now copy this formula down from cell
D4:D33. If you wanted to calculate the forecast values at time periods 31, 32, 33, 34, and 35
then continue the copy from D34:D38.

Note
1. As before, the values of x have to be the sequential numbers that continue from the last
historical period number.
2. The principles of calculating linear trend, as described here, can be applied to other types
of curves. The Manual and the Function methods work with any curve. The Function method
in addition to TREND function can be applied to the GROWTH function. GROWTH is Excel
function that describes exponential trends. It is invoked and used in exactly the same way as the
TREND function used for linear time series.

Student exercises
X9.9 If the time series components were extracted as in Table 9.7, how would you
reconstruct the time series (ŷ) using: (a) an additive model and (b) a mixed model?

T 90 95 100 105 110 115 120 125 130


C 2 4 6 4 2 4 6 4 2
I 5 3 4 6 5 6 4 5 4

Table 9.7
430 Business statistics using Excel

X9.10 If a time series can be best fitted with the trend whose equation is y = a + bx + cx2,
would you say that this is a linear model?
X9.11 R-squared (R2) is a measure of how closely a trend fits the time series. What is another
expression for this statistic and in what context have we used it when discussing linear
regression?
X9.12 Does R2 = 0.90 indicate a good fit? Why?
X9.13 Extrapolate the time series in Table 9.8 three time periods in the future. Use the TREND
function. Why do you think it would not make sense to extrapolate this time series 10
time periods in the future?

X 1 2 3 4 5 6 7 8 9 10 11 12
Y 230 300 290 320 350 400 350 400 420

Table 9.8

9.4 Moving averages and time series smoothing


Let us use a simple Excel AVERAGE function to calculate the average value of a time series.
We’ll use a very short and artificial time series just for the purposes of illustration.

Example 9.14
A very short time series (shown in Figure 9.24) has an average value of 208. The average value
represents the series fairly well because the series flows very much horizontally. Figure 9.25
illustrates this graphically. The average of 208 is shown as a horizontal line that runs across the
time series.

Figure 9.24

Figure 9.25 Time series and the mean value


300

250
Series value

200

150

100

50

0
1 2 3 4 5
Time point, x Figure 9.25
Time series data and analysis 431

As we know, this particular sample time series is called a stationary time series.
However, if the series was moving upwards, or downwards, this average value would not be
the best representation of the series.
In this case a much more realistic representation would be some kind of moving average.
We are effectively saying that, in general, a series of moving averages is a much more realistic
representation for non-stationary time series.

9.4.1 Forecasting with moving averages


We’ll show how to create moving averages and how to use this generic statistical tool for
forecasting purposes.

Example 9.15
We created another short time series and calculated moving averages in Figure 9.26.

Figure 9.26

➜ Excel solution
Period Cells B4:B8 Values
Series Cells C4:C8 Values
3MA Cell D5 Formula: =SUM(C4:C6)/3
Copy formula down D5:D7
5MA Cell F6 Formula: =SUM(C4:C8)/5

Moving averages are dynamical averages that change in accordance with the number
of periods for which they are calculated. A general formula for moving averages is given
by equation (9.13).

x
t − N +1
∑ xi Moving averages Averages
i=t calculated for a limited
Mt =
N (9.13) number of periods in
a time series. Every
subsequent period
In equation (9.13), t is the time period and N is the number of observations taken into excludes the first
observation from the
the calculation. It is clear that if we are using three observations as a basis for calculating previous period and
moving averages then the first possible observation for which we can calculate the mov- includes the one following
the previous period.
ing average is observation 3. Equation (9.13) can be simplified and expressed as equation This becomes a series of
(9.14). moving averages.
432 Business statistics using Excel

x t + x t −1 + x t − N + 1
Mt =
N (9.14)

The advantage of using an odd number for N and taking an odd number of elements into
the equation is that we can centre the moving average value in the middle of the interval, as
per Figure 9.26. This implies that we’ll use the odd number of interval most of the time, as it
is easier to centre the values. In our case, the moving average for period two is calculated as:

x 3 + x 2 + x 1 150 + 250 + 200


M2 = = = 200
3 3

x t − x t −N
Note Equation (9.14) can also be rewritten as: Mt = Mt −1 + .
N
In other words, if we do not know the value of the first observation in the moving average
interval, we can still estimate the current moving average from the previous value of the
moving average, plus the other value from the interval. Although this might appear to be a
useless fact here, you will see why we mentioned it when we discuss exponential smoothing.

What happens if we extend the number of observations in the moving average interval?
Let’s look what happens if we take all the values from the series to constitute the interval.
As Figure 9.27 shows, they simply became the overall average. This implies that the larger
the number of observations used for calculating the moving average, the smoother and
the more horizontal the line representing it will be.

Time series plot with 3MA and 5MA plotted


400

350

300

250
Series value

200

150

100
5 MA
50 Actual series
3 MA
0
1 2 3 4 5
Time point, x Figure 9.27

x
Exponential Note It is a general principle that the larger the number of moving averages in the
smoothing One of the
methods of forecasting that formula, the ‘smoother’, or less dynamic, the time series of moving averages will be. Various
uses a constant (or several business reports will tend to use moving averages. The most frequent choice is to use
constants) to predict future
values by ‘smoothing’ the 3-month, 6-month, and 12-month moving averages. If you look at the 3-month moving
past values in the series. average time series, you will see that it is closely tracking the actual time series. However, a
The effect of this constant
decreases exponentially as
12-month moving average line will be much ‘flatter’, or more horizontal, as it is averaging a
the older observations are much larger interval and it is, therefore, not so subject to most recent events.
taken into calculation.
Time series data and analysis 433

Example 9.16
Let us now use a little longer time series and see how to use moving averages for forecasting
purposes. If a series is horizontal (stationary) and we just want to predict a single future value
of this series, we already said that using a simple average value of the series is almost as good
as any other method. Figure 9.28 shows such a stationary series with 30 observations and its
mean value that was used to predict the 31st observation.

Stationary time series


6

5.75 Mean value


Series value

5.5

5.25

5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Time point, x Figure 9.28

The advantage of this simple method is that it can be extended further in the future. If
we need to forecast for the next five observations, we just extend the mean line. By defini-
tion, if a series is stationary it fluctuates around its mean. Therefore, the mean is its best
predictor. This method does not produce very precise forecasts, but the results will be
accurate enough. To add more sophistication to our forecasting and to try to emulate the
movements of the original series, we need to see how to use the principle of moving aver-
ages. We’ll use the same Excel method as in section 9.3.2. To remind you how we added a
trend line to a time series, we right click on the time series, which will invoke a dialogue
box with several options included. We then click on the option called ‘Add Trendline. . .’.
This invokes the next dialogue box, and we select the moving averages option and change
the number of periods to three, as illustrated in Figure 9.29.
Excel will start automatically charting the moving average from the last observation in
the period specified (in this case three). If we selected a five-period moving average, then
the moving average function would start from observation five. This is somewhat different
to the advice we gave earlier when we recommended that the moving average should be
centred in the middle of the interval for which it is calculated. A simple reason for this is
that, here, we are trying to predict the series, and this is going to help us to achieve this.
So, how is the moving average approach used to produce forecasts? All we need to do
is to shift the moving average plot, as produced by Excel, by one observation. In other
words, the moving average value for the first three observations (assuming we are using
moving averages for three periods) becomes the forecast for the fourth observation. The
fifth observation is predicted by using the second three period moving average (obser-
vations two to four) and so on. Figure 9.30 illustrates the point for three-period moving
averages.
434 Business statistics using Excel

Figure 9.29

6 Stationary time series

5.75
Series value

5.5

5.25

Time series Mean Three per. mov. avg. (time series)


5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Time point, x Figure 9.30

However, there are two difficulties associated with this approach. First of all, we cannot
extend our forecast beyond just one future period, which means that this method can only
be used as a short-term forecasting method that predicts only one future observation. To
forecast using moving averages we can modify equation (9.13) to create equation (9.15) to
enable calculation of the required moving average forecasts.

t−N
∑ xi
i = t −1
Ft = (9.15)
N
Time series data and analysis 435

For example, for MA(4) the forecast at time point five is:

x 4 + x 3 + x 2 + x1
F5 =
4

The other issue is that Excel will not shift the moving average plot if we are using Add
Trendline wizard. We need to calculate moving averages manually.

Example 9.17
Figure 9.31 shows how the moving averages were calculated for a three point moving average
(3MA) and the use of the model to provide a forecast at time point 31.

Figure 9.31

➜ Excel solution
Period Cells B4:B34 Values
Series Cells C4:C34 Values
3MA Cell D7 Formula: =AVERAGE(C4:C6)
Copy formula down D7:D34
Forecast
Period Cell B35 Value
3MA Cell D35 Formula: =AVERAGE(C32:C34)

As stated earlier, we need to remember that the more we extend the number of periods
used for calculating the moving average, the smoother and more horizontal the curve will
be. If we take into account all the observations in the series, needless to say we will have
only one moving average value and it will be identical to the mean value of the overall
time series.

Note Moving averages are an acceptable forecasting technique, provided we are


interested in forecasting only one future period.
436 Business statistics using Excel

9.4.2 Exponential smoothing concept


In order to introduce the exponential smoothing method, we need to assume that one of
the ways to think about observations in a time series is to say that the previous value in the
series (yt−1), plus some error element (et), is the best predictor of the current value ( ŷ t ) if
we are dealing with a stationary time series as given by equation (9.16).

ŷ t = y t −1 + et (9.16)

When yt−1 remains stationary over time it is reasonable to forecast future values of ŷ t by
using regression analysis, as described in Chapter 8. In such situations the least squares
estimate of yt−1 would be the average value of all the observed data values, where in the
calculation of the point estimate of the constant term in linear regression (b0) we are
equally weighting each of the previously-observed terms in the time series data set.
When the value of yt−1 changes over time (non-stationary) then this equal weighting
may not be appropriate and it may be more desirable to weight recent observations more
heavily than older observations. Simple exponential smoothing is a forecasting method
that applies unequal weights to the time series data. Let’s explain how are we going to get
to this formula.
With a bit of imagination, we can say that every new forecast is the old one plus an
adjustment for the error that occurred in the last forecast, i.e. et−1 = yt−1 − Ft−1, as presented
in equation (9.17).

Ft = Ft −1 + ( y t −1 − Ft −1 ) (9.17)

Where yt−1 is the actual result from period t − 1 and Ft−1 is the forecast result for period
t − 1.
Let us now assume that the error element, i.e. (yt−1 − Ft−1), is zero. In this case the current
forecast is the same as the previous forecast. However, if it is not zero, then, under certain
circumstances, we might be interested in taking just a fraction of this error using equation
(9.18).

Ft = Ft −1 + α( y t −1 − Ft −1 ) (9.18)

Note Why a fraction of an error? If every current forecast/observation depends on the


previous one and this one depends on the one before, etc., then all the previous errors are, in
fact, embedded in every current observation/forecast. By taking a fraction of error, we are, in
fact, discounting the influence that every previous observation and its associated error has on
current observations/forecasts.
x
Simple exponential
smoothing Simple
exponential smoothing is a We use the letter α to describe the fraction and the word ‘fraction’ implies that α takes
forecasting technique that
uses a weighted average
values between zero and one. If α = 0, then the current forecast is the same as the previ-
of past time series values ous one. If α = 1, then the current forecast is also the same as the previous one, plus the
to arrive at smoothed time
full amount of the deviation between the previous actual and forecasted value. In order to
series values that can be
used as forecasts. take just a fraction of that deviation, α has to be greater than zero and smaller than one, i.e.
Time series data and analysis 437

0 < α < 1. The forecasts calculated in such a way are, in fact, smoothing the actual observa-
tions. If we plot both the original observations and these newly calculated ‘back-forecasts’
of the series, we’ll see that the back-forecast curve is eliminating some of the dynamics
that the original observations exhibit. It is a smoother time series.
Equation (9.18) can be rewritten as equation (9.19).

Ft = αy t −1 + (1 − α) Ft −1 (9.19)

Equations (9.18) and (9.19) are identical, and it is a matter of preference which one
to use. They both provide identical forecasts based on smoothed approximations of the
original time series.

Note The origins of equation (9.18) and (9.19) can be found in Brown’s single
exponential smoothing method. However, the original Brown’s formula states that:

S´t = αy t + (1 − α ) S´t −1 (9.20)

Note that Brown uses yt rather than yt−1. Effectively, this means every current smoothed value
in Brown’s formula is the future forecast value in our formula. If we use the original smoothing
equation by Brown, then we have to remember that Ft = S´t −1 (see equation (9.22)).
There is a connection between Brown’s formula and the moving averages concept. You
y − y t −N
will recall from the section on moving averages that we’ve said that: Mt = Mt −1 + t .
N
If yt−N was unknown and we used Mt−i as its best estimate instead, then this can be rewritten as
1 1 1
Mt = y t + (1 − ) Mt −1. If we say that α = , then Mt is another expression for St. We can see
N N n
the similarities between the exponential smoothing concept and the moving averages.

This approach to forecasting is based on the single exponential smoothing method; we


will explain shortly why the word ‘exponential’ is used.

x
Brown’s single
Note We implied that the smaller the α (i.e. the closer α is to zero), the smoother and exponential smoothing
more horizontal the series of newly calculated values is. Conversely, the larger the α (i.e. the method Brown’s single
exponential smoothing
closer α to one), the more impact the deviations have and potentially the more dynamic the method is a basis for
fitted series is. When α = 1, the smoothed values are identical to the original values, i.e. no forecasting method
called Simple Exponential
smoothing is taking place.
Smoothing.
Smoothing
constant Smoothing
The smoothing constant (α) and the number of elements in the interval for calculating constant is a parameter of
moving averages are, in fact, related. The equation that defines this relationship is given the exponential smoothing
model that provides the
by equation (9.21).
weight given to the most
recent time series value
2
α= in the calculation of the
M +1 (9.21) forecast value.
438 Business statistics using Excel

In equation (9.21), M is a number of observations used to calculate the moving average.


The formula indicates that the moving average for three observations that we used earlier
is equivalent to α = 0.5. Equally, α = 0.2 is equivalent to M = 9. So, the smaller the value of
the smoothing constant, the more horizontal the series will be, just like in the case when a
larger number of moving averages is used.

Note If we substituted in the formula for exponential smoothing all the previous values
from the series we would see that, effectively, we are multiplying the newer observations
with higher values of α and the older data in the series with the smaller values of α. By
doing this we are, in effect, assigning a higher importance to the more recent observations.
As we move further in the past, the value of α falls exponentially. This is the reason why we
call it exponential smoothing. In essence, every value in the series is affected by all those
that precede it, but the relative weight (importance) of these preceding values declines
exponentially the further we go in the past.

9.4.3 Forecasting with exponential smoothing


To avoid the confusion between equations (9.18) and (9.19) on the one hand and equa-
tion (9.20) on the other, we must remember that:

Ft = S’t-1 (9.22)

Example 9.18
As an example, we can use the same short time series we used to demonstrate how to use
moving averages (see Example 9.15), to create forecasts using Brown’s exponential smoothing
method, as illustrated in Figure 9.32. To start the smoothing process the data analyst must make
a choice for the smoothing constant α and the initial estimate of S’ 0. The value of S’ 0 is needed
to determine the smoothed statistic for S’ 0 .

S 1′ = αy 0 + (1 − α ) S 0′

Figure 9.32
Applying simple exponential smoothing

In this example, we have chosen α = 0.3 and S 0′ = y1=150.


Time series data and analysis 439

➜ Excel solution
α = Cell C3 Value
Period Cells B6:B10 Values
Yi Cells C6:C10 Values
Si’ Cell D6 Formula: =C6
Cell D7 Formula: =$C$3*C7+(1−$C$3)*D6
Copy formula down D5:D10
Forecast Cell F7 Formula: =D6
Copy formula down F7:F11

Note Alternative methods for calculating the starting value S’ 0 are employed by
analysts given that simple exponential smoothing is concerned with tracking changes over
time in the true average level of the data series. It can be shown that, in simple exponential
smoothing, the value of S’ 0 using the average of the first six observations in your data set will
provide a reasonable starting point for simple exponential smoothing.

As was the case with moving averages, in order to forecast one value in the future, we
need to shift the exponential smoothing calculations by one period ahead. The last expo-
nentially-smoothed value will, in effect, become a forecast for the following period.

Note Simple exponential smoothing, just like the moving averages method is an
acceptable forecasting technique, provided we are interested in forecasting only one future
period.

As an alternative to this formula method, Excel provides the exponential smoothing


method from the Data Analysis add-in pack (Select Data > Select Data Analysis > Select
Exponential Smoothing).

Example 9.19
Apply the Data Analysis method to repeat Example 9.18.

Select Cell D4 (Figure 9.33).

Figure 9.33

Select Data > Select Data Analysis > Select Exponential Smoothing (Figure 9.34).
440 Business statistics using Excel

Figure 9.34

Click OK to access the Exponential Smoothing menu.


WARNING: Excel uses the expression ‘Damping factor’, rather than smoothing con-
stant, or α. Damping factor is defined as (1 – α). In other words, if you want α to be 0.3,
you must specify in Excel the value of damping factor as 0.7. Input the menu inputs as
illustrated in Figure 9.35 and click OK.

Figure 9.35

Figure 9.36

If we compare the formula method solution illustrated in Figure 9.32 and the Data
Analysis > Exponential Smoothing solution illustrated in Figure 9.36 we note the Data
Analysis > Exponential Smoothing method always ignores the first observation and pro-
duces exponential smoothing from the second observation. It also cuts short with the
exponential smoothing values, as the last exponentially-smoothed value corresponds
with the last observation in the series. You can easily extend the last cell one period in the
future to get a short-term forecast by inserting a time point 6 in cell B9 and then dragging
the Excel formula in cell D8 down to cell D9, as illustrated in Figure 9.37.

Figure 9.37
Time series data and analysis 441

The second thing that becomes obvious is that you cannot change the values of α and
see automatically what effect this has on your forecasts. This means that you would be bet-
ter off producing your own set of formulae, as shown in Figure 9.32.

Example 9.20
Consider the data set in Table 9.9 and smooth the data using Brown’s exponential smoothing
method with smoothing factors 0.1 and 0.9.

Time point Series value Time point Series value Time point Series value
1 5.38 11 5.3 21 5.49
2 5.36 12 5.51 22 5.38
3 5.38 13 5.49 23 5.46
4 5.65 14 5.38 24 5.43
5 5.59 15 5.57 25 5.35
6 5.43 16 5.91 26 5.28
7 5.53 17 5.91 27 5.54
8 5.43 18 5.86 28 5.38
9 5.4 19 5.62 29 5.35
10 5.35 20 5.49 30 5.45

Table 9.9

Figure 9.38 presents the Excel solution.

Figure 9.38
442 Business statistics using Excel

➜ Excel solution
Period Cells A6:A35 Values
Series Cells B6:B35 Values
α = 0.1
α = Cell C3 Value
Smoothed data Cell C6 Formula: =B6
Cell C7 Formula: =$C$3*B7+(1−$C$3)*C6
Copy formula down C7:C35
α = 0.9
α = Cell F3 Value
Smoothed data Cell F6 Formula: =B6
Cell F7 Formula: =$F$3*B7+(1−$F$3)*F6
Copy formula down F7:F35

Data value Exponential smoothing, α = 0.1 and 0.9


6 Actual value Forecast, alpha = 0.1 Forecast, alpha = 0.9
5.9

5.8

5.7
Data value

5.6

5.5

5.4

5.3

5.2
0 5 10 15 20 25 30 35
Time point, x Figure 9.39

Figure 9.39 illustrates the original data and exponential smoothing forecasts with two
smoothing constants α = 0.1 and 0.9.
Figure 9.39 shows the impact two different values of α have on forecasts. As expected,
smaller α makes forecasts smoother and larger α makes them more dynamic, mimicking
more closely the original time series.

Example 9.21
In this example we will show how to use all the formulae listed in this chapter and we will
demonstrate how to use the Data Analysis > Exponential Smoothing to undertake the analysis.
Figures 9.40 and 9.41 illustrate the Excel solution with a smoothing constant of α = 0.1 using
equations (9.18), (9.19), (9.20), and (9.21), and the Excel Data Analysis > Exponential Smoothing
solution.
Time series data and analysis 443

Figure 9.40

➜ Excel solution
α = Cell B4 Value
Period Cells A6:A36 Values
Series Cells B6:B35 Values
Forecast data Cell C6 Formula: =B6
Cell C7 Formula: =C6+$B$4*(B6−C6)
Copy formula down C7:C36
Forecast data Cell D6 Formula: =B6
Cell D7 Formula: =$B$4*B6+(1−$B$4)*D6
Copy formula down D7:D36

Figure 9.41
444 Business statistics using Excel

➜ Excel solution
Smoothed data Cell E6 Formula: =B6
Cell E7 Formula: =$B$4*B7+(1−$B$4)*E6
Copy formula down E7:E35
Forecast data Cell F6 Formula: =B6
Cell F7 Formula: =F6+$B$4*(B6−F6)
Copy formula down F7:F36
Forecast data with Data Analysis
Cell G6 N/A
Cell G7 Formula: =B6
Cell G8 Formula:= =0.1*B7+0.9*G7
This formula is copied down by Excel between G8:G35
Cell G36 Copy formula down G35:G36

Excel Data Analysis solution


As the application of the equations (9.18), (9.19), (9.20), and (9.21) is self-explanatory,
we will focus on the automated Excel solution. To access the Excel data analysis solution,
select Data > Data Analysis > Exponential Smoothing and input the required inputs, as
illustrated in Figure 9.42.

Figure 9.42

In Figure 9.42, the Input Range is B6:B35, Damping factor = 1 – smoothing constant = 1
– 0.1 = 0.9, Output Range G6. Clicking OK will then produce the solution presented in cells
G6:G35 in Figure 9.41.

Note WARNING: As already indicated, Excel uses the expression ‘Damping factor’,
rather than smoothing constant (α). The ‘Damping factor’ is defined as (1 – α). For example, a
smoothing constant α = 0.1 defines the damping factor as equal to 0.9 (damping factor = 1 –
α = 1 – 0.1 = 0.9).

The Data Analysis > Exponential Smoothing method produces solutions that start in
cell G6, but it returns #N/A value. The cell G7 is the copy of B6, followed by the formula
=0.1*B7+0.9*G7 in cell G8. The formula is essentially the same as the one in column D.
Time series data and analysis 445

Why can’t we use Brownian exponential smoothing values directly for forecasting? As we
said, every current smoothed value has to be treated as the future forecast. Pay attention
to row 36. This is where we produced forecasts. You will notice that the only column in
which we were not able to do this is column E. However, if these cells are shifted down by
one cell, as in column F, then Brownian exponential smoothing values become the fore-
casts, as illustrated in Figure 9.43.

Figure 9.43

The advantage of using built-in Data Analysis > Exponential Smoothing is that we get
a chart and standard errors included automatically with the results. We have not shown
them here, but we are encouraging readers to experiment with various output options.

Forecasting seasonal series with exponential


9.5
smoothing
We mentioned briefly that time series can also be seasonal in nature. Seasonality is meas-
ured by the number of periods after a particular pattern repeats itself. It does not mean
that the values will be identical, just that the pattern will have similar features. Using clas-
sical methods, such as time series decomposition, we could break the time series down
into its constituent components, such as trend, seasonal component, and the irregular
component. According to this classical method, we can then recompose the time series
using one of the models (additive, multiplicative, or mixed). In this chapter we will use a
different approach. We will combine exponential smoothing with some of the model re-
composition methods.
Earlier in this chapter we fitted a time series using linear trend. The formula for the
straight line used as a trend was: Y = a + bx. You will also recall that we said that various
time series components (trend, cyclical, seasonal, and irregular) can be added together to
form a time series decomposition model. As we said, this additive relationship is only one
of several possible options. In fact, sometimes the components form not an additive, but a
multiplicative model. Which one is the correct one is beyond the scope of this book, but it
suffices to say that it will depend primarily on the type of time series used. In order to intro-
duce one of the possible approaches to handling the seasonal time series, we need to accept
that we can use either an additive or a multiplicative model. They are defined as follows:

Ft + m = a t + St − s + m (9.23)

Ft + m = a t St − s + m (9.24)

Note As a general guidance, multiplicative models are better suited for time series that
show dramatic growth or decline (non-stationary time series), while additive models are more
suited for less dynamic time series (stationary time series).
446 Business statistics using Excel

Just as with the linear trend, the coefficient at is an intercept of the series, but, in this
case, a dynamic one. St−s+m represents the slope and is called a seasonal component. The
meaning of the symbols s and m in the subscript t + m and t−s + m is: s = number of peri-
ods in a seasonal cycle, and m = number of forecasting periods (forecasting horizon).
The main feature of this approach is that we can use exponential smoothing to estimate
dynamical values of not just the seasonal component, but of the intercept too. For an
additive model, these two factors are calculated as follows.

a t = α(y t − S t − s ) + (1 − α)a t − 1 (9.25)

S t = δ(y t − a t ) + (1 − δ)S t − s (9.26)

For a multiplicative model, these two factors are calculated as follows:

⎛ yt ⎞
at = α ⎜ + (1 − α)a t − 1
⎝ St − s ⎟⎠ (9.27)

⎛ yt ⎞
St = δ ⎜ ⎟ + (1 − δ)St − s
⎝ at ⎠ (9.28)

As we can see, unlike the simple exponential smoothing which required only one
smoothing constant, here we are using two smoothing constants, alpha (α) and delta (δ).
In both cases we need to initialize the values of at and St. This is achieved, for additive
models, by calculating as, from equation (9.29).

s yt
as = ∑
t =1 s (9.29)

Where, t = 1,2. . . s and at = as. In other words, the first s number of as is calculated as an
average of all the corresponding actual observations. The initial values of St are calculated
from equation (9.30).
St = y t − a t (9.30)

For the multiplicative model as is calculated in the same way as in (9.30) and St is cal-
culated as:

yt
St =
at (9.31)

The forecasts are produced as follows.


To produce back forecasts (i.e. m = 0) for a current period, for example at t = 15, and
assuming that s = 4, we would use the following equation:

F15 + 0 = a15 + S15 − 4 + 0 = a15 + S11


x If the series has only 20 observations, for example, and we want to produce the forecast
Forecasting horizon A
number of the future for the 23rd period, the forecast is calculated as:
time units until which the
forecasts will be extended. F20 + 3 = a 20 + S20 − 4 + 3 = a 20 + S19
Time series data and analysis 447

For a multiplicative model we use the same principle, except that the components are
not added, but multiplied.

Example 9.22
For the time series data presented in Table 9.10 calculate seasonal forecasts using the simple
seasonal additive exponential smoothing model.

Year
Quarter 1 2 3 4 5 6
1 17.15 16.80 13.85 16.99 21.12 17.35
2 19.87 15.87 19.67 18.96 21.03 17.57
3 20.53 17.13 20.29 24.84 24.55 18.19
4 20.78 18.11 20.94 23.11 25.90 21.66

Table 9.10

If we plot the time series as illustrated in Figure 9.44 we note that we have a definite pattern
in the shape of the curve, which repeats over the same time period quarters.
This suggests a seasonal module would be appropriate.
Time series plot
30.00

25.00
Series value

20.00

15.00

10.00

5.00
Series
0.00
1 3 5 7 9 11 13 15 17 19 21 23
Time point, x Figure 9.44

Figure 9.45 illustrates the Excel solution.

➜ Excel solution
Period Cell B5:B32 Values
Quarter Cell C5:C32 Values
Series Cell D5:D28 Values
Alpha α Cell H4 Value =0.5
Delta δ Cell H5 Value =0.5
MSE Cell H6 Formula: =SUMXMY2(D9:D28,G9:G28)/COUNT(D9:D28)
at Cell E5 Formula: =(D5+D9+D13+D17+D21+D25)/6
Copy formula down E5:E8
448 Business statistics using Excel

Cell E9 Formula: =$H$4*(D9−F5)+(1−$H$4)*E8


Copy formula down E9:E28
St Cell F5 Formula: =D5−E5
Copy formula down F5:F8
Cell F9 Formula: =$H$5*(D9−E9)+(1−$H$5)*F5
Copy formula down F9:F28
Forecast Ft Cell G9 Formula: =E9+F5
Copy formula down G9:G28
Cells G29 Formula =$E$28+F5
Copy formula down G29:G32

Figure 9.45

Cells H4 and H5 contain the values of constants α and δ, while cell H6 contains mean
square error (MSE).
We assigned the initial values to both α and δ as 0.5 each.
Figure 9.46 illustrates the fit of the forecasted values onto the initial time series plot.

Time series plot with fitted seasonal model


30.00

25.00

20.00
Series value

15.00

x 10.00
Mean square error The
mean value of all the
5.00
differences between the Series Seasonal forecast
actual and forecasted
values in the time series. 0.00
The differences between 1 3 5 7 9 11 13 15 17 19 21 23 25 27
these values are squared to Time point, x
avoid positive and negative
differences cancelling each Figure 9.46
other. Initial forecast chart
Time series data and analysis 449

By inputting manually the value of 0.5 in cells H4 and H5 as the values of constants α
and δ, we automatically get in cell H6 the MSE value of 2.036. Cell H6 contains the MSE
formula: =SUMXMY2(D9:D28,G9:G28)/COUNT(D9:D28).
This formula will be explained fully in the next section and we’ll use it here just as a
method for estimating the values of α and δ. We used Excel’s solver function to find the
optimum values of α and δ. Let us explain how this is done. We put manually any value
to cells H4 and H5—in our example 0.5 in each cell. After that, we put together all the for-
mulae and calculate forecasts. Once this is done, we click on cell H6 where the formula for
MSE resides. To access the Excel Solver menu select Data > Solver.
In Figure 9.47, we specify that we want cell Z6 to take the minimum value, by changing
cells Z4 and Z5, under the condition that both cells Z4 and Z5 should never be less than 0
or greater than 1.

Figure 9.47

This changes all the calculated cells automatically and produces the forecast as per
Figure 9.48. As we can see, the Solver has changed the values of alpha and delta, which in
turn had an effect on all our formulae and forecasts.
Figure 9.49 illustrates the forecast using the seasonal exponential smoothing method.
We observe that the forecast values are a good fit to the actual data values. As we can see,
this method, although fairly simple, produces impressive results.
450 Business statistics using Excel

Figure 9.48

Time series plot with fitted seasonal model


30.00

25.00

20.00
Series value

15.00

10.00

5.00
Series Seasonal forecast

0.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Time point, x Figure 9.49

Student exercises
X9.14 For what kind of time series would you use a multiplicative versus additive seasonal
exponential smoothing model?
X9.15 Why are seasonal parameters at and St in the seasonal exponential smoothing method
called dynamic parameters?
X9.16 What is the role of mean square error (MSE) in seasonal exponential smoothing
method?

9.6 Forecasting errors

9.6.1 Error measurement


One of the primary reasons for using forecasting as a tool is to try to reduce uncertainty.
The better the forecasts, the lower the uncertainty that surrounds the variable we forecast.
Time series data and analysis 451

We can never eliminate uncertainty, but good forecasts can reduce it to an acceptable
level. What would we consider to be a good forecast? An intuitive answer is that it has be
the one that shows the smallest error when compared with the actual event. The prob-
lem with this statement is that we cannot measure the error until the event happened, by
which time it is too late to say that our forecast was, or was not, good. In a way, we would
like to measure the error before the future unfolds. How do we do this? As we demon-
strated in this chapter, when forecasting, we always used the model to back-fit the existing
time series. This is sometimes called back-casting or, more appropriately, ex-post fore-
casting. Once we have produced ex-post forecasts, it is easy to measure deviations from
the actual data. These deviations are forecasting errors, and they will tell us how good our
method or model is.

❉ Interpretation The main assumption we make here is: whichever model shows the
smallest errors in the past, it will probably make the smallest errors when extrapolated in the
future. In other words, the model with smallest historical errors will reduce the uncertainty
that the future brings. This is the key assumption.

Calculating errors is one of the easiest tasks. We can define an error as a difference
between what actually happened and what we thought would happen. In the context of
forecasting time series and models, error is the difference between the actual data and the
data produced by a model, or ex-post forecasts. This can be expressed as a formula:

et = A t − Ft , or et = y t − Ft (9.32)
Where et is an error for a period t, At is the actual value in a period t and Ft is a forecasted
value for the same period t.

Example 9.23
Figure 9.50 shows an example of how to calculate forecasting errors.

In Example 9.23, using some simple method, we produced back-forecasts that deviate
clearly from the actual historical values. Figure 9.50 shows the results.

Figure 9.50

➜ Excel solution
Period Cells B4:B8 Values
Actual Cells C4:C8 Values x
Forecast Cells D4:D8 Values Forecasting errors A
difference between the
Error Cell E4 Formula: =C4−D4 actual and the forecasted
Copy formula down E4:E8 value in the time series.
452 Business statistics using Excel

Sum Cell C9 Formula: =SUM(C4:C8)


Copy formula across C9:E9
Average Cell C10 Formula: =AVERAGE(C4:C8)
Copy formula across C10:E10

Figure 9.51 is a chart showing actual and forecasted values.

Actual vs forecast
350

300

250

200
Value

150

100
Actual

50 Forecast

0
1 2 3 4 5
Period Figure 9.51

For period 1 (t = 1) our method exceeded actual values, which is presented as –30
because errors are calculated as actual minus forecasted. For period t = 2, our method
underscored by 100. For period 6 (t = 6), for example, our method was perfect and it had
not generated any errors. What can we conclude from this? If these were the first 5 weeks
of our new business venture, and if we add all these numbers together, than our cumula-
tive forecast for these 5 weeks would have been 1060. In reality the business generated
1040. This implies that the method we used made a cumulative error of –20 or it overesti-
mates the reality by 20 units. If we divide this cumulative value by the number of weeks to
which it applies, i.e. 5, we get the average value of our error:

∑ ( A t − Ft ) −20
e= = = −4
n 5

❉ Interpretation The average error that our method generates per period is −4
and because errors are defined as differences between the actual and forecast values, this
means that on average the actual values are 4 units higher than our forecast. Given earlier
assumptions that the method will probably continue to perform in the future as in the past
(assuming there are no dramatic or step changes), our method will probably generate similar
errors in the future.

Assuming that we decided to experiment with some other method, and assuming that
the average error that this other method generated was 2, which method would you rather
Time series data and analysis 453

use to forecast your business venture? The answer, hopefully, is very straightforward. The
second method is somewhat pessimistic (the actual values are 2 units per period below
the forecasted values), but in absolute terms 2 is less than 4. Therefore, we would recom-
mend the second method as a much better model for forecasting this particular business
venture. In this example, we have not only decided which forecasting method reduces
uncertainty more, but we have also learned how to use two different ways of measuring
this uncertainty. Using errors as measures of uncertainty, we learned how to calculate an
average, or mean error, and we implied that an absolute average error also makes sense to
be estimated. In practice, other error measurements are also used.

9.6.2 Types of errors


In fact, many error measurements are used to assess how good the forecasts are. The four
more commonly used error measurements are the mean error (ME), the mean absolute
error (sometimes called the mean absolute deviation and abbreviated as MAD), the
x
mean square error (MSE), and the mean percentage error (MPE). These errors are cal- Error measurements A
culated as follows: method of validating
the quality of forecasts.
∑( A t − Ft ) Involves calculating the
ME = mean error, the mean
n (9.33) squared error, and the
percentage error, etc.
Mean absolute deviation
∑ A t −Ft
MAD = (MAD) The mean value
n (9.34) of all the differences
between the actual and
forecasted values in the
∑( A t − Ft )2 time series. The differences
MSE = between these values are
n (9.35) represented as absolute
values, i.e. the effects of the
sign are ignored.
⎛ A − Ft ⎞ ⎛e ⎞
∑⎜ t ∑⎜ t ⎟ Mean percentage error
⎝ A t ⎠⎟ ⎝ At ⎠ (MPE) The mean value
MPE = =
n n (9.36) of all the differences
between the actual and
forecasted values in the
Sometimes MPE causes problems in Excel (owing to negative values), in which case it is time series. The differences
between these values are
better to estimate the mean absolute percentage error (MAPE): represented as percentage
values.
⎛ A − Ft ⎞ ⎛e ⎞ Mean absolute
∑⎜ t ∑⎜ t ⎟
⎝ A t ⎟⎠ ⎝ At ⎠ percentage error
MAPE = = (MAPE) The mean value
n n (9.37) of all the differences
between the actual and
forecasted values in the
time series. The differences
Example 9.24 between these values are
represented as absolute
It is very easy to calculate these errors using Excel. Using our previous example, these errors are percentage values, i.e.
the effects of the sign are
calculated as shown in Figure 9.52. ignored.
454 Business statistics using Excel

Figure 9.52
Calculating various errors

➜ Excel solution
Period Cells B4:B8 Values
Actual Cells C4:C8 Values
Forecast Cells D4:D8 Values
Error Cell E4 Formula: =C4−D4
Copy down E4:E8
MAD Cell F4 Formula: =ABS(E4)
Copy down F4:F8
MSE Cell G4 Formula: =E4^2
Copy down G4:G8
MPE Cell H4 Formula: =E4/G4
Copy down H4:H8
MPE % Cell I4 Formula: =H4*100
Copy down I4:I8
MAPE Cell J4 Formula: =F4/C4
Copy down J4:J8
Sum Cell C9 Formula: =SUM(C4:C8)
Copy formula across C9:J9
Average Cell C10 Formula: =AVERAGE(C4:C8)
Copy formula across C10:J10

Column I in Figure 9.52 is identical to the column H. The only difference is that we used
Excel percentage formatting to present the numbers as percentages, rather than decimal
values. Rather than calculating individual errors (as in columns E–J) and adding all the
individual error values (as in row 9) or calculating the average (as in row 10), we could
have calculated all these errors with a single formula line for each type of error.

Example 9.25
Using some of the built-in Excel functions, these errors can be calculated as illustrated in
Figures 9.53 and 9.54.
Time series data and analysis 455

Figure 9.53

Figure 9.54
Single cell formulae for calculating the mean error (ME), mean absolute deviation (MAD), mean square
error(MSE), mean percentage error (MPE), and mean absolute percentage error (MAPE)

➜ Excel solution—alternative
ME Cell D12 Formula: =(SUM(B2:B6)−SUM(C2:C6))/COUNT(B2:B6)
MAD Cell D13 Formula: {=SUM(ABS(D2:D6))/COUNT(D2:D6)}
MSE Cell D14 Formula: =SUMXMY2(B2:B6,C2:C6)/COUNT(B2:B6)
MPE Cell D15 Formula: {=SUM(((B2:B6)−(C2:C6))/(B2:B6))/COUNT(B2:B6)}
MAPE Cell D16 Formula: {=SUM(ABS((B2:B6)−(C2:C6))/(B2:B6))/COUNT(B2:B6)}

Note that MAD, MPE, and MAPE formulae have curly brackets on both sides of
the formulae. Do not enter these brackets manually. Excel enters the brackets auto-
matically if after you typed the formula you do not just press the Enter button, but
CTRL + SHIFT + ENTER (i.e. all three at the same time). This means that the range is
treated as an array. Just for the sake of clarity, Figures 9.52 and 9.53 reproduce the spread-
sheet as it should look if the single cell formulae for the error calculations were used.
Again, note that the curly brackets for MAD, MPE, and MAPE are not visible by observing
formulae in cells D13, D15, and D16. However, they are visible in the formula bar, as illus-
trated in Figure 9.54.

9.6.3 Interpreting errors


x
How do we interpret the five different error measurements? We have already said that ME Mean error (ME) The
indicates that the actual data are, on average, four units per period above the forecasted mean value of all the
differences between the
values. This is a good indication, but the problem is that positive and negative deviations actual and forecasted
eliminate each other, so we might end up with a forecast that jumps up and down around values in the time series.
456 Business statistics using Excel

the actual values, never providing exact forecasts, yet the ME could be zero. To eliminate
the problem with ME, we can calculate MAD. MAD indicates that if we eliminate over-
and underestimates of our forecasts, a typical bias that our method shows (regardless of
whether it is positive or negative) is 44 units per period. This is typical error, regardless of
the direction in which our forecasts went when estimating the actual values.
The meaning of the MSE is more difficult to interpret, for the simple reason that we have
taken the square values of our errors. What is a square value of something? The ration-
ale is as follows: if there are some big deviations of our forecast from the actual values,
then in order to magnify these deviations we need to square them. Let’s take an example
of two hypothetical errors for a period. Let one error reading show 2 and the other one 10.
The second error is five times larger than the first one. However, when we square these
two numbers, number 100 (10 × 10) is 25 times larger than number 4 (2 × 2). This is what
we mean by magnifying large errors. So, the higher the MSE, the more extreme deviations
from the actual values are contained in our forecast. This is particularly useful when com-
paring two forecasts. If the MSE obtained from the first forecast is larger than the MSE from
the second, than the first forecast contains more extreme deviations than the second one.
The interpretation of the MPE is very intuitive. It tells us that, on average, an error con-
stitutes x% of the actual value or, as in our case, MPE = –8.73%. This means that on average
our forecasting errors overshot the actual values by 8.73% (remember that negative error
means forecasts overshooting the actual values and positive error means undershooting).
However, this implies that, just like with the ME, we could have a series of overshoots and
undershoots (as in our example), yet gaining an average value of almost zero. The mean
absolute percentage error (MAPE) addresses this problem. It shows us the value of 0.2473.
In other words, if we disregard positive and negative variations of our forecasts from the
actual values, we are, on average, making an absolute error of 24.73%.

9.6.4 Error inspection


At the beginning of this chapter, we mentioned that forecasting errors should be treated as
the residual element; in other words, something that is moving completely randomly when
observed visually. Regardless of the forecasting method, errors should always be calculated
and inspected. Should these errors follow any kind of pattern, the forecasting method must
be treated as suspect. In fact, forecasting errors are often required to adhere to some formal
assumptions, such as independence, normality, and homoscedasticity (see section 8.2.4
for full explanation). The meaning of some of these terms is not just that we do not want to
see any pattern among the errors (residuals), but also that they should be independent of
each other, normally distributed, and have a constant variance (homoscedasticity). Some
rigorous tests exist to help us determine if the residuals violate any of these assumptions,
but they are beyond the scope of this book. One of the most elementary steps to take after
we produce our forecasts is to check visually the residuals/errors. Figure 9.55 shows an
example of plotting forecasting errors. It appears that the residuals are flowing in a random
fashion and do not show any pattern, which is exactly what we wanted.
In addition to checking visually that forecasting errors are behaving randomly, we also
need to ensure that they are not dependent on one another. In other words, errors must
not be correlated. There are other properties that errors need to comply with, but this is
beyond the scope of the chapter on basic forecasting.
Time series data and analysis 457
Residuals

0.025

0.02

0.015

0.01
Value, y

0.005

0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67
–0.005 Time point, x
–0.01

–0.015

–0.02

Figure 9.55
Error/residual plot

Note One of the methods of verifying whether residuals are correlated is the
autocorrelation plot. Autocorrelations are the coefficients that we calculate and they form an
autocorrelation function. They can be used for various purposes, but here we are referring to
the series of autocorrelations of residuals. Essentially, we lag the residuals by one time period,
then by another and another, etc., and then measure correlations between all these lagged
series of residuals. This is called a residual autocorrelation function.

Student exercises
X9.17 Could you explain the difference between accuracy and precision? What are the
consequences if your forecasts are accurate, but not precise? Could you have precise
forecasts that are not accurate?
X9.18 Why is the MAD type of error measurement preferred over the ME type of error?
X9.19 Two forecasts were produced, as shown in Table 9.11. The ME for the second forecast
is some seven times larger than the ME for the first forecast. However, the MSE is some
20 times larger. Can you explain?

X Y ŷ1 ŷ2 e1 e2
1 230 230 230 0 0
2 300 305 305 −5 −5
3 290 295 295 −5 −5
4 320 320 320 0 0
5 350 345 345 5 5
6 400 402 350 −2 50
7 350 355 355 −5 −5
8 400 395 395 5 5
458 Business statistics using Excel

9 420 420 420 0 0


ME = −0.78 5.78
MSE = 14.33 291.67

Table 9.11

X9.20 It is acceptable to see some regularity in pattern when examining the series of residuals
or forecasting errors?
X9.21 The closer the actual observations, when compared with forecasted values on a scatter
diagram, are to the diagonal line, the better the forecasts. Is this correct?

9.7 Confidence intervals

9.7.1 Population and sample standard errors


We need to remind ourselves from the sampling chapter that the standard error (SE) of
the mean is calculated as:

σ
σx =
n (9.38)

We also said that when dealing with a normal distribution, we expect 68.3% of all the
values to be within x ± 1σ , 95.4% of all the values to be within x ± 2σ, and 99.7% to be
within x ± 3σ. We also said that to change any distribution into a standard distribution,
standardized z units need to be calculated using equation (9.39).

x−µ
z=
σ (9.39)

Z-values are used for estimating the confidence interval (CI) of the estimate of the
x
mean using equation (9.40).
Confidence interval A
confidence interval gives an
estimated range of values
CI = X ± z SE (9.40)
which is likely to include
an unknown population
parameter. Where, SE is the standard error. Depending on the value of z, we get different CIs. For:
Population standard (a) z = 1.64 for 90% CI, (b) z = 1.96 for 95% CI, and (c) z = 2.58 for 99% CI. It is important to
deviation The population
standard deviation is the also remind ourselves that most of the time we cannot calculate the SE for the simple rea-
standard deviation of all son that we do not know the population standard deviation (σ). In this case, the sample
possible values.
standard deviation is calculated using equation (9.41).
Sample standard
deviation A sample
standard deviation is n
an estimate, based on a ∑ (x i − x)2
i =1
sample, of a population s= (9.41)
standard deviation. n −1
Time series data and analysis 459

Now we have the standard deviation of the sample (the data set, or the time series) we
can modify the equation for the standard error (SE, as given by equation (9.42).

s
SE =
n (9.42)

How do we estimate the confidence interval of the sample mean? First of all, if the time
series is relatively short and represents just a small sample of the true population data val-
ues, then the t distribution is used for the computation of the confidence interval, rather
than the z-value.

X ± t value SE , or X ± t value (s/ n ) (9.43)

The only difference between equations (9.43) and (9.40) is that the t-value in equation
(9.43) will be determined not just by the level of significance (as was the case with the
z-values), but also by the number of degrees of freedom.

Note A general rule is that for larger samples, the z-values and the t-value produce
similar results, so it is discretionary which one to use. A large sample in time series analysis is a
series with more than 100 observations.

9.7.2 Standard errors in time series


In equation (9.41) we measured the differences between every observation and the mean.
This was the basis for calculating the standard deviation. When dealing with time series,
it would seem logical to use the same principle, but, rather than calculating deviations
from the mean, we calculate deviations between the actual and predicted values. We can
modify equation (9.41) and instead of the mean value use the predicted values, as defined
by equation (9.44).

n
∑ (y i − yˆ i )2
i =1
SE y,yˆ = (9.44)
n−2

Here, yi are the actual observations and ŷi are the predicted values. The Excel version of
this formula is =SQRT (SUMXMY2 (array_x, array_y)/n−2). Actually, Excel offers an even
more elegant function as a substitute for this formula. The function is called: = STEYX
(known_y’s, known_x’s). This function returns the standard error of forecast. If you look
into Excel’s Help file, you will see that this function is a very elegant representation of a
monstrous-looking equation given by (9.45).
x
Standard error of

[nΣxy − (Σx )(Σy)]2 ⎞⎟


forecast The square
⎛ 1 ⎞⎛
SE y,x = ⎜ ⎟ ⎜ nΣ y − (Σy ) −
2 2 root of the variance of all
⎝ n(n − 2) ⎠ ⎝ Σ x 2 − ( Σ x )2 ⎠ (9.45) forecasting errors adjusted
for the sample size.
460 Business statistics using Excel

Note Remember that these two formulae are identical:

( )
= SQRT SUMXMY2 (array _ x, array _ y ) / n − 2

and

= STEYX (known _ y ’s, known _ x’s)

They both return the standard error for the predicted values.

If the standard error SEy,x is a measure of the amount of error in the prediction of y for an
individual x, this means we can modify equation (9.43) into equation (9.46).

yˆ ± t value × SE y,yˆ (9.46)

Ŷ are the predicted values, SEy,y^ is the standard error of prediction and the tvalue is the
t-value from the Student’s t critical table. We can recap that, depending on what is the
desired confidence interval (CI), the values for z and t are as follows: (a) CI = 90% for
z = 1.64 and t = 1.73, (b) CI = 95% for z = 1.96 and t = 2.09, and (c) CI = 99% for z = 2.58
and t = 2.86. The values of t are not fixed as they depend on the number of degrees of free-
dom and the size of the sample.

Note The t-values are not universal for these given levels of confidence. The calculation
of the t-values depends on a number of degrees of freedom. The above t-values are only
valid for 8 degrees of freedom, which is the length of our time series minus 2.

Example 9.26
Consider fitting a confidence interval to the data set represented in Table 9.12 and use the Excel
trend function to provide forecasts for the next five time periods.
We have a very short time series with only ten observations; we will use the Excel TREND
function to produce forecasts and fit a confidence interval to the forecasts.
Figure 9.56 illustrates the technique using a very short time series.

Figure 9.56
Time series data and analysis 461

X Y
1 2
2 1
3 1
4 4
5 13
6 3
7 8
8 6
9 9
10 10

Table 9.12

➜ Excel solution
X Cell B4:B18 Values
Y Cell C4:C13 Values
Trend Cell D4 Formula: =TREND($C$4:$C$13, $B$4:$B$13, B4)
Copy formula down D4:D18
SE = Cell H3 Formula: =STEYX(C4:C13, B4:B13)
Alpha = Cell H4 Value
df = Cell H5 Formula: =COUNT(C4:C13)−2
t-value = Cell H6 Formula: =T.INV.2T(H4, H5)
– Interval E4 Formula: =D4−$H$3*$H$6
Copy formula down E4:E18
+ Interval F4 Formula: =D4+$H$3*$H$6
Copy formula down F4:F18

This trend function was extrapolated five periods in the future. Figure 9.57 illustrates
the graph for the prediction and the corresponding confidence interval.
The calculations, as well as the graph, indicate that we are on the right track as far as
the confidence interval is concerned, except that it does not comply with one intuitive
assumption. It is intuitive to think that the confidence interval is not constant and that
it should change with time. In other words, the further we go in the future, the wider the
interval should be as the uncertainty increases. As we can see from the above example,
the level of confidence here is a constant value. In order to make the confidence level
change with time, in addition to equation (9.46), we will need to replace equation (9.44)
with equation (9.47).

1 (x i − x)2
SE y,x = SE y,Yˆ 1 + + (9.47)
n ∑ (x i − x)2
462 Business statistics using Excel

Prediction values with constant confidence intervals


25
Series Trend – Interval + Interval

20

15

Series value, y 10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Time point, x
–5

–10 Figure 9.57

This is exactly the same equation as equation (8.36) from Chapter 8. The only exception
is that in Chapter 8 we included the tcri-value in the equation, while here it is included in
the procedure.

Example 9.27
We’ll use exactly the same example to demonstrate the effects of this additional formula.

Figure 9.58 illustrates the Excel solution to calculate the interval estimate.

Figure 9.58

The only difference to Figure 9.56 is that we had to introduce two additional columns,
one for SEy,x (column F) and the other one for the mean value (column E). This column, as
we will see, will help us with the implementation of equation (9.47) into the Excel solution.

➜ Excel solution
X Cell B4:B18 Values
Y Cell C4:C13 Values
Trend Cell D4 Formula: =TREND($C$4:$C$13,$B$4:$B$13,B4)
Copy formula down D4:D18
Mean X Cell E4 Formula: =AVERAGE($B$4:$B$13)
Copy formula down E4:E13
Time series data and analysis 463

SE (Ŷ) = Cell F4 Formula: =SQRT(1+(1/COUNT($B$4:$B$13))


+(B4−AVERAGE($B$4:$B$13))^2/
SUMXMY2($B$4:$B$13,$E$4:$E$13))
Copy formula down F4:F18
Alpha = Cell I3 Value
df = Cell I4 Formula: =COUNT(C4:C13)−2
t-value = Cell I5 Formula: =T.INV.2T(I3,I4)
SE = Cell I6 Formula: =STEYX(C4:C13,B4:B13)
+ Interval G4 Formula: =D4+($I$5*$I$6*F4)
Copy formula down G4:G18
– Interval H4 Formula: =D4−($I$5*$I$6*F4)
Copy formula down H4:H18

Figure 9.59 illustrates the Excel graphical solution.

Prediction values with constant confidence intervals


30
Series Trend – Interval + Interval
25

20
Series value, y

15

10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
–5 Time point, x

–10 Figure 9.59

Student exercises
X9.22 How would you describe the concept of confidence interval in the context of precision
in forecasting?
X9.23 Would you use the z-values or t-values to calculate the confidence interval for a time
series that has 200 observations?
X9.24 Why is it logical to expect that the confidence interval should get wider and wider the
further we go in the future with our forecasts?

■ Techniques in practice
TP1 Coco S. A. is considering diversifying and entering the housing market in the USA. They
are only interested in a short-term investment. To help them with the decision and assess the
market, their US analyst extracted a time series that covers US Months’ Supply of Houses For
Sale at Current Sales Rate. The time series is not adjusted for seasonality and the data set reflects
true market movements. Table 9.13 covers data from January 2004 to June 2008.
464 Business statistics using Excel

2004 01 4.2 2005 01 4.8 2006 01 5.9 2007 01 8.2 2008 01 11


2004 02 3.6 2005 02 4 2006 02 6.1 2007 02 8 2008 02 10
2004 03 3 2005 03 3.5 2006 03 5.1 2007 03 6.8 2008 03 10
2004 04 3.5 2005 04 3.8 2006 04 5.6 2007 04 6.5 2008 04 9
2004 05 3.3 2005 05 3.7 2006 05 5.5 2007 05 6.9 2008 05 9
2004 06 3.7 2005 06 4 2006 06 5.8 2007 06 7.5 2008 06 9
2004 07 4.2 2005 07 3.9 2006 07 6.9 2007 07 8 2008 07
2004 08 4 2005 08 4.3 2006 08 6.5 2007 08 8.9 2008 08
2004 09 4.4 2005 09 5 2006 09 7 2007 09 9.9 2008 09
2004 10 4.1 2005 10 4.7 2006 10 7.5 2007 10 9 2008 10
2004 11 5 2005 11 5.9 2006 11 7.8 2007 11 11 2008 11
2004 12 5.2 2005 12 5.9 2006 12 7.6 2007 12 11 2008 12

Table 9.13

Analyse the data and produce a forecast. Pay specific attention to:

(a) Graphing the time series


(b) Deciding on the type of time series
(c) Using the best-suited method and producing forecasts until the end of 2008 (six time
periods in the future)
(d) Measuring the quality of your forecast
(e) Deciding what would be your recommendations, from the data analysis point of view,
to the company.

TP2 Baker Ltd is concerned about the influence on petrol prices on its profit margins. The
owner of the company looked at weekly petrol prices (pence per gallon) for London and com-
piled a time series. The series starts on 14 November 2005 and goes until 4 August 2008. The
data in pence per gallon are shown in Table 9.14.

250.1 238.6 291.5 269.1 249.7 296.2 296.1 330.9 334.2 423.8
241.6 244.3 287.9 261.5 245 302.6 293.2 331 335.7 423.5
235.8 245.6 283.7 255.9 240.2 304.4 290.7 328.9 341.1 424.7
232.2 261.6 290.5 251.7 237.6 307.1 288.6 328.5 338.5 424.8
234.3 261.7 295.8 247.4 236.2 313.5 287.3 327.8 340.1 424.2
236.8 273.7 299.3 243 236.2 315.1 287.5 327.6 342.9 421.9
238.1 279.7 302 239.9 239.6 313.1 288.9 329.4 349.8 412.7
244.2 292 305.2 239.1 246.6 313.5 293.8 334 362.6 405.1
254 306.4 307.6 238.5 268 310.4 293.2 332.4 375.3
254.8 304.9 309.1 242.8 271.3 307.6 292.7 329.1 376.9
257.7 303.6 305.3 248.9 273.4 307.5 292.5 328.7 385.9
255.6 301.6 300 247.6 275.9 306.6 298.6 326 393.1
253.1 298.9 294.4 252.2 281.4 307 304 326.1 408.3
248.8 296.4 287.5 253.9 287.3 303.6 319.6 327.4 412.1
242.4 292.5 279.3 253.5 295.7 300.4 328.6 333.2 418.3

Table 9.14
Time series data and analysis 465
The owner is not too familiar with forecasting, but knows how to use trending function. Put
yourself in his shoes and do the following:

(a) Chart the time series


(b) Pick the best suitable curve to fit to the data set
(c) Extrapolate the data another 20 time periods in the future
(d) Calculate the confidence interval
(e) What do you think you need to do to preserve your profit margins?

TP3 Skodel Ltd is considering investing into technology stocks. As a test case, it looked at
Microsoft-adjusted monthly closing values of stocks between 1 March 2001 and 1 August
2008. The time series is given in Table 9.15.

26.21 30.25 26.02 25.57 22.5 21.52


25.72 29.42 27.17 23.93 22.4 22.09
27.51 27.38 25.24 23.66 21.75 25.5
28.32 27.68 26.72 23.36 20.88 24.67
28.42 30.22 24.73 24.3 21.69 26.94
28.28 29.24 24.76 24.36 20.54 28.01
27.1 28.75 26.35 22.38 20.1 27.15
32.35 28.02 24.57 22.29 20.07 24.59
35.33 26.69 23.83 21.27 21.86 21.63
33.35 25.08 24.76 22.63 24.39 24.12
36.41 23.39 24.2 23.59 22.61 27.98
29.14 22.65 23.12 23.35 18.49 30.86
28.42 22.02 24.07 21.93 20.75 29.25
28.58 23.39 25.06 22.3 20.29 28.64
29.05 26.35 25.48 23.58 23.13 23.12

Table 9.15

Use the exponential smoothing method and experiment with various levels of the smooth-
ing constant. See what impact it has on your forecasts and how it changes the forecasting
errors. Make a recommendation as to what approach to forecasting you would use and why.

■ Summary
In this chapter we focused on univariate time series analysis as a primary tool for extrapolating
time series and forecasting. We described what the prerequisites are before we start selecting
a forecasting method; namely, ensuring that all the observations were re-recorded in the same
units of time, that no observation was missing, that we do not have unexpected outliers, and
that we produce the time series graph before proceeding. We explained the concept of indices
and how to convert them from one base to another. This was linked with aggregate indices and
we introduced Consumer Price Index (CPI) as a major method of deflating the value related
time series. We also showed how to convert the values into constant dollars.
466 Business statistics using Excel

Various trend models were introduced, as well as how to fit them to time series, produc-
ing the ex-post forecasts. Other alternative methods to trend-fitting and extrapolation were
introduced, such as the moving average method and exponential smoothing. The relevance
of the smoothing constant α was explained. This was followed with the introduction of how
to apply exponential smoothing to seasonal time series. Once we mastered various forecasting
methods and techniques, we focused on forecasting errors and how to measure them. The
relevance of various error indicators (ME, MSE, MAD, etc.) was introduced, as well as how to
interpret them to select the best forecast.
The final element introduced was the confidence interval (CI), which brings together extrap-
olation and error measurement. We explained how to apply confidence measurement to our
forecasts and what the limitations are.

■ Key terms
Additive model Mean absolute deviation Residuals (R)
Aggregate price indices (MAD) Sample standard deviation
Base index period Mean absolute percentage Seasonal
Brown’s single exponential error (MAPE) Seasonal component
smoothing method Mean error (ME) Seasonal time series
Classical time series analysis Mean percentage error Seasonal variations (S)
Classical time series (MPE) Simple exponential
decomposition Mean square error (MSE) smoothing
Confidence interval Mixed model Simple index
Cyclical variations (C) Moving average trend Smoothing constant
Error measurements Moving averages Standard error of forecast
Exponential smoothing Multiplicative model Stationary time series
Exponential trend Multivariate methods Time period
Forecasting Non-seasonal Time series
Forecasting errors Non-stationary Trend (T)
Forecasting horizon Polynomial line Trend component
Index numbers Polynomial trend Types of trends
Irregular variations (I) Population standard Univariate methods
Linear trend deviation
Logarithmic trend Power trend

■ Further reading
Textbook resources
1. Brown, R. G. (2004) Smoothing, Forecasting and Prediction. Mineola, NY: Dover Publications.
2. Chatfield, C. (2004) The Analysis of Time Series: An Introduction. Boca Raton, FL; London:
Chapman & Hall/CRC.
3. Hanke, J. E. and D. W. Wichern (2005) Business Forecasting. Upper Saddle River: Pearson/
Prentice Hall.
Time series data and analysis 467
4. Newbold, P. and T. Bos (1994) Introductory Business & Economic Forecasting. Cincinnati:
South-Western Pub.
5. Evans, M. K. (2003) Practical Business Forecasting. Malden, MA; Oxford: Blackwell
Publishing.

Web resources
1. Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/pmc/sec-
tion4/pmc4.htm (accessed 25 May 2012).
2. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
3. Wikipedia articles on time series http://en.wikipedia.org/wiki/Time_series (accessed 25
May 2012).
4. Statsoft Electronic Textbook http://www.statsoft.com/textbook/sttimser.html (accessed 25
May 2012).
5. A private collection by Rob Hyndman http://www.robjhyndman.com/TSDL/ (accessed
25 May 2012).
Glossary

The ISI glossary of statistical terms provides definitions in a number of different languages:
http://isi.cbs.nl/glossary/index.htm

Addition law for mutually exclusive Beta, β Beta refers to the probability
events Addition law for mutually exclu- that a false population parameter lies in-
sive events is a result used to determine side the confidence interval.
the probability that event A or event B oc- Binomial distribution A Binomial dis-
curs, but both events cannot occur at the tribution can be used to model a range of
same time. discrete random data variables.
Additive model The additive model Binomial experiment A binomial exper-
time series model is a model whereby the iment is an experiment with a fixed number
separate components of the time series are of independent trials. Each trial has exactly
added together to identify the actual time two outcomes and the probability of each
series value. outcome in a binomial experiment re-
Adjusted r2 Adjusted R squared mea- mains the same for each trial.
sures the proportion of the variation in the Box plot A box plot is a way of summa-
dependent variable accounted for by the rizing a set of data measured on an interval
explanatory variables and adjusted for the scale.
number of degrees of freedom. Box-and-whisker plot A box-and-
Aggregate price index A measure of the whisker plot is a way of summarizing a set
value of money based on a collection (a of data measured on an interval scale.
basket) of items and compared to the same Brown’s single exponential smooth-
collection of items at some base date or a ing method Brown’s single exponential
period of time. smoothing method is the basis for a fore-
Alpha, α Alpha refers to the probabil- casting method called Simple Exponential
ity that the true population parameter Smoothing.
lies outside the confidence interval. Not Categorical variable A set of data is
to be confused with the symbol alpha said to be categorical if the values or ob-
in a time series context i.e. exponential servations belonging to it can be sorted ac-
smoothing, where alpha is the smoothing cording to category.
constant. Central Limit Theorem The Central
Alternative hypothesis (H1) The al- Limit Theorem states that whenever a ran-
ternative hypothesis, H1, is a statement of dom sample is taken from any distribu-
what a statistical hypothesis test is set up tion (m, s2), then the sample mean will be
to establish. approximately normally distributed with
Arithmetic mean The sum of a list of mean m and variance s2/n.
numbers divided by the number of numbers. Central tendency Measures the location
Assumptions An assumption is a propo- of the middle or the centre of a distribution.
sition that is taken for granted. Chance Chance is the unknown and
Autocorrelation Autocorrelation is the unpredictable element in happenings that
correlation between members of a time seems to have no assignable cause.
series of observations and the same values Chi square distribution The chi square
shifted at a fixed time interval. distribution is a mathematical distribution
Bar chart A bar chart is a way of sum- that is used directly or indirectly in many
marizing a set of categorical data. tests of significance.
Base index period A value of a variable Chi square test Apply the chi square
relative to its previous value at some fixed distribution to test for homogeneity, inde-
base. pendence, or goodness-of-fit.
Glossary 469
Chi square test for goodness-of-fit The Contingency table A contingency table
chi-square goodness-of-fit test of a statisti- is a table of frequencies classified according
cal model describes how well the statistical to the values of the variables in question.
model fits a set of observations. Continuous probability distribution
Chi square test of association The chi- If a random variable is a continuous vari-
square test of association provides a meth- able, its probability distribution is called a
od for testing the association between the continuous probability distribution.
row and column variables in a two-way ta- Continuous random variable A contin-
ble where the null hypothesis H0 assumes uous random variable is one which takes
that there is no association between the an infinite number of possible values.
variables. Continuous variable A set of data is
Chi square test of independent samples said to be continuous if the values belong
Pearson chi-square test is a non-paramet- to a continuous interval of real values.
ric test for a difference in proportions be- Covariance Covariance is a measure of
tween two or more independent samples. how much two variables change together.
Class boundaries Class boundaries Critical test statistic The critical value
separate one class in a grouped frequency for a hypothesis test is a limit at which the
distribution from another. value of the sample test statistic is judged
Class limit Class limits separate one to be such that the null hypothesis may be
class in a grouped frequency distribution rejected.
from another. Critical value The critical value(s) for
Class mid-point The class mid-point is a hypothesis test is a threshold to which
the midpoint of each class interval. the value of the test statistic in a sample is
Classes Classes provide several conve- compared to determine whether or not the
nient intervals into which the values of the null hypothesis is rejected.
variable of a frequency distribution may be Cross tabulation Cross tabulation is
grouped. the process made with two or more data
Classical time series analysis Ap- sources (variables) that are tabulating the
proach to forecasting that decomposes a results of one against the other.
time series into certain constituent com- Cumulative distribution function The
ponents (trend, cyclical, seasonal, and ran- cumulative distribution function (CDF),
dom component), makes estimates of each or just distribution function, describes the
component, and then re-composes the probability that a real-valued random vari-
time series and extrapolates into the future. able X with a given probability distribution
Classical time series decomposition will be found at a value less than or equal
Classical time series decomposition is a to x.
statistical method that deconstructs a time Cumulative frequency distribution
series into notional components. The cumulative frequency for a value x
Coefficient of determination (COD) is the total number of scores that are less
The proportion of the variance in the de- than or equal to x.
pendent variable that is predicted from the Cyclical variations (C) The cyclical
independent variable. variations of the time series model that
Coefficient of variation The coefficient result in periodic above-trend and below-
of variation measures the spread of a set of trend behaviour of the time series lasting
data as a proportion of its mean. more than one year.
Conditional probability Conditional Degrees of freedom Refers to the
probability is the probability of an event number of independent observations in a
occurring given that another event has al- sample minus the number of population
ready occurred. parameters that must be estimated from
Confidence interval (1 − a) A confi- sample data.
dence interval gives an estimated range Dependent variable A dependent vari-
of values which is likely to include an un- able is what you measure in the experiment
known population parameter. and what is affected during the experiment.
470 Glossary

Discrete Discrete data are a set of data (or several constants) to predict future val-
where the values/observations belonging ues by ‘smoothing’ the past values in the
to it are distinct and separate, i.e. they can series. The effect of this constant decreases
be counted (1,2,3. . .). exponentially as the older observations are
Discrete probability distribution If a taken into calculation.
random variable is a discrete variable, its Exponential trend An underlying time
probability distribution is called a discrete series trend that follows the movements of
probability distribution. an exponential curve.
Discrete random variable A discrete Extreme value An extreme value is an
random variable is one which may take on unusually large or an unusually small value
only a countable number of distinct values compared with the others in the data set.
such as 0, 1, 2, 3, 4 . . . F distribution The F distribution (also
Discrete variable A set of data is said to known the Fisher–Snedecor distribution)
be discrete if the values belonging to it can is a continuous probability distribution
be counted as 1, 2, 3 . . . that arises frequently as the null distribu-
Dispersion The variation between data tion of a test statistic, most notably in the
values is called dispersion. analysis of variance.
Durbin–Watson The Durbin–Watson F test Tests whether two population
statistic is a test statistic used to detect the variances are the same based upon sample
presence of autocorrelation (a relation- values.
ship between values separated from each F test for two population variances
other by a given time lag) in the residu- (variance ratio test) F test for two popula-
als (prediction errors) from a regression tion variances (variance ratio test) is used
analysis. to test if the variances of two populations
Empirical approach Empirical proba- are equal.
bility, also known as relative frequency, or Five-number summary A five-number
experimental probability, is the ratio of the summary is especially useful when we
number of outcomes in which a specified have so many data that it is sufficient to
event occurs to the total number of trials. present a summary of the data rather than
Equal variance (homoscedasticity) the whole data set.
Homogeneity of variance (homoscedastic- Forecasting A method of predicting the
ity) assumptions state that the error vari- future values of a variable, usually repre-
ance should be constant. sented as the time series values.
Error measurement A method of vali- Forecasting errors A difference be-
dating the quality of forecasts. Involves cal- tween the actual and the forecasted value
culating the mean error, the mean squared in the time series.
error, and the percentage error, etc. Forecasting horizon A number of the
Estimate An estimate is an indication future time units until which the forecasts
of the value of an unknown quantity based will be extended.
on observed data. Frequency definition of probability
Event An event is any collection of out- Frequency definition of probability defines
comes of an experiment. an event’s probability as the limit of its rela-
Expected frequency In a contingency tive frequency in a large number of trials.
table the expected frequencies are the fre- Frequency distributions Systematic
quencies that you would predict in each method of showing the number of
cell of the table, if you knew only the row occurrences of observational data in order
and column totals, and if you assumed from least to greatest.
that the variables under comparison were Frequency polygon A graph made by
independent. joining the middle-top points of the col-
Experimental probability approach umns of a frequency histogram.
Experimental probability approach (see General addition probability law Gen-
Empirical approach). eral addition probability law is a result
Exponential smoothing One of the used to determine the probability that
methods of forecasting that uses a constant event A or event B occurs or both occur.
Glossary 471
Graph A graph is a picture designed to the smallest (least) sum of squared differ-
express words, particularly the connection ences between fitted and actual values.
between two or more quantities. Left-skewed Left-skewed (or negative
Grouped frequency distributions Data skew) indicates that the tail on the left side
arranged in intervals to show the frequency of the probability density function is lon-
with which the possible values of a variable ger than the right side and the bulk of the
occur. values (possibly including the median) lie
Histogram A histogram is a way of sum- to the right of the mean.
marizing data that are measured on an in- Level of confidence The confidence
terval scale (either discrete or continuous). level is the probability value (1 − a) associ-
Histogram with unequal class intervals ated with a confidence interval.
A histogram with unequal class intervals Level of significance The level of sig-
is a graphical representation showing a vi- nificance is the criterion used for rejecting
sual impression of the distribution of data the null hypothesis.
where class widths are of different sizes. Linear relationship A linear relation-
Hypothesis test procedure A series of ship exists between variables if, when you
steps to determine whether to accept or plot their values, you get a straight line.
reject a null hypothesis, based on sample Linear regression analysis Simple lin-
data. ear regression aims to find a linear rela-
Independence of errors Independence tionship between a response variable and
of errors means that the distribution of er- a possible predictor variable by the meth-
rors is random and not influenced by or od of least squares.
correlated to the errors in prior observa- Linear trend Linear trend is a straight
tions. The opposite of independence is line fit to a data set.
called autocorrelation. Logarithmic trend A model that uses
Independent events Two events are in- the logarithmic equation to approximate
dependent if the occurrence of one of the the time series.
events has no influence on the occurrence Lower one tail test A lower one tail test
of the other event. is a statistical hypothesis test in which the
Independent variable An independent values for which we can reject the null hy-
variable is the variable you have control pothesis, H0 are located entirely in the left
over, what you can choose and manipulate. tail of the probability distribution.
Index number A value of a variable Mann–Whitney U test The Mann–
relative to its previous value at some base. Whitney U test is used to test the null hy-
Intercept Value of the regression equa- pothesis that two populations have iden-
tion (y) when the x value = 0. tical distribution functions against the
Interquartile range The interquartile alternative hypothesis that the two distri-
range is a measure of the spread of or dis- bution functions differ only with respect to
persion within a data set. location (median), if at all.
Interval scale An interval scale is a McNemar’s test McNemar’s test is a
scale of measurement where the distance non-parametric method used on nominal
between any two adjacent units of mea- data to determine whether the row and
surement (or ‘intervals’) is the same but column marginal frequencies are equal.
the zero point is arbitrary. Mean The mean is a measure of the av-
Irregular variations The irregular varia- erage data value for a data set.
tions of the time series model that reflect Mean absolute deviation (MAD) The
the random variation of the time series val- mean value of all the differences between
ues beyond what can be explained by the the actual and forecasted values in the
trend, cyclical, and seasonal components. time series. The differences between these
Kurtosis Kurtosis is a measure of the values are represented as absolute values,
‘peakedness’ or the distribution. i.e. the effects of the sign are ignored.
Least squares The method of least Mean absolute percentage error
squares is a criterion for fitting a specified (MAPE) The mean value of all the differ-
model to observed data. If refers to finding ences between the actual and forecasted
472 Glossary

values in the time series. The differences separate components of the time series are
between these values are represented as multiplied together to identify the actual
absolute percentage values, i.e. the effects time series value.
of the sign are ignored. Multivariate methods Methods that
Mean error (ME) The mean value of use more than one variable and try to pre-
all the differences between the actual and dict the future values of one of the variables
forecasted values in the time series. by using the values of other variables.
Mean percentage error (MPE) The Mutually exclusive Mutually exclusive
mean value of all the differences between events are ones that cannot occur at the
the actual and forecasted values in the time same time.
series. The differences between these val- Nominal scale A set of data is said to be
ues are represented as percentage values. nominal if the values belonging to it can be
Mean square error (MSE) The mean assigned a label rather than a number.
value of all the differences between the ac- Non-parametric Non-parametric tests
tual and forecasted values in the time se- are often used in place of their paramet-
ries. The differences between these values ric counterparts when certain assump-
are squared to avoid positive and negative tions about the underlying population are
differences cancelling each other. questionable.
Median The median is the value half- Non-seasonal Non-seasonal is the com-
way through the ordered data set. ponent of variation in a time series which
Mixed model The mixed time series is not dependent on the time of year.
blends both additive and multiplicative Non-stationary time series A time se-
components together to identify the actual ries that does not have a constant mean
time series value. and oscillates around this moving mean.
Mode The mode is the most frequently Normal approximation to the binomial
occurring value in a set of discrete data. If the number of trials, n, is large, the bino-
Moving average Averages calculated mial distribution is approximately equal to
for a limited number of periods in a time the normal distribution.
series. Every subsequent period excludes Normal distribution The normal dis-
the first observation from the previous tribution is a symmetrical, bell-shaped
period and includes the one following the curve, centred at its expected value.
previous period. This becomes a series of Normal probability plot Graphical
moving averages. technique to assess whether the data is
Moving average trend The moving av- normally distributed.
erage trend is a method of forecasting or Normality of errors Normality of errors
smoothing a time series by averaging each assumption states that the errors should
successive group of data points. be normally distributed—technically nor-
Multiple regression model Multiple mality is necessary only for the t-tests to
linear regression aims to find a linear be valid, estimation of the coefficients only
relationship between a dependent vari- requires that the errors be identically and
able and several possible independent independently distributed.
variables. Null hypothesis (H0) The null hypoth-
Multiplication law Multiplication law is esis, H0, represents a theory that has been
a result used to determine the probability put forward but has not been proved.
that two events, A and B, both occur. Observed frequency In a contingency
Multiplication law for independent table the observed frequencies are the fre-
events Multiplication law for independent quencies actually obtained in each cell of
events is the chance that they both hap- the table, from our random sample.
pen simultaneously is the product of the One sample test A one sample test is a
chances that each occurs individually, e.g. hypothesis test for answering questions
P(A and B) = P(A)*P(B). about the mean (or median) where the data
Multiplication law for joint events see are a random sample of independent ob-
Multiplication law. servations from an underlying distribution.
Multiplicative model The multiplicative One sample t-test for the population
time series model is a model whereby the mean A one sample t-test is a hypothesis
Glossary 473
test for answering questions about the volves the use of the sample variance to
mean where the data are a random sample provide a ‘best estimate’ of the unknown
of independent observations from an un- population variance.
derlying normal distribution where popu- Poisson distribution Poisson distribu-
lation variance is unknown. tions model a range of discrete random
One sample z-test for the population data variables.
mean A one-sample z-test is used to test Poisson probability distribution The
whether a population parameter is signifi- Poisson distribution is a discrete probabil-
cantly different from some hypothesized ity distribution that expresses the prob-
value. ability of a given number of events occur-
One tail test A one tail test is a statisti- ring in a fixed interval of time and/or space
cal hypothesis test in which the values for if these events occur with a known average
which we can reject the null hypothesis, rate and independently of the time since
H0, are located entirely in one tail of the the last event.
probability distribution. Polynomial line A polynomial line is a
Ordinal scale Ordinal scale is a scale curved line whose curvature depends on
where the values/observations belonging the degree of the polynomial variable.
to it can be ranked (put in order) or have Polynomial trend A model that uses an
a rating scale attached. You can count and equation of any polynomial curve (parab-
order, but not measure, ordinal data. ola, cubic curve, etc.) to approximate the
Ordinal variable A set of data is said to time series.
be ordinal if the values belonging to it can Population mean The population
be ranked. mean is the mean value of all possible
Outcome An outcome is the result of an values.
experiment or other situation involving Population standard deviation The
uncertainty. population standard deviation is the stan-
Outlier An outlier is an observation in a dard deviation of all possible values.
data set which is far removed in value from Population variance The population
the others in the data set. variance is the variance of all possible
Parametric Any statistic computed by values.
procedures that assumes the data were Power trend A model that uses an
drawn from a particular distribution. equation of a power curve (a parabola) to
Pearson’s coefficient of correlation approximate the time series.
Pearson’s correlation coefficient measures Probability Probability provides a
the linear association between two vari- quantitative description of the likely oc-
ables that have been measured on interval currence of a particular event.
or ratio scales. Probability of event A given that event B
Pie chart A pie chart is a way of summa- has occurred See Conditional probability.
rizing a set of categorical data. Probable Probable represents that an
Point estimate A point estimate (or es- event or events is likely to happen or to be
timator) is any quantity calculated from true.
the sample data which is used to provide P-value The p-value is the probabil-
information about the population. ity of getting a value of the test statistic as
Point estimate of the population mean extreme as or more extreme than that ob-
Point estimate for the mean involves the served by chance alone, if the null hypoth-
use of the sample mean to provide a ‘best esis is true.
estimate’ of the unknown population mean. Q1 Q1 is the lower quartile and is the
Point estimate of the population pro- data value a quarter way up through the
portion Point estimate for the proportion ordered data set.
involves the use of the sample proportion Q3 Q3 is the upper quartile and is the
to provide a ‘best estimate’ of the unknown data value a quarter way down through the
population proportion. ordered data set.
Point estimate of the population vari- Qualitative variable Variables can be
ance Point estimate for the variance in- classified as descriptive or categorical.
474 Glossary

Quantitative variable Variables can be resistant against violations of parametric


classified using numbers. assumptions.
Quartiles Quartiles are values that di- Sample space The sample space is an
vide a sample of data into four groups con- exhaustive list of all the possible outcomes
taining an equal number of observations. of an experiment.
Random experiment A random experi- Sample standard deviation A sample
ment is an experiment, trial, or observa- standard deviation is an estimate, based on a
tion that can be repeated numerous times sample, of a population standard deviation.
under the same conditions. Sampling distribution The sampling
Random sample A random sample is distribution describes probabilities associ-
a sampling technique where we select a ated with a statistic when a random sam-
sample from a population of values. ple is drawn from a population.
Random variable A random variable Sampling error Sampling error refers
is a function that associates a unique nu- to the error that results from taking one
merical value with every outcome of an sample rather than taking a census of the
experiment. entire population.
Ranks List data in order of size. Sampling frame A sampling frame is
Range The range of a data set is a mea- the source material or device from which
sure of the dispersion of the observations. a sample is drawn.
Ratio scale Ratio scale consists not only Scatter plot A scatter plot is a plot of
of equidistant points but also has a mean- one variable against another variable.
ingful zero point. Seasonal Seasonal is the component of
Raw data Raw data is data collected in variation in a time series which is depen-
original form. dent on the time of year.
Region of rejection The range of values Seasonal component A component in
that leads to rejection of the null hypothesis. the classical time series analysis approach
Regression analysis Regression analysis to forecasting that covers seasonal move-
is used to model the relationship between ments of the time series, usually taking
a dependent variable and one or more in- place inside one year’s horizon.
dependent variables. Seasonal time series A time series, rep-
Regression coefficient A regression co- resented in the units of time smaller than a
efficient is a measure of the relationship year, that shows regular pattern in repeating
between a dependent variable and an in- itself over a number of these units of time.
dependent variable. Seasonal variations (S) The seasonal
Relative frequency Relative frequency variations of the time series model that
is another term for proportion; it is the shows a periodic pattern over one year or
value calculated by dividing the number of less.
times an event occurs by the total number Shape The shape of the distribution re-
of times an experiment is carried out. fers to the shape of a probability distribu-
Residual The residual represents the tion and involves the calculation of skew-
unexplained variation (or error) after fit- ness and kurtosis.
ting a regression model. Significance level, α The significance
Residuals (R) The differences between level of a statistical hypothesis test is a
the actual and predicted values. Some- fixed probability of wrongly rejecting the
times called forecasting errors. Their be- null hypothesis, H0, if it is in fact true.
haviour and pattern has to be random. Sign test The sign test is designed to test
Right-skewed Right-skewed (or posi- a hypothesis about the location of a popu-
tive skew) indicates that the tail on the right lation distribution.
side is longer than the left side and the bulk Simple exponential smoothing Simple
of the values lie to the left of the mean. exponential smoothing is a forecasting tech-
Robust test If a test is robust, the valid- nique that uses a weighted average of past
ity of the test result will not be affected by time series values to arrive at smoothed time
poorly structured data. In other words, it is series values that can be used as forecasts.
Glossary 475
Simple index A simple index is designed Stationary time series A time series
to measure changes in some measure over that does have a constant mean and oscil-
time. lates around this mean.
Skewness Skewness is defined as asym- Student’s t distribution The t distribution
metry in the distribution of the data values. is the sampling distribution of the t statistic.
Slope Gradient of the fitted regression Sum of squares for error (SSE) The SSE
line. measures the variation in the modelling
Smoothing constant Smoothing con- errors.
stant is a parameter of the exponential Sum of squares for regression (SSR)
smoothing model that provides the weight The SSR measures how much variation
given to the most recent time series value there is in the modelled values.
in the calculation of the forecast value. Symmetrical A data set is symmetrical
Spearman’s rank coefficient of cor- when the data values are distributed in the
relation Spearman’s rank correlation same way above and below the middle value.
coefficient is applied to data sets when it Table A table shows the number of
is not convenient to give actual values to times that items occur.
variables but one can assign a rank order Tally chart A tally chart is a method of
to instances of each variable. counting frequencies, according to some
Standard deviation Measure of the dis- classification, in a set of data.
persion of the observations (A square root Test statistic A test statistic is a quantity
value of the variance) calculated from our sample of data.
Standard error of forecast The square Tied ranks Two or more data values
root of the variance of all forecasting errors share a rank value.
adjusted for the sample size. Time period An unit of time by which
Standard error of the estimate (SEE) the variable is defined (an hour, a day, a
The standard error of the estimate (SEE) is month, a year, etc.).
an estimate of the average squared error in Time series A variable measured and
prediction. represented per units of time.
Standard error of the mean The stan- Time series plot A chart of a change in
dard error of the mean (SEM) is the stan- variable against time.
dard deviation of the sample mean’s esti- Total sum of squares (SST) The SST
mate of a population mean. measures how much variation there is in
Standard error of the proportion The the observed data (SST = SSR + SSE).
standard error of the proportion is the Trend (T) The trend is the long-run
standard deviation of the sample propor- shift or movement in the time series ob-
tion’s estimate of a population proportion. servable over several periods of time.
Standard normal distribution A stan- Trend component A component in the
dard normal distribution is a normal dis- classical time series analysis approach to
tribution with zero mean (μ = 0) and unit forecasting that covers underlying direc-
variance (σ2 = 1). tional movements of the time series.
Stated limits The lower and upper lim- True or mathematical limits True or
its of a class interval. mathematical limits separate one class
Statistic A statistic is a quantity that is in a grouped frequency distribution from
calculated from a sample of data. another.
Statistical independence Two events Two sample tests A two sample test is
are independent if the occurrence of one a hypothesis test for answering questions
of the events gives us no information about the mean where the data are collect-
about whether or not the other event will ed from two random samples of indepen-
occur. dent observations, each from an underly-
Statistical power The power of a statis- ing distribution.
tical test is the probability that it will cor- Two sample t-test for population
rectly lead to the rejection of a false null mean (dependent or paired samples) A
hypothesis. two sample t-test for population mean
476 Glossary

(dependent or paired samples) is used Type II error, α A type II error occurs


to compare two dependent population when the null hypothesis, H0, is not reject-
means inferred from two samples (depen- ed when it is in fact false.
dent indicates that the values from both Types of trends The type of trend can
samples are numerically dependent upon include line and curve fits to the data set.
each other—there is a correlation between Unbiased When the mean of the sam-
corresponding values). pling distribution of a statistic is equal to
Two sample t-test for the population a population parameter, that statistic is
mean (independent samples, equal vari- said to be an unbiased estimator of the
ance) A two sample t-test for the popula- parameter.
tion mean (independent samples, equal Uncertainty Uncertainty is a state of
variance) is used when two separate sets having limited knowledge where it is im-
of independent and identically distributed possible to describe exactly the existing
samples are obtained, one from each of the state or future outcome of a particular
two populations being compared. event occurring.
Two sample t-test for population mean Univariate methods Methods that use
(independent samples, unequal vari- only one variable and try to predict its fu-
ances) A two sample t-test for population ture value on the basis of the past values of
mean (independent samples, unequal the same variable.
variances) is used when two separate sets Upper one tail test A upper one tail
of independent but differently distributed test is a statistical hypothesis test in which
samples are obtained, one from each of the the values for which we can reject the null
two populations being compared. hypothesis, H0 are located entirely in the
Two sample z-test for the population right tail of the probability distribution.
mean A two sample z-test for the popu- Variable A variable is a symbol that can
lation mean is used to evaluate the differ- take on any of a specified set of values.
ence between two group means. Variance Measure of the dispersion of
Two sample z-test for the population the observations.
proportion A two sample z-test for the pop- Variation Variation is a measure that
ulation proportion is used to evaluate the describes how spread out or scattered a set
difference between two group proportions. of data is.
Two tail test A two tail test is a statisti- Wilcoxon signed rank sum test The
cal hypothesis test in which the values for Wilcoxon signed ranks test is designed to
which we can reject the null hypothesis, test a hypothesis about the location of the
H0, are located in both tails of the prob- population median (one or two matched
ability distribution. pairs).
Type I error, α A type I error occurs
when the null hypothesis is rejected when
it is in fact true.
Index

Note: Excel functions are shown in capitals.

average values 58, 80, 315, 320,


A 430–1, 436, 452 C
A estimated 316–17
ABS 253–4, 256, 258–60, 267–8, averages see also mean; median; calculated test statistics 255,
270, 326, 454–5 mode 259–60, 270, 272, 276–7,
absolute error 453, 456 moving see moving averages 282, 328–30
absolute references 368 samples 261–2 calculated two-tail p-values 255,
actual time series 425, 432 weighted 77–8, 436 259
Add Trendline 51, 366, 397, 421, axes categorical/category variables
433, 435 horizontal 32, 44, 74 2, 296, 298, 301–2
addition law 115, 118, 133 titles 24, 26, 37, 50, 52, 99 categorical data 19, 21–2, 56,
addition law for mutually vertical 44, 344 306, 310, 324, 340
exclusive events 115, 133 CDF see cumulative distribution
additive model 419, 445–6, 466 function
adjusted r squared 399, 404–5 B central limit theorem 183,
aggregate indices 415–16, 185–6, 205–6, 208, 235,
465–6 bar charts 21–7, 30–2, 57 241, 248
alpha 232, 248, 251, 264, 294, bars 22, 26, 32, 37, 39–40, 52, 61 central tendency 33
340, 442 error 52, 96 measures of 83, 90, 104
alternative hypotheses 243–6, base year 413–14, 417 chance 107, 122, 133, 136, 188
251–2, 280, 286, 288, Beta 251, 290 Chart Tools 26, 30
355–6, 358–9 bias 187, 192–3, 198, 219–20, charts 21–2, 24, 26–7, 44, 50–2,
true 251, 290–1 241 56–7, 96–9
Analysis menu 98–9 Bin Range 8–10, 34–5, 214 appearance 24, 30, 50, 98
analysis of variance (ANOVA) BINOM.DIST 160–4, 173, 177, bar 21–7, 30–2, 57
153, 246–7, 286, 294, 371, 179, 321–2 column 37, 96
381, 394–5 binomial 158, 166, 173–9, 182–3, line 46, 97–8, 344, 421
Analysis ToolPak see data 212, 313, 322 titles 24, 26, 30, 37, 50, 52, 99
analysis, tool coefficients 159–60 chi-square 294, 296–7, 299, 301,
ANOVA see analysis of variance experiment 156, 173 309, 315–17, 341
approximation 173, 176, 180, probabilities 159–60, 175, 177, critical values 316–17
205, 232, 322 179, 322 distribution 136, 153, 183,
normal 175–6, 179–80, 183, binomial probability 297–8, 301, 307, 312
303, 307, 322, 325 distributions 175–7 goodness-of-fit test 168, 297,
arithmetic mean 59, 78, 105 binomial probability 313–14
arrays 455, 459–60 function 173 and non-parametric
association 246–7, 296–8, Bins 10, 42, 199–200, 215, 319 hypothesis testing 296–
300–2, 340–1, 343, box-and-whisker plots 96, 99, 341
349–50, 356–7 149 test 246, 296, 303, 306–7, 309,
chi-square test of 297–8, box plots 94, 96–9, 105 313, 340–1
300–1, 340 Brown’s exponential smoothing of association 297, 300–1,
linear 347, 349 method 438, 441, 445 340
autocorrelations 370, 457 see also exponential and independent
AVERAGE 65–6, 214, 261–2, 270, smoothing sampes 303–7
275–6, 383–4, 462–3 bull hits 121, 157–8, 160 McNemar see McNemar test
478 Index

CHISQ.DIST 300–1, 305, 315–16 COMBIN 160 covariance 347–9, 405


CHISQ.DIST.RT 300, 305–6, 310, conditional probability 119 coverage error 185, 193
315–16 confidence 225, 227, 231, 234–5, CPI (Consumer Price Index)
CHISQ.INV.RT 300–1, 305, 307, 241, 243, 285 415–18, 465
315–16 level of 235, 241, 287–8, 290, critical f 287–8, 380
CHISQ.TEST 300, 305 461 critical tables 143, 460
CI see confidence interval confidence intervals (CIs) 217– critical test statistic 251–2,
class boundaries 8, 11, 32, 39, 19, 225–8, 230–6, 238–9, 258–60, 272, 277, 282,
57 382–3, 387, 458–61 316–17, 329–30
lower see lower class constant 462–3 critical values 91–2, 228, 259–60,
boundaries estimates 217–18, 226–35, 238, 288–9, 311, 315–16,
upper 8, 11, 32–5, 42–3, 71, 82, 382–3 358–60
86–7 lower 227, 231, 234, 236 estimates of 294, 340
class frequency 41–2, 74 population 225–42 lower 287–8
class intervals 10, 42–3, 58, 73 upper 227, 231, 234, 236 upper 286–9
unequal 40, 42 confidence measurement see critical z 228, 255, 323, 325, 329,
class limits 10, 39, 57 confidence intervals 336
class mid-points 43–4, 71, 73–4, constant values 406, 417–18, 461 cross tabulation 22, 57, 343
128 constant variance 370–1, 388, tables 22 see also contingency
class widths 8, 11, 32–3, 35, 39, 397, 456 see also equal tables
41–2, 126 variances cumulative distribution
equal 8, 34, 37 constants 363, 371, 387, 408, 423, function (CDF) 154–5
standard 41–2 432, 448–9 cumulative frequency 69–70,
unequal 2, 42, 73 Consumer Price Index see CPI 74–7, 133
classical time series contingency tables 22, 153, 298, curve 22, 70, 75–6
analysis 419–20, 466 300–1, 303, 305–8 distributions 56, 69–70, 74
classifications 6, 110–11, 155 continuous data 2, 10, 60 tables 69
cluster sampling 190 continuous distributions 176, curly brackets 455
clusters 185, 188, 190 219, 246, 313, 319 cyclical variations 419, 466
COD see coefficients, of continuous random
determination variables 136, 183
coefficients 370, 374, 398, 425, continuous variables 3, 136 D
446, 457 convenience 156, 185, 187, 191
correlation see correlation sampling 191 damping factor 440, 444 see also
coefficients CORREL 349, 353 smoothing, constant
of determination (COD) 348, correlation coefficients 344, data analysis 212–14, 252, 263–4,
366, 372–4, 387, 395, 347–8, 350, 352–4, 374, 273, 278, 370–2, 439–40
399–400, 404–5 405, 425 application of 273, 278, 282
Pearson’s see Pearson’s Pearson’s 90, 105, 343–4, exploratory 59, 94, 250
correlation coefficient; 347–9, 351, 355–8, 404–5 tool 252, 263–4, 272–3, 278,
Pearson’s coefficient of Spearman’s rank 343–4, 347, 282, 289–90, 444
skewness 356–9, 404 regression solution 385–8
regression 346, 370 correlations 279, 344, 346–7, data descriptors 58–105
of skewness 350, 355, 357–8, 370 data distribution 60, 95, 149
Fisher’s 90 linear see linear correlations data labels 30, 43, 50, 52
Pearson’s 90, 105 negative 350–1, 356, 358 data sets 58–63, 84–5, 94–7, 104–
Spearman’s rank 343–4, 347, positive 350–1, 356, 358 5, 343–7, 364–8, 370–2
356–60, 404–5 serial 370 ordered 59, 62, 64
of variation 81, 88–9 strong 350, 353 data types 22, 56, 105, 247, 341
column charts 37, 96 COUNT 261, 270, 275, 285, 300, data variables 32, 39, 47, 56, 60,
columns 11–12, 19–21, 24, 29, 383–4, 455 343–4, 356
297–8, 444–5, 462 COUNTA 305 discrete random 155, 165
totals 17, 298, 301 COUNTIF 320–1, 326, 333 types 2–3
variables 297–8, 305, 308 COVAR 347 deflating of values 416–19
Index 479
delta 446–7, 449 F 133, 153, 183, 286–8, 294, measurement error 185, 193
denominators 220, 287, 378, 378, 380 measurement of 51, 450–3,
380 frequency see frequency, 455, 466
dependent samples 246–7, 279, distributions non-response 185, 193
297, 303, 307, 310, 322 left-skewed 95, 149, 152 normality of 370, 397
dependent variables 343–6, 348, leptokurtic 91, 93, 229 prediction 370
350, 362–3, 375, 378–9, Mann-Whitney 335 sampling 185–7, 193, 210, 225,
399–400 mesokurtic 91, 93 229, 248, 353
descriptive statistics 66, 85, 94, non-normal 204, 340 sum of squares for error 369,
100 normal 83, 92, 136–40 381, 405
descriptors, data 58–105 sampling 254, 262, 267, term 52, 383
deviations 220, 378, 425, 436–7, 271, 276 type I 251, 290, 295
451, 455–6, 459 standard 141, 229, 248 type II 251, 290–1, 295
mean absolute 453, 455–6, 466 null 153, 286 types of 192–3, 251, 453–5
standard see standard Poisson 133, 135–6, 155, variance 371
deviation 165–70, 173–5, 180–1, ex-post forecasts 451, 466
unexplained 378 313–17 expected frequencies 166, 169,
die rolling 108–9, 112–13, 120–1, population 149, 183, 205, 250, 298–301, 305–7, 313,
136, 195 254, 257–8, 262 315–17
differences probability see probability, expected values 107, 127–8, 131,
median 322, 326–7 distributions 136, 218–19
negative 448 right-skewed 95, 149, 152 experimental probability see
non-parametric tests of 246 skewed 82–3, 89–90, 92, 95, empirical probability
paired 325, 330 149, 152 experiments 107–10, 112–14, 136,
positive 322, 327 left-skewed 95, 149, 152 156, 158–9, 161–2, 164–6
significant 244–6, 255, 259–60, right-skewed 95, 149, 152 die 109, 197
268, 304, 306–7, 311 standard normal 141, 229, 248 factorial 246–7, 250, 294, 340
squared 358, 363 symmetric 90, 92, 94, 149, exploratory data analysis 59
discrete data 10–11, 41, 59, 68 205, 330 exponential curves 392, 423
discrete probability uniform 154 modified 392
distributions 135–6, Durbin-Watson statistic 370 exponential smoothing 248,
155–83 406, 432, 436–8, 466
discrete random variables 136, forecasting with 438–45
155, 165–6, 183 E seasonal 449
discrete variables 32, 155–6, simple 436, 438–9, 446
183, 307 empirical probability 110 single 437
dispersion 33, 40, 56, 58–9, 80–1, equal chance 136, 188 exponential trends 423, 429
104–5, 127 equal variances 275, 295, 371, 456 extrapolation 406–7, 419, 466
measures of 58–9, 81, 96, 104 error(s) extreme values 11, 60, 62, 83–4,
display 1, 11, 16–17, 303 absolute 453, 456 90, 95–6, 149–50
distribution-free tests see non- coverage 185, 193
parametric tests degree of 217, 219–20, 368
distribution functions 155, 246, forecasting 406, 420, 456, 459, F
318 466
cumulative 154–5 independence of 370, 405 F distribution 133, 153, 183,
distribution shape 59, 325 interpretation 455–6 286–8, 294, 378, 380
distributions 80–2, 89–91, 152–3, margin error 238–9 F statistic 286–8
182–3, 229, 246–50, 286–8 margin of 185, 193, 239 F test 285–90
binomial probability 155, mean absolute factorial experiments 246–7,
175–7 percentage 453, 455–6, 250, 294, 340
chi-square 136, 153, 183, 466 factorials 159, 161, 166
297–8, 301, 307, 312 mean percentage 453–6, 466 fair die 121, 136, 195
continuous 176, 219, 246, 313, mean square 378, 381, 386, false null hypotheses 251, 290–1
319 448–9, 453, 455–6, 466 F.DIST 286
480 Index

finite populations 156, 207


F.INV 286, 288–9 G I
F.INV.RT 286–9, 380
first quartile 64–5, 76, 82–3, general addition law 115–16, 133 independence 122, 313, 370, 456
94–7, 105, 149 Gompertz curves 393 of errors 370, 405
Fisher-Snedecor distribution see goodness-of-fit 153, 246, 297, statistical 117, 120, 133
F distribution 313 independent events 120–1
Fisher’s kurtosis 92, 105 graphical methods 56, 73–6, 96, independent observations 220,
Fisher’s skewness coefficient 90 105, 370 246, 257, 297
fitted lines 364–5 graphical representations 42 see independent samples 246–7,
fitting 313–14, 344–6, 362–3, 365, also tables 295–7, 303–7, 319, 331, 334
404, 420, 425 graphs 3, 21–2, 30–2, 49, 51–2, and chi-square test 303–7
five-number summary 94–5, 57–8, 150–1 random 266, 332
105, 149 line 46, 97–8, 344, 421 and unequal variances 274–9
fixed costs 130–1 grouped frequency independent variables 343,
forecast values 420, 429, 436–7, distributions 6, 8, 10, 35, 57, 345–6, 362–3, 370–1,
448–9, 451–3, 455 71–2, 82 374–5, 390, 396–8
forecasting 407, 419–20, 423, tables 9, 36 index numbers 406, 411–19, 466
431–3, 435, 437–9, GROWTH 429 indices 406, 413–14, 465
450–1 aggregate 415–16, 465–6
errors 406, 420, 456, 459, 466 simple 412–15
with exponential H inference measures 366, 368
smoothing 438–45 inference(s) 187, 218, 243, 297, 371
methods 406, 423, 432, 435–6, H0 see null hypotheses statistical 194
439, 456, 465–6 H1 see alternative hypotheses infinite populations 156, 207
with moving averages 431–5 histograms 8, 21–2, 56–7, 73–5, inflation 407, 409, 415–16
freedom, degrees of 229–30, 126, 199–200, 213–15 input data series 24, 29, 37,
232, 258–9, 276–7, 287–9, frequency 21 50, 52
300–1, 377–8 with unequal class input ranges 35, 214, 444
frequency 3–7, 9–10, 26–7, 32–4, intervals 40–2 INT 276
37–44, 69–76, 199–200 homoscedasticity see constant integers 10, 115–16
class 41–2, 74 variance; equal variances odd 115–16
cumulative see cumulative horizontal axis 32, 44, 74 INTERCEPT 365–6, 372, 376,
frequency horizontal grid lines 27, 52 379, 382, 384, 427
distributions 2–4, 6–8, 67–9, hyperbola curves 391 intercepts 365, 405, 426–7, 446
73–4, 126–7, 167–8, 215 hypotheses 243, 245–6, 248, 251, interest rates 4, 409
creation 6–10 257, 288, 294 interpolation, linear 64
grouped 6, 8, 10, 35, 57, alternative see alternative interquartile range 58, 81–3,
71–2, 82 hypotheses 96, 105
expected 166, 169, 298–301, initial research 244 interval measurement scales 3
305–7, 313, 315–17 non-parametric 296–7, 299, interval scales 3, 8, 94, 96
high 89 301, 305, 307, 309, 311 intervals 3, 20–2, 35, 225, 237–9,
highest 69, 73 null see null hypotheses 432–3, 461–3
histograms 21 statistical 186, 241, 298 confidence see confidence
low 89 hypothesis statements 244, 255, intervals
polygons 2, 21–2, 42–6, 74, 96 259, 280, 287–8, 302, 307 estimates 183, 217–19, 225,
relative 27, 32, 107, 110, 124–7, alternative hypothesis (H1) 228, 462
133, 185 244–5 prediction 383, 385
tables 9, 31, 36, 69 null hypothesis (H0) 244–5
cumulative 69 hypothesis tests 228, 243–4,
grouped 9, 36 246–7, 249, 257, 353, 358 K
total 42, 75–6, 126 see also non-parametric
Friedman test 246, 340 tests; parametric tests Kruskal-Wallis test 246, 340
F.TEST 286 statistical 185, 244, 248–9 kurtosis 59, 81, 100, 105
Index 481
Fisher’s 92, 105 mesokurtic distributions 91, 93
measures of 89–93 M midhinge 95, 149
midrange 95, 149
McNemar test 303, 308–10, 325, mixed models 419, 466
L 341 see also z tests modal class 69, 72–4
MAD see mean absolute deviation modal values 60, 63, 74
labels 4, 8, 24, 29, 35, 37, 52 Mann-Whitney U test 246, mode 3, 59–63, 67–9, 72–4, 80,
Layout tool menu 26, 98–9 279, 297, 318, 324, 332, 89–90, 105
least squares 348, 362–3, 394, 400 340–1 model reliability 399
regression 343, 363, 393, 404 MAPE see mean absolute testing 364, 372–4, 387
left-skewed distributions 95, percentage error modelling errors 369
149, 152 margin errors 238–9 moving averages 406, 430–45
legends 24, 26–7, 50 margin of error 185, 193, 239 plots 433, 435
leptokurtic distribution 91, 93, matched pairs test, Wilcoxon trend 423
229 see Wilcoxon signed rank MPE see mean percentage error
level of confidence 235, 241, sum test MSE see mean square error
287–8, 290, 461 mathematical limits 10–11, 70 multiple linear regression 344,
line charts/graphs 46, 97–8, MAX 81, 321–2 362, 404
344, 421 mean 58–63, 80–92, 194–211, multiplication law 117–20, 133
linear associations 347, 349 217–35, 241–8, 253–5, multiplicative model 419, 445–7,
linear correlations 452–3 466
and regression analysis 343–405 arithmetic 59, 78, 105 multistage sampling 190
significance 353–6 overall 197 multivariate methods 409–10
linear interpolation 64 population see population(s), mutually exclusive events 114–
linear models 395, 399 mean 15, 119–21, 133
linear regression 372, 394, 407, samples 194–211, 213–14,
436 217–20, 222, 224–8, 230–1,
analysis 405 253–5 N
fitting of line to sample standard error of the 222, 229,
data 364–8 254, 258 negative correlations 350–1,
multiple 344, 362, 404 weighted 78 356, 358
simple 343, 348, 362 mean absolute deviation (MAD) negative differences 448
linear relationships 343, 348, 453, 455–6, 466 negative values 83, 92, 347, 453
362, 375, 377, 380, 383 mean absolute percentage error nominal data 4, 307–8
linear trends 423, 425, 429, (MAPE) 453, 455–6, 466 nominal scales 3
445–6, 466 mean percentage error (MPE) non-linear relationships 344,
linearity 370, 388 453–6, 466 390–1, 393
assumption 371, 388, 397 mean square due to non-normal distributions 204,
location 17, 58–9, 89, 131, 246, regression 378, 381 340
318, 324 mean square error (MSE) 378, non-parametric
logarithmic trends 423, 466 381, 386, 448–9, 453, hypothesis 296–7, 299,
logarithms 166, 423 455–6, 466 301, 305, 307, 309, 311
logistic curves 392 measurement 2–3, 31, 194, 220–1, non-parametric methods 246,
lower class boundaries 8, 11, 294, 296, 356 307
32–3, 35, 42–3, 82, 86–7 error 185, 193 non-parametric tests 243, 246,
true 75–6 interval/ratio level of 246, 294 250, 296–7, 318–19, 331,
lower confidence intervals 227, scales 3–4 340
231, 234, 236 units of 3, 89 non-probability sampling
lower critical values 287–8 median 58–63, 69–70, 74–6, 89– methods 190–1
lower one tail tests 249, 288, 294, 90, 94–100, 149, 318–20 non-proportional quota
319, 326 classes 72, 75, 87 sampling 192
lower quartile see first quartile differences 322, 326–7 non-response error 185, 193
lower tail p-values 254, 288 population 319, 324, 327 non-stationary time series 407–
LQ see first quartile position 72, 87–8 9, 431, 445
482 Index

non-symmetry 95, 149 paired 319, 324–5, 329, 356 paired ranks 327
normal approximations 175–6, tied 330 paired samples 279, 281, 294,
179–80, 183, 303, 307, observed frequencies 169, 305, 297, 307–8, 319, 324
322, 325 312–13, 315 pairs, matched 324
probability 177, 179, 181 observed values 141, 155, 334–5, parabolas 391, 393, 420, 423
solution 179, 181 363, 368, 373 parameter conditions 313
normal curves 137–9, 141–2, ogive 22, 70, 74–6 parameters 135, 180, 194–5, 218,
146–8, 154–5, 199, 208–9, one sample t-tests 246–7, 251, 393–4, 426–7, 437
250 291–2, 294, 319, 324 population 183, 189, 193–5,
normal distributions 92, 135–7, one sample z-tests 246, 294 217–20, 225, 241, 246
140–1, 143, 148, 175–6, 229 one tail p-values 259, 263, 281, unknown 217–18, 458
approximations 136, 185 290, 321, 327–9, 336 sample 183, 194, 246
to binomial one tail tests parametric tests 243–97, 318–19,
distribution 175–9 lower 249, 288, 294, 319, 326 331, 340
normal equations 363 upper 249, 262, 268, 280, 288, patterns 3, 48–50, 58, 157, 159,
normal populations 186, 198–9, 295, 326–8 370, 456
218, 226, 319 order of size 60, 62–3, 69, 296 PDF see probability, density
normal probability ordinal data 3–4, 21–2, 60, 105, function
curves 151–2 246, 340–1, 343–4 peakedness 59, 81, 105
plots 149–53, 183, 371, 386, ordinal scales 3 PEARSON 349, 353, 355
388, 394, 396–7 ordinal variables 21, 57 Pearson’s coefficient of
normal sampling outcomes 107–8, 112–13, 116, skewness 90, 105
distributions 254, 262, 120, 124–5, 136, 155–6 Pearson’s correlation
267, 271, 276 possible 108–9, 112, 165 coefficient 343–4, 347,
normality 149–53, 325, 370, 388, outliers 59–60, 62, 84, 95–6, 348–53, 355–8, 404–5
456 104–5, 149–50, 346–7 percentiles 62, 64, 75–6, 358
of errors 370, 397 suspected 95–6, 150 classes 76
NORM.DIST 138–40, 142–3, overall mean 197 perfect correlations 350–1
145–7, 177, 202–4, 206, pie charts 19, 21–2, 27–30, 297
208–9 PivotCharts 11, 17, 19–20
NORM.INV 147–8, 155 P PivotTables 11–20
NORM.S.DIST 141–7, 202–4, 206, plots 21, 126, 149, 186, 213, 344,
208–11, 253–4, 262–3, p-values 251–2, 254–6, 286–8, 370–1
267–8 300–1, 305–7, 310–11, box 94, 96–9, 105
NORM.S.INV 147–8, 151, 227, 315–17 box-and-whisker 96, 99, 149
253, 255, 267, 310–11 calculated 300, 306, 316, 323, moving average 433, 435
null distributions 153, 286 387 normal probability 149–53, 183,
null hypotheses 244–6, 248–9, two tail 255, 259 371, 386, 388, 394, 396–7
251–2, 288, 297–8, 309, exact 307, 322–3, 336 residual 370–1, 386, 388, 394,
318–19 lower 286, 333 396–7
false 251, 290–1 lower tail 254, 288 scatter 21–2, 47–51, 344–7,
testing 254, 258, 262, 267, 271, measured 294, 340 349–51, 364–5, 397, 399
276, 281 method 254, 258–9, 262–3, time series 21–2, 47–51, 57,
true 251, 290 267–8, 270–1, 276–7, 281 408, 432, 447–8, 450
numerators 153, 287, 378, 380 one tail 259, 263, 281, 290, 321, point estimates 185–6, 217–19,
327–9, 336 222–3, 225, 241–2, 375,
two tail 253–5, 258–9, 267–8, 436
O 270–2, 276–7, 286–7, Poisson distributions 133,
310–11 135–6, 155, 165–70, 173–5,
observations 2–3, 81, 108, 330–1, upper 262, 280, 286 180–1, 313–17
431–6, 438–9, 459–60 upper tail 254, 288 approximation to binomial
first 431–2, 440 paired differences 325, 330 distribution 173–5
independent 220, 246, 257, 297 paired observations 319, 324–5, POISSON.DIST 168, 171–3, 181,
last 427, 433, 440 329, 356 314, 316
Index 483
polygons, frequency 2, 21–2, presentation 1–57 random samples 110, 137, 163,
42–6, 74, 96 probability 107–33, 137–46, 188–90, 193–4, 203–7,
polynomial curves 423 154–61, 176–7, 179–81, 209–13
polynomial lines 411, 466 201–12, 251 independent 266, 307, 332
polynomial trends 423 conditional 119 simple 188, 190
pooled estimates 275 density function (PDF) 95, random sampling 186, 188
population(s) 138–9, 141–2, 154–5, 301 simple 188–90
confidence intervals 225–42 distributions 107, 124–7, stratified 189–90, 192
distributions 149, 183, 205, 129–31, 133, 135–83, 185, systematic 189
250, 254, 257–8, 262 249 random variables 136, 154–6,
estimates 185, 222, 224 binomial 175–7 161, 163, 173, 179, 183
finite 156, 207 continuous 135–6, 153–5, continuous 136, 183
infinite 156, 207 183, 286 discrete 136, 155, 165–6, 183
median 319, 324, 327 discrete 135–6, 155–83 rank correlation coefficient,
non-normal 186, 204 Poisson see Poisson Spearman’s see
normal 186, 198–9, 218, 226, distributions Spearman’s rank
319 empirical 110 correlation coefficient
parameters 183, 189, 193–5, frequency definition of 124 RANK.AVG 326–7, 332, 334
217–20, 225, 241, 246 laws 107, 114–15, 133 ranks 62, 296, 319, 322, 327–8,
unknown 217–18, 458 general addition law 115–16, 332, 334
point estimates 133 paired 327
mean and variance 218–22 normal approximation 177, shared 322, 327, 334
proportion and 179, 181 tied 329–30, 337, 356
variance 222–4 samples 188, 190 ratios 3–4, 21–2, 57, 88, 105,
type of 218 theoretical 113 109–10, 306–7
proportion 210–11, 217, 222–4, theory 135, 243 raw data 1–2, 4, 8, 12, 57, 105, 279
236, 242, 246, 295 properties 195, 398, 456 rectangles 32, 40–2
slope 364, 375, 399, 404 proportional quota sampling 191 region of rejection 249, 254–5,
v samples 194 purposive sampling 191 263, 268, 287–8, 310–11,
values 153, 185, 199, 217, 228, 322–3
353 regression 343, 362–5, 369–71,
true 219, 353, 363 Q 378, 381, 395, 405
variables 153 analysis 343, 345, 347, 349,
variances 85–6, 217, 219–20, Q1 see first quartile 363, 367–71, 387–91
241–2, 246–7, 261–2, Q2 see second quartile advanced topics 390–405
286–8 Q3 see third quartile linear 405
positive correlations 350–1, qualitative variables 2–3, 325 and linear correlation 343–
356, 358 quantitative variables 2–3 405
positive relationships 47, 345, QUARTILE.INC 64, 81 assumptions 370–2, 375
357 quartiles 64–5, 75, 88, 96 coefficients 346, 370
positive values 92, 254, 348 first 64–5, 76, 82–3, 94–7, 105, 149 equations 365, 373, 375, 383
power 245, 251, 292 ranges 59 least squares 343, 363, 393, 404
curves 423 second 63–4 linear see linear regression
function 423 third 64–5, 76, 83, 94–7, 149 lines 365–6, 368, 372–4, 378,
statistical 251, 290–2, 294 quota sampling 191 423
trends 423 non-proportional 192 mean square due to 378, 381
precision 189–90 proportional 191 models 362, 365, 375, 378
prediction 362, 372, 375, 460–1 linear multiple 344, 404
errors 370 multiple 362, 398–400, 404
intervals 383–5 R non-linear 390–7
values 375, 462–3 sum of squares 369, 381, 405,
predictor models 381, 387, 390 random experiments 108 425
predictor variables 344, 348, random number Regression tool 381, 385, 387,
362, 364, 374–82, 387, 399 generation 154, 212–13 398, 404
484 Index

rejection 244–5, 249, 251–2, 272, types of 188–92 serial correlation 370
290, 292, 294 v populations 194 shared ranks 322, 327, 334
regions/zones of 249, 254–5, variance 85, 91, 219–20, 222, signed rank sum test 246, 279,
263, 268, 287–8, 310–11, 231–2, 234, 287 297, 318–20, 324–5,
322–3 sampling 85, 156, 182, 185–7, 329–30, 340–1
relative frequency 27, 32, 107, 191–3, 198–9, 204–8 significance 248, 254–5, 271–3,
110, 124–7, 133, 185 cluster 190 276–8, 281–2, 286–8,
reliability 344, 372, 385, 398 concept 186–93 358–60
models 399 convenience 191 level 248–9, 255, 259, 261–3,
residual plots 370–1, 386, 388, distributions 185, 187, 189, 287–90, 354–5, 358–60
394, 396–7 191, 193–5, 197, 248–9 simple exponential
residual values 373, 420 and estimation 185–242 smoothing 436, 438–9,
residuals 365, 368, 370–4, 388, and mean 194–8 446
396, 420, 456–7 normal 254, 262, 267, 271, simple linear regression 343,
response variables 348, 362 276 348, 362
right-skewed distributions 95, and proportion 210–12 simple random sampling 188–90
149, 152 error 185–7, 193, 210, 225, 229, single exponential
risk 248, 350 248, 353 smoothing 437
rows 11–12, 20–1, 213, 297–8, frame 187, 190, 242 SIQR 58–9, 81–5, 87–8, 105
301, 307, 454 multistage 190 skewed distributions 82–3,
variables 305, 308 from non-normal 89–90, 92, 95, 149, 152
RSQ 373–4, 380 population 204–10 skewness 58–9, 62, 75, 89–92, 96,
non-probability 187, 190 100, 105
from normal population 198– coefficient of 90, 105
S 204 measures of 90
purposive 191 right 95–6, 149
sample parameters 183, 194, 246 quota see quota sampling SLOPE 365–6, 372, 376, 379, 382,
sample size 189–90, 199–200, snowball 191–2 384, 427
204–11, 218–23, 234–6, stratified 189–90 slopes 365, 375, 382, 405, 426–7,
238–9, 248–50 terminology 187 446
calculating 237–9 scales 3–4, 62, 92, 297, 346 population 364, 375, 399, 404
sample space 107–8, 118–20, interval 3, 8, 94, 96 smoothing 407, 423, 432, 437,
133, 163 ordinal 3 440, 466
sample statistics 194, 217–18, y-axis 49–50 constant 248, 437–8, 442, 444,
224, 246, 324 scatter plots 21–2, 47–51, 344–7, 446, 466 see also damping
samples 185–210, 212–14, 349–51, 364–5, 397, 399 factor
217–20, 222–31, 246–8, scores 3, 6, 32–3, 92, 109, 113, 122 exponential see exponential
253–5, 257–62, 330–2 see SD see standard deviation smoothing
also sampling SE see standard error time series 430–45
averages 261–2 seasonal components 48, snowball sampling 191–2
dependent 246–7, 279, 297, 419–20, 445–6 Solver 449
303, 307, 310, 322 seasonal exponential Spearman’s rank correlation
independent see independent smoothing 449 coefficient 343–4, 347,
samples seasonal forecasts 447, 450 356–8, 404–5
large 218, 232, 459 seasonal time series 406–7, 409, critical values 360
mean 194–211, 213–14, 217–20, 445, 466 spread 33, 40, 58–9, 80–3, 89, 94,
222, 224–8, 230–1, 253–5 seasonal variations 419, 466 104–5
percentiles 371, 388, 396 second quartile 48–9, 63–4 measures of 82–3
proportion 183, 194, 210–11, SEE, see standard error, of SQRT 86, 92, 145, 147–8, 177–8,
222–3, 235–6, 267, 307–8 estimate 181, 459–60
random see random samples semi-interquartile range 58, square roots 35, 81, 83, 386, 425,
simple random 188, 190 82–3 459
small 233, 250, 324, 347, 459 sequential numbers 407–8, 426, squared differences 358, 363
standardized 201, 211 429 squared error 372, 453
Index 485
squares SUM 86–7, 161–4, 167–8, 195–7, third quartile 64–5, 76, 83, 94–7,
least see least squares 299–300, 304–5, 454–5 149
regression sum of 369, 381, sum of squares 369, 374, 381, 405 tied observations 330
405, 425 for error 369, 374, 378, 381, tied ranks 329–30, 337, 356
sum of 369, 374, 381, 405 386, 405, 425 time, units of 407, 409, 446, 465
SSE see sum of squares, for error for regression 369, 378, 381, time periods 48, 188, 370, 407,
SSR see sum of squares, for 386, 405, 425 415, 419, 424
regression total 369, 374, 378, 381, 386, time points 49–50, 408–10,
SST see total sum of squares 405, 425 421–2, 432–5, 440–2,
standard class widths SUMIF 326, 333 447–8, 462–3
(CWs) 41–2 summary, five-number 94–5, 149 time series 406–11, 419–21, 423,
standard deviation 83–9, SUMPRODUCT 68, 72, 78, 86 430–1, 433–6, 445, 459
139–43, 196–9, 201–10, SUMXMY2 447, 449, 455, 459–60, actual 425, 432
219–23, 225–31, 458–9 463 analysis 48, 406–66
standard error 197–8, 202–12, suspected outliers 95–6, 150 classical 419–20, 466
220–5, 227, 229–31, 241–2, symmetric distributions 90, 92, data 406, 417, 423, 436, 443,
372–3 94, 149, 205, 330 447
of estimate 366, 372–3, 386–7 symmetry 59, 75, 92, 96, 105, forecast 424–5
of forecast 459 152, 250 graphs 48–50, 408, 465
of the mean 222, 229, 254, 258 model 419
population and sample 458–9 non-stationary 407–9, 431, 445
of the proportion 223 T plots 21–2, 47–51, 57, 408, 432,
standard normal distribution 447–8, 450
140–1, 183, 229, 248 t-tests 278, 281–2, 324–5, 355, seasonal 406–7, 409, 445, 466
STANDARDIZE 143 370–1, 374–5, 377–8 short 430–1, 438, 460
stated limits 10–11, 57 model assumptions 250 smoothing 430–45
stationary time series 407–9, one-sample see one-sample stationary 408, 431, 433–4,
431, 433–4, 436, 445 t-tests 436, 445
statistical independence 117, paired 247, 319, 324 trend 420
120, 122, 133 Student’s 153, 246, 250, 376, 396 univariate 409, 419, 465
statistical power 251, 290–2, 294 two sample see two sample values 407, 419, 436–7
statistical tests 185, 192, 241, t-tests T.INV 259, 280, 292
244, 248–9, 251, 318 see tables 1, 4–6, 11, 56, 175, 297–8, 329 T.INV.2T 231, 258–9, 270, 276,
also parametric tests; construction 21 355–6, 359, 383–4
non-parametric tests contingency 22, 153, 298, total probability 126, 137, 158–9,
choice of 247 300–1, 303, 305–8 161–2, 164
STDEV.P 82, 195 creation using PivotTable 11– total sample size 111, 118, 191,
STDEV.S 220, 222, 231, 270, 20 298
275–6, 285–6, 349 critical 143, 460 total sum of squares 369, 374,
STEYX 373, 375, 377, 383, 385, cross tabulation 22 378, 381, 386, 405, 425
459–61, 463 cumulative frequency 69 tree diagrams 107, 123, 129, 136,
straight lines 42, 51, 151, 344, data types 10–11 156–7
362, 411, 420 grouped frequency 9, 36 TREND 367, 460
strata 189–90 tally charts 6, 57 trend chart functions 424–5
stratification 190 T.DIST 292 trend-fitting see fitting
stratified random T.DIST.2T 258–9, 270–2, 276–8, trend lines 51, 365, 367, 420–2,
sampling 189–90, 192 377 424–6, 433
strength of correlations/ T.DIST.RT 259, 280–2 Trendline, Add 51, 366, 397, 421,
associations/ test statistics 228–9, 251–2, 433, 435
relationships 343, 347–8, 286–8, 300–1, 316–17, trends 48, 368, 419–20, 423–5,
350, 374 322–3, 336–7 427–9, 445, 461–3
Student’s t distribution 183 calculated 255, 259–60, 270, components 420, 466
Student’s t-test 153, 246, 250, 272, 276–7, 282, 328–30 fitting to time series 420–3
376, 396 critical see critical test statistic types of 423–4
486 Index

trials 108, 110, 124, 156, 161, upper quartile (UQ) see third variation 80–1, 83, 88–9, 369,
163, 173 quartile 374, 395, 399–400
true limits see mathematical upper tail 255, 259 coefficient of 81, 88–9
limits p-values 254, 288 cyclical 419, 466
two sample t-tests 246–7, 269, UQ see third quartile irregular 419, 466
271, 274–6, 279, 281, total 369, 374
294–5 unexplained 365, 369
dependent samples 279–82 V VAR.P 82, 85
two sample z-tests 246, 295 VAR.S 85, 219, 222, 231, 234
two tail p-values 253–5, 258–9, validity 243, 250, 319 vertical axes 44, 344
267–8, 270–2, 276–7, VAR 82–3, 85, 127–8, 159, 164, visualization 1–57
286–7, 310–11 168, 170
two tail tests 244, 249, 254, 258, variability 89, 198, 371, 373–4
267, 286–8, 309–10 variables 2–3, 21–2, 135–7, 343–5, W
type I errors 251, 290, 295 347–8, 393–7, 407–9
type II errors 251, 290–1, 295 categorical 2, 296, 298, 301–2 weighted averages 77–8, 436
column 297–8, 305, 308 weighted mean 78
dependent 343–6, 348, 350, weightings 41, 77, 436
U 362–3, 375, 378–9, Wilcoxon signed rank sum
399–400 test 246, 279, 297, 318–20,
UCB see upper class boundaries discrete 32, 155–6, 183, 307 324–5, 329–30, 340–1
unbiased estimates 219–24, discrete random 136, 155,
241 165–6, 183
unbiased estimators 195, 197, independent 343, 345–6, 362–3, X
210, 217–19, 221 370–1, 374–5, 390, 396–8
uncertainty 107–8, 133, 191, 406, qualitative 2–3, 325 x-axis 32–3, 39–40, 42, 49, 350
450–1, 453, 461 quantitative 2–3
underlying trends 419–20 response 348, 362
unequal class intervals 40–2 row 305, 308 Y
unequal class widths 2, 42, 73 variance of errors
unequal variances 247, 295 assumption 397 y-axis 32, 39, 350, 411
unexplained deviation 378 variance ratio test 246–7, 294 scales 49–50
unexplained variation 365, 369 variance(s) 83–6, 163–4, 167–8,
uniform distribution 154 173–6, 218–19, 221–4,
univariate methods 409–10, 286–7 Z
466 analysis of see analysis of
univariate time series 409, 419, variance z distribution see standard
465 constant 370–1, 388, 397, 456 normal distribution
upper class boundaries 8, 11, equal 275, 295, 371 z tests 246–7, 264, 282, 303,
32–5, 42–3, 71, 82, 86–7 error 371 307, 309, 319 see also
upper confidence intervals 227, population 85–6, 217, 219–20, McNemar test
231, 234, 236 241–2, 246–7, 261–2, z-tests
upper critical values 286–9 286–8 one-sample 246
upper one tail tests 249, 262, samples 85, 91, 219–20, 222, two sample see two sample
268, 280, 288, 295, 326–8 231–2, 234, 287 z-tests
upper p-values 262, 280, 286 unequal 247, 295 z-values 263, 458–9

You might also like