Business Statistics Using Excel PDF
Business Statistics Using Excel PDF
1
1
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Glyn Davis and Branko Pecar 2013
The moral rights of the authors have been asserted
First Edition copyright 2010
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
British Library Cataloguing in Publication Data
Data available
ISBN 978–0–19–965951–7
Printed in Italy by
L.E.G.O. S.p.A.—Lavis TN
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
Preface
(a) To help students ‘bridge the gap’ between school and university
(b) To enable a student to be confident in handling numerical data
(c) To enable students to appreciate the role of statistics as a business decision-making
tool
(d) To provide a student with the knowledge to use Excel 2010 to solve a range of
statistical problems.
This book is aimed at students who require a general introduction to business statistics
that would normally form a foundation-level business school module. The learning mate-
rial in this book requires minimal input from a lecturer and can be used as a self-instruc-
tion guide. Furthermore, three online workbooks are available; two to help students with
Excel and practise numerical skills, and an advanced workbook to help undertake facto-
rial experiment analysis using Excel 2010.
The growing importance of spreadsheets in business is emphasized throughout the text
by the use of the Excel spreadsheet. The use of software in statistics modules is more or
less mandatory at both diploma and degree level, and the emphasis within the text is on
the use of Excel 2010 to undertake the required calculations.
2 Data descriptors 58
Glossary 468
Index 477
Detailed contents
2 Data descriptors 58
Overview 58
Learning objectives 59
2.1 Measures of central tendency 59
2.1.1 Mean, median, and mode 59
2.1.2 Percentiles and quartiles 63
2.1.3 Averages from frequency distributions 67
2.1.4 Weighted averages 77
2.2 Measures of dispersion 80
2.2.1 The range 82
2.2.2 The interquartile range and semi-interquartile range (SIQR) 82
2.2.3 The standard deviation and variance 83
2.2.4 The coefficient of variation 88
Detailed contents ix
2.2.5 Measures of skewness and kurtosis 89
2.3 Exploratory data analysis 94
2.3.1 Five-number summary 94
2.3.2 Box plots 96
2.3.3 Using the Excel ToolPak add-in 100
Techniques in practice 102
Summary 104
Key terms 105
Further reading 105
Glossary 468
Index 477
How to use this book
» Learning objectives «
On successful completion of the module, you will be able to:
Learning objectives
» Learning » understand the concept of an average;
» recognize that three possible averages exist (mean, mode, and median) and calculate them
using a variety of graphical and formula methods in number and frequency distribution form;
Each chapter opens with a series of learn-
»
»
recognize when to use different measures of average;
understand the concept of dispersion;
ing objectives outlining what you can expect
On successf » recognize that different measures of dispersion exist (range, quartile range, SIQR, standard
deviation, and variance), and calculate them using a variety of graphical and formula methods
to learn as you progress through the chapter.
in number and frequency distribution form;
d » recognize when to use different measures of dispersion; These also serve as helpful recaps of impor-
» understand the idea of distribution shape, and calculate a value for symmetry and
peakedness;
tant concepts when revising.
Examp lem of calculating the mean value in Example 2.1. In Figures 2.1 and 2.2 the mean value is
located in cell E12. To insert the correct Excel function into cell E12 we would click on cell E12
and then Select Formulas > Select Insert Function as illustrated in Figures 2.3 and 2.4.
Detailed worked examples run throughout
each chapter to show you how the theory
To illustrate t relates to practice. The authors break concepts
lem of calcu down into clear step-by-step phases, which
are often accompanied by a series of Excel
screenshots, enabling you to assess your
progress.
Note According to Table 2.3, a number of claims corresponding to ‘one’ occurs three
times, which will contribute three to the total, ‘two’ claims occur four times contributing eight
Note boxes
to the sum, and so on. This can be written as follows:
Note Mean(X) =
(3*1) + (4*2) +.........+ (1*10)
3+4+4+5+5+7+5+3+3+1
= 206/40 = 5.15 Note boxes draw your attention to key points,
times, whic As already pointed out, as we are dealing with discrete data we would indicate a mean as
approximately five claims. Equation (2.3) can now be used to calculate the mean for a fre- areas where extra care should be taken, or
quency distribution data set:
❉ Interpretation Twenty five percent of all the values in the data set are equal to or
Interpretation boxes
❉ Interpr below 430 miles, while 75% are equal to or below 470 miles.
Stud
credit card payments at the counter by counter staff. The manager has collected the
Throughout each chapter you are regularly following processing time data (time in minutes/seconds) (Table 2.21) and requested
that summary statistics are calculated.
(a) Calculate a five-number summary for this data set.
given the chance to test your knowledge (b) Do we have any evidence for a symmetric distribution?
(c) Use the Excel Analysis-ToolPak to calculate descriptive statistics.
and understanding of the topics covered (d) Which measures would you use to provide a measure of average and spread?
X2.14 The m
through student exercises at the end of each
section. You can then monitor your progress
by checking the answers at the back of the
textbook and online.
■
implemented a new set of procedures for its support centre staff. The customer service direc-
Each chapter ends with an overview of the that you are able to recognize the nature of the problem and should be able to convert this into
two appropriate hypothesis statements (H0 and H1) that can be measured.
If you are comparing more than two samples then you would need to employ advanced
■ Summ
techniques covered and serves as an ideal statistical parametric hypothesis tests. These tests are called analysis of variance (ANOVA),
which are described in the online workbook ‘Factorial experiments’.
In this chapter we have described a simple five-step procedure to aid the solution process
tool for you to check your understanding of and have focused on the application of Excel to solve the data problems. The main empha-
sis is placed on the use of the p-value, which provides a number to the probability of the
the skills you should have acquired in that null hypothesis (H0) being rejected. Thus, if the measured p-value > α (Alpha) then we would
accept H0 to be statistically significant. Remember the value of the p-value will depend on
In this chapte
whether we are dealing with a two or one tail test. So take extra care with this concept as this
chapter. is where most students slip up.
www.oxfordtextbooks.co.uk/orc/davis_pecar2e/
For students
Numerical skills workbook
The authors have provided you with a numerical skills
refresher, packed with examples and exercises, to equip you
with the skills needed to confidently approach every topic in
the textbook.
Online glossary
The glossary of terms, along with their definitions from the
book, can now be found online for ease of reference.
How to use the Online Resource Centre xvii
Revision tips
The authors have provided you with revision tips to help
consolidate your learning and to assist you when preparing for
your exams.
Visual walkthroughs
Test bank
Each chapter of the book is accompanied by a bank of assorted
questions, covering a variety of techniques for the topics
covered.
The display of various types of data or information in the form of tables, graphs, and dia-
grams is quite a common spectacle these days. Newspapers, magazines, and television
all use these types of displays to try and convey information in an easy-to-assimilate way.
In a nutshell what these forms of display aim to do is to summarize large sets of raw data
such that we can see, at a glance, the ‘behaviour’ of the data. Figures 1.1 and 1.2 provide
examples of tables published in an English newspaper.
Figure 1.1
‘No better off after rate cuts’. Elizabeth Colman, The Sunday Times—Money, 12 April 2009, p. 6
This chapter and the next will use a variety of techniques that can be used to present the
data in a form that will make sense to people. In this chapter we will look at using tables
and graphical forms to represent the raw data, and in Chapter 2 we will explore methods
that can put a summary number to the raw data.
» Overview «
In this chapter we shall look at methods to summarize data using tables and charts:
» tabulating data;
» graphing data.
2 Business statistics using Excel
Rising attacks
Increase in robberies over past three months
compared to previous year
Staffordshire 56%
North Yorkshire 47%
Lincolnshire 46%
Cambridgeshire 33%
Nottinghamshire 26%
Merseyside 14%
Greater Manchester 10%
SOURCE:
South Wales No increase Police
−14% Metropolitan police figures
000s 120
80
Robberies
Figure 1.2
‘Muggings soar as recession bites’. David Leppard, The Sunday Times, 12 April 2009, p. 11
» Learning objectives «
On successful completion of the module you will be able to:
» understand the different types of data variables that can be used to represent a specific
measurement;
x
» distinguish between discrete and continuous data;
Variable A variable is a » construct histograms for equal and unequal class widths;
symbol that can take on
any of a specified set of » understand what we mean by a frequency polygon;
values.
Quantitative Variables can
» solve problems using Microsoft Excel.
be classified using numbers.
Qualitative Variables can
be classified as descriptive
or categorical.
Categorical variables A
1.1 The different types of data variable
set of data is said to be
categorical if the values or A variable is any measured characteristic or attribute that differs for different subjects.
observations belonging to it
can be sorted according to
For example, if the height of 1000 subjects was measured, then height would be a variable.
category. Variables can be quantitative or qualitative (sometimes called categorical variables).
Visualizing and presenting data 3
Quantitative variables (or numerical variables) are measured on one of three different
scales: interval, ratio, or ordinal.
Qualitative variables are measured on a nominal scale. If a group of business students
was asked to name their favourite browser to browse the Web, then the variable would
be qualitative. If the time spent on the computer to research a topic was measured, then x
the variable would be quantitative. Nominal measurement consists of assigning items Interval scale An
interval scale is a scale
to groups or categories. No quantitative information is conveyed and no ordering of of measurement where
the items is implied. Nominal scales are therefore qualitative rather than quantitative. the distance between
any two adjacent units of
Football club allegiance, sex or gender, degree type, and courses studies are all examples measurement (or ‘intervals’)
of nominal scales. is the same, but the zero
point is arbitrary.
Frequency distributions, described in Chapter 2, are used to analyse data measured
Ratio scale Ratio scale
on a nominal scale. The main statistic computed is the mode. Variables measured on a consists not only of
nominal scale are often referred to as categorical or qualitative variables. It is very impor- equidistant points but also
has a meaningful zero
tant that you understand the type of data variable that you have as the type of graph or point.
summary statistic calculated will be dependent upon the type of data variable that you Ordinal scale Ordinal
are handling. scale is a scale where
the values/observations
Measurements with ordinal scales are ordered in the sense that higher numbers repre- belonging to it can be
sent higher values. However, the intervals between the numbers are not necessarily equal. ranked (put in order)
or have a rating scale
For example, on a five-point rating scale measuring student satisfaction, the difference
attached. You can count
between a rating of 1 (‘very poor’) and a rating of 2 (‘poor’) may not represent the same and order, but not
difference as the difference between a rating of 4 (‘good’) and a rating of 5 (‘very good’). measure, ordinal data.
Nominal scale A set
The lowest point on the rating scale in the example was arbitrarily chosen to be 1 and this
of data is said to be
scale does not have a ‘true’ zero point. The only conclusion you can make is that one is categorical if the values or
better than the other (or even worse), but you cannot say that one is twice as good as the observations belonging to
it can be sorted according
other. to category.
On interval measurement scales, one unit on the scale represents the same magnitude Frequency
of the characteristic being measured across the whole range of the scale. For example, if distributions Systematic
method of showing the
student stress was being measured on an interval scale, then a difference between a score number of occurrences of
of 5 and a score of 6 would represent the same difference in anxiety as would a difference observational data in order
from least to greatest.
between a score of 9 and a score of 10. Interval scales do not have a ‘true’ zero point,
Statistic A statistic is a
however; therefore it is not possible to make statements about how many times higher quantity that is calculated
one score is than another. For the stress measurement, it would not be valid to say that a from a sample of data.
person with a score of 6 was twice as anxious as a person with a score of 3. Graph A graph is a picture
designed to express words,
Ratio scales are like interval scales except they have true zero points. For example, a particularly the connection
weight of 100 g is twice as much as 50 g. Interval and ratio measurements are also called between two or more
quantities.
continuous variables. Table 1.1 summarizes the different measurement scales with
Continuous variable A
examples provided of these different scales. set of data is said to be
continuous if the values
belong to a continuous
interval of real values.
Table A table shows the
1.2 Tables number of times that items
occur.
Classes Classes provide
Presenting data in tabular form can make even the most comprehensive descriptive nar- several convenient intervals
rative of data more readily intelligible. Apart from taking up less room, a table enables into which the values of
the variable of a frequency
figures to be located quicker, easy comparisons between different classes to be made, distribution may be
and may reveal patterns that cannot otherwise be deduced. The simplest form of table grouped.
4 Business statistics using Excel
Table 1.1
Example 1.1
When asked the question ‘If there was a general election tomorrow, which party would you
vote for’, 1110 students responded as follows: 400 said Conservative, 510 Labour, 78 Liberal
Democrats, 55 Green, and the rest some other party. We can put this information in table form
indicating the frequency within each category, either as a raw score or as a percentage of the
total number of responses (Table 1.2).
Note
• When a secondary data source is used it is acknowledged.
• The title of the table is given.
• The total of the frequencies is given.
• When percentages are used for frequencies this is indicated together with the sample size, N.
Sometimes categories can be subdivided and tables can be constructed to convey this
information together with the frequency of occurrence within the subcategories. For
example Table 1.3 indicates the frequency of half-yearly sales of two cars produced by a
large company with the sales split by month.
Example 1.2
Example 1.3
Tabulated results from a survey undertaken to measure the television viewing habits of adult
males by marital status and age.
Single Married
Under 30 years 30+ years Under 30 years 30+ years
Less than 15 hours per week 330 358 1162 484
15 hours or more per week 1719 241 643 1521
Total 2049 599 1805 2005
Example 1.4
Consider the set of data that represents the number of insurance claims processed each day by
an insurance firm over a period of 40 days: 3, 5, 9, 6, 4, 7, 8, 6, 2, 5, 10, 1, 6, 3, 6, 5, 4, 7, 8, 4, 5,
9, 4, 2, 7, 6, 1, 3, 5, 6, 2, 6, 4, 8, 3, 1, 7, 9, 7 and 2.
The frequency distribution can be used to show how many days it took for one claim to be
processed, how many days it took to process two claims, and so on. The simplest way of doing
this is by creating a tally chart.
Write down the range of values from the lowest (1) to the highest (10) then go through the
data set recording each score in the table with a tally mark. It’s a good idea to cross out figures
in the data set as you go through it to prevent double counting. Table 1.5 illustrates the fre-
quency distribution for the data set given in Example 1.4.
Table 1.5
x In this example there were relatively few cases. However, we may have increased our
Tally chart A tally chart survey period to one year, and the range of claims may have been between 0 and 30. As our
is a method of counting
frequencies, according to aim is to summarize information we may find it better to group ‘likes’ into classes to form
some classification, in a set a grouped frequency distribution. The next example illustrates this point.
of data.
Grouped frequency
distribution Data Example 1.5
arranged in intervals to
show the frequency with Consider the following data set of miles travelled by 120 salesmen in one week
which the possible values
of a variable occur. (Table 1.6).
Visualizing and presenting data 7
Table 1.6
This mass of data conveys little in terms of information. Because there would be too many
value scores, putting the data into an ungrouped frequency distribution would not portray an
adequate summary. Grouping the data, however, provides the following (Table 1.7).
Table 1.7 Grouped frequency distribution data for Example 1.5 data
Figure 1.3
8 Business statistics using Excel
We can see that the class widths are all equal and the corresponding Bin Range is
399.5, 405.5, . . .. . .., 519.5. We can now use Excel to calculate the grouped frequency
distribution.
x
Class boundaries Class
boundaries separate
one class in a grouped
frequency distribution from
another. Figure 1.4
Histogram A histogram
is a way of summarizing Select Data.
data that are measured
on an interval scale (either Select Data Analysis menu.
discrete or continuous). Click on Histogram.
Visualizing and presenting data 9
See Figure 1.5.
Figure 1.5
Click OK.
Input Data Range: Cells A6:H20.
Input Bin Range: Cells B24:B30.
Choose location of Output range: Cell D23.
See Figure 1.6.
Figure 1.6
Click OK.
Excel will now print out the grouped frequency table (Bin Range and frequency of
occurrence) as presented in cells D23–E31.
See Figure 1.7.
Figure 1.7
10 Business statistics using Excel
From Table 1.9 we can now create the grouped frequency distribution (Table 1.10)
Mathematical limit
Stated limit Discrete Continuous
A 5–under 10 5–9 5–9.999999’
10–under 15 10–14 10–14.999999’
B 5 –9 5–9 4.5–9.5
10–15 10–15 9.5–15.5
Placing of discrete data into an appropriate class usually provides few problems. If the
data is continuous and stated limits are as style A then a value of 9.9 would be placed in
the 5–under 10 stated class, conversely if style B were used then it would be placed in the
10–15 stated class. Using the true mathematical limits the width of a class can be found.
If CW = class width, UCB = upper class boundary, and LCB = lower class boundary,
then the class width is calculated using equation (1.1).
In Example 1.4, the true limits would be 0.5–1.5, 1.5–2.5, and the class width = 1.5 –
0.5 = 1.0. In Example 1.5, the true limits would be 399.5–419.5, 419.5–439.5, and the class
width = 419.5 – 399.5 = 20. Open ended classes are sometimes used at the two ends of a
distribution as a catch-all for extreme values and stated as, for example, up to 40, 40–50 . . .,
100 and over. There are no hard and fast rules for the number of classes to use, although
the following should be taken into consideration:
(a) Use between 5 and 12 classes. The actual number will depend on the size of the
sample and minimizing the loss of information.
(b) Class widths are easier to handle if in multiples of 2, 5, or 10 units.
(c) Although not always possible, try and keep classes at the same widths within a
distribution.
(d) As a guide, the following formula can be used to calculate the number of classes
given the class boundaries and the class width. Based upon this calculation we
would construct with six classes.
• an Excel worksheet database/list or any range that has labelled columns—we will use
Excel worksheets as examples in this chapter;
• a collection of ranges to be consolidated—the ranges must contain both labelled rows
and columns;
• a database file created in an external application such as Access or Dbase.
The data in a PivotTable cannot be changed as they are the summary of other data. The
data itself can be changed and the PivotTable recalculated thereafter. However, formatting
changes, such as bold, number formats, etc., can be made directly to the PivotTable data.
To rearrange the worksheet simply drag and drop column headings to a new location on
the worksheet, and Microsoft Excel rearranges the data accordingly. To begin, you need
raw data to work with. The general rule is you need more than two criteria of data to work
with, otherwise you have nothing to pivot. Figure 1.8 depicts a typical PivotTable where
we have tabulated department spends against month. Notice the black down-pointing
arrows in the PivotTable. On Row 1 we have Department.
Figure 1.8
If the black arrow was clicked, a drop-down box would appear showing a list of the
departments.
We could click on a department and view the departmental spend for the three months
measured, or we could select which departments to view, or choose only one month. But
Excel does most of the work for you and puts in those drop-down boxes as part of the wiz-
ard. In the example we can see the advertising budget spend in June was €12,422.
Example 1.6
This example consists of a set of data that has been collected to measure the departmental
spend of individuals within three departments of Coco S.A.
The budget spends (in Euros) have been measured for April, May, and June 2007 (see Figure 1.9).
Figure 1.9
Visualizing and presenting data 13
Figure 1.10
Input in the Create PivotTable menu the cell range for the data table and where you
want the PivotTable to appear.
Select a table: Cells B2:E32.
Choose to insert PivotTable in Existing Worksheet: Cell G2.
Figure 1.11 illustrates the Create PivotTable menu.
Figure 1.11
Click OK.
Excel creates a blank PivotTable and the user must then drag and drop the various fields
from the items; the resulting report is displayed ‘on the fly’, as illustrated in Figure 1.12.
Figure 1.12
14 Business statistics using Excel
The PivotTable (Cells G2:I19) will be populated with data from the data table in Cells
B3:E32 with the completion of the PivotTable Field List, which is located at the right-
hand side of the worksheet. Presented in Figures 1.13 and 1.14 are but a few examples of
Figure 1.13
Figure 1.14
Visualizing and presenting data 15
hundreds of possible reports that could be viewed with this data through the PivotTable
format. For Example 1.6 above choose:
Figure 1.15
Modifying reports
The PivotTable field dialog box allows changes to be made to the PivotTable. For example,
we may decide to modify the PivotTable by including the individual staff spends in indi-
vidual departments. This can be achieved by selecting Name in the PivotTable Field List
with the outcome presented in Figure 1.16.
Figure 1.16
From Figure 1.16 we can observe that the individual staff contributions under each
department are presented. If you look at your Excel solution you will observe that the
Name variable is located in the Row Label dialog box. If we move the Name variable into
the Column dialog box then the solution will be as presented in Figure 1.17, where only
part of the solution is illustrated.
Figure 1.17
PivotTable options
A range of PivotTable options are available, as illustrated in Figure 1.18.
Figure 1.18
16 Business statistics using Excel
By default, Excel will use a Sum function on numeric data and Count on non-numeric to
summarize or aggregate the data. To change this:
1. Click on the field you want to change (on the PivotTable itself or in the areas below
the Field list). For example, click inside the numbers within the PivotTable and right-
click on the mouse to bring up the menu illustrated in Figure 1.19.
Figure 1.19
2. Click on Value Field Settings and select the appropriate calculation. For example,
change the calculation from Sum to Average, as illustrated in Figure 1.20.
Figure 1.20
Figure 1.21
Note To display more than one calculation in the Values area add the same Field twice.
Visualizing and presenting data 17
Formatting values
1. Display the Field Settings dialog box as shown in Figure 1.20.
2. Click on the Number Format button.
3. Select the Category you want and set any options. For example, select Number and
enter the number of decimal places to display the data to.
4. Click OK and OK again, and your cells will be reformatted.
Figure 1.22
3. Choose the data set and location of the PivotTable and PivotChart as you would to
create a new PivotTable (see Figure 1.23). A new blank PivotTable and PivotChart will
be created.
Figure 1.23
Figure 1.24
Figure 1.25
Visualizing and presenting data 19
5. The PivotChart and PivotTable will both be created simultaneously, as illustrated in
Figure 1.26.
Figure 1.26
Note Some Chart types (for example pie charts) are not suitable for PivotTables because
they can only show two variables.
Figure 1.27
3. Select type of chart, e.g. Column, and click OK (see Figure 1.28).
x
Pie chart A pie chart is a
way of summarizing a set of
Figure 1.28 categorical data.
20 Business statistics using Excel
Figure 1.29
Grouping data
Data can be summarized into higher level categories by grouping items within PivotTable
fields. Depending on the data in the field there are three ways to group items:
Refreshing a PivotTable
When data is changed in the PivotTable source list the PivotTable does not automatically
recalculate. To refresh the table:
Figure 1.30
PivotTable Options can be set to refresh data every time a spreadsheet is opened.
1. Select Pivot Table Tools Options and click on Change Data Source.
2. Edit the range in the Table/Range box to include your entire dataset and click OK.
Visualizing and presenting data 21
Student Exercises
X1.1 Criticize Table 1.13.
Table 1.13
X1.2 Table 1.14 represents the number of customers visited by a salesman over an 80-week
period
68 64 75 82 68 60 62 88 76 93 73 79 88 73 60 93
71 59 85 75 61 65 75 87 74 62 95 78 63 72 66 78
82 75 94 77 69 74 68 60 96 78 89 61 75 95 60 79
83 71 79 62 67 97 78 85 76 65 71 75 65 80 73 57
88 78 62 76 53 74 86 67 73 81 72 63 76 75 85 77
Table 1.14
x
Bar chart A bar chart is a
Use Excel to construct a grouped frequency distribution from the data set in Table 1.14 and
way of summarizing a set of
indicate both stated and mathematical limits (start at 50–54 with class width of 5). categorical data.
Frequency polygon A
graph made by joining the
middle-top points of the
columns of a frequency
histogram.
1.3 Graphical representation of data Scatter plot A scatter plot
is a plot of one variable
against another variable.
The next stage of analysis after the data has been tabulated is to graph it using a variety of
Time series plot A chart of
methods to provide a suitable graph. In this section we will explore: bar charts, pie charts, a change in variable against
histograms, frequency polygons, scatter plots, and time series plots. The type of graph time.
you will use to graph the data depends upon the type of variable you are dealing with Ordinal variable A set of
data is said to be ordinal if
within your data set, for example category (or nominal), ordinal, or interval (or ratio) data the values belonging to it
(Table 1.15). can be ranked.
22 Business statistics using Excel
June
May
April
Month
March
February
Pink
January Blue
Example 1.7
Consider the categorical data in Example 1.1, which represents the proposed voting behaviour
by a sample of university students. Excel can be used to create a bar chart to represent this data
set. For each category a vertical bar is drawn with the vertical height representing the number
of students in that category (or frequency) with the horizontal distance for each bar, and dis-
x tance between each bar, equal.
Cross tabulation Cross
tabulation is the process Each bar represents the number of students who would vote for a particular UK political
made with two or more party. From the bar chart you can easily detect the differences of frequency between the five
data sources (variables) that
are tabulating the results of
categories (Conservative, Labour, Liberal Democrat, Green, and Other). Figure 1.32 represents
one against the other. a bar chart for the proposed voting behaviour.
Visualizing and presenting data 23
Figure 1.32
Example 1.8
If you are interested in comparing totals then a component (or stacked) bar chart is constructed.
Figure 1.33 represents a component bar chart for the half-yearly car sales.
In this component bar chart you can see the variation in total sales from month to month
and the split between car type category per month.
June
May
April
Month
March
Pink
February Blue
January
Example 1.9
A multiple column chart is used when you want to compare each component over time, but
the totals are of little importance.
Figure 1.34 represents a multiple bar chart for the half-yearly car sales.
24 Business statistics using Excel
Number of cars
5000
4000
3000
2000
1000
0
January February March April May June
Month Figure 1.34
Example 1.10
Excel solution—bar chart using Example 1.1 data
Figure 1.35
2 Select Insert > Chart type (choose Column) > select first option, as illustrated in
Figure 1.36
Figure 1.36
Figure 1.37
26 Business statistics using Excel
In Figure 1.37 the current chart title is ‘Frequency’. To change the title, click on the
current chart title ‘Frequency’, type in the new chart title ‘Proposed voting behaviour’,
and press the enter key. Figure 1.41 illustrates the final chart.
• Axes titles
In this case the axes titles are not currently available in Figure 1.37. To add the axes
titles click on the current chart and note that the Chart Tools menu on the Excel
menu will appear as illustrated in Figure 1.38.
Figure 1.38
Select Layout on the Chart Tools menu to access the Layout tool menu as illustrated
in Figure 1.39.
Figure 1.39
Select Axis title dialog box and choose either ‘Primary Horizontal Axis Title’ or ‘Primary
Vertical Axis Title’, and modify to add the axes titles, as illustrated in Figure 1.40.
Figure 1.40
500
Conservative
400
Frequency
Labour
300 Democrat
200 Green
Other
100
0
ive
er
ra
u
e
bo
th
re
oc
at
O
La
G
rv
em
se
D
on
Party
C
Figure 1.41
each bar has a unique colour then the chart legend will list each of the bar titles, for
example Conservative for the first bar. Figure 1.41 illustrates the final chart.
• Remove horizontal grid lines
To remove the horizontal grid lines click on the horizontal gridlines to select and
press the computer delete key. Further modifications to grid lines can be achieved
by choosing Gridlines on the Chart Tools Gridlines menu.
The final bar chart is illustrated in Figure 1.41.
Student exercise
X1.3 Draw a suitable bar chart for the data in Table 1.16.
Table 1.16
Example 1.11
Example 1.1 proposed voting behaviour data is illustrated in Table 1.17.
Table 1.17
28 Business statistics using Excel
Other, 67
Green, 55
Democrat, 78
Conservative
Conservative,
Labour
400
Democrat
Green
Labour, 510 Other
Figure 1.42
Table 1.18
The size of each slice (sector) depends on the angle at the centre of the circle which, in turn,
depends upon the number in the category the sector represents. Before drawing the pie chart
you should always check that the angles you have calculated sum to 360°. A pie chart may be
constructed on a percentage basis or the actual figures may be used.
Visualizing and presenting data 29
Figure 1.43
2 Select Insert > Chart type (choose Pie) > select first option, as illustrated in
Figure 1.44
Figure 1.44
30 Business statistics using Excel
Frequency
Conservative
Labour
Democrat
Green
Other
Figure 1.45
Figure 1.46
Other, 67, 6%
Green, 55, 5%
Democrat,
78, 7% Conservative
Conservative, Labour
400, 36%
Democrat
Figure 1.47
Student exercises
X1.4 Three thousand six hundred people who work in Bradford were asked about the
means of transport that they used for daily commuting. The data collected is shown in
Table 1.19.
Table 1.19
Mr P 2045 votes
Mr Q 4238 votes
Mrs R 8605 votes
Ms S 12,012 votes
Table 1.20
1.3.3 Histograms
We have already mentioned the idea of a frequency distribution via the displaying of cat-
egory level data with tables and bar charts. This concept can now be extended to higher
levels of measurement. A point to remember when displaying any form of data is the aim
of summarizing information clearly and in such a form that information is not distorted
or lost. The method used to graph a group frequency table (or distribution) is to construct
32 Business statistics using Excel
a histogram. A histogram looks like a bar chart, but they are different and should not be
confused with each other.
Histograms are constructed on the following principles: (a) the horizontal axis (x–axis)
is a continuous scale; (b) each class is represented by a vertical rectangle, the base of
which extends from one true limit to the next; and (c) the area of the rectangle is pro-
portional to the frequency of the class. This is very important as it means that the area
of the bar represents the frequency of each category. In the bar chart the frequency is
represented by the height of each bar. This implies that if we double the class width for
one bar compared with all the other classes then we would have to half the height of that
particular bar compared with all other bars.
In the special case where all class widths are the same then the height of the bar can be
taken to be representative of the frequency of occurrence for that category. It is important
to note that either frequencies or relative frequencies can be used to construct a histo-
gram, but the shape of the histogram would be exactly the same no matter which variable
you chose to graph.
Example 1.12
Example 1.4 represents the number of insurance claims processed each day by an insurance
firm over a period of 40 days (see Table 1.21).
Score Frequency, f
1 3
2 4
3 4
4 5
5 5
6 7
7 5
8 3
9 3
10 1
Σf = 40
Table 1.21
The data variable ‘score’ is a discrete variable and the histogram is constructed as illustrated
in Table 1.22.
We can see from Table 1.22 that all the class widths have the same value 1 (constant, class
width = UCB – LCB). In this case the histogram can be constructed with the height of the bar
representing the frequency of occurrence.
To construct the histogram we would plot frequency (y–axis, vertical) against score (x-axis)
with the boundary between the bars determined by the upper and lower class boundaries.
Figure 1.48 illustrates the class boundary positions for each bar.
Visualizing and presenting data 33
Table 1.22
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
x = 0.5 1.5 2.5 Number of claims, X 9.5 10.5 Figure 1.48
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
Number of claims, X Figure 1.49
We can use the histogram to see how the number of claims varies in frequency from the
lowest claims value of 1 to the highest claims value of 10. If we look at the histogram we note:
• looking along the x-axis we can see that the claims are evenly spread out (1–10);
• the claims rise (1–6) with a maximum of 6 claims per day which occurred on 7 days;
• the claims drop (6–10) with a minimum of 10 claims per day which occurred on 1 day.
These ideas will lead to the idea of average (central tendency) and data spread (dispersion)
which will be explored in Chapter 2.
34 Business statistics using Excel
Example 1.13
Example 1.5 represents the miles recorded by 120 salesmen in one week, as illustrated in
Table 1.23.
Mileage Frequency, f
400–419 12
420–439 27
440–459 34
460–479 24
480–499 15
500–519 8
Σf = 120
Table 1.23
Figure 1.50 represents the histogram for miles recorded by 120 salesmen.
35
30
Frequency
25
20
15
10
0
400–419 420–439 440–459 460–479 480–499 500–519
Mileage Figure 1.50
Excel spreadsheet solution—histogram with equal class widths using Example 1.5
data
1 Data series
Input data into cells A6:H20.
See Figure 1.51.
Figure 1.51
Table 1.24
We can see from Table 1.24 that the class widths are all equal and the corresponding
Bin Range is 399.5, 419.5 . . . 519.5. We can now use Excel to create the grouped
frequency distribution and corresponding histogram for equal classes. If you want,
you can leave the Bin Range box blank and the Excel Histogram tool will automatically
create evenly distributed bin intervals using the minimum and maximum values in
the input range as beginning and end points. The number of intervals is equal to the
square root of the number of input values (rounded down).
Figure 1.52
Figure 1.53
Click OK .
Input Data Range: Cells A6:H20.
Input Bin Range: Cells B24:B30.
Choose location of Output range: Cell D23.
See Figure 1.54.
Figure 1.54
Press OK.
Excel will now print out the grouped frequency table (Bin Range and Frequency of
occurrence), as presented in cells D23–E31.
See Figure 1.55.
Figure 1.55
Visualizing and presenting data 37
We can now use Excel to generate the histogram for equal class widths.
Figure 1.56
4 Create column chart (Insert > Column > choose first option)
This will create the chart illustrated in Figure 1.57 with chart title and axes titles
updated.
35
30
Frequency
25
20
15
10
0
400–419 420–439 440–459 460–479 480–499 500–519
Mileage Figure 1.57
Figure 1.58
Click on any one of the bars and right click on the computer mouse (Figure 1.59).
38 Business statistics using Excel
Figure 1.59
Select Format Data Series and reduce Gap Width to zero, as illustrated in Figure 1.60.
Figure 1.60
35
30
Frequency
25
20
15
10
0
400–419 420–439 440–459 460–479 480–499 500–519
Mileage Figure 1.61
Visualizing and presenting data 39
Table 1.25 Calculation procedure to identify class limits for the histogram
We can see from the table that all the class widths are of the same value 20 (constant, class
width = UCB – LCB). In this case the histogram can be constructed with the height of the bar
representing the frequency of occurrence.
To construct the histogram we would plot frequency (y-axis, vertical) against score (x-axis)
with the boundary between the bars determined by the upper and lower class boundaries.
Figure 1.62 illustrates the class boundary positions for each bar.
25
20
15
10
5
0
400–419 420–439 440–459 460–479 480–499 500–519
Figure 1.63 illustrates the completed histogram for miles recorded by 120 salesmen.
25
20
15
10
5
0
400–419 420–439 440–459 460–479 480–499 500–519
Miles travelled, X Figure 1.63
40 Business statistics using Excel
We can use the histogram to see how the frequency changes as the miles travelled changes
from the lowest group (400–419) to the highest group (500–519). If we look at the histogram
we can note:
• looking along the x-axis we can see that the miles recorded are evenly spread out;
• the miles recorded rise (400–419 to 440–459) with a maximum of 440–459 recorded;
• the miles recorded drop (440–459 to 500–519) with a minimum of 500–519 miles re-
corded.
These ideas will lead to the idea of average (central tendency) and data spread (disper-
sion), which will be explored in Chapter 2.
Student exercises
X1.6 Create a suitable histogram to represent the number of customers visited by a sales
man over an 80-week period (Table 1.26).
68 64 75 82 68 60 62 88 76 93 73 79 88 73 60 93
71 59 85 75 61 65 75 87 74 62 95 78 63 72 66 78
82 75 94 77 69 74 68 60 96 78 89 61 75 95 60 79
83 71 79 62 67 97 78 85 76 65 71 75 65 80 73 57
88 78 62 76 53 74 86 67 73 81 72 63 76 75 85 77
Table 1.26
16.91 9.65 22.68 12.45 18.24 11.79 6.48 12.93 7.25 13.02
8.10 3.25 9.00 9.90 12.87 17.50 10.05 27.43 16.01 6.63
14.73 8.59 6.50 20.35 8.84 13.45 18.75 24.10 13.57 9.18
9.50 7.14 10.41 12.80 32.09 6.74 11.38 17.95 7.25 4.32
8.31 6.50 13.80 9.87 6.29 14.59 19.25 5.74 4.95 15.90
Table 1.27
Table 1.28
4 A B
3
2
0–2 2–4 4–6 6–8 8–12 0–2 2–4 4–6 6–8 8–12 Class
Figure 1.64
Although class (8–12) is twice the width of the other classes, histogram A gives equal
weighting to the frequency for all classes.
It is therefore incorrect. Keep in mind that the area of a rectangle is proportional to
frequency and thus:
Class Frequency
Height =
Class Width (1.2)
Histogram B indicates the correct weighting to the class (8–12). As the class width is
twice the width of the other classes, the height of the rectangle is halved. In general, if we
choose a standard class width, a class having twice the width will have a height of 1/2 of its
frequency; three times the width a height of 1/3 of its frequency, and so on.
Example 1.14
Construct a histogram for the following distribution of discrete data (Table 1.29).
Table 1.29
Taking the class (129–138) as our standard class width (class width = 10) then we can use the
following formula to calculate the heights of each individual bar (or rectangle).
CWs
h = f
CW (1.3)
42 Business statistics using Excel
Where CWs = standard class width = 10, CW = class width, f = class frequency, and h = class
height (height of rectangle). Table 1.30 illustrates the calculation of the class heights when
classes are unequal.
Figure 1.65 illustrates the completed histogram for the Example 1.14 data set.
60
50
40
30
20
10
118–121
112–128
129–138
139–148
149–158
159–178
(a) As the height of the rectangle is proportional to class frequency and class width we can use
the term frequency density rather than frequency.
(b) Total area is proportional to total frequency.
Unfortunately, you cannot create a histogram with unequal class widths (or intervals) using
Excel, but you can create the frequency distribution by inputting the upper and lower class
intervals. These are called Bins in Excel.
Where UCB = Upper Class Boundary and LCB = Lower Class Boundary.
Example 1.15
Table 1.31 illustrates the frequency polygon for the data set in Example 1.5.
Table 1.31
Figure 1.66 illustrates the frequency polygon for the travelling salesmen problem.
25
20
15
10
5
0
409.5 429.5 449.5 469.5 489.5 509.5
Class mid-point miles Figure 1.66
1 Data series
Class Mid-Point: cells D3:D9 (includes data label).
Frequency: cells E3:E9 (includes data label).
Highlight cells D3:E9.
Figure 1.67 illustrates the Excel solution.
x
Class mid-point The class
mid-point is the midpoint
Figure 1.67 of each class interval.
44 Business statistics using Excel
2 Select Insert > Line > select option 4, as illustrated in Figure 1.68
Figure 1.68
600
500
400
Class mid-point
300
Frequency
200
100
0
1 2 3 4 5 6 Figure 1.69
The chart is correct on the vertical axis (frequency), but we would like the horizontal
axis to use the class labels rather than the numbers 1, 2, 3, 4, 5, and 6. To modify the
horizontal axis label from these numbers to the class mid-point labels we need to edit
the data series. Right-click on the data line, as illustrated in Figure 1.71.
Figure 1.70
Figure 1.71
Figure 1.72
Click on Edit in the Horizontal (Category) Axis Labels and browse to the class mid-
point cell reference (D4:D9)(see Figure 1.73).
Figure 1.73
Click OK.
See Figure 1.74.
46 Business statistics using Excel
Figure 1.74
25
20
15
10
5
0
409.5 429.5 449.5 469.5 489.5 509.5
Class mid-point miles Figure 1.75
Figure 1.75 illustrates the frequency polygon after a degree of reformatting (removed
border, horizontal gridlines).
Student exercise
X1.8 Create a frequency polygon (line graph) for the data in Table 1.32
Table 1.32
Visualizing and presenting data 47
Example 1.16
A manufacturing firm has designed a training programme that is supposed to increase the
productivity of employees.
The personnel manager decides to examine this claim by analysing the data results from the
first group of 20 employees that attended the course.
The results are provided in Table 1.33.
Table 1.33
Figure 1.76 illustrates the scatterplot. As can be seen from the scatter plot there would
seem to be some form of relationship; as production increases then there is a tendency for
the percentage raise in production to increase. The data, in fact, would indicate a positive
relationship.
We will explore this concept in more detail in Chapter 8 when discussing measuring
correlation between two data variables.
48 Business statistics using Excel
% increase in production, Y
7
0
0 10 20 30 40 50 60 70 80
Old production, X Figure 1.76
Time series analysis is concerned with data collected over a period of time. It attempts
to isolate and evaluate various factors which contribute to changes over time in such vari-
able series as imports and exports, sales, unemployment, and prices. If we can evaluate
the main components that determine the value of, say, sales for a particular month then
we can project the series into the future to obtain a forecast.
Example 1.17
Consider the following time series data presented in Table 1.34 and the resulting time series
graph.
The first step in analysing the data in Table 1.34 is to create the time series plot using the
technique discussed in the previous section.
Figure 1.77 illustrates the up and down pattern with the overall sales increasing between
the beginning of 2001 and the end of 2004.
This pattern consists of an upward trend and a seasonal component that repeats
between individual quarters (Q1–Q2–Q3–Q4).
We shall explore these ideas of trend and seasonal components in Chapter 9. In the
previous two sections we looked at creating scatter plots and time series plots that may
visually provide information about the possible relationship between one measured vari-
able and another.
Visualizing and presenting data 49
Time series graph for sales data
1200
1000
Sales value, Y
800
600
400
200
0
0 2 4 6 8 10 12 14 16
Time point, X Figure 1.77
Care needs to be taken when using graphs to infer what this relationship may be. For
example, if we modify the y-axis scale then we have a very different picture of this poten-
tial relationship.
We noted that in Example 1.16 that the percentage change in production increases as
the value of the old production increases.
We can change the vertical y-axis so that the minimum value of y-axis is 0, but the maxi-
mum value is now increased to 60.
Figure 1.78 illustrates the effect on the graph of modifying the vertical scale.
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80
Old production, X Figure 1.78
We can see that the data points are now hovering above the x-axis with the increase in
the vertical direction not as pronounced as in the first graph in Figure 1.76. If we further
increased the y-axis scale then this pattern would be diminished even further.
Furthermore, in Figure 1.77 we note that the time series plot indicated that sales are
increasing with time (upward trend) and that we have a pattern within the data that
repeats between individual quarters (Q1–Q2–Q3–Q4).
We can change the y-axis so that the minimum value y-axis value is 0, but the maxi-
mum value is now increased to 3000. We can see that the pattern in the data still shows
an upward trend, but the distinct pattern is not as pronounced as in the first graph. If we
50 Business statistics using Excel
further increased the y-axis scale then this pattern would be diminished even further, as
illustrated in Figure 1.79.
2500
Sales value, A
2000
1500
1000
500
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Time point, X Figure 1.79
Figure 1.80
2 Select Insert > Scatter > choose first option (Figure 1.81).
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
0 10 20 30 40 50 60 70 80
Old production, X Figure 1.82
We can now ask Excel to fit a straight line to this data chart by clicking on a data point
on the chart: right-click on a data point and choose Add Trendline. We will look at
fitting a trend line and curves to scatter and time series charts in Chapters 8 and 9.
Example 1.18
A market researcher has collected a set of data that measures the distance travelled by eight
salesmen. The market researcher has calculated the average and standard deviation, and
requires a graph of mileage travelled against identification (ID) and the two error measure-
ments provided in Table 1.35.
52 Business statistics using Excel
Table 1.35
Excel solution—superimposing two data sets onto one graph for the Example 1.18
data
Figure 1.83
4 Further modify the chart by changing the bars representing the error term (average –
error, average + error) to be horizontal dashed lines rather than vertical bars. Select
average – error bar (this will select all the average – error bars for each ID). Right-
click on average – error bar > Select Change Series Chart Type > Select Line and
click OK. Repeat for average + error bars. The final chart is illustrated in Figure 1.86.
Visualizing and presenting data 53
300 Mileage
Average + error
Average – error
250
200
150
100
50
0
1 2 3 4 5 6 7 8 Figure 1.84
200
Mileage
150
100
50
0
1 2 3 4 5 6 7 8
ID Figure 1.85
200
Mileage
150
100
50
0
1 2 3 4 5 6 7 8
ID Figure 1.86
54 Business statistics using Excel
Student exercises
X1.9 Obtain a scatter plot for the data in Table 1.36 and comment on whether there is a link
between road deaths and the number of vehicles on the road. Would you expect this
to be true? Provide reasons for your answer.
Table 1.36
X1.10 Obtain a scatter plot for the data in Table 1.37 that represents the passenger miles
flown by a UK-based airline (millions of passenger miles) during the period 2003–2004.
Comment on the relationship between miles flown and quarter.
Table 1.37
■ Techniques in practice
TP1 Coco S.A. supplies a range of computer hardware and software to 2000 schools within
a large municipal region of Germany. When Coco S.A. won the contract the issue of customer
service was considered to be central to the company being successful at the final bidding stage.
The company has now requested that its customer service director creates a series of graphi-
cal representations of the data to illustrate customer satisfaction with the service. The data in
Table 1.38 has been collected over the last six months and measures the time to respond to the
received complaint (days).
5 24 34 6 61 56 38 32
87 78 34 9 67 4 54 23
56 32 86 12 81 32 52 53
34 45 21 31 42 12 53 21
43 76 62 12 73 3 67 12
78 89 26 10 74 78 23 32
26 21 56 78 91 85 15 12
15 56 45 21 45 26 21 34
28 12 67 23 24 43 25 65
23 8 87 21 78 54 76 79
Table 1.38
(c) Do the results suggest that there is a great deal of variation in the time taken to respond
to customer complaints?
(d) What conclusions can you draw from these results?
TP2 Bakers Ltd run a chain of bakery shops and is famous for the quality of its pies. The
management of the company is concerned at the number of complaints from customers who
say it takes too long to serve customers at a particular branch. The motto of the company is
‘Have your pie in two minutes’. The manager of the branch concerned has been told to provide
data on the time it takes for customers to enter the shop and be served by the shop staff (see
Table 1.39).
0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70
0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12
0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70
1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88
0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55
1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38
0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80
1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25
1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48
Table 1.39
TP3 Skodel Ltd is a small brewery that is undergoing a major expansion after a takeover
by a large European brewery chain. Skodel Ltd produces a range of beers and lagers, and is
56 Business statistics using Excel
renowned for the quality of its beers, winning a number of prizes at trade fairs throughout
the European Union. The new parent company is reviewing the quality control mechanisms
being operated by Skodel Ltd and is concerned at the quantity of lager in its premium lager
brand, which should contain a mean of 330 ml and a standard deviation of 15 ml. The bottling
plant manager provided the parent company with quantity measurements from 100 bottles for
analysis (see Table 1.40).
326 326 326 326 326 326 326 326 326 326
344 344 344 344 344 344 344 344 344 344
333 333 333 333 333 333 333 333 333 333
346 346 346 346 346 346 346 346 346 346
339 339 339 339 339 339 339 339 339 339
353 353 353 353 353 353 353 353 353 353
310 310 310 310 310 310 310 310 310 310
351 351 351 351 351 351 351 351 351 351
350 350 350 350 350 350 350 350 350 350
348 348 348 348 348 348 348 348 348 348
Table 1.40
■ Summary
The methods described in this chapter are very useful for describing data using a variety of
tabulated and graphical methods. These methods allow one to make sense of data by con-
structing visual representations of numbers within the data set. Table 1.41 provides a summary
of which table/graph to construct given the data type.
In the next chapter we will look at summarizing data using measures of average and
dispersion.
Visualizing and presenting data 57
■ Key terms
Bar chart Grouped frequency Ratio
Categorical distributions Raw data
Class boundaries Histogram Scatter plot or
Class limits Histogram with unequal scattergrams
Class midpoint class intervals Stated limits
Classes Interval Statistic
Continuous Nominal Table
Cross tabulation Ordinal Tally chart
Discrete Ordinal variable Time series plot
Frequency distributions Pie chart True or mathematical
Frequency polygon Qualitative limits
Graph or chart Quantitative Variable
■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.
Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012)
2 Data descriptors
Although tables, diagrams, and graphs provide easy-to-assimilate summaries of data they
only go part way in describing data. Often a concise numerical description is preferable
which enables us to interpret the significance of the data.
Measures of central tendency (or location) attempt to quantify what we mean when
we think of the ‘typical’ or ‘average’ value for a particular data set. The concept of central
tendency is extremely important and is encountered in daily life. For example:
• What is the average carbon dioxide emission for a particular car compared with other,
similar cars?
• What is the average starting salary for new graduates starting employment with a
large city bank?
A further measure that is employed to describe a data set is the concept of dispersion
(or spread) about this middle value.
» Overview «
In this chapter we shall look at three key statistical measures that enable us to describe a data
set:
» the central tendency is the amount by which all the data values coexist about a defined
typical value. A number of different measures of central tendency exist, including mean (or
average), median, and mode, that can be calculated for both ungrouped (individual data values
known) and grouped (data values within class intervals) data sets;
» the dispersion (or spread) is the amount that all the data values are dispersed about this
typical central tendency value. A number of different measures of dispersion exist, including
range, interquartile range, semi-interquartile range (SIQR), standard deviation, and variance
that can be calculated for both ungrouped and grouped data sets;
» the shape of the distribution is the pattern that can be observed within the data set. This
shape can be classified into whether the distribution is symmetric (or skewed) and whether or
not there is evidence that the shape is peaked. Skewness is defined as a measure of the lack of
Data descriptors 59
» Learning objectives «
On successful completion of the module, you will be able to:
» recognize that three possible averages exist (mean, mode, and median) and calculate them
using a variety of graphical and formula methods in number and frequency distribution form;
» recognize that different measures of dispersion exist (range, quartile range, SIQR, standard
deviation, and variance), and calculate them using a variety of graphical and formula methods
in number and frequency distribution form;
» understand the idea of distribution shape, and calculate a value for symmetry and
peakedness;
of the data set then the mean is called the population mean. If we sample from this popu-
lation and calculate the mean then the mean is called the sample mean.
The population and sample mean would be calculated using the same formula:
mean = sum of data values/total number of data values. For example, if KC GmbH were
interested in the mean time taken for a consultant to travel by train from Munich to
Hamburg and if we assume that KC GmbH has gathered the time (rounded to the nearest
minute) for the last five trips (645, 638, 649, 630, and 647), then the mean time would be:
Example 2.1
Suppose the marks obtained in the statistics examination are as illustrated in Table 2.1.
x
24 27 36 48 52 52 53 53 59 60 85 90 95
Population mean The
population mean is the
mean value of all possible Table 2.1
values.
Extreme value An extreme We can describe the overall performance of these 13 students by calculating an ‘average’
value is an unusually large score using the mean, median, and mode.
or an unusually small value
compared with the others
in the data set.
Outlier An outlier is an Excel solution for Example 2.1—mean, median, and mode
observation in a data set
which is far removed in
value from the others in
Data Series input into Cells B4:B16.
the data set. Figure 2.1 illustrates the Excel solution.
Data descriptors 61
Figure 2.1
➜ Excel solution
Statistics marks Cell B4:B16 Values
n = Cell E4 Formula: =COUNT(B4:B16)
Σx = Cell E5 Formula: =SUM(B4:B16)
Mean = Cell E7 Formula: =E5/E4
Mean = Cell E12 Formula: =AVERAGE(B4:B16)
Median = Cell E13 Formula: =MEDIAN(B4:B16)
Mode = Cell E14 Formula: =MODE(B4:B16)
❉ Interpretation The above values imply that, depending on what measure we use,
the average mark for this group can be 56.5% (the mean), 53% (median), or 52% (mode). The
choice of the measure will depend on the type of numbers within the data set.
The mean
In general the mean can be calculated using the formula:
Mean (X) =
Sumof Data Values
=
∑X
Total Number of Data Values ∑ f (2.1)
Where X (X bar) represents the mean value for the sample data, ∑X represents the sum
of all the data values, and ∑f represents the number of data values.
Mean(X) =
∑ X = 24 + 27 + ... + 90 + 95 = 734 / 13 = 56.4615
∑f 13
We can see from the formula method (cell E7) and Excel function method (cell E12) that the
mean examination mark is 56.5%.
62 Business statistics using Excel
The median
The median is defined as the middle number when the data is arranged in order of size.
The position of the median can be calculated as follows:
P
Position of Percentile = (N + 1)
100 (2.2)
Where P represents the percentile value and N represents the number of numbers in
the data set.
Consider the data from Example 2.1 (note that the data is written in order of size—rank.
If it wasn’t ranked, the data would have to be put in the correct order before this method
was used manually) as presented in Table 2.2.
24 27 36 48 52 52 53 53 59 60 85 90 95
Table 2.2
❉ Interpretation From Excel, the value of the mean is 53%. We can see from this
example that the mean and median are reasonably close (56% compared with 53%), and the
distance between the lowest mark and the median (53 – 24 = 29) is less than the distance
between the largest value and the median (95 – 53 = 42). It should be noted that the median
is not influenced by the presence of very small or very large data values in the data set
(extreme values or outliers). If we have a small number of these extreme values (or outliers)
we would use the median instead of the mean to represent the measure of central tendency.
This issue will be explored in greater detail when discussing measuring skewness.
Note In this example the median was calculated to be the seventh number in the
ordered list of data values. If we created an extra value then the calculation would be a little
more complex. For example, if we had 14 numbers (N = 14) then the position of the median
would now be
50
Position of Median = (14 + 1) = 7.5th number
100
x The position of the median would now be the 7.5th number in the data set. To help us
Median The median is the
value halfway through the understand what this means we can rewrite this into a slightly different form:
ordered data set.
8th number + 7th number 53 + 53
Skewness Skewness is Position of Median = = = 53
defined as asymmetry in 2 2
the distribution of the data
values. The median statistics examination mark would then be 53%.
Data descriptors 63
The mode
The mode is defined as the number which occurs most frequently (the most ‘popular’
number).
Note We can see that Excel provides only one solution for the mode (52) even though
we have two modal values in the data set (two numbers 52 and 53 occurred twice).
Example 2.2
Example 1.1 consists of category data that provide a measure of proposed student voting
behaviour at a university. We can see from the frequency count that Labour was the most
popular party for the students. The Labour party would represent the mode for this data set.
Example 2.3
Reconsider Example 2.1 to calculate the twenty-fifth percentile and quartile values.
Figure 2.2
64 Business statistics using Excel
➜ Excel solution
25th percentile = Cell E15 Formula: =PERCENTILE.INC(B4:B16,0.25)
First quartile = Cell E16 Formula: =QUARTILE.INC(B4:B16,1)
Second quartile = Cell E17 Formula: =QUARTILE.INC(B4:B16,2)
Third quartile = Cell E18 Formula: =QUARTILE.INC(B4:B16,3)
Note The calculation process to calculate percentiles and quartiles is as follows for the
first, second, and third quartiles.
First Quartile, Q1
The first quartile corresponds to the twenty-fifth percentile and the position of this value
within the ordered data set is given by equation 2.2.
25 1
Position of Twenty-Fifth Percentile = (13 + 1) = (14) = 3.5th number
100 4
We therefore take the twenty-fifth percentile to be the number half the distance between the
fourth and third numbers. To solve this problem we use linear interpolation between the two
nearest ranks: Position of Twenty-Fifth Percentile = third number + 0.5*(fourth number – third
number). From the list of ordered data values the third number = 36 and the fourth number = 48.
x 75 3
Position of Seventy-Fifth Percentile = (13 + 1) = (14) = 10.5th number
Quartiles Quartiles are 100 4
values that divide a sample
of data into four groups We therefore take the seventy-fifth percentile to be the number half the distance between
containing an equal
number of observations.
the tenth and eleventh numbers. To solve this problem we use linear interpolation between the
Q1 Q1 is the lower quartile two nearest ranks: Position of Seventy-Fifth Percentile = tenth number + 0.5*(eleventh number
and is the data value a – tenth number). From the list of ordered data values the tenth number = 60 and the eleventh
quarter way up through the
ordered data set.
number = 85.
Q3 Q3is the upper quartile
and is the data value a Q3 = 60 + 0.5∗ (85 − 60) = 60 + 0.5∗ (25) = 60 + 12.5 = 72.5
quarter way down through
the ordered data set. The third quartile statistic examination mark is 73.
Data descriptors 65
Note Using Excel the first quartile value is 48 and the third quartile value is 60. The
manual method provides a first quartile value of 42 and third quartile value of 73. Unlike the
median that has a standard calculation method, there is no one standard for the calculation
of the quartiles. A number of definitions for the quartile exist, which results in a number of
different calculation procedures to calculate the value of the quartiles. The method used by
Excel is method 1 in Freund, J. and Perles, B. (1987) ‘A New Look at Quartiles of Ungrouped
Data’, The American Statistician, 41 (3), 200–3.
Example 2.4
To illustrate the use of the Select Formulas > Select Insert Function method consider the prob-
lem of calculating the mean value in Example 2.1. In Figures 2.1 and 2.2 the mean value is
located in cell E12. To insert the correct Excel function into cell E12 we would click on cell E12
and then Select Formulas > Select Insert Function as illustrated in Figures 2.3 and 2.4.
Figure 2.3
Figure 2.4
Figure 2.5
At this final stage we can scroll down the list of the functions until we find the function we
require. To help the user Excel provides appropriate information on each function. We shall
choose the AVERAGE function and click OK. The Excel function average would then be placed
in cell E12. To familiarize yourself with the functions browse down the list to check on what
functions are available and their function names—see online for a complete list of all Excel
functions. To help data analysts Excel provides the calculate descriptive Data Analysis tool to
statistics for a set of numbers. For the calculation of a series of statistical values, for example,
use the Descriptive Statistics menu to calculate descriptive statistics described in this chapter.
Student exercises
X2.1 In 12 consecutive innings a batman’s scores were: 6, 13, 16, 45, 93, 0, 62, 87, 136, 25,
14, and 31. Find his mean score and the median.
X2.2 The following are the IQs of 12 people: 115, 89, 94, 107, 98, 87, 99, 120, 100, 94, 100,
99. It is claimed that ‘the average person in the group has an IQ of over 100’. Is this a
reasonable assertion?
Data descriptors 67
X2.3 A sample of six components was tested to destruction to establish how long they
would last. The times to failure (in hours) during testing were 40, 44, 55, 55, 64,
and 69. Which would be the most appropriate average to describe the life of these
components? What are the consequences of your choice?
X2.4 Find the mean, median, and mode of the following set of data: 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5.
X2.5 The average salary paid to graduates in three companies is: £7000, £6000, and £9000
per annum respectively. If the respective number of graduates in these companies is 5,
12, and 3, find the mean salary paid to the 20 graduates.
Example 2.5
The distribution of insurance claims processed each day is presented in Table 2.3.
Claims (X) 1 2 3 4 5 6 7 8 9 10
Frequency (f) 3 4 4 5 5 7 5 3 3 1
Table 2.3
Figure 2.6
68 Business statistics using Excel
➜ Excel solution
X Cells B4:B13 Values
f Cells C4:C13 Values
fX Cell D4 Formula: =C4*B4
Copy formula down D4:D13
Σf =Cell C17 Formula: =SUM(C4:C13)
ΣfX =Cell C18 Formula: =SUM(D4:D13)
Mean =Cell C19 Formula: =C18/C17
Mean =Cell C23 Formula: =SUMPRODUCT (B4:B13,C4:C13)/SUM(C4:C13)
Mode = Cell C24 Value
❉ Interpretation From Excel, the mean is five claims per day and the mode is six claims
per day.
Note According to Table 2.3, a number of claims corresponding to ‘one’ occurs three
times, which will contribute three to the total, ‘two’ claims occur four times contributing eight
to the sum, and so on. This can be written as follows:
(3*1) + (4*2) +.........+ (1*10)
Mean(X) = = 206/40 = 5.15
3+4+4+5+5+7+5+3+3+1
As already pointed out, as we are dealing with discrete data we would indicate a mean as
approximately five claims. Equation (2.3) can now be used to calculate the mean for a fre-
quency distribution data set:
ΣfX
X=
Σf (2.3)
The following indicates the setting out of the calculation for finding X using the data
set in Table 2.3.
Table 2.4 presents the frequency distribution for Example 2.5.
ΣfX 206
X= = = 5.15 claims per day
Σf 40
Clearly, the number corresponds to the one obtained using the Excel Function method,
as expected.
Data descriptors 69
Table 2.4
The mode
As the mode is the most frequently occurring score it can be determined directly from a
frequency distribution or a histogram. If we consider the distribution given in Example
2.5, the most frequently occurring score is six; it has the highest frequency of seven.
Example 2.6
Reconsider the Example 2.5 data set.
Figure 2.7 illustrates the Excel solution to calculate the cumulative frequency table.
x
Cumulative frequency
distribution The
cumulative frequency for a
value x is the total number
Figure 2.7 of scores that are less than
or equal to x.
70 Business statistics using Excel
➜ Excel solution
X Cells B4:B13 Values
f Cells C4:C13 Values
≤Xcf Cells D4:D14 Values
CF Cell E4 Value
Cell E5 Formula: =C4
Cell E6 Formula: =E5+C5
Copy formula down E6:E14
Figure 2.8 illustrates the cumulative frequency curve and its use to find the median.
35
30
25 20.5th number
20
15
10
Median = 5
5
0
0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5
Class boundary, Xcf Figure 2.8
❉ Interpretation From the cumulative frequency curve, the median is five claims per day.
Table 2.5 represents the cumulative frequency distribution for the data presented in
Example 2.5.
The median value of the above distribution lies at the 20.5th item which lies at 5 claims.
We know this because 21 items are below 5.5.
Table 2.5
Data descriptors 71
Example 2.7
Consider the distribution of miles travelled by salesmen. The layout is very similar to Example
2.5, except the mid-point value is shown for ‘X’.
Table 2.6 represents the grouped frequency distribution for Example 2.7.
Table 2.6
Figure 2.9
LCB, lower class boundary; UCB, upper class boundary.
➜ Excel solution
Mileage Cells A4A9 Values
LCB Cells B4:B9 Values
UCB Cells C4:C9 Values
Mid-point x Cell D4 Formula: =(B4+C4)/2
Copy down formula D4:D9
72 Business statistics using Excel
❉ Interpretation From Excel, the average number of miles travelled are mean = 454,
mode = 448, and median = 452.
The mean
The explanation of the Excel solution is as follows.
Note
Table 2.7 represents the grouped frequency distribution for Example 2.7.
The Mean is calculated using equation (2.3):
ΣfX 54480
Mean ( X) = = = 454miles
Σf 120
Data descriptors 73
Table 2.7
1. The value of X is the class mid-point which is computed using the true limits of each class.
This assumes that the data values within each class vary uniformly between the lowest and
highest data values within the class.
2. Even if a distribution has unequal class widths the same procedure is followed.
3. This form of layout transforms very easily to a spreadsheet.
The mode
As the mode is the most frequently occurring score it can be determined directly from a
frequency distribution or a histogram. If we consider the distribution given in Example
2.7, we can see that the most frequently occurring class is the class 440–459 miles. This
class is known as the modal class. If we look at the histogram associated with this example
the mode is very apparent: it is the class with the highest rectangle. We can estimate the
mode using a formula or graphical method:
(f1 − f0)c
Mode = L +
2f1 − f0 − f2 (2.4)
Where: L = lower class boundary of the modal class, f0 = frequency of the class below
the modal class, f1 = frequency of the modal class, f2 = frequency of the class above the
modal class, and c = modal class width.
Note Please note that this formula only works if the modal class and the two adjacent
classes are of equal width. Therefore, using Example 2.7 we have L = 439.5, f0 = 27, f1 = 34,
f2 = 24 and c = 20.
(34 − 27)
Mode = 439.5 + × 20 = 448 miles to the nearest mile
2(34) − 27 − 24
40
35 Highest class frequency
= modal class 440–459
30
Frequency
25
20
15
10
5
0
400–419 420–439 440–459 460–479 480–499 500–519
Estimate of mode = 448 Mileage Figure 2.10
The modal value is estimated from the histogram to be 449 miles travelled.
The median
Just as with the cases where X is known, finding the median from a frequency distribu-
tion where X is the class mid-point involves some further calculations. The cumulative
frequency distribution and cumulative frequency polygon (or ogive) are used. If we con-
sider the distribution given in Example 2.7, the median of the 120 values is given by the
(120 + 1)/2th value (or 60.5th value) and this value lies in the class (440–459) miles.
An estimate of the value of the median within that class can be determined either by
calculation or by using a graphical method:
Data descriptors 75
(i) Formula method
Equation (2.5) can be used to estimate the median:
(N + 1)/2 − F
Median = L + C (2.5)
f
Where: L = true lower class boundary of the median class, C = median class width,
F = cumulative frequency before the median class, f = frequency within the median class,
and N = total frequency.
From Table 2.8: L = 439.5, C = 459.5 — 439.5 = 20, F = 39, f = 34, and N = 120.
(120 + 1) / 2 − 39
Median = 439.5 + 20 = 452 miles
34
The median number of miles travelled is 452 miles. This is quite close to the value obtained
for the mean (454 miles) and we would expect this given that the histogram for miles trav-
elled looks quite symmetrical. The concept of symmetry and a measure of how symmetri-
cal a distribution is will be explored when discussing the concept of skewness (see Section
2.2.5).
120
100
80
60th number
60
40
20
0
399.5 419.5 439.5 459.5 479.5 499.5 519.5
Xcf
Median value Figure 2.11
x
Percentiles and Quartiles Symmetrical A data set is
symmetrical when the data
values are distributed in the
Individual percentiles can be estimated using the following two methods: (i) the formula same way above and below
method and (ii) the graphical method. the middle value.
76 Business statistics using Excel
(N + 1)P/100 − F
Percentile Value P = L + C
f (2.6)
Where: L = true lower class boundary of the percentile class, C = percentile class width,
F = cumulative frequency before the percentile class, f = frequency within the percentile
class, and N = total frequency.
For example, if we want to calculate the tenth percentile, then P = 10.
10
Position of Tenth Percentile = (120 + 1) ≈ 12 th Number
100
Therefore, the tenth percentile is in the (400–419) class. From Table 2.6: L = 399.5,
C = 419.5 — 399.5 = 20, F = 0, f = 12, and N = 120.
20 ⎛ 10 ⎞
Median = 399.5 + ⎜ (120 + 1) 100 − 0⎟⎠ ≈ 420
12 ⎝
10
Position of Tenth Percentile = (120 + 1) ≈ 12th number
100
Therefore, the tenth percentile is in the (400–419) class. The estimated value can, like
the median (the median is the fiftieth percentile), be read off the cumulative frequency
curve (or ogive).
❉ Interpretation Ten per cent of all the data in this data set have the value equal to or
below 420 miles.
Two important percentiles are the twenty-fifth and seventy-fifth percentiles. These are
known as the lower quartile (LQ) and the upper quartile (UQ) respectively.
25
Position of First Quartile = (120 + 1) ≈ 30 th Number
100
75
Position of Third Quartile = (120 + 1) ≈ 90 th Number
100
Figure 2.12 illustrates the first and third quartile positions on the cumulative frequency
curve. We observe that Q1 and Q3 are approximately 432 and 462 respectively.
Data descriptors 77
Miles travelled by salesmen
140
Cumulative frequency, cf
120
100 90th number
80
60
40 30th number
20
0
399.5 419.5 439.5 459.5 479.5 499.5 519.5
Xcf
Q1 value Q3 value
Figure 2.12
❉ Interpretation Twenty-five per cent of all the values in the data set are equal to or
below 430 miles, while 75% are equal to or below 470 miles.
w 1X1 + w 2 X 2 + w 3 X 3 ∑ wX
X= =
w1 + w 2 + w 3 ∑w (2.7)
Where w is the level of importance placed on each assessment element and X is the
actual mark associated with this weight. This can be laid out in a table format to aid the
calculation process.
Example 2.8
Suppose that Karen’s statistics module is assessed via a series of assessments (multiple choice
questions—mcq, in-course assignment—ica, end assignment—ea) with a weighting of 20%, 30%,
and 50% respectively. The actual marks awarded were 74, 66, and 88. Calculate the weighted
average. Figure 2.13 illustrates the Excel solution.
78 Business statistics using Excel
Figure 2.13
➜ Excel solution
mcq, w Cell C6 Value
mcq, X Cell D6 Value
ica, w Cell C7 Value
ica, X Cell D7 Value
ea, w Cell C8 Value
ea, X Cell D8 Value
wX Cell E6 Formula: =C6*D6
Copy formula down E6:E8
Total Cell E9 Formula: =SUM(E6:E8)
weighted average = Cell E11 Formula: =SUMPRODUCT (C6:C8, D6:D8)
❉ Interpretation Karen’s weighted average and therefore module grade would be 78.6%.
Note
1. If all the weights are equal, then the weighted mean is the same as the arithmetic mean.
2. As emphasized before, Excel does not contain a built-in function to calculate a weighted
average. Again, the SUMPRODUCT () function is used.
3. If the weights are given in percentages then the formula would be modified to SUMPRODUCT
(C6:C8, D6:D8)/SUM (C6:C8).
Student exercises
X2.6 Cameos Ltd is employed by a leading market research organization based in Berlin.
The company is discussing with the firm whether to expand the catering facilities
provided to its employees to include a greater range of products. The initial research by
Cameos has identified the following set of weekly spend (€) by individual employees
(Table 2.9).
Data descriptors 79
22 16 26 33 33 37 9 23 32 17
20 13 12 18 19 10 21 22 25 22
22 22 34 24 23 21 38 31 41 20
Table 2.9
(a) Plot the histogram and visually comment on the shape of the weekly expenditure.
Hint: use class width of 5.
(b) Calculate the values of the mean and median.
(c) Use descriptive statistics in conjunction with the histogram to comment on weekly
expenditure.
X2.7 Form a frequency distribution of the following data given in Table 2.10 with intervals
centred at 10, 15, 20, 25, 30, 35, and 40, and estimate the mean value.
9 26 33 24 41 24 37 39 30 28
24 42 17 26 18 33 40 28 31 20
32 21 39 25 16 17 26 11 30 28
34 24 19 23 27 18 32 21 40
Table 2.10
X2.8 The frequency distribution of the length of a sample of 98 nails is presented in Table
2.11 (measured to the nearest 0.1 mm).
(a) Find the mean length of this sample by hand and by using a spreadsheet.
(b) Construct the cumulative frequency graph and use this to estimate the median.
(c) Check the value of the median using the formula method.
Length Frequency
4.0–4.2 4
4.3–4.5 9
4.6–4.8 13
4.9–5.1 20
5.2–5.4 34
5.5–5.7 18
Table 2.11
Marks Frequency, f
0–10 6
11–20 15
21–30 31
31–40 80
41–50 93
51–60 69
61–70 54
71 –80 33
81–90 12
91 –100 7
Table 2.12
i f
x
Dispersion The variation
between data values is
called dispersion. XA XB
Figure 2.14
Data descriptors 81
Example 2.9
Reconsider Example 2.1 and calculate measures of dispersion (or spread) of the statistics
examination marks presented in Table 2.13.
24 27 36 48 52 52 53 53 59 60 85 90 95
Table 2.13
Figure 2.15 x
Range The range of a
data set is a measure
➜ Excel solution of the dispersion of the
observations.
Statistics marks Cell B4:B16 Values Interquartile range The
interquartile range is a
X^2 Cell C4 Formula: =B4^2 measure of the spread of or
Copy formula down C4:C16 dispersion within a data set.
n = Cell F4 Formula: =COUNT(B4:B16) Variance Measure of
the dispersion of the
Σx = Cell F5 Formula: =SUM(B4:B16) observations.
ΣX^2 = Cell F6 Formula: =SUM(C4:C16) Standard
mean = Cell F7 Formula: =F5/F4 deviation Measure of
the dispersion of the
variance Cell F8 Formula: =F6/F4−F7^2 observations (a square root
standard deviation = Cell F9 Formula: =F8^0.5 value of the variance).
Range = Cell F13 Formula: =MAX (B4:B16)−MIN (B4:B16) Coefficient of
variation The coefficient
Q1 = Cell F14 Formula: =QUARTILE.INC (B4:B16, 1) of variation measures the
Median = Cell F15 Formula: =MEDIAN(B4:B16) spread of a set of data as a
proportion of its mean.
Q3 = Cell F16 Formula: =QUARTILE.INC(B4:B16, 3)
Kurtosis Kurtosis is a
QR = Cell F17 Formula: =F16−F14 measure of the ‘peakedness’
SIQR = Cell F18 Formula: =(F16−F14)/2 or the distribution.
82 Business statistics using Excel
RANGE (ungrouped data) = Highest Extreme Value − Lowest Extreme Value (2.8)
RANGE (grouped data) = UCB Highest Class − LCB Lowest Class (2.9)
Where UCB represents the upper class boundary and LCB represents the lower class
boundary. Thus, for the statistics examination example (Example 2.1): ungrouped data,
lowest value = 24 and highest value = 95, and range = 95 – 24 = 71 marks.
❉ Interpretation From Excel, the range for the statistics examination marks implies that
the achieved marks are scattered over 71 marks between the highest and the lowest mark.
If we have data that is in the form of a grouped frequency distribution then we would
use the upper and lower class boundaries of the largest and smallest class values to calcu-
late the range. Thus, for the miles travelled by salesmen example (Example 2.7): grouped
data, LCB = 399.5, UCB = 519.5. Thus, range = 519.5 — 399.5 = 120 miles.
Q3 − Q1
SIQR =
2 (2.11)
The SIQR is another measure of spread and is computed as one half of the interquartile
range which contains half of the data values. For Example 2.9, the first quartile value is
Data descriptors 83
48% and the third quartile value is 60%. The manual method provides a first quartile value
of 42% and third quartile value of 73%.
The interquartile and SIQRs are more stable than the range because they focus on the
middle half of the data values and, therefore, can’t be influenced by extreme values. The
SIQR is used in conjunction with the median to a highly skewed distribution or to describe
an ordinal data set. The interquartile range (and SIQR) are more influenced by sampling
fluctuations in normal distributions than is the standard deviation, and therefore are not
often used for data that are approximately normally distributed. Furthermore, the actual
data values aren’t used and we will now look at a method that provides a measure of
spread but uses all the data values within the calculation.
∑(X − X )2
Variance, VAR(X ) =
∑f (2.12)
To provide us with an average difference we take the square root of the variance to give
the standard deviation (SD(X)), as defined by equation (2.13).
∑( X − X )2
Standard Deviation , SD(X ) =
∑f (2.13)
⎡ ∑ X2 ⎤
StandardDeviation , SD(X ) = ⎢ − ( X )2 ⎥
⎣ ∑ f ⎦ (2.15)
Variance describes how much the data values are scattered around the mean value or,
to put it differently, how tightly are the data values grouped around the mean. In a way,
the smaller the variance, the more representative the mean value is. Unfortunately, the
variance does not have the same dimension as the data set or the mean. In other words,
if the values are percentages, inches, degrees C, or any other measure, the variance is not
expressed in the same values because it is expressed in squared units.
As such, it is very useful as a comparison measure between the two data sets, as we will
discover later. To bring the variance into the same units of measure as the data set, the
standard deviation needs to be calculated. Although the standard deviation is less sus-
ceptible to extreme values than the range, standard deviation is still more sensitive than
the SIQR. If the possibility of outliers presents itself, then the standard deviation should
be supplemented by the SIQR.
❉ Interpretation From Excel, the variance equals 455.3 (marks2) with a standard
deviation equal to 21.3 marks.
Note Use Excel Formulas > Insert Function method if you are not sure the name of the
function.
The mean and variance can be calculated for Example 2.9 data set using equations (2.1) and
(2.14) respectively. From Table 2.14 we can show that ΣX = 734 and ΣX2 = 47362.
X X2
24 576
27 729
36 1296
48 2304
52 2704
52 2704
53 2809
53 2809
59 3481
60 3600
85 7225
90 8100
95 9025
ΣX = 734 ΣX = 47,362
2
Table 2.14
Data descriptors 85
Mean (X) =
∑ X = 24 + 27 + ... + 90 + 95 = 734 / 13 = 56.4615
∑f 13
∑ X2
Variance, VAR(X ) = − (X )2 = 455.3254438
∑f
From the calculations we can now summarize the results: mean = 56.5 marks, standard
deviation = 21.3 marks, median = 53 marks, Q3 = 73 marks, Q1 = 42 marks, SIQR = 15.5 marks.
❉ Interpretation
A large proportion of the marks obtained in the statistical examination, as per Example 2.9, are
clustered within 21.3 marks around the mean mark of 56.4. We will explain later how large this
proportion is. Most of the marks are between 35.1 (56.4−21.3) and 77.7 (56.4 + 21.3). Eight out
of 13 marks are in this interval, which is 61% of all the marks.
Note It is very important to note that Excel contains two different functions (VAR.S (),
VAR.P ()) to calculate the value of the variance. The function that you use is dependent upon
whether the data set represents the complete population or is a sample from the population
being measured.
1. If the data set is the complete population then the population variance (σ2) is given by the
Excel function VAR.P ().
2. If the data set is a sample from the population then the sample variance (S2) is given by the
Excel function VAR.S ().
These issues will be explored in greater detail in Chapter 5 when discussing the issue of sam-
pling from populations and estimating population values from the sample data.
Example 2.10
Reconsider Example 2.7 grouped frequency data set and calculate the following descriptive
statistics: mean, variance, and standard deviation.
Table 2.15 illustrates the Example 2.7 data set.
86 Business statistics using Excel
Mileage Frequency, f
400–419 12
420–439 27
440–459 34
460–479 24
480–499 15
500–519 8
Table 2.15
Figure 2.16
LCB and UCB are the lower and upper class boundaries.
➜ Excel solution
Mileage Cells A4A9 Values
LCB Cells B4:B9 Values
UCB Cells C4:C9 Values
Mid-point x Cell D4 Formula: =(B4+C4)/2
Copy down formula D4:D9
Frequency, f Cell E4:E9 Values
fx Cell F4 Formula: =E4*D4
Copy down formula F4:F9
fx^2 Cell G4 Formula: =E4*D4^2
Copy down formula G4:G9
Σ f = Cell B11 Formula: =SUM(E4:E9)
Σ fx = Cell B12 Formula: =SUM(F4:F9)
Σ fx^2 = Cell B13 Formula: =SUM(G4:G9)
x mean = Cell B15 Formula: =B12/B11
Population variance The mean = Cell B16 Formula: =SUMPRODUCT(D4:D9,E4:E9)/SUM(E4:E9)
population variance is the
variance of all possible
variance = Cell B17 Formula: =B13/B11−B15^2
values. standard deviation = Cell B18 Formula: =SQRT(B17)
Data descriptors 87
❉ Interpretation From Excel, the mean number of miles travelled is 454 miles with a
standard deviation of 27.4 miles.
Example 2.11
Reconsider Example 2.10 and calculate the median and SIQR.
Figure 2.17
LCB, lower class boundary; UCB, upper class boundary.
➜ Excel solution
Mileage Cells A4A9 Values
LCB Cells B4:B9 Values
UCB Cells C4:C9 Values
Mid-point x Cell D4 Formula: =(B4+C4)/2
Copy down formula D4:D9
Frequency, f Cell E4:E9 Values
CF Cell F4 Formula: =E4
Cell F5 Formula: =F4+E5
Copy formula down F5:F9
Median
N = Cell B12 Formula: =SUM(E4:E9)
Position median = Cell B13 Formula: =(B12+1)/2
Median class is 440-459
L = Cell B15 Formula: =B6
C = Cell B16 Formula: =C6−C5
F = Cell B17 Formula: =F5
f = Cell B18 Formula: =E6
Median = Cell B19 Formula: =B15+B16*(B13−B17)/B18
88 Business statistics using Excel
Figure 2.18
➜ Excel solution
Quartile 1
N = Cell B22 Formula: =B12
Position Q1 = Cell B23 Formula: =(25/100)*(B22+1)
Q1 class is 420-439
L = Cell B25 Formula: =B5
C = Cell B26 Formula: =C5−C4
F = Cell B27 Formula: =F4
f = Cell B18 Formula: =E5
Q1 = Cell B29 Formula: =B25+B26*(B23−B27)/B28
Quartile 3
N = Cell B32 Formula: =B12
Position Q1 = Cell B33 Formula: =(75/100)*(B32+1)
Q1 class is 460−479
L = Cell B35 Formula: =D6
C = Cell B36 Formula: =D7−D6
F = Cell B37 Formula: =F6
f = Cell B38 Formula: =E7
Q3 = Cell B39 Formula: =B35+B36*(B33−B37)/B38
QR = Cell B41 Formula: =B39−B29
SIQR = Cell B42 Formula: =B41/2
❉ Interpretation From Excel, the median number of miles travelled is 452 miles with a
SIQR of 15.6 miles.
another. Standard deviations vary according to the size of values in the distribution and
may not even be in the same unit of measurement. For example, the value of the standard
deviation of a set of weights will be different, depending on whether they are measured in
pounds or kilograms. The coefficient of variation, however, will be the same in both cases
as it does not depend on the unit of measurement. One way of overcoming this problem is
to use the coefficient of variation, V, as defined by equation (2.16).
Standard Deviation
V= ∗ 100%
Mean (2.16)
For example, if the coefficient of variation is 10% then this means that the standard
deviation is equal to 10% of the average. For some measures, the standard deviation
changes as the average changes. In this case, the coefficient of variation is the best way to
summarize the variation. In other cases the standard deviation does not change with the
average. In this case, the standard deviation is the best way to summarize the variation.
Example 2.12
Consider the following problem that compares the average earnings in the UK and USA:
• mean earnings in the UK are £125 per week with a standard deviation of £10;
• mean earnings in the USA are $1005 per week with a standard deviation of $170.
10
For UK V = *100% = 8%
125
170
For USA V = *100% = 16.9%
1005
❉ Interpretation The spread of earnings in the USA is greater than the spread in
earnings in the UK.
A Symmetrical distribution
f
Mode
Mean Mean Mode
Median Median Figure 2.19
mean is ‘dragged’ toward the left (the low values) of the distribution. It is known as a left
or negatively skewed distribution. The skewness of a frequency distribution can be an
important consideration. For example, if your data set is salary, you would prefer a situa-
tion that led to a positively skewed distribution of salary to one that is negatively skewed.
Positive skewness is more common than negative, for example the salaries of lecturers.
One measure of skewness is Pearson’s coefficient of skewness, as defined by equation
(2.17).
3(Mean − Median)
PCS =
StandardDeviation (2.17)
Excel uses an alternative measure of skewness based upon Fisher’s skewness coeffi-
cient as defined by equation (2.18).
n
Fisher’s skewness = ∑((X − X )/s)3
(n − 1)(n − 2) (2.18)
Note
1. With skewed data, the mean is not a good measure of central tendency because it is sensi-
tive to extreme values. In this case the median would be used to provide the measure of central
tendency.
2. The value for skewness is zero for symmetric distributions (mean = median).
3. If mean < median, then the measure of skewness is negative and the distribution is said to
be negatively skewed.
4. If mean > median, then the measure of skewness is positive and the distribution is said to
be positively skewed.
5. The measure of skewness is independent of the units been measured.
Data descriptors 91
6
Skewness critical value = ±2 ×
n (2.19)
Distribution A
Distribution B
X Figure 2.20
We can see from the two distributions that distribution A is more peaked than distribu-
tion B, but the means and standard deviations are approximately the same.
This is a measure of kurtosis and Excel provides Fisher’s measure of kurtosis as defined
by equation (2.20).
Where s represents the sample standard deviation (sample variance = ⎡⎣n/(n − 1)⎤⎦ * pop-
ulation variance).
The critical value of kurtosis is defined by equation (2.21).
24
Kurtosis critical value = ± 2 ×
n (2.21)
Example 2.13
Reconsider the marks obtained in the statistics examination (Example 2.1) as presented in Table
2.16. Calculate a measure of skewness and kurtosis.
24 27 36 48 52 52 53 53 59 60 85 90 95
Table 2.16
Figure 2.21
➜ Excel solution
X: Cells B4:B16 Values
n = Cell E5 Formula: =COUNT(B4:B16)
Fisher’s skew = Cell E7 Formula: =SKEW (B4:B16)
Critical skewness = Cell E8 Formula: =2*SQRT(6/E5)
Fisher’s kurtosis = Cell E10 Formula: =KURT(B4:B16)
Critical kurtosis = Cell E11 Formula: =2*SQRT(24/E5)
Note Reporting the median along with the mean in skewed distributions is generally a
good idea.
Skewness:
1. Skewness = zero. A zero skew value indicates symmetry. Normal distributions produce a
skewness statistic of zero.
2. Skewness = A positive value indicates a positively skewed distribution (that is, with scores
bunched up on the low end of the score scale). In this example we have a positively skewed
distribution.
3. A negative value indicates a negatively skewed distribution (that is, with scores bunched up
on the high end of the scale).
4. Skewness > ±1.36 would suggest severe skewness. In this case we conclude a skewness value
of 0.4410 is not significantly skewed (−1.35 < 0.4410 < +1.36).
Kurtosis:
1. Kurtosis = zero value indicates a symmetrical distribution. Distributions with zero kurtosis are
called mesokurtic. Normal distributions produce a kurtosis statistic of about zero (again, I say
Data descriptors 93
‘about’ because small variations can occur by chance alone). For a positive kurtosis value close
to zero indicates a mesokurtic (that is, normally high) distribution because it is close to zero.
2. A distribution with positive kurtosis is called leptokurtic. In terms of shape, the distribution
has a more acute 'peak' around the mean.
3. A distribution with negative kurtosis is called platykurtic. In terms of shape, the distribution
has a smaller 'peak' around the mean.
4. Kurtosis > ± 2.72 would suggest severe kurtosis. In this case we conclude a kurtosis value of
−0.4253 is not a significant value of kurtosis (−2.72 < −0.4253 < +2.72).
Student exercises
X2.10 Over a one-month period the number of vacant beds in a West Yorkshire hospital was
surveyed. The frequency distribution in Table 2.17 resulted.
Beds vacant 0 2 3 5 6 8
Frequency 4 8 12 4 2 1
Table 2.17
Table 2.18
Table 2.19
94 Business statistics using Excel
Compare the two distributions by plotting out their frequency polygons, and
determine the means and standard deviations.
X2.13 Greendelivery.com has recently decided to review the weekly mileage of the delivery
vehicles used to deliver shopping purchased online to customer homes from a central
parcel depot. The sample data collected (Table 2.20) is part of the first stage in analysing
the economic benefit of potentially moving all vehicles to biofuels from diesel.
Table 2.20
(a) Use Excel to construct a frequency distribution and plot the histogram with class
intervals of 10 and classes 75–84, 85–94 . . . 175–184. Comment on the pattern in
mileage travelled by the company vehicles.
(b) Use the raw data to determine the mean, median, standard deviation, and SIQR.
(c) Comment on which measure you would use to describe the average and measure
of dispersion. Explain using your answers to (a) and (b).
(d) Calculate the measure of skewness and kurtosis, and comment on the distribution
shape.
• Q3 — Median = Median — Q1
• Largest value — Q3 = Q1 — smallest value
• Median = Midhinge = Midrange
The midrange is the average of the largest and smallest data values, and the midhinge
is the average of the first and third quartiles. For non-symmetry the following rule would
hold:
. Example 2.14
In this particular case we will assume that the values are as follows: first quartile Q1 = 15, mini-
mum = 8, median = 33, maximum = 88, and third quartile Q3 = 62
Input your data into Excel as illustrated in Figure 2.22.
Figure 2.22
We can see from the summary statistics that the data distribution is not symmetrical:
• the distance from Q3 to the median (62 — 33 = 29) is not the same as between Q1 and
the median (33 — 15 = 18);
• the distance from Q3 and the largest value (88 — 62 = 26) is not the same as the
distance between Q1 and the smallest value (15 — 8 = 7);
• the median (33), the midhinge ((62 + 15)/2 = 38.5, and the midrange (88 + 8)/2 = 48)
are not equal.
The summary numbers indicate right skewness because the distance between Q3 and
the largest number (88 — 62 = 26) is longer than the distance between Q1 and the smallest x
Right-skewed Right-
value (15 — 8 = 7). The minimum and maximum points are identified and enable identifi- skewed (or positive skew)
cation of any extreme values (or outliers). indicates that the tail on
the right side is longer than
the left side and the bulk of
the values lie to the left of
Note A simple rule to identify an outlier (or suspected outlier) is that the largest value − the mean.
smallest value (88 − 8 = 80) should be no longer than three times the length of the box (Q3 − Left-skewed Left-skewed
(or negative skew) indicates
Q1 = 62 − 15 = 47). that the tail on the left side
of the probability density
function is longer than
the right side and the bulk
of the values (possibly
❉ Interpretation In this case the value of maximum – minimum is 80 and Q3 − Q1 is including the median) lie to
47, and therefore no extreme values are present in the data set. the right of the mean.
96 Business statistics using Excel
Example 2.15
In this particular case we will assume that the values are the same as in Example 2.14: first quar-
tile Q1 = 15, minimum = 8, median = 33, maximum = 88, and third quartile Q3 = 62. The box
plot is then constructed, as illustrated in Figure 2.23.
Minimum
50
Median
40 Maximum
30 Q3
20
10
0
Value Figure 2.23
• if the median within the box is not equidistant from the whisker (or hinge), then the
data is skewed. The box plot indicates right skewness because the distance between
the median and the highest value is greater than the distance between the median
and the lowest value. Furthermore, the top whisker (distance between Q3 and
maximum) is longer than the lower whisker (distance between Q1 and minimum);
• the minimum and maximum points (or whiskers) are identified and enable
identification of any extreme values (or outliers). A simple rule to identify an outlier
(or suspected outlier) is that the whisker (maximum value — minimum value) should
be no longer than three times the length of the box (Q3 — Q1). In this case the value of
maximum — minimum is 80 and Q3 — Q1 is 48, and no extreme values are present in
the data set.
x
Box-and-whisker plot A
box-and-whisker plot is a
Excel spreadsheet solution for Example 2.15
way of summarizing a set
Unfortunately, Microsoft Excel does not have a built-in box plot chart type. You can create
of data measured on an
interval scale. your own charts using stacked bar or column charts and error bars in combination with
Data descriptors 97
line or XY scatter chart series to show additional data. For your data set calculate: first
quartile, minimum, median, maximum, and third quartile.
Figure 2.24
2 Plot chart
Select Insert > choose Line > choose the ‘Line with Markers’ as illustrated in
Figure 2.25.
Value
100
Value
90
80
70
60
50
40
30
20
10
0
Q1 Minimum Median Maximum Q3 Figure 2.25
We note that this does not look like a box plot so we will now modify the line chart
in Figure 2.25 so that it looks like a box plot.
Figure 2.26
98 Business statistics using Excel
Click OK
100
90
80
70
Q1
60
Minimum
50
Median
40
Maximum
30
Q3
20
10
0
Value Figure 2.27
Figure 2.27 illustrates the transformation of the Figure 2.25 line chart.
Figure 2.27 looks more like a box chart and we can modify the chart to improve the
chart appearance, for example, removing the line through the legend points.
Figure 2.28
Figure 2.29 illustrates the transformation of the Figure 2.27 line chart to remove the
lines through the data points.
100
90
80
70
Q1
60
Minimum
50
Median
40
Maximum
30
Q3
20
10
0
Value Figure 2.29
100
90
80
70
Q1
60
Minimum
50
Median
40
Maximum
30
Q3
20
10
0
Value Figure 2.30
6 Add box
Select Layout > Analysis menu choose Up/Down Bars > Select Up/Down Bars
button.
Figure 2.31 illustrates the new chart.
100
90
80
70
60 Q1
Minimum
50
Median
40 Maximum
30 Q3
20
10
0
Value Figure 2.31
Minimum
50
Median
40 Maximum
30 Q3
20
10
0
Value Figure 2.32
100 Business statistics using Excel
Example 2.16
If we consider Example 2.1 data then the Descriptive Statistics procedure in the Excel ToolPak
add-in would give the required descriptive statistics.
Figure 2.33
Click OK.
Input date range: B3:B16.
Grouped By: Columns.
Select Labels in first row.
Output Range: D3.
Click Summary statistics.
See Figure 2.34.
Click OK.
Figure 2.34
Data descriptors 101
The Excel results would then be calculated and printed out in the Excel worksheet (see
Figure 2.35).
Figure 2.35
Student exercises
X2.14 The manager at BIG JIMS restaurant is concerned about the time it takes to process
credit card payments at the counter by counter staff. The manager has collected the
following processing time data (time in minutes/seconds) (Table 2.21) and requested
that summary statistics are calculated.
(a) Calculate a five-number summary for this data set.
(b) Do we have any evidence for a symmetric distribution?
(c) Use the Excel Analysis-ToolPak to calculate descriptive statistics.
(d) Which measures would you use to provide a measure of average and spread?
Table 2.21
X2.15 The local regional development agency is conducting a major review of the economic
development of a local community. One economic measure to be collected is the
local house prices that reflect on the economic well-being of this community. The
102 Business statistics using Excel
development agency have collected the following house price data (£) as presented in
Table 2.22.
(a) Calculate a five-number summary
(b) Do we have any evidence for a symmetric distribution?
(c) Use the Excel Analysis-ToolPak to calculate descriptive statistics.
(d) Which measures would you use to provide a measure of average and spread?
Table 2.22
■ Techniques in practice
TP1 Coco S.A. supplies a range of computer hardware and software to 2000 schools within
a large municipal region of Spain. When Coco S.A. won the contract the issue of customer
service was considered to be central to the company being successful at the final bidding stage.
The company has now requested that its customer service director creates a series of graphical
representations of the data to illustrate customer satisfaction with the service. The following
data has been collected over the last six months and measures the time to respond to the
received complaint (days), as presented in Table 2.23.
The customer service director has analysed this data to create a grouped frequency table
and has plotted the histogram. From this he made a series of observations regarding the time
to respond to customer complaints. He now wishes to extend the analysis to use numerical
methods to describe this data.
(a) From the data set calculate the mean and median.
(b) Repeat the analysis to calculate the standard deviation, quartiles (Q1, Q2, and Q3), quar-
tile range, and SIQR.
(c) Describe the shape of the distribution. Do the results suggest that there is a great deal of
variation in the time taken to respond to customer complaints?
Data descriptors 103
5 24 34 6 61 56 38 32
87 78 34 9 67 4 54 23
56 32 86 12 81 32 52 53
34 45 21 31 42 12 53 21
43 76 62 12 73 3 67 12
78 89 26 10 74 78 23 32
26 21 56 78 91 85 15 12
15 56 45 21 45 26 21 34
28 12 67 23 24 43 25 65
23 8 87 21 78 54 76 79
Table 2.23
(d) Which measures would you recommend the customer service manager uses to describe
the variation in time to respond to customer complaints?
(e) What conclusions can you draw from these results?
TP2 Bakers Ltd run a chain of bakery shops and is famous for the quality of its pies. The
management of the company is concerned at the number of complaints from customers who
say it takes too long to serve customers at a particular branch. The motto of the company is
‘Have your pie in two minutes’. The manager of the branch concerned has been told to provide
data on the time it takes for customers to enter the shop and be served by the shop staff, and
has presented the data in Table 2.24.
0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70
0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12
0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70
1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88 1.88
0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55
1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38 1.38
0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80
1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25
1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48 1.48
Table 2.24
(a) From the data set calculate the mean and median.
(b) Repeat the analysis to calculate the standard deviation, quartiles (Q1, Q2, and Q3), quar-
tile range, and SIQR.
(c) Describe the shape of the distribution. Do the results suggest that there is a great deal of
variation in the time taken to serve customers?
(d) Which measures would you recommend the shop manager uses to describe the varia-
tion in the time taken to serve customers?
(e) What conclusions can you draw from these results?
104 Business statistics using Excel
TP3 Skodel Ltd is a small brewery that is undergoing a major expansion after a takeover
by a large European brewery chain. Skodel Ltd produces a range of beers and lagers, and is
renowned for the quality of its beers, winning a number of prizes at trade fairs throughout
the European Union. The new parent company are reviewing the quality control mechanisms
being operated by Skodel Ltd and are concerned at the quantity of lager in its premium lager
brand, which should contain a mean of 330 ml and a standard deviation of 15 ml. The bottling
plant manager provided the parent company with quantity measurements from 100 bottles for
analysis (Table 2.25).
326 326 326 326 326 326 326 326 326 326
344 344 344 344 344 344 344 344 344 344
333 333 333 333 333 333 333 333 333 333
346 346 346 346 346 346 346 346 346 346
339 339 339 339 339 339 339 339 339 339
353 353 353 353 353 353 353 353 353 353
310 310 310 310 310 310 310 310 310 310
351 351 351 351 351 351 351 351 351 351
350 350 350 350 350 350 350 350 350 350
348 348 348 348 348 348 348 348 348 348
Table 2.25
(a) From the data set calculate the mean and median.
(b) Repeat the analysis to calculate the standard deviation, quartiles (Q1, Q2, and Q3), quar-
tile range, and SIQR.
(c) Describe the shape of the distribution. Do the results suggest that there is a great deal
of variation in quantity within the bottle measurements? Compare the assumed bottle
average and spread with the measured average and spread.
(d) What conclusions can you draw from these results?
■ Summary
This chapter extends your knowledge from using tables and charts to summarizing data using
measures of average and dispersion. The mean is the most commonly calculated average to
represent the measure of central tendency, but this measurement uses all the data within the
calculation and therefore outliers will affect the value of the mean. This can imply that the value
of the mean may not be representative of the underlying data set. If outliers are present in the
data set then you can either eliminate these values or use the median to represent the average.
The average provides a measure of the central tendency (or middle value) and the next calcula-
tion to perform is to provide a measure of the spread of the data within the distribution. The
standard deviation is the most common type of measure of dispersion (or spread), but, like the
mean, the standard deviation is influenced by the presence of outliers within the data set. If
Data descriptors 105
outliers are present in the data set then you can either eliminate these values or use the SIQR to
represent the degree of dispersion. You can estimate the degree of skewness in the data set by
calculating Pearson’s coefficient of skewness (or use Fisher’s skewness equation) and the degree
of ‘peakedness’ by calculating Fisher’s kurtosis coefficient statistic. Box plots are graph plots that
allow you to visualize the degree of symmetry or skewness in the data set.
The chapter explored the calculation process for raw data and frequency distributions, and
it is very important to note that the graphical method will not be as accurate as the raw data
method when calculating the summary statistics. Table 2.26 provides a summary of which sta-
tistics measures to use for different types of data.
■ Key terms
Arithmetic mean Left-skewed Range
Box plot Mean Right-skewed
Box-and-whisker plot Median Shape
Central tendency Mode Skewness
Coefficient of variation Outlier Standard deviation
Dispersion Population mean Symmetrical
Extreme value Population variance Variance
Five-number summary Q1: first quartile Variation
Interquartile range Q3: third quartile
Kurtosis Quartiles
■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.
106 Business statistics using Excel
Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone, and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012).
Introduction to probability 3
» Overview «
The concept of probability is an important aspect of the study of statistics and, within this
chapter, we shall introduce the reader to some of the concepts that are relevant to probability.
» Learning objectives «
On completing this chapter you will be able to:
» understand the concept of the following terms: experiment, outcome, sample space, rela-
tive frequency, and sample probability;
0 0.5 1
m
P(A ) =
n (3.1)
Example 3.1
Consider the result of running the die experiment where the die has been thrown 1000 times
and the number of times each possible outcome (1, 2, 3, 4, 5, and 6) recorded. The result of the
die experiment is illustrated in Table 3.1.
Score 1 2 3 4 5 6
Frequency 173 168 167 161 172 159
Relative frequency 0.173 0.168 0.167 0.161 0.172 0.159
Table 3.1
x
This notion of relative frequency provides an approach to determine the probability of
Relative
an event. As the number of experiments increases then the relative frequency stabilizes frequency Relative
and approaches the probability of the event. Thus, if we had performed the above experi- frequency is another term
for proportion; it is the
ment 2000 times we might expect ‘in the long run’ the frequencies of all the scores to value calculated by dividing
approach 0.167. This implies that P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 0.167. Actually, the number of times
an event occurs by the
for this experiment, the theoretical values for each event would be P(1) = P(2) = P(3) = total number of times an
P(4) = P(5) = P(6) = 1/6. experiment is carried out.
110 Business statistics using Excel
There are many situations where probabilities are derived through this relative
frequency approach (also called empirical approach or experimental probability
approach). If a manufacturer indicates that it is 99% certain (P = 99%) that an electric
light bulb will last 200 hours, this figure will have been arrived at from experiments which
have tested numerous samples of light bulbs. If we are told that the probability of rain on a
June day is 0.42, this will have been determined through studying rainfall records for June
over, say, the past 20 years.
A number of important issues are assumed when approaching probability problems:
• the probability of each event within the probability experiment lies between zero and
one;
• the sum of probabilities of all events in this experiment equals one;
• if we know the probability of an event occurring in the experiment, then the
probability of it not occurring is P(event not occurring) = 1 – P(event occurring).
Example 3.2
Suppose that a particular production process has been in operation for 200 days with a
recorded accident on 150 days. Let A = the event that an accident occurs in future, then the
probability of an accident occurring in future, P (A) = 150/200 = 0.75. This provides an estimate
or probability of 75% that an accident will occur in the future on each separate day.
Example 3.3
Over the last 3 years a random sample of 1000 students was selected and classified according
to degree classification and gender. The results were as recorded in Table 3.2.
Table 3.2
x Calculate: (a) the probability that a student achieves a 2i and is female; (b) the probability that
Empirical a student achieves a 2i and is male; (c) the probability that a student achieves a 2i and is female
approach Empirical
probability, also known
or male; and (d) the probability that a 2i classification is not achieved.
as relative frequency, or
experimental probability, (a) Probability that a student achieves a 2i and is female:
is the ratio of the number
of outcomes in which a Number of female students with a 2i 150
specified event occurs to P (2i and female) = = = 0.15
the total number of trials.
Total sample size 1000
Experimental probability
approach Experimental Probability that a student achieves a 2i and is female is 0.15 or 15%.
probability approach (see
Empirical approach). (b) Probability that a student achieves a 2i and is male:
Introduction to probability 111
Student exercises
X3.5 How would you give an estimate of the probability of a 25-year-old passing a driving
test at a first attempt?
X3.6 In an experiment we toss two unbiased coins 100 times and note the frequency of the
two possible outcomes (heads, tails). We are interested in calculating the probability
(or chance) that at least 1 head will occur from the 100 tosses of the 2 coins. Calculate:
(a) the theoretical probability that at least one head occurs, and (b) the value of this
probability from your experiment. What would you expect to occur between the
theoretical and experimental probability values if the overall number of attempts
increases?
X3.7 Table 3.3 provides information about 200 school leavers and their destination after
leaving school.
Table 3.3
(c) Either went into full-time education or went into a full-time job
(d) Left school at 16 years of age
(e) Left school at 16 years of age and went into full-time education.
X3.8 Consider Table 3.3 in X3.7.
(a) Are the events E and J mutually exclusive?
(b) Determine P(E and J).
(c) Using the values of P(E), P( J), and P(E and J) you have already determined in X3.7,
evaluate P(E) + P( J). What do you notice when you compare your answer with
P(E or J)?
A number of examples will be used to illustrate this notion via the construction of the
sample space.
Example 3.4
If an experiment consists of rolling a die then the possible outcomes are 1, 2, 3, 4, 5, and 6. The
theoretical probability of obtaining a 3 can then be calculated using equation (3.2):
Number of outcomes producing a 3 1
P(obtaining a 3) = = = 0.1666666 ’
Total number of outcomes 6
Example 3.5
If an experiment consists of tossing two unbiased coins then the possible outcomes are: HH,
HT, TH, and TT. We could illustrate the sample space with individual sample points (*), as illus-
trated in Table 3.4.
First coin
H T
Second coin H * *
T * *
Table 3.4
Introduction to probability 113
From this sample space we can calculate individual probabilities. For example, the theoreti-
cal probability of achieving at least one head would be calculated as follows:
Therefore, the theoretical probability of achieving at least 1 head would be 0.75 or 75%.
Example 3.6
An experiment consists of throwing two dice and noting their two scores. The sample space
could be shown as illustrated in Table 3.5.
Table 3.5
From this sample space calculate the following theoretical probabilities to three decimal
places: (a) P(X = Y); (b) P(X + Y = 5); (c) P(X * Y = 36); (d) P(X < 3 and Y > 2).
Student exercises
X3.9 An unbiased coin and a fair die are tossed together. What is the probability of
obtaining a head and a 6?
X3.10 Calculate the following probabilities if two unbiased dice are tossed: (a) the probability
of a 3 on the first die and 5 on the second die = P(3, 5); (b) the probability of a 3 on the
first die and 5 on the second die = P(3 and 5); and (c) the probability of a 3 on the first
die or 5 on the second die = P(3 or 5).
X3.11 The following coins are placed in a bag: 1p, 2p, 5p, and 10p. A coin is taken at random
and then replaced. A second coin is taken at random and then replaced. Calculate the
following probabilities: (a) P(1p chosen first and 2p chosen second); (b) P(sum is 3p);
(c) P(at least one 10p).
114 Business statistics using Excel
X3.12 Ten discs with a different number (0, 1, 2 . . .. 9) printed on them are placed in a bag.
Two discs are taken out of the bag one at a time at random to form a two digit number
(where 08 is counted as the number 8). Assuming the first disc is replaced before the
second is chosen, find the following probability that: (a) the number is even; (b) the
number is less than 30; (c) the number is 67; and (d) the two digits forming the number
are equal. What would happen to your answers to (a)–(d) if the first disc is not replaced
before the second is chosen?
X3.13 Five cards are labelled A1, B2, C3, D3, and E3 respectively. A card is selected at random
and then a second is selected again before the first is replaced. (a) Show by listing
the sample space that there are just 20 possible outcomes. (b) Find the following
probabilities: (i) the first card chosen is A1; (ii) the second card chosen is A1; (iii) the
card A1 is chosen; (iv) the letter on the cards are adjacent in the alphabet; (v) the sum
of the numbers on the cards is odd; and (vi) the sum of the numbers on the cards is 5.
X3.14 A sample of 50 married women was asked how many children they had in their family.
The results are presented in Table 3.6.
Number of children 0 1 2 3 4 5+
Number of families 6 14 13 9 5 3
Table 3.6
Estimate the probability that if any married woman is asked the same question, she will
answer: (a) none; (b) between 1 and 3 inclusive; (c) more than 3; (d) neither 3 nor 4; and (e) less
than 2 or more than 4.
Example 3.7
An experiment consists of tossing three coins. Let events A, B, C, and D represent the events
obtained: three heads, obtained three tails, obtained only two heads, and obtained only two
tails respectively. Figure 3.2 illustrates the sample space for this experiment and the four mutu-
ally exclusive events.
Introduction to probability 115
A C
HHT
HHH HTH
THH
THT
TTT
TTH HTT
D
B Figure 3.2
From Figure 3.2 we have: P(A) = 0.125, P(B) = 0.125, P(C) = 0.375, and P(D) = 0.375. As the
four mutually exclusive events exhaust the sample space then P (A) + P(B) + P(C) + P(D) = 1.0.
As A and B are mutually exclusive, then P(A or B) = P(A) + P(B) = 0.25. Similarly, P(A or B or
C) = P(A) + P(B) + P(C) = 0.625. As P (D) = 0.375, then P(D’) = 1 − P(D) = 1 − 0.375 = 0.625.
Example 3.8
To illustrate this case consider a sample space consisting of the positive integers from 1 through
10. Let event A represent all odd integers and event B represent all integers less than or equal
to 5. These two events within the sample space are displayed in Figure 3.3.
B
7 3 2
5
9 4 x
1
Addition law for mutually
exclusive events Addition
law for mutually exclusive
events is a result used to
8 10
6 Figure 3.3 determine the probability
that event A or event B
occurs, but both events
From Figure 3.3 we note that events A and B overlap with common sample points present. cannot occur at the same
This would be represented by the event {odd integers and integers ≤ 5}. time.
116 Business statistics using Excel
Note It is important to note that when we ask for the probability of events A and B
occurring then this is written as P(A and B). Furthermore, you may see in certain information
sources that the mathematical operator (or symbol) ∩ may be used instead of ‘and’. This
implies that P(A and B) means the same as P(A ∩ B).
The event {A or B} contains the outcomes of either odd integers or integers < 5. A little
thought would indicate that the number containing event A or B is given by the equation
n{A or B} = n{A} + n{B} – n{A and B}. Consequently, by transforming the events into prob-
abilities the general addition probability law is given by equation (3.3).
Example 3.9
A card is chosen from an ordinary pack of cards. Write down the probabilities that the card is:
(a) black and an ace, (b) black or an ace, and (c) neither black nor an ace. Let event A and B
represent the events obtaining an ace card and B a black card respectively. The sample space
is represented by Figure 3.4.
A
B
2
2
24
24 Figure 3.4
x
Number of outcomes in A and B 2
General addition (a) P(B and A) = = = 0.0385
probability law General Total number of outcomes 52
addition probability law is
26 4 2 28
a result used to determine (b) P(B or A) = P(B) + P(A) – P(B and A) = + – = = 0.538462
the probability that event A 52 52 52 52
or event B occurs or both 28
occur. (c) P(neither B nor A) = 1 – P(B or A) = 1 – = 0.4615
52
Introduction to probability 117
Student exercises
X3.15 For each question indicate whether the events are mutually exclusive: (a) thermometers
are inspected and rejected if any of the following are found: poor calibration; inability
to withstand extreme temperatures without breaking; and not within specified size
tolerances; and (b) a manager will reject a job applicant for any of the following
reasons: lack of relevant experience, slovenly appearance, too old.
X3.16 Consider two events, A and B, of an experiment which is not empty. Display this
information in a Venn diagram and shade the area representing the event {A or B’}.
X3.17 Consider two events, A and B, where the associated probabilities are as follows:
P(A or B) = 3/4, P(B) = 3/8 and n(A) = 4. Calculate P(A and B) if the total sample size is
eight.
X3.18 A survey shows that 80% of all households have a colour television and 30% have
a microwave oven. If 20% have both a colour television and a microwave, what
percentage has neither?
X3.19 In a group of 50 students, 30 study French or German. If 20 study French and 15 study
German find the probability that a student studies French and German.
Example 3.10
Of a group of 30 students, 15 are blue-eyed {B}, 5 are left-handed {L}, and 2 are both blue-eyed
and left-handed {B and L}.
The sample space is represented in Figure 3.5.
B
L
13
2
3
Picking one student at random the probabilities would be as follows: P(L) = 5/30; P(B) = 15/30
and P(L and B) = 2/30. If we know that a student is blue-eyed then our sample space will be
reduced to 15 students, of which 2 are left-handed. Thus, P(L/B) = number in {L and B}/number
in {B}. Dividing top and bottom by the total sample space gives:
P(L andB) 2 / 30 2
P(L/B) = = = = 0.133333 ’
P(B) 15/ 30 15
In general, if we have two events, A and B, then the probability of event A given that event
B has occurred is given by equation (3.4).
P(A and B)
P(A/B) =
P(B) (3.4)
This general result can be converted to give the multiplication law for joint events and is
given by equation (3.5).
Example 3.11
Consider two events A and B which contain all sample points with P(A and B) = 1/4 and
P(A/B) = 1/3. Calculate (a) P(B), (b) P(A), and (c) P(B/A).
x
Probability of event A
(a) P(B)? From equation (3.4) we have P(B) = P(A and B)/P(A/B) = (1/4)/(1/3) = 3/4. Therefore,
given that event B has
occurred See Conditional the probability of event B occurring is 0.75 or 75%.
probability. (b) P(A)? Because A and B exhaust the sample space, P(A or B) = 1.0. From the addition law, P(A
Multiplication law
or B) = P(A) + P(B) – P(A and B). Thus, P(A) = P(A and B) + P(A or B) – P(B) = 1.0 + 0.25 –
for joint events See
Multiplication law. 0.75 = 0.5. Thus, the probability of event A occurring is 0.5 or 50%.
Introduction to probability 119
(c) P(B/A)? P(A and B) is the same as P(B and A). Thus, from the multiplication law,
P(B and A) = P(B/A) * P(A). Re-arranging this equation gives P(B/A) = P(B and
A)/P(A) = 0.25/0.5 = 0.5. Thus, the probability that event B occurs given that event A has
already occurred is 0.5 or 50%.
Example 3.12
An office is due to be modernized with new office equipment. To aid the office manager a
survey has been undertaken to identify the following information: (a) the number of laptops,
(b) the number of desktop computers, and (c) whether the computers are old or new. The data
collected is provided in Table 3.7.
Table 3.7
If a person picks one computer at random, calculate the following probabilities: (a) the com-
puter is new; (b) the computer is a laptop; and (c) the computer is new given that it is a lap-
top. Parts (a) and (b) deal with distinct, mutually exclusive events within the full sample space.
Hence, P(N) = 70/100 = 0.70 and P(L) = 60/100 = 0.60. In part (c) we are dealing with the
conditional probability P(N/L). By considering the reduced sample space L (60 laptops) then
P(N/L) = 40/60 = 0.66’ or by considering the definition of conditional probability P(N/L) = P(N
and L)/P(L) = (40/100)/(60/100) = 0.66’. Both methods will give us the same answer, 66 2/3%,
for the probability that it is new given it is a laptop.
Example 3.13
A box contains 6 red and 10 black balls. What is the probability that if three balls are cho-
sen one at a time without replacement that they are all black? Let B1 = Event first draw black,
B2 = Event second draw black, and B3 = Event third draw black. In this example we are deter-
mining the probability that all three balls chosen are black (P(B1 and B2 and B3)). On the first
draw P(B1) = 10/16. On the second draw the sample space has been reduced to 15 balls and
given the condition that the first ball is black then P(B2/B1) = 9/15. On the third draw the sam-
ple space has been reduced to 14 balls and, given the condition that the first and second balls
are black, then P(B3/(B2 and B1)) = 8/14. Thus, P(B1 ∩ B2 ∩ B3) = P(B1) * P(B2/B1) * P(B3/(B2 ∩
B1)) = (10/16) * (9/15) * (8/14) = 0.2143. Therefore, the probability that all three balls are black
when no replacement occurs is 0.2143 or 21.4%.
120 Business statistics using Excel
Student exercises
X3.20 A bowl contains three red chips and five blue chips. Two chips are drawn successively,
at random and without replacement. Calculate the probability that the first chip drawn
is red and the second blue.
X3.21 Two events, D and E, are found to have the following probability relationships:
P(D) = 1/3, P(E) = 1/4, and P(D or E) = 1/2. Calculate the following probabilities: (a) P(D
and E), (b) P(D/E), and (c) P(E/D).
X3.22 Two events A and B are found to have the following probability relationships:
P(A) = 1/3, P(B) = 1/2, and P(A or B) = 3/4. Calculate the following probabilities: (a)
P(A/B), (b) P(B/A), (c) P(B’/A’), and (d) P(A’/B’).
X3.23 A bag contains four red counters and six black counters. A counter is picked at
random from the bag and not replaced. A second counter is then picked. Calculate the
following probabilities: (a) the second counter is red, given that the first is red; (b) both
the counters are red; and (c) the counters are of different colours.
X3.24 The Gompertz Oil Company drills for oil in old oil fields that large companies have
stated are uneconomic. The decision to drill will depend upon a number of factors,
including the geology of the proposed sites. Drilling experience shows that there is
a 0.40 probability of a type A structure present at the site given a productive well. It
is also known that 50% of all wells are drilled in locations with a type A structure and
30% of all wells drilled are productive. Use the information provided to answer the
following questions: (a) What is the probability of a well drilled in a type A structure
and being productive? (b) What is the probability of having a productive well at the
location if the drilling process begins in a location with a type A structure? and (c) Is
finding a productive well independent of the type A structure?
From which we can deduce that for independent events P(A/B) = P(A) and similarly
P(B/A) = P(B).
Note The terms independent and mutually exclusive are different concepts. If A and B
are events with non-zero probabilities, then we can show that for P(A and B):
• if events A and B are mutually exclusive, then P(A and B) = 0. Mutually exclusive events can-
not occur at the same time. For example, the two events ‘my favourite football team lost a
match’ and ‘my favourite football team won the same match’ are mutually exclusive events;
• if two events A and B are independent, then P( A and B) ≠ 0. The outcome of event A has no
effect on the outcome of event B. For example, the two events ‘it rained in Paris’ and ‘my car
broke down in London’ are independent events. When calculating the probabilities for inde-
pendent events you multiply the probabilities. You are effectively asking what the chance is
of both events happening, bearing in mind that the two were unrelated.
So, if events A and B are mutually exclusive, they cannot be independent. If events A and B
are independent, they cannot be mutually exclusive.
Example 3.14
Suppose a fair die is tossed twice. Let event A represent the event first die shows an even num-
ber and event B represent the event second die shows a five or six. Events A and B are intui-
tively unrelated and are, therefore, independent events. Thus, the probability of A occurring
is P(A) = 3/6 = 1/2 and the probability of event B occurring is P(B) = 2/6 = 1/3. Thus, P(A and
B) = P(A) * P(B) = (1/2) *(1/3) = 1/6. Thus, the probability of events A and B occurring together
is 1/6.
Example 3.15
Three marksmen take part in a shooting contest. Their chances of hitting the ‘bull’ are 1/2, 1/3,
and 1/4 respectively. If they fire simultaneously what are the chances that only one bullet will
hit the bull? Let event A, B, C represent the event that the first man hits the bull, the second
man hits the bull, and the third man hits the bull, respectively, with the following probabilities:
P(A) = 1/2; P(B) = 1/3; P(C) = 1/4. The probability problem can be written as follows:
P(only one bull hit) = P(A and B’ and C’ OR A’ and B and C’ OR A’ and B’ and C)
P(only one bull hit) = P(A and B’ and C’) + P(A’ and B and C’) + P(A’ and B’ and C)
P(only one bull hit) = 1/2 * 2/3 * 3/4 + 1/2 * 1/3 * 3/4 + 1/2 * 2/3 * 1/4 = 1/4 + 1/8 + 1/12
P(only one bull hit) = 11/24.
Thus, the probability that one bull is hit between the three marksmen is 11/24 or 45.83%.
122 Business statistics using Excel
Note In the solution we have used the notation A’, B’, and C’. This notation represents
the event that the event does not occur, for example A’ would represent the event that event
A does not occur.
Example 3.16
Two football teams A and B are disputing the historical data of who is likely to win. To settle the
dispute the following probability data presented in Table 3.8 has been collected which meas-
ures the probability of each team scoring 0, 1, 2, or 3 goals. Calculate the probability: (a) that
team A wins, (b) that the teams draw, and (c) that team B wins.
Table 3.8
To solve this problem we need to find the total sample space. There are 16 possible results
(events) given the scores in Table 3.8, each of which is mutually exclusive. We will look at these
in a joint probability table assuming independence, i.e. this means that team A scoring does not
influence team B scoring (Table 3.9).
Team A scores
0 1 2 3
0 0.06 0.06 0.06 0.02
1 0.12 0.12 0.12 0.04
Team B scores
2 0.09 0.09 0.09 0.03
3 0.03 0.03 0.03 0.01
Table 3.9
As the events are mutually exclusive then the probabilities are as follows:
P(A wins) = 0.06 + 0.06 + 0.02 + 0.12 + 0.04 + 0.03 = 0.33
P(Draw) = 0.06 + 0.12 + 0.09 + 0.01 = 0.28
P(B wins) = 1 − {P(A wins) + P(Draw)} = 1 − {0.33 + 0.28} = 0.39
From these results we can see that team B has the greater chance of winning a game.
Student exercise
X3.25 A dart is thrown at a board and is equally likely to land in any one of eight squares
numbered 1–8 inclusive. Let A = Event dart lands in square 5 or 8; B = Event dart
Introduction to probability 123
Example 3.17
A bag contains three red and four white balls.
If one ball is taken at random and then replaced, and another ball is taken calculate the fol-
lowing probabilities:
R2
3/7
R1 4/7
W2
3/7
4/7
R2
W1
3/7
4/7
W2
Figure 3.6
Student exercises
X3.26 Each month DINGO Ltd receives a shipment of 100 parts from its supplier, which
will be checked on delivery for defective parts. Historically, the average number
of defective parts was 5. The new quality assurance procedure involves randomly
selecting a sample of three items (without replacement) for inspection. If more than
one of the sample is defective the order is returned. What proportion of shipments
might be expected to be returned?
X3.27 Susan takes examinations in Mathematics, French, and History. The probability that
she passes Mathematics is 0.7; the corresponding probabilities for French and History
are 0.8 and 0.6. Given that her performances in each subject are independent, draw
a tree diagram to show the possible outcomes. Use this tree diagram to calculate the
following probabilities: (a) fails all three examinations and (b) fails just one examination.
Example 3.18
Consider the situation where a financial analyst collects data pertaining to the sales of a par-
ticular type of fridge freezer. From the data he is interested in the probability that this type of
fridge freeze will be sold in a particular region which he will then use to produce sales estimates
for the next 12 months. From the data he finds that this region sold 230 out of a national num-
ber sold of 1670. From equation (3.8) the relative frequency, or proportion, or probability of
this type of fridge being sold is P(X) = 230/1670 = 0.137725 or 13.8%. This can then be used
within the sales forecast plan, as will be outlined when discussing expectation in Section 3.10.
Example 3.19
To illustrate the idea of a probability distribution, consider the following frequency distribution
representing the mileage travelled by 120 salesmen described in Chapters 1 and 2, as presented
in Table 3.10.
Table 3.10
Figure 3.7
We observe from Figure 3.7 that the relative frequency for 440–459 miles travelled is
0.283333. This implies that we have a chance, or probability, of 34/120 that the miles trav-
elled lies within this class.
126 Business statistics using Excel
➜ Excel solution
Mileage data: Cells B4:B9 Values
Frequency, f Cells C4:C9 Values
Relative frequency Cell D4 Formula: =C4/$C$11
Copy formula from D4:D9
Total f Cell C11 Formula: =SUM(C4:C9)
Total RF Cell D11 Formula: =SUM(D4:D9)
❉ Interpretation Thus, relative frequencies provide estimates of the probability for that
class, or value, to occur. If we were to plot the histogram of relative frequencies we would, in
fact, be plotting out the probabilities for each event, for example P(400 − 420 miles) = 0.10,
P(420 − 440 miles) = 0.225.
Note
If in Figure 3.8 we decreased the class width towards zero and increased the
number of associated bars observable then Figure 3.8 would approximate to a curve—the
probability distribution curve.
0.3
0.3
0.2
0.2
0.1
0.1
0.0
400–419 420–439 440–459 460–479 480–499 500–519
Miles travelled, X Figure 3.8
Introduction to probability 127
Further thought along the lines used in developing the notion of expectation would
reveal that the variance of the probability distribution, VAR(X), can be determined from
equation (3.10).
Example 3.20
Returning to the miles travelled by salesmen we can easily calculate the mean number of miles
travelled and the corresponding measure of dispersion, as illustrated in Figures 3.9–3.11.
Figure 3.9
LCB, lower class boundary; UCB, upper class boundary.
Figure 3.10
Figure 3.11
128 Business statistics using Excel
➜ Excel solution
Mileage travelled Cells A5:A10 Values
Frequency, f Cells B5:B10 Values
LCB Cells C5:C10 Values
UCB Cells D5:D10 Values
Class mid-point Cells E5 Formula: =(C5+D5)/2
Copy formula from E5:E10
Relative frequency Cell G5 Formula: =B5/$C$14
Copy formula from G5:G10
X*P(X) Cell I5 Formula: =E5*G5
Copy formula from I5:I10
X2*P(X) Cell K5 Formula: =E5^2*G5
Copy formula from K5:K10
N = Σf = Cell C14 Formula: =SUM(B5:B10)
ΣXP = Cell C15 Formula: =SUM(I5:I10)
ΣX2P = Cell C16 Formula: =SUM(K5:K10)
Mean = Cell C17 Formula: =C15
Variance = Cell C18 Formula: =C16−C17^2
Standard Deviation = Cell C19 Formula: =C18^0.5
❉ Interpretation From Excel, the expected value is 454 miles travelled with a standard
deviation of 27.38 miles travelled.
Note Rearranging equation (2.3) to give the expected value (mean) represented by
equation (3.9):
∑ fX f
∑ (X) = = ∑X × = ∑ X × P(X)
∑f ∑f
Rearranging equation (2.14) to give the variance value represented by equation (3.10):
( )
2
∑f X − X
VAR(X) = = ∑ [X − E(X)]2 × P(X) = ∑ X 2 × P(X) − [ ∑ X × P(X)]2
∑f
Example 3.21
Consider the problem of a stall at a fete running a game of chance. The game consists of a
customer taking turns to choose three balls from a bag that contains 3 white and 17 red balls
without replacement. For a customer to win he/she would have to choose 3 white, 2 white, or
1 white with winnings of €5, €2, and €0.50 respectively. On the day of the fete 2000 customers
tried the game.
How much money might be expected to have been paid out to each customer?
Introduction to probability 129
To solve this problem we first need to calculate the associated probabilities of choosing
3, 2, 1, and 0 white balls; a tree diagram (see Figure 3.12) visually enables identification of
these probabilities.
The final stage consists of calculating the associated expected value given we know
what the winnings are for 3, 2, 1, and 0 white balls.
3rd Ball
W3 3 White
1st Ball 2nd Ball 1/18
W2 2 White
2/19 R3
17/18
W1
W3
2/18 2 White
3/20
17/19 R3
R2 1 White
16/18
2/18 W3
W2 2 White
3/19
17/20
16/18 1 White
R1 R3
3/18
16/19 R2 W3
1 White
15/18
0 White
R3
Figure 3.12
From the tree diagram illustrated in Figure 3.12 we can identify the different routes that we
can achieve 3, 2, 1, and 0 white balls. The probability of 3 whites is P(3 White) = P(1st White
and 2nd White and 3rd White) = 3/20 * 2/19 * 1/18 = 0.0009. By a similar process: P(only 2
White) = 0.0447, P(only 1 White) = 0.3579, and P(no white) = 0.5965. The probability distri-
bution for the expected winnings can now be constructed and is illustrated in Figure 3.13.
Figure 3.13
➜ Excel solution
Number of white balls Cells B4:B7 Values
Amount won, X Cells C4:C7 Values
Probability, P(C) Cells D4:D7 Values
X*P(X) Cells E4 Formula: =C4*D4
Copy formula from E4:E7
E(X) Cell E9 Formula: =SUM(E4:E7)
Total Cell E10 Formula: =2000*E9
130 Business statistics using Excel
❉ Interpretation From Excel, we observe that the expected winnings for each game
played is E(X) = ΣX * P(X) = 0.27285 (or €0.27 to the nearest cent). Given that we have 2000
players (or games played) then the total winnings is = N * E(X) = 2000 * 0.27285 = €545.70 to
the nearest cent.
Example 3.22
A company manufactures and sells product Xbar. The sales price of the product will be €6 per
unit, and estimates of sales demand and variable costs of sales are as presented in Tables 3.11
and 3.12.
Table 3.11
Table 3.12
The unit variable costs are not conditional on the volume of sales demand and fixed
costs are estimated to be €10,000. What is the expected profit? The expected profit can be
calculated if we realize that profit is determined from equation (3.12).
Table 3.13 illustrates the calculation of the expected sales demand using equation (3.9),
with the probability distribution employed to calculate the column statistics.
Table 3.13
Table 3.14 illustrates the calculation of the expected value of the variable cost per unit
using equation (3.9), with the probability distribution employed to calculate the column
statistics.
Table 3.14
Student exercises
X3.28 A bag contains six white and four red counters, three of which are drawn at random
and without replacement. If X can take on the values of 0, 1, 2, 3 red counters,
construct the probability distribution of X. If the experiment was repeated 60 times,
how many times would we expect to draw more than one red counter?
X3.29 In a game you are offered the chance to toss a fair coin until a ‘tail’ appears. If a tail
appears on the first toss you win £2. If the first tail appears on the second toss you
win £4. If the first tail appears on the third toss you win £8. How much should you be
willing to pay to participate in the game if you intend to quit after the third toss, win or
lose?
■ Techniques in practice
TP1 CoCo S.A. are considering putting money into one of two investments, A and B. The
net profits for identical periods and probabilities of success for investments A and B are given
in Table 3.15.
TP2 Bakers Ltd currently has 303 shops across the UK. Table 3.16 describes the location of
each shop within one of four regions (SW, SE, NE, and ML and NW) and the level of shop profit.
132 Business statistics using Excel
Probability of Return
Net profits, £ A B
8000 0.0 0.1
9000 0.3 0.2
10,000 0.4 0.4
11,000 0.3 0.2
12,000 0.0 0.1
Table 3.15
As part of Bakers Ltd financial quality control initiative the company select a shop at random
and undertake a visit.
Region
SW SE NE ML and NW
Under 8000 12 8 15 22
Profit, £ 8000– < 80,000 54 34 43 41
Over 80,000 34 12 23 5
Table 3.16
(a) Calculate the probability that the shop chosen to be visited will be in the SW region.
(b) Owing to a careless administrative leak of information the next set of visits are known to
be located in the SW region. What is the probability that a shop in the SW region will
be chosen?
(c) Compare the two probabilities of P(SW) and P(SW/over 80,000). Are these two events
independent or dependent?
TP3 The Skodel Ltd credit manager knows from past experience that if the company accepts
a ‘good risk’ applicant for a £60,000 loan the profit will be £15,000. If it accepts a ‘bad risk’
applicant it will lose £6000. If it rejects a ‘bad risk’ applicant nothing is gained or lost. If it rejects
a ‘good risk’ applicant it will lose £3000 in good will.
(a) Complete the profit and loss table for this situation.
Decision
Accept Reject
Type of Risk Good
Bad
Table 3.17
(b) The credit manager assesses the probability that a particular applicant is a ‘good risk’ is
1/3 and a ‘bad risk’ is 2/3. What would be the expected profits for each of the two deci-
sions? Consequently, what decision should be taken for the applicant?
Introduction to probability 133
(c) Another manager independently assesses the same applicant to be four times as likely to
be a bad risk as a good one. What should this manager decide?
(d) Let the probability of being a good risk be x. What value of x would make the company
indifferent between accepting or rejecting an applicant for a mortgage?
■ Summary
In this chapter we have defined the concept of probability using the idea of relative fre-
quency. Furthermore, the key terms have been defined, such as experiment, sample
space, laws of probability, and the relationship between a relative frequency distribution
and probability distribution. In the next chapter we will explore probability distributions,
such as the normal distribution, Student’s t distribution, F distribution, binomial distribu-
tion, and Poisson distribution.
■ Key terms
Addition law for mutually Frequency definition of Outcome
exclusive events probability Probability
Chance General addition probability Probability of event A given
Conditional law that event B has occurred
probability Independent events Probable
Cumulative frequency Multiplication law Random experiment
distribution Multiplication law for Relative frequency
Empirical approach independent events Sample space
Event Multiplication law for joint Statistical independence
Experimental probability events Uncertainty
approach Mutually exclusive
■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.
Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
134 Business statistics using Excel
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012)
Probability distributions 4
» Overview «
The concept of probability is an important aspect of the study of statistics and within Chapter 3
we introduced the reader to some of the concepts that are relevant to probability. However, the
main emphasis of Chapter 4 is to focus on the concepts of discrete and continuous probability
distributions and not on the fundamentals of probability theory. Initially, we will explore the
issue of continuous probability distributions (normal) and then introduce the concept of
discrete probability distributions (binomial, Poisson). Sections 4.1 and 4.2 will explore the
concept of a probability distribution and introduce two distinct types: (a) continuous and (b)
discrete. Table 4.1 summarizes the probability distributions that are applicable to whether the
data variables are discrete/continuous and whether the distributions are symmetric/skewed.
Variable type
Measured characteristic Discrete Continuous
Shape Symmetric Skewed Symmetric Skewed
Distribution Binomial Poisson Normal Exponential
Table 4.1
» Learning objectives «
On completing this unit you will be able to:
» use the normal distribution to calculate the values of a variable that correspond to a par-
ticular probability;
» calculate one parameter of the normal distribution if the other parameters are known;
» use the normal distribution to calculate the probability that a variable has a value between
specific limits;
136 Business statistics using Excel
» solve simple problems using both tree diagrams and the binomial formula;
4.1.1 Introduction
A random variable is a variable that provides a measure of the possible values obtainable
from an experiment. For example, we may wish to count the number of times that the
x number three appears on the tossing of a fair die or we may wish to measure the weight of
Random variable A
people involved in measuring the success of a new diet programme.
random variable is a
function that associates a In the first example, the random variable will consist of the numbers: 1, 2, 3, 4, 5, or 6. If
unique numerical value the die was fair then on each toss of the die each possible number (or outcome) will have
with every outcome of an
experiment. an equal chance of occurring. The numbers 1, 2, 3, 4, 5, or 6 represent the random variable
Discrete random for this experiment. In the second example, the possible number values will represent the
variable A discrete random weights of the people participating in the experiment. The random variable in this case
variable is one which may
take on only a countable would be the values of all possible weights. It is important to note that in the first example
number of distinct values the values take whole number answers (1, 2, 3, 4, 5, 6)—this is an example of a discrete
such as 0, 1, 2, 3, 4 . . .
random variable.
Continuous random
variable A continuous The second example consists of numbers that can take any value with respect to meas-
random variable is one ured accuracy (160.4 lb, 160.41 lb, 160.414 lb, etc.) and is an example of a continuous
which takes an infinite
number of possible values. random variable. In this section we shall explore the concept of a continuous probability
Continuous probability distribution with the focus on introducing the reader to the concept of a normal prob-
distribution If a random ability distribution.
variable is a continuous
variable, its probability
distribution is called a
continuous probability 4.1.2 The normal distribution
distribution.
Normal distribution The When a variable is continuous, and its value is affected by a large number of chance fac-
normal distribution is a tors, none of which predominates, then it will frequently appear as a normal distribution.
symmetrical, bell-shaped
curve, centred at its
This distribution does occur frequently and is probably the most widely used statistical
expected value. distribution. Some of the real-life variables having a normal distribution can be found,
Probability distributions 137
for example, in manufacturing (weights of tin cans) or can be associated with the human
population (people’s heights). The normal distribution is defined by equation (4.1):
2
1⎛ x −µ ⎞
1 − ⎜ ⎟
f (X ) = e 2⎝ σ ⎠
σ 2π (4.1)
This equation can be represented graphically by Figure 4.1 and illustrates the symmet-
rical characteristics of the normal distribution.
Normal curve
f(x)
µ X Figure 4.1
For the normal distribution the mean, median, and mode all have the same numerical
value.
Note
1. The population mean and standard deviation are represented by the notations μ and σ
respectively.
2. If a variable X varies as a normal distribution then we would state that X ~ N (μ, σ2).
3. The total area under the curve represents the total probability of all events occurring which
equals 1.0.
Example 4.1
A manufacturing firm quality assures components manufactured and historically the length of
a tube is found to be normally distributed with a population mean of 100 cm and a population
standard deviation of 5 cm.
Calculate the probability that a random sample of one tube will have a length of at least
110 cm.
From the information provided we define X has the tube length in centimetres and popula-
tion mean µ = 100 and standard deviation σ = 5. This can be represented using the notation
X ~ N (100, 52).
138 Business statistics using Excel
The problem we have to solve is to calculate the probability that 1 tube will have a length
of at least 110 cm.
This can be written as P(X ≥ 110) and is represented by the shaded area illustrated in
Figure 4.2.
Normal curve
f(x)
This problem can be solved by using the Excel function NORM.DIST (X, μ, σ2, TRUE).
This function calculates the area illustrated in Figure 4.3.
Normal curve
f(x)
PDF = NORM.DIST() =
P(X <= 110)
Figure 4.4
Probability distributions 139
➜ Excel solution
Mean = Cell C5 Value
Standard deviation = Cell C6 Value
X = Cell C8 Value
P(X ≤ 110) = Cell C10 Formula: =NORM.DIST(C8,C5,C6,TRUE)
P(X ≥ 110) = Cell C12 Formula: =1−C10
❉ Interpretation From Excel, we observe that the probability that an individual tube
length is at least 110 cm is 0.02275 or 2.3% (P(X ≥ 110) = 0.02275).
Example 4.2
Calculate the probability that X lies between 85 and 105 cm for the problem outlined in
Example 4.1.
In this example we are required to calculate P (85 ≤ X ≤ 105) which represents the area
shaded in Figure 4.5.
The value of P (85 ≤ X ≤ 105) can be calculated using Excel’s NORM.DIST () function.
Normal curve
X2 = 85 µ = 100 X1 = 105 X
Figure 4.5
Figure 4.6
140 Business statistics using Excel
➜ Excel solution
Mean = Cell C5 Value
Standard deviation = Cell C6 Value
X1 = Cell C8 Value
X2 = Cell C9 Value
P(85 ≤ X ≤ 105) = P(X ≤ 105) − P(X ≤ 85)
P(X ≤ 85) = Cell C13 Formula: =NORM.DIST(C8,C5,C6,TRUE)
P(X ≤ 105) = Cell C14 Formula: =NORM.DIST(C9,C5,C6,TRUE)
P(85 ≤ X ≤ 105) = Cell C16 Formula: =C14−C13
❉ Interpretation We observe that the probability that an individual tube length lies
between 85 and 105 cm is 0.839995 or 84.0%.
Student exercise
X4.1 Use the NORM.DIST function to calculate the following probabilities, X ~ N(100, 25):
(a) P(X ≤ 95); (b) P(95 ≤ X ≤ 105); (c) P(105 ≤ X ≤ 115); and (d) P(93 ≤ X ≤ 99). For
each probability identify the region to be found by shading the area on the normal
probability distribution graph.
Z=
(X − µ)
σ (4.2)
x Where X, μ, and σ are the variable score value, population mean, and population stand-
Standard normal
distribution A standard ard deviation, respectively, taken from the original normal distribution. Any distribution
normal distribution is a can be converted to a standardized distribution using equation (4.2) and the shape of the
normal distribution with
zero mean (µ = 0) and unit
standardized version will be the same as the original distribution. If the original was sym-
variance (σ2 = 1). metric then the Z transformed version would still be symmetric and if the original was
Probability distributions 141
skewed then the Z transformed version would still be skewed. The corresponding prob-
ability density function is given by equation (4.3).
1
1 − Z2
f (Z ) = e 2
σ 2π (4.3)
The advantage of this method is that the Z values are not dependent on the original data
units and this allows tables of Z values to be produced with corresponding areas under the
curve. This allows for probabilities to be calculated if the Z value is known, and vice versa,
which allows a range of problems to be solved.
Figure 4.7 illustrates the standard normal distribution (or Z distribution) with Z scores
between –4 and +4.
Normal curve
f(z)
–4 –3 –2 –1 0 1 2 3 4
Z Figure 4.7
The Excel function NORM.S.DIST(z) returns the probability that the observed value of a
standard normal random variable will be less than or equal to z, as illustrated in Figure 4.8.
Note From calculation we can show that the proportion of values between ± 1, ± 2,
and ± 3 population standard deviations from the population mean of zero is 68%, 95%, and
99.7% respectively.
Normal curve
f(z)
P(Z ≤ z)
–4 –3 –2 –1 0 1 2 3 4
z Z Figure 4.8
Example 4.3
Reconsider Example 4.1. If a variable X varies as a normal distribution with a mean of 100 and
a standard deviation of 5, then the value of Z when X = 110 would be given by equation (4.2).
Z = (110−100)/5 = +2
142 Business statistics using Excel
Figure 4.9
➜ Excel solution
Mean = Cell C5 Value
Standard deviation = Cell C6 Value
X = Cell C8 Value
P(X ≤ 110) = Cell C10 Formula: =NORM.DIST(C8,C5,C6,TRUE)
P(X ≥ 110) = Cell C12 Formula: =1−C10
Z = Cell C14 Formula: =(C8−C5)/C6
P(Z ≤ +2) = Cell C15 Formula: =NORM.S.DIST(C14, TRUE)
P(Z ≥ +2) = Cell C16 Formula: =1−C15
Normal curve
f(z)
PDF = NORM.S.DIST() =
P(Z <= 2)
0 2 Z Figure 4.10
From Excel, the NORM.S.DIST () function can be used to calculate P(Z ≥ +2) = 0.02275.
❉ Interpretation
We observe that the probability that an individual tube length is at least 110 cm is 0.02275 or
2.3% (P (X ≥ 110) = P (Z ≥ 2) = 0.02275).
Probability distributions 143
Note
1. This method is used to solve problems using tables of Z values and associated
probabilities.
2. The value of the Z score can be calculated using the Excel function STANDARDIZE ().
3. The Excel function NORM.DIST () calculates the value of the normal distribution for the
specified mean and standard deviation.
4. The Excel function NORM.S.DIST () calculates the value of the normal distribution for the
specified Z score value.
The value of this probability P(X ≥ 110) can be found from critical tables if we convert
P(X ≥ 110) to P(Z ≥ 2) and use the critical tables for the normal distribution provided
in Appendix 2. Table 4.2 illustrates an example of this critical table with the probability
P(Z ≥ 2) identified for a particular value of z.
Example 4.4
If we reconsider Example 4.2 and transform the value of X to Z then we find the solution can
be solved using the NORM.S.DIST () function.
➜ Excel solution
Mean = Cell C5 Value
Standard deviation = Cell C6 Value
X1 = Cell C8 Value
X2 = Cell C9 Value
P(85 ≤ X ≤ 105) = P(X ≤ 105) − P(X ≤ 85)
P(X ≤ 85) = Cell C13 Formula: =NORM.DIST(C8,C5,C6,TRUE)
P(X ≤ 105) = Cell C14 Formula: =NORM.DIST(C9,C5,C6,TRUE)
144 Business statistics using Excel
Figure 4.11
From Excel, the NORM.S.DIST () function can be used to calculate P (85 ≤ X ≤ 105) = P
(− 3 ≤ Z ≤ +1) = 0.839995.
Figure 4.12 illustrates the solution, shaded on the normal distribution.
Normal curve
f(z)
Z2 = –3 0 Z1 = 1 Z
Figure 4.12
❉ Interpretation We observe that the probability that an individual tube length lies
between 85 and 105 cm is 0.839995 or 84.0%.
Probability distributions 145
Student exercise
X4.2 Use the NORM.S.DIST () function to calculate the following probabilities, X ~ N(100,
25): (a) P(X ≤ 95); (b) P(95 ≤ X ≤ 105); (c) P(105 ≤ X ≤ 115); and (d) P(93 ≤ X ≤ 99). In
each case convert X to Z. Compare with your answers from Exercise X4.1.
Example 4.5
A local authority installs 2000 electric lamps. The life of lamps in hours (X) follows a normal
distribution, where X ~ N (1000, 40,000). Calculate: (a) the number of lamps that might be
expected to fail within the first 700 hours; (b) the number of lamps that may be expected to fail
between 900 and 1300 hours; and (c) after how many hours would we expect 10% of the lamps
to fail? From this information we have population mean, µ, of 1000 hours with a variance, σ2, of
40,000 hours2. This problem can be solved using either the NORM.DIST () or NORM.S.DIST ()
Excel functions as illustrated in Figure 4.13.
Figure 4.13
➜ Excel solution
Mean = Cell C5 Value
Variance = Cell C6 Value
Standard deviation = Cell C7 Formula: =SQRT(C6)
146 Business statistics using Excel
X = Cell C9 Value
P(X ≤ 700) = Cell C11 Formula: =NORM.DIST(C9,C5,C7,TRUE)
Z = Cell C13 Formula: =(C9-C5)/C7
P(Z ≤ −1.5) = Cell C14 Formula: =NORM.S.DIST(C13, TRUE)
E(X) = N*P(X ≤ 700) = Cell C15 Formula: =2000*C11
Normal curve
f(x)
(b) The second part of the problem requires the calculation of the probability that X lies
between 900 and 1300 hours, and the estimation of the number of lamps from 2000
which will fail between these limits.
The Excel solution is illustrated in Figure 4.15
Figure 4.15
This problem consists of solving P(900 ≤ X ≤ 1300). Using the NORM.DIST () function
we find that 1249 lamps are expected to fail between 900 and 1300 hours out of the
2000 lamps.
Probability distributions 147
This solution is represented graphically by Figure 4.16.
Normal curve
f(x)
P(900 <= X <= 1300)
= 0.624655
x
X1 = 900 1000 X2 = 1300
Figure 4.16
➜ Excel solution
Mean μ = Cell C5 Value
Variance σ2 = Cell C6 Value
Standard deviation σ = Cell C7 Formula: =SQRT(C6)
X1 = Cell C9 Value
X2 = Cell C10 Value
P(X ≤ 900) = Cell C12 Formula: =NORM.DIST(C9,C5,C7,TRUE)
P(X ≤ 1300) = Cell C13 Formula: =NORM.DIST(C10,C5,C7,TRUE)
P(900 ≤ X ≤ 1300) = Cell C14 Formula: =C13−C12
E(X) = N*P(900 ≤ X ≤ 1300) = Cell C15 Formula: =2000*C14
Z1 = Cell C18 Formula: =(C9−C5)/C7
Z2 = Cell C19 Formula: =(C10−C5)/C7
P(Z ≤ −0.5) = Cell C21 Formula: =NORM.S.DIST(C18, TRUE)
P(Z ≤ 1.5) = Cell C22 Formula: =NORM.S.DIST(C19, TRUE)
P(−0.5 ≤ Z ≤ 1.5) = Cell C23 Formula: =C22−C21
E(X) = N*P(900 ≤ X ≤ 1300) = Cell C24 Formula: =2000*C23
(c) The final part of this problem consists of calculating the number of hours for the first
10% to fail. This corresponds to calculating the value of x where P(X ≤ x) = 0.1.
This problem can be solved using the NORM.INV () or NORM.S.INV () functions, as
illustrated in Figure 4.17.
From Excel, the NORM.INV () or NORM.S.INV () functions can be used to calculate
the expected number of hours for 10% to fail.
This solution is represented graphically by Figure 4.18.
148 Business statistics using Excel
Figure 4.17
Normal curve
f(x)
➜ Excel solution
Mean μ = Cell C5 Value
Variance σ2 = Cell C6 Value
Standard deviation σ = Cell C7 Formula: =SQRT(C6)
P(X = x) = Cell C9 Value
X = Cell C11 Formula: =NORM.INV(C9,C5,C7)
Z = Cell C13 Formula: =NORM.S.INV(C9)
X = μ + Z*σ = Cell C14 Formula: =C5+C13*C7
❉ Interpretation From Excel, the expected number of hours for 10% to fail is 744
hours.
Note
1. This problem corresponds to finding the value of x such that P(X ≤ x) = 10% (or 0.1).
From Excel, we find that P(X ≤ x) = 0.1 corresponds to Z = −1.28. To find x we would
then solve the equation: −1.28 = (X − 1000)/200. Re-arranging this equation gives
X = 1000 + (−1.28)*(200) = 744.
2. The Excel function NORM.INV () calculates the value of X from a normal distribution for
the specified probability, mean and standard deviation.
3. The Excel function NORM.S.INV () calculates the value Z from normal distribution for the
specified probability value.
Probability distributions 149
Student exercises
X4.3 Given that a normal variable has a mean of 10 and a variance of 25, calculate the
probability that a member chosen at random is: (a) ≥ 11, (b) ≤ 11, (c) ≤ 5, (d) ≥ 5, (e)
between 5 and 11.
X4.4 The lifetimes of certain types of car battery are normally distributed with a mean of
1248 days and standard deviation of 185 days. If the supplier guarantees them for 1080
days, what proportion of batteries will be replaced under guarantee?
X4.5 Electrical resistors have a design resistance of 500 ohms. The resistors are produced
by a machine with an output that is normally distributed N(501,9) ohms. Resistances
below 498 ohms and above 508 ohms are rejected. Find: (a) the proportion that will
be rejected; (b) the proportion which would be rejected if the mean was adjusted so
as to minimize the proportion of rejects; (c) how much the standard deviation would
need to be reduced (leaving the mean at 501 ohms) so that the proportion of rejects
below 498 ohms would be halved.
• for symmetrical distributions the following rule would hold: Q3 − Median = Median −
Q1, Largest value − Q3 = Q1 − smallest value, and Median = Midhinge = Midrange. The
midrange is the average of the largest and smallest data values and the midhinge is
the average of the first and third quartiles;
• for non-symmetry the following rule would hold: right-skewed distributions: Largest
value − Q3 greatly exceeds Q1 − Smallest value, and left-skewed distributions: Q1 −
Smallest value greatly exceeds Largest value – Q3.
In Example 2.15 we were given the first quartile Q1 = 15, minimum = 8, median = 33,
maximum = 88 and third quartile Q3 = 62. From this data we concluded that the data dis-
tribution is not symmetrical (distance from Q3 to the median (62 − 33 = 29) is not the same
as between Q1 and the median (33 − 15 = 18), distance from Q3 and the largest value (88 −
62 = 26) is not the same as the distance between Q1 and the smallest value (15 − 8 = 7), and
the median (33), the midhinge ((62 + 15)/2 = 38.5) and the midrange ((88 + 8)/2 = 48) are
not equal). The summary numbers indicate right skewness because the distance between x
Q3 and the largest number (88 − 62 =26) is longer than the distance between Q1 and the Normal probability
plot Graphical technique
smallest value (15 − 8 = 7). The minimum and maximum points are identified and enable to assess whether the data
identification of any extreme values (or outliers). is normally distributed.
150 Business statistics using Excel
Note A simple rule to identify an outlier (or suspected outlier) is that the largest value –
smallest value (88 – 8 = 80) should be no longer than three times the length of the box (Q3 –
Q1 = 62 – 15 = 47). In this case the value of maximum – minimum is 80 and Q3 – Q1 is 47
and therefore no extreme values are present in the data set.
A normal probability plot consists of constructing a graph of data values against a cor-
responding Z value where Z is based upon the ordered value.
Example 4.6
The manager at BIG JIMS restaurant is concerned about the time it takes to process credit card
payments at the counter by counter staff.
The manager has collected the following processing time data (time in minutes/seconds)
and requested that the data be checked to see if it is normally distributed (Table 4.3).
Table 4.3
Figure 4.19
➜ Excel solution
n = Cell C3 Value
Ordered value Cells E4:E22 Values
Area Cell F4 Formula: =1/(C3 + 1)
Cell F5 Formula: =F4+$F$4
Copy formula from F5:F22
Probability distributions 151
• order the data values (1, 2, 3 . . . n) with 1 referring to the smallest data value and n
representing the largest data value;
• for the first data value (smallest) calculate the cumulative area using the formula: = 1/
(n + 1).
• calculate the value of Z for this cumulative area using the Excel Function:
=NORM.S.INV (Z value);
• repeat for the other values where the cumulative area is given by the formula: =old
area + 1/(n + 1);
• input data values with smallest to largest value;
• plot data value y against Z value for each data point.
Figure 4.20 illustrates the normal probability curve plot for Example 4.6. We observe
from the graph that the relationship between the data values and Z is approximately a
straight line.
1.4
1.2
0.8
0.6
0.4
0.2
0
–2 –1 0 1 2
Z Value Figure 4.20
For data that is normally distributed we would expect the relationship to be linear. In
this situation we would accept the statement that the data values are approximately nor-
mally distributed.
❉ Interpretation Owing to the fact that the normal probability plot shows more or less
a straight line, we conclude that the data is approximately normally distributed.
152 Business statistics using Excel
(a) Figure 4.21 illustrates a normal distribution where Largest value – Q3 equals Q1 – Small-
est value.
Normal probability plot—normal curve
1.40
1.20
1.00
0.80
0.60
0.40
0.20
0.00
–2 –1 0 1 2
Z Value Figure 4.21
1.50
1.00
0.50
0.00
–2 –1 0 1 2
–0.50 Z Value
(c) Figure 4.23 illustrates a right-skewed distribution where Largest value – Q3 greatly ex-
ceeds Q1 – Smallest value.
Probability distributions 153
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00
–2 –1 0 1 2
Z Value Figure 4.23
1. Student’s t distribution
The Student’s t distribution is a distribution that is used to estimate a mean value
when the population variable is normally distributed but the sample chosen to
measure the population value is small and the population standard deviation is
unknown. It is the basis of the popular Student’s t-tests for the statistical significance
of the difference between two sample means and for confidence intervals for the
difference between two population means.
2. Chi-square distribution
The chi-square distribution (χ2 distribution) is a popular distribution that is used
to solve statistical inference problems involving contingency tables and assessing
goodness-of-fit tests between sample data and distributions. x
Student’s t
3. F distribution distribution The t
distribution is the sampling
The F distribution is a distribution that can be used to test whether the ratios of two distribution of the t statistic.
variances from normally distributed statistics are statistically different. The test Chi-square
statistic is defined as F = s12 s2 2, where s12 and s22 are the sample 1 and sample 2 distribution The chi
square distribution is a
variances respectively. The shape of the distribution depends upon the numerator mathematical distribution
and denominator degrees of freedom (df1 = n1 − 1, df2 = n2 − 1); the F distribution is that is used directly or
indirectly in many tests of
written as a function of n1, n2 as F (n1, n2). significance.
F distribution The
F distribution (also known
Note The normal, Student’s t, and chi-square distributions are special cases of the F the Fisher–Snedecor
distribution, as follows: distribution) is a continuous
probability distribution
• normal distribution = F(n1 = 1, n2 = infinite) distribution; that arises frequently as the
null distribution of a test
• Student’s t distribution = F(n1 = 1, n2) distribution ; statistic, most notably in the
• chi-square distribution = F(n1, n2 = infinite) distribution. analysis of variance
154 Business statistics using Excel
Two other continuous probability distributions are the uniform and exponential distri-
butions. The uniform distribution is used in the generation of random numbers for differ-
ent probability distributions and the exponential probability distribution is important in
the area of queuing theory.
∫
PDF = P(a ≤ x ≤ b) = f ( x ) dx , where f(x) ≥ 0, for all x.
a
If we assume that the probability distribution is normal, then Figure 4.24 represents
graphically what area the PDF represents.
Thus, the probability that X takes on a value in the interval (a, b) is the area under the
density function from a to b (see shaded region in Figure 4.24).
Normal curve
f(x)
a b µ X Figure 4.24
Example 4.7
In Example 4.1, we calculated the probability that a tube length will be at least 110 cm and it
is known that the tube length is normally distributed with a population mean and standard
deviation equal to 100 and 5 respectively.
This can be written as X ~ N (100, 52) with the probability problem written as P(X ≥ 110) =
1 − P(X ≤ 110) = 1 − PDF (110, 100, 5, TRUE) = 1 − 0.97725 = 0.02275.
The term P(X ≤ 110) represents the PDF, which is calculated by the Excel NORM.DIST () func-
tion. Figure 4.25 illustrates this graphically.
Normal curve
f(x)
finds PDF = P(X ≤ 110) = 0.97725
Example 4.8
In Example 4.1, we found that when X = 110 the value of P(X ≥ 110) = 0.02275. The CDF
calculates the value of X when you know the probability value, as illustrated in Figure 4.26.
Given P(X ≤ 110) = 0.97725, the CDF will calculate the value of x given the PDF = 0.97725.
Therefore, x = CDF (0.97725, 100, 5) = 110, which is calculated by the Excel NORM.INV
function.
Normal curve
x
Cumulative distribution
P(X ≥ 110) = 0.02275 function The cumulative
distribution function
(CDF), or just distribution
function, describes the
100 finds 110 X Figure 4.26 probability that a real-
valued random variable
X with a given probability
distribution will be found
at a value less than or
equal to x.
Discrete probability
distribution If a random
4.2 Discrete probability distributions variable is a discrete
variable, its probability
distribution is called
4.2.1 Introduction a discrete probability
distribution.
In this section we shall explore discrete probability distributions when dealing with dis- Binomial distribution A
binomial distribution can
crete random variables. Two specific distributions included are: binomial and Poisson be used to model a range
probability distributions. We will also explore how to approximate one distribution with of discrete random data
variables.
another, if appropriate.
Poisson probability
distribution The Poisson
distribution is a discrete
4.2.2 Binomial probability distribution probability distribution that
expresses the probability of
One of the most elementary discrete random variables—binomial—is associated with a given number of events
occurring in a fixed interval
questions that only allow ‘Yes’ or ‘No’ type answers, or a classification such as male or of time and/or space if
female, or recording a component as defective or not defective. If the outcomes are also these events occur with a
known average rate and
independent, for example, the possibility of a defective component does not influence the
independently of the time
possibility of finding another defective component then the variable is considered to be since the last event.
a binomial variable.
156 Business statistics using Excel
These five characteristics define the binomial experiment and are applicable for situ-
ations of sampling from finite populations with replacement or for infinite populations
with or without replacement.
Example 4.9
A marksman shoots three rounds at a target. The probability of getting a ‘bull’ is 0.3. Develop
the probability distribution for getting 0, 1, 2, and 3 bulls. This experiment can be modelled by
a binomial distribution as:
x
Binomial experiment A • three identical trials (n = 3);
binomial experiment is an • each trial can result in either a bull (success) or not a bull (failure);
experiment with a fixed
number of independent • the outcome of each trial is independent;
trials. Each trial has • the probability of a success (P(a bull) = p = 0.3) is the same for each trial;
exactly two outcomes and
the probability of each
• the random variable is discrete.
outcome in a binomial
experiment remains the Figure 4.27 illustrates the tree diagram that represents the described experiment.
same for each trial. Let B represent the event that the marksmen hits the bull and B’ represents the event that
Discrete variable A set of the bull is missed.
data is said to be discrete if
the values belonging to it The corresponding individual event probabilities are: P(B) = 0.3 and P(B’) = 1 − P(B) =
can be counted as 1, 2, 3 . . . 1 − 0.3 = 0.7.
Probability distributions 157
Third attempt
B
Second attempt
B
First attempt B′
B
B
B′ B′
B
B
B′
B′ B
B′
B′ Figure 4.27
From this tree diagram we can identify the possible routes for 0, 1, 2, and 3 bull hits as
follows: P(no bull hit) = P(X = 0 success) = P(B’B’B’) = 0.7 * 0.7 * 0.7 = (0.7)3 = 0.343.
The important lesson is to note how we can use the tree diagram to undertake a calcula-
tion of an individual probability, but also note the pattern identified in the relationship
between the probability, P, and the individual event probability of success, p, or failure, q.
From Figure 4.27 we observe:
From these calculations we can now note the probability distribution for this experi-
ment (see Table 4.4).
X Formula P(X)
0 q 3 0.343
1 3pq2 0.441
2 2
3p q 0.189
3 p3 0.027
Total = 1.000
Table 4.4
0.35
0.3
1
0.25
0.2 2
0.15
3
0.1
0.05
0
0 1 2 3
Number of bulls hit, X Figure 4.28
Note From the probability distribution we observe that the total probability equals one.
This is expected as the total probability would represent the total experiment.
If we increase the size of the experiment then it becomes quite difficult to calculate the
event probabilities. We really need to develop a formula for calculating binomial prob-
abilities. Using the ideas generated earlier, we have:
Repeating this experiment for increasing values of ‘n’ would enable the identification
of a pattern that can be used to develop equation (4.5) to calculate the probability of ‘r’
successes given ‘n’ attempts of the experiment.
⎛ n⎞
P(X = r) = ⎜ ⎟ P r q n − r
⎝ r⎠ (4.5)
⎛ n⎞
The term ⎜ ⎟ calculates the binomial coefficients which are the numbers in front of the
⎝ r⎠
letter terms in the binomial expansion. For example, in the previous example we found
that the total probability = p3 + 3p2q + 3pq2 + q3 with the numbers in front of the letters of
1, 3, 3, and 1. These numbers are called the binomial coefficients and are calculated using
equation (4.6).
⎛ n⎞ n!
⎜⎝ r ⎟⎠ = n! (n − r)! (4.6)
Note
⎛ n⎞
1. The term ⎜ ⎟ calculates the number of combinations of obtaining ‘r’ successes from ‘n’
⎝ r⎠
⎛ n⎞
attempts of the experiment. In certain information sources the term ⎜ ⎟ is replaced with
⎝ r⎠
alternative notation nCr.
2. It is important to note that 3! = 3*2*1 = 6, 2! = 2*1 = 2, 1! = 1, 0! = 1.
It can be shown that the mean and variance for a binomial distribution is given by equa-
tions (4.8) and (4.9)
⎛ 3⎞
P(no bulls hit) = P( X = 0) = ⎜ ⎟ (0.3)0 (0.7)3
⎝ 0⎠
Inspecting this equation we note that the problem consists of three terms that are mul-
⎛ 3⎞
tiplied together to provide the probability of no bulls hit. The terms are: (a) ⎜ ⎟ , (b) (0.3)0,
⎝ 0⎠
and (c) (0.7)3. Parts (b) and (c) are straightforward to calculate and part (a) can be calcu-
lated from equation (4.6) as follows:
⎛ 3⎞ 3! 3! 3 × 2 ×1
⎜⎝ 0⎟⎠ = 0!(3 − 0)! = 0!3! = 1 × 3 × 2 × 1 = 1
⎛ 3⎞
P(no bulls hit) = P( X = 0) = ⎜ ⎟ (0.3)0 (0.7)3 = 1 × 1 × (0.7)3 = 0.343
⎝ 0⎠
• Binomial probability of ‘r’ successes from ‘n’ attempts using Excel function BINOM.
DIST ().
• Binomial coefficients using Excel function COMBIN ().
• Factorial values using Excel function FACT ()
Figure 4.29
➜ Excel solution
Number of trials, n Cell D3 Value
Probability of hitting bull, p Cell D4 Value
Probability of missing bull, q Cell D5 Formula: =1−D4
Probability distributions 161
Probability distribution
r Cells D10:D13 Values
P(X = r) Cell E10 Formula: =BINOM.DIST(D10,$D$3, $D$4,FALSE)
Copy formula down E10:E13
Total Cell E14 Formula: =SUM(E10:E13)
Number of combinations
r Cells D17:D20 Values
nC Cell E17 Formula: =COMBIN($D$3,D17)
r
Copy formula down E17:E20
Factorials
r Cells D23:D26 Values
r! Cell E23 Formula: =FACT(D23)
Copy formula down E23:E26
Example 4.10
A local authority surveyed the travel preferences of people who travelled to work by train
or bus. The initial analysis suggested that one in five people travelled by train to work. If five
people are interviewed, what is the probability that: (a) exactly three prefer travelling by train,
P(X = 3); (b) three or more prefer travelling by train, P(X ≥ 3); and (c) fewer than three prefer
travelling by train, P(X < 3).
This experiment can be modelled by a binomial distribution as:
The random variable, X, represents the number of people travelling by train out of the
five people interviewed. From the information provided we note that P(success) = P(prefer
train) = p = 1/5 = 0.2, P(failure) = 1 − p = q = 0.8, and number of identical trails n = 5.
Figure 4.30
➜ Excel solution
Number of trails n = Cell D3 Value
Probability travels by train, p = Cell D4 Value
Probability does not travel by
train, q = Cell D5 Formula: =1−D4
r Cells D10: D15 Values
P(X = r) Cells E10 Formula: =BINOM.DIST(D10,$D$3,$D$4,FALSE)
Copy formula down E10:E15
Total Cell E17 Formula: =SUM(E10:E15)
r Cell D20:D22 Values
P(X = 3) = Cell E20 Formula: =BINOM.DIST(D20,$D$3,$D$4,FALSE)
P(X ≥ 3) = Cell E21 Formula: =1−BINOM.DIST(D21,$D$3,$D$4,TRUE)
P(X < 3) = Cell E22 Formula: =BINOM.DIST(D22,$D$3,$D$4,TRUE)
❉ Interpretation
(a) P(exactly three prefer train) = P(X = 3) = 0.0512
(b) P(three or more prefer train) = P(X ≥ 3) = P(X = 3) + P(X = 4) + P(X = 5) = 0.05792
(c) P(fewer than three prefer train) = P(X < 3) = P(X ≤ 2) = 0.9421
Example 4.11
A manufacturing company regularly conducts quality control checks at specified periods on all
products manufactured. A new order for 2000 light bulbs is due to be delivered to a national
do-it-yourself store. Historically, the manufacturing process has a failure rate of 15% and the
sample to be tested consists of four randomly selected light bulbs. From this information
Probability distributions 163
estimate the following probabilities: (a) find the probability distribution for 0, 1, 2, 3, and 4
defective light bulbs; (b) calculate the probability that at least three will be defective; and (c)
determine the mean and variance of the distribution.
This example highlights the case of selecting without replacement from a large population.
The effect on the sample space can be considered negligible and therefore we can consider
the events as independent. Let the random variable, X, represent the number of defective
light bulbs from the random sample. This value of X can take the values: 0 defective from 4
bulbs, or 1 defective from 4 bulbs, or 2 defective from 4 bulbs, or 3 defective from 4 bulbs,
or all 4 bulbs defective. This can be written as X = 0, 1, 2, 3, 4. For this example we have
p = P(success) = P(defective bulb) = 0.15, q = P(not defective) = 1 – p = 0.85, and n = 4.
Figure 4.31
➜ Excel solution
Number of trials n = Cell D3 Value
Probability light bulb fails, p = Cell D4 Value
Probability light bulb
does not fail, q = Cell D5 Formula: =1−D4
❉ Interpretation
(a) Table 4.5 represents the probability distribution for Example 4.11.
r P(X = r)
0 0.52201
1 0.36848
2 0.09754
3 0.01148
4 0.00051
Table 4.5
(b) The probability of at least three defective bulbs from the sample of four is 0.0119813.
(c) Mean and variance for the probability distribution is mean = 0.6 and variance = 0.51.
Student exercises
X4.6 Evaluate the following: (a) 3C1, (b) 10C3, (c) 2C0.
X4.7 A binomial model has n = 4 and p = 0.6.
(a) Find the probabilities of each of the five possible outcomes (i.e. P(0), P(1) . . . P(4)).
(b) Construct a histogram of this data.
Probability distributions 165
X4.8 Attendance at a cinema has been analysed and shows that audiences consist of 60%
men and 40% women for a particular film. If a random sample of six people was
selected from the audience during a performance, find the following probabilities:
(a) All women are selected
(b) Three men are selected
(c) Fewer than three women are selected.
X4.9 A quality control system selects a sample of three items from a production line. If
one or more is defective, a second sample is taken (also of size three), and if one or
more of these are defective then the whole production line is stopped. Given that the
probability of a defective item is 0.05, what is the probability that the second sample is
taken? What is the probability that the production line is stopped?
X4.10 Five people in seven voted in an election. If four of those on the roll are interviewed
what is the probability that at least three voted?
X4.11 A small tourist resort has a weekend traffic problem and is considering whether or not
to provide emergency services to help mitigate the congestion that results from an
accident or breakdown. Past records show that the probability of a breakdown or an
accident on any given day of a four-day weekend is 0.25. The cost to the community
caused by congestion resulting from an accident or breakdown is as follows:
• a weekend with 1 accident day costs £20,000;
• a weekend with 2 accident days costs £30,000;
• a weekend with 3 accident days costs £60,000;
• a weekend with 4 accident days costs £125,000.
As part of its contingency planning, the resort needs to know:
(a) The probability that a weekend will have no accidents
(b) The probability that a weekend will have at least two accidents
(c) The expected cost that the community will have to bear for an average weekend
period
(d) Whether or not to accept a tender from a private firm for emergency services of
£20,000 for each weekend during the season.
λ re− λ
P (X = r) =
r! (4.10)
Where:
. Example 4.12
The data in Table 4.6 is a record of the number of times a river has flooded in a wet season over
the past 100 years.
Check if the distribution may be modelled using the Poisson distribution and determine
the expected frequencies for a 100-year period. The Excel solution is provided in Figures 4.32
and 4.33
Table 4.6
Probability distributions 167
Figure 4.32
➜ Excel solution
(a) Calculate frequency distribution mean and variance
Number of floods X Cells B7:B12 Values
Number of years with X floods, f Cells C7:C12 Values
xf Cell D7 Formula: =B7*C7
Copy formula down D7:D12
Σf = Cell C13 Formula: =SUM(C7:C12)
ΣXf = Cell D13 Formula: =SUM(D7:D12)
2
X Cell F7 Formula: =B7^2
Copy formula down F7:F12
fX2 Cell H7 Formula: =C7*F7
Copy formula down H7:H12
ΣfX Cell H13
2 Formula: =SUM(H7:H12)
Mean = Cell D16 Formula: =D13/C13
Variance = Cell D17 Formula: =H13/C13−C16^2
❉ Interpretation The average number of floods is 1.4 per year with a variance of 1.32.
They seem to be in close agreement (only 5.7% difference), which is one of the characteristics
of the Poisson distribution. The mean and variance of the Poisson distribution have the same
numerical value and, given the closeness of the two values in this numerical example, we
would conclude that the Poisson distribution should be a good model for the sample data.
168 Business statistics using Excel
Note The average number of floods per year (λ) and variance are calculated from the
frequency distribution.
∑ fX 140
Mean λ = = = 1.4
∑f 100
∑ fX 2
− (mean) = 1.32.
2
Variance VAR (X) =
∑f
The chi-square goodness-of-fit test is used to see if the Poisson model is a significant fit
to the sample data (see section 7.1.4).
Figure 4.33
➜ Excel solution
(b) Calculate the Poisson probabilities and fit probability model to data set
r Cells J7:J12 Values
P(X = r) Cell K7 Formula: =POISSON.DIST( J7,$C$16,FALSE)
Copy formula down K7:K12
Total = Cell K14 Formula: =SUM(K7:K12)
Expected frequencies Cell M7 Formula: =$C$13*K7
Copy formula from M7:M12
Total = Cell M14 Formula: =SUM(M7:M12)
❉ Interpretation
(a) The probability distribution is given in Table 4.7.
Probability distributions 169
r P(X = r)
0 0.2466
1 0.3452
2 0.2417
3 0.1128
4 0.0395
5 0.0111
Table 4.7
To check how well the Poisson probability distribution fits the data set we note that the
observed frequencies are given in the original table and that the expected frequencies can be
calculated from the Poisson probability fit using the equation EF = (∑f) × P(X = r). The manual
solution is now presented in Table 4.8.
Table 4.8
We note that the expected frequencies are approximately equal to the observed frequency
values.
Table 4.9 illustrates the calculation of the Poisson probability values for λ = 1.4 by
applying equation (4.10).
Table 4.9
170 Business statistics using Excel
Figure 4.34 illustrates a Poisson probability plot for the number of floods example.
Probability, P(X = r)
0.3000
0.2500
0.2000
0.1500
0.1000
0.0500
0.0000
0 1 2 3 4 5
Number of floods, X Figure 4.34
The skewed nature of the distribution can be clearly seen (positive skew).
If we determine the mean and the variance, either using the frequency distribution or
the probability distribution, we would find that the relationship is as given in equation
(4.11).
λ = VAR(X) (4.11)
• mean = variance
• events discrete and randomly distributed in time and space;
• mean number of events in a given interval is constant;
• events are independent;
• two or more events cannot occur simultaneously.
Note Once it has been identified that the mean and variance have the same numerical
value, ensure that the other conditions above are satisfied, indicating that the sample data
most likely follow the Poisson distribution.
Example 4.13
A company is reviewing the number of telephone lines available for customer support. The
average number of calls received per day is three calls during a five-minute period. Estimate
the proportion of phone calls that cannot be answered during a five-minute period: (a) if the
company installs four lines and (b) if the company installs five lines.
Figure 4.35
➜ Excel solution
λ = Cell C3 Value
r Cells B6:B10 Values
P(X = r) Cell C6 Formula: =POISSON.DIST(B6,$C$3,FALSE)
Copy formula down C6:C10
P(X ≤ 4) = Cell C12 Formula: =POISSON.DIST(B10,C3,TRUE)
P(X ≤ 4) = Cell C13 Formula: =SUM(C6:C10)
P(X > 4) = Cell C15 Formula: =1−C12
P(X > 5) = Cell C17 Formula: =1−POISSON(5,C3,TRUE)
❉ Interpretation
(a) If the company has four lines then the probability that a call cannot be answered P (call
not answered) = 1 − P(X ≤ 4) = 1 − P(X = 0 or X = 1 or X = 2 or X = 3 or X = 4). From
Excel, P (call not answered) = 0.185263245 or 18.5%. Probability that callers cannot
connect is 18.5% of the time.
(b) Should another line be installed? The corresponding calculation shows that if n = 5 then
the P(call not answered) = 1 − P(X ≤ 5) = 1 − P(X = 0 or X = 1 or X = 2 or X = 3 or X = 4
or X = 5). From Excel, P(call not answered) = 0.083917942 or 8.4%. The probability that
the switchboard could not handle all calls has been reduced to 8.4%. Whether or not this
was worthwhile depends upon the likely profits that this would create against the cost of
installation and running an extra telephone line.
The probability that a call cannot be answered P (call not answered) = 1 − P(X ≤ 4). Table
4.10 illustrates the calculation of the Poisson probability values for λ = 3 by applying
equation (4.10).
172 Business statistics using Excel
30e −3 = POISSON.DIST(B6,$C$3,FALSE)
0 P ( X = 0) = = 0.0498
0!
31e −3
1 P ( X = 1) = = 0.1494
1!
32e −3
2 P ( X = 2) = = 0.2240
2!
33e −3
3 P ( X = 3) = = 0.2240
3!
34e −3 = POISSON.DIST(B10,$C$3,FALSE)
4 P ( X = 4) = = 0.1680
4!
Table 4.10
Student exercises
X4.12 Calculate P(0), P(1), P(2), P(3), P(4), P(5), P(6), and P(>6) for a Poisson variable with a
mean of 1.2. Using this probability distribution determine the mean and variance.
X4.13 In a machine shop the average number of machines out of operation is two. Assuming
a Poisson distribution for machines out of operation, calculate the probability that at
any one time there will be:
(a) Exactly one machine out of operation
(b) More than one machine out of operation.
X4.14 A factory estimates that 0.25% of its production of small components is defective.
These are sold in packets of 200. Calculate the percentage of the packets containing
one or more defectives.
X4.15 The average number of faults in a metre of cloth produced by a particular machine is
0.1. (a) What is the probability that a length of four metres is free from faults? (b) How
long would a piece have to be before the probability that it contains no flaws is less
than 0.95?
X4.16 A garage has three cars available for daily hire. Calculate the following probabilities if
the variable is a Poisson variable with a mean of 2: (a) find the probability that on a
given day that exactly none, one, two, and three cars will be hired, and determine the
mean number of cars hired per day; (b) the charge of hire of a car is £25 per day and
the total outgoings per car, irrespective of whether or not it is hired, are £5 per day.
Determine the expected daily profit from hiring these three cars.
X4.17 Accidents occur in a factory randomly and, on average, at the rate of 2.6 per month.
What is the probability that in a given month: (a) no accidents will occur and (b) more
than one accident will occur?
Probability distributions 173
P (X = r) ≅
(np)r e− np
r! (4.12)
The Poisson random variable theoretically ranges from 0 → ∞. However, when used
as an approximation to the binomial distribution, the Poisson random variable—the
number of successes out of n observations—cannot be greater than the sample size n.
With large n and small p, equation 4.12 implies that the probability of observing a large
number of successes becomes small and approaches zero quite rapidly. For small values
of p (<0.1) and large values of n, the Poisson distribution will approximate the binomial
distribution with λ = np. For the binomial distribution with p small (<0.1) the mean (or
expected) value = np and the variance = npq = np(1−p) ≈ np. This implies that for small p
the expected and variance for the binomial distribution is approximated by the mean and
variance of the Poisson distribution (λ = np, VAR(X) = np).
Example 4.14
In a large consignment of apples 3% are rotten. What is the probability that a carton of 60
apples will contain fewer than 2 rotten apples? We have here a binomial experiment and there-
fore could easily apply the binomial distribution with p = 0.03, q = 0.97 and n = 60.
Figure 4.36
➜ Excel solution
n = Cell C3 Value
p = Cell C4 Value
np = Cell C5 Formula: =C3*C4
Binomial: P( X < 2) = Cell D7 Formula: =BINOM.DIST(1,C3,C4,TRUE)
Poisson: P( X < 2) = Cell D8 Formula: =POISSON.DIST(1,C5,TRUE)
174 Business statistics using Excel
❉ Interpretation We can see from Excel that the binomial and Poisson distributions
provide approximately equal results, 45.92% and 46.28% respectively.
The degree of agreement between the binomial and Poisson probability distributions
for this problem can be observed in Figure 4.37.
0.3
0.25
P(X = r)
0.2 Binomial
0.15 Poisson
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10 11
X Figure 4.37
Note
1. Binomial solution
P ( X < 2) = 60
C0p0q60 − 0 + 60
C1p1q60 −1
2. Poisson solution
As n is large and p is small we can use the Poisson distribution. To check if the Poisson dis-
tribution is appropriate calculate the mean and variance: mean = np = 60 * 0.03 = 1.8, and
variance = npq = 60 * 0.03 * 0.97 = 1.746. Comparing the two values we see that they are
Probability distributions 175
approximately equal and the binomial distribution can be approximated using the Poisson
distribution:
Table 4.11
Student exercise
X4.18 A new telephone directory is to be published. Before publication entries are proofread
for errors and any corrections made. Experience suggests that, on average, 0.1% of
the entries require correction and that entries requiring correction are randomly
distributed. The directory contains 800 pages with 300 entries per page. Two methods
for making corrections are proposed: Method A (costs 50p per page containing one
correction and £1.50 per page containing two or more corrections), and Method B
(costs £1 per page containing one or more corrections). Which method, based on cost,
should be used?
Z=
( X − np)
npq x
(4.13) Normal approximation
to the binomial If the
The normal distribution can be used to approximate the binomial probabilities (nor- number of trials, n, is large,
the binomial distribution is
mal approximation to the binomial) when n is large and p is close to 0.5 and np >5 (and approximately equal to the
nq > 5), with mean (μNormal ≈ μBinomial = np) and variance (σ 2 Normal ≈ σ 2 Binomial = npq). normal distribution.
176 Business statistics using Excel
Example 4.15
Assume you have a fair coin and wish to know the probability that you would get eight heads
out of ten flips. The binomial distribution has a mean of µ = np = 10 * 0.5 = 5 and a variance
of σ2 = npq = 10 * 0.5 * 0.5 = 2.5. The standard deviation is therefore 1.5811. A total of 8 heads
is 1.8973 standard deviations above the mean of the distribution [(8–5)/1.5811]. The question
then is ‘What is the probability of getting a value exactly 1.8973 standard deviations above the
mean?’. The answer to this question is to remember that the probability of a particular event for
a normal distribution is zero given that a particular event (or value of X) will not have an actual
area within the normal distribution. The problem is that the binomial distribution is a discrete
probability distribution whereas the normal distribution is a continuous distribution. The solu-
tion is to round off and consider any value from 7.5 to 8.5 to represent an outcome of 8 heads.
Using this approach, we can solve discrete binomial problems with a normal approximation if
we transform X = 8 for the binomial to the region 7.5–8.5 for the normal distribution.
The area shaded in Figure 4.38 is an approximation of the probability of obtaining eight
heads.
Normal curve
f(x)
P(X = 8) = 0.043495
X
5 6 7 8 9 Figure 4.38
We can see that the binomial probability distribution solution, P(X = 8) Binomial ≈P
(7.5 ≤ X ≤ 8.5) Normal.
Figure 4.39
Probability distributions 177
➜ Excel solution
Binomial
n = Cell D5 Value
p = Cell D6 Value
mean μ = Cell D7 Formula: =D5*D6
Variance σ2 = Cell D8 Formula: =D5*D6*(1−D6)
SD σ = Cell D9 Formula: =SQRT(D8)
Number of heads X = Cell D10 Value
P(X = 8) = Cell D11 Formula: =BINOM.DIST(D10,D5,D6,FALSE)
Normal
Lower X1 = Cell D13 Formula: =D10−0.5
Upper X2 = Cell D14 Formula: =D10+0.5
P(X1 ≤ 7.5) = Cell D15 Formula: =NORM.DIST(D13,D7,D9,TRUE)
P(X2 ≤ 8.5) = Cell D16 Formula: =NORM.DIST(D14,D7,D9,TRUE)
P(7.5 ≤ X ≤ 8.5) = Cell D18 Formula: =D16−D15
We can see from Excel that the two probabilities agree with one another. The binomial
probability of obtaining 8 heads from 10 flips is 0.043945 and the normal approximation
probability of containing 8 heads is 0.043495.
Example 4.16
Enquiries at a travel agent lead to a holiday booking being made only sometimes. The agent
needs to make 35 bookings per week to break even. If during a week there are 100 enquiries
and the probability of a booking in each case is 0.4, find the probability that the agent will at
least break even in this particular week. To solve this problem let X represent the number of
bookings per week, p represent the probability that a booking will be made p = 0.4, and n
represent the number of possible bookings over the week, n = 100.
The area shaded in Figure 4.40 is a normal approximation of the binomial probability of
obtaining at least 35 bookings.
We can see that the binomial probability distribution solution, P(X ≥ 35) Binomial ≈ 1 − P
(X ≤ 34.5) Normal.
178 Business statistics using Excel
Normal curve
f(x)
P(X => 34.5) = 0.86921388
34.5 40 x
Figure 4.41
➜ Excel solution
n = Cell D3 Value
p = Cell D4 Value
mean μ = Cell D5 Formula: =D3*D4
Variance σ2 = Cell D6 Formula: =D3*D4*(1−D4)
SD σ = Cell D7 Formula: =SQRT(D6)
Binomial
P(X ≥ 35) = 1 − P(X ≤ 34)?
Binomial X = Cell D12 Value
Probability distributions 179
We can see from Excel that the two probabilities agree with one another. The binomial
probability of obtaining at least 35 bookings is 0.86966347 and the normal approximation
probability of obtaining at least 35 bookings is 0.86921388.
Note
(a) Binomial solution:
This would be quite difficult to solve without the aid of calculator or some other compu-
tational device, for example a spreadsheet. From Excel we find that this probability value
is P(X ≥ 35) = 0.8697.
(b) Normal approximation solution (n = 100, p = 0.4):
⎛ 34.5 — 40 ⎞
P(X ≥ 35 for binomial) ≈ P ⎜ Z ≥ ⎟ = P(Z ≥ − 1.12) = 0.8692
⎝ 4.899 ⎠
Comparing the two answers we can see that good agreement has been reached.
Student exercise
X4.19 Given X is a discrete binomial random variable with p = 0.3 and n = 20: (a) Can we
use the normal approximation to estimate the binomial probability? (b) What if n is
changed to 15? and (c) if n = 40 and p = 0.1 is the normal approximation appropriate?
180 Business statistics using Excel
Z=
(X − λ )
λ (4.14)
The approximation improves as the value of the mean (λ) grows larger and at a particu-
lar value we can assume that the Z variable is normally distributed.
Example 4.17
The average number of broken eggs per lorry is known to be 50. What is the probability that
there will be more than 70 broken eggs on a particular lorry load?
We may use the normal approximation to the Poisson distribution, where the mean
and variance are calculated as follows: mean (μNormal ≈ μPoisson = λ = 50) and variance
( σ 2Normal ≈ σ 2Poisson = λ = 50 ).
Require P(X > 70 for Poisson) ≈ P(X > 70.5 for normal).
The area shaded in Figure 4.42 is an approximation of the probability of obtaining more than
70 broken eggs.
Normal curve
f(x)
50 70.5 x
We can see that the Poisson probability distribution solution, P(X > 70) Poisson ≈P
(X ≥ 70.5) normal.
Figure 4.43
➜ Excel solution
Mean λ = Cell D3 Value
Variance σ2 = Cell D4 Value
SD σ = Cell D5 Formula: =SQRT(D4)
Poisson
P(X >70) = 1 − P(X ≤ 70)?
Poisson X = Cell D10 Value
P(X >70) = 1 − P(X ≤ 70) = Cell D11 Formula: =1−POISSON.DIST(D10,D3,TRUE)
Normal P(X ≥ 70.5)?
Normal X = Cell D16 Value
P(X ≥ 70.5) = Cell D17 Formula: =1−NORM.DIST(D16,D3,D5,TRUE)
Z = Cell D19 Formula: =(D16−D3)/D5
P(X ≥ 70.5) = Cell D20 Formula: =1−NORM.S.DIST(D19, TRUE)
We can see from Excel that the two probabilities closely agree with one another. The
Poisson probability of obtaining more than 70 broken eggs is 0.002971 and the normal
approximation probability of obtaining more than 70 broken eggs is 0.001871.
Note
(a) Poisson solution:
This would be quite difficult to solve without the aid of a calculator or some other compu-
tational device, for example a spreadsheet. From Excel we find that this Poisson probability
value is P(X > 70) = 0.002971.
(b) Normal approximation solution:
μ = λ = 50 and σ = λ = 7.071068.
P(X > 70 for Poisson) ≈ P(X ≥ 70.5 for normal)
⎛ 70.5 − 50 ⎞
P(X > 70 for Poisson) ≈ P ⎜ Z ≥ ⎟ = P (Z ≥ 2.899138) = 0.001871.
⎝ 7.071068 ⎠
Comparing the two answers we can see that good agreement has been reached.
182 Business statistics using Excel
Student exercise
X4.20 A local maternity hospital has an average of 36 births per week. Use this information
to calculate the following probabilities: (a) the probability that there are fewer than
30 births in a given week; (b) the probability that there will be more than 40 births in
a given week; and (c) the probability that there will be between 30 and 40 births in a
given week.
■ Techniques in practice
TP1 CoCo S. A. is concerned at the time taken to react to customer complaints and have
implemented a new set of procedures for its support centre staff. The customer service direc-
tor plans to reduce the mean time for responding to customer complaints to 28 days and has
collected the sample data given in Table 4.12 after implementation of the new procedures to
assess the time to react to complaints (days).
20 33 33 29 24 30
40 33 20 39 32 37
32 50 36 31 38 29
15 33 27 29 43 33
31 35 19 39 22 21
28 22 26 42 30 17
32 34 39 39 32 38
Table 4.12
Table 4.13
(a) Estimate the mean and variance based upon the sample data.
(b) State the value of calorie count if the production manager would like this value to be
43 ± 5%.
(c) Estimate the probability that the calorie count lies between 43 ± 5% (assume that your
answers to question (a) represent the population values).
■ Summary
The notion of a discrete and continuous probability distribution was introduced and examples
provided to illustrate the different types of discrete (binomial, Poisson) and continuous (nor-
mal) distributions.
In Chapter 5 we shall explore the concept of data sampling from normal and non-normal
population distributions and introduce the reader to the central limit theorem. Furthermore,
we will introduce a range of continuous probability distributions (Student’s t distribution, F
distribution, and chi-square distribution), which will be used in later chapters to solve a range
of problems that require statistical inference tests to be applied.
In Chapter 6 we will apply the central limit theorem to provide point and interval estimates
to certain population parameters (mean, variance, proportion) based upon sample parameters
(sample mean, sample variance, sample proportion).
■ Key terms
Binomial Discrete probability Normal probability plot
Binomial experiment distributions Poisson distribution
Chi-square distribution Discrete random variable Poisson probability
Continuous probability Discrete variable distribution
distribution F distribution Random variable
Continuous random variable Normal approximation to Standard normal distribution
Cumulative distribution the binomial Student’s t distribution
function Normal distribution
184 Business statistics using Excel
■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.
Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012).
Sampling distributions and
estimating 5
» Overview «
In Chapters 3 and 4 we introduced the concept of a probability distribution via the idea of
relative frequency and introduced two distinct types: discrete and continuous. In this chapter
we will explore the concept of taking a sample from a population and use this sample to
provide population estimates for the mean, standard deviation, and proportion. The types
of statistics that we explored within earlier chapters are statistics that provide an answer
to a particular question, where we assume that the data collected is from the complete
population. In many situations this is not the case and the data collected represents a sample
from a population being measured. In this case, the statistics calculated from the sample
(mean, standard deviation, and proportion) represent estimates of the true value that could
be calculated if you had access to the complete population of data values. These estimates
provide point estimates of the population values with the disagreement between the sample
and population value representing the margin of error. The margin of error can be represented
by the concept of a confidence interval for the population parameter value estimated from
the sample. This interval can be estimated if we assume that the sampling distribution of the
mean is normally distributed. We will show that the Central Limit Theorem allows a normal
distribution approximation for the sampling distribution of the mean to be assumed, even if
the population is not normally distributed. This result will allow the methods described in this
chapter to be employed to solve a range of statistical hypothesis tests where we test whether
the population mean has a particular value based upon the collected sample data.
» Learning objectives «
On completing this unit you will be able to:
» recognize reasons for sampling error—coverage error, non-response error, sampling error,
measurement error;
186 Business statistics using Excel
» calculate sampling errors and confidence intervals when the population standard devia-
tion is known and unknown (z- and t-tests);
doing so. For example, we browse through the television channels to find a programme
we may wish to watch and then make a decision based upon this sampling process. From
the sampling undertaken we may also make conclusions on the overall quality of televi-
sion programmes based upon the sample of programmes observed. This concept is called
making an inference on the population based upon the sample observations.
The primary aim of sampling is to select a sample from the population that shares the
same characteristics as the population. For example, if the population average height of
grown men between the age of 20 and 50 years is 176 cm, then the sample average height
would also be expected to be 176 cm unless we have the problem of sampling error. This
concept of sampling error can be measured and will be discussed within this chapter.
This concept of sample and population values being in agreement allows us to state that
we expect the sample to be representative of the population values being measured.
Questions we should answer are:
• How well does the sample represent the larger population from which it was drawn?
• How closely do the features of the sample resemble those of the larger population?
Before we describe the main sampling methods we need to define the terminology we
will use in this, and later, chapter(s).
Probability sampling
The idea behind this type of probability sampling is random selection. More specifically,
each sample from the population of interest has a known probability of selection under a
given sampling scheme.
There are four categories of probability samples described, as illustrated in Figure 5.1
Probability
Simple
Systematic Stratified Cluster
random
Figure 5.1
Example 5.1
Consider the situation that a marketing researcher will experience when selecting a random
sample of 200 shoppers who shop at a supermarket during a particular time period. The
researcher notes that the supermarket would like to seek the views of its customers on a pro-
posed re-development of the store and the total footfall (the number of people visiting a shop
or a chain of shops in a period of time is called its footfall) within this time period is 10,000.
With a footfall (or population) of this size we could employ a number of ways to select an
appropriate sample of 200 from the potential 10,000. For example, we could place 10,000 con-
secutively numbered pieces of paper (1–10000) in a box, draw a number at random from the
box, shake, and select another number to maximize the chances of the second pick being ran-
dom, shake, and continue the process until all 200 numbers are selected. These would then be
used to select a customer entering the store with the customer chosen based upon the number
selected from the random process. To maximize the chances that customers selected would
agree to complete the survey we could enter them into a prize draw. These 200 customers
will form our sample with each number in the selection having the same probability of being
chosen. When undertaking the collection of data via random sampling we generally find it dif-
x
Random sample A ficult to devise a selection scheme to guarantee that we have a random sample. For example,
random sample is a the selection from a population might not be the total population that you wish to measure or,
sampling technique where
we select a sample from a
during the time period when the survey is conducted, we may find that the customers sampled
population of values. may by unrepresentative of the population as a result of unforeseen circumstances.
Sampling distributions and estimating 189
1. Step 1—divide the number of cases in the population by the desired sample size.
2. Step 2—select a random number between one and the value attained in step 1. For
example, we could pick the number 28.
3. Step 3—starting with case number chosen in step 2, take every twenty-eighth record,
as per this example.
Example 5.2
To illustrate, consider the situation where we wish to sample the views of graduate job applicants
to a major financial institution. The nature of this survey is to collect data on the application
process from the applicants’ perspective. The survey will therefore have to collect the views from
the different specified groups within the identified population. For example, this could be based
on gender, race, type of employment requested (full- or part-time), or whether an applicant is
classified as disabled. If we use simple random sampling it is possible that we may miss a repre-
sentative sample from one of these groups as a result, for example, of the relative size of the group
relative to the population. In this case, we would employ stratified random sampling to ensure
that appropriate numbers of sample values are drawn from each group in proportion to the per-
centage of the population as a whole. Stratified sampling offers several advantages over simple
random sampling: (a) it guards against an unrepresentative sample (e.g. all male from a predomi-
nately female group); (b) it provides sufficient group data for separate group analysis; (c) it requires
a smaller sample; and (d) greater precision is achievable compared with simple random sampling
for a sample of the same size. Stratified random sampling nearly always results in a smaller vari-
ance for the estimated mean or other population parameters of interest. The main disadvantage
190 Business statistics using Excel
of a stratified sample is that it may be more costly to collect and process the data compared with
a simple random sample. Two different categories of stratified random sampling are available
Non-probability sampling
In many situations it is not possible to select the kinds of probability samples used in
large-scale surveys. For example, we may be required to seek the views of local, family-run
businesses that have experienced financial difficulties during the bank credit crunch of
2007–2012. In this situation there are no easily accessible lists of businesses experiencing
difficulties, or there may never be a list created or available. The question of obtaining a
sample in this situation is achievable by using non-probability sampling methods to col-
lect the required sample data.
Figure 5.2 illustrates the four primary types of non-probability sampling methods.
Sampling distributions and estimating 191
Non-
probability
Convenience Purposive
Quota Snowball
Figure 5.2
We can divide non-probability sampling methods into two broad types: convenience
or purposive.
• Quota sampling
Quota sampling is designed to overcome the most obvious flaw of convenience (or
availability) sampling. Rather than taking just anyone, quotas are set to ensure that
the sample you get represents certain characteristics in proportion to their prevalence
in the population. Note that for this method you have to know something about the
characteristics of the population ahead of time. There are two types of quota sampling:
proportional and non-proportional.
300 men, you will continue to sample men, even if legitimate women respondents
come along—you will not sample them because you have already ‘met your quota’.
The primary problem with this form of sampling is that even when we know that
a quota sample is representative of the particular characteristics for which quotas
have been set, we have no way of knowing if the sample is representative in terms
of any other characteristics. If we set quotas for age, we are likely to attain a sample
with good representativeness on age, but one that may not be very representative
in terms of gender, education, or other pertinent factors.
• In non-proportional quota sampling you specify the minimum number of
sampled data points you want in each category. In this case you are not concerned
with having the correct proportions, but with achieving the numbers in each
category. This method is the non-probabilistic analogue of stratified random
sampling in that it is typically used to assure that smaller groups are adequately
represented in your sample.
• Snowball sampling
In snowball sampling, you begin by identifying someone who meets the criteria for
inclusion in your study. You then ask them to recommend others who they may know
who also meet the criteria. Thus, the sample group appears to grow like a rolling snow-
ball. This sampling technique is often used in hidden populations, which are dif-
ficult for researchers to access, including firms with financial difficulties or students
struggling with their studies. The method creates a sample with questionable repre-
sentativeness and it can be difficult to judge how a sample compares with a larger pop-
ulation. Furthermore, an issue arises in who the respondents refer you to, for example,
friends will refer you to friends but are less likely to refer to ones they don’t consider as
friends, for whatever reason. This creates a further bias within the sample that makes it
difficult to say anything about the population.
Note The primary difference between probability methods of sampling and non-
probability methods is that in the latter you do not know the likelihood that any element of a
population will be selected for study.
going to make some errors. The concept of sampling implies that we’ll also have to deal
with a number of types of errors, including sampling error, coverage error, measurement
error, and non-response error.
5.2.1 Introduction
When we wish to know something about a particular population it is usually impracti-
cal, especially when considering large populations, to collect data from every unit of that x
population. It is more efficient to collect data from a sample of the population under study Estimate An estimate is
an indication of the value
and from the sample make estimates of the population parameters. Essentially, based on of an unknown quantity
a sample, we make generalizations about a population. based on observed data.
194 Business statistics using Excel
Standard deviation σ s
Proportion π ρ
Table 5.1
Note The symbols μ, σ, and ρ are the Greek symbols mu = μ, sigma = σ, and rho = ρ.
be equal and they can be plotted as a frequency distribution of the means. What is really
important here is that the mean of all the sample means has some interesting properties.
It is identical to the overall population mean.
Note A sample mean is unbiased as the mean of all sample means of size n selected
from the population is equal to the population mean, μ.
Example 5.3
To illustrate this property consider the problem of tossing a fair die. The die has 6 numbers (1,
2, 3, 4, 5, and 6), with each number likely to have the same frequency of occurrence. If we then
take all possible samples of size 2 from this population then we will be able to illustrate two
important results of the sampling distribution of the sample means.
Figure 5.3
➜ Excel solution
X Cells B8:B13 Values
X2 Cell C8 Formula: =B8∧2
Copy formula down C8:C13
N = Cell C15 Formula: =COUNT(B8:B13)
ΣX = Cell C16 Formula: =SUM(B8:B13)
ΣX2 = Cell C17 Formula: =SUM(C8:C13) x
Unbiased When the mean
Mean = Cell C18 Formula: =C16/C15
of the sampling distribution
Mean = Cell C19 Formula: =AVERAGE(B8:B13) of a statistic is equal to a
Pop SDev = Cell C20 Formula: =SQRT(C17/C15−C18∧2) population parameter, that
statistic is said to be an
Pop SDev = Cell C21 Formula: =STDEV.P(B8:B13) unbiased estimator of the
parameter.
196 Business statistics using Excel
From the population data values (1, 2, 3, 4, 5, and 6) we can calculate the population
mean and standard deviation using equations (2.1) and (2.3):
∑ X 21
Population mean, µ = = = 3.5
N 6
∑ X2 91
− (µ ) = − (3.5) = 1.7078
2 2
Population standard deviation, σ =
N 6
If we now sample all possible samples of size 2 (n = 2) from the population then we
would have the following sampling distribution of size 2. We can calculate the mean of
these sample means and corresponding standard deviation of the sample means, as illus-
trated in Figure 5.4.
Figure 5.4
➜ Excel solution
Sample pairs
Value 1 Cells F8:F28 Values
Value 2 Cells G8:G28 Values
Value mean Cell H8 Formula: =(F8+G8)/2
Copy formula down from H8:H28
f Cells J8:J28 Values
f * Xbar Cell K8 Formula: =J8*H8
Copy formula down from K8:K28
f *Xbar 2 Cell M8 Formula: =K8*H8
Copy formula down from M8:M28
Σf = Cell G30 Formula: =SUM( J8:J28)
Sampling distributions and estimating 197
∑X
X=
n (5.1)
For sample pair (2, 6) the sample mean is equal to 4 (cell H18). For each sample pair we
would have a different sample mean, as can be observed in Figure 5.4 (column H). From
this list of sample means we can calculate the overall mean of the sample means using
equation (5.2).
∑X
X=
∑f (5.2)
From Excel, the mean of the sample means X is equal to 3.5. From the die experiment
we observe X = µ = 3.5 . Furthermore, the mean of the sample means is an unbiased esti-
mator of the population mean.
X=µ (5.3)
The standard error of the sample means (or standard deviation of the sample means)
measures the standard deviation of all sample means from the overall mean. We know
from the population data ranges from 1 to 6 with a population standard deviation of
1.7078. We can repeat this exercise to calculate the standard deviation for the samples
means using equation (5.4).
∑ fX 2
( )
2
σX = − X
∑f (5.4)
From Excel, the standard deviation of the sample means (σ X ) is equal to 1.2076. From
this we conclude a difference exists between the two values. Why? Observe that in the
sampling example we calculate a series of sample means of size 2 and then calculated the
overall mean of the sample means. When averaging you replace the data set with a single
number that measures the middle value of the data set. The mean will be influenced by
any extreme data points in the sample, but by repeating the experiment to calculate a
series of means we should find that the range between the largest and smallest means
is less than the range within the original data sets. In other words, averages have smaller
198 Business statistics using Excel
variability than single observations. The standard error of the sampling mean distribution
is not equal to the population standard deviation (σ X < σ). In fact, the standard deviation
of the sample means is a biased estimate of the population standard deviation.
Note The standard deviation of the sample means is a biased estimate of the
population standard deviation because it is not necessarily the same as the population
standard deviation.
It can be shown that the relationship between sample and population is represented
by equation (5.5):
σ
σX =
n (5.5)
Equation (5.5) is called the standard error of the sample means or just standard error.
From equation (5.5) we observe that as n increases, the value of the standard error of the
sampling mean approaches zero (σ X → 0). In other words, as n increases the spread of
the sample mean decreases to zero. In this situation the measured random variable would
have to be constant to produce this result.
Note The law of large numbers implies that the sample mean X will approach the
population mean (μ) as n increases in value.
Using the numbers from our example, the values of the mean and standard deviation of
the sampling means is calculated as follows:
Check:
∑X 126
X= = = 3.5
∑f 36
∑fX 2
()
2 493.6
− (3.5) = 1.2076
2
σX = − X =
∑ f 36
σ 1.7078
= = 1.2076 = σ X
n 2
X ∼ N (µ, σ2)
µ X
Figure 5.5
If we choose a sample from a normal population then we can show that the sample
means are also normally distributed with a mean of μ and a standard deviation of the
sampling mean given by equation (5.5), where n is the sample size on which the sampling
distribution was based. Figure 5.6 illustrates the relationship between the sampling mean
and the normal distribution:
Normal curve
f(X)
X ∼ N µ, σ
n( )
2
µ X Figure 5.6
Example 5.4
Consider the problem of selecting 40,000 random samples from a population that is assumed
to be normally distributed with mean £45,000 and standard deviation of £10,000.
The population values are based on 40,000 data points and the sampling distribution is
illustrated in Figure 5.7.
We observe from Figure 5.7 that the population data is approximately normal.
Histogram for the population data N = 40000
16000
14000
Frequency
12000
10000
8000
6000
4000
2000
0
6000
16000
26000
36000
46000
56000
66000
76000
86000
96000
More
From Figures 5.8 to 5.11 we observe that the sampling distribution of the mean is approxi-
mately normal for sample distributions of size n = 2, 5, 10, and 40. From the histograms we
observe that the sample means are less spread out about the mean as the sample sizes increase.
Figure 5.8 illustrates the sampling distribution for the sample means for sample size n = 2.
Histogram, n = 2
450
400
350
Frequency
300
250
200
150
100
50
0
0
e
00
00
00
00
00
00
00
00
or
M
20
28
36
44
52
60
68
76
Bin Figure 5.8
Figure 5.9 illustrates the sampling distribution for the sample means for sample size n = 5.
Histogram, n = 5
450
400
350
Frequency
300
250
200
150
100
50
0
0
e
00
00
00
00
00
00
00
00
or
M
25
30
35
40
45
50
55
60
Figure 5.10 illustrates the sampling distribution for the sample means for sample size n = 10.
Histogram, n = 10
500
450
400
Frequency
350
300
250
200
150
100
50
0
0
e
00
00
00
00
00
00
00
or
M
34
38
42
46
50
54
58
Figure 5.11 illustrates the sampling distribution for the sample means for sample size n = 40.
Histogram, n = 40
500
400
Frequency
300
200
100
0
00
00
00
00
00
00
00
00
e
or
0
M
40
42
44
46
48
50
52
54
Note From these observations we conclude that if we sample from a population that
is normally distributed with mean μ and standard deviation σ (X ~ N(μ, σ2), then the sampling
mean is normally distributed with mean μ and standard deviation of the sample means of
σX = σ n .
⎛ σ2 ⎞
X ~ N ⎜ µ, ⎟ (5.6)
⎝ n⎠
Given that we now know that the sample mean is normally distributed then we can
solve a range of problems using the methods described in Chapters 6, 8, and 9. The stand-
ardized sample mean Z value is given by equation (5.7):
X−µ
Z=
σ n (5.7)
Example 5.5
Diet X runs a number of weight reduction centres within a large town in the north east of
England. From the historical data it was found that the weight of participants is normally dis-
tributed with a mean of 150 lb and a standard deviation of 25 lb. This can be written in math-
ematical notation as X ~ N (150, 252). Calculate the probability that the average sample weight
is greater than 160 lb when 25 participants are randomly selected for the sample.
Figure 5.12
202 Business statistics using Excel
➜ Excel solution
Population mean = Cell D7 Value
Population standard deviation = Cell D8 Value
Sample size n = Cell D11 Value
Sample mean = Cell D12 Value
Standard error of mean = Cell D13 Formula: =D8/D11∧0.5
Z = Cell D15 Formula: =(D12−D7)/D13
Z = Cell D16 Formula: =STANDARDIZE(D12,D7,D13)
P = Cell D18 Formula: =1−NORM.DIST(D12,D7,D13,TRUE)
P = Cell D19 Formula: =1−NORM.S.DIST(D16,TRUE)
The problem requires the solution to the problem P(X > 160).
Figure 5.13 illustrates the region to be found that represents this probability. Excel
can be used to solve this problem by either using the NORM.DIST () or NORM.S.DIST ()
functions.
Normal curve
µ 160 X
0 2 Z Figure 5.13
Given the population mean (μ = 150), population standard deviation (σ = 25), sample
size (n = 25), and standard error of the sample mean σ x = σ n = 25 25 = 5.
X − µ 160 − 150 10
Z= = = =2
σ n 25 25 5
Sampling distributions and estimating 203
( )
From Excel, P X > 160 = P (Z > 2 ) = 1 − NORM.S.DIST (Z, TRUE) = 0.022750132.
As expected, both methods provided the same answer to the problem of calculating the
required probability.
❉ Interpretation Based upon a random sample the probability that the sample mean
is greater than 160 pounds is 0.0228 or 2.28%.
Example 5.6 Calculate the probability that the sample mean lies between 146
and 158 pounds.
Figure 5.14
➜ Excel solution
Population mean = Cell D5 Value
Population standard deviation = Cell D6 Value
Sample size n = Cell D8 Value
Standard error = Cell D9 Formula: =D6/D8∧0.5
Sample 1 mean = Cell D10 Value
Sample 2 mean = Cell D11 Value
Z1 = Cell D12 Formula: =(D10−D5)/D9
Z2 = Cell D13 Formula: =(D11−D5)/D9
P = Cell D15 Formula: =NORM.DIST(D11,D5,D9,TRUE)−
NORM.DIST(D10,D5,D9,TRUE)
P = Cell D16 Formula: = NORM.S.DIST(D13,TRUE)-
NORM.S.DIST(D12,TRUE)
204 Business statistics using Excel
The problem requires the solution to the problem P(140 < X < 158) .
Figure 5.15 illustrates the region to be found that represents this probability. Again, Excel
can be used to solve this problem by using either the NORM.DIST () or NORM.S.DIST ()
functions.
Normal curve
P(140 < X < 158) = P(–2 < Z < 1.6)
Given the population mean (μ = 150), population standard deviation (σ = 25), sample
size (n = 25), and standard error of the sample mean (σ x = σ n = 25 25 = 5) .
From Excel, P(140 < X < 158) = NORM.DIST (X 2 , μ, σ X, TRUE) − NORM.DIST (X1, μ,
σ X , TRUE) = 0.922450576.
X 2 − µ 158 − 150 8
Z2 = = = = 1.6
σ n 25 25 5
( )
From Excel, P 140 < X < 158 = P ( −2 < Z < 1.6 ) =NORM.S.DIST (Z2, TRUE) − NORM.S.
DIST (Z1, TRUE) = 0.922450576.
Both methods provided the same answer to the problem of calculating the required
probability.
❉ Interpretation Based upon a random sample the probability that the sample mean
is between 140 and 158 lb is 0.9224 or 92.24%.
will be approximately normal with mean μ and standard deviation σ X if the sample size
is sufficiently large. In most cases, the value of n should be at least 30 for non-symmetric
distributions and at least 20 for symmetric distributions before we apply this approxima-
tion. This relationship is already represented by equation (5.6).
This leads to an important concept in statistics known as the Central Limit Theorem.
The Central Limit Theorem provides us with a shortcut to the information required for
constructing a sampling distribution. By applying the Theorem we can obtain the descrip-
tive values for a sampling distribution (usually the mean and the standard error, which is
computed from the sampling variance) and we can also obtain probabilities associated
with any of the sample means in the sampling distribution.
Note The Central Limit Theorem states that no matter what the shape of the
population distribution, the sampling distribution of the means will be approximately normal
with increasing sample sizes providing better approximations to the normal distribution.
If the mean is approximately normally distributed then we can solve a range of prob-
lems using the methods described in Chapters 6, 8, and 9.
Example 5.7
Consider the sampling of 50 electrical components from a production run where, historically,
the component’s average lifetime was found to be 950 hours with a standard deviation of 25
hours. The population data is right-skewed and therefore cannot be considered to be normally
distributed. Calculate the probability that the sample mean is less than 958 hours.
Figure 5.16
x
Central Limit
Theorem The Central
Limit Theorem states
➜ Excel solution that whenever a random
sample is taken from any
Population mean = Cell D3 Value distribution (m, S2), then
Population standard deviation = Cell D4 Value the sample mean will be
approximately normally
Sample size n = Cell D6 Value distributed with mean m
Standard error = Cell D7 Formula: =D4/D6∧0.5 and variance S2/n.
206 Business statistics using Excel
As the sample size is reasonably large (>30), we will apply the Central Limit Theorem to
the problem and assume that the sampling mean distribution is approximately normally
distributed. From equation (5.6) we have X~N(µ , σ 2 n ) = N(950, 252 5 0) .
The problem requires the solution to the problem P(X < 958).
Figure 5.17 illustrates the region to be found that represents this probability.
Normal curve
µ = 950 958 X
Excel can be used to solve this problem by either using the NORM.DIST () or
NORM.S.DIST () functions.
Given the population mean (μ = 950), population standard deviation (σ = 25), sample
size (n = 50), and standard error (σ x = σ n = 25 50 = 3.535533906) .
❉ Interpretation Based upon a random sample the probability that the sample mean
is less than 958 hours is 0.988174192 or 98.82%.
Sampling distributions and estimating 207
In the previous cases we assumed that sampling will have taken place with replacement
(very large or infinite population). If no replacement is undertaken then equation (5.5) is
modified by a correction factor to give equation (5.8):
σ N−n
σX = × (5.8)
n N −1
Example 5.8
A random sample of 30 part-time employees is chosen without replacement from a firm
employing 200 part-time workers. If the mean hours worked per month is 60 hours with a
standard deviation of 5 hours determine the probability that the sample mean: (a) will lie
between 60 and 62 hours, and (b) be over 63 hours. In this example we have a finite popula-
tion of size N (= 200) and a sample size of 30 (n = 30).
From equation (5.8) we can calculate the standard error of the sampling mean and then use
Excel to calculate the two probability values.
Figure 5.18
➜ Excel solution
Population mean = Cell D3 Value
Population standard deviation = Cell D4 Value
Population size N = Cell D6 Value
Sample size n = Cell D7 Value
Standard error = Cell D9 Formula: =(D4/D7∧0.5)*SQRT((D6−D7)/(D6−1))
208 Business statistics using Excel
(a)
Sample 1 mean = Cell D12 Value
Sample 2 mean = Cell D13 Value
P = Cell D14 Formula: = NORM.DIST(D13,D3,D9,TRUE)-
NORM.DIST(D12,D3,D9,TRUE)
Z1 = Cell D15 Formula: =(D12−D3)/D9
Z2 = Cell D16 Formula: =(D13−D3)/D9
P = Cell D17 Formula: =NORM.S.DIST(D16,TRUE)−
NORM.S.DIST(D15,TRUE)
(b)
Sample mean = Cell D20 Value
Z = Cell D21 Formula: =(D20−D3)/D9
P = Cell D22 Formula: =1−NORM.DIST(D20,D3,D9,TRUE)
As the sample size is relatively large for the population, we will apply the Central Limit
Theorem to the problem and assume that the sampling mean distribution is approxi-
mately normally distributed. From equation (5.6) we have X ~ N(µ , σ 2 n ).
(a) The problem requires the solution to the problem P(60 < X < 62).
Figure 5.19 illustrates the region to be found that represents this probability.
Normal curve
µ = 60 62 X
0 2.37 Z Figure 5.19
Excel can be used to solve this problem by either using the NORM.DIST () or
NORM.S.DIST () functions.
Given the population mean (μ = 60), population standard deviation (σ = 5), sample
size (n = 30), and standard error (σ x = 0.84373....) calculate P(60 < X < 62).
From Excel, P(60 < X < 62) = NORM.DIST (D13, D3, D9, TRUE) – NORM.DIST (D12,
D3, D9, TRUE) = 0.491115714.
X1 − µ 60 − 60
Z1 = = =0
σn 0.84373
X 2 − µ 62 − 60
Z2 = = = 2.3704
σn 0.84373
From Excel, P(60 < X < 62) = P(0 < Z < 2.3704) = NORM.S.DIST (Z2, TRUE) −
NORM.S.DIST (Z1, TRUE) = 0.491115714.
Both methods provide the same answer to the problem of calculating the required
probability.
❉ Interpretation Based upon a random sample the probability that the sample mean
lies between 60 and 62 is 0.491115714 or 49.11%.
(b) The problem requires the solution to the problem P(X > 63).
Figure 5.20 illustrates the region to be found that represents this probability.
Excel can be used to solve this problem by either using the NORM.DIST () or
NORM.S.DIST () functions.
Normal curve
µ = 60 63 X
Given the population mean (μ = 60), population standard deviation (σ = 5), sample
size (n = 30), and standard error (σ x = σ n = 5 30 = 0.84373....) calculate P(X > 63).
X − µ 63 − 60
Z= = = 3.55560866
σn 0.84373
210 Business statistics using Excel
From Excel, P(X > 63) = P(Z > 3.55560866) = 1−NORM.S.DIST (Z, TRUE) = 0.000188553.
Both methods provide the same answer to the problem of calculating the required
probability.
❉ Interpretation Based upon a random sample the probability that the sample mean
is greater than 63 is 0.000188553 or 0.02%.
µρ = π (5.9)
Equation (4.9) represents the variance of the binomial distribution which when divided
by ‘n’ gives equation (5.10), the standard deviation (or standard error) of the sampling
proportion, σρ, where π represents the population proportion.
π (1 − π )
σρ =
n (5.10)
From equations (5.9) and (5.10) the sampling distribution of the proportion is
approximated by a binomial distribution with mean (μρ) and standard deviation (σρ).
Sampling distributions and estimating 211
Furthermore, the sampling distribution of the sample proportion (ρ) can be approxi-
mated with a normal distribution when the probability of success is approximately 0.5,
and nπ and n(1–π) are at least 5.
⎛ π (1 − π ) ⎞
ρ ~ N ⎜ π,
⎝ n ⎟⎠ (5.11)
The standardized sample mean Z value is given by modifying equation (5.7) to give
equation (5.12).
ρ−π
Z=
π (1 − π )
n (5.12)
Example 5.9
It is known that 25% of workers in a factory own a personal computer. Find the probability that
at least 26% of a random sample of 80 workers will own a personal computer. In this example,
we have the population proportion π = 0.25 and sample size n = 80. The problem requires the
calculation of P(ρ ≥ 0.26).
Figure 5.21
➜ Excel solution
Population proportion = Cell D3 Value
Sample proportion = Cell D5 Value
Sample size n = Cell D6 Value
Standard error = Cell D8 Formula: =SQRT(D3*(1−D3)/D6)
Z = Cell D10 Formula: =(D5−D3)/D8
P = Cell D12 Formula: =1−NORM.DIST(D5,D3,D8,TRUE)
P = Cell D14 Formula: =1−NORM.S.DIST(D10,TRUE)
212 Business statistics using Excel
From equation (5.10) the standard error for the sampling distribution of the proportion
is:
π (1 − π ) 0.25 (1 − 0.25)
σρ = = = 0.04841
n 80
Substituting this value into equation (5.12) gives the standardized Z value:
❉ Interpretation The probability that at least 26% of the workers own a computer is
41.82%.
Figure 5.22
Excel Data Analysis add-in
Select Data > Data Analysis > Random Number Generation and click OK.
Enter the following parameters into Figure 5.23:
Figure 5.23
Excel Random Number Generation
Click OK.
Example 5.10
Consider the problem of sampling from a population which consists of the salaries for pub-
lic sector employees employed by a national government. The historical data suggests that
the population data is normally distributed with mean of €45,000 and standard deviation of
€10,000. We can use Excel to generate ‘N’ random samples with each sample containing ‘n’
data values.
(a) Create 10 random samples each with 1000 data points.
(b) Calculate the mean for each random sample.
(c) Plot the histogram representing the sampling distribution for the sample mean.
(a) Generate ‘n’ samples with ‘N, data values (n = 10, N = 1000), as
illustrated in Figure 5.24
From Excel, Select Data > Data Analysis > Random Number Generation.
Input:
n = 10
N = 1000
Normal distribution
Mean = 45000
SD = 1000
Output range: Cell B5. Click OK.
Figure 5.24 illustrates the completed menu.
The ‘n’ samples are located in the rows of the table of values, for example sample 1:
B5:K5, sample 2: B6:K6, and sample 1000: B1006:K1006.
214 Business statistics using Excel
Figure 5.24
Figure 5.25
Figure 5.26
To create the histogram select Data > Data Analysis > Histogram and select values as
illustrated in Figure 5.27.
Input Range: L5:L1004
Bin Range: N10:N15
Output Range: P9
Click OK
Sampling distributions and estimating 215
Figure 5.27
Figures 5.28 and 5.29 illustrate the frequency distribution and corresponding histogram.
Histogram
500
400
Frequency
300
200
100
0
44000 44500 45000 45500 46000 46500 More
Bin Figure 5.29 Histogram
From the histogram we note that the histogram values are centred about the population
mean value of €45,000. If we repeated this exercise for different values of sample size ‘n’
we would find that the range would reduce as the sample sizes increase.
Student exercises
X5.1 Five people have all made claims for the amounts shown in Table 5.2.
Person 1 2 3 4 5
Insurance claim, € 500 400 900 1000 1200
Table 5.2
216 Business statistics using Excel
5.3.1 Introduction
In the previous section we explored the sampling distribution of the mean and propor-
tion, and stated that these distributions can be considered to be normal with particular
population parameters (μ, σ2). For many populations, it is likely that we do not know the
value of the population mean (or proportion). Fortunately, we can use the sample mean
(or proportion) to provide an estimate of the population value. The objective of estimation
is to determine the approximate value of a population parameter on the basis of a sample
statistic. The method described in this section is dependent upon the sampling distribu-
tion being normally or approximately normally distributed. We can provide two estimates
of the population value: point and interval estimate.
Figure 5.30 illustrates the relationship between population mean, point, and interval
estimates.
Point estimate
Suppose that you want to find the mean weight of all football players who play in a local
football league. Owing to practical constraints you are unable to measure all the players,
but you are able to select a sample of 25 players at random and weigh them to provide a
sample mean. From Section 5.2 we know that the sampling distribution of the mean is
approximately normally distributed for large sample sizes and that the sample mean can
be considered to be an unbiased estimator of the population mean. After the sampling we
establish that the mean weight of the sample of players is 188 kg. This number becomes
the point estimate of the population mean. If we know, or can estimate, the population
standard deviation (σ), then we can apply equation (5.7) to provide an interval estimate
for the population mean based upon some degree of error between the sample and popu-
lation means. This interval estimate is called the confidence interval for the population
mean (or confidence interval for the population proportion if we are measuring propor-
tions). In this section we shall consider the following topics:
• types of estimates; x
• criteria of a good estimator; Point estimate A point
estimate (or estimator) is
• point estimate of the population mean, μ; any quantity calculated
• point estimate of the population proportion, π; from the sample data
which is used to provide
• point estimate of the population variance, σ2.
information about the
population.
In Section 5.4 we shall consider the following topics:
Confidence interval A
confidence interval gives an
• confidence interval estimate of the population mean (μ) and proportion (π), σ
estimated range of values
known; which is likely to include
• confidence interval estimate of the population mean (μ) and proportion (π), σ an unknown population
parameter.
unknown, n ≥ 30;
218 Business statistics using Excel
• confidence interval estimate of the population mean (μ) and proportion (π), σ
unknown, n < 30.
( )
E X =µ (5.13)
σ2
( )
VAR X =
n
(5.14)
If n grows larger, then the value of the variance of the sample mean grows smaller.
3. If there are two unbiased estimators of a parameter, the one whose variance is smaller
is said to be efficient, for example both the sample mean and median are unbiased
estimators of the population mean. Which one should we use? The sample median
has a greater variance than the sample mean, so we choose the sample mean as it is
relatively efficient when compared with the sample median.
Thus, a point estimate of the population mean, µ̂, is given by equation (5.15):
µ̂ = X (5.15)
In Chapter 4 we noted that the point probabilities in continuous distributions were zero,
and here , in Chapter 5, we are expecting the point estimator to get closer and closer to the
true population value as the sample size increases. The degree of error is not reflected by
the point estimator, but we can employ the concept of the interval estimator to put a prob-
ability to the value of the population parameter lying between two values, with the middle
value being represented by the point estimator. Section 5.4 will discuss the concept of an
interval estimate or confidence interval.
In statistics, the standard deviation is often estimated from a random sample drawn
from a population. In Section 5.2.4 we showed, via a simple example, that the sampling
distribution of the means gives the following rules:
1. The mean of the sample means is an unbiased estimator of the population mean
(x = µ). In other words, the expected value of the sample means equals the
population mean (E(x) = µ).
2. The sample variances are a biased estimator of the population variance (σ 2 x ≠ σ 2 ).
In other words, the expected value of the sample variances are not equal to the
population variance (E(s) ≠ σ).
The sample variance bias can be corrected using Bessel’s correction, which corrects the
bias in the estimation of the population variance and some, but not all, of the bias in the
estimation of the population standard deviation. The Bessel correction factor is given by
equation (5.16).
n
(n − 1) (5.16)
n ( x i − x )2
s2 = Σ
i =1 n −1 (5.17)
If you use n rather than n – 1 in equation (5.17) then you are biasing the statistic as an
x
estimator with the equation, giving an underestimate of the true population variance. It
Point estimate of the
can be shown mathematically that the sample variance given by equation (5.17) is a point population mean Point
estimate of the population variance. The Excel function to calculate an unbiased esti- estimate for the mean
involves the use of the
mate of the population variance (s2) is VAR.S(). sample mean to provide
The corrected sample standard deviation is given by equation (5.18). a ‘best estimate’ of the
unknown population
mean.
n ( x i − x )2 Point estimate of the
s= Σ population variance Point
i =1 n −1 (5.18)
estimate for the variance
involves the use of the
Unfortunately, it can be shown mathematically that not all the bias is removed when sample variance to provide
a ‘best estimate’ of the
using n – 1 in the equation rather than n, but, fortunately, the amount of bias is negligible unknown population
and we assume that equation (5.18) is an unbiased estimator of the population standard variance.
220 Business statistics using Excel
deviation. The Excel function to calculate an unbiased estimate of the population stand-
ard deviation (s) is STDEV.S(). Finally, the standard error of the sample means with the
estimate of the population standard deviation given by the sample standard deviation is
given by equation (5.19).
s
σx =
n (5.19)
The relationship between the biased sample variance (s2b) and the unbiased sample
variance (s2) is given by equation (5.20).
⎛ n ⎞ 2
s2 = ⎜ sb
⎝ n − 1 ⎟⎠ (5.20)
Table 5.3
Similarly, the bias in the sample standard deviation is very small when n – 1 is used
instead of n in the denominator. The sample standard deviation is still biased, but the bias
is negligible. For example, for a normally distributed variable the approximate unbiased
estimator of the population standard deviation (σ̂) can be shown to be given by equation
(5.21).
x
Degrees of
freedom Refers to the ⎛ 1 ⎞
σ̂ = s × ⎜ 1 +
number of independent
⎝ 4 (n − 1) ⎟⎠ (5.21)
observations in a sample
minus the number of
population parameters that
must be estimated from
Table 5.4 explores the degree of error between the unbiased estimate of the population
sample data. standard deviation and the sample standard deviation. The table shows that when the
Sampling distributions and estimating 221
sample size is 4 the underestimate is 8.33% and when the sample size is 30 the underes-
timate is 0.86%. Furthermore, the difference between the two values quickly reduces in
size.
n= 4 10 20 30 40 50 100
Error = 0.0833 0.0278 0.0132 0.0086 0.0064 0.0051 0.0025
% error = 8.3333 2.7778 1.3158 0.8621 0.6410 0.5102 0.2525
Table 5.4
From a practical perspective we assume that equation (5.18) gives an unbiased estima-
tor of the population standard deviation.
Example 5.11
An experiment on the measurement of the length of rods was performed five times, with the
following results: 1.010, 1.012, 1.008, 1.013, and 1.011. Calculate the unbiased estimates of the
mean and variance of possible measurements, and give an estimate for the standard error of
your estimate of the mean.
Figure 5.31
➜ Excel solution
X Cells B5:B9 Value
(X-Xbar)2 Cell C5 Formula: =(B5-$G$9)∧2
Copy formula down C5:C9
n = Cell G4 Formula: =COUNT(B5:B9)
ΣX = Cell G5 Formula: =SUM(B5:B9)
Σ(X-Xbar)2 = Cell G6 Formula: =SUM(C5:C9)
222 Business statistics using Excel
Formula solution
Sample mean = Cell G9 Formula: =G5/G4
Sample variance = Cell G10 Formula: =G6/(G4−1)
Sample standard deviation = Cell G11 Formula: =G10∧0.5
Estimate of population mean = Cell G12 Formula: =G9
Estimate of population standard deviation = Cell G13 Formula: =G11
Estimate of the standard error of the mean = Cell G14 Formula: =G13/G4∧0.5
Function solution mean x = Cell G17 Formula: =AVERAGE(B5:B9)
Sample variance = Cell G18 Formula: =VAR.S(B5:B9)
Sample standard deviation = Cell G19 Formula: =STDEV.S(B5:B9)
Estimate of population mean = Cell G20 Formula: =G17
Estimate of population standard deviation = Cell G21 Formula: =G19
Estimate of the standard error of the mean = Cell G22 Formula: =G21/G4∧0.5
The value of the unbiased estimates of the population mean, variance, and standard
error of the mean are provided by solving equations (5.15), (5.17), and (5.19).
Sample size n = 5
1.010 + 1.012 + 1.008 + 1.013 + 1.011
Sample mean X = = 1.0108
5
n
Σ ( X i − X )2
i =1
Sample variance s = = 0.0019235
n −1
x ❉ Interpretation The value of the unbiased estimates of the mean, variance, and
Standard error of the
mean The standard error
standard error are 1.011, 0.0019, and 0.0009 respectively.
of the mean (SEM) is the
standard deviation of the
sample mean’s estimate of
a population mean.
5.3.5Point estimate for the population proportion and
Point estimate variance
of the population
proportion Point estimate In the previous section we provided the equations to calculate the point estimate for the
for the proportion involves
the use of the sample population mean based upon the sample data. Instead of solving problems involving the
proportion to provide mean we can use the sample proportion to provide point estimates of the population pro-
a ‘best estimate’ of the
unknown population
portion. Equations (5.22) and (5.23) provide point estimates of the population propor-
proportion. tion and standard error:
Sampling distributions and estimating 223
π (1 − π )
Estimate of standard error , σ ρ = (5.23)
n
Example 5.12
In a sample of 400 textile workers, 184 expressed dissatisfaction regarding a prospective plan
to modify working conditions. Provide a point estimate of the population proportion of total
workers who would be dissatisfied and give an estimate for the standard error of your estimate.
Figure 5.32
➜ Excel solution
Total in sample n = Cell C5 Value
X Cell C6 Value
Sample proportion = Cell C8 Formula: =C6/C5
Estimate population proportion = Cell C12 Formula: =C8
Estimate population standard error = Cell C13 Formula: =SQRT(C12*(1−C12)/C5)
( )
Standard error of the proportion, σˆ ρ = πˆ 1 − πˆ n = 0.46 × (1 − 0.46 ) 400 = 0.025
The value of the unbiased estimates of the population mean, variance, and standard
error of the proportion are provided by solving equations (5.22) and (5.23).
x
(a) Sample values Standard error of the
proportion The standard
error of the proportion is
Sample size n = 400
the standard deviation of
the sample proportion’s
Number of successes X = 184 estimate of a population
proportion.
Sample proportion ρ = X/n = 184/400 = 0.46
224 Business statistics using Excel
πˆ (1 − πˆ ) 0.46(1 − 0.46)
Estimate of population standard error σˆ ρ = = = 0.0249
n 400
❉ Interpretation The value of the unbiased estimates of the proportion and standard
error are 0.46 and 0.0249 respectively.
n1 X 1 + n 2 X 2
X= (5.24)
n1 + n 2
n1s12 + n 2s2 2
σ̂ 2 = (5.25)
n1 + n 2 − 2
n πˆ + n 2 πˆ 2 n1ρ1 + n 2ρ2
πˆ = 1 1 = (5.26)
n1 + n 2 n1 + n 2
Student exercises
X5.11 A random sample of 5 values was taken from a population: 8.1, 6.5, 4.9, 7.3, and 5.9.
Estimate the population mean and standard deviation, and the standard error of the
estimate for the population mean.
X5.12 The mean of 10 readings of a variable was 8.7 with standard deviation 0.3. A further
5 readings were taken: 8.6, 8.5, 8.8, 8.7, and 8.9. Estimate the mean and standard
deviation of the set of possible readings using all the data available.
X5.13 Two samples are drawn from the same population as follows: sample 1 (0.4, 0.2, 0.2,
0.4, 0.3, and 0.3) and sample 2 (0.2, 0.2, 0.1, 0.4, 0.2, 0.3, and 0.1). Determine the best
unbiased estimates of the population mean and variance.
Sampling distributions and estimating 225
X5.14 A random sample of 100 rods from a population line were measured and found to
have a mean length of 12.132 with standard deviation 0.11. A further sample of 50 is
taken. Find the probability that the mean of this sample will be between 12.12 and
12.14.
X5.15 A random sample of 20 children in a large school were asked a question and 12
answered correctly. Estimate the proportion of children in the school who answered
correctly and the standard error of this estimate.
X5.16 A random sample of 500 fish is taken from a lake and marked. After a suitable interval a
second sample of 500 is taken and 25 of these are found to be marked. By considering
the second sample estimate the number of fish in the lake.
5.4.1 Introduction
If we take just one sample from a population we can estimate from the sample a popu-
lation parameter. Our knowledge of sampling error would indicate that the standard
error provides an evaluation of the likely error associated with a particular estimate. If
we assume that the sampling distribution of the sample means are normally distributed
then we can provide a measure of this error in terms of a probability value that the value of
the population mean will lie within a specified interval. This interval is called an interval
estimate (or confidence interval), where the interval is centred at the point estimate for
the population mean. Assuming that the sampling distribution of the mean follows a nor-
mal distribution then we can allocate probability values to these interval estimates. From
equation (5.7) we can restructure the equation to give equation (5.27):
σ
µ = X−Z× (5.27)
n
From our knowledge of the normal distribution we know that 95% of the distribution
lies within ± 1.96 standard deviations of the mean. Thus, for the distribution of sample
means, 95% of these sample means will lie in the interval defined by equation (5.27).
µ = X ± 1.96 × σ n
Therefore, this equation tells us that an interval estimate (or confidence inter-
val) is centred at X , with a lower value of µ1 = X − 1.96 × σ n and upper value of
µ 2 = X + 1.96 × σ n , as illustrated in Figure 5.33.
We will now look at how interval estimates and associated levels of confidence can be
calculated.
226 Business statistics using Excel
Normal curve
f(x)
µ1 X µ2 µ
Figure 5.33
σ σ
X−Z× ≤µ≤X+Z× (5.28)
n n
Example 5.13
Eight samples measuring the length of cloth are sampled from a population where the length
is normally distributed with population standard deviation 0.2. Calculate a 95% confidence
interval for the population mean based on a sample of 8 observations: 4.9, 4.7, 5.1, 5.4, 4.7,
5.2, 4.8, and 5.1.
Figure 5.34
Sampling distributions and estimating 227
➜ Excel solution
X: Cell B6:B13 Values
X2: Cell C6 Formula: =B6∧2
Copy formula down C6:C13
n = Cell C17 Formula: =COUNT(B6:B13)
ΣX = Cell C18 Formula: =SUM(B6:B13)
ΣX2 = Cell C19 Formula: =SUM(C6:C13)
Population standard deviation σ = Cell F4 Value
2 tails, 95% confidence interval = Cell F5 Value
CDF = Cell F6 Formula: =1−F5/2
Zcri = Cell F7 Formula: =NORM.S.INV(F6)
Formula Solution
Sample mean = Cell F9 Formula: =C18/C17
Estimate of population mean = Cell F10 Formula: =F9
Standard error of the mean = Cell F11 Formula: =F4/C17∧0.5
μ1 = Cell F12 Formula: =F9−F7*F11
μ2 = Cell F13 Formula: =F9+F7*F11
Function Solution
Sample mean x = Cell F16 Formula: =AVERAGE(B6:B13)
Estimate of population mean = Cell F17 Formula: =F16
Standard error of the mean = Cell F18 Formula: =F4/C17∧0.5
μ1 = Cell F19 Formula: =F16−CONFIDENCE.
NORM(F5,F4,C17)
μ2 = Cell F20 Formula: =F16+CONFIDENCE.
NORM(F5,F4,C17)
The value of the lower and upper confidence interval is given by equation (5.28). From
Excel: population standard deviation σ = 0.2 (known), sample mean X = 4.9875 , sample
size = 8, and value of Z for 95% confidence = ± 1.96. Substituting the values into equation
(5.28) gives:
σ 0.2
Standard error σ X = = = 0.0707
n 8
σ
µ1 = X − Z × = 4.9875 − 1.96 × 0.0707 = 4.8489
n
σ
µ2 = X + Z × = 4.9875 + 1.96 × 0.0707 = 5.1261
n
Figure 5.35 illustrates the 95% confidence interval for the population mean.
Thus, the 95% confidence interval for μ is = 4.9875 ± 1.96 * 0.0707 = 4.9875 ± 0.1386 =
4.8489 → 5.1261.
228 Business statistics using Excel
Normal curve
95% confidence
interval for µ
µ1 X µ2 µ
4.8489 4.9875 5.1261 Figure 5.35
❉ Interpretation The 95% confidence interval for the population mean is 4.8489 to
5.1261.
The value of the critical z statistic at a particular significance level can be found from the
normal distribution tables provided online. Table 5 illustrates an example of this with the
critical value z identified for a particular z value of the probability P(Z ≥ z) = 2.5% = 0.025
(right-hand tail in Figure 5.35).
From Table 5.5, critical z value = 1.96 when P(Z ≥ z) = 0.025. Given that we have two
tails then the critical z value = ±1.96.
Note This is often the case in many student research projects. They handle small sizes
and the population standard deviation is unknown.
If we have more information about the population then we would expect the probabil-
ity of the population mean lying within 1.96 standard errors of the mean to be smaller
when the population standard deviation is known compared with being unknown.
The question then becomes: Can we measure how much smaller this probability will
be? This question was answered by W. S. Gossett, who determined the distribution of
the mean when divided by an estimate of the standard error. The resultant distribution is
called the Student’s t distribution.
If the random variable X is normally distributed, then the test statistic has a t distribu-
tion with n – 1 degrees of freedom and with the test statistic defined by equation (5.29).
X−µ
t df =
s n (5.29)
Note The t distribution is very similar to the normal distribution when the estimate
of variance is based on many degrees of freedom (df = n – 1), but has relatively more scores
in its tails when there are fewer degrees of freedom. The t distribution is symmetric, like the
normal distribution, but flatter.
Figure 5.36 shows the t distribution with five degrees of freedom and the standard nor-
mal distribution. The t distribution is flatter than the normal distribution (leptokurtic).
Z
T
Z or t
–6 –4 –2 0 2 4 6 Figure 5.36
As the t distribution is leptokurtic, the percentage of the distribution within 1.96 stand-
ard deviations of the mean is less than the 95% for the normal distribution.
However, if the number of degrees of freedom (df ) is large (df = n – 1 ≥ 30) then there is
very little difference between the two probability distributions. The sampling error for the
t distribution is given by the sample standard deviation (s) and sample size (n), as defined
by equation (5.30).
σˆ s
σX = =
n n (5.30)
230 Business statistics using Excel
The degrees of freedom and confidence interval are given by equations (5.31) and
(5.32).
df = n − 1 (5.31)
s s
X − t df × ≤ µ ≤ X + t df ×
n n (5.32)
Example 5.14
For the following sample of 8 observations from an infinite normal population find the sample
mean and standard deviation, and hence determine the standard error, the population stand-
ard deviation, and a 95% confidence interval for the mean: 10.3, 12.4, 11.6, 11.8, 12.6, 10.9,
11.2, and 10.3.
Figure 5.37
➜ Excel solution
X Cell B6:B13
(X – Xbar)2 Cell C6 Formula: =(B6−$G$9)∧2
Copy formula down C6:C13
n = Cell C18 Formula: =COUNT(B6:B13)
ΣX = Cell C19 Formula: =SUM(B6:B13)
Σ(X – Xbar)2 = Cell C20 Formula: =SUM(C6:C13)
Sampling distributions and estimating 231
The value of the lower and upper confidence interval is given by equation (5.32). From
Excel: sample mean, X = 11.3875 , sample size = 8, sample variance = 0.7641071, sample
standard deviation = 0.8741322, and the value of t8 for 95% confidence = ±2.3646243.
Substituting values into equation (5.32) gives:
s 0.8741322
Standard error σ X = = = 0.3090524
n 8
s
µ1 = X − t 8 × = 11.3875 − 2.3646243 × 0.3090524 = 10.656707
n
s
µ2 = X − t 8 × = 11.3875 + 2.3646243 × 0.3090524 = 12.118293
n
Figure 5.38 illustrates the 95% confidence interval for the population mean.
Thus, the 95% confidence interval for μ is = 11.3875 ± 2.3646243* 0.3090524 = 10.6567 →
12.1183.
❉ Interpretation We are 95% confident that, on the basis of the sample, the true
population mean is between 10.6567 and 12.1183.
232 Business statistics using Excel
t distribution with 7 df
95% confidence
interval for µ
t7
The value of the critical t statistic at a particular significance level and degrees of free-
dom can be found from the Student’s t distribution tables provided online.
Table 5.6 illustrates an example of this with the critical t value identified for a particular
value of the probability P(T ≥ t) = 2.5% = 0.025 (right-hand tail in Figure 5.38) (ALPHA =
2 * 0.025 = 0.5) and degrees of freedom = n – 1 = 7.
ALPHA, df 50% 0.5 20% 0.20 10% 0.1 5% 0.05 2.50% 0.025 1% 0.01
1 1.00 3.08 6.31 12.71 25.45 63.66
2 0.82 1.89 2.92 4.30 6.21 9.92
3 0.76 1.64 2.35 3.18 4.18 5.84
4 0.74 1.53 2.13 2.78 3.50 4.60
5 0.73 1.48 2.02 2.57 3.16 4.03
6 0.72 1.44 1.94 2.45 2.97 3.71
7 0.71 1.41 1.89 2.36 2.84 3.50
8 0.71 1.40 1.86 2.31 2.75 3.36
Table 5.6 Calculation of t for P(T ≥ t) = 0.025 with 7 degrees of freedom (df)
From Table 5.6, the critical t value = 2.36 when P(T ≥ t) = 0.025 and 7 degrees of free-
dom. Given that we have two tails then the critical t value = ±2.36.
X−µ
Z=
s n (5.33)
Sampling distributions and estimating 233
s s
X−Z× ≤µ≤X+Z×
n n (5.34)
Example 5.15
Eight samples measuring the length of cloth are sampled from a population where the length is
normally distributed with population standard deviation unknown. Calculate a 95% confidence
interval for the population mean based on a sample of 8 observations: 4.9, 4.7, 5.1, 5.4, 4.7,
5.2, 4.8, and 5.1.
Note We are using a small sample to illustrate the application of the method. When n
<30 (σ unknown), we would use the Student’s t distribution to fit the confidence interval.
Figure 5.39
➜ Excel solution
X: Cell B6:B13
(X – Xbar)2: Cell C6 Formula: =(B6-$F$9)∧2
Copy Formula from C6:C13
n = Cell C17 Formula: =COUNT(B6:B13)
ΣX = Cell C18 Formula: =SUM(B6:B13)
Σ(X – Xbar)2 = Cell C19 Formula: =SUM(C6:C13)
2 tails, 95% confidence interval = Cell F5 Value
CDF = Cell F6 Formula: =1−F5/2
Zcri = Cell F7 Formula: =NORM.S.INV(F6)
234 Business statistics using Excel
Formula solution
Sample mean = Cell F9 Formula: =C18/C17
Estimate of population mean = Cell F10 Formula: =F9
Sample variance Cell F11 Formula: =C19/(C17−1)
Sample standard deviation = Cell F12 Formula: =F11∧0.5
Standard error of the mean = Cell F13 Formula: =F12/C17∧0.5
μ1 = Cell F14 Formula: =F9−F7*F13
μ2 = Cell F15 Formula: =F9 + F7*F13
Function solution
Sample mean = Cell F18 Formula: =AVERAGE(B6:B13)
Estimate of population mean = Cell F19 Formula: =F18
Sample variance Cell F20 Formula: =VAR.S(B6:B13)
Sample standard deviation = Cell F21 Formula: =STDEV.S(B6:B13)
Standard error of the mean = Cell F22 Formula: =F21/C17∧0.5
μ1 = Cell F23 Formula: =F18−CONFIDENCE.
NORM(F5,F21,C17)
μ2 = Cell F24 Formula: =F18+CONFIDENCE.
NORM(F5,F21,C17)
The value of the lower and upper confidence interval is given by equation (5.34). From
Excel: sample mean, X = 4.9875 , sample size = 8, sample variance = 0.064107143, sample
standard deviation = 0.253193884, and value of Z for 95% confidence = ±1.96.
Substituting values into equation (5.34) gives:
s 0.253193884
Standard error σ X = = = 0.08951755
n 8
σ
µ1 = X − Z × = 4.9875 − 1.96 × 0.08951755 = 4.8120488
n
σ
µ2 = X + Z × = 4.9875 + 1.96 × 0.08951755 = 5.1629512
n
Figure 5.40 illustrates the 95% confidence interval for the population mean.
Thus, the 95% confidence interval for μ is = 4.9875 ± 1.96 * 0.089517516 = 4.81 → 5.16.
Normal curve
µ1 µ2 µ
X
4.81 5.16
4.9875 Figure 5.40
Sampling distributions and estimating 235
❉ Interpretation The 95% confidence interval for the population mean is 4.81–5.16.
ρ (1 − ρ) ρ (1 − ρ)
ρ−Z× ≤ µ ≤ρ+Z× (5.35)
n n
Example 5.16
In Example 5.9 we stated that 25% of workers in a factory own a personal computer. If this was
not known we could use the idea of a confidence interval to put a level of confidence on the
population proportion based upon the sample data collected. The sample data resulted in a
sample proportion = 0.26 with a sample size = 80.
Figure 5.41
➜ Excel solution
Sample proportion = Cell C5 Value
Sample size n = Cell C6 Value
x
Point estimate of population mean = Cell C8 Formula: =C5 Level of confidence The
Two tails, 95% confidence interval = Cell C10 Value confidence level is
the probability value
Proportion in right and left tails = Cell C11 Formula: =C10/2 (1–α) associated with a
Upper Zcri = Cell C12 Formula: =NORM.S.INV(1−C11) confidence interval.
236 Business statistics using Excel
The value of the lower and upper confidence interval is given by equation (5.35). From
Excel: sample proportion, ρ = 0.26, and sample size = 80, and value of Z for 95% confi-
dence = ±1.96. Substituting values into equation (5.35) gives:
ρ (1 − ρ) 0.26 (1 − 0.26 )
Standard error σ ρ = = = 0.0490408
n 80
ρ (1 − ρ)
µ1 = ρ − Z × = 0.26 − 1.96 × 0.0490408 = 0.1638818
n
ρ (1 − ρ)
µ2 = ρ + Z × = 0.26 + 1.96 × 0.0490408 = 0.3561182
n
Figure 5.42 illustrates the 95% confidence interval for the population proportion.
Thus, the 95% confidence interval for ρ is = 0.26 ± 1.96 *0.0490408 = 0.16 → 0.36.
Normal curve
–1.96 0 1.96 Z
❉ Interpretation The 95% confidence interval for the people who own a personal
computer in the whole population is between 16.3% and 35.6%.
Student exercises
X5.17 The standard deviation for a method of measuring the concentration of nitrate ions
in water is known to be 0.05 ppm. If 100 measurements give a mean of 1.13 ppm,
calculate the 90% confidence limits for the true mean.
Sampling distributions and estimating 237
X5.18 In trying to determine the sphere of influence of a sports centre a random sample
of 100 visitors was taken. This indicated a mean travel distance (d) of 10 miles with
a standard deviation of 3 miles: (a) What are the 90% confidence limits for the
population mean travel distance (D), and (b) What sample size would be required to
ensure that the confidence interval for D was 0.5 miles at the 95% level?
X5.19 The masses, in grams, of 13 ball bearings taken at random from a batch are: 21.4,
23.1, 25.9, 24.7, 23.4, 24.5, 25.0, 22.5, 26.9, 26.4, 25.8, 23.2, and 21.9. Calculate a 95%
confidence interval for the mean mass of the population, supposed normal, from
which these masses were drawn.
σ
Interval = 2 × Z × (5.36)
n
Rearranging equation (5.36) will enable the calculation of the size via equation (5.37).
2
⎛ 2 × Z × σ⎞
n=⎜ (5.37)
⎝ Interval ⎟⎠
Example 5.17
A researcher determines that a margin of error (or sampling error, e) of no more than ± 0.5 units
is desired, along with a 98% confidence interval. If we assume a normal population standard
deviation of 0.2, calculate the sample size, n.
Figure 5.43
238 Business statistics using Excel
➜ Excel solution
Specified interval = Cell C4 Value
Population standard deviation = Cell C5 Value
Two tails, 98% confidence interval = Cell C7 Value
Proportion in right and left hand tails = Cell C8 Formula: =C7/2
Upper Zcri = Cell C9 Formula: =NORM.S.INV(1−C8)
Sample size n = Cell C11 Formula: =(2*C9*C5/C4)∧2
From Excel: interval = 0.1, population standard deviation = 0.2, Zcri for 98% =
±2.326347874, and the sample size is calculated from equation (5.37).
2 2
⎛ 2 × Z × σ⎞ ⎛ 2 × 2.326347874 × 0.2 ⎞
n=⎜ =⎜ ⎟⎠ = 86.5903109
⎝ Interval ⎟⎠ ⎝ 0.1
Figure 5.44 illustrates the relationship between interval, confidence interval, and size
of sample.
µ1 X µ2 µ
Note To see what impact the selection of the error of margin and confidence interval
has on the sample size, we’ll run a small simulation. We’ll keep all the data from the previous
example (Table 5.7).
Table 5.7
Sampling distributions and estimating 239
By keeping the same margin of error, but changing the confidence interval, we can see
how the sample size changes. Effectively, in this example, we need to increase the sample
size almost three times if we wanted our confidence interval to increase from 90% to 99%.
Let’s now keep the confidence interval constant, at 90%, but let’s change the margin of
error (Table 5.8).
Table 5.8
As we can see, the margin of error has a tremendous impact on the sample size. This
explains why political polls are often conducted with a 3% error margin. To increase the
accuracy in this case we would have to increase the sample size tenfold, which is clearly
too expensive. It is particularly important to emphasize here that the margin of error
depends very little on the size of the population from which we are sampling, as long as
the sampling fraction is less than 5% of the total population. For very large populations,
the impact is almost negligible.
Student exercise
X5.20 A business analyst has been requested by the managing director of a national
supermarket chain to undertake a business review of the company. One of the key
objectives is to assess the level of spending of shoppers who, historically, have weekly
mean levels of spending of €168.00 with a standard deviation of €15.65. Calculate
the size of a random sample to produce a 98% confidence interval for the population
mean spend, given that the interval is €30? Is the sample size appropriate given the
practical factors?
■ Techniques in practice
TP1 Concerned at the time to react to customer complaints CoCo S.A. has implemented
a new set of procedures for its support centre staff. The customer service director has directed
that a suitable test is applied to a new sample to assess whether the new target mean time for
responding to customer complaints is 28 days (Table 5.9).
20 33 33 29 24 30
40 33 20 39 32 37
32 50 36 31 38 29
15 33 27 29 43 33
31 35 19 39 22 21
28 22 26 42 30 17
32 34 39 39 32 38
Table 5.9
240 Business statistics using Excel
TP2 Bakers Ltd is currently undertaking a review of the delivery vans used to deliver prod-
ucts to customers. The company runs two types of delivery van (type A, recently purchased,
and type B, at least 3 years old) which are supposed to be capable of achieving 20 km per litre
of petrol. A new sample has now been collected (Table 5.10).
A B A B
17.68 15.8 26.42 34.8
18.72 36.1 25.22 16.8
26.49 6.3 13.52 15.0
26.64 12.3 14.01 28.9
9.31 15.5 33.9
22.38 40.1 27.1
20.23 20.4 16.8
28.80 3.7 23.6
17.57 13.6 29.7
9.13 35.1 28.2
20.98 33.3
Table 5.10
TP3 Skodel Ltd is developing a low calorie lager for the European market with a mean
designed calorie count of 43 calories per 100 ml. The new product development team are
having problems with the production process and have collected two independent random
samples to assess whether the target calorie count is being met (assume the population vari-
ables are normally distributed) (Table 5.11).
A B A B
49.7 39.4 45.2 34.5
45.9 46.5 40.5 43.5
37.7 36.2 31.9 37.8
40.6 46.7 41.9 39.7
34.8 36.5 39.8 41.1
51.4 45.4 54.0 33.6
34.3 38.2 47.8 35.8
63.1 44.1 26.3 44.6
41.2 58.7 31.7 38.4
41.4 47.1 45.1 26.1
41.1 59.7 47.9 30.7
Table 5.11
■ Summary
In this chapter we have provided an introduction to the important statistical concept of sam-
pling and have explored methods that can be used to provide point and confidence intervals.
We have shown that the Central Limit Theorem is a very important theorem that allows the
application of a range of statistical tests to be performed.
1. We have shown how the Central Limit Theorem can eliminate the need to construct
a sampling distribution by examining all possible samples that might be drawn from a
population. The Central Limit Theorem allows us to determine the sampling distribution
by using the population mean and variance values or estimates of these obtained from a
sample.
2. Furthermore, an unbiased estimate of the population mean is provided by the sample
mean and the sample variance (or standard deviation) is a biased estimate of the
population variance (or standard deviation).
3. From the Central Limit Theorem we know that the sampling distribution can be
approximated by the normal distribution.
We have shown that as the sample size increases the standard error decreases, but please be
aware that any advantage quickly vanishes as any improvements in standard error tend to be
smaller as the sample size gets larger and larger. The next chapter will now use these results to
introduce the concept of statistical hypothesis testing. In Chapter 6 we shall explore testing a state-
ment about the value of a population parameter given information about one or two samples.
■ Key terms
Central limit theorem Degrees of freedom Point estimate
Confidence interval Estimate Point estimate of the
Critical value Level of confidence population mean
242 Business statistics using Excel
■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.
Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012).
Introduction to parametric
hypothesis testing 6
» Overview «
Experiments, surveys, and pilot projects are often carried out with the objective of testing a
theory, or hypothesis, about the nature of the process under investigation. Consider a UK
company attempting to enter the German market. They appoint a distributor in Bavaria who
reports that, on average, 2.7 litres of their product are consumed per week per family. Is this
number representative and indicative of the whole country? Should they decide to expand the
network of distributors? What confidence can be assigned to these numbers? Are these figures
from Germany comparable with the UK market? Just how much confidence can be placed
on the inference that there is no difference between the two populations? In order to provide
answers to these questions we set up a statement (a hypothesis) and test its validity by the
application of probability theory.
In this chapter we shall explore a range of hypothesis tests for one and two samples where
the population is normally distributed. The type of test employed, z or t, depends mainly on the
sample size. Even if the population is not normal, the tests will still give an approximate solution
if the sample size is sufficiently large (Central Limit Theorem). Chapter 7 will extend the range
of hypothesis tests to include the so-called non-parametric tests, which may be used in place
of parametric tests when the modelling assumptions are doubtful.
» Learning objectives «
On completing this unit you will be able to:
» understand the concept of the null and alternative hypothesis;
» understand the difference between one and two samples;
» understand the difference between the terms parametric and non-parametric;
» identify appropriate one and two sample tests;
» explain what is meant by a significance level;
» choose an appropriate sampling distribution;
244 Business statistics using Excel
Example 6.1
The historical output by employees is a mean rate of 100 units per hour with a standard devia-
tion of 20 units per hour. A new employee is tested on 36 separate random occasions and
is found to have an output of 90 units per hour. Does this indicate that the new employee’s
output is significantly different from the population mean output?
Figure 6.1
Figure 6.1 illustrates the Excel solution to solve the problem outlined in Example 6.1.
As we can see, we have used several built-in Excel functions, which we will explain
shortly, to help us make a decision. What is our decision? In this example we would reject
the null hypothesis H0 in favour of the alternative hypothesis H1 and conclude that there
is a significant difference between the new employee’s output and the firm’s existing
employee output. In fact, this test gives us power to state that we are 95% certain of our
decision. How did we do this? Hypothesis testing requires only a few strict steps and they
are as follows:
1 State hypothesis
5 Make a decision
x
As we already introduced how to state the hypothesis, let’s explain the remaining four Level of significance The
level of significance is the
steps. criterion used for rejecting
the null hypothesis.
246 Business statistics using Excel
Number of
samples?
Independent or
One sample Z test for Z test for two dependent
Z test proportion proportions samples?
(s known)
Compare Compare
Paired t-test
means? varience?
Figure 6.2
1. What are you testing: difference or association? For parametric tests we are
measuring the difference between data values.
x
2. What is the type of data being measured? For parametric tests we are dealing with
Two sample t-test
interval/ratio data. for population mean
3. Can we assume that the population is normally distributed? For parametric tests we (independent samples,
unequal variances A two
expect the variable(s) being measured to be normally distributed or approximtely sample t-test for population
normally distributed. mean (independent
samples, unequal variances)
4. How many samples? We are dealing with one and two sample parametric tests. If we is used when two separate
have more then two samples then we would be dealing with an advanced statistical sets of independent but
differently distributed
hypothesis concept called ANOVA. This topic is described in the online workbook samples are obtained,
‘Factorial experiments’. one from each of the
two populations being
5. From Figure 6.2 we can then choose the appropriate test by answering extra
compared.
questions regarding whether we are dealing with means or proportions, or whether F test for two population
two samples are related (or dependent) or independent of one another. variances (variance
ratio test) F test for two
It is important to note that we have a range of other hypothesis tests to measure asso- population variances
(variance ratio test) is used
ciation (see Chapter 7) and in dealing with distribution free tests (see online workbook to test if the variances of
‘Factorial experiments’). two populations are equal.
248 Business statistics using Excel
❉ Interpretation If an analyst states that the results are significant at the 5% level then
what they are saying is that there is a 5% probability that the sample data values collected have
occurred by chance. An alternative view is to use the concept of a confidence interval. In this
case we can observe that we are 95% confident that the results have not occurred by chance.
Note Most of the examples in this chapter use 0.05 for the level of significance. In practice
you will notice that sometimes certain hypotheses can be accepted at that level of significance,
but would have to be rejected if we used 0.01 as the level of significance. What do we do in such
situations? Read on further and section 6.1.9 on the types of errors might offer some resolution.
0.4
0.3
0.2
0.1
0.0
–4.0 –3.0 –2.0 –1.0 0.0 1.0 2.0 3.0 4.0
Z or T value Figure 6.3
x
Critical test statistic The
critical value for a
normal and t distributions decreases as the number of degrees of freedom increases and hypothesis test is a limit
that very little numerical difference exists between the normal and t distributions when at which the value of the
sample test statistic is
we have sample sizes ≥ 30. judged to be such that the
From this concept we can calculate the corresponding test statistic and calculate the null hypothesis may be
rejected.
critical test statistic value given a significance level.
One tail tests A one
tail test is a statistical
hypothesis test in which
6.1.7 One and two tail tests the values for which we can
reject the null hypothesis,
In Section 6.1.1 we stated that the alternative hypotheses can be written as H1: μ ≠ €31,000 H0, are located entirely in
one tail of the probability
or H1: μT ≠ μL. The ≠ sign tells us that we are not sure what the direction of the difference distribution.
will be (< or >) but that a difference exists. In this case we have a two tailed test. It is pos- Region of rejection The
sible that we are assessing that the average accountant’s salary is greater than €31,000 range of values that leads
to rejection of the null
(implying H1: μ > €31,000) or is smaller than €31,000 (implying H1: μ < €31,000). In both hypothesis.
cases the direction is known and these are known as one tail tests. Two tail test A two tail test
The hypothesis test set up (H0 and H1) will tell you automatically whether you have a is a statistical hypothesis
test in which the values for
one or two tailed test. The region of rejection is located in the tail(s) of the distribution. which we can reject the
The exact location is determined by the way H1 is expressed. If H1 simply states that there null hypothesis, H0, are
is a difference, for example H1: μ ≠ 100, then the region of rejection is located in both tails located in both tails of the
probability distribution.
of the sampling distribution with areas equal to α/2. Lower one tail test A
For example, if α is set at 0.05 then the area in both tails will be 0.025 (see Figure 6.4). lower one tail test is a
statistical hypothesis test in
This is known as a two tail test. If H1 states that there is a direction of difference, for exam-
which the values for which
ple μ < 100 or μ > 100, then the region of rejection is located in one tail of the sampling we can reject the null
distribution—the tail being defined by the direction of the difference. hypothesis, H0 are located
entirely in the left tail of the
Hence, for a less than direction (H1: μ < 100) the left-hand tail would be used (see probability distribution.
Figure 6.5). Upper one tail test An
This is known as a lower one tail test. upper one tail test is a
statistical hypothesis test in
Hence, for a greater than direction (H1: μ > 100) the right-hand tail would be used (see which the values for which
Figure 6.6). This is known as an upper one tail test. we can reject the null
hypothesis, H0 are located
The actual location of this critical region will be determined by whether the variable entirely in the right tail of
being measured varies as a normal or Student’s t distribution. the probability distribution.
250 Business statistics using Excel
Normal curve
µ ≠ 100
Accept H0
µ X Figure 6.4
Normal curve
µ < 100
Reject H0 5%
Accept H0
µ X Figure 6.5
Normal curve
µ > 100
Reject H0 5%
Accept H0
µ X Figure 6.6
Truth
H0 true H1 true
Decision Reject H0 Type I error Correct
Size of test α Power = 1 – β
Do not reject H0 Correct Type II error β
Confidence interval = 1 – α
Type I error
From Table 6.1 we observe that it is the rejection of a true null hypothesis that is a type I
error. This probability is represented by the level of significance (Greek letter Alpha, α) and
the significance value chosen represents the maximum probability of making a type I error.
Type II error
A type II error (denoted by Greek letter Beta, β) is only an error in the sense that an oppor-
tunity to reject the null hypothesis correctly was lost. It is not an error in the sense that an
incorrect conclusion was drawn, as no conclusion is drawn when the null hypothesis is
not rejected. Which of the errors is more serious? The answer to this question depends on x
the damage that is related to it. Type I and type II errors are related to each other; increas- Type I error, α A type I
ing the type I error will decrease the type II error and vice versa. error occurs when the null
hypothesis is rejected when
it is in fact true.
Statistical power Type II error, β A type II
The statistical power of the test is the probability of accepting the true alternative hypoth- error occurs when the
null hypothesis, H0, is not
esis or the probability of rejecting a false null hypothesis. The relationship between statis- rejected when it is in fact
tical power and the type II error is given by the equation power = 1 – β. false.
A simple example will be employed in Section 6.10 to illustrate the calculation of the Beta, β Beta refers to the
probability that a false
type II error (β) and the statistical power for a one sample t-test. population parameter
lies inside the confidence
interval.
6.1.10 P-values Statistical power The
power of a statistical test
Unlike the classical approach using the critical test statistic we can use the p-value to is the probability that it
will correctly lead to the
decide on accepting or rejecting H0. The p-value represents the probability of the calcu- rejection of a false null
lated random sample test statistic being this extreme if the null hypothesis is true. This hypothesis.
p-value can then be compared to the chosen significance level (α) to make a decision P-value The p-value is
the probability of getting
between accepting or rejecting the null hypothesis H0. a value of the test statistic
as extreme as or more
extreme than that observed
❉ Interpretation If p < α, then we would reject the null hypothesis H0 and accept the by chance alone, if the null
hypothesis is true.
alternative hypothesis H1.
252 Business statistics using Excel
Note The Excel screenshots will identify each of these stages in the solution process.
Microsoft Excel can be used to calculate a p-value depending upon whether the vari-
able being measured varies as a normal or Student’s t distribution.
The p-value will be generated automatically by Excel when using the Analysis ToolPak
solution method (Select Data > Data Analysis).
❉ Interpretation If test statistic > critical test statistic then we would reject the null
hypothesis H0 and accept the alternative hypothesis H1.
Microsoft Excel can be used to calculate the critical test statistic values depending
upon whether the variable being measured varies as a normal or Student’s t distribution.
Note The Excel screenshots will identify each of these stages in the solution process.
These values will be generated automatically by Excel when using the Data Analysis
solution method (Select Data > Data Analysis).
Student Exercises
X6.1 A supermarket is supplied by a consortium of milk producers. Recently, a quality
assurance check suggests that the amount of milk supplied is significantly different
from the quantity stated within the contract: (i) define what we mean by significantly
different; (ii) state the null and alternative hypothesis statements; and (iii) for the
alternative hypothesis do we have a two tail, lower one tail, or upper one tail test?
X6.2 A business analyst is attempting to understand visually the meaning of the critical test
statistic and the p-value. For a z value of 2.5 and significance level of 5% provide a
sketch of the normal probabilty distribution and use the sketch to illustrate the location
of the following statistics: test statistic, critical test statistic, significance value, and
p-value (you do not need to calculate the values of zcri or the p-value).
Introduction to parametric hypothesis testing 253
Example 6.2
Employees of a firm produce units at a rate of 100 per hour with a standard deviation of
20 units per hour. A new employee is tested on 36 separate random occasions and is found
to have an output of 90 units per hour. Does this indicate that the new employee’s output is
significantly different from the average output?
Figure 6.7
➜ Excel Solution
Significance level Cell E10 Value = 0.05
Population mean Cell E13 Value = 100
Population standard deviation Cell E14 Value = 20
Sample size n Cell E16 Value = 36
Sample mean Xavg Cell E17 Value = 90
Sample standard error Cell E18 Formula = E14/E16^0.5
Zcal Cell E19 Formula = STANDARDIZE(E17,E13,E18)
Two tail p-value Cell E22 Formula = 2*(1−NORM.S.DIST(ABS(E19),TRUE))
Lower Zcri = Cell E23 Formula = NORM.S.INV(E10/2)
Upper Zcri = Cell E24 Formula = NORM.S.INV(1-E10/2)
254 Business statistics using Excel
1 State hypothesis
Null hypothesis H0: μ = 100 (population mean is equal to 100 units per hour).
Alternative hypothesis H1: μ ≠ 100 (population mean is not 100 units per hour).
The ≠ sign implies a two tail test.
2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—one sample;
• the statistic we are testing—testing for a difference between a sample mean ( x = 90)
and population mean (μ = 100). Population standard deviation is known (σ = 20);
• size of the sample—large (n = 36);
• nature of population from which sample drawn—population distribution is not
known, but sample size is large. For large n, the Central Limit Theorem states that
the sample mean is distributed approximately as a normal distribution.
( x − µ)
Zcal =
σ n (6.1)
From Excel, population mean = 100 (see Cell E13), population standard deviation = 20
(see Cell E14), sample size n = 36 (see Cell E16), sample mean x = 90 (see Cell E17),
and standard error of the mean σ x = 3.33333’ (see Cell E18):
X − µ 90 − 100
Zcal = = = −3 (see Cell E19)
σ n 20 36
In order to identify region of rejection in this case, we need to find the p-value.
The p-value can be found from Excel by using the NORM.S.DIST() function. In the
example H1: μ ≠ 100 units/hour. From Excel, the two tail p-value = 0.0026998 (see
Cell E22).
Note For two tail tests the p-value would be given by the Excel formula:
=2*(1 − NORM.S.DIST(ABS(z value or cell reference), true)).
For one tail tests the p-value would be given by the Excel formula:
=NORM.S.DIST(abs(z value) for lower tail p-value, where Z is negative value
=1 − NORM.S.DIST(z value) for upper tail p-value, where Z is positive value.
Introduction to parametric hypothesis testing 255
5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated two-tail p-value of 0.0026998.
We can observe that the p-value < α and we conclude that given the two tail p-value
(0.0026998) < α (0.05) we reject H0 and accept H1.
Excel solution using the critical z-test statistic for a one sample z-test
The solution procedure is exactly the same as for the p-value except that we use the critical
test statistic value to make a decision.
1 State hypothesis
2 Select test
5 Make decision
Does the test statistic lie within the region of rejection? Compare the calculated and
critical z values to determine which hypothesis statement (H0 or H1) to accept. In
Figure 6.8 we observe that zcal lies in the lower rejection zone (−3 < −1.96). Given zcal
(−3) < lower two tail zcri (−1.96), we will reject H0 and accept H1.
Normal curve
0.5*p-value
−3 −1.96 0 +1.96 Z
Accept H0
Figure 6.8
Figure 6.8 illustrates the relationship between the p-value, test statistic, and critical test
statistic.
Student Exercises
X6.3 What are the critical z values for a significance level of 2%: (i) two tail, (ii) lower one tail,
and (iii) upper one tail?
X6.4 A marketing manager has undertaken a hypothesis test to test for the difference
between accessories purchased for two different products. The initial analysis has been
performed and an upper one tail z-test has been chosen. Given that the z value was
calculated to be 3.45 find the corresponding p-value. From this result what would you
conclude?
X6.5 A mobile phone company is concerned at the lifetime of phone batteries supplied
by a new supplier. Based upon historical data this type of battery should last for 900
days with a standard deviation of 150 days. A recent, randomly selected sample of
40 batteries was selected and the sample battery life was found to be 942 days. Is the
sample battery life significantly different from 900 days (significance level 5%)?
X6.6 A local Indian restaurant advertises home delivery times of 30 minutes. To monitor
the effectiveness of this promise the restaurant manager monitors the time that the
order was received and the time of delivery. Based upon historical data the average
time for delivery is 30 minutes with a standard deviation of 5 minutes. After a series of
complaints from customers regarding this promise the manager decided to analyse the
data of the last 50 orders which resulted in an average time of 32 minutes. Conduct an
appropriate test at a significance level of 5%. Should the manager be concerned?
Introduction to parametric hypothesis testing 257
Example 6.3
A local car dealer wants to know if the purchasing habits of a buyer buying extras have changed.
He is particularly interested in male buyers. Based upon collected data he has estimated that
the distribution of extras purchased is approximately normally distributed with an average of
£2000 per customer. To test this hypothesis he has collected the extras purchased by the last
seven male customers (£): 2300, 2386, 1920, 1578, 3065, 2312, and 1790. Test whether the
extras purchased on average has changed.
5
Figure 6.9
x
One sample t-test for
the population mean A
➜ Excel solution one sample t-test is
a hypothesis test for
Significance level = Cell E13 Value = 0.05 answering questions about
Population mean = Cell E16 Value = 2000 the mean where the data
are a random sample of
Sample data: Cells E18:E24 Values independent observations
Sample size = Cell E26 Formula = COUNT(E18:E24) from an underlying
normal distribution where
Sample mean Xavg = Cell E27 Formula = AVERAGE(E18:E24) population variance is
Sample standard deviation s = Cell E28 Formula = STDEV.S(E18:E24) unknown.
258 Business statistics using Excel
1 State hypothesis
Null hypothesis H0: μ = 2000 (population mean spend on extras is equal to £2000).
Alternative hypothesis H1: μ ≠ 2000 (population mean is not equal to £2000).
The ≠ sign implies a two tail test.
2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—one sample;
• the statistic we are testing—testing for a difference between a sample mean and
population mean (μ = 2000). Two tail test. Population standard deviation is not
known;
• size of the sample—small (n = 7);
• nature of population from which sample drawn—population distribution is
normal, sample size is small, and population standard deviation is unknown. The
sample standard deviation will be used as an estimate of the population standard
deviation and the sampling distribution of the mean is a t distribution with n – 1
degrees of freedom.
( x − µ)
t cal =
s n (6.2)
From Excel, population mean = 2000 (see Cell E16), sample size n = 7 (see Cell E26),
sample mean x = 2193 (see Cell E27), sample standard deviation s = 489.62673 (see
Cell E28), and standard error of the mean σ x = 185.0615084 (see Cell E29):
( x − µ) 2193 − 2000
t cal = = = 1.0429 (see Cell E30 )
s n 489.62673 7
Introduction to parametric hypothesis testing 259
Identify the region of rejection using the p-value method—the p-value can be found
from Excel by using the T.DIST.2T() function. In the example H1: μ ≠ £2000. From
Excel, the two tail p-value = 0.3371825 (see Cell E34).
Note We can calculate the two tail p-value using the Excel function T.DIST.2T:
=T.DIST.2T(ABS(t value), degrees of freedom).
We can calculate the one tail p-value using the Excel function T.DIST:
=T.DIST(t value, degrees of freedom, true) for 1 tail lower.
We can calculate the one tail p-value using the Excel function T.DIST.RT:
=T.DIST.RT(t value, degrees of freedom) for one tail upper.
5 Make a decision
Does the test statistic lie in the region of rejection? Compare the chosen significance
level (α) of 5% (or 0.05) with the calculated two tail p-value of 0.3371825. We can
observe that the p-value > α and we decided that we accept H0. Given two tail p-value
(0.3371825) > α (0.05), we will accept H0 and reject H1.
Excel solution using the critical t-test statistic for a one sample t-test
The solution procedure is exactly the same as for the p-value except that we use the critical
test statistic value to make a decision.
1. State hypothesis.
2. Select test.
3. Set level of significance (α = 0.05).
4. Extract relevant statistic.
The calculated test statistic tcal = 1.0429 (see Cell E30). Calculate the critical test statis-
tic, tcri. In the example H1: μ ≠ £2000. The critical t values can be found from Excel by
using the T.INV.2T() function, two tail tcri = ± 2.2447 (see Cells E35–E36).
5. Make decision
Does the test statistic lie within the region of rejection? Compare the calculated and
critical t values to determine which hypothesis statement (H0 or H1) to accept. As
260 Business statistics using Excel
tcal (0.97) lies between the two critical values (−2.447 ± 2.447) we would accept H0.
Given tcal (0.9655) lies between the lower and upper critical t values (−2.447 ± 2.447),
we will accept H0 and reject H1.
Figure 6.10 illustrates the relationship between the p-value, test statistic, and critical
test statistic.
Two tail p =
T.DIST(ABS(1.04), 6) =
0.3371825 > 0.05
0.5*p-value
Reject H0 2.5%
Reject H0 2.5%
Student exercises
X6.7 Calculate the critical t values for a significance level of 1% and 12 degrees of freedom:
(1) two tail, (ii) lower one tail, and (iii) upper one tail.
X6.8 After further data collection the marketing manager (Exercise X6.4) decides to
revisit the data analysis and changes the type of test to a t-test. (i) Explain under
what conditions a t-test could be used rather then the z-test, and (ii) calculate the
corresponding p-value if the sample size was 13 and the test statistic equal to 2.03.
From this result what would you conclude?
X6.9 A tyre manufacturer conducts quality assurance checks on the tyres that it
manufactures. One of the tests consists of undertaking a test on their medium-quality
tyres with an independent random sample of 12 tyres providing a sample mean and
standard deviation of 14,500 km and 800 km respectively. Given that the historical
average is 15,000 km and that the population is normally distributed, test whether the
sample would raise a cause for concern.
X6.10 A new low-fat fudge bar is advertised as having 120 calories. The manufacturing
company conducts regular checks by selecting independent random samples and
testing the sample average against the advertised average. Historically, the population
varies as a normal distribution and the most recent sample consists of the numbers:
99, 132, 125, 92, 108, 127, 105, 112, 102, 112, 129, 112, 111, 102, and 122. Is the
population value significantly different from 120 calories (significance level 5%)?
Introduction to parametric hypothesis testing 261
Example 6.4
A large organization produces electric light bulbs in each of its two factories (A and B). It is
suspected that the quality of production from factory A is better than from factory B. To test this
assertion the organization collects samples from factory A and B, and measures how long each
light bulb works for (in hours) before it fails. Both population standard deviations are known
(σA2 = 52783 and σB2 = 61560). Conduct a two sample z-test for the population mean to test
this hypothesis.
Figure 6.11
➜ Excel solution
A: Cell B4:B33 Values
B: Cell C4:C35 Values
Significance level = Cell G13 Value = 0.05
nA = Cell G17 Formula: = COUNT (B4:B33)
Sample average = Cell G18 Formula: = AVERAGE(B4:B33)
Population variance known σ2A = Cell G19 Value
nB = Cell G22 Formula = COUNT (C4:C35)
262 Business statistics using Excel
1 State hypothesis
Null hypothesis H0: μA ≤ μB.
Alternative hypothesis H1: μA > μB.
The ‘ > ‘ sign implies an upper one tail test.
2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples.
• the statistic we are testing—testing that the lifetime of light bulbs from factory A
last longer than for factory B. Both population variances are known (σ2A = 52783
and σ2B = 61560);
• size of both samples—large (nA = 30 and nb = 32);
• nature of population from which sample drawn—population distribution is not
known, but sample size large. For large n, the Central Limit Theorem states that
the sample means are approximately normally distributed (nA and nB ≥ 30).
(X A − X B ) − (µ A − µ B )
z cal =
⎡ σ A 2 σ B2 ⎤
⎢ + ⎥
⎣ nA nB ⎦ (6.3)
From Excel: nA = 30 (see Cell G17), X A = 1135.33’ (see Cell G18), σ2A = 52783 (see Cell
G19), nb = 32 (see Cell G22), X B = 894.21575 (see Cell G23), and σ2B = 61560 (see Cell
G24). If H0 is true (μA − μB = 0) then equation (6.3) simplifies to:
X A − XB
z cal = = 3.9729 (see Cell G26)
⎡ σ A 2 σ B2 ⎤
⎢ + ⎥
⎣ nA nB ⎦
Introduction to parametric hypothesis testing 263
Identify the region of rejection using the p-value method. The p-value can be found
from Excel using the NORM.S.DIST(z-value, true) function. In the example H1: μA > μB.
From Excel, the upper one tail p-value = 0.0000354957 (see Cell G30).
5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated upper one tail p-value of
0.0000354957. We can observe that the p-value < α, and we conclude that we reject
H0 and accept H1.
❉ Interpretation It can be concluded that, at the 0.05 level, light bulbs from factory A
have significantly longer lifetimes than the light bulbs from factory B.
Figure 6.12 illustrates the relationship between the p-value and test statistic.
Example 6.5
Reconsider Example 6.4, but use the Data Analysis tool to undertake the analysis.
Figure 6.13 illustrates the application of the Data Analysis: z-test: Two Sample for Means.
A: Cells B4:B33.
B: Cells C4:C35.
Hypothesized mean difference = 0.
Variable 1 variance = 52783.
Variable 2 variance = 61560.
Alpha = 0.05.
Output Range: Cell E4.
Figure 6.13
We observe from Figure 6.14 that the relevant results agree with the previous results.
Figure 6.14
Introduction to parametric hypothesis testing 265
❉ Interpretation It can be concluded that, at the 0.05 level, light bulbs from factory A
have significantly longer lifetimes than the light bulbs from factory B.
Student exercises
X6.11 A battery manufacturer supplies a range of car batteries to car manufacturers. The
40 Amp-hour battery is manufactured at two manufacturing plants with a stated mean
time between charges of 8.3 days and a variance of 1.25 days. The company regularly
selects an independent random sample from the two plants with results as shown in
Table 6.2.
Plant A Plant B
6.72 10.13 9.31 7.83 9.93 8.10 6.27 8.54
9.83 7.38 9.36 9.23 10.36 7.81 9.69 8.51
7.15 6.93 7.23 8.70 9.06 7.58 8.01 9.54
7.72 9.32 8.32 10.65 8.08 8.35 7.78 9.08
9.20 8.70 9.32 8.09 9.82 6.51 8.33 7.01
11.36 8.50 8.86 10.06 9.56 7.98 8.94 7.06
6.38 7.99 9.34 6.62 7.81 6.62 9.82 9.26
9.57 7.23 8.91 10.74 7.27 8.14 9.45 10.26
Table 6.2
(a) For the given samples conduct an appropriate hypothesis test to test that the
sample mean values are not different at the 5% level of significance.
(b) If the sample means are not significantly different test whether the population
mean is 8.3 days (choose sample A to undertake the test).
X6.12 The Indian restaurant manager has employed two new delivery drivers and wishes to
assess their performance. The data in Table 6.3 represent the delivery times for person
A and B undertaken on the same day.
Person A Person B
32.9 25.6 36.2 34.6 30.3 31.6 25.5 36.5 36.0 36.3
29.4 33.5 32.5 40.7 32.7 25.5 28.1 38.8 32.4 32.8
41.2 35.6 40.8 32.4 35.3 34.2 37.5 33.3 25.9 37.7
40.3 34.6 30.2 37.1 31.0 33.4 32.3 33.2
39.3 36.5 35.0 32.7 35.5 32.6 31.9 36.8
30.3 35.7 40.2 34.2 36.5 34.0 35.9 25.1
37.5 38.0 33.4 33.2 36.1 41.4 29.0 37.6
45.0 30.7 37.8 37.7 28.9 29.8 34.3 34.4
Table 6.3
Based upon your analysis of the two samples is there any evidence that the delivery
times are different (test at 5%).
266 Business statistics using Excel
Example 6.6
Concerned by the number of passengers not wearing rear seat belts in cars, a local police
authority decided to undertake a series of surveys based upon two large cities. The sur-
vey consisted of two independent random samples collected from city A and B. The police
authority would like to know if the proportions of passengers wearing seat belts between city
A and B are different. Conduct a two sample z-test for the population proportion to test this
hypothesis.
5
Figure 6.15
➜ Excel solution
NA = Cell C4 Value
NB = Cell D4 Value
nA = Cell C5 Value
nB = Cell D5 Value
Introduction to parametric hypothesis testing 267
1 State hypothesis
Null hypothesis H0: πA = πB.
Alternative hypothesis H1: πA ≠ πB.
The ≠ sign implies a two tail test.
2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples;
• the statistic we are testing—testing that the proportion wearing seatbelts is different
between the two cities. Both population standard deviations are unknown;
• size of both samples—large (nA = 250 and nB = 190);
• nature of population from which sample drawn—population distribution is not
known, but sample size large. For large n, the Central Limit Theorem states that
the sample proportions are approximately normal distributed.
From this information we will undertake a two sample z-test for proportions.
ρA − ρB − (π A − π B )
z cal =
π A (1 − π A ) π B (1 − π B )
+
NA NB (6.4)
Where, ρA and ρB are proportions for sample A and B and and πA and πB are the
population proportions (πA ~ ρA, πB ~ ρB). From Excel: NA = 250 (see Cell D17), NB = 190
268 Business statistics using Excel
(see Cell D18), nA = 135 (see Cell D19), nB = 80 (see Cell D20), ρA = 0.54 (see Cell D21),
and ρB = 0.42 (see Cell D22). If H0 is true (πA – πB = 0) then equation (6.4) simplifies to:
ρA − ρB
z cal = = 2.49 (see CellD25)
ρA (1 − ρA ) ρB (1 − ρB )
+
NA NB
Identify the region of rejection using the p-value method. The p-value can be found
from Excel using the NORM.S.DIST(z value, true) function. In the example H1: πA ≠ πB.
From Excel, the two tail p-value = 0.013 (see Cell D29).
5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated two tail p-value of 0.013. We
can observe that the p-value < α, and we conclude that we reject H0 and accept H1.
Figure 6.16 illustrates the relationship between the p-value and test statistic.
Student exercises
X6.13 During a national election a national newspaper wanted to assess whether there was
a similar voting pattern for a particular party between two towns in the north-east of
England. The sample results are illustrated in Table 6.4.
Town A Town B
Number interviewed, N 456 345
Intention to vote for party, n 243 212
Table 6.4
Airport A Airport B
Total number of items processed, N 15596 25789
Number of items of luggage misplaced, n 123 167
Table 6.5
Assess whether there is a significant difference in misplaced luggage between the two
airports (test at 5%).
Example 6.7
A certain product of organic beans are packed in tins and sold by two local shops. The local
authority have received complaints from customers that the amount of beans within the tins x
sold by the shop are different. To test this statistically two small random samples were collected Two sample t-test for
the population mean
from both shops. Conduct a two sample t-test for the population mean (independent sam- (independent samples,
ples, equal variance) to test this hypothesis. equal variance) A
two sample t-test for
the population mean
(independent samples,
Figure 6.17 illustrates the Excel solution. equal variance) is used
when two separate sets
of independent and
➜ Excel solution identically distributed
samples are obtained,
A: Cells B4:B21 Values one from each of the
two populations being
B: Cells C4:C28 Values compared.
270 Business statistics using Excel
1
2
Figure 6.17
1 State hypothesis
Null hypothesis H0: μA = μB.
Alternative hypothesis H1: μA ≠ μB.
The ≠ sign implies a two tail test.
Introduction to parametric hypothesis testing 271
2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples;
• the statistic we are testing—testing that the amount of beans in a tin sold by both
shops is the same. Both population standard deviations are unknown;
• size of both samples—small (nA = 18 and nb = 25);
• nature of population from which sample drawn—population distribution is not
known, but we will assume that the population is normally distributed.
We will assume that the population variances are equal and conduct a Two Sample
t-test: Assuming Equal Variances (also called pooled-variance t-test).
(X A − X B ) − (µ A − µ B )
t cal =
⎡ 1 1 ⎤
(σ A + B ) × ⎢ + ⎥
⎣ n A nB ⎦ (6.5)
df = n A + n B − 2 (6.7)
From Excel: nA = 18 (see Cell G16), X A = 527.055’ (see Cell G17), SA = 51.02 (see
Cell G18), nB = 25 (see Cell G20), X B = 496.64 (see Cell G21), and SB = 41.38 (see
Cell G22):
⎡⎣(n A − 1) s A 2 + (n B − 1)sB 2 ⎤⎦
σˆ A + B = = 2082.0171816 (see Cell G24)
n A + nB − 2
(X A − X B )
t cal = = 2.156 (see Cell G25)
⎡ 1 1 ⎤
(σ A + B ) × ⎢ + ⎥
⎣ n A nB ⎦
Identify region of rejection using the p-value method. The p-value can be found from
Excel by using the T.DIST.2T() function. In this example H1: μA ≠ μB . From Excel, two
tail p-value = 0.036970 (see Cell G29).
272 Business statistics using Excel
5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated two tail p-value of
0.036970. We can observe that the p-value < α and we conclude that we reject H 0
and accept H1.
❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the quantity of beans sold by shops A and B is significantly different at the
5% level of significance. It should be noted that the decision will change if you choose a
1% level of significance.
❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the quantity of beans sold by shops A and B is significantly different at the
5% level of significance.
Figure 6.18 illustrates the relationship between the p-value and test statistic.
Two tail p =
T.DIST.2T(ABS(2.156),
41) = 0.036970 < 0.05
Reject Ho 2.5%
Reject Ho 2.5%
Example 6.8
Reconsider Example 6.7, but use the Data Analysis tool to undertake the analysis.
Figure 6.19 illustrates the application of Data Analysis: t-test Two-Sample Assuming Equal
Variances (Select Data > Data Analysis > t-test: Two Sample Assuming Equal Variances).
Figure 6.19
We observe from Figure 6.20 that the relevant results agree with the previous results.
Figure 6.20
❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the quantity of beans sold by shops A and B is significantly different at
the 5% level of significance.
274 Business statistics using Excel
Student exercises
X6.15 During an examination board concerns were raised concerning the marks obtained by
students sitting the final year advanced economics (AE) and e-Marketing (EM) papers
(Table 6.6).
AE AE EM EM EM
51 63 71 68 61
66 35 69 53 59
50 9 63 65 55
48 39 66 48 66
54 35 43 63 61
83 44 34 48 58
68 68 57 47 77
48 36 58 53 73
45 68 64 54
Table 6.6
Historically, the sample data varies as a normal distribution and the population
standard deviations are approximately equal. Assess whether there is a significant
difference between the two sets of results (test at 5%).
X6.16 A university finance department would like to compare the travel expenses claimed by
staff attending conferences. After initial data analysis the finance director has identified
two departments who seem to have very different levels of claims. Based upon the
data provided (Table 6.7), undertake a suitable test to assess whether the level of
claims from department A is significantly greater than that from department B. You
can assume that the population expenses data are normally distributed and that the
population standard deviations are approximately equal.
Department A Department B
156.67 146.81 147.28 140.67 108.21 109.10 127.16
169.81 143.69 157.58 154.78 142.68 110.93 101.85
130.74 155.38 179.89 154.86 135.92 132.91 124.94
158.86 170.74
Table 6.7
variances are equal and the sample variances are combined to give a pooled estimate of
σ̂ A + B given by equation (6.6).
If we are concerned that the assumption of equal variances is unsound then we can
conduct a two sample t-test with equations (6.8) and (6.9).
Example 6.9
A certain product of organic beans is packed in tins and sold by two local shops. The local
authority have received complaints from customers that the amount of beans within the tins
sold by the shop is different. To test this statistically two small, random samples were collected
from both shops.
Figure 6.21
➜ Excel solution
A: Cells B4:B21 Values
B: Cells C4:C28 Values
Significance level = Cell G13 Value = 0.05
nA = Cell G15 Formula: = COUNT(B4:B21)
averageA = Cell G16 Formula: = AVERAGE(B4:B21)
sA = Cell G17 Formula: = STDEV.S(B4:B21)
nB = Cell G19 Formula: = COUNT(C4:C28)
276 Business statistics using Excel
1 State hypothesis
Null hypothesis H0: μA = μB.
Alternative hypothesis H1: μA ≠ μB.
The ≠ sign implies a two tail test.
2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples;
• the statistic we are testing—testing that the amount of beans in a tin sold by both
shops is the same. Both population standard deviations are unknown;
• size of both samples—small (nA = 18 and nb = 25);
• nature of population from which sample drawn—population distribution is not
known, but we will assume that the population is approximately normal given
sample size is close to 30.
We will assume that the population variances are not equal and conduct a Two
Sample t-test: Assuming Unequal Variances (also called separate-variance t-test).
(X A − X B ) − (µ A − µ B )
t cal =
⎡ s A 2 sB 2 ⎤
⎢ + ⎥
⎣ n A nB ⎦ (6.8)
Introduction to parametric hypothesis testing 277
2
⎛ s A 2 sB 2 ⎞
⎜⎝ n + n ⎟⎠
A B
df =
⎛ ⎛ s 2 ⎞2 ⎛ s 2 ⎞2 ⎞
A B
⎜⎜ ⎟⎠ ⎜⎝ n ⎟⎠ ⎟
⎜ ⎝ n A B ⎟
⎜ n A − 1 + nB − 1 ⎟
⎜ ⎟
⎜⎝ ⎟⎠
(6.9)
From Excel: nA = 18 (see Cell G15), X A = 527.055’ (see Cell G16), SA = 51.02 (see Cell
G17), nB = 25 (see Cell G19), X B = 496.64 (see Cell G20), and SB = 41.38 (see Cell G21).
If H0 is true (μA – μB = 0) then equation (6.8) simplifies to:
X A − XB
t cal = = 2.083 (see Cell G23)
⎡ s A 2 sB 2 ⎤
⎢ + ⎥
⎣ n A nB ⎦
2
⎛ s A 2 sB 2 ⎞
⎜⎝ n + n ⎟⎠
A B
df = = 32 (see Cell G28)
⎛ ⎛ s 2 ⎞2 ⎛ s 2 ⎞2 ⎞
A B
⎜⎜ ⎟ ⎜ ⎟ ⎟
⎜ ⎝ n A ⎠ + ⎝ nB ⎠ ⎟
⎜ nA − 1 nB − 1 ⎟
⎜ ⎟
⎜⎝ ⎟⎠
Please note that the number of degrees of freedom (df ) is rounded to the nearest
whole number. Identify the region of rejection using the p-value method. The p-value
can be found from Excel by using the T.DIST.2T() function. In the example H1: μA ≠ μB.
From Excel, the two tail p-value = 0.045288 (see Cell G30).
5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated two tail p-value of 0.045288.
We can observe that the p-value < α, and we conclude that we reject H0 and accept H1.
❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the quantity of beans sold by shops A and B is significantly different at the
5% level of significance.
❉ Interpretation We conclude that, based upon the sample data collected, we have
evidence that the quantity of beans sold by shops A and B is significantly different at the
5% level of significance.
Figure 6.22 illustrates the relationship between the p-value and the test statistic.
Two tail p =
T.DIST.2T(ABS(2.083), 32)
= 0.045288 < 0.05
Example 6.10
Reconsider Example 6.9, but use the Data Analysis tool to undertake the analysis
Figure 6.23 illustrates the application of Data Analysis: t-test: Two Sample Assuming
Unequal Variances (Select Data > Data Analysis > t-test: Two Sample Assuming Unequal
Variances).
We observe from Figure 6.24 that the relevant results agree with the previous results.
Figure 6.23
❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the quantity of beans sold by shops A and B is significantly different at the
5% level of significance.
Introduction to parametric hypothesis testing 279
Figure 6.24
Student exercises
X6.17 Repeat Exercise X6.15, but do not assume equal variances. Are the two sets of results
significantly different (test at 5%)?
X6.18 Repeat Exercise X6.16, but do not assume equal variances. Are the expenses claimed
by department A significantly different to department B?
2
3
Figure 6.25
➜ Excel solution
Person Cells B5:B30 Values
Before weight, B Cells C5:C30 Values
After weight, A Cells D5:D30 Values
d = B – A Cell E5 Formula: = C5-D5 Copy formula down E5:E30
dn2 Cell G5 Formula: = E5∧2 Copy formula down G5:G30
Significance level Cell L13 Value = 0.05
n = Cell L15 Formula: = COUNT(B5:B30)
Σd = Cell L16 Formula: = SUM(E5:E30)
Σd2 = Cell L17 Formula: = SUM(G5:G30)
Mean d = Cell L18 Formula: = AVERAGE(E5:E30)
sd = Cell L19 Formula: = SQRT((L17−L16^2/L15)/(L15−1))
tcal = Cell L20 Formula: = (L18−10)/(L19/SQRT(L15))
df = Cell L23 Formula: = L15−1
Upper p-value = Cell L24 Formula: = T.DIST.RT(L20,L23)
Upper tcri = Cell L25 Formula: = T.INV(1−L13,L23)
1 State hypothesis
The hypothesis statement implies that the population mean weight loss between
A and B should be at least 10 lb. If D = μA − μB = 10 lb, then the null and alternative
hypotheses would be stated as follows.
2 Select test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples;
• the statistic we are testing—testing that the weight reduction programme results
in a weight loss. Both population standard deviations are unknown;
• size of both samples—small (nA and nB = 26);
• nature of population from which sample drawn—population distribution is
normally distributed.
In this case we have two variables that are related to each other (weight before vs
weight after treament) and we will conduct a two sample t-test: paired sample for
means.
d−D
t cal =
sd
n (6.10)
∑ d 2 − (∑ d)2 /n
sd =
n −1 (6.11)
df = n − 1 (6.12)
From Excel: n = 26 (see Cell L15), Σd = 447 (see Cell L16), Σd2 = 12989 (see Cell L17),
d = 17.19231 (see Cell L18), and D = 10:
∑ d 2 − (∑ d)2 /n
sd = = 14.56577 (see Cell L19)
n −1
d−D
t cal = = 2.517802 (see Cell L20)
sd
n
Identify the region of rejection using the p-value method. The p-value can be found
from Excel by using the T.DIST.RT() function. In the example H1: D > 10. From Excel,
the upper one tail p-value = 0.0093 (see Cell L24).
5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated upper one tail p-value
282 Business statistics using Excel
of 0.0093. We can observe that the p-value < α, and we conclude that we reject H0
and accept H1.
❉ Interpretation It can be concluded that the average weight loss is more than 10 lb at
a 5% level of significance.
❉ Interpretation It can be concluded that the average weight loss is more than 10 lb at
a 5% level of significance.
Figure 6.26 illustrates the relationship between the p-value and test statistic.
0 2.517 t
Figure 6.26
Example 6.12
Reconsider Example 6.11, but use the Data Analysis tool to undertake the analysis.
Figure 6.27 illustrates the application of Data Analysis: Two-Sample z Test for Paired
Means (Select Data > Data Analysis > t-test: Paired Two Sample for Means).
We observe from Figure 6.28 that the relevant results agree with the previous results.
❉ Interpretation It can be concluded that the average weight loss is more than 10 lb at
a 5% level of significance.
Introduction to parametric hypothesis testing 283
Figure 6.27
Figure 6.28
Student exercises
X6.19 Choko Ltd provides training to its salespeople to aid the ability of each salesperson
to increase the value of their sales. During the last training session 15 salespeople
attended, and their weekly sales before and sales after are provided in Table 6.8.
284 Business statistics using Excel
Table 6.8
Assuming that the populations are normally distributed, assess whether there is any
evidence that the training improves sales (test at 5% and 1%).
X6.20 Concern has been raised at the standard achieved by students completing final
year project reports within a university department. One of the factors identified as
important is the research methods (RM) module mark achieved, which is studied
before the students start their project. The department has now collected data for 15
students, as given in Table 6.9.
Student RM Project
1 38 71
2 50 46
3 51 56
4 75 44
5 58 62
6 42 65
7 54 50
8 39 51
9 48 43
10 14 62
11 38 66
12 47 75
13 58 60
14 53 75
15 66 63
Table 6.9
Assuming that the populations are normally distributed, is there any evidence to
suggest that the marks are different (test at 5%).
Introduction to parametric hypothesis testing 285
Example 6.13
In this example we will use the F test to check if the two population variances in Example 6.7
can be considered equal with a 95% confidence.
2
3
Figure 6.29
➜ Excel solution
A: Cells B4:B21 Values
B: Cells C4:C28 Values
Significance level Cell G10 Value = 0.05
nA = Cell G12 Formula = COUNT(B4:B21)
x
nB = Cell G13 Formula = COUNT(C4:C28) F test Tests whether two
sA = Cell G14 Formula = STDEV.S(B4:B21) population variances are
the same based upon
sample values.
286 Business statistics using Excel
1 State hypothesis
The alternative hypothesis statement implies that the population variances are not
equal. The null and alternative hypotheses would be stated as follows:
2 State test
We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• number of samples—two samples;
• the statistic we are testing—testing that the variances are different from each other;
• size of both samples—small (nA = 18 and nb = 25).
• nature of population from which sample drawn—population distribution is not
known, but we will assume that the population is approximately normal given
sample sizes close to 30.
3 Set the level of significance at α = 0.05 (see Cell G10). For two-tail (or non-
directional) tests use α = significance level/2 = 0.025.
With the hypothesis tests considered so far we have been able to write the hypothesis
statement as either a one or two tail test. With the F test we have a similar situation but
we are dealing with variances rather than mean values.
From Excel: nA = 18 (see Cell G12), nB = 25 (see Cell G13), sA = 51.02 (see Cell G14),
sB = 41.38 (see Cell G15). Given that sA > sB then the numerator variance in equation
(6.13) will be the sample A variance and the denominator variance will be the sample
B variance.
S2 A
Fcal = = 1.5197059 (see Cell G16)
S2 B
Identify the region of rejection using the p-value method. The p-value can be found
from Excel by using the FTEST() function. In the example H1: σ A 2 ≠ σ B 2. From Excel,
the two tail p-value = 0.3393282 (see Cell G21).
5 Make a decision
Does the test statistic lie within the region of rejection? Compare the chosen
significance level (α) of 5% (or 0.05) with the calculated two tail p-value of 0.3393282.
We can observe that the p-value > α, and we conclude that we accept H0 and reject H1.
Given that the two tail p-value (0.3393282) > α (0.05) we will accept H0 and reject H1.
❉ Interpretation It can be concluded that the two population variances are not
significantly different at the 95% level of confidence.
1 State hypothesis
2 State test
Note The upper (FU) and lower (FL) critical values for a two tail test can be calculated
using Excel as follows:
5 Make a decision
Does the test statistic lie within the region of rejection? Compare the calculated and
critical F values to determine which hypothesis statement (H0 or H1) to accept. The
calculation of Fcal yields a value of 1.5197 and therefore lies in the region of rejection
for H1. Given that the Fcal (1.5197) lies between the lower critical value FL (0.3906518)
and upper critical value FU (2.3864801) we will accept H0 and reject H1.
❉ Interpretation It can be concluded that, based upon the sample data collected,
we have evidence that the population variances are not significantly different at the 95%
level of confidence. In this case we would be reasonanbly happy to conduct the two sample
pooled t-test.
Figure 6.30 illustrates the relationship between the p-value, the F test statistic, and the
critical F statistic.
Alternative hypothesis H1: σ A 2 > σB2 Alternative hypothesis H1: σ A 2 < σB2
With α = significance level With α = significance level
Table 6.10
Introduction to parametric hypothesis testing 289
Note The upper (FU) and lower (FL) critical values for a one tail test can be calculated
using Excel as follows:
1. Upper one tail F value = FU = F.INV.RT(significance level, df for largest variance, df for small-
est variance) or = F.INV(1-significance level, df for largest variance, df for smallest variance)
2. Lower one tail F value = FL = F.INV(significance level, df for largest variance, df for small-
est variance) or = 1/F.INV.RT(significance level, df for smallest variance, df for largest variance.
Example 6.14
Reconsider Example 6.13, but use the Data Analysis tool to undertake the analysis.
Figure 6.31 illustrates the application of the Excel Analysis ToolPak: F Test: Two Sample
for Variances (Select Data > Data Analysis> F Test: Two Sample for Variances).
Figure 6.31
We observe from Figure 6.32 that the relevant results agree with the previous results.
Figure 6.32
290 Business statistics using Excel
❉ Interpretation It can be concluded that, based upon the sample data collected, we
have evidence that the population variances are not significantly different at the 5% level of
confidence. In this case we would be reasonanbly happy to conduct the two sample pooled t-test.
Note In the Data Analysis solution for the F test the Excel solution is for one tail only.
If you have a two tail hypothesis F test then compare the one tail p-value from Figure 6.32
(0.16966411) with the significance level divided by 2 (= α/2).
Student exercises
X6.21 In Exercise X6.15 we assumed that the two population variances are equal. Conduct an
appropriate test to check if the variances are equal (test at 5% and 1%).
X6.22 In Exercise X6.16 we assumed that the two population variances are equal. Conduct an
appropriate test to check if the variances are equal (test at 5%).
If β was equal to 23% then the statistical power = 1 − 0.23 = 77% and we would con-
clude that we would accept a true alternative hypothesis 77% of the time.
Example 6.15
Let us illustrate the calculation of the type II error (β) and the statistical power via a simple
example. Consider the problem of estimating the spend on a particular type of sweet per day
where, historically, the average spend is £19.44 per day with a standard deviation of £6.23. The
shop would like to check whether or not the current spending per day is still £19.44 and they
have decided to collect a sample on a particular day of size 32, which results in average sample
spend of £23.40.
Introduction to parametric hypothesis testing 291
The shop, after consultation with an analyst, decides to conduct a one sample t-test, but they
would like to know how confident they can be in the outcome of applying this test to the data.
In other words, what is the probability of accepting a true alternative hypothesis or rejecting
a false null hypothesis?
Figure 6.33 illustrates a pictorial representation of the solution process.
Distribution A
α α = 0.025
2 2
t
0 tcri
X
19.44 Xcri
Distribution B
Statistical power
β
t
0
X
XB 23.40
Figure 6.33
Figure 6.34
292 Business statistics using Excel
➜ Excel solution
μ = Cell D6 Value
σ = Cell D7 Value
Significance = Cell D8 Value
Xbar = Cell D10 Value
n = Cell D11 Value
df = Cell D13 Formula: = D11−1
t = Cell D14 Formula: = (D10−D6)/(D7/SQRT(D11))
Step 1: calculate Xcri in distribution A with H0: μ = 20
tcri = Cell D18 Formula: = T.INV(1−D8/2,D13)
Xcri = Cell D19 Formula: = D6+D18*(D7/SQRT(D11))
Step 2: calculate the value of β where XB = Xcri for distribution A if H0: μ=22
tβ = Cell D23 Formula: = (D19−D10)/(D7/SQRT(D11))
β = Cell D24 Formula: = T.DIST(D23,D13,TRUE)
Step 3: calculate statistical power
Power = Cell D28 Formula: = 1−D24
❉ Interpretation From Excel, the value of the statistical power = 0.935091117 or 94%.
This high value of the statistical power (or just power) of 94% indicates that the one sample
t-test is highly likely to detect the effect or reject the null hypothesis that the population mean
is £19.44.
Note It is important to note that the value of β and the statistical power is not given by
an Excel function.
■ Techniques in practice
1. Concerned at the time taken to react to customer complaints, CoCo S. A. has imple-
mented a new set of procedures for its support centre staff. The customer service director has
directed that a suitable test is applied to a new sample to assess whether the new target mean
time for responding to customer complaints is 28 days. Table 6.11 illustrates the data collected
by the customer service director.
20 33 33 29 24 30
40 33 20 39 32 37
32 50 36 31 38 29
15 33 27 29 43 33
31 35 19 39 22 21
28 22 26 42 30 17
32 34 39 39 32 38
Table 6.11
Introduction to parametric hypothesis testing 293
(a) Describe the test to be applied with stated assumptions.
(b) Conduct the required test to assess whether evidence exists for the mean time to respond
to complaints to be greater than 28 days.
(c) What would happen to your results if the population mean time to react to customer
complaints changes to 30 days?
2. Bakers Ltd are currently undertaking a review of the delivery vans used to deliver products
to customers. The company runs two types of delivery van (type A, recently purchased, and
type B, at least three years old) which are supposed to be capable of achieving 20 km per litre
of petrol. A new sample has now been collected, as given in Table 6.12.
A B A B
17.68 15.8 26.42 34.8
18.72 36.1 25.22 16.8
26.49 6.3 13.52 15.0
26.64 12.3 14.01 28.9
9.31 15.5 33.9
22.38 40.1 27.1
20.23 20.4 16.8
28.80 3.7 23.6
17.57 13.6 29.7
9.13 35.1 28.2
20.98 33.3
Table 6.12
(a) Assuming that the population distance travelled varies as a normal distribution, is there any
evidence to suggest that the two types of delivery vans differ in the mean distances travelled?
(b) Based upon your analysis, is there any evidence that the new delivery vans meet the
mean average of 20 km per litre?
3. Skodel Ltd is developing a low calorie lager for the European market with a mean
designed calorie count of 43 calories per 100 ml. The new product development team are
having problems with the production process and have collected two independent random
samples to assess whether the target calorie count is being met (assume the population vari-
ables are normally distributed), as presented in Table 6.13.
A B A B
49.7 39.4 45.2 34.5
45.9 46.5 40.5 43.5
37.7 36.2 31.9 37.8
40.6 46.7 41.9 39.7
34.8 36.5 39.8 41.1
51.4 45.4 54.0 33.6
34.3 38.2 47.8 35.8
63.1 44.1 26.3 44.6
41.2 58.7 31.7 38.4
41.4 47.1 45.1 26.1
41.1 59.7 47.9 30.7
Table 6.13
294 Business statistics using Excel
■ Summary
In this chapter we have provided an introduction to the important statistical concept of para-
metric hypothesis testing for one and two samples. What is important in hypothesis testing is
that you are able to recognize the nature of the problem and should be able to convert this into
two appropriate hypothesis statements (H0 and H1) that can be measured.
If you are comparing more than two samples then you would need to employ advanced
statistical parametric hypothesis tests. These tests are called analysis of variance (ANOVA),
which are described in the online workbook ‘Factorial experiments’.
In this chapter we have described a simple five-step procedure to aid the solution process
and have focused on the application of Excel to solve the data problems. The main empha-
sis is placed on the use of the p-value, which provides a number to the probability of the
null hypothesis (H0) being rejected. Thus, if the measured p-value > α (Alpha) then we would
accept H0 to be statistically significant. Remember the value of the p-value will depend on
whether we are dealing with a two or one tail test. So take extra care with this concept as this
is where most students slip up.
The second part of the decision-making described the use of the critical test statistic in mak-
ing decisions. This is the traditional textbook method which uses published tables to provide
estimates of critical values for various test parameter values. The final method, and perhaps the
main Excel method you will use, is to employ the Data Analysis method available within Excel.
The focus of parametric tests is that the underlying variables are at the interval/ratio level of
measurement and the population being measured is distributed as a normal or approximately
normal distribution. In the next chapter we shall explore how we undertake hypothesis testing
for variables that are at the nominal or ordinal level of measurement by exploring the concept
of the chi-square and non-parametric tests.
■ Key Terms
Alpha Level of significance One tail tests
Alternative hypothesis Lower one tail test Parametric
Beta, α Mann–Whitney U test P-value
Central Limit Theorem Non-parametric Region of rejection
Critical test statistic Null hypothesis Robust test
F distribution One sample t-test Significance level, α
F test for the population Statistical power
F test for two population mean Two sample t-test for
variances (variance One sample test population mean
ratio test) One sample z-test for the (dependent or paired
Hypothesis test procedure population mean samples)
Introduction to parametric hypothesis testing 295
Two sample t-test for (independent samples, Two sample z-test for the
population mean equal variance) population proportion
(independent samples, Two sample tests Two tail test
unequal variances) Two sample z-test Type I error
Two sample t-test for for the population Type II error
the population mean mean Upper one tail test
■ Further Reading
Textbook Resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.
Web Resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed 25
May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html (accessed
25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States,
the Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May
2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May
2012).
Chi-square and non-
7 parametric hypothesis testing
» Overview «
In Chapter 6 we explored a series of parametric tests to assess whether the differences between
means (or variances) are statistically significant. Within parametric tests we sample from a
distribution with a known parameter value, for example population mean (μ), variance (ρ2), or
proportion (ρ). The techniques described were defined by three assumptions: (i) the underlying
population being measured varies as a normal distribution; (ii) the level of measurement is of
equal interval or ratio scaling; and (iii) the population variances are equal. Unfortunately, we will
come across data that does not fit these assumptions.
How do we measure the difference between the attitudes of people surveyed in assessing
their favourite car, when the response by each person is of the form: 1, 2, 3 . . . n? In this
situation we have ordinal data in which taking differences between the numbers (or ranks) is
meaningless. Furthermore, if we are asking for opinions where the opinion is of a categorical
form (e.g. strongly agree, agree, do not agree) then the concept of difference is, again,
meaningless. The responses are words not numbers, but you can, if you so wish, solve this
problem by allocating a number to each response, with 1 for strongly agree, 2 for agree, and
so on. This gives you a rating scale of responses, but remember that the opinions of people are
not quite the same as measuring time or measuring the difference between two times. Can we
say that the difference between strongly agree and agree is the same as the difference between
agree and disagree? Another way of looking at this problem is to ask the question: can we say
that a rank of 5 is five times stronger than a rank of 1?
This chapter will provide an overview to the chi-square (Χ2, where Χ is the Greek letter chi)
and non-parametric tests that can be used when parametric methods are not appropriate.
» Learning objectives «
On completing this unit you will be able to:
» apply the chi-square test to test for association between categorical variables;
x
Rank List data in order » apply the chi-square test to measure the difference between two proportions from two
of size. independent samples;
Chi-square and non-parametric hypothesis testing 297
» apply the chi-square test to measure the difference between two proportions from two
dependent samples;
1. Undertake a chi-square test of association that is popular with students in analysing Chi square test for
goodness-of-fit The chi-
survey data square goodness-of-fit
2. Undertake a chi-square test for independent samples test of a statistical model
describes how well the
3. Undertake a chi-square test for dependent samples statistical model fits a set of
4. Undertake a chi-square test for goodness-of-fit. observations.
298 Business statistics using Excel
To test the null hypothesis we would compare the expected cell frequencies with the
observed cell frequencies and calculate the chi-square test statistic given by equation
(7.2):
(O − E)2
χ2 cal = ∑
E (7.2)
The chi-square test statistic enables a comparison to be made between the observed
frequencies (O) and expected frequencies (E) calculated via equation (7.2). Equation (7.1)
tells us what the expected frequencies would be if there was no association between the
two categorical variables, for example, gender and course. If the values are close to one
another then this provides evidence that there is no association, and, conversely, if we find
large differences between the observed and expected frequencies then we have evidence
to suggest an association does exist between the two categorical variables: gender and
course. Statistical hypothesis testing allows us to confirm whether the differences are likely
to be statistically significant. The chi-square distribution varies in shape with the number
of degrees of freedom and thus we need to find this value before we can look up the appro-
priate critical values. The number of degrees of freedom (df) is given by equation (7.3).
df = (r − 1) * (c − 1) (7.3)
Where r = number of rows and c = number of columns. Identify a region of rejection using
either the p-value method or calculate the critical test statistic.
Example 7.1
Suppose a university sampled 485 of its students to determine whether males and females dif-
x fered in preference for five courses offered. The question we would like to answer is to confirm
Test statistic A test statistic whether or not we have an association between the courses chosen and the person’s gender.
is a quantity calculated
In this case we have two attributes, gender and course, both of which have been divided into
from our sample of data.
Contingency table A
categories: two for gender and five for course. The resulting table is called a 5*2 contingency
contingency table is a table table because it consists of 5 rows and 2 columns. To determine whether gender and course
of frequencies classified
preference are associated (or independent) we conduct a chi-square test of association on the
according to the values of
the variables in question. contingency table.
Chi-square and non-parametric hypothesis testing 299
Figure 7.1
➜ Excel solution
Data series Cells D5:E9 Values
Number of males Cell D10 Formula: =SUM(D5:D9)
Number of Females Cell E10 Formula: =SUM(E5:E9)
Number of A101 Cell F5 Formula: =SUM(D5:E5)
Number of D102 Cell F6 Formula: =SUM(D6:E6)
Number of M101 Cell F7 Formula: =SUM(D7:E7)
Number of S101 Cell F8 Formula: =SUM(D8:E8)
Number of T101 Cell F9 Formula: =SUM(D9:E9)
Grand Total Cell F10 Formula: =SUM(D10:E10)
Expected frequencies Cell D15 Formula: =$F5*D$10/$F$10
Copy formula down and across D15:E19
(O−E)2/E Cell F15 Formula: =(D5−D15)^2/D15
Copy formula down and across F15:G19
5
Figure 7.2
300 Business statistics using Excel
➜ Excel solution
Significance level = Cell K12 Value =0.05
χ 2cal = Cell K15 Formula: =SUM(F15:G19)
r = Cell K18 Formula: =COUNT(D5:D9)
c = Cell K19 Formula: =COUNT(D5:E5)
df = Cell K20 Formula: =(K18−1)*(K19−1)
Critical χ 2 = Cell K21 Formula: =CHISQ.INV.RT(K12,K20)
P-value = Cell K22 Formula: =CHISQ.DIST.RT(K15,K20)
or p-value using CHISQ.DIST = Cell K24 Formula: =1−CHISQ.DIST(K15,K20,TRUE)
or p-value using CHISQ.TEST = Cell K26 Formula: =CHISQ.TEST(D5:E9,D15:E19)
2 Select test
• Number of samples—two category data variables (gender and course). The
sample data is randomly selected and are represented as frequency counts
within the contingency table.
• The statistic we are testing—testing for an association between the two category
data variables.
5 Make a decision
In Figure 7.2 we observe that the p-value <α (5.7E-13 <0.05) and conclude that we
reject H0 and accept H1.
Chi-square and non-parametric hypothesis testing 301
Figure 7.3 illustrates the relationship between the p-value and test statistic.
P-value = CHI.DIST.RT(63.36, 4)
Note
For the chi-square test to give meaningful results:
To reduce the associated errors, we can apply the Yate's correction for continuity given by
equation (7.4).
( O — E — 0.5)2
χcal2 = ∑
E (7.4)
x
Expected frequency In
a contingency table the
Excel solution using the critical test statistic expected frequencies are
the frequencies that you
The solution procedure is exactly the same as for the p-value except that we use the critical would predict in each
cell of the table, if you
test statistic value to make a decision. The calculated test statistic χ2cal = 63.3562 (see Cell knew only the row and
K15). Calculate the critical test statistic, χcri
2
. The critical value can be found from Excel by column totals, and if you
assumed that the variables
using the CHISQ.INV.RT() function, χcri = 9.4877 (see Cell K21). Does the test statistic lie
2
under comparison were
within the region of rejection? Compare the calculated and critical χ2 values to determine independent.
302 Business statistics using Excel
which hypothesis statement (H0 or H1) to accept. In Figure 7.3 we observe that χcal
2
lies in
the upper rejection zone (63.3562 > 9.49).
Student exercises
X7.1 A business consultant requests that you perform some preliminary calculations before
analysing a data set using Excel.
(a) Calculate the number of degrees of freedom for a contingency table with three
rows and four columns.
(b) Find the upper tail critical χ2 value with a significance level of 5% and 1%. What
Excel function would you use to find this value?
(c) Describe how you would use Excel to calculate the test p-value. What does the
p-value represent if the calculated chi-square test statistic equals 8.92?
X7.2 A trainee risk manager for an investment bank has been told that the level of risk is
related directly to the industry type (manufacturing, retail, and financial). For the data
presented in the contingency table (Table 7.1) analyse whether or not perceived risk is
dependent upon the type of industry identified (assess at 5%)? If the two variables are
associated then what is the form of the association?
Table 7.1
Table 7.2
Chi-square and non-parametric hypothesis testing 303
X7.4 A local trade association is concerned at the level of business activity within the
local region. As part of a research project a random sample of business owners
were surveyed on how optimistic they were for the coming year. Based upon the
contingency Table 7.3 do we have any evidence to suggest different levels of optimism
for business activity (assess at 5%)? If the two variables are associated then what is the
form of the association?
Table 7.3
X7.5 A group of students at a language school volunteered to sit a test that is to be undertaken
to assess the effectiveness of a new method to teach German to English-speaking
students. To assess effectiveness students sit two different tests with one test in English
and the other test in German. Is there any evidence to suggest that the student test
performances in English are replicated by their test performances in German (Table 7.4;
assess at 5%)? If the two variables are associated then what is the form of the association?
German English
≥60% 40–59% < 40%
≥60% 90 81 8
40–59% 61 90 8
<40% 29 39 6
Table 7.4
Example 7.2
To illustrate the concept consider the example of a firm who surveys whether or not employees
use the train to travel to work. The firm collects the data and has created a 2*2 contingency
table (see Table 7.5) to summarize the responses for only the people who work on two days.
304 Business statistics using Excel
Monday Wednesday
Take train to work 89 76
Do not take train to work 64 88
The question is now whether or not we have a significant difference between the
Monday and Wednesday employees who travel to work by train.
Figures 7.4 and 7.5 illustrate the Excel solution.
Figure 7.4
➜ Excel solution
Data series: Cells C6:D7
Sum row 1 Cell E6 Formula: =SUM(C6:D6)
Sum row 2 Cell E7 Formula: =SUM(C7:D7)
Sum column 1 Cell C8 Formula: =SUM(C6:C7)
Sum column 2 Cell D8 Formula: =SUM(D6:D7)
Grand total = Cell E8 Formula: =SUM(E6:E7)
Expected frequencies Cell C13 Formula: =$E6*C$8/$E$8
Copy formula down and across C13:D14
(O−E)^2/E Cell E13 Formula: =(C6−C13)^2/C13
Copy formula down and across E13:F14
5
Figure 7.5
Chi-square and non-parametric hypothesis testing 305
➜ Excel solution
Significance level Cell K13 Value
p Cell K16 Formula: =E6/E8
χ2cal = Cell K17 Formula: =SUM(E13:F14)
r = Cell K18 Formula: =COUNTA(B6:B7)
c = Cell K19 Formula: =COUNTA(C5:D5)
df = Cell K20 Formula: =(K18−1)*(K19−1)
χ2cri = Cell K21 Formula: =CHISQ.INV.RT(K13,K20)
P-value = Cell K22 Formula: =CHISQ.DIST.RT(K17,K20)
or p-value using CHISQ.DIST = Cell K24 Formula: =1−CHISQ.DIST(K17,K20,TRUE)
or p-value using CHISQ.TEST = Cell K25 Formula: =CHISQ.TEST(C6:D7,C13:D14)
In general, the 2*2 contingency table can be structured as illustrated in Table 7.6.
Column variable
1 2 Totals
Row variable 1 n1 n2 N
2 t1 –n1 t2 – n2 T–N
Totals t1 t2 T
Table 7.6
From this table we can estimate the proportion (or probability) that employees will use
the train by calculating the overall proportion (ρ) using equation (7.5).
n1 + n 2 N
ρ= = (7.5)
t1 + t 2 T
We can now use this estimate to calculate the expected frequency (E) for each cell
within the contingency table by multiplying the column total by ρ for the cells linked to
travelled by train and (1−ρ) for those cells who did not travel by train using equation (7.6).
Calculate the chi-square test statistic to compare the observed and expected frequen-
cies using equation (7.2).
(O − E)2
χ2 cal = ∑
E
(O − E)2
χ2 cal = ∑ = 4.437356 (Cell K17)
E
5 Make a decision
We observe that the p-value < α (0.035161 < 0.05), and we conclude that we reject H0
and accept H1.
Note If you decided that the significance level is 1% (0.01), then we would have a
reverse decision given that the two tail p-value > α (0.035161 > 0.01). In this case we would
accept H0 and reject H1. This is an example of modifying your decision based upon how
confident you would like to be with your overall decision.
Chi-square and non-parametric hypothesis testing 307
tistic lie within the region of rejection? Compare the calculated and critical χ2 values to
determine which hypothesis statement (H0 or H1) to accept. We observe that χcal 2
lies in the
region of rejection (4.437356 > 3.841459), and we reject H0 and accept H1.
Note
1. For the chi-square test to give meaningful results the expected frequency for each cell in the
2*2 contingency table is required to be at least 5. If this is not the case then the chi-square
distribution is not a good approximation to the ratio (O−E)2/E. In this situation, we can use
Fisher’s test, which provides an exact p-value.
2. In the example you may have noticed that the frequency counts are discrete variables which
are mapped onto the continuous chi-square distribution. In this case we need to apply the
Yates’ correction for continuity given by equation (7.4).
3. In Section 6.5 we compared two sample proportions using a normal approximation. When
we have one degree of freedom we can show that there is a simple relationship between the
value of χcal2 and the corresponding value of Z is given by the relationship χ2cal = ( Z cal )2 .
4. If we are interested in testing for direction in the alternative hypothesis (e.g. H1: π1 > π2) then
you cannot use a chi-square test but will have to undertake a normal distribution Z test to
test for direction.
The two proportion solution can be extended to more than two proportions, but this is
beyond the scope of this text.
x
McNemar
Example 7.3 test McNemar’s test is a
non-parametric method
Consider the problem of estimating the effectiveness of a political campaign on the voting used on nominal data to
determine whether the
patterns of a group of voters. Two groups of voters are selected at random and their voting row and column marginal
intentions (drop carbon dioxide (CO2), tax) for a local election recorded. Both groups are then frequencies are equal.
308 Business statistics using Excel
subjected to the same campaign and their voting intentions recorded. The question that arises
is whether or not the campaign was effective on the voting intentions of the voters. In this case
we have two groups who are recorded before and after, and we recognize that we are dealing
with paired samples. To solve this problem we can use McNemar’s test for two sets of nominal
data that are randomly selected. Table 7.7 contains the outcome of the voting intentions before
and after the campaign.
Before After
Drop CO2 Tax
Drop CO2 287 89
Tax 45 200
Table 7.7
The question is whether the political campaign has been successful on ‘drop CO2’ vot-
ers and ‘tax’ voters, who both received the same marketing campaign. To simplify the
problem we shall look at whether or not the proportion voting ‘drop CO2’ has changed
significantly.
H0: proportion voting for ‘drop CO2’ not changed.
H1: proportion voting for ‘drop CO2’ changed.
In terms of notation this can be written as: H0: π1 = π2, H1: π1 ≠ π2, where π1 = popula-
tion proportion voting ‘drop CO2’ before campaign and π2 = population proportion voting
‘drop CO2’ after campaign.
Note Remember that the other hypothesis is whether or not the proportions voting
for the ‘tax’ party are the same before and after the campaign.
In general, the 2*2 contingency table can be structured as illustrated in Table 7.8.
Column variable
Drop CO2 Tax Totals
Row variable Drop CO2 a b a+b
Tax c d c+d
Totals a+c b+d N
Table 7.8
From this table we observe that the sample proportions are given by equations (7.7)
and (7.8):
a+b
ρ1 =
N (7.7)
a+c
ρ2 =
N (7.8)
Chi-square and non-parametric hypothesis testing 309
This problem can be solved using either a z or chi-square test to test the difference
between the two proportions. It is important to note that the z test and chi-square test are
both applicable when dealing with two tail tests, but if your problem is directional (lower
or upper) then you can only use the z test.
1. McNemar Z test
To test the null hypothesis we can use the McNemar z test statistic defined by
equation (7.9), which is normally approximated:
b−c
Zcal =
b+c (7.9)
χ2 cal =
(b − c)2
b+c (7.10)
For one degree of freedom the relationship between chi-square and Z is given by the
= (Zcal ) . Figure 7.6 illustrates the Excel solution for the McNemar Z and
2
relationship χcal
2
chi-square tests.
1
2
Figure 7.6
➜ Excel solution
Data series: Cells D6:E7
Sum row 1 Cell F6 Formula: =SUM(D6:E6)
Sum row 2 Cell F7 Formula: =SUM(D7:E7)
Sum column 1 Cell D8 Formula: =SUM(D6:D7)
310 Business statistics using Excel
1 State hypothesis
Given that the population proportions are π1 and π2 then the null and alternative
hypothesis are as follows:
H0: π1 = π2.
H1: π1 ≠ π2.
Two tail test.
Where π1 represents the proportion voting ‘drop CO2’ before the campaign and π2
represents the proportion voting ‘drop CO2’ after the campaign.
b−c 89 − 45
z cal = = = 3.801021 (Cell J18)
b+c 89 + 45
chosen significance level (α) of 5% (or 0.05) with the calculated two tail p-value of
0.00014.
5 Make a decision
We observe that the p-value < α (0.00014 < 0.05), and we conclude that we reject H0
and accept H1.
❉ Interpretation There is a significant difference in the voting intentions for ‘drop CO2’
after the campaign compared with before the campaign.
b−c 89 − 45
Zcal = = = 3.801021 (Cell J18)
b+c 89 + 45
Identify the region of rejection using the critical test statistic. The critical value can be
found from Excel by using the NORM.S.INV() function. From Excel, the two tail zcri = ±1.96
(see Cells J20 and J21). Does the test statistic lie within the region of rejection? Compare
the calculated and critical value to determine which hypothesis statement (H0 or H1) to
accept. We observe that Zcal lies in the region of rejection (3.801021 > 1.96) and accept H1.
❉ Interpretation There is a significant difference in the voting intentions for ‘drop CO2′
after the campaign compared with before the campaign.
Note
1. This problem can be solved using a chi-square method defined by equation (7.10). For one
degree of freedom we have a relationship between the value of χ2cal and the corresponding
value of Z is given by the relationship χ2cal = ( Z cal )2 (see Cells J24 and J25).
2. If we are interested in testing for direction in the alternative hypothesis, for example H1:
π1 > π2, then you cannot use a chi-square test but will have to undertake a normal distribu-
tion z test to test for direction.
The two proportion solution can be extended easily to more than two proportions, but
this is beyond the scope of this text book.
312 Business statistics using Excel
Student exercises
X7.6 A business analyst requests answers to the following questions:
(a) What is the p-value when the chi-square test statistic = 2.89 and we have one
degree of freedom?
(b) If you have one degree of freedom, what is the value of the Z test statistic?
(c) Find the critical chi-square value for significance levels of 1% and 5%.
X7.7 The petrol prices during the summer of 2008 raised concerns with new car sellers
that potential customers were taking prices into account when choosing a new car. To
provide evidence to test this possibility a group of five local car showrooms agreed to
ask fleet managers and individual customers during August 2008 whether they were or
were not influenced by petrol prices. The results were as shown in Table 7.9.
Table 7.9
At a 5% level of significance is there any evidence for the concerns raised by the car
showroom owners? Answer this question using both the critical test statistic and
p-value.
X7.8 A business analyst has been asked to confirm the effectiveness of a marketing
campaign on people’s attitudes to global warming. To confirm that the campaign was
effective a group of 500 people were randomly selected from the population and
asked the simple question about whether they agree that national governments should
be concerned with an answer of ‘Yes’ or ‘No’. The results are as shown in Table 7.10.
Table 7.10
At a 5% level of significance is there any evidence that the campaign has increased the
number of people requesting that national governments should be concerned that global
warming is an issue? Answer this question using both the critical test statistic and p-value.
(O) with the expected frequencies (E) predicted by fitting a particular probability distri-
bution to the data set of observed frequencies or to compare whether observed sample
frequencies differ significantly from expected frequencies. The chi-square test is an alter-
native to the Anderson–Darling and Kolmogorov–Smirnov goodness-of-fit tests. The chi-
square goodness-of-fit test can be applied to discrete distributions, such as the binomial
and the Poisson. The Kolmogorov–Smirnov and Anderson–Darling tests are restricted to
continuous distributions.
For a chi-square goodness-of-fit test, the hypotheses take the following form:
(O − E)2
χ2 cal = ∑
E
df = n − k − 1 (7.11)
Table 7.11
Table 7.12
Example 7.4
To illustrate the method consider the example of a motorway safety officer who believes that
the number of accidents per week occurring on a stretch of motorway can be modelled using
a Poisson distribution. If X denotes the number of accidents per week then the sample data
can be modelled by fitting a Poisson distribution to the sample data. Figure 7.7 provides the
tabulated data and the chi-square goodness-of-fit test. The Poisson probability distribution is
given by equation (4.10).
λre − λ
P(X = r) =
r!
Where r = 0, 1, 2, 3 . . . ∞.
Figure 7.7
➜ Excel solution
X Cells B5:B11 Values
O Cells C5:C11 Values
xO Cells D5 Formula: =B5*C5
Copy formula down D5:D11
Estimated mean Cell D13 Formula: =SUM(D5:D11)/SUM(C5:C11)
P(X) Cells F5 Formula: =POISSON.DIST(B5,$D$13,FALSE)
Copy formula down F5:F11
E Cells G5 Formula: =SUM($C$5:$C$11)*F5
Copy formula down G5:G11
(O − E)^2/E Cells H5 Formula: =(C5−G5)^2/G5
Copy formula down H5:H11
Chi-square and non-parametric hypothesis testing 315
Figure 7.8
➜ Excel solution
Significance level Cell L12 Value =0.05
Chi-square (χ2) Cell L15 Formula: =SUM(H5:H11)
n Cell L18 Formula: =COUNT(B5:B11)
k Cell L19 Value =1
df Cell L20 Formula: =L18−L19−1
Critical value χ2 Cell L21 Formula: =CHISQ.INV.RT(L12,L20)
P-value Cell L22 Formula: =CHISQ.DIST.RT(L15,L20)
or p-value using CHISQ.DIST Cell L24 Formula: =1−CHISQ.DIST(L15,L20,TRUE)
2 Select test
Comparing observed frequency with an expected frequency predicted by the Poisson
distribution. Chi-square goodness-of-fit test.
∑ fx
λ= = 2 (see Cell D13)
∑f
316 Business statistics using Excel
5 Make decision
From Excel, p-value > α (0.73 > 0.05), and we would accept H0 and reject H1. Figure 7.9
illustrates the relationship between the p-value, test statistic, and the critical test statistic.
1 State hypothesis
2 Select test
5 Make decision
From Excel, calculated chi-square value < critical chi-square value (2.7914 < 11.0705),
and we would accept H0 and reject H1. Figure 7.9 illustrates the relationship between
the p-value, test statistic, and the critical test statistic.
Figure 7.9 illustrates the relationship between the p-value, test statistic, and the critical
test statistic.
P-value = 0.732
when chi-square = 2.7914
Note The expected number for each random variable must be at least 5. If necessary
combine classes in the table to satisfy this requirement. For example, in Figure 7.7 the
expected frequencies are all <5 and the classes should be combined. This will result in the
value of the number of classes n being reduced from 7 to 5 and the number of degrees
of freedom (df) from 5 to 3. This results in a new two tail p-value = 0.4249 and critical test
statistic = 7.8. In this case, the overall conclusion would not change.
Student exercises
X7.9 An employment agency has recently implemented a new training programme to
develop the interview skills of potential job applicants. Based upon the collected data
(Table 7.14) can we say confidently that the data can be modelled using a binomial
distribution (assess at 5%)?
Table 7.14
318 Business statistics using Excel
X7.10 A university has recently set up a satellite department within a local college of
higher education. The university claims that 35% of the undergraduate students are
in department A, 26% are in department B, 25% are in department D, and 14% are
in department D. A random sample of 320 students finds the following number of
students in departments A–D: 132, 89, 64, and 35. Perform a hypothesis test at 5% to
test this claim.
X7.11 A new airport terminal has been assessing waiting times for passengers to be processed
at the airport check-in counters. The airport owners would like to be able to attach
levels of risk to different aspects of the business. To undertake this we are required to fit
an appropriate probability distribution to the observed frequencies provided in Table
7.15. (a) Use the data in Table 7.15 to provide an estimate of the population mean and
standard deviation; (b) construct a z distribution table with upper class boundaries of
14, 17, 22, 26, and infinity; (c) use this table to calculate the cumulative distribution
function values at these class boundaries based on your answers to parts (a)–(b); (d)
estimate the class probabilities and resultant expected frequencies; (e) calculate the
observed frequencies based upon your upper class boundaries; and (f) undertake
a chi-square goodness-of-fit test to assess at a 95% confidence that the normal
distribution would be a good fit to the sample data.
6 7 7 8 10 12 13 13 14
14 15 15 16 16 16 16 16 17
17 18 13 18 19 19 19 20 20
22 23 23 12 24 25 25 26 27
27 27 28 28 29 30 30 31 33
Table 7.15
Table 7.16
(a) Assessing the validity of a population median value assessed from collected sample
data—replaces the one-sample t-test, which assumes a normal population and that
a mean value has meaning
(b) Assessing the validity that the difference between two population medians is zero
based upon sample data—replaces the paired t-test, which assumes a normal
population and that a mean value has meaning
(c) Assessing the validity of proportions where the proportions are estimated from
ordered nominal (or categorical) data where a numerical scale is inappropriate,
but where we can rank the data observations—replaces the sample Z test for
proportions, which assumes a normal population.
If we rank the data then the null hypothesis would result in half the ranks to be less than
the median (r1) and half the ranks would be greater than the median (r2). In this situation
the null hypothesis can be modelled by a binomial distribution with the probability of a
data value being less than or greater than the median being equal to p = 0.5, with sample
size n. The sign test assumptions are: (1) randomly selected samples and (2) continuous
distribution. The sign test measures the number of counts that fall above and below the
median value. Given that 50% of all values lie below and 50% of all values lie above then
the population proportion (or probability) at the median value is 50% or 0.5. Under the
null hypothesis, we would expect the number of counts distribution to be approximately
symmetric around the median and the distribution of values below and above to be dis-
tributed at random among the ranks. The corresponding hypothesis statements for two
tail and one tail tests are as presented in Table 7.17.
Table 7.17
In this case the probability distribution is a binomial distribution with the probability
(or proportion) of success = 0.5 and the number of trials represented by the number of
paired observations (n). In this case we can model the situation using a binomial distri-
bution X ~ Bin (n, p). In this situation the value of the probability (P(X = r)), mean (μ),
320 Business statistics using Excel
and standard deviation is given by equations (4.5), (4.8), and (4.9): P(X = r) = nCr pr qn−r,
μ = np, and σ = np (1 − p) .
Example 7.5
To illustrate the concept consider the situation where 16 randomly-selected people were cho-
sen to measure the effectiveness of a new training programme on the value of sales. For the
training programme to be effective we would expect the hypothesis statement to be H1: the
training programme results in the average value in sales to increase. Given that we are told only
that we have a random selection and no information is given about the distribution, then we
will use the sign test to answer the question.
Figure 7.10
➜ Excel solution
Person: Cells A4:A19 Values
A: Cells B4:B19 Values
B: Cells C4:C19 Values
d = Cell D4 Formula: =B4−C4
Copy formula down D4:D19
Sign Cell F4 Formula: =IF(D40<0,"−",IF(D4>0,"+","0"))
Copy formula down F4:F19
➜ Excel solution
Level = Cell K8 Value =0.05
Median d = Cell K10 Formula =MEDIAN(D4:D19)
p = Cell K11 Value =0.5
N = Cell K12 Formula =COUNT(A4:A19)
r− = Cell K13 Formula =COUNTIF(F4:F19,"−")
Chi-square and non-parametric hypothesis testing 321
1
2
3
Figure 7.11
2 Select test
Two dependent samples, both samples consist of ratio data, and no information on
the form of the distribution. Conduct signed rank test.
Note This value given by the binomial equation represents an exact p-value.
If n is sufficiently large (n >25), we can use a normal approximation with the value
of the mean and standard deviation given by equations (4.8) and (4.9). From Excel,
μ = 8 (see Cell K25) and σ = 2 (see Cell K26). If μ ± 2σ is contained within the range
of the binomial 0 – n′, then the normal approximation should be an accurate
approximation. The normal approximation z equation is given by equation (7.12):
Xc − µ
Zcal =
σ (7.12)
5 Make decision
We will reject H0 and accept H1 given that the binomial p-value (0.0384) < α (0.05) and
normal approximation p-value = 0.0401 < α (0.05).
❉ Interpretation From the sample data we have sufficient statistical evidence that the
after sales are significantly larger than the before sales.
❉ Interpretation From the sample data we have sufficient statistical evidence that the
after sales are significantly larger than the before sales.
Note The decision to accept the alternative hypothesis is a borderline decision and will
change if the significance level changes to 1% from 5%.
Student exercises
X7.12 A researcher has undertaken a sign test with the following results: sum of positive and
negative signs are 15 and 4, respectively, with 3 ties. Given that binomial p = 0.5, assess
whether there is evidence that the median value is > 0.5 (assess at 5%).
324 Business statistics using Excel
X7.13 A teacher of 40 university students studying the application of Excel within a business
context is concerned that students are not taking a group work assignment seriously.
This is deemed to be important given that the group work element is contributing to
the development of personal development skills. To assess whether or not this is a
problem the module tutor devises a simple experiment which judges the individual
level of cooperation by each individual student within their own group. In the
experiment a rating scale is employed to measure the level of cooperation: 1 = limited
cooperation, 5 = moderate cooperation, and 10 = complete cooperation. The form of
the testing consists of an initial observation, a lecture on working in groups, and a final
observation. Given the raw data in Table 7.18 conduct a relevant test to assess whether
or not we can observe that cooperation has changed significantly (assess at 5%).
5, 8 4, 6 3, 3 6, 5 8, 9 10, 9 8, 8 4, 8 5, 5 8, 9
3, 5 5, 4 6, 5 4, 4 7, 8 7, 9 9, 9 8, 7 5, 8 5, 6
8, 7 8, 8 3, 4 5, 6 6, 7 4, 8 7, 8 9, 10 10, 10 8, 9
8, 8 4, 6 4, 5 7, 8 5, 7 7, 9 8, 10 3, 6 5, 6 7,8
Table 7.18
X7.14 A leading business-training firm advertises in its promotional material that its class sizes
at its Paris branch are no greater than 25. Recently, the firm has received a number of
complaints from disgruntled students who have complained that class sizes are >25
for a majority of its courses in Paris. To assess this claim the company selects 15 classes
at random and measures the class sizes as follows: 32, 19, 26, 25, 28, 21, 29, 22, 27,
28, 26, 23, 26, 28, and 29. Undertake an appropriate test to assess whether there is
any justification to the complaints (assess at 5%). What would your decision be if you
assessed at 1%?
number of paired observations is large (n > 20) we can use a test based on the normal
distribution.
The Wilcoxon signed rank sum test assumptions are:
Although the Wilcoxon test assumes neither normality nor homogeneity of variance, it
does assume that the two samples are from populations with the same distribution shape.
It is also vulnerable to outliers, although not to nearly the same extent as the t-test. If we
cannot make this assumption about the distribution then we should use a test called the
sign test for ordinal data. The McNemar test is available for nominal paired data relat-
ing to dichotomous qualitative variables and is described in Section 7.1.3. In this section
we shall solve the Wilcoxon signed rank sum test where we have a large and small num-
ber of paired observations. In the case of a large number of paired observations (n > 20)
we shall use a normal approximation to provide an answer to the hypothesis statement.
Furthermore, for a large number of paired observations we shall use Excel to calculate
both the p-value and critical z value to make a decision. The situation of a small number
of paired observations (n ≤ 20) will be described together with an outline of the solution
process.
Example 7.6
Suppose that Slim-Gym is offering a weight reduction programme that they advertise will result
in more than a 10-lb weight loss in the first 30 days. Twenty subjects were selected for a study
and their weights before and after the weight loss programme were recorded.
Figures 7.12 and 7.13 illustrate the Excel solution, where X and Y represent the weight
before and after the weight loss programme. For this problem we should be able to write
the null and alternative hypotheses as H0: X – Y – 10 ≤ 0, H1: X – Y – 10 > 0.
Figure 7.12
326 Business statistics using Excel
➜ Excel solution
X: Cells A4:A27 Values
Y: Cells B4:B27 Values
d = Cell C4 Formula: =A4−B4−$K$7
Copy formula down C4:C27
ABS(d) = Cell E4 Formula: =ABS(C4)
Copy formula down E4:E27
Rank Cell G4 Formula: =RANK.AVG(E4,$E$4:$E$27,1)
Copy formula down G4:G27
5
Figure 7.13
➜ Excel solution
D0 = Cell K7 Value
Median difference = Cell K8 Formula =MEDIAN(C4:C27)
Significance level = Cell K13 Value
n = Cell K16 Formula =COUNT(A4:A27)
n0 = Cell K17 Formula =COUNTIF(E4:E27,“0”)
n′ = Cell K18 Formula =K16−K17
T− = Cell K19 Formula =SUMIF(C4:C27," <0",G4:G27)
T+ = Cell K20 Formula =SUMIF(C4:C27," >0",G4:G27)
n′(n′ + 1)/2 = Cell K21 Formula =K18*(K18+1)/2
T− + T+ = Cell K22 Formula =K19+K20
Two tail test, T = Cell K23 Formula =MIN(K19,K20)
Upper one tail test, T = Cell K24 Formula =K20
Lower one tail test, T = Cell K25 Formula =K19
mu = Cell K26 Formula =K18*(K18+1)/4
Chi-square and non-parametric hypothesis testing 327
Note The value of z has been corrected for continuity by subtracting 0.5 (H1: >)
2 Select test
Two dependent samples.
Both samples consist of ratio data.
No information on the form of the distribution.
Wilcoxon signed rank test.
Median value centred at D0 = 10 (see Cell K7).
The median difference is +4.1 (Cell K8), which supports the alternative hypothesis
that d > 0. If this was negative, or zero, then you would not conduct the test as there is
no evidence from the sample that d > 0.
n′ (n′ + 1)
T+ + T− = (7.13)
2
n′ (n′ + 1)
T+ + T− = = 300 cells
2
Find Tcal
The value of Tcal is determined from the criteria outlined in Table 7.19.
Table 7.19
Given that we have an upper one tail test, then Tcal = 265.
Find Zcal
If the number of pairs is such that n is large enough (>20) a normal approximation
can be used with Zcal given by equation (7.14), and the mean and standard deviation
given by equations (7.15) and (7.16) respectively.
Tcal − µ T ± 0.5
Zcal =
σT (7.14)
n′ (n′ + 1)
µT =
4 (7.15)
n′ (n′ + 1) (2n′ + 1)
σT =
24 (7.16)
The value of Zcal is corrected for continuity by subtracting 0.5 if H1: > 0 or add 0.5 if
H1: < 0. From Excel: μT = 150 (see Cell K26), σT = 35.0 (see Cell K27), and Zcal is given
by equation (7.14).
Compare the chosen significance level (α) of 5% (or 0.05) with the calculated upper
one tail p-value of 0.0005.
5 Make decision.
We will reject H0 and accept H1 given two tail p-value (0.0005) <α (0.05).
❉ Interpretation From the sample data we have sufficient statistical evidence that the
weight loss is greater than 10 lbs.
The solution procedure is exactly the same as for the p-value except that we use the criti-
cal test statistic value to make a decision. The calculated test statistic zcal = 3.2714 (see Cell
K26). Calculate the critical test statistic, Zcri. The critical Z values can be found from Excel
by using the NORM.S.INV () function, upper one tail Zcri = +1.6449 (see Cell K30). Does
the test statistic lie within the region of rejection? Compare the calculated and critical Z
values to determine which hypothesis statement (H0 or H1) to accept. We observe that zcal
lies in the upper rejection zone (3.2714 > 1.6449) and we accept H1.
❉ Interpretation From the sample data we have sufficient statistical evidence that the
weight loss is greater than 10 lbs.
Figure 7.14 illustrates the relationship between the p-value and test statistic.
Normal curve
f(x)
Z
0
Zcri = + 1.65 Zcal = 3.2
Figure 7.15
The small sample method uses tables to look up the lower critical value (e.g. 92) and you
have to use the smallest T value as Tcal. If you want the upper critical value then you can
calculate the value if you remember that the distribution is symmetric about the median
(remember median = mean for symmetric distributions). From Figure 7.15: μT – lower
Tcri = upper Tcri – μT.
1. Observations in the sample may be exactly equal to zero in the case of paired
differences. Ignore such observations and adjust n accordingly. For the previous
example we removed any values and used n′ instead of n.
2. Two or more observations/differences may be equal. If so, average the ranks across
the tied observations and reduce the variance by equation (7.17) for each group of t
tied ranks.
(t 3 − t )/ 48 (7.17)
Note In the example and exercises we have not modified the solution for tied ranks.
Student exercises
X7.15 The Wilcoxon paired ranks test is considered to be more powerful than the sign test.
Explain why.
X7.16 A company is planning to introduce new packaging for a product that has used the
same packing for over 20 years. Before it makes a decision on the new packaging it
decides to ask a panel of 20 participants to rate the current and proposed packaging
(using a rating scale of do not change 0–change 100) (Table 7.20). Is there any
evidence that the new packaging is more favourably received compared with the older
packaging (assess at 5%)?
Chi-square and non-parametric hypothesis testing 331
Table 7.20
X7.17 A local manufacturer is concerned at the number of errors made by machinists in the
production of kites for a multinational retail company. To reduce the number of errors
being made the company decides to retrain all staff in a new set of procedures to
minimize the problem. To assess whether the training worked a random sample of 10
machinists were selected, and the number of errors made before and after the training
recorded as shown in Table 7.21.
Machinist
1 2 3 4 5 6 7 8 9 10
Before 49 34 30 46 37 28 48 40 42 45
After 22 23 32 24 23 21 24 29 27 27
11 12 13 14 15 16 17 18 19 20
Before 29 45 32 44 49 28 44 39 47 41
After 23 29 37 22 33 27 35 32 35 24
21 22 23 24 25 26 27 28 29 30
Before 33 38 35 35 47 47 48 35 41 35
After 37 37 24 23 23 37 38 30 29 31
Table 7.21
Is there any evidence that the training has reduced the number of errors (assess at 5%)?
The basic premise of the test is that once all of the values in the two samples are put into
a single ordered list, if they come from the same parent population, then the rank at which
values from sample 1 and sample 2 appear will be by chance. If the two samples come
from different populations, then the rank at which the sample values will appear will not
be random and there will be a tendency for values from one of the samples to have lower
ranks than values from the other sample. We are thus testing for different locations of the
two samples. Whenever n1 and n2 is greater than 20, a large sample approximation can be
used for the distribution of the Mann–Whitney U statistic. The Mann–Whitney assump-
tions are as follows: (i) independent random samples are obtained from each population,
and (ii) the two populations are continuous and have the same shape.
Example 7.7
A local training firm has developed an innovative programme to improve the performance of
students on the courses it offers. To assess whether the new programme improves student per-
formance the firm have collected two random samples from the population of students sitting
an accountancy examination, where sample 1 students have studied via the traditional method
and sample 2 students via the new programme. The firm has analysed previous data and the
outcome of the results provides evidence that the distribution is not normally distributed, but
is skewed to the left. This information results in concern at the suitability of using a two sam-
ple independent t-test to undertake the analysis and, instead, they decide to use a suitable
distribution-free test. In this case the appropriate test is the Mann–Whitney U test.
Figures 7.16 and 7.17 illustrate the Excel Mann–Whitney U test solution.
Figure 7.16
➜ Excel solution
Training type: Cells A4:A18 Values
Combined samples: Cells B4:B18 Values
Rank Cell C4 Formula: =RANK.AVG(B4,$B$4:$B$18,1)
Copy formula down C4:C18
Chi-square and non-parametric hypothesis testing 333
1
2
Figure 7.17
➜ Excel solution
Significance level = Cell H10 Value
Median sample 1 = Cell H12 Formula: =MEDIAN(B4:B10)
Median sample 2 = Cell H13 Formula: =MEDIAN(B11:B18)
n1 = Cell H14 Formula: =COUNTIF(A4:A18,"=1")
n2 = Cell H15 Formula: =COUNTIF(A4:A18,"=2")
T1 = Cell H16 Formula: =SUMIF(A4:A18,"=1",C4:C18)
T2 = Cell H17 Formula: =SUMIF(A4:A18,"=2",C4:C18)
T1max = Cell H18 Formula: =H14*H15+H14*(H14+1)/2
T2max = Cell H19 Formula: =H14*H15+H15*(H15+1)/2
U1 = Cell H20 Formula: =H18−H16
U2 = Cell H21 Formula: =H19−H17
U1 + U2 = Cell H22 Formula: =H20+H21
n1n2 = Cell H23 Formula: =H14*H15
Ucal = Cell H24 Formula: =MIN(H20,H21)
mu = Cell H25 Formula: =H14*H15/2
sigma = Cell H26 Formula: =SQRT(H14*H15*(H14+H15+1)/12)
Z = Cell H27 Formula: =(H24−H25+0.5)/H26
Lower one tail Zcri = Cell H26 Formula: =NORM.S.INV(H10)
Lower p-value = Cell H26 Formula: =NORM.S.DIST(H27,TRUE)
Note The value of z has been corrected for continuity by adding 0.5 (H1: < 0)
334 Business statistics using Excel
1 State hypothesis
H0: no difference in examination performance between the two groups.
H1: new programme improved performance (M1 < M2).
Lower one tailed test.
2 Select test
Comparing two independent samples.
Both samples consist of ratio data.
Unknown population distribution.
Mann–Whitney U test.
n 1 (n 1 + 1)
U1 = n 1n 2 + − T1
2 (7.18)
n 2 (n 2 + 1)
U 2 = n 1n 2 + − T2
2 (7.19)
Substituting the computed values into equations (7.18) and (7.19) gives U1 = 47 (see
Cell H20), U2 = 9 (see Cell H21). Check using equation (7.20):
U1 + U 2 = n1n 2 (7.20)
From Excel, U1 + U2 = 56 (Cell H22) and n1n2 = 56 (Cell H23). The value of Ucal can
be either U1 or U2, and, for this example, we will choose Ucal = Minimum of U1 and
U2 = MIN (47, 9) = 9 (see Cell H24).
If the null hypothesis is true then we would expect U1 and U2 both to be centred at the
mean value μU, given by equation (7.21).
n1n 2
µU =
2 (7.21)
U cal − µ U + 0.5
Z=
µU (7.22)
n1n2 (n1 + n2 + 1)
σU =
12 (7.23)
The value of Zcal is corrected for continuity by subtracting 0.5 if H1: > 0 or add 0.5 if
H1: < 0. From Excel, μU = 28 (see Cell H25), σU = 8.6410 (see Cell H26), and Zcal is given
by equation (7.22).
5 Make decision
We will reject H0 and accept H1 given the lower one tail p-value (0.0161) < α (0.05).
Figure 7.18 illustrates the relationship between the p-value and test statistic.
Normal curve
f(x)
Reject H0 5%
p = NORM.S.DIST Accept H0
(2.141)
–2.1410 –1.65 0 Z
Figure 7.18
The theory suggests that if the null hypothesis is true then the U test statistic will be cen-
tred at μU = 28 with critical regions identified in Figure 7.19.
U2 = 9 Ucri = 13 µu = 28 U1 = 47
Z
Note The Mann–Whitney U test is equivalent statistically to the Wilcoxon rank sum test.
Student exercises
X7.18 What assumptions need to be made about the type and distribution of the data when
the Mann–Whitney test is used?
X7.19 Two groups of randomly-selected students are tested on a regular basis as part of
professional appraisals that are conducted on a two-year cycle by a leading financial
services company based in London. The first group has 8 students, with their sum of
the ranks equal to 65, and the second group has 9 students. Is there sufficient evidence
to suggest that the performance of the second group is better than the performance of
the first group (assess at 5%)?
X7.20 The sale of new homes is tied closely to the level of confidence within the financial
markets. A developer builds new homes in two European countries (A and B) and is
338 Business statistics using Excel
concerned that there is a direct relationship between the country and the interest rates
obtainable to build properties. To provide answers the developer decides to undertake
market research to see what interest rates would be obtainable if he decided to borrow
€300,000 over 20 years from 5 financial institutions in country A and 8 financial
institutions in country B. Based upon the data in Table 7.22 do we have any evidence
to suggest that the interest rates are significantly different?
Table 7.22
■ Techniques in practice
TP1 CoCo S. A. is concerned about the time taken to react to customer complaints and has
implemented a new set of procedures for its support centre staff. The customer service director
has decided that there is no evidence for the population distribution to be normally distributed
and has directed that a suitable test is applied to the sample to assess whether the new target
mean time for responding to customer complaints is 28 days (Table 7.23).
20 33 33 29 24 30
40 33 20 39 32 37
32 50 36 31 38 29
15 33 27 29 43 33
31 35 19 39 22 21
28 22 26 42 30 17
32 34 39 39 32 38
Table 7.23
TP2 Bakers Ltd are currently undertaking a review of the delivery vans used to deliver prod-
ucts to customers. The company runs two types of delivery van (type A, recently purchased,
Chi-square and non-parametric hypothesis testing 339
and type B, at least three years old), which are supposed to be capable of achieving 20 km per
litre of petrol. A new sample has now been collected as shown in Table 7.24.
(a) Assuming that the population distance travelled does not vary as a normal distribution, is
there any evidence to suggest that the two types of delivery van differ in mean distance
travelled?
(b) Based upon your analysis, is there any evidence that the new delivery vans meet the
mean average of 20 km per litre?
A B A B
17.68 15.8 26.42 34.8
18.72 36.1 25.22 16.8
26.49 6.3 13.52 15.0
26.64 12.3 14.01 28.9
9.31 15.5 33.9
22.38 40.1 27.1
20.23 20.4 16.8
28.80 3.7 23.6
17.57 13.6 29.7
9.13 35.1 28.2
20.98 33.3
Table 7.24
TP3 Skodel Ltd is developing a low calorie lager for the European market with a mean
designed calorie count of 43 calories per 100 ml. The new product development team are
having problems with the production process and have collected two independent random
samples to assess whether the target calorie count is being met (do not assume that the popu-
lation variables are normally distributed) (Table 7.25).
A B A B
49.7 39.4 45.2 34.5
45.9 46.5 40.5 43.5
37.7 36.2 31.9 37.8
40.6 46.7 41.9 39.7
34.8 36.5 39.8 41.1
51.4 45.4 54.0 33.6
34.3 38.2 47.8 35.8
63.1 44.1 26.3 44.6
41.2 58.7 31.7 38.4
41.4 47.1 45.1 26.1
41.1 59.7 47.9 30.7
Table 7.25
340 Business statistics using Excel
■ Summary
In this chapter we have explored the concept of hypothesis testing for data involving cat-
egory data using the chi-square distribution and extended the parametric tests to the
case of non-parametric tests (or so called distribution-free tests), which do not require
the assumption of the population (or sample) distributions being normal. This chapter
adopted the simple five-step procedure described in Chapter 6 to aid the solution process
and focused on the application of Excel to solve the data problems.
The main emphasis is placed on the use of the p-value, which provides a number to the
probability of the null hypothesis (H0) being rejected. Thus, if the measured p-value > α
(Alpha) then we would accept H0 to be statistically significant. Remember the value of the
p-value will depend on whether we are dealing with a two or one tail test. So take extra
care with this concept as this is where most students slip up.
The second part of the decision-making described the use of the critical test statistic in
making decisions. This is the traditional textbook method which uses published tables to
provide estimates of critical values for various test parameter values.
In the case of the chi-square test we looked at a range of applications, including: testing
for differences in proportions, testing for association, and testing how well a theoretical
probability distribution fits collected sample data.
In the case of non-parametric tests we looked at a range of tests, including: sign test for
one sample, two paired sample Wilcoxon signed rank test, and two independent samples
Mann–Whitney test. In the case where we have more than two samples then we would
have to use techniques, such as the Kruskal–Wallis test or Friedman test depending upon
whether we are dealing with independent or dependent samples respectively. These tests
are described in the online workbook ‘Factorial experiments’.
Figure 7.20 provides a diagrammatic representation of the decisions required to decide
on which test to use to undertake the correct hypothesis test.
The key questions are:
1. What are you testing: difference or association? For non-parametric tests we are
dealing with ordinal and/or non-normal distributions, while the chi-square test will
test for association.
2. What is the type of data being measured? For non-parametric tests we are dealing
with ordinal data and categorical data for the chi-square test of association.
3. Can we assume that the population is normally distributed? For both types of tests we
are not assuming that the population distribution is normal.
4. How many samples? In Figure 7.20 we are dealing with one and two sample tests.
Chi-square and non-parametric hypothesis testing 341
Number of
samples?
Independent Association or
Sign test
or dependent? proportion?
Independent Dependent
Chi-square
McNemar’s
test of
test
proportions
Figure 7.20
■ Key terms
Chi-square test Expected frequency Sign test
Chi-square test for Goodness-of-fit test Test statistic
independent samples Mann–Whitney U test Wilcoxon signed rank sum
Chi-square test of McNemar’s test test
association Observed frequency
Contingency table Rank
■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.
Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
342 Business statistics using Excel
» Overview «
In this chapter we will explore methods that define possible relationships, or associations,
between two interval, or ordinal, data variables. The issue of measuring the association
between two nominal data variables was explored under cross tabulation and the chi-square
distribution. When dealing with two data variables we can explore visually the possibility of an
association by plotting a scatter plot of one variable against another variable. Visually, this will
help to decide whether or not an association exists and the possible form of the association,
for example linear or non-linear. The strength of this association can then be assessed either
by calculating Pearson’s correlation coefficient for interval data or Spearman’s rank order
correlation coefficient for ordinal data. If the scatter plot suggests a possible association then
we can use least squares regression to fit this model to the data set. In this text we will focus on
linear relationships, but we have included sections introducing non-linear and multiple linear
regression analysis. Excel can be used to calculate most of the terms using specific functions
and we can access a data analysis macro called regression to calculate all the terms we would
need to undertake the regression analysis described within this overview.
» Learning objectives «
On successful completion of the module you will be able to:
» understand the meaning of simple linear correlation and regression analysis;
» apply a scatter plot to represent visually a possible relationship between two data variables;
» calculate Pearson’s correlation coefficient for interval data and provide meaning to this
value;
»calculate Spearman’s rank correlation coefficient for ordinal ranked data and provide
meaning to this value;
» fit a simple linear regression model to the two data variables to be able to predict a
dependent variable using an independent variable;
344 Business statistics using Excel
• apply a scatter plot to represent visually a possible relationship between two data
variables;
• understand the meaning of simple linear correlation analysis;
• calculate Pearson’s correlation coefficient for interval data and provide meaning to
this value;
• calculate Spearman’s rank correlation coefficient for ordinal ranked data and
provide meaning to this value;
• undertake an inference test on the value of the correlation coefficients (r and rs) being
significant.
to examine this claim by analysing the data results from the first group of 20 employees that
attended the course.
Table 8.1 provides the data set for the % change in production (y) measured against a range
of production values (x).
Table 8.1
(a) Dependent variable—the variable that we wish to predict, in this case % change in
production (variable y)
(b) Independent variable—in general, labelled as variable x or, in this case, production
x
variable. The independent variable provides the basis for calculating the value of the Dependent variable A
dependent variable. dependent variable is
what you measure in the
As a first stage to the analysis, the scatterplot would be plotted out, which, as indicated experiment and what
is affected during the
in Figure 8.1, involves plotting each pair of values as a point on a graph. experiment.
As can be seen from the scatterplot there would seem to be some form of relationship; Independent variable An
as productivity increases then there is a tendency for % change in production to increase. independent variable is the
variable you have control
The data, in fact, would indicate a positive relationship. As we will see in the next sec- over, what you can choose
tion, it is possible to describe this relationship by fitting a line or curve to the data set. In and manipulate.
346 Business statistics using Excel
% Change in production, y
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
0 10 20 30 40 50 60 70 80
Production, x Figure 8.1
Figure 8.2 we modified the y axis to run from 0 to 25 instead 0 to 9. We also changed one
point to illustrate the case of outliers. However, before we do this, just note how much
impact the change in resolution (i.e. change in scale for y) has on the perceived pattern of
the time series.
Note When scatterplots are used, like any other visualization method, make sure that
the right resolution (the y-axis range) is used.
Identifying outliers
Outlier?
25.0
% Change in production
20.0
15.0
10.0
5.0
0.0
0 20 40 60 80
Production Figure 8.2
population being measured and the problem is due to having a small sample. There is no
widely accepted method on how to deal with outliers. Some researchers use quantitative
methods to exclude outliers that lie beyond ±1.5 standard deviations around the mean
value. To decide on what type of relationship exists between the two variables (x, y) then
we need to provide a numerical method to assess the strength of this potential relation-
ship, rather than rely on just the scatter plot.
The next three sections will explore three methods that can be used to measure the
relationship between data values: covariance, Pearson’s coefficient of correlation, and
Spearman’s rank correlation coefficient.
8.1.2 Covariance
A measure that tells us if the variables are jointly related is called covariance. Usually this
implies that we measure if the two variables move together. The covariance takes either
positive or negative values, depending on if the two variables move in the same or oppo-
site direction. If the covariance value is zero, or close to zero, then the two variables do not
move closely together at all. Equation (8.1) defines the sample covariance.
Σ (x − x )( y − y)
cov(x, y) =
n −1 (8.1)
As will be explained shortly, the covariance is an important building block for calculat-
ing the coefficient of correlation between two variables.
Example 8.2
Reconsider the data set in Example 8.1.
Figure 8.3
Note From Excel sample the covariance is 18.05, implying that both variables are
moving in the same direction (indicated by the positive value). A major flaw with the
covariance is that the variable can take any value and you are unable to measure the relative
strength of the relationship. For this value of 18.05 we do not know if this represents a strong
or weak relationship between x and y. To measure this strength we would use the correlation
coefficient or the coefficient of determination (COD).
s xy
r=
sxsy (8.2)
Note
(a) If r lies between –1 ≤ r ≤ −0.5 or 0.5 ≤ r ≤ 1 (large association).
(b) If r lies between –0.5 ≤ r ≤ −0.3 or 0.3 ≤ r ≤ 0.5 (medium association).
(c) If r lies between –0.3 ≤ r ≤ −0.1 or 0.1 ≤ r ≤ 0.3 (small association).
Example 8.3
Reconsider the data set in Example 8.1.
Figure 8.4
➜ Excel solution
X: Cells C4:C23 Values
Y: Cells D4:D23 Values
Pearson r = Cell C25 Formula: =PEARSON(C4:C23,D4:D23)
Pearson r = Cell C26 Formula: =CORREL(C4:C23, D4:D23)
Pearson r = Cell C27 Formula: =COVARIANCE.S(C4:C23,D4:D23)/(STDEV.S(C4:C23)*
STDEV.S(D4:D23))
❉ Interpretation From Excel, the sample correlation coefficient is equal to +0.89. This
would indicate a fairly strong positive linear association (or relationship) between the value of
the % change in production (y) and the value of the original production values (x), confirming
the impression from the scatter plot in Figure 8.1.
350 Business statistics using Excel
It should be noted that if you include the outlier illustrated in Figure 8.2 then the value
of the correlation coefficient (r) would reduce to 0.3 and would suggest very little correla-
tion between the two variables (x, y).
What does the value of ‘r’ not indicate?
1. Correlation only measures the strength of a relationship between two variables but
does not prove a cause and effect relationship.
(a) Medical research suggests a strong correlation between the consumption of
alcohol and alcohol-induced liver disease. In this situation we have a cause
and effect situation where increased alcohol consumption increases the risk of
developing liver disease.
(b) But do we have a cause and effect between the amount of petrol sold and the
consumption of ice cream during the summer months? In this case the increase
in petrol consumption and ice cream sales is owing to the fact that it is summer
and (i) the holiday season has started and (ii) the temperature is increasing.
(c) Even though we do not have a cause and effect between the variables it is
possible that the association found might lead to what the true cause might be.
For example, a new survey found that the more time people spent watching
television the fatter they became. It could be that unemployed people spend
more time watching television and, at the same time, they cannot afford to eat a
healthy diet. In this case employment status would be the real cause. Remember,
it is usually more complicated than this simple example and the value of a
dependent variable may depend on more than just one independent variable.
2. A value of r ≈ 0 would indicate no linear relationship between x and y, but this may
indicate that the true form of the relationship is non-linear.
500
Predicted variable, y
400
300
200
100
0
0 5 10 15 20 25
Known value, x Figure 8.5
In Figure 8.6 the data point pattern goes from a high value on the y-axis down to a high
value on the x-axis—the variables have a negative correlation.
Figure 8.7 Example of perfect positive correlation, r = +1
A perfect positive correlation is given the value of 1 and a perfect negative correlation
is given the value of −1. In reality the value of the correlation will lie between −1 and +1.
Linear correlation and regression analysis 351
Scatter plot—example of negative correlation
600
500
Predicted value, y
400
300
200
100
0
0 5 10 15 20 25
Known value, x Figure 8.6
350
300
250
200
150
100
50
0
0 5 10 15 20 25
Known value, x Figure 8.7
Figure 8.8 illustrates what the scatterplot would look like for correlation value of −0.47.
500
400
300
200
100
0
0 5 10 15 20 25
Known value, x Figure 8.8
For Example, Figure 8.1 is the scatter plot for % change in production against produc-
tion which, as we already know, suggests that as x increases, y increases, and the values
are increasing in the same direction. We’ll now show how to calculate Pearson’s correla-
tion coefficient, r, using a formula approach in Excel.
Example 8.4
Reconsider the data set in Example 8.1.
352 Business statistics using Excel
Σx Σy
Σxy −
r= n
⎛ 2 ( Σ x )2 ⎞ ⎛ 2 ( Σ y )2 ⎞
⎜⎝ Σx − ⎟ ⎜ Σy − ⎟
n ⎠⎝ n ⎠ (8.4)
Equation 8.4 is a modified version of equation (8.3). We will use this formula to demon-
strate how to calculate the correlation coefficient in Excel using this equation. Figure 8.9
illustrates the Excel solution.
Figure 8.9
➜ Excel solution
x: Cells C4:C23 Values
y: Cells D4:D23 Values
xy Cell E4 Formula: =C4*D4
Copy formula down E4:E23
x^2 Cell G4 Formula: =C4^2
Copy formula down G4:G23
y^2 Cell I4 Formula: =D4^2
Copy formula down I4:I23
n = Cell D26 Formula: =COUNT(C4:C23)
ΣX = Cell D27 Formula: =SUM(C4:C23)
ΣY = Cell D28 Formula: =SUM(D4:D23)
Linear correlation and regression analysis 353
From Excel: n = 20, Σx = 1064, ΣY = 110.80, ΣXY = 6237.50, ΣX2 = 60352, and
ΣY2 = 653.44.
Substituting these values into equation (8.4) gives r = 0.89 (see cell D33).
As expected, we get the same value of 0.89 as calculated by Excel functions =PEARSON ()
or =CORREL (). We still have not examined how significant this linear correlation is, i.e. do
the conclusions we made about the sample data apply to the whole population? In order
to do this we need to conduct a hypothesis test. The end result will confirm if the same
conclusion applies to the whole company (population) and, more specifically, at what
level of significance.
r−p
t cal =
1 − r2
n−2 (8.5)
As per previous chapters on hypothesis testing, testing of the significance is done in five
short steps.
1 State hypothesis
4 Extract the relevant statistic, which will consist of three simple calculations:
(a) Calculate the value of r (correlation coefficient);
(b) Calculate test statistic tcalc;
(c) Determine the critical value tcrit;
5 Make a decision
Example 8.5
Reconsider the data set in Example 8.1.
Figure 8.10
➜ Excel solution
x: Cells C4:C23 Values
y: Cells D4:D23 Values
2
3
5
Figure 8.11
Linear correlation and regression analysis 355
➜ Excel solution
Significance level = Cell I10 Value =0.05
Pearson coefficient = Cell I13 Formula: =PEARSON(C4:C23, D4:D23)
n = Cell I15 Formula: =COUNT(B4:B23)
df = Cell I16 Formula: =I15-2
t = Cell I17 Formula: =I13/SQRT((1−I13^2)/(I15−2))
Upper two tail t-critical = Cell I18 Formula: =T.INV.2T(I10, I16)
Lower two tail t-critical = Cell I19 Formula: =−I18
1 State hypothesis
Null hypothesis H0: ρ = 0 no population correlation exists
Alternative hypothesis H1: ρ ≠ 0 correlation exists
2 Select test—in this case we already know that we are testing the significance of linear
correlation and we use a t-test to test for significance.
3 Significance level. Set the significance level of 5% = 0.05 (see cell I10)
r
t cal =
1 − r2
n−2 (8.6)
Note We note that the alternative hypothesis is ≠ and therefore we have not implied a
direction for the value of ρ. All we know is that it could be a significant correlation and that
ρ > 0 or ρ < 0. In this case we have two directions where ρ would be deemed significant and
this is called a two-tailed test.
From Excel:
r
t cal = = 8.29 (see cell I17)
1 − r2
n−2
(c) Using a significance level of 0.05 with 19 degrees of freedom the critical t value =
T.INV.2T (I10, I16) =± 2.1 (see cells I18 and I19).
5 Make a decision
The calculated value of the t-test statistic (8.29) is greater than the critical t statistic
value (2.1). We conclude that we should reject H0 and accept H1.
356 Business statistics using Excel
Note The preceding example illustrates a two tailed test, but one tail tests can exist and
will denote confidence in a specific relationship between X and Y.
For example, in the previous example we are quite certain that we would expect the %
change in production and the original production value of the tested employees to be related
and the association to be positive (as X increases Y increases). In this case we would conduct
H0: ρ = 0 and H1: ρ > 0. If we then tested at 5% then all this 5% would be allocated to the right-
hand tail of the decision graph and tcri would be positive. In this example the Excel solution
would give tcri =T.INV.2T (0.05*2, 18) = +1.73.
If we reversed the test and assumed that the association was negative (as X increases Y
decreases) then the alternative hypothesis would read H0: ρ = 0 and H1: ρ < 0, with a critical t
value of tcri = −1.73.
H0: ρ = 0
H1: ρ < 0
Left-tailed test.
H0: ρ = 0
H1: ρ > 0
Right-tailed test.
6 ∗ ∑ (X r − Yr )2
rs = 1 −
n(n 2 − 1) (8.7)
Where Xr = rank order value of X, Yr = rank order value of Y, and n = number of paired
observations.
Equation (8.7) is known as Spearman’s rank correlation coefficient. The use of ranks
allows us to measure correlation using characteristics that cannot be expressed quanti-
tatively, but that lend themselves to being ranked. This equivalence between equations
(8.4) and (8.7) will only be true for situations where no tied ranks exist. When tied ranks
exist then you will find discrepancies between the value of r and rs. As with the other
Linear correlation and regression analysis 357
non-parametric tests introduced in this text, ties are handled by giving each tied value the
mean of the rank positions for which it is tied. The interpretation of rs is similar to that for
r, namely: (a) a value of rs near 1.0 indicates a strong positive relationship and (b) a value
of rs near −1.0 indicates a strong negative relationship.
Note
(a) If rs lies between –1 ≤ rs ≤ −0.5 or 0.5 ≤ rs ≤ 1 (large association).
(b) If rs lies between –0.5 ≤ rs ≤ −0.3 or 0.3 ≤ rs ≤ 0.5 (medium association).
(c) If rs lies between –0.3 ≤ rs ≤ −0.1 or 0.1 ≤ rs ≤ 0.3 (small association).
For pairs of data considered to have a strong relationship, just as in the case of
Pearson’s correlation coefficient, you will need to confirm that the value is significant (see
section 8.1.6).
Example 8.6
You are asked to decide whether the statistics rank correlates with the mathematics rank for
seven students provided in Table 8.2. As the information is ranked we use Spearman’s correla-
tion coefficient to measure the correlation between statistics and mathematics ranks.
Table 8.2
Figure 8.12
358 Business statistics using Excel
➜ Excel solution
Statistics rank, Xr Cells C5:C11 Values
Mathematics rank, Yr Cells D5:D11 Values
Xr − Yr = Cell F5 Formula: =C5−D5
Copy formula down F5:F11
(Xr − Yr)^2 = Cell H5 Formula: =F5^2
Copy formula down H5:H11
n = Cell F14 Formula: =COUNT (B5:B11)
Squared rank differences = Cell F15 Formula: =SUM(H5:H11)
Spearman’s rank correlation = Cell F17 Formula: =1−6*F15/(F14*(F14^2−1))
❉ Interpretation From Figure 8.12 the Spearman rank correlation is positive, rs = 0.54,
indicating that there is a mild positive rank correlation in this case. If this number was closer
to +1, we would be able to claim much stronger positive rank correlation.
Note Excel does not have a procedure for computing Spearman’s ranked correlation
coefficient directly. However, as the formula for Spearman’s is the same as for Pearson’s
correlation coefficient, we can use it providing that we have first converted the x and y
variables to rankings (Data > Data Analysis > Rank and Percentile).
rs − ps
t cal =
1 − rs 2
n−2 (8.8)
Example 8.7
Reconsider the data set in Example 8.6 and assess the significance of rs.
Linear correlation and regression analysis 359
Figure 8.13
➜ Excel solution
Sig = Cell C5 Value
n = Cell C6 Value
df = Cell C7 Formula: =C6−2
tcri = Cell C8 Formula: =T.INV.2T(C5,C7)
Critical rs = Cell C9 Formula: =C8/SQRT(C8^2+C6−2)
1 State hypothesis
Null hypothesis H0: ρs = 0 no population correlation
Alternative hypothesis H1: ρs ≠ 0 population correlation exists
Two tail test
2 Select test—we already know that this is testing the significance of Spearman’s rank
correlation coefficient
rs
t cal =
1 − rs 2
n−2 (8.9)
Note We note that the alternative hypothesis is ≠ and therefore we have no implied
direction for the value of ρs. All we know is that it could be a significant correlation and that
ρs > 0 or ρs < 0. In this case we have two directions where ρ would be deemed significant and
this is called a two tailed test.
The critical value of rs may be found either from a table of values or by calculation,
depending upon the size of the sample, n.
360 Business statistics using Excel
N 6 7 8 9 10
Significance level 5% 0.829 0.759 0.738 0.666 0.632
(e) If the sample size n ≥ 10, the test statistic is approximated by a t statistic with
n − 2 degrees of freedom, as shown in equation (8.10). The critical rs value can
be found by rearranging equation (8.9) to make rs the subject of the equation:
t
rs =
t2 + n − 2 (8.10)
To find the critical rs value: (i) find tcri and (ii) substitute this value for tcri into
equation (8.10) to find the critical rs value. For example, if the significance level is
5% two tail, then tcri =± 2.31 and the critical value of rs = ±0.63.
Note For n > 20, rs may be treated as normal (0, 1), where
z = rs n − 1 (8.11)
For example, if the significance level is 5% two tail and n = 40, then Zcri = ±1.96 and the criti-
cal value of rs = ±0.314. In the comparison of marks example we have n = 7, significance level
5% two tail, and the table critical rs value is ± 0.759.
5 Make a decision
Given that 0.54 < 0.759, the test statistic does not fall in the critical region. Therefore,
we accept H0 and reject H1.
Student exercises
X8.1 In the course of a survey relating to examination success, you have discovered a high
negative correlation between students’ hours of study and their examination marks.
This is so at variance with common sense that it has been suggested an error has been
made. Do you agree?
X8.2 Construct a scatter plot for the data in Table 8.4 and calculate Pearson’s correlation
coefficient, r. Comment on the strength of the correlation between x and y.
x: 40 41 40 42 40 40 42 41 41 42
y: 32 43 28 45 31 34 48 42 36 38
Table 8.4
Linear correlation and regression analysis 361
X8.3 Display the data given in Table 8.5 in an appropriate form and state how the variables
are correlated.
x: 0 15 30 45 60 75 90 105 120
y: 806 630 643 625 575 592 408 469 376
Table 8.5
X8.4 Table 8.6 indicates the number of vehicles and number of road deaths in ten countries.
Countries Vehicles per 100 population Road deaths per 100,000 population
UK 31 14
Belgium 32 30
Denmark 30 23
France 46 32
Germany 30 26
Irish Republic 19 20
Italy 35 21
Netherlands 40 23
Canada 46 30
USA 57 35
Table 8.6
(a) Construct a scatter plot and comment upon the possible relationship between the
two variables.
(b) Calculate the product moment correlation coefficient between vehicle numbers
and road deaths.
(c) Use your answers to (a) and (b) to comment upon your results.
X8.5 Samples of students’ essays were marked by two tutors independently. The resulting
ranks are shown in Table 8.7.
A 5 8 1 6 2 7 3 4
Tutor
B 7 4 3 1 6 8 5 2
Table 8.7
Mathematics 89 73 57 53 51 49 47 44 42 38
Statistics 51 53 49 50 48 21 46 19 43 43
Table 8.8
362 Business statistics using Excel
(a) Find the correlation coefficient for the two sets of marks.
(b) Place the marks in rank order and calculate the rank correlation coefficient.
(c) The following is a quotation from a statistics text ‘Rank correlation can be used
to give a quick approximation to the product moment correlation coefficient’.
Comment on this in the light of your results.
X8.7 Three people, P, Q, and R, were asked to place in preference nine features of a house
(A, B, C ... I). Calculate Spearman’s rank order correlation coefficients between the pairs
of preferences, as shown Table 8.9.
A B C D E F G H I
P 1 2 4 8 9 7 6 3 5
Q 1 4 5 8 7 9 2 3 6
R 1 9 6 8 7 4 2 3 5
Table 8.9
How far does this help to decide which pair from the three would be most likely to be
able to compromise on a suitable house?
The values of constants b0 and b1 are effectively estimates of some true values of β0 and
β1, and we’ll also have to test to see how well they represent these true population values.
In order to determine this relationship the constants b0 and b1 have to be estimated from
the observed values of x and y. To do this, regression analysis utilizes the method of least
squares regression to provide a relationship between b0, b1, and the sample data values
(x, y). The method assumes that the line will pass through the point of intersection of the
mean values of x and y (x,y).
The method then pivots the line about this point until:
(i) The sum of the vertical squared distance of the data points is a minimum
(ii) The sum of the vertical distances of the data points above the line equals those
below the line.
Σ ( y − y ) = minimum
2
and
Σ (y − yˆ ) = 0.
Σ y = nb0 + b1 Σ x
and
Σ xy = b0 Σ x + b1 Σ x 2
By solving the above equations simultaneously, estimates of the constants b0 and b1 are
determined to give the equation of the line of regression of y on x, where y is the depend-
ent variable and x is the independent variable. The two ‘normal equations’ can be rear-
ranged so that a solution can be obtained as given by equations (8.14) and (8.15).
nΣxy − Σx Σy
b1 =
nΣx 2 − (Σx )2 (8.14)
Σy − b1 Σx
b0 =
n (8.15)
Excel can be used in a number of different ways to undertake regression analysis and
x
calculate the required coefficients b0 and b1. Least squares The
method of least squares
1. Excel statistical functions—Excel contains a range of functions that allow a range of is a criterion for fitting
a specified model to
regression coefficient calculations to be undertaken. observed data. If refers to
2. Excel worksheet functions—standard Excel functions can be used to reproduce the finding the smallest (least)
sum of squared differences
manual solution, e.g. SUM, SQRT functions. between fitted and actual
3. Excel Data Analysis > Regression—this method provides a complete set of solutions. values.
364 Business statistics using Excel
Example 8.8
Reconsider Example 8.1 and fit the scatter plot as illustrated in Figure 8.14.
10
0
0 20 40 60 80 100
Production, X Figure 8.14
From Figure 8.14 we conclude that the % change in production (y-variable) is increas-
ing as the production increases.
Example 8.9
Figure 8.15 represents the Excel solution to fitting a regression line to the Example 8.8 data set.
Figure 8.15
The Excel function to calculate the slope, b1, and intercept, b0, is as described next.
➜ Excel solution
x: Cells B5:B24 Values
y: Cells C5:C24 Values
b0 = Cell C27 Formula: =INTERCEPT(C5:C24,B5:B24)
b1 = Cell C28 Formula: =SLOPE(C5:C24,B5:B24)
From Excel: b0 = 0.6712 and b1 = 0.0915. The equation of the sample regression line is
ŷ = 0.6712 + 0.0915x .
❉ Interpretation The regression equation for the example used here is % change in
production = 0.6712 + 0.0915 * production.
For every value of x (production) we can now estimate a value of the % change in pro-
duction. If we plotted these estimated values, they would represent a trend line, or a line x
Intercept Value of the
of regression. The calculated trend line has been fitted to the scatter plot as shown in regression equation (y)
Figure 8.16. when the x value = 0.
Observe that not all data points lie on the fitted line. In this case we can also observe an Residual The residual
represents the unexplained
error (sometimes called a residual or variation) between the data y value and the value of variation (or error) after
the line y value at each data point. fitting a regression model.
366 Business statistics using Excel
% Change in production, Y
10
0
0 20 40 60 80 100
Production, X Figure 8.16
This concept of error can be measured using a variety of methods, including: coefficient
of determination (COD), standard error of estimate (SEE), and a range of inference meas-
ures to assess the suitability of the regression model fit to the data set.
An alternative approach to calculating the regression line is to right-click on one of the
data points in the graph and select Add Trendline option from the box (Figure 8.17).
Figure 8.17
Figure 8.18
In order to demonstrate the above points, we’ll show here yet another method of calcu-
lating linear regression using the Excel TREND() function (Figure 8.19).
10
y = 0.0915x + 0.6712
8 R2 = 0.7924
0
0 20 40 60 80 100
Production, X Figure 8.19
Example 8.10
Use the Excel TREND() function to fit a trend line to the Example 8.1 data set, as illustrated in
Figure 8.20.
368 Business statistics using Excel
Figure 8.20
➜ Excel solution
x: Cells B5:B24 Values
y: Cells C5:C24 Values
Estimated y =CellsD5 Formula =TREND($C$5:$C$24,$B$5:$B$24,B5)
Copy formula down D5:D24
Error =Cells F5 Formula =C5−D5
Copy formula down F5:F24
There is obviously a degree of error between observed values of y and those estimated
by the regression line (ŷ). This error, or difference, is known as the residual and is defined
by equation (8.16).
Residual = y − yˆ (8.16)
These errors, as we will discover shortly, are a very important part of regression analysis.
An alternative is to use some general Excel functions to get even richer data. Excel also
offers an even more comprehensive way to achieve the same task through Excel’s Data
Analysis tool, but we’ll come back to it at the very end of this chapter. Let us return to
the concept of error that we mentioned above, which can be measured using a variety of
methods, including: coefficient of determination (COD), standard error of the estimate
(SEE), and a range of inference measures to assess the suitability of the regression model
fit to the data set.
Linear correlation and regression analysis 369
y
SSE
SST
yi
SSR
–
y
x
–
x xi
0
Figure 8.21
Understanding the relationship between SST, SSR, and SSE.
n
SSR = ∑ (y i − y )2
i =1 (8.17)
( )
n 2
SSE = ∑ y i − yˆ x
i =1 (8.18)
Sum of squares for
regression (SSR) The
Regression total sum of squares (SST) is sometimes called the total variation: SSR measures how much
variation there is in the
modelled values.
n
SST = ∑ ( y i − y )
2 Sum of squares for error
i =1 (8.19) (SSE) The SSE measures
the variation in the
modelling errors.
The total sum of squares is equal to the regression sum of squares plus the error sum of Total sum of squares
squares. (SST) The SST measures
how much variation there
is in the observed data
SST = SSR + SSE (8.20) (SST = SSR + SSE).
370 Business statistics using Excel
1. Linearity
Linearity assumes that the relationship between the two variables is linear. To assess
linearity, the residuals (or errors) are plotted against the independent variable, x. Excel
Data > Data Analysis > Regression will create this plot automatically if requested (see sec-
tion 8.2.10). From Figure 8.22 we observe that there is no apparent pattern between the
residuals and x. Furthermore, the residuals are evenly spread out about error equal to zero.
Residual plot
1.5
1.0
0.5
Error
x
0.0
0 10 20 30 40 50 60 70 80
Independence of
–0.5 x
errors Independence
of errors means that the
distribution of errors
–1.0
is random and not
influenced by or correlated
to the errors in prior –1.5
observations. The opposite
of independence is called Figure 8.22
autocorrelation. Residuals versus x
Durbin–Watson The
Durbin–Watson statistic For this example a line fit to the data set would appear appropriate. If the scatter plot
is a test statistic used suggests that the relationship is non-linear then you would have to identify and fit this
to detect the presence
of autocorrelation (a relationship to your data set (see section 8.3.1).
relationship between
values separated from each
other by a given time lag)
2. Independence of errors
in the residuals (prediction The independence of errors assumption requires that there is no correlation between the
errors) from a regression
residuals of the regression analysis.
analysis.
Autocorrelation
This effect is called serial correlation and can be measured using the Durbin–Watson
Autocorrelation is the statistic. Another expression for serial correlation, though usually used in a different con-
correlation between
text, is autocorrelation. For Example 8.8, the data has been collected at the same time
members of a time series
of observations and the period and we do not need to consider serial correlation (independence of errors) as a
same values shifted at a problem. This topic is beyond the scope of this textbook.
fixed time interval.
Normality of
errors Normality of errors 3. Normality of errors
assumption states that the The normality of errors assumption requires that the measured errors (or residuals) are
errors should be normally
distributed - technically normally distributed for each value of the independent variable, X. If this assumption is
normality is necessary violated then the result can produce unrealistic estimations for the regression coefficients
only for the t-tests to
be valid, estimation of b0, b1, and the measures of correlation. Furthermore, any inference tests or confidence
the coefficients only intervals calculated are dependent upon the errors being normally distributed.
requires that the errors
be identically and
This assumption can be evaluated using two graphical methods: (i) construct
independently distributed. a histogram for the errors against x and check whether the shape looks normal or
Linear correlation and regression analysis 371
(ii) create a normal probability plot of the residuals (available from the Excel Data > Data
Analysis > Regression). Figure 8.23 illustrates a normal probability plot based upon the
Example 8.8 data set.
4
3
2
1
0
0 20 40 60 80 100 120
Sample percentile
Figure 8.23
Normal probability plot of the residuals
We observe that the relationship is fairly linear and we conclude that the normal
assumption is not violated.
This problem can occur if the dependent and/or independent variables are not nor-
mally distributed or the linearity assumption is violated. Like the ANOVA F test and t-test,
regression analysis is robust against departures from this assumption. As long as the dis-
tribution of error against X is not very different from a normal distribution then the infer-
ences on β0 and β1 will not be seriously affected (see sections 8.2.6 and 8.2.7).
4. Variance constant
The final assumption of equal variance (or homoscedasticity) requires that the variance
of the errors is constant for all values of X.
This implies that the variability of the Y values is the same for all values of X and this
assumption is important when making inferences about β0 and β1 (see sections 8.2.6 and
8.2.7). If there are violations of this assumption then we can use data transformations or
weighted least-squares to attempt to improve model accuracy. In Figure 8.24 we observe
that the error is not growing in size as the value of X changes. This plot provides evidence
that the variance assumption is not violated. If the value of error changes greatly as the
value of X changes then we would assume that the variance assumption is violated.
Residual plot
1.5
1
Residuals or error
0.5
0 x
0 10 20 30 40 50 60 70 80 Equal variance
–0.5 x (homoscedasticity)
Homogeneity of variance
(homoscedasticity)
–1
assumption states that the
error variance should be
–1.5 Figure 8.24 constant.
372 Business statistics using Excel
If any of the four assumptions are violated, we can only conclude that linear regression
is not the best method for fitting to the data set, and we will need to find an alternative
method or model.
Note See section 8.2.10 for the Data > Data Analysis > Regression menu method to
check regression assumptions.
Example 8.11
Reconsider the data set in Example 8.1 and test the linear regression model reliability.
Figure 8.25 illustrates the Excel solution to calculate the coefficient of determination
and standard error of the estimate.
Figure 8.25
x ➜ Excel solution
Standard error of the
estimate (SEE) The x: Cells B5:B24 Values
standard error of the y: Cells C5:C24 Values
estimate (SEE) is an
estimate of the average
b0 = Cell C27 Formula: =INTERCEPT (C5:C24,B5:B24)
squared error in prediction. b1 = Cell C28 Formula: =SLOPE (C5:C24,B5:B24)
Linear correlation and regression analysis 373
SSE
SEE =
n−2 (8.21)
Equation (8.21) can be rewritten to give equation (8.22) by using equation (8.18).
( )
n 2
∑ y − yˆ
i =1
SEE =
n−2 (8.22)
This provides a measure of the scatter of observed values around the corresponding
estimated y values on the regression line and is measured in the same units as y. From
Excel (Figure 8.25), the Excel function STEYX() calculates the value of the standard error
of the estimate (SEE) of y on x is 0.675.
❉ Interpretation We are 68% confident that the true value will be in the interval
of ±0.675 of the estimated value of ŷ. To be 90% certain, we would need to take 2SEE, i.e.
2 × 0.675, which is the interval of ±1.35 around ŷ.
variability of Y can be split into two components: (i) variability explained or accounted
for by the regression line, and (ii) unexplained variability, as indicated by the residuals.
It should be noted that the correlation coefficient provides a measure of the strength of
the association between two variables but the issue of interpreting the value is a problem.
After all what do we mean by strong, weak, or moderately associated? Fortunately, we do
have a method that is easier to interpret: the COD.
The COD is defined as the proportion of the total variation in y that is explained by the
variation in the independent variable x. This definition is represented by equations (8.23)
and (8.24):
( )
n 2
∑ yˆ − y
i =1
COD = n
∑ (y − y )
2
i =1 (8.24)
By further manipulation of equation (8.24) it can be shown that the coefficient of deter-
mination (COD) is given by equation (8.25):
❉ Interpretation From Excel the coefficient of determination is 0.79 or 79%. This value
tells us that 79% of the variation in the % raise in production is explained by the variation
in the production variable. Conversely, this implies that 21% of the sample variability in the
% raise in production is due to factors other than Production and is not explained by the
regression line.
Note The coefficient of determination equation (8.23) can be rewritten in terms of SSE
and SST by making use of the relationship SSR = SST − SSE.
X and Y variables we will require the application of a t-test to check whether β1 is equal to
zero. This is essentially a test to determine if the regression model is usable.
If the slope is significantly different from zero then we can use the regression equation
to predict the dependent variable for any value of the independent variable. If the slope
is zero then the independent variable has no prediction value as for every value of the
independent variable the dependent variable would be zero. Therefore, when this is the
situation we would not use the equation to make predictions.
In order to test the significance of the relationship between y and x, we test the null
hypothesis:
This implies that there is no change in the value of the variable y as the variable x
increases in size.
The alternative hypothesis states that the value of the y variable changes as the value of
the x variable increases in size.
H1: β1 ≠ 0 linear relationship exists and the relationship is not zero (two tail
test)
For simple linear regression which has one independent variable, the F test is equiva-
lent to the t-test (see section 8.2.7). In this hypothesis test we are assessing the possibility
that β1 = 0. In order to test this hypothesis we will calculate a measure of the difference
between the value of the population slope (β1) and the sample slope (b1). The value of b1
will change as we collect different samples and this would create a sampling distribution
for the b1 term. It can be shown that if the regression assumptions hold, then the popula-
tion of all possible values of the term b1 will be normally distributed with mean of β1 and
with a standard deviation given by equation (8.26).
σ
σ b1 =
SSX (8.26)
Equation (8.26) can be rewritten as equation (8.27) if we note that the standard error of
the estimate sxy is a point estimate of σ and sb1 is a point estimate of σb1.
S xy SEE
s b1 = =
SSX ( x-x )2 (8.27)
Where SEE is the standard error of the estimate given by Excel function STEYX().
It can be shown that the relationship between b1, β1, and tcal, is given by equation (8.28),
which follows a t distribution with the number of degrees of freedom df = n − 2.
b1 − β1
t cal =
s b1 (8.28)
Example 8.12
Reconsider the Example 8.1 data set and test the significance of the predictor variable (x).
376 Business statistics using Excel
Figures 8.26 and 8.27 illustrate the Excel solution to undertake the required hypothesis
Student’s t-test.
Figure 8.26
➜ Excel solution
x: Cells B5:B24 Values
y: Cells C5:C24 Values
b0 = Cell C27 Formula: =INTERCEPT (C5:C24,B5:B24)
b1 = Cell C28 Formula: =SLOPE (C5:C24,B5:B24)
ŷ = CellsD5 Formula: =$C$27+$C$28*B5
Copy formula down D5:D24
(x − xbar)^2 = Cell F5 Formula: =(B5−$K$15)^2
Copy formula down F5:F24
5
Figure 8.27
Linear correlation and regression analysis 377
➜ Excel solution
Level = Cell K12 Value =0.05
SEE = Cell K14 Formula: =STEYX (C5:C24,B5:B24)
Average x = Cell K15 Formula: =AVERAGE (B5:B24)
SSX = Cell K16 Formula: =SUM (F5:F24)
Sb1 = Cell K17 Formula: =K14/SQRT(K16)
t = Cell K18 Formula: =C28/K17
n = Cell K20 Formula: =COUNT (A5:A24)
k = Cell K21 Value =1
df = Cell K22 Formula: =K20−(K21+1)
Upper tcri = Cell K23 Formula: =T.INV.2T(K12,K22)
Lower tcri = Cell K24 Formula: =−K23
Two tail p-value = Cell K25 Formula: =T.DIST.2T(K18,K22)
1 State hypothesis
H0: β1 = 0 no linear relationship.
H1: β1 ≠ 0 linear relationship exists and since we believe that the relationship is
not zero (two tail test).
2 Select the test—we know that this is the t-test for testing if the predictor variable is a
significant contributor.
b1
t cal =
s b1 (8.29)
The test statistic follows a t distribution with n − 2 degrees of freedom. From Excel,
t = 8.29 with 18 degrees of freedom (see cells K18 and K22).
2. Critical t value, tcri
We can now test to see if this sample t value would result in accepting or rejecting
H0. From Excel we see that the critical t value = ± 2.1 at a 5% significance level (see
cells K23, K24). At this stage we need to remember that the hypothesis test implies
no perceived direction for H1 to be accepted. The Excel function to calculate the
critical t value is as follows.
5 Make a decision
As tcal > tcri (8.3 > 2.1), then the test statistic lies in the rejection zone for H0. Therefore,
reject H0 and accept H1. Alternatively, as the p-value < α (1.47E-7 < 0.05), reject H0
and accept H1.
378 Business statistics using Excel
Note A similar approach can be used to test if the constant term b0 is a significant
b
contributor to the value of y. This requires the following t-test statistic t cal = 0 to be
sb0
calculated and compared with the critical t value.
⎛ SSR ⎞
⎜⎝ k ⎟⎠
Fcal =
⎛ SSE ⎞
⎜⎝ n − (k + 1) ⎟⎠ (8.31)
MSR
Fcal =
MSE (8.32)
COD
Fcal = k
(1 − COD)
(n − (k + 1)) (8.33)
Where n is the total number of paired values and k is the number of predictor variables.
If the regression line fits the sample data (little scatter about line) then the value of F will
be quite large. Conversely, if the regression line does not fit the sample data (increased
scatter about line) then the value of F will approach zero.
Linear correlation and regression analysis 379
Example 8.13
Reconsider the data in Example 8.1 and conduct an F test to test whether or not the dependent
variable is a significant contributor to the dependent variable.
Figure 8.28
➜ Excel solution
x: Cells B5:B24 Values
y: Cells C5:C24 Values
b0 = Cell C27 Formula: =INTERCEPT (C5:C24,B5:B24)
b1 = Cell C28 Formula: =SLOPE (C5:C24,B5:B24)
ŷ = CellsD5 Formula: =$C$27+$C$28*B5
Copy formula down D5:D24
5
Figure 8.29
380 Business statistics using Excel
➜ Excel solution
Level = Cell H12 Value =0.05
n = Cell H14 Formula: =COUNT (A5:A24)
k = Cell H15 Value
COD = Cell H16 Formula: =RSQ (C5:C24,B5:B24)
F-cal = Cell H17 Formula: =(H16/H15)/((1−H16)/(H14−(H15+1)))
df num = Cell H19 Formula: =H15
df denom = Cell H20 Formula: =H14−(H15+1)
F-critical = Cell H21 Formula: =F.INV.RT(H12,H19,H20)
p-value = Cell H22 Formula: =F.DIST.RT(H17,H19,H20)
1 Hypothesis test
H0: β1 = 0 no linear relationship.
H1: β1 ≠ 0 linear relationship exists and we believe that the relationship is not
zero (two tail test).
2 Select the test—which we know is F test, testing whether the predictor variable is a
significant contributor
5 Make a decision
Figure 8.30 illustrates the shape of the F distribution and the relationship between the
critical F value and H0 and H1 being true.
H0 true
H1 true
0 F
Fcri = 4.4 Fcal = 68.7
Figure 8.30
Linear correlation and regression analysis 381
As Fcal > Fcri (68.7 > 4.4), we reject H0 and accept H1. Alternatively, use the p-value
(1.47E-7) < 0.05 and conclude that the alternative hypothesis is accepted.
Note For a one predictor model the t-test and F test is essentially the same test. In
fact, for a one predictor regression model the relationship between F and t is t = F . Check,
t = 8.29. . . F = 68.7. . . t = F = 68.7... = 8.29.
The format for the analysis of variance (ANOVA) table is as shown in Table 8.10.
Where degrees of freedom (df ), sum of squares for regression (SSR), sum of squares for
error (SSE), total sum of squares (SST), mean square due to regression (MSR), mean
square due to error (MSE), and F is the statistic.
The completed ANOVA table is part of the Excel Data > Data Analysis > Regression
solution described in Section 8.2.10.
For Example 8.13, k = 1, n = 20, and the ANOVA table would be as presented in
Table 8.11.
β1 = b1 ± t × s b1 (8.34)
This equation implies two border values for β1 with the confidence interval lying
between these two values.
Example 8.14
Reconsider Example 8.1 and calculate a 95% confidence interval for the slope coefficient of the
predictor variable.
Figure 8.31
➜ Excel solution
x: Cells B5:B24 Values
y: Cells C5:C24 Values
b0 = Cell C27 Formula: =INTERCEPT(C5:C24,B5:B24)
b1 = Cell C28 Formula: =SLOPE (C5:C24,B5:B24)
ŷ = CellsD5 Formula: =$C$27+$C$28*B5
Copy formula down D5:D24
Linear correlation and regression analysis 383
❉ Interpretation From Excel, the 95% confidence interval for Example 8.14 is between
0.068% and 0.115%. Because these values are above zero we conclude that there is a
significant linear relationship between the two variables (% change in production (y) and
production (x)). If the interval had included zero then you would conclude no significant
linear relationship exists. If we rescale the numbers, we can say that the confidence interval
states that for a production increase of 100, the % change in production is estimated to
increase by at least 6.8 but no more than 11.5.
n ∗ (x p − x )
2
1
e = t cri ∗ SEE ∗ 1 + +
( )
n n ∗ ∑ x 2 − ( ∑ x )2 (8.36)
Example 8.15
Fit a prediction interval at xp = 30 to the data set from Example 8.1.
384 Business statistics using Excel
Figures 8.32 and 8.33 illustrate the Excel solution to calculate the predictor interval.
Figure 8.32
➜ Excel solution
x Cells B5:B24 Values
y Cells C5:C24 Values
x^2 Cells D5 Formula: = B5^2
Copy formula down D5:D24
Figure 8.33
➜ Excel solution
b0 = Cell G4 Formula: =INTERCEPT (C5:C24,B5:B24)
b1 = Cell G5 Formula: =SLOPE (C5:C24,B5:B24)
n = Cell G7 Formula: =COUNT (A5:A24)
level = Cell G8 Value =0.05
df = Cell G9 Formula: =G7−2
tcri = Cell G10 Formula: =T.INV.2T(G8,G9)
x = Cell G11 Value =30
Xbar = Cell G12 Formula: =AVERAGE (B2:B24)
Y^ = Cell G13 Formula: =G4+G5*G11
Σx = Cell G14 Formula: =SUM (B5:B24)
Linear correlation and regression analysis 385
From Excel: xp = 30, n = 20, significance level = 0.05, tcri = ±2.10092, SEE = 0.675872845,
x = 53.2, ∑ x = 1064, and ∑ x 2 = 60352. Substituting values in to equation (8.36) gives:
1 20 × (30 − 53.2)2
e = 2.10092 × 0.675872845 × 1 + + = 1.551353666
20 20 × (60352) − (1064)2
Equation (8.35) then gives the 95% prediction interval for xp = 30 to lie between 1.87
and 4.97
❉ Interpretation Therefore, if the production level was at 30 units then we predict the
value of the % change in production of 3.4%. In fact, we can state a 95% confidence value
of 1.87–4.97%. This shows that the actual value can vary greatly from the predicted value
of 3.4%.
Example 8.16
Reconsider the data set from Example 8.1 and use the Excel Data Analysis tool to fit the linear
regression model, and calculate the required reliability and significance test statistics.
• Y Range: C5:C24.
• X Range: B5:B24.
• Confidence interval: 95%.
386 Business statistics using Excel
Figure 8.34
Click OK.
Excel will now calculate and output the required regression statistics and charts, as
illustrated in Figure 8.34.
Figure 8.35
Note We can also equate the printout in Figure 8.34 with the terms from Section 8.2.3.
They are as follows:
Cell F7 = R-Square
Cell F9 = standard error of estimate (SEE)
Cell F14 = dfR
Cell F15 = dfE
Cell F16 = dfT
Cell G14 = SSR
Cell G15 = SSE
Cell H16 = SST
Cell H14 = MSR (this is the result of G14/F14)
Cell H15 = MSE (this is the result of G15/F15). If you take a square root of this value, you get
standard error of the estimate, as per Cell F9.
Linear correlation and regression analysis 387
From Figure 8.35 we can identify the required regression statistics (Table 8.12).
Table 8.12
COD, coefficient of determination; SEE, standard error of estimate.
What is the p-value and how is it used and interpreted in Excel? This is the same statistic
we have already used extensively in previous chapters on hypothesis testing. The Excel
solution provides the t-test values for each contributor (b0 and b1) and includes a statistic
called the p-value. The p-value measures the chance (or probability) of achieving a test
statistic equal to or more extreme than the sample value obtained, assuming H0 is true. As
we already know, to make a decision we compare the calculated p-value with the level of
significance (say 0.05 or 5%) and if p < 0.05 then we would reject H0.
The application of the t-test tells us that the predictor variable (production, x) is a sig-
nificant contributor to the value of y (% change in production, y) given that p (=1.473E-
7) < 0.05. Furthermore, it is observed that the constant is not a significant contributor
to the value of y (p = 0.28 > 0.05) and this would suggest that the model should be of
the form ŷ = b1x . This can be achieved easily in Excel by using constant = 0 in the Data
Analysis > Regression solution.
The F test confirms that the predictor variable is a significant contributor to the value
of the dependent variable (p = 1.473E-7 < 0.05). This confirms the t-test solution and we
conclude that there is a significant relationship between the % change in production
and old production. Remember that for a one predictor model, t = F = 68.7 = 8.29 .
388 Business statistics using Excel
The Regression Data Analysis also helps with the checking of some of the assumptions,
namely: linearity, constant variance, and normality, as illustrated in Figures 8.36–8.38.
Figure 8.36
Residual output
0.5
0
0 10 20 30 40 50 60 70 80
–0.5
–1
X Variable 1
–1.5
Figure 8.37
Plot of residuals against x
We can see from Figure 8.37 that we have no observed pattern within the residual plot
and we can assume that the linearity assumption is not violated. Furthermore, the resid-
ual, and hence the variance, are not growing in size and are bounded between a high and
low point. From this we conclude that the variance assumption is not violated.
4
3
2
1
0
0 20 40 60 80 100 120
Sample percentile
Figure 8.38
Assumption check for normality
From the normal probability plot we have a fairly linear relationship and we conclude
that the normality assumption is not violated.
Linear correlation and regression analysis 389
Student exercises
X8.8 In the regression equation for yˆ = b0 + b1x, the value of b0 is given by the equation:
∑ Y − b12 ∑ X ∑ Y − b1 ∑ X
A. b0 = B. b0 =
n 2n
∑ Y − b1 ∑ X ∑ Y − n∑ X
C. b0 = D. b0 =
n n
X8.9 In the regression equation for yˆ = b0 + b1x, the value of b1 is given by the equation:
n ∑ XY 2 − ∑ X ∑ Y n ∑ XY − ∑ X ∑ Y
A. b1 = B. b1 =
n ∑ X 2 − ( ∑ X )2 n ∑ X 2 − ( ∑ X )2
n ∑ XY − ∑ X ∑ Y n ∑ XY − ∑ X ∑ Y
C. b1 = D. b1 =
n ∑ X − ( ∑ X )2 n ∑ X 2 − (∑ X)
Use the ANOVA table (Table 8.13) to answer exercise questions X10.10–X10.12.
ANOVA df SS MS F Significance F
Regression 1 3.76127E + 11 3.76127E + 11 162.7172745 7.34827E-16
Residual 41 94773006578 2311536746
Total 42 4.709E + 11
Table 8.13
df, degrees of freedom; SS, sum of squares; MS, mean sum of squares.
Table 8.14
390 Business statistics using Excel
(a) Plot a scatter plot and comment on a possible relationship between sales and
advertising.
(b) Use Excel regression functions to undertake the following tasks:
(i) Fit linear model
(ii) Check model reliability (r and COD)
(iii) Undertake appropriate inference tests (t and F test)
(iv) Check model assumptions (residual and normality checks)
(v) Provide a 95% confidence interval for the predictor variable.
X8.14 Fit an appropriate equation to the data set (Table 8.15) to predict the examination
mark given the assignment mark for 14 undergraduate students.
Table 8.15
(a) Plot a scatter plot and comment on a possible relationship between sales and
advertising.
(b) Use Excel regression functions to undertake the following tasks:
(i) Fit linear model
(ii) Check model reliability (r and COD)
(iii) Undertake appropriate inference tests (t and F test)
(iv) Check model assumptions (residual and normality checks)
(v) Provide a 95% confidence interval for the predictor variable.
we shall introduce the concept of non-linear regression via a simple example. Before we
start we need to introduce a series of non-linear relationships between the variable y and
x and their governing equations. Some of the most popular curves that describe the shape
of these relationships are presented in Figures 8.39–8.45.
Line y = 2x + 4
16
14
12
10
8
y
6
4
2
0
0 1 2 3 4 5 6
x
Figure 8.39
Line y = b0 + b1x
20
15
10
5
0
0 1 2 3 4 5 6
x
Figure 8.40
Parabola curve y = b2x2 + b1x + b0
20
15
y
10
0
0 1 2 3 4 5 6
x
Figure 8.41
b0
Hyperbola curve y =
x
392 Business statistics using Excel
250
200
150
y
100
50
0
0 1 2 3 4 5 6
x
Figure 8.42
Exponential curve y = b0b1x
1.5
1
y
0.5
0
0 1 2 3 4 5 6
–0.5
–1
x
Figure 8.43
Modified exponential curve y = b2 + b0b1
0.5
0.4
y 0.3
0.2
0.1
0
–4 –2 0 2 4 6
x
Figure 8.44
1
Logistic curve = b2 + b0b1x
y
Linear correlation and regression analysis 393
Gompertz curve y = 2.5(0.3)^(0.5^x)
3
2.5
y 1.5
0.5
0
–4 –2 0 2 4 6
x
Figure 8.45
x
Gompertz curve y = b2b0b1
Let’s look at just one of these non-linear relationships. Equation (8.37) represents the
equation of a parabola (or polynomial of degree 2):
ŷ = b0 + b1x + b2 x 2 (8.37)
The values of the parameters b0, b1, and b2 can be determined using least squares
regression by solving equations (8.38)–(8.41).
−4
∑ x ∑ y − ∑ xˆ 2 ∑ xˆ 2 y
b0 =
( )
2
n ∑ xˆ 4 − ∑ xˆ 2 (8.38)
∑ xy
ˆ
b1 =
∑ xˆ 2 (8.39)
n ∑ xˆ 2 y − ∑ xˆ 2 ∑ y
b2 =
( )
2
n ∑ xˆ 4 − ∑ xˆ 2 (8.40)
where
x̂ = x − x (8.41)
We will use Excel to show how to fit this curve to a data set, calculate the equation of the
line, and calculate the coefficient of determination (though the data set is not shown here,
just the principle of how to use Excel for this purpose).
Example 8.17
Table 8.16 provides the sales and price data collected from a range of discount stores selling
a particular product but using their own discount policy to price the product. The question is,
can we fit an appropriate relationship to predict sales given price?
The solution to this problem consists of identifying the type of relationship between the two
variables. Figure 8.46 illustrates graphically the relationship between sales and price.
394 Business statistics using Excel
Table 8.16
Figure 8.46 illustrates a scatter plot for demand (y) against price (x), illustrating a pos-
sible non-linear relationship between the variables y and x.
90.00
85.00
80.00
75.00
70.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00
Price, x Figure 8.46
From Figure 8.46 we may suggest that the relationship between the two variables is
given by model 1 or model 2.
If the relationship was non-linear we may still use linear regression, as long as we are
able to transform the non-linear data to a linear form. The parameters b0 and b1 in model
1 and model 2 can be estimated using the methods described in previous sections, and we
will use the Data Analysis > Regression method to calculate these values (include residual
plot and normal probability plot). Figures 8.47 and 8.48 present the ANOVA results for
both models.
Table 8.17 shows the results of applying least squares regression for model 1 and 2. The
results show that model 2 represents a better fit to the data set than model 1.
Linear correlation and regression analysis 395
Figure 8.47
Regression ANOVA table for model 1: y = b0 + b1x.
Figure 8.48
b
Regression ANOVA table for model 2: y = b0 + 1 .
x
6.95
2 yˆ = 77.56 + 0.96
x
Table 8.17
COD, coefficient of determination.
❉ Interpretation From Table 8.17, we can see that for the non-linear model 96% of the
variations in one variable are explained by variations in another, while for the linear model
only 66% of variations are explained by the model. Clearly, we are better off using the non-
6.95
linear model: yˆ = 77.56 + .
x
Note
1. In model 1 the regression is fitted to variable y and variable x.
2. In model 2 the x variable has been transformed to 1/x and the regression is fitted to variable
y and variable 1/x.
To complete the solution you would then need to analyse the model 2 ANOVA table
results to check whether or not the model 2 parameter terms (b0, b1) are significant
396 Business statistics using Excel
contributors to the value of the independent variable (y) using the Student’s t-test (or F
test). From Figure 8.47, the two parameter values b0 and b1 are significant contributors
to the value of the y variable (p = 3*10−16 < 0.05 for b0 and p = 4 * 10−8 < 0.05 for b1). The
final step in the analysis process is to check the model assumptions. From the Data > Data
Analysis > Regression results we requested the residual and normal probability plots.
Figures 8.49–8.52 compare the results for model 1 and 2.
2
0
0.00 0.50 1.00 1.50 2.00 2.50 3.00
–2
–4
–6
X variable 1 Figure 8.49
40
20
0
0 20 40 60 80 100 120
Sample percentile Figure 8.50
1
0
0.0000 1.0000 2.0000 3.0000 4.0000
–1
–2
–3
X variable 1 Figure 8.51
40
20
0
0 20 40 60 80 100 120
Sample percentile Figure 8.52
Linear correlation and regression analysis 397
For example, if we wanted to fit a polynomial of order 2 to the scatterplot then we would
choose the Polynomial option and select order 2, as illustrated in Figure 8.53.
Figure 8.53
The general equation of a polynomial of order 2 would be: Y = b0 + b1x + b2x2. Finally,
you can ask the Format Trendline > Trendline Options menu to include this equation on
the scatterplot together with the value of the coefficient of determination (R2).
upon the land value but also upon the value of home improvements made to a property.
The form of the population regression equation with ‘n’ independent variables can be
written as:
The multiple regression models can be found using the Excel Data Regression tool to
provide the coefficients, assumption, and reliability checks, and to conduct appropriate
inference tests.
Example 8.18
Table 8.18 consists of data that has been collected by an estate agent who wishes to model the
relationship between house sales price (£) and the independent variables: land value, LV (£) and
the value of home improvements, IV (£).
In order to fit the model the estate agent selected a random sample of size 20 properties
from the 2000 properties sold in that year (Table 8.18).
Table 8.18
Linear correlation and regression analysis 399
100000
80000
60000
40000
20000
0
0 5000 10000 15000 20000 25000
Land value (X)
Figure 8.54
Scatter plot of sales price versus land value.
100000
80000
60000
40000
20000
0
0 20000 40000 60000 80000
Home improvements (X)
Figure 8.55 x
Adjusted r2 Adjusted R
Scatter plot of sales price versus the value of home improvements.
squared measures the
proportion of the variation
The two scatter plots (Figures 8.54 and 8.55) suggest that a linear model would be appro- in the dependent variable
priate for y vs x1 and y vs x2. It should be noted that in both scatter plots we do have some accounted for by the
explanatory variables and
evidence that possible non-linear models may be more appropriate, given the observa- adjusted for the number of
tion that the data points are starting to decrease in y value at the top range for x. It should degrees of freedom.
400 Business statistics using Excel
also be noted that the sample sizes are quite small and we will assume that both relation-
ships are linear within the multiple regression model. From this analysis we can identify
three possible models, identified in Table 8.19.
Population Sample
Model 1 Y = β0 + β1X1 y = b0 + b1x1
Model 2 Y = β0 + β2 X 2 y = b0 + b2 x 2
Table 8.19
Table 8.20 shows the results of applying least squares regression for models 1, 2, and 3.
The results show that model 3 represents a better fit to the data than models 1 and 2.
Table 8.20
COD, coefficient of determination.
❉ Interpretation From the summary in Table 8.20 we can see that the third model is
the best fit as 92% of variations in selling price are explained by the combined effect of both
the land value and home improvements. Clearly, this is the most superior model.
To complete the solution you would then need to check the model assumptions and
undertake an appropriate t-test (or F test) to test whether the independent variable is a sig-
nificant contributor to the dependent variable. The examples given here serve only as an
illustration to indicate that there is much more depth to the regression analysis technique.
Student exercise
X8.15 An estate agent is interested in developing a model to predict the house sales price
based upon two other variables: size of property and age. His initial analysis suggests
a multiple model regression would be appropriate, with the relationship between the
dependent and independent variables being linear. Table 8.21 presents the data set.
Use Excel Data > Data Analysis > Regression to undertake the following tasks:
(i) Fit the multiple regression model
(ii) Check model reliability
Linear correlation and regression analysis 401
Table 8.21
■ Techniques in practice
TP1 Coco S. A. has requested that a local property company undertake an analysis of prop-
erty prices. The initial data collection has been undertaken and independent variables identi-
fied: square feet, age, and local property tax. The Excel regression analysis has been performed
with the results presented in Figures 8.56–8.60.
Figure 8.56
402 Business statistics using Excel
SF residual plot
80000
60000
40000
Residuals
20000
0
–20000 0 1000 2000 3000 4000
–40000 SF
–60000
Figure 8.57
20000
0
–20000 0 20 40 60
–40000
–60000 Age
Figure 8.58
PT residual plot
80000
60000
40000
Residuals
20000
0
–20000 0 500 1000 1500 2000
–40000
PT
–60000
Figure 8.59
100000
50000
0
0 50 100 150
Sample percentile
Figure 8.60
Table 8.22
(a) Plot a scatter plot and comment on a possible relationship between calories and the
amount of fat in the pies.
(b) Use the Excel data analysis regression tool to undertake the following tasks:
(i) State the least squares regression model equation
(ii) Comment on model reliability (r and COD)
(iii) Is the independent variable significant (F or t-test)?
(iv) Check model assumptions (residual and normality checks).
TP3 Skodel Ltd employs a local transport company to deliver beers to local supermarkets.
To develop better work schedules, the managers want to estimate the total daily travel time
for their drivers’ journeys. Initially, the managers believed that the total daily travel time would
be related closely to the number of miles travelled in making the daily deliveries (Table 8.23).
Table 8.23
404 Business statistics using Excel
(a) Plot a scatter plot and comment on a possible relationship between travel time and
miles travelled.
(b) Use the Excel data analysis regression tool to undertake the following tasks:
(i) State the least squares regression model equation
(ii) Comment on model reliability (r and COD)
(iii) Is the independent variable significant (F or t-test)?
(iv) Check model assumptions (residual and normality checks).
■ Summary
In this chapter we have explored techniques that can be used to explore possible relationships
between two variables using scatter plots and calculating appropriate numerical measures
of association: Pearson and Spearman. The method used will depend upon the type of data
within the data set as described in Table 8.24.
If the initial data exploration shows that we have a possible relationship between y and x
then we can attempt to fit an appropriate model to the data set using least squares regression.
Within the chapter we have explored three methods: (i) fitting a line, (ii) fitting a curve, and
(iii) fitting linear multiple regression model to the data set. Excel will allow you to calculate
the required statistics via it’s built-in statistical functions or by making use of the data analysis
regression tool, which includes the necessary statistics and appropriate assumption-checking
charts. The solution process consists of the following steps:
1. Construct scatter plot to visually assess the nature of a possible relationship between the
variables
2. Fit line or curve to data set using the identified relationship
3. Calculate reliability statistics (COD and adjusted r2 for multiple regression models)
4. For multiple regression models calculate the F test statistic to see if the combined model
predictor coefficients are a significant contributor to the value of y
5. Conduct appropriate t-tests to check whether each predictor variable is a significant con-
tributor to the value of y
6. Conduct appropriate confidence intervals for the population slope
7. Assess assumption violation.
Linear correlation and regression analysis 405
■ Key terms
Adjusted r2 Intercept Scatter plot
Assumptions Least squares Slope
Autocorrelation Linear regression analysis Spearman’s coefficient of
Coefficient of determination Linear relationship correlation
(COD) Multiple regression model Standard error of the
Covariance Normality estimate (SEE)
Dependent variable Outliers Sum of squares for error
Durbin–Watson Pearson’s coefficient of (SSE)
Equal variance correlation Sum of squares for
Homoscedasticity Regression analysis regression (SSR)
Independence of errors Regression coefficient Total sum of squares (SST)
Independent variable Residual
■ Further reading
Textbook resources
1. Whigham, D. (2007) Business Data Analysis using Excel. Oxford: Oxford University Press.
2. Lindsey, J. K. (2003) Introduction to Applied Statistics: A Modelling Approach (2nd edn).
Oxford: Oxford University Press.
Web resources
1. StatSoft Electronic Textbook http://www.statsoft.com/textbook/stathome.html (accessed
25 May 2012).
2. HyperStat Online Statistics Textbook http://davidmlane.com/hyperstat/index.html
(accessed 25 May 2012).
3. Eurostat—website is updated daily and provides direct access to the latest and most com-
plete statistical information available on the European Union (EU), the EU Member States, the
Euro-zone and other countries http://epp.eurostat.ec.europa.eu (accessed 25 May 2012).
4. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
5. The International Statistical Institute (ISI) glossary of statistical terms provides definitions
in a number of different languages http://isi.cbs.nl/glossary/index.htm (accessed 25 May 2012).
9 Time series data and analysis
The aim of this chapter is to provide the reader with a set of tools which can be used in the
context of time series analysis and extrapolation. This chapter will allow you to apply a
range of time series tools that can be used to tackle a number of business and other types
of unrelated objectives (in economics, social sciences, and so on). These objectives range
from calculating index changes, deflating prices, and bringing the values to a constant
value, extrapolating data, business forecasting, and reducing the uncertainty related to
future events.
» Overview «
In this chapter we shall look at a range of methods that will be useful in helping us to solve
problems using Excel, including:
» calculating and converting index numbers from one base to another;
» deflating prices and bringing them to a constant value;
» fitting a line to a time series;
» extrapolating the line in the future;
» using moving averages and exponential smoothing as forecasting methods;
» producing forecasts when dealing with the seasonal time series;
» learning how to calculate and interpret forecasting errors;
» learning how to assess the quality of forecasts by inspecting forecasting error;
» calculating the confidence interval for forecasts.
» Learning objectives «
On successful completion of the module, you will be able to:
» understand how to use and recalculate index number;
» know how to use indices to deflate prices;
Time series data and analysis 407
starting from one onwards. This column will, in fact, become a variable, as we will see in
the pages to follow, though a special kind of variable that contains sequential numbers.
The second point to make here is that by just looking at the data, we can ‘see’ very little.
Example 9.1
The most important lesson here is: when dealing with the time series data, it is mandatory to
visualize the data. Well, let’s just do this. Figure 9.2 illustrates the two time series.
Figure 9.1
Figure 9.2 illustrates a time series plot for the Example 9.1 data set. What jumps out at us
immediately is that one of the time series seems to be moving upwards and the other one
is following some horizontal line. The first is called a non-stationary time series, while
the second one, following a horizontal line, is called a stationary time series.
35
30
25
Series value
20
15
10
5 Non-stationary Stationary
0
1 2 3 4 5 6 7 8 9 10
Time point, x Figure 9.2
Figure 9.2 illustrates a graph of the two time series data sets.
x
In general, all time series will fall in to the first or the second category. A variety of meth-
Non-stationary time
series A time series that ods have been invented to handle either the stationary or non-stationary time series.
does not have a constant
mean and oscillates around
this moving mean.
Stationary time series Note Visualization and charting of a time series is not an optional extra, but one of the
A time series that does
have a constant mean and most essential steps in time series analysis. You can learn a lot about a variable just by looking
oscillates around this mean. at the time series graph.
Time series data and analysis 409
Example 9.2
Here is an example of one seasonal stationary and one seasonal non-stationary time series.
Figure 9.3 illustrates two seasonal time series data sets.
35
Seasonal non-stationary Seasonal stationary
30
25
Series value
20
15
10
5
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Time point, x
Figure 9.3
x
Seasonal Seasonal is the
component of variation
in a time series which is
9.1.3 Univariate and multivariate methods dependent on the time
of year.
Besides the division to stationary and non-stationary time series and methods, the meth- Non-seasonal
Non-seasonal is the
ods for handling time series can also be divided into univariate and multivariate meth-
component of variation in
ods. Univariate methods take just one single time series and try to produce a forecast for a time series which is not
this time series, independently of any other variable. The logic is that the influences of all dependent on the time
of year.
other variables are already imbedded into this single time series, so by just extrapolating it
Seasonal time series
into the future, we extrapolate all the implicit influences of numerous other variables that A time series, represented
have influenced this one. A good example would be taking the time series of the level of in the units of time smaller
than a year, that shows
inventory for a particular product. We know that this inventory depends on many factors, regular pattern in repeating
such as the volume of sales (which depends on various market factors), speed of replen- itself over a number of
these units of time.
ishment, and so on. Rather than worrying about all these factors, we can say that they are
Multivariate
embedded implicitly in our inventory time series. In other words, the history will tell us methods Methods that
which way the future will unfold. This is the major assumption behind univariate time use more than one variable
and try to predict the
series methods, i.e. the history holds the clues for the future. future values of one of
The opposite example is if we are trying to predict one variable by relating it to a number the variables by using the
values of other variables.
of other variables. We can take an example of inflation and try to predict this variable by
Univariate
anticipating how the interest rates will go, what will be the level of individual consumption, methods Methods that use
institutional investment, and volume of money on the market, etc. If we have one variable only one variable and try to
predict its future value on
that is dependent on a number of other variables that are treated as independent (often the basis of the past values
called the predictors), then the use of so-called multivariate methods is appropriate. of the same variable.
410 Business statistics using Excel
Note Sometimes the methods that deal with time series are also divided into causal
or regression methods, and time series methods. This is a bit of an old-fashioned division as
most of the methods have evolved to such a degree of complexity that it is difficult to say
which one belongs where. Nevertheless, this chapter is dedicated only to a set of methods
that belong to the family of time series methods.
Example 9.3
Figures 9.4 and 9.5 represent the same time series represented at two different scales.
Time series data
11000
10500
10000
Series value
9500
9000
8500
8000
1 6 11 16 21 26 31 36 41 46 51
Time point, x
Figure 9.4
9900
9800
Series value
9700
9600
9500
9400
9300
9200
1 6 11 16 21 26 31 36 41 46 51
Time point, x
Figure 9.5
Time series data and analysis 411
The two time series (Figures 9.4 and 9.5) are just one and the same time series, but visual-
ized in two different ways. Both charts consist of Dow Jones Industrial Average index taken
arbitrarily between 25 September and 5 December 2003. However, the y-axis on the first
chart is scaled to a smaller level of resolution and the second chart has a much larger scale.
It is obvious that, depending on the scale, we can ‘see’ almost two different time series. The
way we visualize our time series and what our ultimate objectives are will determine what
method to apply. The first one could be approximated by some straight line—which is what
we did—and the second one can be fitted with an nth term polynomial line.
Note The visual representation of the time series will often determine what method
to use, although this is not the primary criterion. The choice of the method should be
determined by the type of time series and the forecasting objectives.
Student exercises
X9.1 Chart the time series given in Table 9.1 and decide if it is stationary and or seasonal.
x 1 2 3 4 5 6 7 8 9 10
y 2 5 6 6 4 5 7 5 8 9
Table 9.1
X9.2 The time series given in Table 9.2 is seasonal. What would you say is the periodicity of
the seasonal component?
x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
y 1 3 5 3 1 3 5 7 5 3 5 7 9 7 5
Table 9.2
X9.3 Is it possible to have a time series that is non-seasonal and non-stationary? If so, how
would you call it and can you draw a graph showing how such a series might look?
X9.4 Go to one of the websites that allow you to download financial time series (e.g. http://
finance.yahoo.com/) and plot the series of your choice in several identical line graphs.
Change the scale of the y-axis on every graph and make sure that they are radically
different scales. What can you say about the appearance of every graph?
X9.5 Take the first differences between the observations in the time series from X9.1. Would
you say that the differenced time series is stationary? If it is not what would you do to
make it stationary?
x
9.2 Index numbers Polynomial line A
polynomial line is a curved
line whose curvature
The simplest way to analyse time series data is to compare a value from one point in time depends on the degree of
with some other value at a different point in time. the polynomial variable.
412 Business statistics using Excel
Example 9.4
Example 9.4 represents average annual domestic crude oil prices in the USA from 1980-2011
(see Table 9.3).
Table 9.3 Average annual domestic crude oil price in $/barrel (bbl)
The price is given in $/barrel (bbl). In 1985, for example, the average price of oil was
$26.92. In 2007, the same oil was priced at $64.20. The question we are interested in is: by
how much has the 2007 nominal price changed when compared with the one from 1985?
To answer this question we need to use index numbers. Index numbers measure the
change, typically expressed in percentages. To answer the question we introduced, all
we have to do is to divide the price of oil from 2007 with the one from 1985 and multiply
it by 100:
Where yt is the value for the year for which the index is calculated and y0 is the value for
the base year. In Example 9.4, this is:
y 2000 64.20
I2007 = × 100 = × 100 = 238.00
y 1985 26.92
Clearly, it is easy to calculate indices in Excel. We show two examples using the same
time series: one calculating indices for 1980 as the base year and the other one for 1992
as the base year.
Example 9.5
Figure 9.6 illustrates the Excel calculation procedure to calculate the required indices for the
average annual oil price.
Figure 9.6
➜ Excel solution
Year Cells B4:B35 Values
Average price Cell C4:C35 Values
Index base 1980 = Cell D4 Value (=100)
Cell D5 Formula: =C5/$C$4*100
Copy formula D5:D35
Index base 1992 = Cell F4 Formula: =C4/$C$16*100
Copy formula F4:F15
Cell F16 Value (=100)
Cell F17 Formula: =C17/$C$16*100
Copy formula F17:F35
❉ Interpretation
1. For the first index series: the price of oil in the year 2005, for example, was 33.73%
(133.73 — 100) higher than the price of oil in 1980.
414 Business statistics using Excel
2. For the second index series: the price of oil in the year 1999, for example, was 13.97%
(100 — 86.03) lower than the price of oil in 1992.
To convert indices from one year to another is very easy. Let’s say that we want to know
by how much was the price of oil higher in the year 2000 when compared with 1990. Using
the first series of indices, the one where 1980 is the base year, it is calculated as:
If we tried to do the same for the second series of indices, the one where 1992 is the base
year:
❉ Interpretation Indices can be converted easily from one base to another. Regardless
which series of indices we use, the price of oil in year 2000, for example, was 18% higher than
the price of oil in 1990.
Rather than having a time series of indices on a fixed basis, i.e. starting from one partic-
ular year that is equal to 100, we can have indices on a year-to-year basis. This effectively
means that every previous year is equivalent to 100. These are called chain indices.
Example 9.6
Figure 9.7 illustrates the calculation procedure to calculate the average oil price and index
values for oil prices.
Figure 9.7
Time series data and analysis 415
➜ Excel solution
Year Cells B4:B35 Values
Average price Cell C4:C35 Values
Index base 1980 = Cell D4 Value (=100)
Cell D5 Formula: = C5/C4*100
Copy formula D5:D35
❉ Interpretation The average oil price in 1985, for example, has dropped when
compared with the previous year by 6.37% and the price in 2007, for example, has grown by
10.12% in comparison with the previous year.
The series of numbers in Example 9.6 is very interesting as it actually shows us a per-
centage of change that takes place on a year-by-year basis. Using index numbers, we can
calculate a number of other more complicated indices. This takes us to an example of
aggregate price indices.
Example 9.7
Table 9.4 shows the value of CPI from 1980 to 2011; they are calculated on the basis of year
2000.
Table 9.4
To calculate the value of CPI for 2007, for example, when compared with the previous
year, we can use the formula we have already introduced:
❉ Interpretation The annual inflation rate in the USA, measured as CPI, in 2007 was
2.85%.
I YEAR A − I YEAR B
CPI YEAR A = × 100
I YEAR B (9.2)
CPI has one very important quality and that is: it can be used as a price deflator. We can
use CPI to convert (or deflate, hence the word deflator) prices from any year into the so
called constant prices. This is sometimes called converting actual dollars into real dollars,
i.e. dollars free from the inflation.
Example 9.8
Let’s take the example of oil prices as before. Column B repeats the average annual price of
domestic crude oil in $/bbl. These values are given in current dollars, i.e. the value of the dollar
in every given year. The second column shows us the values of the CPI index for every year,
given on the basis of year 2000 = 100 (Figure 9.8).
Time series data and analysis 417
Figure 9.8
Oil prices deflated with CPI
➜ Excel solution
Year Cells B4:B35 Values
Oil price Cells C4:C35 Values
CPI Cells D4:D35 Values
Deflated value = Cell E4 Formula: = C4*($D$24/D4)
Copy formula down E4:E35
To convert the prices of oil into a constant value, we need to deflate them. In our exam-
ple we can deflate them by multiplying annual prices with their corresponding CPI, which
is divided by the base year, i.e. year 2000, as per our example: Price at time A = Price at
time B × (CPI at time A /CPI at time B). In a more general sense, this formula is:
CPI A
PA = PB ×
CPIB (9.3)
❉ Interpretation The price of oil in 2007, when expressed in a constant dollar value for
year 2000, was $53.32 (cell E31 in Figure 9.8). The price of oil in 1980, on the same basis, i.e.
in constant year 2000 dollars, was $78.19 (cell E4 in Figure 9.8). This means that, in real terms,
the price of oil in 1980 was much higher than the price in 2007.
Example 9.9
Using the previously-described technique for converting indices from one base to another, if
we wanted to calculate the price of oil on the basis of a constant value of US dollars for the
year 2007, the calculation is as shown in Figure 9.9.
418 Business statistics using Excel
Figure 9.9
Deflated oil prices with CPI
➜ Excel solution
Year Cells B4:B35 Values
Oil price Cells C4:C35 Values
CPI Cells D4:D35 Values
Deflated value = Cell E4 Formula: = C4*($D$31/D4)
Copy formula down E4:E35
The calculation shown in the Excel solution helps us with simple questions, such as
is the price of oil in 2007 of $64.20 higher in real terms than the price of oil of $37.42 in
1980? We can translate this into a question: how much is $37.42 from 1980 worth in 2007
terms? This is calculated as: Adjusted price = Old price*(CPI for 2007/CPI for 1980). In
more general terms:
⎛ CPIFixed ⎞
y tadj = y t ⎜
⎝ CPIt ⎟⎠ (9.4)
❉ Interpretation Given that the 2007 price of oil is $64.20, this means that in 1980 the
price of oil was equivalent to $94.14. Using the constant value of dollars in 2007, the price of
oil in 1980 was $29.94 dollars more than the 2007 price of oil of $64.20.
Student exercises
X9.6 Calculate indices based on year 2000 for the series shown in Table 9.5. Could you
convert them into indices based on year 2003?
Time series data and analysis 419
Year 2000 2001 2002 2003 2004 2005 2006 2007 2008
x
Sales 230 300 290 320 350 400 350 400 420 Classical time series
analysis Approach
Table 9.5 to forecasting that
decomposes a time series
X9.7 Use the CPI values from Figure 9.8 to convert the sales values from the student exercise into certain constituent
components (trend,
X9.6 to a constant dollar value based on the 2004 value of the dollar. cyclical, seasonal and,
random component),
X9.8 What is the real value of the sales value in 2007 if you put it on the constant year 2000 makes estimates of each
basis? component and then re-
composes the time series
and extrapolates into the
future.
Trend (T) The trend is the
long-run shift or movement
in the time series
observable over several
periods of time.
9.3 Trend extrapolation Cyclical variations (C) The
cyclical variations of the
At the beginning of this chapter we classified not only the time series into various types, time series model that
result in periodic above-
but also various methods that deal with time series. It is our objective in this text to deal trend and below-trend
with univariate time series only and to describe just several basic time series analysis behaviour of the time series
lasting more than one year.
methods. Classical time series analysis starts with an assumption that every time series
Seasonal variations (S) The
can be decomposed into four elementary components: (i) underlying trend (T), (ii) cycli- seasonal variations of the
cal variations (C), (iii) seasonal variations (S), and (iv) irregular variations (I). time series model that
shows a periodic pattern
Depending on the model, these components can be put together in different ways to over one year or less.
represent the time series. The simplest of all is the so-called additive model. It states that Irregular variations (I)
time series Y implicitly consists of the four components that are all added together: The irregular variations
of the time series model
Y = T + C + S + I (9.5) that reflects the random
variation of the time series
values beyond what can
In addition to an additive model, a multiplicative model can also be used. Sometimes, be explained by the trend,
the most appropriate model is a mixed model. Here are two examples of these models: cyclical, and seasonal
components.
Multiplicative model: Y=T×C×S×I Additive model The
Mixed model: Y = (T × C × S) + I additive model time series
model is a model whereby
the separate components
The character of the data in time series will determine which model is the most
of the time series are added
appropriate. together to identify the
Underlying trend is almost self-explanatory, but we’ll describe it further along. The actual time series value.
Multiplicative model The
cyclical component consists of the long-term variations that happen over a period of sev-
multiplicative time series
eral years. If the time series is not long enough, sometimes we might not even be able to model is a model whereby
observe this component because the cycle is either longer than our time series or it is just the separate components
of the time series are
not obvious. However, the seasonal component applies to seasonal effects happening multiplied together to
within one year. Therefore, if the time series consists of annual data, there is no need to identify the actual time
series value.
worry about the seasonal component. At the same time, if we have monthly data and our
Mixed model The
time series is several years long, then it will (potentially) consist of the seasonal, as well as mixed time series blends
of the cyclical, component. And, finally, the irregular component is everything else that together both additive and
multiplicative components
together to identify the
actual time series value.
420 Business statistics using Excel
does not fit into any of the previous three components. A method of isolating different
components in a time series, or decomposing the time series, is called the classical time
series decomposition method. This is one of the oldest approaches to forecasting. The
whole area of classical time series analysis is concerned with the theory and practice of
how to decompose a time series into these components, estimate them, and then recom-
pose to produce forecasts. We will not go into this method in any depth, but we’ll look into
the trend component.
Y = T + R (9.6)
If a trend represents an underlying pattern that the time series follows, than the residu-
als are something that should oscillate randomly around the trend. In other words, if we
can estimate the underlying trend of a time series, we will not worry about these ran-
dom residuals fluctuating around the trend line. We can then extrapolate this trend. The
x
trend becomes our forecast of the time series. Admittedly, this forecast will not be 100%
Seasonal component A
component in the classical accurate as some residual value will be oscillating around the trend, but for all practical
time series analysis purposes, this might be exactly what we want. We are interested in just isolating the trend
approach to forecasting
that covers seasonal and extrapolating it into the future, which produces the forecast value for our time series.
movements of the time
series, usually taking place
inside one year’s horizon.
Classical time series
Note Fitting a trend to a time series and extrapolating it into the future is the most
decomposition Classical elementary form of forecasting.
time series decomposition
is a statistical method that
deconstructs a time series
into notional components.
Trend component A 9.3.2 Fitting a trend to a time series
component in the classical
time series analysis If trend is the underlying pattern that indicates the general movements and the direction
approach to forecasting
that covers underlying
of time series, then this implies that a trend can be described by any regular curve. This
directional movements of usually means a smooth curve, either straight line, a parabola, a sinusoid, or any other
the time series.
well-defined curve. Fortunately, Excel is very well equipped to help us define the trend,
Residuals (R) The
differences between the
fit it to time series, and extrapolate it into the future. Let’s see what elementary types of
actual and predicted trends are embedded in Excel and how to invoke them.
values. Sometimes called
forecasting errors. Their
behaviour and pattern has
to be random. Example 9.10
Types of trends The type
of trend can include line We’ll use an artificially-created time series that consists of only 30 observations, as illustrated
and curve fits to the data
in Table 9.6.
set.
Time series data and analysis 421
Table 9.6
When charted as a line graph, the time series looks as illustrated in Figure 9.10.
70
60
50
Series value
40
30
20
10
0
1 3 5 7 9 11 13 15 17 19 21 23
2 25
5 27 29
Time point, x
Figure 9.10
To fit a trend line to the time series is a very easy graphical process in Excel, as we have
already demonstrated in Chapter 8.
To fit a trend line to the time series right-click on any data point in the Excel graph, as
illustrated in Figure 9.11, and select Add Trendline.
After selecting Add Trendline, choose Linear Option, as well as Display Equation on
chart, and Display R-squared on chart (see Figure 9.12). Click on close.
422 Business statistics using Excel
Figure 9.11
Figure 9.12
The final graph with the trend line added automatically is illustrated in Figure 9.13 with
the line equation and coefficient of determination included.
80
70 y = 1.8082x + 13.306
R2= 0.89
60
50
Series value
40
30
20
10
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Time point, x Figure 9.13
Time series data and analysis 423
What we are getting here instantly is a straight line that describes the underlying move-
ment and the direction of our time series.
y = mx + b (9.7)
y = c ln x + b (9.8)
Here, c and b are constants and ln is natural logarithm function. The picture in the Excel
dialogue box indicates that this trend has a form of an inverse exponential curve. The one
x
that quickly reaches some high value and then continues to grow much more slowly.
Linear trend Linear trend
Polynomial curve comes in several degrees, for example a polynomial equation of is a straight line fit to a
degree 6 would be written as defined by equation (9.9). data set.
Logarithmic trend A
model that uses the
y = b + c1x + c 2 x + c 3 x + c 4 x + c 5 x + c6 x
2 3 4 5 6
(9.9) logarithmic equation to
approximate the time
series.
In this case also b and c1 to c6 are constants. If you experiment with these curves, you
Polynomial trend A model
will see that some of them translate into very dynamic curves, making multiple turns and that uses an equation of
ups and downs. any polynomial curve
(parabola, cubic curve, etc.)
Power function has a very simple equation, with c and b as constants, as defined by to approximate the time
equation (9.10). series.
Power trend A model
that uses an equation of a
y = cx b
(9.10) power curve (a parabola)
to approximate the time
series.
This trend is a parabolic trend that will continue to grow forever.
Exponential trend An
Exponential trend also has two constants, c and b, as defined by equation (9.11). underlying time series trend
that follows the movements
of an exponential curve.
y = ce bx (9.11)
Moving average trend The
moving average trend is a
The symbol e is used for the basis of natural logarithms. Unlike the power trend, which method of forecasting or
smoothing a time series by
continues to grow at a constant rate, exponential trend moves slowly at the beginning and averaging each successive
then resumes the very fast change typified by exponential growth. group of data points.
424 Business statistics using Excel
Moving averages trend is a special type of trend that we will cover further along in the
chapter as a separate heading, owing to its special way of deployment.
Figure 9.14
In Figure 9.15 we will opt for an automatic trend line that will move five periods in the
future.
Figure 9.15
Figure 9.16 illustrates the modified time series chart with the trend line extended by 5
time periods to provide a forecast for time points 31, 32, 33, 34, and 35.
Time series data and analysis 425
90
y = 1.8082x + 13.306
80
R2= 0.89
70
60
Series value
50
40
30
20
10
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Time point, x Figure 9.16
We can see that the actual time series is not a smooth straight line, but that it oscillates
around one and we have identified it. By extrapolating our straight line or linear trend in
to the future, we are stating that the actual line might be a bit adrift, but we believe that it
will be inside some confidence factor, as we will describe later on. Excel does not just give
us a pictorial representation of this trend line, but the actual equation of this line. From
Figure 9.16, we can see that this trend line is moving in accordance with the equation:
y = 1.802x + 13.306. We’ll explain this in a minute. The R-squared (or R2) value is 0.89. Let’s
refresh what we know about this statistic.
When fitting a line to a data set, as we described in Chapter 8, we measure how closely
the trend line fits the actual data. Every deviation is squared and all these values are
summed to create the total sum of squares (SST). The theory suggests that the SST consists
of the regression sum of squares (SSR) and residual sum of squares (SSE).
R-squared is a coefficient that measures how closely is the actual time series approxi-
mated (or fitted) by a trend line as given by equation (9.12).
SSE
R2 = 1 −
SST (9.12)
Note
The closer R2 is to the value of 1, the better the fit of the trend to time series. In
our case R-squared is 0.89, which is very good. This confirms that our trend is approximating,
or fitting, the data very well. Only 11% (1 – 0.89 = 0.11) of data variations are not captured, or
explained, by the trend line. This is more than reasonable.
We said earlier that the trend line equation in this particular case was y = 1.802x + 13.306.
Excel extrapolated our trend line five periods into the future, but we do not know either the
past values or the future values of this trend line. All we have is the chart that does this for us.
We need to learn how to calculate these values manually or using the built-in Excel functions.
426 Business statistics using Excel
Example 9.11
In this example we will fit a trend line to a time series data set given the value of the slope of
the trend line and its intercept. Figure 9.17 illustrates the manual calculation of the trend line
using basic Excel formulae.
Figure 9.17
Figure 9.17 indicates that we have put the value of the intercept in the cell H1 and the
value of the slope in the cell H2.
➜ Excel solution
Period Cells B4:B33 Values
Series 1 Cells C4:C33 Values
a = Cell H3 Value
b = Cell H4 Value
Trend Cell D4 Formula: = $H$3+$H$4*B4
Copy formula down D4:D33
Time series data and analysis 427
The forecasts at time points 31, 32, 33, 34, and 35 are produced in the same way as illus-
trated in Figure 9.18.
Figure 9.18
Note The future values of x should always be a sequential continuation of the period
numbers used in the past. In our case, the last observation is for period 30, which means that
the future values of x are 31, 32 . . . 35.
It is not necessary to use the Excel graph function and chart the trend first to get the
values of the intercept and the slope. Excel has a built-in function for both parameters. In
cell H3 we could have invoked the Excel INTERCEPT() function or in cell H4 the SLOPE()
function.
Example 9.12
Repeat Example 9.11, but this time use the Excel function INTERCEPT() and SLOPE() to calculate
these values in Cells H3 and H4.
Figure 9.19 illustrates the change to cells H3 and H4 in the Excel solution given in Figure 9.17.
Figure 9.19
➜ Excel solution
Period Cells B4:B33 Values
Series 1 Cells C4:C33 Values
a = Cell H3 Formula:
b = Cell H4 Formula:
Trend Cell D4 Formula: = $H$3+$H$4*B4
Copy formula down D4:D33
Example 9.13
Remember, if you cannot remember the names of the Excel functions then you can select
Formulas > Insert Function and choose the function you require to undertake your data analy-
sis. For example, Figures 9.20 and 9.21 illustrate the Excel solution for the INTERCEPT() and
SLOPE() functions.
428 Business statistics using Excel
Figure 9.20
Figure 9.21
Figure 9.22
Figure 9.23
Click OK.
The formula: = TREND(C4:C33,B4:B33,B4,TRUE) will then appear in cell D4.
Remember, before you copy this formula down to calculate the trend line values for
time points 1, 2, 3 . . . 30 (cells D4:D33) you will need to fix the cell reference for the terms
known_y’s and known_x’s, as given by the following change to formula in cell D4: =
TREND($C$4:$C$33,$B$4:$B$33,B4,TRUE). Now copy this formula down from cell
D4:D33. If you wanted to calculate the forecast values at time periods 31, 32, 33, 34, and 35
then continue the copy from D34:D38.
Note
1. As before, the values of x have to be the sequential numbers that continue from the last
historical period number.
2. The principles of calculating linear trend, as described here, can be applied to other types
of curves. The Manual and the Function methods work with any curve. The Function method
in addition to TREND function can be applied to the GROWTH function. GROWTH is Excel
function that describes exponential trends. It is invoked and used in exactly the same way as the
TREND function used for linear time series.
Student exercises
X9.9 If the time series components were extracted as in Table 9.7, how would you
reconstruct the time series (ŷ) using: (a) an additive model and (b) a mixed model?
Table 9.7
430 Business statistics using Excel
X9.10 If a time series can be best fitted with the trend whose equation is y = a + bx + cx2,
would you say that this is a linear model?
X9.11 R-squared (R2) is a measure of how closely a trend fits the time series. What is another
expression for this statistic and in what context have we used it when discussing linear
regression?
X9.12 Does R2 = 0.90 indicate a good fit? Why?
X9.13 Extrapolate the time series in Table 9.8 three time periods in the future. Use the TREND
function. Why do you think it would not make sense to extrapolate this time series 10
time periods in the future?
X 1 2 3 4 5 6 7 8 9 10 11 12
Y 230 300 290 320 350 400 350 400 420
Table 9.8
Example 9.14
A very short time series (shown in Figure 9.24) has an average value of 208. The average value
represents the series fairly well because the series flows very much horizontally. Figure 9.25
illustrates this graphically. The average of 208 is shown as a horizontal line that runs across the
time series.
Figure 9.24
250
Series value
200
150
100
50
0
1 2 3 4 5
Time point, x Figure 9.25
Time series data and analysis 431
As we know, this particular sample time series is called a stationary time series.
However, if the series was moving upwards, or downwards, this average value would not be
the best representation of the series.
In this case a much more realistic representation would be some kind of moving average.
We are effectively saying that, in general, a series of moving averages is a much more realistic
representation for non-stationary time series.
Example 9.15
We created another short time series and calculated moving averages in Figure 9.26.
Figure 9.26
➜ Excel solution
Period Cells B4:B8 Values
Series Cells C4:C8 Values
3MA Cell D5 Formula: =SUM(C4:C6)/3
Copy formula down D5:D7
5MA Cell F6 Formula: =SUM(C4:C8)/5
Moving averages are dynamical averages that change in accordance with the number
of periods for which they are calculated. A general formula for moving averages is given
by equation (9.13).
x
t − N +1
∑ xi Moving averages Averages
i=t calculated for a limited
Mt =
N (9.13) number of periods in
a time series. Every
subsequent period
In equation (9.13), t is the time period and N is the number of observations taken into excludes the first
observation from the
the calculation. It is clear that if we are using three observations as a basis for calculating previous period and
moving averages then the first possible observation for which we can calculate the mov- includes the one following
the previous period.
ing average is observation 3. Equation (9.13) can be simplified and expressed as equation This becomes a series of
(9.14). moving averages.
432 Business statistics using Excel
x t + x t −1 + x t − N + 1
Mt =
N (9.14)
The advantage of using an odd number for N and taking an odd number of elements into
the equation is that we can centre the moving average value in the middle of the interval, as
per Figure 9.26. This implies that we’ll use the odd number of interval most of the time, as it
is easier to centre the values. In our case, the moving average for period two is calculated as:
x t − x t −N
Note Equation (9.14) can also be rewritten as: Mt = Mt −1 + .
N
In other words, if we do not know the value of the first observation in the moving average
interval, we can still estimate the current moving average from the previous value of the
moving average, plus the other value from the interval. Although this might appear to be a
useless fact here, you will see why we mentioned it when we discuss exponential smoothing.
What happens if we extend the number of observations in the moving average interval?
Let’s look what happens if we take all the values from the series to constitute the interval.
As Figure 9.27 shows, they simply became the overall average. This implies that the larger
the number of observations used for calculating the moving average, the smoother and
the more horizontal the line representing it will be.
350
300
250
Series value
200
150
100
5 MA
50 Actual series
3 MA
0
1 2 3 4 5
Time point, x Figure 9.27
x
Exponential Note It is a general principle that the larger the number of moving averages in the
smoothing One of the
methods of forecasting that formula, the ‘smoother’, or less dynamic, the time series of moving averages will be. Various
uses a constant (or several business reports will tend to use moving averages. The most frequent choice is to use
constants) to predict future
values by ‘smoothing’ the 3-month, 6-month, and 12-month moving averages. If you look at the 3-month moving
past values in the series. average time series, you will see that it is closely tracking the actual time series. However, a
The effect of this constant
decreases exponentially as
12-month moving average line will be much ‘flatter’, or more horizontal, as it is averaging a
the older observations are much larger interval and it is, therefore, not so subject to most recent events.
taken into calculation.
Time series data and analysis 433
Example 9.16
Let us now use a little longer time series and see how to use moving averages for forecasting
purposes. If a series is horizontal (stationary) and we just want to predict a single future value
of this series, we already said that using a simple average value of the series is almost as good
as any other method. Figure 9.28 shows such a stationary series with 30 observations and its
mean value that was used to predict the 31st observation.
5.5
5.25
5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Time point, x Figure 9.28
The advantage of this simple method is that it can be extended further in the future. If
we need to forecast for the next five observations, we just extend the mean line. By defini-
tion, if a series is stationary it fluctuates around its mean. Therefore, the mean is its best
predictor. This method does not produce very precise forecasts, but the results will be
accurate enough. To add more sophistication to our forecasting and to try to emulate the
movements of the original series, we need to see how to use the principle of moving aver-
ages. We’ll use the same Excel method as in section 9.3.2. To remind you how we added a
trend line to a time series, we right click on the time series, which will invoke a dialogue
box with several options included. We then click on the option called ‘Add Trendline. . .’.
This invokes the next dialogue box, and we select the moving averages option and change
the number of periods to three, as illustrated in Figure 9.29.
Excel will start automatically charting the moving average from the last observation in
the period specified (in this case three). If we selected a five-period moving average, then
the moving average function would start from observation five. This is somewhat different
to the advice we gave earlier when we recommended that the moving average should be
centred in the middle of the interval for which it is calculated. A simple reason for this is
that, here, we are trying to predict the series, and this is going to help us to achieve this.
So, how is the moving average approach used to produce forecasts? All we need to do
is to shift the moving average plot, as produced by Excel, by one observation. In other
words, the moving average value for the first three observations (assuming we are using
moving averages for three periods) becomes the forecast for the fourth observation. The
fifth observation is predicted by using the second three period moving average (obser-
vations two to four) and so on. Figure 9.30 illustrates the point for three-period moving
averages.
434 Business statistics using Excel
Figure 9.29
5.75
Series value
5.5
5.25
However, there are two difficulties associated with this approach. First of all, we cannot
extend our forecast beyond just one future period, which means that this method can only
be used as a short-term forecasting method that predicts only one future observation. To
forecast using moving averages we can modify equation (9.13) to create equation (9.15) to
enable calculation of the required moving average forecasts.
t−N
∑ xi
i = t −1
Ft = (9.15)
N
Time series data and analysis 435
For example, for MA(4) the forecast at time point five is:
x 4 + x 3 + x 2 + x1
F5 =
4
The other issue is that Excel will not shift the moving average plot if we are using Add
Trendline wizard. We need to calculate moving averages manually.
Example 9.17
Figure 9.31 shows how the moving averages were calculated for a three point moving average
(3MA) and the use of the model to provide a forecast at time point 31.
Figure 9.31
➜ Excel solution
Period Cells B4:B34 Values
Series Cells C4:C34 Values
3MA Cell D7 Formula: =AVERAGE(C4:C6)
Copy formula down D7:D34
Forecast
Period Cell B35 Value
3MA Cell D35 Formula: =AVERAGE(C32:C34)
As stated earlier, we need to remember that the more we extend the number of periods
used for calculating the moving average, the smoother and more horizontal the curve will
be. If we take into account all the observations in the series, needless to say we will have
only one moving average value and it will be identical to the mean value of the overall
time series.
ŷ t = y t −1 + et (9.16)
When yt−1 remains stationary over time it is reasonable to forecast future values of ŷ t by
using regression analysis, as described in Chapter 8. In such situations the least squares
estimate of yt−1 would be the average value of all the observed data values, where in the
calculation of the point estimate of the constant term in linear regression (b0) we are
equally weighting each of the previously-observed terms in the time series data set.
When the value of yt−1 changes over time (non-stationary) then this equal weighting
may not be appropriate and it may be more desirable to weight recent observations more
heavily than older observations. Simple exponential smoothing is a forecasting method
that applies unequal weights to the time series data. Let’s explain how are we going to get
to this formula.
With a bit of imagination, we can say that every new forecast is the old one plus an
adjustment for the error that occurred in the last forecast, i.e. et−1 = yt−1 − Ft−1, as presented
in equation (9.17).
Ft = Ft −1 + ( y t −1 − Ft −1 ) (9.17)
Where yt−1 is the actual result from period t − 1 and Ft−1 is the forecast result for period
t − 1.
Let us now assume that the error element, i.e. (yt−1 − Ft−1), is zero. In this case the current
forecast is the same as the previous forecast. However, if it is not zero, then, under certain
circumstances, we might be interested in taking just a fraction of this error using equation
(9.18).
Ft = Ft −1 + α( y t −1 − Ft −1 ) (9.18)
0 < α < 1. The forecasts calculated in such a way are, in fact, smoothing the actual observa-
tions. If we plot both the original observations and these newly calculated ‘back-forecasts’
of the series, we’ll see that the back-forecast curve is eliminating some of the dynamics
that the original observations exhibit. It is a smoother time series.
Equation (9.18) can be rewritten as equation (9.19).
Ft = αy t −1 + (1 − α) Ft −1 (9.19)
Equations (9.18) and (9.19) are identical, and it is a matter of preference which one
to use. They both provide identical forecasts based on smoothed approximations of the
original time series.
Note The origins of equation (9.18) and (9.19) can be found in Brown’s single
exponential smoothing method. However, the original Brown’s formula states that:
Note that Brown uses yt rather than yt−1. Effectively, this means every current smoothed value
in Brown’s formula is the future forecast value in our formula. If we use the original smoothing
equation by Brown, then we have to remember that Ft = S´t −1 (see equation (9.22)).
There is a connection between Brown’s formula and the moving averages concept. You
y − y t −N
will recall from the section on moving averages that we’ve said that: Mt = Mt −1 + t .
N
If yt−N was unknown and we used Mt−i as its best estimate instead, then this can be rewritten as
1 1 1
Mt = y t + (1 − ) Mt −1. If we say that α = , then Mt is another expression for St. We can see
N N n
the similarities between the exponential smoothing concept and the moving averages.
x
Brown’s single
Note We implied that the smaller the α (i.e. the closer α is to zero), the smoother and exponential smoothing
more horizontal the series of newly calculated values is. Conversely, the larger the α (i.e. the method Brown’s single
exponential smoothing
closer α to one), the more impact the deviations have and potentially the more dynamic the method is a basis for
fitted series is. When α = 1, the smoothed values are identical to the original values, i.e. no forecasting method
called Simple Exponential
smoothing is taking place.
Smoothing.
Smoothing
constant Smoothing
The smoothing constant (α) and the number of elements in the interval for calculating constant is a parameter of
moving averages are, in fact, related. The equation that defines this relationship is given the exponential smoothing
model that provides the
by equation (9.21).
weight given to the most
recent time series value
2
α= in the calculation of the
M +1 (9.21) forecast value.
438 Business statistics using Excel
Note If we substituted in the formula for exponential smoothing all the previous values
from the series we would see that, effectively, we are multiplying the newer observations
with higher values of α and the older data in the series with the smaller values of α. By
doing this we are, in effect, assigning a higher importance to the more recent observations.
As we move further in the past, the value of α falls exponentially. This is the reason why we
call it exponential smoothing. In essence, every value in the series is affected by all those
that precede it, but the relative weight (importance) of these preceding values declines
exponentially the further we go in the past.
Ft = S’t-1 (9.22)
Example 9.18
As an example, we can use the same short time series we used to demonstrate how to use
moving averages (see Example 9.15), to create forecasts using Brown’s exponential smoothing
method, as illustrated in Figure 9.32. To start the smoothing process the data analyst must make
a choice for the smoothing constant α and the initial estimate of S’ 0. The value of S’ 0 is needed
to determine the smoothed statistic for S’ 0 .
S 1′ = αy 0 + (1 − α ) S 0′
Figure 9.32
Applying simple exponential smoothing
➜ Excel solution
α = Cell C3 Value
Period Cells B6:B10 Values
Yi Cells C6:C10 Values
Si’ Cell D6 Formula: =C6
Cell D7 Formula: =$C$3*C7+(1−$C$3)*D6
Copy formula down D5:D10
Forecast Cell F7 Formula: =D6
Copy formula down F7:F11
Note Alternative methods for calculating the starting value S’ 0 are employed by
analysts given that simple exponential smoothing is concerned with tracking changes over
time in the true average level of the data series. It can be shown that, in simple exponential
smoothing, the value of S’ 0 using the average of the first six observations in your data set will
provide a reasonable starting point for simple exponential smoothing.
As was the case with moving averages, in order to forecast one value in the future, we
need to shift the exponential smoothing calculations by one period ahead. The last expo-
nentially-smoothed value will, in effect, become a forecast for the following period.
Note Simple exponential smoothing, just like the moving averages method is an
acceptable forecasting technique, provided we are interested in forecasting only one future
period.
Example 9.19
Apply the Data Analysis method to repeat Example 9.18.
Figure 9.33
Select Data > Select Data Analysis > Select Exponential Smoothing (Figure 9.34).
440 Business statistics using Excel
Figure 9.34
Figure 9.35
Figure 9.36
If we compare the formula method solution illustrated in Figure 9.32 and the Data
Analysis > Exponential Smoothing solution illustrated in Figure 9.36 we note the Data
Analysis > Exponential Smoothing method always ignores the first observation and pro-
duces exponential smoothing from the second observation. It also cuts short with the
exponential smoothing values, as the last exponentially-smoothed value corresponds
with the last observation in the series. You can easily extend the last cell one period in the
future to get a short-term forecast by inserting a time point 6 in cell B9 and then dragging
the Excel formula in cell D8 down to cell D9, as illustrated in Figure 9.37.
Figure 9.37
Time series data and analysis 441
The second thing that becomes obvious is that you cannot change the values of α and
see automatically what effect this has on your forecasts. This means that you would be bet-
ter off producing your own set of formulae, as shown in Figure 9.32.
Example 9.20
Consider the data set in Table 9.9 and smooth the data using Brown’s exponential smoothing
method with smoothing factors 0.1 and 0.9.
Time point Series value Time point Series value Time point Series value
1 5.38 11 5.3 21 5.49
2 5.36 12 5.51 22 5.38
3 5.38 13 5.49 23 5.46
4 5.65 14 5.38 24 5.43
5 5.59 15 5.57 25 5.35
6 5.43 16 5.91 26 5.28
7 5.53 17 5.91 27 5.54
8 5.43 18 5.86 28 5.38
9 5.4 19 5.62 29 5.35
10 5.35 20 5.49 30 5.45
Table 9.9
Figure 9.38
442 Business statistics using Excel
➜ Excel solution
Period Cells A6:A35 Values
Series Cells B6:B35 Values
α = 0.1
α = Cell C3 Value
Smoothed data Cell C6 Formula: =B6
Cell C7 Formula: =$C$3*B7+(1−$C$3)*C6
Copy formula down C7:C35
α = 0.9
α = Cell F3 Value
Smoothed data Cell F6 Formula: =B6
Cell F7 Formula: =$F$3*B7+(1−$F$3)*F6
Copy formula down F7:F35
5.8
5.7
Data value
5.6
5.5
5.4
5.3
5.2
0 5 10 15 20 25 30 35
Time point, x Figure 9.39
Figure 9.39 illustrates the original data and exponential smoothing forecasts with two
smoothing constants α = 0.1 and 0.9.
Figure 9.39 shows the impact two different values of α have on forecasts. As expected,
smaller α makes forecasts smoother and larger α makes them more dynamic, mimicking
more closely the original time series.
Example 9.21
In this example we will show how to use all the formulae listed in this chapter and we will
demonstrate how to use the Data Analysis > Exponential Smoothing to undertake the analysis.
Figures 9.40 and 9.41 illustrate the Excel solution with a smoothing constant of α = 0.1 using
equations (9.18), (9.19), (9.20), and (9.21), and the Excel Data Analysis > Exponential Smoothing
solution.
Time series data and analysis 443
Figure 9.40
➜ Excel solution
α = Cell B4 Value
Period Cells A6:A36 Values
Series Cells B6:B35 Values
Forecast data Cell C6 Formula: =B6
Cell C7 Formula: =C6+$B$4*(B6−C6)
Copy formula down C7:C36
Forecast data Cell D6 Formula: =B6
Cell D7 Formula: =$B$4*B6+(1−$B$4)*D6
Copy formula down D7:D36
Figure 9.41
444 Business statistics using Excel
➜ Excel solution
Smoothed data Cell E6 Formula: =B6
Cell E7 Formula: =$B$4*B7+(1−$B$4)*E6
Copy formula down E7:E35
Forecast data Cell F6 Formula: =B6
Cell F7 Formula: =F6+$B$4*(B6−F6)
Copy formula down F7:F36
Forecast data with Data Analysis
Cell G6 N/A
Cell G7 Formula: =B6
Cell G8 Formula:= =0.1*B7+0.9*G7
This formula is copied down by Excel between G8:G35
Cell G36 Copy formula down G35:G36
Figure 9.42
In Figure 9.42, the Input Range is B6:B35, Damping factor = 1 – smoothing constant = 1
– 0.1 = 0.9, Output Range G6. Clicking OK will then produce the solution presented in cells
G6:G35 in Figure 9.41.
Note WARNING: As already indicated, Excel uses the expression ‘Damping factor’,
rather than smoothing constant (α). The ‘Damping factor’ is defined as (1 – α). For example, a
smoothing constant α = 0.1 defines the damping factor as equal to 0.9 (damping factor = 1 –
α = 1 – 0.1 = 0.9).
The Data Analysis > Exponential Smoothing method produces solutions that start in
cell G6, but it returns #N/A value. The cell G7 is the copy of B6, followed by the formula
=0.1*B7+0.9*G7 in cell G8. The formula is essentially the same as the one in column D.
Time series data and analysis 445
Why can’t we use Brownian exponential smoothing values directly for forecasting? As we
said, every current smoothed value has to be treated as the future forecast. Pay attention
to row 36. This is where we produced forecasts. You will notice that the only column in
which we were not able to do this is column E. However, if these cells are shifted down by
one cell, as in column F, then Brownian exponential smoothing values become the fore-
casts, as illustrated in Figure 9.43.
Figure 9.43
The advantage of using built-in Data Analysis > Exponential Smoothing is that we get
a chart and standard errors included automatically with the results. We have not shown
them here, but we are encouraging readers to experiment with various output options.
Ft + m = a t + St − s + m (9.23)
Ft + m = a t St − s + m (9.24)
Note As a general guidance, multiplicative models are better suited for time series that
show dramatic growth or decline (non-stationary time series), while additive models are more
suited for less dynamic time series (stationary time series).
446 Business statistics using Excel
Just as with the linear trend, the coefficient at is an intercept of the series, but, in this
case, a dynamic one. St−s+m represents the slope and is called a seasonal component. The
meaning of the symbols s and m in the subscript t + m and t−s + m is: s = number of peri-
ods in a seasonal cycle, and m = number of forecasting periods (forecasting horizon).
The main feature of this approach is that we can use exponential smoothing to estimate
dynamical values of not just the seasonal component, but of the intercept too. For an
additive model, these two factors are calculated as follows.
⎛ yt ⎞
at = α ⎜ + (1 − α)a t − 1
⎝ St − s ⎟⎠ (9.27)
⎛ yt ⎞
St = δ ⎜ ⎟ + (1 − δ)St − s
⎝ at ⎠ (9.28)
As we can see, unlike the simple exponential smoothing which required only one
smoothing constant, here we are using two smoothing constants, alpha (α) and delta (δ).
In both cases we need to initialize the values of at and St. This is achieved, for additive
models, by calculating as, from equation (9.29).
s yt
as = ∑
t =1 s (9.29)
Where, t = 1,2. . . s and at = as. In other words, the first s number of as is calculated as an
average of all the corresponding actual observations. The initial values of St are calculated
from equation (9.30).
St = y t − a t (9.30)
For the multiplicative model as is calculated in the same way as in (9.30) and St is cal-
culated as:
yt
St =
at (9.31)
For a multiplicative model we use the same principle, except that the components are
not added, but multiplied.
Example 9.22
For the time series data presented in Table 9.10 calculate seasonal forecasts using the simple
seasonal additive exponential smoothing model.
Year
Quarter 1 2 3 4 5 6
1 17.15 16.80 13.85 16.99 21.12 17.35
2 19.87 15.87 19.67 18.96 21.03 17.57
3 20.53 17.13 20.29 24.84 24.55 18.19
4 20.78 18.11 20.94 23.11 25.90 21.66
Table 9.10
If we plot the time series as illustrated in Figure 9.44 we note that we have a definite pattern
in the shape of the curve, which repeats over the same time period quarters.
This suggests a seasonal module would be appropriate.
Time series plot
30.00
25.00
Series value
20.00
15.00
10.00
5.00
Series
0.00
1 3 5 7 9 11 13 15 17 19 21 23
Time point, x Figure 9.44
➜ Excel solution
Period Cell B5:B32 Values
Quarter Cell C5:C32 Values
Series Cell D5:D28 Values
Alpha α Cell H4 Value =0.5
Delta δ Cell H5 Value =0.5
MSE Cell H6 Formula: =SUMXMY2(D9:D28,G9:G28)/COUNT(D9:D28)
at Cell E5 Formula: =(D5+D9+D13+D17+D21+D25)/6
Copy formula down E5:E8
448 Business statistics using Excel
Figure 9.45
Cells H4 and H5 contain the values of constants α and δ, while cell H6 contains mean
square error (MSE).
We assigned the initial values to both α and δ as 0.5 each.
Figure 9.46 illustrates the fit of the forecasted values onto the initial time series plot.
25.00
20.00
Series value
15.00
x 10.00
Mean square error The
mean value of all the
5.00
differences between the Series Seasonal forecast
actual and forecasted
values in the time series. 0.00
The differences between 1 3 5 7 9 11 13 15 17 19 21 23 25 27
these values are squared to Time point, x
avoid positive and negative
differences cancelling each Figure 9.46
other. Initial forecast chart
Time series data and analysis 449
By inputting manually the value of 0.5 in cells H4 and H5 as the values of constants α
and δ, we automatically get in cell H6 the MSE value of 2.036. Cell H6 contains the MSE
formula: =SUMXMY2(D9:D28,G9:G28)/COUNT(D9:D28).
This formula will be explained fully in the next section and we’ll use it here just as a
method for estimating the values of α and δ. We used Excel’s solver function to find the
optimum values of α and δ. Let us explain how this is done. We put manually any value
to cells H4 and H5—in our example 0.5 in each cell. After that, we put together all the for-
mulae and calculate forecasts. Once this is done, we click on cell H6 where the formula for
MSE resides. To access the Excel Solver menu select Data > Solver.
In Figure 9.47, we specify that we want cell Z6 to take the minimum value, by changing
cells Z4 and Z5, under the condition that both cells Z4 and Z5 should never be less than 0
or greater than 1.
Figure 9.47
This changes all the calculated cells automatically and produces the forecast as per
Figure 9.48. As we can see, the Solver has changed the values of alpha and delta, which in
turn had an effect on all our formulae and forecasts.
Figure 9.49 illustrates the forecast using the seasonal exponential smoothing method.
We observe that the forecast values are a good fit to the actual data values. As we can see,
this method, although fairly simple, produces impressive results.
450 Business statistics using Excel
Figure 9.48
25.00
20.00
Series value
15.00
10.00
5.00
Series Seasonal forecast
0.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Time point, x Figure 9.49
Student exercises
X9.14 For what kind of time series would you use a multiplicative versus additive seasonal
exponential smoothing model?
X9.15 Why are seasonal parameters at and St in the seasonal exponential smoothing method
called dynamic parameters?
X9.16 What is the role of mean square error (MSE) in seasonal exponential smoothing
method?
We can never eliminate uncertainty, but good forecasts can reduce it to an acceptable
level. What would we consider to be a good forecast? An intuitive answer is that it has be
the one that shows the smallest error when compared with the actual event. The prob-
lem with this statement is that we cannot measure the error until the event happened, by
which time it is too late to say that our forecast was, or was not, good. In a way, we would
like to measure the error before the future unfolds. How do we do this? As we demon-
strated in this chapter, when forecasting, we always used the model to back-fit the existing
time series. This is sometimes called back-casting or, more appropriately, ex-post fore-
casting. Once we have produced ex-post forecasts, it is easy to measure deviations from
the actual data. These deviations are forecasting errors, and they will tell us how good our
method or model is.
❉ Interpretation The main assumption we make here is: whichever model shows the
smallest errors in the past, it will probably make the smallest errors when extrapolated in the
future. In other words, the model with smallest historical errors will reduce the uncertainty
that the future brings. This is the key assumption.
Calculating errors is one of the easiest tasks. We can define an error as a difference
between what actually happened and what we thought would happen. In the context of
forecasting time series and models, error is the difference between the actual data and the
data produced by a model, or ex-post forecasts. This can be expressed as a formula:
et = A t − Ft , or et = y t − Ft (9.32)
Where et is an error for a period t, At is the actual value in a period t and Ft is a forecasted
value for the same period t.
Example 9.23
Figure 9.50 shows an example of how to calculate forecasting errors.
In Example 9.23, using some simple method, we produced back-forecasts that deviate
clearly from the actual historical values. Figure 9.50 shows the results.
Figure 9.50
➜ Excel solution
Period Cells B4:B8 Values
Actual Cells C4:C8 Values x
Forecast Cells D4:D8 Values Forecasting errors A
difference between the
Error Cell E4 Formula: =C4−D4 actual and the forecasted
Copy formula down E4:E8 value in the time series.
452 Business statistics using Excel
Actual vs forecast
350
300
250
200
Value
150
100
Actual
50 Forecast
0
1 2 3 4 5
Period Figure 9.51
For period 1 (t = 1) our method exceeded actual values, which is presented as –30
because errors are calculated as actual minus forecasted. For period t = 2, our method
underscored by 100. For period 6 (t = 6), for example, our method was perfect and it had
not generated any errors. What can we conclude from this? If these were the first 5 weeks
of our new business venture, and if we add all these numbers together, than our cumula-
tive forecast for these 5 weeks would have been 1060. In reality the business generated
1040. This implies that the method we used made a cumulative error of –20 or it overesti-
mates the reality by 20 units. If we divide this cumulative value by the number of weeks to
which it applies, i.e. 5, we get the average value of our error:
∑ ( A t − Ft ) −20
e= = = −4
n 5
❉ Interpretation The average error that our method generates per period is −4
and because errors are defined as differences between the actual and forecast values, this
means that on average the actual values are 4 units higher than our forecast. Given earlier
assumptions that the method will probably continue to perform in the future as in the past
(assuming there are no dramatic or step changes), our method will probably generate similar
errors in the future.
Assuming that we decided to experiment with some other method, and assuming that
the average error that this other method generated was 2, which method would you rather
Time series data and analysis 453
use to forecast your business venture? The answer, hopefully, is very straightforward. The
second method is somewhat pessimistic (the actual values are 2 units per period below
the forecasted values), but in absolute terms 2 is less than 4. Therefore, we would recom-
mend the second method as a much better model for forecasting this particular business
venture. In this example, we have not only decided which forecasting method reduces
uncertainty more, but we have also learned how to use two different ways of measuring
this uncertainty. Using errors as measures of uncertainty, we learned how to calculate an
average, or mean error, and we implied that an absolute average error also makes sense to
be estimated. In practice, other error measurements are also used.
Figure 9.52
Calculating various errors
➜ Excel solution
Period Cells B4:B8 Values
Actual Cells C4:C8 Values
Forecast Cells D4:D8 Values
Error Cell E4 Formula: =C4−D4
Copy down E4:E8
MAD Cell F4 Formula: =ABS(E4)
Copy down F4:F8
MSE Cell G4 Formula: =E4^2
Copy down G4:G8
MPE Cell H4 Formula: =E4/G4
Copy down H4:H8
MPE % Cell I4 Formula: =H4*100
Copy down I4:I8
MAPE Cell J4 Formula: =F4/C4
Copy down J4:J8
Sum Cell C9 Formula: =SUM(C4:C8)
Copy formula across C9:J9
Average Cell C10 Formula: =AVERAGE(C4:C8)
Copy formula across C10:J10
Column I in Figure 9.52 is identical to the column H. The only difference is that we used
Excel percentage formatting to present the numbers as percentages, rather than decimal
values. Rather than calculating individual errors (as in columns E–J) and adding all the
individual error values (as in row 9) or calculating the average (as in row 10), we could
have calculated all these errors with a single formula line for each type of error.
Example 9.25
Using some of the built-in Excel functions, these errors can be calculated as illustrated in
Figures 9.53 and 9.54.
Time series data and analysis 455
Figure 9.53
Figure 9.54
Single cell formulae for calculating the mean error (ME), mean absolute deviation (MAD), mean square
error(MSE), mean percentage error (MPE), and mean absolute percentage error (MAPE)
➜ Excel solution—alternative
ME Cell D12 Formula: =(SUM(B2:B6)−SUM(C2:C6))/COUNT(B2:B6)
MAD Cell D13 Formula: {=SUM(ABS(D2:D6))/COUNT(D2:D6)}
MSE Cell D14 Formula: =SUMXMY2(B2:B6,C2:C6)/COUNT(B2:B6)
MPE Cell D15 Formula: {=SUM(((B2:B6)−(C2:C6))/(B2:B6))/COUNT(B2:B6)}
MAPE Cell D16 Formula: {=SUM(ABS((B2:B6)−(C2:C6))/(B2:B6))/COUNT(B2:B6)}
Note that MAD, MPE, and MAPE formulae have curly brackets on both sides of
the formulae. Do not enter these brackets manually. Excel enters the brackets auto-
matically if after you typed the formula you do not just press the Enter button, but
CTRL + SHIFT + ENTER (i.e. all three at the same time). This means that the range is
treated as an array. Just for the sake of clarity, Figures 9.52 and 9.53 reproduce the spread-
sheet as it should look if the single cell formulae for the error calculations were used.
Again, note that the curly brackets for MAD, MPE, and MAPE are not visible by observing
formulae in cells D13, D15, and D16. However, they are visible in the formula bar, as illus-
trated in Figure 9.54.
the actual values, never providing exact forecasts, yet the ME could be zero. To eliminate
the problem with ME, we can calculate MAD. MAD indicates that if we eliminate over-
and underestimates of our forecasts, a typical bias that our method shows (regardless of
whether it is positive or negative) is 44 units per period. This is typical error, regardless of
the direction in which our forecasts went when estimating the actual values.
The meaning of the MSE is more difficult to interpret, for the simple reason that we have
taken the square values of our errors. What is a square value of something? The ration-
ale is as follows: if there are some big deviations of our forecast from the actual values,
then in order to magnify these deviations we need to square them. Let’s take an example
of two hypothetical errors for a period. Let one error reading show 2 and the other one 10.
The second error is five times larger than the first one. However, when we square these
two numbers, number 100 (10 × 10) is 25 times larger than number 4 (2 × 2). This is what
we mean by magnifying large errors. So, the higher the MSE, the more extreme deviations
from the actual values are contained in our forecast. This is particularly useful when com-
paring two forecasts. If the MSE obtained from the first forecast is larger than the MSE from
the second, than the first forecast contains more extreme deviations than the second one.
The interpretation of the MPE is very intuitive. It tells us that, on average, an error con-
stitutes x% of the actual value or, as in our case, MPE = –8.73%. This means that on average
our forecasting errors overshot the actual values by 8.73% (remember that negative error
means forecasts overshooting the actual values and positive error means undershooting).
However, this implies that, just like with the ME, we could have a series of overshoots and
undershoots (as in our example), yet gaining an average value of almost zero. The mean
absolute percentage error (MAPE) addresses this problem. It shows us the value of 0.2473.
In other words, if we disregard positive and negative variations of our forecasts from the
actual values, we are, on average, making an absolute error of 24.73%.
0.025
0.02
0.015
0.01
Value, y
0.005
0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67
–0.005 Time point, x
–0.01
–0.015
–0.02
Figure 9.55
Error/residual plot
Note One of the methods of verifying whether residuals are correlated is the
autocorrelation plot. Autocorrelations are the coefficients that we calculate and they form an
autocorrelation function. They can be used for various purposes, but here we are referring to
the series of autocorrelations of residuals. Essentially, we lag the residuals by one time period,
then by another and another, etc., and then measure correlations between all these lagged
series of residuals. This is called a residual autocorrelation function.
Student exercises
X9.17 Could you explain the difference between accuracy and precision? What are the
consequences if your forecasts are accurate, but not precise? Could you have precise
forecasts that are not accurate?
X9.18 Why is the MAD type of error measurement preferred over the ME type of error?
X9.19 Two forecasts were produced, as shown in Table 9.11. The ME for the second forecast
is some seven times larger than the ME for the first forecast. However, the MSE is some
20 times larger. Can you explain?
X Y ŷ1 ŷ2 e1 e2
1 230 230 230 0 0
2 300 305 305 −5 −5
3 290 295 295 −5 −5
4 320 320 320 0 0
5 350 345 345 5 5
6 400 402 350 −2 50
7 350 355 355 −5 −5
8 400 395 395 5 5
458 Business statistics using Excel
Table 9.11
X9.20 It is acceptable to see some regularity in pattern when examining the series of residuals
or forecasting errors?
X9.21 The closer the actual observations, when compared with forecasted values on a scatter
diagram, are to the diagonal line, the better the forecasts. Is this correct?
σ
σx =
n (9.38)
We also said that when dealing with a normal distribution, we expect 68.3% of all the
values to be within x ± 1σ , 95.4% of all the values to be within x ± 2σ, and 99.7% to be
within x ± 3σ. We also said that to change any distribution into a standard distribution,
standardized z units need to be calculated using equation (9.39).
x−µ
z=
σ (9.39)
Z-values are used for estimating the confidence interval (CI) of the estimate of the
x
mean using equation (9.40).
Confidence interval A
confidence interval gives an
estimated range of values
CI = X ± z SE (9.40)
which is likely to include
an unknown population
parameter. Where, SE is the standard error. Depending on the value of z, we get different CIs. For:
Population standard (a) z = 1.64 for 90% CI, (b) z = 1.96 for 95% CI, and (c) z = 2.58 for 99% CI. It is important to
deviation The population
standard deviation is the also remind ourselves that most of the time we cannot calculate the SE for the simple rea-
standard deviation of all son that we do not know the population standard deviation (σ). In this case, the sample
possible values.
standard deviation is calculated using equation (9.41).
Sample standard
deviation A sample
standard deviation is n
an estimate, based on a ∑ (x i − x)2
i =1
sample, of a population s= (9.41)
standard deviation. n −1
Time series data and analysis 459
Now we have the standard deviation of the sample (the data set, or the time series) we
can modify the equation for the standard error (SE, as given by equation (9.42).
s
SE =
n (9.42)
How do we estimate the confidence interval of the sample mean? First of all, if the time
series is relatively short and represents just a small sample of the true population data val-
ues, then the t distribution is used for the computation of the confidence interval, rather
than the z-value.
The only difference between equations (9.43) and (9.40) is that the t-value in equation
(9.43) will be determined not just by the level of significance (as was the case with the
z-values), but also by the number of degrees of freedom.
Note A general rule is that for larger samples, the z-values and the t-value produce
similar results, so it is discretionary which one to use. A large sample in time series analysis is a
series with more than 100 observations.
n
∑ (y i − yˆ i )2
i =1
SE y,yˆ = (9.44)
n−2
Here, yi are the actual observations and ŷi are the predicted values. The Excel version of
this formula is =SQRT (SUMXMY2 (array_x, array_y)/n−2). Actually, Excel offers an even
more elegant function as a substitute for this formula. The function is called: = STEYX
(known_y’s, known_x’s). This function returns the standard error of forecast. If you look
into Excel’s Help file, you will see that this function is a very elegant representation of a
monstrous-looking equation given by (9.45).
x
Standard error of
( )
= SQRT SUMXMY2 (array _ x, array _ y ) / n − 2
and
They both return the standard error for the predicted values.
If the standard error SEy,x is a measure of the amount of error in the prediction of y for an
individual x, this means we can modify equation (9.43) into equation (9.46).
Ŷ are the predicted values, SEy,y^ is the standard error of prediction and the tvalue is the
t-value from the Student’s t critical table. We can recap that, depending on what is the
desired confidence interval (CI), the values for z and t are as follows: (a) CI = 90% for
z = 1.64 and t = 1.73, (b) CI = 95% for z = 1.96 and t = 2.09, and (c) CI = 99% for z = 2.58
and t = 2.86. The values of t are not fixed as they depend on the number of degrees of free-
dom and the size of the sample.
Note The t-values are not universal for these given levels of confidence. The calculation
of the t-values depends on a number of degrees of freedom. The above t-values are only
valid for 8 degrees of freedom, which is the length of our time series minus 2.
Example 9.26
Consider fitting a confidence interval to the data set represented in Table 9.12 and use the Excel
trend function to provide forecasts for the next five time periods.
We have a very short time series with only ten observations; we will use the Excel TREND
function to produce forecasts and fit a confidence interval to the forecasts.
Figure 9.56 illustrates the technique using a very short time series.
Figure 9.56
Time series data and analysis 461
X Y
1 2
2 1
3 1
4 4
5 13
6 3
7 8
8 6
9 9
10 10
Table 9.12
➜ Excel solution
X Cell B4:B18 Values
Y Cell C4:C13 Values
Trend Cell D4 Formula: =TREND($C$4:$C$13, $B$4:$B$13, B4)
Copy formula down D4:D18
SE = Cell H3 Formula: =STEYX(C4:C13, B4:B13)
Alpha = Cell H4 Value
df = Cell H5 Formula: =COUNT(C4:C13)−2
t-value = Cell H6 Formula: =T.INV.2T(H4, H5)
– Interval E4 Formula: =D4−$H$3*$H$6
Copy formula down E4:E18
+ Interval F4 Formula: =D4+$H$3*$H$6
Copy formula down F4:F18
This trend function was extrapolated five periods in the future. Figure 9.57 illustrates
the graph for the prediction and the corresponding confidence interval.
The calculations, as well as the graph, indicate that we are on the right track as far as
the confidence interval is concerned, except that it does not comply with one intuitive
assumption. It is intuitive to think that the confidence interval is not constant and that
it should change with time. In other words, the further we go in the future, the wider the
interval should be as the uncertainty increases. As we can see from the above example,
the level of confidence here is a constant value. In order to make the confidence level
change with time, in addition to equation (9.46), we will need to replace equation (9.44)
with equation (9.47).
1 (x i − x)2
SE y,x = SE y,Yˆ 1 + + (9.47)
n ∑ (x i − x)2
462 Business statistics using Excel
20
15
Series value, y 10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Time point, x
–5
This is exactly the same equation as equation (8.36) from Chapter 8. The only exception
is that in Chapter 8 we included the tcri-value in the equation, while here it is included in
the procedure.
Example 9.27
We’ll use exactly the same example to demonstrate the effects of this additional formula.
Figure 9.58 illustrates the Excel solution to calculate the interval estimate.
Figure 9.58
The only difference to Figure 9.56 is that we had to introduce two additional columns,
one for SEy,x (column F) and the other one for the mean value (column E). This column, as
we will see, will help us with the implementation of equation (9.47) into the Excel solution.
➜ Excel solution
X Cell B4:B18 Values
Y Cell C4:C13 Values
Trend Cell D4 Formula: =TREND($C$4:$C$13,$B$4:$B$13,B4)
Copy formula down D4:D18
Mean X Cell E4 Formula: =AVERAGE($B$4:$B$13)
Copy formula down E4:E13
Time series data and analysis 463
20
Series value, y
15
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
–5 Time point, x
Student exercises
X9.22 How would you describe the concept of confidence interval in the context of precision
in forecasting?
X9.23 Would you use the z-values or t-values to calculate the confidence interval for a time
series that has 200 observations?
X9.24 Why is it logical to expect that the confidence interval should get wider and wider the
further we go in the future with our forecasts?
■ Techniques in practice
TP1 Coco S. A. is considering diversifying and entering the housing market in the USA. They
are only interested in a short-term investment. To help them with the decision and assess the
market, their US analyst extracted a time series that covers US Months’ Supply of Houses For
Sale at Current Sales Rate. The time series is not adjusted for seasonality and the data set reflects
true market movements. Table 9.13 covers data from January 2004 to June 2008.
464 Business statistics using Excel
Table 9.13
Analyse the data and produce a forecast. Pay specific attention to:
TP2 Baker Ltd is concerned about the influence on petrol prices on its profit margins. The
owner of the company looked at weekly petrol prices (pence per gallon) for London and com-
piled a time series. The series starts on 14 November 2005 and goes until 4 August 2008. The
data in pence per gallon are shown in Table 9.14.
250.1 238.6 291.5 269.1 249.7 296.2 296.1 330.9 334.2 423.8
241.6 244.3 287.9 261.5 245 302.6 293.2 331 335.7 423.5
235.8 245.6 283.7 255.9 240.2 304.4 290.7 328.9 341.1 424.7
232.2 261.6 290.5 251.7 237.6 307.1 288.6 328.5 338.5 424.8
234.3 261.7 295.8 247.4 236.2 313.5 287.3 327.8 340.1 424.2
236.8 273.7 299.3 243 236.2 315.1 287.5 327.6 342.9 421.9
238.1 279.7 302 239.9 239.6 313.1 288.9 329.4 349.8 412.7
244.2 292 305.2 239.1 246.6 313.5 293.8 334 362.6 405.1
254 306.4 307.6 238.5 268 310.4 293.2 332.4 375.3
254.8 304.9 309.1 242.8 271.3 307.6 292.7 329.1 376.9
257.7 303.6 305.3 248.9 273.4 307.5 292.5 328.7 385.9
255.6 301.6 300 247.6 275.9 306.6 298.6 326 393.1
253.1 298.9 294.4 252.2 281.4 307 304 326.1 408.3
248.8 296.4 287.5 253.9 287.3 303.6 319.6 327.4 412.1
242.4 292.5 279.3 253.5 295.7 300.4 328.6 333.2 418.3
Table 9.14
Time series data and analysis 465
The owner is not too familiar with forecasting, but knows how to use trending function. Put
yourself in his shoes and do the following:
TP3 Skodel Ltd is considering investing into technology stocks. As a test case, it looked at
Microsoft-adjusted monthly closing values of stocks between 1 March 2001 and 1 August
2008. The time series is given in Table 9.15.
Table 9.15
Use the exponential smoothing method and experiment with various levels of the smooth-
ing constant. See what impact it has on your forecasts and how it changes the forecasting
errors. Make a recommendation as to what approach to forecasting you would use and why.
■ Summary
In this chapter we focused on univariate time series analysis as a primary tool for extrapolating
time series and forecasting. We described what the prerequisites are before we start selecting
a forecasting method; namely, ensuring that all the observations were re-recorded in the same
units of time, that no observation was missing, that we do not have unexpected outliers, and
that we produce the time series graph before proceeding. We explained the concept of indices
and how to convert them from one base to another. This was linked with aggregate indices and
we introduced Consumer Price Index (CPI) as a major method of deflating the value related
time series. We also showed how to convert the values into constant dollars.
466 Business statistics using Excel
Various trend models were introduced, as well as how to fit them to time series, produc-
ing the ex-post forecasts. Other alternative methods to trend-fitting and extrapolation were
introduced, such as the moving average method and exponential smoothing. The relevance
of the smoothing constant α was explained. This was followed with the introduction of how
to apply exponential smoothing to seasonal time series. Once we mastered various forecasting
methods and techniques, we focused on forecasting errors and how to measure them. The
relevance of various error indicators (ME, MSE, MAD, etc.) was introduced, as well as how to
interpret them to select the best forecast.
The final element introduced was the confidence interval (CI), which brings together extrap-
olation and error measurement. We explained how to apply confidence measurement to our
forecasts and what the limitations are.
■ Key terms
Additive model Mean absolute deviation Residuals (R)
Aggregate price indices (MAD) Sample standard deviation
Base index period Mean absolute percentage Seasonal
Brown’s single exponential error (MAPE) Seasonal component
smoothing method Mean error (ME) Seasonal time series
Classical time series analysis Mean percentage error Seasonal variations (S)
Classical time series (MPE) Simple exponential
decomposition Mean square error (MSE) smoothing
Confidence interval Mixed model Simple index
Cyclical variations (C) Moving average trend Smoothing constant
Error measurements Moving averages Standard error of forecast
Exponential smoothing Multiplicative model Stationary time series
Exponential trend Multivariate methods Time period
Forecasting Non-seasonal Time series
Forecasting errors Non-stationary Trend (T)
Forecasting horizon Polynomial line Trend component
Index numbers Polynomial trend Types of trends
Irregular variations (I) Population standard Univariate methods
Linear trend deviation
Logarithmic trend Power trend
■ Further reading
Textbook resources
1. Brown, R. G. (2004) Smoothing, Forecasting and Prediction. Mineola, NY: Dover Publications.
2. Chatfield, C. (2004) The Analysis of Time Series: An Introduction. Boca Raton, FL; London:
Chapman & Hall/CRC.
3. Hanke, J. E. and D. W. Wichern (2005) Business Forecasting. Upper Saddle River: Pearson/
Prentice Hall.
Time series data and analysis 467
4. Newbold, P. and T. Bos (1994) Introductory Business & Economic Forecasting. Cincinnati:
South-Western Pub.
5. Evans, M. K. (2003) Practical Business Forecasting. Malden, MA; Oxford: Blackwell
Publishing.
Web resources
1. Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/pmc/sec-
tion4/pmc4.htm (accessed 25 May 2012).
2. Economagic—contains international economic data sets http://www.economagic.com
(accessed 25 May 2012).
3. Wikipedia articles on time series http://en.wikipedia.org/wiki/Time_series (accessed 25
May 2012).
4. Statsoft Electronic Textbook http://www.statsoft.com/textbook/sttimser.html (accessed 25
May 2012).
5. A private collection by Rob Hyndman http://www.robjhyndman.com/TSDL/ (accessed
25 May 2012).
Glossary
The ISI glossary of statistical terms provides definitions in a number of different languages:
http://isi.cbs.nl/glossary/index.htm
Addition law for mutually exclusive Beta, β Beta refers to the probability
events Addition law for mutually exclu- that a false population parameter lies in-
sive events is a result used to determine side the confidence interval.
the probability that event A or event B oc- Binomial distribution A Binomial dis-
curs, but both events cannot occur at the tribution can be used to model a range of
same time. discrete random data variables.
Additive model The additive model Binomial experiment A binomial exper-
time series model is a model whereby the iment is an experiment with a fixed number
separate components of the time series are of independent trials. Each trial has exactly
added together to identify the actual time two outcomes and the probability of each
series value. outcome in a binomial experiment re-
Adjusted r2 Adjusted R squared mea- mains the same for each trial.
sures the proportion of the variation in the Box plot A box plot is a way of summa-
dependent variable accounted for by the rizing a set of data measured on an interval
explanatory variables and adjusted for the scale.
number of degrees of freedom. Box-and-whisker plot A box-and-
Aggregate price index A measure of the whisker plot is a way of summarizing a set
value of money based on a collection (a of data measured on an interval scale.
basket) of items and compared to the same Brown’s single exponential smooth-
collection of items at some base date or a ing method Brown’s single exponential
period of time. smoothing method is the basis for a fore-
Alpha, α Alpha refers to the probabil- casting method called Simple Exponential
ity that the true population parameter Smoothing.
lies outside the confidence interval. Not Categorical variable A set of data is
to be confused with the symbol alpha said to be categorical if the values or ob-
in a time series context i.e. exponential servations belonging to it can be sorted ac-
smoothing, where alpha is the smoothing cording to category.
constant. Central Limit Theorem The Central
Alternative hypothesis (H1) The al- Limit Theorem states that whenever a ran-
ternative hypothesis, H1, is a statement of dom sample is taken from any distribu-
what a statistical hypothesis test is set up tion (m, s2), then the sample mean will be
to establish. approximately normally distributed with
Arithmetic mean The sum of a list of mean m and variance s2/n.
numbers divided by the number of numbers. Central tendency Measures the location
Assumptions An assumption is a propo- of the middle or the centre of a distribution.
sition that is taken for granted. Chance Chance is the unknown and
Autocorrelation Autocorrelation is the unpredictable element in happenings that
correlation between members of a time seems to have no assignable cause.
series of observations and the same values Chi square distribution The chi square
shifted at a fixed time interval. distribution is a mathematical distribution
Bar chart A bar chart is a way of sum- that is used directly or indirectly in many
marizing a set of categorical data. tests of significance.
Base index period A value of a variable Chi square test Apply the chi square
relative to its previous value at some fixed distribution to test for homogeneity, inde-
base. pendence, or goodness-of-fit.
Glossary 469
Chi square test for goodness-of-fit The Contingency table A contingency table
chi-square goodness-of-fit test of a statisti- is a table of frequencies classified according
cal model describes how well the statistical to the values of the variables in question.
model fits a set of observations. Continuous probability distribution
Chi square test of association The chi- If a random variable is a continuous vari-
square test of association provides a meth- able, its probability distribution is called a
od for testing the association between the continuous probability distribution.
row and column variables in a two-way ta- Continuous random variable A contin-
ble where the null hypothesis H0 assumes uous random variable is one which takes
that there is no association between the an infinite number of possible values.
variables. Continuous variable A set of data is
Chi square test of independent samples said to be continuous if the values belong
Pearson chi-square test is a non-paramet- to a continuous interval of real values.
ric test for a difference in proportions be- Covariance Covariance is a measure of
tween two or more independent samples. how much two variables change together.
Class boundaries Class boundaries Critical test statistic The critical value
separate one class in a grouped frequency for a hypothesis test is a limit at which the
distribution from another. value of the sample test statistic is judged
Class limit Class limits separate one to be such that the null hypothesis may be
class in a grouped frequency distribution rejected.
from another. Critical value The critical value(s) for
Class mid-point The class mid-point is a hypothesis test is a threshold to which
the midpoint of each class interval. the value of the test statistic in a sample is
Classes Classes provide several conve- compared to determine whether or not the
nient intervals into which the values of the null hypothesis is rejected.
variable of a frequency distribution may be Cross tabulation Cross tabulation is
grouped. the process made with two or more data
Classical time series analysis Ap- sources (variables) that are tabulating the
proach to forecasting that decomposes a results of one against the other.
time series into certain constituent com- Cumulative distribution function The
ponents (trend, cyclical, seasonal, and ran- cumulative distribution function (CDF),
dom component), makes estimates of each or just distribution function, describes the
component, and then re-composes the probability that a real-valued random vari-
time series and extrapolates into the future. able X with a given probability distribution
Classical time series decomposition will be found at a value less than or equal
Classical time series decomposition is a to x.
statistical method that deconstructs a time Cumulative frequency distribution
series into notional components. The cumulative frequency for a value x
Coefficient of determination (COD) is the total number of scores that are less
The proportion of the variance in the de- than or equal to x.
pendent variable that is predicted from the Cyclical variations (C) The cyclical
independent variable. variations of the time series model that
Coefficient of variation The coefficient result in periodic above-trend and below-
of variation measures the spread of a set of trend behaviour of the time series lasting
data as a proportion of its mean. more than one year.
Conditional probability Conditional Degrees of freedom Refers to the
probability is the probability of an event number of independent observations in a
occurring given that another event has al- sample minus the number of population
ready occurred. parameters that must be estimated from
Confidence interval (1 − a) A confi- sample data.
dence interval gives an estimated range Dependent variable A dependent vari-
of values which is likely to include an un- able is what you measure in the experiment
known population parameter. and what is affected during the experiment.
470 Glossary
Discrete Discrete data are a set of data (or several constants) to predict future val-
where the values/observations belonging ues by ‘smoothing’ the past values in the
to it are distinct and separate, i.e. they can series. The effect of this constant decreases
be counted (1,2,3. . .). exponentially as the older observations are
Discrete probability distribution If a taken into calculation.
random variable is a discrete variable, its Exponential trend An underlying time
probability distribution is called a discrete series trend that follows the movements of
probability distribution. an exponential curve.
Discrete random variable A discrete Extreme value An extreme value is an
random variable is one which may take on unusually large or an unusually small value
only a countable number of distinct values compared with the others in the data set.
such as 0, 1, 2, 3, 4 . . . F distribution The F distribution (also
Discrete variable A set of data is said to known the Fisher–Snedecor distribution)
be discrete if the values belonging to it can is a continuous probability distribution
be counted as 1, 2, 3 . . . that arises frequently as the null distribu-
Dispersion The variation between data tion of a test statistic, most notably in the
values is called dispersion. analysis of variance.
Durbin–Watson The Durbin–Watson F test Tests whether two population
statistic is a test statistic used to detect the variances are the same based upon sample
presence of autocorrelation (a relation- values.
ship between values separated from each F test for two population variances
other by a given time lag) in the residu- (variance ratio test) F test for two popula-
als (prediction errors) from a regression tion variances (variance ratio test) is used
analysis. to test if the variances of two populations
Empirical approach Empirical proba- are equal.
bility, also known as relative frequency, or Five-number summary A five-number
experimental probability, is the ratio of the summary is especially useful when we
number of outcomes in which a specified have so many data that it is sufficient to
event occurs to the total number of trials. present a summary of the data rather than
Equal variance (homoscedasticity) the whole data set.
Homogeneity of variance (homoscedastic- Forecasting A method of predicting the
ity) assumptions state that the error vari- future values of a variable, usually repre-
ance should be constant. sented as the time series values.
Error measurement A method of vali- Forecasting errors A difference be-
dating the quality of forecasts. Involves cal- tween the actual and the forecasted value
culating the mean error, the mean squared in the time series.
error, and the percentage error, etc. Forecasting horizon A number of the
Estimate An estimate is an indication future time units until which the forecasts
of the value of an unknown quantity based will be extended.
on observed data. Frequency definition of probability
Event An event is any collection of out- Frequency definition of probability defines
comes of an experiment. an event’s probability as the limit of its rela-
Expected frequency In a contingency tive frequency in a large number of trials.
table the expected frequencies are the fre- Frequency distributions Systematic
quencies that you would predict in each method of showing the number of
cell of the table, if you knew only the row occurrences of observational data in order
and column totals, and if you assumed from least to greatest.
that the variables under comparison were Frequency polygon A graph made by
independent. joining the middle-top points of the col-
Experimental probability approach umns of a frequency histogram.
Experimental probability approach (see General addition probability law Gen-
Empirical approach). eral addition probability law is a result
Exponential smoothing One of the used to determine the probability that
methods of forecasting that uses a constant event A or event B occurs or both occur.
Glossary 471
Graph A graph is a picture designed to the smallest (least) sum of squared differ-
express words, particularly the connection ences between fitted and actual values.
between two or more quantities. Left-skewed Left-skewed (or negative
Grouped frequency distributions Data skew) indicates that the tail on the left side
arranged in intervals to show the frequency of the probability density function is lon-
with which the possible values of a variable ger than the right side and the bulk of the
occur. values (possibly including the median) lie
Histogram A histogram is a way of sum- to the right of the mean.
marizing data that are measured on an in- Level of confidence The confidence
terval scale (either discrete or continuous). level is the probability value (1 − a) associ-
Histogram with unequal class intervals ated with a confidence interval.
A histogram with unequal class intervals Level of significance The level of sig-
is a graphical representation showing a vi- nificance is the criterion used for rejecting
sual impression of the distribution of data the null hypothesis.
where class widths are of different sizes. Linear relationship A linear relation-
Hypothesis test procedure A series of ship exists between variables if, when you
steps to determine whether to accept or plot their values, you get a straight line.
reject a null hypothesis, based on sample Linear regression analysis Simple lin-
data. ear regression aims to find a linear rela-
Independence of errors Independence tionship between a response variable and
of errors means that the distribution of er- a possible predictor variable by the meth-
rors is random and not influenced by or od of least squares.
correlated to the errors in prior observa- Linear trend Linear trend is a straight
tions. The opposite of independence is line fit to a data set.
called autocorrelation. Logarithmic trend A model that uses
Independent events Two events are in- the logarithmic equation to approximate
dependent if the occurrence of one of the the time series.
events has no influence on the occurrence Lower one tail test A lower one tail test
of the other event. is a statistical hypothesis test in which the
Independent variable An independent values for which we can reject the null hy-
variable is the variable you have control pothesis, H0 are located entirely in the left
over, what you can choose and manipulate. tail of the probability distribution.
Index number A value of a variable Mann–Whitney U test The Mann–
relative to its previous value at some base. Whitney U test is used to test the null hy-
Intercept Value of the regression equa- pothesis that two populations have iden-
tion (y) when the x value = 0. tical distribution functions against the
Interquartile range The interquartile alternative hypothesis that the two distri-
range is a measure of the spread of or dis- bution functions differ only with respect to
persion within a data set. location (median), if at all.
Interval scale An interval scale is a McNemar’s test McNemar’s test is a
scale of measurement where the distance non-parametric method used on nominal
between any two adjacent units of mea- data to determine whether the row and
surement (or ‘intervals’) is the same but column marginal frequencies are equal.
the zero point is arbitrary. Mean The mean is a measure of the av-
Irregular variations The irregular varia- erage data value for a data set.
tions of the time series model that reflect Mean absolute deviation (MAD) The
the random variation of the time series val- mean value of all the differences between
ues beyond what can be explained by the the actual and forecasted values in the
trend, cyclical, and seasonal components. time series. The differences between these
Kurtosis Kurtosis is a measure of the values are represented as absolute values,
‘peakedness’ or the distribution. i.e. the effects of the sign are ignored.
Least squares The method of least Mean absolute percentage error
squares is a criterion for fitting a specified (MAPE) The mean value of all the differ-
model to observed data. If refers to finding ences between the actual and forecasted
472 Glossary
values in the time series. The differences separate components of the time series are
between these values are represented as multiplied together to identify the actual
absolute percentage values, i.e. the effects time series value.
of the sign are ignored. Multivariate methods Methods that
Mean error (ME) The mean value of use more than one variable and try to pre-
all the differences between the actual and dict the future values of one of the variables
forecasted values in the time series. by using the values of other variables.
Mean percentage error (MPE) The Mutually exclusive Mutually exclusive
mean value of all the differences between events are ones that cannot occur at the
the actual and forecasted values in the time same time.
series. The differences between these val- Nominal scale A set of data is said to be
ues are represented as percentage values. nominal if the values belonging to it can be
Mean square error (MSE) The mean assigned a label rather than a number.
value of all the differences between the ac- Non-parametric Non-parametric tests
tual and forecasted values in the time se- are often used in place of their paramet-
ries. The differences between these values ric counterparts when certain assump-
are squared to avoid positive and negative tions about the underlying population are
differences cancelling each other. questionable.
Median The median is the value half- Non-seasonal Non-seasonal is the com-
way through the ordered data set. ponent of variation in a time series which
Mixed model The mixed time series is not dependent on the time of year.
blends both additive and multiplicative Non-stationary time series A time se-
components together to identify the actual ries that does not have a constant mean
time series value. and oscillates around this moving mean.
Mode The mode is the most frequently Normal approximation to the binomial
occurring value in a set of discrete data. If the number of trials, n, is large, the bino-
Moving average Averages calculated mial distribution is approximately equal to
for a limited number of periods in a time the normal distribution.
series. Every subsequent period excludes Normal distribution The normal dis-
the first observation from the previous tribution is a symmetrical, bell-shaped
period and includes the one following the curve, centred at its expected value.
previous period. This becomes a series of Normal probability plot Graphical
moving averages. technique to assess whether the data is
Moving average trend The moving av- normally distributed.
erage trend is a method of forecasting or Normality of errors Normality of errors
smoothing a time series by averaging each assumption states that the errors should
successive group of data points. be normally distributed—technically nor-
Multiple regression model Multiple mality is necessary only for the t-tests to
linear regression aims to find a linear be valid, estimation of the coefficients only
relationship between a dependent vari- requires that the errors be identically and
able and several possible independent independently distributed.
variables. Null hypothesis (H0) The null hypoth-
Multiplication law Multiplication law is esis, H0, represents a theory that has been
a result used to determine the probability put forward but has not been proved.
that two events, A and B, both occur. Observed frequency In a contingency
Multiplication law for independent table the observed frequencies are the fre-
events Multiplication law for independent quencies actually obtained in each cell of
events is the chance that they both hap- the table, from our random sample.
pen simultaneously is the product of the One sample test A one sample test is a
chances that each occurs individually, e.g. hypothesis test for answering questions
P(A and B) = P(A)*P(B). about the mean (or median) where the data
Multiplication law for joint events see are a random sample of independent ob-
Multiplication law. servations from an underlying distribution.
Multiplicative model The multiplicative One sample t-test for the population
time series model is a model whereby the mean A one sample t-test is a hypothesis
Glossary 473
test for answering questions about the volves the use of the sample variance to
mean where the data are a random sample provide a ‘best estimate’ of the unknown
of independent observations from an un- population variance.
derlying normal distribution where popu- Poisson distribution Poisson distribu-
lation variance is unknown. tions model a range of discrete random
One sample z-test for the population data variables.
mean A one-sample z-test is used to test Poisson probability distribution The
whether a population parameter is signifi- Poisson distribution is a discrete probabil-
cantly different from some hypothesized ity distribution that expresses the prob-
value. ability of a given number of events occur-
One tail test A one tail test is a statisti- ring in a fixed interval of time and/or space
cal hypothesis test in which the values for if these events occur with a known average
which we can reject the null hypothesis, rate and independently of the time since
H0, are located entirely in one tail of the the last event.
probability distribution. Polynomial line A polynomial line is a
Ordinal scale Ordinal scale is a scale curved line whose curvature depends on
where the values/observations belonging the degree of the polynomial variable.
to it can be ranked (put in order) or have Polynomial trend A model that uses an
a rating scale attached. You can count and equation of any polynomial curve (parab-
order, but not measure, ordinal data. ola, cubic curve, etc.) to approximate the
Ordinal variable A set of data is said to time series.
be ordinal if the values belonging to it can Population mean The population
be ranked. mean is the mean value of all possible
Outcome An outcome is the result of an values.
experiment or other situation involving Population standard deviation The
uncertainty. population standard deviation is the stan-
Outlier An outlier is an observation in a dard deviation of all possible values.
data set which is far removed in value from Population variance The population
the others in the data set. variance is the variance of all possible
Parametric Any statistic computed by values.
procedures that assumes the data were Power trend A model that uses an
drawn from a particular distribution. equation of a power curve (a parabola) to
Pearson’s coefficient of correlation approximate the time series.
Pearson’s correlation coefficient measures Probability Probability provides a
the linear association between two vari- quantitative description of the likely oc-
ables that have been measured on interval currence of a particular event.
or ratio scales. Probability of event A given that event B
Pie chart A pie chart is a way of summa- has occurred See Conditional probability.
rizing a set of categorical data. Probable Probable represents that an
Point estimate A point estimate (or es- event or events is likely to happen or to be
timator) is any quantity calculated from true.
the sample data which is used to provide P-value The p-value is the probabil-
information about the population. ity of getting a value of the test statistic as
Point estimate of the population mean extreme as or more extreme than that ob-
Point estimate for the mean involves the served by chance alone, if the null hypoth-
use of the sample mean to provide a ‘best esis is true.
estimate’ of the unknown population mean. Q1 Q1 is the lower quartile and is the
Point estimate of the population pro- data value a quarter way up through the
portion Point estimate for the proportion ordered data set.
involves the use of the sample proportion Q3 Q3 is the upper quartile and is the
to provide a ‘best estimate’ of the unknown data value a quarter way down through the
population proportion. ordered data set.
Point estimate of the population vari- Qualitative variable Variables can be
ance Point estimate for the variance in- classified as descriptive or categorical.
474 Glossary
non-symmetry 95, 149 paired 319, 324–5, 329, 356 paired ranks 327
normal approximations 175–6, tied 330 paired samples 279, 281, 294,
179–80, 183, 303, 307, observed frequencies 169, 305, 297, 307–8, 319, 324
322, 325 312–13, 315 pairs, matched 324
probability 177, 179, 181 observed values 141, 155, 334–5, parabolas 391, 393, 420, 423
solution 179, 181 363, 368, 373 parameter conditions 313
normal curves 137–9, 141–2, ogive 22, 70, 74–6 parameters 135, 180, 194–5, 218,
146–8, 154–5, 199, 208–9, one sample t-tests 246–7, 251, 393–4, 426–7, 437
250 291–2, 294, 319, 324 population 183, 189, 193–5,
normal distributions 92, 135–7, one sample z-tests 246, 294 217–20, 225, 241, 246
140–1, 143, 148, 175–6, 229 one tail p-values 259, 263, 281, unknown 217–18, 458
approximations 136, 185 290, 321, 327–9, 336 sample 183, 194, 246
to binomial one tail tests parametric tests 243–97, 318–19,
distribution 175–9 lower 249, 288, 294, 319, 326 331, 340
normal equations 363 upper 249, 262, 268, 280, 288, patterns 3, 48–50, 58, 157, 159,
normal populations 186, 198–9, 295, 326–8 370, 456
218, 226, 319 order of size 60, 62–3, 69, 296 PDF see probability, density
normal probability ordinal data 3–4, 21–2, 60, 105, function
curves 151–2 246, 340–1, 343–4 peakedness 59, 81, 105
plots 149–53, 183, 371, 386, ordinal scales 3 PEARSON 349, 353, 355
388, 394, 396–7 ordinal variables 21, 57 Pearson’s coefficient of
normal sampling outcomes 107–8, 112–13, 116, skewness 90, 105
distributions 254, 262, 120, 124–5, 136, 155–6 Pearson’s correlation
267, 271, 276 possible 108–9, 112, 165 coefficient 343–4, 347,
normality 149–53, 325, 370, 388, outliers 59–60, 62, 84, 95–6, 348–53, 355–8, 404–5
456 104–5, 149–50, 346–7 percentiles 62, 64, 75–6, 358
of errors 370, 397 suspected 95–6, 150 classes 76
NORM.DIST 138–40, 142–3, overall mean 197 perfect correlations 350–1
145–7, 177, 202–4, 206, pie charts 19, 21–2, 27–30, 297
208–9 PivotCharts 11, 17, 19–20
NORM.INV 147–8, 155 P PivotTables 11–20
NORM.S.DIST 141–7, 202–4, 206, plots 21, 126, 149, 186, 213, 344,
208–11, 253–4, 262–3, p-values 251–2, 254–6, 286–8, 370–1
267–8 300–1, 305–7, 310–11, box 94, 96–9, 105
NORM.S.INV 147–8, 151, 227, 315–17 box-and-whisker 96, 99, 149
253, 255, 267, 310–11 calculated 300, 306, 316, 323, moving average 433, 435
null distributions 153, 286 387 normal probability 149–53, 183,
null hypotheses 244–6, 248–9, two tail 255, 259 371, 386, 388, 394, 396–7
251–2, 288, 297–8, 309, exact 307, 322–3, 336 residual 370–1, 386, 388, 394,
318–19 lower 286, 333 396–7
false 251, 290–1 lower tail 254, 288 scatter 21–2, 47–51, 344–7,
testing 254, 258, 262, 267, 271, measured 294, 340 349–51, 364–5, 397, 399
276, 281 method 254, 258–9, 262–3, time series 21–2, 47–51, 57,
true 251, 290 267–8, 270–1, 276–7, 281 408, 432, 447–8, 450
numerators 153, 287, 378, 380 one tail 259, 263, 281, 290, 321, point estimates 185–6, 217–19,
327–9, 336 222–3, 225, 241–2, 375,
two tail 253–5, 258–9, 267–8, 436
O 270–2, 276–7, 286–7, Poisson distributions 133,
310–11 135–6, 155, 165–70, 173–5,
observations 2–3, 81, 108, 330–1, upper 262, 280, 286 180–1, 313–17
431–6, 438–9, 459–60 upper tail 254, 288 approximation to binomial
first 431–2, 440 paired differences 325, 330 distribution 173–5
independent 220, 246, 257, 297 paired observations 319, 324–5, POISSON.DIST 168, 171–3, 181,
last 427, 433, 440 329, 356 314, 316
Index 483
polygons, frequency 2, 21–2, presentation 1–57 random samples 110, 137, 163,
42–6, 74, 96 probability 107–33, 137–46, 188–90, 193–4, 203–7,
polynomial curves 423 154–61, 176–7, 179–81, 209–13
polynomial lines 411, 466 201–12, 251 independent 266, 307, 332
polynomial trends 423 conditional 119 simple 188, 190
pooled estimates 275 density function (PDF) 95, random sampling 186, 188
population(s) 138–9, 141–2, 154–5, 301 simple 188–90
confidence intervals 225–42 distributions 107, 124–7, stratified 189–90, 192
distributions 149, 183, 205, 129–31, 133, 135–83, 185, systematic 189
250, 254, 257–8, 262 249 random variables 136, 154–6,
estimates 185, 222, 224 binomial 175–7 161, 163, 173, 179, 183
finite 156, 207 continuous 135–6, 153–5, continuous 136, 183
infinite 156, 207 183, 286 discrete 136, 155, 165–6, 183
median 319, 324, 327 discrete 135–6, 155–83 rank correlation coefficient,
non-normal 186, 204 Poisson see Poisson Spearman’s see
normal 186, 198–9, 218, 226, distributions Spearman’s rank
319 empirical 110 correlation coefficient
parameters 183, 189, 193–5, frequency definition of 124 RANK.AVG 326–7, 332, 334
217–20, 225, 241, 246 laws 107, 114–15, 133 ranks 62, 296, 319, 322, 327–8,
unknown 217–18, 458 general addition law 115–16, 332, 334
point estimates 133 paired 327
mean and variance 218–22 normal approximation 177, shared 322, 327, 334
proportion and 179, 181 tied 329–30, 337, 356
variance 222–4 samples 188, 190 ratios 3–4, 21–2, 57, 88, 105,
type of 218 theoretical 113 109–10, 306–7
proportion 210–11, 217, 222–4, theory 135, 243 raw data 1–2, 4, 8, 12, 57, 105, 279
236, 242, 246, 295 properties 195, 398, 456 rectangles 32, 40–2
slope 364, 375, 399, 404 proportional quota sampling 191 region of rejection 249, 254–5,
v samples 194 purposive sampling 191 263, 268, 287–8, 310–11,
values 153, 185, 199, 217, 228, 322–3
353 regression 343, 362–5, 369–71,
true 219, 353, 363 Q 378, 381, 395, 405
variables 153 analysis 343, 345, 347, 349,
variances 85–6, 217, 219–20, Q1 see first quartile 363, 367–71, 387–91
241–2, 246–7, 261–2, Q2 see second quartile advanced topics 390–405
286–8 Q3 see third quartile linear 405
positive correlations 350–1, qualitative variables 2–3, 325 and linear correlation 343–
356, 358 quantitative variables 2–3 405
positive relationships 47, 345, QUARTILE.INC 64, 81 assumptions 370–2, 375
357 quartiles 64–5, 75, 88, 96 coefficients 346, 370
positive values 92, 254, 348 first 64–5, 76, 82–3, 94–7, 105, 149 equations 365, 373, 375, 383
power 245, 251, 292 ranges 59 least squares 343, 363, 393, 404
curves 423 second 63–4 linear see linear regression
function 423 third 64–5, 76, 83, 94–7, 149 lines 365–6, 368, 372–4, 378,
statistical 251, 290–2, 294 quota sampling 191 423
trends 423 non-proportional 192 mean square due to 378, 381
precision 189–90 proportional 191 models 362, 365, 375, 378
prediction 362, 372, 375, 460–1 linear multiple 344, 404
errors 370 multiple 362, 398–400, 404
intervals 383–5 R non-linear 390–7
values 375, 462–3 sum of squares 369, 381, 405,
predictor models 381, 387, 390 random experiments 108 425
predictor variables 344, 348, random number Regression tool 381, 385, 387,
362, 364, 374–82, 387, 399 generation 154, 212–13 398, 404
484 Index
rejection 244–5, 249, 251–2, 272, types of 188–92 serial correlation 370
290, 292, 294 v populations 194 shared ranks 322, 327, 334
regions/zones of 249, 254–5, variance 85, 91, 219–20, 222, signed rank sum test 246, 279,
263, 268, 287–8, 310–11, 231–2, 234, 287 297, 318–20, 324–5,
322–3 sampling 85, 156, 182, 185–7, 329–30, 340–1
relative frequency 27, 32, 107, 191–3, 198–9, 204–8 significance 248, 254–5, 271–3,
110, 124–7, 133, 185 cluster 190 276–8, 281–2, 286–8,
reliability 344, 372, 385, 398 concept 186–93 358–60
models 399 convenience 191 level 248–9, 255, 259, 261–3,
residual plots 370–1, 386, 388, distributions 185, 187, 189, 287–90, 354–5, 358–60
394, 396–7 191, 193–5, 197, 248–9 simple exponential
residual values 373, 420 and estimation 185–242 smoothing 436, 438–9,
residuals 365, 368, 370–4, 388, and mean 194–8 446
396, 420, 456–7 normal 254, 262, 267, 271, simple linear regression 343,
response variables 348, 362 276 348, 362
right-skewed distributions 95, and proportion 210–12 simple random sampling 188–90
149, 152 error 185–7, 193, 210, 225, 229, single exponential
risk 248, 350 248, 353 smoothing 437
rows 11–12, 20–1, 213, 297–8, frame 187, 190, 242 SIQR 58–9, 81–5, 87–8, 105
301, 307, 454 multistage 190 skewed distributions 82–3,
variables 305, 308 from non-normal 89–90, 92, 95, 149, 152
RSQ 373–4, 380 population 204–10 skewness 58–9, 62, 75, 89–92, 96,
non-probability 187, 190 100, 105
from normal population 198– coefficient of 90, 105
S 204 measures of 90
purposive 191 right 95–6, 149
sample parameters 183, 194, 246 quota see quota sampling SLOPE 365–6, 372, 376, 379, 382,
sample size 189–90, 199–200, snowball 191–2 384, 427
204–11, 218–23, 234–6, stratified 189–90 slopes 365, 375, 382, 405, 426–7,
238–9, 248–50 terminology 187 446
calculating 237–9 scales 3–4, 62, 92, 297, 346 population 364, 375, 399, 404
sample space 107–8, 118–20, interval 3, 8, 94, 96 smoothing 407, 423, 432, 437,
133, 163 ordinal 3 440, 466
sample statistics 194, 217–18, y-axis 49–50 constant 248, 437–8, 442, 444,
224, 246, 324 scatter plots 21–2, 47–51, 344–7, 446, 466 see also damping
samples 185–210, 212–14, 349–51, 364–5, 397, 399 factor
217–20, 222–31, 246–8, scores 3, 6, 32–3, 92, 109, 113, 122 exponential see exponential
253–5, 257–62, 330–2 see SD see standard deviation smoothing
also sampling SE see standard error time series 430–45
averages 261–2 seasonal components 48, snowball sampling 191–2
dependent 246–7, 279, 297, 419–20, 445–6 Solver 449
303, 307, 310, 322 seasonal exponential Spearman’s rank correlation
independent see independent smoothing 449 coefficient 343–4, 347,
samples seasonal forecasts 447, 450 356–8, 404–5
large 218, 232, 459 seasonal time series 406–7, 409, critical values 360
mean 194–211, 213–14, 217–20, 445, 466 spread 33, 40, 58–9, 80–3, 89, 94,
222, 224–8, 230–1, 253–5 seasonal variations 419, 466 104–5
percentiles 371, 388, 396 second quartile 48–9, 63–4 measures of 82–3
proportion 183, 194, 210–11, SEE, see standard error, of SQRT 86, 92, 145, 147–8, 177–8,
222–3, 235–6, 267, 307–8 estimate 181, 459–60
random see random samples semi-interquartile range 58, square roots 35, 81, 83, 386, 425,
simple random 188, 190 82–3 459
small 233, 250, 324, 347, 459 sequential numbers 407–8, 426, squared differences 358, 363
standardized 201, 211 429 squared error 372, 453
Index 485
squares SUM 86–7, 161–4, 167–8, 195–7, third quartile 64–5, 76, 83, 94–7,
least see least squares 299–300, 304–5, 454–5 149
regression sum of 369, 381, sum of squares 369, 374, 381, 405 tied observations 330
405, 425 for error 369, 374, 378, 381, tied ranks 329–30, 337, 356
sum of 369, 374, 381, 405 386, 405, 425 time, units of 407, 409, 446, 465
SSE see sum of squares, for error for regression 369, 378, 381, time periods 48, 188, 370, 407,
SSR see sum of squares, for 386, 405, 425 415, 419, 424
regression total 369, 374, 378, 381, 386, time points 49–50, 408–10,
SST see total sum of squares 405, 425 421–2, 432–5, 440–2,
standard class widths SUMIF 326, 333 447–8, 462–3
(CWs) 41–2 summary, five-number 94–5, 149 time series 406–11, 419–21, 423,
standard deviation 83–9, SUMPRODUCT 68, 72, 78, 86 430–1, 433–6, 445, 459
139–43, 196–9, 201–10, SUMXMY2 447, 449, 455, 459–60, actual 425, 432
219–23, 225–31, 458–9 463 analysis 48, 406–66
standard error 197–8, 202–12, suspected outliers 95–6, 150 classical 419–20, 466
220–5, 227, 229–31, 241–2, symmetric distributions 90, 92, data 406, 417, 423, 436, 443,
372–3 94, 149, 205, 330 447
of estimate 366, 372–3, 386–7 symmetry 59, 75, 92, 96, 105, forecast 424–5
of forecast 459 152, 250 graphs 48–50, 408, 465
of the mean 222, 229, 254, 258 model 419
population and sample 458–9 non-stationary 407–9, 431, 445
of the proportion 223 T plots 21–2, 47–51, 57, 408, 432,
standard normal distribution 447–8, 450
140–1, 183, 229, 248 t-tests 278, 281–2, 324–5, 355, seasonal 406–7, 409, 445, 466
STANDARDIZE 143 370–1, 374–5, 377–8 short 430–1, 438, 460
stated limits 10–11, 57 model assumptions 250 smoothing 430–45
stationary time series 407–9, one-sample see one-sample stationary 408, 431, 433–4,
431, 433–4, 436, 445 t-tests 436, 445
statistical independence 117, paired 247, 319, 324 trend 420
120, 122, 133 Student’s 153, 246, 250, 376, 396 univariate 409, 419, 465
statistical power 251, 290–2, 294 two sample see two sample values 407, 419, 436–7
statistical tests 185, 192, 241, t-tests T.INV 259, 280, 292
244, 248–9, 251, 318 see tables 1, 4–6, 11, 56, 175, 297–8, 329 T.INV.2T 231, 258–9, 270, 276,
also parametric tests; construction 21 355–6, 359, 383–4
non-parametric tests contingency 22, 153, 298, total probability 126, 137, 158–9,
choice of 247 300–1, 303, 305–8 161–2, 164
STDEV.P 82, 195 creation using PivotTable 11– total sample size 111, 118, 191,
STDEV.S 220, 222, 231, 270, 20 298
275–6, 285–6, 349 critical 143, 460 total sum of squares 369, 374,
STEYX 373, 375, 377, 383, 385, cross tabulation 22 378, 381, 386, 405, 425
459–61, 463 cumulative frequency 69 tree diagrams 107, 123, 129, 136,
straight lines 42, 51, 151, 344, data types 10–11 156–7
362, 411, 420 grouped frequency 9, 36 TREND 367, 460
strata 189–90 tally charts 6, 57 trend chart functions 424–5
stratification 190 T.DIST 292 trend-fitting see fitting
stratified random T.DIST.2T 258–9, 270–2, 276–8, trend lines 51, 365, 367, 420–2,
sampling 189–90, 192 377 424–6, 433
strength of correlations/ T.DIST.RT 259, 280–2 Trendline, Add 51, 366, 397, 421,
associations/ test statistics 228–9, 251–2, 433, 435
relationships 343, 347–8, 286–8, 300–1, 316–17, trends 48, 368, 419–20, 423–5,
350, 374 322–3, 336–7 427–9, 445, 461–3
Student’s t distribution 183 calculated 255, 259–60, 270, components 420, 466
Student’s t-test 153, 246, 250, 272, 276–7, 282, 328–30 fitting to time series 420–3
376, 396 critical see critical test statistic types of 423–4
486 Index
trials 108, 110, 124, 156, 161, upper quartile (UQ) see third variation 80–1, 83, 88–9, 369,
163, 173 quartile 374, 395, 399–400
true limits see mathematical upper tail 255, 259 coefficient of 81, 88–9
limits p-values 254, 288 cyclical 419, 466
two sample t-tests 246–7, 269, UQ see third quartile irregular 419, 466
271, 274–6, 279, 281, total 369, 374
294–5 unexplained 365, 369
dependent samples 279–82 V VAR.P 82, 85
two sample z-tests 246, 295 VAR.S 85, 219, 222, 231, 234
two tail p-values 253–5, 258–9, validity 243, 250, 319 vertical axes 44, 344
267–8, 270–2, 276–7, VAR 82–3, 85, 127–8, 159, 164, visualization 1–57
286–7, 310–11 168, 170
two tail tests 244, 249, 254, 258, variability 89, 198, 371, 373–4
267, 286–8, 309–10 variables 2–3, 21–2, 135–7, 343–5, W
type I errors 251, 290, 295 347–8, 393–7, 407–9
type II errors 251, 290–1, 295 categorical 2, 296, 298, 301–2 weighted averages 77–8, 436
column 297–8, 305, 308 weighted mean 78
dependent 343–6, 348, 350, weightings 41, 77, 436
U 362–3, 375, 378–9, Wilcoxon signed rank sum
399–400 test 246, 279, 297, 318–20,
UCB see upper class boundaries discrete 32, 155–6, 183, 307 324–5, 329–30, 340–1
unbiased estimates 219–24, discrete random 136, 155,
241 165–6, 183
unbiased estimators 195, 197, independent 343, 345–6, 362–3, X
210, 217–19, 221 370–1, 374–5, 390, 396–8
uncertainty 107–8, 133, 191, 406, qualitative 2–3, 325 x-axis 32–3, 39–40, 42, 49, 350
450–1, 453, 461 quantitative 2–3
underlying trends 419–20 response 348, 362
unequal class intervals 40–2 row 305, 308 Y
unequal class widths 2, 42, 73 variance of errors
unequal variances 247, 295 assumption 397 y-axis 32, 39, 350, 411
unexplained deviation 378 variance ratio test 246–7, 294 scales 49–50
unexplained variation 365, 369 variance(s) 83–6, 163–4, 167–8,
uniform distribution 154 173–6, 218–19, 221–4,
univariate methods 409–10, 286–7 Z
466 analysis of see analysis of
univariate time series 409, 419, variance z distribution see standard
465 constant 370–1, 388, 397, 456 normal distribution
upper class boundaries 8, 11, equal 275, 295, 371 z tests 246–7, 264, 282, 303,
32–5, 42–3, 71, 82, 86–7 error 371 307, 309, 319 see also
upper confidence intervals 227, population 85–6, 217, 219–20, McNemar test
231, 234, 236 241–2, 246–7, 261–2, z-tests
upper critical values 286–9 286–8 one-sample 246
upper one tail tests 249, 262, samples 85, 91, 219–20, 222, two sample see two sample
268, 280, 288, 295, 326–8 231–2, 234, 287 z-tests
upper p-values 262, 280, 286 unequal 247, 295 z-values 263, 458–9