0% found this document useful (0 votes)
101 views167 pages

All MAT9004 Content (Derivative)

Uploaded by

buffettmorandini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views167 pages

All MAT9004 Content (Derivative)

Uploaded by

buffettmorandini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 167

1

All MAT9004 content (derivative)

Daniel Horsley

December 6, 2021 at 12:50 pm

1 COPYRIGHT WARNING This ebook is protected by copyright. For use within Monash University
only. NOT FOR RESALE Do not remove this notice.
Disclaimer: https://www.monash.edu/disclaimer-copyright
Contents

1 Mathematical Notation and Functions 4


1.1 Important notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Set notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Interval notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Sigma notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.4 Product notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Functions: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 What is a function? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Zeroes and stationary points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 One-to-one and inverse functions . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Convex and concave functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Functions: Graphing activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Functions: Activity 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Some useful functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.1 Linear functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.2 Polynomial functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.3 Exponential functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5.4 Logarithmic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Functions: Activity 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Functions: Activity 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.8 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Calculus 29
2.1 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.1 Tangents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.3 Finding tangents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.4 Higher derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.5 Warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.6 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Differentiation: Activity 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 Stationary points revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Second derivative test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.4 Convexity and concavity revisited . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Optimisation: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.1 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.2 Finding antiderivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Integration: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1
2.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3 Linear Algebra 49
3.1 Vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.1 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.2 Addition and Scalar Multiplication of Matrices . . . . . . . . . . . . . . . . . 49
3.1.3 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.4 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.5 The Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1.6 Geometric Interpretation of Vectors . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Vectors and matrices: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Solving linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.1 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.3 Matrix Inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Solving linear systems: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5.2 Calculating Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 69
3.6 Eigenvalues and eigenvectors: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4 Multivariate Functions 82
4.1 Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.1 Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.2 Circles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.3 Ellipses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 Relations: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Functions of several variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.1 Functions of several variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.2 Linear functions of two variables . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.3 Zeroes and stationary points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.4 Level sets and contour plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4 Functions of several variables: Graphing activity . . . . . . . . . . . . . . . . . . . . . 93
4.4.1 Plots of relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2 Plots of functions of two variables . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Functions of several variables: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.6 Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.6.1 Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.6.2 The gradient vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.6.3 Locating stationary points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.6.4 Classifying stationary points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.6.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.7 Partial derivatives: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.8 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5 Enumeration and Discrete Probability 123


5.1 Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.1.1 Counting via sums and products . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.1.2 Counting selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.2 Enumeration: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

2
5.3.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.3 Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.3.4 Warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4 Probability: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.5 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.5.1 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.5.2 Independence again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.5.3 Independent repeated trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.5.4 Bayes' theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.6 Conditional probability: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6 Probability Distributions 140


6.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1.3 Operations on random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.1.4 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.1.5 Linearity of expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.1.6 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2 Random variables: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3 Important probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3.1 Useful discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3.2 Continuous probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.3.3 Useful continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.4 Important probability distributions: Activity . . . . . . . . . . . . . . . . . . . . . . . 156
6.5 Graphs and trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.5.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.5.2 Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.5.3 Paths and cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.5.4 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.5.5 Adjacency matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.5.6 Famous problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.6 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

3
Chapter 1

Mathematical Notation and Functions

1.1 Important notation

1.1.1 Set notation


To understand written mathematics, it helps a lot to have a grasp of set notation. A set in maths
is just an unordered collection of distinct things. We write {𝑎1 , 𝑎2 , … , 𝑎𝑡 } for the set that contains
𝑎1 , 𝑎2 , … , 𝑎𝑡 For example {1, 2} is a set that contains two things: the numbers 1 and 2. We might
also write {1, 2, …} for the set that contains all of the positive integers, for example.
Some sets of numbers are used often enough to have their own notation:
• ℕ is the set of natural numbers: {0, 1, 2, …}.
• ℤ is the set of integers: {… , −2, −1, 0, 1, 2, …}.
• ℝ is the set of real numbers that includes all the numbers on a standard number line (eg. −6, 1
7
and 𝜋).
We write 𝑥 ∈ 𝑆 to mean ”𝑥 is in the set 𝑆”. For example, 𝑥 ∈ ℝ means ”𝑥 is in the set of real
numbers” or, in other words, ”𝑥 is a real number”.
The notation {𝑥 ∈ 𝑆 ∶ 𝑃 (𝑥)} is used to mean ”all the elements of the set 𝑆 such that some property
𝑃 (𝑥)” is true. For example, {𝑥 ∈ ℕ ∶ 𝑥 is odd} is the set {1, 3, 5, …} of odd natural numbers.

1.1.2 Interval notation


Intervals are certain sets containing infinitely many real numbers. For real numbers 𝑎 and 𝑏 with
𝑎 < 𝑏:
• The set [𝑎, 𝑏] contains all real numbers between 𝑎 and 𝑏 (including 𝑎, 𝑏). That is, [𝑎, 𝑏] = {𝑥 ∶
𝑎 ≤ 𝑥 ≤ 𝑏}; we call [𝑎, 𝑏] the closed interval from 𝑎 to 𝑏.
• The set (𝑎, 𝑏) contains all elements of [𝑎, 𝑏] except for 𝑎 and 𝑏; that is,[𝑎, 𝑏] = {𝑥 ∶ 𝑎 < 𝑥 < 𝑏};
we call (𝑎, 𝑏) the open interval from 𝑎 to 𝑏.
• ℝ is the set of real numbers that includes all the numbers on a standard number line (eg. −6, 17
and 𝜋).
Similarly, we define:
• [𝑎, 𝑏) = {𝑥 ∈ ℝ ∶ 𝑎 ≤ 𝑥 < 𝑏}
• (𝑎, 𝑏] = {𝑥 ∈ ℝ ∶ 𝑎 < 𝑥 ≤ 𝑏}
• [𝑎, ∞) = {𝑥 ∈ ℝ ∶ 𝑥 ≥ 𝑎} and (𝑎, ∞) = {𝑥 ∈ ℝ ∶ 𝑥 > 𝑎}
• (−∞, 𝑏) = {𝑥 ∈ ℝ ∶ 𝑥 < 𝑏} and(−∞, 𝑏] = {𝑥 ∈ ℝ ∶ 𝑥 ≤ 𝑏} .

4
1.1.3 Sigma notation
The symbol ∑ is often used to denote a sum.
If 𝑎 and 𝑏 are integers with 𝑎 ≤ 𝑏 and 𝑓 is a function, then
𝑏
∑ 𝑓(𝑥) = 𝑓(𝑎) + 𝑓(𝑎 + 1) + ⋯ + 𝑓(𝑏).
𝑥=𝑎

(See the next section for a precise definition of a function, for the moment the following examples
should give you the idea.)
Example. + 16 .
6 1 1 1 1
∑𝑥=3 𝑥 = 3 + 4 + 5

For more complicated sums, the notation ∑𝑥∈𝑆 𝑓(𝑥) is used to mean the sum of 𝑓(𝑥) over all the 𝑥
in the set 𝑆.
Example. ∑𝑥∈{2,6,7} 1
𝑥 = 1
2 + 1
6 + 17 .

Questions.
A1.1
1. How would you write ∑𝑥=2 𝑥2𝑥+1 without using sigma notation?
4

2. Write 14 + 16 + 18 + 10
1
using sigma notation. A1.2

1.1.4 Product notation


Similarly the symbol ∏ is used for products.
If 𝑎 and 𝑏 are integers with 𝑎 ≤ 𝑏 and 𝑓 is a function, then
𝑏
∏ 𝑓(𝑥) = 𝑓(𝑎) × 𝑓(𝑎 + 1) × ⋯ × 𝑓(𝑏).
𝑥=𝑎

Again, the notation ∏𝑥∈𝑆 𝑓(𝑥) is used to mean the product of 𝑓(𝑥) over all the 𝑥 in the set 𝑆.
Questions.
A1.3
1. How would you write ∏𝑥=10 (3𝑥 + 1) without using product notation?
12

1.2 Functions: the basics

1.2.1 What is a function?


We can think of a function as a black box that accepts inputs from some specified set (called its
domain) and gives outputs. The key property is that
for each input 𝑥 a function 𝑓 gives exactly one fixed output which we call 𝑓(𝑥).
The set of all the outputs of a function is called its range or image.
We can represent functions in several ways as the following example shows.
Example. The function which accepts any real number as input and outputs the square
of that number can be represented:
• in words, as above;
• in mathematical symbols as 𝑓(𝑥) = 𝑥2 with domain ℝ;
• as a plot with the possible inputs on the horizontal axis and the possible outputs on
the vertical axis. The values can be read off the plot, i.e. 𝑓(0) = 0, 𝑓(−2) = 4.

5
The functions we'll be concerned with initially are real functions of a single real variable. This just
means functions that accept a real number as an input and output a real number (or, more formally,
whose domain and range are both subsets of ℝ). More specifically:
• almost all the functions we'll consider will be continuous (meaning that their plots can be drawn
without lifting pen from paper);
• most of the functions we consider will be smooth (meaning that not only is the function
continuous but also its plot is a nice smooth curve).
We'll be able to give more a precise definition of smooth later.
Questions. Answer the following for the plot of a function 𝑓(𝑥) shown below.

6
1. Estimate 𝑓(−1), 𝑓(1) and 𝑓(3). A1.4
2. When 𝑓(𝑥) = 12 , what (roughly) might 𝑥 be? A1.5
3. What do you think the most important points on the curve are? A1.6
A1.7
4. For what values of 𝑥 is the function increasing and for what values is it decreasing?
So we can be lazy sometimes, when we don't give a domain for a function we'll assume that the
domain is the set of all real numbers for which the mathematical√expression is a well-defined real
number. For example, if we just say a function is given by 𝑓(𝑥) = 𝑥, then we'll assume the domain
is the set of all real numbers greater than or equal to 0.

1.2.2 Zeroes and stationary points


The zeroes or roots of a function 𝑓 are the values 𝑥 in its domain where 𝑓(𝑥) = 0. These are the
values of 𝑥 at which the plot of the function crosses the 𝑥-axis. The function in the plot above has
three zeroes: 0.2, 2.1 and 5.2 (roughly). In general a function might have many zeroes or none at all.
The stationary points of a function are the values 𝑥 in its domain where the function is ”flat”: neither
increasing nor decreasing. (Stationary points are examples of critical points, which we'll define later.)
• If a stationary point is at the bottom of a ”dip” it is called a local minimum. The function in
the plot above has a local minimum at 1.

• If it is at the top of a ”rise” it is called a local maximum. The function in the plot above has a
local maximum at 4.

• It's also possible for a stationary point to be neither of these things. For example the function
𝑓(𝑥) = 𝑥3 + 12 shown below has a stationary point at 0 which is neither a local minimum nor a
local maximum. These kind of stationary points are inflection points.

7
There's a nice visualisation of zeroes and stationary points of a function here. (First download the
Wolfram CDF player - we'll be linking to lots of these.)
Questions. Can you come up with a smooth real-valued function with domain ℝ such that:
A1.8
1. the function has a zero but no stationary points?
A1.9
2. the function has a stationary point but no zeroes?
A1.10
3. the function has no zeroes and no critical points?

1.2.3 One-to-one and inverse functions


If we have a function 𝑓, it can be useful to find another function 𝑓 −1 such that 𝑓 −1 (𝑓(𝑥)) = 𝑥 for all
𝑥 in the domain of 𝑓. The function 𝑓 −1 is called the inverse function of 𝑓. We can think of it as the
”undo” function for 𝑓: 𝑓 −1 (𝑦) tells us for what value of 𝑥 makes 𝑓(𝑥) = 𝑦. Note that 𝑓 −1 (𝑥) does not
mean the same thing as 𝑓(𝑥) 1
.

Example. The function 𝑓(𝑥) = 3𝑥 + 5 with domain ℝ has inverse function 𝑓 −1 (𝑦) = 𝑦−5
3
with domain ℝ. We can check this is true as follows:
𝑓 −1 (𝑓(𝑥)) = 𝑓 −1 (3𝑥 + 5) = (3𝑥+5)−5
3 = 3𝑥
3 =𝑥 for all 𝑥 in ℝ.
To find an inverse function, we can often solve the equation 𝑦 = 𝑓(𝑥) for 𝑥. For example,
to find the inverse function we gave above, we would calculate:

𝑦 = 3𝑥 + 5
𝑦 − 5 = 3𝑥
𝑦−5
3 = 𝑥.

We can use inverse functions to undo the action of the original function as in the following.
Example. Suppose that 𝑓 is a function with domain [2, 6] and it is known that 𝑓 −1 (𝑥) =
𝑥2 + 2 for 𝑥 ∈ [0, 2].
Question: find a number 𝑥 such that 𝑓(𝑥) = 1.5.
Solution: Applying 𝑓 −1 to both sides of the equation, we have 𝑓 −1 (𝑓(𝑥)) = 𝑓 −1 (1.5) =
4.25. But 𝑓 −1 (𝑓(𝑥)) = 𝑥. So 𝑥 = 4.25.

8
Not all functions have inverse functions, however.
Example. The function 𝑓(𝑥) = 𝑥2 with domain ℝ does not have an inverse function.
Because 𝑓(2) = 4 and 𝑓(−2) = 4, if an inverse function existed 𝑓 −1 (4) would have to
equal both 2 and −2.
To avoid this problem we need a function with the following property.
A function 𝑓 is one-to-one if, for all distinct 𝑥1 and 𝑥2 in its domain, 𝑓(𝑥1 ) ≠ 𝑓(𝑥2 ).
This property tells us exactly which functions have inverses.
A function has an inverse if and only if it is one-to-one.
Sometimes if a function is not one-to-one, we can restrict its domain to make a new function that is
one-to-one and has an inverse.
Example. We saw that the function 𝑓(𝑥) = 𝑥2 with domain ℝ does not have an inverse
function. But the function 𝑔(𝑥) = 𝑥2 with domain [0, ∞) is one-to-one and does have an

inverse function: 𝑔−1 (𝑦) = 𝑦.
Questions. Do the following functions have inverse functions? If so, what are they?

1. The function 𝑓(𝑥) = 2𝑥 − 4 with domain ℝ. A1.11


2. The function 𝑓(𝑥) = 𝑥4 with domain ℝ. A1.12
3. The function 𝑓(𝑥) = 𝑥3 + 2 with domain ℝ. A1.13

1.2.4 Convex and concave functions


A function is convex if, for any two points on its plot, the plot of the function between those two
points is entirely below (or touching) the straight line joining those points.
Concavity is the opposite: a function is concave if, for any two points on its plot, the plot of the
function between those two points is entirely above (or touching) the straight line joining those points.
We'll see more formal definitions of these concepts later. Note that many functions are neither convex
nor concave. Note also that functions with straight line graphs are both convex and concave (they
are touching all the way)!
Question. Look at the plot below of the function 𝑓(𝑥) = 2𝑥 . Do you think 𝑓 is convex, concave or
neither? A1.14

9
1.3 Functions: Graphing activity
Throughout this unit, it will be helpful to be able to see the plots of various functions. This will help
you to identify important features of functions such as their zeroes and stationary points. There are
many places online which allow you to quickly produce plots of functions. In this activity, we will
explain how to use one particular website, Wolfram Alpha, to produce these plots. However, feel free
to use other resources to produce plots if you wish.
First, follow this link to the Wolfram Alpha home page. You should see a screen like this.

From here, type in the rule for the function you would like to plot and click on the equals sign on the
right hand side.

10
• To write a power, use a circumflex (^). For example, to write 𝑥4 , you would type in x^4.
• To write a product, use an asterisk (*). For example, to write 𝑥𝑒𝑥 , you would type in x*e^x.
You don't always need to use an asterisk, but it may be useful sometimes.
• The forward slash (/) is used for division. For example, to write 𝑥4 , you would type in x/4.
• You may also need to use brackets when entering some functions, even when you don't usually
write brackets. For example, to write 𝑒3𝑥 you would type e^(3x).
Type in the rule for a common function, such as a polynomial. Below we have entered in the rule
𝑥3 + 5𝑥2 .

After you have clicked on the equals sign on the right hand side, you will see a plot of your chosen
function.

11
If you scroll down, you will see further details about this function. For now, the details you will be
most interested in are the zeroes (these are called ”roots” on Wolfram Alpha) and the stationary
points.

Under the local maximum section, we are told that max{𝑥3 +5𝑥2 } = 500 27 at 𝑥 = − 3 . This means that
10

the value of the function at this local maximum is 500


27 . So if the function 𝑓 is given by 𝑓(𝑥) = 𝑥 + 5𝑥 ,
3 2

then 𝑓(− 103 ) = 27 .


500

You can ask Wolfram Alpha to produce a plot over just one part of the domain. For example, we may
like to plot the same function just on the interval [2, 5]. Below is one way of producing such a plot.

You can also view the plots of several functions on the same pair of axes. For example, the plots of
two functions, 𝑓(𝑥) = 𝑥3 + 5𝑥2 and 𝑔(𝑥) = 𝑒−2𝑥 , are shown below on the interval [−1, 1].

12
1.4 Functions: Activity 1
Question 1: Consider the plot below of the function 𝑓(𝑥) = 𝑥3 − 3𝑥2 .

(a) By looking at the plot, find the zeroes of 𝑓. Check that these values are actually zeroes by setting
𝑓(𝑥) = 0 and solving 0 = 𝑥3 − 3𝑥2 . A1.15
(b) Evaluate 𝑓(2), then evaluate both 𝑓(1.9) and 𝑓(2.1). What type of point is 𝑥 = 2? What type
of point is 𝑥 = 0? A1.16
(c) Find two numbers 𝑥1 and 𝑥2 such that 𝑓(𝑥1 ) = 𝑓(𝑥2 ). What does this tell you about the

13
A1.17
function?
A1.18
(d) Is 𝑓 convex, concave, or neither?
(e) Restrict the domain of 𝑓 to find a new function 𝑔 with the same rule (𝑔(𝑥) = 𝑥3 − 3𝑥2 ) so that 𝑔
is one-to-one. You do not need to find the inverse of 𝑔. A1.19
Question 2: Consider the plot below of the function 𝑓(𝑥) = 2𝑒 3 𝑥 .
1

A1.20
(a) Is 𝑓 convex, concave, or neither?
(b) Find the inverse function 𝑓 −1 and give its domain. Why is it possible to find the inverse
function? A1.21
A1.22
(c) Is 𝑓 −1 convex, concave, or neither?
A1.23
(d) Do 𝑓 or 𝑓 −1 have any zeroes? If so, find them.
A1.24
(e) Do 𝑓 or 𝑓 −1 have any stationary points? If so, what are they?
Question 3: Suppose that 𝑓 is a polynomial function such that 𝑓(𝑥) is positive only when 1 < 𝑥 < 3.
A1.25
(a) What can you say about the number of zeroes that 𝑓 has?
(b) What can you say about the number of stationary points that 𝑓 has between 𝑥 = 1 and
𝑥 = 3? A1.26

1.5 Some useful functions

1.5.1 Linear functions


A linear function is a function that can be written as
𝑓(𝑥) = 𝑚𝑥 + 𝑏
where 𝑚 and 𝑏 are real numbers.

14
Obviously, linear functions are so-called because their plots form straight lines. The line corresponding
to 𝑓(𝑥) = 𝑚𝑥 + 𝑏 intercepts the vertical axis at 𝑏, because 𝑓(0) = 𝑏. Also, the line has slope 𝑚,
meaning that an increase of 𝑑 in the horizontal co-ordinate corresponds to an increase of 𝑑𝑚 in the
vertical co-ordinate.
Example. The function 𝑓(𝑥) = 12 𝑥 − 1 is a linear function. Its plot is given below. Note
that the line intercepts the vertical axis at −1 and has slope 12 .

Linear functions (that are not constant functions with slope 𝑚 = 0) have one zero and no stationary
points.
Questions.
1. 𝑓 is a linear function such that 𝑓(0) = 2 and 𝑓( 52 ) = 3. Give a formula for 𝑓(𝑥). A1.27
2. A linear function has a negative 𝑦 intercept and a positive zero. What can you say about its
slope? A1.28

1.5.2 Polynomial functions


A polynomial function of degree 𝑛 is a function that can be written as
𝑓(𝑥) = 𝑎𝑛 𝑥𝑛 + 𝑎𝑛−1 𝑥𝑛−1 + ⋯ + 𝑎1 𝑥 + 𝑎0
where 𝑎𝑛 , 𝑎𝑛−1 , … , 𝑎0 are real numbers with 𝑎𝑛 ≠ 0.
Constant functions are polynomial functions of degree 0 and linear functions are polynomial functions
of degree 1. Polynomial functions of degree 2 and 3 are often called quadratic functions and cubic
functions.
Example. The functions 𝑓(𝑥) = 𝑥2 − 2𝑥 + 12 and 𝑔(𝑥) = 14 𝑥3 − 14 𝑥2 − 𝑥 are polynomial
functions of degree 2 and 3. Their plots are given below.

15
It can be proven that a polynomial function of degree 𝑛 has at most 𝑛 zeroes and at most 𝑛 − 1
stationary points. So, given their degrees, the functions in the example above have the greatest
number of zeroes and stationary points possible. For many functions this is not the case, however.
For example the cubic function 𝑓(𝑥) = 14 𝑥3 + 12 𝑥2 + 12 𝑥, plotted below, has one zero and no stationary
points.

16
Demonstration: The number of distinct real roots of a real polynomial
Demonstration: Where are my roots?
Questions.
1. What is the degree of the polynomial function 𝑓(𝑥) = (𝑥 − 2)(𝑥 − 1)(𝑥 + 1)? A1.29
2. Using your answer to 1, what can you say about the zeroes of 𝑓(𝑥) = (𝑥 − 2)(𝑥 − 1)(𝑥 + 1)?
A1.30

1.5.3 Exponential functions


An exponential function is a function that can be written as
𝑓(𝑥) = 𝑎𝑥
where 𝑎 is a positive real number, sometimes called the base.
This definition means that when 𝑥 is increased by 1, the value of 𝑓 is multiplied
√ by 𝑎. Note that
allowing 𝑎 to be negative would cause problems: for example, (−1) = −1 is not a real number.
1/2

Example. The function 𝑓(𝑥) = 2𝑥 is an exponential function. Its plot is given below.

17
This is typical of exponential functions with bases greater than 1. For bases less than 1, exponential
functions take on a different appearance.
Example. The function 𝑓(𝑥) = ( 12 )𝑥 is an exponential function. Its plot is given below.

There are a number of useful laws associated with exponential functions. For positive real numbers 𝑎
and 𝑏 and any real numbers 𝑥 and 𝑦:

1 1 √
𝑎0 = 1 𝑎−𝑥 = 𝑎𝑥 = 𝑥
𝑎
𝑎𝑥

𝑎𝑥
𝑎𝑥 𝑎𝑦 = 𝑎𝑥+𝑦 = 𝑎𝑥−𝑦 (𝑎𝑥 )𝑦 = 𝑎𝑥𝑦
𝑎𝑦

𝑎 𝑥 𝑎𝑥
(𝑎𝑏)𝑥 = 𝑎𝑥 𝑏𝑥 ( ) = 𝑥
𝑏 𝑏
One very important base for exponential functions is the mathematical constant 𝑒 which is an
irrational real number approximately equal to 2.71828. We'll see some reasons why 𝑒 is special later.
The function 𝑓(𝑥) = 𝑒𝑥 is sometimes written as exp(𝑥).

18
Exponential functions have no stationary points. They also have no zeroes, because 𝑎𝑥 is always
positive. Every exponential function intersects the vertical axis at 1, because 𝑎0 = 1 for any real
number 𝑎. Notice that exponential functions are one-to-one. This means that they have inverse
functions - this fact will be important in the next section.
Demonstration: Graphs of exponential functions
Questions.
1. What can you say about 𝑎𝑥 when 0 < 𝑎 < 1 and 𝑥 < 0? A1.31
2. The amount of radiation a sample emits on a given day is always three quarters of the amount
it emitted the day before. What fraction of the radiation that it emitted on day 0 does it emit
on day 2? On day 20? A1.32

1.5.4 Logarithmic functions


We often solve equations by applying inverse functions:
• We solve 𝑥 + 𝑎 = 𝑏 for 𝑥 by subtracting 𝑎 from both sides.
• We solve 𝑎𝑥 = 𝑏 for 𝑥 by dividing both sides by 𝑎.
• We solve 𝑥𝑎 = 𝑏 for 𝑥 by taking the 𝑎th root of both sides.
But how can we solve equations like 𝑎𝑥 = 𝑏? To do this we need an inverse function for exponentiation.
The logarithm to base 𝑎, log𝑎 (𝑥) is defined as the inverse 𝑓 −1 of the function 𝑓(𝑥) = 𝑎𝑥 . This means
that
log𝑎 (𝑎𝑥 ) = 𝑥 and 𝑎log𝑎 (𝑥) = 𝑥.
Because 𝑎𝑥 is always positive, the domain of the function log𝑎 is just the set of positive real numbers.
The logarithm of a negative number, or of 0 is not defined. The notation ln(𝑥) is used for the
logarithm to base 𝑒 of 𝑥.
Because log𝑎 (𝑥) is the inverse function of 𝑎𝑥 , we can plot it by reflecting the plot of 𝑎𝑥 about the line
𝑓(𝑥) = 𝑥.
Example. The plot of log2 (𝑥) is given below.

You can see that this is the reflection of the plot of 2𝑥 above.

19
There are also useful laws associated with logarithms. These can be obtained from the exponential
laws we saw above. For positive real numbers 𝑎, 𝑥, 𝑦 and 𝑏:

log𝑎 (1) = 0 log𝑎 (𝑥𝑦 ) = 𝑦 log𝑎 (𝑥)

𝑥
log𝑎 (𝑥𝑦) = log𝑎 (𝑥) + log𝑎 (𝑦) log𝑎 ( ) = log𝑎 (𝑥) − log𝑎 (𝑦)
𝑦

log𝑏 (𝑥)
log𝑎 (𝑥) =
log𝑏 (𝑎)

Logarithmic functions have no stationary points. Every logarithmic function has exactly one zero (at
1).
Questions.
1. Solve 3𝑥 = 400 for 𝑥. A1.33
2. If inflation means that prices increase by 2% every year, how many years does it take them to
double? A1.34

1.6 Functions: Activity 2


Question 1: Below is a table of values and a plot of the population of Australia for each year from
2004 to 2014. The vertical axis is measured in millions of people.

Year 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Population (in millions) 20.1 20.4 20.7 21.1 21.5 21.9 22.3 22.6 22.9 23.2 23.5

20
We can construct a function 𝑓 which ”fills in the gaps” and gives the population of Australia at any
point in between the points listed above. Come up with a linear function (𝑓(𝑥) = 𝑚𝑥 + 𝑏) whose
plot closely follows the plot above. You can do this by substituting in suitable values from the table
above to find 𝑚 and 𝑏. A1.35
A1.36
Does this function give sensible values outside of the interval [2004, 2014]?
Now consider the plot below of the population of Australia over an extended time period.

A1.37
A linear function is no longer suitable for this plot. What type of function might be suitable?
Question 2: Sometimes data will fit a logarithmic or inverse function much better than a linear
one. If you suspect that your data will not fit a linear function very well, a good way to test for
that is to apply a function (logarithm, square, reciprocal etc.) to the x-axis data, then plot this
against the y-axis data and hope that the resulting graph looks linear. Then we can go through a
similar procedure to the one in the previous question to find a function of the form 𝑦 = 𝑚 ln(𝑥) + 𝑏,
𝑦 = 𝑚𝑥2 + 𝑏 or 𝑦 = 𝑚 𝑥 + 𝑏, depending on what function was used to transform the data.

Consider the following table containing the GDP per capita (measured in units of $1000 per person)
and child mortality rate (deaths per 1000 births) for six countries.

GDP per capita 6.5 14.8 23.1 44.1 51.2 73.2


Child mortality rate 1.6 2.5 3.1 3.9 3.7 4.1

Plot this data with the GDP per capita on the horizontal axis and the child mortality rate on the
vertical axis. You may prefer to use software such as Microsoft Excel to do this. A1.38
This data does not look very linear, (we would need a concave function to fit this data) so we will
apply some transformations to the data. One common transformation is the squared transformation.
Extend the above table to include a row of the squares of the GDP per capita values. (That is, apply
the function 𝑓(𝑥) = 𝑥2 to each value in the first row). A1.39
Plot the transformed data with the square of the GDP per capita on the horizontal axis and the child
mortality rate on the vertical axis. Does this look more linear than the original plot? A1.40

21
Now apply the same process, but with the reciprocal function this time. (That is, apply the function
𝑓(𝑥) = 𝑥1 to the GDP per capita data). Plot your data to see if it is more linear than the original
plot. A1.41
A1.42
Do this once more using the function 𝑓(𝑥) = ln(𝑥).
Using your previous answers, what type of function would you use to model the relationship between
GDP per capita and child mortality rate? A1.43

1.7 Functions: Activity 3


Question 1:
Consider the function 𝑓(𝑥) = 𝑥2 − 2𝑥.
A1.44
(a) Evaluate ∑5𝑥=3 𝑓(𝑥).
A1.45
(b) Evaluate ∏5𝑥=3 𝑓(𝑥).
A1.46
(c) Evaluate ∏𝑥=2 𝑓(𝑥).
50

Question 2 (Factorial):
There is an important function in mathematics called the factorial function. Instead of being written
in the usual way as 𝑓(𝑥), it is usually written as 𝑛!. Its domain is ℕ = {0, 1, 2, …} and it is defined
by 𝑛! = ∏𝑥=1 𝑥. By convention, 0! = 1. But otherwise, 𝑛! is found by multiplying together every
𝑛

positive integer between 1 and 𝑛. For example,


4! = 1 × 2 × 3 × 4 = 24.
1 A1.47
(a) Evaluate the sum ∑3𝑛=0 𝑛! .

A1.48
(b) Now evaluate ∑𝑛=0 (using a calculator if you wish).
7 1
𝑛!
A1.49
(c) Do you recognise what number your answer is close to?
(d) Write the function 𝑓(𝑥) = ∑𝑛=0 without using sigma notation. Your answer will be a
3 1 𝑛
𝑛! 𝑥
degree 3 polynomial. A1.50
(e) Look at the plots of 𝑓(𝑥) = ∑3𝑛=0 𝑛! 𝑥 and 𝑔(𝑥) = 𝑒𝑥 using Wolfram Alpha or some other
1 𝑛

graphing software. Put both plots on the same graph if you can. What do you notice? A1.51
(f) See what happens when you replace the plot of ∑3𝑛=0 1 𝑛
𝑛! 𝑥 with ∑4𝑛=0 1 𝑛
𝑛! 𝑥 and ∑5𝑛=0 1 𝑛
𝑛! 𝑥 and
so on.
Question 3:
Consider the function 𝑓(𝑥) = log𝑒 (𝑥).
A1.52
(a) Write ∑𝑥=1 𝑓(𝑥) without using sigma notation.
4

(b) Use the logarithm laws to write your answer as a single logarithm. Use the factorial function
from Question 2 in your answer. A1.53

22
1.8 Answers
1.1 2
5 + 3
10 + 17 .
4

1.2 There are different possibilities here. For example, ∑𝑥=2 or ∑𝑥∈{4,6,8,10} 𝑥1 .
5 1
2𝑥

1.3 31 × 34 × 37.
1.4 Roughly 𝑓(−1) = 2.4, 𝑓(1) = −0.3 and 𝑓(3) = 0.5.
1.5 It might be (roughly) −0.3, 3 or 4.6.
1.6 No right or wrong here. But the points where it touches the 𝑥-axis, the point where it crosses
the 𝑦-axis, the bottom of the ”dip” at 𝑥 = 1 and the top of the ”rise” at 𝑥 = 4 are definitely all
important.
1.7 It's decreasing when 𝑥 < 1 and when 𝑥 > 4, and increasing when 1 < 𝑥 < 4.
1.8 One example is 𝑓(𝑥) = 2𝑥.
1.9 One example is 𝑓(𝑥) = 𝑥2 + 1.
1.10 One example is 𝑓(𝑥) = 2𝑥 .
1.11 Yes. The inverse function is 𝑓 −1 (𝑦) = 𝑦
2 + 2.
1.12 No. The function is not one-to-one because, for example, 𝑓(1) = 𝑓(−1).

1.13 Yes. The inverse function is 𝑓 −1 (𝑦) = 3 𝑦 − 2.
1.14 It's convex. Notice if you place two fingers anywhere on the function's plot, then the plot is
entirely below the straight line between your fingers.
1.15 Looking at the plot suggests that 𝑥 = 0 and 𝑥 = 3 are the zeroes. Taking out 𝑥2 as a
common factor gives 0 = 𝑥2 (𝑥 − 3), which means both 0 = 𝑥2 and 0 = 𝑥 − 3 need to be solved.
This confirms that 𝑥 = 0 and 𝑥 = 3 are the zeroes.
1.16 𝑓(2) = −4, 𝑓(1.9) = −3.971 and 𝑓(2.1) = −3.969. Since evaluating 𝑓 at the points close to
𝑥 = 2 gives values greater than 𝑓(2), this suggests that 𝑥 = 2 is a local minimum. By similar
reasoning, 𝑥 = 0 is a local maximum.
1.17 There are many correct answers here, one being 𝑥1 = 0 and 𝑥2 = 3, since 𝑓(0) = 0 = 𝑓(3).
This means that 𝑓 is not one-to-one.
1.18 The function is neither concave, nor convex. To see this, draw a straight line from the point
on the plot where 𝑥 = −1 and 𝑥 = 3. The plot is not entirely below the straight line, nor is it
entirely above the straight line.
1.19 There are many correct answers here. One possible answer is to let 𝑔 have the domain
(−∞, 0].
1.20 The function is convex. To see this, draw a straight line between any two points on the plot
and notice that the straight line lies entirely above the plot.
1.21 Solve the equation 𝑦 = 𝑓(𝑥) for 𝑥.

23
1
𝑦 = 2𝑒 3 𝑥
1
1
2𝑦 = 𝑒3𝑥
ln( 12 𝑦) = 13 𝑥
3 ln( 12 𝑦) = 𝑥

The inverse function is 𝑓 −1 (𝑦) = 3 ln( 12 𝑦) and its domain is (0, ∞). It is possible to find the
inverse function because 𝑓 is a one-to-one function.
1.22 The function is concave. The plot of 𝑓 is given below, and after drawing a straight line
between any two points on the plot, the plot is entirely above the straight line.

1.23 Since the plot of 𝑓 lies entirely above the horizontal axis, 𝑓 has no zeroes. However, 𝑓 −1
appears to have a zero at 𝑦 = 2. This can be verified by solving 𝑓 −1 (𝑦) = 0 for 𝑦.

3 ln( 12 𝑦) = 0
ln( 12 𝑦) = 0
1
2𝑦 = 𝑒0
1
2𝑦 =1
𝑦=2

1.24 Since both 𝑓 and 𝑓 −1 are always increasing (the plots of both have no ”flat” points), neither
of them have any stationary points.
1.25 Below are the plots of a few example polynomial functions which satisfy the above condition,
but have a different number of zeroes each.

So there is not enough information to say exactly how many zeroes 𝑓 will have. However, 𝑓
must have at least two zeroes, namely 𝑥 = 1 and 𝑥 = 3.

24
1.26 Again, there is not enough information to say exactly how many stationary points there
are between 𝑥 = 1 and 𝑥 = 3. However, there must be at least one. There must also be more
local maxima than local minima. Below are the plots of some example polynomial functions to
illustrate this.

1.27 𝑓(𝑥) = 25 𝑥 + 2.
1.28 The slope must be positive.
1.29 3 because 𝑓(𝑥) = 𝑥3 − 2𝑥2 − 𝑥 + 2.
1.30 The zeroes are exactly 2, 1 and −1. Obviously 𝑓(2) = 𝑓(1) = 𝑓(−1) = 0 and because the
degree is 3 we know that there are no other zeroes.
1.31 𝑎𝑥 > 1. What about in the other three cases for 𝑎 and 𝑥?
1.32 ( 34 )2 ≈ 56% on day 2. ( 34 )20 ≈ 0.3% on day 20.

1.33 𝑥 = log3 (400) = 𝑙𝑛(400)


ln(3) ≈ 5.45.
1.34 Solving (1.02)𝑥 = 2 for 𝑥, we see that it takes log1.02 (2) ≈ 35 years.
1.35 Substituting in the values for 2004 and 2014 gives

20.1 = 2004𝑚 + 𝑏
23.5 = 2014𝑚 + 𝑏.

Solving these simultaneously gives 𝑚 = 0.34 and 𝑏 = −661.26.


1.36 No. For example, 𝑓(0) = −661.26. This says that during the year 0 there was a negative
population.
1.37 This looks similar to the plot of an exponential function.

1.38
1.39

25
GDP per capita 6.5 14.8 23.1 44.1 51.2 73.2
(GDP per capita)2 42.25 219.04 533.61 1944.81 2621.44 5358.24
Child mortality rate 1.6 2.5 3.1 3.9 3.7 4.1

1.40
This plot is certainly not linear. It looks even less linear than the original plot.
1.41

GDP per capita 6.5 14.8 23.1 44.1 51.2 73.2


1/(GDP per capita) 0.154 0.068 0.043 0.023 0.020 0.014
Child mortality rate 1.6 2.5 3.1 3.9 3.7 4.1

This plot looks a little better, but is still not quite linear. This time a convex function would
be needed to fit the data.
1.42

GDP per capita 6.5 14.8 23.1 44.1 51.2 73.2


ln(GDP per capita) 1.87 2.69 3.14 3.79 3.94 4.29
Child mortality rate 1.6 2.5 3.1 3.9 3.7 4.1

26
This plot resembles a straight line very closely.
1.43 Since the logarithm transformation resulted in a plot closely resembling a straight line, we
would assume that 𝑦 = 𝑚 ln(𝑥) + 𝑏. Here, 𝑥 is the GDP per capita, 𝑦 is the child mortality
rate, and 𝑚 and 𝑏 are constants chosen to fit the data.
1.44 First, evaluate the terms individually.

𝑓(3) = 3
𝑓(4) = 8
𝑓(5) = 15

Adding them up gives ∑𝑥=3 𝑓(𝑥) = 𝑓(3) + 𝑓(4) + 𝑓(5) = 3 + 8 + 15 = 26.


5

1.45
5
∏𝑥=3 𝑓(𝑥) = 𝑓(3) × 𝑓(4) × 𝑓(5) = 3 × 8 × 15 = 360.
1.46 Notice that 𝑓(2) = 0. So when you multiply all of these numbers together, the result will be
0 regardless of what the other terms are. ∏𝑥=2 𝑓(𝑥) = 0.
50

1.47

3
1 1 1 1 1
∑ = + + +
𝑛=0
𝑛! 0! 1! 2! 3!
1 1 1 1
= + + +
1 1 2 6
6 6 3 1
= + + +
6 6 6 6
16
=
6
8
=
3

1.48
7 1
∑𝑛=0 𝑛! = 2.71825
1.49 2.71825 is close to the mathematical constant 𝑒, which was introduced earlier. In fact, the
more terms you add up in the above sum, the closer your answer will be to 𝑒.
1.50 Remember that 𝑥0 = 1. Therefore, ∑3𝑛=0 1 𝑛
𝑛! 𝑥 = 1 + 𝑥 + 12 𝑥2 + 16 𝑥3 .

27
1.51 You should notice that the plots are quite close around 𝑥 = 0. In fact they intersect at 𝑥 = 0.
Both plots are shown below. The orange curve is the plot of 𝑓 and the blue curve is the plot of
𝑔.

1.52 ∑4𝑥=1 𝑓(𝑥) = log𝑒 (1) + log𝑒 (2) + log𝑒 (3) + log𝑒 (4).
1.53

4
∑ 𝑓(𝑥) = log𝑒 (1) + log𝑒 (2) + log𝑒 (3) + log𝑒 (4)
𝑥=1
= log𝑒 (1 × 2) + log𝑒 (3 × 4)
= log𝑒 (1 × 2 × 3 × 4)
= log𝑒 (4!)

28
Chapter 2

Calculus

2.1 Differentiation
Here, we're going to introduce derivatives and higher derivatives of functions. These are really useful,
but we'll hold off most of our discussion of why they are until the next chapter.

2.1.1 Tangents
It can be very useful to know whether a function 𝑓 is increasing or decreasing at a point and whether
it is doing so steeply or gently. In order to do this, it helps to define tangents to a function.
Let 𝑓 be a function. The tangent to 𝑓 at a point 𝑎 (if it exists) is the straight line that best
approximates 𝑓(𝑥) when 𝑥 is near 𝑎.
This definition might not seem very intuitive at first, but tangents are easy to visualise:
Example. A plot of the function 𝑓(𝑥) = 𝑥2 − 2𝑥 + 1
2 (blue) and its tangents at 𝑥 = 3
4
(orange) and 𝑥 = 2 (green).

29
Notice that the slopes of these tangents give us information about the behavior of 𝑓 at these points:
𝑓 is decreasing gently at 𝑥 = 34 and increasing more steeply 𝑥 = 2. There is a mathematical method
that allows us to get this information exactly and without having to resort to plotting the function.
The demonstration here is a nice visualisation of tangents to a variety of functions.
Questions.
1. Estimate the slope of tangents to the function in the above example at 𝑥 = − 14 , 𝑥 = 1, and
𝑥 = 32 . A2.1
2. A tangent to a circle is sometimes defined as a line that touches the circle in exactly one point.
Is there a problem with defining a tangent to a function similarly? A2.2

2.1.2 Derivatives
Let 𝑓 be a function. The derivative of 𝑓 at a point 𝑎 is the slope of the tangent to 𝑓 at 𝑥 = 𝑎 (the
derivative exists exactly if the tangent does).
For the nice smooth functions we are concentrating on here, tangents and derivatives always exist
(at the end of this section, we'll give an example of a non-smooth situation where a tangent and
derivative do not exist). Functions with this property are called differentiable. For such a function
𝑓 we can do even better than finding the derivative of 𝑓 for a few different values 𝑎 - we can find
another function denoted 𝑓’ (with the same domain as 𝑓) such that 𝑓 ′ (𝑎) gives the derivative of 𝑓 at
𝑎 for any 𝑎.
We call the function 𝑓’ the derivative of the function 𝑓.
Example. The function 𝑓(𝑥) = 𝑥2 − 2𝑥 + 12 that we plotted above has a derivative 𝑓 ′ (𝑥) = 2𝑥 − 2.
Using 𝑓’, we can calculate that the slope of the orange tangent (at 𝑥 = 34 ) is 𝑓 ′ ( 34 ) = − 12 and the
slope of the green tangent (at 𝑥 = 2) is 𝑓 ′ (2) = 2. What would the slopes of tangents at 𝑥 = 1 and
𝑥 = 32 be?
Sometimes 𝑓’ is significant in its own right (for example, if 𝑓(𝑥) gives the number of customers for

30
a company at time 𝑥, then 𝑓’ gives the rate at which the company is gaining or losing customers).
Most importantly, however, knowing 𝑓’ can give us information about 𝑓 itself (for example, it can
help us to find maxima and minima of 𝑓).
In general, 𝑓’ can be thought of as the rate of change of 𝑓. When 𝑓’ is positive 𝑓 is increasing and
when 𝑓’ is negative 𝑓 is decreasing. When the magnitude of 𝑓’ is large 𝑓 is increasing or decreasing
rapidly and when the magnitude of 𝑓’ is small 𝑓 is increasing or decreasing slowly.
The derivatives of many of the basic functions we have seen are well known. For 𝑐, 𝑛, 𝑎, 𝑑 ∈ ℝ with
𝑛 ≠ 0 and 𝑎 > 0:
• If 𝑓(𝑥) = 𝑐, then 𝑓 ′ (𝑥) = 0.
• If 𝑓(𝑥) = 𝑥𝑛 , then 𝑓 ′ (𝑥) = 𝑛𝑥𝑛−1 .
• If 𝑓(𝑥) = 𝑎𝑐𝑥+𝑑 , then 𝑓 ′ (𝑥) = 𝑐 ln(𝑎)𝑎𝑐𝑥+𝑑 . (If 𝑓(𝑥) = 𝑒𝑐𝑥+𝑑 , then 𝑓 ′ (𝑥) = 𝑐𝑒𝑐𝑥+𝑑 .)
• If 𝑓(𝑥) = log𝑎 (𝑐𝑥 + 𝑑), then 𝑓 ′ (𝑥) = ln(𝑎)(𝑐𝑥+𝑑)
𝑐
. (If 𝑓(𝑥) = ln(𝑐𝑥 + 𝑑), then 𝑓 ′ (𝑥) = 𝑐𝑥+𝑑 .)
𝑐

We can combine these with two important facts to help us to find derivatives of more functions. For
differentiable functions 𝑓 and 𝑔 and 𝑐 ∈ ℝ:
• The derivative of 𝑐𝑓(𝑥) is 𝑐𝑓 ′ (𝑥).
• The derivative of 𝑓(𝑥) + 𝑔(𝑥) is 𝑓 ′ (𝑥) + 𝑔′ (𝑥).
(Also the derivative of 𝑓(𝑥) − 𝑔(𝑥) is 𝑓 ′ (𝑥) − 𝑔′ (𝑥)).
Example. Let 𝑓(𝑥) = − 13 𝑥3 + 2𝑥2 –3𝑥 + 1. To find the derivative of 𝑓, we can think as
follows:
• The derivative of 𝑥3 is 3𝑥3−1 = 3𝑥2 . So the derivative of − 13 𝑥3 is −𝑥2 .
• The derivative of 𝑥2 is 2𝑥2−1 = 2𝑥. So the derivative of 2𝑥2 is 4𝑥.
• The derivative of 𝑥 is 𝑥1−1 = 𝑥0 = 1. So the derivative of −3𝑥 is −3.
• The derivative of 1 is 0.
Putting the above together, we have 𝑓 ′ (𝑥) = −𝑥2 + 4𝑥 − 3.
Here's a visualisation that shows plots of a polynomial and its derivative on the same set of axes.
Questions.
1. Find the derivative of 𝑓(𝑥) = 𝑥2 − 2𝑥 + 12 . How close were your guesses for the tangent slopes
in question 1 in the last section? A2.3
2. Find the derivative of 𝑔(𝑥) = 4𝑥3 –2𝑥2 + 𝑒2𝑥 . What is the slope of a tangent to 𝑔 at 𝑥 = 2?
A2.4

3. Find the derivative of ℎ(𝑥) = ln(3𝑥) − 7. What happens to the slope of ℎ as 𝑥 gets very large?
A2.5

2.1.3 Finding tangents


Sometimes we want to find the full equation of a tangent line and not just its slope. This video gives
a nice run-down of how this can be done.

2.1.4 Higher derivatives


We've seen that if a function 𝑓 is differentiable, then we can find its derivative 𝑓’. Of course 𝑓’ is
itself just a function, so if it is differentiable then we can find its derivative which is usually written
𝑓 ′′ or 𝑓 (2) . If 𝑓 (2) is differentiable then we can differentiate it to find 𝑓 (3) and so on.
The 𝑛th derivative 𝑓 (𝑛) of a function 𝑓 (if it exists) is the function obtained from 𝑓 by differentiating
𝑛 times.

31
Example. In the last example we saw that the derivative of 𝑓(𝑥) = − 13 𝑥3 + 2𝑥2 –3𝑥 + 1
is 𝑓 ′ (𝑥) = −𝑥2 + 4𝑥 − 3. So to find the second derivative of 𝑓, we differentiate 𝑓 ′ (𝑥) =
−𝑥2 + 4𝑥 − 3 and obtain 𝑓 ′′ (𝑥) = −2𝑥 + 4.
We can go on and discover that 𝑓 (3) (𝑥) = −2 and that 𝑓 (𝑛) (𝑥) = 0 for every 𝑛 ≥ 4.
Questions.
1. What are the first and second derivatives of 𝑓(𝑥) = 𝑒𝑥 ? In general what is the 𝑛th derivative?
A2.6

2. What are the first and second derivatives of 𝑔(𝑥) = 𝑒2𝑥 ? In general what is the 𝑛th derivative?
A2.7
A2.8
3. If ℎ is a polynomial function of degree 7, what is ℎ(8) (𝑥)?
Using higher derivatives, we're now able to give a formal definition of what it means for a function to
be smooth.
A function 𝑓 is smooth if 𝑓 (𝑛) exists for every 𝑛 ≥ 1.
So smooth functions are those that can be differentiated any number of times.
Here's a visualisation that shows plots of a polynomial and its higher derivatives on the same set of
axes.

2.1.5 Warning
Here, we're concentrating on ”nice” functions that are smooth or at least differentiable. Not all
functions are, though - not even all continuous functions.
Example. Think of the function 𝑓(𝑥) = |𝑥|, where |𝑥| is the absolute value of 𝑥. A plot
of 𝑓 is shown below. Notice that 𝑓 is continuous because this plot can be drawn without
lifting pen from paper.

The tangent to 𝑓 at 𝑥 = 𝑎 has slope −1 if 𝑎 < 0 and slope 1 if 𝑎 > 0. So 𝑓 ′ (𝑎) = −1 if


𝑎 < 0 and 𝑓 ′ (𝑎) = 1 if 𝑎 > 0. But there is no well-defined tangent to 𝑓 at 𝑥 = 0, so 𝑓 ′ (0)
does not exist. This means that 𝑓 is not differentiable (because there is a point 𝑎 in its
domain such that 𝑓 ′ (𝑎) does not exist).

32
2.1.6 Notation
Here we're using 𝑓 ′ (𝑥) to represent the derivative of a function 𝑓(𝑥) and 𝑓 (𝑛) to represent its 𝑛th
derivative. This is one of two common notations. The other uses 𝑑𝑥 𝑑𝑓
or 𝑑𝑥
𝑑
𝑓(𝑥) to represent the
derivative and 𝑑𝑥𝑛 to represent the 𝑛th derivative.
𝑛
𝑑 𝑓

2.2 Differentiation: Activity 1


Question 1:
Consider the function 𝑓 which has domain ℝ and is defined by
𝑓(𝑥) = 2𝑥3 − 15𝑥2 + 36𝑥 − 26.
The plot of 𝑓 is given below.

A2.9
(a) Find the rule for the function 𝑓’.
A2.10
(b) Produce a plot of the function 𝑓’ and use it to find the zeroes of 𝑓’.
A2.11
(c) What do the zeroes of 𝑓’ tell you about 𝑓?
Question 2:
Consider the function 𝑓 where 𝑓(𝑥) = 𝑒3𝑥 + 2𝑥2 . The plot is given below.

33
(a) Find the equation of the tangent at the point 𝑥 = 0. This equation will be of the form 𝑦 = 𝑚𝑥 + 𝑐
where 𝑚 is the gradient of the tangent. A2.12
A2.13
(b) Produce a plot of the function 𝑓 and the tangent at 𝑥 = 0.
Question 3:
A2.14
Find the derivative of 𝑓 where 𝑓(𝑥) = log2 (𝑥3 + 3𝑥2 ). (Hint: Use logarithm laws).
Question 4:
Suppose that you did not know how to differentiate 𝑓(𝑥) = 𝑎𝑐𝑥+𝑑 , but instead you only knew the special
case where 𝑑 = 0. That is, pretend you only know that if 𝑓(𝑥) = 𝑎𝑐𝑥 , then 𝑓 ′ (𝑥) = 𝑐 ln(𝑎)𝑎𝑐𝑥 . Use this
and the laws for exponential functions to show that if 𝑓(𝑥) = 𝑎𝑐𝑥+𝑑 , then 𝑓 ′ (𝑥) = 𝑐 ln(𝑎)𝑎𝑐𝑥+𝑑 . A2.15
Question 5:
A2.16
(a) Find 𝑓 (17) (𝑥) if 𝑓(𝑥) = 𝑒6𝑥 . Write your answer in terms of a power.
(b) Find 𝑓 (12) (𝑥) if 𝑓(𝑥) = 𝑥12 . Write your answer using the factorial function (see Question 2 in
Functions: Activity 3). A2.17

2.3 Optimisation

2.3.1 Stationary points revisited


We can now give a more precise definition of a stationary point:
A stationary point of a differentiable function is a point at which its derivative is equal to 0.
This means that the tangent to a function at a stationary point is a horizontal line. This fits with our
intuitive idea that a function is ”flat” at a stationary point. So we can find the stationary points of a
function 𝑓 by finding zeroes of its derivative, that is, the solutions to 𝑓 ′ (𝑥) = 0.
We can also classify stationary points using the derivative:

34
• A local minimum is a stationary point where the derivative of the function changes from negative
to positive.
• A local maximum is a stationary point where the derivative of the function changes from positive
to negative.
• When the derivative of the function is positive on both sides of the stationary point or negative
on both sides of the stationary point, it is neither a local minimum nor a local maximum (it is
an inflection point).
Exercise. Think about why these definitions correspond to our intuitive notions of local minima as
”dips” and local maxima as ”crests”.

Example. The derivative of 𝑓(𝑥) = − 13 𝑥3 + 2𝑥2 –3𝑥 + 1 is 𝑓 ′ (𝑥) = −𝑥2 + 4𝑥 − 3. Solving


𝑓 ′ (𝑥) = 0 we get 𝑥 = 1 and 𝑥 = 3. It's not too hard to work out that 𝑓 ′ (𝑥) < 0 when
𝑥 < 1, 𝑓 ′ (𝑥) > 0 when 1 < 𝑥 < 3, and 𝑓 ′ (𝑥) < 0 when 𝑥 > 3. This means we expect a
local minimum at 𝑥 = 1 (where the derivative changes from negative to positive) and a
local maximum at 𝑥 = 3 (where the derivative changes from negative to positive). Finally
we can calculate that 𝑓(1) = − 13 and 𝑓(3) = 1.
Exercise. Given the information we deduced in the example above, sketch what you
expect a plot of 𝑓 to look like. Now compare it to A2.18 of 𝑓. Check that your sketch and
the plot agree with the information we found.
Questions.
1. Find any stationary points of 𝑓(𝑥) = 2𝑥 − 𝑒𝑥 and whether they are local minima, local maxima,
or both. A2.19
2. Let 𝑔 be a function and let 𝑐, 𝑑 ∈ ℝ with 𝑐 ≠ 0. How do the locations of the stationary points
of ℎ(𝑥) = 𝑐𝑔(𝑥) + 𝑑 correspond to the locations of the stationary points of 𝑔(𝑥)? A2.20

2.3.2 Second derivative test


The second derivative of a function gives another way of classifying stationary points:
Let 𝑓 be a function, and let 𝑎 be a stationary point of 𝑓 (so 𝑓 ′ (𝑎) = 0) such that 𝑓 ′′ (𝑎) exists.
• If 𝑓 ′′ (𝑎) > 0, then 𝑓 has a local minimum at 𝑥 = 𝑎.
• If 𝑓 ′′ (𝑎) < 0, then 𝑓 has a local maximum at 𝑥 = 𝑎.
When 𝑓 ′′ (𝑎) = 0, the second derivative test doesn’t say anything. It might be that 𝑓 has a local
minimum at 𝑎, a local maximum at 𝑎, or neither.
Questions.
1. Find the second derivative of 𝑓(𝑥) = − 13 𝑥3 + 2𝑥2 –3𝑥 + 1. A2.21
2. Use your answer to question 1 to confirm that the second derivative test correctly classifies the
stationary points of the example in the last section. A2.22

2.3.3 Optimisation
Often we're interested in finding where a function take its least or greatest value over its whole domain
(not just a local minimum or maximum). This is often referred to as an optimisation problem.
Let 𝑓 be a function.
• 𝑓 has a global minimum at 𝑥 = 𝑎 if 𝑓(𝑎) ≤ 𝑓(𝑥) for all 𝑥 in the domain of 𝑓.
• 𝑓 has a global maximum at 𝑥 = 𝑎 if 𝑓(𝑎) ≥ 𝑓(𝑥) for all 𝑥 in the domain of 𝑓.

35
Not all functions have global minima or maxima. For example the function 𝑓(𝑥) = 𝑥 with domain
ℝ does not have a global maximum (because 𝑓(𝑥) keeps increasing as 𝑥 does). However if we limit
ourselves to continuous functions whose domain is a closed real interval (ie. {𝑥 ∈ ℝ ∶ 𝑐 ≤ 𝑥 ≤ 𝑑} for
some 𝑐, 𝑑 ∈ ℝ, then we are guaranteed to have them.
Let 𝑓 be a continuous function whose domain is a real interval {𝑥 ∈ ℝ ∶ 𝑐 ≤ 𝑥 ≤ 𝑑}. Then 𝑓 has a
global maximum at at least one point and a global minimum at at least one point.
Furthermore, for functions whose domain is a closed real interval, global maxima and minima can
only occur at certain points:
Let 𝑓 be a function whose domain is a real interval {𝑥 ∈ ℝ ∶ 𝑐 ≤ 𝑥 ≤ 𝑑}. If 𝑎 is a global minimum
point or global maximum point of 𝑓, then one of the following holds:
• 𝑎 = 𝑐 or 𝑎 = 𝑑;
• 𝑓 ′ (𝑎) = 0 (that is, 𝑎 is a stationary point of 𝑓); or
• 𝑓 ′ (𝑎) does not exist.
Points 𝑎 where 𝑓 ′ (𝑎) = 0 or 𝑓 ′ (𝑎) does not exist are called critical points (so every stationary point
is a critical point but not vice-versa).
Example. To find the local minima and maxima of 𝑓(𝑥) = − 13 𝑥3 + 2𝑥2 –3𝑥 + 1 on the
domain {𝑥 ∈ ℝ ∶ 0 ≤ 𝑥 ≤ 72 } we act as follows. We first calculate 𝑓 ′ (𝑥) = −𝑥2 + 4𝑥 − 3.
Solving 𝑓 ′ (𝑥) = 0 we get 𝑥 = 1 and 𝑥 = 3. So 𝑓 has 𝑥 = 1 and 𝑥 = 3 as stationary points,
𝑥 = 0 and 𝑥 = 72 as boundary points, and there are no points where 𝑓 ′ (𝑥) does not exist.
So we only need to calculate 𝑓(0) = 1, 𝑓(1) = − 13 , 𝑓(3) = 1, and 𝑓( 72 ) = 17
24 . From this
we see that 𝑓 has a global minimum of − 3 at 𝑥 = 1 and a global maximum of 1 at 𝑥 = 0
1

and 𝑥 = 3. This fits with the plot of 𝑓 below.

This video discusses critical points and extrema.


Questions.
1. Find the global maxima and minima of 𝑓(𝑥) = 1 2
2 𝑥 –10𝑥 + 15 + 16 ln(𝑥) on the domain
{𝑥 ∈ ℝ ∶ 1 ≤ 𝑥 ≤ 14}. A2.23

2.3.4 Convexity and concavity revisited


Second derivatives also give a good test for convexity and concavity.
Let 𝑓 be a twice differentiable function.
• 𝑓 is convex if and only if 𝑓 ′′ (𝑥) ≥ 0 for all 𝑥 in the domain of 𝑓.
• 𝑓 is concave if and only if 𝑓 ′′ (𝑥) ≤ 0 for all 𝑥 in the domain of 𝑓.
Convex and concave functions have special properties related to their maxima and minima:

36
• Any local minimum of a convex function is also a global minimum.
• Any local maximum of a concave function is also a global maximum.

2.4 Optimisation: Activity


Question 1:
A business proposal has been developed, part of which predicts the monthly profit of a new company
in terms of the number of employees. The function 𝑓 which predicts this profit is given by
1
𝑓(𝑥) = − 4800 𝑥4 + 10𝑥2
where 𝑥 is the number of employees and 𝑓(𝑥) is measured in dollars. The plot of 𝑓 is given below.

(a) Initially, the company plans to hire 60 employees. How much profit is expected in the first
month? A2.24
(b) The company hires one new employee at the start of each month for the next five months. What
total profit is expected over the first six months? Write your answer using sigma notation. A2.25
Note: In reality, ℝ would not be a suitable domain for this function. The number of employees must
be a non-negative integer. So ℕ would be a more appropriate domain. However, we are about to do
some calculations that involve differentiation, so for now we will just treat 𝑓 as if its domain were ℝ.
A2.26
(c) Find 𝑓 ′ (𝑥).
A2.27
(d) Use 𝑓’ to find all stationary points of 𝑓.
(e) Classify these stationary points using the second derivative test. Check your answers with the
above plot. A2.28
(f) What is the largest monthly profit that the company can expect to make? How many employees
do they need in order to make this profit? A2.29
Question 2:
A2.30
(a) Consider the function 𝑓 given by 𝑓(𝑥) = 𝑥3 − 6𝑥2 + 12𝑥 − 8. Find the stationary point of 𝑓.

37
A2.31
(b) What happens when you use the second derivative test to classify this stationary point?
(c) Find the derivative of 𝑓 at 𝑥 = 1 and 𝑥 = 3 and hence classify the stationary point. Produce a
plot of 𝑓 to verify your answer. A2.32

2.5 Integration
We often need to find the area under the plotted curve of a function. We will see later that this is
particularly useful in probability theory.
We can find areas under curves with a technique called integration that is closely related to differenti-
ation.

2.5.1 Integration
Let 𝑓 be a continuous function. The definite integral of 𝑓 between 𝑥 = 𝑎 and 𝑥 = 𝑏, written as
𝑏
∫ 𝑓(𝑥) 𝑑𝑥,
𝑎
is the signed area bounded by the plot of 𝑓, the 𝑥-axis, 𝑥 = 𝑎 and 𝑥 = 𝑏.
We say ”signed area” because any areas below the 𝑥-axis count negatively.
Example. Finding the red area in the plot below might approximate the probability that
a randomly selected person was between 180 and 190cm tall.

The area in red would be written as ∫ 𝑓(𝑥) 𝑑𝑥 where 𝑓 is the function plotted in blue.
190
180

There is a surprisingly simple way of calculating definite integrals:


Let 𝑓 be a continuous function. Then
𝑏
∫ 𝑓(𝑥) 𝑑𝑥 = 𝐹 (𝑏) − 𝐹 (𝑎),
𝑎
where 𝐹 is a function such that 𝐹 ’ = 𝑓.

38
This works because of a deep result sometimes called the fundamental theorem of calculus. A function
like 𝐹 is called an antiderivative of 𝑓. Any continuous function has an antiderivative. In fact any
continuous function has lots of antiderivatives, but they are all essentially the same: they differ only
by constants.
Example. Suppose we want to calculate ∫ 𝑥3 𝑑𝑥. The corresponding area is pictured
1
0
below.

Notice that the function 𝐹 (𝑥) = 14 𝑥4 has derivative 𝐹 ′ (𝑥) = 𝑥3 . So 𝐹 is an antiderivative


of 𝑓 and

1
∫ 𝑥3 𝑑𝑥 = 𝐹 (1) − 𝐹 (0)
0
1
= 4 − 0
1
= 4.

(We could have also picked 𝐺(𝑥) = 14 𝑥4 + 10 as an antiderivative of 𝑓, but we would have

39
gotten exactly the same answer because the 10s would have cancelled out.)
Questions.
1. Guided by the example above, calculate ∫ 𝑥3 𝑑𝑥. A2.33
1
−1
2. Intuitively speaking, why does your answer to 1 make sense? A2.34
A2.35
3. Calculate ∫ 𝑒−𝑥 𝑑𝑥. (Note that 𝐹 (𝑥) = −𝑒−𝑥 has derivative 𝐹 ′ (𝑥) = 𝑒−𝑥 .)
5
0

2.5.2 Finding antiderivatives


So the main challenge in calculating a definite integral of a function 𝑓 is finding an antiderivative of 𝑓.
This process is sometimes called indefinite integration. Just like with differentiation, there are lots of
rules that can help us find antiderivatives. We're not going to go into those here, or require you to
find antiderivatives of more complex functions (like with differentiation, computers are pretty good at
this). What's important is that you understand the concept.

2.6 Integration: Activity


Question 1:
A2.36
(a) Find an antiderivative of 𝑓, where 𝑓(𝑥) = 12 − 3𝑥2 .
(b) Find the area bounded by the plot of 𝑓 that lies above the 𝑥-axis. Produce a plot to visualise
this area. A2.37
Question 2:
Consider the function 𝑓 given by 𝑓(𝑥) = 𝑒−2𝑥 − 𝑥 − 1. Find the area bounded by the plot of 𝑓, the
𝑥-axis, and the vertical line at 𝑥 = 2. Produce a plot of 𝑓 to visualise this area. A2.38
Question 3:
Below is a plot of a function given by 𝑓(𝑥) = 12𝑥2 + 1. The top-right corner of the shaded region is
at 𝑥 = 1.

40
A2.39
Find the area of the shaded region.

41
2.7 Answers
2.1 The slope at 𝑥 = −0.25 is negative (−2.5) and at 𝑥 = 1 the tangent is flat with a slope of
zero

2.2
2.3 𝑓 ′ (𝑥) = 2𝑥 − 2
So the slopes of tangents to𝑓 at 𝑥 = − 14 , 𝑥 = 1, and 𝑥 = 3
2 are 𝑓 ′ (− 14 ) = − 52 , 𝑓 ′ (1) = 0 and
𝑓 ′ ( 32 ) = 1.
2.4 𝑔′ (𝑥) = 12𝑥2 − 4𝑥 + 2𝑒2𝑥
A tangent to 𝑓 at 𝑥 = 2 has slope 𝑔′ (2) = 40 + 2𝑒4 ≈ 149.
So the function is sloping very steeply upward here.
2.5 ℎ′ (𝑥) = 𝑥1
As 𝑥 gets very large ℎ′ (𝑥) gets very close to 0 (but stays just positive).
So ℎ keeps increasing, but very gradually.
2.6 𝑓 ′ (𝑥) = 𝑒𝑥
𝑓 ′′ (𝑥) = 𝑒𝑥
𝑓 (𝑛) (𝑥) = 𝑒𝑥
2.7 𝑔′ (𝑥) = 2𝑒2𝑥
𝑔′′ (𝑥) = 4𝑒2𝑥
𝑔(𝑛) (𝑥) = 2𝑛 𝑒2𝑥

42
2.8 0
2.9 𝑓 ′ (𝑥) = 6𝑥2 − 30𝑥 + 36

2.10 By observation, the zeroes are 𝑥 = 2 and 𝑥 = 3.


This can be verified by factorising, which gives 𝑓 ′ (𝑥) = 6(𝑥 − 2)(𝑥 − 3).
2.11 𝑓 ′ (2) = 0 and 𝑓 ′ (3) = 0. So the derivative of 𝑓 is zero at both 𝑥 = 2 and 𝑥 = 3. That is,
the slope of the tangent to 𝑓 at both 𝑥 = 2 and 𝑥 = 3 is zero. Therefore, the plot of 𝑓 will be
”flat” at 𝑥 = 2 and 𝑥 = 3. In other words, 𝑓 has stationary points at 𝑥 = 2 and 𝑥 = 3. This
agrees with the above plot.
2.12 Differentiating gives
𝑓 ′ (𝑥) = 3𝑒3𝑥 + 4𝑥.
Then 𝑓 ′ (0) = 3, which is the gradient of the tangent at 𝑥 = 0. So 𝑚 = 3. To find 𝑐, note that
𝑓(0) = 1, so the point (0, 1) lies on the tangent line. Substituting this in gives
1 = 3 × 0 + 𝑐 = 𝑐,
so the equation of the tangent is 𝑦 = 3𝑥 + 1.

2.13
2.14 Using logarithm laws,

𝑓(𝑥) = log2 (𝑥2 (𝑥 + 3))


= log2 (𝑥2 ) + log2 (𝑥 + 3)
= 2 log2 (𝑥) + log2 (𝑥 + 3).

So,

43
1 1
𝑓 ′ (𝑥) = 2 +
ln(2)𝑥 ln(2)(𝑥 + 3)
2(𝑥 + 3) 𝑥
= +
ln(2)𝑥(𝑥 + 3) ln(2)𝑥(𝑥 + 3)
3𝑥 + 6
= .
ln(2)𝑥(𝑥 + 3)

2.15 Use the laws for exponential functions to write


𝑓(𝑥) = 𝑎𝑐𝑥 𝑎𝑑 = 𝑎𝑑 𝑎𝑐𝑥 .
Now 𝑎𝑑 is just a constant, so the 𝑎𝑐𝑥 part may be differentiated as per usual. Differentiating
and using the laws for exponential functions gives
𝑓 ′ (𝑥) = 𝑎𝑑 𝑐 ln(𝑎)𝑎𝑐𝑥 = 𝑐 ln(𝑎)𝑎𝑐𝑥 𝑎𝑑 = 𝑐 ln(𝑎)𝑎𝑐𝑥+𝑑 .
2.16 Note that 𝑓 (1) (𝑥) = 𝑓 ′ (𝑥) = 6𝑒6𝑥 , 𝑓 (2) (𝑥) = 6 × 6𝑒6𝑥 = 62 𝑒6𝑥 , and similarly 𝑓 (3) (𝑥) = 63 𝑒6𝑥 .
So each time 𝑓 is differentiated, the power of 6 out the front increases by 1. So, 𝑓 (17) (𝑥) = 617 𝑒6𝑥 .
2.17 We have 𝑓 (1) (𝑥) = 12𝑥11 , 𝑓 (2) (𝑥) = 11 × 12𝑥10 , and 𝑓 (3) (𝑥) = 10 × 11 × 12𝑥9 . So once 𝑓
has been differentiated 12 times, we will have 𝑓 (12) (𝑥) = 1 × 2 × … × 11 × 12 = 12!.

2.18
2.19 𝑓 ′ (𝑥) = 2 − 𝑒𝑥
Solving 𝑓 ′ (𝑥) = 0 we see that 𝑓 has one stationay point: at 𝑥 = ln(2).
𝑓 ′ (𝑥) = 2 − 𝑒𝑥 changes from positive to negative at 𝑥 = ln(2), so this point is a local maximum.

44
2.20 They are exactly the same. This is because ℎ′ (𝑥) = 𝑐𝑔′ (𝑥) and 𝑐𝑔′ (𝑥) = 0 has exactly the
same solutions as 𝑔′ (𝑥) = 0.
What can you say about the classifications of the stationary points of ℎ(𝑥) compared to those
of 𝑔(𝑥)?
2.21 𝑓 ′′ (𝑥) = −2𝑥 + 4
2.22 𝑓 ′′ (1) = 2 so there is a local minimum at 𝑥 = 1. 𝑓 ′′ (3) = −2 so there is a local maximum at
𝑥 = 3.
2.23 𝑥.
𝑓 ′ (𝑥) = 𝑥 − 10 + 16
Solving 𝑓 ′ (𝑥) = 0 we get 𝑥 = 2 and 𝑥 = 8.
So 𝑓 has 𝑥 = 2 and 𝑥 = 8 as stationary points, 𝑥 = 1 and 𝑥 = 14 as boundary points, and
there are no points where 𝑓 ′ (𝑥) does not exist.
So we only need to calculate 𝑓(1) = 5.5, 𝑓(2) ≈ 8.09, 𝑓(8) ≈ 0.27, and 𝑓(14) ≈ 15.22.
From this we see that 𝑓 has a global minimum of approximately 0.27 at 𝑥 = 8 and a global
maximum of 15.22 at 𝑥 = 14.
2.24 1
𝑓(60) = − 4800 604 + 10 × 602 = 33300. The expected profit is $33, 300.
2.25 The total profit is found by adding up the profit from the first six months. This is 𝑓(60) +
𝑓(61) + 𝑓(62) + 𝑓(63) + 𝑓(64) + 𝑓(65), which can be rewritten as ∑𝑥=60 (− 4800
65 1
𝑥4 + 10𝑥2 ).

2.26 1
𝑓 ′ (𝑥) = − 1200 𝑥3 + 20𝑥
2.27 Set 𝑓 ′ (𝑥) = 0 and solve for 𝑥. Factorising gives 𝑥 (− 1200
1
𝑥2 + 20) = 0, which yields two
√ √
solutions, 𝑥 = 0 and 𝑥 = 24000 = 40 15 ≈ 154.9.
2.28 The second derivative is 𝑓 ′′ (𝑥) = − 400
1
𝑥2 + 20. Evaluating 𝑓 ′′ at the two stationary points

gives 𝑓 ′′ (0) √
= 20 and 𝑓 ′′ (40 15) = −40. Since 𝑓 ′′ (0) > 0 √
there is a local minimum at 𝑥 = 0.
Since 𝑓 (40 15) < 0, there is a local maximum at 𝑥 = 40 15.
′′

2.29 Looking at the plot, the stationary point 𝑥 = 40 15 is a local maximum. However, √ the
number of employees must be an integer. Evaluating 𝑓 at the integers near 40 15 ≈ 154.9
gives 𝑓(155) = $120, 000 and 𝑓(154) = $119, 983 (to the nearest dollar). So the company needs
155 employees to make a maximum monthly profit of $120, 000.
2.30 Differentiating gives 𝑓 ′ (𝑥) = 3𝑥2 − 12𝑥 + 12 = 3(𝑥2 − 4𝑥 + 4) = 3(𝑥 − 2)2 . Solving 𝑓 ′ (𝑥) = 0
shows that there is a stationary point at 𝑥 = 2.
2.31 The second derivative is 𝑓 ′′ (𝑥) = 6𝑥 − 12 and so 𝑓 ′′ (2) = 0. Therefore, the second derivative
test does not tell us anything.
2.32 Since 𝑓 ′ (1) = 3 and 𝑓 ′ (3) = 3, the function 𝑓 is increasing on either side of
𝑥 = 2. Therefore, there is an inflection point at 𝑥 = 2. The plot below verifies

this.
2.33

45
1
∫ 𝑥3 𝑑𝑥 = 𝐹 (1) − 𝐹 (−1)
−1
= 14 − 1
4
= 0.

2.34 The area below the 𝑥-axis between 𝑥 = −1 and 𝑥 = 0 counts negatively and cancels with
the area above the 𝑥-axis between 𝑥 = 0 and 𝑥 = 1.
2.35

5
∫ 𝑒−𝑥 𝑑𝑥 = 𝐹 (5) − 𝐹 (0)
0
= −𝑒−5 − (−1)
= 1 − 𝑒−5
≈ 0.993.

2.36 A function 𝐹 where 𝐹 ′ = 𝑓 is needed. Differentiating 12𝑥 gives 12, and differentiating −𝑥3
gives −3𝑥2 . So 𝐹 (𝑥) = 12𝑥 − 𝑥3 .
2.37 This is the area in question.

The area can be calculated as follows:

2
∫ 12 − 3𝑥2 𝑑𝑥 = 𝐹 (2) − 𝐹 (−2)
−2
= 16 − (−16)
= 32

.
2.38 This is the area in question.

46
An antiderivative of 𝑓 is given by 𝐹 (𝑥) = − 12 𝑒−2𝑥 − 12 𝑥2 − 𝑥. This can be checked by
differentiating 𝐹. The signed area is then

2
∫ 𝑒−2𝑥 − 𝑥 − 1𝑑𝑥 = 𝐹 (2) − 𝐹 (0)
0
1 7
= − 𝑒−4 −
2 2
≈ −3.51

.
The negative answer makes sense since the area is below the 𝑥-axis. So the area is in fact
1 −4
2𝑒 + 72 ≈ 3.51.
2.39 This can be done by finding the area of this region;

and then subtracting the result from the area of this region:

An antiderivative of 𝑓 is given by 𝐹 (𝑥) = 4𝑥3 + 𝑥. So the area of the first region is

47
1
∫ 12𝑥2 + 1𝑑𝑥 = 𝐹 (1) − 𝐹 (0) = 5.
0

The area of the second region is just the length multiplied by the height of that rectangle. The
height is 𝑓(1) = 13, so this area is 1 × 13 = 13. Therefore, the area of the original shaded
region is 13 − 5 = 8.

48
Chapter 3

Linear Algebra

3.1 Vectors and matrices


In this section, we introduce some mathematical objects which allow us to represent lots of information
in a more compact way.

3.1.1 Matrices
An 𝑚 × 𝑛 matrix is a rectangular array of numbers with 𝑚 rows and 𝑛 columns.
We will use capital letters to represent matrices. If 𝐴 is an 𝑚 × 𝑛 matrix, then we say that the size
of 𝐴 is 𝑚 × 𝑛.
Example: The matrix 𝐴 given below is a 4 × 3 matrix, and 𝐵 is a 2 × 2 matrix.
3 4 0
⎡1 −3 −2⎤
𝐴=⎢ ⎥
⎢0 2 0⎥
⎣6 −13 24 ⎦

− 14 2
𝐵=[ ]
𝜋 0.32
For a matrix 𝐴, we will use the notation (𝐴)𝑖𝑗 to denote the entry in row 𝑖 and column 𝑗 of 𝐴.
Question: Consider the following matrix:
3 −4 5
𝐴=[ ]
0 1 8
A3.1
What is the size of 𝐴?
A3.2
What is (𝐴)12 ?

3.1.2 Addition and Scalar Multiplication of Matrices


If two matrices have the same size, we can add them together to make a new matrix by adding the
corresponding entries. The resulting matrix will have the same size as the original matrices.
If 𝐴 and 𝐵 are both 𝑚 × 𝑛 matrices, then 𝐴 + 𝐵 is the 𝑚 × 𝑛 matrix where (𝐴 + 𝐵)𝑖𝑗 = (𝐴)𝑖𝑗 + (𝐵)𝑖𝑗
for all 𝑖 and 𝑗.
1 −3 2 9
Example: If 𝐴 = [ ] and 𝐵 = [ ], then
4 4 −4 −2

49
1 + 2 −3 + 9 3 6
𝐴+𝐵 =[ ]=[ ].
4−4 4−2 0 2
Matrices of the same size can also be subtracted by subtracting the corresponding entries.
We can also multiply matrices by a single number, where the resulting matrix is found by multiplying
each entry by that number. The resulting matrix will have the same size as the original matrix.
If 𝐴 is an 𝑚 × 𝑛 matrix and 𝑘 ∈ ℝ, then 𝑘𝐴 is the 𝑚 × 𝑛 matrix where (𝑘𝐴)𝑖𝑗 = 𝑘(𝐴)𝑖𝑗 for all 𝑖 and
𝑗.
2 1 0
Example: If 𝐴 = [ ], then
−2 3 −1
4×2 4×1 4×0 8 4 0
4𝐴 = [ ]=[ ].
4 × −2 4 × 3 4 × −1 −8 12 −4
This second operation is called scalar multiplication. The reason for this is that we often describe real
numbers as scalars.
3 1 0 −5 A3.3
Question: Let 𝐴 = [ ] and 𝐵 = [ ]. Evaluate 2𝐴 + 𝐵.
−1 4 7 2

3.1.3 Matrix Multiplication


We can also multiply two matrices together. However, we don't require them to be the same size like
we did for matrix addition. The requirement is that the number of columns of the matrix on the left
must be equal to the number of rows of the matrix on the right. This might not seem intuitive at
first, but it will make sense after seeing an example.
Rather than giving a formal definition first, we will work through an example and then give the
formal definition for the interested student.
5 3
1 2
Example: Let 𝐴 = ⎡ ⎤
⎢0 −4⎥ and 𝐵 = [−2 0 ]. Evaluate 𝐴𝐵.
⎣3 4 ⎦
We first need to determine the size of 𝐴𝐵. The number of rows in 𝐴𝐵 will be equal to
the number of rows in 𝐴, and the number of columns in 𝐴𝐵 will be equal to the number
of columns in 𝐵.
In the illustration below, the outer two numbers on the left hand side (3 and 2) give the
size of 𝐴𝐵, whereas the inner two numbers (2 and 2) just need to be equal for the matrix
multiplication to work.
𝐴
⏟ 𝐵
⏟ = 𝐴𝐵

(3×2) (2×2) 3×2

To determine (𝐴𝐵)11 , we take row 1 of 𝐴 and column 1 of 𝐵, multiply the corresponding


numbers together, then add up the result. The required row and column are shown below
in red.
5 3
1 2
𝐴 = ⎢0 −4⎤

⎥ 𝐵=[ ] (𝐴𝐵)11 = (5 × 1) + (3 × −2) = −1
−2 0
⎣3 4 ⎦
So far, we have determined one entry in 𝐴𝐵.
−1

𝐴𝐵 = ⎢ ⎤

⎣ ⎦

50
The remaining entries are found in a similar way by choosing an appropriate row and
column. For example, to find (𝐴𝐵)21 we need row 2 from 𝐴 and column 1 from 𝐵.
5 3
1 2
𝐴 = ⎢0 −4⎤

⎥ 𝐵=[ ] (𝐴𝐵)21 = (0 × 1) + (−4 × −2) = 8
−2 0
⎣3 4 ⎦
To find (𝐴𝐵)32 , we need row 3 of 𝐴 and column 2 of 𝐵.
5 3
1 2
𝐴 = ⎢0 −4⎤

⎥ 𝐵=[ ] (𝐴𝐵)32 = (3 × 2) + (4 × 0) = 6
−2 0
⎣3 4 ⎦
Follow this process to find the remaining three entries. The final result is given below.
−1 10
𝐴𝐵 = ⎡
⎢ 8 0⎥

⎣−5 6 ⎦
You may realise now why we need the number of columns in 𝐴 to be equal to the number of rows in
𝐵. This is so that when we find each entry in 𝐴𝐵, we can pair each number in a row from 𝐴 with a
column from 𝐵.
If 𝐴 is an 𝑚 × 𝑝 matrix and 𝐵 is an 𝑝 × 𝑛 matrix, then 𝐴𝐵 is the 𝑚 × 𝑛 matrix where (𝐴𝐵)𝑖𝑗 =
∑𝑘=1 (𝐴)𝑖𝑘 (𝐵)𝑘𝑗 for all 𝑖 and 𝑗.
𝑝

Below is a summary of the rules of algebra for matrices, where 𝐴, 𝐵 and 𝐶 are matrices and 𝑘, 𝑙 are
real numbers. These rules work as long as the matrices are of the right size for the matrix addition
and multiplication to work.
• 𝐴(𝐵𝐶) = (𝐴𝐵)𝐶
• 𝐴(𝐵 + 𝐶) = 𝐴𝐵 + 𝐴𝐶 and (𝐴 + 𝐵)𝐶 = 𝐴𝐶 + 𝐵𝐶
• 𝑘(𝐴 + 𝐵) = 𝑘𝐴 + 𝑘𝐵
• (𝑘 + 𝑙)𝐴 = 𝑘𝐴 + 𝑙𝐴
• (𝑘𝐴)𝐵 = 𝑘(𝐴𝐵) = 𝐴(𝑘𝐵)
Note that we have not included 𝐴𝐵 = 𝐵𝐴 in the above list. In general, you cannot swap the order of
matrix multiplication. In fact, in the previous example, the product 𝐵𝐴 does not even exist because
the sizes of these matrices are incompatible.

3.1.4 Vectors
A vector is a matrix with only one column.
This is actually the definition of a column vector. A row vector is a matrix with only one row, but we
will only need to deal with column vectors. We will use lower case letters with arrows above them to
represent vectors.
3
0
Example: 𝑢⃗ = ⎡ ⎤
⎢ ⎥ and 𝑣 ⃗ = [12] are vectors.
−2
⎣5⎦
Just as we did with matrices, we can multiply vectors by scalars, we can add two vectors of the same
size, and we can multiply a vector and a matrix, provided that their sizes are compatible. Since
vectors are matrices, the rules of algebra we saw for matrices also apply to vectors.
3 2 3 −1 A3.4
Question: Let 𝐴 = [ ], 𝑢⃗ = [ ] and 𝑣 ⃗ = [ ]. Evaluate 𝐴𝑢⃗ − 4𝑣.⃗
−4 5 −8 −8

51
3.1.5 The Dot Product
The dot product is an operation which can be performed on a pair of vectors of the same size. It
has several important uses, one of which is to provide a more compact representation for certain
expressions.
Given a pair of vectors,
𝑢1 𝑣1
⎡𝑢 ⎤ ⎡𝑣 ⎤
𝑢⃗ = ⎢ 2 ⎥ , 𝑣⃗ = ⎢ 2 ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣𝑢𝑛 ⎦ ⎣𝑣𝑛 ⎦
their dot product is
𝑢⃗ ⋅ 𝑣 ⃗ = 𝑢1 𝑣1 + 𝑢2 𝑣2 + … + 𝑢𝑛 𝑣𝑛 .
1 2
Question: Compute the dot product of 𝑢⃗ = ⎢3⎥ and 𝑣 ⃗ = ⎢−4⎤
⎡ ⎤ ⎡
⎥.
A3.5

2
⎣ ⎦ 6
⎣ ⎦

3.1.6 Geometric Interpretation of Vectors


For vectors with two entries (2 × 1 matrices), we can represent them on the 𝑦 plane as arrows. For
each point (𝑥, 𝑦), we can construct a vector, which will be represented as an arrow from the origin to
that point.
The position vector of the point (𝑥, 𝑦) is the vector

𝑥
[ ]
𝑦

and it is represented geometrically as an arrow from (0, 0) to (𝑥, 𝑦).


Example: Below is a plot of the position vectors for (1, 3) and (−4, 1).

52
The operations of scalar multiplication and addition of vectors can be performed geometrically.
Scalar multiplication simply ”stretches” or ”shrinks” the vector, depending on the value of the scalar.
If the scalar is negative, then the vector will be reflected so that it points in the opposite direction.
4
Example: Let 𝑣 ⃗ = [ ]. This vector is plotted below.
−2

2 −8
We can algebraically compute 12 𝑣 ⃗ = [ ] and −2𝑣 ⃗ = [ ].
−1 4
These vectors are plotted below. The blue vector is 𝑣,⃗ the orange vector is 12 𝑣,⃗ and the
green vector is −2𝑣.⃗

53
Since 1
2 is between 0 and 1, the vector 12 𝑣 ⃗ is just 𝑣 ⃗ after being ”shrunk”.
Since −2 is negative, the vector −2𝑣 ⃗ is just 𝑣 after being reflected and ”stretched”.
Addition of vectors can be done by plotting the two vectors, moving one of them so that its ”tail” is
at the same position as the ”tip” of the other vector, then drawing an arrow from the origin to the
”tip” of the vector which has just been moved.
3 −1
Example: Let 𝑢⃗ = [ ] and 𝑣 ⃗ = [ ]. These vectors are plotted below.
1 −4

54
In order to add these vectors geometrically, we will move 𝑣,⃗ so that its ”tail” is at the
same position as the ”tip” of 𝑢.⃗

55
Now we draw an arrow from the origin to the tip of 𝑣 ⃗ (after it has been moved) to represent
𝑢⃗ + 𝑣.⃗

56
2
So 𝑢+
⃗ 𝑣⃗ = [ ]. You can check that you get the same result by finding 𝑢+
⃗ 𝑣 ⃗ algebraically.
−3
Two vectors 𝑢⃗ and 𝑣 ⃗ are perpendicular if and only their dot product is equal to zero.
Example: Consider the following two vectors:
2 −6
𝑢⃗ = [ ] 𝑣⃗ = [ ]
3 4
Their dot product is 𝑢⃗ ⋅ 𝑣 ⃗ = 2 × (−6) + 3 × 4 = 0. A plot of these vectors is given below.
Note that they are perpendicular.

57
This geometric interpretation of vectors can be extended to three dimensions (vectors with three
entries rather than two), although it becomes slightly more difficult to visualise.

3.2 Vectors and matrices: Activity


Question 1:
Consider the following matrices.
2 3 7 −2 −4 −6
𝐴=[ ] 𝐵=[ ] 𝐶=[ ]
−1 0 3 4 2 0
A3.6
(a) Evaluate 𝐴(𝐵 + 𝐶) and 𝐴𝐵 + 𝐴𝐶.
A3.7
(b) Evaluate 𝐴𝐵 and 𝐵𝐴. Did you expect this?
A3.8
(c) Show that 𝐴𝐶 = 𝐶𝐴. Can you explain why?
Question 2:
Consider the following matrices and vectors.
7 −2 2 3 1
1 0 1 2 3
𝐴=[ ] 𝐵=[ ] 𝐶=⎡
⎢ 5 −1⎥
⎤ 𝐷=⎡
⎢6 −3 0⎥

0 1 4 5 6
⎣−9 0 ⎦ ⎣7 19 4⎦
0
2
𝑢⃗ = [ ] 𝑣 ⃗ = ⎢0⎤


8
⎣0⎦
Which of the following expressions make sense? For those that do, state the size of the resulting
matrix.
1. 𝐴 + 𝐵𝐶. A3.9
2. 𝐴𝐵𝐶 A3.10
3. 𝐴𝐶𝐵 A3.11

58
4. 𝐵𝐷 + 𝐵 A3.12
5. 𝐴𝐵𝐷𝐶𝐴 A3.13
6. 𝐴𝑢⃗ + 𝐵𝑣 ⃗ A3.14
7. 𝐷𝑣 ⃗ + 𝐶𝑣 ⃗ A3.15
8. 𝑢⃗ − 𝐶𝑢⃗ A3.16
Question 3:
The blue vector below is 𝑢⃗ and the orange vector is 𝑣.⃗

Write expressions for the green vectors below in terms of 𝑢⃗ and 𝑣.⃗
(a)

59
A3.17

(b)

60
A3.18

Question 4:
The blue vector below is 𝑢⃗ and the orange vector is 𝑣.⃗

61
A3.19
Which of the following green vectors represent 𝑣 ⃗ − 2𝑢?⃗

62
A B

C D

3.3 Solving linear systems

3.3.1 Linear Systems


In the following problems, we will use 𝑥, 𝑦 and 𝑧 to represent variables. A linear equation is an
equation where each term is
√just a multiple of those variables. A typical linear equation may look like
𝑥 + 3𝑦 = 2 or 3𝑥 − 0.4𝑦 + 2𝑧 = 0. We will only deal with problems involving two or three variables,
although the ideas here will work for equations with any number of variables.
A linear system is just a collection of linear equations. We will be interested in writing these linear
systems as matrix equations.
Example: Consider a small business that pays its regular staff $60, 000 each year and
pays its managers $100, 000 each year. The business employs eight times as many regular
staff as it does managers and spends $1.74 million on wages each year.
Let 𝑥 and 𝑦 represent the number of regular staff and managers respectively. Since there

63
are eight times as many regular staff as managers, one equation would be 𝑥 = 8𝑦, which
can be rearranged to give
𝑥 − 8𝑦 = 0.
The total amount spent on wages is $1, 740, 000, which is the sum of the amount spent
on managers and the amount spent on regular staff. Writing this as an equation, we have
60, 000𝑥 + 100, 000𝑦 = 1, 740, 000. Dividing both sides of this equation by 20, 000 gives
3𝑥 + 5𝑦 = 87.
This linear system can be written in matrix form where each row represents one of the
above equations.

1 −8 𝑥 0
[ ][ ] = [ ]
3 5 𝑦 87

To verify this, we can carry out the matrix multiplication on the left hand side. First of
all, the matrix has size 2 × 2 and the vector has size 2 × 1, so the resulting vector will be
a vector of size 2 × 1. The first entry will be 1 × 𝑥 + (−8) × 𝑦 = 𝑥 − 8𝑦. Similarly, the
second entry will be 3𝑥 + 5𝑦. So this equation says

𝑥 − 8𝑦 0
[ ]=[ ]
3𝑥 + 5𝑦 87

so the corresponding entries must be equal. This gives us back the original two equations.
We will often write linear systems more compactly as 𝐴𝑥⃗ = 𝑏⃗ where 𝑥⃗ is a vector containing all of the
variables. In the above example, the components of this equation would be

1 −8 𝑥 0
𝐴=[ ], 𝑥⃗ = [ ] , and 𝑏⃗ = [ ] .
3 5 𝑦 87

See this demonstration for a visual interpretation of linear systems and their solutions.

3.3.2 Gaussian Elimination


One systematic method for solving linear equations is called Gaussian elimination. It is useful for
solving systems involving three or more variables. It involves adding and subtracting equations
together in such a way that some variables disappear, making the equations easier to solve.
To understand the main idea of Gaussian elimination, consider the linear system from the previous
example. This system is not too difficult to solve. Since 𝑥 = 8𝑦 from the first equation, we can
substitute this into the second equation to find 87 = 3(8𝑦) + 5𝑦 = 29𝑦. Dividing by 29 gives 𝑦 = 3,
which is substituted into the first equation, giving 𝑥 = 24.
Another method would be to multiply the first equation by 3 and subtract the result from the first
equation. This gives

(3𝑥 + 5𝑦) − 3(𝑥 − 8𝑦) = 87 − 3 × 0


29𝑦 = 87

and we can continue as before to find 𝑥 and 𝑦.

64
Note that after this process, we have a modified linear system, which can be written as a matrix
equation

1 −8 𝑥 0
[ ][ ] = [ ]
0 29 𝑦 87
,
which contains a zero in the bottom-left corner of the matrix. Gaussian elimination is a procedure
for introducing zeroes into the bottom-left corner.
Also note that adding and subtracting equations can be done by adding and subtracting the entries
from the corresponding rows in the matrix equation.
Example: Consider the following linear system, represented as a matrix equation:

1 2 2 𝑥 3
⎡−2 0 2⎤ ⎡ 𝑦 ⎤ = ⎡−6⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ 6 2 1⎦ ⎣ 𝑧 ⎦ ⎣ 10 ⎦

The first step is to multiply the second row by 3 and add the result to the third row. We
only change the third row.

1 2 2 𝑥 3
⎡ −2 0 2 ⎤ ⎡𝑦⎤ = ⎡ −6 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣6 + 3 × (−2) 2 + 3 × 0 1 + 3 × 2 𝑧
⎦⎣ ⎦ ⎣ 10 + 3 × (−6) ⎦
1 2 2 𝑥 3
⎡−2 0 2⎤ ⎡ 𝑦 ⎤ = ⎡−6⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ 0 2 7⎦ ⎣ 𝑧 ⎦ ⎣−8⎦

You should see now why the choosing a multiple of 3 was required. We wanted to introduce
a zero into the bottom-left entry. We would also like the entry above that to be a zero.
So the next step is to multiply the first two by 2 and add the result to the second row.

1 2 2 𝑥 3
⎡0 4 6⎤ ⎡ 𝑦 ⎤ = ⎡ 0 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 2 7 ⎦ ⎣ 𝑧 ⎦ ⎣−8⎦

We can also change a row by just multiplying it by a factor. We will multiply the second
row by 12 .

1 2 2 𝑥 3
⎡0 2 3⎤ ⎡ 𝑦 ⎤ = ⎡ 0 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 2 7 ⎦ ⎣ 𝑧 ⎦ ⎣−8⎦

Now we can subtract the second row from the third row.

1 2 2 𝑥 3
⎡0 2 3⎤ ⎡ 𝑦 ⎤ = ⎡ 0 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 0 4 ⎦ ⎣ 𝑧 ⎦ ⎣−8⎦

This system is now in a position to be solved. The third row corresponds to the equation
4𝑧 = −8, so 𝑧 = −2.

65
The second row corresponds to 2𝑦 + 3𝑧 = 0, which combined with 𝑧 = −2 gives 𝑦 = 3.
Finally, the first row gives 𝑥 + 2𝑦 + 2𝑧 = 3, which combined with 𝑧 = −2 and 𝑦 = 3 gives
𝑥 = 1.
The manipulations above are called row operations. There are three row operations that may be used
to solve a linear system.
• Any row may be multiplied by a non-zero number.
• Any multiple of one row may be added to another row.
• Any two rows may be interchanged.
Note that in the above examples, we aimed to fill the bottom-left corner with zeroes. The entire
process will still work in the same way if we fill any other corner of the matrix with zeroes.
See the demonstration here for a visual interpretation of linear systems of three variables and the
Gaussian elimination method.

3.3.3 Matrix Inverses


Another method for solving linear systems involves finding what is called the inverse of a matrix. We
will first introduce a special type of matrix.
An identity matrix is an 𝑛 × 𝑛 matrix, denoted by 𝐼, such that (𝐼)𝑖𝑖 = 1 for all 𝑖 and (𝐼)𝑖𝑗 = 0
whenever 𝑖 ≠ 𝑗.
Identity matrices can be square matrices of any size. They have 1 in each diagonal entry and 0 in
every other entry. The 2 × 2 and 3 × 3 identity matrices are shown below.

1 0 0
1 0 ⎡0
[ ] ⎢ 1 0⎤

0 1
⎣0 0 1 ⎦

These matrices serve a similar purpose to the number 1. When you multiply any matrix by an identity
matrix of appropriate size, the result will be the same matrix you started with.
Example: Carry out the following matrix multiplication to verify that

1 0 −2 0 3 −2 0 3
[ ][ ]=[ ]
0 1 0 1 −4 0 1 −4

and

1 0 0
−2 0 3 ⎡ −2 0 3
[ ] 0 1 0⎤
⎥ = [ 0 1 −4]
0 1 −4 ⎢
⎣0 0 1 ⎦
.
Let 𝐴 be a square matrix. 𝐴 is called invertible if there is a square matrix 𝐴−1 such that 𝐴−1 𝐴 = 𝐼.
This matrix 𝐴−1 is called the inverse matrix of 𝐴.
Example: Carry out the matrix multiplication to verify that
1 1 1
𝐴 −1
= ⎢0 0 1⎤


⎣1 0 2 ⎦
is the inverse of

66
0 −2 1
𝐴=⎡ ⎤
⎢1 1 −1⎥.
⎣0 1 0⎦
Not all matrices have an inverse. A quick way to check if a matrix has an inverse is to calculate a
quantity of the matrix called its determinant. We will only look at determinants of 2 × 2 matrices,
but there are methods to calculate determinants for larger square matrices. The interested student
may wish to see the demonstration here which details the process for calculating determinants of
3 × 3 matrices.
Let 𝐴 be a 2 × 2 matrix where
𝑎 𝑏
𝐴=[ ].
𝑐 𝑑
The determinant of 𝐴 is det(𝐴) = 𝑎𝑑 − 𝑏𝑐. A square matrix 𝐴 is invertible if and only if det(𝐴) ≠ 0.
A3.20
Question: Are the following matrices invertible?

2 2 3 6
𝐴=[ ] 𝐵=[ ]
1 −3 2 4

If a 2 × 2 matrix is invertible, then there is a simple formula to calculate its inverse matrix.
𝑎 𝑏
If a 2 × 2 matrix 𝐴 = [ ] is invertible, then
𝑐 𝑑

1 𝑑 −𝑏
𝐴−1 = [ ].
det(𝐴) −𝑐 𝑎

Question: Find 𝐴−1 for the matrix 𝐴 in the above example. Then check your answer by calculating
𝐴−1 𝐴. A3.21
We now show how calculating inverse matrices provides another method for solving linear systems.
If 𝐴𝑥⃗ = 𝑏⃗ is a linear system and 𝐴 is invertible, then 𝑥⃗ = 𝐴−1 𝑏.⃗
To see why this works, we take the linear system and multiply on the left by 𝐴−1 . This gives
𝐴−1 𝐴𝑥⃗ = 𝐴−1 𝑏.⃗
But 𝐴−1 𝐴 = 𝐼 by definition, so
𝐼𝑥⃗ = 𝐴−1 𝑏.⃗
Recall that multiplying by the identity matrix does not change the matrix/vector. So we have
𝑥⃗ = 𝐴−1 𝑏.⃗
Example: We can solve the following linear system using the above method.

𝑥+𝑦=3 (3.1)
𝑥 + 2𝑦 = 4 (3.2)

First, we will write

1 1 𝑥 3
𝐴=[ ], 𝑥⃗ = [ ] , and 𝑏⃗ = [ ]
1 2 𝑦 4

67
so that 𝐴𝑥⃗ = 𝑏.⃗ Now det(𝐴) = 1 × 2 − 1 × 1 = 1, so 𝐴 is invertible. Its inverse matrix is
1 2 −1 2 −1
𝐴−1 = 1 [ ]=[ ].
−1 1 −1 1
So the solution is
2 −1 3 2
𝑥⃗ = [ ] [ ] = [ ].
−1 1 4 1
By equating the components of these vectors, we have 𝑥 = 2 and 𝑦 = 1.

3.4 Solving linear systems: Activity


Question 1:
Describe the sequence of row operations that would take you from each of the following systems to
the next. A3.22

8 11 13 𝑥 23
⎡−4 2 3 ⎤ ⎡𝑦⎤ = ⎡ 1 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣2 1 2 ⎦ ⎣𝑧⎦ ⎣ 4 ⎦
2 1 2 𝑥 4
⎡−4 2 3 ⎥ ⎢𝑦⎤
⎤ ⎡ =⎢1⎤

⎢ ⎥ ⎥
⎣ 8 11 13 ⎦ ⎣ 𝑧 ⎦ 23
⎣ ⎦
2 1 2 𝑥 4
⎡−4 2 3⎤ ⎡ 𝑦 ⎤ = ⎡1⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ 0 7 5 ⎦ ⎣ 𝑧 ⎦ ⎣7⎦
2 1 2 𝑥 4
⎡0 4 7⎤ ⎡ 𝑦 ⎤ = ⎡9⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 7 5⎦ ⎣ 𝑧 ⎦ ⎣7⎦
2 1 2 𝑥 4
⎡0 4 7 ⎤ ⎡ 𝑦 ⎤ = ⎡ 9 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 28 20⎦ ⎣ 𝑧 ⎦ ⎣28⎦
2 1 2 𝑥 4
⎡0 4 7 ⎤ ⎡ 𝑦 ⎤ = ⎡ 9 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 0 −29⎦ ⎣ 𝑧 ⎦ ⎣−35 ⎦

Question 2:
Consider the matrices 𝐴 and 𝐵 given below.

2 1 3 −1
𝐴=[ ] 𝐵=[ ]
1 3 −1 2

A3.23
Evaluate 𝐴𝐵.
A3.24
Use this to find 𝐵−1 .
A3.25
Check your answer by using det(𝐵) to find 𝐵−1 .
Question 3:

68
Researchers have devised a formula to predict an adult's height, ℎ, from their femur length, 𝑙. The
formula is
ℎ = 𝑚𝑙 + 𝑏
where 𝑚 and 𝑏 are constants. All lengths are measured in cm. The researchers have used this formula
to predict that an adult whose femur length is 50 cm is 195 cm tall, and that another adult whose
femur length is 45 cm is 182 cm tall. Find the constants 𝑚 and 𝑏 by writing a matrix equation and
finding the inverse of a matrix. A3.26
A3.27
Use your answer to predict the height of an adult whose femur length is 35 cm.
A3.28
What feature of this linear system allowed you to use this method to solve it?

3.5 Eigenvalues and eigenvectors

3.5.1 Eigenvalues and Eigenvectors


In this section, we will introduce a powerful tool of linear algebra, namely eigenvalues and eigenvectors.
Some applications of this method will be shown in the following activity.
Given an 𝑛 × 𝑛 matrix 𝐴, an eigenvalue of 𝐴 is a scalar 𝜆 such that 𝐴𝑥⃗ = 𝜆𝑥⃗ for some 𝑛 × 1 vector 𝑥.⃗
A non-zero vector 𝑥⃗ which satisfies 𝐴𝑥⃗ = 𝜆𝑥⃗ for some scalar 𝜆 is called an eigenvector of 𝐴.
Example: Consider the following matrix 𝐴 and vectors 𝑥⃗ and 𝑦.⃗

1 2 2 −1
𝐴=[ ] 𝑥⃗ = [ ] 𝑦⃗ = [ ]
3 2 3 1

Using matrix multiplication, we can make the following calculations. You should check
these calculations yourself.
8
𝐴𝑥⃗ = [ ] = 4𝑥 ⃗
12
1
𝐴𝑦 ⃗ = [ ] = −𝑦 ⃗
−1
So 4 and −1 are eigenvalues of 𝐴, and 𝑥⃗ and 𝑦 ⃗ are eigenvectors of 𝐴.

3.5.2 Calculating Eigenvalues and Eigenvectors


The characteristic equation of an 𝑛 × 𝑛 matrix 𝐴 is
det(𝜆𝐼 − 𝐴) = 0.
The eigenvalues of 𝐴 are the solutions of the characteristic equation.
Example: Consider the matrix 𝐴 from the above example. We will first evaluate 𝜆𝐼 − 𝐴.

69
1 0 1 2
𝜆𝐼 − 𝐴 = 𝜆 [ ]–[ ]
0 1 3 2
𝜆 0 1 2
=[ ]–[ ]
0 𝜆 3 2
𝜆 − 1 −2
=[ ]
−3 𝜆 − 2

Now we can find det(𝜆𝐼 − 𝐴).

det(𝜆𝐼 − 𝐴) = (𝜆 − 1)(𝜆 − 2) − (−2) × (−3)


= 𝜆2 − 𝜆 − 2𝜆 + 2 − 6
= 𝜆2 − 3𝜆 − 4
= (𝜆 − 4)(𝜆 + 1)

So the characteristic equation of 𝐴 is (𝜆−4)(𝜆+1) = 0. The solutions to this characteristic


equation are 𝜆 = 4 and 𝜆 = −1. These correspond to the eigenvalues found in the above
example.
Once an eigenvalue has been found, an eigenvector corresponding to that eigenvalue can be found
by solving the equation (𝜆𝐼 − 𝐴)𝑥⃗ = 0,⃗ which is equivalent to the equation 𝐴𝑥⃗ = 𝜆𝑥⃗ given in the
original definition. Here, 0⃗ is the vector with 0 in each entry. There will in fact be infinitely many
eigenvectors corresponding to each eigenvalue.
Example: We can find an eigenvector corresponding to the eigenvalue 𝜆 = 4 from the
previous example. To do this, we solve (4𝐼 − 𝐴)𝑥⃗ = 0.⃗ The first step will be to apply a
row operation. Namely, to add the first row to the second.

(4𝐼 − 𝐴)𝑥⃗ = 0⃗
3 −2 𝑥 0
[ ][ ] = [ ]
−3 2 𝑦 0
3 −2 𝑥 0
[ ][ ] = [ ]
0 0 𝑦 0

This can be rewritten as two linear equations.

3𝑥–2𝑦 = 0
0=0

The second equation is now useless, so we ignore it. We have only one useful equation
to find two variables. There will be infinitely many solutions to this equation, so we
can assign a non-zero value to one variable in order to find just one of those solutions.
For instance, we could set 𝑥 = 2, then solving 3𝑥 − 2𝑦 = 0 would give 𝑦 = 3. So one
eigenvector corresponding to 𝐴 would be
2
𝑥⃗ = [ ] .
3
However, there are infinitely many eigenvectors corresponding to the eigenvalue 𝜆 = 4.
By choosing a different value of 𝑥, you will find a different eigenvector.

70
Question: Apply the method above to find an eigenvector of 𝐴 corresponding to the eigenvalue
𝜆 = −1. A3.29

3.6 Eigenvalues and eigenvectors: Activity


Question 1:
−3 2 A3.30
Find the eigenvalues of 𝐴 = [ ]
4 −1
A3.31
Find one eigenvector corresponding to each eigenvalue.
Question 2 (Diagonalization):
In this question, we will provide an efficient technique using eigenvalues to calculate large powers of
matrices. Consider the following matrix:
1 3
𝐴=[ ]
1 −1
A3.32
Find the eigenvalues of 𝐴.
Verify that

−1
[ ]
1

and

3
[ ]
1

A3.33
are eigenvectors of 𝐴. Which eigenvalues do they correspond to?
Now consider the matrices
−2 0 −1 3
𝐷=[ ] 𝑃 =[ ]
0 2 1 1
which are made up of the eigenvalues and eigenvectors. (Note that the eigenvalue in the first column
of 𝐷 corresponds to the eigenvector in the first column of 𝑃. This procedure will not work unless
the columns of the two matrices correspond.) The matrix 𝐷 is a diagonal matrix because all entries
apart from those on the main diagonal are 0. Evaluate 𝑃 −1 . A3.34
A3.35
Verify that 𝑃 𝐷𝑃 −1 = 𝐴.
A3.36
Write 𝐴2 = 𝐴𝐴 and 𝐴3 = 𝐴𝐴𝐴 in terms of 𝐷 and 𝑃. Do the same for 𝐴𝑛 = 𝐴𝐴
⏟ ⋯𝐴.
𝑛 times
A3.37
Evaluate 𝐷 and 𝐷 . What is 𝐷 ?
2 3 𝑛

A3.38
Using your previous answers, evaluate 𝐴5 .
Question 3 (Markov Chains):
Consider a town with a population of 4000 with only two grocery stores. Assume that every person
in this town visits exactly one grocery store each week. Suppose that in the first week, half of the
population visit store A and half visit store B. We will represent this information in a vector.

71
2000
𝑥0⃗ = [ ]
2000
It is known that half of the people who visit store A one week will return to store A the next week. It
is also known that three quarters of the people who visit store B one week will return to store B the
next week.
A3.39
How many people will visit each store in the second week?
The information about which stores people visit based on the previous week can be stored in a matrix.
1/2 1/4
𝐴=[ ]
1/2 3/4
A3.40
Evaluate the vectors 𝑥1⃗ = 𝐴𝑥0⃗ and 𝑥2⃗ = 𝐴𝑥1⃗ .
We can similarly define 𝑥𝑛⃗ = 𝐴𝑥𝑛−1 ⃗ for each positive integer 𝑛. Based on your previous answers,
what do the vectors 𝑥0⃗ , 𝑥1⃗ , 𝑥2⃗ , … represent? A3.41
A3.42
Write 𝑥2⃗ and 𝑥3⃗ in terms of 𝐴 and 𝑥0⃗ . Do the same for 𝑥𝑛⃗ .
A3.43
Use the procedure from Question 2 to write 𝐴 = 𝑃 𝐷𝑃 −1 , where 𝐷 is a diagonal matrix.
A3.44
What happens to 𝐷𝑛 when 𝑛 is large?
Use your previous answers to deduce how many people each week will visit each store in the long-
run. A3.45

72
3.7 Answers
3.1 2×3
3.2 −4
3.3 Just as with the order of operations for numbers, we perform multiplication first and then
addition.

3 1 0 −5
2𝐴 + 𝐵 = 2 [ ]+[ ]
−1 4 7 2
6 2 0 −5
=[ ]+[ ]
−2 8 7 2
6 −3
=[ ]
5 10

3.4 We can use matrix multiplication to find 𝐴𝑢⃗ first.


3 2 3
𝐴=[ ] 𝑢⃗ = [ ] (𝐴𝑢)⃗ 11 = (3 × 3) + (2 × −8) = −7
−4 5 −8
3 2 3
𝐴=[ ] 𝑢⃗ = [ ] (𝐴𝑢)⃗ 21 = (−4 × 3) + (5 × −8) = −52
−4 5 −8
−7
𝐴𝑢⃗ = [ ]
−52
Then we use matrix addition and scalar multiplication to find 𝐴𝑢⃗ − 4𝑣.⃗

−7 −1
𝐴𝑢⃗ − 4𝑣 ⃗ = [ ] − 4[ ]
−52 −8
−7 −4
=[ ]−[ ]
−52 −32
−3
=[ ]
−20

3.5
𝑢⃗ ⋅ 𝑣 ⃗ = 1 × 2 + 3 × (−4) + 2 × 6 = 2.
3.6

2 3 7 −2 −4 −6
𝐴(𝐵 + 𝐶) = [ ] ([ ]+[ ])
−1 0 3 4 2 0
2 3 3 −8
=[ ][ ]
−1 0 5 4
2×3+3×5 2 × −8 + 3 × 4
=[ ]
−1 × 3 + 0 × 5 −1 × −8 + 0 × 4
21 −4
=[ ]
−3 8

73
2 3 7 −2 2 3 −4 −6
𝐴𝐵 + 𝐴𝐶 = [ ][ ]+[ ][ ]
−1 0 3 4 −1 0 2 0
2×7+3×3 2 × −2 + 3 × 4 2 × −4 + 3 × 2 2 × −6 + 3 × 0
=[ ]+[ ]
−1 × 7 + 0 × 3 −1 × −2 + 0 × 4 −1 × −4 + 0 × 2 −1 × −6 + 0 × 0
23 8 −2 −12
=[ ]+[ ]
−7 2 4 6
21 −4
=[ ]
−3 8

3.7
23 8
𝐴𝐵 = [ ]
−7 2
16 21
𝐵𝐴 = [ ]
2 9
It may seem unexpected that 𝐴𝐵 ≠ 𝐵𝐴. But if you look at the way you performed the matrix
multiplication, you were multiplying different pairs of numbers together each time.
3.8
−2 −12
𝐴𝐶 = [ ]
4 6
−2 −12
𝐶𝐴 = [ ]
4 6
Notice that 𝐶 = −2𝐴. Therefore, 𝐴𝐶 = 𝐴(−2𝐴) = −2𝐴𝐴 = 𝐶𝐴.
3.9 The size of 𝐵𝐶 is 2 × 2, which is the same as that of 𝐴. Therefore, 𝐴 + 𝐵𝐶 has size 2 × 2.
3.10 𝐴 has 2 columns and 𝐵 has 2 rows, so 𝐴𝐵 makes sense. In fact the size of 𝐴𝐵 is 2 × 3. Since
𝐴𝐵 has 3 columns and 𝐶 has 3 rows, 𝐴𝐵𝐶 makes sense, and its size is 2 × 2.
3.11 𝐴 has 2 columns and 𝐶 has 3 rows, so 𝐴𝐶 does not makes sense. So 𝐴𝐶𝐵 will not make
sense.
3.12 Note that 𝐵𝐷 and 𝐵 both have size 2 × 3. So 𝐵𝐷 + 𝐵 will have size 2 × 3.
3.13 𝐴𝐵 has size 2 × 3. 𝐴𝐵𝐷 has size 2 × 3. 𝐴𝐵𝐷𝐶 has size 2 × 2. Therefore, 𝐴𝐵𝐷𝐶𝐴 has size
2 × 2.
3.14 Both 𝐴𝑢⃗ and 𝐵𝑣 ⃗ have size 2 × 1, so the size of 𝐴𝑢⃗ + 𝐵𝑣 ⃗ is 2 × 1.
3.15 Since 𝐶 has 2 columns and 𝑣 ⃗ has 3 rows, 𝐶𝑣 ⃗ does not make sense. So 𝐷𝑣 ⃗ + 𝐶𝑣 ⃗ does not
make sense.
3.16 Note that 𝑢⃗ has size 2 × 1 but 𝐶𝑢.⃗ We cannot add or subtract vectors or matrices of different
sizes, so this expression does not make sense.
3.17 The blue vector is the same length as above, but the orange vector has reversed and stretched
by a factor of three. So the green vector is the result of adding 𝑢⃗ and the appropriate scalar
multiple of 𝑣.⃗ The green vector is 𝑢⃗ − 3𝑣.⃗
3.18 The orange vector is unchanged, but the blue vector is now half of its original length. This
corresponds to it being multiplied by 12 . So the second green vector is 𝑣 ⃗ + 12 𝑢.⃗

74
3.19 Option B is correct. The blue vector is multiplied by −2, which stretches it out and reflects
it. It is then moved so that its tail is at the tip of the orange vector.
3.20 𝐴 is invertible because det(𝐴) = 2 × (−3) − 2 × 1 = −8, which is not equal to 0. But 𝐵 is
not invertible because det(𝐵) = 3 × 4 − 6 × 2 = 0.
−3 −2 3/8 1/4
3.21 𝐴−1 = 1
−8 [ ]=[ ]
−1 2 1/8 −1/4
3/8 1/4 2 2 1 0
𝐴−1 𝐴 = [ ][ ]=[ ]=𝐼
1/8 −1/4 1 −3 0 1
3.22
• In the first row operation, the first and third rows are swapped.
• In the second row operations, the first row is multiplied by 4 and the result is subtracted
from the third row.
• In the third row operation, the first row is multiplied by 2 and the result is added to the
second row.
• In the fourth row operation, the third row is multiplied by 4.
• In the final row operation, the second row is multiplied by 7 and the result is subtracted
from the third row.
5 0
3.23 𝐴𝐵 = [ ]
0 5
3.24 The matrix 𝐵−1 needs to satisfy 𝐵−1 𝐵 = 𝐼. Notice that 𝐴𝐵 = 5𝐼. So 15 𝐴𝐵 = 𝐼. Therefore,
2/5 1/5
𝐵−1 = 15 𝐴 = [ ].
1/5 3/5
3.25

det(𝐵) = 3 × 2 − (−1) × (−1)


=5

1 2 1
𝐵−1 = [ ]
5 1 3
2/5 1/5
=[ ]
1/5 3/5

3.26 Substitute the lengths from each adult into the formula to create a system of two equations.

50𝑚 + 𝑏 = 195
45𝑚 + 𝑏 = 182

Write this as a matrix equation.

50 1 𝑚 195
[ ][ ] = [ ]
45 1 𝑏 182

The determinant of the matrix above is


50 × 1 − 1 × 45 = 5.

75
So the inverse of this matrix is
1 1 −1
5 [ ].
−45 50
The solution to this system is

𝑚 1 1 −1 195 2.6
[ ]= [ ][ ]=[ ]
𝑏 5 −45 50 182 65

3.27 ℎ = 2.6 × 35 + 65 = 156 cm.


3.28 The matrix was invertible (because its determinant was not zero). This allowed you to find
the inverse of that matrix.
3.29
−2 −2
−𝐼 − 𝐴 = [ ].
−3 −3
We need to solve the following linear system.

−2 −2 𝑥 0
[ ][ ] = [ ]
−3 −3 𝑦 0
.
The first row operation you could apply would be to multiply the first row by − 32 .

3 3 𝑥 0
[ ][ ] = [ ]
−3 −3 𝑦 0
.
Now add the first row to the second.

3 3 𝑥 0
[ ][ ] = [ ]
0 0 𝑦 0
.
So our one useful equation is 3𝑥 + 3𝑦 = 0. One eigenvector is found by choosing 𝑦 = 1, so that
𝑥 = −1. So one eigenvector is
−1
𝑥⃗ = [ ] .
1
3.30

1 0 −3 2
𝜆𝐼 − 𝐴 = 𝜆 [ ]−[ ]
0 1 4 −1
𝜆 + 3 −2
=[ ]
−4 𝜆 + 1

det(𝜆𝐼 − 𝐴) = (𝜆 + 3)(𝜆 + 1) − (−2) × (−4)


= 𝜆2 + 3𝜆 + 𝜆 + 3 − 8
= 𝜆2 + 4𝜆 − 5
= (𝜆 + 5)(𝜆 − 1)

76
So the solutions to the characteristic equation are 𝜆 = −5 and 𝜆 = 1. These are the eigenvalues
of 𝐴.
3.31
For 𝜆 = 1, we have
4 −2
𝜆−𝐴=[ ].
−4 2
Adding the first row to the second in the corresponding linear sytem gives

4 −2 𝑥 0
[ ][ ] = [ ]
0 0 𝑦 0
.
So our useful equation is 4𝑥 − 2𝑦 = 0. Setting 𝑥 = 1 gives the eigenvector
1
𝑥⃗ = [ ].
2
For 𝜆 = −5, we have
−2 −2
−5𝐼 − 𝐴 = [ ].
−4 −4
Adding a multiple of the first row to the second in the corresponding linear system gives

−2 −2 𝑥 0
[ ][ ] = [ ]
0 0 𝑦 0
.
So our useful equation is −2𝑥 − 2𝑦 = 0. Setting 𝑥 = 1 gives the eigenvector
1
𝑥⃗ = [ ].
−1
Note that any (non-zero) scalar multiple of these eigenvectors would also be an eigenvector.
3.32

1 0 1 3
𝜆𝐼 − 𝐴 = 𝜆 [ ]−[ ]
0 1 1 −1
𝜆 − 1 −3
=[ ]
−1 𝜆 + 1

det(𝜆𝐼 − 𝐴) = (𝜆 − 1)(𝜆 + 1) − (−3) × (−1)


= 𝜆2 − 𝜆 + 𝜆 − 1 − 3
= 𝜆2 − 4
= (𝜆 + 2)(𝜆 − 2)

So the eigenvalues of 𝐴 are 𝜆 = 2 and 𝜆 = −2.

77
3.33 Multiply 𝐴 by these vectors to find

1 3 −1 2 −1
[ ] [ ] = [ ] = −2 [ ]
1 −1 1 −2 1

and

1 3 3 6 3
[ ][ ] = [ ] = 2[ ]
1 −1 1 2 1
.
So
−1
[ ]
1

and
3
[ ]
1

correspond to 𝜆 = −2 and 𝜆 = 2 respectively.


3.34 First, the determinant of
𝑃 is det(𝑃 ) = (−1) × 1 − 3 × 1 = −4.
Therefore,
1 1 −3 −1/4 3/4
𝑃 −1 = −4 [ ]=[ ].
−1 −1 1/4 1/4
3.35 This is just routine matrix multiplication.
−2 0 −1 3 2 6
𝑃𝐷 = [ ][ ]=[ ]
0 2 1 1 −2 2
2 6 −1/4 3/4 1 3
𝑃 𝐷𝑃 −1 = [ ][ ]=[ ]=𝐴
−2 2 1/4 1/4 1 −1
3.36 Recall that 𝑃 −1 𝑃 = 𝐼, where 𝐼 is the identity matrix.

𝐴2 = 𝐴𝐴
= 𝑃 𝐷𝑃 −1 𝑃 𝐷𝑃 −1
= 𝑃 𝐷𝐼𝐷𝑃 −1
= 𝑃 𝐷2 𝑃 −1

𝐴3 = 𝐴2 𝐴
= 𝑃 𝐷2 𝑃 −1 𝑃 𝐷𝑃 −1
= 𝑃 𝐷2 𝐼𝐷𝑃 −1
= 𝑃 𝐷3 𝑃 −1

The above process can be repeated any number of times. So 𝐴𝑛 = 𝑃 𝐷𝑛 𝑃 −1 .

78
3.37
−2 0 −2 0 4 0
𝐷2 = [ ][ ]=[ ]
0 2 0 2 0 4
4 0 −2 0 −8 0
𝐷3 = 𝐷2 𝐷 = [ ][ ]=[ ]
0 4 0 2 0 8
As you can see, the main diagonal entries are just powers of −2 and 2, whereas the other entries
are 0. This pattern continues, so
(−2)𝑛 0
𝐷𝑛 = [ ].
0 2𝑛
3.38

𝐴5 = 𝑃 𝐷5 𝑃 −1
−1 3 (−2)5 0 −1/4 3/4
=[ ][ ][ ]
1 1 0 25 1/4 1/4
−1 3 −32 0 −1/4 3/4
=[ ][ ][ ]
1 1 0 32 1/4 1/4
32 96 −1/4 3/4
=[ ][ ]
−32 32 1/4 1/4
16 48
=[ ]
16 −16

The efficiency of this method comes from only ever needing to perform two matrix multiplications,
regardless of the power of the matrix.
3.39 Half of the 2000 people who visited store A in the first week will return to store A in the
second week. Also, one quarter of the 2000 people who visited store B in week one will visit
store A in the second week. So the total number of people visiting store A in the second week is
1 1
2 × 2000 + 4 × 2000 = 1500.
By similar reasoning, 2500 people will visit store B in the second week.
3.40
1/2 1/4 2000 1500
𝑥1⃗ = [ ][ ]=[ ]
1/2 3/4 2000 2500
1/2 1/4 1500 1375
𝑥2⃗ = [ ][ ]=[ ]
1/2 3/4 2500 2625
3.41 These vectors represent the number of people who visit each store each week.
3.42

𝑥2⃗ = 𝐴𝑥1⃗
= 𝐴𝐴𝑥0⃗
= 𝐴2 𝑥0⃗

79
𝑥3⃗ = 𝐴𝑥2⃗
= 𝐴𝐴2 𝑥0⃗
= 𝐴3 𝑥0⃗

This process can be repeated indefinitely. So the pattern holds for all 𝑛.
𝑥𝑛⃗ = 𝐴𝑛 𝑥0⃗
3.43 The eigenvalues are 1 and 1/4, and they have corresponding eigenvectors

1/2 −1
[ ] and [ ]
1 1

respectively. So the necessary matrices are


1 0 1/2 −1 2/3 2/3
𝐷=[ ] 𝑃 =[ ] 𝑃 −1 = [ ].
0 1/4 1 1 −2/3 1/3
3.44 As in Question 2,
1𝑛 0 1 0
𝐷𝑛 = [ ]=[ ].
0 (1/4)𝑛 0 (1/4)𝑛
When 𝑛 is large, (1/4)𝑛 becomes very small, virtually 0. So 𝐷𝑛 will be very close to the matrix

1 0
[ ]
0 0
.
3.45 We want to see what happens to 𝑥𝑛⃗ when 𝑛 is large. Since 𝑥𝑛⃗ = 𝐴𝑛 𝑥0⃗ , we need to find 𝐴𝑛 .
As in Question 2, we have

𝐴𝑛 = 𝑃 𝐷𝑛 𝑃 −1
1/2 −1 1 0 2/3 2/3
=[ ][ 𝑛] [ ].
1 1 0 (1/4) −2/3 1/3

When 𝑛 is large, we can replace (1/4)𝑛 with 0 because it is so small.

1/2 −1 1 0 2/3 2/3


𝐴𝑛 ≈ [ ][ ][ ]
1 1 0 0 −2/3 1/3
1/2 0 2/3 2/3
=[ ][ ]
1 0 −2/3 1/3
1/3 1/3
=[ ].
2/3 2/3

Therefore, when 𝑛 is large,

80
𝑥𝑛⃗ = 𝐴𝑛 𝑥0⃗
1/3 1/3 2000
≈[ ][ ]
2/3 2/3 2000
1333
≈[ ].
2667

So in the long run roughly 1333 people will visit store A each week, and roughly 2667 will visit
store B each week.

81
Chapter 4

Multivariate Functions

4.1 Relations

4.1.1 Relations
An ordered pair of real numbers will be written as (𝑥, 𝑦), where 𝑥 and 𝑦 are real numbers. The set of
all ordered pairs of real numbers will be denoted by ℝ2 .
We can produce plots of ordered pairs as shown in the following example.
Example. The ordered pairs (1, 2) and (−2, 1) are plotted below.

The first number in the ordered pair is represented on the horizontal 𝑥-axis and the second
is represented on the vertical 𝑦-axis.
A binary relation on ℝ is a set of ordered pairs of real numbers.
Although there are many other types of relations, in this section we will just use the term relation to
describe a binary relation on ℝ.
The relations we will consider will be sets of ordered pairs with a rule that relates the two numbers in

82
each ordered pair. The notation used is explained in the following example.
Example. The relation {(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑦 = 𝑥2 } is the set of all ordered pairs (𝑥, 𝑦) such
that 𝑦 = 𝑥2 . The plot of a relation like this is the same as the plot of the function 𝑓 given
by 𝑓(𝑥) = 𝑥2 .

Although the above relation had the same plot as that of a function, it should be understood that
not all relations are functions. Relations are more general. Below you will find some examples of
relations that are not functions.

4.1.2 Circles
A circle centred at (ℎ, 𝑘) with radius 𝑟 can be described by the following relation:
{(𝑥, 𝑦) ∈ ℝ2 ∶ (𝑥 − ℎ)2 + (𝑦 − 𝑘)2 = 𝑟2 }
Example. The plot of the relation {(𝑥, 𝑦) ∶ 𝑥2 + (𝑦 + 2)2 = 4} is given below.

83
This is the circle centred at (0, −2) with radius 2.
A4.1
Question. Is the plot above the plot of a function?

4.1.3 Ellipses
An ellipse centred at (ℎ, 𝑘) can be described by the following relation:
(𝑥−ℎ)2 (𝑦−𝑘)2
{(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑎2 + 𝑏2 = 1}

The constants 𝑎 and 𝑏 give a measure of how ”wide” and how ”tall” the ellipse is. The constant 𝑎 is
the horizontal distance between the leftmost (or rightmost) point on the ellipse and its centre, and 𝑏
is the vertical distance between the lowest (or highest) point on the ellipse and its centre.
The smaller of 𝑎 and 𝑏 is sometimes called the semi-minor axis length and the larger is called the
semi-major axis length. However, we won't be referring to these names, so you may just want to think
about their interpretations given above,
(𝑥−1)2
Example. The plot of the relation {(𝑥, 𝑦) ∈ ℝ2 ∶ 4 + (𝑦 − 1)2 = 1} is given below.

84
This is an ellipse centred at (1, 1). Comparing with the general form above, we have 𝑎 = 2
and 𝑏 = 1. Compare these values with their interpretations in terms of distances from the
centre of the ellipse.

4.2 Relations: Activity


Question 1: Write the relations for the following two circles as sets.
(a)

A4.2

85
(b)

A4.3

Question 2: Write the relations for the following two ellipses as sets.
(a)

A4.4

(b)

86
A4.5

Question 3: Sketch the plots of the following relations:


(a)
A4.6
{(𝑥, 𝑦) ∈ ℝ2 ∶ (𝑥 + 3)2 + (𝑦 − 1)2 = 9}
(b)
(𝑦+1)2 A4.7
{(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑥2 + 4 = 1}

4.3 Functions of several variables

4.3.1 Functions of several variables


The functions considered in this section will be similar to the real functions of a single real variable
considered earlier in that they must generate exactly one output for each input. The key difference
here is that to generate an input, we will need to input several numbers at once.
We will be mostly considering real functions of two real variables, often just called functions of two
variables. A function of two variables will take an ordered pair of real numbers as an input and
produce a single real number as an output. The domain of these functions will be ℝ2 (or some subset
of ℝ2 ) and their range will be some subset of ℝ.
We can describe functions of two variables in several ways, as shown in the following example.
Example. The function which accepts any ordered pair of real numbers as inputs and
outputs one less than the sum of the squares of those numbers can be represented:
• in words, as above;

87
• with mathematical symbols as 𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 − 1 with domain ℝ2 ;
• as a plot, with the possible inputs on the two horizontal axes and the possible outputs
on the vertical axis.

Unlike the plots of functions of one variable, we need to specify which axis represents the
𝑥 values and which represents the 𝑦 values.
This function can be evaluated at any point in ℝ2 . For example,
𝑓(2, 3) = 22 + 32 − 1 = 12
and
𝑓(0, −5) = 02 + (−5)2 − 1 = 24.
We can also consider real functions of three or more real variables. The domain of a real function of
three real variables will be some subset of ℝ3 (the set of all ordered triples (𝑥, 𝑦, 𝑧) of real numbers).
More generally, a real function of 𝑛 real variables will have as its domain some subset of ℝ𝑛 (the set of
all ordered 𝑛-tuples (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) of real numbers). However, we will mostly restrict our attention
to real functions of two real variables.

4.3.2 Linear functions of two variables


A linear function of two real variables is a function that can be written as
𝑓(𝑥, 𝑦) = 𝑎𝑥 + 𝑏𝑦 + 𝑐
where 𝑎, 𝑏 and 𝑐 are real numbers.
The plot of a linear function is a plane.
Example. Below is the plot of the function 𝑓 given by 𝑓(𝑥) = 2𝑥 − 3𝑦 + 1 with domain
ℝ2 .

88
The demonstration here allows you to change the orientation and position of a plane. The plane
is written in a slightly different form in this demonstration. However, if you rearrange the given
equation of the plane to write 𝑧 in terms of 𝑥 and 𝑦, then you will have your plane as a function in
the form given above. For example, the plane given by 4(𝑥 − 2) − 2𝑦 + 2𝑧 = 0 can be rearranged to
give 𝑓(𝑥, 𝑦) = 𝑧 = −2𝑥 + 𝑦 + 4.

4.3.3 Zeroes and stationary points


The zeroes or roots of a function 𝑓 of two variables, are the ordered pairs (𝑥, 𝑦) in the domain such
that 𝑓(𝑥, 𝑦) = 0.
Example. The linear function given by 𝑓(𝑥, 𝑦) = 3𝑥 + 𝑦 − 2 has infinitely many zeroes.
By setting 𝑓(𝑥, 𝑦) = 0 and rearranging, we see that any ordered pair (𝑥, 𝑦) that satisfies
𝑦 = −3𝑥 + 2 is a zero of 𝑓.
It is not uncommon for functions of two variables to have infinitely many zeroes.
The stationary points of a function are the ordered pairs (𝑥, 𝑦) in the domain where the plot is ”flat”.
In the next section we will demonstrate a method used to locate these stationary points. However,
this can be done by eye to a certain extent.
Example. Below is a plot of the function 𝑓 given by 𝑓(𝑥, 𝑦) = 𝑥2 − 4𝑥 − 𝑦2 with domain
ℝ2 .

89
The plot appears to be ”flat” when 𝑥 = 2 and 𝑦 = 0. So (2, 0) is a stationary point.

4.3.4 Level sets and contour plots


A level set of a function 𝑓 of two variables is a set {(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑓(𝑥, 𝑦) = 𝑧} where 𝑧 is some real
number.
A level set is just a binary relation on ℝ. So for a given function of two variables, we can produce a
plot of its level sets. Such plots are called contour plots.
Example. Consider the function 𝑓 given by 𝑓(𝑥, 𝑦) = 𝑥2 + 4(𝑦 − 1)2 − 3 with domain
ℝ2 . The level set for 𝑧 = 1 is given by
{(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑥2 + 4(𝑦 − 1)2 − 3 = 1}.
This can be rearranged into something more familiar.
𝑥2
{(𝑥, 𝑦) ∈ ℝ2 ∶ 4 + (𝑦 − 1)2 = 1}
This level set is a relation which describes an ellipse centred at (0, 1). The contour plot
for 𝑧 = 1 is simply the plot of this relation and is given below.

90
To gain some intuition about what the three-dimensional plot of 𝑓 looks like, we can
produce several contour plots on the same set of axes. Each contour plot below is labelled
with the value of 𝑧 that was used.

91
This contour plot tells us a lot about what the three-dimensional plot of the function
looks like. The value of the function becomes larger as we move away from the point
(0, 1). We can deduce that there is a local minimum at (0, 1). The value of the function
at this local minimum is 𝑓(0, 1) = −3.
We can also say something about how ”steep” the plot will be at certain points. Compare
the spacing between the 𝑧 = −2 and 𝑧 = 1 contour plots with the spacing between the
𝑧 = 10 and 𝑧 = 13 contour plots. The spacing between the first two is larger, which
means that in order for the value of 𝑓 to increase by 3, a greater distance needs to be
covered. Essentially, larger spacing between contour plots corresponds to slower growth
(and smaller spacing corresponds to more rapid growth).
So 𝑓 grows more quickly the further we move from (0, 1). This is because the spacing
between contours becomes smaller as we move away from this point.
Question. Does the function 𝑓 grow more rapidly as you move away from the point (1, 0) to the
right along the 𝑥-axis, or as you move down the line 𝑦 = 1? A4.8
The demonstration here allows you to modify a few functions and compare their three-dimensional
plots with their contour plots. Don't worry if some of those functions look complicated or unfamiliar.
Just focus on identifying features on the plots, in particular the stationary points.

92
4.4 Functions of several variables: Graphing activity
In this section, we will show you how to use WolframAlpha to produce plots of relations, functions of
two variables, and contour plots.

4.4.1 Plots of relations


When generating the plot of a function of one variable, say 𝑓 given by 𝑓(𝑥) = 2𝑥2 + 3, we only needed
to type in the expression ”2x^2+3”. However, for a relation, we will need to enter the equation
relating the 𝑥 and 𝑦-coordinates.
In order to plot the ellipse
𝑥2 (𝑦−1)2
{(𝑥, 𝑦) ∈ ℝ2 ∶ 9 + 4 = 1} ,

we must enter the defining equation.

93
94
As usual, you can see more information about the relation by scrolling down the page. However, the
most important thing on this page is the plot of the relation.

4.4.2 Plots of functions of two variables


To plot a function of two variables, as was the case with functions of one variable, we need only enter
the expression used as the rule for the function.
Below we have generated a plot of the function 𝑓 given by 𝑓(𝑥, 𝑦) = 𝑥2 − 𝑥𝑦 + 2𝑦2 with domain ℝ2 .

95
96
If you wish to see the plot of this function on a larger part of the domain, you can specify this. For
example, you may wish to see the above plot for 𝑥-values between −3 and 2, and 𝑦 values between
−2 and 2.

97
98
By scrolling down, you will see a contour plot of this function.

99
100
Although it does not have the values of the function labelled for each contour, the colours can give
you some indication. Darker colours correspond to smaller values and lighter colours correspond to
larger values. Compare the contour plot with the three-dimensional plot. In particular, look at what
is happening around (0, 0) on each.
Spend some time coming up with your own functions and plot them on WolframAlpha. Compare the
contour plots with the three-dimensional plots and locate the stationary points of these functions.

4.5 Functions of several variables: Activity


Question 1: Consider the function 𝑓 given by 𝑓(𝑥, 𝑦) = 8𝑥2 + 4𝑦 with domain ℝ2 .
(a) Sketch (by hand) a contour plot of 𝑓 at 𝑧-values of −8, −4, 0, 4, 8 and 12. In your plot, let the
𝑥-axis span between −2 and 2, and let the 𝑦-axis span between −2 and 3. A4.9
(b) From the point (1, −1), does the function grow more rapidly as you move in the vertical or
horizontal direction? A4.10
A4.11
(c) Using this contour plot, deduce if 𝑓 has any local maxima or minima.
A4.12
(d) Where are the zeroes of 𝑓?
Question 2: Consider the function 𝑓 given by 𝑓(𝑥, 𝑦) = 𝑥2 –6𝑥 + 4𝑦2 + 8𝑦 + 5 with domain ℝ2 .
A4.13
(a) Expand (𝑥 − 3)2 + 4(𝑦 + 1)2 − 8 to show that it is equal to 𝑓(𝑥, 𝑦).
A4.14
(b) Plot the level sets for 𝑧 = −4 and 𝑧 = 8.
(c) Does 𝑓 have a stationary point? If so, state where it is and describe what type of stationary point
it is. Produce a three-dimensional plot of 𝑓 to confirm your answer. A4.15
Question 3: Consider the following contour plot.

101
(a) Does this function have any local maxima or local minima in the region shown? If so, at what
points in the domain do they occur? A4.16
(b) How does the value of the function change as you move along a diagonal line through the origin
(i.e. the point (0, 0))? A4.17
A4.18
(c) Would you consider the point (0, 0) to be a local maximum, a local minimum, neither or both?
Question 4: Consider the following vectors:
2 𝑥
𝑢⃗ = ⎢3⎤

⎥ 𝑣 ⃗ = ⎢𝑦⎤

⎥.
1
⎣ ⎦ 1
⎣ ⎦
A4.19
(a) Let 𝑓 be the function given by 𝑓(𝑥, 𝑦) = 𝑢⃗ ⋅ 𝑣.⃗ What type of function is 𝑓?
A4.20
(b) Describe in words the three-dimensional plot and the contour plot of 𝑓.
Question 5 (Quadratic Forms): Consider the following matrices:
1 8 1 1 1 0
𝐴=[ ] 𝐵=[ ] 𝐶=[ ]
2 1 1 1 0 4
We will use these matrices to define the functions 𝑓, 𝑔 and ℎ. Each function has domain ℝ2 .

102
1 8 𝑥
𝑓(𝑥, 𝑦) = [𝑥 𝑦] [ ][ ]
2 1 𝑦
1 1 𝑥
𝑔(𝑥, 𝑦) = [𝑥 𝑦] [ ][ ]
1 1 𝑦
1 0 𝑥
ℎ(𝑥, 𝑦) = [𝑥 𝑦] [ ][ ]
0 4 𝑦

(a) Carry out the above matrix multiplication to find the rules for 𝑓, 𝑔 and ℎ. (Your answer will be a
1 × 1 matrix, which you can just treat as a real number). A4.21
A4.22
(b) Calculate the eigenvalues of 𝐴, 𝐵 and 𝐶.
(c) Produce contour plots of 𝑓, 𝑔 and ℎ. Locate their stationary points and identify the type of
each stationary point. (You may also wish to produce a three-dimensional plot to help you find the
stationary points). A4.23
(d) Write down some different 2 × 2 matrices and repeat the above steps. Do you think the eigenvalues
reveal anything about the nature of the stationary points? A4.24

4.6 Partial derivatives


4.6.1 Partial derivatives
Earlier, we looked at derivatives of functions of one variable. Our approach was to find the gradient of
the tangent line at a point on the plot of the function. When dealing with a function of two variables,
rather than constructing a tangent line at a point on the plot, we need to find a tangent plane instead.
Example. Below is a plot of the function 𝑓 given by 𝑓(𝑥, 𝑦) = 1 − 𝑥2 − 𝑦2 with a plane
tangent to the function at (1, 1).

Note that the blue tangent plane just touches the orange plot of 𝑓 at (1, 1), which is
marked in red.

103
The demonstration here allows you to view a tangent plane at various points on the plot of a function
of two variables.
In order to specify the orientation of this tangent plane, we need two numbers. One will describe the
slope in the direction of the 𝑥-axis and the other will describe the slope in the direction of the 𝑦-axis.
These quantities can be found using functions called partial derivatives.
The partial derivative in the 𝑥-direction is denoted by 𝑓𝑥 and is found by differentiating 𝑓 with respect
to 𝑥 as if it were a function of one variable, treating 𝑦 as a constant.
Similarly, the partial derivative in the 𝑦-direction is denoted by 𝑓𝑦 and is found by differentiating 𝑓
with respect to 𝑦, treating 𝑥 as a constant.
Example. Let 𝑓 be the function given by 𝑓(𝑥, 𝑦) = 𝑥2 𝑦 + 𝑒3𝑦 with domain ℝ2 .
To compute 𝑓𝑥 , we differentiate 𝑓 with respect to 𝑥, treating 𝑦 as a constant. So
𝑓𝑥 (𝑥, 𝑦) = 2𝑥𝑦. Similarly, 𝑓𝑦 (𝑥, 𝑦) = 𝑥2 + 3𝑒3𝑦 .
Partial derivatives can be thought of as a rate of change in a particular direction. When 𝑓𝑥 is positive
𝑓 increases as you move along the 𝑥-axis (keeping 𝑦 constant). Similarly, when 𝑓𝑦 is positive 𝑓
increases as you move along the 𝑦-axis. When the magnitude of 𝑓𝑥 or 𝑓𝑦 is large, 𝑓 is increasing or
decreasing more rapidly than when 𝑓𝑥 or 𝑓𝑦 is small.
Example. Consider the plot below of the function 𝑓 given by 𝑓(𝑥, 𝑦) = 1 − 𝑥3 .

The partial derivatives are 𝑓𝑥 (𝑥, 𝑦) = −3𝑥2 and 𝑓𝑦 (𝑥, 𝑦) = 0. We can evaluate these at
the point (0.5, 1).

3
𝑓𝑥 (0.5, 1) = −
4
𝑓𝑦 (0.5, 1) = 0

Around (0.5, 1), 𝑓 decreases as you move along the 𝑥-axis in the positive direction, which
agrees with 𝑓𝑥 (0.5, 1) being negative. As you move along the 𝑦-axis, 𝑓 does not change,
agreeing with our calculation 𝑓𝑦 (0.5, 1) = 0.

104
The notion of a partial derivative can be easily extended to functions of any number of variables. The
following example illustrates this with a function of three variables.
Example. Consider the function 𝑓 of three variables given by 𝑓(𝑥, 𝑦, 𝑧) = 𝑥𝑦𝑧 2 + ln(𝑥𝑦).
The partial derivatives are given below.

𝑦
𝑓𝑥 (𝑥, 𝑦, 𝑧) = 𝑦𝑧 2 +
𝑥
2 𝑥
𝑓𝑦 (𝑥, 𝑦, 𝑧) = 𝑥𝑧 +
𝑦
𝑓𝑧 (𝑥, 𝑦, 𝑧) = 2𝑥𝑦𝑧

The concept of differentiability is slightly more complicated for a function of two (or more) variables.
Essentially, differentiability ensures that at each point, there exists a tangent plane which is ”close” to
the function near that point. For this to occur, we need more than just the partial derivatives to exist.
This would only ensure that the tangent plane is close to the function as you travel towards that
point along the direction of the 𝑥 and 𝑦-axes. This needs to be true regardless of the direction you
travel. However, this is a somewhat technical point and we will not pursue it further. The functions
we will encounter will be differentiable on their entire domain.

4.6.2 The gradient vector


We will first give the definition of the gradient vector and show through an example why it is useful.
Given a function 𝑓 of two variables, its gradient vector is the vector
𝑓𝑥 (𝑥, 𝑦)
∇𝑓(𝑥, 𝑦) = [ ]
𝑓𝑦 (𝑥, 𝑦)
where the two components are the partial derivatives of 𝑓.
Example. Recall the first example with 𝑓(𝑥, 𝑦) = 1 − 14 𝑥2 − 18 𝑦2 . Its partial derivatives
are given by 𝑓𝑥 (𝑥, 𝑦) = − 12 𝑥 and 𝑓𝑦 (𝑥, 𝑦) = − 14 𝑦. Its gradient vector is
− 12 𝑥
∇𝑓(𝑥, 𝑦) = [ ].
− 14 𝑦
We can compute the gradient vector at various points. For example,
− 12
∇𝑓(1, 1) = [ ]
− 14
and
1
∇𝑓(−1, 2) = [ 21 ] .
−2
Below is a contour plot of 𝑓 with these two vectors superimposed. The vectors have been
shifted to the points where ∇𝑓(𝑥, 𝑦) was evaluated. For example, ∇𝑓(1, 1) starts at the
point (1, 1).

105
The blue vector is the gradient vector at (1, 1). Note that it is perpendicular to the
contour lines near (1, 1). In particular, it points in the direction that the function grows
most rapidly from (1, 1).
Similarly, the orange vector points in the direction of most rapid growth away from (−1, 2).
This vector is also perpendicular to the contour lines near (−1, 2).
Questions.
1. Does the function in the previous example have a local maximum or a local minimum? If so,
where is it? A4.25
2. What do you notice about the gradient vectors in relation to your answer to Question 1? A4.26
See the demonstration here for several contour plots with gradient vectors superimposed.
As with partial derivatives, we can define the gradient vector for functions of 𝑛 variables. It is simply
the 𝑛 × 1 vector whose entries are all 𝑛 partial derivatives.

4.6.3 Locating stationary points


A stationary point of a function of two variables is a point such that both partial derivatives are 0.
In order to locate stationary points of a function of two variables, we will be solving two equations
simultaneously. Namely, 𝑓𝑥 (𝑥, 𝑦) = 0 and 𝑓𝑦 (𝑥, 𝑦) = 0.

106
Example. To locate the stationary point(s) of 𝑓, which is given by 𝑓(𝑥, 𝑦) = 𝑦2 +4𝑥𝑦−8𝑥,
we will first find its partial derivatives. They are
𝑓𝑥 (𝑥, 𝑦) = 4𝑦 − 8 and 𝑓𝑦 (𝑥, 𝑦) = 2𝑦 + 4𝑥.
By setting both of these equal to zero, we arrive at the following system of equations:

4𝑦 = 8
2𝑦 + 4𝑥 = 0

We have several methods for solving systems of equations like this. Using any of these
methods will yield the solution 𝑥 = −1, 𝑦 = 2. So (−1, 2) is a stationary point of 𝑓.
This method words for functions of 𝑛 variables. However, the system of equations becomes harder to
solve when there are more variables.

4.6.4 Classifying stationary points


As with functions of one variables, we can repeatedly take partial derivatives of functions of two
(or more) variables, provided that the function is sufficiently differentiable. We will only concern
ourselves with second partial derivatives in this section. A second subscript will be used to denote
second partial derivatives.
Given a function 𝑓 of two variables,
• 𝑓𝑥𝑥 is the second partial derivative of 𝑓 with respect to 𝑥,
• 𝑓𝑥𝑦 is a mixed partial derivative of 𝑓, found by differentiating first with respect to 𝑥 and then
with respect to 𝑦,
• 𝑓𝑦𝑥 is also a mixed partial derivative of 𝑓, found by differentiating first with respect to 𝑦 and
then with respect to 𝑥, and
• 𝑓𝑦𝑦 is the second partial derivative of 𝑓 with respect to 𝑦.
For all functions that we will encounter in this course, the order in which partial derivatives are taken
does not matter. That is to say that 𝑓𝑥𝑦 = 𝑓𝑦𝑥 . However, there are certain functions where these
mixed partial derivatives are not equal.
The Hessian matrix of a function 𝑓 of two variables is
𝑓𝑥𝑥 (𝑥, 𝑦) 𝑓𝑥𝑦 (𝑥, 𝑦)
𝐻(𝑥, 𝑦) = [ ].
𝑓𝑦𝑥 (𝑥, 𝑦) 𝑓𝑦𝑦 (𝑥, 𝑦)
The following test can be used to determine the nature of the stationary points of a function of two
variables.
• If det(𝐻(𝑥, 𝑦)) > 0 at the stationary point, then it is a local maximum or minimum. If in
addition 𝑓𝑥𝑥 (𝑥, 𝑦) > 0, then it is a local minimum, whereas if 𝑓𝑥𝑥 (𝑥, 𝑦) < 0, it is a local
maximum.
• If det(𝐻(𝑥, 𝑦)) < 0 at the stationary point, then it is a saddle point (see the following examples).
• If det(𝐻(𝑥, 𝑦)) = 0, then the test is inconclusive.
Example. Consider the function 𝑓 given by 𝑓(𝑥, 𝑦) = 𝑥2 − 6𝑥 + 𝑦3 − 12𝑦. Its partial
derivatives are 𝑓𝑥 (𝑥, 𝑦) = 2𝑥 − 6 and 𝑓𝑦 (𝑥, 𝑦) = 3𝑦2 − 12. The system of equations to be
solved is

2𝑥 − 6 = 0
3𝑦2 − 12 = 0

107
.
Although this is not a linear system, we can still find its solutions. The first equation
gives 𝑥 = 3, and the second gives 𝑦2 = 4, which has two solutions, 𝑦 = 2 and 𝑦 = −2. So
the two stationary points are (3, 2) and (3, −2).
We can find the second derivatives and hence the Hessian matrix.
2 0
𝐻(𝑥, 𝑦) = [ ]
0 6𝑦
The determinant of the Hessian matrix is det(𝐻(𝑥, 𝑦)) = 12𝑦.
At the point (3, 2), we have det(𝐻(3, 2)) = 24 > 0 and 𝑓𝑥𝑥 (3, 2) = 2 > 0, so (3, 2) is a
local minimum.
We also have det(𝐻(3, −2)) = −24 < 0, so (3, −2) is a saddle point.
The plot below shows these two stationary points. Note that the saddle point at (3, −2)
is neither a local maximum nor a local minimum.

4.6.5 Notation
Here we're using 𝑓𝑥 (𝑥, 𝑦) to represent the partial derivative of a function 𝑓(𝑥, 𝑦) with respect to 𝑥
and so on. In the other common notation the partial derivative of a function 𝑓(𝑥, 𝑦) with respect to
𝑥 is denoted by 𝜕𝑥
𝜕𝑓
.

4.7 Partial derivatives: Activity


Question 1: Recall Question 1 from the previous activity with 𝑓 given by 𝑓(𝑥, 𝑦) = 8𝑥2 + 4𝑦 with
domain ℝ2 .
A4.27
(a) Find the partial derivatives 𝑓𝑥 and 𝑓𝑦 .

108
(b) Recall that in previous activity, we found that 𝑓 had no stationary points. Explain why this is
the case using the partial derivatives. A4.28
Question 2: Recall Question 2 from the previous activity with 𝑓 given by 𝑓(𝑥, 𝑦) = 𝑥2 –6𝑥+4𝑦2 +8𝑦+5
with domain ℝ2 .
A4.29
(a) Find the partial derivatives 𝑓𝑥 and 𝑓𝑦 .
A4.30
(b) Use the partial derivatives to find all stationary points.
A4.31
(c) Find the Hessian matrix of 𝑓.
(d) Use the Hessian matrix to classify the stationary point found earlier. Compare this with your
answer from Question 2 in the previous activity. A4.32
Question 3: Consider the function 𝑓 given by 𝑓(𝑥, 𝑦) = 𝑥𝑦 − 𝑥 with domain ℝ2 .
A4.33
(a) Find and classify all stationary points of 𝑓.
(b) Produce a three-dimensional plot and a contour plot of 𝑓 (using software) to see the behaviour of
the function around this stationary point. A4.34
(c) Evaluate the gradient vector at the points (0.5, 0.5), (−0.5, 0.5) and (0, 1.25). Is this what you
expected based on your contour plot? A4.35
Question 4 (Least Squares): Below is a table containing the number of children of three people
along with the number of bedrooms in their house.

Number of children 0 1 2
Number of bedrooms 2 3 5

A plot of the data is also given with the number of children on the horizontal axis and the number of
bedrooms on the vertical axis.

We would like to model the relationship between these two quantities using a linear function (of one
variable). The function will be of the form 𝑓(𝑥) = 𝑚𝑥 + 𝑏 where 𝑥 is the number of children and
𝑓(𝑥) is the number of bedrooms. The aim of this question is to find suitable values for 𝑚 and 𝑏.

109
(a) For each potential choice of 𝑚 and 𝑏, define the residual at each data point to be the difference
between the actual number of bedrooms and the number of bedrooms predicted by the function 𝑓 for
that data point. Define the squared error of each data point to be the square of the residual. Find
the residual and the squared error for the third data point in the table. A4.36
(b) Now define a function 𝑔 of two variables to be the sum of the squared errors of all three data
points. The two variables are 𝑚 and 𝑏. Write down the rule for 𝑔(𝑚, 𝑏). A4.37
(c) Since 𝑔 represents in some sense the overall error of prediction, a suitable choice for 𝑚 and 𝑏
would be the values which make 𝑔 as small as possible. This suggests that we should look for a local
minimum. Find the partial derivatives (𝑔𝑚 and 𝑔𝑏 ) of 𝑔. A4.38
A4.39
(d) Find and classify all stationary points of 𝑔.
(e) Using the values of 𝑚 and 𝑏 found in your previous answer, produce a plot of 𝑓. Compare this
with the plot of the data above to see how well the function 𝑓 agrees with the data. If possible, plot
the data and the function together. A4.40
(f) Using the function 𝑓, predict the number of bedrooms in a house owned by someone with three
children. A4.41

110
4.8 Answers
4.1 No. A function can only output one value for each input. The inputs of a function are
represented on the horizontal axis, and the outputs on the vertical axis. For example, we cannot
consider 0 to be an input because the plot above would give both −4 and 0 as outputs.
4.2 This circle is centred at (1, −2) and has a radius of 2. As a set, the relation which this
represents is
{(𝑥, 𝑦) ∈ ℝ2 ∶ (𝑥 − 1)2 + (𝑦 + 2)2 = 4}.
4.3 This circle is centred at (0, −1) and has a radius of 12 . As a set, the relation which this
represents is
{(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑥2 + (𝑦 + 1)2 = 14 }.
4.4 This ellipse is centred at (1, 1). The horizontal distance from the centre of the ellipse to the
rightmost point is 4, and the vertical distance to the lowest point is 2. As a set, the relation
which this ellipse represents is
(𝑥−1)2 (𝑦−1)2
{(𝑥, 𝑦) ∈ ℝ2 ∶ 16 + 4 = 1}.

4.5 This ellipse is centred at (−3, 0). The horizontal distance from the centre of the ellipse to the
rightmost point is 1, and the vertical distance to the lowest point is 3. As a set, the relation
which this ellipse represents is
𝑦2
{(𝑥, 𝑦) ∈ ℝ2 ∶ (𝑥 + 3)2 + 9 = 1}.

4.6 This is a circle centred at (−3, 1) with a radius of 3.


4.7 This is an ellipse centred at (0, −1). The horizontal distance from the cen-
tre to the rightmost point is 1 and the vertical distance to the lowest point is

111
2.
4.8 The vertical spacing between contour plots is smaller than the horizontal spacing. So the
growth as we move along 𝑦 = 1 is more rapid than when we move along the 𝑥-axis.
4.9 For 𝑧 = −8, we need to plot the relation described by 𝑓(𝑥, 𝑦) = −8. Simplifying this gives
𝑦 = −2𝑥2 − 2. So we need to plot the relation
{(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑦 = −2𝑥2 − 2}.
The other level sets can be found similarly. For example, the 𝑧 = −4 level set is
{(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑦 = −2𝑥2 − 1}.
The plots each of these level sets is given below.

112
4.10 Around the point (1, −1), the horizontal distance between contours is smaller than the
vertical distance. So the function grows more rapidly when moving in a horizontal direction.
4.11 There are no points where the function has a greater value at all surrounding points. So
there is no local minimum. By the same reasoning, there is no local maximum.
4.12 The zeroes all lie on the 𝑧 = 0 contour line. These are the points where 𝑓(𝑥, 𝑦) = 0.
4.13 Expanding (𝑥 − 3)2 gives 𝑥2 − 6𝑥 + 9 and expanding (𝑦 + 1)2 gives 𝑦2 + 2𝑦 + 1.

(𝑥 − 3)2 + 4(𝑦 + 1)2 − 8 = 𝑥2 − 6𝑥 + 9 + 4(𝑦2 + 2𝑦 + 1) − 8


= 𝑥2 − 6𝑥 + 4𝑦2 + 8𝑦 + 9 + 4 − 8
= 𝑓(𝑥, 𝑦)

4.14 Setting 𝑓(𝑥, 𝑦) = −4, we have


(𝑥 − 3)2 + 4(𝑦 + 1)2 − 8 = −4,
which can be rearranged to give
(𝑥−3)2
4 + (𝑦 + 1)2 = 1.
Similarly, when 𝑓(𝑥, 𝑦) = 8, we can rearrange to give

113
(𝑥−3)2 (𝑦+1)2
16 + 4 = 1.
Both contours are ellipses. The contour plot is given below.

4.15 The contour plot shows that the value of the function decreases as you approach the centre
of the ellipses, at (3, −1). So (3, −1) will be a stationary point, in fact a local minimum. Below
is a three-dimensional plot of 𝑓 confirming this.

114
4.16 The point (1, 1) is a local maximum. This is because as you approach (1, 1), the value of the
function increases. Similarly, (−1, −1) is also a local maximum. Both (1, −1) and (−1, 1) are
local minima because the value of the function decreases as you approach those points.
4.17 If the diagonal line has a positive gradient (i.e. is moving from the bottom-left to the
top-right), then the value of the function will decrease to 0 but then increases again. If the
diagonal line has a negative gradient, then the value of the function will increase to 0 and then
decrease.
4.18 The point (0, 0) is neither a local maximum nor a local minimum. This is because there are
some points nearby where the value of the function is larger, and there are other points nearby
where the value of the function is smaller. A point like this is often called a saddle point.
4.19 Since 𝑓(𝑥, 𝑦) = 2𝑥 + 3𝑦 + 1, 𝑓 is a linear function.
4.20 The three-dimensional plot is a plane. The contour plot is a collection of parallel lines.
4.21 The calculations for 𝑓 are given below.

𝑥 + 8𝑦
𝑓(𝑥, 𝑦) = [𝑥 𝑦] [ ]
2𝑥 + 𝑦
= [𝑥(𝑥 + 8𝑦) + 𝑦(2𝑥 + 𝑦)]
= 𝑥2 + 10𝑥𝑦 + 𝑦2

The others can be found similarly.

𝑔(𝑥, 𝑦) = 𝑥2 + 2𝑥𝑦 + 𝑦2
ℎ(𝑥, 𝑦) = 𝑥2 + 4𝑦2

4.22 As per the last section, we calculate eigenvalues by solving the characteristic equation
det(𝜆𝐼 − 𝐴) = 0.

115
• The eigenvalues of 𝐴 are 5 and −3.
• The eigenvalues of 𝐵 are 2 and 0.
• The eigenvalues of 𝐶 are 4 and 1.
4.23 The contour plot for 𝑓 given below reveals a stationary point at (0, 0). This is a saddle point
because the function may increase or decrease as you move towards (0, 0), depending on the
direction you approach it.

The contour plot for 𝑔 given below is a little harder to decipher. There are in fact infinitely many
stationary points along the line with a slope of −1 passing through (0, 0). A three-dimensional
plot of 𝑔 may help here. It somewhat resembles a half-pipe.

The contour plot of ℎ is given below. There is a stationary point at (0, 0). Notice that this is a
local minimum since no matter what direction you approach (0, 0), the value of the function
decreases.

116
4.24 The signs of the eigenvalues reveal the nature of the stationary point(s). For example, if both
eigenvalues are positive then there will be a local minimum.
4.25 There is a local maximum at (0, 0).
4.26 They point in the general direction of this local maximum. This makes sense because the
gradient vectors point in the direction of the most rapid growth of the function. This would
lead towards a local maximum, provided that the function actually has a local maximum.
4.27

𝑓𝑥 (𝑥, 𝑦) = 16𝑥
𝑓𝑦 (𝑥, 𝑦) = 4

4.28 A stationary point is a point (𝑥, 𝑦) where both 𝑓𝑥 (𝑥, 𝑦) = 0 and 𝑓𝑦 (𝑥, 𝑦) = 0.But 𝑓𝑦 (𝑥, 𝑦) is
always equal to 4. So even though 𝑓𝑥 (𝑥, 𝑦) is equal to 0 when 𝑥 = 0, there are no points (𝑥, 𝑦)
where both 𝑓𝑥 (𝑥, 𝑦) = 0 and 𝑓𝑦 (𝑥, 𝑦) = 0.
4.29

𝑓𝑥 (𝑥, 𝑦) = 2𝑥 − 6
𝑓𝑦 (𝑥, 𝑦) = 8𝑦 + 8

4.30 We need to solve 𝑓𝑥 (𝑥, 𝑦) = 0 and 𝑓𝑦 (𝑥, 𝑦) = 0. This is a linear system of equations.

2𝑥 − 6 = 0
8𝑦 + 8 = 0

Solving this, we find 𝑥 = 3 and 𝑦 = −1. So (3, −1) is the only stationary point.
4.31 We first need the second partial derivatives.

117
𝑓𝑥𝑥 (𝑥, 𝑦) = 2
𝑓𝑥𝑦 (𝑥, 𝑦) = 0
𝑓𝑦𝑥 = 0
𝑓𝑦𝑦 (𝑥, 𝑦) = 8

The Hessian matrix of 𝑓 is


2 0
𝐻(𝑥, 𝑦) = [ ].
0 8
4.32 The determinant of the Hessian matrix is det(𝐻(𝑥, 𝑦)) = 16. This determinant is always
positive. In particular, det(𝐻(3, −1)) > 0, so (3, −1) is either a local maximum or a local
minimum. We note that 𝑓𝑥𝑥 (3, −1) > 0, so (3, −1) is a local minimum.
4.33 First of all, we need to find the partial derivatives.

𝑓𝑥 (𝑥, 𝑦) = 𝑦 − 1
𝑓𝑦 (𝑥, 𝑦) = 𝑥

The stationary points are found by solving 𝑓𝑥 (𝑥, 𝑦) = 0 and 𝑓𝑦 (𝑥, 𝑦) = 0. This gives 𝑥 = 0 and
𝑦 = 1. So (0, 1) is a stationary point.
The second partial derivatives are

𝑓𝑥𝑥 (𝑥, 𝑦) = 0
𝑓𝑥𝑦 (𝑥, 𝑦) = 1
𝑓𝑦𝑥 (𝑥, 𝑦) = 1
𝑓𝑦𝑦 (𝑥, 𝑦) = 0

So the Hessian matrix is

0 1
[ ]
1 0
.
Its determinant is det(𝐻(𝑥, 𝑦)) = −1. Since det(𝐻(0, 1)) < 0, the point (0, 1) is a saddle point.
4.34

118
4.35 The partial derivatives have already been found, which gives

119
𝑦−1
∇𝑓(𝑥, 𝑦) = [ ].
𝑥
In particular,

−0.5
∇𝑓(0.5, 0.5) = [ ],
0.5
−0.5
∇𝑓(−0.5, 0.5) = [ ] and
−0.5
0.25
∇𝑓(0, 1.25) = [ ]
0

Below is the previous contour plot with the gradient vectors superimposed.

This is to be expected because the gradient vector will point in the direction in which the
function increases most rapidly. So you should expect to see the gradient vector pointing
towards contour curves with higher values.
4.36 The residual is
5 − (2𝑚 + 𝑏)

120
and the squared error is
(5 − (2𝑚 + 𝑏))2 .
4.37 Some routine algebraic manipulation gives

𝑔(𝑚, 𝑏) = (2 − 𝑏)2 + (3 − (𝑚 + 𝑏))2 + (5 − (2𝑚 + 𝑏))2


= 38 − 20𝑏 + 3𝑏2 − 26𝑚 + 6𝑏𝑚 + 5𝑚2 .

4.38

𝑔𝑚 (𝑚, 𝑏) = −26 + 6𝑏 + 10𝑚


𝑔𝑏 (𝑚, 𝑛) = −20 + 6𝑏 + 6𝑚

4.39 The stationary points are found by solving 𝑔𝑚 (𝑚, 𝑏) = 0 and 𝑔𝑏 (𝑚, 𝑏) = 0. This gives a
linear system.

10 6 𝑚 26
[ ][ ] = [ ]
6 6 𝑏 20

Solving this system (using Gaussian elimination or inverse matrices) gives 𝑚 = 3


2 and 𝑏 = 6 .
11

The second partial derivatives are

𝑔𝑚𝑚 (𝑚, 𝑏) = 10
𝑔𝑚𝑏 (𝑚, 𝑏) = 6
𝑔𝑏𝑚 (𝑚, 𝑏) = 6
𝑔𝑏𝑏 (𝑚, 𝑏) = 6.

So the Hessian matrix is

10 6
[ ]
6 6
.
Its determinant is det(𝐻(𝑚, 𝑏)) = 24. Since det(𝐻(3/2, 11/6)) > 0, the point (3/2, 11/6) is a
local minimum.
4.40

121
4.41 The function 𝑓 can be written as
𝑓(𝑥) = 32 𝑥 + 11
6 .

This function predicts the number of bedrooms to be


𝑓(3) = 32 3 + 11
6 ≈ 6.33.

122
Chapter 5

Enumeration and Discrete Probability

5.1 Enumeration
It is often very useful to be able to count (or enumerate) how many ways something can occur. For
example, it may be important to know how many different possible barcodes there are under some
barcoding system or how many possible keys there are for some encryption algorithm.

5.1.1 Counting via sums and products


Two basic rules are used very often in enumeration problems. The first is sometimes called the ”rule
of sum”:
If there are 𝑎 options in one class, 𝑏 options in another class, and the two classes do no overlap, then
there are 𝑎 + 𝑏 ways to choose an option from one of the two classes.
Example. Tommy's mum has promised him his choice of a reward out of 2 different toys
and 4 different lollies. Tommy has 2 + 4 = 6 choices for his reward.
This rule can be generalised to more than two classes. If there are 𝑡 non-overlapping classes with
𝑎1 , 𝑎2 , … , 𝑎𝑡 options in them respectively, then there are 𝑎1 + 𝑎2 + ⋯ + 𝑎𝑡 ways to choose an option
from one of the classes.
Notice that our example with Tommy only works if the two classes do not overlap. If there were
one (weird) edible toy in both classes Tommy would only have 5 choices. In fact, this idea can be
extended to a more general rule. If there are 𝑎 options in one class, 𝑏 options in another class, and
the two classes overlap in 𝑐 options, then there are 𝑎 + 𝑏 − 𝑐 ways to choose an option from one of
the two classes.
The other basic rule is sometimes called the ”rule of product”:
If there are 𝑎 options in one class and 𝑏 options in another class, then there are 𝑎𝑏 ways to choose
one option from each of the two classes.
Example. Tommy's mum has promised him a reward of a toy out of 2 different toys and
a lolly out of 4 different lollies. Tommy has 2 × 4 = 8 choices for his reward.
Again this rule can be generalised to more than two classes. If there are 𝑡 classes with 𝑎1 , 𝑎2 , … , 𝑎𝑡
options in them respectively, then there are 𝑎1 × 𝑎2 × ⋯ × 𝑎𝑡 ways to choose one option from each of
the classes.
Questions.
A restaurant offers four entrees, six main courses and three desserts.

123
1. How many ways are there for you to choose a three course meal? A5.1
2. You're not very hungry and decide to just choose one item from the menu. How many ways are
there to do this? A5.2

5.1.2 Counting selections


When counting selections of objects from a set it's important to know two things. The first is whether
repeated objects are allowed or not (for example, repeated characters are allowed in a username but
repeated players are not possible in a sports team). The second is whether our selections are ordered
or not (for example, PINs are ordered selections of digits but a standard lottery pick is unordered).
The factorial of a positive integer 𝑛 is
𝑛! = 𝑛 × (𝑛 − 1) × ⋯ × 2 × 1.
It's also useful to define 0! = 1.

Selections without repetition


A reviewer is going to compare ten phones and list, in order, a top three. In how many ways can she
do this? More generally, how many ways are there to arrange 𝑟 objects chosen from a set of 𝑛 objects?
In our example, the reviewer has 10 options for her favourite, but then only 9 for her second-favourite,
and 8 for third-favourite. So there are 10 × 9 × 8 ways she could make her list.
When making an ordered selection of 𝑟 objects from a set of𝑛 objects there are

𝑛 options for the 1st element


𝑛−1 options for the 2nd element
𝑛−2 options for the 3rd element
⋮ ⋮
𝑛−𝑟+1 options for the 𝑟𝑡ℎ element.

So we have the following formula.


The number of ordered selections without repetition of 𝑟 objects from a set of 𝑛 objects (0 ≤ 𝑟 ≤ 𝑛) is

(𝑛−𝑟)! .
𝑛!
𝑛(𝑛 − 1) ⋯ (𝑛 − 𝑟 + 1) =
What if our reviewer instead chose an unordered top three? In how many ways could she do that?
More generally, how many ways are there to choose (without order) 𝑟 objects from a set of 𝑛 objects?
For every unordered list our reviewer could make there are 3! = 6 corresponding possible ordered
lists. And we've seen that she could make 10 × 9 × 8 ordered lists. So the number of unordered lists
she could make is 10×9×8
6 .
For every combination of 𝑟 elements from a set of 𝑛 elements there are 𝑟! corresponding permutations.
So, using our formula for the number of permutations we have the following.
The number of unordered selections without repetition of 𝑟 objects from a set of 𝑛 objects (0 ≤ 𝑟 ≤ 𝑛)
is
𝑛(𝑛−1)⋯(𝑛−𝑟+1) 𝑛! 𝑛
𝑟! = 𝑟!(𝑛−𝑟)! = ( ).
𝑟

Notice that the notation (𝑛𝑟) is used for 𝑟!(𝑛−𝑟)! .


𝑛!
Expressions like this are called binomial coefficients.
Questions.

124
1. Eight runners are in a race. How many possibilities are there for the first, second and third
place getters? A5.3
2. How many ways are there to choose a team of 11 from a squad of 18? A5.4
3. How many ways are there to choose a team of 11 and a captain and vice-captain for that team
from a squad of 18? A5.5

Selections with repetition


An ordered selection of 𝑟 objects from a set of 𝑛 objects is really just a sequence of length 𝑟 with
each term chosen from the set. There are 𝑛 possibilities for each term and so:
The number of ordered selections with repetition of 𝑟 objects from a set of 𝑛 objects is
𝑛⏟
⏟×⏟𝑛⏟
×⋯
⏟⏟ 𝑛 = 𝑛𝑟 .
×⏟
𝑟

Example. The number of PINs made of four decimal digits is 10 × 10 × 10 × 10 = 104 .


Unordered selections with repetition are arguably the trickiest to understand. We'll give the formula
here and leave the justification to an activity.
The number of unordered selections with repetition of 𝑟 objects from a set of 𝑛 objects is
𝑛−1+𝑟
𝑟!(𝑛−1)! .
(𝑛−1+𝑟)!
( )=
𝑟
Questions.
1. How many binary strings of length 5 are there? How many of these contain exactly two 1s?
A5.6

2. Confirm that the formula for unordered selections with repetition works in the case 𝑟 = 2 and
𝑛 = 3. A5.7

5.2 Enumeration: Activity


Question 1: Calculating factorials for even relatively small numbers can be difficult. For example,
there are 33 digits in 30!. This poses a problem when evaluating binomial coefficients. The following
method somewhat simplifies those calculations. Try to complete this question without using a
calculator.
A5.8
(a) Find the number 𝑥 that satisfies 10! = 7!𝑥.
A5.9
(b) Use your previous answer to evaluate (10
7 ).
A5.10
(c) Evaluate (16
14) using the above method.

Question 2: For each of the following scenarios, decide whether the selections are ordered or
unordered, and whether or not they include repetitions. Then evaluate the number of selections.
1. A family of 7 are out to lunch and each of them orders one dish from a menu of 12 dishes. From
the chef's perspective, how many orders are possible? A5.11
2. A company is setting up a new department and they have already chosen 20 employees to work
in this department. They need to assign 4 of these people to the leadership team. How many
different leadership teams are possible? A5.12
3. The company from the previous question now decides that the leadership team will consist of a
department head, an assistant department head, a training supervisor and an assistant training
supervisor. How many different leadership teams are possible now? A5.13

125
4. You are opening an email account and you must select a password. Your password is required
to be at least 8 characters long and you are only allowed to use lower case letters, upper case
letters, and numbers. However, you aren't very good at remembering long passwords, so you
decide that your password will be at most 11 characters long. How many different passwords
can you choose from? (Hint: Leave your answer as an expression in terms of the formulae in
the previous lesson.) A5.14
Question 3: Suppose that you would like to travel from the red point to the blue point in the grid
below according the following three rules:
1. You must travel along the horizontal and vertical grid lines, one step at a time.
2. You may only travel to the right or upwards.
3. You must not take two consecutive vertical steps.

A5.15 A5.16
How many different paths can you take?

5.3 Probability
Probability gives us a way to model random processes mathematically. These processes could be
anything from the rolling of dice, to radioactive decay of atoms, to the performance of a stock
market index. The mathematical environment we work in when dealing with probabilities is called a
probability space.

5.3.1 Probability spaces


We'll start with a formal definition and then look at some examples of how the definition is used.
A probability space consists of:
• a sample space 𝑆 which contains all the possible outcomes of the random process; and
• a function Pr(𝑠) that assigns a positive probability to each outcome such that the sum of the
probabilities of the outcomes in 𝑆 is 1.
Each time the process occurs it should produce exactly one outcome (never zero or more than one).
The probability of an outcome is a measure of the likeliness that it will occur. It is given as a real
number between 0 and 1 inclusive, where 0 indicates that the outcome cannot occur and 1 indicates
that the outcome must occur.
Example:

126
The spinner above might be modeled by a probability space with sample space 𝑆 =
{1, 2, 3, 4} and probability function given as follows.
⎧ 12 for 𝑠=1
{1
{ for 𝑠=2
Pr(𝑠) = ⎨ 41
{8 for 𝑠=3
{1 for 𝑠 = 4.
⎩8
It can be convenient to give this as a table:

𝑠 1 2 3 4
Pr(𝑠) 1
2
1
4
1
8
1
8

Example: Rolling a fair six-sided die could be modelled by a probability space with
sample space 𝑆 = {1, 2, 3, 4, 5, 6} and probability function Pr given as follows.

𝑠 1 2 3 4 5 6
Pr(𝑠) 1
6
1
6
1
6
1
6
1
6
1
6

A sample space like this one where every outcome has an equal probability is sometimes called a
uniform sample space. Outcomes from a uniform sample space are said to have been taken uniformly
at random.
Questions.
1. A game is played with a 6-sided die that has three faces numbered 1, two faces numbered 3,
and one face numbered 5. Create a probability space that models a single roll of this die. A5.17
2. A dial in a poker machine is modelled by a probability space with sample space
{lemon, orange, apple, cherry} and Pr(lemon) = 13 , Pr(orange) = 13 , and Pr(apple) = 14 .
What is Pr(cherry)? A5.18

5.3.2 Events
An event is just a collection of outcomes we are interested in for some reason. Formally it is a subset
of the sample space.
Example: In the die rolling example with 𝑆 = {1, 2, 3, 4, 5, 6}, we could define the event
of rolling at least a 3. Formally, this would be the set {3, 4, 5, 6}. We could also define
the event of rolling an odd number as the set {1, 3, 5}.
The probability of an event 𝐴 is the sum of the probabilities of the outcomes in 𝐴.
Pr(𝐴) = ∑𝑥∈𝐴 Pr(𝑥).
Example: In the spinner example, for the event 𝐴 = {1, 2, 4}, we have

127
Pr(𝐴) = Pr(1) + Pr(2) + Pr(4)
1 1 1
= 2 + 4 + 8
7
= 8.

In a uniform sample space (where all outcomes are equally likely) the probability of an event 𝐴 can
be calculated as:
number of outcomes in 𝐴
Pr(𝐴) = total number of outcomes = |𝐴|
|𝑆| .

If 𝐴 is an event, we use 𝐴 to mean the event ”not 𝐴”.


For any events 𝐴,
Pr(𝐴) = 1 − Pr(𝐴).
For any two events 𝐴 and 𝐵,
Pr(𝐴 or 𝐵) = Pr(𝐴) + Pr(𝐵) − Pr(𝐴 and 𝐵).
Example: In our die rolling example, let 𝐴 = {1, 2} and 𝐵 = {2, 3, 4} be events. Then
Pr(𝐴 or 𝐵) = 2
6 + 3
6 − 1
6 = 23 .
Two events 𝐴 and 𝐵 are mutually exclusive if Pr(𝐴 and 𝐵) = 0, that is, if 𝐴 and 𝐵 cannot occur
together. For mutually exclusive events, we have
Pr(𝐴 or 𝐵) = Pr(𝐴) + Pr(𝐵).
Questions.
1. A number is chosen uniformly at random from {11, 12, … , 32}. Let 𝐶 be the event that the
second digit of the number is a 1. What set represents 𝐶? A5.19
2. What is the probability of 𝐶? A5.20
3. Let 𝐷 be the event that the first digit of the number is not a 1. What set represents the event
𝐷? A5.21
4. What is Pr(𝐷)? A5.22
5. What set represents the event ”𝐶 and 𝐷”? A5.23
6. What is Pr(𝐶 and 𝐷)? A5.24

5.3.3 Independent events


We say that two events are independent when the occurrence or non-occurrence of one event does not
affect the likelihood of the other occurring.
Two events 𝐴 and 𝐵 are independent if
Pr(𝐴 and 𝐵) = Pr(𝐴) Pr(𝐵).
Example: A binary string of length 3 is generated uniformly at random. The event 𝐴
that the first bit is a 1 is independent of the event 𝐵 that the second bit is a 1. But 𝐴 is
not independent of the event 𝐶 that the string contains exactly two 1s.
Formally, the sample space is 𝑆 = {111, 110, 101, 100, 011, 010, 001, 000}
and Pr(𝑠) = 18 for any 𝑠 ∈ 𝑆. So,

128
𝐴 = {111, 110, 101, 100} Pr(𝐴) = 1
2
𝐵 = {111, 110, 011, 010} Pr(𝐵) = 1
2
𝐶 = {110, 101, 011} Pr(𝐶) = 3
8
𝐴 and 𝐵 = {111, 110} Pr(𝐴 and 𝐵) = 1
4
𝐴 and 𝐶 = {110, 101} Pr(𝐴 and 𝐶) = 1
4.

So Pr(𝐴 and 𝐵) = Pr(𝐴) Pr(𝐵) but Pr(𝐴 and 𝐶) ≠ Pr(𝐴) Pr(𝐶).


Questions.
1. Are the events 𝐶 and 𝐷 defined in the last set of questions independent?
A5.25

5.3.4 Warning
Here, we've only been discussing discrete probability where we have a finite number of different possible
outcomes. Some of the definitions and results we state apply only in this case. Our definition of a
probability space, for example, is actually the definition of a discrete probability space, and so on.
The discrete setting provides a good environment to learn most of the vital concepts and intuitions of
probability theory.
We can also consider probabilities where there is an infinite continuum of possible outcomes (for
example, a randomly selected person might be 170cm tall, or 170.1cm tall or 170.001cm tall and so
on). We'll discuss this briefly later on.

5.4 Probability: Activity


Question 1: Suppose you toss a coin three times and you are interested in the sequence of heads
and tails that appears after tossing the coin.
A5.26
(a) Write a suitable sample space for this experiment.
(b) Write the event of
1. tossing three heads, A5.27
2. tossing exactly two tails, A5.28
3. tossing at least two tails, A5.29
4. tossing a tail on the second throw, A5.30
A5.31
5. tossing a tail only on the second throw.
(c) Suppose that this is a uniform sample space. Let 𝐴 be the event of tossing exactly two tails
and let 𝐵 be the event of tossing a tail on only the second throw. Find the probabilities of these
events. A5.32
A5.33
(d) Are 𝐴 and 𝐵 mutually exclusive?
A5.34
(e) Are 𝐴 and 𝐵 mutually exclusive?
A5.35
(f) Are 𝐴 and 𝐵 independent?
Question 2: You decide to roll a fair six-sided die and toss a fair coin.
A5.36
(a) Write down a sample space to represent the set of outcomes.

129
A5.37
(b) How would you assign probabilities to each outcome?
(c) Let 𝐴 be the event that you roll a number less than 5 and let 𝐵 be the event that you toss tails.
Which event is more likely? A5.38
A5.39
(d) Are 𝐴 and 𝐵 mutually exclusive?
A5.40
(e) Without doing any calculations, do you expect 𝐴 and 𝐵 to be independent?
A5.41
(f) Use calculations to determine if 𝐴 and 𝐵 independent.
(g) Notice that Pr(𝐴) + Pr(𝐵) is greater than 1. Should this concern you? Does this mean that
Pr(𝐴 or 𝐵) is greater than 1? A5.42

5.5 Conditional probability


Your friend believes that Python coding has become more popular than AFL in Melbourne. She
bets you $10 that the next person to pass you on the street will be a Python programmer. You feel
confident about this bet. However, when you see a man in a ”Hello, world!” t-shirt approaching, you
don't feel so confident any more. Why is this?
We can think about this with a diagram. The rectangle represents the set of people in Melbourne,
the circle 𝑃 is the set of Python coders, and the circle 𝑇 is the set of ”Hello, world!” t-shirt owners.

Initially, you feel confident because the circle 𝑃 takes up a small proportion of the rectangle. But
when you learn that your randomly selected person is in the circle 𝑇, you feel bad because the circle
𝑃 covers almost all of 𝑇. In mathematical language, the probability that a random Melbournian is a
Python coder is low, but the probability that a random Melbournian is a Python coder given that
they own a ”Hello, world!” t-shirt is high.

5.5.1 Conditional probability


Conditional probabilities measure the likelihood of an event, given that some other event occurs.
For events 𝐴 and 𝐵, the conditional probability of 𝐴 given 𝐵 is
Pr(𝐴 and 𝐵)
Pr(𝐴|𝐵) = Pr(𝐵) .

130
This definition also implies that
Pr(𝐴 and 𝐵) = Pr(𝐴|𝐵) Pr(𝐵).
Example: The spinner from the last chapter is spun.

Let 𝐴 be the event that the result was at least 3 and 𝐵 be the
event that the result was even. What is Pr(𝐴|𝐵)?

Pr(𝐴 and 𝐵) = Pr(4) = 1


8
Pr(𝐵) = Pr(2) + Pr(4) = 1
4 + 1
8 = 3
8

Thus,
Pr(𝐴 and 𝐵)
Pr(𝐴|𝐵) = Pr(𝐵) = ( 18 )/( 38 ) = 13 .
Example: A binary string of length 6 is generated uniformly at random. Let 𝐴 be the
event that the first bit is a 1 and 𝐵 be the event that the string contains two 1s. What is
Pr(𝐴|𝐵)?
There are 26 strings in our sample space. Now 𝐴 and 𝐵 occurs when the first bit is a 1
and the rest of the string contains one 1. There are (51) such strings and so Pr(𝐴 and 𝐵) =
(51)/26 . Also, there are (62) strings containing two 1s and so Pr(𝐵) = (62)/26 . Thus,
Pr(𝐴|𝐵) = Pr(𝐴 and 𝐵)
Pr(𝐵) = (51)/(62) = 13 .
Questions.
For the following events 𝐴 and 𝐵, do you think Pr(𝐴|𝐵) > Pr(𝐴), Pr(𝐴|𝐵) < Pr(𝐴) or Pr(𝐴|𝐵) =
Pr(𝐴)?
1. A standard die is rolled. 𝐴 is the event that the roll is a 6, and 𝐵 is the event that the roll is
even. A5.43
2. A random American is chosen. 𝐴 is the event that they are at less than 6ft tall and 𝐵 is the
event that they are a professional basketball player. A5.44
3. For a Manchester United vs Manchester City soccer game, 𝐴 is the event that City win the
game and 𝐵 is the event that United lead 3-0 at half time. A5.45

5.5.2 Independence again


Our definition of conditional probability gives us another way of defining independence. We can say
that events 𝐴 and 𝐵 are independent if
Pr(𝐴) = Pr(𝐴|𝐵).
This makes sense intuitively: it is a formal way of saying that the likelihood of 𝐴 does not depend on
whether or not 𝐵 occurs.

131
5.5.3 Independent repeated trials
Generally if we perform exactly the same action multiple times, we assume that the results for each
trial will be independent of the others. For example, if we roll a die twice, then the result of the first
roll will be independent of the result of the second.
For two independent repeated trials, each from a sample space 𝑆, our overall sample space is 𝑆 × 𝑆 and
our probability function will be given by Pr((𝑠1 , 𝑠2 )) = Pr(𝑠1 ) Pr(𝑠2 ). For three independent repeated
trials the sample space is 𝑆 ×𝑆 ×𝑆 and the probability function Pr((𝑠1 , 𝑠2 , 𝑠3 )) = Pr(𝑠1 ) Pr(𝑠2 ) Pr(𝑠3 ),
and so on.
Example: The spinner from the previous example is spun twice. What is the probability
that the results add to 5?
A total of 5 can be obtained as (1, 4), (4, 1), (2, 3) or (3, 2). Because the spins are
independent:

Pr((1, 4)) = Pr((4, 1)) = 1


2 × 1
8 = 1
16
Pr((2, 3)) = Pr((3, 2)) = 1
4 × 1
8 = 1
32

So, because (1, 4), (4, 1), (2, 3) and (3, 2) are mutually exclusive, the probability of the
total being 5 is 16
1 1
+ 16 1
+ 32 1
+ 32 3
= 16 .
Questions.
1. If the spinner from the previous example is spun three times, what is the probability that the
results add to 4? A5.46

5.5.4 Bayes' theorem


Bayes' theorem gives a way of calculating the conditional probability of an event 𝐴 given an event 𝐵
when we already know the probabilities of 𝐴, of 𝐵 given 𝐴, and of 𝐵 given 𝐴.
Bayes' theorem: For events 𝐴 and 𝐵,
Pr(𝐵|𝐴) Pr(𝐴)
Pr(𝐴|𝐵) = Pr(𝐵|𝐴) Pr(𝐴)+Pr(𝐵|𝐴) Pr(𝐴)
.

Note that the denominator above is simply an expression for 𝑃 (𝐵). The fact that
𝑃 (𝐵) = Pr(𝐵|𝐴) Pr(𝐴) + Pr(𝐵|𝐴) Pr(𝐴)
is due to the law of total probability.
Example: Luke Skywalker discovers that some porgs have an extremely rare genetic
mutation that makes them powerful force users. He develops a test for this mutation that
is right 99% of the time and decides to test all the porgs on Ahch-To. Suppose there are
100 mutant porgs in the population of 24 million. We would guess that the test would
come up positive for 99 of the 100 mutants, but also for 239 999 non-mutants.
We are assuming that the conditional probability of a porg testing positive given it's a
mutant is 0.99. But what is the conditional probability of it being a mutant given that it
tested positive? From our guesses, we would expect this to be 99+239999
99
≈ 0.0004. Bayes'
theorem gives us a way to formalise this:

132
Pr(𝑀 ) Pr(𝑃 |𝑀 )
Pr(𝑀 |𝑃 ) =
Pr(𝑀 ) Pr(𝑃 |𝑀 ) + Pr(𝑀) Pr(𝑃 |𝑀)
100
× 0.99
24000000
= 100 100
24000000 × 0.99 + (1 − 24000000 ) × 0.01
99
= 99+239999

≈ 0.0004.

Example: A binary string is created so that the first bit is a 0 with probability 13 and
then each subsequent bit is the same as the preceding one with probability 34 . What is
the probability that the first bit is 0, given that the second bit is 0?
Let 𝐹 be the event that the first bit is 0 and let 𝑆 be the event that the second bit is
0. So Pr(𝐹 ) = 13 . If 𝐹 occurs then the second bit will be 0 with probability 34 and so
Pr(𝑆|𝐹 ) = 34 . If 𝐹 does not occur then the second bit will be 0 with probability 14 and so
Pr(𝑆|𝐹) = 14 . So, by Bayes' theorem,

Pr(𝐹 ) Pr(𝑆|𝐹 )
Pr(𝐹 |𝑆) =
Pr(𝐹 ) Pr(𝑆|𝐹 ) + Pr(𝐹) Pr(𝑆|𝐹)
1 3
3 × 4
= 1 3 2 1
3 × 4 + 3 × 4

= ( 14 )/( 12
5
)
= 35 .

Questions.
1. In the porg example above a ”99% accurate” test produced many more false positive results
than it did true positive results. Why was this? A5.47

5.6 Conditional probability: Activity


Question 1: Consider an urn that contains 1 red ball, 3 green balls, and 2 yellow balls.
(a) Suppose that we draw one ball from the urn (each ball is equally likely to be drawn) and we are
interested only in the colour of that ball. Write down a suitable sample space for this experiment. A5.48
A5.49
(b) What is the probability of each outcome in this sample space?
(c) Now suppose that we draw two balls from the urn but the first ball is not placed back in the urn.
Calculate the probability of drawing a red ball and then a green ball. A5.50
For the remainder of this question, we will use the sample space
𝑆 = {𝑅𝐺, 𝑅𝑌 , 𝐺𝑅, 𝐺𝐺, 𝐺𝑌 , 𝑌 𝑅, 𝑌 𝐺, 𝑌 𝑌 }
where each outcome describes the colour of the first and second ball drawn from the urn. For example,
the outcome 𝑅𝐺 represents drawing a red ball and then a green ball.
A5.51
(d) Why is there no 𝑅𝑅 outcome in the sample space?

133
(e) Let 𝐴 be the event that exactly one of the two balls was red and let 𝐵 be the event that exactly
one of the two balls was green. Find Pr(𝐴) and Pr(𝐵). A5.52
A5.53
(f) Are 𝐴 and 𝐵 independent?
A5.54
(g) Evaluate Pr(𝐴|𝐵).
A5.55
(h) Find the probability that the first ball was green given that the second ball was yellow.
Question 2: A survey was conduced in which 1000 people were asked to state their age and whether or
not they enjoyed mathematics. It was found that 90% people over the age of 30 enjoyed mathematics
but only 40% of people aged 30 or under did. Half of the survey's participants were over the age of 30.
Given that a randomly selected person from the survey enjoyed mathematics, what is the probability
that they were over the age of 30? A5.56

134
5.7 Answers
5.1 4 × 6 × 3 = 72
5.2 4 + 6 + 3 = 13
5.3 8×7×6= 5! ,
8!
assuming order is important.
5.4 (18
11)

5.5 1 ) × ( 1 ) × ( 9 ) (first choose a captain, then a vice-captain, then the remaining 9 players).
(18 17 16

There are other ways to get the same answer here.


5.6 25 = 32 and (52) = 10. The latter is because the binary string is completely determined by
choosing two positions for the 1s.
5.7 Say our objects are 1, 2 and 3.
There are 6 possible selections: [1, 1], [2, 2], [3, 3], [1, 2], [1, 3] and [2, 3].
Our formula also gives (3−1+2 4
2 ) = (2) = 6.

5.8
10! = (1 × 2 × … × 7) × 8 × 9 × 10 = 7! × 8 × 9 × 10
𝑥 = 8 × 9 × 10 = 720
5.9
(10
7) =
10!
7!3! = 7!×720
7!3! = 720
6 = 120
5.10
16! = 14! × 15 × 16 = 14! × 240
(16
14) =
16!
14!2! = 14!×240
14!×2 = 120
5.11 The selections are unordered because the chef makes no distinction between who has ordered
the meals. The selections include repetitions because several family members may order the
same main meal. The number of selections is
(12−1+7)! 18!
(12−1)!7! = 11!7! = 31 824.
5.12 The selections are unordered because there is no distinction between the positions available.
The selections are without repetition because we require four different people. The number of
possible leadership teams is
20!
4!16! = 4845.
5.13 This time the selections are ordered because we can distinguish between the available positions.
The selections are still without repetition. The number of possible leadership teams is
20!
16! = 116 280.
5.14 The selections are ordered because different passwords may have the same characters appearing
in a different order. The selections allow repetition because you may reuse characters in a
password. Including upper case letters, lower case letters, and numbers, there are 62 characters
to choose from. So there are 628 passwords of length 8, there are 629 passwords of length 9,
and so on. The total number of passwords is
628 + 629 + 6210 + 6211 .
This is an enormous number, roughly 5 × 1019 .

135
5.15 Think of each path a sequence of horizontal and vertical steps. For example,
𝐻, 𝐻, 𝑉 , 𝐻, 𝐻, 𝐻, 𝑉 , 𝐻, 𝑉 , 𝐻, 𝐻, 𝐻, 𝑉 .
How can you place these vertical steps relative to the horizontal steps?
5.16 You must take 9 horizontal steps and 4 vertical steps. Consider each path as a sequence of
horizontal and vertical steps (as in the hint). Think of the horizontal steps as being fixed
𝐻, 𝐻, 𝐻, 𝐻, 𝐻, 𝐻, 𝐻, 𝐻, 𝐻
with the vertical steps being placed in between. There are 10 potential locations for the vertical
steps and we must select 4 of these locations. There are no repetitions because we cannot have
two consecutive vertical steps . Furthermore, these selections will be unordered.
This reduces the problem to finding the number of unordered selections without repetitions of 4
objects from a set of 10 objects. There number of paths is
10!
4!6! = 210.
5.17 This could be modelled by a probability space with sample space {1, 3, 5} and Pr(1) = 12 ,
Pr(3) = 13 , and Pr(5) = 16 .
5.18 The sum of the probabilities of the outcomes in a probability space must be 1, so we must
have Pr(cherry) = 1 − 13 − 13 − 14 = 12
1
.
5.19 {11, 21, 31}.
5.20 There are 22 numbers in the sample space, so each has probability 1
22 of being chosen. Thus,
Pr(𝐶) = Pr(11) + Pr(21) + Pr(31) = 22
1 1
+ 22 + 221 3
= 22 .
5.21 {20, 21, … , 32}.
5.22 Pr(𝐷) = Pr(20) + Pr(21) + ⋯ + Pr(32) = 13 × 1
22 = 22 .
13

5.23 {21, 31}.


5.24 Pr(𝐶 and 𝐷) = Pr(21) + Pr(31) = 1
22 + 1
22 = 11 .
1

5.25 In the last set of questions we saw that Pr(𝐶 and 𝐷) = 11


1
.
We also saw that Pr(𝐶) = 22 and Pr(𝐷) = 22 , so Pr(𝐶) Pr(𝐷) = 22
3 13 3
× 13
22 = 484 .
39

So Pr(𝐶 and 𝐷) ≠ Pr(𝐶) Pr(𝐷) and 𝐶 and 𝐷 are not independent.


5.26
𝑆 = {𝐻𝐻𝐻, 𝐻𝐻𝑇 , 𝐻𝑇 𝐻, 𝐻𝑇 𝑇 , 𝑇 𝐻𝐻, 𝑇 𝐻𝑇 , 𝑇 𝑇 𝐻, 𝑇 𝑇 𝑇 }
5.27 {𝐻𝐻𝐻}
5.28 {𝐻𝑇 𝑇 , 𝑇 𝐻𝑇 , 𝑇 𝑇 𝐻}
5.29 {𝐻𝑇 𝑇 , 𝑇 𝐻𝑇 , 𝑇 𝑇 𝐻, 𝑇 𝑇 𝑇 }
5.30 {𝐻𝑇 𝐻, 𝐻𝑇 𝑇 , 𝑇 𝑇 𝐻, 𝑇 𝑇 𝑇 }
5.31 {𝐻𝑇 𝐻}
5.32
From the previous answer, 𝐴 = {𝐻𝑇 𝑇 , 𝑇 𝐻𝑇 , 𝑇 𝑇 𝐻} and 𝐵 = {𝐻𝑇 𝐻}. Since each outcome is
equally likely, the probabilities are all 18 . So
Pr(𝐴) = Pr(𝐻𝑇 𝑇 ) + Pr(𝑇 𝐻𝑇 ) + Pr(𝑇 𝑇 𝐻) = 1
8 + 1
8 + 1
8 = 3
8

and

136
Pr(𝐵) = Pr(𝐻𝑇 𝐻) = 18 .
5.33 Yes. The events 𝐴 and 𝐵 have no outcomes in common, so Pr(𝐴 and 𝐵) = 0.
5.34 No. Note that 𝐴 = {𝐻𝐻𝐻, 𝐻𝐻𝑇 , 𝐻𝑇 𝐻, 𝑇 𝐻𝐻, 𝑇 𝑇 𝑇 } so the outcome 𝐻𝑇 𝐻 is in both 𝐴
and 𝐵. Therefore, Pr(𝐴 and 𝐵) = Pr(𝐻𝑇 𝐻) = 18 ≠ 0.
5.35 They are not independent because Pr(𝐴 and 𝐵) = 0 but Pr(𝐴) Pr(𝐵) = 3
8 × 1
8 = 64 .
3

5.36 The sample space will contain twelve outcomes. It may be written as
𝑆 = {1𝐻, 1𝑇 , 2𝐻, 2𝑇 , 3𝐻, 3𝑇 , 4𝐻, 4𝑇 , 5𝐻, 5𝑇 , 6𝐻, 6𝑇 }
where the outcome 2𝑇 represents having rolled a 2 and tossed heads.
5.37 Since the die and coin are fair, each of the twelve outcomes are equally likely. So the
probability of each outcome will be 12
1
.
5.38 We have
𝐴 = {1𝐻, 1𝑇 , 2𝐻, 2𝑇 , 3𝐻, 3𝑇 , 4𝐻, 4𝑇 }
and
𝐵 = {1𝑇 , 2𝑇 , 3𝑇 , 4𝑇 , 5𝑇 , 6𝑇 }.
So
Pr(𝐴) = 2
3 and Pr(𝐵) = 12 .
𝐴 is more likely to occur.
5.39 No. There are outcomes in both 𝐴 and 𝐵, for example, 4𝑇.
5.40 The event 𝐴 has nothing to do with the result of the coin toss and 𝐵 has nothing to do with
the die roll. So it seems reasonable to expect that the events would be independent.
5.41 They are independent because
Pr(𝐴 and 𝐵) = Pr(1𝑇 ) + Pr(2𝑇 ) + Pr(3𝑇 ) + Pr(4𝑇 ) = 1
3

and
Pr(𝐴) Pr(𝐵) = 2
3 × 1
2 = 13 .
5.42 There is no problem here because Pr(𝐴) + Pr(𝐵) is not equal to Pr(𝐴 or 𝐵) due to the
events not being mutually exclusive. In fact, since
Pr(𝐴 or 𝐵) = Pr(𝐴) + Pr(𝐵) − Pr(𝐴 and 𝐵),
we have
Pr(𝐴 or 𝐵) = 2
3 + 1
2 − 13 .
Therefore, Pr(𝐴 or 𝐵) = 56 , which is not greater than 1.
5.43 Pr(𝐴|𝐵) > Pr(𝐴).
5.44 Pr(𝐴|𝐵) < Pr(𝐴).
5.45 Pr(𝐴|𝐵) < Pr(𝐴). (This seems backwards, but imagine that you had bet that United would
lead 3-0 at half time, and then a friend told you that City had won. You'd feel worse about
your bet. We can talk about conditional probabilities regardless of the chronological order of
the events or of the direction of causation.

137
5.46 A total of 4 can be obtained as (1, 1, 2), (1, 2, 1) or (2, 1, 1). Because the spins are independent:
Pr((1, 1, 2)) = Pr((1, 2, 1)) = Pr((2, 1, 1)) = 1
2 × 1
2 × 1
4 = 1
16

So, because (1, 1, 2), (1, 2, 1) and (2, 1, 1) are mutually exclusive, the probability of the total
being 4 is 16
1 1
+ 16 + 161
= 16 3
.
5.47 Basically, because mutants are so rare. Even though testing positive increases the chance a
porg is a mutant, the base chance of this is so low that it's still quite unlikely.
5.48 {red, green, yellow}
5.49 Pr(red) = 16 , Pr(green) = 1
2 and Pr(yellow) = 13 .
5.50 Initially, the probability of drawing a red ball is 16 . Since the total number of remaining balls
is now 5, the probability of drawing a green ball is 35 . Multiplying these gives a probability of
10 .
1

5.51 This outcome would correspond to drawing two different red balls, but there is only one red
ball in the urn.
5.52 We have
𝐴 = {𝑅𝐺, 𝑅𝑌 , 𝐺𝑅, 𝑌 𝑅}
and
𝐵 = {𝑅𝐺, 𝐺𝑅, 𝐺𝑌 , 𝑌 𝐺}.
We calculated Pr(𝑅𝐺) = 1
10 earlier. The probabilities of the other outcomes are calculated
below:

1 2 1
Pr(𝑅𝑌 ) = × =
6 5 15
1 1 1
Pr(𝐺𝑅) = × =
2 5 10
1 1 1
Pr(𝑌 𝑅) = × =
3 5 15
1 2 1
Pr(𝐺𝑌 ) = × =
2 5 5
1 3 1
Pr(𝑌 𝐺) = × =
3 5 5

1
Pr(𝐴) = Pr(𝑅𝐺) + Pr(𝑅𝑌 ) + Pr(𝐺𝑅) + Pr(𝑌 𝑅) =
3
3
Pr(𝐵) = Pr(𝑅𝐺) + Pr(𝐺𝑅) + Pr(𝐺𝑌 ) + Pr(𝑌 𝐺) =
5

5.53 There are two outcomes common to both 𝐴 and 𝐵, namely 𝑅𝐺 and 𝐺𝑅. Using the previous
calculations,

1
Pr(𝐴 and 𝐵) = Pr(𝑅𝐺) + Pr(𝐺𝑅) =
5
1 3 1
Pr(𝐴) Pr(𝐵) = × =
3 5 5

so 𝐴 and 𝐵 are independent.

138
5.54 Since 𝐴 and 𝐵 are independent, Pr(𝐴|𝐵) = Pr(𝐴) = 13 .
5.55 We can define events
𝐶 = {𝐺𝑅, 𝐺𝐺, 𝐺𝑌 } and 𝐷 = {𝑅𝑌 , 𝐺𝑌 , 𝑌 𝑌 }
which correspond to the first ball being green and second ball being yellow respectively. There
is only one outcome common to both 𝐶 and 𝐷, namely 𝐺𝑌. The probabilities of most of these
outcomes have already been found. The remaining ones are calculated below.

1 2 1
Pr(𝐺𝐺) = × =
2 5 5
1 1 1
Pr(𝑌 𝑌 ) = × =
3 5 15

Pr(𝐶 and 𝐷)
Pr(𝐶|𝐷) =
Pr(𝐷)
Pr(𝐺𝑌 )
=
Pr(𝑅𝑌 ) + Pr(𝐺𝑌 ) + Pr(𝑌 𝑌 )
3
=
5

5.56 Let 𝐴 be the event that the randomly selected person is over the age of 30, and let 𝑀 be the
event that they enjoy mathematics. We know the following:

Pr(𝑀 |𝐴) = 0.9


Pr(𝑀 |𝐴) = 0.4
Pr(𝐴) = 0.5
Pr(𝐴) = 0.5

Using Baye's theorem, we can find Pr(𝐴|𝑀 ).

Pr(𝑀 |𝐴) Pr(𝐴)


Pr(𝐴|𝑀 ) =
Pr(𝑀 |𝐴) Pr(𝐴) + Pr(𝑀 |𝐴) Pr(𝐴)
9
=
13

139
Chapter 6

Probability Distributions

6.1 Random variables

6.1.1 Random variables


In a game, three standard dice will be rolled and the number of sixes will be recorded. We could let
𝑋 stand for the number of sixes rolled. Then 𝑋 is a special kind of variable whose value is based on
a random process. These are called random variables.
Because the value of 𝑋 is random, it doesn’t make sense to ask whether 𝑋 = 0, for example. But we
can ask what the probability is that 𝑋 = 0 or that 𝑋 ≥ 2. This is because ”𝑋 = 0” and ”𝑋 ≥ 2”
correspond to events from our sample space.
We can describe the behaviour of a random variable 𝑋 by listing, for each value 𝑥 that 𝑋 can take,
the probability that 𝑋 = 𝑥. This listing gives the probability distribution of the random variable.
Again, formally this listing is a function from the values of 𝑋 to their probabilities. (Note: this only
works for discrete random variables, we'll discuss continuous random variables later.)
Example: Let 𝑋 be the number of 1s in a binary string of length 2 chosen uniformly at
random. The probability distribution of 𝑋 is given by
⎧ 14 if 𝑥 = 0
{
Pr(𝑋 = 𝑥) = ⎨ 12 if 𝑥 = 1
{1 if 𝑥 = 2
⎩4
It can be convenient to give this as a table:

𝑥 0 1 2
Pr(𝑋 = 𝑥) 1
4
1
2
1
4

Questions.
1. An elevator is malfunctioning. Every minute it is equally likely to ascend one floor, descend one
floor, or stay where it is. When it begins malfunctioning it is on level 5. Let 𝑋 be the level it is
on two minutes later. Find the probability distribution for 𝑋. A6.1

6.1.2 Independence
We have seen that two events are independent when the occurrence or non-occurrence of one event
does not affect the likelihood of the other occurring. Similarly two random variables are independent
if the value of one does not affect the likelihood that the other will take a certain value.

140
Random variables 𝑋 and 𝑌 are independent if, for all 𝑥 and 𝑦,
Pr(𝑋 = 𝑥 and 𝑌 = 𝑦) = Pr(𝑋 = 𝑥) Pr(𝑌 = 𝑦).
It is a consequence of this definition that, for any sets of values 𝐴 and 𝐵 , we have
𝑃 𝑟[𝑋 ∈ 𝐴 and 𝑌 ∈ 𝐵] = 𝑃 𝑟[𝑋 ∈ 𝐴]𝑃 𝑟[𝑌 ∈ 𝐵].
Example: A standard die is rolled three times. Let 𝑍 be the number of sixes rolled.
What is the probability distribution of 𝑍?
Obviously 𝑍 can only take values in {0, 1, 2, 3}. Each roll there is a six with probability
6 and not a six with probability 6 . As usual, we assume the rolls are independent.
1 5

Pr(𝑍 = 0) = 5
6 × 6 × 6
5 5

Pr(𝑍 = 1) = ( 16 )( 56 )( 56 ) + ( 56 )( 16 )( 56 ) + ( 56 )( 56 )( 16 )
Pr(𝑍 = 2) = ( 16 )( 16 )( 56 ) + ( 16 )( 56 )( 16 ) + ( 56 )( 16 )( 16 )
Pr(𝑍 = 3) = 1 1
6 × 6 × 6.
1

So the probability distribution of 𝑍 is

𝑥 0 1 2 3
Pr(𝑍 = 𝑥) 125
216
75
216
15
216
1
216

Example: An integer is generated uniformly at random from the set {10, 11, … , 29}. Let
𝑋 and 𝑌 be its first and second (decimal) digit. Then 𝑋 and 𝑌 are independent random
variables. To prove this notice that, for 𝑥 ∈ {1, 2} and 𝑦 ∈ {0, 1, … , 9},

Pr(𝑋 = 𝑥 and 𝑌 = 𝑦) = 1
20

whereas
Pr(𝑋 = 𝑥) = 10
20 = 1
2

and
Pr(𝑌 = 𝑦) = 2
20 = 1
10 .

Thus
Pr(𝑋 = 𝑥) Pr(𝑌 = 𝑦) = 1
2 × 1
10 = 1
20 = Pr(𝑋 = 𝑥 and 𝑌 = 𝑦).

Questions.
1. If, in the example above, the integer was generated uniformly at random from the set
{10, 11, … , 31}, would the random variables 𝑋 and 𝑌 still be independent? A6.2 A6.3

6.1.3 Operations on random variables


From a random variable 𝑋, we can create new random variables such as 𝑋 + 1, 2𝑋 and 𝑋 2 . These
variables work as you would expect them to. From random variables 𝑋 and 𝑌 we can define a
new random variables such as 𝑋 + 𝑌, 𝑋𝑌 and so on. Working out the distribution of these can be
complicated, however. This is especially true when 𝑋 and 𝑌 are not independent. In this case we
really need to know the joint probability distribution of 𝑋 and 𝑌 which tells us, for all possible values
of 𝑥 and 𝑦, the probability that both 𝑋 = 𝑥 and 𝑌 = 𝑦.

141
6.1.4 Expected value
A standard die is rolled some number of times and the average of the rolls is calculated. If the die is
rolled only once, this average is just the value rolled and is equally likely to be 1, 2, 3, 4, 5 or 6. If
the die is rolled ten times, then the average might be between 1 and 2 but this is pretty unlikely -
it's much more likely to be between 3 and 4. If the die is rolled ten thousand times, then we can be
almost certain that the average will be very close to 3.5. We will see that 3.5 is the expected value of
a random variable representing the die roll.
When we said ”average” above, we really meant ”mean”. Remember that the mean of a collection of
numbers is the sum of the numbers divided by how many of them there are. So the mean of 𝑥1 , … , 𝑥𝑡
is 𝑥1 +⋯+𝑥
𝑡
𝑡
. The mean of 2, 2, 3 and 11 is 2+2+3+11
4 = 4.5, for example.
The expected value of a random variable is calculated as a weighted average of its possible values.
If 𝑋 is a random variable with distribution

𝑥 𝑥1 𝑥2 ⋯ 𝑥𝑡
Pr(𝑋 = 𝑥) 𝑝1 𝑝2 ⋯ 𝑝𝑡

then the expected value of 𝑋 is


E[𝑋] = 𝑝1 𝑥1 + 𝑝2 𝑥2 + ⋯ + 𝑝𝑡 𝑥𝑡 .
Example. If 𝑋 is a random variable representing a die roll, then
E[𝑋] = 16 × 1 + 16 × 2 + ⋯ + 16 × 6 = 3.5.
Example. Someone estimates that each year the share price of Acme Corporation has a
10% chance of increasing by $10, a 50% chance of increasing by $4, and a 40% chance of
falling by $10. Assuming that this estimate is good, are Acme shares likely to increase in
value over the long term?
We can represent the change in the Acme share price by a random variable 𝑋 with
distribution

𝑥 −10 4 10
Pr(𝑋 = 𝑥) 2
5
1
2
1
10

Then
E[𝑋] = 2
5 × −10 + 1
2 ×4+ 1
10 × 10 = −1.
Because this value is negative, Acme shares will almost certainly decrease in value over
the long term.
Notice that it was important that we weighted our average using the probabilities here. If we had
just taken the average of −10, 4 and 10 we would have gotten the wrong answer by ignoring the fact
that some values were more likely than others.
Our initial die-rolling example hinted that when we average over a large number of independent trials
we will get very close to the expected value. A formal version of this (slightly vague) statement is
given by a famous theorem called the law of large numbers.
Questions.
1. Do you agree or disagree with the following statement? ”The expected value of a random
variable is the value it is most likely to take.” A6.4
2. Let 𝑋 be the number of heads occurring when a fair coin is flipped three times. Find E[𝑋].
A6.5

142
6.1.5 Linearity of expectation
We saw in the last lecture that adding random variables can be difficult. Finding the expected value
of a sum of random variables is easy if we know the expected values of the variables.
If 𝑋 and 𝑌 are random variables, then
E[𝑋 + 𝑌 ] = E[𝑋] + E[𝑌 ].
This works even if 𝑋 and 𝑌 are not independent.
Similarly, finding the expected value of a scalar multiple of a random variable is easy if we know the
expected value of the variable.
If 𝑋 is a random variable and 𝑠 ∈ ℝ, then
E[𝑠𝑋] = 𝑠E[𝑋].
Example. Two standard dice are rolled. What is the expected total?
Let 𝑋1 and 𝑋2 be random variables representing the first and second die rolls. From the
earlier example E[𝑋1 ] = E[𝑋2 ] = 3.5 and so
E[𝑋1 + 𝑋2 ] = E[𝑋1 ] + E[𝑋2 ] = 3.5 + 3.5 = 7.
Example. What is the expected number of ”11” substrings in binary string of length 5
chosen uniformly at random?
For 𝑖 = 1, … , 4, let 𝑋𝑖 be a random variable that is equal to 1 if the 𝑖th and (𝑖 + 1)th
bits of the string are both 1, and is equal to 0 otherwise. Then 𝑋1 + ⋯ + 𝑋4 is the
number of ”11� substrings in the string. The probability that the 𝑖th bit is a 1 is 12 and
the probability that the (𝑖 + 1)th bit is a 1 is 12 . So, because the bits are independent,
Pr(𝑋𝑖 = 1) = 12 × 12 = 14 and E[𝑋𝑖 ] = 14 for 𝑖 = 1, … , 4. So,
E[𝑋1 + ⋯ + 𝑋4 ] = E[𝑋1 ] + ⋯ + E[𝑋4 ] = 4
4 = 1.
Note that the variables 𝑋1 , … , 𝑋4 in the above example were not independent, but we were still
allowed to use linearity of expectation.
Questions.
1. A black box produces a random number according to some probability distribution that you do
not know. You use the box to create two random numbers and subtract the second from the
first. Can you find the expected value of the number you produce? A6.6
2. Let 𝑋 be the number of heads occurring when a fair coin is flipped three times. In the last set
of questions you calculated E[𝑋]. Can you now do this more easily? A6.7

6.1.6 Variance
Think of the random variables 𝑋, 𝑌 and 𝑍 whose distributions are given below.

𝑥 −1 99 𝑦 −1 1 𝑧 −50 50
Pr(𝑋 = 𝑥) 99
100
1
100 Pr(𝑌 = 𝑦) 1
2
1
2 Pr(𝑍 = 𝑧) 1
2
1
2

These variables are very different. Perhaps 𝑋 corresponds to buying a raffle ticket, 𝑌 to making a
small bet on a coin flip, and 𝑍 to making a large bet on a coin flip. However, if you only consider
expected value, all of these variables look the same - they each have expected value 0.
To give a bit more information about a random variable we can define its variance, which measures
how ”spread out” its distribution is. People often also refer to the standard deviation which is simply
the square root of the variance.

143
If 𝑋 is a random variable with E[𝑋] = 𝜇, the variance of 𝑋 is given by
Var[𝑋] = E[(𝑋 − 𝜇)2 ].
The standard deviation of 𝑋 is given by
𝜎 = √Var[𝑋].
So the variance is a measure of how much we expect the variable to differ from its expected value.
Example. The variable 𝑋 above will be 1 smaller than its expected value with probability
100 and will be 99 larger than its expected value with probability 100 . So
99 1

Var[𝑋] = 99
100 × (−1)2 + 1
100 × 992 = 99.
Similarly,

Var[𝑌 ] = 1
2 × (−1)2 + 1
2 × 12 = 1
Var[𝑍] = 1
2 × (−50)2 + 1
2 × 502 = 2500.

Notice that the variance of 𝑋 is much smaller than the variance of 𝑍 because 𝑋 is very likely to be
close to its expected value, whereas 𝑍 will certainly be far from its expected value.
Questions.
1. Let 𝑋 be a random variable with distribution given by

𝑥 0 2 6
Pr(𝑋 = 𝑥) 1
6
1
2
1
3

A6.8
Find the variance of 𝑋.
2. Let 𝑋 be the sum of 1000 spins of our spinner, and let 𝑌 be 1000 times the result of a single
spin. Using linearity of expectation we can see that E[𝑋] = E[𝑌 ]. Which of 𝑋 and 𝑌 do you

A6.9
think would have greater variance?

6.2 Random variables: Activity


Question 1: Consider the random variables 𝑋 and 𝑌 whose distributions are given below. Suppose
that 𝑋 and 𝑌 are independent.

𝑥 0 1 𝑦 0 1
Pr(𝑋 = 𝑥) 1
4
3
4 Pr(𝑌 = 𝑦) 1
2
1
2

A6.10
(a) Evaluate Pr(𝑋 = 0 and 𝑌 = 1).

144
A6.11
(b) Find E[𝑋] and E[𝑌 ].
A6.12
(c) What is Pr(𝑋 = E[𝑋])?
A6.13
(d) Evaluate Pr(𝑋 + 𝑌 = 1) and Pr(𝑋 = 𝑌 ).
(e) Consider the random variables 𝑈 = 𝑋 + 𝑌 and 𝑉 = 𝑋 − 𝑌. Find the distributions of 𝑈 and 𝑉
and write your answer as a table (as above). A6.14
A6.15
(f) Without doing any calculations, do you expect 𝑈 and 𝑉 to be independent?
A6.16
(g) Use calculations to see if 𝑈 and 𝑉 are independent.
Question 2 (Markov's inequality): Consider the random variable 𝑋 with the following distribution.

𝑥 0 3 5 7 8
Pr(𝑋 = 𝑥) 1
6
1
3
1
8
1
8
1
4

A6.17
(a) Find Pr(𝑋 ≥ 6)
A6.18
(b) Find E[𝑋].
(c) An important result in probability theory is that for any random variable 𝑋 which only takes
non-negative values, and any positive real number 𝑎, we have
E[𝑋]
Pr(𝑋 ≥ 𝑎) ≤ 𝑎 .
A6.19
Verify this for the above random variable with 𝑎 = 6.

6.3 Important probability distributions


In this chapter we'll introduce some of the most common and useful probability distributions. These
arise in various different real-world situations.

6.3.1 Useful discrete distributions


Discrete uniform distribution
This type of distribution arises when we choose one of a set of consecutive integers so that all choices
are equally likely.
The discrete uniform distribution with parameters 𝑎, 𝑏 ∈ ℤ (𝑎 ≤ 𝑏) is given by
Pr(𝑋 = 𝑘) = 1
𝑏−𝑎+1 for 𝑘 ∈ {𝑎, 𝑎 + 1, … , 𝑏}.
(𝑏−𝑎+1)2 −1
It has E[𝑋] = 𝑎+𝑏
2 and Var[𝑋] = 12 .
Example. Discrete uniform distribution with 𝑎 = 3, 𝑏 = 8.

145
Bernoulli distribution
This type of distribution arises when we have a single process that succeeds with probability 𝑝 and
fails otherwise. Such a process is called a Bernoulli trial.
The Bernoulli distribution with parameter 𝑝 ∈ [0, 1] is given by

𝑝 for 𝑘 = 1
Pr(𝑋 = 𝑘) = {
1−𝑝 for 𝑘 = 0.
It has E[𝑋] = 𝑝 and Var[𝑋] = 𝑝(1 − 𝑝).

Geometric distribution
This distribution gives the probability that, in a sequence of independent Bernoulli trials, we see
exactly 𝑘 failures before the first success.
The geometric distribution with parameter 𝑝 ∈ [0, 1] is given by
Pr(𝑋 = 𝑘) = 𝑝(1 − 𝑝)𝑘 for 𝑘 ∈ ℕ.
We have E[𝑋] = 1−𝑝
𝑝 and Var[𝑋] = 𝑝2 .
1−𝑝

Example. Geometric distribution with 𝑝 = 0.5

146
Example. If every minute there is a 1% chance that your internet connection fails then
the probability of staying online for exactly 𝑥 consecutive minutes is approximated by a
geometric distribution with 𝑝 = 0.01. It follows that the expected value is 1−0.01
0.01 = 99
minutes and the variance is (0.01)2 = 9900.
1−0.01

Binomial distribution
This distribution gives the probability that, in a sequence of 𝑛 independent Bernoulli trials, we see
exactly 𝑘 successes.
The binomial distribution with parameters 𝑛 ∈ ℤ+ and 𝑝 ∈ [0, 1] is given by

Pr(𝑋 = 𝑘) = (𝑛𝑘)𝑝𝑘 (1 − 𝑝)𝑛−𝑘 for 𝑘 ∈ {0, … , 𝑛}.

It has E[𝑋] = 𝑛𝑝 and Var[𝑋] = 𝑛𝑝(1 − 𝑝).


Figure. Binomial distribution with 𝑛 = 20, 𝑝 = 0.5

147
This demonstration displays binomial distributions with a variety of parameter values.
Example. If 1000 people search a term on a certain day and each of them has a 10%
chance of clicking a sponsored link, then the number of clicks on that link is approximated
by a binomial distribution with 𝑛 = 1000 and 𝑝 = 0.1. It follows that the expected value
is 1000 × 0.1 = 100 clicks and the variance is 1000 × 0.1 × 0.9 = 90.

Poisson distribution
In a Poisson process, events occur over time so that an average of 𝜆 events occur per time period and
the probability that an event will occur in one moment is the same as the probability that an event
will occur in any other moment (where our ”moments” are very short time intervals of equal length).
This kind of process is an excellent model for many real-world phenomena such as machine failures,
calls to a help centre, or goals in a soccer match.
In this kind of process, the number of events in a specified time period forms a Poisson distribution.
The Poisson distribution with parameter 𝜆 ∈ ℝ (𝜆 > 0) is given by
Pr(𝑋 = 𝑘) = 𝜆𝑘 𝑒−𝜆
𝑘! for 𝑘 ∈ ℕ.
We have E[𝑋] = 𝜆 and Var[𝑋] = 𝜆.
Figure. Poisson distribution with 𝜆 = 4

148
This demonstration displays Poisson distributions with a variety of parameter values.
Example. If a call centre usually receives 6 calls per minute, then a Poisson distribution
with 𝜆 = 6 approximates probability it receives 𝑘 calls in a certain minute. It follows that
the expected value is 6 calls and the variance is 6.
Questions.
1. There is a 95% chance of a packet being received after being sent down a noisy line, and the
packet is resent until it is received. What is the probability that the packet is received within
the first three attempts? A6.20
2. A factory aims to have at most 2% of the components it makes be faulty. What is the probability
of a quality control test of 20 random components finding that 2 or more are faulty, if the
factory is exactly meeting its 2% target? A6.21
3. The number of times a machine needs adjusting during a day approximates a Poisson distribution,
and on average the machine needs to be adjusted three times per day. What is the probability
it does not need adjusting on a particular day? A6.22

6.3.2 Continuous probability


Look at the binomial distributions pictured below. Each has 𝑝 = 0.5 and from left to right we have
with 𝑛 = 8, 𝑛 = 32 and 𝑛 = 128. (Note that we have restricted the 𝑥-axis and scaled the 𝑦-axis to
emphasise the similarities.) They seem to be approaching a continuous curve.

149
It would be useful to be able to discuss probability in continuous settings. For example, this would
enable us to talk about the distributions of heights of people, times between a query and response,
lengths of bananas and so on.
In these cases we can't describe a probability distribution by listing the probability of all the possible
values because there will be infinitely many of them. Also, the chance of someone having any particular
height, for example, is minuscule: it's almost impossible that someone will be exactly 172.34256183cm
tall. It does make sense, however, to talk about the chance that someone will be between 171cm and
173cm tall.
We can describe a continuous probability distribution using a probability density function. For
example, the probability density function for the height in cm of a female Australian might look like
the following.

150
The probability of a height lying in a particular range is the definite integral of the probability density
function in that range. So the probability of a height between 171cm and 173cm is given by the area
marked below.

In the same way that the probabilities of the outcomes in a discrete probability space must add to 1,
the total area under the curve of a probability density function must be 1.

6.3.3 Useful continuous distributions


Continuous uniform distribution
In a continuous uniform distribution, all valid intervals with the same length are equally likely.
The continuous uniform distribution with parameters 𝑎, 𝑏 ∈ ℝ (𝑎 ≤ 𝑏) has probability density function
given by

1
for 𝑥 ∈ [𝑎, 𝑏]
{ 𝑏−𝑎
0 otherwise.

151
.
(𝑏−𝑎)2
It has E[𝑋] = 𝑎+𝑏
2 and Var[𝑋] = 12 .

Figure. Probability density function for a continuous uniform distribution with 𝑎 = −1,
𝑏 = 2.

Example. If a real number 𝑋 between −1 and 2 is selected uniformly at random, then


the probability distribution function for 𝑋 is the one pictured in the figure above. The
probability that this number will be between 0 and 1 is 13 . Intuitively this is because the
length of the interval (0, 1) is one third of the length of the interval (−1, 2). It can be
calculated formally as ∫ 13 𝑑𝑥.
1
0

Exponential distribution
The exponential distribution gives the time between events in a Poisson process. It can be thought of
as a continuous version of the geometric distribution.
The exponential distribution with parameter 𝜆 ∈ ℝ (𝜆 > 0) has probability density function given by
𝜆𝑒−𝜆𝑥 for 𝑥 ≥ 0.
It has E[𝑋] = 1
𝜆 and Var[𝑋] = 𝜆2 .
1

Figure. Probability density function for an exponential distribution with 𝜆 = 12 .

152
Example. Our example for the Poisson distribution above was a call centre that usually
receives 6 calls per minute. An exponential distribution with 𝜆 = 6 will approximate the
time between calls into this call centre. It follows that the expected time between calls is
6 of a minute (10 seconds) and the variance is 36 .
1 1

Normal distribution
This distribution comes up in many situations. Heights, test scores and a host of other data are often
normally distributed. As we saw above, a binomial distribution with large 𝑛 is well approximated by
a normal distribution.
The normal distribution with parameters 𝜇 ∈ ℝ and 𝜎 ∈ ℝ (𝜎 > 0) has probability density function
given by
2
√ 1
2𝜎2 𝜋
exp(− (𝑥−𝜇)
2𝜎2 )

It has E[𝑋] = 𝜇 and Var[𝑋] = 𝜎2 .


Figure. Probability density function for a normal distribution with 𝜇 = 0, 𝜎 = 1.

153
A normal distribution with 𝜇 = 0, 𝜎 = 1, like the one above, in referred to as a standard normal
distribution.
There are many tools available to calculate areas under the curve of a normal probability density
function. See, for example, this online calculator.

𝑡-distribution
Lightweight Corporation claim their LED bulbs have a mean lifetime of 30000 hours. You believe
that these lifetimes will be normally distributed. You select 15 random bulbs and find that they have
a mean lifetime of 28000 hours with a standard deviation of 5000 hours. Obviously your best guess
for the lifetime of their bulbs is 28000 hours, but its possible that you were unlucky and picked a
sample of 15 poor bulbs. How likely is it, though?
To deal with situations like this, statisticians use a measure called the 𝑡-statistic. First we need to
define the sample variance of a sample 𝑥1 , … , 𝑥𝑛 of 𝑛 numbers to be
1 𝑛
𝑠2 = 𝑛−1 ∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2 .
(Don't confuse this with the usual variance, which would have 𝑛 instead of 𝑛 − 1 in its definition.
The 𝑛 − 1 is important to make the sample variance have, on average, the same value as the variance
of the distribution being sampled.)
The 𝑡-statistic for a sample 𝑥1 , … , 𝑥𝑛 of size 𝑛 is given by
𝑥−𝜇
̄√
𝑡= 𝑠/ 𝑛

where 𝑥̄ and 𝜇 are the mean of the sample and the mean of the population, respectively, whilst 𝑠 is
the square root of the sample variance defined just above.
We can rearrange this as 𝜇 = 𝑥̄ − √𝑡𝑠𝑛 , so the 𝑡-statistic is measuring how far from the actual population
mean our sample mean is (scaled by √𝑠𝑛 ). When the values in the population are normally distributed
and the sample has size 𝑛, the distribution of the 𝑡-statistic forms a 𝑡-distribution with parameter
𝜈 = 𝑛 − 1.
The 𝑡-distribution is a probability distribution with parameter 𝜈 ∈ ℤ+ . Its probability density function
is complicated to express and needn't concern us. It has E[𝑋] = 0 and, for 𝜈 ≥ 3, Var[𝑋] = 𝜈−2 𝜈
.

154
The parameter 𝜈 is sometimes referred to as the number of degrees of freedom. As 𝜈 becomes large
the 𝑡-distribution approaches a normal distribution with 𝜇 = 0 and 𝜎 = 1.
Figure. In red, green, and orange are the probability density functions for 𝑡-distributions
with 𝜈 = 1, 𝜈 = 2 and 𝜈 = 6. In blue is the probability density function for a normal
distribution with 𝜇 = 0 and 𝜎 = 1.

Again, there are many tools available to calculate areas under the curve of a 𝑡-distribution probability
density function. See, for example, this online calculator.

𝜒2 -distribution
The 𝜒2 -distribution is a probability distribution with parameter 𝑘 ∈ ℤ+ . It is the probability
distribution of (𝑋1 )2 + ⋯ + (𝑋𝑘 )2 where 𝑋1 , … , 𝑋𝑘 are random variables that each have a standard
normal distribution. Its probability density function is complicated to express and needn't concern
us. It has E[𝑋] = 𝑘 and Var[𝑋] = 2𝑘.
The parameter 𝑘 is sometimes referred to as the number of degrees of freedom. The 𝜒2 -distribution is
mostly used in hypothesis testing, rather than for directly modelling situations.
Figure. In blue, orange, and green are the probability density functions for 𝜒2 -
distributions with 𝑘 = 1, 𝑘 = 3 and 𝑘 = 5.

155
Questions.
1. The number of metres a cat walks down a 4 metre long hallway before taking a nap is a given
by a continuous uniform distribution with 𝑎 = 0 and 𝑏 = 4. The sun shines through windows
onto the first metre and last 50cm of the hallway. What is the probability that the cat falls
asleep in the sun? A6.23
2. The actual length in cm of a ”30cm” ruler made by a factory is given by a normal distribution
with 𝜇 = 30 and 𝜎 = 0.03. What is the probability that the length of the ruler is off by 0.1cm
or more? (Consider using the online calculator liked above.) A6.24
3. In the example above, we wanted to gauge whether to believe Lightweight Corporation's claim
of30000 hours average lifetime. Assuming the claim is correct, what is the probability that the
𝑡-statistic derived from our sample was as small as it turned out (or smaller) ? What can you
deduce from this? A6.25

6.4 Important probability distributions: Activity


Question 1: Suppose that the number of emails you receive each day follows a Poisson distribution
with 𝜆 = 8.
A6.26
(a) What is the probability that you receive no emails on a given day?
A6.27
(b) What is the probability that you receive at least one email on a given day?
A6.28
(c) What is the probability that you receive exactly 8 emails in a given day?
(d) Suppose that the number of emails you receive on a given day is independent of the number you
receive on any other day. What it the probability that you receive exactly 8 emails two days in a
row? A6.29
Question 2: A team of hackers are trying to take down your server. The team makes an attempt
each day and on each attempt, there is a 6% chance that they succeed. You want to know the
distribution for the number of days it will take them to succeed.
A6.30
(a) Which common probability distribution would best model this situation.
A6.31
(b) What is the probability that the team of hackers succeeds on the third day?

156
6.5 Graphs and trees

6.5.1 Graphs
A graph is basically a network consisting of nodes (called vertices) and links (called edges) between
those nodes.
A graph is a collection of objects called vertices in which some pairs of vertices are designated as
adjacent. We imagine adjacent vertices as having an edge between them. We usually picture graphs
by drawing a dot for each vertex and a line between two dots for each edge.
We often name edges by concatenating the names of the two vertices they join: so 𝑢𝑣 is an edge
between vertices 𝑢 and 𝑣.
Example. Think of the graph with four vertices, 𝑢, 𝑣, 𝑤, 𝑥, and four edges 𝑢𝑣, 𝑣𝑤, 𝑤𝑥, 𝑥𝑢.
We could draw this graph as follows:

Note that we could also draw this graph as:

Graphs are the same when they have the same vertices and edges. How we choose to position the
vertices and edges when we draw a graph is not important.
Here we're discussing the most commonly used definition of a graph. In these graphs:
• edges do not have directions;
• there are never two or more edges between a pair of vertices;
• there are no edges from a vertex to itself (loops).
However, in some applications it is natural to allow directed edges, multiple edges or loops and all of
these variants are used and studied. In addition, graphs where vertices or edges are assigned weights
or colours are sometimes employed.
Questions.
A6.32
1. Draw a picture of a graph with four vertices and an edge between every pair of vertices.
2. What is the maximum number of edges that a graph with 𝑛 vertices can have? A6.33

157
6.5.2 Degree
We say that an edge of a graph that joins two vertices is incident with those vertices.
The degree of a vertex in a graph is the number of edges in the graph that are incident with it.
Because each edge in a graph is incident with exactly two vertices, each edge contributes 2 to the
total degree of the graph. This gives us the following:
The sum of the degrees of the vertices in a graph is twice the number of edges in the graph. In
particular, this sum is always even.
Questions.
A6.34
1. Is there a graph with five vertices with degrees 3, 3, 2, 2, 2?
A6.35
2. Is there a graph with five vertices with degrees 4, 3, 3, 2, 1?
A6.36
3. Is there a graph with five vertices with degrees 5, 4, 3, 2, 2?

6.5.3 Paths and cycles


A path of length 𝑡 is a graph whose vertices can be labelled 𝑣0 , 𝑣1 , … , 𝑣𝑡 so that the edges of the
graph are exactly 𝑣0 𝑣1 , 𝑣1 𝑣2 , … , 𝑣𝑡−1 𝑣𝑡 . We say this is a path between 𝑣0 and 𝑣𝑡 .
Example. We picture below paths of length 1, 2, 3, 4, and 5.

We say that a graph is connected if there is a path between any two vertices.
Example. The graph on the left below is not connected because there is not a path
between 𝑣2 and 𝑣3 (for example). The graph on the right below is connected.

158
A cycle of length 𝑡 is a graph whose vertices can be labelled 𝑣1 , 𝑣2 , … , 𝑣𝑡 so that the edges of the
graph are exactly 𝑣1 𝑣2 , 𝑣2 𝑣3 , … , 𝑣𝑡−1 𝑣𝑡 , 𝑣𝑡 𝑣1 .
Example. We picture below cycles of length 3, 4, 5, and 6.

Questions.
1. How many edges must be removed from a path in order to make a graph that is not connected?
How about for a cycle? A6.37

6.5.4 Trees
A tree is a graph that is connected but contains no cycles.
Example. The graph on the left below is not a tree because it contains a cycle on vertices
𝑣1 , 𝑣2 , 𝑣4 , 𝑣6 . The graph in the centre below is not a tree because it is not connected.
The graph on the right below is a tree.

159
The following result can be proved by induction on the number of vertices in the graph.
A tree with 𝑛 vertices has exactly 𝑛 − 1 edges.
It turn out that any graph with 𝑛 vertices that has 𝑛 − 2 or fewer edges must not be connected. This
means that trees are ”minimal” connected graphs (with respect to their number of edges).
A spanning tree of a graph 𝐺 is a tree in 𝐺 that contains every vertex of 𝐺.
Example. A graph (below left) and a spanning tree of that graph (below right). There
are many other spanning trees of the graph. Find some.

Obviously a graph that is not connected cannot have a spanning tree. On the other hand, every
connected graph has at least one spanning tree. Given a connected graph we can produce a spanning
tree by repeatedly finding a cycle in the graph and deleting one of its edges until no cycles remain.
The resulting graph will be connected and will not have any cycles, so it will be a tree.
Questions.
1. Are some paths trees? Are all paths trees? A6.38
2. When we begin with a connected graph and repeatedly find a cycle in the graph and delete
one of its edges until no cycles remain, how do we know that the resulting graph is connected?
A6.39
A6.40
3. How many spanning trees does a cycle of length 6 have?

6.5.5 Adjacency matrices


One way or recording the information in a graph is with a matrix called an adjacency matrix.
If 𝐺 is a graph with vertices 𝑣1 , 𝑣2 , … , 𝑣𝑛 , then the adjacency matrix of 𝐺 is the 𝑛 × 𝑛 matrix 𝑀
such that the entry of 𝑀 in row 𝑖 and column 𝑗 is 1 if 𝑣𝑖 𝑣𝑗 is an edge of 𝐺 and is 0 if 𝑣𝑖 𝑣𝑗 is not an
edge of 𝐺.
Example. A graph and its adjacency matrix are given below.

160
0 1 1 0 0

⎜ 1 0 1 1 0⎞⎟

⎜ ⎟
⎜ 1 1 0 0 0⎟⎟

⎜0 ⎟
1 0 0 1⎟
⎝0 0 0 1 0⎠

This demonstration displays a variety of graphs and their adjacency matrices.


Questions.
A6.41
1. Draw the graph whose adjacency matrix is given below.

0 1 1 0 0

⎜1 0 0 1 0⎞⎟

⎜ ⎟
⎜1 0 0 1 1⎟⎟

⎜0 ⎟
1 1 0 1⎟
⎝0 0 1 1 0⎠
2. Find the adjacency matrix of a cycle with vertices 𝑣1 , 𝑣2 , … , 𝑣6 and edges 𝑣1 𝑣2 , 𝑣2 𝑣3 , 𝑣3 𝑣4 , 𝑣4 𝑣5 , 𝑣5 𝑣6 , 𝑣6 𝑣1 .
A6.42

6.5.6 Famous problems


There are many standard problems about graphs that arise over and over again in various applications.
These include:
• Given two vertices in a graph, find the shortest path between them.
• Given a graph, find the smallest number of vertices that must be deleted to disconnect the
graph.
• Given a graph, find a subset of its edges such that every vertex of the graph is incident with
exactly one edge in the subset.
• Given a connected graph with weighted edges find the spanning tree of the graph with the
minimum weight.
• Given a graph, find the largest subset of its vertices that are all adjacent to each other.
Very fast, efficient algorithms are known for many of these problems.

161
6.6 Answers
6.1
𝑥 3 4 5 6 7
Pr(𝑋 = 𝑥) 1
9
2
9
1
3
2
9
1
9

6.2 No. If we know that 𝑋 = 2 then 𝑌 is equally likely to be any digit, but if we know that
𝑋 = 3, then 𝑌 is certain to be a 0 or 1. So the value of 𝑋 has an effect on the likelihood that 𝑌
will take certain values and the variables are not independent.
6.3 No. Consider, for example, the probability that 𝑋 = 3 and 𝑌 = 0.
Pr(𝑋 = 3 and 𝑌 = 0) = 22 1
(because the set contains 22 digits).
Pr(𝑋 = 3) = 11 1
(this happens when the number is 30 or 31).
Pr(𝑌 = 0) = 223
(this happens when the number is 10, 20 or 31).
It's now easy to check that Pr(𝑋 = 3 and 𝑌 = 0) ≠ Pr(𝑋 = 3) Pr(𝑌 = 0).
6.4 You should disagree. It's very possible for a random variable to have no chance of taking its
expected value. For example, you will never roll 3.5 on a standard die.
6.5 By writing down all the possible sequences of heads and tails (or by other methods) we can
see that the probability distribution of 𝑋 is

𝑥 0 1 2 3
Pr(𝑋 = 𝑥) 1
8
3
8
3
8
1
8

So E[𝑋] = 1
8 ×0+ 3
8 ×1+ 3
8 ×2+ 1
8 × 3 = 32 .
6.6 Yes. Let the expected value of a number from the box be 𝑧. Then, using linearity of
expectation, the expected value of the number you produce is E[𝑋] = 𝑧 − 𝑧 = 0.
6.7 Yes. It's not hard to see that the expected value of the number of heads occurring from a
single coin flip is 12 × 0 + 12 × 1 = 12 . So, using linearity of expectation, E[𝑋] = 3 × 12 .
6.8 The expected value of 𝑋 is
E[𝑋] = 16 × 0 + 12 × 2 + 13 × 6 = 3.
So, the variance of 𝑋 is
Var[𝑋] = 16 × (0 − 3)2 + 12 × (2 − 3)2 + 1
3 × (6 − 3)2 = 5.
6.9 Y by a long way. By the law of large numbers, 𝑋 will very likely be quite close to E[𝑋]. On
the other hand 𝑌 could very well be far from E[𝑌 ]. This will mean that 𝑌 has greater variance.
6.10 Because 𝑋 and 𝑌 are independent, the probability of this event can be written as a product.
Pr(𝑋 = 0 and 𝑌 = 1) = Pr(𝑋 = 0) Pr(𝑌 = 1) = 1
4 × 1
2 = 1
8

6.11

1 3 3
E[𝑋] = ×0+ ×1=
4 4 4

1 1 1
E[𝑌 ] = ×0+ ×1=
2 2 2

6.12
Pr(𝑋 = E[𝑋]) = Pr(𝑋 = 34 ) = 0

162
6.13 There are two ways the random variable 𝑋 + 𝑌 can take a value of 1. Namely, when 𝑋 = 0
and 𝑌 = 1, or when 𝑋 = 1 and 𝑌 = 0. These two possibilities are mutually exclusive events.
So the probability of one or the other happening is the sum of their individual probabilities.
Pr(𝑋 + 𝑌 = 1) = Pr(𝑋 = 0 and 𝑌 = 1) + Pr(𝑋 = 1 and 𝑌 = 0)
Because 𝑋 and 𝑌 are independent, the above probabilities can be written as products.

Pr(𝑋 + 𝑌 = 1) = Pr(𝑋 = 0) Pr(𝑌 = 1) + Pr(𝑋 = 1) Pr(𝑌 = 0)


1 1 3 1
= × + ×
4 2 4 2
1
=
2

In the same way, there are two way in which 𝑋 and be equal to 𝑌.

Pr(𝑋 = 𝑌 ) = Pr(𝑋 = 0 and 𝑌 = 0) + Pr(𝑋 = 1 and 𝑌 = 1)


= Pr(𝑋 = 0) Pr(𝑌 = 0) + Pr(𝑋 = 1) Pr(𝑌 = 1)
1 1 3 1
= × + ×
4 2 4 2
1
=
2

6.14 We have already calculated Pr(𝑈 = 1) = 12 . The other probabilities are calculated in the
same way.

𝑢 0 1 2
Pr(𝑈 = 𝑢) 1
8
1
2
3
8
𝑣 −1 0 1
Pr(𝑉 = 𝑣) 1
8
1
2
3
8

6.15 Both 𝑈 and 𝑉 depend on 𝑋 and 𝑌. So a change in the value of 𝑈 can affect the value of 𝑉
and vice-versa. We would not expect them to be independent.
6.16 Consider the event that both 𝑈 = 0 and 𝑉 = −1. This is the same as the event that
𝑋 + 𝑌 = 0 and 𝑋 − 𝑌 = 1. But these events cannot happen simultaneously. That is, if
𝑋 + 𝑌 = 0, then 𝑋 = 0 and 𝑌 = 0, so 𝑋 − 𝑌 = 0 ≠ −1. These events are mutually exclusive.
Pr(𝑈 = 0 and 𝑉 = −1) = 0
However,
Pr(𝑈 = 0) Pr(𝑉 = −1) = 1
8 × 1
8 = 1
64 .

So 𝑈 and 𝑉 are not independent.


6.17

Pr(𝑋 ≥ 6) = Pr(𝑋 = 7) + Pr(𝑋 = 8)


1 1
= +
8 4
3
=
8

163
6.18

1 1 1 1 1
E[𝑋] = ×0+ ×3+ ×5+ ×7+ ×8
6 3 8 8 4
9
=
2

6.19
E[𝑋] 9 1 3
6 = 2 × 6 = 4
E[𝑋]
Pr(𝑋 ≥ 6) = 3
8 ≤ 3
4 = 6

6.20 Let 𝑋 be the number of failures before the packet is successfully received. Then 𝑋 has a
geometric distribution with 𝑝 = 0.95. So the probability that the packet is received within the
first three attempts is

Pr(𝑋 = 0) + Pr(𝑋 = 1) + Pr(𝑋 = 2) = (0.95) + (0.95)(0.05) + (0.95)(0.05)2


= 0.999875.

6.21 Let 𝑋 be the number of faulty components. Then 𝑋 has a binomial distribution with
𝑝 = 0.02 and 𝑛 = 20. The probability that zero or one components are faulty is

Pr(𝑋 = 0) + Pr(𝑋 = 1) = (20 0


0 )(0.02) (0.98)
20
+ (20 1
1 )(0.02) (0.98)
19

≈ 0.9401.

So the probability that at least two components are faulty is approximately 1 − 0.9401 = 0.0599.
6.22 Let 𝑋 be the number of times the machine needs adjustment on the day. Then 𝑋 has a
Poisson distribution with 𝜆 = 3. The probability that the machine does not need adjusting on
the day is
Pr(𝑋 = 0) = 3 0!𝑒 = 𝑒−3 .
0 −3

6.23 Informally, 1.5


4 = 8 of the hallway is in the sun and so the probability is 8 . The same answer
3 3

can be found formally as ∫ 14 𝑑𝑥 + ∫ 14 𝑑𝑥.


1 4
0 3.5

6.24 Using the calculator we see that the probability that the ruler's lengths is between 29.9cm
and 30.1cm is approximately 0.9991. So the probability that the length of the ruler is off by
0.1cm or more is approximately 1 − 0.9991 = 0.0009.
6.25 The 𝑡-statistic of the sample is 28000−30000

5000/ 15
≈ −1.549. Using the calculator with 𝜈 = 15 − 1 =
14 we see that the probability of the 𝑡-statistic being less than this value is approximately
0.0718 = 7.18%. Nothing is certain, but the claim looks a little bit unlikely that the mean
lifetime is30000 or more
6.26 We can let 𝑋 denote a Poisson random variable with 𝜆 = 8.
Pr(𝑋 = 0) = 80 𝑒−8
0! = 𝑒−8 ≈ 0.0003

164
6.27 Note that Pr(𝑋 = 0) + Pr(𝑋 ≥ 1) = 1 since these two events are mutually exclusive and
cover the whole sample space.
Pr(𝑋 ≥ 1) = 1 − Pr(𝑋 = 0) = 1 − 𝑒−8 ≈ 0.9997
6.28
Pr(𝑋 = 8) = 88 𝑒−8
8! ≈ 0.14
6.29 We can answer this question by considering two independent random variables 𝑋 and 𝑌,
both of which have a Poisson distribution with 𝜆 = 8.
Pr(𝑋 = 8 and 𝑌 = 8) = Pr(𝑋 = 8) Pr(𝑌 = 8) = 88 𝑒−8
8! × 88 𝑒−8
8! ≈ 0.019
6.30 This is best described by a geometric distribution with parameter 𝑝 = 0.06.
6.31 Let 𝑋 be a random variable with a geometric distribution with parameter 𝑝 = 0.06. We are
interested in finding the probability of exactly 2 failures before the first success.
Pr(𝑋 = 2) = 0.06(1 − 0.06)2 ≈ 0.053.

6.32 Here are two possible answers.


6.33 (𝑛2). The graph can have at most one edge between each pair of vertices, and there are (𝑛2)
such pairs.

6.34 Yes. Here is one example.


6.35 No. These degrees sum to an odd number, but we know the degrees in any graph sum to an
even number.
6.36 No. A vertex in a graph on five vertices can have degree at most 4 because there are only 4
other vertices it can have an edge to.
6.37 One for a path. Two for a cycle.
6.38 All paths are trees. They are connected and contain no cycles.
6.39 It's important that the deleted edges are in cycles. How does this help to ensure that the
graph remains connected?
6.40 6. Deleting any one of the 6 edges results in a spanning tree.

165
6.41
6.42

0 1 0 0 0 1

⎜1 0 1 0 0 0⎞⎟

⎜ ⎟
⎜0 1 0 1 0 0⎟⎟

⎜ ⎟
⎜0 0 1 0 1 0⎟⎟

⎜0 ⎟
0 0 1 0 1⎟
⎝1 0 0 0 1 0⎠

166

You might also like