All MAT9004 Content (Derivative)
All MAT9004 Content (Derivative)
Daniel Horsley
1 COPYRIGHT WARNING This ebook is protected by copyright. For use within Monash University
only. NOT FOR RESALE Do not remove this notice.
Disclaimer: https://www.monash.edu/disclaimer-copyright
Contents
2 Calculus 29
2.1 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.1 Tangents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.3 Finding tangents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.4 Higher derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.5 Warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.6 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Differentiation: Activity 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 Stationary points revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Second derivative test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.4 Convexity and concavity revisited . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Optimisation: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.1 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.2 Finding antiderivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Integration: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1
2.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3 Linear Algebra 49
3.1 Vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.1 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.2 Addition and Scalar Multiplication of Matrices . . . . . . . . . . . . . . . . . 49
3.1.3 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.4 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.5 The Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1.6 Geometric Interpretation of Vectors . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Vectors and matrices: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Solving linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.1 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.3 Matrix Inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Solving linear systems: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5.2 Calculating Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 69
3.6 Eigenvalues and eigenvectors: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 Multivariate Functions 82
4.1 Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.1 Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.2 Circles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.3 Ellipses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 Relations: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Functions of several variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.1 Functions of several variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.2 Linear functions of two variables . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.3 Zeroes and stationary points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.4 Level sets and contour plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4 Functions of several variables: Graphing activity . . . . . . . . . . . . . . . . . . . . . 93
4.4.1 Plots of relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2 Plots of functions of two variables . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Functions of several variables: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.6 Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.6.1 Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.6.2 The gradient vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.6.3 Locating stationary points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.6.4 Classifying stationary points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.6.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.7 Partial derivatives: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.8 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
2
5.3.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.3 Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.3.4 Warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4 Probability: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.5 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.5.1 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.5.2 Independence again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.5.3 Independent repeated trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.5.4 Bayes' theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.6 Conditional probability: Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3
Chapter 1
4
1.1.3 Sigma notation
The symbol ∑ is often used to denote a sum.
If 𝑎 and 𝑏 are integers with 𝑎 ≤ 𝑏 and 𝑓 is a function, then
𝑏
∑ 𝑓(𝑥) = 𝑓(𝑎) + 𝑓(𝑎 + 1) + ⋯ + 𝑓(𝑏).
𝑥=𝑎
(See the next section for a precise definition of a function, for the moment the following examples
should give you the idea.)
Example. + 16 .
6 1 1 1 1
∑𝑥=3 𝑥 = 3 + 4 + 5
For more complicated sums, the notation ∑𝑥∈𝑆 𝑓(𝑥) is used to mean the sum of 𝑓(𝑥) over all the 𝑥
in the set 𝑆.
Example. ∑𝑥∈{2,6,7} 1
𝑥 = 1
2 + 1
6 + 17 .
Questions.
A1.1
1. How would you write ∑𝑥=2 𝑥2𝑥+1 without using sigma notation?
4
2. Write 14 + 16 + 18 + 10
1
using sigma notation. A1.2
Again, the notation ∏𝑥∈𝑆 𝑓(𝑥) is used to mean the product of 𝑓(𝑥) over all the 𝑥 in the set 𝑆.
Questions.
A1.3
1. How would you write ∏𝑥=10 (3𝑥 + 1) without using product notation?
12
5
The functions we'll be concerned with initially are real functions of a single real variable. This just
means functions that accept a real number as an input and output a real number (or, more formally,
whose domain and range are both subsets of ℝ). More specifically:
• almost all the functions we'll consider will be continuous (meaning that their plots can be drawn
without lifting pen from paper);
• most of the functions we consider will be smooth (meaning that not only is the function
continuous but also its plot is a nice smooth curve).
We'll be able to give more a precise definition of smooth later.
Questions. Answer the following for the plot of a function 𝑓(𝑥) shown below.
6
1. Estimate 𝑓(−1), 𝑓(1) and 𝑓(3). A1.4
2. When 𝑓(𝑥) = 12 , what (roughly) might 𝑥 be? A1.5
3. What do you think the most important points on the curve are? A1.6
A1.7
4. For what values of 𝑥 is the function increasing and for what values is it decreasing?
So we can be lazy sometimes, when we don't give a domain for a function we'll assume that the
domain is the set of all real numbers for which the mathematical√expression is a well-defined real
number. For example, if we just say a function is given by 𝑓(𝑥) = 𝑥, then we'll assume the domain
is the set of all real numbers greater than or equal to 0.
• If it is at the top of a ”rise” it is called a local maximum. The function in the plot above has a
local maximum at 4.
• It's also possible for a stationary point to be neither of these things. For example the function
𝑓(𝑥) = 𝑥3 + 12 shown below has a stationary point at 0 which is neither a local minimum nor a
local maximum. These kind of stationary points are inflection points.
7
There's a nice visualisation of zeroes and stationary points of a function here. (First download the
Wolfram CDF player - we'll be linking to lots of these.)
Questions. Can you come up with a smooth real-valued function with domain ℝ such that:
A1.8
1. the function has a zero but no stationary points?
A1.9
2. the function has a stationary point but no zeroes?
A1.10
3. the function has no zeroes and no critical points?
Example. The function 𝑓(𝑥) = 3𝑥 + 5 with domain ℝ has inverse function 𝑓 −1 (𝑦) = 𝑦−5
3
with domain ℝ. We can check this is true as follows:
𝑓 −1 (𝑓(𝑥)) = 𝑓 −1 (3𝑥 + 5) = (3𝑥+5)−5
3 = 3𝑥
3 =𝑥 for all 𝑥 in ℝ.
To find an inverse function, we can often solve the equation 𝑦 = 𝑓(𝑥) for 𝑥. For example,
to find the inverse function we gave above, we would calculate:
𝑦 = 3𝑥 + 5
𝑦 − 5 = 3𝑥
𝑦−5
3 = 𝑥.
We can use inverse functions to undo the action of the original function as in the following.
Example. Suppose that 𝑓 is a function with domain [2, 6] and it is known that 𝑓 −1 (𝑥) =
𝑥2 + 2 for 𝑥 ∈ [0, 2].
Question: find a number 𝑥 such that 𝑓(𝑥) = 1.5.
Solution: Applying 𝑓 −1 to both sides of the equation, we have 𝑓 −1 (𝑓(𝑥)) = 𝑓 −1 (1.5) =
4.25. But 𝑓 −1 (𝑓(𝑥)) = 𝑥. So 𝑥 = 4.25.
8
Not all functions have inverse functions, however.
Example. The function 𝑓(𝑥) = 𝑥2 with domain ℝ does not have an inverse function.
Because 𝑓(2) = 4 and 𝑓(−2) = 4, if an inverse function existed 𝑓 −1 (4) would have to
equal both 2 and −2.
To avoid this problem we need a function with the following property.
A function 𝑓 is one-to-one if, for all distinct 𝑥1 and 𝑥2 in its domain, 𝑓(𝑥1 ) ≠ 𝑓(𝑥2 ).
This property tells us exactly which functions have inverses.
A function has an inverse if and only if it is one-to-one.
Sometimes if a function is not one-to-one, we can restrict its domain to make a new function that is
one-to-one and has an inverse.
Example. We saw that the function 𝑓(𝑥) = 𝑥2 with domain ℝ does not have an inverse
function. But the function 𝑔(𝑥) = 𝑥2 with domain [0, ∞) is one-to-one and does have an
√
inverse function: 𝑔−1 (𝑦) = 𝑦.
Questions. Do the following functions have inverse functions? If so, what are they?
9
1.3 Functions: Graphing activity
Throughout this unit, it will be helpful to be able to see the plots of various functions. This will help
you to identify important features of functions such as their zeroes and stationary points. There are
many places online which allow you to quickly produce plots of functions. In this activity, we will
explain how to use one particular website, Wolfram Alpha, to produce these plots. However, feel free
to use other resources to produce plots if you wish.
First, follow this link to the Wolfram Alpha home page. You should see a screen like this.
From here, type in the rule for the function you would like to plot and click on the equals sign on the
right hand side.
10
• To write a power, use a circumflex (^). For example, to write 𝑥4 , you would type in x^4.
• To write a product, use an asterisk (*). For example, to write 𝑥𝑒𝑥 , you would type in x*e^x.
You don't always need to use an asterisk, but it may be useful sometimes.
• The forward slash (/) is used for division. For example, to write 𝑥4 , you would type in x/4.
• You may also need to use brackets when entering some functions, even when you don't usually
write brackets. For example, to write 𝑒3𝑥 you would type e^(3x).
Type in the rule for a common function, such as a polynomial. Below we have entered in the rule
𝑥3 + 5𝑥2 .
After you have clicked on the equals sign on the right hand side, you will see a plot of your chosen
function.
11
If you scroll down, you will see further details about this function. For now, the details you will be
most interested in are the zeroes (these are called ”roots” on Wolfram Alpha) and the stationary
points.
Under the local maximum section, we are told that max{𝑥3 +5𝑥2 } = 500 27 at 𝑥 = − 3 . This means that
10
You can ask Wolfram Alpha to produce a plot over just one part of the domain. For example, we may
like to plot the same function just on the interval [2, 5]. Below is one way of producing such a plot.
You can also view the plots of several functions on the same pair of axes. For example, the plots of
two functions, 𝑓(𝑥) = 𝑥3 + 5𝑥2 and 𝑔(𝑥) = 𝑒−2𝑥 , are shown below on the interval [−1, 1].
12
1.4 Functions: Activity 1
Question 1: Consider the plot below of the function 𝑓(𝑥) = 𝑥3 − 3𝑥2 .
(a) By looking at the plot, find the zeroes of 𝑓. Check that these values are actually zeroes by setting
𝑓(𝑥) = 0 and solving 0 = 𝑥3 − 3𝑥2 . A1.15
(b) Evaluate 𝑓(2), then evaluate both 𝑓(1.9) and 𝑓(2.1). What type of point is 𝑥 = 2? What type
of point is 𝑥 = 0? A1.16
(c) Find two numbers 𝑥1 and 𝑥2 such that 𝑓(𝑥1 ) = 𝑓(𝑥2 ). What does this tell you about the
13
A1.17
function?
A1.18
(d) Is 𝑓 convex, concave, or neither?
(e) Restrict the domain of 𝑓 to find a new function 𝑔 with the same rule (𝑔(𝑥) = 𝑥3 − 3𝑥2 ) so that 𝑔
is one-to-one. You do not need to find the inverse of 𝑔. A1.19
Question 2: Consider the plot below of the function 𝑓(𝑥) = 2𝑒 3 𝑥 .
1
A1.20
(a) Is 𝑓 convex, concave, or neither?
(b) Find the inverse function 𝑓 −1 and give its domain. Why is it possible to find the inverse
function? A1.21
A1.22
(c) Is 𝑓 −1 convex, concave, or neither?
A1.23
(d) Do 𝑓 or 𝑓 −1 have any zeroes? If so, find them.
A1.24
(e) Do 𝑓 or 𝑓 −1 have any stationary points? If so, what are they?
Question 3: Suppose that 𝑓 is a polynomial function such that 𝑓(𝑥) is positive only when 1 < 𝑥 < 3.
A1.25
(a) What can you say about the number of zeroes that 𝑓 has?
(b) What can you say about the number of stationary points that 𝑓 has between 𝑥 = 1 and
𝑥 = 3? A1.26
14
Obviously, linear functions are so-called because their plots form straight lines. The line corresponding
to 𝑓(𝑥) = 𝑚𝑥 + 𝑏 intercepts the vertical axis at 𝑏, because 𝑓(0) = 𝑏. Also, the line has slope 𝑚,
meaning that an increase of 𝑑 in the horizontal co-ordinate corresponds to an increase of 𝑑𝑚 in the
vertical co-ordinate.
Example. The function 𝑓(𝑥) = 12 𝑥 − 1 is a linear function. Its plot is given below. Note
that the line intercepts the vertical axis at −1 and has slope 12 .
Linear functions (that are not constant functions with slope 𝑚 = 0) have one zero and no stationary
points.
Questions.
1. 𝑓 is a linear function such that 𝑓(0) = 2 and 𝑓( 52 ) = 3. Give a formula for 𝑓(𝑥). A1.27
2. A linear function has a negative 𝑦 intercept and a positive zero. What can you say about its
slope? A1.28
15
It can be proven that a polynomial function of degree 𝑛 has at most 𝑛 zeroes and at most 𝑛 − 1
stationary points. So, given their degrees, the functions in the example above have the greatest
number of zeroes and stationary points possible. For many functions this is not the case, however.
For example the cubic function 𝑓(𝑥) = 14 𝑥3 + 12 𝑥2 + 12 𝑥, plotted below, has one zero and no stationary
points.
16
Demonstration: The number of distinct real roots of a real polynomial
Demonstration: Where are my roots?
Questions.
1. What is the degree of the polynomial function 𝑓(𝑥) = (𝑥 − 2)(𝑥 − 1)(𝑥 + 1)? A1.29
2. Using your answer to 1, what can you say about the zeroes of 𝑓(𝑥) = (𝑥 − 2)(𝑥 − 1)(𝑥 + 1)?
A1.30
Example. The function 𝑓(𝑥) = 2𝑥 is an exponential function. Its plot is given below.
17
This is typical of exponential functions with bases greater than 1. For bases less than 1, exponential
functions take on a different appearance.
Example. The function 𝑓(𝑥) = ( 12 )𝑥 is an exponential function. Its plot is given below.
There are a number of useful laws associated with exponential functions. For positive real numbers 𝑎
and 𝑏 and any real numbers 𝑥 and 𝑦:
1 1 √
𝑎0 = 1 𝑎−𝑥 = 𝑎𝑥 = 𝑥
𝑎
𝑎𝑥
𝑎𝑥
𝑎𝑥 𝑎𝑦 = 𝑎𝑥+𝑦 = 𝑎𝑥−𝑦 (𝑎𝑥 )𝑦 = 𝑎𝑥𝑦
𝑎𝑦
𝑎 𝑥 𝑎𝑥
(𝑎𝑏)𝑥 = 𝑎𝑥 𝑏𝑥 ( ) = 𝑥
𝑏 𝑏
One very important base for exponential functions is the mathematical constant 𝑒 which is an
irrational real number approximately equal to 2.71828. We'll see some reasons why 𝑒 is special later.
The function 𝑓(𝑥) = 𝑒𝑥 is sometimes written as exp(𝑥).
18
Exponential functions have no stationary points. They also have no zeroes, because 𝑎𝑥 is always
positive. Every exponential function intersects the vertical axis at 1, because 𝑎0 = 1 for any real
number 𝑎. Notice that exponential functions are one-to-one. This means that they have inverse
functions - this fact will be important in the next section.
Demonstration: Graphs of exponential functions
Questions.
1. What can you say about 𝑎𝑥 when 0 < 𝑎 < 1 and 𝑥 < 0? A1.31
2. The amount of radiation a sample emits on a given day is always three quarters of the amount
it emitted the day before. What fraction of the radiation that it emitted on day 0 does it emit
on day 2? On day 20? A1.32
You can see that this is the reflection of the plot of 2𝑥 above.
19
There are also useful laws associated with logarithms. These can be obtained from the exponential
laws we saw above. For positive real numbers 𝑎, 𝑥, 𝑦 and 𝑏:
𝑥
log𝑎 (𝑥𝑦) = log𝑎 (𝑥) + log𝑎 (𝑦) log𝑎 ( ) = log𝑎 (𝑥) − log𝑎 (𝑦)
𝑦
log𝑏 (𝑥)
log𝑎 (𝑥) =
log𝑏 (𝑎)
Logarithmic functions have no stationary points. Every logarithmic function has exactly one zero (at
1).
Questions.
1. Solve 3𝑥 = 400 for 𝑥. A1.33
2. If inflation means that prices increase by 2% every year, how many years does it take them to
double? A1.34
Year 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Population (in millions) 20.1 20.4 20.7 21.1 21.5 21.9 22.3 22.6 22.9 23.2 23.5
20
We can construct a function 𝑓 which ”fills in the gaps” and gives the population of Australia at any
point in between the points listed above. Come up with a linear function (𝑓(𝑥) = 𝑚𝑥 + 𝑏) whose
plot closely follows the plot above. You can do this by substituting in suitable values from the table
above to find 𝑚 and 𝑏. A1.35
A1.36
Does this function give sensible values outside of the interval [2004, 2014]?
Now consider the plot below of the population of Australia over an extended time period.
A1.37
A linear function is no longer suitable for this plot. What type of function might be suitable?
Question 2: Sometimes data will fit a logarithmic or inverse function much better than a linear
one. If you suspect that your data will not fit a linear function very well, a good way to test for
that is to apply a function (logarithm, square, reciprocal etc.) to the x-axis data, then plot this
against the y-axis data and hope that the resulting graph looks linear. Then we can go through a
similar procedure to the one in the previous question to find a function of the form 𝑦 = 𝑚 ln(𝑥) + 𝑏,
𝑦 = 𝑚𝑥2 + 𝑏 or 𝑦 = 𝑚 𝑥 + 𝑏, depending on what function was used to transform the data.
Consider the following table containing the GDP per capita (measured in units of $1000 per person)
and child mortality rate (deaths per 1000 births) for six countries.
Plot this data with the GDP per capita on the horizontal axis and the child mortality rate on the
vertical axis. You may prefer to use software such as Microsoft Excel to do this. A1.38
This data does not look very linear, (we would need a concave function to fit this data) so we will
apply some transformations to the data. One common transformation is the squared transformation.
Extend the above table to include a row of the squares of the GDP per capita values. (That is, apply
the function 𝑓(𝑥) = 𝑥2 to each value in the first row). A1.39
Plot the transformed data with the square of the GDP per capita on the horizontal axis and the child
mortality rate on the vertical axis. Does this look more linear than the original plot? A1.40
21
Now apply the same process, but with the reciprocal function this time. (That is, apply the function
𝑓(𝑥) = 𝑥1 to the GDP per capita data). Plot your data to see if it is more linear than the original
plot. A1.41
A1.42
Do this once more using the function 𝑓(𝑥) = ln(𝑥).
Using your previous answers, what type of function would you use to model the relationship between
GDP per capita and child mortality rate? A1.43
Question 2 (Factorial):
There is an important function in mathematics called the factorial function. Instead of being written
in the usual way as 𝑓(𝑥), it is usually written as 𝑛!. Its domain is ℕ = {0, 1, 2, …} and it is defined
by 𝑛! = ∏𝑥=1 𝑥. By convention, 0! = 1. But otherwise, 𝑛! is found by multiplying together every
𝑛
A1.48
(b) Now evaluate ∑𝑛=0 (using a calculator if you wish).
7 1
𝑛!
A1.49
(c) Do you recognise what number your answer is close to?
(d) Write the function 𝑓(𝑥) = ∑𝑛=0 without using sigma notation. Your answer will be a
3 1 𝑛
𝑛! 𝑥
degree 3 polynomial. A1.50
(e) Look at the plots of 𝑓(𝑥) = ∑3𝑛=0 𝑛! 𝑥 and 𝑔(𝑥) = 𝑒𝑥 using Wolfram Alpha or some other
1 𝑛
graphing software. Put both plots on the same graph if you can. What do you notice? A1.51
(f) See what happens when you replace the plot of ∑3𝑛=0 1 𝑛
𝑛! 𝑥 with ∑4𝑛=0 1 𝑛
𝑛! 𝑥 and ∑5𝑛=0 1 𝑛
𝑛! 𝑥 and
so on.
Question 3:
Consider the function 𝑓(𝑥) = log𝑒 (𝑥).
A1.52
(a) Write ∑𝑥=1 𝑓(𝑥) without using sigma notation.
4
(b) Use the logarithm laws to write your answer as a single logarithm. Use the factorial function
from Question 2 in your answer. A1.53
22
1.8 Answers
1.1 2
5 + 3
10 + 17 .
4
1.2 There are different possibilities here. For example, ∑𝑥=2 or ∑𝑥∈{4,6,8,10} 𝑥1 .
5 1
2𝑥
1.3 31 × 34 × 37.
1.4 Roughly 𝑓(−1) = 2.4, 𝑓(1) = −0.3 and 𝑓(3) = 0.5.
1.5 It might be (roughly) −0.3, 3 or 4.6.
1.6 No right or wrong here. But the points where it touches the 𝑥-axis, the point where it crosses
the 𝑦-axis, the bottom of the ”dip” at 𝑥 = 1 and the top of the ”rise” at 𝑥 = 4 are definitely all
important.
1.7 It's decreasing when 𝑥 < 1 and when 𝑥 > 4, and increasing when 1 < 𝑥 < 4.
1.8 One example is 𝑓(𝑥) = 2𝑥.
1.9 One example is 𝑓(𝑥) = 𝑥2 + 1.
1.10 One example is 𝑓(𝑥) = 2𝑥 .
1.11 Yes. The inverse function is 𝑓 −1 (𝑦) = 𝑦
2 + 2.
1.12 No. The function is not one-to-one because, for example, 𝑓(1) = 𝑓(−1).
√
1.13 Yes. The inverse function is 𝑓 −1 (𝑦) = 3 𝑦 − 2.
1.14 It's convex. Notice if you place two fingers anywhere on the function's plot, then the plot is
entirely below the straight line between your fingers.
1.15 Looking at the plot suggests that 𝑥 = 0 and 𝑥 = 3 are the zeroes. Taking out 𝑥2 as a
common factor gives 0 = 𝑥2 (𝑥 − 3), which means both 0 = 𝑥2 and 0 = 𝑥 − 3 need to be solved.
This confirms that 𝑥 = 0 and 𝑥 = 3 are the zeroes.
1.16 𝑓(2) = −4, 𝑓(1.9) = −3.971 and 𝑓(2.1) = −3.969. Since evaluating 𝑓 at the points close to
𝑥 = 2 gives values greater than 𝑓(2), this suggests that 𝑥 = 2 is a local minimum. By similar
reasoning, 𝑥 = 0 is a local maximum.
1.17 There are many correct answers here, one being 𝑥1 = 0 and 𝑥2 = 3, since 𝑓(0) = 0 = 𝑓(3).
This means that 𝑓 is not one-to-one.
1.18 The function is neither concave, nor convex. To see this, draw a straight line from the point
on the plot where 𝑥 = −1 and 𝑥 = 3. The plot is not entirely below the straight line, nor is it
entirely above the straight line.
1.19 There are many correct answers here. One possible answer is to let 𝑔 have the domain
(−∞, 0].
1.20 The function is convex. To see this, draw a straight line between any two points on the plot
and notice that the straight line lies entirely above the plot.
1.21 Solve the equation 𝑦 = 𝑓(𝑥) for 𝑥.
23
1
𝑦 = 2𝑒 3 𝑥
1
1
2𝑦 = 𝑒3𝑥
ln( 12 𝑦) = 13 𝑥
3 ln( 12 𝑦) = 𝑥
The inverse function is 𝑓 −1 (𝑦) = 3 ln( 12 𝑦) and its domain is (0, ∞). It is possible to find the
inverse function because 𝑓 is a one-to-one function.
1.22 The function is concave. The plot of 𝑓 is given below, and after drawing a straight line
between any two points on the plot, the plot is entirely above the straight line.
1.23 Since the plot of 𝑓 lies entirely above the horizontal axis, 𝑓 has no zeroes. However, 𝑓 −1
appears to have a zero at 𝑦 = 2. This can be verified by solving 𝑓 −1 (𝑦) = 0 for 𝑦.
3 ln( 12 𝑦) = 0
ln( 12 𝑦) = 0
1
2𝑦 = 𝑒0
1
2𝑦 =1
𝑦=2
1.24 Since both 𝑓 and 𝑓 −1 are always increasing (the plots of both have no ”flat” points), neither
of them have any stationary points.
1.25 Below are the plots of a few example polynomial functions which satisfy the above condition,
but have a different number of zeroes each.
So there is not enough information to say exactly how many zeroes 𝑓 will have. However, 𝑓
must have at least two zeroes, namely 𝑥 = 1 and 𝑥 = 3.
24
1.26 Again, there is not enough information to say exactly how many stationary points there
are between 𝑥 = 1 and 𝑥 = 3. However, there must be at least one. There must also be more
local maxima than local minima. Below are the plots of some example polynomial functions to
illustrate this.
1.27 𝑓(𝑥) = 25 𝑥 + 2.
1.28 The slope must be positive.
1.29 3 because 𝑓(𝑥) = 𝑥3 − 2𝑥2 − 𝑥 + 2.
1.30 The zeroes are exactly 2, 1 and −1. Obviously 𝑓(2) = 𝑓(1) = 𝑓(−1) = 0 and because the
degree is 3 we know that there are no other zeroes.
1.31 𝑎𝑥 > 1. What about in the other three cases for 𝑎 and 𝑥?
1.32 ( 34 )2 ≈ 56% on day 2. ( 34 )20 ≈ 0.3% on day 20.
20.1 = 2004𝑚 + 𝑏
23.5 = 2014𝑚 + 𝑏.
1.38
1.39
25
GDP per capita 6.5 14.8 23.1 44.1 51.2 73.2
(GDP per capita)2 42.25 219.04 533.61 1944.81 2621.44 5358.24
Child mortality rate 1.6 2.5 3.1 3.9 3.7 4.1
1.40
This plot is certainly not linear. It looks even less linear than the original plot.
1.41
This plot looks a little better, but is still not quite linear. This time a convex function would
be needed to fit the data.
1.42
26
This plot resembles a straight line very closely.
1.43 Since the logarithm transformation resulted in a plot closely resembling a straight line, we
would assume that 𝑦 = 𝑚 ln(𝑥) + 𝑏. Here, 𝑥 is the GDP per capita, 𝑦 is the child mortality
rate, and 𝑚 and 𝑏 are constants chosen to fit the data.
1.44 First, evaluate the terms individually.
𝑓(3) = 3
𝑓(4) = 8
𝑓(5) = 15
1.45
5
∏𝑥=3 𝑓(𝑥) = 𝑓(3) × 𝑓(4) × 𝑓(5) = 3 × 8 × 15 = 360.
1.46 Notice that 𝑓(2) = 0. So when you multiply all of these numbers together, the result will be
0 regardless of what the other terms are. ∏𝑥=2 𝑓(𝑥) = 0.
50
1.47
3
1 1 1 1 1
∑ = + + +
𝑛=0
𝑛! 0! 1! 2! 3!
1 1 1 1
= + + +
1 1 2 6
6 6 3 1
= + + +
6 6 6 6
16
=
6
8
=
3
1.48
7 1
∑𝑛=0 𝑛! = 2.71825
1.49 2.71825 is close to the mathematical constant 𝑒, which was introduced earlier. In fact, the
more terms you add up in the above sum, the closer your answer will be to 𝑒.
1.50 Remember that 𝑥0 = 1. Therefore, ∑3𝑛=0 1 𝑛
𝑛! 𝑥 = 1 + 𝑥 + 12 𝑥2 + 16 𝑥3 .
27
1.51 You should notice that the plots are quite close around 𝑥 = 0. In fact they intersect at 𝑥 = 0.
Both plots are shown below. The orange curve is the plot of 𝑓 and the blue curve is the plot of
𝑔.
1.52 ∑4𝑥=1 𝑓(𝑥) = log𝑒 (1) + log𝑒 (2) + log𝑒 (3) + log𝑒 (4).
1.53
4
∑ 𝑓(𝑥) = log𝑒 (1) + log𝑒 (2) + log𝑒 (3) + log𝑒 (4)
𝑥=1
= log𝑒 (1 × 2) + log𝑒 (3 × 4)
= log𝑒 (1 × 2 × 3 × 4)
= log𝑒 (4!)
28
Chapter 2
Calculus
2.1 Differentiation
Here, we're going to introduce derivatives and higher derivatives of functions. These are really useful,
but we'll hold off most of our discussion of why they are until the next chapter.
2.1.1 Tangents
It can be very useful to know whether a function 𝑓 is increasing or decreasing at a point and whether
it is doing so steeply or gently. In order to do this, it helps to define tangents to a function.
Let 𝑓 be a function. The tangent to 𝑓 at a point 𝑎 (if it exists) is the straight line that best
approximates 𝑓(𝑥) when 𝑥 is near 𝑎.
This definition might not seem very intuitive at first, but tangents are easy to visualise:
Example. A plot of the function 𝑓(𝑥) = 𝑥2 − 2𝑥 + 1
2 (blue) and its tangents at 𝑥 = 3
4
(orange) and 𝑥 = 2 (green).
29
Notice that the slopes of these tangents give us information about the behavior of 𝑓 at these points:
𝑓 is decreasing gently at 𝑥 = 34 and increasing more steeply 𝑥 = 2. There is a mathematical method
that allows us to get this information exactly and without having to resort to plotting the function.
The demonstration here is a nice visualisation of tangents to a variety of functions.
Questions.
1. Estimate the slope of tangents to the function in the above example at 𝑥 = − 14 , 𝑥 = 1, and
𝑥 = 32 . A2.1
2. A tangent to a circle is sometimes defined as a line that touches the circle in exactly one point.
Is there a problem with defining a tangent to a function similarly? A2.2
2.1.2 Derivatives
Let 𝑓 be a function. The derivative of 𝑓 at a point 𝑎 is the slope of the tangent to 𝑓 at 𝑥 = 𝑎 (the
derivative exists exactly if the tangent does).
For the nice smooth functions we are concentrating on here, tangents and derivatives always exist
(at the end of this section, we'll give an example of a non-smooth situation where a tangent and
derivative do not exist). Functions with this property are called differentiable. For such a function
𝑓 we can do even better than finding the derivative of 𝑓 for a few different values 𝑎 - we can find
another function denoted 𝑓’ (with the same domain as 𝑓) such that 𝑓 ′ (𝑎) gives the derivative of 𝑓 at
𝑎 for any 𝑎.
We call the function 𝑓’ the derivative of the function 𝑓.
Example. The function 𝑓(𝑥) = 𝑥2 − 2𝑥 + 12 that we plotted above has a derivative 𝑓 ′ (𝑥) = 2𝑥 − 2.
Using 𝑓’, we can calculate that the slope of the orange tangent (at 𝑥 = 34 ) is 𝑓 ′ ( 34 ) = − 12 and the
slope of the green tangent (at 𝑥 = 2) is 𝑓 ′ (2) = 2. What would the slopes of tangents at 𝑥 = 1 and
𝑥 = 32 be?
Sometimes 𝑓’ is significant in its own right (for example, if 𝑓(𝑥) gives the number of customers for
30
a company at time 𝑥, then 𝑓’ gives the rate at which the company is gaining or losing customers).
Most importantly, however, knowing 𝑓’ can give us information about 𝑓 itself (for example, it can
help us to find maxima and minima of 𝑓).
In general, 𝑓’ can be thought of as the rate of change of 𝑓. When 𝑓’ is positive 𝑓 is increasing and
when 𝑓’ is negative 𝑓 is decreasing. When the magnitude of 𝑓’ is large 𝑓 is increasing or decreasing
rapidly and when the magnitude of 𝑓’ is small 𝑓 is increasing or decreasing slowly.
The derivatives of many of the basic functions we have seen are well known. For 𝑐, 𝑛, 𝑎, 𝑑 ∈ ℝ with
𝑛 ≠ 0 and 𝑎 > 0:
• If 𝑓(𝑥) = 𝑐, then 𝑓 ′ (𝑥) = 0.
• If 𝑓(𝑥) = 𝑥𝑛 , then 𝑓 ′ (𝑥) = 𝑛𝑥𝑛−1 .
• If 𝑓(𝑥) = 𝑎𝑐𝑥+𝑑 , then 𝑓 ′ (𝑥) = 𝑐 ln(𝑎)𝑎𝑐𝑥+𝑑 . (If 𝑓(𝑥) = 𝑒𝑐𝑥+𝑑 , then 𝑓 ′ (𝑥) = 𝑐𝑒𝑐𝑥+𝑑 .)
• If 𝑓(𝑥) = log𝑎 (𝑐𝑥 + 𝑑), then 𝑓 ′ (𝑥) = ln(𝑎)(𝑐𝑥+𝑑)
𝑐
. (If 𝑓(𝑥) = ln(𝑐𝑥 + 𝑑), then 𝑓 ′ (𝑥) = 𝑐𝑥+𝑑 .)
𝑐
We can combine these with two important facts to help us to find derivatives of more functions. For
differentiable functions 𝑓 and 𝑔 and 𝑐 ∈ ℝ:
• The derivative of 𝑐𝑓(𝑥) is 𝑐𝑓 ′ (𝑥).
• The derivative of 𝑓(𝑥) + 𝑔(𝑥) is 𝑓 ′ (𝑥) + 𝑔′ (𝑥).
(Also the derivative of 𝑓(𝑥) − 𝑔(𝑥) is 𝑓 ′ (𝑥) − 𝑔′ (𝑥)).
Example. Let 𝑓(𝑥) = − 13 𝑥3 + 2𝑥2 –3𝑥 + 1. To find the derivative of 𝑓, we can think as
follows:
• The derivative of 𝑥3 is 3𝑥3−1 = 3𝑥2 . So the derivative of − 13 𝑥3 is −𝑥2 .
• The derivative of 𝑥2 is 2𝑥2−1 = 2𝑥. So the derivative of 2𝑥2 is 4𝑥.
• The derivative of 𝑥 is 𝑥1−1 = 𝑥0 = 1. So the derivative of −3𝑥 is −3.
• The derivative of 1 is 0.
Putting the above together, we have 𝑓 ′ (𝑥) = −𝑥2 + 4𝑥 − 3.
Here's a visualisation that shows plots of a polynomial and its derivative on the same set of axes.
Questions.
1. Find the derivative of 𝑓(𝑥) = 𝑥2 − 2𝑥 + 12 . How close were your guesses for the tangent slopes
in question 1 in the last section? A2.3
2. Find the derivative of 𝑔(𝑥) = 4𝑥3 –2𝑥2 + 𝑒2𝑥 . What is the slope of a tangent to 𝑔 at 𝑥 = 2?
A2.4
3. Find the derivative of ℎ(𝑥) = ln(3𝑥) − 7. What happens to the slope of ℎ as 𝑥 gets very large?
A2.5
31
Example. In the last example we saw that the derivative of 𝑓(𝑥) = − 13 𝑥3 + 2𝑥2 –3𝑥 + 1
is 𝑓 ′ (𝑥) = −𝑥2 + 4𝑥 − 3. So to find the second derivative of 𝑓, we differentiate 𝑓 ′ (𝑥) =
−𝑥2 + 4𝑥 − 3 and obtain 𝑓 ′′ (𝑥) = −2𝑥 + 4.
We can go on and discover that 𝑓 (3) (𝑥) = −2 and that 𝑓 (𝑛) (𝑥) = 0 for every 𝑛 ≥ 4.
Questions.
1. What are the first and second derivatives of 𝑓(𝑥) = 𝑒𝑥 ? In general what is the 𝑛th derivative?
A2.6
2. What are the first and second derivatives of 𝑔(𝑥) = 𝑒2𝑥 ? In general what is the 𝑛th derivative?
A2.7
A2.8
3. If ℎ is a polynomial function of degree 7, what is ℎ(8) (𝑥)?
Using higher derivatives, we're now able to give a formal definition of what it means for a function to
be smooth.
A function 𝑓 is smooth if 𝑓 (𝑛) exists for every 𝑛 ≥ 1.
So smooth functions are those that can be differentiated any number of times.
Here's a visualisation that shows plots of a polynomial and its higher derivatives on the same set of
axes.
2.1.5 Warning
Here, we're concentrating on ”nice” functions that are smooth or at least differentiable. Not all
functions are, though - not even all continuous functions.
Example. Think of the function 𝑓(𝑥) = |𝑥|, where |𝑥| is the absolute value of 𝑥. A plot
of 𝑓 is shown below. Notice that 𝑓 is continuous because this plot can be drawn without
lifting pen from paper.
32
2.1.6 Notation
Here we're using 𝑓 ′ (𝑥) to represent the derivative of a function 𝑓(𝑥) and 𝑓 (𝑛) to represent its 𝑛th
derivative. This is one of two common notations. The other uses 𝑑𝑥 𝑑𝑓
or 𝑑𝑥
𝑑
𝑓(𝑥) to represent the
derivative and 𝑑𝑥𝑛 to represent the 𝑛th derivative.
𝑛
𝑑 𝑓
A2.9
(a) Find the rule for the function 𝑓’.
A2.10
(b) Produce a plot of the function 𝑓’ and use it to find the zeroes of 𝑓’.
A2.11
(c) What do the zeroes of 𝑓’ tell you about 𝑓?
Question 2:
Consider the function 𝑓 where 𝑓(𝑥) = 𝑒3𝑥 + 2𝑥2 . The plot is given below.
33
(a) Find the equation of the tangent at the point 𝑥 = 0. This equation will be of the form 𝑦 = 𝑚𝑥 + 𝑐
where 𝑚 is the gradient of the tangent. A2.12
A2.13
(b) Produce a plot of the function 𝑓 and the tangent at 𝑥 = 0.
Question 3:
A2.14
Find the derivative of 𝑓 where 𝑓(𝑥) = log2 (𝑥3 + 3𝑥2 ). (Hint: Use logarithm laws).
Question 4:
Suppose that you did not know how to differentiate 𝑓(𝑥) = 𝑎𝑐𝑥+𝑑 , but instead you only knew the special
case where 𝑑 = 0. That is, pretend you only know that if 𝑓(𝑥) = 𝑎𝑐𝑥 , then 𝑓 ′ (𝑥) = 𝑐 ln(𝑎)𝑎𝑐𝑥 . Use this
and the laws for exponential functions to show that if 𝑓(𝑥) = 𝑎𝑐𝑥+𝑑 , then 𝑓 ′ (𝑥) = 𝑐 ln(𝑎)𝑎𝑐𝑥+𝑑 . A2.15
Question 5:
A2.16
(a) Find 𝑓 (17) (𝑥) if 𝑓(𝑥) = 𝑒6𝑥 . Write your answer in terms of a power.
(b) Find 𝑓 (12) (𝑥) if 𝑓(𝑥) = 𝑥12 . Write your answer using the factorial function (see Question 2 in
Functions: Activity 3). A2.17
2.3 Optimisation
34
• A local minimum is a stationary point where the derivative of the function changes from negative
to positive.
• A local maximum is a stationary point where the derivative of the function changes from positive
to negative.
• When the derivative of the function is positive on both sides of the stationary point or negative
on both sides of the stationary point, it is neither a local minimum nor a local maximum (it is
an inflection point).
Exercise. Think about why these definitions correspond to our intuitive notions of local minima as
”dips” and local maxima as ”crests”.
2.3.3 Optimisation
Often we're interested in finding where a function take its least or greatest value over its whole domain
(not just a local minimum or maximum). This is often referred to as an optimisation problem.
Let 𝑓 be a function.
• 𝑓 has a global minimum at 𝑥 = 𝑎 if 𝑓(𝑎) ≤ 𝑓(𝑥) for all 𝑥 in the domain of 𝑓.
• 𝑓 has a global maximum at 𝑥 = 𝑎 if 𝑓(𝑎) ≥ 𝑓(𝑥) for all 𝑥 in the domain of 𝑓.
35
Not all functions have global minima or maxima. For example the function 𝑓(𝑥) = 𝑥 with domain
ℝ does not have a global maximum (because 𝑓(𝑥) keeps increasing as 𝑥 does). However if we limit
ourselves to continuous functions whose domain is a closed real interval (ie. {𝑥 ∈ ℝ ∶ 𝑐 ≤ 𝑥 ≤ 𝑑} for
some 𝑐, 𝑑 ∈ ℝ, then we are guaranteed to have them.
Let 𝑓 be a continuous function whose domain is a real interval {𝑥 ∈ ℝ ∶ 𝑐 ≤ 𝑥 ≤ 𝑑}. Then 𝑓 has a
global maximum at at least one point and a global minimum at at least one point.
Furthermore, for functions whose domain is a closed real interval, global maxima and minima can
only occur at certain points:
Let 𝑓 be a function whose domain is a real interval {𝑥 ∈ ℝ ∶ 𝑐 ≤ 𝑥 ≤ 𝑑}. If 𝑎 is a global minimum
point or global maximum point of 𝑓, then one of the following holds:
• 𝑎 = 𝑐 or 𝑎 = 𝑑;
• 𝑓 ′ (𝑎) = 0 (that is, 𝑎 is a stationary point of 𝑓); or
• 𝑓 ′ (𝑎) does not exist.
Points 𝑎 where 𝑓 ′ (𝑎) = 0 or 𝑓 ′ (𝑎) does not exist are called critical points (so every stationary point
is a critical point but not vice-versa).
Example. To find the local minima and maxima of 𝑓(𝑥) = − 13 𝑥3 + 2𝑥2 –3𝑥 + 1 on the
domain {𝑥 ∈ ℝ ∶ 0 ≤ 𝑥 ≤ 72 } we act as follows. We first calculate 𝑓 ′ (𝑥) = −𝑥2 + 4𝑥 − 3.
Solving 𝑓 ′ (𝑥) = 0 we get 𝑥 = 1 and 𝑥 = 3. So 𝑓 has 𝑥 = 1 and 𝑥 = 3 as stationary points,
𝑥 = 0 and 𝑥 = 72 as boundary points, and there are no points where 𝑓 ′ (𝑥) does not exist.
So we only need to calculate 𝑓(0) = 1, 𝑓(1) = − 13 , 𝑓(3) = 1, and 𝑓( 72 ) = 17
24 . From this
we see that 𝑓 has a global minimum of − 3 at 𝑥 = 1 and a global maximum of 1 at 𝑥 = 0
1
36
• Any local minimum of a convex function is also a global minimum.
• Any local maximum of a concave function is also a global maximum.
(a) Initially, the company plans to hire 60 employees. How much profit is expected in the first
month? A2.24
(b) The company hires one new employee at the start of each month for the next five months. What
total profit is expected over the first six months? Write your answer using sigma notation. A2.25
Note: In reality, ℝ would not be a suitable domain for this function. The number of employees must
be a non-negative integer. So ℕ would be a more appropriate domain. However, we are about to do
some calculations that involve differentiation, so for now we will just treat 𝑓 as if its domain were ℝ.
A2.26
(c) Find 𝑓 ′ (𝑥).
A2.27
(d) Use 𝑓’ to find all stationary points of 𝑓.
(e) Classify these stationary points using the second derivative test. Check your answers with the
above plot. A2.28
(f) What is the largest monthly profit that the company can expect to make? How many employees
do they need in order to make this profit? A2.29
Question 2:
A2.30
(a) Consider the function 𝑓 given by 𝑓(𝑥) = 𝑥3 − 6𝑥2 + 12𝑥 − 8. Find the stationary point of 𝑓.
37
A2.31
(b) What happens when you use the second derivative test to classify this stationary point?
(c) Find the derivative of 𝑓 at 𝑥 = 1 and 𝑥 = 3 and hence classify the stationary point. Produce a
plot of 𝑓 to verify your answer. A2.32
2.5 Integration
We often need to find the area under the plotted curve of a function. We will see later that this is
particularly useful in probability theory.
We can find areas under curves with a technique called integration that is closely related to differenti-
ation.
2.5.1 Integration
Let 𝑓 be a continuous function. The definite integral of 𝑓 between 𝑥 = 𝑎 and 𝑥 = 𝑏, written as
𝑏
∫ 𝑓(𝑥) 𝑑𝑥,
𝑎
is the signed area bounded by the plot of 𝑓, the 𝑥-axis, 𝑥 = 𝑎 and 𝑥 = 𝑏.
We say ”signed area” because any areas below the 𝑥-axis count negatively.
Example. Finding the red area in the plot below might approximate the probability that
a randomly selected person was between 180 and 190cm tall.
The area in red would be written as ∫ 𝑓(𝑥) 𝑑𝑥 where 𝑓 is the function plotted in blue.
190
180
38
This works because of a deep result sometimes called the fundamental theorem of calculus. A function
like 𝐹 is called an antiderivative of 𝑓. Any continuous function has an antiderivative. In fact any
continuous function has lots of antiderivatives, but they are all essentially the same: they differ only
by constants.
Example. Suppose we want to calculate ∫ 𝑥3 𝑑𝑥. The corresponding area is pictured
1
0
below.
1
∫ 𝑥3 𝑑𝑥 = 𝐹 (1) − 𝐹 (0)
0
1
= 4 − 0
1
= 4.
(We could have also picked 𝐺(𝑥) = 14 𝑥4 + 10 as an antiderivative of 𝑓, but we would have
39
gotten exactly the same answer because the 10s would have cancelled out.)
Questions.
1. Guided by the example above, calculate ∫ 𝑥3 𝑑𝑥. A2.33
1
−1
2. Intuitively speaking, why does your answer to 1 make sense? A2.34
A2.35
3. Calculate ∫ 𝑒−𝑥 𝑑𝑥. (Note that 𝐹 (𝑥) = −𝑒−𝑥 has derivative 𝐹 ′ (𝑥) = 𝑒−𝑥 .)
5
0
40
A2.39
Find the area of the shaded region.
41
2.7 Answers
2.1 The slope at 𝑥 = −0.25 is negative (−2.5) and at 𝑥 = 1 the tangent is flat with a slope of
zero
2.2
2.3 𝑓 ′ (𝑥) = 2𝑥 − 2
So the slopes of tangents to𝑓 at 𝑥 = − 14 , 𝑥 = 1, and 𝑥 = 3
2 are 𝑓 ′ (− 14 ) = − 52 , 𝑓 ′ (1) = 0 and
𝑓 ′ ( 32 ) = 1.
2.4 𝑔′ (𝑥) = 12𝑥2 − 4𝑥 + 2𝑒2𝑥
A tangent to 𝑓 at 𝑥 = 2 has slope 𝑔′ (2) = 40 + 2𝑒4 ≈ 149.
So the function is sloping very steeply upward here.
2.5 ℎ′ (𝑥) = 𝑥1
As 𝑥 gets very large ℎ′ (𝑥) gets very close to 0 (but stays just positive).
So ℎ keeps increasing, but very gradually.
2.6 𝑓 ′ (𝑥) = 𝑒𝑥
𝑓 ′′ (𝑥) = 𝑒𝑥
𝑓 (𝑛) (𝑥) = 𝑒𝑥
2.7 𝑔′ (𝑥) = 2𝑒2𝑥
𝑔′′ (𝑥) = 4𝑒2𝑥
𝑔(𝑛) (𝑥) = 2𝑛 𝑒2𝑥
42
2.8 0
2.9 𝑓 ′ (𝑥) = 6𝑥2 − 30𝑥 + 36
2.13
2.14 Using logarithm laws,
So,
43
1 1
𝑓 ′ (𝑥) = 2 +
ln(2)𝑥 ln(2)(𝑥 + 3)
2(𝑥 + 3) 𝑥
= +
ln(2)𝑥(𝑥 + 3) ln(2)𝑥(𝑥 + 3)
3𝑥 + 6
= .
ln(2)𝑥(𝑥 + 3)
2.18
2.19 𝑓 ′ (𝑥) = 2 − 𝑒𝑥
Solving 𝑓 ′ (𝑥) = 0 we see that 𝑓 has one stationay point: at 𝑥 = ln(2).
𝑓 ′ (𝑥) = 2 − 𝑒𝑥 changes from positive to negative at 𝑥 = ln(2), so this point is a local maximum.
44
2.20 They are exactly the same. This is because ℎ′ (𝑥) = 𝑐𝑔′ (𝑥) and 𝑐𝑔′ (𝑥) = 0 has exactly the
same solutions as 𝑔′ (𝑥) = 0.
What can you say about the classifications of the stationary points of ℎ(𝑥) compared to those
of 𝑔(𝑥)?
2.21 𝑓 ′′ (𝑥) = −2𝑥 + 4
2.22 𝑓 ′′ (1) = 2 so there is a local minimum at 𝑥 = 1. 𝑓 ′′ (3) = −2 so there is a local maximum at
𝑥 = 3.
2.23 𝑥.
𝑓 ′ (𝑥) = 𝑥 − 10 + 16
Solving 𝑓 ′ (𝑥) = 0 we get 𝑥 = 2 and 𝑥 = 8.
So 𝑓 has 𝑥 = 2 and 𝑥 = 8 as stationary points, 𝑥 = 1 and 𝑥 = 14 as boundary points, and
there are no points where 𝑓 ′ (𝑥) does not exist.
So we only need to calculate 𝑓(1) = 5.5, 𝑓(2) ≈ 8.09, 𝑓(8) ≈ 0.27, and 𝑓(14) ≈ 15.22.
From this we see that 𝑓 has a global minimum of approximately 0.27 at 𝑥 = 8 and a global
maximum of 15.22 at 𝑥 = 14.
2.24 1
𝑓(60) = − 4800 604 + 10 × 602 = 33300. The expected profit is $33, 300.
2.25 The total profit is found by adding up the profit from the first six months. This is 𝑓(60) +
𝑓(61) + 𝑓(62) + 𝑓(63) + 𝑓(64) + 𝑓(65), which can be rewritten as ∑𝑥=60 (− 4800
65 1
𝑥4 + 10𝑥2 ).
2.26 1
𝑓 ′ (𝑥) = − 1200 𝑥3 + 20𝑥
2.27 Set 𝑓 ′ (𝑥) = 0 and solve for 𝑥. Factorising gives 𝑥 (− 1200
1
𝑥2 + 20) = 0, which yields two
√ √
solutions, 𝑥 = 0 and 𝑥 = 24000 = 40 15 ≈ 154.9.
2.28 The second derivative is 𝑓 ′′ (𝑥) = − 400
1
𝑥2 + 20. Evaluating 𝑓 ′′ at the two stationary points
√
gives 𝑓 ′′ (0) √
= 20 and 𝑓 ′′ (40 15) = −40. Since 𝑓 ′′ (0) > 0 √
there is a local minimum at 𝑥 = 0.
Since 𝑓 (40 15) < 0, there is a local maximum at 𝑥 = 40 15.
′′
√
2.29 Looking at the plot, the stationary point 𝑥 = 40 15 is a local maximum. However, √ the
number of employees must be an integer. Evaluating 𝑓 at the integers near 40 15 ≈ 154.9
gives 𝑓(155) = $120, 000 and 𝑓(154) = $119, 983 (to the nearest dollar). So the company needs
155 employees to make a maximum monthly profit of $120, 000.
2.30 Differentiating gives 𝑓 ′ (𝑥) = 3𝑥2 − 12𝑥 + 12 = 3(𝑥2 − 4𝑥 + 4) = 3(𝑥 − 2)2 . Solving 𝑓 ′ (𝑥) = 0
shows that there is a stationary point at 𝑥 = 2.
2.31 The second derivative is 𝑓 ′′ (𝑥) = 6𝑥 − 12 and so 𝑓 ′′ (2) = 0. Therefore, the second derivative
test does not tell us anything.
2.32 Since 𝑓 ′ (1) = 3 and 𝑓 ′ (3) = 3, the function 𝑓 is increasing on either side of
𝑥 = 2. Therefore, there is an inflection point at 𝑥 = 2. The plot below verifies
this.
2.33
45
1
∫ 𝑥3 𝑑𝑥 = 𝐹 (1) − 𝐹 (−1)
−1
= 14 − 1
4
= 0.
2.34 The area below the 𝑥-axis between 𝑥 = −1 and 𝑥 = 0 counts negatively and cancels with
the area above the 𝑥-axis between 𝑥 = 0 and 𝑥 = 1.
2.35
5
∫ 𝑒−𝑥 𝑑𝑥 = 𝐹 (5) − 𝐹 (0)
0
= −𝑒−5 − (−1)
= 1 − 𝑒−5
≈ 0.993.
2.36 A function 𝐹 where 𝐹 ′ = 𝑓 is needed. Differentiating 12𝑥 gives 12, and differentiating −𝑥3
gives −3𝑥2 . So 𝐹 (𝑥) = 12𝑥 − 𝑥3 .
2.37 This is the area in question.
2
∫ 12 − 3𝑥2 𝑑𝑥 = 𝐹 (2) − 𝐹 (−2)
−2
= 16 − (−16)
= 32
.
2.38 This is the area in question.
46
An antiderivative of 𝑓 is given by 𝐹 (𝑥) = − 12 𝑒−2𝑥 − 12 𝑥2 − 𝑥. This can be checked by
differentiating 𝐹. The signed area is then
2
∫ 𝑒−2𝑥 − 𝑥 − 1𝑑𝑥 = 𝐹 (2) − 𝐹 (0)
0
1 7
= − 𝑒−4 −
2 2
≈ −3.51
.
The negative answer makes sense since the area is below the 𝑥-axis. So the area is in fact
1 −4
2𝑒 + 72 ≈ 3.51.
2.39 This can be done by finding the area of this region;
and then subtracting the result from the area of this region:
47
1
∫ 12𝑥2 + 1𝑑𝑥 = 𝐹 (1) − 𝐹 (0) = 5.
0
The area of the second region is just the length multiplied by the height of that rectangle. The
height is 𝑓(1) = 13, so this area is 1 × 13 = 13. Therefore, the area of the original shaded
region is 13 − 5 = 8.
48
Chapter 3
Linear Algebra
3.1.1 Matrices
An 𝑚 × 𝑛 matrix is a rectangular array of numbers with 𝑚 rows and 𝑛 columns.
We will use capital letters to represent matrices. If 𝐴 is an 𝑚 × 𝑛 matrix, then we say that the size
of 𝐴 is 𝑚 × 𝑛.
Example: The matrix 𝐴 given below is a 4 × 3 matrix, and 𝐵 is a 2 × 2 matrix.
3 4 0
⎡1 −3 −2⎤
𝐴=⎢ ⎥
⎢0 2 0⎥
⎣6 −13 24 ⎦
√
− 14 2
𝐵=[ ]
𝜋 0.32
For a matrix 𝐴, we will use the notation (𝐴)𝑖𝑗 to denote the entry in row 𝑖 and column 𝑗 of 𝐴.
Question: Consider the following matrix:
3 −4 5
𝐴=[ ]
0 1 8
A3.1
What is the size of 𝐴?
A3.2
What is (𝐴)12 ?
49
1 + 2 −3 + 9 3 6
𝐴+𝐵 =[ ]=[ ].
4−4 4−2 0 2
Matrices of the same size can also be subtracted by subtracting the corresponding entries.
We can also multiply matrices by a single number, where the resulting matrix is found by multiplying
each entry by that number. The resulting matrix will have the same size as the original matrix.
If 𝐴 is an 𝑚 × 𝑛 matrix and 𝑘 ∈ ℝ, then 𝑘𝐴 is the 𝑚 × 𝑛 matrix where (𝑘𝐴)𝑖𝑗 = 𝑘(𝐴)𝑖𝑗 for all 𝑖 and
𝑗.
2 1 0
Example: If 𝐴 = [ ], then
−2 3 −1
4×2 4×1 4×0 8 4 0
4𝐴 = [ ]=[ ].
4 × −2 4 × 3 4 × −1 −8 12 −4
This second operation is called scalar multiplication. The reason for this is that we often describe real
numbers as scalars.
3 1 0 −5 A3.3
Question: Let 𝐴 = [ ] and 𝐵 = [ ]. Evaluate 2𝐴 + 𝐵.
−1 4 7 2
50
The remaining entries are found in a similar way by choosing an appropriate row and
column. For example, to find (𝐴𝐵)21 we need row 2 from 𝐴 and column 1 from 𝐵.
5 3
1 2
𝐴 = ⎢0 −4⎤
⎡
⎥ 𝐵=[ ] (𝐴𝐵)21 = (0 × 1) + (−4 × −2) = 8
−2 0
⎣3 4 ⎦
To find (𝐴𝐵)32 , we need row 3 of 𝐴 and column 2 of 𝐵.
5 3
1 2
𝐴 = ⎢0 −4⎤
⎡
⎥ 𝐵=[ ] (𝐴𝐵)32 = (3 × 2) + (4 × 0) = 6
−2 0
⎣3 4 ⎦
Follow this process to find the remaining three entries. The final result is given below.
−1 10
𝐴𝐵 = ⎡
⎢ 8 0⎥
⎤
⎣−5 6 ⎦
You may realise now why we need the number of columns in 𝐴 to be equal to the number of rows in
𝐵. This is so that when we find each entry in 𝐴𝐵, we can pair each number in a row from 𝐴 with a
column from 𝐵.
If 𝐴 is an 𝑚 × 𝑝 matrix and 𝐵 is an 𝑝 × 𝑛 matrix, then 𝐴𝐵 is the 𝑚 × 𝑛 matrix where (𝐴𝐵)𝑖𝑗 =
∑𝑘=1 (𝐴)𝑖𝑘 (𝐵)𝑘𝑗 for all 𝑖 and 𝑗.
𝑝
Below is a summary of the rules of algebra for matrices, where 𝐴, 𝐵 and 𝐶 are matrices and 𝑘, 𝑙 are
real numbers. These rules work as long as the matrices are of the right size for the matrix addition
and multiplication to work.
• 𝐴(𝐵𝐶) = (𝐴𝐵)𝐶
• 𝐴(𝐵 + 𝐶) = 𝐴𝐵 + 𝐴𝐶 and (𝐴 + 𝐵)𝐶 = 𝐴𝐶 + 𝐵𝐶
• 𝑘(𝐴 + 𝐵) = 𝑘𝐴 + 𝑘𝐵
• (𝑘 + 𝑙)𝐴 = 𝑘𝐴 + 𝑙𝐴
• (𝑘𝐴)𝐵 = 𝑘(𝐴𝐵) = 𝐴(𝑘𝐵)
Note that we have not included 𝐴𝐵 = 𝐵𝐴 in the above list. In general, you cannot swap the order of
matrix multiplication. In fact, in the previous example, the product 𝐵𝐴 does not even exist because
the sizes of these matrices are incompatible.
3.1.4 Vectors
A vector is a matrix with only one column.
This is actually the definition of a column vector. A row vector is a matrix with only one row, but we
will only need to deal with column vectors. We will use lower case letters with arrows above them to
represent vectors.
3
0
Example: 𝑢⃗ = ⎡ ⎤
⎢ ⎥ and 𝑣 ⃗ = [12] are vectors.
−2
⎣5⎦
Just as we did with matrices, we can multiply vectors by scalars, we can add two vectors of the same
size, and we can multiply a vector and a matrix, provided that their sizes are compatible. Since
vectors are matrices, the rules of algebra we saw for matrices also apply to vectors.
3 2 3 −1 A3.4
Question: Let 𝐴 = [ ], 𝑢⃗ = [ ] and 𝑣 ⃗ = [ ]. Evaluate 𝐴𝑢⃗ − 4𝑣.⃗
−4 5 −8 −8
51
3.1.5 The Dot Product
The dot product is an operation which can be performed on a pair of vectors of the same size. It
has several important uses, one of which is to provide a more compact representation for certain
expressions.
Given a pair of vectors,
𝑢1 𝑣1
⎡𝑢 ⎤ ⎡𝑣 ⎤
𝑢⃗ = ⎢ 2 ⎥ , 𝑣⃗ = ⎢ 2 ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣𝑢𝑛 ⎦ ⎣𝑣𝑛 ⎦
their dot product is
𝑢⃗ ⋅ 𝑣 ⃗ = 𝑢1 𝑣1 + 𝑢2 𝑣2 + … + 𝑢𝑛 𝑣𝑛 .
1 2
Question: Compute the dot product of 𝑢⃗ = ⎢3⎥ and 𝑣 ⃗ = ⎢−4⎤
⎡ ⎤ ⎡
⎥.
A3.5
2
⎣ ⎦ 6
⎣ ⎦
𝑥
[ ]
𝑦
52
The operations of scalar multiplication and addition of vectors can be performed geometrically.
Scalar multiplication simply ”stretches” or ”shrinks” the vector, depending on the value of the scalar.
If the scalar is negative, then the vector will be reflected so that it points in the opposite direction.
4
Example: Let 𝑣 ⃗ = [ ]. This vector is plotted below.
−2
2 −8
We can algebraically compute 12 𝑣 ⃗ = [ ] and −2𝑣 ⃗ = [ ].
−1 4
These vectors are plotted below. The blue vector is 𝑣,⃗ the orange vector is 12 𝑣,⃗ and the
green vector is −2𝑣.⃗
53
Since 1
2 is between 0 and 1, the vector 12 𝑣 ⃗ is just 𝑣 ⃗ after being ”shrunk”.
Since −2 is negative, the vector −2𝑣 ⃗ is just 𝑣 after being reflected and ”stretched”.
Addition of vectors can be done by plotting the two vectors, moving one of them so that its ”tail” is
at the same position as the ”tip” of the other vector, then drawing an arrow from the origin to the
”tip” of the vector which has just been moved.
3 −1
Example: Let 𝑢⃗ = [ ] and 𝑣 ⃗ = [ ]. These vectors are plotted below.
1 −4
54
In order to add these vectors geometrically, we will move 𝑣,⃗ so that its ”tail” is at the
same position as the ”tip” of 𝑢.⃗
55
Now we draw an arrow from the origin to the tip of 𝑣 ⃗ (after it has been moved) to represent
𝑢⃗ + 𝑣.⃗
56
2
So 𝑢+
⃗ 𝑣⃗ = [ ]. You can check that you get the same result by finding 𝑢+
⃗ 𝑣 ⃗ algebraically.
−3
Two vectors 𝑢⃗ and 𝑣 ⃗ are perpendicular if and only their dot product is equal to zero.
Example: Consider the following two vectors:
2 −6
𝑢⃗ = [ ] 𝑣⃗ = [ ]
3 4
Their dot product is 𝑢⃗ ⋅ 𝑣 ⃗ = 2 × (−6) + 3 × 4 = 0. A plot of these vectors is given below.
Note that they are perpendicular.
57
This geometric interpretation of vectors can be extended to three dimensions (vectors with three
entries rather than two), although it becomes slightly more difficult to visualise.
58
4. 𝐵𝐷 + 𝐵 A3.12
5. 𝐴𝐵𝐷𝐶𝐴 A3.13
6. 𝐴𝑢⃗ + 𝐵𝑣 ⃗ A3.14
7. 𝐷𝑣 ⃗ + 𝐶𝑣 ⃗ A3.15
8. 𝑢⃗ − 𝐶𝑢⃗ A3.16
Question 3:
The blue vector below is 𝑢⃗ and the orange vector is 𝑣.⃗
Write expressions for the green vectors below in terms of 𝑢⃗ and 𝑣.⃗
(a)
59
A3.17
(b)
60
A3.18
Question 4:
The blue vector below is 𝑢⃗ and the orange vector is 𝑣.⃗
61
A3.19
Which of the following green vectors represent 𝑣 ⃗ − 2𝑢?⃗
62
A B
C D
63
are eight times as many regular staff as managers, one equation would be 𝑥 = 8𝑦, which
can be rearranged to give
𝑥 − 8𝑦 = 0.
The total amount spent on wages is $1, 740, 000, which is the sum of the amount spent
on managers and the amount spent on regular staff. Writing this as an equation, we have
60, 000𝑥 + 100, 000𝑦 = 1, 740, 000. Dividing both sides of this equation by 20, 000 gives
3𝑥 + 5𝑦 = 87.
This linear system can be written in matrix form where each row represents one of the
above equations.
1 −8 𝑥 0
[ ][ ] = [ ]
3 5 𝑦 87
To verify this, we can carry out the matrix multiplication on the left hand side. First of
all, the matrix has size 2 × 2 and the vector has size 2 × 1, so the resulting vector will be
a vector of size 2 × 1. The first entry will be 1 × 𝑥 + (−8) × 𝑦 = 𝑥 − 8𝑦. Similarly, the
second entry will be 3𝑥 + 5𝑦. So this equation says
𝑥 − 8𝑦 0
[ ]=[ ]
3𝑥 + 5𝑦 87
so the corresponding entries must be equal. This gives us back the original two equations.
We will often write linear systems more compactly as 𝐴𝑥⃗ = 𝑏⃗ where 𝑥⃗ is a vector containing all of the
variables. In the above example, the components of this equation would be
1 −8 𝑥 0
𝐴=[ ], 𝑥⃗ = [ ] , and 𝑏⃗ = [ ] .
3 5 𝑦 87
See this demonstration for a visual interpretation of linear systems and their solutions.
64
Note that after this process, we have a modified linear system, which can be written as a matrix
equation
1 −8 𝑥 0
[ ][ ] = [ ]
0 29 𝑦 87
,
which contains a zero in the bottom-left corner of the matrix. Gaussian elimination is a procedure
for introducing zeroes into the bottom-left corner.
Also note that adding and subtracting equations can be done by adding and subtracting the entries
from the corresponding rows in the matrix equation.
Example: Consider the following linear system, represented as a matrix equation:
1 2 2 𝑥 3
⎡−2 0 2⎤ ⎡ 𝑦 ⎤ = ⎡−6⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ 6 2 1⎦ ⎣ 𝑧 ⎦ ⎣ 10 ⎦
The first step is to multiply the second row by 3 and add the result to the third row. We
only change the third row.
1 2 2 𝑥 3
⎡ −2 0 2 ⎤ ⎡𝑦⎤ = ⎡ −6 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣6 + 3 × (−2) 2 + 3 × 0 1 + 3 × 2 𝑧
⎦⎣ ⎦ ⎣ 10 + 3 × (−6) ⎦
1 2 2 𝑥 3
⎡−2 0 2⎤ ⎡ 𝑦 ⎤ = ⎡−6⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ 0 2 7⎦ ⎣ 𝑧 ⎦ ⎣−8⎦
You should see now why the choosing a multiple of 3 was required. We wanted to introduce
a zero into the bottom-left entry. We would also like the entry above that to be a zero.
So the next step is to multiply the first two by 2 and add the result to the second row.
1 2 2 𝑥 3
⎡0 4 6⎤ ⎡ 𝑦 ⎤ = ⎡ 0 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 2 7 ⎦ ⎣ 𝑧 ⎦ ⎣−8⎦
We can also change a row by just multiplying it by a factor. We will multiply the second
row by 12 .
1 2 2 𝑥 3
⎡0 2 3⎤ ⎡ 𝑦 ⎤ = ⎡ 0 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 2 7 ⎦ ⎣ 𝑧 ⎦ ⎣−8⎦
Now we can subtract the second row from the third row.
1 2 2 𝑥 3
⎡0 2 3⎤ ⎡ 𝑦 ⎤ = ⎡ 0 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 0 4 ⎦ ⎣ 𝑧 ⎦ ⎣−8⎦
This system is now in a position to be solved. The third row corresponds to the equation
4𝑧 = −8, so 𝑧 = −2.
65
The second row corresponds to 2𝑦 + 3𝑧 = 0, which combined with 𝑧 = −2 gives 𝑦 = 3.
Finally, the first row gives 𝑥 + 2𝑦 + 2𝑧 = 3, which combined with 𝑧 = −2 and 𝑦 = 3 gives
𝑥 = 1.
The manipulations above are called row operations. There are three row operations that may be used
to solve a linear system.
• Any row may be multiplied by a non-zero number.
• Any multiple of one row may be added to another row.
• Any two rows may be interchanged.
Note that in the above examples, we aimed to fill the bottom-left corner with zeroes. The entire
process will still work in the same way if we fill any other corner of the matrix with zeroes.
See the demonstration here for a visual interpretation of linear systems of three variables and the
Gaussian elimination method.
1 0 0
1 0 ⎡0
[ ] ⎢ 1 0⎤
⎥
0 1
⎣0 0 1 ⎦
These matrices serve a similar purpose to the number 1. When you multiply any matrix by an identity
matrix of appropriate size, the result will be the same matrix you started with.
Example: Carry out the following matrix multiplication to verify that
1 0 −2 0 3 −2 0 3
[ ][ ]=[ ]
0 1 0 1 −4 0 1 −4
and
1 0 0
−2 0 3 ⎡ −2 0 3
[ ] 0 1 0⎤
⎥ = [ 0 1 −4]
0 1 −4 ⎢
⎣0 0 1 ⎦
.
Let 𝐴 be a square matrix. 𝐴 is called invertible if there is a square matrix 𝐴−1 such that 𝐴−1 𝐴 = 𝐼.
This matrix 𝐴−1 is called the inverse matrix of 𝐴.
Example: Carry out the matrix multiplication to verify that
1 1 1
𝐴 −1
= ⎢0 0 1⎤
⎡
⎥
⎣1 0 2 ⎦
is the inverse of
66
0 −2 1
𝐴=⎡ ⎤
⎢1 1 −1⎥.
⎣0 1 0⎦
Not all matrices have an inverse. A quick way to check if a matrix has an inverse is to calculate a
quantity of the matrix called its determinant. We will only look at determinants of 2 × 2 matrices,
but there are methods to calculate determinants for larger square matrices. The interested student
may wish to see the demonstration here which details the process for calculating determinants of
3 × 3 matrices.
Let 𝐴 be a 2 × 2 matrix where
𝑎 𝑏
𝐴=[ ].
𝑐 𝑑
The determinant of 𝐴 is det(𝐴) = 𝑎𝑑 − 𝑏𝑐. A square matrix 𝐴 is invertible if and only if det(𝐴) ≠ 0.
A3.20
Question: Are the following matrices invertible?
2 2 3 6
𝐴=[ ] 𝐵=[ ]
1 −3 2 4
If a 2 × 2 matrix is invertible, then there is a simple formula to calculate its inverse matrix.
𝑎 𝑏
If a 2 × 2 matrix 𝐴 = [ ] is invertible, then
𝑐 𝑑
1 𝑑 −𝑏
𝐴−1 = [ ].
det(𝐴) −𝑐 𝑎
Question: Find 𝐴−1 for the matrix 𝐴 in the above example. Then check your answer by calculating
𝐴−1 𝐴. A3.21
We now show how calculating inverse matrices provides another method for solving linear systems.
If 𝐴𝑥⃗ = 𝑏⃗ is a linear system and 𝐴 is invertible, then 𝑥⃗ = 𝐴−1 𝑏.⃗
To see why this works, we take the linear system and multiply on the left by 𝐴−1 . This gives
𝐴−1 𝐴𝑥⃗ = 𝐴−1 𝑏.⃗
But 𝐴−1 𝐴 = 𝐼 by definition, so
𝐼𝑥⃗ = 𝐴−1 𝑏.⃗
Recall that multiplying by the identity matrix does not change the matrix/vector. So we have
𝑥⃗ = 𝐴−1 𝑏.⃗
Example: We can solve the following linear system using the above method.
𝑥+𝑦=3 (3.1)
𝑥 + 2𝑦 = 4 (3.2)
1 1 𝑥 3
𝐴=[ ], 𝑥⃗ = [ ] , and 𝑏⃗ = [ ]
1 2 𝑦 4
67
so that 𝐴𝑥⃗ = 𝑏.⃗ Now det(𝐴) = 1 × 2 − 1 × 1 = 1, so 𝐴 is invertible. Its inverse matrix is
1 2 −1 2 −1
𝐴−1 = 1 [ ]=[ ].
−1 1 −1 1
So the solution is
2 −1 3 2
𝑥⃗ = [ ] [ ] = [ ].
−1 1 4 1
By equating the components of these vectors, we have 𝑥 = 2 and 𝑦 = 1.
8 11 13 𝑥 23
⎡−4 2 3 ⎤ ⎡𝑦⎤ = ⎡ 1 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣2 1 2 ⎦ ⎣𝑧⎦ ⎣ 4 ⎦
2 1 2 𝑥 4
⎡−4 2 3 ⎥ ⎢𝑦⎤
⎤ ⎡ =⎢1⎤
⎡
⎢ ⎥ ⎥
⎣ 8 11 13 ⎦ ⎣ 𝑧 ⎦ 23
⎣ ⎦
2 1 2 𝑥 4
⎡−4 2 3⎤ ⎡ 𝑦 ⎤ = ⎡1⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ 0 7 5 ⎦ ⎣ 𝑧 ⎦ ⎣7⎦
2 1 2 𝑥 4
⎡0 4 7⎤ ⎡ 𝑦 ⎤ = ⎡9⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 7 5⎦ ⎣ 𝑧 ⎦ ⎣7⎦
2 1 2 𝑥 4
⎡0 4 7 ⎤ ⎡ 𝑦 ⎤ = ⎡ 9 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 28 20⎦ ⎣ 𝑧 ⎦ ⎣28⎦
2 1 2 𝑥 4
⎡0 4 7 ⎤ ⎡ 𝑦 ⎤ = ⎡ 9 ⎤
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 0 −29⎦ ⎣ 𝑧 ⎦ ⎣−35 ⎦
Question 2:
Consider the matrices 𝐴 and 𝐵 given below.
2 1 3 −1
𝐴=[ ] 𝐵=[ ]
1 3 −1 2
A3.23
Evaluate 𝐴𝐵.
A3.24
Use this to find 𝐵−1 .
A3.25
Check your answer by using det(𝐵) to find 𝐵−1 .
Question 3:
68
Researchers have devised a formula to predict an adult's height, ℎ, from their femur length, 𝑙. The
formula is
ℎ = 𝑚𝑙 + 𝑏
where 𝑚 and 𝑏 are constants. All lengths are measured in cm. The researchers have used this formula
to predict that an adult whose femur length is 50 cm is 195 cm tall, and that another adult whose
femur length is 45 cm is 182 cm tall. Find the constants 𝑚 and 𝑏 by writing a matrix equation and
finding the inverse of a matrix. A3.26
A3.27
Use your answer to predict the height of an adult whose femur length is 35 cm.
A3.28
What feature of this linear system allowed you to use this method to solve it?
1 2 2 −1
𝐴=[ ] 𝑥⃗ = [ ] 𝑦⃗ = [ ]
3 2 3 1
Using matrix multiplication, we can make the following calculations. You should check
these calculations yourself.
8
𝐴𝑥⃗ = [ ] = 4𝑥 ⃗
12
1
𝐴𝑦 ⃗ = [ ] = −𝑦 ⃗
−1
So 4 and −1 are eigenvalues of 𝐴, and 𝑥⃗ and 𝑦 ⃗ are eigenvectors of 𝐴.
69
1 0 1 2
𝜆𝐼 − 𝐴 = 𝜆 [ ]–[ ]
0 1 3 2
𝜆 0 1 2
=[ ]–[ ]
0 𝜆 3 2
𝜆 − 1 −2
=[ ]
−3 𝜆 − 2
(4𝐼 − 𝐴)𝑥⃗ = 0⃗
3 −2 𝑥 0
[ ][ ] = [ ]
−3 2 𝑦 0
3 −2 𝑥 0
[ ][ ] = [ ]
0 0 𝑦 0
3𝑥–2𝑦 = 0
0=0
The second equation is now useless, so we ignore it. We have only one useful equation
to find two variables. There will be infinitely many solutions to this equation, so we
can assign a non-zero value to one variable in order to find just one of those solutions.
For instance, we could set 𝑥 = 2, then solving 3𝑥 − 2𝑦 = 0 would give 𝑦 = 3. So one
eigenvector corresponding to 𝐴 would be
2
𝑥⃗ = [ ] .
3
However, there are infinitely many eigenvectors corresponding to the eigenvalue 𝜆 = 4.
By choosing a different value of 𝑥, you will find a different eigenvector.
70
Question: Apply the method above to find an eigenvector of 𝐴 corresponding to the eigenvalue
𝜆 = −1. A3.29
−1
[ ]
1
and
3
[ ]
1
A3.33
are eigenvectors of 𝐴. Which eigenvalues do they correspond to?
Now consider the matrices
−2 0 −1 3
𝐷=[ ] 𝑃 =[ ]
0 2 1 1
which are made up of the eigenvalues and eigenvectors. (Note that the eigenvalue in the first column
of 𝐷 corresponds to the eigenvector in the first column of 𝑃. This procedure will not work unless
the columns of the two matrices correspond.) The matrix 𝐷 is a diagonal matrix because all entries
apart from those on the main diagonal are 0. Evaluate 𝑃 −1 . A3.34
A3.35
Verify that 𝑃 𝐷𝑃 −1 = 𝐴.
A3.36
Write 𝐴2 = 𝐴𝐴 and 𝐴3 = 𝐴𝐴𝐴 in terms of 𝐷 and 𝑃. Do the same for 𝐴𝑛 = 𝐴𝐴
⏟ ⋯𝐴.
𝑛 times
A3.37
Evaluate 𝐷 and 𝐷 . What is 𝐷 ?
2 3 𝑛
A3.38
Using your previous answers, evaluate 𝐴5 .
Question 3 (Markov Chains):
Consider a town with a population of 4000 with only two grocery stores. Assume that every person
in this town visits exactly one grocery store each week. Suppose that in the first week, half of the
population visit store A and half visit store B. We will represent this information in a vector.
71
2000
𝑥0⃗ = [ ]
2000
It is known that half of the people who visit store A one week will return to store A the next week. It
is also known that three quarters of the people who visit store B one week will return to store B the
next week.
A3.39
How many people will visit each store in the second week?
The information about which stores people visit based on the previous week can be stored in a matrix.
1/2 1/4
𝐴=[ ]
1/2 3/4
A3.40
Evaluate the vectors 𝑥1⃗ = 𝐴𝑥0⃗ and 𝑥2⃗ = 𝐴𝑥1⃗ .
We can similarly define 𝑥𝑛⃗ = 𝐴𝑥𝑛−1 ⃗ for each positive integer 𝑛. Based on your previous answers,
what do the vectors 𝑥0⃗ , 𝑥1⃗ , 𝑥2⃗ , … represent? A3.41
A3.42
Write 𝑥2⃗ and 𝑥3⃗ in terms of 𝐴 and 𝑥0⃗ . Do the same for 𝑥𝑛⃗ .
A3.43
Use the procedure from Question 2 to write 𝐴 = 𝑃 𝐷𝑃 −1 , where 𝐷 is a diagonal matrix.
A3.44
What happens to 𝐷𝑛 when 𝑛 is large?
Use your previous answers to deduce how many people each week will visit each store in the long-
run. A3.45
72
3.7 Answers
3.1 2×3
3.2 −4
3.3 Just as with the order of operations for numbers, we perform multiplication first and then
addition.
3 1 0 −5
2𝐴 + 𝐵 = 2 [ ]+[ ]
−1 4 7 2
6 2 0 −5
=[ ]+[ ]
−2 8 7 2
6 −3
=[ ]
5 10
−7 −1
𝐴𝑢⃗ − 4𝑣 ⃗ = [ ] − 4[ ]
−52 −8
−7 −4
=[ ]−[ ]
−52 −32
−3
=[ ]
−20
3.5
𝑢⃗ ⋅ 𝑣 ⃗ = 1 × 2 + 3 × (−4) + 2 × 6 = 2.
3.6
2 3 7 −2 −4 −6
𝐴(𝐵 + 𝐶) = [ ] ([ ]+[ ])
−1 0 3 4 2 0
2 3 3 −8
=[ ][ ]
−1 0 5 4
2×3+3×5 2 × −8 + 3 × 4
=[ ]
−1 × 3 + 0 × 5 −1 × −8 + 0 × 4
21 −4
=[ ]
−3 8
73
2 3 7 −2 2 3 −4 −6
𝐴𝐵 + 𝐴𝐶 = [ ][ ]+[ ][ ]
−1 0 3 4 −1 0 2 0
2×7+3×3 2 × −2 + 3 × 4 2 × −4 + 3 × 2 2 × −6 + 3 × 0
=[ ]+[ ]
−1 × 7 + 0 × 3 −1 × −2 + 0 × 4 −1 × −4 + 0 × 2 −1 × −6 + 0 × 0
23 8 −2 −12
=[ ]+[ ]
−7 2 4 6
21 −4
=[ ]
−3 8
3.7
23 8
𝐴𝐵 = [ ]
−7 2
16 21
𝐵𝐴 = [ ]
2 9
It may seem unexpected that 𝐴𝐵 ≠ 𝐵𝐴. But if you look at the way you performed the matrix
multiplication, you were multiplying different pairs of numbers together each time.
3.8
−2 −12
𝐴𝐶 = [ ]
4 6
−2 −12
𝐶𝐴 = [ ]
4 6
Notice that 𝐶 = −2𝐴. Therefore, 𝐴𝐶 = 𝐴(−2𝐴) = −2𝐴𝐴 = 𝐶𝐴.
3.9 The size of 𝐵𝐶 is 2 × 2, which is the same as that of 𝐴. Therefore, 𝐴 + 𝐵𝐶 has size 2 × 2.
3.10 𝐴 has 2 columns and 𝐵 has 2 rows, so 𝐴𝐵 makes sense. In fact the size of 𝐴𝐵 is 2 × 3. Since
𝐴𝐵 has 3 columns and 𝐶 has 3 rows, 𝐴𝐵𝐶 makes sense, and its size is 2 × 2.
3.11 𝐴 has 2 columns and 𝐶 has 3 rows, so 𝐴𝐶 does not makes sense. So 𝐴𝐶𝐵 will not make
sense.
3.12 Note that 𝐵𝐷 and 𝐵 both have size 2 × 3. So 𝐵𝐷 + 𝐵 will have size 2 × 3.
3.13 𝐴𝐵 has size 2 × 3. 𝐴𝐵𝐷 has size 2 × 3. 𝐴𝐵𝐷𝐶 has size 2 × 2. Therefore, 𝐴𝐵𝐷𝐶𝐴 has size
2 × 2.
3.14 Both 𝐴𝑢⃗ and 𝐵𝑣 ⃗ have size 2 × 1, so the size of 𝐴𝑢⃗ + 𝐵𝑣 ⃗ is 2 × 1.
3.15 Since 𝐶 has 2 columns and 𝑣 ⃗ has 3 rows, 𝐶𝑣 ⃗ does not make sense. So 𝐷𝑣 ⃗ + 𝐶𝑣 ⃗ does not
make sense.
3.16 Note that 𝑢⃗ has size 2 × 1 but 𝐶𝑢.⃗ We cannot add or subtract vectors or matrices of different
sizes, so this expression does not make sense.
3.17 The blue vector is the same length as above, but the orange vector has reversed and stretched
by a factor of three. So the green vector is the result of adding 𝑢⃗ and the appropriate scalar
multiple of 𝑣.⃗ The green vector is 𝑢⃗ − 3𝑣.⃗
3.18 The orange vector is unchanged, but the blue vector is now half of its original length. This
corresponds to it being multiplied by 12 . So the second green vector is 𝑣 ⃗ + 12 𝑢.⃗
74
3.19 Option B is correct. The blue vector is multiplied by −2, which stretches it out and reflects
it. It is then moved so that its tail is at the tip of the orange vector.
3.20 𝐴 is invertible because det(𝐴) = 2 × (−3) − 2 × 1 = −8, which is not equal to 0. But 𝐵 is
not invertible because det(𝐵) = 3 × 4 − 6 × 2 = 0.
−3 −2 3/8 1/4
3.21 𝐴−1 = 1
−8 [ ]=[ ]
−1 2 1/8 −1/4
3/8 1/4 2 2 1 0
𝐴−1 𝐴 = [ ][ ]=[ ]=𝐼
1/8 −1/4 1 −3 0 1
3.22
• In the first row operation, the first and third rows are swapped.
• In the second row operations, the first row is multiplied by 4 and the result is subtracted
from the third row.
• In the third row operation, the first row is multiplied by 2 and the result is added to the
second row.
• In the fourth row operation, the third row is multiplied by 4.
• In the final row operation, the second row is multiplied by 7 and the result is subtracted
from the third row.
5 0
3.23 𝐴𝐵 = [ ]
0 5
3.24 The matrix 𝐵−1 needs to satisfy 𝐵−1 𝐵 = 𝐼. Notice that 𝐴𝐵 = 5𝐼. So 15 𝐴𝐵 = 𝐼. Therefore,
2/5 1/5
𝐵−1 = 15 𝐴 = [ ].
1/5 3/5
3.25
1 2 1
𝐵−1 = [ ]
5 1 3
2/5 1/5
=[ ]
1/5 3/5
3.26 Substitute the lengths from each adult into the formula to create a system of two equations.
50𝑚 + 𝑏 = 195
45𝑚 + 𝑏 = 182
50 1 𝑚 195
[ ][ ] = [ ]
45 1 𝑏 182
75
So the inverse of this matrix is
1 1 −1
5 [ ].
−45 50
The solution to this system is
𝑚 1 1 −1 195 2.6
[ ]= [ ][ ]=[ ]
𝑏 5 −45 50 182 65
−2 −2 𝑥 0
[ ][ ] = [ ]
−3 −3 𝑦 0
.
The first row operation you could apply would be to multiply the first row by − 32 .
3 3 𝑥 0
[ ][ ] = [ ]
−3 −3 𝑦 0
.
Now add the first row to the second.
3 3 𝑥 0
[ ][ ] = [ ]
0 0 𝑦 0
.
So our one useful equation is 3𝑥 + 3𝑦 = 0. One eigenvector is found by choosing 𝑦 = 1, so that
𝑥 = −1. So one eigenvector is
−1
𝑥⃗ = [ ] .
1
3.30
1 0 −3 2
𝜆𝐼 − 𝐴 = 𝜆 [ ]−[ ]
0 1 4 −1
𝜆 + 3 −2
=[ ]
−4 𝜆 + 1
76
So the solutions to the characteristic equation are 𝜆 = −5 and 𝜆 = 1. These are the eigenvalues
of 𝐴.
3.31
For 𝜆 = 1, we have
4 −2
𝜆−𝐴=[ ].
−4 2
Adding the first row to the second in the corresponding linear sytem gives
4 −2 𝑥 0
[ ][ ] = [ ]
0 0 𝑦 0
.
So our useful equation is 4𝑥 − 2𝑦 = 0. Setting 𝑥 = 1 gives the eigenvector
1
𝑥⃗ = [ ].
2
For 𝜆 = −5, we have
−2 −2
−5𝐼 − 𝐴 = [ ].
−4 −4
Adding a multiple of the first row to the second in the corresponding linear system gives
−2 −2 𝑥 0
[ ][ ] = [ ]
0 0 𝑦 0
.
So our useful equation is −2𝑥 − 2𝑦 = 0. Setting 𝑥 = 1 gives the eigenvector
1
𝑥⃗ = [ ].
−1
Note that any (non-zero) scalar multiple of these eigenvectors would also be an eigenvector.
3.32
1 0 1 3
𝜆𝐼 − 𝐴 = 𝜆 [ ]−[ ]
0 1 1 −1
𝜆 − 1 −3
=[ ]
−1 𝜆 + 1
77
3.33 Multiply 𝐴 by these vectors to find
1 3 −1 2 −1
[ ] [ ] = [ ] = −2 [ ]
1 −1 1 −2 1
and
1 3 3 6 3
[ ][ ] = [ ] = 2[ ]
1 −1 1 2 1
.
So
−1
[ ]
1
and
3
[ ]
1
𝐴2 = 𝐴𝐴
= 𝑃 𝐷𝑃 −1 𝑃 𝐷𝑃 −1
= 𝑃 𝐷𝐼𝐷𝑃 −1
= 𝑃 𝐷2 𝑃 −1
𝐴3 = 𝐴2 𝐴
= 𝑃 𝐷2 𝑃 −1 𝑃 𝐷𝑃 −1
= 𝑃 𝐷2 𝐼𝐷𝑃 −1
= 𝑃 𝐷3 𝑃 −1
78
3.37
−2 0 −2 0 4 0
𝐷2 = [ ][ ]=[ ]
0 2 0 2 0 4
4 0 −2 0 −8 0
𝐷3 = 𝐷2 𝐷 = [ ][ ]=[ ]
0 4 0 2 0 8
As you can see, the main diagonal entries are just powers of −2 and 2, whereas the other entries
are 0. This pattern continues, so
(−2)𝑛 0
𝐷𝑛 = [ ].
0 2𝑛
3.38
𝐴5 = 𝑃 𝐷5 𝑃 −1
−1 3 (−2)5 0 −1/4 3/4
=[ ][ ][ ]
1 1 0 25 1/4 1/4
−1 3 −32 0 −1/4 3/4
=[ ][ ][ ]
1 1 0 32 1/4 1/4
32 96 −1/4 3/4
=[ ][ ]
−32 32 1/4 1/4
16 48
=[ ]
16 −16
The efficiency of this method comes from only ever needing to perform two matrix multiplications,
regardless of the power of the matrix.
3.39 Half of the 2000 people who visited store A in the first week will return to store A in the
second week. Also, one quarter of the 2000 people who visited store B in week one will visit
store A in the second week. So the total number of people visiting store A in the second week is
1 1
2 × 2000 + 4 × 2000 = 1500.
By similar reasoning, 2500 people will visit store B in the second week.
3.40
1/2 1/4 2000 1500
𝑥1⃗ = [ ][ ]=[ ]
1/2 3/4 2000 2500
1/2 1/4 1500 1375
𝑥2⃗ = [ ][ ]=[ ]
1/2 3/4 2500 2625
3.41 These vectors represent the number of people who visit each store each week.
3.42
𝑥2⃗ = 𝐴𝑥1⃗
= 𝐴𝐴𝑥0⃗
= 𝐴2 𝑥0⃗
79
𝑥3⃗ = 𝐴𝑥2⃗
= 𝐴𝐴2 𝑥0⃗
= 𝐴3 𝑥0⃗
This process can be repeated indefinitely. So the pattern holds for all 𝑛.
𝑥𝑛⃗ = 𝐴𝑛 𝑥0⃗
3.43 The eigenvalues are 1 and 1/4, and they have corresponding eigenvectors
1/2 −1
[ ] and [ ]
1 1
1 0
[ ]
0 0
.
3.45 We want to see what happens to 𝑥𝑛⃗ when 𝑛 is large. Since 𝑥𝑛⃗ = 𝐴𝑛 𝑥0⃗ , we need to find 𝐴𝑛 .
As in Question 2, we have
𝐴𝑛 = 𝑃 𝐷𝑛 𝑃 −1
1/2 −1 1 0 2/3 2/3
=[ ][ 𝑛] [ ].
1 1 0 (1/4) −2/3 1/3
80
𝑥𝑛⃗ = 𝐴𝑛 𝑥0⃗
1/3 1/3 2000
≈[ ][ ]
2/3 2/3 2000
1333
≈[ ].
2667
So in the long run roughly 1333 people will visit store A each week, and roughly 2667 will visit
store B each week.
81
Chapter 4
Multivariate Functions
4.1 Relations
4.1.1 Relations
An ordered pair of real numbers will be written as (𝑥, 𝑦), where 𝑥 and 𝑦 are real numbers. The set of
all ordered pairs of real numbers will be denoted by ℝ2 .
We can produce plots of ordered pairs as shown in the following example.
Example. The ordered pairs (1, 2) and (−2, 1) are plotted below.
The first number in the ordered pair is represented on the horizontal 𝑥-axis and the second
is represented on the vertical 𝑦-axis.
A binary relation on ℝ is a set of ordered pairs of real numbers.
Although there are many other types of relations, in this section we will just use the term relation to
describe a binary relation on ℝ.
The relations we will consider will be sets of ordered pairs with a rule that relates the two numbers in
82
each ordered pair. The notation used is explained in the following example.
Example. The relation {(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑦 = 𝑥2 } is the set of all ordered pairs (𝑥, 𝑦) such
that 𝑦 = 𝑥2 . The plot of a relation like this is the same as the plot of the function 𝑓 given
by 𝑓(𝑥) = 𝑥2 .
Although the above relation had the same plot as that of a function, it should be understood that
not all relations are functions. Relations are more general. Below you will find some examples of
relations that are not functions.
4.1.2 Circles
A circle centred at (ℎ, 𝑘) with radius 𝑟 can be described by the following relation:
{(𝑥, 𝑦) ∈ ℝ2 ∶ (𝑥 − ℎ)2 + (𝑦 − 𝑘)2 = 𝑟2 }
Example. The plot of the relation {(𝑥, 𝑦) ∶ 𝑥2 + (𝑦 + 2)2 = 4} is given below.
83
This is the circle centred at (0, −2) with radius 2.
A4.1
Question. Is the plot above the plot of a function?
4.1.3 Ellipses
An ellipse centred at (ℎ, 𝑘) can be described by the following relation:
(𝑥−ℎ)2 (𝑦−𝑘)2
{(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑎2 + 𝑏2 = 1}
The constants 𝑎 and 𝑏 give a measure of how ”wide” and how ”tall” the ellipse is. The constant 𝑎 is
the horizontal distance between the leftmost (or rightmost) point on the ellipse and its centre, and 𝑏
is the vertical distance between the lowest (or highest) point on the ellipse and its centre.
The smaller of 𝑎 and 𝑏 is sometimes called the semi-minor axis length and the larger is called the
semi-major axis length. However, we won't be referring to these names, so you may just want to think
about their interpretations given above,
(𝑥−1)2
Example. The plot of the relation {(𝑥, 𝑦) ∈ ℝ2 ∶ 4 + (𝑦 − 1)2 = 1} is given below.
84
This is an ellipse centred at (1, 1). Comparing with the general form above, we have 𝑎 = 2
and 𝑏 = 1. Compare these values with their interpretations in terms of distances from the
centre of the ellipse.
A4.2
85
(b)
A4.3
Question 2: Write the relations for the following two ellipses as sets.
(a)
A4.4
(b)
86
A4.5
87
• with mathematical symbols as 𝑓(𝑥, 𝑦) = 𝑥2 + 𝑦2 − 1 with domain ℝ2 ;
• as a plot, with the possible inputs on the two horizontal axes and the possible outputs
on the vertical axis.
Unlike the plots of functions of one variable, we need to specify which axis represents the
𝑥 values and which represents the 𝑦 values.
This function can be evaluated at any point in ℝ2 . For example,
𝑓(2, 3) = 22 + 32 − 1 = 12
and
𝑓(0, −5) = 02 + (−5)2 − 1 = 24.
We can also consider real functions of three or more real variables. The domain of a real function of
three real variables will be some subset of ℝ3 (the set of all ordered triples (𝑥, 𝑦, 𝑧) of real numbers).
More generally, a real function of 𝑛 real variables will have as its domain some subset of ℝ𝑛 (the set of
all ordered 𝑛-tuples (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) of real numbers). However, we will mostly restrict our attention
to real functions of two real variables.
88
The demonstration here allows you to change the orientation and position of a plane. The plane
is written in a slightly different form in this demonstration. However, if you rearrange the given
equation of the plane to write 𝑧 in terms of 𝑥 and 𝑦, then you will have your plane as a function in
the form given above. For example, the plane given by 4(𝑥 − 2) − 2𝑦 + 2𝑧 = 0 can be rearranged to
give 𝑓(𝑥, 𝑦) = 𝑧 = −2𝑥 + 𝑦 + 4.
89
The plot appears to be ”flat” when 𝑥 = 2 and 𝑦 = 0. So (2, 0) is a stationary point.
90
To gain some intuition about what the three-dimensional plot of 𝑓 looks like, we can
produce several contour plots on the same set of axes. Each contour plot below is labelled
with the value of 𝑧 that was used.
91
This contour plot tells us a lot about what the three-dimensional plot of the function
looks like. The value of the function becomes larger as we move away from the point
(0, 1). We can deduce that there is a local minimum at (0, 1). The value of the function
at this local minimum is 𝑓(0, 1) = −3.
We can also say something about how ”steep” the plot will be at certain points. Compare
the spacing between the 𝑧 = −2 and 𝑧 = 1 contour plots with the spacing between the
𝑧 = 10 and 𝑧 = 13 contour plots. The spacing between the first two is larger, which
means that in order for the value of 𝑓 to increase by 3, a greater distance needs to be
covered. Essentially, larger spacing between contour plots corresponds to slower growth
(and smaller spacing corresponds to more rapid growth).
So 𝑓 grows more quickly the further we move from (0, 1). This is because the spacing
between contours becomes smaller as we move away from this point.
Question. Does the function 𝑓 grow more rapidly as you move away from the point (1, 0) to the
right along the 𝑥-axis, or as you move down the line 𝑦 = 1? A4.8
The demonstration here allows you to modify a few functions and compare their three-dimensional
plots with their contour plots. Don't worry if some of those functions look complicated or unfamiliar.
Just focus on identifying features on the plots, in particular the stationary points.
92
4.4 Functions of several variables: Graphing activity
In this section, we will show you how to use WolframAlpha to produce plots of relations, functions of
two variables, and contour plots.
93
94
As usual, you can see more information about the relation by scrolling down the page. However, the
most important thing on this page is the plot of the relation.
95
96
If you wish to see the plot of this function on a larger part of the domain, you can specify this. For
example, you may wish to see the above plot for 𝑥-values between −3 and 2, and 𝑦 values between
−2 and 2.
97
98
By scrolling down, you will see a contour plot of this function.
99
100
Although it does not have the values of the function labelled for each contour, the colours can give
you some indication. Darker colours correspond to smaller values and lighter colours correspond to
larger values. Compare the contour plot with the three-dimensional plot. In particular, look at what
is happening around (0, 0) on each.
Spend some time coming up with your own functions and plot them on WolframAlpha. Compare the
contour plots with the three-dimensional plots and locate the stationary points of these functions.
101
(a) Does this function have any local maxima or local minima in the region shown? If so, at what
points in the domain do they occur? A4.16
(b) How does the value of the function change as you move along a diagonal line through the origin
(i.e. the point (0, 0))? A4.17
A4.18
(c) Would you consider the point (0, 0) to be a local maximum, a local minimum, neither or both?
Question 4: Consider the following vectors:
2 𝑥
𝑢⃗ = ⎢3⎤
⎡
⎥ 𝑣 ⃗ = ⎢𝑦⎤
⎡
⎥.
1
⎣ ⎦ 1
⎣ ⎦
A4.19
(a) Let 𝑓 be the function given by 𝑓(𝑥, 𝑦) = 𝑢⃗ ⋅ 𝑣.⃗ What type of function is 𝑓?
A4.20
(b) Describe in words the three-dimensional plot and the contour plot of 𝑓.
Question 5 (Quadratic Forms): Consider the following matrices:
1 8 1 1 1 0
𝐴=[ ] 𝐵=[ ] 𝐶=[ ]
2 1 1 1 0 4
We will use these matrices to define the functions 𝑓, 𝑔 and ℎ. Each function has domain ℝ2 .
102
1 8 𝑥
𝑓(𝑥, 𝑦) = [𝑥 𝑦] [ ][ ]
2 1 𝑦
1 1 𝑥
𝑔(𝑥, 𝑦) = [𝑥 𝑦] [ ][ ]
1 1 𝑦
1 0 𝑥
ℎ(𝑥, 𝑦) = [𝑥 𝑦] [ ][ ]
0 4 𝑦
(a) Carry out the above matrix multiplication to find the rules for 𝑓, 𝑔 and ℎ. (Your answer will be a
1 × 1 matrix, which you can just treat as a real number). A4.21
A4.22
(b) Calculate the eigenvalues of 𝐴, 𝐵 and 𝐶.
(c) Produce contour plots of 𝑓, 𝑔 and ℎ. Locate their stationary points and identify the type of
each stationary point. (You may also wish to produce a three-dimensional plot to help you find the
stationary points). A4.23
(d) Write down some different 2 × 2 matrices and repeat the above steps. Do you think the eigenvalues
reveal anything about the nature of the stationary points? A4.24
Note that the blue tangent plane just touches the orange plot of 𝑓 at (1, 1), which is
marked in red.
103
The demonstration here allows you to view a tangent plane at various points on the plot of a function
of two variables.
In order to specify the orientation of this tangent plane, we need two numbers. One will describe the
slope in the direction of the 𝑥-axis and the other will describe the slope in the direction of the 𝑦-axis.
These quantities can be found using functions called partial derivatives.
The partial derivative in the 𝑥-direction is denoted by 𝑓𝑥 and is found by differentiating 𝑓 with respect
to 𝑥 as if it were a function of one variable, treating 𝑦 as a constant.
Similarly, the partial derivative in the 𝑦-direction is denoted by 𝑓𝑦 and is found by differentiating 𝑓
with respect to 𝑦, treating 𝑥 as a constant.
Example. Let 𝑓 be the function given by 𝑓(𝑥, 𝑦) = 𝑥2 𝑦 + 𝑒3𝑦 with domain ℝ2 .
To compute 𝑓𝑥 , we differentiate 𝑓 with respect to 𝑥, treating 𝑦 as a constant. So
𝑓𝑥 (𝑥, 𝑦) = 2𝑥𝑦. Similarly, 𝑓𝑦 (𝑥, 𝑦) = 𝑥2 + 3𝑒3𝑦 .
Partial derivatives can be thought of as a rate of change in a particular direction. When 𝑓𝑥 is positive
𝑓 increases as you move along the 𝑥-axis (keeping 𝑦 constant). Similarly, when 𝑓𝑦 is positive 𝑓
increases as you move along the 𝑦-axis. When the magnitude of 𝑓𝑥 or 𝑓𝑦 is large, 𝑓 is increasing or
decreasing more rapidly than when 𝑓𝑥 or 𝑓𝑦 is small.
Example. Consider the plot below of the function 𝑓 given by 𝑓(𝑥, 𝑦) = 1 − 𝑥3 .
The partial derivatives are 𝑓𝑥 (𝑥, 𝑦) = −3𝑥2 and 𝑓𝑦 (𝑥, 𝑦) = 0. We can evaluate these at
the point (0.5, 1).
3
𝑓𝑥 (0.5, 1) = −
4
𝑓𝑦 (0.5, 1) = 0
Around (0.5, 1), 𝑓 decreases as you move along the 𝑥-axis in the positive direction, which
agrees with 𝑓𝑥 (0.5, 1) being negative. As you move along the 𝑦-axis, 𝑓 does not change,
agreeing with our calculation 𝑓𝑦 (0.5, 1) = 0.
104
The notion of a partial derivative can be easily extended to functions of any number of variables. The
following example illustrates this with a function of three variables.
Example. Consider the function 𝑓 of three variables given by 𝑓(𝑥, 𝑦, 𝑧) = 𝑥𝑦𝑧 2 + ln(𝑥𝑦).
The partial derivatives are given below.
𝑦
𝑓𝑥 (𝑥, 𝑦, 𝑧) = 𝑦𝑧 2 +
𝑥
2 𝑥
𝑓𝑦 (𝑥, 𝑦, 𝑧) = 𝑥𝑧 +
𝑦
𝑓𝑧 (𝑥, 𝑦, 𝑧) = 2𝑥𝑦𝑧
The concept of differentiability is slightly more complicated for a function of two (or more) variables.
Essentially, differentiability ensures that at each point, there exists a tangent plane which is ”close” to
the function near that point. For this to occur, we need more than just the partial derivatives to exist.
This would only ensure that the tangent plane is close to the function as you travel towards that
point along the direction of the 𝑥 and 𝑦-axes. This needs to be true regardless of the direction you
travel. However, this is a somewhat technical point and we will not pursue it further. The functions
we will encounter will be differentiable on their entire domain.
105
The blue vector is the gradient vector at (1, 1). Note that it is perpendicular to the
contour lines near (1, 1). In particular, it points in the direction that the function grows
most rapidly from (1, 1).
Similarly, the orange vector points in the direction of most rapid growth away from (−1, 2).
This vector is also perpendicular to the contour lines near (−1, 2).
Questions.
1. Does the function in the previous example have a local maximum or a local minimum? If so,
where is it? A4.25
2. What do you notice about the gradient vectors in relation to your answer to Question 1? A4.26
See the demonstration here for several contour plots with gradient vectors superimposed.
As with partial derivatives, we can define the gradient vector for functions of 𝑛 variables. It is simply
the 𝑛 × 1 vector whose entries are all 𝑛 partial derivatives.
106
Example. To locate the stationary point(s) of 𝑓, which is given by 𝑓(𝑥, 𝑦) = 𝑦2 +4𝑥𝑦−8𝑥,
we will first find its partial derivatives. They are
𝑓𝑥 (𝑥, 𝑦) = 4𝑦 − 8 and 𝑓𝑦 (𝑥, 𝑦) = 2𝑦 + 4𝑥.
By setting both of these equal to zero, we arrive at the following system of equations:
4𝑦 = 8
2𝑦 + 4𝑥 = 0
We have several methods for solving systems of equations like this. Using any of these
methods will yield the solution 𝑥 = −1, 𝑦 = 2. So (−1, 2) is a stationary point of 𝑓.
This method words for functions of 𝑛 variables. However, the system of equations becomes harder to
solve when there are more variables.
2𝑥 − 6 = 0
3𝑦2 − 12 = 0
107
.
Although this is not a linear system, we can still find its solutions. The first equation
gives 𝑥 = 3, and the second gives 𝑦2 = 4, which has two solutions, 𝑦 = 2 and 𝑦 = −2. So
the two stationary points are (3, 2) and (3, −2).
We can find the second derivatives and hence the Hessian matrix.
2 0
𝐻(𝑥, 𝑦) = [ ]
0 6𝑦
The determinant of the Hessian matrix is det(𝐻(𝑥, 𝑦)) = 12𝑦.
At the point (3, 2), we have det(𝐻(3, 2)) = 24 > 0 and 𝑓𝑥𝑥 (3, 2) = 2 > 0, so (3, 2) is a
local minimum.
We also have det(𝐻(3, −2)) = −24 < 0, so (3, −2) is a saddle point.
The plot below shows these two stationary points. Note that the saddle point at (3, −2)
is neither a local maximum nor a local minimum.
4.6.5 Notation
Here we're using 𝑓𝑥 (𝑥, 𝑦) to represent the partial derivative of a function 𝑓(𝑥, 𝑦) with respect to 𝑥
and so on. In the other common notation the partial derivative of a function 𝑓(𝑥, 𝑦) with respect to
𝑥 is denoted by 𝜕𝑥
𝜕𝑓
.
108
(b) Recall that in previous activity, we found that 𝑓 had no stationary points. Explain why this is
the case using the partial derivatives. A4.28
Question 2: Recall Question 2 from the previous activity with 𝑓 given by 𝑓(𝑥, 𝑦) = 𝑥2 –6𝑥+4𝑦2 +8𝑦+5
with domain ℝ2 .
A4.29
(a) Find the partial derivatives 𝑓𝑥 and 𝑓𝑦 .
A4.30
(b) Use the partial derivatives to find all stationary points.
A4.31
(c) Find the Hessian matrix of 𝑓.
(d) Use the Hessian matrix to classify the stationary point found earlier. Compare this with your
answer from Question 2 in the previous activity. A4.32
Question 3: Consider the function 𝑓 given by 𝑓(𝑥, 𝑦) = 𝑥𝑦 − 𝑥 with domain ℝ2 .
A4.33
(a) Find and classify all stationary points of 𝑓.
(b) Produce a three-dimensional plot and a contour plot of 𝑓 (using software) to see the behaviour of
the function around this stationary point. A4.34
(c) Evaluate the gradient vector at the points (0.5, 0.5), (−0.5, 0.5) and (0, 1.25). Is this what you
expected based on your contour plot? A4.35
Question 4 (Least Squares): Below is a table containing the number of children of three people
along with the number of bedrooms in their house.
Number of children 0 1 2
Number of bedrooms 2 3 5
A plot of the data is also given with the number of children on the horizontal axis and the number of
bedrooms on the vertical axis.
We would like to model the relationship between these two quantities using a linear function (of one
variable). The function will be of the form 𝑓(𝑥) = 𝑚𝑥 + 𝑏 where 𝑥 is the number of children and
𝑓(𝑥) is the number of bedrooms. The aim of this question is to find suitable values for 𝑚 and 𝑏.
109
(a) For each potential choice of 𝑚 and 𝑏, define the residual at each data point to be the difference
between the actual number of bedrooms and the number of bedrooms predicted by the function 𝑓 for
that data point. Define the squared error of each data point to be the square of the residual. Find
the residual and the squared error for the third data point in the table. A4.36
(b) Now define a function 𝑔 of two variables to be the sum of the squared errors of all three data
points. The two variables are 𝑚 and 𝑏. Write down the rule for 𝑔(𝑚, 𝑏). A4.37
(c) Since 𝑔 represents in some sense the overall error of prediction, a suitable choice for 𝑚 and 𝑏
would be the values which make 𝑔 as small as possible. This suggests that we should look for a local
minimum. Find the partial derivatives (𝑔𝑚 and 𝑔𝑏 ) of 𝑔. A4.38
A4.39
(d) Find and classify all stationary points of 𝑔.
(e) Using the values of 𝑚 and 𝑏 found in your previous answer, produce a plot of 𝑓. Compare this
with the plot of the data above to see how well the function 𝑓 agrees with the data. If possible, plot
the data and the function together. A4.40
(f) Using the function 𝑓, predict the number of bedrooms in a house owned by someone with three
children. A4.41
110
4.8 Answers
4.1 No. A function can only output one value for each input. The inputs of a function are
represented on the horizontal axis, and the outputs on the vertical axis. For example, we cannot
consider 0 to be an input because the plot above would give both −4 and 0 as outputs.
4.2 This circle is centred at (1, −2) and has a radius of 2. As a set, the relation which this
represents is
{(𝑥, 𝑦) ∈ ℝ2 ∶ (𝑥 − 1)2 + (𝑦 + 2)2 = 4}.
4.3 This circle is centred at (0, −1) and has a radius of 12 . As a set, the relation which this
represents is
{(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑥2 + (𝑦 + 1)2 = 14 }.
4.4 This ellipse is centred at (1, 1). The horizontal distance from the centre of the ellipse to the
rightmost point is 4, and the vertical distance to the lowest point is 2. As a set, the relation
which this ellipse represents is
(𝑥−1)2 (𝑦−1)2
{(𝑥, 𝑦) ∈ ℝ2 ∶ 16 + 4 = 1}.
4.5 This ellipse is centred at (−3, 0). The horizontal distance from the centre of the ellipse to the
rightmost point is 1, and the vertical distance to the lowest point is 3. As a set, the relation
which this ellipse represents is
𝑦2
{(𝑥, 𝑦) ∈ ℝ2 ∶ (𝑥 + 3)2 + 9 = 1}.
111
2.
4.8 The vertical spacing between contour plots is smaller than the horizontal spacing. So the
growth as we move along 𝑦 = 1 is more rapid than when we move along the 𝑥-axis.
4.9 For 𝑧 = −8, we need to plot the relation described by 𝑓(𝑥, 𝑦) = −8. Simplifying this gives
𝑦 = −2𝑥2 − 2. So we need to plot the relation
{(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑦 = −2𝑥2 − 2}.
The other level sets can be found similarly. For example, the 𝑧 = −4 level set is
{(𝑥, 𝑦) ∈ ℝ2 ∶ 𝑦 = −2𝑥2 − 1}.
The plots each of these level sets is given below.
112
4.10 Around the point (1, −1), the horizontal distance between contours is smaller than the
vertical distance. So the function grows more rapidly when moving in a horizontal direction.
4.11 There are no points where the function has a greater value at all surrounding points. So
there is no local minimum. By the same reasoning, there is no local maximum.
4.12 The zeroes all lie on the 𝑧 = 0 contour line. These are the points where 𝑓(𝑥, 𝑦) = 0.
4.13 Expanding (𝑥 − 3)2 gives 𝑥2 − 6𝑥 + 9 and expanding (𝑦 + 1)2 gives 𝑦2 + 2𝑦 + 1.
113
(𝑥−3)2 (𝑦+1)2
16 + 4 = 1.
Both contours are ellipses. The contour plot is given below.
4.15 The contour plot shows that the value of the function decreases as you approach the centre
of the ellipses, at (3, −1). So (3, −1) will be a stationary point, in fact a local minimum. Below
is a three-dimensional plot of 𝑓 confirming this.
114
4.16 The point (1, 1) is a local maximum. This is because as you approach (1, 1), the value of the
function increases. Similarly, (−1, −1) is also a local maximum. Both (1, −1) and (−1, 1) are
local minima because the value of the function decreases as you approach those points.
4.17 If the diagonal line has a positive gradient (i.e. is moving from the bottom-left to the
top-right), then the value of the function will decrease to 0 but then increases again. If the
diagonal line has a negative gradient, then the value of the function will increase to 0 and then
decrease.
4.18 The point (0, 0) is neither a local maximum nor a local minimum. This is because there are
some points nearby where the value of the function is larger, and there are other points nearby
where the value of the function is smaller. A point like this is often called a saddle point.
4.19 Since 𝑓(𝑥, 𝑦) = 2𝑥 + 3𝑦 + 1, 𝑓 is a linear function.
4.20 The three-dimensional plot is a plane. The contour plot is a collection of parallel lines.
4.21 The calculations for 𝑓 are given below.
𝑥 + 8𝑦
𝑓(𝑥, 𝑦) = [𝑥 𝑦] [ ]
2𝑥 + 𝑦
= [𝑥(𝑥 + 8𝑦) + 𝑦(2𝑥 + 𝑦)]
= 𝑥2 + 10𝑥𝑦 + 𝑦2
𝑔(𝑥, 𝑦) = 𝑥2 + 2𝑥𝑦 + 𝑦2
ℎ(𝑥, 𝑦) = 𝑥2 + 4𝑦2
4.22 As per the last section, we calculate eigenvalues by solving the characteristic equation
det(𝜆𝐼 − 𝐴) = 0.
115
• The eigenvalues of 𝐴 are 5 and −3.
• The eigenvalues of 𝐵 are 2 and 0.
• The eigenvalues of 𝐶 are 4 and 1.
4.23 The contour plot for 𝑓 given below reveals a stationary point at (0, 0). This is a saddle point
because the function may increase or decrease as you move towards (0, 0), depending on the
direction you approach it.
The contour plot for 𝑔 given below is a little harder to decipher. There are in fact infinitely many
stationary points along the line with a slope of −1 passing through (0, 0). A three-dimensional
plot of 𝑔 may help here. It somewhat resembles a half-pipe.
The contour plot of ℎ is given below. There is a stationary point at (0, 0). Notice that this is a
local minimum since no matter what direction you approach (0, 0), the value of the function
decreases.
116
4.24 The signs of the eigenvalues reveal the nature of the stationary point(s). For example, if both
eigenvalues are positive then there will be a local minimum.
4.25 There is a local maximum at (0, 0).
4.26 They point in the general direction of this local maximum. This makes sense because the
gradient vectors point in the direction of the most rapid growth of the function. This would
lead towards a local maximum, provided that the function actually has a local maximum.
4.27
𝑓𝑥 (𝑥, 𝑦) = 16𝑥
𝑓𝑦 (𝑥, 𝑦) = 4
4.28 A stationary point is a point (𝑥, 𝑦) where both 𝑓𝑥 (𝑥, 𝑦) = 0 and 𝑓𝑦 (𝑥, 𝑦) = 0.But 𝑓𝑦 (𝑥, 𝑦) is
always equal to 4. So even though 𝑓𝑥 (𝑥, 𝑦) is equal to 0 when 𝑥 = 0, there are no points (𝑥, 𝑦)
where both 𝑓𝑥 (𝑥, 𝑦) = 0 and 𝑓𝑦 (𝑥, 𝑦) = 0.
4.29
𝑓𝑥 (𝑥, 𝑦) = 2𝑥 − 6
𝑓𝑦 (𝑥, 𝑦) = 8𝑦 + 8
4.30 We need to solve 𝑓𝑥 (𝑥, 𝑦) = 0 and 𝑓𝑦 (𝑥, 𝑦) = 0. This is a linear system of equations.
2𝑥 − 6 = 0
8𝑦 + 8 = 0
Solving this, we find 𝑥 = 3 and 𝑦 = −1. So (3, −1) is the only stationary point.
4.31 We first need the second partial derivatives.
117
𝑓𝑥𝑥 (𝑥, 𝑦) = 2
𝑓𝑥𝑦 (𝑥, 𝑦) = 0
𝑓𝑦𝑥 = 0
𝑓𝑦𝑦 (𝑥, 𝑦) = 8
𝑓𝑥 (𝑥, 𝑦) = 𝑦 − 1
𝑓𝑦 (𝑥, 𝑦) = 𝑥
The stationary points are found by solving 𝑓𝑥 (𝑥, 𝑦) = 0 and 𝑓𝑦 (𝑥, 𝑦) = 0. This gives 𝑥 = 0 and
𝑦 = 1. So (0, 1) is a stationary point.
The second partial derivatives are
𝑓𝑥𝑥 (𝑥, 𝑦) = 0
𝑓𝑥𝑦 (𝑥, 𝑦) = 1
𝑓𝑦𝑥 (𝑥, 𝑦) = 1
𝑓𝑦𝑦 (𝑥, 𝑦) = 0
0 1
[ ]
1 0
.
Its determinant is det(𝐻(𝑥, 𝑦)) = −1. Since det(𝐻(0, 1)) < 0, the point (0, 1) is a saddle point.
4.34
118
4.35 The partial derivatives have already been found, which gives
119
𝑦−1
∇𝑓(𝑥, 𝑦) = [ ].
𝑥
In particular,
−0.5
∇𝑓(0.5, 0.5) = [ ],
0.5
−0.5
∇𝑓(−0.5, 0.5) = [ ] and
−0.5
0.25
∇𝑓(0, 1.25) = [ ]
0
Below is the previous contour plot with the gradient vectors superimposed.
This is to be expected because the gradient vector will point in the direction in which the
function increases most rapidly. So you should expect to see the gradient vector pointing
towards contour curves with higher values.
4.36 The residual is
5 − (2𝑚 + 𝑏)
120
and the squared error is
(5 − (2𝑚 + 𝑏))2 .
4.37 Some routine algebraic manipulation gives
4.38
4.39 The stationary points are found by solving 𝑔𝑚 (𝑚, 𝑏) = 0 and 𝑔𝑏 (𝑚, 𝑏) = 0. This gives a
linear system.
10 6 𝑚 26
[ ][ ] = [ ]
6 6 𝑏 20
𝑔𝑚𝑚 (𝑚, 𝑏) = 10
𝑔𝑚𝑏 (𝑚, 𝑏) = 6
𝑔𝑏𝑚 (𝑚, 𝑏) = 6
𝑔𝑏𝑏 (𝑚, 𝑏) = 6.
10 6
[ ]
6 6
.
Its determinant is det(𝐻(𝑚, 𝑏)) = 24. Since det(𝐻(3/2, 11/6)) > 0, the point (3/2, 11/6) is a
local minimum.
4.40
121
4.41 The function 𝑓 can be written as
𝑓(𝑥) = 32 𝑥 + 11
6 .
122
Chapter 5
5.1 Enumeration
It is often very useful to be able to count (or enumerate) how many ways something can occur. For
example, it may be important to know how many different possible barcodes there are under some
barcoding system or how many possible keys there are for some encryption algorithm.
123
1. How many ways are there for you to choose a three course meal? A5.1
2. You're not very hungry and decide to just choose one item from the menu. How many ways are
there to do this? A5.2
(𝑛−𝑟)! .
𝑛!
𝑛(𝑛 − 1) ⋯ (𝑛 − 𝑟 + 1) =
What if our reviewer instead chose an unordered top three? In how many ways could she do that?
More generally, how many ways are there to choose (without order) 𝑟 objects from a set of 𝑛 objects?
For every unordered list our reviewer could make there are 3! = 6 corresponding possible ordered
lists. And we've seen that she could make 10 × 9 × 8 ordered lists. So the number of unordered lists
she could make is 10×9×8
6 .
For every combination of 𝑟 elements from a set of 𝑛 elements there are 𝑟! corresponding permutations.
So, using our formula for the number of permutations we have the following.
The number of unordered selections without repetition of 𝑟 objects from a set of 𝑛 objects (0 ≤ 𝑟 ≤ 𝑛)
is
𝑛(𝑛−1)⋯(𝑛−𝑟+1) 𝑛! 𝑛
𝑟! = 𝑟!(𝑛−𝑟)! = ( ).
𝑟
124
1. Eight runners are in a race. How many possibilities are there for the first, second and third
place getters? A5.3
2. How many ways are there to choose a team of 11 from a squad of 18? A5.4
3. How many ways are there to choose a team of 11 and a captain and vice-captain for that team
from a squad of 18? A5.5
2. Confirm that the formula for unordered selections with repetition works in the case 𝑟 = 2 and
𝑛 = 3. A5.7
Question 2: For each of the following scenarios, decide whether the selections are ordered or
unordered, and whether or not they include repetitions. Then evaluate the number of selections.
1. A family of 7 are out to lunch and each of them orders one dish from a menu of 12 dishes. From
the chef's perspective, how many orders are possible? A5.11
2. A company is setting up a new department and they have already chosen 20 employees to work
in this department. They need to assign 4 of these people to the leadership team. How many
different leadership teams are possible? A5.12
3. The company from the previous question now decides that the leadership team will consist of a
department head, an assistant department head, a training supervisor and an assistant training
supervisor. How many different leadership teams are possible now? A5.13
125
4. You are opening an email account and you must select a password. Your password is required
to be at least 8 characters long and you are only allowed to use lower case letters, upper case
letters, and numbers. However, you aren't very good at remembering long passwords, so you
decide that your password will be at most 11 characters long. How many different passwords
can you choose from? (Hint: Leave your answer as an expression in terms of the formulae in
the previous lesson.) A5.14
Question 3: Suppose that you would like to travel from the red point to the blue point in the grid
below according the following three rules:
1. You must travel along the horizontal and vertical grid lines, one step at a time.
2. You may only travel to the right or upwards.
3. You must not take two consecutive vertical steps.
A5.15 A5.16
How many different paths can you take?
5.3 Probability
Probability gives us a way to model random processes mathematically. These processes could be
anything from the rolling of dice, to radioactive decay of atoms, to the performance of a stock
market index. The mathematical environment we work in when dealing with probabilities is called a
probability space.
126
The spinner above might be modeled by a probability space with sample space 𝑆 =
{1, 2, 3, 4} and probability function given as follows.
⎧ 12 for 𝑠=1
{1
{ for 𝑠=2
Pr(𝑠) = ⎨ 41
{8 for 𝑠=3
{1 for 𝑠 = 4.
⎩8
It can be convenient to give this as a table:
𝑠 1 2 3 4
Pr(𝑠) 1
2
1
4
1
8
1
8
Example: Rolling a fair six-sided die could be modelled by a probability space with
sample space 𝑆 = {1, 2, 3, 4, 5, 6} and probability function Pr given as follows.
𝑠 1 2 3 4 5 6
Pr(𝑠) 1
6
1
6
1
6
1
6
1
6
1
6
A sample space like this one where every outcome has an equal probability is sometimes called a
uniform sample space. Outcomes from a uniform sample space are said to have been taken uniformly
at random.
Questions.
1. A game is played with a 6-sided die that has three faces numbered 1, two faces numbered 3,
and one face numbered 5. Create a probability space that models a single roll of this die. A5.17
2. A dial in a poker machine is modelled by a probability space with sample space
{lemon, orange, apple, cherry} and Pr(lemon) = 13 , Pr(orange) = 13 , and Pr(apple) = 14 .
What is Pr(cherry)? A5.18
5.3.2 Events
An event is just a collection of outcomes we are interested in for some reason. Formally it is a subset
of the sample space.
Example: In the die rolling example with 𝑆 = {1, 2, 3, 4, 5, 6}, we could define the event
of rolling at least a 3. Formally, this would be the set {3, 4, 5, 6}. We could also define
the event of rolling an odd number as the set {1, 3, 5}.
The probability of an event 𝐴 is the sum of the probabilities of the outcomes in 𝐴.
Pr(𝐴) = ∑𝑥∈𝐴 Pr(𝑥).
Example: In the spinner example, for the event 𝐴 = {1, 2, 4}, we have
127
Pr(𝐴) = Pr(1) + Pr(2) + Pr(4)
1 1 1
= 2 + 4 + 8
7
= 8.
In a uniform sample space (where all outcomes are equally likely) the probability of an event 𝐴 can
be calculated as:
number of outcomes in 𝐴
Pr(𝐴) = total number of outcomes = |𝐴|
|𝑆| .
128
𝐴 = {111, 110, 101, 100} Pr(𝐴) = 1
2
𝐵 = {111, 110, 011, 010} Pr(𝐵) = 1
2
𝐶 = {110, 101, 011} Pr(𝐶) = 3
8
𝐴 and 𝐵 = {111, 110} Pr(𝐴 and 𝐵) = 1
4
𝐴 and 𝐶 = {110, 101} Pr(𝐴 and 𝐶) = 1
4.
5.3.4 Warning
Here, we've only been discussing discrete probability where we have a finite number of different possible
outcomes. Some of the definitions and results we state apply only in this case. Our definition of a
probability space, for example, is actually the definition of a discrete probability space, and so on.
The discrete setting provides a good environment to learn most of the vital concepts and intuitions of
probability theory.
We can also consider probabilities where there is an infinite continuum of possible outcomes (for
example, a randomly selected person might be 170cm tall, or 170.1cm tall or 170.001cm tall and so
on). We'll discuss this briefly later on.
129
A5.37
(b) How would you assign probabilities to each outcome?
(c) Let 𝐴 be the event that you roll a number less than 5 and let 𝐵 be the event that you toss tails.
Which event is more likely? A5.38
A5.39
(d) Are 𝐴 and 𝐵 mutually exclusive?
A5.40
(e) Without doing any calculations, do you expect 𝐴 and 𝐵 to be independent?
A5.41
(f) Use calculations to determine if 𝐴 and 𝐵 independent.
(g) Notice that Pr(𝐴) + Pr(𝐵) is greater than 1. Should this concern you? Does this mean that
Pr(𝐴 or 𝐵) is greater than 1? A5.42
Initially, you feel confident because the circle 𝑃 takes up a small proportion of the rectangle. But
when you learn that your randomly selected person is in the circle 𝑇, you feel bad because the circle
𝑃 covers almost all of 𝑇. In mathematical language, the probability that a random Melbournian is a
Python coder is low, but the probability that a random Melbournian is a Python coder given that
they own a ”Hello, world!” t-shirt is high.
130
This definition also implies that
Pr(𝐴 and 𝐵) = Pr(𝐴|𝐵) Pr(𝐵).
Example: The spinner from the last chapter is spun.
Let 𝐴 be the event that the result was at least 3 and 𝐵 be the
event that the result was even. What is Pr(𝐴|𝐵)?
Thus,
Pr(𝐴 and 𝐵)
Pr(𝐴|𝐵) = Pr(𝐵) = ( 18 )/( 38 ) = 13 .
Example: A binary string of length 6 is generated uniformly at random. Let 𝐴 be the
event that the first bit is a 1 and 𝐵 be the event that the string contains two 1s. What is
Pr(𝐴|𝐵)?
There are 26 strings in our sample space. Now 𝐴 and 𝐵 occurs when the first bit is a 1
and the rest of the string contains one 1. There are (51) such strings and so Pr(𝐴 and 𝐵) =
(51)/26 . Also, there are (62) strings containing two 1s and so Pr(𝐵) = (62)/26 . Thus,
Pr(𝐴|𝐵) = Pr(𝐴 and 𝐵)
Pr(𝐵) = (51)/(62) = 13 .
Questions.
For the following events 𝐴 and 𝐵, do you think Pr(𝐴|𝐵) > Pr(𝐴), Pr(𝐴|𝐵) < Pr(𝐴) or Pr(𝐴|𝐵) =
Pr(𝐴)?
1. A standard die is rolled. 𝐴 is the event that the roll is a 6, and 𝐵 is the event that the roll is
even. A5.43
2. A random American is chosen. 𝐴 is the event that they are at less than 6ft tall and 𝐵 is the
event that they are a professional basketball player. A5.44
3. For a Manchester United vs Manchester City soccer game, 𝐴 is the event that City win the
game and 𝐵 is the event that United lead 3-0 at half time. A5.45
131
5.5.3 Independent repeated trials
Generally if we perform exactly the same action multiple times, we assume that the results for each
trial will be independent of the others. For example, if we roll a die twice, then the result of the first
roll will be independent of the result of the second.
For two independent repeated trials, each from a sample space 𝑆, our overall sample space is 𝑆 × 𝑆 and
our probability function will be given by Pr((𝑠1 , 𝑠2 )) = Pr(𝑠1 ) Pr(𝑠2 ). For three independent repeated
trials the sample space is 𝑆 ×𝑆 ×𝑆 and the probability function Pr((𝑠1 , 𝑠2 , 𝑠3 )) = Pr(𝑠1 ) Pr(𝑠2 ) Pr(𝑠3 ),
and so on.
Example: The spinner from the previous example is spun twice. What is the probability
that the results add to 5?
A total of 5 can be obtained as (1, 4), (4, 1), (2, 3) or (3, 2). Because the spins are
independent:
So, because (1, 4), (4, 1), (2, 3) and (3, 2) are mutually exclusive, the probability of the
total being 5 is 16
1 1
+ 16 1
+ 32 1
+ 32 3
= 16 .
Questions.
1. If the spinner from the previous example is spun three times, what is the probability that the
results add to 4? A5.46
Note that the denominator above is simply an expression for 𝑃 (𝐵). The fact that
𝑃 (𝐵) = Pr(𝐵|𝐴) Pr(𝐴) + Pr(𝐵|𝐴) Pr(𝐴)
is due to the law of total probability.
Example: Luke Skywalker discovers that some porgs have an extremely rare genetic
mutation that makes them powerful force users. He develops a test for this mutation that
is right 99% of the time and decides to test all the porgs on Ahch-To. Suppose there are
100 mutant porgs in the population of 24 million. We would guess that the test would
come up positive for 99 of the 100 mutants, but also for 239 999 non-mutants.
We are assuming that the conditional probability of a porg testing positive given it's a
mutant is 0.99. But what is the conditional probability of it being a mutant given that it
tested positive? From our guesses, we would expect this to be 99+239999
99
≈ 0.0004. Bayes'
theorem gives us a way to formalise this:
132
Pr(𝑀 ) Pr(𝑃 |𝑀 )
Pr(𝑀 |𝑃 ) =
Pr(𝑀 ) Pr(𝑃 |𝑀 ) + Pr(𝑀) Pr(𝑃 |𝑀)
100
× 0.99
24000000
= 100 100
24000000 × 0.99 + (1 − 24000000 ) × 0.01
99
= 99+239999
≈ 0.0004.
Example: A binary string is created so that the first bit is a 0 with probability 13 and
then each subsequent bit is the same as the preceding one with probability 34 . What is
the probability that the first bit is 0, given that the second bit is 0?
Let 𝐹 be the event that the first bit is 0 and let 𝑆 be the event that the second bit is
0. So Pr(𝐹 ) = 13 . If 𝐹 occurs then the second bit will be 0 with probability 34 and so
Pr(𝑆|𝐹 ) = 34 . If 𝐹 does not occur then the second bit will be 0 with probability 14 and so
Pr(𝑆|𝐹) = 14 . So, by Bayes' theorem,
Pr(𝐹 ) Pr(𝑆|𝐹 )
Pr(𝐹 |𝑆) =
Pr(𝐹 ) Pr(𝑆|𝐹 ) + Pr(𝐹) Pr(𝑆|𝐹)
1 3
3 × 4
= 1 3 2 1
3 × 4 + 3 × 4
= ( 14 )/( 12
5
)
= 35 .
Questions.
1. In the porg example above a ”99% accurate” test produced many more false positive results
than it did true positive results. Why was this? A5.47
133
(e) Let 𝐴 be the event that exactly one of the two balls was red and let 𝐵 be the event that exactly
one of the two balls was green. Find Pr(𝐴) and Pr(𝐵). A5.52
A5.53
(f) Are 𝐴 and 𝐵 independent?
A5.54
(g) Evaluate Pr(𝐴|𝐵).
A5.55
(h) Find the probability that the first ball was green given that the second ball was yellow.
Question 2: A survey was conduced in which 1000 people were asked to state their age and whether or
not they enjoyed mathematics. It was found that 90% people over the age of 30 enjoyed mathematics
but only 40% of people aged 30 or under did. Half of the survey's participants were over the age of 30.
Given that a randomly selected person from the survey enjoyed mathematics, what is the probability
that they were over the age of 30? A5.56
134
5.7 Answers
5.1 4 × 6 × 3 = 72
5.2 4 + 6 + 3 = 13
5.3 8×7×6= 5! ,
8!
assuming order is important.
5.4 (18
11)
5.5 1 ) × ( 1 ) × ( 9 ) (first choose a captain, then a vice-captain, then the remaining 9 players).
(18 17 16
5.8
10! = (1 × 2 × … × 7) × 8 × 9 × 10 = 7! × 8 × 9 × 10
𝑥 = 8 × 9 × 10 = 720
5.9
(10
7) =
10!
7!3! = 7!×720
7!3! = 720
6 = 120
5.10
16! = 14! × 15 × 16 = 14! × 240
(16
14) =
16!
14!2! = 14!×240
14!×2 = 120
5.11 The selections are unordered because the chef makes no distinction between who has ordered
the meals. The selections include repetitions because several family members may order the
same main meal. The number of selections is
(12−1+7)! 18!
(12−1)!7! = 11!7! = 31 824.
5.12 The selections are unordered because there is no distinction between the positions available.
The selections are without repetition because we require four different people. The number of
possible leadership teams is
20!
4!16! = 4845.
5.13 This time the selections are ordered because we can distinguish between the available positions.
The selections are still without repetition. The number of possible leadership teams is
20!
16! = 116 280.
5.14 The selections are ordered because different passwords may have the same characters appearing
in a different order. The selections allow repetition because you may reuse characters in a
password. Including upper case letters, lower case letters, and numbers, there are 62 characters
to choose from. So there are 628 passwords of length 8, there are 629 passwords of length 9,
and so on. The total number of passwords is
628 + 629 + 6210 + 6211 .
This is an enormous number, roughly 5 × 1019 .
135
5.15 Think of each path a sequence of horizontal and vertical steps. For example,
𝐻, 𝐻, 𝑉 , 𝐻, 𝐻, 𝐻, 𝑉 , 𝐻, 𝑉 , 𝐻, 𝐻, 𝐻, 𝑉 .
How can you place these vertical steps relative to the horizontal steps?
5.16 You must take 9 horizontal steps and 4 vertical steps. Consider each path as a sequence of
horizontal and vertical steps (as in the hint). Think of the horizontal steps as being fixed
𝐻, 𝐻, 𝐻, 𝐻, 𝐻, 𝐻, 𝐻, 𝐻, 𝐻
with the vertical steps being placed in between. There are 10 potential locations for the vertical
steps and we must select 4 of these locations. There are no repetitions because we cannot have
two consecutive vertical steps . Furthermore, these selections will be unordered.
This reduces the problem to finding the number of unordered selections without repetitions of 4
objects from a set of 10 objects. There number of paths is
10!
4!6! = 210.
5.17 This could be modelled by a probability space with sample space {1, 3, 5} and Pr(1) = 12 ,
Pr(3) = 13 , and Pr(5) = 16 .
5.18 The sum of the probabilities of the outcomes in a probability space must be 1, so we must
have Pr(cherry) = 1 − 13 − 13 − 14 = 12
1
.
5.19 {11, 21, 31}.
5.20 There are 22 numbers in the sample space, so each has probability 1
22 of being chosen. Thus,
Pr(𝐶) = Pr(11) + Pr(21) + Pr(31) = 22
1 1
+ 22 + 221 3
= 22 .
5.21 {20, 21, … , 32}.
5.22 Pr(𝐷) = Pr(20) + Pr(21) + ⋯ + Pr(32) = 13 × 1
22 = 22 .
13
and
136
Pr(𝐵) = Pr(𝐻𝑇 𝐻) = 18 .
5.33 Yes. The events 𝐴 and 𝐵 have no outcomes in common, so Pr(𝐴 and 𝐵) = 0.
5.34 No. Note that 𝐴 = {𝐻𝐻𝐻, 𝐻𝐻𝑇 , 𝐻𝑇 𝐻, 𝑇 𝐻𝐻, 𝑇 𝑇 𝑇 } so the outcome 𝐻𝑇 𝐻 is in both 𝐴
and 𝐵. Therefore, Pr(𝐴 and 𝐵) = Pr(𝐻𝑇 𝐻) = 18 ≠ 0.
5.35 They are not independent because Pr(𝐴 and 𝐵) = 0 but Pr(𝐴) Pr(𝐵) = 3
8 × 1
8 = 64 .
3
5.36 The sample space will contain twelve outcomes. It may be written as
𝑆 = {1𝐻, 1𝑇 , 2𝐻, 2𝑇 , 3𝐻, 3𝑇 , 4𝐻, 4𝑇 , 5𝐻, 5𝑇 , 6𝐻, 6𝑇 }
where the outcome 2𝑇 represents having rolled a 2 and tossed heads.
5.37 Since the die and coin are fair, each of the twelve outcomes are equally likely. So the
probability of each outcome will be 12
1
.
5.38 We have
𝐴 = {1𝐻, 1𝑇 , 2𝐻, 2𝑇 , 3𝐻, 3𝑇 , 4𝐻, 4𝑇 }
and
𝐵 = {1𝑇 , 2𝑇 , 3𝑇 , 4𝑇 , 5𝑇 , 6𝑇 }.
So
Pr(𝐴) = 2
3 and Pr(𝐵) = 12 .
𝐴 is more likely to occur.
5.39 No. There are outcomes in both 𝐴 and 𝐵, for example, 4𝑇.
5.40 The event 𝐴 has nothing to do with the result of the coin toss and 𝐵 has nothing to do with
the die roll. So it seems reasonable to expect that the events would be independent.
5.41 They are independent because
Pr(𝐴 and 𝐵) = Pr(1𝑇 ) + Pr(2𝑇 ) + Pr(3𝑇 ) + Pr(4𝑇 ) = 1
3
and
Pr(𝐴) Pr(𝐵) = 2
3 × 1
2 = 13 .
5.42 There is no problem here because Pr(𝐴) + Pr(𝐵) is not equal to Pr(𝐴 or 𝐵) due to the
events not being mutually exclusive. In fact, since
Pr(𝐴 or 𝐵) = Pr(𝐴) + Pr(𝐵) − Pr(𝐴 and 𝐵),
we have
Pr(𝐴 or 𝐵) = 2
3 + 1
2 − 13 .
Therefore, Pr(𝐴 or 𝐵) = 56 , which is not greater than 1.
5.43 Pr(𝐴|𝐵) > Pr(𝐴).
5.44 Pr(𝐴|𝐵) < Pr(𝐴).
5.45 Pr(𝐴|𝐵) < Pr(𝐴). (This seems backwards, but imagine that you had bet that United would
lead 3-0 at half time, and then a friend told you that City had won. You'd feel worse about
your bet. We can talk about conditional probabilities regardless of the chronological order of
the events or of the direction of causation.
137
5.46 A total of 4 can be obtained as (1, 1, 2), (1, 2, 1) or (2, 1, 1). Because the spins are independent:
Pr((1, 1, 2)) = Pr((1, 2, 1)) = Pr((2, 1, 1)) = 1
2 × 1
2 × 1
4 = 1
16
So, because (1, 1, 2), (1, 2, 1) and (2, 1, 1) are mutually exclusive, the probability of the total
being 4 is 16
1 1
+ 16 + 161
= 16 3
.
5.47 Basically, because mutants are so rare. Even though testing positive increases the chance a
porg is a mutant, the base chance of this is so low that it's still quite unlikely.
5.48 {red, green, yellow}
5.49 Pr(red) = 16 , Pr(green) = 1
2 and Pr(yellow) = 13 .
5.50 Initially, the probability of drawing a red ball is 16 . Since the total number of remaining balls
is now 5, the probability of drawing a green ball is 35 . Multiplying these gives a probability of
10 .
1
5.51 This outcome would correspond to drawing two different red balls, but there is only one red
ball in the urn.
5.52 We have
𝐴 = {𝑅𝐺, 𝑅𝑌 , 𝐺𝑅, 𝑌 𝑅}
and
𝐵 = {𝑅𝐺, 𝐺𝑅, 𝐺𝑌 , 𝑌 𝐺}.
We calculated Pr(𝑅𝐺) = 1
10 earlier. The probabilities of the other outcomes are calculated
below:
1 2 1
Pr(𝑅𝑌 ) = × =
6 5 15
1 1 1
Pr(𝐺𝑅) = × =
2 5 10
1 1 1
Pr(𝑌 𝑅) = × =
3 5 15
1 2 1
Pr(𝐺𝑌 ) = × =
2 5 5
1 3 1
Pr(𝑌 𝐺) = × =
3 5 5
1
Pr(𝐴) = Pr(𝑅𝐺) + Pr(𝑅𝑌 ) + Pr(𝐺𝑅) + Pr(𝑌 𝑅) =
3
3
Pr(𝐵) = Pr(𝑅𝐺) + Pr(𝐺𝑅) + Pr(𝐺𝑌 ) + Pr(𝑌 𝐺) =
5
5.53 There are two outcomes common to both 𝐴 and 𝐵, namely 𝑅𝐺 and 𝐺𝑅. Using the previous
calculations,
1
Pr(𝐴 and 𝐵) = Pr(𝑅𝐺) + Pr(𝐺𝑅) =
5
1 3 1
Pr(𝐴) Pr(𝐵) = × =
3 5 5
138
5.54 Since 𝐴 and 𝐵 are independent, Pr(𝐴|𝐵) = Pr(𝐴) = 13 .
5.55 We can define events
𝐶 = {𝐺𝑅, 𝐺𝐺, 𝐺𝑌 } and 𝐷 = {𝑅𝑌 , 𝐺𝑌 , 𝑌 𝑌 }
which correspond to the first ball being green and second ball being yellow respectively. There
is only one outcome common to both 𝐶 and 𝐷, namely 𝐺𝑌. The probabilities of most of these
outcomes have already been found. The remaining ones are calculated below.
1 2 1
Pr(𝐺𝐺) = × =
2 5 5
1 1 1
Pr(𝑌 𝑌 ) = × =
3 5 15
Pr(𝐶 and 𝐷)
Pr(𝐶|𝐷) =
Pr(𝐷)
Pr(𝐺𝑌 )
=
Pr(𝑅𝑌 ) + Pr(𝐺𝑌 ) + Pr(𝑌 𝑌 )
3
=
5
5.56 Let 𝐴 be the event that the randomly selected person is over the age of 30, and let 𝑀 be the
event that they enjoy mathematics. We know the following:
139
Chapter 6
Probability Distributions
𝑥 0 1 2
Pr(𝑋 = 𝑥) 1
4
1
2
1
4
Questions.
1. An elevator is malfunctioning. Every minute it is equally likely to ascend one floor, descend one
floor, or stay where it is. When it begins malfunctioning it is on level 5. Let 𝑋 be the level it is
on two minutes later. Find the probability distribution for 𝑋. A6.1
6.1.2 Independence
We have seen that two events are independent when the occurrence or non-occurrence of one event
does not affect the likelihood of the other occurring. Similarly two random variables are independent
if the value of one does not affect the likelihood that the other will take a certain value.
140
Random variables 𝑋 and 𝑌 are independent if, for all 𝑥 and 𝑦,
Pr(𝑋 = 𝑥 and 𝑌 = 𝑦) = Pr(𝑋 = 𝑥) Pr(𝑌 = 𝑦).
It is a consequence of this definition that, for any sets of values 𝐴 and 𝐵 , we have
𝑃 𝑟[𝑋 ∈ 𝐴 and 𝑌 ∈ 𝐵] = 𝑃 𝑟[𝑋 ∈ 𝐴]𝑃 𝑟[𝑌 ∈ 𝐵].
Example: A standard die is rolled three times. Let 𝑍 be the number of sixes rolled.
What is the probability distribution of 𝑍?
Obviously 𝑍 can only take values in {0, 1, 2, 3}. Each roll there is a six with probability
6 and not a six with probability 6 . As usual, we assume the rolls are independent.
1 5
Pr(𝑍 = 0) = 5
6 × 6 × 6
5 5
Pr(𝑍 = 1) = ( 16 )( 56 )( 56 ) + ( 56 )( 16 )( 56 ) + ( 56 )( 56 )( 16 )
Pr(𝑍 = 2) = ( 16 )( 16 )( 56 ) + ( 16 )( 56 )( 16 ) + ( 56 )( 16 )( 16 )
Pr(𝑍 = 3) = 1 1
6 × 6 × 6.
1
𝑥 0 1 2 3
Pr(𝑍 = 𝑥) 125
216
75
216
15
216
1
216
Example: An integer is generated uniformly at random from the set {10, 11, … , 29}. Let
𝑋 and 𝑌 be its first and second (decimal) digit. Then 𝑋 and 𝑌 are independent random
variables. To prove this notice that, for 𝑥 ∈ {1, 2} and 𝑦 ∈ {0, 1, … , 9},
Pr(𝑋 = 𝑥 and 𝑌 = 𝑦) = 1
20
whereas
Pr(𝑋 = 𝑥) = 10
20 = 1
2
and
Pr(𝑌 = 𝑦) = 2
20 = 1
10 .
Thus
Pr(𝑋 = 𝑥) Pr(𝑌 = 𝑦) = 1
2 × 1
10 = 1
20 = Pr(𝑋 = 𝑥 and 𝑌 = 𝑦).
Questions.
1. If, in the example above, the integer was generated uniformly at random from the set
{10, 11, … , 31}, would the random variables 𝑋 and 𝑌 still be independent? A6.2 A6.3
141
6.1.4 Expected value
A standard die is rolled some number of times and the average of the rolls is calculated. If the die is
rolled only once, this average is just the value rolled and is equally likely to be 1, 2, 3, 4, 5 or 6. If
the die is rolled ten times, then the average might be between 1 and 2 but this is pretty unlikely -
it's much more likely to be between 3 and 4. If the die is rolled ten thousand times, then we can be
almost certain that the average will be very close to 3.5. We will see that 3.5 is the expected value of
a random variable representing the die roll.
When we said ”average” above, we really meant ”mean”. Remember that the mean of a collection of
numbers is the sum of the numbers divided by how many of them there are. So the mean of 𝑥1 , … , 𝑥𝑡
is 𝑥1 +⋯+𝑥
𝑡
𝑡
. The mean of 2, 2, 3 and 11 is 2+2+3+11
4 = 4.5, for example.
The expected value of a random variable is calculated as a weighted average of its possible values.
If 𝑋 is a random variable with distribution
𝑥 𝑥1 𝑥2 ⋯ 𝑥𝑡
Pr(𝑋 = 𝑥) 𝑝1 𝑝2 ⋯ 𝑝𝑡
𝑥 −10 4 10
Pr(𝑋 = 𝑥) 2
5
1
2
1
10
Then
E[𝑋] = 2
5 × −10 + 1
2 ×4+ 1
10 × 10 = −1.
Because this value is negative, Acme shares will almost certainly decrease in value over
the long term.
Notice that it was important that we weighted our average using the probabilities here. If we had
just taken the average of −10, 4 and 10 we would have gotten the wrong answer by ignoring the fact
that some values were more likely than others.
Our initial die-rolling example hinted that when we average over a large number of independent trials
we will get very close to the expected value. A formal version of this (slightly vague) statement is
given by a famous theorem called the law of large numbers.
Questions.
1. Do you agree or disagree with the following statement? ”The expected value of a random
variable is the value it is most likely to take.” A6.4
2. Let 𝑋 be the number of heads occurring when a fair coin is flipped three times. Find E[𝑋].
A6.5
142
6.1.5 Linearity of expectation
We saw in the last lecture that adding random variables can be difficult. Finding the expected value
of a sum of random variables is easy if we know the expected values of the variables.
If 𝑋 and 𝑌 are random variables, then
E[𝑋 + 𝑌 ] = E[𝑋] + E[𝑌 ].
This works even if 𝑋 and 𝑌 are not independent.
Similarly, finding the expected value of a scalar multiple of a random variable is easy if we know the
expected value of the variable.
If 𝑋 is a random variable and 𝑠 ∈ ℝ, then
E[𝑠𝑋] = 𝑠E[𝑋].
Example. Two standard dice are rolled. What is the expected total?
Let 𝑋1 and 𝑋2 be random variables representing the first and second die rolls. From the
earlier example E[𝑋1 ] = E[𝑋2 ] = 3.5 and so
E[𝑋1 + 𝑋2 ] = E[𝑋1 ] + E[𝑋2 ] = 3.5 + 3.5 = 7.
Example. What is the expected number of ”11” substrings in binary string of length 5
chosen uniformly at random?
For 𝑖 = 1, … , 4, let 𝑋𝑖 be a random variable that is equal to 1 if the 𝑖th and (𝑖 + 1)th
bits of the string are both 1, and is equal to 0 otherwise. Then 𝑋1 + ⋯ + 𝑋4 is the
number of ”11� substrings in the string. The probability that the 𝑖th bit is a 1 is 12 and
the probability that the (𝑖 + 1)th bit is a 1 is 12 . So, because the bits are independent,
Pr(𝑋𝑖 = 1) = 12 × 12 = 14 and E[𝑋𝑖 ] = 14 for 𝑖 = 1, … , 4. So,
E[𝑋1 + ⋯ + 𝑋4 ] = E[𝑋1 ] + ⋯ + E[𝑋4 ] = 4
4 = 1.
Note that the variables 𝑋1 , … , 𝑋4 in the above example were not independent, but we were still
allowed to use linearity of expectation.
Questions.
1. A black box produces a random number according to some probability distribution that you do
not know. You use the box to create two random numbers and subtract the second from the
first. Can you find the expected value of the number you produce? A6.6
2. Let 𝑋 be the number of heads occurring when a fair coin is flipped three times. In the last set
of questions you calculated E[𝑋]. Can you now do this more easily? A6.7
6.1.6 Variance
Think of the random variables 𝑋, 𝑌 and 𝑍 whose distributions are given below.
𝑥 −1 99 𝑦 −1 1 𝑧 −50 50
Pr(𝑋 = 𝑥) 99
100
1
100 Pr(𝑌 = 𝑦) 1
2
1
2 Pr(𝑍 = 𝑧) 1
2
1
2
These variables are very different. Perhaps 𝑋 corresponds to buying a raffle ticket, 𝑌 to making a
small bet on a coin flip, and 𝑍 to making a large bet on a coin flip. However, if you only consider
expected value, all of these variables look the same - they each have expected value 0.
To give a bit more information about a random variable we can define its variance, which measures
how ”spread out” its distribution is. People often also refer to the standard deviation which is simply
the square root of the variance.
143
If 𝑋 is a random variable with E[𝑋] = 𝜇, the variance of 𝑋 is given by
Var[𝑋] = E[(𝑋 − 𝜇)2 ].
The standard deviation of 𝑋 is given by
𝜎 = √Var[𝑋].
So the variance is a measure of how much we expect the variable to differ from its expected value.
Example. The variable 𝑋 above will be 1 smaller than its expected value with probability
100 and will be 99 larger than its expected value with probability 100 . So
99 1
Var[𝑋] = 99
100 × (−1)2 + 1
100 × 992 = 99.
Similarly,
Var[𝑌 ] = 1
2 × (−1)2 + 1
2 × 12 = 1
Var[𝑍] = 1
2 × (−50)2 + 1
2 × 502 = 2500.
Notice that the variance of 𝑋 is much smaller than the variance of 𝑍 because 𝑋 is very likely to be
close to its expected value, whereas 𝑍 will certainly be far from its expected value.
Questions.
1. Let 𝑋 be a random variable with distribution given by
𝑥 0 2 6
Pr(𝑋 = 𝑥) 1
6
1
2
1
3
A6.8
Find the variance of 𝑋.
2. Let 𝑋 be the sum of 1000 spins of our spinner, and let 𝑌 be 1000 times the result of a single
spin. Using linearity of expectation we can see that E[𝑋] = E[𝑌 ]. Which of 𝑋 and 𝑌 do you
A6.9
think would have greater variance?
𝑥 0 1 𝑦 0 1
Pr(𝑋 = 𝑥) 1
4
3
4 Pr(𝑌 = 𝑦) 1
2
1
2
A6.10
(a) Evaluate Pr(𝑋 = 0 and 𝑌 = 1).
144
A6.11
(b) Find E[𝑋] and E[𝑌 ].
A6.12
(c) What is Pr(𝑋 = E[𝑋])?
A6.13
(d) Evaluate Pr(𝑋 + 𝑌 = 1) and Pr(𝑋 = 𝑌 ).
(e) Consider the random variables 𝑈 = 𝑋 + 𝑌 and 𝑉 = 𝑋 − 𝑌. Find the distributions of 𝑈 and 𝑉
and write your answer as a table (as above). A6.14
A6.15
(f) Without doing any calculations, do you expect 𝑈 and 𝑉 to be independent?
A6.16
(g) Use calculations to see if 𝑈 and 𝑉 are independent.
Question 2 (Markov's inequality): Consider the random variable 𝑋 with the following distribution.
𝑥 0 3 5 7 8
Pr(𝑋 = 𝑥) 1
6
1
3
1
8
1
8
1
4
A6.17
(a) Find Pr(𝑋 ≥ 6)
A6.18
(b) Find E[𝑋].
(c) An important result in probability theory is that for any random variable 𝑋 which only takes
non-negative values, and any positive real number 𝑎, we have
E[𝑋]
Pr(𝑋 ≥ 𝑎) ≤ 𝑎 .
A6.19
Verify this for the above random variable with 𝑎 = 6.
145
Bernoulli distribution
This type of distribution arises when we have a single process that succeeds with probability 𝑝 and
fails otherwise. Such a process is called a Bernoulli trial.
The Bernoulli distribution with parameter 𝑝 ∈ [0, 1] is given by
𝑝 for 𝑘 = 1
Pr(𝑋 = 𝑘) = {
1−𝑝 for 𝑘 = 0.
It has E[𝑋] = 𝑝 and Var[𝑋] = 𝑝(1 − 𝑝).
Geometric distribution
This distribution gives the probability that, in a sequence of independent Bernoulli trials, we see
exactly 𝑘 failures before the first success.
The geometric distribution with parameter 𝑝 ∈ [0, 1] is given by
Pr(𝑋 = 𝑘) = 𝑝(1 − 𝑝)𝑘 for 𝑘 ∈ ℕ.
We have E[𝑋] = 1−𝑝
𝑝 and Var[𝑋] = 𝑝2 .
1−𝑝
146
Example. If every minute there is a 1% chance that your internet connection fails then
the probability of staying online for exactly 𝑥 consecutive minutes is approximated by a
geometric distribution with 𝑝 = 0.01. It follows that the expected value is 1−0.01
0.01 = 99
minutes and the variance is (0.01)2 = 9900.
1−0.01
Binomial distribution
This distribution gives the probability that, in a sequence of 𝑛 independent Bernoulli trials, we see
exactly 𝑘 successes.
The binomial distribution with parameters 𝑛 ∈ ℤ+ and 𝑝 ∈ [0, 1] is given by
147
This demonstration displays binomial distributions with a variety of parameter values.
Example. If 1000 people search a term on a certain day and each of them has a 10%
chance of clicking a sponsored link, then the number of clicks on that link is approximated
by a binomial distribution with 𝑛 = 1000 and 𝑝 = 0.1. It follows that the expected value
is 1000 × 0.1 = 100 clicks and the variance is 1000 × 0.1 × 0.9 = 90.
Poisson distribution
In a Poisson process, events occur over time so that an average of 𝜆 events occur per time period and
the probability that an event will occur in one moment is the same as the probability that an event
will occur in any other moment (where our ”moments” are very short time intervals of equal length).
This kind of process is an excellent model for many real-world phenomena such as machine failures,
calls to a help centre, or goals in a soccer match.
In this kind of process, the number of events in a specified time period forms a Poisson distribution.
The Poisson distribution with parameter 𝜆 ∈ ℝ (𝜆 > 0) is given by
Pr(𝑋 = 𝑘) = 𝜆𝑘 𝑒−𝜆
𝑘! for 𝑘 ∈ ℕ.
We have E[𝑋] = 𝜆 and Var[𝑋] = 𝜆.
Figure. Poisson distribution with 𝜆 = 4
148
This demonstration displays Poisson distributions with a variety of parameter values.
Example. If a call centre usually receives 6 calls per minute, then a Poisson distribution
with 𝜆 = 6 approximates probability it receives 𝑘 calls in a certain minute. It follows that
the expected value is 6 calls and the variance is 6.
Questions.
1. There is a 95% chance of a packet being received after being sent down a noisy line, and the
packet is resent until it is received. What is the probability that the packet is received within
the first three attempts? A6.20
2. A factory aims to have at most 2% of the components it makes be faulty. What is the probability
of a quality control test of 20 random components finding that 2 or more are faulty, if the
factory is exactly meeting its 2% target? A6.21
3. The number of times a machine needs adjusting during a day approximates a Poisson distribution,
and on average the machine needs to be adjusted three times per day. What is the probability
it does not need adjusting on a particular day? A6.22
149
It would be useful to be able to discuss probability in continuous settings. For example, this would
enable us to talk about the distributions of heights of people, times between a query and response,
lengths of bananas and so on.
In these cases we can't describe a probability distribution by listing the probability of all the possible
values because there will be infinitely many of them. Also, the chance of someone having any particular
height, for example, is minuscule: it's almost impossible that someone will be exactly 172.34256183cm
tall. It does make sense, however, to talk about the chance that someone will be between 171cm and
173cm tall.
We can describe a continuous probability distribution using a probability density function. For
example, the probability density function for the height in cm of a female Australian might look like
the following.
150
The probability of a height lying in a particular range is the definite integral of the probability density
function in that range. So the probability of a height between 171cm and 173cm is given by the area
marked below.
In the same way that the probabilities of the outcomes in a discrete probability space must add to 1,
the total area under the curve of a probability density function must be 1.
1
for 𝑥 ∈ [𝑎, 𝑏]
{ 𝑏−𝑎
0 otherwise.
151
.
(𝑏−𝑎)2
It has E[𝑋] = 𝑎+𝑏
2 and Var[𝑋] = 12 .
Figure. Probability density function for a continuous uniform distribution with 𝑎 = −1,
𝑏 = 2.
Exponential distribution
The exponential distribution gives the time between events in a Poisson process. It can be thought of
as a continuous version of the geometric distribution.
The exponential distribution with parameter 𝜆 ∈ ℝ (𝜆 > 0) has probability density function given by
𝜆𝑒−𝜆𝑥 for 𝑥 ≥ 0.
It has E[𝑋] = 1
𝜆 and Var[𝑋] = 𝜆2 .
1
152
Example. Our example for the Poisson distribution above was a call centre that usually
receives 6 calls per minute. An exponential distribution with 𝜆 = 6 will approximate the
time between calls into this call centre. It follows that the expected time between calls is
6 of a minute (10 seconds) and the variance is 36 .
1 1
Normal distribution
This distribution comes up in many situations. Heights, test scores and a host of other data are often
normally distributed. As we saw above, a binomial distribution with large 𝑛 is well approximated by
a normal distribution.
The normal distribution with parameters 𝜇 ∈ ℝ and 𝜎 ∈ ℝ (𝜎 > 0) has probability density function
given by
2
√ 1
2𝜎2 𝜋
exp(− (𝑥−𝜇)
2𝜎2 )
153
A normal distribution with 𝜇 = 0, 𝜎 = 1, like the one above, in referred to as a standard normal
distribution.
There are many tools available to calculate areas under the curve of a normal probability density
function. See, for example, this online calculator.
𝑡-distribution
Lightweight Corporation claim their LED bulbs have a mean lifetime of 30000 hours. You believe
that these lifetimes will be normally distributed. You select 15 random bulbs and find that they have
a mean lifetime of 28000 hours with a standard deviation of 5000 hours. Obviously your best guess
for the lifetime of their bulbs is 28000 hours, but its possible that you were unlucky and picked a
sample of 15 poor bulbs. How likely is it, though?
To deal with situations like this, statisticians use a measure called the 𝑡-statistic. First we need to
define the sample variance of a sample 𝑥1 , … , 𝑥𝑛 of 𝑛 numbers to be
1 𝑛
𝑠2 = 𝑛−1 ∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2 .
(Don't confuse this with the usual variance, which would have 𝑛 instead of 𝑛 − 1 in its definition.
The 𝑛 − 1 is important to make the sample variance have, on average, the same value as the variance
of the distribution being sampled.)
The 𝑡-statistic for a sample 𝑥1 , … , 𝑥𝑛 of size 𝑛 is given by
𝑥−𝜇
̄√
𝑡= 𝑠/ 𝑛
where 𝑥̄ and 𝜇 are the mean of the sample and the mean of the population, respectively, whilst 𝑠 is
the square root of the sample variance defined just above.
We can rearrange this as 𝜇 = 𝑥̄ − √𝑡𝑠𝑛 , so the 𝑡-statistic is measuring how far from the actual population
mean our sample mean is (scaled by √𝑠𝑛 ). When the values in the population are normally distributed
and the sample has size 𝑛, the distribution of the 𝑡-statistic forms a 𝑡-distribution with parameter
𝜈 = 𝑛 − 1.
The 𝑡-distribution is a probability distribution with parameter 𝜈 ∈ ℤ+ . Its probability density function
is complicated to express and needn't concern us. It has E[𝑋] = 0 and, for 𝜈 ≥ 3, Var[𝑋] = 𝜈−2 𝜈
.
154
The parameter 𝜈 is sometimes referred to as the number of degrees of freedom. As 𝜈 becomes large
the 𝑡-distribution approaches a normal distribution with 𝜇 = 0 and 𝜎 = 1.
Figure. In red, green, and orange are the probability density functions for 𝑡-distributions
with 𝜈 = 1, 𝜈 = 2 and 𝜈 = 6. In blue is the probability density function for a normal
distribution with 𝜇 = 0 and 𝜎 = 1.
Again, there are many tools available to calculate areas under the curve of a 𝑡-distribution probability
density function. See, for example, this online calculator.
𝜒2 -distribution
The 𝜒2 -distribution is a probability distribution with parameter 𝑘 ∈ ℤ+ . It is the probability
distribution of (𝑋1 )2 + ⋯ + (𝑋𝑘 )2 where 𝑋1 , … , 𝑋𝑘 are random variables that each have a standard
normal distribution. Its probability density function is complicated to express and needn't concern
us. It has E[𝑋] = 𝑘 and Var[𝑋] = 2𝑘.
The parameter 𝑘 is sometimes referred to as the number of degrees of freedom. The 𝜒2 -distribution is
mostly used in hypothesis testing, rather than for directly modelling situations.
Figure. In blue, orange, and green are the probability density functions for 𝜒2 -
distributions with 𝑘 = 1, 𝑘 = 3 and 𝑘 = 5.
155
Questions.
1. The number of metres a cat walks down a 4 metre long hallway before taking a nap is a given
by a continuous uniform distribution with 𝑎 = 0 and 𝑏 = 4. The sun shines through windows
onto the first metre and last 50cm of the hallway. What is the probability that the cat falls
asleep in the sun? A6.23
2. The actual length in cm of a ”30cm” ruler made by a factory is given by a normal distribution
with 𝜇 = 30 and 𝜎 = 0.03. What is the probability that the length of the ruler is off by 0.1cm
or more? (Consider using the online calculator liked above.) A6.24
3. In the example above, we wanted to gauge whether to believe Lightweight Corporation's claim
of30000 hours average lifetime. Assuming the claim is correct, what is the probability that the
𝑡-statistic derived from our sample was as small as it turned out (or smaller) ? What can you
deduce from this? A6.25
156
6.5 Graphs and trees
6.5.1 Graphs
A graph is basically a network consisting of nodes (called vertices) and links (called edges) between
those nodes.
A graph is a collection of objects called vertices in which some pairs of vertices are designated as
adjacent. We imagine adjacent vertices as having an edge between them. We usually picture graphs
by drawing a dot for each vertex and a line between two dots for each edge.
We often name edges by concatenating the names of the two vertices they join: so 𝑢𝑣 is an edge
between vertices 𝑢 and 𝑣.
Example. Think of the graph with four vertices, 𝑢, 𝑣, 𝑤, 𝑥, and four edges 𝑢𝑣, 𝑣𝑤, 𝑤𝑥, 𝑥𝑢.
We could draw this graph as follows:
Graphs are the same when they have the same vertices and edges. How we choose to position the
vertices and edges when we draw a graph is not important.
Here we're discussing the most commonly used definition of a graph. In these graphs:
• edges do not have directions;
• there are never two or more edges between a pair of vertices;
• there are no edges from a vertex to itself (loops).
However, in some applications it is natural to allow directed edges, multiple edges or loops and all of
these variants are used and studied. In addition, graphs where vertices or edges are assigned weights
or colours are sometimes employed.
Questions.
A6.32
1. Draw a picture of a graph with four vertices and an edge between every pair of vertices.
2. What is the maximum number of edges that a graph with 𝑛 vertices can have? A6.33
157
6.5.2 Degree
We say that an edge of a graph that joins two vertices is incident with those vertices.
The degree of a vertex in a graph is the number of edges in the graph that are incident with it.
Because each edge in a graph is incident with exactly two vertices, each edge contributes 2 to the
total degree of the graph. This gives us the following:
The sum of the degrees of the vertices in a graph is twice the number of edges in the graph. In
particular, this sum is always even.
Questions.
A6.34
1. Is there a graph with five vertices with degrees 3, 3, 2, 2, 2?
A6.35
2. Is there a graph with five vertices with degrees 4, 3, 3, 2, 1?
A6.36
3. Is there a graph with five vertices with degrees 5, 4, 3, 2, 2?
We say that a graph is connected if there is a path between any two vertices.
Example. The graph on the left below is not connected because there is not a path
between 𝑣2 and 𝑣3 (for example). The graph on the right below is connected.
158
A cycle of length 𝑡 is a graph whose vertices can be labelled 𝑣1 , 𝑣2 , … , 𝑣𝑡 so that the edges of the
graph are exactly 𝑣1 𝑣2 , 𝑣2 𝑣3 , … , 𝑣𝑡−1 𝑣𝑡 , 𝑣𝑡 𝑣1 .
Example. We picture below cycles of length 3, 4, 5, and 6.
Questions.
1. How many edges must be removed from a path in order to make a graph that is not connected?
How about for a cycle? A6.37
6.5.4 Trees
A tree is a graph that is connected but contains no cycles.
Example. The graph on the left below is not a tree because it contains a cycle on vertices
𝑣1 , 𝑣2 , 𝑣4 , 𝑣6 . The graph in the centre below is not a tree because it is not connected.
The graph on the right below is a tree.
159
The following result can be proved by induction on the number of vertices in the graph.
A tree with 𝑛 vertices has exactly 𝑛 − 1 edges.
It turn out that any graph with 𝑛 vertices that has 𝑛 − 2 or fewer edges must not be connected. This
means that trees are ”minimal” connected graphs (with respect to their number of edges).
A spanning tree of a graph 𝐺 is a tree in 𝐺 that contains every vertex of 𝐺.
Example. A graph (below left) and a spanning tree of that graph (below right). There
are many other spanning trees of the graph. Find some.
Obviously a graph that is not connected cannot have a spanning tree. On the other hand, every
connected graph has at least one spanning tree. Given a connected graph we can produce a spanning
tree by repeatedly finding a cycle in the graph and deleting one of its edges until no cycles remain.
The resulting graph will be connected and will not have any cycles, so it will be a tree.
Questions.
1. Are some paths trees? Are all paths trees? A6.38
2. When we begin with a connected graph and repeatedly find a cycle in the graph and delete
one of its edges until no cycles remain, how do we know that the resulting graph is connected?
A6.39
A6.40
3. How many spanning trees does a cycle of length 6 have?
160
0 1 1 0 0
⎛
⎜ 1 0 1 1 0⎞⎟
⎜
⎜ ⎟
⎜ 1 1 0 0 0⎟⎟
⎜
⎜0 ⎟
1 0 0 1⎟
⎝0 0 0 1 0⎠
0 1 1 0 0
⎛
⎜1 0 0 1 0⎞⎟
⎜
⎜ ⎟
⎜1 0 0 1 1⎟⎟
⎜
⎜0 ⎟
1 1 0 1⎟
⎝0 0 1 1 0⎠
2. Find the adjacency matrix of a cycle with vertices 𝑣1 , 𝑣2 , … , 𝑣6 and edges 𝑣1 𝑣2 , 𝑣2 𝑣3 , 𝑣3 𝑣4 , 𝑣4 𝑣5 , 𝑣5 𝑣6 , 𝑣6 𝑣1 .
A6.42
161
6.6 Answers
6.1
𝑥 3 4 5 6 7
Pr(𝑋 = 𝑥) 1
9
2
9
1
3
2
9
1
9
6.2 No. If we know that 𝑋 = 2 then 𝑌 is equally likely to be any digit, but if we know that
𝑋 = 3, then 𝑌 is certain to be a 0 or 1. So the value of 𝑋 has an effect on the likelihood that 𝑌
will take certain values and the variables are not independent.
6.3 No. Consider, for example, the probability that 𝑋 = 3 and 𝑌 = 0.
Pr(𝑋 = 3 and 𝑌 = 0) = 22 1
(because the set contains 22 digits).
Pr(𝑋 = 3) = 11 1
(this happens when the number is 30 or 31).
Pr(𝑌 = 0) = 223
(this happens when the number is 10, 20 or 31).
It's now easy to check that Pr(𝑋 = 3 and 𝑌 = 0) ≠ Pr(𝑋 = 3) Pr(𝑌 = 0).
6.4 You should disagree. It's very possible for a random variable to have no chance of taking its
expected value. For example, you will never roll 3.5 on a standard die.
6.5 By writing down all the possible sequences of heads and tails (or by other methods) we can
see that the probability distribution of 𝑋 is
𝑥 0 1 2 3
Pr(𝑋 = 𝑥) 1
8
3
8
3
8
1
8
So E[𝑋] = 1
8 ×0+ 3
8 ×1+ 3
8 ×2+ 1
8 × 3 = 32 .
6.6 Yes. Let the expected value of a number from the box be 𝑧. Then, using linearity of
expectation, the expected value of the number you produce is E[𝑋] = 𝑧 − 𝑧 = 0.
6.7 Yes. It's not hard to see that the expected value of the number of heads occurring from a
single coin flip is 12 × 0 + 12 × 1 = 12 . So, using linearity of expectation, E[𝑋] = 3 × 12 .
6.8 The expected value of 𝑋 is
E[𝑋] = 16 × 0 + 12 × 2 + 13 × 6 = 3.
So, the variance of 𝑋 is
Var[𝑋] = 16 × (0 − 3)2 + 12 × (2 − 3)2 + 1
3 × (6 − 3)2 = 5.
6.9 Y by a long way. By the law of large numbers, 𝑋 will very likely be quite close to E[𝑋]. On
the other hand 𝑌 could very well be far from E[𝑌 ]. This will mean that 𝑌 has greater variance.
6.10 Because 𝑋 and 𝑌 are independent, the probability of this event can be written as a product.
Pr(𝑋 = 0 and 𝑌 = 1) = Pr(𝑋 = 0) Pr(𝑌 = 1) = 1
4 × 1
2 = 1
8
6.11
1 3 3
E[𝑋] = ×0+ ×1=
4 4 4
1 1 1
E[𝑌 ] = ×0+ ×1=
2 2 2
6.12
Pr(𝑋 = E[𝑋]) = Pr(𝑋 = 34 ) = 0
162
6.13 There are two ways the random variable 𝑋 + 𝑌 can take a value of 1. Namely, when 𝑋 = 0
and 𝑌 = 1, or when 𝑋 = 1 and 𝑌 = 0. These two possibilities are mutually exclusive events.
So the probability of one or the other happening is the sum of their individual probabilities.
Pr(𝑋 + 𝑌 = 1) = Pr(𝑋 = 0 and 𝑌 = 1) + Pr(𝑋 = 1 and 𝑌 = 0)
Because 𝑋 and 𝑌 are independent, the above probabilities can be written as products.
In the same way, there are two way in which 𝑋 and be equal to 𝑌.
6.14 We have already calculated Pr(𝑈 = 1) = 12 . The other probabilities are calculated in the
same way.
𝑢 0 1 2
Pr(𝑈 = 𝑢) 1
8
1
2
3
8
𝑣 −1 0 1
Pr(𝑉 = 𝑣) 1
8
1
2
3
8
6.15 Both 𝑈 and 𝑉 depend on 𝑋 and 𝑌. So a change in the value of 𝑈 can affect the value of 𝑉
and vice-versa. We would not expect them to be independent.
6.16 Consider the event that both 𝑈 = 0 and 𝑉 = −1. This is the same as the event that
𝑋 + 𝑌 = 0 and 𝑋 − 𝑌 = 1. But these events cannot happen simultaneously. That is, if
𝑋 + 𝑌 = 0, then 𝑋 = 0 and 𝑌 = 0, so 𝑋 − 𝑌 = 0 ≠ −1. These events are mutually exclusive.
Pr(𝑈 = 0 and 𝑉 = −1) = 0
However,
Pr(𝑈 = 0) Pr(𝑉 = −1) = 1
8 × 1
8 = 1
64 .
163
6.18
1 1 1 1 1
E[𝑋] = ×0+ ×3+ ×5+ ×7+ ×8
6 3 8 8 4
9
=
2
6.19
E[𝑋] 9 1 3
6 = 2 × 6 = 4
E[𝑋]
Pr(𝑋 ≥ 6) = 3
8 ≤ 3
4 = 6
6.20 Let 𝑋 be the number of failures before the packet is successfully received. Then 𝑋 has a
geometric distribution with 𝑝 = 0.95. So the probability that the packet is received within the
first three attempts is
6.21 Let 𝑋 be the number of faulty components. Then 𝑋 has a binomial distribution with
𝑝 = 0.02 and 𝑛 = 20. The probability that zero or one components are faulty is
≈ 0.9401.
So the probability that at least two components are faulty is approximately 1 − 0.9401 = 0.0599.
6.22 Let 𝑋 be the number of times the machine needs adjustment on the day. Then 𝑋 has a
Poisson distribution with 𝜆 = 3. The probability that the machine does not need adjusting on
the day is
Pr(𝑋 = 0) = 3 0!𝑒 = 𝑒−3 .
0 −3
6.24 Using the calculator we see that the probability that the ruler's lengths is between 29.9cm
and 30.1cm is approximately 0.9991. So the probability that the length of the ruler is off by
0.1cm or more is approximately 1 − 0.9991 = 0.0009.
6.25 The 𝑡-statistic of the sample is 28000−30000
√
5000/ 15
≈ −1.549. Using the calculator with 𝜈 = 15 − 1 =
14 we see that the probability of the 𝑡-statistic being less than this value is approximately
0.0718 = 7.18%. Nothing is certain, but the claim looks a little bit unlikely that the mean
lifetime is30000 or more
6.26 We can let 𝑋 denote a Poisson random variable with 𝜆 = 8.
Pr(𝑋 = 0) = 80 𝑒−8
0! = 𝑒−8 ≈ 0.0003
164
6.27 Note that Pr(𝑋 = 0) + Pr(𝑋 ≥ 1) = 1 since these two events are mutually exclusive and
cover the whole sample space.
Pr(𝑋 ≥ 1) = 1 − Pr(𝑋 = 0) = 1 − 𝑒−8 ≈ 0.9997
6.28
Pr(𝑋 = 8) = 88 𝑒−8
8! ≈ 0.14
6.29 We can answer this question by considering two independent random variables 𝑋 and 𝑌,
both of which have a Poisson distribution with 𝜆 = 8.
Pr(𝑋 = 8 and 𝑌 = 8) = Pr(𝑋 = 8) Pr(𝑌 = 8) = 88 𝑒−8
8! × 88 𝑒−8
8! ≈ 0.019
6.30 This is best described by a geometric distribution with parameter 𝑝 = 0.06.
6.31 Let 𝑋 be a random variable with a geometric distribution with parameter 𝑝 = 0.06. We are
interested in finding the probability of exactly 2 failures before the first success.
Pr(𝑋 = 2) = 0.06(1 − 0.06)2 ≈ 0.053.
165
6.41
6.42
0 1 0 0 0 1
⎛
⎜1 0 1 0 0 0⎞⎟
⎜
⎜ ⎟
⎜0 1 0 1 0 0⎟⎟
⎜
⎜ ⎟
⎜0 0 1 0 1 0⎟⎟
⎜
⎜0 ⎟
0 0 1 0 1⎟
⎝1 0 0 0 1 0⎠
166