Text Book
Text Book
P UBLISHED BY AUTHORS
Licensed under the Creative Commons Attribution-NonCommercial 3.0 Unported License (the
“License”). You may not use this file except in compliance with the License. You may obtain a
copy of the License at http://creativecommons.org/licenses/by-nc/3.0. Unless required
by applicable law or agreed to in writing, software distributed under the License is distributed on an
“AS IS ” BASIS , WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.
I Part One
1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1 Introduction 9
1.2 Review List 10
1.3 Useful Calculus Theorems 10
3 Computational Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Number Systems 19
3.2 Fixed and Floating point numbers 20
3.3 Loss of Significance 24
4 Root-Finding Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Bisection Method 27
4.1.1 Evaluating the Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Fixed Point Iteration 33
4.2.1 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Newton’s Method (Newton-Raphson) 39
4.3.1 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 Zeros of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Monomial Interpolation 52
5.2 Lagrange Interpolation 54
5.3 Divided Differences 56
5.4 Error in Polynomial Interpolation 58
5.5 Interpolating Derivatives 59
5.6 Error in Hermite Interpolation 61
7 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.0.1 Forward Difference Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.0.2 Backward Difference Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.0.3 Centered Difference Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.1 Richardson Extrapolation 77
7.2 Formulas Using Lagrange Interpolating Polynomials 79
7.3 Roundoff Error and Data Errors in Numerical Differentiation 80
8 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1 Basic Quadrature 83
8.2 Composite Numerical Integration 87
8.3 Gaussian Quadrature 89
8.4 Adaptive Quadrature 91
8.5 Romberg Integration 93
8.6 Multidimensional Integration 94
9 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1 Linear Algebra Basics 97
9.2 Vector and Matrix Norms 99
II Part Two
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
I
Part One
1. Fundamentals
1.1 Introduction
Numerical analysis is the mathematical study of numerical algorithms (methods) used to solve
mathematical problems. The goal of such a study is to clarify the relative benefits and drawbacks
associated with a given algorithm. The need for such an understanding is multifaceted. People facing
“real world" problems that must be solved often discover the task is impossible to do by hand. In such
a situation, an appropriate algorithm for approximating the solution can be chosen by balancing the
theoretical properties of the method with the setting (problem statement, computational resources,
etc) in which a solution is sought. Understanding the theoretical properties of existing methods can
also assist the numerical analyst in developing and analyzing new algorithms which may be more
appropriate for a given setting than those which are already available.
The goal of this course is to prepare you to choose an appropriate method to solve a given
problem, implement the method appropriately, and evaluate the results to determine effectiveness
of your method. A deep feeling for the algorithms and the theory associated with them is best
gained by direct calculation and experimentation with them. For this reason, this course emphasizes
derivation of methods, proof of associated theoretical properties, and computation of results using
the algorithms. Computations using a method can be tedious, making it impractical to implement a
method by hand beyond the most basic of cases. For this reason, programming is an integral part of
applying these methods, and a necessary part of this course. All of this makes this course one of
the most applied of applied mathematics courses.
The utility of the programming portion of this course extends beyond merely assisting with
understanding the algorithms more deeply. Programming is one of the most marketable skills you
can have as an applied mathematician, with MATLAB and Python currently (2017) being the two
most popular programming languages. Alongside the discussion of numerical methods in this
text you will find ((MATLAB or Python)) code or pseudo-code (a set of instructions emphasizing
the steps to program without the syntax and detail required for a particular language) to help you
develop your programming skills.
Enjoy!
10 Chapter 1. Fundamentals
Theorem 1.3.2 — Mean Value Theorem. If f ∈ C[a, b] and f is differentiable on (a, b), then
f (b) − f (a)
∃c ∈ (a, b) for which f 0 (c) = .
b−a
Theorem 1.3.3 — Rolle’s Theorem. If f ∈ C[a, b], and f is differentiable on (a, b), and
f (a) = f (b), then ∃c ∈ (a, b) s.t. f 0 (c) = 0.
Theorem 1.3.4 — Extreme Value Theorem. If f ∈ C[a, b], it must have a maximum and a
minimum on [a, b].
Theorem 1.3.5 — Taylor Series for a Function. Suppose the function f has derivatives of all
orders on an interval containing the point a. The Taylor series for f centered at a is
f 00 (a) f (3) (a)
f (a) + f 0 (a)(x − a) + (x − a)2 + (x − a)3 + ...
2! 3!
f (k) (a)
= ∑∞ k=0 (x − a)k .
k!
2. Introduction to Numerical Analysis
When dealing with a mathematical object such as a function, derivative, equation, definite integral,
etc., we are often confronted with conflicting realities:
• the need to derive information from the object, and
• practical limitations on our ability to perform the task.
In studying Numerical Analysis, we will come to understand, from various viewpoints, numerical
methods for approximating the information being sought. Simply understanding the steps involved
in using a numerical method is one part of this. The analysis portion of the study will provide
insight about conditions under which a method is expected to provide a “reasonable” approximation
as well as a way to control the algorithm in such a way that the subjective standard of “reasonable”
may be achieved. In the process, we find it necessary to return to the conflicting realities above:
perhaps the “reasonable” approximation cannot be found without “unreasonable” expectations.
Such reflection leads us to develop and analyze alternative methods.
Accuracy refers to your error: how close is your result to the actual value?
We often evaluate the accuracy of the algorithm and its relation to the method. A big, common,
question is: Does it converge?
Efficiency is evaluated by speed (in implementation), simplicity (is it easy for someone else
to follow?), and clarity (does it make sense to someone else? Commentary in your code is key to
clarity). At a secondary level we may also ask: is it concise? (Are you using 4 lines for an operation
that can be done in 1?) Is it simple, fast, concise, and clear?
Stability is required for a method to converge. This portion of the broader requirement of
wellposedness refers to the effect of small changes in the input data on the output of an algorithm.
If small changes to the input produce “small” changes in the output, the method is stable (or
well-conditioned). Conditional stability can occur if the stability of a method is dependent on the
initial data.
In application, you will find that accuracy and efficiency will be fighting against each other. In
any of these applications, it is important to decide what is most important in your problem. Often,
this may be determined by an employer or project leader. When accuracy takes precedence, which
would happen if you are sending a probe to Pluto, for example. The efficiency in your code is
important in that you make it run as quickly as possible without sacrificing accuracy.
In computer graphics it is often necessary to normalize a vector, meaning you multiply by the
reciprocal of the sqaure root of sum of squares of its components. This task is completed millions
of times a second for modern computer games and exponentiation is an expensive operation. The
fast inverse square root algorithm alleviates this expense by only using bit shifts and multiplications
by first completing an approximation which has an error of about 3.4% and then refines this using
2.2 Assessing Algorithms 13
Newton’s Method to achieve an error of 0.17%. The computational savings per application is very
minimal but doing this millions of times a second can provide a drastic improvement in peformance.
Generally, we seek the best in all three: accuracy, efficiency, and stability. However, most of
the time one will have to be sacrificed over the others when working with larger problems. In the
next sections, we will dig into accuracy and efficiency. For now, consider the following example
emphasizing the importance of stability.
1
Example 2.1 — Testing Stability. Show that the function f (x) = is stable about x = 0.99,
x+1
1
but g(x) = is not.
1−x
Approximate f (x) and g(x) using a Taylor Series approximation about x = 0.99
1
f (0.99) = = 0.525
1.99
−1
f 0 (x) =
(1 + x)2
2
f 00 (x) =
(1 + x)3
1
g(0.99) = = 100
0.01
1
g0 (x) =
(1 − x)2
2
g00 (x) =
(1 − x)3
Taylor Series Approximation:
1 2 (x − 0.99)2
f (x) = 0.525 − (x − 0.99) + + ... (= for infinite terms)
(1.99)2 (1.99)3 2!
1 2 (x − 0.99)2
g(x) = 100 + (x − 0.99) + + ...
(0.01)2 (0.01)3 2!
We can evaluate the stability by looking at the first few terms. Recall that, generally the first
few terms are the largest - and so we can use them to estimate the behavior of the function. In this
case, for g(x), the terms will grow because it is unstable. However, we can see that in just the first
two terms of the Taylor Series expansion.
(x − 0.99)
f (x) ≈ 0.525 −
3.9601
g(x) ≈ 100 + 10, 000(x − 0.99)
Recall: to determine stability, we make a small change to the value of x, and evaluate the change
in the function value. We use the Taylor Series because it is a polynomial and easy to evaluate - we
can easily ‘see’ the stability analysis in the first few terms of the series.
If we compare f (x) for two similar values of x, one at x = 0.99, and a second nearby x = 0.98.
For the function to be stable, the change in output should be similar to the change in input 0.01. If
the change in the output is greater than the change in the input, you know it is unstable.
So, the difference in the outputs is 0.002525 < 0.01, and we can see that for f (x) near x = 0.99,
the function is stable, or well-conditioned.
If we complete the same analysis using g(x): x̄1 = 0.99, ȳ1 = f (x̄1 ) = 100
x̄2 = 0.98, ȳ2 = f (x̄2 ) = 100 + 10, 000(0.01) = 200
Now, the difference in outputs is 100! This is clearly unstable because the output changed by
orders of magnitude more than the input. A small change caused a huge change - this is unstable,
or ill-conditioned.
Sources of error:
1. Mathematical model: we must make assumptions about the system in order to form a
mathematical model. There can be errors made in those assumptions which will propagate
through the whole problem. However, those assumptions are often necessary simplifications
to the problem to make it solvable (trade-offs).
2. Data: whenever data measurements are taken, there is error. There is always a limit to their
accuracy, and there is also noise. However, these are inherent and unavoidable errors - we
can only make attempts to quantify them and manage them to minimize their effects.
3. Numerical Method: discretization creates error, but it is defined and manageable error
(usually). It will also relate to the convergence properties of the method.
4. Round-off error: this arises due to finite precision of a machine - computers do not store
numbers to infinite precision. This error is unavoidable, but can be approximated if you know
your system. More on this soon.
5. Mistakes: this is by far the largest source of error, and the most common. Can arise from
data collection, transcription, malfunction, etc.
We want to avoid error, and minimize it.
Relative error provides an indication of how good a value is relative to the size of the exact
value. Relative error is effectively a percentage of error, and related to the absolute error by
absolute error |u − v|
relative error = = .
size of exact value |u|
2.3 Numerical Algorithms and Errors 15
Relative error is more commonly used because it contains information about the problem.
Example 2.2 If someone tells you, “I had only 0.1 error!" (absolute measurement), that might
sound great if the value of the actual was 100. However, if the actual was 0.2? That’s awful. The
relative error in these two cases is more informative: 0.1% vs. 50%. With a relative measurement,
it is clear that the first measurement was good, and the second was not - in fact, it likely has errors
in implementation to be so far off.
However, we will not always want a relative measurement. The relative error can break down
when the value of your actual is small (round-off error occurs when dividing by small numbers). In
these cases, the absolute error is more meaningful.
Recall your Taylor Series! Go practice it - NOW!!! (I can’t stress this enough.)
Recall that you can use your Taylor Series to approximate a function, so if we want to approx-
imate f (x − h), and we know values of the function and its derivatives at x, we can approximate
h2 00 h3
f (x − h) = f (x) − h f 0 (x) + f (x) − f 000 (x) + ... (2.1)
2 6
Sidenote: you learned Taylor Series as, approximating f (x) centered about an x-value a through
f 00 (a)
f (x) = f (a) + f 0 (a)(x − a) + (x − a)2 + .... In these approximations, we are replacing x with
2!
16 Chapter 2. Introduction to Numerical Analysis
x − h, and a with x.
The problem asks us to determine the derivative of f (x), which is inside of this Taylor Series.
So, we can manipulate it algebraically to see that
h2 00 h3
h f 0 (x) = f (x) − f (x − h) + f (x) − f 000 (x) + ... (2.2)
2 6
and
f (x) − f (x − h) h 00 h2
f 0 (x) = + f (x) − f 000 (x) + ... (2.3)
h 2 6
If h is small, we can approximate the derivative through
f (x) − f (x − h)
f 0 (x) ≈ (2.4)
h
The discretization error is the error defined by this approximation. So, if we determine the
absolute error in this approximation - it is defined by the remaining terms in the series.
f (x) − f (x − h) h h2
Absolute error = | exact - approximate | = | f 0 (x)− | = | f 00 (x)− f 000 (x)+...|
h 2 6
h h2
Again, if h is small we know that , and as such we can define the order of our error
2 6
in terms of the largest term in our discretization error. We define our discretization error to be
h
proportional to | f 00 (x)| = O(h) using Big O notation.
2
Our discretization error is defined by the method and our choice of the value h (will often be
called a step size later).
We also use Big O notation to discuss computational expense. Computational expense is gen-
erally measured through the processing required by a computer to solve the problem. Specifically,
we like to refer to this as the number of operations needed to complete that process. So, we use
Big O notation to define how many operations will be required by the computer to complete the
algorithm with respect to n. This is particularly useful for large-scale problems that require a lot of
processing power (we will not do any this semester, but it is important to be aware of).
Example 2.5 A common choice is to halve your step size (because it is easy to modify this way),
but that doubles the number of points. If the computational expense of your method is only O(n),
then you are only doubling the number of operations, and it will only take about twice as long.
However, if your method uses O(n3 ) operations, then doubling the number of points will require an
order of eight times as many operations. If the program took only 20 minutes and you halve the
step size - then it will likely take approximately 160 minutes (this is dependent on more variables,
but operation count can give you a rough estimate). You want to make sure that it is worth it before
requiring that kind of processing power.
Big Theta or Θ notation defines bounding of a function, φ (h) above and below by another
function ψ(h). This is useful for analyzing behavior of an unknown function, if it can be related
through a known function.
2.3 Numerical Algorithms and Errors 17
e.g. φ (h) = Θ(ψ(h)) if cψ(h) ≤ φ (h) ≤ dψ(h) ∀h > 0 (small h) where c, d are constants.
The base in which we work is not critical. There are some who believe that we should work in
base 12 since it has more divisors and would make calculations of thirds and fourths easier. In this
system, we would need to create twelve symbols (two extra) so that our number system is
0 = 0, 1 = 1, 2 = 2, 3 = 3, 4 = 4, 5 = 5, 6 = 6, 7 = 7, 8 = 8, 9 = 9, 10 = χ, 11 = ξ .
Computers store numbers in base 2 so there are only two symbols used, either 0 or 1. So to
store 6523.75 on a computer, we would first need to convert this to a base 2 number
In a computer, arithmetic is performed via switches which can only be on or off and is
represented by either a 0 or a 1. The vocabulary used for this is a bit which is a 0 or a 1. A byte is
composed of 8 bits and a word is machine-dependent unit since this is the number of bits processed
by a computer’s processor. So if you have a 32-bit machine it can store 32 bits (or 4 bytes).
20 Chapter 3. Computational Details
Example 3.1 Given 101101 in binary, we can convert it to a base 10 number through
1 × 25 + 0 × 24 + 1 × 23 + 1 × 22 + 0 × 21 + 1 × 20 = 32 + 0 + 8 + 4 + 0 + 1 = 45
x = ±1.b1 b2 · · · bN × 2 p
where 1.b1 b2 · · · bN is the normalized binary representation of our number with b1 6= 0 called the
mantissa, p is called the exponent (also in binary representation) and N is the size of the mantissa. A
number is normalized meaning that there is only one digit before the decimal and it is 1. Although
a computer can also deal with non-normalized numbers, we will not consider such in this text.
The size of the mantissa and the exponent are machine dependent, as is the base being used in
the computer. In this text, we will only concern ourselves with base 2. In Table 3.1, various values
of the size of sign, mantissa and exponent for single, double, long double and quadruple precision
for the IEEE 754 Floating Point Standard.
Table 3.1: IEEE 754 Floating Point Standard precision storage and exponent bias
Using the IEEE 754 Floating Point Standard, in single precision the first bit determines the sign
of the number, the next 8 bits represent the exponent, and the last 23 bits represent the fraction. The
sign of a number is represented by a 0 for positive and a 1 for negative. Note that there is no sign bit
for the exponent. Therefore, we must make allowances for both positive and negative exponents by
use of an exponent bias. Consider single precision which allows for 8 bits in the exponent so that
we can store any exponent value between (00000000)2 = (0)10 through (11111111)2 = (255)10 .
However, these are all positive values so to store the negative values we “shift” this number line
so that about half of the number will be considered negative and the other half as positive and
zero lies near the center. If we shift this number line by (01111111)2 = (127)10 = 27 − 1 then
we can represent any integer exponent between −126 and 127. We do reserve (00000000)2 and
(11111111)2 for the special numbers of denormalized and overflow, respectively. A summary of
this information can be found in Table 3.2.
Storing the number 6523.75 on the computer in single precision is accomplished by first
3.2 Fixed and Floating point numbers 21
Table 3.2: Exponent representation in IEEE 754 Floating Point Standard for single precision
0|10000001|10101100110011001100110
in the computer.
Example 3.3 To store the number 49.4 in single precision, we first convert to binary
and to store this on the computer we would need to round up since the 24th bit is 1. Therefore, 49.4
is stored in single precision as
0|10000100|10001011001100110011010
How much of a rounding error did we incur when converting 49.4 to single precision? In order
to answer this question, we must note that in single precision we round by 2−23 . Depending upon
which precision and machine hardware you are working with this value can change. In double
precision, this value would be 2−52 . This number is called machine epsilon, denoted εmach , and is
the smallest value between 1 and the next floating number greater than 1. This does not mean that
εmach is indistinguishable from zero.
In order to store 49.4 on the machine, we rounded up and chopped off the repeating portion so
our resulting error is
−23
2 − (0.100110)2 × 2−23 × 25 = 2−23 − (10.0110)2 × 2−25 × 25
= (4 − 2.4) × 2−25 × 25
= 1.6 × 2−20
This is the absolute error between the value stored on the computer and the exact value of 49.4.
The IEEE 754 Floating Point Standard provides a model for computations such that the relative
rounding error is no more that one-half machine epsilon. In our case of 49.4 we have a relative
error of
1.6 × 2−20 8 1 1
= × 2−20 = × 2−17 < 2−24 = εmach
49.4 247 247 2
of the processor and can be accomplished at a higher precision. For example, adding 6523.75 and
49.4 is computed as
(6523.75 + 49.4)10 = (1.10010111101111)2 × 212 + (1.100010110)2 × 25
= (1.10010111101111)2 × 212 + (1.1000101100110011001101)2 × 25
= (1.10010111101111 + 0.00000011000101100110011)2 × 212
= (1.10011010110100100110011)2 × 212
Example 3.5 Find the decimal (base 10) number for the following single precision number:
0|10000010|0110011000001000000000
The first bit is a 0, so we know the number is positive. The exponent is represented as:
1 × 27 + 0 × 26 + 0 × 25 + 0 × 24 + 0 × 23 + 0 × 22 + 1 × 21 + 0 × 20 = 130
The integer value is determined by this value minus the machine bias: 130 − 127 = 3 and the
fraction is determined through negative powers of 2.
We find that we have 12 significant digits in decimal format of the value actually stored but
any decimal number between the values of 11.18847608566284 and 11.18847703933716 have the
same single precision representation as 11.188477 indicating that we only have 8 significant digits.
The point here is that machine precision can mess things up. If we look at the next smallest number
in floating point, we have
0|10000100|011001100000011111111
which has decimal representation of 11.1884760. If we look at the next largest number in floating
point, we have
0|10000100|0110011000001000000001
which has decimal representation of 11.1884775.
There are infinitely many numbers in between these! We must realize that a computer can only
store a finite number of digits and this format will produce errors. This is important to realize
because machine precision is somewhat predictable, but it is not controllable - so you need to be
aware of its existence. However, you cannot always account for it exactly. Generally we bound
round off errors, using information about the machine precision of our system.
24 Chapter 3. Computational Details
Example 3.6 — Seeing the Computational Error. We can illustrate the error in Python using
the following simple code which stores the value 6.4, subtracts 6.0, and then subtracts 0.4.
x = 6.4
y = x - 6.0
z = y - 0.4
print "\nx = %e\ny = %e\nz = %e\n" % (x,y,z)
which outputs
x = 6.400000e+00
y = 4.000000e-01
z = 3.330669e-16
rel_err = 5.204170e-17
This example is performed in IEEE 754 Floating Point Standard of Double precision so that
εmach = 2−53 ≈ 1.1102 × 10−16 .
x − y = 0.00000261
leaves only three significant digits. This occurrence is called a loss of significance. In this case it is
unavoidable, but in general you should look at other ways of performing the operation so that this
is avoided. The following examples will help to illustrate this further.
Example 3.7 When computing the roots of a quadratic equation, say ax2 + bx + c = 0, we can
use the closed formulas of
√ √
−b + b2 − 4ac −b − b2 − 4ac
x1 = and x2 =
2a 2a
but might incur a loss of significance for one of these roots. For example, let us only use two
significant digits so that if we compute the roots for
x2 + 200x + 1 = 0,
then
√ √
−200 + 39996 −200 − 39996
x1 = and x2 =
2 2
resulting
√ in x1 = 0 and x2 = −200. Note that with the use of only two significant digits −200 ≈
− 39996 and the difference of these two numbers will provide a loss of significance. Interestingly
3.3 Loss of Significance 25
enough, there is a way to avoid the loss of significance in this situation. Multiplying by the
conjugate, we can compute x1 via
√ √ !
−200 + 39996 −200 − 39996 4 1
x1 = √ = √ ≈− = −5.0 × 10−3 .
2 −200 − 39996 2(−200 − 39996) 200
Example 3.8 Another situation in which a loss of significance can result is when we add numbers
which are relatively far apart from another. For example, the finite sum
108
1
∑ n2
n=1
can provide the wrong answer when one is not careful. The following code performs this calculation
in two ways: one where n is increasing and the other where n is decreasing.
mySum_1 =0.0
for i in range(1, 100000001):
n = i*1.0
n = n*n
n = 1/n
mySum_1 = mySum_1 + n
print mySum_1
mySum_2 = 0.0
for i in range(100000000, 0, -1):
n = i*1.0
n = n*n
n = 1/n
mySum_2 = mySum_2 + n
print mySum_2
The output is
1.64493405783
1.64493405685
R When computing this sum with an increasing n, we are adding the largest values first meaning
that the smaller values will be dwarfed. This dwarfing is causing the loss of significance, Note
that when n = 1, we have 1/n2 = 1 and when n = 100000000, we have 1/n2 = 1.0 × 10−16 .
Whenever we use a computer to perform our calculations we are most often not using exact
arithmetic and must be wary of pitfalls that can crop up by the simple nature of the finite precision
machine. The floating point standards allow us to control these errors but we should also try and
avoid introducing errors whenever possible. However, there are situations in which we may find
ourselves comfortable with the error in exchange for efficiency.
4. Root-Finding Techniques
Many numerical methods of interest to the numerical analyst and practitioner are based on deter-
mining the zeros of a given function. We begin the journey of understanding such methods by
focusing on techniques for solving nonlinear equations in one variable. Although this may seem a
modest undertaking, the methods here are important and their analysis provides a foundation for
understanding many of the points that one must bear in mind when implementing a method:
Accuracy: are we getting results that make sense? Check your result.
Efficiency: small number of function evaluations (these can be very computationally expensive)
Stability/Robustness: fails rarely, if at all, and announces failure if it happens (e.g. if you enter an
a and a b that both give positive values for f (x) - it should say so.) Predict possible errors,
whenever possible, and test them to define the error message.
To optimize your method you also want to
• Minimize requirements for applying the method
• Minimize smoothness requirements for applying the method
• Use a method that generalizes well to many functions and unknowns
Generally, only some of these will be satisfied. The important distinction is to recognize what is
needed for the problem you are solving.
Mathematically, our goal is to address the problem of determining x such that f (x) = 0 for a
given function f (x). The techniques will build upon some mathematical result to drive the ultimate
form of the method.
If f (x) ∈ C[a, b], and f (a) < 0 and f (b) > 0, then ∃c ∈ (a, b) such that f (c) = 0.
28 Chapter 4. Root-Finding Techniques
This idea leads us to the Bisection method. This is the method you would inherently try if
you had no other way to determine the intercept. If we have a function f that is continuous on the
interval [a, b] where the function is negative at one endpoint and positive at the other, we know
there is a point in the interval where the function is zero. We start with the endpoints, confirm that
one is negative and the other is positive. Then, evaluate the midpoint.
If the midpoint is negative, then it replaces the endpoint which was negative. If the midpoint
is positive, then it replaces the endpoint which was positive. We repeat this process until we are
satisfied with the result meaning we are within a desired tolerance.
We have to define an error tolerance; how close to zero does the value need to be for us to stop
implementing this method? We refer to this as our stopping criteria. When the error is less than our
stopping criteria (error tolerance), then we can stop.
Example 4.1 Given the graph of f (x) = x5 − 2x4 + 2x3 − 2x2 + x − 1 on [0, 2], there exists an
x-intercept. Apply the Bisection method to find it.
0.5 1 1.5 2
f (0) = −1 < 0
f (2) = 32 − 32 + 16 − 8 + 2 − 1 = 9 > 0
a+b 0+2
The midpoint is given by = =1
2 2
f (1) = 1 − 2 + 2 − 2 + 1 − 1 = −1 < 0
So, x = 1 will replace x = 0 as our left endpoint. Our new interval is now [1, 2]
4.1 Bisection Method 29
Since f (1) < 0, and f (2) > 0, we repeat the process on our new interval.
1+2
The midpoint is given by = 1.5
2
So, x = 1.5 will replace x = 2 as our right endpoint. Our new interval is now [1, 1.5].
0.2
−0.4
−0.6
−0.8
−1
Since f (1) < 0, and f (1.5) > 0, we repeat the process again, on our new interval.
1 + 1.5
The midpoint is given by = 1.25
2
So, x = 1.25 will replace x = 1 as our new left endpoint. Our new interval is now [1.25, 1.5].
1.25 + 1.5
Currently our best approximation of the root is given by x = = 1.375. Note that
2
f (1.375) ≈ −0.4411.
We would continue this process until the value we receive for f (mid point) is ‘close enough’ to
zero. This is typically defined by the problem, or in application - by you.
30 Chapter 4. Root-Finding Techniques
For these types of problems −0.4411 is still not close enough to zero to warrant stopping. You
probably want to have 0.01 or less error. Especially if you plan to use this number for something
else.
1. Input f , a, and b, such that f (a) and f (b) are opposite signs.
a+b
2. Determine the midpoint x = .
2
3. Evaluate f (x).
4. Check to see if f (x) is within error tolerance.
5. If not, determine if f (x) > 0 or < 0.
6. Define the new interval (replace a or b with x).
7. Repeat until error tolerance is met, then return value of x.
This is an iterative method, which means we repeat the same process until the specified stopping
criteria is met. Stopping criteria here was defined in terms of the error tolerance, but that is not
always defined because we don’t always know what ‘actual’ is.
Stopping criteria can be defined by:
• Fixed number of iterations allowed
• Difference in two approximations: current estimate and last estimate generally, |xn − xn−1 | <
tol
• Error tolerance | f (x)| < tol
For these methods, we can also say: if f (x) == 0, we can stop.
Note: Since we are seeking a function value near zero ( f (x) = 0), we will use absolute error
rather than relative error.
Limitations of Bisection: Due to the requirements to apply the method, we can only locate odd
roots. The function must cross the x-axis in order to use the Bisection method.
−3 −2 −1 1 2
−5
−10
−15
Note: This graph has an odd root at x = −2, and an even root at x = 1. Bisection cannot be
used to find the root at x = 1 because it does not cross the x-axis.
4.1 Bisection Method 31
b−a
Convergence: Since the error is decreased roughly by half at each iteration, E = O ,
2n
where b − a is the width of the initial interval, and n is the number of iterations.
b−a
If En = n , using En = tol as the error tolerance, solve for n:
2
b−a
2n =
En
n ln 2 = ln(b − a) − ln(En )
ln(b − a) − ln(En )
n=
ln 2
Note that this provides a very slow convergence.
Example 4.2 Analyze the convergence of the Bisection method for f (x) = x5 − 2x4 + 2x3 −
2−0
2x2 + x − 1 on [0, 2],(our previous example), En = n for a tolerance of tol = 0.01.
2
We desire our absolute error to be less than our tolerance so that
2
0.01 >
2n
n
2 > 200
n ln(2) > ln(200)
ln(200)
n>
ln(2)
n > 7.64386
Note: The value of f (x) is not within our error tolerance, but the value of x must be within
tolerance from the actual root. At the 8th iteration, our approximation would be the midpoint
x = 1.4609375, with f (x) ≈ −0.027115. So, it is important to know that this approximation does
not ensure that f (x) is within tolerance, only that x is within tolerance.
The following implementation helped to produce the table above. Note that this code is not
robust and may not always produce the desired results. For instance, there is no checking for
whether we have exceeded a maximum number of iterations.
# Bisection Method: computes approximate solution of f(x)=0
# Input: function f; a,b such that f(a)*f(b)<0,
# and tolerance tol
# Output: Approximate solution x such that f(x)=0
def sign(x):
32 Chapter 4. Root-Finding Techniques
if x < 0:
return -1
elif x > 0:
return 1
else:
return 0
def mybisect(f,a,b,tol):
fa = f(a)
fb = f(b)
# bisect_test.py
from mybisection import mybisect
def f(x):
return x**5 - 2.0*x**4 + 2.0*x**3 - 2.0*x**2 + x - 1.0
tol = 1.0e-2
x = mybisect(f,0,2,tol)
print x
print f(x)
In our testing script, the code defines our function f (x) and the then makes a call to the bisection
function with the initial interval endpoints and stopping tolerance. The actual work is completed
within mybisection.py. Note that the eighth iteration is not performed within the while-loop but
is calculated on the return for the function. Also note that our stopping criteria is determined by
half the width of the interval which ensures that we are within our tolerance of the actual root. It is
worth noting that within the while-loop, we only perform one function evaluation, namely at the
midpoint of the current interval.
We could define different stopping criteria. Depending on the problem, it may be more important
that f (x) is close to zero, or it might be more important to have a close approximation for x. In our
example, we would need to perform one more iteration in order for f (x) to be within the tolerance
of 0.01.
Real-world concept: if you’re trying to locate what city a projectile was fired from, you care
more about the accuracy of x. However, if you are firing a projectile and you want it to release a
capsule at a specified height - you need f (x) to be as accurate as possible.
Accuracy: The Bisection method can achieve accuracy desired, but it might take many iterations
4.2 Fixed Point Iteration 33
x = 0: g(0) = 1. Set x = 1
x = 1: g(1) = 2. Set x = 2
x = 2: g(2) = −7. We have a problem. It’s clearly not converging. Why?
−1
it’s too steep near the fixed point. Note: The fixed point happens where g(x) (cyan) crosses with
the graph of y = x (blue).
Without the graph, we can determine this by checking the derivative: g0 (x) = −5x4 + 8x3 −
6x2 + 4x
At x = 0, g0 (0) = 0 - this may not raise a red flag because 0 ≤ ρ < 1, but since it is zero, look
at the next value.
At x = 1, g0 (1) = 1 and since 1 is greater than any ρ < 1... this will diverge from here.
Based on this, and the visual of the graph - we can see it is too steep to converge.
Given f (x) = x3 + 4x2 − 10, find the root using Fixed Point Iteration.
First, convert the root-finding problem to a fixed point problem: use g(x) = x − f (x) for sim-
plicity.
g(x) = x − x3 − 4x2 + 10
4.2 Fixed Point Iteration 35
10
0.5 1 1.5 2
−5
−10
Option: Solve f (x) = 0 for x, and check to ensure the derivative will be small enough to
converge.
x3 + 4x2 − 10 = 0
x3 + 4x2 = 10
x2 (x + 4) = 10
10
x2 =
x+4
r
10
x=
x+4
r
10
Plot the new g(x) =
x+4
1.5
0.5
0.5 1 1.5 2
Second iteration
1.5
1.4
1.3
1.2
1.1
Fixed Point requires the function to be slowly varying, |g0 (x)| ≤ ρ < 1, in order to converge. In
this example, the function is slowly varying, and we can see that the values spiral in to the fixed
point.
Proofs of Convergence
Uniqueness: This is a proof that if there are two fixed points in an interval satisfying the require-
ment that g(x) is slowly varying, then those two points are actually the same point - so the fixed
point is unique. Proof by contradiction.
Suppose there exist two points p and q which are fixed points of g(x) in [a, b], where p 6= q.
g(p) − g(q)
By the Mean Value Theorem, ∃ξ ∈ [a, b] s.t. = g(ξ ).
p−q
Since p and q are fixed points, g(p) = p and g(q) = q. So, |p − q| = |g(p) − g(q)| and thus
|g(p) − g(q)|
|g0 (ξ )| = = 1.
|p − q|
This is a contradiction of the requirement that g(x) is slowly varying, |g0 (x)| ≤ ρ < 1.
4.2 Fixed Point Iteration 37
Therefore, p = q, implying that on a slowly varying interval of g(x), there will only exist one
fixed point.
Convergence: This is a proof to show that if the convergence requirements are met, Fixed Point
method does, in fact, converge to the fixed point of the function.
If g(x) ∈ C[a, b], and g(x) ∈ [a, b]∀x ∈ [a, b] and g0 (x) exists on (a, b), ∃ρ, 0 < ρ < 1 s.t.
|g0 (x)| ≤ ρ∀x ∈ (a, b).
Then, ∀p0 ∈ [a, b], the sequence pn = g(pn−1 ), n ≥ 1 converges to a unique fixed point p ∈ [a, b].
In other words, these are sufficient conditions for convergence, not necessary conditions.
Assumptions:
1)∃p s.t. g(p) = p,
2) g(x) ∈ [a, b]∀x ∈ [a, b],
3) ∃pn ∈ [a, b]
Given those, we can evaluate our absolute error |pn − p| = |g(pn ) − g(p)| = by the MVT
|g0 (ξ )||pn−1 − p|
We can use this to expand upon our error with each iteration:
Since we define 0 < ρ < 1, limn→∞ ρ n → 0, this ensures that if our conditions are all met, this
method will converge!
Or, pn → p as n → ∞.
So, we can say that the method converges at a rate (ρ n ). This is nice because it says that the
smaller your slope is, the faster it will converge to the fixed point.
So, to prescribe g(x) as efficiently as possible, ensure ρ is as small as possible (not zero!) for
faster convergence.
Example 4.3 Consider the function f (x) = x3 − x − 1, and solve for the root within 10−2 (i.e.,
two digits of precision so that we will have a tolerance of 0.5 × 10−2 ) accuracy on the interval [1, 2]
with initial guess of p0 = 1.
38 Chapter 4. Root-Finding Techniques
First, rewrite the function as a fixed point problem: first attempting the easiest we choose
g(x) = x + f (x) = x3 − 1. In order to determine if this is a convergent sequence, we need to
determine whether the derivative is less than one. Note that we specifically need |g0 (p)| < 1 but we
do not know p. Without knowing the fixed point, we will need to check the derivative for the entire
interval. Starting with the left endpoint we see that
g0 (x) = 3x2 and at x = 1 , g0 (1) = 3 > 1
and we note that g0 (x) = 3x2 is an increasing function over the interval [1, 2] so that |g0 (x)| > 1 for
all x ∈ [1, 2]. This choice of g(x) will diverge meaning that we need to choose a different g(x).
For our next attempt we will solve for x given the equation x3 − x − 1 = 0. Performing a little
algebraic manipulation,
x3 = x + 1
1
x2 = 1 +
rx
1
x = ± 1+
x
p
resulting in g(x) = 1 + 1/x since we know our fixed point is positive. It is important to point
out that our interval of [1, 2] precludes zero so that this function is valid. Additionally the interval
contains only positive values so we run no risk of negative values within the radical.
As before, we need to check for convergence of the sequence and begin by considering the
derivative,
1 −1/2 −1
0 1 −1
g (x) = 1+ 2
= q .
2 x x 2x2 1 + 1x
0
On the interval [1, 2], |g
(x)| is a decreasing function with its largest value obtained at x = 1. So at
−1
x = 1, |g0 (1)| = √ ≈ 0.3536 < 1. Since this is the largest value for ρ on the interval [1, 2], we
2 2
can use this to bound ρ ≤ 0.3536.
Recall that we seek
|pn − p| ≤ ρ n max {|1 − p0 |, |2 − p0 |} < 0.5 × 10−2 .
p0 ∈[1,2]
The largest possible value for the maximum is 1, because that is the size of our interval. So, if
we assume that value is 1, we define n through ρ n < 0.5 × 10−2 , which we can solve for n. Since
ln(0.5 × 10−2 )
ρ < 1, ln ρ < 0, so n > .
ln ρ
In this example, ρ = 0.3536, so that the largest value will be realized when ρ is near its upper
ln(0.5 × 10−2 )
bound, so n > ≈ 5.096, so it should take no more than 6 iterations to ensure two
ln 0.3536
digits of precision.
Run it!
n pn g(pn )
1 1 1.414214
2 1.414214 1.306563
3 1.306563 1.328671
4 1.328671 1.323870
5 1.323870 1.324900
4.3 Newton’s Method (Newton-Raphson) 39
Graphically:
1.5
1.4
1.3
1.2
1.1
As you can see, we have 10−2 accuracy at the 4th iteration. By using the maximum value,
we have a conservative estimate for the number of iterations (it usually converges faster than our
estimate, but our estimate ensures we reach tolerance).
Efficiency: Determining g(x) is the most time-consuming process, once you have it the algo-
rithm is efficient.
Minimal requirements to apply the method? To apply it, yes - but it will not converge if g(x)
does not vary slowly.
Generalizes well? No. Fixed point is not going to extend well, although the family of methods
is diverse in and of itself.
If f (p0 ) > 0 and f 0 (p0 ) > 0, we want to move left to find a zero.
If f (p0 ) > 0 and f 0 (p0 ) < 0, we want to move right to find a zero.
Similarly, if f (p0 ) < 0 and f 0 (p0 ) > 0, we want to move right to find a zero.
40 Chapter 4. Root-Finding Techniques
And, if f (p0 ) < 0 and f 0 (p0 ) < 0, we want to move left to find a zero.
What we can conclude from this, is that the slope and the value at our initial point determine
where to go next.
Graphically (Draw): We can use the point and slope calculation to determine the tangent line,
and then find the intercept of the tangent line easily. - Use that as the next guess.
Using the earlier example: f (x) = x5 − 2x4 + 2x3 − 2x2 + x − 1 with an initial guess at x = 2.
f (2) = 9, f 0 (x) = 5x4 − 8x3 + 6x2 − 4x + 1, and f 0 (2) = 33. So, the tangent line at x = 2 is
y = 33(x − 2) + 9
−10
−20
The root of the tangent line is easily located by setting y = 0 and solving for x. 0 = 33(x − 2) +
9 → x ≈ 1.72727273.
Repeat with x = 1.72727273. f (1.72727273) ≈ 2.6392944, f 0 (1.72727273) ≈ 15.271088, and
the tangent line is y = 15.271088(x − 1.72727273) + 2.6392944.
−5
Solving for the root of the tangent line yields the approximation to the root: 0 = 15.271088(x −
1.72727273) + 2.6392944 → x ≈ 1.554443. Given that f (1.554443) ≈ 0.632466, we see that we
are converging on the root of f (x).
4.3 Newton’s Method (Newton-Raphson) 41
Derivation:
Given:
1. f ∈ C2 [a, b] ( f , f 0 , and f 00 continuous on [a, b])
2. ∃p ∈ [a, b] such that f (p) = 0
3. p∗ is an approximation to p, and |p − p∗ | is small
We can approximate f (x) through its Taylor Series centered about x = p∗ :
f 00 (p∗ )
f (x) = f (p∗ ) + f 0 (p∗ )(x − p∗ ) + (x − p∗ )2 + ...
2!
This can be truncated by using another term, η(p∗ ) ∈ (x, p∗ ) such that evaluating f 00 (η(p∗ ))
above satisfies the equality, without the additional terms.
f 00 (η(p∗ ))
f (x) = f (p∗ ) + f 0 (p∗ )(x − p∗ ) + (x − p∗ )2
2!
If we then evaluate f (x) at x = p:
f 00 (η(p∗ ))
f (p) = 0 = f (p∗ ) + f 0 (p∗ )(p − p∗ ) + (p − p∗ )2
2!
Since we defined p∗ to be an approximation of p, such that |p − p∗ | is small; we know that
|p − p∗ |2 |p − p∗ |. Thus, we can neglect the last term in the approximation to define an approxi-
mation to p. Yielding:
Solve for p!
f (p∗ )
p = p∗ −
f 0 (p∗ )
f (pn−1 )
This yields the fixed point iteration pn = pn−1 − for Newton’s Method.
f 0 (pn−1 )
Pros: fast! Works with multiple roots.
Cons: Can diverge, and you need to evaluate both the function and its derivative.
Problem cases:
1. If the initial point is too far from the root, it will diverge. (draw)
0.1
2 4 6 8 10
−0.1
−0.2
42 Chapter 4. Root-Finding Techniques
1 2 3 4
3. If there are multiple roots nearby, it can converge to the “wrong" one
60
40
20
−2 −1 1 2 3 4
Convergence: Suppose f ∈ C2 [a, b], if p ∈ [a, b] and f (p) = 0, where f 0 (p) 6= 0. Then, ∃δ > 0
such that Newton’s Method generates a sequence {pn }∞ n=0 converging to p ∀p0 ∈ [p − δ , p + δ ].
This states that Newton’s Method will converge if the initial guess is ‘close enough’ to p.
Stability: Dependent upon how close the initial guess is, and the derivative of f (x).
Minimal requirements to apply the method? Certainly not, we need derivative information!
Minimal smoothness requirements? No, this has far more than previous methods.
Generalizes well? Yes and no. It generalizes, but becomes complex very quickly.
4.3 Newton’s Method (Newton-Raphson) 43
Speed of convergence:
A method is linearly convergent if |pn+1 − p| ≤ ρ|pn − p| as n → ∞ for some ρ < 1
Since we can have a ρn = M|pn − p|, a method that is quadratically convergent implies super-
linearly convergent.
More accurate methods typically require much more work for small changes in error → im-
practical.
Additionally, roundoff errors can overwhelm your results quickly with higher-order methods.
|pn+1 − p|
Generally, if limn→∞ = λ , for α, λ > 0 ∈ R, then {pm }∞ n=0 converges to p at a rate
|pn − p|α
of order α with error constant λ . For reference, a larger α results in faster convergence.
1
Ex: If pn = , pn → 0 as n → ∞.
2n
1
2n+1 − 0 1
1
Evaluate limn→∞ = , so λ = , and α = 1, so the method is linearly convergent.
1 − 0 2 2
2n
1
2n+1 − 0
Whereas, if we evaluated limn→∞ 2 = 2n−1 , which diverges to infinity as n → ∞. So,
1
− 0
2n
we know it is not quadratically convergent.
Also, if λ = 0 (as is the case if α = 0.5), then λ ≯ 0, so the α value will not be the correct
speed of convergence, either.
|pn+1 − p| 1
If p0 = 1 and limn→∞ = (since p is zero)
|pn − p| 2
1 1 1
We can extrapolate that for large n, |pn+1 | ∼ |pn | ∼ |pn−1 | ∼ n+1 |p0 |
2 4 2
1 1
Since we seek pn < 10−2 , this implies that n p0 < 10−2 , since p0 = 1 we solve n < 10−2
2 2
2n > 102
n ln 2 > 2 ln 10
2 ln 10
n> ≈ 6.64
ln 2
44 Chapter 4. Root-Finding Techniques
1. g ∈ C[a, b]
2. g(x) ∈ [a, b]∀x ∈ [a, b]
3. g0 (x) ∈ C[a, b]
4. ∃ constant 0 < ρ < 1 s.t. |g0 (x)| ≤ ρ∀x ∈ (a, b)
5. g0 (p) 6= 0
Then, ∀p0 ∈ [a, b] the sequence pn = g(pn−1 ) will converge linearly to a unique fixed point
p ∈ [a, b]
So,
1. Suppose p is a fixed point of g(x)
2. g0 (p) = 0
3. g00 (x) ∈ C(a, b), and is bounded by M: g00 (x) ≤ M ∈ (a, b)
Then ∃δ > 0 s.t. ∀p0 ∈ [p − δ , p + δ ], the sequence pn = g(pn−1 ) will converge at least quadrat-
ically.
f 00 (p∗ )
From our Taylor Series’ expansion, we had the error term (x − p∗ )2
2!
That was for f (x), but the same is true of g(x). Given that the second derivative of g(x) is
bounded by M, we can use that to define our convergence through
M
|pn+1 − p| < |pn − p|2 for large n
2
Constructing the fixed point method:
We know f (x) = 0, thus for any differentiable function ϕ(x), we also know that
−ϕ(x) f (x) = 0.
Using the form of g(x) defined above, we can determine the derivative g0 (x) using product rule.
Evaluating it at x = p yields:
4.3 Newton’s Method (Newton-Raphson) 45
1 − ϕ(p) f 0 (p) = 0
1 = ϕ(p) f 0 (p)
1
Thus, ϕ(p) =
f 0 (p)
For this to hold, we must require that f 0 (p) 6= 0.
1
Suppose/choose to set ϕ(x) = where f 0 (x) 6= 0
f 0 (x)
f (x)
Then, g(x) = x − Newton’s Method!
f 0 (x)
Since we derived it requiring that the method be quadratically convergent - it holds.
Do note that for an even root (rather than a simple root), it will only converge linearly.
It is, essentially, Newton’s method implemented with an approximate derivative instead of the
actual derivative.
f (pn−1 ) − f (pn−2 )
f 0 (pn−1 ) ≈
pn−1 − pn−2
Substituting this approximation into Newton’s method yields:
Now we need two initial guesses, because we need two points in order to approximate the
derivative.
Secant Method works by determining a secant line with the initial two guesses, and using the
root of that secant line as the next guess. (Draw it out!)
Using the same example, f (x) = x5 − 2x4 + 2x3 − 2x2 + x − 1 with the two initial guesses at
9 − (−1)
x = 1 and x = 2. f (1) = −1, f (2) = 9, the secant line has slope = 10 because it passes
2−1
46 Chapter 4. Root-Finding Techniques
The root of the secant line is then used to build the next. 0 = 10x − 11 → x = 1.1, f (1.1) =
9 − (−0.97569)
−0.97569, slope of secant line = 11.0841, yielding the secant line y = 11.0841(x−
2 − 1.1
2) + 9.
Efficiency: Two function evaluations needed initially, but only one after that. Relatively efficient
(better than Newton’s to implement, but converges slower than Newton’s - tradeoff)
Minimal requirements to apply the method? Similar to Bisection, but not as minimal as fixed
point.
4.3 Newton’s Method (Newton-Raphson) 47
Minimal smoothness requirements? They aren’t specified, but the implication being that they
have requirements similar to Newton’s.
Generalizes well? Yes and no. It generalizes, but becomes complex very quickly (Broyden’s
method, etc).
Recall that Secant uses two initial points to construct a secant line.
Müller’s Method uses 3 initial points to construct a quadratic approximation to find the intercept.
Input p0 , p1 , and p2 . Use f (p0 ), f (p1 ), and f (p2 ) to construct the system and solve for a, b,
and c.
48 Chapter 4. Root-Finding Techniques
10−2
10−6
10−8
10−10
10−12
bisection method
secant method
10−14 Newton’s method
0 1 2 3 4 5 6 7 8 9 10
iteration #
Example 4.4 Using the same function f (x) = x5 − 2x4 + 2x3 − 2x2 + x − 1, and the three points
x = 1.2, 1.6, and 2. Use Müller’s method to approximate the root.
Given p0 = 1.2, p1 = 1.6, and p2 = 2. f (1.2) = −0.88288, f (1.6) = 1.05056, and f (2) = 9,
determine the quadratic which passes through all three points.
Solve for the x-intercepts of P(x) = 18.8x2 − 47.8064x + 29.4128 at x = 1.49963, and 1.04327.
Evaluate f (x) for both roots, and use the one with the value of f (x) closer to zero.
f (1.49963) ≈ 0.21623
f (1.04327) ≈ −0.99592
Replace p0 with p1 , p1 with p2 , and p2 with 1.49963. Repeat until tolerance is met.
Müller’s method will converge fast without the need of a close pn , but it is less efficient with an
order of convergence approximately α = 1.84. For reference, Secant is about 1.62. Additionally, it
can compute complex roots.
5. Polynomial Interpolation
Recall: Polynomial approximations of functions from Calc 2, using your Taylor Series.
All ϕk (x)’s are defined, simple functions: “basis functions” We solve for our ck coefficients to
determine the interpolant: “parameters”
All ϕk (x)’s must be linearly independent. (i.e. you cannot write any ϕk (x) as a linear combina-
tion of other ϕk (x)’s)
52 Chapter 5. Polynomial Interpolation
Upon substituting in the data points we have, we will get a system of n + 1 equations and n + 1
unknowns. Then, we solve the system of equations to determine the coefficients for our interpolant.
Standard polynomial interpolation uses monomials for ϕk (x)’s and is also known as Monomial
Interpolation. This is the simplest and most familar form for implementation. Each basis function
has the form ϕk = xk so that
n
f (x) = ∑ ck xk = c0 + c1 x + c2 x2 + . . . + cn xn .
k=0
Note: we could use a Taylor Polynomial here, but we really don’t want to: Taylor Polynomials
are centered about a single point, and we want a polynomial that fits a set of data - which means we
want accuracy across an interval rather than about a single point.
Theorem 5.1.1 For n + 1 data points (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ), there exists a unique polyno-
mial, p(x) = pn (x) = ∑nk=0 ck xk , of degree at most n which satisfies p(xi ) = yi for i = 0, ..., n.
This polynomial interpolates the n + 1 data points.
The proof of the theorem is straight forward. You assume there are two such polynomials and
consider another polynomial which is the difference of these two and show by the Fundamental
Theorem of Algebra that the difference must be the zero polynomial since it has too many zeros to
be anything else.
Now that we have layed the theoretical groud work for the method, let’s work through an
example to find the ck ’s that satisfy this.
Example 5.1 Given 3 points (1, 2), (2, 5), (3, 7), determine the monomial interpolant which
satisfies all three points.
p2 (1) = c0 + c1 + c2 = 2
−1
2 − c1 − c2 + 3c1 + 9c2 = 7 → 2c1 + 8c2 = 5 → 6 − 6c2 + 8c2 = 5 → 2c2 = −1 → c2 =
2
3 9
c1 = 3 + =
2 2
9 1
c0 = 2 − + = −2
2 2
9 1
Putting this together yields p2 (x) = −2 + x − x2
2 2
1 2 3 4
−2
In this picture, you can see that the interpolant passes through all three points exactly, as it is
designed to.
If this is extrapolated, the system becomes ill-conditioned when we use many data points.
This means that it becomes prone to round-off error, and our solutions become incorrect as a
result. Monomial is only worthwhile when used for a small number of data points.
Accuracy: Monomial interpolation is good for a small number of data points, but for a large
number of data points it becomes inaccurate due roundoff error in the linear system that must be
solved to determine the interpolant.
Efficiency: Monomial interpolation is the least efficient technique computationally. The com-
2
putational cost of solving the linear system for n data points is n3 + 2n, which is much more
3
expensive than the other techniques we will discuss. In terms of simplicity, it is the simplest
54 Chapter 5. Polynomial Interpolation
technique to follow and understand. However, the computational cost outweighs the simplicity
which makes this method inefficient overall.
Stability: Monomial interpolation is not stable. When used for a ‘simple’ problem with few
data points, it is effective. However, as the data set grows, the instability grows.
(
0, j 6= k
Lk (x j ) = (5.1)
1, j=k
Example 5.2 If we use the same points as the previous problem, but construct a Lagrange
First, we define our basis functions for this data set. We must ensure that Lk is zero for other
points, and one at the point corresponding to its construction: e.g. L0 (x0 ) = 1, and L0 (x1 ) =
L0 (x2 ) = 0
We define these through:
(x − x1 )(x − x2 ) (x − 2)(x − 3) 1
L0 (x) = = = (x − 2)(x − 3)
(x0 − x1 )(x0 − x2 ) (1 − 2)(1 − 3) 2
(x − x0 )(x − x2 ) (x − 1)(x − 3)
L1 (x) = = = −(x − 1)(x − 3)
(x1 − x0 )(x1 − x2 ) (2 − 1)(2 − 3)
(x − x0 )(x − x1 ) (x − 1)(x − 2) 1
L2 (x) = = = (x − 1)(x − 2)
(x2 − x0 )(x2 − x1 ) (3 − 1)(3 − 2) 2
Note: by forming the numerator as a product of (x − xi )’s we ensure that at those xi ’s our basis
function is zero. By forming the denominator as a product of (x j − xi )’s, we normalize the basis
function so that it is one when evaluated at x j .
Then,
Lagrange basis functions are pre-defined to be linearly independent, since they are zero at all
data points other than their respective x j .
General form:
n
(x − x0 )(x − x1 )(x − x2 ) . . . (x − xk−1 )(x − xk+1 ) . . . (x − xn ) (x − xi )
Lk (x) = = ∏
(xk − x0 )(xk − x1 )(xk − x2 ) . . . (xk − xk−1 )(xk − xk+1 ) . . . (xk − xn ) i=0,i6=k (xk − xi )
The numerator in this is the part that builds the polynomial of degree n. The denominator is the
part that normalizes the polynomial so it is one at the position corresponding to xk , these are called
Barycentric weights.
f n+1 (η(x))
f (x) − pn (x) = (x − x0 )(x − x1 ) . . . (x − xn )
(n + 1)!
error is defined to be zero at the points used to build the approximation. Since it will be exact up
to a polynomial of degree at most n, the n + 1th derivative of such a polynomial will be zero also.
This form is consistent with the accuracy of the interpolant.
Note: this resembles your Taylor Polynomial error, but is not the same.
Bad cases: if the spacing is large between data points, it can cause your error to become large.
Additionally, large derivatives are a problem, because you have a rapid change between data points
which can also cause your error to become large.
Efficiency: Lagrange interpolation is relatively simple to construct because the basis functions
are consistently defined, and the coefficients are the y-values from the data set (no additional compu-
tations). Computationally, it is much more efficient than Monomial, and is the most computationally
efficient of the three similar methods. It is more difficult to follow than Monomial, but not much
more difficult. So, Lagrange is considered much more efficient than Monomial. One downside to
its efficiency is that it requires you do not change the data set. Sometimes in application, you may
make an approximation with a set number of data points, and then decide to add or remove points
from the data set. With Lagrange interpolation, you will need to reevaluate the entire approximation,
which makes it slightly less efficient than adaptive methods, like Newton’s Divided Differences
(next).
Stability: Lagrange interpolation is the most stable of the three similar methods. Since it uses
all the data points to define the basis functions and weights directly, it does not risk instability
unless the data points are exceedingly close together (division by small numbers in construction of
the basis functions).
Lagrange interpolation is great if you have a known data set to begin with, and you know it
isn’t going to change. Since every basis function is built by knowing all your data points up front, it
can become extremely cumbersome to add data points later. Lagrange interpolation is very popular
56 Chapter 5. Polynomial Interpolation
f (x0 ) = c0 → c0 = f (x0 )
f (x1 ) = c0 + c1 (x1 − x0 ).
Using the constant from the first data point, we can then define the second constant through
f (x1 ) − f (x0 )
c1 = .
x1 − x0
With three data points, we can construct a quadratic approximation:
f (x1 ) − f (x0 )
f (x0 ) + (x2 − x0 ) + c2 (x2 − x0 )(x2 − x1 )
x1 − x0
Manipulating the expression yields:
f (x1 ) − f (x0 )
f (x2 ) − f (x0 ) − (x2 − x0 ) = c2 (x2 − x0 )(x2 − x1 )
x1 − x0
Solving for the constant yields
where each
f [xk+1 , ..., x j ] − f [xk , ..., x j−1 ]
f [xk , ..., x j ] = .
x j − xk
These ‘divided differences’ yield the coefficients for the terms in pn (x). This is easier to see and do
in a table representation:
5.3 Divided Differences 57
Example 5.3 Data points (0, 1), (1, 5), (2, 3), and (3, 7):
x-values f (x) 1st DD 2nd DD 3rd DD
0 1
5−1
1 5 =4
1−0
3−5 −2 − 4
2 3 = −2 = −3
2−1 2−0
7−3 4 − (−2) 3 − (−3)
3 7 =4 =3 =2
3−2 3−1 3−0
Using the top diagonal entries, we will construct the interpolating polynomial:
p3 (x) = 1 + 4(x − 0) + (−3)(x − 0)(x − 1) + 2(x − 0)(x − 1)(x − 2)
= 1 + 4x + 3x(x − 1) + 2x(x − 1)(x − 2)
= 1 + 4x + 3x2 − 3x + 2x3 − 4x2 − 2x2 + 4x
= 2x3 − 3x2 + 5x + 1
Notice here that we had equispaced points, so when we evaluated the divided differences, the
denominators were consistent for each column.
If each point is equispaced, then there is a specific difference between each set of adjacent
points, xi+1 − xi = h, ∀xi , i = 0, 1, ..., n.
f (xi+1 ) − f (xi ) ∆ f
Then each 1st DD: f [xi , xi+1 ] = = “first derivative” as h → 0 (it looks like
xi+1 − xi h
f (x0 + h) − f (x0 )
!)
h
f [xi+1 , xi+2 ] − f [xi , xi+1 ] ∆ f (x1 ) − ∆ f (xi )
2nd DD: f [xi , xi+1 , xi+2 ] = “second derivative” resembles
xi+2 − xi 2h
∆2 f (xi )
which approximates
2h2
and so forth.
1 k
Generally, f [x0 , ..., xk ] = ∆ f (x0 ) “kth derivative" about x0 .
k!hk
Forward Difference Formula:
n
s k
pn (x) = f (x0 ) + ∑ ∆ f (x0 )
k=1 k
58 Chapter 5. Polynomial Interpolation
s
s(s − 1)...(s − k + 1) s
s
Recall: k = , so 1 = s, 0 = 1.
k!
In this, s corresponds to the approximation of f (x0 + sh)
Backward differences go in the opposite direction i → i − 1 instead of forward i → i + 1.
Accuracy: Divided differences are also construct polynomials of degree n using n + 1 data
points.
Stability: Divided differences is more stable than Monomial, but slightly less stable than
Lagrange. The primary cause of instability comes from the divided differences, if the denominator
is very small.
If pn (x) interpolates f(x) at n + 1 data points x0 , x1 , ..., xn , and f ∈ Cn+1 [a, b] with bounded
derivatives. Then, ∀x ∈ [a, b], ∃ξ = ξ (x) ∈ [a, b] such that
f n+1 (ξ (x)) n
f (x) − pn (x) = ∏i=0 (x − xi )
(n + 1)!)
Which is bounded by
1
maxa≤x≤b | f (x) − pn (x)| ≤ maxa≤t≤b | f n+1 (t)| maxa≤s≤b ∏ni=0 |s − xi |
(n + 1)!
5.5 Interpolating Derivatives 59
The divided differences relate to the derivatives of f (x), thus the error is bounded by the next
derivative of f (x).
To do this, we must discuss the Osculating Polynomial, a polynomial which satisfies f and f 0
at each node.
If we are given n + 1 data points, we will have 2n + 2 data points (n + 1 function evaluations
and n + 1 derivative evaluations). So, the degree of our polynomial interpolant is one less than our
data set: 2n + 1.
We can construct this similar to Lagrange Interpolants, where we know we want to match
pn (xi ) = f (xi ), and with the derivative information we also want to match p0n (xi ) = f 0 (xi ).
Our basis functions Ĥn, j (x) will match the derivative data to ensure:
= f (x0 )Hn,0 (x) + f (x1 )Hn,1 (x) + ... + f (xn )Hn,n (x) + f 0 (x0 )Ĥn,0 (x) + ... + f 0 (xn )Ĥn,n (x)
Given the conditions, we can define our basis functions in terms of Lagrange polynomials.
0 (x )]L2 (x)
Hn, j (x) = [1 − 2(x − x j )Ln, j j n, j
2 (x)
Ĥn, j (x) = (x − x j )Ln, j
0 (x )]L2 (x )
Hn, j (xi ) = [1 − 2(xi − x j )Ln, j i n, j i
60 Chapter 5. Polynomial Interpolation
2 (x ) = 1 because L (x ) = 1.
If i = j, then this reduces to Ln, j j n, j j
2 (x )
Ĥn, j (xi ) = (xi − x j )Ln, j i
Yes, these are consistent with the requirements listed above for these functions.
We construct the same type of polynomial as before - but now we have derivative data!
j−1
pn (x) = ∑nj=0 f [x0 , x1 , ..., x j ] ∏i=0 (x − xi )
Given data for x = −1, 0, 1, f (−1) = 0, f (0) = 1, f (1) = 2, and f 0 (0) = 0, f 00 (0) = 0, and
f 0 (1)
=5
For each derivative evaluation, we add another row to the table. When we are given derivative
data, that derivative data replaces the divided difference for that column in the new row (1st deriva-
tive → 1st DD, etc.).
p5 (x) = 0 + 1(x + 1) + (−1)(x + 1)x + 1(x + 1)x2 + 0(x + 1)x3 + 1(x + 1)x3 (x − 1)
Which simplifies to p5 (x) = 1 + x5 , which is the function I used to build the data set!
Since we had 6 data points, we can construct a polynomial of degree 5, so this will be exact for
a polynomial of degree 5 (or less)!
Accuracy: Hermite interpolation is significantly more accurate because it uses not just function
values at each data point, but also derivative values at data points. Each derivative value adds to the
accuracy of the approximation. If you have n + 1 x-values, all with function values and derivative
values, the approximation will be exact for a polynomial of degree 2n + 1.
Efficiency: Hermite interpolation requires more operations for the same number of x-values,
but otherwise the efficiency is exactly the same as Newton’s Divided Differences. The operations
are much the same, and implementation is similar.
Stability: Slightly more stable due to the derivative information, but otherwise the same as
Newton’s Divided differences.
Let f (x) ∈ C[a, b] be 2n + 2 times differentiable on (a, b) and consider the distinct nodes
x0 < x1 < · · · < xn in [a, b]. If p2n+1 (x) is the Hermite polynomial such that p2n+1 (xi ) = f (xi )
62 Chapter 5. Polynomial Interpolation
f (2n+2) (ξ (x)) n
f (x) − p2n+1 (x) = (x − xi )2
(2n + 2)!) ∏ i=0
which is bounded by
n
1
max | f (x) − p2n+1 (x)| ≤ max | f (2n+2) (t)| max ∏(s − xi )2
a≤x≤b (2n + 2)! a≤t≤b a≤s≤b i=0
6. Piecewise Polynomial Interpolation
General polynomial interpolation can be wildly varying (see results in homework, etc).
Piecewise polynomial interpolation is a way to approximate between specified points and avoid
oscillations between points (keeping it smooth and consistent with the overall data).
We will discuss three approaches to piecewise polynomial interpolation:
• Linear piecewise polynomial interpolation (generalizes to higher order polynomials)
• Piecewise Hermite interpolation
• Cubic Spline interpolation
So, if your data set is defined on the interval [a, b], and you have data points xi in that interval,
then each subinterval is defined between each xi and xi+1 for each interpolant. i.e. we define a
polynomial interpolant, as we did last chapter, but only between two data points at a time.
For piecewise linear, we form a line with the function values between those points.
We can define this generally by matching si (xi ) = f (xi ) and si (xi+1 ) = f (xi+1 )
f (xi+1 ) − f (xi )
From this, we construct the line si (x) = f (xi ) + (x − xi )
xi+1 − xi
We can see that if we substitute in x = xi , this reduces to si (xi ) = f (xi ).
64 Chapter 6. Piecewise Polynomial Interpolation
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3
You can also think of this form in terms of your divided differences. Recall: f [xi , xi+1 ] =
f (xi+1 ) − f (xi )
xi+1 − xi
Using this form, we can write si (x) = f (xi ) + f [xi , xi+1 ](x − xi ).
However, this is a really bad fit. Unless our function is a line, or we have data super-close
together, it’s likely to be an awful interpolant. Error is O(h2 ) if we have each set of points separated
by h.
We can extend this same idea to more points and use low order polynomial interpolants to con-
nect the data in a more smooth fashion, while ensuring consistency at the shared points. However,
it will not vastly improve our approximation because it will not ensure smoothness at each shared
point.
Accuracy: Piecewise linear interpolants are only exact for linear functions. The error is O(h2 )
for h = distance between equispaced data points. So, a piecewise linear interpolant may be “good
enough" when the data points are close together, but it is generally a poor approximation.
Efficiency: Piecewise linear interpolants are straightforward and simple to compute, making
them efficient computationally and in terms of their accessibility.
Stability: Piecewise linear interpolants are stable because they follow the general trend of the
data (no highly oscillatory behavior). However, they can be wholly inaccurate as a result.
The primary reason we do not use piecewise linear interpolants is their inaccuracy. They are ef-
ficient and stable, but most functions are not represented well with piecewise linear approximations.
6.2 Piecewise Hermite Interpolation 65
This requires derivative data at all the given points. This adds some smoothness to the approxima-
tion, but it also requires that we have derivative data at all of our points to apply it. Derivative data
is not generally known, which often makes this method prohibitive.
We will still require si (xi ) = f (xi ) and si (xi+1 ) = f (xi+1 ). For Hermite, we also require
s0i (xi ) = f 0 (xi ) and s0i (xi+1 ) = f 0 (xi+1 ).
This adds two conditions to our problem, and allows us to fit not only derivative data, but also
construct a polynomial of degree 3, so our error term reduces to O(h4 ).
Downside: it requires derivative information for f (x) that we don’t generally have.
x f (x) f 0 (x)
0 1 0
Example 6.1 Use Piecewise Hermite Interpolation to approximate the given data:
π/2 0 1
π -1 0
Construct a Hermite Interpolant over each subinterval.
0.5
−0.5
−1
As you can see in the plot, they are nearly indistinguishable from each other.
Accuracy: Piecewise Hermite interpolants are more accurate because they use derivative
information in addition to function values.
Efficiency: Piecewise Hermite interpolants require derivative information, and are compu-
tationally expensive because they require three divided differences for every subinterval. This
also makes them complex to solve for. Additionally, the requirement that we know derivative
information at every point is prohibitive. So, piecewise Hermite interpolants are accurate, but not
efficient.
Stability: Piecewise Hermite interpolants are very stable due to the use of derivative informa-
tion. This forces the behavior of the interpolant to closely match that of the function.
We seek an alternative that yields such accuracy, without requiring additional data: Cubic
Splines!
We construct a cubic polynomial between each set of points xi and xi+1 . We will still ensure
that si (xi ) = f (xi ), and si (xi+1 ) = f (xi+1 ).
We will also ensure that s0i (xi+1 ) = s0i+1 (xi+1 ), and s00i (xi+1 ) = s00i+1 (xi+1 ).
So, each spline has 4 unknowns, and with n + 1 data points, that means we have n interpolant
and 4n unknowns. To solve for these interpolants we need 4n conditions to solve the system on the
6.3 Cubic Splines 67
whole domain.
Then we use the forms to define equations which will satisfy the above conditions.
s0 (0) = a0 = f (0) = 0 → a0 = 0
s1 (1) = a1 = f (1) = 1 → a1 = 1
s0 (1) = a0 + b0 + c0 + d0 = f (1) = 1 → a0 + b0 + c0 + d0 = 1
s1 (2) = a1 + b1 + c1 + d1 = f (2) = 2 → a1 + b1 + c1 + d1 = 2
s000 (1) = 2c0 + 6d0 = s001 (1) = 2c1 → 2c0 + 6d0 − 2c1 = 0
Since we do not have derivative data for the boundary, and there is only one interior point, the
only boundary condition we can apply is the free boundary: s000 (x0 ) = s00n−1 (xn ) = 0
These 8 equations allow us to set up the linear system to solve for a0 , b0 , c0 , d0 , a1 , b1 , c1 , and d1 .
1 0 0 0 0 0 0 0 a0 0
0 0 0 0 1 0 0 0 b0 1
1 1 1 1 0 0 0 0 c0 1
0 0 0 0 1 1 1 1 d0 2
0 1 2 3 0 −1 0 0 a1 = 0
0 0 2 6 0 0 −2 0 b1 0
0 0 2 0 0 0 0 0 c1 0
0 0 0 0 0 0 2 6 d1 0
This is a linear system A~x = ~b, we can solve it in MATLAB with A\~b yielding
a0 0
b0 1
c0 0
d0 0
=
a1 1
b1 1
c1 0
d1 0
Which, when substituted into our splines yields:
6.3 Cubic Splines 69
Naturally, the data I gave above - was for the simple function y = x!
Since the cubic spline can solve for a linear function exactly (with free boundary conditions) it
gave us the exact function.
Example 6.3 Solve the same problem we completed using piecewise Hermite, with Cubic
Splines. We will add two more points to allow for use of Not-a-Knot conditions in this example.
x y
0 1
π/2 0
Given data
π -1
3π/2 0
2π 1
We have 5 points, so we will need to construct 4 cubic splines.
Then we use the forms to define equations which will satisfy the above conditions.
s0 (0) = a0 = f (0) = 1 → a0 = 1
s1 (π/2) = a1 = f (π/2) = 0 → a1 = 0
s2 (π) = a2 = f (π) = −1 → a2 = −1
s3 (3π/2) = a3 = f (3π/2) = 0 → a3 = 0
s000 (π/2) = 2c0 + 3d0 π = s001 (π/2) = 2c1 → 2c0 + 3d0 π − 2c1 = 0
s001 (π) = 2c1 + 3d1 π = s002 (π) = 2c2 → 2c1 + 3d1 π − 2c2 = 0
s002 (3π/2) = 2c2 + 3d2 π = s003 (3π/2) = 2c3 → 2c2 + 3d2 π − 2c3 = 0
We apply the Not-a-Knot conditions to our system, s000 000 000 000
0 (x1 ) = s1 (x1 ) and sn−2 (xn−1 ) = sn−1 (xn−1 ).
s000
0 (x) = 6d0
s000
1 (x) = 6d1
s000
2 (x) = 6d2
s000
3 (x) = 6d3
6.3 Cubic Splines 71
s000 000
2 (x) = 6d2 = s3 (x) = 6d3 → 6d2 − 6d3 = 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 b0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 c0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 d0
1 π/2 π 2 /4 π 3 /8 0 0 0 0 0 0 0 0 0 0 0 0 a1
0 0
0 0 1 π/2 π 2 /4 π 3 /8 0 0 0 0 0 0 0 0 b1
0 0
0 0 0 0 0 0 1 π/2 π 2 /4 π 3 /8 0 0 0 0 c1
0 0 0 0 0 0 0 0 0 0 0 0 1 π/2 π 2 /4 π 3 /8 d1
=
0 1 3π 2 /4 0 −1 0 0 0 0 0 0 0 0 0 0 a2
π
0 0 0 0 0 1 3π 2 /4 0 −1 0 0 0 0 0 0 b2
π
3π 2 /4 0 −1
0 0 0 0 0 0 0 0 0 1 π 0 0 c2
−2
0 0 2 3π 0 0 0 0 0 0 0 0 0 0 0 d2
0 0
0 0 0 0 2 3π 0 0 −2 0 0 0 0 0 a3
0 0
0 0 0 0 0 0 0 0 2 3π 0 0 −2 0
b3
0 0 0 6 0 0 0 −6 0 0 0 0 0 0 0 0 c3
0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 −6 d3
1
0
−1
0
0
−1
0
1
0
0
0
0
0
0
0
0
This is a linear system A~x = ~b, we can solve it in MATLAB with A\~b yielding
72 Chapter 6. Piecewise Polynomial Interpolation
a0 1
b0
0
c0 −0.607927
d 0.129006
0
a
1 0
b −0.954930
1
c
1 0
d 0.129006
1
=
a2 −1
b2 0
c2 0.607927
d2 −0.129006
a3
0
b3 0.954930
c3 0
d3 −0.129006
Which, when substituted into our splines yields:
0.5
1 2 3 4 5 6
−0.5
−1
Again, the two curves are nearly indistinguishable from each other, however - this time we did
not have derivative information.
Accuracy: The accuracy of Cubic Spline with free boundary conditions is O(h2 ), which solves
exactly for a linear function. If you have derivative data for your endpoints (clamped boundary), it
can solve for a cubic exactly, O(h4 ). If you apply the Not-a-Knot condition, and the function is a
cubic polynomial, it will also solve for for a cubic exactly, O(h4 ).
6.3 Cubic Splines 73
Efficiency: Computational cost of cubic spline is high, it requires the construction and solving
of a linear system. It is also difficult to follow and construct. However, it does not require additional
information as Hermite does. So, it provides an alternative to gain accuracy without requiring
derivative information.
Stability: Cubic splines are not as stable as Hermite because they do not have derivative
information for the function. However, by ensuring continuity between splines, the solution is
much more stable than a standard polynomial interpolant. Complete splines (clamped boundary)
are the most stable, because they use derivative information on the boundary. Not-a-knot follows
because it ensures continuity, but does not use additional function data. Free boundary conditions
are the least stable because they enforce a condition that is not necessarily true. Enforcing the
second derivative at the boundary to be zero is intended to prevent rapidly changing functions at
the ends, however - most functions do not have a zero for their second derivative, so the condition
is inconsistent with the function behavior.
Cubic splines are a preferred technique when we have many data points to satisfy because they
are adaptable and well-behaved. Often, the accuracy and stability make them preferred.
7. Numerical Differentiation
We will approximate the derivatives of a given function, f (x). f 0 (x), f 00 (x), etc. at and/or near a
particular x-value, xi .
These are particularly useful in solving differential equations numerically, because we approx-
imate the derivatives numerically and then the problem is algebraic. (We do this in Numerical
Analysis II!)
f (x + h) − f (x)
Recall from Calculus I: f 0 (x) = limh→0
h
f (xi + h) − f (xi )
Thus, we can approximate this value numerically through f 0 (xi ) ≈
h
However, this induces additional error (roundoff error) as h → 0. It is also important to note
that as h → 0, the number of computations will increase (the number of data points on an interval
will increase dramatically if the space between all points is decreased - this is an important concept
for efficiency).
Recall: we derived this method earlier in the semester using the Taylor Series’ of a function,
and we talked about truncation error O(h p )
This is error induced by “truncating" our Taylor Series to a finite number of terms.
In order to apply numerical differentiation effectively, we need to ensure that our function is
sufficiently smooth on the interval: f ∈ Cn+1 [a, b]. This helps us ensure our error term is bounded,
since it is defined by the (n + 1)th derivative of f (x).
Note: Taylor Series’ expansions are localized, and provide a reasonably good approximation
very close to the point they are centered about. However, they do not typically define a global
approximation, and if h is large the approximation will not be realistic, and can even give results
that are nonsensical (or nonphysical) if used in this way.
We have three commonly used forms for difference formulas to approximate the first derivative
of f (x).
Second Derivatives
We can derive second derivatives in the same manner we derived first derivatives, we will want to
isolate the second derivative in our Taylor expansion.
Note: we will end up dividing by h2 , which is significantly smaller than h, so the higher
derivatives are much more vulnerable to roundoff errors.
Centered Difference for second derivative: we add the two instead of subtracting, so we cancel
the first derivative rather than the second.
h2 h3 h4 0000
f (x0 + h) = f (x0 ) + h f 0 (x0 ) + f 00 (x0 ) + f 000 (x0 ) + f (x0 ) + ...
2 6 24
h 2 h3 h4
f (x0 − h) = f (x0 ) − h f 0 (x0 ) + f 00 (x0 ) − f 000 (x0 ) + + f 0000 (x0 )...
2 6 24
h 4
f (x0 + h) + f (x0 − h) = 2 f (x0 ) + h2 f 00 (x0 ) + f 0000 (x0 ) + ...
12
Solve for f 00 (x0 )
f (x0 + h) − 2 f (x0 ) + f (x0 − h) h2 0000
f 00 (x0 ) = − f (x0 ) + ...
h2 12
Note: All of these methods are derived using equispaced data. Other forms can be derived for
other cases of data sets. Also note that you do not need to use the same method on your whole data
set.
Given an approximation to the derivative ( f 0 (x0 ), f 00 (x0 ), etc) we will denote N(h), and the
actual value of the derivative represented by M, we can always define the error term in terms of
78 Chapter 7. Numerical Differentiation
M − N(h) = k1 h + k2 h2 + k3 h3 + ...
For the one-sided approximation to f 0 (x0 ), the error is O(h), so if we can combine multiple
O(h) approximations appropriately, we can eliminate the O(h) term and yield a O(h2 ) error term.
So, if we have two approximations to the derivative using low-order approximations, one at h
h
and one at , using the same method, we can define our error terms:
2
M = N(h) 2 3
+ k1 h + k2 h +2k3 h + ...
h h h h3
M=N + k1 + k2 + k3 + ...
2 2 4 8
in order to cancel the O(h) terms, we can combine the two approximations:
h2 3h3
h
2M − M = 2N(h) − N − k2 − k3 + ...
2 2 4
h
M = 2N(h) − N + O(h2 )
2
This process can be repeated until desired accuracy is obtained. Be wary of roundoff errors,
as they can creep in and make your results no longer accurate as you compute further steps of
Richardson Extrapolation.
N1 (0.2) − N1 (0.4)
N2 (h) = N1 (0.2) + = 2N1 (0.2) − N1 (0.4) ≈ 0.98204
2−1
h
N2 = 2N1 (0.1) − N1 (0.2) ≈ 0.99549
2
N2 (0.2) − N2 (−.4)
N3 (h) = N2 (0.2) + = 0.99877
4−1
This is useful because the computational expense is quite small while the error decreases
significantly.
Note: Sometimes Richardson Extrapolation is presented as I’ve done here, with factors of h,
and other times it is done using multiples of h, as I will do below. They are effectively equivalent,
even though they seem different - the concept driving the computations is the same.
Second Derivatives
We can also use this same concept to improve upon our second derivative approximations.
h f (x0 + h) − 2 f (x0 ) + f (x0 − h)
If we use N1 = + O(h2 )
2 h2
f (x0 + 2h) − 2 f (x0 ) + f (x0 − 2h)
and N1 (h) = + O(h2 )
4h2
We can combine them to cancel the first error term
h
4N1 − N1 (h) = 3M
2
4h4 0000 f (x0 + 2h) 2 f (x0 ) f (x0 − 2h) 4h4 0000
1
= 2 4 f (x0 + h) − 8 f (x0 ) + 4 f (x0 − h) − f (x0 ) − + − + f (x0 ) + ...
h 12 4 4 4 12
1 f (x0 + 2h) f (x0 ) f (x0 − 2h)
= 2 4 f (x0 + h) − 8 f (x0 ) + 4 f (x0 − h) − + − + O(h3 )
h 4 2 4
Which simplifies to:
1
M= (16 f (x0 + h) − 30 f (x0 ) + 16 f (x0 − h) − f (x0 + 2h) − f (x0 − 2h))) + O(h3 )
12h2
These can become unstable the more you extrapolate, and for widely spaced data. Requires
bounded derivatives of f (x) to work. Be careful about roundoff errors as well, as h → 0.
Polynomials are easy to evaluate, and differentiate. Since they are locally good approximations,
they can approximate the derivatives effectively.
1
Ex: Given data ( , − ln(2)), (1, 0), (2, ln(2)), and (3, ln(3)). Construct an interpolating poly-
2
nomial for the data set, and use it to approximate the derivative of f (x) at x = 1.
80 Chapter 7. Numerical Differentiation
Format-wise, since we know we will arrive at the same polynomial using Newton’s Divided
Differences:
2 1 2
f [x0 ] = − ln 2, f [x0 , x1 ] = 2 ln 2, f [x0 , x1 , x2 ] = − ln 2, f [x0 , x1 , x2 , x3 ] = ln 3 − ln 2.
3 5 15
1 2 1 1 2 1
So, L(x) = − ln 2 + 2 ln 2(x − ) − ln 2(x − )(x − 1) + ( ln 3 − ln 2)(x − )(x − 1)(x − 2)
2 3 2 5 15 2
We can then approximate the derivative using L0 (x).
2 1
L0 (x) = 2 ln 2 − ln 2 (x − 1) + (x − )
3 2
1 2 1 1
+ ln 3 − ln 2 (x − 1)(x − 2) + (x − )(x − 2) + (x − )(x − 1)
5 15 2 2
2 1 1 2 1
At x = 1, L0 (1) = 2 ln 2 − ln 2( ) + ln 3 − ln 2 (− ) ≈ 1.09159
3 2 5 15 2
This is a pretty decent approximation to the derivative of the actual function, f (x) = ln x, so
1
f 0 (x) = , since f 0 (1) = 1, the error is less than 10%.
x
However, if f 0 (x) → 0, the approximation will not be good (smaller derivative, larger relative
error).
One aspect of using Lagrange interpolants is that we can continue to approximate higher
derivatives, although they become less stable with higher derivatives.
Examine the roundoff error in the computation of the derivative for f (x) = ex near x = 0:
When we have a function, f (x), and need to determine ab f (x)dx when we do not know the
R
antiderivative F(x) analytically → we must find another way to determine the integral.
Recall: Riemann sums (Calculus I), and Numerical Integration (Calculus II).
Numerical
Rb
integration approximates the integral using a summation
a f (x)dx ≈ ∑nj=0 a j f (x j )
each x j ∈ [a, b] is called an abscissae,
each a j is the weight given to the jth term f (x j ).
These techniques are called quadrature methods because quadrature means area, and we are
computing a single integral, which relates directly to the area under the curve. Similar concept to
how we evaluated integrals using Riemann sums in Calculus.
Note that the weight, a j , in Riemann sums was the width of the box built under the curve,
denoted ∆x, and was the same for all the boxes because we used equispaced data points. We
generalize this term to weights because the value of these coefficients will vary by the method used.
This also relates nicely to our polynomial interpolation in the previous chapters.
x − xk
Recall: for a Lagrange interpolant pn (x) = ∑nj=0 f (x j )L j (x) = ∑nj=0 f (x j ) ∏nk=0,k6= j
x j − xk
Rb Rb n n x − xk
Thus, we can approximate a f (x)dx ≈ a ∑ j=0 ∏k=0,k6= j dx
x j − xk
x − xk
= ∑nj=0 f (x j ) ab ∏nk=0,k6= j dx = ∑nj=0 f (x j ) ab L j (x)dx
R R
x j − xk
So, our weights can be determined through a j = ab L j (x)dx
R
Example 8.1 Seek an approximation to the integral of f (x) = ex on [0, 1] using a linear approxi-
mation, p1 (x).
84 Chapter 8. Numerical Integration
f (x0 ) = f (0) = 1
f (x1 ) = f (1) = e
x−1
L0 (x) = = 1−x
0−1
x−0
L1 (x) = =x
1−0
p1 (x) = 1(1 − x) + e(x)
R1 1 1 1
So, a0 = 0(1 − x)dx = x − x2 |10 = 1 − − 0 + 0 =
2 2 2
R1 1 21 1 1
a1 = 0 xdx = x |0 = − 0 =
2 2 2
1
So, the area approximation is given by (1 + e).
2
2.5
1.5
0.5 f (x) = ex
f (x) = 1 + (e − 1)x
This linear approximation to the area under a curve can be generalized, because our Lagrange
polynomials take the same form each time.
x−b
L0 (x) =
a−b
Rb x−b x2 bx b
a0 = a dx = − |
a−b 2(a − b) a − b a
b2 b2 a2 ab
= − − +
2(a − b) a − b 2(a − b) a − b
b2 − 2ab + a2 (b − a)2 1
= = = (b − a)
2(b − a) 2(b − a) 2
Rb x−a x2 ax b b2 − 2ab + a2 1
Similarly, a1 = a dx = − | = = (b − a)
b−a 2(b − a) b − a a 2(b − a) 2
Rb b−a
This is the trapezoidal rule in numerical integration: a f (x)dx ≈ [ f (a) + f (b)]
2
This method approximates the area using trapezoids, where the top is formed by a linear fit to
the curve using the endpoint values. (see picture)
What if we want to approximate the curve with a quadratic?
We will need three points, so we will have to cut the interval in half, and we will need
a+b
information about that midpoint, x = .
2
8.1 Basic Quadrature 85
a+b
(x − b) x −
2 2 2 (a + 3b) a+b
Now, L0 (x) = = x − x+b
a+b (a − b)2 2 2
(a − b) a −
2
So, we determine the coefficient a0 = ab L0 (x)dx
R
2 Rb 2 a + 3b a+b
x − x+b dx
(a − b)2 a 2 2
3
2 x (a + 3b) 2 a+b
= 2
− x +b x |ba
(a − b) 3 4 2
3
b − a3
2 a + 3b 2 2 2 a+b
= − (b − a ) + (b − ab)
(a − b)2 3 4 2
2
b + ab + a2 a2 + 3ba + ab + 3b2 ab + b2
−2
= − +
(a − b) 3 4 2
2 2 2 2 2
−2 b ab a a 3b ab b
= + + − − ab − + +
a−b 3 3 3 4 4 2 2
Combine similar terms:
b2 b2 3b2 b2
+ − =
3 2 4 12
ab ab −1
− ab + = ab
3 2 6
a2 a2 a2
− =
3 4 12
Substituting this back into the original form
2
ab a2
−2 b
− +
a − b 12 6 12
−1
b2 − 2ab + a2
=
6(a − b)
1
= (b − a)2
6(b − a)
b−a
=
6
b−a
So, a0 =
6
2 b−a
Similarly, a1 =
3 6
b−a
a2 =
6
b−a
Yielding Simpson’s rule: ab f (x)dx ≈
R
( f (x0 ) + 4 f (x1 ) + f (x2 ))
6
These methods are all derived by forming a polynomial interpolant between each xi and xi+1 ,
and are generalized as Newton-Cotes formulas.
In this context, they can be defined by the same h used in differentiation, xi+1 − xi = h.
Both of these should seem mildly familiar from Calculus II numerical integration.
The techniques you learned there are these, but in their composite form. (Next!)
For composite numerical integration, we complete these ‘basic’ techniques on subintervals and
add them all together.
When numerical integration techniques include the endpoints, they are called closed (much like
interval notation).
When numerical integration techniques do not include the endpoints, they are called open.
An example you would have seen in Calculus II of an open technique is Midpoint Rule.
86 Chapter 8. Numerical Integration
Midpoint is slightly better than trapezoidal, but it is also impractical for computations with data
(because we cannot then evaluate a midpoint).
Rb a+b
The basic Midpoint rule uses the midpoint as the height: a f (x)dx ≈ (b − a) f
2
Quadrature error
Error is always determined by the difference between the ‘exact’ and the ‘approximate’.
For numerical integration
Rb n Rb n
a f (x)dx − ∑ j=0 a j f (x j ) = a f [x0 , x1 , ..., xn , x] ∏ j=0 (x − x j )dx
Error terms for each of our basic techniques:
− f 00 (ξ )(b − a)3
Trapezoidal: . Conceptually, you can think about how this method was derived.
12
We used a p1 (x) approximation, which was linear. So, the error in that approximation was in
terms of f 00 (ξ ). Since we integrated the (x − x0 )2 term, our result has an O(x3 ). This results in the
(b − a)3 term. Each basic trapezoidal integral results in O(h3 ) error.
− f 0000 (ξ ) b − a 5
Simpson’s: . Similarly, this technique has O(h5 ) for each basic integral.
90 2
f 00 (ξ )(b − a)3
Midpoint: . This technique also has O(h3 ) error for each basic integral.
24
Precision: also called degree of accuracy, these integrals will be exact for polynomials of
degree: 1 (Trapezoidal and Midpoint), and 3 (Simpson’s).
Example 8.2 Given f (x) = x3 on [0, 0.5]
If we apply the basic Trapezoidal rule:
0.5 − 0 1 1
ITrap = 0+ =
2 8 32
If we apply the basic Simpson’s rule:
0.5 − 0 1
0 + 4(1/4)3 + (1/2)3 =
ISimp =
6 64
If we apply the basic Midpoint rule:
1
IMid = (0.5 − 0) (1/4)3 =
128
The actual integral, using the Fundamental Theorem of Calculus:
R 0.5 3 x4 0.5 1/16 1
0 x dx = | = −0 =
4 0 4 64
Our b − a = h = 0.5, and f 0 (x) = 3x2 , f 00 (x) = 6x, f 000 (x) = 6, and f 0000 (x) = 0.
If we substitute these into our error bounds, we maximize the derivative f p (ξ ) using the value
of ξ ∈ [a, b] that maximizes the derivative in our error bound.
− f 00 (ξ )(b − a)3 −6 ∗ 0.5 ∗ (0.5)3 −1
1 1
So, for Trapezoidal: = = . Our actual error was − =
12 12 16 64 32
1
. This is consistent with our error bound because the actual error was less (in magnitude) than
32
the error bound value.
− f 0000 (ξ ) b − a 5
0
Simpson’s: = (0.25)5 = 0. Simpson’s has no error because it solves
90 2 90
for the integral of a cubic polynomial exactly.
f 00 (ξ )(b − a)3 6 ∗ 0.5 ∗ (0.5)3 1
Midpoint: = = . This is consistent with the error bound
24 24 64
1
because the actual error was .
128
8.2 Composite Numerical Integration 87
Accuracy: Accuracy is directly defined by the degree of precision, or order of accuracy of the
method. Trapezoidal: 1, Simpson’s: 3, Midpoint: 1.
8.3 Gaussian Quadrature 89
Efficiency: The efficiency of these methods is defined by the function evaluations, and number
of operations involving addition, subtraction, multiplication, and/or division. The efficiency of the
methods is reasonably similar, but directly related to the number of points used to build the method.
Here greater accuracy of the method saves efficiency.
Stability: Numerical integration is fairly stable.Numerical integration is more stable than
numerical differentiation because, as you may have noticed, as h → 0 we are multiplying by small
numbers instead of dividing by them. With numerical differentiation we had to be careful because
if h was too small we would incur roundoff error. In integration, this is no longer a concern
computationally.
a1 f (x1 )
Rb Rb Rb b2 − a2
a f (x)dx = a c 0 dx + a c 1 xdx = c 0 (b − a) + c 1 = a1 (c0 + c1 x1 )
2
Separating the pieces by c0 and c1 terms:
c0 (b − a) = a1 c0 → a1 = b − a, so the weight is given by the width of the interval.
b2 − a2 b2 − a2 b+a
c1 = a1 c1 x1 → x1 = = (Note: c1 is on both sides of the equation, so we
2 2(b − a) 2
can solve for x1 directly). The point is given by the midpoint of the interval.
These are the definitions of our Midpoint rule.
Using a monomial basis, we can derive our quadrature points and weights in this manner.
However, there is a better way.
Legendre Polynomials!
Legendre polynomials
form an orthogonal basis of polynomials on [−1, 1], this means that
R1 0 j 6= k
φ
−1 j (x)φ k (x)dx = 2
j=k
2j+1
The functions yield zero when integrated with each other, but when integrated with themselves
- yield a positive constant.
In order to apply Gaussian quadrature using Legendre polynomials, we need to map [a, b] onto
[−1, 1]. If we assume t ∈[a, b] and x ∈ [−1, 1], then we can define a mapping through:
2 b+a
x= t−
b−a 2
b−a b+a
t= x+
2 2
We can verify these transformations:
90 Chapter 8. Numerical Integration
2 a−b
if t = a, then x = = −1 X
b−a 2
2 b−a
if t = b, then x = =1X
b−a 2
Similarly, recall u-substitution from Calculus I or II (or the Jacobian for these transformations
dt b−a
in higher dimensions, in Calculus III): = modifies the integral
dx 2
Rb R1 dt R1 b−a
a f (t)dt = −1 f (x) dx = −1 f (x) dx
dx 2
Our Legendre polynomials are given by
φ0 (x) = 1
φ1 (x) = x
1
φ2 (x) = (3x2 − 1)
2
1
φ3 (x) = (5x3 − 3x)
2
etc. in general they can be defined through the recursion relation
2j+1 j
φ j+1 (x) = xφ j (x) − φ j−1 (x)
j+1 j+1
If f (x) is a polynomial Rof degree less than 2n, then the approximation will be exact (no longer
1
an approximation, really). −1 f (x)dx = ∑ni=1 ai f (xi ).
We can define the xi ’s more generally through the zeros of the Legendre polynomials. So, for n
points, we use the zeros of the φn+1 (x) Legendre polynomial.
2
The ai ’s can be defined through the values of the Legendre polynomials ai = 0 (x ))2
(1 − xi2 )(φn+1 i
When n = 1: φ1 (x) = x, only equals zero when x = 0. So, x1 = 0, and then
2
a1 = = 2.
(1 − 02 )(1)2
R1
−1 f (x)dx = 2 f (0) This is the Midpoint rule on [−1, 1].
1 p p p
For n = 2: φ2 (x) = (3x2 − 1), this equals zero at x = ± 1/3, so x1 = − 1/3, x2 = 1/3.
2
Using these, we determine their weights:
2 2
a1 = p = =1
(1 − 1/3)(3( 1/3)2 ) (2/3)(3)
a2 = a1 = 1 since the roots are symmetric, and the values are squared, both weights will be the
same.
1 p
When n = 3: φ3 (x) = (5x3 − 3x), this only equals zero when x = 0, or ± 3/5. We order
2 p p
them from negative to positive, x1 = − 3/5, x2 = 0, x3 = 3/5.
2 2 5
a1 = 2 = = .
2/5(9) 9
15
(1 − 3/5) (3/5) − 3/2
2
2 2 8
a2 = 2 = = .
9/4 9
15
(1 − 0) (0) − 3/2
2
5
a3 = a1 = .
9
Note: since Gauss-Legendre quadrature uses the same functions and weights every time, you do
not need to recalculate them. They only depend on the value of n, and will be the same each time.
R2 4
Example 8.5 1 x dx with n = 3 Gaussian quadrature points.
First: map [1, 2] onto [−1, 1]
2 1+2 x+3
x= (t − ) = 2t − 3, t =
2−1 2 2
8.4 Adaptive Quadrature 91
x+3 4 1
R2 4 R1 31
1 t dt = −1 dx. Exact value: = 6.2.
2 2 5 p p
Using Gaussian quadrature with three points: x1 = − 3/5, x2 = 0, x3 = 3/5, and a1 = a3 =
5 8
, a2 = .
9 9 !4 !4
4 p 4 ! p
1 R1 x+3 5 1 3 − 3/5 8 1 3 5 1 3 + 3/5
−1 dx ≈ + + =
2 2 9 2 2 9 2 2 9 2 2
6.1999999955 (rounding error from calculator, should be exact)
Gaussian quadrature is bound by the 2nth derivative. So, when n = 3 it is bounded by the 6th
derivative, which makes it exact for a polynomial of degree 5, or more generally degree 2n − 1.
R1 22n+1 ((n)!)4
The error term is generally defined through: −1 f (x)dx− ∑nj=0 a j f (xi ) = 3
f (2n) (ξ )
(2n + 1) ((2n)!)
f 6 (ξ )
So, the error for our example is given by = 0.
15, 750
Accuracy: Gaussian quadrature has precision, or degree of accuracy 2n − 1.
Efficiency: Gaussian quadrature has similar efficiency in implementation. However, defining
the quadrature points and weights is not efficient. If the points and weights are already known and
easily accessible for implementation, the method is quite efficient.
Stability: Gaussian quadrature is similarly stable to the other techniques.
Gaussian quadrature ensures better accuracy, while minimizing the number of points required
to gain that accuracy. Neither Trapezoidal, Simpson’s, or Midpoint can gain an exact solution for a
polynomial of degree 5.
rapidly in another.
92 Chapter 8. Numerical Integration
When the function is slowly varying, we don’t need small h. In fact, it seems quite the waste
of computational resources to make h small there. However, when it is steep (or changing more
rapidly), we will want smaller h.
Adaptive quadrature is a technique that allows us to approximate the integral within a specified
tolerance, and divide up the interval accordingly to reach that tolerance.
We will focus on adaptive quadrature using composite Simpson’s rule (although, this concept
can be extended using any integration technique).
idea is to build approximations of the same type on the interval [a, b] and then the sum
The basic
a+b a+b
of a, and , b . Compare the two approximations, and if they are within tolerance of
2 2
each other, use the better approximation because it is closer to the actual, anyway.
[a, b]
a+b a+b
a, ,b
2 2
3a + b 3a + b a + b a + b a + 3b a + 3b
a, , , ,b
4 4 2 2 4 4
We have generally measured our error term functionally (through a term based on the function,
h, and the method used).
We cannot calculate the error in this manner using adaptive quadrature because there is not a
fixed h value.
We need to be able to define our error independent of the function and a fixed h.
on [a, b]:
Simpson’s rule
b−a a+b
S1 = f (a) + 4 f + f (b)
3 2
b−a 3a + b a+b a + 3b
S2 = f (a) + 4 f +2f +4f + f (b)
6 4 2 4
1
Given the error bound of Simpson’s rule, we know that S2 has th the error of S1 .
16
So, if we know that the exact integral, I, is equivalent to each approximation plus its respective
error, we can define an error measurement.
I = S1 + E1 = S2 + E2
1
E2 = E1
16
15
So, S2 − S1 = E1
16
16
E1 ≈ (S2 − S1 ) = 16E2
15
1
E2 ≈ (S2 − S1 )
15
We use this to approximate our actual error E2 in each subinterval. If E2 < tol, we can stop
dividing that subinterval.
We sum these approximations to efficiently approximate the integral within the accuracy needed,
and the minimum number of subintervals.
Notes:
This technique can perform poorly if f 0000 (x) varies rapidly. We compensate by generally using
1 1
(S2 − S1 ) rather than (S2 − S1 ) to ensure we are within tolerance.
10 15
8.5 Romberg Integration 93
This method is not guaranteed, and as such, any questionable results should be checked, or
investigated further.
Accuracy: Adaptive quadrature is built to ensure required accuracy, as efficiently as possible.
Efficiency: Adaptive quadrature is constructed to optimize the efficiency of the approximation
over the whole interval. Since it is designed to use less points on regions with less change
(shallower), and more points on regions with more change (steeper), it will use less points than a
method that uses a fixed number of points over the whole interval.
Stability: As mentioned above, if the 4th derivative is varying rapidly, this method can perform
poorly. Also, it is not guaranteed to converge - the stability is not guaranteed.
Adaptive quadrature is useful when we need to optimize efficiency over accuracy and stability.
One nice feature is that it can be done with any standard method.
If we define h = b − a,
I − R1 = k1 h2 + k2 h4 + k3 h6 + ...
2 4 6
h h h
I − R 2 = k1 + k2 + k3 + ...
2 2 2
If we combine the terms to cancel the k1 term, we increase the order of our accuracy.
3 15
(I − R1 ) − 4(I − R2 ) = k2 h4 + k3 h6 + ...
4 16
Leading to a result with O(h4 ) error.
Solving for I in this equation yields
4R2 − R1 1 4 5
I= − k2 h + k3 h6 + ...
3 4 16
If we repeat this process, we can continue to improve on the accuracy of our approximation.
R2,1 − R1,1
The notation for this extrapolation is given by R2,2 = .
3
h h h
The first subscript denotes the step size chosen 1 → h, 2 → , 3 → , 4 → , etc.
2 4 8
The second subscript denotes the steps of extrapolation. 1 → initial approximations, 2 → after
one step of extrapolation, 3 → after two steps of extrapolation, etc.
These are easier to work with in a tabular format, with the extrapolation generally defined
through
R j+1,k−1 − R j, j−1
R j+1,k = R j+1,k−1 + for k = 1, 3, ..., j + 1.
4k−1 − 1
R 1
Ex: Apply Romberg integration to the following integral: 12 dx
x
94 Chapter 8. Numerical Integration
1 1 3
R1,1 = 1+ = = 0.75
2 2 4
1 2 1
R2,1 = 1+2 + ≈ 0.708333
4 3 2
1 4 2 4 1
R3,1 = 1+2 +2 +2 + ≈ 0.697023
8 5 3 7 2
1 8 4 8 2 8 4 8 1
R4,1 = 1+2 +2 +2 +2 +2 +2 +2 + ≈ 0.694122
16 9 5 11 3 3 7 15 2
i Ri,1 Ri,2 Ri,3 Ri,4
1 0.75
2 0.708333 0.694444
3 0.697023 0.693253 0.693174
4 0.694122 0.693155 0.693149 0.693148
Actual value: 0.693147
Significant improvement in the approximation. However, if h is large this can be very bad.
Since we applied trapezoidal rule, the error will carry through. So, more extrapolation does not
always equal better results.
Accuracy: Romberg integration is specifically used to improve accuracy.
Efficiency: Romberg integration can be more efficient than using more data points, but is not
particularly efficient itself.
Stability: Romberg integration is not particularly stable, more iterations can gain accuracy, but
it is not guaranteed.
Romberg integration is best used if you cannot take more data, but need to gain additional
accuracy. Otherwise, it is likely not worth the added effort.
We integrate from the inside → outside (integrate with respect to y first, then with respect to x
in the above integral).
We can do this operation numerically in the same order, but we don’t have to.
hx hy n
In Riemann sums, a double integral using Simpson’s rule looks like ∑ j=0 ∑m k=0 f (x j , yk )w jk +
9
Es
Example: Use composite Simpson’s rule along x and y to define the numerical integration of a
double integral.
Each x j = a + jhx , yk = c + khy .
The weights from composite Simpson’s rule in 1D were 1, 4, 2, 4, 2, 4, ..., 2, 4, 1
In 2D, we’ll take multiplicative combinations of these weights.
y0 y1 y3 . . . ym
x0 1 4 2 4 . . . 1
x1 4 16 8 16 . . . 4
x2 2 8 4 8 . . . 2
x3 4 16 8 16 . . . 4
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
xn 1 4 2 4 . . . 1
This matrix corresponds to the w in the summation above, and each entry w jk corresponds to
the weight for the associated point (x j , yk ).
8.6 Multidimensional Integration 95
We can extend any 1D numerical integration technique in this manner: Trapezoidal, Simpson’s,
Gaussian (requires a 2D transform as in Calculus III to map the domain appropriately in both
dimensions), Adaptive, etc.
[add example]
9. Linear Algebra
If you feel weak in your Linear Algebra skills, go back to your textbook and notes. Then, bring
questions!
This chapter overviews key concepts from linear algebra that we use in the remainder of the
course.
16
Matrix-vector operations
98 Chapter 9. Linear Algebra
Range: range(A) is defined to be all the vectors that can be written as a linear combination of
the columns of A, ~y = A~x.
Null Space: null(A) is defined to be all vectors~z ∈ Rm such that A~z = ~0.
If A is singular, there will be infinite solutions (the system is underdetermined, and the null
space contains more than just ~0).
Equivalent statements:
• A is non singular
• det(A) 6= 0
• Columns of A are linearly independent
• Rows of A are linearly independent
• A−1 exists such that A−1 A = I
• range(A) = Rn
• null(A) = {~0}
Eigenvalues: A~x = λ~x
Specific constants that yield the same vector as the matrix multiplication.
In linear algebra, this was solved by determining the values λ when det(λ I − A) = 0.
Norms will allow us to define error in a linear system. Now that we’re working with vectors instead
of scalars, we need a new measurement.
0.5
−1 −0.5 0.5 1
−0.5 `2 norm
`∞ norm
`1 norm
−1
9.3 Determine the `2 , `∞ , and `1 norms of
Example
1
−4
~x = 5
−17
p
||~x||2 = 12 + (−4)2 + 52 + (−17)2 ≈ 18.1939
||~x||∞ = 17
||~x||1 = |1| + | − 4| + |5| + | − 17| = 27
Cauchy-Schwartz Inequality:
For any two vectors ~x,~y ∈ Rn
|~xT~y| ≤ ||~x||2 ||~y||2
When we used iterative methods before, we defined a tolerance as our stopping criteria. For
vectors, we will have to use a norm to determine when we reach tolerance, using the distance
between two vectors ||~x −~y|| p ≤ tol.
Matrix norms (not as intuitive)
A matrix norm specifies the size of A, and it is induced by the vector norm - it is only defined
through its relationship with a vector.
If || · ||a is a vector norm on Rn , then ||A||a = max||~x||a =1 ||A~x||a is a matrix norm.
Properties of a matrix norm: Given A, B, and a scalar α
• ||A|| ≥ 0
• ||A|| = 0 if and only if A = 0
• ||αA|| = |α|||A||
• ||A + B|| = ||A|| + ||B||
• ||AB|| ≤ ||A||||B||
Our common norms:
||A||1 = max1≤ j≤m ∑ni=1 |ai j | max column sum
||A||∞ = max1≤ j≤n ∑mj=1 |ai j | max row sum
p
||A||2 = ρ(AT A), where ρ is the spectral radius, or largest eigenvalue, of the given matrix.
For any natural norm, the matrix norm is induced by the vector norm, as above.
3 2 5
Example 9.4 A = 1 4 7
2 −1 3
||A||1 = max(6, 7, 15) = 15
9.2 Vector and Matrix Norms 101
||A||∞ = √
max(10, 12, 6) = 12
||A||2 = 107.6389 = 10.3749
10. Direct Methods for Solving Linear Systems
This chapter is all about solving the system A~x = ~b. We assume A is real, non singular, and square
(n × n).
To solve this system, we will discuss two types of solutions.
Direct methods: solve a system in a finite number of steps (often the same way you would by
hand). Total error = roundoff error only.
Iterative methods: Repeated process until a specified accuracy is met. May give a satisfactory
approximation in a finite number of steps, but the process can be repeated indefinitely. Total error
= roundoff error + truncation error (from stopping the method).
We will start with direct methods, first.
Gaussian Elimination: Form the augmented matrix and apply row reduction systematically to
reduce the system to upper-triangular form.
R∗2 = R2 − 2R1
R∗3 = R3 − 4R1
1 1 −3 −1
0 2 5 7
0 −3 14 11
1
R∗2 = R2
2
R∗3 = R3 + 3R∗2
1 1 −3 −1
0 1 5/2 7/2
0 0 43/2 43/2
2
R∗3 = R3
43
1 1 −3 −1
0 1 5/2 7/2
0 0 1 1
We will only go to REF (Reduced Echelon Form) at most in our code, it is computationally
unnecessary to complete Gaussian elimination all the way to RREF (Row-Reduced Echelon Form).
Backward Substitution:
We use backward substitution to solve the system after it is reduced to upper-triangular form.
In this example, we can solve for x3 = 1 immediately.
5 7
x2 + = → x2 = 1
2 2
x1 +1 −3 = −1 → x1 = 1
1
~x = 1
1
Since we will be programming this, we will keep a fixed form for the matrix, and follow a
specific manner to perform Gaussian elimination. There will not be conditionals in the basic code.
So, what can go wrong?
• If aii = 0, we will have a zero on the diagonal (pivot element). In this case, we need to swap
rows to ensure there are no zeros along the diagonal. (More on this later)
• If a pi = 0∀i ≤ p ≤ n (a row of zeros), then there is no unique solution. The system has
infinite solutions if a p(n+1) = 0, and no solution if a p(n+1) 6= 0.
We apply 3 invariant operations to apply Gaussian Elimination
• Multiplication by a constant
• Linear combination of rows
• Row swaps
We eliminate systematically column-by-column the lower off-diagonal terms.
For a system
a11 a12 a13 . . . a1n b1
a21 a22 a23 . . . a2n b2
a31 a32 a33 . . . a3n b3
. . . . . . . .
. . . . . . . .
. . . . . . . .
an1 an2 an3 . . . ann bn
We eliminate the lower entries in the first column through:
(1) a21 (1) a31
R2 = R2 − R1 , R3 = R3 − R1 , etc.
a11 a11
10.1 Gaussian Elimination and Backward Substitution 105
10.2 LU Decomposition
Gaussian elimination is a computationally expensive operation. LU decomposition is introduced
because if you need to solve the same system A~x = ~b for different ~b vectors, it offers a less
computationally expensive option. However, if you are only solving one linear system, it is slightly
more expensive than using Gaussian elimination directly.
Theorem 10.2.1 if Gaussian elimination can be performed on a system A~x = ~b without a row
interchange, then A can be factored into the product of a lower triangular matrix L, and an upper
triangular matrix U.
We use Gaussian elimination to “split" A = LU, where L = lower triangular matrix, and U =
upper triangular matrix. This requires the same O(n3 ) operations as applying Gaussian elimination
to solve the system. However, by splitting the matrix we reduce the operations to solve the system
if we have more than one system with the same A matrix.
This allows us to solve the system A~x = ~b → LU~x = ~b by letting U~x =~y so that L~y = ~b.
We can then solve for ~y using a forward substitution, and then ~x with a backward substitution.
Each of these is O(n2 ) operations, which is going to be more computationally efficient than using
Gaussian elimination (O(n3 ) operations) for every new ~b.
How do we get L and U?
10.2 LU Decomposition 107
To apply Gaussian elimination in the manner we have so far assumes there are no zeros at pivot
points (diagonal terms). What do we do if there is a zero at a pivot?
108 Chapter 10. Direct Methods for Solving Linear Systems
If you solved the problem analytically in this scenario, you would swap rows. We can do that
numerically also, using permutation matrices. These are matrices where I has row swaps for the
rows that need to be swapped in A.
1 0 0
Example 10.5 P= 0
0 1 swaps rows 2 and 3
0 1 0
1 0 0 1 2 3 1 2 3
PA = 0 0 1 4 5 6 = 7 8 12
0 1 0 7 8 12 4 5 6
This gives us a way to solve these problems that would otherwise break down by applying a
permutation matrix to both sides of the equation. A~x = ~b → PA~x = P~b.
Note: P−1 = PT
To do this, we look down each column, and find the first, largest, entry in the column. Then,
swap those rows.
For each kth column, find the smallest integer p, k ≤ p ≤ n with magnitude |a pk | = max |aik |.
If p 6= k, swap rows p and k, then move to the next column.
We can avoid this kind of error by applying partial pivoting. We would see that the second row
has the largest entry in the first column, and then swap the first and second rows before applying
Gaussian
elimination.
1.000 −1.000 −1.000
3.333 × 10−4 1.000 2.000
∗
R2 = R2 − 3.333 × 10 R1−4
Now, we’re multiplying by small numbers instead of dividing by them, so roundoff error is less
risky.
10.4 Efficient Implementation 109
1.000 −1.000 −1.000
0 1.000 2.000
x2 = 2, x1 − 2 = −1 → x1 = 1
This is consistent for three digit accuracy.
However, partial pivoting does not guarantee stability in your solution.
Note: For pivoting, we start with the first row and only look below the current row. Do not
attempt to re-pivot rows that have already been locked.
Loops are rather inefficient, so using matrix-vector multiplication is preferable for efficiency
(when possible). However, that does not mean that matrix-vector multiplication is cheap - it isn’t.
Memory access: BLAS Basic Linear Algebra Subroutines [this section needs more... Henc?
Want to take a run at it?]
110 Chapter 10. Direct Methods for Solving Linear Systems
Level 1: “Simple" operations are arithmetic and logical operations. They do not require
additional memory access, and are the most cost-efficient.
Level 2: Matrix updates, systems, matrix-vector multiplication. These require moderate
memory access, and are preferable to Level 3.
Level 3: Matrix-matrix multiplications require the most memory access. Especially when
matrices are large, it may be beneficial to convert a level 3 operation to level 2 for efficiency.
Iterative methods approximate a solution by repeating a process (recall: fixed point iteration, etc.).
We will discuss iterative methods for solving a linear system.
We introduced norms to present a way to approximate our error, and determine appropriate
stopping criteria for systems.
Direct methods are great for small systems, but can be extremely computationally expensive
for large systems. They are more accurate (because the technique is exact), but can be less efficient.
For large, sparse systems, iterative methods are preferred. They also come into use when solving
some partial differential equations (Numerical Analysis II!), i.e. Poisson or Laplace’s equation
∇2 u = g(x, y).
Recall: matrix storage and operations are expensive. [Henc: I had a comment here about
reshape in MATLAB, is there anything similar in Python?]
Iterative methods of solving linear systems are constructed with a very similar format to fixed
point, but with vectors. We seek a solution to A~x = ~b through ~f (~xk ) = ~b − A~x = ~0, and defining a
~g(~x) such that ~xk+1 = ~g(~xk ).
Note: Convergence for these methods is linear.
Thus, our iteration will take the form ~xk+1 = M −1 N~xk + M −1~b
We can simplify this expression with some additional algebraic manipulation.
~xk+1 = M −1 (M − A)~xk + M −1~b =~xk − M −1 A~xk + M −1~b
~xk + M −1 (~b − A~xk ) =~xk + M −1~rk
What is M?
We choose M so that M −1 is as close to A−1 as possible while still being an efficient computa-
tion.
We will discuss two specific methods that are variations on this idea: Jacobi and Gauss-Seidel.
We also want a convergent matrix. An n × n matrix A is convergent if limk→∞ Ak = 0
Equivalent statements:
• A is a convergent matrix
• limk→∞ ||Ak || = 0 for a natural norm
• limk→∞ ||Ak || = 0 for all natural norms
• ρ(A) < 1
• limk→∞ Ak~x = ~0 ∀~x ∈ Rn
We refer to these as relaxation methods because of the smoothing property a convergent matrix
has.
Iterative methods will always start with an initial guess, ~x0 , and the iteration will converge
linearly (if it converges) to some ~x∗ solution to A~x = ~b.
11.1.1 Jacobi
Simultaneous relaxation: updates all terms in the ‘next’ guess using values from the previous guess.
M = D the diagonal terms of A
We build the structure using the summation notation for A~x = ~b
∑nj=1 ai j x j = bi
Since aii xi is a term of ∑nj=1 ai j x j , we can solve for xi directly.
aii xi = bi − ∑nj=1, j6=i ai j x j
(k)
(k+1) bi − ∑nj=1, j6=i ai j x j
Defining an iteration xi =
aii
It can also be written as ~x(k+1) =~x + D−1~r(k) .
(k)
38 −125 − 1 32
3
~x(1) = =
40 −156 − 2 15
32
10 10
25 − 64/15 − 96/10
12
38 − 25/3
(2) − 32/10
~x =
15
40 − 10 − 64/15
10
This is cumbersome by hand and converges slowly, but it is also generally effective.
These techniques are particularly useful for large, sparse, A matrices. For a 3 × 3 matrix, this is
not generally necessary.
11.1.2 Gauss-Seidel
Uses the updated values, as they are updated. This involves more work, but is also faster than
Jacobi.
M = E formed by the lower-triangular terms of A
If Jacobi will converge for a given system, then so will Gauss-Seidel. It also converges linearly,
but approximately twice as fast as Jacobi.
Since Gauss-Seidel uses updated values, it is much harder to implement in parallel.
It is built from the same original summation ∑nj=1 ai j x j = bi , if we expand the summation terms
∑i−1 n
j=1 ai j x j + aii xi + ∑ j=i+1 ai j x j = bi
“updated” + “current” + “future” = bi
(k+1) (k)
(k+1) bi − ∑i−1
j=1 ai j x j − ∑nj=i+1 ai j x j
We solve for the iteration ~xi =
aii
It can also be written as ~x(k+1) =~x(k) + E −1~r(k)
See how it behaves for the same example
12 0 0
Example 11.2 Solve the same system using Gauss-Seidel, E = 5 15 0
6 2 10
25 − 2 − 3
12 5/3 1.6667
38 − 25/3 − 1
~x(1) = = 86/45 = 1.9111
15 1178/450 2.6178
40 − 30/3 − 172/45
10
1.6667
Compare with Jacobi: 2.1333
3.2
1
and exact: 2
3
Notice that Gauss-Seidel is slightly closer - but not exclusively after only one iteration. This
increases over time.
Recall that we discussed the determinant as a process that is too computationally expensive to do
numerically, O(n!) operations. This chapter outlines methods of solving for eigenvalues through
the relationship A~x = λ~x iteratively. This allows us to solve for an approximation to our eigenvalues
more efficiently.
12.3 QR Algorithm
The QR algorithm is a matrix reduction technique that computes all the eigenvalues of a symmetric
matrix simultaneously.
It must be applied to a symmetric matrix in tridiagonal form. So, Householder transformations
can be used to reduce a symmetric matrix A to tridiagonal form prior to applying the QR algorithm.
12.3 QR Algorithm 123
a11 a12 0 0 0 . . .
a21 a22 a23 0 0 . . .
0 a32 a33 a34 0 . . .
Given a matrix A = .
. . . . . . .
. . . . . . . .
. . . . . . . .
Factor A = QR using QR factorization, [Q, R] = qr(A)
Then use the factors to reduce the symmetric, tridiagonal matrix to a diagonal matrix of the
eigenvalues through multiplication RQ.
Since Q is orthogonal and R is upper-triangular, then R = QT A = QT QR, and RQ = QT AQ,
which reduces the off-diagonal terms of A because Q is an orthogonal matrix.
This will converge to a diagonal matrix of the eigenvalues of A with a rate of convergence
λi+1
O .
λi
This method will converge if all eigenvalues are distinct and real.
There are more general applications of this method as well. It can be applied to an upper-
Hessenberg matrix (we can use Householder transformations to reduce a non-symmetric matrix
to upper-Hessenberg form), and reduce it to an upper-triangular matrix with the diagonal entries
approaching the approximated eigenvalues.
12.3.1 Shifting
A more ‘practical’ application applies shifting to accelerate convergence to the eigenvalues.
λi+1
Since may be close to 1, and this will slow convergence of the method, shifting is used to
λi
accelerate its convergence. Choose an αk close to an eigenvalue, generally we do this by selecting
one of the diagonal values.
Then, split A(k) by subtracting this value from the diagonal terms: Ã(k) = A(k) − αk I
Factor Ã(k) = Q(k) R(k)
Combine to reduce the off-diagonal terms: A(k+1) = R(k) Q(k) + αk I
Repeat.
Example 12.3 Compare standard QR with QR+shift:
3 4 0
Given the matrix A(0) = 4 4 5
0 5 1
Standard QR: [Q, R] = qr(A)
−0.6 0.1264 −0.79
Q(0) = −0.8 − 0.0948 0.5925
0 0.9874 0.158
−5 −5.6 −4
R(0) = 0 5.0636 0.5135
0 0 3.1203
7.48 −4.0509 0
A(1) = R(0) Q(0) = −4.0509 0.027 3.0811
0 3.0811 0.493
Repeat: [Q, R] = qr(A)
−0.8793 −0.2505 0.405
Q(1) = 0.4762 −0.4625 0.7479
0 0.8505 0.5259
124 Chapter 12. Eigenvalues and Singular Values
−8.5065 3.5749 1.4673
R(1) = 0 3.6226 −1.0057
0 0 2.5636
9.1824 1.7251 0
A(2) = R(1) Q(1) = 1.7251 −2.5307 2.1804
0 2.1804 1.3483
We can see the diagonal terms are decreasing, but not nearly zero yet. If we continue to repeat
this process, they will approach zero. We can use the off-diagonal terms to determine when we
reach tolerance, computationally.
Compare with QR+ shift:
We will select the last entry 0=1
as our α
2 4 0
Shift A: Ã(0) = A(0) − I = 4 3 5
0 5 0
(0)
Factor à : [Q, R] = qr(A)
−0.4472 0.3651 −0.8165
Q(0) = −0.8944 −0.1826 0.4082
0 0.9129 0.4082
−4.4721 −4.4721 −4.4721
R(0) = 0 5.4772 −0.9129
0 0 2.0412
Combine, with shift:
7 −4.899 0
A(1) = R(0) Q(0) + I = −4.899 −0.8333 1.8634
0 1.8634 1.8333
5.1667 −4.899 0
Re-shift, α1 = 1.8333: Ã(1) = A(1) − 1.8333I = −4.899 −2.6667 1.8634
0 1.8634 0
(1)
Factor à : [Q, R] = qr(A)
−0.7257 −0.6492 0.228
Q(1) = 0.6881 −0.6847 0.2404
0 0.3314 0.9435
−7.12 1.7201 1.2821
R(1) = 0 5.6236 −1.2758
0 0 0.448
Combine, with shift:
8.1836 3.8693 0
A(2) = R(1) Q(1) + 1.8333I = 3.8693 −2.4396 0.1485
0 0.1485 2.2561
After two iterations, the eigenvalue in the last row appears to be converging, while the first and
second rows have large off-diagonal terms.
Looking at the actual eigenvalues: −3.7029, 2.2591, and 9.4438.
We see that, with shifting, the value corresponding with the shift converges much faster than
the others. (Note: this will perform better if we use the largest diagonal.) However, when we look
at QR alone, all the eigenvalues are converging similarly (perhaps the largest eigenvalue converging
faster than the others).
If we look at later iterations,
we can see this even more clearly.
9.4438 −0.0023 0
QR iteration 9: A(9) = −0.0023 −3.7018 0.0803
0 0.0803 2.258
12.3 QR Algorithm 125
Error:
9.4438: −4.072 × 10−7
−3.7029: 1.08 × 10−3
2.2591: −1.08 × 10−3
All approximations are pretty good.
9.3425 −0.0023 0
QR+ shift, iteration 9: A(9) = −0.0023 −3.7018 0.0002
0 0.0002 2.2591
Error:
9.4438: 1.01 × 10−1
−3.7029: 1.01 × 10−1
2.2591: 1 × 10−15
Decent approximations of the first two, and solidly approximates the third, which was used for
the shift.
Whichever value we select for the shift is the value QR+ shift will converge to fastest. In this
case, it had roughly cubic convergence to the third eigenvalue listed.
One benefit to using the last term in the matrix is that we can reduce the matrix once that
eigenvalue reaches tolerance.
So, in the case of our example - we can take the matrix with shifting, reduce it to the 2 × 2 and
continue the process with the next eigenvalue.
For a 3 × 3 matrix, this is not as useful as it is for a larger matrix. For a large matrix, we can
reduce the operations significantly by using shifting to converge to one eigenvalue, then reduce the
size of the matrix systematically to find all the eigenvalues, with less and less operations at each
reduction. Compared with QR alone, this can be a much more computationally efficient way to
compute all the eigenvalues of a large matrix.
II
Part Two
Index
accuracy, 10 stability, 10
Big O, 13 word, 17
Big Theta, 14
Bisection method, 25, 28
bit, 17
byte, 17
computational expense, 14
efficiency, 10
error, 12
absolute, 12
relative, 12
fixed point, 18
Fixed Point Iteration, 31
floating point, 18
Introduction, 7
little o, 15
machine epsilon, 20
piecewise:linear, 61
Review List, 8
rounding to nearest, 20