0% found this document useful (0 votes)
488 views

Text Book

Uploaded by

Benmark Jabay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
488 views

Text Book

Uploaded by

Benmark Jabay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 129

Numerical Analysis

A course to prepare you for a career in Applied Mathematics

Drs. Shelley B. Rohde, Henricus Bouwmeester, and


Christopher Harder
Copyright
c 2017 Shelley B. Rohde, Henricus Bouwmeester, and Christopher Harder

P UBLISHED BY AUTHORS

Licensed under the Creative Commons Attribution-NonCommercial 3.0 Unported License (the
“License”). You may not use this file except in compliance with the License. You may obtain a
copy of the License at http://creativecommons.org/licenses/by-nc/3.0. Unless required
by applicable law or agreed to in writing, software distributed under the License is distributed on an
“AS IS ” BASIS , WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

First printing, Summer 2017


Contents

I Part One
1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1 Introduction 9
1.2 Review List 10
1.3 Useful Calculus Theorems 10

2 Introduction to Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


2.1 Scientific Computing 11
2.2 Assessing Algorithms 12
2.3 Numerical Algorithms and Errors 14
2.3.1 Absolute and Relative Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Common Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Computational Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Number Systems 19
3.2 Fixed and Floating point numbers 20
3.3 Loss of Significance 24

4 Root-Finding Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Bisection Method 27
4.1.1 Evaluating the Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Fixed Point Iteration 33
4.2.1 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Newton’s Method (Newton-Raphson) 39
4.3.1 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 Zeros of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Monomial Interpolation 52
5.2 Lagrange Interpolation 54
5.3 Divided Differences 56
5.4 Error in Polynomial Interpolation 58
5.5 Interpolating Derivatives 59
5.6 Error in Hermite Interpolation 61

6 Piecewise Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63


6.1 Linear Piecewise Polynomial Interpolation 63
6.2 Piecewise Hermite Interpolation 65
6.3 Cubic Splines 66

7 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.0.1 Forward Difference Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.0.2 Backward Difference Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.0.3 Centered Difference Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.1 Richardson Extrapolation 77
7.2 Formulas Using Lagrange Interpolating Polynomials 79
7.3 Roundoff Error and Data Errors in Numerical Differentiation 80

8 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1 Basic Quadrature 83
8.2 Composite Numerical Integration 87
8.3 Gaussian Quadrature 89
8.4 Adaptive Quadrature 91
8.5 Romberg Integration 93
8.6 Multidimensional Integration 94

9 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1 Linear Algebra Basics 97
9.2 Vector and Matrix Norms 99

10 Direct Methods for Solving Linear Systems . . . . . . . . . . . . . . . . . . . . . 103


10.1 Gaussian Elimination and Backward Substitution 103
10.1.1 Implementation Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10.1.2 Backward Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10.1.3 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10.2 LU Decomposition 106
10.3 Pivoting Strategies 108
10.3.1 Scaled Partial Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
10.3.2 Complete Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
10.4 Efficient Implementation 109
10.5 Estimating Errors and the Condition Number 110
10.5.1 Error in Direct Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

11 Iterative Methods for Solving Linear Systems . . . . . . . . . . . . . . . . . . . 113


11.1 Stationary Iteration and Relaxation Methods 113
11.1.1 Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
11.1.2 Gauss-Seidel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
11.2 Convergence of Stationary Methods 116

12 Eigenvalues and Singular Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119


12.1 The Power Method 119
12.2 Singular Value Decomposition 121
12.2.1 Householder Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
12.3 QR Algorithm 122
12.3.1 Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

II Part Two
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
I
Part One
1. Fundamentals

1.1 Introduction

Numerical analysis is the mathematical study of numerical algorithms (methods) used to solve
mathematical problems. The goal of such a study is to clarify the relative benefits and drawbacks
associated with a given algorithm. The need for such an understanding is multifaceted. People facing
“real world" problems that must be solved often discover the task is impossible to do by hand. In such
a situation, an appropriate algorithm for approximating the solution can be chosen by balancing the
theoretical properties of the method with the setting (problem statement, computational resources,
etc) in which a solution is sought. Understanding the theoretical properties of existing methods can
also assist the numerical analyst in developing and analyzing new algorithms which may be more
appropriate for a given setting than those which are already available.
The goal of this course is to prepare you to choose an appropriate method to solve a given
problem, implement the method appropriately, and evaluate the results to determine effectiveness
of your method. A deep feeling for the algorithms and the theory associated with them is best
gained by direct calculation and experimentation with them. For this reason, this course emphasizes
derivation of methods, proof of associated theoretical properties, and computation of results using
the algorithms. Computations using a method can be tedious, making it impractical to implement a
method by hand beyond the most basic of cases. For this reason, programming is an integral part of
applying these methods, and a necessary part of this course. All of this makes this course one of
the most applied of applied mathematics courses.
The utility of the programming portion of this course extends beyond merely assisting with
understanding the algorithms more deeply. Programming is one of the most marketable skills you
can have as an applied mathematician, with MATLAB and Python currently (2017) being the two
most popular programming languages. Alongside the discussion of numerical methods in this
text you will find ((MATLAB or Python)) code or pseudo-code (a set of instructions emphasizing
the steps to program without the syntax and detail required for a particular language) to help you
develop your programming skills.
Enjoy!
10 Chapter 1. Fundamentals

1.2 Review List


This course is a senior experience course. It builds upon many prior courses. Here is a list of
concepts you should be comfortable with upon entering this course. If you need help with any of
these, please see me soon. Neither of us wants you to fall behind.
1. Find roots (x-intercepts, and intersections)
2. From calculus: Taylor series, Riemann sums, differential limits, continuity, differentiability,
multiple integrals, and vectors
3. Proofs (the form of a general proof)
4. Derivations
5. Solve linear systems
6. From linear algebra: Gaussian elimination, singularity, matrix-vector operations, transpose,
determinant, and eigenvalues
7. Basic programming skills: for loops, while loops, if statements, definitions

1.3 Useful Calculus Theorems


Recap some useful theorems from calculus:

Theorem 1.3.1 — Intermediate Value Theorem. If f ∈ C[a, b] where f (x1 ) ≤ s ≤ f (x2 ), x1


and x2 ∈ [a, b], ∃c ∈ [a, b] | f (c) = s.

Theorem 1.3.2 — Mean Value Theorem. If f ∈ C[a, b] and f is differentiable on (a, b), then
f (b) − f (a)
∃c ∈ (a, b) for which f 0 (c) = .
b−a

Theorem 1.3.3 — Rolle’s Theorem. If f ∈ C[a, b], and f is differentiable on (a, b), and
f (a) = f (b), then ∃c ∈ (a, b) s.t. f 0 (c) = 0.

Theorem 1.3.4 — Extreme Value Theorem. If f ∈ C[a, b], it must have a maximum and a
minimum on [a, b].

Theorem 1.3.5 — Taylor Series for a Function. Suppose the function f has derivatives of all
orders on an interval containing the point a. The Taylor series for f centered at a is
f 00 (a) f (3) (a)
f (a) + f 0 (a)(x − a) + (x − a)2 + (x − a)3 + ...
2! 3!
f (k) (a)
= ∑∞ k=0 (x − a)k .
k!
2. Introduction to Numerical Analysis

When dealing with a mathematical object such as a function, derivative, equation, definite integral,
etc., we are often confronted with conflicting realities:
• the need to derive information from the object, and
• practical limitations on our ability to perform the task.
In studying Numerical Analysis, we will come to understand, from various viewpoints, numerical
methods for approximating the information being sought. Simply understanding the steps involved
in using a numerical method is one part of this. The analysis portion of the study will provide
insight about conditions under which a method is expected to provide a “reasonable” approximation
as well as a way to control the algorithm in such a way that the subjective standard of “reasonable”
may be achieved. In the process, we find it necessary to return to the conflicting realities above:
perhaps the “reasonable” approximation cannot be found without “unreasonable” expectations.
Such reflection leads us to develop and analyze alternative methods.

2.1 Scientific Computing


The need to form an approximation arises in situations ranging from the mundane to the very
interesting. For instance, a college algebra student may be faced with solving an equation f (x) = 0
for which there is no algebraic method (e.g., completing the square for quadratic equations) to be
called upon. Instead, the student would ask their calculator to find an x-intercept, and the calculator
would make use of a numerical method to display a decimal approximation to the solution. A
broader and more interesting setting requiring numerical analysis is scientific computing. Here,
numerical methods are called upon to solve mathematical problems arising in the context of some
scientific pursuit.
To fix ideas, one may be faced with a physical problem such as predicting the motion of an
object under the influence of various forces. Such problems are generally broken down into the
following components:
1. Observed Phenomenon Something happens, we are intrigued, and want to examine the
behavior or recreate it.
12 Chapter 2. Introduction to Numerical Analysis

2. Mathematical model or data We define a mathematical model for the phenomenon in


an attempt to describe it, or we take data about what is happening in order to model it.
(Mathematical Modeling course)
3. Discretization We ‘break up’ the problem into pieces to make it solvable. This is the step
where some numerical method is identified - we take a large, unsolvable, analytical problem
and convert it to a discrete, solvable, problem.
4. Algorithm The specific set of steps, in order, needed to solve the discrete problem. Often,
these steps are defined in the form of pseudo-code or specific programming language. It
is important we must evaluate the chosen algorithm for accuracy, efficiency, and stability
(details to follow).
5. Implementation We implement the algorithm on a computer. Programming step: language,
structure, comparing architecture.
6. Results We use the results from our implementation to analyze accuracy, efficiency, and
stability of our method. We can also then relate these results to the observed phenomenon to
make predictions, or modifications to our model and/or its implementation.
This course focuses on Discretization, Algorithm, Implementation, and Results (items 3-6
above), which are at the heart of numerical analysis. This context makes it clear that numerical anal-
ysis is an applied mathematics, which broadly speaking deals with the application and development
of mathematical results in the context of real-world problems.

2.2 Assessing Algorithms


Accuracy, efficiency, and stability (robustness) are key to a good algorithm. (i.e. these are the things
to consider if someone asks you if your algorithm is ‘good’.)

Accuracy refers to your error: how close is your result to the actual value?
We often evaluate the accuracy of the algorithm and its relation to the method. A big, common,
question is: Does it converge?

Efficiency is evaluated by speed (in implementation), simplicity (is it easy for someone else
to follow?), and clarity (does it make sense to someone else? Commentary in your code is key to
clarity). At a secondary level we may also ask: is it concise? (Are you using 4 lines for an operation
that can be done in 1?) Is it simple, fast, concise, and clear?

Stability is required for a method to converge. This portion of the broader requirement of
wellposedness refers to the effect of small changes in the input data on the output of an algorithm.
If small changes to the input produce “small” changes in the output, the method is stable (or
well-conditioned). Conditional stability can occur if the stability of a method is dependent on the
initial data.

In application, you will find that accuracy and efficiency will be fighting against each other. In
any of these applications, it is important to decide what is most important in your problem. Often,
this may be determined by an employer or project leader. When accuracy takes precedence, which
would happen if you are sending a probe to Pluto, for example. The efficiency in your code is
important in that you make it run as quickly as possible without sacrificing accuracy.
In computer graphics it is often necessary to normalize a vector, meaning you multiply by the
reciprocal of the sqaure root of sum of squares of its components. This task is completed millions
of times a second for modern computer games and exponentiation is an expensive operation. The
fast inverse square root algorithm alleviates this expense by only using bit shifts and multiplications
by first completing an approximation which has an error of about 3.4% and then refines this using
2.2 Assessing Algorithms 13

Newton’s Method to achieve an error of 0.17%. The computational savings per application is very
minimal but doing this millions of times a second can provide a drastic improvement in peformance.
Generally, we seek the best in all three: accuracy, efficiency, and stability. However, most of
the time one will have to be sacrificed over the others when working with larger problems. In the
next sections, we will dig into accuracy and efficiency. For now, consider the following example
emphasizing the importance of stability.
1
 Example 2.1 — Testing Stability. Show that the function f (x) = is stable about x = 0.99,
x+1
1
but g(x) = is not.
1−x
Approximate f (x) and g(x) using a Taylor Series approximation about x = 0.99

1
f (0.99) = = 0.525
1.99
−1
f 0 (x) =
(1 + x)2
2
f 00 (x) =
(1 + x)3
1
g(0.99) = = 100
0.01
1
g0 (x) =
(1 − x)2
2
g00 (x) =
(1 − x)3
Taylor Series Approximation:
1 2 (x − 0.99)2
f (x) = 0.525 − (x − 0.99) + + ... (= for infinite terms)
(1.99)2 (1.99)3 2!
1 2 (x − 0.99)2
g(x) = 100 + (x − 0.99) + + ...
(0.01)2 (0.01)3 2!
We can evaluate the stability by looking at the first few terms. Recall that, generally the first
few terms are the largest - and so we can use them to estimate the behavior of the function. In this
case, for g(x), the terms will grow because it is unstable. However, we can see that in just the first
two terms of the Taylor Series expansion.

(x − 0.99)
f (x) ≈ 0.525 −
3.9601
g(x) ≈ 100 + 10, 000(x − 0.99)

Recall: to determine stability, we make a small change to the value of x, and evaluate the change
in the function value. We use the Taylor Series because it is a polynomial and easy to evaluate - we
can easily ‘see’ the stability analysis in the first few terms of the series.

If we compare f (x) for two similar values of x, one at x = 0.99, and a second nearby x = 0.98.
For the function to be stable, the change in output should be similar to the change in input 0.01. If
the change in the output is greater than the change in the input, you know it is unstable.

For f (x): x̄1 = 0.99, ȳ1 = f (x̄1 ) = 0.525


x̄2 = 0.98, ȳ2 = f (x̄2 ) = 0.525 + 0.002525 = 0.527525
14 Chapter 2. Introduction to Numerical Analysis

So, the difference in the outputs is 0.002525 < 0.01, and we can see that for f (x) near x = 0.99,
the function is stable, or well-conditioned.

If we complete the same analysis using g(x): x̄1 = 0.99, ȳ1 = f (x̄1 ) = 100
x̄2 = 0.98, ȳ2 = f (x̄2 ) = 100 + 10, 000(0.01) = 200

Now, the difference in outputs is 100! This is clearly unstable because the output changed by
orders of magnitude more than the input. A small change caused a huge change - this is unstable,
or ill-conditioned.

2.3 Numerical Algorithms and Errors


Identifying errors in numerical analysis is key. Why? Errors are unavoidable in numerical analysis.
Merely by implementing a numerical method, we are inducing error into our system. It is ex-
tremely important to understand what error(s) exist in the problem, and how they affect our solution.

Sources of error:

1. Mathematical model: we must make assumptions about the system in order to form a
mathematical model. There can be errors made in those assumptions which will propagate
through the whole problem. However, those assumptions are often necessary simplifications
to the problem to make it solvable (trade-offs).
2. Data: whenever data measurements are taken, there is error. There is always a limit to their
accuracy, and there is also noise. However, these are inherent and unavoidable errors - we
can only make attempts to quantify them and manage them to minimize their effects.
3. Numerical Method: discretization creates error, but it is defined and manageable error
(usually). It will also relate to the convergence properties of the method.
4. Round-off error: this arises due to finite precision of a machine - computers do not store
numbers to infinite precision. This error is unavoidable, but can be approximated if you know
your system. More on this soon.
5. Mistakes: this is by far the largest source of error, and the most common. Can arise from
data collection, transcription, malfunction, etc.
We want to avoid error, and minimize it.

In order to do that, we need to understand the effects of our error.

2.3.1 Absolute and Relative Errors


Absolute error is the difference between exact and approximate, it is also referred to as measured
error.

Given u as exact, and v as approximation of u, we define the absolute error by |u − v|.

Relative error provides an indication of how good a value is relative to the size of the exact
value. Relative error is effectively a percentage of error, and related to the absolute error by

absolute error |u − v|
relative error = = .
size of exact value |u|
2.3 Numerical Algorithms and Errors 15

Relative error is more commonly used because it contains information about the problem.

 Example 2.2 If someone tells you, “I had only 0.1 error!" (absolute measurement), that might
sound great if the value of the actual was 100. However, if the actual was 0.2? That’s awful. The
relative error in these two cases is more informative: 0.1% vs. 50%. With a relative measurement,
it is clear that the first measurement was good, and the second was not - in fact, it likely has errors
in implementation to be so far off. 

However, we will not always want a relative measurement. The relative error can break down
when the value of your actual is small (round-off error occurs when dividing by small numbers). In
these cases, the absolute error is more meaningful.

2.3.2 Common Notations


We will often talk about error in terms of characteristic parameters that arise naturally as a part of
discretization. Examples of such parameters include:
• the size of some object (such as an interval), denoted by h;
• the number of times some calculation is repeated, denoted by n;
• the number of points present in discretization, denoted by n.
An important part of the analysis of numerical methods is the derivation of results establishing
|error| < F(parameter),
i.e., the error is bounded above in terms of some function of a characteristic parameter which we
can control. If, for instance, the parameter is h, we would like F(h) → 0 as h → 0. In the case
of a parameter n, we would like F(n) → 0 as n → ∞. In such cases, we see |error| → 0 and we
say the method converges in the sense that these properties force the approximation obtained by a
numerical method to approach the exact value.
When assessing convergence in terms of a parameter, we often choose to emphasize only a
portion of the effect the function has on the parameter. Introducing: Big O notation. We use Big O
notation to define a convergence rate for our discretization in terms of the step size used (h), or the
number of points used (n). Clearly, as h decreases, or as n increases, the method should converge.
 Example 2.3 If F(h) ≈ chq , where c is a single-digit constant (?????????), we could write
|error ≤ chq . Therefore, as h → 0, the error → 0 ‘like’ hq → 0. When this behavior is our focus, the
error is written as O(hq ) (Note: this does require h to be small; in application 0 < h  1 usually.) 
The following example provides a practical setting in which the characteristic parameter h
appears and how the associated error behaves in terms of this parameter.
 Example 2.4 — Error analysis. Approximate the derivative of f (x) using The Taylor Series,
and determine the discretization error (what is the error term in this approximation?)

Recall your Taylor Series! Go practice it - NOW!!! (I can’t stress this enough.)

Recall that you can use your Taylor Series to approximate a function, so if we want to approx-
imate f (x − h), and we know values of the function and its derivatives at x, we can approximate

h2 00 h3
f (x − h) = f (x) − h f 0 (x) + f (x) − f 000 (x) + ... (2.1)
2 6
Sidenote: you learned Taylor Series as, approximating f (x) centered about an x-value a through
f 00 (a)
f (x) = f (a) + f 0 (a)(x − a) + (x − a)2 + .... In these approximations, we are replacing x with
2!
16 Chapter 2. Introduction to Numerical Analysis

x − h, and a with x.

The problem asks us to determine the derivative of f (x), which is inside of this Taylor Series.
So, we can manipulate it algebraically to see that

h2 00 h3
h f 0 (x) = f (x) − f (x − h) + f (x) − f 000 (x) + ... (2.2)
2 6
and

f (x) − f (x − h) h 00 h2
f 0 (x) = + f (x) − f 000 (x) + ... (2.3)
h 2 6
If h is small, we can approximate the derivative through

f (x) − f (x − h)
f 0 (x) ≈ (2.4)
h
The discretization error is the error defined by this approximation. So, if we determine the
absolute error in this approximation - it is defined by the remaining terms in the series.

f (x) − f (x − h) h h2
Absolute error = | exact - approximate | = | f 0 (x)− | = | f 00 (x)− f 000 (x)+...|
h 2 6
h h2
Again, if h is small we know that  , and as such we can define the order of our error
2 6
in terms of the largest term in our discretization error. We define our discretization error to be
h
proportional to | f 00 (x)| = O(h) using Big O notation.
2
Our discretization error is defined by the method and our choice of the value h (will often be
called a step size later). 

We also use Big O notation to discuss computational expense. Computational expense is gen-
erally measured through the processing required by a computer to solve the problem. Specifically,
we like to refer to this as the number of operations needed to complete that process. So, we use
Big O notation to define how many operations will be required by the computer to complete the
algorithm with respect to n. This is particularly useful for large-scale problems that require a lot of
processing power (we will not do any this semester, but it is important to be aware of).

 Example 2.5 A common choice is to halve your step size (because it is easy to modify this way),
but that doubles the number of points. If the computational expense of your method is only O(n),
then you are only doubling the number of operations, and it will only take about twice as long.
However, if your method uses O(n3 ) operations, then doubling the number of points will require an
order of eight times as many operations. If the program took only 20 minutes and you halve the
step size - then it will likely take approximately 160 minutes (this is dependent on more variables,
but operation count can give you a rough estimate). You want to make sure that it is worth it before
requiring that kind of processing power. 

Big Theta or Θ notation defines bounding of a function, φ (h) above and below by another
function ψ(h). This is useful for analyzing behavior of an unknown function, if it can be related
through a known function.
2.3 Numerical Algorithms and Errors 17

e.g. φ (h) = Θ(ψ(h)) if cψ(h) ≤ φ (h) ≤ dψ(h) ∀h > 0 (small h) where c, d are constants.

Little o notation relates two functions by a constant (Big O requires more).


f (n) = o(g(n)) if | f (n)| ≥ k|g(n)| for fixed k.
3. Computational Details

3.1 Number Systems


To fully comprehend roundoff errors, we must discuss how numbers are stored on a computer, or a
finite precision machine.
The way we normally work with numbers is in base 10. So when we write a number like
6523.75 we are really computing the following sum

6523.75 = 6 × 103 + 5 × 102 + 2 × 101 + 3 × 100 + 7 × 10−1 + 5 × 10−2

The base in which we work is not critical. There are some who believe that we should work in
base 12 since it has more divisors and would make calculations of thirds and fourths easier. In this
system, we would need to create twelve symbols (two extra) so that our number system is

0 = 0, 1 = 1, 2 = 2, 3 = 3, 4 = 4, 5 = 5, 6 = 6, 7 = 7, 8 = 8, 9 = 9, 10 = χ, 11 = ξ .

Considering our previous number of 6523.75, we can rewrite this as

(6523.75)10 = 3 × 123 + 9 × 122 + 3 × 121 + 7 × 120 + 9 × 12−1 = (3937.9)12

Computers store numbers in base 2 so there are only two symbols used, either 0 or 1. So to
store 6523.75 on a computer, we would first need to convert this to a base 2 number

(6523.75)10 = 1 × 212 + 1 × 211 + 0 × 210 + 0 × 29 + 1 × 28 + 0 × 27 + 1 × 26


+ 1 × 25 + 1 × 24 + 1 × 23 + 0 × 22 + 1 × 21 + 1 × 20 + 1 × 2−1 + 1 × 2−2
= (1100101111011.11)2

In a computer, arithmetic is performed via switches which can only be on or off and is
represented by either a 0 or a 1. The vocabulary used for this is a bit which is a 0 or a 1. A byte is
composed of 8 bits and a word is machine-dependent unit since this is the number of bits processed
by a computer’s processor. So if you have a 32-bit machine it can store 32 bits (or 4 bytes).
20 Chapter 3. Computational Details

 Example 3.1 Given 101101 in binary, we can convert it to a base 10 number through

1 × 25 + 0 × 24 + 1 × 23 + 1 × 22 + 0 × 21 + 1 × 20 = 32 + 0 + 8 + 4 + 0 + 1 = 45

and this would be composed of 6 bits. 

3.2 Fixed and Floating point numbers


A fixed point representation of a number corresponds to having a fixed number of digits before and
after the decimal point. Since these are fixed to a certain number of digits, we typically consider
this as working with integers. Another representation is floating point representation where the
decimal can float relative to the significant digits of the number. Our calculations on a computer
will be completed using the floating point representation.
A nonzero number x is stored on the computer in the form

x = ±1.b1 b2 · · · bN × 2 p

where 1.b1 b2 · · · bN is the normalized binary representation of our number with b1 6= 0 called the
mantissa, p is called the exponent (also in binary representation) and N is the size of the mantissa. A
number is normalized meaning that there is only one digit before the decimal and it is 1. Although
a computer can also deal with non-normalized numbers, we will not consider such in this text.
The size of the mantissa and the exponent are machine dependent, as is the base being used in
the computer. In this text, we will only concern ourselves with base 2. In Table 3.1, various values
of the size of sign, mantissa and exponent for single, double, long double and quadruple precision
for the IEEE 754 Floating Point Standard.

precision sign exponent mantissa total bits exponent bias


single 1 8 23 32 127
double 1 11 52 64 1023
long double 1 15 64 80 16,383
quadruple 1 15 112 128 16,383

Table 3.1: IEEE 754 Floating Point Standard precision storage and exponent bias

Using the IEEE 754 Floating Point Standard, in single precision the first bit determines the sign
of the number, the next 8 bits represent the exponent, and the last 23 bits represent the fraction. The
sign of a number is represented by a 0 for positive and a 1 for negative. Note that there is no sign bit
for the exponent. Therefore, we must make allowances for both positive and negative exponents by
use of an exponent bias. Consider single precision which allows for 8 bits in the exponent so that
we can store any exponent value between (00000000)2 = (0)10 through (11111111)2 = (255)10 .
However, these are all positive values so to store the negative values we “shift” this number line
so that about half of the number will be considered negative and the other half as positive and
zero lies near the center. If we shift this number line by (01111111)2 = (127)10 = 27 − 1 then
we can represent any integer exponent between −126 and 127. We do reserve (00000000)2 and
(11111111)2 for the special numbers of denormalized and overflow, respectively. A summary of
this information can be found in Table 3.2.
Storing the number 6523.75 on the computer in single precision is accomplished by first
3.2 Fixed and Floating point numbers 21

Exponent Exponent Numerical value


bitstring (base 10) represented
00000000 Denorm ±(0.b1 b2 . . . b23 )2 · 2−126
00000001 -126 ±(1.b1 b2 . . . b23 )2 · 2−126
00000010 -125 ±(1.b1 b2 . . . b23 )2 · 2−125
00000011 -124 ±(1.b1 b2 . . . b23 )2 · 2−124
.. .. ..
. . .
01111010 -5 ±(1.b1 b2 . . . b23 )2 · 2−5
01111011 -4 ±(1.b1 b2 . . . b23 )2 · 2−4
01111100 -3 ±(1.b1 b2 . . . b23 )2 · 2−3
01111101 -2 ±(1.b1 b2 . . . b23 )2 · 2−2
01111110 -1 ±(1.b1 b2 . . . b23 )2 · 2−1
01111111 0 ±(1.b1 b2 . . . b23 )2 · 20
10000000 +1 ±(1.b1 b2 . . . b23 )2 · 2+1
10000001 +2 ±(1.b1 b2 . . . b23 )2 · 2+2
10000010 +3 ±(1.b1 b2 . . . b23 )2 · 2+3
10000011 +4 ±(1.b1 b2 . . . b23 )2 · 2+4
.. .. ..
. . .
11111101 +126 ±(1.b1 b2 . . . b23 )2 · 2+126
11111110 +127 ±(1.b1 b2 . . . b23 )2 · 2+127
11111111 ± Inf or NaN ± Inf if b1 , b2 , . . . , b23 = 0, otherwise NaN

Table 3.2: Exponent representation in IEEE 754 Floating Point Standard for single precision

converting it to binary and then into the floating point standard.


(6523.75)10 = (1100101111011.11)2
= (1.10010111101111)2 × 2(12)10
= (1.10010111101111)2 × 2(00001100)2
= (1.10010111101111)2 × 2(00001100)2 +(01111111)2
= (1.10010111101111)2 × 2(10001011)2

The number 6523.75 is then stored on the computer in single precision as


0|10001011|10010111101111000000000
Note the first digit is not stored since the number is normalized meaning that we know the first digit
is a 1.
As with decimal format, there are many numbers which cannot be represented with only a
certain number of decimal places. For example, the number 1/3 in decimal is repeating and any
attempt to write it will provide some error.
Example 3.2 The number 6.7 cannot be represented in binary without a repeating expansion.
Converting this to binary gives us
(6.7)10 = 110.10110
and in single precision this would be stored as
0|10000001|10101100110011001100110
22 Chapter 3. Computational Details

where we can only store 23 digits in the mantissa.


The IEEE 754 Floating Point Standard provides a method for rounding called Rounding to
Nearest. If the 24th bit to the right of the binary point is 0, then we round down. If the 24th bit is 1,
then we round up unless all known bits to the right of the 1 are 0’s in which case 1 is added to the
23rd bit if and only of the 23rd bit is 1.
Thus for our number 6.7, if we apply the Rounding to Nearest rule, we store

0|10000001|10101100110011001100110

in the computer. 

 Example 3.3 To store the number 49.4 in single precision, we first convert to binary

(49.4)10 = (110001.0110)2 = (1.100010110)2 × 200000101

and then to floating point representation

(1.100010110)2 × 200000101 = (1.10001011001100110011001100110 . . .)2 × 200000101

and to store this on the computer we would need to round up since the 24th bit is 1. Therefore, 49.4
is stored in single precision as

0|10000100|10001011001100110011010

which rounded so that 49.4 is not truly stored on the computer. 

How much of a rounding error did we incur when converting 49.4 to single precision? In order
to answer this question, we must note that in single precision we round by 2−23 . Depending upon
which precision and machine hardware you are working with this value can change. In double
precision, this value would be 2−52 . This number is called machine epsilon, denoted εmach , and is
the smallest value between 1 and the next floating number greater than 1. This does not mean that
εmach is indistinguishable from zero.
In order to store 49.4 on the machine, we rounded up and chopped off the repeating portion so
our resulting error is
 −23
2 − (0.100110)2 × 2−23 × 25 = 2−23 − (10.0110)2 × 2−25 × 25
  

= 2−23 − 2.4 × 2−25 × 25


 

= (4 − 2.4) × 2−25 × 25
= 1.6 × 2−20

This is the absolute error between the value stored on the computer and the exact value of 49.4.
The IEEE 754 Floating Point Standard provides a model for computations such that the relative
rounding error is no more that one-half machine epsilon. In our case of 49.4 we have a relative
error of

1.6 × 2−20 8 1 1
= × 2−20 = × 2−17 < 2−24 = εmach
49.4 247 247 2

 Example 3.4 — Addition in floating point representation. Performing addition in floating


point representation requires each number to be converted to floating point and then aligning the
binary points so that the exponents are the same. The actual addition is performed within a register
3.2 Fixed and Floating point numbers 23

of the processor and can be accomplished at a higher precision. For example, adding 6523.75 and
49.4 is computed as
(6523.75 + 49.4)10 = (1.10010111101111)2 × 212 + (1.100010110)2 × 25
= (1.10010111101111)2 × 212 + (1.1000101100110011001101)2 × 25
= (1.10010111101111 + 0.00000011000101100110011)2 × 212
= (1.10011010110100100110011)2 × 212

 Example 3.5 Find the decimal (base 10) number for the following single precision number:

0|10000010|0110011000001000000000
The first bit is a 0, so we know the number is positive. The exponent is represented as:
1 × 27 + 0 × 26 + 0 × 25 + 0 × 24 + 0 × 23 + 0 × 22 + 1 × 21 + 0 × 20 = 130
The integer value is determined by this value minus the machine bias: 130 − 127 = 3 and the
fraction is determined through negative powers of 2.

M = 0 × 2−1 + 1 × 2−2 + 1 × 2−3 + 0 × 2−4 + 0 × 2−5 + 1 × 2−6


+ 1 × 2−7 + 0 × 2−8 + 0 × 2−9 + 0 × 2−10 + 0 × 2−11 + 0 × 2−12
+ 1 × 2−13 + 0 × 2−14 + 0 × 2−15 + 0 × 2−16 + 0 × 2−17 + 0 × 2−18
+ 0 × 2−19 + 0 × 2−20 + 0 × 2−21 + 0 × 2−22 + 0 × 223
= 0.3985595703125
Note that this is a normalized number so we know the digit to the left of the binary point is 1.
We determine the value actually stored by
(1 + M) × 25 = 11.1884765625
which has decimal representation of 11.188477. 

We find that we have 12 significant digits in decimal format of the value actually stored but
any decimal number between the values of 11.18847608566284 and 11.18847703933716 have the
same single precision representation as 11.188477 indicating that we only have 8 significant digits.
The point here is that machine precision can mess things up. If we look at the next smallest number
in floating point, we have
0|10000100|011001100000011111111
which has decimal representation of 11.1884760. If we look at the next largest number in floating
point, we have
0|10000100|0110011000001000000001
which has decimal representation of 11.1884775.
There are infinitely many numbers in between these! We must realize that a computer can only
store a finite number of digits and this format will produce errors. This is important to realize
because machine precision is somewhat predictable, but it is not controllable - so you need to be
aware of its existence. However, you cannot always account for it exactly. Generally we bound
round off errors, using information about the machine precision of our system.
24 Chapter 3. Computational Details

 Example 3.6 — Seeing the Computational Error. We can illustrate the error in Python using
the following simple code which stores the value 6.4, subtracts 6.0, and then subtracts 0.4.
x = 6.4
y = x - 6.0
z = y - 0.4
print "\nx = %e\ny = %e\nz = %e\n" % (x,y,z)

print "rel_err = %e" % (z/x)

which outputs

x = 6.400000e+00
y = 4.000000e-01
z = 3.330669e-16

rel_err = 5.204170e-17

This example is performed in IEEE 754 Floating Point Standard of Double precision so that
εmach = 2−53 ≈ 1.1102 × 10−16 . 

3.3 Loss of Significance


An arithmetic operation should try to preserve the same number of significant digits as those in the
operand, however this may not always be possible. As an example, consider subtracting the two
floating point numbers of

x = 0.13459873 and y = 0.13459612

which both have eight significant digits and the difference

x − y = 0.00000261

leaves only three significant digits. This occurrence is called a loss of significance. In this case it is
unavoidable, but in general you should look at other ways of performing the operation so that this
is avoided. The following examples will help to illustrate this further.
 Example 3.7 When computing the roots of a quadratic equation, say ax2 + bx + c = 0, we can
use the closed formulas of
√ √
−b + b2 − 4ac −b − b2 − 4ac
x1 = and x2 =
2a 2a
but might incur a loss of significance for one of these roots. For example, let us only use two
significant digits so that if we compute the roots for

x2 + 200x + 1 = 0,

then
√ √
−200 + 39996 −200 − 39996
x1 = and x2 =
2 2
resulting
√ in x1 = 0 and x2 = −200. Note that with the use of only two significant digits −200 ≈
− 39996 and the difference of these two numbers will provide a loss of significance. Interestingly
3.3 Loss of Significance 25

enough, there is a way to avoid the loss of significance in this situation. Multiplying by the
conjugate, we can compute x1 via
√ √ !
−200 + 39996 −200 − 39996 4 1
x1 = √ = √ ≈− = −5.0 × 10−3 .
2 −200 − 39996 2(−200 − 39996) 200

 Example 3.8 Another situation in which a loss of significance can result is when we add numbers

which are relatively far apart from another. For example, the finite sum

108
1
∑ n2
n=1

can provide the wrong answer when one is not careful. The following code performs this calculation
in two ways: one where n is increasing and the other where n is decreasing.
mySum_1 =0.0
for i in range(1, 100000001):
n = i*1.0
n = n*n
n = 1/n
mySum_1 = mySum_1 + n
print mySum_1

mySum_2 = 0.0
for i in range(100000000, 0, -1):
n = i*1.0
n = n*n
n = 1/n
mySum_2 = mySum_2 + n
print mySum_2

The output is

1.64493405783
1.64493405685

but which one is correct?

R When computing this sum with an increasing n, we are adding the largest values first meaning
that the smaller values will be dwarfed. This dwarfing is causing the loss of significance, Note
that when n = 1, we have 1/n2 = 1 and when n = 100000000, we have 1/n2 = 1.0 × 10−16 .

Whenever we use a computer to perform our calculations we are most often not using exact
arithmetic and must be wary of pitfalls that can crop up by the simple nature of the finite precision
machine. The floating point standards allow us to control these errors but we should also try and
avoid introducing errors whenever possible. However, there are situations in which we may find
ourselves comfortable with the error in exchange for efficiency.
4. Root-Finding Techniques

Many numerical methods of interest to the numerical analyst and practitioner are based on deter-
mining the zeros of a given function. We begin the journey of understanding such methods by
focusing on techniques for solving nonlinear equations in one variable. Although this may seem a
modest undertaking, the methods here are important and their analysis provides a foundation for
understanding many of the points that one must bear in mind when implementing a method:

Accuracy: are we getting results that make sense? Check your result.
Efficiency: small number of function evaluations (these can be very computationally expensive)
Stability/Robustness: fails rarely, if at all, and announces failure if it happens (e.g. if you enter an
a and a b that both give positive values for f (x) - it should say so.) Predict possible errors,
whenever possible, and test them to define the error message.
To optimize your method you also want to
• Minimize requirements for applying the method
• Minimize smoothness requirements for applying the method
• Use a method that generalizes well to many functions and unknowns
Generally, only some of these will be satisfied. The important distinction is to recognize what is
needed for the problem you are solving.
Mathematically, our goal is to address the problem of determining x such that f (x) = 0 for a
given function f (x). The techniques will build upon some mathematical result to drive the ultimate
form of the method.

4.1 Bisection Method


The fundamental result driving the bisection method is the Intermediate Value Theorem, which
holds that when f (x) is a continuous function which intersects the x-axis, there is a point at which
the function vanishes:

If f (x) ∈ C[a, b], and f (a) < 0 and f (b) > 0, then ∃c ∈ (a, b) such that f (c) = 0.
28 Chapter 4. Root-Finding Techniques

This idea leads us to the Bisection method. This is the method you would inherently try if
you had no other way to determine the intercept. If we have a function f that is continuous on the
interval [a, b] where the function is negative at one endpoint and positive at the other, we know
there is a point in the interval where the function is zero. We start with the endpoints, confirm that
one is negative and the other is positive. Then, evaluate the midpoint.

If the midpoint is negative, then it replaces the endpoint which was negative. If the midpoint
is positive, then it replaces the endpoint which was positive. We repeat this process until we are
satisfied with the result meaning we are within a desired tolerance.
We have to define an error tolerance; how close to zero does the value need to be for us to stop
implementing this method? We refer to this as our stopping criteria. When the error is less than our
stopping criteria (error tolerance), then we can stop.

 Example 4.1 Given the graph of f (x) = x5 − 2x4 + 2x3 − 2x2 + x − 1 on [0, 2], there exists an
x-intercept. Apply the Bisection method to find it.

We plot the function for reference.

0.5 1 1.5 2

Evaluating at the end points we have

f (0) = −1 < 0
f (2) = 32 − 32 + 16 − 8 + 2 − 1 = 9 > 0

so that the conditions are satisfied to apply the Bisection method.

a+b 0+2
The midpoint is given by = =1
2 2

f (1) = 1 − 2 + 2 − 2 + 1 − 1 = −1 < 0

So, x = 1 will replace x = 0 as our left endpoint. Our new interval is now [1, 2]
4.1 Bisection Method 29

1.2 1.4 1.6 1.8 2

Since f (1) < 0, and f (2) > 0, we repeat the process on our new interval.

1+2
The midpoint is given by = 1.5
2

f (1.5) = (1.5)5 − 2(1.5)4 + 2(1.5)3 − 2(1.5)2 + 1.5 − 1 = 0.21875 > 0

So, x = 1.5 will replace x = 2 as our right endpoint. Our new interval is now [1, 1.5].

0.2

1.1 1.2 1.3 1.4 1.5


−0.2

−0.4

−0.6

−0.8

−1

Since f (1) < 0, and f (1.5) > 0, we repeat the process again, on our new interval.

1 + 1.5
The midpoint is given by = 1.25
2

f (1.25) = (1.25)5 − 2(1.25)4 + 2(1.25)3 − 2(1.25)2 + 1.25 − 1 = −0.799805 < 0.

So, x = 1.25 will replace x = 1 as our new left endpoint. Our new interval is now [1.25, 1.5].
1.25 + 1.5
Currently our best approximation of the root is given by x = = 1.375. Note that
2
f (1.375) ≈ −0.4411.
We would continue this process until the value we receive for f (mid point) is ‘close enough’ to
zero. This is typically defined by the problem, or in application - by you.
30 Chapter 4. Root-Finding Techniques

For these types of problems −0.4411 is still not close enough to zero to warrant stopping. You
probably want to have 0.01 or less error. Especially if you plan to use this number for something
else.


Algorithm 4.1.1 — The Bisection Method.

1. Input f , a, and b, such that f (a) and f (b) are opposite signs.
a+b
2. Determine the midpoint x = .
2
3. Evaluate f (x).
4. Check to see if f (x) is within error tolerance.
5. If not, determine if f (x) > 0 or < 0.
6. Define the new interval (replace a or b with x).
7. Repeat until error tolerance is met, then return value of x.

This is an iterative method, which means we repeat the same process until the specified stopping
criteria is met. Stopping criteria here was defined in terms of the error tolerance, but that is not
always defined because we don’t always know what ‘actual’ is.
Stopping criteria can be defined by:
• Fixed number of iterations allowed
• Difference in two approximations: current estimate and last estimate generally, |xn − xn−1 | <
tol
• Error tolerance | f (x)| < tol
For these methods, we can also say: if f (x) == 0, we can stop.
Note: Since we are seeking a function value near zero ( f (x) = 0), we will use absolute error
rather than relative error.

4.1.1 Evaluating the Bisection method


The Bisection method is not efficient, but it does work consistently.

Limitations of Bisection: Due to the requirements to apply the method, we can only locate odd
roots. The function must cross the x-axis in order to use the Bisection method.

−3 −2 −1 1 2

−5

−10

−15

Note: This graph has an odd root at x = −2, and an even root at x = 1. Bisection cannot be
used to find the root at x = 1 because it does not cross the x-axis.
4.1 Bisection Method 31


b−a
Convergence: Since the error is decreased roughly by half at each iteration, E = O ,
2n
where b − a is the width of the initial interval, and n is the number of iterations.
b−a
If En = n , using En = tol as the error tolerance, solve for n:
2
b−a
2n =
En
n ln 2 = ln(b − a) − ln(En )
ln(b − a) − ln(En )
n=
ln 2
Note that this provides a very slow convergence.
 Example 4.2 Analyze the convergence of the Bisection method for f (x) = x5 − 2x4 + 2x3 −
2−0
2x2 + x − 1 on [0, 2],(our previous example), En = n for a tolerance of tol = 0.01.
2
We desire our absolute error to be less than our tolerance so that
2
0.01 >
2n
n
2 > 200
n ln(2) > ln(200)
ln(200)
n>
ln(2)
n > 7.64386

It would take 8 iterations to ensure accuracy within 0.01.

iteration x f (x) f (x) absolute error change in x


1 1 −1 1
2 1.5 0.21875 0.21875 0.5
3 1.25 −0.79981 0.79981 0.25
4 1.375 −0.44107 0.44107 0.125
5 1.4375 −0.15629 0.15629 0.0625
6 1.46875 0.01891 0.01891 0.03125
7 1.453125 −0.07163 0.07163 0.015625
8 1.4609375 −0.02711 0.02711 0.0078125

Note: The value of f (x) is not within our error tolerance, but the value of x must be within
tolerance from the actual root. At the 8th iteration, our approximation would be the midpoint
x = 1.4609375, with f (x) ≈ −0.027115. So, it is important to know that this approximation does
not ensure that f (x) is within tolerance, only that x is within tolerance.
The following implementation helped to produce the table above. Note that this code is not
robust and may not always produce the desired results. For instance, there is no checking for
whether we have exceeded a maximum number of iterations.
# Bisection Method: computes approximate solution of f(x)=0
# Input: function f; a,b such that f(a)*f(b)<0,
# and tolerance tol
# Output: Approximate solution x such that f(x)=0

def sign(x):
32 Chapter 4. Root-Finding Techniques

if x < 0:
return -1
elif x > 0:
return 1
else:
return 0

def mybisect(f,a,b,tol):
fa = f(a)
fb = f(b)

# note that we do not compute f(a)*f(b) since that could cause an


overflow
if sign(fa)*sign(fb) >= 0:
print ’f(a)f(b)<0 not satisfied!’
quit()
while (b-a)/2.0 > tol: # this is half the width of the interval
c = (a+b)/2.0
fc = f(c) # this is our only function evaluation
if fc == 0: # c is a solution, done
return c
if sign(fc)*sign(fa) < 0: # a and c make the new interval
b = c
fb = fc
else: # c and b make the new interval
a = c
fa = fc
return (a+b)/2.0 # the new midpoint is our best estimate

# bisect_test.py
from mybisection import mybisect
def f(x):
return x**5 - 2.0*x**4 + 2.0*x**3 - 2.0*x**2 + x - 1.0
tol = 1.0e-2
x = mybisect(f,0,2,tol)
print x
print f(x)

In our testing script, the code defines our function f (x) and the then makes a call to the bisection
function with the initial interval endpoints and stopping tolerance. The actual work is completed
within mybisection.py. Note that the eighth iteration is not performed within the while-loop but
is calculated on the return for the function. Also note that our stopping criteria is determined by
half the width of the interval which ensures that we are within our tolerance of the actual root. It is
worth noting that within the while-loop, we only perform one function evaluation, namely at the
midpoint of the current interval.
We could define different stopping criteria. Depending on the problem, it may be more important
that f (x) is close to zero, or it might be more important to have a close approximation for x. In our
example, we would need to perform one more iteration in order for f (x) to be within the tolerance
of 0.01.
Real-world concept: if you’re trying to locate what city a projectile was fired from, you care
more about the accuracy of x. However, if you are firing a projectile and you want it to release a
capsule at a specified height - you need f (x) to be as accurate as possible. 

Accuracy: The Bisection method can achieve accuracy desired, but it might take many iterations
4.2 Fixed Point Iteration 33

in order to achieve that accuracy.


Efficiency: The Bisection method is not efficient in terms of function evaluations, nor in terms
of the speed of convergence. However, it is simple to implement and the operations are easy to
follow (relatively speaking).
Stability: Very stable. As long as the requirements to apply the method are satisfied, it will
converge. Changing the starting interval will affect the number of iterations required for the method
to converge, but it will still converge to the root.
Problem cases: multiple roots can cause it to converge to the “wrong" root if the interval does
not have a unique root. However, if there is only one root in the interval [a, b], Bisection will find it
- eventually. Even roots cannot be located at all with the Bisection method.

4.2 Fixed Point Iteration


The Bisection Method discussed in the previous section is an iterative process for solving f (x) = 0.
In this section, we will link the goal of solving equations to another iterative process: fixed-point
iteration. Fixed-point iterations are simple enough to define: given an initial x0 and a function g(x),
set xi = g(xi−1 ) for i = 1, 2, 3, . . . . Surprisingly, this process can yield a convergent sequence of
numbers.
We first motivate with an example, consider g(x) = 31 x2 − 1 with x0 = 2.0. Then x1 = g(2.0) =
0.3, and after several iterations, we see the iterative process is converging to 0.791287847 . . . .
Observe, however, that given x0 = 10, the values in the sequence tend to infinity.
A fixed point occurs when f (a) = a for a general function f (x). We can also consider this any
point where f (x) intersects with the line y = x.
So, in order to solve these problems we convert the root-finding problem f (x) = 0 to a fixed
point problem where g(x) = x. We will construct the function g(x) to be consistent with the original
problem, and solve iteratively for the fixed point.
So, if f (x) = 0, naturally the simplest form is to set g(x) = f (x) + x. This is not generally
efficient, but it is simple.
f (x)
Additionally, we could use g(x) = x + 2 f (x), or g(x) = x − 0 if f 0 (x) 6= 0, etc. There are
f (x)
infinite possibilities. So, how do we choose?
These possibilities built a set of possible fixed point methods. This is why it is generally referred
to as a family of methods. They are all defined by the same context, but perform differently.
Key questions to consider when defining a method:
• Existence: Is there a fixed point in [a, b]?
If g(x) ∈ C[a, b] and g(x) ∈ [a, b] for some x ∈ [a, b], then by the Intermediate Value Theorem
g(x) has a fixed point in [a, b].
• Uniqueness: Is there only one fixed point in [a, b]?
If g(x) ∈ C[a, b] and g0 (x) exists in (a, b), and a ∃ positive constant ρ < 1 with |g0 (x)| ≤
ρ∀x ∈ (a, b), then the fixed point is unique. (derivative is bounded by a small number, proof
later in notes)
• Stability and Convergence: will the iteration converge? If it doesn’t, what will that mean?
• Efficiency: How fast will it converge?

4.2.1 Fixed Point Iteration


Given f (x) = x5 − 2x4 + 2x3 − 2x2 + x − 1, find the root using Fixed Point Iteration.

Convert the root problem to a fixed point problem:


g(x) = x − f (x) = x − x5 + 2x4 − 2x3 + 2x2 − x + 1 = −x5 + 2x4 − 2x3 + 2x2 + 1
34 Chapter 4. Root-Finding Techniques

Pick a starting point: x = 0 is ‘easy’

x = 0: g(0) = 1. Set x = 1
x = 1: g(1) = 2. Set x = 2
x = 2: g(2) = −7. We have a problem. It’s clearly not converging. Why?

Check the function graph:

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

−1

it’s too steep near the fixed point. Note: The fixed point happens where g(x) (cyan) crosses with
the graph of y = x (blue).

Without the graph, we can determine this by checking the derivative: g0 (x) = −5x4 + 8x3 −
6x2 + 4x

At x = 0, g0 (0) = 0 - this may not raise a red flag because 0 ≤ ρ < 1, but since it is zero, look
at the next value.

At x = 1, g0 (1) = 1 and since 1 is greater than any ρ < 1... this will diverge from here.

Based on this, and the visual of the graph - we can see it is too steep to converge.

Let’s try a simpler example:

Given f (x) = x3 + 4x2 − 10, find the root using Fixed Point Iteration.

First, convert the root-finding problem to a fixed point problem: use g(x) = x − f (x) for sim-
plicity.

g(x) = x − x3 − 4x2 + 10
4.2 Fixed Point Iteration 35

10

0.5 1 1.5 2

−5

−10

Evaluate g0 (x) = −3x2 − 8x + 1, and look at the graph.

The slope is still too steep... what do we do?

Option: Solve f (x) = 0 for x, and check to ensure the derivative will be small enough to
converge.

x3 + 4x2 − 10 = 0

x3 + 4x2 = 10

x2 (x + 4) = 10

10
x2 =
x+4
r
10
x=
x+4
r
10
Plot the new g(x) =
x+4

The square root ensures that the derivative will be small.



Now, if we start at x = 1, g(x) = 2 ≈ 1.4142

x = 1.4142, g(1.4142) = 1.359

x = 1.359, g(1.359) = 1.366

Close agreement! This seems to be converging on a fixed point!

See the iterations on graph with y = x:


First iteration
36 Chapter 4. Root-Finding Techniques

1.5

0.5

0.5 1 1.5 2

Second iteration
1.5

1.4

1.3

1.2

1.1

1.1 1.2 1.3 1.4 1.5

Fixed Point requires the function to be slowly varying, |g0 (x)| ≤ ρ < 1, in order to converge. In
this example, the function is slowly varying, and we can see that the values spiral in to the fixed
point.

Proofs of Convergence
Uniqueness: This is a proof that if there are two fixed points in an interval satisfying the require-
ment that g(x) is slowly varying, then those two points are actually the same point - so the fixed
point is unique. Proof by contradiction.

Suppose there exist two points p and q which are fixed points of g(x) in [a, b], where p 6= q.

g(p) − g(q)
By the Mean Value Theorem, ∃ξ ∈ [a, b] s.t. = g(ξ ).
p−q
Since p and q are fixed points, g(p) = p and g(q) = q. So, |p − q| = |g(p) − g(q)| and thus
|g(p) − g(q)|
|g0 (ξ )| = = 1.
|p − q|
This is a contradiction of the requirement that g(x) is slowly varying, |g0 (x)| ≤ ρ < 1.
4.2 Fixed Point Iteration 37

Therefore, p = q, implying that on a slowly varying interval of g(x), there will only exist one
fixed point.

Convergence: This is a proof to show that if the convergence requirements are met, Fixed Point
method does, in fact, converge to the fixed point of the function.

If g(x) ∈ C[a, b], and g(x) ∈ [a, b]∀x ∈ [a, b] and g0 (x) exists on (a, b), ∃ρ, 0 < ρ < 1 s.t.
|g0 (x)| ≤ ρ∀x ∈ (a, b).

Then, ∀p0 ∈ [a, b], the sequence pn = g(pn−1 ), n ≥ 1 converges to a unique fixed point p ∈ [a, b].
In other words, these are sufficient conditions for convergence, not necessary conditions.

Assumptions:
1)∃p s.t. g(p) = p,
2) g(x) ∈ [a, b]∀x ∈ [a, b],
3) ∃pn ∈ [a, b]

Given those, we can evaluate our absolute error |pn − p| = |g(pn ) − g(p)| = by the MVT
|g0 (ξ )||pn−1 − p|

We can use this to expand upon our error with each iteration:

|g0 (ξ )||pn−1 − p| ≤ ρ|pn−1 − p|

This gives us the relation |pn − p| ≤ ρ|pn−1 − p|, n≥1

From this we can extrapolate that |pn−1 − p| ≤ ρ|pn−2 − p|

Which is the same as |pn − p| ≤ ρ 2 |pn−2 − p|

From this we can see that, generally |pn − p| ≤ ρ n |p0 − p|.

Since we define 0 < ρ < 1, limn→∞ ρ n → 0, this ensures that if our conditions are all met, this
method will converge!

The rate at which it converges is defined by ρ, limn→∞ |pn − p| ≤ limn→∞ ρ n |p0 − p| = 0

Or, pn → p as n → ∞.

So, we can say that the method converges at a rate (ρ n ). This is nice because it says that the
smaller your slope is, the faster it will converge to the fixed point.

So, to prescribe g(x) as efficiently as possible, ensure ρ is as small as possible (not zero!) for
faster convergence.

 Example 4.3 Consider the function f (x) = x3 − x − 1, and solve for the root within 10−2 (i.e.,
two digits of precision so that we will have a tolerance of 0.5 × 10−2 ) accuracy on the interval [1, 2]
with initial guess of p0 = 1.
38 Chapter 4. Root-Finding Techniques

First, rewrite the function as a fixed point problem: first attempting the easiest we choose
g(x) = x + f (x) = x3 − 1. In order to determine if this is a convergent sequence, we need to
determine whether the derivative is less than one. Note that we specifically need |g0 (p)| < 1 but we
do not know p. Without knowing the fixed point, we will need to check the derivative for the entire
interval. Starting with the left endpoint we see that
g0 (x) = 3x2 and at x = 1 , g0 (1) = 3 > 1
and we note that g0 (x) = 3x2 is an increasing function over the interval [1, 2] so that |g0 (x)| > 1 for
all x ∈ [1, 2]. This choice of g(x) will diverge meaning that we need to choose a different g(x).
For our next attempt we will solve for x given the equation x3 − x − 1 = 0. Performing a little
algebraic manipulation,
x3 = x + 1
1
x2 = 1 +
rx
1
x = ± 1+
x

p
resulting in g(x) = 1 + 1/x since we know our fixed point is positive. It is important to point
out that our interval of [1, 2] precludes zero so that this function is valid. Additionally the interval
contains only positive values so we run no risk of negative values within the radical.
As before, we need to check for convergence of the sequence and begin by considering the
derivative,
1 −1/2 −1
   
0 1 −1
g (x) = 1+ 2
= q .
2 x x 2x2 1 + 1x
0
On the interval [1, 2], |g
(x)| is a decreasing function with its largest value obtained at x = 1. So at
−1
x = 1, |g0 (1)| = √ ≈ 0.3536 < 1. Since this is the largest value for ρ on the interval [1, 2], we

2 2
can use this to bound ρ ≤ 0.3536.
Recall that we seek
|pn − p| ≤ ρ n max {|1 − p0 |, |2 − p0 |} < 0.5 × 10−2 .
p0 ∈[1,2]

The largest possible value for the maximum is 1, because that is the size of our interval. So, if
we assume that value is 1, we define n through ρ n < 0.5 × 10−2 , which we can solve for n. Since
ln(0.5 × 10−2 )
ρ < 1, ln ρ < 0, so n > .
ln ρ
In this example, ρ = 0.3536, so that the largest value will be realized when ρ is near its upper
ln(0.5 × 10−2 )
bound, so n > ≈ 5.096, so it should take no more than 6 iterations to ensure two
ln 0.3536
digits of precision.
Run it!

n pn g(pn )
1 1 1.414214
2 1.414214 1.306563
3 1.306563 1.328671
4 1.328671 1.323870
5 1.323870 1.324900
4.3 Newton’s Method (Newton-Raphson) 39

Graphically:

1.5

1.4

1.3

1.2

1.1

1.1 1.2 1.3 1.4 1.5

As you can see, we have 10−2 accuracy at the 4th iteration. By using the maximum value,
we have a conservative estimate for the number of iterations (it usually converges faster than our
estimate, but our estimate ensures we reach tolerance).


Accuracy: Easy to check.

Efficiency: Determining g(x) is the most time-consuming process, once you have it the algo-
rithm is efficient.

Stability: clearly defined in terms of the derivative of g(x).

Minimal requirements to apply the method? To apply it, yes - but it will not converge if g(x)
does not vary slowly.

Minimal smoothness requirements? To apply it, yes - for it to converge, no.

Generalizes well? No. Fixed point is not going to extend well, although the family of methods
is diverse in and of itself.

4.3 Newton’s Method (Newton-Raphson)


Perhaps the most popular method, and certainly the fastest 1D root-finding method.

Conceptual motivation: Find p such that f (p) = 0 from an initial guess p0

If f (p0 ) > 0 and f 0 (p0 ) > 0, we want to move left to find a zero.

If f (p0 ) > 0 and f 0 (p0 ) < 0, we want to move right to find a zero.

Similarly, if f (p0 ) < 0 and f 0 (p0 ) > 0, we want to move right to find a zero.
40 Chapter 4. Root-Finding Techniques

And, if f (p0 ) < 0 and f 0 (p0 ) < 0, we want to move left to find a zero.

What we can conclude from this, is that the slope and the value at our initial point determine
where to go next.

Graphically (Draw): We can use the point and slope calculation to determine the tangent line,
and then find the intercept of the tangent line easily. - Use that as the next guess.

Using the earlier example: f (x) = x5 − 2x4 + 2x3 − 2x2 + x − 1 with an initial guess at x = 2.
f (2) = 9, f 0 (x) = 5x4 − 8x3 + 6x2 − 4x + 1, and f 0 (2) = 33. So, the tangent line at x = 2 is
y = 33(x − 2) + 9

1.2 1.4 1.6 1.8 2

−10

−20

The root of the tangent line is easily located by setting y = 0 and solving for x. 0 = 33(x − 2) +
9 → x ≈ 1.72727273.
Repeat with x = 1.72727273. f (1.72727273) ≈ 2.6392944, f 0 (1.72727273) ≈ 15.271088, and
the tangent line is y = 15.271088(x − 1.72727273) + 2.6392944.

1.2 1.4 1.6 1.8 2

−5

Solving for the root of the tangent line yields the approximation to the root: 0 = 15.271088(x −
1.72727273) + 2.6392944 → x ≈ 1.554443. Given that f (1.554443) ≈ 0.632466, we see that we
are converging on the root of f (x).
4.3 Newton’s Method (Newton-Raphson) 41

Derivation:
Given:
1. f ∈ C2 [a, b] ( f , f 0 , and f 00 continuous on [a, b])
2. ∃p ∈ [a, b] such that f (p) = 0
3. p∗ is an approximation to p, and |p − p∗ | is small
We can approximate f (x) through its Taylor Series centered about x = p∗ :

f 00 (p∗ )
f (x) = f (p∗ ) + f 0 (p∗ )(x − p∗ ) + (x − p∗ )2 + ...
2!
This can be truncated by using another term, η(p∗ ) ∈ (x, p∗ ) such that evaluating f 00 (η(p∗ ))
above satisfies the equality, without the additional terms.

f 00 (η(p∗ ))
f (x) = f (p∗ ) + f 0 (p∗ )(x − p∗ ) + (x − p∗ )2
2!
If we then evaluate f (x) at x = p:

f 00 (η(p∗ ))
f (p) = 0 = f (p∗ ) + f 0 (p∗ )(p − p∗ ) + (p − p∗ )2
2!
Since we defined p∗ to be an approximation of p, such that |p − p∗ | is small; we know that
|p − p∗ |2  |p − p∗ |. Thus, we can neglect the last term in the approximation to define an approxi-
mation to p. Yielding:

0 = f (p∗ ) + f 0 (p∗ )(p − p∗ )

Solve for p!

f (p∗ )
p = p∗ −
f 0 (p∗ )
f (pn−1 )
This yields the fixed point iteration pn = pn−1 − for Newton’s Method.
f 0 (pn−1 )
Pros: fast! Works with multiple roots.

Cons: Can diverge, and you need to evaluate both the function and its derivative.

Problem cases:
1. If the initial point is too far from the root, it will diverge. (draw)

0.1

2 4 6 8 10

−0.1

−0.2
42 Chapter 4. Root-Finding Techniques

2. It converges slower for even roots (but it does converge!)

1 2 3 4

3. If there are multiple roots nearby, it can converge to the “wrong" one

60

40

20

−2 −1 1 2 3 4

Convergence: Suppose f ∈ C2 [a, b], if p ∈ [a, b] and f (p) = 0, where f 0 (p) 6= 0. Then, ∃δ > 0
such that Newton’s Method generates a sequence {pn }∞ n=0 converging to p ∀p0 ∈ [p − δ , p + δ ].

This states that Newton’s Method will converge if the initial guess is ‘close enough’ to p.

Accuracy: easy to check

Efficiency: Two function evaluations, more expensive computationally.

Stability: Dependent upon how close the initial guess is, and the derivative of f (x).

Minimal requirements to apply the method? Certainly not, we need derivative information!

Minimal smoothness requirements? No, this has far more than previous methods.

Generalizes well? Yes and no. It generalizes, but becomes complex very quickly.
4.3 Newton’s Method (Newton-Raphson) 43

Speed of convergence:
A method is linearly convergent if |pn+1 − p| ≤ ρ|pn − p| as n → ∞ for some ρ < 1

A method is quadratically convergent if |pn+1 − p| ≤ M|pn − p|2 as n → ∞ for some constant


M.
A method is superlinearly convergent if |pn+1 − p| ≤ ρn |pn − p| if ρn → 0 as n → ∞

Since we can have a ρn = M|pn − p|, a method that is quadratically convergent implies super-
linearly convergent.

More accurate methods typically require much more work for small changes in error → im-
practical.

Additionally, roundoff errors can overwhelm your results quickly with higher-order methods.

|pn+1 − p|
Generally, if limn→∞ = λ , for α, λ > 0 ∈ R, then {pm }∞ n=0 converges to p at a rate
|pn − p|α
of order α with error constant λ . For reference, a larger α results in faster convergence.

1
Ex: If pn = , pn → 0 as n → ∞.
2n

1
2n+1 − 0 1

1
Evaluate limn→∞ = , so λ = , and α = 1, so the method is linearly convergent.
1 − 0 2 2

2n

1
2n+1 − 0

Whereas, if we evaluated limn→∞ 2 = 2n−1 , which diverges to infinity as n → ∞. So,
1
− 0
2n
we know it is not quadratically convergent.

Also, if λ = 0 (as is the case if α = 0.5), then λ ≯ 0, so the α value will not be the correct
speed of convergence, either.

How large must n be to ensure |pn − p| < 10−2 ?

|pn+1 − p| 1
If p0 = 1 and limn→∞ = (since p is zero)
|pn − p| 2
1 1 1
We can extrapolate that for large n, |pn+1 | ∼ |pn | ∼ |pn−1 | ∼ n+1 |p0 |
2 4 2
1 1
Since we seek pn < 10−2 , this implies that n p0 < 10−2 , since p0 = 1 we solve n < 10−2
2 2
2n > 102

n ln 2 > 2 ln 10

2 ln 10
n> ≈ 6.64
ln 2
44 Chapter 4. Root-Finding Techniques

n ≥ 7 to ensure 10−2 error

Derive a quadratically convergent fixed point method:


For fixed point to converge we need:

1. g ∈ C[a, b]
2. g(x) ∈ [a, b]∀x ∈ [a, b]
3. g0 (x) ∈ C[a, b]
4. ∃ constant 0 < ρ < 1 s.t. |g0 (x)| ≤ ρ∀x ∈ (a, b)
5. g0 (p) 6= 0
Then, ∀p0 ∈ [a, b] the sequence pn = g(pn−1 ) will converge linearly to a unique fixed point
p ∈ [a, b]

Recall: |pn − p| ≤ ρ|pn−1 − p| which implies that α = 1, linearly convergent.

We want quadratic convergence!

So,
1. Suppose p is a fixed point of g(x)
2. g0 (p) = 0
3. g00 (x) ∈ C(a, b), and is bounded by M: g00 (x) ≤ M ∈ (a, b)
Then ∃δ > 0 s.t. ∀p0 ∈ [p − δ , p + δ ], the sequence pn = g(pn−1 ) will converge at least quadrat-
ically.

f 00 (p∗ )
From our Taylor Series’ expansion, we had the error term (x − p∗ )2
2!
That was for f (x), but the same is true of g(x). Given that the second derivative of g(x) is
bounded by M, we can use that to define our convergence through

M
|pn+1 − p| < |pn − p|2 for large n
2
Constructing the fixed point method:

We know f (x) = 0, thus for any differentiable function ϕ(x), we also know that
−ϕ(x) f (x) = 0.

We seek the function ϕ(x) s.t.

g(x) = x − ϕ(x) f (x) = x (defining the fixed point method in general)

Based on our assumptions, we require that g0 (p) = 0, and we know f (p) = 0

Using the form of g(x) defined above, we can determine the derivative g0 (x) using product rule.

g0 (x) = 1 − ϕ 0 (x) f (x) − ϕ(x) f 0 (x)

Evaluating it at x = p yields:
4.3 Newton’s Method (Newton-Raphson) 45

g0 (p) = 1 − ϕ 0 (p) f (p) − ϕ(p) f 0 (p) = 0

Since f (p) = 0, that term disappears, leaving:

1 − ϕ(p) f 0 (p) = 0

1 = ϕ(p) f 0 (p)

1
Thus, ϕ(p) =
f 0 (p)
For this to hold, we must require that f 0 (p) 6= 0.

1
Suppose/choose to set ϕ(x) = where f 0 (x) 6= 0
f 0 (x)
f (x)
Then, g(x) = x − Newton’s Method!
f 0 (x)
Since we derived it requiring that the method be quadratically convergent - it holds.

Do note that for an even root (rather than a simple root), it will only converge linearly.

4.3.1 Secant Method


Secant method is used when you want something similar to Newton’s method, but you cannot get
derivative information.

It is, essentially, Newton’s method implemented with an approximate derivative instead of the
actual derivative.

We use the slope between two values to approximate the derivative:

f (pn−1 ) − f (pn−2 )
f 0 (pn−1 ) ≈
pn−1 − pn−2
Substituting this approximation into Newton’s method yields:

f (pn−1 )(pn−1 − pn−2 )


pn = pn−1 −
f (pn−1 ) − f (pn−2 )
How does this affect implementation?

Now we need two initial guesses, because we need two points in order to approximate the
derivative.

Upside: we don’t need the derivative function anymore!

Secant Method works by determining a secant line with the initial two guesses, and using the
root of that secant line as the next guess. (Draw it out!)

Using the same example, f (x) = x5 − 2x4 + 2x3 − 2x2 + x − 1 with the two initial guesses at
9 − (−1)
x = 1 and x = 2. f (1) = −1, f (2) = 9, the secant line has slope = 10 because it passes
2−1
46 Chapter 4. Root-Finding Techniques

through both points. Forming the secant line y = 10(x − 1) − 1.

1.2 1.4 1.6 1.8 2

The root of the secant line is then used to build the next. 0 = 10x − 11 → x = 1.1, f (1.1) =
9 − (−0.97569)
−0.97569, slope of secant line = 11.0841, yielding the secant line y = 11.0841(x−
2 − 1.1
2) + 9.

1.2 1.4 1.6 1.8 2


−2

The approximation to the root is then 0 = 11.0841x − 13.1682 → x ≈ 1.18803, f (1.18803) ≈


−0.898718, approaching the root - but slower than Newton’s method.

Secant is slower than Newton’s method (usually), it typically converges superlinearly.

Accuracy: easy to check, again.

Efficiency: Two function evaluations needed initially, but only one after that. Relatively efficient
(better than Newton’s to implement, but converges slower than Newton’s - tradeoff)

Stability: Similar to Newton’s

Minimal requirements to apply the method? Similar to Bisection, but not as minimal as fixed
point.
4.3 Newton’s Method (Newton-Raphson) 47

Minimal smoothness requirements? They aren’t specified, but the implication being that they
have requirements similar to Newton’s.

Generalizes well? Yes and no. It generalizes, but becomes complex very quickly (Broyden’s
method, etc).

Example: Comparison of methods


Use Bisection, Newton’s, and Secant to locate the first positive root of cos x.

Bisection: Use interval [1, 3]

Newton’s: Use initial guess p0 = 1

Secant: Use initial guesses p0 = 0.8 and p1 = 1

n Bisect Aerror Newton’s Aerror Secant Aerror


0 2 0.4292 1 0.5708 0.8 0.7708
1 1.5 0.0708 1.642 0.071296 1 0.5708
2 1.75 0.1792 1.57068 1.21x10−4 1.690904 0.1201
3 1.625 0.0542 1.570796 5.91x10−13 1.565498 5.3x10−3
4 1.5625 0.0083 1.570796 10−16 1.571098 3.02x10−4
5 1.59375 0.02295 1.570796 10−16 1.570779 1.7x10−5
6 1.578125 0.00733 1.570796 10−16 1.570797 9.6x10−7
7 1.5703125 4.838x10−4 1.570796 101 6 1.5707963 5.43x10−8
8 1.57421875 3.422x10−3 1.570796 10−16 1.57079633 3.06x10−9
9 1.572265625 1.469x10−3 1.570796 10−16 1.570796327 1.73x10−10
10 1.5712890625 4.927x10−4 1.570796 10−16 1.5707963268 9.76x10−12

It is quite clear that Newton’s method is the fastest by far.


Evaluating the rate of convergence of these methods: Plot the error on a log scale. We will get
into more detail with this later, but for now - notice the general shape of these curves. Bisection
is relatively flat, the slowest to converge. Secant is significantly improved, with a steeper slope.
Newton’s completely outdoes the others, converging in 4 iterations.
Note: all of these approximations are local approximation. That means it is only good if you
are already close to the root.

4.3.2 Zeros of Polynomials


Given a polynomial, P(x), of degree n ≥ 1, with real or complex coefficients; P(x) = 0 has at least
one (possibly complex) root.

Recall that Secant uses two initial points to construct a secant line.

Müller’s Method uses 3 initial points to construct a quadratic approximation to find the intercept.

Input p0 , p1 , and p2 . Use f (p0 ), f (p1 ), and f (p2 ) to construct the system and solve for a, b,
and c.
48 Chapter 4. Root-Finding Techniques

10−2

absolute error (| r - xn |) 10−4

10−6

10−8

10−10

10−12
bisection method
secant method
10−14 Newton’s method
0 1 2 3 4 5 6 7 8 9 10
iteration #

Assuming P(x) is of the form P(x) = a(x − p2 )2 + b(x − p2 ) + c

P(p0 ) = a(p0 − p2 )2 + b(p0 − p2 ) + c = f (p0 )

P(p1 ) = a(p1 − p2 )2 + b(p1 − p2 ) + c = f (p1 )

P(p2 ) = a(0) + b(0) + c = f (p2 )

 Example 4.4 Using the same function f (x) = x5 − 2x4 + 2x3 − 2x2 + x − 1, and the three points
x = 1.2, 1.6, and 2. Use Müller’s method to approximate the root.

1.2 1.4 1.6 1.8 2

Given p0 = 1.2, p1 = 1.6, and p2 = 2. f (1.2) = −0.88288, f (1.6) = 1.05056, and f (2) = 9,
determine the quadratic which passes through all three points.

P(1.2) = a(1.2 − 2)2 + b(1.2 − 2) + c = −0.88288

P(1.6) = a(1.6 − 2)2 + b(1.6 − 2) + c = 1.05056


4.3 Newton’s Method (Newton-Raphson) 49

P(2) = a(2 − 2)2 + b(2 − 2) + c = 9

We can solve this system using a matrix-vector product.


    
0.64 −0.8 1 a −0.88288
0.16 −0.4 1 b =  1.05056 
0 0 1 c 9
The solution to this system yields, a = 18.8, b = 27.3936, and c = 9.

P(x) = 18.8(x − 2)2 + 27.3936(x − 2) + 9

Solve for the x-intercepts of P(x) = 18.8x2 − 47.8064x + 29.4128 at x = 1.49963, and 1.04327.
Evaluate f (x) for both roots, and use the one with the value of f (x) closer to zero.

f (1.49963) ≈ 0.21623

f (1.04327) ≈ −0.99592

Replace p0 with p1 , p1 with p2 , and p2 with 1.49963. Repeat until tolerance is met.

Müller’s method will converge fast without the need of a close pn , but it is less efficient with an
order of convergence approximately α = 1.84. For reference, Secant is about 1.62. Additionally, it
can compute complex roots.
5. Polynomial Interpolation

Recall: Polynomial approximations of functions from Calc 2, using your Taylor Series.

When to use Polynomial Interpolation:


1. Data Sets: approximate a function using a set of points. We construct an interpolant because
we don’t know the function.
2. Approximate a complex function: we would use an interpolant because polynomials are
smooth and easy to evaluate. Thus, making a more complex function more efficient.
Why would we construct an interpolant? To approximate values we cannot measure or evaluate
already, i.e. prediction.
Interpolation: defines values between points we already know.
Extrapolation: defines values outside of the domain we already know.
Manipulation: approximate other values based on the approximate to our function (derivatives,
integrals, etc.)
We use polynomials because if you are constructing an approximation, you want to use a
function that is simple and easy to evaluate. We know that polynomials are smooth, easy to evaluate,
and have good properties for manipulation. Specifically, we seek a polynomial that matches all the
known values exactly.
Given a data set (x0 , f (x0 )), . . . , (xn , f (xn )), find a polynomial p(x) such that p(xk ) = f (xk ) for
all k = 0, . . . , n.
General form for interpolation
n
f (x) = ∑ ck ϕk (x) = c0 ϕ0 (x) + c1 ϕ1 (x) + . . . + cn ϕn (x).
k=0

All ϕk (x)’s are defined, simple functions: “basis functions” We solve for our ck coefficients to
determine the interpolant: “parameters”
All ϕk (x)’s must be linearly independent. (i.e. you cannot write any ϕk (x) as a linear combina-
tion of other ϕk (x)’s)
52 Chapter 5. Polynomial Interpolation

Upon substituting in the data points we have, we will get a system of n + 1 equations and n + 1
unknowns. Then, we solve the system of equations to determine the coefficients for our interpolant.
Standard polynomial interpolation uses monomials for ϕk (x)’s and is also known as Monomial
Interpolation. This is the simplest and most familar form for implementation. Each basis function
has the form ϕk = xk so that
n
f (x) = ∑ ck xk = c0 + c1 x + c2 x2 + . . . + cn xn .
k=0

Note: we could use a Taylor Polynomial here, but we really don’t want to: Taylor Polynomials
are centered about a single point, and we want a polynomial that fits a set of data - which means we
want accuracy across an interval rather than about a single point.

5.1 Monomial Interpolation


Interestingly enough, constructing a polynomial of degree at most n on a data set of n + 1 points is
unique.

Theorem 5.1.1 For n + 1 data points (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ), there exists a unique polyno-
mial, p(x) = pn (x) = ∑nk=0 ck xk , of degree at most n which satisfies p(xi ) = yi for i = 0, ..., n.
This polynomial interpolates the n + 1 data points.

The proof of the theorem is straight forward. You assume there are two such polynomials and
consider another polynomial which is the difference of these two and show by the Fundamental
Theorem of Algebra that the difference must be the zero polynomial since it has too many zeros to
be anything else.
Now that we have layed the theoretical groud work for the method, let’s work through an
example to find the ck ’s that satisfy this.
 Example 5.1 Given 3 points (1, 2), (2, 5), (3, 7), determine the monomial interpolant which
satisfies all three points.

Seek p2 (x) = c0 + c1 x + c2 x2 such that:

p2 (1) = c0 + c1 + c2 = 2

p2 (2) = c0 + 2c1 + 4c2 = 5

p2 (3) = c0 + 3c1 + 9c2 = 7


    
1 1 1 c0 2
1 2 4 c1  = 5
1 3 9 c2 7
3 equations, 3 unknowns - solve! (yes, it’s easier to solve as a linear system, but you can solve
it algebraically)

Solving the first equation for c0 in terms of c1 and c2 : c0 = 2 − c1 − c2

Substituting this into the other equations yields:


5.1 Monomial Interpolation 53

2 − c1 − c2 + 2c1 + 4c2 = 5 → c1 + 3c2 = 3 → c1 = 3 − 3c2

−1
2 − c1 − c2 + 3c1 + 9c2 = 7 → 2c1 + 8c2 = 5 → 6 − 6c2 + 8c2 = 5 → 2c2 = −1 → c2 =
2
3 9
c1 = 3 + =
2 2
9 1
c0 = 2 − + = −2
2 2
9 1
Putting this together yields p2 (x) = −2 + x − x2
2 2

1 2 3 4

−2

In this picture, you can see that the interpolant passes through all three points exactly, as it is
designed to. 

A few things to note:


• n + 1 data points yields a polynomial of degree at most n. This will result in high-order
polynomials if we have a lot of data points to interpolate. The consequences of this result in
highly oscillatory solutions, or solutions that behave inconsistently with our expectations for
a smooth solution.
• Monomial interpolation becomes unstable for several data points. Consider the matrix formed
in our linear system:
 
1 1 1
1 2 4
1 3 9

If this is extrapolated, the system becomes ill-conditioned when we use many data points.
This means that it becomes prone to round-off error, and our solutions become incorrect as a
result. Monomial is only worthwhile when used for a small number of data points.
Accuracy: Monomial interpolation is good for a small number of data points, but for a large
number of data points it becomes inaccurate due roundoff error in the linear system that must be
solved to determine the interpolant.

Efficiency: Monomial interpolation is the least efficient technique computationally. The com-
2
putational cost of solving the linear system for n data points is n3 + 2n, which is much more
3
expensive than the other techniques we will discuss. In terms of simplicity, it is the simplest
54 Chapter 5. Polynomial Interpolation

technique to follow and understand. However, the computational cost outweighs the simplicity
which makes this method inefficient overall.

Stability: Monomial interpolation is not stable. When used for a ‘simple’ problem with few
data points, it is effective. However, as the data set grows, the instability grows.

Conclusions: Monomial is a great introduction to how polynomial interpolation works. How-


ever, it is generally going to be the worst possible choice when applying an interpolation technique.

5.2 Lagrange Interpolation


Lagrange Polynomials are constructed to define each ck as simply as possible. Specifically, each
ck = yk . This is done by constructing the basis functions so that ϕk (xk ) = 1 and ∀ j 6= k, ϕk (x j ) = 0.
In consistency with the name, these specific basis functions are denoted through:

(
0, j 6= k
Lk (x j ) = (5.1)
1, j=k

We construct the polynomial as before, p(x) = pn (x) = ∑nk=0 yk Lk (x)

 Example 5.2 If we use the same points as the previous problem, but construct a Lagrange

interpolant: (1, 2), (2, 5), (3, 7)

First, we define our basis functions for this data set. We must ensure that Lk is zero for other
points, and one at the point corresponding to its construction: e.g. L0 (x0 ) = 1, and L0 (x1 ) =
L0 (x2 ) = 0
We define these through:

(x − x1 )(x − x2 ) (x − 2)(x − 3) 1
L0 (x) = = = (x − 2)(x − 3)
(x0 − x1 )(x0 − x2 ) (1 − 2)(1 − 3) 2
(x − x0 )(x − x2 ) (x − 1)(x − 3)
L1 (x) = = = −(x − 1)(x − 3)
(x1 − x0 )(x1 − x2 ) (2 − 1)(2 − 3)
(x − x0 )(x − x1 ) (x − 1)(x − 2) 1
L2 (x) = = = (x − 1)(x − 2)
(x2 − x0 )(x2 − x1 ) (3 − 1)(3 − 2) 2
Note: by forming the numerator as a product of (x − xi )’s we ensure that at those xi ’s our basis
function is zero. By forming the denominator as a product of (x j − xi )’s, we normalize the basis
function so that it is one when evaluated at x j .

Then,

p2 (x) = 2L0 (x) + 5L1 (x) + 7L2 (x)


7
= (x − 2)(x − 3) − 5(x − 1)(x − 3) + (x − 1)(x − 2)
2
−1 2 9
= x + x−2
2 2
Note that this is the same polynomial from our previous example, just a different construction. 
5.2 Lagrange Interpolation 55

Lagrange basis functions are pre-defined to be linearly independent, since they are zero at all
data points other than their respective x j .
General form:
n
(x − x0 )(x − x1 )(x − x2 ) . . . (x − xk−1 )(x − xk+1 ) . . . (x − xn ) (x − xi )
Lk (x) = = ∏
(xk − x0 )(xk − x1 )(xk − x2 ) . . . (xk − xk−1 )(xk − xk+1 ) . . . (xk − xn ) i=0,i6=k (xk − xi )

The numerator in this is the part that builds the polynomial of degree n. The denominator is the
part that normalizes the polynomial so it is one at the position corresponding to xk , these are called
Barycentric weights.

Theorem 5.2.1 Suppose we have


• x0 , x1 , x2 , ..., xn distinct numbers in [a, b]
• f ∈ Cn+1 [a, b]
• p(xk ) = f (xk ) ∀k = 0, ..., n
Then, for all x ∈ [a, b], there exists η(x) ∈ [a, b] such that

f n+1 (η(x))
f (x) − pn (x) = (x − x0 )(x − x1 ) . . . (x − xn )
(n + 1)!

error is defined to be zero at the points used to build the approximation. Since it will be exact up
to a polynomial of degree at most n, the n + 1th derivative of such a polynomial will be zero also.
This form is consistent with the accuracy of the interpolant.

Note: this resembles your Taylor Polynomial error, but is not the same.

Bad cases: if the spacing is large between data points, it can cause your error to become large.
Additionally, large derivatives are a problem, because you have a rapid change between data points
which can also cause your error to become large.

Accuracy: Lagrange interpolation of n + 1 data points is exact for a polynomial of degree n.

Efficiency: Lagrange interpolation is relatively simple to construct because the basis functions
are consistently defined, and the coefficients are the y-values from the data set (no additional compu-
tations). Computationally, it is much more efficient than Monomial, and is the most computationally
efficient of the three similar methods. It is more difficult to follow than Monomial, but not much
more difficult. So, Lagrange is considered much more efficient than Monomial. One downside to
its efficiency is that it requires you do not change the data set. Sometimes in application, you may
make an approximation with a set number of data points, and then decide to add or remove points
from the data set. With Lagrange interpolation, you will need to reevaluate the entire approximation,
which makes it slightly less efficient than adaptive methods, like Newton’s Divided Differences
(next).

Stability: Lagrange interpolation is the most stable of the three similar methods. Since it uses
all the data points to define the basis functions and weights directly, it does not risk instability
unless the data points are exceedingly close together (division by small numbers in construction of
the basis functions).

Lagrange interpolation is great if you have a known data set to begin with, and you know it
isn’t going to change. Since every basis function is built by knowing all your data points up front, it
can become extremely cumbersome to add data points later. Lagrange interpolation is very popular
56 Chapter 5. Polynomial Interpolation

because it is computationally efficient and stable.

5.3 Divided Differences


Another approach to polynomial interpolation is a direct approach is called divided differences.
Given each yi = f (xi ), we construct our constants, ck , through divided differences which in the
general format look very much like a slope formula. To obtain the general formula, we will build
up from one point and proceed to the next.
With one data point, we can only make a constant approximation:

f (x0 ) = c0 → c0 = f (x0 )

With two data points, we can construct a linear approximation:

f (x1 ) = c0 + c1 (x1 − x0 ).

Using the constant from the first data point, we can then define the second constant through

f (x1 ) − f (x0 )
c1 = .
x1 − x0
With three data points, we can construct a quadratic approximation:

f (x2 ) = c0 + c1 (x2 − x0 ) + c2 (x2 − x0 )(x2 − x1 ).

We can then use the previous constants to determine the third

f (x1 ) − f (x0 )
f (x0 ) + (x2 − x0 ) + c2 (x2 − x0 )(x2 − x1 )
x1 − x0
Manipulating the expression yields:

f (x1 ) − f (x0 )
f (x2 ) − f (x0 ) − (x2 − x0 ) = c2 (x2 − x0 )(x2 − x1 )
x1 − x0
Solving for the constant yields

f (x2 ) − f (x0 ) f (x1 ) − f (x0 )



x2 − x0 x1 − x0
c2 = .
x2 − x1
We can continue this process indefinitely. However, there is an easier way to do this by hand:
divided differences. Generally denoted through
n n k−1
pn (x) = f [x0 ] + ∑ f [x0 , . . . , xk ](x − x0 ) . . . (x − xk−1 ) = f [x0 ] + ∑ f [x0 , . . . , xk ] ∏ (x − xi )
k=1 k=1 i=0

where each
f [xk+1 , ..., x j ] − f [xk , ..., x j−1 ]
f [xk , ..., x j ] = .
x j − xk

These ‘divided differences’ yield the coefficients for the terms in pn (x). This is easier to see and do
in a table representation:
5.3 Divided Differences 57

x-values f (x) 1st DD 2nd DD 3rd DD


x0 f [x0 ]
x1 f [x1 ] f [x0 , x1 ]
x2 f [x2 ] f [x1 , x2 ] f [x0 , x1 , x2 ]
x3 f [x3 ] f [x2 , x3 ] f [x1 , x2 , x3 ] f [x0 , x1 , x2 , x3 ]
The construction is similar to Lagrange interpolating polynomials, but this method is adaptive:we
do not need all of our data points immediately to construct the polynomial easily - we can add
points in later without reconstructing the full solution.

 Example 5.3 Data points (0, 1), (1, 5), (2, 3), and (3, 7):
x-values f (x) 1st DD 2nd DD 3rd DD
0 1
5−1
1 5 =4
1−0
3−5 −2 − 4
2 3 = −2 = −3
2−1 2−0
7−3 4 − (−2) 3 − (−3)
3 7 =4 =3 =2
3−2 3−1 3−0
Using the top diagonal entries, we will construct the interpolating polynomial:
p3 (x) = 1 + 4(x − 0) + (−3)(x − 0)(x − 1) + 2(x − 0)(x − 1)(x − 2)
= 1 + 4x + 3x(x − 1) + 2x(x − 1)(x − 2)
= 1 + 4x + 3x2 − 3x + 2x3 − 4x2 − 2x2 + 4x
= 2x3 − 3x2 + 5x + 1

Notice here that we had equispaced points, so when we evaluated the divided differences, the
denominators were consistent for each column. 

There is a specific case of Divided Differences for equispaced points


Newton’s Forward Difference Formula (NFDF) moves forward in x (positive)
Newton’s Backward Difference Formula (NBDF) moves backward in x (negative)

If each point is equispaced, then there is a specific difference between each set of adjacent
points, xi+1 − xi = h, ∀xi , i = 0, 1, ..., n.
f (xi+1 ) − f (xi ) ∆ f
Then each 1st DD: f [xi , xi+1 ] = = “first derivative” as h → 0 (it looks like
xi+1 − xi h
f (x0 + h) − f (x0 )
!)
h
f [xi+1 , xi+2 ] − f [xi , xi+1 ] ∆ f (x1 ) − ∆ f (xi )
2nd DD: f [xi , xi+1 , xi+2 ] = “second derivative” resembles
xi+2 − xi 2h
∆2 f (xi )
which approximates
2h2
and so forth.
1 k
Generally, f [x0 , ..., xk ] = ∆ f (x0 ) “kth derivative" about x0 .
k!hk
Forward Difference Formula:

n  
s k
pn (x) = f (x0 ) + ∑ ∆ f (x0 )
k=1 k
58 Chapter 5. Polynomial Interpolation

s
 s(s − 1)...(s − k + 1) s
 s

Recall: k = , so 1 = s, 0 = 1.
k!
In this, s corresponds to the approximation of f (x0 + sh)
Backward differences go in the opposite direction i → i − 1 instead of forward i → i + 1.

Accuracy: Divided differences are also construct polynomials of degree n using n + 1 data
points.

Efficiency: Divided differences is slightly more computationally expensive than Lagrange,


but is also an adaptive method. This makes it a more efficient technique if you can (or want
to) take additional data to construct multiple approximations. The tabular format makes divided
differences easier to follow and implement, but it is less straightforward than Lagrange or Monomial.

Stability: Divided differences is more stable than Monomial, but slightly less stable than
Lagrange. The primary cause of instability comes from the divided differences, if the denominator
is very small.

Summary of each of these polynomials:

ϕ j (x) Cost Pros/Cons


Monomial Simple, straightforward, but unstable
2 3
xj n + 2n
3
Lagrange each c j = y j , most stable construction,
(x − xi )
L j (x) = ∏ni=0,i6=k n2 + 5n but not adaptive
(xk − xi )
NDD adaptive, somewhere between Monomial and
j−1 3 2
∏i=0 (x − xi ) n + 2n Lagrange stability and efficiency-wise
2

5.4 Error in Polynomial Interpolation


We can define our error as the difference between our actual function and the polynomial interpolant:

en (x) = f (x) − pn (x)

Theorem 5.4.1 Polynomial Interpolation Error

If pn (x) interpolates f(x) at n + 1 data points x0 , x1 , ..., xn , and f ∈ Cn+1 [a, b] with bounded
derivatives. Then, ∀x ∈ [a, b], ∃ξ = ξ (x) ∈ [a, b] such that

f n+1 (ξ (x)) n
f (x) − pn (x) = ∏i=0 (x − xi )
(n + 1)!)
Which is bounded by
1
maxa≤x≤b | f (x) − pn (x)| ≤ maxa≤t≤b | f n+1 (t)| maxa≤s≤b ∏ni=0 |s − xi |
(n + 1)!
5.5 Interpolating Derivatives 59

The divided differences relate to the derivatives of f (x), thus the error is bounded by the next
derivative of f (x).

5.5 Interpolating Derivatives


It is desirable to approximate f (x) well, but it is also desirable to approximate the derivative at the
given points.

To do this, we must discuss the Osculating Polynomial, a polynomial which satisfies f and f 0
at each node.

The most popular of these techniques is Hermite Interpolation.

If we are given n + 1 data points, we will have 2n + 2 data points (n + 1 function evaluations
and n + 1 derivative evaluations). So, the degree of our polynomial interpolant is one less than our
data set: 2n + 1.

We can construct this similar to Lagrange Interpolants, where we know we want to match
pn (xi ) = f (xi ), and with the derivative information we also want to match p0n (xi ) = f 0 (xi ).

So, our basis functions Hn, j (x) will need to ensure:


(
1 i= j
Hn, j (xi ) =
0 i 6= j
and Hn,0 j (xi ) = 0, ∀i.

Our basis functions Ĥn, j (x) will match the derivative data to ensure:

Ĥn, j (xi ) = 0, ∀i.


(
1 i= j
and Ĥn,0 j (xi ) =
0 i 6= j
Then, we construct H2n+1 (x) = ∑nj=0 f (x j )Hn, j (x) + ∑nj=0 f 0 (x j )Ĥn, j (x)

= f (x0 )Hn,0 (x) + f (x1 )Hn,1 (x) + ... + f (xn )Hn,n (x) + f 0 (x0 )Ĥn,0 (x) + ... + f 0 (xn )Ĥn,n (x)

Given the conditions, we can define our basis functions in terms of Lagrange polynomials.

0 (x )]L2 (x)
Hn, j (x) = [1 − 2(x − x j )Ln, j j n, j

2 (x)
Ĥn, j (x) = (x − x j )Ln, j

We can verify this is consistent by checking it:

0 (x )]L2 (x )
Hn, j (xi ) = [1 − 2(xi − x j )Ln, j i n, j i
60 Chapter 5. Polynomial Interpolation

Since Ln, j (xi ) = 0 when i 6= j, the whole thing will be zero.

2 (x ) = 1 because L (x ) = 1.
If i = j, then this reduces to Ln, j j n, j j

0 (x )]L2 (x ) + [1 − 2(x − x )L0 (x )]2L (x )L0 (x )


Hn,0 j (xi ) = [−2Ln, j i n, j i i j n, j j n, j i n, j i

For i 6= j, this reduces to zero because, once again, Ln, j (xi ) = 0.

0 (x ) + 2L0 (x ) = 0, so it is 0 for all values of i.


For i = j, this reduces to −2Ln, j j n, j j

2 (x )
Ĥn, j (xi ) = (xi − x j )Ln, j i

If i 6= j, Ln, j (xi ) = 0. If i = j, then (xi − x j ) = 0, so this is zero for all values of i.

Ĥn,0 j (xi ) = Ln,


2 (x ) + 2(x − x )L (x )L0 (x ).
j i i j n, j i n, j i

If i 6= j, then Ln, j (xi ) = 0, and the whole thing is zero.

If i = j, then Ln, j (xi ) = 1, and Ĥn,0 j (xi ) = 1.

Yes, these are consistent with the requirements listed above for these functions.

Implementing this becomes easier with divided differences:

We construct the same type of polynomial as before - but now we have derivative data!

j−1
pn (x) = ∑nj=0 f [x0 , x1 , ..., x j ] ∏i=0 (x − xi )

f [xk+1 , ..., x j ] − f [xk , ..., x j−1 ]




 x j 6= xk
x j − xk

where, as before, f [xk , ..., x j ] = j−k (x ) for any 0 < k ≤ j ≤ n
f k
x j = xk


( j − k)!

 Example 5.4 Newton’s DD using Hermite Interpolation

Given data for x = −1, 0, 1, f (−1) = 0, f (0) = 1, f (1) = 2, and f 0 (0) = 0, f 00 (0) = 0, and
f 0 (1)
=5

For each derivative evaluation, we add another row to the table. When we are given derivative
data, that derivative data replaces the divided difference for that column in the new row (1st deriva-
tive → 1st DD, etc.).

Construct the DD table for this example:


5.6 Error in Hermite Interpolation 61

x y 1st DD 2nd DD 3rd DD 4th DD 5th DD


-1 0
1−0
0 1 =1
0 − (−1)
0−1
0 1 0 = −1
0 − (−1)
0 − (−1)
0 1 0 0 =1
0 − (−1)
2−1 1−0 1−0 1−1
1 2 =1 =1 =1 =0
1−0 1−0 1−0 1 − (−1)
5−1 4−1 3−1 2−0
1 2 5 =4 =3 =2 =1
1−0 1−0 1−0 1 − (−1)
Note: there are three rows for x = 0 because we have function data, first derivative data, and
second derivative data at x = 0. There are two rows for x = 1, because we have function data and
first derivative data. While there is only one row for x = −1 because we only have function data at
that point. Also, the derivative data is only used on the repeated rows. We still compute normal
divided differences between x = −1 with x = 0, and x = 0 with x = 1.
Then we use the diagonal coefficients again, to construct the polynomial.

p5 (x) = 0 + 1(x + 1) + (−1)(x + 1)x + 1(x + 1)x2 + 0(x + 1)x3 + 1(x + 1)x3 (x − 1)

Which simplifies to p5 (x) = 1 + x5 , which is the function I used to build the data set!

Since we had 6 data points, we can construct a polynomial of degree 5, so this will be exact for
a polynomial of degree 5 (or less)!


Accuracy: Hermite interpolation is significantly more accurate because it uses not just function
values at each data point, but also derivative values at data points. Each derivative value adds to the
accuracy of the approximation. If you have n + 1 x-values, all with function values and derivative
values, the approximation will be exact for a polynomial of degree 2n + 1.

Efficiency: Hermite interpolation requires more operations for the same number of x-values,
but otherwise the efficiency is exactly the same as Newton’s Divided Differences. The operations
are much the same, and implementation is similar.

Stability: Slightly more stable due to the derivative information, but otherwise the same as
Newton’s Divided differences.

5.6 Error in Hermite Interpolation


In the case that we are only considering interpolating the n + 1 data points and only the first
derivative at these points, we can quantify the error of the Hermite interpolating polynomial.

Theorem 5.6.1 Hermite Interpolation Error

Let f (x) ∈ C[a, b] be 2n + 2 times differentiable on (a, b) and consider the distinct nodes
x0 < x1 < · · · < xn in [a, b]. If p2n+1 (x) is the Hermite polynomial such that p2n+1 (xi ) = f (xi )
62 Chapter 5. Polynomial Interpolation

and p02n+1 (xi ) = f 0 (xi ) for i = 0, 1, . . . , n, then

f (2n+2) (ξ (x)) n
f (x) − p2n+1 (x) = (x − xi )2
(2n + 2)!) ∏ i=0

which is bounded by
n
1
max | f (x) − p2n+1 (x)| ≤ max | f (2n+2) (t)| max ∏(s − xi )2
a≤x≤b (2n + 2)! a≤t≤b a≤s≤b i=0
6. Piecewise Polynomial Interpolation

General polynomial interpolation can be wildly varying (see results in homework, etc).
Piecewise polynomial interpolation is a way to approximate between specified points and avoid
oscillations between points (keeping it smooth and consistent with the overall data).
We will discuss three approaches to piecewise polynomial interpolation:
• Linear piecewise polynomial interpolation (generalizes to higher order polynomials)
• Piecewise Hermite interpolation
• Cubic Spline interpolation

6.1 Linear Piecewise Polynomial Interpolation


Connecting each pair of two consecutive points of a data set is called a Linear piecewise polynomial
and is the simplest of these methods, but also has the potential for the largest error. This is also
referred to as “Broken Line" interpolation.
Divide up the data set into subintervals, and form a line between the endpoints of each of these
subintervals.

So, if your data set is defined on the interval [a, b], and you have data points xi in that interval,
then each subinterval is defined between each xi and xi+1 for each interpolant. i.e. we define a
polynomial interpolant, as we did last chapter, but only between two data points at a time.

For piecewise linear, we form a line with the function values between those points.

We can define this generally by matching si (xi ) = f (xi ) and si (xi+1 ) = f (xi+1 )

f (xi+1 ) − f (xi )
From this, we construct the line si (x) = f (xi ) + (x − xi )
xi+1 − xi
We can see that if we substitute in x = xi , this reduces to si (xi ) = f (xi ).
64 Chapter 6. Piecewise Polynomial Interpolation

0.8

0.6

0.4

0.2

0
0 0.5 1 1.5 2 2.5 3

Figure 6.1: This is an example of a linear piecewise polynomial approximating sin(x) on 0 ≤ x ≤ π.

Also, if we substitute in x = xi+1 , this reduces to si (xi+1 ) = f (xi+1 ).

You can also think of this form in terms of your divided differences. Recall: f [xi , xi+1 ] =
f (xi+1 ) − f (xi )
xi+1 − xi

Using this form, we can write si (x) = f (xi ) + f [xi , xi+1 ](x − xi ).

However, this is a really bad fit. Unless our function is a line, or we have data super-close
together, it’s likely to be an awful interpolant. Error is O(h2 ) if we have each set of points separated
by h.

We can extend this same idea to more points and use low order polynomial interpolants to con-
nect the data in a more smooth fashion, while ensuring consistency at the shared points. However,
it will not vastly improve our approximation because it will not ensure smoothness at each shared
point.

Accuracy: Piecewise linear interpolants are only exact for linear functions. The error is O(h2 )
for h = distance between equispaced data points. So, a piecewise linear interpolant may be “good
enough" when the data points are close together, but it is generally a poor approximation.
Efficiency: Piecewise linear interpolants are straightforward and simple to compute, making
them efficient computationally and in terms of their accessibility.

Stability: Piecewise linear interpolants are stable because they follow the general trend of the
data (no highly oscillatory behavior). However, they can be wholly inaccurate as a result.

The primary reason we do not use piecewise linear interpolants is their inaccuracy. They are ef-
ficient and stable, but most functions are not represented well with piecewise linear approximations.
6.2 Piecewise Hermite Interpolation 65

6.2 Piecewise Hermite Interpolation

This requires derivative data at all the given points. This adds some smoothness to the approxima-
tion, but it also requires that we have derivative data at all of our points to apply it. Derivative data
is not generally known, which often makes this method prohibitive.

We will still require si (xi ) = f (xi ) and si (xi+1 ) = f (xi+1 ). For Hermite, we also require
s0i (xi ) = f 0 (xi ) and s0i (xi+1 ) = f 0 (xi+1 ).

This adds two conditions to our problem, and allows us to fit not only derivative data, but also
construct a polynomial of degree 3, so our error term reduces to O(h4 ).

Downside: it requires derivative information for f (x) that we don’t generally have.

x f (x) f 0 (x)
0 1 0
 Example 6.1 Use Piecewise Hermite Interpolation to approximate the given data:
π/2 0 1
π -1 0
Construct a Hermite Interpolant over each subinterval.

First on [0, π/2]:

x f (x) 1st DD 2nd DD 3rd DD


0 1
0 1 0
0−1 −2/π − 0
π/2 0 = −2/π = −4/(π 2 )
π/2 − 0 π/2 − 0
−1 + 2/π −2/π + 4/(π 2 ) − (−4/(π 2 ))
π/2 0 -1 = −2/π + 4/(π 2 ) = −4/(π 2 ) + 1
π/2 − 0 π/2 − 0
P(x) = 1 + 0(x − 0) − 4/(π 2 )(x − 0)2 + (−4/(π 2 ) + 16/(π 3 ))(x − 0)2 (x − π/2) = 1 + (2/π −
8/(π 2 ) − 4/(π 2 ))x2 + (−4/π + 16/(π 3 ))x3 ≈ 1 − 0.5792x2 + 0.1107x3

Then, on [π/2, π]:

x f (x) 1st DD 2nd DD 3rd DD


π/2 0
π/2 0 -1
−1 − 0 −2/π + 1
π -1 = −2/π = −4/(π 2 ) + 2/π
π − π/2 π − π/2
0 − (−2/π) 4/(π 2 ) − (−4/(π 2 ) + 2/π)
π -1 0 = 4/(π 2 ) = 16/(π 3 ) − 4/(
π − π/2 π − π/2
P(x) = 0 − 1(x − π/2) + (−4/(π 2 ) + 2/π)(x − π/2)2 + (16/(π 3 ) − 4/(π 2 ))(x − π/2)2 (x −
π) ≈ 1.2832 − 0.3606x − 0.4645x2 + 0.1107x3

Compare them graphically:


66 Chapter 6. Piecewise Polynomial Interpolation

0.5

0.5 1 1.5 2 2.5 3

−0.5

−1

As you can see in the plot, they are nearly indistinguishable from each other.

Accuracy: Piecewise Hermite interpolants are more accurate because they use derivative
information in addition to function values.

Efficiency: Piecewise Hermite interpolants require derivative information, and are compu-
tationally expensive because they require three divided differences for every subinterval. This
also makes them complex to solve for. Additionally, the requirement that we know derivative
information at every point is prohibitive. So, piecewise Hermite interpolants are accurate, but not
efficient.

Stability: Piecewise Hermite interpolants are very stable due to the use of derivative informa-
tion. This forces the behavior of the interpolant to closely match that of the function.

We seek an alternative that yields such accuracy, without requiring additional data: Cubic
Splines!

6.3 Cubic Splines


Cubic Splines match the polys and their derivatives at the interior points to ensure smoothness to
the overall interpolant.

We construct a cubic polynomial between each set of points xi and xi+1 . We will still ensure
that si (xi ) = f (xi ), and si (xi+1 ) = f (xi+1 ).

We will also ensure that s0i (xi+1 ) = s0i+1 (xi+1 ), and s00i (xi+1 ) = s00i+1 (xi+1 ).

This ensures the interpolant is C2 [a, b].

Each spline takes the form si (x) = ai + bi (x − xi ) + ci (x − xi )2 + di (x − xi )3

So, each spline has 4 unknowns, and with n + 1 data points, that means we have n interpolant
and 4n unknowns. To solve for these interpolants we need 4n conditions to solve the system on the
6.3 Cubic Splines 67

whole domain.

We get n conditions from si (xi ) = f (xi )


We get n conditions from si (xi+1 ) = f (xi+1 )
We get n − 1 conditions from s0i (xi+1 ) = s0i+1 (xi+1 ) (two endpoints are excluded here: (n + 1) − 2 =
n − 1)
We get n − 1 conditions from s00i (xi+1 ) = s00i+1 (xi+1 ) (again, the two endpoints are excluded here)
So, the sum of these gives us 4n − 2 conditions, and we complete our set of 4n conditions by using
one of the three options for our boundary (below). Then, we have 4n equations and 4n unknowns
to solve for.

We require additional boundary constraints at x0 = a and xn = b. We have three options for


these boundary conditions.
• Clamped boundary: this is the ideal case if you actually know your derivative values at
the boundary. We define the derivatives of our splines to equal those of our function at the
boundary: s00 (x0 ) = f 0 (x0 ) and s0n−1 (xn ) = f 0 (xn ). This is also referred to as the complete
spline.
• Free boundary: this is an easy-to-implement case because it does not require any additional
function knowledge. However, if you have a function that varies rapidly near the boundary,
it will be inconsistent, and cause the function to not match well near the boundary. If you
are applying splines, this technique assumes smoothness of the function through applying
s000 (x0 ) = s00n−1 (xn ) = 0. This technique is commonly implemented because of its ease of use.
• Not-a-knot boundary: this is another condition that we can implement to ensure some level
of smoothness, but it does not use any enforced values for the boundary. In fact, this is
really a condition on the immediate interior points to the boundary. We set s000 000
0 (x1 ) = s1 (x1 )
000 000
and sn−2 (xn−1 ) = sn−1 (xn−1 ). This allows for more freedom in the behavior of the function,
while simultaneously completing our condition requirements and ensuring smoothness of
our splines.
x y
0 0
 Example 6.2 Given data
1 1
2 2
We have 3 points, so we will need to construct two cubic splines.

Each spline takes the form si (x) = ai + bi (x − xi ) + ci (x − xi )2 + di (x − xi )3

s0 (x) = a0 + b0 (x − 0) + c0 (x − 0)2 + d0 (x − 0)3 = a0 + b0 x + c0 x2 + d0 x3

s1 (x) = a1 + b1 (x − 1) + c1 (x − 1)2 + d1 (x − 1)3

Then we use the forms to define equations which will satisfy the above conditions.

Each si (xi ) = f (xi )

s0 (0) = a0 = f (0) = 0 → a0 = 0

s1 (1) = a1 = f (1) = 1 → a1 = 1

Each si (xi+1 ) = f (xi+1 )


68 Chapter 6. Piecewise Polynomial Interpolation

s0 (1) = a0 + b0 + c0 + d0 = f (1) = 1 → a0 + b0 + c0 + d0 = 1

s1 (2) = a1 + b1 + c1 + d1 = f (2) = 2 → a1 + b1 + c1 + d1 = 2

Each s0i (xi+1 ) = s0i+1 (xi+1 )

s00 (x) = b0 + 2c0 x + 3d0 x2

s01 (x) = b1 + 2c1 (x − 1) + 3d1 (x − 1)2

s00 (1) = b0 + 2c0 + 3d0 = s01 (1) = b1 → b0 + 2c0 + 3d0 − b1 = 0

Each s00i (xi+1 ) = s00i+1 (xi+1 )

s000 (x) = 2c0 + 6d0 x

s001 (x) = 2c1 + 6d1 (x − 1)

s000 (1) = 2c0 + 6d0 = s001 (1) = 2c1 → 2c0 + 6d0 − 2c1 = 0

Since we do not have derivative data for the boundary, and there is only one interior point, the
only boundary condition we can apply is the free boundary: s000 (x0 ) = s00n−1 (xn ) = 0

s000 (0) = 2c0 = 0

s001 (2) = 2c1 + 6d1 = 0

These 8 equations allow us to set up the linear system to solve for a0 , b0 , c0 , d0 , a1 , b1 , c1 , and d1 .
    
1 0 0 0 0 0 0 0 a0 0
0 0 0 0 1 0 0 0 b0  1
    
1 1 1 1 0 0 0 0 c0  1
    
0 0 0 0 1 1 1 1  d0  2
   
0 1 2 3 0 −1 0 0 a1  = 0

    
0 0 2 6 0 0 −2 0 b1  0
    
0 0 2 0 0 0 0 0 c1  0
0 0 0 0 0 0 2 6 d1 0
This is a linear system A~x = ~b, we can solve it in MATLAB with A\~b yielding
   
a0 0
b0  1
   
c0  0
   
d0  0
 = 
a1  1
   
b1  1
   
c1  0
d1 0
Which, when substituted into our splines yields:
6.3 Cubic Splines 69

s0 (x) = 0 + x + 0x2 + 0x3 = x

s1 (x) = 1 + (x − 1) + 0(x − 1)2 + 0(x − 1)3 = x

Naturally, the data I gave above - was for the simple function y = x!


Since the cubic spline can solve for a linear function exactly (with free boundary conditions) it
gave us the exact function.
Example 6.3 Solve the same problem we completed using piecewise Hermite, with Cubic
Splines. We will add two more points to allow for use of Not-a-Knot conditions in this example.

x y
0 1
π/2 0
Given data
π -1
3π/2 0
2π 1
We have 5 points, so we will need to construct 4 cubic splines.

Each spline takes the form si (x) = ai + bi (x − xi ) + ci (x − xi )2 + di (x − xi )3

s0 (x) = a0 + b0 (x − 0) + c0 (x − 0)2 + d0 (x − 0)3 = a0 + b0 x + c0 x2 + d0 x3

s1 (x) = a1 + b1 (x − π/2) + c1 (x − π/2)2 + d1 (x − π/2)3

s2 (x) = a2 + b2 (x − π) + c2 (x − π)2 + d2 (x − π)3

s3 (x) = a3 + b3 (x − 3π/2) + c3 (x − 3π/2)2 + d3 (x − 3π/2)3

Then we use the forms to define equations which will satisfy the above conditions.

Each si (xi ) = f (xi )

s0 (0) = a0 = f (0) = 1 → a0 = 1

s1 (π/2) = a1 = f (π/2) = 0 → a1 = 0

s2 (π) = a2 = f (π) = −1 → a2 = −1

s3 (3π/2) = a3 = f (3π/2) = 0 → a3 = 0

Each si (xi+1 ) = f (xi+1 )

s0 (π/2) = a0 + b0 (π/2) + c0 (π 2 /4) + d0 (π 3 /8) = f (π/2) = 0 → a0 + b0 (π/2) + c0 (π 2 /4) +


d0 (π 3 /8) = 0

s1 (π) = a1 + b1 (π/2) + c1 (π 2 /4) + d1 (π 3 /8) = f (π) = −1 → a1 + b1 (π/2) + c1 (π 2 /4) +


d1 (π 3 /8) = −1
70 Chapter 6. Piecewise Polynomial Interpolation

s2 (3π/2) = a2 +b2 (π/2)+c2 (π 2 /4)+d2 (π 3 /8) = f (3π/2) = 0 → a2 +b2 (π/2)+c2 (π 2 /4)+


d2 (π 3 /8) = 0

s3 (2π) = a3 + b3 (π/2) + c3 (π 2 /4) + d3 (π 3 /8) = f (2π) = 1 → a3 + b3 (π/2) + c3 (π 2 /4) +


d3 (π 3 /8) = 1

Each s0i (xi+1 ) = s0i+1 (xi+1 )

s00 (x) = b0 + 2c0 x + 3d0 x2

s01 (x) = b1 + 2c1 (x − π/2) + 3d1 (x − π/2)2

s02 (x) = b2 + 2c2 (x − π) + 3d2 (x − π)2

s03 (x) = b3 + 2c3 (x − 3π/2) + 3d3 (x − 3π/2)2

s00 (π/2) = b0 + c0 π + 3d0 π 2 /4 = s01 (π/2) = b1 → b0 + c0 π + 3d0 π 2 /4 − b1 = 0

s01 (π) = b1 + c1 π + 3d1 π 2 /4 = s02 (π) = b2 → b1 + c1 π + 3d1 π 2 /4 − b2 = 0

s02 (3π/2) = b2 + c2 π + 3d2 π 2 /4 = s03 (3π/2) = b3 → b2 + c2 π + 3d2 π 2 /4 − b3 = 0

Each s00i (xi+1 ) = s00i+1 (xi+1 )

s000 (x) = 2c0 + 6d0 x

s001 (x) = 2c1 + 6d1 (x − π/2)

s002 (x) = 2c2 + 6d2 (x − π)

s003 (x) = 2c3 + 6d3 (x − 3π/2)

s000 (π/2) = 2c0 + 3d0 π = s001 (π/2) = 2c1 → 2c0 + 3d0 π − 2c1 = 0

s001 (π) = 2c1 + 3d1 π = s002 (π) = 2c2 → 2c1 + 3d1 π − 2c2 = 0

s002 (3π/2) = 2c2 + 3d2 π = s003 (3π/2) = 2c3 → 2c2 + 3d2 π − 2c3 = 0

We apply the Not-a-Knot conditions to our system, s000 000 000 000
0 (x1 ) = s1 (x1 ) and sn−2 (xn−1 ) = sn−1 (xn−1 ).

s000
0 (x) = 6d0

s000
1 (x) = 6d1

s000
2 (x) = 6d2

s000
3 (x) = 6d3
6.3 Cubic Splines 71

We require s000 000


0 (x) = 6d0 = s1 (x) = 6d1 → 6d0 − 6d1 = 0

s000 000
2 (x) = 6d2 = s3 (x) = 6d3 → 6d2 − 6d3 = 0

These 16 equations allow us to set up the linear system to solve for a0 , b0 , c0 , d0 , a1 , b1 , c1 , d1 ,


a2 , b2 , c2 , d2 , a3 , b3 , c3 , and d3 .

  
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0  b0 
  
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0    c0 
 

0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0  d0 
 

1 π/2 π 2 /4 π 3 /8 0 0 0 0 0 0 0 0 0 0 0 0  a1 
 
 
0 0
 0 0 1 π/2 π 2 /4 π 3 /8 0 0 0 0 0 0 0 0   b1 
 
0 0
 0 0 0 0 0 0 1 π/2 π 2 /4 π 3 /8 0 0 0 0    c1 
 
0 0 0 0 0 0 0 0 0 0 0 0 1 π/2 π 2 /4 π 3 /8 d1 
   
  =
0 1 3π 2 /4 0 −1 0 0 0 0 0 0 0 0 0 0  a2 

π
0 0 0 0 0 1 3π 2 /4 0 −1 0 0 0 0 0 0  b2 
  
π
3π 2 /4 0 −1
  
0 0 0 0 0 0 0 0 0 1 π 0 0   c2 
−2
  
0 0 2 3π 0 0 0 0 0 0 0 0 0 0 0  d2 
  
0 0
 0 0 0 0 2 3π 0 0 −2 0 0 0 0 0   a3 
 
0 0
 0 0 0 0 0 0 0 0 2 3π 0 0 −2 0 
b3 

0 0 0 6 0 0 0 −6 0 0 0 0 0 0 0 0   c3 
0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 −6 d3
 
1
0
 
−1
 
0
 
0
 
−1
 
0
 
1
 
0
 
0
 
 
0
 
0
 
0
 
0
 
0
0

This is a linear system A~x = ~b, we can solve it in MATLAB with A\~b yielding
72 Chapter 6. Piecewise Polynomial Interpolation
   
a0 1
b0  
   0 

c0  −0.607927
   
d   0.129006 
0
   
a  
 1  0 

b  −0.954930
 1  
c  
 1  0 

d 0.129006
   
 1 
 =

a2   −1


b2   0
   

   
c2   0.607927 
d2  −0.129006
   
   
a3  
   0 

b3   0.954930 
   
c3   0 
d3 −0.129006
Which, when substituted into our splines yields:

s0 (x) = 1 + 0x − 0.607927x2 + 0.129006x3

s1 (x) = 0 − 0.95493(x − π/2) + 0(x − π/2)2 + 0.129006(x − π/2)3

s2 (x) = −1 + 0(x − π) + 0.607927(x − π)2 − 0.129006(x − π)3

s3 (x) = 0 + 0.95493(x − 3π/2) + 0(x − 3π/2)2 − 0.129006(x − 3π/2)3

Compare them graphically:

0.5

1 2 3 4 5 6

−0.5

−1

Again, the two curves are nearly indistinguishable from each other, however - this time we did
not have derivative information. 

Accuracy: The accuracy of Cubic Spline with free boundary conditions is O(h2 ), which solves
exactly for a linear function. If you have derivative data for your endpoints (clamped boundary), it
can solve for a cubic exactly, O(h4 ). If you apply the Not-a-Knot condition, and the function is a
cubic polynomial, it will also solve for for a cubic exactly, O(h4 ).
6.3 Cubic Splines 73

Efficiency: Computational cost of cubic spline is high, it requires the construction and solving
of a linear system. It is also difficult to follow and construct. However, it does not require additional
information as Hermite does. So, it provides an alternative to gain accuracy without requiring
derivative information.

Stability: Cubic splines are not as stable as Hermite because they do not have derivative
information for the function. However, by ensuring continuity between splines, the solution is
much more stable than a standard polynomial interpolant. Complete splines (clamped boundary)
are the most stable, because they use derivative information on the boundary. Not-a-knot follows
because it ensures continuity, but does not use additional function data. Free boundary conditions
are the least stable because they enforce a condition that is not necessarily true. Enforcing the
second derivative at the boundary to be zero is intended to prevent rapidly changing functions at
the ends, however - most functions do not have a zero for their second derivative, so the condition
is inconsistent with the function behavior.

Cubic splines are a preferred technique when we have many data points to satisfy because they
are adaptable and well-behaved. Often, the accuracy and stability make them preferred.
7. Numerical Differentiation

We will approximate the derivatives of a given function, f (x). f 0 (x), f 00 (x), etc. at and/or near a
particular x-value, xi .
These are particularly useful in solving differential equations numerically, because we approx-
imate the derivatives numerically and then the problem is algebraic. (We do this in Numerical
Analysis II!)
f (x + h) − f (x)
Recall from Calculus I: f 0 (x) = limh→0
h
f (xi + h) − f (xi )
Thus, we can approximate this value numerically through f 0 (xi ) ≈
h
However, this induces additional error (roundoff error) as h → 0. It is also important to note
that as h → 0, the number of computations will increase (the number of data points on an interval
will increase dramatically if the space between all points is decreased - this is an important concept
for efficiency).
Recall: we derived this method earlier in the semester using the Taylor Series’ of a function,
and we talked about truncation error O(h p )
This is error induced by “truncating" our Taylor Series to a finite number of terms.
In order to apply numerical differentiation effectively, we need to ensure that our function is
sufficiently smooth on the interval: f ∈ Cn+1 [a, b]. This helps us ensure our error term is bounded,
since it is defined by the (n + 1)th derivative of f (x).
Note: Taylor Series’ expansions are localized, and provide a reasonably good approximation
very close to the point they are centered about. However, they do not typically define a global
approximation, and if h is large the approximation will not be realistic, and can even give results
that are nonsensical (or nonphysical) if used in this way.
We have three commonly used forms for difference formulas to approximate the first derivative
of f (x).

7.0.1 Forward Difference Formula


This is a “one-sided" formula that can be used on a left endpoint, where there is only data on the
right side of the point.
76 Chapter 7. Numerical Differentiation

Starting with the Taylor Series’ centered about x0 :


f 00 (x0 )
f (x) = f (x0 ) + f 0 (x0 )(x − x0 ) + (x − x0 )2 + ...
2
If x = x0 + h this becomes
h2
f (x0 + h) = f (x0 ) + h f 0 (x0 ) + f 00 (x0 ) + ...
2
If we truncate this at the second term, we can write it as
h2
f (x0 + h) = f (x0 ) + h f 0 (x0 ) + f 00 (η(x0 )) such that η(x0 ) satisfies the equality by being equal
2
to the sum of all remaining terms, which are by the definition that h  1, much much smaller than
all prior terms of the series.
Then we can solve the previous equation for f 0 (x0 ) algebraically by subtracting f (x0 ) and
dividing by h:
h2
f (x0 + h) − f (x0 ) = h f 0 (x0 ) + f 00 (η(x0 ))
2
h2 00
f (x0 + h) − f (x0 ) − f (η(x0 )) = h f 0 (x0 )
2
h2 00
f (x0 + h) − f (x0 ) − f (η(x0 ))
2 = f 0 (x0 )
h
f (x0 + h) − f (x0 ) h 00
Which simplifies to f 0 (x0 ) = − f (η(x0 ))
h 2
0 f (x0 + h) − f (x0 )
In this, the forward difference formula is f (x0 ) = , and the truncation error
h
h
is − f 00 (η(x0 )) = O(h).
2

7.0.2 Backward Difference Formula


This is also a “one-sided" formula, but for the right endpoint, where there is only data on the left
side of the data point.
So, instead of considering x = x0 + h, we consider x = x0 − h
h2
f (x0 − h) = f (x0 ) − h f 0 (x0 ) + f 00 (η(x0 ))
2
Again, we solve algebraically for f 0 (x0 ):
f (x0 ) − f (x0 − h) h 00
f 0 (x0 ) = + f (η(x0 ))
h 2
These methods are relatively simple to implement and derive, but they are not very accurate.
O(h) means that if we halve our step size, it will only halve our error.
If we can do better, we definitely want to.

7.0.3 Centered Difference Formula


The Centered Difference Formula can only be used on interior points because it requires data from
both sides of the data point. However, it also improves our truncation error term significantly.
If we start with both f (x0 + h) and f (x0 − h):
h2 h3
f (x0 + h) = f (x0 ) + h f 0 (x0 ) + f 00 (x0 ) + f 000 (x0 ) + ...
2 6
h 2 h3
f (x0 − h) = f (x0 ) − h f 0 (x0 ) + f 00 (x0 ) − f 000 (x0 ) + ...
2 6
0 h3 000
f (x0 + h) − f (x0 − h) = 2h f (x0 ) + f (x0 ) + ...
3
Solve for f 0 (x0 )
h3
f (x0 + h) − f (x0 − h) − f 000 (x0 ) − ... = 2h f 0 (x0 )
3
7.1 Richardson Extrapolation 77

f (x0 + h) − f (x0 − h) h2 000


− f (x0 ) + ... = f 0 (x0 )
2h 6
f (x0 + h) − f (x0 − h)
So the Centered Difference Formula is f 0 (x0 ) ≈
2h
It has an error term O(h2 ). This is significantly improved over the one-sided, O(h) methods,
but also requires more data be available in order to approximate the derivative. Additionally, it
cannot be used on endpoints.
In order to maintain consistent error, we can use one-sided 3 point formulas:
h2 h3
Given f (x0 + h) = f (x0 ) + h f 0 (x0 ) + f 00 (x0 ) + f 000 (x0 ) + ...
2 6
0 4h2 00 8h3 000
f (x0 + 2h) = f (x0 ) + 2h f (x0 ) + f (x0 ) + f (x0 ) + ...
2 6
We can combine them to cancel out our O(h2 ) terms:
4h3 000
4 f (x0 + h) − f (x0 + 2h) = 3 f (x0 ) + 2h f 0 (x0 ) − f (x0 ) + ...
6
Solve for f 0 (x0 )
4 f (x0 + h) − f (x0 + 2h) − 3 f (x0 ) h2 000
f 0 (x0 ) = + f (x0 ) + ...
2h 3
More accuracy requires more points, and more algebra. There is a point at which accuracy vs.
efficiency becomes moot, and it is no longer worth pushing for better accuracy because it makes
the computation so much less efficient. O(h2 ) isn’t there yet.

Second Derivatives
We can derive second derivatives in the same manner we derived first derivatives, we will want to
isolate the second derivative in our Taylor expansion.
Note: we will end up dividing by h2 , which is significantly smaller than h, so the higher
derivatives are much more vulnerable to roundoff errors.
Centered Difference for second derivative: we add the two instead of subtracting, so we cancel
the first derivative rather than the second.
h2 h3 h4 0000
f (x0 + h) = f (x0 ) + h f 0 (x0 ) + f 00 (x0 ) + f 000 (x0 ) + f (x0 ) + ...
2 6 24
h 2 h3 h4
f (x0 − h) = f (x0 ) − h f 0 (x0 ) + f 00 (x0 ) − f 000 (x0 ) + + f 0000 (x0 )...
2 6 24
h 4
f (x0 + h) + f (x0 − h) = 2 f (x0 ) + h2 f 00 (x0 ) + f 0000 (x0 ) + ...
12
Solve for f 00 (x0 )
f (x0 + h) − 2 f (x0 ) + f (x0 − h) h2 0000
f 00 (x0 ) = − f (x0 ) + ...
h2 12
Note: All of these methods are derived using equispaced data. Other forms can be derived for
other cases of data sets. Also note that you do not need to use the same method on your whole data
set.

7.1 Richardson Extrapolation


Richardson Extrapolation is used to create an approximation with higher order error, when you
have several data points on the domain that are already equispaced and the number is divisible by
2m , where m is an integer. This allows you to use local, low order error approximations to construct
a better approximation to the derivative at a specified value.

Given an approximation to the derivative ( f 0 (x0 ), f 00 (x0 ), etc) we will denote N(h), and the
actual value of the derivative represented by M, we can always define the error term in terms of
78 Chapter 7. Numerical Differentiation

powers of h (for equispaced data), h p . Therefore,

M − N(h) = k1 h + k2 h2 + k3 h3 + ...

For the one-sided approximation to f 0 (x0 ), the error is O(h), so if we can combine multiple
O(h) approximations appropriately, we can eliminate the O(h) term and yield a O(h2 ) error term.

So, if we have two approximations to the derivative using low-order approximations, one at h
h
and one at , using the same method, we can define our error terms:
2
M = N(h) 2 3
 + k1 h + k2 h +2k3 h + ...
h h h h3
M=N + k1 + k2 + k3 + ...
2 2 4 8
in order to cancel the O(h) terms, we can combine the two approximations:

h2 3h3
 
h
2M − M = 2N(h) − N − k2 − k3 + ...
2 2 4
 
h
M = 2N(h) − N + O(h2 )
2
This process can be repeated until desired accuracy is obtained. Be wary of roundoff errors,
as they can creep in and make your results no longer accurate as you compute further steps of
Richardson Extrapolation.

Generally, we can define the process of Richardson Extrapolation through:


 
h
  N j−1 − N j−1 (h)
h 2
N j (h) = N j−1 + , which will have O(h j ) error.
2 2 j−1 − 1
We can use these in the same format as our divided differences table, to more easily track how
the approximation is faring.

h O(h) O(h2 ) O(h3 ) O(h4 )


h N1(h)
h h
N1 N2 (h)
2 2  
h h h
N1 N2 N3 (h)
4 4 2  
h h h h
N1 N2 N3 N4 (h)
8 8 4 2
Ex: Approximate the derivative of f (x) = ln x at x0 = 1 for h = 0.4
ln(1.4) − ln(1)
N1 (h) = = 0.84118
0.4
 
h ln(1.2) − ln(1)
N1 = = 0.91161
2 0.2
 
h ln(1.1) − ln(1)
N1 = = 0.95310
4 0.1
1
Since, f 0 (x) = is equal to 1 when x = 1, this makes sense that our approximations get closer
x
to one for decreasing values of h.
7.2 Formulas Using Lagrange Interpolating Polynomials 79

Applying Richardson Extrapolation:

N1 (0.2) − N1 (0.4)
N2 (h) = N1 (0.2) + = 2N1 (0.2) − N1 (0.4) ≈ 0.98204
2−1
 
h
N2 = 2N1 (0.1) − N1 (0.2) ≈ 0.99549
2
N2 (0.2) − N2 (−.4)
N3 (h) = N2 (0.2) + = 0.99877
4−1
This is useful because the computational expense is quite small while the error decreases
significantly.

Note: Sometimes Richardson Extrapolation is presented as I’ve done here, with factors of h,
and other times it is done using multiples of h, as I will do below. They are effectively equivalent,
even though they seem different - the concept driving the computations is the same.

Second Derivatives
We can also use this same concept to improve upon our second derivative approximations.
 
h f (x0 + h) − 2 f (x0 ) + f (x0 − h)
If we use N1 = + O(h2 )
2 h2
f (x0 + 2h) − 2 f (x0 ) + f (x0 − 2h)
and N1 (h) = + O(h2 )
4h2
We can combine them to cancel the first error term
 
h
4N1 − N1 (h) = 3M
 2
4h4 0000 f (x0 + 2h) 2 f (x0 ) f (x0 − 2h) 4h4 0000

1
= 2 4 f (x0 + h) − 8 f (x0 ) + 4 f (x0 − h) − f (x0 ) − + − + f (x0 ) + ...
h  12 4 4  4 12
1 f (x0 + 2h) f (x0 ) f (x0 − 2h)
= 2 4 f (x0 + h) − 8 f (x0 ) + 4 f (x0 − h) − + − + O(h3 )
h 4 2 4
Which simplifies to:
1
M= (16 f (x0 + h) − 30 f (x0 ) + 16 f (x0 − h) − f (x0 + 2h) − f (x0 − 2h))) + O(h3 )
12h2
These can become unstable the more you extrapolate, and for widely spaced data. Requires
bounded derivatives of f (x) to work. Be careful about roundoff errors as well, as h → 0.

7.2 Formulas Using Lagrange Interpolating Polynomials


General approach: build a Lagrange interpolating polynomial and differentiate it.

Polynomials are easy to evaluate, and differentiate. Since they are locally good approximations,
they can approximate the derivatives effectively.

1
Ex: Given data ( , − ln(2)), (1, 0), (2, ln(2)), and (3, ln(3)). Construct an interpolating poly-
2
nomial for the data set, and use it to approximate the derivative of f (x) at x = 1.
80 Chapter 7. Numerical Differentiation

Format-wise, since we know we will arrive at the same polynomial using Newton’s Divided
Differences:

x f (x) 1st DD 2nd DD 3rd DD


1
− ln 2
2
ln 2
1 0
1/2
− ln 2
2 ln 2 ln 2
3/2
ln 3 − ln 2 − ln 2 1/2 ln 3 − ln 2 + 2/3 ln 2
3 ln 3 ln 3 − ln 2
2 5/2
We use the top diagonal terms to form the polynomial:

2 1 2
f [x0 ] = − ln 2, f [x0 , x1 ] = 2 ln 2, f [x0 , x1 , x2 ] = − ln 2, f [x0 , x1 , x2 , x3 ] = ln 3 − ln 2.
3 5 15
1 2 1 1 2 1
So, L(x) = − ln 2 + 2 ln 2(x − ) − ln 2(x − )(x − 1) + ( ln 3 − ln 2)(x − )(x − 1)(x − 2)
2 3 2 5 15 2
We can then approximate the derivative using L0 (x).
 
2 1
L0 (x) = 2 ln 2 − ln 2 (x − 1) + (x − )
 3 2 
1 2 1 1
+ ln 3 − ln 2 (x − 1)(x − 2) + (x − )(x − 2) + (x − )(x − 1)
5 15 2 2
 
2 1 1 2 1
At x = 1, L0 (1) = 2 ln 2 − ln 2( ) + ln 3 − ln 2 (− ) ≈ 1.09159
3 2 5 15 2
This is a pretty decent approximation to the derivative of the actual function, f (x) = ln x, so
1
f 0 (x) = , since f 0 (1) = 1, the error is less than 10%.
x
However, if f 0 (x) → 0, the approximation will not be good (smaller derivative, larger relative
error).

One aspect of using Lagrange interpolants is that we can continue to approximate higher
derivatives, although they become less stable with higher derivatives.

If we approximate f 00 (1), for example:


 
00 2 1 2 1
L (x) = − ln 2 + ln 3 − ln 2 (2(x − 2) + 2(x − 1) + 2(x − )
3 5 15 2
1
L00 (1) ≈ −0.5894, compared with f 00 (1) = − 2 = −1.
1

7.3 Roundoff Error and Data Errors in Numerical Differentiation


Recall: when we compute numerical derivatives, we divide by small numbers. This process can
induce roundoff errors due to machine precision (as we discussed in Chapter 3).
7.3 Roundoff Error and Data Errors in Numerical Differentiation 81

Examine the roundoff error in the computation of the derivative for f (x) = ex near x = 0:

f (x0 + h) − f (x0 − h) eh − e−h


Numerical derivative: Dh = f 0 (x0 ) ≈ =
2h 2h
Actual derivative: D̄h = f 0 (x) = ex

h 0.1 0.01 0.001 0.0001 0.00001


Dh 1.00167 1.00002 1.00002 1.00017 1.00136
D̄h 1.10517 1.01005 1.00100 1.00010 1.00001
In order to understand and minimize the roundoff error, we need to know the floating point
precision of the system. h cannot be close to it if we want to ensure our roundoff error is smaller
than our truncation error. If h becomes too small, roundoff error can increase our error beyond the
expectation for the system, and in extreme circumstances it can “blow up" and yield unphysical
results.
This is important for us to define the effectiveness of our method, if we know truncation error
dominates, we can appropriately estimate the error of our system. This can become more important
when we implement numerical derivatives to approximate solutions to differential equations. it can
also cause problems if we are using data that is particularly noisy, where derivative approximations
can vary greatly between points.

Accuracy: Accuracy for derivative approximations is defined by the value of h (truncation


error). There will be additional roundoff error because as h → 0, we will be dividing by small
numbers. However, if you have a specific accuracy needed, you can ensure it by using an appropriate
method and accounting for roundoff error.
Efficiency: Efficiency decreases as more function evaluations are required. Because we assume
h is small, we often do not require beyond O(h2 ) accuracy. This allows us to make an efficient
approximation, within reasonable accuracy.
Stability: Numerical differentiation is not stable. As h → 0 roundoff error can overtake an
approximation and yield unphysical results. It is important to be wary of roundoff error when using
numerical derivative approximations.
8. Numerical Integration

When we have a function, f (x), and need to determine ab f (x)dx when we do not know the
R

antiderivative F(x) analytically → we must find another way to determine the integral.
Recall: Riemann sums (Calculus I), and Numerical Integration (Calculus II).
Numerical
Rb
integration approximates the integral using a summation
a f (x)dx ≈ ∑nj=0 a j f (x j )
each x j ∈ [a, b] is called an abscissae,
each a j is the weight given to the jth term f (x j ).
These techniques are called quadrature methods because quadrature means area, and we are
computing a single integral, which relates directly to the area under the curve. Similar concept to
how we evaluated integrals using Riemann sums in Calculus.
Note that the weight, a j , in Riemann sums was the width of the box built under the curve,
denoted ∆x, and was the same for all the boxes because we used equispaced data points. We
generalize this term to weights because the value of these coefficients will vary by the method used.

8.1 Basic Quadrature


Approximating the area under a curve on a specific interval, without breaking it up into pieces.
We can do this by approximating f (x) with a polynomial, pn (x), and then integrating the
polynomial directly, ab pn (x)dx. This is effective because we can integrate a polynomial exactly.
R

This also relates nicely to our polynomial interpolation in the previous chapters.
x − xk
Recall: for a Lagrange interpolant pn (x) = ∑nj=0 f (x j )L j (x) = ∑nj=0 f (x j ) ∏nk=0,k6= j
x j − xk
Rb Rb n n x − xk
Thus, we can approximate a f (x)dx ≈ a ∑ j=0 ∏k=0,k6= j dx
x j − xk
x − xk
= ∑nj=0 f (x j ) ab ∏nk=0,k6= j dx = ∑nj=0 f (x j ) ab L j (x)dx
R R
x j − xk
So, our weights can be determined through a j = ab L j (x)dx
R

 Example 8.1 Seek an approximation to the integral of f (x) = ex on [0, 1] using a linear approxi-
mation, p1 (x).
84 Chapter 8. Numerical Integration

f (x0 ) = f (0) = 1
f (x1 ) = f (1) = e
x−1
L0 (x) = = 1−x
0−1
x−0
L1 (x) = =x
1−0
p1 (x) = 1(1 − x) + e(x)
R1 1 1 1
So, a0 = 0(1 − x)dx = x − x2 |10 = 1 − − 0 + 0 =
2 2 2
R1 1 21 1 1
a1 = 0 xdx = x |0 = − 0 =
2 2 2
1
So, the area approximation is given by (1 + e).
2

2.5

1.5

0.5 f (x) = ex
f (x) = 1 + (e − 1)x

0.2 0.4 0.6 0.8 1

This linear approximation to the area under a curve can be generalized, because our Lagrange
polynomials take the same form each time.
x−b
L0 (x) =
a−b
Rb x−b x2 bx b
a0 = a dx = − |
a−b 2(a − b) a − b a
b2 b2 a2 ab
= − − +
2(a − b) a − b 2(a − b) a − b
b2 − 2ab + a2 (b − a)2 1
= = = (b − a)
2(b − a) 2(b − a) 2
Rb x−a x2 ax b b2 − 2ab + a2 1
Similarly, a1 = a dx = − | = = (b − a)
b−a 2(b − a) b − a a 2(b − a) 2
Rb b−a
This is the trapezoidal rule in numerical integration: a f (x)dx ≈ [ f (a) + f (b)]
2
This method approximates the area using trapezoids, where the top is formed by a linear fit to
the curve using the endpoint values. (see picture)
What if we want to approximate the curve with a quadratic?
We will need three points, so we will have to cut the interval in half, and we will need
a+b
information about that midpoint, x = .
2
8.1 Basic Quadrature 85
 
a+b
(x − b) x −   
2 2 2 (a + 3b) a+b
Now, L0 (x) =  = x − x+b
a+b (a − b)2 2 2
(a − b) a −
2
So, we determine the coefficient a0 = ab L0 (x)dx
R
    
2 Rb 2 a + 3b a+b
x − x+b dx
(a − b)2 a 2 2
 3   
2 x (a + 3b) 2 a+b
= 2
− x +b x |ba
(a − b) 3 4 2
 3
b − a3
   
2 a + 3b 2 2 2 a+b
= − (b − a ) + (b − ab)
(a − b)2 3 4 2
 2
b + ab + a2 a2 + 3ba + ab + 3b2 ab + b2

−2
= − +
(a − b) 3 4 2
 2 2 2 2 2

−2 b ab a a 3b ab b
= + + − − ab − + +
a−b 3 3 3 4 4 2 2
Combine similar terms:
b2 b2 3b2 b2
+ − =
3 2 4 12
ab ab −1
− ab + = ab
3 2 6
a2 a2 a2
− =
3 4 12
Substituting this back into the original form
 2
ab a2

−2 b
− +
a − b 12 6 12
−1
b2 − 2ab + a2

=
6(a − b)
1
= (b − a)2
6(b − a)
b−a
=
6
b−a
So, a0 =
6  
2 b−a
Similarly, a1 =
3 6
b−a
a2 =
6
b−a
Yielding Simpson’s rule: ab f (x)dx ≈
R
( f (x0 ) + 4 f (x1 ) + f (x2 ))
6
These methods are all derived by forming a polynomial interpolant between each xi and xi+1 ,
and are generalized as Newton-Cotes formulas.
In this context, they can be defined by the same h used in differentiation, xi+1 − xi = h.
Both of these should seem mildly familiar from Calculus II numerical integration.
The techniques you learned there are these, but in their composite form. (Next!)
For composite numerical integration, we complete these ‘basic’ techniques on subintervals and
add them all together.
When numerical integration techniques include the endpoints, they are called closed (much like
interval notation).
When numerical integration techniques do not include the endpoints, they are called open.
An example you would have seen in Calculus II of an open technique is Midpoint Rule.
86 Chapter 8. Numerical Integration

Midpoint is slightly better than trapezoidal, but it is also impractical for computations with data
(because we cannot then evaluate a midpoint).
 
Rb a+b
The basic Midpoint rule uses the midpoint as the height: a f (x)dx ≈ (b − a) f
2

Quadrature error
Error is always determined by the difference between the ‘exact’ and the ‘approximate’.
For numerical integration
Rb n Rb n
a f (x)dx − ∑ j=0 a j f (x j ) = a f [x0 , x1 , ..., xn , x] ∏ j=0 (x − x j )dx
Error terms for each of our basic techniques:
− f 00 (ξ )(b − a)3
Trapezoidal: . Conceptually, you can think about how this method was derived.
12
We used a p1 (x) approximation, which was linear. So, the error in that approximation was in
terms of f 00 (ξ ). Since we integrated the (x − x0 )2 term, our result has an O(x3 ). This results in the
(b − a)3 term. Each basic trapezoidal integral results in O(h3 ) error.
− f 0000 (ξ ) b − a 5
 
Simpson’s: . Similarly, this technique has O(h5 ) for each basic integral.
90 2
f 00 (ξ )(b − a)3
Midpoint: . This technique also has O(h3 ) error for each basic integral.
24
Precision: also called degree of accuracy, these integrals will be exact for polynomials of
degree: 1 (Trapezoidal and Midpoint), and 3 (Simpson’s).
 Example 8.2 Given f (x) = x3 on [0, 0.5]
If we apply the basic Trapezoidal rule:
 
0.5 − 0 1 1
ITrap = 0+ =
2 8 32
If we apply the basic Simpson’s rule:
0.5 − 0 1
0 + 4(1/4)3 + (1/2)3 =

ISimp =
6 64
If we apply the basic Midpoint rule:
1
IMid = (0.5 − 0) (1/4)3 =

128
The actual integral, using the Fundamental Theorem of Calculus:
R 0.5 3 x4 0.5 1/16 1
0 x dx = | = −0 =
4 0 4 64
Our b − a = h = 0.5, and f 0 (x) = 3x2 , f 00 (x) = 6x, f 000 (x) = 6, and f 0000 (x) = 0.
If we substitute these into our error bounds, we maximize the derivative f p (ξ ) using the value
of ξ ∈ [a, b] that maximizes the derivative in our error bound.
− f 00 (ξ )(b − a)3 −6 ∗ 0.5 ∗ (0.5)3 −1

1 1
So, for Trapezoidal: = = . Our actual error was − =
12 12 16 64 32
1
. This is consistent with our error bound because the actual error was less (in magnitude) than
32
the error bound value.
− f 0000 (ξ ) b − a 5
 
0
Simpson’s: = (0.25)5 = 0. Simpson’s has no error because it solves
90 2 90
for the integral of a cubic polynomial exactly.
f 00 (ξ )(b − a)3 6 ∗ 0.5 ∗ (0.5)3 1
Midpoint: = = . This is consistent with the error bound
24 24 64
1
because the actual error was . 
128
8.2 Composite Numerical Integration 87

8.2 Composite Numerical Integration


Basic quadrature methods are ineffective over longer intervals. Note: the error bounds, when
(b − a) > 1 will become large.
We can improve upon our approximation by using higher order Lagrange polynomials, but that
comes with higher oscillations, and is inefficient due to the complex computations involved, and
inaccuracy of the oscillations.
So, in the same manner that we improved upon the basic techniques polynomial interpolation
by using piecewise polynomial interpolants, we can improve upon the basic integration techniques
by breaking up the interval into smaller pieces and adding them all together - composite quadrature
methods.
These are composed:
1. Divide the interval, (b − a), into equispaced subintervals
2. Apply the basic quadrature to each subinterval
3. Sum the respective integrals to approximate ab f (x)dx
R

Outline of the process:


1. Given and integral on [a, b], we subdivide into r equispaced subintervals, and the width
b−a
of each subinterval will be h = . Then, each point can be defined by xi = a + ih,
r
i = 0, 1, ..., r.
2. We then evaluate each integral xxi−1
R i
f (x)dx using any of the basic methods.
h
Trapezoidal: ( f (xi−1 ) + f (xi ))
2   
h xi−1 + xi
Simpson’s: f (xi−1 + 4 f + f (xi ) (In this case, r must be an even number)
6   2
xi−1 + xi
Midpoint: h f
2
R xi
3. ∑ri=1 xi−1 f (x)dx
h
Trapezoidal: ∑ri=1 ( f (xi−1 ) + f (xi ))
2
h
= (( f (x0 ) + f (x1 )) + ( f (x1 ) + f (x2 )) + ... + ( f (xr−2 ) + f (xr−1 )) + ( f (xr−1 ) + f (xr )))
2
h
= ( f (x0 ) + 2 f (x1 ) + 2 f (x2 ) + ... + 2 f (xr−1 ) + f (xr ))
2
This is the Trapezoidal rule you would have learned in Calculus II.
− f 00 (ξ )(b − a)3
Error: Recall the basic rule had error bound , now the b − a is replaced by h,
12
− f 00 (ξ )h3
but we also have r of these error terms summed together, which looks like ∗r =
12
00 3
− f (ξ )h b − a 00
− f (ξ )(b − a)h 2
= , O(h2 ) error. This can still solve exactly only for a
12 h 12
linear equation, degree of precision 1.
Simpson’s: Since we require the midpoint, we define h as the distance between each point
we evaluate (including the midpoint), this means that each interval is actually 2h wide. This
is the reason we require r to be even in order to apply Simpson’s rule.
r/2 2h
∑i=1 ( f (x2i−2 ) + 4 f (x2i−1 ) + f (x2i )). The subscripts here ensure that the ends are always
6
even (0, 2, 4, 6, etc.), and the midpoints are always odd (1, 3, 5, etc.). If we expand this out,
we will see that the ends are used twice, except for the first and last points.
h
= (( f (x0 ) + 4 f (x1 ) + f (x2 )) + ( f (x2 ) + 4 f (x3 ) + f (x4 )) + ...)
3
h
= ( f (x0 ) + 4 f (x1 ) + 2 f (x2 ) + 4 f (x3 ) + 2 f (x4 ) + ...)
3
88 Chapter 8. Numerical Integration

We can generalize this with summation notation through


h r/2 r/2−1

f (a) + 4 ∑i=1 f (x2i−1 ) + 2 ∑i=1 f (x2i ) + f (b) . This is the form you would have learned
3
in Calculus II.
Error bound: The same relationship happens here, as in Trapezoidal, resulting in an error
− f 0000 (ξ )(b − a)h4 b−a
bound of = O(h4 ) error. Since our intervals are 2h wide, r = , so
180 2h
1
when we replace r, we have an extra factor of .
2
Still, Simpson’s can solve for the integral of a cubic exactly. So, it has precision (degree of
accuracy) of 3.         
r xi−1 + xi x0 + x1 x1 + x2 xr−1 + xr
Midpoint: ∑i=1 h f =h f +f + ... + f .This
2 2 2 2
is the form you would have learned in Calculus II.
f 00 (ξ )(b − a)h2
Error bound: Simplified in the same way as Trapezoidal, = O(h2 ) error. Still
24
has precision 1.
R2 1
 Example 8.3 Determine 1 x4 dx using each composite method, r = 4, h = .
4
h 1
Trapezoidal: ∑4i=1 ( f (xi−1 ) + f (xi )) = ( f (1) + 2 f (1.25) + 2 f (1.5) + 2 f (1.75) + f (2)) =
 4  42  4 ! 8
1 5 3 7
1+ + + + 24 ≈ 6.34535
8 4 2 4
h
f (a) + 4 ∑2k=1 f (x2k−1 ) + 2 ∑1k=1 f (x2k ) + f (b)

Simpson’s:
3
1
= ( f (1) + 4 f (1.25) + 2 f (1.5) + 4 f (1.75) + f (2))
12 !
 4  4  4
1 5 3 7
= 1+4 +2 +4 + 24 ≈ 6.2005208
12 4 2 4
 
xi−1 + xi 1
Midpoint: h ∑4i=1 f = ( f (1.125) + f (1.375) + f (1.625) + f (1.875)) ≈ 6.1272
2 4
x5 1 5  31
Exact: 12 x4 dx = |21 = 2 − 15 =
R
= 6.2
5 5 5
Verify consistency with the error bound:
f 0 (x) = 4x3 , f 00 (x) = 12x2 , f 000 (x) = 24x, f 0000 (x) = 24.
Trapezoidal: maximum of f 00 (ξ ) on [1, 2] happens at 2, f 00 (2) = 48.
 2
−48 1
So, (1) = −0.25
12 4
Actual error: −0.14535, consistent.
Simpson’s: maximum of f 0000 (ξ ) on [1, 2] is independent of ξ because it is a constant function.
 4
−24 1
So, (1) = −0.0005208
180 4
Actual error: −0.0005208, the error bound is exactly the same as the error in this case because the
derivative in our error bound is constant.
Midpoint: we use the same maximum as Trapezoidal, f 00 (2) = 48.
 2
48 1
So, (1) = 0.125
24 4
Actual error: 0.0728, consistent. 

Accuracy: Accuracy is directly defined by the degree of precision, or order of accuracy of the
method. Trapezoidal: 1, Simpson’s: 3, Midpoint: 1.
8.3 Gaussian Quadrature 89

Efficiency: The efficiency of these methods is defined by the function evaluations, and number
of operations involving addition, subtraction, multiplication, and/or division. The efficiency of the
methods is reasonably similar, but directly related to the number of points used to build the method.
Here greater accuracy of the method saves efficiency.
Stability: Numerical integration is fairly stable.Numerical integration is more stable than
numerical differentiation because, as you may have noticed, as h → 0 we are multiplying by small
numbers instead of dividing by them. With numerical differentiation we had to be careful because
if h was too small we would incur roundoff error. In integration, this is no longer a concern
computationally.

8.3 Gaussian Quadrature


For those of you wondering “Where the heck does midpoint come from?" and/or “Why is Midpoint
better than Trapezoidal?" Gaussian Quadrature is your answer.
Newton-Cotes quadrature methods were closed because they included the endpoints, and fixed
subinterval widths. So, they require equispaced data.
Gaussian quadrature is derived by seeking the ‘best’ points we can use to integrate a function
as precisely as possible. Are there optimal points on the interval that will approximate our integral
‘best’?
Gaussian quadrature points are not equispaced, and are derived to solve a polynomial of degree
2n − 1 (thus have precision 2n − 1), where n is the number of points used in the interval.
To gain accuracy of 2n − 1, we need 2n conditions.
 Example 8.4 n = 1 (1-point Gaussian Quadrature)
Seek to solve a linear equation exactly (2(1) − 1 = 1) f (x) = c0 + c1 x
We seek the approximation form ∑ni=1 ai f (xi ), so we seek to find a1 and x1 such that ab f (x)dx =
R

a1 f (x1 )
Rb Rb Rb b2 − a2
a f (x)dx = a c 0 dx + a c 1 xdx = c 0 (b − a) + c 1 = a1 (c0 + c1 x1 )
2
Separating the pieces by c0 and c1 terms:
c0 (b − a) = a1 c0 → a1 = b − a, so the weight is given by the width of the interval.
b2 − a2 b2 − a2 b+a
c1 = a1 c1 x1 → x1 = = (Note: c1 is on both sides of the equation, so we
2 2(b − a) 2
can solve for x1 directly). The point is given by the midpoint of the interval.
These are the definitions of our Midpoint rule. 

Using a monomial basis, we can derive our quadrature points and weights in this manner.
However, there is a better way.
Legendre Polynomials!
Legendre polynomials
 form an orthogonal basis of polynomials on [−1, 1], this means that
R1 0 j 6= k
φ
−1 j (x)φ k (x)dx = 2
 j=k
2j+1
The functions yield zero when integrated with each other, but when integrated with themselves
- yield a positive constant.
In order to apply Gaussian quadrature using Legendre polynomials, we need to map [a, b] onto
[−1, 1]. If we  assume t ∈[a, b] and x ∈ [−1, 1], then we can define a mapping through:
2 b+a
x= t−
b−a 2
b−a b+a
t= x+
2 2
We can verify these transformations:
90 Chapter 8. Numerical Integration
 
2 a−b
if t = a, then x = = −1 X
b−a  2 
2 b−a
if t = b, then x = =1X
b−a 2
Similarly, recall u-substitution from Calculus I or II (or the Jacobian for these transformations
dt b−a
in higher dimensions, in Calculus III): = modifies the integral
dx 2
Rb R1 dt R1 b−a
a f (t)dt = −1 f (x) dx = −1 f (x) dx
dx 2
Our Legendre polynomials are given by
φ0 (x) = 1
φ1 (x) = x
1
φ2 (x) = (3x2 − 1)
2
1
φ3 (x) = (5x3 − 3x)
2
etc. in general they can be defined through the recursion relation
2j+1 j
φ j+1 (x) = xφ j (x) − φ j−1 (x)
j+1 j+1
If f (x) is a polynomial Rof degree less than 2n, then the approximation will be exact (no longer
1
an approximation, really). −1 f (x)dx = ∑ni=1 ai f (xi ).
We can define the xi ’s more generally through the zeros of the Legendre polynomials. So, for n
points, we use the zeros of the φn+1 (x) Legendre polynomial.
2
The ai ’s can be defined through the values of the Legendre polynomials ai = 0 (x ))2
(1 − xi2 )(φn+1 i
When n = 1: φ1 (x) = x, only equals zero when x = 0. So, x1 = 0, and then
2
a1 = = 2.
(1 − 02 )(1)2
R1
−1 f (x)dx = 2 f (0) This is the Midpoint rule on [−1, 1].
1 p p p
For n = 2: φ2 (x) = (3x2 − 1), this equals zero at x = ± 1/3, so x1 = − 1/3, x2 = 1/3.
2
Using these, we determine their weights:
2 2
a1 = p = =1
(1 − 1/3)(3( 1/3)2 ) (2/3)(3)
a2 = a1 = 1 since the roots are symmetric, and the values are squared, both weights will be the
same.
1 p
When n = 3: φ3 (x) = (5x3 − 3x), this only equals zero when x = 0, or ± 3/5. We order
2 p p
them from negative to positive, x1 = − 3/5, x2 = 0, x3 = 3/5.
2 2 5
a1 = 2 = = .
2/5(9) 9

15
(1 − 3/5) (3/5) − 3/2
2
2 2 8
a2 = 2 = = .
9/4 9

15
(1 − 0) (0) − 3/2
2
5
a3 = a1 = .
9
Note: since Gauss-Legendre quadrature uses the same functions and weights every time, you do
not need to recalculate them. They only depend on the value of n, and will be the same each time.
R2 4
 Example 8.5 1 x dx with n = 3 Gaussian quadrature points.
First: map [1, 2] onto [−1, 1]
2 1+2 x+3
x= (t − ) = 2t − 3, t =
2−1 2 2
8.4 Adaptive Quadrature 91

x+3 4 1
   
R2 4 R1 31
1 t dt = −1 dx. Exact value: = 6.2.
2 2 5 p p
Using Gaussian quadrature with three points: x1 = − 3/5, x2 = 0, x3 = 3/5, and a1 = a3 =
5 8
, a2 = .
9 9  !4   !4 
 4 p  4 ! p
1 R1 x+3 5 1 3 − 3/5  8 1 3 5 1 3 + 3/5 
−1 dx ≈  + +  =
2 2 9 2 2 9 2 2 9 2 2
6.1999999955 (rounding error from calculator, should be exact)


Gaussian quadrature is bound by the 2nth derivative. So, when n = 3 it is bounded by the 6th
derivative, which makes it exact for a polynomial of degree 5, or more generally degree 2n − 1.
R1 22n+1 ((n)!)4
The error term is generally defined through: −1 f (x)dx− ∑nj=0 a j f (xi ) = 3
f (2n) (ξ )
(2n + 1) ((2n)!)
f 6 (ξ )
So, the error for our example is given by = 0.
15, 750
Accuracy: Gaussian quadrature has precision, or degree of accuracy 2n − 1.
Efficiency: Gaussian quadrature has similar efficiency in implementation. However, defining
the quadrature points and weights is not efficient. If the points and weights are already known and
easily accessible for implementation, the method is quite efficient.
Stability: Gaussian quadrature is similarly stable to the other techniques.
Gaussian quadrature ensures better accuracy, while minimizing the number of points required
to gain that accuracy. Neither Trapezoidal, Simpson’s, or Midpoint can gain an exact solution for a
polynomial of degree 5.

8.4 Adaptive Quadrature


Suppose you have a function like: Which is relatively flat in one area, and changing much more

rapidly in another.
92 Chapter 8. Numerical Integration

When the function is slowly varying, we don’t need small h. In fact, it seems quite the waste
of computational resources to make h small there. However, when it is steep (or changing more
rapidly), we will want smaller h.
Adaptive quadrature is a technique that allows us to approximate the integral within a specified
tolerance, and divide up the interval accordingly to reach that tolerance.
We will focus on adaptive quadrature using composite Simpson’s rule (although, this concept
can be extended using any integration technique).
 idea is to build approximations of the same type on the interval [a, b] and then the sum
The basic
a+b a+b
of a, and , b . Compare the two approximations, and if they are within tolerance of
2 2
each other, use the better approximation because it is closer to the actual, anyway.
[a, b]
   
a+b a+b
a, ,b
2 2

       
3a + b 3a + b a + b a + b a + 3b a + 3b
a, , , ,b
4 4 2 2 4 4
We have generally measured our error term functionally (through a term based on the function,
h, and the method used).
We cannot calculate the error in this manner using adaptive quadrature because there is not a
fixed h value.
We need to be able to define our error independent of the function and a fixed h.
 on [a, b]:
Simpson’s rule   
b−a a+b
S1 = f (a) + 4 f + f (b)
3   2      
b−a 3a + b a+b a + 3b
S2 = f (a) + 4 f +2f +4f + f (b)
6 4 2 4
1
Given the error bound of Simpson’s rule, we know that S2 has th the error of S1 .
16
So, if we know that the exact integral, I, is equivalent to each approximation plus its respective
error, we can define an error measurement.
I = S1 + E1 = S2 + E2
1
E2 = E1
16
15
So, S2 − S1 = E1
16
16
E1 ≈ (S2 − S1 ) = 16E2
15
1
E2 ≈ (S2 − S1 )
15
We use this to approximate our actual error E2 in each subinterval. If E2 < tol, we can stop
dividing that subinterval.
We sum these approximations to efficiently approximate the integral within the accuracy needed,
and the minimum number of subintervals.
Notes:
This technique can perform poorly if f 0000 (x) varies rapidly. We compensate by generally using
1 1
(S2 − S1 ) rather than (S2 − S1 ) to ensure we are within tolerance.
10 15
8.5 Romberg Integration 93

This method is not guaranteed, and as such, any questionable results should be checked, or
investigated further.
Accuracy: Adaptive quadrature is built to ensure required accuracy, as efficiently as possible.
Efficiency: Adaptive quadrature is constructed to optimize the efficiency of the approximation
over the whole interval. Since it is designed to use less points on regions with less change
(shallower), and more points on regions with more change (steeper), it will use less points than a
method that uses a fixed number of points over the whole interval.
Stability: As mentioned above, if the 4th derivative is varying rapidly, this method can perform
poorly. Also, it is not guaranteed to converge - the stability is not guaranteed.
Adaptive quadrature is useful when we need to optimize efficiency over accuracy and stability.
One nice feature is that it can be done with any standard method.

8.5 Romberg Integration


Applies a similar extrapolation (as in Richardson extrapolation), but to integrals. We will apply it
using the Trapezoidal rule.
Consider the trapezoidal rule for n subintervals:
b−a
R1 = ( f (a) + f (b))
2      
b−a a+b 1 b−a a+b
R2 = f (a) + 2 f + f (b) = R1 + f
4   2   2 2 2   
b−a 3a + b a+b a + 3b 1 b−a 3a + b
R3 = f (a) + 2 f +2f +2f + f (b) = R2 + f +
 8  4 2 4 2 2 4
a + 3b
f
4
The error term for composite trapezoidal for a C∞ [a, b] function f (x) is of the form k1 h2k +
2i (2i−1) (a) and f (2i−1) (b). This allows us to combine these
∑∞ i=2 ki hk , where each ki is formed by f
terms to form higher order errors O(h ). p

If we define h = b − a,
I − R1 = k1 h2 + k2 h4 + k3 h6 + ...
 2  4  6
h h h
I − R 2 = k1 + k2 + k3 + ...
2 2 2
If we combine the terms to cancel the k1 term, we increase the order of our accuracy.
3 15
(I − R1 ) − 4(I − R2 ) = k2 h4 + k3 h6 + ...
4 16
Leading to a result with O(h4 ) error.
Solving for I in this equation yields
4R2 − R1 1 4 5
I= − k2 h + k3 h6 + ...
3 4 16
If we repeat this process, we can continue to improve on the accuracy of our approximation.
R2,1 − R1,1
The notation for this extrapolation is given by R2,2 = .
3
h h h
The first subscript denotes the step size chosen 1 → h, 2 → , 3 → , 4 → , etc.
2 4 8
The second subscript denotes the steps of extrapolation. 1 → initial approximations, 2 → after
one step of extrapolation, 3 → after two steps of extrapolation, etc.
These are easier to work with in a tabular format, with the extrapolation generally defined
through
R j+1,k−1 − R j, j−1
R j+1,k = R j+1,k−1 + for k = 1, 3, ..., j + 1.
4k−1 − 1
R 1
Ex: Apply Romberg integration to the following integral: 12 dx
x
94 Chapter 8. Numerical Integration
 
1 1 3
R1,1 = 1+ = = 0.75
2 2 4 
1 2 1
R2,1 = 1+2 + ≈ 0.708333
4  3  2    
1 4 2 4 1
R3,1 = 1+2 +2 +2 + ≈ 0.697023
8  5  3  7  2         
1 8 4 8 2 8 4 8 1
R4,1 = 1+2 +2 +2 +2 +2 +2 +2 + ≈ 0.694122
16 9 5 11 3 3 7 15 2
i Ri,1 Ri,2 Ri,3 Ri,4
1 0.75
2 0.708333 0.694444
3 0.697023 0.693253 0.693174
4 0.694122 0.693155 0.693149 0.693148
Actual value: 0.693147
Significant improvement in the approximation. However, if h is large this can be very bad.
Since we applied trapezoidal rule, the error will carry through. So, more extrapolation does not
always equal better results.
Accuracy: Romberg integration is specifically used to improve accuracy.
Efficiency: Romberg integration can be more efficient than using more data points, but is not
particularly efficient itself.
Stability: Romberg integration is not particularly stable, more iterations can gain accuracy, but
it is not guaranteed.
Romberg integration is best used if you cannot take more data, but need to gain additional
accuracy. Otherwise, it is likely not worth the added effort.

8.6 Multidimensional Integration


Recall: double integrals in Calculus III: ab cd f (x, y) dydx
R R

We integrate from the inside → outside (integrate with respect to y first, then with respect to x
in the above integral).
We can do this operation numerically in the same order, but we don’t have to.
hx hy n
In Riemann sums, a double integral using Simpson’s rule looks like ∑ j=0 ∑m k=0 f (x j , yk )w jk +
9
Es
Example: Use composite Simpson’s rule along x and y to define the numerical integration of a
double integral.
Each x j = a + jhx , yk = c + khy .
The weights from composite Simpson’s rule in 1D were 1, 4, 2, 4, 2, 4, ..., 2, 4, 1
In 2D, we’ll take multiplicative combinations of these weights.
y0 y1 y3 . . . ym
x0 1 4 2 4 . . . 1
x1 4 16 8 16 . . . 4
x2 2 8 4 8 . . . 2
x3 4 16 8 16 . . . 4
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
xn 1 4 2 4 . . . 1
This matrix corresponds to the w in the summation above, and each entry w jk corresponds to
the weight for the associated point (x j , yk ).
8.6 Multidimensional Integration 95

We can extend any 1D numerical integration technique in this manner: Trapezoidal, Simpson’s,
Gaussian (requires a 2D transform as in Calculus III to map the domain appropriately in both
dimensions), Adaptive, etc.
[add example]
9. Linear Algebra

If you feel weak in your Linear Algebra skills, go back to your textbook and notes. Then, bring
questions!

This chapter overviews key concepts from linear algebra that we use in the remainder of the
course.

9.1 Linear Algebra Basics


Solve a linear system:
a11 x1 + a12 x2 = b1
a21 x1 + a22 x2 = b2
 Example 9.1 3x1 + 2x2 = 5
2x1 − 4x2 = −3     
3 2 x1 5
Convert to a linear system =
2 −4 x2 −3
2
Reduce the matrix using Gaussian elimination R2 − R1
" # " # 3
3 2 x  5
1
16 = 19
0 − x2 −
3 3
19
x2 =
16
7
x1 =
8 
7
8
~x =  19  

16
Matrix-vector operations
98 Chapter 9. Linear Algebra

• Equality: A = B if and only if Ai j = Bi j for all i, j ∈ A


• Addition: A + B is only possible if and only if A and B are the same size ∈ Rn×m . In that
case, sum term-by-term Ai j + Bi j in the matrix.
• Scalar multiplication: λ A multiplies to each entry of the matrix λ Ai j ∀i, j ∈ A.
• Multiplication: A × B = C, requires inner matrix dimensions to match. Example An×m ×
Bm×p = Cn×p . Note: this computation takes O(n3 ) operations (m multiplications and m − 1
( and the end matrix is n × p) 2m − 1 × n × p.
additions for each term,
1 i= j
• Identity matrix: Ii j = , a diagonal matrix of ones.
0 i 6= j
• Transpose: If A = {Ai j }, then AT = {A ji }. The columns become rows, rows become columns.
If AT = A, then A is symmetric.
(AT )T = A, (AB)T = BT AT , (A + B)T = AT + BT , and if A−1 exists then (A−1 )T = (AT )−1
Vocabulary
Recall the determinant:
 det(A)

a b
For a 2 × 2 matrix, , the determinant is given by ad − bc.
c d
 
a b c
For a 3 × 3 matrix, d e f , the determinant is given by a(ek − f h) − b(dk − f g) + c(dh −
g h k
eg).
Note: our determinant is too computationally expensive O(n!) operations, and will not be used
numerically.
Properties of the determinant |A|
A row swap changes the sign of det(A)
Scalar multiplication det(λ A) = λ n det(A)
det(AB) = det(A)det(B)
det(AT ) = det(A)
1
When A−1 exists, det(A−1 ) =
det(A)
Singularity: det(A) = 0, this happens if any row (or column) is equal to a constant multiple of
any other rows (or columns).
For a matrix to be non-singular, all columns (and rows) must be linearly independent. This
means that no row (or column) can be written as a linear combination of the other rows (or columns).
Example 9.2 The system:
3x1 + 6x2 − x3 = 5
6x1 + 12x2 − 2x3 = 10
4x1 + 2x2 − 3x3 = 3     
3 6 −1 x1 5
Converted to the system 6 12 −2 x2 = 10
    
4 2 −3 x3 3
det(A) = 3(−36 + 4) − 6(−18 + 8) − 1(12 − 48) = −96 + 60 + 36 = 0
The system is singular. 

Range: range(A) is defined to be all the vectors that can be written as a linear combination of
the columns of A, ~y = A~x.

Null Space: null(A) is defined to be all vectors~z ∈ Rm such that A~z = ~0.

A~x = ~b will not have a solution unless ~b ∈ range(A).


9.2 Vector and Matrix Norms 99

If A is singular, there will be infinite solutions (the system is underdetermined, and the null
space contains more than just ~0).

Equivalent statements:

• A is non singular
• det(A) 6= 0
• Columns of A are linearly independent
• Rows of A are linearly independent
• A−1 exists such that A−1 A = I
• range(A) = Rn
• null(A) = {~0}
Eigenvalues: A~x = λ~x

Specific constants that yield the same vector as the matrix multiplication.
In linear algebra, this was solved by determining the values λ when det(λ I − A) = 0.

9.2 Vector and Matrix Norms

Norms will allow us to define error in a linear system. Now that we’re working with vectors instead
of scalars, we need a new measurement.

A vector norm, || · ||a is a function that maps Rn → R where


• ||~x||a ≥ 0 ∀~x ∈ Rn
• ||~x||a = 0 if and only if ~x = ~0
• ||α~x||a = |α|||~x||a ∀α ∈ R, ~x ∈ Rn
• ||~x +~y||a ≤ ||~x||a + ||~y||a ∀~x,~y ∈ Rn (triangle inequality)
Common norms (` p norms):
1/2
• `2 norm ||~x||2 = ∑ni=1 |xi |2
This is your distance formula, also known as the “Euclidean norm."
Larger values then have more weight, the graphical representation of this norm is a unit
circle, all vectors of unit length from the origin.
• `∞ norm ||~x||∞ = max1≤i<∞ |xi |
The largest of all values in the vector, also known as the “maximum norm."
The graphical representation of this norm is a unit square, all vectors from the origin with at
least one ‘1’ in the vector.
• `1 norm ||~x||1 = ∑ni=1 |xi |
Sum of the magnitude of each entry.
The graphical representation of this norm is a diamond. The sum of the components yields a
value of 1.
• General ` p norm ||~x|| p = (∑ni=1 |xi | p )1/p
100 Chapter 9. Linear Algebra

0.5

−1 −0.5 0.5 1

−0.5 `2 norm
`∞ norm
`1 norm
−1


 9.3 Determine the `2 , `∞ , and `1 norms of
Example
1
 −4 
~x =   5 

−17
p
||~x||2 = 12 + (−4)2 + 52 + (−17)2 ≈ 18.1939
||~x||∞ = 17
||~x||1 = |1| + | − 4| + |5| + | − 17| = 27


Cauchy-Schwartz Inequality:
For any two vectors ~x,~y ∈ Rn
|~xT~y| ≤ ||~x||2 ||~y||2
When we used iterative methods before, we defined a tolerance as our stopping criteria. For
vectors, we will have to use a norm to determine when we reach tolerance, using the distance
between two vectors ||~x −~y|| p ≤ tol.
Matrix norms (not as intuitive)
A matrix norm specifies the size of A, and it is induced by the vector norm - it is only defined
through its relationship with a vector.
If || · ||a is a vector norm on Rn , then ||A||a = max||~x||a =1 ||A~x||a is a matrix norm.
Properties of a matrix norm: Given A, B, and a scalar α
• ||A|| ≥ 0
• ||A|| = 0 if and only if A = 0
• ||αA|| = |α|||A||
• ||A + B|| = ||A|| + ||B||
• ||AB|| ≤ ||A||||B||
Our common norms:
||A||1 = max1≤ j≤m ∑ni=1 |ai j | max column sum
||A||∞ = max1≤ j≤n ∑mj=1 |ai j | max row sum
p
||A||2 = ρ(AT A), where ρ is the spectral radius, or largest eigenvalue, of the given matrix.

For any natural norm, the matrix norm is induced by the vector norm, as above.
 
3 2 5
 Example 9.4 A = 1 4 7
2 −1 3
||A||1 = max(6, 7, 15) = 15
9.2 Vector and Matrix Norms 101

||A||∞ = √
max(10, 12, 6) = 12
||A||2 = 107.6389 = 10.3749 
10. Direct Methods for Solving Linear Systems

This chapter is all about solving the system A~x = ~b. We assume A is real, non singular, and square
(n × n).
To solve this system, we will discuss two types of solutions.
Direct methods: solve a system in a finite number of steps (often the same way you would by
hand). Total error = roundoff error only.
Iterative methods: Repeated process until a specified accuracy is met. May give a satisfactory
approximation in a finite number of steps, but the process can be repeated indefinitely. Total error
= roundoff error + truncation error (from stopping the method).
We will start with direct methods, first.

10.1 Gaussian Elimination and Backward Substitution


Gaussian Elimination: Reduces a matrix to its upper-triangular equivalent.

Backward substitution: solves the system given an upper-triangular matrix.

Gaussian Elimination: Form the augmented matrix and apply row reduction systematically to
reduce the system to upper-triangular form.

 Example 10.1 Apply Gaussian elimination to the following system:


x1 + x2 − 3x3 = −1
2x1 + 4x2 − x3 = 5
4x1 + x2 + 2x3 = 7
Convert to an augmented matrix
 
R1 1 1 −3 −1
R2  2 4 −1 5 
R3 4 1 2 7
104 Chapter 10. Direct Methods for Solving Linear Systems

R∗2 = R2 − 2R1
R∗3 = R3 − 4R1
 
1 1 −3 −1
 0 2 5 7 
0 −3 14 11
1
R∗2 = R2
2
R∗3 = R3 + 3R∗2
 
1 1 −3 −1
 0 1 5/2 7/2 
0 0 43/2 43/2
2
R∗3 = R3
43
 
1 1 −3 −1
 0 1 5/2 7/2 
0 0 1 1
We will only go to REF (Reduced Echelon Form) at most in our code, it is computationally
unnecessary to complete Gaussian elimination all the way to RREF (Row-Reduced Echelon Form).
Backward Substitution:
We use backward substitution to solve the system after it is reduced to upper-triangular form.
In this example, we can solve for x3 = 1 immediately.
5 7
x2 + = → x2 = 1
2 2
x1 +1 −3 = −1 → x1 = 1
1
~x = 1

1


Since we will be programming this, we will keep a fixed form for the matrix, and follow a
specific manner to perform Gaussian elimination. There will not be conditionals in the basic code.
So, what can go wrong?
• If aii = 0, we will have a zero on the diagonal (pivot element). In this case, we need to swap
rows to ensure there are no zeros along the diagonal. (More on this later)
• If a pi = 0∀i ≤ p ≤ n (a row of zeros), then there is no unique solution. The system has
infinite solutions if a p(n+1) = 0, and no solution if a p(n+1) 6= 0.
We apply 3 invariant operations to apply Gaussian Elimination
• Multiplication by a constant
• Linear combination of rows
• Row swaps
We eliminate systematically column-by-column the lower off-diagonal terms.
For a system
 
a11 a12 a13 . . . a1n b1
 a21 a22 a23 . . . a2n b2 
 
 a31 a32 a33 . . . a3n b3 
 
 . . . . . . . . 
 
 . . . . . . . . 
 
 . . . . . . . . 
an1 an2 an3 . . . ann bn
We eliminate the lower entries in the first column through:
(1) a21 (1) a31
R2 = R2 − R1 , R3 = R3 − R1 , etc.
a11 a11
10.1 Gaussian Elimination and Backward Substitution 105

We then continue the process systematically to the right:


(1) (1)
(2) (1) a32 (1) (2) (1) a42 (1)
R3 = R3 − (1) R2 , R4 = R4 − (1) R2 , etc.
a22 a22
We repeat this process until we have an upper-triangular matrix
 
a11 a12 a13 . . . a1n b1
 0 a(1) a(1) . . . a(1) b2
(1) 
 22 23 2n 
(2) (2) (2) 
 0 0 a33 . . . a3n b3


 . . . . . . . .
 

 
 . . . . . . . . 
 
 . . . . . . . . 
(n−1) (n−1)
0 0 0 . . . ann bn

10.1.1 Implementation Concerns


Speed and memory requirements.
2
Speed: Gaussian elimination requires n3 + O(n2 ) operations, or overall O(n3 ) operations.
3
This is very expensive computationally for large n, which is why we don’t solve all the way to
the identity matrix to solve systems using Gaussian elimination. Backward substitution is much
cheaper computationally O(n2 ).
Memory/storage: Gaussian elimination requires storing a n × (n + 1) matrix. This is less of an
issue today, especially with computational packages. It is best implemented by replacing entries as
you go, rather than storing more than one matrix.

10.1.2 Backward Substitution


As in the earlier example, an upper triangular matrix allows you to solve from the bottom row up.
bk − ∑nj=k+1 ak j x j
xk = , recall its implementation in our first example.
akk
This requires all akk ’s to be nonzero.
Computational cost: O(n2 ) operations, which is vastly preferred over Gaussian Elimination
(O(n3 ) operations).
Forward Substitution is the flip of this for a lower triangular matrix, similar operations are
performed going down the rows (rather than up).

10.1.3 Matrix Inverse


By hand, we compute the matrix inverse through applying Gaussian Elimination to the augmented
matrix
 with an Identity matrix.  
1 0 0 a−1 a−1 a−1

a11 a12 a13 1 0 0 11 12 13
 a21 a22 a23 0 1 0  →  0 1 0 a−1 a−1 a−1 
21 22 23
a31 a32 a33 0 0 1 0 0 1 a−1 31 a −1
32 a −1
33
This is far too computationally expensive for us to do the same way numerically. However, we
can apply Gaussian elimination with backward substitution to this system in circumstances where
we need the inverse.
If we apply Gaussian elimination to form an upper triangular matrix on the left, we can solve
for each column on the right.
  
a11 a12 a13 b11 b12 b13

a11 a12 a13 1 0 0
 a21 a22 a23 0 1 0  →  (1) (1)
 0 a22 a23 b21 b22 b23 

a31 a32 a33 0 0 1 (2)
0 0 a33 b31 b32 b33
We solve each column separately, to build A−1
106 Chapter 10. Direct Methods for Solving Linear Systems
 
1 1 −3
 Example 10.2 Find the inverse to A = 2 4 −1
4 1 2
 
1 1 −3 1 0 0
 2 4 −1 0 1 0 
4 1 2 0 0 1
( (1)
Set R 1) = R2 − 2R1 , R3 = R3 − 4R1
 2 
1 1 −3 1 0 0
 0 2 5 −1 1 0 
0 −3 14 −4 0 1
(2) (1) 3 (1)
Set R3 = R3 + R2
 2 
1 1 −3 1 0 0
 0 2 5 −2 1 0 
0 0 43/2 −7 3/2 1
First column
   
1 1 −3 1 9/43
 0 2 5 −2 , so the first column of A−1 is  −8/43 
0 0 43/2 −7 −14/43
Second column
   
1 1 −3 0 −5/43
 0 2 5 1 , so the second column of A−1 is  14/43 
0 0 43/2 3/2 3/43
Third
 column   
1 1 −3 0 11/43
 0 2 5 0 , so the third column of A−1 is −5/43
0 0 43/2 1 2/43
−1 −1
Check: AA = A A = I X


10.2 LU Decomposition
Gaussian elimination is a computationally expensive operation. LU decomposition is introduced
because if you need to solve the same system A~x = ~b for different ~b vectors, it offers a less
computationally expensive option. However, if you are only solving one linear system, it is slightly
more expensive than using Gaussian elimination directly.

Theorem 10.2.1 if Gaussian elimination can be performed on a system A~x = ~b without a row
interchange, then A can be factored into the product of a lower triangular matrix L, and an upper
triangular matrix U.

We use Gaussian elimination to “split" A = LU, where L = lower triangular matrix, and U =
upper triangular matrix. This requires the same O(n3 ) operations as applying Gaussian elimination
to solve the system. However, by splitting the matrix we reduce the operations to solve the system
if we have more than one system with the same A matrix.
This allows us to solve the system A~x = ~b → LU~x = ~b by letting U~x =~y so that L~y = ~b.
We can then solve for ~y using a forward substitution, and then ~x with a backward substitution.
Each of these is O(n2 ) operations, which is going to be more computationally efficient than using
Gaussian elimination (O(n3 ) operations) for every new ~b.
How do we get L and U?
10.2 LU Decomposition 107

Build transformation matrices. We can perform Gaussian elimination through transformation


matrices.
 
1 2 3
 Example 10.3 A = 4 5 6
7 8 12
To reduce the first column, R∗2 = R2 − 4R1 , and R∗3 = R3 − 7R1
    
1 0 0 1 2 3 1 2 3
M (1) A = −4 1 0 4 5 6  = 0 −3 −6
−7 0 1 7 8 12 0 −6 −9
Reducing the second column, R∗3 = R3 − 2R2
    
1 0 0 1 2 3 1 2 3
M (2) M (1) A = 0 1 0 0 −3 −6 = 0 −3 −6 = U
0 −2 1 0 −6 −9 0 0 3
(2) (1) (1)−1 (2)−1
Since M M A = U → A = M   M U,  
1 0 0 1 0 0 1 0 0
−1 −1
L = M (1) M (2) = 4 1 0 0 1 0 = 4 1 0
7 0 1 0 2 1 7 2 1


Each of these transformation matrices is a modification to I to apply the appropriate combination


of rows.
Note: LU Decomposition is not unique.
 
6
 Example 10.4 Solve A~ x = ~b for ~b = 30
81
   
1 2 3 6
4 5 6 ~x = 30
7 8 12 81
Using
 L and
 U from above,   
1 0 0 1 2 3 6
4 1 0 0 −3 −6~x = 30
7 2 1 0 0 3 81
Solve L~y = ~
  b first
 
1 0 0 6
4 1 0~y = 30
7 2 1 81
Forward substitution: y1 = 6, 4y1 + y2 = 30 → y2 = 6, 7y1 + 2y2 + y3 = 81 → y3 = 27
Then
 solve U~x=~y  
1 2 3 6
0 −3 −6~x =  6 
0 0 3 27
Backward substitution: x3 = 27/3 = 9, −3x2 − 6x3 = 6 → −3x2 = 60 → x2 = −20, x1 + 2x2 +
3x3 = 6 → x1 = 19    
1 2 3 19 6
Check: 4 5 6  −20 = 30 X
7 8 12 9 81


To apply Gaussian elimination in the manner we have so far assumes there are no zeros at pivot
points (diagonal terms). What do we do if there is a zero at a pivot?
108 Chapter 10. Direct Methods for Solving Linear Systems

If you solved the problem analytically in this scenario, you would swap rows. We can do that
numerically also, using permutation matrices. These are matrices where I has row swaps for the
rows that need to be swapped in A.
 
1 0 0
 Example 10.5 P= 0
 0 1 swaps rows 2 and 3
0 1 0
    
1 0 0 1 2 3 1 2 3
PA = 0 0 1 4 5 6  = 7 8 12 

0 1 0 7 8 12 4 5 6
This gives us a way to solve these problems that would otherwise break down by applying a
permutation matrix to both sides of the equation. A~x = ~b → PA~x = P~b.
Note: P−1 = PT

10.3 Pivoting Strategies


When a pivot has a zero, we will naturally want to swap rows, but what if the pivot is near zero?
How can we effectively and efficiently swap rows to avoid roundoff error?

First technique is partial pivoting, or maximal column pivoting.

To do this, we look down each column, and find the first, largest, entry in the column. Then,
swap those rows.

For each kth column, find the smallest integer p, k ≤ p ≤ n with magnitude |a pk | = max |aik |.
If p 6= k, swap rows p and k, then move to the next column.

 Example 10.6 For a 2 × 2 system,this might look like


3.333 × 10−4 1.000

2
1.000 −1.000 −1
If we simply apply Gaussian elimination as in the previous section (without seeing that the first
pivot is close to zero).
1.000
R∗2 = R2 − R1 = R2 − 3000R1
3.333 × 10−4
Applying 3 digit accuracy (truncating
 to simulate roundoff error), this simplifies to
3.333 × 10−4 1.000

2
0 −3001 −6001
6001
When we then apply backward substitution to solve, x2 = = 1.9996 → 2 with 3 digit
3001
accuracy.
If x2 = 2, then 3.333 × 10−4 x1 + 2 = 2, so x1 = 0!
Clearly, this cannot be the case for the original problem. 

We can avoid this kind of error by applying partial pivoting. We would see that the second row
has the largest entry in the first column, and then swap the first and second rows before applying
Gaussian
 elimination. 
1.000 −1.000 −1.000
3.333 × 10−4 1.000 2.000

R2 = R2 − 3.333 × 10 R1−4

Now, we’re multiplying by small numbers instead of dividing by them, so roundoff error is less
risky.
10.4 Efficient Implementation 109
 
1.000 −1.000 −1.000
0 1.000 2.000
x2 = 2, x1 − 2 = −1 → x1 = 1
This is consistent for three digit accuracy.
However, partial pivoting does not guarantee stability in your solution.
Note: For pivoting, we start with the first row and only look below the current row. Do not
attempt to re-pivot rows that have already been locked.

10.3.1 Scaled Partial Pivoting


Scaled pivoting by comparing the values in each column with respect to the largest value in their
row.
1. Find a scale factor for each row si = max1≤ j≤n |ai j | (largest element in ith row).
2. Rank all entries in that row by the ratio with the scaling factor, the entry in the column with
|a pk | |aik |
the largest ratio is swapped with the current row. For the kth column, = maxk≤i≤n .
sp si
3. If k 6= p we swap rows p and k.
4. Solve the new system
Note: We don’t swap rows in the same way we can by hand, we have to simulate it. This is
often done by copying the rows, and then replacing them.

10.3.2 Complete Pivoting


Maximal pivoting → swap rows and columns to optimize the pivot values. This is often too
computationally expensive.
The whole point of the process in pivoting is to make our matrix diagonally dominant. If a
matrix is diagonally dominant, the diagonal entries are greater than the sum of the other entries in
the same row.
When our matrix is diagonally dominant, we can perform Gaussian elimination without addi-
tional pivoting.
Later, we will discuss symmetric positive definite matrices, which also do not require additional
pivoting.  
1 2 3
My example, 4 5 6  is NOT diagonally dominant.
7 8 12
1 < 2 + 3, 5 < 4 + 6, and 12 < 7 + 8!
An
 exampleof a diagonally dominant matrix is
7 2 1
3 5 −1
0 5 6
7 > 2 + 1, 5 > 3 + 1, and 6 > 5 + 0 X

10.4 Efficient Implementation


The main takeaways about implementation are that when you implement matrix and vector opera-
tions on a computer, it is important to know something about how these are stored and used.

Loops are rather inefficient, so using matrix-vector multiplication is preferable for efficiency
(when possible). However, that does not mean that matrix-vector multiplication is cheap - it isn’t.
Memory access: BLAS Basic Linear Algebra Subroutines [this section needs more... Henc?
Want to take a run at it?]
110 Chapter 10. Direct Methods for Solving Linear Systems

Level 1: “Simple" operations are arithmetic and logical operations. They do not require
additional memory access, and are the most cost-efficient.
Level 2: Matrix updates, systems, matrix-vector multiplication. These require moderate
memory access, and are preferable to Level 3.
Level 3: Matrix-matrix multiplications require the most memory access. Especially when
matrices are large, it may be beneficial to convert a level 3 operation to level 2 for efficiency.

10.5 Estimating Errors and the Condition Number


We seek an effective way to measure our error. In an ideal situation, we could compare our
||~x − x̃||
numerical solution to the exact solution, , but if we know the solution, then why would we
||~x||
bother? We typically do not know the exact solution, so how do we know if our solution is any
good?
We must evaluate the residual measurement, r̃ = ~b − Ax̃. This is a measurement we can easily
evaluate, and due to roundoff errors we know r̃ 6= ~0.
Note: small r̃ does not necessarily mean small error.
However, we can approximate our error using the residual.
r̃ = ~b − Ax̃ = A~x − Ax̃ = A(~x − x̃)
So, we could determine our error ~x − x̃ = A−1 r̃.
Recall: norms, which allow us to measure the size of a vector.
If we apply a norm to both sides ||~x − x̃|| = ||A−1 r̃|| ≤ ||A−1 ||||r̃||
This bounds our absolute error.
||~x − x̃|| ||A−1 ||||r̃||||A||
Through a similar analysis, we can bound the relative error through ≤ .
||~x|| ||~b||
The condition number is defined through κ(A) = ||A||||A−1 ||. The norms here can be any
chosen norm, e.g. κ1 (A) = ||A||1 ||A−1 ||1 , etc. A large condition number indicates that A is close
to singular. In other words, your residual will be a bad measurement for the accuracy of your
approximation.
||~x − x̃|| ||r̃||
If we rewrite our relative error in terms of the condition number, ≤ κ(A) , we have
||~x|| ||~b||
an expression multiplying our condition number by the relative error in the residual. So, even if the
value of our residual itself is small, having a large condition number will have a huge impact on the
bound for our actual error in the approximation effectively rendering the residual measurement to
be useless.
This is important to understand when we implement iterative solvers, because we have to use
the residual in some capacity to approximate when we reach our tolerance.
    
1 2 x1 3
 Example 10.7 =
1.0001 2 x2 3.0001
 
1
We can naturally see that the solution to this system is ~x =
1
However,
  for this system
 - the matrix
 is close to singular. If we start with the initial guess
3 0
x̃ = , the residual is r̃ = which is very small!
0 −0.0002
 
−2
However, the error in the approximation ~x − x̃ =~e = which is very large!
1
   
1 2 −10000 −10000
The condition number for A = , where A−1 =
1.0001 2 5000.5 −5000
If we evaluate the condition numbers,
10.5 Estimating Errors and the Condition Number 111

κ1 (A) = 4 ∗ 15000.5 = 60002 (max column sum)


κ∞ = 3.0001 ∗ 20000 = 60002 (max row sum)
These are huge condition numbers, which is why the example can have a small residual for
such a poor approximation. 

10.5.1 Error in Direct Methods


Gaussian elimination does not introduce errors on its own; it isn’t an approximation. So, any error
that is induced through Gaussian elimination comes from roundoff error.
This means that any system we solve numerically is actually a slightly perturbed problem. If
we are trying to solve A~x = ~b, we actually end up solving (A + δ A)x̃ = ~b + δ~b.
So, r̃ = ~b − A~x = δ Ax̃ − δ b.
A common misconception is to try and use the eigenvalues of a matrix to determine how close
to singular a matrix is - take a moment to look up the condition number, and evaluate why it is a
better measurement of the matrix.
11. Iterative Methods for Solving Linear Systems

Iterative methods approximate a solution by repeating a process (recall: fixed point iteration, etc.).
We will discuss iterative methods for solving a linear system.
We introduced norms to present a way to approximate our error, and determine appropriate
stopping criteria for systems.
Direct methods are great for small systems, but can be extremely computationally expensive
for large systems. They are more accurate (because the technique is exact), but can be less efficient.
For large, sparse systems, iterative methods are preferred. They also come into use when solving
some partial differential equations (Numerical Analysis II!), i.e. Poisson or Laplace’s equation
∇2 u = g(x, y).
Recall: matrix storage and operations are expensive. [Henc: I had a comment here about
reshape in MATLAB, is there anything similar in Python?]
Iterative methods of solving linear systems are constructed with a very similar format to fixed
point, but with vectors. We seek a solution to A~x = ~b through ~f (~xk ) = ~b − A~x = ~0, and defining a
~g(~x) such that ~xk+1 = ~g(~xk ).
Note: Convergence for these methods is linear.

11.1 Stationary Iteration and Relaxation Methods


Recall: fixed point iteration. We used a function to locate a root by defining a function g(x) = x.
We will use a similar format to solve the linear system A~x = ~b.
We are seeking the ‘root’ when A~x −~b = ~0.
To define a fixed point, we will have to split the matrix A.
The general format of this is to define A = M − N, (which means that M = N + A, and N =
M − A).
We then solve for a fixed point, or stationary iteration. We call it a stationary iteration because
we do not change our matrices M or A, nor our vector ~b, so the format of the iteration remains
constant.
A~x = (M − N)~x = M~x − N~x = ~b
M~x = N~x +~b
114 Chapter 11. Iterative Methods for Solving Linear Systems

Thus, our iteration will take the form ~xk+1 = M −1 N~xk + M −1~b
We can simplify this expression with some additional algebraic manipulation.
~xk+1 = M −1 (M − A)~xk + M −1~b =~xk − M −1 A~xk + M −1~b
~xk + M −1 (~b − A~xk ) =~xk + M −1~rk
What is M?
We choose M so that M −1 is as close to A−1 as possible while still being an efficient computa-
tion.
We will discuss two specific methods that are variations on this idea: Jacobi and Gauss-Seidel.
We also want a convergent matrix. An n × n matrix A is convergent if limk→∞ Ak = 0
Equivalent statements:
• A is a convergent matrix
• limk→∞ ||Ak || = 0 for a natural norm
• limk→∞ ||Ak || = 0 for all natural norms
• ρ(A) < 1
• limk→∞ Ak~x = ~0 ∀~x ∈ Rn
We refer to these as relaxation methods because of the smoothing property a convergent matrix
has.
Iterative methods will always start with an initial guess, ~x0 , and the iteration will converge
linearly (if it converges) to some ~x∗ solution to A~x = ~b.

11.1.1 Jacobi
Simultaneous relaxation: updates all terms in the ‘next’ guess using values from the previous guess.
M = D the diagonal terms of A
We build the structure using the summation notation for A~x = ~b
∑nj=1 ai j x j = bi
Since aii xi is a term of ∑nj=1 ai j x j , we can solve for xi directly.
aii xi = bi − ∑nj=1, j6=i ai j x j
(k)
(k+1) bi − ∑nj=1, j6=i ai j x j
Defining an iteration xi =
aii
It can also be written as ~x(k+1) =~x + D−1~r(k) .
(k)

Notes for implementation


• Assume each aii 6= 0, reorder as necessary
• Scaled partial pivoting if needed, to make each aii as large as possible (we are dividing by
the diagonal terms, and so we want to ensure they are not small!)
• We only need to store the last iteration and build the current one. We do not need to store all
iterations (saves space).
• Since each iteration only relies on the values from the previous iteration, terms can be
determined in parallel. (This is usually done by dividing the work between processors to save
time.)
 
1
 Example 11.1 Solve the system A~ x = ~b with Jacobi and initial guess ~x(0) = 1
1
     
12 2 3 25 12 0 0
with A =  5 15 1 , ~b = 38, D =  0 15 0 
6 2 10 40 0 0 10
11.1 Stationary Iteration and Relaxation Methods 115
25 − 2 − 3 5
   

 38 −125 − 1   32
  3

~x(1) =  = 
 40 −156 − 2   15
   
32 
10 10 
25 − 64/15 − 96/10

 12
 38 − 25/3

(2) − 32/10 
~x =  
 15 
40 − 10 − 64/15
 
10
This is cumbersome by hand and converges slowly, but it is also generally effective. 

These techniques are particularly useful for large, sparse, A matrices. For a 3 × 3 matrix, this is
not generally necessary.

11.1.2 Gauss-Seidel
Uses the updated values, as they are updated. This involves more work, but is also faster than
Jacobi.
M = E formed by the lower-triangular terms of A
If Jacobi will converge for a given system, then so will Gauss-Seidel. It also converges linearly,
but approximately twice as fast as Jacobi.
Since Gauss-Seidel uses updated values, it is much harder to implement in parallel.
It is built from the same original summation ∑nj=1 ai j x j = bi , if we expand the summation terms
∑i−1 n
j=1 ai j x j + aii xi + ∑ j=i+1 ai j x j = bi
“updated” + “current” + “future” = bi
(k+1) (k)
(k+1) bi − ∑i−1
j=1 ai j x j − ∑nj=i+1 ai j x j
We solve for the iteration ~xi =
aii
It can also be written as ~x(k+1) =~x(k) + E −1~r(k)
See how it behaves for the same example
 
12 0 0
 Example 11.2 Solve the same system using Gauss-Seidel, E =  5 15 0 
6 2 10
 25 − 2 − 3 
   
 12  5/3 1.6667
 38 − 25/3 − 1   
~x(1) =  = 86/45  = 1.9111
 15 1178/450 2.6178
40 − 30/3 − 172/45
 
10  
1.6667
Compare with Jacobi: 2.1333
3.2
 
1
and exact: 2
3
Notice that Gauss-Seidel is slightly closer - but not exclusively after only one iteration. This
increases over time. 

Sidenote: check out SOR, if this is a topic of interest to you.


116 Chapter 11. Iterative Methods for Solving Linear Systems

11.2 Convergence of Stationary Methods


Recall: condition number, and its relation to the differences between residual,~r(k) = ~b − A~x(k) , and
the actual error ~e(k) =~x −~x(k) = A−1~r(k) .
To ensure convergence of the method, we need ~e(k) → 0 as k → ∞.
The matrix notation for the stationary iteration takes the form of a fixed point iteration, ~x(k+1) =
g(~x ) = T~x(k) +~c
(k)

In this form, we call T the iteration matrix.


When we repeat the operation, we see why T needs to be a convergent matrix.
~x(1) = T~x(0) +~c
~x(2) = T~x(1) +~c = T (T~x(0) +~c) +~c = T 2~x(0) + (T + I)~c
~x(3) = T~x(2) +~c = T 3~x(0) + (T 2 + T + I)~c
Generalizing for any k
~x(k) = T k~x(0) + (T (k−1) + T (k−2) + ... + T + I)~c
Recall: a convergent matrix A is such that limk→∞ Ak = 0
Thus, a convergent matrix T will ensure convergence of this fixed point iteration. If T is a
convergent matrix, then ρ(T ) < 1.
To determine convergence of our method for a problem, we evaluate ρ(T ). The smaller it is,
the faster it will converge (rate). However, it is still only going to converge linearly (order).
What is T ?
Both Jacobi and Gauss-Seidel use the matrix splitting defined in the previous section. So, if
~x(k+1) =~x(k) + M −1~r(k) =~x(k) + M −1 (~b − A~x(k) ) = (I − M −1 A)~x(k) + M −1~b
So, T = I − M −1 A, and ~c = M −1~b
For Jacobi: TJ = I − D−1 A
For Gauss-Seidel: TGS = I − E −1 A
Recall that D = diagonal entries of A, and E = lower-triangular entries of A.
Evaluating the convergence of Jacobi and Gauss-Seidel through ρ(T ) using the earlier example:
 
12 2 3
 Example 11.3 A =  5 15 1 
6 2 10
   
12 0 0 1/12 0 0
D =  0 15 0 , so D−1 =  0 1/15 0 
0 0 10 0 0 1/10
      
1 0 0 1/12 0 0 12 2 3 0 −1/6 −1/4
Then, TJ = I −D−1 A = 0 1 0 −  0 1/15 0   5 15 1  = −1/3 0 −1/15
0 0 1 0 0 1/10 6 2 10 −3/5 −1/5 0
Recall: ρ(TJ ) = maximum absolute eigenvalue (max1≤i≤n |λi |)
In the next section, we will discuss ways to determine the eigenvalues of your matrix without
the determinant - but since we only have a 3 × 3 matrix, we’ll use the determinant by hand to
determine the eigenvalues  of TJ . 
λ 1/6 1/4
det(λ I −TJ ) = det 1/3 λ 1/15 = λ (λ 2 −1/75)−1/6(λ /3−3/75)+1/4(1/15−
3/5 1/5 λ
3λ /5) = 0
λ ≈ −0.51408, 0.11323, and 0.40085.
The largest in magnitude is the first one listed, with a magnitude of 0.51408 < 1 so it is a
convergent matrix, and the method will converge to a solution. 

 Example 11.4 For the same system, let’s evaluate TGS = I − E −1 A


11.2 Convergence of Stationary Methods 117
      
1 0 0 1/12 0 0 12 2 3 0 −1/6 −1/4
= 0 1 0 − −1/36 1/15 0   5 15 1  =  0 1/18 1/60 
0 0 1 −1/18 −1/75 1/10 6 2 10 19/30 1/9 9/50
The eigenvalues of TGS = 0.0955048 + 0.3858i, 0.0955048 − 0.3858i, and 0.0445459. With
the complex components, the largest eigenvalue is |0.0955048 + 0.3858i| = 0.397445 < 1 so it will
converge. It is also less than ρ(TJ ), so it will converge faster than Jacobi (though still linearly). 
Thm: Stationary method convergence for the problem A~x = ~b using the stationary iteration
~x (k+1) =~x(k) + M −1~r(k) , where T = I − M −1 A is the iteration matrix, will converge if and only if
ρ(T ) < 1.
Implementation: Stopping criteria
||~r(k) ||
1. Relative residual: Stop the loop when ≤ tol
||~b||
2. Relative iterations: Stop the loop when ||~x(k) −~x(k−1) || ≤ tol, but keep in mind that this will
vary greatly with respect to the iteration matrix.
Examine both sets of stopping criteria in the lab.
12. Eigenvalues and Singular Values

Recall that we discussed the determinant as a process that is too computationally expensive to do
numerically, O(n!) operations. This chapter outlines methods of solving for eigenvalues through
the relationship A~x = λ~x iteratively. This allows us to solve for an approximation to our eigenvalues
more efficiently.

12.1 The Power Method


The Power method is an iterative method used to find the largest eigenvalue of a matrix A.
Assumptions required for the power method:
1. A is a square n × n matrix
2. A has n eigenvalues, and the largest is unique |λ1 | > |λi |, for i = 2, ..., n
sidenote: If all λ ’s are > 0 and distinct, then we will have n linearly independent eigenvectors.
3. We have n linearly independent eigenvectors, ~v → {~v(1) ,~v(2) , ...,~v(n) }
note: if we do not, it may still converge, but it is not guaranteed.
4. Any vector in Rn can be written as a linear combination of these eigenvectors, ~x = ∑nj=1~v( j)
(~v( j) ’s form a basis of Rn )
If A~x = λ~x, then A~x = ∑nj=1 β j A~v( j) = ∑nj=1 β j λ j~v( j)
We can use this last assumption to derive the Power method.
If we repeatedly multiply by A:
Ak~x = ∑nj=1 β j λ jk~v( j) , and we factor out the dominant (largest) eigenvalue λ1
 k
n λj
k
= λ1 ∑ j=1 β j ~v( j) , the j = 1 term will dominate the sum.
λ1
 k
λj
limk→∞ = 0 → limk→∞ Ak~x → limk→∞ λ1k β1~v(1)
λ1
This will → 0 if λ1 < 1, and will diverge if λ1 > 1
We can use this to approximate the associated eigenvector.
 k
(k) (0) n λj
k k
~x1 = A ~x = λ1 ∑ j=1 β j ~v( j)
λ1
120 Chapter 12. Eigenvalues and Singular Values

We will normalize ~x(k) at each iteration by γk , the norm of our approximation.


 k
(k) n λj
k
~x = γk λ1 ∑ j=1 β j ~v( j)
λ1
~xT A~x
We can then approximate λ1 through µ(~x) = T
~x ~x
With the normalization, ~xT~x → 1
(k)
At each iteration, we are approximating λ1 ≈ µ(~x(k) ) =~x(k) A~x(k)
To apply the power method we first approximate the vector
x̃ = A~x(k−1)
Then, we normalize that approximation

~x(k) =
||x̃||
We then use the normalized vector to approximate the eigenvalue
(k)
λ1 =~x(k)T A~x(k)
The Power method is linearly convergent.

λ2
We define an error constant, . The smaller this value is, the faster it will converge (though,
λ1
still only linearly in rate).
 
1 −1 0
 Example 12.1 Determine the largest eigenvalue of A = −2 4 −2 with initial vector
0 −1 2
 
−1
guess x̃(0) =  2 
1
We start by normalizing
 √the
 vector, always.
−1/ √ 6
x̃(0)
~x(0) = (0) =  2/√6 
||x̃ ||
1/ 6
 √ 
−3/√ 6
Approximate x̃(1) = A~x(0) =  8/ 6 
0
 √ 
−3/√ 73
Normalize it ~x(1) =  8/ 73 , this is the approximation to our eigenvector.
0
19
Then, we use it to approximate the associated (largest) eigenvalue: µ (1) =~x(1)T A~x(0) = ≈
6
3.167
Repeat the process  until√ tolerance
 is met.
−11/√ 73
x̃(2) = A~x(1) =  38/ √73 
−8/ 73
 √ 
−11/√ 1629
Normalize, ~x(2) =  38/ √1629 
−8/ 1629
337
Approximate the eigenvalue: µ (2) = ≈ 4.616
73
One more iteration...
 √ 
−49/√ 1629
x̃(3) = A~x(2) =  190/ √1629 
−54/ 1629
12.2 Singular Value Decomposition 121
 √ 
−49/√ 41417
Normalize, ~x(3) =  190/ √41417 
−54/ 41417
8191
Approximate the eigenvalue, µ (3) = ≈ 5.028
1629
Actual eigenvalues: 5.1249, 1.6367, and 0.2384
−0.225
The actual eigenvector ~v =  0.928 , compared with the approximation at three iterations
−0.297
 
−.2408
~x(3) =  0.9336 
−0.2653
Not bad! In three iterations, we’re already converging to the largest eigenvalue and its associated
eigenvector. 

12.2 Singular Value Decomposition


This is a technique for splitting a matrix that is not square. Singular values are the equivalent to
eigenvalues of a non-square matrix.
The general form for the Singular Value Decomposition (SVD) of a m × n matrix A = UΣV T
where U is an m × m orthogonal matrix, Σ is a diagonal m × n matrix of singular values, and V T is
an n × n orthogonal matrix.
If you’re wondering, what is an orthogonal matrix? An orthogonal matrix, Q has orthonormal
columns so that QT Q = I.
[ Henc: I made a comment about svd in MATLAB here, is there anything similar in Python?]
We can use this to solve a linear system A~x = ~b with a non-square matrix A (Note: all previous
methods assumed A was square)
A~x = UΣV T~x = ~b
Since U is orthogonal, we can left-multiply by U T
U T UΣV T~x = U T~b
ΣV T~x = U T~b
Left-multiply by Σ−1
Σ−1 ΣV T~x = Σ−1U T~b
V T~x = Σ−1U T~b
Left multiply by V , since V T is orthogonal also
VV T~x = V Σ−1U T~b
~x = V Σ−1U T~b
For a non-square matrix, or a singular square matrix, we can only solve it by solving a similar
problem and approximating the eigenvalues (or singular values).
To do this, we will use the singular value decomposition. Note that the singular values in Σ are
given through AT A = V ΣT U T UΣV T = V ΣT ΣV T = V Σ2V T . Since AT A is square, the eigenvalues
of AT A will equal the singular values, squared.
σ1
Recall, the condition number was defined by the ratio κ2 (A) = , this yields a way to determine
σn
your condition number.

12.2.1 Householder Transformations


We use Householder transformations to turn a matrix A into upper-Hessenberg (non-symmetric A),
or tri-diagonal form (symmetric A).
Given any orthogonal matrix, P, ||~b − A~w|| = ||P~b − PA~w||
122 Chapter 12. Eigenvalues and Singular Values

P = I − 2~u~uT , this is a symmetric and orthogonal transformation (preserves the 2-norm). It


converts a non-symmetric matrix to upper-Hessenberg form (lower triangular terms zeroed out). It
will convert a symmetric matrix to tri-diagonal form.
How it works: we seek ~u which will eliminate the lower-triangular terms in the first column,
with ~uT ~u = 1.

Generally, we do this through ũ = 1st column +||1st column||[1,zeros]T , and set ~u =
||ũ||
Then, we define P = I −~u~u T
 
3 4 3
 Example 12.2 Use Householder transformations to reduce A = 4 4 5 to upper-Hessenberg
3 5 1
form.    
3 √ 1
Use ũ = 4 + | 34| 0
  
3 0
 
0.8702
~u = 0.3942
0.2956
 
0.7572 0.343 0.2238
~u~uT =  0.343 0.1554 0.1165
0.2572 0.1165 0.0874
 
−0.5145 −0.686 −0.5145
P(1) = I − 2~u~uT =  −0.686 0.6893 −0.233 
−0.5145 −0.233 0.8252
 
−5.8307 −7.3744 −5.488
P(1) A =  0 −1.1521 1.1554 
0 1.1359 −1.8835
This eliminated the lower-triangular terms in the first column, we can repeat with the second
column using the same form ũ =2nd column +||2nd column||[0, 1,zeros]T 

These transformations also compose the form used to complete QR decomposition. QR


decomposition decomposes A into an orthogonal matrix Q, that is m × n, and an upper-triangular
 is n × n.
matrix R that
R
A=Q
0
 
~ ~ R
The residual measurement ||b − A~x|| = ||b − Q ~x||
0
 
R
Since Q is orthogonal, we can solve for ~x as~r → 0 through ||Q ~b −
T ~x|| → 0
0
So, if we solve QT~b = R~x, we will solve A~x = ~b
[Henc: I mentioned qr in MATLAB here, is there anything similar in Python?]

12.3 QR Algorithm
The QR algorithm is a matrix reduction technique that computes all the eigenvalues of a symmetric
matrix simultaneously.
It must be applied to a symmetric matrix in tridiagonal form. So, Householder transformations
can be used to reduce a symmetric matrix A to tridiagonal form prior to applying the QR algorithm.
12.3 QR Algorithm 123
 
a11 a12 0 0 0 . . .
a21 a22 a23 0 0 . . .
 
 0 a32 a33 a34 0 . . .
Given a matrix A =  .

 . . . . . . . 
 . . . . . . . .
. . . . . . . .
Factor A = QR using QR factorization, [Q, R] = qr(A)
Then use the factors to reduce the symmetric, tridiagonal matrix to a diagonal matrix of the
eigenvalues through multiplication RQ.
Since Q is orthogonal and R is upper-triangular, then R = QT A = QT QR, and RQ = QT AQ,
which reduces the off-diagonal terms of A because Q is an orthogonal matrix.
This will converge to a diagonal matrix of the eigenvalues of A with a rate of convergence
λi+1
O .
λi
This method will converge if all eigenvalues are distinct and real.
There are more general applications of this method as well. It can be applied to an upper-
Hessenberg matrix (we can use Householder transformations to reduce a non-symmetric matrix
to upper-Hessenberg form), and reduce it to an upper-triangular matrix with the diagonal entries
approaching the approximated eigenvalues.

12.3.1 Shifting
A more ‘practical’ application applies shifting to accelerate convergence to the eigenvalues.
λi+1
Since may be close to 1, and this will slow convergence of the method, shifting is used to
λi
accelerate its convergence. Choose an αk close to an eigenvalue, generally we do this by selecting
one of the diagonal values.
Then, split A(k) by subtracting this value from the diagonal terms: Ã(k) = A(k) − αk I
Factor Ã(k) = Q(k) R(k)
Combine to reduce the off-diagonal terms: A(k+1) = R(k) Q(k) + αk I
Repeat.
 Example 12.3 Compare standard QR with QR+shift:
 
3 4 0
Given the matrix A(0) = 4 4 5
0 5 1
Standard QR: [Q, R] = qr(A)
 
−0.6 0.1264 −0.79
Q(0) = −0.8 − 0.0948 0.5925 
0 0.9874 0.158
 
−5 −5.6 −4
R(0) =  0 5.0636 0.5135
0 0 3.1203
 
7.48 −4.0509 0
A(1) = R(0) Q(0) = −4.0509 0.027 3.0811
0 3.0811 0.493
Repeat: [Q, R] = qr(A)
 
−0.8793 −0.2505 0.405
Q(1) =  0.4762 −0.4625 0.7479
0 0.8505 0.5259
124 Chapter 12. Eigenvalues and Singular Values
 
−8.5065 3.5749 1.4673
R(1) =  0 3.6226 −1.0057
0 0 2.5636
 
9.1824 1.7251 0
A(2) = R(1) Q(1) = 1.7251 −2.5307 2.1804
0 2.1804 1.3483
We can see the diagonal terms are decreasing, but not nearly zero yet. If we continue to repeat
this process, they will approach zero. We can use the off-diagonal terms to determine when we
reach tolerance, computationally.
Compare with QR+ shift:
We will select the last entry 0=1
 as our α
2 4 0
Shift A: Ã(0) = A(0) − I = 4 3 5
0 5 0
(0)
Factor Ã : [Q, R] = qr(A) 
−0.4472 0.3651 −0.8165
Q(0) = −0.8944 −0.1826 0.4082 
0 0.9129 0.4082
 
−4.4721 −4.4721 −4.4721
R(0) =  0 5.4772 −0.9129
0 0 2.0412
Combine, with shift:  
7 −4.899 0
A(1) = R(0) Q(0) + I = −4.899 −0.8333 1.8634
0 1.8634 1.8333
 
5.1667 −4.899 0
Re-shift, α1 = 1.8333: Ã(1) = A(1) − 1.8333I = −4.899 −2.6667 1.8634
0 1.8634 0
(1)
Factor Ã : [Q, R] = qr(A) 
−0.7257 −0.6492 0.228
Q(1) =  0.6881 −0.6847 0.2404
0 0.3314 0.9435
 
−7.12 1.7201 1.2821
R(1) =  0 5.6236 −1.2758
0 0 0.448
Combine, with shift:  
8.1836 3.8693 0
A(2) = R(1) Q(1) + 1.8333I = 3.8693 −2.4396 0.1485
0 0.1485 2.2561
After two iterations, the eigenvalue in the last row appears to be converging, while the first and
second rows have large off-diagonal terms.
Looking at the actual eigenvalues: −3.7029, 2.2591, and 9.4438.
We see that, with shifting, the value corresponding with the shift converges much faster than
the others. (Note: this will perform better if we use the largest diagonal.) However, when we look
at QR alone, all the eigenvalues are converging similarly (perhaps the largest eigenvalue converging
faster than the others).
If we look at later iterations,
 we can see this even more  clearly.
9.4438 −0.0023 0
QR iteration 9: A(9) = −0.0023 −3.7018 0.0803
0 0.0803 2.258
12.3 QR Algorithm 125

Error:
9.4438: −4.072 × 10−7
−3.7029: 1.08 × 10−3
2.2591: −1.08 × 10−3
All approximations are pretty good.
 
9.3425 −0.0023 0
QR+ shift, iteration 9: A(9) = −0.0023 −3.7018 0.0002
0 0.0002 2.2591
Error:
9.4438: 1.01 × 10−1
−3.7029: 1.01 × 10−1
2.2591: 1 × 10−15
Decent approximations of the first two, and solidly approximates the third, which was used for
the shift. 

Whichever value we select for the shift is the value QR+ shift will converge to fastest. In this
case, it had roughly cubic convergence to the third eigenvalue listed.
One benefit to using the last term in the matrix is that we can reduce the matrix once that
eigenvalue reaches tolerance.
So, in the case of our example - we can take the matrix with shifting, reduce it to the 2 × 2 and
continue the process with the next eigenvalue.
For a 3 × 3 matrix, this is not as useful as it is for a larger matrix. For a large matrix, we can
reduce the operations significantly by using shifting to converge to one eigenvalue, then reduce the
size of the matrix systematically to find all the eigenvalues, with less and less operations at each
reduction. Compared with QR alone, this can be a much more computationally efficient way to
compute all the eigenvalues of a large matrix.
II
Part Two
Index

accuracy, 10 stability, 10

Big O, 13 word, 17
Big Theta, 14
Bisection method, 25, 28
bit, 17
byte, 17

computational expense, 14

efficiency, 10
error, 12
absolute, 12
relative, 12

fixed point, 18
Fixed Point Iteration, 31
floating point, 18

Introduction, 7

little o, 15

machine epsilon, 20

Newton’s Method (Newton-Raphson), 37


normalized, 18

piecewise:linear, 61

Review List, 8
rounding to nearest, 20

You might also like