Num Methods
Num Methods
Andreas Stahel
0 Introduction 1
0.1 Using these Lecture Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Content and Goals of this Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.3 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.3.1 A Selection of Literature on Numerical Methods . . . . . . . . . . . . . . . . . . . 2
0.3.2 A Selection of Literature on the Finite Element Method . . . . . . . . . . . . . . . . 3
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Matrix Computations 23
2.1 Prerequisites and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Floating Point Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Floating Point Numbers and Rounding Errors in Arithmetic Operations . . . . . . . 23
2.2.2 Flops, Accessing Memory and Cache . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Multi Core Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
i
CONTENTS ii
SHA 21-5-21
CONTENTS iii
SHA 21-5-21
CONTENTS iv
SHA 21-5-21
CONTENTS v
5.9.1 From the Plane Stress Matrix to the Full Stress Matrix . . . . . . . . . . . . . . . . 394
5.9.2 From the Minimization Formulation to a System of PDE’s . . . . . . . . . . . . . . 395
5.9.3 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
5.9.4 Deriving the Differential Equations using the Euler–Lagrange Equation . . . . . . . 397
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
SHA 21-5-21
CONTENTS vi
Index 483
SHA 21-5-21
Chapter 0
Introduction
• The method of finite differences (FD) is presented and applied to a few typical problems.
• The necessary tensor calculus for the description of elasticity problems is presented.
• The principal of virtual work is applied to mechanical problems and the importance of the calculus of
variations is illustrated. This will serve as basis for the Finite Element Method, i.e. FEM .
• The method of finite elements is introduced using 2D model problems. The convergence of triangular
elements of order 1 and 2 is examined carefully.
The topics to be examined in this class are shown in Figure 1.
1
Lecture notes are different from books. A book should be complete, while the combination of lectures and lecture notes should
be complete.
1
CHAPTER 0. INTRODUCTION 2
Numerical Tools
@
I
@
6 @
@
? @
@
Matrix Computation
Y
HH
- Finite Difference R
@
- Model Problems
H
HH
H
HH
H
HH
Calc. of Variations, Elasticity j
H
- Finite Elements
• Get to know a few basic algorithms and techniques to solve modeling problems.
Obviously there are many important topics numerical methods that we do not consider in this class. The
methods presented should help you to read and understand the corresponding literature. An incomplete list
of non-considered topics is: interpolation and approximation, numerical integration, ordinary differential
equations, . . .
To realize that reliable numerical methods can be very important, examine a list of disasters, caused
by numerical errors, see http://ta.twi.tudelft.nl/nw/users/vuik/wi211/disasters.html. Among the spectacular
events are Patriot missile failures, an Ariane rocket blowup and the sinking of the Sleipner offshore platform.
0.3 Literature
Below find a selection of books on numerical analysis and the Finite Element Method. The list is not an
attempt to generate a representative selection, but is strongly influenced by my personal preference and
bookshelf. The list of covered topics might help when you face problems not solvable with the methods and
ideas presented in this class.
[Schw09] H.R. Schwarz, Numerische Mathematik: this is a good introduction to numerical mathematics and
might serve well as a complement to the lecture notes.
2
Richard Hamming, 1962
SHA 21-5-21
CHAPTER 0. INTRODUCTION 3
[IsaaKell66] Isaacson and Keller, Analysis of Numerical Methods: this is an excellent (and very affordable) in-
troduction to numerical analysis. It is mathematically very solid and I strongly recommend it as a
supplement.
[DahmReus07] Dahmen and Reusken, Numerik für Ingenieure und Naturwissenschaftler: this is a comprehensive
presentation of most of the important numerical algorithms.
[GoluVanLoan96] Gene Golub, Charles Van Loan, Matrix Computations: this is the bible for matrix computations (see
Chapter 2) and an excellent book. Use this as reference for matrix computations. There is a new,
expanded edition of this marvelous book [GoluVanLoan13].
[Smit84] G. D. Smith, Numerical Solution of Partial Differential Equations: Finite Difference Methods: this is
a basic introduction to the method of finite differences.
[Thom95] J. W. Thomas, Numerical Partial Differential Equations: Finite Difference Methods: this is an excel-
lent up-to-date presentation of finite difference methods. Use this book if you want to go beyond the
presentation in class.
[Acto90] F. S. Acton, Numerical Methods that Work: a well written book on many basic aspect of numerical
methods. Common sense advise is given out freely. Well worth reading.
[Pres92] Press et al., Numerical Recipes in C: this is a collection of basic algorithms and some explanation of
the effects and aspects to consider. There are versions of this book for the programming languages C,
C++, Fortran, Pascal, Modula and Basic.
[Knor08] M. Knorrenschild, Numerische Mathematik: a very short and affordable collection of results and
examples. No proofs are given, just the facts stated. It might be useful when reviewing the topics and
preparing an exam.
[GhabWu16] J. Ghaboussi and X.S. Wu, Numerical Methods in Computational Mechanics: a selection of numerical
results and definitions, useful for computational mechanics.
[John87] Claes Johnson, Numerical Solution of Partial Differential Equations by the Finite Element Method:
this is an excellent introduction to FEM, readable and mathematically precise. It might serve well as
a supplement to the lecture notes.
[TongRoss08] P. Tong and J. Rossettos: Finite Element Method, Basic Technique and Implementation. The book
title says it all. Implementation details are carefully presented. It contains a good presentation of the
necessary elasticity concepts. Since this book is available a Dover edition, the cost is minimal.
[Schw88] H. R. Schwarz, Finite Element Method: An easily understandable book, presenting the basic algo-
rithms and tools to use FEM.
[Hugh87] Thomas J. R. Hughes, The Finite Element Method: Linear Static and Dynamic Finite Element Analy-
sis: this classic book on FEM contains considerable information for elasticity problems. The presen-
tation discuses many implementation details, also for shells, beams, plates and curved elements. It is
very affordable.
SHA 21-5-21
CHAPTER 0. INTRODUCTION 4
[GoluVanLoan96]
[DahmReus07]
[GhabWu16]
[IsaaKell66]
[Thom95]
[Schw09]
[Knor08]
[Acto90]
[Smit84]
[Stah08]
Reference
Floating point arithmetic x x x x x x x
CPU, memory and cache structure x
Gauss algorithm o x x x x x x
LR factorization x x x x x x x
Cholesky algorithm x x x x x x x
Banded Cholesky x x x x
Stability of linear algorithms x x o x x
Iterative methods for linear systems o x x x x x
Conjugate gradient method x x x
Preconditioners o x x
Solving a single nonlinear equation x x x x x x
Newton’s algorithm for systems x x x x x x
FD approximation x x x x x x x
Consistency, stability, convergence, Lax theorem x x x x x x
Neumann stability analysis x x x
Boundary value problems 1D x x x x x x
Boundary value problems 2D x x x x x x x x
Stability of static problems x x x x
Explicity, implicit, Crank-Nicolson for heat equation x x x x x
Explicit, implicit for wave equation x x x x
Nonlinear FD problems x x
Numerical integration, 1D x x x x x
Gauss integration 1D o x x x x
Gauss integration 2D o x x x
ODE, initial value problems x x x x x x
Zeros of polynomials x x x
Interpolation x x x x
Eigenvalues and eigenvectors x x x x x x
Linear regression o x x x x
Table 1: Literature on Numerical Methods. An extensive coverage of the topic is marked by x, while a brief
coverage is marked by o.
SHA 21-5-21
CHAPTER 0. INTRODUCTION 5
[Brae02] Dietrich Braess, Finite Elemente: this is a modern presentation of the mathematical theory for FEM
and their implementation. Mathematically it is more advanced and preciser than these lecture notes.
Also find the exact formulation of the Bramble-Hilbert Lemma. This book is recommended for further
studies.
[Gmur00] Thomas Gmür, Méthodes des éléments finis en mécanique des structures: An introduction (in french)
for FEM applied to elasticity.
[ZienMorg06] O. C. Zienkiewicz and K. Morgan, Finite Elements and Approximation: A solid presentation of the
FEM is given, preceded by a short review of the finite difference method.
[AtkiHan09] K. Atkinson and W. Han, Theoretical Numerical Analysis: a good functional analysis framework
for numerical analysis is presented in this book. This is a rather theoretical, well written book. It
also shows the connection of partial differential equations and numerical methods. An excellent
presentation of the FEM is given.
[Ciar02] Philippe Ciarlet, The Finite Element Method for Elliptic Problems: Here you find a mathematical
presentation of the error estimates for the FEM, including all details. The Bramble-Hilbert Lemma is
carefully examined. This is a very solid mathematical foundation.
[Prze68] J.S. Przemieniecki, Theory of Matrix Structural Analysis: this is a classical presentation of the me-
chanical approach to FEM. A good introduction of the keywords stress, strain and Hooke law is
shown.
[Koko15] Jonas Koko, Approximation numérique avec MATLAB, Programmation vectorisée, équations aux
dérivées partielles. A nice introduction (in french) to MATLAB and the coding of FEM algorithms.
The details are worked out. Some code for the linear elasticity problem is developed.
[ShamDym95] Shames and Dym, Energy and Finite Element Methods in Structural Mechanics. A solid introduction
to variational methods, applied to structural mechanics. Contains a good description of mechanics
and FEM.
[Froc16] Jörg Frochte, Finite–Elemente–Methode. Eine praxisbezogene Einführung mit GNU Octave/Matlab.
A very readable introduction with complete codes for MATLAB/Octave. Contains a presentation of
dynamic problems.
Find a list of references for the FEM and the covered topics in Table 2.
Bibliography
[Acto90] F. S. Acton. Numerical Methods that Work; 1990 corrected edition. Mathematical Association of
America, Washington, 1990.
[AtkiHan09] K. Atkinson and W. Han. Theoretical Numerical Analysis. Number 39 in Texts in Applied
Mathematics. Springer, 2009.
[Brae02] D. Braess. Finite Elemente. Theorie, schnelle Löser und Anwendungen in der Elastizitätstheorie.
Springer, second edition, 2002.
[Ciar02] P. G. Ciarlet. The Finite Element Method for Elliptic Problems. SIAM, 2002.
[DahmReus07] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Springer,
2007.
SHA 21-5-21
CHAPTER 0. INTRODUCTION 6
[ShamDym95]
[ZienMorg06]
[TongRoss08]
[AtkiHan09]
[Gmur00]
[Schw88]
[Hugh87]
[Koko15]
[John87]
[Brae02]
[Zien13]
[Froc16]
[Stah08]
[Prze68]
[Ciar02]
Reference
Intro to Calculus of Variations x x x x x x x x x x x x x x x
Finite element method, linear 2D x x x x x x x x x x x x x x
Generation of stiffness matrix x x x x x x x x x x x x
Error estimates, Lemma of Cea x x x x x x x x
Second order elements in 2D x x x x x x x x x x
Gauss integration x x x x x x x
Formulation of elasticity problems x x x x x x x x x x x x x
[GhabWu16] J. Ghaboussi and X. Wu. Numerical Methods in Computational Mechanics. CRC Press, 2016.
[Gmur00] T. Gmür. Méthode des éléments finis en mécanique des structures. Mécanique (Lausanne).
Presses polytechniques et universitaires romandes, 2000.
[GoluVanLoan96] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, third
edition, 1996.
[GoluVanLoan13] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,
fourth edition, 2013.
[Hugh87] T. J. R. Hughes. The Finite Element Method, Linear Static and Dynamic Finite Element Analysis.
Prentice–Hall, 1987. Reprinted by Dover.
[IsaaKell66] E. Isaacson and H. B. Keller. Analysis of Numerical Methods. John Wiley & Sons, 1966.
Republished by Dover in 1994.
[John87] C. Johnson. Numerical Solution of Partial Differential Equations by the Finite Element Method.
Cambridge University Press, 1987. Republished by Dover.
[Koko15] J. Koko. Approximation numérique avec Matlab, Programmation vectorisée, équations aux
dérivées partielles. Ellipses, Paris, 2015.
SHA 21-5-21
CHAPTER 0. INTRODUCTION 7
[ShamDym95] I. Shames and C. Dym. Energy and Finite Element Methods in Structural Mechanics. New
Age International Publishers Limited, 1995.
[Smit84] G. D. Smith. Numerical Solution of Partial Differential Equations: Finite Difference Methods.
Oxford Univerity Press, Oxford, third edition, 1986.
[Thom95] J. W. Thomas. Numerical Partial Differential Equations: Finite Difference Methods, volume 22
of Texts in Applied Mathematics. Springer Verlag, New York, 1995.
[TongRoss08] P. Tong and J. Rossettos. Finite Element Method, Basic Technique and Implementation. MIT,
1977. Republished by Dover in 2008.
[Zien13] O. Zienkiewicz, R. Taylor, and J. Zhu. The Finite Element Method: Its Basis and Fundamentals.
Butterworth-Heinemann, 7 edition, 2013.
[ZienMorg06] O. C. Zienkiewicz and K. Morgan. Finite Elements and Approximation. John Wiley & Sons,
1983. Republished by Dover in 2006.
SHA 21-5-21
Chapter 1
The main goal of the introductory chapter is to familiarize you with a small set of sample problems. These
problems will be used throughout the class and in the notes to illustrate the presented methods and algo-
rithms. Each of the selected model problems stands for a class of similar application problems.
The flux of thermal energy is a vector indicating the direction of the flow and the amount of thermal
energy flowing per second and square meter. Fourier’s law of heat conduction can be stated in the form
~q = −k ∇T . (1.1)
This basic law of physics indicates that the thermal energy will flow from hot spots to areas with lower
temperature. For some simple situations the consequences of this equation will be examined. The only
other basic physical principle to be used is conservation of energy. Some of the variables and symbols
used in this section are shown in Table 1.2.
8
CHAPTER 1. THE MODEL PROBLEMS 9
symbol unit
J
density of energy u
m3
temperature T K
J
heat capacity c
K kg
kg
density ρ
m3
J
heat conductivity k
smK
J
heat flux ~q
s m2
J
external energy source density f
s m3
area of the cross section A m2
Now compute the rate of change of energy in the same interval. The rate of change has to equal the total
flux of energy into this interval plus the input from external sources
At this point use the conservation of energy principle. Since the above equation has to be correct for all
possible values of a and b the expressions under the integrals have to be equal and we obtain the general
equation for heat conduction in one variable
∂ T (t, x) ∂ ∂ T (t, x)
A(x) ρ c = k A(x) + A(x) f (t, x) .
∂t ∂x ∂x
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 10
This second order differential equation has to be supplemented by boundary conditions, either prescribed
temperature or prescribed energy flux.
As a standard example consider a beam with constant cross section of length 1 with known temperature
T = 0 at the two ends and a known heating contribution f (x). This leads to
d2 1
− 2
T (x) = f (x) for 0 ≤ x ≤ 1 and T (0) = T (1) = 0 . (1.2)
dx k
For the special case f (x) = 0 use the method of separation of variables to find a solution to this
problem. With α = ρkc the above simplifies to
∂ ∂2
∂t T (t, x) = α ∂x2
T (t, x) for 0 ≤ x ≤ 1 and t≥0
T (t, 0) = T (t, 1) = 0 for t ≥ 0
T (0, x) = T0 (x) for 0 ≤ x ≤ 1 .
• Look for solutions of the form T (t, x) = h(t) · u(x), i.e. a product of functions of one variable each.
Plugging this into the partial differential equation leads to
∂ ∂2
T (t, x) = α T (t, x)
∂t ∂x2
ḣ(t) · u(x) = α h(t) · u00 (x)
1 ḣ(t) u00 (x)
= .
α h(t) u(x)
Since the left hand side depends on t only and the right hand side on x only, both have to be constant.
One can verify that the constant has to be negative, e.g. −λ2 .
• Using the right hand side leads to the boundary value problem
Use the boundary conditions T (t, 0) = T (t, 1) = 0 to verify that this problem has nonzero solutions
only for the values λn = n π and the solutions are given by
un (x) = sin(x n π) .
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 11
• This solution satisfies the initial condition T (0, x) = hn (0) sin(x n π). To satisfy arbitrary initial
conditions T (0, x) = T0 (x) use the superposition principle and Fourier Sin series
∞
X
T0 (x) = bn sin(n π x)
n=1
Use this explicit formula to verify the accuracy of the numerical approximations to be examined.
Using the divergence theorem on the second integral and Fourier’s law find
∂T
ZZ ZZ ZZ
hcρ dA = − div(h ~q) dA + h f dA
∂t
G
Z ZG G
ZZ
= div (k h∇T ) dA + h f dA
ZGZ G
ZZ
= div (k h∇T (t, x, y)) dA + h f dA .
G G
This equation has to be correct for all possible domains G and not only for the physical domain. Thus the
expressions under the integral have to be equal and thus find the partial differential equation.
∂T
hcρ = div (k h∇T (t, x, y)) + h f . (1.4)
∂t
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 12
1 1
0.5 0.5
0 0
-0.5 -0.5
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 13
F~ T~ (x + ∆x)
−T~ (x)
x
∆x
Assume
For a string the tension vector T~ is required to have the same slope as the string. Using the condition of
small slope arrive at
T2 (x) = u0 (x) T0
T2 (x + ∆x) = u0 (x + ∆x) T0
u0 (x + ∆x) − u0 (x) T0 ≈ T0 u00 (x) ∆x .
T2 (x + ∆x) − T2 (x) =
By combining the above approximations and dividing by ∆x leads to the differential equation
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 14
Supplement this with the boundary conditions of a fixed string (i.e. u(0) = u(L) = 0) to obtain a second
order boundary value problem (BVP). Using Calculus of Variations (see Section 5.2) observe that solving
this BVP is equivalent to finding the minimizer of the functional
L
1
Z
F (u) = T0 (u0 (x))2 − f (x) · u(x) dx
0 2
ρ(x) ü(t, x) = T0 u00 (t, x) + f (t, x) for 0 < x < L and t>0
u(t, 0) = u(t, L) = 0 for 0 ≤ t
(1.8)
u(0, x) = u0 (x) for 0 < x < L
u̇(0, x) = u1 (x) for 0 < x < L .
the function √
u(t, x) = A cos( λ t + φ) · y(x)
is a solution of the equation of the vibrating string with√no external force, i.e. f (t, x) = 0 . Thus the
eigenvalues λ lead to vibrations with the frequencies ν = λ/(2 π) .
where the interpretation of the terms is shown in Table 1.3. Thus we are lead to the IBVP
symbol unit
vertical displacement u m
external force density f N/m2
horizontal tension τ N/m
mass density ρ kg/m2
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 15
ρ(x, y) ü(t, x, y) = ∇ (τ (x, y) ∇u(t, x, y)) + f (t, x, y) for (x, y) ∈ Ω and t>0
u(t, x, y) = 0 for (x, y) ∈ Γ and t > 0
(1.10)
u(0, x, y) = u0 (x, y) for (x, y) ∈ Ω
u̇(0, x, y) = u1 (x, y) for (x, y) ∈ Ω .
Using Calculus of Variations (see Section 5.2) we will show that solving this BVP is equivalent to finding
the minimizer of the functional
τ
ZZ
F (u) = (∇u)2 − f · u dA
2
Ω
amongst ‘all ’ functions u with u(x) = 0 on the boundary Γ . Figure 1.1 might represent a solution.
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 16
For 0 ≤ x ≤ L the function u(x) indicates by how much the part of the horizontal beam originally at
position x will be displaced horizontally, i.e. the new position is given by x + u(x) . In Table 1.4 find a list
of all relevant expressions.
The basic law of Hooke is given by
∆L
EA =F
L
for a beam of length L with cross section A. A force of F will stretch the beam by ∆L. For a short section
of length l of the beam between x and x + l find
∆l u(x + l) − u(x)
= −→ u0 (x) as l → 0 + .
l l
This expression is called strain and often denoted by ε = ∂∂xu . Then the force F (x) at a cross section at x is
given by Hooke’s law
F (x) = EA(x) u0 (x) ,
where F (x) is the force the section on the right of the cut will apply to the left section. Now examine all
forces on the section between x and x + ∆x, whose sum has to be zero for a steady state situation (Newton’s
law).
Z x+∆x
0 0
EA(x + ∆x) u (x + ∆x) − EA(x) u (x) + f (s) ds = 0
x
EA(x + ∆x) u0 (x + ∆x) − EA(x) u0 (x)
Z x+∆x
1
− = f (s) ds .
∆x ∆x x
Taking the limit ∆x → 0 in the above expression leads to the second order differential equation for the
unknown displacement function u(x).
d d u(x)
− EA(x) = f (x) for 0 < x < L . (1.13)
dx dx
There are different boundary conditions to be considered:
• At the left end point x = 0 we assume no displacement, i.e u(0) = 0 .
• At the right end point x = L we can examine a given displacement u(L) = uM , i.e we have a
Dirichlet condition.
• At the right end point x = L we can examine the situation of a known force F , leading to the
Neumann condition
d u(L)
EA(L) =F.
dx
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 17
du 2
du du
A = A( ) = A0 · 1 − ν ≈ A0 · 1 − 2 ν .
dx dx dx
Since the area is smaller the stress will increase, leading to a further reduction of the area. The resulting
stress is the true stress, compared to the engineering stress, where the force is divided by the original
area A0 , not the true area A0 (1 − ν d u(x) 2
dx ) . The linear boundary value problem (1.13) is replaced by a
nonlinear problem.
!
d u 2 d u(x)
d
− EA0 1 − ν = f (x) for 0 < x < L . (1.14)
dx dx dx
For a given force F at the endpoint at x = L use the boundary condition
d u(L) 2 d u(L)
EA0 (L) 1 − ν =F.
dx dx
If there is no force along the beam (f (x) = 0) the differential equation implies that
d u 2 d u(x)
EA0 1 − ν = const .
dx dx
The above equation and boundary condition lead to a cubic equation for the unknown function w(x) =
u0 (x) .
d u(x) 2 d u(x)
EA0 (x) 1 − ν = F
dx dx
EA0 (x) (1 − ν w(x))2 w(x) = F
F
ν 2 w3 (x) − 2 ν w2 (x) + w(x) = (1.15)
EA0 (x)
νF
ν 3 w3 (x) − 2 ν 2 w2 (x) + ν w(x) = .
EA0 (x)
This is cubic equation for the unknown ν w(x). Thus this example can be solved without using differential
equations by solving nonlinear equations, see Example 3–13 on page 122 or by using the method of finite
differences, see Example 4–9 on page 297.
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 18
stess σ
6
fracture
t
[force]
plastic region
elastic region
∂σ
∂ε = E
[additional length]
strain-ε
∆y
~x(l)
∆x x
-
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 19
! !
d x(s) cos(α(s))
=
ds y(s) sin(α(s))
and construct the curve from the angle function α(s) with an integral
! ! !
x(l) x(0)
Z l cos(α(s))
~x(l) = = + ds for 0 ≤ l ≤ L .
y(l) y(0) 0 sin(α(s))
The definition of the curvature κ is the rate of change of the angle α(s) as function of the arc length s .
Thus it is given by
d
κ(s) = α(s) .
ds
The theory of elasticity implies that the curvature at each point is proportional to the total moment generated
by all forces to the right of this point. If only one force F~ = (F1 , F2 )T is applied at the endpoint this
moment M is given by
EI κ = M = F2 ∆x − F1 ∆y .
Since ! ! !
∆x x(L) − x(l)
Z L cos(α(s))
= = ds
∆y y(L) − y(l) l sin(α(s))
we can rewrite the above equation as a differential-integral equation. Then computing one derivative with
respect to the variable l transforms the problem into a second order differential equation.
Z L ! !
0 F2 cos(α(s))
EI α (l) = · ds
l −F1 sin(α(s))
! !
0
0 F2 cos(α(l))
EI α (l) = − · = F1 sin(α(l)) − F2 cos(α(l)) .
−F1 sin(α(l))
If the beam starts out horizontally we have a(0) = 0 and since the moment at the right end point vanishes
we find the second boundary condition as α0 (L) = 0 . Thus we find a nonlinear, second order boundary
value problem 0
EI α0 (s) = F1 sin(α(s)) − F2 cos(α(s)) . (1.16)
In the above general form this problem can only be solved approximately by numerical methods, see Exam-
ple 4–11 on page 300.
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 20
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 21
√
the functions un (s) = cos( kn s) are nontrivial solutions of the linearized equation
These solutions can be used as starting points for Newtons method applied to the nonlinear problem (1.18) .
π2 Fc EI π 2
k1 = 2
= =⇒ Fc = .
L EI L2
If the actual force F1 is smaller than Fc there will be no deflection of the beam. As soon as the critical value
is reached there will be a sudden, drastic deformation of the beam. The corresponding solution is
π
u1 (s) = ε cos( s) .
L
Thus for small values of the parameter ε we expect approximate solutions of the form
π
α(s) = ε u1 (s) = ε cos( s) .
L
Using the integral formula
! ! !
x(l) l cos(α(s)) l cos(ε cos( Lπ s))
Z Z
~x(l) = = ds = ds
y(l) 0 sin(α(s)) 0 sin(ε cos( Lπ s))
! !
Z l 1 l
≈ ds = .
0 ε cos( Lπ s) ε Lπ sin( Lπ l)
• If λ > 0 is very small the functional will be minimized if u ≈ f , i.e. a good approximation of f .
• If λ is very large the integral of (u0 )2 will be very small, i.e. we have a very smooth function.
The above problem can be solved with the help of an Euler–Lagrange equation, to be examined in Sec-
tion 5.2.1, starting on page 305. For smooth functions f (x) the minimizer u(x) of the Tikhonov functional
Tλ (u) is a solution of the boundary value problem
In Figure 1.5 find a function f (x) with a few jumps and some noise added. The above problem was solved
for three different values of the positive parameter λ.
SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 22
• For small values of λ (e.g. λ = 10−5 ) the original curve f (x) is matched very closely, including the
jumps. Caused by the noise the curve is very wiggly. In Figure 1.5 points only are displayed.
• For intermediate values of λ (e.g. λ = 10−3 ) the original curve f (x) is matched reasonably well and
the jumps are not visible.
• For large values of λ (e.g. λ = 10−1 ) the jumps in f (x) completely disappear, but the result u(x) is
far off the original (noisy) function f (x).
1.5
0.5
-0.5
0 0.2 0.4 0.6 0.8 1
x
Figure 1.5: A nonsmooth function f and three regularized approximations u, using a small (λ = 10−5 ),
intermediate (λ = 10−3 ) and large (λ = 0.1) parameter
The above idea can be modified by using different smoothing functionals, depending on the specific
requirements of the application.
Bibliography
[Farl82] S. J. Farlow. Partial Differential Equations for Scientist and Engineers. Dover, New York, 1982.
SHA 21-5-21
Chapter 2
Matrix Computations
• you should understand the basic concepts of floating point operations on computers.
• you should be familiar with the concept of the condition number of a matrix and its consequences.
• you should understand the algorithm for LR factorization and Cholesky’s algorithm.
• you should be familiar with sparse matrices and the conjugate gradient algorithm.
• you should be aware of the importance of a preconditioner for the conjugate gradient algorithm.
In this binary scientific representation the integer part will always equals 1. Thus this number could be stored
in 14 bit, consisting of 1 sign bit, 3 exponent bits and 10 base bits. Obviously this type of representation is
asking for standardization, which was done in 1985 with the IEEE-754 standard. As an example examine
23
CHAPTER 2. MATRIX COMPUTATIONS 24
the float format, to be stored in 32 bits. The information is given by one sign bit s, 8 exponent bits bj and
23 bits aj for the base.
s b0 b1 b2 . . . b7 a1 a2 a3 . . . a23
This leads to the number
23
X 7
X
x = ±a · 2b−B0 where a = 1 + aj 2−j and b= bk 2k .
j=1 k=0
The sign bit s indicates whether the number is positive or negative. The exponent bits bk ∈ {0, 1} represent
the exponent b in the binary basis where the bias B0 is chosen such that b ≥ 0. The size of B0 depends on
the exact type of real number, e.g. for the data type float the IEEE committee picked B0 = 127 . The base
bits aj ∈ {0, 1} represent the base 1 ≤ a < 2. Thus numbers x in the range 1.2 · 10−38 < |x| < 3.4 · 1038
ln(2)
can be represented with approximately log10 (223 ) = 23 ln(10) ≈ 7 significant decimal digits. It is important
to observe that not all numbers can be represented exactly. The absolute error is at least of the order 2−B0 −23
and the relative error is at most 2−23 .
On the Intel 486 processor (see [Intel90, §15]) find the floating point data types in Table 2.1. The two
data types float and double exist in the memory only. As soon as a number is loaded into the CPU, it is
automatically converted to the extended format, the format used for all internal operations. When a num-
ber is moved from the CPU to memory it has1 to be converted to float or double format. The situation
on other hardware is very similar. As a consequence it is reasonable to assume that all computations will
be carried out with one of those two formats. The additional accuracy of the extended format is used as
guard digits. Find additional information in [DowdSeve98] or [HeroArnd01]. The reference [Gold91], also
available on the internet, gives more information on the IEEE-754 standard for floating point operations.
When analyzing the performance of algorithms for matrix operations a notation to indicate the accuracy
is necessary. When storing a real number x in memory some roundoff might occur and a number x (1 + ε),
where ε is bound by a number u, depending on the CPU architecture used.
stored
x −→ x (1 + ε) where |ε| ≤ u
For the above standard formats we may work with the following values for the unit roundoff u.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 25
1. Rescale the binary representation of the smaller number, such that it will have the same exponent as
the larger number..
2. Add the two new base numbers using all available digits. Assuming that B binary digits are correct
and thus the error of the sum of the bases is of the size 2−B .
3. Convert the result into the correct binary representation.
x1 + x2 = a1 · 2b1 + a2 · 2b2 = a1 · 2b2 −∆b + a2 · 2b2 = a1 · 2−∆b + a2 · 2b2 = a · 2b
The absolute error is of the size err ≈ 2b2 −B . The relative error err/(x1 + x2 ) can not be estimated, since
the sum might be a small number if x1 and x2 are of similar size with opposite sign. If x1 and x2 are of
similar size and have the same sign, then the relative error may be estimated as 2−B .
The absolute error is of the size err ≈ 2b1 +b2 −B . The relative error err/(x1 · x2 ) is estimated by 2−B .
Based on the above arguments conclude that any implementation of floating point operations will neces-
sarily lead to approximation errors in the results. For an algorithm to be useful we have to assure that those
errors can not falsify the results. More information on floating point operations may be found in [Wilk63]
or [YounGreg72, §2.7].
2–1 Example : To illustrate the above effects examine two numbers given by their decimal representation.
Use x1 = 7.65432 · 103 , x2 = 1.23456 · 105 and assume that all operations are carried out with 6 significant
digits. There is no clever rounding scheme, all digits beyond the sixth are to be chopped of.
(a) Addition:
Using a computation with more digits leads to x1 + x2 = 1.3111032 · 105 and thus find an absolute
error of approximately 0.3 or a relative error of x10.3 −6
+x2 ≈ 2 · 10 .
(b) Subtraction:
With a computation with more digits find x1 − x2 = 1.1580168 · 105 and an absolute error of approx-
imately 0.3 or a relative error of x10.3 −6
−x2 ≈ 3 · 10 . Thus for this example the errors for addition and
subtraction are of similar size. If we were to redo the above computations with x1 = 1.23457 · 105 ,
then the difference equals 0.00001 · 105 = 1. The absolute error would not change drastically, but the
relative error would be huge, caused by the division by a very small number.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 26
(c) Multiplication
The absolute error is approximately 8 · 102 and the relative error 10−6 , as has to be expected for a
multiplication with 6 significant digits.
♦
2–2 Example : The effects of approximation errors on additions and subtractions can also be examined
assuming that xi is known with an error of ∆xi , i.e Xi = xi ± ∆xi . Find
One typical operation involves one multiplication, one addition, a couple of address computations and access
to the data in memory or cache. Thus the aboveis selected as a standard and call the effort to perform these
operations one flop 2 , short for floating point operation. The abreviation FLOPS stands for FLoating point
Operation Per Second.
The computational effort for the evaluation of transcendental functions is obviously higher than for an
addition or multiplication. Using a simple benchmark with MATLAB/Octave and C++ on a laptop with an
Intel I-7 processor Table 2.2 can be generated. The timings are normalized, such that the time to perform
one addition leads to 1 . No memory access is taken into account for this table. Observe that this is (at best)
a rough approximation and depends on many parameters, e.g. the exact hardware and the compiler used.
2
Over time the common definition of a flop has evolved. In the old days a multiplication took considerably more time than
an addition or one memory access. Thus one flop used to equal one multiplication. With RISC architectures multiplications
became as fast as additions and thus one flop was either an addition or multiplication. Suddenly most computers were twice as
fast. On current computers the memory access takes up most of the time and the memory structure is more important than the raw
multiplication/addition power. When reading performance data you have to verify which definition of flop is used. In almost all
cases the multiplications are counted.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 27
2–3 Example : For a few basic operations the flop count is easy to determine.
On modern computers the clock rate of the CPU is considerably higher than the clock rate for memory
access, i.e. it takes much longer to copy a floating point number from memory into the CPU, than to perform
a multiplication. To eliminate this bottleneck in the memory bandwidth sophisticated (and expensive) cache
schemes are used. Find further information in [DowdSeve98]. As a typical example examine the three level
cache structure of an Alpha processor shown in Figure 2.1. The data given for the access times in Figure 2.1
are partially estimated and take the cache overhead into account. The data is given by [DowdSeve98].
A typical Pentium IV system has 16 KB of level 1 cache (L1) and 512 or 1024 KB on-chip level 2 cache
(L2). The dual core Opteron processors available from AMD in 2005 have 64 KB of data cache and 64 KB
of instruction cache for each core. The level 2 cache of 1 MB for each core is on-chip too and runs at the
same clock rate as the CPU. These caches take up most of the area on the chip.
The importance of a fast and large cache is illustrated with the performance of a banded Cholesky
algorithm. For a given number n this algorithm requires fast access to n2 numbers. For the data type
double this leads to 8 n2 bytes of fast access memory. The table below shows some typical values.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 28
CPU registers at 500 MHz, resp 2 ns
Level 1 Cache, 8+8 KB, 2 ns
Level 2 Cache, 96 KB, 5 ns
6
?
Level 3 Cache, 4 MB, off chip, 30 ns
6
' ? $
& %
Figure 2.1: Memory and cache access times on a typical 500 MHz Alpha 21164 microprocessor system
CPU FLOPS
NeXT (68040/25MHz) 1M
HP 735/100 10 M
SUN Sparc ULTRA 10 (440MHz) 50 M
Pentium III 800 (out of cache) 50 M
Pentium III 800 (in cache) 185 M
Pentium 4 2.6 GHz (out of cache) 370 M
Pentium 4 2.6 GHz (in cache) 450 M
Intel I7-920 (2.67 GHz) 700 M
Intel Haswell I7-5930 (3.5 GHz) 1’800 M
AMD Ryzen 3950X (3.5 GHz) 6’000 M
Table 2.3: FLOPS for a few CPU architectures, using one core only
40
30
A
Mflop/sec
B
20
C
Cb
10
0
50 100 200 500 1000 2000
nx
Figure 2.2: FLOPS for a 21164 microprocessor system, four implementations of one algorithm
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 29
50 140
120
40
100
A
30 gcc −O
80
Mflop/sec
Mflop/sec
cxx
B2 60
20 cxx −O
C
40
10
20
0 0
50 100 200 500 1000 2000 50 100 200 500 1000 2000
nx nx
(a) Pentium III with 512 KB cache (b) 21264 system with 8 MB cache
(c) Intel Xeon 2.7 GHz, 2MB cache (d) Intel I7-920 2.7 GHz, 8MB cache
7000
A
6000 B
C
5000 D
MFLOPS
4000
3000
2000
1000
0
10-1 100 101 102
MB cache
(e) Intel Xeon, 3.5 GHz Xeon with 15 MB cache (f) AMD Ryzen 3950X, 3.5 GHz with 64 MB cache
Figure 2.3: FLOPS for a few systems, four implementations of one algorithm
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 30
In Figure 2.2 find the performance result for this algorithm on a Alpha 21164 platform with the cache
structure from Figure 2.1. Four slightly different codes were tested with different compilers. All codes show
the typical drop of in performance if the need for fast memory exceeds the available cache (4 MB).
In Figure 2.3 the similar results for a few different systems are shown. The graphs clearly show that
Intel made considerable improvements with their memory access performance. All codes and platforms
show the typical drop of in performance if the need for fast memory exceeds the available cache. All results
clearly indicate that the choice of compiler and very careful coding and testing is very important for good
performance of a given algorithm. In Table 2.3 the performance data for a few common CPU architectures
is shown.
For the most recent CPUs this data is rather difficult to determine, since the CPUs do not have a fixed
clock rate any more and might execute more than one instruction per cycle.
• each core has a separate L1 cache for data (32 KB) and code (32 KB)
• each core has a separate L2 cache for data and code (256 KB)
32 KB code
32 KB code
32 KB code
32 KB data
32 KB data
32 KB data
32 KB data
8 MB L3 cache, shared
• In 2014 Intel introduced the Haswell-E architecture, a clear step up from the Nehalem architecture.
The size of the third level cache is larger and a better memory interface is used. As an example
consider the CPU I7-5930:
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 31
– each core has a separate L1 cache for data (32 KB) and code (32 KB)
– each core has a separate L2 cache for data and code (256 KB)
– the large (15 MB) L3 cache is dynamically shared between the 6 cores
• In 2017 AMD introduced the Ryzen Threadripper 1950X with 16 cores, 96KB (32KB data, 64KB
code) of L1 cache per core, 8MB of L2 cache and 32MB of L3 cache.
• In the fall of 2019 AMD topped the above with the Ryzen 9 3950X with 16 cores. This processor has
64KB of L1 cache and 512KB of L2 cache dedicated for each of the 16 cores and 64MB of shared
L3 cache. The performance is outstanding, see Figure 2.3(f). The structure in Figure 2.4 is still valid,
just more cores and (a lot) more cache.
The above observations consider only the aspect on single CPUs, possibly with multiple cores. For super-
computing clusters the effect of a huge number of computing cores is even more essential.
It is rather surprising to observe the development: in 1976 the Cray 1 cost 8.8 M$, weighed 5.5 tons
and could perform up to 160 MFLOPS. In 2014 a NVIDIA GeForce GTX Titan Black card cost 1‘000$ and
can deliver up to 1‘800 GFLOPS (double precision, FP64). Thus the 2014 GPU is 11‘000 times faster, very
light and at a very affordable price.
The above code was used on the host karman (Ryzen 3950X, 16 cores, 32 threads with hyper-threading)
and htop showed 32 active threads. If launching Octave after setting the environment variable export
OPENBLAS NUM THREADS=1, then only one thread is started and the computation time is 23.7 sec. For
each of the above 10 iterations more than N 3 flops are required. With N = 212 = 4096 this leads to 687·109
flops. A computation time of 3.16 seconds corresponds to a computational power of 217 GFLOPS for the
host karman, which is considerably more than shown in Figure 2.3(f), where a single core is used with
a more complex algorithm. Using only a single thread (computation time 23.7 sec) leads to 27 GFLOPS.
Since the CPU has 16 cores another attempt using 16 threads shows a computation time of 2.72 seconds, i.e.
to 253 GFLOPS4 . With N = 214 = 160 384 using 16 threads the computational power is 324 GFLOPS. For
the same CPU a computational power between 6 GFLOPS and 324 GFLOPS showed up.
Resulting advice:
• Use as many threads as your CPU has true cores, i.e. ignore hyper-threading.
For a matrix A with kAk < 1 the inverse of I − A can be determined by the Neumann series (I − A)−1 = I + A + A2 +
3
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 32
• Do not trust the FLOPS data published by companies (e.g AMD, Intel, . . . ), use your own application
as benchmark. The differences can be substantial.
T
5
T
4
T
6
T
3
T
2
T
1
x x x x x x
1 2 3 4 5 6
1
The value of h = n+1 represents the distance between two points xi . Multiplying a vector by this matrix
corresponds to computing the second derivative. The differential equation is replaced by a system of linear
equations.
d2
− 2 u(x) = f (x) −→ An ~u = f~ .
dx
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 33
• To analyze the performance of the matrix algorithms the eigenvalues λj and eigenvectors ~vj will be
usefull, i.e. use An~vj = λj ~vj .
– For j n observe
4 πh π π
λj = 2
sin2 (j ) = 4 (n + 1)2 sin2 (j ) ≈ 4 (n + 1)2 (j )2 = π 2 j 2
h 2 2 (n + 1) 2 (n + 1)
and in particular
λmin = λ1 ≈ π 2 .
– For the largest eigenvalue use
π n
λmax = λn = 4 (n + 1)2 sin2 ( ) ≈ 4 n2 .
2 n+1
The above matrix will be useful for a number of the problems in the introductory chapter.
• An will be used to solve the static heat problem (1.2).
• For each time step in the dynamic heat problem (1.3) the matrix An will be used.
• To solve the steady state problem for the vertical displacements of a string, i.e. equation (1.7), the
above matrix needed.
• For each time step of the dynamic string problem (1.8), we need the above matrix.
• The eigenvalues of the above matrix determine the solutions of the eigenvalue problem for the vibrat-
ing string problem, i.e. equation (1.9).
• The problems of the horizontal stretching of a beam (i.e. equations (1.13) and (1.14)) will lead to
matrices similar to the above.
• When using Newton’s method to solve the nonlinear problem of a bending beam ((1.16), (1.17)
or (1.18)) we will have to use matrices similar to the above.
5
To verify these eigenvalues and eigenvectors use the complex exponential function u(x) = exp(i α x). The values at the grid
points xk = k h are uk = exp(i α k h) leading to a vector ~u. The matrix multiplication computes
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 34
y
6
13 14 15 16
9 10 11 12
5 6 7 8
1 2 3 4
x
-
• The first upper and lower diagonals are almost filled with −1 . If the column/row index of the diagonal
entry is a multiple of n, then the numbers to the right and below are zero.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 35
The missing numbers in the first off-diagonals can be explained by Figure 2.6. The gaps are caused by the
fact, that if the node number is a multiple of n (i.e. k · n), then the node has no direct connection to the node
with number k n + 1.
We replace the partial differential equation by a system of linear equations.
d2 u(x, y) d2 u(x, y) 1
− − = f (x, y) −→ Ann ~u = f~ .
dx2 dy 2 k
Observe the following important facts about this matrix Ann :
• The matrix is sparse (very few nonzero entries) and has a band structure.
• To analyze the performance of the different algorithms it is convenient to know the eigenvalues.
4 πh 4 πh
λi,j = 2
sin2 (j ) + 2 sin2 (i ) for 1 ≤ i, j ≤ n .
h 2 h 2
– For i, j n obtain
λi,j ≈ π 2 (i2 + j 2 )
and in particular
λmin = λ1,1 ≈ 2 π 2 .
For the largest eigenvalue find
λmax = λn,n ≈ 8 n2 .
The above matrix will be useful for a number of the problems in the introductory chapter.
• Ann will be used to solve the static heat problem (1.5) with 2 space dimensions.
• For each time step in the dynamic heat problem (1.6) the matrix Ann will be used.
• To solve the steady state problem for the vertical displacements of a membrane, i.e. equation (1.11),
we need the above matrix.
• For each time step of the dynamic membrane problem (1.10), the above matrix is needed.
• The eigenvalues of the above matrix determine the solutions of the eigenvalue problem for the vibrat-
ing membrane problem, i.e. equation (1.12).
One can construct the model matrix Annn corresponding to the finite difference approximation of a
differential equation examined on a unit cube in space R3 . In Table 2.4 find the key properties of the model
matrices. Observe that all matrices are symmetric and positive definite.
• Computational cost: how many operations are required to solve the system?
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 36
An Ann Annn
differential equation on unit interval unit square unit cube
size of grid n n×n n×n×n
size of matrix n×n n2 × n2 n3 × n3
semi bandwidth 2 n n2
nonzero entries 3n 5 n2 7 n3
smallest eigenvalue λmin ≈ π2 2 π2 3 π2
largest eigenvalue λmax ≈ 4 n2 8 n2 12 n2
4 4 4
condition number κ ≈ π2
n2 π2
n2 π2
n2
• Memory cost: how much memory of what type is required to solve the system?
• Accuracy: what type and size of errors are introduced by the algorithm? Is there an unnecessary loss
of accuracy. This is an essential criterion if you wish to obtain reliable solutions and not a heap of
random numbers.
• Special properties: does the algorithm use special properties of the matrix A? Some of those proper-
ties to be considered are: symmetry, positive definiteness, band structure and sparsity.
• Multi core architecture: can the algorithm take advantage of a multi core architecture?
There are three classes of solvers for linear systems of equations to be examined in these notes:
• Direct solvers: we will examine the standard LR factorization and the Cholesky factorization, Sec-
tions 2.4.1 and 2.6.1.
• Sparse direct solvers: we examine the banded Cholesky factorization, the simplest possible case, see
Section 2.6.4.
• Iterative solvers: we will examine the most important case, the conjugate gradient algorithm, see
Section 2.7.
One of the conclusions you should draw from the following sections is that it is almost never necessary to
compute the inverse of a given matrix.
2.4.1 LR Factorization
When solving linear systems of equations the concept of matrix factorization is essential. For a known
vector ~b ∈ Rn and a given n × n matrix A we search for solutions ~x ∈ Rn of the system A · ~x = ~b. Many
algorithms use a matrix factorization A = L · R where the matrices have special properties to simplify
solving the system of linear equations.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 37
2–4 Definition :
• A matrix R is called an upper triangular matrix iff6 ri,j = 0 for all i > j, i.e. all entries below the
main diagonal are equal to 0 .
• A matrix L is called an lower triangular matrix iff li,j = 0 for all i < j, i.e. all entries above the
main diagonal are equal to 0 .
• Introduce an auxiliary vector ~y . First solve the system L ~y = ~b from top to bottom. The equation
represented by row i in the matrix L reads as
1 y1 = b1
l2,1 y1 + 1 y2 = b2 .
l3,1 y1 + l3,2 y2 + 1 y3 = b3
6
iff: used by mathematicians to spell out “if and only if”
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 38
y(1) = b(1)
for i= 2 to n
y(i) = b(i) - L(i,1:i-1)*y(1:i-1)
end%for
• Subsequently we consider the linear equations with the matrix R, e.g. for a 3 × 3 matrix
x(n) = y(n)/R(n,n)
for i= n-1 to 1
x(i) = y(i) / R(i,i) - ( R(i,i+1:n)*x(i+1) ) / R(i,i)
end%for
Pn 1
• Forward and backward substitution require each approximately k=1 k ≈ 2 n2 flops. Thus the total
computational effort is given by n2 flops.
The above observations show that systems of linear equations are easily solved, once the original matrix is
factorized as a product of a left and right triangular matrix. In the next section we show that this factorization
can be performed using the ideas of the algorithm of Gauss to solve linear systems of equations.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 39
3
• Multiply the first row by 2 and add it to the second row.
• Multiply the first row by 2 and subtract it from the third row.
• Multiply the modified second row by 3 and add it to the third row.
The resulting system is represented by a triangular matrix and can easily be solved from bottom to top.
2 x1 + 6 x2 + 2 x3 = 2
1 x2 + 3 x3 = 7 .
7 x3 = 23
The next goal is to verify that an LR factorization is a clever notation for the algorithm of Gauss. To
verify this we use the notation of block matrices and a recursive scheme, i.e. we start with a problem of
size n × n and reduce it to a problem of size (n − 1) × (n − 1). For this we divide a matrix in 4 blocks of
submatrices, i.e.
a1,1 a1,2 a1,3 . . . a1,n a1,1 a1,2 a1,3 . . . a1,n
a
2,1 a2,2 a3,3 . . . a3,n a1,2
A= a3,1 a3,2 a3,3 . . . a3,n = a1,3
.
An−1
.. .
.. .
.. . .. .
.. .
..
.
an,1 an,2 an,3 . . . an,n a1,n
The submatrix An−1 is a (n − 1) × (n − 1) matrix. Using this notation we are searching n × n matrices L
and R such that A = L · R, i.e
a1,1 a1,2 a1,3 . . . a1,n 1 0 0 ... 0 r1,1 r1,2 r1,3 . . . r1,n
a l 0
2,1 2,1
a3,1 = l3,1 · 0 .
An−1 Ln−1 Rn−1
. . .
.. .. ..
an,1 ln,1 0
Using the standard matrix multiplication compute the entries in the 4 segments of A separately and then
examine A block by block.
• Examine the top left block (one single number) in A
a1,1 = 1 · r1,1 .
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 40
For the standard algorithm of Gauss, this is the multiple to be used when adding a multiple of the first
row to row i. This step might fail if a1,1 = 0. This possible problem can be avoided with the help of
proper pivoting. This will be examined later in this course, see Section 2.5.3, page 51.
Thus the lower triangular matrix L keeps track of what row operations have to be applied to transform
An−1 into Ãn−1 .
• Observe that the first row and column of A will not have to be used again in the next step. Thus for a
memory efficient implementation we may overwrite the first row and column of A with the first row
of R and the first column of L. The number 1 on the diagonal of L does not have to be stored.
After having performed the above steps we are left with a similar question, but the size of the matrices was
reduced by 1. By recursion we can restart the above process with the reduced matrices. Finally we will find
the LR factorization of the matrix A. The total operation count is given by
n
X 1 3
FlopLR = (k 2 − k) ≈ n .
3
k=1
It is possible to compute the inverse A−1 of a given square matrix A with the help of the LR factoriza-
tion. It can be shown (e.g. [Schw86]) that the computational effort is approximately 34 n3 and thus 4 times
as high as solving one system of linear equations directly.
2–5 Observation : Adding multiples of one row to another row in a large matrix can be implemented in
parallel on a multicore architecture, as shown in Section 2.2.3. For this to be efficient the number of columns
has to be considerably larger than the number CPU cores to be used. On most computers this task is taken
over by a well chosen BLAS library (Basic Linear Algebra Subroutines). Excellent versions are provided by
OPENBLAS or by ATLAS (Automatically Tuned Linear Algebra Software). MATLAB uses the Intel Math
Kernel Library and Octave is built on OPENBLAS by default. ♦
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 41
2–6 Example : The above general calculations are illustrated with the numerical example used at the start
of this section. For the given 3 × 3 matrix A we first examine the first column of the left triangular matrix
L and the first row of the right triangular matrix R.
2 6 2 1 0 0 r1,1 r1,2 r1,3
−3 −8 0 = l2,1 1 0 · 0 r2,2 r2,3
4 9 2 l3,1 l3,2 1 0 0 r3,3
1 0 0 2 6 2
= −3
1 0 · 0 r2,2 r2,3 .
2
2 l3,2 1 0 0 r3,3
Then we restart the computation with the 2 × 2 blocks in the lower right corner of the above matrices. From
the above conclude
" # " # " # " #
−3
−8 0 2
h i 1 0 r2,2 r2,3
= · 6 2 + ·
9 2 2 l3,2 1 0 r3,3
" # " # " #
−9 −3 1 0 r2,2 r2,3
= + · .
12 4 l3,2 1 0 r3,3
The 2 × 2 block of A has to be modified first by adding the correct multiples of the first row of A, i.e.
" # " # " #
−8 0 −9 −3 1 3
− =
9 2 12 4 −3 −2
The only missing value of r3,3 can be determined by examining the lower right corner of the above matrix
product.
−2 = (−3) · 3 + 1 · r3,3 =⇒ r3,3 = 7 .
Thus conclude
2 6 2 1 0 0 2 6 2
A= −3 −8 0 = −3 1 0 · 0 1 3 = L · R.
2
4 9 2 2 −3 1 0 0 7
first solve
1 0 0 y1 2
−3
1 · y2 = 4
0
2
2 −3 1 y3 6
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 42
This is exactly the system we are left with after the matrix A was reduced to echelon form. This should
illustrated that the LR factorization is a clever way to formulate the algorithm of Gauss. ♦
for k = 1:n
R(k,k:n) = A(k,k:n);
if ( R(k,k) == 0) error ("LRtest: division by 0") endif
% divide numbers in k-th column, below the diagonal by A(k,k)
L(k+1:n,k) = A(k+1:n,k)/R(k,k);
% apply the row operations to A
for j = k+1:n
A(j,k+1:n) = A(j,k+1:n) - A(k,k+1:n)*L(j,k);
end%for
end%for
The only purpose of the above code is to help the reader to understand the algorithm. It should never be
used to solve a real problem.
• No pivoting is done and thus the code might fail on perfectly solvable problems. Running this code
will also lead to unnecessary rounding errors.
• The code is not memory efficient at all. It keeps copies of 3 full size matrices around.
There are considerably better implementations, based on the above ideas. The code can be tested using the
model matrix An from Section 2.3.1.
n = 5;
h = 1/(n+1)
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 43
Observe that the triangular matrices L and R have all nonzero entries on the diagonal and the first off-
diagonal.
Applying row and column operations to matrices can be described by multiplications with elementary
matrices, from the left or the right.
• Multiplying a matrix A by an elementary matrix from the left has the same effect as applying the row
operation to the matrix.
1 0 0 2 6 2 2 6 2
3 1 0 · −3 −8 0 = 0 1 3
2
0 0 1 4 9 2 4 9 2
• Multiplying a matrix A by an elementary matrix from the right has the same effect as applying the
column operation to the matrix.
2 6 2 1 0 0 14 6 2
−3 −8 0 · 2 1 0 = −19 −8 0
4 9 2 0 0 1 22 9 2
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 44
Using multiple row operations we can perform an LR factorization. We use again the same example as
before. The row operations to be applied are
3
1. Add 2 times the first row to the second row.
These operations are visible on the left in Figure 2.7. On the right find the corresponding elementary
matrices.
2 6 2
−3 −8 0 R2 ← R2 + 3 R1
2
4 9 2
1 0 0 1 0 0
E1−1 =
↓ E1 = 3 , 3
+2 1 0 −2 1 0
0 0 1 0 0 1
2 6 2
0 1 3
R3 ← R3 − 2 R1
4 9 2
1 0 0 1 0 0
E2−1 =
↓ E2 = 0 1 0
, 0 1 0
−2 0 1 +2 0 1
2 6 2
R3 ← R3 + 3 R2
0 1 3
0 −3 −2
1 0 0 1 0 0
E3−1 =
↓ E3 =
0 1 0
, 0 1 0
0 +3 1 0 −3 1
2 6 2
0 1 3
0 0 7
The row operations from Figure 2.7 are used to construct the LR factorization. Start with A = I · A and
use the elementary matrices for row and column oprerations.
Observe that we apply row operations to transform the matrix on the right to upper echelon form. The
matrix on the left keeps track of the operations to be applied.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 45
2 6 2 1 0 0 2 6 2
E1−1 · E1
−3 −8 0 = 0 1 0
−3 −8 0
4 9 2 0 0 1 4 9 2
1 0 0 2 6 2
E2−1 · E2
= 3
−2 1 0 0 1 3
0 0 1 4 9 2
1 0 0 2 6 2
E3−1 · E3
= 3
−2 1 0 0 1 3
+2 0 1 0 −3 −2
1 0 0 2 6 2
= 3 ·
−2 1
0 0 1 3
+2 −3 1 0 0 7
Thus we constructed the LR factorization of the matrix A.
2 6 2 1 0 0 2 6 2
−3 −8 0 = − 3 · 0 1 3 =L·R
2 1 0
4 9 2 +2 −3 1 0 0 7
2–8 Example : It is an exercise to verify that the following three norms satisfy the above properties:
v
u n
uX p
k~xk = k~xk2 = t |xi |2 = (x21 + x22 + . . . + x2n )1/2 = h~x , ~xi
i=1
n
X
k~xk1 = |xi | = |x1 | + |x2 | + . . . + |xn |
i=1
k~xk∞ = max |xi |
1≤i≤n
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 46
On a given vector space one can have multiple norms, e.g. k~xkA and k~xkB . Those norms are said to be
equivalent if there exist positive constants c1 and c2 such that
On the vector space Rn all norms are equivalent and one may verify the following inequalities. If we have
information on one of the possible norms of a vector ~x we have some information on the size of the other
norms too.
2–9 Result : For all vectors ~x ∈ Rn we have
√
k~xk2 ≤ k~xk1 ≤ n k~xk2 ,
√
k~xk∞ ≤ k~xk2 ≤ n k~xk∞ ,
k~xk∞ ≤ k~xk1 ≤ n k~xk∞ .
3
Proof : For the interested reader some details of the computations are shown here.
n n
!2
X X
k~xk22 = x2i ≤ |xi | = k~xk21
i=1 i=1
n
~ √
|xi | = h~I , |x|i ≤ k~Ik2 · k~xk2 = n k~xk2
X
k~xk1 =
i=1
v
u n
q
2
uX q √
k~xk∞ = max{|xi |} = max{xi } ≤ t x2i ≤ n max{x2i } = n k~xk∞
i=1
n
X
k~xk∞ = max |xi | ≤ |xi | = k~xk1 ≤ n max |xi | = n k~xk∞
i=1
A matrix norm of a matrix A should give us some information on the length of the vector ~y = A · ~x,
based on the length of ~x . Thus we require the basic inequality
i.e. the norm kAk is the largest occurring amplification factor when multiplying a vector with this matrix A .
2–10 Definition : For each vector norm there is a resulting matrix norm definded by
kA · ~xk
kAk = max = max kA · ~xk
x6=~0
~ k~xk k~
xk=1
kA · ~xk2
kAk2 = max = max kA · ~xk2
x6=~0
~ k~xk2 k~
xk2 =1
kA · ~xk1
kAk1 = max = max kA · ~xk1
x6=~0
~ k~xk1 k~
xk1 =1
kA · ~xk∞
kAk∞ = max = max kA · ~xk∞
x6=~0
~ k~xk∞ k~
xk∞ =1
where the maximum is taken over all ~x ∈ Rn with ~x 6= ~0 , or equivalently over all vectors with k~xk = 1 .
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 47
2–11 Example : For a given m × n matrix A the two norms kAk1 and kAk∞ are rather easy to compute:
m
!
X
kAk1 = max |ai,j | = maximal column sum
1≤j≤n
i=1
n
X
kAk∞ = max |ai,j | = maximal row sum
1≤i≤m
j=1
♦
Proof : We examine kAk∞ first. Assume that the maximum value of k~y k∞ = kA · ~xk∞ is attained for
the vector ~x with k~xk∞ = 1. Then all components have to be xj = ±1, otherwise we could increase k~y k∞
without changing k~xk∞ . If the maximal value of |yi | is attained at the component with index p then the
matrix multiplication
y1 a1,1 a1,2 a1,3 . . . a1,n x1
y a
2 2,1 a2,2 a2,3 . . . a2,n x2
y3 = a3,1 a3,2 a3,3 . . . a3,n · x3
.. .. .. .. ..
.
. . . .
ym am,1 am,2 am,3 . . . am,n xn
implies
n
X n
X n
X
yp = ap,j xj = ap,j (±1) = |ap,j | .
j=1 j=1 j=1
This leads to the claimed result
n
X
kAk∞ = k~y k∞ = max |ai,j | .
1≤i≤m
j=1
If the above column maximum is attained in column k choose xk = 1 and all other components of ~x are set
to zero. For this special vector then find
m
X m
X
kA · ~xk1 = |ai,k · 1| = |ai,k | = kAk1 .
i=1 i=1
2
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 48
Unfortunately the most important norm kAk = kAk2 is not easily computed. But for n × m matrices
A we have the following inequalities.
1 √
√ kAk∞ ≤ kAk2 ≤ m kAk∞
n
1 √
√ kAk1 ≤ kAk2 ≤ n kAk1
m
and thus we might be able to estimate the size of kAk2 with the help of the other norms. The proofs of
the above statements are based on Result 2–9. A precise result on the 2-norm is given in [GoluVanLoan96,
Theorem 3.2.1]. The result is stated here for sake of completeness.
2–12 Result : For any m×n matrix A there exists a vector ~z ∈ Rn with k~zk2 = 1 such that AT A·~z = µ2 ~z
and kAk2 = µ . Since AT A is symmetric and positive definite we know that all eigenvalues are real and
positive. Thus kAk2 is the square root of the largest eigenvalue of the n × n matrix AT A . 3
We might attempt to compute all eigenvalues of the symmetric matrix AT A and then compute the
square root of the largest eigenvalue. Since it is computationally expensive to determine all eigenvalues the
task remains difficult. There are special algorithms (power method) to estimate the largest eigenvalue of
AT A, used in the MATLAB/Octave functions normest(), condest() and eigs(), see Section 3.2.4.
In Result 3–22 (page 134) find few facts on symmetric matrices and the corresponding eigenvalues and
eigenvectors. Using this result one can verify that for real, symmetric matrices
A symmetric =⇒ kAk2 = max |λi | .
i
This can also be confirmed directly. For a given symmetric matrix A let ~ej be the basis of normalized
eigenvectors. Then we write the arbitrary vector ~x as a linear combination of the eigenvectors ~ej .
n
X n
X
~x = cj ~ej =⇒ A · ~x = cj λj ~ej
j=1 j=1
If k~xk = 1 then
n
X n
X
2 2 2
1 = k~xk = |cj | =⇒ kA · ~xk = |λj |2 |cj |2
j=1 j=1
and the largest possible value will be attained if cn = 1 and all other cj = 0, i.e. the vector ~x points in the
direction of the eigenvector with the largest eigenvalue.
Similarly we can determine the norm of the inverse matrix A−1 . Since
n n
X X 1
~x = cj ~ej =⇒ A−1 · ~x = cj ~ej
λj
j=1 j=1
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 49
2–13 Example : For the matrix An in section 2.3.1 (page 32) the matrix norms are easily computed.
1
(a) The maximal column and rows sums are given by h2
(1 + 2 + 1) and thus
4
kAn k1 = kAn k∞ = ≈ 4 n2 .
h2
In this case the three matrix norms are approximately equal, at least for large values on n . ♦
2–14 Example : The norm of an orthogonal matrix Q equals 1, i.e. kQk2 = 1. To verify this use
kQ~xk22 = hQ~x , Q~xi = h~x , QT Q~xi = h~x , ~xi = k~xk22 .
Similary find that kQ−1 k2 = 1. To verify this use (Q−1 )T = (QT )−1 and
kQ−1 ~xk22 = hQ−1 ~x , Q−1 ~xi = h~x , (Q−1 )T Q−1 ~xi = h~x , ~xi = k~xk22 .
♦
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 50
i.e. the relative error in the given vector ~b might at worst be mutliplied by the condition number to obtain
the relative error in the solution ~x.
As an example reconsider the above symmetric matrix A and use the vectors ~b = ~en and ~bp = ~en + ε ~e1 .
Thus we examine a relative error of 0 < ε 1 in ~b .
k~xp − ~xk kA−1 · (~en + ε ~e1 ) − A−1 · ~en k kε A−1 · ~e1 k 1/|λ1 | |λn |
= −1
= −1
=ε ≤ε .
k~xk kA · ~en k kA · ~en k 1/|λn | |λ1 |
In the above example the correct vector is divided by the largest possible number (λn ) but the error is divided
by the smallest possible number (λ1 ). Thus we examined the worst case scenario.
|λn |
κ2 (A) = = kAk2 · kA−1 k2
|λ1 |
for symmetric matrices A and if κ2 = 10d we might loose d decimal digits of accuracy when multiplying a
vector by A or when solving a system of linear equations.
Using the above idea we define the condition number for the matrix and the result applies to multipli-
cation by matrices and solving of linear systems systems.
2–15 Definition : The condition number κ(A) of a nonsingular square matrix is defined by
κ = κ(A) = kAk · kA−1 k
Obviously the condition number depends on the matrix norm used.
2–16 Example :
• Based on Example 2–14 the condition number of an orthogonal matrix Q equals 1, using the 2–norm.
• Using the singular value decomposition (SVD) (see equation (3.5) on page 149) and the above idea
one can verify that for a real n × n matrix A
σ1 largest singular value
κ2 = κ2 (A) = =
σn smallest singular value
where the singular values are given by σi .
• MATLAB/Octave provide the command condest() to efficiently compute a good estimate of the
condition number.
♦
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 51
For any matrix and norm we have κ(A) ≥ 1. If the condition number κ is not too large we speak of a
well conditioned problem.
For n × n matrices we have the following relations between the different condition numbers.
1
n κ2 ≤ κ1 ≤ n κ2
1
n κ∞ ≤ κ2 ≤ n κ∞
2–17 Example : For the model matrix An of size n × n in Section 2.3.1 (page 32) find
λmax 4 n2
κ2 = ≈ 2
λmin π
and for 2D the model matrix Ann of size n2 × n2 in Section 2.3.2 (page 34) find the same result
λmax 4 n2
κ2 = ≈ 2 .
λmin π
♦
2–18 Result : In an ideal situation absolutely no roundoff occurs during the solution process. Only when
ˆ satisfies
A, ~b and ~x are stored some roundoff will occur. The stored solution ~x
ˆ = ~b + ~e with kEk∞ ≤ u kAk∞
(A + E) · ~x and k~ek∞ ≤ u k~bk∞ .
ˆ solves a nearby system exactly. If now u κ∞ (A) ≤ 1/2 then one can show that
Thus ~x
ˆ − ~xk∞ ≤ 4 u κ∞ (A) k~xk∞ .
k~x
As a consequence of the above result we can not expect relative errors smaller than κ u for
any kind of clever algorithm to solve linear systems of equations. The goal has to be to
achieve this accuracy.
2–19 Definition : For the following results we use some special, convenient notations:
The absolute value and the comparison operator are applied to each entry in the matrix.
The following theorem ([GoluVanLoan96, Theorem 3.3.2] keeps track of the rounding errors in the back
substitution process, i.e. when solving the triangular systems. The proof is considerably beyond the scope
of these notes.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 52
2–20 Theorem : Let L̂ and R̂ be the computed LR factors of a n × n matrix A. Suppose ~yˆ is the computed
ˆ the computed solution of R̂ · ~x = ~yˆ. Then
solution of L̂ · ~y = ~b and ~x
ˆ = ~b
(A + E) · ~x
with
|E| ≤ n u (3 |A| + 5 |L| |R|) + O(u2 ) , (2.2)
where |L| |R| is a matrix multiplication. For all practical purposes we may ignore the term O(u2 ), since
with u ≈ 10−16 we have u2 ≈ 10−32 . 3
This result implies that we find exact solutions of a perturbed system A + E. This is called backward
stability.
The above result shows that large values in the triangular matrices L and R should be avoided whenever
possible. Unfortunately we can obtain large numbers during the factorization, even for well-conditioned
matrices, as shown by the following example.
2–21 Example : For a small, positive number ε the matrix
" #
ε 1
A=
1 0
If ε > 0 is very close to zero then the numbers in L and R will be large and lead to
" # " # " #
1 0 ε 1 ε 1
|L| · |R| = 1 · =
ε 1 0 1ε 1 2ε
and thus one of the entries in the bound in Theorem 2–20 is large. Thus the error in the result might be
unnecessary large. This elementary, but typical example illustrates that pivoting is necessary. ♦
The correct method to avoid the above problem is pivoting. In the LR factorization on page 38 try to
factor the submatrices An−k = Ln−k · Rn−k . In the unmodified algorithm the top left number in An−k
is used as pivot element. Before performing a next step in the LR factorization it is better to exchange
rows (equations) and possibly rows (variables) to avoid divisions by small numbers. There are two possible
strategies:
• partial pivoting:
Choose the largest absolute number in the first column of An−k . Exchange equations for this to
become the top left number.
• total pivoting:
Choose the largest absolute number in the submatrix An−k . Exchange equations and renumber vari-
ables for this to become the top left number.
The computational effort for total pivoting is considerably higher, since (n−k)2 numbers have to be searched
for the maximal value. The bookkeeping requires considerably more effort, since equations and unknowns
have to be rearranged. This additional effort is not compensated by considerably better (more reliable)
results. Thus for almost all problems partial pivoting will be used. As a consequence we will only examine
partial pivoting.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 53
When using partial pivoting all entries in the left matrix L will be smaller than 1 and thus kLk∞ ≤ n .
This leads to an improved error estimate (2.2) in Theorem 2–20. For details see [GoluVanLoan96, §3.4.6].
The formulation for factorization has to be modified slightly and supplemented with a permutation
matrix P.
2–22 Result : A square matrix P is a permutation matrix if each row and each column of P contains exactly
one number 1 and all other entries are zero. Multiplying a matrix or vector from the left by a permutation
matrix has the effect of row permutations:
pi,j = 1 ⇐⇒ the old row j will turn into the new row i
•
0 1 0 0 1 2
0 0 1 0 2
3
· =
0 0 0 1
3 4
1 0 0 0 4 1
For a given matrix A we seek triangular matrices L and R and a permutation matrix P, such that
P · A = L · R.
If we now want to solve the system A · ~x = ~b using the factorization replace the original system by two
linear systems with triangular matrices.
(
L ~y = P · ~b
A ~x = ~b ⇐⇒ PA ~x = P~b ⇐⇒ LR ~x = P~b ⇐⇒ .
R ~x = ~y
Example 2–6 has to be modified accordingly, i.e. the permutation given by P is applied to the right hand
side.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 54
P·A = L
·R
0 1 0 3/2 4 −7/2 1 0 0 3 2 1
= 1/2 0 3 −4 .
1 0 0 3 2 1 1 0
0 0 1 0 −1 25/3 0 −1/3 1 0 0 7
Then the system can be solved using two triangular systems. First solve from top to bottom
L ~y = P ~b
1 0 0 y1 0
1/2 1 0 y2 = +1
0 −1/3 1 y3 −1
and then
R ~x = ~y
3 2 1 x1 y1
0 3 −4 x2 = y2
0 0 7 x3 y3
Any good numerical library has an implementation of the LR (or LU) factorization with partial pivoting
built in. As an example consider the help provided by Octave on the command lu().
Octave
help lu
--> lu is a built-in function
[l, u, p] = lu (A)
returns
l = 1.00000 0.00000
0.33333 1.00000
u = 3.00000 4.00000
0.00000 0.66667
p = 0 1
1 0
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 55
Using this factorization one can solve systems of linear equations A~x = ~b for ~x.
Octave
A = randn(3,3); % generate a random matrix
b = rand(3,1);
x1 = A\b; % a first solution
[L,U,P] = lu(A); % compute the LU factorization with pivoting
x2 = U\(L\(P*b)) % the solution with the help of the factorization
DifferenceSol = norm(x1-x2) % display the differences, should be zero
The first solution is generated with the help of the backslash operator \ . Internally Octave/MATLAB use the
LU (same as LR) factorization. Computing and storing the L, U and P matrices is only useful when multiple
systems with the same matrix A have to be solved. The computational effort to apply the back substitution
is considerably smaller than the effort to determine the factorization, at least for general matrices.
A = RT · D · R
The approach is adapted from Section 2.4.1, using block matrices again. Using standard matrix multiplica-
tions we find
7
This modification is known as modified Cholesky factorization and MATLAB provides the command ldl() for this algorithm.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 56
a1,1 a1,2 a1,3 . . . a1,n
a
1,2
a1,3 = RT · D · R =
An−1
..
.
a1,n
1 0 0 ... 0 d1 0 0 ... 0 1 r1,2 r1,3 . . . r1,n
r 0 0
1,2
=
r1,3
0 0
RTn−1 Dn−1 Rn−1
.. .
..
.
..
.
r1,n 0 0
1 0 0 ... 0 d1 d1 r1,2 d1 r1,3 . . . d1 r1,n
r 0
1,2
=
r1,3
0
RTn−1 Dn−1 · Rn−1
.. .
..
.
r1,n 0
Now we examine the effects of the last matrix multiplication on the four submatrices. This translates to 4
subsystems.
• Examine the top left block (one single number) in A. Obviously we find a1,1 = d1 .
• The top right block (row) in A is then already taken care of. It is a transposed copy of the first column.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 57
1
This operation requires 2 (n − 1)2 flops since the matrix is symmetric. Now we are left with the new
factorization
Ãn−1 = RTn−1 · Dn−1 · Rn−1
with the updated matrix Ãn−1 .
• Then restart the process with the reduced problem of size (n − 1) × (n − 1) in the lower right block.
n−1
X k2 1
FlopChol ≈ ≈ n3 .
2 6
k=1
Thus we were able to reduce the number of necessary operations by a factor of 2 compared to the standard
LR factorization (FlopLR ≈ 13 n3 ).
2–25 Observation : Adding multiples of one row to another row in a large matrix can be implemented in
parallel on a multicore architecture, as shown in Section 2.2.3. The number of columns has to be consider-
ably larger than the number of CPU cores to be used. ♦
for k = 1:n-1
R(k,k) = 1;
for j = k+1:n
R(k,j) = A(j,k)/A(k,k);
A(j,:) = A(j,:) - R(k,j)*A(k,:); % row operations
A(:,j) = A(:,j) - R(k,j)*A(:,k); % column operations
end%for
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 58
R(n,n) = 1;
end%for
D = diag(diag(A));
• As we go through the algorithm the coefficients in R can replace the coefficients in A which will not
be used any more. This cuts the memory requirement in half.
• If we do all computations in the upper right part of A, we already know that the result in the lower left
part has to be the same. Thus we can do only half of the calculations.
• As we already know that the numbers in the diagonal of R have to be 1, we do not need to return
them. One can use the diagonal of R to return the coefficients of the diagonal matrix D.
If we implement most8 of the above points we obtain an improved algorithm, shown below.
choleskyM.m
function R = choleskyM(A)
% R = choleskyM(A) if A is a symmetric positive definite matrix
% returns a upper triangular matrix R and a diagonal matrix D
% such that A = R1’*D*R1
% R1 has all diagonal entries equal to 1
% the values of D are returned on the diagonal of R
[n,m] = size(A);
if (n˜=m) error (’choleskyM: matrix has to be square’) end%if
for k = 1:n-1
if ( abs(A(k,k)) <= TOL) error (’choleskyM:might be a singular matrix’)
endif
for j = k+1:n
A(j,k) = A(k,j)/A(k,k);
% row operations only
A(j,j:n) = A(j,j:n) - A(j,k)*A(k,j:n);
end%for
end%for
if ( abs(A(n,n)) <= TOL) error (’choleskyM:might be a singular matrix’) end%if
The above code finds the Cholesky factorization of the matrix, but does not solve a system of linear
equations. It has to be supplemented with the corresponding back-substitution algorithm.
8
The memory requirements can be made considerably smaller
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 59
function x = CholeskySolver(R,b)
% x = choleskySolver(R,b) solves A x = b
% R has to be generated by R = choleskyM(A)
[n,m] = size(R);
if (n ˜= length(b))
error ("CholeskySovler: matrix and vector do not have same dimension ") endif
% forward substitution of R’ y = b
y = zeros(size(b));
y(1) = b(1);
for k = 2:n
y(k) = b(k);
for j = 1:k-1 y(k) = y(k) - R(j,k)*y(j);end%for
end%for
% backward substitution of R x = y
x = zeros(size(b));
x(n) = y(n);
for k = 1:n-1
x(n-k) = y(n-k);
for j = n-k+1:n x(n-k) = x(n-k) - R(n-k,j)*x(j);end%for
end%for
A = [1 3 -4; 3 11 0; -4 0 10];
R = choleskyM(A)
b = [1; 2; 3];
x = CholeskySolver(R,b)’
-->
R = 1 3 -4
0 2 6
0 0 -78
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 60
is solved by
x1 −1.16667
x2 ≈ +0.50000 .
x3 −0.16667
2–26 Definition : A symmetric, real matrix A is called positive definite if and only if
hA · ~x , ~xi = h~x , A · ~xi > 0 for all ~x 6= ~0 .
The matrix is called positive semidefinite if and only if
hA · ~x , ~xi = h~x , A · ~xi ≥ 0 for all ~x .
2–27 Example : To verify that the matrix An (see page 32) is positive definite we have to show that9
h~u , An ~ui > 0 for all ~u ∈ Rn \{~0}
u1 2 u1 − u2
u2 −u1 + 2 u2 − u3
u3 −u2 + 2 u3 − u4
1
h~u , An ~ui = h , i
h ... ..
2
.
u −u
n−2 + 2 un−1 − un
n−1
un −un−1 + 2 un
u1 u1 − (u2 − u1 )
u2 (u2 − u1 ) − (u3 − u2 )
u3 (u3 − u2 ) − (u4 − u3 )
1
= h , i
h ...
2
..
.
u (u − u ) − (u − u )
n−1 n−1 n−2 n n−1
un (un − un−1 ) + un
u1 u1
(u2 − u1 )
(u2 − u1 )
(u3 − u2 ) (u3 − u2 ) 2
1
i + un
= h ..
,
..
h2 .
.
h2
(un−1 − un−2 )
(un−1 − un−2 )
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 61
n
!
1 X
= u21 + u2n + (ui − ui−1 )2 .
h2
i=2
This sum of squares is obviously positive. Only if ~u = ~0 the expression will be zero. Thus the matrix An is
positive definite. ♦
A positive definite matrix A has a few properties that are easy to verify.
• ai,i > 0 for 1 ≤ i ≤ n, i.e. the numbers on the diagonal are positive.
• max |ai,j | = max ai,i , i.e. the maximal value has to be on the diagonal.
3
Proof :
• Choose ~x = ~ei and compute h~ei , A · ~ei i = ai,i > 0. For a 4 × 4 matrix this is illustrated by
a a a a14 0 0 a13 0
11 12 13
a
12 a22 a23 a24 0 0 a23 0
hA~x , ~xi = h , i = h , i = a33 > 0 .
a13 a23 a33 a34 1 1
a33 1
a14 a24 a34 a44 0 0 a34 0
• Assume max |ai,j | = ak,l with k 6= l. Choose ~x = ~ek − sign(ak,l ) ~el and compute h~x , A · ~xi =
ak,k + al,l − 2 |ak,l | ≤ 0, contradicting positive definiteness. To illustrate the argument we use a small
matrix again.
a11 a12 a13 a14 ±1 ±1
a
12 a22 a23 a24 0 0
hA~x , ~xi = h , i
a13 a23 a33 a34 1 1
a14 a24 a34 a44 0 0
±a11 + a13 ±1
. 0
= h , i = a11 + a33 ± 2 a13 > 0
±a13 + a33 1
. 0
1
By choosing the correct sign we conclude |a13 | ≤ 2 (a11 + a33 ) and thus |a13 | can not be larger than
both of the other numbers.
2
The above allows to easily detect that a matrix is not positive definite, but it does not contain a criterion
to quickly decide that A is positive definite.
For a large number of applied problems the resulting matrix has to be positive definite, based on physical
or mechanical observations. In many applications the generalized energy of a system is given by
1
energy = hA · ~x , ~xi .
2
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 62
If the object is deformed/modified the energy is often strictly increased. and based on this, the matrix A has
to be positive definite.
The eigenvalues contain all information about definiteness of a symmetric matrix, but this is not an
efficient way to detect positive definite matrices. Finding all eigenvalues is computationally expensive. If
the matrix is symmetric and the smallest eigenvalue is positive, then all eigenvalues are positive. For this
the MATLAB/Octave function eigest() is very useful, see Section 3.2.4.
2–29 Result : The symmetric matrix A is positive definite iff all eigenvalues are strictly positive. The
symmetric matrix A is positive semidefinite iff all eigenvalues are positive or zero. 3
Proof : This is a direct consequence of the diagonalization result 3–22 (page 134) A = Q D QT , where
the diagonal matrix contains the eigenvalues λj along its diagonal. The computation is based on ~y = QT · ~x
and the equation
X
hA · ~x , ~xi = hQ · D · QT · ~x , ~xi = hD · QT · ~x , QT · ~xi = hD · ~y , ~y i = λj yi2 .
j
2
This result is of little help to decide whether a given, large matrix is positive definite or not. Finding
all eigenvalues is not an option, as it is computationally expensive. A positive answer can be given using
diagonal dominance and reducible matrices, see e.g. [Axel94, §4].
2–30 Definition : Consider a symmetric n × n matrix A.
• A is called strictly diagonally dominant iff |ai,i | > σi for all 1 ≤ i ≤ n, where
X
σi = |ai,j | .
j6=i , 1≤j≤n
Along each column/row the sum of the off-diagonal elements is smaller than the diagonal element.
• A is called diagonally dominant iff |ai,i | ≥ σi for all 1 ≤ i ≤ n.
• A is called reducible if there exists a permutation matrix P and square matrices B1 , B2 and a matrix
B3 such that " #
B 1 B 3
P · A · PT =
0 B2
Since A is symmetric the matrix P · A · PT is also symmetric and the block B3 has to vanish, i.e. we
have the condition " #
B 1 0
P · A · PT = .
0 B2
This leads to an easy interpretation of a reducible matrix A: the system of linear equation A ~u = ~b
can be decomposed into two smaller, independent systems B1 ~u1 = ~b1 and B2 ~u2 = ~b2 . To arrive at
this situation all one has to do is renumber the equations and variables.
• A is called irreducible if it is not reducible.
• A is called irreducibly diagonally dominant if A is irreducible and
– |ai,i | ≥ σi for all 1 ≤ i ≤ n
– |ai,i | > σi for at least one 1 ≤ i ≤ n
For further explanations concerning reducible matrices see [Axel94] or [VarFEM]. For our purposes it is
sufficient to know that the model matrices An and Ann in Section 2.3 are positive definite, diagonally
dominant and irreducible.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 63
As a consequence of the above result the model matrices An and Ann are positive definite.
2–32 Example : A positive definite matrix need not be diagonally dominant. As an example consider the
matrix
5 −4 1
−4 6 −4 1
1 −4 6 −4 1
1 −4 6 −4 1
A=
.. .. .. .. ..
.
. . . . .
1 −4 6 −4 1
1 −4 6 −4
1 −4 5
This matrix was generated with the help of the model matrix An , given on page 32 by A = h4 An · An .
This matrix A is positive definite, but it is clearly not diagonally dominant. ♦
The algorithm of Cholesky will not only determine the factorization, but also indicate if the matrix A is
positive definite. It is an efficient tool to determine positive definiteness.
2–33 Result : Let A = RT · D · R be the Cholesky factorization of the previous section. Then A
is positive definite if and only if all entries in the diagonal matrix D are strictly positive. 3
Proof : Since the triangular matrix R has only numbers 1 along the diagonal, it is invertible. If the vectors
~x ∈ Rn will cover all of Rn , then the constructed vectors ~y = R ~x will also cover all of Rn . Now the
identity
h~x , A ~xi = h~x , RT · D · R ~xi = hR ~x , D · R ~xi = h~y , D ~y i .
implies
• Keep track of rounding errors for the algebraic operations to be executed during the algorithm of
Cholesky.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 64
A = RT · D · R .
Since A is positive definite we know that ai,i > 0 and di > 0. Thus we find bounds on the coefficients in R
and D in terms of A.
i−1
X n
X
2 2
di ≤ ai,i and dk rk,i ≤ ai,i or similar dk rk,i ≤ ai,i .
k=1 k=1
Using this and the Cauchy–Schwartz inequality10 we now obtain an estimate for the result of the matrix
multiplication below, where the entries in |R| are given by the absolute values of the entries in R. Estimates
of this type are needed to keep track of the ‘worst case’ situation for rounding errors and the algorithm. We
will need information on the expression below.
n
X n
X p p
T
|R| · D · |R| i,j = |rk,i | dk |rk,j | = (|rk,i | dk ) ( dk |rk,j |)
k=1 k=1
v v
u n u n
2 ≤ √a · √a
uX uX
≤ t 2
dk rk,i · t dk rk,j i,i j,j ≤ max aj,j . (2.3)
1≤j≤n
k=1 k=1
Example 2–36 shows that the above is false if A is not positive definite.
3
The estimate (2.3) for a positive definite A now implies
10
For vectors we know that h~ y i = k~
x, ~ xk k~
y k cos α, where α is the angle between the vectors. This implies |h~ y i| ≤ k~
x, ~ xk k~
yk .
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 65
i.e. the result of the numerical computations is the exact solution of slightly modified equations. The
modification is small compared to the maximal coefficient in the original problem.
As a consequence of the above we conclude that for positive definite, symmetric matrices there is
no need for pivoting when using the Cholesky algorithm.
2–36 Example : If the matrix is not positive definite the effect of roundoff errors may be large, even if the
matrix has an ideal condition number close to 1. Consider the system
" # ! !
0.0001 1 x1 1
= .
1 0.0001 x2 1
Exact arithmetic leads to the factorization
" # " # " # " #
0.0001 1 1 0 0.0001 0 1 10000
= · · .
1 0.0001 10000 1 0 −9999.9999 0 1
The condition number is κ = 1.0002 and thus we expect almost no loss of precision11 . The exact solution
is ~x = (0.99990001 , 0.99990001)T . Since all numbers in A and ~b are smaller than 1 one might hope
for an error of the order of machine precision. The bounds on the entries in R and D in (2.3) are clearly
violated, e.g.
|R|T · |D| · |R| 2,2 = |r1,2 | |d1 | |r1,2 | + |r2,2 | |d2 | |r2,2 | = 108 · 10−4 + 9999.9999 ≈ 20000
Using floating point arithmetic with u ≈ 10−8 (i.e. 8 decimal digits) we obtain a factorization
" # " # " # " #
1 0 0.0001 0 1 10000 0.0001 1
· · =
10000 1 0 −10000 0 1 1 0
ˆ = (1.0 , 0.9999)T . Thus the relative error of the solution is 10−4 . This is by mag-
and the solution is ~x
nitudes larger than the machine precision u ≈ 10−8 . The effect is generated by the large numbers in the
factorization. This can not occur if the matrix A is positive definite since we have the bound (2.3). To over-
come this type of problem a good pivoting scheme has to be used when the matrix is not positive definite, see
e.g. [GoluVanLoan96, §4.4]. This will most often destroy the symmetry of the problem. It is possible to use
row and column permutations to preserve the symmetry. This approach will be examined in Section 2.6.6.
♦
11
Observe that the eigenvalues of the matrix are λ1 = 1.0001 and λ2 = −0.9999. Thus the matrix is not positive definite. But
the permuted matrix (row permutations)
" #" # " #
0 1 0.0001 1 1 0.0001
=
1 0 1 0.0001 0.0001 1
has eigenvalues λ1 = 1.0001 and λ2 = +0.9999 and thus is positive definite.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 66
@ @ @ @@
@ @ @@ @ @@
@ @ @@
· ·
@ @@
@ @ = @@ @ @@
@ @ @@ @ @@
@ @ @@ @ @@
@ @@ @ @
@@ @@ @@
@@ @ @@ @ @@ @
@@ @ @@ @ @@ @
@@ @ ... @@ @ ... @@ @ ...
@@ @ @@ @ @@ @
@@@ @@@ @@@
@@ @@ @@
Figure 2.9: Cholesky steps for a banded matrix. The active area is marked
When implementing the algorithm of Cholesky one works along the diagonal, top to bottom. For each
step only a block of size b × b of numbers is worked on, i.e. has to be quickly accessible. Three of these
situations are shown in Figure 2.9. Each of those steps requires approximately b2 /2 flops and there are
approximately n of those, thus we find an approximate12 operational count of
1
FlopCholBand ≈ b2 n .
2
12
We ignored the effect that the first row in each diagonal step is left unchanged and we also do not take into account that in the
lower right corner fewer computations are needed. Both effects are of lower order.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 67
This has to be compared to the computational effort for the full Cholesky factorization
1 1
FlopChol ≈ n3 = n2 n .
6 6
The additional cost to solve a system by back substitution is approximately
FlopSolveBand ≈ 2 b n .
We need n · b numbers to store the complete matrix A. As the algorithm proceeds along the diagonal in A (or
its reduction R) in each step only the next b rows will be worked on. As we go to the next row the previous
top row will not be used any more, but a new row at the will be added at the bottom, see Figure 2.9 . Thus
we have only b · b active entries at any time. If these numbers can be placed in fast memory (e.g. in cache)
then the implementation will run faster than in regular memory. Thus for good performance we like to store
b2 numbers in fast memory. This has to be taken in consideration when setting up the data structure for
a banded matrix, together with the memory and cache architecture of the computer to be used. Table 2.5
shows the types of fast and regular memory relevant for most problems and some typical sizes of matrices.
The table assumes that 8 Bytes are used to store one number. If not enough fast memory is available the
algorithm will still generate the result, but not as fast. An efficient implementation of the above idea is
shown in [VarFEM].
Table 2.5: Memory requirements for the Cholesky algorithm for banded matrices
2–37 Observation : If the semibandwidth b is considerably larger than the number of used CPU cores,
then the algorithm can efficiently be implemented on a multi core architecture, as shown in Section 2.2.3.
♦
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 68
• Divide row n − 1 by rn−1,n−1 . Subtract multiples of row n − 1 from rows n − 2 to n − b rows, such
that the only nonzero the value 1 on the diagonal. This requires b · 2 flops.
Since (RT )−1 = (R−1 )T and A−1 = (RT R)−1 = R−1 R−T we could now solve the system A~x = ~b by
~b = R−1 (R−T ~x) ,
i.e. by two matrix mutiplications, each requiring approximately 21 n2 flops. This coincides with the number
of operations required to multiply with the full matrix A−1 = R−1 R−T .
If we avoid the inverse matrix R but use twice a back–substitution: first RT ~y = ~b and then R ~x = ~y we
require only 2 b n flops, i.e. considerably less than 21 n2 .
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 69
Examine the steady state heat equation (1.5) (page 12) on a unit square with nx interior grid points in x
direction and ny points in y direction.
2 T (x,y) ∂ 2 T (x,y)
−∂ ∂x2
− ∂y 2
= 1
k f (x, y) for 0 ≤ x, y ≤ 1
T (x, y) = 0 for (x, y) on boundary .
The resulting matrix is of size nx·ny by nx·ny with a semi-bandwidth of nx. The matrix may be generated
as a Kronecker product of two tridiagonal matrices, representing the second derivatives in x and y direction.
A = kron(speye(ny),Dxx) + kron(Dyy,speye(nx));
b = ones(nx*ny,1);
Now solve the resulting system of linear equations A ~x = ~b with different algorithms.
• The above sparse matrix is converted to a full matrix, whose inverse is used to determine the solution.
Afull = full(A);
x1 = inv(Afull)*b;
This method consumes a lot of computation time and a full matrix with 2004 entries has to be stored,
i.e. we would need 8 · 2004 B = 12.8 GB of memory. This is a foolish method to use. The compu-
tation actually failed for n = 200. A test with n = 80 leads to a computation time of 150 seconds.
• The standard solver of Octave uses a good algorithm and the code
tic()
x2 = A\b;
SolveTime = toc()
takes 0.108 seconds to solve one system. Octave uses the selection tree displayed in Section 2.6.7.
• We may first compute the Cholesky factorization A = RT R and then determine the solution in two
steps: solve RT ~y = ~b, then R ~x = ~y .
R = chol(A);
x3 = R\(R’\b);
It takes 0.667 seconds to compute R and then 0.127 seconds to solve. The command nnz(R) shows
that R has approximately 8 · 106 nonzero entries. This coincides with the n3x = 2003 entries required
by the banded Cholesky solver examined in the previous sections.
• One can modify the Cholesky algorithm with row and column permutations, seeking a factorization
PT AP = RT R
where P is a permutation matrix. Octave uses the Approximate Minimum Degree permutation
matrix generated by the command amd(). Systems of linear equations can then be solved by
A~x = ~b ⇐⇒ PT A P PT ~x = PT ~b ⇐⇒ RT R PT ~x = PT ~b
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 70
RT ~y = PT ~b
R ~z = ~y
PT ~x = ~z
[R, m, P] = chol(A);
x4 = P*(R\(R’\(P’*b)));
The result requires 0.122 second for the factorization and only 0.014 seconds to solve the system. The
matrix R has only 1081911 ≈ 106 nonzero entries.
To illustrate the above we use our standard matrix Ann with n = 200. Using Octave (version 5.0.1) on a
Xeon E5-1650 system we found the numbers in Table 2.6.
The Cholesky algorithm with permutations is most efficient, concerning computation time and memory
consumption. We illustrate this by a slightly modified example. We take the above standard example, but
add one nonzero number to destroy most of the band structure.
The matrix A generated with the above code is of size 400 × 400 and the semi-bandwidth is 20 (approx-
imately). We can compute the size of the sparse matrix required to store the Cholesky factorization R:
• Full Cholesky: half of a N × N = 400 × 400 matrix, leading to 80’000 entries.
• Band Cholesky: N × b = 400 × 20, leading to 8’000 nonzero entries. Ignoring the single nonzero we
still have a semi bandwidth of 20 and thus the banded Cholesky factorization would require 400·20 =
80 000 nonzero entries.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 71
0 0 0
Figure 2.10: The sparsity pattern of a band matrix and two Cholesky factorizations
• Sparse Cholesky with permutations: 3785 nonzero entries. As a consequence we need less storage
and the back substitution will be about twice as fast.
The sparsity pattern in Figure 2.10 shows where the non-zeros are. In 2.10(a) find the non-zeros in the
original matrix A. The band structure is clearly visible. By zooming in we would find only 5 diagonals
occupied by numbers, i.e. the band is far from being full. In 2.10(b) we recognize the result of the Cholesky
factorization, where the additional nonzero entry leads to an isolated spike in the matrix R. The band in
this matrix is full. In 2.10(c) find the results with the additional permutations allowed. The band structure is
replaced by a even more sparse pattern of non-zeros.
We observe:
• The chol() implementation in Octave is as efficient as a banded Cholesky and can deal with isolated
nonzeros outside of the band.
• The chol() command with the additional permutations can be considerably more efficient, i.e.
requires less memory and the back substitution is faster.
The documentation of Octave contains a selection tree for solving systems of linear equations using
sparse matrices. Find this information in the official Octave manual in the section Linear Algebra on Sparse
Matrices. When using the command A\b with a sparse matrix A to solve a linear system the following
decision tree is used to choose the algorithm to solve the system.
2. If the matrix is a permuted diagonal, solve directly taking into account the permutations. Goto 8
3. If the matrix is square, banded and if the band density is less than that given by spparms (”bandden”)
continue, else goto 4.
(a) If the matrix is tridiagonal and the right-hand side is not sparse continue, else goto 3(b).
i. If the matrix is hermitian, with a positive real diagonal, attempt Cholesky factorization
using Lapack xPTSV.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 72
ii. If the above failed or the matrix is not hermitian with a positive real diagonal use Gaussian
elimination with pivoting using Lapack xGTSV, and goto 8.
(b) If the matrix is hermitian with a positive real diagonal, attempt Cholesky factorization using
Lapack xPBTRF.
(c) if the above failed or the matrix is not hermitian with a positive real diagonal use Gaussian
elimination with pivoting using Lapack xGBTRF, and goto 8.
4. If the matrix is upper or lower triangular perform a sparse forward or backward substitution, and
goto 8.
5. If the matrix is a upper triangular matrix with column permutations or lower triangular matrix with
row permutations, perform a sparse forward or backward substitution, and goto 8.
6. If the matrix is square, hermitian with a real positive diagonal, attempt sparse Cholesky factorization
using CHOLMOD.
7. If the sparse Cholesky factorization failed or the matrix is not hermitian with a real positive diagonal,
and the matrix is square, factorize using UMFPACK.
8. If the matrix is not square, or any of the previous solvers flags a singular or near singular matrix, find
a minimum norm solution using CXSPARSE.
The above clearly illustrates that a reliable and efficient algorithm to solve linear systems of equations
uses more than the most elementary ideas. In particular the keywords Cholesky and band structure appear
often.
Sparse matrices can very efficiently be multiplied with a vector. Thus we seek algorithms to solve
linear systems of equations, using multiplications only. The trade-of is that we might have to use
many multiplications of a matrix times a vector.
We will examine methods that allow to solve linear systems with 106 unknowns within a few seconds
on a standard computer, see Table 2.15 on page 91.
2–38 Observation : It is possible to take advantage of a multi-core architecture for the multiplication of a
sparse matrix with a vector. ♦
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 73
λmax 4
κ= ≈ 2 n2 .
λmin π
An iterative method will have to do better than this to be considered useful. To multiply the matrix Ann
with a vector we need about 5 n2 multiplications.
The above matrix Ann might appear when solving a two dimensional heat conduction problem. For the
similar three dimensional problem we find a matrix A of size N = n3 and each row has approximately
7 nonzero entries. The semi-bandwidth of the matrix is n2 . Thus the banded Cholesky solver requires
approximately 21 n3 · n4 floating point operations. The condition number is identical to the 2-D situation.
−D −D ln 10
k log q ≤ −D or k ≥ = > 0.
log q ln q
For most applications the factor q < 1 will be very close to 1. Thus examine q = 1 − q1 and use the Taylor
approximation ln q = ln(1 − q1 ) ≈ −q1 . Then the above computations leads to an estimate for the number
of iterations necessary to decrease the error by D digits, i.e.
D ln 10
k≥ . (2.4)
q1
This implies that the numbers of desired correct digits is proportional to the number of required iterations
and inversely proportional to the deviation q1 of the factor q = 1 − q1 from 1.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 74
A possible graph of such a function and its level curves are shown in Figure 2.11. For symmetric matrices
the gradient of this function is given by13
∇f (~x) = A ~x + ~b = ~0 .
y
x
~xk
d~k
d~k+1 ~xk+1
A given point ~xk is assumed to be a good approximation of the exact solution ~x. The error is given by
the residual vector
~rk = A ~xk + ~b .
The direction of steepest descent is given by
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 75
This is the reason for the name steepest descent or gradient method, illustrated in Figure 2.12 . Thus
search for a better solution in the direction d~k , i.e. determine the coefficient α ∈ R, such that the value of
the function
1
h(α) = f (~xk + α d~k ) = h(~xk + α d~k ) , A (~xk + α d~k )i + h(~xk + α d~k ) , ~bi
2
α2 ~ α ~
= hdk , A d~k i + hdk , A ~xk i + hA d~k , ~xk i + 2 hd~k , ~bi + independent on α
2 2
α2 ~
= hdk , A d~k i + α hd~k , A ~xk i + hd~k , ~bi + independent on α
2
is minimal. This leads to the condition
d h(α)
0= = α hd~k , A d~k i + hd~k , A ~xk i + hd~k , ~bi = α hd~k , A d~k i + hd~k , A ~xk + ~bi
dα
hd~k , A ~xk + ~bi hd~k , ~rk i h~rk , ~rk i
α = − = − = +
~ ~
hA dk , dk i ~ ~
hA dk , dk i hA d~k , d~k i
and thus the next approximation ~xk+1 of the solution is given by
k~rk k2
~xk+1 = ~xk + α d~k = ~xk + d~k .
hA d~k , d~k i
One step of this iteration is shown in Figure 2.12 and a pseudo code for the algorithm is shown on the left
in Table 2.7.
Table 2.7: Gradient algorithm to solve A ~x + ~b = ~0, a first attempt (left) and an efficient implementation
(right)
The computational effort for one step in the algorithm seems to be: 2 matrix/vector multiplications,
2 scalar products and 2 vector additions. But the residual vector ~rk and the direction vector d~k differ only in
their sign. Since
the necessary computations for one step of the iteration can be reduced, leading to the algorithm on the right
in Table 2.7. To translate between the two implementations use a few ± changes and
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 76
d~k ←→ ~r
A d~k ←→ d~
~xk+1 = ~xk + α d~k ←→ ~x = ~x + α ~r
~rk+1 = ~rk + A d~k ←→ ~r = ~r + α d~
The improved algorithm in Table 2.7 requires
• one matrix–vector product and two scalar products
• two vector additions of the type ~x = ~x + α ~r
• storage for the sparse matrix and 3 vectors
If each row of the N × N -matrix A has on average nz nonzero entries, then each iteration requires approx-
imately (4 + nz) N flops (multiplication/addition pairs).
Since the matrix A is positive definite
d2
h(α) = hA d~k , d~k i > 0 ,
dα2
unless −d~k = A ~xk +~b = ~0 . This is the minimum of the function h(α) and consequently f (~xk+1 ) < f (~xk ),
unless ~xk equals the exact solution of A ~x + ~b = ~0 . Since d~k = −~rk conclude that α ≥ 0, i.e. the algorithm
made a step of positive length in the direction of the negative gradient.
The algorithm does not perform well if we search the minimal value in a narrow valley, as illustrated
in Figure 2.13. Instead of going down the valley, the algorithm jumps across and it requires many steps to
get close to the lowest point. This is reflected by the error estimate for this algorithm. One can show that
(e.g. [Hack94, Theorem 9.2.3], [Hack16, Theorem 9.10], [LascTheo87, p. 496], [KnabAnge00, p. 212],
[AxelBark84, Theorem 1.8])14
κ−1 k 2 k
k~xk − ~xkA ≤ k~x0 − ~xkA ≈ 1 − k~x0 − ~xkA (2.5)
κ+1 κ
using the energy norm k~xk2A = h~x , A ~xi.
For most matrices based on finite element problems we know that k~xk ≤ α k~xkA and thus
κ−1 k 2 k
k~xk − ~xk ≤ c ≈c 1−
κ+1 κ
where
λmax
κ= = condition number of A .
λmin
The resulting number of required iterations is given by
D ln 10 D ln 10
k≥ = κ.
q1 2
Thus if the ratio of the largest and smallest eigenvalue of the matrix A is large, then the algorithm converges
slowly. Unfortunately this is most often the case, thus Figure 2.13 shows the typical situation and not the
exception.
14
In [GoluVanLoan13, §11.3.2] find a complete (rather short) proof of
1
k~ x∗ k2A ≤ (1 −
xk+1 − ~ ) k~ x∗ k2A .
xk − ~
κ2 (A)
Using r
1 1 2
1− ≈1− >1−
κ 2κ κ
observe that (2.5) is a slightly better estimate. I have a note of the proof in [GoluVanLoan13, p. 627ff], adapted to the notation of
these lecture notes.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 77
2 π2
q = 1 − q1 = 1 − ≈1− .
κ 2 n2
Then equation (2.4) implies that we need
D ln 10 2 D ln 10 2
k≥ = n
q1 π2
iterations to increase the accuracy by D digits. Based on the estimated operation counts
for the operations necessary for each step in the steepest descent iteration we arrive at the total number of
flops as
18 D ln 10 4
9 n2 k ≈ n ≈ 4.2 D n4 .
π2
This is slightly worse than a banded Cholesky algorithm (FlopChol ≈ 21 n4 ). The gradient algorithm does
use less memory, but requires more flops.
Conjugate directions
On the left in Figure 2.14 find elliptical level curves of the function g(~x) = h~x , A ~xi. A first vector ~a is
tangential to a given level curve at a point. A second vector ~b is connecting this point to the origin. The two
vectors represent two subsequent search directions. When applying the transformation
! !
u 1/2 x
~u = =A = A1/2 ~x
v y
15
The conjugate gradient algorithm was developed by Magnus Hestenes and Eduard Stiefel (ETHZ) in 1952.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 78
obtain
g(~x) = h~x , A ~xi = hA1/2 ~x , A1/2 ~xi = h~u , ~ui = h(~u)
and the level curves of the function h in a (u, v) system will be circles, shown on the right in Figure 2.14.
The two vectors ~a and ~b shown on in the left part will transform according to the same transformation rule.
The resulting images will be orthogonal and thus
v
y
~a A1/2~a
~b A1/2~b
x u
Since the two directions d~k and d~k−1 have to be conjugate conclude
Then the optimal value of αk to minimize h(α) = f (~xk + αk d~k ) can be determined with a calculation
identical to the standard gradient method, i.e.
h~rk , d~k i
αk = −
hA d~k , d~k i
16
Using the diagonalization
p of the matrix A (see page 134) we even have a formula for A1/2 . Since A = Q · diag(λi ) · QT we
1/2
use A = Q · diag(λi ) · QT and conclude
p p
A1/2 · A1/2 = Q · diag(λi ) · QT · Q · diag(λi ) · QT = Q · diag(λi ) · QT = A .
Fortunately we do not need the explicit formula for A1/2 , since this would require all eigenvectors, which is computationally
expensive. For the algorithm it is sufficient to know how to multiply a vector by the matrix A.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 79
and obtain a better approximation of the solution of the linear system by ~xk+1 = ~xk + αk d~k . This algorithm
is spelled out on the left in Table 2.8 and its result is illustrated in Figure 2.15. Just as in the standard
d2 ~ ~
gradient algorithm find dα 2 h(α) = hA dk , dk i > 0 and thus
• either the algorithm terminates, i.e. we found the optimal solution at this point
• or αk > 0 .
−~rk d~k−1
d~k ~xk
An example in R2
The function
" # ! ! ! !
1 +1 −0.5 x x −1 x
f (x, y) = h , i+h , i
2 −0.5 +3 y y −2 y
1 2 1 3
= x + x y + y2 − x − 2 y
2 2 2
is minimized at (x, y) ≈ (1.45455, 0.90909). With a starting vector at (x0 , y0 ) = (1, 1) one can apply
two steps of the gradient algorithm, or two steps of the conjugate gradient algorithm. The first step of the
conjugate gradient algorithm coincides with the first step of the gradient algorithm, since there is no previous
direction to determine the conjugate direction yet. The result is shown in Figure 2.16. The two blue arrows
are the result of the gradient algorithm (steepest descent) and the green vector is the second step of the
conjugate gradient algorithm. In this example the conjugate gradient algorithm finds the exact solution with
two steps. This is not a coincidence, but generally correct and caused by orthogonality properties of the
conjugate gradient algorithm.
Orthogonality properties
We define the Krylov subspaces generated by the matrix A and the vector d~0
~ri ∈ K(k, d~0 ) , d~i ∈ K(k, d~0 ) and ~xi ∈ ~x0 + K(k, d~0 ) for 0 ≤ i ≤ k .
The above is correct for any choice of the parameters βk . Now we examine the algorithm in Table 2.8 with
the optimal choice for αk , but the values of βk in d~k = −~rk + βk d~k−1 are to be determined by a new
criterion. The theorem below shows that we minimized the function f (~x) on the k + 1 dimensional affine
subspace K(k, d~0 ), and not only on the two dimensional plane spanned by the last two search directions.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 80
Table 2.8: The conjugate gradient algorithm to solve A ~x + ~b = ~0 and an efficient implementation
1.1
1
y
0.9
0.8
Figure 2.16: Two steps of the gradient algorithm (blue) and the conjugate gradient algorithm (green)
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 81
2–39 Theorem : Consider fixed values of k ∈ N, ~x0 and ~r0 = A ~x0 +~b. Choose the vector ~x ∈ ~x0 +K(k, d~0 )
such that the energy function f (~x) = 12 h~x, A~xi + h~x, ~bi is minimized on the affine subspace ~x0 + K(k, d~0 ).
The subspace K(k, d~0 ) has dimension k + 1. The following orthogonality properties are correct
The values
h~rk , A d~k−1 i
βk =
hd~k−1 , A d~k−1 i
will generate the optimal solution, i.e. f (~x) is minimized, with the algorithm on the left in Table 2.8. 3
Proof : Since the vector ~x ∈ ~x0 + K(k, d~0 ) minimizes the function f (~x) on the affine subspace ~x0 +
K(k, d~0 ), its gradient has to be orthogonal on the subspace K(k, d~0 ), i.e. with ~r = A ~x + ~b = ∇f (~x)
and K(k, d~0 ) is a strict subspace of K(k + 1, d~0 ). This implies dim(K(k, d~0 )) = k + 1.
Using ~rk+1 = ~rk + αk A d~k and d~i = −~rk + βi d~i−1 we conclude by recursion
~rk+1 − ~rk
hd~i , A d~k i = h−~ri + βi d~i−1 , i
αk
i
βi ~ 1
βj hd~0 , ~rk+1 − ~rk i
Y
= hdi−1 , ~rk+1 − ~rk i =
αk αk
j=1
i
−1 Y
= βj h~r0 , ~rk+1 − ~rk i = 0 .
αk
j=1
The above is correct for all possible choices of βj and also implies
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 82
2–40 Corollary :
• Since dim(K(k, d~0 )) = k + 1 the conjugate gradient algorithm with exact arithmetic will terminate
after at most N steps. Due to rounding errors this will not be of practical relevance for large matri-
ces. In addition the number of steps might be prohibitively large. Thus use the conjugate gradient
algorithm as an iterative method.
The above properties allow a more efficient implementation of the conjugate gradient algorithm. The
algorithm on the right in Table 2.8 is taken from [GoluVanLoan96]. This improved implementation of the
algorithm requires for each iteration
If each row of the matrix A has on average nz nonzero entries then we determine that each iteration requires
approximately (5 + nz) N flops (multiplication/addition pairs).
Convergence estimate
Assume that the exact solution is given by ~z, i.e. A ~z + ~b = ~0. Use the notation ~r = A ~y + ~b, resp.
~y = A−1 (~r − ~b) to conclude that ~y − ~z = A−1 ~r. Then consider the following function
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 83
norm. Find the result and proofs17 in [Hack94, Theorem 9.4.12], [Hack16, Theorem 10.14], [LascTheo87],
[KnabAnge00, p. 218] or [AxelBark84]. The relevant convergence estimate is
√ k
2 k
κ−1
k~xk − ~xkA ≤ 2 √ k~x0 − ~xkA ≈ 2 1 − √ k~x0 − ~xkA .
κ+1 κ
This leads to √ k
2 k
κ−1
k~xk − ~xk ≤ c √ ≈c 1− √ . (2.6)
κ+1 κ
√
Compare this with the corresponding estimate (2.5) (page 76), where κ appears, instead of κ. The resulting
number of required iterations is thus given by
D ln 10 D ln 10 √
k≥ = κ. (2.7)
q1 2
√
This is considerably better than the estimate for the steepest descent method, since κ is replaced by κ κ.
2 π
q = 1 − q1 = 1 − √ ≈ 1 − .
κ n
Then equation (2.4) implies that we need
D ln 10 D ln 10
k≥ = n
q1 π
iterations to increase the precision by D digits. Based on the estimate for the operations necessary to
multiply the matrix with a vector we estimate the total number of flops as
D ln 10 3
(5 + 5) n2 k ≈ 10 n ≈ 7.3 D n3 .
π
This is considerably better than a banded Cholesky algorithm, since the number of operations is proportional
to n3 instead of n4 . For large values of n the conjugate gradient method is clearly preferable.
Table 2.9 shows the required storage and the number of necessary flops to solve the 2–D and 3–D model
problems with n free grid points in each direction. The results are illustrated18 in Figure 2.17. Observe that
one operation for the gradient algorithm requires more time than one operation of the Cholesky algorithm,
due to the multiplication of the sparse matrix with a vector.
We may draw the following conclusions from Table 2.9 and the corresponding Figure 2.17.
• The iterative methods require less memory than direct solvers. For 3–D problem this difference is
accentuated.
• For 2–D problems with small resolution the banded Cholesky algorithm is more efficient than the
conjugate gradient method. For larger 2–D problems conjugate gradient will perform better.
17
The simpler proof in [GoluVanLoan13, Theorem 11.3.3] does not produce the best possible estimate. There find the estimate
1/2
1 1
k~
xk+1 − ~
xk k ≤ 1 − k~
xk − ~xk ≤ 1 − k~
xk − ~x,k
κ 2κ
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 84
• For 3–D problems one should always use conjugate gradient, even for small problems.
• For small 3–D problems banded Cholesky will compute results with reasonable computation time.
• The method of steepest descent is never competitive.
Table 2.10 lists approximate computation times for a CPU capable of performing 108 = 100 MFLOPS or
1011 = 100 GFLOPS.
2–D 3–D
storage flops storage flops
1 4 1 7
Cholesky, banded n3 2 n n5 2 n
18 D ln 10 22 D ln 10
Steepest Descent 8 n2 π2
n4 10 n3 π2
n5
10 D ln 10 12 D ln 10
Conjugate Gradient 9 n2 π n3 11 n3 π n4
14 18
10 10
Cholesky 2D Cholesky 3D
1 day
Steepest Descent 2D 16 Steepest Descent 3D
12 Conjugate Gradient 2D 10 Conjugate Gradient 3D 1 year
10 1h
1 month
number of flops
number of flops
14
10
10
10 1 min 1 day
12
10 1h
08 1 sec
10
10
10 1 min
06
10 08 1 sec
10
04 06
10 10
1 2 3 1 2 3
10 10 10 10 10 10
number of grid points in each direction number of grid points in each direction
(a) 2D model problem (b) 3D model problem
Figure 2.17: Number of operations of banded Cholesky, steepest descent and conjugate gradient algorithm
on the model problem. The time estimates are for a CPU capable to perform 100 MFLOPS.
CPU \ flops 108 109 1010 1011 1012 1014 1016 1018
100 MFLOPS 1 sec 10 sec 1.7 min 17 min 2.8 h 11.6 days 3.2 years 320 years
100 GFLOPS 0.001 sec 0.01 sec 0.1 sec 1 sec 10 sec 16 min 28 h 116 days
Table 2.10: Time required to complete a given number of flops on a 100 MFLOPS or 100 GFLOPS CPU
2.7.5 Preconditioning
Based on equation (2.6) the convergence of the conjugate gradient method is heavily influenced by the
condition number of the matrix A. If the problem is modified such that the condition number decreases
there will be faster convergence. The idea is to replace the system A~x = −~b by an equivalent system with
a smaller condition number. There are different options on how to proceed:
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 85
• Left preconditioning:
M−1 A ~x = −M−1 ~b .
• Right preconditioning:
A M−1 ~u = −~b with ~x = M−1 ~u .
M−1 −1
u = −M−1 ~ x = M−1
L AMR ~ L b with ~ R ~u.
M = RT R .
The matrices R and M have to be chosen such that it takes little effort to solve systems with those matrices
and also as little memory as possible. Two of the possible constructions of these matrices will be shown
below, the incomplete Cholesky preconditioner. Then split the preconditioner between the left and right
side. Use ~x = R−1 ~u to conclude
and apply the conjugate gradient algorithm (see Table 2.8 on page 80) with the new matrix
à = R−T AR−1 .
If the matrix M = RT R is relatively close to the matrix A = RTe Re we conclude that R ≈ Re and thus
As a consequence find a small condition number of the modified matrix Ã. This leads to the basic algorithm
on the left in Table 2.11.
In the algorithm on the center of Table 2.11 introduce the new vector
~zk = R−1~rk = R−1 Ã~xk + R−1 R−T ~b = R−1 R−T AR−1 ~xk + R−1 R−T ~b
= M−1 AR−1 ~xk + M−1~b = M−1 AR−1 ~xk + ~b .
Then realize that the vectors d~k and ~xk appear in the form R−1 d~k and R−1 ~xk . This allows for a translation
of the algorithm with slight changes as shown on the right in Table 2.11. This can serve as starting point for
an efficient implementation.
Observe that the update of ~zk involves the matrix M−1 and thus we have to solve the system
M ~zk = AR−1~rk
for ~zk . Thus it is important that the structure of the matrix M allows for fast solutions.
SHA 21-5-21
choose initial point ~x0 choose initial point ~x0 choose initial point ~x0 resp. ~x0 = R−1 ~x0
~r0 = Ã ~x0 + R−T ~b ~r0 = R−T AR−1 ~x0 + R−T ~b ~r0 = A~x0 + ~b , ~z0 = M−1~r0
d~0 = −~r0 R−1 d~0 = −R−1~r0 d~0 = −~z0
h~r0 , d~0 i h~r0 , d~0 i h~r0 , d~0 i
α0 = − α0 = − α0 = −
hà d~0 , d~0 i hR−T AR−1 d~0 , d~0 i hAd~0 , d~0 i
~x1 = ~x0 + α0 d~0 R−1 ~x1 = R−1 ~x0 + α0 R−1 d~0 ~x1 = ~x0 + α0 d~0
CHAPTER 2. MATRIX COMPUTATIONS
Table 2.11: Preconditioned conjugate gradient algorithms to solve A ~x + ~b = ~0. On the left the original algorithm, in the center with ~zk = R−1~rk and on the
right using R−1 d~k and R−1 ~xk . The algorithm on the right might serve as starting point for an efficient implementation.
SHA 21-5-21
86
CHAPTER 2. MATRIX COMPUTATIONS 87
• Keep the sparsity pattern of the original matrix A, i.e. drop the entry at a position in the matrix if the
entry in the original matrix is zero. This algorithm is called IC(0) and it is presented below.
• Drop the entry if its value is below a certain threshold, the drop-tolerance. This is often called the
ICT() algorithm. The results of ICT() and a similar LR factorization are examined in the following
section.
This construction of a preconditioner matrix R is based on the Cholesky factorization of the symmetric,
positive definite matrix A, i.e. A = RT R. But we require that the matrix R has the same sparsity pattern
as the matrix A. Those two wishes can not be satisfied simultaneously. Give up on the exact factorization
and require only
RT R = A + E
for some perturbation matrix E. This leads to the conditions
1. ri,j = 0 if ai,j = 0 ,
To develop the algorithm use the same idea as in Section 2.6.1. The approximate factorization
A + E = RT · R
Now examine this matrix multiplication on the four submatrices. Keep track of the sparsity pattern. This
translates to 4 subsystems.
• Examine the top left block (one single number) in A. Obviously a1,1 = r1,1 · r1,1 and thus
√
r1,1 = a1,1 .
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 88
• The top right block (row) in A is then already taken care of, thanks to the symmetry of A .
• Then restart the process with the reduced problem of size (n − 1) × (n − 1) in the lower right block.
The above can be translated to Octave code without major problems. Be aware that this implementation is
very far from being efficient and do not use it on large problems. This author has a faster version, but for
some real speed coding in C++ is necessary. In real applications the matrices A or R are rarely computed.
Most often a function to evaluate the matrix products has to be provided. This allows an optimal usage of
the sparsity pattern.
function R = cholInc(A)
% R = cholInc(A) returns the incomplete Cholesky factorization
% of the positive definte matrix A
[n,m] = size(A);
if (n ˜= m) error (’cholesky: matrix has to be square’); end%if
R = zeros(size(A));
for k = 1:n
if A(k,k)<0 error(’cholInc: failed, might be singular’); end%if
R(k,k) = sqrt(A(k,k));
for i = k+1:n
if A(k,i) ˜= 0 R(k,i) = A(k,i)/R(k,k); end%if
end%for
for j = k+1:n
for i = j:n
if A(j,i) ˜= 0 A(j,i) = A(j,i)-A(k,j)*A(k,i)/A(k,k); end%if
end%for
end%for
end%for
The efficiency of the above preconditioner is determined by the change in condition number for the
matrices A and à = R−T AR−1 . As an example consider the model matrix Ann from Section 2.3.2
for different values of n, leading to Table 2.12. Based on equation (2.6) these results might make a big
difference for the computation time. In [AxelBark84] a modified incomplete Cholesky factorization is
presented and you may find the statement that the condition number for the model matrices An and Ann
decreases to order n (instead of n2 ). This is considerably better than the result indicated by Table 2.12. It
will also change the results in Table 2.9 and Figure 2.17 in favor of the preconditioned conjugate gradient
method. In the following section find numerical results on the performance of the Cholesky preconditioner
with a drop-tolerance.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 89
Table 2.12: The condition numbers κ using the incomplete Cholesky preconditioner IC(0)
2–41 Observation : The preconditioner will reduce the condition number for the conjugate gradient itera-
tion, but it does have no effect on the condition number of the original matrix A. Instead of solving A~x = ~b
the modified equation Ã~u = R−T AR−1 ~u = R−T ~b is examined, where ~u = R ~x. Thus the sequence of
solvers is
RT ~bnew = ~b solve for ~bnew top to bottom
A ~x = ~b ⇐⇒ (R−T AR−1 ) ~u = ~bnew solve for ~u by conjugate gradient .
R ~x = ~u solve for ~x bottom up
The first and last step of the algorithm will increase the condition number again, thus the conditioning of
solving A ~x = ~b is not improved.
As an example consider the exact Cholesky factorization R of A, then à = R−T AR = I and thus
κ(Ã) = 1. A test with Octave for the matrix Ann with n = 200 shows κ(Ann ) ≈ 2.4·104 and κ(R) ≈ 211,
resp. κ(R)2 ≈ 4.4 · 104 . Thus the overall condition number of the problem does not change. For an
incomplete Cholesky preconditioner with a small drop tolerance the result will be similar. ♦
• For strictly positive drop tolerances droptol > 0 the ICT() algorithm is applied. Only entries sat-
isfying the condition abs (L(i,j)) >= droptol * norm (A(j:end, j),1) are stored
in the left triangular matrix L. The command ichol() generates this result by
opts.type = ’ict’;
opts.droptol = droptol;
L = ichol(A,opts);
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 90
opts.type = ’nofill’;
L = ichol(A,opts);
• For a system A ~x = ~b the iteration terminates if kA ~x − ~bk ≤ tol · k~bk. At first a relative error of 10−5
was asked for, then an improvement by another factor 10−2 .
Find the results19 in Tables 2.13 (MATLAB) and 2.14 (Octave). The following facts can be observed:
• The memory required for L increases for smaller drop-tolerances. The matrix for IC(0) requires
considerably less memory.
• The norm of the error matrix E = L · LT − A decreases for smaller drop-tolerances, i.e. the precon-
ditioned matrix is a better approximation of the original matrix.
• The condition number κ moves closer to the ideal value of 1 for smaller drop-tolerances. MATLAB
and Octave obtain identical condition numbers. This is no surprise, as the same algorithm is used.
• Using the estimated condition number we can estimate the number of iterations required to achieve the
desired relative tolerance. The actually observed number of iterations is rather close to the estimate.
The number of iterations for MATLAB and Octave are identical, and all considerably smaller than the
≈ 1900 iterations without preconditioner.
• MATLAB is faster than Octave for computing the factorization, since MATLAB uses a multithreaded
version of ichol(). Octave is faster than MATLAB to perform the conjugate gradient iterations. The
required computation time for one iteration increases for smaller drop-tolerances. This is caused by
the larger factorization matrix L.
• It takes very little time to determine the IC(0) factorization, but the condition number of the precon-
ditioned system is large and thus it takes many iterations to achieve the desired accuracy.
drop tolerance 0.001 0.0003 0.0001 3e-5 1e-5 3e-6 1e-6 IC(0)
time ichol [sec] 0.12 0.21 0.40 0.82 1.52 3.06 6.29 0.06
size ichol [MB] 57.9 87.4 136.9 205.6 313.4 471.5 696.2 15.1
norm kEk/kAk 2.8e-3 1.2e-3 4.6e-4 1.9e-4 7.3e-5 2.8e-5 1.0e-5 7.4e-2
cond for PCG 343.74 148.19 56.16 24.52 10.16 4.52 2.27 9700
estim. # of runs 107 70 44 29 19 13 9 570
first run, # of runs 70 46 29 19 12 9 6 349
first run, time [sec] 2.69 2.14 1.57 1.34 1.13 1.12 1.05 10.32
restart, # of runs 27 18 10 7 5 2 2 146
restart, time [sec] 1.08 0.85 0.55 0.50 0.47 0.27 0.36 4.28
Table 2.13: Performance for MATLAB’s pcg() with an incomplete Cholesky preconditioner
The above results show that large systems of linear equations can be solved quickly. For the above
problem on a 1000 × 1000 grid we have to solve a system of 1 million unknowns. Leading to the results
in Table 2.15. If the same system has to be solved many times for different right hand sides this table also
shows that huge systems can be solved within a few seconds. This applies to dynamic problems, where a
good initial guess is provided by the previous time step.
19
Most of the computational effort for these tables goes towards estimating the condition numbers! This is not shown in the
tables. With new versions of Octave the command pcg() will estimate the extreme eigenvalues rather quickly by using the output
variable EIGEST.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 91
drop tolerance 0.001 0.0003 0.0001 3e-5 1e-5 3e-6 1e-6 IC(0)
time ichol [sec] 0.20 0.47 0.97 1.87 3.71 7.30 14.41 0.08
size ichol [MB] 57.9 87.4 136.9 205.6 313.4 471.5 696.2 15.1
norm kEk/kAk 2.7e-3 1.2e-3 4.5e-4 1.9e-4 7.3e-5 2.8e-5 1.0e-5 7.3e-2
cond for PCG 343.74 148.19 56.16 24.52 10.16 4.52 2.27 9700
first run, # of runs 70 46 29 19 12 9 6 349
first run, time [sec] 1.54 1.25 1.06 0.93 0.85 0.97 0.92 5.89
restart, # of runs 27 18 10 7 5 2 2 146
restart, time [sec] 0.60 0.50 0.36 0.35 0.36 0.22 0.31 2.52
Table 2.14: Performance for Octave’s pcg() with an incomplete Cholesky preconditioner
Table 2.15: Timing for solving a system with 106 = 10 0000 000 unknowns
• The time required to perform the incomplete LU factorization. Surprisingly the computation time
seems not to depend on the drop tolerance used. Octave is faster than MATLAB.
• The memory required to perform the incomplete LU factorization. Smaller values for the drop toler-
ance lead to larger matrices L and U. Octave uses less memory than MATLAB.
• An estimate of the norm of the matrix E = L · U − A. Smaller values for the drop tolerance lead to
a smaller norm of the error matrix. The results for MATLAB and Octave are identical.
• An estimate of the condition number of the matrix à = L−1 A R−1 . Smaller values for the drop
tolerance lead to a smaller condition number.
• An estimate of the required number of iterations to improve the result by 5 digits. The estimate is
based on (2.7), i.e.
D ln 10 √
k≥ κ.
2
• The effective number of iterations required to improve the result by 5 digits. The results illustrate that
the theoretical bounds are rather close to the actual number of iterations. The results for MATLAB and
Octave are identical.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 92
• The time required to run the iteration. At first the CPU time decreases, then it might increase again,
caused by the larger preconditioner matrices.
The computations shown were performed on an Intel Haswell I7-5930 system at 3.5 GHz.
drop tolerance 0.001 0.0003 0.0001 3e-5 1e-5 3e-6 1e-6 ILU(0)
time ILU [sec] 61.50 60.74 62.17 62.97 63.38 68.34 76.88 0.08
size ILU [MB] 124.3 191.5 298.5 458.3 699.1 1049.2 1527.8 30.3
norm kEk/kAk 2.5e-3 9.6e-4 3.8e-4 1.5e-4 5.9e-5 2.1e-5 7.7e-6 7.3e-2
cond for PCG 316 121.4 48.64 19.48 8.10 3.69 1.96
estim. # of runs 103 64 40 25 16 11 8
first run, iterations 67 42 26 17 11 8 6 349
first run, time [sec] 1.56 1.23 1.03 0.92 0.90 0.95 1.02 5.88
restart, iterations 26 15 10 6 4 2 1 146
restart, time [sec] 0.61 0.44 0.39 0.33 0.33 0.24 0.18 2.48
drop tolerance 0.001 0.0003 0.0001 3e-5 1e-5 3e-6 1e-6 ILU(0)
time ILU [sec] 140.41 136.47 139.33 137.10 136.16 139.99 147.88 0.08
size ILU [MB] 134.7 193.5 371.8 528.6 803.2 1131.5 1678.2 30.3
norm kEk/kAk 2.5e-3 9.7e-4 3.8e-4 1.5e-4 5.9e-5 2.2e-5 7.8e-6 7.4e-2
cond for PCG 316 121.4 48.64 19.48 8.10 3.69 1.96
first run, iterations 67 42 26 17 11 8 6 349
first run, time [sec] 2.90 2.08 1.58 1.32 1.14 1.10 1.12 10.79
restart, iterations 26 15 10 6 4 2 1 146
restart, time [sec] 1.12 0.74 0.61 0.46 0.43 0.28 0.20 4.36
2.8.1 Normal Equation, Conjugate Gradient Normal Residual (CGNR) and BiCGSTAB
The method of the normal equation is based on the fact that for an invertible matrix A
A ~x = ~b ⇐⇒ AT A ~x = AT ~b .
The expression can be generated by minimizing the norm of the residual ~r = A ~x − ~b.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 93
Since
the square matrix AT A is symmetric and positive definite. Thus the conjugate gradient algorithm can be
applied to the modified problem
AT A ~x = AT ~b .
The computational effort for each iteration step is approximately doubled, as we have to multiply the given
vector by the matrix A and then by its transpose AT . A more severe disadvantage stems from the fact
κ(AT A) = (κ(A))2
and thus the convergence is usually slow. Using the normal equation is almost never a good idea. The
above idea can be slightly modified, leading to the conjugate residual method or conjugate gradient normal
residual method (CGNR), see e.g. [Saad00, §6.8],
p [LascTheo87, §8.6.2] or [GoluVanLoan13, §11.3.9]. The
rate of convergence is related to κ(A), but not κ(A) as for the conjugate gradient method.
Possibly good choices are the Generalized Minimal Residual algorithm (GMRES) or the BiConjugate
Gradients Stabilized method (BiCGSTAB). A good description is given in [Saad00, §6.5, 7.4.2]. For both
algorithms preconditioning can and should be used. All of the above algorithms are implemented in Octave.
Since the GMRES algorithm uses a sizable number of tools from linear algebra, we try to bring some
structure in the presentation.
• First present the basic idea of GMRES.
min kA(~x0 + ~x) − ~bk = k~rn k = kA~xn + A~x0 − ~bk = kA~xn − ~b0 k .
x∈Kn
~
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 94
Thus we minimize the residual on the Krylov subspace Kn , hereby justifying the name GMRES. This is
different from the conjugate gradient algorithm, where the energy norm hA (~un − ~uexact ) , (~un − ~uexact )i
is minimized.
The matrix Qn is generated by the Arnoldi algorithm, see page 96. The vector ~b0 = ~b − A~x0 is the first
residual. Using the matrix Qn the above can be rephrased (use Qn ~y = ~x) as
1 ~
If we use the initial vector ~q1 = b0 for the Arnoldi iteration we know that h~qk , ~b0 i = 0 for all k ≥ 2 and
k~b0 k
thus ~b0 has to be a multiple of ~q1 , and ~q1 is the first column of Qn+1 .
~b0 = k~b0 k ~q1 = k~b0 k Qn+1~e1
With the Arnoldi iteration we have A Qn = Qn+1 Hn ∈ MN,n and thus20 we have a smaller least square
problem to be solved.
• The large least square problem with the matrix A of size N × N is replaced with an equivalent least
square problem with the Hessenberg matrix Hn of size (n + 1) × n.
• This least square problem can be solved by the algorithm of your choice, e.g. a QR factorization.
Using Givens transformations the QR algorithm is very efficient for Hessenberg matrices (≈ n2 op-
erations), see page 97.
• The algorithm can be stopped if the desired accuracy is achieved, e.g. if kA~xn − ~b0 k/k~bk is small
enough.
• The size of the matrices Qn and Hn increases, and with them the computational effort.
20
kQn+1 ~zk2 = hQn+1 ~z, Qn+1 ~zi = h~z, QTn+1 Qn+1 ~zi = h~z, In+1 ~zi = k~zk2
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 95
• The orthogonality of the columns of Qn will not be perfectly maintained, caused by arithmetic errors.
To control both of these problem the GMRES algorithm is usually restarted every m steps, with typical
values of m ≈ 10 ∼ 20. This leads to the GMRES(m) algorithm in Figure 2.19, with an outer and inner
loop. Picking the optimal value for m is an art.
The above is (at best) a rough description of the basic ideas. For a good implementation the above steps
of Arnoldi, Given’s rotation and minimization have to be combined, and a proper termination criterion has
to be used. Find a good description and pseudo code in [templates]. Find a good starting point for code at
www.netlib.org/templates/matlab/ .
Convergence of GMRES
• The norm of the residuals ~rn = A~xn − ~b is decreasing k~rn+1 k ≤ k~rn k. This is obvious since the
Krylov subspace is enlarged.
• In principle GMRES will generate the exact solution after N steps, but this result is useless. The goal
is to use n N iterations and arithmetic errors will prevent us from using GMRES as a direct solver.
• One can construct cases where GMRES(m) does stagnate and not converge at all.
• For a fast convergence the eigenvalues of A should be clustered at a point away from the ori-
gin [LiesTich05]. This is different from the conjugate gradient algorithm, where the condition number
determines the rate of convergence.
• The convergence can be improved by preconditioners.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 96
and the solution ~x is identical to the solution of the least square problem
min kR~x − QT ~bk
~
x
Arnoldi iteration
For a given matrix A ∈ MN,N with N ≥ n the Arnoldi algorithm generates a sequence of orthonormal
vectors ~qn such that the matrix is transformed to upper Hessenberg form. The key properties are
Qn = [~q1 , ~q2 , ~q3 , . . . , ~qn ] ∈ MN,n
QTn Qn = In
A · Qn = Qn+1 · Hn ∈ MN,n .
The column vectors ~qj of the matrix Q have length 1 and are pairwise orthogonal. The matrix Hn is of
upper Hessenberg form if all entries below the first lower diagonal are zero, i.e.
h11 h12 ··· h1n
h21 h22 h23 · · · h 2n
0 h32 h33 h3,4 h 3n
. . .
. ∈ Mn+1,n .
Hn = 0 0 h43 h44 . .
.. . . . . . . .
.
.
. . . .
.
.
.
0 hn,n−1 hn,n
0 ··· 0 hn+1,n
The algorithm to generate the vectors ~qn and the matrix Hn is not overly complicated. The last column
of the matrix equation A · Qn = Qn+1 · Hn ∈ MN,n reads as
n+1
X
A ~qn = hj,n ~qj
j=1
and thus
hk,n = hA ~qn , ~qk i for k = 1, 2, 3, . . . n .
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 97
• The condition k~qn+1 k = 1 then determines the value of hn+1,n , i.e. use
n
X
hn+1,n = kA ~qn − hj,n ~qj k .
j=1
• If the above would lead to hn+1,n = 0 then the Krylov subspace Kn is invariant under A and the
GMRES algorithm will have produced to optimal solution, see [Saad00].
The algorithm can be implemented, see Figure 2.20.
• The computational effort for one Arnoldi step is given by one matrix multiplication, n scalar products
for vectors of length N and the summation of n vectors of length N to generate ~qn+1 .
• To generate all of Qn+1 we need n matrix multiplications and n2 N additional operations.
• For small values of n N this is dominated by the matrix multiplication (n N 2 ).
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 98
with c = cos(α), s = sin(α) and the angle is α such that the the entry h4,3 will be rotated to zero. These
rotation matrices are very easy to invert, since
G−1 = GT and G GT = I .
Now multiply the Hessenberg matrix Hn from the left by In+1 = G1 GT , then by G2 GT2 , then by G3 GT3 ,
. . . Repeat until you end up with a right triangular matrix R, resp Hn = Q R.
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ 0 ∗ ∗ ∗ ∗ ∗ 0 ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗ 0 ∗ ∗ ∗ ∗ ∗ 0 0 ∗ ∗ ∗ ∗
Hn = 0 0 ∗ ∗ ∗ ∗ = G1 0 0 ∗ ∗ ∗ ∗ = G1 G2 0 0 ∗ ∗ ∗ ∗
0 0 0 ∗ ∗ ∗ 0 0 0 ∗ ∗ ∗ 0 0 0 ∗ ∗ ∗
0 0 0 0 ∗ ∗ 0 0 0 0 ∗ ∗ 0 0 0 0 ∗ ∗
0 0 0 0 0 ∗ 0 0 0 0 0 ∗ 0 0 0 0 0 ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗ 0 ∗ ∗ ∗ ∗ ∗
0 0 ∗ ∗ ∗ ∗ 0 0 ∗ ∗ ∗ ∗
= G1 G2 G3 0 0 0 ∗ ∗ ∗ = G1 G2 G3 G4 0 0 0 ∗ ∗ ∗
0 0 0 ∗ ∗ ∗ 0 0 0 0 ∗ ∗
0 0 0 0 ∗ ∗ 0 0 0 0 ∗ ∗
0 0 0 0 0 ∗ 0 0 0 0 0 ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗ 0 ∗ ∗ ∗ ∗ ∗
0 0 ∗ ∗ ∗ ∗ 0 0 ∗ ∗ ∗ ∗
= G1 G2 G3 G4 G5 0 0 0 ∗ ∗ ∗ = G1 G2 G3 G4 G5 G6 0 0 0 ∗ ∗ ∗
0 0 0 0 ∗ ∗ 0 0 0 0 ∗ ∗
0 0 0 0 0 ∗ 0 0 0 0 0 ∗
0 0 0 0 0 ∗ 0 0 0 0 0 0
= G1 G2 G3 G4 G5 G6 R = Q R
This leads to
Hn = QR = GT1 GT2 GT3 · · · GTn R
with a right matrix R ∈ M(n+1)×n and Q ∈ M(n+1)×(n+1) with Q · QT = In+1 . Now solve the minimiza-
tion problem
Since the vector ~y has no influence on the last component of the vector R~y , minimize by using only the first
n rows of R and Q. This leads to a solution, for more details see Section 3.5.1.
An implementation should take advantage of a few facts.
• Since
Q = G1 G2 G3 · · · Gn =⇒ QT = GTn · · · GT3 GT2 GT1
and the vector QT p~ is computed efficiently while the Givens rotations are determined.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 99
• The upper Hessenberg matrix Hn is transformed to the right matrix R row by row. Overwrite the top
rows of Hn by the rows of R, to save some memory.
• If the original matrix is not of upper Hessenberg form, then Given’s rotations is not an efficient choice
for a QR factorization. A better choice is to use Householder reflections.
direct iterative
more general
LR GMRES
nonsymmetric
with pivoting BiCGSTAB 6
The FEM (Finite Element Method) software COMSOL has a rather large choice of algorithms to solve
the resulting systems on linear equations and some of them are capable of using multiple cores. Amongst
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 100
Table 2.19: Benchmark of different algorithms for linear systems, used with COMSOL Multiphysics
the possible model problems we picked one: on a cylinder with height 10 and radius 1 and a horizontal
force was applied at the top and the bottom was fixed. A FEM system was set up, resulting in a system of
800’000 linear equations. Thus we have a sizable 3D problem. Then different algorithms were used to solve
the system. The PC used is based on a I7-920 CPU with 4 cores and has 12 GB of RAM. Thus we have
a system as presented in Section 2.2.3. Find the timing results in Table 2.19. This result is by no means
representative and its main purpose is to show that you have to choose a good algorithm. Your conclusion
will depend on the type of problems at hand, its size, the software and hardware to be used.
• SVD Singular Value Decomposition: has many applications, see Section 3.2.6 starting on page 149.
• QR factorization: this factorization has many applications, including linear regression. Find a short
presentation of QR used for linear regression in Section 3.5.1, starting on page 222.
filename function
speed subdirectory with C code to determine the FLOPS for a CPU
AnnGenerate.m code to generate the a model matrix Ann
LRtest.m code for LR factorization
cholesky.m code for the Cholesky factorization of a matrix
choleskySolver.m code to solve a linear system with Cholesky
cholInc.m code for the incomplete Cholesky factorization
Bibliography
[Axel94] O. Axelsson. Iterative Solution Methods. Cambridge University Press, 1994.
SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 101
[AxelBark84] O. Axelsson and V. A. Barker. Finite Element Solution of Boundary Values Problems. Aca-
demic Press, 1984.
[TopTen] B. A. Cypra. The best of the 20th century: Editors name top 10 algorithms. SIAM News, 2000.
[www:LinAlgFree] J. Dongarra. Freely available software for linear algebra on the web.
http://www.netlib.org/utk/people/JackDongarra/la-sw.html.
[DowdSeve98] K. Dowd and C. Severance. High Performance Computing. O’Reilly, 2nd edition, 1998.
[Gold91] D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM
Computing Surveys, 23(1), March 1991.
[GoluVanLoan96] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, third
edition, 1996.
[GoluVanLoan13] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,
fourth edition, 2013.
[Hack94] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 of Applied
Mathematical Sciences. Springer, first edition, 1994.
[Hack16] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 of Applied
Mathematical Sciences. Springer, second edition, 2016.
[HeroArnd01] H. Herold and J. Arndt. C-Programmierung unter Linux. SuSE Press, 2001.
[Intel90] Intel Corporation. i486 Microprocessor Programmers Reference Manual. McGraw-Hill, 1990.
[LascTheo87] P. Lascaux and R. Théodor. Analyse numérique matricielle appliquée a l’art de l’ingénieur,
Tome 2. Masson, Paris, 1987.
[LiesTich05] J. Liesen and P. Tichý. Convergence analysis of Krylov subspace methods. GAMM Mitt. Ges.
Angew. Math. Mech., 27(2):153–173 (2005), 2004.
[Saad00] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS, second edition, 2000. available on
the internet.
[Shew94] J. R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain.
Technical report, Carnegie Mellon University, 1994.
[VarFEM] A. Stahel. Calculus of Variations and Finite Elements. Lecture Notes used at HTA Biel, 2000.
[YounGreg72] D. M. Young and R. T. Gregory. A Survey of Numerical Analysis, Volume 1. Dover Publica-
tions, New York, 1972.
SHA 21-5-21
Chapter 3
Numerical Tools
• show some aspects of SVD (singular value decomposition) and PCA (principal component analysis).
• apply linear and nonlinear regression. This includes fitting of curves to given data.
• Nonlinear equations
– you should be able to apply the methods of bisection and false positioning to solve one nonlinear
equation.
– you should be able to apply Newton’s method reliably to solve one nonlinear equation.
– you should be familiar with possible problems when using Newton’s method.
– you should be able to apply Newton’s method reliably to solve systems of nonlinear equations.
– you should understand the importance of eigenvalues and eigenvectors for linear mappings.
– you should be able to use eigenvalues to describe the behavior of solutions of systems of linear
ODEs.
– you should understand the connection between eigenvalues and singular value decomposition.
– you should be able to use the covariance matrix and PCA to describe the distribution of data.
• Numerical integration
– you should be able to integrate functions given by data points using the trapezoidal rule.
– you should understand Simpson’s algorithm and Gauss integration.
102
CHAPTER 3. NUMERICAL TOOLS 103
– you should understand the basic idea of an adaptive integration of a given function.
– You should be able to use Octave/MATLAB to evaluate integrals reliably.
– you should be able to use MATLAB/Octave to solve ODEs numerically, with reliable results.
– you should understand the basic idea used for the numerical solvers and the importance of sta-
bility for the algorithms.
– you should be able to use the matrix notation to setup and solve linear regression problems,
including the confidence intervals for the optimal parameters.
– you should be able to setup and solve linear regression problems, including the confidence in-
tervals for the optimal parameters.
• the idea and computations for derivatives and linear approximations for a function of one variable.
• the idea and computations for derivatives and linear approximations for a function of multiple vari-
ables.
f (x) = 0 or F~ (~x) = ~0
for nonlinear functions f or F~ and algebraic manipulations fail to give satisfactory results, then one has
to resort to approximation methods. This will very often involve iterative methods. To a known value
xn apply some carefully planned operations to obtain a new value xn+1 . As the same operation is applied
repeatedly hope for the sequence of values to converge, preferably to a solution of the original problem.
As an example pick an arbitrary value x0 , type it into your pocket calculator and then keep pushing the
cos button. After a few steps you will realize that the displayed numbers converge to x ≈ 0.73909. This
number x solves the equation cos(x) = x.
For all iterative methods the same questions have to be answered before launching a lengthy calculation:
• Will the generated sequence converge, and converge to the desired solution?
This section will present some of the basic iterative methods and the corresponding results, such that you
should be able to answer the above questions.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 104
3–1 Definition : Let xn be a sequence converging to x? . This convergence is said to be of order p if there
is a constant c such that
kxn+1 − x? k ≤ c kxn − x? kp .
A method which produces such a sequence is said to have an order of convergence p .
The expression log kxn −x? k corresponds to the number of correct (decimal) digits in the approximation
xn of the exact value x? . Thus the order of convergence is an indication on how quickly the approximation
sequence will converge to the exact solution. We examine two important cases:
• Linear convergence, convergence of order 1:
Thus the number of accurate digits increases by a fixed number (log c) for each step, as long as c < 1,
i.e log(c) < 0 . In real application we do not have the exact solution to check for convergence, but we
may observe the difference between subsequent values.
Thus expect the number of stable digits to increase by a fixed amount (log c) for each step.
kxn − x? k ≤ c kxn−1 − x? k2
log(kxn − x? k) ≤ 2 log kxn−1 − x? k + log(c)
Thus the number of accurate digits is doubled at each step, ignoring the expression log(c). Once we
have enough digits this simplification is justified. When observing the number of stable digits we find
similarly
and consequently the number of stable digits should double at each step, at least once we are close to
the actual solution1 .
The effects of different orders of convergence is illustrated in Example 3–3 (see page 109), leading to
Table 3.2 on page 110.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 105
• Terminate if the absolute change in x is small enough, i.e. k~xn+1 − ~xn k is small.
k~xn+1 − ~xn k
• Terminate if the relative change in x is small enough, i.e. use one of the expressions ,
k~xn k
k~xn+1 − ~xn k k~xn+1 − ~xn k
or as termination criterion.
k~xn+1 k k~xn k + k~xn+1 k
• In black box solvers one should use a combination of absolute tolerance A and relative tolerance R,
e.g. stop if
k~xn+1 − ~xn k ≤ A + R k~xn k
• Terminate if the absolute error in ~y = F~ (~x) is small enough, i.e. kF~ (~xn )k is small.
3.1.2 Bisection, Regula Falsi and Secant Method to Solve one Equation
In this section find the basic idea for three algorithms to solve a single equation of the form y = f (x) =
0. Throughout this section assume that the function f is at least continuous, or as often differentiable as
necessary. Also assume that a solution exists, i.e. f (x? ) = 0, and it is not a double zero, i.e. f 0 (x? ) 6= 0 .
Find a brief description and an illustrative graphic for each of the algorithms. Since coding of these algorithm
is not too difficult, no explicit code is provided.
Bisection
This basic algorithm will find zeros of continuous function, once two values of x with opposite signs for y
are known. The solution x? of f (x? ) = 0 will be bracketed by xn and xn+1 , i.e. x? is between xn and
xn+1 . Find the description of the algorithm below and an illustration in Figure 3.1.
• Start with two values x0 and x1 such that y0 = f (x0 ) and y1 = f (x1 ) have opposite signs, i.e.
f (x0 ) · f (x1 ) < 0. This leads to an initial interval.
– Compute the function y = f (x) at the midpoint xn+1 of the current interval and examine the
sign of y .
– Retain the mid point and one of the endpoints, such that the y-values have opposite signs. This
is the new interval to be examined in the next iteration.
x0 x2 x4 -
x5 x3 x1
This algorithm will always converge, since the function f is assumed to be continuous and the solution
is bracketed. Obviously the maximal error is halved at each step of the iteration and we have an elementary
estimate for the error
1
|xn+1 − x? | ≤ n |x1 − x0 | .
2
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 106
ln 2
Thus we find linear convergence, i.e. the number of accurate decimal digits is increased by ln 10 ≈ 0.3 by
each step.
6
x0 x2 x3x
4 -
x1
Secant method
The two previous algorithm are guaranteed to give a solution, since the solution was bracketed. We can
modify the false position method slightly and always retain the last two values, independent on the sign of
the function. Find the illustration in Figure 3.3.
– Compute the zero of the secant connecting the two given points
xn − xn−1
xn+1 = xn − f (xn ) .
f (xn ) − f (xn−1 )
This implies that the number of correct digits is multiplied by 1.6, as soon as we are close enough. This is
a huge advantage over the bisection and false position methods. As a clear disadvantage we have no guar-
anteed convergence, even if the solution was originally bracketed. The secant might intersect the horizontal
axis at a far away point and thus we might end up with a different solution than expected, or none at all. One
can show that the secant method will converge to a solution x? if the starting values are close enough to x?
and f 0 (x? ) 6= 0 .
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 107
6
-
x3 x2 x1 x0
f (x0 + ∆x) = 0
f (x0 + ∆x) ≈ f (x0 ) + f 0 (x0 ) · ∆x
f (x0 ) + f 0 (x0 ) · ∆x = 0
f (x0 )
∆x = −
f 0 (x0 )
f (x0 )
x1 = x0 + ∆x = x0 −
f 0 (x0 )
The above computations lead to the algorithm of Newton, some authors call it Newton–Raphson.
– Compute values f (xn ) and the derivative f 0 (xn ) at the point xn . Apply Newton’s formula
f (xn )
xn+1 = xn − .
f 0 (xn )
6
∆x2
∆x1
-
x2 x1 x0
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 108
|xn+1 − x? | ≈ c |xn − x? |2 .
This implies that the number of correct digits is multiplied by 2, as soon as we are close enough. This
is a huge advantage over the bisection and false position methods. As a clear disadvantage we have no
guaranteed convergence. The tangent might intersect the horizontal axis at a far away point and thus we
might end up with a different solution than expected, or none at all. One can show that Newton’s method
will converge to a solution x? if the starting values are close enough to x? and f 0 (x? ) 6= 0 .
√
3–2 Example : To compute the value of x = 2 we may try to solve the equation x2 − 2 = 0. For this
example we find
f (xn ) x2 − 2 2 + x2n
xn+1 = xn − 0 = xn − n = .
f (xn ) 2 xn 2 xn
With a starting value of x0 = 1 we find
2+1 3 2 + 9/4 17
x1 = = , x2 = = ≈ 1.417 and x3 ≈ 1.414216 .
2 2 3 12
Thus we are very close to the actual solution with very few iteration steps. ♦
The major problem of Newton’s method is based to the fact that the initial guess has to be close enough
to the exact solution. If this is not the case, then we might run into severe problems. Consider the three
graphs in Figure 3.5.
• The graph on the left does not have a zero (or is it a double zero?) and thus Newton’s method will
happily iterate along, never getting close to a solution.
• The middle graph has one solution, but if we start a Newton iteration to the right of the maximum,
then the iteration will move further and further to the right. The clearly existing, unique zero will not
be found.
• In the graph on the right it is easy to find starting values x0 such that Newton’s method will converge,
but not to the solution closest to x0 .
Figure 3.5: Three functions that might cause problems for Newton’s methods
• Newton’s method is an excellent tool to compute a zero of a function accurately, and quickly.
• Newton’s method can fail miserably when no good initial guess is available.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 109
Comparison
The above four algorithms have all their weak and strong points, thus a comparison is asked for:
• Bisection and False Position are garanteed to converge to a solution, as long as the starting points x0
and x1 lead to different signs for f (x0 ) and f (x1 ) .
• The secant method converges faster than bisection or Regula Falsi, but the performance of Newton is
hard to beat.
• The secant and Newton’s method might not give the desired/expected solution or might even fail
completely.
• Only Newton’s methods requires values of the derivative.
Find the results in Table 3.1.
3–3 Example
√ : Performance comparison of solvers2
Compute 2 as solution of the equation f (x) = x − 2 = 0. We use implementations in Octave of the
above four algorithms to solve this elementary equation and keep track of the following quantities:
• The estimate of the solution at each step, xn .
• The number of correct decimal digits: corr.
• The number of non-changing digits from the last iteration step: fix.
The results are shown in Table 3.2. There are some observations to be made:
• The number of accurate digits for the bisection method increases very slowly, but exactly in the
predicted way. For 10 iterations the error has to be divided by 210 ≈ 1000, thus we gain 3 digits only.
• The Regula Falsi method leads to linear convergence, the number of correct digits increases by a fixed
number (≈ 0.7) for each step. This is clearly superior to the method of bisection.
• The secant method leads to superlinear convergence, the number of correct digits increases by a fixed
factor (≈ 1.6) for each step. After a few steps (8) we reach machine accuracy and there is no change
in the result any more.
• The Newton method converges very quickly (5 steps) up to machine precision to the exact solution.
The number of correct digits is doubled at each step. This is caused by the quadratic convergence of
the algorithm.
• The number of unchanged digits at each step (fix) is a safe estimate of the number of correct digits
(corr). This is an important observation, since for real world problems the only available information
is the value of fix. Computing corr requires the exact solution and thus there would be no need for
a solution algorithm.
♦
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 110
• The ideas of the Bisection method, the False Position method and the Secant method can not be carried
over to the situation of multiple equations.
• The method of Newton can be applied to systems of equations. This will be done in the next section.
It has to be pointed out that a number of problems might occur:
– Newton requires a good starting point to work reliably.
– We also need the derivatives of the functions. For a system of n equations this amounts to n2 par-
tial derivatives to be known. The computational (and programming) cost might be prohibitive.
– For each step of Newton’s method a system of n linear equations has to be solved and for very
large n this might be difficult.
• There exist derivative free algorithms to solve systems of equations, e.g. Broyden’s method. As a
possible starting point consider [Pres92].
• If the problem is a minimization problem. i.e. you are searching for ~x ∈ Rn such that the function
f : Rn → R attains its minimum at ~x. This leads to a system of n equations
grad f (~x) = ~0 .
Since − grad f is pointing in the direction of steepest descent one has good knowledge where the
minimum might be. In this situation reliable and efficient algorithms are known, e.g. the Levenberg–
Marquardt algorithm.
In these notes we concentrate on Newton’s method and and its applications. Successive substitution and
partial substitution are mentioned briefly.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 111
With the translation F (x) = G(x) + x it is obvious that a zero of G is a fixed point (F (x) = x) of F .
Thus we may concentrate our efforts on efficient algorithms to locate fixed points of iterations.
Then there exists exactly one fixed point z ∈ M of the mapping F , i.e. one solution of F (z) = z.
For any initial point x0 ∈ M the sequence formed by xn+1 = F (xn ) will converge to z and we
have the estimate
kxn+1 − zk ≤ c kxn − zk
i.e. the order of convergence is at least 1 . By applying the above estimate repeatedly we find the
a priori estimate
kxn − zk ≤ cn kx0 − zk
i.e. we can estimate the number of necessary iterations before starting the algorithm. An a poste-
riori estimate is given by
c
kxn+1 − zk ≤ kxn − xn+1 k
1−c
i.e. we can estimate the error during the computations by comparing subsequent values. 3
The proof below is given for sake of completeness only. It is possible to work through the remainder of
these notes without working through the proof, but it is advisable to understand the illustration in Figure 3.6
and the consequences of the estimates in the above theorem..
Proof : For an arbitrary initial point x0 ∈ M we examine the sequence xn = F n (x0 ).
xn = F n (x0 ) −→ z ∈ M as n → ∞ .
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 112
and thus
F (x) = lim xn+1 = lim xn = z .
If z̄ is also a fixed point we use the contraction property
to conclude z̄ = z. Thus we have a unique fixed point. To verify the linear convergence we use F (z) = z
and the contraction property to conclude
y
The function F maps the set M to M and x
it is a contraction, i.e. there is a constant
c < 1 such that
F(y)
F(x)
kF (x) − F (y)k ≤ c kx − yk
for all x, y ∈ M .
3–5 Example : The function f (x) = cos(x) on the interval M = [0 , 1] satisfies the assumptions of
Banach’s fixed point theorem. Obviously 0 ≤ cos x ≤ 1 for 0 ≤ x ≤ 1 and thus f maps M into M . The
contraction property is a consequence of an integral estimate.
Z x
cos(x) − cos(y) = − sin(t) dt
y
Z x
| cos(x) − cos(y)| = | sin(t) dt| ≤ sin(1) |x − y| .
y
The contraction constant is given by c = sin(1) < 1. As a consequence we find that the equation cos(x) = x
has exactly one solution in M . We can obtain this solution by choosing an arbitrary initial value x0 ∈ M
and then apply the iteration xn+1 = cos(xn ). This is illustrated in Figure 3.7 . ♦
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 113
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 114
– Use the known value of xn and solve the equation F (xn , xn+1 ) = xn+1 for the unknown xn+1 .
As a trivial example try to solve the nonlinear equation 3 + 3 x = ex . Given xn you can solve 3 + 3 xn+1 =
exn for xn+1 by
1
xn+1 = (exn − 3) .
3
A simple graph (Figure 3.8) will convince you that the equation has two solutions, one close to x ≈ −1 and
the other close to x ≈ 2.5. Choosing x0 = −1 will converge to the solution, but x0 = 2.5 will not converge
at all. This shows that even for simple examples the method can fail.
60
y = 3+3x
50 y = exp(x)
40
30
y
20
10
-10
-2 -1 0 1 2 3 4
x
Another example is given by the stretching of a beam in equation (1.14) on page 17. To solve the
nonlinear boundary value problem
!
d u(x) 2 d u(x)
d
− EA0 (x) 1 − ν = f (x) for 0 < x < L
dx dx dx
d un (x) 2
a(x) = E A0 (x) 1 − ν
dx
and then solve the linear boundary value problem
d d un+1 (x)
− a(x) = f (x)
dx dx
for the next approximation un+1 (x). Using the finite diffence method this will be used in Example 4–9 on
page 297.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 115
To simplify the problem replace the nonlinear function f and g by linear approximations about the initial
point (x0 , y0 ).
∂f ∂f
f (x0 + ∆x, y0 + ∆y) ≈ f (x0 , y0 ) +
∆x + ∆y
∂x ∂y .
∂g ∂g
g (x0 + ∆x, y0 + ∆y) ≈ g (x0 , y0 ) +
∆x + ∆y
∂x ∂y
Thus replace the original equations by a set of approximate linear equations. This leads to equations for the
unknowns ∆x and ∆y.
∂f ∂f
f (x0 , y0 ) +
∆x + ∆y = 0
∂x ∂y
∂g ∂g
g (x0 , y0 ) +
∆x + ∆y = 0
∂x ∂y
Often a shortcut notation is used
∂f ∂f
fx = , fy = ,
∂x ∂y
and thus the approximate linear equations can be written in the form
! ! !
f (x0 , y0 ) ∆x 0
+A· = ,
g (x0 , y0 ) ∆y 0
Just as in the situation of a single equation find a (hopefully) better approximation of the true zero.
! ! !
x1 x0 ∆x
= +
y1 y0 ∆y
3
If the determinant is different from zero use the formula
" #−1 " #
−1 fx fy 1 gy −fy
A = =
gx gy f x gy − gx f y −gx fx
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 116
This leads to an iterative formula for Newton’s method applied to a system of two equations.
! ! ! ! !
xn+1 xn ∆x xn −1 f (xn , yn )
= + = −A · .
yn+1 yn ∆y yn g (xn , yn )
where " #
fx (xn , yn ) fy (xn , yn )
A=
gx (xn , yn ) gy (xn , yn )
This iteration formula is, not surprisingly, very similar to the formula for a single equation.
1
xn+1 = xn − f (xn )
f 0 (x n)
and for (x0 , y0 ) = (1 , 1) we have the values f1 (x0 , y0 ) = 4 and f2 (x0 , y0 ) = 4 and we find a system of
linear equations for x1 and y1 .
! " # ! !
4 2 8 x1 − 1 0
+ =
4 16 2 y1 − 1 0
and thus
! ! ! ! " #−1 ! !
x1 x0 ∆x 1 2 8 4 0.8064516
= + = − ≈ .
y1 y0 ∆y 1 16 2 4 0.5483870
This is the result of the first Newton step. A visualization of this step can be generated with the code
in Newton2D.m .
For the next step we use f1 (x1 , y1 ) ≈ 0.853 and f2 (x1 , y1 ) ≈ 0.993 and find the system for x2 and y2 .
! " # ! !
0.853 1.6129 4.3871 x2 − 0.806 0
+ = .
0.993 8.3918 1.0968 y2 − 0.548 0
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 117
x0 = 1 y0 = 1
x1 = 0.8064516 y1 = 0.5483870
x2 = 0.7088993 y2 = 0.3897547
x3 = 0.6837299 y3 = 0.3658653
x4 = 0.6821996 y4 = 0.3655839
.. ..
. .
x7 = 0.6821941 y7 = 0.3655855
The above algorithm can be implemented in Octave. Below find an code segment to be stored in a file
NewtonSolve.m . The function NewtonSolve() takes the function f , the function Df for the partial
derivatives and the initial value ~x0 as arguments and computes the solution of the system f (~x) = ~0 . The
default accuracy of 10−10 can be modified with a fourth argument. The code applies at most 20 iterations.
The code will return the approximate solution and the number of iterations required.
NewtonSolve.m
function [x,counter] = NewtonSolve(f,Df,x0,atol)
if nargin<4 atol = 1e-10; end%if;
maxit = 20; counter = 0; xOld = x0;
x = xOld - feval(Df,xOld)\feval(f,xOld);
while ((counter<=maxit) && (norm(xOld-x)>atol))
xOld = x;
x = xOld-feval(Df,xOld)\feval(f,xOld);
counter = counter+1;
end%while
end%function
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 118
The linear approximation is represented with the help of the matrix DF of partial derivatives.
∂ f1 ∂ f1 ∂ f1 ∂ f1
∂x1 ∂x2 ∂x3 ∂xn
∂ f2 ∂ f2 ∂ f2 ∂ f2
∂x1 ∂x2 ∂x3 ... ∂xn
∂ f3 ∂ f3 ∂ f3 ∂ f3
DF =
∂x1 ∂x2 ∂x3 ∂xn
.. .. ..
. . .
∂ fn ∂ fn ∂ fn ∂ fn
∂x1 ∂x2 ∂x3 ... ∂xn
Newton’s method is again based on the idea of replacing the nonlinear system by its linear approximation
and then using a good initial guess ~x0 ∈ Rn .
−→ −→ −→
F~ (~x0 + ∆x) = ~0 −→ F~ (~x0 ) + DF(~x0 ) · ∆x = ~0 −→ ~x1 = ~x0 + ∆x
It is important to understand this basic idea when applying the algorithm to a concrete problem. It will
enable the user to give the algorithm a helping hand when necessary. Quite often Newton is not used as
black box algorithm, but tuned to the concrete problem.
3–9 Theorem : Let F~ ∈ C 2 (Rn , Rn ) be twice continuously differentiable and for a ~x? ∈ Rn we
have F~ (~x? ) = ~0 and the n × n matrix DF(~x? ) of partial derivatives is invertible. Then the Newton
iteration
~xn+1 = ~xn − (DF(~xn ))−1 · F~ (~xn )
will converge quadratically to the solution ~x? , if only the initial guess ~x0 is close enough to ~x? . 3
The critical point is again the condition that the initial guess ~x0 has to be close enough to the solution for
the algorithm to converge. Thus the remark on Newtons methods applied to a single equation (Section 3.1.2)
remain valid.
The above result is not restricted to the space Rn . Using standard analysis on Banach spaces the corre-
sponding result remains valid.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 119
The damped Newton algorithm is used by the Levenberg–Marquardt algorithm to solve nonlinear
regression problems. At first the parameter α is strictly smaller than 1. As progress is made α approaches 1
to achieve quadratic convergence. This approach is used for nonlinear regression problems in Section 3.5.7
by the command leasqr().
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 120
fzero(@(x)xˆ2-2,[0,2])
-->
ans = 1.4142
A sizable number of options can be used and more outputs are generated, see help fzero.
fzero(@(x)xˆ2-2,2)
-->
ans = 1.4142
Since Newton’s methods requires the Jacobian matrix fsolve() uses a finite difference approach to de-
termine the matrix of partial derivatives. One can also specify the Jacobian by creating a function4 with two
return arguments. In addition the option Jacobian has to be on.
fsolve(’Fun_System’,[1;1], optimset(’Jacobian’,’on’))
-->
ans = 0.6822
0.3656
A sizable number of options can be used and more outputs are generated, see help fsolve.
4
With MATLAB this function has to be in a separate file Fun System.m.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 121
x = 0:0.1:3;
figure(1); plot(x,f(x))
grid on; xlabel(’x’); ylabel(’f(x)= xˆ2-1-cos(x)’);
10
-2
0 0.5 1 1.5 2 2.5 3
• Example 4–9 examines the stretching of a beam by a given force and variable cross section. The
method of succesive substitutions is used.
• Example 4–11 examines the bending of a beam. Large defomations are allowed. Newton’s method
will be used.
• Example 4–12 examines a similar problem, using a parameterized version of Newton’s method.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 122
3–13 Example : Stretching of a beam by a given force and variable cross section
The differential equation describing the longitudinal deformation of a string with cross section A0 (x) by a
force density f (x) = 0 and a force F at the right end at x = L is given by (see Section 1.3)
!
d u 2 d u(x)
d
− EA0 1 − ν = 0 for 0 < x < L .
dx dx dx
and the boundary conditions u(0) = 0 and EA0 (L) (1 − ν u0 (L))2 u0 (L) = F . If a function w(x) = u0 (x)
for 0 ≤ x ≤ L solves equation (1.15) (page 17)
F
ν 2 w3 (x) − 2 ν w2 (x) + w(x) = ,
EA0 (x)
• Determine for which domain the equation might be solved, requiring the solution to be realistic.
• Start with a force of F = 0 and increase it step by step. For each force compute the new length of the
beam.
• Plot the force F as a function of the change in length u(L) to confirm Hooke’s law.
The variable to be solved for is z, and x is considered as a parameter. With the help of solution of the
equation G(z) = 0 construct solutions of the beam problem. In general the solution z will depend on x.
Before launching the computations examine possible failures of the method. Newton’s iteration will fail if
the derivative vanishes. The derivative of G(z) is given by
d
G(z) = 3 z 2 − 4 z + 1 = (3 z − 1) (z − 1) .
dz
Since G0 (0) = 1 we know this derivative to be positive for 0 < z small. The zeros of the derivative G0 are
readily determined as
√ (
+4 ± 16 − 12 1
z= = .
6 1
3
G0
Since the first zero of the derivative is at z = expect problems if z ≈ 13 . To confirm this examine the
1
3
graph of the auxiliary function h shown in Figure 3.10.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 123
0.3
0.25
0.2
h(z) = z 3 − 2 z 2 + z 0.15
h(z)
νF
G(z) = h(z) − 0.1
EA0
0.05
-0.05
0 0.2 0.4 0.6 0.8 1 1.2 1.4
z
We will start with the force F = 0 and thus w(x) = 0. Then increase the value of F slowly and w
(resp. z) will increase too. We use Newton’s method to determine the function w(x), using the initial value
w0 (x) = 0 to find a solution of the above problem. If the expression
νF
EA0 (x)
is larger than h(1/3) = 4/27 there is no smooth solution any more. If the critical limit for F is exceeded find
that z = ν u0 (x) would have to be larger than 1. This would lead to a negative radius with cross sectional
area A0 (1 − ν u0 (x))2 , which is obviously mechanical nonsense. This is confirmed by Figure 3.10. The
beam will simply break if the critical limit is exceeded. The Octave code will happily generate numbers and
graphs for larger values of F : GIGO .
~ = ~0
• Choose a starting function w(x) = 0, resp. vector w
• Choose the forces for which the solution is to be computed. The force should be increased slowly
from 0 to the maximal possible value.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 124
function testBeam()
clear EA
nu = 0.3; L = 3;
function y = G(z,T)
nu = 0.3;
y = nuˆ2*z.ˆ3-2*nu*z.ˆ2 + z -T;
end%function
function y = dG(z)
y = 3*nuˆ2*z.ˆ2-4*nu*z + 1;
end%function
N = 500;
h = L/(N+1); % stepsize
x = (0:h:L)’;
clf; relErrorTol=1e-10; % choose you relative error tolerance
z = zeros(size(x)); zNew = z;
u = cumtrapz(x,zNew/nu);
figure(1);
plot(x, u);
grid on; xlabel(’position x’); ylabel(’displacement u’);
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 125
figure(2);
plot(maxAmp,FList);
grid on; xlabel(’maximal displacement u(L)’); ylabel(’force F’);
end%function
6 0.25
5
0.2
displacement u
4
0.15
3
F
0.1
2
0.05
1
0 0
0 0.5 1 1.5 2 2.5 3 0 1 2 3 4 5 6
position x u(L)
The graph in Figure 3.11(b) shows the force F as function of the displacement u(L) at the right endpoint.
If the lateral contraction of the beam would not be taken into account (i.e. ν = 0) the results would be a
straight line, confirming Hooke’s law. The Poisson effect weakens the beam, since the area of the cross
sections is reduced and the stress thus increased, i.e. move from the the engineering stress to true stress. ♦
c2
u(x) = ln ,
1 + cos(c x)
where the value of the constant c is determined as solution of the equation c2 = 1 + cos c . In Example 3–10
Newton’s method is used to find c ≈ 1.1765019 . Now use Newton’s again method to solve the above
nonlinear boundary value problem.
With an approximate solution un (x) (start with u0 (x) = 0) search a new solution of the form un+1 (x) =
un (x) + φ(x) and examine a linear boundary value problem for the unknown function φ(x). Use the Taylor
approximation eu+φ ≈ eu + eu φ = eu (1 + φ) and solve
This boundary value problem for the function φ(x) can be solved with a finite difference approximation (see
Chapter 4). Let h = N 2+1 and xi = −1 + i h for i = 1, 2, 3, . . . N . With ui = u(xi ) and φi = φ(xi ) obtain
a finite difference approximation of the second order derivative
−φi−1 + 2 φi − φi+1
−φ00 (xi ) ≈
h2
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 126
Solve this system of linear equations and then restart with un+1 (x) = un (x) + φ(x). The matrix A in
Aφ ~ = ~b has a tridiagonal structure. For this type of problem special algorithms exist5 . The above algorithm
is implemented in Octave/MATLAB.
x = (-1:h:1)’;
c = 1.176501940; uexact = log(cˆ2./(1+cos(c*x)));
The result, shown below, illustrates that the algorithm stabilizes after the fourth step. The error does not
decrease any more. The remaining error is dominated by the number of grid points N = 201. When setting
N = 20001 the error decreases to 2.3 · 10−10 . This effect can only be illustrated using the known exact
solution. In real world problems this is not the case and we would stop the iteration as soon as enough digits
do not change any more.
errorNewton =
1.611231387e-02
2.683166420e-05
1.913113033e-06
1.913054659e-06
1.913054659e-06
5
An implementation is given in Octave as command trisolve. For MATLAB a similar code is provided in Table 3.19 in the
file tridiag.m . With newer versions one can use sparse matrices to solve tridiagonal systems efficiently.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 127
The above code is listed in Table 3.19 as file Keller.m. In this file the method of successive substitu-
tions is also applied to the problem. A graph with the errors for Newton’s method and successive substitu-
tions is generated. With this code you can verify that both methods converge, but Newton’s converge rate is
two, while successive substitution converges linearly. In Table 3.3 find a comparison of Newton’s method
and the partial substitution approach applied to problems similar to the above. ♦
Substitution Newton
convergence linear, slow quadratic, fast
complexity of code very simple intermediate
good starting values necessary yes yes
derivative of f (u) required no yes
solve a new linear system for each step no yes
3–15 Example : In the previous example the BVP (Boundary Value Problem)
2. Transform the linearized BVP into a system of linear equations, using finite differences.
1. Transform the nonlinear BVP into a system of nonlinear equations, using finite differences.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 128
3–16 Example : In the previous example only the right hand side of the BVP contained a nonlinear
function. The method applies also to problems with nonlinear coefficient functions. Consider
0
− a(u(x)) u0 (x) = f (u(x)) with u(0) = u(1) = 0
and use Newton’s method, i.e. for a known starting function u search for u+φ and determine φ as a solution
of a linear problem. Then restart with the new function u1 = u + φ. Use the linear approximations
to replace the original nonlinear differential equation for u with a linear equation for the unknown function φ.
0
− a0 (u) u0 φ + a(u) φ0 = (a(u) u0 )0 + f (u) + f 0 (u) φ
− (a(u) φ)00 − f 0 (u) φ = (a(u) u0 )0 + f (u)
The finite difference approximation of the expression (a(u) u0 )0 is given in Example 4–8 on page 268. There
are two possible options to solve the above linear BVP for the unknown function φ.
• Option 1: Finite difference approximation of the expression
b(x − h) u(x − h) − 2 b(x) u(x) + b(x + h) u(x + h)
(b(x) u(x))00 ≈
h2
or with a matrix notation
−2 b1 b2 u1
b1 −2 b2 b3 u2
b2 −2 b3 b4 u3
1
b3 −2 b4 b5 u
· 4
h2
.
..
. .
.
bN −2 −2 bN −1 bN u
N −1
bN −1 −2 bN uN
• Option 2: If the coefficient function a(u) is strictly positive introduce a new function w(x) =
a(x) φ(x). Since φ = a1 a φ = a1 w find the new differential equation
f 0 (u)
−w00 − w = (a(u) u0 )0 + f (u)
a(u)
w(x)
for the unknown function w(x). Once w(x) is computed use φ(x) = .
a(u(x))
Then restart with the new approximation un (x) = u(x) + φ(x). ♦
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 129
fminbnd(@(x)sin(x),2,8)/pi
-->
ans = 1.50000
The function fminbnd() will only search between x = 2 and x = 8. A sizable number of options can be
used and more outputs are generated, see help fminbnd.
f (x, y) = −2 x2 − 3 x y − 2 y 2 + 5 x + 2 y
search for a minimum of −f (x, y). Examine the graph of f (x, y) in Figure 3.12.
Octave
[xx,yy] = meshgrid( [-1:0.1:4],[-2:0.1:2]);
10
0
-10
-20
-30
-40
2
1 4
0 3
2
y -1 1
0 x
-2 -1
Using the graph conclude that there is a maximum not too far away from (x, y) ≈ (1.5 , 0). Now use
fminsearch() with the function −f and the above starting values.
Octave
xMin = fminsearch(@(x)-f(x(1),x(2)),[1.5,0])
-->
xMin = 2.0000 -1.0000
A sizable number of options can be used and more outputs are generated, see help fminsearch. ♦
To examine the accuracy of the extreme point consider to have a closer look at contour levels.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 130
The columns of A contain the images of the standard basis vectors. Figure 3.13 visualizes this linear
mapping.
~x
A ~x
6 6
j
~e2 ~x 7→ A ~x ~e2
6 A ~e2
6
:
~
A e
1
- - - -
~e1 ~e1
1 1
0.5 0.5
2
0 0
x
-0.5 -0.5
-1 -1
-1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5
x x
1 1
UT U = In .
6
This author actually would prefer the notation of an orthonormal matrix.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 131
For matrices with complex entries this is called an unitary matrix. The column vectors have length 1
and are pairwise orthogonal. To examine this property consider the columns of U as vectors ~ui ∈ Rn ,
i.e. U = [~u1 , ~u2 , ~u3 , . . . , ~un ] ∈ Mn×n . Then examine the scalar products of these column vectors, as
components of a matrix product.
~uT1
~uT
2
T
UT U = ~u3 [~u1 , ~u2 , ~u3 , . . . , ~un ]
..
.
~unT
h~u1 , ~u1 i h~u1 , ~u2 i h~u1 , ~u3 i h~u1 , ~un i 1 0 0 0
h~u , ~u i h~u , ~u i h~u , ~u i · · · h~u , ~u i 0 1 0 ··· 0
2 1 2 2 2 3 2 n
= h~u3 , ~u1 i h~u3 , ~u2 i h~u3 , ~u3 i h~u3 , ~un i
= 0 0 1 0 = In
.
.. . .. .
..
.
.. . .
. . ..
h~un , ~u1 i h~un , ~u2 i h~un , ~u3 i · · · h~un , ~un i 0 0 0 ··· 1
For an orthogonal matrix use U−1 = UT , i.e. the inverse matrix is easily given.
Multiplying a vector ~x by an orthogonal matrix U does not change the length of the vector and the
angles between vectors. To verify this fact use the scalar product again.
kU~xk2 = hU~x, U~xi = h~x, UT U~xi = h~x, I ~xi = k~xk2
hU~x, U~y i h~x, UT U~y i h~x, ~y i
cos(∠(U~x, U~y )) = = = = cos(∠(~x, ~y ))
k~xk k~y k k~xk k~y k k~xk k~y k
Thus multiplying by an orthogonal matrix U corresponds to a rotation in Rn and possibly one reflection. If
det(U) = −1, there is a reflection involved, with det(U) = +1 it is rotations only.
In the plane R2 the orthogonal matrices with angle of rotation α are the well known rotation matrices.
" # " #
+ cos(α) − sin(α) + cos(α) + sin(α)
U= or U = .
+ sin(α) + cos(α) + sin(α) − cos(α)
♦
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 132
3–20 Observation :
• If λ is an eigenvalue then (A − λ I) ~u = ~0 with ~u 6= ~0. Thus the matrix A − λ I ∈ Mn×n is not
invertible and λ is a zero of the characteristic polynomial
p(λ) = det(A − λ I) = 0 .
To determine the characteristic polynomial subtract λ along the diagonal of the matrix A and compute
the determinant. This is a polynomial of degree n and consequently any n × n matrix has exactly n
eigenvalues λi for i = 1, 2, 3, . . . , n . These eigenvalues can be real or complex.
• Not all eigenvalues have their ”own” eigenvector. The matrix
" #
0 1
A=
0 0
has a double eigenvalue λ1 = λ2 = 0 but only one eigenvector ~u = (1, 0)T . In this case λ = 0
has algebraic multiplicity 2, but geometric multiplicity 1 . This matrix can not be diagonalized. The
mathematical tool to be used in this special case are Jordan normal forms, see e.g. [HornJohn90,
§3] or https://en.wikipedia.org/wiki/Jordan normal form. In these notes only digonalizable matrices
are examined and used.
• If λ = α + i β ∈ C with β 6= 0 is a complex eigenvalue with eigenvector ~u + i ~v , then λ − i β is an
eigenvalue too, with eigenvector ~u − i ~v . To verify this examine
A (~u + i ~v ) = (α + i β) (~u + i ~v ) = (α ~u − β ~v ) + i (β ~u + α ~v )
A ~u = α ~u − β ~v real part
A ~v = β ~u + α ~v imaginary part
A (~u − i ~v ) = (α − i β) (~u − i ~v ) = (α ~u − β ~v ) − i (β ~u + α ~v )
or with a matrix notation
" #
h i +α −β h i
A ~u ~v = ~u ~v .
+β +α
The two vectors ~u ∈ Rn \{~0} and ~v ∈ Rn \{~0} are not zero, verified by a contradiction argument.
~v = ~0 =⇒ A ~v = +β ~u = ~0 =⇒ ~u = ~0
~u = ~0 =⇒ A ~u = −β ~v = ~0 =⇒ ~v = ~0
The two vectors ~u, ~v ∈ Rn are (real) linearly independent, again verified by contradiction.
~v = c ~u =⇒ A ~u = α ~u − β ~v = (α − β c) ~u
and ~u would be an eigenvector with real eigenvalue (α − β c).
• If λ is a generalized eigenvector and the invertible weight matrix B has a Cholesky factorization
B = RT R, then
A ~u = λ B ~u = λ RT R ~u =⇒ R−T A R−1 ~u = λ ~u
and thus λ is a regular eigenvalue of the matrix R−T A R−1 . Often B = diag([w1 , w2 , w3 , . . . , wn ]) =
√
diag(wi ) is a diagonal matrix with positive entries wi and thus R = diag( wi ) and λ is an eigen-
√
value of diag( √1wi ) A diag( √1wi ). Rows and columns of A have to be divided by wi , this preserves
the symmetry of A.
♦
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 133
• The corresponding eigenvectors ~vi are linearly independent. Thus the matrix with the eigenvectors as
columns is invertible.
V = [~v1 , ~v2 , ~v3 , . . . , ~vn ] ∈ Mn×n
A V = V diag(λi )
and consequently
V−1 A V = diag(λi ) and A = V diag(λi ) V−1 .
This is a diagonalization of the matrix A.
If the eigenvalues are not isolated (i.e. λi = λj ) then some of the above results may fail. The eigen-
vectors might not ~vi be linearly independent any more. As a consequence not all vectors ~x are generated as
linear combinations of eigenvectors. The situation improves drastically for symmetric matrices.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 134
(λi − λj ) h~ei , ~ej i = λi h~ei , ~ej i − λj h~ei , ~ej i = hλi ~ei , ~ej i − h~ei , λj ~ej i
= hA ~ei , ~ej i − h~ei , A ~ej i = hA ~ei , ~ej i − hA ~ei , ~ej i = 0
This leads to
A · Q = Q · diag(λj )
λ1
λ2
QT · A · Q = diag(λj ) =
..
.
λn
A = Q · diag(λj ) · QT
This process is called diagonalization of the symmetric matrix A. This result is extremely useful,
e.g. to simplify general stress or strain situations to principal stress or strain situations, see Section 5.3
starting on page 326.
• Each vector ~x ∈ Rn can be written as a linear combination of the eigenvectors ~ei . The result for the
general matrix leads to
~eT1 x1 h~e1 , ~xi
~eT x h~e , ~xi
2 2 2
−1 T
T
~c = Q ~x = Q ~x = ~u3 x3 = h~e3 , ~xi
.. .. ..
. . .
~eTn xn h~en , ~xi
n
X n
X
~x = ci ~ei = h~x , ~ei i ~ei
i=1 i=1
n
X n
X
~x = ci ~ei = h~x , ~ei i ~ei
i=1 i=1
• The inverse matrix is easily computed if all eigenvalues and vectors are already known. Then use
1
A−1 = (Q · diag(λj ) · QT )−1 = Q−T · diag(λj )−1 · Q−1 = Q · diag( ) · QT .
λj
This is not an efficient way to determine the inverse matrix, since the eigenvalues and vectors are
difficult to determine. The results is more useful for analytical purposes.
3
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 135
Another characterization of the eigenvalues of a symmetric, positive definite matrix A can be based on
the Rayleigh quotient
h~x , A ~xi
ρ(~x) = .
h~x , ~xi
Assume that the eigenvalues λ1 ≤ λ2 ≤ λ3 ≤ . . . ≤ λn are sorted. When looking for an extremum of the
function h~x , A ~xi, subject to the constraint k~xk = 1 use the Lagrange multiplier theorem and
~ x , A ~xi = 2 A ~x
∇h~ ~ x , ~xi = 2 ~x
and ∇h~
to conclude that A ~x = λ ~x for some factor λ. Using h~x , A ~xi = h~x , λ ~xi = λ k~xk2 conclude
λ1 = min h~x , A ~xi and λn = max h~x , A ~xi .
k~
xk=1 k~
xk=1
For other eigenvalues use a slight modification of this result. If ~e1 is an eigenvector to the first eigenvalue
use the fact that the eigenvector to strictly larger eigenvalues are orthogonal to ~e1 . This leads to a method to
determine λ2 by
λ2 = min{h~x , A ~xi | k~x k = 1 and ~x ⊥ ~e1 } .
This result can be extended in the obvious way to obtain λ3 by
λ3 = min{h~x , A ~xi | k~x k = 1 , ~x ⊥ ~e1 and ~x ⊥ ~e2 } .
The other eigenvalues can also be characterized by looking at subspaces by the Courant-Fischer Minimax
Theorem, see [GoluVanLoan96, Theorem 8.1.2] or [Axel94, Lemma 3.13].
h~x , A ~xi
λk = max min
dim S=n−k ~
x∈S\{~0} h~x , ~xi
d2
− u(x) = f (x) −→ An ~u = f~ .
dx2
The exact eigenvalues are given by
4 πh
λj = 2
sin2 (j ) for 1 ≤ j ≤ n .
h 2
jπ
and the eigenvector ~vj are generated by discretizing the function sin(x n+1 ) over the interval [0, 1] and thus
has j extrema within the interval.
1j π 2j π 3j π (n − 1) j π nj π
~vj = sin( ) , sin( ) , sin( ) , . . . , sin( ) , sin( )
n+1 n+1 n+1 n+1 n+1
♦
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 136
implies that the mapping generated by the matrix has a special structure:
This behavior is consistent with the mapping of the eigenvectors ~vi , i.e. the columns of Q, since A ~vi =
λi ~vi . Vectors in the eigenspace spanned by ~vi are stretched (or compressed) by the factor λi . The image
of the unit circle in R2 after a mapping by a symmetric matrix is thus an ellipse with the semi-axis in the
direction of the eigenvectors and the length given by the eigenvalues. The similar result for ellipsoids as
images of the unit sphere in Rn is valid. ♦
These expressions have many applications, amongst them level sets for general Gaussian distributions (e.g.
Figures 3.19, 3.20 and 3.23) or the regions of confidence for the parameters determined by linear or nonlinear
regression (e.g. Figures 3.49, 3.58 and 3.59).
Use ~x = ni=1 √1λ ci ~ei with ni=1 c2i = r2 to examine the quadratic form
P P
i
n n n X n
X 1 X 1 X 1
h~x, A ~xi = h √ ci ~ei , A p cj ~ej i = p ci cj h~ei , A~ej i
i=1
λ i j=1
λ j i=1 j=1
λ i λ j
n X
n n
X 1 X 1 2
= ci cj h~ei , λj ~ej i = c λi = r 2 .
λi i
p
i=1 j=1
λi λj i=1
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 137
0.5
1
0.5
x2
x3
1
-0.5 -0.5 0.5
0
-1 x1
-0.5
-1 1 0.5 -1
-1 -0.5 0 0.5 1 0 -0.5
x2 -1
x1
Figure 3.14: Level curve or surface of quadratic forms in R2 or R3 . The level curve and surface are shown
in blue and in black the eigenvectors divided by λi .
Similar arguments generate the level surface in R3 , visible in Figure 3.14(b). The computations use
spherical coordinates
x cos(θ) cos(φ)
y = r cos(θ) sin(φ) for 0 ≤ φ ≤ 2 π and − π ≤ θ ≤ π .
2 2
z sin(θ)
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 138
Any vector ~x ∈ R3 can be written as linear combination of the three vectors d~i , i.e. ~x = c1 d~1 + c2 d~2 + c3 d~3
with
Q ~c = ~x and k~ck = kQT ~xk = k~xk .
Then the matrix
1/l12 0 0
T
Q = (Q diag(1/li )) diag(1/li )QT
A = Q
0 1/l22 0
0 0 1/l32
has the three eigenvectors d~i and the eigenvalues λi = 1/li2 . It satisfies
1
h±li d~i , A (±li d~i )i = (±1)2 hli d~i , λi (li d~i )i = hd~i , li λi li d~i ihd~i , d~i i = kd~i k = 1 ,
li2
i.e. the six endpoints of the axis of the ellipsoid are on the level set f (~x) = 1. Example 3–24 shows that
multiplying by the matrix A corresponds to a sequence of three mappings:
2. ~z = diag( l12 ) ~y will stretch or compress the i–th component yi by the factors 1
li2
.
i
The above examples allows to visualize level curves and surfaces in R2 and R3 , but the task is more
difficult when higher dimensions are necessary, i.e. for regression problems with more that three parameters.
There are (at least) two methods to reduce the dimensions to visualize the results:
As example examine a problem with four independent variables, given by the symmetric matrix A ∈
M4×4 , i.e. ai,j = aj,i . Examine reductions to expressions involving two variables only, e.g. x1 and x2 . The
quadratic form to be examined is
a1,1 a1,2 a1,3 a1,4 x1 x1
4
a
2,1 a2,2 a2,3 a2,4 x2 x2
X
f (~x) = hA ~x, ~xi = h , i = ai,j xi xj .
a3,1 a3,2 a3,3 a3,4 x3 x3
i,j=1
a4,1 a4,2 a4,3 a4,4 x4 x4
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 139
To determine the intersection with the x1 x2 –plane (i.e. x3 = x4 = 0) it is convenient to write the matrix as
composition of four 2 × 2 block matrices
a1,1 a1,2 a1,3 a1,4
" #
a a a a A11 A13
2,1 2,2 2,3 2,4
A= = .
a3,1 a3,2 a3,3 a3,4 A T A
13 33
Thus the intersections of the level sets with the plane x3 = x4 = 0 can be visualized with the tools from the
above example.
To examine the projection on this plane requires more effort. Examine Figure 3.14(b) observe that the
contour of the projection along the x3 axis onto the horizontal plane is characterized by the x3 component
of the gradient (see page 74) ∇f (~x) = 2 A ~x to vanish. For the four dimensional example this leads to the
conditions of the last two components of the gradient to vanish., i.e.
x1
! " #
0 a3,1 a3,2 a3,3 a3,4 x2
=
0 a4,1 a4,2 a4,3 a4,4 x3
x4
" # ! " # ! ! !
a3,1 a3,2 x1 a3,3 a3,4 x3 x 1 x 3
= + = AT13 + A33 .
a4,1 a4,2 x2 a4,3 a4,4 x4 x2 x4
If A is positive definite, then A11 and A33 are positive definite too. With this notation the quadratic form is
expressed in terms of (x1 , x2 ) only.
x1 x1
" #
A11 A13 x2 x2
f (~x) = hA ~x, ~xi = h , i
AT13 A33
x3 x3
x4 x4
! ! ! ! ! !
x1 x1 x3 x1 x3 x3
= hA11 , i + 2 hA13 , i + hA33 , i
x2 x2 x4 x2 x4 x4
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 140
! ! ! !
x1 x1 x1 x1
= hA11 , i − 2 hA13 A−1 T
33 A13 , i+
x2 x2 x2 x2
! !
x1 x1
+hA33 A−1 T
33 A13 , A−1 T
33 A13 i
x2 x2
! ! ! !
x1 x1 x1 x1
= hA11 , i− 2 hA13 A−1 T
33 A13 , i+
x2 x2 x2 x2
! !
x1 x1
+hA13 A−1 T
33 A13 , i
x2 x2
! !
x1 x1
A13 A−1 T
= h A11 − 33 A13 , i
x2 x2
Thus examine the level curves in R2 of the quadratic form generated by the modified matrix à ∈ M2×2
In the case of a projection from R3 onto R2 along the x3 direction this simplifies slightly to
x1 x1 " # ! !
1 a 1,3
h i x 1 x 1
f (~x) = hA x2 , x2 i − a3,3 h a
a1,3 a2,3 , i
2,3 x2 x2
0 0
" # " # ! ! !
a1,1 a1,2 1 a1,3 h i x1 x1
= h − a1,3 a2,3 , i.
a1,2 a2,2 a3,3 a2,3 x2 x2
The above approach can be generalized to more that four dimensions and the variables to be removed
can be chosen arbitrary. This is implemented in a MATLAB/Octave function.
ReduceQuadraticForm.m
function A_new = ReduceQuadraticForm(A,remove)
stay = 1:length(A);
for ii = 1:length(remove) %% remove the indicated components
stay = stay(stay˜=remove(ii));
end%for
%% construct the sub matrices
Asr = A(stay,remove); Arr = A(remove,remove);
A_new = A(stay,stay) - Asr*(Arr\Asr’); %% the new matrix
end%function
For the above matrix A ∈ M4×4 the coordinates x3 and x4 are removed and the new matrix à ∈ M2×2
is generated by calling A new = ReduceQuadraticForm(A,[3 4]) .
3–27 Example : Intersection and Projection of Level Sets of a Quadratic Form onto Coordinate Plane
As example examine the ellipsoid in Figure 3.14(b) and determine the intersection and projection onto the
coordinate plane x3 = 0. Find the results of the code below in Figure 3.15. Observe that the axes of the two
ellipses in Figure 3.15(a) are in slightly different directions.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 141
figure(1); clf
plot3(Points(1,:),Points(2,:),Points(3,:),’.b’)
hold on
plot3([0,Evec(1,1)]/sqrt(Eval(1,1)),...
[0,Evec(2,1)]/sqrt(Eval(1,1)),...
[0,Evec(3,1)]/sqrt(Eval(1,1)),’k’)
plot3([0,Evec(1,2)]/sqrt(Eval(2,2)),...
[0,Evec(2,2)]/sqrt(Eval(2,2)),...
[0,Evec(3,2)]/sqrt(Eval(2,2)),’k’)
plot3([0,Evec(1,3)]/sqrt(Eval(3,3)),...
[0,Evec(2,3)]/sqrt(Eval(3,3)),...
[0,Evec(3,3)]/sqrt(Eval(3,3)),’k’)
plot3(Points_i(1,:),Points_i(2,:),Points_i(3,:),’r’)
plot3(Points_p(1,:),Points_p(2,:),Points_p(3,:),’c’)
xlabel(’x_1’); ylabel(’x_2’); zlabel(’x_3’);
axis(1.1*[-1 1 -1 1 -1 1]); axis equal
view([-80 30]); hold off
figure(2); clf
plot(Points_i(1,:),Points_i(2,:),’r’,
Points_p(1,:),Points_p(2,:),’c’)
xlabel(’x_1’); ylabel(’x_2’);
legend(’intersection’,’projection’); axis(1.1*[-1 1 -1 1]); axis equal
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 142
1 intersection
projection
0.5 1
0.5
x2
0
0
x3
1
-0.5 -0.5
0.5
-1 0
-1 -0.5 x1
1 0.5 -1
-1 -0.5 0 0.5 1 0 -0.5 -1
x1 x2
(a) in the coordinate plane x3 = 0 (b) in R3
Figure 3.15: Level surface of a quadratic form in R3 with intersection and projection onto R2
use
A = [1 2 3; 4 5 6; 7 8 9];
eval = eig(A)
-->
eval = 1.6117e+01
-1.1168e+00
-1.3037e-15
A = [1 2 3; 4 5 6; 7 8 9];
eval = eig(A)
[evec,eval] = eig(A)
-->
evec = -0.231971 -0.785830 0.408248
-0.525322 -0.086751 -0.816497
-0.818673 0.612328 0.408248
The eigenvectors are all normalized to have length 1 and the eigenvalues are returned on the diagonal of a
matrix. For symmetric matrices the eigenvectors are orthogonal.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 143
The same command is used to compute generalized eigenvalues by supplying two input arguments. To
examine
1 2 3 1 0 0
4 5 6 ~v = λ 0 10 0 ~v
7 8 9 0 0 100
use
As example consider the model matrices An and Ann from Section 2.3 (see page 32). For these matrices
the exact eigenvalues are known, eigs() will return estimates of some of the eigenvalues. To estimate the
three largest eigenvalues of the 1000 × 1000 matrix An for n = 1000 use
n = 1000; h = 1/(n+1);
A_n = spdiags(ones(n,1)*[-1, 2, -1],[-1, 0, 1],n,n)/(hˆ2);
lambda_max = 4/hˆ2*sin(n*pi*h/2)ˆ2
options.issysm = true; options.tol = nˆ2*1e-5;
lambda = eigs(A_n,3,’lm’,options)
-->
lambda_max = 4.0080e+06
lambda = 4.0019e+06
3.9569e+06
3.8748e+06
lambda_min = 4/hˆ2*sin(pi*h/2)ˆ2
lambda = eigs(A_n,3,’sm’,options)
-->
lambda_min = 9.8696
lambda = 88.8258
39.4783
9.8696
To estimate the three largest eigenvalues of the 100 000 × 100 000 matrix Ann for n = 100 use
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 144
n = 100; h = 1/(n+1);
A_n = spdiags(ones(n,1)*[-1, 2, -1],[-1, 0, 1],n,n)/(hˆ2);
A_nn = kron(speye(n),A_n) + kron(A_n,speye(n));
lambda_max = 8/hˆ2*sin(n*pi*h/2)ˆ2
options.issysm = true; options.tol = nˆ2*1e-5;
lambda = eigs(A_nn,3,’lm’,options)
-->
lambda_max = 8.1588e+04
lambda = 8.1344e+04
8.0238e+04
7.8325e+04
This code takes less than 0.02 seconds to run, using eig() takes 46 seconds on a fast desktop computer.
Using estimates for the smallest and largest eigenvalue allows to estimate the condition number of the
matrix. The command condest() uses similar ideas to estimate the condition number of a matrix.
where
• ~xp (t) is one particular solution
d
• ~xh (t) is a solution of the corresponding homogeneous system dt ~xh (t) = A(t) ~xh (t)
Because of this structure one may examine the homogeneous problem to study the stability of numerical
algorithms to solve the system of ODEs, see Section 3.4.
For constant matrices A(t) = A the eigenvalues and eigenvectors provide a lot of information on the
d
solutions of the homogeneous problem dt ~xh (t) = A ~xh (t). This will lead to Result 3–29 below.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 145
(c) If the above n eigenvectors ~vi are linearly independent, then any initial value ~x(0) = ~x0 can be written
as a linear combination of the eigenvectors, i.e.
n
X
~x(0) = ~x0 = ci ~vi .
i=1
Verify that solving for the coefficients ci leads to a system of linear equations. Using this solve (3.4)
with an initial condition ~x(0) = ~x0 . With the notation
v1,i x1
v x
2,i 2
~vi = v3,i and ~x0 = x3
.. ..
. .
vn,i xn
x1 v1,1 v1,2 v1,3 v1,n
x v v v v
2 2,1 2,2 2,3 2,n
x3 = c1 v3,1 + c2 v3,2 + c3 v3,3 + · · · + cn v3,n
.. .. .. .. ..
. . . . .
xn vn,1 vn,2 vn,3 vn,n
v1,1 v1,2 v1,3 · · · v1,n c1
v
2,1 v2,2 v2,3 · · · v2,n
c
2
v3,1 v3,2 v3,3 · · · v3,n
= c3 .
.. .. .. .. .. ..
.
. . . .
.
vn,1 vn,2 vn,3 · · · vn,n cn
If the eigenvectors ~vi are linearly independent, the system has a unique solution.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 146
Find the corresponding vector field, two solutions and the eigen–directions in Figure 3.16.
1
y
-1
-1 0 1 2 3
x
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 147
d
are linearly independent solutions of the system of ODEs dt ~x(t) = A ~x(t). These solutions move in the
plane in R spanned by ~u and ~v . They grow (or shrink) exponentially like eα t and rotate in the plane with
n
angular velocity β.
The above observations can be collected and lead to a complete description of the solution of the homo-
d
geneous system dt ~x(t) = A ~x(t), assuming that all eigenvalues are isolated.
and
−0.13
λ3 ≈ +0.095 with ~e3 ≈
−0.91 .
0.40
In the plane generated by the vectors ~u and ~v obtain exponential spirals with decaying radius cr e−0.1 t . In
the direction given by ~e3 the solution is moving away from the origin, the distance described by a function
c e+0.095 t . This behavior is confirmed by Figure 3.17 by the graphs of three solutions.
StabilityExample.m
A = [0.39 0.51 1.26; -0.68 0 -0.44; -1.47 -0.05 -0.5]
[EV,EW] = eig(A)
t = linspace(0,20);
[t1,x1] = ode45(@(t,x)A*x,t,[1;1;1]);
[t2,x2] = ode45(@(t,x)A*x,t,[0.4,-0.2,0]);
[t3,x3] = ode45(@(t,x)A*x,t,[-1,-0.5,-1]);
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 148
1.5
1.5 1
1 0.5
0.5 0
0
z
-0.5
z
-0.5
-1 -1
-1.5 -1.5
-3
-2 3
-1 2
0 1
x 1 y 0
3
2 2
1 -1 1 2 3
3 0 -2 -1 0
-1 y -3
x
(a) view onto the direction ~e3 (b) view onto the plane generated by ~
u and ~v
[0,real(EV(1,3))],[0,real(EV(2,3))],[0,real(EV(3,3))],’g’)
hold on % draw three solutions in different colors
plot3(x1(:,1),x1(:,2),x1(:,3),’r’,x2(:,1),x2(:,2),x2(:,3),’b’,...
x3(:,1),x3(:,2),x3(:,3),’c’)
xlabel(’x’);ylabel(’y’);zlabel(’z’); axis equal; hold off
-->
EV =
0.66972 + 0.00000i 0.66972 - 0.00000i -0.13030 + 0.00000i
-0.17882 + 0.27667i -0.17882 - 0.27667i -0.90807 + 0.00000i
-0.18947 + 0.63801i -0.18947 - 0.63801i 0.39803 + 0.00000i
EW = Diagonal Matrix
-0.10264 + 1.41104i 0 0
0 -0.10264 - 1.41104i 0
0 0 0.09529 + 0.00000i
The situation changes slightly if second order systems are examined. Wave equations lead to this type
of ODEs, see e.g. Section 4.6.
d2
~x(t) = −A ~x(t)
dt2
is solved by
~x(t) = cos(ωi t) ~vi and ~x(t) = sin(ωi t) ~vi .
Using trigonometric identities these two solutions can be combined to have one amplitude ai and a phase
shift φi by
~x(t) = (c1 cos(ωi t) + c2 sin(ωi t)) ~vi = ai cos(ωi (t − φi )) ~vi .
As a consequence the frequencies of these oscillations are determined by the eigenvalues λi = ωi2 . 3
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 149
d2
2
d
~x(t) = cos(ωi t) ~vi = −ωi2 cos(ωi t) ~vi
dt2 dt2
−A ~x(t) = − cos(ωi t) A ~vi = − cos(ωi t) λi ~vi = −ωi2 cos(ωi t) ~vi
and thus the system of differential equations is solved. The computation for the second solution sin(ωi t) ~vi
is similar. 2
The above could also be examined by Result 3–29. Translate the system of n second order equations to
d
a system of 2 n ODEs of order one. For this introduce the new variable ~v (t) = dt ~x(t) ∈ Rn and examine
the (2 n) × (2 n) system
! ! " # !
d ~x(t) ~v (t) 0 In ~x(t)
= = .
dt ~v (t) −A ~x(t) −A 0 ~v (t)
The generalized eigenvalues and eigenvectors satisfy A ~v = λ B ~v and with those ODEs with a weight
matrix can be solved, i.e.
d
B ~x(t) = A ~x(t)
dt
d d
is solved by ~x(t) = eλ t ~v . To verify this use B dt ~x(t) = λ eλ t B ~v and A dt ~x(t) = λ eλ t B ~v . A similar
result is correct for second order ODEs
d2
B ~x(t) = −A ~x(t) ,
dt2
which are solved by
and σ1 ≥ σ2 ≥ σ3 ≥ . . . ≥ σp ≥ 0. 3
As consequence of the above find the SVD (Singular Value Decomposition) of the n × n matrix A.
σ1 0 0 . . . 0
0 σ 0 ... 0
2
T
T
A = U diag(σ1 , σ2 , . . . , σp ) V = U 0 0 σ3 . . . 0
V
.
.. . .
. . ..
0 0 0 . . . σn
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 150
With MATLAB/Octave the singular value decomposition can be computed by [U,S,V] = svd(A) . It is
not difficult to see that for square matrices A and the usual 2–norm we have
1 σ1
kAk2 = σ1 , kA−1 k2 = and cond(A) = .
σn σn
If the matrix A is symmetric and positive definite, then U = V and
AU = UΣ
implies that the singular values are given by the eigenvalues λi = σi and in the columns of the orthogonal
matrix U find the normalized eigenvectors of A. Thus the SVD coincides with the diagonalization of the
matrix A, as examined in Result 3–22 on page 134. If the real matrix A is symmetric but not necessary
positive definite, then σi = |λi | and for the corresponding columns (i.e. eigenvectors) find ~vi = −~ui . Then
VT U is a diagonal matrix with numbers ±1 on the diagonal.
If the m × n matrix A has more rows than columns (m > n) then the factorization has the form
σ1 0 0 . . . 0
0 σ2 0 . . . 0
σ1 0 0 . . . 0
0 0 σ3 . . . 0
0 σ 0 . . . 0
.. .
.. 2
. .. T
. V = Uecon 0 0 σ3 . . . 0 VT .
A=U
0 0 0 . . . σn
.
. . . .
.
0 0 0 ... 0
. . .
0 0 0 . . . σn
.. ..
. .
0 0 0 ... 0
Since the matrix D has m − n rows with zeros at the bottom, the last m − n columns of U will be multiplied
by zeros. There is no need to compute these columns and the command [U,D,V] = svd(A,’econ’)
generates this more economic form.
The SVD has many more applications: image processing, data compression, regression, robotics, ...
Search on the internet for the keywords Professor SVD, Gene Golub and find an excellent article
about Gene Golub and SVD by Cleve Moler, the founder of MATLAB.
A = round(100*rand(8,4))
[U,D,V] = svd(A,’econ’);
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 151
The entries in the diagonal matrix D of the SVD are [279.982, 93.171, 75.224, 41.348] and thus the first
entry is considerably larger than the others and one may approximate the matrix by using the first columns
of U and V only, i.e.
A ≈ A1 = ~u1 · 279.982 · ~v1T .
This matrix A1 has a simple structure, as ilustrated by the trivial example
1 1 2 10
2 h i 2 4 20
1 2 10 = .
3 3 6 30
−1 −1 −2 −10
Each row in A1 is a multiple of the row vector ~v1T and each column in A1 is a multiple of the column vector
~u1 . Using the first two columns leads to
" # " #
h i 279.982 0 ~v1T
A ≈ A2 = ~u1 ~u2 · · .
0 93.171 ~v2T
Each row in A2 is a linear combination of the two row vectors ~v1T and ~v2T and each column in A2 is a linear
combination of the two column vectors ~u1 and ~u2 . The norm of these matrices and their differences are
and the values are not too far from the original matrix A.
67.00 71.43 34.06 38.14 45.43 80.66 50.48 44.07
53.84 57.40 27.37 30.65
34.68 65.61 41.96
35.92
50.08 53.39 25.46 28.51 31.17 61.49 39.86 33.71
30.42 32.43 15.46 17.31 65.22 17.53 −11.03 7.73
A1 ≈ , A2 ≈
48.98 52.21 24.90 27.88
17.58 65.65 48.78 36.52
66.39 70.77 33.75 37.79 96.11 58.05 11.13 29.61
76.25 81.29 38.76 43.40
96.95 72.43 23.01 37.70
73.41 78.26 37.32 41.79 78.21 76.21 33.67 40.47
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 152
1 (x−µ)2
p(x) = √ e− 2 σ2 .
2πσ
Examine a random variable ~x ∈ Rn with n independent components xi with means µi and variances σi2 .
In this case the Gaussian distribution in Rn is given by the PDF
n
!
1 1 X (xi − µi )2
~ , ~σ ) = √ n Qn
p(~x | µ exp −
2π i=1 σi
2
i=1
σi2
n n
n ln(2 π) X 1 X (xi − µi )2
ln(p(~x | µ
~ , ~σ )) = − − ln(σi ) − .
2
i=1
2 σi2
i=1
For the standardized Gaussian distribution with means 0 and variances 1 obtain
1 1 n ln(2 π) 1
p(~x | ~0, ~1) = √ n exp − k~xk2 and ln(p(~x | ~0, ~1)) = − − k~xk2 .
2π 2 2 2
3–35 Observation : To determine the probability that a point ~x ∈ Rn satifies k~xk ≤ r one needs the
n/2
surface of the unit sphere Sn = {~x ∈ Rn | k~xk = 1} ⊂ Rn , given by Sn−1 = 2Γ(π n ) , e.g. S2 = 2 π,
√ 2
2π π 2 π2
S3 = Γ( 32 )
= 4 π, S4 = Γ(2) = π 2 . To determine the confidence region for an n–dimensional case with
confidence level 100 (1 − α)% solve the equation
rmax rmax
1 2 π n/2 n−1 21−n/2
Z Z
1 2 1 2
e− 2 r n r dr = e− 2 r rn−1 dr = 1 − α (3.6)
(2 π)n/2 0 Γ( 2 ) Γ( n2 ) 0
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 153
for the upper limit of integration rmax . The value can also be computed using the chi–squared distribution
2
by rmax = chi2inv(1 − α, n). Then the domain of confidence is determined by r = k~xk ≤ rmax . For
n = 2 this leads to
Z rmax rmax
1 1 2 1 2 1 2
1−α = e− 2 r r dr = −e− 2 r 0 = 1 − e− 2 rmax (3.7)
Γ(1) 0
1 2 p
ln(α) = − rmax =⇒ rmax = −2 ln(α)
2
1
and the domain of confidence 2 r2 ≤ − ln(α) = ln α1 .
In Table 3.4 find results for dimensions n from 1 to 10 for three values of α. Listed are the radius rmax
up to which the integral has to be performed and the values of the PDF at this radius. This table contains
useful information:
• For the 1-d Gauss distribution include all values within 2 (or 1.96) standard deviations from the mean
to cover 95% of the events.
• For the 4-d Gauss distribution include all values within 3 (or 3.08) standard deviations from the mean
to cover 95% of the events.
• For the 8-d Gauss distribution include all values within 4 (or 3.94) standard deviations from the mean
to cover 95% of the events.
This is information on the size of the region of confidence for different dimensions n.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 154
With px (~x) = (2 π)−n/2 exp(− 21 k~xk2 ) equation (3.8) can be written in the form7
1 1 T T −1
p(~y | A) = √ n exp − ~y (AA ) ~y .
2 π | det(A)| 2
and the above result can be generalized to affine transformations
~x 7→ T(~x) = ~y = ~y0 + A~x
with the resulting PDF
1 1 T T −1
p(~y | A, ~y0 ) = exp − (~y − ~y0 ) (AA ) (~y − ~y0 ) . (3.9)
(2 π)n/2 | det(A)| 2
kA−1 ~
y k2 = hA−1 ~
y , A−1 ~ y , A−T A−1 ~
y i = h~ y , (AAT )−1 ~
y i = h~ yi
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 155
The above is the answer to: given A and ~y0 , what is the PDF of the measurement points.
Now turn the approach around and seek the answer to: given the measured data points ~yj , recover the
affine transformation with A and the shift ~y0 . The maximum likelihood approach seeks the values of A
and ~y0 to render the function f (A, ~y0 ) maximal. Hoping for an unimodal function search for zeros of the
derivatives of f with respect to ~y0 and the components of Γ = (A AT )−1 .
• Compute the gradient of f with respect to ~y0 and solve for ~y0 .
N N
~0 = ∂f X X
= (AAT )−1 (~yj − ~y0 ) = (AAT )−1 (~yj − ~y0 ) ∈ Rn
∂~y0
j=1 j=1
N
X N
X
~0 = (~yj − ~y0 ) = −N ~y0 + ~yj
j=1 j=1
N
1 X
~y0 = ~yj
N
j=1
Thus the average value of the data points is the most likely estimator of the offset.
• As a consequence of the above you can first determine the average values and then subtract them from
the data. This simplifies the function and now examine maxima of the modified function
N
fˆ(Γ) := +N ln(det(Γ)) −
X
~yjT Γ ~yj .
j=1
• To examine derivatives with respect to Γrs the adjugate matrix8 can be used with the computational
rules
1 ∂
Γ−1 = adj(Γ) ∈ Mn×n and det(Γ) = (−1)r+s adj(Γ)sr .
det(Γ) ∂Γrs
This implies
∇ det(Γ) = adj(Γ)T = det(Γ) · Γ−T .
Take each data point ~yi (the average ~y0 already subtracted) as a column vector and stack all measure-
ments in one matrix h i
Y = ~y1 ~y2 · · · ~yN ∈ Mn×N
8
First compute the cofactor Cr,s . Start with the original matrix, eliminate row r and column s, compute the determinant,
multiply by (−1)r+s . The transpose of the matrix of cofactors is equal to adj(Γ).
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 156
appear as components of the covariance matrix. The above condition for the partial derivatives of fˆ
to vanish is transformed to
adj(Γ) 1
= YYT .
det(Γ) N
Thus for the matrix A find the necessary condition
1
Γ−1 = AAT = Y YT ∈ Mn×n .
N
One can not recover the components of A, but only the product Γ−1 = AAT . If you insist on one
of the possible constructions of a matrix A you may use the Cholesky factorization Γ−1 = RT R, i.e
A = RT is one of the possible reconstructions.
The level curves of the exponent yield useful information. It simplifies notation to work with the quadratic
function
1
h(~y ) = (~y − ~y0 )T Γ(~y − ~y0 ) .
2
The minimal value of h is 0, attained at ~y = ~y0 . The level curves can be generated by the code presented in
Example 3–25, using the eigenvectors and eigenvalues of Γ = (AAT )−1 , i.e. the inverse of the covariance
matrix.
6
5.8
5.6
5.5
5.4
5.2
5
5
4.8
4.5
4.6
4 4.4
-40 -20 0 20 40 60 80 100 -20 0 20 40 60 80
Figure 3.19: Raw data and level curves for the likelihood function. Inside the level curves find 99%, 87%
or 39% of the data points.
This is used to determine the domains of confidence with a level of confidence (1 − α), i.e a level of
significance α. For n = 2 use (3.6) and (3.7) to solve
Z rmax
1 2 1 2
e− 2 r r dr = 1 − e− 2 rmax = 1 − α .
0
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 157
Then (1 − α) 100% of the points are in the domain kA−1 ~y k = k~xk ≤ rmax . Find the results for rmax =
1, 2, 3 with the corresponding confidence levels in Figure 3.19. Assuming ~y0 = ~0 this domain can be
determined using the covariance matrix M by
2
rmax ≥ kA−1 ~y k2 = hA−1 ~y , A−1 ~y i = h~y , A−T A−1 ~y i = h~y , (AAT )−1 ~y i = h~y , M−1 ~y i .
To generate graphs use the eigenvalues and eigenvectors of the covariance matrix, see Example 3–25. The
result has to be compared to the usual individual√intervals of confidence for y1 and y2 . Since two conditions
have to be satisfied use a level of confidence of 1 − α and read out the standard deviations as square roots
√ the diagonal entries in the covariance matrix. Then use the normal distribution by c = norminv(1 − (1 −
of
1 − α)) ≈ norminv(1 − α/2) to construct the intervals of confidence for yi and y2
mean(yi ) − c σi ≤ yi ≤ mean(yi ) + c σi .
The ellipse provides better information on the location of the data points, since the correlation between y1
and y2 is taken into account.
6
5.8
5.6
5.5
5.4
5.2
5
5
4.8
4.5
data
4.6 PCA 1
PCA 2
4.4
4
-20 0 20 40 60 80 -20 0 20 40 60 80 100
Figure 3.20: Raw data and the domain of confidence at level of significance α = 0.05 and the PCA
• If λi is small the function h will grow slowly in that direction and the level curves will be far apart.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 158
√
To visualize the above one may rescale the eigenvectors ~ei to length λi and then draw them starting at the
center point ~y0 . Find an example in Figure 3.19(b).
Since M = Γ−1 = AAT = N1 YYT the eigenvalues λi of Γ lead to the eigenvalues µi = λ1i of AAT .
This leads to the Principal Component Analysis, or short PCA. The main application of PCA is dimension
reduction, i.e. instead of many dimensions reduce the data such that only the main (principal) components
have to be examined. Such a principal component is a linear combination of the available variables.
If λ1 ≤ λ2 ≤ λ3 ≤ . . . then µ1 ≥ µ2 ≥ µ3 ≥ . . ., i.e. large eigenvalues indicated a slow growing
of the function h and thus the data is spread out in the direction of the corresponding eigenvector. If ~e1 is
normalized then compute the principle component of the data, i.e. the projection of the data in the direction
of ~e1 , given by h~e1 , ~y − ~y0 i. The data can the be displayed at h~e1 , ~y − ~y0 i ~e1 ∈ R2 . The second principal
component is given by h~e2 , ~y − ~y0 i. Find the graphical representation in Figure 3.20(b).
• Determine the eigenvalues and eigenvectors using the command eig(). With the option ’descend’
the eigenvalues are in deceasing order.
• Compute the length of the projection of the data into the directions of the eigenvectors, leading to the
scores.
• Multiply the scores with the eigenvectors to obtain the projection of the data onto the eigen directions.
figure(4)
plot(data(:,1),data(:,2),’.b’,m(1) + PCAdata1(:,1), m(2) + PCAdata1(:,2),’.r’,...
m(1) + PCAdata2(:,1), m(2) + PCAdata2(:,2),’.g’)
legend(’data’,’PCA 1’,’PCA 2’,’location’,’southeast’)
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 159
3–37 Example : PCA is not restricted to problems in 2 dimension, it is only the visualization that is a more
challenging. In Figure 3.21 find in blue a cloud of data points in R3 . The data points seem to be rather close
to a plane. A PCA for the first two components was performed and the each data point was projected onto
the plane spanned by the eigenvectors belonging to these two components. This leads to the red points in
the figure. Observe that this approach is different from using linear regression to determine the best fitting
plane.
• With a linear regression using z = c0 + c1 x + c2 y the sum of the squared vertical distances is
minimized.
• With PCA the sum of the squared orthogonal distances to the plane is minimized. Consult the next
section, where PCA is regarded as an optimization problem.
4
z
-2
7
6
5
4
y 3
2 6 8
1 0 2 4
-2 x
To find the extrema differentiate with respect to the components ek of the normalized vector ~e.
N X n n X N n
∂ X X X
f (~e) = ( yi,j ei ) yk,j = ( yi,j yk,j ) ei = (Y YT )k,i ei
∂ek
j=1 i=1 i=1 j=1 i=1
T
∇f (~e) = Y Y ~e
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 160
Thus the Lagrange multiplier theorem for constrained optimization with the constraint k~ek2 = 1 and
∇k~ek2 = 2 ~e lead to
Y YT ~e = λ ~e ,
i.e. the direction ~e has to be an eigenvector of the matrix Y YT ∈ Mn×n . Use a directional derivative of
f (~e) to conclude
d
f (c ~e) = h~e, ∇f (c ~e)i = c h~e, YT Y ~ei = c h~e, λ ~ei = c λ
dc
Z 1 Z 1
d 1
f (1 ~e) = f (~0) + f (c ~e) dc = 0 + c λ dc = λ
0 dc 0 2
1
and thus f (~e) = 2 λ. The largest value of f (~e) is attained for the eigenvector belonging to the largest
eigenvalue λmax .
-2
-4
-6 -4 -2 0 2 4 6
Figure 3.22: Projection of data in the direction of the first principal component. Original data in blue and
the first principal component in red. The original data points and their projections are connected by green
lines.
This is illustrated by Figure 3.22: the direction of the first principal component is such that the sum of
the squared distances of the red points from the origin is largest, while the sum of the squared lengths of the
green lines is smallest. This is based on the observation that the square of P the principal component and the
squared length of corresponding green line add up to k~yj k2 and the sum N j=1 k~yj k2 does not depend on
the direction ~e.
[coeff,score,latent] = princomp(data);
dataPCA = score(:,1)*coeff(1,:);
figure(1); plot(data(:,1),data(:,2),’+b’,dataPCA(:,1),dataPCA(:,2),’+r’)
hold on
for ii = 1:length(data)
plot([data(ii,1),dataPCA(ii,1)],[data(ii,2),dataPCA(ii,2)],’g’)
end%for
hold off
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 161
Y = U D VT
Y YT = U T
DV VD U
T T
= T
UDD U
T
σ2
1
σ22
D DT = ∈ Mn×n
..
.
σn2
Thus the SVD of the matrix Y ∈ Mn×N leads to the eigenvalues σi2 of the matrix Y YT with the n
eigenvectors in the columns of U ∈ Mn×n . This is the information required for the PCA. Examine the
source code of princomp.m in Octave to realize that SVD is used to determine the PCA. Observe that the
matrix YYT does not have to be computed. This might make a difference for very large data sets, e.g. for
machine learning.
help princomp
-->
-- Function File: [COEFF] = princomp(X)
-- Function File: [COEFF,SCORE] = princomp(X)
-- Function File: [COEFF,SCORE,LATENT] = princomp(X)
-- Function File: [COEFF,SCORE,LATENT,TSQUARE] = princomp(X)
-- Function File: [...] = princomp(X,’econ’)
Performs a principal component analysis on a NxP data matrix X
References
----------
1. Jolliffe, I. T., Principal Component Analysis, 2nd Edition, Springer, 2002
[coeff,score,latent] = princomp(data);
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 162
The statistics toolbox9 (i.e. $$$) in MATLAB contains the command pca(), which generates the iden-
tical result by [coeff,score,latent] = pca(data); .
MATLAB and Octave provide the command pcacov() to determine the PCA for a given covariance
matrix.
The resulting matrix M is called correlation matrix. It has numbers 1 on the diagonal and all other entries
are smaller than 1. This has the effect that a unit step in any of the coordinate axes will lead to the same
increase for the function value.
Since the matrix is rescaled the scales of the coordinates have to be adapted, leading to
1 1
h(~y ) = h(~y − ~y0 ), Γ(~y − ~y0 )i = h(~y − ~y0 ), M−1 (~y − ~y0 )i
2 2
1
= h(~y − ~y0 ), S (S−1 M−1 S−1 ) S(~y − ~y0 )i
2
1 −1
= h(S(~y − ~y0 )), M (S(~y − ~y0 ))i = h(S(~y − ~y0 ))
2
Now rerun the eigenvalue/eigenvector algorithm for the correlation matrix M and display the result in a
rescaled graph, using identical scales in all coordinate directions. The eigenvectors appear orthogonal, find
an example in Figure 3.23.
9
On the Mathworks web site find a code princomp.m by B. Jones which works on MATLAB.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 163
-1
-2
-2 -1 0 1 2 3
Figure 3.23: Scaled data and level curves for the likelihood function
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 164
• In subsection 3.3.1 the trapezoidal rule and Simpson’s rule are used to evaluate integrals for functions
given by data points.
• In subsection 3.3.2 possible algorithms to evaluate integrals of functions given by an expression (for-
mula) are examined. The brilliant idea of Gauss is presented and the method of adaptive integration
is introduced.
• In subsection 3.3.4 some ideas and Octave/MATLAB codes on how to integrate over domains in the
plane R2 are introduced.
The idea of a trapezoidal integration is to replace the function f (x) by a piecewise linear interpolation.
Each segment is a trapez, whose area is easy to determine by
Z xi
f (xi−1 ) + f (xi )
f (x) dx ≈ area of trapez = width · average height = (xi − xi−1 ) · .
xi−1 2
b−a
n=5 h= xi = a + ih for i = 0, . . . , 5
5
and Rb 1
(a) + f (x1 ) + f (x2 ) + f (x3 ) + f (x4 ) + 21 f (b)
f (x) dx ≈ h f
a 2 4
h P
= 2 f (a) + f (b) + 2 f (xi ) .
i=1
y = f(x)
a x
b
x0 x1 x2 x3 x4 x5
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 165
b−a
(1 , 2 , 2 , 2 , . . . , 2 , 2 , 2 , 2 , 2 , 1) .
2n
The verification of this result is given as an exercise and the arguments to be used are very similar to
Result 3–40.
is evaluated, using the trapezoidal rule on each of the sub-intervals. In Figure 3.25 find the results for the
R π/2
elementary integral I = 0 cos(x) dx = 1.
• For n = 3 sub-intervals the trapezoidal rule leads to the answer I ≈ 0.9770. Since the graph of
cos(x) is above the piecewise linear interpolation used by the trapezoidal rule, it is no surprise that
the approximate answer is two low.
Rx
• Using cumtrapz() the integral I(x) = 0 cos(t) dt = sin(x) is approximated, again leading to
values that are too small.
• For n = 100 sub-intervals the trapezoidal rule leadsRto the answer I ≈ 0.999979, i.e. considerably
x
more accurate. Similarly for the indefinite integral 0 cos(t) dt where true integral sin(x) and the
approximate cure are indistinguishable.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 166
0.8
0.6
0.4
cos(x), n=3
∫ cos(x), n=3
0.2
cos(x), n=100
∫ cos(x), n=100
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
x
R π/2
Figure 3.25: The integral 0 cos(x) dx by the trapezoidal rule
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 167
Proof : The goal is to find constants c−1 , c0 and c+1 such that
Z +h
u(x) dx ≈ h (c−1 u(−h) + c0 u(0) + c+1 u(+h))
−h
with an error as small as possible. To arrive at this goal use a Taylor approximation at x = 0
u00 (0) 2 u000 (0) 3 u(4) (0) 4 u(5) (0) 5 u(6) (0) 6
u(x) = u(0) + u0 (0) x + x + x + x + x + x + O(x7 ) . (3.10)
2 6 4! 5! 6!
For small values of x contributions of low order are of higher importance. Thus we proceed contribution by
contribution, starting with the small order terms, i.e. first 1, then x, then x2 , . . .
1 : Using only the first contribution u(x) ≈ u(0) and the integral of u(x) = 1 leads to
Z +h
1 dx = 2 h = h (c−1 1 + c0 1 + c+1 1)
−h
x : Using only the first two contributions u(x) ≈ u(0) + u0 (0) x verify that the second contribution
u(x) = x is integrated exately, i.e
Z +h
x dx = 0 = h (c−1 (−h), 1 + c0 0 + c+1 h)
−h
These equations have a unique solution c−1 = c+1 = 13 and c0 = 43 and thus the approximation
formula for the integral is
Z +h
h
u(x) dx = (u(−h) + 4 u(0) + u(+h)) + error2h .
−h 3
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 168
b−a
• To obtain the estimate for the integral from a to b use n = 2h of those integrals, leading10 to
b−a
error[a,b] ≤ max |u(4) (ξ)| h4 .
180 a≤ξ≤b
The above Simpson integration over an interval [a, b] can only be applied if an even number n of sub-
intervals is used. Thus use an odd number of data points xi = a + i b−a n for i = 0, 1, 2, . . . , n. If an odd
number of intervals of equal length h is used one can use the Simpson 3/8–rule to preserve the order of
convergence, i.e. the error remains proportional to h4 .
Z x3
3
f (x) dx ≈ h (f (x0 ) + 3 f (x1 ) + 3 f (x2 ) + f (x3 ))
x0 8
R π/2
3–41 Example : Examine the elementary integral 0 sin(x) dx = 1 . The comparison of the trapezoidal
rule, Simpson’s formula and the Gauss method should convince you that Gauss is often extremely efficient.
π/2
√ !
π/4 π π π 2
Z
sin(x) dx ≈ sin(0) + 2 sin( ) + sin( ) = 0+2 +1 ≈ 0.9481
0 2 4 2 8 2
π/2
√ !
π/4 π π π 2
Z
sin(x) dx ≈ sin(0) + 4 sin( ) + sin( ) = 0+4 + 1 ≈ 1.0023
0 3 4 2 12 2
Observe that Simpson’s approach generates a smaller error that the trapezoidal rule.
• Using the Gauss idea from Result 3–43 in the next section with one sub-interval with three points
leads to an even more accurate result.
Z π/2 r r !
π/4 π 3 π π 3
sin(x) dx ≈ 5 sin( (1 − )) + 9 sin( ) + 5 sin( (1 + )) ≈ 1.000008
0 9 4 5 4 4 5
10
The proof shown in these notes only leads to an estimate of the error, not a rigorous proof. The simplicity of the argument used
justifies the lack of mathematical rigor. The statement is correct though, find a proof in [RalsRabi78, §4].
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 169
x0 = pi/4; h = pi/4;
xm = x0 - sqrt(3/5)*h; xp = x0 + sqrt(3/5)*h;
Gauss = h/9*(5*sin(xm) + 8*sin(x0) + 5*sin(xp))
difference = Gauss - 1
-->
Gauss = 1.0000081216
difference = 8.1216e-06
3–42 Example : Simpson’s algorithm is easy to implement in Octave/MATLAB and can compute integrals
rather accurately. Find one possible implementation in simpson.m below and determine approximative
R π/2
values of 0 cos(x) dx = 1, using 10 and 100 sub-intervals.
format long
n = 10 ; x = linspace(0,pi/2,n+1); Integral10 = simpson(cos(x),0,pi/2,n)
n = 100 ; x = linspace(0,pi/2,n+1); Integral100 = simpson(cos(x),0,pi/2,n)
-->
Integral10 = 1.000003392220900
Integral100 = 1.000000000338236
♦
simpson.m
function res = simpson(f,a,b,n)
if isa(f,’function_handle’)
n = round(n/2+0.1)*2; %% assure even number of subintervals
h = (b-a)/n;
x = linspace(a,b,n+1);
f_x = x;
for k = 0:n
f_x(k+1) = feval(f,x(k+1));
end%for
else
n = length(f);
if (floor(n/2)-n/2==0)
error(’simpson: odd number of data points required’);
else
n = n-1;
h = (b-a)/n;
f_x = f(:)’;
end%if
end%if
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 170
• Examine a function that is not smooth at a point inside the interval, e.g. f (x) = | cos(x)|, which is
not differentiable at x = π2 . Then determine the integral
Z 2 Z π/2 Z 2
| cos(x)| dx = cos(x) dx + − cos(x) dx = 1 + 1 − sin(2) ≈ 1.09070 .
0 0 π/2
Using Simpson’s method with n = 100 sub-intervals leads to an error of ≈ 3 · 10−5 , while integration
of the smooth function cos(x) leads to a much smaller error of ≈ 8 · 10−10 . The problem can be
R2 R π/2 R 2
avoided by splitting up the integral 0 = 0 + π/2 . Many Octave/MATLAB integration routines
use options to indicate this type of special points inside the interval of integration.
Integral100A = simpson(cos(x),0,2)
ErrorA = Integral100A - sin(2)
-->
Integral100 = 1.0907
Error = 2.7390e-05
Integral100A = 0.9093
ErrorA = 8.0830e-10
• For a function with an infinite integrand the Simpson and the trapezoidal rule fail. Find information
on how to handle functions with limx→a f (x) = ∞ in [IsaaKell66, §6.2, p.346].
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 171
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 172
The above is just one example of the Gauss integration idea. It is possible to use fewer or more integra-
tion points. The literature on the topic is vast, see e.g. https://en.wikipedia.org/wiki/Gaussian quadrature .
There are tables for the different Gauss integration schemes, e.g. the wikipedia page, [AbraSteg, Table 25.4]
or the online version [DLMF15, §3.5] at http://dlmf.nist.gov/, or [TongRoss08, Table 6.1, page 188]. The
two point approximation
Z h r r !
1 1
f (x) dx ≈ h f (− h) + f (+ h) .
−h 3 3
3–44 Example : The Gauss algorithm for a three or five point integration is implemented Octave 11 . Find
one possible implementation in IntegrateGauss.m with the built-in help.
help IntegrateGauss
-->
INTEGRAL = IntegrateGauss (X,F,N)
parameters:
* F the function, given as a function handle or string with the name
* X the vector of x values with the domain of integration on each
subinterval x_i to x_(i+1) a three or five point Gauss integration is used
* N if this optional parameter equals 5, a five point Gauss formula is used,
the default value is 3
n = 10 ; x = linspace(0,pi/2,n+1);
Integral3 = IntegrateGauss(x,@(x)sin(x))
Error3 = Integral3-1
Integral5 = IntegrateGauss(x,@(x)sin(x),5)
Error5 = Integral5-1
-->
Integral3 = 1.0000
Error3 = 7.4574e-12
Integral5 = 1.0000
Error5 = -1.1102e-16
The result shows that a Gauss integration with 10 sub-intervals and five Gauss points already generates a
result whose accuracy is limited by the CPU accuracy. ♦
IntegrateGauss.m
function Integral = IntegrateGauss (x,f,n)
if nargin <= 2 n = 3; %% default is 3 Gauss points
elseif n!=5 n = 3;
endif
x = x(:)’; dx = diff(x);
if n==3
11
The source code has to be modified to run under MATLAB.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 173
The method of Gauss to integrate is used extensively by the method of finite elements to integrate over
triangles or rectangles, see Sections 6.5.1 and 6.8.
• Simpson: the error is expected to be propotional to h4 and thus a straight line with slope 4.
• Gauss: with three points, the error is expected to be propotional to h6 and thus a straight line with
slope 6.
All of the above is confirmed by Figure 3.26 and Table 3.5. The surprising horizontal segements for the
trapezoidal and Simpson’s rule between n = 2 and n = 4 are caused by the zeros of sin(2 x), which
coincide with the points of evaluation. Thus the algorithms estimate the value of the integral by 0 . ♦
Table 3.5: Approximation errors of three integration algorithms. The examined integral is shown in (3.11)
and the width of the sub-intervals is h = 2nπ .
12
Use the laws of logarithms and z = C · hp to conclude log(z) = log(C) + p · log(h).
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 174
1000
10-01
10-02
10-03
10-04
10-05
error
10-06
10-07
10-08 trapez
10-09 Simpson
10-10 Gauss
10-11
10-1 100
stepsize h
Figure 3.26: The approximation errors of three integration algorithms as function of the stepsize
r r r r r r r
b−a
N=6 (1 2 2 2 2 2 1) 12
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 175
The summations for TN and T2N are not independent, since all points used by TN are also used by T2N ,
and some more.
NP−1
TN = f (a) + 2 f (a + i N ) + f (b) b−a
b−a
2N
i=1
2NP−1
T2N = f (a) + 2 f (a + i b−a b−a
2N ) + f (b) 4 N
i=1
represented by
s r s r s r s r s r s r s
b−a
N=6 (1 2 2 2 2 2 1) 12
b−a
N = 12 (1 2 2 2 2 2 2 2 2 2 2 2 1) 24
Thus conclude
N
1 b−a X b−a
T2N = TN + f (a + (2i − 1) ).
2 2N 2N
i=1
The function has only to be evaluated at the new points. Using the sequence
s r s r s r s r s r s r s
b−a
T6 (2 4 4 4 4 4 2) 2·12
b−a
T12 (1 2 2 2 2 2 2 2 2 2 2 2 1) 2·12
b−a
S12 (1 4 2 4 2 4 2 4 2 4 2 4 1) 3·12
Based on Gauss integration a similar saving of evaluations of the function f is not possible, since only
points inside the sub-intervals are used.
For graphs similar to Figure 3.27(a) it works well to use finer meshes on all of the interval, but for
graphs similar to Figure 3.27(b) a local adaptation is asked for, i.e. use more points where the function
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 176
0.8 1
0.6 0.8
0.6
0.4
0.4
0.2
0.2
0
0
-0.2 -0.2
0 1 2 3 4 5 6 0 1 2 3 4 5 6
x x
(a) a smooth graph (b) a local refinement is necessary
values vary severely. As example examine Simpson’s approach on a sub-interval of length 4 h starting at xi
and compare the two approximations of the integral.
xi +4h
2h
Z
f (x) dx ≈ R1 = (f (xi ) + 4 f (xi+2 ) + f (xi+4 ))
xi 3
xi +4h
h
Z
f (x) dx ≈ R2 = (f (xi ) + 4 f (xi+1 ) + 2 f (xi+2 ) + 4 f (xi+3 ) + f (xi+4 ))
xi 3
If the difference |R1 −R2 | is too large the interval has to be divided, otherwise proceed with the given result.
The integration algorithms provided by MATLAB/Octave use local adaptation.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 177
ff([0.2:0.2:2])
-->
-0.1715 0.1854 0.3777 0.3395 0 0.3972 0.5800 0.7251 0.8491 0.9589
This is essential for the integration routines and required for speed reasons.
– With Octave the definition of this function can be before the function call in an Octave script or
in a separate file ff.m .
– With MATLAB the definition of this function has to be in a separate file ff.m or at the end of a
MATLAB script.
result = quad(’ff’,0,3)
-->
ans = 1.9819
• As a function handle
The above integral can also be computed by using a function handle.
ff = @(x) x.*sin(1./x).*sqrt(abs(x-1));
integral(ff,0,3)
-->
ans = 1.9819
• function handles are very convenient to determine integrals depending on parameters. To compute the
integrals Z 2
sin(λ t) dt
0
for λ values between 1 and 2 use
integral(@(t)cos(t).*exp(-t),0,Inf)
-->
ans = 0.5000
The function integral() can be used with a few possible optional parameters, specified as pairs of the
string with the name of the option and the corresponding value. The most often used options are:
• AbsTol to specify the absolute tolerance. The default value is AbsTol = 10−10 .
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 178
• RelTol to specify the relative tolerance. The default value is AbsTol = 10−6 .
• The adaptive integration is refined to determine the value Q of the integral, until the condition
is satisfied, i.e. either the relative or absolute error are small enough.
• Waypoints to specify a set of points at which the function to be integrated might not be continuous.
This can be used instead of multiple calls of integral() on sub-intervals.
Observe that this function is not differentiable at x = π2 . Thus the high order of approximation for Simpson
and Gauss are not valid and problems are to be expected.
integral_exact = 2 - sin(2);
integral_1 = integral(@(x)abs(cos(x)),0,2)
integral_2 = integral(@(x)abs(cos(x)),0,2,’AbsTol’,1e-12,’RelTol’,1e-12)
integral_3 = integral(@(x)abs(cos(x)),0,2,’Waypoints’,pi/2)
Log10_Errors = log10(abs([integral_1, integral_2 integral_3] - integral_exact))
-->
integral_1 = 1.0907
integral_2 = 1.0907
integral_3 = 1.0907
Log10_Errors = -6.5642 -15.1764 -Inf
π
The results illustrate the usage of the tolerance parameters. Specifying the special point x = 2 generates a
more accurate results with less computational effort.
• To integrate a function f over the interval [a, b] use quad(f,a,b), where f is a string with the
name of the function or a function handle.
• With the optional third argument TOL a vector with the absolute and relative tolerances can be speci-
fied.
• With Octave the optional fourth argument SING ia s vector of values where singularites are expected.
In addition options can be read or set by calling quad options .
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 179
a = 0; b = 2*pi;
[q, ier, nfun, err] = quad (@(x)sin(2*x).*exp(-x), a, b)
[q, ier, nfun, err] = quad (@(x)sin(x/2)-exp(-20*(x-2).ˆ2).*sin(10*x).ˆ2, a, b)
The results show that the first function required 21 evaluations of the function f1 (x) while f2 (x) had to be
evaluated 273 times. This is caused by the local adaptation around x ≈ 2, obvious in Figure 3.27(b).
[Q,num_eval] = quadv(@(t)[sin(t);cos(t)]*exp(-t),0,0.5*pi)
-->
Q = 0.3961
0.6039
num_eval = 17
The same can be obtained by calling integral() with the option ArrayValued .
Q = integral(@(t)[sin(t);cos(t)]*exp(-t),0,0.5*pi, ’ArrayValued’,true)
integral A compatibility wrapper function that will choose between quadv and quadgk depending on the
integrand and options chosen.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 180
MATLAB/Octave provides commands to perform this double integration. The function f (x, y) depends on
two arguments, the corresponding MATLAB/Octave code has to accept matrices for x and y as arguments
and return a matrix of the same size.
3–46 Example : To integrate the function f (x, y) = x2 + y over the rectanguar domain 1 ≤ x ≤ 2 and
0 ≤ y ≤ 3 use
quad2d(@(x,y)x.ˆ2+y,1,2,0,3)
integral2(@(x,y)x.ˆ2+y,1,2,0,3)
-->
ans = 11.500
ans = 11.500
♦
y y
6 6 d(x)
d
Ω Ω
c c(x)
x
- x
-
a b a b
(a) a rectangular domain (b) a general domain
Integrals over non rectangular domains in Figure 3.28(b) can be given by a ≤ x ≤ b and x dependent
limits for the y values c(x) ≤ y ≤ d(x). Integral are computed by nested 1-D integrations
!
ZZ Z Z b d(x)
Q= f (x, y) dA = f (x, y) dy dx .
a c(x)
Ω
MATLAB/Octave provides commands to perform this double integration. The only change is that the upper
and lower limits are provided as functions of x.
Integrals over domains where the left and right limit are given as functions of y (i.e. a(y) ≤ x ≤ b(y))
are covered by swapping the order of integration,
!
ZZ Z Z d b(y)
Q= f (x, y) dA = f (x, y) dx dy .
c a(y)
Ω
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 181
3–47 Example : Consider the triangular domain with corners (0, 0), (1, 0) and (0, 1). Thus the limits are
0 ≤ x ≤ 1 and 0 ≤ y ≤ 1 − x . To integrate the function
(1 + x + y)2
f (x, y) = p
x2 + y 2
use !
1 1−x
(1 + x + y)2
ZZ Z Z
f (x, y) dA = p dy dx .
0 0 x2 + y 2
Ω
♦
3–48 Example : The previous example can also be solved using polar coordinates r and φ. Express the
area element dA in polar coordinates by
dA = dx · dy = r · dr dφ .
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 182
3–49 Observation :
• For triple integrals MATLAB/Octave provide the command integral3() with a syntax similar to
integral2() .
• To integrate over triangles or rectangles is a task required for the method of finite elements. Find
results in this direction in Sections 6.5.1 and 6.8.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 183
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 184
2.5
population p
2
1.5
0.5
0
0 1 2 3 4 5
time t
Figure 3.29: Vector field and three solutions for a logistic equation
t_max = 5; p_max = 3;
[t,p] = meshgrid(linspace(0, t_max,20),linspace(0,p_max,20));
v1 = ones(size(t)); v2 = p.*(2-p);
figure(1); quiver(t,p,v1,v2)
xlabel(’time t’); ylabel(’population p’)
axis([0 t_max, 0 p_max])
[t1,p1] = ode45(@(t,p)p.*(2-p),linspace(0,t_max,50),0.4);
[t2,p2] = ode45(@(t,p)p.*(2-p),linspace(0,t_max,50),3.0);
[t3,p3] = ode45(@(t,p)p.*(2-p),linspace(0,t_max,50),0.01);
hold on
plot(t1,p1,’r’,t2,p2,’r’, t3,p3,’r’)
hold off
The predators y (e.g. sharks) are feeding of the pray x (e.g. small fish). The food supply for the pray is
limited by the environment. The behavior of these two populations can be described by a system of two first
order differential equations.
d
x(t) = (c1 − c2 y(t)) x(t)
dt
d
y(t) = (c3 x(t) − c4 ) y(t)
dt
13
Proposed by Alfred J. Lotka in 1910 and Vito Volterra in 1926.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 185
where ci are positive constants. This function can be implemented in a function file VolterraLotka.m
in MATLAB/Octave.
VolterraLotka.m
function res = VolterraLotka(x)
c1 = 1; c2 = 2; c3 = 1; c4 = 1;
res = [(c1-c2*x(2))*x(1);
(c3*x(1)-c4)*x(2)];
end%function
With the help of the above function generate information about the solutions of this sytem of ODEs.
• Generate the data for the vector field and then use the command quiver to display the vector field,
shown in Figure 3.30(b).
• Use ode45() to generate numerical solutions, in this case with initial values (x(0), y(0)) = (2, 1)
and for 100 times 0 ≤ ti ≤ 15. Display the result in Figure 3.30.
3 2
prey
2.5 predator
1.5
2
predator
1.5 1
1
0.5
0.5
0 0
0 2 4 6 8 10 12 14 0 0.5 1 1.5 2 2.5
time prey
(a) as function of time (b) vector field and solution
Figure 3.30: One solution and the vector field for the Volterra-Lotka problem
n = length(x); m = length(y);
Vx = zeros(n,m); Vy = Vx; % create zero vectors for the vector field
for i = 1:n
for j = 1:m
v = VolterraLotka([x(i),y(j)],0); % compute the vector
Vx(i,j) = v(1); Vy(i,j) = v(2);
end%for
end%for
t = linspace(0,15,100);
[t,XY] = ode45(@(t,x)VolterraLotka(x),t,[2;1]);
figure(1); plot(t,XY)
xlabel(’time’); legend(’prey’,’predator’); axis([0,15,0,3]); grid on
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 186
• In Figure 3.31(a) find the graphs of x(t) and v(t) as function of the time t. The effect of the damping
term −α v(t) = −0.1 v(t) is clearly visible.
• In Figure 3.31(b) find the vector field and the computed solution. The horizontal axis represents the
displacement x and the vertical axis indicates the velocity v = ẋ. This is the phase portrait of the
second order ODE.
for i = 1:n
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 187
1 1
displacement
velocity
0.5 0.5
velocity
0 0
-0.5 -0.5
-1 -1
0 5 10 15 20 25 -1 -0.5 0 0.5 1
time displacement
(a) displacement and velocity (b) phase portrait
for j = 1:m
z = Spring([y(i),v(j)]); % compute the vector
Vx(i,j) = z(1); Vy(i,j) = z(2); % store the components
end%for
end%for
t = linspace(0,25,100);
[t,XY] = ode45(@(t,y)Spring(y),t,[0;1]);
figure(1); plot(t,XY)
xlabel(’time’); legend(’displacement’,’velocity’)
axis(); grid on
d
x(t) = x(t)2 − 2 t with x (0) = 0.75 .
dt
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 188
Thus search for a curve following the corresponding vector field in Figure 3.32. Use the definition of the
1.5
1
x(t)
0.5
0
0 0.2 0.4 0.6 0.8 1 1.2
time t
d
Figure 3.32: Vector field and solution of the ODE dt x(t) = x(t)2 − 2 t. The true solution is displayed in
red, the solution generated by one Euler step in blue and the result using four Euler steps in green.
derivative
d x(h) − x(0)
x(0) = lim
dt h→0 h
at t = 0. Instead of the limes use a ”small” value for h. The differential equation is transformed into an
algebraic equation
x(h) − x(0)
= f (0, x(0)) .
h
This can easily be solved for x(h)
Thus the first step of this method leads to the straight line in Figure 3.32. A bit more general, to move from
time t to time t + h use
x(t + h) = x(t) + h f (t, x(t)) .
This is Euler’s method or also the explicit method.
Above the stepsize h = 1 was used to determine x(1). To obtain a better approximation use smaller
values for h. With the stepsize h = 41 = 0.25 find
This leads to the four straigth line segments in Figure 3.32 and this approximate solution is already closer
to the exact solution.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 189
1.6
1.4
slope k2
1.2
1
slope k1
x
0.8
slope k
0.6
0.4
0.2
-0.2 0 0.2 0.4 0.6 0.8 1
t
d
Figure 3.33: One step of the Heun Method for the ODE dt x(t) = −x2 (t) − 2 t with x(0) = 0.75 and step
size h = 1
d
For the initial value problem dt x(t) = x(t)2 − 2 t with x(0) = 0.75 the calculations for one Heun step
of length h = 1 are given by
k1 = f (t0 , x0 ) = f (0, 0.75) = 0.5625
k2 = f (t0 + h, x0 + h, k1 ) = f (1, 1.3125) = −0.27734375
1
k = 2 (k1 + k2 ) = 0.142578125
x(1) ≈ x(0) + h k = 0.75 + 1 (0.142578125) = 0.892578125
This is a better approximation than the one generate by one Euler step. The above computations can be
performed by MATLAB/Octave:
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 190
with step size h from t = ti to t = ti+1 = ti + h use the following computational scheme:
k1 = f (ti , xi )
k2 = f (ti + h/2, xi + k1 h/2)
k3 = f (ti + h/2, xi + k2 h/2)
k4 = f (ti + h, xi + k3 h)
h
xi+1 = xi + (k1 + 2k2 + 2k3 + k4 )
6
ti+1 = ti + h
At four different positions the function f (t, x) is evaluated, leading to four slopes ki for the solution of
the ODE. Then one time step is performed with a weighted average of the four slopes. This is visualized in
Figure 3.34.
slope k2
1
0.8
slope k3
x
slope k1
0.6
slope k
0.4
slope k4
0.2
-0.2 0 0.2 0.4 0.6 0.8 1
t
d
Figure 3.34: One step of the Runge–Kutta Method of order 4 for the ODE dt x(t) = −x2 (t) − 2 t with
x(0) = 0.75 and step size h = 1
d
For the initial value problem dt x(t) = x(t)2 − 2 t with x(0) = 0.75 the calculations for one Runge–
Kutta step of length h = 1 are given by
The advanced numerical solver ode23() generated the answer x(1) ≈ 0.32449. Thus the answer of the
RK algorithm is considerably closer than the answer of x(1) ≈ 0.7975 generated by four Euler steps. For
both approaches the right hand side of the ODE was evaluated four times, i.e. a comparable computational
effort. The above computations can be performed by MATLAB or Octave:
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 191
Thus for small values of h obtain the local error estimate for one Euler step
|yE (t0 + h) − y(t0 + h)| ≤ CE h2 (3.12)
where the “constant” CE is related to second derivatives of the function f (t, y). Thus the local discretization
error of Euler method is of order 2 . To arrive at a final time t0 + T one has to apply n = T /h steps and
each step might add some error. This leads to the global discretization error
|yE (t0 + T ) − y(t0 + Y )| ≤ C̃E h . (3.13)
Thus the global discretization error of the Euler method is of order 1 .
Similar arguments (with more tedious computations) can be performed for the method of Heun and
Runge–Kutta, leading to Table 3.6.
Table 3.6: Discretization errrors for the methods of Euler, Heun and Runge–Kutta
For most problems Runge–Kutta is the most efficient algorithm to generate approximate solutions to
ODEs. There are multiple aspects to be taken into account, i.e. to compare Euler’s method to Runge–Kutta:
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 192
2. Differentiablility of f (t, x)
For the above error estimates to be correct Euler requires that the function f be twice differentiable,
while Runge–Kutta requires that f is four times differentiable.
Advantage: Euler
3. Order of consistency
The global discretization error for Euler is proportional to h, while it is proportional to h4 for Runge–
Kutta.
Advantage: Runge–Kutta.
3–53 Example : The last of the above arguments is by far the most important. This is illustrated by an
d
example, see [MeybVach91]. The initial value problem dt y(t) = 1 + (y(t) − t)2 with y(0) = 0.5 is
1
solved by y(t) = t + 2−t , e.g. y(1.8) = 6.8 . Use n time steps of length n = 1.8
n to approximate y(1.8)
numerically. This leads to Table 3.7. With 72 calls of the RHS by Runge–Kutta the global error is of the
same size as with 7200 calls by Euler. ♦
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 193
Thus it is no surprise that there is a close connection between numerical integration and solving ODEs.
Compare the order of convergence of the different methods. It stands out that the order of the error for
Runge–Kutta is higher than expected.
k1 = f (u (x))
x x+h
I ≈ h f (x) u (x + h) ≈ u (x) + h k1
error = O (h2 ) local error = O (h2 )
trapezoidal rule method of Heun
k1 = f (u (x))
k2 = f (u (x) + k1 h)
x x+h
h h
I≈ (f (x) + f (x + h)) u (x + h) ≈ u (x) + (k1 + k2 )
2 2
error = O (h3 ) local error = O (h3 )
Simpson’s rule method of Runge–Kutta (RK4)
k1 = f (u (x))
k2 = f (u (x) + k1 h/2)
k3 = f (u (x) + k2 h/2)
x x+h/2 x+h
k4 = f (u (x) + k3 h)
h h
I≈ (f (x) + 4 f (x + h/2) + f (x + h)) u (x + h) ≈ u (x) + (k1 + 2 k2 + 2 k3 + k4 )
6 6
error = O (h4 ) local error = O (h5 )
Codes for Runge–Kutta, Heun and Euler with fixed step size
An ODE usually has to be solved on a given time interval. To apply the Runge–Kutta approach with a fixed
step size use the code ode RungeKutta.m in Figure 3.35. The function takes several arguments:
• FunFcn: a string with the function name for the RHS of the ODE.
• t: a vector of scalar time values at which the solution is returned. t(1) is the initial time.
• steps: the number of Runge–Kutta steps to be taken between the output times.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 194
The function will return the outpot times in Tout and the values of the solution in Yout.
The code has the following structure:
2. documentation
3. initialization
4. main loop
Very similar codes ode Euler.m and ode Heun.m implement the methods of Euler and Heun with
fixed step size. The usage is illustrated by the following example, solving the equation of a simple pendulum.
d2
y(t) = −k sin(y(t))
dt2
describes the angle y(t) of a pendulum, possibly with large angles, since the approximation sin(y) ≈ y is
not used. This second order ODE is transformed to a system of order 1 by
! !
d y(t) v(t)
= .
dt v(t) −k sin(y(t)
This ODE leads to a function pend(), which is then used by ode RungeKutta() to generate the solu-
tion for times [0, 30] for different initial angles.
Pendulum.m
%% demo file to solve a pendulum equation
Tend = 30;
%% on Matlab put the definition of the function in a separate filed pend.m
function y = pend(t,x)
k = 1;
y = [x(2);-k*sin(x(1))];
end%function
t = linspace(0,Tend,100);
[t,y] = ode_RungeKutta(’pend’,t,y0,10); % Runge-Kutta
% [t,y] = ode_Euler(’pend’,t,y0,10); % Euler
% [t,y] = ode_Heun(’pend’,t,y0,10); % Heun
plot(t,180/pi*y(:,1))
xlabel(’time’); ylabel(’angle [Deg]’)
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 195
ode RungeKutta.m
function [tout, yout] = ode_RungeKutta(Fun, t, y0, steps)
% [Tout, Yout] = ode_RungeKutta(fun, t, y0, steps)
%
% Integrate a system of ordinary differential equations using
% 4th order Runge-Kutta formula.
%
% INPUT:
% Fun - String containing name of user-supplied problem description.
% Call: yprime = Fun(t,y)
% T - Vector of times (scalar).
% y - Solution column-vector.
% yprime - Returned derivative column-vector; yprime(i) = dy(i)/dt.
% T(1) - Initial value of t.
% y0 - Initial value column-vector.
% steps - steps to take between given output times
%
% OUTPUT:
% Tout - Returned integration time points (column-vector).
% Yout - Returned solution, one solution column-vector per tout-value.
%
% The result can be displayed by: plot(tout, yout).
% Initialization
y = y0(:); yout = y’; tout = t(:);
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 196
3–55 Definition : A numerical method to solve ODEs is called stable iff for Re(λ) < 0 the numerical
d
solution of dt y(t) = λ y(t) satisfies
lim y(t) = 0 .
t→∞
d
When using approximation methods to solve dt y(t) = λ y(t) one arrives in many cases at an iteration
formula of the type
yi+1 = y(ti + h) = g(λ h) yi ,
i.e. at each time step the current value of the solution y(ti ) is multiplied by g(λ h) to arrive at y(ti + h) =
y(ti+1 ). In this case limt→∞ y(t) = 0 is equivalent to limn→∞ (g(λ h))n = 0, which is equivalent to
|g(λ h)| < 1. The stability can be formulated using the stability function g(z) for z ∈ C.
• The computational scheme is stable in the domain in C where |g(λ h)| < 1.
lim g(z) = 0 as z −→ −∞ .
Based on Result 3–29 the stability carries over to linear systems of ODEs and by linearization to non-
linear systems too.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 197
1.5 3
exact exact
stable stable
1 2
stable stable
unstable unstable
0.5 1
solutions y(t)
solutions y(t)
0 0
-0.5 -1
-1 -2
-1.5 -3
0 0.5 1 1.5 2 0 2 4 6 8
time t time t
d
Figure 3.36: Conditional stability of Euler’s approximation to dt y(t) = λ y(t) with λ < 0
yi = y0 (1 + h λ)i .
For this expression to remain bounded independent on i the condition |1 + h λ| ≤ 1 is necessary. For
complex values of λ this condition is satisfied if z = λ h is inside a circle of radius 1 with center at −1 ∈ C.
This domain is visualized in Figure 3.38. For real, negative values of λ the condition simplifies to
2
h |λ| < 2 ⇐⇒ h< .
|λ|
This is an example of conditional stability , i.e. the schema is only stable if the above condition on the
step size h is satisfied. To visualize the behavior examine the results in Figure 3.36 for solutions of the
d
differential equation dt y(t) = λ y(t).
• At the starting point the differential equation determines the slope of the straight line approximation
of the solution. The slope is independent on the length of the step size h.
• If the step size is small enough then the numerical solution will not overshoot but converge to zero, as
expected.
• If the step size is too large then the numerical solution will overshoot and will move further and further
away from zero by each step.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 198
d yi+1 − yi
y(ti+1 ) − λ y(ti+1 ) ≈ − λ yi+1 = 0 .
dt h
The differential equation is replaced by the difference equation
yi+1 − yi
= +λ yi+1 =⇒ (1 − h λ) yi+1 = yi .
h
One can verify that this difference equation is solved by
1
yi = y0 .
(1 − h λ)i
For this expression to remain bounded independent on i we need |1 − h λ| > 1. For complex values of λ
this condition is satisfied if z = λ h is outside a circle of radius 1 with center at +1 ∈ C. This domain
is visualized in Figure 3.38. For real, negative values of λ the condition leads to 1 − h λ > +1, which is
automatically satisfied and we have unconditional stability. The method as A–stable and L-stable.
To visualize the behavior we examine the results in Figure 3.37 for solutions of the differential equation
d
dt y(t)= λ y(t).
• The slope of the straight line approximation is determined by the differential equation at the end point
of the straight line segment. Consequently the slope will depend on the step size h.
• If the step size is small enough then the numerical solution will not overshoot but converge to zero.
• Even if the step size is large the numerical solution will not overshoot zero, but converge to zero.
exact
1
stable
stable
0.8
stable
solutions y(t)
0.6
0.4
0.2
d
Figure 3.37: Unconditional stability of the implicit approximation to dt y(t) = λ y(t) with λ < 0
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 199
3–58 Result : Conditional stability for the methods of Heun and Runge–Kutta
d
Using one step of the method of Heun solve dt y(t) = λ y(t) with yi = y(ti ) and yi+1 = y(ti + h) leads to
k1 = λ yi
k2 = λ (yi + h λ yi ) = (λ + h λ2 ) y(0)
1 1
k = (k1 + k2 ) = (2 λ + h λ2 ) yi
2 2
1
yi+1 = y(0) + h k = (1 + h λ + h2 λ2 ) yi
2
1 2 2 i
yi = (1 + h λ + h λ ) y0
2
For this expression to remain bounded independent on i we need
1 2 2 1
|1 + h λ + h λ | = |1 + z + z 2 | ≤ 1 .
2 2
To examine this set in C use | exp(i α)| = 1 and solve
1 2
ei α = 1 + z + z
2
2 ei α = 2 + 2 z + z2
z 2 + 2 z + 2 − 2 ei α = 0
p
z1,2 = −1 ± −1 + 2 ei α .
√
This generates an ellipse like curve between −2 and 0 on the real axis and ±i 3 along the imaginary
direction, visible in Figure 3.38. Heun’s method is stable inside this domain, i.e. this is a conditional
stability.
For the Runge–Kutta method the corresponding inequality is15
1 2 1 2 1 4 1 1 1
|1 + z + z + z + z | = |1 + (h λ) + (h λ)2 + (h λ)2 + (h λ)4 | ≤ 1
2 6 24 2 6 24
and the domain is displayed in Figure 3.38. The classical Runge–Kutta method is stable inside this domain,
i.e. this is a conditional stability. 3
15
Use elementary, tedious computations or software for symbolic calculations.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 200
yi+1 − yi yi + yi+1
= λ
h 2
2 + hλ
yi+1 = yi
2 − hλ
2 + hλ i
yi = y0 .
2 − hλ
2 + hλ 2+z
= ≤1 ⇐⇒ |2 + z| ≤ |2 − z| .
2 − hλ 2−z
• For Re λ < 0 the two points have the same size imaginary part and | Re(2 + h λ)| is smaller than
| Re(2 − h λ)|. Thus the method is stable in the left half plane Re λ < 0.
• For Re λ > 0 the method is unstable, as should be. The exact solution exp(λ t) grows exponentially
for t > 0.
The domain of stability is displayed in Figure 3.38. The method is unconditionally A–stable, but not L-
stable. 3
3 Euler
implizit
2
Heun
1 RK4
Im(h λ)
CN
0
-1
-2
-3
-3 -2 -1 0 1 2
Re(h λ)
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 201
• Euler: stable in a circle with radius 1 and center at −1 + i 0 ∈ C. For Re(λ) < 0 and large time steps
h the method is unstable, i.e conditional stability.
• implicit: stable in outside a circle with radius 1 and center at +1 + i 0 ∈ C. For Re(λ) < 0 the
method is stable for any time step h > 0, i.e. unconditional stability.
• Heun: stable in an ellipse like domain with real parts larger than −2. For Re(λ) < 0 and large time
steps h the method is unstable, i.e conditional stability.
• RK4: stable in an odd shaped domain with real parts larger than −2.8. For Re(λ) < 0 and large time
steps h the method is unstable, i.e conditional stability.
• CN: Crank–Nicolson, the method is stable in the complex half plane Re(z) < 0, i.e unconditional
stability.
k1 = f (tn , yn )
k2 = f (tn + c2 h, yn + h (a21 k1 ))
k3 = f (tn + c3 h, yn + h (a31 k1 + a32 k2 ))
k4 = f (tn + c4 h, yn + h (a41 k1 + a42 k2 + a43 k3 ))
..
.
ks = f (tn + cs h, yn + h (as1 k1 + as2 k2 + · · · + as,s−1 ks−1 ))
X s
yn+1 = yn + h bi ki
i=1
0
c2 a21
c3 a31 a32
c4 a41 a42 a43 (3.14)
.. .. ..
. . .
cs as1 as2 as3 · · · as,s−1
yn+1 b1 b2 b3 ··· bs−1 bs
Working from top to bottom for the general, explicit Runge–Kutta scheme one never has to solve an equa-
tion, just evaluate the function f (t, y) and plug in. Thus this is an explicit scheme. One step of length h
requires s evaluations of f (t, y).
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 202
The classical Runge–Kutta method of order 4 is a 4 stage method with the Butcher table
0
1 1
2 2
1 1 .
2 0 2
(3.16)
1 0 0 1
1 2 2 1
yn+1 6 6 6 6
Observe that this leads to a system of equation for the slopes ~k ∈ Rs , thus it is an implicit scheme. For
nonlinear functions f (t, y) it is a nonlinear system of equations. For a system of n equation with ~y ∈ Rn it
leads to a nonlinear system of n · s equations for n · s unknowns. Since Newton’s algorithm is used to solve
the system it is a good idea to provide the Jacobian matrix J to the algorithms. Use Example 3–69 on how
to use the Jacobian.
∂ f1 ∂ f1 ∂ f1
· · ·
∂∂yf1 ∂∂yf2 ∂yn
∂ f2
∂ f (t, ~y )
2
∂y1 ∂y2
2
· · · ∂yn
J= = . .. .. ..
∂~y .. . . .
∂ f2 ∂ f2 ∂ f2
∂y1 ∂y2 · · · ∂yn
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 203
~k = λ (yn~1 + h A ~k) ⇐⇒ (I − λ h A) ~k = λ yn ~1 .
det(I − z A + z ~1 ~bT )
g(z) = 1 + z h~b, (I − z A)−1 ~1i = (3.18)
det(I − z A)
where z = λ h. For a proof see [Butc03, Lemma 351A, p. 230]. The stability condition is |g(z)| < 1. This
stability function g(z) is a rational function where numerator and denominator are polynomials of degree s.
♦
For an explicit scheme the matrix of coefficients has a lower triangular form
0
a 0
21
A= a31 a32 0
.
.. . ..
as,1 as,22 · · · as,s−1 0
and thus det(I − z A) = 1 and the stability function g(z) is a polynomial of degree s. An explicit scheme
of this type is conditionally stable, never unconditionally stable.
3–63 Example : Butcher table for the implicit Euler and Crank–Nicolson Method
The implicit Euler method is given by
k1 = f (tn + h, yn + h k1 )
yn+1 = yn + h k1 = yn + h f (tn + h, yn + h k1 )
yn+1 − yn
= f (tn , yn+1 )
h
Thus the Butcher table is
1 1
yn+1 1
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 204
0 0 0
1 1
1 2 2
1 1
yn+1 2 2
k1 = f (tn , yn + h)
1 1
k2 = f (tn + h, yn + h ( k1 + k2 ))
2 2
1 1
yn+1 = yn + h k1 + k2
2 2
h 1 1
= yn + f (tn , yn ) + f (tn + h, yn + h ( k1 + k2 ))
2 2 2
h
= yn + (f (tn , yn ) + f (tn + h, yn+1 ))
2
yn+1 − yn 1
= (f (tn , yn ) + f (tn + h, yn+1 )) .
h 2
This is caused by the identical rows in the Butcher table. If cj = 1 and aj,i = bi for i = 1, 2, . . . , s then
s
X
yn+1 = yn + h bi ki
i=1
s
X
kj = f (tn + 1 h, yn + h aj,i ki ) = f (tn + h, yn+1 )
i=1
d
As a consequence the Crank–Nicolson Method is stable for ODEs dt y(t) = λ y(t) with Re λ < 0, i.e find
unconditional stability. ♦
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 205
methods in one single table, one method of order p and one of order p − 1 . The higher order approximation
is given by
Xs
yn+1 = yn + h bi ki
i=1
and used to possibly adapt the step size h. The key point of an embedded method is to use the same slopes
ki to generate the estimates of order p and p − 1. This uses fewer evaluations of the RHS f (t, y) than
performing one step of size h and two of size h/2 and then compare.
0
1 1
2 2
3 3
4 0 4
2 1 4
(3.20)
1 9 3 9
2 1 4
yn+1 9 3 9 0
∗
yn+1 7 1 1 1
24 4 3 8
and one embedded step requires 4 evaluations of f (t, y). Observe that the last row of the coefficient matrix
A and the vector ~bT coincide, thus
3
X 3
X
k4 = f (tn + h, yn + h a4,i ki ) , yn+1 = yn + h a4,i ki
i=1 i=1
and k4 can be used during the next step as ”new” k1 . Consequently only 3 evaluations of f (t, y) are required.
For Octave find the Butcher table for the command ode23() in the file runge kutta 23.m and for
MATLAB in the file ode23.m . ♦
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 206
and one embedded step requires 7 evaluations of f (t, y). Observe that the last row of the coefficient matrix
A and the vector ~bT coincide, thus
6
X 6
X
k7 = f (tn + h, yn + h a7,i ki ) , yn+1 = yn + h a7,i ki
i=1 i=1
and k7 can be used during the next step as ”new” k1 . Consequently only 6 evaluations of f (t, y) are required.
For Octave find the Butcher table for the command ode45() in the file runge kutta 45 dorpri.m
and for MATLAB in the file ode45.m . ♦
• for h too small the computational effort might be to large, i.e. it take too long or the rounding errors
cause by the arithmetic on the CPU cause problems.
A nice way out would be if the algorithms could determine a ”good” step size. Fortunately this is possible
in most cases. Here one approach of automatic step size control is presented: Compute the solution at t + h
twice, once with one step of length h and the two steps of length h/2. Use the difference of the two results
to estimate the discretization error and then adapt the step size h accordingly.
Use Table 3.6 to estimate the error when stepping from t with y(t) to t + h) with solution y(t + h), one
with one step of size h leading to the first result r1 , then with two step of size h/2 leading to r2 . Then use
the approximations
y(t + h) − r1 ≈ C hp+1
p+1
y t + 2 · h2 − r2 ≈ 2 C h2
as exact equation (not completely true, but god enough) and solve for y(t + h) and C. Subtraction the above
two expressions leads to
p+1 !
p+1 h 2p − 1 p+1
r2 − r1 = C h −2 =C h ,
2 2p
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 207
This implies
2p
C hp+1 = (r2 − r1 )
2p − 1
2p 2p r2 − r1
y(t + h) = r1 + p (r2 − r1 ) = (3.22)
2 −1 2p − 1
and equation (3.22) leads to two useful results:
1. control of step size h: (3.22) contains a (often very good) estimate of the local discretization error at
time t:
2p
y(t + h) − r1 ≈ p (r2 − r1 ).
2 −1
This deviation should not be larger than a given bound 0 < ε, typically very small. Consider three
outcomes:
The choice of the error bound ε and the way to adapt the step size have to be made very carefully.
2. local extrapolation: using r1 and r2 generate a new, better approximation for the solution at t + h by
2p r2 − r1
y(t + h) =
2p − 1
This ”new” method will have a local discretization error proportional to hp+2 , i.e. the order of con-
vergence is improved by 1 .
The above error estimates and extrapolation method assume that the Taylor approximation of the ODE
is correct, which requires the RHS function f (t, y) to be many times differentiable. If the function f (t, y)
smooth (e.g. jumps of the values or of derivatives), then the above estimates are not correct. This will cause
the adaptive step size approaches in Section 3.4.5 to try step sizes smaller and smaller and the algorithm
might come to a screeching halt, usually with a warning message. Higer order methods (ode45()) are
more susceptible to this problem than lower order methods (ode23()). Another reason for the algorithms
d
to stop could be blowup of the solution. As example consider the ODE dt y(t) = 1 + y(t)2 , solved by
π
y(t) = tan(t), which blows up at t = 2 .
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 208
The above idea based on the Runge–Kutta method is implemented in a code rk45.m. The name
indicates the order of convergence is 4, improved to 5 by the extrapolation. In the function rk23.m the
idea is implemented based on the method of Heun, which is also called a Runge–Kutta method of order 2 .
In both codes an absolute and relative tolerance for the error can be specified.
The algorithm in ode45.m does not use the above mentioned classical Runge–Kutta with half step
size. One step requires 11 evaluations of the ODE function f (t, y). Instead the Dormand–Prince embedded
method in Example 3–65 is used. It requires only 7 evaluations of f (t, y) whith the same error proportional
to h5 .
on the interval [0, 3π]. Asking for the same relative and absolute tolerance (10−2 or 10−6 ) with the two
adaptive methods based on Heun and Runge–Kutta leads to the results in Table 3.9 and Figure 3.39. The
Runge–Kutta based approach uses clearly fewer time steps. Figure 3.39 also shows that in section of the
solution with a large curvature, a smaller time step is used. The results for the tolerance of 10−6 clearly
show that the difference is more significant for higher accuracy results.
Table 3.9: Comparison of a Heun based method (rk23.m) and Runge–Kutta method (rk45.m)
Observe that the number of function evaluations is considerably higher than the number of time steps.
• Each time step consists of one step of length h and two steps of length h/2. For Runge–Kutta this
leads to 11 function evaluations for each time step and for Heun to 5 evaluations. Observe that the
first evaluation is share between the two computations.
• It a step size is rejected and the computation redone with a shorter time step, some of the evaluations
are ”thrown away”.
♦
To illustrate the results in this section a few MATLAB/Octave codes are used, see Table 3.19 on page 251.
• Matlab R2019a: ode113, ode15i, ode15s, ode23, ode23s, ode23t, ode23tb, ode45
The goal of this subsection is to illustrate the application of the commands ode?? to a single ODE or
systems of ODEs. The usage of the options is explained.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 209
1.5 1.5
1 1
solution y(t)
solution y(t)
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 2 4 6 8 10 0 2 4 6 8 10
time t time t
(a) Heun (b) Runge–Kutta
Figure 3.39: Graphical results for a Heun based method and Runge–Kutta method, with tolerances 10−2
help ode45
-->
-- [T, Y] = ode45 (FUN, TRANGE, INIT)
-- [T, Y] = ode45 (FUN, TRANGE, INIT, ODE_OPT)
-- [T, Y, TE, YE, IE] = ode45 (...)
-- SOLUTION = ode45 (...)
-- ode45 (...)
The input arguments have to be given by the user. A call of the function ode45() will return results.
• input arguments
FUN is a function handle, inline function, or string containing the name of the function that defines
d d
the ODE: dt y(t) = f (t, y(t)) (or dt ~y (t) = f (t, ~y (t))). The function must accept two inputs,
where the first is time t and the second is a column vector (or scalar) of unknowns y.
TRANGE specifies the time interval over which the ODE will be evaluated. Typically, it is a two-element
vector specifying the initial and final times ([tinit , tf inal ]). If there are more than two elements,
then the solution will be evaluated at these intermediate time instances.
Observe that the algorithms ode??() will always first choose the intermediate times, using
the adapted step sizes, see Section 3.4.5. If TRANGE is a two-element vector, these values are
returned. If more intermediate times are asked for the algorithm will use a special interpolation
algorithm to return the solution at the desired times. Asking for more (maybe many) interme-
diate times will not increase the accuracy. For increased accuracy use the options RelTol and
AbsTol, see page 212.
INIT contains the initial value for the unknowns. If it is a row vector then the solution Y will be a
matrix in which each column is the solution for the corresponding initial value in tinit .
ODE OPT The optional fourth argument ODE OPT specifies non-default options to the ODE solver. It is a
structure generated by odeset(), see page 212.
• return arguments
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 210
– If the function [T,Y] = ode??() is called with two return arguments, the first return argu-
ment is column vector T with the times at which the solution is returned. The output Y is a matrix
in which each column refers to a different unknown of the problem and each row corresponds to
a time in T.
– If the function SOL = ode??() is called with one return argument, a structure with three
fields: SOL.x are the times, SOL.y is the matrix of solution values and the string SOL.solver
indicated which solver was used.
– If the function ode??() is called with no return arguments, a graphic is generated. Try
ode45(@(t,y)y,[0,1],1) .
– If using the Events option, then three additional outputs may be returned. TE holds the time
when an Event function returned a zero. YE holds the value of the solution at time TE. IE
contains an index indicating which Event function was triggered in the case of multiple Event
functions.
d
Solving the ODE dt y(t) = (1 − t) y(t) for 0 ≤ t ≤ 5 with initial condition y(0) = 1 is a one-liner.
[t,y] = ode45(@(t,y)(1-t)*y,[0,5],1);
plot(t,y,’+-’)
The plot on the left in Figure 3.40 shows that the time steps used by ode45() are rather large and thus
the solution seems to be inaccurate. The Dormand–Prince method in ode45() used large time steps to
achieve the desired accuracy and then returned the solution at those times only. It might be better to return
the solution at more intermediate times, uniformly spaced. This can be specified in trange when calling
ode45(), see the code below. Find the result on the right in Figure 3.40.
[t,y] = ode45(@(t,y)(1-t)*y,[0:0.1:5],1);
plot(t,y,’+-’)
2 2
trange = [0,5] trange = [0:0.1:5]
1.5 1.5
1 1
0.5 0.5
0 0
0 1 2 3 4 5 0 1 2 3 4 5
Figure 3.40: Solution of an ODE by ode45() at the computed times or at preselected times
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 211
This (overly) simple model for the spreading of a virus uses three time dependent variables:
• Every day an infected individual will make contact to b other individuals, possibly infecting them with
the virus. Only the fraction 0 < S(t) < 1 is susceptible, thus b S(t) will actually be infected by this
individual.
b
• During his sick period an individual will thus infect k S(t) newly infected patients.
• As we have a total of N I(t) sick individuals we will observe b N I(t) S(t) new infection every day.
This leads to a system of ODEs for the three ratios S(t), I(t) and R(t).
d
S(t) = −b S(t) I(t)
dt
d d S(t) d R(t)
I(t) = − − = +b S(t) I(t) − k I(t)
dt dt dt
d
R(t) = +k I(t)
dt
Rewrite this as an ODE for I(t) and R(t), using S(t) = 1 − I(t) − R(t).
d
I(t) = +b (1 − I(t) − R(t)) I(t) − k I(t) = (+b (1 − I(t) − R(t) − k) I(t)
dt
= (+b − k − b I(t) − b R(t)) I(t)
d
R(t) = +k I(t)
dt
This system of ODE is solved numerically using e.g ode45(). Find the results of the code below in
Figure 3.41. The additional code in the file SIR Model.m generates the vector fields in Figure 3.42.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 212
SIR Model.m
I0 = 1e-4; S0 = 1 - I0; R0 = 0; %% the initial values
b = 1/3; k = 1/10; %% the model parameters
%% b = 1/8; %% use this for a smaller infection rate
[t,IR] = ode45(@(t,x)SIR(t,x(1),x(2),b,k),linspace(0,600,601),[I0,R0]);
figure(1); plot(t,IR)
xlabel(’time [days]’); ylabel(’fraction of Infected and Recovered’)
ylim([-0.05 1.05])
legend(’infected’,’recovered’, ’location’,’east’)
fraction of Infected and Recovered
1 1
0.8 0.8
0.6 0.6
infected infected
recovered recovered
0.4 0.4
0.2 0.2
0 0
0 100 200 300 400 500 600 0 100 200 300 400 500 600
time [days] time [days]
(a) b = 31 , k = 1
10
(b) b = 18 , k = 1
10
Figure 3.41: SIR model, with infection rate b and recovery rate k. The positive effect of a small infection
rate b is obvious.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 213
1 1
0.8 0.8
Recovered
Recovered
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Infected Infected
(a) regular vector field (b) normalized vector field
1 1
Figure 3.42: SIR model vector field with b = 3 and k = 10
InitialSlope: vector, []
InitialStep: scalar, >0, []
Jacobian: matrix or function_handle, []
JConstant: binary, {["off"], "on"}
JPattern: sparse matrix, []
Mass: matrix or function_handle, []
MassSingular: switch, {["maybe"], "no", "yes"}
MaxOrder: switch, {[5], 1, 2, 3, 4, }
MaxStep: scalar, >0, []
MStateDependence: switch, {["weak"], "none", "strong"}
MvPattern: sparse matrix, []
NonNegative: vector of integers, []
NormControl: binary, {["off"], "on"}
OutputFcn: function_handle, []
OutputSel: scalar or vector, []
Refine: scalar, integer, >0, []
Stats: binary, {["off"], "on"}
Vectorized: binary, {["off"], "on"}
Matlab odeset()
odeset()
-->
AbsTol: [ positive scalar or vector {1e-6} ]
RelTol: [ positive scalar {1e-3} ]
NormControl: [ on | {off} ]
NonNegative: [ vector of integers ]
OutputFcn: [ function_handle ]
OutputSel: [ vector of integers ]
Refine: [ positive integer ]
Stats: [ on | {off} ]
InitialStep: [ positive scalar ]
MaxStep: [ positive scalar ]
BDF: [ on | {off} ]
MaxOrder: [ 1 | 2 | 3 | 4 | {5} ]
Jacobian: [ matrix | function_handle ]
JPattern: [ sparse matrix ]
Vectorized: [ on | {off} ]
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 214
The most frequently used options are AbsTol (default value 10−6 ) and RelTol (default value 10−3 ),
used to specify the absolute and relative tolerances for the solution. At each time step the algorithm
estimates the error(i) of the i-th component of the solution. Then the condition
has to be satisfied. Thus at least one of the absolute or relative error has to be satisfied. As a consequence
it is rather useless to ask for a very small relative error, but keep the absolute error large. Both have to be
made small.
d
The ODE dt x(t) = 1 + x(t)2 with x(0) = 0 is solved by x(t) = tan(t). Thus we know the exact value
of x(π/4) = 1. With the default values obtain
[t,x] = ode23(@(t,x)1+xˆ2,[0,pi/4],0);
Error_Steps_default = [x(end)-1, length(t)-1]
-->
Error_Steps_default = [-3.6322e-05 25]
i.e. with 25 Heun steps the error is approximately 3.6 · 10−5 . With the options AbsTol and RelTol the
error can be made smaller. The price to pay are more steps, i.e. a higher computational effort.
ode_opt = odeset(’AbsTol’,1e-9,’RelTol’,1e-9);
[t,x] = ode23(@(t,x)1+xˆ2,[0,pi/4],0,ode_opt);
Error_Steps_opt = [x(end)-1 ,length(t)-1]
-->
Error_Steps_opt = [-2.3420e-10 495]
With the command odeget() one can read out specific options for the ode solvers.
Below four algorithms available on Octave are applied to a few sample problems to illustrate the differ-
ences. The results by very MATLAB are similar.
ode45 : an implementation of a Runge–Kutta (4,5) formula, the Dormand–Prince method of order 4 and 5.
This algorithm works well on most non–stiff problems and is a good choice as a starting algorithm.
ode23 : an implementation of an explicit Runge–Kutta (2,3) method, the explicit Bogacki–Shampine method
of order 3. It might be more efficient than ode45 at crude tolerances for moderately stiff problems.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 215
ode15s : this command solves stiff ODEs. It uses a variable step, variable order BDF (Backward Differ-
entiation Formula) method that ranges from order 1 to 5. Use ode15s if ode45 fails or is very
inefficient.
ode23s : this command solves stiff ODEs with a Rosenbrock method of order (2,3). The ode23s solver
evaluates the Jacobian during each step of the integration, so supplying it with the Jacobian matrix is
critical to its reliability and efficiency.
To see the statistics of the different solvers use the option16 opt = odeset(’Stats’,’on’).
3–67 Example : The ODE
d
y(t) = −y(t) + 3 with y(0) = 0
dt
with the exact solution
y(t) = 3 − 3 exp(−t)
is an example of a non stiff problem. The solution should not be a problem at all for either algorithm. This
is confirmed by the results in Table 3.10. Observe that the algorithm in ode45 generates the most accurate
results, see Figure 3.43. The example was solved by the code below.
Table 3.10: Data for a non–stiff ODE problem with different algorithms
• For ode45() the number of evaluations is barely more than 6 times the number of time steps. This
is consistent with the information in Example 3–65. Thus the algorithm never had to go to smaller
time steps h. The time step is increased, until it reaches to upper limit17 MaxStep of h = 0.5.
16
This author generated the numbers by using a counter within the function, and obtained slightly different numbers.
17
Examine the source file odedefaults.m to confirm this default value. The option MaxStep can be modified.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 216
• For ode23() the number of evaluations is barely more than 3 times the number of time steps. This
is consistent with the information in Example 3–64. Thus the algorithm never had to go to smaller
time steps h. In Figure 3.43 observe that the step size is strictly increasing for ode23().
figure(1); plot(t45,y_exact(t45),t45,y45,t23,y23,t15s,y15s,t23s,y23s)
xlabel(’time t’); ylabel(’solution y(t)’); title(’solution’)
legend(’exact’,’ode45’,’ode23’,’ode15s’,’ode23s’,’location’,’southeast’)
figure(2); plot(t45,y45-y_exact(t45),t23,y23-y_exact(t23),...
t15s,y15s-y_exact(t15s),t23s,y23s-y_exact(t23s))
xlabel(’time t’); title(’difference to exact solution’)
legend(’ode45’,’ode23’,’ode15s’,’ode23s’)
figure(3); plot(t45(2:end),diff(t45),’+’,t23(2:end),diff(t23),’+’,...
t15s(2:end),diff(t15s),’*’, t23s(2:end),diff(t23s),’*’)
xlabel(’time t’); title(’size of time step’)
legend(’ode45’,’ode23’,’ode15s’,’ode23s’,’location’,’northwest’)
NumberOfStepsNonStiff = [length(t45),length(t23),length(t15s),length(t23s)]-1
♦
3–68 Example : The ODE
d
y(t) = −1000 y(t) + 3000 − 2000 exp(−t) with y(0) = 0
dt
with the exact solution
y(t) = 3 − 0.998 exp(−1000 t) − 2.002 exp(−t)
is an example of a stiff problem. For 0 ≤ t 1 the term exp(−1000 t) (generated by −1000 y(t))
dominates, then exp(−t) takes over. Those effects occur on a different time scale. Find the graphical
results in Figure 3.44 and the number of steps and evaluations in Table 3.11. The stiffness of the ODE is
confirmed by the number of time steps and evaluation of the RHS in Table 3.11. Observe that the algorithm
in ode15s generates the most accurate results, see Figure 3.44. The Octave code is very similar to the
previous example.
Some of the results in Table 3.11 can be explained.
• For ode45() the number of evaluations equals 6.65 times the number of time steps. One step of
Dormand–Prince uses 6 evaluations, thus the step size had to be made shorter many times.
• The number of evaluations for the stiff solvers ode15s() and ode23s() is considerably smaller
than for the explicit solvers ode45() and ode23().
Table 3.11: Data for a stiff ODE problem with different algorithms
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 217
2 0.0001 ode23s
1.5 0
exact
1 ode45 -0.0001
ode23
0.5 ode15s -0.0002
ode23s
0 -0.0003
0 1 2 3 4 5 0 1 2 3 4 5
time t time t
(a) the solutions (b) differences to exact solution
0
0 0.1 0.2 0.3 0.4 0.5 1.56 1.58 1.6 1.62 1.64
time t time t
(c) stepsizes (d) zoom in on differences to exact solution
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 218
is solved by ! ! ! !
y1 (t) 2 −1 1
= exp(−t) + exp(−100 t) + .
y2 (t) −1 +1 0
The result is based on the eigenvalues λ1 = −100 and λ2 = −1 of the matrix. Since the eigenvalues have
different magnitudes this is a (moderately) stiff system.
• For ode45() the number of evaluations equals 6.9 times the number of time steps. One step of
Dormand–Prince uses 6 evaluations, thus the step size had to be made shorter many times.
• The number of evaluations for the stiff solvers ode15s() and ode23s() is considerably smaller
than for the explicit solvers ode45() and ode23().
Table 3.12: Data for a stiff ODE system with different algorithms
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 219
figure(1); plot(t45,y_exact(t45),t45,y45,t23,y23,t15s,y15s,t23s,y23s)
xlabel(’time t’); ylabel(’solution y(t)’);
legend(’exact’,’ode45’,’ode23’,’ode15s’,’ode23s’)
title(’solution’)
figure(2); plot(t45,y45-y_exact(t45),t23,y23-y_exact(t23),...
t15s,y15s-y_exact(t15s), t23s,y23s-y_exact(t23s))
xlabel(’time t’); title(’difference to exact solution’)
legend(’ode45’,’ode23’,’ode15s’,’ode23s’)
t_lim = 5; t23_short = t23(t23<t_lim);
figure(3); plot(t45(2:end),diff(t45),’+’,t23(2:end),diff(t23),’+’,...
t15s(2:end),diff(t15s),’*’,t23s(2:end),diff(t23s),’*’)
xlabel(’time t’); title(’size of time step’)
legend(’ode45’,’ode23’,’ode15s’,’ode23s’,’location’,’northwest’)
NumberOfSteps = [length(t45),length(t23),length(t15s),length(t23s)]-1
In the above code the Jacobian for the ODE was not used. In this example the Jacobian is given by
" #
∂ f~ 98 198
J= = .
∂~y −99 −199
Pass this information on to the implicit algorithms ode15s() and ode23s() by setting the option.
Then obtain the results in Table 3.13. Observe that the number of steps does not change, but the number of
evaluations by ode15s() and ode23s() is substantially smaller. This is due to the algorithms not having
to determine the Jacobian numerically by evaluating the function f (t, ~y ).
Table 3.13: Data for a stiff ODE system with different algorithms, using the Jacobian matrix
is solved by ! ! ! !
y1 (t) +9 −8 1
= exp(t) + exp(−2 t) + .
y2 (t) −10 +9 0
The result is based on the eigenvalues λ1 = −2 and λ2 = −1 of the matrix. Since the eigenvalues have
similar magnitudes this is a non–stiff system. In Figure 3.46 and Table 3.14 verify that the algorithm ode45
generates the most accurate results with the fewest number of evaluations of the RHS of the ODE.
Some of the results in Table 3.14 can be explained.
• For ode45() the number of evaluations equals 6.26 times the number of time steps. One step of
Dormand–Prince uses 6 evaluations, thus the step size had to be made shorter only a few times. This
is visible in Figure 3.46.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 220
-3e-05 0
0 1 2 3 4 5 0 1 2 3 4 5
time t time t
(a) differences to exact solution (b) stepsizes
• For ode23() the number of evaluations equals 3.01 times the number of time steps. One step of
Bogacki–Shampine uses 3 evaluations, thus the step size was increased most of the times. This is
visible in Figure 3.46. Considerably more time steps are used by ode23(), compared to ode45().
Table 3.14: Data for non–stiff ODE system with different algorithms
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 221
This is the reason for the often used name method of least squares.
3
dependent variable y
-1
-1 0 1 2 3 4 5
independent variable x
Consider k~rk2 as a function of p1 and p2 . At the minimal point the two partial derivatives with respect to
p1 and p2 have to vanish. This leads to a system of linear equations for the vector p~, the normal equation.
FT F · p~ = FT ~y
This can easily be implemented in Octave, leading to the result in Figure 3.47 and a residual of k~rk2 ≈ 1.23.
Octave
x = [0; 1; 2; 3.5; 4]; y = [-0.5; 1; 2.4; 2.0; 3.1];
F = [ones(size(x)) x]
p = (F’*F)\(F’*y)
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 222
residual = norm(F*p-y)
The optimal values of the parameter vector p~ = (p1 , p2 , . . . , pm )T have to be determined. Thus minimize
the expression
2
Xn Xn X n m
X
χ2 = k~rk2 = ri2 = (f (xi ) − yi )2 = pj · fj (xi ) − yi
i=1 i=1 i=1 j=1
k~rk2 = kF · p~ − ~y k2 = hF · p~ − ~y , F · p~ − ~y i ,
Setting the partial derivatives with respect to the parameters leads to a necessary condition.
FT F · p~ = FT ~y
This system of n linear equations for the unknown vector of parameters p~ ∈ Rm is called a normal equa-
tion. Once we have the optimal parameter vector p~, compute the values of the regression curve with a matrix
multiplication
Xm
(F p~)i = pj · fj (xi ) .
j=1
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 223
The m × m matrix Ru is square and upper triangular. The left part Ql of the square matrix Q is of size
n × m and satisfies QTl Ql = In . Use the zeros in the lower part of R to verify that
F = Q · R = Ql · Ru .
MATLAB/Octave can compute the QR factorization by [Q,R] = qr(F) and the reduced form by the
command [Ql,Ru] = qr(F,0). This factorization is very useful to implement linear regression.
Multiplying a vector ~r ∈ Rn with the orthogonal matrix Q or its inverse QT corresponds to a rotation of
the vector and thus will not change its length18 . This observation can be used to rewrite the linear regression
problem from Section 3.5.1.
F · p~ − ~y = ~r length to be minimized
Q · R · p~ − ~y = ~r length to be minimized
T T
R · p~ − Q · ~y = Q · ~r
" # " # " #
Ru · p~ QTl · ~y QTl · ~r
− =
0 QTr · ~y QTr · ~r
Since the vector p~ does not change the lower part of the above system, the problem can be replaced by a
smaller system of m equations for m unknowns, namely the upper part only of the above system.
Obviously this length is minimized if QTl · ~r = ~0 and thus find the reduced equations for the vector p~.
Ru · p~ = QTl · ~y
p~ = R−1 T
u · Ql · ~
y
In Octave the above algorithm can be implemented with two commands only.
Octave
[Q,R] = qr(F,0);
p = R\(Q’*y);
It can be shown that the condition number for the QR algorithm is much smaller than the condition number
for the algorithm based on (FT F) · p~ = FT ~y . Thus there are fewer accuracy problems to be expected and
the results are more reliable19 .
As a simple example fit a function f (x) = p1 · 1 + p2 · x + p3 · sin(x) to a given set of data points (xi , yi )
for 1 ≤ i ≤ n, as seen in Figure 3.48. In this example the n × 3 matrix F is given by
1 x1 sin(x1 )
1 x sin(x )
2 2
F= 1 x3 sin(x3 ) .
.. .. .
..
. .
1 xn sin(xn )
The code below first generates random data and then uses the reduced QR factorization to apply the linear
regression.
18
xk2 = hQ ~
kQ ~ x, QT Q ~
xi = h~
x, Q ~ xi = h~ xk2
xi = k~
x, I ~
19
A careful computation shows that using the QR factorization F = Q R in FT F p
~ = FT ~ ~ = QTl ~
y also leads to Ru p y.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 224
7
raw data
6 regression
y values
4
0
0 2 4 6 8 10
x values
figure(1)
plot(x,y,’+’,x,y_reg)
legend(’raw data’,’regression’)
xlabel(’x values’); ylabel(’y values’)
-->
p = 6.00653
-0.50358
0.43408
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 225
Now the linear regression problem can be examined. The computations are rather similar to the linear
regression approach using the QR factorization.
F · p~ − ~y = ~r length to be minimized
T
U · diag(σ1 , σ2 , . . . , σn ) · V · p~ − ~y = ~r length to be minimized
T T T
Σ · V · p~ − U ~y = U ~r length to be minimized
" # " # " #
Σ · VT · p~ UTl ~y UTl ~r
− = length to be minimized
0 UTr ~y UTr ~r
optimize upper part only ,set UTl ~r = ~0, with smallest possible norm
Σ · VT · p~ − UTl ~y = ~0
Σ · VT · p~ = UTl ~y
If σn > 0 then the above problem has a unique solution. The ratio σ1 /σn contains information about the
sensitivity of this least square problem. This allows to detect ill conditioned linear regression problems. For
further information consult [GoluVanLoan13, §5.3].
MATLAB/Octave provide a command to generate the reduced SVD: [U,S,V] = svd(F,’econ’)
or [U,S,V] = svd(F,0). The above algorithm is applied with just two commands.
[Ul,S,V] = svd(F,0);
p = (S*V’)\(Ul’*y(:))
figure(1)
plot(x,y,’+’,x,y_reg2)
legend(’raw data’,’regression’)
xlabel(’x values’); ylabel(’y values’)
The result will be identical to the one generated by the the QR factorization and also leads to Figure 3.48.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 226
Using the above results (for the parabola fit) determine the residual vector
~r = F · p~ − ~y
and then mean and variance V = σ 2 of the y–errors can be estimated. The estimation is valid if the y–errors
are independent and assumed to be of equal size, i.e. assume a-priori that the errors are given by a
normal distribution. Then the variance σ 2 of the y values can be estimated by
n
1 X
σ2 = r12
n−m
i=1
In most applications the values of the parameters pj contain the essential information. It is often important
to know how reliable the obtained results are, i.e. we want to know the variance of the determined parameter
values pj . To this end consider the normal equation
FT · F · p~ = FT · ~y
Knowing this standard deviation and assuming a normal distribution one can (see [MontRung03, §12-3.1]
readily21 determine the 95% confidence intervals for the individual parameters22 , i.e. with a probability of
20
If zk are independent random variables given by a normal distribution with variances var(zk ), then a linear combination of
the zi also leads to a normal distribution. The variances are given by the computational rules:
var (z1 + z2 ) = var (z1 ) + var (z2 )
var (α1 z1 ) = α12 var (z1 )
X X 2
var ( αi zi ) = αi var (zi )
i i
21
Use Z +1.96 σ
1
√ exp(−x2 /(2 σ 2 )) dx ≈ 0.95 .
−1.96 σ 2π
22
A more careful approach is to determine the confidence region for p ~ ∈ Rm , using a general m–dimensional F –distribution,
see https://en.wikipedia.org/wiki/Confidence region . The domain of confidence will be an ellipsoid in Rm , best computed using a
PCA. The difference can be substantial if the off–diagonal entries in the correlation matrix for the parameters p
~ are not small. Find
more information in [RawlPantuDick98, §4.6.3] or examine Example 3–71 below.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 227
√ √
95% the actual value of the parameter is between pi − 1.96 vari and pi + 1.96 vari 23 . The above assumes
that the distribution of the parameters are normal distributions. Actually the distribution to use is a Student’s
t-distribution with n − m degrees of freedom. This can be computed in MATLAB/Octave by
• Fit f2 (x) = p1 1 + p2 (x − 10) by minimizing the length of the residual vector ~r.
1 x0 − 10 sin(x0 ) r
0
1 x − 10 ! sin(x1 ) r1
1
p 1
F2 p~ − ~y = 1 x 2 − 10 − sin(x2 ) = r2
p2
.. .. .. ..
. . . .
1 x30 − 10 sin(x30 ) r30
Both attempts fit a straight line through the same data points, but represented by slightly different parametriza-
tions. The optimal values of the parameters and their estimated standard deviations are given by
Observe that 0.112357 + 0.661721 x ≈ 6.729570 + 0.661721 (x − 10) . The two straight lines coincide,
but the standard deviation of the parameter p1 is much larger for the second attempt. This is caused by the
strong correlation between the functions 1 and (x − 10) for the second approach. This can be made visible
by examining the correlation matrices
" # " #
1 −0.8589 1 0.9987
corp1 = and corp2 = .
−0.8589 1 0.9987 1
23
Observe that the CIs (Confidence Interval) are the probabilistic event, i.e. they will change from one random drawing to the
next. Thus the statement is not ”With probability 0.95 the true value of pi is in the CI ”, but better ”With probability 0.95 the
generated CI contains the true value of pi ”. See en.wikipedia.org/wiki/Confidence interval .
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 228
With MATLAB/Octave 24 use lscov() to obtain the covariance matrix covp and then the correlation matrix
by " # " #
1 1
0 0
corp = σ1 1 · covp · σ1 1 .
0 σ2 0 σ2
• To construct a better, ellipsoidal domain of confidence use the covariance matrix. To determine the
dimensions of the ellipses the F–distribution has to be used. For more information see [DrapSmit98,
§5.4,§5.5], [RawlPantuDick98, §4.6.3] or use Section 3.2.7 with results on Gaussian distributions.
Find the graphical results in Figure 3.49. Observe that the domain of confidence in Figure 3.49(a) contains
preciser information, mainly by the narrower range 0.06 ≤ p1 ≤ 0.16 in horizontal direction. The domain
of confidence in Figure 3.49(a) clearly points out that 95% is in an ellipse, not spread out over all of the
rectangle. In Figure 3.49(b) for the regression f2 (x) = p1 1 + p2 (x − 10) the ellipse is extremely narrow.
This is caused by the strong correlation between p1 and p2 , as computed in thecorrelation matrix corp2 . The
situation is better in In Figure 3.49(a) for the regression f1 (x) = p1 1+p2 x. An even slightly better solution
would be to fit a function f3 (x) = p1 1 + p2 (x − π4 ), where the two parameters are not correlated at all,
i.e. the correlation matrix would be very close to the identity matrix I2 . The result in Figure 3.49 would be
an ellipse with horizontal axis. To generate the ellipses in Figure 3.49 use eigenvalues and eigenvectors, see
Example 3–25. ♦
0.7 0.7
p2
p2
0.65 0.65
0.6 0.6
0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 6 6.2 6.4 6.6 6.8 7 7.2 7.4
p1 p1
(a) f1 (x) = p1 1 + p2 x (b) f2 (x) = p1 1 + p2 (x − 10)
Figure 3.49: Regions of confidence for two similar regressions with identical data
3.5.3 The commands LinearRegression(), regress() and lscov() for Octave and
MATLAB
In MATLAB and/or Octave there are a few commands to useful for linear regression, see Table 3.15. In these
notes three are presented : LinearRegression(), regress() and lscov().
24
Based on [GoluVanLoan96, §5.6.3], which is also available in [GoluVanLoan13, §6.1.2].
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 229
Command Properties
LinearRegression() standard and weighted linear regression
returns standard deviations for parameters
regress() standard linear regression
returns confidence intervals for parameters
lscov() gerneralized least square estimation, with weights
ols() ordinary least square estimation
gls() gerneralized least square estimation
polyfit() regression with for polynomials only
lsqnonneg() regression with positivity constraint
• MATLAB: download the file from my web site at LinearRegression.m for Matlab .
The builtin help LinearRegression leads to
LinearRegression (F, Y)
LinearRegression (F, Y, W)
[P, E_VAR, R, P_VAR, FIT_VAR] = LinearRegression(...)
parameters:
* F is an n*m matrix with the values of the basis functions at
the support points. In column j give the values of f_j at the
points x_i (i=1,2,...,n)
* Y is a column vector of length n with the given values
* W is a column vector of length n with the weights of the data points.
1/w_i is expected to be proportional to the estimated
uncertainty in the y values. Then the weighted expression
sum_(i=1,...,n)(w_iˆ2*(y_i-f(x_i))ˆ2) is minimized.
return values:
* P is the vector of length m with the estimated values of the parameters
* E_VAR is the vector of estimated variances of the provided y values.
If weights are provided, then the product e_var_i*wˆ2_i is assumed
to be constant.
* R is the weighted norm of the residual
* P_VAR is the vector of estimated variances of the parameters p_j
* FIT_VAR is the vector of the estimated variances of the fitted
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 230
The command LinearRegression() allows to use a weighted least square algorithm, i.e. instead
of minimizing the standard k~rk2 = ni=1 ri2 the weighted expression
P
n
X
k~rk2W = wi2 ri2
i=1
is minimized, with given weights wi . This should be used if some data is known to be more reliable than
others, or to give outliers a lesser weight.
Here,
* ’y’ is a column vector of observed values
* ’X’ is a matrix of regressors, with the first column filled
with the constant value 1
* ’beta’ is a column vector of regression parameters
* ’e’ is a column vector of random errors
Arguments are
* Y is the ’y’ in the model
* X is the ’X’ in the model
* ALPHA is the significance level used to calculate the confidence
intervals BINT and RINT (see ’Return values’ below).
If not specified, ALPHA defaults to 0.05
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 231
-- X = lscov (A, B)
-- X = lscov (A, B, V)
-- X = lscov (A, B, V, ALG)
-- [X, STDX, MSE, S] = lscov (...)
Reference: Golub and Van Loan (1996), ’Matrix Computations (3rd Ed.)’,
Johns Hopkins, Section 5.6.3
Finding the optimal parameters, their standard deviation and the confidence intervals
On the first few lines in the code below the data is generated, using a normally distributed noise contribution.
Then LinearRegression() is applied to determine the solutions.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 232
figure(1)
plot(x,y,’k’,x_d,y_d,’b+’,x,y_fit,’g’)
xlabel(’independent variable x’)
ylabel(’dependent variable y’)
title(’regression with small noise’)
legend(’true curve’,’data’,’best fit’,’location’,’south’)
-->
parameters = -7.0696e-03 7.3803e-03
1.1126e+00 3.6093e-02
-1.1446e+00 3.7854e-02
Using the standard deviations of the parameters one can then determine the confidence intervals for a chosen
level of significance α = 5% = 0.05.
The numerical result for the Student-t distribution implies that with a level of confidence of 95% the confi-
dence intervals for the parameters pi in y = p1 1 + p2 x + p3 x2 satisfy
−0.023 ≤ p1 ≤ +0.009
+1.04 ≤ p2 ≤ +1.19
−1.22 ≤ p3 ≤ −1.06 .
This confirms the “exact” values of p1 = 0, p2 = +1 and p3 = −1. The best fit parabola in Figure 3.50(a)
is rather close to the “true” parabola.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 233
dependent variable y
0.5
0.25
0.2 0
0.15
0.1 -0.5
Figure 3.50: Linear regression for a parabola, with a small or a large noise contribution
−0.9 ≤ p1 ≤ +0.7
−3.7 ≤ p2 ≤ +4.2
−3.7 ≤ p3 ≤ +4.7
The best fit parabola in Figure 3.50(b) is poor approximation of the “true” parabola.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 234
1.0365e+00 1.1888e+00
-1.2244e+00 -1.0647e+00
The command lscov can also determine the covariance matrix for the parameters.
covp_lscov
-->
covp_lscov = 5.4469e-05 -2.3421e-04 2.0710e-04
-2.3421e-04 1.3027e-03 -1.3021e-03
2.0710e-04 -1.3021e-03 1.4330e-03
The rather large values off the diagonal in the correlation matrix indicate that the three parameters pi are not
independent.
With the covariance matrix the ellipsoidal region of confidence for p~ ∈ R3 can be determined. To start
choose a level of significance, in this case α = 0.05.
• Use Example 3–71 to determine the ellipsoidal domain of confidence in R3 and Example 3–25 to visu-
alize, see Figure 3.51(a). This figure can be rotated on screen if generated locally by Octave/MATLAB.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 235
• To obtain printable visualizations restrictions to planes are useful. Use the tools from Example 3–27
to display the intersection and projection of the 3D domain of confidence with the plane p3 = const.
Find the result in Figure 3.51(b).
• If working with individual intervals of confidence the level of significance has to be adapted, such that
(1 − α3 )3 = 1 − α = 0.95, i.e. α3 = 1 − (1 − α)1/3 ≈ 13 α. The projection of the “box”of confidence
in R3 leads to the rectangle in Figure 3.51(b).
-1 1.25
projection
-1.05
intersection
1.2 square
-1.1
-1.15
p3
1.15
-1.2
-1.25
p2 1.1
-1.3
-0.03 1.05
-0.02
-0.01
0 1
p1 0.01 1.25
0.02 1.1 1.15 1.2 -0.03 -0.02 -0.01 0 0.01 0.02
1 1.05
p2 p1
~ ∈ R3
(a) for p (b) for (p1 , p2 ) ∈ R2
inv_cov = inv(covp_lscov);
%% the ellipsoid in space
[Evec,Eval] = eig(inv_cov);
phi = linspace(0,2*pi,81); theta = linspace(-pi,pi,81);
x = cos(phi’)*cos(theta); y = sin(phi’)*cos(theta); z = ones(size(phi’))*sin(theta);
Points = Evec*inv(sqrt(Eval))*[x(:),y(:),z(:)]’;
figure(2)
plot3(p1,p2,p3,’or’, p1+sqrt(f_limit)*Points(1,:),...
p2 + sqrt(f_limit)*Points(2,:),p3 + sqrt(f_limit)*Points(3,:),’.b’,...
[p1-t_lim*sigma(1),p1+t_lim*sigma(1)],[p2,p2],[p3,p3],’g’,...
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 236
[p1,p1],[p2-t_lim*sigma(2),p2+t_lim*sigma(2)],[p3,p3],’g’,...
[p1,p1],[p2,p2],[p3-t_lim*sigma(3),p3+t_lim*sigma(3)],’g’)
xlabel(’p_1’); ylabel(’p_2’); zlabel(’p_3’); view([75,20]);
figure(3)
plot(p1+sqrt(f_limit)*Points_p(1,:),p2+sqrt(f_limit)*Points_p(2,:),’c’,...
p1+sqrt(f_limit)*Points_i(1,:),p2+sqrt(f_limit)*Points_i(2,:),’r’,...
p1+t_lim*sigma(1)*[-1 1 1 -1 -1],p2+t_lim*sigma(2)*[-1,-1,1,1,-1],’g’)
xlabel(’p_1’); ylabel(’p_2’); legend(’projection’,’intersection’,’square’)
For the large noise case obtain again no surprise, i.e. the same results as for LinearRegression()
and regress().
noise_big = 0.5*randn(100*N,1);
x_d = rand(100*N,1) ; y_d = x_d.*(1-x_d) + noise_big;
Using more data points will increase the reliability of the result, but only if the residuals are normally
distributed. If you are fitting a straight line to data on a parabola the results can not improve. Thus one has
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 237
to make a visual check of the fitting (Figure 3.52(a)), to realize the the result is impossibly correct, even
if the standard deviations of the parameters pi are small. The problem is also visible in a histogram of the
residuals (Figure 3.52(b)). Obviously the distribution of the residuals is far from a normal distribution, and
thus the essential working assumption of normally distributed deviations is violated.
0.4 500
true parabola
data
0.3 best fit line 400
dependent variable y
frequency
0.2 300
0.1 200
0 100
-0.1 0
0 0.2 0.4 0.6 0.8 1 -0.2 -0.1 0 0.1 0.2 0.3 0.4
independent variable x deviation
(a) fitting a straight line to a parabola (b) histogram of the residuals
residuals = F*p-y_d;
figure(4)
hist(residuals)
xlabel(’deviation’); ylabel(’frequency’)
-->
parameters = 0.1639191 0.0034668
0.0035860 0.0059815
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 238
linear regression leads to the obviously wrong result in Figure 3.53(b). The following computations were
performed:
• The angle was given in degrees, i.e. 0◦ ≤ α ≤ 90◦
p~ = inv(FT F) · (FT ~y ) .
• The warning message about a singular matrix was ignored. Thus the computations performed with a
matrix with a very large condition number, e.g. κ ≈ 1017 .
0.8
0.6
intensity
0.4
0.2
α 0
-0.2
0 20 40 60 80 100
angle
(a) setup of LED (b) intensity profile
• Instead of the normal equation with the matrix FT F use the QR factorization.
• The angle α = 90◦ leads to numbers 905 ≈ 6 · 109 and thus to numbers 3 · 1019 in the matrix FT F.
In the same matrix there are also numbers close to 1. Thus the condition number of this matrix is
extremely large. Using radian instead of degrees helps, i.e. use an appropriate rescaling of the data.
Using just one of these three improvements leads to reasonable answers. The most important aspect to
consider when using the linear regression method is to select the correct type of function for the fitting. This
decision has to be made based on insight into the problem at hand.
25
In my (long) career I have seen no application requiring a regression with a polynomial of high degree. I have seen many
problems when using a polynomial of high degree.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 239
F = [ones(size(x)), x , y];
p = LinearRegression(F,z)
figure(1);
plot3(x,y,z,’*’)
hold on
mesh(x_grid,y_grid,z_grid)
xlabel(’x’); ylabel(’y’); zlabel(’z’);
hold off
-->
p = 1.7689 2.0606 -1.4396
Since only very few (N=100) points were used the exact parameter values p~ = (+2, +2, −1.5) are not
very accurately reproduced. Increasing N will lead to more accurate results for this simulation, or decrease
the size of the random noise in +0.5*randn(N,1).
2
z
-2
-4 2
1.5
3 2.5 1
2 1.5 0.5
1 0.5 x
y 0 0
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 240
The command LinearRegression() does not determine the confidence intervals for the parame-
ters, but it returns the estimated standard deviations, resp. the variances. With these the confidence intervals
can be computed, using the Student-t distribution. To determine the CI modify the above code slightly.
The results implies that the 95% confidence intervals for the parameters pi are given by
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 241
Command Properties
leasqr() standard non linear regression, Levenberg–Marquardt
fsolve() can be used for nonlinear regression too
nlinfit() nonlinear regression
lsqcurvefit() nonlinear curve fitting
nonlin curvefit() frontend, Octave only
lsqnonlin() nonlinear minimization of residue
nonlin residmin() frontend, Octave only
nlparci() determine confidence intervals of parameters, MATLAB only
expfit() regression with exponential functions
page 119) to solve the corresponding system of nonlinear equations for the parameters. Thus leasqr()
needs the matrix of partial derivatives with respect to the parameters. This matrix can be provided as function
or estimated by the auxiliary function dfdp.m, which uses a finite difference approximation. The Octave
package also provides one example in leasqrdemo.m and you can examine its source code.
As a first example try to fit a function
through a number of measured points (ti , yi ). Then search the values for the parameters A, α, ω and φ to
minimize X
|f (ti ) − yi |2 .
i
Since the function is nonlinear with respect to the parameters α, ω and φ one can not use linear regression.
In Octave the command leasqr() will solve nonlinear regression problems. To set up an example
one may:
1. Choose “exact” values for the parameters.
You have to provide the function leasqr() with good initial estimates for the parameters. The al-
gorithm in leasqr() uses a damped Newton method (see Section 3.1.5) to find the optimal solution.
Examining the selection of points in Figure 3.55 estimate
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 242
• ω ≈ 0.9: the period seems to be slightly larger than 2 π, thus ω slightly smaller than 1.
The results of your simulation might vary slightly, caused by the random numbers involved.
Octave
A0 = 2; al0 = 0; omega0 = 1; phi0 = pi/2;
[fr,p] = leasqr(t,y,[A0,al0,omega0,phi0],’f’,1e-10);
p’
yFit = f(t,p);
plot(t,y,’+’, t,yFit)
legend(’data’,’fit’)
-->
p = 1.523957 0.098949 0.891675 1.545294
1
data
fit
0.5
-0.5
-1
-1.5
0 2 4 6 8 10
The above result contains the estimates for the parameters. For many problems the deviations from
the true curve are randomly distributed, with a normal distribution and small variance. In this case the
parameters are also randomly distributed with a normal distribution. The diagonal of the covariance matrix
contains the variances of the parameters and thus estimate the standard deviations by taking the square root
of the variances.
Octave
pkg load optim % load the optimization package in Octave
[fr,p,kvg,iter,corp,covp,covr,stdresid,Z,r2] =...
leasqr(t,y,[A0,al0,omega0,phi0],’f’,1e-10);
pDev = sqrt(diag(covp))’
-->
pDev = 0.0545981 0.0077622 0.0073468 0.0307322
With the above results obtain Table 3.18. Observe that the results are consistent, i.e. the estimated param-
eters are rather close to the “exact” values. To obtain even better estimates, rerun the simulation with less
noise or more points.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 243
y = f (~
p , x) .
A few (n) points are given, thus ~x ∈ Rn , and the same number of values of ~yd ∈ Rn are measured. For
precise measurements we expect ~yd ≈ ~y = f (~p , ~x). Then we can search for the optimal parameters p~ ∈ Rm
such that
p , ~x) − ~yd = ~0 .
f (~
If m < n this is an over determined system of n equation for the m unknowns p~ ∈ Rm . In this case the
command fsolve() will convert the system of equations to a minimization problem
It is also possible to estimate the variances of the optimal parameters, using the techniques from Sec-
tion 3.5.2. This will be done very carefully in the following Section 3.5.8.
As an illustrative example data points on the curve y = exp(−0.2 x) + 3 are generated and then some
random noise is added. As initial parameters use the naive guess y(x) = exp(0 · x) + 0. The best possible
fit is determined and displayed in Figure 3.56.
[p, fval, info, output] = fsolve (@(p) (exp(-p(1)*x) + p(2) - y), [0, 0]);
plot(x,y,’+’, x,exp(-p(1)*x)+p(2))
3.9
3.8
3.7
3.6
3.5
3.4
3.3
3.2
0 1 2 3 4 5
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 244
Many growth phenomena can be described by rescaling and shifting the basic logistic27 growth function
exp(x) 1
g(x) = 1+exp(x) = 1+exp(−x) . It is easy to see that this function is monotonically increasing and
1
lim g(x) = 0 , g(0) = and lim g(x) = 1 .
x→−∞ 2 x→+∞
with the four parameters pi , i = 1, 2, 3, 4. An example is shown in Figure 3.57. For the given data points (in
red) the optimal values for the parameters pi have to be determined. This is a nonlinear regression problem.
65
60
frequency
55
50
45
-2 0 2 4 6 8 10
distance
Figure 3.57: Data points and the optimal fit by a logistic function
27
Also called sigmoid function
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 245
An essential point for a nonlinear regression problems is to find good estimates for the values of the
parameters. Thus examine the graph of the logistic function (3.24) carefully:
• At the midpoint x = p4 find f (p4 ) = p1 + p2 12 .
• For the extreme values observe limx→−∞ f (x) = p1 and limx→+∞ f (x) = p1 + p2 .
p2 p3
• The maximal slope is at the midpoint and given by28 f 0 (p4 ) = 4 .
Assuming p2 , p3 > 0 now find good estimates for the parameter values.
• p1 offset: minimal height of the data points
x_data = [0 1 2 3 4 5 6 7 8]’;
y_data = [46.8 47.2 48.8 51.8 55.7 58.6 61.8 63 63.8]’;
p1 = min(y_data);
p2 = max(y_data)-min(y_data);
p3 = 4*max(diff(y_data)./diff(x_data))/p2;
p4 = mean(x_data);
This result can now be used to apply a nonlinear regression, using the functions leasqr(), fsolve()
or lsqcurvefit().
• Call leasqr(), returning the values and the covariance matrix. On the diagonal of the covariance
matrix find the estimated variances of the parameters pi .
Find the result in Figure 3.57. As numerical result the optimal values of pi and their standard deviations are
shown. These can be used to determine confidencePn intervals for2the parameters. In addition the number of
1/2
required iterations and the resulting residual ( i=1 (f (xi ) − yi ) ) is displayed.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 246
Using the estimated variances of the individual parameters pi using the t–distribution the 95% confidence
interval for the individual parameters can be estimated.
But since the correlation matrix corp contains large off–diagonal entries the result might not be reliable.
The large off–diagonal entries indicate that the four parameters pi are not independent.
Domain of confidence
To examine the joint domain of confidence for the four parameters pi some more computations have to be
performed. There are two methods to visualize the joint domain of confidence;
1. as a rectangular box in R4
2. as an ellipsoid in R4
For the individual confidence intervals with level of significance α = 0.05 use the Student t–distribution
by calling tinv(1-0.05/2,n-4). If the ”box” in R4 would be constructed with these widths, then the
confidence of the true parameter to be inside the box would be (1 − α)4 = (1 − 0.05)4 ≈ 0.81. For a level
of confidence 1 − α for all four parameters together one needs p44 = (1 − α) or p4 = (1 − α)1/4 . This leads
to the level of significance
α4 = 1 − p4 = 1 − (1 − α)1/4 ≈ α/4 .
alpha4 = 1-(1-alpha)ˆ0.25;
p_CI_95_joint = p + tinv(1-alpha4/2,length(x_data)-4)*[-sigma +sigma]
-->
p_CI_95_joint = 44.4879 47.3758
15.9825 20.8748
0.6023 1.0751
3.6257 4.2399
The result is a clearly larger domain of confidence for the joint parameters, it is a rectangular ”box” in R4 .
Caused by the correlation of the parameters pi one should examine the ellipsoidal domain of confidence
in R4 . The intersection of the domain in R4 with the plane p3 = p4 = const is an ellipse in the plane with p1
and p2 . To determine the dimensions of the ellipses the F–distribution has to be used, see [DrapSmit98] or
[RawlPantuDick98]. Find the result in Figure 3.58(a). The result of the intersection with p1 = p2 = const
is shown in Figure 3.58(b). Also shown are the central axis for the rectangular domains of confidence in
green. Use the results in Section 3.2.7 and Example 3–25 to draw the ellipses.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 247
f_limit = 4*finv(1-alpha,4,length(x_data));
figure(3);
plot(p1+sqrt(f_limit)*Points(1,:),p2+sqrt(f_limit)*Points(2,:),’b’,’linewidth’,2)
t_lim = tinv(1-alpha4/2,length(x_data)-4)
hold on
plot(p1,p2,’or’)
plot([p1-t_lim*sigma(1),p1+t_lim*sigma(1)],[p2,p2],’g’)
plot([p1,p1],[p2-t_lim*sigma(2),p2+t_lim*sigma(2)],’g’)
xlabel(’p_1’); ylabel(’p_2’); hold off
20 4.4
19
4.2
18
p2
p4
17 4
16
3.8
15
14 3.6
45 45.5 46 46.5 47 47.5 48 48.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
p1 p3
(a) intersection with p3 = p4 = const (b) intersection with p1 = p2 = const
Figure 3.58: The intersection of the 95% confidence ellipsoid in R4 with 2D-planes
The intersection of the confidence domain in R4 with the plane p4 = const leads to an ellipsoid in R3 ,
shown in Figure 3.59. The intersection of this ellipsoid with the horizontal plane p3 = const generated by the
green markers leads to the ellipse in Figure 3.58(a). On ocasion it might also be useful to use the projections
of the general ellipsoids on coordinate planes, using the ideas leading to Example 3–27 on page 140.
f_limit = 4*finv(1-alpha,4,length(x_data));
figure(4)
plot3(p1+sqrt(f_limit)*Points(1,:),p2 + sqrt(f_limit)*Points(2,:),...
p3 + sqrt(f_limit)*Points(3,:),’.b’)
hold on
plot3(p1,p2,p3,’or’)
plot3([p1-t_lim*sigma(1),p1+t_lim*sigma(1)],[p2,p2],[p3,p3],’g’)
plot3([p1,p1],[p2-t_lim*sigma(2),p2+t_lim*sigma(2)],[p3,p3],’g’)
plot3([p1,p1],[p2,p2],[p3-t_lim*sigma(3),p3+t_lim*sigma(3)],’g’)
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 248
1.3
1.2
1.1
1
p3
0.9
0.8
0.7
0.6
48.548
47.547
46.546 15 14
p1 45.545 17 16
19 18
20 p2
Figure 3.59: The intersection of the 95% confidence ellipsoid in R4 with the p4 = const plane
The result for the optimal parameters pi is identical to the previous results. In addition the number of
required iterations and the resulting residual are displayed. The command curvefit stat() will deter-
mine the covariance matrix covp and the correlation matrix corp.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 249
With this information the analysis of the domain of confidence can be performed, identical to the results
by leasqr().
It is no surprise that the same result is found. fsolve() does not estimate standard deviations for the
parameters. One might use nlparci() to determine confidence intervals.
It is no surprise that the same result is found. lsqcurvefit() does not estimate standard deviations for
the parameters. The command lsqcurvefit() can return more results, e.g. the residual or the Jacobian
matrix with the partial derivatives with respect to the parameters.
• polyconf: confidence and prediction intervals for polynomial fitting, uses wpolyfit
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 250
3.6 Resources
• Nonlinear Equations : based on my notes
• Numerical Integration
– Detailed information on ODE solvers can be found in the book [Butc03] by John Butcher
([Butc16] is a newer edition) or in [HairNorsWann08], [HairNorsWann96].
– Use the article ode suite.pdf ([ShamReic97]) by Shampine and Reichelt for excellent in-
formation on the MATLAB ODE solvers. The document is available at the Mathwork web site at
https://www.mathworks.com/help/pdf doc/otherdocs/ode suite.pdf .
– Information on different Runge–Kutta methods is available on Wikipedia.
https://en.wikipedia.org/wiki/List of Runge-Kutta methods
https://en.wikipedia.org/wiki/Runge-Kutta methods
– Marc Compere provided codes that (should) work with MATLAB and Octave. It is available in a
repository at https://gitlab.com/comperem/ode solvers.
* rk2fixed.m, rk4fixed.m and rk8fixed.m are explicit Runge–Kutta methods with
fixed step size of order 2, 4 and 8 .
* ode23.m, ode45.m and ode78.m are explicit Runge–Kutta methods with adaptive step
sizes.
Bibliography
[AbraSteg] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions. Dover, 1972.
[Agga20] C. Aggarwal. Linear Algebra and Optimization for Machine Learning. Springer, first edition,
2020.
[Butc03] J. Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, Ltd,
second edition, 2003.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 251
filename description
nonlinear equations
NewtonSolve.m function file to apply Newton’s method
exampleSystem.m first example of a system of equations, Example 3–8
Newton2D.m code to visualize Example 3–8
testBeam.m code to solve Example 3–13
Keller.m script file to solve Example 3–14
tridiag.m MATLAB function to solve tridiagonal systems
eigenvalues and eigenvectors
ReduceQuadraticForm.m function used in Example 3–27
numerical integration
simpson.m integration using Simpson’s rule
IntegrateGauss.m integration using the Gauss approach
ordinary differential equations, ODEs
ode RungeKutta Runge–Kutta method with fixed step size
ode Heun Heun method with fixed step size
ode Euler Euler method with fixed step size
Pendulum.m Pendulum demo for the algorithms with fixed step size
rk23.m an adaptive algorithm, based on the method of Heun
rk45.m an adaptive algorithm, based on the method of Runge–Kutta
Test rk23 rk45.m demo code for rk23 and rk45 in Example 3–66
linear and nonlinear regression
LinearRegression.m linear regression, Octave version
LinearRegression.m.Matlab linear regression, MATLAB version
ExampleLinReg.m code to generate Section 3.5.4
leasqr.m MATLAB version of the command leasqr()
dfdp.m MATLAB version of the support file for leasqr()
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 252
[Butc16] J. Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, Ltd,
third edition, 2016.
[DrapSmit98] N. Draper and H. Smith. Applied Regression Analysis. Wiley, third edition, 1998.
[GoluVanLoan96] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, third
edition, 1996.
[GoluVanLoan13] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,
fourth edition, 2013.
[HairNorsWann08] E. Hairer, S. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I: Non-
stiff Problems. Springer Series in Computational Mathematics. Springer Berlin Heidelberg, second
edition, 1993. third printing 2008.
[HairNorsWann96] E. Hairer, S. Nørsett, and G. Wanner. Solving Ordinary Differential Equations II:
Stiff and Differential-Algebraic Problems. Lecture Notes in Economic and Mathematical Systems.
Springer, second edition, 1996.
[HornJohn90] R. Horn and C. Johnson. Matrix Analysis. Cambridge University Press, 1990.
[IsaaKell66] E. Isaacson and H. B. Keller. Analysis of Numerical Methods. John Wiley & Sons, 1966.
Republished by Dover in 1994.
[Kell92] H. B. Keller. Numerical Methods for Two–Point Boundary Value Problems. Dover, 1992.
[Linz79] P. Linz. Theoretical Numerical Analysis. John Wiley& Sons, 1979. Republished by Dover.
[MeybVach91] K. Meyberg and P. Vachenauer. Höhere Mathematik II. Springer, Berlin, 1991.
[MontRung03] D. Montgomery and G. Runger. Applied Statistics and Probability for Engineers. John
Wiley & Sons, third edition, 2003.
[RalsRabi78] A. Ralston and P. Rabinowitz. A first Course in Numerical Analysis. McGraw–Hill, second
edition, 1978. repulished by Dover in 2001.
[RawlPantuDick98] J. Rawlings, S. Pantula, and D. Dickey. Applied regression analysis. Springer texts in
statistics. Springer, New York, 2. ed edition, 1998.
[SchnWihl11] H. R. Schneebeli and T. Wihler. The Netwon-Raphson Method and Adaptive ODE Solvers.
Fractals, 19(1):87–99, 2011.
[ShamReic97] L. Shampine and M. W. Reichelt. The MATLAB ODE Suite. SIAM Journal on Scientific
Computing, 18:1–22, 1997.
[Stah00] A. Stahel. Calculus of Variations and Finite Elements. supporting notes, 2000.
SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 253
[Octave07] A. Stahel. Octave and Matlab for Engineers. lecture notes, 2007.
[TongRoss08] P. Tong and J. Rossettos. Finite Element Method, Basic Technique and Implementation. MIT,
1977. Republished by Dover in 2008.
[YounGreg72] D. M. Young and R. T. Gregory. A Survey of Numerical Analysis, Volume 1. Dover Publica-
tions, New York, 1972.
SHA 21-5-21
Chapter 4
• you should understand the basic concept of a finite difference approximation and finite difference
stencils.
• should be familiar with the concepts of consistency, stability and convergence of a finite difference
approximation.
• should be able to set up and solve second order linear boundary value problems on intervals and
rectangles.
• should be able to set up and solve second order linear initial boundary value problems on intervals.
• should be able to set up and solve some nonlinear boundary value problems with the help of a finite
difference approximation.
254
CHAPTER 4. FINITE DIFFERENCE METHODS 255
t−h t t+h
time axis -
'$ '$
y(t + h) − y(t) −1 +1
y 0 (t) ≈
h h h
&% &%
'$ '$
y(t) − y(t − h) −1 +1
y 0 (t) ≈
h h h
&% &%
'$ '$ '$
y(t + h) − y(t − h) −1 +1
y 0 (t) ≈ 0
2h 2h 2h
&% &% &%
Figure 4.1: FD stencils for y 0 (t), forward, backward and centered approximations
In Figure 4.2 use the values of the function at the grid points t − h, t and t + h to find formulas for the
first and second order derivatives. The second derivative is examined as derivative of the derivative. The
above observations hint towards the following approximate formulas.
d y(t + h) − y(t) y(t) − y(t − h) y(t + h) − y(t − h)
y(t) ≈ ≈ ≈
dt h h 2h
d y(t + h/2) − y(t − h/2)
y(t) ≈
dt h
d2 y (t + h/2) − y 0 (t − h/2)
0
1 y(t + h) − y(t) y(t) − y(t − h)
y(t) ≈ ≈ −
dt2 h h h h
y(t − h) − 2 y(t) + y(t + h)
=
h2
The quality of the above approximations is determined by the error. For smaller values of h > 0 the
error should be as small as possible. To determine the size of this error use the Taylor approximation
y 00 (t) 2 y 000 (t) 3 y (4) (t) 4
y(t + x) = y(t) + y 0 (t) · x + x + x + x + O(x5 )
2 3! 4!
with different values for x (use x = ±h with |h| 1) and verify that
y 00 (t) 2 y 000 (t) 3
y(t + h) = y(t) + y 0 (t) · h + h + h + O(h4 )
2 3!
y 00 (t) 2 y 000 (t) 3
y(t − h) = y(t) − y 0 (t) · h + h − h + O(h4 )
2 3!
y 000 (t) 3
y(t + h) − y(t − h) = 2 y 0 (t) · h + 2 h + O(h4 )
3!
y(t + h) − y(t − h) y 000 (t) 2
− y 0 (t) = h + O(h3 ) = O(h2 )
2h 3!
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 256
t−h t t+h
With the above finite difference stencils replace derivatives by approximate finite differences, accepting
a discretization error. As h converges to 0 we expect this error to approach 0. But in most cases small values
of h will lead to larger arithmetic errors when performing the operations and this contribution will get larger
as h approaches 0. For the total error the two contributions have to be added. This basic rule
is illustrated in Figure 4.3. As a consequence do not expect to get arbitrary close to an error of 0. In this
chapter only the discretization errors are carefully examined, assuming that the arithmetic error is negligible.
This does not imply that rounding errors can safely be ignored, as illustrated in an exercise.
1
Use the notation f (h) = O(hn ) to indicate |f (h)| ≤ C hn for some constant C. This indiates that the expression f (h) is of
order hn or less.
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 257
error
total error
discretization error
arithmetic error
stepsize h
∂2 u ∂2 u
−∆u = − −
∂x2 ∂y 2
for most 2–dimensional steady state problems. Based on Table 4.1 obtain a simple finite difference approx-
imation
−u(x − h, y) + 2 u(x, y) − u(x + h, y) −u(x, y − h) + 2 u(x, y) − u(x, y + h)
−∆u(x, y) = + +O(h2 ) .
h2 h2
For the rectangular grid in Figure 4.4 set
d ui+1,j − ui,j
u(ti , xj ) = + O(ht )
dt ht
ui,j−1 − 2 ui,j + ui,j+1
u00 (ti , xj ) = + O(h2x )
h2x
d 00 1 2 1 1 1
( u − u )i,j ≈ − 2 ui,j−1 + 2
− ui,j − 2 ui,j+1 + ui+1,j
dt hx hx ht hx ht
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 258
−1/h2
y, i
6
−1/h2 4/h2 −1/h2
s
y s s s
s
- x, j −1/h2
x
Figure 4.4: Finite difference stencil for −uxx − uyy if h = hx = hy
'$
t, i
6 1/ht
&%
s '$
# '$
t s s s
−1/h2x 2/h2x − 1/ht −1/h2x
- x, j "
&% !&%
x
Figure 4.5: Finite difference stencil for ut − uxx , explicit, forward
This leads to the stencil in Figure 4.5, the explicit finite difference stencil for the heat equation.
d
If the backward difference approximation is used for the time derivative dt u find the implicit finite
difference stencil in Figure 4.6.
'$
# '$
t, i
6 −1/h2x 2/h2x + 1/ht −1/h2x
"
&% !&%
t s s s '$
s
−1/ht
- x, j &%
x
Figure 4.6: Finite difference stencil for ut − uxx , implicit, backward
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 259
The exact solution is given by y(t) = y0 e−λ t . Obviously the solution is bounded on any interval on R+
and we expect its numerical approximation to remain bounded too, independent on the final time T .
d
To visualize the context consider a similar problem dt y(t) + λ y(t) = f (t) for a given function f (t).
This problem can be discretized with stepsize h at the grid points ti = i h for 0 ≤ i ≤ N − 1. This will lead
to an approximation on the interval [0 , T ] = [0 , (N − 1) h] . The unknown function y(t) in the interval
t ∈ [0 , T ] is replaced by a vector ~y ∈ RN and the function f (t) is replaced by a vector f~ ∈ RN .
differential operation
y ∈ C 1 ([0, T ] , R) - f ∈ C 0 ([0, T ] , R)
P P
? finite difference operation ?
~y ∈ RN - f~ ∈ RN
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 260
−u00 (x) = f (x) for 0 < x < L with boundary conditions u(0) = u(L) = 0 . (4.2)
Assume that for a given function f (x) the exact solution is given by the function u(x). The differential
L
equation is replaced by a difference equation. For n ∈ N discretize the interval by xk = k · h = k n+1 and
then consider an approximate solution uk ≈ u(k · h) for k = 0, 1, 2, . . . , n, n + 1. The finite difference
approximation of the second derivative in Table 4.1 leads for interior points to
uk−1 − 2 uk + uk+1
− = fk = f (k · h) for k = 1, 2, 3, . . . , n . (4.3)
h2
The boundary conditions are taken into account by u0 = un+1 = 0 . These linear equations can be written
in the form
2 −1 u1 f1
−1 2 −1 u2 f2
−1 2 −1 u3 f3
1
· =
.
. . . .. ..
2
h .. .. ..
. .
−1 2 −1
un−1 f
n−1
−1 2 un fn
The solution of this linear system will create the values of the approximate solution at the grid points. Exact
and approximate solution are shown in Figure 4.8. As h → 0 we hope that u will converge to the exact
solution u(x) .
2.5
2
solution u(x)
1.5
0.5
0
0 0.2 0.4 0.6 0.8 1
position x
To examine the behavior of the approximate solution use a general framework for finite difference ap-
proximations to boundary value problems. Examine Figure 4.9 to observe how a differential equation is
replaced by an approximate system of linear equations. The similar approach for general problems is shown
in Figure 4.10.
Consider functions defined on a domain Ω ⊂ RN and for a fixed mesh size h cover the domain with a
discrete set of points xk ∈ Ω. This leads to the following vector spaces:
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 261
∂ 2
− ∂x2
u ∈ C 2 ([0, L] , R) - f ∈ C 0 ([0, L] , R)
Ph Ph
Ah ·
f~h ∈ RN
? ?
~uh ∈ RN -
• E2 is a space of functions defined on Ω. In the above example consider f ∈ C 0 ([0, L] , R) with the
norm kf kE2 = max{|f (x)| : 0 ≤ x ≤ L}.
• E1h is a space of discretized functions. In the above example consider ~u ∈ Rn = E1h , where uk =
u(k · h). The vector space E1h is equipped with the norm k~ukE h = max{|uk | : 1 ≤ k ≤ n}.
1
• E2h is also a space of discretized functions. In the above example consider f~ ∈ Rn = E2h , where
fk = f (k · h). The vector space E2h is equipped with the norm kf~kE h = max{|fk | : 1 ≤ k ≤ n}.
2
• For u ∈ E1 let F : E1 → E2 be the linear differential operator. In the above example F(u) = −u00 .
• For ~u ∈ E1h let Fh : E1h → E2h be the linear difference operator. In the above example
uk−1 − 2 uk + uk+1
Fh (~u)k =
h2
• For u ∈ E1 let ~u = P1h (u) ∈ E1h be the projection of the function u ∈ E1 onto E1h . It is determined
by evaluation the function at the points xk .
• For f ∈ E2 let f~ = P2h (f ) ∈ E2h be the projection of the function f ∈ E2 onto E2h . It is determined
by evaluation the function at the points xk .
The above operations are illustrated in Figure 4.10. There is a recent article [KhanKhan18] on the impor-
tance of this structure of the for fundamental spaces of numerical analysis.
F
u ∈ E1 - f ∈ E2 h −→ 0
kP1h ukE h −→ kukE1
P1h P2h 1
kP2h f kE h −→ kf kE2
? Fh ? 2
uh ∈ E1h - fh ∈ E h
2 P2h (F(u)) ≈ Fh (P1h (u))
4–1 Definition : For a given f ∈ E2 let u ∈ E1 be the solution of F(u) = f and ~uh the solution of
Fh (~uh ) = P2h (f ).
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 262
where the constant c2 is independent on h, but it may depend on u. This implies that the diagram in
Figure 4.10 is almost commutative as h approaches 0 .
• The above approximation scheme is said to be stable if the linear operator Fh ∈ L(E1h , E2h ) is
invertible and there exists a constant M , independent on h, such that
kuh kE h ≤ M kFh (uh )kE h for all uh ∈ E1h
1 2
For the above example the stability condition reads as k~uk ≤ M kAh f~k or equivalently kA−1 ~
h fk ≤
M k~uk, independent on h. This is thus the condition on the matrix norm of the inverse matrix to be inde-
pendent on h, i.e. kA−1
h k ≤ M.
Now state a fundamental result for finite difference approximations to differential equations. The theo-
rem is also known as Lax equivalence theorem2 . The result applies to a large variety of problems. We will
examine only a few of them.
4–2 Theorem : If a finite difference scheme is consistent of order p and stable, then it is convergent
of order p. A short formulation is:
Proof : Let u be the solution of F(u) = f and ~u the solution of Fh (~u) = P2h (f ) = P2h (F(u)). Since the
scheme is stable and consistent of order p we find
kP1h (u) − ~ukE h = kF−1 h F (P
h 1
h
(u) − ~
u ) kE h
1 1
≤ kF−1 h
h k kFh (P1 (u)) − Fh (~
u)kE h
2
Table 4.2 illustrates the abstract concept using the example equation (4.2).
4–3 Result : To verify convergence of the solution of the finite difference of approximation of equation (4.2)
to the exact solution we have to assure that the scheme is consistent and stable. Use the finite difference
approximation
−uk−1 + 2 uk − uk+1
−u00 (x) = f (x) −→ = fk .
h2
• Consistency: According to equation (4.1) or Table 4.1 (page 256) the scheme is consistent of order 2.
• Stability: Let ~u be the solution of the equation (4.3) with right hand side f~. Then
L2 L2 ~
k~uk∞ = max {|uk |} ≤ max {|fk |} = kf k∞ independent on h . (4.4)
1≤k≤n 2 1≤k≤n 2
3
2
We only use the result that a consistent and stable scheme has to be convergent. Lax also showed that a consistent and
convergent scheme has to be stable. Find a proof in [AtkiHan09].
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 263
Proof : The proof of stability of this finite difference scheme is based on a discrete maximum principle3 .
Proceed in two stages.
For the continuous case this corresponds to functions with u00 (x) ≥ 0 attaining the largest value on
the boundary.
To verify the discrete statement assume that max1≤k≤n {uk } = ui for some index 1 ≤ i ≤ n. Then
C C C 2
max {wk+ } = max {uk + vk } ≤ max{v0 , vn+1 } = L .
1≤k≤n 1≤k≤n 2 2 2
C
Since vk ≥ 0, this implies uk ≤ 2 L2 .
3
Readers familiar with partial differential equations will recognize the maximum principle and the construction of sub- and
super–solutions to obtain à priori bounds.
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 264
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 265
one can apply the above procedures again to obtain a nonlinear system of equations.
+2 −1 0 0 u1 0.4 + cos(u1 )
1 −1 +2 −1 0 u2 0.8 + cos(u2 )
=
0.42 0 −1 +2 −1 u3 1.2 + cos(u3 )
0 0 −1 +2 u4 1.6 + cos(u4 )
To solve this nonlinear system use methods from chapter 3.1, i.e. partial substitution or Newton’s method.
Using obvious notations denote the above system by
A ~u = ~x + cos(~u) .
• To use the method of partial substitution choose a starting vector ~u0 , e.g. ~u0 = (0, 0, 0, 0)T . Then use
the iteration scheme
A ~uk+1 = ~x + cos(~uk )
~uk+1 = A−1 (~x + cos(~uk ))
corresponds to the stretching of a beam. We consider, at first, constant cross sections only and we will work
with the constant EA . The interval [0 , L] is divided into N + 1 subintervals of equal length h = NL+1 .
Using the notations
and the finite difference formula for u00 in Table 4.1 we replace the differential equation at all interior points
by the difference equation
EA
− (ui−1 − 2 ui + ui+1 ) = fi for 1 ≤ i ≤ N
h2
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 266
for the unknowns ui . The boundary conditions lead to u0 = 0 and uN +1 = uM . Using a matrix notation
we find a linear system of equations.
2 −1 u1 f1 0
−1 2 −1 u2 f2 0
−1 2 −1 u3 f3 0
h2
−1 2 −1 u
· 4 =
f
4 +
0
EA
. . ..
. . .
.. .. .. .
.
.
.
.
−1 2 −1 uN −1
f
N −1
0
−1 2 uN fN uM
and set the corresponding variables in Octave. Then set up the matrix A, solve the system and plot the
solution, leading to Figure 4.11(a).
BeamStretch.m
EA = 1.0; L = 3; uM = 0.2; N = 20;
fRHS = @(x) sin(0.5*x); % define the function for the RHS
h = L/(N+1); % stepsize
x = (h:h:L-h)’; f = fRHS(x);
g = hˆ2/EA*f; g(N) = g(N)+uM;
d u(x)
F (x) = EA .
dx
This can be approximated by a centered difference formula
h ui+1 − ui
F (xi + ) ≈ EA .
2 h
Thus plot the force F , as seen in Figure 4.11(b). This graph shows that the left part of the beam is stretched
(u0 > 0), while the right part is compressed (u0 < 0).
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 267
1 1
0.8 0.5
displacement
0.6 0
force
0.4 -0.5
0.2 -1
0 -1.5
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
distance distance
(a) displacment (b) force
du = diff([0;u;uM])/h;
plot([0;x]+h/2,EA*du); grid on
• The above example also contains the solution to the steady state heat equation (1.2) on page 10.
• The above example also contains the solution to the steady state of the vertical deformation of a
horizontal string, equation (1.7) on page 13.
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 268
This additional equation can be added to the previous system of equations, leading to a system of N + 1
linear equations.
2 −1 u1 f1 0
−1 2 −1 u f 0
2 2
−1 2 −1 u3
f3
0
−1 2 −1 u4 2 f4 0
= h
· +
.. .. .. .. .. ..
EA
. . .
. . .
−1 2 −1 u
N −1
f
N −1
0
−1 2 −1 uN
fN
0
fN +1 hF
−1 +1 uN +1 2 EA
This matrix is again symmetric, positive definite and tridiagonal. For the simple case f (x) = 0 the exact
F
solution u(x) = EA x is known. This is confirmed by the Octave/MATLAB computations below and the
resulting straight line in Figure 4.12.
h = L/(N+1); % stepsize
x = (h:h:L)’;
f = zeros(size(x)); % f(x) = 0
g = hˆ2/EA*f; g(N+1) = g(N+1)/2+h*F/EA;
u = trisolve(di,up,g);
plot([0;x],[0;u])
♦
4–8 Example : Stretching of a beam by a given force and variable cross section
If the cross section A in the previous example 4–7 is not constant, the above algorithm has to be modified.
The differential equation
d d u(x)
− EA(x) = f (x) for 0 < x < L
dx dx
now uses a variable coefficient a(x) = EA(x). The boundary conditions remain u(0) = 0 and EA(L) d u(L)dx =
0 0 1
F . To determine the derivative of g(x) = a(x) u (x) use the centered difference formula g (x) = h (g(x +
h h 2
2 ) − g(x − 2 )) + O(h ) and the approximations
u(x) − u(x − h)
u0 (x − h/2) = + O(h2 )
h
u(x + h) − u(x)
u0 (x + h/2) = + O(h2 )
h
d 1
a(x) · u0 (x) = a(x + h/2) · u0 (x + h/2) − a(x − h/2) · u0 (x − h/2) + O(h2 )
dx h
1 u(x + h) − u(x) u(x) − u(x − h)
≈ a(x + h/2) − a(x − h/2)
h h h
1 h h h h
= a(x − ) u(x − h) − (a(x − ) + a(x + )) u(x) + a(x + ) u(x + h) .
h2 2 2 2 2
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 269
1
EA constant
EA variable
0.8
displacement
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3
distance
Figure 4.12: Stretching of a beam with constant and variable cross section
One can verify that the error of this finite difference approximation is of the order h2 . Observe that the
values of the coefficient function a(x) = EA(x) are used at the midpoints of the intervals of length h. For
0 ≤ i ≤ N set ai = a(i h + h2 ) to find the difference scheme
To take the boundary condition EA(L) u0 (L) = a(L) u0 (L) = F into account proceed just as in Example 4–
7.
d u(L) 1
F = a(L) ≈ (a(L − h/2) u0 (L − h/2) + a(L + h/2) u0 (L + h/2)) + O(h2 )
dx 2
aN (−uN + uN +1 ) aN +1 (−uN +1 + uN +2 )
≈ + + O(h2 )
2h 2h
aN +1 uN +2 = +aN uN − (aN − aN +1 ) uN +1 + 2 h F + O(h3 )
Using this information for the finite difference approximation of the differential equation at x = L this leads
to
u(L) − u(L − h) uN +1 − uN
a(L) u0 (L) ≈ a(L − h/2) = aN =F
h h
f
would generate a similar equation, without the contribution N2+1 , which is consistent of order h too. A more
detailed analysis shows that the approach in (4.5) is consistent of order h2 for constant functions a(x). Thus
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 270
equation (4.5) should be used. With this additional equation arrive at a system of N + 1 linear equations.
a0 + a1 −a1 u1
−a1 a 1 + a 2 −a 2
u2
−a2 a2 + a3 −a3 u3
−a3 a3 + a4 −a4 u
· 4 =
.
.. .. ..
. . . ..
−a N −1 a N −1 + a N −a N
u
N
−aN aN uN +1
f1 0
f2 0
f3 0
= h2 f4 + 0
. .
. .
. .
f
N 0
fN +1
2 hF
The result in Figure 4.12 (page 269) confirms the fact that the thinner beam is weaker, i.e. it will stretch
more than the beam with constant, larger cross section. ♦
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 271
Figure 4.13: A finite difference grid for a steady state heat equation
The unknown values of ui,j have to be numbered, and there are different options. Here is one of them:
• First number the nodes in the lowest row with numbers 1 through n.
• Then number the nodes in the second row with numbers n + 1 through 2 n.
• Proceed through all the rows. The top right corner will obtain the number n2 .
The above finite difference approximation of the PDE then reads as
4 u(i−1) n+j − u(i−2) n+j − ui n+j − u(i−1) n+j−1 − u(i−1) n+j+1
= f(i−1) n+j .
h2
This approximation is consistent of order 2. Arguments very similar to Result 4–3 show that the scheme is
stable and thus it is convergent of order 2. Using the model matrix Ann from Section 2.3.2 (page 34) this
leads to a system of linear equations.
Ann ~u = f~
with a banded, symmetric, positive definite matrix Ann . The above is implemented in Octave to solve the
system of linear equations and generate the graphics. As an example solve the problem with the right hand
side f (x, y) = x2 .
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 272
Plate.m
%%%%% script file to solve the heat equation on a unit square
n = 7;
f = @(x,y) x.ˆ2; % describe the heating contribution
%%%%%%%%%%%%%%%%%%%%%%%%%% no modifications necessary beyond this line
h = 1/(n+1);
Dxx = spdiags(ones(n,1)*[-1 2 -1],[-1 0 1],n,n)/hˆ2;
A = kron(Dxx,eye(n)) + kron(eye(n),Dxx);
x = h:h:1-h; y = x;
[xx,yy] = meshgrid(x,y); % generate the mesh for the graphics
fvec = f(xx(:),yy(:)); % compute the function
tic(); % start the stop watch
u = A\fvec; % solve the system of equations
toc() % display the solution time
mesh(xx,yy,reshape(fvec,n,n)) % generate the graphics
xlabel(’x’); ylabel(’y’);
The result of the above code is not completely satisfying, since the zero values of the function on the
boundary are not displayed. The code below adds these values and will generate Figure 4.14. The graph
clearly displays the higher temperature in the section with large values of x. This is caused by the heating
term f (x, y) = x2 .
0.025
0.02
0.015
0.01
0.005
01
0.8 1
0.6 0.8
y 0.4 0.6
0.2 0.4 x
0.2
0 0
The model matrix Ann on page 34 is symmetric of size n2 × n2 and it has a band structure with semi-
bandwidth n. Thus it is a very sparse matrix. In the above code the system of linear equation is solved
with the command u=A\fvec, and this will take advantage of the symmetry and sparseness. It is not a
good idea to use u = inv(A)*fvec since this creates the full matrix A−1 . A better idea is to use
Octave/MATLAB commands to create A as a sparse matrix, then the built-in algorithm will take advantage
of this. Thus we can solve problems with much finer grids, see the modified code below. In addition we
may allow a different number of grid points in the two directions.
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 273
• Use nx interior points in x direction and ny points in y direction for a matrix of size (nx · ny) × (nx ·
ny). Numbering in the x direction first will lead to a semi-bandwidth of nx, but numbering in the y
direction first will lead to a semi-bandwidth of ny.
• To construct the matrices representing the derivatives in x and y direction independently use the
command spdiags(). Then use the Kronecker product (command kron()) to construct the sparse
matrix A.
• The backslash operator \ in MATLAB or Octave will take full advantage of the sparsity structure of
this matrix, using algorithms presented in Chapter 2, in particular Section 2.6.7.
Heat2DStatic.m
nx = 55; ny = 50;
f = @(x,y)x.ˆ2;
%%%%%%%%%%%%%%%%%%%%%%%%%%
hx = 1/(nx+1); hy = 1/(ny+1);
Dxx = spdiags(ones(nx,1)*[-1 2 -1],[-1 0 1],nx,nx)/(hxˆ2);
Dyy = spdiags(ones(ny,1)*[-1 2 -1],[-1 0 1],ny,ny)/(hyˆ2);
A = kron(Dxx,speye(ny))+kron(speye(nx),Dyy);
x = hx:hx:1-hx; y = hy:hy:1-hy;
[xx,yy] = meshgrid(x,y);
fvec = f([xx(:),yy(:)]);
tic()
u = A\fvec;
solutionTime = toc()
The approximate solutions generated by a finite difference scheme should satisfy this property too, leading
to the stability condition.
The two dimensional domain (t, x) ∈ R+ × [0, 1] is discretized as illustrated in Figure 4.15. For step
1
sizes h = ∆x = n+1 and ∆t let
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 274
The boundary condition u(t, 0) = u(t, 1) = 0 implies ui,0 = ui,n+1 = 0 and the initial condition u(0, x) =
u0 (x) leads to u0,j = u0 (j · ∆x). The PDE (4.6) is replaced by a finite difference approximation on the grid
shown in Figure 4.15 and the result is examined.
t
6
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
u = 0H
H
H
Hu=0
H
H
H H
H H
H H
H H
H H
H H
H
H u
t i,j
H
H
i H
H
H
H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H-
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
x
0 j u(0, x) = u0 (x) 1
The solution of the finite difference equation will be computed with the help of time steps, i.e. use the
values at one time level t = i · ∆t and then compute the values at the next level t + ∆t = (i + 1) ∆t. Thus
put all values at one time level t = i ∆t into a vector ~ui by
A finite difference approximation to the second order space derivative is given by (see Table 4.1 on page 256)
Now find the approximation of the PDE (Partial Differential Equation) by a linear system of ordinary dif-
ferential equations
d
~u(t) = −κ An ~u(t) with ~u(0) = ~u0 .
dt
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 275
1 kπ 4 kπ
λk = (2 + 2 cos )= sin2
∆x2 n+1 ∆x2 2 (n + 1)
and
nkπ T
1kπ 2kπ 3kπ (n − 1) k π
~vk = sin , sin , sin , . . . , sin , sin
n+1 n+1 n+1 n+1 n+1
where k = 1, 2, 3, . . . , n. Thus the eigenvectors are discretizations of the functions sin(k π x) on the interval
[0, 1] . These functions have exactly k local extrema in the interval. The higher the value of k the more the
eigenfunction will oscillate. In these notes find more information on matrices of the above type in Section 2.3
(page 32) and Result 3–22 (page 134). For another proof of the above see [Smit84, p. 154].
Since the matrix An is symmetric the eigenvectors are orthogonal and form a basis. Since the eigenvec-
tors satisfy (
1 if j = k
h~vk , ~vj i =
0 if j 6= k
(i.e. orthonormalized), then any vector ~u can be written as linear combination of normalized eigenvectors
~vk of the matrix An , i.e.
X n
~u = αk ~vk with αk = h~u, ~vk i .
k=1
For arbitrary t ≥ 0 consider the vector ~u(t) of the discretized (in space) solution. The differential equa-
tion (4.7) reads as
d
~u(t) = −κ An · ~u(t) .
dt
The solution ~u(t) of this system of ODEs (Ordinary Differential Equation) can be written as linear combi-
nation of eigenvectors, i.e.
n
X
~u(t) = αk (t) ~vk
k=1
n
d X
~u(t) = α̇k (t) ~vk
dt
k=1
n
X n
X
−κ An ~u(t) = −κ αk (t) An ~vk = −κ αk (t) λk ~vk
k=1 k=1
n
X n
X
α̇k (t) ~vk = − (κ αk (t) λk ) ~vk .
k=1 k=1
Examine the scalar product of the above with a vector ~vj and use the orthogonality to conclude
n
X n
X
h~vj , α̇k (t) ~vk i = h~vj , − (κ αk (t) λk ) ~vk i
k=1 k=1
n
X n
X
α̇k (t) h~vj , ~vk i = − (κ αk (t) λk ) h~vj , ~vk i
k=1 k=1
α̇j (t) = −κ λj αj (t) for j = 1, 2, 3 . . . , n .
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 276
The above system of n linear equations is converted to n linear, first order differential equations. The
initial values for the coefficient functions are given by αj (0) = h~u0 , ~vj i . For these equations use the
methods and results in Section 4.3.1. The approximation scheme to the system of differential equations
d
dt x(t) = −κ A ~ x(t) is stable if and only if the scheme applied to all of the above ordinary differential
equations is stable. Three different approaches will be examined: explicit, implicit and Crank–Nicolson.
Since the above ordinary differential equation can be solved analytically find a formula for the solution
n
X n
X
~u(t) = αk (t) ~vk = h~u0 , ~vk i exp(−κ λk t) ~vk .
k=1 k=1
For numerical purposes this formula is not very useful, since the effort to determine the eigenvalues and
eigenvectors is too large.
t6
i+1 t
ui+1,j − ui,j ui,j−1 − 2 ui,j + ui,j+1 t t t
= κ i
∆t (∆x)2
i–1
κ ∆t
ui+1,j = ui,j + (u i,j−1 − 2 ui,j + ui,j+1 )
(∆x)2 j–1 j j+1
-
x
If the vector ~ui is known the values at the next time level ~ui+1 can be computed without solving a system of
linear equations, thus this is called an explicit method. Starting with the discretization of the initial values
~u0 and applying the above formula repeatedly we find the solution
The goal is to examine the stability of this finite difference scheme. Since for eigenvalues λk and eigenvec-
tors ~vk we have
(In − κ ∆t An )i · ~vk = (1 − κ ∆t λk )i · ~vk
and the solution will remain bounded as i → ∞ only if κ ∆t λk < 2 for all k = 1, 2, 3, . . . , n. This
corresponds to the stability condition.
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 277
Since we want to use the results of Section 4.3.2 on solutions of the ordinary differential equation we
translate to the coefficient functions αk (t) and find
d
αk (t) = −κ λk αk (t)
dt
αk (t + ∆t) − αk (t)
= −κ λk αk (t) finite difference approximation
∆t
αk (t + ∆t) = (1 − ∆t κ λk ) αk (t)
αk (i · ∆t) = (1 − ∆t κ λk )i αk (0) .
The scheme is stable if the absolute value of the bracketed expression is smaller than 1, i.e.
κ λk ∆t < 2 .
4 nπ
Since the largest eigenvalue of An is (see Section 2.3.1 starting on page 32) λn = ∆x2
sin2 2 (n+1) ≈
4 sin2 π2 4
∆x2
= ∆x2
find the stability condition
∆t 1 1
κ < ⇐⇒ ∆t < (∆x)2 .
∆x2 2 2κ
This is a situation of conditional stability. The restriction on the size of the timestep ∆t is severe, since for
small values of ∆x the ∆t will need to be much smaller.
∂ ∂2
u(t, x) = κ u(t, x) for 0 < x < 1 and t>0
∂t ∂x2
u(t, 0) = u(t, 1) = 0 for t > 0
u(0, x) = f (x) for 0 < x ≤ 1
∆t
is shown for values of r = κ ∆x 2 slightly smaller or larger than the critical value of 0.5 . The initial value
used is (
2x for 0 < x ≤ 0.5
f (x) = .
2 − 2x for 0.5 ≤ x < 1
Since the largest eigenvalue of An will be the first to exhibit instability examine the corresponding eigen-
0.35 0.4
0.3
0.3
0.25
0.2
0.2
0.15
0.1
0.1
0
0.05
0 -0.1
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
Figure 4.17: Solution of 1-d heat equation, stable and unstable algorithms with r ≈ 0.5
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 278
vector T
1nπ 2nπ 3nπ (n − 1) n π nnπ
~vn = sin , sin , sin , . . . , sin , sin .
n+1 n+1 n+1 n+1 n+1
The corresponding eigenfunction has n extrema in the interval. Thus the instability should exhibit n extrema,
which is confirmed by Figure 4.17(b) where the calculation is done with n = 9, as shown in the Octave code
below. The deviation from the correct solution exhibits 9 local extrema in the interval. This is an example
of a consistent and non-stable finite difference approximation. Obviously the scheme is not convergent.
HeatDynamic.m
L = 1; % length of the space interval
n = 9; % number of interior grid points
%n = 29; % number of interior grid points
r = 0.45; % ratio to compute time step
%r = 0.52; % ratio to compute time step
T = 0.1; % final time
iv = @(x) min([2*x/L,2-2*x/L]’)’ ;
dx = L/(n+1); dt = r*dxˆ2; x = linspace(0,L,n+2)’;
y = iv(x);
ynew = y;
legend(’off’)
for t = 0:dt:T+dt;
% for k = 2:n+1 % code with loops
% ynew(k) = (1-2*r)*y(k)+r*(y(k-1)+y(k+1));
% endfor
% y = ynew;
y(2:n+1) = (1-2*r)*y(2:n+1)+r*(y(1:n)+y(3:n+2)); % no loops
plot(x,y)
axis([0,1,0,1]); grid on
text(0.1,0.9,[’t = ’,num2str(t,3)]);
pause(0.02);
end%for
In the above code we verify that for each time step approximately 2 · n multiplications/additions are
necessary. Thus the computational cost of one time step is 2 n.
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 279
t6
i+1 u u u
i u
ui+1,j − ui,j ui+1,j−1 − 2 ui+1,j + ui+1,j+1
=κ i–1
∆t (∆x)2
-
j–1 j j+1 x
Since λk > 0 we find that this scheme is unconditionally stable, i.e. there are no restrictions on the ratio of
the step sizes ∆x and ∆t. This is confirmed by the results in Figure 4.19. It was generated by code similar
to the one below.
1 1
0.8 0.8
Temperature
Temperature
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
position x position x
Figure 4.19: Solution of 1-d heat equation, implicit scheme with small and large step sizes
HeatDynamicImplicit.m
L = 1; % length of the space interval
n = 29; % number of interior grid points
r = 0.2; % ratio to compute time step
%r = 2.0;% ratio to compute time step
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 280
iv = @(x)min([2*x/L,2-2*x/L]’)’;
dx = L/(n+1); dt = 2*r*dxˆ2; x = linspace(0,L,n+2)’;
initval = iv(x(2:n+1));
yplot = zeros(plots,n+2);
plotc = 1; tplot = linspace(0,T,plots);
Adiag = ones(n,1)*(1+2*r);
Aoffdiag = -ones(n-1,1)*r;
y = initval;
for t = 0:dt:T+dt;
if min(abs(tplot-t))<dt/2
yplot(plotc,2:n+1) = y’; plotc = plotc +1;
end%if
y = trisolve(Adiag,Aoffdiag,y);
end%for
plot(x,yplot)
grid on
xlabel(’position x’); ylabel(’Temperature’)
To perform one time step one has to solve a system of n linear equations where the matrix is symmetric,
tridiagonal and positive definite. There are efficient algorithms (trisolve()) for this type of problem
(e.g. [GoluVanLoan96], [GoluVanLoan13]), requiring only 5 n multiplications. If the matrix decomposition
and the back-substitution are separately coded, then this can even be reduce to an operation count for one
time-step of only 3 n multiplication. Thus the computational effort for one explicit step is similar to the cost
for one implicit step, but we gain unconditional stability.
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 281
t6
i+1 u u u
ui+1,j − ui,j ui,j−1 − 2 ui,j + ui,j+1 +
= i u u u
κ ∆t 2 (∆x)2
ui+1,j−1 − 2 ui+1,j + ui+1,j+1 i–1
+
2 (∆x)2 -
j–1 j j+1 x
or
κ ∆t κ ∆t
In + An · ~ui+1 = In − An · ~ui .
2 2
With the values ~ui at a given time multiply the vector with a matrix and then solve a system of linear
equations to determine the values ~ui+1 at the next time level. This is an implicit method. As in the previous
section we use the eigenvalues and vectors of An to examine stability of the scheme. Examine the inequality
i
2 − κ ∆t λk
< 1.
2 + κ ∆t λk
Since λk > 0 this scheme is unconditionally stable.
In Table 4.3 find a comparison of the three different finite difference approximations to equation (4.6).
• For the explicit method one multiplication by a matrix I − α A is required. Thus we need approxi-
mately 3 n multiplications. If one would take advantage of the symmetry it could be reduced to 2 n
multiplications.
• For the implicit method one system of linear equations with a matrix I − α A is required. Using the
standard Cholesky factorization with band structure approximately 4 n multiplications are required.
Working with the modified Cholesky factorization one could reduce to 3 n multiplications.
• For the Crank–Nicolson method one matrix multiplication is paired with one system to be solved.
Thus we need approximately 7 n multiplication, or only 5 n with the optimized algorithms.
Using an inverse matrix is a bad idea in the above context, as this will lead to a full matrix and thus at
least n2 multiplications. Even for relatively large numbers n, the time required to do one time step will be
minimal for all of the above methods. This will be different for the 2D situation, as examined in Table 4.4.
As a consequence one should use either an implicit method or Crank–Nicolson for this type of problem.
If the differential equation to be solved contains an inhomogeneous term, i.e.
∂ ∂2
u(t, x) = κ 2 u00 (t, x) + f (t, x)
∂t ∂x
then use the difference approximation7
∆t κ ∆t ~
~ui+1 − ~ui = − An (~ui + ~ui+1 ) + (fi + f~i+1 ) .
2 2
This leads to
∆t κ ∆t κ ∆t ~
(I +
An ) ~ui+1 = (I − An ) ~ui + (fi + f~i+1 ) .
2 2 2
This system can be solved similarly.
7
Instead of 1
2
(f~i + f~i+1 ) one can (or even should) use f~i+1/2 , i.e. the discrtetization of f (ti + ∆t
2
, x).
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 282
Table 4.3: Comparison of finite difference schemes for the 1D heat equation
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 283
Thus the stability condition is again ∆t < 2/λn , where λn is the largest generalized eigenvalue.
• The fully implicit scheme will lead to
1
M (~u(t + ∆t) − ~u(t)) = −A · ~u(t + ∆t) + f~(t)
∆t
(M + ∆t A) · ~u(t + ∆t) = M · ~u(t) + ∆t f~(t)
and is unconditionally stable.
• The Crank–Nicolson scheme will lead to
1 1 1
M (~u(t + ∆t) − ~u(t)) = − (A · ~u(t + ∆t) + A · ~u(t)) + (f~(t) + f~(t + ∆t))
∆t 2 2
∆t ∆t ∆t ~
M+ A · ~u(t + ∆t) = M− A · ~u(t) + (f (t) + f~(t + ∆t))
2 2 2
and is unconditionally stable.
Explicit approximation
The explicit (with respect to time) finite difference approximation is determined by
1
(~ui+1 − ~ui ) = −κ Ann ~ui + f~i
∆t
or
~ui+1 = ~ui − ∆t (κ Ann ~ui − f~i ) .
For each time step we have to multiply the matrix Ann with a vector. Due to the severe sparsity of the matrix
this requires approximatley 5 n2 multiplications. Since the largest eigenvalue is given by κ λn,n ≈ κ 8 n2 ≈
8κ
(∆x)2
we have the stability condition
2 1
∆t ≤ ≈ (∆x)2 .
κ λn,n 4κ
The algorithm is conditionally stable only.
Implicit approximation
The implicit (with respect to time) finite difference approximation is determined by
1
(~ui+1 − ~ui ) = −κ Ann ~ui+1 + f~i+1
∆t
or
(I + ∆t κ Ann ) ~ui+1 = ~ui + ∆t f~i+1 .
The algorithm is unconditionally stable. For each time-step a system of linear equations has to be solved, but
the matrix is constant. Thus we can factorize the matrix once (Cholesky) and then do the back substitution
steps only. The symmetric, positive definite matrix Ann has size n2 × n2 and a semi-bandwidth of b =
n. Using the results in Section 2.6.4 the computational effort for one banded Cholesky factorization is
approximated by 21 n2 n2 . Each subsequent solving of a system of equation requires 2 n3 multiplications.
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 284
Crank–Nicolson approximation
The CN finite difference approximation is determined by
1 κ 1
(~ui+1 − ~ui ) = − Ann (~ui + ~ui+1 ) + (f~i + f~i+1 )
∆t 2 2
or κ κ ∆t ~
I + ∆t
Ann ~ui+1 = I − ∆t Ann ~ui + fi + f~i+1 .
2 2 2
The algorithm is unconditionally stable too and the computational effort is comparable to the implicit
method.
Comparison
A comparison for the explicit, implicit and CN approximation is given in Table 4.4. For the implicit scheme
each time step requires more computations than an explicit time step. The time steps for the implicit scheme
may be larger. The choice of best algorithm thus depends on the time interval on which you want to compute
the solution: for very small times the explicit scheme is more efficient, for very large times the implicit
scheme is more efficient. This differs from the 1D situation in Table 4.3 where the computational effort for
each time step was small and of the same order for the three algorithms examined. The Crank–Nicolson
scheme can be applied to the 2D heat equation, leading to a higher order of consistency.
Table 4.4: Comparison of finite difference schemes for 2D dynamic heat equations
nx = 34; ny = 35;
f = @(x,y) x.ˆ2;
%%%%%%%%%%%%%%%%%%%%%%%%%%
t = 0:dt:T; utrace = zeros(size(t));
hx = 1/(nx+1); hy = 1/(ny+1);
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 285
for k = 2:length(t);
u = P*(R\(R’\(P’*(u+fvec)))); % one time step
% u = R\(R’\(u+fvec)); % one time step
utrace(k) = u(55);
end%for
In the above code we used a sparse Cholesky factorization to solve the system of linear equations at
each time step. Since A is a sparse, symmetric, positive definite matrix we may also use an iterative solver,
e.g. the conjugate gradient algorithm. According to the results in Section 2.7 and in particular Figure 2.17
(page 84), this might be a faster solution for fine meshes. Since this is a time step algorithm we have have
good initial guesses of the solution of the linear system, using the result at the previous time step.
0.03
0.025
0.0025
0.02
0.015 0.002
0.01
0.0015
Temp u
0.005
0 0.001
1
0.8 1
0.6 0.8 0.0005
0.4 0.6
y 0.2 0.4
0.2 x 0
0 0 0 0.1 0.2 0.3 0.4 0.5
Time t
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 286
d
u(t) = −λ u(t) with u(0) = 1 (4.10)
dt
for values λ > 0. This was examined in Section 3.4.3 on page 196. The exact solution is obviously
u(t) = u(0) exp(−λ t). Of interest are two aspects of the solvers:
1. stability condition: |ui | has to remain bounded. This is the case for all Re(λ) < 0 if the method is
A–stable.
2. decay condition: for large values of λ, the solution should converge to 0 very fast. This is the case if
the method is L–stable.
In a typical situation we find after one step u(∆t) ≈ g(λ ∆t), where the function g() depends on the algo-
rithm to be examined. For the exact solver use g(λ ∆t) = exp(−λ ∆t). The stability requires |g(λ ∆t)| ≤ 1
and the decay condition translates to |g(λ ∆t)| 1 for λ ∆t 1 .
~ui+1 − ~ui
= −A ~ui + f~i
∆t
~ui+1 = ~ui + ∆t (−A ~ui + f~i ) .
~ui+1 − ~ui
= −A ~ui+1 + f~i+1
∆t
(I + ∆t A)~ui+1 = ~ui + ∆t f~i+1
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 287
The method is consistent of order 2. Observe that 2 time levels are used to advance by one time step. Thus
one needs another algorithm to start BDF2, e.g. one RK step, which is also of order 2. To examine the
d
stability use the model equation (4.10), i.e. examine dt u(t) = −λ u(t).
4 1 2 ∆t
ui+1 − ui + ui−1 = −λ ui+1
3 3 3
(3 + 2 λ ∆t) ui+1 = 4 ui − ui−1
4 1
ui+1 = ui − ui−1
3 + 2 λ ∆t 3 + 2 λ ∆t
! " # !
ui 0 1 ui−1
= −1 4
ui+1 3+2 λ ∆t 3+2 λ ∆t ui
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 288
To examine the stability of this scheme use the eigenvalues of this matrix.
" #
−µ 1 4 1
0 = det = µ2 − µ+
−1 4
−µ 3 + 2 λ ∆t 3 + 2 λ ∆t
3+2 λ ∆t 3+2 λ ∆t
2
0 = (3 + 2 λ ∆t) µ − 4 µ + 1
1 p
µ1,2 = +4 ± 16 − 4 (3 + 2 λ ∆t)
2 (3 + 2 λ ∆t)
√
1 p +2 ± 1 − 2 λ ∆t
= +2 ± 4 − (3 + 2 λ ∆t) =
3 + 2 λ ∆t 3 + 2 λ ∆t
To examine this expression consider two different cases for z = λ ∆t:
√
1 2 ± 1 − 2z
• z ≤ 2 : Use g1,2 (z) = and
3 + 2z
√
2 + 1 − 2z
gBDF 2 (z) = max{g1,2 (z)} =
3 + 2z
√
2 ± i 2z − 1
• z > 21 : Use g1,2 (z) = ∈ C and
3 + 2z
√ √
22 + 2 z − 1 3 + 2z 1
gBDF 2 (z) = max{|g1,2 (z)|} = = = √
3 + 2z 3 + 2z 3 + 2z
Verify that for z > 0 the absolute values are smaller than 1, and thus we have unconditional stability. For
the decay condition we observe that limz→∞ |g1,2 (z)| = 0. This is based on gBDF 2 (z) ≈ √12 z for z 1 .
The method as A–stable and L-stable.
An L-stable RK solver
This is a diagonally implicit Runge–Kutta or DIRK method, find a good presentation in [Butc03, §361].
The ODE (4.9) is discretized by
(I + θ ∆t A) ~un = (I − (1 − θ) ∆t A) ~ui
1 1
(I + θ ∆t A) ~ui+1 = I − ∆t A ~ui − ( − θ) ∆t A ~un
2 2
√
where θ = 1 − 1/ 2. The method is consistent of order 2. Observe that two systems of linear equations
have to be solved for each step.
~un − ~ui = −θ ∆t A (~un − ~ui ) − ∆t A ~ui
1
= −(1 − √ ) ∆t A (~un − ~ui ) − ∆t A ~ui
2
1 1
(I + θ ∆t A) ~ui+1 = I − ∆t A ~ui − ( − θ) ∆t A ~un
2 2
To examine the stability use (4.10).
(1 + θ λ ∆t) un = (1 − (1 − θ) λ ∆t) ui
λ 1
(1 + θ λ ∆t) ui+1 = (1 − ∆t) ui − ( − θ) λ ∆t un
2 2
λ 1 1 − (1 − θ) λ ∆t
= (1 − ∆t) ui − ( − θ) λ ∆t ui
2 2 1 + θ λ ∆t !
λ ( 12 − θ) (1 − (1 − θ) λ ∆t)
= (1 − ∆t) − λ ∆t ui
2 1 + θ λ ∆t
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 289
√
with θ = 1 − 1/ 2 ≈ 0.29. Thus we have to examine the function (z = λ ∆t)
!
1 z ( 1 − θ) (1 − (1 − θ) z)
gRK (z) = (1 − ) − 2 z
1+θz 2 1+θz
Use a plot of this function for z > 0 to observe that |g(z)| < 1 and thus we have unconditional stability. For
z 1 use 1 − (1 − θ) z ≈ −(1 − θ) z and Mathematica to verify
! √
1 z ( 12 − θ) (1 − θ) z 4 + 2 (1 − 2) z
gRK (z) ≈ (1 − ) + z = √
1+θz 2 1+θz (2 + (2 − 2) z)2
For z 1 we find
√ √ √
2 (1 − 2) z 2 (2 + 2)2 (1 − 2)
gRK (z) ≈ √ = √ √
(2 − 2)2 z 2 (2 + 2)2 (2 − 2)2 z
√ √ √ √
2 (6 + 4 2)(1 − 2) 6−8−2 2 −1 − 2
= = = .
(4 − 2)2 z 2z z
Table 4.5: For different ODE solvers find the order of consistency, the asymptotic approximation of the
stability function g(z), the number of time levels required and the number of linear systems to be solved to
perform one time step.
∂2 ∂2
∂t2
u(t, x) = c2 ∂x2
u(t, x) for 0 < x < 1 and t>0
u(t, 0) = u(t, 1) = 0 for t > 0
. (4.11)
u(0, x) = u0 (x) for 0 < x < 1
u̇(0, x) = u1 (x) for 0 < x < 1
The equation of a vibrating string (1.8) on page 14 is of this form. Examine an explicit and an implicit finite
difference approximation.
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 290
1
IM
CN
BDF2
0.5 RK
amplification factor
exact
-0.5
-1
0 5 10 15 20
z
t6
ui+1,j − 2 ui,j + ui−1,j ui,j−1 − 2 ui,j + ui,j+1 i+1 u
2
= c2
(∆t) (∆x)2 u u u
i
~ui+1 − 2 ~ui + ~ui−1 i–1 u
= −c2 An ~ui
(∆t)2
-
~ui+1 = 2 I − (∆t)2 c2 An ~ui − ~ui−1
j–1 j j+1 x
Figure 4.23: Explicit finite difference approximation for the wave equation
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 291
to remain bounded too. Solve the above difference equation for α(t + h) to find
α(t + h) = 2 α(t) − α(t − h) − h2 λ α(t) .
Using a matrix notation write this in the form
! " # !
α(t) 0 1 α(t − h)
= ·
α(t + h) −1 2 − λ h2 α(t)
and with an iteration find
! " #i !
α(i h) 0 1 α(0)
= · .
α((i + 1) h) −1 2 − λ h2 α(h)
These solutions remain bounded as i → ∞ if the eigenvalues µ of the matrix have absolute values smaller
or equal to 18 . Thus we examine the solutions of the characteristic equation
" #
0−µ 1
det = µ2 − µ (2 − λ h2 ) + 1 = 0 .
−1 2 − λ h2 − µ
Using a factorization of this polynomial find
µ2 − µ (2 − λ h2 ) + 1 = (µ − µ1 ) (µ − µ2 ) = µ2 − (µ1 + µ2 ) µ + µ1 µ2 .
Since the constant term equals 1, conclude that µ1 · µ2 = 1. If both values µ1,2 would be real and µ1 · µ2 = 1
then µ2 = 1/µ1 and one of the absolute values would be larger than 1. If the values are conjugate complex
use 1 = µ1 · µ2 = µ1 · µ1 = |µ1 |2 . Thus the condition for |µ1,2 | ≤ 1 to be correct is given by: conjugate
complex values on the unit circle with nonzero imaginary part. The solutions µ1,2 are given by
1 p 1 p
µ1,2 = 2 − λ h2 ± (2 − λ h2 )2 − 4 = 2 − λ h2 ± λ2 h4 − 4λ h2
2 √ √ 2
2 − λ h2 ± λ h2 λ h2 − 4
= .
2
Thus as a necessary and sufficient condition for stability use a negative discriminant.
4
λ2 h4 − 4 λ h2 < 0 ⇐⇒ λ h2 ≤ 4 ⇐⇒ h2 ≤
λ
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 292
With a block matrix notation this can be transformed in a form similar to the ODE situation above.
! " # !
~ui 0 In ~ui−1
= · .
~ui+1 −In 2 In − c2 (∆t)2 An ~ui
Then write the solution as a linear combination of the eigenvectors of the matrix An , i.e.
n
X
~u(t) = αk (t) ~vk where An~vk = λk ~vk .
k=1
(∆t)2
c2 ≤1 ⇐⇒ c2 (∆t)2 ≤ (∆x)2 ⇐⇒ c ∆t ≤ ∆x .
(∆x)2
The solution at the first two time levels has to be known to get the finite difference scheme started. We have
to use the initial conditions to construct the vectors u0 and u1 . The first initial condition in equation (4.11)
9
Currently this is not implemented yet in some of my sample codes.
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 293
obviously implies that ~u0 should be the discretization of u(x, 0) = u0 (x). The Octave code below is
an elementary implementation of the presented finite difference scheme. In the example below the initial
velocity v0 (x) = 0 is used and then determined ~u1 by the method in the previous section. If the ratio
∆t
r = ∆x is increased beyond the critical value of 1/c then the algorithm is unstable and the solution will
be far away from the true solution. The instability is again (as in Section 4.5.3) in the direction of the
eigenvector belonging to the largest eigenvalue.
Wave.m
L = 3; % length of the space interval
n = 150; % number of interior grid points
r = 0.99; % ratio to compute time step
T = 6; % final time
iv = @(x)max([min([2*x’;2-2*x’]);0*x’])’; % initial value
figure(1); clf;
for t = 0:dt:T+dt;
plot(x,y0); axis([0,L,-1,1]); drawnow(); %% graphics uses most of the time
if 0 %% for loops, slow
for k = 2:n+1
y2(k)=(2-2*rˆ2)*y1(k)+rˆ2*(y1(k-1)+y1(k+1))-y0(k);
end%for
else %% no loop, fast
y2(2:n+1) = (2-2*rˆ2)*y1(2:n+1) + rˆ2*(y1(1:n)+y1(3:n+2))-y0(2:n+1);
end%if
y0 = y1; y1 = y2;
end%for
figure(2)
plot(x,y0,x,iv(x))
∂2 ∂2
∂t2
u(t, x) = c2 ∂x2
u(t, x) for −∞ < x < ∞ and t ∈ R
u(0, x) = u0 (x) for −∞ < x < ∞ (4.12)
∂
∂t u(0, x) = u1 (x) for −∞ < x < ∞
This implies that the solution of the wave equation at time t and position x is determined by the values in
the cone of dependence, i.e. all times τ < t and positions x̃ such that |x̃ − x| ≤ c (t − τ ). This is visualized
in Figure 4.24. On the left find the domain having an influence on the solution at time t and position x and
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 294
1.2 1.2
(x,t) (x-ct,t) (x+ct,t)
1 1
influenced interval
0.8 0.8
time t
time t
0.6 0.6
0.4 0.4
0.2 0.2
interval of dependence
0 0
x-c t x+c t (x,0)
-0.2 -0.2
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
position x position x
(a) cone of dependence (b) cone of influence
on the right the domain influenced by the values at time t = 0 and position x. This formula and figure also
confirm that information is traveling at most with speed c.
A quick look at Figure 4.23 (page 290) will confirm that for the explicit finite difference scheme the
∆t
cone of dependence has a slope of ∆x , while the slope for the exact solution in Figure 4.24 is given by 1/c.
Since the numerical cone has to contain the exact cone we have the condition
∆t 1 1
≤ =⇒ ∆t ≤ ∆x .
∆x c c
This confirms the stability condition obtained using eigenvalues.
One can verify (tedious computations) that this difference scheme is consistent of order (∆x)2 + (∆t)2 . As
a consequence we obtain a linear system of equations for ~ui+1 . Thus this is an implicit scheme.
2 2 2
2 (∆t) 2 (∆t) 2 (∆t)
I+c An ~ui+1 = 2 I − 2 c An ~ui − I + c An ~ui−1
4 4 4
(∆t)2
γ = c2 λ > 0.
4
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 295
t6
i+1 u u u
ui+1 −2 ~
~ ui +~ui−1 i u u u
(∆t)2
=
c2 i–1 u u u
= − 4 An (~ui+1 + 2 ~ui + ~ui−1 )
-
j–1 j j+1 x
Figure 4.25: Implicit finite difference approximation for the wave equation
(∆t)2
α(t + ∆t) − 2 α(t) + α(t − ∆t) = −c2 λ (α(t + ∆t) + 2 α(t) + α(t − ∆t))
4
(1 + γ) α(t + ∆t) = (2 − 2 γ) α(t) − (1 + γ) α(t − ∆t)
! " # !
α(t) 0 1 α(t − ∆t)
=
α(t + ∆t) −1 2−21+γ
γ
α(t)
Examine the eigenvalues µ1,2 of this matrix by solving the quadratic equation
" #
−µ 1 2 − 2γ
det 1−γ
= µ2 − µ+1=0
−1 2 −µ 1+γ
1+γ
1−γ 2
4 −4<0
1+γ
conclude that two values are complex conjugate. This implies |µ1 | = |µ2 | = 1 and the scheme is uncondi-
tionally stable.
A ~v = λ M ~v .
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 296
or equivalently
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 297
d u(L) 2 d u(L)
EA0 (L) 1 − ν =F.
dx dx
This is a nonlinear boundary value problem for the unknown displacement function u(x) . We will use the
method of successive substitution from Section 3.1.4. Thus we proceed as follows:
• Pick a starting function u0 (x). If possible use a good guess to the solution. For this example u0 (x) =
0 will do.
d u(x) 2
a(x) = E A0 (x) 1 − ν
dx
figure(1)
clf; hold on; grid on; axis([0 3 0 1.4]) % setup of graphics
xlabel(’position x’); ylabel(’displacement u’);
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 298
figure(2)
semilogy(Differences)
xlabel(’iterations’); ylabel(’relative difference’)
The above code required 12 iterations until the relative difference was smaller than 10−5 . Find the graphical
results in Figure 4.26(a). The final result has to be compared with Example 4–8 to verify that the beam is
even weaker than before. The logarithm of the relative difference can be plotted as a function of the iteration
number. The result in Figure 4.26(b) shows a straight line and this is consistent with a linear convergence10 .
♦
1e+0
1.2 1e-1
norm of difference
1
displacement u
1e-2
0.8
1e-3
0.6
1e-4
0.4
0.2 1e-5
0 1e-6
0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12
position x number of iterations
Figure 4.26: Nonlinear beam stretching problem solved by successive substitution and the logarithmic dif-
ferences as function of the number of iterations
10
|diff| ≈ α0 q n =⇒ ln |diff| ≈ ln α0 + n · ln q
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 299
To take the boundary condition α0 (L) = 0 into account proceed as in Example 4–7 on page 267. For this
F2
elementary problem use the known, the exact solution α(s) = EI (L s − 21 s2 ) leading to a maximal angle
F2 2
of α(L) = EI L . The exact solution is a polynomial of degree 2 and for this problem the approximate
solution will coincide with the exact solution, i.e. no approximation error. According to Section 1.4.2 the
maximal vertical deflection is given by y(L) = 3FEI 2
L3 . With the angle function α(s) the shape of the
beam is given by ! Z !
x(l) l cos(α(s))
= ds .
y(l) 0 sin(α(s))
Caused by the numerical integration by the trapezoidal rule (cumtrapz()) the maximal displacement will
not be reproduced exactly. One error contribution is the approximate integration by the trapezoidal rule and
another effect is using sin α for the integration, instead of only α. The code below implements the above
algorithm and verifies the result.
BeamLinear.m
EI = 1.0; L = 3; F = [0 0.1];
% F = [0 2] % large force
N = 200;
%%%%%%%% no modifications necessary beyond this line
h = L/N; s = (h:h:L)’;
%% build and solve the tridiagonal system
A = spdiags(ones(N,1)*[-1 2 -1],[-1 0 1],N,N)/hˆ2; A(N,N) = 1/hˆ2;
g = F(2)/EI*ones(size(s)); g(N) = g(N)/2;
alpha = A\g;;
%% display the solution
x = cumtrapz([0;s],[1; cos(alpha)]); y = cumtrapz([0;s],[0; sin(alpha)]);
plot(x,y); xlabel(’x’); ylabel(’y’);
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 300
One may try to solve the above problem using partial substitution. Start with an initial angle α0 (s) and
then solve iteratively the linear problem
00 F2
−αk+1 (s) = cos(αk (s)) .
EI
For small forces F2 this will be successful, but for larger angles the answers are of no value. One has to use
Newton’s method.
The contributions of the form bi φi have to be integrated into the matrix and thus on the diagonal find the
expressions h22 + bi . This matrix is symmetric, but not necessarily positive definite since the values of bi
might be negative. The new solution αnew (s) can then be computed by
αnew (s) = α(s) + φ(s) resp. αi → αi + φi .
With this new approximation then start the next iteration step for Newton’s method. This has to be repeated
until a solution is found with the desired accuracy.
This algorithm is implemented in MATLAB/Octave and the result shown in Figure 4.27. Use the previous
example as a reference problem and for very small forces F2 and the resulting angles for the two answers
should be close.
11
φ(L + h) − φ(L − h)
0 = φ0 (L) = + O(h2 ) =⇒ φn+1 = φn−1
2h
−φn−1 + 2 φn − φn+1 −φn−1 + φn bn fn
+ bn φn = fn =⇒ + φn =
h2 h2 2 2
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 301
2.5
1.5
y 1
0.5
0
0 0.5 1 1.5 2
x
BeamNewton.m
EI = 1.0; L = 3; F = [0 0.1]; % try values of 0.5 1.5 and 2
N = 200;
%%%%%%%% no modifications necessary beyond this line
h = L/N; % stepsize
s = (h:h:L)’; alpha = zeros(size(s));
The values of the differences in the above iterative algorithm are given by
4.5 · 10−1 , 2.9 · 10−2 , 1.1 · 10−4 , 1.4 · 10−9 and 3.1 · 10−15
and verify that the number of stable digits is doubled at each step, after an initial search for the solution.
This is consistent with the quadratic convergence of Newton’s method. ♦
When the codes in Examples 4–10 and 4–11 are used again with a larger value for the vertical force
F2 = 2.0 obtain the (at first) surprising results in Figure 4.28. This obvious problem is created by the
geometric nonlinearity in the differential equation, i.e. the nonlinear expression cos(α(s)).
• The computations in Example 4–10 are based on the assumptions of small angles and use the approx-
imation cos α ≈ 1. For this computation this is certain to be false and thus the results are invalid.
Thus the result in Figure 4.28(a) can not be correct.
• The solution with Newton’s method from Example 4–11 is folding down and pulled backward. This
might well be a physical solutions, but not the one we were looking for. Thus the result in Fig-
ure 4.28(b) is correct, but useless. This is an illustration of the fact that nonlinear problems might
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 302
0.2
1
0
0.8 -0.2
0.6 -0.4
y
y
-0.6
0.4
-0.8
0.2
-1
0 -1.2
-0.2 -1.4
-0.4 -0.2 0 0.2 0.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
x x
Figure 4.28: Bending of a beam with large force, solved as linear problem (a) and by Newton’s method (b),
using a zero initial displacement
have multiple solutions and we have to assure that we find the desired solution. When Newton’s al-
gorithm is applied to this problem the errors will at first get considerably larger and only after a few
searching steps the iteration will start to converge towards one of the possible solutions. This illus-
trates again that Newton is not a good algorithm to search a solution, but a good method to determine
a known solution with good accuracy.
3 3
2.5 2.5
2 2
y
1.5 1.5
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 2.5 3 0 0.2 0.4 0.6 0.8 1
x x
Figure 4.29: Nonlinear beam problem for a large force, solved by a parameterized Newton’s method
BeamParam.m
EI = 1.0; L = 3; N = 200; FList = 0.25:0.25:2;
%%%%%%%% no modifications necessary beyond this line
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 303
h = L/N; % stepsize
s = (h:h:L)’;
alpha1 = zeros(size(s));
errTol = 1e-10;
yPlot = zeros(length(s)+1,1); xPlot = [0;s];
for F2 = FList
disp(sprintf(’Use force F_2 = %4.2g’,F2))
errAbs = 2*errTol;
while errAbs>errTol;
b = F2/EI*sin(alpha1); b(N) = b(N)/2;
f = F2/EI*cos(alpha1); f(N) = f(N)/2;
phi = -(A+spdiags(b,0,N,N))\(A*alpha1-f);
alpha1 = alpha1 + phi;
errAbs = max(abs(phi))
end%while
x = cumtrapz([0;s],[0; cos(alpha1)]); y = cumtrapz([0;s],[0; sin(alpha1)]);
xPlot = [xPlot x]; yPlot = [yPlot y];
end%for
figure(1); clf
plot(xPlot,yPlot)
grid on; xlabel(’x’); ylabel(’y’); axis equal
figure(2);
x = cumtrapz([0;s],[0; cos(alpha1)]); y = cumtrapz([0;s],[0; sin(alpha1)]);
plot(x,y); grid on; xlabel(’x’); ylabel(’y’);
filename function
BeamStretch.m code to solve Example 4–6
BeamStretchVariable.m code to solve Example 4–8
Plate.m code to solve the BVP in Section 4.4.2
Heat2DStatic.m code to solve the BVP in Section 4.4.2
HeatDynamic.m code to solve the IBVP in Section 4.5.3
HeatDynamicImplicit.m code to solve the IBVP in Section 4.5.4
PlateDynamic.m code to solve the IBVP in Section 4.5.7
Wave.m code to solve the IBVP in Section 4.6.1
BeamNL.m code to solve Example 4–9
BeamLinear.m code to solve the bending beam problem, Example 4–10
BeamNewton.m code to solve the bending beam problem, Example 4–11
BeamParam.m code to solve the bending beam problem, Example 4–12
SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 304
Bibliography
[AtkiHan09] K. Atkinson and W. Han. Theoretical Numerical Analysis. Number 39 in Texts in Applied
Mathematics. Springer, 2009.
[Butc03] J. Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, Ltd,
second edition, 2003.
[GoluVanLoan96] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, third
edition, 1996.
[GoluVanLoan13] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,
fourth edition, 2013.
[IsaaKell66] E. Isaacson and H. B. Keller. Analysis of Numerical Methods. John Wiley & Sons, 1966.
Republished by Dover in 1994.
[Smit84] G. D. Smith. Numerical Solution of Partial Differential Equations: Finite Difference Methods.
Oxford Univerity Press, Oxford, third edition, 1986.
[Thom95] J. W. Thomas. Numerical Partial Differential Equations: Finite Difference Methods, volume 22
of Texts in Applied Mathematics. Springer Verlag, New York, 1995.
SHA 21-5-21
Chapter 5
• you should be familiar with the basic idea of the calculus of variations.
• you should be able to apply the Euler–Lagrange equations to problems in one or multiple variables.
• you should understand the notations of stress and strain and Hooke’s law.
In this chapter we assume that you are familiar with the following:
is well defined. The main idea is now to examine the behavior of F (u) for different functions u. Search
for functions u(x), which will minimize the value of F (u). Technically speaking try to minimize the
functional F .
305
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 306
The basic idea is quite simple: if a function F (u) has a minimum, then its derivative has to vanish. But
there is a major technical problem: the variable u is actually a function u(x), i.e. we have to minimize a
functional. The techniques of the calculus of variations1 deal with this type of problem.
5–1 Definition : If a mapping F is defined for a set of functions X and returns a number as a result then F
is called a functional on the function space X.
Thus a functional is nothing but a function with a a set of functions as domain of definition. It might
help to compare typical functions and functionals.
The fundamental lemma below is related to Hilbert space methods. As very simple example examine
vectors in Rn to visualize the basic idea. A vector ~u ∈ Rn equals ~0 if and only if the scalar product with all
~ ∈ Rn vanishes, i.e.
vectors φ
~ = 0 for all φ
h~u , φi ~ ∈ Rn ⇐⇒ ~u = ~0 .
Similarly a continuous function vanishes on an interval [a , b] iff its product with all functions φ integrates
to 0 , i.e. the role of the scalar product is taken over by an integration.
Z b
hf~, ~g i −→ f (x) · g(x) dx
a
3
1
The calculus of variations was initiated with the problem of a brachistochrone by Johann Bernoulli’s (1667–1748) in 1696,
see [HenrWann17]. Contributions by Jakob Bernoulli (1654–1705) and Leonhard Euler (1707–1783) followed. Joseph-Louis
Lagrange (1736–1813) contributed extensively to the method.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 307
Proof : Proceed by contradiction. Assume that u(x0 ) > 0 for some x0 between a and b . Since the function
u(x) is continuous we know that u(x) > 0 on a (possibly small) interval x1 < x < x2 . Now choose
0 for x ≤ x1
φ(x) = (x − x1 )2 (x − x2 )2 for x1 ≤ x ≤ x2 .
0 for x2 ≤ x
Then conclude u(x) φ(x) ≥ 0 for all a ≤ x ≤ b and u(x0 ) φ(x0 ) > 0 and thus
Z b Z x2
u(x) · φ(x) dx = u(x) · φ(x) dx > 0 .
a x1
This is a contradiction to the condition in the Lemma. Thus u(x) = 0 for a < x < b. As the function u is
continuous conclude that u(a) = u(b) = 0. 2
With a few more mathematical ideas the above result can be improved to obtain an important result for
the calculus of variations.
for all infinitely often differentiable functions φ(x) with φ(a) = φ(b) = 0 then
Proof : Find the proof of the first statement in any good book on functional analysis or calculus of varia-
tions. For the second part use integration by parts, i.e.
Z b Z b
0
0= u(x) · φ (x) dx = u(b) · φ(b) − u(a) · φ(a) − u0 (x) · φ(x) dx .
a a
Considering all test function φ(x) with φ(a) = φ(b) = 0 leads to the condition u0 (x) = 0. We are free
to choose test functions with arbitrary values at the end points a and b, thus conclude that u(a) · φ(a) =
u(b) · φ(b) = 0 . 2
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 308
For a given function f (x, u, u0 ) search for a function u(x) such that the functional
Z b
F (u) = f (x, u(x), u0 (x)) dx
a
has a critical value for the function u. For sake of a readability use the notations2
∂
fx (x, u, u0 ) = f (x, u, u0 ) ,
∂x
∂
fu (x, u, u0 ) = f (x, u, u0 ) ,
∂u
∂
fu0 (x, u, u0 ) = f (x, u, u0 ) .
∂u0
If the functional F attains its minimal value at the function u(x) conclude that
Thus the scalar function g(ε) has a minimum at ε = 0 and thus the derivative should vanish, i.e.
d g(0) d
= F (u + εφ) = 0 for all functions φ .
dε dε ε=0
To find the equations to be satisfied by the solution u(x) use linear approximations. For small values of ∆u
and ∆u0 use a Taylor approximation to conclude
∂ f (x, u, u0 ) ∂ f (x, u, u0 )
f (x, u + ∆u, u0 + ∆u0 ) ≈ f (x, u, u0 ) + ∆u + ∆u0
∂u ∂u0
= f (x, u, u0 ) + fu (x, u, u0 ) ∆u + fu0 (x, u, u0 ) ∆u0
f (x, u(x) + εφ (x), u0 (x) + εφ0 (x)) = f (x, u(x), u0 (x)) + ε fu (x, u(x), u0 (x)) φ (x) +
+ε fu0 (x, u(x), u0 (x)) φ0 (x) + O(ε2 ) .
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 309
or
b
d
Z
F (u + ε φ) = fu (x, u(x), u0 (x)) φ (x) + fu0 (x, u(x), u0 (x)) φ0 (x) dx .
dε ε=0
a
This integral has to vanish for all function φ(x) and using the Fundamental Lemma 5–3, leading to a
necessary condition. An integration by parts leads to
Z b
0 = fu (x, u(x), u0 (x)) φ (x) + fu0 (x, u(x), u0 (x)) φ0 (x) dx
a
b
= fu0 (x, u(x), u0 (x)) φ (x) x=a
Z b
0 d 0
+ fu (x, u(x), u (x)) − fu0 (x, u(x), u (x)) φ (x) dx .
a dx
Since this expression has to vanish for all function φ(x) the necessary conditions are
d 0
dx fu (x, u(x), u (x)) = fu (x, u(x))
Z b
0
f (x, u(x), u0 (x)) dx extremal =⇒ fu0 (a, u(a), u0 (a)) · φ(a) = 0 .
a
f 0 (b, u(b), u0 (b)) · φ(b) = 0
u
The first condition is the Euler–Lagrange equation, the second and third condition are boundary condi-
tions. If the value u(a) is given and we are not free to choose, then we need φ(a) = 0 and the first boundary
condition is automatically satisfied. If we are free to choose u(a), then φ(a) need not vanish and we have
the condition
fu0 (a, u(a), u0 (a)) = 0 .
This is a natural boundary condition. A similar argument applies at the other endpoint x = b .
Now we have the central result for the calculus of variations in one variable.
d
fu0 (x, u(x), u0 (x)) = fu (x, u(x), u0 (x)) (5.1)
dx
has to be satisfied for a < x < b. This is usually a second order differential equation.
• If it is a critical value amongst all functions with prescribed boundary values u(a) and u(b),
use these to solve the differential equation.
• If you are free to choose the values of u(a) and/or u(b), then the natural boundary condi-
tions
fu0 (a, u(a), u0 (a)) = 0 and/or fu0 (b, u(b), u0 (b)) = 0
can be used.
3
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 310
the Euler–Lagrange is not modified, but the natural boundary condition at x = b is given by
The verification follows exactly the above procedure and is left as an exercise.
x
a b
f (x, u, u0 ) =
p
1 + (u0 )2
fx (x, u, u0 ) = fu (x, u, u0 ) = 0
u0
fu0 (x, u, u0 ) = p
1 + (u0 )2
d u0 (x)
p = 0.
dx 1 + (u0 (x))2
The derivative of a function being zero everywhere implies that the function has to be constant and thus
u0 (x)
p =c
1 + (u0 (x))2
and conclude that u0 (x) has to be constant. Thus the optimal solution is a straight line. This should not be a
surprise.
If we are free to choose the point of contact along the vertical line at x = b we may use φ(b) 6= 0 and
thus find the natural boundary condition
u0 (b)
p = 0.
1 + (u0 (b))2
This implies u0 (b) = 0 and thus u0 (x) = 0 for all x. This leads to a horizontal line, which is obviously the
shortest connection from the given height at x = a to the vertical line at x = b . ♦
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 311
and due to the constant horizontal force T this requires an energy of T ∆L. The applied external, vertical
force density f (x) can be modeled by a corresponding potential energy density of −f (x) u(x). Now the
total energy is given by
Z L Z L
F (u(x), u0 (x)) dx =
p
E(u) = T 1 + (u0 (x))2 − f (x) · u(x) dx .
0 0
F (u, u0 ) = T
p
1 + (u0 )2 − f · u
Fu (u, u0 ) = −f
u0
Fu0 (u, u0 ) = T p
1 + (u0 )2
and thus the Euler–Lagrange equation (5.1) applied to this example lead to
!
d T
− p u0 (x) = f (x) .
dx 1 + (u0 (x))2
Since the string is attached at both ends, supplement this differential equation with the boundary conditions
u(0) = u(L) = 0 .
If we know a priori that the slope u0 (x) along the string is small, use a linear approximation3
1
1 + (u0 (x))2 ≈ 1 + (u0 (x))2 .
p
2
With this the change of length ∆L of the string is given by
Lp L
1 0
Z Z
∆L = 1 + (u0 (x))2 dx − L ≈ (u (x))2 dx .
0 0 2
Now the total energy can be written in the form
L
1
Z
E(u) = T ∆L + Epot = T (u0 (x))2 − f (x) · u(x) dx
0 2
and the resulting Euler–Lagrange equation is given by
3
√ 1
Use the Taylor approximation 1+z ≈1+ 2
z.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 312
An external force F~ = (F1 , F2 ) at the right end point ~x(L) has to be determined by
~ ∂ Upot ∂ Upot
F = − grad Upot = − ,
∂x ∂y
and are thus given by the potential energy Upot (x, y)
Z L Z L
Upot (x(L), y(L)) = −F1 x(L) − F2 y(L) = −F1 cos(α(s)) ds − F2 sin(α(s)) ds .
0 0
The total energy Utot as a functional of the angle function α(s) is given by
Z L
1
Utot (α) = Uelast (α) + Upot (~x(L)) = EI(α0 (s))2 ds + Upot (x(L), y(L))
0 2
Z L
1
= EI(α0 (s))2 − F1 cos(α(s)) − F2 sin(α(s)) ds .
0 2
The physical situation is characterized as a minimum of this functional, using Bernoulli’s principle. For the
expression to be integrated find the partial derivatives
1
F (α, α0 ) = EI(α0 )2 − F1 cos(α) − F2 sin(α)
2
Fα (α, α0 ) = F1 sin(α) − F2 cos(α)
Fα0 (α, α0 ) = EIα0
This is identical equation (1.16) (page 19). For a beam clamped at the left end and no moments at the right
end point find the boundary conditions α(0) = α0 (L) = 0 . The second is a natural boundary condition, as
u(L) is not prescribed. This is a nonlinear, second order boundary value problem. ♦
5.2.2 Quadratic Functionals and Second Order Linear Boundary Value Problems
If for given functions a(x), b(x) and g(x) the functional
Z x1
1 1
F (u) = a(x) (u0 (x))2 + b(x) u(x)2 + g(x) · u(x) dx (5.2)
x0 2 2
has to be minimised, then obtain the Euler–Lagrange equation
d
fu0 = fu
dx
d d u(x)
a(x) = b(x) u(x) + g(x) .
dx dx
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 313
This is a linear, second order differential equation which has to be supplemented with appropriate boundary
conditions. If the value at one of the endpoints is given then this is called a Dirichlet boundary condition.
If we are free to choose the value at the boundary then this is called a Neumann boundary condition.
Theorem 5–4 implies that the second situation leads to a natural boundary condition
du
a(x) = 0 for x = x0 or x = x1 .
dx
If we wish to consider a non-homogeneous boundary conditions
du
a(x) = r(x) for x = x0 or x = x1
dx
then the functional has to be supplemented by
Thus the above approach shows that many second order differential equation correspond to extremal points
for a properly chosen functional. Many physical, mechanical and electrical problems lead to this type of
equation as can be seen in Table 5.1 (Source: [OttoPete92, p. 63]).
can be extended to functions of multiple variables. If Ω ⊂ Rn is a ”nice” domain with boundary ∂Ω and
outer unit normal vector ~n then the corresponding result is called the divergence theorem. For domains
Ω ⊂ R2 use
∂ v1 ∂ v2
ZZ ZZ I
div ~v dA = + dA = h~v , ~ni ds
∂x ∂y ∂Ω
Ω Ω
where dV is the standard volume element and dA the surface element. The usual rule to differentiate
products of two functions leads to
∇ · (f ~v ) = (∇f ) · ~v + f (∇ · ~v )
div(f ~v ) = hgrad f, ~v i + f (div ~v ) .
This formula is referred to as Green–Gauss theorem or Green’s identity and is similar to integration by
parts for functions of one variable
Z b Z b
f · g 0 dx = −f (a) · g(a) + f (b) · g(b) − f 0 · g dx
a a
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 314
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 315
5.2.4 Quadratic Functionals and Second Order Boundary Value Problems in 2 Dimensions
We want to modify the functional in (5.2) to a 2 dimensional setting and examine the boundary value
problem resulting from the Euler–Lagrange equations.
Consider a domain Ω ⊂ R2 with a boundary ∂Ω = Γ1 ∪ Γ2 consisting of two disjoint parts Γ1 and Γ2 .
For given functions a, b, f , g1 and g (all depending on x and y) we search a yet unknown function u, such
that the functional
1 1
ZZ Z
F (u) = a h∇u, ∇ui + b u2 + f · u dA − g2 u ds (5.3)
2 2 Γ2
Ω
is minimal amongst all functions u which satisfy
u(x, y) = g1 (x, y) for (x, y) ∈ Γ1 .
To find the necessary equations assume that φ and ∇φ are small and use the approximations
(u + φ)2 = u2 + 2 u φ + φ2 ≈ u2 + 2 u φ
h∇(u + φ), ∇(u + φ)i = h∇u, ∇ui + 2 h∇u, ∇φi + h∇φ, ∇φi
≈ h∇u, ∇ui + 2 h∇u, ∇φi
and Green’s identity to conclude
ZZ Z
F (u + φ) − F (u) ≈ a h∇u, ∇φi + b u φ + f · φ dA − g2 φ ds
Γ2
ZΩZ Z Z
= (−∇( a ∇u) + b u + f ) · φ dA + a h~n, ∇ui φ ds − g2 φ ds
Γ Γ2
ZΩZ Z
= (−∇( a ∇u) + b u + f ) · φ dA + (a h~n, ∇ui − g2 ) φ ds .
Γ2
Ω
The test-function φ is arbitrary, but has to vanish on Γ1 . If the functional F is minimal for the function u
then the above integral has to vanish for all test-functions φ. First consider only test-functions that vanish
on Γ2 too and use the fundamental lemma (a modification of Theorem 5–3) to conclude that the expression
in the parenthesis in the integral over the domain Ω has to be zero. Then use arbitrary test functions φ
to conclude that the expression in the integral over Γ2 has to vanish too. Thus the resulting linear partial
differential equation with boundary conditions is given by
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 316
The functions a, b f and gi are known functions and we have to determine the solution u, all depending on
the independent variables (x, y) ∈ Ω . The vector ~n is the outer unit normal vector. The expression
∂u ∂u ∂u
h∇u, ~ni = n1 + n2 =
∂x ∂y ∂~n
determines the directional derivative of the function u in the direction of the outer normal ~n.
A list of typical applications of elliptic equations of second order is shown in Table 5.2, see [Redd84].
The static heat conduction problem in Section 1.1.5 (page 11) is another example. A description of the
ground water flow problem is given in [OttoPete92]. This table clearly illustrates the importance of the
above type of problem.
where we assume that u = 0 on the boundary ∂Ω. Now apply a vertical force to the membrane given by a
force density function f (x, y) (units: N/m2 ). To formulate this we introduce a potential energy
ZZ
Epot = − f · u dA .
Ω
τ ∆u = ∇ · (τ ∇u) = −f .
This corresponds to the model problem with equation (1.11) on page 15. ♦
f = −ρ ü
where ρ is the mass density (units kg/m2 ). The resulting equation is then
ρ ü − ∇ · (τ ∇u) = 0
SHA 21-5-21
Field of application Primary variable Material constant Source variable Secondary variables
∂u ∂u
General situation u a f ∂x , ∂y
Heat flow density ~q
Heat transfer Temperature T Conductivity k Heat source Q
~q = −k∇T
Diffusion Concentration c Diffusion coefficient external supply Q flux ~q = −D ∇c
Electrostatics Scalar potential Φ Dielectric constant ε Charge density ρ Electric flux density D
Magnetostatics Magnetic potential Φ Permeability ν Magnetic flux density B
Transverse deflection
Transverse deflection u Tension of membrane T Transversely distributed load Normal force q
of elastic membrane
Stress τ
Eα ∂φ
Torsion of a bar Warping function φ 1 0 τxz = 2 (1+ν) (−y + ∂x )
Eα ∂φ
τyz = 2 (1+ν) (x + ∂y )
T
Velocity (u, v)
Irrotational flow Mass production σ dΨ
Stream function Ψ Density ρ ∂x = −u
of an ideal fluid (usually zero) dΨ
∂y =v
dΦ
∂x =u
Velocity potential Φ dΦ
∂y =v
∂Φ
seepage q = K ∂n
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS
Recharge Q velocities
Ground-water flow Piezometric head Φ Permeability K
SHA 21-5-21
317
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 318
apply a small perturbation φ to the argument u and use a linear approximation to find
ZZ
J(u + ϕ) = F (u + ϕ, ∇u + ∇ϕ) dA
ZΩZ
= F (u, ∇u) + Fu (u, ∇u) ϕ + F∇u (u, ∇u) ∇ϕ dA + O(|ϕ|2 , k∇ϕk2 )
Ω
If the minimum of the functional J(u) is attained at the function u conclude that for all permissible test
functions ϕ find the necessary condition
ZZ
0 = F∇u (u, ∇u) ∇ϕ + Fu (u, ∇u) ϕ dA
IΩ ZZ
= ϕ F∇u (u, ∇u) · ~n ds + −∇ (F∇u (u, ∇u)) ϕ + Fu (u, ∇u) ϕ dA .
∂Ω
Ω
Since this expression vanishes for all test functions ϕ use the fundamental lemma to find the Euler–Lagrange
equation
−∇ (F∇u (u, ∇u)) + Fu (u, ∇u) = 0 in Ω (5.5)
and on the sections of the boundary where the test function ϕ does not vanish the natural boundary condition
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 319
If the functional J attains its minimum at the function u conclude that for all test functions ϕ
1
ZZ
0 = + q ∇u · ∇ϕ dA
1 + u 2 + u2
Ω x y
1 1
I ZZ
= q ~n ∇u ϕ ds − ∇ q ∇u ϕ dA .
∂Ω 2
1 + ux + uy 2 2
1 + ux + uy 2
Ω
If the values z = u(x, y) are known on the boundary ∂Ω ⊂ R2 then the test functions ϕ vanish on the
boundary and the Euler–Lagrange equation are given by
1
−∇ q ∇u = 0 .
1 + u2x + u2y
This is a nonlinear second order differential equation. The identical result may be generated by
q
F (u, ∇u) = 1 + u2x + u2y
Fu (u, ∇u) = 0
1
F∇u (u, ∇u) = q (ux , uy )
1 + u2x + u2y
The above idea can also be applied to functionals depending on more than one independent variable.
Examine a domain Ω ⊂ R2 with a boundary ∂Ω consisting of two parts. On Γ1 the values of u1 and u2 are
given and on Γ2 these values are free. Then minimize a functional of the form
ZZ Z
J(u1 , u2 ) = F (u1 , u2 , ∇u1 , ∇u2 ) dA − u1 · g1 + u2 · g2 ds . (5.6)
Γ2
Ω
J = J(u1 + ϕ1 , u2 + ϕ2 )
4
Use the notation F∇u = ( ∂u∂ x F , ∂
∂uy
F )T .
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 320
ZZ
= F (u1 + ϕ1 , ∇u1 + ∇ϕ1 , u2 + ϕ2 , ∇u2 + ∇ϕ2 ) dA +
Ω
Z
− (u1 + ϕ1 ) · g1 + (u2 + ϕ2 ) · g2 ds
Γ2
ZZ
= J(u1 , u2 ) + Fu1 (. . .) ϕ1 + hF∇u1 (. . .), ∇ϕ1 i + Fu2 (. . .) ϕ2 + hF∇u2 (. . .), ∇ϕ2 i dA −
Ω
Z
− ϕ1 · g1 + ϕ2 · g2 ds + O(|ϕ1 |2 , k∇ϕ1 k2 , |ϕ2 |2 , k∇ϕ2 k2 ) .
Γ2
Find necessary conditions if the minimum of the functional J(u1 , u2 ) is attained at (u1 , u2 ). Conclude that
for all permissable test functions ϕ1 and ϕ2 vanishing on the boundary ∂Ω the following integral has to
equal zero.
ZZ
0 = Fu1 (. . .) ϕ1 + hF∇u1 (. . .), ∇ϕ1 i + Fu2 (. . .) ϕ2 + hF∇u2 (. . .), ∇ϕ2 i dA
ZΩZ
= (Fu2 (. . .) − div (F∇u1 (. . .))) ϕ1 + (Fu2 (. . .) − div (F∇u2 (. . .))) ϕ2 dA
Ω
Since this expression to vanish for all test functions ϕ1 and ϕ2 use the fundamental lemma to arrive at a
system of Euler–Lagrange equations.
5–11 Result : A minimizer u1 and u2 of the functional J(u1 , u2 ) of the form (5.6) solves the system of
Euler–Lagrange equations
div (F∇u1 (u1 , u2 , ∇u1 , ∇u2 )) = Fu1 (u1 , u2 , ∇u1 , ∇u2 ) (5.7)
div (F∇u2 (u1 , u2 , ∇u1 , ∇u2 )) = Fu2 (u1 , u2 , ∇u1 , ∇u2 ) . (5.8)
3
Using these equations with test functions ϕ1 and ϕ2 vanishing on Γ1 , but not necessarily on Γ2 find
Z
0 = ϕ1 F∇u1 (. . .) · ~n − ϕ1 · g1 + ϕ2 F∇u2 (. . .) · ~n − ϕ2 · g2 ds .
Γ2
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 321
The actual motion of a system with the above Lagrangian L is such as to render the (Hamilton’s)
integral Z t2 Z t2
I= (T − V ) dt = L(~q, ~q˙ ) dt
t1 t1
an extremum with respect to all twice differentiable functions ~q(t). Here t1 and t2 are arbitrary
times.
This is a situation where we (usually) have multiple dependent variables qi and thus the Euler–Lagrange
equations imply
d∂L ∂L
= for i = 1, 2, . . . , n .
dt ∂ q̇i ∂qi
These differential equations apply to many mechanical setups, as the following examples will illustrate.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 322
To find the kinetic energy we need the velocity of the lower mass. The velocity vector is equal to the
vector sum of the velocity of the upper mass and the velocity of the lower particle relative to the upper mass.
π
~v1 = length l ϕ̇ and angle ϕ ± 2
π
~v2 = length l θ̇ and angle θ ± 2
ϕ − θ = angle between ~v1 and ~v2
Since the two vectors differ in direction by an angle of ϕ−θ we can use the law of cosine to find the absolute
velocity as5 q
speed of second mass = l ϕ̇2 + θ̇2 + 2 ϕ̇ θ̇ cos(ϕ − θ) .
Thus the total kinetic energy is
m l2 2
T (ϕ, θ, ϕ̇, θ̇) = 2 ϕ̇ + θ̇2 + 2 ϕ̇ θ̇ cos(ϕ − θ)
2
and the Lagrange function is
m l2 2
L=T −V = 2 ϕ̇ + θ̇2 + 2 ϕ̇ θ̇ cos(ϕ − θ) + m l g (2 cos ϕ + cos θ) .
2
The Euler–Lagrange equation for the free variable ϕ is obtained by
∂L
= m l2 2 ϕ̇ + θ̇ cos(ϕ − θ)
∂ ϕ̇
d∂L
= m l2 2 ϕ̈ + θ̈ cos(ϕ − θ) − θ̇ (ϕ̇ − θ̇) sin(ϕ − θ)
dt ∂ ϕ̇
∂L
= −m l2 ϕ̇ θ̇ sin(ϕ − θ) − m l g 2 sin ϕ
∂ϕ
which, upon substitution into the Euler–Lagrange equation, yields
m l2 2 ϕ̈ + θ̈ cos(ϕ − θ) + θ̇2 sin(ϕ − θ) = −m l g 2 sin ϕ .
∂L
= m l2 θ̇ + ϕ̇ cos(ϕ − θ)
∂ θ̇
d∂L
= m l2 θ̈ + ϕ̈ cos(ϕ − θ) − ϕ̇ (ϕ̇ − θ̇) sin(ϕ − θ)
dt ∂ θ̇
∂L
= +m l2 ϕ̇ θ̇ sin(ϕ − θ) − m l g sin θ
∂θ
leading to
m l2 θ̈ + ϕ̈ cos(ϕ − θ) − ϕ̇2 sin(ϕ − θ) = −m l g sin θ .
Those two equations can be divided by m l2 and then lead to a system of ordinary differential equations of
order 2.
g
2 ϕ̈ + θ̈ cos(ϕ − θ) + θ̇2 sin(ϕ − θ) = − 2 sin ϕ
l
2 g
θ̈ + ϕ̈ cos(ϕ − θ) − ϕ̇ sin(ϕ − θ) = − sin θ
l
5
Another approach is to use cartesian coordinates x(ϕ, θ) = l (sin ϕ + sin θ), y(ϕ, θ) = −l (cos ϕ + cos θ) and a few
calculations.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 323
The matrix on the left hand side is invertible and thus this differential equation can reliably be solved by
numerical procedures, see Section 3.4 starting on page 183.
Assuming that all angles and velocities are small one can use the approximations cos(ϕ − θ) ≈ 1 and
sin x ≈ x to obtain the linearized system of differential equations
" # ! !
2 1 ϕ̈ g 2ϕ
=− .
1 1 θ̈ l θ
This linear system of equations could be solved explicitly, using eigenvalues and eigenvectors, see Section
For the above matrix find
! !
√ 1 √ 1
λ1 = 2 − 2 ≈ 0.59 , ~v1 = √ and λ2 = 2 + 2 ≈ 3.41 , ~v1 = √
2 − 2
Thus thre is one type of in-phase solution with a small frequency and a high frequency solution where the
two angles are out of phase. ♦
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 324
m1
x
θ l
m2
First examine the case F = 0 and for the Lagrange function L = T − V derive two Euler–Lagrange
equations. The first equation deals with the dependence on the function x(t) and its derivative ẋ(t).
d
Lẋ (x, θ, ẋ, θ̇) = Lx (x, θ, ẋ, θ̇)
dt
d
(m1 + m2 ) ẋ + m2 l cos θ θ̇ = 0.
dt
From this conclude that the momentum in x direction is conserved. The second equation deals with the
dependence on the function θ(t).
d
L (x, θ, ẋ, θ̇) = Lθ (x, θ, ẋ, θ̇)
dt θ̇
d
m2 l ẋ cos θ + l2 θ̇ = −m2 l ẋ θ̇ sin θ − m2 l g sin θ
dt
d
ẋ cos θ + l θ̇ = −ẋ θ̇ sin θ − g sin θ
dt
ẍ cos θ − ẋ θ̇ sin θ + l θ̈ = −ẋ θ̇ sin θ − g sin θ
ẍ cos θ + l θ̈ = − g sin θ .
This is a second order differential equation for the functions x(t) and θ(t). The two equations can be
combined, leading to the system
With the help of a matrix the system can be solved for the highest occurring derivatives. A straight forward
computation shows that the determinant of the matrix does not vanish and thus we can always find the
inverse matrix.
! " #−1 !
d2 x m1 + m2 m2 l cos θ m2 l sin θ (θ̇)2
=
dt2 θ cos θ l − g sin θ
This is a convenient form to generate numerical solutions for the problem at hand.
The above model does not consider friction. Now we want to include some friction on the moving
chariot. This is not elementary, as the potential V can not depend on the velocity ẋ, but there is a trick to be
used.
1. Introduce a constant force F applied to the chariot. This is done by modifying the potential V accord-
ingly.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 325
To take the additional force F into account modify the potential energy
The Euler–Lagrange equation for the variable θ will not be affected by this change, but the equation for x
turns out to be
d
(m1 + m2 ) ẋ + m2 l cos θ θ̇ = F
dt
(m1 + m2 ) ẍ + m2 l cos θ θ̈ = m2 l sin θ (θ̇)2 + F
Below find the complete code to solve this example and the resulting Figure 5.3.
1 0.8
0.6
0.8
0.4
0.2
0.6
position
angle
0.4 -0.2
-0.4
0.2
-0.6
-0.8
0 0 1 2 3 4 5
0 1 2 3 4 5
time
time
MovingPendulum.m
function MovingPendulum()
t = 0:0.01:5; Y0 = [0;pi/6;0;0];
function dy = MovPend(y)
l = 1; m1 = 1; m2 = 8; g = 9.81; al = 0.5;
ddy = [m1+m2, m2*l*cos(y(2));
cos(y(2)), l]\[m2*l*sin(y(2))*y(4)ˆ2-al*y(3);-g*sin(y(2))];
dy = [y(3);y(4);ddy];
end%function
[t,Y] = ode45(@(t,y)MovPend(y),t,Y0);
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 326
figure(2)
plot(t,Y(:,1)); grid on; xlabel(’time’); ylabel(’position’);
figure(3)
plot(t,Y(:,2)); grid on; xlabel(’time’); ylabel(’angle’);
end%function
In Figure 5.4 the experimental laws of elasticity are illustrated. A beam of original length l with cross-
sectional area A = w · h is stretched by applying a force F . Many experiments lead to the following two
basic laws of elasticity.
• Hooke’s law
∆l 1 F
= ,
l E A
where the material constant E is called modulus of elasticity (Young’s modulus).
• Poisson’s law
∆h ∆w ∆l
= = −ν ,
h w l
where the material constant ν is called Poisson’s ratio.
∆l h
∆h
l
F
∆w
Figure 5.4: Definition of the modulus of elasticity E and the Poisson number ν
In this section the above basic mechanical facts are formulated for general situations, i.e. introduce the
basic equations of elasticity. The procedure is as follows:
• State the connection between deformations and forces, leading to Hooke’s law.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 327
An elastic solid can be fixed at its left edge and be pulled on at the right edge by a force. Figure 5.5
shows a simple situation. The original shape (dotted line) will change into a deformed state (full line). The
goal is to give a mathematical description of the deformation of the solid (strain) and the forces that will
occur in the solid (stress). A point originally at position ~x in the solid is moved to its new position ~x + ~u(~x),
i.e. displaced by ~u(~x).
~x + ~u
~x −→ ~x + ~u
! ! !
x x u1 (x, y)
~x −→ +
y y u2 (x, y)
The notation will be used to give a formula for the elastic energy stored in the deformed solid. Based on
this information we will construct a finite element solution to the problem. For a given force we search the
displacement vector field ~u(~x).
In order to simplify the treatment enormously we assume that the displacement of the structure are
very small compared to the dimensions of the solid.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 328
Since ∆x and ∆y are assumed to be very small, the deformation is very close to an affine deformation,
i.e. a linear deformation and a translation. Since the deformations are small we also know that the deformed
rectangle has to be almost horizontal, thus Figure 5.6 is correct. A straightforward Taylor approximation
leads to expressions for the positions of the four corners of the rectangle.
! ! !
x x u 1 (x, y)
A= −→ A0 = +
y y u2 (x, y)
! ! ! ∂u1 (x,y)
!
x + ∆x 0 x + ∆x u1 (x, y) ∂x ∆x
B= −→ B = + + ∂u2 (x,y)
y y u2 (x, y) ∂x ∆x
! ! ! ∂u1 (x,y)
!
x 0 x u1 (x, y) ∂y ∆y
C= −→ C = + + ∂u2 (x,y)
y + ∆y y + ∆y u2 (x, y) ∂y ∆y
! ! ! ∂u1 (x,y) ∂u1 (x,y)
!
x + ∆x x + ∆x u1 (x, y) ∆x + ∆y
D= −→ D0 = + + ∂x
∂u2 (x,y)
∂y
∂u2 (x,y)
y + ∆y y + ∆y u2 (x, y) ∂x ∆x + ∂y ∆y
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 329
Observe that the matrix R does not lead to changes of distances in the body. They correspond to rotations.
Only the contributions by A lead to stretching of the material.
Setting ∆y = 0 in the above formula compute the distance |A0 B 0 | as
∂ u1 ∂ u1 2
|A0 B 0 |2 = (∆x)2 + 2 (∆x)2 ≈ (∆x)2 (1 + )
r ∂x ∂x
∂ u1 ∂ u1
|A0 B 0 | = 1+2 ∆x ≈ ∆x + ∆x .
∂x ∂x
Now examine the ratio of the change of length over the original length to obtain the normal strains εxx and
εyy in the direction of the two axes.
∂u1 (x,y)
change of length in x direction ∂x ∆x ∂u1 (x, y)
εxx = = =
length in x direction ∆x ∂x
∂u2 (x,y)
change of length in y direction ∂y ∆y ∂u2 (x, y)
εyy = = = .
length in y direction ∆y ∂y
assume that the rectangle ABCD is not rotated, as shown in Figure 5.6. Let γ1 be the angle formed by the
line A0 B 0 with the x axis and γ2 the angle between the line A0 C 0 and the y axis. The sign convention is such
that both angles in Figure 5.6 are positive. Since tan φ ≈ φ for small angles find
∂u2 (x,y)
∂x ∆x ∂u2 (x, y)
tan γ1 = =
∆x ∂x
∂u1 (x,y)
∂y ∆y
∂u1 (x, y)
tan γ2 = =
∆y ∂y
2 εxy = tan γ1 + tan γ2 ≈ γ1 + γ2 .
Thus the number εxy indicates by how much a right angle between the x and y axis would be diminished by
the given deformation.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 330
• Along the x-axis observe ∆u1 = 0.2 and ∆u2 = 0.4. This leads to
∂ u1 ∆u1 0.2 ∂ u2 ∆u2 0.4
= = = 0.1 and = = = 0.2 .
∂x ∆x 2 ∂x ∆x 2
♦
5–17 Example : It is a good exercise to compute the strain components for a few simple deformations.
• pure translation: If the displacement vector ~u is constant we have the situation of a pure translation,
without deformation. Since all derivatives of u1 and u2 vanish we find εxx = εyy = εxy = 0, i.e. the
strain components are all zero.
Since the overall displacement has to be small compute only with small angles φ8 . This leads to
" # ∂ u1
1 ∂ u1 ∂ u2
" # " #
εxx εxy ∂x + cos φ − 1 0 0 0
= ∂ u 2 ∂y ∂x = ≈ .
1 ∂ u2 ∂ u2
εxy εyy 2 ∂y + ∂x
1
∂y 0 cos φ − 1 0 0
corresponds to a stretching of the solid by the factor 1 + λ in both directions. The components of the
strain are given by
" # ∂ u1 1
∂ u1 ∂ u2
" #
εxx εxy ∂x + λ 0
= ∂ u 2 ∂y ∂x =
1 ∂ u2 ∂ u2
εxy εyy 2 ∂y
1
+ ∂x ∂y 0 λ
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 331
corresponds to a stretching by the factor 1 + λ along the x axis. The components of the strain are
given by
" # ∂ u1 1
∂ u1 ∂ u2
" #
εxx εxy ∂x 2 ∂y + ∂x λ 0
= ∂ u ∂ u2
∂ u2
= .
1
εxy εyy 2 ∂y
1
+ ∂x ∂y 0 0
corresponds to a stretching by the factor 1 + λ along the axis x = y. The straight line y = −x is left
unchanged. To verify this observe
! ! ! !
u1 (x, x) x u1 (x, −x) 0
=λ and =λ .
u2 (x, x) x u2 (x, −x) 0
• The two previous examples both stretch the solid in one direction by a factor λ and leave the orthogo-
nal direction unchanged. Thus it is the same type of deformation, the difference being the coordinate
system used to examine the result. Observe that the expressions
5–18 Observation : Consider two coordinate systems, where one is generated by rotating the first coordi-
nate axes by an angle φ. The situation is shown in Figure 5.7 with φ = π6 = 30◦ . Now express a vector ~u
(components in (xy)–system) also in the (x0 y 0 )–system. To achieve this rotate the vector ~u by −φ and read
out the components. In our example find ~u = (1 , 1)T and thus
" # ! " # ! !
cos φ sin φ u1 0.866 0.5 1 1.366
~u0 = RT · ~u = · ≈ · = .
− sin φ cos φ u2 −0.5 0.866 1 0.366
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 332
y
y0 " # !
6 cos φ sin φ x
AA
K
~x0 = RT · ~x = ·
q − sin φ cos φ y
A x0
A A *
" # !
A A
cos φ − sin φ x0
~x = R · ~x0 =
A
A ·
A -x sin φ cos φ y0
5–19 Result : A given strain situation is examined in two different coordinate system, as show in Figure 5.7.
Then we have
3
Proof : Since the deformations at a given point are identical we have
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 333
∂ 0 0 ∂u1 ∂u1 ∂u2 ∂u2
u (~x ) = − sin φ − sin φ + cos φ + cos φ − sin φ + cos φ
∂y 0 2 ∂x ∂y ∂x ∂y
∂u1 ∂u2 ∂u1 ∂u2
= sin2 φ + cos2 φ − cos φ sin φ + .
∂x ∂y ∂y ∂x
Now verify that
∂ u01 ∂ u02 ∂ u1 ∂ u2
ε0x0 x0 + ε0y0 y0 = + = + = εxx + εyy ,
∂x0 ∂y 0 ∂x ∂y
∂ u1 ∂ u02
0 ∂ u1 ∂ u2
− = − .
∂y 0 ∂x0 ∂y ∂x
These two expressions are thus independent on the orientation of the coordinate system.
If the matrix multiplication below is carried one step further, then the claimed transformation formula
will appear.
" #
T
2 εxx 2 εxy
R · ·R=
2 εxy 2 εyy
" # " # " #
cos φ sin φ 2 ∂∂x
u1 ∂ u2
∂x + ∂ u1
∂y cos φ − sin φ
= · ∂u ∂ u1 ∂ u2
·
− sin φ cos φ ∂x
2
+ ∂y 2 ∂y sin φ cos φ
" # " #
cos φ sin φ 2 cos φ ∂∂xu1
+ sin φ( ∂∂x u2
+ ∂∂yu1 ) −2 sin φ ∂∂xu1
+ cos φ( ∂∂xu2
+ ∂∂yu1 )
= ·
− sin φ cos φ cos φ ( ∂∂x
u2
+ ∂∂yu1 ) + 2 sin φ ∂∂yu2 − sin φ ( ∂∂x
u2
+ ∂∂yu1 ) + 2 cos φ ∂∂yu2
2
5–20 Example : To read out the strain in a direction given by the normalized directional vector d~ =
(d1 , d2 )T = (cos α, sin α)T compute the normal strain ∆l 9
l in that direction by
! " # !
∆l d1 εxx εxy d1
=h , · i.
l d2 εxy εyy d2
• Then the top left entry ε0x0 x0 shows the normal strain ∆l l in that direction.
! " # !
1 ε 0 ε 0 1
∆l 0 0 0 0
= ε0x0 x0 = h , xx xy
i
l 0 ε0x0 y0 ε0y0 y0 0
! " # ! ! " # !
1 ε xx ε xy 1 d 1 ε xx ε xy d1
= h , RT · ·R· i = h , · i
0 εxy εyy 0 d2 εxy εyy d2
9
This result might be useful when working with strain gauges to measure deformations.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 334
Since the strain matrix is symmetric, there always exists10 an angle φ such that the strain matrix in the
new coordinate system is diagonal, i.e.
" # " # " # " #
ε0x0 x0 0 cos φ sin φ εxx εxy cos φ − sin φ
= · · .
0 ε0y0 y0 − sin φ cos φ εxy εyy sin φ cos φ
Thus at least close to the examined point the deformation consists of stretching (or compressing) the x0 axis
and stretching (or compressing) the y 0 axis. No shear strain is observed in this new coordinate system. One
of the possible displacements is given by
! ! !
x0 x0 ε0x0 x0 x0
−→ + .
y0 y0 ε0y0 y0 y 0
The values of ε0x0 x0 and ε0y0 y0 can be found as eigenvalues of the original strain matrix, i.e. solutions of the
quadratic equation " #
εxx − λ εxy
f (λ) = det = 0.
εxy εyy − λ
The eigenvectors indicate the directions of pure strain, i.e. in that coordinate system you find no shear strain.
The eigenvalues correspond to the principal strains.
This corresponds to a solid stretched by 4% in the x direction and the angle between the x and y axis is
diminished by 0.02 . To diagonalize this matrix we determine the zeros of
" #
0.04 − λ 0.01
det(A − λI) = det = λ2 − 0.04 λ − 0.012 = (λ − 0.02)2 − 0.022 − 0.012 = 0
0.01 0−λ
10
The eigenvector ~e1 to the first eigenvalue λ1 can be normalized and thus written in the form
! ! !
cos φ cos φ cos φ
~e1 = =⇒ A = λ1 .
sin φ sin φ sin φ
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 335
√ √ √
with the solutions
√ λ 1 = 0.02 + 0.0005 = 0.01 (2 + 5) ≈ 0.04236 and λ 2 = 0.02 − 0.0005 =
0.01 (2 − 5) ≈ −0.00236. The first eigenvector is then found as solution of the linear system
" # ! " # ! !
0.04 − λ1 0.01 x −0.00236 0.01 x 0
≈ = .
0.01 0 − λ1 y 0.01 −0.04236 y 0
The second of the above equations is a multiple of the first and thus use only the first equation
−0.00236 x + 0.01 y = 0 .
Since only the direction matters generate an easy solution by
!
1
~e1 = with λ1 = 0.04236 .
0.236
The second eigenvector ~e2 is orthogonal to the first and thus
!
0.236
~e2 = with λ2 = −0.00236 .
−1
As a consequence the above strain corresponds to a pure stretching by 4.2% in the direction of ~e1 and a
compression of 0.2% in the orthogonal direction. ♦
The above results about transformation of strains in a rotated coordinate system do also apply. Thus for
a given strain there is a rotation of the coordinate system, given by the orthonormal matrix R such that
ε0x0 x0 0 0 εxx εxy εxz
T
0
ε0y0 y0 0 = R · εyx εyy εyz · R .
0 0 ε0z 0 z 0 εzx εzy εzz
11
In part of the literature (e.g. [Prze68]) the shear strains are defined without the division by 2. All results can be adapted
accordingly.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 336
S0 = RT SR
det(S0 − λ I) = det(RT SR − λ RT R) = det(RT (S − λ I) R)
= det(RT ) det(S − λ I) det(R) = det(S − λ I) .
The two characteristic polynomials are identical and the coefficients for λ3 , λ2 , λ1 and λ0 = 1 have to
coincide. This leads to three invariants. Compute the determinant to find
εxx − λ εxy εxz
det(S − λ I) = det εyx εyy − λ εyz
εzx εzy εzz − λ
= −λ3 + λ2 (εxx + εyy + εzz ) −
−λ (εyy εzz − ε2yz + εxx εzz − ε2xz + εxx εyy − ε2xy ) + det(S) .
• These invariant expressions can be expressed in terms of the eigenvalues of the strain matrix S: I1 =
λ1 + λ2 + λ3 , I2 = λ1 λ2 + λ2 λ3 + λ3 λ1 and I3 = λ1 · λ2 · λ3 . To verify this fact diagonalize the
matrix, i.e. compute with λ1 = εxx , λ2 = εyy , λ3 = εzz and vanishing shearing strains.
• Since det(S0 − λ I) = det(S − λ I) the three eigenvalues λi are invariant. But there is no easy, explicit
expression for the eigenvalues in terms of coefficients of the tensor, since a cubic equation has to be
solved to determine the values of λi .
We will see (page 352) that the elastic energy density can be expressed in terms of the invariants Ii .
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 337
σy3 σy
6 3
- τxy
6
- τxy
3 1
τyx τxy
σx2 2 1
6 - σ1
x σx 6 - σ
x
2 ?
τyx τxy ?
4
4
τxy
τxy
?4 ?
σy σy
Figure 5.8: Definition of stress in a plane, initial (left) and simplified (right) situation
The stress situation of a solid is described by all components of the stress, typically as functions of the
location.
for all positive values of ∆x and ∆y. Change the values of ∆x and ∆y independently to arrive at the desired conclusion.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 338
The stress vector ~s may be decomposed in a normal component σ and a tangential component τ . Determine
the component of ~σ in the direction of ~n by
" # !
T σx τxy cos φ
σ = h~n , ~si = ~n · ~s = h~n , ~si = (cos φ, sin φ) ·
τxy σy sin φ
" # !
σx τxy cos φ
τ = (− sin φ, cos φ) · .
τxy σy sin φ
The value of σ is positive if ~σ is pointing out of the solid and τ is positive if ~τ is pointing upward in
Figure 5.9.
This allows to consider a new coordinate system, generated by rotation of the xy system by an angle φ
(see Figure 5.7, page 332). The result is
" # !
σx τxy cos φ
σx0 = (cos φ, sin φ) ·
τxy σy sin φ
" # !
σx τxy − sin φ
σy0 = (− sin φ, cos φ) ·
τxy σy cos φ
" # !
σx τxy cos φ
τx0 y0 = (− sin φ, cos φ) · .
τxy σy sin φ
This transformation formula should be compared with result 5–19 on page 332. It shows that the behavior
under coordinate rotations for the stress matrix and the strain matrix is identical.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 339
z
6
σz
6 τ
- yz
τxz
τzy
6 σ
- y
τzx -y
τxy
6 τ
- yx
σx
symbol description
σx normal stress at a surface orthogonal to x = const
σy normal stress at a surface orthogonal to y = const
σz normal stress at a surface orthogonal to z = const
tangential stress in y direction at surface orthogonal to x = const
τxy = τyx
tangential stress in x direction at surface orthogonal to y = const
tangential stress in z direction at surface orthogonal to x = const
τxz = τzx
tangential stress in x direction at surface orthogonal to z = const
tangential stress in z direction at surface orthogonal to y = const
τyz = τzy
tangential stress in y direction at surface orthogonal to z = const
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 340
~s = S · ~n .
The behavior of S under rotation of the coordinate system ~x = RT · ~x0 or ~x0 = R · ~x is given by
σx0 τxy0 0
τxz σx τxy τxz
S0 = 0 = RT · τ
τ 0 σ 0 τ σ τ · R.
xy y yz xy y yz
0
τxz 0
τyz σz0 τxz τyz σz
for the three eigenvalues λ1,2,3 and the corresponding orthonormal eigenvectors ~e1 , ~e2 and ~e3 , we find a
coordinate system in which all tangential stress components vanish. Since there are only normal stresses the
stress matrix S0 has the form
σx0 0 0
0 σ0 0 .
y
0 0 σz0
The numbers on the diagonal are the principal stresses. This is very useful to extract results out of stress
computations. When asked to find the stress at a given point in a solid many different forms of answers are
possible:
• Find the three principal stresses and use those as a result. One might also give the corresponding
directions.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 341
• There is an adaption of Mohr’s circle to a 3D stress setup, shown in Figure 5.11(b). The Wikipedia
page en.wikipedia.org/wiki/Mohr’s circle shows more information on Mohr’s circle. In [ChouPaga67,
§1.5, §1.8] find a derivation of Mohr’s circle in 2D and 3D.
• Since the transformation law for strains is identical one can generate a similar circle with the normal
strains on the horizontal axis and the shear strains on the vertical axis.
3
5.3.3 Invariant Stress Expressions, Von Mises Stress and Tresca Stress
The von Mises stress σM (also called octahedral shear stress) is a scalar expression and often used to
examine failure modes of solids, see also the following section.
2
σM = σx2 + σy2 + σz2 − σx σy − σy σz − σz σx + 3 τxy
2 2
+ 3 τyz 2
+ 3 τzx
1
(σx − σy )2 + (σy − σz )2 + (σz − σx )2 + 3 τxy 2 2 2
= + τyz + τzx .
2
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 342
( y
,+ xy
)
1 2 2
( ,- )
x xy
It is important that the above expression for the von Mises stress does not depend on the orientation of
the coordinate system. On page 336 find the invariants for the strain matrix. Using identical arguments
determine three invariants for the stress matrix.
I1 = σx + σy + σz
2 2 2
I2 = σy σz + σx σz + σx σy − τyz − τxz − τxy
σx τxy τxz
2 2 2
I3 = det τxy σy τyz
= +σx σy σz + 2 τxy τxz τyz − σx τyz − σy τxz − σz τxy
τxz τyz σz
Obviously any function of these invariants is an invariant too and consequently independent of the orienta-
tion of the coordinate system. Many physically important expressions have to be invariant, e.g. the energy
density.
and consequently the von Mises stress is invariant under rotations. If reduced to principal stresses (no shear
stress) find
2
2 σM = (σ1 − σ2 )2 + (σ2 − σ3 )2 + (σ3 − σ1 )2 .
Thus the von Mises stress is a measure for the differences among the three principal stresses. In the simplest
possible case of stress in one direction only, i.e. σ2 = σ3 = 0, find
2 1
(σ1 − 0)2 + (0 − 0)2 + (0 − σ1 )2 = σ12 .
σM =
2
and thus a measure of the differences amongst the principal stresses, similar to the von Mises stress.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 343
5–23 Corollary : The von Mises stress is smaller than the Tresca stress, i.e.
0 ≤ σM ≤ σT .
3
Proof : Without loss of generality examine the principal stress situation and assume σ1 ≤ σ2 ≤ σ3 .
2
2 σM = (σ1 − σ2 )2 + (σ2 − σ3 )2 + (σ3 − σ1 )2
2 σT2 = 2 (σ3 − σ1 )2
2
2 (σM − σT2 ) = (σ1 − σ2 )2 + (σ2 − σ3 )2 − (σ3 − σ1 )2 = 2 σ22 − 2 σ1 σ2 − 2 σ3 σ2 + 2 σ1 σ3
= 2 (σ2 − σ3 ) (σ2 − σ1 ) ≤ 0
2 ≤ σ2 .
This implies σM T 2
To decide whether a material will fail you will need the maximal principal stress, the von Mises and
Tresca stress. You need the definition of the different stresses and strains, Hooke’s law (Section 5.6) and
possibly the plane stress and plane strain description (Sections 5.8 and 5.9). For a given situation there are
multiple paths to compute these:
• If the 3 × 3 stress matrix is known, then first determine the eigenvalues, i.e. the principal stresses.
Once you have these it is easy to read out the desired values.
• If the 3 × 3 strain matrix is known, the you have two options to determine the principal stresses.
1. First use Hooke’s law to determine the 3 × 3 stress matrix, then proceed as above.
2. Determine the eigenvalues of the strain matrix to determine the principal strains. Then use
Hooke’s law to determine the principal stresses.
• If the situation is a plane stress situation and you know the 2 × 2 stress matrix, then you may first
generate the full 3 × 3 stress matrix, and then proceed as above.
• If the situation is a plane strain situation and you know the 2 × 2 strain matrix, then you may first
generate the full 3 × 3 strain matrix, and then proceed as above.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 344
εxx εxy εxz σx τxy τxz " #
Hooke’s law plane stress σx τxy
εxy εyy εyz - τ σy τyz
xy τxy σy
εxz εyz εzz τxz τyz σz
eigenvalues eigenvalues
Hooke’s law
? ?
ε1 , ε2 , ε3 - σ1 , σ2 , σ3 principal stresses
principal strains
compute
?
σM axP rinc = max{|σ1 | , |σ2 | , |σ3 |}
σT resca = max{|σ1 − σ2 | , |σ2 − σ3 | , |σ3 − σ1 |}
σvonM ises = √12 (σ1 − σ2 )2 + (σ2 − σ3 )2 + (σ3 − σ1 )2
p
Figure 5.12: How to determine the maximal principal stress, von Mises stress and Tresca stress
σT ≤ σY .
Most cracks in metals are caused by severe shearing forces, i.e. they are ductile. If a metal is submitted to a
traction test (pull until it breaks) the direction of the initial break and the force show an angle of 45◦ . This
is very different for brittle materials15 .
15
Twisting a piece of chalk until it breaks usually leads to an angle of 45◦ for the initial crack. Is is an exercise to verify that for
twisting the maximal normal stress occurs at this angle. Since chalk is a brittle material it breaks because of the maximal normal
stress is too large.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 345
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 346
Examples of scalars include: temperature, hydrostatic pressure, density, concentration. Observe that not
all expressions leading to a number are invariant under transformations. As an example consider the partial
∂f
derivative of a scalar with respect to the first coordinate, i.e. ∂x1
. The transformation rule for this expression
will be examined below.
5–24 Example : Well known and often used examples of vectors are position vectors, velocity and force
vectors. ♦
The gradient is often used as a row vector and thus transpose the above identity to conclude
" #
∂ 0 ∂ 0 ∂ ∂ cos α − sin α ∂ ∂
( 0f , 0
f )=( f, f) · =( f, f) · R .
∂x ∂y ∂x ∂y sin α cos α ∂x ∂y
Observe that not all pairs of scalar expressions transform according to a first order tensor. As examples
consider stress and strain. The transformation rules for
! !
εxx σx
and
εyy σy
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 347
When a new coordinate system is introduced the required transformation rule, for A to be a tensor, is
" # " #" #" #
a01,1 a01,2 cos α sin α a1,1 a1,2 cos α − sin α
= · ·
a02,1 a02,2 − sin α cos α a2,1 a2,2 sin α cos α .
A0 = RT · A · R
To decide whether a 2 × 2 matrix is a tensor this transformation rule has to be verified. When all details are
carried out this leads to the following formula, where C = cos α and S = sin α .
" # " # " # " #
a01,1 a01,2 C S a1,1 a1,2 C −S
= · ·
a02,1 a02,2 −S C a2,1 a2,2 S C
" # " #
C S a1,1 C + a1,2 S −a1,1 S + a1,2 C
= ·
−S C a2,1 C + a2,2 S −a2,1 S + a2,2 C
" #
a1,1 C 2 + (a1,2 + a2,1 )CS + a2,2 S 2 (−a1,1 + a2,2 )SC + a1,2 C 2 − a2,1 S 2
=
(−a1,1 + a2,2 )SC − a1,2 S 2 + a2,1 C 2 a1,1 S 2 − (a1,2 + a2,1 )CS + a2,2 C 2
Thus stress and strain, as examined in the previous sections, are second order tensors. ♦
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 348
y
y 0
! " # ! B
x 2 0.5 x
B
−→ · B A~e2
B ~ e2 6
y 1 1 y *
B A~e1
x0
B
B
B
B
- x
~e1
Figure 5.13: Action of a linear mapping from R2 onto R2
DU is the displacement gradient tensor. Now examine this matrix if generated using a rotated coordi-
nate system. Since the above is a linear mapping we can use the previous results and conclude that the
transformation rule
DU0 = RT DU R (5.11)
is correct, i.e. the displacement gradient is a tensor of order 2 . ♦
and thus the symmetric Hesse matrix of second order partial derivatives satisfies
" 2 0
∂2 f 0
# " 2
∂2 f
# " # " #
∂ f ∂ f
∂x02 0
∂x ∂y 0 + cos α + sin α ∂x2 ∂x ∂y + cos α − sin α
∂2 f 0 ∂2 f 0
= · ∂2 f ∂2 f
· .
∂x0 ∂y 0 ∂y 02 − sin α + cos α ∂x ∂y ∂y 2 + sin α + cos α
This implies that the matrix of second order derivatives satisfies the transformation rule of a second order
tensor. ♦
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 349
εxx 1 −ν −ν σx
= 1
· σy ,
εyy −ν 1
−ν
E
ε −ν −ν 1 σz
zz (5.12)
εxy τxy
εxz
= 1+ν
τxz
E
εyz τyz
σx 1−ν ν ν εxx
σy = E
· εyy ,
ν 1−ν ν
(1 + ν) (1 − 2 ν)
σz ν ν 1−ν εzz
(5.14)
τxy εxy
τxz = E
εxz .
(1 + ν)
τyz εyz
16
One can verify that for homogeneous, isotropic materials a linear law must have this form, e.g. [Sege77]
17
The missing factors 2 are due to the different definition of the shear strains.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 350
E
σx + σy + σz = (εxx + εyy + εzz )
1 − 2ν
to conclude that ν = 12 leads to εxx + εyy + εzz = 0. This is a constraint to be satisfied for incompressible
materials. Equation (5.12) can also be written in the form
εxx σx 1
1+ν ν
εyy = σ − (σx + σy + σz ) 1
E y E
εzz σz 1
and the determinant of this matrix equals zero. Thus one can not determine the inverse matrix.
To start out we want to find the formula for the energy density of a deformed body. For this we consider
a small block with width ∆x, height ∆y and depth ∆z, located at the fixed origin. For a fixed displacement
vector ~u of the corner P = (∆x, ∆y, ∆z) we deform the block by a sequence of affine deformations, such
that point P moves along straight lines. The displacement vector of P is given by the formula t ~u where the
parameter t varies from 0 to 1. If the final strain is denoted by ~ε then the strains during the deformation are
given by t ~ε. Accordingly the stresses are given by t ~σ where the final stress ~σ can be computed by Hooke’s
law (e.g. equation (5.14)). Now we compute the total work needed to deform this block, using the basic
formula work = force · distance. There are six contributions:
• The face x = ∆x moves from ∆x to ∆x (1 + εxx ). For a parameter step dt at 0 < t < 1 the traveled
distance is thus εxx ∆x dt. The force is determined by the area ∆y · ∆z and the normal stress t σx ,
where σx is the stress in the final position t = 1. The first energy contribution can now be integrated
by
Z 1 Z 1
1
(∆y · ∆z · σx ) · t εxx ∆x dt = ∆y · ∆z · ∆x · σx · εxx t dt = ∆V · σx · εxx .
0 0 2
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 351
z
6
Figure 5.14: Block to be deformed, used to determine the elastic energy density W
∂ u2
• To examine the situation of pure shearing strain observe that the face at x = ∆x is moved by ∂x ∆x
in y–direction. The shear stress in that direction is τxy and thus the resulting energy
1 ∂ u2
∆x · τxy ∆y ∆z
2 ∂x
∂ u1
The face at y = ∆y is moved by ∂y ∆y in x–direction. The shear stress in that direction is τxy ,
leading to an energy contribution
1 ∂ u1
∆y · τxy ∆x ∆z
2 ∂y
Adding these two leads to a contribution to the energy due to shearing in the xy plane.
1 ∂ u2 ∂ u 1
( + ) τxy ∆x ∆y ∆z = εxy τxy ∆V
2 ∂x ∂y
The similar contributions due to shearing in the xz and yz planes lead to energies εxz τxz ∆V and
εyz τyz ∆V .
Adding all six contributions and then dividing by the volume ∆V we obtain the energy density
energy 1
W = = (σx εxx + σy εyy + σz εzz + 2 τxy εxy + 2 τxz εxz + 2 τyz εyz ) .
∆V 2
This can be written as scalar product in the form
σx εxx τxy εxy
1
W = h
σy
,
εyy i + h τxz , εxz i
(5.15)
2
σz εzz τyz εyz
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 352
1
The above formula for the energy density W can be written in the form W = 2 h~ε , A ~εi with a suitably
chosen symmetric matrix A.
1−ν ν ν
ν 1−ν ν
0
E ν ν 1−ν
A=
(1 + ν) (1 − 2 ν)
2 (1 − 2 ν)
0 2 (1 − 2 ν)
2 (1 − 2 ν)
The gradient of this expression is given by ∇hA ~ε , ~εi = 2 A ~ε and thus conclude
∂W ∂W ∂W
σx = , σy = , σz = ,
∂εxx ∂εyy ∂εzz
and
1 ∂W 1 ∂W 1 ∂W E
τxy = , τxz = , τyz = = εyz .
2 ∂εxy 2 ∂εxz 2 ∂εyz 1+ν
Thus obtain normal and shear stresses as partial derivatives of the energy density W with respect to the
normal and shear strains.
use that the matrix S2 is a tensor too, and this leads to more invariant expressions.
εxx εxy εxz
S = εxy εyy εyz
εxz εyz εzz
I1 = trace(S) = εxx + εyy + εzz
I12 = trace(S)2 = ε2xx + ε2yy + ε2zz + 2 εxx εyy + 2 εxx εzz + 2 εyy εzz
ε2xx + ε2xy + ε2xz · ·
S2 =
· ε 2 + ε 2 + ε2 ·
xy yy yz
· · ε2xz + ε2yz + ε2zz
1
I2 = εyy εzz − ε2yz + εxx εzz − ε2xz + εxx εyy − ε2xy = trace(S)2 − trace(S2 )
2
I3 = det(S)
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 353
I2 = ε11 ε22 + ε11 ε33 + ε22 ε33 and I4 = ε211 + ε222 + ε233
find
E
W = ((1 − ν) I4 + 2 ν I2 )
2 (1 + ν) (1 − 2 ν)
E
(1 − ν) I12 − 2 (1 − 2 ν) I2
=
2 (1 + ν) (1 − 2 ν)
E
(1 − ν) (ε211 + ε222 + ε233 ) + 2 ν (ε11 ε22 + ε11 ε33 + ε22 ε33 ) .
=
2 (1 + ν) (1 − 2 ν)
Observe that different expressions are available for the energy density W as function of the invariants.
This invariant corresponds to the von Mises stress σM on page 341, but formulated with strains instead of
stresses. Expressed in principal strains find
For the energy density this leads to (using elementary, tedious algebra)
1+ν 2
I6 = I1 + (1 − 2 ν) I5
3
1+ν 1 − 2ν
(εxx + εyy + εzz )2 + (εxx − εyy )2 + (εxx − εzz )2 + (εyy − εzz )2
=
3 3
1 − 2ν
6 ε2xy + ε2xz + ε2yz
+
3
1+ν 1 − 2ν
(εxx + εyy + εzz )2 + (εxx − εyy )2 + (εxx − εzz )2 + (εyy − εzz )2
=
3 3
+2 (1 − 2 ν) (ε2xy + ε2xz + ε2yz )
1 + ν + 2 (1 − 2 ν) 2 2 (1 + ν) − 2 (1 − 2 ν)
= εxx + ε2yy + ε2zz + (εxx εyy + εxx εzz + εyy εzz )
3 3
+2 (1 − 2 ν) (ε2xy + ε2xz + ε2yz )
= (1 − ν) ε2xx + ε2yy + ε2zz + 2 ν (εxx εyy + εxx εzz + εyy εzz ) + 2 (1 − 2 ν) (ε2xy + ε2xz + ε2yz )
2 (1 + ν) (1 − 2 ν)
= W.
E
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 354
The above is a starting point for nonlinear material laws, e.g. hyperelastic materials as described in [Bowe10,
§3.3.3]. This approach allows to distinguish shear stress and hydrostatic stress. In principle the energy
density W can be any function of invariants, e.g. W = f (I1 , I5 ). The validity of the model has to be
justified by experiments or by other arguments. As examples find in the FEM software COMSOL the
nonlinear material models Neo–Hookean, Mooney–Rivlin and Murnaghan, amongst others.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 355
Volume forces
A force applied to the volume of the solid can be introduced by means of a volume force density f~
N ). By adding the potential energy
(units: m3
ZZZ
UV ol = − f~ · ~u dV
Ω
to the elastic energy and then minimizing we are lead to the correct force term.
As an example consider the weight (given by the density ρ) of the solid and the resulting gravitational
force. This leads to a force density of
0
f~ =
0
−ρ g
and thus find the corresponding potential energy as
ZZZ
UV ol = + ρ g u3 dV .
Ω
Surface forces
N ),
By adding a surface potential energy, using the surface force density ~g (units: m2
ZZ
USurf = − ~g · ~u dA
∂Ω
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 356
1−ν ν ν εxx εxx
1 E
ZZZ
= h ν 1−ν ν · εyy , εyy i dV +
2 (1 + ν) (1 − 2 ν)
Ω ν ν 1−ν εzz εzz
εxy εxy
E
ZZZ ZZZ ZZ
~ i~
+ h ε , ε i dV − f · ~
u dV − g · ~u dA .
1 + ν xz xz
Ω εyz εyz Ω ∂Ω
As in many other situations we use again a principle of least energy to find the equilibrium state of a
deformed solid. This result is known under the name Bernoulli18 principle.
If the above solid is in equilibrium, then the displacement function ~u is a minimizer of the above
energy functional, subject to the given boundary conditions.
This is the basis for a finite element (FEM) solution to elasticity problems.
• The solid of length L has constant cross section perpendicular to the x axis, with area A = ∆y · ∆z.
• The left face is fixed in the x direction, but free to move in the other directions.
F
• The constant normal stress σx at the right face is given by σx = A.
z
6y
x
-
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 357
• The first component of the above equations leads to the classical, basic form of Hooke’s law
∆L 1 F
εxx = = .
L E A
This is the standard definition of Young’s modulus of elasticity. The solid is stretched by a factor
1 + E1 FA .
Poisson’s ratio ν.
εyy = −ν εxx
Multiply the relative change of length in the x direction by ν to obtain the relative change of length in
the y and z directions. One expects ν ≥ 0.
• Assuming that we have the principal strain situation the energy density is given by
1−ν ν ν εxx εxx
1 E
W = h ν 1−ν ν εyy , εyy
i
2 (1 + ν) (1 − 2 ν)
ν ν 1−ν εzz εzz
For the case of unixaial loading in x–direction use εyy = εzz and find
1 E
(1 − ν) (ε2xx + 2 ε2yy ) + ν (4 εxx εyy + 2 ε2yy )
W =
2 (1 + ν) (1 − 2 ν)
∂W 1 E
= ((1 − ν) 2 εxx + ν 4 εyy ))
∂εxx 2 (1 + ν) (1 − 2 ν)
E
= ((1 − ν) εxx + ν 2 εyy ))
(1 + ν) (1 − 2 ν)
∂W 1 E
= ((1 − ν) 4 εyy + ν 4 (εxx + εyy ))
∂εyy 2 (1 + ν) (1 − 2 ν)
4 E
= (ν εxx + εyy )
2 (1 + ν) (1 − 2 ν)
• Based on the Bernoulli principle the energy density W will be minimized with respect to εyy . Thus
conclude εyy = −ν εyy , i.e. obtain the Poisson contraction as consequence of minimizing the energy.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 358
Submerging a solid deep into water will lead to this hydrostatic pressure situation. Hooke’s law now implies
and
εxx 1 −ν −ν p 1
εyy = − 1 −ν 1 −ν · p = − p (1 − 2 ν) 1
E E
εzz −ν −ν 1 p 1
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 359
One can also start with the general formula for the energy density based on Hooke’s law.
E
W = (εxx + εyy + εzz )2 +
6 (1 − 2 ν)
E
+ ((εxx − εyy )2 + (εxx − εzz )2 + (εyy − εzz )2 + 6 (ε2xy + ε2yz + ε2xz )) .
6 (1 + ν)
Use Bernoulli’s principle to determine all strains. Minimizing this energy density leads to
∂W E ∂W E
0 = = 2 εyz and 0 = = 2 εxz
∂εyz 1+ν ∂εxz 1+ν
∂W E E
0 = = (εxx + εyy + εzz ) + ((εxx − εyy ) + (εxx − εzz ))
∂εxx 3 (1 − 2 ν) 3 (1 + ν)
∂W E E
0 = = (εxx + εyy + εzz ) + (−(εxx − εyy ) + (εyy − εzz ))
∂εyy 3 (1 − 2 ν) 3 (1 + ν)
∂W E E
0 = = (εxx + εyy + εzz ) + (−(εxx − εzz ) − (εyy − εzz ))
∂εzz 3 (1 − 2 ν) 3 (1 + ν)
Then conlude
• The sum of the first two equations leads to εxx + εyy = 2 εzz = −2 (εxx + εyy ). This implies
εxx + εyy = 0 and thus εzz = 0.
• The sum of the first and third equation leads to εxx + εzz = 2 εyy = −2 (εxx + εzz ). This implies
εxx + εzz = 0 and thus εyy = 0.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 360
i.e. vanishing shear strains. Based on Hooke’s law (5.14) conclude that the shear stresses vanish too, i.e.
σx τxy τxz σx 0 0
τxy σy τyz = 0 σy 0
τxz τyz σz 0 0 σz
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 361
or
∆V ≈ (εxx + εyy + εzz ) V .
Using
εxx 1 −ν −ν σx
εyy = 1 −ν 1 −ν · σy
E
εzz −ν −ν 1 σz
find
1 − 2ν
εxx + εyy + εzz = (σx + σy + σz )
E
and
∆V 1 − 2ν
≈ (σx + σy + σz ) .
V E
This implies that the sum of the three principal stresses determines the volume change. Based on this one
can consider the volume changing hydrostatic stress (pressure)
1
σh = (σx + σy + σz )
3
and the shape changing stresses
• Use Hooke’s law to find the elastic energy density and by an integration find the total elastic energy.
• Use Euler–Lagrange equations to determine the boundary value problems for the radial and angular
displacement functions and solve these.
• Use these solutions to determine the actual energy stored in the twisted tube and determine the re-
quired torque.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 362
When twisting a circular tube on the top surface one can decompose the deformation in a radial compo-
nent ur and and angular component uϕ . The vertical component is assumed to vanish20 , i.e. u3 = 0.
u1 cos ϕ − sin ϕ
u2 = ur (r, z) sin ϕ + uϕ (r, z) + cos ϕ
u3 0 0
Based on the rotational symmetry21 one can examine the expression in the xz plane only, i.e. y = 0. Let
r = x and find the normal strains
∂ u1 ∂ ur ∂ u2 1 ∂ u3
ε11 = = , ε22 = = ur , ε33 = = 0
∂x ∂r ∂y r ∂z
and the shear strains
∂ u1 ∂ u2 1 ∂ uϕ ∂ u1 ∂ u 3 ∂ ur ∂ u2 ∂ u3 ∂ uϕ
2 εxy = + = − uϕ + , 2 εxz = + = , 2 εyz = + = .
∂y ∂x r ∂r ∂z ∂x ∂z ∂z ∂y ∂z
Observe that this is neither a plane strain nor a plane stress situation. The energy density W is given by
equation (5.16) (page 351)
1−ν ν ν εxx εxx
1 E
W = h ν 1−ν ν · εyy , i +
εyy
2 (1 + ν) (1 − 2 ν)
ν ν 1−ν εzz εzz
20
This is just for the sake of simplicity of the presentation. The vertical displacement u3 can be used as an unknown function
too, with zero displacement at the top and the bottom. The resulting three Euler–Lagrange equations will then lead to u3 = 0. This
author has the detailed computations.
21
Using the chain rule one can express the partial derivatives with respect to Cartesian coordinates x and y in terms of derivatives
with respect cylindrical coordinates r and ϕ.
∂f ∂F sin ϕ ∂F ∂f ∂F cos ϕ ∂F
f (x, y) = F (r, ϕ) =⇒ = cos ϕ − and = sin ϕ +
∂x ∂r r ∂ϕ ∂y ∂r r ∂ϕ
Along the x axis we have ϕ = 0 and thus cos ϕ = 1 and sin ϕ = 0.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 363
εxy εxy
E
+ h εxz , i
εxz
(1 + ν)
εyz εyz
and leads to the energy density along the x axis with r = x.
E
(1 − ν) (ε2xx + ε2yy + ε2zz ) + 2 ν (εxx εyy + εxx εzz + εyy εzz ) +
W (r, z) =
2 (1 + ν) (1 − 2 ν)
E
+ (ε2 + ε2xz + ε2yz )
(1 + ν) xy
E (1 − ν) ∂ ur 2 1 2
= ( ) + 2 ur +
2 (1 + ν) (1 − 2 ν) ∂r r
E ∂ uϕ uϕ 2 ∂ ur 2 ∂ uϕ 2
+ ( − ) +( ) +( ) .
4 (1 + ν) ∂r r ∂z ∂z
With this the functional U for the total elastic energy is given by an integration over the tube R0 < r < R1
and 0 < z < H, using cylindrical coordinates.
Z H Z R1
U (~u) = W (r, z) 2 π r dr dz
0 R0
H Z R1
∂ ur 2 u2r
E (1 − ν) r
Z
= 2π ( ) + 2 +
0 R0 2 (1 + ν) (1 − 2 ν) ∂r r
Er ∂ uϕ uϕ 2 ∂ ur 2 ∂ uϕ 2
+ ( − ) +( ) +( ) dr dz
4 (1 + ν) ∂r r ∂z ∂z
Z H Z R1
∂ ur 2 u2r
πE 2 (1 − ν)
= r( ) + +
2 (1 + ν) 0 R0 (1 − 2 ν) ∂r r
r ∂ uϕ uϕ 2 ∂ ur 2 ∂ uϕ 2
+ ( − ) +( ) +( ) dr dz
2 ∂r r ∂z ∂z
Z H Z R1
πE ∂ ur ∂ ur ∂ uϕ ∂ uϕ
= F (r, z, ur , , , uϕ , , ) dr dz .
2 (1 + ν) 0 R0 ∂r ∂z ∂r ∂z
Thus the elastic energy is given as a quadratic expression in terms of the two displacement functions ur and
uϕ and their partial derivatives. The physical solution will minimize this energy based on the Bernoulli prin-
ciple. Use computations very similar to Section 5.2.4 (page 315) to generate the Euler–Lagrange equations
for the two unknown functions ur and uϕ .
• For the radial displacement function ur (r, z):
∂F 4 (1 − ν) ur
=
∂ur 1 − 2ν r
∂F 4 (1 − ν) ∂ ur
= r
∂ ∂∂r
ur 1 − 2ν ∂r
∂F ∂ ur
= r
∂ ∂∂z
ur ∂z
∂ ∂F ∂ ∂F ∂F
∂ u
+ =
∂r ∂ ∂rr ∂z ∂ ∂∂z
ur ∂ur
2
∂ 2 ur
4 (1 − ν) ∂ ur ∂ ur 4 (1 − ν) ur
r 2
+ +r =
1 − 2ν ∂r ∂r ∂z 2 1 − 2ν r
on the domain R0 < r < R1 and 0 < z < H. At the bottom z = 0 and the top z = H the boundary
conditions are
ur (r, 0) = 0 and ur (r, H) = 0 for R0 < r < R1
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 364
∂F
and on the sides r = Ri we use the natural boundary condition = 0, leading to
∂ ∂∂r
ur
∂ ur (R0 , z) ∂ ur (R1 , z)
= = 0 for 0 < z < H .
∂r ∂r
This boundary value problem is solved by ur (r, z) = 0, i.e. no radial displacement.
• For the angular displacement function uϕ (r, z):
∂F ∂ uϕ uϕ
= −( − )
∂uϕ ∂r r
∂F ∂ uϕ uϕ
∂u
= +r ( − )
∂ ∂rϕ ∂r r
∂F ∂ uϕ
∂u
= r
∂ ∂zϕ ∂z
∂ ∂F ∂ ∂F ∂F
∂ u
+ ∂ u
=
∂r ∂ ϕ ∂z ∂ ϕ ∂uϕ
∂r ∂z
∂ ∂ uϕ uϕ ∂ ∂ uϕ ∂ uϕ uϕ
r( − ) + r = −( − )
∂r ∂r r ∂z ∂z ∂r r
∂ 2 uϕ ∂ 2 uϕ ∂ uϕ uϕ
r 2
+ r 2
= − +
∂r ∂z ∂r r
on the domain R0 < r < R1 and 0 < z < H. At the bottom z = 0 and the top z = H the boundary
conditions are
uϕ (r, 0) = 0 and ur (r, H) = r α for R0 < r < R1
∂F ∂ uϕ uϕ
and on the sides r = Ri use the natural boundary condition ∂u
= r( − ) = 0, leading to
∂ ∂rϕ ∂r r
∂ uϕ (R0 , z) ∂ uϕ (R1 , z)
R0 = uϕ (R0 , z) and R1 = uϕ (R1 , z) for 0 < z < H
∂r ∂r
α
This boundary value problem is solved by uϕ (r, z) = r z H.
α
Using the above solutions ur (r, z) = 0 and uϕ (r, z) = r z H compute the energy density
E (1 − ν) ∂ ur 2 1 2 E ∂ uϕ uϕ 2 ∂ ur 2 ∂ uϕ 2
W (r, z) = ( ) + 2 ur + ( − ) +( ) +( )
2 (1 + ν) (1 − 2 ν) ∂r r 4 (1 + ν) ∂r r ∂z ∂z
E (1 − ν) E α E r2
= 0+ 02 + 02 + (r )2 = α2
2 (1 + ν) (1 − 2 ν) 4 (1 + ν) H 4 (1 + ν) H 2
and thus the elastic energy by an integration
Z H Z R1 H R1
2 π E α2
Z Z
U (~u) = W (r, z) 2 π r dr dz = r3 dr dz
0 R0 4 (1 + ν) H 2 0 R0
2 π E α2
= (R14 − R04 ) .
4 · 4 (1 + ν) H
The elastic energy is expressed as as function of the angle of rotation α. The torque T required to twist this
circular tube is given by
∂U πE α
T = = (R14 − R04 ) .
∂α 4 (1 + ν) H
This result can also be obtained by assuming that the cut at height z is rotated by an angle α Hz and then
determine the resulting shear stresses along those planes. The computations are rather easy. The above
approach verifies that this simplification is correct. ♦
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 365
5.7 More on Tensors and Energy Densities for Nonlinear Material Laws
5.7.1 A few More Tensors
In most parts of these lecture notes we only use infinitesimal strains. This restricts the applications to small
strain and displacement situations only. One important example that can not be examined using infinitesimal
strains only is the large bending of slender beams. There are nonlinear extensions allowing to describe more
general situations. In this section we provide a starting point for further investigations Find more information
in [Bowe10], [Redd13] and [Redd15].
Use the displacement gradient tensor DU (Example 5–28) and examine Figure 5.6 (page 327) to verify
that for a deformation ~u(x, y) the vector
! " # ! ! !
∂ u1 d u1
∆x ∆x ∆x ∆x −→
+ ∂∂x u2
∂y
∂ u2
= + DU = (I + DU) ∆x
∆y ∂x ∂y ∆y ∆y ∆y
connects the points A0 to D0 . I + DU is the deformation gradient tensor. For two vectors
! !
−→ ∆x1 −→ ∆x2
∆x1 = and ∆x2 =
∆y1 ∆y2
examine the scalar product of the image of the two vectors.
−→ −→ −→ −→
h(I + DU) ∆x1 , (I + DU) ∆x2 i = h ∆x1 , (I + DUT )(I + DU) ∆x2 i
−→ −→
= h ∆x1 , (I + DU + DUT + DUT DU) ∆x2 i
−→ −→
= h ∆x1 , C ∆x2 i
The matrix
C = I + DU + DUT + DUT DU
is the Cauchy–Green deformation tensor. In particular obtain in Figure 5.6 (page 327)
−→ −→ −→ −→
|A0 D0 |2 = h(I + DU) ∆x , (I + DU) ∆xi = h ∆x , C ∆xi .
For the 2D situation use
C = I + DU + DUT + DUT DU
" # " # " # " #
1 0 2 ∂u 1 ∂u1
+ ∂u2 ∂u1 ∂u2 ∂u1 ∂u1
= + ∂u ∂x∂u ∂y
∂u2
∂x
+ ∂u ∂x ∂x
∂u2
· ∂u ∂x ∂y
∂u2
0 1 ∂x
2
+ ∂y
1
2 ∂y ∂y
1
∂y ∂x
2
∂y
" # " # ∂u 2 ∂u 2 ∂u ∂u ∂u2 ∂u2
1 0 2 ∂u 1 ∂u1
+ ∂u2
∂x
1
+ ∂x
2 1
∂x ∂y
1
+ ∂x ∂y
= + ∂u ∂x∂u ∂y
∂u2
∂x
+ 2 2
0 1 ∂x
2
+ ∂y
1
2 ∂y
∂u 1 ∂u 1
+ ∂u 2 ∂u 2 ∂u 1
+ ∂u2
∂x ∂y ∂x ∂y ∂y ∂y
2 2
∂u1
+ ∂u ∂u1 ∂u1 ∂u2 ∂u2
" #
∂x ∂y + ∂x ∂y
2
εxx εxy ∂x ∂x
= I+2 + 2 2 .
εxy εyy ∂u1 ∂u1
+ ∂u2 ∂u2 ∂u1
+ ∂u2
∂x ∂y ∂x ∂y ∂y ∂y
Use the transformation rule (5.11) for the displacement gradient tensor DU to examine the Cauchy–Green
deformation tensor in a rotated coordinate system.
T T
C0 = I + DU0 + DU0 + DU0 DU0
= I + RT DU R + RT DUT R + RT DUT R RT DU R
= I + RT DU R + RT DUT R + RT DUT DU R
= RT I + DU + DUT + DUT DU R
= RT C R
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 366
When dropping the quadratic contributions obtain the previous (infinitesimal) strain tensor. In Table 5.6 find
the definitions of the tensors defined in these lecture notes.
2 ∂ u1 2 ∂ u2 2
l = (1 + ) +( ) (∆x)2
∂x ∂x
∂ u1 ∂ u1 2 ∂ u2 2
= 1+2 +( ) +( ) (∆x)2
∂x ∂x ∂x
r
∂ u1 ∂ u1 2 ∂ u2 2
l = 1+2 +( ) +( ) ∆x
∂x ∂x ∂x
∂ u1 1 ∂ u1 2 1 ∂ u2 2
≈ 1+ + ( ) + ( ) ∆x
∂x 2 ∂x 2 ∂x
∆l ∂ u 1 1 ∂ u1 2 1 ∂ u 2 2
≈ + ( ) + ( ) = Exx .
∆x ∂x 2 ∂x 2 ∂x
Thus Exx shows the relative change of length in x direction. Use Figure 5.6 (page 327) for a visual-
ization of the result. Observe that the displaced segment need not be vertical any more.
Exy : Use the two orthogonal vectors ~v1 = (1 , 0)T and ~v2 = (0 , 1)T , attach these at a point, deform the
solid and then determine the angle φ between the two deformed vectors. Assume that the entries in
DU are small. Using
! ! ! !
∂ u1 ∂ u1
1 ∂x 0 ∂y
(I + DU) ~v1 = + ∂ u2
, (I + DU) ~v2 = + ∂ u2
0 ∂x 1 ∂y
h~v1 , C ~v2 i Cxy
cos(φ) = =√ √ ≈ 2 Exy
k(I + DU)~v1 k k(I + DU)~v2 k 1 + small 1 + small
conclude
π π
− φ ≈ sin( − φ) = cos φ ≈ 2 Exy .
2 2
Thus 2 Exy indicates by how much the angle between the two coordinates axis is diminished by the
deformation. This interpretation is identical to the interpretation of the infinitesimal strain tensor on
page 329.
22
√ 1
For |z| 1 use the linear approximation 1+z ≈1+ 2
z.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 367
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 368
Since the Green strain tensor E satisfies the usual transformation rule for a second order tensor it can be
diagonalized by rotating the coordinate system and we find in the rotated coordinate system
2 2
∂u1 ∂u2
" # " #
∂ u1 + 0
Exx 0 ∂x 0 1 ∂x ∂x
= + 2 2 .
0 Eyy 0 ∂ u2 2 0 ∂u1
+ ∂u2
∂y ∂y ∂y
This will be useful to determine energy formulas for selected deformation problems, or you may use the
invariant expressions of the Green strain tensor, comparable to the observations on page 352ff .
5–37 Example : More invariants of the Cauchy–Green deformation tensor, used to describe nonlinear
material laws
The Cauchy–Green deformation tensor C is often used to describe the elastic energy density W (C) for large
deformations. It is easiest to work with the tensor in principal form, i.e. the Cauchy-Green deformation
tensor in a principal system.
C = I + DU + DUT + DUT DU
1 + 2 ∂∂xu1
· ·
= ∂ u2 +
· 1 + 2 ∂y ·
· · 1 + 2 ∂∂z
u3
( ∂u1 )2 + ( ∂u 2 2 ∂u3 2
∂x ) + ( ∂x ) · ·
∂x
+ ∂u 2 ∂u 2 ∂u 2
· ( ∂y ) + ( ∂y ) + ( ∂y )
1 2 3
·
· · ( ∂u 1 2 ∂u2 2 ∂u3 2
∂z ) + ( ∂z ) + ( ∂z )
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 369
∂u1 2
(1 + ∂x ) + ( ∂u2 2 ∂u3 2
∂x ) + ( ∂x ) · ·
= · ( ∂u 1 2
∂y ) + (1 +
∂u2 2
∂y ) + ( ∂u3 2
∂y ) ·
· · ( ∂u1 2 ∂u2 2
∂z ) + ( ∂z ) + (1 +
∂u3 2
∂z )
λ21 0 0
=
0 λ22 0
0 0 2
λ3
The diagonal entries λ2i are squares of the the principal stretches λi .
r
∂ u1 2 ∂ u2 2 ∂ u3 2
λ1 = (1 + ) +( ) +( ) = factor by which x axis will be stretched
∂x ∂x ∂x
s
∂ u2 2 ∂ u1 2 ∂ u3 2
λ2 = (1 + ) +( ) +( ) = factor by which y axis will be stretched
∂y ∂y ∂y
r
∂ u3 2 ∂ u1 2 ∂ u2 2
λ3 = (1 + ) +( ) +( ) = factor by which z axis will be stretched
∂z ∂z ∂z
A geometrical reasoning for the above is shown in Figure 5.6 (on page 327).
The elastic energy density W is usually expressed in terms of invariants of the tensors. There are many
different expressions for the energy density, corresponding to different types of materials. A few of the
invariants of the Cauchy–Green deformation tensor are given by
Table 5.7 shows some material laws commonly used. The following examples analyze some of these
definitions. The energy density W for the nonlinear material is shown, and then a small strain analysis
is performed to examine the connection to the linear law of Hooke. For most cases uniaxial loading in
x–direction and hydrostatic loading is examined.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 370
Table 5.7: Some nonlinear material laws, given by the energy density W . The connection of the small strain
approximation to Hooke’s linear law is shown.
Start by examining an incompressible material, i.e. J = λ1 λ2 λ3 = 1 and the energy density given by
where C10 is a material constant. This is called the Neo–Hookean energy density, for incompressible
materials. To examine uniaxial stretching use the symmetry λ2 = λ3 and the incompressibility λ1 ·λ2 ·λ3 = 1
to conclude
1 2
λ2 = λ3 = √ and W = C10 · (λ21 + − 3) .
λ1 λ1
∂ u2 ∂ u3
With the notations z2 = ∂x and z3 = ∂x minimize W (z2 , z3 ).
∂ 2 ∂ λ1 2 1 ∂ u2 ∂ u2
0 = W = C10 · (2 λ1 − 2 − 3) = C10 · (2 λ1 − 2 − 3) =⇒ =0
∂z2 λ1 ∂z2 λ1 λ1 ∂x ∂x
∂ 2 ∂ λ1 2 1 ∂ u3 ∂ u3
0 = W = C10 · (2 λ1 − − 3) = C10 · (2 λ1 − − 3) =⇒ =0
∂z3 λ21 ∂z3 λ21 λ1 ∂x ∂x
There is no shearing in this situation, as expected. Using
r
∂ u1 2 ∂ u1 ∂ u1
λ1 = (1 + ) = 1+ = 1+ = 1 + εxx
∂x ∂x ∂x
find the elastic energy density W = 12 E ε2xx (based on equation (5.15) on page 351, generated by small
deformations and Hooke’s law) and compare to the result based on the above formula.
2 2 2 2
W = C10 · (λ1 + − 3) = C10 · (1 + εxx ) + −3
λ1 1 + εxx
≈ C10 · 1 + 2 εxx + ε2xx + 2 (1 − εxx + ε2xx − ε3xx + ε4xx ) − 3 = C10 · (3 ε2xx − 2 ε3xx + 2 ε4xx )
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 371
1
For small εxx this should be similar to W = 2 E ε2xx , leading to
1
C10 = E.
6
1
Using this generate an approximating stress–strain curve based on W ≈ 6 E (3 ε2xx − 2 ε3xx ).
∂W
σx = ≈ E εxx − E ε2xx
∂εxx
εyy
Since the material is assumed to be incompressible, confirm the Poisson ratio of ν = εxx = − 12 by
1 1 1
λ2 = λ3 = √ = √ ≈ 1 − εxx
λ1 1 + ε xx 2
In Figure 5.17(a) find the stress–strain curve for the linear Hooke’s law and the Neo–Hookean material
under uniaxial load. This is only valid if |εxx | 1, but the stress–strain curve can be generated for larger
values of εxx too.
∂W ∂ 2 2 2
σx = = C10 (1 + εxx ) + − 3 = C10 · 2 (1 + εxx ) −
∂εxx ∂εxx 1 + εxx (1 + εxx )2
= C10 2 1 + εxx − 1 + 2 εxx − 3 ε2xx + 4 ε3xx − 5 ε4xx + O(ε5xx )
• for 0 < εxx the slope is smaller than E and for −1 < εxx < 0 the slope is larger than E.
1
0.8
0.4
true stress/E
0.6
stress/E
0.2 0.4
0 0.2
linear Hooke 0
-0.2 neo-Hookean linear Hooke
-0.2
neo-Hookean approx. neo-Hookean
-0.4 -0.4
-0.2 0 0.2 0.4 0.6 0.8 1 -0.2 0 0.2 0.4 0.6 0.8 1
strain strain
(a) engineering stress (b) true stress
Figure 5.17: Stress–strain curve for Hooke’s linear law and an incompressible neo–Hookean material under
uniaxial load
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 372
The above arguments use the engineering stress, i.e. the force is divided by the area of the undeformed
solid. Since λ2 = λ3 = √1λ = √1+ε 1
the cross sectional area is diminised by a factor of λ2 · λ3 = 1+ε1 xx ,
1 xx
leading to a true stress of
2
σx,true = σx (1 + εxx ) = C10 · 2 (1 + εxx ) − (1 + εxx )
(1 + εxx )2
2 2
= C10 · 2 (1 + εxx ) −
(1 + εxx )
The true stress based on Hooke’s law is given by E (εxx + ε2xx ), shown in Figure 5.17(b). ♦
5–39 Example : From large strain energy to Hooke for uniaxial loading for compressible material
In [Bowe10, §3.5.5] a Neo–Hookean energy density of the form
1 I1 1
W = C10 · (I¯1 − 3) + (J − 1)2 = C10 · ( 2/3 − 3) + (J − 1)2 (5.20)
D J D
is examined. The first expression should take shape changes into account, while the second term is related
to volume changes.
To examine the energy in equation (5.20) for an uniaxial stretching a good approximations of the invari-
ants as function of εxx = ∂∂x
u1
is required. Assume a small deformation and use Hooke’s law with a Poisson
1
ratio of 0 ≤ ν ≤ 2 . As consequence work with εyy = εzz = −ν εxx and compute
2 (1 − 2 ν) 5 − 8 ν + 14 ν 2 2
J −2/3 = 1 − εxx + εxx −
3 9
4 (−10 + 15 ν − 21 ν 2 + 35 ν 3 ) 3
− εxx + O(ε4xx )
81
I1 4 (1 + ν)2 2 4 (1 + ν)2 (7 − 11 ν) 3
I¯1 = 2/3 = 3+ εxx − εxx + O(ε4xx ) .
J 3 27
Observe that the invariant I1 contains a contribution proportional to εxx , while I¯1 does not. When minimiz-
ing the energy based on I1 this would lead to a nonzero stress σx 6= 0, even if no deformation is applied
with εxx = 0. Thus one should not use the energy C10 · (I3 − 3) for compressible materials. Now examine
I1 1
W = C10 · ( 3/2 − 3) + (J − 1)2
J D
4 (1 + ν)2 (1 − 2 ν)2
= C10 + ε2xx −
3 D
4 (1 + ν)2 (7 − 11 ν) 2 (1 − 2 ν) ν (2 − ν)
− C10 + ε3xx + O(ε4xx )
27 D
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 373
and use
E 1 1 E 1
C10 = = µ and = = K
4 (1 + ν) 2 D 6 (1 − 2 ν) 2
(with shear modulus µ and bulk modulus K, see Table 5.5 on page 360 in Example 5–33) and elementary
algebra to conclude23
4 (1 + ν)2 (1 − 2 ν)2
W = C10 + ε2xx + O(ε3xx )
3 D
4 (1 + ν)2
E E 2
= + (1 − 2 ν) ε2xx + O(ε3xx )
4 (1 + ν) 3 6 (1 − 2 ν)
E (1 + ν) E 1
= + (1 − 2 ν) ε2xx + O(ε3xx ) = E ε2xx + O(ε3xx ) .
3 6 2
For small, uniaxial deformations the energy densities generated by Hooke’s law and by (5.20) coincide.
To determine the stress–strain curve use
4 (1 + ν)2 (1 − 2 ν)2
∂W
σx = = 2 C10 + εxx +
∂εxx 3 D
4 (1 + ν)2 (7 − 11 ν) 2 (1 − 2 ν) ν (−2 + ν)
+3 −C10 + ε2xx + O(ε3xx )
27 D
1
and with the above values for C10 and D for ν = 2 this leads to
The stress–strain curve is again shown in Figure 5.17 and identical to the approximative curve for the
incompressible Neo–Hookean material. ♦
23
Another approach would be (using λ2 = λ3 = 1 + εyy .)
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 374
and this leads to the expressions used for the different models for the energy density.
I1 − 3 = 6 εxx + 3 ε2xx
I1 3 (1 + εxx )2
I¯1 − 3 = − 3 = −3=0
J 2/3 (1 + εxx )2
(J − 1)2 = (3 εxx + 3 ε2xx + ε3xx )2 = ε2xx (3 + 3 εxx + ε2xx )2
= ε2xx (9 + 18 εxx + 15 ε2xx + 6 ε3xx + ε4xx )
The result I¯1 − 3 = 0 shows that the invariant I¯1 does not take volume changes into account, while I1 − 3
and (J − 1)2 do. Now have a closer look at two models frequently used for the elastic energy density W .
• Neo–Hookean
where K is the bulk modulus, see page 360. This is consistent with the results generated using uniaxial
loading in Example 5–39. ♦
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 375
0 ∂u ∂y
1
0
DU = 0 0 0
0 0 0
Example 5–32 (page 359) shows that this deformation minimizes the Hookean elastic energy density W for
a given εxy , i.e. this deformation is a consequence of Bernoulli’s principle.
The principal stretches λi are the square roots of the eigenvalues of the tensor C and thus with ∂u
∂y =
1
This shows that the shearing deformation does not compress the material, i.e. the volume is preserved. The
Neo–Hookean energy density is
I1 1
W = C10 ( 2/3
− 3) + (J − 1)2 = C10 4 ε2xy .
J D
This leads to the shearing stress
1 ∂W
τxy = = 4 C10 εxy .
2 ∂εxy
E
Comparing with the similar result based on Hooke’s law in Example 5–32 (τxy = 1+ν εxy ) leads to 4 C10 =
E 1 E 1
1+ν . For incompressible materials with ν = 2 find C10 = 4 (1+ 1 ) = 6 E. This is consistent with the
2
previous examples. These results coincide with the results for an incompressible Neo–Hookean energy
density W = C10 (I1 − 3), since J = det(C) = 1.
In Example 5–39 a uniaxial loading leads to the stress–strain curve
4 (1 + ν)2 (1 − 2 ν)2
∂W
σx = ≈ 2 C10 + εxx
∂εxx 3 D
24
The invariants could be computed using the eigenvalues µ1,2 , but this is not as elegant. Use α := ( ∂u
∂y
1 2
) = 4 ε2xy .
∂u1 2 ∂u1 2 ∂u1 2 ∂u1 2
0 = (1 − µ) (1 + ( ) − µ) − ( ) = µ2 − µ (2 + ( ) )+1−( ) = µ2 − µ (2 + α) + 1 − α
∂y ∂y ∂y ∂y
1 p 1 p
µ1,2 = 2 + α ± (2 + α)2 − 4 (1 − α) = 2 + α ± 4 α + α2 + 4 α
2 2
α 1p
q q
= 1+ ± 8 α + α = 1 + 2 εxy ± 16 ε2xy + 16 ε4xy = 1 + 2 ε2xy ± 2 ε2xy + ε4xy
2 2
2 2
√
The principal stretches are λ1,2 = µ1,2 and λ3 = 1, leading to the invariants
q q
I1 = λ21 + λ22 + λ23 = 1 + 2 ε2xy + 2 ε2xy + ε4xy + 1 + 2 ε2xy − 2 ε2xy + ε4xy + 1 = 3 + 4 ε2xy
J = λ1 · λ2 · λ3 = (1 + 2 ε2xy )2 − 4 (ε2xy + ε4xy ) = 1 .
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 376
5–42 Example : Neo–Hookean energy density for compressible materials, uniaxial loading
In [BoneWood08, §6.4] (or [Shab08, p. 146]) find an energy density for Neo–Hookean, compressible mate-
rials, given by
µ λ
W = (I1 − 3) − µ ln J + (ln J)2 (5.21)
2 2
using two material constants (see Table 5.5 on page 360)
E νE
shear modulus µ = and Lamé parameter λ = .
2 (1 + ν) (1 + ν) (1 − 2 ν)
Using the FEM software COMSOL these two constants have to be given when using the Neo–Hookean
material law. Now verify that this is consistent with Hooke’s law for small deformations. For small, uniaxial
deformations compute with λ1 = 1 + εxx and λ2 = λ3 = 1 − ν εxx , leading to
I1 − 3 = λ21 + λ22 + λ23 − 3 = 2 (1 − 2 ν) εxx + (1 + 2 ν 2 ) ε2xx
J = λ1 · λ2 · λ3 = (1 + εxx ) (1 − ν εxx )2 = 1 + (1 − 2 ν) εxx + (−2 ν + ν 2 ) ε2xx + ν 2 ε3xx
1
ln x ≈ x − x2 for |x| 1 Taylor approximation
2
1 1
ln J ≈ +(1 − 2 ν) εxx + (−2 ν + ν 2 ) ε2xx − (1 − 2 ν)2 ε2xx = +(1 − 2 ν) εxx − 1 + 2 ν 2 ε2xx
2 2
(ln J)2 ≈ +(1 − 2 ν)2 ε2xx .
With these expressions use the energy density W in (5.21) and conclude
µ λ
W = (I1 − 3) − µ ln J + (ln J)2
2 2
µ µ 1 λ
≈ 2 (1 − 2 ν) εxx + (1 + 2 ν ) εxx − µ (1 − 2 ν) εxx − (1 + 2 ν ) εxx + (1 − 2 ν)2 ε2xx
2 2 2 2
2 2 2 2
λ E ν E (1 − 2 ν) 2
= +µ (1 + 2 ν 2 ) ε2xx + (1 − 2 ν)2 ε2xx = + (1 + 2 ν 2 ) ε2xx + εxx
2 2 (1 + ν) 2 (1 + ν)
(1 + 2 ν 2 ) + ν − 2 ν 2
E 2
= E ε2xx = ε .
2 (1 + ν) 2 xx
Thus the Neo–Hookean energy expression (5.21) is consistent with the usual energy density based on
Hooke’s law for linear materials.
∂W
To obtain the stress–strain curve in Figure 5.18 determine σx = ∂εxx
. Using
∂ (I1 − 3)
= 2 (1 − 2 ν) + 2 (1 + 2 ν 2 ) εxx
∂εxx
∂J
= (1 − 2 ν) + 2 (−2 ν + ν 2 ) εxx + 3 ν 2 ε2xx
∂εxx
∂ ln J 1 ∂J 1
(1 − 2 ν) + 2 (−2 ν + ν 2 ) εxx + 3 ν 2 ε2xx
= =
∂εxx J ∂εxx J
∂ (ln J)2 2 ln J ∂ J 2 ln J
(1 − 2 ν) + 2 (−2 ν + ν 2 ) εxx + 3 ν 2 ε2xx
= =
∂εxx J ∂εxx J
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 377
obtain
∂W ∂ µ λ 2
σx = = (I1 − 3 − 2 ln J) + (ln J)
∂εxx ∂εxx 2 2
2 1 2 2 2
= µ (1 − 2 ν) + (1 + 2 ν ) εxx − (1 − 2 ν) + 2 (−2 ν + ν ) εxx + 3 ν εxx +
J
λ ln J
(1 − 2 ν) + 2 (−2 ν + ν 2 ) εxx + 3 ν 2 ε2xx .
+
J
For the value ν = 0.3 Figure 5.18 shows the resulting stress–strain curves for the Neo–Hookean and Hooke’s
law. ♦
1
linear Hooke
0.8 neo-Hookean
0.6
0.4
stress/E
0.2
-0.2
-0.4
Figure 5.18: Stress–strain curve for Hooke’s linear law and a compressible Neo–Hookean material with
ν = 0.3 under uniaxial load
5–43 Example : Neo–Hookean energy density for compressible materials, hydrostatic loading
For hydrostatic loading use λ1 = λ2 = λ3 = 1 + εxx . This leads to
µ λ
W = (I1 − 3) − µ ln J + (ln J)2
2 2
µ 1 λ
≈ (6 εxx + 3 ε2xx ) − µ 3 (εxx − ε2xx ) + 9 ε2xx (1 − εxx )
2 2 2
9 2 3 E 9 νE
≈ (3 µ + λ) εxx = + ε2xx
2 2 (1 + ν) 2 (1 + ν) (1 − 2 ν)
3 E (1 − 2 ν) + 9 ν E 2 3E
= εxx = ε2 .
2 (1 + ν) (1 − 2 ν) 2 (1 − 2 ν) xx
This coincides with the energy density determined by Hooke’s linear law in Example 5–31 on page 358. ♦
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 378
5–44 Example : Neo–Hookean energy density for compressible materials, pure shearing
Based on Example 5–41 use
∂ u1 2
I1 = λ21 + λ22 + λ23 = trace(C) = 3 + ( ) = 3 + 4 ε2xy
∂y
J = λ21 · λ22 · λ23 = det(C) = 1 = J
µ λ µ
W = (I1 − 3) − µ ln J + (ln J)2 = 4 ε2xy .
2 2 2
This leads to the shearing stress
1 ∂W E
τxy = = µ 4 εxy = εxy .
2 ∂εxy (1 + ν)
For small deformations this is consistent with Hooke’s law. ♦
which is the energy density for a Mooney–Rivlin material law. For the incompressible case one might
use
W = C10 · (I¯1 − 3) + C01 · (I¯2 − 3) = C10 · (I1 − 3) + C01 · (I2 − 3) ,
where
I2 λ2 λ2 + λ21 λ23 + λ21 λ22 −4/3 2/3 2/3 2/3 −4/3 2/3 2/3 2/3 −4/3
I¯2 = 4/3 = 2 3 = λ1 λ2 λ3 + λ1 λ2 λ3 + λ1 λ2 λ3 .
J (λ1 λ2 λ3 )4/3
• With C01 = 0 the Mooney–Rivlin energy density is equal to the Neo–Hookean energy density in
Example 5–38.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 379
• Observe that minimizing the total energy, using the integral of the energy density W , does not respect
the incompressibility constraint J = λ1 · λ2 · λ3 = 1. This has to be taken care of in the design of the
algorithm.
• Using λi = λi /J 1/3 also find
µ α α α
W =(λ1 + λ2 + λ3 − 3) (5.23)
α
as definition for Ogden energies. The FEM code ABAQUS is using this setup. Since λ1 · λ2 · λ3 = 1
it seems that the incompressibility constraint is built in. In fact (5.22) and (5.23) are equivalent in the
case of incompressible materials.
5–45 Example : Mooney–Rivlin energy density for incompressible materials, uniaxial loading
For the Mooney–Rivlin energy density work with
W = C10 · (I¯1 − 3) + C01 · (I¯2 − 3)
= C10 · (λ̄21 + λ̄22 + λ̄23 − 3) + C01 · (λ̄22 λ̄23 + λ̄21 λ̄23 + λ̄21 λ̄22 − 3)
4/3 −2/3 −2/3 −2/3 4/3 −2/3 −2/3 −2/3 4/3
= C10 · (λ1 λ2 λ3 + λ1 λ2 λ3 + λ1 λ2 λ3 − 3) +
−4/3 2/3 2/3 2/3 −4/3 2/3 2/3 2/3 −4/3
+C01 · (λ1 λ2 λ3 + λ1 λ2 λ3 + λ1 λ2 λ3 − 3) .
−1/2
In the case of uniaxial, incompressible loading we use λ1 = 1 + εxx and λ2 = λ3 = λ1 , leading to
4/3 2/3 −2/3 −1/3 −4/3 4/3 2/3 −2/3
W = C10 · (λ1 λ1 + 2 λ1 λ1 − 3) + C01 · (λ1 λ2 + 2 λ1 λ2 − 3)
= C10 · (λ21+ 2 λ−1 −2
1 − 3) + C01 · (λ1 + 2 λ1 − 3)
= C10 · ((1 + εxx )2 + 2 (1 + εxx )−1 − 3) + C01 · ((1 + εxx )−2 + 2 (1 + εxx ) − 3)
≈ C10 · (2 εxx + ε2xx − 2 εxx + 2 ε2xx ) + C01 · (−2 εxx + 3 ε2xx + 2 εxx )
1
= C10 3 ε2xx + C01 3 ε2xx = 3 (C10 + C01 ) ε2xx = E ε2xx .
2
As a consequence obtain for small strains
1
C10 + C01 = E.
6
To determine the stress in x–direction examine
∂W
σx = = C10 · (2 (1 + εxx ) − 2 (1 + εxx )−2 ) + C01 · (−2 (1 + εxx )−3 + 2)
∂εxx
≈ C10 · (2 εxx + 4 εxx − 6 ε2xx + 8 ε3xx − 10 ε4xx )) +
+C01 · (6 εxx − 12 ε2xx + 20 ε3xx − 30 ε4xx )
= 6 (C10 + C01 ) εxx − (6 C10 + 12 C01 ) ε2xx +
+(8 C10 + 20 C01 ) ε3xx − (10 C10 + 30 C01 ) ε4xx .
Find the resulting stress–strain curves in Figure 5.19. ♦
5–46 Example : Mooney–Rivlin energy density for compressible materials, hydrostatic and uniaxial
loading
For the compressible Mooney–Rivlin energy density use
1
W = C10 · (I¯1 − 3) + C01 (I¯2 − 3) + (J − 1)2
D
1
= C10 · (λ̄21 + λ̄22 + λ̄23 − 3) + C01 · (λ̄22 λ̄23 + λ̄21 λ̄23 + λ̄21 λ̄22 − 3) +
(λ1 λ2 λ3 − 1)2
D
4/3 −2/3 −2/3 −2/3 4/3 −2/3 −2/3 −2/3 4/3
= C10 · (λ1 λ2 λ3 + λ1 λ2 λ3 + λ1 λ2 λ3 − 3) +
−4/3 2/3 2/3 2/3 −4/3 2/3 2/3 2/3 −4/3 1
+C01 · (λ1 λ2 λ3 + λ1 λ2 λ3 + λ1 λ2 λ3 − 3) + (λ1 λ2 λ3 − 1)2 .
D
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 380
linear Hooke
C Neo-Hooke
0.4 10
C =C
10 01
C only
01
0.2
stress/E
-0.2
-0.4
-0.2 0 0.2 0.4 0.6 0.8 1
strain
Figure 5.19: Stress–strain curve for Hooke’s linear law and three cases of incompressible Mooney–Rivlin
with uniaxial loading: (1): C10 = E6 , C01 = 0, i.e. Neo–Hooke, (2): C10 = C01 = 12 E
, (3): C10 = 0,
E
C01 = 6 .
• For the case of hydrostatic loading use λ̄i = λi = 1 + εxx and thus I¯1 = I¯2 = 3 and minimize
1
W (εxx ) = D (J − 1)2 . The setup is identical to the compressible Neo–Hookean case in Example 5–
40. This leads to
1 1
W = (J − 1)2 = + ε2xx (9 + 18 εxx + 15 ε2xx + 6 ε3xx + ε4xx )
D D
1 ∂W 1
6 εxx + 18 ε2xx + O(ε3xx )
σx = =
3 ∂εxx D
and
6 E 1 E 1
= =⇒ = = K.
D 1 − 2ν D 6 (1 − 2 ν) 2
• In the case of uniaxial, compressible loading use λ1 = 1 + εxx and λ2 = λ3 = 1 + εyy , leading to
∂W −4 4
0= = C10 · ( (1 + εxx )4/3 (1 + εyy )−7/3 + (1 + εxx )−2/3 (1 + εyy )−1/3 ) +
∂εyy 3 3
4 4
+C01 · ( (1 + εxx )−4/3 (1 + εyy )1/3 − (1 + εxx )2/3 (1 + εyy )−5/3 ) +
3 3
4 2
+ ((1 + εxx ) (1 + εyy ) − 1) (1 + εyy )
D
4 4 7 2 1
≈ C10 · (−1 − εxx + εyy + 1 − εxx − εyy ) +
3 3 3 3 3
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 381
4 4 1 2 5 4
− C01 · (1 − εxx + εyy − 1 − εxx + εyy ) + (εxx + 2 εyy )
3 3 3 3 3 D
8 8 4
= C10 · (−εxx + εyy ) + C01 · (−εxx + εyy ) + (εxx + 2 εyy )
3
3 D
4 8 4 8
= − (C10 + C01 ) εxx + 2 + (C10 + C01 ) εyy
D 3 D 3
4 8
− (C10 + C01 ) 3 − 2 D (C10 + C01 )
εyy = − D4 38 εxx = − εxx
2 D + 3 (C10 + C01 ) 6 + 2 D (C10 + C01 )
1 3 D (C10 + C01 )
= − − εxx = −ν εxx
2 6 + 2 D (C10 + C01 )
1 D 2 1 E 2
= − − (C10 + C01 ) + O(D ) εxx = − −D + O(D ) εxx
2 2 2 12
1
For the incompressible case D = 0 we obtain ν = 2 again.
For small strains use the energy density W and with the help of Mathematica one can generate a
Taylor approximation for W with respect to εxx , using Poisson’s law in the form εyy = − 12 (1 −
D (C10 + C01 )) εxx .
1
W (εxx ) ≈ (C10 + C01 ) (9 − 3 (C10 + C01 ) D + (C10 + C01 )2 D2 )) ε2xx +
3
1
+ −2 (C10 + 2 C01 ) + (3 C10 + C01 ) (C10 + C01 ) D+
2
1 2 2 1 3 2
+ (C10 + C01 ) (5 C10 + 3 C01 ) D − (C10 + C01 ) (11 C10 + 7 C01 ) D ε3xx
3 54
This leads to the relation to Young’s modulus
2
E = (C10 + C01 ) (9 − 3 (C10 + C01 ) D + (C10 + C01 )2 D2 ))
3
2
= 6 (C10 + C01 ) − 2 (C10 + C01 )2 D + (C10 + C01 )3 D2 )) . (5.24)
3
The normal stress σx is given by
∂W 2
σx = ≈ (C10 + C01 ) (9 − 3 (C10 + C01 ) D + (C10 + C01 )2 D2 )) εxx .
∂εxx 3
For the incompressible case D = 0 this simplifies to
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 382
µ
= ((1 + εxx )α + 2 (1 + εxx )−α/2 − 3) use a Taylor aproximation
α
µ α α α
≈ (1 + α εxx + (α − 1) ε2xx + 2 − α εxx + ( + 1) ε2xx − 3)
α 2 2 2
µ α 2 3 µ α
= (α2 − α + + α) ε2xx = ε2xx .
2α 2 4
1
Comparing this the the energy density based on Hooke’s linear law W = 2 E ε2xx conclude
3 2E
E= µα or µ = .
2 3α
1 1
In the case of α = 2 (Neo–Hookean) find µ = 3 E, which is consistent with C10 = 6 E.
Find the stress-strain curve for α = 1.5, α = 2 and α = 3 in Figure 5.20, together with the linear law by
Hooke. ♦
linear Hooke
α=1.5
α=2 neo-Hookean
0.4
α=3
0.2
stress/E
-0.2
-0.4
-0.2 0 0.2 0.4 0.6 0.8 1
strain
Figure 5.20: Stress–strain curve for Hooke’s linear law and an incompressible Ogden material under uniaxial
load, for different values of α
An attempt to apply the same procedure to a hydrostatic situation has to fail for incompressible materials.
µ
W = 3 ((1 + εxx )α − 1) using 1 + εxx = λ1 = λ2 = λ3
α
1 ∂W µ
σx = = α (1 + εxx )α−1 = µ (1 + εxx )α−1 ≈ µ (1 + (α − 1) εxx ) .
3 ∂εxx α
Thus we would have a resulting force, without applying a stretch!
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 383
• Observe that minimizing the total energy, using the integral of the energy density W , takes the com-
pressibility into account by considering (J − 1)2 .
5–48 Example : Ogden energy density for compressible materials, hydrostatic loading
For hydrostatic loading use λ1 = λ2 = λ3 = 1 + εxx , leading to λi = 1 and thus the first contribution in
Odgen’s energy density equals zero. The values of the material parameters µ and α have no influence on the
result. This leads to
1 2 1 2 9 2
W = 0+ (1 + εxx )3 − 1 = 3 εxx + 3 ε2xx + ε3xx ≈ ε
D D D xx
∂W 2
(1 + εxx )3 − 1 3 + 6 εxx + 3 ε2xx
=
∂εxx D
6 18
(1 + εxx )3 − 1 (1 + εxx )2 ≈
= εxx .
D D
Comparing with the computations using Hooke’s law in Example 5–31 (page 358) conclude
1 E 3 6 (1 − 2 ν)
= or D = .
2 1 − 2ν D E
The stress–strain curve for Hooke’s and Ogdens material are given by
E
σHooke = εxx
1 − 2ν
1 ∂W 2
(1 + εxx )3 − 1 (1 + εxx )2
σOgden = =
3 ∂εxx D
E
3 εxx + 3 ε2xx + ε3xx (1 + εxx )2
=
3 (1 − 2 ν)
E 1
= εxx 1 + εxx + ε2xx (1 + εxx )2
1 − 2ν 3
E 7 2 3
= εxx 1 + 3 εxx + εxx + O(εxx ) .
1 − 2ν 3
Find the result in Figure 5.21, the computations performed with a Poisson ratio ν = 0.3. ♦
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 384
linear Hooke
0.6
Ogden
0.4
stress/E
0.2
-0.2
-0.4
-0.2 -0.1 0 0.1 0.2
strain
Figure 5.21: Stress–strain curve for Hooke’s linear law and a compressible Ogden material under hydrostatic
load with ν = 0.3.
5–49 Example : Ogden energy density for compressible materials, uniaxial loading
For uniaxial loading we λ1 = 1 + εxx and λ2 = λ3 = 1 + εyy , leading to
J = λ1 · λ2 · λ3 = (1 + εxx ) (1 + εyy )2
and thus
λ1 1 + εxx 1 + εyy
λ1 = = , λ2 = λ3 = .
J 1/3 (1 + εxx )1/3 (1 + εyy )2/3 (1 + εxx )1/3 (1 + εyy )2/3
∂ 2µ
0= W (εxx , εyy ) = (−(1 + εxx )2α/3 (1 + εyy )−2α/3−1 + (1 + εxx )−α/3 (1 + εyy )α/3−1 ) +
∂εyy 3
4
(1 + εxx ) · (1 + εyy )2 − 1 (1 + εxx ) (1 + εyy )
+
D
2µ 2α −2α − 3 α α−3
≈ (−(1 + εxx ) (1 + εyy ) + (1 − εxx ) (1 + εyy ) +
3 3 3 3 3
4
+ (εxx + 2 εyy ) (1 + εxx ) (1 + εyy )
D
2µ 2α 2α + 3 α α−3 4
≈ (− εxx + εyy − εxx + εyy ) + (εxx + 2 εyy )
3 3 3 3 3 D
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 385
2µα 4 2µα 8
= − + εxx + + εyy
3 D 3 D
−2 µ α D + 12 2 µ α D + 24
= εxx + εyy
3D 3D
12 − 2 µ α D 6 − µαD
εyy = − εxx = − εxx .
24 + 2 µ α D 12 + µ α D
For small, uniaxial deformation this leads to Poission ratio
−εyy 6 − µαD 1 3µαD
ν= = = − .
εxx 12 + µ α D 2 2 (12 + µ α D)
For 0 < D 1 find ν ≈ 21 , as expected for an almost incompressible material.
To determine the stress-strain curve examine the dependence of W on λ1 , resp. εxx . Since the energy
density W is minimized with respect to λ2 use
d W (λ1 , λ2 ) ∂ W (λ1 , λ2 ) ∂ W (λ1 , λ2 ) ∂ λ2 ∂ W (λ1 , λ2 )
= + = + 0.
dλ1 ∂λ1 ∂λ2 ∂λ1 ∂λ1
For small strains determine
∂ µ 2 α 2α/3−1 −2α/3 2 α −α/3−1 α/3 2
λ1 · λ22 − 1 λ22
W (λ1 , λ2 ) = + λ1 λ2 − λ1 λ2 +
∂λ1 α 3 3 D
∂ 2µ
W (εxx , εyy ) = (1 + εxx )2α/3−1 (1 + εyy )−2α/3 − (1 + εxx )−α/3−1 (1 + εyy )α/3 +
∂εxx 3
2
(1 + εxx ) · (1 + εyy )2 − 1 (1 + εyy )2
+
D
2µ 2α − 3 2α α+3 α
≈ (1 + εxx ) (1 − εyy ) − (1 − εxx ) (1 + εyy ) +
3 3 3 3 3
2
+ (εxx + 2 εyy ) (1 + 2 εyy )
D
2 µ 2α − 3 2α α+3 α 2
≈ εxx − εyy + εxx − εyy + (εxx + 2 εyy )
3 3 3 3 3 D
2µα 2 −2 µ α 4
= + εxx + + εyy
3 D 3 D
2µαD + 6 2 µ α D − 12
= εxx − εyy .
3D 3D
Using the above Poisson ratio this leads to
∂ 2µαD + 6 2 µ α D − 12 6 − µ α D
W (εxx , εyy ) ≈ εxx + εxx
∂εxx 3D 3D 12 + µ α D
2 (6 − µ α D)2
2 (3 + µ α D) (12 + µ α D)
= − εxx
3 D (12 + µ α D) 3 D (12 + µ α D)
2 (3 + µ α D) (12 + µ α D) − 2 (6 − µ α D)2
= εxx
3 D (12 + µ α D)
(30 + 24) µ α D 18 µ α
= εxx = εxx .
3 D (12 + µ α D) 12 + µ α D
As a consequence we find for small strains εxx
18 µ α
E≈ .
12 + µ α D
3
In case of an incompressible material this simplifies to E = 2 µ α, which is consistent with Example 5–47.
♦
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 386
In [Ogde13, §4.4, p. 222] the two Neo–Hookean energies for compressible and incompressible mate-
rials are discussed.
µ λ
W = C10 (I1 − 3) and W = (I1 − 3 − 2 ln J) + (ln J)2
2 2
• [Hack15, (4.6), p.20] Mooney–Rivlin
λi
λi =
J 1/3
I1 λ21 + λ22 + λ23 2 2 2
I1 = 2/3
= 2/3
= λ1 + λ2 + λ3
J J
• Use [Oden71, p. 315] for graphs and p.222ff for Neo–Hookean.
• The COMSOL manual StructuralMechanicsModuleUsersGuide.pdf for different mate-
rial models, using the above invariants.
• The ABAQUS theory manual contains useful information, e.g. Section 4.6.1 on hyperelastic material
behavior
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 387
fix
all external forces act horizontally
u3 independent on x, y and z
and
u1 = u1 (x, y) , u2 = u2 (x, y) , independent on z .
This leads to vanishing strains in z direction
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 388
and thus this is called a plane strain situation. It can be realized by a long solid in the direction of the z axis
with constant cross section and a force distribution parallel to the xy plane, independent on z. The two ends
are to be fixed in z direction. Due to Saint–Venants’s principle (see e.g. [Sout73, §5.6]) the boundary effects
at the two far ends can safely be ignored. Another example is the expansion of a blood vessel, embedded in
body tissue. The pulsating blood pressure will stretch the walls of the vessel, but there is no movement of
the wall in the direction of the blood flow.
or equivalently
σx 1−ν ν 0 εxx
σy = E
· εyy
(1 + ν) (1 − 2 ν) ν 1−ν 0
τxy 0 0 1 − 2ν εxy . (5.26)
E ν (εxx + εyy )
σz = , τxz = τyz = 0
(1 + ν) (1 − 2 ν)
As unknown functions examine the two components of the displacement vector ~u = (u1 , u2 )T , as functions
of x and y. The components of the strain can be computed as derivatives of ~u. Thus if ~u is known, all other
expressions can be computed.
If the volume and surface forces are parallel to the xy plane and independent on z, then the corresponding
energy contributions25 can be written as integrals over the domain Ω ⊂ R2 , resp. the boundary ∂Ω. Obtain
the total energy as a functional of the yet unknown function ~u.
As in many other situations use again the Bernoulli principle to find the solution of the plane strain elasticity
problem.
25
Observe that we quietly switch from a domain in Ω × [0, H] ⊂ R3 to the planar domain Ω ⊂ R2 . The ‘energy’ U actually
denotes the ‘energy divided by height H’.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 389
5–50 Example : Consider a horizontal, rectangular plate, trapped in z direction between two very hard
surfaces. Then compress the plate in x direction (εxx < 0) by a force F applied to its right edge. Assume
that the normal strain εxx in x-direction is known and then determine the other expressions. Since it is a
plane strain situation use u3 = εzz = εxz = εyz = 0. The similar plane stress problem will be examined in
Example 5–51. Assume that all strains are constant. Now εyy and εxy can be determined by minimizing the
energy density. From equation (5.27) obtain
1 E
(1 − ν) ε2xx + 2 ν εxx εyy + (1 − ν) ε2yy + 2 (1 − 2 ν) ε2xy .
W =
2 (1 + ν) (1 − 2 ν)
As a necessary condition for a minimum the partial derivatives with respect to εyy and εxy have to vanish.
This leads to
+2 ν εxx + 2 (1 − ν) εyy = 0 and εxy = 0 .
This leads to a modified Poisson’s ratio ν ∗ for the plane strain situation.
ν
εyy = − εxx = −ν ∗ εxx .
1−ν
Since ν > 0 we conclude ν ∗ > ν, i.e. the plate will expand (εyy > 0) more in y direction than a free plate.
This is caused by the plate not being allowed to expand in z direction, i.e. εzz = 0 .
The energy density in this situation is given by
ν2 ν 2 (1 − ν) 2
1 E 2 2
W = (1 − ν) εxx − 2 ε + ε
2 (1 + ν) (1 − 2 ν) 1 − ν xx (1 − ν)2 xx
1 − 2 ν + ν2 2 ν2 ν2
1 E
= − + ε2xx
2 (1 + ν) (1 − 2 ν) 1−ν 1−ν 1−ν
1 E
= ε2 .
2 1 − ν 2 xx
By comparing this with the situation of a simple stretched shaft (Example 5–30, page 356) find a modified
modulus of elasticity
1
E∗ = E
1 − ν2
∂ W 26
and the pressure required to compress the plate is given by using σz = ∂εxx
F ∆L
= E∗ = E ∗ εxx .
A L
The fixation of the plate in z direction (plane strain) prevents the plate from showing the Poisson contraction
(resp. expansion), when compressed in x direction. Thus more force is required to compress it in x direction.
This information is given by E ∗ = 1−ν 1
2 E > E . ♦
Similarly modified constants ν ∗ and E ∗ are used in [Sout73, p. 87] to formulate the partial differential
equations governing this situation. It is a matter of elementary algebra to verify that
σx 1−ν ν 0 εxx
σy =
E
ν
1−ν 0 εyy
(1 + ν) (1 − 2 ν)
τxy 0 0 1 − 2ν εxy
26
Ona may also use Hooke’s law for a plane strain setup to conclude
ν2
E E
σx = ((1 − ν) εxx + ν εyy ) = (1 − ν) − εxx
(1 + ν) (1 − 2 ν) (1 + ν) (1 − 2 ν) 1−ν
E (1 − ν)2 − ν 2 E
= εxx = εxx .
(1 + ν) (1 − 2 ν) 1−ν 1 − ν2
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 390
1 ν∗ 0 εxx
E∗
ν∗
= ∗ 2 1 0 εyy .
1 − (ν )
0 0 1−ν ∗ εxy
This form of Hooke’s law for a plane strain situation coincides with the plane stress situation in equa-
tion (5.30) on page 393, but using E ∗ and ν ∗ instead of the usual E and ν.
This can be used to derive a system of partial differential equations that are solved by the actual displacement
function. Use the abreviation
1 E
k= .
2 (1 + ν) (1 − 2 ν)
to find the main expression for the elastic energy given by
ZZ εxx 1−ν ν 0 εxx
Uelast = h
εyy , k ν
1−ν 0 · εyy i dx dy
Ω εxy 0 0 2 (1 − 2 ν) εxy
∂ u1
ZZ ∂x 1−ν ν 0 εxx
∂ u2
= h
∂y
,k ν 1 − ν 0 · εyy i dx dy
1 ∂ u1 ∂ u2
Ω
2 ∂y + ∂x 0 0 2 (1 − 2 ν) ε xy
∂ u1
! " # εxx
1 − ν ν 0
ZZ
∂x
= h ∂u ,k · εyy
i dx dy
∂y
1
0 0 1 − 2ν
Ω εxy
∂ u2
! " # ε xx
0 0 1 − 2ν
ZZ
h ∂∂x
+ u2
, k · εyy i dx dy .
∂y ν 1−ν 0
Ω εxy
Using the divergence theorem (Section 5.2.3, page 313) on the two integrals find
" # εxx
1−ν ν 0
ZZ
Uelast = − u1 div k · εyy dx dy
0 0 1 − 2ν
Ω εxy
" # εxx
1−ν ν 0
I
+ u1 h~n , k · εyy i ds
∂Ω 0 0 1 − 2ν
εxy
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 391
" # εxx
0 0 1 − 2ν
ZZ
− u2 div
k · dx dy
εyy
ν 1−ν 0
Ω εxy
" # ε xx
0 0 1 − 2ν
I
+ u2 h~n , k · ε yy
i ds .
∂Ω ν 1−ν 0
εxy
Using a calculus of variations argument with perturbations of φ1 of u1 vanishing on the boundary conclude27
" # εxx
E 1−ν ν 0
div (1 + ν) (1 − 2 ν) · εyy = −f1
0 0 1 − 2ν
εxy
E (1 − ν) ∂∂x u1
+ ν ∂∂yu2
div
1−2 ν
∂ u1 ∂ u2
= −f1
(1 + ν) (1 − 2 ν) 2 ∂y + ∂x
This is a system of second order partial differential equations (PDE) for the unknown displacement vector
function ~u. If the coefficients E and ν are constantone can juggle with these equations and arrive at different
formulations. The first equation may be rewritten in the form
2 (1 − ν) ∂ 2 u1 ∂ 2 u2 ∂ 2 u1 ∂ 2 u2
E 2ν
+ + + = −f1
2 (1 + ν) 1 − 2 ν ∂x2 1 − 2 ν ∂y ∂x ∂y 2 ∂x ∂y
1 + (1 − 2 ν) ∂ 2 u1 ∂ 2 u1 ∂ 2 u2
E 1
+ + = −f1
2 (1 + ν) 1 − 2ν ∂x2 ∂y 2 1 − 2 ν ∂y ∂x
2
∂ u1 ∂ 2 u1
E 1 ∂ ∂ u1 ∂ u2
+ + + = −f1 .
2 (1 + ν) ∂x2 ∂y 2 1 − 2 ν ∂x ∂x ∂y
By rewriting the second differential equation in a similar fashion we arrive at a formulation given in [Sout73,
p. 87].
E 1 ∂ ∂ u 1 ∂ u2
∆ u1 + + = −f1
2 (1 + ν) 1 − 2 ν ∂x ∂x ∂y
E 1 ∂ ∂ u 1 ∂ u2
∆ u2 + + = −f2
2 (1 + ν) 1 − 2 ν ∂y ∂x ∂y
27
There is a minor gap in the argument: we only take variations of u1 into account while the resulting variations on εxx , εyy and
εxy are ignored. Thus use
hu + φ , Au − f i minimal =⇒ Au − f = 0 .
The preceding calculation examines an expression hu, Aui for an accordingly defined scalar product. For a symmetric matrix A
and a perturbation u + φ of u we should actually examine
hu + φ , A (u + φ) − f i = hu , A u − f i + hφ , A u − f i + hu , A φi + hφ , A φi
≈ hu , A u − f i + hφ , 2 A u − f i .
If this expression is minimized at u then we conclude 2 Au − f = 0. The only difference to the first approach is a factor of 2,
which is taken into account for the resulting differential equations in the main text.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 392
Prescribed displacement
If on a section Γ1 of the boundary ∂Ω the displacement vector ~u is known, use this as a boundary condition
on the section Γ1 . Thus find Dirichlet conditions on this section of the boundary.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 393
This allows a verification of the equations by comparing with (5.9) (page 338), i.e. find
" # ! " # ! !
σx τxy cos φ σx τxy nx g1
= = .
τxy σy sin φ τxy σy ny g2
At the surface the stresses have to coincide with the externally applied stresses. The above boundary condi-
tions can be written in terms of the unknown displacement vector ~u, leading to
E ∂ u1 ∂ u2 1 − 2 ν ∂ u1 ∂ u2
nx (1 − ν) +ν + ny + = g1
(1 + ν) (1 − 2 ν) ∂x ∂y 2 ∂y ∂x
E ∂ u2 ∂ u1 1 − 2 ν ∂ u1 ∂ u2
ny (1 − ν) +ν + nx + = g2 .
(1 + ν) (1 − 2 ν) ∂y ∂x 2 ∂y ∂x
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 394
Now express the total energy of a plane stress problem as function of the strains.
As in many other situations use again the Bernoulli principle to find the solution of the plane stress elasticity
problem.
5.9.1 From the Plane Stress Matrix to the Full Stress Matrix
For a plane stress problem the reduced stress matrix is a 2 × 2 matrix, while the full stress matrix has to be
3×3.
" # σx τxy 0
σx τxy
plane stress −→ , 3D −→ τxy σy 0
τxy σy
0 0 0
To compute the principal stresses σi determine the eigenvalues is this matrix, i.e. solve
σx − λ τxy 0 " #
σ x − λ τ xy
0 = det τxy σy − λ 0 = −λ det τxy σy − λ
0 0 −λ
2
= −λ λ2 − λ (σx + σy ) + σx σy − τxy 2
= −λ (σx − λ) (σy − λ) − τxy .
This leads to
s 2
σx + σy 1q σ + σy
2 )= x
σx − σy
σ1,2 = ± (σx + σy )2 − 4 (σx σy − τxy ± 2
+ τxy
2 2 2 2
σ3 = 0.
The above principal stresses may be used to determine the von Mises and the Tresca stress. The solution of
the quadratic equation can be visualized using Mohr’s circle, see Result 5–22 on page 341.
5–51 Example : Consider a horizontal, rectangular plate, compressed in x direction by a force F applied
to its right edge. Assume that the normal strain εxx < 0 in x-direction is known and then determine the
other expressions. Use a plane stress model, i.e. σz = τxz = τyz = 0. The similar plane strain problem was
examined in Example 5–50. Assume that all strains are constant. Now εyy and εxy can be determined by
minimizing the energy density. For the energy density obtain
E
ε2xx + ε2yy + 2 ν εxx εyy + 2 (1 − ν) ε2xy .
W =
2 (1 − ν 2 )
As a necessary condition for a minimum the partial derivatives with respect to εyy and εxy have to vanish.
This leads to
+2 εyy + 2 ν εxx = 0 and εxy = 0 .
This leads to the standard Poisson’s ratio ν for the plane strain situation, i.e. εyy = −ν εxx .
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 395
This equation is very similar to the expression for a plane strain situation in equation (5.28) (page 388).
The only difference is in the coefficients. As a starting point for a finite element solution of a plane stress
problem use the Bernoulli principle and minimize the energy functional
∂ u1
! " # εxx
E 1 ν 0
ZZ
∂x
= h , · εyy i dx dy
2 (1 − ν 2 ) ∂ u1
0 0 1−ν
∂y
Ω εxy
∂ u2
! " # ε xx
E 0 0 1−ν
ZZ
∂x
+ h , · εyy i dx dy
2 (1 − ν 2 ) ∂ u2
ν 1 0
∂y
Ω εxy
" # εxx
E 1 ν 0
ZZ
= − 2
u1 div · εyy dx dy
2 (1 − ν ) 0 0 1−ν
Ω εxy
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 396
" # εxx
E 1 ν 0
I
+ u1 h~n , · εyy i ds
2 (1 − ν 2 ) ∂Ω 0 0 1−ν
εxy
" # εxx
E 0 0 1−ν
ZZ
− u2 div · ε yy
dx dy
2 (1 − ν 2 ) ν 1 0
Ω εxy
" # εxx
E 0 0 1−ν
I
+ u2 h~n , · ε yy
i ds .
2 (1 − ν 2 ) ∂Ω ν 1 0
εxy
Reconsidering the calculations for the plane strain situation and make a few minor changes to adapt the
results to the above plane stress situation to arrive at the system of partial differential equations.
∂ u1 ∂ u2
E ∂x + ν ∂y
div 2
1−ν
∂ u1 ∂ u2
= −f1 (5.34)
1−ν 2 ∂y + ∂x
1−ν ∂ u1 ∂ u2
E 2 ∂y + ∂x
div 2
= −f2 (5.35)
1−ν ∂ u
ν 1 + ∂ u2∂x ∂y
If E and ν are constant, i.e. a homogeneous material, we can use elementary, tedious operations to find
2
∂ u1 ∂ 2 u1 1 + ν ∂
E ∂ u1 ∂ u 2
+ + + = −f1
2 (1 + ν) ∂x2 ∂y 2 1 − ν ∂x ∂x ∂y
2
∂ u2 ∂ 2 u2 1 + ν ∂
E ∂ u1 ∂ u 2
2
+ 2
+ + = −f2
2 (1 + ν) ∂x ∂y 1 − ν ∂y ∂x ∂y
1 + ν? ~ ~
E
∆~u + ∇ ∇~u = −f~
2 (1 + ν) 1 − ν?
• On a section Γ1 of the boundary we assume that the displacement vector ~u is known and thus we find
Dirichlet boundary conditions.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 397
• On the section Γ2 the displacement ~u is not submitted to constraints, but we apply an external force
~g . Again we use a calculus of variations argument to find the resulting boundary conditions.
The contributions of the integral over the boundary section Γ2 in the total energy Uelast + UV ol + USurf are
given by
" # ε xx
E 1 ν 0
Z Z
. . . ds = u1 h~n , · ε yy
i ds
Γ2 2 (1 − ν 2 ) Γ2 0 0 1−ν
εxy
" # εxx
E 0 0 1−ν
Z Z
+ u2 h~n , · εyy i ds −
~g · ~u ds
2 (1 − ν 2 ) Γ2 ν 1 0 Γ2
εxy
!
E εxx + ν εyy
Z Z
= u1 h~n , i ds − g1 u1 ds
2 (1 − ν 2 ) Γ2 (1 − ν) εxy Γ2
!
E (1 − ν) εxy
Z Z
+ u2 h~n , i ds − g2 u2 ds .
2 (1 − ν 2 ) Γ2 ν εxx + εyy Γ2
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 398
1 E 2 2 2
= ε xx + ε yy + 2 ν ε xx ε yy + 2 (1 − ν) ε ex
2 1 − ν2
1 E ∂ u1 2 ∂ u2 2 ∂ u1 ∂ u 2 1 − ν ∂ u1 ∂ u2 2
= ( ) +( ) + 2ν ( )( )+ ( + ) .
2 1 − ν2 ∂x ∂y ∂x ∂y 2 ∂y ∂x
This leads to a functional of the form (5.6), see page 319. It is a quadratic functional with the expression to
be integrated given by
the partial derivatives of the expression F (u1 , u2 , ∇u1 , ∇u2 ) are required.28
hF∇u1 (u1 , u2 , ∇u1 , ∇u2 ), ~ni = g1 and hF∇u2 (u1 , u2 , ∇u1 , ∇u2 ), ~ni = g2
28
Use the notation F∇u = ( ∂u∂ x F , ∂
∂uy
F )T .
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 399
lead to
!
∂ u1 ∂ u2
E ∂x + ν ∂y
h~n , i = g1
1 − ν2 1−ν ∂ u1 ∂ u2
2 ( ∂y + ∂x )
!
1−ν ∂ u1 ∂ u2
E 2 ( ∂y + ∂x )
h~n , i = g2 .
1 − ν2 ∂ u2 ∂ u1
∂y + ν ∂x
Using Hooke’s law for the plane stress situation (5.30) it can be simplified to
! !
σx τxy
h~n , i = g1 and h~n , i = g2
τxy σy
Bibliography
[Aris62] R. Aris. Vectors, Tensors and the Basic Equations of Fluid Mechanics. Prentice Hall, 1962.
Republished by Dover.
[BoneWood08] J. Bonet and R. Wood. Nonlinear Continuum Mechanics for Finite Element Analysis. Cam-
bridge University Press, 2008.
[BoriTara79] A. I. Borisenko and I. E. Tarapov. Vector and Tensor Analysis with Applications. Dover, 1979.
first published in 1966 by Prentice–Hall.
[Bowe10] A. F. Bower. Applied Mechanics of Solids. CRC Press, 2010. web site at solidmechanics.org.
[ChouPaga67] P. C. Chou and N. J. Pagano. Elasticity, Tensors, dyadic and engineering Approaches. D
Van Nostrand Company, 1967. Republished by Dover 1992.
[GhabPeckWu17] J. Ghaboussi, D. Pecknold, and X. Wu. Nonlinear Computational Solid Mechanics. CRC
Press, 2017.
[Gree77] D. T. Greenwood. Classical Dynamics. Prentice Hall, 1977. Republished by Dover 1997.
[HenrWann17] P. Henry and G. Wanner. Johann Bernoulli and the cycliod: A theorem for posteriority.
Elemente der Mathematik, 72(4):137–163, 2017.
[Holz00] G. A. Holzapfel. Nonlinear Solid Mechanics, a Continuum Approch for Engineering. John Wi-
ley& Sons, 2000.
[Oden71] J. Oden. Finite Elements of Nonlinear Continua. Advanced engineering series. McGraw-Hill,
1971. Republished by Dover, 2006.
[Ogde13] R. Ogden. Non-Linear Elastic Deformations. Dover Civil and Mechanical Engineering. Dover
Publications, 2013.
SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 400
[OttoPete92] N. S. Ottosen and H. Petersson. Introduction to the Finite Element Method. Prentice Hall,
1992.
[Redd13] J. N. Reddy. An Introduction to Continuum Mechanics. Cambridge University Press, 2nd edition,
2013.
[Redd15] J. N. Reddy. An Introduction to Nonlinear Finite Element Analysis. Oxford University Press, 2nd
edition, 2015.
[VarFEM] A. Stahel. Calculus of Variations and Finite Elements. Lecture Notes used at HTA Biel, 2000.
SHA 21-5-21
Chapter 6
1 1 2
ZZ Z
2
F (u) = a (∇u) + b u + f · u dA − g2 u ds
2 2 Γ2
Ω
was examined, with the boundary condition u(x, y) = g1 (x, y) for (x, y) ∈ Γ1 . At the minimal function u
the derivative in the direction of the function φ has to vanish. Thus the minimal solution has to satisfy
ZZ Z
0= (−∇( a ∇u) + b u + f ) · φ dA + (a ~n · ∇u − g2 ) φ ds (6.1)
Γ2
Ω
for all test function φ vanishing on the boundary Γ1 . Using Green’s formula (integration by parts) this leads
to ZZ Z
0= a ∇u · ∇φ + (b u + f ) · φ dA − g2 φ ds . (6.2)
Γ2
Ω
A function u satisfying this condition is called a weak solution. Using the fundamental lemma of the
calculus of variations conclude that the function u is a solution of the boundary value problem
For weak solutions or the minimization formulation use a numerical integration to discretize the for-
mulation. This will lead directly to the Finite Element Method (FEM). The path to follow is illustrated
in Figure 6.1. The branch on the left can only be used for self-adjoint problems and the resulting matrix
A will be symmetric. The branch on the right is applicable to more general problems, leading to possibly
non-symmetric matrices.
401
CHAPTER 6. FINITE ELEMENT METHODS 402
discretize discretize
? ?
vector ~u ∈ RN satisfies
vector ~u ∈ RN is minimizer of
~ + hW f~ , φi
hA ~u , φi ~ =0
F (~u) = 1 hA ~u , ~ui + hW f~ , ~ui
2 ~ ∈ RN
for all vectors φ
FEM
? ?
~u ∈ RN satisfies A ~u + W f~ = ~0 ~u ∈ RN satisfies A ~u + W f~ = ~0
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 403
• Present the theoretical results for self-adjoint BVP as minimization problems. State the required
approximation results.
This simple problem is used to explain the basic algorithms and can be implemented with MATLAB or
Octave 1 . The purpose of the code is to illustrate the necessary degree of complexity.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 404
ZZ XZZ
. . . dA ≈ . . . dA .
Ω k Tk
If the above process is carried out correctly the functional F is replaced by a discrete approximation
1
F (u) ≈ h~u , A ~ui + hW f~ , ~ui
2
with a symmetric, positive definite matrix A. This expression is minimized by the solution of the linear
system of equations
A ~u = −W f~ .
It is one of the most important advantages of the Finite Element Method that it can be applied on irregularly
shaped domains. For rectangular domains Ω the finite difference methods could be used to solve the BVP
in his section. Applying the finite differences to non-rectangular domains can be very challenging.
For one possible construction of finite elements the value of the unknown function at each of the nodes
is one degree of freedom. Thus for each triangle we have exactly 3 degrees of freedom and the total number
N of (interior) nodes corresponds to the number of unknowns. The element stiffness matrices AT will be
of size 3 × 3 and the global stiffness matrix A is a N × N matrix.
• Create the N × N matrix A, originally filled with zeros, and the vector f~ ∈ RN .
– Compute the element stiffness matrix AT and the vector WT f~T . Use equation (6.3) and a
numerical integration scheme.
– Add matrix and vector to the global structure.
• Solve the global system A ~u + Wf~ = ~0 for the vector of unknown values in ~u.
• Visualize the solution and make the correct conclusion for your application.
The actual computation of an element stiffness matrix will be examined carefully in the subsequent sections.
It is the most important building block of any FEM approach.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 405
Observe that there is a systematic integration error due to replacing the true function by an approximate,
linear function.
This leads to the integrals
A
ZZ
f · u dA ≈ (f1 u1 + f2 u2 + f3 u3 )
3
ZZ T
1 1A
b · u2 dA ≈ (b1 u21 + b2 u22 + b3 u23 ) .
2 2 3
T
The situation of one triangle in the xy plane and the corresponding triangle in the (xyu)–space is shown in
Figure 6.2. A normal vector ~n is given by the vector product ~n = ~a × ~b.
x2 − x1 x3 − x1 +(y2 − y1 ) · (u3 − u1 ) − (u2 − u1 ) · (y3 − y1 )
~n = y2 − y1 × y3 − y1 = −(x2 − x1 ) · (u3 − u1 ) + (u2 − u1 ) · (x3 − x1 )
u2 − u1 u3 − u1 +(x2 − x1 ) · (y3 − y1 ) − (y2 − y1 ) · (x3 − x1 )
∂u
∂x
= λ ∂u with λ = −2 A .
∂y
−1
2
x = (x, y) ∈ R2 to a vector ~
Quietly extend the vector ~ x = (x, y, 0) ∈ R3 .
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 406
b
n
a
u
Figure 6.2: One triangle in space (green) and projected to plane (blue)
The third component of this vector equals twice the oriented3 area A of the triangle. To obtain the gradient
in the first two components, the vector has to be normalized, such that the third component is equals −1.
! !
∂u
∂x −1 +(y 2 − y 1 ) · (u3 − u1 ) − (u2 − u1 ) · (y3 − y 1 )
∇u = =
∂u 2A −(x2 − x1 ) · (u3 − u1 ) + (u2 − u1 ) · (x3 − x1 )
∂y
It is an exercise to verify that the matrix MT M is symmetric and positive semidefinite. The expression
vanishes if and only if u1 = u2 = u3 . This corresponds to a horizontal plane in Figure 6.2.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 407
where
b1 0 0
a1 + a2 + a3 A
MT M +
AT = 0 b2 0 (6.5)
12 A 3
0 0 b3
f1
A
WT f~T
= f2 . (6.6)
3
f3
The generated mesh consists of 33 nodes, forming 48 triangles. On the boundary the values of the
function u are given by the known function g(x, y) and thus not all nodes are degrees of freedom for the
FEM problem. As a result find 17 interior points in this mesh and thus the resulting system of equations will
have 17 equations and unknowns.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 408
y
• The variable elem contains a list of all the triangles
0.5
and the node numbers of the corners.
For each triangle we find the element stiffnes matrix AT , which will contribute to the global stiffness
matrix A. As an example consider Figure 6.4 with 3 nodes for each triangle. The entries of AT have to be
added to the previous entries in the global matrix A and accordingly the entries of ~bk have to be added to
the global vector f~.
local ←→ global
j
i 3 triangle ←→ mesh
1
1 ←→ i
2 k
2 ←→ k
3 ←→ j
The above construction allows to verify that symmetry of the element stiffness matrices AT carries over to
the global matrix A. If all element stiffness matrices are positive definite the global stiffness matrix will be
positive definite. Once the matrix and the right hand side vector are generated, a linear system of equations
has to be solved. Thus results from chapter 2 have to be used. This assembling of the global stiffness matrix
can be implemented in (almost) any programming language. As an example you might have a look at the
Octave package FEMoctave on GitHub4 .
4
Use github.com/AndreasStahel/FEMoctave or the documentation web.sha1.bfh.science/FEMoctave/FEMdoc.pdf.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 409
There are different criteria on how to choose an optimal first node. Tests show that nodes with few
neighbors are often good stating nodes. Thus one may choose nodes with the minimal number of neighbors.
Also good candidates are nodes at extreme points of the discretized domain. A more detailed description of
the Cuthill-McKee algorithm and how to choose starting points is given in [LascTheo87].
1 2 3 4 5 6 7 8 9 10 11
6 9 11 1 ∗ ∗ ∗
2 ∗ ∗ ∗ ∗
3 ∗ ∗ ∗ ∗ ∗
4 ∗ ∗ ∗ ∗ ∗ ∗
2 4
10 5 ∗ ∗ ∗ ∗ ∗
7 6 ∗ ∗ ∗ ∗
7 ∗ ∗ ∗ ∗ ∗ ∗ ∗
8 ∗ ∗ ∗ ∗
9 ∗ ∗ ∗ ∗ ∗
1 3 5 8 10 ∗ ∗ ∗ ∗ ∗
11 ∗ ∗ ∗
The algorithm is illustrated by numbering the simple mesh in Figure 6.5. On the right the structure of
the nonzero elements in the resulting stiffness matrix is shown. The band structure is clearly recognizable.
• The first node is chosen, since it has only two neighbors and is at one end of the domain.
• Node 1 has two neighbors, number 2 is given to the node above, since it has only one free neighbor.
The node on the right (two free neighbors) of 1 will be number 3 .
5
When using iterative solvers with sparse matrices, the reduction of bandwidth is irrelevant. Since many (newer) direct solvers
internally renumber equations and variable the importance of the Cuthill–McKee algorithm has clearly diminished.
6
Named after Elizabeth Cuthill and James McKee.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 410
• Node 3 now has also only one free node left, number 5 .
• Of the two free neighbors of node 4, the one above has fewer free nodes and thus will receive num-
ber 6. The node on the right will be number 7 .
As an example examine a BVP on the domain shown in Figure 6.6, where the mesh was generated by
the program triangle (see [www:triangle]). The mesh has 518 nodes and the original numbering leads
to a semi–bandwidth of 515, i.e. no band structure. Nonetheless we have a sparse matrix, since only 3368
entries are nonzero (i.e. 1.25%). The nonzero elements in the matrix A are shown in Figure 6.7, before and
after applying the Cuthill–McKee algorithm. The new semi-bandwidth is 28. If finer meshes (more nodes)
are used, then the improvements due to a good renumbering of the nodes will be even larger.
Within the band only 21% of the entries are not zero, i.e. we still have a certain sparsity within the band.
The algorithm of Cholesky can not take advantage of this sparsity within the band, but iterative methods
can, as examined in Section 2.7.
0 0
100 100
200 200
300 300
400 400
500 500
0 100 200 300 400 500 0 100 200 300 400 500
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 411
u = BVP2D(Mesh,1,0,0,0,5,0,0,0);
figure(2); FEMtrimesh(Mesh.elem,Mesh.nodes(:,1),Mesh.nodes(:,2),u)
xlabel(’x’); ylabel(’y’); zlabel(’u’); view([160,35])
figure(3);
tricontour(Mesh.elem,Mesh.nodes(:,1),Mesh.nodes(:,2),u,linspace(0,1,11)+eps)
xlabel(’x’); ylabel(’y’); axis equal
2
1
0.8
0.6 1.5
u
0.4
0.2 1
y
0
0 0.5
0.5
1
y 1.5 0 0
1 0.5
2 2 1.5 0 0.5 1 1.5 2
x x
(a) surface of a solution (b) contour levels of a solution
• There is no control over the approximation error yet. The corresponding results will be examined in
Section 6.4, starting on page 417.
• It it not clear whether the approximation by piecewise linear functions was a good choice. This will be
clarified in Section 6.4 and a more efficient approach, using second order elements, will be examined
in Section 6.5.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 412
• The numerical integration was done with a rather simplistic idea. A better approach will be presented
in Section 6.5.
• It is not obvious how to adapt the above approach to more general problems. This will be examined
in following sections.
∂2 ∂2
∂x2
u(x, y) + ∂y 2
u(x, y) = f (x, y) for (x, y) ∈ Ω = (0, a) × (0, b)
.
u(x, y) = 0 for (x, y) on boundary ∂Ω
Use the domain Ω in Figure 6.9 with a uniform, rectangular mesh, which has nx interior nodes in x direction
and ny interior nodes in y direction. In the shown example nx = 18 and ny = 5. The step sizes are given
a b
by hx = nx+1 and hy = ny+1 .
y
6 (a, b)
x
-
In the mesh in Figure 6.9 there are two types of triangles, shown in Figure 6.10 and thus one can compute
all element contributions with the help of those two standard triangles.
hx
3 3 2
B
hy hy
A
1 hx 2 1
If the values ui of the function at the three corners of a type A triangle are known, then compute the
gradient of the linearly interpolated function by
u1 " # u1
−1 −1 (y3 − y2 ) (y1 − y3 ) (y2 − y1 )
∇u = M· u2 = · u2
2 area 2 area (x2 − x3 ) (x3 − x1 ) (x1 − x2 )
u3 u3
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 413
" # u1
−1 hy −hy 0
= ·
u2
hx · hy 0 hx −hx
u3
and thus
hy 0 " # hy 2 −hy 2 0
hy −hy 0
MT · M =
2 2 2 2
−hy ·
hx
0 hx −hx
=
−hy hy + hx −hx .
0 −hx 0 −hx2 hx2
Observe that the zeros in the off-diagonal corners are based on the facts y1 = y2 and x2 = x3 . Equation (6.5)
leads to the element stiffness matrix
hy 2 −hy 2 0
a1 + a2 + a3 1
MT · M =
AA = −hy 2 hy 2 + hx2 −hx2
12 area 2 hx hy
0 −hx2 hx2
Now construct the system of linear equations for the boundary value problem on the mesh in Figure 6.9.
For this consider the mesh point in column i and row j (starting at the bottom) in the mesh and denote
the value of the solution at this point by ui,j . In Figure 6.9 observe that ui,j is directly connected to 6
neighboring points and 6 triangles are used to built the connections. This is visualized in Figure 6.11, the
left part of that figure represents the stencilRat this point in the mesh.
Start by examining the contributions to Ω u · f dA involving the coefficients ui,j . As the coefficients in
~bA = ~bB are all constant, obtain six contributions of the size hx hy fi,j leading to a total of hx hy fi,j
6
hx hy
fi,j → 6 = hx hy .
6
With similar arguments examine the contributions to 12 Ω ∇u · ∇u dA. Use the element matrices AA , AB
R
1 2 (hx2 + hy 2 )
ui,j → hy 2 + hx2 + (hx2 + hy 2 ) + hy 2 + hx2 + (hx2 + hy 2 ) =
2 hx hy hx hy
1 −hx 2
−hx2 + 0 + 0 + 0 + 0 − hx2 =
ui+1,j →
2 hx hy hx hy
1 −hx2
0 + 0 − hx2 − hx2 + 0 + 0 =
ui−1,j →
2 hx hy hx hy
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 414
# #
ui,j+1 ui+1,j+1
"!"!
2
# # # 3
1
ui−1,j ui,j ui+1,j
4
"!"!"! 6
5
# #
ui−1,j−1 ui,j−1
"!"!
1 −hy 2
0 − hy 2 − hy 2 + 0 + 0 + 0 =
ui,j+1 →
2 hx hy hx hy
1 −hy 2
0 + 0 + 0 + 0 − hy 2 − hy 2 =
ui,j−1 →
2 hx hy hx hy
1
ui+1,j+1 → (0 + 0 + 0 + 0 + 0 + 0) = 0
2 hx hy
1
ui−1,j−1 → (0 + 0 + 0 + 0 + 0 + 0) = 0
2 hx hy
Observe that the two diagonal connections in the stencil lead to zero contributions. This is correct for the
rectangular mesh. Thus the resulting equation for the degree of freedom ui,j is given by
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 415
−1
h2
−1 4 −1
h2 h2 h2
−1
h2
In the following sections we will examine the first two contributions, and methods to minimize the errors.
• Using quadratic polynomials on each triangle will (usually) lead to smaller errors.
• Using the brilliant integration methods by Gauss, the influence of the integration error will be very
small.
~x = ~0 ⇐⇒ ~ = 0 for all φ
h~x , φi ~ ∈ RN ⇐⇒ ~ i i = 0 for i = 1, 2, . . . , N
h~x , φ
This obvious fact also applies to Hilbert spaces, which will be functions space for our boundary value
problems to be examined.
~ i for i = 1, 2, 3, . . . N be a basis
Let H be a Hilbert space with a finite dimensional subspace Hh . Let φ
of Hh and thus any vector ~x ∈ Hh can be written in the form
N
~j .
X
~x = xj φ
j=1
With the above examine a weak solution of a system of linear equations and find the following.
A ~u = f~ ∈ Hh ⇐⇒ A ~u − f~ = 0
⇐⇒ hA ~u − f~ , φi
~ = 0 for all φ ~ ∈ Hh
⇐⇒ hA ~u − f~ , φ
~ j i = 0 for all j = 1, 2, . . . , N
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 416
The functions a, b f and gi are known functions and we have to determine the solution u, all depending on
the independent variables (x, y) ∈ Ω ⊂ R2 . The vector ~n is the outer unit normal vector. The expression
∂u ∂u ∂u
~n · ∇u = n1 + n2 =
∂x ∂y ∂~n
equals the directional derivative of the function u in the direction of the outer normal ~n.
Consider a smooth test-function φ vanishing on the Dirichlet boundary Γ1 and a classical solution u of
the boundary value problem (6.8). Then multiply the differential equation in (6.8) by φ and integrate over
the domain Ω.
0 = −∇ · (a ∇u + u ~b) + b0 u − f
ZZ
0 = φ −∇ · (a ∇u + u ~b) + b0 u − f dA
ZΩZ Z
= ∇φ · (a ∇u + u ~b) + φ (b0 u − f ) dA − φ a ∇u + u ~b · ~n ds
Γ
ZΩZ Z
= ∇φ · (a ∇u + u ~b) + φ (b0 u − f ) dA − φ (g2 + g3 u) ds
Γ2
Ω
7
We knowingly ignore regularity problems, i.e. we assume that all expressions and solutions are smooth enough. These
problems are carefully examined in books and classes on PDEs.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 417
6–2 Definition : A function u satisfying the above equation for all smooth test functions φ vanishing on Γ1
is said to be an weak solution of the BVP (6.8).
The above computation shows that classical solutions have to be weak solutions.
In particular find that for all functions φ vanishing on all of the boundary Γ we have
ZZ
0= φ −∇ · (a ∇u + u ~b) + b0 u − f
Ω
0 = −∇ · (a ∇u + u ~b) + b0 u − f
i.e. we have a solution of the differential equation. This in turn now leads to
Z
0 = + φ a ∇u + u ~b · ~n − φ (g2 + g3 u) ds
Γ
for all smooth function φ on the boundary Γ2 and thus we also recover the boundary condition in (6.8)
0 = a ∇u + u ~b · ~n − (g2 + g3 u) .
Thus we have equivalence of weak and classical solution (ignoring smoothness problems).
If there were an additional term ~c · ∇u in the PDE (6.8) we would have to consider an additional term
in the above computations.
ZZ ZZ
φ ~c · ∇u dA = ∇(φ u ~c) − u ∇(φ ~c) dA
Ω Ω
Z ZZ
= − φ u ~c · n ds − u ∇φ · ~c + u φ ∇~c dA
Γ
Ω
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 418
In general the exact ue solution can not be found and we have to settle for an approximate solution uh , where
the parameter h corresponds to the typical size (e.g. diameter) of the elements used for the approximation.
Obviously we hope for the solution uh to converge to the exact solution ue as h approaches 0. It is the goal
of this section to show under what circumstances this is in fact the case and also to determine the rate of
convergence. The methods and ideas used can also be applied to partial differential equations with multiple
variables.
• There is a positive number α0 such that 0 < α0 ≤ a ≤ α1 and 0 ≤ b0 ≤ β1 for all points in the
domain Ω ⊂ R2 .
• The quadratic functional F (u) is strictly positive definite. This condition is satisfied if
– either on a nonempty section Γ1 of the boundary the Dirichlet boundary condition is imposed
– or the function b0 is strictly positive on a subdomain.
There are other combinations of conditions to arrive at a strictly positive functional F , but the above
two are easiest to verify.
With the above assumptions we know that the BVP (6.9) has exactly one solution. The proof of this result
is left to mathematicians. As a rule of thumb we use that the solution u is (k + 2)-times differentiable if f
is k-times differentiable.
This mathematical result tells us that there is a unique solution ue of the boundary value problem, but it
does not give the solution. Now we use the finite element method to find numerical approximations uh to
this exact solution ue .
Basic properties of the integral imply that the bilinear form A is symmetric and linear with respect to each
argument, i.e. for λi ∈ R we have
A(u, v) = A(v, u)
A(λ1 u1 + λ2 u2 , v) = λ1 A(u1 , v) + λ2 A(u2 , v)
A(u, λ1 v1 + λ2 v2 ) = λ1 A(u, v1 ) + λ2 A(u, v2 ) .
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 419
L2 := {u : Ω → R | u is square integrable}
ZZ
2
kuk2 := hu, ui = u2 dA .
Ω
For the function u to be in the smaller subspace V we require the function u and its derivatives ∇u to be
square integrable and u has to satisfy the Dirichlet boundary condition (if there are any imposed). The norm
in this space is given by
ZZ 2 2
∂u ∂u
ZZ
kuk2V := k∇uk22 + kuk22 = 2
|∇u| + u dA = 2
+ + u2 dA .
∂x ∂y
Ω Ω
L2 and V are vector spaces and h . , . i is a scalar product on L2 . V is called a Sobolv space. Obviously we
have
V ⊂ L2 and kuk2 ≤ kukV .
ZZ ZZ
|A(u, v)| = a ∇u · ∇v + b0 u v dA ≤ |a| |∇u| |∇v| + |b0 | |u| |v| dA
Ω Ω
ZZ ZZ
≤ α1 |∇u| |∇v| dA + β1 |u| |v| dA ≤ α1 k∇uk2 k∇vk2 + β1 kuk2 kvk2
Ω Ω
≤ (α1 + β1 ) kukV kvkV .
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 420
It can be shown9 that the above inequality is correct as long as the assumptions in Section 6.4.1 are satisfied.
Thus find
γ0 kuk2V ≤ A(u, u) ≤ (α1 + β1 ) kuk2V for all u ∈ V . (6.10)
This inequality is the starting point for most theoretical results on boundary value problems of the type
examined in these notes. The bilinear form A(·, ·) is called coercive or elliptic. For the purposes of these
notes it is sufficient to realize that the expression A(u, u) corresponds to the squared integral of the function
u and its partial derivatives of order 1.
The functions in the finite dimensional subspace Vh have to be piecewise differentiable and everywhere
continuous. This condition is necessary, since we try to minimize a functional involving first order partial
derivatives. This property is called conforming elements. Instead of searching for a minimum on all of V
we now only consider functions in Vh ⊂ V to find the minimizer of the functional. This is illustrated in
Table 6.2. We hope that the minimum uh ∈ Vh will be close to the exact solution ue ∈ V . The main goal
of this section is to show that this is in fact the case. The ideas of proofs are adapted from [John87, p.54]
and [Davi80, §7] and can also be used in more general situations, e.g. for differential equations with more
independent variables. To simplify the proof of the abstract error estimate we use two lemmas.
9
The correct mathematical result to be used is Poincaré’s inequality. There exists a constant C (depending on Ω only) such that
for any smooth function u vanishing on the boundary we have
ZZ ZZ
u2 dA ≤ C |∇u|2 dA .
Ω Ω
This inequality replaces the condition 0 < β0 ≤ b. Intuitively the inequality shows that the values of the function are controlled by
the values of the derivatives. For elasticity problems Korn’s inequality will play the same role.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 421
30
25
20
15
F
10
5
0
5
4 5
3 4
2 3
u 1 2
1 u
2 0 0 1
6–5 Lemma : Let ue ∈ V be the minimizer of the functional F on all of V and let uh ∈ Vh be the
minimizer of
1
F (ψh ) = A(ψh , ψh ) − hf, ψh i − hg2 , ψh iΓ2
2
amongst all ψh ∈ Vh . This implies that uh ∈ Vh is also the minimizer of
G(ψh ) = A(ue − ψh , ue − ψh )
amongst all ψh ∈ Vh . 3
Proof : If uh ∈ Vh minimizes F in Vh and ue ∈ V minimizes F in V then the previous lemma implies
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 422
G(uh + φh ) = A(ue − uh − φh , ue − uh − φh )
= A(ue − uh , u0 − uh ) − 2 A(ue − uh , φh ) + A(φh , φh )
= A(ue − ue , ue − ue ) + A(φh , φh )
≥ A(ue − uh , ue − uh ) .
Equality occurs only if φh = 0. Thus φh = 0 ∈ Vh is the unique minimizer of the above function and the
result is established. 2
It assumes that the integrations are carried out without error. Since we will use a Gauss integration this
is not far from the truth.
As a consequence of Céa’s Lemma we have to be able to approximate an the exact solution ue ∈ V
by approximate function ψh ∈ Vh and the error of the finite element solution uh ∈ Vh is smaller than the
approximation error, except for the factor k. Thus the Lemma of Céa reduces the question of estimating the
error of the approximate solution to a question of estimating the approximation error for a given function
(the exact solution) in the energy norm. Standard interpolation results allow to estimate the error of the
approximation, assuming some regularity on the exact solution u.
Proof : Use the inequality (6.10) and the above lemma to conclude that
and thus s
α1 + β 1
kue − uh kV ≤ kue − uh − φh kV for all φh ∈ Vh .
γ0
As φh ∈ Vh is arbitrary find the claimed result. 2
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 423
• The concept of stability of a finite difference approach is replaced by the coercive bilinear form
A(u, u), e.g. inequality (6.10). These rather mathematical results might be difficult to prove. For
second order boundary value problems of the type (6.9) Poincaré’s inequality has to be used. For
elasticity problems the tool to be used is Korn’s inequality.
• The abstract error estimate in Theorem 6–6 then implies that the FEM approximation uh converges
to the true solution u0 , in some very specific sense. Thus Theorem 6–6 replaces the Lax Equivalence
Theorem 4–2.
Example 6–1 (page 412) illustrates that in special cases a finite element approximation is equivalent to a
finite difference approximation. ♦
Πh : V −→ Vh , u 7→ Πh u
For two neighboring triangles the interpolated functions will coincide along the common edge, since the
linear functions coincide at the two corners. Thus the interpolated function is continuous on the domain
and we have conforming elements. The interpolated function Πh u and the original function u coincide
if u happens to be a linear function. By considering a Taylor expansion one can verify10 that the typical
approximation error of the function is of the order c h2 , where the constant c depends on higher order
derivatives of u. The error of the gradient is of order h.
where
2 2 2
∂2u ∂2u ∂2u
ZZ
|u|22 = + + dA .
∂x2 ∂x ∂y ∂y 2
Ω
The constant c does not depend on h and the function u, as long as a minimal angle condition is satisfied. 3
10
Use the fact that the quadratic terms in the Taylor expansion lead to an approximation error. For an error vanishing at the nodes
at x = 0 and h we use a function f (x) = a · x · (h − x) with derivatives f 0 (x) = a (h − 2 x) and f 00 (x) = −2 a. Since the maximal
2 2
value of a · h2 /4 is attained at h/2 we find |f (x)| ≤ a 4h = h8 max |f 00 | and |f 0 (x)| ≤ a2h max |f 00 | for all 0 ≤ x ≤ h.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 424
An exact statement and proof of this result is given in [Brae02, §II.6]. The result is based on the
fundamental Bramble–Hilbert–Lemma.
Now we have all the ingredients to state and proof the basic convergence results for finite element
solutions to boundary value problems in two variables. The exact solution ue ∈ V to be approximated is the
minimizer of the functional
1 1
ZZ Z
2 2
F (u) = a (∇u) + b0 u − f u dA − g2 u ds
2 2 Γ2
Ω
On a smooth domain Ω ⊂ R2 the exact solution ue is smooth (often differentiable) if a, b0 , f and g2 are
smooth. Instead of searching on the space V we restrict the search on the finite dimensional subspace Vh
and arrive at the approximate minimizer uh . Thus the error function e = uh − ue has to be as small as
possible for the approximation to be of a good quality. In fact we hope for a convergence
uh −→ ue as h −→ 0
6–9 Theorem : Examine the boundary value problem (6.9) where the conditions are such that
the unique solutions u and all its partial derivative up to order 2 are square integrable over the
domain Ω. If the subspace Vh is generated by the piecewise linear interpolation operator Πh then
we find
kuh − ue kV ≤ C h and kuh − ue k2 ≤ C1 h2
for some constants C and C1 independent on h.
We may say that
• uh converges to ue with an error proportional to h2 as h → 0.
Observe that the above estimates are not point-wise estimates. It is the integrals of the solution and its
derivatives that are controlled.
Proof : The interpolation result 6–8 and the abstract error estimate 6–6 imply immediately
This is the first of the desired estimates. It shows that the error of the function and its first order partial
derivatives are approximately proportional to h. The second estimate states that the error of the function
is proportional to h2 . This second estimate requires considerably more work. The method of proof is
known as Nitsche trick and is due to Joachim Nitsche and Jean-Pierre Aubin. A good presentation is given
in [StraFix73, §3.4] or [KnabAnge00, Satz 3.37]. For sake of completeness a similar presentation is shown
below.
Use the notation e = uh − ue and equation (6.10) to conclude
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 425
Regularity theory now implies that the second order derivatives of w are bounded by the values of e (in the
L2 sense) or more precisely
|w|2 ≤ c kek2 = c kuh − ue k2 .
The interpolation result 6–8 leads to
By choosing ψ = e arrive at
ZZ
−A(w, e) = he, ei = kek22 = |uh − ue |2 dA .
Ω
kuh − ue k2 ≤ C1 h2 .
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 426
y
s
6 J
J
Js
f (x, y) = c1 + c2 x + c3 y + s J
J
+c4 x2 + c5 x y + c6 y 2 Js
s
s
-x
By considering a Taylor expansion one can verify11 that the typical approximation error of the function is
of the order c h3 where the constant c depends on higher order derivatives of u. The error of the gradient is
of order h2 .
where |u|23 is the sum of all squared and intergrated partial derivatives of order 3 . The constant c does not
depend on h and the function u, as long as a minimal angle condition is satisfied. 3
An exact statement and proof of this result is given in [Brae02, §II.6]. The result is based on the
fundamental Bramble–Hilbert–Lemma. Based on this interpolation estimate we can again formulate the
basic convergence result for a finite element approximation using piecewise quadratic approximations.
6–11 Theorem : Examine the boundary value problem (6.9) where the conditions are such that the
unique solutions u and all its partial derivative up to order 3 are square integrable over the domain
Ω . If the subspace Vh is generated by the piecewise quadratic interpolation operator Πh then we
find
kuh − ue kV ≤ C h2 and kuh − ue k2 ≤ C1 h3
for some constants C and C1 independent on h.
We may say that
• uh converges to ue with an error proportional to h3 as h → 0.
11
Use the fact the cubic terms in the Taylor expansion lead to an approximation error. For an error vanishing at the nodes at x = 0
and ±h we use a function f (x) = a · x · (h2 − x2 ) with derivatives f 0 (x) = a (h2 − 3 x2 ), f 00 (x) = −a 6 x and f 000 (x) = −a 6.
3 √
The maximal value a32√h3 of the function is attained at ±h/ 3 we find |f (x)| ≤ c h3 max |f 000 | and |f 0 (x)| ≤ c h2 max |f 000 | for
all −h ≤ x ≤ h.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 427
Proof : The interpolation result 6–10 and the abstract error estimate 6–6 imply immediately
which is already the first of the desired estimates. The second estimate has to be verified with the Aubin–
Nitsche method, as in the proof of Theorem 6–9. 2
Observe that the convergence with quadratic interpolation (Theorem 6–11) is improved by a factor of
h compared to the linear interpolation, i.e. Theorem 6–9. Thus one might be tempted to increase the order
of the approximating polynomials further and further. But there are also reasons that speak against such a
process:
• Carrying out the interpolation will get more and more difficult. In particular the continuity across the
edges of the triangles is not easily obtained. It is more difficult to construct higher order conforming
elements.
• For higher order approximations to be effective we need bounds on higher order derivatives of the
exact solution ue . This might be difficult or impossible to achieve. If the domain is a polygon, there
will be corners and smoothness of the solution is far from obvious. Some of the coefficient functions
in the BVP (6.9) might not be smooth, e.g. by two different materials used in the domain. Thus we
might not benefit from a higher order convergence with higher order elements.
In the interior of the domain Ω smoothness of the exact solution ue is often true and with higher or-
der approximations we get a faster convergence. Thus piecewise approximations of orders 1, 2 and 3 are
regularly used. In the next section a detailed construction of second order elements is presented.
Presentations rather similar to the above can be found in many books on FEM. As example consult
[Brae02] for a proof of energy estimates and also for error estimates in the L∞ norm, i.e. point wise
estimates. In [AxelBark84] find Céa’s lemma and regularity results for non-symmetric problems.
The function u is a weak solution of the above BVP if it satisfies the boundary condition u = g1 on Γ1 and
for all test functions φ (vanishing in Γ1 ) we have the integral condition
ZZ Z
0= ~
∇φ · (a ∇u + u b) + φ (b0 u − f ) dA − φ (g2 + g3 u) ds .
Γ2
Ω
The domain Ω ⊂ R2 is triangulated and the values of the function u at the corners and the midpoints of
the edges of the triangles are considered as degrees of freedom of the system. This leads to a vector ~u to be
determined. We have to find a discretized version of the integrals in the above functions and determine the
global stiffness matrix A such that the above integral condition translates to
~ + hW f~ , φi
0 = hA ~u , φi ~ .
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 428
Then the discretized solution is given as solution of the linear systems of equations
A ~u + W f~ = ~0 .
All computations should be formulated with matrices, such that an implementation with Octave/MATLAB
will be easy.
The order of presentation is as follows:
• Examine the basis functions for a second order element on the standard triangle.
• Integration of f φ .
• Integration of b0 u φ .
• Integration of u ~b ∇φ .
• Integration of a ∇u ∇φ .
! ! ! !
x x1 x2 − x1 x3 − x1
= +ξ +ν
y y1 y2 − y1 y3 − y1
! " # ! ! !
x1 x2 − x1 x3 − x1 ξ x1 ξ
= + · = +T·
y1 y2 − y1 y3 − y1 ν y1 ν
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 429
where " #
x2 − x1 x3 − x1 ∂ (x, y)
T= = .
y2 − y1 y3 − y1 ∂ (ξ, ν)
If the coordinates (x, y) are given find the values of (ξ, ν) with the help of
! ! " # !
ξ −1 x − x1 1 y3 − y1 −x3 + x1 x − x1
=T · = · .
ν y − y1 det(T) −y2 + y1 x2 − x1 y − y1
Since the area of the standard triangle Ω is 21 find the area of E is given by 12 |det T|. For a numerical
integration over the standard triangle Ω choose some integration points ~gj ∈ Ω and the corresponding
weights wj for j = 1, 2, . . . , m and then work with
ZZ m
~ dA ≈
X
f (ξ) wj f (~gj ) . (6.12)
Ω j=1
The integration points and weights have to be chosen, such that the integration error is as small as possible.
There are many integration schemes available. One of the early papers is [Cowp73] and a list is shown
in [Hugh87, Table 3.1.1, p. 173] and a short list in [TongRoss08]12 , [Gmur00, Tableau D3, page 233]
or [Zien13, Table 6.3, p. 181].
As a concrete and useful example use the points g1 = 12 (λ1 , λ1 ) and g4 = 12 (λ2 , λ2 ) along the diagonal
ξ = ν. Similarly use two more points along each connecting straight line from a corner of the triangle to
the midpoint of the opposing edge. This leads to a total of 6 integration points where groups of 3 have the
same weight, i.e. w1 = w2 = w3 and w4 = w5 = w6 . Finally add the midpoint with weight w7 . This
is illustrated in Figure 6.16. The nodes are shown in red and the integration points in blue. This choice
satisfies two essential conditions:
• If a sample point is used in a Gauss integration, then all other points obtainable by permuting the three
corners of the triangle must appear and with identical weight.
• All sample points must be inside the triangle (or on the triangle boundary) and all weights must be
positive.
Then one arrives at a 7 × 2 matrix G (see equation (6.13)) containing in each row the coordinates of one
integration point ~gj and a vector w
~ with the corresponding integration weights.
12
The results are known as Hammer’s formula. There is a typing error in [TongRoss08, Table 6.2, page 190]. For the integration
scheme using four points the coefficient for the central point is negative, i.e. −0.5625. Thus the scheme should not be used for
FEM codes, since there is a danger that the stiffness matrix will not be positive definite. For integration purposes the scheme does
just fine.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 430
ν
r
6
@A
A@r
3
A @
A @
rHr5 A @r4 r
H A @
HHAr 7 @
AHH @
r1 A HH@ r
2
r6 ξ
r Ar Hr
A H @
H@ -
Figure 6.16: Gauss integration of order 5 on the standard triangle, using 7 integration points
To determine the optimal values setup and solve a system of five nonlinear equations for the unknowns
λ1 , λ2 , w1 , w4 and w7 . The integration is approximated by
ZZ
f (~xi ) dA ≈ w1 f (ξ~1 ) + f (ξ~2 ) + f (ξ~3 ) + w4 f (ξ~4 ) + f (ξ~5 ) + f (ξ~6 ) + w7 f (ξ~7 )
Ω
We require that ξ k for 0 ≤ k ≤ 5 be integrated exactly. This generates five equations to be solved13 . Due
to the symmetric arrangement of the integration points this implies that all polynomial up to degree 5 are
integrated exactly. The equations to be solved are generated by elementary computations.
1
ZZ
1 dA = = 3 w1 + 3 w4 + w7
2
ZΩZ
1 λ1 λ2 1
ξ dA = = w1 (2 + 1 − λ1 ) + w4 (2 + 1 − λ 2 ) + w7
6 2 2 3
Ω
1
= w1 1 + w4 1 + w7 equivalent to the above condition
3
1 3 2 3 1
ZZ
ξ 2 dA = = w1 (1 − 2 λ1 + λ ) + w4 (1 − 2 λ2 + λ22 ) + w7
12 2 1 2 9
ZΩZ
1 3 3 3 1
ξ 3 dA = = w1 (1 − 3 λ1 + 3 λ21 − λ1 ) + w4 (1 − 3 λ2 + 3 λ22 − λ32 ) + w7
20 4 4 27
ZΩZ
1 9 4 9 1
ξ 4 dA = = w1 (1 − 4 λ1 + 6 λ21 − 4 λ31 + λ1 ) + w4 (1 − 4λ2 + 6λ22 − 4λ32 + λ42 ) + w7
30 8 8 81
ZΩZ
1 15 5
ξ 5 dA = = w1 (1 − 5 λ1 + 10 λ21 − 10 λ31 + 5 λ41 − λ )
42 16 1
Ω
15 5 1
+w4 (1 − 5 λ2 + 10 λ22 − 10 λ32 + 5 λ42 − λ2 ) + w7
16 241
13
As example we consider the case f (ξ) = ξ 2 with some more details.
Z 1 Z 1−ξ Z 1 3 1
ξ4
ZZ
ξ 1
ξ 2 dA = ξ 2 dν dξ = (1 − ξ) ξ 2 dξ = − =
0 0 0 3 4 ξ=0 12
Ω
7
X λ1 2 λ1 λ2 λ2 1
wj f (~gj ) = w1 ) + (1 − λ1 )2 + ( )2 + w4 ( )2 + (1 − λ2 )2 + ( )2 + w7 ( )2
(
j=1
2 2 2 2 3
3 3 1
= w1 1 − 2 λ1 + λ21 + w4 1 − 2 λ2 + λ22 + w7
2 2 9
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 431
The above leads to a system of five nonlinear equations for the five unknowns λ1 , λ2 , w1 , w4 and w7 . This
can be solved by Octave/MATLAB or Mathematica 14 . There are multiple solutions possible, but you need a
solution satisfying the following key properties:
• Pick a solution with 0 < λ1 < λ2 < 1. This corresponds to the desirable result that all Gauss points
are inside the triangle.
• Pick a solution with positive weights w1 , w4 and w7 . This guarantees that the integral of a positive
function is positive.
The equation resulting from the integral of ξ over the triangle is identical to the equation generated by the
integral of 1 and thus not taken into account. Due to the symmetry of the Gauss points and the weights one
can verify that all polynomials up to degree 5 are integrated exactly, e.g. ν, ν 5 , ξ 2 ν 3 , . . .
The solution is given by
λ1 /2 λ1 /2 0.101287 0.101287 w1 0.0629696
1−λ λ1 /2 0.797427 0.101287
w 0.0629696
1 2
λ1 /2 1 − λ1 0.101287 0.797427 w3 0.0629696
G = λ2 /2 λ2 /2 ≈ 0.470142 0.470142 , w ~ = w4 ≈ 0.0661971 . (6.13)
1−λ λ2 /2 0.059716 0.470142
w 0.0661971
2 5
λ2 /2 1 − λ2 0.470142 0.059716 w6 0.0661971
1/3 1/3 0.333333 0.333333 w7 0.1125000
Using the transformation results in this section compute the coordinates XG for the Gauss integration points
in a general triangle by
! " # !
x1 x2 − x1 x3 − x1 T x1
XG = + ·G = + T · GT .
y1 y2 − y1 y3 − y1 y1
This approximate integration yields the exact results for polynomials up to degree 5 . Thus for one triangle
with diameter h and an area of the order h2 the integration error for smooth functions is of the order h6 ·h2 =
h8 . When dividing a large domain in sub-triangle of size h this leads to a total integration error of the
order h6 . For most problems this error will be considerably smaller than the approximation error of the FEM
method and one cano ignore this error contribution and we will from now on assume that the integrations
yield exact results.
The above can be implemented in Octave or MATLAB.
IntegrationTriangle.m
function res = IntegrationTriangle(corners,func)
T = [corners(2,:)-corners(1,:);corners(3,:)-corners(1,:)];
14
√ √ √ √
The exact values are λ1 = (12 − 2 15)/21, λ2 = (12 + 2 15)/21, w1 = (155 − 15)/2400, w4 = (155 + 15)/2400
and w7 = 9/80.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 432
if ischar(func)
P = G*T; P(:,1) = P(:,1)+corners(1,1); P(:,2) = P(:,2)+corners(1,2);
val = feval(func,P);
else
val = func(:);
end%if
res = w*val*abs(det(T));
end%function
and the above function can be tested by integrating the function x + y 2 over a triangle.
Octave
corners = [0 0; 1 0; 0 1];
IntegrationTriangle(corners,’intFunc’)
Observations:
• To save computation time some FEM algorithms use a simplified numerical integration. If the func-
tions to be examined are close to constants over each triangle, then the error is acceptable.
• It is important to observe that the functions and the solutions are only evaluated at the integration
points. This may lead to surprising (and wrong) results. Keywords: hourglassing and shear locking.
• The material properties are usually given by coefficient functions, e.g. for Young’s modulus E and
Poisson’s ratio ν. Thus these properties are evaluated at the Gauss points, but not at the nodes. This
can lead to surprising extrapolation effects, e.g. a material constraint might not be satisfied at the
nodes, but only at the integration points.
Φ1 (ξ, ν) = (1 − ξ − ν) (1 − 2 ξ − 2 ν)
Φ2 (ξ, ν) = ξ (2 ξ − 1)
Φ3 (ξ, ν) = ν (2 ν − 1)
Φ4 (ξ, ν) = 4 ξ ν
Φ5 (ξ, ν) = 4 ν (1 − ξ − ν)
Φ6 (ξ, ν) = 4 ξ (1 − ξ − ν)
and find their graphs in Figure 6.17. Any quadratic polynomial f on the standard triangle Ω can be written
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 433
1 1
1 1
0 0
ν ν
0 0
ξ 0 ξ 0
1 1
1 1
1 1
0
0
ν ν
0 0
ξ 0 ξ 0
1 1
1 1
1 1
0 0
ν ν
0 0
ξ 0 ξ 0
1 1
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 434
If the numbers fi represent the values of a quadratic function f (x, y) at the corners and midpoints of a
general triangle then use the above and the transformation rule to conclude
ZZ
~ , M · f~i = | det T| hMT · w
f (~x) dA = | det T| hw ~ , f~i . (6.14)
E
The interpolation matrix M
+0.4743526 −0.0807686 −0.0807686 0.0410358 0.3230744 0.3230744
−0.0807686 +0.4743526 −0.0807686 0.3230744 0.0410358 0.3230744
−0.0807686 −0.0807686 +0.4743526 0.3230744 0.3230744 0.0410358
M ≈ −0.0525839 −0.0280749 −0.0280749 0.8841342 0.1122998 0.1122998
−0.0280749 −0.0525839 −0.0280749 0.1122998 0.8841342 0.1122998
−0.0280749 −0.0280749 −0.0525839 0.1122998 0.1122998 0.8841342
−0.1111111 −0.1111111 −0.1111111 0.4444444 0.4444444 0.4444444
and the weight vector w~ do not depend on the triangle E, but only on the standard elements and the choice
of integration method. Thus for a new triangle E only the determinant of T has to be computed. Since
0
0
T
0
M ·w ~ =
1/6
1/6
1/6
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 435
the integration of quadratic functions by (6.14) is rather easy to do: add up the values of the function at the
three mid-points of the edges, then divide the result by 6 and multiply by | det T|.
where
6 −1 −1 −4 0 0
−1 6 −1 0 −4 0
T 1 −1 −1 6 0 0 −4
M · diag(w)
~ ·M= .
360
−4 0 0 32 16 16
0 −4 0 16 32 16
0 0 −4 16 16 32
This result may be confirmed with the help of a program capable of symbolic computations.
The unknown function u and the test function ϕ will be given at the nodes. Now develop numerical integra-
tion formulas for the above expressions. We have to aim for expression of the form
~
hA · ~u , φi and h~b , φi
~ .
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 436
Simplification if f is constant
Now the contribution to the element vector is
0
0
T
0
f | det T| M · w
~ = f | det T|
1/6
1/6
1/6
RR
and thus the effect of f φ dA is very easy to implement.
E
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 437
Simplification if b0 is constant
Now the contribution to the element stiffness matrix is
6 −1 −1 −4 0 0
−1 6 −1 0 −4 0
T b0 | det T| −1 −1 6 0 0 −4
b0 | det T| M · diag(w)
~ ·M= .
360
−4 0 0 32 16 16
0 −4 0 16 32 16
0 0 −4 16 16 32
According to section 6.5.1 the coordinates (ξ, ν) of the standard triangle are connected to the global
coordinates (x, y) by
! ! " # ! ! !
x x1 x2 − x1 x3 − x1 ξ x1 ξ
= + · = +T·
y y1 y2 − y1 y3 − y1 ν y1 ν
or equivalently
! ! " # !
ξ −1 x − x1 1 y3 − y1 −x3 + x1 x − x1
=T · = · .
ν y − y1 det(T) −y2 + y1 x2 − x1 y − y1
If a function f (x, y) is given on the general triangle E we can pull it back to the standard triangle by
and then compute the gradient of g with respect to its independent variables ξ and ν. The result will depend
on the partial derivatives of f with respect to x and y. The standard chain rule implies
∂ ∂ ∂ f (x, y) ∂ x ∂ f (x, y) ∂ y
g(ξ, ν) = f (x(ξ, ν) , y(ξ, ν)) = +
∂ξ ∂ξ ∂x ∂ξ ∂y ∂ξ
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 438
∂ f (x, y) ∂ f (x, y)
= (x2 − x1 ) + (y2 − y1 )
∂x ∂y
∂ ∂ ∂ f (x, y) ∂ x ∂ f (x, y) ∂ y
g(ξ, ν) = f (x(ξ, ν) , y(ξ, ν)) = +
∂ν ∂ν ∂x ∂ν ∂y ∂ν
∂ f (x, y) ∂ f (x, y)
= (x3 − x1 ) + (y3 − y1 ) .
∂x ∂y
This can be written with the help of matrices as
! " # ! !
∂g ∂f ∂f
(x 2 − x 1 ) (y2 − y1 )
∂ξ
∂g
= · ∂x
∂f
= TT · ∂x
∂f
∂ν (x3 − x1 ) (y3 − y1 ) ∂y ∂y
or equivalently
∂g ∂g ∂f ∂f
, = , ·T .
∂ξ ∂ν ∂x ∂y
This implies
" #
y3 − y1 −x3 + x1
∂f ∂f ∂g ∂g 1 ∂g ∂g
, = , · T−1 = , · .
∂x ∂y ∂ξ ∂ν det T ∂ξ ∂ν −y2 + y1 x2 − x1
Let ϕ be a function on the standard triangle Ω, given as a linear combination of the basis functions, i.e.
6
X
ϕ(ξ, ν) = ϕi Φi (ξ, ν) ,
i=1
where
Φ1 (ξ, ν) (1 − ξ − ν) (1 − 2 ξ − 2 ν)
Φ2 (ξ, ν)
ξ (2 ξ − 1)
~
Φ3 (ξ, ν) ν (2 ν − 1)
Φ(ξ, ν) = = .
Φ4 (ξ, ν) 4ξν
Φ5 (ξ, ν)
4 ν (1 − ξ − ν)
Φ6 (ξ, ν) 4 ξ (1 − ξ − ν)
Then its gradient with respect to ξ and ν can be determined with the help of
−3 + 4 ξ + 4 ν −3 + 4 ξ + 4 ν
4 ξ − 1 0
h
~ = 0 4 ν − 1 i
~ ξ (ξ, ν) Φ
~ ν (ξ, ν) .
grad Φ = Φ
4ν 4ξ
−4 ν 4 − 4 ξ − 8 ν
4 − 8ξ − 4ν −4 ξ
Thus conclude
∂ϕ ∂ϕ h
~ ξ (ξ, ν) Φ
i
~ ν (ξ, ν) = ϕ
h
~ ξ (ξ, ν) Φ
i
~ ν (ξ, ν) .
, = (ϕ1 , ϕ2 , ϕ3 , ϕ4 , ϕ5 , ϕ6 ) · Φ ~T · Φ
∂ξ ∂ν
If the function ϕ(x, y) is given on the general triangle as linear combination of the basis-functions on E find
6
X
ϕ(x, y) = ϕi Φi (ξ(x, y) , ν(x, y)) .
i=1
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 439
or by transposition
! " # " # " #
∂ϕ
T ~T
Φ 1 y3 − y1 −y2 + y1 ~T
Φ
∂x
= T−1 · ξ
·ϕ
~= · ξ
·ϕ
~
∂ϕ ~ Tν
Φ det(T) −x3 + x1 x2 − x1 ~T
Φ
∂y ν
and the same identities can be spelled out for the two components independently
∂ϕ 1 h
~ T + (−y2 + y1 ) Φ
~T ·ϕ
i
= (+y3 − y1 ) Φ ξ ν ~,
∂x det(T)
∂ϕ 1 h
~ T + (+x2 − x1 ) Φ
~T ·ϕ
i
= (−x3 + x1 ) Φ ξ ν ~.
∂y det(T)
For the numerical integration use the values of the gradients at the Gauss integration points ~gj = (ξj , νj ).
We already found that the values of the function ϕ at the Gauss points can be computed with the help of the
interpolation matrix M by
ϕ(~g1 ) ϕ1
ϕ(~g2 ) ϕ2
=M· . .
.
.. ..
ϕ(~g7 ) ϕ6
Similarly define the interpolation matrices for the partial derivatives. Using
−3 + 4 ξ1 + 4 ν1 4 ξ1 − 1 0 4 ν1 −4 ν1 4 − 8 ξ1 − 4 ν1
−3 + 4 ξ2 + 4 ν2 4 ξ2 − 1 0 4 ν2 −4 ν2 4 − 8 ξ2 − 4 ν2
Mξ =
.. ..
. .
−3 + 4 ξm + 4 νm 4 ξm − 1 0 4 νm −4 νm 4 − 8 ξm − 4 νm
−2.18971 −0.59485 0.00000 0.40515 −0.40515 2.78456
0.59485 2.18971 0.00000 0.40515 −0.40515 −2.78456
0.59485 −0.59485 0.00000 3.18971 −3.18971 0.00000
≈ 0.76114 0.88057 0.00000 1.88057 −1.88057 −1.64170
−0.88057 −0.76114 0.00000 1.88057 −1.88057 1.64170
−0.88057 0.88057 0.00000 0.23886 −0.23886 0.00000
−0.33333 0.33333 0.00000 1.33333 −1.33333 0.00000
find
ϕξ (~g1 ) ϕ1
ϕξ (~g2 ) ϕ2
= Mξ · .
.. ..
.
.
ϕξ (~g7 ) ϕ6
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 440
Similarly write
−3 + 4 ξ1 + 4 ν1 0 4 ν1 − 1 4 ξ1 4 − 4 ξ1 − 8 ν1 −4 ξ1
−3 + 4 ξ2 + 4 ν2 0 4 ν2 − 1 4 ξ2 4 − 4 ξ2 − 8 ν2 −4 ξ2
Mν =
.. ..
. .
−3 + 4 ξm + 4 νm 0 4 νm − 1 4 ξm 4 − 4 ξm − 8 νm −4 ξm
−2.18971 0.00000 −0.59485 0.40515 2.78456 −0.40515
0.59485 0.00000 −0.59485 3.18971 0.00000
−3.18971
0.59485 0.00000 2.18971 0.40515 −2.78456 −0.40515
≈ 0.76114 0.00000 0.88057 1.88057 −1.64170 −1.88057
−0.88057 0.00000 0.88057 0.23886 0.00000 −0.23886
−0.88057 0.00000 −0.76114 1.88057 1.64170 −1.88057
−0.33333 0.00000 0.33333 1.33333 0.00000 −1.33333
and
ϕν (~g1 ) ϕ1
ϕν (~g2 ) ϕ2
= Mν · .
.. ..
.
.
ϕν (~g7 ) ϕ6
∂ϕ
and find for the first component ϕx = ∂x of the gradient at the Gauss points
ϕx (~x1 )
ϕx (~x2 ) 1 h i
=
~
(+y3 − y1 ) MTξ + (−y2 + y1 ) MTν · φ
.. det(T)
.
ϕx (~x7 )
The above results for Mξ and Mν can be implemeted in MATLAB/Octave and then used to compute the
element stiffness matrix.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 441
∂φ ∂φ
ZZ ZZ ZZ
Cont = ~
u b ∇φ dA = u b1 dA + u b2 dA
∂x ∂y
E E E
∂φ 1
ZZ ZZ
u b1 dA = u b1 ((y3 − y1 ) φξ + (−y2 + y1 ) φν ) dA
∂x det T
E E
| det T| −→
≈ ~ + (−y2 + y1 ) Mν · φi
hdiag( wb 1 ) · M · ~u , (y3 − y1 ) Mξ · φ ~
det T
| det T| −→
~
h (y3 − y1 ) MTξ + (−y2 + y1 ) MTν · diag( wb 1 ) · M · ~u , φi
=
det T
~
= hB1 ~u, φi
and similarly
∂φ 1
ZZ ZZ
u b2 dA = u b2 ((−x3 + x1 ) φξ + (x2 − x1 ) φν ) dA
∂y det T
E E
| det T| −→
≈ ~ + (x2 − x1 ) Mν · φi
hdiag( wb 2 ) · M · ~u , (−x3 + x1 ) Mξ · φ ~
det T
| det T| −→
~
h (−x3 + x1 ) MTξ + (x2 − x1 ) MTν · diag( wb 2 ) · M · ~u , φi
=
det T
~
= hB2 ~u, φi
For computations use the above formula. A slightly more compact notation is given by
ZZ
Cont = u ~b ∇φ dA
E
−→ −→
| det T| T (y 3 − y 1 ) diag( wb 1 ) + (−x 3 + x 1 ) diag( wb 2 ) ~
h Mξ , MTν ·
≈ −→ −→ · M · ~u , φi
det T (−y2 + y1 ) diag( wb 1 ) + (x2 − x1 ) diag( wb 2 )
# −→
"
| det T| T (y3 − y1 ) (−x3 + x1 ) diag( wb 1 ) ~
h Mξ , MTν ·
= · −→ · M · ~u , φi
det T (−y2 + y1 ) (x2 − x1 ) diag( wb 2 )
−→
diag( wb 1 ) ~ .
= | det T|h MTξ , MTν · T−1 ·
−→ · M · ~u , φi
diag( wb 2 )
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 442
For the integration over the general triangle E use the transformation formula (6.11) and obtain
∂ u(~x) ∂ φ(~x) ∂ u(~x(ξ, ν)) ∂ φ(~x(ξ, ν))
ZZ ZZ
a dA = | det T| a(~x(ξ, ν)) dξ dν
∂x ∂x ∂x ∂x
E Ω
| det T| ~ = 1 ~
≈ 2
hAx · ~u , φi hAx · ~u , φi
(det T) | det T|
where
h iT −→
h i
Ax = (+y3 − y1 ) Mξ + (−y2 + y1 ) Mν · diag(wa) · (+y3 − y1 ) Mξ + (−y2 + y1 ) Mν
−→
h i h i
= (+y3 − y1 ) MTξ + (−y2 + y1 ) MTν · diag(wa) · (+y3 − y1 ) Mξ + (−y2 + y1 ) Mν
−→ −→
= (+y3 − y1 )2 MTξ · diag(wa) · Mξ + (−y2 + y1 )2 MTν · diag(wa) · Mν
−→ −→
+(+y3 − y1 ) (−y2 + y1 ) MTξ · diag(wa) · Mν + MTν · diag(wa) · Mξ .
For a constant coefficient a more of the above expressions can be computed explicitly and simplified.
3 1 0 0 0 −4
1 3 0 0 0 −4
T 1 0 0 0 0 0 0
Mξ · diag(w)~ · Mξ =
6
0 0 0 8 −8 0
0 0 0 −8 8 0
−4 −4 0 0 0 8
3 0 1 0 −4 0
0 0 0 0 0 0
T 1
1 0 3 0 −4 0
Mν · diag(w)~ · Mν =
6 0 0 0
8 0 −8
−4 0 −4 0 8 0
0 0 0 −8 0 8
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 443
6 1 1 0 −4 −4
1 0 −1
−4
4 0
1 1 −1 0 4 −4 0
MTξ · diag(w)
~ · Mν + MTν · diag(w)
~ · Mξ =
6
0 4 4 8 −8 −8
−4 0 −4 −8 8 8
−4 −4 0 −8 8 8
Similarly determine
where
h iT −→
h i
Ay = (−x3 + x1 ) Mξ + (+x2 − x1 ) Mν · diag(wa) · (−x3 + x1 ) Mξ + (+x2 − x1 ) Mν .
Now put all the above computations into one single formula, leading to
1
ZZ
a ∇u · ∇φ dA ≈ ~ .
h(Ax + Ay ) · ~u , φi
| det T|
E
tt = T(2,2)*Mxi-T(1,2)*Mnu; Ax = tt’*diag(w.*val)*tt;
tt = -T(2,1)*Mxi+T(1,1)*Mnu; Ay = tt’*diag(w.*val)*tt;
and with this segment the code for the function ElementContribution() is complete. The element
stiffness matrix and the element vector can now be computed by
Octave
corners = [0 0; 1 0; 0 1];
[elMat,elVec] = ElementContribution(corners,’fFunc’,’bFunc’,’fFunc’)
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 444
with the element stiffness matrix AE = Ax + Ay + B0 + B1 + B2 . Then use ideas similar to Section 6.2.6
(page 407) to assemble the results to obtain the global stiffness matrix A and the resulting system of linear
equations to be solved is given by
A~u = W f~ .
The boundary contributions in
ZZ I
∇φ · (a ∇u + u ~b) + φ (b0 u − f ) dA − φ (g2 + g3 u) ds = 0
Γ2
Ω
have to be taken into account by very similar procedures. If all goes well the vector ~u ∈ RN is an approxi-
mation of the solution u(x, y) of the boundary value problem
Let h > 0 be the typical length of a side of a triangle. For first order elements 12 h is used, such that the
computational effort is comparable to second order elements, i.e. the same number of degrees of freedom.
Nonuniform meshes are used, to avoid the effect of superconvergence. By choosing different values of h one
should observe smaller errors for smaller values of h. The error is measured by computing the L2 norms of
the difference of the exact and approximate solutions, for the values of the functions and its partial derivative
with respect to y. These are the expressions used in the general convergence estimates, based on the abstract
error estimate in Theorem 6–6. A double logarithmic plot leads to Figure 6.18.
– The slope of the curve for the absolute values of u(x, y) − ue (x, y) is approximately 2 and thus
conclude that the error is proportional to h2 .
∂
– The slope of the curve for the absolute values of ∂y (u(x, y) − ue (x, y)) is approximately 1 and
thus conclude that the error of the gradient is proportional to h.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 445
– The slope of the curve for the absolute values of u(x, y) − ue (x, y) is approximately 3 and thus
conclude that the error is proportional to h3 .
∂
– The slope of the curve for the absolute values of ∂y (u(x, y) − ue (x, y)) is approximately 2 and
thus conclude that the error of the gradient is proportional to h2 .
These observations confirm the theoretical error estimates in Theorem 6–9 (page 424) and Theorem 6–
11 (page 426). It is rather obvious from Figure 6.18 that second order elements generate more accurate
solutions for a comparable computational effort.
-2
log (difference)
-4
-6 linear, u-u
10
e
linear, d/dy (u-u )
e
-8 quad, u-u
e
quad, d/dy (u-u )
e
-10
-2.5 -2 -1.5 -1 -0.5
log (h)
10
6.6.2 Estimate the Number of Nodes and Triangles and the Effect on the Sparse Matrix
Let Ω ⊂ R2 be a domain with a triangular mesh with many triangles. Examine the typical mesh, shown
below, and consider only triangles and nodes inside the mesh, as the contributions by the borders are con-
siderably smaller for large meshes.
• For first order elements the nodes are the corners of the triangles.
1 1
N≈ T3= T
6 2
Thus the number N of nodes is approximately half the number T of triangles.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 446
• For second order elements the nodes are the corners of the triangles and the midpoints of the edges.
Each midpoint is shared by two triangles.
1 3
N≈ T + T = 2T
2 2
Thus the number N of nodes is approximately twice the number T of triangles.
The above implies that the number of degrees of freedom to solve a problem with second order elements
with a typical diameter h of the triangles is approximately equal to using linear element on triangles with
diameter h/2.
The above estimates also allow to estimate how many entries in the sparse matrix resulting from an FEM
algorithm will be different from zero.
• For linear elements each node typically touches 6 triangles and each of the involved corners is shared
by two triangles. Thus there might be 6 + 1 = 7 nonzero entries in each row of the matrix.
• For second order triangles we have to distinguish between corners and midpoints.
– Each corner touches typically six triangles and thus expect up to 6 × 3 + 1 = 19 nonzero entries
in the corresponding row of the matrix.
– Each midpoint touches two triangles and two of the corner points are shared. Thus expect up to
2 + 2 × 3 + 1 = 9 nonzero entries in the corresponding row of the matrix.
3·9+19
The midpoints outnumber the corners by a factor of three. Thus expect an average of 4 = 11.5
nonzero entries in each row of the matrix.
The degrees of freedom and nodes used coincide for the two approaches, i.e four triangles in Figure 6.19(a)
for the linear elements from one of the eight triangles for the quadratic elements.
Figure 6.20(a) shows the difference of the computed solution with first order elements to the exact solu-
tion. Within each of the 32 elements the difference is not too far from a quadratic function. Figure 6.20(b)
shows the values of the partial derivative ∂∂yu . It is clearly visible that the gradient is constant within each
triangle, and not continuous across element borders.
Figure 6.21(a) shows the difference of the computed solution with second order elements to the exact
solution. The error is considerably smaller than for linear elements, using identical degrees of freedom.
Within each of the 8 elements the difference does not show a simple structure. Figure 6.21(b) shows the
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 447
1.5
1
y
0.5
0
0 0.2 0.4 0.6 0.8 1
x
(a) the mesh (b) the solution
∂u
(a) the difference to the exact solution (b) the values of ∂y
∂u
Figure 6.20: Difference to the exact solution and values of ∂y , using a first order mesh
∂u
(a) the difference to the exact solution (b) the values of ∂y
∂u
Figure 6.21: Difference to the exact solution and values of ∂y , using a second order mesh
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 448
values of the partial derivative ∂∂yu . It is clearly visible that the gradient is not constant within the triangles.
By a careful inspection one has to accept that the gradient is not continuous across element borders, but the
jumps are considerably smaller than for linear elements.
Figure 6.22 shows the errors for the partial derivative ∂∂yu and confirms this observation. For first order
elements (Figure 6.22(a)) the gradient is constant within each triangle and thus the maximal error on the
triangles is proportional to the size of the triangles. This confirms the convergence of order 1 (i.e. ≈ c h1 ) for
the gradients with linear elements. The error for quadratic elements is considerably smaller, for a comparable
computational effort.
∂u
Figure 6.22: Difference of the approximate values of ∂y to the exact values
6.6.4 Remaining Pieces for a Complete FEM Algorithm, the FEMoctave Package
With the above algorithms and codes we can construct the element stiffness matrix and the vector contribu-
tion for a triangular element with second order polynomials as basis functions. Thus we can take advantage
of the convergence result in Theorem 6–11 on page 426. The missing parts for a complete algorithm are:
• Examine the integrals over the boundary edges in a similar fashion. This poses no major technical
problem.
• Assemble the global stiffness matrix and vector, similar to the method in Section 6.2.6, page 407.
• Solve the resulting system of linear systems, using either a direct method or an iterative approach.
Use the methods from Chapter 2.
In 2020 I wrote a FEM code in Octave, implementing all of the above. Find the documentation with
description of the algorithms and sample codes at web.sha1.bfh.science/FEMoctave/FEMdoc.pdf. The pack-
age is hosted on GitHub at github.com/AndreasStahel/FEMoctave . Within Octave use
pkg install https://github.com/AndreasStahel/FEMoctave/archive/v2.0.3.tar.gz
to download and install the package and then pkg load femoctave to load the package. For Linux
systems the complete package is on my web page at web.sha1.bfh.science/FEMoctave2.0.1.tgz. The source
code, demos and examples for FEMoctave is also available in the directory web.sha1.bfh.science/FEMoctave.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 449
6.7 Applying the FEM to Other Types of Problems, e.g. Plane Elasticity
In the previous section approximate solutions of the boundary value problem
are generated, either as weak solutions or, if ~b = ~0, as minimizers of the functional
1 1
ZZ
F (u) = a (∇u)2 + b0 u2 + f · u dA
2 2
Ω
among all functions u with u = g1 on the boundary section Γ1 . By examining Table 5.2 on page 317 verify
that the above setup covers a wide variety of applications. With a standard finite difference approximation
of the time derivative many dynamic problems can be solved too.
According to equation (5.28) on page 388 a plane strain problem can be examined as minimizer of the
functional
1−ν ν 0 εxx εxx
1 E
ZZ
U (~u) = h ν 1 − ν 0 · εyy , εyy i dx dy −
2 (1 + ν) (1 − 2 ν)
Ω 0 0 2 (1 − 2 ν) εxy εxy
ZZ I
− f~ · ~u dx dy − ~g · ~u ds
∂Ω
Ω
For a plane stress problem the total energy is given by (5.31) on page 394
Thus all contributions to the above elastic energy functionals are of the same type as the integrals in the
previous sections. Thus identical techniques are used to develop FEM code for elasticity problems. The
convergence results in Section 6.4 also apply. The role of Poincaré’s inequality is taken over by Korn’s
inequality.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 450
is a bilinear mapping, i.e. it is linear (resp. affine) with respect to ξ and ν separately. If the quadrilateral
is a parallelogram, then the contribution with the factor 14 (ξ + 1) (ν + 1) vanishes and we have the same
transformation formula as for triangles. The above transformation is built with the corner at (x1 , y1 ) as
starting point and the vectors in directions of (x2 , y2 ) and (x4 , y4 ). One can also build on the central point
~c and the directional vectors d~ξ and d~ν , given by
4
! ! !
1 X xi 1 x2 + x3 1 x3 + x4
~c = , d~ξ = − ~c, d~ν = − ~c .
4 yi 2 y2 + y3 2 y3 + y4
i=1
If the quadrilateral is a parallelogram then the midpoints of the opposite corners coincide, thus the vector
! ! ! !
x 1 x 3 x 2 x4
d~ξν = + − −
y1 y3 y2 y4
vanishes for parallelograms. The vector d~ξν indicates the deviation from a parallelogram and equation (6.15)
is identical to
ξν ~
F~ (ξ, ν) = ~c + ξ d~ξ + ν d~ν + dξν
4
! !
ξ −x1 + x2 + x3 − x4 ν −x1 − x2 + x3 + x4
= ~c + + +
4 −y1 + y2 + y3 − y4 4 −y1 − y2 + y3 + y4
!
ξν +x1 − x2 + x3 − x4
+ (6.16)
4 +y1 − y2 + y3 − y4
Observe that
! ! ! !
x1 x2 x4 x3
F~ (−1, −1) = , F~ (+1, −1) = , F~ (−1, +1) = , F~ (+1, +1) =
x2 y2 y4 y3
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 451
ν ! !j
6 ξ x
7→ BMB ν
t4 1 t3 ν y
yt t(x3 , y3 )
6 B
A (x , y ) B
4 4
A B 1
ξ
Ω 1 ξ A B
- A B t
A
A E B
B (x2 , y2 )
A B
t1 t2
A B
At
(x1 , y1 )
-x
Figure 6.23: The transformation of a linear quadrilateral element and the four nodes
is not constant, which is the case for triangular elements. This has to be taken into account when integrating
over the general quadrilateral E by using the integration formula (6.11) on page 429.
Along each of the edges of the standard square Ω ⊂ R2 one of ξ or ν is constant and thus the transfor-
mation F~ leads to straight lines as borders of the domain E = F~ (Ω) ∈ R2 . Using the functions 1, ξ, ν and
ξ ν construct the four basis functions with the key property
(
1 for i = j
Φi (ξj , νj ) = δi,j = .
0 for i 6= j
1 1
Φ1 (ξ, ν) = 4 (1 − ξ) (1 − ν) , Φ2 (ξ, ν) = 4 (1 + ξ) (1 − ν) ,
1 1
Φ3 (ξ, ν) = 4 (1 + ξ) (1 + ν) , Φ4 (ξ, ν) = 4 (1 − ξ) (1 + ν)
or shorter
1
Φi (ξ, ν) =
(1 + ξi ξ) (1 + νi ν) for i = 1, 2, 3, 4 ,
4
where ξi , νi = ±1 are the coordinates of the corners, e.g. ξ1 = ν1 = −1 or ξ3 = ν3 = +1.
• It is a matter of tedious algebra to verify that for any bilinear function F~ (ξ, ν)
4
! !
x xi
= F~ (ξ, ν) =
X
Φi (ξ, ν) .
y i=1 yi
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 452
This is a bilinear interpolation on the standard square Ω. To verify this examine16 the linear system
+1 +1 +1 +1 f1 c1
−1 +1 +1 −1 f2 c2
1
=
4 −1 −1 +1 +1 f3 c3
+1 −1 +1 −1 f4 c4
and observe that the matrix is invertible. Thus given the values of ci we can solve uniquely for fi . Let
M be the above matrix, then M · MT = 41 I, i.e. 2 M is a unitary matrix. This leads to M−1 = 4 MT .
• With these shape functions implement an interpolation on the general quadrilateral E in Figure 6.23
by using the values of the function at the nodes. This leads to the values when pulled back to the
standard square Ω.
• Along each of the edges these basis functions are linear functions, since one of the variables ξ or ν
is constant. This leads to conforming elements, i.e. the values of the functions will be continuous
across element boundaries. Observe that these functions are not linear when displayed on a general
quadrilateral, see Figure 6.24. There is no easy formula for this bilinear interpolation on a general
quadrilateral.
Based on this the general approximation results are applicable and similar to Theorem 6–9 obtain for an
FEM algorithm based in this interpolation the error estimates
To evaluate integrals over the domain E ⊂ R2 use the transformation rule (6.11), i.e.
Z +1 Z +1
∂ (x, y)
ZZ ZZ
f dA = f (~x (ξ, ν)) det dξ dν = f (~x (ξ, ν)) |det T(ξ, ν)| dξ dν .
∂ (ξ, ν) −1 −1
E Ω
Observe that the term | det T(ξ, ν)| is not constant. The idea of Gauss integration can be applied to squares,
using the 1D integrals17
Z +1 X
f (t) dt ≈ wj f (tj ) general formula
−1 j
16
Compare the coefficients of 1, ξ, ν and ξν.
4
X
4 fi Φi (ξ, ν) = f1 (1 − ξ)(1 − ν) + f2 (1 + ξ)(1 − ν) + f3 (1 + ξ)(1 + ν) + f1 (1 − ξ)(1 + ν)
i=1
= (f1 + f2 + f3 + f4 ) + (−f1 + f2 + f3 − f4 ) ξ + (−f1 − f2 + f3 + f4 ) ν + (+f1 − f2 + f3 − f4 ) ξν
17
To derive the first formula integrate 1, t and t2 .
R +1
f (t) dt = w1 f (−ξ) + w1 f (+ξ)
R +1−1
1 dt = 2 = w1 1 + w1 1 =⇒ w1 = 1
R−1
+1
−1
t dt = 0 = − w1 ξ + w1 ξ = 0
R +1 2
p
t dt = 32 = + w1 ξ 2 + w1 ξ 2 =⇒ ξ= 1/3
R−1
+1 3
−1
t dt = 0 = − w1 ξ 3 + w1 ξ 3 = 0
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 453
Φ
1
1
4 0.2
0.8 0.2
0.6 3 0.4
y
1
Φ
0.2
0.4
0.
2 0.6 4
0.2
5 0.8
4
0 3
2 1
0 1 2 1 y
3 4 5 0
x 1 2 3 4
x
Φ
2
1 0.2
4
0.8 0.4
0.2
0.6 3
y
0.6
2
0.8
Φ
0.4
2
0.4
0.2
5
4 0.2
0 3
2 1
0 1 2 1 y
3 4 5 0
x 1 2 3 4
x
Φ
3
0.2 0. 0.8
4
0.6
1
4
0.8 0.2 0.4
0.6 3 0.2
y
3
Φ
0.4
2
0.2
5
4
0 3
2 1
0 1 2 1 y
3 4 5 0
x 1 2 3 4
x
Φ
4
0.4
0.8
1
0.6
0.2
4
0.8
0.6 3
4
y
0.
4
Φ
0.4
2
0.
2
0.2
5
4
0 3
2 1
0 1 2 1 y
3 4 5 0
x 1 2 3 4
x
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 454
ν
6 ν
6
t t t t
t t t
t t
Ω ξ Ω ξ
- t t t -
t t
t t t
t t t t
Figure 6.25: Gauss integration points on the standard square, using 22 = 4 or 32 = 9 integration points
+1
−1 +1
Z
f (t) dt ≈ 1 f ( √ ) + 1 f ( √ ) 2 point Gauss integration
−1 3 3
Z +1 r r
5 3 8 5 3
f (t) dt ≈ f (− ) + f (0) + f (+ ) 3 point Gauss integration
−1 9 5 9 9 5
This leads to the 2D Gauss integration points shown on the left in Figure 6.25. If the 2D integration is based
on the 3 point formula for the integration on [−1, +1] the 2D situation is shown on the right in Figure 6.25.
There are more integration schemes of the same type, e.g. see [TongRoss08, §6.5.3] or [Hugh87, §3.8].
With this we have all the tools to apply the ideas from Section 6.5 to construct the matrices and vectors
required for an FEM algorithm, based on first order quadrilateral elements with four nodes.
Thus t4 is not integrated exactly and the error is proportional to h4 . To derive the second formula use
R +1
f (t) dt = w1 f (−ξ) + w0 f (0) + w1 f (+ξ)
R +1−1
1 dt = 2 = w1 1 + w0 1 + w1 1
R−1
+1
−1
t dt = 0 = − w1 ξ + w0 0 + w1 ξ = 0
R +1
t dt = 23 = + w1 ξ 2 + w1 ξ 2
2
R−1+1 3
t dt = 0 = − w1 ξ 3 + w1 ξ 3 = 0
R −1
+1 4
t dt = 52 = + w1 ξ 4 + w1 ξ 4
R−1+1 5
−1
t dt = 0 = − w1 ξ 5 + w1 ξ 5 = 0
Thus t6 is not integrated exactly and the error is proportional to h6 . The system to be solved is
w0 + 2 w1 = 2
3 5 8
2 w1 ξ 2 = 32 =⇒ ξ 2 = , w1 = , w0 = .
5 9 9
2 w ξ4 = 2
1 5
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 455
ν
BMB ν
6 ! !j
ξ x
7→ yt Bt t(x3 , y3 )
t4 t7 t3
ν y
6A
(x4 , y4 ) B
A B t 1
ξ
A B
Ω A t B t
t8 t6 ξ
E
- A B
(x2 , y2 )
Bt
A B
A
A B
At
t1 t5 t2 (x1 , y1 )
-x
Figure 6.26: The transformation of a second order quadrilateral element and the eight nodes
1
Φi (ξ, ν) = (1 − ξi ξ) (1 − νi ν) (ξi ξ + νi ν − 1) for i = 1, 2, 3, 4
4
1
Φi (ξ, ν) = (1 − ξ 2 ) (1 + νi ν) for i = 5, 7
2
1
Φi (ξ, ν) = (1 + ξi ξ) (1 − ν 2 ) for i = 6, 8
2
• The above shape functions are all of the form
f (ξ, ν) = c1 + c2 ξ + c3 ν + c4 ξ 2 + c5 ν 2 + c6 ξ ν + c7 ξ 2 ν + c8 ξ ν 2 ,
i.e. these are biquadratic18 functions. Any function of this form can be writen as a linear combination
of the Φi (ξ, ν), i.e.
X 8 8
X
f (ξ, ν) = f (ξi , νi ) Φi (ξ, ν) = fi Φi (ξ, ν)
i=1 i=1
• Along each of the edges these basis functions are quadratic functions, since one of the variables ξ or
ν is constant. This leads to conforming elements, i.e. the values of the functions will be continuous
across element boundaries. Find the contour lines of the shape functions on a general quadrilateral in
Figure 6.27. There is no easy formula for this biquadratic interpolation on a general quadrilateral.
Based on this the general approximation results are applicable and similar to Theorem 6–11 obtain for
an FEM algorithm based in this interpolation the error estimates
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 456
Φ Φ
1 2
0
0
4 -0.2 4 -0.2
0
-0.2
-0.2
3 -0. 3
2
0..2
y
00..864
0
-0.
0
0
2 0.2 0 2
0
0.4
0
1 0..86 1
0
1 2 3 4 1 2 3 4
x x
Φ Φ
3 4
0 0.0.8 0
0.68
0.20.4 6
0.
0.4
0
-0.
0.2
4 -0.2 4 2
0
-0.2
-0
3 3 0
.2
2
-0.
y
y
0
2 0
0 -0.
2 2
1 1 0
0
1 2 3 4 1 2 3 4
x x
Φ Φ
5 6
2
0..4
0
4
0.2
4 0.2 0.6
0.8
0.4
4
0.2
0.
0.2
0.6
0.4
3 3 0.6 .4
y
0
0.2 0.2
0.8
2 2
0.6 0.4
0.2
1 1
1 2 3 4 1 2 3 4
x x
Φ Φ
7 8
0.2
0. 0.2
0.2
0.4
6 0.2
0.
0.4
0.4
4 4
0.6 0.8 0.6
0.4
0.2
3 0.4 3
0.
y
y
2
0.2
0.6
0.8
2 2
4
0.
2
1 1 0.
1 2 3 4 1 2 3 4
x x
Figure 6.27: Contour levels of the shape functions on a 8 node quadrilateral element
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 457
The above is the basis to implement a FEM algorithm based on quadrilateral elements.
where α > 0 is a parameter. This is called a logistic growth model. Find the graph of the function for
d
α = 1 and the solution of the corresponding logistic differential equation dt u(t) = u(t) · (1 − u(t))
in Figure 6.28.
• diffusion: the tumor cells will also spread out, just like heat spreads in a medium. Thus describe this
effect by a heat equation.
0.25 1
growth rate of concentration u
0.2 0.8
concentration u
0.15 0.6
0.1 0.4
0.05 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 2 4 6 8
concentration u time t
(a) the function for logistic growth (b) solution of the logistic differential equation
Figure 6.28: The function leading to logistic growth and the solution of the differential equation
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 458
or after a multiplication by r2
d 2 ∂ 2 ∂ u(t, r)
r u(t, r) = r + r2 α f (u(t, r)) . (6.18)
dt ∂r ∂r
This has to be supplemented with appropriate initial and boundary values. The goal is to examine the
behavior of solutions of this partial differential equation, using finite elements.
with some boundary conditions. Multiplying (6.19) by a smooth test function φ(r) and an integration by
parts leads to
Z R 0
0 = a(r) u0 (r) + b(r) f (r) φ(r) dr
0
R Z R
0
= a(r) u (r) φ(r) r=0 + −a(r) u0 (r) φ0 (r) + b(r) f (r) φ(r) dr . (6.20)
0
Using FEM this equation will be discretized, leading to the stiffness matrix A and the weight matrix M,
such that hA~u − Mf~ , φi
~ = 0 for all vectors φ.
~ This then leads to the linear system A~u = Mf~ to be solved
for the vector ~u .
The section below will lead to a numerical implementation of the above idea. Then the developed
MATLAB/Octave code will be tested with the help of a few example problems.
To compute these integrals first examine the very efficient Gauss integration on a standard interval [ −h
2 ,
+h
2 ]
of length h.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 459
• The three values of a function u(x) at u(−h/2) = u− , u(0) = u0 and u(h/2) = u+ determine a
quadratic interpolating polynomial20
u+ − u− u+ − 2 u0 + u−
u(x) = u0 + x+ 2 x2 .
h h2
q
Use x = 0 and x = ± 35 h
2 to determine the values of u(x) at the Gauss points by
q q
3 3
q
3h 3 4 3
u(− 5 2 ) 10 +2
5
10 10
2
− u−
5
u(0) = 0 1 0 · u0
q q q
3h 3 3
u
u(+ 5 2 ) 3 5 4 3 5 +
10 − 2 10 10 + 2
√ √
3 + 15 4 3 − 15 u− u−
1
= 0 10 0 · u0 = G0 · u0 .
10 √ √
3 − 15 4 3 + 15 u+ u+
Use this Gaussian interpolation matrix to compute the values of the function at the Gauss integration
points, using the values at the nodes.
• The above can be repeated to obtain the values of the derivative u0 (x) at the Gauss points.
q q q q
0 3h
u (− 5 2 ) −1 − 2 5 +4 5 +1 − 2 35
3 3
1 u − u−
1
0
u (0) =
−1 0 +1 ·
u = G · u .
h q 0 h 1 0
q q q
u0 (+ 35 h2 ) −1 + 2 35 −4 35 +1 + 2 35 u+ u+
• To evaluate the function a(x) at the Gauss points use the notation
q
a(− 35 h2 ) 0 0
a= 0 a(0) 0
q
0 0 a(+ 35 h2 )
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 460
The above notation leads to the required integrals. With ∆ri = ri+1 − ri obtain
Z ri+1
I0 = b(r) f (r) φ(r) dr ≈ ∆ri hW b G0 f~ , G0 φi ~ = ∆ri hGT W b G0 f~ , φi
0
~
r
Z iri+1
∆ri ~ = 1 hGT W a G1 ~u , φi
~
I1 = a(r) u0 (r) φ0 (r) dr ≈ 2
hW a G1 ~u , G1 φi 1
ri (∆ri ) ∆ri
Now work on the interval [0, R], discretized by 0 = r0 < r1 < r2 < . . . < rn−1 < rn = R. Then examine
the discrete version of the weak solution, thus integrals of the type
Z R 0
I = a(r) u0 (r) φ(r) − b(r) f (r) φ(r) dr
0
n−1
X Z ri+1 0
= a(r) u0 (r) φ(r) − b(r) f (r) φ(r) dr
i=0 ri
n−1
X
1 T ~ T ~ ~
≈ hG1 W ai G1 ~ui , φi i − ∆ri hG0 W bi G0 fi , φi i
∆ri
i=0
= hA ~u − M f~ , φi
~ for all vectors φ
~.
The stiffness matrix A and the weight matrix M are both of size (2 n + 1) × (2 n + 1), but possible
boundary conditions are not take into account yet21 . This has to be done with some care, since the differential
equation (6.19) has a unique solution only if boundary conditions are specified.
Thus it is straightforward to adapt the algorithm and the codes to examine differential equations of the type
0
a(r) u0 (r) + c(r) u0 (r) + h(r) u(r) + b(r) f (r) = 0 .
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 461
RR
and the integral a u0 (r) φ0 (r) dr leads to a matrix
0
+7 −8 +1
−8 +16 −8
+1 −8 +14 −8 +1
−8 +16 −8
+1 −8 +14 −8 +1
a
−8 +16 −8 . (6.22)
3 ∆r
+1 −8 +14 −8 +1
.. .. .. ..
. . . .
+1 −8 +14 −8 +1
−8 +16 −8
+1 −8 7
Observe that this matrix has a band structure with 3 or 5 elements only arranged along the diagonal. This
matrix is symmetric and for a > 0 positive semi-definite. If we have Dirichlet boundary conditions the first
(or last) row and column are removed, then the matrix will be strictly positive definite. This corresponds to
the analytical observations
Z R Z R
0 2
a |u (r)| dr ≥ 0 and a |u0 (r)|2 dr = 0 ⇐⇒ u(r) = const .
0 0
Some of the key properties are similar to our model matrix An from Section 2.3.1.
Similarly using
+4 +2 −1
1
GT0 WG0 =
+2 +16 +2
30
−1 +2 +4
RR
the integral 0 b f (r) φ(r) dr leads to
+4 +2 −1
+2 +16 +2
−1 +2 +8 +2 −1
+2 +16 +2
−1 +2 +8 +2 −1
b ∆r
+2 +16 +2 . (6.23)
30
−1 +2 +8 +2 −1
.. .. .. .. ..
. . . . .
−1 +2 +8 +2 −1
+2 +16 +2
−1 +2 +4
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 462
This leads to different algorithms to take Dirichlet or Neumann conditions into account.
• If the value of u(R) is known, then φ(R) needs to be zero and the contribution vanishes. If u(R) = 0
the last value in the vector ~u is zero and we can safely remove this zero in ~u and the last column in
the matrix A.
• If we have no constraint on u(R) the natural boundary condition is a(R) u0 (R) = 0. We do not have
to do anything to take this condition into account.
• For boundary conditions of the type a(R) u0 (r) = c1 + c2 u(R) the correct type of contribution will
have to be added.
This leads to the linear system
A ~u − M f~ = ~0 or ~u = A−1 M f~ = A\M f~ .
The resulting matrix A is symmetric and has a band structure with semi-bandwidth 3, i.e. in each row there
are up to 5 entries about the diagonal.
A first example
To solve the boundary value problem
−u00 (r) = 1 on 0 ≤ r ≤ 1 with u(0) = u(1) = 0
1
with the exact solution uexact (r) = 2 r (1 − r) use the code below.
Test1.m
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 463
[A,M,r] = GenerateFEM(x,a,b);
A = A(2:end-1,2:end-1); M = M(2:end-1,:); % Dirichlet BC on both ends
f = ones(size(r(:)));
u_exact = r(:).*(1-r(:))/2;
u = A\(M*f);
figure(1); plot(r,[0;u;0],r,u_exact)
xlabel(’r’); ylabel(’u’);title(’solution, exact and approximate’)
figure(2); plot(r,[0;u;0]-u_exact)
xlabel(’r’); ylabel(’u’);title(’difference of solutions, exact and approximate’)
It turns out that the generated solution coincides with the exact solution. This is no real surprise, since the
exact solution is a quadratic function, which we approximate by a piecewise quadratic function. For real
problems this is very unlikely to occur. Thus this example is only useful to verify the algorithm and the
coding.
A second example
To solve the boundary value problem
π π
−u00 (r) = cos(3 r) on 0 ≤ r ≤ with u0 (0) = u( ) = 0
2 2
1
with the exact solution uexact (r) = 9 cos(3 r) use the code Test2.m below. Find the solution in Fig-
ure 6.29(a).
Test2.m
N = 2*10; % number of elements
x = linspace(0,pi/2,N+1);
a = @(x) 1*ones(size(x)); b = @(x) 1*ones(size(x)); % solve -u"= cos(3*x)
[A,M,r] = GenerateFEM(x,a,b);
By using different values N for the number of elements observe (see Table 6.3) an order of convergence
at the nodes of approximately 4! This is better than to be expected by Theorem 6–11 (page 426). If
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 464
the solution is reconstructed between the grid points by a piecewise quadratic interpolation one observes
a cubic convergence. This is the expected result by the abstract error estimate for a piecewise quadratic
approximation. The additional accuracy is caused by the effect of superconvergence, and we can not count
on it to occur. Figure 6.29(b) shows the error at the nodes and also at points between the nodes22 . For those
we obtain the expected third order of convergence. A closer look at the difference of exact and approximate
solution at r ≈ 1.1 also reveals that the approximate solution is not twice differentiable, since the slope of
the curve has a severe jump. The piecewise quadratic interpolation can (and does) lead to non-continuous
derivatives.
0.1 5e-05
difference
0.05
u
0 0
-0.05
-0.1 -5e-05
-0.15
-0.2 -0.0001
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
r r
(a) the solutions (b) difference of exact and approximate solution
Figure 6.29: Exact and approximate solution of the second test problem
A third example
The function uexact (r) = exp(−r2 ) − exp(−R2 ) solves the boundary value problem
where f (r) = (6 − 4 r) exp(−r2 ). In this example the coefficient functions a(r) = b(r) = r2 are used.
The effect of super-convergence can be observed for this example too.
Test3.m
N = 20; R = 3;
x = linspace(0,R,N);
a = @(x) x.ˆ2; b = @(x) x.ˆ2;
[A,M,r] = GenerateFEM(x,a,b);
A = A(1:end-1,1:end-1); % Dirichlet BC at the right end point
M = M(1:end-1,:);
r = r(:); f = (6-4*r.ˆ2).*exp(-r.ˆ2);
u_exact = exp(-r.ˆ2)-exp(-Rˆ2);
u = A\(M*f);
figure(1); plot(r,[u;0],r,u_exact)
xlabel(’r’); ylabel(’u’); title(’solution, exact and approximate’)
22
This figure can not be generated by the codes in these lecture notes. The Octave command pwquadinterp() allows to
apply the piecewise quadratic interpolation within the elements.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 465
figure(2); plot(r,[u;0]-u_exact)
xlabel(’r’); ylabel(’u’); title(’difference of exact and approximate solution’)
(r2 u0 (t, r))0 + r2 α f (u(t, r)) − r2 u̇(t, r) = 0 for 0 < r < R and t > 0 .
d ~u(t) ~
A ~u(t) − α M f~(~u(t)) + M =0
dt
or by rearranging terms
d ~u(t)
M = −A ~u(t) + α M f~(~u(t)) . (6.24)
dt
Because of the nonlinear function f (u) = u · (1 − u) this is a nonlinear system of ordinary differential
equations. Now use the finite difference method from Section 4.5, starting on page 273, to discretize the
dynamic behavior. To start out use a Crank–Nicolson scheme for the time discretization, but for the nonlinear
contribution use an explicit expression24 . This will lead to systems of linear equations to be solved. With
the time discretization ~ui = ~u(i ∆t) this leads to
Create an animation
The above is implemented in MATLAB/Octave and an animation25 can be displayed on screen. Find the final
state in Figure 6.30.
LogisticModel.m
23
u − M f~ = ~0.
The static equation (a(r) u0 (r))0 + b(r) f (r) = 0 leads to the linear system A ~
24
This can be improved, see later in the notes.
25
With the animation the code took 12.5 seconds to run, without the plots only 0.12 seconds. Thus most of the time is used to
generate and display the plots.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 466
for ii = 1:Nt
u = (M+dt/2*A)\((M-dt/2*A)*u + dt*M*al*LogF(u));% Crank-Nicolson
t = t + dt;
figure(1); plot(r,u)
xlabel(’radius r’); ylabel(’density u’)
axis([0 R -0.2 1.1]); text(0.7*R,0.7,sprintf(’t = %5.3f’,t))
drawnow();
end%for
0.8
t = 6.000
0.6
density u
0.4
0.2
-0.2
0 10 20 30 40 50
radius r
u_all = zeros(2*Nx-1,Nt+1);
LogF = @(u) max(0,u.*(1-u)); al = 10; % scaling factor
u0 = @(x,R)0.001*exp(-x.ˆ2/4);
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 467
dt = T/Nt;
u = u0(r,R);
u_all(:,1) = u;
t = 0;
for ii = 1:Nt
u = (M+dt/2*A)\((M-dt/2*A)*u + dt*M*al*LogF(u));% Crank-Nicolson,
t = t + dt;
u_all(:,ii+1) = u;
end%for
figure(2); mesh(0:dt:T,r,u_all)
ylabel(’radius r’); xlabel(’time t’); zlabel(’concentration u’)
axis([0,T,0,R,0,1])
figure(3); contour(0:dt:T,r,u_all,[0.1 0.5 0.9])
caxis([0 1]); ylabel(’radius r’); xlabel(’time t’)
Figure 6.31: Concentration u(t, r) as function of time t and radius r on a short time interval and contours
Discussion
• In Figure 6.31 observe the small initial seed growing to a moving front where the concentration
increases from 0 to 1 over a short distance.
• In Figure 6.32 it is obvious that this front moves with a constant speed, without changing width or
shape.
• The above is a clear indication that for the moving front section the original equation
0
r2 u̇(t, r) = r2 u0 (t, r) + r2 α f (u(t, r))
2
u̇(t, r) = u00 (t, r) + u0 (t, r) + α f (u(t, r))
r
can be replaced by Fisher’s equation
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 468
Figure 6.32: Concentration u(t, r) as function of time t and radius r on a long time interval and contours
i.e. the contribution 2r u0 (t, r) is dropped. The behavior of solutions of this equation is well studied
and the literature is vast!
• If clinical data is available the real task will be to find good values for the parameters D and α to
match the observed behavior of the tumors.
• Instead of the function f (u) = u · (1 − u) other functions may be used to describe the growth of the
tumor.
• For small time steps the algorithm showed severe instability, caused by the nonlinear contribution
f (u) = u · (1 − u). Using max{f (u), 0} improved the situation slightly.
• The traveling speed of the front depended rather severely on the time step. This is not surprising as
the contribution f (~ui ) is the driving force, and it is lagging behind by 21 ∆t since Crank–Nicolson
is formulated at time t = ti + 21 ∆t. One should use f ( 21 (~ui + ~ui+1 )) instead, but this leads to a
nonlinear system of equations to be solved for each time step.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 469
This can be implemented with MATLAB/Octave, leading to the code in LogisticModel2.m. The
solutions look very similar to the ones from the previous section, but the traveling speed of the front
is different. The linear system to be solved is modified for each time step, thus the computational
effort is higher. Since all matrices have a very narrow band structure the computational penalty is not
very high. A more careful examination of the above approach reveals that this actually one step of
Newton’s method to solve the nonlinear system of equations.
• One might as well use Newton’s algorithm to solve the nonlinear system of equations for ~ui+1 . To
apply the algorithm consider each time step as a nonlinear system of equations F(~ui+1 ) = ~0. To
~ differentiate with respect to ~ui+1 .
approximate F(~ui+1 + Φ)
This can be implemented with MATLAB/Octave, leading to the code in LogisticModel3.m. The
solutions look very similar to the ones from the previous section, and the traveling speed of the front
is close to the speed observed in LogisticModel2.m.
• The different speeds observed for the above approaches should trigger the question: what is the correct
speed. For this examine the results of four different runs.
– The solutions by the linearized and fully nonlinear approaches differ very little. This is no
surprise, since the linearized approach is just the first step of Newton’s method, which is used
for the nonlinear approach.
– The explicit solution with Nt=200 leads to a clearly slower speed and the smaller time step with
Nt=3200 leads to a speed much closer to the one observed by the nonlinear approach. This
clearly indicates that the observed speed depends on the time step, and for smaller time steps the
front is moving with a speed move closer to the observed speed for the linearized and nonlinear
approach.
– As a consequence one should use the linearized or nonlinear approach.
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 470
0.8
t = 6.000
density u
0.6
0.4
explicit 200
0.2
explicit 3200
linearized 200
0 nonlinear 200
0 10 20 30 40 50
radius r
Figure 6.33: Graph of the solutions by four computations with different algorithms
LogisticModel2.m
R = 50; % space interval [0,R]
T = 6; % time interval [0,T]
x = linspace(0,R,200); Nt = 200;
LogF = @(u) u.*(1-u); dLogF = @(u) 1-2*u;
for ii = 1:Nt
dfu = dt*al/2*M*diag(dLogF(u));
u = (M+dt/2*A-dfu)\((M-dt/2*A-dfu)*u + dt*M*al*LogF(u));% Crank-Nicolson
t = t + dt;
figure(1); plot(r,u); xlabel(’radius r’); ylabel(’density u’)
axis([0 R -0.2 1.1]); text(0.7*R,0.7,sprintf(’t = %5.3f’,t))
drawnow();
end%for
LogisticModel3.m
% use true Newton
R = 50; % space interval [0,R]
T = 6; % time interval [0,T]
x = linspace(0,R,200); Nt = 200;
LogF = @(u) u.*(1-u); dLogF = @(u) 1-2*u;
for ii = 1:Nt
SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 471
Bibliography
[AxelBark84] O. Axelsson and V. A. Barker. Finite Element Solution of Boundary Values Problems. Aca-
demic Press, 1984.
[Brae02] D. Braess. Finite Elemente. Theorie, schnelle Löser und Anwendungen in der Elastizitätstheorie.
Springer, second edition, 2002.
[Cowp73] G. R. Cowper. Gaussian quadrature formulas for triangles. International Journal on Numerical
Methods and Engineering, 7:405–408, 1973.
[Davi80] A. J. Davies. The Finite Element Method: a First Approach. Oxford University Press, 1980.
[Gmur00] T. Gmür. Méthode des éléments finis en mécanique des structures. Mécanique (Lausanne).
Presses polytechniques et universitaires romandes, 2000.
[Hugh87] T. J. R. Hughes. The Finite Element Method, Linear Static and Dynamic Finite Element Analysis.
Prentice–Hall, 1987. Reprinted by Dover.
[John87] C. Johnson. Numerical Solution of Partial Differential Equations by the Finite Element Method.
Cambridge University Press, 1987. Republished by Dover.
[LascTheo87] P. Lascaux and R. Théodor. Analyse numérique matricielle appliquée a l’art de l’ingénieur,
Tome 2. Masson, Paris, 1987.
[StraFix73] G. Strang and G. J. Fix. An Analysis of the Finite Element Method. Prentice–Hall, 1973.
[TongRoss08] P. Tong and J. Rossettos. Finite Element Method, Basic Technique and Implementation. MIT,
1977. Republished by Dover in 2008.
[Zien13] O. Zienkiewicz, R. Taylor, and J. Zhu. The Finite Element Method: Its Basis and Fundamentals.
Butterworth-Heinemann, 7 edition, 2013.
Find a longer literature list for FEM, with some comments, in Section 0.3.2 and Table 2, starting on
page 3.
SHA 21-5-21
Bibliography
[Acto90] F. S. Acton. Numerical Methods that Work; 1990 corrected edition. Mathematical Association of
America, Washington, 1990.
[Agga20] C. Aggarwal. Linear Algebra and Optimization for Machine Learning. Springer, first edition,
2020.
[Aris62] R. Aris. Vectors, Tensors and the Basic Equations of Fluid Mechanics. Prentice Hall, 1962.
Republished by Dover.
[AtkiHan09] K. Atkinson and W. Han. Theoretical Numerical Analysis. Number 39 in Texts in Applied
Mathematics. Springer, 2009.
[AxelBark84] O. Axelsson and V. A. Barker. Finite Element Solution of Boundary Values Problems. Aca-
demic Press, 1984.
[BoneWood08] J. Bonet and R. Wood. Nonlinear Continuum Mechanics for Finite Element Analysis. Cam-
bridge University Press, 2008.
[BoriTara79] A. I. Borisenko and I. E. Tarapov. Vector and Tensor Analysis with Applications. Dover, 1979.
first published in 1966 by Prentice–Hall.
[Bowe10] A. F. Bower. Applied Mechanics of Solids. CRC Press, 2010. web site at solidmechanics.org.
[Brae02] D. Braess. Finite Elemente. Theorie, schnelle Löser und Anwendungen in der Elastizitätstheorie.
Springer, second edition, 2002.
[Butc03] J. Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, Ltd,
second edition, 2003.
[Butc16] J. Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, Ltd,
third edition, 2016.
[ChouPaga67] P. C. Chou and N. J. Pagano. Elasticity, Tensors, dyadic and engineering Approaches. D
Van Nostrand Company, 1967. Republished by Dover 1992.
[Ciar02] P. G. Ciarlet. The Finite Element Method for Elliptic Problems. SIAM, 2002.
472
BIBLIOGRAPHY 473
[Cowp73] G. R. Cowper. Gaussian quadrature formulas for triangles. International Journal on Numerical
Methods and Engineering, 7:405–408, 1973.
[TopTen] B. A. Cypra. The best of the 20th century: Editors name top 10 algorithms. SIAM News, 2000.
[DahmReus07] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Springer,
2007.
[Davi80] A. J. Davies. The Finite Element Method: a First Approach. Oxford University Press, 1980.
[www:LinAlgFree] J. Dongarra. Freely available software for linear algebra on the web.
http://www.netlib.org/utk/people/JackDongarra/la-sw.html.
[DowdSeve98] K. Dowd and C. Severance. High Performance Computing. O’Reilly, 2nd edition, 1998.
[DrapSmit98] N. Draper and H. Smith. Applied Regression Analysis. Wiley, third edition, 1998.
[Farl82] S. J. Farlow. Partial Differential Equations for Scientist and Engineers. Dover, New York, 1982.
[GhabPeckWu17] J. Ghaboussi, D. Pecknold, and X. Wu. Nonlinear Computational Solid Mechanics. CRC
Press, 2017.
[GhabWu16] J. Ghaboussi and X. Wu. Numerical Methods in Computational Mechanics. CRC Press, 2016.
[Gmur00] T. Gmür. Méthode des éléments finis en mécanique des structures. Mécanique (Lausanne).
Presses polytechniques et universitaires romandes, 2000.
[Gold91] D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM
Computing Surveys, 23(1), March 1991.
[GoluVanLoan96] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, third
edition, 1996.
[GoluVanLoan13] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,
fourth edition, 2013.
[Gree77] D. T. Greenwood. Classical Dynamics. Prentice Hall, 1977. Republished by Dover 1997.
[Hack94] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 of Applied
Mathematical Sciences. Springer, first edition, 1994.
[Hack16] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 of Applied
Mathematical Sciences. Springer, second edition, 2016.
[HairNorsWann08] E. Hairer, S. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I: Non-
stiff Problems. Springer Series in Computational Mathematics. Springer Berlin Heidelberg, second
edition, 1993. third printing 2008.
SHA 21-5-21
BIBLIOGRAPHY 474
[HairNorsWann96] E. Hairer, S. Nørsett, and G. Wanner. Solving Ordinary Differential Equations II:
Stiff and Differential-Algebraic Problems. Lecture Notes in Economic and Mathematical Systems.
Springer, second edition, 1996.
[HenrWann17] P. Henry and G. Wanner. Johann Bernoulli and the cycliod: A theorem for posteriority.
Elemente der Mathematik, 72(4):137–163, 2017.
[HeroArnd01] H. Herold and J. Arndt. C-Programmierung unter Linux. SuSE Press, 2001.
[Holz00] G. A. Holzapfel. Nonlinear Solid Mechanics, a Continuum Approch for Engineering. John Wi-
ley& Sons, 2000.
[HornJohn90] R. Horn and C. Johnson. Matrix Analysis. Cambridge University Press, 1990.
[Hugh87] T. J. R. Hughes. The Finite Element Method, Linear Static and Dynamic Finite Element Analysis.
Prentice–Hall, 1987. Reprinted by Dover.
[Intel90] Intel Corporation. i486 Microprocessor Programmers Reference Manual. McGraw-Hill, 1990.
[IsaaKell66] E. Isaacson and H. B. Keller. Analysis of Numerical Methods. John Wiley & Sons, 1966.
Republished by Dover in 1994.
[John87] C. Johnson. Numerical Solution of Partial Differential Equations by the Finite Element Method.
Cambridge University Press, 1987. Republished by Dover.
[Kell92] H. B. Keller. Numerical Methods for Two–Point Boundary Value Problems. Dover, 1992.
[Koko15] J. Koko. Approximation numérique avec Matlab, Programmation vectorisée, équations aux
dérivées partielles. Ellipses, Paris, 2015.
[LascTheo87] P. Lascaux and R. Théodor. Analyse numérique matricielle appliquée a l’art de l’ingénieur,
Tome 2. Masson, Paris, 1987.
[LiesTich05] J. Liesen and P. Tichý. Convergence analysis of Krylov subspace methods. GAMM Mitt. Ges.
Angew. Math. Mech., 27(2):153–173 (2005), 2004.
[Linz79] P. Linz. Theoretical Numerical Analysis. John Wiley& Sons, 1979. Republished by Dover.
[MeybVach91] K. Meyberg and P. Vachenauer. Höhere Mathematik II. Springer, Berlin, 1991.
[MontRung03] D. Montgomery and G. Runger. Applied Statistics and Probability for Engineers. John
Wiley & Sons, third edition, 2003.
[Oden71] J. Oden. Finite Elements of Nonlinear Continua. Advanced engineering series. McGraw-Hill,
1971. Republished by Dover, 2006.
SHA 21-5-21
BIBLIOGRAPHY 475
[Ogde13] R. Ogden. Non-Linear Elastic Deformations. Dover Civil and Mechanical Engineering. Dover
Publications, 2013.
[OttoPete92] N. S. Ottosen and H. Petersson. Introduction to the Finite Element Method. Prentice Hall,
1992.
[RalsRabi78] A. Ralston and P. Rabinowitz. A first Course in Numerical Analysis. McGraw–Hill, second
edition, 1978. repulished by Dover in 2001.
[RawlPantuDick98] J. Rawlings, S. Pantula, and D. Dickey. Applied regression analysis. Springer texts in
statistics. Springer, New York, 2. ed edition, 1998.
[Redd13] J. N. Reddy. An Introduction to Continuum Mechanics. Cambridge University Press, 2nd edition,
2013.
[Redd15] J. N. Reddy. An Introduction to Nonlinear Finite Element Analysis. Oxford University Press, 2nd
edition, 2015.
[Saad00] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS, second edition, 2000. available on
the internet.
[SchnWihl11] H. R. Schneebeli and T. Wihler. The Netwon-Raphson Method and Adaptive ODE Solvers.
Fractals, 19(1):87–99, 2011.
[ShamDym95] I. Shames and C. Dym. Energy and Finite Element Methods in Structural Mechanics. New
Age International Publishers Limited, 1995.
[ShamReic97] L. Shampine and M. W. Reichelt. The MATLAB ODE Suite. SIAM Journal on Scientific
Computing, 18:1–22, 1997.
[Shew94] J. R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain.
Technical report, Carnegie Mellon University, 1994.
[Smit84] G. D. Smith. Numerical Solution of Partial Differential Equations: Finite Difference Methods.
Oxford Univerity Press, Oxford, third edition, 1986.
SHA 21-5-21
BIBLIOGRAPHY 476
[VarFEM] A. Stahel. Calculus of Variations and Finite Elements. Lecture Notes used at HTA Biel, 2000.
[Stah00] A. Stahel. Calculus of Variations and Finite Elements. supporting notes, 2000.
[Octave07] A. Stahel. Octave and Matlab for Engineers. lecture notes, 2007.
[StraFix73] G. Strang and G. J. Fix. An Analysis of the Finite Element Method. Prentice–Hall, 1973.
[Thom95] J. W. Thomas. Numerical Partial Differential Equations: Finite Difference Methods, volume 22
of Texts in Applied Mathematics. Springer Verlag, New York, 1995.
[TongRoss08] P. Tong and J. Rossettos. Finite Element Method, Basic Technique and Implementation. MIT,
1977. Republished by Dover in 2008.
[YounGreg72] D. M. Young and R. T. Gregory. A Survey of Numerical Analysis, Volume 1. Dover Publica-
tions, New York, 1972.
[Zien13] O. Zienkiewicz, R. Taylor, and J. Zhu. The Finite Element Method: Its Basis and Fundamentals.
Butterworth-Heinemann, 7 edition, 2013.
[ZienMorg06] O. C. Zienkiewicz and K. Morgan. Finite Elements and Approximation. John Wiley & Sons,
1983. Republished by Dover in 2006.
SHA 21-5-21
List of Figures
477
LIST OF FIGURES 478
4.1 FD stencils for y 0 (t), forward, backward and centered approximations . . . . . . . . . . . . 255
4.2 Finite difference approximations of derivatives . . . . . . . . . . . . . . . . . . . . . . . . 256
SHA 21-5-21
LIST OF FIGURES 479
SHA 21-5-21
LIST OF FIGURES 480
5.21 Stress–strain curve for Hooke’s linear law and a compressible Ogden material under hydro-
static load with ν = 0.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
5.22 Plane strain and plane stress situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
SHA 21-5-21
List of Tables
481
LIST OF TABLES 482
3.11 Data for a stiff ODE problem with different algorithms . . . . . . . . . . . . . . . . . . . . 216
3.12 Data for a stiff ODE system with different algorithms . . . . . . . . . . . . . . . . . . . . . 218
3.13 Data for a stiff ODE system with different algorithms, using the Jacobian matrix . . . . . . . 219
3.14 Data for non–stiff ODE system with different algorithms . . . . . . . . . . . . . . . . . . . 220
3.15 Commands for linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
3.16 Examples for linear and nonlinear regression . . . . . . . . . . . . . . . . . . . . . . . . . 240
3.17 Commands for nonlinear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
3.18 Estimated and exact values of the parameters . . . . . . . . . . . . . . . . . . . . . . . . . 243
3.19 Codes for chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
SHA 21-5-21
Index
483
INDEX 484
SHA 21-5-21
INDEX 485
SHA 21-5-21
INDEX 486
SHA 21-5-21
INDEX 487
templates, 85
tensor, 345
Cauchy–Green deformation, 365, 366
deformation gradient, 365
displacement gradient, 348, 365, 366
Green strain, 366, 368
infinitesimal strain, 329, 366
infinitesimal stress, 366
thermal conductivity, 8
Tikhonov regularization, 21
time discretization, 465
tolerance, 214
absolute, 214
relative, 214
trapz, 165
Tresca stress, 341, 344
triangle, 407
triangular matrix, 36
triangularization, 407
tube, torsion, 361
tumor growth, 457
unit roundoff, 24
variance
of parameters, 226
vector
norm, 45
outer unit normal, 316, 416
Volterra–Lotka, 184
volume forces, 355
von Mises stress, 341, 345
SHA 21-5-21