0% found this document useful (0 votes)
8 views

Num Methods

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Num Methods

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 495

Numerical Methods

Andreas Stahel

Version of May 21, 2021


©Andreas Stahel, 2007–2021
All rights reserved. This work may not be translated or copied in whole or in part without the written
permission by the author, except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software is
forbidden.
Contents

0 Introduction 1
0.1 Using these Lecture Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Content and Goals of this Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.3 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.3.1 A Selection of Literature on Numerical Methods . . . . . . . . . . . . . . . . . . . 2
0.3.2 A Selection of Literature on the Finite Element Method . . . . . . . . . . . . . . . . 3
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1 The Model Problems 8


1.1 Heat Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2 The One Dimensional Heat Equation . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.3 The Steady State Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.4 The Dynamic Problem, Separation of Variables . . . . . . . . . . . . . . . . . . . . 10
1.1.5 The Two Dimensional Heat Equation . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.6 The Steady State 2D–Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.7 The Dynamic 2D–Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Vertical Deformations of Strings and Membranes . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Equation for a Steady State Deformation of a String . . . . . . . . . . . . . . . . . 13
1.2.2 Equation for a Vibrating String . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.3 The Eigenvalue Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.4 Equation for a Vibrating Membrane . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.5 Equation for a Steady State Deformation of a Membrane . . . . . . . . . . . . . . . 15
1.2.6 Eigenvalue Problem for a Membrane . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Horizontal Stretching of a Beam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2 Poisson’s Ratio, Lateral Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.3 Nonlinear Stress Strain Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Bending and Buckling of a Beam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2 Bending of a Beam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.3 Buckling of a Beam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5 Tikhonov Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Matrix Computations 23
2.1 Prerequisites and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Floating Point Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Floating Point Numbers and Rounding Errors in Arithmetic Operations . . . . . . . 23
2.2.2 Flops, Accessing Memory and Cache . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Multi Core Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

i
CONTENTS ii

2.2.4 Using Multiple Cores of a CPU with Octave and MATLAB . . . . . . . . . . . . . . 31


2.3 The Model Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1 The 1-d Model Matrix An . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.2 The 2-d Model Matrix Ann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Solving Systems of Linear Equations and Matrix Factorizations . . . . . . . . . . . . . . . 35
2.4.1 LR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.2 LR Factorization and Elementary Matrices . . . . . . . . . . . . . . . . . . . . . . 43
2.5 The Condition Number of a Matrix, Matrix and Vector Norms . . . . . . . . . . . . . . . . 45
2.5.1 Vector Norms and Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.2 The Condition Number of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.3 The Effect of Rounding Errors, Pivoting . . . . . . . . . . . . . . . . . . . . . . . . 51
2.6 Structured Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.6.1 Symmetric Matrices, Algorithm of Cholesky . . . . . . . . . . . . . . . . . . . . . 55
2.6.2 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.6.3 Stability of the Algorithm of Cholesky . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.6.4 Banded Matrices and the Algorithm of Cholesky . . . . . . . . . . . . . . . . . . . 66
2.6.5 Computing with an Inverse Matrix is Usually Inefficient . . . . . . . . . . . . . . . 67
2.6.6 Octave Implementations of Sparse Direct Solvers . . . . . . . . . . . . . . . . . . . 68
2.6.7 A Selection Tree used in Octave for Sparse Linear Systems . . . . . . . . . . . . . 71
2.7 Sparse Matrices and Iterative Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.7.1 The Model Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.7.2 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.7.3 Steepest Descent Iteration, Gradient Algorithm . . . . . . . . . . . . . . . . . . . . 73
2.7.4 Conjugate Gradient Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.7.5 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.7.6 The Incomplete Cholesky Preconditioner . . . . . . . . . . . . . . . . . . . . . . . 87
2.7.7 Conjugate Gradient Algorithm with an Incomplete Cholesky Preconditioner . . . . . 89
2.7.8 Conjugate Gradient Algorithm with an Incomplete LU Preconditioner . . . . . . . . 91
2.8 Iterative Solvers for Non-Symmetric Systems . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.8.1 Normal Equation, Conjugate Gradient Normal Residual (CGNR) and BiCGSTAB . 92
2.8.2 Generalized Minimal Residual, GMRES and GMRES(m) . . . . . . . . . . . . . . 93
2.9 Iterative Solvers in MATLAB/Octave and a Comparison with Direct Solvers . . . . . . . . . 99
2.9.1 Iterative Solvers in MATLAB/Octave . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.9.2 A Comparison of Direct and Iterative Solvers . . . . . . . . . . . . . . . . . . . . . 99
2.10 Other Matrix Factorizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

3 Numerical Tools 102


3.0.1 Prerequisites and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.1 Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.1.2 Bisection, Regula Falsi and Secant Method to Solve one Equation . . . . . . . . . . 105
3.1.3 Systems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.1.4 The Contraction Mapping Principle and Successive Substitutions . . . . . . . . . . 111
3.1.5 Newton’s Algorithm to Solve Systems of Equations . . . . . . . . . . . . . . . . . . 115
3.1.6 Modifications of Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.1.7 Octave/MATLAB Commands to Solve Equations . . . . . . . . . . . . . . . . . . . 120
3.1.8 Examples of Nonlinear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.1.9 Optimization with MATLAB/Octave . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.2 Eigenvalues and Eigenvectors of Matrices, SVD, PCA . . . . . . . . . . . . . . . . . . . . 130
3.2.1 Matrices and Linear Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

SHA 21-5-21
CONTENTS iii

3.2.2 Eigenvalues and Diagonalization of Matrices . . . . . . . . . . . . . . . . . . . . . 131


3.2.3 Level Sets of Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.2.4 Commands for Eigenvalues and Eigenvectors in Octave/MATLAB . . . . . . . . . . 141
3.2.5 Eigenvalues and Systems of Ordinary Differential Equations . . . . . . . . . . . . . 144
3.2.6 SVD, Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.2.7 From Gaussian Distribution to Covariance, and then to PCA . . . . . . . . . . . . . 152
3.3 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
3.3.1 Integration of a Function Given by Data Points . . . . . . . . . . . . . . . . . . . . 164
3.3.2 Integration of a Function given by a Formula . . . . . . . . . . . . . . . . . . . . . 170
3.3.3 Integration Routines Provided by MATLAB/Octave . . . . . . . . . . . . . . . . . . 176
3.3.4 Integration over Domains in R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
3.4 Solving Ordinary Differential Equations, Initial Value Problems . . . . . . . . . . . . . . . 183
3.4.1 Different Types of Ordinary Differential Equations . . . . . . . . . . . . . . . . . . 183
3.4.2 The Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
3.4.3 Stability of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
3.4.4 General Runge–Kutta Methods, Represented by Butcher Tables . . . . . . . . . . . 201
3.4.5 Adaptive Step Sizes and Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . 206
3.4.6 ODE solvers in MATLAB/Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
3.4.7 Comparison of four Algorithms Available with Octave/MATLAB . . . . . . . . . . . 214
3.5 Linear and Nonlinear Regression, Curve Fitting . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5.1 Linear Regression, Method of Least Squares . . . . . . . . . . . . . . . . . . . . . 221
3.5.2 Estimation of the Variance of Parameters, Confidence Intervals, Domain of Confidence226
3.5.3 The commands LinearRegression(), regress() and lscov() for Octave
and MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
3.5.4 An Elementary Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
3.5.5 How to Obtain Wrong Results! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
3.5.6 An Example with Multiple Independent Variables . . . . . . . . . . . . . . . . . . . 239
3.5.7 Introduction to Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 240
3.5.8 Nonlinear Regression with a Logistic Function . . . . . . . . . . . . . . . . . . . . 244
3.5.9 Additional Commands from the Package optim in Octave . . . . . . . . . . . . . . 249
3.6 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

4 Finite Difference Methods 254


4.1 Prerequisites and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
4.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
4.2.1 Finite Difference Approximations of Derivatives . . . . . . . . . . . . . . . . . . . 254
4.2.2 Finite Difference Stencils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
4.3 Consistency, Stability and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
4.3.1 A Finite Difference Approximation of an Initial Value Problem . . . . . . . . . . . . 258
4.3.2 Explicit Method, Conditional Stability . . . . . . . . . . . . . . . . . . . . . . . . . 259
4.3.3 Implicit Method, Unconditional Stability . . . . . . . . . . . . . . . . . . . . . . . 259
4.3.4 General Difference Approximations, Consistency, Stability and Convergence . . . . 260
4.4 Boundary Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
4.4.1 Two Point Boundary Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . 264
4.4.2 Boundary Values Problems on a Rectangle . . . . . . . . . . . . . . . . . . . . . . 271
4.5 Initial Boundary Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
4.5.1 The Dynamic Heat Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
4.5.2 Construction of the Solution Using Eigenvalues and Eigenvectors . . . . . . . . . . 275
4.5.3 Explicit Finite Difference Approximation to the Heat Equation . . . . . . . . . . . . 276
4.5.4 Implicit Finite Difference Approximation to the Heat Equation . . . . . . . . . . . . 278

SHA 21-5-21
CONTENTS iv

4.5.5 Crank–Nicolson Approximation to the Heat Equation . . . . . . . . . . . . . . . . . 280


4.5.6 General Parabolic Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
4.5.7 A two Dimensional Dynamic Heat Equation . . . . . . . . . . . . . . . . . . . . . . 283
4.5.8 Comparison of a Few More Solvers for Dynamic Problems . . . . . . . . . . . . . . 285
4.6 Hyperbolic Problems, Wave Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
4.6.1 An Explicit Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
4.6.2 An Implicit Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
4.6.3 General Wave Type Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
4.7 Nonlinear Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
4.7.1 Partial Substitution or Picard Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 297
4.7.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

5 Calculus of Variations, Elasticity and Tensors 305


5.1 Prerequisites and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
5.2 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
5.2.1 The Euler Lagrange Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
5.2.2 Quadratic Functionals and Second Order Linear Boundary Value Problems . . . . . 312
5.2.3 The Divergence Theorem and its Consequences . . . . . . . . . . . . . . . . . . . . 313
5.2.4 Quadratic Functionals and Second Order Boundary Value Problems in 2 Dimensions 315
5.2.5 Nonlinear Problems and Euler–Lagrange Equations for Systems . . . . . . . . . . . 318
5.2.6 Hamilton’s principle of Least Action . . . . . . . . . . . . . . . . . . . . . . . . . . 320
5.3 Basic Elasticity, Description of Stress and Strain . . . . . . . . . . . . . . . . . . . . . . . . 326
5.3.1 Description of Strain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
5.3.2 Description of Stress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
5.3.3 Invariant Stress Expressions, Von Mises Stress and Tresca Stress . . . . . . . . . . . 341
5.4 Elastic Failure Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
5.4.1 Maximum Principal Stress Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
5.4.2 Maximum Shear Stress Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
5.4.3 Maximum Distortion Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
5.5 Scalars, Vectors and Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
5.5.1 Change of Coordinate Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
5.5.2 Zero Order Tensors: Scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
5.5.3 First Order Tensors: Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
5.5.4 Second Order Tensors: some Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 346
5.6 Hooke’s Law and Elastic Energy Density . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
5.6.1 Hooke’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
5.6.2 Hooke’s law for Incompressible Materials . . . . . . . . . . . . . . . . . . . . . . . 350
5.6.3 Elastic Energy Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
5.6.4 Volume and Surface Forces, the Bernoulli Principle . . . . . . . . . . . . . . . . . . 355
5.6.5 Some Exemplary Situations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
5.7 More on Tensors and Energy Densities for Nonlinear Material Laws . . . . . . . . . . . . . 365
5.7.1 A few More Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
5.7.2 Neo–Hookean Energy Density Models . . . . . . . . . . . . . . . . . . . . . . . . . 369
5.7.3 Ogden and Mooney–Rivlin Energy Density Models . . . . . . . . . . . . . . . . . . 378
5.7.4 References used in the Above Section . . . . . . . . . . . . . . . . . . . . . . . . . 386
5.8 Plane Strain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
5.8.1 Description of Plane Strain and Plane Stress . . . . . . . . . . . . . . . . . . . . . . 387
5.8.2 From the Minimization Formulation to a System of PDE’s . . . . . . . . . . . . . . 390
5.8.3 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
5.9 Plane Stress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

SHA 21-5-21
CONTENTS v

5.9.1 From the Plane Stress Matrix to the Full Stress Matrix . . . . . . . . . . . . . . . . 394
5.9.2 From the Minimization Formulation to a System of PDE’s . . . . . . . . . . . . . . 395
5.9.3 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
5.9.4 Deriving the Differential Equations using the Euler–Lagrange Equation . . . . . . . 397
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

6 Finite Element Methods 401


6.1 From Minimization to the Finite Element Method . . . . . . . . . . . . . . . . . . . . . . . 401
6.2 Piecewise Linear Finite Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
6.2.1 Discretization, Approximation and Assembly of the Global Stiffness Matrix . . . . . 403
6.2.2 Integration over one Triangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
6.2.3 Integration of ∇u · ∇u over one Triangle . . . . . . . . . . . . . . . . . . . . . . . 405
6.2.4 The Element Stiffness Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
6.2.5 Triangularization of the Domain Ω ⊂ R2 . . . . . . . . . . . . . . . . . . . . . . . 407
6.2.6 Assembly of the System of Linear Equations . . . . . . . . . . . . . . . . . . . . . 407
6.2.7 The Algorithm of Cuthill and McKee to Reduce Bandwidth . . . . . . . . . . . . . 409
6.2.8 A First Solution by the FEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
6.2.9 Contributions to the Approximation Error . . . . . . . . . . . . . . . . . . . . . . . 414
6.3 Classical and Weak Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
6.3.1 Weak Solutions of a System of Linear Equations . . . . . . . . . . . . . . . . . . . 415
6.3.2 Classical Solutions and Weak Solutions of Differential Equations . . . . . . . . . . 416
6.4 Energy Norms and Error Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
6.4.1 Basic Assumptions and Regularity Results . . . . . . . . . . . . . . . . . . . . . . 418
6.4.2 Function Spaces, Norms and Continuous Functionals . . . . . . . . . . . . . . . . . 418
6.4.3 Convergence of the Finite Dimensional Approximations . . . . . . . . . . . . . . . 420
6.4.4 Approximation by Piecewise Linear Interpolation . . . . . . . . . . . . . . . . . . . 423
6.4.5 Approximation by Piecewise Quadratic Interpolation . . . . . . . . . . . . . . . . . 425
6.5 Construction of Triangular Second Order Elements . . . . . . . . . . . . . . . . . . . . . . 427
6.5.1 Integration over a General Triangle . . . . . . . . . . . . . . . . . . . . . . . . . . 428
6.5.2 The Basis Functions for a Second Order Element . . . . . . . . . . . . . . . . . . . 432
6.5.3 Integration of Functions Given at the Nodes . . . . . . . . . . . . . . . . . . . . . . 434
6.5.4 Integrals to be Computed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
6.5.5 The Octave code ElementContribution.m . . . . . . . . . . . . . . . . . . . 435
6.5.6 Integration of f φ over one Triangle . . . . . . . . . . . . . . . . . . . . . . . . . . 436
6.5.7 Integration of b0 u φ over one Triangle . . . . . . . . . . . . . . . . . . . . . . . . . 437
6.5.8 Transformation of the Gradient to the Standard Triangle . . . . . . . . . . . . . . . 437
6.5.9 Integration of u ~b ∇φ over one Triangle . . . . . . . . . . . . . . . . . . . . . . . . 441
6.5.10 Integration of a ∇u ∇φ over one Triangle . . . . . . . . . . . . . . . . . . . . . . . 442
6.5.11 Construction of the Element Stiffness Matrix . . . . . . . . . . . . . . . . . . . . . 444
6.6 Comparing First and Second Order Triangular Elements . . . . . . . . . . . . . . . . . . . 444
6.6.1 Observe the Convergence of the Error as h → 0 . . . . . . . . . . . . . . . . . . . . 444
6.6.2 Estimate the Number of Nodes and Triangles and the Effect on the Sparse Matrix . . 445
6.6.3 Behavior of a FEM Solution within Triangular Elements . . . . . . . . . . . . . . . 446
6.6.4 Remaining Pieces for a Complete FEM Algorithm, the FEMoctave Package . . . . 448
6.7 Applying the FEM to Other Types of Problems, e.g. Plane Elasticity . . . . . . . . . . . . . 449
6.8 Using Quadrilateral Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
6.8.1 First Order Quadrilateral Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
6.8.2 Second Order Quadrilateral Elements . . . . . . . . . . . . . . . . . . . . . . . . . 455
6.9 An Application of FEM to a Tumor Growth Model . . . . . . . . . . . . . . . . . . . . . . 457
6.9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
6.9.2 The Finite Element Method Applied to Static 1D Problems . . . . . . . . . . . . . . 458

SHA 21-5-21
CONTENTS vi

6.9.3 Solving the Dynamic Tumor Growth Problem . . . . . . . . . . . . . . . . . . . . . 465


Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471

Bibliography for all Chapters 472

List of Figures 480

List of Tables 482

Index 483

SHA 21-5-21
Chapter 0

Introduction

0.1 Using these Lecture Notes


These lecture notes1 serve as support for the lectures. The students shall not be forced to copy many results
and formulas from blackboard, beamer or projector. The notes will not replace attending the lectures, but
the combination of lectures and notes should provide all necessary information in a digestible form.
In an ideal world a student will read through the lecture notes before the topic is presented in class. This
allows the student to take full advantage of the presentation in the lectures. In class more weight is put on
the ideas for the algorithms and methods, while the notes spell out the tedious details too. In the real world
it is more likely that a student is using the notes in class. It is a good idea to supplement the notes in class
with your personal notes. It should not be necessary to buy additional books for this class. A collection of
additional references and a short description of the content is given in Section 0.3.

0.2 Content and Goals of this Class


In this class we will present the necessary background to choose and use numerical algorithms to solve
problems arising in biomedical engineering applications. Obviously we have to choose a small subset of all
possibly useful topics to be considered.
• Some algorithms to examine large systems of linear equations are examined.

• Some basic numerical tools are presented.

– Basic methods to solve nonlinear equations.


– The basic ideas and tools of numerical integration of functions given by data points or formulas.
– The ideas and tools to generate numerical solutions of ordinary differential equations.
– The ideas and tools to use linear and nonliner regression.

• The method of finite differences (FD) is presented and applied to a few typical problems.

• The necessary tensor calculus for the description of elasticity problems is presented.

• The principal of virtual work is applied to mechanical problems and the importance of the calculus of
variations is illustrated. This will serve as basis for the Finite Element Method, i.e. FEM .

• The method of finite elements is introduced using 2D model problems. The convergence of triangular
elements of order 1 and 2 is examined carefully.
The topics to be examined in this class are shown in Figure 1.
1
Lecture notes are different from books. A book should be complete, while the combination of lectures and lecture notes should
be complete.

1
CHAPTER 0. INTRODUCTION 2

Numerical Tools
 @
I
@
6 @
@
? @
@
Matrix Computation 
Y
HH
- Finite Difference  R
@
- Model Problems
H
HH 
H
HH
H
HH
Calc. of Variations, Elasticity j
H
- Finite Elements

Figure 1: Structure of the topics examined in this class

The main goals of this class are:

• Get to know a few basic algorithms and techniques to solve modeling problems.

• Learn a few techniques to examine performance of numerical algorithm.

• Examine reliability of results of numerical computations.

• Examine speed and memory requirements of some algorithms.

The Purpose of Computing is Insight, not Numbers2

Obviously there are many important topics numerical methods that we do not consider in this class. The
methods presented should help you to read and understand the corresponding literature. An incomplete list
of non-considered topics is: interpolation and approximation, numerical integration, ordinary differential
equations, . . .

To realize that reliable numerical methods can be very important, examine a list of disasters, caused
by numerical errors, see http://ta.twi.tudelft.nl/nw/users/vuik/wi211/disasters.html. Among the spectacular
events are Patriot missile failures, an Ariane rocket blowup and the sinking of the Sleipner offshore platform.

0.3 Literature
Below find a selection of books on numerical analysis and the Finite Element Method. The list is not an
attempt to generate a representative selection, but is strongly influenced by my personal preference and
bookshelf. The list of covered topics might help when you face problems not solvable with the methods and
ideas presented in this class.

0.3.1 A Selection of Literature on Numerical Methods


[Stah08] A. Stahel, Numerical Methods: the lectures notes for this class should be sufficient to follow the class.
You might want to consult other books, either to catch up on prerequisite knowledge or to find other
approaches to topics presented in class.

[Schw09] H.R. Schwarz, Numerische Mathematik: this is a good introduction to numerical mathematics and
might serve well as a complement to the lecture notes.
2
Richard Hamming, 1962

SHA 21-5-21
CHAPTER 0. INTRODUCTION 3

[IsaaKell66] Isaacson and Keller, Analysis of Numerical Methods: this is an excellent (and very affordable) in-
troduction to numerical analysis. It is mathematically very solid and I strongly recommend it as a
supplement.

[DahmReus07] Dahmen and Reusken, Numerik für Ingenieure und Naturwissenschaftler: this is a comprehensive
presentation of most of the important numerical algorithms.

[GoluVanLoan96] Gene Golub, Charles Van Loan, Matrix Computations: this is the bible for matrix computations (see
Chapter 2) and an excellent book. Use this as reference for matrix computations. There is a new,
expanded edition of this marvelous book [GoluVanLoan13].

[Smit84] G. D. Smith, Numerical Solution of Partial Differential Equations: Finite Difference Methods: this is
a basic introduction to the method of finite differences.

[Thom95] J. W. Thomas, Numerical Partial Differential Equations: Finite Difference Methods: this is an excel-
lent up-to-date presentation of finite difference methods. Use this book if you want to go beyond the
presentation in class.

[Acto90] F. S. Acton, Numerical Methods that Work: a well written book on many basic aspect of numerical
methods. Common sense advise is given out freely. Well worth reading.

[Pres92] Press et al., Numerical Recipes in C: this is a collection of basic algorithms and some explanation of
the effects and aspects to consider. There are versions of this book for the programming languages C,
C++, Fortran, Pascal, Modula and Basic.

[Knor08] M. Knorrenschild, Numerische Mathematik: a very short and affordable collection of results and
examples. No proofs are given, just the facts stated. It might be useful when reviewing the topics and
preparing an exam.

[GhabWu16] J. Ghaboussi and X.S. Wu, Numerical Methods in Computational Mechanics: a selection of numerical
results and definitions, useful for computational mechanics.

Find a list of these references and the covered topics in Table 1.

0.3.2 A Selection of Literature on the Finite Element Method


[Zien13] Zienkiewicz, Taylor, and Zhu. The Finite Element Method: Its Basis and Fundamentals. A very good
introduction to FEM. Includes the application to linear elasticity.

[John87] Claes Johnson, Numerical Solution of Partial Differential Equations by the Finite Element Method:
this is an excellent introduction to FEM, readable and mathematically precise. It might serve well as
a supplement to the lecture notes.

[TongRoss08] P. Tong and J. Rossettos: Finite Element Method, Basic Technique and Implementation. The book
title says it all. Implementation details are carefully presented. It contains a good presentation of the
necessary elasticity concepts. Since this book is available a Dover edition, the cost is minimal.

[Schw88] H. R. Schwarz, Finite Element Method: An easily understandable book, presenting the basic algo-
rithms and tools to use FEM.

[Hugh87] Thomas J. R. Hughes, The Finite Element Method: Linear Static and Dynamic Finite Element Analy-
sis: this classic book on FEM contains considerable information for elasticity problems. The presen-
tation discuses many implementation details, also for shells, beams, plates and curved elements. It is
very affordable.

SHA 21-5-21
CHAPTER 0. INTRODUCTION 4

[GoluVanLoan96]
[DahmReus07]

[GhabWu16]
[IsaaKell66]

[Thom95]
[Schw09]

[Knor08]
[Acto90]
[Smit84]
[Stah08]
Reference
Floating point arithmetic x x x x x x x
CPU, memory and cache structure x
Gauss algorithm o x x x x x x
LR factorization x x x x x x x
Cholesky algorithm x x x x x x x
Banded Cholesky x x x x
Stability of linear algorithms x x o x x
Iterative methods for linear systems o x x x x x
Conjugate gradient method x x x
Preconditioners o x x
Solving a single nonlinear equation x x x x x x
Newton’s algorithm for systems x x x x x x
FD approximation x x x x x x x
Consistency, stability, convergence, Lax theorem x x x x x x
Neumann stability analysis x x x
Boundary value problems 1D x x x x x x
Boundary value problems 2D x x x x x x x x
Stability of static problems x x x x
Explicity, implicit, Crank-Nicolson for heat equation x x x x x
Explicit, implicit for wave equation x x x x
Nonlinear FD problems x x
Numerical integration, 1D x x x x x
Gauss integration 1D o x x x x
Gauss integration 2D o x x x
ODE, initial value problems x x x x x x
Zeros of polynomials x x x
Interpolation x x x x
Eigenvalues and eigenvectors x x x x x x
Linear regression o x x x x

Table 1: Literature on Numerical Methods. An extensive coverage of the topic is marked by x, while a brief
coverage is marked by o.

SHA 21-5-21
CHAPTER 0. INTRODUCTION 5

[Brae02] Dietrich Braess, Finite Elemente: this is a modern presentation of the mathematical theory for FEM
and their implementation. Mathematically it is more advanced and preciser than these lecture notes.
Also find the exact formulation of the Bramble-Hilbert Lemma. This book is recommended for further
studies.

[Gmur00] Thomas Gmür, Méthodes des éléments finis en mécanique des structures: An introduction (in french)
for FEM applied to elasticity.

[ZienMorg06] O. C. Zienkiewicz and K. Morgan, Finite Elements and Approximation: A solid presentation of the
FEM is given, preceded by a short review of the finite difference method.

[AtkiHan09] K. Atkinson and W. Han, Theoretical Numerical Analysis: a good functional analysis framework
for numerical analysis is presented in this book. This is a rather theoretical, well written book. It
also shows the connection of partial differential equations and numerical methods. An excellent
presentation of the FEM is given.

[Ciar02] Philippe Ciarlet, The Finite Element Method for Elliptic Problems: Here you find a mathematical
presentation of the error estimates for the FEM, including all details. The Bramble-Hilbert Lemma is
carefully examined. This is a very solid mathematical foundation.

[Prze68] J.S. Przemieniecki, Theory of Matrix Structural Analysis: this is a classical presentation of the me-
chanical approach to FEM. A good introduction of the keywords stress, strain and Hooke law is
shown.

[Shab08] Elasticity to FEM, plasticity

[Koko15] Jonas Koko, Approximation numérique avec MATLAB, Programmation vectorisée, équations aux
dérivées partielles. A nice introduction (in french) to MATLAB and the coding of FEM algorithms.
The details are worked out. Some code for the linear elasticity problem is developed.

[ShamDym95] Shames and Dym, Energy and Finite Element Methods in Structural Mechanics. A solid introduction
to variational methods, applied to structural mechanics. Contains a good description of mechanics
and FEM.

[Froc16] Jörg Frochte, Finite–Elemente–Methode. Eine praxisbezogene Einführung mit GNU Octave/Matlab.
A very readable introduction with complete codes for MATLAB/Octave. Contains a presentation of
dynamic problems.

Find a list of references for the FEM and the covered topics in Table 2.

Bibliography
[Acto90] F. S. Acton. Numerical Methods that Work; 1990 corrected edition. Mathematical Association of
America, Washington, 1990.

[AtkiHan09] K. Atkinson and W. Han. Theoretical Numerical Analysis. Number 39 in Texts in Applied
Mathematics. Springer, 2009.

[Brae02] D. Braess. Finite Elemente. Theorie, schnelle Löser und Anwendungen in der Elastizitätstheorie.
Springer, second edition, 2002.

[Ciar02] P. G. Ciarlet. The Finite Element Method for Elliptic Problems. SIAM, 2002.

[DahmReus07] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Springer,
2007.

SHA 21-5-21
CHAPTER 0. INTRODUCTION 6

[ShamDym95]
[ZienMorg06]
[TongRoss08]

[AtkiHan09]
[Gmur00]
[Schw88]
[Hugh87]

[Koko15]
[John87]

[Brae02]
[Zien13]

[Froc16]
[Stah08]

[Prze68]
[Ciar02]
Reference
Intro to Calculus of Variations x x x x x x x x x x x x x x x
Finite element method, linear 2D x x x x x x x x x x x x x x
Generation of stiffness matrix x x x x x x x x x x x x
Error estimates, Lemma of Cea x x x x x x x x
Second order elements in 2D x x x x x x x x x x
Gauss integration x x x x x x x
Formulation of elasticity problems x x x x x x x x x x x x x

Table 2: Literature on the Finite Element Method

[Froc16] J. Frochte. Finite-Elemente-Methode: Eine praxisbezogene Einführung mit GNU Octave/MAT-


LAB. Hanser Fachbuchverlag, 2016. Octave/Matlab code available.

[GhabWu16] J. Ghaboussi and X. Wu. Numerical Methods in Computational Mechanics. CRC Press, 2016.

[Gmur00] T. Gmür. Méthode des éléments finis en mécanique des structures. Mécanique (Lausanne).
Presses polytechniques et universitaires romandes, 2000.

[GoluVanLoan96] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, third
edition, 1996.

[GoluVanLoan13] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,
fourth edition, 2013.

[Hugh87] T. J. R. Hughes. The Finite Element Method, Linear Static and Dynamic Finite Element Analysis.
Prentice–Hall, 1987. Reprinted by Dover.

[IsaaKell66] E. Isaacson and H. B. Keller. Analysis of Numerical Methods. John Wiley & Sons, 1966.
Republished by Dover in 1994.

[John87] C. Johnson. Numerical Solution of Partial Differential Equations by the Finite Element Method.
Cambridge University Press, 1987. Republished by Dover.

[Knor08] M. Knorrenschild. Numerische Mathematik. Carl Hanser Verlag, 2008.

[Koko15] J. Koko. Approximation numérique avec Matlab, Programmation vectorisée, équations aux
dérivées partielles. Ellipses, Paris, 2015.

[Pres92] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C, The


Art of Scientific Computing. Cambridge University Press, second edition, 1992.

[Prze68] J. Przemieniecki. Theory of Matrix Structural Analysis. McGraw–Hill, 1968. Republished by


Dover in 1985.

[Schw88] H. R. Schwarz. Finite Element Method. Academic Press, 1988.

[Schw09] H. R. Schwarz. Numerische Mathematik. Teubner und Vieweg, 7. edition, 2009.

[Shab08] A. A. Shabana. Computational Continuum Mechanics. Cambridge University Press, 2008.

SHA 21-5-21
CHAPTER 0. INTRODUCTION 7

[ShamDym95] I. Shames and C. Dym. Energy and Finite Element Methods in Structural Mechanics. New
Age International Publishers Limited, 1995.

[Smit84] G. D. Smith. Numerical Solution of Partial Differential Equations: Finite Difference Methods.
Oxford Univerity Press, Oxford, third edition, 1986.

[Stah08] A. Stahel. Numerical Methods. lecture notes, BFH-TI, 2008.

[Thom95] J. W. Thomas. Numerical Partial Differential Equations: Finite Difference Methods, volume 22
of Texts in Applied Mathematics. Springer Verlag, New York, 1995.

[TongRoss08] P. Tong and J. Rossettos. Finite Element Method, Basic Technique and Implementation. MIT,
1977. Republished by Dover in 2008.

[Zien13] O. Zienkiewicz, R. Taylor, and J. Zhu. The Finite Element Method: Its Basis and Fundamentals.
Butterworth-Heinemann, 7 edition, 2013.

[ZienMorg06] O. C. Zienkiewicz and K. Morgan. Finite Elements and Approximation. John Wiley & Sons,
1983. Republished by Dover in 2006.

SHA 21-5-21
Chapter 1

The Model Problems

The main goal of the introductory chapter is to familiarize you with a small set of sample problems. These
problems will be used throughout the class and in the notes to illustrate the presented methods and algo-
rithms. Each of the selected model problems stands for a class of similar application problems.

1.1 Heat Equations


1.1.1 Description
The heat capacity c of a material gives the amount of energy needed to raise the temperature T of one
kilogram of the material by one degree K (Kelvin). The thermal conductivity k of a material indicates the
amount of energy transmitted through a plate with thickness 1 m and 1 m2 area if the temperatures at the
two sides differ by 1 K. In Table 1.1 find values for c and k for some typical materials. For homogeneous
materials the values of c and k will not depend on the location ~x. For most materials the values depend on
the temperature T , but we will not consider this case (yet). The resulting equations would be nonlinear.

heat capacity at 20◦ C heat conductivity


symbol c k
unit kJ W
kg K mK
iron 0.452 74
steel 0.42 - 0.51 45
copper 0.383 384
water 4.182 0.598

Table 1.1: Some values of heat related constants

The flux of thermal energy is a vector indicating the direction of the flow and the amount of thermal
energy flowing per second and square meter. Fourier’s law of heat conduction can be stated in the form

~q = −k ∇T . (1.1)

This basic law of physics indicates that the thermal energy will flow from hot spots to areas with lower
temperature. For some simple situations the consequences of this equation will be examined. The only
other basic physical principle to be used is conservation of energy. Some of the variables and symbols
used in this section are shown in Table 1.2.

8
CHAPTER 1. THE MODEL PROBLEMS 9

symbol unit
J
density of energy u
m3
temperature T K
J
heat capacity c
K kg
kg
density ρ
m3
J
heat conductivity k
smK
J
heat flux ~q
s m2
J
external energy source density f
s m3
area of the cross section A m2

Table 1.2: Symbols and variables for heat conduction

1.1.2 The One Dimensional Heat Equation


If a temperature T over a solid (with constant cross section A) is known to depend on one coordinate x only,
then the change of temperature ∆T measured over a distance ∆x will lead to a flow of thermal energy ∆Q.
If the time difference is ∆t then (1.1) reads as
∆Q ∆T
= −k A .
∆t ∆x
This is Fourier’s law and it leads to a heat flux in direction x of
∂T
q = −k .
∂x
Consider the temperature T as dependent variable find on the interval a ≤ x ≤ b the thermal energy
Z b Z b
E(t) = A u(t, x) dx = A ρ c T (t, x) dx .
a a

Now compute the rate of change of energy in the same interval. The rate of change has to equal the total
flux of energy into this interval plus the input from external sources

total change = input through boundary + external sources ,


  Rb
∂ E(t)
∂t = −k A(a) ∂ T∂x
(t,a)
+ k A(b) ∂ T∂x(t,b)
+ a A(x) f (t, x) dx ,
Rb Rb ∂   Rb
∂ T (t,x) ∂ T (t,x)
a A(x) ρ c ∂t dx = a ∂x k A(x) ∂x dx + a A(x) f (t, x) dx .

At this point use the conservation of energy principle. Since the above equation has to be correct for all
possible values of a and b the expressions under the integrals have to be equal and we obtain the general
equation for heat conduction in one variable
 
∂ T (t, x) ∂ ∂ T (t, x)
A(x) ρ c = k A(x) + A(x) f (t, x) .
∂t ∂x ∂x

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 10

1.1.3 The Steady State Problem


If the steady state situation has to be examine then the temperature T can not depend on t and thus
 
∂ ∂ T (x)
− k A(x) = A(x) f (x) .
∂x ∂x

This second order differential equation has to be supplemented by boundary conditions, either prescribed
temperature or prescribed energy flux.
As a standard example consider a beam with constant cross section of length 1 with known temperature
T = 0 at the two ends and a known heating contribution f (x). This leads to

d2 1
− 2
T (x) = f (x) for 0 ≤ x ≤ 1 and T (0) = T (1) = 0 . (1.2)
dx k

1.1.4 The Dynamic Problem, Separation of Variables


As standard example consider a beam with constant cross section A of length 1 with prescribed temperature
at both ends for all times and a known temperature profile T (0, x) = T0 (x) at time t = 0. This leads to the
partial differential equation
ρc ∂ ∂2 1
k ∂t T (t, x) − ∂x2
T (t, x) = k f (x) for 0 ≤ x ≤ 1 and t≥0
T (t, 0) = T (t, 1) = 0 for t ≥ 0 (1.3)
T (0, x) = T0 (x) for 0 ≤ x ≤ 1 .

For the special case f (x) = 0 use the method of separation of variables to find a solution to this
problem. With α = ρkc the above simplifies to

∂ ∂2
∂t T (t, x) = α ∂x2
T (t, x) for 0 ≤ x ≤ 1 and t≥0
T (t, 0) = T (t, 1) = 0 for t ≥ 0
T (0, x) = T0 (x) for 0 ≤ x ≤ 1 .

• Look for solutions of the form T (t, x) = h(t) · u(x), i.e. a product of functions of one variable each.
Plugging this into the partial differential equation leads to

∂ ∂2
T (t, x) = α T (t, x)
∂t ∂x2
ḣ(t) · u(x) = α h(t) · u00 (x)
1 ḣ(t) u00 (x)
= .
α h(t) u(x)

Since the left hand side depends on t only and the right hand side on x only, both have to be constant.
One can verify that the constant has to be negative, e.g. −λ2 .

• Using the right hand side leads to the boundary value problem

u00 (x) = −λ2 u(x) with u(0) = u(1) = 0 .

Use the boundary conditions T (t, 0) = T (t, 1) = 0 to verify that this problem has nonzero solutions
only for the values λn = n π and the solutions are given by

un (x) = sin(x n π) .

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 11

• The resulting differential equation for h(t) is



h(t) = −α λ2n h(t)
∂t
with the solutions hn (t) = hn (0) exp(−α λ2n t).

• By combining the above find solutions

Tn (t, x) = hn (t) · un (x) = hn (0) sin(x n π) exp(−α λ2n t) .

• This solution satisfies the initial condition T (0, x) = hn (0) sin(x n π). To satisfy arbitrary initial
conditions T (0, x) = T0 (x) use the superposition principle and Fourier Sin series

X
T0 (x) = bn sin(n π x)
n=1

with the Fourier coefficients Z 1


bn = 2 T0 (x) sin(n π x) dx .
0
Combing the above find the solution

X
T (t, x) = bn sin(x n π) exp(−α λ2n t) .
n=1

Use this explicit formula to verify the accuracy of the numerical approximations to be examined.

1.1.5 The Two Dimensional Heat Equation


If the domain G ⊂ R2 with boundary curve C describes a thin plate with constant thickness h then we may
assume that the temperature will depend on t, x and y only and not on z. The total energy stored in that
domain is given by ZZ ZZ
E (t) = h u dA = h c ρ T (t, x, y) dA .
G G
d
Again examine the rate of change of energy dt E and arrive at
∂ ∂T
ZZ I ZZ
E= hcρ dA = − h ~q · ~n ds + h f dA .
∂t ∂t C
G G

Using the divergence theorem on the second integral and Fourier’s law find
∂T
ZZ ZZ ZZ
hcρ dA = − div(h ~q) dA + h f dA
∂t
G
Z ZG G
ZZ
= div (k h∇T ) dA + h f dA
ZGZ G
ZZ
= div (k h∇T (t, x, y)) dA + h f dA .
G G

This equation has to be correct for all possible domains G and not only for the physical domain. Thus the
expressions under the integral have to be equal and thus find the partial differential equation.
∂T
hcρ = div (k h∇T (t, x, y)) + h f . (1.4)
∂t

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 12

If ρ, c, h and k are constant then find in cartesian coordinates


 2
∂2

∂ ∂
cρ T (t, x, y) = k T (t, x, y) + 2 T (t, x, y) + f (t, x, y)
∂t ∂x2 ∂y
or shorter

cρ T (t, x, y) = k ∆ T (t, x, y) + f (t, x, y) ,
∂t
where ∆ is the well known Laplace operator
∂2 u ∂2 u
∆u = + .
∂x2 ∂y 2
The heat equation is a second order partial differential equation with respect to the space variable x and y
and of first order with respect to time t.

1.1.6 The Steady State 2D–Problem


If the steady state has to be examined then the temperature T can not depend on t and then use the unit
square 0 ≤ x, y ≤ 1 as the standard domain. Then the problem simplifies to
1
−∆T (x, y) = k f (x, y) for 0 ≤ x, y ≤ 1
(1.5)
T (x, y) = 0 for (x, y) on boundary .
Find a possible solution of the above equation on an L-shaped domain in Figure 1.1. On one part of the
boundary the temperature T (x, y) = 0 is prescribed, but on the section on the right in Figure 1.1 the
condition is thermal insulation, i.e. vanishing normal derivative ∂∂nT = 0 .

1 1
0.5 0.5
0 0
-0.5 -0.5

Figure 1.1: Temperature T as function of the horizontal position

1.1.7 The Dynamic 2D–Problem


If in the above problem the temperature T depends on time t too we have to solve the dynamic problem,
supplemented with appropriate initial and boundary values. Our example problem turns into
ρc ∂ 1
k ∂t T (t, x, y) − ∆T (t, x, y) = k f (t, x, y) for 0 ≤ x, y ≤ 1 and t≥0
T (t, x, y) = 0 for (x, y) on boundary and t ≥ 0 (1.6)
T (0, x, y) = T0 (x, y) for 0 ≤ x, y ≤ 1 .

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 13

Figure 1.1 might represent the solution T (t, x, y) at a given time t .

1.2 Vertical Deformations of Strings and Membranes


1.2.1 Equation for a Steady State Deformation of a String
In Figure 1.2 a segment of a string is shown. If the segment of horizontal length ∆x is to be at rest the sum
of all the forces applied to the segment has to vanish, i.e.

T~ (x + ∆x) − T~ (x) = −F~ .

F~ T~ (x + ∆x)

−T~ (x)

x
∆x

Figure 1.2: Segment of a string

Assume

• the vertical displacement of the string is given by a function y = u(x).

• the slopes of the string are small, i.e. |u0 (x)|  1

• the horizontal component of the tension T~ is constant and equals T0 .

• the vertical external force is given by a force density function f (x)

Thus the external vertical force to the segment is given by


Z x+∆x
F = f (s) ds ≈ f (x) ∆x .
x

For a string the tension vector T~ is required to have the same slope as the string. Using the condition of
small slope arrive at

T2 (x) = u0 (x) T0
T2 (x + ∆x) = u0 (x + ∆x) T0
u0 (x + ∆x) − u0 (x) T0 ≈ T0 u00 (x) ∆x .

T2 (x + ∆x) − T2 (x) =

By combining the above approximations and dividing by ∆x leads to the differential equation

−T0 u00 (x) = f (x) . (1.7)

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 14

Supplement this with the boundary conditions of a fixed string (i.e. u(0) = u(L) = 0) to obtain a second
order boundary value problem (BVP). Using Calculus of Variations (see Section 5.2) observe that solving
this BVP is equivalent to finding the minimizer of the functional
L
1
Z
F (u) = T0 (u0 (x))2 − f (x) · u(x) dx
0 2

amongst ‘all ’ functions u with u(0) = u(L) = 0 .

1.2.2 Equation for a Vibrating String


If the sum of the vertical forces in the previous section is not equal to zero, then the string will accelerate.
The necessary force equals M ü ≈ ρ ü ∆x, where ρ is the specific mass measured in mass per length. With
the boundary conditions at the two ends arrive at the initial boundary value problem (IBVP).

ρ(x) ü(t, x) = T0 u00 (t, x) + f (t, x) for 0 < x < L and t>0
u(t, 0) = u(t, L) = 0 for 0 ≤ t
(1.8)
u(0, x) = u0 (x) for 0 < x < L
u̇(0, x) = u1 (x) for 0 < x < L .

1.2.3 The Eigenvalue Problem


For an eigenvalue λ and an eigenfunction of the problem

−T0 y 00 (x) = λ ρ(x) y(x) with y(0) = y(L) = 0 , (1.9)

the function √
u(t, x) = A cos( λ t + φ) · y(x)
is a solution of the equation of the vibrating string with√no external force, i.e. f (t, x) = 0 . Thus the
eigenvalues λ lead to vibrations with the frequencies ν = λ/(2 π) .

1.2.4 Equation for a Vibrating Membrane


The situation of a vibrating membrane is similar to a vibrating string. A careful analysis of the situation
(e.g. [Trim90, §1.4]) shows that the resulting PDE is given by
   
∂ ∂ ∂ ∂
ρ ü = τ u + τ u +f,
∂x ∂x ∂y ∂y

where the interpretation of the terms is shown in Table 1.3. Thus we are lead to the IBVP

symbol unit
vertical displacement u m
external force density f N/m2
horizontal tension τ N/m
mass density ρ kg/m2

Table 1.3: Symbols and variables for a vibrating membrane

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 15

ρ(x, y) ü(t, x, y) = ∇ (τ (x, y) ∇u(t, x, y)) + f (t, x, y) for (x, y) ∈ Ω and t>0
u(t, x, y) = 0 for (x, y) ∈ Γ and t > 0
(1.10)
u(0, x, y) = u0 (x, y) for (x, y) ∈ Ω
u̇(0, x, y) = u1 (x, y) for (x, y) ∈ Ω .

Thus we have a second order PDE, in time and space variables.


If the external
p force f vanishes and ρ and τ are constant then find the standard wave equation with
velocity c = τ /ρ ,
τ
ü = ∆u = c2 ∆u .
ρ

1.2.5 Equation for a Steady State Deformation of a Membrane


The equation governing a steady state membrane are an elementary consequence of the previous section by
setting u̇ = ü = 0 . Thus we find a second order BVP.

−∇ (τ (x, y) ∇u(x, y)) = f (x, y) for (x, y) ∈ Ω


(1.11)
u(x, y) = 0 for (x, y) ∈ Γ .

Using Calculus of Variations (see Section 5.2) we will show that solving this BVP is equivalent to finding
the minimizer of the functional
τ
ZZ
F (u) = (∇u)2 − f · u dA
2

amongst ‘all ’ functions u with u(x) = 0 on the boundary Γ . Figure 1.1 might represent a solution.

1.2.6 Eigenvalue Problem for a Membrane


If u(x, y) is a solution of the eigenvalue problem

λ ρ(x, y) u(x, y) = −∇ (τ (x, y) ∇u(x, y)) for (x, y) ∈ Ω


(1.12)
u(x, y) = 0 for (x, y) ∈ Γ ,

then the function √


A cos( λ t + φ) · u(x, y)
is a solution
√ of the equation of a vibrating membrane. Thus the eigenvalues λ lead to the frequencies
ν = λ/(2 π) . The corresponding eigenfunction u(x, y) shows the shape of the oscillating membrane.

1.3 Horizontal Stretching of a Beam


1.3.1 Description
A beam of length L with known cross sectional area A (possibly variable) may be stretched horizontally by
different methods:

• forces applied to its ends.

• extending its length.

• applying a horizontal force all along the beam.

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 16

description symbol SI units


horizontal position 0≤x≤L [m]
horizontal displacement u [m]
du
strain ε= dx free of units
force at right end point F [N]
density of force along beam f [N/m]
modulus of elasticity E [N/m2 ]
Poisson’s ratio 0 ≤ ν < 1/2 free of units
area of cross section A [m2 ]
stress σ [N/m2 ]

Table 1.4: Variables used for the stretching of a beam

For 0 ≤ x ≤ L the function u(x) indicates by how much the part of the horizontal beam originally at
position x will be displaced horizontally, i.e. the new position is given by x + u(x) . In Table 1.4 find a list
of all relevant expressions.
The basic law of Hooke is given by
∆L
EA =F
L
for a beam of length L with cross section A. A force of F will stretch the beam by ∆L. For a short section
of length l of the beam between x and x + l find
∆l u(x + l) − u(x)
= −→ u0 (x) as l → 0 + .
l l
This expression is called strain and often denoted by ε = ∂∂xu . Then the force F (x) at a cross section at x is
given by Hooke’s law
F (x) = EA(x) u0 (x) ,
where F (x) is the force the section on the right of the cut will apply to the left section. Now examine all
forces on the section between x and x + ∆x, whose sum has to be zero for a steady state situation (Newton’s
law).
Z x+∆x
0 0
EA(x + ∆x) u (x + ∆x) − EA(x) u (x) + f (s) ds = 0
x
EA(x + ∆x) u0 (x + ∆x) − EA(x) u0 (x)
Z x+∆x
1
− = f (s) ds .
∆x ∆x x
Taking the limit ∆x → 0 in the above expression leads to the second order differential equation for the
unknown displacement function u(x).
 
d d u(x)
− EA(x) = f (x) for 0 < x < L . (1.13)
dx dx
There are different boundary conditions to be considered:
• At the left end point x = 0 we assume no displacement, i.e u(0) = 0 .
• At the right end point x = L we can examine a given displacement u(L) = uM , i.e we have a
Dirichlet condition.
• At the right end point x = L we can examine the situation of a known force F , leading to the
Neumann condition
d u(L)
EA(L) =F.
dx

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 17

1.3.2 Poisson’s Ratio, Lateral Contraction


Poisson’s ratio ν is a material constant and indicates by what factor the lateral directions will contract, when
the material is stretched. Thus the resulting cross sectional area will be reduced from its original values A0 .
Assuming | ddxu |  1 the modified area is given by

du 2
   
du du
A = A( ) = A0 · 1 − ν ≈ A0 · 1 − 2 ν .
dx dx dx
Since the area is smaller the stress will increase, leading to a further reduction of the area. The resulting
stress is the true stress, compared to the engineering stress, where the force is divided by the original
area A0 , not the true area A0 (1 − ν d u(x) 2
dx ) . The linear boundary value problem (1.13) is replaced by a
nonlinear problem.
!
d u 2 d u(x)
 
d
− EA0 1 − ν = f (x) for 0 < x < L . (1.14)
dx dx dx
For a given force F at the endpoint at x = L use the boundary condition
d u(L) 2 d u(L)
 
EA0 (L) 1 − ν =F.
dx dx

Material Modulus of elasticity Poisson’s ratio


E in 109 N/m2 ν
Aluminum 73 0.34
Bone (Femur) 17 0.30
Gold 78 0.42
Rubber 3 0.50
Steel 200 0.29
Titanium 110 0.36

Table 1.5: Typical values for the elastic constants

If there is no force along the beam (f (x) = 0) the differential equation implies that
d u 2 d u(x)
 
EA0 1 − ν = const .
dx dx
The above equation and boundary condition lead to a cubic equation for the unknown function w(x) =
u0 (x) .
d u(x) 2 d u(x)
 
EA0 (x) 1 − ν = F
dx dx
EA0 (x) (1 − ν w(x))2 w(x) = F
F
ν 2 w3 (x) − 2 ν w2 (x) + w(x) = (1.15)
EA0 (x)
νF
ν 3 w3 (x) − 2 ν 2 w2 (x) + ν w(x) = .
EA0 (x)
This is cubic equation for the unknown ν w(x). Thus this example can be solved without using differential
equations by solving nonlinear equations, see Example 3–13 on page 122 or by using the method of finite
differences, see Example 4–9 on page 297.

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 18

1.3.3 Nonlinear Stress Strain Relations


If a bar is stretched by small forces F then the total change of length is proportional to the force, according
to Hooke’s law
1
ε= σ or σ = E ε .
E
The strain ε = ∆L 0 F
L = u (relative change of length) is proportional to the stress σ = A (force per area).
The constant of proportionality is the modulus of elasticity E. If the force exceeds a certain critical value,
then the behavior of the material might change, the material might soften up or even break.
The qualitative behavior can be seen in Figure 1.3. The stress σ is a nonlinear function of the strain ε.
Nonlinear material properties are often important. For the sake of simplicity most examples in these notes
use linear material laws, but consider geometric nonlinearities only, e.g. the bending of a beam in Sec-
tion 1.4. A few remarks on nonlinear behavior caused by large deformations are shoen in Section 5.7
starting on page 365.

stess σ
6
fracture
t
[force]
plastic region

elastic region
∂σ
∂ε = E
[additional length]
strain-ε

Figure 1.3: Nonlinear stress strain relation

1.4 Bending and Buckling of a Beam


1.4.1 Description
Examine a beam of length L as shown in Figure 1.4. The geometric description is given by its angle α as a
function of the arc length 0 ≤ s ≤ L . In Table 1.6 find a list of all relevant expressions. Since the slope of
the curve (x(s) , y(s)) is given by the angle α(s) find
y
6 F~ H
Y
H
HH
H~
x(L)

∆y
~x(l)

∆x x
-

Figure 1.4: Bending of a Beam

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 19

! !
d x(s) cos(α(s))
=
ds y(s) sin(α(s))

and construct the curve from the angle function α(s) with an integral
! ! !
x(l) x(0)
Z l cos(α(s))
~x(l) = = + ds for 0 ≤ l ≤ L .
y(l) y(0) 0 sin(α(s))

description symbol SI units


arc length parameter 0≤s≤L [m]
horizontal and vertical position x(s), y(s) [m]
curvature κ(s) [m−1 ]
force at right end point F~ [N]
modulus of elasticity E [N/m2 ]
inertia of cross section I [m4 ]

Table 1.6: Variables used for a bending beam

The definition of the curvature κ is the rate of change of the angle α(s) as function of the arc length s .
Thus it is given by
d
κ(s) = α(s) .
ds
The theory of elasticity implies that the curvature at each point is proportional to the total moment generated
by all forces to the right of this point. If only one force F~ = (F1 , F2 )T is applied at the endpoint this
moment M is given by
EI κ = M = F2 ∆x − F1 ∆y .
Since ! ! !
∆x x(L) − x(l)
Z L cos(α(s))
= = ds
∆y y(L) − y(l) l sin(α(s))
we can rewrite the above equation as a differential-integral equation. Then computing one derivative with
respect to the variable l transforms the problem into a second order differential equation.
Z L ! !
0 F2 cos(α(s))
EI α (l) = · ds
l −F1 sin(α(s))
! !
0
0 F2 cos(α(l))
EI α (l) = − · = F1 sin(α(l)) − F2 cos(α(l)) .
−F1 sin(α(l))

If the beam starts out horizontally we have a(0) = 0 and since the moment at the right end point vanishes
we find the second boundary condition as α0 (L) = 0 . Thus we find a nonlinear, second order boundary
value problem 0
EI α0 (s) = F1 sin(α(s)) − F2 cos(α(s)) . (1.16)
In the above general form this problem can only be solved approximately by numerical methods, see Exam-
ple 4–11 on page 300.

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 20

1.4.2 Bending of a Beam


If the above beam is only submitted to a vertical force (F1 = 0) and the parameters are assumed to be
constant then the equation to be examined is
F2
−α00 (s) = cos(α(s)) . (1.17)
EI
The boundary conditions α(0) = α0 (L) = 0 describe the situation of a beam clamped at the left edge and
no moment is applied at the right end point.

Approximation for small angles


If we know that F2 6= 0 and all angles remain small (|α|  1) we may simplify the nonlinear term (use
cos α ≈ 1 and sin α ≈ 0) and find equation
F2
−α00 (s) =
EI
and an integration over the interval s < z < L leads to
L
F2
Z
0 0
α (s) = α (L) − α00 (z) dz = 0 + (L − s)
s EI

and another integration from 0 to s, using α(0) = 0, leads to


Z s
F2 1
α(s) = α(0) + α0 (z) dz = (L s − s2 ) .
0 EI 2
Since all angles are assumed to be small (sin α ≈ α) this implies
Z x
F2 1 1 F2
y(x) = α(z) dz = ( L x2 − x3 ) = (3 L − x) x2
0 EI 2 6 6 EI

and thus we find the maximal deflection at x = L by


F2
y(L) = L3 .
3 EI
This result may be useful to verify the results of numerical algorithms.

1.4.3 Buckling of a Beam


If the above beam is horizontally compressed (F1 < 0, F2 = 0) and the parameters are assumed to be
constant the equation to be solved is
−F1
−α00 (s) = k sin(α(s)) where k = >0. (1.18)
EI
The boundary conditions α0 (0) = α0 (L) = 0 describe the situation with no moments applied at the two end
points. This equation is trivially solved by α(s) = 0, independent on the value of F1 .
For small displacements use sin α ≈ α and the equation simplifies to

−α00 (s) = k α(s) ,

then use the fact that for  n π 2


kn = where n ∈ N
L

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 21


the functions un (s) = cos( kn s) are nontrivial solutions of the linearized equation

u00n (s) = −kn un (s) with u0n (0) = u0n (L) = 0 .

These solutions can be used as starting points for Newtons method applied to the nonlinear problem (1.18) .

The first (n = 1) of the above solutions is characterized by the critical force Fc

π2 Fc EI π 2
k1 = 2
= =⇒ Fc = .
L EI L2
If the actual force F1 is smaller than Fc there will be no deflection of the beam. As soon as the critical value
is reached there will be a sudden, drastic deformation of the beam. The corresponding solution is
π
u1 (s) = ε cos( s) .
L
Thus for small values of the parameter ε we expect approximate solutions of the form
π
α(s) = ε u1 (s) = ε cos( s) .
L
Using the integral formula
! ! !
x(l) l cos(α(s)) l cos(ε cos( Lπ s))
Z Z
~x(l) = = ds = ds
y(l) 0 sin(α(s)) 0 sin(ε cos( Lπ s))
! !
Z l 1 l
≈ ds = .
0 ε cos( Lπ s) ε Lπ sin( Lπ l)

In the above expression we can eliminate the variable l and find


L π
y(x) = ε sin( x) .
π L
This shape corresponds to the Euler buckling of the beam.

1.5 Tikhonov Regularization


Assume that you have a non-smooth function f (x) on the interval [a, b] and want to approximate it by
a smoother function u. One possible approach is to use a Tikhonov functional. For a parameter λ > 0
minimize the functional
Z b Z b
Tλ (u) = 2
|u(x) − f (x)| dx + λ (u0 (x))2 dx .
a a

• If λ > 0 is very small the functional will be minimized if u ≈ f , i.e. a good approximation of f .

• If λ is very large the integral of (u0 )2 will be very small, i.e. we have a very smooth function.
The above problem can be solved with the help of an Euler–Lagrange equation, to be examined in Sec-
tion 5.2.1, starting on page 305. For smooth functions f (x) the minimizer u(x) of the Tikhonov functional
Tλ (u) is a solution of the boundary value problem

−λ u00 (x) + u(x) = +f (x) with u0 (a) = u0 (b) = 0 .

In Figure 1.5 find a function f (x) with a few jumps and some noise added. The above problem was solved
for three different values of the positive parameter λ.

SHA 21-5-21
CHAPTER 1. THE MODEL PROBLEMS 22

• For small values of λ (e.g. λ = 10−5 ) the original curve f (x) is matched very closely, including the
jumps. Caused by the noise the curve is very wiggly. In Figure 1.5 points only are displayed.

• For intermediate values of λ (e.g. λ = 10−3 ) the original curve f (x) is matched reasonably well and
the jumps are not visible.

• For large values of λ (e.g. λ = 10−1 ) the jumps in f (x) completely disappear, but the result u(x) is
far off the original (noisy) function f (x).

solutions with different values for λ


3
f
λ=1e-5
2.5
λ=1e-3
λ=1e-1
2
u and f

1.5

0.5

-0.5
0 0.2 0.4 0.6 0.8 1
x
Figure 1.5: A nonsmooth function f and three regularized approximations u, using a small (λ = 10−5 ),
intermediate (λ = 10−3 ) and large (λ = 0.1) parameter

The above idea can be modified by using different smoothing functionals, depending on the specific
requirements of the application.

Bibliography
[Farl82] S. J. Farlow. Partial Differential Equations for Scientist and Engineers. Dover, New York, 1982.

[Trim90] D. W. Trim. Applied Partial Differential Equations. PWS–Kent, 1990.

SHA 21-5-21
Chapter 2

Matrix Computations

2.1 Prerequisites and Goals


After having worked through this chapter

• you should understand the basic concepts of floating point operations on computers.

• you should be familiar with the concept of the condition number of a matrix and its consequences.

• you should understand the algorithm for LR factorization and Cholesky’s algorithm.

• you should be familiar with sparse matrices and the conjugate gradient algorithm.

• you should be aware of the importance of a preconditioner for the conjugate gradient algorithm.

In this chapter we assume that you are familiar with

• basic linear algebra.

• matrix and vector operation.

• the algorithm of Gauss to solve a linear system of equations.

2.2 Floating Point Operations


This chapter is started with a very basic introduction to the representation of floating point numbers on
computers and the basic effects of rounding for arithmetic operations. For a more detailed presentation
use [YounGreg72, §1, §2].

2.2.1 Floating Point Numbers and Rounding Errors in Arithmetic Operations


On computer floating point numbers have to be stored in binary format. As an example consider the decimal
number
x = 178.125 = +1.78125 · 102 .
Since 178 = 128 + 32 + 16 + 2 and 0.125 = 0 1
2 +0 1
4 +1 1
8 = 2−3 find the binary representation

x = 10110010.001 = +1.0110010001 · 2111 .

In this binary scientific representation the integer part will always equals 1. Thus this number could be stored
in 14 bit, consisting of 1 sign bit, 3 exponent bits and 10 base bits. Obviously this type of representation is
asking for standardization, which was done in 1985 with the IEEE-754 standard. As an example examine

23
CHAPTER 2. MATRIX COMPUTATIONS 24

the float format, to be stored in 32 bits. The information is given by one sign bit s, 8 exponent bits bj and
23 bits aj for the base.
s b0 b1 b2 . . . b7 a1 a2 a3 . . . a23
This leads to the number
23
X 7
X
x = ±a · 2b−B0 where a = 1 + aj 2−j and b= bk 2k .
j=1 k=0

The sign bit s indicates whether the number is positive or negative. The exponent bits bk ∈ {0, 1} represent
the exponent b in the binary basis where the bias B0 is chosen such that b ≥ 0. The size of B0 depends on
the exact type of real number, e.g. for the data type float the IEEE committee picked B0 = 127 . The base
bits aj ∈ {0, 1} represent the base 1 ≤ a < 2. Thus numbers x in the range 1.2 · 10−38 < |x| < 3.4 · 1038
ln(2)
can be represented with approximately log10 (223 ) = 23 ln(10) ≈ 7 significant decimal digits. It is important
to observe that not all numbers can be represented exactly. The absolute error is at least of the order 2−B0 −23
and the relative error is at most 2−23 .
On the Intel 486 processor (see [Intel90, §15]) find the floating point data types in Table 2.1. The two
data types float and double exist in the memory only. As soon as a number is loaded into the CPU, it is
automatically converted to the extended format, the format used for all internal operations. When a num-
ber is moved from the CPU to memory it has1 to be converted to float or double format. The situation
on other hardware is very similar. As a consequence it is reasonable to assume that all computations will
be carried out with one of those two formats. The additional accuracy of the extended format is used as
guard digits. Find additional information in [DowdSeve98] or [HeroArnd01]. The reference [Gold91], also
available on the internet, gives more information on the IEEE-754 standard for floating point operations.

data type bytes bits base a digits exponent b − B0 range


float 4 32 23 bit 7 −126 ≤ b − B0 ≤ 127 1038
double 8 64 52 bit 16 −1022 ≤ b − B0 ≤ 1023 10308
extended 10 80 63 bit 19 −16382 ≤ b − B0 ≤ 16383 104931

Table 2.1: Binary representation of floating point numbers

When analyzing the performance of algorithms for matrix operations a notation to indicate the accuracy
is necessary. When storing a real number x in memory some roundoff might occur and a number x (1 + ε),
where ε is bound by a number u, depending on the CPU architecture used.
stored
x −→ x (1 + ε) where |ε| ≤ u

For the above standard formats we may work with the following values for the unit roundoff u.

float: u ≈ 2−23 ≈ 1.2 · 10−7


(2.1)
double: u ≈ 2−52 ≈ 2.2 · 10−16

The notation of u for the unit roundoff is adapted form [GoluVanLoan96].

Addition and multiplication of floating point numbers


Examine two floating point numbers x1 = a1 · 2b1 and x2 = a2 · 2b2 . Use that 1 ≤ |ai | < 2 and assume
b1 ≤ b2 . When computing the sum x = x1 + x2 the following steps have to be performed:
1
We quietly ignore that the format long double might be used in C. Most numerical code is using double or one might
consider float, to save memory.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 25

1. Rescale the binary representation of the smaller number, such that it will have the same exponent as
the larger number..
2. Add the two new base numbers using all available digits. Assuming that B binary digits are correct
and thus the error of the sum of the bases is of the size 2−B .
3. Convert the result into the correct binary representation.
 
x1 + x2 = a1 · 2b1 + a2 · 2b2 = a1 · 2b2 −∆b + a2 · 2b2 = a1 · 2−∆b + a2 · 2b2 = a · 2b

The absolute error is of the size err ≈ 2b2 −B . The relative error err/(x1 + x2 ) can not be estimated, since
the sum might be a small number if x1 and x2 are of similar size with opposite sign. If x1 and x2 are of
similar size and have the same sign, then the relative error may be estimated as 2−B .

When computing the product x = x1 · x2 the following steps have to be performed:


1. Use the standard binary representation the two numbers.
2. Add the two exponents and multiply the two bases, using all available digits. Since 1 ≤ ai < 2 we
know 1 ≤ a1 · a2 < 4, but (at most) one shift operation will move the base back into the allowed
domain. We assume B binary digits are correct and thus the error of the product of the bases is of the
size 2−B .
3. Convert the result into the correct binary representation.

x1 · x2 = a1 · 2b1 · a2 · 2b2 = a1 · a2 · 2b1 +b2 = a · 2b

The absolute error is of the size err ≈ 2b1 +b2 −B . The relative error err/(x1 · x2 ) is estimated by 2−B .

Based on the above arguments conclude that any implementation of floating point operations will neces-
sarily lead to approximation errors in the results. For an algorithm to be useful we have to assure that those
errors can not falsify the results. More information on floating point operations may be found in [Wilk63]
or [YounGreg72, §2.7].

2–1 Example : To illustrate the above effects examine two numbers given by their decimal representation.
Use x1 = 7.65432 · 103 , x2 = 1.23456 · 105 and assume that all operations are carried out with 6 significant
digits. There is no clever rounding scheme, all digits beyond the sixth are to be chopped of.
(a) Addition:

x1 + x2 = 7.65432 · 103 + 1.23456 · 105 = 7.65432 · 10−2 + 1.23456 · 105




= (0.07654 + 1.23456) · 105 = 1.31110 · 105

Using a computation with more digits leads to x1 + x2 = 1.3111032 · 105 and thus find an absolute
error of approximately 0.3 or a relative error of x10.3 −6
+x2 ≈ 2 · 10 .

(b) Subtraction:

x1 − x2 = 7.65432 · 103 − 1.23456 · 105 = 7.65432 · 10−2 − 1.23456 · 105




= (0.07654 − 1.23456) · 105 = −1.15802 · 105

With a computation with more digits find x1 − x2 = 1.1580168 · 105 and an absolute error of approx-
imately 0.3 or a relative error of x10.3 −6
−x2 ≈ 3 · 10 . Thus for this example the errors for addition and
subtraction are of similar size. If we were to redo the above computations with x1 = 1.23457 · 105 ,
then the difference equals 0.00001 · 105 = 1. The absolute error would not change drastically, but the
relative error would be huge, caused by the division by a very small number.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 26

(c) Multiplication

x1 · x2 = 7.65432 · 103 · 1.23456 · 105 = (7.65432 · 1.23456) · 108 = 9.44971 · 108

The absolute error is approximately 8 · 102 and the relative error 10−6 , as has to be expected for a
multiplication with 6 significant digits.

2–2 Example : The effects of approximation errors on additions and subtractions can also be examined
assuming that xi is known with an error of ∆xi , i.e Xi = xi ± ∆xi . Find

X1 + X2 = (x1 ± ∆x1 ) + (x2 ± ∆x2 ) = x1 + x2 ± (∆x1 ± ∆x2 ),


X1 · X2 = (x1 ± ∆x1 ) · (x2 ± ∆x2 ) = x1 · x2 ± (x2 · ∆x1 ± x1 · ∆x2 ± ∆x1 · ∆x2 ),
X1 · X2 ∆x1 ∆x2
≈ 1± ± .
x1 · x2 x1 x2
The above can be rephrased in a very compact form:

• When adding two numbers the absolute errors will be added.

• When multiplying two numbers the relative errors will be added.

2.2.2 Flops, Accessing Memory and Cache


When evaluating the performance of hardware or algorithm we need a good measure of the computational
effort necessary to run a given code. Most numerical code is dominated by matrix and vector operations and
those in turn are dominated by operations of the type

C(i,j) = C(i,j) + A(i,k)*B(k,j) .

One typical operation involves one multiplication, one addition, a couple of address computations and access
to the data in memory or cache. Thus the aboveis selected as a standard and call the effort to perform these
operations one flop 2 , short for floating point operation. The abreviation FLOPS stands for FLoating point
Operation Per Second.

The computational effort for the evaluation of transcendental functions is obviously higher than for an
addition or multiplication. Using a simple benchmark with MATLAB/Octave and C++ on a laptop with an
Intel I-7 processor Table 2.2 can be generated. The timings are normalized, such that the time to perform
one addition leads to 1 . No memory access is taken into account for this table. Observe that this is (at best)
a rough approximation and depends on many parameters, e.g. the exact hardware and the compiler used.

2
Over time the common definition of a flop has evolved. In the old days a multiplication took considerably more time than
an addition or one memory access. Thus one flop used to equal one multiplication. With RISC architectures multiplications
became as fast as additions and thus one flop was either an addition or multiplication. Suddenly most computers were twice as
fast. On current computers the memory access takes up most of the time and the memory structure is more important than the raw
multiplication/addition power. When reading performance data you have to verify which definition of flop is used. In almost all
cases the multiplications are counted.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 27

Operation Octave MATLAB C++


+,-,* 1 1 1
division 1 1 2
sqrt() 2.3 1.8 1.4
exp() 5 4.3 7.7
sin(), cos() 5.8 4.4 13

Table 2.2: Normalized timing for different operations on an Intel I7

2–3 Example : For a few basic operations the flop count is easy to determine.

(a) A scalar product of two vectors ~x, ~y ∈ Rn is computed by


n
X
s = h~x , ~y i = xi · yi = x1 · y1 + x2 · y2 + x3 · y3 + . . . + xn · yn
i=1

and requires n flops.

(b) For a n × n–matrix A computing the product ~y = A · ~x requires n2 flops, since


n
X
yj = aj,i · xi for j = 1, 2, 3, . . . , n .
i=1

(c) Multiplying a n × m matrix A with a m × r matrix B requires n · m · r flops to compute the n × r


matrix C = A · B .

On modern computers the clock rate of the CPU is considerably higher than the clock rate for memory
access, i.e. it takes much longer to copy a floating point number from memory into the CPU, than to perform
a multiplication. To eliminate this bottleneck in the memory bandwidth sophisticated (and expensive) cache
schemes are used. Find further information in [DowdSeve98]. As a typical example examine the three level
cache structure of an Alpha processor shown in Figure 2.1. The data given for the access times in Figure 2.1
are partially estimated and take the cache overhead into account. The data is given by [DowdSeve98].
A typical Pentium IV system has 16 KB of level 1 cache (L1) and 512 or 1024 KB on-chip level 2 cache
(L2). The dual core Opteron processors available from AMD in 2005 have 64 KB of data cache and 64 KB
of instruction cache for each core. The level 2 cache of 1 MB for each core is on-chip too and runs at the
same clock rate as the CPU. These caches take up most of the area on the chip.

The importance of a fast and large cache is illustrated with the performance of a banded Cholesky
algorithm. For a given number n this algorithm requires fast access to n2 numbers. For the data type
double this leads to 8 n2 bytes of fast access memory. The table below shows some typical values.

n numbers fast memory


128 16’384 128 KB
256 65’536 512 KB
512 262’144 2048 KB
750 562’500 4395 KB
1024 1’048’576 8192 KB

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 28

 
CPU registers at 500 MHz, resp 2 ns
   
Level 1 Cache, 8+8 KB, 2 ns
 
Level 2 Cache, 96 KB, 5 ns
 
6
 ? 
Level 3 Cache, 4 MB, off chip, 30 ns
 
6

' ? $

Memory, 4 GB, RAM, 220 ns

& %

Figure 2.1: Memory and cache access times on a typical 500 MHz Alpha 21164 microprocessor system

CPU FLOPS
NeXT (68040/25MHz) 1M
HP 735/100 10 M
SUN Sparc ULTRA 10 (440MHz) 50 M
Pentium III 800 (out of cache) 50 M
Pentium III 800 (in cache) 185 M
Pentium 4 2.6 GHz (out of cache) 370 M
Pentium 4 2.6 GHz (in cache) 450 M
Intel I7-920 (2.67 GHz) 700 M
Intel Haswell I7-5930 (3.5 GHz) 1’800 M
AMD Ryzen 3950X (3.5 GHz) 6’000 M

Table 2.3: FLOPS for a few CPU architectures, using one core only

40

30
A
Mflop/sec

B
20
C
Cb
10

0
50 100 200 500 1000 2000
nx

Figure 2.2: FLOPS for a 21164 microprocessor system, four implementations of one algorithm

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 29

50 140

120
40
100
A
30 gcc −O
80
Mflop/sec

Mflop/sec
cxx
B2 60
20 cxx −O
C
40
10
20

0 0
50 100 200 500 1000 2000 50 100 200 500 1000 2000
nx nx

(a) Pentium III with 512 KB cache (b) 21264 system with 8 MB cache

(c) Intel Xeon 2.7 GHz, 2MB cache (d) Intel I7-920 2.7 GHz, 8MB cache

7000
A
6000 B
C
5000 D
MFLOPS

4000

3000

2000

1000

0
10-1 100 101 102
MB cache
(e) Intel Xeon, 3.5 GHz Xeon with 15 MB cache (f) AMD Ryzen 3950X, 3.5 GHz with 64 MB cache

Figure 2.3: FLOPS for a few systems, four implementations of one algorithm

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 30

In Figure 2.2 find the performance result for this algorithm on a Alpha 21164 platform with the cache
structure from Figure 2.1. Four slightly different codes were tested with different compilers. All codes show
the typical drop of in performance if the need for fast memory exceeds the available cache (4 MB).
In Figure 2.3 the similar results for a few different systems are shown. The graphs clearly show that
Intel made considerable improvements with their memory access performance. All codes and platforms
show the typical drop of in performance if the need for fast memory exceeds the available cache. All results
clearly indicate that the choice of compiler and very careful coding and testing is very important for good
performance of a given algorithm. In Table 2.3 the performance data for a few common CPU architectures
is shown.
For the most recent CPUs this data is rather difficult to determine, since the CPUs do not have a fixed
clock rate any more and might execute more than one instruction per cycle.

2.2.3 Multi Core Architectures


In 2008 Intel introduced the Nehalem architecture The design has an excellent interface between the CPU
and the memory (RAM) and a three level cache, as shown in Figure 2.4. As a consequence the performance
drop for larger problems is not as severe, visible in Figure 2.3(d), where only one core is used.
• a CPU consists of 4 cores

• each core has a separate L1 cache for data (32 KB) and code (32 KB)

• each core has a separate L2 cache for data and code (256 KB)

• the large (8 MB) L3 cache is dynamically shared between the 4 cores


This multi core architecture allows the four cores to work on the same data. To take full advantage of this
it is required that the algorithms can run several sections in parallel (threading). Since CPUs with this or
similar architecture are very common we will exam all algorithms on whether they can take advantage of
multiple cores. More information on possible architectures is given in [DowdSeve98].

Core 1 Core 2 Core 3 Core 4


32 KB code

32 KB code

32 KB code

32 KB code
32 KB data

32 KB data

32 KB data

32 KB data

L2 cache L2 cache L2 cache L2 cache

256KB 256KB 256KB 256KB

8 MB L3 cache, shared

Figure 2.4: CPU-cache structure for the Intel I7-920 (Nehalem)

• In 2014 Intel introduced the Haswell-E architecture, a clear step up from the Nehalem architecture.
The size of the third level cache is larger and a better memory interface is used. As an example
consider the CPU I7-5930:

– a CPU consists of 6 cores

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 31

– each core has a separate L1 cache for data (32 KB) and code (32 KB)
– each core has a separate L2 cache for data and code (256 KB)
– the large (15 MB) L3 cache is dynamically shared between the 6 cores

The performance is good, as shown in Figure 2.3(e).

• In 2017 AMD introduced the Ryzen Threadripper 1950X with 16 cores, 96KB (32KB data, 64KB
code) of L1 cache per core, 8MB of L2 cache and 32MB of L3 cache.

• In the fall of 2019 AMD topped the above with the Ryzen 9 3950X with 16 cores. This processor has
64KB of L1 cache and 512KB of L2 cache dedicated for each of the 16 cores and 64MB of shared
L3 cache. The performance is outstanding, see Figure 2.3(f). The structure in Figure 2.4 is still valid,
just more cores and (a lot) more cache.

The above observations consider only the aspect on single CPUs, possibly with multiple cores. For super-
computing clusters the effect of a huge number of computing cores is even more essential.
It is rather surprising to observe the development: in 1976 the Cray 1 cost 8.8 M$, weighed 5.5 tons
and could perform up to 160 MFLOPS. In 2014 a NVIDIA GeForce GTX Titan Black card cost 1‘000$ and
can deliver up to 1‘800 GFLOPS (double precision, FP64). Thus the 2014 GPU is 11‘000 times faster, very
light and at a very affordable price.

2.2.4 Using Multiple Cores of a CPU with Octave and MATLAB


Many commands in MATLAB will take advantage of multiple cores available on the CPU. Current versions
of Octave can use the libraries OPENBLAS or ATLAS to use multiple cores for some of the built-in
commands. On Linux systems use top or htop to observe the load on the systen with many cores3 .

N = 2ˆ12; % = 4096 % size of the matrix


A = eye(N) + 0.1*rand(N); Apower = A;
invA = eye(N) + Apower; t0 = cputime(); tic();
for ii = 1:10
Apower = A * Apower; invA = invA + Apower;
end%%for
WallTime = toc()
CPUtime = cputime()-t0
GFLOPS = 10*Nˆ3/WallTime/1e9

The above code was used on the host karman (Ryzen 3950X, 16 cores, 32 threads with hyper-threading)
and htop showed 32 active threads. If launching Octave after setting the environment variable export
OPENBLAS NUM THREADS=1, then only one thread is started and the computation time is 23.7 sec. For
each of the above 10 iterations more than N 3 flops are required. With N = 212 = 4096 this leads to 687·109
flops. A computation time of 3.16 seconds corresponds to a computational power of 217 GFLOPS for the
host karman, which is considerably more than shown in Figure 2.3(f), where a single core is used with
a more complex algorithm. Using only a single thread (computation time 23.7 sec) leads to 27 GFLOPS.
Since the CPU has 16 cores another attempt using 16 threads shows a computation time of 2.72 seconds, i.e.
to 253 GFLOPS4 . With N = 214 = 160 384 using 16 threads the computational power is 324 GFLOPS. For
the same CPU a computational power between 6 GFLOPS and 324 GFLOPS showed up.
Resulting advice:

• Use as many threads as your CPU has true cores, i.e. ignore hyper-threading.
For a matrix A with kAk < 1 the inverse of I − A can be determined by the Neumann series (I − A)−1 = I + A + A2 +
3

A + A4 + . . .. The code determines the first 10 contributions of this approximation.


3
4
With N = 4096 = 212 and the code Apower*=A; invA+=Apower; (Octave only) in the loop obtain 279 GFLOPS.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 32

• Do not trust the FLOPS data published by companies (e.g AMD, Intel, . . . ), use your own application
as benchmark. The differences can be substantial.

2.3 The Model Matrices


Before examining numerical algorithms for linear systems of equations and the corresponding matrices we
present two typical and rather important matrices. All of the subsequent results shall be applied to these
model matrices. Many problems examined in Chapter 1 lead to matrices similar to An or Ann .

2.3.1 The 1-d Model Matrix An


When using a finite difference approximation with n interior points (see Chapter 4) for the model prob-
lem (1.2)
d2 1
− 2 T (x) = f (x) for 0 ≤ x ≤ 1 and T (0) = T (1) = 0
dx k
1
the function T (x) is replaced by a set of the values Ti of the function at xi = i · h = i n+1 , as shown in
Figure 2.5.

T
5
T
4
T
6
T
3

T
2

T
1

x x x x x x
1 2 3 4 5 6

Figure 2.5: The discrete approximation of a continuous function

The second derivative is replaced by a multiplication by a n × n matrix An , where


 
2 −1
 
 −1 2 −1 
 
−1 2 −1
 
 
1 
. . .

An = 2  . . . . . . .

h  
−1 2 −1
 
 
 

 −1 2 −1 

−1 2

1
The value of h = n+1 represents the distance between two points xi . Multiplying a vector by this matrix
corresponds to computing the second derivative. The differential equation is replaced by a system of linear
equations.
d2
− 2 u(x) = f (x) −→ An ~u = f~ .
dx

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 33

Observe the following important facts about this matrix An :


• The matrix is symmetric, tridiagonal and positive definite

• To analyze the performance of the matrix algorithms the eigenvalues λj and eigenvectors ~vj will be
usefull, i.e. use An~vj = λj ~vj .

– The exact eigenvalues are given by5


4 πh
λj = sin2 (j ) for 1 ≤ j ≤ n .
h2 2

– The eigenvector ~vj is generated by discretizing the function sin(x n+1 ) over the interval [0 , 1]
and thus has j extrema within the interval.
 
1j π 2j π 3j π (n − 1) j π nj π
~vj = sin( ) , sin( ) , sin( ) , . . . , sin( ) , sin( ) .
n+1 n+1 n+1 n+1 n+1

– For j  n observe
4 πh π π
λj = 2
sin2 (j ) = 4 (n + 1)2 sin2 (j ) ≈ 4 (n + 1)2 (j )2 = π 2 j 2
h 2 2 (n + 1) 2 (n + 1)
and in particular
λmin = λ1 ≈ π 2 .
– For the largest eigenvalue use
π n
λmax = λn = 4 (n + 1)2 sin2 ( ) ≈ 4 n2 .
2 n+1

The above matrix will be useful for a number of the problems in the introductory chapter.
• An will be used to solve the static heat problem (1.2).

• For each time step in the dynamic heat problem (1.3) the matrix An will be used.

• To solve the steady state problem for the vertical displacements of a string, i.e. equation (1.7), the
above matrix needed.

• For each time step of the dynamic string problem (1.8), we need the above matrix.

• The eigenvalues of the above matrix determine the solutions of the eigenvalue problem for the vibrat-
ing string problem, i.e. equation (1.9).

• The problems of the horizontal stretching of a beam (i.e. equations (1.13) and (1.14)) will lead to
matrices similar to the above.

• When using Newton’s method to solve the nonlinear problem of a bending beam ((1.16), (1.17)
or (1.18)) we will have to use matrices similar to the above.
5
To verify these eigenvalues and eigenvectors use the complex exponential function u(x) = exp(i α x). The values at the grid
points xk = k h are uk = exp(i α k h) leading to a vector ~u. The matrix multiplication computes

−uk−1 + 2 uk − uk+1 = exp(i α k h) (− exp(−i α h) + 2 − exp(+i α h))


αh
= exp(i α k h) (2 − 2 cos(α h)) = exp(i α k h) 4 sin2 ( ).
2
Thus the values of uk are multiplied with the factor 4 sin2 ( α2h ). Now examine the imaginary part of these coefficients, i.e. use
uk = sin(α k h). Based on u0 = un+1 = 0 require sin(αj (n + 1) h) = sin(αj 1) = 0, i.e. αj = j π. Thus the eigenvalues are

λj = h42 sin2 (j π2h ) and the corresponding eigenvectors are the discretizations of the functions sin(x n+1 ).

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 34

2.3.2 The 2-d Model Matrix Ann


2 2
When using a finite difference approximation of the Laplace operator (i.e. compute ∂∂x2u + ∂∂y2u ) with n2
interior points on the unit square leads to a n2 × n2 matrix Ann . The mesh for this discretisation in the case
of n = 4 is shown in Figure 2.6 with a typical numbering of the nodes.

y
6

13 14 15 16

9 10 11 12

5 6 7 8

1 2 3 4
x
-

Figure 2.6: A 4 × 4 grid on a square domain

The matrix is characterized by a few key properties:


1
• The matrix has a leading factor of 1/h2 , where h = n+1 represents the distance between neighboring
points.

• Along the main diagonal all entries are equal to 4 .

• The n-th upper and lower diagonals are filled with −1 .

• The first upper and lower diagonals are almost filled with −1 . If the column/row index of the diagonal
entry is a multiple of n, then the numbers to the right and below are zero.

Below find the 16 × 16 matrix Ann for n = 4.


 
4 −1 · · −1
−1 4 −1 −1
 
 
 

 · −1 4 −1 −1 

· −1 4 −1
 
 
 
 −1 4 −1 −1 
 

 −1 −1 4 −1 −1 

−1 −1 4 −1 −1
 
 
 
1  −1 −1 4 −1 
A4,4 = 2  
h  −1 4 −1 −1 

−1 −1 4 −1 −1
 
 
 

 −1 −1 4 −1 −1 

−1 −1 4 −1 
 

 
 −1 4 −1 
 

 −1 −1 4 −1 

−1 −1 4 −1 
 

−1 −1 4

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 35

The missing numbers in the first off-diagonals can be explained by Figure 2.6. The gaps are caused by the
fact, that if the node number is a multiple of n (i.e. k · n), then the node has no direct connection to the node
with number k n + 1.
We replace the partial differential equation by a system of linear equations.

d2 u(x, y) d2 u(x, y) 1
− − = f (x, y) −→ Ann ~u = f~ .
dx2 dy 2 k
Observe the following important facts about this matrix Ann :

• The matrix is symmetric and positive definite.

• The matrix is sparse (very few nonzero entries) and has a band structure.

• To analyze the performance of the different algorithms it is convenient to know the eigenvalues.

– The exact eigenvalues are given by

4 πh 4 πh
λi,j = 2
sin2 (j ) + 2 sin2 (i ) for 1 ≤ i, j ≤ n .
h 2 h 2
– For i, j  n obtain
λi,j ≈ π 2 (i2 + j 2 )
and in particular
λmin = λ1,1 ≈ 2 π 2 .
For the largest eigenvalue find
λmax = λn,n ≈ 8 n2 .

The above matrix will be useful for a number of the problems in the introductory chapter.

• Ann will be used to solve the static heat problem (1.5) with 2 space dimensions.

• For each time step in the dynamic heat problem (1.6) the matrix Ann will be used.

• To solve the steady state problem for the vertical displacements of a membrane, i.e. equation (1.11),
we need the above matrix.

• For each time step of the dynamic membrane problem (1.10), the above matrix is needed.

• The eigenvalues of the above matrix determine the solutions of the eigenvalue problem for the vibrat-
ing membrane problem, i.e. equation (1.12).

One can construct the model matrix Annn corresponding to the finite difference approximation of a
differential equation examined on a unit cube in space R3 . In Table 2.4 find the key properties of the model
matrices. Observe that all matrices are symmetric and positive definite.

2.4 Solving Systems of Linear Equations and Matrix Factorizations


Most problems in scientific computing require solutions of linear systems of equations. Even if the problem
is nonlinear you need good, reliable solvers for linear systems, since they form the building block of non-
linear solvers, e.g. Newton’s algorithm (Section 3.1.5). Any algorithm for linear systems will have to be
evaluated using the following criteria:

• Computational cost: how many operations are required to solve the system?

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 36

An Ann Annn
differential equation on unit interval unit square unit cube
size of grid n n×n n×n×n
size of matrix n×n n2 × n2 n3 × n3
semi bandwidth 2 n n2
nonzero entries 3n 5 n2 7 n3
smallest eigenvalue λmin ≈ π2 2 π2 3 π2
largest eigenvalue λmax ≈ 4 n2 8 n2 12 n2
4 4 4
condition number κ ≈ π2
n2 π2
n2 π2
n2

Table 2.4: Properties of the model matrices An , Ann and Annn .

• Memory cost: how much memory of what type is required to solve the system?

• Accuracy: what type and size of errors are introduced by the algorithm? Is there an unnecessary loss
of accuracy. This is an essential criterion if you wish to obtain reliable solutions and not a heap of
random numbers.

• Special properties: does the algorithm use special properties of the matrix A? Some of those proper-
ties to be considered are: symmetry, positive definiteness, band structure and sparsity.

• Multi core architecture: can the algorithm take advantage of a multi core architecture?

There are three classes of solvers for linear systems of equations to be examined in these notes:

• Direct solvers: we will examine the standard LR factorization and the Cholesky factorization, Sec-
tions 2.4.1 and 2.6.1.

• Sparse direct solvers: we examine the banded Cholesky factorization, the simplest possible case, see
Section 2.6.4.

• Iterative solvers: we will examine the most important case, the conjugate gradient algorithm, see
Section 2.7.

One of the conclusions you should draw from the following sections is that it is almost never necessary to
compute the inverse of a given matrix.

2.4.1 LR Factorization
When solving linear systems of equations the concept of matrix factorization is essential. For a known
vector ~b ∈ Rn and a given n × n matrix A we search for solutions ~x ∈ Rn of the system A · ~x = ~b. Many
algorithms use a matrix factorization A = L · R where the matrices have special properties to simplify
solving the system of linear equations.

Solving triangular systems of linear equations


When using a matrix factorization to solve a linear system of equations we regularly have to work with
triangular matrices. Thus we briefly examine the main properties of matrices with triangular structure.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 37

2–4 Definition :

• A matrix R is called an upper triangular matrix iff6 ri,j = 0 for all i > j, i.e. all entries below the
main diagonal are equal to 0 .

• A matrix L is called an lower triangular matrix iff li,j = 0 for all i < j, i.e. all entries above the
main diagonal are equal to 0 .

• The two n × n matrices L and R form an LR factorization of the matrix A if

– L is a lower triangular matrix and its diagonal numbers are all 1 .


– R is an upper triangular matrix.
– A is factorized by the two matrices, i.e. A = L · R

The above factorization


   
1 0 0 0 ... 0 r1,1 r1,2 r1,3 r1,4 . . . r1,n
   
 l2,1 1 0 0 ... 0   0 r2,2 r2,3 r2,4 . . . r2,n 
   
 l3,1 l3,2 1 0 ... 0   0 0 r3,3 r3,4 . . . r3,n 
   
A=L·R=  l
· 
 4,1 l4,2 l4,3 1 0  0 0 0 r4,4 . . . r4,n 

  
 .. .. .. ..  
 .. .. .. .. 

 . . . .  
 . . . . 

ln,1 ln,2 . . . ln,n−1 1 0 0 ... 0 rn,n

can be represented graphically by


@ @
@ @
A= L@ · @ R
@ @
@ @
@ @

Solving the system A ~x = L R ~x = ~b of linear equations


@ @
@ @
~x = ~b
L@ · @ R
@ @
@ @
@ @

can be performed in two steps


(
L ~y = ~b
A ~x = ~b ⇐⇒ L · R ~x = ~b ⇐⇒ .
R ~x = ~y

• Introduce an auxiliary vector ~y . First solve the system L ~y = ~b from top to bottom. The equation
represented by row i in the matrix L reads as

1 y1 = b1
l2,1 y1 + 1 y2 = b2 .
l3,1 y1 + l3,2 y2 + 1 y3 = b3
6
iff: used by mathematicians to spell out “if and only if”

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 38

These equations can easily be solved. The general formula is given by


i−1
X i−1
X
yi + li,j yj = bi =⇒ yi = bi − li,j yj .
j=1 j=1

y(1) = b(1)
for i= 2 to n
y(i) = b(i) - L(i,1:i-1)*y(1:i-1)
end%for

• Subsequently we consider the linear equations with the matrix R, e.g. for a 3 × 3 matrix

r1,1 x1 + r1,2 x2 + r1,3 x3 = y1


r2,2 x2 + r2,3 x3 = y2 .
r3,3 x3 = y3

These equations have to be solved bottom to top.


n n
X yi 1 X
ri,j xj = yi =⇒ xi = − ri,j xj .
ri,i ri,i
j=i j=i+1

x(n) = y(n)/R(n,n)
for i= n-1 to 1
x(i) = y(i) / R(i,i) - ( R(i,i+1:n)*x(i+1) ) / R(i,i)
end%for

Pn 1
• Forward and backward substitution require each approximately k=1 k ≈ 2 n2 flops. Thus the total
computational effort is given by n2 flops.

The above observations show that systems of linear equations are easily solved, once the original matrix is
factorized as a product of a left and right triangular matrix. In the next section we show that this factorization
can be performed using the ideas of the algorithm of Gauss to solve linear systems of equations.

LR Factorization and the Algorithm of Gauss


The algorithm of Gauss is based on the idea of row reduction of a matrix. As an example consider a system
of three equations.      
2 6 2 x1 2
     
 −3 −8 0  ·  x2  =  4 
     
4 9 2 x3 6
The basic idea of the algorithm is to use row operations to transform the matrix in echelon form. Using the
notation of an augmented matrix this translates to
     
2 6 2 2 2 6 2 2 2 6 2 2
     
 −3 −8 0 4  −→  0 1 3 7   −→  0 1 3 7  .
 
  
4 9 2 6 0 −3 −2 2 0 0 7 23

In the above computations the following steps were applied:

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 39

3
• Multiply the first row by 2 and add it to the second row.
• Multiply the first row by 2 and subtract it from the third row.
• Multiply the modified second row by 3 and add it to the third row.
The resulting system is represented by a triangular matrix and can easily be solved from bottom to top.

2 x1 + 6 x2 + 2 x3 = 2
1 x2 + 3 x3 = 7 .
7 x3 = 23

The next goal is to verify that an LR factorization is a clever notation for the algorithm of Gauss. To
verify this we use the notation of block matrices and a recursive scheme, i.e. we start with a problem of
size n × n and reduce it to a problem of size (n − 1) × (n − 1). For this we divide a matrix in 4 blocks of
submatrices, i.e.
   
a1,1 a1,2 a1,3 . . . a1,n a1,1 a1,2 a1,3 . . . a1,n
   
 a
 2,1 a2,2 a3,3 . . . a3,n   a1,2
  

   
A=  a3,1 a3,2 a3,3 . . . a3,n  =  a1,3
  .
An−1

 .. .
.. .
.. . .. .
..   .
..
  
 . 
   
an,1 an,2 an,3 . . . an,n a1,n

The submatrix An−1 is a (n − 1) × (n − 1) matrix. Using this notation we are searching n × n matrices L
and R such that A = L · R, i.e
     
a1,1 a1,2 a1,3 . . . a1,n 1 0 0 ... 0 r1,1 r1,2 r1,3 . . . r1,n
     
 a   l   0 
 2,1   2,1   
     
 a3,1  =  l3,1 · 0 .
An−1 Ln−1 Rn−1
     
 .   .   .
 ..   ..   ..


     
an,1 ln,1 0

Using the standard matrix multiplication compute the entries in the 4 segments of A separately and then
examine A block by block.
• Examine the top left block (one single number) in A

a1,1 = 1 · r1,1 .

• Examine the top right block (row) in A

a1,j = 1 · r1,j for j = 2, 3, . . . , n .

Thus the first row of R is a copy of the first row of A.


• Examine the bottom left block (column) in A
   
a2,1 l2,1
   
 a3,1   l3,1  ai,1 ai,1
 .  =  .  · r1,1 =⇒ li,1 = = .
   
 ..   ..  r1,1 a1,1
   
an,1 ln,1

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 40

For the standard algorithm of Gauss, this is the multiple to be used when adding a multiple of the first
row to row i. This step might fail if a1,1 = 0. This possible problem can be avoided with the help of
proper pivoting. This will be examined later in this course, see Section 2.5.3, page 51.

• Examine the bottom right block in A


 
l2,1
 
 l3,1  h i
An−1 =  .  · r1,2 r1,3 . . . r1,n + Ln−1 · Rn−1 .
 
 .. 
 
ln,1

This can be rewritten in the form


 
l2,1
 
 l3,1  h i
Ln−1 · Rn−1 = An−1 −  .  · a1,2 a1,3 . . . a1,n = Ãn−1 .
 
 .. 
 
ln,1

This operation is a sequence of row operations on the (n − 1) × (n − 1) matrix An−1 :


ai,1
From each row a multiple (factor li,1 = a1,1 ) of the first row is subtracted.

Thus the lower triangular matrix L keeps track of what row operations have to be applied to transform
An−1 into Ãn−1 .

• Verify that these operations require (n − 1) + (n − 1)2 = n (n − 1) flops.

• Observe that the first row and column of A will not have to be used again in the next step. Thus for a
memory efficient implementation we may overwrite the first row and column of A with the first row
of R and the first column of L. The number 1 on the diagonal of L does not have to be stored.
After having performed the above steps we are left with a similar question, but the size of the matrices was
reduced by 1. By recursion we can restart the above process with the reduced matrices. Finally we will find
the LR factorization of the matrix A. The total operation count is given by

n
X 1 3
FlopLR = (k 2 − k) ≈ n .
3
k=1

It is possible to compute the inverse A−1 of a given square matrix A with the help of the LR factoriza-
tion. It can be shown (e.g. [Schw86]) that the computational effort is approximately 34 n3 and thus 4 times
as high as solving one system of linear equations directly.

2–5 Observation : Adding multiples of one row to another row in a large matrix can be implemented in
parallel on a multicore architecture, as shown in Section 2.2.3. For this to be efficient the number of columns
has to be considerably larger than the number CPU cores to be used. On most computers this task is taken
over by a well chosen BLAS library (Basic Linear Algebra Subroutines). Excellent versions are provided by
OPENBLAS or by ATLAS (Automatically Tuned Linear Algebra Software). MATLAB uses the Intel Math
Kernel Library and Octave is built on OPENBLAS by default. ♦

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 41

2–6 Example : The above general calculations are illustrated with the numerical example used at the start
of this section. For the given 3 × 3 matrix A we first examine the first column of the left triangular matrix
L and the first row of the right triangular matrix R.
     
2 6 2 1 0 0 r1,1 r1,2 r1,3
     
 −3 −8 0  =  l2,1 1 0  ·  0 r2,2 r2,3 
     
4 9 2 l3,1 l3,2 1 0 0 r3,3
   
1 0 0 2 6 2
   
=  −3
1 0   ·  0 r2,2 r2,3  .
 
 2
2 l3,2 1 0 0 r3,3

Then we restart the computation with the 2 × 2 blocks in the lower right corner of the above matrices. From
the above conclude
" # " # " # " #
−3
−8 0 2
h i 1 0 r2,2 r2,3
= · 6 2 + ·
9 2 2 l3,2 1 0 r3,3
" # " # " #
−9 −3 1 0 r2,2 r2,3
= + · .
12 4 l3,2 1 0 r3,3

The 2 × 2 block of A has to be modified first by adding the correct multiples of the first row of A, i.e.
" # " # " #
−8 0 −9 −3 1 3
− =
9 2 12 4 −3 −2

and then we use an LR factorization of a 2 × 2 matrix.


" # " # " # " # " #
1 3 1 0 r2,2 r2,3 1 0 1 3
= · = · .
−3 −2 l3,2 1 0 r3,3 −3 1 0 r3,3

The only missing value of r3,3 can be determined by examining the lower right corner of the above matrix
product.
−2 = (−3) · 3 + 1 · r3,3 =⇒ r3,3 = 7 .
Thus conclude
     
2 6 2 1 0 0 2 6 2
     
A= −3 −8 0  =  −3 1 0  ·  0 1 3  = L · R.
   2   
4 9 2 2 −3 1 0 0 7

Instead of solving the system


     
2 6 2 x1 2
     
 −3 −8 0  ·  x2  =  4 
     
4 9 2 x3 6

first solve      
1 0 0 y1 2
     
−3
1  ·  y2  =  4 
0 
    
 2
2 −3 1 y3 6

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 42

from top to bottom with the solution


3
y1 = 2 =⇒ y2 = 4 + y1 = 7 =⇒ y3 = 6 − 2 y1 + 3 y2 = 23 .
2
Instead of the original system now solve
     
2 6 2 x1 2
     
 0 1 3  ·  x2 = 7 .
     
0 0 7 x3 23

This is exactly the system we are left with after the matrix A was reduced to echelon form. This should
illustrated that the LR factorization is a clever way to formulate the algorithm of Gauss. ♦

Implementation in Octave or MATLAB


It is rather straightforward to implement the above algorithm in Octave/MATLAB.
LRtest.m
function [L,R] = LRtest(A)
% [L,R] = LRtest(A) if A is a square matrix
% performs the LR decomposition of the matrix A
% !!!!!!!!!!!! NO PIVOTING IS DONE !!!!!!!!!!!!!!!!!
% this is for instructional purposes only
[n,n] = size(A);
R = zeros(n,n); L = eye(n);

for k = 1:n
R(k,k:n) = A(k,k:n);
if ( R(k,k) == 0) error ("LRtest: division by 0") endif
% divide numbers in k-th column, below the diagonal by A(k,k)
L(k+1:n,k) = A(k+1:n,k)/R(k,k);
% apply the row operations to A
for j = k+1:n
A(j,k+1:n) = A(j,k+1:n) - A(k,k+1:n)*L(j,k);
end%for
end%for

The only purpose of the above code is to help the reader to understand the algorithm. It should never be
used to solve a real problem.

• No pivoting is done and thus the code might fail on perfectly solvable problems. Running this code
will also lead to unnecessary rounding errors.

• The code is not memory efficient at all. It keeps copies of 3 full size matrices around.

There are considerably better implementations, based on the above ideas. The code can be tested using the
model matrix An from Section 2.3.1.

n = 5;
h = 1/(n+1)

A = diag(2*ones(n,1)) -diag(ones(n-1,1),1) -diag(ones(n-1,1),-1);


A = (n+1)ˆ2*A;
[L,R] = LRtest(A)

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 43

leading to the results


   
1 0 0 0 0 72 −36 0 0 0
   
 −0.5 1 0 0 0   0 54 −36 0 0
   
L= 0 −0.667 1 0 0  R= 0 0 48 −36 0 .
   
   
 0 0 −0.75 1 0   0 0 0 45 −36 
   
0 0 0 −0.8 1 0 0 0 0 43.2

Observe that the triangular matrices L and R have all nonzero entries on the diagonal and the first off-
diagonal.

2.4.2 LR Factorization and Elementary Matrices


In this section we show a different notation for the LR factorization, using elementary matrices. This
notation can be useful when applying the factorization to small matrices. In this subsection no new results
are introduced, just a different notation for the LR factorization.
By definition an elementary matrix is generated by applying one row or column operation on the identity
matrix. Thus their inverses are easy to construct. Consider the following examples.

1. Multiply the second row by 7 .


   
1 0 0 1 0 0
and E−1 = 
   
E= 1
 0 7 0 
  0 7 0 

0 0 1 0 0 1

2. Add the 3 times the first row to the third row.


   
1 0 0 1 0 0
E−1 = 
   
E=  0 1 0  and

 0 1 0 

3 0 1 −3 0 1

Applying row and column operations to matrices can be described by multiplications with elementary
matrices, from the left or the right.

• Multiplying a matrix A by an elementary matrix from the left has the same effect as applying the row
operation to the matrix.
     
1 0 0 2 6 2 2 6 2
     
 3 1 0  ·  −3 −8 0  =  0 1 3 
 2     
0 0 1 4 9 2 4 9 2

• Multiplying a matrix A by an elementary matrix from the right has the same effect as applying the
column operation to the matrix.
     
2 6 2 1 0 0 14 6 2
     
 −3 −8 0  ·  2 1 0  =  −19 −8 0 
     
4 9 2 0 0 1 22 9 2

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 44

Using multiple row operations we can perform an LR factorization. We use again the same example as
before. The row operations to be applied are
3
1. Add 2 times the first row to the second row.

2. Subtract 2 times the first row from the third row.

3. Add 3 times the second row to the third row.

These operations are visible on the left in Figure 2.7. On the right find the corresponding elementary
matrices.

 
2 6 2
 −3 −8 0  R2 ← R2 + 3 R1
 
  2
4 9 2
   
1 0 0 1 0 0
E1−1 = 
   
↓ E1 =  3  , 3
 +2 1 0   −2 1 0 

0 0 1 0 0 1
 
2 6 2
 
 0 1 3 
  R3 ← R3 − 2 R1
4 9 2
   
1 0 0 1 0 0
E2−1 = 
   
↓ E2 =  0 1 0
 
 ,  0 1 0 

−2 0 1 +2 0 1
 
2 6 2
 
 R3 ← R3 + 3 R2
 0 1 3 

0 −3 −2
   
1 0 0 1 0 0
E3−1 = 
   
↓ E3 = 
 0 1 0

 ,  0 1 0 

0 +3 1 0 −3 1
 
2 6 2
 
 0 1 3 
 
0 0 7

Figure 2.7: LR factorization, using elementary matrices

The row operations from Figure 2.7 are used to construct the LR factorization. Start with A = I · A and
use the elementary matrices for row and column oprerations.
Observe that we apply row operations to transform the matrix on the right to upper echelon form. The
matrix on the left keeps track of the operations to be applied.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 45

     
2 6 2 1 0 0 2 6 2
E1−1 · E1 
     
 −3 −8 0  =  0 1 0 
 −3 −8 0 

   
4 9 2 0 0 1 4 9 2

   
1 0 0 2 6 2
E2−1 · E2
   
=  3
 −2 1 0   0 1 3 
  
0 0 1 4 9 2

   
1 0 0 2 6 2
E3−1 · E3 
   
=  3
 −2 1 0   0 1 3 


+2 0 1 0 −3 −2

   
1 0 0 2 6 2
   
= 3 ·
 −2 1
0   0 1 3 
  
+2 −3 1 0 0 7
Thus we constructed the LR factorization of the matrix A.
     
2 6 2 1 0 0 2 6 2
     
 −3 −8 0  =  − 3 · 0 1 3 =L·R
   2 1 0   
4 9 2 +2 −3 1 0 0 7

2.5 The Condition Number of a Matrix, Matrix and Vector Norms


2.5.1 Vector Norms and Matrix Norms
With a norm of a vector we usually associate its geometric length, but is is sometimes useful to use different
norms. Thus we present three different norms used for the analysis of matrix computations and start out
with the general definition of a norm.
2–7 Definition : A function is called a norm if for all vectors ~x, ~y ∈ Rn and scalars α ∈ R the following
properties are satisfied:
k~xk ≥ 0 and k~xk = 0 ⇐⇒ ~x = ~0
k~x + ~y k ≤ k~xk + k~y k
kα ~xk = |α| k~xk

2–8 Example : It is an exercise to verify that the following three norms satisfy the above properties:
v
u n
uX p
k~xk = k~xk2 = t |xi |2 = (x21 + x22 + . . . + x2n )1/2 = h~x , ~xi
i=1
n
X
k~xk1 = |xi | = |x1 | + |x2 | + . . . + |xn |
i=1
k~xk∞ = max |xi |
1≤i≤n

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 46

On a given vector space one can have multiple norms, e.g. k~xkA and k~xkB . Those norms are said to be
equivalent if there exist positive constants c1 and c2 such that

c1 k~xkA ≤ k~xkB ≤ c2 k~xkA for all ~x .

On the vector space Rn all norms are equivalent and one may verify the following inequalities. If we have
information on one of the possible norms of a vector ~x we have some information on the size of the other
norms too.
2–9 Result : For all vectors ~x ∈ Rn we have

k~xk2 ≤ k~xk1 ≤ n k~xk2 ,

k~xk∞ ≤ k~xk2 ≤ n k~xk∞ ,
k~xk∞ ≤ k~xk1 ≤ n k~xk∞ .
3
Proof : For the interested reader some details of the computations are shown here.
n n
!2
X X
k~xk22 = x2i ≤ |xi | = k~xk21
i=1 i=1
n
~ √
|xi | = h~I , |x|i ≤ k~Ik2 · k~xk2 = n k~xk2
X
k~xk1 =
i=1
v
u n
q
2
uX q √
k~xk∞ = max{|xi |} = max{xi } ≤ t x2i ≤ n max{x2i } = n k~xk∞
i=1
n
X
k~xk∞ = max |xi | ≤ |xi | = k~xk1 ≤ n max |xi | = n k~xk∞
i=1

A matrix norm of a matrix A should give us some information on the length of the vector ~y = A · ~x,
based on the length of ~x . Thus we require the basic inequality

kA · ~xk ≤ kAk k~xk for all ~x ∈ Rn

i.e. the norm kAk is the largest occurring amplification factor when multiplying a vector with this matrix A .

2–10 Definition : For each vector norm there is a resulting matrix norm definded by
kA · ~xk
kAk = max = max kA · ~xk
x6=~0
~ k~xk k~
xk=1

kA · ~xk2
kAk2 = max = max kA · ~xk2
x6=~0
~ k~xk2 k~
xk2 =1

kA · ~xk1
kAk1 = max = max kA · ~xk1
x6=~0
~ k~xk1 k~
xk1 =1

kA · ~xk∞
kAk∞ = max = max kA · ~xk∞
x6=~0
~ k~xk∞ k~
xk∞ =1

where the maximum is taken over all ~x ∈ Rn with ~x 6= ~0 , or equivalently over all vectors with k~xk = 1 .

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 47

One may verify that all of the above norms satisfy


kAk ≥ 0 and kAk = 0 ⇐⇒ A=0
kA + Bk ≤ kAk + kBk
kα Ak = |α| kAk

2–11 Example : For a given m × n matrix A the two norms kAk1 and kAk∞ are rather easy to compute:
m
!
X
kAk1 = max |ai,j | = maximal column sum
1≤j≤n
i=1
 
n
X
kAk∞ = max  |ai,j | = maximal row sum
1≤i≤m
j=1


Proof : We examine kAk∞ first. Assume that the maximum value of k~y k∞ = kA · ~xk∞ is attained for
the vector ~x with k~xk∞ = 1. Then all components have to be xj = ±1, otherwise we could increase k~y k∞
without changing k~xk∞ . If the maximal value of |yi | is attained at the component with index p then the
matrix multiplication
     
y1 a1,1 a1,2 a1,3 . . . a1,n x1
     
 y   a
 2   2,1 a2,2 a2,3 . . . a2,n   x2 
  
     
 y3  =  a3,1 a3,2 a3,3 . . . a3,n  ·  x3 
     
 ..   .. .. ..   .. 
 .  
   . . .   . 
 

ym am,1 am,2 am,3 . . . am,n xn
implies
n
X n
X n
X
yp = ap,j xj = ap,j (±1) = |ap,j | .
j=1 j=1 j=1
This leads to the claimed result
 
n
X
kAk∞ = k~y k∞ = max  |ai,j | .
1≤i≤m
j=1

To determine the norm kAk1 examine


m
X X n
m X n
X m
X
k~y k1 = |yi | = ai,j xj ≤ |xj | |ai,j |
i=1 i=1 j=1 j=1 i=1
n m m n
! !
X X X X
≤ |xj | max |ai,k | = max |ai,k | |xj |
1≤k≤n 1≤k≤n
j=1 i=1 i=1 j=1
m
!
X
= max |ai,k | k~xk1 .
1≤k≤n
i=1

If the above column maximum is attained in column k choose xk = 1 and all other components of ~x are set
to zero. For this special vector then find
m
X m
X
kA · ~xk1 = |ai,k · 1| = |ai,k | = kAk1 .
i=1 i=1
2

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 48

Unfortunately the most important norm kAk = kAk2 is not easily computed. But for n × m matrices
A we have the following inequalities.
1 √
√ kAk∞ ≤ kAk2 ≤ m kAk∞
n
1 √
√ kAk1 ≤ kAk2 ≤ n kAk1
m
and thus we might be able to estimate the size of kAk2 with the help of the other norms. The proofs of
the above statements are based on Result 2–9. A precise result on the 2-norm is given in [GoluVanLoan96,
Theorem 3.2.1]. The result is stated here for sake of completeness.
2–12 Result : For any m×n matrix A there exists a vector ~z ∈ Rn with k~zk2 = 1 such that AT A·~z = µ2 ~z
and kAk2 = µ . Since AT A is symmetric and positive definite we know that all eigenvalues are real and
positive. Thus kAk2 is the square root of the largest eigenvalue of the n × n matrix AT A . 3

We might attempt to compute all eigenvalues of the symmetric matrix AT A and then compute the
square root of the largest eigenvalue. Since it is computationally expensive to determine all eigenvalues the
task remains difficult. There are special algorithms (power method) to estimate the largest eigenvalue of
AT A, used in the MATLAB/Octave functions normest(), condest() and eigs(), see Section 3.2.4.

In Result 3–22 (page 134) find few facts on symmetric matrices and the corresponding eigenvalues and
eigenvectors. Using this result one can verify that for real, symmetric matrices
A symmetric =⇒ kAk2 = max |λi | .
i
This can also be confirmed directly. For a given symmetric matrix A let ~ej be the basis of normalized
eigenvectors. Then we write the arbitrary vector ~x as a linear combination of the eigenvectors ~ej .
n
X n
X
~x = cj ~ej =⇒ A · ~x = cj λj ~ej
j=1 j=1

Use the orthogonality of the eigenvectors ~ei to conclude


n
X n
X n
X n
X
2 2
k~xk = h~x , ~xi = h ci ~ei , cj ~ej i = |ci | h~ei , ~ei i = |ci |2 .
i=1 j=1 i=1 i=1

If k~xk = 1 then
n
X n
X
2 2 2
1 = k~xk = |cj | =⇒ kA · ~xk = |λj |2 |cj |2
j=1 j=1
and the largest possible value will be attained if cn = 1 and all other cj = 0, i.e. the vector ~x points in the
direction of the eigenvector with the largest eigenvalue.

Similarly we can determine the norm of the inverse matrix A−1 . Since
n n
X X 1
~x = cj ~ej =⇒ A−1 · ~x = cj ~ej
λj
j=1 j=1

we find for k~xk = 1


n
X 1
kA−1 · ~xk2 = |cj |2
|λj |2
j=1
and the largest possible value will be attained if the vector ~x points in the direction of the eigenvector with
the smallest absolute value of the eigenvalue, i.e.
1
kA−1 k2 = .
minj |λj |

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 49

2–13 Example : For the matrix An in section 2.3.1 (page 32) the matrix norms are easily computed.
1
(a) The maximal column and rows sums are given by h2
(1 + 2 + 1) and thus
4
kAn k1 = kAn k∞ = ≈ 4 n2 .
h2

(b) The largest eigenvalue λn ≈ 4 n2 implies


kAn k2 ≈ 4 n2 .

(c) Since the smallest eigenvalue is given by λ1 ≈ π 2 we find


1 1
kA−1
n k2 = ≈ 2.
|λ1 | π

In this case the three matrix norms are approximately equal, at least for large values on n . ♦

2–14 Example : The norm of an orthogonal matrix Q equals 1, i.e. kQk2 = 1. To verify this use
kQ~xk22 = hQ~x , Q~xi = h~x , QT Q~xi = h~x , ~xi = k~xk22 .
Similary find that kQ−1 k2 = 1. To verify this use (Q−1 )T = (QT )−1 and
kQ−1 ~xk22 = hQ−1 ~x , Q−1 ~xi = h~x , (Q−1 )T Q−1 ~xi = h~x , ~xi = k~xk22 .

2.5.2 The Condition Number of a Matrix


Condition number for a matrix vector multiplication
Compare the result of ~y = A · ~x with a slightly perturbed result ~yp = A · ~xp . Then we want to compare the
k~xp − ~xk
relative error in ~x =
k~xk
with the
k~yp − ~y k kA · (~xp − ~x)k
relative error in ~y = = .
k~y k kA ~xk
The condition number κ of the matrix A is characterized by the property
k~yp − ~y k kA · (~xp − ~x)k k~xp − ~xk
= ≤κ .
k~y k kA ~xk k~xk

As typical example we consider a symmetric, nonsingular matrix A with eigenvalues


0 < |λ1 | ≤ |λ2 | ≤ |λ3 | ≤ . . . ≤ |λn−1 | ≤ |λn |
and the vectors ~x = ~e1 and ~xp = ~e1 + ε ~en . Thus we examine a relative error of 0 < ε  1 in ~x . Since
A · ~e1 = λ1 ~e1 and A · ~en = λn ~en we find
~y = A · ~x = A · ~e1 = λ1 ~e1
~yp = A · ~xp = A · (~e1 + ε ~en ) = λ1 ~e1 + ε λn~en
k~yp − ~y k kA · (~e1 + ε ~en ) − A · ~e1 k kε A · ~en k |λn |
= = =ε .
k~y k kA · ~e1 k kA · ~e1 k |λ1 |
In the above example the correct vector is multiplied by the smallest possible number (λ1 ) but the error is
multiplied by the largest possible number (λn ). Thus we examined the worst case scenario.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 50

Condition number when solving a linear system of equations


For a given vector ~b compare the solution ~x of A · ~x = ~b with a slightly perturbed result A · ~xp = ~bp . Then
we want to compare the
k~bp − ~bk
relative error in ~b =
k~bk
with the
k~xp − ~xk kA−1 · (~bp − ~b)k
relative error in ~x = = .
k~xk kA−1 ~bk
In this situation the condition number κ is characterized by the property
k~xp − ~xk k~bp − ~bk kA · (~xp − ~x)k
≤κ =κ ,
k~xk ~
kbk kA ~xk

i.e. the relative error in the given vector ~b might at worst be mutliplied by the condition number to obtain
the relative error in the solution ~x.
As an example reconsider the above symmetric matrix A and use the vectors ~b = ~en and ~bp = ~en + ε ~e1 .
Thus we examine a relative error of 0 < ε  1 in ~b .
k~xp − ~xk kA−1 · (~en + ε ~e1 ) − A−1 · ~en k kε A−1 · ~e1 k 1/|λ1 | |λn |
= −1
= −1
=ε ≤ε .
k~xk kA · ~en k kA · ~en k 1/|λn | |λ1 |
In the above example the correct vector is divided by the largest possible number (λn ) but the error is divided
by the smallest possible number (λ1 ). Thus we examined the worst case scenario.

Based on the above two observation we use |λn | = kAk2 and 1


|λ1 | = kA−1 k2 to conclude

|λn |
κ2 (A) = = kAk2 · kA−1 k2
|λ1 |
for symmetric matrices A and if κ2 = 10d we might loose d decimal digits of accuracy when multiplying a
vector by A or when solving a system of linear equations.

Using the above idea we define the condition number for the matrix and the result applies to multipli-
cation by matrices and solving of linear systems systems.
2–15 Definition : The condition number κ(A) of a nonsingular square matrix is defined by
κ = κ(A) = kAk · kA−1 k
Obviously the condition number depends on the matrix norm used.

2–16 Example :
• Based on Example 2–14 the condition number of an orthogonal matrix Q equals 1, using the 2–norm.
• Using the singular value decomposition (SVD) (see equation (3.5) on page 149) and the above idea
one can verify that for a real n × n matrix A
σ1 largest singular value
κ2 = κ2 (A) = =
σn smallest singular value
where the singular values are given by σi .
• MATLAB/Octave provide the command condest() to efficiently compute a good estimate of the
condition number.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 51

For any matrix and norm we have κ(A) ≥ 1. If the condition number κ is not too large we speak of a
well conditioned problem.
For n × n matrices we have the following relations between the different condition numbers.
1
n κ2 ≤ κ1 ≤ n κ2
1
n κ∞ ≤ κ2 ≤ n κ∞

The verification is based on Result 2–9.

2–17 Example : For the model matrix An of size n × n in Section 2.3.1 (page 32) find

λmax 4 n2
κ2 = ≈ 2
λmin π
and for 2D the model matrix Ann of size n2 × n2 in Section 2.3.2 (page 34) find the same result

λmax 4 n2
κ2 = ≈ 2 .
λmin π

2.5.3 The Effect of Rounding Errors, Pivoting


Now we want to examine the effects of arithmetic operations and rounding when solving a system of linear
equations of the form A · ~x = ~b. The main reference for the results in this section is the bible of matrix com-
putations by Gene Golub and Charles van Loan [GoluVanLoan96], or the newer edition [GoluVanLoan13] .

2–18 Result : In an ideal situation absolutely no roundoff occurs during the solution process. Only when
ˆ satisfies
A, ~b and ~x are stored some roundoff will occur. The stored solution ~x
ˆ = ~b + ~e with kEk∞ ≤ u kAk∞
(A + E) · ~x and k~ek∞ ≤ u k~bk∞ .
ˆ solves a nearby system exactly. If now u κ∞ (A) ≤ 1/2 then one can show that
Thus ~x
ˆ − ~xk∞ ≤ 4 u κ∞ (A) k~xk∞ .
k~x

The above bounds are the best possible. 3

As a consequence of the above result we can not expect relative errors smaller than κ u for
any kind of clever algorithm to solve linear systems of equations. The goal has to be to
achieve this accuracy.

2–19 Definition : For the following results we use some special, convenient notations:

(A)i,j = ai,j =⇒ |A|i,j = |ai,j |


A≤B ⇐⇒ ai,j ≤ bi,j for all indices i and j .

The absolute value and the comparison operator are applied to each entry in the matrix.

The following theorem ([GoluVanLoan96, Theorem 3.3.2] keeps track of the rounding errors in the back
substitution process, i.e. when solving the triangular systems. The proof is considerably beyond the scope
of these notes.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 52

2–20 Theorem : Let L̂ and R̂ be the computed LR factors of a n × n matrix A. Suppose ~yˆ is the computed
ˆ the computed solution of R̂ · ~x = ~yˆ. Then
solution of L̂ · ~y = ~b and ~x
ˆ = ~b
(A + E) · ~x

with
|E| ≤ n u (3 |A| + 5 |L| |R|) + O(u2 ) , (2.2)
where |L| |R| is a matrix multiplication. For all practical purposes we may ignore the term O(u2 ), since
with u ≈ 10−16 we have u2 ≈ 10−32 . 3
This result implies that we find exact solutions of a perturbed system A + E. This is called backward
stability.
The above result shows that large values in the triangular matrices L and R should be avoided whenever
possible. Unfortunately we can obtain large numbers during the factorization, even for well-conditioned
matrices, as shown by the following example.
2–21 Example : For a small, positive number ε the matrix
" #
ε 1
A=
1 0

is well-conditioned. When applying the LR factorization obtain


" # " # " #
ε 1 1 0 ε 1
= 1 · = L · R.
1 0 ε 1 0 −1 ε

If ε > 0 is very close to zero then the numbers in L and R will be large and lead to
" # " # " #
1 0 ε 1 ε 1
|L| · |R| = 1 · =
ε 1 0 1ε 1 2ε

and thus one of the entries in the bound in Theorem 2–20 is large. Thus the error in the result might be
unnecessary large. This elementary, but typical example illustrates that pivoting is necessary. ♦

The correct method to avoid the above problem is pivoting. In the LR factorization on page 38 try to
factor the submatrices An−k = Ln−k · Rn−k . In the unmodified algorithm the top left number in An−k
is used as pivot element. Before performing a next step in the LR factorization it is better to exchange
rows (equations) and possibly rows (variables) to avoid divisions by small numbers. There are two possible
strategies:

• partial pivoting:
Choose the largest absolute number in the first column of An−k . Exchange equations for this to
become the top left number.

• total pivoting:
Choose the largest absolute number in the submatrix An−k . Exchange equations and renumber vari-
ables for this to become the top left number.

The computational effort for total pivoting is considerably higher, since (n−k)2 numbers have to be searched
for the maximal value. The bookkeeping requires considerably more effort, since equations and unknowns
have to be rearranged. This additional effort is not compensated by considerably better (more reliable)
results. Thus for almost all problems partial pivoting will be used. As a consequence we will only examine
partial pivoting.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 53

When using partial pivoting all entries in the left matrix L will be smaller than 1 and thus kLk∞ ≤ n .
This leads to an improved error estimate (2.2) in Theorem 2–20. For details see [GoluVanLoan96, §3.4.6].

kEk∞ ≤ n u (3 kAk∞ + 5 n kRk∞ ) + O(u2 )

The formulation for factorization has to be modified slightly and supplemented with a permutation
matrix P.
2–22 Result : A square matrix P is a permutation matrix if each row and each column of P contains exactly
one number 1 and all other entries are zero. Multiplying a matrix or vector from the left by a permutation
matrix has the effect of row permutations:

pi,j = 1 ⇐⇒ the old row j will turn into the new row i

As a consequence we find PT = P−1 . 3


2–23 Example : The effects of permutation matrices are best illustrated by a few elementary examples

     
0 0 1 1 2 5 6
     
 1 0 0 · 3 4 = 1 2 
     
0 1 0 5 6 3 4


     
0 1 0 0 1 2
     
 0 0 1 0   2 
   3 
· =
   
 
 0 0 0 1 
  3   4 
  
 
1 0 0 0 4 1

For a given matrix A we seek triangular matrices L and R and a permutation matrix P, such that

P · A = L · R.

If we now want to solve the system A · ~x = ~b using the factorization replace the original system by two
linear systems with triangular matrices.
(
L ~y = P · ~b
A ~x = ~b ⇐⇒ PA ~x = P~b ⇐⇒ LR ~x = P~b ⇐⇒ .
R ~x = ~y

Example 2–6 has to be modified accordingly, i.e. the permutation given by P is applied to the right hand
side.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 54

2–24 Example : To solve the system


    
3/2 4 −7/2 x1 +1
    
A ~x =   3 2 1    x2  =  0 
   
0 −1 25/3 x3 −1

we use the factorization

  P·A = L
·R   
0 1 0 3/2 4 −7/2 1 0 0 3 2 1
      
 =  1/2   0 3 −4  .
 1 0 0  3 2 1   1 0   
 
0 0 1 0 −1 25/3 0 −1/3 1 0 0 7

Then the system can be solved using two triangular systems. First solve from top to bottom

L ~y = P ~b
    
1 0 0 y1 0
    
 1/2 1 0   y2  =  +1 
    
0 −1/3 1 y3 −1

and then

R ~x = ~y
    
3 2 1 x1 y1
    
 0 3 −4   x2  =  y2 
    
0 0 7 x3 y3

from bottom to top. ♦

Any good numerical library has an implementation of the LR (or LU) factorization with partial pivoting
built in. As an example consider the help provided by Octave on the command lu().
Octave
help lu
--> lu is a built-in function

Compute the LU decomposition of A.


If A is full subroutines from LAPACK are used and if A is sparse
then UMFPACK is used.
The result is returned in a permuted form, according to the optional return
value P. For example, given the matrix ’a = [1, 2; 3, 4]’,

[l, u, p] = lu (A)
returns
l = 1.00000 0.00000
0.33333 1.00000

u = 3.00000 4.00000
0.00000 0.66667

p = 0 1
1 0

The matrix is not required to be square.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 55

Using this factorization one can solve systems of linear equations A~x = ~b for ~x.
Octave
A = randn(3,3); % generate a random matrix
b = rand(3,1);
x1 = A\b; % a first solution
[L,U,P] = lu(A); % compute the LU factorization with pivoting
x2 = U\(L\(P*b)) % the solution with the help of the factorization
DifferenceSol = norm(x1-x2) % display the differences, should be zero

The first solution is generated with the help of the backslash operator \ . Internally Octave/MATLAB use the
LU (same as LR) factorization. Computing and storing the L, U and P matrices is only useful when multiple
systems with the same matrix A have to be solved. The computational effort to apply the back substitution
is considerably smaller than the effort to determine the factorization, at least for general matrices.

2.6 Structured Matrices


Many matrices have special properties. They might be symmetric, have the nonzero entries concentrated in
a narrow band around the diagonal or might have very few nonzero entries. The model matrices An and
Ann in Section 2.3 exhibit those properties. The basic LR factorization can be adapted to take advantage of
these properties.

2.6.1 Symmetric Matrices, Algorithm of Cholesky


If a matrix is symmetric then we might seek a factorization A = L · R with L = RT . This will lead to
the classical Cholesky factorization A = RT · R. This approach is given as an exercise. This algorithm
will require the computation of square roots, which is often undesirable. Observe that on the diagonal of the
factor R of the standard Cholesky factorization you will find numbers different from 1, while the modified
factorization below requires values of 1 along the diagonal.
Instead we examine a slight modification7 . We seek a diagonal matrix D and an upper triangular matrix
R with numbers 1 on the diagonal such that

A = RT · D · R

The approach is adapted from Section 2.4.1, using block matrices again. Using standard matrix multiplica-
tions we find
7
This modification is known as modified Cholesky factorization and MATLAB provides the command ldl() for this algorithm.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 56

 
a1,1 a1,2 a1,3 . . . a1,n
 
 a 
 1,2 
 
 a1,3 = RT · D · R =
An−1
 
 .. 

 . 

a1,n

     
1 0 0 ... 0 d1 0 0 ... 0 1 r1,2 r1,3 . . . r1,n
     
 r   0   0 
 1,2     
     
=
 r1,3
  0   0 
RTn−1 Dn−1 Rn−1
    
 ..   .
 ..
  .
 ..


 . 
 

 


r1,n 0 0

   
1 0 0 ... 0 d1 d1 r1,2 d1 r1,3 . . . d1 r1,n
   
 r   0 
 1,2   
   
=
 r1,3
  0 
RTn−1 Dn−1 · Rn−1
  
 ..   .
 ..


 . 
 


r1,n 0

Now we examine the effects of the last matrix multiplication on the four submatrices. This translates to 4
subsystems.

• Examine the top left block (one single number) in A. Obviously we find a1,1 = d1 .

• Examine the bottom left block (column) in A.


   
a1,2 r1,2
   
 a1,3   r1,3  a1,i a1,i
 .  =  .  · d1 =⇒ r1,i = =
   
 ..   ..  d1 a1,1
   
a1,n r1,n

This operation requires (n − 1) flops.

• The top right block (row) in A is then already taken care of. It is a transposed copy of the first column.

• Examine the bottom right block in A. We need


 
r1,2
 
 r1,3  h i
An−1 = d1  .  · r1,2 r1,3 . . . r1,n + RTn−1 · Dn−1 · Rn−1 .
 
 .. 
 
r1,n

For 2 ≤ i, j ≤ n update the entries in An−1 by applying


a1,i a1,j
ai,j −→ ai,j − d1 r1,i r1,j = ai,j − .
a1,1

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 57

1
This operation requires 2 (n − 1)2 flops since the matrix is symmetric. Now we are left with the new
factorization
Ãn−1 = RTn−1 · Dn−1 · Rn−1
with the updated matrix Ãn−1 .

• Then restart the process with the reduced problem of size (n − 1) × (n − 1) in the lower right block.

The total number of operations can be estimated by

n−1
X k2 1
FlopChol ≈ ≈ n3 .
2 6
k=1

Thus we were able to reduce the number of necessary operations by a factor of 2 compared to the standard
LR factorization (FlopLR ≈ 13 n3 ).

2–25 Observation : Adding multiples of one row to another row in a large matrix can be implemented in
parallel on a multicore architecture, as shown in Section 2.2.3. The number of columns has to be consider-
ably larger than the number of CPU cores to be used. ♦

The algorithm and an implementation in Octave


The above algorithm can be implemented in any programming language. Using a MATLAB/Octave pseudo
code one might write.

for each row: for k = 1:n


for each row below the current row for j = k+1:n
find the factor for the row operation R(k,j) = A(j,k)/A(k,k);
do the row operation A(j,:) = A(j,:)-R(k,j)*A(k,:);
do the column operation A(:,j) = A(:,j)-R(k,j)*A(:,k);

The above may be implemented in MATLAB/Octave.

function [R,D] = choleskyDiag(A)


% [R,D] = choleskyDiag(A) if A is a symmetric positive definite matrix
% returns a upper triangular matrix R and a diagonal matrix D
% such that A = R’*D*R

% this code can only be used for didactical purposes


% it has some major flaws!

[n,m] = size(A); R = zeros(n);

for k = 1:n-1
R(k,k) = 1;
for j = k+1:n
R(k,j) = A(j,k)/A(k,k);
A(j,:) = A(j,:) - R(k,j)*A(k,:); % row operations
A(:,j) = A(:,j) - R(k,j)*A(:,k); % column operations
end%for

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 58

R(n,n) = 1;
end%for
D = diag(diag(A));

The above code has some serious flaws:

• It does not check for correct size of the input.

• It does not check for possible divisions by 0.

• As we go through the algorithm the coefficients in R can replace the coefficients in A which will not
be used any more. This cuts the memory requirement in half.

• If we do all computations in the upper right part of A, we already know that the result in the lower left
part has to be the same. Thus we can do only half of the calculations.

• As we already know that the numbers in the diagonal of R have to be 1, we do not need to return
them. One can use the diagonal of R to return the coefficients of the diagonal matrix D.

If we implement most8 of the above points we obtain an improved algorithm, shown below.
choleskyM.m
function R = choleskyM(A)
% R = choleskyM(A) if A is a symmetric positive definite matrix
% returns a upper triangular matrix R and a diagonal matrix D
% such that A = R1’*D*R1
% R1 has all diagonal entries equal to 1
% the values of D are returned on the diagonal of R

TOL = 1e-10*max(abs(A(:))); %% there certainly are better tests than this!!

[n,m] = size(A);
if (n˜=m) error (’choleskyM: matrix has to be square’) end%if

for k = 1:n-1
if ( abs(A(k,k)) <= TOL) error (’choleskyM:might be a singular matrix’)
endif
for j = k+1:n
A(j,k) = A(k,j)/A(k,k);
% row operations only
A(j,j:n) = A(j,j:n) - A(j,k)*A(k,j:n);
end%for
end%for
if ( abs(A(n,n)) <= TOL) error (’choleskyM:might be a singular matrix’) end%if

% return the lower triangular part of A.


% Transpose it to obtain an upper triangular matrix
R = tril(A)’;

The above code finds the Cholesky factorization of the matrix, but does not solve a system of linear
equations. It has to be supplemented with the corresponding back-substitution algorithm.

8
The memory requirements can be made considerably smaller

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 59

function x = CholeskySolver(R,b)
% x = choleskySolver(R,b) solves A x = b
% R has to be generated by R = choleskyM(A)

[n,m] = size(R);
if (n ˜= length(b))
error ("CholeskySovler: matrix and vector do not have same dimension ") endif

% forward substitution of R’ y = b
y = zeros(size(b));
y(1) = b(1);
for k = 2:n
y(k) = b(k);
for j = 1:k-1 y(k) = y(k) - R(j,k)*y(j);end%for
end%for

% solve diagonal system


for k = 1:n y(k) = y(k)/R(k,k);endfor

% backward substitution of R x = y
x = zeros(size(b));
x(n) = y(n);
for k = 1:n-1
x(n-k) = y(n-k);
for j = n-k+1:n x(n-k) = x(n-k) - R(n-k,j)*x(j);end%for
end%for

The operation count for the back substitution algorithm is given by


FlopSolve ≈ n2 .
For large n this is small compared to the n3 /6 operations for the factorization. Thus we will only keep track
of the computational effort for the factorization.

Now we can solve an exemplary system of linear equations.

A = [1 3 -4; 3 11 0; -4 0 10];
R = choleskyM(A)
b = [1; 2; 3];
x = CholeskySolver(R,b)’
-->
R = 1 3 -4
0 2 6
0 0 -78

x = -1.16667 0.50000 -0.16667

The result says that


       
1 3 −4 1 0 0 1 0 0 1 3 −4
       
= 3 1 0 · 0 2 · 0 1
 3 11 0     0   6 
 
−4 0 10 −4 6 1 0 0 −78 0 0 1
and the system      
1 3 −4 x1 1
     
  x2  =  2 
 3 11 0     

−4 0 10 x3 3

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 60

is solved by    
x1 −1.16667
   
 x2  ≈  +0.50000  .
   
x3 −0.16667

2.6.2 Positive Definite Matrices


In the previous section we completely ignored pivoting. This might lead to unnecessary divisions by num-
bers close to zero and thus large errors. For a special type of symmetric matrices one can show that pivoting
is in fact not necessary to obtain numerical stability.

2–26 Definition : A symmetric, real matrix A is called positive definite if and only if
hA · ~x , ~xi = h~x , A · ~xi > 0 for all ~x 6= ~0 .
The matrix is called positive semidefinite if and only if
hA · ~x , ~xi = h~x , A · ~xi ≥ 0 for all ~x .

2–27 Example : To verify that the matrix An (see page 32) is positive definite we have to show that9
h~u , An ~ui > 0 for all ~u ∈ Rn \{~0}

   
u1 2 u1 − u2
   
 u2   −u1 + 2 u2 − u3 
   
u3 −u2 + 2 u3 − u4
   
1    
h~u , An ~ui = h , i
h  ... ..
 
2   
  . 
   
 u  −u
n−2 + 2 un−1 − un 
 
 n−1  
un −un−1 + 2 un
   
u1 u1 − (u2 − u1 )
   
 u2   (u2 − u1 ) − (u3 − u2 ) 
   
u3 (u3 − u2 ) − (u4 − u3 )
   
1    
= h , i
h  ...
2   
 
..
.


   
 u   (u − u ) − (u − u ) 
 n−1   n−1 n−2 n n−1 

un (un − un−1 ) + un
   
u1 u1
   

 (u2 − u1 )  
  (u2 − u1 ) 

(u3 − u2 ) (u3 − u2 ) 2
   
1   
i + un

= h ..
,
..
h2  .
 
  .

 h2
   

   (un−1 − un−2 ) 
(un−1 − un−2 )   

(un − un−1 ) (un − un−1 )


9
This verification corresponds to the integration by parts for twice differentiable functions u(x) with boundary conditions
u(0) = u(1) = 0.
Z 1 Z 1
u(x) · (−u00 (x)) dx = 0 + u(x)0 · u0 (x) dx ≥ 0
0 0

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 61

n
!
1 X
= u21 + u2n + (ui − ui−1 )2 .
h2
i=2

This sum of squares is obviously positive. Only if ~u = ~0 the expression will be zero. Thus the matrix An is
positive definite. ♦

A positive definite matrix A has a few properties that are easy to verify.

2–28 Result : If the matrix A = (ai,j )1≤i,j≤n is positive definite then

• ai,i > 0 for 1 ≤ i ≤ n, i.e. the numbers on the diagonal are positive.

• max |ai,j | = max ai,i , i.e. the maximal value has to be on the diagonal.

3
Proof :

• Choose ~x = ~ei and compute h~ei , A · ~ei i = ai,i > 0. For a 4 × 4 matrix this is illustrated by
        
a a a a14 0 0 a13 0
 11 12 13        
 a
 12 a22 a23 a24   0   0   a23   0 
        
hA~x , ~xi = h    ,  i = h  ,  i = a33 > 0 .
 a13 a23 a33 a34   1   1 
     a33   1 
    
a14 a24 a34 a44 0 0 a34 0

• Assume max |ai,j | = ak,l with k 6= l. Choose ~x = ~ek − sign(ak,l ) ~el and compute h~x , A · ~xi =
ak,k + al,l − 2 |ak,l | ≤ 0, contradicting positive definiteness. To illustrate the argument we use a small
matrix again.
    
a11 a12 a13 a14 ±1 ±1
    
 a
 12 a22 a23 a24   0   0 
   
hA~x , ~xi = h  , i
 a13 a23 a33 a34   1   1 
    
a14 a24 a34 a44 0 0
   
±a11 + a13 ±1
   
 .   0 
= h , i = a11 + a33 ± 2 a13 > 0
   
 ±a13 + a33   1 
   
. 0

1
By choosing the correct sign we conclude |a13 | ≤ 2 (a11 + a33 ) and thus |a13 | can not be larger than
both of the other numbers.

2
The above allows to easily detect that a matrix is not positive definite, but it does not contain a criterion
to quickly decide that A is positive definite.

For a large number of applied problems the resulting matrix has to be positive definite, based on physical
or mechanical observations. In many applications the generalized energy of a system is given by
1
energy = hA · ~x , ~xi .
2

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 62

If the object is deformed/modified the energy is often strictly increased. and based on this, the matrix A has
to be positive definite.
The eigenvalues contain all information about definiteness of a symmetric matrix, but this is not an
efficient way to detect positive definite matrices. Finding all eigenvalues is computationally expensive. If
the matrix is symmetric and the smallest eigenvalue is positive, then all eigenvalues are positive. For this
the MATLAB/Octave function eigest() is very useful, see Section 3.2.4.
2–29 Result : The symmetric matrix A is positive definite iff all eigenvalues are strictly positive. The
symmetric matrix A is positive semidefinite iff all eigenvalues are positive or zero. 3
Proof : This is a direct consequence of the diagonalization result 3–22 (page 134) A = Q D QT , where
the diagonal matrix contains the eigenvalues λj along its diagonal. The computation is based on ~y = QT · ~x
and the equation
X
hA · ~x , ~xi = hQ · D · QT · ~x , ~xi = hD · QT · ~x , QT · ~xi = hD · ~y , ~y i = λj yi2 .
j
2
This result is of little help to decide whether a given, large matrix is positive definite or not. Finding
all eigenvalues is not an option, as it is computationally expensive. A positive answer can be given using
diagonal dominance and reducible matrices, see e.g. [Axel94, §4].
2–30 Definition : Consider a symmetric n × n matrix A.
• A is called strictly diagonally dominant iff |ai,i | > σi for all 1 ≤ i ≤ n, where
X
σi = |ai,j | .
j6=i , 1≤j≤n

Along each column/row the sum of the off-diagonal elements is smaller than the diagonal element.
• A is called diagonally dominant iff |ai,i | ≥ σi for all 1 ≤ i ≤ n.
• A is called reducible if there exists a permutation matrix P and square matrices B1 , B2 and a matrix
B3 such that " #
B 1 B 3
P · A · PT =
0 B2
Since A is symmetric the matrix P · A · PT is also symmetric and the block B3 has to vanish, i.e. we
have the condition " #
B 1 0
P · A · PT = .
0 B2
This leads to an easy interpretation of a reducible matrix A: the system of linear equation A ~u = ~b
can be decomposed into two smaller, independent systems B1 ~u1 = ~b1 and B2 ~u2 = ~b2 . To arrive at
this situation all one has to do is renumber the equations and variables.
• A is called irreducible if it is not reducible.
• A is called irreducibly diagonally dominant if A is irreducible and
– |ai,i | ≥ σi for all 1 ≤ i ≤ n
– |ai,i | > σi for at least one 1 ≤ i ≤ n

For further explanations concerning reducible matrices see [Axel94] or [VarFEM]. For our purposes it is
sufficient to know that the model matrices An and Ann in Section 2.3 are positive definite, diagonally
dominant and irreducible.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 63

2–31 Result : (see e.g. [Axel94, Theorem 4.9])


Consider a real symmetric matrix A with positive numbers along the diagonal. If A is strictly diagonally
dominant or irreducibly diagonally dominant, then A is positive definite. 3

As a consequence of the above result the model matrices An and Ann are positive definite.

2–32 Example : A positive definite matrix need not be diagonally dominant. As an example consider the
matrix  
5 −4 1
 
 −4 6 −4 1 
 
 
 1 −4 6 −4 1 
 
1 −4 6 −4 1
 
 
A= 
.. .. .. .. ..
.
. . . . .

 
 

 1 −4 6 −4 1 

 

 1 −4 6 −4 

1 −4 5
This matrix was generated with the help of the model matrix An , given on page 32 by A = h4 An · An .
This matrix A is positive definite, but it is clearly not diagonally dominant. ♦

The algorithm of Cholesky will not only determine the factorization, but also indicate if the matrix A is
positive definite. It is an efficient tool to determine positive definiteness.

2–33 Result : Let A = RT · D · R be the Cholesky factorization of the previous section. Then A
is positive definite if and only if all entries in the diagonal matrix D are strictly positive. 3

Proof : Since the triangular matrix R has only numbers 1 along the diagonal, it is invertible. If the vectors
~x ∈ Rn will cover all of Rn , then the constructed vectors ~y = R ~x will also cover all of Rn . Now the
identity
h~x , A ~xi = h~x , RT · D · R ~xi = hR ~x , D · R ~xi = h~y , D ~y i .
implies

h~x , A · ~xi > 0 for all ~x 6= ~0 ⇐⇒ h~y , D · ~y i > 0 for all ~y 6= ~0


X n
⇐⇒ di yi2 > 0 for all ~y 6= ~0
i=1
⇐⇒ di > 0 for all 1 ≤ i ≤ n .

2.6.3 Stability of the Algorithm of Cholesky


To show that the Cholesky algorithm is stable (without pivoting) for positive definite systems two essential
ingredients are used:
• Show that the entries in the factorization R and D are bounded by the entries in A. This is only correct
for positive definite matrices.

• Keep track of rounding errors for the algebraic operations to be executed during the algorithm of
Cholesky.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 64

The entries of R and D are bounded


For a symmetric, positive definite matrix we have the factorization

A = RT · D · R .

By multiplying out the diagonal elements we obtain


i−1
X i−1
X n
X
2 2
ai,i = di + rk,i dk rk,i = di + dk rk,i = dk rk,i .
k=1 k=1 k=1

Since A is positive definite we know that ai,i > 0 and di > 0. Thus we find bounds on the coefficients in R
and D in terms of A.
i−1
X n
X
2 2
di ≤ ai,i and dk rk,i ≤ ai,i or similar dk rk,i ≤ ai,i .
k=1 k=1

Using this and the Cauchy–Schwartz inequality10 we now obtain an estimate for the result of the matrix
multiplication below, where the entries in |R| are given by the absolute values of the entries in R. Estimates
of this type are needed to keep track of the ‘worst case’ situation for rounding errors and the algorithm. We
will need information on the expression below.
n
X n
X p p
T

|R| · D · |R| i,j = |rk,i | dk |rk,j | = (|rk,i | dk ) ( dk |rk,j |)
k=1 k=1
v v
u n u n
2 ≤ √a · √a
uX uX
≤ t 2
dk rk,i · t dk rk,j i,i j,j ≤ max aj,j . (2.3)
1≤j≤n
k=1 k=1

Example 2–36 shows that the above is false if A is not positive definite.

Rounding errors while solving


The following result is a modification of Theorem 2–20 for symmetric matrices.
2–34 Result : (Modification of [GoluVanLoan96, Theorem 3.3.1])
Assume that for a positive definite, symmetric n × n matrix A the algorithm of Cholesky leads to an ap-
proximate factorization
R̂T · D̂ · R̂ = A + E
Then the error matrix E satisfies

|E| ≤ 3 (n − 1) u |A| + |R|T · |D| · |R| + O(u2 ) .




3
The estimate (2.3) for a positive definite A now implies

|E| ≤ 6 (n − 1) u max ai,i .


i

10
For vectors we know that h~ y i = k~
x, ~ xk k~
y k cos α, where α is the angle between the vectors. This implies |h~ y i| ≤ k~
x, ~ xk k~
yk .

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 65

2–35 Result : (Modification of [GoluVanLoan96, Theorem 3.3.2])


Let R̂ and D̂ be the computed factors of the Cholesky factorization of the n × n matrix A. Then forward
and back substitution are used to solve D̂ · R̂T ~y = ~b with the computed solution ~yˆ and solve R̂ ~x = ~yˆ with
ˆ . Then
the computed solution ~x
ˆ = ~b with |E| ≤ n u 3 |A| + 5 |R|T · |D| · |R| + O(u2 ) .

(A + E) ~x
3
The estimate (2.3) for a positive definite A now implies
|E| ≤ 8 n u max ai,i ,
i

i.e. the result of the numerical computations is the exact solution of slightly modified equations. The
modification is small compared to the maximal coefficient in the original problem.

As a consequence of the above we conclude that for positive definite, symmetric matrices there is
no need for pivoting when using the Cholesky algorithm.

2–36 Example : If the matrix is not positive definite the effect of roundoff errors may be large, even if the
matrix has an ideal condition number close to 1. Consider the system
" # ! !
0.0001 1 x1 1
= .
1 0.0001 x2 1
Exact arithmetic leads to the factorization
" # " # " # " #
0.0001 1 1 0 0.0001 0 1 10000
= · · .
1 0.0001 10000 1 0 −9999.9999 0 1

The condition number is κ = 1.0002 and thus we expect almost no loss of precision11 . The exact solution
is ~x = (0.99990001 , 0.99990001)T . Since all numbers in A and ~b are smaller than 1 one might hope
for an error of the order of machine precision. The bounds on the entries in R and D in (2.3) are clearly
violated, e.g.
|R|T · |D| · |R| 2,2 = |r1,2 | |d1 | |r1,2 | + |r2,2 | |d2 | |r2,2 | = 108 · 10−4 + 9999.9999 ≈ 20000


Using floating point arithmetic with u ≈ 10−8 (i.e. 8 decimal digits) we obtain a factorization
" # " # " # " #
1 0 0.0001 0 1 10000 0.0001 1
· · =
10000 1 0 −10000 0 1 1 0
ˆ = (1.0 , 0.9999)T . Thus the relative error of the solution is 10−4 . This is by mag-
and the solution is ~x
nitudes larger than the machine precision u ≈ 10−8 . The effect is generated by the large numbers in the
factorization. This can not occur if the matrix A is positive definite since we have the bound (2.3). To over-
come this type of problem a good pivoting scheme has to be used when the matrix is not positive definite, see
e.g. [GoluVanLoan96, §4.4]. This will most often destroy the symmetry of the problem. It is possible to use
row and column permutations to preserve the symmetry. This approach will be examined in Section 2.6.6.

11
Observe that the eigenvalues of the matrix are λ1 = 1.0001 and λ2 = −0.9999. Thus the matrix is not positive definite. But
the permuted matrix (row permutations)
" #" # " #
0 1 0.0001 1 1 0.0001
=
1 0 1 0.0001 0.0001 1
has eigenvalues λ1 = 1.0001 and λ2 = +0.9999 and thus is positive definite.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 66

2.6.4 Banded Matrices and the Algorithm of Cholesky


Both matrices An and Ann in Section 2.3 exhibit a band structure. This is no coincidence as most ma-
trices generated by finite element or finite difference methods are banded matrices. We present the most
elementary direct method using the band structure of the matrix A. This approach is practical if the degrees
of freedom in a finite element problem are numbered to minimize the bandwidth of the matrix. There are
special algorithms to achieve this goal, e.g. Cuthill-McKee as described in Section 6.2.7 in the context of
FEM.
If a symmetric matrix A has all nonzero numbers close to the diagonal, then it is called a banded matrix.
If ai,j = 0 for |i − j| > b then the integer b is called the semibandwidth of A. For a tridiagonal matrix
we find b = 2, the main diagonal and one off-diagonal. As the algorithm of Cholesky is based on row and
column operations we can apply it to a banded matrix and as long as no pivoting is done the band structure of
the matrix is maintained. Thus we can factor a positive definite symmetric matrix A with semibandwidth b
as
A = RT · D · R ,
where R is an upper triangular unity matrix with semibandwidth b and D is a diagonal matrix with positive
entries. This situation is visualized in Figure 2.8. For a n × n matrix A we are interested in the situation
1<bn.

@ @ @ @@
@ @ @@ @ @@
@ @ @@
· ·
@ @@
@ @ = @@ @ @@
@ @ @@ @ @@
@ @ @@ @ @@
@ @@ @ @

Figure 2.8: The Cholesky decomposition for a banded matrix

@@ @@ @@
@@ @ @@ @ @@ @
@@ @ @@ @ @@ @
@@ @ ... @@ @ ... @@ @ ...
@@ @ @@ @ @@ @
@@@ @@@ @@@
@@ @@ @@

Figure 2.9: Cholesky steps for a banded matrix. The active area is marked

When implementing the algorithm of Cholesky one works along the diagonal, top to bottom. For each
step only a block of size b × b of numbers is worked on, i.e. has to be quickly accessible. Three of these
situations are shown in Figure 2.9. Each of those steps requires approximately b2 /2 flops and there are
approximately n of those, thus we find an approximate12 operational count of

1
FlopCholBand ≈ b2 n .
2

12
We ignored the effect that the first row in each diagonal step is left unchanged and we also do not take into account that in the
lower right corner fewer computations are needed. Both effects are of lower order.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 67

This has to be compared to the computational effort for the full Cholesky factorization
1 1
FlopChol ≈ n3 = n2 n .
6 6
The additional cost to solve a system by back substitution is approximately
FlopSolveBand ≈ 2 b n .
We need n · b numbers to store the complete matrix A. As the algorithm proceeds along the diagonal in A (or
its reduction R) in each step only the next b rows will be worked on. As we go to the next row the previous
top row will not be used any more, but a new row at the will be added at the bottom, see Figure 2.9 . Thus
we have only b · b active entries at any time. If these numbers can be placed in fast memory (e.g. in cache)
then the implementation will run faster than in regular memory. Thus for good performance we like to store
b2 numbers in fast memory. This has to be taken in consideration when setting up the data structure for
a banded matrix, together with the memory and cache architecture of the computer to be used. Table 2.5
shows the types of fast and regular memory relevant for most problems and some typical sizes of matrices.
The table assumes that 8 Bytes are used to store one number. If not enough fast memory is available the
algorithm will still generate the result, but not as fast. An efficient implementation of the above idea is
shown in [VarFEM].

size band fast memory memory


cache [MB] RAM [MB]
1’000 100 0.08 0.8
10’000 100 0.08 8
100’000 100 0.08 80
100’000 200 0.32 160
100’000 500 2 400
100’000 1’000 8 800
1’000’000 1’000 8 8’000

Table 2.5: Memory requirements for the Cholesky algorithm for banded matrices

2–37 Observation : If the semibandwidth b is considerably larger than the number of used CPU cores,
then the algorithm can efficiently be implemented on a multi core architecture, as shown in Section 2.2.3.

2.6.5 Computing with an Inverse Matrix is Usually Inefficient


If a linear system with matrix A ∈ Mn×n has to be solved many times one might be tempted to compute
the inverse matrix A−1 . As exampleexamine a symmetric, positive definite matrix A with semibandwidth
b  n. This example illustrates that is is very inefficient to compute with inverse matrices, it uses more
memory and requires more flops, i.e. a longer computation time.
To determine the inverse matrix M = R−1 of the Cholesky factorization supplement the band matrix
R with the identity matrix, and then use row operations to transform the left part to the identity matrix In .
Working with elementary matrices (Section 2.4.2, page 43) examine the augmented matrix.
R · R−1 = In
(M · R) · R−1 = M
In · R−1 = M

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 68

As example consider a 6 × 6 matrix with semibandwidth 3.


 
r1,1 r1,2 r1,3 0 0 0 1
 

 r2,2 r2,3 r2,4 0 0 1 

 

 r3,3 r3,4 r3,5 0 1 

r4,4 r4,5 r4,6 1
 
 
 

 r5,5 r5,6 1 

r6,6 1

and using row operations this has to be transformed to


 
1 m1,1 m1,2 m1,3 m1,4 m1,5 m1,6
 

 1 m2,2 m2,3 m2,4 m2,5 m2,6 

 

 1 m3,3 m3,4 m3,5 m3,6 
 .
1 m4,4 m4,5 m4,6 
 

 

 1 m5,5 m5,6 

1 m6,6

To transform R to the diagonal matrix work bottom up.


• Divide the last row by rn,n . Subtract multiples of the last row from the last b rows, such that the only
nonzero entry in column n is the value 1 in the bottom right corner. This requires b flops.

• Divide row n − 1 by rn−1,n−1 . Subtract multiples of row n − 1 from rows n − 2 to n − b rows, such
that the only nonzero the value 1 on the diagonal. This requires b · 2 flops.

• Divide row n − 2 by rn−2,n−2 . Subtract multiples of row n − 2 from rows n − 3 to n − b − 1 rows,


such that the only nonzero the value 1 on the diagonal. This requires b · 3 flops.

• Proceed similary for all rows, up to the first row.


n2
Observe that the inverse matrix M = R−1 is a full, upper triangular matrix. Thus it contains 2 entries.
The number of required flops is estimated by
n
X 1 2
bk ≈ bn .
2
k=1

Since (RT )−1 = (R−1 )T and A−1 = (RT R)−1 = R−1 R−T we could now solve the system A~x = ~b by
~b = R−1 (R−T ~x) ,

i.e. by two matrix mutiplications, each requiring approximately 21 n2 flops. This coincides with the number
of operations required to multiply with the full matrix A−1 = R−1 R−T .
If we avoid the inverse matrix R but use twice a back–substitution: first RT ~y = ~b and then R ~x = ~y we
require only 2 b n flops, i.e. considerably less than 21 n2 .

2.6.6 Octave Implementations of Sparse Direct Solvers


In the previous sections we examined a banded Cholesky algorithm. This is only one example of a sparse
direct solver. The above ideas of a banded Cholesky solver can be improved, with considerable additional
effort. UMFPACK and CHOLMOD (by Timothy Davis) are good libraries for direct solvers for sparse
matrices. Both are used by Octave. We illustrate its use by an example.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 69

Examine the steady state heat equation (1.5) (page 12) on a unit square with nx interior grid points in x
direction and ny points in y direction.
2 T (x,y) ∂ 2 T (x,y)
−∂ ∂x2
− ∂y 2
= 1
k f (x, y) for 0 ≤ x, y ≤ 1
T (x, y) = 0 for (x, y) on boundary .

The resulting matrix is of size nx·ny by nx·ny with a semi-bandwidth of nx. The matrix may be generated
as a Kronecker product of two tridiagonal matrices, representing the second derivatives in x and y direction.

nx = 200; ny = nx; hx = 1/(nx+1); hy = 1/(ny+1);


Dxx = spdiags(ones(nx,1)*[-1, 2, -1],[-1, 0, 1],nx,nx)/(hxˆ2);
Dyy = spdiags(ones(ny,1)*[-1, 2, -1],[-1, 0, 1],ny,ny)/(hyˆ2);

A = kron(speye(ny),Dxx) + kron(Dyy,speye(nx));
b = ones(nx*ny,1);

Now solve the resulting system of linear equations A ~x = ~b with different algorithms.

• The above sparse matrix is converted to a full matrix, whose inverse is used to determine the solution.

Afull = full(A);
x1 = inv(Afull)*b;

This method consumes a lot of computation time and a full matrix with 2004 entries has to be stored,
i.e. we would need 8 · 2004 B = 12.8 GB of memory. This is a foolish method to use. The compu-
tation actually failed for n = 200. A test with n = 80 leads to a computation time of 150 seconds.

• The standard solver of Octave uses a good algorithm and the code

tic()
x2 = A\b;
SolveTime = toc()

takes 0.108 seconds to solve one system. Octave uses the selection tree displayed in Section 2.6.7.

• We may first compute the Cholesky factorization A = RT R and then determine the solution in two
steps: solve RT ~y = ~b, then R ~x = ~y .

R = chol(A);
x3 = R\(R’\b);

It takes 0.667 seconds to compute R and then 0.127 seconds to solve. The command nnz(R) shows
that R has approximately 8 · 106 nonzero entries. This coincides with the n3x = 2003 entries required
by the banded Cholesky solver examined in the previous sections.

• One can modify the Cholesky algorithm with row and column permutations, seeking a factorization

PT AP = RT R

where P is a permutation matrix. Octave uses the Approximate Minimum Degree permutation
matrix generated by the command amd(). Systems of linear equations can then be solved by

A~x = ~b ⇐⇒ PT A P PT ~x = PT ~b ⇐⇒ RT R PT ~x = PT ~b

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 70

and thus by the sequence

RT ~y = PT ~b
R ~z = ~y
PT ~x = ~z

With Octave/MATLAB this translates to

[R, m, P] = chol(A);
x4 = P*(R\(R’\(P’*b)));

The result requires 0.122 second for the factorization and only 0.014 seconds to solve the system. The
matrix R has only 1081911 ≈ 106 nonzero entries.
To illustrate the above we use our standard matrix Ann with n = 200. Using Octave (version 5.0.1) on a
Xeon E5-1650 system we found the numbers in Table 2.6.

algorithm storage (numbers) factorization time solving time


1
full Cholesky 2 n4 ≈ 8 · 108 hopeless ??
sparse Cholesky ≈8· 106 0.667 sec 0.127 sec
sparse Cholesky with permutations ≈ 1 · 106 0.122 sec 0.014 sec
standard \ operator 0.108 sec

Table 2.6: Comparison of direct solvers for Ann with n = 200

The Cholesky algorithm with permutations is most efficient, concerning computation time and memory
consumption. We illustrate this by a slightly modified example. We take the above standard example, but
add one nonzero number to destroy most of the band structure.

nx = 20; ny = nx; hx = 1/(nx+1); hy = 1/(ny+1);


Dxx = spdiags(ones(nx,1)*[-1 2 -1],[-1 0 1],nx,nx)/(hxˆ2);
Dyy = spdiags(ones(ny,1)*[-1 2 -1],[-1 0 1],ny,ny)/(hyˆ2);
A = kron(speye(ny),Dxx) + kron(Dyy,speye(nx));
A(nx,nx*ny/2) = -A(1,1)/4; A(nx*ny/2,nx) = -A(1,1)/4;
% add numbers to destroy the band structure
R = chol(A); % standard Cholesky, sparse
[R2,m,P] = chol(A); % sparse Cholesky, with permutations
nonzeroR = nnz(R)
nonzeroR2 = nnz(R2)
-->
nonzeroR = 8179
nonzeroR2 = 3758

The matrix A generated with the above code is of size 400 × 400 and the semi-bandwidth is 20 (approx-
imately). We can compute the size of the sparse matrix required to store the Cholesky factorization R:
• Full Cholesky: half of a N × N = 400 × 400 matrix, leading to 80’000 entries.

• Band Cholesky: N × b = 400 × 20, leading to 8’000 nonzero entries. Ignoring the single nonzero we
still have a semi bandwidth of 20 and thus the banded Cholesky factorization would require 400·20 =
80 000 nonzero entries.

• Sparse Cholesky: 8179 nonzero entries

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 71

0 0 0

100 100 100

200 200 200

300 300 300

400 400 400


0 100 200 300 400 0 100 200 300 400 0 100 200 300 400

(a) original matrix A (b) R no permutations (c) R with permutations

Figure 2.10: The sparsity pattern of a band matrix and two Cholesky factorizations

• Sparse Cholesky with permutations: 3785 nonzero entries. As a consequence we need less storage
and the back substitution will be about twice as fast.

The sparsity pattern in Figure 2.10 shows where the non-zeros are. In 2.10(a) find the non-zeros in the
original matrix A. The band structure is clearly visible. By zooming in we would find only 5 diagonals
occupied by numbers, i.e. the band is far from being full. In 2.10(b) we recognize the result of the Cholesky
factorization, where the additional nonzero entry leads to an isolated spike in the matrix R. The band in
this matrix is full. In 2.10(c) find the results with the additional permutations allowed. The band structure is
replaced by a even more sparse pattern of non-zeros.
We observe:

• The chol() implementation in Octave is as efficient as a banded Cholesky and can deal with isolated
nonzeros outside of the band.

• The chol() command with the additional permutations can be considerably more efficient, i.e.
requires less memory and the back substitution is faster.

2.6.7 A Selection Tree used in Octave for Sparse Linear Systems


The banded Cholesky algorithm above shows how to use properties of the matrices to find efficient algo-
rithms to solve systems of linear equations. There are many more tricks of the trade to be used. The goal of
the previous section is to explain one of the essential ideas. Real world codes should use more features of
the matrices. Octave and MATLAB use sparse matrices and more advanced algorithms.

The documentation of Octave contains a selection tree for solving systems of linear equations using
sparse matrices. Find this information in the official Octave manual in the section Linear Algebra on Sparse
Matrices. When using the command A\b with a sparse matrix A to solve a linear system the following
decision tree is used to choose the algorithm to solve the system.

1. If the matrix is diagonal, solve directly and goto 8.

2. If the matrix is a permuted diagonal, solve directly taking into account the permutations. Goto 8

3. If the matrix is square, banded and if the band density is less than that given by spparms (”bandden”)
continue, else goto 4.

(a) If the matrix is tridiagonal and the right-hand side is not sparse continue, else goto 3(b).
i. If the matrix is hermitian, with a positive real diagonal, attempt Cholesky factorization
using Lapack xPTSV.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 72

ii. If the above failed or the matrix is not hermitian with a positive real diagonal use Gaussian
elimination with pivoting using Lapack xGTSV, and goto 8.
(b) If the matrix is hermitian with a positive real diagonal, attempt Cholesky factorization using
Lapack xPBTRF.
(c) if the above failed or the matrix is not hermitian with a positive real diagonal use Gaussian
elimination with pivoting using Lapack xGBTRF, and goto 8.

4. If the matrix is upper or lower triangular perform a sparse forward or backward substitution, and
goto 8.

5. If the matrix is a upper triangular matrix with column permutations or lower triangular matrix with
row permutations, perform a sparse forward or backward substitution, and goto 8.

6. If the matrix is square, hermitian with a real positive diagonal, attempt sparse Cholesky factorization
using CHOLMOD.

7. If the sparse Cholesky factorization failed or the matrix is not hermitian with a real positive diagonal,
and the matrix is square, factorize using UMFPACK.

8. If the matrix is not square, or any of the previous solvers flags a singular or near singular matrix, find
a minimum norm solution using CXSPARSE.

The above clearly illustrates that a reliable and efficient algorithm to solve linear systems of equations
uses more than the most elementary ideas. In particular the keywords Cholesky and band structure appear
often.

2.7 Sparse Matrices and Iterative Solvers


All of the problems in Chapter 1 lead to linear systems A ~x +~b = ~0, where only very few entries of the large
matrix A are different from zero, i.e. we have a sparse matrix. The Cholesky algorithm for banded matrices
is using only some of this sparsity. Due to the sparsity the computational effort to compute a matrix product
A ~x is minimal, compared to the number of operations to solve the above system with a direct method. One
is lead to search for an algorithm to solve the linear system, using matrix multiplications only. This leads
to iterative methods, i.e. we apply computational operations until the desired accuracy is achieved. There
is no reliable method to decide beforehand how many operations will have to be applied. The previously
considered algorithms of LR factorization and Cholesky are both direct methods, since both methods will
lead to the solution of the linear system using a known, finite number of operations.

Sparse matrices can very efficiently be multiplied with a vector. Thus we seek algorithms to solve
linear systems of equations, using multiplications only. The trade-of is that we might have to use
many multiplications of a matrix times a vector.

We will examine methods that allow to solve linear systems with 106 unknowns within a few seconds
on a standard computer, see Table 2.15 on page 91.

2–38 Observation : It is possible to take advantage of a multi-core architecture for the multiplication of a
sparse matrix with a vector. ♦

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 73

2.7.1 The Model Problems


In Section 2.3 we find the matrix Ann of size n2 × n2 with a semi-bandwidth of n + 1 ≈ n. In each
row/column only 5 entries are different from zero. For the condition number we obtain

λmax 4
κ= ≈ 2 n2 .
λmin π

When using a banded Cholesky algorithm to solve A ~x + ~b = ~0 we need

• storage for n · n2 = n3 numbers.


1 1
• approximately 2 n2 n2 = 2 n4 floating point operations.

An iterative method will have to do better than this to be considered useful. To multiply the matrix Ann
with a vector we need about 5 n2 multiplications.
The above matrix Ann might appear when solving a two dimensional heat conduction problem. For the
similar three dimensional problem we find a matrix A of size N = n3 and each row has approximately
7 nonzero entries. The semi-bandwidth of the matrix is n2 . Thus the banded Cholesky solver requires
approximately 21 n3 · n4 floating point operations. The condition number is identical to the 2-D situation.

2.7.2 Basic Definitions


For a given invertible N × N matrix A and a given vector ~b we have the exact solution ~x of A ~x + ~b = ~0.
For an iteration mapping Φ : RN → RN we choose an initial vector ~x0 and then compute ~x1 = Φ(~x0 ),
~x2 = Φ(~x1 ) = Φ2 (~x0 ) or
~xk = Φk (~x0 ) .
The mapping Φ is called an iterative method with linear convergence factor q < 1 if the error after k
steps is bounded by
k~xk − ~xk ≤ c q k .
To improve the accuraccy of the inital guess ~x0 by D digits we need q k ≤ 10−D . This is satisfied if

−D −D ln 10
k log q ≤ −D or k ≥ = > 0.
log q ln q
For most applications the factor q < 1 will be very close to 1. Thus examine q = 1 − q1 and use the Taylor
approximation ln q = ln(1 − q1 ) ≈ −q1 . Then the above computations leads to an estimate for the number
of iterations necessary to decrease the error by D digits, i.e.

D ln 10
k≥ . (2.4)
q1
This implies that the numbers of desired correct digits is proportional to the number of required iterations
and inversely proportional to the deviation q1 of the factor q = 1 − q1 from 1.

2.7.3 Steepest Descent Iteration, Gradient Algorithm


For a symmetric, positive definite matrix A the solution of the linear system A ~x + ~b = ~0 is given by the
location of the minimum of the function
1
f (~x) = h~x , A ~xi + h~x , ~bi .
2

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 74

A possible graph of such a function and its level curves are shown in Figure 2.11. For symmetric matrices
the gradient of this function is given by13

∇f (~x) = A ~x + ~b = ~0 .

y
x

Figure 2.11: Graph of a function to be minimized and its level curves

~xk

d~k

d~k+1 ~xk+1

Figure 2.12: One step of a gradient iteration

A given point ~xk is assumed to be a good approximation of the exact solution ~x. The error is given by
the residual vector
~rk = A ~xk + ~b .
The direction of steepest descent is given by

d~k = −∇f (~xk ) = −A ~xk − ~b = −~rk .


13
Use a summation notation for the scalar and matrix product and differentiate with the help of the product rule.
n n
!
1 1 X X X
f (~
x) = h~
x, A~ xi + h~x , ~bi = xi ( ai,j xj ) + bj xj
2 2 i=1 j=1 1≤j≤n
n n
!
∂ 1 X X X
0= f (~
x) = 1 ak,j xj + xi (ai,k 1) + bk 1 = ak,j xj + bk
∂xk 2 j=1 i=1 1≤j≤n

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 75

This is the reason for the name steepest descent or gradient method, illustrated in Figure 2.12 . Thus
search for a better solution in the direction d~k , i.e. determine the coefficient α ∈ R, such that the value of
the function
1
h(α) = f (~xk + α d~k ) = h(~xk + α d~k ) , A (~xk + α d~k )i + h(~xk + α d~k ) , ~bi
2
α2 ~ α  ~ 
= hdk , A d~k i + hdk , A ~xk i + hA d~k , ~xk i + 2 hd~k , ~bi + independent on α
2 2
α2 ~  
= hdk , A d~k i + α hd~k , A ~xk i + hd~k , ~bi + independent on α
2
is minimal. This leads to the condition
d h(α)  
0= = α hd~k , A d~k i + hd~k , A ~xk i + hd~k , ~bi = α hd~k , A d~k i + hd~k , A ~xk + ~bi

hd~k , A ~xk + ~bi hd~k , ~rk i h~rk , ~rk i
α = − = − = +
~ ~
hA dk , dk i ~ ~
hA dk , dk i hA d~k , d~k i
and thus the next approximation ~xk+1 of the solution is given by

k~rk k2
~xk+1 = ~xk + α d~k = ~xk + d~k .
hA d~k , d~k i
One step of this iteration is shown in Figure 2.12 and a pseudo code for the algorithm is shown on the left
in Table 2.7.

choose initial point ~x0 choose initial point ~x


k=0 ~r = A ~x + ~b
while k~rk k = kA ~xk + ~bk too large while ρ = k~rk2 = h~r, ~ri too large
d~k = −~rk d~ = A ~r
ρ
h~rk , d~k i α=−
α=− ~
hd , ~ri
hA d~k , d~k i
~x = ~x + α ~r
~xk+1 = ~xk + α d~k
~r = ~r + α d~
k =k+1
endwhile
endwhile

Table 2.7: Gradient algorithm to solve A ~x + ~b = ~0, a first attempt (left) and an efficient implementation
(right)

The computational effort for one step in the algorithm seems to be: 2 matrix/vector multiplications,
2 scalar products and 2 vector additions. But the residual vector ~rk and the direction vector d~k differ only in
their sign. Since

~rk+1 = A ~xk+1 + ~b = A (~xk + αk d~k ) + ~b = A ~xk + ~b + αk A d~k = ~rk + αk A d~k

the necessary computations for one step of the iteration can be reduced, leading to the algorithm on the right
in Table 2.7. To translate between the two implementations use a few ± changes and

basic algorithm ←→ improved algorithm

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 76

d~k ←→ ~r
A d~k ←→ d~
~xk+1 = ~xk + α d~k ←→ ~x = ~x + α ~r
~rk+1 = ~rk + A d~k ←→ ~r = ~r + α d~
The improved algorithm in Table 2.7 requires
• one matrix–vector product and two scalar products
• two vector additions of the type ~x = ~x + α ~r
• storage for the sparse matrix and 3 vectors
If each row of the N × N -matrix A has on average nz nonzero entries, then each iteration requires approx-
imately (4 + nz) N flops (multiplication/addition pairs).
Since the matrix A is positive definite
d2
h(α) = hA d~k , d~k i > 0 ,
dα2
unless −d~k = A ~xk +~b = ~0 . This is the minimum of the function h(α) and consequently f (~xk+1 ) < f (~xk ),
unless ~xk equals the exact solution of A ~x + ~b = ~0 . Since d~k = −~rk conclude that α ≥ 0, i.e. the algorithm
made a step of positive length in the direction of the negative gradient.

The algorithm does not perform well if we search the minimal value in a narrow valley, as illustrated
in Figure 2.13. Instead of going down the valley, the algorithm jumps across and it requires many steps to
get close to the lowest point. This is reflected by the error estimate for this algorithm. One can show that
(e.g. [Hack94, Theorem 9.2.3], [Hack16, Theorem 9.10], [LascTheo87, p. 496], [KnabAnge00, p. 212],
[AxelBark84, Theorem 1.8])14
κ−1 k 2 k
   
k~xk − ~xkA ≤ k~x0 − ~xkA ≈ 1 − k~x0 − ~xkA (2.5)
κ+1 κ
using the energy norm k~xk2A = h~x , A ~xi.

For most matrices based on finite element problems we know that k~xk ≤ α k~xkA and thus
κ−1 k 2 k
   
k~xk − ~xk ≤ c ≈c 1−
κ+1 κ
where
λmax
κ= = condition number of A .
λmin
The resulting number of required iterations is given by
D ln 10 D ln 10
k≥ = κ.
q1 2
Thus if the ratio of the largest and smallest eigenvalue of the matrix A is large, then the algorithm converges
slowly. Unfortunately this is most often the case, thus Figure 2.13 shows the typical situation and not the
exception.
14
In [GoluVanLoan13, §11.3.2] find a complete (rather short) proof of
1
k~ x∗ k2A ≤ (1 −
xk+1 − ~ ) k~ x∗ k2A .
xk − ~
κ2 (A)
Using r
1 1 2
1− ≈1− >1−
κ 2κ κ
observe that (2.5) is a slightly better estimate. I have a note of the proof in [GoluVanLoan13, p. 627ff], adapted to the notation of
these lecture notes.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 77

Figure 2.13: The gradient algorithm for a large condition number

Performance on the model problem


4
For the problem in Section 2.7.1 we find κ ≈ π2
n2 and thus

2 π2
q = 1 − q1 = 1 − ≈1− .
κ 2 n2
Then equation (2.4) implies that we need
D ln 10 2 D ln 10 2
k≥ = n
q1 π2
iterations to increase the accuracy by D digits. Based on the estimated operation counts

Operation with Ann flops


Ann · ~x 5 n2
~x = ~x + α ~r n2
hd~ , ~ri n2

for the operations necessary for each step in the steepest descent iteration we arrive at the total number of
flops as
18 D ln 10 4
9 n2 k ≈ n ≈ 4.2 D n4 .
π2
This is slightly worse than a banded Cholesky algorithm (FlopChol ≈ 21 n4 ). The gradient algorithm does
use less memory, but requires more flops.

2.7.4 Conjugate Gradient Iteration


The conjugate gradient algorithm15 will improve the above mentioned problem of the gradient method.
Instead of searching for the minimum of the function f (~x) = 21 h~x, A~xi + h~b, ~xi in the direction of steepest
descent, combine this direction with the previous search direction and aim to reach the minimal value of the
function f (~x) in this plane with one step only.
The algorithm was named as one of the top ten algorithms of the 20th century, see [TopTen]. Find a
detailed, readable introduction to the method of conjugate gradients in [Shew94].

Conjugate directions
On the left in Figure 2.14 find elliptical level curves of the function g(~x) = h~x , A ~xi. A first vector ~a is
tangential to a given level curve at a point. A second vector ~b is connecting this point to the origin. The two
vectors represent two subsequent search directions. When applying the transformation
! !
u 1/2 x
~u = =A = A1/2 ~x
v y
15
The conjugate gradient algorithm was developed by Magnus Hestenes and Eduard Stiefel (ETHZ) in 1952.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 78

obtain
g(~x) = h~x , A ~xi = hA1/2 ~x , A1/2 ~xi = h~u , ~ui = h(~u)
and the level curves of the function h in a (u, v) system will be circles, shown on the right in Figure 2.14.
The two vectors ~a and ~b shown on in the left part will transform according to the same transformation rule.
The resulting images will be orthogonal and thus

0 = hA1/2~a , A1/2~bi = hA ~a , ~bi .

The vectors ~a and ~b are said to be conjugate16 .

v
y

~a A1/2~a

~b A1/2~b
x u

Figure 2.14: Ellipse and circle to illustrate conjugate directions

The basic conjugate gradient algorithm


The direction vectors d~k−1 and d~k of two subsequent steps of the conjugate gradient algorithm should behave
like the two vectors in the left part of Figure 2.14. The new direction vector d~k is assumed to be a linear
combination of the gradient ∇f (~xk ) = A ~xk + ~b = ~rk and the old search direction d~k−1 , i.e.

d~k = −~rk + β d~k−1 where ~rk = A ~xk + ~b .

Since the two directions d~k and d~k−1 have to be conjugate conclude

0 = hd~k , A d~k−1 i = h−~rk + β d~k−1 , A d~k−1 i


h~rk , A d~k−1 i
β = .
hd~k−1 , A d~k−1 i

Then the optimal value of αk to minimize h(α) = f (~xk + αk d~k ) can be determined with a calculation
identical to the standard gradient method, i.e.

h~rk , d~k i
αk = −
hA d~k , d~k i
16
Using the diagonalization
p of the matrix A (see page 134) we even have a formula for A1/2 . Since A = Q · diag(λi ) · QT we
1/2
use A = Q · diag(λi ) · QT and conclude
p p
A1/2 · A1/2 = Q · diag(λi ) · QT · Q · diag(λi ) · QT = Q · diag(λi ) · QT = A .

Fortunately we do not need the explicit formula for A1/2 , since this would require all eigenvectors, which is computationally
expensive. For the algorithm it is sufficient to know how to multiply a vector by the matrix A.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 79

and obtain a better approximation of the solution of the linear system by ~xk+1 = ~xk + αk d~k . This algorithm
is spelled out on the left in Table 2.8 and its result is illustrated in Figure 2.15. Just as in the standard
d2 ~ ~
gradient algorithm find dα 2 h(α) = hA dk , dk i > 0 and thus

• either the algorithm terminates, i.e. we found the optimal solution at this point

• or αk > 0 .

This allows for division by αk in the analysis of the algorithm.

−~rk d~k−1
d~k ~xk

Figure 2.15: One step of a conjugate gradient iteration

An example in R2
The function
" # ! ! ! !
1 +1 −0.5 x x −1 x
f (x, y) = h , i+h , i
2 −0.5 +3 y y −2 y
1 2 1 3
= x + x y + y2 − x − 2 y
2 2 2
is minimized at (x, y) ≈ (1.45455, 0.90909). With a starting vector at (x0 , y0 ) = (1, 1) one can apply
two steps of the gradient algorithm, or two steps of the conjugate gradient algorithm. The first step of the
conjugate gradient algorithm coincides with the first step of the gradient algorithm, since there is no previous
direction to determine the conjugate direction yet. The result is shown in Figure 2.16. The two blue arrows
are the result of the gradient algorithm (steepest descent) and the green vector is the second step of the
conjugate gradient algorithm. In this example the conjugate gradient algorithm finds the exact solution with
two steps. This is not a coincidence, but generally correct and caused by orthogonality properties of the
conjugate gradient algorithm.

Orthogonality properties
We define the Krylov subspaces generated by the matrix A and the vector d~0

K(k, d~0 ) = span{d~0 , A d~0 , A2 d~0 , . . . , Ak−1 d~0 , Ak d~0 } .

Since ~rk+1 = ~rk + αk A d~k and d~k = −~rk + βk d~k−1 we conclude

~ri ∈ K(k, d~0 ) , d~i ∈ K(k, d~0 ) and ~xi ∈ ~x0 + K(k, d~0 ) for 0 ≤ i ≤ k .

The above is correct for any choice of the parameters βk . Now we examine the algorithm in Table 2.8 with
the optimal choice for αk , but the values of βk in d~k = −~rk + βk d~k−1 are to be determined by a new
criterion. The theorem below shows that we minimized the function f (~x) on the k + 1 dimensional affine
subspace K(k, d~0 ), and not only on the two dimensional plane spanned by the last two search directions.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 80

choose initial point ~x0 choose initial point ~x0


~r0 = A ~x0 + ~b ~r = A ~x + ~b
d~0 = −~r0 ρ0 = k~rk2
h~r0 , d~0 i d~ = −~r
α0 = −
hA d~0 , d~0 i p~ = A d~
~x1 = ~x0 + α0 d~0 α=
ρ0
h~ ~
p , di
~r1 = A ~x1 + ~b
~x = ~x + α d~
k=1
~r = ~r + α p~
while k~rk k too large
h~rk , A d~k−1 i k=1
βk =
hd~k−1 , A d~k−1 i while ρk too large
ρk
d~k = −~rk + βk d~k−1 β=
ρk−1
h~rk , d~k i
αk = − d = −~r + β d~
~
hA d~k , d~k i
p~ = A d~
~xk+1 = ~xk + αk d~k ρk
α=
k =k+1 h~ ~
p , di
~rk = A ~xk + ~b ~x = ~x + α d~
endwhile ~r = ~r + α p~
ρk = h~r , ~ri
k =k+1
endwhile

Table 2.8: The conjugate gradient algorithm to solve A ~x + ~b = ~0 and an efficient implementation

1.1

1
y

0.9

0.8

1 1.1 1.2 1.3 1.4 1.5 1.6


x

Figure 2.16: Two steps of the gradient algorithm (blue) and the conjugate gradient algorithm (green)

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 81

2–39 Theorem : Consider fixed values of k ∈ N, ~x0 and ~r0 = A ~x0 +~b. Choose the vector ~x ∈ ~x0 +K(k, d~0 )
such that the energy function f (~x) = 12 h~x, A~xi + h~x, ~bi is minimized on the affine subspace ~x0 + K(k, d~0 ).
The subspace K(k, d~0 ) has dimension k + 1. The following orthogonality properties are correct

h~rj , ~ri i = 0 for all 0 ≤ i 6= j < k


hdj , A d~i i = 0
~ for all 0 ≤ i 6= j ≤ k
h~rk , ~y i = h~xk − ~x , A ~y i = 0 for all ~y ∈ K(k, d~0 ) .

The values
h~rk , A d~k−1 i
βk =
hd~k−1 , A d~k−1 i
will generate the optimal solution, i.e. f (~x) is minimized, with the algorithm on the left in Table 2.8. 3
Proof : Since the vector ~x ∈ ~x0 + K(k, d~0 ) minimizes the function f (~x) on the affine subspace ~x0 +
K(k, d~0 ), its gradient has to be orthogonal on the subspace K(k, d~0 ), i.e. with ~r = A ~x + ~b = ∇f (~x)

hA ~x + ~b , ~hi = h~r , ~hi = 0 for all ~h ∈ K(k, d~0 ) .

Since ~r = ~rk+1 = A ~x + ~b this leads to

h~rk+1 , ~ri i = h~r , ~ri i = 0 for all 0 ≤ i ≤ k

and K(k, d~0 ) is a strict subspace of K(k + 1, d~0 ). This implies dim(K(k, d~0 )) = k + 1.
Using ~rk+1 = ~rk + αk A d~k and d~i = −~rk + βi d~i−1 we conclude by recursion

~rk+1 − ~rk
hd~i , A d~k i = h−~ri + βi d~i−1 , i
αk
 
i
βi ~ 1 
βj  hd~0 , ~rk+1 − ~rk i
Y
= hdi−1 , ~rk+1 − ~rk i =
αk αk
j=1
 
i
−1  Y 
= βj h~r0 , ~rk+1 − ~rk i = 0 .
αk
j=1

The above is correct for all possible choices of βj and also implies

0 = hd~k , A d~k−1 i = h−~rk + βk d~k−1 , A d~k−1 i = −h~rk , A d~k−1 i + βk hd~k−1 , A d~k−1 i .

Thus the optimal values for βk are as shown in the theorem. 2

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 82

2–40 Corollary :

• Since dim(K(k, d~0 )) = k + 1 the conjugate gradient algorithm with exact arithmetic will terminate
after at most N steps. Due to rounding errors this will not be of practical relevance for large matri-
ces. In addition the number of steps might be prohibitively large. Thus use the conjugate gradient
algorithm as an iterative method.

• Using the orthogonalities in the above theorem to conclude

h~rk , d~k i = h~rk , −~rk + βk d~k−1 i = −k~rk k2


h~rk , A d~k−1 i = βk hd~k−1 , A d~k−1 i
~rk+1 = ~rk + αk A d~k
1 1
h~rk+1 , A d~k i = h~rk+1 , ~rk+1 − ~rk i = k~rk+1 k2
αk αk
1 ~ 1
hd~k , A d~k i = hdk , ~rk+1 − ~rk i = k~rk k2
αk αk
h~rk , A d~k−1 i αk−1 k~rk k2 k~rk k2
βk = = = .
hd~k−1 , A d~k−1 i αk−1 k~rk−1 k2 k~rk−1 k2

The above properties allow a more efficient implementation of the conjugate gradient algorithm. The
algorithm on the right in Table 2.8 is taken from [GoluVanLoan96]. This improved implementation of the
algorithm requires for each iteration

• one matrix–vector product and two scalar products,

• three vector additions of the type ~x = ~x + α ~r,

• storage for the sparse matrix and 4 vectors.

If each row of the matrix A has on average nz nonzero entries then we determine that each iteration requires
approximately (5 + nz) N flops (multiplication/addition pairs).

Convergence estimate
Assume that the exact solution is given by ~z, i.e. A ~z + ~b = ~0. Use the notation ~r = A ~y + ~b, resp.
~y = A−1 (~r − ~b) to conclude that ~y − ~z = A−1 ~r. Then consider the following function

g(~y ) = k~y − ~zk2A = h~y − ~z , A(~y − ~z)i = h~r , A−1~ri

and verify that


1 1 1
k~x − ~zk2A = h~x − ~z , A(~x − ~z)i = h~x + A−1 ~b , A ~x + ~bi
2 2 2
1 1
= h~x, A ~xi + h~x , ~bi + hA−1 ~b , ~bi
2 2
1 −1 ~ ~
= f (~x) + hA b , bi .
2
Thus the conjugate gradient algorithm minimized the energy norm given by the function g(~y ) = k~y −
~zkA on the subspaces K(k, d~0 ). It should be no surprise that the error estimate can be expressed in this

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 83

norm. Find the result and proofs17 in [Hack94, Theorem 9.4.12], [Hack16, Theorem 10.14], [LascTheo87],
[KnabAnge00, p. 218] or [AxelBark84]. The relevant convergence estimate is
√ k
2 k
 
κ−1
k~xk − ~xkA ≤ 2 √ k~x0 − ~xkA ≈ 2 1 − √ k~x0 − ~xkA .
κ+1 κ
This leads to √ k
2 k
 
κ−1
k~xk − ~xk ≤ c √ ≈c 1− √ . (2.6)
κ+1 κ

Compare this with the corresponding estimate (2.5) (page 76), where κ appears, instead of κ. The resulting
number of required iterations is thus given by
D ln 10 D ln 10 √
k≥ = κ. (2.7)
q1 2

This is considerably better than the estimate for the steepest descent method, since κ is replaced by κ  κ.

Performance on the model problems


√ 2
For the problem in Section 2.7.1 we find κ≈ π n and thus

2 π
q = 1 − q1 = 1 − √ ≈ 1 − .
κ n
Then equation (2.4) implies that we need
D ln 10 D ln 10
k≥ = n
q1 π
iterations to increase the precision by D digits. Based on the estimate for the operations necessary to
multiply the matrix with a vector we estimate the total number of flops as
D ln 10 3
(5 + 5) n2 k ≈ 10 n ≈ 7.3 D n3 .
π
This is considerably better than a banded Cholesky algorithm, since the number of operations is proportional
to n3 instead of n4 . For large values of n the conjugate gradient method is clearly preferable.

Table 2.9 shows the required storage and the number of necessary flops to solve the 2–D and 3–D model
problems with n free grid points in each direction. The results are illustrated18 in Figure 2.17. Observe that
one operation for the gradient algorithm requires more time than one operation of the Cholesky algorithm,
due to the multiplication of the sparse matrix with a vector.

We may draw the following conclusions from Table 2.9 and the corresponding Figure 2.17.

• The iterative methods require less memory than direct solvers. For 3–D problem this difference is
accentuated.

• For 2–D problems with small resolution the banded Cholesky algorithm is more efficient than the
conjugate gradient method. For larger 2–D problems conjugate gradient will perform better.
17
The simpler proof in [GoluVanLoan13, Theorem 11.3.3] does not produce the best possible estimate. There find the estimate
 1/2  
1 1
k~
xk+1 − ~
xk k ≤ 1 − k~
xk − ~xk ≤ 1 − k~
xk − ~x,k
κ 2κ

leading to q1 = 21κ , while the better estimates leads to q1 = √2 .


κ
The difference is essential.
18
The accuracy is required to improve by 6 digits.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 84

• For 3–D problems one should always use conjugate gradient, even for small problems.
• For small 3–D problems banded Cholesky will compute results with reasonable computation time.
• The method of steepest descent is never competitive.
Table 2.10 lists approximate computation times for a CPU capable of performing 108 = 100 MFLOPS or
1011 = 100 GFLOPS.

2–D 3–D
storage flops storage flops
1 4 1 7
Cholesky, banded n3 2 n n5 2 n
18 D ln 10 22 D ln 10
Steepest Descent 8 n2 π2
n4 10 n3 π2
n5
10 D ln 10 12 D ln 10
Conjugate Gradient 9 n2 π n3 11 n3 π n4

Table 2.9: Comparison of algorithms for the model problems

14 18
10 10
Cholesky 2D Cholesky 3D
1 day
Steepest Descent 2D 16 Steepest Descent 3D
12 Conjugate Gradient 2D 10 Conjugate Gradient 3D 1 year
10 1h
1 month
number of flops

number of flops

14
10
10
10 1 min 1 day
12
10 1h
08 1 sec
10
10
10 1 min
06
10 08 1 sec
10

04 06
10 10
1 2 3 1 2 3
10 10 10 10 10 10
number of grid points in each direction number of grid points in each direction
(a) 2D model problem (b) 3D model problem

Figure 2.17: Number of operations of banded Cholesky, steepest descent and conjugate gradient algorithm
on the model problem. The time estimates are for a CPU capable to perform 100 MFLOPS.

CPU \ flops 108 109 1010 1011 1012 1014 1016 1018
100 MFLOPS 1 sec 10 sec 1.7 min 17 min 2.8 h 11.6 days 3.2 years 320 years
100 GFLOPS 0.001 sec 0.01 sec 0.1 sec 1 sec 10 sec 16 min 28 h 116 days

Table 2.10: Time required to complete a given number of flops on a 100 MFLOPS or 100 GFLOPS CPU

2.7.5 Preconditioning
Based on equation (2.6) the convergence of the conjugate gradient method is heavily influenced by the
condition number of the matrix A. If the problem is modified such that the condition number decreases
there will be faster convergence. The idea is to replace the system A~x = −~b by an equivalent system with
a smaller condition number. There are different options on how to proceed:

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 85

• Left preconditioning:
M−1 A ~x = −M−1 ~b .

• Right preconditioning:
A M−1 ~u = −~b with ~x = M−1 ~u .

• Split preconditioning: M is factored by M = ML · MR

M−1 −1
u = −M−1 ~ x = M−1
L AMR ~ L b with ~ R ~u.

The ideal condition number for M−1 A (resp. A M−1 or M−1 −1


L AMR ) would be 1, but this would re-
quire M = A and thus the system of linear equations is solved. The aim is to get the new matrix M−1 A
(resp. A M−1 or M−1 −1
L A MR ) as close as possible to the identity matrix, but demanding little computa-
tional effort. In addition the new matrix might not be symmetric and we have to modify the above idea
slightly. There are a number of different methods to implement this idea and write efficient code. Consult
the literature before writing your own code, e.g. [Saad00]. This reference is available on the internet. With
Octave/MATLAB many algorithms and preconditioners are available, see Table 2.18 on page 99. As a good
starting reference for code use [templates] and the codes at www.netlib.org/templates/matlab .

As a typical example we examine a Cholesky factorization of a symmetric matrix M, i.e.

M = RT R .

The matrices R and M have to be chosen such that it takes little effort to solve systems with those matrices
and also as little memory as possible. Two of the possible constructions of these matrices will be shown
below, the incomplete Cholesky preconditioner. Then split the preconditioner between the left and right
side. Use ~x = R−1 ~u to conclude

Ã~u = R−T AR−1 ~u = −R−T ~b .

Now verify that the new matrix is symmetric since

hR−T AR−1 ~x , ~y i = hAR−1 ~x , R−1 ~y i = h~x , R−T AR−1 ~y i

and apply the conjugate gradient algorithm (see Table 2.8 on page 80) with the new matrix

à = R−T AR−1 .

If the matrix M = RT R is relatively close to the matrix A = RTe Re we conclude that R ≈ Re and thus

à = (R−T RTe ) (Re R−1 ) ≈ I · I = I .

As a consequence find a small condition number of the modified matrix Ã. This leads to the basic algorithm
on the left in Table 2.11.
In the algorithm on the center of Table 2.11 introduce the new vector

~zk = R−1~rk = R−1 Ã~xk + R−1 R−T ~b = R−1 R−T AR−1 ~xk + R−1 R−T ~b
 
= M−1 AR−1 ~xk + M−1~b = M−1 AR−1 ~xk + ~b .

Then realize that the vectors d~k and ~xk appear in the form R−1 d~k and R−1 ~xk . This allows for a translation
of the algorithm with slight changes as shown on the right in Table 2.11. This can serve as starting point for
an efficient implementation.
Observe that the update of ~zk involves the matrix M−1 and thus we have to solve the system

M ~zk = AR−1~rk

for ~zk . Thus it is important that the structure of the matrix M allows for fast solutions.

SHA 21-5-21
choose initial point ~x0 choose initial point ~x0 choose initial point ~x0 resp. ~x0 = R−1 ~x0
~r0 = Ã ~x0 + R−T ~b ~r0 = R−T AR−1 ~x0 + R−T ~b ~r0 = A~x0 + ~b , ~z0 = M−1~r0
d~0 = −~r0 R−1 d~0 = −R−1~r0 d~0 = −~z0
h~r0 , d~0 i h~r0 , d~0 i h~r0 , d~0 i
α0 = − α0 = − α0 = −
hà d~0 , d~0 i hR−T AR−1 d~0 , d~0 i hAd~0 , d~0 i
~x1 = ~x0 + α0 d~0 R−1 ~x1 = R−1 ~x0 + α0 R−1 d~0 ~x1 = ~x0 + α0 d~0
CHAPTER 2. MATRIX COMPUTATIONS

k=1 k=1 k=1


while k~rk k too large while k~rk k too large while k~rk k too large
h~rk , Ã d~k−1 i h~rk , R−T AR−1 d~k−1 i h~zk , Ad~k−1 i
βk = βk = βk =
hd~k−1 , Ã d~k−1 i hd~k−1 , R−T AR−1 d~k−1 i hd~k−1 , Ad~k−1 i
d~k = −~rk + βk d~k−1 −1
R d~k = −R−1~rk + βk R−1 d~k−1 d~k = −~zk + βk d~k−1
h~rk , d~k i h~rk , d~k i h~rk , d~k i
αk = − αk = − αk = −
hà d~k , d~k i hR−T AR−1 d~k , d~k i hAd~k , d~k i
~xk+1 = ~xk + αk d~k −1
R ~xk+1 = R−1 ~xk + αk R−1 d~k ~xk+1 = ~xk + αk d~k
k =k+1 k =k+1 k =k+1
 
~rk = Ã ~xk + R−T ~b R−1~rk = R−1 R−T AR−1 ~x k + R−T ~b ~rk = A~xk + ~b , ~zk = M−1~rk
endwhile endwhile endwhile
~x = R~xk is the solution

Table 2.11: Preconditioned conjugate gradient algorithms to solve A ~x + ~b = ~0. On the left the original algorithm, in the center with ~zk = R−1~rk and on the
right using R−1 d~k and R−1 ~xk . The algorithm on the right might serve as starting point for an efficient implementation.

SHA 21-5-21
86
CHAPTER 2. MATRIX COMPUTATIONS 87

2.7.6 The Incomplete Cholesky Preconditioner


An incomplete Cholesky factorization is based on the standard Cholesky factorization, but does not use all
of the entries of a complete factorization. There are different ideas used to drop elements:

• Keep the sparsity pattern of the original matrix A, i.e. drop the entry at a position in the matrix if the
entry in the original matrix is zero. This algorithm is called IC(0) and it is presented below.

• Drop the entry if its value is below a certain threshold, the drop-tolerance. This is often called the
ICT() algorithm. The results of ICT() and a similar LR factorization are examined in the following
section.

This construction of a preconditioner matrix R is based on the Cholesky factorization of the symmetric,
positive definite matrix A, i.e. A = RT R. But we require that the matrix R has the same sparsity pattern
as the matrix A. Those two wishes can not be satisfied simultaneously. Give up on the exact factorization
and require only
RT R = A + E
for some perturbation matrix E. This leads to the conditions

1. ri,j = 0 if ai,j = 0 ,

2. (RT R)i,j = ai,j if if ai,j 6= 0 .

To develop the algorithm use the same idea as in Section 2.6.1. The approximate factorization

A + E = RT · R

can be written using block matrices.


    
a1,1 a1,2 a1,3 . . . a1,n r1,1 0 0 ... 0 r1,1 r1,2 r1,3 . . . r1,n
    
 a   r1,2  0 
 1,2    
    
 a1,3 +E =  r1,3 · 0 .
An−1 RTn−1 Rn−1
    
 . ..  .
 ..   ..
  




 . 


a1,n r1,n 0

Now examine this matrix multiplication on the four submatrices. Keep track of the sparsity pattern. This
translates to 4 subsystems.

• Examine the top left block (one single number) in A. Obviously a1,1 = r1,1 · r1,1 and thus

r1,1 = a1,1 .

• Examine the bottom left block (column) in A


   
a1,2 r1,2
   
 a1,3   r1,3 
 . = .  · r1,1
   
 ..   .. 
   
a1,n r1,n

and thus for 2 ≤ i ≤ n and a1,i 6= 0 find


a1,i
r1,i =
r1,1

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 88

• The top right block (row) in A is then already taken care of, thanks to the symmetry of A .

• Examine the bottom right block in A. We need


 
r1,2
 
 r1,3  h i
An−1 =  .  · r1,2 r1,3 . . . r1,n + RTn−1 · Rn−1 .
 
 .. 
 
r1,n

For 2 ≤ i, j ≤ n and ai,j 6= 0 update the entries in An−1 by applying


a1,i a1,j
ai,j −→ ai,j − r1,i r1,j = ai,j − .
a1,1

If a1,i = 0 or a1,j = 0 there is no need to perform this step.

• Then restart the process with the reduced problem of size (n − 1) × (n − 1) in the lower right block.

The above can be translated to Octave code without major problems. Be aware that this implementation is
very far from being efficient and do not use it on large problems. This author has a faster version, but for
some real speed coding in C++ is necessary. In real applications the matrices A or R are rarely computed.
Most often a function to evaluate the matrix products has to be provided. This allows an optimal usage of
the sparsity pattern.

function R = cholInc(A)
% R = cholInc(A) returns the incomplete Cholesky factorization
% of the positive definte matrix A
[n,m] = size(A);
if (n ˜= m) error (’cholesky: matrix has to be square’); end%if
R = zeros(size(A));
for k = 1:n
if A(k,k)<0 error(’cholInc: failed, might be singular’); end%if
R(k,k) = sqrt(A(k,k));
for i = k+1:n
if A(k,i) ˜= 0 R(k,i) = A(k,i)/R(k,k); end%if
end%for
for j = k+1:n
for i = j:n
if A(j,i) ˜= 0 A(j,i) = A(j,i)-A(k,j)*A(k,i)/A(k,k); end%if
end%for
end%for
end%for

The efficiency of the above preconditioner is determined by the change in condition number for the
matrices A and à = R−T AR−1 . As an example consider the model matrix Ann from Section 2.3.2
for different values of n, leading to Table 2.12. Based on equation (2.6) these results might make a big
difference for the computation time. In [AxelBark84] a modified incomplete Cholesky factorization is
presented and you may find the statement that the condition number for the model matrices An and Ann
decreases to order n (instead of n2 ). This is considerably better than the result indicated by Table 2.12. It
will also change the results in Table 2.9 and Figure 2.17 in favor of the preconditioned conjugate gradient
method. In the following section find numerical results on the performance of the Cholesky preconditioner
with a drop-tolerance.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 89

n size of Ann κ(A) ≈ κ(Ã) ≈


10 100 × 100 48.2 5.1
20 400 × 400 178 16.6
40 10 600 × 10 600 681 61
80 60 400 × 60 400 20 658 236
160 250 600 × 250 600 100 500 949
320 1020 400 × 1020 400 410 500 3691
520 2700 400 × 2700 400 1090 600 9723

Table 2.12: The condition numbers κ using the incomplete Cholesky preconditioner IC(0)

2–41 Observation : The preconditioner will reduce the condition number for the conjugate gradient itera-
tion, but it does have no effect on the condition number of the original matrix A. Instead of solving A~x = ~b
the modified equation Ã~u = R−T AR−1 ~u = R−T ~b is examined, where ~u = R ~x. Thus the sequence of
solvers is



 RT ~bnew = ~b solve for ~bnew top to bottom
A ~x = ~b ⇐⇒ (R−T AR−1 ) ~u = ~bnew solve for ~u by conjugate gradient .


 R ~x = ~u solve for ~x bottom up

The first and last step of the algorithm will increase the condition number again, thus the conditioning of
solving A ~x = ~b is not improved.
As an example consider the exact Cholesky factorization R of A, then à = R−T AR = I and thus
κ(Ã) = 1. A test with Octave for the matrix Ann with n = 200 shows κ(Ann ) ≈ 2.4·104 and κ(R) ≈ 211,
resp. κ(R)2 ≈ 4.4 · 104 . Thus the overall condition number of the problem does not change. For an
incomplete Cholesky preconditioner with a small drop tolerance the result will be similar. ♦

2.7.7 Conjugate Gradient Algorithm with an Incomplete Cholesky Preconditioner


The model problem with a 520 × 520 grid leads to a matrix with 270’400 rows and columns and with
MATLAB or Octave use the command ichol() to determine incomplete Cholesky factorizations A ≈
L LT . Since MATLAB/Octave can estimate condition numbers and then use equation (2.6) to estimate the
number of iterations required to improve the accuracy by D = 5 decimal digits. Without preconditioner
approximately 1900 iterations might be required, based on the condition number κ ≈ 1100 000. The compu-
tations were performed on an Intel Haswell I7-5930 with a base frequency of 3.5 GHz.

• For strictly positive drop tolerances droptol > 0 the ICT() algorithm is applied. Only entries sat-
isfying the condition abs (L(i,j)) >= droptol * norm (A(j:end, j),1) are stored
in the left triangular matrix L. The command ichol() generates this result by

opts.type = ’ict’;
opts.droptol = droptol;
L = ichol(A,opts);

For droptol = 0 the incomple Cholesky factorization IC(0) is performed by calling

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 90

opts.type = ’nofill’;
L = ichol(A,opts);

• For a system A ~x = ~b the iteration terminates if kA ~x − ~bk ≤ tol · k~bk. At first a relative error of 10−5
was asked for, then an improvement by another factor 10−2 .
Find the results19 in Tables 2.13 (MATLAB) and 2.14 (Octave). The following facts can be observed:
• The memory required for L increases for smaller drop-tolerances. The matrix for IC(0) requires
considerably less memory.

• The norm of the error matrix E = L · LT − A decreases for smaller drop-tolerances, i.e. the precon-
ditioned matrix is a better approximation of the original matrix.

• The condition number κ moves closer to the ideal value of 1 for smaller drop-tolerances. MATLAB
and Octave obtain identical condition numbers. This is no surprise, as the same algorithm is used.

• Using the estimated condition number we can estimate the number of iterations required to achieve the
desired relative tolerance. The actually observed number of iterations is rather close to the estimate.
The number of iterations for MATLAB and Octave are identical, and all considerably smaller than the
≈ 1900 iterations without preconditioner.

• MATLAB is faster than Octave for computing the factorization, since MATLAB uses a multithreaded
version of ichol(). Octave is faster than MATLAB to perform the conjugate gradient iterations. The
required computation time for one iteration increases for smaller drop-tolerances. This is caused by
the larger factorization matrix L.

• It takes very little time to determine the IC(0) factorization, but the condition number of the precon-
ditioned system is large and thus it takes many iterations to achieve the desired accuracy.

drop tolerance 0.001 0.0003 0.0001 3e-5 1e-5 3e-6 1e-6 IC(0)
time ichol [sec] 0.12 0.21 0.40 0.82 1.52 3.06 6.29 0.06
size ichol [MB] 57.9 87.4 136.9 205.6 313.4 471.5 696.2 15.1
norm kEk/kAk 2.8e-3 1.2e-3 4.6e-4 1.9e-4 7.3e-5 2.8e-5 1.0e-5 7.4e-2
cond for PCG 343.74 148.19 56.16 24.52 10.16 4.52 2.27 9700
estim. # of runs 107 70 44 29 19 13 9 570
first run, # of runs 70 46 29 19 12 9 6 349
first run, time [sec] 2.69 2.14 1.57 1.34 1.13 1.12 1.05 10.32
restart, # of runs 27 18 10 7 5 2 2 146
restart, time [sec] 1.08 0.85 0.55 0.50 0.47 0.27 0.36 4.28

Table 2.13: Performance for MATLAB’s pcg() with an incomplete Cholesky preconditioner

The above results show that large systems of linear equations can be solved quickly. For the above
problem on a 1000 × 1000 grid we have to solve a system of 1 million unknowns. Leading to the results
in Table 2.15. If the same system has to be solved many times for different right hand sides this table also
shows that huge systems can be solved within a few seconds. This applies to dynamic problems, where a
good initial guess is provided by the previous time step.
19
Most of the computational effort for these tables goes towards estimating the condition numbers! This is not shown in the
tables. With new versions of Octave the command pcg() will estimate the extreme eigenvalues rather quickly by using the output
variable EIGEST.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 91

drop tolerance 0.001 0.0003 0.0001 3e-5 1e-5 3e-6 1e-6 IC(0)
time ichol [sec] 0.20 0.47 0.97 1.87 3.71 7.30 14.41 0.08
size ichol [MB] 57.9 87.4 136.9 205.6 313.4 471.5 696.2 15.1
norm kEk/kAk 2.7e-3 1.2e-3 4.5e-4 1.9e-4 7.3e-5 2.8e-5 1.0e-5 7.3e-2
cond for PCG 343.74 148.19 56.16 24.52 10.16 4.52 2.27 9700
first run, # of runs 70 46 29 19 12 9 6 349
first run, time [sec] 1.54 1.25 1.06 0.93 0.85 0.97 0.92 5.89
restart, # of runs 27 18 10 7 5 2 2 146
restart, time [sec] 0.60 0.50 0.36 0.35 0.36 0.22 0.31 2.52

Table 2.14: Performance for Octave’s pcg() with an incomplete Cholesky preconditioner

drop-tolerance size L time ichol() time PCG # iterations


Octave 1e-3 215 MB 1.4 sec 11.1 sec 126
MATLAB 1e-3 215 MB 0.51 sec 9.9 sec 126
Octave 1e-5 10 197 MB 16.5 sec 6.3 sec 23
MATLAB 1e-5 10 197 MB 6.17 sec 6.4 sec 23

Table 2.15: Timing for solving a system with 106 = 10 0000 000 unknowns

2.7.8 Conjugate Gradient Algorithm with an Incomplete LU Preconditioner


For the above problem one can also use an incomplete LU factorizations as preconditioners. The actual
algorithm used is ILUC (Crout’s version). This can be used if the matrix A is not symmetric and positive
definite. The corresponding iterative algorithms will be examined in the next section, e.g GMRES and
BICG. Octave/MATLAB can compute incomplete LU factorizations with the command ilu(). The results
in Tables 2.16 and 2.17 show the following information:

• The drop tolerance used.

• The time required to perform the incomplete LU factorization. Surprisingly the computation time
seems not to depend on the drop tolerance used. Octave is faster than MATLAB.

• The memory required to perform the incomplete LU factorization. Smaller values for the drop toler-
ance lead to larger matrices L and U. Octave uses less memory than MATLAB.

• An estimate of the norm of the matrix E = L · U − A. Smaller values for the drop tolerance lead to
a smaller norm of the error matrix. The results for MATLAB and Octave are identical.

• An estimate of the condition number of the matrix à = L−1 A R−1 . Smaller values for the drop
tolerance lead to a smaller condition number.

• An estimate of the required number of iterations to improve the result by 5 digits. The estimate is
based on (2.7), i.e.
D ln 10 √
k≥ κ.
2
• The effective number of iterations required to improve the result by 5 digits. The results illustrate that
the theoretical bounds are rather close to the actual number of iterations. The results for MATLAB and
Octave are identical.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 92

• The time required to run the iteration. At first the CPU time decreases, then it might increase again,
caused by the larger preconditioner matrices.

The computations shown were performed on an Intel Haswell I7-5930 system at 3.5 GHz.

drop tolerance 0.001 0.0003 0.0001 3e-5 1e-5 3e-6 1e-6 ILU(0)
time ILU [sec] 61.50 60.74 62.17 62.97 63.38 68.34 76.88 0.08
size ILU [MB] 124.3 191.5 298.5 458.3 699.1 1049.2 1527.8 30.3
norm kEk/kAk 2.5e-3 9.6e-4 3.8e-4 1.5e-4 5.9e-5 2.1e-5 7.7e-6 7.3e-2
cond for PCG 316 121.4 48.64 19.48 8.10 3.69 1.96
estim. # of runs 103 64 40 25 16 11 8
first run, iterations 67 42 26 17 11 8 6 349
first run, time [sec] 1.56 1.23 1.03 0.92 0.90 0.95 1.02 5.88
restart, iterations 26 15 10 6 4 2 1 146
restart, time [sec] 0.61 0.44 0.39 0.33 0.33 0.24 0.18 2.48

Table 2.16: Performance of Octave’s pcg() with an ilu() preconditioner

drop tolerance 0.001 0.0003 0.0001 3e-5 1e-5 3e-6 1e-6 ILU(0)
time ILU [sec] 140.41 136.47 139.33 137.10 136.16 139.99 147.88 0.08
size ILU [MB] 134.7 193.5 371.8 528.6 803.2 1131.5 1678.2 30.3
norm kEk/kAk 2.5e-3 9.7e-4 3.8e-4 1.5e-4 5.9e-5 2.2e-5 7.8e-6 7.4e-2
cond for PCG 316 121.4 48.64 19.48 8.10 3.69 1.96
first run, iterations 67 42 26 17 11 8 6 349
first run, time [sec] 2.90 2.08 1.58 1.32 1.14 1.10 1.12 10.79
restart, iterations 26 15 10 6 4 2 1 146
restart, time [sec] 1.12 0.74 0.61 0.46 0.43 0.28 0.20 4.36

Table 2.17: Performance of MATLAB’s pcg() with an ilu() preconditioner

2.8 Iterative Solvers for Non-Symmetric Systems


The conjugate gradient algorithm of the previous section is restricted to symmetric, positive definite matrices
A. It is absolutely necessary to adapt the method to nonsymmetric matrices.

2.8.1 Normal Equation, Conjugate Gradient Normal Residual (CGNR) and BiCGSTAB
The method of the normal equation is based on the fact that for an invertible matrix A

A ~x = ~b ⇐⇒ AT A ~x = AT ~b .

The expression can be generated by minimizing the norm of the residual ~r = A ~x − ~b.

f (~x) = kA ~x − ~bk2 = hA ~x − ~b , A ~x − ~bi = h~x, AT A ~xi − 2 h~x, AT ~bi + h~b, ~bi


∇f (~x) = 2 (AT A ~x − AT ~b) = ~0

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 93

Since

h~x , AT A ~y i = hA ~x , A ~y i = hAT A ~x , ~y i and h~x , AT A ~xi = hA ~x , A ~xi = kA ~xk2

the square matrix AT A is symmetric and positive definite. Thus the conjugate gradient algorithm can be
applied to the modified problem
AT A ~x = AT ~b .


The computational effort for each iteration step is approximately doubled, as we have to multiply the given
vector by the matrix A and then by its transpose AT . A more severe disadvantage stems from the fact

κ(AT A) = (κ(A))2

and thus the convergence is usually slow. Using the normal equation is almost never a good idea. The
above idea can be slightly modified, leading to the conjugate residual method or conjugate gradient normal
residual method (CGNR), see e.g. [Saad00, §6.8],
p [LascTheo87, §8.6.2] or [GoluVanLoan13, §11.3.9]. The
rate of convergence is related to κ(A), but not κ(A) as for the conjugate gradient method.

Possibly good choices are the Generalized Minimal Residual algorithm (GMRES) or the BiConjugate
Gradients Stabilized method (BiCGSTAB). A good description is given in [Saad00, §6.5, 7.4.2]. For both
algorithms preconditioning can and should be used. All of the above algorithms are implemented in Octave.

2.8.2 Generalized Minimal Residual, GMRES and GMRES(m)


The above disadvantage can be avoided by a method developed by Saad and Schultz in 1986. The algorithm
is based on Krylov subspace methods. Below find a short outline of the basic ideas for this algorithm. A
detailed presentation of these methods is beyond the scope of these notes and can be found in the literature,
e.g. [Saad00]. The theoretical and practical convergence behavior is nontrivial and examined in current
research projects. In [Saad00] it is shown that the condition number κ(A) enters in the estimates.

Since the GMRES algorithm uses a sizable number of tools from linear algebra, we try to bring some
structure in the presentation.
• First present the basic idea of GMRES.

• The restarted version GMRES(m) is worth to be implemented

• The convergence results are given.

• To solve least square problems orthogonal matrices are used.

• The Arnoldi iteration is one of the building blocks for GMRES.

• Hessenberg matrices and Given’s rotations are used as an efficient tool.

The GMRES algorithm


The basic idea of the GMRES (Generalized Minimum RESiduals) algorithm is to solve a least square prob-
lem at each step of the iteration. We assume that the matrix A is of size N × N . Examine an n dimensional
Krylov space Kn
Kn = span{~b0 , A~b0 , A2~b0 , A3~b0 , . . . , An−1~b0 }
with an orthonormal base Qn = [~q1 , ~q2 , ~q3 , . . . , ~qn ], i.e. the vectors ~qj are given by the columns of the
N × n matrix Qn with QTn · Qn = In . For a given starting vector ~x0 we seek a solution in the affine
subspace ~x0 + Kn . The exact solution ~x0 + ~x∗ = A−1~b is approximated by a vector ~x0 + ~xn , such that

min kA(~x0 + ~x) − ~bk = k~rn k = kA~xn + A~x0 − ~bk = kA~xn − ~b0 k .
x∈Kn
~

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 94

Thus we minimize the residual on the Krylov subspace Kn , hereby justifying the name GMRES. This is
different from the conjugate gradient algorithm, where the energy norm hA (~un − ~uexact ) , (~un − ~uexact )i
is minimized.
The matrix Qn is generated by the Arnoldi algorithm, see page 96. The vector ~b0 = ~b − A~x0 is the first
residual. Using the matrix Qn the above can be rephrased (use Qn ~y = ~x) as

min kAQn ~y − ~b0 k = k~rn k = kA~xn − ~b0 k


y ∈Rn
~

1 ~
If we use the initial vector ~q1 = b0 for the Arnoldi iteration we know that h~qk , ~b0 i = 0 for all k ≥ 2 and
k~b0 k
thus ~b0 has to be a multiple of ~q1 , and ~q1 is the first column of Qn+1 .
~b0 = k~b0 k ~q1 = k~b0 k Qn+1~e1

With the Arnoldi iteration we have A Qn = Qn+1 Hn ∈ MN,n and thus20 we have a smaller least square
problem to be solved.

min Qn+1 Hn ~y − k~b0 k Qn+1~e1 = minn Hn ~y − k~b0 k ~e1 .


y ∈Rn
~ y ∈R
~

• The large least square problem with the matrix A of size N × N is replaced with an equivalent least
square problem with the Hessenberg matrix Hn of size (n + 1) × n.

• This least square problem can be solved by the algorithm of your choice, e.g. a QR factorization.
Using Givens transformations the QR algorithm is very efficient for Hessenberg matrices (≈ n2 op-
erations), see page 97.

• The algorithm can be stopped if the desired accuracy is achieved, e.g. if kA~xn − ~b0 k/k~bk is small
enough.

This leads to the basic GMRES algorithm in Figure 2.18.

1 Choose ~x0 and compute ~b0 = ~b − A~x0


2 Q1 = ~q1 = ~1 ~b0
kb0 k
3 for n = 1, 2, 3, . . . do
4 Arnoldi step n to determine ~qn+1 and Hn
5 minimize Hn ~y − k~bk ~e1
6 ~xn = Qn ~y
7 end
8 ~x∗ ≈ ~x0 + ~xn

Figure 2.18: The GMRES algorithm

The GMRES(m) algorithm


The basic GMRES algorithm usually performs well, but has a few key disadvantages if many iterations are
asked for.

• The size of the matrices Qn and Hn increases, and with them the computational effort.
20

kQn+1 ~zk2 = hQn+1 ~z, Qn+1 ~zi = h~z, QTn+1 Qn+1 ~zi = h~z, In+1 ~zi = k~zk2

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 95

• The orthogonality of the columns of Qn will not be perfectly maintained, caused by arithmetic errors.
To control both of these problem the GMRES algorithm is usually restarted every m steps, with typical
values of m ≈ 10 ∼ 20. This leads to the GMRES(m) algorithm in Figure 2.19, with an outer and inner
loop. Picking the optimal value for m is an art.

The above is (at best) a rough description of the basic ideas. For a good implementation the above steps
of Arnoldi, Given’s rotation and minimization have to be combined, and a proper termination criterion has
to be used. Find a good description and pseudo code in [templates]. Find a good starting point for code at
www.netlib.org/templates/matlab/ .

1 Choose restart parameter m


2 Choose ~x0
3 for k = 0, 1, 2, 3, . . . do
4 ~b0 = ~b − A~x0
5 Q1 = ~q1 = ~1 ~b0
kb0 k
6 for n = 1, 2, , , 3, . . . , m do
7 Arnoldi step n to determine ~qn+1 and Hn
8 minimize Hn ~y − k~bk ~e1
9 ~xn = Qn ~y
10 end
11 ~x0 = ~x0 + ~xm
12 end

Figure 2.19: The GMRES(m) algorithm

Convergence of GMRES
• The norm of the residuals ~rn = A~xn − ~b is decreasing k~rn+1 k ≤ k~rn k. This is obvious since the
Krylov subspace is enlarged.
• In principle GMRES will generate the exact solution after N steps, but this result is useless. The goal
is to use n  N iterations and arithmetic errors will prevent us from using GMRES as a direct solver.
• One can construct cases where GMRES(m) does stagnate and not converge at all.
• For a fast convergence the eigenvalues of A should be clustered at a point away from the ori-
gin [LiesTich05]. This is different from the conjugate gradient algorithm, where the condition number
determines the rate of convergence.
• The convergence can be improved by preconditioners.

Orthogonal matrices, QR factorization and least square problems


Assume Q is a matrix with QT Q = I, not necessarily square. For the least square problem with A = QR
the expression to be minimized is
kA~x − ~bk2 = hA~x − ~b , A~x − ~bi = hQR~x − ~b , QR~x − ~bi
= hQR~x , QR~xi − 2 hQR~x , ~bi + h~b , ~bi
= hR~x , QT QR~xi − 2 hR~x , QT ~bi + h~b , ~bi
= hR~x , R~xi − 2 hR~x , QT ~bi + hQT ~b , QT ~bi − hQT ~b , QT ~bi + h~b , ~bi
= kR~x − QT ~bk2 − kQT ~bk2 + k~bk2

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 96

and the solution ~x is identical to the solution of the least square problem
min kR~x − QT ~bk
~
x

with the solution


~x = R−1 QT ~b .
For more details see Section 3.5.1.

Arnoldi iteration
For a given matrix A ∈ MN,N with N ≥ n the Arnoldi algorithm generates a sequence of orthonormal
vectors ~qn such that the matrix is transformed to upper Hessenberg form. The key properties are
Qn = [~q1 , ~q2 , ~q3 , . . . , ~qn ] ∈ MN,n
QTn Qn = In
A · Qn = Qn+1 · Hn ∈ MN,n .
The column vectors ~qj of the matrix Q have length 1 and are pairwise orthogonal. The matrix Hn is of
upper Hessenberg form if all entries below the first lower diagonal are zero, i.e.
 
h11 h12 ··· h1n
 
 h21 h22 h23 · · · h 2n

 
 
 0 h32 h33 h3,4 h 3n 
 
. . .
.  ∈ Mn+1,n .
 
Hn =   0 0 h43 h44 . . 
 .. . . . . . . .
. 
 .
 . . . . 

 .
.

 .
 0 hn,n−1 hn,n  
0 ··· 0 hn+1,n

The algorithm to generate the vectors ~qn and the matrix Hn is not overly complicated. The last column
of the matrix equation A · Qn = Qn+1 · Hn ∈ MN,n reads as
n+1
X
A ~qn = hj,n ~qj
j=1

and thus the sequence ~qn of vectors is in the Krylov space


Kn = span{~q1 , A~q1 , A2 ~q1 , A3 ~q1 , . . . , An−1 ~q1 } .
We also observe that  
n
1 X
~qn+1 = A ~qn − hj,n ~qj  .
hn+1,n
j=1

• As starting vector ~q1 we may choose any vector of length 1 .


• The orthogonality is satisfied if
 
n
1 X 1
0 = h~qn+1 , ~qk i = hA ~qn , ~qk i − hj,n h~qj , ~qk i = (hA ~qn , ~qk i − hk,n 1)
hn+1,n hn+1,n
j=1

and thus
hk,n = hA ~qn , ~qk i for k = 1, 2, 3, . . . n .

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 97

• The condition k~qn+1 k = 1 then determines the value of hn+1,n , i.e. use
n
X
hn+1,n = kA ~qn − hj,n ~qj k .
j=1

• If the above would lead to hn+1,n = 0 then the Krylov subspace Kn is invariant under A and the
GMRES algorithm will have produced to optimal solution, see [Saad00].
The algorithm can be implemented, see Figure 2.20.
• The computational effort for one Arnoldi step is given by one matrix multiplication, n scalar products
for vectors of length N and the summation of n vectors of length N to generate ~qn+1 .
• To generate all of Qn+1 we need n matrix multiplications and n2 N additional operations.
• For small values of n  N this is dominated by the matrix multiplication (n N 2 ).

1 Choose ~q1 with k~q1 k = 1


2 for n = 1, 2, 3, . . . do
3 Set ~qn+1 = A~qn
4 for k = 1, 2, 3, . . . , n do
5 hk,n = h~qn+1 , ~qk i
6 ~qn+1 = ~qn+1 − hk,n ~qk
7 end
1
8 hn+1,n = k~qn+1 k and then ~qn+1 = hn+1,n ~qn+1
9 end

Figure 2.20: The Arnoldi algorithm

Given’s rotations applied to Hessenberg matrices


Any vector (x, y) in the plane R2 can be rotated onto the horizontal axis. With the angle α = arctan(y/x)
we find " # ! p !
cos α + sin α x x2 + y 2
= .
− sin α cos α y 0
Now we use the same idea on the upper Hessenberg matrix with an angle such that
! " # ! p !
T h11 cos α + sin α h11 h211 + h221
G = =
h21 − sin α cos α h21 0
i.e. we apply a rotation matrix to the first two rows of Hn and find a zero in the place of h2,1 . If applied to
a matrix this is called a Givens rotation. To the modified equation we apply the same idea again to enforce
h32 = 0. This can be repeated, i.e. a multiplication by Gi uses a combination of rows i and i + 1 to replace
the entry hi+1,i by zero. Thus the G3 Givens rotation to be applied to 6 × 6 matrices is given by
 
1 · · · · ·
 
 · 1 · · · · 
 
 
 · · c s · · 
G3 =  
 · · −s c · · 
 
 
 · · · · 1 · 
 
· · · · · 1

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 98

with c = cos(α), s = sin(α) and the angle is α such that the the entry h4,3 will be rotated to zero. These
rotation matrices are very easy to invert, since

G−1 = GT and G GT = I .

Now multiply the Hessenberg matrix Hn from the left by In+1 = G1 GT , then by G2 GT2 , then by G3 GT3 ,
. . . Repeat until you end up with a right triangular matrix R, resp Hn = Q R.
     
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
     
 ∗ ∗ ∗ ∗ ∗ ∗   0 ∗ ∗ ∗ ∗ ∗   0 ∗ ∗ ∗ ∗ ∗ 
     
     
 0 ∗ ∗ ∗ ∗ ∗   0 ∗ ∗ ∗ ∗ ∗   0 0 ∗ ∗ ∗ ∗ 
     
Hn =  0 0 ∗ ∗ ∗ ∗  = G1  0 0 ∗ ∗ ∗ ∗  = G1 G2  0 0 ∗ ∗ ∗ ∗ 
     
     
 0 0 0 ∗ ∗ ∗   0 0 0 ∗ ∗ ∗   0 0 0 ∗ ∗ ∗ 
     
     
 0 0 0 0 ∗ ∗   0 0 0 0 ∗ ∗   0 0 0 0 ∗ ∗ 
     
0 0 0 0 0 ∗ 0 0 0 0 0 ∗ 0 0 0 0 0 ∗
   
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
   
 0 ∗ ∗ ∗ ∗ ∗   0 ∗ ∗ ∗ ∗ ∗ 
   
   
 0 0 ∗ ∗ ∗ ∗   0 0 ∗ ∗ ∗ ∗ 
   
= G1 G2 G3  0 0 0 ∗ ∗ ∗  = G1 G2 G3 G4  0 0 0 ∗ ∗ ∗ 
   
   
 0 0 0 ∗ ∗ ∗   0 0 0 0 ∗ ∗ 
   
   
 0 0 0 0 ∗ ∗   0 0 0 0 ∗ ∗ 
   
0 0 0 0 0 ∗ 0 0 0 0 0 ∗
   
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
   
 0 ∗ ∗ ∗ ∗ ∗   0 ∗ ∗ ∗ ∗ ∗ 
   
   
 0 0 ∗ ∗ ∗ ∗   0 0 ∗ ∗ ∗ ∗ 
   
= G1 G2 G3 G4 G5  0 0 0 ∗ ∗ ∗  = G1 G2 G3 G4 G5 G6  0 0 0 ∗ ∗ ∗ 
   
   
 0 0 0 0 ∗ ∗   0 0 0 0 ∗ ∗ 
   
   
 0 0 0 0 0 ∗   0 0 0 0 0 ∗ 
   
0 0 0 0 0 ∗ 0 0 0 0 0 0
= G1 G2 G3 G4 G5 G6 R = Q R

This leads to
Hn = QR = GT1 GT2 GT3 · · · GTn R
with a right matrix R ∈ M(n+1)×n and Q ∈ M(n+1)×(n+1) with Q · QT = In+1 . Now solve the minimiza-
tion problem

Hn ~y − k~b0 k ~e1 = Q R ~y − k~b0 k QQT ~e1 = R ~y − k~b0 k QT ~e1 .

Since the vector ~y has no influence on the last component of the vector R~y , minimize by using only the first
n rows of R and Q. This leads to a solution, for more details see Section 3.5.1.
An implementation should take advantage of a few facts.
• Since
Q = G1 G2 G3 · · · Gn =⇒ QT = GTn · · · GT3 GT2 GT1
and the vector QT p~ is computed efficiently while the Givens rotations are determined.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 99

• The upper Hessenberg matrix Hn is transformed to the right matrix R row by row. Overwrite the top
rows of Hn by the rows of R, to save some memory.
• If the original matrix is not of upper Hessenberg form, then Given’s rotations is not an efficient choice
for a QR factorization. A better choice is to use Householder reflections.

2.9 Iterative Solvers in MATLAB/Octave and a Comparison with Direct Solvers


2.9.1 Iterative Solvers in MATLAB/Octave
Octave and MATLAB have a few iterative solvers already implemented, see Table 2.18. The list is not
complete. For most (or all) of these implementations you have to provide either the matrix for the linear
system to be solved, or a function to multiply a vector by that matrix. For most algorithms you can specify
a preconditioner. Use the built-in documentation of MATLAB/Octave for more information.

pcg() preconditioned conjugate gradient method


bicg() bi-conjugate gradient iterative method
bicgstab() stabilized bi-conjugate gradient iterative method
gmres() generalized minimum residual method
pcr() preconditioned conjugate residual method
minres() minimum residual method
qmr() quasi-minimal residual method
cgs() conjugate gradients squared method

Table 2.18: Iterative solvers in Octave/MATLAB

2.9.2 A Comparison of Direct and Iterative Solvers


In Figure 2.21 find a general comparison of different solvers for linear systems. This figure illustrates that
for different problems different algorithms should be used. On the web you can find a great variety of
software for linear algebra problems. An excellent starting point is given by [www:LinAlgFree].

direct iterative

more general
LR GMRES
nonsymmetric
with pivoting BiCGSTAB 6

symmetric Cholesky Conjugate ?


positive definite (banded) Gradient
more robust

more robust  - less storage

Figure 2.21: Comparison of linear solvers

The FEM (Finite Element Method) software COMSOL has a rather large choice of algorithms to solve
the resulting systems on linear equations and some of them are capable of using multiple cores. Amongst

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 100

Algorithm time required


number of threads 1 2 3 4
UMFPACK, direct solver running out of memory
TAUCS, Cholesky based direct solver running out of memory
SPOOLES, direct solver 1614 s 1054 s 732 s 710 s
PARDISO, direct solver (by O.Schenk, Basel) 480 s 261 s 197 s 177 s
Conjugate Gradient, with Cholesky preconditioner 276 s 255 s 239 s 230 s
Multigrid, with PARDISO preconditioner 95 s 62 s 50 s 45 s

Table 2.19: Benchmark of different algorithms for linear systems, used with COMSOL Multiphysics

the possible model problems we picked one: on a cylinder with height 10 and radius 1 and a horizontal
force was applied at the top and the bottom was fixed. A FEM system was set up, resulting in a system of
800’000 linear equations. Thus we have a sizable 3D problem. Then different algorithms were used to solve
the system. The PC used is based on a I7-920 CPU with 4 cores and has 12 GB of RAM. Thus we have
a system as presented in Section 2.2.3. Find the timing results in Table 2.19. This result is by no means
representative and its main purpose is to show that you have to choose a good algorithm. Your conclusion
will depend on the type of problems at hand, its size, the software and hardware to be used.

2.10 Other Matrix Factorizations


We only examined the LR (or LU) factorization and the Cholesky factorization of a matrix. There are many
more factorizations and possible applications thereof.

• SVD Singular Value Decomposition: has many applications, see Section 3.2.6 starting on page 149.

• QR factorization: this factorization has many applications, including linear regression. Find a short
presentation of QR used for linear regression in Section 3.5.1, starting on page 222.

In the previous chapter the codes in Table 2.20 were used.

filename function
speed subdirectory with C code to determine the FLOPS for a CPU
AnnGenerate.m code to generate the a model matrix Ann
LRtest.m code for LR factorization
cholesky.m code for the Cholesky factorization of a matrix
choleskySolver.m code to solve a linear system with Cholesky
cholInc.m code for the incomplete Cholesky factorization

Table 2.20: Codes for chapter 2

Bibliography
[Axel94] O. Axelsson. Iterative Solution Methods. Cambridge University Press, 1994.

SHA 21-5-21
CHAPTER 2. MATRIX COMPUTATIONS 101

[AxelBark84] O. Axelsson and V. A. Barker. Finite Element Solution of Boundary Values Problems. Aca-
demic Press, 1984.

[templates] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo,


C. Romine, and H. V. der Vorst. Templates for the Solution of Linear Systems: Building Blocks for
Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA, 1994.

[TopTen] B. A. Cypra. The best of the 20th century: Editors name top 10 algorithms. SIAM News, 2000.

[www:LinAlgFree] J. Dongarra. Freely available software for linear algebra on the web.
http://www.netlib.org/utk/people/JackDongarra/la-sw.html.

[DowdSeve98] K. Dowd and C. Severance. High Performance Computing. O’Reilly, 2nd edition, 1998.

[Gold91] D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM
Computing Surveys, 23(1), March 1991.

[GoluVanLoan96] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, third
edition, 1996.

[GoluVanLoan13] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,
fourth edition, 2013.

[Hack94] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 of Applied
Mathematical Sciences. Springer, first edition, 1994.

[Hack16] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 of Applied
Mathematical Sciences. Springer, second edition, 2016.

[HeroArnd01] H. Herold and J. Arndt. C-Programmierung unter Linux. SuSE Press, 2001.

[Intel90] Intel Corporation. i486 Microprocessor Programmers Reference Manual. McGraw-Hill, 1990.

[KnabAnge00] P. Knabner and L. Angermann. Numerik partieller Differentialgleichungen. Springer Ver-


lag, Berlin, 2000.

[LascTheo87] P. Lascaux and R. Théodor. Analyse numérique matricielle appliquée a l’art de l’ingénieur,
Tome 2. Masson, Paris, 1987.

[LiesTich05] J. Liesen and P. Tichý. Convergence analysis of Krylov subspace methods. GAMM Mitt. Ges.
Angew. Math. Mech., 27(2):153–173 (2005), 2004.

[Saad00] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS, second edition, 2000. available on
the internet.

[Schw86] H. R. Schwarz. Numerische Mathematik. Teubner, Braunschweig, 1986.

[Shew94] J. R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain.
Technical report, Carnegie Mellon University, 1994.

[VarFEM] A. Stahel. Calculus of Variations and Finite Elements. Lecture Notes used at HTA Biel, 2000.

[Wilk63] J. H. Wilkinson. Rounding Errors in Algebraic Processes. Prentice-Hall, 1963. Republished by


Dover in 1994.

[YounGreg72] D. M. Young and R. T. Gregory. A Survey of Numerical Analysis, Volume 1. Dover Publica-
tions, New York, 1972.

SHA 21-5-21
Chapter 3

Numerical Tools

3.0.1 Prerequisites and Goals


In this chapter we will present some methods to

• solve a nonlinear equation f (x) = 0 or systems of nonlinear equations F~ (~x) = ~0.

• work with eigenvalues and eigenvectors of matrices.

• show some aspects of SVD (singular value decomposition) and PCA (principal component analysis).

• integrate functions numerically, either given by data points of a formula.

• solve ordinary differential equations numerically.

• apply linear and nonlinear regression. This includes fitting of curves to given data.

• examine intervals and regions of confidence for parameters determined by regression.

After having worked through this chapter

• Nonlinear equations

– you should be able to apply the methods of bisection and false positioning to solve one nonlinear
equation.
– you should be able to apply Newton’s method reliably to solve one nonlinear equation.
– you should be familiar with possible problems when using Newton’s method.
– you should be able to apply Newton’s method reliably to solve systems of nonlinear equations.

• Eigenvalues and eigenvectors of matrices

– you should understand the importance of eigenvalues and eigenvectors for linear mappings.
– you should be able to use eigenvalues to describe the behavior of solutions of systems of linear
ODEs.
– you should understand the connection between eigenvalues and singular value decomposition.
– you should be able to use the covariance matrix and PCA to describe the distribution of data.

• Numerical integration

– you should be able to integrate functions given by data points using the trapezoidal rule.
– you should understand Simpson’s algorithm and Gauss integration.

102
CHAPTER 3. NUMERICAL TOOLS 103

– you should understand the basic idea of an adaptive integration of a given function.
– You should be able to use Octave/MATLAB to evaluate integrals reliably.

• Ordinary differential equations (ODE)

– you should be able to use MATLAB/Octave to solve ODEs numerically, with reliable results.
– you should understand the basic idea used for the numerical solvers and the importance of sta-
bility for the algorithms.

• Linear and nonlinear regression

– you should be able to use the matrix notation to setup and solve linear regression problems,
including the confidence intervals for the optimal parameters.
– you should be able to setup and solve linear regression problems, including the confidence in-
tervals for the optimal parameters.

In this chapter we assume that you are familiar with

• the definition and interpretation of eigenvalues and eigenvectors for matrices.

• the idea and computations for derivatives and linear approximations for a function of one variable.

• the idea and computations for derivatives and linear approximations for a function of multiple vari-
ables.

3.1 Nonlinear Equations


3.1.1 Introduction
When trying to solve a single equation or a system of equations

f (x) = 0 or F~ (~x) = ~0

for nonlinear functions f or F~ and algebraic manipulations fail to give satisfactory results, then one has
to resort to approximation methods. This will very often involve iterative methods. To a known value
xn apply some carefully planned operations to obtain a new value xn+1 . As the same operation is applied
repeatedly hope for the sequence of values to converge, preferably to a solution of the original problem.
As an example pick an arbitrary value x0 , type it into your pocket calculator and then keep pushing the
cos button. After a few steps you will realize that the displayed numbers converge to x ≈ 0.73909. This
number x solves the equation cos(x) = x.

For all iterative methods the same questions have to be answered before launching a lengthy calculation:

• What is the computational cost of one step of the iteration?

• Will the generated sequence converge, and converge to the desired solution?

• How quickly will the sequence converge?

• How reliable is the iteration method?

This section will present some of the basic iterative methods and the corresponding results, such that you
should be able to answer the above questions.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 104

3–1 Definition : Let xn be a sequence converging to x? . This convergence is said to be of order p if there
is a constant c such that
kxn+1 − x? k ≤ c kxn − x? kp .
A method which produces such a sequence is said to have an order of convergence p .

The expression log kxn −x? k corresponds to the number of correct (decimal) digits in the approximation
xn of the exact value x? . Thus the order of convergence is an indication on how quickly the approximation
sequence will converge to the exact solution. We examine two important cases:
• Linear convergence, convergence of order 1:

kxn − x? k ≤ c kxn−1 − x? k ≤ c2 kxn−2 − x? k ≤ . . . ≤ cn kx0 − x? k


log kxn − x? k ≤ log c + log kxn−1 − x? k
log kxn − x? k ≤ log kx0 − x? k + n log c

Thus the number of accurate digits increases by a fixed number (log c) for each step, as long as c < 1,
i.e log(c) < 0 . In real application we do not have the exact solution to check for convergence, but we
may observe the difference between subsequent values.

kxn+1 − xn k = k(xn+1 − x? ) − (xn − x? )k ≤ kxn+1 − x? k + kxn − x? k


≤ kx0 − x? k cn+1 + kx0 − x? k cn ≤ kx0 − x? k (1 + c) cn
log kxn+1 − xn k ≤ log kx0 − x? k + log(1 + c) + n log(c)

Thus expect the number of stable digits to increase by a fixed amount (log c) for each step.

• Quadratic convergence, convergence of order 2:

kxn − x? k ≤ c kxn−1 − x? k2
log(kxn − x? k) ≤ 2 log kxn−1 − x? k + log(c)

Thus the number of accurate digits is doubled at each step, ignoring the expression log(c). Once we
have enough digits this simplification is justified. When observing the number of stable digits we find
similarly

kxn+1 − xn k = k(xn+1 − x? ) − (xn − x? )k ≤ c kxn − x? k2 + kxn − x? k


= (c kxn − x? k + 1) kxn − x? k
log kxn+1 − xn k ≤ log (c kxn − x? k + 1) + log kxn − x? k ≈ 0 + log kxn − x? k

and consequently the number of stable digits should double at each step, at least once we are close to
the actual solution1 .
The effects of different orders of convergence is illustrated in Example 3–3 (see page 109), leading to
Table 3.2 on page 110.

How to stop an iteration


When a system of equations F~ (~x) = ~0 is solved by an iterative method you end up with a sequence of
vectors ~xn for n = 1, 2, 3, . . . Then the task is to determine when to stop the iteration and accept the current
result as good enough. Thus a good termination criterion has to be selected in advance. There are different
possible options and a good choice has to be based on the concrete application and the question one has to
answer2 .
1
Use log 1 = 0 and thus log(c kxn − x∗ k + 1) ≈ 0 if c kxn − x∗ k  1.
2
Richard W. Hamming (1962) is said to have coined the phrase: The purpose of computing is insight, not numbers.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 105

• Terminate if the absolute change in x is small enough, i.e. k~xn+1 − ~xn k is small.
k~xn+1 − ~xn k
• Terminate if the relative change in x is small enough, i.e. use one of the expressions ,
k~xn k
k~xn+1 − ~xn k k~xn+1 − ~xn k
or as termination criterion.
k~xn+1 k k~xn k + k~xn+1 k
• In black box solvers one should use a combination of absolute tolerance A and relative tolerance R,
e.g. stop if
k~xn+1 − ~xn k ≤ A + R k~xn k

• Terminate if the absolute error in ~y = F~ (~x) is small enough, i.e. kF~ (~xn )k is small.

3.1.2 Bisection, Regula Falsi and Secant Method to Solve one Equation
In this section find the basic idea for three algorithms to solve a single equation of the form y = f (x) =
0. Throughout this section assume that the function f is at least continuous, or as often differentiable as
necessary. Also assume that a solution exists, i.e. f (x? ) = 0, and it is not a double zero, i.e. f 0 (x? ) 6= 0 .
Find a brief description and an illustrative graphic for each of the algorithms. Since coding of these algorithm
is not too difficult, no explicit code is provided.

Bisection
This basic algorithm will find zeros of continuous function, once two values of x with opposite signs for y
are known. The solution x? of f (x? ) = 0 will be bracketed by xn and xn+1 , i.e. x? is between xn and
xn+1 . Find the description of the algorithm below and an illustration in Figure 3.1.

• Start with two values x0 and x1 such that y0 = f (x0 ) and y1 = f (x1 ) have opposite signs, i.e.
f (x0 ) · f (x1 ) < 0. This leads to an initial interval.

• Repeat until the desired accuracy is achieved:

– Compute the function y = f (x) at the midpoint xn+1 of the current interval and examine the
sign of y .
– Retain the mid point and one of the endpoints, such that the y-values have opposite signs. This
is the new interval to be examined in the next iteration.

x0 x2 x4 -
x5 x3 x1

Figure 3.1: Method of bisection to solve one equation

This algorithm will always converge, since the function f is assumed to be continuous and the solution
is bracketed. Obviously the maximal error is halved at each step of the iteration and we have an elementary
estimate for the error
1
|xn+1 − x? | ≤ n |x1 − x0 | .
2

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 106

ln 2
Thus we find linear convergence, i.e. the number of accurate decimal digits is increased by ln 10 ≈ 0.3 by
each step.

False position method, Regula Falsi


This algorithm is a minor modification of the bisection method. Instead of using the midpoint of the interval
we continue with the zero of the secant connecting the two endpoints. Find the illustration in Figure 3.2. It
can be shown that the convergence of the algorithm is linear.

 


 

6 


 
  
 
  
 
   
  
x0 x2 x3x

4 -
 
  x1
 
  
 

Figure 3.2: Method of false position to solve one equation

Secant method
The two previous algorithm are guaranteed to give a solution, since the solution was bracketed. We can
modify the false position method slightly and always retain the last two values, independent on the sign of
the function. Find the illustration in Figure 3.3.

• Start with two values x0 and x1 , compute y0 = f (x0 ) and y1 = f (x1 ) .

• Repeat until the desired accuracy is achieved:

– Compute the zero of the secant connecting the two given points
xn − xn−1
xn+1 = xn − f (xn ) .
f (xn ) − f (xn−1 )

– Restart with xn and xn+1 .

One can show that this algorithm has superlinear convergence.

|xn+1 − x? | ≈ c |xn − x? |1.618 .

This implies that the number of correct digits is multiplied by 1.6, as soon as we are close enough. This is
a huge advantage over the bisection and false position methods. As a clear disadvantage we have no guar-
anteed convergence, even if the solution was originally bracketed. The secant might intersect the horizontal
axis at a far away point and thus we might end up with a different solution than expected, or none at all. One
can show that the secant method will converge to a solution x? if the starting values are close enough to x?
and f 0 (x? ) 6= 0 .

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 107

6 







 
 
 

 
  -
x3 x2 x1 x0

Figure 3.3: Secant method to solve one equation

Newton’s method to solve one equation


This important algorithm is based on the idea of a linear approximation. For a given estimate x0 of the
zero of the function f (x) we replace the function by its linear approximation, i.e. the tangent to the curve
y = f (x) at the point (x0 , f (x0 )) .

f (x0 + ∆x) = 0
f (x0 + ∆x) ≈ f (x0 ) + f 0 (x0 ) · ∆x
f (x0 ) + f 0 (x0 ) · ∆x = 0
f (x0 )
∆x = −
f 0 (x0 )
f (x0 )
x1 = x0 + ∆x = x0 −
f 0 (x0 )

The above computations lead to the algorithm of Newton, some authors call it Newton–Raphson.

• Start with a value x0 close to the zero of f (x) .

• Repeat until the desired accuracy is achieved:

– Compute values f (xn ) and the derivative f 0 (xn ) at the point xn . Apply Newton’s formula

f (xn )
xn+1 = xn − .
f 0 (xn )

– Restart with xn+1

The algorithm is illustrated in Figure 3.4.

6 







 

 

 ∆x2 
∆x1
  -
x2 x1 x0

Figure 3.4: Newton’s method to solve one equation

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 108

One can show that this algorithm converges quadratically

|xn+1 − x? | ≈ c |xn − x? |2 .

This implies that the number of correct digits is multiplied by 2, as soon as we are close enough. This
is a huge advantage over the bisection and false position methods. As a clear disadvantage we have no
guaranteed convergence. The tangent might intersect the horizontal axis at a far away point and thus we
might end up with a different solution than expected, or none at all. One can show that Newton’s method
will converge to a solution x? if the starting values are close enough to x? and f 0 (x? ) 6= 0 .

3–2 Example : To compute the value of x = 2 we may try to solve the equation x2 − 2 = 0. For this
example we find
f (xn ) x2 − 2 2 + x2n
xn+1 = xn − 0 = xn − n = .
f (xn ) 2 xn 2 xn
With a starting value of x0 = 1 we find

2+1 3 2 + 9/4 17
x1 = = , x2 = = ≈ 1.417 and x3 ≈ 1.414216 .
2 2 3 12
Thus we are very close to the actual solution with very few iteration steps. ♦

The major problem of Newton’s method is based to the fact that the initial guess has to be close enough
to the exact solution. If this is not the case, then we might run into severe problems. Consider the three
graphs in Figure 3.5.

• The graph on the left does not have a zero (or is it a double zero?) and thus Newton’s method will
happily iterate along, never getting close to a solution.

• The middle graph has one solution, but if we start a Newton iteration to the right of the maximum,
then the iteration will move further and further to the right. The clearly existing, unique zero will not
be found.

• In the graph on the right it is easy to find starting values x0 such that Newton’s method will converge,
but not to the solution closest to x0 .

Figure 3.5: Three functions that might cause problems for Newton’s methods

• Newton’s method is an excellent tool to compute a zero of a function accurately, and quickly.

• Crucial for a success is the availability of a good initial guess.

• Newton’s method can fail miserably when no good initial guess is available.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 109

Comparison
The above four algorithms have all their weak and strong points, thus a comparison is asked for:
• Bisection and False Position are garanteed to converge to a solution, as long as the starting points x0
and x1 lead to different signs for f (x0 ) and f (x1 ) .
• The secant method converges faster than bisection or Regula Falsi, but the performance of Newton is
hard to beat.
• The secant and Newton’s method might not give the desired/expected solution or might even fail
completely.
• Only Newton’s methods requires values of the derivative.
Find the results in Table 3.1.

Bisection False Position Secant Newton


bracketing necessary yes yes no no
guaranteed success yes yes no no
requires derivative no no no yes
order of convergence 1 1 1.618 2

Table 3.1: Comparison of methods to solve one equation

3–3 Example
√ : Performance comparison of solvers2
Compute 2 as solution of the equation f (x) = x − 2 = 0. We use implementations in Octave of the
above four algorithms to solve this elementary equation and keep track of the following quantities:
• The estimate of the solution at each step, xn .
• The number of correct decimal digits: corr.
• The number of non-changing digits from the last iteration step: fix.
The results are shown in Table 3.2. There are some observations to be made:
• The number of accurate digits for the bisection method increases very slowly, but exactly in the
predicted way. For 10 iterations the error has to be divided by 210 ≈ 1000, thus we gain 3 digits only.
• The Regula Falsi method leads to linear convergence, the number of correct digits increases by a fixed
number (≈ 0.7) for each step. This is clearly superior to the method of bisection.
• The secant method leads to superlinear convergence, the number of correct digits increases by a fixed
factor (≈ 1.6) for each step. After a few steps (8) we reach machine accuracy and there is no change
in the result any more.
• The Newton method converges very quickly (5 steps) up to machine precision to the exact solution.
The number of correct digits is doubled at each step. This is caused by the quadratic convergence of
the algorithm.
• The number of unchanged digits at each step (fix) is a safe estimate of the number of correct digits
(corr). This is an important observation, since for real world problems the only available information
is the value of fix. Computing corr requires the exact solution and thus there would be no need for
a solution algorithm.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 110

Bisection Regula Falsi Secant Method Newton’s Method


xn corr fix xn corr fix xn corr fix xn corr fix
1.000000 0.4 1.000000 0.4 1.000000 0.4 2.000000 0.2
1.500000 1.1 0.3 1.333333 1.1 0.5 1.333333 1.1 0.5 1.500000 1.1 0.3
1.250000 0.8 0.6 1.400000 1.8 1.2 1.428571 1.8 1.0 1.416667 2.6 1.1
1.375000 1.4 0.9 1.411765 2.6 1.9 1.413793 3.4 1.8 1.414216 5.7 2.6
1.437500 1.6 1.2 1.413793 3.4 2.7 1.414211 5.7 3.4 1.414214 12 5.7
1.406250 2.1 1.5 1.414141 4.1 3.5 1.414214 9.5 5.7 1.414214 16 12
1.421875 2.1 1.8 1.414201 4.9 4.2 1.414214 15 9.5 1.414214 16 16
1.414062 3.8 2.1 1.414211 5.7 5.0 1.414214 16 16 1.414214 16 16
1.417969 2.4 2.4 1.414213 6.4 5.8 1.414214 16 16 1.414214 16 16
1.416016 2.7 2.7 1.414213 7.2 6.5 1.414214 16 16 1.414214 16 16

Table 3.2: Performance of some basic algorithms to solve x2 − 2 = 0

3.1.3 Systems of Equations


In the previous section we found that the situation to solve a single equation is rather comfortable. We have
different algorithms at our disposition with different strength and weaknesses. Combined with graphical
tools we should be able to examine almost all situations with reliable results.
The situation changes drastically for systems of equations and one may sum up the situation:

There is no reliable black box algorithm to solve systems of equations

• The ideas of the Bisection method, the False Position method and the Secant method can not be carried
over to the situation of multiple equations.
• The method of Newton can be applied to systems of equations. This will be done in the next section.
It has to be pointed out that a number of problems might occur:
– Newton requires a good starting point to work reliably.
– We also need the derivatives of the functions. For a system of n equations this amounts to n2 par-
tial derivatives to be known. The computational (and programming) cost might be prohibitive.
– For each step of Newton’s method a system of n linear equations has to be solved and for very
large n this might be difficult.
• There exist derivative free algorithms to solve systems of equations, e.g. Broyden’s method. As a
possible starting point consider [Pres92].
• If the problem is a minimization problem. i.e. you are searching for ~x ∈ Rn such that the function
f : Rn → R attains its minimum at ~x. This leads to a system of n equations
grad f (~x) = ~0 .
Since − grad f is pointing in the direction of steepest descent one has good knowledge where the
minimum might be. In this situation reliable and efficient algorithms are known, e.g. the Levenberg–
Marquardt algorithm.
In these notes we concentrate on Newton’s method and and its applications. Successive substitution and
partial substitution are mentioned briefly.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 111

3.1.4 The Contraction Mapping Principle and Successive Substitutions


The theoretical foundation for many iterative schemes to solve systems of nonlinear equations is given by
Banach’s fixed point theorem, also called contraction mapping principle. This is one of the most important
results in nonlinear analysis and it has many applications. We give an abstract version below and illustrate
it by a few examples.

With the translation F (x) = G(x) + x it is obvious that a zero of G is a fixed point (F (x) = x) of F .

G(x) = F (x) − x = 0 ⇐⇒ F (x) = G(x) + x = x

Thus we may concentrate our efforts on efficient algorithms to locate fixed points of iterations.

3–4 Theorem : Banach’s fixed point theorem, Contraction Mapping Principle


Let M be a closed subset of a Banach space E and the mapping F is a contraction from M to
M , i.e. there exists a constant c < 1 such that

F : M −→ M with kF (x) − F (y)k ≤ c kx − yk for all x, y ∈ M . (3.1)

Then there exists exactly one fixed point z ∈ M of the mapping F , i.e. one solution of F (z) = z.
For any initial point x0 ∈ M the sequence formed by xn+1 = F (xn ) will converge to z and we
have the estimate
kxn+1 − zk ≤ c kxn − zk
i.e. the order of convergence is at least 1 . By applying the above estimate repeatedly we find the
a priori estimate
kxn − zk ≤ cn kx0 − zk
i.e. we can estimate the number of necessary iterations before starting the algorithm. An a poste-
riori estimate is given by
c
kxn+1 − zk ≤ kxn − xn+1 k
1−c
i.e. we can estimate the error during the computations by comparing subsequent values. 3

The proof below is given for sake of completeness only. It is possible to work through the remainder of
these notes without working through the proof, but it is advisable to understand the illustration in Figure 3.6
and the consequences of the estimates in the above theorem..
Proof : For an arbitrary initial point x0 ∈ M we examine the sequence xn = F n (x0 ).

kF n (x) − F n (y)k ≤ c kF n−1 (x) − F n−1 (y)k ≤ cn kx − yk


k−1
X
n n+k n k n
kF (x0 ) − F (x0 )k ≤ c kx − F (x0 )k ≤ c kF i (x0 ) − F i+1 (x0 )k
i=0
k−1
X cn
≤ cn ci kx0 − F (x0 )k ≤ kx0 − F (x0 )k
1−c
i=0

Thus xn is a Cauchy sequence and we conclude

xn = F n (x0 ) −→ z ∈ M as n → ∞ .

Since F is continuous we conclude


xn+1 = F (xn ) −→ F (z)

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 112

and thus
F (x) = lim xn+1 = lim xn = z .
If z̄ is also a fixed point we use the contraction property

kz̄ − zk = kF (z̄) − F (z)k ≤ c kz̄ − zk

to conclude z̄ = z. Thus we have a unique fixed point. To verify the linear convergence we use F (z) = z
and the contraction property to conclude

kxn+1 − zk = kF (xn ) − F (z)k ≤ c kxn − zk .

To verify the à posteriori estimate we use

kxn − zk ≤ kxn − xn+1 k + kxn+1 − zk ≤ kxn − xn+1 k + c kxn − zk


1
kxn − zk ≤ kxn − xn+1 k
1−c
c
kxn+1 − zk ≤ c kxn − zk ≤ kxn − xn+1 k .
1−c
2

y
The function F maps the set M to M and x
it is a contraction, i.e. there is a constant
c < 1 such that
F(y)
F(x)
kF (x) − F (y)k ≤ c kx − yk

for all x, y ∈ M .

Figure 3.6: The contraction mapping principle

3–5 Example : The function f (x) = cos(x) on the interval M = [0 , 1] satisfies the assumptions of
Banach’s fixed point theorem. Obviously 0 ≤ cos x ≤ 1 for 0 ≤ x ≤ 1 and thus f maps M into M . The
contraction property is a consequence of an integral estimate.
Z x
cos(x) − cos(y) = − sin(t) dt
y
Z x
| cos(x) − cos(y)| = | sin(t) dt| ≤ sin(1) |x − y| .
y

The contraction constant is given by c = sin(1) < 1. As a consequence we find that the equation cos(x) = x
has exactly one solution in M . We can obtain this solution by choosing an arbitrary initial value x0 ∈ M
and then apply the iteration xn+1 = cos(xn ). This is illustrated in Figure 3.7 . ♦

3–6 Result : Let M ⊂ E be a closed subset of a Banach space E. If a mapping F : M → M is


differentiable and the linear operator DF ∈ L(E , E) (i.e. a bounded linear operator) satisfies kDF(x)k ≤
c < 1 for all x ∈ M , then F is a contraction. Thus the equation F (x) = x can be solved by successive
substitutions xn+1 = F (xn ). 3

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 113

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

Figure 3.7: Successive substitution to solve cos x = x

Proof : For x, y ∈ M we define


g(λ) = F (x + λ(y − x)) for 0 ≤ λ ≤ 1
We find g(0) = F (x) and g(1) = F (y). The chain rule implies
d
g(λ) = DF(x + λ(y − x)) · (y − x)

and thus
1 1
d g(λ)
Z Z
F (y) − F (x) = g(1) − g(0) = dλ = DF(x + λ(y − x)) · (y − x) dλ
0 dλ 0
The estimate of DF now implies
Z 1
kF (y) − F (x)k ≤ kDF(x + λ(y − x))k · ky − xk dλ ≤ c ky − xk
0
and thus we have a contraction. 2

3–7 Example : Quadratic convergence of Newton’s method


Newton’s method to solve a single equation f (x) = 0 is using the iteration
f (xn )
xn+1 = F (xn ) = xn − .
f 0 (xn )
Thus we find
d f 0 (x) · f 0 (x) − f (x) · f 00 (x)
F (x) = 1 − .
dx (f 0 (x))2
If f (x? ) = 0 and f 0 (x? ) 6= 0 we conclude
d f 0 (x? ) · f 0 (x? ) − 0
F (x? ) = 1 − = 0.
dx (f 0 (x? ))2
If the function f is twice continuously differentiable we can conclude that in a neighborhood of x? the
d
derivative satisfies | dx F (x)| ≤ 12 and thus F is a contraction. The proof shows that the contraction
constant c gets closer to 0 as the approximate solution xn approaches the exact solution x? . Based on this
idea one can prove the quadratic convergence of Newton’s method. The result remains valid in the Banach
space context, see e.g. [Deim84, Theorem 15.6]. A precise result is shown in [Linz79, §5.3], without proof.
The situation of n equation for n unknown is also examined carefully in [IsaaKell66]. ♦

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 114

Partial successive substitution


There are problems when it is advantageous to modify the method of successive substitutions. If we have
a function F (x, y) and we want to solve F (x, x) = x we can use successive substitutions on one of the
arguments only.
• Start with an initial value x0 .

• Repeat until the error is small enough

– Use the known value of xn and solve the equation F (xn , xn+1 ) = xn+1 for the unknown xn+1 .

As a trivial example try to solve the nonlinear equation 3 + 3 x = ex . Given xn you can solve 3 + 3 xn+1 =
exn for xn+1 by
1
xn+1 = (exn − 3) .
3
A simple graph (Figure 3.8) will convince you that the equation has two solutions, one close to x ≈ −1 and
the other close to x ≈ 2.5. Choosing x0 = −1 will converge to the solution, but x0 = 2.5 will not converge
at all. This shows that even for simple examples the method can fail.

60
y = 3+3x
50 y = exp(x)

40

30
y

20

10

-10
-2 -1 0 1 2 3 4
x

Figure 3.8: Partial successive substitution to solve 3 + 3 x = exp(x)

Another example is given by the stretching of a beam in equation (1.14) on page 17. To solve the
nonlinear boundary value problem
!
d u(x) 2 d u(x)
 
d
− EA0 (x) 1 − ν = f (x) for 0 < x < L
dx dx dx

use a known function un (x) to compute the coefficient function

d un (x) 2
 
a(x) = E A0 (x) 1 − ν
dx
and then solve the linear boundary value problem
 
d d un+1 (x)
− a(x) = f (x)
dx dx
for the next approximation un+1 (x). Using the finite diffence method this will be used in Example 4–9 on
page 297.

The above approach is sometimes called a Picard iteration.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 115

3.1.5 Newton’s Algorithm to Solve Systems of Equations


In a previous section we used Newton’s method to solve a single equation. The ideas can be applied to
systems of equations. We first use the algorithm to solve two equations in two unknowns.

Newton’s algorithm to solve two equations with two unknowns


Search a solution of two equations in two unknowns
(
f (x, y) = 0
.
g (x, y) = 0

To simplify the problem replace the nonlinear function f and g by linear approximations about the initial
point (x0 , y0 ).
∂f ∂f

 f (x0 + ∆x, y0 + ∆y) ≈ f (x0 , y0 ) +
 ∆x + ∆y
∂x ∂y .
∂g ∂g
 g (x0 + ∆x, y0 + ∆y) ≈ g (x0 , y0 ) +
 ∆x + ∆y
∂x ∂y
Thus replace the original equations by a set of approximate linear equations. This leads to equations for the
unknowns ∆x and ∆y.
∂f ∂f

 f (x0 , y0 ) +
 ∆x + ∆y = 0
∂x ∂y
∂g ∂g
 g (x0 , y0 ) +
 ∆x + ∆y = 0
∂x ∂y
Often a shortcut notation is used
∂f ∂f
fx = , fy = ,
∂x ∂y
and thus the approximate linear equations can be written in the form
! ! !
f (x0 , y0 ) ∆x 0
+A· = ,
g (x0 , y0 ) ∆y 0

where the 2 × 2 matrix A of partial derivatives is given by


" #
fx (x0 , y0 ) fy (x0 , y0 )
A= .
gx (x0 , y0 ) gy (x0 , y0 )

If the matrix is invertible3 the solution is given by


! !
∆x f (x0 , y0 )
= −A−1 · .
∆y g (x0 , y0 )

Just as in the situation of a single equation find a (hopefully) better approximation of the true zero.
! ! !
x1 x0 ∆x
= +
y1 y0 ∆y
3
If the determinant is different from zero use the formula
" #−1 " #
−1 fx fy 1 gy −fy
A = =
gx gy f x gy − gx f y −gx fx

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 116

This leads to an iterative formula for Newton’s method applied to a system of two equations.
! ! ! ! !
xn+1 xn ∆x xn −1 f (xn , yn )
= + = −A · .
yn+1 yn ∆y yn g (xn , yn )

where " #
fx (xn , yn ) fy (xn , yn )
A=
gx (xn , yn ) gy (xn , yn )
This iteration formula is, not surprisingly, very similar to the formula for a single equation.
1
xn+1 = xn − f (xn )
f 0 (x n)

3–8 Example : Examine the equations


x2 + 4 y 2 = 1
4 x4 + y 2 = 1
with the estimated solutions x0 = 1 and y0 = 1. We want to apply a few steps of Newton’s method.

With f1 (x, y) = x2 + 4 y 2 − 1 and f2 (x, y) = 4 x4 + y 2 − 1 we find the partial derivatives


" # " #
∂ f1 ∂ f1
∂x ∂y 2x 8y
∂ f2 ∂ f2
=
∂x ∂y 16 x3 2 y

and for (x0 , y0 ) = (1 , 1) we have the values f1 (x0 , y0 ) = 4 and f2 (x0 , y0 ) = 4 and we find a system of
linear equations for x1 and y1 .
! " # ! !
4 2 8 x1 − 1 0
+ =
4 16 2 y1 − 1 0

This can also be writen as a system for the update step


" # ! !
2 8 ∆x 4
=−
16 2 ∆y 4

and thus
! ! ! ! " #−1 ! !
x1 x0 ∆x 1 2 8 4 0.8064516
= + = − ≈ .
y1 y0 ∆y 1 16 2 4 0.5483870

This is the result of the first Newton step. A visualization of this step can be generated with the code
in Newton2D.m .
For the next step we use f1 (x1 , y1 ) ≈ 0.853 and f2 (x1 , y1 ) ≈ 0.993 and find the system for x2 and y2 .
! " # ! !
0.853 1.6129 4.3871 x2 − 0.806 0
+ = .
0.993 8.3918 1.0968 y2 − 0.548 0

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 117

This and similar calculations lead to

x0 = 1 y0 = 1
x1 = 0.8064516 y1 = 0.5483870
x2 = 0.7088993 y2 = 0.3897547
x3 = 0.6837299 y3 = 0.3658653
x4 = 0.6821996 y4 = 0.3655839
.. ..
. .
x7 = 0.6821941 y7 = 0.3655855

Observe the rapid convergence to a solution.

The above algorithm can be implemented in Octave. Below find an code segment to be stored in a file
NewtonSolve.m . The function NewtonSolve() takes the function f , the function Df for the partial
derivatives and the initial value ~x0 as arguments and computes the solution of the system f (~x) = ~0 . The
default accuracy of 10−10 can be modified with a fourth argument. The code applies at most 20 iterations.
The code will return the approximate solution and the number of iterations required.
NewtonSolve.m
function [x,counter] = NewtonSolve(f,Df,x0,atol)
if nargin<4 atol = 1e-10; end%if;
maxit = 20; counter = 0; xOld = x0;
x = xOld - feval(Df,xOld)\feval(f,xOld);
while ((counter<=maxit) && (norm(xOld-x)>atol))
xOld = x;
x = xOld-feval(Df,xOld)\feval(f,xOld);
counter = counter+1;
end%while
end%function

The above problem can now be solved by

%% code to solve a simple system of equations


F = @(x) [x(1)ˆ2+4*x(2)ˆ2-1; 4*x(1)ˆ4 + x(2)ˆ2-1];
DF = @(x) [2*x(1), 8*x(2);
16*x(1)ˆ3, 2*x(2)];

x0 = [1;1]; % choose the starting value


[sol,iter] = NewtonSolve(F,DF,x0) % apply Newton’s method
-->
sol = 0.68219
0.36559
iter = 5

The standard result for n equations for n unknowns


The situation of n equation with n unknowns can be described with a vector function F~ with domain of
definition Rn , or a subset thereof. Solving the system of n equations is then translated to the search of a

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 118

vector ~x ∈ Rn such that


     
f1 (~x) f1 (x1 , x2 , . . . , xn ) 0
     
 f (~x)   f (x , x , . . . , x )   0 
 2   2 1 2 n   
~ ~
     
F (~x) =  f3 (~x)  =  f3 (x1 , x2 , . . . , xn )
   =
  0  = 0.
 ..   ..   .. 

 .  
  .  
  . 

fn (~x) fn (x1 , x2 , . . . , xn ) 0

The linear approximation is represented with the help of the matrix DF of partial derivatives.
 
∂ f1 ∂ f1 ∂ f1 ∂ f1
 ∂x1 ∂x2 ∂x3 ∂xn 
∂ f2 ∂ f2 ∂ f2 ∂ f2

 ∂x1 ∂x2 ∂x3 ... ∂xn


 ∂ f3 ∂ f3 ∂ f3 ∂ f3 
DF = 
 ∂x1 ∂x2 ∂x3 ∂xn


 .. .. .. 

 . . . 

∂ fn ∂ fn ∂ fn ∂ fn
∂x1 ∂x2 ∂x3 ... ∂xn

The Taylor approximation can now be written in the form


−→ −→
F~ (~x + ∆x) ≈ F~ (~x) + DF(~x) · ∆x .

Newton’s method is again based on the idea of replacing the nonlinear system by its linear approximation
and then using a good initial guess ~x0 ∈ Rn .
−→ −→ −→
F~ (~x0 + ∆x) = ~0 −→ F~ (~x0 ) + DF(~x0 ) · ∆x = ~0 −→ ~x1 = ~x0 + ∆x

It is important to understand this basic idea when applying the algorithm to a concrete problem. It will
enable the user to give the algorithm a helping hand when necessary. Quite often Newton is not used as
black box algorithm, but tuned to the concrete problem.

3–9 Theorem : Let F~ ∈ C 2 (Rn , Rn ) be twice continuously differentiable and for a ~x? ∈ Rn we
have F~ (~x? ) = ~0 and the n × n matrix DF(~x? ) of partial derivatives is invertible. Then the Newton
iteration
~xn+1 = ~xn − (DF(~xn ))−1 · F~ (~xn )
will converge quadratically to the solution ~x? , if only the initial guess ~x0 is close enough to ~x? . 3

The critical point is again the condition that the initial guess ~x0 has to be close enough to the solution for
the algorithm to converge. Thus the remark on Newtons methods applied to a single equation (Section 3.1.2)
remain valid.

The above result is not restricted to the space Rn . Using standard analysis on Banach spaces the corre-
sponding result remains valid.

3.1.6 Modifications of Newton’s Method


There are many modification of the basic idea of Newton’s method.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 119

Numerical evaluation of partial derivatives


∂ fi
If no analytical formula for the partial derivatives ∂x j
is available, then one can consider a finite differ-
ence approximation to these derivatives. Since there are n2 partial derivatives this requires at least n2 + 1
evaluations of the functions fi . This might be a delicate problem, and computationally expensive.

The modified Newton algorithm


The computational effort to determine the n × n matrix DF(~xn ) can be considerable. Thus one can reuse
the same matrix for a fixed number m of steps and only then reevaluate the matrix of partial derivatives.
~xn+j+1 = ~xn+j − (DF(~xn ))−1 · F~ (~xn+j ) for j = 0, 1, 2, . . . , m .
More iterations than with the standard method may be needed, but the computational effort for one step is
smaller.

Damped Newton’s method


If the initial guess ~x0 is not close enough to the actual solution, then Newton’s method might jump to a
completely different region and continue its search there, see e.g. the computations leading to Figure 4.28
on page 302. To avoid this effect one can at first shorten the step in Newtons method. For a parameter
0 < α ≤ 1 the iteration formula is modified to
~xn+1 = ~xn − α (DF(~xn ))−1 · F~ (~xn )
For α = 1 we have the classical formula. For α < 1 we have a damped Newton iteration. In this case we
loose quadratic convergence.
In recent papers [SchnWihl11], [AmreWihl14] Thomas Wihler (University of Bern) proposed a system-
atic approach to choosing the step-sizes, based on an ODE related to the system of equations to be solved.

The damped Newton algorithm is used by the Levenberg–Marquardt algorithm to solve nonlinear
regression problems. At first the parameter α is strictly smaller than 1. As progress is made α approaches 1
to achieve quadratic convergence. This approach is used for nonlinear regression problems in Section 3.5.7
by the command leasqr().

Parameterized Newton’s method


One tool available to circumvent the problem of good initial guesses is to use a parameterized Newton’s
method. It is often known which part of the equations causes problems for convergence.
• Start your computation with the troublesome term turned off and find a solution of the modified
problem.
• Then turn the nonlinear term on step by step and use the previous solution as a initial point for
Newton’s method.
In Example 4–12 we will try to solve the boundary value problem
F2
−α00 (s) = cos(α(s)) for 0 < s < L and α(0) = α0 (L) = 0 .
EI
This is the mathematical model for the bending beam, shown as shown in Section 1.4 (page 18). For large
values of F2 this will not give the desired solution. For F2 = 0 the solution α(s) = 0 is obvious. Thus we
start with F2 = 0 and then increase F2 in small steps, solving the BVP. If we arrive at the desired value of
F2 we then will have a solution of the original problem. This is more efficient and (even more important)
more reliable than the basic algorithm of Newton.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 120

3.1.7 Octave/MATLAB Commands to Solve Equations


The command fzero() to solve a single equation
With the command fzero() a single equation f (x) = 0 can be solved. It is advisable to give a bracketing
as initial values. To solve x2 − 2 = 0 use

fzero(@(x)xˆ2-2,[0,2])
-->
ans = 1.4142

A sizable number of options can be used and more outputs are generated, see help fzero.

The command fsolve() to solve one equation or a system


With the command fsolve() a single equation f (x) = 0 or systems f~(~x) = ~0 can be solved. The
algorithm is based on Newton’s method and thus it is essential to provide a good initial guess.
To solve x2 − 2 = 0 use

fzero(@(x)xˆ2-2,2)
-->
ans = 1.4142

To solve the system in Example 3–8 use

f = @(x)[x(1)ˆ2 + 4*x(2)ˆ2-1; 4*x(1)ˆ4 + x(2)ˆ2-1];


fsolve(f,[1;1])
-->
ans = 0.6822
0.3656

Since Newton’s methods requires the Jacobian matrix fsolve() uses a finite difference approach to de-
termine the matrix of partial derivatives. One can also specify the Jacobian by creating a function4 with two
return arguments. In addition the option Jacobian has to be on.

function [res, Jac] = Fun_System(x)


res = [x(1)ˆ2 + 4*x(2)ˆ2-1; 4*x(1)ˆ4 + x(2)ˆ2-1];
if nargout ==2
Jac = [2*x(1), 8*x(2); 4*x(1)ˆ3, 2*x(2)];
end%if
end%function

fsolve(’Fun_System’,[1;1], optimset(’Jacobian’,’on’))
-->
ans = 0.6822
0.3656

A sizable number of options can be used and more outputs are generated, see help fsolve.

4
With MATLAB this function has to be in a separate file Fun System.m.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 121

3–10 Example : Using Newton’s algorithm in MATLAB/Octave


In Example 3–14 the solution of the equation x2 − 1 − cos x = 0 is needed. In an Octave script first define a
function to evaluate the function f (x) and its derivative. Then create a graph and estimate the location of the
zero as x0 = 1. Find the result in Figure 3.9. Then a simple call of fsolve() will compute the location
of the zero as x ≈ 1.1765019 . By tracing the calls to the function f (x) on can observe how fsolve()
uses a finite difference approximation to determine the values of the derivative f 0 (x).

x = 0:0.1:3;

function [y,dy] = f(x)


y = x.*x-1-cos(x); % value of the function
dy = 2*x+sin(x); % value of the derivative
display([x,y]) % show the values of x and y
endfunction

figure(1); plot(x,f(x))
grid on; xlabel(’x’); ylabel(’f(x)= xˆ2-1-cos(x)’);

[z,info,msg] = fsolve(’f’,1.0) % without using the derivative


options.Jacobian = ’on’; % use the given derivative
z = fsolve(’f’,1.0,options)

10

-2
0 0.5 1 1.5 2 2.5 3

Figure 3.9: Graph of the function y = x2 − 1 − cos(x)

3.1.8 Examples of Nonlinear Equations


3–11 Example : Nonlinear finite difference equations
In Section 4.7, starting on page 297, some examples of nonlinear equations will be examined:

• Example 4–9 examines the stretching of a beam by a given force and variable cross section. The
method of succesive substitutions is used.

• Example 4–11 examines the bending of a beam. Large defomations are allowed. Newton’s method
will be used.

• Example 4–12 examines a similar problem, using a parameterized version of Newton’s method.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 122

3–12 Example : Nonlinear methods applied to a tumor growth problem


In Section 6.9, starting on page 457, the idea of linearization or Newton’s method will be used to examine
a tumor growth problem. For the space discretization the finite element method (FEM) will be used and for
the time stepping a nonlinear Crank–Nicolson algorithm. ♦

3–13 Example : Stretching of a beam by a given force and variable cross section
The differential equation describing the longitudinal deformation of a string with cross section A0 (x) by a
force density f (x) = 0 and a force F at the right end at x = L is given by (see Section 1.3)
!
d u 2 d u(x)
 
d
− EA0 1 − ν = 0 for 0 < x < L .
dx dx dx

and the boundary conditions u(0) = 0 and EA0 (L) (1 − ν u0 (L))2 u0 (L) = F . If a function w(x) = u0 (x)
for 0 ≤ x ≤ L solves equation (1.15) (page 17)
F
ν 2 w3 (x) − 2 ν w2 (x) + w(x) = ,
EA0 (x)

then its integral Z x


u(x) = w(s) ds
0
represents the horizontal deflection of a horizontal beam. A horizontal force F is applied at the right end-
point. If F = 0, then the obvious and physically correct solution is w(x) = 0 . For a given function A0 (x)
search for a solution of the above nonlinear equation. The solution plan to be carried out below is as follows:

• Introduce an auxiliary function G to be examined.

• Determine for which domain the equation might be solved, requiring the solution to be realistic.

• Start with a force of F = 0 and increase it step by step. For each force compute the new length of the
beam.

• Plot the force F as a function of the change in length u(L) to confirm Hooke’s law.

Set z = ν u0 = ν w and consider the function


νF
G(z) = z 3 − 2 z 2 + z − .
EA0 (x)

The variable to be solved for is z, and x is considered as a parameter. With the help of solution of the
equation G(z) = 0 construct solutions of the beam problem. In general the solution z will depend on x.
Before launching the computations examine possible failures of the method. Newton’s iteration will fail if
the derivative vanishes. The derivative of G(z) is given by

d
G(z) = 3 z 2 − 4 z + 1 = (3 z − 1) (z − 1) .
dz
Since G0 (0) = 1 we know this derivative to be positive for 0 < z small. The zeros of the derivative G0 are
readily determined as
√ (
+4 ± 16 − 12 1
z= = .
6 1
3

G0
Since the first zero of the derivative is at z = expect problems if z ≈ 13 . To confirm this examine the
1
3
graph of the auxiliary function h shown in Figure 3.10.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 123

0.3

0.25

0.2

h(z) = z 3 − 2 z 2 + z 0.15

h(z)
νF
G(z) = h(z) − 0.1
EA0
0.05

-0.05
0 0.2 0.4 0.6 0.8 1 1.2 1.4
z

Figure 3.10: Definition and graph of the auxiliary function h

We will start with the force F = 0 and thus w(x) = 0. Then increase the value of F slowly and w
(resp. z) will increase too. We use Newton’s method to determine the function w(x), using the initial value
w0 (x) = 0 to find a solution of the above problem. If the expression
νF
EA0 (x)
is larger than h(1/3) = 4/27 there is no smooth solution any more. If the critical limit for F is exceeded find
that z = ν u0 (x) would have to be larger than 1. This would lead to a negative radius with cross sectional
area A0 (1 − ν u0 (x))2 , which is obviously mechanical nonsense. This is confirmed by Figure 3.10. The
beam will simply break if the critical limit is exceeded. The Octave code will happily generate numbers and
graphs for larger values of F : GIGO .

The iteration formula to solve the above equation is given by


G(wn (x))
wn+1 (x) = wn (x) − .
G0 (wn (x))
As a concrete example choose the value ν = 0.3 and the function
1  πx 
EA0 (x) = 2 − sin( ) for 0 ≤ x ≤ L .
2 L
This corresponds to a beam with a thiner midsection. These values will be reused in Example 4–9, where
the same problem is solved with the help of a finite difference approximation. Since the minimal value of
EA0 (x) is 1/2 the above condition on F translates to
4 EA0
F < ≈ 0.24691 .
27 ν
Thus expect problems beyond this critical value. To find a solution proceed as follows:
• Define the necessary constants and functions

• Choose a number N of grid points on the interval (0 , L)

~ = ~0
• Choose a starting function w(x) = 0, resp. vector w

• Choose the forces for which the solution is to be computed. The force should be increased slowly
from 0 to the maximal possible value.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 124

• For each value of the force:

– Run the Newton iteration until the desired accuracy is achieved.


– Compute the new length of the beam with the help of
Z L
u(L) = w(x) dx .
0

• Plot the length as a function of the applied force.


The MATLAB/Octave code below and the resulting Figure 3.11 confirm the above observations.

First define all necessary constants and functions.

function testBeam()
clear EA
nu = 0.3; L = 3;

function res = EA(x)


res = (2-sin(x/L*pi))/2;
end%function

function y = G(z,T)
nu = 0.3;
y = nuˆ2*z.ˆ3-2*nu*z.ˆ2 + z -T;
end%function

function y = dG(z)
y = 3*nuˆ2*z.ˆ2-4*nu*z + 1;
end%function

Then run the Newton iteration.

N = 500;
h = L/(N+1); % stepsize
x = (0:h:L)’;
clf; relErrorTol=1e-10; % choose you relative error tolerance
z = zeros(size(x)); zNew = z;

FList = 0.01:0.01:0.24; maxAmp = zeros(size(FList));


k = 0;
for F = FList
k = k+1;
T = F./EA(x);
relError = 2*relErrorTol;
while relError>relErrorTol;
zNew = z-G(z,T)./dG(z);
relError = max(abs(z-zNew))/max(abs(zNew));
z = zNew;
end%while
maxAmp(k) = trapz(x,zNew/nu);
end%for

u = cumtrapz(x,zNew/nu);
figure(1);
plot(x, u);
grid on; xlabel(’position x’); ylabel(’displacement u’);

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 125

figure(2);
plot(maxAmp,FList);
grid on; xlabel(’maximal displacement u(L)’); ylabel(’force F’);
end%function

6 0.25

5
0.2
displacement u

4
0.15
3

F
0.1
2

0.05
1

0 0
0 0.5 1 1.5 2 2.5 3 0 1 2 3 4 5 6
position x u(L)

(a) displacement as function of position (b) force as function of maximal displacement

Figure 3.11: Graphs for stretching of a beam, with Poisson contraction

The graph in Figure 3.11(b) shows the force F as function of the displacement u(L) at the right endpoint.
If the lateral contraction of the beam would not be taken into account (i.e. ν = 0) the results would be a
straight line, confirming Hooke’s law. The Poisson effect weakens the beam, since the area of the cross
sections is reduced and the stress thus increased, i.e. move from the the engineering stress to true stress. ♦

3–14 Example : In [Kell92, p. 317] the boundary value problem

−u00 (x) = −eu(x) with u(−1) = u(1) = 0

is examined. The exact solution is given by

c2
 
u(x) = ln ,
1 + cos(c x)

where the value of the constant c is determined as solution of the equation c2 = 1 + cos c . In Example 3–10
Newton’s method is used to find c ≈ 1.1765019 . Now use Newton’s again method to solve the above
nonlinear boundary value problem.
With an approximate solution un (x) (start with u0 (x) = 0) search a new solution of the form un+1 (x) =
un (x) + φ(x) and examine a linear boundary value problem for the unknown function φ(x). Use the Taylor
approximation eu+φ ≈ eu + eu φ = eu (1 + φ) and solve

−u00n (x) − φ00 (x) = −eun (x)+φ(x) ≈ −eun (x) (1 + φ(x))


−φ00 (x) + eun (x) φ(x) = u00n (x) − eun (x) with φ(−1) = φ(1) = 0 .

This boundary value problem for the function φ(x) can be solved with a finite difference approximation (see
Chapter 4). Let h = N 2+1 and xi = −1 + i h for i = 1, 2, 3, . . . N . With ui = u(xi ) and φi = φ(xi ) obtain
a finite difference approximation of the second order derivative

−φi−1 + 2 φi − φi+1
−φ00 (xi ) ≈
h2

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 126

and thus a system of linear equations for the unknowns φi . Use φ0 = φN +1 = 0.

−φi−1 + 2 φi − φi+1 −ui−1 + 2 ui − ui+1


2
+ eui φi = − − eui = bi for i = 1, 2, 3, . . . , N
h h2
Using a matrix notation this leads to
     
2 −1
h2
+ eu1 h2
φ1 b1
−1 2 −1
+ eu2
     
   φ2   b2
h2 h2 h2

     
−1 2 −1
+ eu3   φ3   b3
     
 h2 h2 h2 · =

.
.. .. ..   ..   ..

. . .
 
   .   . 
     
−1 2 −1

 h2 h2
+ euN −1 h2
  φ
  N −1
  b
  N −1


−1 2
h2 h2
+ euN φN bN

Solve this system of linear equations and then restart with un+1 (x) = un (x) + φ(x). The matrix A in
Aφ ~ = ~b has a tridiagonal structure. For this type of problem special algorithms exist5 . The above algorithm
is implemented in Octave/MATLAB.

N = 201; h = 2/(N+1); % number of grid points and stepsize

x = (-1:h:1)’;
c = 1.176501940; uexact = log(cˆ2./(1+cos(c*x)));

%%%% Newton %%%%%%%%


%% build the tridiagonal matrix
di = 2*ones(N,1)/hˆ2 ; % main diagonal
up = -ones(N-1,1)/hˆ2; % upper and lower diagonal

Niterations = 5; errorNewton = zeros(Niterations,1);


u = zeros(N,1);
for k = 1:Niterations
g = diff(diff([0;u;0]))/hˆ2-exp(u);
u = u + trisolve(di+exp(u),up,g);
errorNewton(k) = max(abs(uexact-[0;u;0]));
end%for
errorNewton

The result, shown below, illustrates that the algorithm stabilizes after the fourth step. The error does not
decrease any more. The remaining error is dominated by the number of grid points N = 201. When setting
N = 20001 the error decreases to 2.3 · 10−10 . This effect can only be illustrated using the known exact
solution. In real world problems this is not the case and we would stop the iteration as soon as enough digits
do not change any more.

errorNewton =
1.611231387e-02
2.683166420e-05
1.913113033e-06
1.913054659e-06
1.913054659e-06

5
An implementation is given in Octave as command trisolve. For MATLAB a similar code is provided in Table 3.19 in the
file tridiag.m . With newer versions one can use sparse matrices to solve tridiagonal systems efficiently.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 127

The above code is listed in Table 3.19 as file Keller.m. In this file the method of successive substitu-
tions is also applied to the problem. A graph with the errors for Newton’s method and successive substitu-
tions is generated. With this code you can verify that both methods converge, but Newton’s converge rate is
two, while successive substitution converges linearly. In Table 3.3 find a comparison of Newton’s method
and the partial substitution approach applied to problems similar to the above. ♦

Substitution Newton
convergence linear, slow quadratic, fast
complexity of code very simple intermediate
good starting values necessary yes yes
derivative of f (u) required no yes
solve a new linear system for each step no yes

Table 3.3: Compare partial substitution method and Newton’s method

3–15 Example : In the previous example the BVP (Boundary Value Problem)

−u00 (x) = −eu(x) with u(−1) = u(1) = 0

was solved by the following steps:

1. Linearize the BVP.

2. Transform the linearized BVP into a system of linear equations, using finite differences.

3. Solve the resulting system of linear equations.

One may also try to apply the operations in a different order:

1. Transform the nonlinear BVP into a system of nonlinear equations, using finite differences.

2. Linearize this system of nonlinear equations.

3. Solve the resulting system of linear equations.

With finite differences the system of nonlinear equations to be solved is


−ui−1 + 2 ui − ui+1
= −eui for i = 1, 2, . . . , n .
h2
If u(x) + φ(x) is a small perturbation of u(x) we use the linear approximation eu+φ ≈ eu (1 + φ) and the
fact that the linear difference operation on the left is linear. We find

−ui−1 + 2 ui − ui+1 −φi−1 + 2 φi − φi+1


+ = −eui (1 + φi )
h2 h2
or
−φi−1 + 2 φi − φi+1 −ui−1 + 2 ui − ui+1
+ eui φi = − − eui .
h2 h2
This system of linear equations is identical to the previous problem and consequently one will find identical
results. ♦

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 128

3–16 Example : In the previous example only the right hand side of the BVP contained a nonlinear
function. The method applies also to problems with nonlinear coefficient functions. Consider
0
− a(u(x)) u0 (x) = f (u(x)) with u(0) = u(1) = 0

and use Newton’s method, i.e. for a known starting function u search for u+φ and determine φ as a solution
of a linear problem. Then restart with the new function u1 = u + φ. Use the linear approximations

a(u + φ) ≈ a(u) + a0 (u) · φ


f (u + φ) ≈ f (u) + f 0 (u) · φ
a(u + φ) (u + φ)0 ≈ (a(u) + a0 (u) · φ) (u + φ)0
≈ a(u) u0 + a0 (u) u0 φ + a(u) φ0

to replace the original nonlinear differential equation for u with a linear equation for the unknown function φ.
0
− a0 (u) u0 φ + a(u) φ0 = (a(u) u0 )0 + f (u) + f 0 (u) φ
− (a(u) φ)00 − f 0 (u) φ = (a(u) u0 )0 + f (u)

The finite difference approximation of the expression (a(u) u0 )0 is given in Example 4–8 on page 268. There
are two possible options to solve the above linear BVP for the unknown function φ.
• Option 1: Finite difference approximation of the expression
b(x − h) u(x − h) − 2 b(x) u(x) + b(x + h) u(x + h)
(b(x) u(x))00 ≈
h2
or with a matrix notation
   
−2 b1 b2 u1
   
 b1 −2 b2 b3   u2 
   
b2 −2 b3 b4   u3
   
 
1    
 b3 −2 b4 b5   u
· 4

h2 
 
  .
..

.   .
  .
 
 
   

 bN −2 −2 bN −1 bN   u
  N −1


bN −1 −2 bN uN

• Option 2: If the coefficient function a(u) is strictly positive introduce a new function w(x) =
a(x) φ(x). Since φ = a1 a φ = a1 w find the new differential equation

f 0 (u)
−w00 − w = (a(u) u0 )0 + f (u)
a(u)
w(x)
for the unknown function w(x). Once w(x) is computed use φ(x) = .
a(u(x))
Then restart with the new approximation un (x) = u(x) + φ(x). ♦

3.1.9 Optimization with MATLAB/Octave


The command fminbnd() to find minima of a function with one variable

The function f (x) = sin(x) has a local minimum at x = 2 and the command fminbnd() will find this
location between 2 and 8 by

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 129

fminbnd(@(x)sin(x),2,8)/pi
-->
ans = 1.50000

The function fminbnd() will only search between x = 2 and x = 8. A sizable number of options can be
used and more outputs are generated, see help fminbnd.

The command fminsearch() to find minima of a function with multiple variables


3–17 Example : One can also optimize functions of multiple variables. Instead of a maximum of

f (x, y) = −2 x2 − 3 x y − 2 y 2 + 5 x + 2 y

search for a minimum of −f (x, y). Examine the graph of f (x, y) in Figure 3.12.
Octave
[xx,yy] = meshgrid( [-1:0.1:4],[-2:0.1:2]);

function res = f(x,y)


res = -2*x.ˆ2 -3*x.*y-2*y.ˆ2 + 5*x+2*y;
endfunction

surfc(xx,yy,f(xx,yy)); xlabel(’x’); ylabel(’y’);

10
0
-10
-20
-30
-40
2
1 4
0 3
2
y -1 1
0 x
-2 -1

Figure 3.12: Graph of a function h = f (x, y), with contour lines

Using the graph conclude that there is a maximum not too far away from (x, y) ≈ (1.5 , 0). Now use
fminsearch() with the function −f and the above starting values.
Octave
xMin = fminsearch(@(x)-f(x(1),x(2)),[1.5,0])
-->
xMin = 2.0000 -1.0000

A sizable number of options can be used and more outputs are generated, see help fminsearch. ♦
To examine the accuracy of the extreme point consider to have a closer look at contour levels.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 130

3.2 Eigenvalues and Eigenvectors of Matrices, SVD, PCA


3.2.1 Matrices and Linear Mappings
Matrices of size m×n can be used to describe linear mappings from Rn into Rm . The columns of A contain
the images of the standard basis vectors in Rn . This is best illustrated by an example.
Examine the matrix A " #
1 0.5
A=
0.25 0.75
representing a linear mapping from R2 to R2 , i.e. for an arbitrary vector ~x ∈ R2 the image is given by
! " # ! !
x1 1 0.5 x1 1 x1 + 0.5 x2
~x = 7→ A ~x = = .
x2 0.25 0.75 x2 0.25 x1 + 0.75 x2

For the standard basis vectors observe


" # ! ! " # ! !
1 0.5 1 1 1 0.5 0 0.5
A ~e1 = = and A ~e2 = = .
0.25 0.75 0 0.25 0.25 0.75 1 0.75

The columns of A contain the images of the standard basis vectors. Figure 3.13 visualizes this linear
mapping.
~x
A ~x
6 6
 
  

j
~e2 ~x 7→ A ~x ~e2
6 A ~e2
6

:

~
A e
1
- -   - -
~e1 ~e1

1 1

0.5 0.5
2

0 0
x

-0.5 -0.5

-1 -1
-1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5
x x
1 1

Figure 3.13: A linear mapping applied to a rectangle

3–18 Example : Orthogonal matrices, unitary matrices


A real matrix V ∈ Mn×n is called an orthogonal matrix6 iff

UT U = In .
6
This author actually would prefer the notation of an orthonormal matrix.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 131

For matrices with complex entries this is called an unitary matrix. The column vectors have length 1
and are pairwise orthogonal. To examine this property consider the columns of U as vectors ~ui ∈ Rn ,
i.e. U = [~u1 , ~u2 , ~u3 , . . . , ~un ] ∈ Mn×n . Then examine the scalar products of these column vectors, as
components of a matrix product.
 
~uT1
 
 ~uT 
 2 
 T 
UT U =   ~u3  [~u1 , ~u2 , ~u3 , . . . , ~un ]

 .. 
 . 
 
~unT
   
h~u1 , ~u1 i h~u1 , ~u2 i h~u1 , ~u3 i h~u1 , ~un i 1 0 0 0
   
 h~u , ~u i h~u , ~u i h~u , ~u i · · · h~u , ~u i   0 1 0 ··· 0 
 2 1 2 2 2 3 2 n   
   
=   h~u3 , ~u1 i h~u3 , ~u2 i h~u3 , ~u3 i h~u3 , ~un i 
 =  0 0 1 0  = In


 .
.. . .. .
.. 


 .
.. . .
. . .. 

   
h~un , ~u1 i h~un , ~u2 i h~un , ~u3 i · · · h~un , ~un i 0 0 0 ··· 1

For an orthogonal matrix use U−1 = UT , i.e. the inverse matrix is easily given.

Multiplying a vector ~x by an orthogonal matrix U does not change the length of the vector and the
angles between vectors. To verify this fact use the scalar product again.
kU~xk2 = hU~x, U~xi = h~x, UT U~xi = h~x, I ~xi = k~xk2
hU~x, U~y i h~x, UT U~y i h~x, ~y i
cos(∠(U~x, U~y )) = = = = cos(∠(~x, ~y ))
k~xk k~y k k~xk k~y k k~xk k~y k
Thus multiplying by an orthogonal matrix U corresponds to a rotation in Rn and possibly one reflection. If
det(U) = −1, there is a reflection involved, with det(U) = +1 it is rotations only.

In the plane R2 the orthogonal matrices with angle of rotation α are the well known rotation matrices.
" # " #
+ cos(α) − sin(α) + cos(α) + sin(α)
U= or U = .
+ sin(α) + cos(α) + sin(α) − cos(α)

3.2.2 Eigenvalues and Diagonalization of Matrices


In these notes only real valued matrices are examined, i.e. A ∈ Mm×n = Rm×n .

Definition of eigenvalues and eigenvectors


3–19 Definition :
• A number λ ∈ C is called an eigenvalue with corresponding eigenvector ~u 6= ~0 of the real matrix
A ∈ Mn×n iff
A ~u = λ ~u . (3.2)
• A number λ ∈ C is called an generalized eigenvalue with corresponding eigenvector ~u 6= ~0 of the
real matrix A ∈ Mn×n and the weight matrix B ∈ Mn×n iff
A ~u = λ B ~u . (3.3)

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 132

3–20 Observation :
• If λ is an eigenvalue then (A − λ I) ~u = ~0 with ~u 6= ~0. Thus the matrix A − λ I ∈ Mn×n is not
invertible and λ is a zero of the characteristic polynomial
p(λ) = det(A − λ I) = 0 .
To determine the characteristic polynomial subtract λ along the diagonal of the matrix A and compute
the determinant. This is a polynomial of degree n and consequently any n × n matrix has exactly n
eigenvalues λi for i = 1, 2, 3, . . . , n . These eigenvalues can be real or complex.
• Not all eigenvalues have their ”own” eigenvector. The matrix
" #
0 1
A=
0 0

has a double eigenvalue λ1 = λ2 = 0 but only one eigenvector ~u = (1, 0)T . In this case λ = 0
has algebraic multiplicity 2, but geometric multiplicity 1 . This matrix can not be diagonalized. The
mathematical tool to be used in this special case are Jordan normal forms, see e.g. [HornJohn90,
§3] or https://en.wikipedia.org/wiki/Jordan normal form. In these notes only digonalizable matrices
are examined and used.
• If λ = α + i β ∈ C with β 6= 0 is a complex eigenvalue with eigenvector ~u + i ~v , then λ − i β is an
eigenvalue too, with eigenvector ~u − i ~v . To verify this examine
A (~u + i ~v ) = (α + i β) (~u + i ~v ) = (α ~u − β ~v ) + i (β ~u + α ~v )
A ~u = α ~u − β ~v real part
A ~v = β ~u + α ~v imaginary part
A (~u − i ~v ) = (α − i β) (~u − i ~v ) = (α ~u − β ~v ) − i (β ~u + α ~v )
or with a matrix notation
" #
h i +α −β h i
A ~u ~v = ~u ~v .
+β +α

The two vectors ~u ∈ Rn \{~0} and ~v ∈ Rn \{~0} are not zero, verified by a contradiction argument.
~v = ~0 =⇒ A ~v = +β ~u = ~0 =⇒ ~u = ~0
~u = ~0 =⇒ A ~u = −β ~v = ~0 =⇒ ~v = ~0
The two vectors ~u, ~v ∈ Rn are (real) linearly independent, again verified by contradiction.
~v = c ~u =⇒ A ~u = α ~u − β ~v = (α − β c) ~u
and ~u would be an eigenvector with real eigenvalue (α − β c).
• If λ is a generalized eigenvector and the invertible weight matrix B has a Cholesky factorization
B = RT R, then
A ~u = λ B ~u = λ RT R ~u =⇒ R−T A R−1 ~u = λ ~u
and thus λ is a regular eigenvalue of the matrix R−T A R−1 . Often B = diag([w1 , w2 , w3 , . . . , wn ]) =

diag(wi ) is a diagonal matrix with positive entries wi and thus R = diag( wi ) and λ is an eigen-

value of diag( √1wi ) A diag( √1wi ). Rows and columns of A have to be divided by wi , this preserves
the symmetry of A.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 133

3–21 Result : Facts on real, square matrices


Examine a real n × n matrix A. If A has n distinct real eigenvalues λ1 ≤ λ2 ≤ λ3 ≤ . . . ≤ λn then the
following facts can be verified (by mathematicians).

• The corresponding eigenvectors ~vi are linearly independent. Thus the matrix with the eigenvectors as
columns is invertible.
V = [~v1 , ~v2 , ~v3 , . . . , ~vn ] ∈ Mn×n

• The property eigenvalue A ~vi = λi can be written in the matrix form

A V = V diag(λi )

and consequently
V−1 A V = diag(λi ) and A = V diag(λi ) V−1 .
This is a diagonalization of the matrix A.

• Any vector ~x ∈ Rn can be written as a linear combination of the eigenvectors, i.e.


n
X
~x = ci ~vi .
i=1

As a system of linear equations this reads as


   
c1 x
   1 

 c2 

 x 
 2 
   
V ~c = [~v1 , ~v2 , ~v3 , . . . , ~vn ] 
 c3  =  x3  = ~x
  
 ..   .. 

 . 

 . 
 
cn xn
~c = V−1 ~x .

If the eigenvalues are not isolated (i.e. λi = λj ) then some of the above results may fail. The eigen-
vectors might not ~vi be linearly independent any more. As a consequence not all vectors ~x are generated as
linear combinations of eigenvectors. The situation improves drastically for symmetric matrices.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 134

3–22 Result : Facts on symmetric, real matrices


If A is a real, symmetric n × n matrix, then the following facts can be verified (by mathematicians).
• A has n real eigenvalues λ1 ≤ λ2 ≤ λ3 ≤ . . . ≤ λn .
• There are n eigenvectors ~ej for 1 ≤ j ≤ n with A ~ej = λj ~ej . All eigenvectors have length 1 and
they are pairwise orthogonal, i.e. h~ej , ~ei i = 0 if i 6= j. If λi 6= λj this is easy to verify.

(λi − λj ) h~ei , ~ej i = λi h~ei , ~ej i − λj h~ei , ~ej i = hλi ~ei , ~ej i − h~ei , λj ~ej i
= hA ~ei , ~ej i − h~ei , A ~ej i = hA ~ei , ~ej i − hA ~ei , ~ej i = 0

Even if λi = λj the orthogonality h~ei , ~ej i = 0 can be preserved.


• Examine the orthogonal matrix Q, with the normalized eigenvectors ~ei as columns, i.e.

Q = [~e1 , ~e2 , ~e3 , . . . , ~en ] with QT · Q = In and Q−1 = QT .

This leads to

A · Q = Q · diag(λj )
 
λ1
 
 λ2 
QT · A · Q = diag(λj ) = 
 
.. 

 . 

λn
A = Q · diag(λj ) · QT

This process is called diagonalization of the symmetric matrix A. This result is extremely useful,
e.g. to simplify general stress or strain situations to principal stress or strain situations, see Section 5.3
starting on page 326.
• Each vector ~x ∈ Rn can be written as a linear combination of the eigenvectors ~ei . The result for the
general matrix leads to
    
~eT1 x1 h~e1 , ~xi
    
 ~eT   x   h~e , ~xi 
 2  2   2 
−1 T
 T    
~c = Q ~x = Q ~x =  ~u3   x3  =  h~e3 , ~xi 
    

 ..   ..   .. 
 .  .   . 
    
~eTn xn h~en , ~xi
n
X n
X
~x = ci ~ei = h~x , ~ei i ~ei
i=1 i=1
n
X n
X
~x = ci ~ei = h~x , ~ei i ~ei
i=1 i=1

• The inverse matrix is easily computed if all eigenvalues and vectors are already known. Then use
1
A−1 = (Q · diag(λj ) · QT )−1 = Q−T · diag(λj )−1 · Q−1 = Q · diag( ) · QT .
λj
This is not an efficient way to determine the inverse matrix, since the eigenvalues and vectors are
difficult to determine. The results is more useful for analytical purposes.
3

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 135

Another characterization of the eigenvalues of a symmetric, positive definite matrix A can be based on
the Rayleigh quotient
h~x , A ~xi
ρ(~x) = .
h~x , ~xi
Assume that the eigenvalues λ1 ≤ λ2 ≤ λ3 ≤ . . . ≤ λn are sorted. When looking for an extremum of the
function h~x , A ~xi, subject to the constraint k~xk = 1 use the Lagrange multiplier theorem and
~ x , A ~xi = 2 A ~x
∇h~ ~ x , ~xi = 2 ~x
and ∇h~
to conclude that A ~x = λ ~x for some factor λ. Using h~x , A ~xi = h~x , λ ~xi = λ k~xk2 conclude
λ1 = min h~x , A ~xi and λn = max h~x , A ~xi .
k~
xk=1 k~
xk=1

For other eigenvalues use a slight modification of this result. If ~e1 is an eigenvector to the first eigenvalue
use the fact that the eigenvector to strictly larger eigenvalues are orthogonal to ~e1 . This leads to a method to
determine λ2 by
λ2 = min{h~x , A ~xi | k~x k = 1 and ~x ⊥ ~e1 } .
This result can be extended in the obvious way to obtain λ3 by
λ3 = min{h~x , A ~xi | k~x k = 1 , ~x ⊥ ~e1 and ~x ⊥ ~e2 } .
The other eigenvalues can also be characterized by looking at subspaces by the Courant-Fischer Minimax
Theorem, see [GoluVanLoan96, Theorem 8.1.2] or [Axel94, Lemma 3.13].
h~x , A ~xi
λk = max min
dim S=n−k ~
x∈S\{~0} h~x , ~xi

3–23 Example : The matrix An presented in Section 2.3.1 (page 32)


 
2 −1
 
 −1 2 −1 
 
−1 2 −1
 
 
1 
.. .. ..

An = 2 

. . .
h 


−1 2 −1
 
 
 

 −1 2 −1  
−1 2
1
with h = n+1 is the finite difference approximation of the second derivative, i.e.

d2
− u(x) = f (x) −→ An ~u = f~ .
dx2
The exact eigenvalues are given by

4 πh
λj = 2
sin2 (j ) for 1 ≤ j ≤ n .
h 2

and the eigenvector ~vj are generated by discretizing the function sin(x n+1 ) over the interval [0, 1] and thus
has j extrema within the interval.
 
1j π 2j π 3j π (n − 1) j π nj π
~vj = sin( ) , sin( ) , sin( ) , . . . , sin( ) , sin( )
n+1 n+1 n+1 n+1 n+1

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 136

3–24 Example : The mapping generated by a symmetric matrix


For a symmetric matrix A the factorization

A = Q diag(λi ) QT = Q diag(λi ) Q−1

implies that the mapping generated by the matrix has a special structure:

1. ~y = Q−1 ~x = QT ~x will rotate the vector ~x to obtain ~y .

2. ~z = diag(λi ) ~y will stretch or compress the i–th component yi by the factors λi .

3. A ~x = Q ~z will invert the rotation form the first step.

This behavior is consistent with the mapping of the eigenvectors ~vi , i.e. the columns of Q, since A ~vi =
λi ~vi . Vectors in the eigenspace spanned by ~vi are stretched (or compressed) by the factor λi . The image
of the unit circle in R2 after a mapping by a symmetric matrix is thus an ellipse with the semi-axis in the
direction of the eigenvectors and the length given by the eigenvalues. The similar result for ellipsoids as
images of the unit sphere in Rn is valid. ♦

3.2.3 Level Sets of Quadratic Forms


3–25 Example : Level sets of quadratic forms
The quadratic form generated by a symmetric n × n matrix A is given by the function

f (~x) = h~x, A~xi for all ~x ∈ Rn .

These expressions have many applications, amongst them level sets for general Gaussian distributions (e.g.
Figures 3.19, 3.20 and 3.23) or the regions of confidence for the parameters determined by linear or nonlinear
regression (e.g. Figures 3.49, 3.58 and 3.59).
Use ~x = ni=1 √1λ ci ~ei with ni=1 c2i = r2 to examine the quadratic form
P P
i

n n n X n
X 1 X 1 X 1
h~x, A ~xi = h √ ci ~ei , A p cj ~ej i = p ci cj h~ei , A~ej i
i=1
λ i j=1
λ j i=1 j=1
λ i λ j
n X
n n
X 1 X 1 2
= ci cj h~ei , λj ~ej i = c λi = r 2 .
λi i
p
i=1 j=1
λi λj i=1

Thus the eigenvalues and eigenvectors of A characterize the contour levels.


For n = 2 the values (c1 , c2 ) = r (cos(α), sin(α)) satisfy c21 + c22 = r2 and thus the vectors ~x =
√r cos(α) ~e1 + √rλ sin(α) ~e2 satisfy k~xk = r. The code below generates the level curves for h~x, A~xi = 1
λ1 2
visible in Figure 3.14(a) and the rescaled eigenvectors.

A = [2 1.5; 1.5 3];


[Evec, Eval] = eig(A)
alpha = linspace(0,2*pi);
Points = Evec*inv(sqrt(Eval))*[cos(alpha);sin(alpha)];
figure(1); plot(Points(1,:),Points(2,:),’b’)
hold on
plot([0,Evec(1,1)]/sqrt(Eval(1,1)),[0,Evec(2,1)]/sqrt(Eval(1,1)),’k’)
plot([0,Evec(1,2)]/sqrt(Eval(2,2)),[0,Evec(2,2)]/sqrt(Eval(2,2)),’k’)
xlabel(’x_1’); ylabel(’x_2’)
axis([-1 1 -1 1]); axis equal
hold off

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 137

0.5
1

0.5
x2

x3
1
-0.5 -0.5 0.5
0
-1 x1
-0.5
-1 1 0.5 -1
-1 -0.5 0 0.5 1 0 -0.5
x2 -1
x1

(a) 2D level curve in R2 (b) 3D level surface in R3

Figure 3.14: Level curve or surface of quadratic forms in R2 or R3 . The level curve and surface are shown
in blue and in black the eigenvectors divided by λi .

Similar arguments generate the level surface in R3 , visible in Figure 3.14(b). The computations use
spherical coordinates
   
x cos(θ) cos(φ)
 y  = r  cos(θ) sin(φ)  for 0 ≤ φ ≤ 2 π and − π ≤ θ ≤ π .
   
    2 2
z sin(θ)

A = [2 1.5 1; 1.5 3 0;1 0 3];


[Evec, Eval] = eig(A)
phi = linspace(0,2*pi,51); theta = linspace(-pi,pi,51);
x = cos(phi’)*cos(theta); y = sin(phi’)*cos(theta); z = ones(size(phi’))*sin(theta);
Points = Evec*inv(sqrt(Eval))*[x(:),y(:),z(:)]’;
figure(1); clf
plot3(Points(1,:),Points(2,:),Points(3,:),’.b’)
hold on
plot3([0,Evec(1,1)]/sqrt(Eval(1,1)),...
[0,Evec(2,1)]/sqrt(Eval(1,1)),...
[0,Evec(3,1)]/sqrt(Eval(1,1)),’k’)
plot3([0,Evec(1,2)]/sqrt(Eval(2,2)),...
[0,Evec(2,2)]/sqrt(Eval(2,2)),...
[0,Evec(3,2)]/sqrt(Eval(2,2)),’k’)
plot3([0,Evec(1,3)]/sqrt(Eval(3,3)),...
[0,Evec(2,3)]/sqrt(Eval(3,3)),...
[0,Evec(3,3)]/sqrt(Eval(3,3)),’k’)
xlabel(’x_1’); ylabel(’x_2’); zlabel(’x_3’);
axis([-1 1 -1 1 -1 1]); axis equal
view([-80 30])
hold off

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 138

3–26 Example : Ellipsoids as level sets


In the previous example the level curve of a quadratic function was visualized using eigenvalues and eigen-
vectors of a symmetric positive definite matrix A. The process can be used in reverse order: given the three
semi axis of an ellipsoid in R3 by three orthogonal directions d~i and the corresponding lengths li , find the
quadratic form f (~x) = h~x, A~xi such that the ellipsoid is given as level set f (~x) = 1.
Assuming three directions d~i are normalized (kd~i k = 1), then the matrix Q = [d~1 , d~2 , d~3 ] satisfies

QT Q = I3 and thus Q−1 = QT .

Any vector ~x ∈ R3 can be written as linear combination of the three vectors d~i , i.e. ~x = c1 d~1 + c2 d~2 + c3 d~3
with
Q ~c = ~x and k~ck = kQT ~xk = k~xk .
Then the matrix
 
1/l12 0 0
 T
 Q = (Q diag(1/li )) diag(1/li )QT

A = Q
 0 1/l22 0 
0 0 1/l32

has the three eigenvectors d~i and the eigenvalues λi = 1/li2 . It satisfies

1
h±li d~i , A (±li d~i )i = (±1)2 hli d~i , λi (li d~i )i = hd~i , li λi li d~i ihd~i , d~i i = kd~i k = 1 ,
li2

i.e. the six endpoints of the axis of the ellipsoid are on the level set f (~x) = 1. Example 3–24 shows that
multiplying by the matrix A corresponds to a sequence of three mappings:

1. ~y = Q−1 ~x = QT ~x will rotate the vector ~x to obtain ~y .

2. ~z = diag( l12 ) ~y will stretch or compress the i–th component yi by the factors 1
li2
.
i

3. A ~x = Q ~z will invert the rotation form the first step.

Thus the ellipsoid is given as level curve of f (~x) = h~x, A~xi = 1. ♦

The above examples allows to visualize level curves and surfaces in R2 and R3 , but the task is more
difficult when higher dimensions are necessary, i.e. for regression problems with more that three parameters.
There are (at least) two methods to reduce the dimensions to visualize the results:

• intersection with a coordinate plane with xi = 0 for some indices i.

• projection of the high dimensional ellipsoid onto a coordinate plane.

As example examine a problem with four independent variables, given by the symmetric matrix A ∈
M4×4 , i.e. ai,j = aj,i . Examine reductions to expressions involving two variables only, e.g. x1 and x2 . The
quadratic form to be examined is
    
a1,1 a1,2 a1,3 a1,4 x1 x1
     4
 a
 2,1 a2,2 a2,3 a2,4   x2   x2 
    X
f (~x) = hA ~x, ~xi = h  , i = ai,j xi xj .
 a3,1 a3,2 a3,3 a3,4   x3   x3 
     i,j=1
a4,1 a4,2 a4,3 a4,4 x4 x4

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 139

To determine the intersection with the x1 x2 –plane (i.e. x3 = x4 = 0) it is convenient to write the matrix as
composition of four 2 × 2 block matrices
 
a1,1 a1,2 a1,3 a1,4
  " #
 a a a a  A11 A13
 2,1 2,2 2,3 2,4 
A=  = .
 a3,1 a3,2 a3,3 a3,4  A T A
  13 33

a4,1 a4,2 a4,3 a4,4

The restriction to the plane x3 = x4 = 0 is characterized by


    
a1,1 a1,2 a1,3 a1,4 x1 x
    1 
 a
 2,1 a2,2 a2,3 a2,4   x2   x2
    
f (~x) = h , i


 a3,1 a3,2 a3,3 a3,4   0   0 
    
a4,1 a4,2 a4,3 a4,4 0 0
" # ! ! ! !
a1,1 a1,2 x1 x1 x1 x1
= h , i = hA11 , i
a2,1 a2,2 x2 x2 x2 x2

Thus the intersections of the level sets with the plane x3 = x4 = 0 can be visualized with the tools from the
above example.
To examine the projection on this plane requires more effort. Examine Figure 3.14(b) observe that the
contour of the projection along the x3 axis onto the horizontal plane is characterized by the x3 component
of the gradient (see page 74) ∇f (~x) = 2 A ~x to vanish. For the four dimensional example this leads to the
conditions of the last two components of the gradient to vanish., i.e.
 
x1
! " # 
0 a3,1 a3,2 a3,3 a3,4   x2 

=  
0 a4,1 a4,2 a4,3 a4,4   x3 

x4
" # ! " # ! ! !
a3,1 a3,2 x1 a3,3 a3,4 x3 x 1 x 3
= + = AT13 + A33 .
a4,1 a4,2 x2 a4,3 a4,4 x4 x2 x4

This system can be solved for (x3 , x4 ) as function of (x1 , x2 ).


! " #−1 " # ! !
x3 a3,3 a3,4 a3,1 a3,2 x1 x1
=− = −A−1 T
33 A31
x4 a4,3 a4,4 a4,1 a4,2 x2 x2

If A is positive definite, then A11 and A33 are positive definite too. With this notation the quadratic form is
expressed in terms of (x1 , x2 ) only.
   
x1 x1
" #   
A11 A13   x2   x2 
  
f (~x) = hA ~x, ~xi = h , i
AT13 A33 
 
 x3   x3 
  
x4 x4
! ! ! ! ! !
x1 x1 x3 x1 x3 x3
= hA11 , i + 2 hA13 , i + hA33 , i
x2 x2 x4 x2 x4 x4

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 140

! ! ! !
x1 x1 x1 x1
= hA11 , i − 2 hA13 A−1 T
33 A13 , i+
x2 x2 x2 x2
! !
x1 x1
+hA33 A−1 T
33 A13 , A−1 T
33 A13 i
x2 x2
! ! ! !
x1 x1 x1 x1
= hA11 , i− 2 hA13 A−1 T
33 A13 , i+
x2 x2 x2 x2
! !
x1 x1
+hA13 A−1 T
33 A13 , i
x2 x2
! !
x1 x1
A13 A−1 T

= h A11 − 33 A13 , i
x2 x2

Thus examine the level curves in R2 of the quadratic form generated by the modified matrix à ∈ M2×2

à = A11 − A13 A−1 T


33 A13 .

In the case of a projection from R3 onto R2 along the x3 direction this simplifies slightly to
   
x1 x1 " # ! !
    1 a 1,3
h i x 1 x 1
f (~x) = hA   x2  ,  x2 i − a3,3 h a
   a1,3 a2,3 , i
2,3 x2 x2
0 0
" # " # ! ! !
a1,1 a1,2 1 a1,3 h i x1 x1
= h − a1,3 a2,3 , i.
a1,2 a2,2 a3,3 a2,3 x2 x2

The above approach can be generalized to more that four dimensions and the variables to be removed
can be chosen arbitrary. This is implemented in a MATLAB/Octave function.
ReduceQuadraticForm.m
function A_new = ReduceQuadraticForm(A,remove)
stay = 1:length(A);
for ii = 1:length(remove) %% remove the indicated components
stay = stay(stay˜=remove(ii));
end%for
%% construct the sub matrices
Asr = A(stay,remove); Arr = A(remove,remove);
A_new = A(stay,stay) - Asr*(Arr\Asr’); %% the new matrix
end%function

For the above matrix A ∈ M4×4 the coordinates x3 and x4 are removed and the new matrix à ∈ M2×2
is generated by calling A new = ReduceQuadraticForm(A,[3 4]) .

3–27 Example : Intersection and Projection of Level Sets of a Quadratic Form onto Coordinate Plane
As example examine the ellipsoid in Figure 3.14(b) and determine the intersection and projection onto the
coordinate plane x3 = 0. Find the results of the code below in Figure 3.15. Observe that the axes of the two
ellipses in Figure 3.15(a) are in slightly different directions.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 141

A = [2 1.5 1; 1.5 3 0;1 0 3];


[Evec, Eval] = eig(A);
phi = linspace(0,2*pi,51); theta = linspace(-pi,pi,51);
x = cos(phi’)*cos(theta); y = sin(phi’)*cos(theta); z = ones(size(phi’))*sin(theta);
Points = Evec*inv(sqrt(Eval))*[x(:),y(:),z(:)]’;

%%% intersection with the plane x_3=0


A_intersect = A(1:2,1:2);
[Evec_i, Eval_i] = eig(A_intersect)
x = cos(phi); y = sin(phi); z = zeros(size(phi));
Points_i = [Evec_i*inv(sqrt(Eval_i))*[x(:),y(:)]’;z];

%%% projection onto the plane x_3=0


A_project = ReduceQuadraticForm(A,3);
[Evec_p, Eval_p] = eig(A_project)
Points_p = [Evec_p*inv(sqrt(Eval_p))*[x(:),y(:)]’;z];

figure(1); clf
plot3(Points(1,:),Points(2,:),Points(3,:),’.b’)
hold on
plot3([0,Evec(1,1)]/sqrt(Eval(1,1)),...
[0,Evec(2,1)]/sqrt(Eval(1,1)),...
[0,Evec(3,1)]/sqrt(Eval(1,1)),’k’)
plot3([0,Evec(1,2)]/sqrt(Eval(2,2)),...
[0,Evec(2,2)]/sqrt(Eval(2,2)),...
[0,Evec(3,2)]/sqrt(Eval(2,2)),’k’)
plot3([0,Evec(1,3)]/sqrt(Eval(3,3)),...
[0,Evec(2,3)]/sqrt(Eval(3,3)),...
[0,Evec(3,3)]/sqrt(Eval(3,3)),’k’)
plot3(Points_i(1,:),Points_i(2,:),Points_i(3,:),’r’)
plot3(Points_p(1,:),Points_p(2,:),Points_p(3,:),’c’)
xlabel(’x_1’); ylabel(’x_2’); zlabel(’x_3’);
axis(1.1*[-1 1 -1 1 -1 1]); axis equal
view([-80 30]); hold off

figure(2); clf
plot(Points_i(1,:),Points_i(2,:),’r’,
Points_p(1,:),Points_p(2,:),’c’)
xlabel(’x_1’); ylabel(’x_2’);
legend(’intersection’,’projection’); axis(1.1*[-1 1 -1 1]); axis equal

3.2.4 Commands for Eigenvalues and Eigenvectors in Octave/MATLAB


MATLAB and Octave provide command to determine eigenvalues and eigenvectors for square matrices. In
this section only the usage of these commands is illustrated, not attempt to examine the algorithms used is
made. Consult the bible of matrix computations [GoluVanLoan13, §7, §8] for information on the algorithms
to compute the eigenvalues. In [IsaaKell66, §4] find a shorter presentation.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 142

1 intersection
projection

0.5 1

0.5
x2

0
0

x3
1
-0.5 -0.5
0.5
-1 0
-1 -0.5 x1
1 0.5 -1
-1 -0.5 0 0.5 1 0 -0.5 -1
x1 x2
(a) in the coordinate plane x3 = 0 (b) in R3

Figure 3.15: Level surface of a quadratic form in R3 with intersection and projection onto R2

The command eig() to compute all eigenvalues


With the command eig() all n eigenvalues or eigenvectors and eigenvalues of a n × n can be determined.
To compute the eigenvalues of the matrix
 
1 2 3
 
A=  4 5 6 

7 8 9

use

A = [1 2 3; 4 5 6; 7 8 9];
eval = eig(A)
-->
eval = 1.6117e+01
-1.1168e+00
-1.3037e-15

and to compute the eigenvectors and eigenvalues use

A = [1 2 3; 4 5 6; 7 8 9];
eval = eig(A)
[evec,eval] = eig(A)
-->
evec = -0.231971 -0.785830 0.408248
-0.525322 -0.086751 -0.816497
-0.818673 0.612328 0.408248

eval = Diagonal Matrix 1.6117e+01 0 0


0 -1.1168e+00 0
0 0 -1.3037e-15

The eigenvectors are all normalized to have length 1 and the eigenvalues are returned on the diagonal of a
matrix. For symmetric matrices the eigenvectors are orthogonal.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 143

The same command is used to compute generalized eigenvalues by supplying two input arguments. To
examine    
1 2 3 1 0 0
   
 4 5 6  ~v = λ  0 10 0  ~v
   
7 8 9 0 0 100
use

A = [1 2 3; 4 5 6; 7 8 9]; B = diag([1 10 100]);


eval = eig(A,B)
-->
eval = 1.8221e+00
-2.3214e-01
1.5755e-19

The command eigs() to compute a few eigenvalues


The computational effort to determine all eigenvalues of a large matrix can be huge, or it might not be
possible at all. In many cases not all eigenvalues are required, but only the largest or smallest, and a good
estimate is good enough. Thus Octave/MATLAB provide a command for this special task, eigs(). The
basic idea for the algorithm used by eigs() is power iteration, see [GoluVanLoan13, §7.3], [IsaaKell66,
§4.2] or [Demm97]. Find a detailed description in the lecture notes [Stah00, §11.5] available on my web
site at https://web.sha1.bfh.science/fem/VarFEM/VarFEM.pdf.

As example consider the model matrices An and Ann from Section 2.3 (see page 32). For these matrices
the exact eigenvalues are known, eigs() will return estimates of some of the eigenvalues. To estimate the
three largest eigenvalues of the 1000 × 1000 matrix An for n = 1000 use

n = 1000; h = 1/(n+1);
A_n = spdiags(ones(n,1)*[-1, 2, -1],[-1, 0, 1],n,n)/(hˆ2);
lambda_max = 4/hˆ2*sin(n*pi*h/2)ˆ2
options.issysm = true; options.tol = nˆ2*1e-5;
lambda = eigs(A_n,3,’lm’,options)
-->
lambda_max = 4.0080e+06
lambda = 4.0019e+06
3.9569e+06
3.8748e+06

and for the three smallest eigenvalues

lambda_min = 4/hˆ2*sin(pi*h/2)ˆ2
lambda = eigs(A_n,3,’sm’,options)
-->
lambda_min = 9.8696
lambda = 88.8258
39.4783
9.8696

To estimate the three largest eigenvalues of the 100 000 × 100 000 matrix Ann for n = 100 use

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 144

n = 100; h = 1/(n+1);
A_n = spdiags(ones(n,1)*[-1, 2, -1],[-1, 0, 1],n,n)/(hˆ2);
A_nn = kron(speye(n),A_n) + kron(A_n,speye(n));
lambda_max = 8/hˆ2*sin(n*pi*h/2)ˆ2
options.issysm = true; options.tol = nˆ2*1e-5;
lambda = eigs(A_nn,3,’lm’,options)
-->
lambda_max = 8.1588e+04
lambda = 8.1344e+04
8.0238e+04
7.8325e+04

This code takes less than 0.02 seconds to run, using eig() takes 46 seconds on a fast desktop computer.

Using estimates for the smallest and largest eigenvalue allows to estimate the condition number of the
matrix. The command condest() uses similar ideas to estimate the condition number of a matrix.

3.2.5 Eigenvalues and Systems of Ordinary Differential Equations


A general linear system of ordinary differential equations is given by
d
~x(t) = A(t) ~x(t) + f~(t) .
dt
One can show that any solution is of the form

~x(t) = ~xh (t) + ~xp (t)

where
• ~xp (t) is one particular solution
d
• ~xh (t) is a solution of the corresponding homogeneous system dt ~xh (t) = A(t) ~xh (t)
Because of this structure one may examine the homogeneous problem to study the stability of numerical
algorithms to solve the system of ODEs, see Section 3.4.
For constant matrices A(t) = A the eigenvalues and eigenvectors provide a lot of information on the
d
solutions of the homogeneous problem dt ~xh (t) = A ~xh (t). This will lead to Result 3–29 below.

Examine the system of linear, homogeneous ODEs


d
~x(t) = A ~x(t) . (3.4)
dt
where the real n × n matrix A has real eigenvalues λi with corresponding eigenvectors ~vi ∈ Rn .
(a) Any function ~x(t) = eλi t ~vi solves (3.4), since
 
d d λi t
~x(t) = e ~vi = λi eλi t ~vi and A ~x(t) = eλi t A ~vi = eλi t λi ~vi .
dt dt
Pn
(b) Any linear combination of above solutions also solves the ODE (3.4), i.e. ~x(t) = i=1 ci eλi t ~vi is a
solution. To verify this examine
n n
d X d X
~x(t) = ci eλi t ~vi = ci λi eλi t ~vi
dt dt
i=1 i=1
n
X n
X
A ~x(t) = ci eλi t A ~vi = ci eλi t λi ~vi .
i=1 i=1

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 145

(c) If the above n eigenvectors ~vi are linearly independent, then any initial value ~x(0) = ~x0 can be written
as a linear combination of the eigenvectors, i.e.
n
X
~x(0) = ~x0 = ci ~vi .
i=1

Verify that solving for the coefficients ci leads to a system of linear equations. Using this solve (3.4)
with an initial condition ~x(0) = ~x0 . With the notation
   
v1,i x1
   
 v   x 
 2,i   2 
   
~vi =  v3,i  and ~x0 =  x3 
  
 ..   .. 
 .   . 
   
vn,i xn

the equation ~x0 = ni=1 ci ~vi leads to


P

         
x1 v1,1 v1,2 v1,3 v1,n
         
 x   v   v   v   v 
 2   2,1   2,2   2,3   2,n 
         
 x3  = c1  v3,1  + c2  v3,2  + c3  v3,3  + · · · + cn  v3,n 
         
 ..   ..   ..   ..   .. 
 .   .   .   .   . 
         
xn vn,1 vn,2 vn,3 vn,n
   
v1,1 v1,2 v1,3 · · · v1,n c1
   
 v
 2,1 v2,2 v2,3 · · · v2,n
  c 
  2 
   
 v3,1 v3,2 v3,3 · · · v3,n
=    c3  .
  
 .. .. .. .. ..   .. 
 .
 . . . . 

 . 
 
vn,1 vn,2 vn,3 · · · vn,n cn

If the eigenvectors ~vi are linearly independent, the system has a unique solution.

3–28 Example : Examine the system of ODEs

ẋ(t) = x(t) + 3 y(t)


ẏ(t) = 4 x(t) + y(t)

or using the matrix notation ! " # !


d x(t) 1 3 x(t)
= .
dt y(t) 4 1 y(t)
For the matrix " #
1 3
A=
4 1
the eigenvalues are given as solutions of a quadratic equation
" #
1−λ 3
det = (1 − λ)2 − 12 = λ2 − 2λ − 11 = 0
4 1−λ

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 146

with the solutions


1 √ 1 √
λ1 = (2 + 4 + 44) ≈ 4.4641 , λ2 = (2 − 4 + 44) ≈ −2.4641
2 2
and the corresponding eigenvectors
! !
1 1
~v1 ≈ , ~v2 ≈ .
1.1547 −1.1547

Thus the general solution of the system of ODEs is given by


! ! !
x (t) 1 1
≈ c1 e4.4641 t + c2 e−2.4641 t .
y (t) 1.1547 −1.1547

Find the corresponding vector field, two solutions and the eigen–directions in Figure 3.16.

1
y

-1
-1 0 1 2 3
x

Figure 3.16: Ein lineares Vektorfeld mit Eigenvektoren

If λ = α + iβ ∈ C is a complex eigenvalue of a real matrix A with eigenvector ~u + i ~v then λ = α − i β


is also eigenvalue with eigenvector ~u − i ~v . Now use

A exp((α + i β) t) (~u + i ~v ) = (α + i β) exp((α + i β) t) (~u + i ~v )


d
exp((α + i β) t) (~u + i ~v ) = (α + i β) exp((α + i β) t) (~u + i ~v )
dt
and examine

exp((α + i β) t) (~u + i ~v ) = exp(α t) exp(i β t) (~u + i ~v )


= exp(α t) (cos(β t) + i sin(β t)) (~u + i ~v )
= exp(α t) (cos(β t) ~u − sin(β t) ~v ) + i exp(α t) (cos(β t) ~v + sin(β t) ~u)

and its conjugate complex to conclude that

eα t (cos(β t) ~u − sin(β t) ~v ) and eα t (cos(β t) ~v + sin(β t) ~u)

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 147

d
are linearly independent solutions of the system of ODEs dt ~x(t) = A ~x(t). These solutions move in the
plane in R spanned by ~u and ~v . They grow (or shrink) exponentially like eα t and rotate in the plane with
n

angular velocity β.

The above observations can be collected and lead to a complete description of the solution of the homo-
d
geneous system dt ~x(t) = A ~x(t), assuming that all eigenvalues are isolated.

3–29 Result : Linear systems of ODEs with simple eigenvalues


Let λj ∈ R, j = 1, . . . , n1 be isolated eigenvalues of the n × n matrix A with eigenvectors ~ej . Use
the isolated, complex eigenvalues λk = αk + iβk ∈ C, αk , βk ∈ R, βk > 0 for k = 1, . . . , n2 with
eigenvectors ~uk + i~vk . If these are all the eigenvalues (i.e. n1 + 2 n2 = n), then all solutions of the system
d
dt ~
x(t) = A ~x(t) are given by
n1
cj eλj t~ej +
P
~xh (t) =
j=1
n2
eαk t (rk
P
+ (cos (βk t) ~uk − sin (βk t) ~vk ) + sk (cos (βk t) ~vk + sin (βk t) ~uk ) )
k=1

with real constants cj , rk , sk . 3

3–30 Example : Examine the linear system


     
x(t) 0.39 0.51 1.26 x(t)
d   
 y(t)  =  −0.68
  
 y(t)  .
0 −0.44 
dt      
z(t) −1.47 −0.05 −0.5 z(t)

The code below generates the eigenvalues and eigenvectors


   
0.67 0
   
λ1,2 ≈ −0.10 ± i 1.41 with ~e1,2 = ~u ± i ~v ≈ 
 −0.18  ± i  0.28 
  
−0.19 0.64

and  
−0.13
 
λ3 ≈ +0.095 with ~e3 ≈ 
 −0.91 .

0.40

In the plane generated by the vectors ~u and ~v obtain exponential spirals with decaying radius cr e−0.1 t . In
the direction given by ~e3 the solution is moving away from the origin, the distance described by a function
c e+0.095 t . This behavior is confirmed by Figure 3.17 by the graphs of three solutions.
StabilityExample.m
A = [0.39 0.51 1.26; -0.68 0 -0.44; -1.47 -0.05 -0.5]
[EV,EW] = eig(A)

t = linspace(0,20);
[t1,x1] = ode45(@(t,x)A*x,t,[1;1;1]);
[t2,x2] = ode45(@(t,x)A*x,t,[0.4,-0.2,0]);
[t3,x3] = ode45(@(t,x)A*x,t,[-1,-0.5,-1]);

figure(1) % draw the eigenvectors in green


plot3([0,real(EV(1,1))],[0,real(EV(2,1))],[0,real(EV(3,1))],’g’,...
[0,imag(EV(1,1))],[0,imag(EV(2,1))],[0,imag(EV(3,1))],’g’,...

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 148

1.5
1.5 1
1 0.5
0.5 0
0

z
-0.5
z

-0.5
-1 -1
-1.5 -1.5

-3
-2 3
-1 2
0 1
x 1 y 0
3
2 2
1 -1 1 2 3
3 0 -2 -1 0
-1 y -3
x
(a) view onto the direction ~e3 (b) view onto the plane generated by ~
u and ~v

Figure 3.17: Spirals as solutions of a system of three differential equations

[0,real(EV(1,3))],[0,real(EV(2,3))],[0,real(EV(3,3))],’g’)
hold on % draw three solutions in different colors
plot3(x1(:,1),x1(:,2),x1(:,3),’r’,x2(:,1),x2(:,2),x2(:,3),’b’,...
x3(:,1),x3(:,2),x3(:,3),’c’)
xlabel(’x’);ylabel(’y’);zlabel(’z’); axis equal; hold off
-->
EV =
0.66972 + 0.00000i 0.66972 - 0.00000i -0.13030 + 0.00000i
-0.17882 + 0.27667i -0.17882 - 0.27667i -0.90807 + 0.00000i
-0.18947 + 0.63801i -0.18947 - 0.63801i 0.39803 + 0.00000i

EW = Diagonal Matrix
-0.10264 + 1.41104i 0 0
0 -0.10264 - 1.41104i 0
0 0 0.09529 + 0.00000i

The situation changes slightly if second order systems are examined. Wave equations lead to this type
of ODEs, see e.g. Section 4.6.

3–31 Result : Systems of second order, linear ordinary differential equations


Let A ∈ Rn×n be a real valued matrix with positive eigenvalues λi = ωi2 > 0 and the corresponding
eigenvectors ~vi . Verify that the second order system of linear differential equations

d2
~x(t) = −A ~x(t)
dt2
is solved by
~x(t) = cos(ωi t) ~vi and ~x(t) = sin(ωi t) ~vi .
Using trigonometric identities these two solutions can be combined to have one amplitude ai and a phase
shift φi by
~x(t) = (c1 cos(ωi t) + c2 sin(ωi t)) ~vi = ai cos(ωi (t − φi )) ~vi .
As a consequence the frequencies of these oscillations are determined by the eigenvalues λi = ωi2 . 3

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 149

Proof : For ~x(t) = cos(ωi t) ~vi compute

d2
 2 
d
~x(t) = cos(ωi t) ~vi = −ωi2 cos(ωi t) ~vi
dt2 dt2
−A ~x(t) = − cos(ωi t) A ~vi = − cos(ωi t) λi ~vi = −ωi2 cos(ωi t) ~vi

and thus the system of differential equations is solved. The computation for the second solution sin(ωi t) ~vi
is similar. 2

The above could also be examined by Result 3–29. Translate the system of n second order equations to
d
a system of 2 n ODEs of order one. For this introduce the new variable ~v (t) = dt ~x(t) ∈ Rn and examine
the (2 n) × (2 n) system
! ! " # !
d ~x(t) ~v (t) 0 In ~x(t)
= = .
dt ~v (t) −A ~x(t) −A 0 ~v (t)

The generalized eigenvalues and eigenvectors satisfy A ~v = λ B ~v and with those ODEs with a weight
matrix can be solved, i.e.
d
B ~x(t) = A ~x(t)
dt
d d
is solved by ~x(t) = eλ t ~v . To verify this use B dt ~x(t) = λ eλ t B ~v and A dt ~x(t) = λ eλ t B ~v . A similar
result is correct for second order ODEs
d2
B ~x(t) = −A ~x(t) ,
dt2
which are solved by

~x(t) = (c1 cos(ωi t) + c2 sin(ωi t)) ~vi = ai cos(ωi (t − φi )) ~vi ,

where the generalized eigenvalue A ~vi = λi B ~vi is given by λi = ωi2 .

3.2.6 SVD, Singular Value Decomposition


For non-symmetric, or even non-square, matrices A the idea of diagonalization of a matrix (see Result 3–22
on page 134) can be generalized, leading to the singular value decomposition (SVD) of the matrix.

3–32 Theorem : [GoluVanLoan13, §2.4]


If A ∈ Rm×n is a real m × n matrix, then there exist orthogonal matrices U ∈ Rm×m and V ∈ Rn×n and
singular values σi such

UT A V = Σ = diag(σ1 , σ2 , . . . , σp ) ∈ Rm×n where p = min{n, m} (3.5)

and σ1 ≥ σ2 ≥ σ3 ≥ . . . ≥ σp ≥ 0. 3
As consequence of the above find the SVD (Singular Value Decomposition) of the n × n matrix A.
 
σ1 0 0 . . . 0
 
 0 σ 0 ... 0 
 2 
T
  T
A = U diag(σ1 , σ2 , . . . , σp ) V = U  0 0 σ3 . . . 0 

V

 .
.. . .
. . .. 
 
0 0 0 . . . σn

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 150

With MATLAB/Octave the singular value decomposition can be computed by [U,S,V] = svd(A) . It is
not difficult to see that for square matrices A and the usual 2–norm we have
1 σ1
kAk2 = σ1 , kA−1 k2 = and cond(A) = .
σn σn
If the matrix A is symmetric and positive definite, then U = V and

AU = UΣ

implies that the singular values are given by the eigenvalues λi = σi and in the columns of the orthogonal
matrix U find the normalized eigenvectors of A. Thus the SVD coincides with the diagonalization of the
matrix A, as examined in Result 3–22 on page 134. If the real matrix A is symmetric but not necessary
positive definite, then σi = |λi | and for the corresponding columns (i.e. eigenvectors) find ~vi = −~ui . Then
VT U is a diagonal matrix with numbers ±1 on the diagonal.

If the m × n matrix A has more rows than columns (m > n) then the factorization has the form
 
σ1 0 0 . . . 0
 
 0 σ2 0 . . . 0   
σ1 0 0 . . . 0
 
 0 0 σ3 . . . 0 
 
 
 0 σ 0 . . . 0
.. .
  
.. 2
. ..  T
   
 .  V = Uecon  0 0 σ3 . . . 0  VT .
 
A=U
 0 0 0 . . . σn 
   
 .
. . . .
. 

 0 0 0 ... 0 
 
 . . . 
0 0 0 . . . σn
 
 .. .. 


 . . 
0 0 0 ... 0

Since the matrix D has m − n rows with zeros at the bottom, the last m − n columns of U will be multiplied
by zeros. There is no need to compute these columns and the command [U,D,V] = svd(A,’econ’)
generates this more economic form.

The SVD has many more applications: image processing, data compression, regression, robotics, ...
Search on the internet for the keywords Professor SVD, Gene Golub and find an excellent article
about Gene Golub and SVD by Cleve Moler, the founder of MATLAB.

3–33 Example : Approximation of a matrix by SVD


Examine a random 8 × 4 matrix A and compute the SVD by

A = round(100*rand(8,4))
[U,D,V] = svd(A,’econ’);

This leads to the matrix  


45 91 51 25
 

 23 93 13 31 

32 66 43 21 
 

 
 71 8 4 2 
A=  .
 

 30 27 78 61 

96 65 12 16 
 

 

 96 59 18 69 

77 81 31 36

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 151

The entries in the diagonal matrix D of the SVD are [279.982, 93.171, 75.224, 41.348] and thus the first
entry is considerably larger than the others and one may approximate the matrix by using the first columns
of U and V only, i.e.
A ≈ A1 = ~u1 · 279.982 · ~v1T .
This matrix A1 has a simple structure, as ilustrated by the trivial example
   
1 1 2 10
   
 2 h i  2 4 20 
 1 2 10 =   .
   

 3   3 6 30 
   
−1 −1 −2 −10

Each row in A1 is a multiple of the row vector ~v1T and each column in A1 is a multiple of the column vector
~u1 . Using the first two columns leads to
" # " #
h i 279.982 0 ~v1T
A ≈ A2 = ~u1 ~u2 · · .
0 93.171 ~v2T

Each row in A2 is a linear combination of the two row vectors ~v1T and ~v2T and each column in A2 is a linear
combination of the two column vectors ~u1 and ~u2 . The norm of these matrices and their differences are

kAk = kA1 k = kA2 k = 279.982 , kA1 − Ak = 93.171 and kA2 − Ak = 75.224

and the values are not too far from the original matrix A.
   
67.00 71.43 34.06 38.14 45.43 80.66 50.48 44.07
   

 53.84 57.40 27.37 30.65 


 34.68 65.61 41.96
35.92 

50.08 53.39 25.46 28.51  31.17 61.49 39.86 33.71 
   
 
   
 30.42 32.43 15.46 17.31   65.22 17.53 −11.03 7.73 
A1 ≈  , A2 ≈ 
   
 

 48.98 52.21 24.90 27.88 


 17.58 65.65 48.78 36.52 
66.39 70.77 33.75 37.79  96.11 58.05 11.13 29.61 
   
 
   

 76.25 81.29 38.76 43.40 


 96.95 72.43 23.01 37.70 
73.41 78.26 37.32 41.79 78.21 76.21 33.67 40.47

3–34 Example : Image compression by SVD


The above idea can be used to compress images, by using only the first few contributions of the SVD. The
original picture in Figure 3.18 is a 480 × 480 digital picture. Using SVD only n components are used, with
n = 5, 10, 20 and 40.

pkg load image % Octave only


im = rgb2gray(imread(’Lenna.jpg’));
[U,D,V] = svd(im);
n = 10;
imNew = mat2gray(U(:,1:n)*D(1:n,1:n)*V(:,1:n)’);
figure(1); imshow(imNew)

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 152

(a) n = 5 (b) n = 10 (c) n = 20 (d) n = 40

Figure 3.18: Image compression by SVD with different number n of contributions

3.2.7 From Gaussian Distribution to Covariance, and then to PCA


In this section the transformation of Gaussian normal distributions by linear or affine transformations is
examined. This is useful to understand the domains of confidence used by linear and nonlinear regression.
As consequence the tool of principal component analysis (PCA) can be examined and the concepts and
interpretation of covariance and correlation matrices are presented. Gaussian distributions and PCA are
essential tools for machine learning, see e.g. [DeisFaisOng20, §6, §10].

Gaussian probability distributions in Rn


The standard Gaussian PDF with mean µ and variance σ 2 is given by

1 (x−µ)2
p(x) = √ e− 2 σ2 .
2πσ

Examine a random variable ~x ∈ Rn with n independent components xi with means µi and variances σi2 .
In this case the Gaussian distribution in Rn is given by the PDF
n
!
1 1 X (xi − µi )2
~ , ~σ ) = √ n Qn
p(~x | µ exp −
2π i=1 σi
2
i=1
σi2
n n
n ln(2 π) X 1 X (xi − µi )2
ln(p(~x | µ
~ , ~σ )) = − − ln(σi ) − .
2
i=1
2 σi2
i=1

For the standardized Gaussian distribution with means 0 and variances 1 obtain
 
1 1 n ln(2 π) 1
p(~x | ~0, ~1) = √ n exp − k~xk2 and ln(p(~x | ~0, ~1)) = − − k~xk2 .
2π 2 2 2

3–35 Observation : To determine the probability that a point ~x ∈ Rn satifies k~xk ≤ r one needs the
n/2
surface of the unit sphere Sn = {~x ∈ Rn | k~xk = 1} ⊂ Rn , given by Sn−1 = 2Γ(π n ) , e.g. S2 = 2 π,
√ 2
2π π 2 π2
S3 = Γ( 32 )
= 4 π, S4 = Γ(2) = π 2 . To determine the confidence region for an n–dimensional case with
confidence level 100 (1 − α)% solve the equation
rmax rmax
1 2 π n/2 n−1 21−n/2
Z Z
1 2 1 2
e− 2 r n r dr = e− 2 r rn−1 dr = 1 − α (3.6)
(2 π)n/2 0 Γ( 2 ) Γ( n2 ) 0

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 153

for the upper limit of integration rmax . The value can also be computed using the chi–squared distribution
2
by rmax = chi2inv(1 − α, n). Then the domain of confidence is determined by r = k~xk ≤ rmax . For
n = 2 this leads to
Z rmax rmax
1 1 2 1 2 1 2
1−α = e− 2 r r dr = −e− 2 r 0 = 1 − e− 2 rmax (3.7)
Γ(1) 0
1 2 p
ln(α) = − rmax =⇒ rmax = −2 ln(α)
2
1
and the domain of confidence 2 r2 ≤ − ln(α) = ln α1 .
In Table 3.4 find results for dimensions n from 1 to 10 for three values of α. Listed are the radius rmax
up to which the integral has to be performed and the values of the PDF at this radius. This table contains
useful information:
• For the 1-d Gauss distribution include all values within 2 (or 1.96) standard deviations from the mean
to cover 95% of the events.
• For the 4-d Gauss distribution include all values within 3 (or 3.08) standard deviations from the mean
to cover 95% of the events.
• For the 8-d Gauss distribution include all values within 4 (or 3.94) standard deviations from the mean
to cover 95% of the events.
This is information on the size of the region of confidence for different dimensions n.

n α = 0.3173 α = 0.05 α = 0.01


rmax pdf(rmax ) rmax pdf(rmax ) rmax pdf(rmax )
1 1 0.483941 1.95996 0.116890 2.57583 0.0289195
2 1.51517 0.480780 2.44775 0.122387 3.03485 0.0303485
3 1.87796 0.482494 2.79548 0.125287 3.36821 0.0311339
4 2.17244 0.484161 3.08022 0.127198 3.64372 0.0316669
5 2.42644 0.485534 3.32724 0.128595 3.88411 0.0320657
6 2.65300 0.486660 3.54846 0.129683 4.10023 0.0323814
7 2.85941 0.487598 3.75062 0.130565 4.29829 0.0326409
8 3.05023 0.488393 3.93793 0.131301 4.48221 0.0328600
9 3.22852 0.489078 4.11327 0.131928 4.65467 0.0330487
10 3.39647 0.489675 4.27867 0.132473 4.81760 0.0332138

Table 3.4: Standard Gaussian PDF in n dimensions

Linear and affine transformations and the resulting Gaussian PDF


Let A ∈ Mn×n be an invertible matrix and examine the effect of the transformation ~y = A ~x . Search for
the PDF for ~y . Using a substitution for the multidimensional integral leads to
P (~y ∈ A(Ω)) = P (~x ∈ Ω)
Z Z
P (~y ∈ A(Ω)) = py (~y ) d Vy = | det(A)| py (A~x) dVx
A(Ω) Ω
Z
P (~x ∈ Ω) = px (~x) d Vx .

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 154

Since the above has to be correct for any domain Ω ⊂ Rn conclude


1
px (~x) = | det(A)| py (A~x) or py (~y ) = px (A−1 ~y ) . (3.8)
| det(A)|

3–36 Example : Consider the example


 
σ1
 
 σ2 
A = diag(~σ ) = 
 
.. 

 . 

σn
Qn
with det(A) = i=1 σi and
n
2 −1 2
X yi2
k~xk = kA ~y k = .
σ2
i=1 i
Use (3.8) to arrive at
n
!
1 1 X yi2
p(~y ) = √ n Qn exp − ,
2π i=1 σi
2 σ2
i=1 i
i.e. the recovered Gaussian distribution with different variances. ♦

With px (~x) = (2 π)−n/2 exp(− 21 k~xk2 ) equation (3.8) can be written in the form7
 
1 1 T T −1
p(~y | A) = √ n exp − ~y (AA ) ~y .
2 π | det(A)| 2
and the above result can be generalized to affine transformations
~x 7→ T(~x) = ~y = ~y0 + A~x
with the resulting PDF
 
1 1 T T −1
p(~y | A, ~y0 ) = exp − (~y − ~y0 ) (AA ) (~y − ~y0 ) . (3.9)
(2 π)n/2 | det(A)| 2

Maximum likelihood estimators of ~y0 and A


Assume that N independent data points ~yj ∈ Rn are given and is is known that the distribution is an
affine transformation of the standard normal distribution.
p You want to recover information on the affine
transformation T~x = ~y0 + A ~x. Using | det(A)| = det(AAT ) the PDF for the combined event is
N
Y
p(~y1 , . . . , ~yN | A, ~y0 ) = p(~yj | A, ~y0 )
j=1
N  
Y 1 1 T T −1
= exp − (~yj − ~y0 ) (AA ) (~yj − ~y0 )
j=1
(2 π)n/2 | det(A)| 2
f (A, ~y0 ) := ln(p(~y1 , . . . , ~yN | A, ~y0 ))
N
N N X1
= − n ln(2 π) − ln(det(AAT )) − (~yj − ~y0 )T (AAT )−1 (~yj − ~y0 ) .
2 2 2
j=1
7

kA−1 ~
y k2 = hA−1 ~
y , A−1 ~ y , A−T A−1 ~
y i = h~ y , (AAT )−1 ~
y i = h~ yi

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 155

The above is the answer to: given A and ~y0 , what is the PDF of the measurement points.
Now turn the approach around and seek the answer to: given the measured data points ~yj , recover the
affine transformation with A and the shift ~y0 . The maximum likelihood approach seeks the values of A
and ~y0 to render the function f (A, ~y0 ) maximal. Hoping for an unimodal function search for zeros of the
derivatives of f with respect to ~y0 and the components of Γ = (A AT )−1 .

• Compute the gradient of f with respect to ~y0 and solve for ~y0 .
N N
~0 = ∂f X X
= (AAT )−1 (~yj − ~y0 ) = (AAT )−1 (~yj − ~y0 ) ∈ Rn
∂~y0
j=1 j=1
N
X N
X
~0 = (~yj − ~y0 ) = −N ~y0 + ~yj
j=1 j=1
N
1 X
~y0 = ~yj
N
j=1

Thus the average value of the data points is the most likely estimator of the offset.

• As a consequence of the above you can first determine the average values and then subtract them from
the data. This simplifies the function and now examine maxima of the modified function
N
fˆ(Γ) := +N ln(det(Γ)) −
X
~yjT Γ ~yj .
j=1

• To examine derivatives with respect to Γrs the adjugate matrix8 can be used with the computational
rules
1 ∂
Γ−1 = adj(Γ) ∈ Mn×n and det(Γ) = (−1)r+s adj(Γ)sr .
det(Γ) ∂Γrs
This implies
∇ det(Γ) = adj(Γ)T = det(Γ) · Γ−T .

• For the derivative of fˆ with respect to Γrs obtain


N
adj(Γ)sr X
0 = +N − yr,j · ys,j .
det(Γ)
j=1

Take each data point ~yi (the average ~y0 already subtracted) as a column vector and stack all measure-
ments in one matrix h i
Y = ~y1 ~y2 · · · ~yN ∈ Mn×N

and compute the covariance matrix


1
M= YYT ∈ Mn×n .
N
Then the sums
N
1 X
yr,j · ys,j = Mr,s for 1 ≤ r, s ≤ n
N
j=1

8
First compute the cofactor Cr,s . Start with the original matrix, eliminate row r and column s, compute the determinant,
multiply by (−1)r+s . The transpose of the matrix of cofactors is equal to adj(Γ).

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 156

appear as components of the covariance matrix. The above condition for the partial derivatives of fˆ
to vanish is transformed to
adj(Γ) 1
= YYT .
det(Γ) N
Thus for the matrix A find the necessary condition
1
Γ−1 = AAT = Y YT ∈ Mn×n .
N
One can not recover the components of A, but only the product Γ−1 = AAT . If you insist on one
of the possible constructions of a matrix A you may use the Cholesky factorization Γ−1 = RT R, i.e
A = RT is one of the possible reconstructions.

Interpretation of level curves, eigenvalues and eigenvectors, PCA


Based on equation (3.9) examine the PDF for the values of ~y given by
 
1 p 1 T
p(~y ) = det(Γ) exp − (~y − ~y0 ) Γ(~y − ~y0 ) .
(2 π)n/2 2

The level curves of the exponent yield useful information. It simplifies notation to work with the quadratic
function
1
h(~y ) = (~y − ~y0 )T Γ(~y − ~y0 ) .
2
The minimal value of h is 0, attained at ~y = ~y0 . The level curves can be generated by the code presented in
Example 3–25, using the eigenvectors and eigenvalues of Γ = (AAT )−1 , i.e. the inverse of the covariance
matrix.

6
5.8

5.6
5.5
5.4

5.2
5
5

4.8
4.5
4.6

4 4.4
-40 -20 0 20 40 60 80 100 -20 0 20 40 60 80

(a) raw data (b) raw data and level curves

Figure 3.19: Raw data and level curves for the likelihood function. Inside the level curves find 99%, 87%
or 39% of the data points.

This is used to determine the domains of confidence with a level of confidence (1 − α), i.e a level of
significance α. For n = 2 use (3.6) and (3.7) to solve
Z rmax
1 2 1 2
e− 2 r r dr = 1 − e− 2 rmax = 1 − α .
0

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 157

Then (1 − α) 100% of the points are in the domain kA−1 ~y k = k~xk ≤ rmax . Find the results for rmax =
1, 2, 3 with the corresponding confidence levels in Figure 3.19. Assuming ~y0 = ~0 this domain can be
determined using the covariance matrix M by
2
rmax ≥ kA−1 ~y k2 = hA−1 ~y , A−1 ~y i = h~y , A−T A−1 ~y i = h~y , (AAT )−1 ~y i = h~y , M−1 ~y i .

To generate graphs use the eigenvalues and eigenvectors of the covariance matrix, see Example 3–25. The
result has to be compared to the usual individual√intervals of confidence for y1 and y2 . Since two conditions
have to be satisfied use a level of confidence of 1 − α and read out the standard deviations as square roots
√ the diagonal entries in the covariance matrix. Then use the normal distribution by c = norminv(1 − (1 −
of
1 − α)) ≈ norminv(1 − α/2) to construct the intervals of confidence for yi and y2

mean(yi ) − c σi ≤ yi ≤ mean(yi ) + c σi .

The result for α = 0.05 is shown in Figure 3.20(a).

• 95% of the data points are inside the ellipse.

• 95% of the data points are inside the rectangle.

The ellipse provides better information on the location of the data points, since the correlation between y1
and y2 is taken into account.

6
5.8

5.6
5.5
5.4

5.2
5
5

4.8
4.5
data
4.6 PCA 1
PCA 2
4.4
4
-20 0 20 40 60 80 -20 0 20 40 60 80 100

(a) domains of confidence (b) PCA

Figure 3.20: Raw data and the domain of confidence at level of significance α = 0.05 and the PCA

From covariance to PCA


The positive eigenvalues λi of Γ and the normalized eigenvectors ~ei give a good description of the func-
tion h. Examine
1 1
h(~y0 + t ~ei ) = t2 ~eTi Γ~ei = t2 λi .
2 2
• If λi is large the function h will grow quickly in that direction and the level curves will be close
together.

• If λi is small the function h will grow slowly in that direction and the level curves will be far apart.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 158


To visualize the above one may rescale the eigenvectors ~ei to length λi and then draw them starting at the
center point ~y0 . Find an example in Figure 3.19(b).

Since M = Γ−1 = AAT = N1 YYT the eigenvalues λi of Γ lead to the eigenvalues µi = λ1i of AAT .
This leads to the Principal Component Analysis, or short PCA. The main application of PCA is dimension
reduction, i.e. instead of many dimensions reduce the data such that only the main (principal) components
have to be examined. Such a principal component is a linear combination of the available variables.
If λ1 ≤ λ2 ≤ λ3 ≤ . . . then µ1 ≥ µ2 ≥ µ3 ≥ . . ., i.e. large eigenvalues indicated a slow growing
of the function h and thus the data is spread out in the direction of the corresponding eigenvector. If ~e1 is
normalized then compute the principle component of the data, i.e. the projection of the data in the direction
of ~e1 , given by h~e1 , ~y − ~y0 i. The data can the be displayed at h~e1 , ~y − ~y0 i ~e1 ∈ R2 . The second principal
component is given by h~e2 , ~y − ~y0 i. Find the graphical representation in Figure 3.20(b).

Elementary code in Octave/MATLAB


Based on the above idea it is easy to write code in Octave/MATLAB to determine the PCA and generate
figures comparable to Figure 3.20(b). The steps in the code below are:

• Use the mean values m


~ of the data, manly for graphical purposes.

• With the command cov() the mean values m


~ are subtracted and the covariance computed.
n
1 X
mi = datak,i
n
k=1
n
X
cov(data)i,j = (datak,i − mi ) (datak,j − mj )
k=1

• Determine the eigenvalues and eigenvectors using the command eig(). With the option ’descend’
the eigenvalues are in deceasing order.

• Compute the length of the projection of the data into the directions of the eigenvectors, leading to the
scores.

• Multiply the scores with the eigenvectors to obtain the projection of the data onto the eigen directions.

• Display the data and the first two principal components.

data = ....... %% generate your data as a n x 2 matrix


m = mean(data); %% mean values
[PCAvec,PCAval] = eig(cov(data)); %% eigenvectors and values of the covariance matrix
[Pval, idx] = sort(diag(PCAval),1,’descend’);
Pscore1 = (data-m)*PCAvec(:,idx(1));
Pscore2 = (data-m)*PCAvec(:,idx(2));
PCAdata1 = Pscore1*PCAvec(:,idx(1))’;
PCAdata2 = Pscore2*PCAvec(:,idx(2))’;

figure(4)
plot(data(:,1),data(:,2),’.b’,m(1) + PCAdata1(:,1), m(2) + PCAdata1(:,2),’.r’,...
m(1) + PCAdata2(:,1), m(2) + PCAdata2(:,2),’.g’)
legend(’data’,’PCA 1’,’PCA 2’,’location’,’southeast’)

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 159

3–37 Example : PCA is not restricted to problems in 2 dimension, it is only the visualization that is a more
challenging. In Figure 3.21 find in blue a cloud of data points in R3 . The data points seem to be rather close
to a plane. A PCA for the first two components was performed and the each data point was projected onto
the plane spanned by the eigenvectors belonging to these two components. This leads to the red points in
the figure. Observe that this approach is different from using linear regression to determine the best fitting
plane.

• With a linear regression using z = c0 + c1 x + c2 y the sum of the squared vertical distances is
minimized.

• With PCA the sum of the squared orthogonal distances to the plane is minimized. Consult the next
section, where PCA is regarded as an optimization problem.

4
z

-2
7
6
5
4
y 3
2 6 8
1 0 2 4
-2 x

Figure 3.21: PCA demo for data in R3

PCA as an optimization problem


Another view on PCA is to examine it as solution of an optimization problem. For the data matrix Y =
[~y1 , ~y2 , . . . , ~yN ] ∈ Mn×N find the direction vector ~e ∈ Rn such that the squared sum of the projections in
this direction ~e is maximal, subject to the constraint k~ek = 1. Since the projection of a vector ~y onto the
direction (given by ~e) equals h~e, ~y i the function f (~e) to be optimized is given by
N N n
1 X 1 XX
f (~e) = (h~yj , ~ei)2 = ( yi,j ei )2 .
2 2
j=1 j=1 i=1

To find the extrema differentiate with respect to the components ek of the normalized vector ~e.
N X n n X N n
∂ X X X
f (~e) = ( yi,j ei ) yk,j = ( yi,j yk,j ) ei = (Y YT )k,i ei
∂ek
j=1 i=1 i=1 j=1 i=1
T
∇f (~e) = Y Y ~e

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 160

Thus the Lagrange multiplier theorem for constrained optimization with the constraint k~ek2 = 1 and
∇k~ek2 = 2 ~e lead to
Y YT ~e = λ ~e ,
i.e. the direction ~e has to be an eigenvector of the matrix Y YT ∈ Mn×n . Use a directional derivative of
f (~e) to conclude

d
f (c ~e) = h~e, ∇f (c ~e)i = c h~e, YT Y ~ei = c h~e, λ ~ei = c λ
dc
Z 1 Z 1
d 1
f (1 ~e) = f (~0) + f (c ~e) dc = 0 + c λ dc = λ
0 dc 0 2
1
and thus f (~e) = 2 λ. The largest value of f (~e) is attained for the eigenvector belonging to the largest
eigenvalue λmax .

-2

-4
-6 -4 -2 0 2 4 6

Figure 3.22: Projection of data in the direction of the first principal component. Original data in blue and
the first principal component in red. The original data points and their projections are connected by green
lines.

This is illustrated by Figure 3.22: the direction of the first principal component is such that the sum of
the squared distances of the red points from the origin is largest, while the sum of the squared lengths of the
green lines is smallest. This is based on the observation that the square of P the principal component and the
squared length of corresponding green line add up to k~yj k2 and the sum N j=1 k~yj k2 does not depend on
the direction ~e.

N = 20; data = randn(N,2); data = data - mean(data);


A = [3 1;1 1.5]; data = data*A;

[coeff,score,latent] = princomp(data);
dataPCA = score(:,1)*coeff(1,:);

figure(1); plot(data(:,1),data(:,2),’+b’,dataPCA(:,1),dataPCA(:,2),’+r’)
hold on
for ii = 1:length(data)
plot([data(ii,1),dataPCA(ii,1)],[data(ii,2),dataPCA(ii,2)],’g’)
end%for
hold off

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 161

PCA computed by SVD


For Y ∈ Mn×N with n < N compute the SVD (singular value decomposition) with unitary matrices
U ∈ Mn×n , V ∈ MN ×N and the diagonal matrix D ∈ Mn×N .

Y = U D VT
Y YT = U T
 DV VD U
T T
= T
 UDD U
T

σ2
 1 
 σ22 
D DT =   ∈ Mn×n
 
..

 . 

σn2

Thus the SVD of the matrix Y ∈ Mn×N leads to the eigenvalues σi2 of the matrix Y YT with the n
eigenvectors in the columns of U ∈ Mn×n . This is the information required for the PCA. Examine the
source code of princomp.m in Octave to realize that SVD is used to determine the PCA. Observe that the
matrix YYT does not have to be computed. This might make a difference for very large data sets, e.g. for
machine learning.

The PCA commands in Octave and MATLAB


The builtin commands princomp() and/or pca() do not use eigenvalues and eigenvectors, but a SVD
to determine the PCA. With Octave use the command princomp() from the statistics package.

help princomp
-->
-- Function File: [COEFF] = princomp(X)
-- Function File: [COEFF,SCORE] = princomp(X)
-- Function File: [COEFF,SCORE,LATENT] = princomp(X)
-- Function File: [COEFF,SCORE,LATENT,TSQUARE] = princomp(X)
-- Function File: [...] = princomp(X,’econ’)
Performs a principal component analysis on a NxP data matrix X

* COEFF : returns the principal component coefficients


* SCORE : returns the principal component scores, the
representation of X in the principal component space
* LATENT : returns the principal component variances, i.e., the
eigenvalues of the covariance matrix X.
* TSQUARE : returns Hotelling’s T-squared Statistic for each
observation in X
* [...] = princomp(X,’econ’) returns only the elements of latent
that are not necessarily zero, and the corresponding columns of
COEFF and SCORE, that is, when n <= p, only the first n-1.
This can be significantly faster when p is much larger than n.
In this case the svd will be applied on the transpose of the data matrix X

References
----------
1. Jolliffe, I. T., Principal Component Analysis, 2nd Edition, Springer, 2002

With this command the PCA is easily generated by

[coeff,score,latent] = princomp(data);

and the visualization is identical to the above code.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 162

The statistics toolbox9 (i.e. $$$) in MATLAB contains the command pca(), which generates the iden-
tical result by [coeff,score,latent] = pca(data); .
MATLAB and Octave provide the command pcacov() to determine the PCA for a given covariance
matrix.

From covariance to correlation


The above eigenvector ~ei of the symmetric matrix Γ are orthogonal. You will usually not find a right angle
between them when graphing (e.g. Figure 3.19), since the scales in the directions in Rn are usually not the
same. This can be fixed. Change the scale in each coordinate direction in Rn such that the new covariance
matrix M̄ has only numbers 1 on the diagonal. To achieve this divide rows and columns by the square root
of the entries on the diagonal, i.e.
1
Γ−1 = M = Y YT ∈ Mn×n
N
 √ 
M11
 √ 
 M22 
S−1 = 
 
.. 
 . 

 
Mnn
M = SMS

The resulting matrix M is called correlation matrix. It has numbers 1 on the diagonal and all other entries
are smaller than 1. This has the effect that a unit step in any of the coordinate axes will lead to the same
increase for the function value.
Since the matrix is rescaled the scales of the coordinates have to be adapted, leading to
1 1
h(~y ) = h(~y − ~y0 ), Γ(~y − ~y0 )i = h(~y − ~y0 ), M−1 (~y − ~y0 )i
2 2
1
= h(~y − ~y0 ), S (S−1 M−1 S−1 ) S(~y − ~y0 )i
2
1 −1
= h(S(~y − ~y0 )), M (S(~y − ~y0 ))i = h(S(~y − ~y0 ))
2
Now rerun the eigenvalue/eigenvector algorithm for the correlation matrix M and display the result in a
rescaled graph, using identical scales in all coordinate directions. The eigenvectors appear orthogonal, find
an example in Figure 3.23.

9
On the Mathworks web site find a code princomp.m by B. Jones which works on MATLAB.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 163

-1

-2

-2 -1 0 1 2 3

Figure 3.23: Scaled data and level curves for the likelihood function

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 164

3.3 Numerical Integration


In this section a few methods to evaluate definite integrals numerically are presented.

• In subsection 3.3.1 the trapezoidal rule and Simpson’s rule are used to evaluate integrals for functions
given by data points.

• In subsection 3.3.2 possible algorithms to evaluate integrals of functions given by an expression (for-
mula) are examined. The brilliant idea of Gauss is presented and the method of adaptive integration
is introduced.

• In subsection 3.3.3 some commands in Octave/MATLAB are used to evaluate integrals.

• In subsection 3.3.4 some ideas and Octave/MATLAB codes on how to integrate over domains in the
plane R2 are introduced.

3.3.1 Integration of a Function Given by Data Points


In this subsection assume that a function is given by n + 1 data points a = x0 < x1 < x2 < . . . < xn = b
and the corresponding values yi = f (xi ) of the function. A simple situation is shown in Figure 3.24. The
Rb
goal is to find a good approximation of the definite integral a f (x) dx .

The idea of a trapezoidal integration is to replace the function f (x) by a piecewise linear interpolation.
Each segment is a trapez, whose area is easy to determine by
Z xi
f (xi−1 ) + f (xi )
f (x) dx ≈ area of trapez = width · average height = (xi − xi−1 ) · .
xi−1 2

For Figure 3.24 the resulting approximation of the integral is

b−a
n=5 h= xi = a + ih for i = 0, . . . , 5
5
and Rb 1
(a) + f (x1 ) + f (x2 ) + f (x3 ) + f (x4 ) + 21 f (b)

f (x) dx ≈ h f
a 2 4

h P
= 2 f (a) + f (b) + 2 f (xi ) .
i=1

y = f(x)

a x
b
x0 x1 x2 x3 x4 x5

Figure 3.24: Trapezoidal integration

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 165

3–38 Result : Trapezoidal integration


The integral of a twice differentiable function over the standard interval [−h/2, +h/2] is approximated by
+h/2
h
Z
f (x) dx ≈ (f (−h/2) + f (+h/2))
−h/2 2

and the error is estimated by


1
errorh ≤ max |f 00 (ξ)| h3 .
6 −h/2≤ξ≤+h/2
b−a
For a smooth function defined on the interval [a, b], divided up into n subintervals of length h = n and
xi = a + i h this leads to
n−1
Z b !
h X
f (x) dx ≈ f (a) + f (b) + 2 f (xi ) .
a 2
i=1

with the error


b−a
error[a,b] ≤ max |f 00 (ξ)| h2 .
12 a≤ξ≤b
For polynomials up to degree 1 this leads to exact values for the integral. 3
The pattern of coefficients for the trapezoidal rule is given by

b−a
(1 , 2 , 2 , 2 , . . . , 2 , 2 , 2 , 2 , 2 , 1) .
2n

The verification of this result is given as an exercise and the arguments to be used are very similar to
Result 3–40.

3–39 Example : The commands trapz() and cumtrapz()


With Octave/MATLAB the trapezoidal rule is implemented in trapz(). The command takes 2 arguments,
the values of the independent and dependent variables. The spacing does not have to be uniform. With
cumtrapz() (cumulative trapezoidal rule) the definite integral
Z x
I(x) = f (t) dt
a

is evaluated, using the trapezoidal rule on each of the sub-intervals. In Figure 3.25 find the results for the
R π/2
elementary integral I = 0 cos(x) dx = 1.

• For n = 3 sub-intervals the trapezoidal rule leads to the answer I ≈ 0.9770. Since the graph of
cos(x) is above the piecewise linear interpolation used by the trapezoidal rule, it is no surprise that
the approximate answer is two low.
Rx
• Using cumtrapz() the integral I(x) = 0 cos(t) dt = sin(x) is approximated, again leading to
values that are too small.

• For n = 100 sub-intervals the trapezoidal rule leadsRto the answer I ≈ 0.999979, i.e. considerably
x
more accurate. Similarly for the indefinite integral 0 cos(t) dt where true integral sin(x) and the
approximate cure are indistinguishable.

The code to generate Figure 3.25 is not too complicated.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 166

0.8

0.6

0.4
cos(x), n=3
∫ cos(x), n=3
0.2
cos(x), n=100
∫ cos(x), n=100
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
x

R π/2
Figure 3.25: The integral 0 cos(x) dx by the trapezoidal rule

x100 = linspace(0,pi/2,100); y100 = cos(x100);


n = 3; x = linspace(0,pi/2,n+1); y = cos(x);
Integral = trapz(x,y)
figure(1)
plot(x,y,’+-k’,x,cumtrapz(x,y),’+-b’,x100,y100,’-g’,x100,cumtrapz(x100,y100),’-r’)
xlabel(’x’); xlim([0,pi/2]);
legend(’cos(x), n=3’,’\int cos(x), n=3’,
’cos(x), n=100’,’\int cos(x), n=100’,’location’,’south’)

3–40 Result : Simpson integration


The integral of a smooth function over the standard interval [−h, +h] is approximated by
Z +h
h
f (x) dx ≈ (f (−h) + 4 f (0) + f (+h))
−h 3
and the error is estimated by
1
error2h ≤ max |f (4) (ξ)| h5
90 −h≤ξ≤+h
For a smooth function defined on the interval [a, b], divided up into an even number of subintervals of length
h = b−a
n and xi = a + i h, this leads to
 
Z b n/2
h X
f (x) dx ≈ f (a) − f (b) + 2 (2 f (x2i−1 ) + f (x2i ))
a 3
i=1

with the error


b−a
error[a,b] ≤ max |f (4) (ξ)| h4 .
180 a≤ξ≤b
For polynomials up to degree 3 this leads to exact values for the integral. 3
h
For the interval [−h, +h] the coefficients for the function values at −h, 0 and+h are 3 (1 , 4 , 1). For
the integration over an interval [a, b] this leads to the pattern of coefficients
b−a
(1 , 4 , 2 , 4 , 2 , 4 , 2 , 4 , 2 , . . . , 4 , 2 , 4 , 2 , 4 , 1) .
3n

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 167

Proof : The goal is to find constants c−1 , c0 and c+1 such that
Z +h
u(x) dx ≈ h (c−1 u(−h) + c0 u(0) + c+1 u(+h))
−h

with an error as small as possible. To arrive at this goal use a Taylor approximation at x = 0

u00 (0) 2 u000 (0) 3 u(4) (0) 4 u(5) (0) 5 u(6) (0) 6
u(x) = u(0) + u0 (0) x + x + x + x + x + x + O(x7 ) . (3.10)
2 6 4! 5! 6!
For small values of x contributions of low order are of higher importance. Thus we proceed contribution by
contribution, starting with the small order terms, i.e. first 1, then x, then x2 , . . .
1 : Using only the first contribution u(x) ≈ u(0) and the integral of u(x) = 1 leads to
Z +h
1 dx = 2 h = h (c−1 1 + c0 1 + c+1 1)
−h

This integral is exact if the equation c−1 + c0 + c+1 = 1 is satisfied.

x : Using only the first two contributions u(x) ≈ u(0) + u0 (0) x verify that the second contribution
u(x) = x is integrated exately, i.e
Z +h
x dx = 0 = h (c−1 (−h), 1 + c0 0 + c+1 h)
−h

This integral is exact if the equation −c−1 + c+1 = 0 is satisfied.


00
x2 : Using the first three contributions u(x) ≈ u(0) + u0 (0) x + u 2(0) x2 verify that the third contribution
u(x) = x2 is integrated exactly, i.e
Z +h
2
x2 dx = h3 = h c−1 h2 + c0 0 + c+1 h2

−h 3
2
This integral is exact if the equation c−1 + c+1 = 3 is satisfied.

• This leads to a linear system of equations


        
1
1 1 1 c−1 2 c−1 3
        
 −1 0 +1   c0  =  0  =⇒  c0  =  4 
        3 
2 1
1 0 1 c+1 3 c+1 3

These equations have a unique solution c−1 = c+1 = 13 and c0 = 43 and thus the approximation
formula for the integral is
Z +h
h
u(x) dx = (u(−h) + 4 u(0) + u(+h)) + error2h .
−h 3

• To estimate the size


R +hof the error first verify that this sum leads to the exact integral for the function
u(x) = x3 , since −h x3 dx = 0 = h3 ((−h)3 +4·0+(+h)3 ). The integration for x4 is not exact, since
R +h 4 2 5 h 4 4 2 5 2 5 2 5 4 5
−h x dx = 5 h 6= 3 ((−h) + 4 · 0 + (+h) ) = 3 h . The error equals 3 h − 5 h = 15 h . The
integration of the higher power x5 leads to a contribution proportional to h6 , which is considerably
4 5
smaller than the above 15 h for 0 < h  1. Thus the error for the integration over [−h, +h] is
estimated by
4 u(4) (0) 5 1 (4)
error2h ≈ h = u (0) h5 .
15 4! 90

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 168

b−a
• To obtain the estimate for the integral from a to b use n = 2h of those integrals, leading10 to

b−a
error[a,b] ≤ max |u(4) (ξ)| h4 .
180 a≤ξ≤b

The above computations can be organized in tabular form.


R +h
p(x) −h p(x) dx = h (c−1 p(−h) + c0 p(0) + c+1 p(+h))
1 2 h = h (c−1 + c0 + c+1 ) equation 1
x 0 = h (−c−1 h + c0 0 + c+1 h) equation 2
2
x2 h3 = h +c−1 h2 + c0 0 + c+1 h2

3 equation 3
2
x3 h4 = h −c−1 h3 + c0 0 + c+1 h3

4 OK
2 2
x4 h5 = h +c−1 h4 + c0 0 + c+1 h4 = h5

5 3 not OK

The above Simpson integration over an interval [a, b] can only be applied if an even number n of sub-
intervals is used. Thus use an odd number of data points xi = a + i b−a n for i = 0, 1, 2, . . . , n. If an odd
number of intervals of equal length h is used one can use the Simpson 3/8–rule to preserve the order of
convergence, i.e. the error remains proportional to h4 .
Z x3
3
f (x) dx ≈ h (f (x0 ) + 3 f (x1 ) + 3 f (x2 ) + f (x3 ))
x0 8

R π/2
3–41 Example : Examine the elementary integral 0 sin(x) dx = 1 . The comparison of the trapezoidal
rule, Simpson’s formula and the Gauss method should convince you that Gauss is often extremely efficient.

• Use the trapezoidal rule with 2 sub-intervals, thus 3 evaluations of sin(x).

π/2
√ !
π/4  π π  π 2
Z
sin(x) dx ≈ sin(0) + 2 sin( ) + sin( ) = 0+2 +1 ≈ 0.9481
0 2 4 2 8 2

• Use Simpson’s rule with 1 sub-interval, thus 3 evaluations of sin(x).

π/2
√ !
π/4  π π  π 2
Z
sin(x) dx ≈ sin(0) + 4 sin( ) + sin( ) = 0+4 + 1 ≈ 1.0023
0 3 4 2 12 2

Observe that Simpson’s approach generates a smaller error that the trapezoidal rule.

• Using the Gauss idea from Result 3–43 in the next section with one sub-interval with three points
leads to an even more accurate result.
Z π/2 r r !
π/4 π 3 π π 3
sin(x) dx ≈ 5 sin( (1 − )) + 9 sin( ) + 5 sin( (1 + )) ≈ 1.000008
0 9 4 5 4 4 5

10
The proof shown in these notes only leads to an estimate of the error, not a rigorous proof. The simplicity of the argument used
justifies the lack of mathematical rigor. The statement is correct though, find a proof in [RalsRabi78, §4].

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 169

x0 = pi/4; h = pi/4;
xm = x0 - sqrt(3/5)*h; xp = x0 + sqrt(3/5)*h;
Gauss = h/9*(5*sin(xm) + 8*sin(x0) + 5*sin(xp))
difference = Gauss - 1
-->
Gauss = 1.0000081216
difference = 8.1216e-06

3–42 Example : Simpson’s algorithm is easy to implement in Octave/MATLAB and can compute integrals
rather accurately. Find one possible implementation in simpson.m below and determine approximative
R π/2
values of 0 cos(x) dx = 1, using 10 and 100 sub-intervals.

format long
n = 10 ; x = linspace(0,pi/2,n+1); Integral10 = simpson(cos(x),0,pi/2,n)
n = 100 ; x = linspace(0,pi/2,n+1); Integral100 = simpson(cos(x),0,pi/2,n)
-->
Integral10 = 1.000003392220900
Integral100 = 1.000000000338236


simpson.m
function res = simpson(f,a,b,n)

%% simpson(integrand,a,b,n) compute the integral of the function f


%% on the interval [a,b] with using Simpsons rule
%% use n subintervals of equal length , n has to be even, otherwise n+1 is used
%% f is either a function handle, e.g @sin or a vector of values

if isa(f,’function_handle’)
n = round(n/2+0.1)*2; %% assure even number of subintervals
h = (b-a)/n;
x = linspace(a,b,n+1);
f_x = x;
for k = 0:n
f_x(k+1) = feval(f,x(k+1));
end%for
else
n = length(f);
if (floor(n/2)-n/2==0)
error(’simpson: odd number of data points required’);
else
n = n-1;
h = (b-a)/n;
f_x = f(:)’;
end%if
end%if

w = 2*[ones(1,n/2); 2*ones(1,n/2)]; w = w(:); % construct the simpson weights


w = [w;1]; w(1)=1;
res = (b-a)/(3*n)*f_x*w;

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 170

Problematic cases for numerical integration


The above results are all based on the Taylor approximation (3.10), which requires that the function to be
integrated is often differentiable. This is not always the case and the domain of integration might not be a
finite interval [a, b]. Thus the algorithms have to be adapted.

• Examine a function that is not smooth at a point inside the interval, e.g. f (x) = | cos(x)|, which is
not differentiable at x = π2 . Then determine the integral
Z 2 Z π/2 Z 2
| cos(x)| dx = cos(x) dx + − cos(x) dx = 1 + 1 − sin(2) ≈ 1.09070 .
0 0 π/2

Using Simpson’s method with n = 100 sub-intervals leads to an error of ≈ 3 · 10−5 , while integration
of the smooth function cos(x) leads to a much smaller error of ≈ 8 · 10−10 . The problem can be
R2 R π/2 R 2
avoided by splitting up the integral 0 = 0 + π/2 . Many Octave/MATLAB integration routines
use options to indicate this type of special points inside the interval of integration.

n = 100; x = linspace(0,2,n+1); y = abs(cos(x));


Integral100 = simpson(y,0,2)
Error = Integral100 - (2-sin(2))

Integral100A = simpson(cos(x),0,2)
ErrorA = Integral100A - sin(2)

-->
Integral100 = 1.0907
Error = 2.7390e-05

Integral100A = 0.9093
ErrorA = 8.0830e-10

• For a function with an infinite integrand the Simpson and the trapezoidal rule fail. Find information
on how to handle functions with limx→a f (x) = ∞ in [IsaaKell66, §6.2, p.346].

• For a function with an infinite


R ∞ domain the Simpson and the trapezoidal rule fail. Find information on
how to handle integrals in a f (x) dx [IsaaKell66, §6.3, p.350].

3.3.2 Integration of a Function given by a Formula


In this subsection assume that a function is given by an explicit expression y = f (x) and for fixed values
Rb
−∞ < a < b < +∞ the goal is to find a good approximation of the definite integral a f (x) dx . Usually
the relative or absolute tolerance for the error are specified and the integral then computed. The number and
location of points where functions is evaluated can be selected.

Gauss integration over an interval

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 171

3–43 Result : Gauss integration


The integral of a smooth function over the standard interval [−h, +h] is approximated by
Z h r r !
h 3 3
f (x) dx ≈ 5 f (− h) + 8 f (0) + 5 f (+ h)
−h 9 5 5

and the error is estimated by


error2h ≤ C h7 .
For a smooth function defined on the interval [a, b], divided up into an even number of subintervals of length
h = b−a
n this leads to the error
error[a,b] ≤ C̃ h6 .
For polynomials up to degree 5 the values for the integral are exact. 3
Proof : To integrate a general function f (x) over the interval [−h , +h] choose 3 symmetric integration
points at −ξh, 0 and +ξh and integration weights w0 and w1 . Select these values such that the formula
s
6
s
s
Z +h
f (x) dx ≈ 2 h (w1 f (−ξh) + w0 f (0) + w1 f (ξh)) w1 w0 w1
−h -
−h −ξh 0 +ξh +h
yields a result with an error as small as possible. To determine the optimal values for w0 , w1 and ξ use
a Taylor approximation and require that polynomials of increasing order are integrated exactly. This will
lead to a small integration error. The computations are very similar to Result 3–40 for the verification of
Simpson’s rule.
Using the monomials 1, x, x2 , x3 and x4 leads to a system of 3 nonlinear equations, shown below. Since
x5 is also integrated exactly the local integration error is proportinal to h7 .
R +h
p(x) −h p(x) dx = 2h (w1 p(−ξh) + w0 p(0) + w1 p(+ξh))
1 2 h = 2 h (w1 + w0 + w1 ) equation 1
x 0 = 2 h (−w1 ξ h + w0 0 + w1 ξ h) OK
2
x2 h3 ξ 2 h2 ξ 2 h2

3 = 2 h +w1 + w0 0 + w1 equation 2
x3 ξ 3 h3 ξ 3 h3

0 = 2 h −w1 + w0 0 + w1 OK
2 5
x4 = 2 h +w1 ξ 4 h4 + w0 0 + w1 ξ 4 h4

5h equation 3
x5 0 = 2 h −w1 ξ 5 h5 + w0 0 + w1 ξ 5 h5

OK
2 7
x6 6= 2 h +w1 ξ 6 h6 + w0 0 + w1 ξ 6 h6

7h not OK

To solve the system


1 1
w0 + 2 w1 = 1 , 2 w1 ξ 2 = and 2 w1 ξ 4 =
3 5
q
3
divide the second and third equation to conclude ξ 2 = 5 and thus ξ = ± 35 . Then the second equation
leads to w1 = 95 , and the first equation allows to conclude w0 = 89 .
r r !
h
h 3 3
Z
f (x) dx ≈ 5 f (− h) + 8 f (0) + 5 f (+ h)
−h 9 5 5

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 172

The above is just one example of the Gauss integration idea. It is possible to use fewer or more integra-
tion points. The literature on the topic is vast, see e.g. https://en.wikipedia.org/wiki/Gaussian quadrature .
There are tables for the different Gauss integration schemes, e.g. the wikipedia page, [AbraSteg, Table 25.4]
or the online version [DLMF15, §3.5] at http://dlmf.nist.gov/, or [TongRoss08, Table 6.1, page 188]. The
two point approximation
Z h r r !
1 1
f (x) dx ≈ h f (− h) + f (+ h) .
−h 3 3

and its error estimate error2h ≤ C h5 are given as an exercise.

3–44 Example : The Gauss algorithm for a three or five point integration is implemented Octave 11 . Find
one possible implementation in IntegrateGauss.m with the built-in help.

help IntegrateGauss
-->
INTEGRAL = IntegrateGauss (X,F,N)

Integrate the function F over the interval X

parameters:
* F the function, given as a function handle or string with the name
* X the vector of x values with the domain of integration on each
subinterval x_i to x_(i+1) a three or five point Gauss integration is used
* N if this optional parameter equals 5, a five point Gauss formula is used,
the default value is 3

return value: INTEGRAL the value of the integral


R π/2
Use the following code to determine approximate values of 0 cos(x) dx = 1, using 10 sub-intervals
and three or five Gauss points on each sub-interval.

n = 10 ; x = linspace(0,pi/2,n+1);
Integral3 = IntegrateGauss(x,@(x)sin(x))
Error3 = Integral3-1
Integral5 = IntegrateGauss(x,@(x)sin(x),5)
Error5 = Integral5-1
-->
Integral3 = 1.0000
Error3 = 7.4574e-12
Integral5 = 1.0000
Error5 = -1.1102e-16

The result shows that a Gauss integration with 10 sub-intervals and five Gauss points already generates a
result whose accuracy is limited by the CPU accuracy. ♦

IntegrateGauss.m
function Integral = IntegrateGauss (x,f,n)
if nargin <= 2 n = 3; %% default is 3 Gauss points
elseif n!=5 n = 3;
endif
x = x(:)’; dx = diff(x);
if n==3
11
The source code has to be modified to run under MATLAB.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 173

pos = 0.5 + [-sqrt(0.6)/2 0 +sqrt(0.6)/2]’; %% location of Gauss points


weight = [5 8 5]’/18; %% weight for Gauss integration
else %% n=5
pos = 0.5 + [-1/6*sqrt(5+2*sqrt(10/7)) -1/6*sqrt(5-2*sqrt(10/7))...
0 +1/6*sqrt(5-2*sqrt(10/7)) +1/6*sqrt(5+2*sqrt(10/7))]’;
weight = [(322-13*sqrt(70))/1800 (322+13*sqrt(70))/1800 64/225...
(322+13*sqrt(70))/1800 (322-13*sqrt(70))/1800 ]’;
endif
IntegrationPoints = x(1:end-1) + pos*dx;
IntegrationWeights = weight*dx;
if (is_function_handle(f))
Integral = sum( f(IntegrationPoints(:)).*IntegrationWeights(:));
else
Integral = sum( feval(f,IntegrationPoints(:)).*IntegrationWeights(:));
endif
endfunction

The method of Gauss to integrate is used extensively by the method of finite elements to integrate over
triangles or rectangles, see Sections 6.5.1 and 6.8.

3–45 Example : Convergence of the basic algorithms


The exact integral to be examined is given by
Z 2π
2
sin(2 x) exp(−x) dx = (1 − exp(−2 π)) ≈ 0.3993 . (3.11)
0 5
Find the graph of this function in Figure 3.27(a). Three algorithms are used to determine the integral. For n
sub-intervals use h = 2nπ .

• trapezoidal: the error is expected to be proportional to h2 . In a double logarithmic graph this


should12 lead to a straight line with slope 2.

• Simpson: the error is expected to be propotional to h4 and thus a straight line with slope 4.

• Gauss: with three points, the error is expected to be propotional to h6 and thus a straight line with
slope 6.

All of the above is confirmed by Figure 3.26 and Table 3.5. The surprising horizontal segements for the
trapezoidal and Simpson’s rule between n = 2 and n = 4 are caused by the zeros of sin(2 x), which
coincide with the points of evaluation. Thus the algorithms estimate the value of the integral by 0 . ♦

n trapez Simpson Gauss


2 -3.9925e-01 -3.9925e-01 -2.0145e-02
4 -3.9925e-01 -3.9925e-01 -4.7235e-04
8 -1.0334e-01 -4.7057e-03 -5.1212e-06
16 -2.5715e-02 1.6044e-04 -7.1676e-08
32 -6.4176e-03 1.5004e-05 -1.0885e-09
64 -1.6036e-03 1.0076e-06 -1.6886e-11

Table 3.5: Approximation errors of three integration algorithms. The examined integral is shown in (3.11)
and the width of the sub-intervals is h = 2nπ .

12
Use the laws of logarithms and z = C · hp to conclude log(z) = log(C) + p · log(h).

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 174

1000
10-01
10-02
10-03
10-04
10-05
error

10-06
10-07
10-08 trapez
10-09 Simpson
10-10 Gauss
10-11
10-1 100
stepsize h

Figure 3.26: The approximation errors of three integration algorithms as function of the stepsize

Richardson extrapolation for improved convergence


For the trapezoidal rule and Simpson’s method the order of convergence (for smooth functions) is known.
Using the result Ih with interval length h and Ih/2 with interval length h2 can be used to generate an even
more accurate estimate by extrapolation.
Ih − I ≈ c hp Ih − I ≈ c hp
=⇒ =⇒ (2p − 1) I ≈ 2p Ih/2 − Ih .
Ih/2 − I ≈ c ( h2 )p 2p (I h/2 − I) ≈ c hp
• For the trapezoidal rule use p = 2 to obtain the improved estimate
22 Ih/2 − Ih 4 Ih/2 − Ih
I≈ 2
= .
2 −1 3
• For Simpson’s method use p = 4 to obtain the improved estimate
24 Ih/2 − Ih 16 Ih/2 − Ih
I≈ = .
24 − 1 15
A similar approach is used in Section 3.4.5 to approximate solutions of ordinary differential equations.

The idea of adaptive integration


Rb
To estimate the error when computing an integral I = a f (x) dx one can use a subdivision in N sub-
intervals, leading to the approximative IN , then with 2 N sub-intervals, leading to I2N . If the difference is
small enough accept the result, otherwise determine I4n .
To apply this idea use the trapezoidal rule with N sub-intervals.
N −1
!
X b−a b−a
TN = f (a) + 2 f (a + i ) + f (b)
N 2N
i=1
For N = 6 a graphical representation is given by

r r r r r r r

b−a
N=6 (1 2 2 2 2 2 1) 12

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 175

The summations for TN and T2N are not independent, since all points used by TN are also used by T2N ,
and some more.
NP−1
 
TN = f (a) + 2 f (a + i N ) + f (b) b−a
b−a
2N
i=1
2NP−1
 
T2N = f (a) + 2 f (a + i b−a b−a
2N ) + f (b) 4 N
i=1

represented by

s r s r s r s r s r s r s

b−a
N=6 (1 2 2 2 2 2 1) 12

b−a
N = 12 (1 2 2 2 2 2 2 2 2 2 2 2 1) 24

Thus conclude
N
1 b−a X b−a
T2N = TN + f (a + (2i − 1) ).
2 2N 2N
i=1

The function has only to be evaluated at the new points. Using the sequence

T1 T2 T4 T8 T16 T32 ...

it is possible to stop as soon as the desired accuracy is achieved.


Since the error for Simpson’s approach is expected to be much smaller, the above has to be adapted to
use this. Use
N N −1
!
b−a X X
S2N = f (a) + 4 f (x2i−1 ) + 2 f (x2i ) + f (b) ,
3 · 2N
i=1 i=1
b−a
where xi = a + i 2N . A graphical representation is given by

s r s r s r s r s r s r s

b−a
T6 (2 4 4 4 4 4 2) 2·12

b−a
T12 (1 2 2 2 2 2 2 2 2 2 2 2 1) 2·12

b−a
S12 (1 4 2 4 2 4 2 4 2 4 2 4 1) 3·12

Thus for Simpson’s rule use


2 1 1
S2N = (2 T2N − TN ) = (4 T2N − TN ) .
3 2 3

Based on Gauss integration a similar saving of evaluations of the function f is not possible, since only
points inside the sub-intervals are used.

For graphs similar to Figure 3.27(a) it works well to use finer meshes on all of the interval, but for
graphs similar to Figure 3.27(b) a local adaptation is asked for, i.e. use more points where the function

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 176

0.8 1

0.6 0.8

0.6
0.4
0.4
0.2
0.2

0
0

-0.2 -0.2
0 1 2 3 4 5 6 0 1 2 3 4 5 6
x x
(a) a smooth graph (b) a local refinement is necessary

Figure 3.27: Two graphs of functions for integration

values vary severely. As example examine Simpson’s approach on a sub-interval of length 4 h starting at xi
and compare the two approximations of the integral.
xi +4h
2h
Z
f (x) dx ≈ R1 = (f (xi ) + 4 f (xi+2 ) + f (xi+4 ))
xi 3
xi +4h
h
Z
f (x) dx ≈ R2 = (f (xi ) + 4 f (xi+1 ) + 2 f (xi+2 ) + 4 f (xi+3 ) + f (xi+4 ))
xi 3

If the difference |R1 −R2 | is too large the interval has to be divided, otherwise proceed with the given result.
The integration algorithms provided by MATLAB/Octave use local adaptation.

3.3.3 Integration Routines Provided by MATLAB/Octave


In this section some of the functions provided by Octave and/or MATLAB are illustrated. For more details
consult the appropriate manuals.

How to pass the function as argument to the numerical integration commands


With Octave and MATLAB there are different ways to pass a function to the integration routine.

• As a string with the function name


A function can be defined in a separate file, e.g.
ff.m
function res = ff(x)
res = x.*sin (1./x).*sqrt(abs(x-1));
end

This defines the function


1 p
ff(x) = x · sin( ) · |x − 1|
x
Observe that the code for the function is written with vectorization in mind, i.e. it can be evaluated
for many arguments with one function call.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 177

ff([0.2:0.2:2])
-->
-0.1715 0.1854 0.3777 0.3395 0 0.3972 0.5800 0.7251 0.8491 0.9589

This is essential for the integration routines and required for speed reasons.

– With Octave the definition of this function can be before the function call in an Octave script or
in a separate file ff.m .
– With MATLAB the definition of this function has to be in a separate file ff.m or at the end of a
MATLAB script.

An integration over the interval [0, 3] can then be performed by

result = quad(’ff’,0,3)
-->
ans = 1.9819

• As a function handle
The above integral can also be computed by using a function handle.

ff = @(x) x.*sin(1./x).*sqrt(abs(x-1));
integral(ff,0,3)
-->
ans = 1.9819

• function handles are very convenient to determine integrals depending on parameters. To compute the
integrals Z 2
sin(λ t) dt
0
for λ values between 1 and 2 use

for lambda = [1:0.1:2]


integral(@(t)sin(lambda*t),0,2)
end%for

Using the function integral()


For the integration by integral() the function has to be passed as a handle. The limits of integration
can be −∞ or +∞. To determine Z +∞
cos(t) exp(−t) dt
0
use

integral(@(t)cos(t).*exp(-t),0,Inf)
-->
ans = 0.5000

The function integral() can be used with a few possible optional parameters, specified as pairs of the
string with the name of the option and the corresponding value. The most often used options are:
• AbsTol to specify the absolute tolerance. The default value is AbsTol = 10−10 .

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 178

• RelTol to specify the relative tolerance. The default value is AbsTol = 10−6 .

• The adaptive integration is refined to determine the value Q of the integral, until the condition

error ≤ max{AbsTol, RelTol · |Q|}

is satisfied, i.e. either the relative or absolute error are small enough.

• Waypoints to specify a set of points at which the function to be integrated might not be continuous.
This can be used instead of multiple calls of integral() on sub-intervals.

As an example for the function integral() examine the integral


Z 2
| cos(x)| dx = 2 − sin(2) .
0

Observe that this function is not differentiable at x = π2 . Thus the high order of approximation for Simpson
and Gauss are not valid and problems are to be expected.

integral_exact = 2 - sin(2);
integral_1 = integral(@(x)abs(cos(x)),0,2)
integral_2 = integral(@(x)abs(cos(x)),0,2,’AbsTol’,1e-12,’RelTol’,1e-12)
integral_3 = integral(@(x)abs(cos(x)),0,2,’Waypoints’,pi/2)
Log10_Errors = log10(abs([integral_1, integral_2 integral_3] - integral_exact))
-->
integral_1 = 1.0907
integral_2 = 1.0907
integral_3 = 1.0907
Log10_Errors = -6.5642 -15.1764 -Inf
π
The results illustrate the usage of the tolerance parameters. Specifying the special point x = 2 generates a
more accurate results with less computational effort.

Using the function quad()


This function uses an recursive, adaptive Simpson scheme, as presented in the previous sections. This
function will be removed from MATLAB in a future release.

• To integrate a function f over the interval [a, b] use quad(f,a,b), where f is a string with the
name of the function or a function handle.

• With the optional third argument TOL a vector with the absolute and relative tolerances can be speci-
fied.

• With Octave the optional fourth argument SING ia s vector of values where singularites are expected.
In addition options can be read or set by calling quad options .

The two functions leading to Figure 3.27 are


x
f1 (x) = sin(2 x) exp(−x) and f2 (x) = sin( ) − exp(−20 (x − 2)2 ) sin(10 x) .
2
To integrate these functions over the interval [0, 2 π] use (Octave only)

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 179

a = 0; b = 2*pi;
[q, ier, nfun, err] = quad (@(x)sin(2*x).*exp(-x), a, b)
[q, ier, nfun, err] = quad (@(x)sin(x/2)-exp(-20*(x-2).ˆ2).*sin(10*x).ˆ2, a, b)

The results show that the first function required 21 evaluations of the function f1 (x) while f2 (x) had to be
evaluated 273 times. This is caused by the local adaptation around x ≈ 2, obvious in Figure 3.27(b).

Using the function quadv()


The function quadv() is similar to quad(), but can evaluate vector valued integrals. The basic call is
quadv(f,a,b), where f is a string with the name of the function or a function handle. If a second return
argument is asked for the number of function evaluations is given. This function will be removed from
MATLAB in a future release, use integral() with the option ArrayValued.
To evaluate !
Z π/2 sin(t)
~ =
Q exp(−t) dt
0 cos(t)
use

[Q,num_eval] = quadv(@(t)[sin(t);cos(t)]*exp(-t),0,0.5*pi)
-->
Q = 0.3961
0.6039

num_eval = 17

The same can be obtained by calling integral() with the option ArrayValued .

Q = integral(@(t)[sin(t);cos(t)]*exp(-t),0,0.5*pi, ’ArrayValued’,true)

More functions available in Octave


In Section 23.1 of the Octave manual a few more integration functions are documented.

quad Numerical integration based on Gaussian quadrature.

quadv Numerical integration using an adaptive vectorized Simpson’s rule.

quadl Numerical integration using an adaptive Lobatto rule.

quadgk Numerical integration using an adaptive Gauss-Konrod rule.

quadcc Numerical integration using adaptive Clenshaw-Curtis rules.

integral A compatibility wrapper function that will choose between quadv and quadgk depending on the
integrand and options chosen.

3.3.4 Integration over Domains in R2


Integrals over rectangular domains in Figure 3.28(a) Ω = [a, b] × [c, d] ⊂ R2 are computed by nested 1-D
integrations
ZZ Z b Z d 
Q= f (x, y) dA = f (x, y) dy dx .
a c

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 180

MATLAB/Octave provides commands to perform this double integration. The function f (x, y) depends on
two arguments, the corresponding MATLAB/Octave code has to accept matrices for x and y as arguments
and return a matrix of the same size.

• integral2: The basic call is Q = integral2(f,a,b,c,d). Possible options have to be


specified by the string with the name of the option and the corresponding value.

• quad2: The basic call is Q = quad2d(f,a,b,c,d). Possible options have to be specified by


the string with the name of the option and the corresponding value.

3–46 Example : To integrate the function f (x, y) = x2 + y over the rectanguar domain 1 ≤ x ≤ 2 and
0 ≤ y ≤ 3 use

quad2d(@(x,y)x.ˆ2+y,1,2,0,3)
integral2(@(x,y)x.ˆ2+y,1,2,0,3)
-->
ans = 11.500
ans = 11.500


y y
6 6 d(x)
d

Ω Ω
c c(x)
x
- x
-
a b a b
(a) a rectangular domain (b) a general domain

Figure 3.28: Domains in the plane R2

Integrals over non rectangular domains in Figure 3.28(b) can be given by a ≤ x ≤ b and x dependent
limits for the y values c(x) ≤ y ≤ d(x). Integral are computed by nested 1-D integrations
!
ZZ Z Z b d(x)
Q= f (x, y) dA = f (x, y) dy dx .
a c(x)

MATLAB/Octave provides commands to perform this double integration. The only change is that the upper
and lower limits are provided as functions of x.
Integrals over domains where the left and right limit are given as functions of y (i.e. a(y) ≤ x ≤ b(y))
are covered by swapping the order of integration,
!
ZZ Z Z d b(y)
Q= f (x, y) dA = f (x, y) dx dy .
c a(y)

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 181

3–47 Example : Consider the triangular domain with corners (0, 0), (1, 0) and (0, 1). Thus the limits are
0 ≤ x ≤ 1 and 0 ≤ y ≤ 1 − x . To integrate the function

(1 + x + y)2
f (x, y) = p
x2 + y 2
use !
1 1−x
(1 + x + y)2
ZZ Z Z
f (x, y) dA = p dy dx .
0 0 x2 + y 2

This is readily implemented using integral2() or quad2d() .

fun = @(x,y) 1./( sqrt(x + y) .* (1 + x + y).ˆ2 );


ymax = @(x) 1 - x;
Q1 = quad2d(fun,0,1,0,ymax)
Q2 = integral2(fun,0,1,0,ymax)
-->
Q1 = 0.2854
Q2 = 0.2854


3–48 Example : The previous example can also be solved using polar coordinates r and φ. Express the
area element dA in polar coordinates by

dA = dx · dy = r · dr dφ .

For the function use


fp (r, φ) = f (x, y) = f (r cos(φ), r sin(φ)) .
The upper limit y = 1 − x of the domain has to expressed in terms of r and φ by
1
y =1−x =⇒ r sin(φ) = 1 − r cos(π) =⇒ rmax (φ) = .
cos(φ) + sin(φ)

Then the double integral


!
ZZ Z π/2 Z rmax (φ)
fp (r, φ) dA = fp (r, φ) r dr dφ .
0 0

is computed, using integral2() or quad2d() .

polarfun = @(theta,r) fun(r.*cos(theta),r.*sin(theta)).*r;


rmax = @(theta) 1./(sin(theta) + cos(theta));
Q1 = quad2d(polarfun,0,pi/2,0,rmax)
Q2 = integral2(polarfun,0,pi/2,0,rmax)
-->
Q1 = 0.2854
Q2 = 0.2854

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 182

3–49 Observation :

• MATLAB/Octave provide another function dblquad().

• For triple integrals MATLAB/Octave provide the command integral3() with a syntax similar to
integral2() .

• To integrate over triangles or rectangles is a task required for the method of finite elements. Find
results in this direction in Sections 6.5.1 and 6.8.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 183

3.4 Solving Ordinary Differential Equations, Initial Value Problems


In this section a few types of ordinary differential equations are solved by numerical methods. The goal is
to
• understand on how to setup ordinary differential equations for numerical algorithms.
• understand the basics of some algorithms used to solve ODEs.
• be able to use Octave/MATLAB to solve ODEs and systems of ODEs reliably.
To achieve this goal the section is organized as follows:
• In subsection 3.4.1 different types of ODEs are shown and Octave/MATLAB commands used to gen-
erate numerical approximations. The provided examples should help to use MATLAB/Octave to solve
ODEs.
• In subsection 3.4.2 find the basic ideas for three ODE solvers: Euler, Heun and a Runge–Kutta
method. The essential error estimates are shown. Simple codes for algorithms with fixed step size are
given. The stability of the algorithms is examined.
• In subsection 3.4.4 general Runge–Kutta schemes and their Butcher tables are introduced.
• In subsection 3.4.5 the idea of local extrapolation and adaptive step sizes is presented and illustrated
by one example.
• In subsection 3.4.6 the usage of the solvers provided by Octave/MATLAB is presented.
• In subsection 3.4.7 four algorithms in Octave/MATLAB are examined carefully. Particular attention is
given to stiff problems.

3.4.1 Different Types of Ordinary Differential Equations


Introduction
The simplest form of an ODE (Ordinary Differential Equation) is
d
u(t) = f (t, u(t))
dt
for a given function f : [tinit , tend ] × R −→ R. An additional initial time t0 and an initial value u0 = u(t0 )
lead to an IVP (Initial Value Problem). A solution has to be a function, satisfying the differential equation
and the initial condition. One can show that for differentiable functions f any IVP has a unique solution on
an interval a < t0 < b. The final time t < b might be smaller that +∞, since the solution could blow up in
finite time.

3–50 Example : The logistic differential equation


The behavior of the size of a population 0 ≤ p(t) with a limited nutrition supply can be modeled by the
logistic differential equation
d
p(t) = (α − p(t)) p(t) .
dt
In this case the differential equation with f (p) = (α − p) p is autonomous, i.e. f does not explicitly depend
on the time t. In Figure 3.29 three solutions for α = 2 with different initial values are shown. The vector
field is generated by displaying many (rescaled) vectors (1, f (p)) attached at points (t, p). The function p(t)
being a solution of the ordinary differential equation is equivalent to the slope of the curve to coincide with
the directions of the vector field. Thus vector fields can be useful to understand the qualitative behavior of
solutions of ODEs. Figure 3.29 is generated by the code below.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 184

2.5

population p
2

1.5

0.5

0
0 1 2 3 4 5
time t

Figure 3.29: Vector field and three solutions for a logistic equation

t_max = 5; p_max = 3;
[t,p] = meshgrid(linspace(0, t_max,20),linspace(0,p_max,20));
v1 = ones(size(t)); v2 = p.*(2-p);
figure(1); quiver(t,p,v1,v2)
xlabel(’time t’); ylabel(’population p’)
axis([0 t_max, 0 p_max])

[t1,p1] = ode45(@(t,p)p.*(2-p),linspace(0,t_max,50),0.4);
[t2,p2] = ode45(@(t,p)p.*(2-p),linspace(0,t_max,50),3.0);
[t3,p3] = ode45(@(t,p)p.*(2-p),linspace(0,t_max,50),0.01);
hold on
plot(t1,p1,’r’,t2,p2,’r’, t3,p3,’r’)
hold off

Systems of ordinary differential equations


The above idea can be applied to a system of ODEs. As example consider the famous predator–pray model
by Volterra–Lotka13 .
3–51 Example : Volterra–Lotka predator–prey model
Consider two different species with the size of their population given by x(t) and y(t).

x(t) population size of pray at time t


y(t) population size of predator at time t

The predators y (e.g. sharks) are feeding of the pray x (e.g. small fish). The food supply for the pray is
limited by the environment. The behavior of these two populations can be described by a system of two first
order differential equations.
d
x(t) = (c1 − c2 y(t)) x(t)
dt
d
y(t) = (c3 x(t) − c4 ) y(t)
dt
13
Proposed by Alfred J. Lotka in 1910 and Vito Volterra in 1926.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 185

where ci are positive constants. This function can be implemented in a function file VolterraLotka.m
in MATLAB/Octave.
VolterraLotka.m
function res = VolterraLotka(x)
c1 = 1; c2 = 2; c3 = 1; c4 = 1;
res = [(c1-c2*x(2))*x(1);
(c3*x(1)-c4)*x(2)];
end%function

With the help of the above function generate information about the solutions of this sytem of ODEs.

• Generate the data for the vector field and then use the command quiver to display the vector field,
shown in Figure 3.30(b).

• Use ode45() to generate numerical solutions, in this case with initial values (x(0), y(0)) = (2, 1)
and for 100 times 0 ≤ ti ≤ 15. Display the result in Figure 3.30.

3 2
prey
2.5 predator
1.5
2
predator

1.5 1

1
0.5
0.5

0 0
0 2 4 6 8 10 12 14 0 0.5 1 1.5 2 2.5
time prey
(a) as function of time (b) vector field and solution

Figure 3.30: One solution and the vector field for the Volterra-Lotka problem

x = 0:0.2:2.6; % define the x values to be examined


y = 0:0.2:2.0; % define the y values to be examined

n = length(x); m = length(y);
Vx = zeros(n,m); Vy = Vx; % create zero vectors for the vector field

for i = 1:n
for j = 1:m
v = VolterraLotka([x(i),y(j)],0); % compute the vector
Vx(i,j) = v(1); Vy(i,j) = v(2);
end%for
end%for

t = linspace(0,15,100);
[t,XY] = ode45(@(t,x)VolterraLotka(x),t,[2;1]);

figure(1); plot(t,XY)
xlabel(’time’); legend(’prey’,’predator’); axis([0,15,0,3]); grid on

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 186

figure(2); quiver(x,y,Vx’,Vy’,2); hold on


plot(XY(:,1),XY(:,2));
axis([min(x),max(x),min(y),max(y)]);
grid on; xlabel(’prey’); ylabel(’predator’); hold off

Converting an ODE of higher order to a system of order 1


Ordinary differential equation of higher order can be converted to systems of order 1. Thus most numerical
algorithm are applicable to system of order 1, but not to higher order ODEs. The method is illustrated by
the example of a pendulum equation.
3–52 Example : ODE for a damped pendulum
The equation
ẍ(t) + α ẋ(t) + k x(t) = f (t)
describes a mass attached to a spring with an additional damping term α ẋ. Introducing the new variables
y1 (t) = x(t) and y2 (t) = ẋ(t) leads to
d
y1 (t) = ẋ(t) = y2 (t)
dt
and
d d
y2 (t) = ẋ(t) = ẍ(t) = f (t) − α ẋ(t) − k x(t) = f (t) − α y2 (t)(t) − k y1 (t)
dt dt
This can be written as a system of first order equations
! ! !
d y1 (t) ẏ1 (t) y2 (t)
= =
dt y2 (t) ẏ2 (t) f (t) − k y1 (t) − α y2 (t)(t)
or
d
~y (t) = F~ (~y (t))
dt
and with the help of a function file the problem can be solved with computations very similar to the above
Volterra–Lotka example. The code below will compute a solution with the initial displacement x(0) = 0
d
and initial velocity dt x(0) = 1. Then Figure 3.31 will be generated.

• In Figure 3.31(a) find the graphs of x(t) and v(t) as function of the time t. The effect of the damping
term −α v(t) = −0.1 v(t) is clearly visible.

• In Figure 3.31(b) find the vector field and the computed solution. The horizontal axis represents the
displacement x and the vertical axis indicates the velocity v = ẋ. This is the phase portrait of the
second order ODE.

y = -1:0.2:1; v = -1:0.2:1; n = length(y); m = length(v);


Vx = zeros(n,m); Vy = Vx; % create zero vectors for the vector field

function ydot = Spring(y)


ydot = zeros(size(y));
k = 1; al = 0.1;
ydot(1) = y(2);
ydot(2) = -k*y(1)-al*y(2);
end%function

for i = 1:n

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 187

1 1
displacement
velocity
0.5 0.5

velocity
0 0

-0.5 -0.5

-1 -1
0 5 10 15 20 25 -1 -0.5 0 0.5 1
time displacement
(a) displacement and velocity (b) phase portrait

Figure 3.31: Vector field and a solution for a spring-mass problem

for j = 1:m
z = Spring([y(i),v(j)]); % compute the vector
Vx(i,j) = z(1); Vy(i,j) = z(2); % store the components
end%for
end%for

t = linspace(0,25,100);
[t,XY] = ode45(@(t,y)Spring(y),t,[0;1]);

figure(1); plot(t,XY)
xlabel(’time’); legend(’displacement’,’velocity’)
axis(); grid on

figure(2); plot(XY(:,1),XY(:,2),’g’); % plot solution in phase portrait


axis([min(y),max(y),min(v),max(v)]);
hold on
quiver(y,v,Vx’,Vy’,’b’);
xlabel(’displacement’); ylabel(’velocity’);
grid on; hold off

3.4.2 The Basic Algorithms


In this subsection three basic algorithms are presented. The purpose is to understand how these algorithms
work, the resulting code is only useful for didatctical end demo purposes. For real world problem use the
codes presented in subsection 3.4.6, starting on page 208. The explications are given for single ODEs, but
the methods apply to systems of ODEs too, without major modifications.

The Euler method


To understand the basic idea on numerical ODE solvers the method of Euler is easiest to understand. The
idea carries over to more sophisticated and efficient approaches.
As simple example examine the IVP

d
x(t) = x(t)2 − 2 t with x (0) = 0.75 .
dt

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 188

Thus search for a curve following the corresponding vector field in Figure 3.32. Use the definition of the

1.5

1
x(t)

0.5

0
0 0.2 0.4 0.6 0.8 1 1.2
time t

d
Figure 3.32: Vector field and solution of the ODE dt x(t) = x(t)2 − 2 t. The true solution is displayed in
red, the solution generated by one Euler step in blue and the result using four Euler steps in green.

derivative
d x(h) − x(0)
x(0) = lim
dt h→0 h
at t = 0. Instead of the limes use a ”small” value for h. The differential equation is transformed into an
algebraic equation
x(h) − x(0)
= f (0, x(0)) .
h
This can easily be solved for x(h)

x(h) = x(0) + h f (0, x(0)) .

Thus the first step of this method leads to the straight line in Figure 3.32. A bit more general, to move from
time t to time t + h use
x(t + h) = x(t) + h f (t, x(t)) .
This is Euler’s method or also the explicit method.

Above the stepsize h = 1 was used to determine x(1). To obtain a better approximation use smaller
values for h. With the stepsize h = 41 = 0.25 find

x(0.25) ≈ 0.75 + 0.25 (1 + 0.752 − 2 · 0) = 0.890625


x(0.50) = 0.890625 + 0.25 (1 + 0.8906252 − 2 · 0.25) ≈ 0.963928
x(0.75) ≈ 0.963928 + 0.25 (1 + 0.9639282 − 2 · 0.5) ≈ 0.9462176
x(1.00) ≈ 0.9462176 + 0.25 (1 + 0.94621762 − 2 · 0.75) ≈ 0.7950496

This leads to the four straigth line segments in Figure 3.32 and this approximate solution is already closer
to the exact solution.

The Heun method


This slightly more advanced method is known as method of Heun, or also Runge–Kutta of order 2. To
diecretize
ẋ = f (t, x) with x(t0 ) = x0

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 189

with step size h from t = ti to t = ti+1 = ti + h use the computational scheme:


k1 = f (ti , xi )
k2 = f (ti + h, xi + k1 h)
k1 + k2
xi+1 = xi + h
2
ti+1 = ti + h
At two different positions the function f (t, x) is evaluated, leading to two slopes k1 and k2 for the solution
of the ODE. Then one time step is performed with the average of the two slopes. This is visualized in
Figure 3.33.

1.6
1.4
slope k2
1.2
1
slope k1
x

0.8
slope k
0.6
0.4
0.2
-0.2 0 0.2 0.4 0.6 0.8 1
t

d
Figure 3.33: One step of the Heun Method for the ODE dt x(t) = −x2 (t) − 2 t with x(0) = 0.75 and step
size h = 1
d
For the initial value problem dt x(t) = x(t)2 − 2 t with x(0) = 0.75 the calculations for one Heun step
of length h = 1 are given by
k1 = f (t0 , x0 ) = f (0, 0.75) = 0.5625
k2 = f (t0 + h, x0 + h, k1 ) = f (1, 1.3125) = −0.27734375
1
k = 2 (k1 + k2 ) = 0.142578125
x(1) ≈ x(0) + h k = 0.75 + 1 (0.142578125) = 0.892578125
This is a better approximation than the one generate by one Euler step. The above computations can be
performed by MATLAB/Octave:

x0 = 0.75; h = 1; % set initial value and step size


f = @(t,x)xˆ2-2*t; % define the function f(t,x) = xˆ2 - 2 t
h = 1;
k1 = f(0,x0)
k2 = f(h,x0+h*k1)
k = (k1+k2)/2
x1 = x0+h*k

The classical Runge–Kutta method


One of the most often used methods is a Runge–Kutta method of order 4. It is often calles the classical
Runge–Kutta method. To apply one time step for the IVP
ẋ = f (t, x) with x(t0 ) = x0

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 190

with step size h from t = ti to t = ti+1 = ti + h use the following computational scheme:

k1 = f (ti , xi )
k2 = f (ti + h/2, xi + k1 h/2)
k3 = f (ti + h/2, xi + k2 h/2)
k4 = f (ti + h, xi + k3 h)
h
xi+1 = xi + (k1 + 2k2 + 2k3 + k4 )
6
ti+1 = ti + h

At four different positions the function f (t, x) is evaluated, leading to four slopes ki for the solution of
the ODE. Then one time step is performed with a weighted average of the four slopes. This is visualized in
Figure 3.34.

slope k2
1

0.8
slope k3
x

slope k1
0.6
slope k
0.4
slope k4
0.2
-0.2 0 0.2 0.4 0.6 0.8 1
t

d
Figure 3.34: One step of the Runge–Kutta Method of order 4 for the ODE dt x(t) = −x2 (t) − 2 t with
x(0) = 0.75 and step size h = 1
d
For the initial value problem dt x(t) = x(t)2 − 2 t with x(0) = 0.75 the calculations for one Runge–
Kutta step of length h = 1 are given by

k1 = f (t0 , x0 ) = f (0, 0.75) = 0.5625


k2 = f (t0 + h2 , x0 + h
2 k1 ) = f ( 12 , 1.03125) ≈ 0.0634766
k3 = f t0 + h2 , (x0 + h
2 k2 ) ≈ f ( 12 , 0.781738) ≈ −0.388885
k4 = f (t0 + h, x0 + h k3 ) ≈ f (1, 0.36111474) ≈ −1.869596
1
k = 6 (k1 + 2 k2 + 2 k3 + k4 ) ≈ −0.326319
x(1) ≈ x(0) + h k ≈ 0.75 + 1 (−0.326319) ≈ 0.423681 .

The advanced numerical solver ode23() generated the answer x(1) ≈ 0.32449. Thus the answer of the
RK algorithm is considerably closer than the answer of x(1) ≈ 0.7975 generated by four Euler steps. For
both approaches the right hand side of the ODE was evaluated four times, i.e. a comparable computational
effort. The above computations can be performed by MATLAB or Octave:

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 191

x0 = 0.75; h = 1; % set initial value and step size


f = @(t,x)xˆ2-2*t; % define the function f(x) = xˆ2 - 2 t
h = 1;
k1 = f(0,x0)
k2 = f(h/2,x0+h/2*k1)
k3 = f(h/2,x0+h/2*k2)
k4 = f(h,x0+h*k3)
k = (k1+2*k2+2*k3+k4)/6
x1 = x0+h*k

Discretization errors for Euler, Heun and Runge–Kutta


For all three of the above approximation methods to solutions of ODEs one can (and should) use estimates
for the discretization errors. For smaller values of the step length h the error is expected to be smaller. As
d
one example examine the error of the Euler method. Assuming that a solution x(t) of an ODE dt y(t) =
f (t, y(t)) which can be represented by a Taylor approximation
y 00 (t0 ) 2 y 000 (t0 ) 3 y (k) (t0 ) k
y(t0 + h) = y(t0 ) + y 0 (t0 )h + h + h + ... + h + ...
2! " 3! k! #
y 00 (t ) y 000 (t ) y (k) (t )
0 0 0
= y(t0 ) + y 0 (t0 ) · h + h2 · + h + ... + hk−2 + . . . .
2! 3! k!

Using y 0 (t0 ) = f (t0 , y(t0 )) and the Euler approximation


yE (t0 + h) = y(t0 ) + h f (t0 , y(t0 )
leads to
" #
y 00 (t0 ) y 000 (t0 )
2 y (k) (t0 ) k−2
y(t0 + h) − yE (t0 + h) = h · + h + ... + h + ... .
2! 3! k!

Thus for small values of h obtain the local error estimate for one Euler step
|yE (t0 + h) − y(t0 + h)| ≤ CE h2 (3.12)
where the “constant” CE is related to second derivatives of the function f (t, y). Thus the local discretization
error of Euler method is of order 2 . To arrive at a final time t0 + T one has to apply n = T /h steps and
each step might add some error. This leads to the global discretization error
|yE (t0 + T ) − y(t0 + Y )| ≤ C̃E h . (3.13)
Thus the global discretization error of the Euler method is of order 1 .
Similar arguments (with more tedious computations) can be performed for the method of Heun and
Runge–Kutta, leading to Table 3.6.

Verfahren step size h local error global error


T
Euler h= n ≈ CE · h2 ≈ CE · n · h2 = C̃E h
T
Heun h= n ≈ C H · h3 ≈ CH · n · h3 = C̃H h2
T
Runge–Kutta h= n ≈ CRK · h5 ≈ CRK · n · h5 = C̃RK h4

Table 3.6: Discretization errrors for the methods of Euler, Heun and Runge–Kutta

For most problems Runge–Kutta is the most efficient algorithm to generate approximate solutions to
ODEs. There are multiple aspects to be taken into account, i.e. to compare Euler’s method to Runge–Kutta:

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 192

1. Computational effort for one step


The computational effort for most applications is dominated by the evaluation of the RHS f (t, x) of
the ODE. For the Euler method one call of the RHS is required, while Runge–Kutta requires 4 calls
of the RHS.
Advantage: Euler

2. Differentiablility of f (t, x)
For the above error estimates to be correct Euler requires that the function f be twice differentiable,
while Runge–Kutta requires that f is four times differentiable.
Advantage: Euler

3. Order of consistency
The global discretization error for Euler is proportional to h, while it is proportional to h4 for Runge–
Kutta.
Advantage: Runge–Kutta.

4. Number of time steps


Based on the higher order of convergence one usually gets away with fewer time steps for Runge–
Kutta, for a given total error. The smaller h the better Runge–Kutta will perform.
Advantage: Runge–Kutta.

3–53 Example : The last of the above arguments is by far the most important. This is illustrated by an
d
example, see [MeybVach91]. The initial value problem dt y(t) = 1 + (y(t) − t)2 with y(0) = 0.5 is
1
solved by y(t) = t + 2−t , e.g. y(1.8) = 6.8 . Use n time steps of length n = 1.8
n to approximate y(1.8)
numerically. This leads to Table 3.7. With 72 calls of the RHS by Runge–Kutta the global error is of the
same size as with 7200 calls by Euler. ♦

method h n number of calls global error


Euler 0.1 18 18 2.23 · 10−0
Runge–Kutta 0.1 18 72 3.40 · 10−3
Euler 0.01 180 180 4.91 · 10−1
Runge–Kutta 0.01 180 720 4.20 · 10−7
Euler 0.001 1800 1800 5.66 · 10−2
Runge–Kutta 0.001 1800 7200 4.32 · 10−11

Table 3.7: A comparison of Euler and Runge–Kutta

The ordinary differential equation


d
u(x) = f (u(x)) with u (x0 ) = u0
dx
is equivalent14 to the integral equation
Z x
u (x) = u0 + f (u (s)) ds .
x0
14
 Z x 
d
u0 + f (u(s)) ds = f (u(x)) and u(x0 ) = u0
dx x0

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 193

Thus it is no surprise that there is a close connection between numerical integration and solving ODEs.
Compare the order of convergence of the different methods. It stands out that the order of the error for
Runge–Kutta is higher than expected.

integral differential equation


exact solutions
R x+h
I= x f (t) dt u0 (x) = f (u (x))
R x+h
u (x + h) = u (x) + x f (u (t)) dt
approximate solutions
rectangular rule Euler method

k1 = f (u (x))
x x+h

I ≈ h f (x) u (x + h) ≈ u (x) + h k1
error = O (h2 ) local error = O (h2 )
trapezoidal rule method of Heun

k1 = f (u (x))
k2 = f (u (x) + k1 h)
x x+h

h h
I≈ (f (x) + f (x + h)) u (x + h) ≈ u (x) + (k1 + k2 )
2 2
error = O (h3 ) local error = O (h3 )
Simpson’s rule method of Runge–Kutta (RK4)
k1 = f (u (x))
k2 = f (u (x) + k1 h/2)
k3 = f (u (x) + k2 h/2)
x x+h/2 x+h
k4 = f (u (x) + k3 h)
h h
I≈ (f (x) + 4 f (x + h/2) + f (x + h)) u (x + h) ≈ u (x) + (k1 + 2 k2 + 2 k3 + k4 )
6 6
error = O (h4 ) local error = O (h5 )

Table 3.8: Comparing integration and solving ODEs

Codes for Runge–Kutta, Heun and Euler with fixed step size
An ODE usually has to be solved on a given time interval. To apply the Runge–Kutta approach with a fixed
step size use the code ode RungeKutta.m in Figure 3.35. The function takes several arguments:

• FunFcn: a string with the function name for the RHS of the ODE.

• t: a vector of scalar time values at which the solution is returned. t(1) is the initial time.

• y0: the initial values.

• steps: the number of Runge–Kutta steps to be taken between the output times.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 194

The function will return the outpot times in Tout and the values of the solution in Yout.
The code has the following structure:

1. name of the function, declaration of the parameters

2. documentation

3. initialization

4. main loop

(a) determine length of steps h


(b) apply the correct number of Runge–Kutta steps
(c) save the result

5. return the result

Very similar codes ode Euler.m and ode Heun.m implement the methods of Euler and Heun with
fixed step size. The usage is illustrated by the following example, solving the equation of a simple pendulum.

3–54 Example : The second order ODE

d2
y(t) = −k sin(y(t))
dt2
describes the angle y(t) of a pendulum, possibly with large angles, since the approximation sin(y) ≈ y is
not used. This second order ODE is transformed to a system of order 1 by
! !
d y(t) v(t)
= .
dt v(t) −k sin(y(t)

This ODE leads to a function pend(), which is then used by ode RungeKutta() to generate the solu-
tion for times [0, 30] for different initial angles.
Pendulum.m
%% demo file to solve a pendulum equation
Tend = 30;
%% on Matlab put the definition of the function in a separate filed pend.m
function y = pend(t,x)
k = 1;
y = [x(2);-k*sin(x(1))];
end%function

y0 = [0.1;0]; % small angle


% y0=[pi/2;0]; % large angle
% y0 = [pi-0.01;0]; % very large angle

t = linspace(0,Tend,100);
[t,y] = ode_RungeKutta(’pend’,t,y0,10); % Runge-Kutta
% [t,y] = ode_Euler(’pend’,t,y0,10); % Euler
% [t,y] = ode_Heun(’pend’,t,y0,10); % Heun

plot(t,180/pi*y(:,1))
xlabel(’time’); ylabel(’angle [Deg]’)

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 195

ode RungeKutta.m
function [tout, yout] = ode_RungeKutta(Fun, t, y0, steps)
% [Tout, Yout] = ode_RungeKutta(fun, t, y0, steps)
%
% Integrate a system of ordinary differential equations using
% 4th order Runge-Kutta formula.
%
% INPUT:
% Fun - String containing name of user-supplied problem description.
% Call: yprime = Fun(t,y)
% T - Vector of times (scalar).
% y - Solution column-vector.
% yprime - Returned derivative column-vector; yprime(i) = dy(i)/dt.
% T(1) - Initial value of t.
% y0 - Initial value column-vector.
% steps - steps to take between given output times
%
% OUTPUT:
% Tout - Returned integration time points (column-vector).
% Yout - Returned solution, one solution column-vector per tout-value.
%
% The result can be displayed by: plot(tout, yout).

% Initialization
y = y0(:); yout = y’; tout = t(:);

% The main loop


for i = 2:length(t)
h = (t(i)-t(i-1))/steps;
tau = t(i-1);
for j = 1:steps
% Compute the slopes
s1 = feval(Fun, tau, y); s1 = s1(:);
s2 = feval(Fun, tau+h/2, y+h*s1/2); s2 = s2(:);
s3 = feval(Fun, tau+h/2, y+h*s2/2); s3 = s3(:);
s4 = feval(Fun, tau+h, y+h*s3); s4 = s4(:);
tau = tau + h;
y = y + h*(s1 + 2*s2+ 2*s3 + s4)/6;
end%for
yout = [yout; y.’];
end%for

Figure 3.35: Code for Runge–Kutta with fixed step size

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 196

3.4.3 Stability of the Algorithms


Examine the ODE
d
y(t) = λ y(t)
dt
with the exact solution y(t) = y(0) exp(λ t). For λ < 0 this solution converges to zero as t → ∞. Thus
numerical algorithms are expected to have this feature too.
Based in Section 3.2.5 and in particular Result 3–29 the stability behavior for the elementary ODE
d
dt y(t) = λ y(t) is also valid for systems of linear ODEs. Using linearization most of the results remain
valid for nonlinear systems of ODEs.

3–55 Definition : A numerical method to solve ODEs is called stable iff for Re(λ) < 0 the numerical
d
solution of dt y(t) = λ y(t) satisfies
lim y(t) = 0 .
t→∞

d
When using approximation methods to solve dt y(t) = λ y(t) one arrives in many cases at an iteration
formula of the type
yi+1 = y(ti + h) = g(λ h) yi ,
i.e. at each time step the current value of the solution y(ti ) is multiplied by g(λ h) to arrive at y(ti + h) =
y(ti+1 ). In this case limt→∞ y(t) = 0 is equivalent to limn→∞ (g(λ h))n = 0, which is equivalent to
|g(λ h)| < 1. The stability can be formulated using the stability function g(z) for z ∈ C.

• The computational scheme is stable in the domain in C where |g(λ h)| < 1.

• A computational scheme is called A-stable iff

|g(z)| ≤ 1 for all {z ∈ C : Re(z) < 0} .

• A computational scheme is called L-stable iff it is A–stable and in addition

lim g(z) = 0 as z −→ −∞ .

Based on Result 3–29 the stability carries over to linear systems of ODEs and by linearization to non-
linear systems too.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 197

1.5 3
exact exact
stable stable
1 2
stable stable
unstable unstable
0.5 1
solutions y(t)

solutions y(t)
0 0

-0.5 -1

-1 -2

-1.5 -3
0 0.5 1 1.5 2 0 2 4 6 8
time t time t

(a) first steps (b) many steps

d
Figure 3.36: Conditional stability of Euler’s approximation to dt y(t) = λ y(t) with λ < 0

3–56 Result : Conditional stability of Euler’s explicit method


d
When using Euler’s forward difference method to solve dt y(t) = λ y(t) with yi = y(ti ) and yi+1 =
y(ti + h) find
d yi+1 − yi
y(ti ) − λ y(ti ) ≈ − λ yi = 0 .
dt h
The differential equation is replaced by the difference equation
yi+1 − yi
= λ yi =⇒ yi+1 = yi + h λ yi = (1 + h λ) yi .
h
Verify that this difference equation is solved by

yi = y0 (1 + h λ)i .

For this expression to remain bounded independent on i the condition |1 + h λ| ≤ 1 is necessary. For
complex values of λ this condition is satisfied if z = λ h is inside a circle of radius 1 with center at −1 ∈ C.
This domain is visualized in Figure 3.38. For real, negative values of λ the condition simplifies to
2
h |λ| < 2 ⇐⇒ h< .
|λ|

This is an example of conditional stability , i.e. the schema is only stable if the above condition on the
step size h is satisfied. To visualize the behavior examine the results in Figure 3.36 for solutions of the
d
differential equation dt y(t) = λ y(t).

• At the starting point the differential equation determines the slope of the straight line approximation
of the solution. The slope is independent on the length of the step size h.

• If the step size is small enough then the numerical solution will not overshoot but converge to zero, as
expected.

• If the step size is too large then the numerical solution will overshoot and will move further and further
away from zero by each step.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 198

3–57 Result : Unconditional stability the backward Euler method


d
When using backward Euler’s method to solve dt y(t) = λ y(t) with yi = y(ti ) and yi+1 = y(ti + h) find

d yi+1 − yi
y(ti+1 ) − λ y(ti+1 ) ≈ − λ yi+1 = 0 .
dt h
The differential equation is replaced by the difference equation
yi+1 − yi
= +λ yi+1 =⇒ (1 − h λ) yi+1 = yi .
h
One can verify that this difference equation is solved by
1
yi = y0 .
(1 − h λ)i

For this expression to remain bounded independent on i we need |1 − h λ| > 1. For complex values of λ
this condition is satisfied if z = λ h is outside a circle of radius 1 with center at +1 ∈ C. This domain
is visualized in Figure 3.38. For real, negative values of λ the condition leads to 1 − h λ > +1, which is
automatically satisfied and we have unconditional stability. The method as A–stable and L-stable.
To visualize the behavior we examine the results in Figure 3.37 for solutions of the differential equation
d
dt y(t)= λ y(t).

• The slope of the straight line approximation is determined by the differential equation at the end point
of the straight line segment. Consequently the slope will depend on the step size h.

• If the step size is small enough then the numerical solution will not overshoot but converge to zero.

• Even if the step size is large the numerical solution will not overshoot zero, but converge to zero.

exact
1
stable
stable
0.8
stable
solutions y(t)

0.6

0.4

0.2

0 0.5 1 1.5 2 2.5 3


time t

d
Figure 3.37: Unconditional stability of the implicit approximation to dt y(t) = λ y(t) with λ < 0

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 199

3–58 Result : Conditional stability for the methods of Heun and Runge–Kutta
d
Using one step of the method of Heun solve dt y(t) = λ y(t) with yi = y(ti ) and yi+1 = y(ti + h) leads to

k1 = λ yi
k2 = λ (yi + h λ yi ) = (λ + h λ2 ) y(0)
1 1
k = (k1 + k2 ) = (2 λ + h λ2 ) yi
2 2
1
yi+1 = y(0) + h k = (1 + h λ + h2 λ2 ) yi
2
1 2 2 i
yi = (1 + h λ + h λ ) y0
2
For this expression to remain bounded independent on i we need
1 2 2 1
|1 + h λ + h λ | = |1 + z + z 2 | ≤ 1 .
2 2
To examine this set in C use | exp(i α)| = 1 and solve
1 2
ei α = 1 + z + z
2
2 ei α = 2 + 2 z + z2
z 2 + 2 z + 2 − 2 ei α = 0
p
z1,2 = −1 ± −1 + 2 ei α .

This generates an ellipse like curve between −2 and 0 on the real axis and ±i 3 along the imaginary
direction, visible in Figure 3.38. Heun’s method is stable inside this domain, i.e. this is a conditional
stability.
For the Runge–Kutta method the corresponding inequality is15
1 2 1 2 1 4 1 1 1
|1 + z + z + z + z | = |1 + (h λ) + (h λ)2 + (h λ)2 + (h λ)4 | ≤ 1
2 6 24 2 6 24
and the domain is displayed in Figure 3.38. The classical Runge–Kutta method is stable inside this domain,
i.e. this is a conditional stability. 3

15
Use elementary, tedious computations or software for symbolic calculations.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 200

3–59 Result : Unconditional stability for the method of Crank–Nicolson


In Section 4.5.5 the method of Crank–Nicolson will be examined. Using one step of this method to solve
d
dt y(t) = λ y(t) with yi = y(ti ) and yi+1 = y(ti + h) leads to

yi+1 − yi yi + yi+1
= λ
h 2
2 + hλ
yi+1 = yi
2 − hλ
2 + hλ i
 
yi = y0 .
2 − hλ

For this expression to remain bounded independent on i we need

2 + hλ 2+z
= ≤1 ⇐⇒ |2 + z| ≤ |2 − z| .
2 − hλ 2−z

Examine the two points 2 ± z = 2 ± h λ ∈ C:

• For Re λ < 0 the two points have the same size imaginary part and | Re(2 + h λ)| is smaller than
| Re(2 − h λ)|. Thus the method is stable in the left half plane Re λ < 0.

• For Re λ = 0 observe |2 + h λ| = |2 − h λ|, i.e. the method is stable.

• For Re λ > 0 the method is unstable, as should be. The exact solution exp(λ t) grows exponentially
for t > 0.

The domain of stability is displayed in Figure 3.38. The method is unconditionally A–stable, but not L-
stable. 3

3 Euler
implizit
2
Heun
1 RK4
Im(h λ)

CN
0
-1
-2
-3
-3 -2 -1 0 1 2
Re(h λ)

Figure 3.38: Domains of stability in C for a few algorithms

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 201

3–60 Result : Domains of stability for a few algorithms


The domains of stability of the above methods to solve ODEs is visible in Figure 3.38.

• Euler: stable in a circle with radius 1 and center at −1 + i 0 ∈ C. For Re(λ) < 0 and large time steps
h the method is unstable, i.e conditional stability.

• implicit: stable in outside a circle with radius 1 and center at +1 + i 0 ∈ C. For Re(λ) < 0 the
method is stable for any time step h > 0, i.e. unconditional stability.

• Heun: stable in an ellipse like domain with real parts larger than −2. For Re(λ) < 0 and large time
steps h the method is unstable, i.e conditional stability.

• RK4: stable in an odd shaped domain with real parts larger than −2.8. For Re(λ) < 0 and large time
steps h the method is unstable, i.e conditional stability.

• CN: Crank–Nicolson, the method is stable in the complex half plane Re(z) < 0, i.e unconditional
stability.

3.4.4 General Runge–Kutta Methods, Represented by Butcher Tables


Explicit Runge–Kutta schemes
An explicit Runge–Kutta method with s stages is given by

k1 = f (tn , yn )
k2 = f (tn + c2 h, yn + h (a21 k1 ))
k3 = f (tn + c3 h, yn + h (a31 k1 + a32 k2 ))
k4 = f (tn + c4 h, yn + h (a41 k1 + a42 k2 + a43 k3 ))
..
.
ks = f (tn + cs h, yn + h (as1 k1 + as2 k2 + · · · + as,s−1 ks−1 ))
X s
yn+1 = yn + h bi ki
i=1

The computational scheme is conveniently represented by a Butcher table.

0
c2 a21
c3 a31 a32
c4 a41 a42 a43 (3.14)
.. .. ..
. . .
cs as1 as2 as3 · · · as,s−1
yn+1 b1 b2 b3 ··· bs−1 bs

Working from top to bottom for the general, explicit Runge–Kutta scheme one never has to solve an equa-
tion, just evaluate the function f (t, y) and plug in. Thus this is an explicit scheme. One step of length h
requires s evaluations of f (t, y).

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 202

3–61 Example : Butcher tables for Heun and classical Runge–Kutta


The method of Heun is a 2 stage Runge–Kutta method of order 2 and since
1 1
k1 = f (tn , yn ) , k2 = f (tn + h, yn + h k1 ) , yn + 1 = yn + k1 + k2
2 2
its Butcher table is given by
0
1 1 . (3.15)
1 1
yn+1 2 2

The classical Runge–Kutta method of order 4 is a 4 stage method with the Butcher table

0
1 1
2 2
1 1 .
2 0 2
(3.16)
1 0 0 1
1 2 2 1
yn+1 6 6 6 6

Implicit Runge–Kutta schemes


A general implicit Runge–Kutta method with s stages is given by

k1 = f (tn + c1 h, yn + h (a1,1 k1 + a1,2 k2 + · · · + a1,s ks ))


k2 = f (tn + c2 h, yn + h (a2,1 k1 + a2,2 k2 + · · · + a2,s ks ))
k3 = f (tn + c3 h, yn + h (a3,1 k1 + a3,2 k2 + · · · + a3,s ks ))
.. ..
. .
ks = f (tn + cs h, yn + h (as,1 k1 + as,2 k2 + · · · + as,s ks ))
X s
yn+1 = yn + h bi ki
i=1

Observe that this leads to a system of equation for the slopes ~k ∈ Rs , thus it is an implicit scheme. For
nonlinear functions f (t, y) it is a nonlinear system of equations. For a system of n equation with ~y ∈ Rn it
leads to a nonlinear system of n · s equations for n · s unknowns. Since Newton’s algorithm is used to solve
the system it is a good idea to provide the Jacobian matrix J to the algorithms. Use Example 3–69 on how
to use the Jacobian.  
∂ f1 ∂ f1 ∂ f1
· · ·
 ∂∂yf1 ∂∂yf2 ∂yn
∂ f2 

∂ f (t, ~y ) 
2
 ∂y1 ∂y2
2
· · · ∂yn 
J= = . .. .. .. 
∂~y  .. . . . 
 
∂ f2 ∂ f2 ∂ f2
∂y1 ∂y2 · · · ∂yn

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 203

An implicit Runge–Kutta scheme is again represented by a Butcher table.

c1 a1,1 a1,2 a1,3 · · · a1,s


c2 a2,1 a2,2 a2,3 · · · a2,s
c3 a3,1 a3,2 a3,3 · · · a3,s ~c A
.. .. .. .. .. .. ⇐⇒ (3.17)
. . . . . . ~bT
cs as,1 as,2 as,3 · · · as,s
yn+1 b1 b2 b3 ··· bs

3–62 Example : Stability analysis for an implicit scheme


d
For the linear ODE dt y(t) = λ y(t) the system of equations for the slopes ~k is

~k = λ (yn~1 + h A ~k) ⇐⇒ (I − λ h A) ~k = λ yn ~1 .

Thus the next time step is given by


 
yn+1 = yn + h h~b, ~ki = yn + yn λ h h~b, (I − λ h A)−1 ~1i = yn 1 + λ h h~b, (I − λ h A)−1 ~1i .

The essential information is given by the stability function

det(I − z A + z ~1 ~bT )
g(z) = 1 + z h~b, (I − z A)−1 ~1i = (3.18)
det(I − z A)

where z = λ h. For a proof see [Butc03, Lemma 351A, p. 230]. The stability condition is |g(z)| < 1. This
stability function g(z) is a rational function where numerator and denominator are polynomials of degree s.

For an explicit scheme the matrix of coefficients has a lower triangular form
 
0
 
 a 0 
 21 
 
A=  a31 a32 0 


 .
.. . .. 

 
as,1 as,22 · · · as,s−1 0

and thus det(I − z A) = 1 and the stability function g(z) is a polynomial of degree s. An explicit scheme
of this type is conditionally stable, never unconditionally stable.

3–63 Example : Butcher table for the implicit Euler and Crank–Nicolson Method
The implicit Euler method is given by

k1 = f (tn + h, yn + h k1 )
yn+1 = yn + h k1 = yn + h f (tn + h, yn + h k1 )
yn+1 − yn
= f (tn , yn+1 )
h
Thus the Butcher table is
1 1
yn+1 1

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 204

and the stability function is


1 1
g(z) = 1 + z h~b, (I − z A)−1 ~1i = 1 + z = .
1−z 1−z
d
As a consequence the implicit Euler Method is stable for ODEs dt y(t) = λ y(t) with Re λ < 0, i.e
unconditional stability.
The Butcher table for the Crank–Nicolson method is

0 0 0
1 1
1 2 2
1 1
yn+1 2 2

and the method is given by

k1 = f (tn , yn + h)
1 1
k2 = f (tn + h, yn + h ( k1 + k2 ))
 2  2
1 1
yn+1 = yn + h k1 + k2
2 2
 
h 1 1
= yn + f (tn , yn ) + f (tn + h, yn + h ( k1 + k2 ))
2 2 2
h
= yn + (f (tn , yn ) + f (tn + h, yn+1 ))
2
yn+1 − yn 1
= (f (tn , yn ) + f (tn + h, yn+1 )) .
h 2
This is caused by the identical rows in the Butcher table. If cj = 1 and aj,i = bi for i = 1, 2, . . . , s then
s
X
yn+1 = yn + h bi ki
i=1
s
X
kj = f (tn + 1 h, yn + h aj,i ki ) = f (tn + h, yn+1 )
i=1

The stability function is


! " #−1 !
1
1 0 1
g(z) = 1 + z h~b, (I − z A)−1 ~1i = 1 + z h 2
1
, i
2 − z2 1− z
2 1
! " # !
z
z 1 1 1− 2 0 1
= 1+ h , z i
2 1 1− 2 + z2 1 1
! !
z
z 1 1− 2 z 2+z
= 1+ h , i = 1+ 2 = .
2−z 1 1+ z 2−z 2−z
2

d
As a consequence the Crank–Nicolson Method is stable for ODEs dt y(t) = λ y(t) with Re λ < 0, i.e find
unconditional stability. ♦

Embedded Runge–Kutta schemes


The embedded methods are designed to produce an estimate of the local truncation error of a single Runge-
Kutta step, and as result, allow to control the error with adaptive stepsize. This is done by having two

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 205

methods in one single table, one method of order p and one of order p − 1 . The higher order approximation
is given by
Xs
yn+1 = yn + h bi ki
i=1

and the lower order approximation by


s
X

yn+1 = yn + h b∗i ki .
i=1

This is represented by an extended Butcher table.

c1 a1,1 a1,2 a1,3 · · · a1,s


c2 a2,1 a2,2 a2,3 · · · a2,s
c3 a3,1 a3,2 a3,3 · · · a3,s
.. .. .. .. .. ..
. . . . . . (3.19)
cs as,1 as,2 as,3 · · · as,s
yn+1 b1 b2 b3 ··· bs

yn+1 b∗1 b∗2 b∗3 ··· b∗s

The error is then estimated by


s
X

yn+1 − yn+1 =h (bi − b∗i ) ki
i=1

and used to possibly adapt the step size h. The key point of an embedded method is to use the same slopes
ki to generate the estimates of order p and p − 1. This uses fewer evaluations of the RHS f (t, y) than
performing one step of size h and two of size h/2 and then compare.

3–64 Example : Butcher table for the Bogacki–Shampine method in ode23.m


This explicit method is used in MATLAB/Octave by the command ode23(). The name indicates that the
result is based on approximations of orders 2 and 3. The Butcher table is

0
1 1
2 2
3 3
4 0 4
2 1 4
(3.20)
1 9 3 9
2 1 4
yn+1 9 3 9 0

yn+1 7 1 1 1
24 4 3 8

and one embedded step requires 4 evaluations of f (t, y). Observe that the last row of the coefficient matrix
A and the vector ~bT coincide, thus
3
X 3
X
k4 = f (tn + h, yn + h a4,i ki ) , yn+1 = yn + h a4,i ki
i=1 i=1

and k4 can be used during the next step as ”new” k1 . Consequently only 3 evaluations of f (t, y) are required.
For Octave find the Butcher table for the command ode23() in the file runge kutta 23.m and for
MATLAB in the file ode23.m . ♦

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 206

3–65 Example : Butcher table for the Dormand–Prince method in ode45.m


This explicit method is used in MATLAB/Octave by the command ode45(). The name indicates that the
result is based on approximations of orders 4 and 5. The Butcher table is ([HairNorsWann08, p. 178],
[Butc03, p. 211])
0
1 1
5 5
3 3 9
10 40 40
4 44 −56 32
5 45 15 9
8 19372 −25360 64448 −212 . (3.21)
9 6561 2187 6561 729
9017 −355 46732 49 −5103
1 3168 33 5247 176 18656
35 500 125 −2187 11
1 384 0 1113 192 6784 84
35 500 125 −2187 11
yn+1 384 0 1113 192 6784 84 0
∗ 5179 7571 39 −92097 187 1
yn+1 57600 0 16695 640 339200 2100 40

and one embedded step requires 7 evaluations of f (t, y). Observe that the last row of the coefficient matrix
A and the vector ~bT coincide, thus
6
X 6
X
k7 = f (tn + h, yn + h a7,i ki ) , yn+1 = yn + h a7,i ki
i=1 i=1

and k7 can be used during the next step as ”new” k1 . Consequently only 6 evaluations of f (t, y) are required.
For Octave find the Butcher table for the command ode45() in the file runge kutta 45 dorpri.m
and for MATLAB in the file ode45.m . ♦

3.4.5 Adaptive Step Sizes and Extrapolation


One of the critical points when using numerical tools to approximate solutions of ODEs is a good choice of
the step size h:

• for h too small the computational effort might be to large, i.e. it take too long or the rounding errors
cause by the arithmetic on the CPU cause problems.

• for h too large the discterization errors will be too large.

A nice way out would be if the algorithms could determine a ”good” step size. Fortunately this is possible
in most cases. Here one approach of automatic step size control is presented: Compute the solution at t + h
twice, once with one step of length h and the two steps of length h/2. Use the difference of the two results
to estimate the discretization error and then adapt the step size h accordingly.
Use Table 3.6 to estimate the error when stepping from t with y(t) to t + h) with solution y(t + h), one
with one step of size h leading to the first result r1 , then with two step of size h/2 leading to r2 . Then use
the approximations
y(t + h) − r1 ≈ C hp+1
p+1
y t + 2 · h2 − r2 ≈ 2 C h2


as exact equation (not completely true, but god enough) and solve for y(t + h) and C. Subtraction the above
two expressions leads to
 p+1 !
p+1 h 2p − 1 p+1
r2 − r1 = C h −2 =C h ,
2 2p

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 207

This implies
2p
C hp+1 = (r2 − r1 )
2p − 1
2p 2p r2 − r1
y(t + h) = r1 + p (r2 − r1 ) = (3.22)
2 −1 2p − 1
and equation (3.22) leads to two useful results:

1. control of step size h: (3.22) contains a (often very good) estimate of the local discretization error at
time t:
2p
y(t + h) − r1 ≈ p (r2 − r1 ).
2 −1
This deviation should not be larger than a given bound 0 < ε, typically very small. Consider three
outcomes:

• If the estimated error


2p
|r2 − r1 | ≥ ε
2p − 1
is too large, then restart at t with a smaller step size h.
• If the estimated error is just about right, go on with the same step size and move from t + h to
t + 2 h.
• If the estimated error is ”considerably too small”, advance by one step and for the next step
choose a large step size h.

The choice of the error bound ε and the way to adapt the step size have to be made very carefully.

2. local extrapolation: using r1 and r2 generate a new, better approximation for the solution at t + h by
2p r2 − r1
y(t + h) =
2p − 1

This ”new” method will have a local discretization error proportional to hp+2 , i.e. the order of con-
vergence is improved by 1 .

The above error estimates and extrapolation method assume that the Taylor approximation of the ODE
is correct, which requires the RHS function f (t, y) to be many times differentiable. If the function f (t, y)
smooth (e.g. jumps of the values or of derivatives), then the above estimates are not correct. This will cause
the adaptive step size approaches in Section 3.4.5 to try step sizes smaller and smaller and the algorithm
might come to a screeching halt, usually with a warning message. Higer order methods (ode45()) are
more susceptible to this problem than lower order methods (ode23()). Another reason for the algorithms
d
to stop could be blowup of the solution. As example consider the ODE dt y(t) = 1 + y(t)2 , solved by
π
y(t) = tan(t), which blows up at t = 2 .

Using Table 3.6 for Runge–Kutta shows p = 4 and thus


16
y(t + h) − r1 ≈ (r2 − r1 ) ≈ r2 − r1
15
16 r2 − r1
y(t + h) ≈
15
The local discretization error is of the form O(hp+1 ) = O(h5 ), i.e. by using h/2 instead of h the error is
expected to be divided by 25 = 32 . As a consequence the step size will be modified by factors closer to 1,
e.g. by 0.8 h for smaller steps and by 1.2 h for larger steps.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 208

The above idea based on the Runge–Kutta method is implemented in a code rk45.m. The name
indicates the order of convergence is 4, improved to 5 by the extrapolation. In the function rk23.m the
idea is implemented based on the method of Heun, which is also called a Runge–Kutta method of order 2 .
In both codes an absolute and relative tolerance for the error can be specified.

The algorithm in ode45.m does not use the above mentioned classical Runge–Kutta with half step
size. One step requires 11 evaluations of the ODE function f (t, y). Instead the Dormand–Prince embedded
method in Example 3–65 is used. It requires only 7 evaluations of f (t, y) whith the same error proportional
to h5 .

3–66 Example : As an illustrative example examine the ODE

ÿ (t) = −y 3 (t) with y (0) = 0 , ẏ (0) = 1

on the interval [0, 3π]. Asking for the same relative and absolute tolerance (10−2 or 10−6 ) with the two
adaptive methods based on Heun and Runge–Kutta leads to the results in Table 3.9 and Figure 3.39. The
Runge–Kutta based approach uses clearly fewer time steps. Figure 3.39 also shows that in section of the
solution with a large curvature, a smaller time step is used. The results for the tolerance of 10−6 clearly
show that the difference is more significant for higher accuracy results.

Heun Runge–Kutta Heun Runge–Kutta


absolute and relative tolerance 10−2 10−6
globale order of convergence h2 h4 h2 h4
number of time steps 213 34 10235 274
number of function calls 1160 440 51400 3487

Table 3.9: Comparison of a Heun based method (rk23.m) and Runge–Kutta method (rk45.m)

Observe that the number of function evaluations is considerably higher than the number of time steps.

• Each time step consists of one step of length h and two steps of length h/2. For Runge–Kutta this
leads to 11 function evaluations for each time step and for Heun to 5 evaluations. Observe that the
first evaluation is share between the two computations.

• It a step size is rejected and the computation redone with a shorter time step, some of the evaluations
are ”thrown away”.


To illustrate the results in this section a few MATLAB/Octave codes are used, see Table 3.19 on page 251.

3.4.6 ODE solvers in MATLAB/Octave


Most of the ODE solver in Octave and MATLAB follow a very similar syntax. Thus it is very easy to switch
the solvers and find the one most suitable for your problem. In the next subsection four of the available
algorithms will be described with more details.

• Octave 6.2.0: ode15i, ode15s, ode23, ode23s, ode45

• Matlab R2019a: ode113, ode15i, ode15s, ode23, ode23s, ode23t, ode23tb, ode45

The goal of this subsection is to illustrate the application of the commands ode?? to a single ODE or
systems of ODEs. The usage of the options is explained.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 209

1.5 1.5
1 1
solution y(t)

solution y(t)
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 2 4 6 8 10 0 2 4 6 8 10
time t time t
(a) Heun (b) Runge–Kutta

Figure 3.39: Graphical results for a Heun based method and Runge–Kutta method, with tolerances 10−2

The basic usage of ode??


Typing the command help ode45 in Octave generates on the first few lines the calling options for
ode45(), but all other ode??() commands are very similar, as are the MATLAB commands.

help ode45
-->
-- [T, Y] = ode45 (FUN, TRANGE, INIT)
-- [T, Y] = ode45 (FUN, TRANGE, INIT, ODE_OPT)
-- [T, Y, TE, YE, IE] = ode45 (...)
-- SOLUTION = ode45 (...)
-- ode45 (...)

The input arguments have to be given by the user. A call of the function ode45() will return results.

• input arguments

FUN is a function handle, inline function, or string containing the name of the function that defines
d d
the ODE: dt y(t) = f (t, y(t)) (or dt ~y (t) = f (t, ~y (t))). The function must accept two inputs,
where the first is time t and the second is a column vector (or scalar) of unknowns y.
TRANGE specifies the time interval over which the ODE will be evaluated. Typically, it is a two-element
vector specifying the initial and final times ([tinit , tf inal ]). If there are more than two elements,
then the solution will be evaluated at these intermediate time instances.
Observe that the algorithms ode??() will always first choose the intermediate times, using
the adapted step sizes, see Section 3.4.5. If TRANGE is a two-element vector, these values are
returned. If more intermediate times are asked for the algorithm will use a special interpolation
algorithm to return the solution at the desired times. Asking for more (maybe many) interme-
diate times will not increase the accuracy. For increased accuracy use the options RelTol and
AbsTol, see page 212.
INIT contains the initial value for the unknowns. If it is a row vector then the solution Y will be a
matrix in which each column is the solution for the corresponding initial value in tinit .
ODE OPT The optional fourth argument ODE OPT specifies non-default options to the ODE solver. It is a
structure generated by odeset(), see page 212.

• return arguments

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 210

– If the function [T,Y] = ode??() is called with two return arguments, the first return argu-
ment is column vector T with the times at which the solution is returned. The output Y is a matrix
in which each column refers to a different unknown of the problem and each row corresponds to
a time in T.
– If the function SOL = ode??() is called with one return argument, a structure with three
fields: SOL.x are the times, SOL.y is the matrix of solution values and the string SOL.solver
indicated which solver was used.
– If the function ode??() is called with no return arguments, a graphic is generated. Try
ode45(@(t,y)y,[0,1],1) .
– If using the Events option, then three additional outputs may be returned. TE holds the time
when an Event function returned a zero. YE holds the value of the solution at time TE. IE
contains an index indicating which Event function was triggered in the case of multiple Event
functions.

d
Solving the ODE dt y(t) = (1 − t) y(t) for 0 ≤ t ≤ 5 with initial condition y(0) = 1 is a one-liner.

[t,y] = ode45(@(t,y)(1-t)*y,[0,5],1);
plot(t,y,’+-’)

The plot on the left in Figure 3.40 shows that the time steps used by ode45() are rather large and thus
the solution seems to be inaccurate. The Dormand–Prince method in ode45() used large time steps to
achieve the desired accuracy and then returned the solution at those times only. It might be better to return
the solution at more intermediate times, uniformly spaced. This can be specified in trange when calling
ode45(), see the code below. Find the result on the right in Figure 3.40.

[t,y] = ode45(@(t,y)(1-t)*y,[0:0.1:5],1);
plot(t,y,’+-’)

2 2
trange = [0,5] trange = [0:0.1:5]

1.5 1.5

1 1

0.5 0.5

0 0
0 1 2 3 4 5 0 1 2 3 4 5

Figure 3.40: Solution of an ODE by ode45() at the computed times or at preselected times

An illustrative example, the SIR model for the spreading of an infection


A pandemic might start with a few infected individuals, but the infection can spread very quickly. One of
the mathematical models for this is the SIR model (Susceptible, Infected, Recovered). Find a description at
https://www.maa.org/press/periodicals/loci/joma/the-sir-model-for-spread-of-disease-the-differential-equation-
model and a YouTube video on the SIR model at https://www.youtube.com/watch?v=NKMHhm2Zbkw.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 211

This (overly) simple model for the spreading of a virus uses three time dependent variables:

S(t) = the susceptible fraction of the population


I(t) = the infected fraction of the population
R(t) = the recovered fraction of the population

and S(t) + I(t) + R(t) = 1 implies

d S(t) d I(t) d R(t)


+ + =0.
dt dt dt
Assuming there are N individuals in the population, use two parameters to describe the spreading of the
virus.

b = number of contacts per day of an infected individual.


b N I(t) new attempted infections, but only the fraction S(t) is susceptible.
k = fraction of infected individuals that will recover in one day.
k N I(t) individuals will recover in one day.

Interpretation of the parameters:


1
• The value k can be considered as number of days an individual can spread the virus to new patients.

• Every day an infected individual will make contact to b other individuals, possibly infecting them with
the virus. Only the fraction 0 < S(t) < 1 is susceptible, thus b S(t) will actually be infected by this
individual.
b
• During his sick period an individual will thus infect k S(t) newly infected patients.

• As we have a total of N I(t) sick individuals we will observe b N I(t) S(t) new infection every day.

This leads to a system of ODEs for the three ratios S(t), I(t) and R(t).

d
S(t) = −b S(t) I(t)
dt
d d S(t) d R(t)
I(t) = − − = +b S(t) I(t) − k I(t)
dt dt dt
d
R(t) = +k I(t)
dt
Rewrite this as an ODE for I(t) and R(t), using S(t) = 1 − I(t) − R(t).

d
I(t) = +b (1 − I(t) − R(t)) I(t) − k I(t) = (+b (1 − I(t) − R(t) − k) I(t)
dt
= (+b − k − b I(t) − b R(t)) I(t)
d
R(t) = +k I(t)
dt
This system of ODE is solved numerically using e.g ode45(). Find the results of the code below in
Figure 3.41. The additional code in the file SIR Model.m generates the vector fields in Figure 3.42.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 212

SIR Model.m
I0 = 1e-4; S0 = 1 - I0; R0 = 0; %% the initial values
b = 1/3; k = 1/10; %% the model parameters
%% b = 1/8; %% use this for a smaller infection rate

%% for MATLAB comment outthis function and put it a file SIR.m


function res = SIR(t,I,R,b,k) %%x = IR
res = [(+b-k-b*I-b*R).*I; k*I];
end%function

[t,IR] = ode45(@(t,x)SIR(t,x(1),x(2),b,k),linspace(0,600,601),[I0,R0]);

figure(1); plot(t,IR)
xlabel(’time [days]’); ylabel(’fraction of Infected and Recovered’)
ylim([-0.05 1.05])
legend(’infected’,’recovered’, ’location’,’east’)
fraction of Infected and Recovered

fraction of Infected and Recovered

1 1

0.8 0.8

0.6 0.6
infected infected
recovered recovered
0.4 0.4

0.2 0.2

0 0
0 100 200 300 400 500 600 0 100 200 300 400 500 600
time [days] time [days]
(a) b = 31 , k = 1
10
(b) b = 18 , k = 1
10

Figure 3.41: SIR model, with infection rate b and recovery rate k. The positive effect of a small infection
rate b is obvious.

Using options for the commands ode??


For these ODE solvers many options can and should be used. The command odeset() will generate a
list of the available options and their default values. With help odeset you obtain more information on
these options. The available options differ slightly for Octave and MATLAB. Below find a list of the options,
including the default values.
Octave odeset()
odeset()
-->
List of the most common ODE solver options.
Default values are in square brackets.

AbsTol: scalar or vector, >0, [1e-6]


RelTol: scalar, >0, [1e-3]
BDF: binary, {["off"], "on"}
Events: function_handle, []

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 213

1 1

0.8 0.8
Recovered

Recovered
0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Infected Infected
(a) regular vector field (b) normalized vector field

1 1
Figure 3.42: SIR model vector field with b = 3 and k = 10

InitialSlope: vector, []
InitialStep: scalar, >0, []
Jacobian: matrix or function_handle, []
JConstant: binary, {["off"], "on"}
JPattern: sparse matrix, []
Mass: matrix or function_handle, []
MassSingular: switch, {["maybe"], "no", "yes"}
MaxOrder: switch, {[5], 1, 2, 3, 4, }
MaxStep: scalar, >0, []
MStateDependence: switch, {["weak"], "none", "strong"}
MvPattern: sparse matrix, []
NonNegative: vector of integers, []
NormControl: binary, {["off"], "on"}
OutputFcn: function_handle, []
OutputSel: scalar or vector, []
Refine: scalar, integer, >0, []
Stats: binary, {["off"], "on"}
Vectorized: binary, {["off"], "on"}

Matlab odeset()
odeset()
-->
AbsTol: [ positive scalar or vector {1e-6} ]
RelTol: [ positive scalar {1e-3} ]
NormControl: [ on | {off} ]
NonNegative: [ vector of integers ]
OutputFcn: [ function_handle ]
OutputSel: [ vector of integers ]
Refine: [ positive integer ]
Stats: [ on | {off} ]
InitialStep: [ positive scalar ]
MaxStep: [ positive scalar ]
BDF: [ on | {off} ]
MaxOrder: [ 1 | 2 | 3 | 4 | {5} ]
Jacobian: [ matrix | function_handle ]
JPattern: [ sparse matrix ]
Vectorized: [ on | {off} ]

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 214

Mass: [ matrix | function_handle ]


MStateDependence: [ none | {weak} | strong ]
MvPattern: [ sparse matrix ]
MassSingular: [ yes | no | {maybe} ]
InitialSlope: [ vector ]
Events: [ function_handle ]

The most frequently used options are AbsTol (default value 10−6 ) and RelTol (default value 10−3 ),
used to specify the absolute and relative tolerances for the solution. At each time step the algorithm
estimates the error(i) of the i-th component of the solution. Then the condition

|error(i)| <= max{RelTol ∗ abs(y(i)), AbsTol(i))

has to be satisfied. Thus at least one of the absolute or relative error has to be satisfied. As a consequence
it is rather useless to ask for a very small relative error, but keep the absolute error large. Both have to be
made small.
d
The ODE dt x(t) = 1 + x(t)2 with x(0) = 0 is solved by x(t) = tan(t). Thus we know the exact value
of x(π/4) = 1. With the default values obtain

[t,x] = ode23(@(t,x)1+xˆ2,[0,pi/4],0);
Error_Steps_default = [x(end)-1, length(t)-1]
-->
Error_Steps_default = [-3.6322e-05 25]

i.e. with 25 Heun steps the error is approximately 3.6 · 10−5 . With the options AbsTol and RelTol the
error can be made smaller. The price to pay are more steps, i.e. a higher computational effort.

ode_opt = odeset(’AbsTol’,1e-9,’RelTol’,1e-9);
[t,x] = ode23(@(t,x)1+xˆ2,[0,pi/4],0,ode_opt);
Error_Steps_opt = [x(end)-1 ,length(t)-1]
-->
Error_Steps_opt = [-2.3420e-10 495]

With the command odeget() one can read out specific options for the ode solvers.

3.4.7 Comparison of four Algorithms Available with Octave/MATLAB


In important aspect to consider when selecting a solve for an ODE is stiffness. A linear system is considered
stiff if the ratio between the smallest and largest eigenvalue is large. Thus the corresponding eigen solutions
have a very different time scale. Due to stability problems different algorithms have to be used. As a
general rule are explicit methods not suitable for stiff problems, while implicit methods might work. For an
introduction use the Wikipedia article https://en.wikipedia.org/wiki/Stiff equation.

Below four algorithms available on Octave are applied to a few sample problems to illustrate the differ-
ences. The results by very MATLAB are similar.

ode45 : an implementation of a Runge–Kutta (4,5) formula, the Dormand–Prince method of order 4 and 5.
This algorithm works well on most non–stiff problems and is a good choice as a starting algorithm.

ode23 : an implementation of an explicit Runge–Kutta (2,3) method, the explicit Bogacki–Shampine method
of order 3. It might be more efficient than ode45 at crude tolerances for moderately stiff problems.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 215

ode15s : this command solves stiff ODEs. It uses a variable step, variable order BDF (Backward Differ-
entiation Formula) method that ranges from order 1 to 5. Use ode15s if ode45 fails or is very
inefficient.

ode23s : this command solves stiff ODEs with a Rosenbrock method of order (2,3). The ode23s solver
evaluates the Jacobian during each step of the integration, so supplying it with the Jacobian matrix is
critical to its reliability and efficiency.

To see the statistics of the different solvers use the option16 opt = odeset(’Stats’,’on’).
3–67 Example : The ODE
d
y(t) = −y(t) + 3 with y(0) = 0
dt
with the exact solution
y(t) = 3 − 3 exp(−t)
is an example of a non stiff problem. The solution should not be a problem at all for either algorithm. This
is confirmed by the results in Table 3.10. Observe that the algorithm in ode45 generates the most accurate
results, see Figure 3.43. The example was solved by the code below.

difference to exact solution size of time step


0.0004
ode45 ode45
0.0003 ode23 0.5 ode23
ode15s ode15s
0.0002 ode23s 0.4 ode23s
0.0001 0.3
0 0.2
-0.0001 0.1
-0.0002 0
0 1 2 3 4 5 0 1 2 3 4 5
time t time t
(a) differences to exact solution (b) stepsizes

Figure 3.43: Solution of a non–stiff ODE with different algorithms

ode45 ode23 ode15s ode23s


Number of steps 30 45 48 52
Number of evaluation 183 138 91 418

Table 3.10: Data for a non–stiff ODE problem with different algorithms

Some of the results in Table 3.10 can be explained.

• For ode45() the number of evaluations is barely more than 6 times the number of time steps. This
is consistent with the information in Example 3–65. Thus the algorithm never had to go to smaller
time steps h. The time step is increased, until it reaches to upper limit17 MaxStep of h = 0.5.
16
This author generated the numbers by using a counter within the function, and obtained slightly different numbers.
17
Examine the source file odedefaults.m to confirm this default value. The option MaxStep can be modified.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 216

• For ode23() the number of evaluations is barely more than 3 times the number of time steps. This
is consistent with the information in Example 3–64. Thus the algorithm never had to go to smaller
time steps h. In Figure 3.43 observe that the step size is strictly increasing for ode23().

opt = odeset ("RelTol", 1e-8, "AbsTol", 1e-4);


f = @(t,y)-(y-3); y_exact = @(t)3-3*exp(-t);

[t45, y45] = ode45 (f, [0, 5], 0, opt);


[t23, y23] = ode23 (f, [0, 5], 0, opt);
[t15s, y15s] = ode15s (f, [0, 5], 0, opt);
[t23s, y23s] = ode23s (f, [0, 5], 0, opt);

figure(1); plot(t45,y_exact(t45),t45,y45,t23,y23,t15s,y15s,t23s,y23s)
xlabel(’time t’); ylabel(’solution y(t)’); title(’solution’)
legend(’exact’,’ode45’,’ode23’,’ode15s’,’ode23s’,’location’,’southeast’)
figure(2); plot(t45,y45-y_exact(t45),t23,y23-y_exact(t23),...
t15s,y15s-y_exact(t15s),t23s,y23s-y_exact(t23s))
xlabel(’time t’); title(’difference to exact solution’)
legend(’ode45’,’ode23’,’ode15s’,’ode23s’)
figure(3); plot(t45(2:end),diff(t45),’+’,t23(2:end),diff(t23),’+’,...
t15s(2:end),diff(t15s),’*’, t23s(2:end),diff(t23s),’*’)
xlabel(’time t’); title(’size of time step’)
legend(’ode45’,’ode23’,’ode15s’,’ode23s’,’location’,’northwest’)
NumberOfStepsNonStiff = [length(t45),length(t23),length(t15s),length(t23s)]-1


3–68 Example : The ODE
d
y(t) = −1000 y(t) + 3000 − 2000 exp(−t) with y(0) = 0
dt
with the exact solution
y(t) = 3 − 0.998 exp(−1000 t) − 2.002 exp(−t)
is an example of a stiff problem. For 0 ≤ t  1 the term exp(−1000 t) (generated by −1000 y(t))
dominates, then exp(−t) takes over. Those effects occur on a different time scale. Find the graphical
results in Figure 3.44 and the number of steps and evaluations in Table 3.11. The stiffness of the ODE is
confirmed by the number of time steps and evaluation of the RHS in Table 3.11. Observe that the algorithm
in ode15s generates the most accurate results, see Figure 3.44. The Octave code is very similar to the
previous example.
Some of the results in Table 3.11 can be explained.
• For ode45() the number of evaluations equals 6.65 times the number of time steps. One step of
Dormand–Prince uses 6 evaluations, thus the step size had to be made shorter many times.

• The number of evaluations for the stiff solvers ode15s() and ode23s() is considerably smaller
than for the explicit solvers ode45() and ode23().

ode45 ode23 ode15s ode23s


Number of steps 1522 2011 99 222
Number of evaluations 10125 6045 178 1778

Table 3.11: Data for a stiff ODE problem with different algorithms

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 217

solution difference to exact solution


3 0.0003
ode45
2.5 0.0002 ode23
ode15s
solution y(t)

2 0.0001 ode23s
1.5 0
exact
1 ode45 -0.0001
ode23
0.5 ode15s -0.0002
ode23s
0 -0.0003
0 1 2 3 4 5 0 1 2 3 4 5
time t time t
(a) the solutions (b) differences to exact solution

size of time step difference to exact solution


0.1 ode45 ode45
ode23 ode23
0.08 ode15s 5e-05 ode15s
ode23s ode23s
0.06
0
0.04
-5e-05
0.02

0
0 0.1 0.2 0.3 0.4 0.5 1.56 1.58 1.6 1.62 1.64
time t time t
(c) stepsizes (d) zoom in on differences to exact solution

Figure 3.44: Solution of a stiff ODE with different algorithms

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 218

3–69 Example : The system of ODEs


! " # ! ! ! !
d y1 (t) 98 198 y1 (t) −98 y1 (0) 2
= + with =
dt y2 (t) −99 −199 y2 (t) +99 y2 (0) 0

is solved by ! ! ! !
y1 (t) 2 −1 1
= exp(−t) + exp(−100 t) + .
y2 (t) −1 +1 0
The result is based on the eigenvalues λ1 = −100 and λ2 = −1 of the matrix. Since the eigenvalues have
different magnitudes this is a (moderately) stiff system.

difference to exact solution size of time step


1.5e-05 0.12
ode45 ode45
1e-05 ode23 0.1 ode23
ode15s ode15s
5e-06 ode23s 0.08 ode23s
0 0.06
-5e-06 0.04
-1e-05 0.02
-1.5e-05 0
0 1 2 3 4 5 0 1 2 3 4 5
time t time t
(a) differences to exact solution (b) stepsizes

Figure 3.45: Solution of a stiff ODE system with different algorithms

Some of the results in Table 3.12 can be explained.

• For ode45() the number of evaluations equals 6.9 times the number of time steps. One step of
Dormand–Prince uses 6 evaluations, thus the step size had to be made shorter many times.

• The number of evaluations for the stiff solvers ode15s() and ode23s() is considerably smaller
than for the explicit solvers ode45() and ode23().

ode45 ode23 ode15s ode23s


Number of steps 174 308 166 283
Number of evaluations 1198 950 284 2832

Table 3.12: Data for a stiff ODE system with different algorithms

The results are generated by the code

opt = odeset ("RelTol", 1e-8, "AbsTol", 1e-6);


f = @(t,y)[98 198;-99, -199]*(y-[1;0]);
y_exact = @(t)2*exp(-t)-exp(-100*t)+1; %% only first component

[t45, y45] = ode45 (f, [0, 5],[2;0], opt); y45 = y45(:,1);


[t23, y23] = ode23 (f, [0, 5],[2;0], opt); y23 = y23(:,1);
[t15s, y15s] = ode15s (f, [0, 5],[2;0], opt); y15s = y15s(:,1);

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 219

[t23s, y23s] = ode23s (f, [0, 5],[2;0], opt); y23s = y23s(:,1);

figure(1); plot(t45,y_exact(t45),t45,y45,t23,y23,t15s,y15s,t23s,y23s)
xlabel(’time t’); ylabel(’solution y(t)’);
legend(’exact’,’ode45’,’ode23’,’ode15s’,’ode23s’)
title(’solution’)
figure(2); plot(t45,y45-y_exact(t45),t23,y23-y_exact(t23),...
t15s,y15s-y_exact(t15s), t23s,y23s-y_exact(t23s))
xlabel(’time t’); title(’difference to exact solution’)
legend(’ode45’,’ode23’,’ode15s’,’ode23s’)
t_lim = 5; t23_short = t23(t23<t_lim);
figure(3); plot(t45(2:end),diff(t45),’+’,t23(2:end),diff(t23),’+’,...
t15s(2:end),diff(t15s),’*’,t23s(2:end),diff(t23s),’*’)
xlabel(’time t’); title(’size of time step’)
legend(’ode45’,’ode23’,’ode15s’,’ode23s’,’location’,’northwest’)
NumberOfSteps = [length(t45),length(t23),length(t15s),length(t23s)]-1

In the above code the Jacobian for the ODE was not used. In this example the Jacobian is given by
" #
∂ f~ 98 198
J= = .
∂~y −99 −199

Pass this information on to the implicit algorithms ode15s() and ode23s() by setting the option.

opt = odeset ("RelTol", 1e-8, "AbsTol", 1e-6, ’Jacobian’,[98 198;-99, -199]);

Then obtain the results in Table 3.13. Observe that the number of steps does not change, but the number of
evaluations by ode15s() and ode23s() is substantially smaller. This is due to the algorithms not having
to determine the Jacobian numerically by evaluating the function f (t, ~y ).

ode45 ode23 ode15s ode23s


Number of steps 174 308 166 283
Number of evaluations 1198 950 222 1417

Table 3.13: Data for a stiff ODE system with different algorithms, using the Jacobian matrix

3–70 Example : The system of ODEs


! " # ! ! ! !
d y1 (t) 79 72 y1 (t) −79 y1 (0) 2
= + with =
dt y2 (t) −90 −82 y2 (t) +90 y2 (0) −1

is solved by ! ! ! !
y1 (t) +9 −8 1
= exp(t) + exp(−2 t) + .
y2 (t) −10 +9 0
The result is based on the eigenvalues λ1 = −2 and λ2 = −1 of the matrix. Since the eigenvalues have
similar magnitudes this is a non–stiff system. In Figure 3.46 and Table 3.14 verify that the algorithm ode45
generates the most accurate results with the fewest number of evaluations of the RHS of the ODE.
Some of the results in Table 3.14 can be explained.
• For ode45() the number of evaluations equals 6.26 times the number of time steps. One step of
Dormand–Prince uses 6 evaluations, thus the step size had to be made shorter only a few times. This
is visible in Figure 3.46.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 220

difference to exact solution size of time step


2e-05 0.4
ode45 ode45
1e-05 ode23 ode23
ode15s 0.3 ode15s
ode23s ode23s
0
0.2
-1e-05
0.1
-2e-05

-3e-05 0
0 1 2 3 4 5 0 1 2 3 4 5
time t time t
(a) differences to exact solution (b) stepsizes

Figure 3.46: Solution of a non–stiff ODE system with different algorithms

• For ode23() the number of evaluations equals 3.01 times the number of time steps. One step of
Bogacki–Shampine uses 3 evaluations, thus the step size was increased most of the times. This is
visible in Figure 3.46. Considerably more time steps are used by ode23(), compared to ode45().

ode45 ode23 ode15s ode23s


Number of steps 35 216 155 271
Number of evaluations 219 651 270 2711

Table 3.14: Data for non–stiff ODE system with different algorithms

The Octave code is very similar to the previous example. ♦

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 221

3.5 Linear and Nonlinear Regression, Curve Fitting


In this section the basics of linear and nonlinear regression are presented. Linear regression is one of the
basic tools for machine learning (ML), e.g. [Agga20] or [DeisFaisOng20, §9]. Using Octave/MATLAB a
few examples for linear and nonlinear regression are provided.

3.5.1 Linear Regression, Method of Least Squares


Linear regression for a straight line
For n given points (xi , yi ) in a plane try to determine a straight line y(x) = p1 · 1 + p2 · x to match those
points as good as possible. One good option is to examine the residuals ri = p1 · 1 + p2 · xi − yi . Using
matrix notation write    
1 x1 y1
   
 1 x  !   y2 

 2 
 · p1
   
~r = F · p~ − ~y =  1 x 3 −  y3  .
p2
   
 .. ..   .. 
 . .   . 
   
1 xn yn
Linear regression corresponds to minimization of the norm of ~r, i.e. minimize
n
X
ri2 = k~rk2 = kF · p~ − ~y k2 = hF · p~ − ~y , F · p~ − ~y i .
i=1

This is the reason for the often used name method of least squares.

3
dependent variable y

-1
-1 0 1 2 3 4 5
independent variable x

Figure 3.47: Regression of a straight line

Consider k~rk2 as a function of p1 and p2 . At the minimal point the two partial derivatives with respect to
p1 and p2 have to vanish. This leads to a system of linear equations for the vector p~, the normal equation.

FT F · p~ = FT ~y


This can easily be implemented in Octave, leading to the result in Figure 3.47 and a residual of k~rk2 ≈ 1.23.
Octave
x = [0; 1; 2; 3.5; 4]; y = [-0.5; 1; 2.4; 2.0; 3.1];

F = [ones(size(x)) x]
p = (F’*F)\(F’*y)

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 222

residual = norm(F*p-y)

xn = [-1 5]; yn = p(1)+p(2)*xn;


plot(x,y,’*r’,xn,yn);
xlabel(’independent variable x’); ylabel(’dependent variable y’)

Linear regression with a matrix motation


The above idea carries over to a linear combination of functions fj (x) for 1 ≤ j ≤ m. For a vector
~x = (x1 , x2 , . . . , xk )T examine a function of the form
m
X
f (~x) = pj · fj (~x) .
j=1

The optimal values of the parameter vector p~ = (p1 , p2 , . . . , pm )T have to be determined. Thus minimize
the expression
  2
Xn Xn X n m
X
χ2 = k~rk2 = ri2 = (f (xi ) − yi )2 =  pj · fj (xi ) − yi 
i=1 i=1 i=1 j=1

Using a vector and matrix notation this leads to


     
p1 y1 f1 (x1 ) f2 (x1 ) . . . fm (x1 )
     
 p2   y2   f1 (x2 ) f2 (x2 ) . . . fm (x2 ) 
p~ =  .  , ~y =  .  , F=
     
 ..   ..  .. .. .. .. 
   

 . . . . 

pm yn f1 (xn ) f2 (xn ) . . . fm (xn )

the expression to be minimized is

k~rk2 = kF · p~ − ~y k2 = hF · p~ − ~y , F · p~ − ~y i ,

Setting the partial derivatives with respect to the parameters leads to a necessary condition.

FT F · p~ = FT ~y


This system of n linear equations for the unknown vector of parameters p~ ∈ Rm is called a normal equa-
tion. Once we have the optimal parameter vector p~, compute the values of the regression curve with a matrix
multiplication
Xm
(F p~)i = pj · fj (xi ) .
j=1

QR factorization and linear regression


For a n × m matrix F with more rows than columns (n > m) a QR factorization of the matrix can be
computed
F=Q·R
where the n × n matrix Q is orthogonal (Q−1 = QT ) and the n × m matrix R has an upper triangular
structure. No consider the block matrix notation
" #
h i Ru
Q = Ql Qr and R = .
0

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 223

The m × m matrix Ru is square and upper triangular. The left part Ql of the square matrix Q is of size
n × m and satisfies QTl Ql = In . Use the zeros in the lower part of R to verify that

F = Q · R = Ql · Ru .

MATLAB/Octave can compute the QR factorization by [Q,R] = qr(F) and the reduced form by the
command [Ql,Ru] = qr(F,0). This factorization is very useful to implement linear regression.
Multiplying a vector ~r ∈ Rn with the orthogonal matrix Q or its inverse QT corresponds to a rotation of
the vector and thus will not change its length18 . This observation can be used to rewrite the linear regression
problem from Section 3.5.1.

F · p~ − ~y = ~r length to be minimized
Q · R · p~ − ~y = ~r length to be minimized
T T
R · p~ − Q · ~y = Q · ~r
" # " # " #
Ru · p~ QTl · ~y QTl · ~r
− =
0 QTr · ~y QTr · ~r

Since the vector p~ does not change the lower part of the above system, the problem can be replaced by a
smaller system of m equations for m unknowns, namely the upper part only of the above system.

Ru · p~ − QTl · ~y = QTl · ~r length to be minimized

Obviously this length is minimized if QTl · ~r = ~0 and thus find the reduced equations for the vector p~.

Ru · p~ = QTl · ~y
p~ = R−1 T
u · Ql · ~
y

In Octave the above algorithm can be implemented with two commands only.
Octave
[Q,R] = qr(F,0);
p = R\(Q’*y);

It can be shown that the condition number for the QR algorithm is much smaller than the condition number
for the algorithm based on (FT F) · p~ = FT ~y . Thus there are fewer accuracy problems to be expected and
the results are more reliable19 .

As a simple example fit a function f (x) = p1 · 1 + p2 · x + p3 · sin(x) to a given set of data points (xi , yi )
for 1 ≤ i ≤ n, as seen in Figure 3.48. In this example the n × 3 matrix F is given by
 
1 x1 sin(x1 )
 
 1 x sin(x ) 
 2 2 
 
F=  1 x3 sin(x3 )  .

 .. .. .
.. 
 . . 
 
1 xn sin(xn )

The code below first generates random data and then uses the reduced QR factorization to apply the linear
regression.

18

xk2 = hQ ~
kQ ~ x, QT Q ~
xi = h~
x, Q ~ xi = h~ xk2
xi = k~
x, I ~

19
A careful computation shows that using the QR factorization F = Q R in FT F p
~ = FT ~ ~ = QTl ~
y also leads to Ru p y.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 224

7
raw data
6 regression

y values
4

0
0 2 4 6 8 10
x values

Figure 3.48: An example for linear regression

% generate the random data


n = 100; x = linspace(0,10,n);
y = 6-0.5*x+0.4*sin(x) + 0.2*randn(1,n);

% perform the linear regression, using the QR factorization


F = [ones(n,1) x(:) sin(x(:))];
[Q1,Ru] = qr(F,0); % apply the reduced QR factorization
p = Ru\(Q1’*y(:)) % determine the optimal parameters
Ru % display the upper right matrix
y_reg = F*p; % determine the linear regression curve

figure(1)
plot(x,y,’+’,x,y_reg)
legend(’raw data’,’regression’)
xlabel(’x values’); ylabel(’y values’)
-->
p = 6.00653
-0.50358
0.43408

Ru = -10.00000 -50.00000 -1.79193


0.00000 29.15765 -0.50449
0.00000 0.00000 6.62802

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 225

SVD, singular value decomposition and linear regression


For F ∈ Rn×m with n > m it is possible to use the SVD F = UΣV to analyze the linear regression
problem. For this split up the matrix U ∈ Rn×n in a left part Ul ∈ Rn×m and a right part Ur ∈ Rn×(n−m) .
 
σ1 0 0 . . . 0
 
 0 σ2 0 . . . 0   
σ 0 0 ... 0
 
 0 0 σ3 . . . 0   1
 

 0 σ2 0 ... 0 
.. . 
 
..
. ..  T
  
 .   T
F = U  V = Ul  0 0 σ3 ... 0   V = Ul ΣV
  
 0 0 0 . . . σn   .. .. .. 

 0 0 0 ... 0 
 
 . . .  
0 0 0 . . . σn
 
 .. .. 


 . . 
0 0 0 ... 0

Now the linear regression problem can be examined. The computations are rather similar to the linear
regression approach using the QR factorization.

F · p~ − ~y = ~r length to be minimized
T
U · diag(σ1 , σ2 , . . . , σn ) · V · p~ − ~y = ~r length to be minimized
T T T
Σ · V · p~ − U ~y = U ~r length to be minimized
" # " # " #
Σ · VT · p~ UTl ~y UTl ~r
− = length to be minimized
0 UTr ~y UTr ~r
optimize upper part only ,set UTl ~r = ~0, with smallest possible norm
Σ · VT · p~ − UTl ~y = ~0
Σ · VT · p~ = UTl ~y

If σn > 0 then the above problem has a unique solution. The ratio σ1 /σn contains information about the
sensitivity of this least square problem. This allows to detect ill conditioned linear regression problems. For
further information consult [GoluVanLoan13, §5.3].
MATLAB/Octave provide a command to generate the reduced SVD: [U,S,V] = svd(F,’econ’)
or [U,S,V] = svd(F,0). The above algorithm is applied with just two commands.

[Ul,S,V] = svd(F,0);
p = (S*V’)\(Ul’*y(:))

The above linear regression example is solved by

[Ul,S,V] = svd(F,0); % compute the reduced SVD factorization


p = (S*V’)\(Ul’*y(:)) % determine the optimal parameters
y_reg = F*p; % determine the linear regression curve

figure(1)
plot(x,y,’+’,x,y_reg2)
legend(’raw data’,’regression’)
xlabel(’x values’); ylabel(’y values’)

The result will be identical to the one generated by the the QR factorization and also leads to Figure 3.48.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 226

3.5.2 Estimation of the Variance of Parameters, Confidence Intervals, Domain of Confi-


dence
For regression problems it is essential to not only determine the optimal values for the parameters, but
also gain information on the accuracy and reliability of the result. Thus the domains of confidence for the
parameters have to be determined.

Using the above results (for the parabola fit) determine the residual vector
~r = F · p~ − ~y
and then mean and variance V = σ 2 of the y–errors can be estimated. The estimation is valid if the y–errors
are independent and assumed to be of equal size, i.e. assume a-priori that the errors are given by a
normal distribution. Then the variance σ 2 of the y values can be estimated by
n
1 X
σ2 = r12
n−m
i=1

In most applications the values of the parameters pj contain the essential information. It is often important
to know how reliable the obtained results are, i.e. we want to know the variance of the determined parameter
values pj . To this end consider the normal equation
FT · F · p~ = FT · ~y


and thus the explicit expression for p~


−1
p~ = FT · F · FT · ~y = M · ~y (3.23)
or
n
X −1
pj = mj,i yi for 1 ≤ j ≤ m with M = [mj,i ]1≤j≤m ,1≤i≤n = FT · F · FT
i=1
is a m × n–matrix, where m < n (more columns than rows). For a reliable evaluation the same argument
has to be computed based on the QR factorization
p~ = R−1 T
u · Ql · ~
y.
This explicit representation of pj allows20 to compute the variance var(pj ) of the parameters pj , using
the estimated variance σ 2 of the y–values. The result is given by
n n
X 1 X
var(pj ) = m2j,i σ 2 where σ = ri2
2
n−m
i=1 i=1

Knowing this standard deviation and assuming a normal distribution one can (see [MontRung03, §12-3.1]
readily21 determine the 95% confidence intervals for the individual parameters22 , i.e. with a probability of
20
If zk are independent random variables given by a normal distribution with variances var(zk ), then a linear combination of
the zi also leads to a normal distribution. The variances are given by the computational rules:
var (z1 + z2 ) = var (z1 ) + var (z2 )
var (α1 z1 ) = α12 var (z1 )
X X 2
var ( αi zi ) = αi var (zi )
i i

21
Use Z +1.96 σ
1
√ exp(−x2 /(2 σ 2 )) dx ≈ 0.95 .
−1.96 σ 2π
22
A more careful approach is to determine the confidence region for p ~ ∈ Rm , using a general m–dimensional F –distribution,
see https://en.wikipedia.org/wiki/Confidence region . The domain of confidence will be an ellipsoid in Rm , best computed using a
PCA. The difference can be substantial if the off–diagonal entries in the correlation matrix for the parameters p
~ are not small. Find
more information in [RawlPantuDick98, §4.6.3] or examine Example 3–71 below.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 227

√ √
95% the actual value of the parameter is between pi − 1.96 vari and pi + 1.96 vari 23 . The above assumes
that the distribution of the parameters are normal distributions. Actually the distribution to use is a Student’s
t-distribution with n − m degrees of freedom. This can be computed in MATLAB/Octave by

p_CI = p + tinv(1-0.05/2,length(x)-m)*[-sigma +sigma]

3–71 Example : Confidence intervals and domains of confidence


As a simple (and academic) regression example examine the fit of a straight line to the curve y = sin(x) for
0 ≤ x ≤ π2 . Create 31 uniformly spaced points xi = i 2·30
π
for i = 0, 1, 2, . . . , 30 and the corresponding
yi = sin(xi ). Then two regressions are performed:
• Fit f1 (x) = p1 1 + p2 x by minimizing the length of the residual vector ~r.
     
1 x0 sin(x0 ) r0
     
 1 x   sin(x )   r1 
 1  !  1   
  p1    
F1 p~ − ~y = 
 1 x 2 
 −  sin(x )
2  =
 r2 
p

 .. ..  2  ..   .. 
 . .  .   . 
     
1 x30 sin(x30 ) r30

• Fit f2 (x) = p1 1 + p2 (x − 10) by minimizing the length of the residual vector ~r.
     
1 x0 − 10 sin(x0 ) r
     0 
 1 x − 10  !  sin(x1 )   r1
   
 1  
  p 1    
F2 p~ − ~y =  1 x 2 − 10  −  sin(x2 )  =  r2 
p2
     
 .. ..   ..   .. 
 . .   .   . 
     
1 x30 − 10 sin(x30 ) r30

Both attempts fit a straight line through the same data points, but represented by slightly different parametriza-
tions. The optimal values of the parameters and their estimated standard deviations are given by

p1 = 0.112357 ± 0.024183 p1 = 6.729570 ± 0.244002


and .
p2 = 0.661721 ± 0.026446 p2 = 0.661721 ± 0.026446

This leads to the 95% confidence intervals

0.062898 ≤ p1 ≤ 0.161816 6.2305 ≤ p1 ≤ 7.2286


and .
0.607634 ≤ p2 ≤ 0.715809 0.6076 ≤ p2 ≤ 0.7158

Observe that 0.112357 + 0.661721 x ≈ 6.729570 + 0.661721 (x − 10) . The two straight lines coincide,
but the standard deviation of the parameter p1 is much larger for the second attempt. This is caused by the
strong correlation between the functions 1 and (x − 10) for the second approach. This can be made visible
by examining the correlation matrices
" # " #
1 −0.8589 1 0.9987
corp1 = and corp2 = .
−0.8589 1 0.9987 1
23
Observe that the CIs (Confidence Interval) are the probabilistic event, i.e. they will change from one random drawing to the
next. Thus the statement is not ”With probability 0.95 the true value of pi is in the CI ”, but better ”With probability 0.95 the
generated CI contains the true value of pi ”. See en.wikipedia.org/wiki/Confidence interval .

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 228

With MATLAB/Octave 24 use lscov() to obtain the covariance matrix covp and then the correlation matrix
by " # " #
1 1
0 0
corp = σ1 1 · covp · σ1 1 .
0 σ2 0 σ2

√ at confidence level 1−α the two confidence intervals


• To construct a rectangular domain of confidence
have to be determined with confidence level 1 − α ≈ 1 − α/2 . Then the parameters are in the
rectangle with a confidence level of 1 − α.

• To construct a better, ellipsoidal domain of confidence use the covariance matrix. To determine the
dimensions of the ellipses the F–distribution has to be used. For more information see [DrapSmit98,
§5.4,§5.5], [RawlPantuDick98, §4.6.3] or use Section 3.2.7 with results on Gaussian distributions.

Find the graphical results in Figure 3.49. Observe that the domain of confidence in Figure 3.49(a) contains
preciser information, mainly by the narrower range 0.06 ≤ p1 ≤ 0.16 in horizontal direction. The domain
of confidence in Figure 3.49(a) clearly points out that 95% is in an ellipse, not spread out over all of the
rectangle. In Figure 3.49(b) for the regression f2 (x) = p1 1 + p2 (x − 10) the ellipse is extremely narrow.
This is caused by the strong correlation between p1 and p2 , as computed in thecorrelation matrix corp2 . The
situation is better in In Figure 3.49(a) for the regression f1 (x) = p1 1+p2 x. An even slightly better solution
would be to fit a function f3 (x) = p1 1 + p2 (x − π4 ), where the two parameters are not correlated at all,
i.e. the correlation matrix would be very close to the identity matrix I2 . The result in Figure 3.49 would be
an ellipse with horizontal axis. To generate the ellipses in Figure 3.49 use eigenvalues and eigenvectors, see
Example 3–25. ♦

fit 1, using normpdf fit 2, using normpdf

0.7 0.7
p2

p2

0.65 0.65

0.6 0.6

0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 6 6.2 6.4 6.6 6.8 7 7.2 7.4
p1 p1
(a) f1 (x) = p1 1 + p2 x (b) f2 (x) = p1 1 + p2 (x − 10)

Figure 3.49: Regions of confidence for two similar regressions with identical data

3.5.3 The commands LinearRegression(), regress() and lscov() for Octave and
MATLAB
In MATLAB and/or Octave there are a few commands to useful for linear regression, see Table 3.15. In these
notes three are presented : LinearRegression(), regress() and lscov().
24
Based on [GoluVanLoan96, §5.6.3], which is also available in [GoluVanLoan13, §6.1.2].

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 229

Command Properties
LinearRegression() standard and weighted linear regression
returns standard deviations for parameters
regress() standard linear regression
returns confidence intervals for parameters
lscov() gerneralized least square estimation, with weights
ols() ordinary least square estimation
gls() gerneralized least square estimation
polyfit() regression with for polynomials only
lsqnonneg() regression with positivity constraint

Table 3.15: Commands for linear regression

The command LinearRegression()


The code for LinearRegression() is available for MATLAB and Octave.
• Octave: either load the optimization package with the command pkg load optim or download
the file from my web site at LinearRegression.m for Octave .

• MATLAB: download the file from my web site at LinearRegression.m for Matlab .
The builtin help LinearRegression leads to

LinearRegression (F, Y)
LinearRegression (F, Y, W)
[P, E_VAR, R, P_VAR, FIT_VAR] = LinearRegression(...)

general linear regression

determine the parameters p_j (j=1,2,...,m) such that the function


f(x) = sum_(j=1,...,m) p_j*f_j(x) is the best fit to the given
values y_i by f(x_i) for i=1,...,n, i.e. minimize
sum_(i=1,...,n)(y_i-sum_(j=1,...,m) p_j*f_j(x_i))ˆ2 with respect to p_j

parameters:
* F is an n*m matrix with the values of the basis functions at
the support points. In column j give the values of f_j at the
points x_i (i=1,2,...,n)
* Y is a column vector of length n with the given values
* W is a column vector of length n with the weights of the data points.
1/w_i is expected to be proportional to the estimated
uncertainty in the y values. Then the weighted expression
sum_(i=1,...,n)(w_iˆ2*(y_i-f(x_i))ˆ2) is minimized.

return values:
* P is the vector of length m with the estimated values of the parameters
* E_VAR is the vector of estimated variances of the provided y values.
If weights are provided, then the product e_var_i*wˆ2_i is assumed
to be constant.
* R is the weighted norm of the residual
* P_VAR is the vector of estimated variances of the parameters p_j
* FIT_VAR is the vector of the estimated variances of the fitted

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 230

function values f(x_i)

To estimate the variance of the difference between future y values


and fitted y values use the sum of E_VAR and FIT_VAR

Caution: do NOT request FIT_VAR for large data sets, as a n by n


matrix is generated

The command LinearRegression() allows to use a weighted least square algorithm, i.e. instead
of minimizing the standard k~rk2 = ni=1 ri2 the weighted expression
P

n
X
k~rk2W = wi2 ri2
i=1

is minimized, with given weights wi . This should be used if some data is known to be more reliable than
others, or to give outliers a lesser weight.

The command regress()


The command regress() is contained in the statistics toolbox in MATLAB (i.e. $$$) and in the statistics
package in Octave, i.e. pkg load statistics. The command help regress in Octave leads to

[B, BINT, R, RINT, STATS] = regress (Y, X, [ALPHA])


Multiple Linear Regression using Least Squares Fit of Y on X with
the model ’y = X * beta + e’.

Here,
* ’y’ is a column vector of observed values
* ’X’ is a matrix of regressors, with the first column filled
with the constant value 1
* ’beta’ is a column vector of regression parameters
* ’e’ is a column vector of random errors

Arguments are
* Y is the ’y’ in the model
* X is the ’X’ in the model
* ALPHA is the significance level used to calculate the confidence
intervals BINT and RINT (see ’Return values’ below).
If not specified, ALPHA defaults to 0.05

Return values are


* B is the ’beta’ in the model
* BINT is the confidence interval for B
* R is a column vector of residuals
* RINT is the confidence interval for R
* STATS is a row vector containing:
* The Rˆ2 statistic
* The F statistic
* The p value for the full model
* The estimated error variance

R and RINT can be passed to ’rcoplot’ to visualize the residual


intervals and identify outliers.

NaN values in Y and X are removed before calculation begins.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 231

The command lscov()


The command lscov() is available in MATLAB/Octave. It returns the covariance matrix for the optimal
parameters, which is required to generate the ellipses of confidence for the optimal parameters. The com-
mand help lscov in Octave leads to

-- X = lscov (A, B)
-- X = lscov (A, B, V)
-- X = lscov (A, B, V, ALG)
-- [X, STDX, MSE, S] = lscov (...)

Compute a generalized linear least squares fit.

Estimate X under the model B = AX + W, where the noise W is assumed


to follow a normal distribution with covariance matrix {\sigmaˆ2} V.

If the size of the coefficient matrix A is n-by-p, the size of the


vector/array of constant terms B must be n-by-k.

The optional input argument V may be an n-element vector of


positive weights (inverse variances), or an n-by-n symmetric
positive semi-definite matrix representing the covariance of B. If
V is not supplied, the ordinary least squares solution is returned.

The ALG input argument, a guidance on solution method to use, is


currently ignored.

Besides the least-squares estimate matrix X (p-by-k), the function


also returns STDX (p-by-k), the error standard deviation of
estimated X; MSE (k-by-1), the estimated data error covariance
scale factors (\sigmaˆ2); and S (p-by-p, or p-by-p-by-k if k > 1),
the error covariance of X.

Reference: Golub and Van Loan (1996), ’Matrix Computations (3rd Ed.)’,
Johns Hopkins, Section 5.6.3

See also: ols, gls, lsqnonneg.

3.5.4 An Elementary Example


As an illustrative example examine a true curve y = x − x2 for 0 ≤ x ≤ 1. Then some normally
distributed noise is added, leading to data points (xi , yi ), shown in Figure 3.50. Using the command
LinearRegression() a parabola y(x) = p1 1 + p2 x + p3 x2 is fitted to the generated values yi . The
result are the estimated values for the parameters pi and the estimated standard deviations for the parameters.

Finding the optimal parameters, their standard deviation and the confidence intervals
On the first few lines in the code below the data is generated, using a normally distributed noise contribution.
Then LinearRegression() is applied to determine the solutions.

N = 20 ; % number of data points


x = linspace(0,1); y = x.*(1-x); % the "true" data
x_d = rand(N,1) ; % the (random) data points
noise_small = 0.02*randn(N,1); % the small (random) noise
y_d = x_d.*(1-x_d) + noise_small; % random data points

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 232

F = [ones(size(x_d)) x_d x_d.ˆ2]; % construct the regression matrix


[p, e_var, r, p_var, fit_var] = LinearRegression(F,y_d);
sigma = sqrt(p_var);
parameters = [p sigma] % show the parameters and their standard deviation

y_fit = p(1) + p(2)*x + p(3)*x.ˆ2; % compute the fitted values

figure(1)
plot(x,y,’k’,x_d,y_d,’b+’,x,y_fit,’g’)
xlabel(’independent variable x’)
ylabel(’dependent variable y’)
title(’regression with small noise’)
legend(’true curve’,’data’,’best fit’,’location’,’south’)
-->
parameters = -7.0696e-03 7.3803e-03
1.1126e+00 3.6093e-02
-1.1446e+00 3.7854e-02

Using the standard deviations of the parameters one can then determine the confidence intervals for a chosen
level of significance α = 5% = 0.05.

alpha = 0.05 ; % level of significance


p95_n = p + norminv(1-alpha/2)*[-sigma +sigma] % normal distribution
p95_t = p + tinv(1-alpha/2,length(x_d)-3)*[-sigma +sigma]% Student-t distribution
-->
p95_n = -2.1535e-02 7.3955e-03
1.0419e+00 1.1834e+00
-1.2188e+00 -1.0704e+00

p95_t = -2.2641e-02 8.5014e-03


1.0365e+00 1.1888e+00
-1.2244e+00 -1.0647e+00

The numerical result for the Student-t distribution implies that with a level of confidence of 95% the confi-
dence intervals for the parameters pi in y = p1 1 + p2 x + p3 x2 satisfy

−0.023 ≤ p1 ≤ +0.009
+1.04 ≤ p2 ≤ +1.19
−1.22 ≤ p3 ≤ −1.06 .

This confirms the “exact” values of p1 = 0, p2 = +1 and p3 = −1. The best fit parabola in Figure 3.50(a)
is rather close to the “true” parabola.

The effect of large noise contributions


The above can be repeated with a considerably larger noise contribution.

noise_big = 0.5*randn(N,1); % the large (random) noise


...
parameters = [p sigma]
...
p95 = p + norminv(1-alpha/2)*[-sigma +sigma]
p95 = p + tinv(1-alpha/2,length(x_d)-3)*[-sigma +sigma]
-->

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 233

regression with small noise regression with large noise


0.35 1
0.3
dependent variable y

dependent variable y
0.5
0.25
0.2 0
0.15
0.1 -0.5

0.05 true curve true curve


-1
data data
0
best fit best fit
-0.05 -1.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
independent variable x independent variable x
(a) small noise (b) large noise

Figure 3.50: Linear regression for a parabola, with a small or a large noise contribution

parameters = -0.082664 0.387973


0.272995 1.897374
0.527731 1.989961

p95_n = -0.8431 0.6777


-3.4458 3.9918
-3.3725 4.4280

p95_t = -0.9012 0.7359


-3.7301 4.2761
-3.6707 4.7262

These numbers imply


p1 ≈ −0.08 , p2 ≈ 0.27 and p3 ≈ 0.53 .
This is obviously far from the “true” values and confirmed by the very wide confidence intervals.

−0.9 ≤ p1 ≤ +0.7
−3.7 ≤ p2 ≤ +4.2
−3.7 ≤ p3 ≤ +4.7

The best fit parabola in Figure 3.50(b) is poor approximation of the “true” parabola.

Using the command regress()


The problem above is solved using the function LinearRegression(), but one may use the function
regress() instead. For the small noise case the code below shows the identical results.

[p_regress, p95_t_regress] = regress(y_d,F,alpha)


-->
p_regress = -7.0696e-03
1.1126e+00
-1.1446e+00

p95_t_regress = -2.2641e-02 8.5014e-03

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 234

1.0365e+00 1.1888e+00
-1.2244e+00 -1.0647e+00

For the large noise case obtain again no surprise.

[p_regress, p95_t_regress] = regress(y_d,F,alpha)


-->
p_regress = -0.082664
0.272995
0.527731

p95_t_regress = -0.9012 0.7359


-3.7301 4.2761
-3.6707 4.7262

Using the command lscov()


The problem above is solved using the functions LinearRegression() or regress(), but one may
use the function lscov() instead. For the small noise case the code below shows the identical results.

[p_lscov, sigma_lscov, ˜, covp_lscov] = lscov (F,y_d);


p_lscov
p95_t_lscov = p_lscov + tinv(1-alpha/2,length(x_d)-3)*[-sigma_lscov +sigma_lscov]
-->
p_lscov = -7.0696e-03
1.1126e+00
-1.1446e+00

p95_t_lscov = -2.2641e-02 8.5014e-03


1.0365e+00 1.1888e+00
-1.2244e+00 -1.0647e+00

The command lscov can also determine the covariance matrix for the parameters.

covp_lscov
-->
covp_lscov = 5.4469e-05 -2.3421e-04 2.0710e-04
-2.3421e-04 1.3027e-03 -1.3021e-03
2.0710e-04 -1.3021e-03 1.4330e-03

The rather large values off the diagonal in the correlation matrix indicate that the three parameters pi are not
independent.

corp = diag(1./sqrt(diag(covp_lscov)))*covp_lscov* diag(1./sqrt(diag(covp_lscov)))


-->
corp = 1.0000 -0.8792 0.7413
-0.8792 1.0000 -0.9530
0.7413 -0.9530 1.0000

With the covariance matrix the ellipsoidal region of confidence for p~ ∈ R3 can be determined. To start
choose a level of significance, in this case α = 0.05.

• Use Example 3–71 to determine the ellipsoidal domain of confidence in R3 and Example 3–25 to visu-
alize, see Figure 3.51(a). This figure can be rotated on screen if generated locally by Octave/MATLAB.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 235

• To obtain printable visualizations restrictions to planes are useful. Use the tools from Example 3–27
to display the intersection and projection of the 3D domain of confidence with the plane p3 = const.
Find the result in Figure 3.51(b).

• If working with individual intervals of confidence the level of significance has to be adapted, such that
(1 − α3 )3 = 1 − α = 0.95, i.e. α3 = 1 − (1 − α)1/3 ≈ 13 α. The projection of the “box”of confidence
in R3 leads to the rectangle in Figure 3.51(b).

-1 1.25
projection
-1.05
intersection
1.2 square
-1.1
-1.15
p3

1.15
-1.2
-1.25
p2 1.1
-1.3
-0.03 1.05
-0.02
-0.01
0 1
p1 0.01 1.25
0.02 1.1 1.15 1.2 -0.03 -0.02 -0.01 0 0.01 0.02
1 1.05
p2 p1
~ ∈ R3
(a) for p (b) for (p1 , p2 ) ∈ R2

Figure 3.51: Region of confidence for the parameters p1 , p2 and p3

inv_cov = inv(covp_lscov);
%% the ellipsoid in space
[Evec,Eval] = eig(inv_cov);
phi = linspace(0,2*pi,81); theta = linspace(-pi,pi,81);
x = cos(phi’)*cos(theta); y = sin(phi’)*cos(theta); z = ones(size(phi’))*sin(theta);
Points = Evec*inv(sqrt(Eval))*[x(:),y(:),z(:)]’;

%% the intersection with p3=0


A_intersect = inv_cov(1:2,1:2);
[Evec_i, Eval_i] = eig(A_intersect);
x = cos(phi); y = sin(phi); z = zeros(size(phi));
Points_i = [Evec_i*inv(sqrt(Eval_i))*[x(:),y(:)]’];

%%% projection onto the plane p3=0


A_project = ReduceQuadraticForm(inv_cov,3);
[Evec_p, Eval_p] = eig(A_project);
Points_p = [Evec_p*inv(sqrt(Eval_p))*[x(:),y(:)]’];

p1 = p_lscov(1); p2 = p_lscov(2); p3 = p_lscov(3); sigma = sigma_lscov;

alpha3 = 1-(1-alpha)ˆ(1/3); t_lim = tinv(1-alpha3/2,length(x_d)-3);


f_limit = 3*finv(1-alpha,3,length(x_d));

figure(2)
plot3(p1,p2,p3,’or’, p1+sqrt(f_limit)*Points(1,:),...
p2 + sqrt(f_limit)*Points(2,:),p3 + sqrt(f_limit)*Points(3,:),’.b’,...
[p1-t_lim*sigma(1),p1+t_lim*sigma(1)],[p2,p2],[p3,p3],’g’,...

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 236

[p1,p1],[p2-t_lim*sigma(2),p2+t_lim*sigma(2)],[p3,p3],’g’,...
[p1,p1],[p2,p2],[p3-t_lim*sigma(3),p3+t_lim*sigma(3)],’g’)
xlabel(’p_1’); ylabel(’p_2’); zlabel(’p_3’); view([75,20]);

figure(3)
plot(p1+sqrt(f_limit)*Points_p(1,:),p2+sqrt(f_limit)*Points_p(2,:),’c’,...
p1+sqrt(f_limit)*Points_i(1,:),p2+sqrt(f_limit)*Points_i(2,:),’r’,...
p1+t_lim*sigma(1)*[-1 1 1 -1 -1],p2+t_lim*sigma(2)*[-1,-1,1,1,-1],’g’)
xlabel(’p_1’); ylabel(’p_2’); legend(’projection’,’intersection’,’square’)

For the large noise case obtain again no surprise, i.e. the same results as for LinearRegression()
and regress().

[p_lscov, sigma_lscov, ˜, covp_lscov] = lscov (F,y_d);


p_lscov
p95_t_lscov = p_lscov + tinv(1-alpha/2,length(x_d)-3)*[-sigma_lscov +sigma_lscov]
-->
p_lscov = -0.082664
0.272995
0.527731

p95_t_lscov = -0.9012 0.7359


-3.7301 4.2761
-3.6707 4.7262

Improving the confidence interval by using more data points


In the above Figure 3.50(b) the size of the noise was large and as a consequence the results for the parameters
pi was rather unreliable, i.e. a large standard variation and wide confidence intervals. This can be improved
by using more data points, assuming that the noise is given by a normal distribution. By multiplying √ the
number of data points by 100 (N → 100*N) theory predicts that the standard deviation σ should be 100 =
10 times smaller. This is caused by the matrix used to solve for the parameters. More data points lead to
larger entries in F and thus to smaller entries in the matrix (FT F)−1 F or in the matrix R−1 u . The results
confirm this expectation.

noise_big = 0.5*randn(100*N,1);
x_d = rand(100*N,1) ; y_d = x_d.*(1-x_d) + noise_big;

F = [ones(size(x_d)) x_d x_d.ˆ2]; % construct the regression matrix


[p, e_var, r, p_var, fit_var] = LinearRegression(F,y_d);
sigma = sqrt(p_var);
parameters = [p sigma] % show the parameters and their standard deviation
p95_t=p+tinv(1-alpha/2,length(x_d)-3)*[-sigma +sigma] % Student-t distribution
-->
parameters = -0.0086227 0.0334205
1.1139283 0.1555626
-1.1457702 0.1514078

p95_t = -0.074165 0.056920


0.808846 1.419010
-1.442704 -0.848836

Using more data points will increase the reliability of the result, but only if the residuals are normally
distributed. If you are fitting a straight line to data on a parabola the results can not improve. Thus one has

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 237

to make a visual check of the fitting (Figure 3.52(a)), to realize the the result is impossibly correct, even
if the standard deviations of the parameters pi are small. The problem is also visible in a histogram of the
residuals (Figure 3.52(b)). Obviously the distribution of the residuals is far from a normal distribution, and
thus the essential working assumption of normally distributed deviations is violated.

0.4 500
true parabola
data
0.3 best fit line 400
dependent variable y

frequency
0.2 300

0.1 200

0 100

-0.1 0
0 0.2 0.4 0.6 0.8 1 -0.2 -0.1 0 0.1 0.2 0.3 0.4
independent variable x deviation
(a) fitting a straight line to a parabola (b) histogram of the residuals

Figure 3.52: Fitting a straight line to data close to a parabola

noise_small = 0.02*randn(100*N,1); y_d = x_d.*(1-x_d) + noise_small;


F = [ones(size(x_d)) x_d]; % construct the regression matrix for a straight line
[p, e_var, r, p_var, fit_var] = LinearRegression(F,y_d);
sigma = sqrt(p_var);
parameters = [p sigma] % show the parameters and their standard deviation
p95_t =p+tinv(1-alpha/2,length(x_d)-3)*[-sigma +sigma] % Student-t distribution

y_fit = p(1) + p(2)*x; % compute the fitted values


figure(3)
plot(x,y,’k’,x_d,y_d,’b+’,x,y_fit,’g’)
xlabel(’independent variable x’)
ylabel(’dependent variable y’)
title(’regression with small noise’)
legend(’true parabola’,’data’,’best fit line’)

residuals = F*p-y_d;
figure(4)
hist(residuals)
xlabel(’deviation’); ylabel(’frequency’)
-->
parameters = 0.1639191 0.0034668
0.0035860 0.0059815

p95_t = 0.1571200 0.1707181


-0.0081447 0.0153166

3.5.5 How to Obtain Wrong Results!


In Figure 3.53 find the data for the intensity of the light emitted by an LED, as function of the angle of
emmision. For the design of the optical system the data points had to be “translated” to a function. Using

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 238

linear regression leads to the obviously wrong result in Figure 3.53(b). The following computations were
performed:
• The angle was given in degrees, i.e. 0◦ ≤ α ≤ 90◦

• A polynomial of degree 5 was used, i.e. regression for T (α) = p1 + p2 α + p3 α2 + p4 α3 + p5 α4 +


p6 α 5 .

• The matrix F was constructed, then the normal equation solved by

p~ = inv(FT F) · (FT ~y ) .

• The warning message about a singular matrix was ignored. Thus the computations performed with a
matrix with a very large condition number, e.g. κ ≈ 1017 .

0.8

0.6
intensity

0.4

0.2

α 0

-0.2
0 20 40 60 80 100
angle
(a) setup of LED (b) intensity profile

Figure 3.53: Intensity of light as function of the angle α

The problem is perfectly solvable, when using better approaches.


• Since the intensity profile is symmetric (i.e. T (−α) = T (α)), there can be no odd contributions to
the polynomial25 . Use the simpler function T (α) = p1 + p2 α2 + p3 α4 , based on monomials of even
degree only.

• Instead of the normal equation with the matrix FT F use the QR factorization.

• The angle α = 90◦ leads to numbers 905 ≈ 6 · 109 and thus to numbers 3 · 1019 in the matrix FT F.
In the same matrix there are also numbers close to 1. Thus the condition number of this matrix is
extremely large. Using radian instead of degrees helps, i.e. use an appropriate rescaling of the data.
Using just one of these three improvements leads to reasonable answers. The most important aspect to
consider when using the linear regression method is to select the correct type of function for the fitting. This
decision has to be made based on insight into the problem at hand.

Choose your basis functions for linear regression very carefully,


based on information about the system to be examined.

25
In my (long) career I have seen no application requiring a regression with a polynomial of high degree. I have seen many
problems when using a polynomial of high degree.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 239

3.5.6 An Example with Multiple Independent Variables


It is as well possible perform linear regression with functions of multiple independent variables. The func-
tion
z = p1 · 1 + p2 · x + p3 · y
describes a plane in 3D space. A surface of this type is fit to a set of given points (xj , yj , zj ) by the code
below, resulting in Figure 3.54. The columns of the matrix F have to contain the values of the basis functions
1, x and y at the given data points.  
1 x1 y1
 
 1 x y2 
 2 
 
F=  1 x3 y3 


 .
.. 

 
1 xN yN

N = 100; x = 2*rand(N,1); y = 3*rand(N,1);


z = 2 + 2*x- 1.5*y + 0.5*randn(N,1);

F = [ones(size(x)), x , y];
p = LinearRegression(F,z)

[x_grid, y_grid] = meshgrid([0:0.1:2],[0:0.2:3]);


z_grid = p(1) + p(2)*x_grid + p(3)*y_grid;

figure(1);
plot3(x,y,z,’*’)
hold on
mesh(x_grid,y_grid,z_grid)
xlabel(’x’); ylabel(’y’); zlabel(’z’);
hold off
-->
p = 1.7689 2.0606 -1.4396

Since only very few (N=100) points were used the exact parameter values p~ = (+2, +2, −1.5) are not
very accurately reproduced. Increasing N will lead to more accurate results for this simulation, or decrease
the size of the random noise in +0.5*randn(N,1).

2
z

-2

-4 2
1.5
3 2.5 1
2 1.5 0.5
1 0.5 x
y 0 0

Figure 3.54: Result of a 3D linear regression

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 240

The command LinearRegression() does not determine the confidence intervals for the parame-
ters, but it returns the estimated standard deviations, resp. the variances. With these the confidence intervals
can be computed, using the Student-t distribution. To determine the CI modify the above code slightly.

[p,˜,˜,p_var] = LinearRegression(F,z); alpha = 0.05;


p_CI = p + tinv(1-alpha/2,N-3)*[-sqrt(p_var) +sqrt(p_var)]
-->
p_CI = +1.6944 +2.2357
+1.8490 +2.2222
-1.5869 -1.3495

The results implies that the 95% confidence intervals for the parameters pi are given by

+1.6944 < p1 < +2.2357


+1.8490 < p2 < +2.2222 with a confidence level of 95% .
−1.5869 < p3 < −1.3495

3.5.7 Introduction to Nonlinear Regression


Introduction and first examples
The commands in the above section are well suited for linear regression problems, but there are many
important nonlinear regression problems. Examine Table 3.16 to distinguish linear and nonlinear regression
problems. Unfortunately nonlinear regression problems are considerably more delicate to work with and
special algorithm have to be used. For many problems it is critical to find good initial guesses for the
parameters to be determined. Linear and nonlinear regression problems may also be treated as minimization
problems. This is often not a good idea, as regression problems have special properties that one can, and
has to, take advantage of. Find a list of commands for nonlinear regression in Table 3.17. Observe that the
syntax and algorithm of these commands might differ between MATLAB and Octave. You definitely have to
consult the manuals and examine example applications.

function parameters type of regression


y = a + mx a, m linear
y = a x2 + b x + c a, b, c linear
y= a ec x a, c nonlinear
y = d + a ec x a, c, d nonlinear
y= a ec x a linear
y = a sin (ωt + δ) a, ω, δ nonlinear
y = a cos (ωt) + b sin (ωt) a, b linear

Table 3.16: Examples for linear and nonlinear regression

Nonlinear least square fit with leasqr()


The optimization package of Octave provides the command leasqr()26 . It is an excellent implementation
of the Levenberg–Marquardt algorithm. The basic idea is a damped Newton’s method (see Section 3.1.6 on
26
In Octave call pkg load optim. MATLAB users may use the codes on my web page leasqr.m and dfdp.m . The version of
leasqr.m available on the internet seems to have a small bug computing the covariance matrix covp. The original source code
contains a remark on an alternative method to determine covp. My web pageshows the modified version.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 241

Command Properties
leasqr() standard non linear regression, Levenberg–Marquardt
fsolve() can be used for nonlinear regression too
nlinfit() nonlinear regression
lsqcurvefit() nonlinear curve fitting
nonlin curvefit() frontend, Octave only
lsqnonlin() nonlinear minimization of residue
nonlin residmin() frontend, Octave only
nlparci() determine confidence intervals of parameters, MATLAB only
expfit() regression with exponential functions

Table 3.17: Commands for nonlinear regression

page 119) to solve the corresponding system of nonlinear equations for the parameters. Thus leasqr()
needs the matrix of partial derivatives with respect to the parameters. This matrix can be provided as function
or estimated by the auxiliary function dfdp.m, which uses a finite difference approximation. The Octave
package also provides one example in leasqrdemo.m and you can examine its source code.
As a first example try to fit a function

f (t) = A e−αt cos(ω t + φ)

through a number of measured points (ti , yi ). Then search the values for the parameters A, α, ω and φ to
minimize X
|f (ti ) − yi |2 .
i
Since the function is nonlinear with respect to the parameters α, ω and φ one can not use linear regression.
In Octave the command leasqr() will solve nonlinear regression problems. To set up an example
one may:
1. Choose “exact” values for the parameters.

2. Generate normaly distributed random numbers as perturbation of the “ exact” result.

3. Define the appropriate function and generate the data.


Find the code below and the generated data points are shown in Figure 3.55, together with the best possible
approximation by a function of the above type.
Octave
Ae = 1.5; ale = 0.1; omegae = 0.9 ; phie = 1.5;
noise = 0.1;
t = linspace(0,10,50)’; n = noise*randn(size(t));
function y = f(t,p)
y = p(1)*exp(-p(2)*t).*cos(p(3)*t + p(4));
endfunction
y = f(t,[Ae,ale,omegae,phie])+n;
plot(t,y,’+;data;’)

You have to provide the function leasqr() with good initial estimates for the parameters. The al-
gorithm in leasqr() uses a damped Newton method (see Section 3.1.5) to find the optimal solution.
Examining the selection of points in Figure 3.55 estimate

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 242

• A ≈ 1.5: this might be the amplitude at t = 0.

• α ≈ 0: there seems to be very little damping.

• ω ≈ 0.9: the period seems to be slightly larger than 2 π, thus ω slightly smaller than 1.

• ψ ≈ π/2: the graph seems to start out like − sin(ω t) = cos(ω t + π2 )

The results of your simulation might vary slightly, caused by the random numbers involved.
Octave
A0 = 2; al0 = 0; omega0 = 1; phi0 = pi/2;
[fr,p] = leasqr(t,y,[A0,al0,omega0,phi0],’f’,1e-10);
p’

yFit = f(t,p);
plot(t,y,’+’, t,yFit)
legend(’data’,’fit’)
-->
p = 1.523957 0.098949 0.891675 1.545294

1
data
fit
0.5

-0.5

-1

-1.5
0 2 4 6 8 10

Figure 3.55: Least square approximation of a damped oscillation

The above result contains the estimates for the parameters. For many problems the deviations from
the true curve are randomly distributed, with a normal distribution and small variance. In this case the
parameters are also randomly distributed with a normal distribution. The diagonal of the covariance matrix
contains the variances of the parameters and thus estimate the standard deviations by taking the square root
of the variances.
Octave
pkg load optim % load the optimization package in Octave
[fr,p,kvg,iter,corp,covp,covr,stdresid,Z,r2] =...
leasqr(t,y,[A0,al0,omega0,phi0],’f’,1e-10);
pDev = sqrt(diag(covp))’
-->
pDev = 0.0545981 0.0077622 0.0073468 0.0307322

With the above results obtain Table 3.18. Observe that the results are consistent, i.e. the estimated param-
eters are rather close to the “exact” values. To obtain even better estimates, rerun the simulation with less
noise or more points.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 243

parameter estimated value standard dev. ”exact” value


A 1.52 0.055 1.5
α 0.099 0.0078 0.1
ω 0.892 0.0073 0.9
φ 1.54 0.031 1.5

Table 3.18: Estimated and exact values of the parameters

Nonlinear regression with fsolve()


The command fsolve() is used to solve systems of nonlinear equations, using Newton’s method. Assume
that a function depends on parameters p~ ∈ Rm and the actual variable x, i.e.

y = f (~
p , x) .

A few (n) points are given, thus ~x ∈ Rn , and the same number of values of ~yd ∈ Rn are measured. For
precise measurements we expect ~yd ≈ ~y = f (~p , ~x). Then we can search for the optimal parameters p~ ∈ Rm
such that
p , ~x) − ~yd = ~0 .
f (~
If m < n this is an over determined system of n equation for the m unknowns p~ ∈ Rm . In this case the
command fsolve() will convert the system of equations to a minimization problem

p , ~x) − ~yd k is minimized with respect to p~ ∈ Rm


kf (~ .

It is also possible to estimate the variances of the optimal parameters, using the techniques from Sec-
tion 3.5.2. This will be done very carefully in the following Section 3.5.8.

As an illustrative example data points on the curve y = exp(−0.2 x) + 3 are generated and then some
random noise is added. As initial parameters use the naive guess y(x) = exp(0 · x) + 0. The best possible
fit is determined and displayed in Figure 3.56.

b0 = 3; a0 = 0.2; % choose the data


x = 0:.5:5; noise = 0.1 * sin (100*x); y = exp (-a0*x) + b0 + noise;

[p, fval, info, output] = fsolve (@(p) (exp(-p(1)*x) + p(2) - y), [0, 0]);
plot(x,y,’+’, x,exp(-p(1)*x)+p(2))

3.9

3.8

3.7

3.6

3.5

3.4

3.3

3.2
0 1 2 3 4 5

Figure 3.56: Nonlinear least square approximation with fsolve()

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 244

Nonlinear regression with lsqnonlin()


The optimization toolbox in MATLAB ($$$) and the optim package in Octave provide the command
lsqnonlin() to solve nonlinear regression problems. The Syntax is very similar to fsolve() and
the result is identical to Figure 3.56.

b0 = 3; a0 = 0.2; % choose the data


x = 0:.5:5; noise = 0.1 * sin (100*x); y = exp (-a0*x) + b0 + noise;

p = lsqnonlin (@(p) (exp(-p(1)*x) + p(2) - y), [0, 0]);


plot(x,y,’+’, x,exp(-p(1)*x)+p(2))

3.5.8 Nonlinear Regression with a Logistic Function


In this section a more concrete example of nonlinear regression is examined carefully. Special attention
is given to the individual intervals of confidence and the combined region of confidence for the optimal
parameters.

Many growth phenomena can be described by rescaling and shifting the basic logistic27 growth function
exp(x) 1
g(x) = 1+exp(x) = 1+exp(−x) . It is easy to see that this function is monotonically increasing and

1
lim g(x) = 0 , g(0) = and lim g(x) = 1 .
x→−∞ 2 x→+∞

By shifting and rescaling examine the modified logistic function


p2
f (x) = p1 + p2 g(p3 (x − p4 )) = p1 + (3.24)
1 + exp(−p3 (x − p4 ))

with the four parameters pi , i = 1, 2, 3, 4. An example is shown in Figure 3.57. For the given data points (in
red) the optimal values for the parameters pi have to be determined. This is a nonlinear regression problem.

65

60
frequency

55

50

45
-2 0 2 4 6 8 10
distance

Figure 3.57: Data points and the optimal fit by a logistic function

27
Also called sigmoid function

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 245

An essential point for a nonlinear regression problems is to find good estimates for the values of the
parameters. Thus examine the graph of the logistic function (3.24) carefully:
• At the midpoint x = p4 find f (p4 ) = p1 + p2 12 .

• For the extreme values observe limx→−∞ f (x) = p1 and limx→+∞ f (x) = p1 + p2 .
p2 p3
• The maximal slope is at the midpoint and given by28 f 0 (p4 ) = 4 .

Assuming p2 , p3 > 0 now find good estimates for the parameter values.
• p1 offset: minimal height of the data points

• p2 amplitude: difference of maximal and minimal value


p2 p3 4m
• p3 slope: the maximal slope is m = 4 and thus p3 = p2

• p4 midpoint: average of x values


Based on this use the code below to determine the estimated values.

x_data = [0 1 2 3 4 5 6 7 8]’;
y_data = [46.8 47.2 48.8 51.8 55.7 58.6 61.8 63 63.8]’;

p1 = min(y_data);
p2 = max(y_data)-min(y_data);
p3 = 4*max(diff(y_data)./diff(x_data))/p2;
p4 = mean(x_data);

This result can now be used to apply a nonlinear regression, using the functions leasqr(), fsolve()
or lsqcurvefit().

Solution by leasqr(), Octave and MATLAB


To determine the optimal values of the parameters:
• Define the logistic function, with the four parameters pi .

• Call leasqr(), returning the values and the covariance matrix. On the diagonal of the covariance
matrix find the estimated variances of the parameters pi .
Find the result in Figure 3.57. As numerical result the optimal values of pi and their standard deviations are
shown. These can be used to determine confidencePn intervals for2the parameters. In addition the number of
1/2
required iterations and the resulting residual ( i=1 (f (xi ) − yi ) ) is displayed.

f = @(x,p) p(1) + p(2)*exp(p(3)*(x-p(4)))./(1+exp(p(3)*(x-p(4))));


[fr, p,˜, iter, corp, covp] = leasqr(x_data,y_data,[p1,p2,p3,p4],f);
sigma = sqrt(diag(covp));
optimal_values = [p’; sigma’]
iter_residual = [iter,norm(fr-y_data)]
x = linspace(-2,10);
figure(1); plot(x,f(x,p),x_data,y_data,’or’)
xlabel(’distance’); ylabel(’frequency’)
-->
optimal_values = 45.931829 18.428664 0.838742 3.932786
0.380858 0.645210 0.062353 0.080993
iter_residual = 4 0.64832
28
For g(x) = 1
1+exp(−x)
use g 0 (0) = 1
4
and then some rescaling to determine f 0 (p4 ).

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 246

Using the estimated variances of the individual parameters pi using the t–distribution the 95% confidence
interval for the individual parameters can be estimated.

p_CI_95 = p + tinv(1-0.05/2,length(x_data)-4)*[-sigma +sigma]


-->
p_CI_95 = 44.9528 46.9109
16.7701 20.0872
0.6785 0.9990
3.7246 4.1410

But since the correlation matrix corp contains large off–diagonal entries the result might not be reliable.

corp = 1.000000 -0.873224 0.787091 0.453505


-0.873224 1.000000 -0.904347 -0.043361
0.787091 -0.904347 1.000000 0.034882
0.453505 -0.043361 0.034882 1.000000

The large off–diagonal entries indicate that the four parameters pi are not independent.

Domain of confidence
To examine the joint domain of confidence for the four parameters pi some more computations have to be
performed. There are two methods to visualize the joint domain of confidence;

1. as a rectangular box in R4

2. as an ellipsoid in R4

For the individual confidence intervals with level of significance α = 0.05 use the Student t–distribution
by calling tinv(1-0.05/2,n-4). If the ”box” in R4 would be constructed with these widths, then the
confidence of the true parameter to be inside the box would be (1 − α)4 = (1 − 0.05)4 ≈ 0.81. For a level
of confidence 1 − α for all four parameters together one needs p44 = (1 − α) or p4 = (1 − α)1/4 . This leads
to the level of significance
α4 = 1 − p4 = 1 − (1 − α)1/4 ≈ α/4 .

alpha4 = 1-(1-alpha)ˆ0.25;
p_CI_95_joint = p + tinv(1-alpha4/2,length(x_data)-4)*[-sigma +sigma]
-->
p_CI_95_joint = 44.4879 47.3758
15.9825 20.8748
0.6023 1.0751
3.6257 4.2399

The result is a clearly larger domain of confidence for the joint parameters, it is a rectangular ”box” in R4 .

Caused by the correlation of the parameters pi one should examine the ellipsoidal domain of confidence
in R4 . The intersection of the domain in R4 with the plane p3 = p4 = const is an ellipse in the plane with p1
and p2 . To determine the dimensions of the ellipses the F–distribution has to be used, see [DrapSmit98] or
[RawlPantuDick98]. Find the result in Figure 3.58(a). The result of the intersection with p1 = p2 = const
is shown in Figure 3.58(b). Also shown are the central axis for the rectangular domains of confidence in
green. Use the results in Section 3.2.7 and Example 3–25 to draw the ellipses.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 247

inv_cov = inv(covp); inv_cov = inv_cov(1:2,1:2); % examine p1 and p2


[Evec,Eval] = eig(inv_cov); angle = linspace(0,2*pi);
Points = Evec*inv(sqrt(Eval))*[cos(angle);sin(angle)];

f_limit = 4*finv(1-alpha,4,length(x_data));
figure(3);
plot(p1+sqrt(f_limit)*Points(1,:),p2+sqrt(f_limit)*Points(2,:),’b’,’linewidth’,2)
t_lim = tinv(1-alpha4/2,length(x_data)-4)
hold on
plot(p1,p2,’or’)
plot([p1-t_lim*sigma(1),p1+t_lim*sigma(1)],[p2,p2],’g’)
plot([p1,p1],[p2-t_lim*sigma(2),p2+t_lim*sigma(2)],’g’)
xlabel(’p_1’); ylabel(’p_2’); hold off

20 4.4

19
4.2
18
p2

p4

17 4

16
3.8
15

14 3.6
45 45.5 46 46.5 47 47.5 48 48.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
p1 p3
(a) intersection with p3 = p4 = const (b) intersection with p1 = p2 = const

Figure 3.58: The intersection of the 95% confidence ellipsoid in R4 with 2D-planes

The intersection of the confidence domain in R4 with the plane p4 = const leads to an ellipsoid in R3 ,
shown in Figure 3.59. The intersection of this ellipsoid with the horizontal plane p3 = const generated by the
green markers leads to the ellipse in Figure 3.58(a). On ocasion it might also be useful to use the projections
of the general ellipsoids on coordinate planes, using the ideas leading to Example 3–27 on page 140.

inv_cov = inv(covp); inv_cov = inv_cov([1 2 3],[1 2 3]);


[Evec,Eval] = eig(inv_cov);
phi = linspace(0,2*pi,81); theta = linspace(-pi,pi,81);
x = cos(phi’)*cos(theta); y = sin(phi’)*cos(theta); z = ones(size(phi’))*sin(theta);
Points = Evec*inv(sqrt(Eval))*[x(:),y(:),z(:)]’;

f_limit = 4*finv(1-alpha,4,length(x_data));
figure(4)
plot3(p1+sqrt(f_limit)*Points(1,:),p2 + sqrt(f_limit)*Points(2,:),...
p3 + sqrt(f_limit)*Points(3,:),’.b’)
hold on
plot3(p1,p2,p3,’or’)
plot3([p1-t_lim*sigma(1),p1+t_lim*sigma(1)],[p2,p2],[p3,p3],’g’)
plot3([p1,p1],[p2-t_lim*sigma(2),p2+t_lim*sigma(2)],[p3,p3],’g’)
plot3([p1,p1],[p2,p2],[p3-t_lim*sigma(3),p3+t_lim*sigma(3)],’g’)

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 248

xlabel(’p_1’); ylabel(’p_2’); zlabel(’p_3’); view([-115,15]); hold off

1.3

1.2

1.1

1
p3

0.9

0.8

0.7

0.6
48.548
47.547
46.546 15 14
p1 45.545 17 16
19 18
20 p2

Figure 3.59: The intersection of the 95% confidence ellipsoid in R4 with the p4 = const plane

Solution by nonlin curvefit(), Octave only


With the command nonlin curvefit() the method of nonlinear least squares can be used to fit a
function to data points. A solution for the above problem is given by
Octave
f = @(p,x) p(1) + p(2)*exp(p(3)*(x-p(4)))./(1+exp(p(3)*(x-p(4))));
[p,fr,convergence_flag,outp] = nonlin_curvefit (f,[p1;p2;p3;p4], x_data, y_data);
optimal_values = p’
iter_residual = [outp.niter,norm(fr-y_data)]
-->
optimal_values = 4.5932e+01 1.8429e+01 8.3874e-01 3.9328e+00
iter_residual = 4 0.6483

The result for the optimal parameters pi is identical to the previous results. In addition the number of
required iterations and the resulting residual are displayed. The command curvefit stat() will deter-
mine the covariance matrix covp and the correlation matrix corp.

settings = optimset (’ret_covp’,true,’ret_corp’,true,’objf_type’,’wls’)


FitInfo = curvefit_stat (f,p,x_data,y_data,settings)
sigma = sqrt(diag(FitInfo.covp))
-->
FitInfo = scalar structure containing the fields:
covp = 1.4505e-01 -2.1458e-01 1.8692e-02 1.3989e-02
-2.1458e-01 4.1630e-01 -3.6383e-02 -2.2660e-03
1.8692e-02 -3.6383e-02 3.8879e-03 1.7616e-04
1.3989e-02 -2.2660e-03 1.7616e-04 6.5599e-03

corp = 1.000000 -0.873224 0.787091 0.453505


-0.873224 1.000000 -0.904347 -0.043361
0.787091 -0.904347 1.000000 0.034882
0.453505 -0.043361 0.034882 1.000000

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 249

sigma = 3.8086e-01 6.4521e-01 6.2353e-02 8.0993e-02

With this information the analysis of the domain of confidence can be performed, identical to the results
by leasqr().

Solution by fsolve(), MATLAB and Octave


The command fsolve() is used to solve systems of nonlinear equations. If more data points than param-
eters are given (more equations than unknowns), then a nonlinear least square solution is determined. Thus
solve the above problem using fsolve().

f2 = @(p) p(1) + p(2)*exp(p(3)*(x_data-p(4)))./(1+exp(p(3)*(x_data-p(4))))-y_data;


[p,fval] = fsolve(f2,[p1,p2,p3,p4]);
optimal_values = p
residual = norm(fval)
-->
optimal_values = 45.93183 18.42866 0.83874 3.93279
residual = 0.64832

It is no surprise that the same result is found. fsolve() does not estimate standard deviations for the
parameters. One might use nlparci() to determine confidence intervals.

Solution by lsqcurvefit(), MATLAB and Octave


With the command lsqcurvefit() the method of nonlinear least squares can be used to fit a function to
data points. A solution for the above problem is given by

f3 = @(p,x_data) p(1) + p(2)*exp(p(3)*(x_data-p(4)))./(1+exp(p(3)*(x_data-p(4))));


[p,residual] = lsqcurvefit(f3,[p1,p2,p3,p4],x_data,y_data)
optimal_values = p’
residual = sqrt(residual)
-->
optimal_values = 45.93183 18.42866 0.83874 3.93279
residual = 0.64832

It is no surprise that the same result is found. lsqcurvefit() does not estimate standard deviations for
the parameters. The command lsqcurvefit() can return more results, e.g. the residual or the Jacobian
matrix with the partial derivatives with respect to the parameters.

3.5.9 Additional Commands from the Package optim in Octave


The package optim in Octave (see https://octave.sourceforge.io/optim/. provides additional commands for
linear and nonlinear regression problems.

• lsqlin: linear least square with linear constraints

• expfit: Prony’s method for non–linear exponential fitting

• polyfit: polynomial fitting, MATLAB and Octave

• wpolyfit: polynomial fitting

• polyconf: confidence and prediction intervals for polynomial fitting, uses wpolyfit

• polyfitinf: polynomial fitting with the maximum norm

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 250

3.6 Resources
• Nonlinear Equations : based on my notes

• Eigenvalues and Eigenvectors


The bible of matrix computations is clearly [GoluVanLoan13] or one of the earlier editions. For
general matrix analysis [HornJohn90] could be useful.

• Numerical Integration

– Some information from [IsaaKell66, §7] was used.


– In [RalsRabi78] rigorous proofs for some of the error estimates are shown.
– Find more information on basic integration methods in [YounGreg72].

• Solving Ordinary Differential Equations, Initial Value Problems

– Detailed information on ODE solvers can be found in the book [Butc03] by John Butcher
([Butc16] is a newer edition) or in [HairNorsWann08], [HairNorsWann96].
– Use the article ode suite.pdf ([ShamReic97]) by Shampine and Reichelt for excellent in-
formation on the MATLAB ODE solvers. The document is available at the Mathwork web site at
https://www.mathworks.com/help/pdf doc/otherdocs/ode suite.pdf .
– Information on different Runge–Kutta methods is available on Wikipedia.
https://en.wikipedia.org/wiki/List of Runge-Kutta methods
https://en.wikipedia.org/wiki/Runge-Kutta methods
– Marc Compere provided codes that (should) work with MATLAB and Octave. It is available in a
repository at https://gitlab.com/comperem/ode solvers.
* rk2fixed.m, rk4fixed.m and rk8fixed.m are explicit Runge–Kutta methods with
fixed step size of order 2, 4 and 8 .
* ode23.m, ode45.m and ode78.m are explicit Runge–Kutta methods with adaptive step
sizes.

• Linear and Nonlinear Regression, Curve Fitting


Some of the information in these notes is taken from the notes [Octave07] for a class on MATLAB/Octave,
available at OctaveAtBFH.pdf or from a statistics class with the supporting notes [Stah16], available
at StatisticsWithMatlabOctave.pdf.

In the previous chapter the codes in Table 3.19 were used.

Bibliography
[AbraSteg] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions. Dover, 1972.

[Agga20] C. Aggarwal. Linear Algebra and Optimization for Machine Learning. Springer, first edition,
2020.

[AmreWihl14] M. Amrein and T. Wihler. An adaptive Newton-method based on a dynamical systems


approach. Communications in Nonlinear Science and Numerical Simulation, 19(9):2958–2973, 2014.

[Axel94] O. Axelsson. Iterative Solution Methods. Cambridge University Press, 1994.

[Butc03] J. Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, Ltd,
second edition, 2003.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 251

filename description
nonlinear equations
NewtonSolve.m function file to apply Newton’s method
exampleSystem.m first example of a system of equations, Example 3–8
Newton2D.m code to visualize Example 3–8
testBeam.m code to solve Example 3–13
Keller.m script file to solve Example 3–14
tridiag.m MATLAB function to solve tridiagonal systems
eigenvalues and eigenvectors
ReduceQuadraticForm.m function used in Example 3–27
numerical integration
simpson.m integration using Simpson’s rule
IntegrateGauss.m integration using the Gauss approach
ordinary differential equations, ODEs
ode RungeKutta Runge–Kutta method with fixed step size
ode Heun Heun method with fixed step size
ode Euler Euler method with fixed step size
Pendulum.m Pendulum demo for the algorithms with fixed step size
rk23.m an adaptive algorithm, based on the method of Heun
rk45.m an adaptive algorithm, based on the method of Runge–Kutta
Test rk23 rk45.m demo code for rk23 and rk45 in Example 3–66
linear and nonlinear regression
LinearRegression.m linear regression, Octave version
LinearRegression.m.Matlab linear regression, MATLAB version
ExampleLinReg.m code to generate Section 3.5.4
leasqr.m MATLAB version of the command leasqr()
dfdp.m MATLAB version of the support file for leasqr()

Table 3.19: Codes for chapter 3

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 252

[Butc16] J. Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, Ltd,
third edition, 2016.

[Deim84] K. Deimling. Nonlinear Functional Analysis. Springer Verlag, 1984.

[DeisFaisOng20] M. P. Deisenroth, A. A. Faisal, and C. S. Ong. Mathematics for Machine Learning.


Cambridge University Press, 2020. pre–publication.

[Demm97] J. W. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, 1997.

[DrapSmit98] N. Draper and H. Smith. Applied Regression Analysis. Wiley, third edition, 1998.

[GoluVanLoan96] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, third
edition, 1996.

[GoluVanLoan13] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,
fourth edition, 2013.

[HairNorsWann08] E. Hairer, S. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I: Non-
stiff Problems. Springer Series in Computational Mathematics. Springer Berlin Heidelberg, second
edition, 1993. third printing 2008.

[HairNorsWann96] E. Hairer, S. Nørsett, and G. Wanner. Solving Ordinary Differential Equations II:
Stiff and Differential-Algebraic Problems. Lecture Notes in Economic and Mathematical Systems.
Springer, second edition, 1996.

[HornJohn90] R. Horn and C. Johnson. Matrix Analysis. Cambridge University Press, 1990.

[IsaaKell66] E. Isaacson and H. B. Keller. Analysis of Numerical Methods. John Wiley & Sons, 1966.
Republished by Dover in 1994.

[Kell92] H. B. Keller. Numerical Methods for Two–Point Boundary Value Problems. Dover, 1992.

[Linz79] P. Linz. Theoretical Numerical Analysis. John Wiley& Sons, 1979. Republished by Dover.

[MeybVach91] K. Meyberg and P. Vachenauer. Höhere Mathematik II. Springer, Berlin, 1991.

[MontRung03] D. Montgomery and G. Runger. Applied Statistics and Probability for Engineers. John
Wiley & Sons, third edition, 2003.

[DLMF15] NIST. NIST Digital Library of Mathematical Functions, 2015.

[Pres92] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C, The


Art of Scientific Computing. Cambridge University Press, second edition, 1992.

[RalsRabi78] A. Ralston and P. Rabinowitz. A first Course in Numerical Analysis. McGraw–Hill, second
edition, 1978. repulished by Dover in 2001.

[RawlPantuDick98] J. Rawlings, S. Pantula, and D. Dickey. Applied regression analysis. Springer texts in
statistics. Springer, New York, 2. ed edition, 1998.

[SchnWihl11] H. R. Schneebeli and T. Wihler. The Netwon-Raphson Method and Adaptive ODE Solvers.
Fractals, 19(1):87–99, 2011.

[ShamReic97] L. Shampine and M. W. Reichelt. The MATLAB ODE Suite. SIAM Journal on Scientific
Computing, 18:1–22, 1997.

[Stah00] A. Stahel. Calculus of Variations and Finite Elements. supporting notes, 2000.

SHA 21-5-21
CHAPTER 3. NUMERICAL TOOLS 253

[Octave07] A. Stahel. Octave and Matlab for Engineers. lecture notes, 2007.

[Stah16] A. Stahel. Statistics with Matlab/Octave. supporting notes, BFH-TI, 2016.

[TongRoss08] P. Tong and J. Rossettos. Finite Element Method, Basic Technique and Implementation. MIT,
1977. Republished by Dover in 2008.

[YounGreg72] D. M. Young and R. T. Gregory. A Survey of Numerical Analysis, Volume 1. Dover Publica-
tions, New York, 1972.

SHA 21-5-21
Chapter 4

Finite Difference Methods

4.1 Prerequisites and Goals


In this chapter we examine one of the methods to replace differential equations by approximating systems
of difference equations. We replace the continuous equation by a discrete system of equations. These
equations are then solved, using the techniques from previous chapters. The goal is to find approximate
solutions that are close to the exact solution. One of the possible standard references is [Smit84]. A more
detailed presentation is given in [Thom95], where you can find state of the art techniques.

After having worked through this chapter

• you should understand the basic concept of a finite difference approximation and finite difference
stencils.

• should be familiar with the concepts of consistency, stability and convergence of a finite difference
approximation.

• should know about conditional and unconditional stability of solvers.

• should be able to set up and solve second order linear boundary value problems on intervals and
rectangles.

• should be able to set up and solve second order linear initial boundary value problems on intervals.

• should be able to set up and solve some nonlinear boundary value problems with the help of a finite
difference approximation.

In this chapter we assume that you are familiar with

• the basic idea and definition of a derivative.

• the concept of ordinary differential equations, in particular with ẏ(t) = −λ y(t) .

• the representation of a vector as a linear combination of eigenvectors.

4.2 Basic Concepts


4.2.1 Finite Difference Approximations of Derivatives
Instead of solving a differential equation replace the derivatives by approximate difference formulas, based
on the definition of a derivative
d y(t + h) − y(t)
y(t) = lim ,
dt h→0 h

254
CHAPTER 4. FINITE DIFFERENCE METHODS 255

by selecting a finite value of 0 < h  1, leading to


d y(t + h) − y(t)
y(t) ≈ .
dt h
Other approximations to the first derivative are possible, using similar ideas and computations. This leads
to the formulas and stencils in Figure 4.1 .

t−h t t+h
time axis -
'$ '$
y(t + h) − y(t) −1 +1
y 0 (t) ≈
h h h
&% &%
'$ '$
y(t) − y(t − h) −1 +1
y 0 (t) ≈
h h h
&% &%
'$ '$ '$
y(t + h) − y(t − h) −1 +1
y 0 (t) ≈ 0
2h 2h 2h
&% &% &%

Figure 4.1: FD stencils for y 0 (t), forward, backward and centered approximations

In Figure 4.2 use the values of the function at the grid points t − h, t and t + h to find formulas for the
first and second order derivatives. The second derivative is examined as derivative of the derivative. The
above observations hint towards the following approximate formulas.
d y(t + h) − y(t) y(t) − y(t − h) y(t + h) − y(t − h)
y(t) ≈ ≈ ≈
dt h h 2h
d y(t + h/2) − y(t − h/2)
y(t) ≈
dt h
d2 y (t + h/2) − y 0 (t − h/2)
0
 
1 y(t + h) − y(t) y(t) − y(t − h)
y(t) ≈ ≈ −
dt2 h h h h
y(t − h) − 2 y(t) + y(t + h)
=
h2

The quality of the above approximations is determined by the error. For smaller values of h > 0 the
error should be as small as possible. To determine the size of this error use the Taylor approximation
y 00 (t) 2 y 000 (t) 3 y (4) (t) 4
y(t + x) = y(t) + y 0 (t) · x + x + x + x + O(x5 )
2 3! 4!
with different values for x (use x = ±h with |h|  1) and verify that
y 00 (t) 2 y 000 (t) 3
y(t + h) = y(t) + y 0 (t) · h + h + h + O(h4 )
2 3!
y 00 (t) 2 y 000 (t) 3
y(t − h) = y(t) − y 0 (t) · h + h − h + O(h4 )
2 3!
y 000 (t) 3
y(t + h) − y(t − h) = 2 y 0 (t) · h + 2 h + O(h4 )
3!
y(t + h) − y(t − h) y 000 (t) 2
− y 0 (t) = h + O(h3 ) = O(h2 )
2h 3!

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 256

t−h t t+h

Figure 4.2: Finite difference approximations of derivatives

and thus conclude1


y(t + h) − y(t − h)
y 0 (t) = + O(h2 ) .
2h
With computations very similar to the above find the finite difference approximations for the second
order derivative by
y(t − h) − 2 y(t) + y(t + h) 2
y 00 (t) = 2
+ y (4) (t) h2 + O(h3 ) . (4.1)
h 4!
Similarly find the formulas for other derivatives in Table 4.1 . This table also indicates that the error of the
centered difference formula is smaller than for the forward or backward formulas. These finite difference
approximations are often visualized with the help of stencils, as shown in Figure 4.1 .
y(t+h)−y(t)
forward difference y 0 (t) = h + O(h)
0 y(t)−y(t−h)
backward difference y (t) = h + O(h)
0 y(t+h/2)−y(t−h/2)
centered difference y (t) = h + O(h2 )
y 00 (t) = y(t−h)−2 hy(t)+y(t+h)
2 + O(h2 )
y 000 (t) = −y(t−h)+3 y(t)−3h3
y(t+h)+y(t+2 h)
+ O(h)
−y(t−3 h/2)+3 y(t−h/2)−3 y(t+h/2)+y(t+3 h/2)
y 000 (t) = h3
+ O(h2 )
y 000 (t) = −y(t−2 h)+2 y(t−h)−2
2 h3
y(t+h)+y(t+2 h)
+ O(h2 )
y (4) (t) = y(t−2 h)−4 y(t−h)+6 hy(t)−4
4
y(t+h)+y(t+2 h)
+ O(h2 )

Table 4.1: Finite difference approximations

With the above finite difference stencils replace derivatives by approximate finite differences, accepting
a discretization error. As h converges to 0 we expect this error to approach 0. But in most cases small values
of h will lead to larger arithmetic errors when performing the operations and this contribution will get larger
as h approaches 0. For the total error the two contributions have to be added. This basic rule

total error = discretization error + arithmetic error

is illustrated in Figure 4.3. As a consequence do not expect to get arbitrary close to an error of 0. In this
chapter only the discretization errors are carefully examined, assuming that the arithmetic error is negligible.
This does not imply that rounding errors can safely be ignored, as illustrated in an exercise.
1
Use the notation f (h) = O(hn ) to indicate |f (h)| ≤ C hn for some constant C. This indiates that the expression f (h) is of
order hn or less.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 257

error
total error
discretization error

arithmetic error

stepsize h

Figure 4.3: Discretization and arithmetic error contributions

4.2.2 Finite Difference Stencils


Based on the above finite difference approximations define finite difference stencils for differential opera-
tors.

Finite difference stencil for a steady state problem


In Chapter 2.7.1 find the differential operator

∂2 u ∂2 u
−∆u = − −
∂x2 ∂y 2
for most 2–dimensional steady state problems. Based on Table 4.1 obtain a simple finite difference approx-
imation
−u(x − h, y) + 2 u(x, y) − u(x + h, y) −u(x, y − h) + 2 u(x, y) − u(x, y + h)
−∆u(x, y) = + +O(h2 ) .
h2 h2
For the rectangular grid in Figure 4.4 set

ui,j = u(xj , yi ) = u(j h , i h)

and then find


4 ui,j − ui−1,j − ui+1,j − ui,j−1 − ui,j+1
(−∆u)i,j ≈
h2

Finite difference stencil for a dynamic heat problem


When trying to discretize the dynamic heat equation dt d
u(t, x) − u00 (t, x) = f (t, x) use the notation ui,j =
d
u(ti , xj ) = u(i ht , j hx ) and the forward difference approximation for dt u.

d ui+1,j − ui,j
u(ti , xj ) = + O(ht )
dt ht
ui,j−1 − 2 ui,j + ui,j+1
u00 (ti , xj ) = + O(h2x )
h2x
 
d 00 1 2 1 1 1
( u − u )i,j ≈ − 2 ui,j−1 + 2
− ui,j − 2 ui,j+1 + ui+1,j
dt hx hx ht hx ht

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 258


−1/h2
y, i
6 


−1/h2 4/h2 −1/h2
s

y s s s
s 
- x, j −1/h2
x

Figure 4.4: Finite difference stencil for −uxx − uyy if h = hx = hy
'$
t, i
6 1/ht
&%

s '$
# '$
t s s s
−1/h2x 2/h2x − 1/ht −1/h2x
- x, j "
&% !&%
x
Figure 4.5: Finite difference stencil for ut − uxx , explicit, forward

This leads to the stencil in Figure 4.5, the explicit finite difference stencil for the heat equation.
d
If the backward difference approximation is used for the time derivative dt u find the implicit finite
difference stencil in Figure 4.6.

4.3 Consistency, Stability and Convergence


In this section we first examine a finite difference approximation for an elementary ordinary differential
equation. The results and consequences will apply to considerably more difficult problems.

4.3.1 A Finite Difference Approximation of an Initial Value Problem


Consider the ordinary differential equation for λ > 0.
d
y(t) = −λ y(t) with y(0) = y0 .
dt

'$
# '$
t, i
6 −1/h2x 2/h2x + 1/ht −1/h2x
"
&% !&%

t s s s '$
s
−1/ht
- x, j &%
x
Figure 4.6: Finite difference stencil for ut − uxx , implicit, backward

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 259

The exact solution is given by y(t) = y0 e−λ t . Obviously the solution is bounded on any interval on R+
and we expect its numerical approximation to remain bounded too, independent on the final time T .
d
To visualize the context consider a similar problem dt y(t) + λ y(t) = f (t) for a given function f (t).
This problem can be discretized with stepsize h at the grid points ti = i h for 0 ≤ i ≤ N − 1. This will lead
to an approximation on the interval [0 , T ] = [0 , (N − 1) h] . The unknown function y(t) in the interval
t ∈ [0 , T ] is replaced by a vector ~y ∈ RN and the function f (t) is replaced by a vector f~ ∈ RN .

differential operation
y ∈ C 1 ([0, T ] , R) - f ∈ C 0 ([0, T ] , R)

P P
? finite difference operation ?
~y ∈ RN - f~ ∈ RN

Figure 4.7: A finite difference approximation of an initial value problem

4.3.2 Explicit Method, Conditional Stability


d
When using the forward difference method in Table 4.1 to solve dt y(t) + λ y(t) = 0 with yi = y(t) and
yi+1 = y(t + h) find
d yi+1 − yi
y(t) + λ y(t) ≈ + λ yi = 0
dt h
and the difference will converge to 0 as the stepsize h approaches 0 . This will be called consistency of the
finite difference approximation. The differential equation is replaced by the difference equation
yi+1 − yi
= −λ yi
h
yi+1 = yi − h λ yi = (1 − h λ) yi .
In Result 3–56 (page 197) it is verified the solution of this difference equation satisfies limi→∞ yi = 0 if
and only if |1 − h λ| < 1, Since λ and h are positive this leads to the condition
2
hλ < 2 ⇐⇒ h< .
λ
This is an example of conditional stability, i.e. the schema is only stable if the above condition on the
stepsize h is satisfied.

4.3.3 Implicit Method, Unconditional Stability


We may also use the backward difference method in Table 4.1
d yi − yi−1
y(t) + λ y(t) ≈ + λ yi = 0
dt h
and the difference will converge to 0 as the stepsize h approaches 0 . Thus this scheme is also consistent.
The differential equation is replaced by
yi − yi−1
= −λ yi =⇒ (1 + h λ) yi = yi−1
h
One can verify that this difference equation is solved by
1
yi = y0 .
(1 + h λ)i
In Result 3–57 (page 198) it is verified the solution of this difference equation satisfies limi→∞ yi = 0 for
λ > 0 and we have unconditional stability.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 260

4.3.4 General Difference Approximations, Consistency, Stability and Convergence


To explain the approximation behavior of finite difference schemes use the example problem

−u00 (x) = f (x) for 0 < x < L with boundary conditions u(0) = u(L) = 0 . (4.2)

Assume that for a given function f (x) the exact solution is given by the function u(x). The differential
L
equation is replaced by a difference equation. For n ∈ N discretize the interval by xk = k · h = k n+1 and
then consider an approximate solution uk ≈ u(k · h) for k = 0, 1, 2, . . . , n, n + 1. The finite difference
approximation of the second derivative in Table 4.1 leads for interior points to
uk−1 − 2 uk + uk+1
− = fk = f (k · h) for k = 1, 2, 3, . . . , n . (4.3)
h2
The boundary conditions are taken into account by u0 = un+1 = 0 . These linear equations can be written
in the form      
2 −1 u1 f1
     
 −1 2 −1   u2   f2 
     
−1 2 −1 u3   f3
     
1   
· =

.
. . . ..   ..

2
h   .. .. ..   
  .   . 
     

 −1 2 −1  
  un−1   f
  n−1


−1 2 un fn
The solution of this linear system will create the values of the approximate solution at the grid points. Exact
and approximate solution are shown in Figure 4.8. As h → 0 we hope that u will converge to the exact
solution u(x) .

2.5

2
solution u(x)

1.5

0.5

0
0 0.2 0.4 0.6 0.8 1
position x

Figure 4.8: Exact and approximate solution of a boundary value problem

To examine the behavior of the approximate solution use a general framework for finite difference ap-
proximations to boundary value problems. Examine Figure 4.9 to observe how a differential equation is
replaced by an approximate system of linear equations. The similar approach for general problems is shown
in Figure 4.10.
Consider functions defined on a domain Ω ⊂ RN and for a fixed mesh size h cover the domain with a
discrete set of points xk ∈ Ω. This leads to the following vector spaces:

• E1 is a space of functions defined on Ω. In the above example consider u ∈ C 2 ([0, L] , R) with


u(0) = u(L) = 0. On this space use the norm kukE1 = max{|u(x)| : 0 ≤ x ≤ L}.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 261

∂ 2
− ∂x2
u ∈ C 2 ([0, L] , R) - f ∈ C 0 ([0, L] , R)

Ph Ph
Ah ·
f~h ∈ RN
? ?
~uh ∈ RN -

Figure 4.9: An approximation scheme for −u00 (x) = f (x)

• E2 is a space of functions defined on Ω. In the above example consider f ∈ C 0 ([0, L] , R) with the
norm kf kE2 = max{|f (x)| : 0 ≤ x ≤ L}.

• E1h is a space of discretized functions. In the above example consider ~u ∈ Rn = E1h , where uk =
u(k · h). The vector space E1h is equipped with the norm k~ukE h = max{|uk | : 1 ≤ k ≤ n}.
1

• E2h is also a space of discretized functions. In the above example consider f~ ∈ Rn = E2h , where
fk = f (k · h). The vector space E2h is equipped with the norm kf~kE h = max{|fk | : 1 ≤ k ≤ n}.
2

On these spaces we examine the following linear operations:

• For u ∈ E1 let F : E1 → E2 be the linear differential operator. In the above example F(u) = −u00 .

• For ~u ∈ E1h let Fh : E1h → E2h be the linear difference operator. In the above example

uk−1 − 2 uk + uk+1
Fh (~u)k =
h2

• For u ∈ E1 let ~u = P1h (u) ∈ E1h be the projection of the function u ∈ E1 onto E1h . It is determined
by evaluation the function at the points xk .

• For f ∈ E2 let f~ = P2h (f ) ∈ E2h be the projection of the function f ∈ E2 onto E2h . It is determined
by evaluation the function at the points xk .

The above operations are illustrated in Figure 4.10. There is a recent article [KhanKhan18] on the impor-
tance of this structure of the for fundamental spaces of numerical analysis.

F
u ∈ E1 - f ∈ E2 h −→ 0
kP1h ukE h −→ kukE1
P1h P2h 1

kP2h f kE h −→ kf kE2
? Fh ? 2
uh ∈ E1h - fh ∈ E h
2 P2h (F(u)) ≈ Fh (P1h (u))

Figure 4.10: A general approximation scheme for boundary value problems

4–1 Definition : For a given f ∈ E2 let u ∈ E1 be the solution of F(u) = f and ~uh the solution of
Fh (~uh ) = P2h (f ).

• The above approximation scheme is said to be convergent of order p if

kP1h (u) − ~uh kE h ≤ c1 hp ,


1

where the constant c1 is independent on h, but it may depend on u.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 262

• The above approximation scheme is said to be consistent of order p if


kFh (P1h (u)) − P2h (F(u))kE h ≤ c2 hp ,
2

where the constant c2 is independent on h, but it may depend on u. This implies that the diagram in
Figure 4.10 is almost commutative as h approaches 0 .
• The above approximation scheme is said to be stable if the linear operator Fh ∈ L(E1h , E2h ) is
invertible and there exists a constant M , independent on h, such that
kuh kE h ≤ M kFh (uh )kE h for all uh ∈ E1h
1 2

This is equivalent to kF−1


h k ≤ M , i.e. the inverse linear operators of the approximate problems are
uniformly bounded.

For the above example the stability condition reads as k~uk ≤ M kAh f~k or equivalently kA−1 ~
h fk ≤
M k~uk, independent on h. This is thus the condition on the matrix norm of the inverse matrix to be inde-
pendent on h, i.e. kA−1
h k ≤ M.
Now state a fundamental result for finite difference approximations to differential equations. The theo-
rem is also known as Lax equivalence theorem2 . The result applies to a large variety of problems. We will
examine only a few of them.

4–2 Theorem : If a finite difference scheme is consistent of order p and stable, then it is convergent
of order p. A short formulation is:

consistency and stability imply convergence


3

Proof : Let u be the solution of F(u) = f and ~u the solution of Fh (~u) = P2h (f ) = P2h (F(u)). Since the
scheme is stable and consistent of order p we find
 
kP1h (u) − ~ukE h = kF−1 h F (P
h 1
h
(u) − ~
u ) kE h
1 1

≤ kF−1 h
h k kFh (P1 (u)) − Fh (~
u)kE h
2

≤ M kFh (P1h (u)) − P2h (F(u))kE h


2
≤ M c hp .
Thus the finite difference approximation scheme is convergent. 2

Table 4.2 illustrates the abstract concept using the example equation (4.2).

4–3 Result : To verify convergence of the solution of the finite difference of approximation of equation (4.2)
to the exact solution we have to assure that the scheme is consistent and stable. Use the finite difference
approximation
−uk−1 + 2 uk − uk+1
−u00 (x) = f (x) −→ = fk .
h2
• Consistency: According to equation (4.1) or Table 4.1 (page 256) the scheme is consistent of order 2.
• Stability: Let ~u be the solution of the equation (4.3) with right hand side f~. Then
L2 L2 ~
k~uk∞ = max {|uk |} ≤ max {|fk |} = kf k∞ independent on h . (4.4)
1≤k≤n 2 1≤k≤n 2
3
2
We only use the result that a consistent and stable scheme has to be convergent. Lax also showed that a consistent and
convergent scheme has to be stable. Find a proof in [AtkiHan09].

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 263

general problem sample problem (4.2)


exact equation F(u) = f −u00 (x) = f (x)
approximate left hand side P1h (u) ∈ E1h uk = u(k · h)
approximate right hand side P2h (f ) ∈ E2h fk = f (k · h)
−uk−1 +2 uk −uk+1
difference expression Fh (~u) h2
−uk−1 +2 uk −uk+1
approximate equation Fh (~u) = P2h (f ) h2
= f (k · h)
stability kuh kE h ≤ M kFh (uh )kE h max{|uk |} ≤ M max{|fk |}
1 2

convergence, as h → 0 kP1h (u) − ~ukE h → 0 max{|u(k · h) − uk |} → 0


1

Table 4.2: Exact and approximate boundary value problem

Proof : The proof of stability of this finite difference scheme is based on a discrete maximum principle3 .
Proceed in two stages.

• As a first step verify a discrete maximum principle. If fk ≤ 0 for k = 0, 1, 2, . . . , n, (n + 1) and


−uk−1 + 2 uk − uk+1
= fk = f (k · h) ≤ 0 for k = 1, 2, 3, . . . , n
h2
then
max {uk } = max{u0 , un+1 } .
0≤k≤n+1

For the continuous case this corresponds to functions with u00 (x) ≥ 0 attaining the largest value on
the boundary.
To verify the discrete statement assume that max1≤k≤n {uk } = ui for some index 1 ≤ i ≤ n. Then

−ui−1 + 2 ui − ui+1 = +h2 fi ≤ 0


1 1
ui = (ui−1 + ui+1 ) + h2 fi ≤ ui + 0 .
2 2
Thus find ui−1 = ui = ui+1 and fi = 0. The process can be repeated with indices i − 1 and i + 1 to
finally obtain the desired estimate. The computations also imply that ~u = ~0 is the only solution of the
homogeneous problem, i.e. the square matrix has a trivial kernel. Using linear algebra this implies
that the matrix representing Fh is invertible.
 2
• Use the vector ~v ∈ Rn defined by vk = (k h)2 = n+1 kL
. The vector corresponds to the discretiza-
tion of the function v(x) = x2 . Verify that
−vk−1 + 2 vk − vk+1
= −2 for k = 1, 2, 3, . . . , n .
h2

Let C = kf~k∞ = max{|fk | : 1 ≤ k ≤ n} and fk+ = fk − C ≤ 0. Then w ~ + = ~u + C


2 ~v is the
~ + ) = f~+ and based on the first part of the proof and u0 = un+1 = 0 find
solution of Fh (w

C C C 2
max {wk+ } = max {uk + vk } ≤ max{v0 , vn+1 } = L .
1≤k≤n 1≤k≤n 2 2 2
C
Since vk ≥ 0, this implies uk ≤ 2 L2 .
3
Readers familiar with partial differential equations will recognize the maximum principle and the construction of sub- and
super–solutions to obtain à priori bounds.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 264

A similar argument with fk− = fk + C ≥ 0 and w


~ − = ~u − C
2 ~v implies
C C
min {uk − v k } ≥ − L2 .
1≤k≤n 2 2
These two inequalities imply
C L2 C L2
− ≤ uk ≤ for k = 1, 2, 3, . . . , n
2 2
and thus the stability estimate (4.4).
2
In this section some basic concepts were introduced and illustrated using one sample application. The
above proof for stability of finite difference approximations to elliptic boundary value problems can be
applied to two or higher dimensional problems, e.g. [Smit84, p. 255]. Further information can be found in
many books on numerical methods to solve PDE’s and also in [IsaaKell66, §9.5] and [Wlok82].

4.4 Boundary Value Problems


In a first section examine differential equations defined on an interval, then solve partial differential equa-
tions on rectangles in R2 .

4.4.1 Two Point Boundary Value Problems


4–4 Example : An elementary example
To examine the boundary value problem
−u00 (x) = 2 − x on 0 < x < 2 with u(0) = u(2) = 0
use 4 internal points x1 = 0.4, x2 = 0.8, x3 = 1.2 and x4 = 1.6. With ui = u(xi ) the finite difference
approximation is
−u(xi−1 ) + 2 u(xi ) − u(xi+1 ) −ui−1 + 2 ui − ui+1
−u00 (xi ) ≈ 2
= ,
h h2
2
where h = 5 = 0.4 and u(x0 ) = u0 = u(x5 ) = u5 = 0. Thus find four equations for 4 unknowns.
1
(−0 + 2 u1 − u2 ) = 2 − x1 = 1.6
0.42
1
(−u1 + 2 u2 − u3 ) = 2 − x2 = 1.2
0.42
1
(−u2 + 2 u3 − u4 ) = 2 − x3 = 0.8
0.42
1
(−u3 + 2 u4 − 0) = 2 − x4 = 0.4
0.42
With a matrix notation this leads to
     
+2 −1 0 0 u 1.6
   1   
1  −1 +2 −1 0 
  u  
 2   1.2 
= .
 
0.42  0 −1 +2 −1 
  
 u3   0.8 
     
0 0 −1 +2 u4 0.4
This system can now be solved using an algorithm from chapter 2. Since h = 0.4 is not small we only obtain
a very crude approximation of the exact solution. To obtain better approximations choose h small, leading
to larger systems of equations. Since the approximation is consistent and stable, we have convergence, i.e.
the deviation from the exact solution can be made as small as we wish, by choosing h small enough4 . ♦
4
Ignoring possible arithmetic error contributions, i.e. Figure 4.3 on page 257.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 265

4–5 Example : A nonlinear boundary value problem


For the nonlinear problem

−u00 (x) = x + cos(u(x)) on 0 < x < 2 with u(0) = u(2) = 0

one can apply the above procedures again to obtain a nonlinear system of equations.
    
+2 −1 0 0 u1 0.4 + cos(u1 )
    
1   −1 +2 −1 0   u2   0.8 + cos(u2 ) 
   
=
0.42  0 −1 +2 −1   u3   1.2 + cos(u3 ) 
  
    
0 0 −1 +2 u4 1.6 + cos(u4 )

To solve this nonlinear system use methods from chapter 3.1, i.e. partial substitution or Newton’s method.
Using obvious notations denote the above system by

A ~u = ~x + cos(~u) .

• To use the method of partial substitution choose a starting vector ~u0 , e.g. ~u0 = (0, 0, 0, 0)T . Then use
the iteration scheme

A ~uk+1 = ~x + cos(~uk )
~uk+1 = A−1 (~x + cos(~uk ))

and start to iterate and hope for convergence.


~ ≈ cos(~u) − sin(~u) · φ
• To use Newton’s method build on the linearization cos(~u + φ) ~ . Then examine

A (~uk + φ) ~ = ~x + cos(~uk ) − sin(~uk ) · φ


~
~ = −A ~uk + ~x + cos(~uk ) .
(A + diag(sin(~uk ))) · φ
~ The matrix A + diag(sin(~uk )) on the
The last expression is a system of equations for the vector φ.
left hand side has to be modified by adding the values of sin(~uk ) along the diagonal. Thus for each
~ update ~uk+1 = ~uk + φ
iteration a new system of linear equations has to be solved. With the solution φ ~
and restart the iteration.

4–6 Example : Stretching of a beam, with fixed endpoints


In Section 1.3 we found that the boundary value problem in equation (1.13) (see page 16)
 
d d u(x)
− EA = f (x) for 0 < x < L with u(0) = 0 and u(L) = uM
dx dx

corresponds to the stretching of a beam. We consider, at first, constant cross sections only and we will work
with the constant EA . The interval [0 , L] is divided into N + 1 subintervals of equal length h = NL+1 .
Using the notations

xi = i · h , ui = u(xi ) and fi = f (xi ) for 0 ≤ i ≤ N + 1

and the finite difference formula for u00 in Table 4.1 we replace the differential equation at all interior points
by the difference equation
EA
− (ui−1 − 2 ui + ui+1 ) = fi for 1 ≤ i ≤ N
h2

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 266

for the unknowns ui . The boundary conditions lead to u0 = 0 and uN +1 = uM . Using a matrix notation
we find a linear system of equations.
       
2 −1 u1 f1 0
       
 −1 2 −1   u2   f2   0 
       
−1 2 −1   u3  f3 0 
       
   
    h2    
 −1 2 −1   u
· 4 =
  f
4 +
  0 
 EA
  
  .  . .. 
. . .
  

 .. .. ..   .
  .


 .
 .
 
  . 

       

 −1 2 −1    uN −1
 

 f
 N −1
 
  0 
−1 2 uN fN uM

This system can be written in the form


A · ~u = ~g
with appropriate definition for the matrix A and the vectors ~u and ~g .
Observe that the N × N matrix A is symmetric, positive definite and tridiagonal. Thus this system of
equations can be solved very quickly, even for large values of N . First choose a specific example

EA = 1 , L = 3 , uM = 0.2 , f (x) = sin(x/2) with N = 20

and set the corresponding variables in Octave. Then set up the matrix A, solve the system and plot the
solution, leading to Figure 4.11(a).
BeamStretch.m
EA = 1.0; L = 3; uM = 0.2; N = 20;
fRHS = @(x) sin(0.5*x); % define the function for the RHS

h = L/(N+1); % stepsize
x = (h:h:L-h)’; f = fRHS(x);
g = hˆ2/EA*f; g(N) = g(N)+uM;

%% build the tridiagonal, symmetric matrix


di = 2*ones(N,1); % diagonal
up = -ones(N-1,1); % upper and lower diagonal

u = trisolve(di,up,g); % use the special solver


figure(1);
plot([0;x;L],[0;u;uM]) % plot the displacement
xlabel(’distance’); ylabel(’displacement’); grid on

The force on the beam at position x is given by

d u(x)
F (x) = EA .
dx
This can be approximated by a centered difference formula

h ui+1 − ui
F (xi + ) ≈ EA .
2 h
Thus plot the force F , as seen in Figure 4.11(b). This graph shows that the left part of the beam is stretched
(u0 > 0), while the right part is compressed (u0 < 0).

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 267

1 1

0.8 0.5
displacement

0.6 0

force
0.4 -0.5

0.2 -1

0 -1.5
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
distance distance
(a) displacment (b) force

Figure 4.11: Stretching of a beam, displacement and force

du = diff([0;u;uM])/h;
plot([0;x]+h/2,EA*du); grid on

• The above example also contains the solution to the steady state heat equation (1.2) on page 10.
• The above example also contains the solution to the steady state of the vertical deformation of a
horizontal string, equation (1.7) on page 13.

4–7 Example : Stretching of a beam by a given force


According to Section 1.3 a known force F at the right endpoint is described by the boundary condition
d u(L)
=F. EA
dx
This new boundary condition replaces the old condition u(L) = uM . To hande this case introduce two new
unknowns uN +1 = u(L) and uN +2 = u(L + h). Using a centered difference approximation find5
d u(L)
uN +2 − uN
+ O(h2 )
=
2h dx
uN +2 − uN
F
=
EA 2h
F
uN +2 = uN + 2 h
EA
and using the differential equation at the boundary point x = L we find
−uN + 2 uN +1 − uN +2 1
= fN +1
h2 EA
F h2
−uN + 2 uN +1 − (uN + 2 h ) = fN +1
EA EA
h2 fN +1 F
−uN + uN +1 = +h .
EA 2 EA
5
A simpler approach uses
uN +1 − uN F
u0 (L) ≈ =
h EA
and thus −uN + uN +1 = EA hF
. The approximation error is of order h. For the above approach find an error of h2 . The additional
f
accuracy is well worth the small additional coding. The only change in the final result is a missing N2+1 in the last component of
the right hand side vector.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 268

This additional equation can be added to the previous system of equations, leading to a system of N + 1
linear equations.
       
2 −1 u1 f1 0
       
 −1 2 −1   u   f   0 
   2   2   
       

 −1 2 −1   u3
 


 f3

 
  0 
−1 2 −1   u4 2  f4 0 
       
= h
   
· +
.. .. ..   ..  .. .. 
  
 EA
. . .
  
   .   .   . 
       

 −1 2 −1   u
  N −1


 f
 N −1
 
  0 
       

 −1 2 −1    uN
 

 fN

 
  0 
fN +1 hF
−1 +1 uN +1 2 EA

This matrix is again symmetric, positive definite and tridiagonal. For the simple case f (x) = 0 the exact
F
solution u(x) = EA x is known. This is confirmed by the Octave/MATLAB computations below and the
resulting straight line in Figure 4.12.

EA = 1.0; L = 3; F = 0.2; N = 20;

h = L/(N+1); % stepsize

x = (h:h:L)’;
f = zeros(size(x)); % f(x) = 0
g = hˆ2/EA*f; g(N+1) = g(N+1)/2+h*F/EA;

%% build the tridiagonal, symmetric matrix


di = 2*ones(N+1,1); di(N+1) = 1; % diagonal
up = -ones(N,1); % upper and lower diagonal

u = trisolve(di,up,g);
plot([0;x],[0;u])


4–8 Example : Stretching of a beam by a given force and variable cross section
If the cross section A in the previous example 4–7 is not constant, the above algorithm has to be modified.
The differential equation
 
d d u(x)
− EA(x) = f (x) for 0 < x < L
dx dx
now uses a variable coefficient a(x) = EA(x). The boundary conditions remain u(0) = 0 and EA(L) d u(L)dx =
0 0 1
F . To determine the derivative of g(x) = a(x) u (x) use the centered difference formula g (x) = h (g(x +
h h 2
2 ) − g(x − 2 )) + O(h ) and the approximations
u(x) − u(x − h)
u0 (x − h/2) = + O(h2 )
h
u(x + h) − u(x)
u0 (x + h/2) = + O(h2 )
h
d 1
a(x) · u0 (x) = a(x + h/2) · u0 (x + h/2) − a(x − h/2) · u0 (x − h/2) + O(h2 )
 
dx h  
1 u(x + h) − u(x) u(x) − u(x − h)
≈ a(x + h/2) − a(x − h/2)
h h h
 
1 h h h h
= a(x − ) u(x − h) − (a(x − ) + a(x + )) u(x) + a(x + ) u(x + h) .
h2 2 2 2 2

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 269

1
EA constant
EA variable
0.8
displacement
0.6

0.4

0.2

0
0 0.5 1 1.5 2 2.5 3
distance

Figure 4.12: Stretching of a beam with constant and variable cross section

One can verify that the error of this finite difference approximation is of the order h2 . Observe that the
values of the coefficient function a(x) = EA(x) are used at the midpoints of the intervals of length h. For
0 ≤ i ≤ N set ai = a(i h + h2 ) to find the difference scheme

−ai−1 ui−1 + (ai−1 + ai ) ui − ai ui+1 = h2 fi for 1 ≤ i ≤ N

To take the boundary condition EA(L) u0 (L) = a(L) u0 (L) = F into account proceed just as in Example 4–
7.
d u(L) 1
F = a(L) ≈ (a(L − h/2) u0 (L − h/2) + a(L + h/2) u0 (L + h/2)) + O(h2 )
dx 2
aN (−uN + uN +1 ) aN +1 (−uN +1 + uN +2 )
≈ + + O(h2 )
2h 2h
aN +1 uN +2 = +aN uN − (aN − aN +1 ) uN +1 + 2 h F + O(h3 )

Using this information for the finite difference approximation of the differential equation at x = L this leads
to

−aN uN + (aN + aN +1 ) uN +1 − aN +1 uN +2 = h2 fN +1 + O(h4 )


−2 aN uN + (2 aN + 0 aN +1 ) uN +1 = h2 fN +1 + 2 h F + O(h4 ) + O(h3 )
fN +1
−aN uN + aN uN +1 = h2 + h F + O(h4 ) + O(h3 ) .
2
Thus use the approximation
−aN uN + aN uN +1 fN +1 1
2
= + F. (4.5)
h 2 h
which is consistent of order h. The elementary approach

u(L) − u(L − h) uN +1 − uN
a(L) u0 (L) ≈ a(L − h/2) = aN =F
h h
f
would generate a similar equation, without the contribution N2+1 , which is consistent of order h too. A more
detailed analysis shows that the approach in (4.5) is consistent of order h2 for constant functions a(x). Thus

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 270

equation (4.5) should be used. With this additional equation arrive at a system of N + 1 linear equations.
   
a0 + a1 −a1 u1
   
 −a1 a 1 + a 2 −a 2
  u2 
   
−a2 a2 + a3 −a3   u3 
   

   

 −a3 a3 + a4 −a4   u
· 4 =

  . 
.. .. ..

. . .   .. 
   

   

 −a N −1 a N −1 + a N −a N
  u
  N


−aN aN uN +1
   
f1 0
   
 f2   0 
   
 f3   0 
   
   
= h2  f4  +  0 
   
 .   . 
 .   . 
 .   . 
   
 f
 N   0 
  
fN +1
2 hF

This is again a linear system of the form


A · ~u = ~g ,
where the matrix A is symmetric, positive definite and tridiagonal.
Reconsider the previous example, but with a thinner cross section A(x) in the middle section of the
beam. Use
1  π x
a(x) = EA(x) = 2 − sin .
2 L
First define all expressions in Octave and then construct and solve the tridiagonal system of equations. In
the code below the sparse, tridiagonal matrix is with the command spdiags() and then the linear system
solved with the usual backslash operator.
BeamStretchVariable.m
L = 3; F = 0.2; N = 20;
fRHS = @(x)zeros(size(x)) % no external forces along the beam
EA = @(x)(2-sin(x/L*pi))/2;

h = L/(N+1); x = (h:h:L)’; f = fRHS(x);


g = hˆ2*f; g(N+1) = g(N+1)/2+h*F;

%% build the tridiagonal, symmetric matrix


di = [EA(x-h/2)+EA(x+h/2)]; % diagonal
di(N+1) = EA(L-h/2); % last entry modified
up = -EA([3*h/2:h:L-h/2]’); % upper and lower diagonal
Mat = spdiags([[up;0],di,[0;up]],[-1 0 1],N+1,N+1); % build the sparse matrix
u = Mat\g; % solve the linear system

plot([0;xB],[0;uB],[0;x],[0;u]) % xB, uB from previous computations


legend(’EA constant’,’EA variable’,’location’,’northwest’)
xlabel(’distance’); ylabel(’displacement’)

The result in Figure 4.12 (page 269) confirms the fact that the thinner beam is weaker, i.e. it will stretch
more than the beam with constant, larger cross section. ♦

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 271

4.4.2 Boundary Values Problems on a Rectangle


To solve the heat equation on a unit square (equation (1.5) on page 12) one has to solve

−∆u(x, y) = f (x, y) for 0 ≤ x, y ≤ 1


u(x, y) = 0 for (x, y) on boundary
1
using the grid size h = n+1 and ui,j = u(j h , i h) and the finite difference stencil in Section 4.2.2 find the
equations
4 ui,j − ui−1,j − ui+1,j − ui,j−1 − ui,j+1
= fi,j .
h2
The corresponding grid (for n = 7) is shown in Figure 4.13.
y 6
u=0
1H
H
HHHHHHHHHHHHHHHHHHHHHHHHH
H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
u = 0H
H
H
H
H
H
u=0
H H
H H
H H
H H
H H
H H
H
H u
t i,j
H
H
i H
H
H
H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H -
HHHHHHHHHHHHHHHHHHHHHHHHHH
x
0 j u=0 1

Figure 4.13: A finite difference grid for a steady state heat equation

The unknown values of ui,j have to be numbered, and there are different options. Here is one of them:
• First number the nodes in the lowest row with numbers 1 through n.
• Then number the nodes in the second row with numbers n + 1 through 2 n.
• Proceed through all the rows. The top right corner will obtain the number n2 .
The above finite difference approximation of the PDE then reads as
4 u(i−1) n+j − u(i−2) n+j − ui n+j − u(i−1) n+j−1 − u(i−1) n+j+1
= f(i−1) n+j .
h2
This approximation is consistent of order 2. Arguments very similar to Result 4–3 show that the scheme is
stable and thus it is convergent of order 2. Using the model matrix Ann from Section 2.3.2 (page 34) this
leads to a system of linear equations.
Ann ~u = f~
with a banded, symmetric, positive definite matrix Ann . The above is implemented in Octave to solve the
system of linear equations and generate the graphics. As an example solve the problem with the right hand
side f (x, y) = x2 .

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 272

Plate.m
%%%%% script file to solve the heat equation on a unit square
n = 7;
f = @(x,y) x.ˆ2; % describe the heating contribution
%%%%%%%%%%%%%%%%%%%%%%%%%% no modifications necessary beyond this line
h = 1/(n+1);
Dxx = spdiags(ones(n,1)*[-1 2 -1],[-1 0 1],n,n)/hˆ2;
A = kron(Dxx,eye(n)) + kron(eye(n),Dxx);

x = h:h:1-h; y = x;
[xx,yy] = meshgrid(x,y); % generate the mesh for the graphics
fvec = f(xx(:),yy(:)); % compute the function
tic(); % start the stop watch
u = A\fvec; % solve the system of equations
toc() % display the solution time
mesh(xx,yy,reshape(fvec,n,n)) % generate the graphics
xlabel(’x’); ylabel(’y’);

The result of the above code is not completely satisfying, since the zero values of the function on the
boundary are not displayed. The code below adds these values and will generate Figure 4.14. The graph
clearly displays the higher temperature in the section with large values of x. This is caused by the heating
term f (x, y) = x2 .

%%% add on the zero boundary values for a nicer graphics


x = 0:h:1; y = x;
[xx,yy] = meshgrid(x,y); uu = zeros(size(xx));
uu(2:n+1,2:n+1) = reshape(u,n,n);
mesh(xx,yy,uu)
xlabel(’x’); ylabel(’y’);

0.025

0.02

0.015

0.01

0.005

01
0.8 1
0.6 0.8
y 0.4 0.6
0.2 0.4 x
0.2
0 0

Figure 4.14: Solution of the steady state heat equation on a square

The model matrix Ann on page 34 is symmetric of size n2 × n2 and it has a band structure with semi-
bandwidth n. Thus it is a very sparse matrix. In the above code the system of linear equation is solved
with the command u=A\fvec, and this will take advantage of the symmetry and sparseness. It is not a
good idea to use u = inv(A)*fvec since this creates the full matrix A−1 . A better idea is to use
Octave/MATLAB commands to create A as a sparse matrix, then the built-in algorithm will take advantage
of this. Thus we can solve problems with much finer grids, see the modified code below. In addition we
may allow a different number of grid points in the two directions.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 273

• Use nx interior points in x direction and ny points in y direction for a matrix of size (nx · ny) × (nx ·
ny). Numbering in the x direction first will lead to a semi-bandwidth of nx, but numbering in the y
direction first will lead to a semi-bandwidth of ny.
• To construct the matrices representing the derivatives in x and y direction independently use the
command spdiags(). Then use the Kronecker product (command kron()) to construct the sparse
matrix A.
• The backslash operator \ in MATLAB or Octave will take full advantage of the sparsity structure of
this matrix, using algorithms presented in Chapter 2, in particular Section 2.6.7.

Heat2DStatic.m
nx = 55; ny = 50;
f = @(x,y)x.ˆ2;
%%%%%%%%%%%%%%%%%%%%%%%%%%
hx = 1/(nx+1); hy = 1/(ny+1);
Dxx = spdiags(ones(nx,1)*[-1 2 -1],[-1 0 1],nx,nx)/(hxˆ2);
Dyy = spdiags(ones(ny,1)*[-1 2 -1],[-1 0 1],ny,ny)/(hyˆ2);
A = kron(Dxx,speye(ny))+kron(speye(nx),Dyy);
x = hx:hx:1-hx; y = hy:hy:1-hy;
[xx,yy] = meshgrid(x,y);
fvec = f([xx(:),yy(:)]);
tic()
u = A\fvec;
solutionTime = toc()

%%% add on the zero boundary values for a nicer graphics


x = 0:hx:1; y = 0:hy:1;
[xx,yy] = meshgrid(x,y); uu = zeros(size(xx));
uu(2:ny+1,2:nx+1) = reshape(u,ny,nx);
mesh(xx,yy,uu)
xlabel(’x’); ylabel(’y’);

4.5 Initial Boundary Value Problems


4.5.1 The Dynamic Heat Equation
A one dimensional heat equation is given by the partial differential equation
∂ ∂2
u(t, x) = κ u(t, x) for 0 < x < 1 and t > 0
∂t ∂x2
u(t, 0) = u(t, 1) = 0 for t > 0 . (4.6)
u(0, x) = u0 (x) for 0 < x < 1

The maximum principle6 implies that for all t ≥ 0 we find

max{|u(x, t)| : 0 ≤ x ≤ 1} ≤ max{|u0 (x)| : 0 ≤ x ≤ 1} .

The approximate solutions generated by a finite difference scheme should satisfy this property too, leading
to the stability condition.
The two dimensional domain (t, x) ∈ R+ × [0, 1] is discretized as illustrated in Figure 4.15. For step
1
sizes h = ∆x = n+1 and ∆t let

ui,j = u(i · ∆t , j · ∆x ) for j = 0, 1, 2, . . . , n, n + 1 and i ≥ 0.


6
Find the precise statement and proofs in any good book on partial differential equations.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 274

The boundary condition u(t, 0) = u(t, 1) = 0 implies ui,0 = ui,n+1 = 0 and the initial condition u(0, x) =
u0 (x) leads to u0,j = u0 (j · ∆x). The PDE (4.6) is replaced by a finite difference approximation on the grid
shown in Figure 4.15 and the result is examined.
t
6
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
u = 0H
H
H
Hu=0
H
H
H H
H H
H H
H H
H H
H H
H
H u
t i,j
H
H
i H
H
H
H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H
H H-
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
x
0 j u(0, x) = u0 (x) 1

Figure 4.15: A finite difference grid for a dynamic heat equation

The solution of the finite difference equation will be computed with the help of time steps, i.e. use the
values at one time level t = i · ∆t and then compute the values at the next level t + ∆t = (i + 1) ∆t. Thus
put all values at one time level t = i ∆t into a vector ~ui by

~ui = (ui,1 , ui,2 , ui,3 , . . . ui,n−1 , ui,n )T .

A finite difference approximation to the second order space derivative is given by (see Table 4.1 on page 256)

∂2 u(t, x − ∆x) − 2 u(t, x) + u(t, x + ∆x)


κ 2
u(t, x) = κ + O((∆x)2 ) . (4.7)
∂x ∆x2
Thus the values of the second order space derivatives at one time level can are approximated by −κ An · ~ui
where the symmetric n × n matrix An is given by
 
2 −1
 
 −1 2 −1 
 
−1 2 −1
 
1  
An = 
.. .. ..
.
∆x2 
 . . .


 

 −1 2 −1 

−1 2

Now find the approximation of the PDE (Partial Differential Equation) by a linear system of ordinary dif-
ferential equations
d
~u(t) = −κ An ~u(t) with ~u(0) = ~u0 .
dt

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 275

4.5.2 Construction of the Solution Using Eigenvalues and Eigenvectors


To examine the different possible approximations of the dynamic heat equation (4.6) use eigenvalues and
eigenvectors of the matrix An given by

1 kπ 4 kπ
λk = (2 + 2 cos )= sin2
∆x2 n+1 ∆x2 2 (n + 1)

and
nkπ T
 
1kπ 2kπ 3kπ (n − 1) k π
~vk = sin , sin , sin , . . . , sin , sin
n+1 n+1 n+1 n+1 n+1
where k = 1, 2, 3, . . . , n. Thus the eigenvectors are discretizations of the functions sin(k π x) on the interval
[0, 1] . These functions have exactly k local extrema in the interval. The higher the value of k the more the
eigenfunction will oscillate. In these notes find more information on matrices of the above type in Section 2.3
(page 32) and Result 3–22 (page 134). For another proof of the above see [Smit84, p. 154].
Since the matrix An is symmetric the eigenvectors are orthogonal and form a basis. Since the eigenvec-
tors satisfy (
1 if j = k
h~vk , ~vj i =
0 if j 6= k
(i.e. orthonormalized), then any vector ~u can be written as linear combination of normalized eigenvectors
~vk of the matrix An , i.e.
X n
~u = αk ~vk with αk = h~u, ~vk i .
k=1

For arbitrary t ≥ 0 consider the vector ~u(t) of the discretized (in space) solution. The differential equa-
tion (4.7) reads as
d
~u(t) = −κ An · ~u(t) .
dt
The solution ~u(t) of this system of ODEs (Ordinary Differential Equation) can be written as linear combi-
nation of eigenvectors, i.e.
n
X
~u(t) = αk (t) ~vk
k=1
n
d X
~u(t) = α̇k (t) ~vk
dt
k=1
n
X n
X
−κ An ~u(t) = −κ αk (t) An ~vk = −κ αk (t) λk ~vk
k=1 k=1
n
X n
X
α̇k (t) ~vk = − (κ αk (t) λk ) ~vk .
k=1 k=1

Examine the scalar product of the above with a vector ~vj and use the orthogonality to conclude
n
X n
X
h~vj , α̇k (t) ~vk i = h~vj , − (κ αk (t) λk ) ~vk i
k=1 k=1
n
X n
X
α̇k (t) h~vj , ~vk i = − (κ αk (t) λk ) h~vj , ~vk i
k=1 k=1
α̇j (t) = −κ λj αj (t) for j = 1, 2, 3 . . . , n .

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 276

The above system of n linear equations is converted to n linear, first order differential equations. The
initial values for the coefficient functions are given by αj (0) = h~u0 , ~vj i . For these equations use the
methods and results in Section 4.3.1. The approximation scheme to the system of differential equations
d
dt x(t) = −κ A ~ x(t) is stable if and only if the scheme applied to all of the above ordinary differential
equations is stable. Three different approaches will be examined: explicit, implicit and Crank–Nicolson.
Since the above ordinary differential equation can be solved analytically find a formula for the solution
n
X n
X
~u(t) = αk (t) ~vk = h~u0 , ~vk i exp(−κ λk t) ~vk .
k=1 k=1

For numerical purposes this formula is not very useful, since the effort to determine the eigenvalues and
eigenvectors is too large.

4.5.3 Explicit Finite Difference Approximation to the Heat Equation


The time derivative in the PDE (4.6) can be approximated by a forward difference
∂ u(t + ∆t, x) − u(t, x)
u(t, x) = + O(∆t) .
∂t ∆t
This can be combined with the space derivatives in equation (4.7) to obtain the scheme illustrated in Fig-
ure 4.16. The corresponding stencil is shown in Figure 4.5 on page 258. The results in Table 4.1 imply that
the scheme is consistent with an error of the order O(∆t) + O((∆x)2 ).

t6
i+1 t
ui+1,j − ui,j ui,j−1 − 2 ui,j + ui,j+1 t t t
= κ i
∆t (∆x)2
i–1
κ ∆t
ui+1,j = ui,j + (u i,j−1 − 2 ui,j + ui,j+1 )
(∆x)2 j–1 j j+1
-
x

Figure 4.16: Explicit finite difference approximation

Using a matrix notation the finite difference equation can be written as


~ui+1 − ~ui
= −κ An ~ui
∆t
ore solving for the next timelevel ~ui+1

~ui+1 = ~ui − κ ∆t An · ~ui = (In − κ ∆t An ) · ~ui .

If the vector ~ui is known the values at the next time level ~ui+1 can be computed without solving a system of
linear equations, thus this is called an explicit method. Starting with the discretization of the initial values
~u0 and applying the above formula repeatedly we find the solution

~ui = (In − κ ∆t An )i · ~u0 .

The goal is to examine the stability of this finite difference scheme. Since for eigenvalues λk and eigenvec-
tors ~vk we have
(In − κ ∆t An )i · ~vk = (1 − κ ∆t λk )i · ~vk
and the solution will remain bounded as i → ∞ only if κ ∆t λk < 2 for all k = 1, 2, 3, . . . , n. This
corresponds to the stability condition.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 277

Since we want to use the results of Section 4.3.2 on solutions of the ordinary differential equation we
translate to the coefficient functions αk (t) and find

d
αk (t) = −κ λk αk (t)
dt
αk (t + ∆t) − αk (t)
= −κ λk αk (t) finite difference approximation
∆t
αk (t + ∆t) = (1 − ∆t κ λk ) αk (t)
αk (i · ∆t) = (1 − ∆t κ λk )i αk (0) .

The scheme is stable if the absolute value of the bracketed expression is smaller than 1, i.e.

κ λk ∆t < 2 .
4 nπ
Since the largest eigenvalue of An is (see Section 2.3.1 starting on page 32) λn = ∆x2
sin2 2 (n+1) ≈
4 sin2 π2 4
∆x2
= ∆x2
find the stability condition

∆t 1 1
κ < ⇐⇒ ∆t < (∆x)2 .
∆x2 2 2κ
This is a situation of conditional stability. The restriction on the size of the timestep ∆t is severe, since for
small values of ∆x the ∆t will need to be much smaller.

In Figure 4.17 a solution of the dynamic heat problem

∂ ∂2
u(t, x) = κ u(t, x) for 0 < x < 1 and t>0
∂t ∂x2
u(t, 0) = u(t, 1) = 0 for t > 0
u(0, x) = f (x) for 0 < x ≤ 1

∆t
is shown for values of r = κ ∆x 2 slightly smaller or larger than the critical value of 0.5 . The initial value
used is (
2x for 0 < x ≤ 0.5
f (x) = .
2 − 2x for 0.5 ≤ x < 1
Since the largest eigenvalue of An will be the first to exhibit instability examine the corresponding eigen-

0.35 0.4

0.3
0.3
0.25

0.2
0.2

0.15
0.1

0.1
0
0.05

0 -0.1
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3

(a) stable: κ ∆t = 0.48 (∆x)2 (b) unstable: κ ∆t = 0.52 (∆x)2

Figure 4.17: Solution of 1-d heat equation, stable and unstable algorithms with r ≈ 0.5

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 278

vector  T
1nπ 2nπ 3nπ (n − 1) n π nnπ
~vn = sin , sin , sin , . . . , sin , sin .
n+1 n+1 n+1 n+1 n+1
The corresponding eigenfunction has n extrema in the interval. Thus the instability should exhibit n extrema,
which is confirmed by Figure 4.17(b) where the calculation is done with n = 9, as shown in the Octave code
below. The deviation from the correct solution exhibits 9 local extrema in the interval. This is an example
of a consistent and non-stable finite difference approximation. Obviously the scheme is not convergent.
HeatDynamic.m
L = 1; % length of the space interval
n = 9; % number of interior grid points
%n = 29; % number of interior grid points
r = 0.45; % ratio to compute time step
%r = 0.52; % ratio to compute time step
T = 0.1; % final time

iv = @(x) min([2*x/L,2-2*x/L]’)’ ;
dx = L/(n+1); dt = r*dxˆ2; x = linspace(0,L,n+2)’;

y = iv(x);
ynew = y;
legend(’off’)
for t = 0:dt:T+dt;
% for k = 2:n+1 % code with loops
% ynew(k) = (1-2*r)*y(k)+r*(y(k-1)+y(k+1));
% endfor
% y = ynew;
y(2:n+1) = (1-2*r)*y(2:n+1)+r*(y(1:n)+y(3:n+2)); % no loops
plot(x,y)
axis([0,1,0,1]); grid on
text(0.1,0.9,[’t = ’,num2str(t,3)]);
pause(0.02);
end%for

In the above code we verify that for each time step approximately 2 · n multiplications/additions are
necessary. Thus the computational cost of one time step is 2 n.

If the differential equation to be solved contains an inhomogeneous term, i.e.


∂ ∂2
u(t, x) = κ 2 u00 (t, x) + f (t, x)
∂t ∂x
then use the difference approximation

~ui+1 − ~ui = −∆t κ An ~ui + ∆t f~i .

This system can be solved similarly.

4.5.4 Implicit Finite Difference Approximation to the Heat Equation


The time derivative in the PDE (4.6) can be approximated by a backward difference
∂ u(t, x) − u(t − ∆t, x)
u(t, x) = + O(∆t) .
∂t ∆t
This will lead to the finite difference scheme shown in Figure 4.18. The corresponding stencil is shown in
Figure 4.6 on page 258. The results in Table 4.1 again imply that the scheme is consistent with the error of
the order O(∆t) + O((∆x)2 ).

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 279

t6

i+1 u u u

i u
ui+1,j − ui,j ui+1,j−1 − 2 ui+1,j + ui+1,j+1
=κ i–1
∆t (∆x)2
-
j–1 j j+1 x

Figure 4.18: Implicit finite difference approximation

Using a matrix notation find


~ui+1 − ~ui = −κ ∆t An · ~ui+1
or
(In + κ ∆t An ) · ~ui+1 = ~ui .
If the values ~ui at a given time ti = i ∆t are known solve a system of linear equations to determine the
values ~ui+1 at the next time level. This is an implicit method. As in the previous section use the eigenvalues
and vectors of An to examine stability of the scheme. Using the known initial value ~u0 we are lead to the
iteration scheme
~ui = (In + κ ∆t An )−i · ~u0
and thus (use An ~vk = λk ~vk )
 i
−i 1
(In + r An ) · ~vk = · ~vk .
1 + κ ∆t λk

Since λk > 0 we find that this scheme is unconditionally stable, i.e. there are no restrictions on the ratio of
the step sizes ∆x and ∆t. This is confirmed by the results in Figure 4.19. It was generated by code similar
to the one below.

1 1

0.8 0.8
Temperature

Temperature

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
position x position x

(a) stable: κ ∆t = 0.5 (∆x)2 (b) stable: κ ∆t = 2 (∆x)2

Figure 4.19: Solution of 1-d heat equation, implicit scheme with small and large step sizes

HeatDynamicImplicit.m
L = 1; % length of the space interval
n = 29; % number of interior grid points
r = 0.2; % ratio to compute time step
%r = 2.0;% ratio to compute time step

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 280

T = 0.5; % final time


plots = 5;% number of plots to be saved

iv = @(x)min([2*x/L,2-2*x/L]’)’;
dx = L/(n+1); dt = 2*r*dxˆ2; x = linspace(0,L,n+2)’;
initval = iv(x(2:n+1));

yplot = zeros(plots,n+2);
plotc = 1; tplot = linspace(0,T,plots);

Adiag = ones(n,1)*(1+2*r);
Aoffdiag = -ones(n-1,1)*r;

y = initval;
for t = 0:dt:T+dt;
if min(abs(tplot-t))<dt/2
yplot(plotc,2:n+1) = y’; plotc = plotc +1;
end%if
y = trisolve(Adiag,Aoffdiag,y);
end%for
plot(x,yplot)
grid on
xlabel(’position x’); ylabel(’Temperature’)

To perform one time step one has to solve a system of n linear equations where the matrix is symmetric,
tridiagonal and positive definite. There are efficient algorithms (trisolve()) for this type of problem
(e.g. [GoluVanLoan96], [GoluVanLoan13]), requiring only 5 n multiplications. If the matrix decomposition
and the back-substitution are separately coded, then this can even be reduce to an operation count for one
time-step of only 3 n multiplication. Thus the computational effort for one explicit step is similar to the cost
for one implicit step, but we gain unconditional stability.

If the differential equation to be solved contains an inhomogeneous term, i.e.

u̇(t, x) = κ u00 (t, x) + f (t, x) ,

then we may use the difference approximation

~ui+1 − ~ui = −∆t κ An ~ui+1 + ∆t f~i+1 .

This system can be solved similarly.

4.5.5 Crank–Nicolson Approximation to the Heat Equation


When using a centered difference approximation

∂ u(t + ∆t/2, x) − u(t − ∆t/2, x)


u(t, x) = + O((∆t)2 )
∂t ∆t
at the midpoint between time levels the finite difference scheme in Figure 4.20 is generated. It is an approxi-
d
mation of the differential equation dt u = κ u00 at the midpoint ( ti +t2i+1 , xj ). The results in Table 4.1 imply
that the scheme is consistent with the error of the order O((∆t)2 ) + O((∆x)2 ). The order of convergence
in time is improved by one..
The matrix notation leads to
κ ∆t
~ui+1 − ~ui = − (An · ~ui+1 + An · ~ui )
2

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 281

t6

i+1 u u u
ui+1,j − ui,j ui,j−1 − 2 ui,j + ui,j+1 +
= i u u u
κ ∆t 2 (∆x)2
ui+1,j−1 − 2 ui+1,j + ui+1,j+1 i–1
+
2 (∆x)2 -
j–1 j j+1 x

Figure 4.20: Crank–Nicolson finite difference approximation

or    
κ ∆t κ ∆t
In + An · ~ui+1 = In − An · ~ui .
2 2
With the values ~ui at a given time multiply the vector with a matrix and then solve a system of linear
equations to determine the values ~ui+1 at the next time level. This is an implicit method. As in the previous
section we use the eigenvalues and vectors of An to examine stability of the scheme. Examine the inequality
i
2 − κ ∆t λk
< 1.
2 + κ ∆t λk
Since λk > 0 this scheme is unconditionally stable.

In Table 4.3 find a comparison of the three different finite difference approximations to equation (4.6).
• For the explicit method one multiplication by a matrix I − α A is required. Thus we need approxi-
mately 3 n multiplications. If one would take advantage of the symmetry it could be reduced to 2 n
multiplications.
• For the implicit method one system of linear equations with a matrix I − α A is required. Using the
standard Cholesky factorization with band structure approximately 4 n multiplications are required.
Working with the modified Cholesky factorization one could reduce to 3 n multiplications.
• For the Crank–Nicolson method one matrix multiplication is paired with one system to be solved.
Thus we need approximately 7 n multiplication, or only 5 n with the optimized algorithms.
Using an inverse matrix is a bad idea in the above context, as this will lead to a full matrix and thus at
least n2 multiplications. Even for relatively large numbers n, the time required to do one time step will be
minimal for all of the above methods. This will be different for the 2D situation, as examined in Table 4.4.
As a consequence one should use either an implicit method or Crank–Nicolson for this type of problem.
If the differential equation to be solved contains an inhomogeneous term, i.e.
∂ ∂2
u(t, x) = κ 2 u00 (t, x) + f (t, x)
∂t ∂x
then use the difference approximation7
∆t κ ∆t ~
~ui+1 − ~ui = − An (~ui + ~ui+1 ) + (fi + f~i+1 ) .
2 2
This leads to
∆t κ ∆t κ ∆t ~
(I +
An ) ~ui+1 = (I − An ) ~ui + (fi + f~i+1 ) .
2 2 2
This system can be solved similarly.
7
Instead of 1
2
(f~i + f~i+1 ) one can (or even should) use f~i+1/2 , i.e. the discrtetization of f (ti + ∆t
2
, x).

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 282

method order of consistency stability condition flops optimal flops


1
explicit O(∆t) + O((∆x)2 ) ∆t < 2κ (∆x)2 3n 2n
implicit O(∆t) + O((∆x)2 ) unconditional 4n 3n
Crank–Nicolson O((∆t)2 ) + O((∆x)2 ) unconditional 7n 5n
advantage Crank–Nicolson implicit and CN none none

Table 4.3: Comparison of finite difference schemes for the 1D heat equation

4.5.6 General Parabolic Problems


In the previous section we considered only a special case of the space discretization operator A = κ An . A
more general situation may be described by the equation
d
~u(t) = −A · ~u(t) + f~(t)
dt
where the symmetric, positive definite matrix A has eigenvalues 0 ≤ λ1 ≤ λ2 ≤ λ3 ≤ . . . ≤ λn . When
using either Crank–Nicolson or the fully implicit method the resulting finite difference scheme will be
unconditionally stable. The explicit method leads to
~u(t + ∆t) = ~u(t) − ∆t A ~u(t) + f~(t) = (I − ∆t A) ~u(t) + ∆t f~(t) .
As in the previous sections we examine ~u as a linear combination of the eigenvectors ~vk . For the largest
eigenvalue λn the factor has to be smaller than 1 and thus |1 − ∆t λn | < 1. This leads to the stability
condition
2
∆t · λn < 2 ⇐⇒ ∆t < .
λn
This condition remains valid, also for problems with more than one space dimension. To use the explicit
method for these type of problems one needs to estimate the largest eigenvalue of the space discretiza-
tion. Estimates of this type can be given, based on the condition number of the discretization matrix, e.g.
[KnabAnge00, Satz 3.45]. For higher space dimensions the effort to solve one linear system of equations
for the implicit methods will increase drastically, as the resulting matrices will not be tridiagonal, but we
find a band structure. Nonetheless this structure can be used in efficient implementations, all will be shown
in the next section. The relevant results on matrix computations are given in Chapter 2.
For many dynamic problems a mass matrix M has to be taken into account too. Consider a discretized
systems of the form
d
M ~u(t) = −A · ~u(t) + f~(t) .
dt
Often linear systems of equations with the matrix M are easily solved, e.g. M might be a diagonal matrix
with positive entries. The generalized eigenvalues λ and eigenvectors ~v are nonzero solutions of
A · ~v = λ M ~v .
• The explicit discretization scheme leads to
1
M (~u(t + ∆t) − ~u(t)) = −A · ~u(t) + f~(t)
∆t
M ~u(t + ∆t) = M ~u(t) − ∆t A · ~u(t) + ∆t f~(t) = (M − ∆t A) · ~u(t) + ∆t f~(t) .
Using an expansion with eigenvectors of the generalized eigenvalue problem the homogeneous prob-
lem (f~(t) = ~0) leads to
αk (t + ∆t) M ~vk = αk (t) (M − ∆t A) · ~vk
= αk (t) (M − ∆t λk M) · ~vk
αk (t + ∆t) = αk (t) (1 − ∆t λk ) .

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 283

Thus the stability condition is again ∆t < 2/λn , where λn is the largest generalized eigenvalue.
• The fully implicit scheme will lead to
1
M (~u(t + ∆t) − ~u(t)) = −A · ~u(t + ∆t) + f~(t)
∆t
(M + ∆t A) · ~u(t + ∆t) = M · ~u(t) + ∆t f~(t)
and is unconditionally stable.
• The Crank–Nicolson scheme will lead to
1 1 1
M (~u(t + ∆t) − ~u(t)) = − (A · ~u(t + ∆t) + A · ~u(t)) + (f~(t) + f~(t + ∆t))
∆t  2  2
∆t ∆t ∆t ~
M+ A · ~u(t + ∆t) = M− A · ~u(t) + (f (t) + f~(t + ∆t))
2 2 2
and is unconditionally stable.

4.5.7 A two Dimensional Dynamic Heat Equation


Equation (1.6) on page 12 describes the temperature distribution as a function of the space coordinates x
and y, and the time variable t .

∂t u(t, x, y) − κ ∆u(t, x, y) = f (t, x, y) for 0 ≤ x, y ≤ 1 and t ≥ 0
u(t, x, y) = 0 for (x, y) on boundary and t ≥ 0 (4.8)
u(0, x, y) = u0 (x, y) for 0 ≤ x, y ≤ 1

Explicit approximation
The explicit (with respect to time) finite difference approximation is determined by
1
(~ui+1 − ~ui ) = −κ Ann ~ui + f~i
∆t
or
~ui+1 = ~ui − ∆t (κ Ann ~ui − f~i ) .
For each time step we have to multiply the matrix Ann with a vector. Due to the severe sparsity of the matrix
this requires approximatley 5 n2 multiplications. Since the largest eigenvalue is given by κ λn,n ≈ κ 8 n2 ≈

(∆x)2
we have the stability condition
2 1
∆t ≤ ≈ (∆x)2 .
κ λn,n 4κ
The algorithm is conditionally stable only.

Implicit approximation
The implicit (with respect to time) finite difference approximation is determined by
1
(~ui+1 − ~ui ) = −κ Ann ~ui+1 + f~i+1
∆t
or
(I + ∆t κ Ann ) ~ui+1 = ~ui + ∆t f~i+1 .
The algorithm is unconditionally stable. For each time-step a system of linear equations has to be solved, but
the matrix is constant. Thus we can factorize the matrix once (Cholesky) and then do the back substitution
steps only. The symmetric, positive definite matrix Ann has size n2 × n2 and a semi-bandwidth of b =
n. Using the results in Section 2.6.4 the computational effort for one banded Cholesky factorization is
approximated by 21 n2 n2 . Each subsequent solving of a system of equation requires 2 n3 multiplications.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 284

Crank–Nicolson approximation
The CN finite difference approximation is determined by
1 κ 1
(~ui+1 − ~ui ) = − Ann (~ui + ~ui+1 ) + (f~i + f~i+1 )
∆t 2 2
or  κ   κ  ∆t  ~ 
I + ∆t
Ann ~ui+1 = I − ∆t Ann ~ui + fi + f~i+1 .
2 2 2
The algorithm is unconditionally stable too and the computational effort is comparable to the implicit
method.

Comparison
A comparison for the explicit, implicit and CN approximation is given in Table 4.4. For the implicit scheme
each time step requires more computations than an explicit time step. The time steps for the implicit scheme
may be larger. The choice of best algorithm thus depends on the time interval on which you want to compute
the solution: for very small times the explicit scheme is more efficient, for very large times the implicit
scheme is more efficient. This differs from the 1D situation in Table 4.3 where the computational effort for
each time step was small and of the same order for the three algorithms examined. The Crank–Nicolson
scheme can be applied to the 2D heat equation, leading to a higher order of consistency.

explicit implicit Crank–Nicolson


order of consistency O(∆t) + O((∆x)2 ) O(∆t) + O((∆x)2 ) O((∆t)2 ) + O((∆x)2 )
1
condition on time step ∆t ∆t ≤ 4κ (∆x)2 no condition no condition
linear system to be solved no yes yes
1 1
flops for matrix factorization none 2 n4 2 n4
flops for each time step 5 n2 2 n3 2 n3

Table 4.4: Comparison of finite difference schemes for 2D dynamic heat equations

A sample code in Octave/MATLAB


Below find Octave code to solve the initial boundary value problem with u0 (x, y) = 0 and f (t, x, y) = x2
on the interval [0 , T ] = [0 , 0.5] with dt = 0.02 and nx · ny = 34 · 35 interior grid points. Figure 4.21(b)
shows that the temperature at an interior point converges towards a final value. The result in Figure 4.21(a)
is, not surprisingly, very similar to Figure 4.14, i.e. the solution of the steady state problem −∆u = x2 . The
results from Section 2.6.6 (page 68) are used to first determine the Cholesky factorization of the matrix and
then for each time step use the back substitution only, this is efficient.
PlateDynamic.m
%%%%% script file to solve the dynamic heat equation on a unit square
%%%%% using an implicit finite difference scheme
T = 0.5; dt = 0.02;

nx = 34; ny = 35;
f = @(x,y) x.ˆ2;

%%%%%%%%%%%%%%%%%%%%%%%%%%
t = 0:dt:T; utrace = zeros(size(t));
hx = 1/(nx+1); hy = 1/(ny+1);

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 285

Dxx = spdiags(ones(nx,1)*[-1 +2 -1],[-1 0 1],nx,nx)/(hxˆ2);


Dyy = spdiags(ones(ny,1)*[-1 +2 -1],[-1 0 1],ny,ny)/(hyˆ2);
A = kron(Dxx,speye(ny)) + kron(speye(nx),Dyy);
Astep = speye(nx*ny,nx*ny) + dt*A;
[R,p,P] = chol(Astep); % Cholesky factorization, with permutations
% R = chol(A);
x = hx:hx:1-hx; y = hy:hy:1-hy;
u = zeros(nx*ny,1); % define the initial temperature
[xx,yy] = meshgrid(x,y); fvec = dt*f(xx(:),yy(:));

for k = 2:length(t);
u = P*(R\(R’\(P’*(u+fvec)))); % one time step
% u = R\(R’\(u+fvec)); % one time step
utrace(k) = u(55);
end%for

%%% add on the zero boundary values for a nicer graphics


x = 0:hx:1; y = 0:hy:1;
[xx,yy] = meshgrid(x,y); uu = zeros(size(xx));
uu(2:ny+1,2:nx+1) = reshape(u,ny,nx);

figure(1); surf(xx,yy,uu); xlabel(’x’); ylabel(’y’);


figure(2); plot(t,utrace); grid on ;xlabel(’Time t’); ylabel(’Temp u’);

In the above code we used a sparse Cholesky factorization to solve the system of linear equations at
each time step. Since A is a sparse, symmetric, positive definite matrix we may also use an iterative solver,
e.g. the conjugate gradient algorithm. According to the results in Section 2.7 and in particular Figure 2.17
(page 84), this might be a faster solution for fine meshes. Since this is a time step algorithm we have have
good initial guesses of the solution of the linear system, using the result at the previous time step.

0.03
0.025
0.0025
0.02
0.015 0.002
0.01
0.0015
Temp u

0.005
0 0.001
1
0.8 1
0.6 0.8 0.0005
0.4 0.6
y 0.2 0.4
0.2 x 0
0 0 0 0.1 0.2 0.3 0.4 0.5
Time t

(a) final temperature (b) temperature at one point as function of time

Figure 4.21: Solution of the dynamic heat equation on a square

4.5.8 Comparison of a Few More Solvers for Dynamic Problems


We list some facts for a few more numerical solvers for the linear system of ODEs of the form
d
~u(t) = −A ~u(t) + f~(t) (4.9)
dt

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 286

or for the corresponding stability questions

d
u(t) = −λ u(t) with u(0) = 1 (4.10)
dt
for values λ > 0. This was examined in Section 3.4.3 on page 196. The exact solution is obviously
u(t) = u(0) exp(−λ t). Of interest are two aspects of the solvers:

1. stability condition: |ui | has to remain bounded. This is the case for all Re(λ) < 0 if the method is
A–stable.

2. decay condition: for large values of λ, the solution should converge to 0 very fast. This is the case if
the method is L–stable.

In a typical situation we find after one step u(∆t) ≈ g(λ ∆t), where the function g() depends on the algo-
rithm to be examined. For the exact solver use g(λ ∆t) = exp(−λ ∆t). The stability requires |g(λ ∆t)| ≤ 1
and the decay condition translates to |g(λ ∆t)|  1 for λ ∆t  1 .

The explicit solver


For the ODE (4.9) consider the basic explicit method

~ui+1 − ~ui
= −A ~ui + f~i
∆t
~ui+1 = ~ui + ∆t (−A ~ui + f~i ) .

The method is consistent of order 1. For the stability (4.10) examine


ui+1 − ui
= −λ ui
∆t
ui+1 = (1 − ∆t λ) ui
2
and thus the function gEx (λ ∆t) = 1 − λ ∆t. This leads to the conditional stability ∆t < λ and we find no
decay at all.

The basic implicit solver


For the ODE (4.9) consider the basic implicit method

~ui+1 − ~ui
= −A ~ui+1 + f~i+1
∆t
(I + ∆t A)~ui+1 = ~ui + ∆t f~i+1

The method is consistent of order 1. For the stability (4.10) examine


ui+1 − ui
= −λ ui+1
∆t
(1 + ∆t λ) ui+1 = ui
ui
ui+1 =
1 + ∆t λ
and thus the stability function with z = λ ∆t is
1
gIm (z) =
1+z
which leads to the unconditional stability and unconditional decay. The method as A–stable and L-stable.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 287

The Crank–Nicolson solver


For the ODE (4.9) consider

~ui+1 − ~ui ~ui+1 + ~ui f~i+1 + f~i


= −A +
∆t 2 2
∆t ∆t ∆t ~
(I + A)~ui+1 = (I − A)~ui + (fi+1 + f~i )
2 2 2
The method is consistent of order 2. For the stability examine the model problem (4.10).
ui+1 − ui ui + ui+1
= −λ
∆t 2
∆t ∆t
(1 + λ) ui+1 = (1 − λ) ui
2 2
2 − ∆t λ
ui+1 = ui
2 + ∆t λ
and thus the stability function with z = λ ∆t is
2−z
gCN (z) =
2+z
which leads to the unconditional stability and no decay, since limz→∞ g(z) = −1 . The method as A–stable
but not L-stable.

The BDF2 solver


This is an implicit method with good stability properties. Find a good presentation in [Butc03, §225]. The
BDF2 (backwards differentiation formula) uses the approximations
d ui+1 − ui ui − ui−1
ui ≈ ≈
dt ∆t ∆t
2 d 3 ui+1 − ui 1 ui − ui−1 ui+1 − 4 ui + ui−1
ui ≈ − =
3 dt 3 ∆t 3 ∆t 3 ∆t
to realize that the ODE (4.9) can be discretized by
 
1 4 1 2 2
~ui+1 − ~ui + ~ui−1 = − A ~ui+1 + f~i+1
∆t 3 3 3 3
4 1 2 ∆t 2 ∆t ~
~ui+1 − ~ui + ~ui−1 = − A ~ui+1 + fi+1
3 3 3 3
2 ∆t 4 1 2 ∆t ~
(I + A) ~ui+1 = + ~ui − ~ui−1 + fi+1
3 3 3 3
(3 I + 2 ∆t A) ~ui+1 = +4 ~ui − ~ui−1 + 2 ∆t f~i+1

The method is consistent of order 2. Observe that 2 time levels are used to advance by one time step. Thus
one needs another algorithm to start BDF2, e.g. one RK step, which is also of order 2. To examine the
d
stability use the model equation (4.10), i.e. examine dt u(t) = −λ u(t).
4 1 2 ∆t
ui+1 − ui + ui−1 = −λ ui+1
3 3 3
(3 + 2 λ ∆t) ui+1 = 4 ui − ui−1
4 1
ui+1 = ui − ui−1
3 + 2 λ ∆t 3 + 2 λ ∆t
! " # !
ui 0 1 ui−1
= −1 4
ui+1 3+2 λ ∆t 3+2 λ ∆t ui

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 288

To examine the stability of this scheme use the eigenvalues of this matrix.
" #
−µ 1 4 1
0 = det = µ2 − µ+
−1 4
−µ 3 + 2 λ ∆t 3 + 2 λ ∆t
3+2 λ ∆t 3+2 λ ∆t
2
0 = (3 + 2 λ ∆t) µ − 4 µ + 1
1  p 
µ1,2 = +4 ± 16 − 4 (3 + 2 λ ∆t)
2 (3 + 2 λ ∆t)

1  p  +2 ± 1 − 2 λ ∆t
= +2 ± 4 − (3 + 2 λ ∆t) =
3 + 2 λ ∆t 3 + 2 λ ∆t
To examine this expression consider two different cases for z = λ ∆t:

1 2 ± 1 − 2z
• z ≤ 2 : Use g1,2 (z) = and
3 + 2z

2 + 1 − 2z
gBDF 2 (z) = max{g1,2 (z)} =
3 + 2z

2 ± i 2z − 1
• z > 21 : Use g1,2 (z) = ∈ C and
3 + 2z
√ √
22 + 2 z − 1 3 + 2z 1
gBDF 2 (z) = max{|g1,2 (z)|} = = = √
3 + 2z 3 + 2z 3 + 2z
Verify that for z > 0 the absolute values are smaller than 1, and thus we have unconditional stability. For
the decay condition we observe that limz→∞ |g1,2 (z)| = 0. This is based on gBDF 2 (z) ≈ √12 z for z  1 .
The method as A–stable and L-stable.

An L-stable RK solver
This is a diagonally implicit Runge–Kutta or DIRK method, find a good presentation in [Butc03, §361].
The ODE (4.9) is discretized by
(I + θ ∆t A) ~un = (I − (1 − θ) ∆t A) ~ui
 
1 1
(I + θ ∆t A) ~ui+1 = I − ∆t A ~ui − ( − θ) ∆t A ~un
2 2

where θ = 1 − 1/ 2. The method is consistent of order 2. Observe that two systems of linear equations
have to be solved for each step.
~un − ~ui = −θ ∆t A (~un − ~ui ) − ∆t A ~ui
1
= −(1 − √ ) ∆t A (~un − ~ui ) − ∆t A ~ui
2
 
1 1
(I + θ ∆t A) ~ui+1 = I − ∆t A ~ui − ( − θ) ∆t A ~un
2 2
To examine the stability use (4.10).
(1 + θ λ ∆t) un = (1 − (1 − θ) λ ∆t) ui
λ 1
(1 + θ λ ∆t) ui+1 = (1 − ∆t) ui − ( − θ) λ ∆t un
2 2
λ 1 1 − (1 − θ) λ ∆t
= (1 − ∆t) ui − ( − θ) λ ∆t ui
2 2 1 + θ λ ∆t !
λ ( 12 − θ) (1 − (1 − θ) λ ∆t)
= (1 − ∆t) − λ ∆t ui
2 1 + θ λ ∆t

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 289


with θ = 1 − 1/ 2 ≈ 0.29. Thus we have to examine the function (z = λ ∆t)
!
1 z ( 1 − θ) (1 − (1 − θ) z)
gRK (z) = (1 − ) − 2 z
1+θz 2 1+θz

Use a plot of this function for z > 0 to observe that |g(z)| < 1 and thus we have unconditional stability. For
z  1 use 1 − (1 − θ) z ≈ −(1 − θ) z and Mathematica to verify
! √
1 z ( 12 − θ) (1 − θ) z 4 + 2 (1 − 2) z
gRK (z) ≈ (1 − ) + z = √
1+θz 2 1+θz (2 + (2 − 2) z)2

For z  1 we find
√ √ √
2 (1 − 2) z 2 (2 + 2)2 (1 − 2)
gRK (z) ≈ √ = √ √
(2 − 2)2 z 2 (2 + 2)2 (2 − 2)2 z
√ √ √ √
2 (6 + 4 2)(1 − 2) 6−8−2 2 −1 − 2
= = = .
(4 − 2)2 z 2z z

Consequently we have unconditional decay. The method as A–stable and L-stable.

Comparison of the solvers


The above algorithms should be compared using different criteria. Table 4.5 shows the key properties and
in Figure 4.22 find the graphs of the stability functions.

name of algorithm order g(z) for z  1 time levels systems to solve


1
implicit 1 ≈ z 1 1
Crank–Nicolson 2 ≈ −1 1 1
BDF2 2 ≈ √12 z ≈ 0.71
√ 2 1
√ z
RK 2 ≈ −1−z 2 1 2

Table 4.5: For different ODE solvers find the order of consistency, the asymptotic approximation of the
stability function g(z), the number of time levels required and the number of linear systems to be solved to
perform one time step.

4.6 Hyperbolic Problems, Wave Equations


The simplest form of a wave equation is

∂2 ∂2
∂t2
u(t, x) = c2 ∂x2
u(t, x) for 0 < x < 1 and t>0
u(t, 0) = u(t, 1) = 0 for t > 0
. (4.11)
u(0, x) = u0 (x) for 0 < x < 1
u̇(0, x) = u1 (x) for 0 < x < 1

The equation of a vibrating string (1.8) on page 14 is of this form. Examine an explicit and an implicit finite
difference approximation.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 290

1
IM
CN
BDF2
0.5 RK

amplification factor
exact

-0.5

-1
0 5 10 15 20
z

Figure 4.22: Stability function for four algorithms with z = λ ∆t

4.6.1 An Explicit Approximation


Examine the finite difference approximation on a grid given by Figure 4.23 . Given ~ui−1 and ~ui compute the
solution ~ui+1 with a multiplication by a matrix, thus this is an explicit scheme. The finite difference scheme
is consistent of order (∆x)2 + (∆t)2 .
~ui+1 − 2 ~ui + ~ui−1
= −c2 An · ~ui
(∆t)2

t6
ui+1,j − 2 ui,j + ui−1,j ui,j−1 − 2 ui,j + ui,j+1 i+1 u
2
= c2
(∆t) (∆x)2 u u u
i
~ui+1 − 2 ~ui + ~ui−1 i–1 u
= −c2 An ~ui
(∆t)2
-
~ui+1 = 2 I − (∆t)2 c2 An ~ui − ~ui−1
 j–1 j j+1 x

Figure 4.23: Explicit finite difference approximation for the wave equation

Stability of the explicit scheme


The next point to be considered is the stability of the finite difference scheme. The technique used is
very similar to the procedure used in Section 4.3.2 (page 259) to examine stability of the finite difference
approximation to the dynamic heat equation. Since the time derivatives are of order 2 it is convenient to first
examine the ordinary differential equation
d2
α(t) = −λ α(t)
dt2

with exact solutions α(t) = A cos( λ t + δ). Thus the solution remains bounded for all times t. Insist on
the solutions of the approximate equation
α(t + h) − 2 α(t) + α(t − h)
= −λ α(t)
h2

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 291

to remain bounded too. Solve the above difference equation for α(t + h) to find
α(t + h) = 2 α(t) − α(t − h) − h2 λ α(t) .
Using a matrix notation write this in the form
! " # !
α(t) 0 1 α(t − h)
= ·
α(t + h) −1 2 − λ h2 α(t)
and with an iteration find
! " #i !
α(i h) 0 1 α(0)
= · .
α((i + 1) h) −1 2 − λ h2 α(h)
These solutions remain bounded as i → ∞ if the eigenvalues µ of the matrix have absolute values smaller
or equal to 18 . Thus we examine the solutions of the characteristic equation
" #
0−µ 1
det = µ2 − µ (2 − λ h2 ) + 1 = 0 .
−1 2 − λ h2 − µ
Using a factorization of this polynomial find
µ2 − µ (2 − λ h2 ) + 1 = (µ − µ1 ) (µ − µ2 ) = µ2 − (µ1 + µ2 ) µ + µ1 µ2 .
Since the constant term equals 1, conclude that µ1 · µ2 = 1. If both values µ1,2 would be real and µ1 · µ2 = 1
then µ2 = 1/µ1 and one of the absolute values would be larger than 1. If the values are conjugate complex
use 1 = µ1 · µ2 = µ1 · µ1 = |µ1 |2 . Thus the condition for |µ1,2 | ≤ 1 to be correct is given by: conjugate
complex values on the unit circle with nonzero imaginary part. The solutions µ1,2 are given by
1  p  1  p 
µ1,2 = 2 − λ h2 ± (2 − λ h2 )2 − 4 = 2 − λ h2 ± λ2 h4 − 4λ h2
2 √ √ 2
2 − λ h2 ± λ h2 λ h2 − 4
= .
2
Thus as a necessary and sufficient condition for stability use a negative discriminant.
4
λ2 h4 − 4 λ h2 < 0 ⇐⇒ λ h2 ≤ 4 ⇐⇒ h2 ≤
λ

Setting up the initial values


To start the iteration the values of u(0) and u(∆t) have to be available. A simple approach is to use the
initial velocity u̇(0) = v0 and thus u(∆t) ≈ u(0) + v0 ∆t. This approximation is consistent of order ∆t. A
better approach is to use the centered approximation and the differential equation at t = 0. The idea used to
improve the consistency is similar to the approach in Example 4–7 used to discretize the boundary condition
u0 (L) = F .
∂ u(∆t) − u(−∆t)
u(0) ≈ = v0
∂t 2 ∆t
u(−∆t) = u(∆t) − 2 v0 ∆t
u(−∆t) − 2 u(0) + u(∆t)
ü(0) ≈ = −λ u(0)
(∆t)2
(u(∆t) − 2 v0 ∆t) − 2 u(0) + u(∆t)
= −λ u(0)
(∆t)2
1
u(∆t) = u(0) + v0 ∆t − λ u(0) (∆t)2 .
2
8
If µ is an eigenvalue of the matrix A with eigenvector ~v then Ak ~v = µk ~v . This expression remains bounded iff |µ| ≤ 1. We
quietly assume that the matrix A is diagnonalizable.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 292

With the additional term the approximation is consistent of order (∆t)2 .


∂2
When applied to the equation ∂t2
~u(t) = −c2 A~u(t) + f~(t) with the initial conditions ~u(0) = ~u0 and

∂t ~
u(0) = v0 use

∂ ~u(∆t) − ~u(−∆t) ~u+1 − ~u−1


~u(0) ≈ = = ~v0
∂t 2 ∆t 2 ∆t
u−1 = u+1 − 2 ∆t v0
u−1 − 2 u0 + u+1 ∂2
≈ ~u(0) = −c2 A~u0 + f~0
(∆t)2 ∂t2
(u+1 − 2 ∆t v0 ) − 2 u0 + u+1 2 u+1 − 2 ∆t v0 − 2 u0
= = −c2 A~u0 + f~0
(∆t)2 (∆t)2
(∆t)2
u+1 = u0 + ∆t ~v0 + (−c2 A~u0 + f~0 ) .
2
This approximation9 is consistent of order (∆t)2 .

Solving the equation by time stepping


Now return to the wave equation. With the notation from the previous section write the discretization scheme
in Figure 4.23 in the form
~ui+1 − 2 ~ui + ~ui−1 = −c2 (∆t)2 An · ~ui
or, when solved for ~ui+1 ,
~ui+1 = 2 In − c2 (∆t)2 An · ~ui − ~ui−1 .


With a block matrix notation this can be transformed in a form similar to the ODE situation above.
! " # !
~ui 0 In ~ui−1
= · .
~ui+1 −In 2 In − c2 (∆t)2 An ~ui

Then write the solution as a linear combination of the eigenvectors of the matrix An , i.e.
n
X
~u(t) = αk (t) ~vk where An~vk = λk ~vk .
k=1

The above matrix is replaced by " #


0 1
−1 2 − c2 (∆t)2 λk
and powers of this matrix have to remain bounded, for all eigenvalues λk . The stability condition for the
ODE leads to c2 (∆t)2 λk ≤ 4 for k = 1, 2, 3, . . . , n. Since the largest eigenvalue is given by
4 nπ 4 4
λn = 2
sin2 ≈ 2
sin2 π/2 =
∆x 2 (n + 1) ∆x ∆x2

find the stability condition

(∆t)2
c2 ≤1 ⇐⇒ c2 (∆t)2 ≤ (∆x)2 ⇐⇒ c ∆t ≤ ∆x .
(∆x)2

The solution at the first two time levels has to be known to get the finite difference scheme started. We have
to use the initial conditions to construct the vectors u0 and u1 . The first initial condition in equation (4.11)
9
Currently this is not implemented yet in some of my sample codes.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 293

obviously implies that ~u0 should be the discretization of u(x, 0) = u0 (x). The Octave code below is
an elementary implementation of the presented finite difference scheme. In the example below the initial
velocity v0 (x) = 0 is used and then determined ~u1 by the method in the previous section. If the ratio
∆t
r = ∆x is increased beyond the critical value of 1/c then the algorithm is unstable and the solution will
be far away from the true solution. The instability is again (as in Section 4.5.3) in the direction of the
eigenvector belonging to the largest eigenvalue.
Wave.m
L = 3; % length of the space interval
n = 150; % number of interior grid points
r = 0.99; % ratio to compute time step
T = 6; % final time
iv = @(x)max([min([2*x’;2-2*x’]);0*x’])’; % initial value

dx = L/(n+1); dt = r*dx; x = linspace(0,L,n+2)’;

y0 = iv(x); y0(1) = 0; y0(n+2) = 0;


% use zero initial speed
y1 = y0 + dtˆ2/2*[0;diff(y0,2);0]/dxˆ2; % improved initialization
y2 = y0; % reserve the memory

figure(1); clf;
for t = 0:dt:T+dt;
plot(x,y0); axis([0,L,-1,1]); drawnow(); %% graphics uses most of the time
if 0 %% for loops, slow
for k = 2:n+1
y2(k)=(2-2*rˆ2)*y1(k)+rˆ2*(y1(k-1)+y1(k+1))-y0(k);
end%for
else %% no loop, fast
y2(2:n+1) = (2-2*rˆ2)*y1(2:n+1) + rˆ2*(y1(1:n)+y1(3:n+2))-y0(2:n+1);
end%if
y0 = y1; y1 = y2;
end%for

figure(2)
plot(x,y0,x,iv(x))

Conditional stability, based on the cone of dependence, D’Alembert’s solution


In the above section we examined the conditional stability of the explicit scheme based on eigenvalues of
the corresponding matrix. One may also use a more physical argument to examine the stability condition.
Using calculus one may verify that the unique solution of the initial value problem

∂2 ∂2
∂t2
u(t, x) = c2 ∂x2
u(t, x) for −∞ < x < ∞ and t ∈ R
u(0, x) = u0 (x) for −∞ < x < ∞ (4.12)

∂t u(0, x) = u1 (x) for −∞ < x < ∞

is given by D’Alembert’s formula.


x+c t
1 1
Z
u(t, x) = (u0 (x − c t) + u0 (x + c t)) + u1 (ξ) dξ (4.13)
2 2c x−c t

This implies that the solution of the wave equation at time t and position x is determined by the values in
the cone of dependence, i.e. all times τ < t and positions x̃ such that |x̃ − x| ≤ c (t − τ ). This is visualized
in Figure 4.24. On the left find the domain having an influence on the solution at time t and position x and

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 294

1.2 1.2
(x,t) (x-ct,t) (x+ct,t)
1 1
influenced interval
0.8 0.8
time t

time t
0.6 0.6

0.4 0.4

0.2 0.2
interval of dependence
0 0
x-c t x+c t (x,0)
-0.2 -0.2
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
position x position x
(a) cone of dependence (b) cone of influence

Figure 4.24: D’Alembert’s solution of the wave equation

on the right the domain influenced by the values at time t = 0 and position x. This formula and figure also
confirm that information is traveling at most with speed c.
A quick look at Figure 4.23 (page 290) will confirm that for the explicit finite difference scheme the
∆t
cone of dependence has a slope of ∆x , while the slope for the exact solution in Figure 4.24 is given by 1/c.
Since the numerical cone has to contain the exact cone we have the condition
∆t 1 1
≤ =⇒ ∆t ≤ ∆x .
∆x c c
This confirms the stability condition obtained using eigenvalues.

4.6.2 An Implicit Approximation


Since the explicit method is again conditionally stable we consider an implicit method, which turns out to
be unconditionally stable. The space discretization at time level i in the previous section is replaced by a
weighted average of discretizations at levels i − 1, i and i + 1 .

~ui+1 − 2 ~ui + ~ui−1 c2


= − (An · ~ui+1 + 2 An · ~ui + An · ~ui−1 ) .
(∆t)2 4

One can verify (tedious computations) that this difference scheme is consistent of order (∆x)2 + (∆t)2 . As
a consequence we obtain a linear system of equations for ~ui+1 . Thus this is an implicit scheme.
2 2 2
     
2 (∆t) 2 (∆t) 2 (∆t)
I+c An ~ui+1 = 2 I − 2 c An ~ui − I + c An ~ui−1
4 4 4

Stability of the implicit scheme


To examine the stability consider eigenvalues λ > 0 and the corresponding eigen vectors and use the notation

(∆t)2
γ = c2 λ > 0.
4

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 295

t6

i+1 u u u

ui+1 −2 ~
~ ui +~ui−1 i u u u
(∆t)2
=
c2 i–1 u u u
= − 4 An (~ui+1 + 2 ~ui + ~ui−1 )
-
j–1 j j+1 x

Figure 4.25: Implicit finite difference approximation for the wave equation

Now examine the time discretization of α̈(t) = −c2 λ α(t).

(∆t)2
α(t + ∆t) − 2 α(t) + α(t − ∆t) = −c2 λ (α(t + ∆t) + 2 α(t) + α(t − ∆t))
4
(1 + γ) α(t + ∆t) = (2 − 2 γ) α(t) − (1 + γ) α(t − ∆t)
! " # !
α(t) 0 1 α(t − ∆t)
=
α(t + ∆t) −1 2−21+γ
γ
α(t)

Examine the eigenvalues µ1,2 of this matrix by solving the quadratic equation
" #
−µ 1 2 − 2γ
det 1−γ
= µ2 − µ+1=0
−1 2 −µ 1+γ
1+γ

and observe that µ1 · µ2 = 1. Using the discriminant

1−γ 2
 
4 −4<0
1+γ

conclude that two values are complex conjugate. This implies |µ1 | = |µ2 | = 1 and the scheme is uncondi-
tionally stable.

4.6.3 General Wave Type Problems


A more general form of a wave type, dynamic boundary value problem may be given in the form
d
M ~u(t) = −A ~u(t) + f~(t)
dt2
and the corresponding explicit scheme is

M (~ui+1 − 2 ~ui + ~ui−1 ) = −(∆t)2 A ~ui + (∆t)2 f~i


M ~ui+1 = 2 M − (∆t)2 A ~ui − M ~ui−1 + (∆t)2 f~i .


The scheme will be stable if


(∆t)2 < 4/λn
where λn is the largest of the generalized eigenvalues, i.e. nonzero solutions of

A ~v = λ M ~v .

Thus the explicit scheme is conditionally stable.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 296

An implicit scheme is given by

~ui+1 − 2 ~ui + ~ui−1 1


M 2
= − (A ~ui+1 + 2 A ~ui + A ~ui−1 ) + f~i
(∆t) 4

or equivalently

(∆t)2 (∆t)2 (∆t)2


     
M+ A ~ui+1 = 2 M − A ~ui − M + A ~ui−1 + (∆t)2 f~i .
4 4 4

For a generalized eigenvalue λ with


A ~v = −λ M ~v
and vectors ~ui = αi ~v this translates to

(∆t)2 (∆t)2 (∆t)2


     
1− λ αi+1 = 2 1 + λ αi − 1 − λ αi−1
4 4 4
∆t2
With the abbreviation γ = 4 λ > 0 this can be written as a system
! " # !
αi 0 1 αi−1
= 2+2 γ
αi+1 −1 1−γ αi

and this scheme is unconditionally stable, as expected.

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 297

4.7 Nonlinear Problems


4.7.1 Partial Substitution or Picard Iteration
4–9 Example : Stretching of a Beam by a given Force and Variable Cross Section
In the above example we take into account that the cross section will change, due to the stretching of
the beam, i.e. we take Poisson contraction into account. The mathematical description is given in equa-
tion (1.14) (see page 17).
!
d u 2 d u(x)
 
d
− EA0 (x) 1 − ν = f (x) for 0 < x < L
dx dx dx

with boundary conditions u(0) = 0 and

d u(L) 2 d u(L)
 
EA0 (L) 1 − ν =F.
dx dx

This is a nonlinear boundary value problem for the unknown displacement function u(x) . We will use the
method of successive substitution from Section 3.1.4. Thus we proceed as follows:

• Pick a starting function u0 (x). If possible use a good guess to the solution. For this example u0 (x) =
0 will do.

• While changes are too large

– Compute the coefficient function

d u(x) 2
 
a(x) = E A0 (x) 1 − ν
dx

and then solve the linear boundary problem


 
d d u(x)
− a(x) = f (x) .
dx dx

This is the problem in Example 4–8 on page 268.


– Take this solution as your current solution and estimate its error by comparing with the previous
solution.

• Show your final solution.

The above algorithm is implemented in Octave/MATLAB.


BeamNL.m
nu = 0.3; L = 3; F = 0.2;
EA = @(x)(2-sin(x/L*pi))/2;
fRHS = @(x)zeros(size(x));
%%%%%%%%%%%%%%%%%%%%%%%%%%
N = 500; h = L/(N+1); % stepsize
x = (h:h:L)’; f = [fRHS(x)];
u = zeros(size(f)); g = hˆ2*f; g(N+1) = g(N+1)/2 + h*F;

cc = 1; cstring = ’rgbcmykrgbcmyk’; %% color counter and color sequence

figure(1)
clf; hold on; grid on; axis([0 3 0 1.4]) % setup of graphics
xlabel(’position x’); ylabel(’displacement u’);

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 298

relDiffTol = 1e-5; % choose the relative difference tolerance


Differences = []; relDiff = 2*relDiffTol;
while relDiff > relDiffTol
a = EA(x-h/2).*(1-nu*diff([0;u])/h).ˆ2; % compute coefficients
di = [a(1:N)+a(2:N+1); a(N+1)]; % diagonal entries
up = -a(2:N+1); % upper diagonal entries
uNew = trisolve(di,up,g); % solve the linear symmetric system
plot([0;x],[0;u],’linewidth’,2,’color’,cstring(cc)); cc = cc+1;
pause(0.5)
relDiff = max(abs(u-uNew))/max(abs(uNew)) % determine relative difference
Differences = [Differences;relDiff]; % store the relative differences
u = uNew; % prepare for restart
end%while
axis([0 3 0 1.4])
xlabel(’position x’); ylabel(’displacement u’); hold off

figure(2)
semilogy(Differences)
xlabel(’iterations’); ylabel(’relative difference’)

The above code required 12 iterations until the relative difference was smaller than 10−5 . Find the graphical
results in Figure 4.26(a). The final result has to be compared with Example 4–8 to verify that the beam is
even weaker than before. The logarithm of the relative difference can be plotted as a function of the iteration
number. The result in Figure 4.26(b) shows a straight line and this is consistent with a linear convergence10 .

1e+0

1.2 1e-1
norm of difference

1
displacement u

1e-2
0.8
1e-3
0.6
1e-4
0.4

0.2 1e-5

0 1e-6
0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12
position x number of iterations

(a) sequence of solutions (b) the differences

Figure 4.26: Nonlinear beam stretching problem solved by successive substitution and the logarithmic dif-
ferences as function of the number of iterations

4.7.2 Newton’s Method


To illustrate the use of Newton’s method we use a single example, the problem of the bending beam in
Section 1.4.2 on page 20.

10

|diff| ≈ α0 q n =⇒ ln |diff| ≈ ln α0 + n · ln q

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 299

4–10 Example : Bending of a beam for small angles


To bend a horizontal beam we apply a small vertical force F2 at the right end point. We use equation (1.17)
F2
−α00 (s) = cos(α(s)) for 0 < s < L and α(0) = α0 (L) = 0
EI
shown in Section 1.4.2 (page 20). The boundary conditions α(0) = α0 (L) = 0 describe the situation of
a beam clamped at the left edge and no moment is applied at the right end. For small angles α we use
cos(α) ≈ 1 and find a linear problem with constant coefficients.
F2
−α00 (s) = with α(0) = α0 (L) = 0 .
EI
We use the discretization xi = i Ln and αi = α(xi ) for i = 1, 2, 3, . . . , n. Using the boundary conditions
α(0) = α0 (L) = 0 find a system of the form
     
2 −1 α1 f1
     
 −1 2 −1   α2   f2 
     
−1 2 −1   α3   f3 
     

1  .. .. ..   ..   .. 
    
· =

. . . . .
h2 
     
    
−1 2 −1   αn−2   fn−2 
     

     

 −1 2 −1   α
  n−1   n−1 
  f 
fn
−1 1 αn 2

To take the boundary condition α0 (L) = 0 into account proceed as in Example 4–7 on page 267. For this
F2
elementary problem use the known, the exact solution α(s) = EI (L s − 21 s2 ) leading to a maximal angle
F2 2
of α(L) = EI L . The exact solution is a polynomial of degree 2 and for this problem the approximate
solution will coincide with the exact solution, i.e. no approximation error. According to Section 1.4.2 the
maximal vertical deflection is given by y(L) = 3FEI 2
L3 . With the angle function α(s) the shape of the
beam is given by ! Z !
x(l) l cos(α(s))
= ds .
y(l) 0 sin(α(s))
Caused by the numerical integration by the trapezoidal rule (cumtrapz()) the maximal displacement will
not be reproduced exactly. One error contribution is the approximate integration by the trapezoidal rule and
another effect is using sin α for the integration, instead of only α. The code below implements the above
algorithm and verifies the result.
BeamLinear.m
EI = 1.0; L = 3; F = [0 0.1];
% F = [0 2] % large force
N = 200;
%%%%%%%% no modifications necessary beyond this line
h = L/N; s = (h:h:L)’;
%% build and solve the tridiagonal system
A = spdiags(ones(N,1)*[-1 2 -1],[-1 0 1],N,N)/hˆ2; A(N,N) = 1/hˆ2;
g = F(2)/EI*ones(size(s)); g(N) = g(N)/2;
alpha = A\g;;
%% display the solution
x = cumtrapz([0;s],[1; cos(alpha)]); y = cumtrapz([0;s],[0; sin(alpha)]);
plot(x,y); xlabel(’x’); ylabel(’y’);

MaximalAngles = [alpha(N), F(2)/(2*EI)*Lˆ2]


MaximalDeflections = [max(y), F(2)/(3*EI)*Lˆ3,trapz([0;s],[0;alpha])]

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 300

One may try to solve the above problem using partial substitution. Start with an initial angle α0 (s) and
then solve iteratively the linear problem
00 F2
−αk+1 (s) = cos(αk (s)) .
EI
For small forces F2 this will be successful, but for larger angles the answers are of no value. One has to use
Newton’s method.

4–11 Example : Bending of a beam, with Newton’s method


Since equation (1.17) on page 20
F2
−α00 (s) = cos(α(s)) for 0 < s < L and α(0) = α0 (L) = 0
EI
is a nonlinear equation use Newton’s method (see Section 3.1.5) to find an approximate solution. Build on
the linear approximation
cos(α + φ) ≈ cos(α) − sin(α) · φ
and for a known function α(s) search a solution φ(s) of the boundary value problem
F2 F2
−φ00 (s) = α00 (s) + cos(α(s)) − sin(α(s)) φ(s) for 0 < s < L .
EI EI
F2 F2
With the definitions f (s) = α00 (s) + EI cos(α(s)) and b(s) = EI sin(α(s)) this is a differential equation
of the form
−φ00 (s) + b(s) φ(s) = f (s) .
The boundary conditions to be satisfied are α(0) + φ(0) = 0 and α0 (L) + φ0 (L) = 0. Since α(0) =
α0 (L) = 0 this translates to φ(0) = φ0 (L) = 0. To keep the second order consistency we again the idea
from Example 4–7 on page 26711 . The resulting system can be written in the form
       
2 −1 φ1 b1 φ1 f1
       
 −1 2 −1   φ2   b2 φ2   f2 
       
−1 2 −1   φ3   b3 φ3   f3 
       

1 
.. .. ..   ..  
   
..   .. 
  
. . . · . + . = . .
h2 
      
      
−1 2 −1   φn−2   bn−2 φn−2   fn−2 
       

       

 −1 2 −1    φn−1   bn−1 φn−1   fn−1 
     
bn fn
−1 1 φn 2 φn 2

The contributions of the form bi φi have to be integrated into the matrix and thus on the diagonal find the
expressions h22 + bi . This matrix is symmetric, but not necessarily positive definite since the values of bi
might be negative. The new solution αnew (s) can then be computed by
αnew (s) = α(s) + φ(s) resp. αi → αi + φi .
With this new approximation then start the next iteration step for Newton’s method. This has to be repeated
until a solution is found with the desired accuracy.
This algorithm is implemented in MATLAB/Octave and the result shown in Figure 4.27. Use the previous
example as a reference problem and for very small forces F2 and the resulting angles for the two answers
should be close.
11

φ(L + h) − φ(L − h)
0 = φ0 (L) = + O(h2 ) =⇒ φn+1 = φn−1
2h
−φn−1 + 2 φn − φn+1 −φn−1 + φn bn fn
+ bn φn = fn =⇒ + φn =
h2 h2 2 2

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 301

2.5

1.5

y 1

0.5

0
0 0.5 1 1.5 2
x

Figure 4.27: Bending of a beam, solved by Newton’s method

BeamNewton.m
EI = 1.0; L = 3; F = [0 0.1]; % try values of 0.5 1.5 and 2
N = 200;
%%%%%%%% no modifications necessary beyond this line
h = L/N; % stepsize
s = (h:h:L)’; alpha = zeros(size(s));

%% build the tridiagonal matrix


A = spdiags(ones(N,1)*[-1 2 -1],[-1 0 1],N,N)/hˆ2;
A(N,N) = 1/hˆ2;
DiffTol = 1e-10; DiffAbs = 2*DiffTol;
while DiffAbs>DiffTol
b = F(2)/EI*sin(alpha); b(N) = b(N)/2;
f = F(2)/EI*cos(alpha); f(N) = f(N)/2;
phi = (A+spdiags(b,0,N,N))\(-A*alpha+f);
alpha = alpha+phi;
disp(sprintf(’maximal angle = %7.4f, difference = %7.4e’,
max(abs(alpha)),max(abs(phi))))
end%while
x = cumtrapz([0;s],[0; cos(alpha)]); y = cumtrapz([0;s],[0; sin(alpha)]);
plot(x,y); xlabel(’x’); ylabel(’y’); grid on

The values of the differences in the above iterative algorithm are given by
4.5 · 10−1 , 2.9 · 10−2 , 1.1 · 10−4 , 1.4 · 10−9 and 3.1 · 10−15
and verify that the number of stable digits is doubled at each step, after an initial search for the solution.
This is consistent with the quadratic convergence of Newton’s method. ♦

When the codes in Examples 4–10 and 4–11 are used again with a larger value for the vertical force
F2 = 2.0 obtain the (at first) surprising results in Figure 4.28. This obvious problem is created by the
geometric nonlinearity in the differential equation, i.e. the nonlinear expression cos(α(s)).
• The computations in Example 4–10 are based on the assumptions of small angles and use the approx-
imation cos α ≈ 1. For this computation this is certain to be false and thus the results are invalid.
Thus the result in Figure 4.28(a) can not be correct.
• The solution with Newton’s method from Example 4–11 is folding down and pulled backward. This
might well be a physical solutions, but not the one we were looking for. Thus the result in Fig-
ure 4.28(b) is correct, but useless. This is an illustration of the fact that nonlinear problems might

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 302

0.2
1
0
0.8 -0.2

0.6 -0.4
y

y
-0.6
0.4
-0.8
0.2
-1
0 -1.2

-0.2 -1.4
-0.4 -0.2 0 0.2 0.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
x x

(a) solved as a linear problem (b) solved as a nonlinear problem

Figure 4.28: Bending of a beam with large force, solved as linear problem (a) and by Newton’s method (b),
using a zero initial displacement

have multiple solutions and we have to assure that we find the desired solution. When Newton’s al-
gorithm is applied to this problem the errors will at first get considerably larger and only after a few
searching steps the iteration will start to converge towards one of the possible solutions. This illus-
trates again that Newton is not a good algorithm to search a solution, but a good method to determine
a known solution with good accuracy.

4–12 Example : Bending of a beam, with parameterized Newton’s method


To solve the problem of a bending beam with a large vertical force and find the solution of a beam bent
upwards use a parameterization method. Instead of searching immediately the solution with the desired
force (F2 = 1.5) increase the force step by step from 0 to the desired value. Newton’s method will find the
solution for one given force and this solution will then be used as starting function for Newton’s method for
the next higher force. Find the intermediate and final results of the code below in Figure 4.29.

3 3

2.5 2.5

2 2
y

1.5 1.5

1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 2.5 3 0 0.2 0.4 0.6 0.8 1
x x

(a) the iterations (b) final displacement

Figure 4.29: Nonlinear beam problem for a large force, solved by a parameterized Newton’s method

BeamParam.m
EI = 1.0; L = 3; N = 200; FList = 0.25:0.25:2;
%%%%%%%% no modifications necessary beyond this line

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 303

h = L/N; % stepsize
s = (h:h:L)’;
alpha1 = zeros(size(s));

%% build the tridiagonal matrix


A = spdiags(ones(N,1)*[-1 2 -1],[-1 0 1],N,N)/hˆ2; A(N,N) = 1/hˆ2;

errTol = 1e-10;
yPlot = zeros(length(s)+1,1); xPlot = [0;s];

for F2 = FList
disp(sprintf(’Use force F_2 = %4.2g’,F2))
errAbs = 2*errTol;
while errAbs>errTol;
b = F2/EI*sin(alpha1); b(N) = b(N)/2;
f = F2/EI*cos(alpha1); f(N) = f(N)/2;
phi = -(A+spdiags(b,0,N,N))\(A*alpha1-f);
alpha1 = alpha1 + phi;
errAbs = max(abs(phi))
end%while
x = cumtrapz([0;s],[0; cos(alpha1)]); y = cumtrapz([0;s],[0; sin(alpha1)]);
xPlot = [xPlot x]; yPlot = [yPlot y];
end%for
figure(1); clf
plot(xPlot,yPlot)
grid on; xlabel(’x’); ylabel(’y’); axis equal

figure(2);
x = cumtrapz([0;s],[0; cos(alpha1)]); y = cumtrapz([0;s],[0; sin(alpha1)]);
plot(x,y); grid on; xlabel(’x’); ylabel(’y’);

In the previous chapter the codes in Table 4.6 were used.

filename function
BeamStretch.m code to solve Example 4–6
BeamStretchVariable.m code to solve Example 4–8
Plate.m code to solve the BVP in Section 4.4.2
Heat2DStatic.m code to solve the BVP in Section 4.4.2
HeatDynamic.m code to solve the IBVP in Section 4.5.3
HeatDynamicImplicit.m code to solve the IBVP in Section 4.5.4
PlateDynamic.m code to solve the IBVP in Section 4.5.7
Wave.m code to solve the IBVP in Section 4.6.1
BeamNL.m code to solve Example 4–9
BeamLinear.m code to solve the bending beam problem, Example 4–10
BeamNewton.m code to solve the bending beam problem, Example 4–11
BeamParam.m code to solve the bending beam problem, Example 4–12

Table 4.6: Codes for chapter 4

SHA 21-5-21
CHAPTER 4. FINITE DIFFERENCE METHODS 304

Bibliography
[AtkiHan09] K. Atkinson and W. Han. Theoretical Numerical Analysis. Number 39 in Texts in Applied
Mathematics. Springer, 2009.

[Butc03] J. Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, Ltd,
second edition, 2003.

[GoluVanLoan96] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, third
edition, 1996.

[GoluVanLoan13] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,
fourth edition, 2013.

[IsaaKell66] E. Isaacson and H. B. Keller. Analysis of Numerical Methods. John Wiley & Sons, 1966.
Republished by Dover in 1994.

[KhanKhan18] O. Khanmohamadi and E. Khanmohammadi. Four fundamental spaces of numerical analy-


sis. Mathematics Magazin, 91(4):243–253, 2018.

[KnabAnge00] P. Knabner and L. Angermann. Numerik partieller Differentialgleichungen. Springer Ver-


lag, Berlin, 2000.

[Smit84] G. D. Smith. Numerical Solution of Partial Differential Equations: Finite Difference Methods.
Oxford Univerity Press, Oxford, third edition, 1986.

[Thom95] J. W. Thomas. Numerical Partial Differential Equations: Finite Difference Methods, volume 22
of Texts in Applied Mathematics. Springer Verlag, New York, 1995.

[Wlok82] J. Wloka. Partielle Differentialgleichungen. Teubner, Stuttgart, 1982.

SHA 21-5-21
Chapter 5

Calculus of Variations, Elasticity and


Tensors

5.1 Prerequisites and Goals


After having worked through this chapter

• you should be familiar with the basic idea of the calculus of variations.

• you should be able to apply the Euler–Lagrange equations to problems in one or multiple variables.

• you should understand the notations of stress and strain and Hooke’s law.

• should be able to formulate elasticity equations as minimization problems.

• recognize plane strain and plane stress situations.

• should know about the stress and strain invariants.

In this chapter we assume that you are familiar with the following:

• Basic calculus for one and multiple variables.

• Taylor approximations of order one, i.e. linear approximations.

• Some classical mechanics.

5.2 Calculus of Variations


5.2.1 The Euler Lagrange Equation
The goal of this section is to “discover” the Euler–Lagrange equation and then apply it to some sample
problems. If the functions u(x) and f (x, u, u0 ) are given then the definite integral
Z b
F (u) = f (x, u(x), u0 (x)) dx
a

is well defined. The main idea is now to examine the behavior of F (u) for different functions u. Search
for functions u(x), which will minimize the value of F (u). Technically speaking try to minimize the
functional F .

305
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 306

The basic idea is quite simple: if a function F (u) has a minimum, then its derivative has to vanish. But
there is a major technical problem: the variable u is actually a function u(x), i.e. we have to minimize a
functional. The techniques of the calculus of variations1 deal with this type of problem.

5–1 Definition : If a mapping F is defined for a set of functions X and returns a number as a result then F
is called a functional on the function space X.

Thus a functional is nothing but a function with a a set of functions as domain of definition. It might
help to compare typical functions and functionals.

domain of definition range


function interval [a, b] numbers R
continuous functions
functional numbers R
defined on [a, b], i.e. C([a, b], R)

Here are a few examples of functionals.



F (u) = a(x) u2 (x) dx defined on C([0, 1], R) with u(0) = u(π) = 1
R01 p
F (u) = 0 1 + u0 (x)2 dx defined on C 1 ([0, 1], R) with u(0) = 1 , u(1) = π
R1
F (u) = 0 (u0 (x)2 − 1)2 + u(x)2 dx defined on C 1 ([0, 1], R) with u(0) = u(1) = 0
R1
F (u) = 0 a(x) u00 (x)2 dx defined on C 2 ([0, 1], R)

The principal goal of the calculus of variations is to find extrema of functionals.

The fundamental lemma below is related to Hilbert space methods. As very simple example examine
vectors in Rn to visualize the basic idea. A vector ~u ∈ Rn equals ~0 if and only if the scalar product with all
~ ∈ Rn vanishes, i.e.
vectors φ
~ = 0 for all φ
h~u , φi ~ ∈ Rn ⇐⇒ ~u = ~0 .

Similarly a continuous function vanishes on an interval [a , b] iff its product with all functions φ integrates
to 0 , i.e. the role of the scalar product is taken over by an integration.
Z b
hf~, ~g i −→ f (x) · g(x) dx
a

5–2 Lemma : If u(x) is a continuous function on the interval a ≤ x ≤ b and


Z b
u(x) · φ(x) dx = 0
a

for all differentiable functions φ with φ(a) = φ(b) = 0 then

u(x) = 0 for all a ≤ x ≤ b .

3
1
The calculus of variations was initiated with the problem of a brachistochrone by Johann Bernoulli’s (1667–1748) in 1696,
see [HenrWann17]. Contributions by Jakob Bernoulli (1654–1705) and Leonhard Euler (1707–1783) followed. Joseph-Louis
Lagrange (1736–1813) contributed extensively to the method.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 307

Proof : Proceed by contradiction. Assume that u(x0 ) > 0 for some x0 between a and b . Since the function
u(x) is continuous we know that u(x) > 0 on a (possibly small) interval x1 < x < x2 . Now choose

 0 for x ≤ x1


φ(x) = (x − x1 )2 (x − x2 )2 for x1 ≤ x ≤ x2 .


 0 for x2 ≤ x

Then conclude u(x) φ(x) ≥ 0 for all a ≤ x ≤ b and u(x0 ) φ(x0 ) > 0 and thus
Z b Z x2
u(x) · φ(x) dx = u(x) · φ(x) dx > 0 .
a x1

This is a contradiction to the condition in the Lemma. Thus u(x) = 0 for a < x < b. As the function u is
continuous conclude that u(a) = u(b) = 0. 2

With a few more mathematical ideas the above result can be improved to obtain an important result for
the calculus of variations.

5–3 Theorem : The fundamental lemma of calculus of variations

• If u(x) is a continuous function for a ≤ x ≤ b and


Z b
u(x) · φ(x) dx = 0
a

for all infinitely often differentiable functions φ(x) with φ(a) = φ(b) = 0 then

u(x) = 0 for all a ≤ x ≤ b

• If u(x) is a differentiable function for a ≤ x ≤ b and


Z b
u(x) · φ0 (x) dx = 0
a

for all infinitely often differentiable functions φ(x) then

u0 (x) = 0 for all a ≤ x ≤ b and u(a) · φ(a) = u(b) · φ(b) = 0

Proof : Find the proof of the first statement in any good book on functional analysis or calculus of varia-
tions. For the second part use integration by parts, i.e.
Z b Z b
0
0= u(x) · φ (x) dx = u(b) · φ(b) − u(a) · φ(a) − u0 (x) · φ(x) dx .
a a

Considering all test function φ(x) with φ(a) = φ(b) = 0 leads to the condition u0 (x) = 0. We are free
to choose test functions with arbitrary values at the end points a and b, thus conclude that u(a) · φ(a) =
u(b) · φ(b) = 0 . 2

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 308

For a given function f (x, u, u0 ) search for a function u(x) such that the functional
Z b
F (u) = f (x, u(x), u0 (x)) dx
a

has a critical value for the function u. For sake of a readability use the notations2


fx (x, u, u0 ) = f (x, u, u0 ) ,
∂x

fu (x, u, u0 ) = f (x, u, u0 ) ,
∂u

fu0 (x, u, u0 ) = f (x, u, u0 ) .
∂u0
If the functional F attains its minimal value at the function u(x) conclude that

g(ε) = F (u + ε φ) ≥ F (u) for all ε ∈ R and arbitrary functions φ(x) .

Thus the scalar function g(ε) has a minimum at ε = 0 and thus the derivative should vanish, i.e.

d g(0) d
= F (u + εφ) = 0 for all functions φ .
dε dε ε=0

To find the equations to be satisfied by the solution u(x) use linear approximations. For small values of ∆u
and ∆u0 use a Taylor approximation to conclude

∂ f (x, u, u0 ) ∂ f (x, u, u0 )
f (x, u + ∆u, u0 + ∆u0 ) ≈ f (x, u, u0 ) + ∆u + ∆u0
∂u ∂u0
= f (x, u, u0 ) + fu (x, u, u0 ) ∆u + fu0 (x, u, u0 ) ∆u0
f (x, u(x) + εφ (x), u0 (x) + εφ0 (x)) = f (x, u(x), u0 (x)) + ε fu (x, u(x), u0 (x)) φ (x) +
+ε fu0 (x, u(x), u0 (x)) φ0 (x) + O(ε2 ) .

Now examine the functional in question


Z b
g(0) = F (u) = f (x, u(x), u0 (x)) dx
a
Z b
g(ε) = F (u + ε φ) = f (x, u(x) + εφ (x), u0 (x) + εφ0 (x)) dx
a
Z b
≈ f (x, u(x), u0 (x)) + ε fu (x, u(x), u0 (x)) φ (x) + ε fu0 (x, u(x), u0 (x)) φ0 (x) dx
a
Z b
= F (u) + ε fu (x, u(x), u0 (x)) φ (x) + fu0 (x, u(x), u0 (x)) φ0 (x) dx
a
2
Observe the difference between total derivatives and partial derivatives, as illustrated by the example.

f (x, u, u0 ) = x2 (u0 )2 + cos(x) · u



f (x, u(x), u0 (x)) = 2 x (u0 (x))2 − sin(x) · u(x)
∂x

f (x, u(x), u0 (x)) = cos(x)
∂u

f (x, u(x), u0 (x)) = 2 x2 u0 (x)
∂u0
d
f (x, u(x), u0 (x)) = 2 x (u0 (x))2 + 2 x2 u0 (x) u00 (x) − sin(x) · u(x) + cos(x) · u0 (x)
dx

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 309

or
b
d
Z
F (u + ε φ) = fu (x, u(x), u0 (x)) φ (x) + fu0 (x, u(x), u0 (x)) φ0 (x) dx .
dε ε=0
a
This integral has to vanish for all function φ(x) and using the Fundamental Lemma 5–3, leading to a
necessary condition. An integration by parts leads to
Z b
0 = fu (x, u(x), u0 (x)) φ (x) + fu0 (x, u(x), u0 (x)) φ0 (x) dx
a
b
= fu0 (x, u(x), u0 (x)) φ (x) x=a
Z b 
0 d 0
+ fu (x, u(x), u (x)) − fu0 (x, u(x), u (x)) φ (x) dx .
a dx
Since this expression has to vanish for all function φ(x) the necessary conditions are

d 0
dx fu (x, u(x), u (x)) = fu (x, u(x))
Z b 
 0

f (x, u(x), u0 (x)) dx extremal =⇒ fu0 (a, u(a), u0 (a)) · φ(a) = 0 .
a 
 f 0 (b, u(b), u0 (b)) · φ(b) = 0

u

The first condition is the Euler–Lagrange equation, the second and third condition are boundary condi-
tions. If the value u(a) is given and we are not free to choose, then we need φ(a) = 0 and the first boundary
condition is automatically satisfied. If we are free to choose u(a), then φ(a) need not vanish and we have
the condition
fu0 (a, u(a), u0 (a)) = 0 .
This is a natural boundary condition. A similar argument applies at the other endpoint x = b .

Now we have the central result for the calculus of variations in one variable.

5–4 Theorem : Euler–Lagrange equation


If a smooth function u(x) leads to a critical value of the functional
Z b
F (u) = f (x, u(x), u0 (x)) dx
a

the differential equation

d
fu0 (x, u(x), u0 (x)) = fu (x, u(x), u0 (x)) (5.1)
dx
has to be satisfied for a < x < b. This is usually a second order differential equation.
• If it is a critical value amongst all functions with prescribed boundary values u(a) and u(b),
use these to solve the differential equation.

• If you are free to choose the values of u(a) and/or u(b), then the natural boundary condi-
tions
fu0 (a, u(a), u0 (a)) = 0 and/or fu0 (b, u(b), u0 (b)) = 0
can be used.
3

If the functional is modified by boundary contributions


Z b
K2 2
F (u) = f (x, u(x), u0 (x)) dx − K1 u(b) − u (b)
a 2

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 310

the Euler–Lagrange is not modified, but the natural boundary condition at x = b is given by

fu0 (b, u(b), u0 (b)) = K1 + K2 u(b) .

The verification follows exactly the above procedure and is left as an exercise.

5–5 Example : Shortest connection between two points


Given two points (a , y1 ) and (b , y2 ) in a plane determine the function y = u(x), such that its graph
connects the two points and the length of this curve is a short as possible. The length L of the curve is given
by the integral
Z bp
L(u) = 1 + (u0 (x))2 dx .
a
Using the notations of the above results determine the partial derivatives

x
a b

Figure 5.1: Shortest connection between two points

f (x, u, u0 ) =
p
1 + (u0 )2
fx (x, u, u0 ) = fu (x, u, u0 ) = 0
u0
fu0 (x, u, u0 ) = p
1 + (u0 )2

and the Euler–Lagrange equation (5.1) applied to this example leads to

d u0 (x)
p = 0.
dx 1 + (u0 (x))2

The derivative of a function being zero everywhere implies that the function has to be constant and thus

u0 (x)
p =c
1 + (u0 (x))2

and conclude that u0 (x) has to be constant. Thus the optimal solution is a straight line. This should not be a
surprise.
If we are free to choose the point of contact along the vertical line at x = b we may use φ(b) 6= 0 and
thus find the natural boundary condition

u0 (b)
p = 0.
1 + (u0 (b))2

This implies u0 (b) = 0 and thus u0 (x) = 0 for all x. This leads to a horizontal line, which is obviously the
shortest connection from the given height at x = a to the vertical line at x = b . ♦

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 311

5–6 Example : String under transversal load


The vertical deformation of a horizontal string can be given by a function y = u(x) for 0 ≤ x ≤ L. Due to
this deformation u(x) of the string it will be lengthened by
Z Lp
∆L = 1 + (u0 (x))2 dx − L
0

and due to the constant horizontal force T this requires an energy of T ∆L. The applied external, vertical
force density f (x) can be modeled by a corresponding potential energy density of −f (x) u(x). Now the
total energy is given by
Z L Z L
F (u(x), u0 (x)) dx =
p
E(u) = T 1 + (u0 (x))2 − f (x) · u(x) dx .
0 0

For this functional find the partial derivatives

F (u, u0 ) = T
p
1 + (u0 )2 − f · u
Fu (u, u0 ) = −f
u0
Fu0 (u, u0 ) = T p
1 + (u0 )2

and thus the Euler–Lagrange equation (5.1) applied to this example lead to
!
d T
− p u0 (x) = f (x) .
dx 1 + (u0 (x))2

Since the string is attached at both ends, supplement this differential equation with the boundary conditions
u(0) = u(L) = 0 .
If we know a priori that the slope u0 (x) along the string is small, use a linear approximation3
1
1 + (u0 (x))2 ≈ 1 + (u0 (x))2 .
p
2
With this the change of length ∆L of the string is given by
Lp L
1 0
Z Z
∆L = 1 + (u0 (x))2 dx − L ≈ (u (x))2 dx .
0 0 2
Now the total energy can be written in the form
L
1
Z
E(u) = T ∆L + Epot = T (u0 (x))2 − f (x) · u(x) dx
0 2
and the resulting Euler–Lagrange equation is given by

−T u00 (x) = f (x) .

3
√ 1
Use the Taylor approximation 1+z ≈1+ 2
z.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 312

5–7 Example : Bending of a beam


In Section 1.4 find the description of a bending beam. If α(s) is the angle at a position (x(s) , y(s)) construct
the curve from the function α(s) with an integral
! Z !
x(l) l cos(α(s))
~x(l) = = ds for 0 ≤ l ≤ L .
y(l) 0 sin(α(s))

The elastic energy stored in the bent beam is given by


Z L
1
Uelast = EI(α0 (s))2 ds .
0 2

An external force F~ = (F1 , F2 ) at the right end point ~x(L) has to be determined by
 
~ ∂ Upot ∂ Upot
F = − grad Upot = − ,
∂x ∂y
and are thus given by the potential energy Upot (x, y)
Z L Z L
Upot (x(L), y(L)) = −F1 x(L) − F2 y(L) = −F1 cos(α(s)) ds − F2 sin(α(s)) ds .
0 0

The total energy Utot as a functional of the angle function α(s) is given by
Z L
1
Utot (α) = Uelast (α) + Upot (~x(L)) = EI(α0 (s))2 ds + Upot (x(L), y(L))
0 2
Z L
1
= EI(α0 (s))2 − F1 cos(α(s)) − F2 sin(α(s)) ds .
0 2
The physical situation is characterized as a minimum of this functional, using Bernoulli’s principle. For the
expression to be integrated find the partial derivatives
1
F (α, α0 ) = EI(α0 )2 − F1 cos(α) − F2 sin(α)
2
Fα (α, α0 ) = F1 sin(α) − F2 cos(α)
Fα0 (α, α0 ) = EIα0

and the Euler–Lagrange equation for this problem is given by


0
EI α0 (s) = F1 sin(α(s)) − F2 cos(α(s)) .

This is identical equation (1.16) (page 19). For a beam clamped at the left end and no moments at the right
end point find the boundary conditions α(0) = α0 (L) = 0 . The second is a natural boundary condition, as
u(L) is not prescribed. This is a nonlinear, second order boundary value problem. ♦

5.2.2 Quadratic Functionals and Second Order Linear Boundary Value Problems
If for given functions a(x), b(x) and g(x) the functional
Z x1
1 1
F (u) = a(x) (u0 (x))2 + b(x) u(x)2 + g(x) · u(x) dx (5.2)
x0 2 2
has to be minimised, then obtain the Euler–Lagrange equation
d
fu0 = fu
 dx 
d d u(x)
a(x) = b(x) u(x) + g(x) .
dx dx

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 313

This is a linear, second order differential equation which has to be supplemented with appropriate boundary
conditions. If the value at one of the endpoints is given then this is called a Dirichlet boundary condition.
If we are free to choose the value at the boundary then this is called a Neumann boundary condition.
Theorem 5–4 implies that the second situation leads to a natural boundary condition
du
a(x) = 0 for x = x0 or x = x1 .
dx
If we wish to consider a non-homogeneous boundary conditions
du
a(x) = r(x) for x = x0 or x = x1
dx
then the functional has to be supplemented by

F (u) + r(x0 ) u(x0 ) − r(x1 ) u(x1 ) .

Thus the above approach shows that many second order differential equation correspond to extremal points
for a properly chosen functional. Many physical, mechanical and electrical problems lead to this type of
equation as can be seen in Table 5.1 (Source: [OttoPete92, p. 63]).

5.2.3 The Divergence Theorem and its Consequences


The well know fundamental theorem of calculus for functions of one variable
Z b
f 0 (x) dx = −f (a) + f (b)
a

can be extended to functions of multiple variables. If Ω ⊂ Rn is a ”nice” domain with boundary ∂Ω and
outer unit normal vector ~n then the corresponding result is called the divergence theorem. For domains
Ω ⊂ R2 use
∂ v1 ∂ v2
ZZ ZZ I
div ~v dA = + dA = h~v , ~ni ds
∂x ∂y ∂Ω
Ω Ω

and if Ω ⊂ R3 then the notation is


ZZZ ZZ
div ~v dV = ih~v , ~ni dA
Ω ∂Ω

where dV is the standard volume element and dA the surface element. The usual rule to differentiate
products of two functions leads to

∇ · (f ~v ) = (∇f ) · ~v + f (∇ · ~v )
div(f ~v ) = hgrad f, ~v i + f (div ~v ) .

Using the divergence theorem conclude


ZZ ZZ
f (div ~v ) dA = div(f ~v ) − hgrad f, ~v i dA

IΩ ZZ
= f h~v , ~ni ds − hgrad f, ~v i dA .
∂Ω

This formula is referred to as Green–Gauss theorem or Green’s identity and is similar to integration by
parts for functions of one variable
Z b Z b
f · g 0 dx = −f (a) · g(a) + f (b) · g(b) − f 0 · g dx
a a

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 314

differential equation problem description constitutive law


T = temperature
d one–dimensional A= area Fourier’s law
A k ddxT + Q = 0

dx dT
heat flow k= thermal conductivity q = −k dx
Q= heat supply
u= displacement
Hooke’s law
d axially loaded A= area
A E ddxu
 du
dx +b=0 σ=E dx
elastic bar E= Young’s modulus
σ = stress
b= axial loading
w= deflection
d transversely loaded
S ddxw

dx +p=0 S= string force
flexible string
p= lateral loading
c= concentration
Fick’s law
d one dimensional A= area
A D ddxc
 dc
dx +Q=0 q = −D dx
diffusion D= Diffusion coefficient
q = flux
Q= external supply
V = voltage
Ohm’s law
d dV
 one dimensional A= area dV
dx Aγ dx +Q=0 q = −γ dx
electric current γ= electric conductivity
q = charge flux
Q= charge supply
p= pressure
laminar flow A= area D2 d p
d

D2 dp
 q= 32µ dx
dx A 32µ dx +Q=0 in a pipe D= diameter
q = volume flux
(Poisseuille flow) µ= viscosity
Q= fluid supply

Table 5.1: Examples of second order differential equations

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 315

or if spelled out for the derivative g 0


Z b Z b
f · g 00 dx = −f (a) · g 0 (a) + f (b) · g 0 (b) − f 0 · g 0 dx .
a a
For finite elements and calculus of variations the divergence theorem is most often used in the form below.
ZZ I ZZ
f (div grad g) dA = f hgrad g, ~ni ds − hgrad f, grad gi dA
∂Ω
Ω Ω
ZZ I ZZ
f ∆g dA = f h∇g, ~ni ds − h∇f, ∇gi dA
∂Ω
Ω Ω

5.2.4 Quadratic Functionals and Second Order Boundary Value Problems in 2 Dimensions
We want to modify the functional in (5.2) to a 2 dimensional setting and examine the boundary value
problem resulting from the Euler–Lagrange equations.
Consider a domain Ω ⊂ R2 with a boundary ∂Ω = Γ1 ∪ Γ2 consisting of two disjoint parts Γ1 and Γ2 .
For given functions a, b, f , g1 and g (all depending on x and y) we search a yet unknown function u, such
that the functional
1 1
ZZ Z
F (u) = a h∇u, ∇ui + b u2 + f · u dA − g2 u ds (5.3)
2 2 Γ2

is minimal amongst all functions u which satisfy
u(x, y) = g1 (x, y) for (x, y) ∈ Γ1 .
To find the necessary equations assume that φ and ∇φ are small and use the approximations
(u + φ)2 = u2 + 2 u φ + φ2 ≈ u2 + 2 u φ
h∇(u + φ), ∇(u + φ)i = h∇u, ∇ui + 2 h∇u, ∇φi + h∇φ, ∇φi
≈ h∇u, ∇ui + 2 h∇u, ∇φi
and Green’s identity to conclude
ZZ Z
F (u + φ) − F (u) ≈ a h∇u, ∇φi + b u φ + f · φ dA − g2 φ ds
Γ2
ZΩZ Z Z
= (−∇( a ∇u) + b u + f ) · φ dA + a h~n, ∇ui φ ds − g2 φ ds
Γ Γ2
ZΩZ Z
= (−∇( a ∇u) + b u + f ) · φ dA + (a h~n, ∇ui − g2 ) φ ds .
Γ2

The test-function φ is arbitrary, but has to vanish on Γ1 . If the functional F is minimal for the function u
then the above integral has to vanish for all test-functions φ. First consider only test-functions that vanish
on Γ2 too and use the fundamental lemma (a modification of Theorem 5–3) to conclude that the expression
in the parenthesis in the integral over the domain Ω has to be zero. Then use arbitrary test functions φ
to conclude that the expression in the integral over Γ2 has to vanish too. Thus the resulting linear partial
differential equation with boundary conditions is given by

∇ · (a ∇u) − b u = f for (x, y) ∈ Ω


u = g1 for (x, y) ∈ Γ1 (5.4)
a h∇u, ~ni = g2 for (x, y) ∈ Γ2

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 316

The functions a, b f and gi are known functions and we have to determine the solution u, all depending on
the independent variables (x, y) ∈ Ω . The vector ~n is the outer unit normal vector. The expression

∂u ∂u ∂u
h∇u, ~ni = n1 + n2 =
∂x ∂y ∂~n

determines the directional derivative of the function u in the direction of the outer normal ~n.

A list of typical applications of elliptic equations of second order is shown in Table 5.2, see [Redd84].
The static heat conduction problem in Section 1.1.5 (page 11) is another example. A description of the
ground water flow problem is given in [OttoPete92]. This table clearly illustrates the importance of the
above type of problem.

5–8 Example : Deformation of a membrane


When a small vertical displacement of a thin membrane is given by z = u(x, y), where (x, y) ∈ Ω ⊂ R2 ,
we can compute the elastic energy stored in the membrane by
τ τ
ZZ ZZ
2
u2x + u2y dA ,

Eelast = k∇uk dA =
2 2
Ω Ω

where we assume that u = 0 on the boundary ∂Ω. Now apply a vertical force to the membrane given by a
force density function f (x, y) (units: N/m2 ). To formulate this we introduce a potential energy
ZZ
Epot = − f · u dA .

Based on the previous results minimizing the total energy


τ
ZZ
u2x + u2y − f u dA

E = Eelast + Epot =
2

leads to the Euler–Lagrange equation

τ ∆u = ∇ · (τ ∇u) = −f .

This corresponds to the model problem with equation (1.11) on page 15. ♦

5–9 Example : Vibration of a membrane


Use Newtons law in the above problem for the vertical acceleration ü. If an external device is applying a
force f on the membrane in a static situation, then the stretched membrane is applying the opposite force to
the external device. If there is no external device this force leads to an acceleration. Thus conclude

f = −ρ ü

where ρ is the mass density (units kg/m2 ). The resulting equation is then

ρ ü − ∇ · (τ ∇u) = 0

which corresponds to equation (1.10) on page 15.


The corresponding eigenvalue equation (1.12) leads to harmonic oscillations as solutions.

SHA 21-5-21
Field of application Primary variable Material constant Source variable Secondary variables
∂u ∂u
General situation u a f ∂x , ∂y
Heat flow density ~q
Heat transfer Temperature T Conductivity k Heat source Q
~q = −k∇T
Diffusion Concentration c Diffusion coefficient external supply Q flux ~q = −D ∇c
Electrostatics Scalar potential Φ Dielectric constant ε Charge density ρ Electric flux density D
Magnetostatics Magnetic potential Φ Permeability ν Magnetic flux density B
Transverse deflection
Transverse deflection u Tension of membrane T Transversely distributed load Normal force q
of elastic membrane
Stress τ
Eα ∂φ
Torsion of a bar Warping function φ 1 0 τxz = 2 (1+ν) (−y + ∂x )
Eα ∂φ
τyz = 2 (1+ν) (x + ∂y )
T
Velocity (u, v)
Irrotational flow Mass production σ dΨ
Stream function Ψ Density ρ ∂x = −u
of an ideal fluid (usually zero) dΨ
∂y =v

∂x =u
Velocity potential Φ dΦ
∂y =v
∂Φ
seepage q = K ∂n
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS

Recharge Q velocities
Ground-water flow Piezometric head Φ Permeability K

Table 5.2: Some examples of Poisson’s equation −∇ (a ∇u) = f



(or pumping −Q) u = −K ∂x

v = −K ∂y

SHA 21-5-21
317
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 318

5.2.5 Nonlinear Problems and Euler–Lagrange Equations for Systems


If a functional J(u) is given in the form
ZZ
J(u) = F (u, ∇u) dA

apply a small perturbation φ to the argument u and use a linear approximation to find
ZZ
J(u + ϕ) = F (u + ϕ, ∇u + ∇ϕ) dA
ZΩZ
= F (u, ∇u) + Fu (u, ∇u) ϕ + F∇u (u, ∇u) ∇ϕ dA + O(|ϕ|2 , k∇ϕk2 )

with the notations


!
∂F ∂F ∂F ∂φ ∂F ∂φ
F∇u = , ∂u and F∇u ∇ϕ = + ∂u .
∂ ∂u
∂x ∂ ∂y ∂u ∂x
∂ ∂x ∂ ∂y ∂y

If the minimum of the functional J(u) is attained at the function u conclude that for all permissible test
functions ϕ find the necessary condition
ZZ
0 = F∇u (u, ∇u) ∇ϕ + Fu (u, ∇u) ϕ dA
IΩ ZZ
= ϕ F∇u (u, ∇u) · ~n ds + −∇ (F∇u (u, ∇u)) ϕ + Fu (u, ∇u) ϕ dA .
∂Ω

Since this expression vanishes for all test functions ϕ use the fundamental lemma to find the Euler–Lagrange
equation
−∇ (F∇u (u, ∇u)) + Fu (u, ∇u) = 0 in Ω (5.5)
and on the sections of the boundary where the test function ϕ does not vanish the natural boundary condition

hF∇u (u, ∇u), ~ni = 0 .

For the example 5–8 of a deformed membrane the above leads to


τ 2
F (u, ∇u) = (u + u2y ) − f · u
2 x
Fu (u, ∇u) = −f
∂ ∂
F∇u (u, ∇u) = ( F, F ) = τ (ux , uy ) = τ ∇u
∂ux ∂uy

and the Euler–Lagrange equation is given by

−∇ (F∇u (u, ∇u)) + Fu (u, ∇u) = −∇ · (τ ∇u) − f = 0 .

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 319

5–10 Example : Plateau problem


If a surface in R3 is described by a function z = u(x, y) where (x, y) ∈ Ω ⊂ R2 , then the total area is given
by the functional ZZ p ZZ q
J(u) = 2
1 + k grad uk dA = 1 + u2x + u2y dA .
Ω Ω
If the goal is to minimize the total
√ area use√calculus of√variations. To generate the Euler–Lagrange equations
use the Taylor approximation u + z ≈ u + z/(2 u) and conclude
ZZ q
J(u + ϕ) = 1 + (ux + ϕx )2 + (uy + ϕy )2 dA
ZΩZ q
≈ 1 + u2x + 2 ux ϕx + u2y + 2 uy ϕy dA

1
ZZ
≈ J(u) + q (ux ϕx + uy ϕy ) dA .

1 + u2x + u2y

If the functional J attains its minimum at the function u conclude that for all test functions ϕ
1
ZZ
0 = + q ∇u · ∇ϕ dA
1 + u 2 + u2
Ω x y
 
1 1
I ZZ
= q ~n ∇u ϕ ds − ∇ q ∇u ϕ dA .
∂Ω 2
1 + ux + uy 2 2
1 + ux + uy 2

If the values z = u(x, y) are known on the boundary ∂Ω ⊂ R2 then the test functions ϕ vanish on the
boundary and the Euler–Lagrange equation are given by
 
1
−∇  q ∇u = 0 .
1 + u2x + u2y

This is a nonlinear second order differential equation. The identical result may be generated by
q
F (u, ∇u) = 1 + u2x + u2y
Fu (u, ∇u) = 0
1
F∇u (u, ∇u) = q (ux , uy )
1 + u2x + u2y

and then use the Euler–Lagrange equation in the form (5.5). ♦

The above idea can also be applied to functionals depending on more than one independent variable.
Examine a domain Ω ⊂ R2 with a boundary ∂Ω consisting of two parts. On Γ1 the values of u1 and u2 are
given and on Γ2 these values are free. Then minimize a functional of the form
ZZ Z
J(u1 , u2 ) = F (u1 , u2 , ∇u1 , ∇u2 ) dA − u1 · g1 + u2 · g2 ds . (5.6)
Γ2

Apply small perturbations ϕ1 and ϕ2 , use a linear approximation to find4

J = J(u1 + ϕ1 , u2 + ϕ2 )
4
Use the notation F∇u = ( ∂u∂ x F , ∂
∂uy
F )T .

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 320

ZZ
= F (u1 + ϕ1 , ∇u1 + ∇ϕ1 , u2 + ϕ2 , ∇u2 + ∇ϕ2 ) dA +

Z
− (u1 + ϕ1 ) · g1 + (u2 + ϕ2 ) · g2 ds
Γ2
ZZ
= J(u1 , u2 ) + Fu1 (. . .) ϕ1 + hF∇u1 (. . .), ∇ϕ1 i + Fu2 (. . .) ϕ2 + hF∇u2 (. . .), ∇ϕ2 i dA −

Z
− ϕ1 · g1 + ϕ2 · g2 ds + O(|ϕ1 |2 , k∇ϕ1 k2 , |ϕ2 |2 , k∇ϕ2 k2 ) .
Γ2

Find necessary conditions if the minimum of the functional J(u1 , u2 ) is attained at (u1 , u2 ). Conclude that
for all permissable test functions ϕ1 and ϕ2 vanishing on the boundary ∂Ω the following integral has to
equal zero.
ZZ
0 = Fu1 (. . .) ϕ1 + hF∇u1 (. . .), ∇ϕ1 i + Fu2 (. . .) ϕ2 + hF∇u2 (. . .), ∇ϕ2 i dA
ZΩZ
= (Fu2 (. . .) − div (F∇u1 (. . .))) ϕ1 + (Fu2 (. . .) − div (F∇u2 (. . .))) ϕ2 dA

Since this expression to vanish for all test functions ϕ1 and ϕ2 use the fundamental lemma to arrive at a
system of Euler–Lagrange equations.
5–11 Result : A minimizer u1 and u2 of the functional J(u1 , u2 ) of the form (5.6) solves the system of
Euler–Lagrange equations
div (F∇u1 (u1 , u2 , ∇u1 , ∇u2 )) = Fu1 (u1 , u2 , ∇u1 , ∇u2 ) (5.7)
div (F∇u2 (u1 , u2 , ∇u1 , ∇u2 )) = Fu2 (u1 , u2 , ∇u1 , ∇u2 ) . (5.8)
3
Using these equations with test functions ϕ1 and ϕ2 vanishing on Γ1 , but not necessarily on Γ2 find
Z
0 = ϕ1 F∇u1 (. . .) · ~n − ϕ1 · g1 + ϕ2 F∇u2 (. . .) · ~n − ϕ2 · g2 ds .
Γ2

This leads to the natural boundary conditions on Γ2 .


hF∇u1 (u1 , u2 , ∇u1 , ∇u2 ), ~ni = g1
hF∇u2 (u1 , u2 , ∇u1 , ∇u2 ), ~ni = g2
This method can be used to derive the differential equations governing elastic deformations of solids, see
Section 5.9.4 for the PDE governing the plane stress situation.

5.2.6 Hamilton’s principle of Least Action


The notes in this section are mostly taken from [VarFEM]. The starting point was the classical book by
Weinberger [Wein74, p. 72].
Examine a system of particles subject to given geometric constraints and otherwise influenced by forces
which are functions of the positions of the particles only. In addition we require the system to be conserva-
tive, i.e. the forces can be written as the gradient of a potential energy V of the system. We denote the n
degrees of freedom of the system with ~q = (q1 , q2 , . . . , qn )T . The kinetic energy T of the system is the
extension of the basic formula E = 21 m v 2 . With those form the Lagrange function L of the system by

L(~q, ~q˙ ) = T (~q, ~q˙ ) − V (~q) .


The fundamental principle of Hamilton can now be formulated:

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 321

The actual motion of a system with the above Lagrangian L is such as to render the (Hamilton’s)
integral Z t2 Z t2
I= (T − V ) dt = L(~q, ~q˙ ) dt
t1 t1

an extremum with respect to all twice differentiable functions ~q(t). Here t1 and t2 are arbitrary
times.

This is a situation where we (usually) have multiple dependent variables qi and thus the Euler–Lagrange
equations imply
d∂L ∂L
= for i = 1, 2, . . . , n .
dt ∂ q̇i ∂qi
These differential equations apply to many mechanical setups, as the following examples will illustrate.

5–12 Example : Single pendulum


For a single pendulum of length l find the kinetic and potential energy
1
T (ϕ, ϕ̇) = m l2 (ϕ̇)2 and V (ϕ) = −m l g cos ϕ
2
and thus the Lagrange function L
L
1
L
L=T −V = m l2 (ϕ̇)2 + m l g cos ϕ . ϕL
2 L l
L
The only degree of freedom is q1 = ϕ and the functional to be mini- L
mised is Ly
L
Z b
1 m
m l2 (ϕ̇)2 + m l g cos ϕ dt .
a 2
The Euler–Lagrange equation leads to
d∂L ∂L
=
dt ∂ ϕ̇ ∂ϕ
d
m l2 ϕ̇ = −m l g sin ϕ
dt
g
ϕ̈ = − sin ϕ .
l
This is the well known differential equation describing a pendulum. One can certainly derive the same
equation using Newton’s law. ♦

5–13 Example : A double pendulum

The calculations for this problem are shown in many books L


L
on classical mechanics, e.g. [Gree77]. A double pendulum L
consists of two particles with mass m suspended by mass-less ϕL
rods of length l. Assuming that all takes place in a vertical Ll
L
plane we have two degrees of freedom: the two angles ϕ and L xm
θ. The potential energy is not too hard to find as @
θ@ l
V (ϕ, θ) = −m l g (2 cos ϕ + cos θ) @
@xm
@
The velocity of the upper particle is v1 = l ϕ̇ .

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 322

To find the kinetic energy we need the velocity of the lower mass. The velocity vector is equal to the
vector sum of the velocity of the upper mass and the velocity of the lower particle relative to the upper mass.
π
~v1 = length l ϕ̇ and angle ϕ ± 2
π
~v2 = length l θ̇ and angle θ ± 2
ϕ − θ = angle between ~v1 and ~v2

Since the two vectors differ in direction by an angle of ϕ−θ we can use the law of cosine to find the absolute
velocity as5 q
speed of second mass = l ϕ̇2 + θ̇2 + 2 ϕ̇ θ̇ cos(ϕ − θ) .
Thus the total kinetic energy is

m l2  2 
T (ϕ, θ, ϕ̇, θ̇) = 2 ϕ̇ + θ̇2 + 2 ϕ̇ θ̇ cos(ϕ − θ)
2
and the Lagrange function is

m l2  2 
L=T −V = 2 ϕ̇ + θ̇2 + 2 ϕ̇ θ̇ cos(ϕ − θ) + m l g (2 cos ϕ + cos θ) .
2
The Euler–Lagrange equation for the free variable ϕ is obtained by

∂L  
= m l2 2 ϕ̇ + θ̇ cos(ϕ − θ)
∂ ϕ̇
d∂L  
= m l2 2 ϕ̈ + θ̈ cos(ϕ − θ) − θ̇ (ϕ̇ − θ̇) sin(ϕ − θ)
dt ∂ ϕ̇
∂L
= −m l2 ϕ̇ θ̇ sin(ϕ − θ) − m l g 2 sin ϕ
∂ϕ
which, upon substitution into the Euler–Lagrange equation, yields
 
m l2 2 ϕ̈ + θ̈ cos(ϕ − θ) + θ̇2 sin(ϕ − θ) = −m l g 2 sin ϕ .

In a similar fashion the Euler–Lagrange equation for the variable θ is obtained by

∂L  
= m l2 θ̇ + ϕ̇ cos(ϕ − θ)
∂ θ̇
d∂L  
= m l2 θ̈ + ϕ̈ cos(ϕ − θ) − ϕ̇ (ϕ̇ − θ̇) sin(ϕ − θ)
dt ∂ θ̇
∂L
= +m l2 ϕ̇ θ̇ sin(ϕ − θ) − m l g sin θ
∂θ
leading to  
m l2 θ̈ + ϕ̈ cos(ϕ − θ) − ϕ̇2 sin(ϕ − θ) = −m l g sin θ .

Those two equations can be divided by m l2 and then lead to a system of ordinary differential equations of
order 2.
g
2 ϕ̈ + θ̈ cos(ϕ − θ) + θ̇2 sin(ϕ − θ) = − 2 sin ϕ
l
2 g
θ̈ + ϕ̈ cos(ϕ − θ) − ϕ̇ sin(ϕ − θ) = − sin θ
l
5
Another approach is to use cartesian coordinates x(ϕ, θ) = l (sin ϕ + sin θ), y(ϕ, θ) = −l (cos ϕ + cos θ) and a few
calculations.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 323

By isolating the second order terms on the left arrive at


" # ! ! !
2 cos(ϕ − θ) ϕ̈ −θ̇2 g 2 sin ϕ
= sin(ϕ − θ) − .
cos(ϕ − θ) 1 θ̈ ϕ̇2 l sin θ

The matrix on the left hand side is invertible and thus this differential equation can reliably be solved by
numerical procedures, see Section 3.4 starting on page 183.
Assuming that all angles and velocities are small one can use the approximations cos(ϕ − θ) ≈ 1 and
sin x ≈ x to obtain the linearized system of differential equations
" # ! !
2 1 ϕ̈ g 2ϕ
=− .
1 1 θ̈ l θ

Solving for the second order derivatives obtain


! " # ! " # !
ϕ̈ g 1 −1 2ϕ g 2 −1 ϕ
=− =− .
θ̈ l −1 2 θ l −2 2 θ

This linear system of equations could be solved explicitly, using eigenvalues and eigenvectors, see Section
For the above matrix find
! !
√ 1 √ 1
λ1 = 2 − 2 ≈ 0.59 , ~v1 = √ and λ2 = 2 + 2 ≈ 3.41 , ~v1 = √
2 − 2

Thus the solutions for small angles there of the form


! r ! r !
φ(t) g 1 g 1
≈ A1 cos( 0.59 + δ1 ) √ + A2 cos( 3.41 + δ2 ) √
θ(t) l 2 l − 2

Thus thre is one type of in-phase solution with a small frequency and a high frequency solution where the
two angles are out of phase. ♦

5–14 Example : A pendulum with moving support


A chariot of mass m1 with an attached pendulum of length l and mass m2 is moving freely. The situation
is shown in Figure 5.2. In this example the independent variable is time t and the two general coordinates
(degrees of freedom) are x and θ, i.e. !
x
~u = .
θ
The position and velocity of the pedulum are
! !
x + l sin(θ) ẋ + l θ̇ cos(θ)
p~ = , ~v =
−l cos(θ) l θ̇ sin(θ)

and potential and kinetic energy are given by

V (x, θ) = −m2 l g cos θ − F x


m1 2 m2  
T (x, θ, ẋ, θ̇) = ẋ + (ẋ + l cos θ θ̇)2 + (l sin θ θ̇)2
2 2
m1 2 m2  2 
= ẋ + ẋ + 2 l ẋ cos θ θ̇ + l2 θ̇2 .
2 2

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 324

m1

x
θ l

m2

Figure 5.2: Pendulum with moving support

First examine the case F = 0 and for the Lagrange function L = T − V derive two Euler–Lagrange
equations. The first equation deals with the dependence on the function x(t) and its derivative ẋ(t).
d
Lẋ (x, θ, ẋ, θ̇) = Lx (x, θ, ẋ, θ̇)
dt
d  
(m1 + m2 ) ẋ + m2 l cos θ θ̇ = 0.
dt
From this conclude that the momentum in x direction is conserved. The second equation deals with the
dependence on the function θ(t).
d
L (x, θ, ẋ, θ̇) = Lθ (x, θ, ẋ, θ̇)
dt θ̇
d  
m2 l ẋ cos θ + l2 θ̇ = −m2 l ẋ θ̇ sin θ − m2 l g sin θ
dt
d  
ẋ cos θ + l θ̇ = −ẋ θ̇ sin θ − g sin θ
dt
ẍ cos θ − ẋ θ̇ sin θ + l θ̈ = −ẋ θ̇ sin θ − g sin θ
ẍ cos θ + l θ̈ = − g sin θ .

This is a second order differential equation for the functions x(t) and θ(t). The two equations can be
combined, leading to the system

(m1 + m2 ) ẍ + m2 l cos θ θ̈ = m2 l sin θ (θ̇)2


ẍ cos θ + l θ̈ = − g sin θ .

With the help of a matrix the system can be solved for the highest occurring derivatives. A straight forward
computation shows that the determinant of the matrix does not vanish and thus we can always find the
inverse matrix.
! " #−1 !
d2 x m1 + m2 m2 l cos θ m2 l sin θ (θ̇)2
=
dt2 θ cos θ l − g sin θ

This is a convenient form to generate numerical solutions for the problem at hand.

The above model does not consider friction. Now we want to include some friction on the moving
chariot. This is not elementary, as the potential V can not depend on the velocity ẋ, but there is a trick to be
used.

1. Introduce a constant force F applied to the chariot. This is done by modifying the potential V accord-
ingly.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 325

2. Find the corresponding differential equations.

3. Set the force F = −αẋ

To take the additional force F into account modify the potential energy

V (x, θ) = −m2 l g cos θ − x · F .

The Euler–Lagrange equation for the variable θ will not be affected by this change, but the equation for x
turns out to be
d  
(m1 + m2 ) ẋ + m2 l cos θ θ̇ = F
dt
(m1 + m2 ) ẍ + m2 l cos θ θ̈ = m2 l sin θ (θ̇)2 + F

and the full system is now given by


! " #−1 !
d2 x m1 + m2 m2 l cos θ m2 l sin θ (θ̇)2 + F
= .
dt2 θ cos θ l − g sin θ

Now replace F by −αẋ and one obtains


! " #−1 !
d2 x m1 + m2 m2 l cos θ m2 l sin θ (θ̇)2 − α ẋ
= .
dt2 θ cos θ l − g sin θ

Below find the complete code to solve this example and the resulting Figure 5.3.

1 0.8

0.6
0.8
0.4

0.2
0.6
position

angle

0.4 -0.2

-0.4
0.2
-0.6

-0.8
0 0 1 2 3 4 5
0 1 2 3 4 5
time
time

Figure 5.3: Numerical solution for a pendulum with moving support

MovingPendulum.m
function MovingPendulum()
t = 0:0.01:5; Y0 = [0;pi/6;0;0];

function dy = MovPend(y)
l = 1; m1 = 1; m2 = 8; g = 9.81; al = 0.5;
ddy = [m1+m2, m2*l*cos(y(2));
cos(y(2)), l]\[m2*l*sin(y(2))*y(4)ˆ2-al*y(3);-g*sin(y(2))];
dy = [y(3);y(4);ddy];
end%function

[t,Y] = ode45(@(t,y)MovPend(y),t,Y0);

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 326

figure(2)
plot(t,Y(:,1)); grid on; xlabel(’time’); ylabel(’position’);
figure(3)
plot(t,Y(:,2)); grid on; xlabel(’time’); ylabel(’angle’);
end%function

5.3 Basic Elasticity, Description of Stress and Strain


In the following sections we give a very basic introduction to the description of elastic deformations. Find
this and more information in the vast literature. One good introductory book is [Bowe10] and the cor-
responding web page solidmechanics.org . The book [GhabPeckWu17] gives a broad presentation of the
basics for computational analysis. It goes clearly beyond the scope of these introductory notes.

In Figure 5.4 the experimental laws of elasticity are illustrated. A beam of original length l with cross-
sectional area A = w · h is stretched by applying a force F . Many experiments lead to the following two
basic laws of elasticity.

• Hooke’s law
∆l 1 F
= ,
l E A
where the material constant E is called modulus of elasticity (Young’s modulus).

• Poisson’s law
∆h ∆w ∆l
= = −ν ,
h w l
where the material constant ν is called Poisson’s ratio.

∆l h

∆h
l
F

∆w

Figure 5.4: Definition of the modulus of elasticity E and the Poisson number ν

In this section the above basic mechanical facts are formulated for general situations, i.e. introduce the
basic equations of elasticity. The procedure is as follows:

• Description of deformed solid: strain.

• Description of the forces within deformed solid: stress.

• Introduction to scalars, vectors, tensors.

• State the connection between deformations and forces, leading to Hooke’s law.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 327

An elastic solid can be fixed at its left edge and be pulled on at the right edge by a force. Figure 5.5
shows a simple situation. The original shape (dotted line) will change into a deformed state (full line). The
goal is to give a mathematical description of the deformation of the solid (strain) and the forces that will
occur in the solid (stress). A point originally at position ~x in the solid is moved to its new position ~x + ~u(~x),
i.e. displaced by ~u(~x).

~x + ~u
~x −→ ~x + ~u
! ! !
x x u1 (x, y)
~x −→ +
y y u2 (x, y)

Figure 5.5: Deformation of an elastic solid

The notation will be used to give a formula for the elastic energy stored in the deformed solid. Based on
this information we will construct a finite element solution to the problem. For a given force we search the
displacement vector field ~u(~x).

In order to simplify the treatment enormously we assume that the displacement of the structure are
very small compared to the dimensions of the solid.

5.3.1 Description of Strain


The strain will give us a mathematical description of the deformation of a given object. It is a purely geo-
metrical description and at this point not related to elasticity. Find a readable description for the construction
of strain, using displacement, in [ChouPaga67, §2, p.34ff]. First examine the strain for the deformation of
an object in a plane. Later we will extend the construction to objects in space.
Of a large object to be deformed and moved in a plane (see Figure 5.5) examine a small rectangle of
width ∆x and height ∆y and its behavior under the deformation. The original rectangle ABCD and the
deformed shape A0 B 0 C 0 D0 are shown in Figure 5.6.
0
((( D ∂ u1
∆y ((((((
(((
C 0((
((  ∂y ( ( 
∂ u2
 
∂y ∆y  
   
   
 ((B 0  
C D A0((((( 
*
 ∆y  
~
u  
∆y 
  ( 
 (((( ∂ u2 ∆x
y

 (
 ( (((( ∂x
6 A B
∆x ∆x ∂ u1
-x ∂x ∆x

Figure 5.6: Definition of strain: rectangle before and after deformation

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 328

Since ∆x and ∆y are assumed to be very small, the deformation is very close to an affine deformation,
i.e. a linear deformation and a translation. Since the deformations are small we also know that the deformed
rectangle has to be almost horizontal, thus Figure 5.6 is correct. A straightforward Taylor approximation
leads to expressions for the positions of the four corners of the rectangle.
! ! !
x x u 1 (x, y)
A= −→ A0 = +
y y u2 (x, y)
! ! ! ∂u1 (x,y)
!
x + ∆x 0 x + ∆x u1 (x, y) ∂x ∆x
B= −→ B = + + ∂u2 (x,y)
y y u2 (x, y) ∂x ∆x
! ! ! ∂u1 (x,y)
!
x 0 x u1 (x, y) ∂y ∆y
C= −→ C = + + ∂u2 (x,y)
y + ∆y y + ∆y u2 (x, y) ∂y ∆y
! ! ! ∂u1 (x,y) ∂u1 (x,y)
!
x + ∆x x + ∆x u1 (x, y) ∆x + ∆y
D= −→ D0 = + + ∂x
∂u2 (x,y)
∂y
∂u2 (x,y)
y + ∆y y + ∆y u2 (x, y) ∂x ∆x + ∂y ∆y

The last equation can be rewritten in the form


! ! ! " # !
∂u1 ∂u1
∆u1 u1 (x + ∆x, y + ∆y) u1 (x, y) ∂x ∂y ∆x
:= − = ∂u2 ∂u2
·
∆u2 u2 (x + ∆x, y + ∆y) u2 (x, y) ∂x ∂y ∆y
" # ! " # !
1 2 ∂u
∂x
1 ∂u1
∂y + ∂u2
∂x ∆x 1 0 ∂u1
∂y − ∂u2
∂x ∆x
= · + ·
2 ∂u2
+ ∂u1
2 ∂u2
∆y 2 ∂u2
− ∂u1
0 ∆y
∂x ∂y ∂y ∂x ∂y
! !
∆x ∆x
= A· +R· .
∆y ∆y

Observe that the matrix A is symmetric and R is antisymmetric6 .


Assuming that our structure is only slightly deformed conclude7 that ∆u1 and ∆u2 are considerably
smaller than ∆x and ∆y. Based on this ignore quadratic contributions (∆u)2 . Now compute the distance
of the points A0 and D0 in the deformed body
! !
∆x + ∆u 1 ∆x + ∆u 1
|A0 D0 |2 = (∆x + ∆u1 )2 + (∆y + ∆u2 )2 = h , i
∆y + ∆u2 ∆y + ∆u2
! ! ! ! !
∆x ∆x ∆x ∆x ∆x
≈ h , i+h , A· +R· i+
∆y ∆y ∆y ∆y ∆y
! ! !
∆x ∆x ∆x
hA · +R· , i
∆y ∆y ∆y
! ! ! !
∆x ∆x ∆x ∆x
= h , i+2h , A· i
∆y ∆y ∆y ∆y
! !
∆x ∆x
= |A D|2 + 2 h , A· i.
∆y ∆y
6
~ = hAT · ~v , wi
This implies h~v , A · wi ~ = hA · ~v , wi ~ = hRT · ~v , wi
~ and h~v , R · wi ~ = −hR · ~v , wi.
~
7
Due to this simplification we will later encouter a problem with rotations about large angles. A possible rescue is shown in
Section 5.7 using the Cauchy–Green tensor.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 329

Observe that the matrix R does not lead to changes of distances in the body. They correspond to rotations.
Only the contributions by A lead to stretching of the material.
Setting ∆y = 0 in the above formula compute the distance |A0 B 0 | as

∂ u1 ∂ u1 2
|A0 B 0 |2 = (∆x)2 + 2 (∆x)2 ≈ (∆x)2 (1 + )
r ∂x ∂x
∂ u1 ∂ u1
|A0 B 0 | = 1+2 ∆x ≈ ∆x + ∆x .
∂x ∂x
Now examine the ratio of the change of length over the original length to obtain the normal strains εxx and
εyy in the direction of the two axes.
∂u1 (x,y)
change of length in x direction ∂x ∆x ∂u1 (x, y)
εxx = = =
length in x direction ∆x ∂x
∂u2 (x,y)
change of length in y direction ∂y ∆y ∂u2 (x, y)
εyy = = = .
length in y direction ∆y ∂y

To find the geometric interpretation of the shear strain


 
1 ∂ u1 ∂ u2
εxy = εyx = +
2 ∂y ∂x

assume that the rectangle ABCD is not rotated, as shown in Figure 5.6. Let γ1 be the angle formed by the
line A0 B 0 with the x axis and γ2 the angle between the line A0 C 0 and the y axis. The sign convention is such
that both angles in Figure 5.6 are positive. Since tan φ ≈ φ for small angles find
∂u2 (x,y)
∂x ∆x ∂u2 (x, y)
tan γ1 = =
∆x ∂x
∂u1 (x,y)
∂y ∆y
∂u1 (x, y)
tan γ2 = =
∆y ∂y
2 εxy = tan γ1 + tan γ2 ≈ γ1 + γ2 .

Thus the number εxy indicates by how much a right angle between the x and y axis would be diminished by
the given deformation.

5–15 Definition : The matrix


   
∂ u1 1 ∂ u1 ∂ u2
" #
εxx εxy
 ∂x 2 ∂y + ∂x
= ∂ u1 ∂ u2

∂ u2

1
εxy εyy 2 ∂y + ∂x ∂y

is the (infinitesimal) strain tensor.

5–16 Example : y (2.2, 1.4)


6
Examine a small section in a deformed solid
(0, 1) (2, 1)
and compare the original and deformed shape
of a small rectangle. A block (∆x = 2 and
∆y = 1) is deformed in the xy plane accord- (2.2, 0.4)
ing to the figure on the right. The original
shape is shown in blue and the deformed shape -
in green. Use this picture to read out the three (2, 0) x
strains.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 330

• Along the x-axis observe ∆u1 = 0.2 and ∆u2 = 0.4. This leads to
∂ u1 ∆u1 0.2 ∂ u2 ∆u2 0.4
= = = 0.1 and = = = 0.2 .
∂x ∆x 2 ∂x ∆x 2

• Along the y-axis observe ∆u1 = ∆u2 = 0. This leads to


∂ u1 ∆u1 ∂ u2 ∆u2
= = 0 and = = 0.
∂y ∆y ∂y ∆y

• Thus the strain tensor is given by


" # " # " #
∂ u1
exx εxy ∂x
1
2 ( ∂∂x
u2
+ ∂ u1
∂y ) 0.1 0.1
= = .
εxy εyy 1
2 ( ∂∂x
u2
+ ∂ u1
∂y )
∂ u2
∂y 0.1 0


5–17 Example : It is a good exercise to compute the strain components for a few simple deformations.

• pure translation: If the displacement vector ~u is constant we have the situation of a pure translation,
without deformation. Since all derivatives of u1 and u2 vanish we find εxx = εyy = εxy = 0, i.e. the
strain components are all zero.

• pure rotation: A pure rotation by angle φ is given by


! " # ! !
x cos φ − sin φ x cos φ x − sin φ y
−→ · =
y sin φ cos φ y sin φ x + cos φ y

and thus the displacement vector is given by


! !
u1 (x, y) cos φ x − sin φ y − x
= .
u2 (x, y) sin φ x + cos φ y − y

Since the overall displacement has to be small compute only with small angles φ8 . This leads to
" #  ∂ u1

1 ∂ u1 ∂ u2
  " # " #
εxx εxy ∂x + cos φ − 1 0 0 0
=  ∂ u  2 ∂y ∂x = ≈ .
1 ∂ u2 ∂ u2
εxy εyy 2 ∂y + ∂x
1
∂y 0 cos φ − 1 0 0

Again all components of the strain vanish.

• stretching in both directions: The displacement


! !
u1 (x, y) x

u2 (x, y) y

corresponds to a stretching of the solid by the factor 1 + λ in both directions. The components of the
strain are given by
" #  ∂ u1 1

∂ u1 ∂ u2
  " #
εxx εxy ∂x + λ 0
=  ∂ u  2 ∂y ∂x =
1 ∂ u2 ∂ u2
εxy εyy 2 ∂y
1
+ ∂x ∂y 0 λ

i.e. there is no shear strain in this situation.


8
This can be improved by working with the Green strain tensor, see Section 5.7 on page 365.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 331

• stretching in x direction only: The displacement


! !
u1 (x, y) x

u2 (x, y) 0

corresponds to a stretching by the factor 1 + λ along the x axis. The components of the strain are
given by
" #  ∂ u1 1

∂ u1 ∂ u2
  " #
εxx εxy ∂x 2 ∂y + ∂x λ 0
=  ∂ u ∂ u2

∂ u2
= .
1
εxy εyy 2 ∂y
1
+ ∂x ∂y 0 0

• stretching in 45◦ direction: The displacement


! !
u1 (x, y) λ x+y
=
u2 (x, y) 2 x+y

corresponds to a stretching by the factor 1 + λ along the axis x = y. The straight line y = −x is left
unchanged. To verify this observe
! ! ! !
u1 (x, x) x u1 (x, −x) 0
=λ and =λ .
u2 (x, x) x u2 (x, −x) 0

The components of the strain are given by


" #  ∂ u1 1

∂ u1 ∂ u2
  " #
εxx εxy ∂x 2 ∂y + ∂x λ/2 λ/2
=  
∂ u1 ∂ u2

∂ u2
= .
1
εxy εyy 2 ∂y + ∂x ∂y λ/2 λ/2

• The two previous examples both stretch the solid in one direction by a factor λ and leave the orthogo-
nal direction unchanged. Thus it is the same type of deformation, the difference being the coordinate
system used to examine the result. Observe that the expressions

εxx , εyy , εxy depend on the coordinate system


∂ u1 ∂ u 2
εxx + εyy and − do not depend on the coordinate system
∂y ∂x
This observation will be confirmed and proven in the next result.

5–18 Observation : Consider two coordinate systems, where one is generated by rotating the first coordi-
nate axes by an angle φ. The situation is shown in Figure 5.7 with φ = π6 = 30◦ . Now express a vector ~u
(components in (xy)–system) also in the (x0 y 0 )–system. To achieve this rotate the vector ~u by −φ and read
out the components. In our example find ~u = (1 , 1)T and thus
" # ! " # ! !
cos φ sin φ u1 0.866 0.5 1 1.366
~u0 = RT · ~u = · ≈ · = .
− sin φ cos φ u2 −0.5 0.866 1 0.366

The numbers are confirmed by Figure 5.7 . ♦

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 332

y
y0 " # !
6  cos φ sin φ x
AA
K
~x0 = RT · ~x = ·
q − sin φ cos φ y
A  x0
A  A *
 " # !
A A 
cos φ − sin φ x0
~x = R · ~x0 =
A
A  ·
A -x sin φ cos φ y0

Figure 5.7: Rotation of the coordinate system

5–19 Result : A given strain situation is examined in two different coordinate system, as show in Figure 5.7.
Then we have

εxx + εyy = ε0x0 x0 + ε0y0 y0


∂ u1 ∂ u2 ∂ u01 ∂ u02
− = −
∂y ∂x ∂y 0 ∂x0
and the strain components transform according to the formula
" # " # " # " #
ε0x0 x0 ε0x0 y0 cos φ sin φ εxx εxy cos φ − sin φ
= · · .
ε0x0 y0 ε0y0 y0 − sin φ cos φ εxy εyy sin φ cos φ

3
Proof : Since the deformations at a given point are identical we have

~u0 (~x0 ) = RT · ~u(~x) = RT · ~u(R · ~x0 )


h i
u01 (~x0 ) = + cos φ + sin φ · ~u(R · ~x0 ) = cos φ u1 (~x) + sin φ u2 (~x)
= + cos φ u1 (cos φ x0 − sin φ y 0 , sin φ x0 + cos φ y 0 ) +
+ sin φ u2 (cos φ x0 − sin φ y 0 , sin φ x0 + cos φ y 0 )
h i
u02 (~x0 ) = − sin φ cos φ · ~u(R · ~x0 ) = − sin φ u1 (~x) + cos φ u2 (~x)
= − sin φ u1 (cos φ x0 − sin φ y 0 , sin φ x0 + cos φ y 0 ) +
+ cos φ u2 (cos φ x0 − sin φ y 0 , sin φ x0 + cos φ y 0 )

With elementary, but lengthy application of the chain rule we find


   
∂ 0 0 ∂u1 ∂u1 ∂u2 ∂u2
u (~x ) = cos φ cos φ + sin φ + sin φ cos φ + sin φ
∂x0 1 ∂x ∂y ∂x ∂y
 
2 ∂u1 2 ∂u2 ∂u1 ∂u2
= cos φ + sin φ + cos φ sin φ +
∂x ∂y ∂y ∂x
   
∂ 0 0 ∂u1 ∂u1 ∂u2 ∂u2
u (~x ) = cos φ − sin φ + cos φ + sin φ − sin φ + cos φ
∂y 0 1 ∂x ∂y ∂x ∂y
∂u1 ∂u2 ∂u1 ∂u2
= − cos φ sin φ + cos φ sin φ + cos2 φ − sin2 φ
∂x ∂y ∂y ∂x
   
∂ 0 0 ∂u1 ∂u1 ∂u2 ∂u2
u (~x ) = − sin φ cos φ + sin φ + cos φ cos φ + sin φ
∂x0 2 ∂x ∂y ∂x ∂y
∂u1 ∂u2 ∂u1 ∂u2
= − cos φ sin φ + cos φ sin φ − sin2 φ + cos2 φ
∂x ∂y ∂y ∂x

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 333

   
∂ 0 0 ∂u1 ∂u1 ∂u2 ∂u2
u (~x ) = − sin φ − sin φ + cos φ + cos φ − sin φ + cos φ
∂y 0 2 ∂x ∂y ∂x ∂y
 
∂u1 ∂u2 ∂u1 ∂u2
= sin2 φ + cos2 φ − cos φ sin φ + .
∂x ∂y ∂y ∂x
Now verify that
∂ u01 ∂ u02 ∂ u1 ∂ u2
ε0x0 x0 + ε0y0 y0 = + = + = εxx + εyy ,
∂x0 ∂y 0 ∂x ∂y
∂ u1 ∂ u02
0 ∂ u1 ∂ u2
− = − .
∂y 0 ∂x0 ∂y ∂x
These two expressions are thus independent on the orientation of the coordinate system.
If the matrix multiplication below is carried one step further, then the claimed transformation formula
will appear.
" #
T
2 εxx 2 εxy
R · ·R=
2 εxy 2 εyy
" # " # " #
cos φ sin φ 2 ∂∂x
u1 ∂ u2
∂x + ∂ u1
∂y cos φ − sin φ
= · ∂u ∂ u1 ∂ u2
·
− sin φ cos φ ∂x
2
+ ∂y 2 ∂y sin φ cos φ
" # " #
cos φ sin φ 2 cos φ ∂∂xu1
+ sin φ( ∂∂x u2
+ ∂∂yu1 ) −2 sin φ ∂∂xu1
+ cos φ( ∂∂xu2
+ ∂∂yu1 )
= ·
− sin φ cos φ cos φ ( ∂∂x
u2
+ ∂∂yu1 ) + 2 sin φ ∂∂yu2 − sin φ ( ∂∂x
u2
+ ∂∂yu1 ) + 2 cos φ ∂∂yu2
2

5–20 Example : To read out the strain in a direction given by the normalized directional vector d~ =
(d1 , d2 )T = (cos α, sin α)T compute the normal strain ∆l 9
l in that direction by
! " # !
∆l d1 εxx εxy d1
=h , · i.
l d2 εxy εyy d2

To verify this result


~
• Construct a rotation matrix R, such that the new x0 direction coincides with the direction given by d.
" #
d1 −d2
R=
d2 d1

• Apply the above transformation rule for the strain.


" # " #
ε0x0 x0 ε0x0 y0 T εxx εxy
=R · ·R
ε0x0 y0 ε0y0 y0 εxy εyy

• Then the top left entry ε0x0 x0 shows the normal strain ∆l l in that direction.
! " # !
1 ε 0 ε 0 1
∆l 0 0 0 0
= ε0x0 x0 = h , xx xy
i
l 0 ε0x0 y0 ε0y0 y0 0
! " # ! ! " # !
1 ε xx ε xy 1 d 1 ε xx ε xy d1
= h , RT · ·R· i = h , · i
0 εxy εyy 0 d2 εxy εyy d2
9
This result might be useful when working with strain gauges to measure deformations.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 334

As an example consider the 45◦ direction, thus d~ = √1


2
(1 , 1)T and conclude
! " # !
∆l 1 1 1 εxx εxy 1
= h√ , √ i
l 2 1 2 εxy εyy 1
! !
1 1 εxx + εxy
= h , i
2 1 εxy + εyy
 
1 1 ∂ u 1 ∂ u 2 ∂ u 1 ∂ u2
= (εxx + 2 εxy + εyy ) = + + + .
2 2 ∂x ∂x ∂y ∂y

Since the strain matrix is symmetric, there always exists10 an angle φ such that the strain matrix in the
new coordinate system is diagonal, i.e.
" # " # " # " #
ε0x0 x0 0 cos φ sin φ εxx εxy cos φ − sin φ
= · · .
0 ε0y0 y0 − sin φ cos φ εxy εyy sin φ cos φ

Thus at least close to the examined point the deformation consists of stretching (or compressing) the x0 axis
and stretching (or compressing) the y 0 axis. No shear strain is observed in this new coordinate system. One
of the possible displacements is given by
! ! !
x0 x0 ε0x0 x0 x0
−→ + .
y0 y0 ε0y0 y0 y 0

The values of ε0x0 x0 and ε0y0 y0 can be found as eigenvalues of the original strain matrix, i.e. solutions of the
quadratic equation " #
εxx − λ εxy
f (λ) = det = 0.
εxy εyy − λ
The eigenvectors indicate the directions of pure strain, i.e. in that coordinate system you find no shear strain.
The eigenvalues correspond to the principal strains.

5–21 Example : Examine the strain matrix


" #
0.04 0.01
A=
0.01 0

This corresponds to a solid stretched by 4% in the x direction and the angle between the x and y axis is
diminished by 0.02 . To diagonalize this matrix we determine the zeros of
" #
0.04 − λ 0.01
det(A − λI) = det = λ2 − 0.04 λ − 0.012 = (λ − 0.02)2 − 0.022 − 0.012 = 0
0.01 0−λ
10
The eigenvector ~e1 to the first eigenvalue λ1 can be normalized and thus written in the form
! ! !
cos φ cos φ cos φ
~e1 = =⇒ A = λ1 .
sin φ sin φ sin φ

The second eigenvector ~e2 is orthogonal to the first and thus


! ! " # " #" #
− sin φ − sin φ cos φ − sin φ cos φ − sin φ λ1 0
A = λ2 =⇒ A = .
cos φ cos φ sin φ cos φ sin φ cos φ 0 λ2
Multiply the last equation from the left by the transpose of the rotation matrix to arrive at the diagonalization result.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 335

√ √ √
with the solutions
√ λ 1 = 0.02 + 0.0005 = 0.01 (2 + 5) ≈ 0.04236 and λ 2 = 0.02 − 0.0005 =
0.01 (2 − 5) ≈ −0.00236. The first eigenvector is then found as solution of the linear system
" # ! " # ! !
0.04 − λ1 0.01 x −0.00236 0.01 x 0
≈ = .
0.01 0 − λ1 y 0.01 −0.04236 y 0
The second of the above equations is a multiple of the first and thus use only the first equation
−0.00236 x + 0.01 y = 0 .
Since only the direction matters generate an easy solution by
!
1
~e1 = with λ1 = 0.04236 .
0.236
The second eigenvector ~e2 is orthogonal to the first and thus
!
0.236
~e2 = with λ2 = −0.00236 .
−1
As a consequence the above strain corresponds to a pure stretching by 4.2% in the direction of ~e1 and a
compression of 0.2% in the orthogonal direction. ♦

Strain for solids in space


So far all calculations were made in the plane, but they can readily be adapted to solids in space. If the
deformation of a solid is given by the deformation vector field ~u, i.e.
     
x x u1
     
~x = 
  y  −→ ~x + ~u =  y  +  u2 
   
z z u3
then we can compute the three normal and three strain components by the formulas in Table 5.311 .

symbol formula interpretation


∂ u1
εxx ∂x ratio of change of length divided by length in x direction
∂ u2
εyy ∂y ratio of change of length divided by length in y direction
∂ u3
εzz ∂z ratio of change of length divided by length in z direction
 
1 ∂ u1 ∂ u2
εxy = εyx 2 + the angle between the x and y axis is diminished by 2 εxy
 ∂y ∂x 
1 ∂ u1 ∂ u3
εxz = εzx 2 + the angle between the x and z axis is diminished by 2 εxz
 ∂z ∂x 
1 ∂ u2 ∂ u3
εyz = εzy 2 ∂z + ∂y the angle between the y and z axis is diminished by 2 εyz

Table 5.3: Normal and shear strains in space

The above results about transformation of strains in a rotated coordinate system do also apply. Thus for
a given strain there is a rotation of the coordinate system, given by the orthonormal matrix R such that
   
ε0x0 x0 0 0 εxx εxy εxz
T 
   
 0
 ε0y0 y0 0   = R ·  εyx εyy εyz  · R .

0 0 ε0z 0 z 0 εzx εzy εzz
11
In part of the literature (e.g. [Prze68]) the shear strains are defined without the division by 2. All results can be adapted
accordingly.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 336

The entries on the diagonal are called principal strains.

Invariant expressions of the strain tensor


Many physical expressions do not depend on the coordinate system used to describe the system, e.g. the
energy density of the system. Thus they should be invariant under rotations of the above type. To examine
this invariant expressions are the essential tool. For this use the characteristic polynomial of the above
matrices. Let S and S0 be the strain matrix in the original and rotated coordinate system. Thus examine

S0 = RT SR
det(S0 − λ I) = det(RT SR − λ RT R) = det(RT (S − λ I) R)
= det(RT ) det(S − λ I) det(R) = det(S − λ I) .

The two characteristic polynomials are identical and the coefficients for λ3 , λ2 , λ1 and λ0 = 1 have to
coincide. This leads to three invariants. Compute the determinant to find
 
εxx − λ εxy εxz
 
det(S − λ I) = det   εyx εyy − λ εyz  
εzx εzy εzz − λ
= −λ3 + λ2 (εxx + εyy + εzz ) −
−λ (εyy εzz − ε2yz + εxx εzz − ε2xz + εxx εyy − ε2xy ) + det(S) .

As a consequence find three invariant strain expressions.

I1 = trace(S) = εxx + εyy + εzz


I2 = εyy εzz − ε2yz + εxx εzz − ε2xz + εxx εyy − ε2xy
I3 = det(S) = εxx εyy εzz + 2 εxy εyz εyz − εxx ε2yz − εyy ε2xz − εzz ε2xy

• These invariant expressions can be expressed in terms of the eigenvalues of the strain matrix S: I1 =
λ1 + λ2 + λ3 , I2 = λ1 λ2 + λ2 λ3 + λ3 λ1 and I3 = λ1 · λ2 · λ3 . To verify this fact diagonalize the
matrix, i.e. compute with λ1 = εxx , λ2 = εyy , λ3 = εzz and vanishing shearing strains.

• Since det(S0 − λ I) = det(S − λ I) the three eigenvalues λi are invariant. But there is no easy, explicit
expression for the eigenvalues in terms of coefficients of the tensor, since a cubic equation has to be
solved to determine the values of λi .

We will see (page 352) that the elastic energy density can be expressed in terms of the invariants Ii .

5.3.2 Description of Stress


For sake of simplicity first examine planar situations and at the end of the section apply the obvious exten-
sions to the more realistic situation in space.
Consider an elastic body where all forces are parallel to the xy plane and independent on z and the
contour of the solid is independent on z. Consider a small rectangular box of this solid with width ∆x,
height ∆y and depth ∆z. A cut parallel to the xy plane is shown in Figure 5.8. Based on the formula
force
stress =
area
examine the normal stress and tangential stress components on the surfaces of this rectangle. Assume
that the small box is a static situation and there are no external body forces. Balancing all components of

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 337

forces and moments12 leads to the conditions

σx2 = σx1 , σy3 = σy4 1


, τyx 2
= τyx 3
= τxy 4
= τxy .

Thus the situation simplifies as shown on the right in Figure 5.8.

σy3 σy
6 3
- τxy
6
- τxy
3 1
τyx τxy
σx2 2 1
6 - σ1
x σx 6 - σ
x
2 ?
τyx τxy ?
4
4
τxy

τxy

?4 ?
σy σy

Figure 5.8: Definition of stress in a plane, initial (left) and simplified (right) situation

The stress situation of a solid is described by all components of the stress, typically as functions of the
location.

Normal and tangential stress in an arbitrary direction


Figure 5.9 shows a virtual cut of a solid such that the normal vector ~n = (cos φ, sin φ)T forms an angle φ
with the x axis. Now examine the normal stress σ and the tangential stress τ along side A.
y
6 ~n

J 3
]
J
~s 
J
3
J 
~σ
σx  J 
A A
J
τxy? y J
Ax J
J
- x

τxy
σy
?

Figure 5.9: Normal and tangential stress in an arbitrary direction

Since Ax = A sin φ and Ay = A cos φ the condition of balance of force leads to

sx A = σx Ay + τxy Ax = σx cos φ + τxy sin φ


sy A = σy Ax + τxy Ay = τxy cos φ + σy sin φ
12
Balancing the forces in x and y direction and the moment leads to

(σx1 − σx2 ) ∆y + (τxy


3 4
− τxy ) ∆x = 0
(σy3 + σy4 ) ∆x + (τyx
1 2
− τyx ) ∆y = 0
1 2 3 4
(τyx + τyx ) ∆y − (τxy + τxy ) ∆x = 0

for all positive values of ∆x and ∆y. Change the values of ∆x and ∆y independently to arrive at the desired conclusion.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 338

where ~s = (sx , sy )T . Using matrices write the above in the form


! " # !
sx σx τxy cos φ
= or ~s = S · ~n (5.9)
sy τxy σy sin φ

where the symmetric stress matrix is given by


" #
σx τxy
S= .
τxy σy

The stress vector ~s may be decomposed in a normal component σ and a tangential component τ . Determine
the component of ~σ in the direction of ~n by
" # !
T σx τxy cos φ
σ = h~n , ~si = ~n · ~s = h~n , ~si = (cos φ, sin φ) ·
τxy σy sin φ
" # !
σx τxy cos φ
τ = (− sin φ, cos φ) · .
τxy σy sin φ

The value of σ is positive if ~σ is pointing out of the solid and τ is positive if ~τ is pointing upward in
Figure 5.9.
This allows to consider a new coordinate system, generated by rotation of the xy system by an angle φ
(see Figure 5.7, page 332). The result is
" # !
σx τxy cos φ
σx0 = (cos φ, sin φ) ·
τxy σy sin φ
" # !
σx τxy − sin φ
σy0 = (− sin φ, cos φ) ·
τxy σy cos φ
" # !
σx τxy cos φ
τx0 y0 = (− sin φ, cos φ) · .
τxy σy sin φ

An elementary matrix multiplication shows that this is equivalent to


" # " # " # " #
σx0 τx0 y0 cos φ sin φ σx τxy cos φ − sin φ
= · · (5.10)
τx0 y0 σy0 − sin φ cos φ τxy σy sin φ cos φ

This transformation formula should be compared with result 5–19 on page 332. It shows that the behavior
under coordinate rotations for the stress matrix and the strain matrix is identical.

Normal and tangential stress in space


All the above observations can be adapted to the situation in space. Figure 5.10 shows the notational con-
vention and Table 5.4 gives a short description.

The symmetric stress matrix S is given by


 
σx τxy τxz
 
S=
 τxy σy τyz 

τxz τyz σz

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 339

z
6
σz
6 τ
- yz
τxz
τzy
6 σ
- y
τzx -y
τxy
6 τ
- yx
σx

Figure 5.10: Components of stress in space

symbol description
σx normal stress at a surface orthogonal to x = const
σy normal stress at a surface orthogonal to y = const
σz normal stress at a surface orthogonal to z = const
tangential stress in y direction at surface orthogonal to x = const
τxy = τyx
tangential stress in x direction at surface orthogonal to y = const
tangential stress in z direction at surface orthogonal to x = const
τxz = τzx
tangential stress in x direction at surface orthogonal to z = const
tangential stress in z direction at surface orthogonal to y = const
τyz = τzy
tangential stress in y direction at surface orthogonal to z = const

Table 5.4: Description of normal and tangential stress in space

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 340

and the stress vector ~s at a plane orthogonal to ~n is given by

~s = S · ~n .

The behavior of S under rotation of the coordinate system ~x = RT · ~x0 or ~x0 = R · ~x is given by
   
σx0 τxy0 0
τxz σx τxy τxz
S0 =  0  = RT ·  τ
   
τ 0 σ 0 τ σ τ  · R.
 xy y yz   xy y yz 
0
τxz 0
τyz σz0 τxz τyz σz

When solving the cubic equation


 
σx − λ τxy τxz
 
det(S − λ I3 ) = det 
 τxy σy − λ τyz =0

τxz τyz σz − λ

for the three eigenvalues λ1,2,3 and the corresponding orthonormal eigenvectors ~e1 , ~e2 and ~e3 , we find a
coordinate system in which all tangential stress components vanish. Since there are only normal stresses the
stress matrix S0 has the form  
σx0 0 0
 
 0 σ0 0  .
 y 
0 0 σz0
The numbers on the diagonal are the principal stresses. This is very useful to extract results out of stress
computations. When asked to find the stress at a given point in a solid many different forms of answers are
possible:

• Give all six components of the stress in a given coordinate system.

• Find the three principal stresses and use those as a result. One might also give the corresponding
directions.

• Give the maximal principal stress.

• Give the maximal and minimal principal stress.

• Give the von Mises stress, or the Tresca stress

The “correct” form of the answer depends on the context.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 341

5–22 Result : Mohr’s circle


Using the above transformation law start out with a situation of principal stresses σ1 and σ2 and then rotate
the coordinate system by an angle φ.
" # " #" #" #
σx τxy cos φ + sin φ σ1 0 cos φ − sin φ
=
τxy σy − sin φ cos φ 0 σ2 + sin φ cos φ
" #" #
cos φ + sin φ σ1 cos φ −σ1 sin φ
=
− sin φ cos φ +σ2 sin φ σ2 cos φ
" #
σ1 cos2 φ + σ2 sin2 φ (σ2 − σ1 ) cos φ sin φ
=
(σ2 − σ1 ) cos φ sin φ σ1 sin2 φ + σ2 cos2 φ

Using trigonometric identities this leads to


1 + cos 2φ 1 − cos 2φ σ1 + σ2 σ2 − σ1
σx = σ1 cos2 φ + σ2 sin2 φ = σ1 + σ2 = − cos 2φ
2 2 2 2
1 − cos 2φ 1 + cos 2φ σ1 + σ2 σ2 − σ1
σy = σ1 sin2 φ + σ2 cos2 φ = σ1 + σ2 = + cos 2φ
2 2 2 2
σ2 − σ1
τxy = (σ2 − σ1 ) cos φ sin φ = sin 2φ
2
Since (R cos 2φ , R sin 2φ) is the parameterization of a circle observe that the points (σx , −τxy ) and (σy , +τxy )
are on a circle in the (σ, τ )–plane with center at σ1 +σ
2
2
on the σ–axis and radius σ2 −σ1
2 . This is Mohr’s circle
shown in Figure 5.11(a).
• If the values of σx , σy and τxy are known one can easily construct Mohr’s circle and then read out the
principal stresses σ1 and σ2 . This is equivalent to solving the corresponding quadratic equation.
2
0 = det(S − σ I) = (σx − σ) (σy − σ) + τxy = σ 2 − (σx + σy ) σ + σx σy + τxy
2
r
1  q  σx + σy σx − σy 2
σ1,2 = (σx + σy ) ± (σx + σy )2 − 4 σx σy + 4 τxy
2 = ± ( 2
) + τxy
2 2 2

• The rotation angle φ is determined by


τxy 2 τxy
tan(2 φ) = σx +σy =
σy − 2
σy − σx

• There is an adaption of Mohr’s circle to a 3D stress setup, shown in Figure 5.11(b). The Wikipedia
page en.wikipedia.org/wiki/Mohr’s circle shows more information on Mohr’s circle. In [ChouPaga67,
§1.5, §1.8] find a derivation of Mohr’s circle in 2D and 3D.

• Since the transformation law for strains is identical one can generate a similar circle with the normal
strains on the horizontal axis and the shear strains on the vertical axis.
3

5.3.3 Invariant Stress Expressions, Von Mises Stress and Tresca Stress
The von Mises stress σM (also called octahedral shear stress) is a scalar expression and often used to
examine failure modes of solids, see also the following section.
2
σM = σx2 + σy2 + σz2 − σx σy − σy σz − σz σx + 3 τxy
2 2
+ 3 τyz 2
+ 3 τzx
1
(σx − σy )2 + (σy − σz )2 + (σz − σx )2 + 3 τxy 2 2 2
 
= + τyz + τzx .
2

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 342

( y
,+ xy
)

1 2 2

( ,- )
x xy

(a) 2D situation (b) 3D situation, source: Wikipedia

Figure 5.11: Mohr’s circle for the 2D and 3D situations

It is important that the above expression for the von Mises stress does not depend on the orientation of
the coordinate system. On page 336 find the invariants for the strain matrix. Using identical arguments
determine three invariants for the stress matrix.

I1 = σx + σy + σz
2 2 2
I2 = σy σz + σx σz + σx σy − τyz − τxz − τxy
 
σx τxy τxz
2 2 2
 
I3 = det  τxy σy τyz 

 = +σx σy σz + 2 τxy τxz τyz − σx τyz − σy τxz − σz τxy
τxz τyz σz

Obviously any function of these invariants is an invariant too and consequently independent of the orienta-
tion of the coordinate system. Many physically important expressions have to be invariant, e.g. the energy
density.

With elementary (but tedious) algebra find

I12 − 3 I2 = σx2 + σy2 + σz2 − σy σz − σx σz − σx σy + 3 τyz


2 2
+ 3 τxz 2
+ 3 τxy 2
= σM

and consequently the von Mises stress is invariant under rotations. If reduced to principal stresses (no shear
stress) find
2
2 σM = (σ1 − σ2 )2 + (σ2 − σ3 )2 + (σ3 − σ1 )2 .
Thus the von Mises stress is a measure for the differences among the three principal stresses. In the simplest
possible case of stress in one direction only, i.e. σ2 = σ3 = 0, find

2 1
(σ1 − 0)2 + (0 − 0)2 + (0 − σ1 )2 = σ12 .

σM =
2

The Tresca stress σT is defined by

σT = max{|σ1 − σ2 | , |σ2 − σ3 | , |σ3 − σ1 |}

and thus a measure of the differences amongst the principal stresses, similar to the von Mises stress.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 343

5–23 Corollary : The von Mises stress is smaller than the Tresca stress, i.e.

0 ≤ σM ≤ σT .

3
Proof : Without loss of generality examine the principal stress situation and assume σ1 ≤ σ2 ≤ σ3 .
2
2 σM = (σ1 − σ2 )2 + (σ2 − σ3 )2 + (σ3 − σ1 )2
2 σT2 = 2 (σ3 − σ1 )2
2
2 (σM − σT2 ) = (σ1 − σ2 )2 + (σ2 − σ3 )2 − (σ3 − σ1 )2 = 2 σ22 − 2 σ1 σ2 − 2 σ3 σ2 + 2 σ1 σ3
= 2 (σ2 − σ3 ) (σ2 − σ1 ) ≤ 0
2 ≤ σ2 .
This implies σM T 2

To decide whether a material will fail you will need the maximal principal stress, the von Mises and
Tresca stress. You need the definition of the different stresses and strains, Hooke’s law (Section 5.6) and
possibly the plane stress and plane strain description (Sections 5.8 and 5.9). For a given situation there are
multiple paths to compute these:

• If the 3 × 3 stress matrix is known, then first determine the eigenvalues, i.e. the principal stresses.
Once you have these it is easy to read out the desired values.

• If the 3 × 3 strain matrix is known, the you have two options to determine the principal stresses.

1. First use Hooke’s law to determine the 3 × 3 stress matrix, then proceed as above.
2. Determine the eigenvalues of the strain matrix to determine the principal strains. Then use
Hooke’s law to determine the principal stresses.

• If the situation is a plane stress situation and you know the 2 × 2 stress matrix, then you may first
generate the full 3 × 3 stress matrix, and then proceed as above.

• If the situation is a plane strain situation and you know the 2 × 2 strain matrix, then you may first
generate the full 3 × 3 strain matrix, and then proceed as above.

These computational paths are illustrated in Figure 5.12.

5.4 Elastic Failure Modes


The results in this section are inspired by [Hear97]. The criterion for elastic failure depend on the type of
material to be examined: ductile13 or brittle14 . To simplify the formulation assume that the stress tensor is
given in principal form, i.e.
   
σx τxy τxz σ1 0 0
   
S=  τxy σy τyz  =  0 σ2 0  .
  
τxz τyz σz 0 0 σ3
13
In German: dehnbar, zäh
14
In German: spröd, brüchig

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 344

   
εxx εxy εxz σx τxy τxz " #
  Hooke’s law   plane stress σx τxy
 εxy εyy εyz   - τ σy τyz  
   xy  τxy σy
εxz εyz εzz τxz τyz σz

eigenvalues eigenvalues
Hooke’s law
? ?
ε1 , ε2 , ε3  - σ1 , σ2 , σ3 principal stresses
principal strains
compute
?
σM axP rinc = max{|σ1 | , |σ2 | , |σ3 |}
σT resca = max{|σ1 − σ2 | , |σ2 − σ3 | , |σ3 − σ1 |}
σvonM ises = √12 (σ1 − σ2 )2 + (σ2 − σ3 )2 + (σ3 − σ1 )2
p

Figure 5.12: How to determine the maximal principal stress, von Mises stress and Tresca stress

5.4.1 Maximum Principal Stress Theory


When the maximum principal stress exceeds a critical yield stress σY the material might fail. For the
matreial not to fail the condition
max{|σ1 | , |σ2 | , |σ3 |} ≤ σY
has to be satisfied. This theory can be shown to apply well to brittle materials, while it should not be
applied to ductile materials. Even in the case of a pure tension test, a ductile materials failing is caused
by large shearing, as examined in the next section. Homogeneous materials can withstand huge hydrostatic
pressures, indicating that a maximum principal stress criterion might not be a wise choice.

5.4.2 Maximum Shear Stress Theory


For the 2D situation we recognize from Mohr’s circle in Result 5–22 that the shear stress at a plane with
angle φ is given by
τxy = (σ2 − σ1 ) cos φ sin φ .
The maximal value is attained at 45◦ angles and the maximal value is 12 (σ2 − σ1 ). This leads to the Tresca
stress
σT = max{|σ1 − σ2 | , |σ2 − σ3 | , |σ3 − σ1 |} .
If working under the condition that the material fails because of shear stresses, we are lead to the condition

σT ≤ σY .

Most cracks in metals are caused by severe shearing forces, i.e. they are ductile. If a metal is submitted to a
traction test (pull until it breaks) the direction of the initial break and the force show an angle of 45◦ . This
is very different for brittle materials15 .
15
Twisting a piece of chalk until it breaks usually leads to an angle of 45◦ for the initial crack. Is is an exercise to verify that for
twisting the maximal normal stress occurs at this angle. Since chalk is a brittle material it breaks because of the maximal normal
stress is too large.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 345

5.4.3 Maximum Distortion Energy


The stress may be written as a sum of a hydrostatic stress (identical in all directions) and shape changing
stresses.
1 1 1
σ1 = (σ1 + σ2 + σ3 ) + (σ1 − σ2 ) + (σ1 − σ3 )
3 3 3
1 1 1
σ2 = (σ1 + σ2 + σ3 ) + (σ2 − σ1 ) + (σ2 − σ3 )
3 3 3
1 1 1
σ3 = (σ1 + σ2 + σ3 ) + (σ3 − σ1 ) + (σ3 − σ2 ) .
3 3 3
The elastic energy density W can be decomposed into a volume changing component and a shape changing
part.
1 − 2ν 1+ν 2
W = (σ1 + σ2 + σ3 )2 + σ
6E 3E M
This is verified in an exercise. A similar argument can be based on strain instead of stress, see Section 5.6.3
on page 353. When computing the energy contribution of the shape changing stresses we are thus lead to
the von Mises stress
2
2 σM = (σ1 − σ2 )2 + (σ2 − σ3 )2 + (σ3 − σ1 )2
and the corresponding criterion
σM ≤ σY .

5.5 Scalars, Vectors and Tensors


This section is a very brief introduction to Cartesian tensors. It consits of the main definitions, the transfor-
mation rules and a few examples. For a readable, elementary introduction you may consult [Sege77]. A n-th
order tensor in R3 is an object whose specification requires a collection of 3n numbers, called component
of the tensor. Scalars are tensors of order 0 with 30 = 1 components, vectors are tensors of order 1 with
31 = 3 components and second order tensors have 32 = 9 components. For an object to be called a tensor
the components have to satisfy specific transformation rules when the coordinate system is changed.
To simplify the presentation examine the situation in R2 only, but it has to be pointed out that all results
and examples remain valid in R3 .

5.5.1 Change of Coordinate Systems


In Figure 5.7 (page 332) the basis vectors of a coordinate system are rotated by a fixed angle α to obtain
the new basis vectors. Then the components of a vectors are transformed according the transformation rules
below.
! ! " # !
x0 T x cos α sin α x
= R · = ·
y0 y − sin α cos α y
! ! " # !
x x0 cos α − sin α x0
= R· = ·
y y0 sin α cos α y0

5.5.2 Zero Order Tensors: Scalars


A scalar function u(x, y) determines one scalar value at given points in space. It might be given in the origi-
nal or the transformed system, the resulting values have to coincide. Scalars are invariant under coordinate
transformations.
u0 (x0 , y 0 ) = u(x, y) .

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 346

Examples of scalars include: temperature, hydrostatic pressure, density, concentration. Observe that not
all expressions leading to a number are invariant under transformations. As an example consider the partial
∂f
derivative of a scalar with respect to the first coordinate, i.e. ∂x1
. The transformation rule for this expression
will be examined below.

5.5.3 First Order Tensors: Vectors


To determine a vector ~u two numbers are required ~u = (u , v)T . In the new coordinate system the same
vector is given by ~u0 = (u0 , v 0 )T . The transformation needs to satisfy the property
! !
u0 u
= RT · .
v0 v

5–24 Example : Well known and often used examples of vectors are position vectors, velocity and force
vectors. ♦

5–25 Example : Gradient as first order tensor


The gradient of a scalar f is a first order tensor. This is a consequence of the chain rule.

f 0 (x0 , y 0 ) = f (x, y) = f (x0 cos α − y 0 sin α, x0 sin α + y 0 cos α)


∂ 0 0 0 ∂f ∂f
0
f (x , y ) = + cos α + sin α
∂x ∂x ∂y
∂ 0 0 0 ∂f ∂f
0
f (x , y ) = − sin α + cos α
∂y ∂x ∂y
and thus find ! " # ! !
∂ 0 ∂ ∂
∂x0 f cos α sin α ∂x f ∂x f
∂ 0
= · ∂
= RT · ∂
.
∂y 0 f − sin α cos α ∂y f ∂y f

The gradient is often used as a row vector and thus transpose the above identity to conclude
" #
∂ 0 ∂ 0 ∂ ∂ cos α − sin α ∂ ∂
( 0f , 0
f )=( f, f) · =( f, f) · R .
∂x ∂y ∂x ∂y sin α cos α ∂x ∂y

Observe that not all pairs of scalar expressions transform according to a first order tensor. As examples
consider stress and strain. The transformation rules for
! !
εxx σx
and
εyy σy

will be examined below.

5.5.4 Second Order Tensors: some Matrices


A second order tensor A requires 4 components, conveniently arranged in the form of a 2 × 2–matrix.
" #
a1,1 a1,2
A=
a2,1 a2,2

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 347

When a new coordinate system is introduced the required transformation rule, for A to be a tensor, is
" # " #" #" #
a01,1 a01,2 cos α sin α a1,1 a1,2 cos α − sin α
= · ·
a02,1 a02,2 − sin α cos α a2,1 a2,2 sin α cos α .
A0 = RT · A · R

To decide whether a 2 × 2 matrix is a tensor this transformation rule has to be verified. When all details are
carried out this leads to the following formula, where C = cos α and S = sin α .
" # " # " # " #
a01,1 a01,2 C S a1,1 a1,2 C −S
= · ·
a02,1 a02,2 −S C a2,1 a2,2 S C
" # " #
C S a1,1 C + a1,2 S −a1,1 S + a1,2 C
= ·
−S C a2,1 C + a2,2 S −a2,1 S + a2,2 C
" #
a1,1 C 2 + (a1,2 + a2,1 )CS + a2,2 S 2 (−a1,1 + a2,2 )SC + a1,2 C 2 − a2,1 S 2
=
(−a1,1 + a2,2 )SC − a1,2 S 2 + a2,1 C 2 a1,1 S 2 − (a1,2 + a2,1 )CS + a2,2 C 2

5–26 Example : Stress and strain as tensors


According to Result 5–19 the components of the strain are transformed by
" # " # " # " #
ε0x0 x0 ε0x0 y0 cos φ sin φ εxx εxy cos φ − sin φ
= · ·
ε0x0 y0 ε0y0 y0 − sin φ cos φ εxy εyy sin φ cos φ

and based on equation (5.10) we find for the stress


" # " # " # " #
σx0 τx0 y0 cos φ sin φ σx τxy cos φ − sin φ
= · · .
τx0 y0 σy0 − sin φ cos φ τxy σy sin φ cos φ

Thus stress and strain, as examined in the previous sections, are second order tensors. ♦

5–27 Example : Linear mapping as tensor


A linear mapping from R2 to R2 is completely determined by the image of the basis vectors, see Sec-
tion 3.2.4. Use the components of these images as columns for a matrix A, then the action of the linear
mapping can be described by a matrix multiplication.
! ! ! " # !
x x x 2 0.5 x
−→ A · e.g. −→ · .
y y y 1 1 y

Find a numerical example and the corresponding picture in Figure 5.13.


If the same linear mapping is to be examined in a new coordinate system (x0 , y 0 ) the matrix will obvi-
ously change. To determine this new matrix A0 the following steps have to be performed:
1. Determine the original components x, y based on the new components x0 and y0.
2. Compute the original components of the image by multiplying with the matrix A.
3. Determine the new components of the image.
This leads to ! ! !
x0 0 x0 T x0
−→ A · =R ·A·R· .
y0 y0 y0
Thus linear mappings can be considered second order tensors. ♦

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 348


  
y

y 0   
! " # ! B  
x 2 0.5 x
B  
−→ · B A~e2 
B ~ e2 6 

y 1 1 y *

B   A~e1
 
 x0
B   
B  
B   

 
B
 - x
~e1
Figure 5.13: Action of a linear mapping from R2 onto R2

5–28 Example : Displacement gradient tensor


Use Figure 5.6 (page 327) to conclude
! ! ! " # ! !
∂u1 ∂u1
∆u1 u1 (x + ∆x, y + ∆y) u1 (x, y) ∂x ∂y ∆x ∆x
= − ≈ ∂u2 ∂u2
· = DU · .
∆u2 u2 (x + ∆x, y + ∆y) u2 (x, y) ∂x ∂y ∆y ∆y

DU is the displacement gradient tensor. Now examine this matrix if generated using a rotated coordi-
nate system. Since the above is a linear mapping we can use the previous results and conclude that the
transformation rule
DU0 = RT DU R (5.11)
is correct, i.e. the displacement gradient is a tensor of order 2 . ♦

5–29 Example : Second order partial derivatives


The computation for the gradient in Example 5–25 can be extended to

f 0 (x0 , y 0 ) = f (x0 cos α − y 0 sin α, x0 sin α + y 0 cos α)


∂ 0 0 0 ∂f ∂f
0
f (x , y ) = + cos α + sin α
∂x ∂x ∂y
∂ 0 0 0 ∂f ∂f
0
f (x , y ) = − sin α + cos α
∂y ∂x ∂y
∂2 0 0 0
 2
∂2 f ∂2 f ∂2 f
  
∂ f
f (x , y ) = + + 2 cos α + sin α cos α + + cos α + sin α sin α
∂x02 ∂x ∂x ∂y ∂x ∂y ∂y 2
∂2 ∂2 f ∂2 f ∂2 f ∂2 f
   
0 0 0
f (x , y ) = + − 2 sin α + cos α cos α + − sin α + cos α sin α
∂x0 ∂y 0 ∂x ∂x ∂y ∂x ∂y ∂y 2
∂2 0 0 0
 2
∂2 f ∂2 f ∂2 f
  
∂ f
f (x , y ) = − − 2 sin α + cos α sin α + − sin α + cos α cos α
∂y 02 ∂x ∂x ∂y ∂x ∂y ∂y 2

and thus the symmetric Hesse matrix of second order partial derivatives satisfies
" 2 0
∂2 f 0
# " 2
∂2 f
# " # " #
∂ f ∂ f
∂x02 0
∂x ∂y 0 + cos α + sin α ∂x2 ∂x ∂y + cos α − sin α
∂2 f 0 ∂2 f 0
= · ∂2 f ∂2 f
· .
∂x0 ∂y 0 ∂y 02 − sin α + cos α ∂x ∂y ∂y 2 + sin α + cos α

This implies that the matrix of second order derivatives satisfies the transformation rule of a second order
tensor. ♦

More examples of second order tensors are given in [Aris62] or [BoriTara79].

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 349

5.6 Hooke’s Law and Elastic Energy Density


Using the strains in Table 5.3 (page 335) and the stresses in Table 5.4 (page 339) formulate the basic con-
nection between the geometric deformations (strain) and the resulting forces (stress). It is a basic physical
law16 , confirmed by many measurements. The shown formulation is valid as long as all stress and strains
are small. For large strains we would have to enter the field of nonlinear elasticity. First steps are shown in
Section 5.7, by listing a few nonlinear material laws.

5.6.1 Hooke’s Law


This is the general form of Hooke’s law for a homogeneous (independent on position), isotropic (indepen-
dent on direction) materials. This law is the foundation of linear elasticity and any book on elasticity will
show a formulation, e.g. [Prze68, §2.2]17 ,[Sout73, §2.7], or [Wein74, §10.1].

     
εxx 1 −ν −ν σx
 
 = 1    
 ·  σy  ,
 εyy  −ν 1
−ν   
  E 
ε −ν −ν 1 σz
 zz    (5.12)
εxy τxy

 εxz

 = 1+ν 
 τxz


  E  
εyz τyz

or by inverting the matrix


     
σx 1−ν ν ν εxx
     
 σy 
 
 ν 1−ν ν
 0  
  εyy 

     
 σz  E  ν ν 1−ν   εzz 
 =  · . (5.13)
 (1 + ν) (1 − 2 ν)
 τxy  1 − 2ν εxy 
    
  
     
 τxz 
 

 0 1 − 2ν  
  εxz 

τyz 1 − 2ν εyz
With the obvious notation equation (5.13) may be written in the form
~σ = H · ~ε .
Observe that the equations decouple and we can equivalently write

     
σx 1−ν ν ν εxx
 
 σy  = E    
 ·  εyy  ,
 ν 1−ν ν
  (1 + ν) (1 − 2 ν)    
σz ν ν 1−ν εzz
(5.14)
   
τxy εxy
 
 τxz  = E  
 εxz  .
  (1 + ν)  
τyz εyz

16
One can verify that for homogeneous, isotropic materials a linear law must have this form, e.g. [Sege77]
17
The missing factors 2 are due to the different definition of the shear strains.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 350

5.6.2 Hooke’s law for Incompressible Materials


Observe that Hooke’s law in this form can not be used for incompressible materials, i.e. ν = 21 . Examine

E
σx + σy + σz = (εxx + εyy + εzz )
1 − 2ν
to conclude that ν = 12 leads to εxx + εyy + εzz = 0. This is a constraint to be satisfied for incompressible
materials. Equation (5.12) can also be written in the form
     
εxx σx 1
  1+ν   ν  
 εyy  =  σ  − (σx + σy + σz )  1 
  E  y  E  
εzz σz 1

and in this form it is easy to read of that


1 − 2ν
εxx + εyy + εzz = (σx + σy + σz ) .
E
Using Hooke’s law in the form of equation (5.12) for ν = 12 leads to
     
1
εxx 2 − 12 − 12 σx
 εyy  = 1  − 1
     
1 1 ·
  E  2 2 − 2   σy 

1 1 1
εzz −2 −2 2 σz

and the determinant of this matrix equals zero. Thus one can not determine the inverse matrix.

5.6.3 Elastic Energy Density


The elastic energy density will be rewritten in many different forms, suitable for different interpretations.
This is useful when extending the material properties to nonlinear laws, e.g. [Bowe10, §3.5.3, §3.5.5]. In
this section many tedious, algebraic steps are spelled out with intermediate steps, even if they are trivial.
This should be helpful when using literature on linear and nonlinear elasticity.

To start out we want to find the formula for the energy density of a deformed body. For this we consider
a small block with width ∆x, height ∆y and depth ∆z, located at the fixed origin. For a fixed displacement
vector ~u of the corner P = (∆x, ∆y, ∆z) we deform the block by a sequence of affine deformations, such
that point P moves along straight lines. The displacement vector of P is given by the formula t ~u where the
parameter t varies from 0 to 1. If the final strain is denoted by ~ε then the strains during the deformation are
given by t ~ε. Accordingly the stresses are given by t ~σ where the final stress ~σ can be computed by Hooke’s
law (e.g. equation (5.14)). Now we compute the total work needed to deform this block, using the basic
formula work = force · distance. There are six contributions:
• The face x = ∆x moves from ∆x to ∆x (1 + εxx ). For a parameter step dt at 0 < t < 1 the traveled
distance is thus εxx ∆x dt. The force is determined by the area ∆y · ∆z and the normal stress t σx ,
where σx is the stress in the final position t = 1. The first energy contribution can now be integrated
by
Z 1 Z 1
1
(∆y · ∆z · σx ) · t εxx ∆x dt = ∆y · ∆z · ∆x · σx · εxx t dt = ∆V · σx · εxx .
0 0 2

• Similarly normal displacement of the faces at y = ∆y and z = ∆z lead to contributions


1 1
∆V · σy · εyy and ∆V · σz · εzz .
2 2

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 351

z
6

Pr = (∆x, ∆y, ∆z)


-y

Figure 5.14: Block to be deformed, used to determine the elastic energy density W

∂ u2
• To examine the situation of pure shearing strain observe that the face at x = ∆x is moved by ∂x ∆x
in y–direction. The shear stress in that direction is τxy and thus the resulting energy

1 ∂ u2
∆x · τxy ∆y ∆z
2 ∂x
∂ u1
The face at y = ∆y is moved by ∂y ∆y in x–direction. The shear stress in that direction is τxy ,
leading to an energy contribution
1 ∂ u1
∆y · τxy ∆x ∆z
2 ∂y
Adding these two leads to a contribution to the energy due to shearing in the xy plane.

1 ∂ u2 ∂ u 1
( + ) τxy ∆x ∆y ∆z = εxy τxy ∆V
2 ∂x ∂y
The similar contributions due to shearing in the xz and yz planes lead to energies εxz τxz ∆V and
εyz τyz ∆V .

Adding all six contributions and then dividing by the volume ∆V we obtain the energy density
energy 1
W = = (σx εxx + σy εyy + σz εzz + 2 τxy εxy + 2 τxz εxz + 2 τyz εyz ) .
∆V 2
This can be written as scalar product in the form

       
σx εxx τxy εxy
1        
W = h
 σy 
 ,
 εyy i + h τxz  ,  εxz i
     (5.15)
2
σz εzz τyz εyz

or according to Hooke’s law in the form of equation (5.14) also as


     
1−ν ν ν εxx εxx
1 E      
W = h ν 1−ν ν  ·  εyy  ,  εyy i + (5.16)
2 (1 + ν) (1 − 2 ν)      
ν ν 1−ν εzz εzz
   
εxy εxy
E    
+ h εxz  ,  εxz 
  
i .
(1 + ν)
εyz εyz

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 352

1
The above formula for the energy density W can be written in the form W = 2 h~ε , A ~εi with a suitably
chosen symmetric matrix A.
 
1−ν ν ν
 
 ν 1−ν ν
 0 

 
E  ν ν 1−ν 
A=  
(1 + ν) (1 − 2 ν) 

2 (1 − 2 ν)


 

 0 2 (1 − 2 ν) 

2 (1 − 2 ν)

The gradient of this expression is given by ∇hA ~ε , ~εi = 2 A ~ε and thus conclude
∂W ∂W ∂W
σx = , σy = , σz = ,
∂εxx ∂εyy ∂εzz

and
1 ∂W 1 ∂W 1 ∂W E
τxy = , τxz = , τyz = = εyz .
2 ∂εxy 2 ∂εxz 2 ∂εyz 1+ν
Thus obtain normal and shear stresses as partial derivatives of the energy density W with respect to the
normal and shear strains.

Energy density as function of invariant strain expressions


The energy density is expressed in terms of the invariant expression on page 336. To achieve this goal first
generate a few invariant strain expressions. Since
2
S0 = S0 S0 = RT S R RT S R = RT S S R

use that the matrix S2 is a tensor too, and this leads to more invariant expressions.
 
εxx εxy εxz
 
S =   εxy εyy εyz 

εxz εyz εzz
I1 = trace(S) = εxx + εyy + εzz
I12 = trace(S)2 = ε2xx + ε2yy + ε2zz + 2 εxx εyy + 2 εxx εzz + 2 εyy εzz
 
ε2xx + ε2xy + ε2xz · ·
S2 = 
 
· ε 2 + ε 2 + ε2 · 
 xy yy yz 
· · ε2xz + ε2yz + ε2zz
1
I2 = εyy εzz − ε2yz + εxx εzz − ε2xz + εxx εyy − ε2xy = trace(S)2 − trace(S2 )

2
I3 = det(S)

Elementary algebra leads to

I4 = I12 − 2 I2 = ε2xx + ε2yy + ε2zz + 2 ε2xy + 2 ε2xz + 2 ε2yz


2 (1 + ν) (1 − 2 ν)
W = (1 − ν) (ε2xx + ε2yy + ε2zz ) + 2 ν (εxx εyy + εxx εzz + εyy εzz )
E
+2 (1 − 2 ν) (ε2xy + ε2xz + ε2yz )
= (1 − ν) (ε2xx + ε2yy + ε2zz + 2 ε2xy + 2 ε2xz + 2 ε2yz )

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 353

+2 ν (εxx εyy + εxx εzz + εyy εzz − ε2xy − ε2xz − ε2yz )


= (1 − ν) (I12 − 2 I2 ) + 2 ν I2 = (1 − ν) I12 − 2 (1 − 2 ν) I2
E (1 − ν) E
W = I12 − I2
2 (1 + ν) (1 − 2 ν) 1+ν
and thus conclude that the energy density W is invariant under rotations, as it should be.
The above expression can also be written in terms of the principal strains. Using

I2 = ε11 ε22 + ε11 ε33 + ε22 ε33 and I4 = ε211 + ε222 + ε233

find
E
W = ((1 − ν) I4 + 2 ν I2 )
2 (1 + ν) (1 − 2 ν)
E
(1 − ν) I12 − 2 (1 − 2 ν) I2

=
2 (1 + ν) (1 − 2 ν)
E
(1 − ν) (ε211 + ε222 + ε233 ) + 2 ν (ε11 ε22 + ε11 ε33 + ε22 ε33 ) .

=
2 (1 + ν) (1 − 2 ν)
Observe that different expressions are available for the energy density W as function of the invariants.

Volume changing and chape changing energy


There are many more invariant expressions. In [Bowe10, p. 89] find
1 2
I5 = I4 − I
3 1
1 2
= ε2xx + ε2yy + ε2zz + 2 (ε2xy + ε2xz + ε2yz ) − (ε + ε2yy + ε2zz + 2 εxx εyy + 2 εxx εzz + 2 εyy εzz )
3 xx
3 I5 = 2 ε2xx + 2 ε2yy + 2 ε2zz − 2 εxx εyy − 2 εxx εzz − 2 εyy εzz + 6 (ε2xy + ε2xz + ε2yz )
= (εxx − εyy )2 + (εxx − εzz )2 + (εyy − εzz )2 + 6 (ε2xy + ε2xz + ε2yz ) .

This invariant corresponds to the von Mises stress σM on page 341, but formulated with strains instead of
stresses. Expressed in principal strains find

3 I5 = (ε11 − ε22 )2 + (ε11 − ε33 )2 + (ε22 − ε33 )2 .

For the energy density this leads to (using elementary, tedious algebra)
1+ν 2
I6 = I1 + (1 − 2 ν) I5
3
1+ν 1 − 2ν
(εxx + εyy + εzz )2 + (εxx − εyy )2 + (εxx − εzz )2 + (εyy − εzz )2

=
3 3
1 − 2ν
6 ε2xy + ε2xz + ε2yz

+
3
1+ν 1 − 2ν
(εxx + εyy + εzz )2 + (εxx − εyy )2 + (εxx − εzz )2 + (εyy − εzz )2

=
3 3
+2 (1 − 2 ν) (ε2xy + ε2xz + ε2yz )
1 + ν + 2 (1 − 2 ν) 2  2 (1 + ν) − 2 (1 − 2 ν)
= εxx + ε2yy + ε2zz + (εxx εyy + εxx εzz + εyy εzz )
3 3
+2 (1 − 2 ν) (ε2xy + ε2xz + ε2yz )
= (1 − ν) ε2xx + ε2yy + ε2zz + 2 ν (εxx εyy + εxx εzz + εyy εzz ) + 2 (1 − 2 ν) (ε2xy + ε2xz + ε2yz )


2 (1 + ν) (1 − 2 ν)
= W.
E

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 354

Thus write the energy density W as sum of two contributions


E E
W = Wvol + Wshape = I2 + I5 .
6 (1 − 2 ν) 1 2 (1 + ν)
• a volume changing contribution, caused by hydrostatic pressure:
E E
Wvol = (εxx + εyy + εzz )2 = (ε11 + ε22 + ε33 )2
6 (1 − 2 ν) 6 (1 − 2 ν)
• and a shape changing contribution, caused by shearing:
E
(εxx − εyy )2 + (εxx − εzz )2 + (εyy − εzz )2 + 6 (ε2xy + ε2xz + ε2yz )

Wshape =
6 (1 + ν)
E
(ε11 − ε22 )2 + (ε11 − ε33 )2 + (ε22 − ε33 )2

=
6 (1 + ν)
• The above can also be expressed using the shear modulus µ and the bulk modulus K, see Example 5–
33 and Table 5.5 (page 360).
E 2 + E 2 2 2

W = 6 (1−2 ν) (ε11 + ε22 + ε33 ) 6 (1+ν) (ε11 − ε22 ) + (ε11 − ε33 ) + (ε22 − ε33 )
K 2 µ 2 2 2

= 2 (ε11 + ε22 + ε33 ) + 3 (ε11 − ε22 ) + (ε11 − ε33 ) + (ε22 − ε33 )
= Wvol + Wshape

The above is a starting point for nonlinear material laws, e.g. hyperelastic materials as described in [Bowe10,
§3.3.3]. This approach allows to distinguish shear stress and hydrostatic stress. In principle the energy
density W can be any function of invariants, e.g. W = f (I1 , I5 ). The validity of the model has to be
justified by experiments or by other arguments. As examples find in the FEM software COMSOL the
nonlinear material models Neo–Hookean, Mooney–Rivlin and Murnaghan, amongst others.

Energy density as function of stress


In the above sections the energy density was expressed in terms of strain. We can combine (5.15) and
Hooke’s law in the form (5.12) to arrive at
        
1 −ν −ν σx σx τxy τxy
1    σy  ,  σy i + 1 + ν h τxz  ,  τxz i
       
W = h −ν 1 −ν
2E      E    
−ν −ν 1 σz σz τyz τyz
1  1+ν
σx2 + σy2 + σz2 − 2 ν (σx σy + σx σz + σy σz ) + 2 2 2

= τxy + τxz + τyz .
2E E
Since invariants of the stress tensor are given by
I1 = σ x + σ y + σ z
2 2 2
I2 = σy σz − τyz + σx σz − τxz + σx σy − τxy
I4 = I12 − 2 I2 = σxx
2 2
+ σyy 2
+ σzz 2
+ 2 τxy 2
+ 2 τxz 2
+ 2 τyz
we conclude
1 ν 1+ν 2
W = (σx2 + σy2 + σz2 ) − (σx σy + σx σz + σy σz ) + 2
(τxy + τxz 2
+ τyz )
2E E E
1 ν
= (σ 2 + σy2 + σz2 + 2 τxy
2 2
+ 2 τxz + 2 τyz2
)− (σx σy + σx σz + σy σz − τxy2 2
− τxz 2
− τyz )
2E x E
1 ν
= I4 − I2 .
2E E
In terms of principal stresses this leads to the energy density
1 ν
W = (σ12 + σ32 + σ32 ) − (σ1 σ2 + σ1 σ3 + σ2 σ3 ) .
2E E

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 355

5.6.4 Volume and Surface Forces, the Bernoulli Principle


The previous section presented expressions for the elastic energy stored in a deformed solid. When using
calculus of variations (leading to FEM algorithms) we also need to take external forces into account. This
is best done by introducing matching potentials energies, representing volume and surface forces.

Volume forces
A force applied to the volume of the solid can be introduced by means of a volume force density f~
N ). By adding the potential energy
(units: m3

ZZZ
UV ol = − f~ · ~u dV

to the elastic energy and then minimizing we are lead to the correct force term.
As an example consider the weight (given by the density ρ) of the solid and the resulting gravitational
force. This leads to a force density of  
0
f~ = 
 
 0 

−ρ g
and thus find the corresponding potential energy as
ZZZ
UV ol = + ρ g u3 dV .

This potential energy decreases if the solid is moved downwards.

Surface forces
N ),
By adding a surface potential energy, using the surface force density ~g (units: m2

ZZ
USurf = − ~g · ~u dA
∂Ω

examine forces applied to the surface only.


As an example consider a constant surface pressure p on the surface of the solid. This surface force
density is
~g = −p ~n
where ~n is the outer unit normal vector. Thus find the corresponding potential energy as
ZZ
USurf = + p ~u · ~n dA .
∂Ω

This potential energy decreases if the solid is compressed, i.e. ~u · ~n < 0.

The Bernoulli Principle


Using the above we can now give a formula for the total energy in a deformed solid.

U (~u) = Uelast + UV ol + USurf (5.17)

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 356

     
1−ν ν ν εxx εxx
1 E
ZZZ
     
= h ν 1−ν ν ·  εyy  ,  εyy i dV +
2 (1 + ν) (1 − 2 ν)      
Ω ν ν 1−ν εzz εzz
   
εxy εxy
E 
ZZZ ZZZ ZZ
~ i~
  
+ h ε  ,  ε i dV − f · ~
u dV − g · ~u dA .
1 + ν  xz   xz 
Ω εyz εyz Ω ∂Ω

As in many other situations we use again a principle of least energy to find the equilibrium state of a
deformed solid. This result is known under the name Bernoulli18 principle.

If the above solid is in equilibrium, then the displacement function ~u is a minimizer of the above
energy functional, subject to the given boundary conditions.

This is the basis for a finite element (FEM) solution to elasticity problems.

5.6.5 Some Exemplary Situations


In this section Hooke’s law is illustrated by considering a few simple examples.

5–30 Example : Hooke’s basic law


Consider the situation in Figure 5.15 with the following assumptions:

• The solid of length L has constant cross section perpendicular to the x axis, with area A = ∆y · ∆z.

• The left face is fixed in the x direction, but free to move in the other directions.
F
• The constant normal stress σx at the right face is given by σx = A.

• There are no forces in the y and z directions.

z
6y

x
-

Figure 5.15: Situation for the basic version of Hooke’s law

This leads to the consequences:

• All stresses in y and z direction vanish, i.e.

σy = σz = τxy = τxz = τyz = 0 .


18
Proposed by Daniel Bernoulli (1700–1782), the son of Johann Bernoulli (1667–1748) and nephew of Jakob Bernoulli (1654–
1705).

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 357

• Hooke’s law (5.12) implies


       
F
εxx 1 −ν −ν 1
  A 
 εyy  = 1  −ν 1 −ν  ·  0  = 1 F
    
 −ν  .
  E     E A  
εzz −ν −ν 1 0 −ν

• The first component of the above equations leads to the classical, basic form of Hooke’s law
∆L 1 F
εxx = = .
L E A
This is the standard definition of Young’s modulus of elasticity. The solid is stretched by a factor
1 + E1 FA .

• In the y and z direction the solid is contracted by a factor of 1 − Eν FA . This is an interpretation of




Poisson’s ratio ν.
εyy = −ν εxx
Multiply the relative change of length in the x direction by ν to obtain the relative change of length in
the y and z directions. One expects ν ≥ 0.

• The energy density W can be found by equation (5.16).


       
σx εxx τxy εxy
1        
W = h σ  ,  ε i + h  τxz  ,  εxz
  i
2  y   yy  
σz εzz τyz εyz
   
 2 1 1  2
1 1 F 1 1 F 1
E ε2xx
   
= h 0  ,  −ν 
   i+0= =
2 E A  2 E A 2
0 −ν

• Assuming that we have the principal strain situation the energy density is given by
    
1−ν ν ν εxx εxx
1 E     
W = h  ν 1−ν ν   εyy  ,  εyy
    i
2 (1 + ν) (1 − 2 ν)  
ν ν 1−ν εzz εzz

For the case of unixaial loading in x–direction use εyy = εzz and find
1 E
(1 − ν) (ε2xx + 2 ε2yy ) + ν (4 εxx εyy + 2 ε2yy )

W =
2 (1 + ν) (1 − 2 ν)
∂W 1 E
= ((1 − ν) 2 εxx + ν 4 εyy ))
∂εxx 2 (1 + ν) (1 − 2 ν)
E
= ((1 − ν) εxx + ν 2 εyy ))
(1 + ν) (1 − 2 ν)
∂W 1 E
= ((1 − ν) 4 εyy + ν 4 (εxx + εyy ))
∂εyy 2 (1 + ν) (1 − 2 ν)
4 E
= (ν εxx + εyy )
2 (1 + ν) (1 − 2 ν)
• Based on the Bernoulli principle the energy density W will be minimized with respect to εyy . Thus
conclude εyy = −ν εyy , i.e. obtain the Poisson contraction as consequence of minimizing the energy.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 358

• Since the stress in x–direction is given by


∂W E
σx = = ((1 − ν) εxx + ν 2 εyy ))
∂εxx (1 + ν) (1 − 2 ν)
E
= ((1 − ν) εxx − ν 2 ν εxx )) = E εxx
(1 + ν) (1 − 2 ν)
we also obtain Hooke’s elementary law as consequence of of minimizing the energy.

5–31 Example : Solid under hydrostatic pressure


If a rectangular block is submitted to a constant pressure p, then we know all components of the stress
(assuming they are constant throughout the solid), namely

σx = σy = σz = −p and τxy = τxz = τyz = 0 .

Submerging a solid deep into water will lead to this hydrostatic pressure situation. Hooke’s law now implies

εxy = εxz = εyz = 0

and        
εxx 1 −ν −ν p 1
 εyy  = − 1  −ν 1 −ν  ·  p  = − p (1 − 2 ν)  1 
       
  E     E  
εzz −ν −ν 1 p 1

i.e. in each direction the solid is compressed by a factor of 1 − p (1−2


E
ν)
. Since putting a solid under pressure
will make it shrink, the Poisson’s ratio must satisfy the condition 0 ≤ ν ≤ 21 . The case ν = 21 corresponds
to an incompressible object.
Since the length in each direction is multiplied by the same factor (1+εxx ) determinr the relative change
of volume by
p (1 − 2ν) 3
 
∆V 3 (1 − 2ν) 1
= 1− −1≈− p=− p
V E E K
E
if the pressure p is small. The appearing coefficient K = 3 (1−2 ν) is called the bulk modulus of the material.
The energy density W is given by
       
σx εxx τxy εxy
1        
W = h σy  , εyy  i + h τxz  , εxz  i
2       
σz εzz τyz εyz
   
−p 1
1 p (1 − 2 ν)  1 1 − 2ν
3 p2 .
  
= − h −p  ,  1 
   i+0=
2 E  2 E
−p 1

This can also be expressed in terms of normal strains.


1 3 (1 − 2 ν) 2 1 3 E
W = p = ε2
2 E 2 1 − 2 ν xx
Then the pressure is determined by
1 ∂W E
−p = σx = = εxx .
3 ∂εxx 1 − 2ν

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 359

5–32 Example : Shear modulus


To the block in Figure 5.14 apply a force of strength F in direction of the x axis to the top (area ∆x · ∆y).
No normal forces in the y direction are applied. The corresponding forces have to be applied to the faces at
x = 0 and x = ∆x for the block to be in equilibrium. No other forces apply, thus
F
τxz = and τxy = τyz = σx = σy = σz = 0 .
A
Hooke’s law (5.12) leads to
εxx = εyy = εzz = εxy = εyz = 0
and
1+ν F 1 F 1
εxz = = = τxy .
E A 2G A 2G
E
This is the reason why some presentations use the shear modulus G = µ = 2 (1+ν) . The energy density is
given by
E
W = τxy εxy = 2 G ε2xy = ε2 .
1 + ν xy
One can determine Young modulus E by a uniaxial stretching and then measuring the shear modulus G =
E
2 (1+ν) allows to determine Poisson’s ratio ν.

One can also start with the general formula for the energy density based on Hooke’s law.
E
W = (εxx + εyy + εzz )2 +
6 (1 − 2 ν)
E
+ ((εxx − εyy )2 + (εxx − εzz )2 + (εyy − εzz )2 + 6 (ε2xy + ε2yz + ε2xz )) .
6 (1 + ν)

Use Bernoulli’s principle to determine all strains. Minimizing this energy density leads to
∂W E ∂W E
0 = = 2 εyz and 0 = = 2 εxz
∂εyz 1+ν ∂εxz 1+ν

and thus the only nonzero shearing is εxy . Similarly

∂W E E
0 = = (εxx + εyy + εzz ) + ((εxx − εyy ) + (εxx − εzz ))
∂εxx 3 (1 − 2 ν) 3 (1 + ν)
∂W E E
0 = = (εxx + εyy + εzz ) + (−(εxx − εyy ) + (εyy − εzz ))
∂εyy 3 (1 − 2 ν) 3 (1 + ν)
∂W E E
0 = = (εxx + εyy + εzz ) + (−(εxx − εzz ) − (εyy − εzz ))
∂εzz 3 (1 − 2 ν) 3 (1 + ν)

Then conlude

• The sum of the three equations leads to εxx + εyy + εzz = 0.

• The sum of the first two equations leads to εxx + εyy = 2 εzz = −2 (εxx + εyy ). This implies
εxx + εyy = 0 and thus εzz = 0.

• The sum of the first and third equation leads to εxx + εzz = 2 εyy = −2 (εxx + εzz ). This implies
εxx + εzz = 0 and thus εyy = 0.

• Now εxx = 0 is easy to see.

Thus the only non vanishing strain is εxy . ♦

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 360

Modul Young’s Poisson Shear Lamé bulk


E ν G=µ λ K notes
E νE E
(E , ν) E ν 2 (1+ν) (1+ν) (1−2 ν) 3 (1−2 ν)
E G (E−2 G) EG
(E , G) E 2G −1 G 3 G−E 3 (3 G−E)
2ν G 2 G (1+ν)
(ν , G) 2 G (1 + ν) ν G 1−2 ν 3 (1−2 ν)
2λ E−3 λ+R E+3 λ+R √
(E , λ) E E+λ+R 4 λ 6 R= E 2 + 9 λ2 + 2 λ E
λ (1+ν) (1−2 ν) λ (1−2 ν) λ (1+ν)
(ν , λ) ν ν 2ν λ 3ν useless if ν = 0
G (3 λ+2 G) λ
(G , λ) λ+G 2 (λ+G) G λ λ + 23G
3 K−E 3E K 3 K (3 K−E)
(E , K) E 6K 9 K−E 9 K−E K
3 K (1−2 ν) 3ν K
(ν , K) 3 K (1 − 2 ν) ν 2 (1+ν) 1+ν K
9GK 3 K−2 G
(G , K) 3 K+G 2 (3 K+G) G K − 23G K
9 K (K−λ) λ 3 (K−λ)
(λ , K) 3 K−λ 3 K−λ 2 λ K

Table 5.5: Elastic moduli and their relations

5–33 Example : A few different elastic moduli


For homogeneous, isotropic materials there are many different ways to describe the linear laws of elasticity.
• Young’s modulus E
• Poisson’s ratio ν
• Shear modulus G or µ
• Lamé’s first parameter λ
• Bulk modulus K
Given any two of these the others are determined19 , leading to Table 5.5. ♦

5–34 Example : Principal stress and principal strain


Since the strain tensor is symmetric select a coordinate system such that
   
εxx εxy εxz εxx 0 0
   
 εyx εyy εyz  =  0 εyy 0 
   
εzx εzy εzz 0 0 εzz

i.e. vanishing shear strains. Based on Hooke’s law (5.14) conclude that the shear stresses vanish too, i.e.
   
σx τxy τxz σx 0 0
   
 τxy σy τyz  =  0 σy 0 
   
τxz τyz σz 0 0 σz

and find the resulting normal stresses


   
σx (1 − ν) εxx + ν (εyy + εzz )
 
 σy  = E  
 (1 − ν) εyy + ν (εxx + εzz )  .
  (1 + ν) (1 − 2 ν)  
σz (1 − ν) εzz + ν (εxx + εyy )
19
See https://en.wikipedia.org/wiki/Elastic modulus .

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 361

Using this, the elastic energy density is given by


   
σx εxx
1    
W = h σy
 ,  εyy i
2    
σz εzz
1 E
(1 − ν) (ε2xx + ε2yy + ε2zz ) + 2 ν (εxx εyy + εyy εzz + εzz εxx ) .

=
2 (1 + ν) (1 − 2 ν)

The deformed volume can be estimated by

V + ∆V = V (1 + εxx ) (1 + εyy ) (1 + εzz ) ≈ V (1 + εxx + εyy + εzz )

or
∆V ≈ (εxx + εyy + εzz ) V .
Using      
εxx 1 −ν −ν σx
 εyy  = 1  −ν 1 −ν  ·  σy 
     
  E    
εzz −ν −ν 1 σz
find
1 − 2ν
εxx + εyy + εzz = (σx + σy + σz )
E
and
∆V 1 − 2ν
≈ (σx + σy + σz ) .
V E
This implies that the sum of the three principal stresses determines the volume change. Based on this one
can consider the volume changing hydrostatic stress (pressure)
1
σh = (σx + σy + σz )
3
and the shape changing stresses

σ̂x = σx − σh , σ̂y = σy − σh , σ̂z = σz − σh .

5–35 Example : Torsion of a tube


In this example examine the torsion of a circular, hollow shaft, as shown in Figure 5.16. Since the compu-
tations are lengthy the computational path shall be spelled out first.

• Express the displacement in terms of radial and angular contributions.

• Determine the normal and shear strains.

• Use Hooke’s law to find the elastic energy density and by an integration find the total elastic energy.

• Use Euler–Lagrange equations to determine the boundary value problems for the radial and angular
displacement functions and solve these.

• Use these solutions to determine the actual energy stored in the twisted tube and determine the re-
quired torque.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 362

Figure 5.16: Torsion of a tube

When twisting a circular tube on the top surface one can decompose the deformation in a radial compo-
nent ur and and angular component uϕ . The vertical component is assumed to vanish20 , i.e. u3 = 0.
     
u1 cos ϕ − sin ϕ
     
 u2  = ur (r, z)  sin ϕ  + uϕ (r, z)  + cos ϕ 
     
u3 0 0

Based on the rotational symmetry21 one can examine the expression in the xz plane only, i.e. y = 0. Let
r = x and find the normal strains
∂ u1 ∂ ur ∂ u2 1 ∂ u3
ε11 = = , ε22 = = ur , ε33 = = 0
∂x ∂r ∂y r ∂z
and the shear strains
∂ u1 ∂ u2 1 ∂ uϕ ∂ u1 ∂ u 3 ∂ ur ∂ u2 ∂ u3 ∂ uϕ
2 εxy = + = − uϕ + , 2 εxz = + = , 2 εyz = + = .
∂y ∂x r ∂r ∂z ∂x ∂z ∂z ∂y ∂z
Observe that this is neither a plane strain nor a plane stress situation. The energy density W is given by
equation (5.16) (page 351)
     
1−ν ν ν εxx εxx
1 E      
W = h ν 1−ν ν  · εyy  , i +
εyy 
2 (1 + ν) (1 − 2 ν)     
ν ν 1−ν εzz εzz
20
This is just for the sake of simplicity of the presentation. The vertical displacement u3 can be used as an unknown function
too, with zero displacement at the top and the bottom. The resulting three Euler–Lagrange equations will then lead to u3 = 0. This
author has the detailed computations.
21
Using the chain rule one can express the partial derivatives with respect to Cartesian coordinates x and y in terms of derivatives
with respect cylindrical coordinates r and ϕ.
∂f ∂F sin ϕ ∂F ∂f ∂F cos ϕ ∂F
f (x, y) = F (r, ϕ) =⇒ = cos ϕ − and = sin ϕ +
∂x ∂r r ∂ϕ ∂y ∂r r ∂ϕ
Along the x axis we have ϕ = 0 and thus cos ϕ = 1 and sin ϕ = 0.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 363

   
εxy εxy
E    
+ h εxz  , i
εxz 
(1 + ν)   
εyz εyz
and leads to the energy density along the x axis with r = x.
E
(1 − ν) (ε2xx + ε2yy + ε2zz ) + 2 ν (εxx εyy + εxx εzz + εyy εzz ) +

W (r, z) =
2 (1 + ν) (1 − 2 ν)
E
+ (ε2 + ε2xz + ε2yz )
(1 + ν) xy
 
E (1 − ν) ∂ ur 2 1 2
= ( ) + 2 ur +
2 (1 + ν) (1 − 2 ν) ∂r r
 
E ∂ uϕ uϕ 2 ∂ ur 2 ∂ uϕ 2
+ ( − ) +( ) +( ) .
4 (1 + ν) ∂r r ∂z ∂z
With this the functional U for the total elastic energy is given by an integration over the tube R0 < r < R1
and 0 < z < H, using cylindrical coordinates.
Z H Z R1
U (~u) = W (r, z) 2 π r dr dz
0 R0
H Z R1
∂ ur 2 u2r
 
E (1 − ν) r
Z
= 2π ( ) + 2 +
0 R0 2 (1 + ν) (1 − 2 ν) ∂r r
 
Er ∂ uϕ uϕ 2 ∂ ur 2 ∂ uϕ 2
+ ( − ) +( ) +( ) dr dz
4 (1 + ν) ∂r r ∂z ∂z
Z H Z R1
∂ ur 2 u2r
 
πE 2 (1 − ν)
= r( ) + +
2 (1 + ν) 0 R0 (1 − 2 ν) ∂r r
 
r ∂ uϕ uϕ 2 ∂ ur 2 ∂ uϕ 2
+ ( − ) +( ) +( ) dr dz
2 ∂r r ∂z ∂z
Z H Z R1
πE ∂ ur ∂ ur ∂ uϕ ∂ uϕ
= F (r, z, ur , , , uϕ , , ) dr dz .
2 (1 + ν) 0 R0 ∂r ∂z ∂r ∂z
Thus the elastic energy is given as a quadratic expression in terms of the two displacement functions ur and
uϕ and their partial derivatives. The physical solution will minimize this energy based on the Bernoulli prin-
ciple. Use computations very similar to Section 5.2.4 (page 315) to generate the Euler–Lagrange equations
for the two unknown functions ur and uϕ .
• For the radial displacement function ur (r, z):
∂F 4 (1 − ν) ur
=
∂ur 1 − 2ν r
∂F 4 (1 − ν) ∂ ur
= r
∂ ∂∂r
ur 1 − 2ν ∂r
∂F ∂ ur
= r
∂ ∂∂z
ur ∂z
∂ ∂F ∂ ∂F ∂F
∂ u
+ =
∂r ∂ ∂rr ∂z ∂ ∂∂z
ur ∂ur
 2
∂ 2 ur

4 (1 − ν) ∂ ur ∂ ur 4 (1 − ν) ur
r 2
+ +r =
1 − 2ν ∂r ∂r ∂z 2 1 − 2ν r
on the domain R0 < r < R1 and 0 < z < H. At the bottom z = 0 and the top z = H the boundary
conditions are
ur (r, 0) = 0 and ur (r, H) = 0 for R0 < r < R1

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 364

∂F
and on the sides r = Ri we use the natural boundary condition = 0, leading to
∂ ∂∂r
ur

∂ ur (R0 , z) ∂ ur (R1 , z)
= = 0 for 0 < z < H .
∂r ∂r
This boundary value problem is solved by ur (r, z) = 0, i.e. no radial displacement.
• For the angular displacement function uϕ (r, z):
∂F ∂ uϕ uϕ
= −( − )
∂uϕ ∂r r
∂F ∂ uϕ uϕ
∂u
= +r ( − )
∂ ∂rϕ ∂r r
∂F ∂ uϕ
∂u
= r
∂ ∂zϕ ∂z
∂ ∂F ∂ ∂F ∂F
∂ u
+ ∂ u
=
∂r ∂ ϕ ∂z ∂ ϕ ∂uϕ
  ∂r  ∂z

∂ ∂ uϕ uϕ ∂ ∂ uϕ ∂ uϕ uϕ
r( − ) + r = −( − )
∂r ∂r r ∂z ∂z ∂r r
∂ 2 uϕ ∂ 2 uϕ ∂ uϕ uϕ
r 2
+ r 2
= − +
∂r ∂z ∂r r
on the domain R0 < r < R1 and 0 < z < H. At the bottom z = 0 and the top z = H the boundary
conditions are
uϕ (r, 0) = 0 and ur (r, H) = r α for R0 < r < R1
∂F ∂ uϕ uϕ
and on the sides r = Ri use the natural boundary condition ∂u
= r( − ) = 0, leading to
∂ ∂rϕ ∂r r

∂ uϕ (R0 , z) ∂ uϕ (R1 , z)
R0 = uϕ (R0 , z) and R1 = uϕ (R1 , z) for 0 < z < H
∂r ∂r
α
This boundary value problem is solved by uϕ (r, z) = r z H.
α
Using the above solutions ur (r, z) = 0 and uϕ (r, z) = r z H compute the energy density
   
E (1 − ν) ∂ ur 2 1 2 E ∂ uϕ uϕ 2 ∂ ur 2 ∂ uϕ 2
W (r, z) = ( ) + 2 ur + ( − ) +( ) +( )
2 (1 + ν) (1 − 2 ν) ∂r r 4 (1 + ν) ∂r r ∂z ∂z
E (1 − ν) E  α  E r2
= 0+ 02 + 02 + (r )2 = α2
2 (1 + ν) (1 − 2 ν) 4 (1 + ν) H 4 (1 + ν) H 2
and thus the elastic energy by an integration
Z H Z R1 H R1
2 π E α2
Z Z
U (~u) = W (r, z) 2 π r dr dz = r3 dr dz
0 R0 4 (1 + ν) H 2 0 R0
2 π E α2
= (R14 − R04 ) .
4 · 4 (1 + ν) H
The elastic energy is expressed as as function of the angle of rotation α. The torque T required to twist this
circular tube is given by
∂U πE α
T = = (R14 − R04 ) .
∂α 4 (1 + ν) H
This result can also be obtained by assuming that the cut at height z is rotated by an angle α Hz and then
determine the resulting shear stresses along those planes. The computations are rather easy. The above
approach verifies that this simplification is correct. ♦

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 365

5.7 More on Tensors and Energy Densities for Nonlinear Material Laws
5.7.1 A few More Tensors
In most parts of these lecture notes we only use infinitesimal strains. This restricts the applications to small
strain and displacement situations only. One important example that can not be examined using infinitesimal
strains only is the large bending of slender beams. There are nonlinear extensions allowing to describe more
general situations. In this section we provide a starting point for further investigations Find more information
in [Bowe10], [Redd13] and [Redd15].

Use the displacement gradient tensor DU (Example 5–28) and examine Figure 5.6 (page 327) to verify
that for a deformation ~u(x, y) the vector
! " # ! ! !
∂ u1 d u1
∆x ∆x ∆x ∆x −→
+ ∂∂x u2
∂y
∂ u2
= + DU = (I + DU) ∆x
∆y ∂x ∂y ∆y ∆y ∆y

connects the points A0 to D0 . I + DU is the deformation gradient tensor. For two vectors
! !
−→ ∆x1 −→ ∆x2
∆x1 = and ∆x2 =
∆y1 ∆y2
examine the scalar product of the image of the two vectors.
−→ −→ −→ −→
h(I + DU) ∆x1 , (I + DU) ∆x2 i = h ∆x1 , (I + DUT )(I + DU) ∆x2 i
−→ −→
= h ∆x1 , (I + DU + DUT + DUT DU) ∆x2 i
−→ −→
= h ∆x1 , C ∆x2 i
The matrix
C = I + DU + DUT + DUT DU
is the Cauchy–Green deformation tensor. In particular obtain in Figure 5.6 (page 327)
−→ −→ −→ −→
|A0 D0 |2 = h(I + DU) ∆x , (I + DU) ∆xi = h ∆x , C ∆xi .
For the 2D situation use
C = I + DU + DUT + DUT DU
" # " # " # " #
1 0 2 ∂u 1 ∂u1
+ ∂u2 ∂u1 ∂u2 ∂u1 ∂u1
= + ∂u ∂x∂u ∂y
∂u2
∂x
+ ∂u ∂x ∂x
∂u2
· ∂u ∂x ∂y
∂u2
0 1 ∂x
2
+ ∂y
1
2 ∂y ∂y
1
∂y ∂x
2
∂y
" # " #   ∂u 2  ∂u 2 ∂u ∂u ∂u2 ∂u2

1 0 2 ∂u 1 ∂u1
+ ∂u2
∂x
1
+ ∂x
2 1
∂x ∂y
1
+ ∂x ∂y 
= + ∂u ∂x∂u ∂y
∂u2
∂x
+  2  2
0 1 ∂x
2
+ ∂y
1
2 ∂y
∂u 1 ∂u 1
+ ∂u 2 ∂u 2 ∂u 1
+ ∂u2
∂x ∂y ∂x ∂y ∂y ∂y
   2   2 
∂u1
+ ∂u ∂u1 ∂u1 ∂u2 ∂u2
" #
∂x ∂y + ∂x ∂y 
2
εxx εxy ∂x ∂x
= I+2 +  2  2 .
εxy εyy ∂u1 ∂u1
+ ∂u2 ∂u2 ∂u1
+ ∂u2
∂x ∂y ∂x ∂y ∂y ∂y

Use the transformation rule (5.11) for the displacement gradient tensor DU to examine the Cauchy–Green
deformation tensor in a rotated coordinate system.
T T
C0 = I + DU0 + DU0 + DU0 DU0
= I + RT DU R + RT DUT R + RT DUT R RT DU R
= I + RT DU R + RT DUT R + RT DUT DU R
= RT I + DU + DUT + DUT DU R


= RT C R

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 366

This is the transformation rule for a second order tensor.

The Green strain tensor is given by


1
E= (C − I)
2
and thus
  2  2 
∂u1 ∂u2 ∂u1 ∂u1 ∂u2 ∂u2
" # " #
Exx Exy εxx εxy 1 ∂x + ∂x ∂x ∂y + ∂x ∂y
E= = +  2  2 . (5.18)
Exy Eyy εxy εyy 2 ∂u1 ∂u1
+ ∂u2 ∂u2 ∂u1
+ ∂u2
∂x ∂y ∂x ∂y ∂y ∂y

When dropping the quadratic contributions obtain the previous (infinitesimal) strain tensor. In Table 5.6 find
the definitions of the tensors defined in these lecture notes.

Geometric interpretation of the entries in the Green strain tensor:


Exx : Consider a deformation with fixed origin. The point (∆x , 0) is moved to (1 + ∂∂x u1
, ∂ u2
∂x ) ∆x and
thus the new length l of the original segment from (0, 0) to (∆x , 0) is given by22

 
2 ∂ u1 2 ∂ u2 2
l = (1 + ) +( ) (∆x)2
∂x ∂x
 
∂ u1 ∂ u1 2 ∂ u2 2
= 1+2 +( ) +( ) (∆x)2
∂x ∂x ∂x
r
∂ u1 ∂ u1 2 ∂ u2 2
l = 1+2 +( ) +( ) ∆x
 ∂x ∂x ∂x 
∂ u1 1 ∂ u1 2 1 ∂ u2 2
≈ 1+ + ( ) + ( ) ∆x
∂x 2 ∂x 2 ∂x
 
∆l ∂ u 1 1 ∂ u1 2 1 ∂ u 2 2
≈ + ( ) + ( ) = Exx .
∆x ∂x 2 ∂x 2 ∂x
Thus Exx shows the relative change of length in x direction. Use Figure 5.6 (page 327) for a visual-
ization of the result. Observe that the displaced segment need not be vertical any more.

Eyy : Similar to Exx , but in y direction.

Exy : Use the two orthogonal vectors ~v1 = (1 , 0)T and ~v2 = (0 , 1)T , attach these at a point, deform the
solid and then determine the angle φ between the two deformed vectors. Assume that the entries in
DU are small. Using
! ! ! !
∂ u1 ∂ u1
1 ∂x 0 ∂y
(I + DU) ~v1 = + ∂ u2
, (I + DU) ~v2 = + ∂ u2
0 ∂x 1 ∂y
h~v1 , C ~v2 i Cxy
cos(φ) = =√ √ ≈ 2 Exy
k(I + DU)~v1 k k(I + DU)~v2 k 1 + small 1 + small
conclude
π π
− φ ≈ sin( − φ) = cos φ ≈ 2 Exy .
2 2
Thus 2 Exy indicates by how much the angle between the two coordinates axis is diminished by the
deformation. This interpretation is identical to the interpretation of the infinitesimal strain tensor on
page 329.

22
√ 1
For |z|  1 use the linear approximation 1+z ≈1+ 2
z.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 367

infinitesimal strain tensor


" # " #
∂ u1
εxx εxy ∂x
1
2 ( ∂∂x
u2
+ ∂ u1
∂y )
=
εxy εxx 1
2 ( ∂∂x
u2
+ ∂ u1
∂y )
∂ u2
∂y

displacement gradient tensor


" #
∂u1 ∂u1
∂x ∂y
DU = ∂u2 ∂u2
∂x ∂y

deformation gradient tensor


" #
∂u1 ∂u1
1+ ∂x ∂y
I + DU = ∂u2
∂x 1 + ∂u∂y
2

Cauchy–Green deformation tensor


" #
T T εxx εxy
C = I + DU + DU + DU DU = I+2 +
εxy εxx
  2  2 
∂u1 ∂u2 ∂u1 ∂u1 ∂u2 ∂u2
∂x + ∂x ∂x ∂y + ∂x ∂y 
+  2  2
∂u1 ∂u1 ∂u2 ∂u2 ∂u1 ∂u2
∂x ∂y + ∂x ∂y ∂y + ∂y

Green strain tensor


" # " #
Exx Exy εxx εxy
= +
Exy Exx εxy εxx
  2  2 
∂u1 ∂u2 ∂u1 ∂u1 ∂u2 ∂u2
∂x + ∂x ∂x ∂y + ∂x ∂y
+ 12   2  2 
∂u1 ∂u1 ∂u2 ∂u2 ∂u1 ∂u2
∂x ∂y + ∂x ∂y ∂y + ∂y
" #
σx τxy
infinitesimal stress tensor =
τxy σy

Table 5.6: Different tensors in 2D

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 368

5–36 Example : Pure rotation


For a pure rotation
! " # ! !
x cos φ − sin φ x cos φ x − sin φ y
−→ · =
y sin φ cos φ y sin φ x + cos φ y
we find the displacement vector
! !
u1 (x, y) cos φ x − sin φ y − x
= .
u2 (x, y) sin φ x + cos φ y − y
Now determine the partial derivatives of the displacements with respect to x and y. This leads to the
Cauchy–Green deformation tensor
" #
2 cos φ − 2 − sin φ + sin φ
C = I+ +
− sin φ + sin φ 2 cos φ − 2
" #
(cos φ − 1)2 + sin2 φ −(1 − cos φ) sin φ + sin φ (1 − cos φ)
+
−(1 − cos φ) sin φ + sin φ (1 − cos φ) sin2 φ + (cos φ − 1)2
" # " # " # " #
1 0 2 cos φ − 2 0 2 − 2 cos φ 0 1 0
= + + =
0 1 0 2 cos φ − 2 0 2 − 2 cos φ 0 1
and we find
−→ −→ −→ −→ −→
|A0 D0 |2 = hC∆x , ∆xi = h∆x , ∆xi = k∆xk2 = |AD|2 .
Thus no section of the solid is stretched, even for large angles φ . All components of the Green strain tensor
E vanish. Thus the small angle restriction from Example 5–17 disappeared. With this approach we can
examine situations with large deformations, but still small strains, e.g. bending of slender rods. ♦

Since the Green strain tensor E satisfies the usual transformation rule for a second order tensor it can be
diagonalized by rotating the coordinate system and we find in the rotated coordinate system
  2  2 
∂u1 ∂u2
" # " #
∂ u1 + 0
Exx 0 ∂x 0 1 ∂x ∂x
= +   2  2  .
0 Eyy 0 ∂ u2 2 0 ∂u1
+ ∂u2
∂y ∂y ∂y

This will be useful to determine energy formulas for selected deformation problems, or you may use the
invariant expressions of the Green strain tensor, comparable to the observations on page 352ff .

5–37 Example : More invariants of the Cauchy–Green deformation tensor, used to describe nonlinear
material laws
The Cauchy–Green deformation tensor C is often used to describe the elastic energy density W (C) for large
deformations. It is easiest to work with the tensor in principal form, i.e. the Cauchy-Green deformation
tensor in a principal system.
C = I + DU + DUT + DUT DU
 
1 + 2 ∂∂xu1
· ·
 
=  ∂ u2 +
 · 1 + 2 ∂y · 
· · 1 + 2 ∂∂z
u3
 
( ∂u1 )2 + ( ∂u 2 2 ∂u3 2
∂x ) + ( ∂x ) · ·
 ∂x 
+ ∂u 2 ∂u 2 ∂u 2
 · ( ∂y ) + ( ∂y ) + ( ∂y )
1 2 3
· 

· · ( ∂u 1 2 ∂u2 2 ∂u3 2
∂z ) + ( ∂z ) + ( ∂z )

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 369

 
∂u1 2
(1 + ∂x ) + ( ∂u2 2 ∂u3 2
∂x ) + ( ∂x ) · ·
=  · ( ∂u 1 2
∂y ) + (1 +
∂u2 2
∂y ) + ( ∂u3 2
∂y ) ·
 

· · ( ∂u1 2 ∂u2 2
∂z ) + ( ∂z ) + (1 +
∂u3 2
∂z )
 
λ21 0 0
 
= 
 0 λ22 0 
0 0 2
λ3

The diagonal entries λ2i are squares of the the principal stretches λi .
r
∂ u1 2 ∂ u2 2 ∂ u3 2
λ1 = (1 + ) +( ) +( ) = factor by which x axis will be stretched
∂x ∂x ∂x
s
∂ u2 2 ∂ u1 2 ∂ u3 2
λ2 = (1 + ) +( ) +( ) = factor by which y axis will be stretched
∂y ∂y ∂y
r
∂ u3 2 ∂ u1 2 ∂ u2 2
λ3 = (1 + ) +( ) +( ) = factor by which z axis will be stretched
∂z ∂z ∂z
A geometrical reasoning for the above is shown in Figure 5.6 (on page 327).

The elastic energy density W is usually expressed in terms of invariants of the tensors. There are many
different expressions for the energy density, corresponding to different types of materials. A few of the
invariants of the Cauchy–Green deformation tensor are given by

I1 = trace(C) = λ21 + λ22 + λ23


∂ u1 2 ∂ u2 2 ∂ u3 2 ∂ u2 2 ∂ u1 2 ∂ u3 2
= (1 + ) +( ) +( ) + (1 + ) +( ) +( ) +
∂x ∂x ∂x ∂y ∂y ∂y
∂ u3 2 ∂ u1 2 ∂ u2 2
+(1 + ) +( ) +( ) ,
∂z ∂z ∂z
1
I2 = λ21 λ22 + λ22 λ23 + λ21 λ23 = trace(C)2 − trace(C2 ) ,

2
2 2 2
I3 = det(C) = λ1 · λ2 · λ3 ,
p
J = det(C) = λ1 · λ2 · λ3 factor of volume change.
I 1 4/3 −2/3 −2/3 −2/3 4/3 −2/3 −2/3 −2/3 4/3
I¯1 = = λ1 λ2 λ3 + λ1 λ2 λ3 + λ1 λ2 λ3 .
J 2/3
Observe that these expressions can be determined by algebraic operations, once the entries in C are known.
It is not necessary to compute the eigenvalues. This is used in Example 5–41. These expressions can be
used to examine different models for the energy density W , mainly for nonlinear material laws. ♦

Table 5.7 shows some material laws commonly used. The following examples analyze some of these
definitions. The energy density W for the nonlinear material is shown, and then a small strain analysis
is performed to examine the connection to the linear law of Hooke. For most cases uniaxial loading in
x–direction and hydrostatic loading is examined.

5.7.2 Neo–Hookean Energy Density Models


5–38 Example : Neo–Hookean energy density for incompressible materials
Now try to connect the Neo–Hookean form for the energy density with the result for small strains based on
Hooke’s law. To do this investigate a uniaxial stretching in x direction only with
∂ u1
εxx = very small .
∂x

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 370

energy density W loading small strain approximation Example


Neo–Hookean, incompressible
C10 · (I1 − 3) = C10 · (I¯1 − 3) uniaxial C10 = 1
6 E 5–38
Neo–Hookean, compressible uniaxial E
C10 = 4 (1+ν) = µ2 5–39
C10 · (I¯1 − 3) + 1 (J − 1)2 D hydrostatic 1 E
D = 6 (1−2 ν) = 2
K
5–40
E
shearing C10 = 4 (1+ν) = ν2 5–41
E
Neo–Hookean, compressible uniaxial µ = 2 (1+ν) 5–42
µ
2 (I1 − 3) − µ ln(J) + λ
2 ln(J)2 hydrostatic λ = (1+ν)µ(1−2
E
ν) 5–43
shearing 5–44
Mooney–Rivlin, incompressible
C10 · (I¯1 − 3) + C01 · (I¯2 − 3) uniaxial C10 + C01 = 1
6 E 5–45
1 E K
Mooney–Rivlin, compressible hydrostatic D = 6 (1−2 ν) = 2 5–46
C10 · (I¯1 − 3) + C01 · (I¯2 − 3) + D
1
(J − 1)2 uniaxial C10 + C01 ≈ 1
6 E 5–46
Ogden, incompressible
µ 3
α (λα1 + λα2 + λα3 − 3) uniaxial E= 2 µα 5–47
1 E K
Ogden, compressible hydrostatic D = 6 (1−2 ν) = 2 5–48
µ 1 18 µ α
α (λ̄α1 + λ̄α2 + λ̄α3 − 3) + D (λ1 λ2 λ3 − 1) uniaxial E = 12+µ αD 5–49

Table 5.7: Some nonlinear material laws, given by the energy density W . The connection of the small strain
approximation to Hooke’s linear law is shown.

Start by examining an incompressible material, i.e. J = λ1 λ2 λ3 = 1 and the energy density given by

W = C10 · (I1 − 3) = C10 · (λ21 + λ22 + λ23 − 3) , (5.19)

where C10 is a material constant. This is called the Neo–Hookean energy density, for incompressible
materials. To examine uniaxial stretching use the symmetry λ2 = λ3 and the incompressibility λ1 ·λ2 ·λ3 = 1
to conclude
1 2
λ2 = λ3 = √ and W = C10 · (λ21 + − 3) .
λ1 λ1
∂ u2 ∂ u3
With the notations z2 = ∂x and z3 = ∂x minimize W (z2 , z3 ).
∂ 2 ∂ λ1 2 1 ∂ u2 ∂ u2
0 = W = C10 · (2 λ1 − 2 − 3) = C10 · (2 λ1 − 2 − 3) =⇒ =0
∂z2 λ1 ∂z2 λ1 λ1 ∂x ∂x
∂ 2 ∂ λ1 2 1 ∂ u3 ∂ u3
0 = W = C10 · (2 λ1 − − 3) = C10 · (2 λ1 − − 3) =⇒ =0
∂z3 λ21 ∂z3 λ21 λ1 ∂x ∂x
There is no shearing in this situation, as expected. Using
r
∂ u1 2 ∂ u1 ∂ u1
λ1 = (1 + ) = 1+ = 1+ = 1 + εxx
∂x ∂x ∂x
find the elastic energy density W = 12 E ε2xx (based on equation (5.15) on page 351, generated by small
deformations and Hooke’s law) and compare to the result based on the above formula.
 
2 2 2 2
W = C10 · (λ1 + − 3) = C10 · (1 + εxx ) + −3
λ1 1 + εxx
≈ C10 · 1 + 2 εxx + ε2xx + 2 (1 − εxx + ε2xx − ε3xx + ε4xx ) − 3 = C10 · (3 ε2xx − 2 ε3xx + 2 ε4xx )


SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 371

1
For small εxx this should be similar to W = 2 E ε2xx , leading to

1
C10 = E.
6
1
Using this generate an approximating stress–strain curve based on W ≈ 6 E (3 ε2xx − 2 ε3xx ).

∂W
σx = ≈ E εxx − E ε2xx
∂εxx
εyy
Since the material is assumed to be incompressible, confirm the Poisson ratio of ν = εxx = − 12 by

1 1 1
λ2 = λ3 = √ = √ ≈ 1 − εxx
λ1 1 + ε xx 2

In Figure 5.17(a) find the stress–strain curve for the linear Hooke’s law and the Neo–Hookean material
under uniaxial load. This is only valid if |εxx |  1, but the stress–strain curve can be generated for larger
values of εxx too.
   
∂W ∂ 2 2 2
σx = = C10 (1 + εxx ) + − 3 = C10 · 2 (1 + εxx ) −
∂εxx ∂εxx 1 + εxx (1 + εxx )2
= C10 2 1 + εxx − 1 + 2 εxx − 3 ε2xx + 4 ε3xx − 5 ε4xx + O(ε5xx )


= C10 6 εxx − 6 ε2xx + 8 ε3xx − 10 ε5xx + O(ε5xx )



 
∂ σx 4
= C10 · 2 +
∂εxx (1 + εxx )3
1
With C10 = 6 E the stress–strain curve is
∂ σx
• a straight line with slope ∂εxx ≈ C10 6 = E for 0 ≤ |εxx |  1, identical to Hooke’s law.
E 1
• a straight line σx ≈ 3 (1 + εxx ) with slope 3 E for εxx  1 very large.

• for 0 < εxx the slope is smaller than E and for −1 < εxx < 0 the slope is larger than E.

This is visualized in Figure 5.17(a).

1
0.8
0.4
true stress/E

0.6
stress/E

0.2 0.4

0 0.2
linear Hooke 0
-0.2 neo-Hookean linear Hooke
-0.2
neo-Hookean approx. neo-Hookean
-0.4 -0.4
-0.2 0 0.2 0.4 0.6 0.8 1 -0.2 0 0.2 0.4 0.6 0.8 1
strain strain
(a) engineering stress (b) true stress

Figure 5.17: Stress–strain curve for Hooke’s linear law and an incompressible neo–Hookean material under
uniaxial load

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 372

The above arguments use the engineering stress, i.e. the force is divided by the area of the undeformed
solid. Since λ2 = λ3 = √1λ = √1+ε 1
the cross sectional area is diminised by a factor of λ2 · λ3 = 1+ε1 xx ,
1 xx
leading to a true stress of
 
2
σx,true = σx (1 + εxx ) = C10 · 2 (1 + εxx ) − (1 + εxx )
(1 + εxx )2
 
2 2
= C10 · 2 (1 + εxx ) −
(1 + εxx )

The true stress based on Hooke’s law is given by E (εxx + ε2xx ), shown in Figure 5.17(b). ♦

5–39 Example : From large strain energy to Hooke for uniaxial loading for compressible material
In [Bowe10, §3.5.5] a Neo–Hookean energy density of the form
1 I1 1
W = C10 · (I¯1 − 3) + (J − 1)2 = C10 · ( 2/3 − 3) + (J − 1)2 (5.20)
D J D
is examined. The first expression should take shape changes into account, while the second term is related
to volume changes.
To examine the energy in equation (5.20) for an uniaxial stretching a good approximations of the invari-
ants as function of εxx = ∂∂x
u1
is required. Assume a small deformation and use Hooke’s law with a Poisson
1
ratio of 0 ≤ ν ≤ 2 . As consequence work with εyy = εzz = −ν εxx and compute

J = λ1 λ2 λ3 = (1 + εxx ) · (1 + εyy ) · (1 + εzz ) = (1 + εxx ) · (1 − ν εxx )2


= 1 + (1 − 2 ν) εxx + (−2 ν + ν 2 ) ε2xx + ν 2 ε3xx
2
(J − 1)2 = ε2xx (1 − 2 ν) + ν (−2 + ν) εxx + ν 2 ε2xx
= ε2xx (1 − 2 ν)2 + 2 (1 − 2 ν) ν (−2 + ν) εxx +
+ ν 2 (−2 + ν)2 + 2 (1 − 2 ν) ν 2 ε2xx + . . .
 

= ε2xx (1 − 2 ν) ((1 − 2 ν) − 2 ν (2 − ν) εxx ) + O(ε4xx )


I1 = (1 + εxx )2 + (1 + εyy )2 + (1 + εzz )2
= (1 + εxx )2 + (1 − ν εxx )2 + (1 − ν εxx )2 = 3 + (2 − 4 ν) εxx + (1 + 2 ν 2 ) ε2xx .

Using Mathematica (for the elementary, tedious operations) to find

2 (1 − 2 ν) 5 − 8 ν + 14 ν 2 2
J −2/3 = 1 − εxx + εxx −
3 9
4 (−10 + 15 ν − 21 ν 2 + 35 ν 3 ) 3
− εxx + O(ε4xx )
81
I1 4 (1 + ν)2 2 4 (1 + ν)2 (7 − 11 ν) 3
I¯1 = 2/3 = 3+ εxx − εxx + O(ε4xx ) .
J 3 27
Observe that the invariant I1 contains a contribution proportional to εxx , while I¯1 does not. When minimiz-
ing the energy based on I1 this would lead to a nonzero stress σx 6= 0, even if no deformation is applied
with εxx = 0. Thus one should not use the energy C10 · (I3 − 3) for compressible materials. Now examine

I1 1
W = C10 · ( 3/2 − 3) + (J − 1)2
J D
4 (1 + ν)2 (1 − 2 ν)2
 
= C10 + ε2xx −
3 D
4 (1 + ν)2 (7 − 11 ν) 2 (1 − 2 ν) ν (2 − ν)
 
− C10 + ε3xx + O(ε4xx )
27 D

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 373

and use
E 1 1 E 1
C10 = = µ and = = K
4 (1 + ν) 2 D 6 (1 − 2 ν) 2
(with shear modulus µ and bulk modulus K, see Table 5.5 on page 360 in Example 5–33) and elementary
algebra to conclude23

4 (1 + ν)2 (1 − 2 ν)2
 
W = C10 + ε2xx + O(ε3xx )
3 D
4 (1 + ν)2
 
E E 2
= + (1 − 2 ν) ε2xx + O(ε3xx )
4 (1 + ν) 3 6 (1 − 2 ν)
 
E (1 + ν) E 1
= + (1 − 2 ν) ε2xx + O(ε3xx ) = E ε2xx + O(ε3xx ) .
3 6 2

For small, uniaxial deformations the energy densities generated by Hooke’s law and by (5.20) coincide.
To determine the stress–strain curve use
4 (1 + ν)2 (1 − 2 ν)2
 
∂W
σx = = 2 C10 + εxx +
∂εxx 3 D
4 (1 + ν)2 (7 − 11 ν) 2 (1 − 2 ν) ν (−2 + ν)
 
+3 −C10 + ε2xx + O(ε3xx )
27 D
1
and with the above values for C10 and D for ν = 2 this leads to

σx = E εxx − E ε2xx + O(ε3xx ) .

The stress–strain curve is again shown in Figure 5.17 and identical to the approximative curve for the
incompressible Neo–Hookean material. ♦

23
Another approach would be (using λ2 = λ3 = 1 + εyy .)

λ21 + λ22 + λ23 1


W = C10 · ( − 3) + (λ1 λ2 λ3 − 1)2
(λ1 λ2 λ3 )2/3 D
4/3 −4/3 −2/3 2/3 1
= C10 · (λ1 λ2 + 2 λ1 λ2 − 3) + (λ1 λ22 − 1)2
D
∂W 4 1/3 −4/3 4 −5/3 2/3 2
= C10 · ( λ1 λ2 − λ1 λ2 ) + (λ1 λ22 − 1) λ22
∂λ1 3 3 D
∂W 4 2
= C10 · ((1 + εxx )1/3 (1 + εyy )−4/3 − (1 + εxx )−5/3 (1 + εyy )2/3 ) + ((1 + εxx ) (1 + εyy )2 − 1) (1 + εyy )2
∂εxx 3 D
4 1 4 5 2 2
≈ C10 · ((1 + εxx ) (1 − εyy ) − (1 − εxx ) (1 + εyy )) + (εxx + 2 εyy ) (1 + 2 εyy )
3 3 3 3 3 D
4 1 4 5 2 2
≈ C10 · ( εxx − εyy + εxx − εyy ) + (εxx + 2 εyy )
3
 3  3  3 3  D
8 2 6 4
= C10 + εxx + − C10 + εyy .
3 D 3 D
1
With εyy = −ν εxx and the above expressions for C10 and D this leads to
     
∂W 8 2 8 4 8 + 8ν 2 − 4ν
≈ C10 + εxx − ν − C10 + εxx = C10 + εxx
∂εxx 3 D 3 D 3 D
   
8 + 8ν E (2 − 4 ν) E 2E E
= + εxx = + εxx = E εxx .
3 4 (1 + ν) 6 (1 − 2 ν) 3 3
This confirms Hooke’s law for small strains.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 374

5–40 Example : The hydrostatic pressure situation


For the hydrostatic pressure situation we work with εxx = εyy = εzz and thus
p
λ = λ1 = λ2 = λ3 = (1 + εxx )2 = 1 + εxx
I1 = λ21 + λ22 + λ23 = 3 (1 + 2 εxx + ε2xx )
J = λ1 λ2 λ3 = (1 + εxx )3

and this leads to the expressions used for the different models for the energy density.

I1 − 3 = 6 εxx + 3 ε2xx
I1 3 (1 + εxx )2
I¯1 − 3 = − 3 = −3=0
J 2/3 (1 + εxx )2
(J − 1)2 = (3 εxx + 3 ε2xx + ε3xx )2 = ε2xx (3 + 3 εxx + ε2xx )2
= ε2xx (9 + 18 εxx + 15 ε2xx + 6 ε3xx + ε4xx )

The result I¯1 − 3 = 0 shows that the invariant I¯1 does not take volume changes into account, while I1 − 3
and (J − 1)2 do. Now have a closer look at two models frequently used for the elastic energy density W .
• Neo–Hookean

W = C10 · (I1 − 3) = C10 3 εxx (2 + εxx )


1 ∂W
σx = = C10 2 (1 + εxx )
3 ∂εxx
This can not be a correct model, since there would be a force required to not deform the solid, i.e.
without any external force the body would shrink. This does not make sense and is in contradiction
to the commonly used “Neo–Hookean is for incompressible solids”. Obviously any type of incom-
pressible material model will struggle with the hydrostatic situation. But the material is assumed to be
incompressible, thus the three stretch factors λi are not independent. Use λ1 · λ2 · λ3 = 1 to conclude
λ3 = λ11·λ2 , which condradicts λ1 = λ2 = λ3 .

• In [Bowe10, §3.5.5] find a model with the energy density given by


1 I1 1
W = C10 · (I¯1 − 3) + (J − 1)2 = C10 · ( 2/3 − 3) + (J − 1)2
D J D
1 2 2 3 4
= C10 0 + εxx (9 + 18 εxx + 15 εxx + 6 εxx + εxx )
D
1 ∂W 1
6 εxx + 18 ε2xx + 20 ε3xx + 10 ε4xx + 2 ε5xx .

σx = =
3 ∂εxx D
In Example 5–31 Hooke’s linear law was used to find
E
−p = σx = εxx < 0
1 − 2ν
1 1 − 2ν 1 E
W = 3 σx2 = 3 ε2 .
2 E 2 1 − 2 ν xx
Comparing the two results for σx for small εxx leads to
6 E 1 E 1
= =⇒ = = K.
D 1 − 2ν D 6 (1 − 2 ν) 2

where K is the bulk modulus, see page 360. This is consistent with the results generated using uniaxial
loading in Example 5–39. ♦

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 375

5–41 Example : Pure shearing with a Neo–Hookean energy density


Examine a pure shearing situation, i.e. the a plane y = const is shifted in x–direction. This is acheived with
a displacement ~u = (u1 , u2 , u3 ) = ∂u
∂y (1, 0, 0). The displacement gradient tensor is then given by
1

 
0 ∂u ∂y
1
0
 
DU =   0 0 0 

0 0 0

and thus the Cauchy–Green deformation tensor is


 
∂u1
1 ∂y 0
C = I + DU + DUT + DUT DU = 
 
∂u1
 ∂y 1 + ( ∂u 1 2
∂y ) .
0 
0 0 1

Example 5–32 (page 359) shows that this deformation minimizes the Hookean elastic energy density W for
a given εxy , i.e. this deformation is a consequence of Bernoulli’s principle.
The principal stretches λi are the square roots of the eigenvalues of the tensor C and thus with ∂u
∂y =
1

2 εxy find the invariants24


∂ u1 2
λ21 + λ22 + λ23 = trace(C) = 3 + ( ) = 3 + 4 ε2xy
∂y
λ21 · λ22 · λ23 = det(C) = 1

This shows that the shearing deformation does not compress the material, i.e. the volume is preserved. The
Neo–Hookean energy density is
I1 1
W = C10 ( 2/3
− 3) + (J − 1)2 = C10 4 ε2xy .
J D
This leads to the shearing stress
1 ∂W
τxy = = 4 C10 εxy .
2 ∂εxy
E
Comparing with the similar result based on Hooke’s law in Example 5–32 (τxy = 1+ν εxy ) leads to 4 C10 =
E 1 E 1
1+ν . For incompressible materials with ν = 2 find C10 = 4 (1+ 1 ) = 6 E. This is consistent with the
2
previous examples. These results coincide with the results for an incompressible Neo–Hookean energy
density W = C10 (I1 − 3), since J = det(C) = 1.
In Example 5–39 a uniaxial loading leads to the stress–strain curve
4 (1 + ν)2 (1 − 2 ν)2
 
∂W
σx = ≈ 2 C10 + εxx
∂εxx 3 D
24
The invariants could be computed using the eigenvalues µ1,2 , but this is not as elegant. Use α := ( ∂u
∂y
1 2
) = 4 ε2xy .
∂u1 2 ∂u1 2 ∂u1 2 ∂u1 2
0 = (1 − µ) (1 + ( ) − µ) − ( ) = µ2 − µ (2 + ( ) )+1−( ) = µ2 − µ (2 + α) + 1 − α
∂y ∂y ∂y ∂y
1  p  1  p 
µ1,2 = 2 + α ± (2 + α)2 − 4 (1 − α) = 2 + α ± 4 α + α2 + 4 α
2 2
α 1p
q q
= 1+ ± 8 α + α = 1 + 2 εxy ± 16 ε2xy + 16 ε4xy = 1 + 2 ε2xy ± 2 ε2xy + ε4xy
2 2
2 2

The principal stretches are λ1,2 = µ1,2 and λ3 = 1, leading to the invariants
q q
I1 = λ21 + λ22 + λ23 = 1 + 2 ε2xy + 2 ε2xy + ε4xy + 1 + 2 ε2xy − 2 ε2xy + ε4xy + 1 = 3 + 4 ε2xy
J = λ1 · λ2 · λ3 = (1 + 2 ε2xy )2 − 4 (ε2xy + ε4xy ) = 1 .

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 376

and the hydrostatic situation in Example 5–40 shows


1 ∂W 6
σx = ≈ εxx
3 ∂εxx D
and the above shearing situation leads to
1 ∂W
τxy = ≈ 4 C10 εxy
2 ∂εxy
1
Thus the combination of the three results allows to determine C10 , D and Poisson’s ratio ν. ♦

5–42 Example : Neo–Hookean energy density for compressible materials, uniaxial loading
In [BoneWood08, §6.4] (or [Shab08, p. 146]) find an energy density for Neo–Hookean, compressible mate-
rials, given by
µ λ
W = (I1 − 3) − µ ln J + (ln J)2 (5.21)
2 2
using two material constants (see Table 5.5 on page 360)
E νE
shear modulus µ = and Lamé parameter λ = .
2 (1 + ν) (1 + ν) (1 − 2 ν)
Using the FEM software COMSOL these two constants have to be given when using the Neo–Hookean
material law. Now verify that this is consistent with Hooke’s law for small deformations. For small, uniaxial
deformations compute with λ1 = 1 + εxx and λ2 = λ3 = 1 − ν εxx , leading to
I1 − 3 = λ21 + λ22 + λ23 − 3 = 2 (1 − 2 ν) εxx + (1 + 2 ν 2 ) ε2xx
J = λ1 · λ2 · λ3 = (1 + εxx ) (1 − ν εxx )2 = 1 + (1 − 2 ν) εxx + (−2 ν + ν 2 ) ε2xx + ν 2 ε3xx
1
ln x ≈ x − x2 for |x|  1 Taylor approximation
2
1 1
ln J ≈ +(1 − 2 ν) εxx + (−2 ν + ν 2 ) ε2xx − (1 − 2 ν)2 ε2xx = +(1 − 2 ν) εxx − 1 + 2 ν 2 ε2xx

2 2
(ln J)2 ≈ +(1 − 2 ν)2 ε2xx .
With these expressions use the energy density W in (5.21) and conclude
µ λ
W = (I1 − 3) − µ ln J + (ln J)2
2 2  
µ µ 1 λ
≈ 2 (1 − 2 ν) εxx + (1 + 2 ν ) εxx − µ (1 − 2 ν) εxx − (1 + 2 ν ) εxx + (1 − 2 ν)2 ε2xx
2 2 2 2
2 2 2 2
λ E ν E (1 − 2 ν) 2
= +µ (1 + 2 ν 2 ) ε2xx + (1 − 2 ν)2 ε2xx = + (1 + 2 ν 2 ) ε2xx + εxx
2 2 (1 + ν) 2 (1 + ν)
(1 + 2 ν 2 ) + ν − 2 ν 2
 
E 2
= E ε2xx = ε .
2 (1 + ν) 2 xx
Thus the Neo–Hookean energy expression (5.21) is consistent with the usual energy density based on
Hooke’s law for linear materials.
∂W
To obtain the stress–strain curve in Figure 5.18 determine σx = ∂εxx
. Using
∂ (I1 − 3)
= 2 (1 − 2 ν) + 2 (1 + 2 ν 2 ) εxx
∂εxx
∂J
= (1 − 2 ν) + 2 (−2 ν + ν 2 ) εxx + 3 ν 2 ε2xx
∂εxx
∂ ln J 1 ∂J 1
(1 − 2 ν) + 2 (−2 ν + ν 2 ) εxx + 3 ν 2 ε2xx

= =
∂εxx J ∂εxx J
∂ (ln J)2 2 ln J ∂ J 2 ln J
(1 − 2 ν) + 2 (−2 ν + ν 2 ) εxx + 3 ν 2 ε2xx

= =
∂εxx J ∂εxx J

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 377

obtain
 
∂W ∂ µ λ 2
σx = = (I1 − 3 − 2 ln J) + (ln J)
∂εxx ∂εxx 2 2
 
2 1 2 2 2

= µ (1 − 2 ν) + (1 + 2 ν ) εxx − (1 − 2 ν) + 2 (−2 ν + ν ) εxx + 3 ν εxx +
J
λ ln J
(1 − 2 ν) + 2 (−2 ν + ν 2 ) εxx + 3 ν 2 ε2xx .

+
J
For the value ν = 0.3 Figure 5.18 shows the resulting stress–strain curves for the Neo–Hookean and Hooke’s
law. ♦

1
linear Hooke
0.8 neo-Hookean

0.6

0.4
stress/E

0.2

-0.2

-0.4

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1


strain

Figure 5.18: Stress–strain curve for Hooke’s linear law and a compressible Neo–Hookean material with
ν = 0.3 under uniaxial load

5–43 Example : Neo–Hookean energy density for compressible materials, hydrostatic loading
For hydrostatic loading use λ1 = λ2 = λ3 = 1 + εxx . This leads to

I1 − 3 = λ21 + λ22 + λ23 − 3 = 3 (1 + εxx )2 − 3 = 6 εxx + 3 ε2xx


J = λ1 · λ2 · λ3 = (1 + εxx )3
1 2
ln J = 3 ln(1 + εxx ) ≈ 3 (εxx − ε )
2 xx
1
(ln J)2 ≈ 9 ε2xx (1 − εxx )2 ≈ 9 ε2xx (1 − εxx ) .
2
Using the energy density find

µ λ
W = (I1 − 3) − µ ln J + (ln J)2
2 2
µ 1 λ
≈ (6 εxx + 3 ε2xx ) − µ 3 (εxx − ε2xx ) + 9 ε2xx (1 − εxx )
2  2 2 
9 2 3 E 9 νE
≈ (3 µ + λ) εxx = + ε2xx
2 2 (1 + ν) 2 (1 + ν) (1 − 2 ν)
3 E (1 − 2 ν) + 9 ν E 2 3E
= εxx = ε2 .
2 (1 + ν) (1 − 2 ν) 2 (1 − 2 ν) xx

This coincides with the energy density determined by Hooke’s linear law in Example 5–31 on page 358. ♦

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 378

5–44 Example : Neo–Hookean energy density for compressible materials, pure shearing
Based on Example 5–41 use
∂ u1 2
I1 = λ21 + λ22 + λ23 = trace(C) = 3 + ( ) = 3 + 4 ε2xy
∂y
J = λ21 · λ22 · λ23 = det(C) = 1 = J
µ λ µ
W = (I1 − 3) − µ ln J + (ln J)2 = 4 ε2xy .
2 2 2
This leads to the shearing stress
1 ∂W E
τxy = = µ 4 εxy = εxy .
2 ∂εxy (1 + ν)
For small deformations this is consistent with Hooke’s law. ♦

5.7.3 Ogden and Mooney–Rivlin Energy Density Models


Ogden energy density model for incompressible materials
For rubber it is common to use Ogden’s energy density to describe the elastic behavior. The general formula
is given by
N
X µ i αi
W = (λ + λα2 i + λα3 i − 3) ,
αi 1
i=1
where λi are the principal stretches, i.e. the square roots of the eigenvalues of the Cauchy–Green deforma-
tion tensor. Since in the above expression all eigenvalues are raised to the same power, the expression is
invariant under coordinate rotations.
The simplest case for Ogden’s energy density is for N = 1.
µ α
W = (λ + λα2 + λα3 − 3) subject to J = λ1 · λ2 · λ3 = 1 (5.22)
α 1
• Considering α = 2 we find
µ 2 µ
W = (λ1 + λ22 + λ23 − 3) = (I1 − 3) = C10 · (I1 − 3)
2 2
i.e. the Neo–Hookean energy density (5.19) is a special case of Ogdens law.

• With N = 2, α1 = 2, α2 = −2 and λ1 · λ2 · λ3 = 1 find


µ1 2 µ2 −2
W = (λ1 + λ22 + λ23 − 3) + (λ1 + λ−2 −2
2 + λ3 − 3)
2 2
µ1 2 µ2 2 2
= (λ1 + λ22 + λ23 − 3) + (λ2 λ3 + λ21 λ23 + λ21 λ22 − 3)
2 2
= C10 · (I1 − 3) + C01 · (I2 − 3)

which is the energy density for a Mooney–Rivlin material law. For the incompressible case one might
use
W = C10 · (I¯1 − 3) + C01 · (I¯2 − 3) = C10 · (I1 − 3) + C01 · (I2 − 3) ,
where
I2 λ2 λ2 + λ21 λ23 + λ21 λ22 −4/3 2/3 2/3 2/3 −4/3 2/3 2/3 2/3 −4/3
I¯2 = 4/3 = 2 3 = λ1 λ2 λ3 + λ1 λ2 λ3 + λ1 λ2 λ3 .
J (λ1 λ2 λ3 )4/3

• With C01 = 0 the Mooney–Rivlin energy density is equal to the Neo–Hookean energy density in
Example 5–38.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 379

• Observe that minimizing the total energy, using the integral of the energy density W , does not respect
the incompressibility constraint J = λ1 · λ2 · λ3 = 1. This has to be taken care of in the design of the
algorithm.
• Using λi = λi /J 1/3 also find
µ α α α
W =(λ1 + λ2 + λ3 − 3) (5.23)
α
as definition for Ogden energies. The FEM code ABAQUS is using this setup. Since λ1 · λ2 · λ3 = 1
it seems that the incompressibility constraint is built in. In fact (5.22) and (5.23) are equivalent in the
case of incompressible materials.

5–45 Example : Mooney–Rivlin energy density for incompressible materials, uniaxial loading
For the Mooney–Rivlin energy density work with
W = C10 · (I¯1 − 3) + C01 · (I¯2 − 3)
= C10 · (λ̄21 + λ̄22 + λ̄23 − 3) + C01 · (λ̄22 λ̄23 + λ̄21 λ̄23 + λ̄21 λ̄22 − 3)
4/3 −2/3 −2/3 −2/3 4/3 −2/3 −2/3 −2/3 4/3
= C10 · (λ1 λ2 λ3 + λ1 λ2 λ3 + λ1 λ2 λ3 − 3) +
−4/3 2/3 2/3 2/3 −4/3 2/3 2/3 2/3 −4/3
+C01 · (λ1 λ2 λ3 + λ1 λ2 λ3 + λ1 λ2 λ3 − 3) .
−1/2
In the case of uniaxial, incompressible loading we use λ1 = 1 + εxx and λ2 = λ3 = λ1 , leading to
4/3 2/3 −2/3 −1/3 −4/3 4/3 2/3 −2/3
W = C10 · (λ1 λ1 + 2 λ1 λ1 − 3) + C01 · (λ1 λ2 + 2 λ1 λ2 − 3)
= C10 · (λ21+ 2 λ−1 −2
1 − 3) + C01 · (λ1 + 2 λ1 − 3)
= C10 · ((1 + εxx )2 + 2 (1 + εxx )−1 − 3) + C01 · ((1 + εxx )−2 + 2 (1 + εxx ) − 3)
≈ C10 · (2 εxx + ε2xx − 2 εxx + 2 ε2xx ) + C01 · (−2 εxx + 3 ε2xx + 2 εxx )
1
= C10 3 ε2xx + C01 3 ε2xx = 3 (C10 + C01 ) ε2xx = E ε2xx .
2
As a consequence obtain for small strains
1
C10 + C01 = E.
6
To determine the stress in x–direction examine
∂W
σx = = C10 · (2 (1 + εxx ) − 2 (1 + εxx )−2 ) + C01 · (−2 (1 + εxx )−3 + 2)
∂εxx
≈ C10 · (2 εxx + 4 εxx − 6 ε2xx + 8 ε3xx − 10 ε4xx )) +
+C01 · (6 εxx − 12 ε2xx + 20 ε3xx − 30 ε4xx )
= 6 (C10 + C01 ) εxx − (6 C10 + 12 C01 ) ε2xx +
+(8 C10 + 20 C01 ) ε3xx − (10 C10 + 30 C01 ) ε4xx .
Find the resulting stress–strain curves in Figure 5.19. ♦

5–46 Example : Mooney–Rivlin energy density for compressible materials, hydrostatic and uniaxial
loading
For the compressible Mooney–Rivlin energy density use
1
W = C10 · (I¯1 − 3) + C01 (I¯2 − 3) + (J − 1)2
D
1
= C10 · (λ̄21 + λ̄22 + λ̄23 − 3) + C01 · (λ̄22 λ̄23 + λ̄21 λ̄23 + λ̄21 λ̄22 − 3) +
(λ1 λ2 λ3 − 1)2
D
4/3 −2/3 −2/3 −2/3 4/3 −2/3 −2/3 −2/3 4/3
= C10 · (λ1 λ2 λ3 + λ1 λ2 λ3 + λ1 λ2 λ3 − 3) +
−4/3 2/3 2/3 2/3 −4/3 2/3 2/3 2/3 −4/3 1
+C01 · (λ1 λ2 λ3 + λ1 λ2 λ3 + λ1 λ2 λ3 − 3) + (λ1 λ2 λ3 − 1)2 .
D

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 380

linear Hooke
C Neo-Hooke
0.4 10
C =C
10 01
C only
01
0.2
stress/E

-0.2

-0.4
-0.2 0 0.2 0.4 0.6 0.8 1
strain

Figure 5.19: Stress–strain curve for Hooke’s linear law and three cases of incompressible Mooney–Rivlin
with uniaxial loading: (1): C10 = E6 , C01 = 0, i.e. Neo–Hooke, (2): C10 = C01 = 12 E
, (3): C10 = 0,
E
C01 = 6 .

• For the case of hydrostatic loading use λ̄i = λi = 1 + εxx and thus I¯1 = I¯2 = 3 and minimize
1
W (εxx ) = D (J − 1)2 . The setup is identical to the compressible Neo–Hookean case in Example 5–
40. This leads to
1 1
W = (J − 1)2 = + ε2xx (9 + 18 εxx + 15 ε2xx + 6 ε3xx + ε4xx )
D D
1 ∂W 1
6 εxx + 18 ε2xx + O(ε3xx )

σx = =
3 ∂εxx D
and
6 E 1 E 1
= =⇒ = = K.
D 1 − 2ν D 6 (1 − 2 ν) 2

• In the case of uniaxial, compressible loading use λ1 = 1 + εxx and λ2 = λ3 = 1 + εyy , leading to

4/3 −4/3 −2/3 +2/3 −4/3 4/3 2/3 −2/3 1


W = C10 · (λ1 λ2 + 2 λ1 λ2 − 3) + C01 · (λ1 λ2 + 2 λ1 λ2 − 3) + (λ1 λ22 − 1)2
D
= C10 · ((1 + εxx )4/3 (1 + εyy )−4/3 + 2 (1 + εxx )−2/3 (1 + εyy )2/3 − 3) +
+C01 · ((1 + εxx )−4/3 (1 + εyy )4/3 + 2 (1 + εxx )2/3 (1 + εyy )−2/3 − 3) +
1
+ ((1 + εxx ) (1 + εyy )2 − 1)2 .
D
For an uniaxial loading the value of εyy is determined as minimizer of W (εxx , εyy ) with respect to εyy .

∂W −4 4
0= = C10 · ( (1 + εxx )4/3 (1 + εyy )−7/3 + (1 + εxx )−2/3 (1 + εyy )−1/3 ) +
∂εyy 3 3
4 4
+C01 · ( (1 + εxx )−4/3 (1 + εyy )1/3 − (1 + εxx )2/3 (1 + εyy )−5/3 ) +
3 3
4 2
+ ((1 + εxx ) (1 + εyy ) − 1) (1 + εyy )
D
4 4 7 2 1
≈ C10 · (−1 − εxx + εyy + 1 − εxx − εyy ) +
3 3 3 3 3

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 381

4 4 1 2 5 4
− C01 · (1 − εxx + εyy − 1 − εxx + εyy ) + (εxx + 2 εyy )
3 3 3 3 3 D
8 8 4
= C10 · (−εxx + εyy ) + C01 · (−εxx + εyy ) + (εxx + 2 εyy )
3
 3  D 
4 8 4 8
= − (C10 + C01 ) εxx + 2 + (C10 + C01 ) εyy
D 3 D 3
4 8
− (C10 + C01 ) 3 − 2 D (C10 + C01 )
εyy = − D4 38 εxx = − εxx
2 D + 3 (C10 + C01 ) 6 + 2 D (C10 + C01 )
 
1 3 D (C10 + C01 )
= − − εxx = −ν εxx
2 6 + 2 D (C10 + C01 )
   
1 D 2 1 E 2
= − − (C10 + C01 ) + O(D ) εxx = − −D + O(D ) εxx
2 2 2 12
1
For the incompressible case D = 0 we obtain ν = 2 again.

For small strains use the energy density W and with the help of Mathematica one can generate a
Taylor approximation for W with respect to εxx , using Poisson’s law in the form εyy = − 12 (1 −
D (C10 + C01 )) εxx .
1
W (εxx ) ≈ (C10 + C01 ) (9 − 3 (C10 + C01 ) D + (C10 + C01 )2 D2 )) ε2xx +
3
1
+ −2 (C10 + 2 C01 ) + (3 C10 + C01 ) (C10 + C01 ) D+
2

1 2 2 1 3 2
+ (C10 + C01 ) (5 C10 + 3 C01 ) D − (C10 + C01 ) (11 C10 + 7 C01 ) D ε3xx
3 54
This leads to the relation to Young’s modulus
2
E = (C10 + C01 ) (9 − 3 (C10 + C01 ) D + (C10 + C01 )2 D2 ))
3
2
= 6 (C10 + C01 ) − 2 (C10 + C01 )2 D + (C10 + C01 )3 D2 )) . (5.24)
3
The normal stress σx is given by
∂W 2
σx = ≈ (C10 + C01 ) (9 − 3 (C10 + C01 ) D + (C10 + C01 )2 D2 )) εxx .
∂εxx 3
For the incompressible case D = 0 this simplifies to

W (εxx ) ≈ 3 (C10 + C01 ) ε2xx − 2 (C10 + 2 C01 ) ε3xx


∂W
σx = ≈ 6 (C10 + C01 ) εxx .
∂εxx
This is consistent with the property 6 (C10 + C01 ) = E with the uniaxial loading of incompressible
Mooney–Rivlin material in Example 5–45.

5–47 Example : Ogden energy density for incompressible materials, uniaxial loading
If the material is assumed to be incompressible use the energy density in equation (5.22) and J = λ1 · λ2 ·
−1/2
λ3 = 1 to conclude λ2 = λ3 = λ1 . With λ1 = 1 + εxx this leads to (for small values of εxx )
µ α
W = (λ + λα2 + λα3 − 3) subject to J = λ1 · λ2 · λ3 = 1
α 1
µ α −α/2
W = (λ + 2 λ1 − 3)
α 1

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 382

µ
= ((1 + εxx )α + 2 (1 + εxx )−α/2 − 3) use a Taylor aproximation
α
µ α α α
≈ (1 + α εxx + (α − 1) ε2xx + 2 − α εxx + ( + 1) ε2xx − 3)
α 2 2 2
µ α 2 3 µ α
= (α2 − α + + α) ε2xx = ε2xx .
2α 2 4
1
Comparing this the the energy density based on Hooke’s linear law W = 2 E ε2xx conclude

3 2E
E= µα or µ = .
2 3α
1 1
In the case of α = 2 (Neo–Hookean) find µ = 3 E, which is consistent with C10 = 6 E.

To examine the stress as function of strain use


∂W ∂ µ
σx = = ((1 + εxx )α + 2 (1 + εxx )−α/2 − 3)
∂εxx ∂εxx α
µ
= (α (1 + εxx )α−1 − α (1 + εxx )−α/2−1 )
α
= µ ((1 + εxx )α−1 − (1 + εxx )−α/2−1 ) .

Find the stress-strain curve for α = 1.5, α = 2 and α = 3 in Figure 5.20, together with the linear law by
Hooke. ♦

linear Hooke
α=1.5
α=2 neo-Hookean
0.4
α=3

0.2
stress/E

-0.2

-0.4
-0.2 0 0.2 0.4 0.6 0.8 1
strain

Figure 5.20: Stress–strain curve for Hooke’s linear law and an incompressible Ogden material under uniaxial
load, for different values of α

An attempt to apply the same procedure to a hydrostatic situation has to fail for incompressible materials.
µ
W = 3 ((1 + εxx )α − 1) using 1 + εxx = λ1 = λ2 = λ3
α
1 ∂W µ
σx = = α (1 + εxx )α−1 = µ (1 + εxx )α−1 ≈ µ (1 + (α − 1) εxx ) .
3 ∂εxx α
Thus we would have a resulting force, without applying a stretch!

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 383

Ogden energy density model for compressible materials


Ogden’s energy density can be modified to take compressibility into account. The general form is
N N
X µi  α α α
 X 1
W = λ1 i + λ2 i + λ3 i −3 + (λ1 · λ2 · λ3 − 1)2i .
αi Di
i=1 i=1

The simplest case N = 1 leads to


µ α α α 1
W = (λ1 + λ2 + λ3 − 3) + (λ1 · λ2 · λ3 − 1)2 . (5.25)
α D
• In the special case of α = 2 find
µ 2 2 2 1 µ 1
W = (λ1 + λ2 + λ3 − 3) + (λ1 · λ2 · λ3 − 1)2 = (I 1 − 3) + (J − 1)2
2 D 2 D
i.e. the Neo–Hookean energy density (5.19) is a special case of Ogden’s law.

• Observe that minimizing the total energy, using the integral of the energy density W , takes the com-
pressibility into account by considering (J − 1)2 .

5–48 Example : Ogden energy density for compressible materials, hydrostatic loading
For hydrostatic loading use λ1 = λ2 = λ3 = 1 + εxx , leading to λi = 1 and thus the first contribution in
Odgen’s energy density equals zero. The values of the material parameters µ and α have no influence on the
result. This leads to
1 2 1 2 9 2
W = 0+ (1 + εxx )3 − 1 = 3 εxx + 3 ε2xx + ε3xx ≈ ε
D D D xx
∂W 2
(1 + εxx )3 − 1 3 + 6 εxx + 3 ε2xx
 
=
∂εxx D
6 18
(1 + εxx )3 − 1 (1 + εxx )2 ≈

= εxx .
D D
Comparing with the computations using Hooke’s law in Example 5–31 (page 358) conclude

1 E 3 6 (1 − 2 ν)
= or D = .
2 1 − 2ν D E
The stress–strain curve for Hooke’s and Ogdens material are given by
E
σHooke = εxx
1 − 2ν
1 ∂W 2
(1 + εxx )3 − 1 (1 + εxx )2

σOgden = =
3 ∂εxx D
E
3 εxx + 3 ε2xx + ε3xx (1 + εxx )2

=
3 (1 − 2 ν)
 
E 1
= εxx 1 + εxx + ε2xx (1 + εxx )2
1 − 2ν 3
 
E 7 2 3
= εxx 1 + 3 εxx + εxx + O(εxx ) .
1 − 2ν 3

Find the result in Figure 5.21, the computations performed with a Poisson ratio ν = 0.3. ♦

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 384

linear Hooke
0.6
Ogden

0.4

stress/E
0.2

-0.2

-0.4
-0.2 -0.1 0 0.1 0.2
strain

Figure 5.21: Stress–strain curve for Hooke’s linear law and a compressible Ogden material under hydrostatic
load with ν = 0.3.

5–49 Example : Ogden energy density for compressible materials, uniaxial loading
For uniaxial loading we λ1 = 1 + εxx and λ2 = λ3 = 1 + εyy , leading to

J = λ1 · λ2 · λ3 = (1 + εxx ) (1 + εyy )2

and thus
λ1 1 + εxx 1 + εyy
λ1 = = , λ2 = λ3 = .
J 1/3 (1 + εxx )1/3 (1 + εyy )2/3 (1 + εxx )1/3 (1 + εyy )2/3

The value of λ2 is determined by minimizing W (λ1 , λ2 ) with respect to λ2 .


µ α 1
W (λ1 , λ2 ) = (λ̄1 + λ̄α2 + λ̄α3 − 3) + (λ1 · λ2 · λ3 − 1)2
α D !
µ α
λ1 α
λ2 1 2
= α/3 2α/3
+ 2 α/3 2α/3 − 3 + λ1 · λ22 − 1
α λ λ λ1 λ2 D
1 2
µ 
2α/3 −2α/3 −α/3 α/3
 1 2
= λ1 λ2 + 2 λ1 λ2 − 3 + λ1 · λ22 − 1
α  D 
∂ µ 2 α 2α/3 −2α/3−1 2 α −α/3 α/3−1 4
λ1 · λ22 − 1 λ1 λ2

W (λ1 , λ2 ) = − λ1 λ2 + λ1 λ2 +
∂λ2 α 3 3 D
2 µ  2α/3 −2α/3−1 −α/3 α/3−1
 4
λ1 · λ22 − 1 λ1 λ2 = 0

= −λ1 λ2 + λ1 λ2 +
3 D
This equation has to be solved for λ2 = 1 + εyy as function of λ1 = 1 + εxx . Using Taylor approximations
for εxx ≈ 0 and εyy ≈ 0 find

∂ 2µ
0= W (εxx , εyy ) = (−(1 + εxx )2α/3 (1 + εyy )−2α/3−1 + (1 + εxx )−α/3 (1 + εyy )α/3−1 ) +
∂εyy 3
4
(1 + εxx ) · (1 + εyy )2 − 1 (1 + εxx ) (1 + εyy )

+
D
2µ 2α −2α − 3 α α−3
≈ (−(1 + εxx ) (1 + εyy ) + (1 − εxx ) (1 + εyy ) +
3 3 3 3 3
4
+ (εxx + 2 εyy ) (1 + εxx ) (1 + εyy )
D
2µ 2α 2α + 3 α α−3 4
≈ (− εxx + εyy − εxx + εyy ) + (εxx + 2 εyy )
3 3 3 3 3 D

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 385

   
2µα 4 2µα 8
= − + εxx + + εyy
3 D 3 D
−2 µ α D + 12 2 µ α D + 24
= εxx + εyy
3D 3D
12 − 2 µ α D 6 − µαD
εyy = − εxx = − εxx .
24 + 2 µ α D 12 + µ α D
For small, uniaxial deformation this leads to Poission ratio
−εyy 6 − µαD 1 3µαD
ν= = = − .
εxx 12 + µ α D 2 2 (12 + µ α D)
For 0 < D  1 find ν ≈ 21 , as expected for an almost incompressible material.

To determine the stress-strain curve examine the dependence of W on λ1 , resp. εxx . Since the energy
density W is minimized with respect to λ2 use
d W (λ1 , λ2 ) ∂ W (λ1 , λ2 ) ∂ W (λ1 , λ2 ) ∂ λ2 ∂ W (λ1 , λ2 )
= + = + 0.
dλ1 ∂λ1 ∂λ2 ∂λ1 ∂λ1
For small strains determine
 
∂ µ 2 α 2α/3−1 −2α/3 2 α −α/3−1 α/3 2
λ1 · λ22 − 1 λ22

W (λ1 , λ2 ) = + λ1 λ2 − λ1 λ2 +
∂λ1 α 3 3 D
∂ 2µ  
W (εxx , εyy ) = (1 + εxx )2α/3−1 (1 + εyy )−2α/3 − (1 + εxx )−α/3−1 (1 + εyy )α/3 +
∂εxx 3
2
(1 + εxx ) · (1 + εyy )2 − 1 (1 + εyy )2

+
D 
2µ 2α − 3 2α α+3 α
≈ (1 + εxx ) (1 − εyy ) − (1 − εxx ) (1 + εyy ) +
3 3 3 3 3
2
+ (εxx + 2 εyy ) (1 + 2 εyy )
D 
2 µ 2α − 3 2α α+3 α 2
≈ εxx − εyy + εxx − εyy + (εxx + 2 εyy )
3 3 3 3 3 D
   
2µα 2 −2 µ α 4
= + εxx + + εyy
3 D 3 D
2µαD + 6 2 µ α D − 12
= εxx − εyy .
3D 3D
Using the above Poisson ratio this leads to
∂ 2µαD + 6 2 µ α D − 12 6 − µ α D
W (εxx , εyy ) ≈ εxx + εxx
∂εxx 3D 3D 12 + µ α D
2 (6 − µ α D)2
 
2 (3 + µ α D) (12 + µ α D)
= − εxx
3 D (12 + µ α D) 3 D (12 + µ α D)
2 (3 + µ α D) (12 + µ α D) − 2 (6 − µ α D)2
= εxx
3 D (12 + µ α D)
(30 + 24) µ α D 18 µ α
= εxx = εxx .
3 D (12 + µ α D) 12 + µ α D
As a consequence we find for small strains εxx
18 µ α
E≈ .
12 + µ α D
3
In case of an incompressible material this simplifies to E = 2 µ α, which is consistent with Example 5–47.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 386

5.7.4 References used in the Above Section


For the above examples I used a few references:
• On Wikipedia find a very informative page on Neo–Hookean material laws at
https://en.wikipedia.org/wiki/Neo-Hookean solid .
• The book by G. Holzapfel [Holz00, §6.5] for a few formulations of strain based energies.
• In [Bowe10, §3.5.5] and the corresponding web page solidmechanics.org a nonlinear energy density
is given by
1 I1 1
W = C10 (I¯1 − 3) + (J − 1)2 = C10 ( 2/3 − 3) + (J − 1)2
D J D
with
E 1 E
C10 = and = .
4 (1 + ν) D 6 (1 − 2 ν)
E
For ν = 0.5 this leads to C10 = 6 and D = 0.
• In [Ogde13, p 221] a few models are examined.

W = C10 (I1 − 3) + C01 (I2 − 3) Mooney–Rivlin


W = C10 (I1 − 3) Neo–Hookean

In [Ogde13, §4.4, p. 222] the two Neo–Hookean energies for compressible and incompressible mate-
rials are discussed.
µ λ
W = C10 (I1 − 3) and W = (I1 − 3 − 2 ln J) + (ln J)2
2 2
• [Hack15, (4.6), p.20] Mooney–Rivlin

W = C10 (I1 − 3) + C01 (I2 − 3) .


E 1
For small strains 2 (C10 + C01 ) equals the shear modulus G = 2 (1+ν) and thus C10 + C01 = 6 E.
• [Hack15, (4.7)] Neo–Hookean
W = C10 (I1 − 3) .
E 1
For small strains 2 C10 is equal to the shear modulus G = 2 (1+ν) and thus C10 = 6 E.
• [Hack15, p. 21] decoupling deviatoric and volumentric response.
I1 I2
W = C10 ( − 3) + C01 ( 2/3 − 3) + D (J − 1)2
J 2/3 J
There might be a typo in [Hack15, (4.11a)], since it says I¯1 = I1
J 1/3
instead of I¯1 = I1
J 2/3
. But

λi
λi =
J 1/3
I1 λ21 + λ22 + λ23 2 2 2
I1 = 2/3
= 2/3
= λ1 + λ2 + λ3
J J
• Use [Oden71, p. 315] for graphs and p.222ff for Neo–Hookean.
• The COMSOL manual StructuralMechanicsModuleUsersGuide.pdf for different mate-
rial models, using the above invariants.
• The ABAQUS theory manual contains useful information, e.g. Section 4.6.1 on hyperelastic material
behavior

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 387

5.8 Plane Strain


5.8.1 Description of Plane Strain and Plane Stress
If a three dimensional problem can be reduced to two dimensions, then the computational effort can be
reduced considerably and the visualization is simplified. For 3D elasticity problems we have to simplify
the situation such that only two independent variables x and y come into play. There are two important
setups leading to this situation: plane strain and plane stress. In both cases a solid with a constant cross
section Ω (parallel to the xy plane) is considered and horizontal forces are applied to the solid. If the solid
is long (in the z direction) and fixed in z direction on both ends we have the situation of plane strain, i.e. no
deformations in z direction. There might be forces in z direction. If the solid is thin and we have no forces
in z direction we have a plane stress situation. In a concrete situation the users has to decide if one of the
two simplifications is applicable. The facts are listed and illustrated in Table 5.8 and Figure 5.22.

plane strain plane stress


assumptions strains in xy plane only stress in xy plane only
εzz = εxz = εyz = 0 σz = τxz = τyz = 0
free expressions εxx , εyy , εxy σx , σy , τxy
consequences τxz = τyz = 0 εxz = εyz = 0
Eν −ν
σz = (1+ν) (1−2 ν) (εxx + εyy ) εzz = E (σx + σy )

Table 5.8: Plane strain and plane stress situation

 
fix
  all external forces act horizontally

plane strain plane stress  


no deformations in z no forces in z  
 
εzz = εxz = εyz = 0 σz = τxz = τyz = 0
 
τxz = τyz = 0 εxz = εyz = 0
 
but σz 6= 0 but εzz 6= 0
fix
 

Figure 5.22: Plane strain and plane stress situation

Consider a situation where the z component of the displacement vector is a constant

u3 independent on x, y and z

and
u1 = u1 (x, y) , u2 = u2 (x, y) , independent on z .
This leads to vanishing strains in z direction

εzz = εxz = εyz = 0

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 388

and thus this is called a plane strain situation. It can be realized by a long solid in the direction of the z axis
with constant cross section and a force distribution parallel to the xy plane, independent on z. The two ends
are to be fixed in z direction. Due to Saint–Venants’s principle (see e.g. [Sout73, §5.6]) the boundary effects
at the two far ends can safely be ignored. Another example is the expansion of a blood vessel, embedded in
body tissue. The pulsating blood pressure will stretch the walls of the vessel, but there is no movement of
the wall in the direction of the blood flow.

Hooke’s law in the form (5.14) (page 349) implies


     
E
σx 1−ν ν ν εxx τxy = (1+ν) εxy
 
 σy  = E    
 ·  εyy  E
  (1 + ν) (1 − 2 ν) 
 ν 1−ν ν    and τxz = (1+ν) 0
E
σz ν ν 1−ν 0 τyz = (1+ν) 0

or equivalently
     
σx 1−ν ν 0 εxx
 
 σy  = E    
 ·  εyy 


 (1 + ν) (1 − 2 ν)  ν 1−ν 0   
τxy 0 0 1 − 2ν εxy . (5.26)

E ν (εxx + εyy )
σz = , τxz = τyz = 0
(1 + ν) (1 − 2 ν)

The energy density can be found by equation (5.15) as


     
1−ν ν 0 εxx εxx
1 E      
W = h ν 1−ν 0  ·  εyy  ,  εyy i . (5.27)
2 (1 + ν) (1 − 2 ν)      
0 0 2 (1 − 2 ν) εxy εxy

As unknown functions examine the two components of the displacement vector ~u = (u1 , u2 )T , as functions
of x and y. The components of the strain can be computed as derivatives of ~u. Thus if ~u is known, all other
expressions can be computed.
If the volume and surface forces are parallel to the xy plane and independent on z, then the corresponding
energy contributions25 can be written as integrals over the domain Ω ⊂ R2 , resp. the boundary ∂Ω. Obtain
the total energy as a functional of the yet unknown function ~u.

U (~u) = Uelast + UV ol + USurf (5.28)


     
1−ν ν 0 εxx εxx
1 E
ZZ
     
= h ν 1−ν 0  ·  εyy  ,  εyy i dx dy −
2 (1 + ν) (1 − 2 ν)      
Ω 0 0 2 (1 − 2 ν) εxy εxy
ZZ I
− f~ · ~u dx dy − ~g · ~u ds
∂Ω

As in many other situations use again the Bernoulli principle to find the solution of the plane strain elasticity
problem.

25
Observe that we quietly switch from a domain in Ω × [0, H] ⊂ R3 to the planar domain Ω ⊂ R2 . The ‘energy’ U actually
denotes the ‘energy divided by height H’.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 389

5–50 Example : Consider a horizontal, rectangular plate, trapped in z direction between two very hard
surfaces. Then compress the plate in x direction (εxx < 0) by a force F applied to its right edge. Assume
that the normal strain εxx in x-direction is known and then determine the other expressions. Since it is a
plane strain situation use u3 = εzz = εxz = εyz = 0. The similar plane stress problem will be examined in
Example 5–51. Assume that all strains are constant. Now εyy and εxy can be determined by minimizing the
energy density. From equation (5.27) obtain
1 E
(1 − ν) ε2xx + 2 ν εxx εyy + (1 − ν) ε2yy + 2 (1 − 2 ν) ε2xy .

W =
2 (1 + ν) (1 − 2 ν)
As a necessary condition for a minimum the partial derivatives with respect to εyy and εxy have to vanish.
This leads to
+2 ν εxx + 2 (1 − ν) εyy = 0 and εxy = 0 .
This leads to a modified Poisson’s ratio ν ∗ for the plane strain situation.
ν
εyy = − εxx = −ν ∗ εxx .
1−ν
Since ν > 0 we conclude ν ∗ > ν, i.e. the plate will expand (εyy > 0) more in y direction than a free plate.
This is caused by the plate not being allowed to expand in z direction, i.e. εzz = 0 .
The energy density in this situation is given by
ν2 ν 2 (1 − ν) 2
 
1 E 2 2
W = (1 − ν) εxx − 2 ε + ε
2 (1 + ν) (1 − 2 ν) 1 − ν xx (1 − ν)2 xx
1 − 2 ν + ν2 2 ν2 ν2
 
1 E
= − + ε2xx
2 (1 + ν) (1 − 2 ν) 1−ν 1−ν 1−ν
1 E
= ε2 .
2 1 − ν 2 xx
By comparing this with the situation of a simple stretched shaft (Example 5–30, page 356) find a modified
modulus of elasticity
1
E∗ = E
1 − ν2
∂ W 26
and the pressure required to compress the plate is given by using σz = ∂εxx

F ∆L
= E∗ = E ∗ εxx .
A L
The fixation of the plate in z direction (plane strain) prevents the plate from showing the Poisson contraction
(resp. expansion), when compressed in x direction. Thus more force is required to compress it in x direction.
This information is given by E ∗ = 1−ν 1
2 E > E . ♦

Similarly modified constants ν ∗ and E ∗ are used in [Sout73, p. 87] to formulate the partial differential
equations governing this situation. It is a matter of elementary algebra to verify that
     
σx 1−ν ν 0 εxx

 σy  =
 E 
 ν
  
1−ν 0   εyy 
  (1 + ν) (1 − 2 ν)    
τxy 0 0 1 − 2ν εxy
26
Ona may also use Hooke’s law for a plane strain setup to conclude
ν2
 
E E
σx = ((1 − ν) εxx + ν εyy ) = (1 − ν) − εxx
(1 + ν) (1 − 2 ν) (1 + ν) (1 − 2 ν) 1−ν
E (1 − ν)2 − ν 2 E
= εxx = εxx .
(1 + ν) (1 − 2 ν) 1−ν 1 − ν2

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 390

   
1 ν∗ 0 εxx
E∗ 
 ν∗
  
= ∗ 2 1 0   εyy  .
1 − (ν )    
0 0 1−ν ∗ εxy
This form of Hooke’s law for a plane strain situation coincides with the plane stress situation in equa-
tion (5.30) on page 393, but using E ∗ and ν ∗ instead of the usual E and ν.

5.8.2 From the Minimization Formulation to a System of PDE’s


Using the Bernoulli principle the displacement vector ~u has to minimize to total energy of the system, given
by
U (~u) = Uelast + UV ol + USurf
     
1−ν ν 0 εxx εxx
1 E
ZZ
     
= h ν 1−ν 0  ·  εyy  ,  εyy i dx dy −
2 (1 + ν) (1 − 2 ν)      
Ω 0 0 2 (1 − 2 ν) εxy εxy
ZZ I
− f~ · ~u dx dy − ~g · ~u ds .
∂Ω

This can be used to derive a system of partial differential equations that are solved by the actual displacement
function. Use the abreviation
1 E
k= .
2 (1 + ν) (1 − 2 ν)
to find the main expression for the elastic energy given by
     
ZZ εxx 1−ν ν 0 εxx
     
Uelast = h
 εyy  , k  ν
  1−ν 0  ·  εyy i dx dy
  
Ω εxy 0 0 2 (1 − 2 ν) εxy
     
∂ u1
ZZ ∂x 1−ν ν 0 εxx
∂ u2
     
= h
  ∂y
,k ν 1 − ν 0  ·  εyy i dx dy
     
1 ∂ u1 ∂ u2

2 ∂y + ∂x 0 0 2 (1 − 2 ν) ε xy
 
∂ u1
! " # εxx
1 − ν ν 0
ZZ
∂x
 
= h ∂u ,k ·  εyy 

i dx dy
∂y
1
0 0 1 − 2ν
Ω εxy
 
∂ u2
! " # ε xx
0 0 1 − 2ν
ZZ
h ∂∂x
 
+ u2
, k ·  εyy i dx dy .
∂y ν 1−ν 0  
Ω εxy
Using the divergence theorem (Section 5.2.3, page 313) on the two integrals find
  
" # εxx
1−ν ν 0
ZZ
  
Uelast = − u1 div  k ·  εyy  dx dy
 0 0 1 − 2ν  
Ω εxy
 
" # εxx
1−ν ν 0
I
 
+ u1 h~n , k · εyy i ds
∂Ω 0 0 1 − 2ν 
εxy

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 391

  
" # εxx
0 0 1 − 2ν
ZZ
  
− u2 div 
k ·  dx dy
εyy 
ν 1−ν 0 
Ω εxy
 
" # ε xx
0 0 1 − 2ν
I
 
+ u2 h~n , k · ε yy
i ds .
∂Ω ν 1−ν 0  
εxy

Using a calculus of variations argument with perturbations of φ1 of u1 vanishing on the boundary conclude27
  
" # εxx
 E 1−ν ν 0  
div  (1 + ν) (1 − 2 ν) ·  εyy  = −f1
0 0 1 − 2ν  
εxy
  
E (1 − ν) ∂∂x u1
+ ν ∂∂yu2
div  
1−2 ν

∂ u1 ∂ u2
  = −f1
(1 + ν) (1 − 2 ν) 2 ∂y + ∂x

and similarly, using perturbations of u2 , leads to


    
1−2 ν ∂ u1 ∂ u2
E +
div   2 ∂y ∂x  = −f2 .
(1 + ν) (1 − 2 ν) ∂ u1
ν ∂x + (1 − ν) ∂∂y u2

This is a system of second order partial differential equations (PDE) for the unknown displacement vector
function ~u. If the coefficients E and ν are constantone can juggle with these equations and arrive at different
formulations. The first equation may be rewritten in the form

2 (1 − ν) ∂ 2 u1 ∂ 2 u2 ∂ 2 u1 ∂ 2 u2
 
E 2ν
+ + + = −f1
2 (1 + ν) 1 − 2 ν ∂x2 1 − 2 ν ∂y ∂x ∂y 2 ∂x ∂y
1 + (1 − 2 ν) ∂ 2 u1 ∂ 2 u1 ∂ 2 u2
 
E 1
+ + = −f1
2 (1 + ν) 1 − 2ν ∂x2 ∂y 2 1 − 2 ν ∂y ∂x
 2
∂ u1 ∂ 2 u1
 
E 1 ∂ ∂ u1 ∂ u2
+ + + = −f1 .
2 (1 + ν) ∂x2 ∂y 2 1 − 2 ν ∂x ∂x ∂y

By rewriting the second differential equation in a similar fashion we arrive at a formulation given in [Sout73,
p. 87].
  
E 1 ∂ ∂ u 1 ∂ u2
∆ u1 + + = −f1
2 (1 + ν) 1 − 2 ν ∂x ∂x ∂y
  
E 1 ∂ ∂ u 1 ∂ u2
∆ u2 + + = −f2
2 (1 + ν) 1 − 2 ν ∂y ∂x ∂y
27
There is a minor gap in the argument: we only take variations of u1 into account while the resulting variations on εxx , εyy and
εxy are ignored. Thus use
hu + φ , Au − f i minimal =⇒ Au − f = 0 .
The preceding calculation examines an expression hu, Aui for an accordingly defined scalar product. For a symmetric matrix A
and a perturbation u + φ of u we should actually examine

hu + φ , A (u + φ) − f i = hu , A u − f i + hφ , A u − f i + hu , A φi + hφ , A φi
≈ hu , A u − f i + hφ , 2 A u − f i .

If this expression is minimized at u then we conclude 2 Au − f = 0. The only difference to the first approach is a factor of 2,
which is taken into account for the resulting differential equations in the main text.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 392

~ and ∆ this can be written in the dense form


With the usual definitions of the operators ∇
 
E 1 
∆ ~u + ~ ∇
∇ ~ · ~u = −f~ . (5.29)
2 (1 + ν) 1 − 2ν
This author is convinced however that the above formulation as a system of PDE’s is considerably less
efficient than the formulation as a minimization problem of the energy given by equation (5.28).

5.8.3 Boundary Conditions


There are different types of useful boundary conditions. Only the most important situations are presented in
these notes.

Prescribed displacement
If on a section Γ1 of the boundary ∂Ω the displacement vector ~u is known, use this as a boundary condition
on the section Γ1 . Thus find Dirichlet conditions on this section of the boundary.

Given boundary forces, no constraints


If on a section Γ2 of the boundary ∂Ω the displacement ~u is free, then use calculus of variations again and
examine all contributions of the integral over the boundary section Γ2 in the total energy Uelast + UV ol +
USurf .
 
" # εxx
1−ν ν 0
Z Z
 
. . . ds = u1 h~n , k · ε yy
i ds
Γ2 Γ2 0 0 1 − 2ν  
εxy
 
" # εxx
0 0 1 − 2ν
Z Z
 
+ u2 h~n , k ·  εyy i ds −
  ~g · ~u ds
Γ2 ν 1−ν 0 Γ2
εxy
!
(1 − ν) εxx + ν εyy
Z Z
= u1 h~n , k · i ds − g1 u1 ds
Γ2 (1 − 2 ν) εxy Γ2
!
(1 − 2 ν) εxy
Z Z
+ u2 h~n , k · i ds − g2 u2 ds
Γ2 ν εxx + (1 − ν) εyy Γ2

~ this leads to two boundary conditions.


Using the perturbations ~u → ~u + φ
!
E (1 − ν) εxx + ν εyy
h~n , i = g1
(1 + ν) (1 − 2 ν) (1 − 2 ν) εxy
!
E (1 − 2 ν) εxy
h~n , i = g2
(1 + ν) (1 − 2 ν) ν εxx + (1 − ν) εyy
Using Hooke’s law (equation (5.14), page 349) these conditions are expressed in terms of stresses. This
leads to
!
σx
h~n , i = nx σx + ny τxy = g1
τxy
!
τxy
h~n , i = ny σy + nx τxy = g2 .
σy

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 393

This allows a verification of the equations by comparing with (5.9) (page 338), i.e. find
" # ! " # ! !
σx τxy cos φ σx τxy nx g1
= = .
τxy σy sin φ τxy σy ny g2
At the surface the stresses have to coincide with the externally applied stresses. The above boundary condi-
tions can be written in terms of the unknown displacement vector ~u, leading to
    
E ∂ u1 ∂ u2 1 − 2 ν ∂ u1 ∂ u2
nx (1 − ν) +ν + ny + = g1
(1 + ν) (1 − 2 ν) ∂x ∂y 2 ∂y ∂x
    
E ∂ u2 ∂ u1 1 − 2 ν ∂ u1 ∂ u2
ny (1 − ν) +ν + nx + = g2 .
(1 + ν) (1 − 2 ν) ∂y ∂x 2 ∂y ∂x

5.9 Plane Stress


Consider the situation of a thin (thickness h) plate in the plane Ω ⊂ R2 . There are no external stresses on the
top and bottom surface and no vertical forces within the plate. Thus assume that σz = 0 within the plate and
τxz = τyz = 0, i.e all stress components in z direction vanish. Thus this is called a plane stress situation.
σz = τxz = τyz = 0
Hooke’s law in the form (5.12) (page 349) implies
     
εxx 1 −ν −ν σx
     
 εyy 
 
 −ν 1 −ν
 0  
  σy 
     
 εzz  1  −ν −ν 1   0 
 =  · 
 E 
 εxy  1+ν 0 0 τxy 
   
  
     
 εxz 
 

 0 0 1+ν 0  
  0 
εyz 0 0 1+ν 0
or by eliminating vanishing terms
     
−ν
εxx 1 −ν 0 σx εzz = E (σx + σy )
 εyy  = 1  −ν 1
     
  E  0   σy 
   and εxz = 0 .
εxy 0 0 1+ν τxy εyz = 0
This matrix can be inverted, leading to
     
−ν
σx 1 ν 0 εxx εzz = 1−ν (εxx + εyy )

 σy  =
 E    
  1 − ν2  ν 1
 0   εyy 
   and εxz = 0 . (5.30)
τxy 0 0 1−ν εxy εyz = 0
The energy density is given by equation (5.15) or (5.16).
           
σx εxx τxy εxy σx εxx
1   ,  εyy i + h 0  ,  εxz i = 1
          
W = h σ y h σ y
 ,  εyy i
2         2    
0 εzz 0 εyz 2 τxy εxy
     
1 ν 0 εxx εxx
E      
= h ν 1 0   εyy  ,  εyy i
2 (1 − ν 2 )      
0 0 2 (1 − ν) εxy εxy
E
ε2xx + ε2yy + 2 ν εxx εyy + 2 (1 − ν) ε2xy

= 2
2 (1 − ν )

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 394

Now express the total energy of a plane stress problem as function of the strains.

U (~u) = Uelast + UV ol + USurf (5.31)


     
1 ν 0 εxx εxx
1 E
ZZ
     
= h ν 1
 0  ·  εyy  ,  εyy i dx dy −
2 (1 − ν 2 )      
Ω 0 0 2 (1 − ν) εxy εxy
ZZ I
− ~
f · ~u dx dy − ~g · ~u ds
∂Ω

As in many other situations use again the Bernoulli principle to find the solution of the plane stress elasticity
problem.

5.9.1 From the Plane Stress Matrix to the Full Stress Matrix
For a plane stress problem the reduced stress matrix is a 2 × 2 matrix, while the full stress matrix has to be
3×3.  
" # σx τxy 0
σx τxy  
plane stress −→ , 3D −→  τxy σy 0 
τxy σy  
0 0 0
To compute the principal stresses σi determine the eigenvalues is this matrix, i.e. solve
 
σx − λ τxy 0 " #
  σ x − λ τ xy
0 = det   τxy σy − λ 0   = −λ det τxy σy − λ
0 0 −λ
2
= −λ λ2 − λ (σx + σy ) + σx σy − τxy 2
 
= −λ (σx − λ) (σy − λ) − τxy .

This leads to
s 2
σx + σy 1q σ + σy
2 )= x
σx − σy
σ1,2 = ± (σx + σy )2 − 4 (σx σy − τxy ± 2
+ τxy
2 2 2 2
σ3 = 0.

The above principal stresses may be used to determine the von Mises and the Tresca stress. The solution of
the quadratic equation can be visualized using Mohr’s circle, see Result 5–22 on page 341.

5–51 Example : Consider a horizontal, rectangular plate, compressed in x direction by a force F applied
to its right edge. Assume that the normal strain εxx < 0 in x-direction is known and then determine the
other expressions. Use a plane stress model, i.e. σz = τxz = τyz = 0. The similar plane strain problem was
examined in Example 5–50. Assume that all strains are constant. Now εyy and εxy can be determined by
minimizing the energy density. For the energy density obtain
E
ε2xx + ε2yy + 2 ν εxx εyy + 2 (1 − ν) ε2xy .

W =
2 (1 − ν 2 )

As a necessary condition for a minimum the partial derivatives with respect to εyy and εxy have to vanish.
This leads to
+2 εyy + 2 ν εxx = 0 and εxy = 0 .
This leads to the standard Poisson’s ratio ν for the plane strain situation, i.e. εyy = −ν εxx .

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 395

The energy density is the given by


1 E
ε2xx + ε2yy + 2 ν εxx εyy + 2 (1 − ν) ε2xy

W =
2 1 − ν2
1 E
ε2xx + ν 2 ε2xx − 2 ν 2 ε2xx + 0

=
2 1 − ν2
1 E 1 E
1 + ν 2 − 2 ν 2 ε2xx = 1 − ν 2 ε2xx
 
=
2 1 − ν2 2 1 − ν2
1
= E ε2xx .
2
By comparing this situation with the situation of a simple stretched shaft (Example 5–30, page 356) find the
standard modulus of elasicity E . ♦

5.9.2 From the Minimization Formulation to a System of PDE’s


The energy density can be found by equation (5.15) as
     
1 ν 0 εxx εxx
1 E      
W = h ν 1 0  ·  εyy  ,  εyy i . (5.32)
2 1 − ν2      
0 0 2 (1 − ν) εxy εxy

This equation is very similar to the expression for a plane strain situation in equation (5.28) (page 388).
The only difference is in the coefficients. As a starting point for a finite element solution of a plane stress
problem use the Bernoulli principle and minimize the energy functional

U (~u) = Uelast + UV ol + USurf (5.33)


     
1 ν 0 εxx εxx
1 E
ZZ
     
= h ν 1
 0  ·  εyy  ,  εyy i dx dy −
2 1 − ν2      
Ω 0 0 2 (1 − ν) εxy εxy
ZZ I
− ~
f · ~u dx dy − ~g · ~u ds .
∂Ω

Using the divergnce theorem rewrite the elastiv energy as


     
∂ u1
∂x 1 ν 0 εxx
1 E
ZZ
     
Uelast = h ∂ u2 , ν 1  ·  εyy i dx dy
2

∂y 0
2 1−ν      
1 ∂ u ∂ u
2 ( ∂y + ∂x ) 0 0 2 (1 − ν) εxy
Ω 1 2

 
∂ u1
! " # εxx
E 1 ν 0
ZZ
∂x
 
= h , ·  εyy i dx dy
2 (1 − ν 2 ) ∂ u1
0 0 1−ν  
∂y
Ω εxy
 
∂ u2
! " # ε xx
E 0 0 1−ν
ZZ
∂x
 
+ h , ·  εyy i dx dy
2 (1 − ν 2 ) ∂ u2
ν 1 0  
∂y
Ω εxy
  
" # εxx
E  1 ν 0
ZZ
 
= − 2
u1 div  · εyy  dx dy

2 (1 − ν )  0 0 1−ν 
Ω εxy

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 396

 
" # εxx
E 1 ν 0
I
 
+ u1 h~n , · εyy  i ds
2 (1 − ν 2 ) ∂Ω 0 0 1−ν 
εxy
  
" # εxx
E 0 0 1−ν
ZZ
  
− u2 div  · ε yy
 dx dy
2 (1 − ν 2 )  ν 1 0  
Ω εxy
 
" # εxx
E 0 0 1−ν
I
 
+ u2 h~n , · ε yy
i ds .
2 (1 − ν 2 ) ∂Ω ν 1 0  
εxy

Reconsidering the calculations for the plane strain situation and make a few minor changes to adapt the
results to the above plane stress situation to arrive at the system of partial differential equations.
  
∂ u1 ∂ u2
E ∂x + ν ∂y
div  2

1−ν

∂ u1 ∂ u2
  = −f1 (5.34)
1−ν 2 ∂y + ∂x
    
1−ν ∂ u1 ∂ u2
E  2 ∂y + ∂x
div  2
 = −f2 (5.35)
1−ν ∂ u
ν 1 + ∂ u2∂x ∂y

If E and ν are constant, i.e. a homogeneous material, we can use elementary, tedious operations to find
 2
∂ u1 ∂ 2 u1 1 + ν ∂
 
E ∂ u1 ∂ u 2
+ + + = −f1
2 (1 + ν) ∂x2 ∂y 2 1 − ν ∂x ∂x ∂y
 2
∂ u2 ∂ 2 u2 1 + ν ∂
 
E ∂ u1 ∂ u 2
2
+ 2
+ + = −f2
2 (1 + ν) ∂x ∂y 1 − ν ∂y ∂x ∂y

or with a shorter notation  


E 1 + ν ~ ~ 
∆~u + ∇ ∇~u = −f~ . (5.36)
2 (1 + ν) 1−ν
This has a structure similar to the equations (5.29) for the plane strain situation. If we set
ν
ν? =
1−ν
then we find ν
1 + ν? 1+ 1−ν 1
?
= ν = .
1−ν 1− 1−ν 1 − 2ν
Thus the plane strain equations (5.29) take the form

1 + ν? ~  ~ 
 
E
∆~u + ∇ ∇~u = −f~
2 (1 + ν) 1 − ν?

and are very similar to the plane stress equations (5.36).

5.9.3 Boundary Conditions


Again only two types of boundary conditions are presented:

• On a section Γ1 of the boundary we assume that the displacement vector ~u is known and thus we find
Dirichlet boundary conditions.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 397

• On the section Γ2 the displacement ~u is not submitted to constraints, but we apply an external force
~g . Again we use a calculus of variations argument to find the resulting boundary conditions.
The contributions of the integral over the boundary section Γ2 in the total energy Uelast + UV ol + USurf are
given by
 
" # ε xx
E 1 ν 0
Z Z
 
. . . ds = u1 h~n , ·  ε yy
i ds
Γ2 2 (1 − ν 2 ) Γ2 0 0 1−ν  
εxy
 
" # εxx
E 0 0 1−ν
Z Z
 
+ u2 h~n , ·  εyy i ds −
  ~g · ~u ds
2 (1 − ν 2 ) Γ2 ν 1 0 Γ2
εxy
!
E εxx + ν εyy
Z Z
= u1 h~n , i ds − g1 u1 ds
2 (1 − ν 2 ) Γ2 (1 − ν) εxy Γ2
!
E (1 − ν) εxy
Z Z
+ u2 h~n , i ds − g2 u2 ds .
2 (1 − ν 2 ) Γ2 ν εxx + εyy Γ2

This leads to two boundary conditions.


!
E εxx + ν εyy
h~n , i = g1
1 − ν2 (1 − ν) εxy
!
E (1 − ν) εxy
h~n , i = g2 .
1 − ν2 ν εxx + εyy
Using Hooke’s law this can be expressed in terms of stresses
! !
σx τxy
h~n , i = g1 and h~n , i = g2
τxy σy
or with a matrix notation " # ! !
σx τxy nx g1
= .
τxy σy ny g2
The above boundary conditions can be written in terms of the unknown displacement vector ~u, leading to
    
E ∂ u1 ∂ u2 1 − ν ∂ u1 ∂ u 2
nx +ν + ny + = g1
1 − ν2 ∂x ∂y 2 ∂y ∂x
    
E ∂ u2 ∂ u1 1 − ν ∂ u1 ∂ u 2
2
ny +ν + nx + = g2 .
1−ν ∂y ∂x 2 ∂y ∂x
These equations are a mathematical model for the correct situation, but certainly not in a very readable form.

5.9.4 Deriving the Differential Equations using the Euler–Lagrange Equation


In this section the above partial differential equations are regenerated, using the Euler–Lagrange equations
from Result 5–11 on page 320. For this use the energy density W for a plane stress problem
     
1 ν 0 εxx εxx
1 E      
W = 2
h ν 1 0  ·  εyy  ,  εyy i
2 1−ν      
0 0 2 (1 − ν) εxy εxy

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 398

1 E 2 2 2

= ε xx + ε yy + 2 ν ε xx ε yy + 2 (1 − ν) ε ex
2 1 − ν2  
1 E ∂ u1 2 ∂ u2 2 ∂ u1 ∂ u 2 1 − ν ∂ u1 ∂ u2 2
= ( ) +( ) + 2ν ( )( )+ ( + ) .
2 1 − ν2 ∂x ∂y ∂x ∂y 2 ∂y ∂x

Thus the total energy in the system is given by

U = U (u1 , u2 ) = Uelast + UV ol + USurf


 
1 E ∂ u1 2 ∂ u2 2 ∂ u 1 ∂ u2 1 − ν ∂ u 1 ∂ u 2 2
ZZ
= ( ) +( ) + 2ν + ( + ) −
2 1 − ν2 ∂x ∂y ∂x ∂y 2 ∂y ∂x

ZZ Z
− f1 · u1 + f2 · u2 dx dy − g1 · u1 + g2 · u2 ds .
Γ2

This leads to a functional of the form (5.6), see page 319. It is a quadratic functional with the expression to
be integrated given by

F = F (u1 , u2 , ∇u1 , ∇u2 )


 
1 E ∂ u1 2 ∂ u2 2 ∂ u1 ∂ u2 1 − ν ∂ u1 ∂ u2 2
= ( ) +( ) + 2ν + ( + ) −
2 1 − ν2 ∂x ∂y ∂x ∂y 2 ∂y ∂x
−f1 · u1 − f2 · u2 .

To use the Euler–Lagrange equations (5.7) and (5.8)

div (F∇u1 (u1 , u2 , ∇u1 , ∇u2 )) = Fu1 (u1 , u2 , ∇u1 , ∇u2 ) ,


div (F∇u2 (u1 , u2 , ∇u1 , ∇u2 )) = Fu2 (u1 , u2 , ∇u1 , ∇u2 ) ,

the partial derivatives of the expression F (u1 , u2 , ∇u1 , ∇u2 ) are required.28

Fu1 = −f1 and Fu2 = −f2


! !
1 E 2 ∂∂x
u1
+ 2ν ∂ u2
∂y E ∂ u1 ∂ u2
∂x + ν ∂y
F∇u1 = =
2 1 − ν2 (1 − ν) ( ∂y + ∂∂x
∂ u1 u2
) 1 − ν2 1−ν ∂ u1 ∂ u2
2 ( ∂y + ∂x )
! !
1 E (1 − ν) ( ∂∂y
u1
+ ∂ u2
∂x ) E 1−ν ∂ u1 ∂ u2
2 ( ∂y + ∂x )
F∇u2 = =
2 1 − ν2 2 ∂∂y
u2
+ 2ν ∂ u1 1 − ν2 ∂ u2 ∂ u1
∂x ∂y + ν ∂x

Now the Euler–Lagrange equations lead to


!!
∂ u1 ∂ u2
E ∂x + ν ∂y
div = −f1
1 − ν2 1−ν ∂ u1 ∂ u2
2 ( ∂y + ∂x )
!!
1−ν ∂ u1 ∂ u2
E 2 ( ∂y + ∂x )
div = −f2 .
1 − ν2 ∂ u2 ∂ u1
∂y + ν ∂x

This is identical to the PDEs (5.34) and (5.35).


On the boundary Γ2 the natural boundary conditions

hF∇u1 (u1 , u2 , ∇u1 , ∇u2 ), ~ni = g1 and hF∇u2 (u1 , u2 , ∇u1 , ∇u2 ), ~ni = g2
28
Use the notation F∇u = ( ∂u∂ x F , ∂
∂uy
F )T .

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 399

lead to
!
∂ u1 ∂ u2
E ∂x + ν ∂y
h~n , i = g1
1 − ν2 1−ν ∂ u1 ∂ u2
2 ( ∂y + ∂x )
!
1−ν ∂ u1 ∂ u2
E 2 ( ∂y + ∂x )
h~n , i = g2 .
1 − ν2 ∂ u2 ∂ u1
∂y + ν ∂x

Using Hooke’s law for the plane stress situation (5.30) it can be simplified to
! !
σx τxy
h~n , i = g1 and h~n , i = g2
τxy σy

and this is consistent with with (5.9) (page 338), i.e.


" # !
σx τxy g1
~n = .
τxy σy g2

Bibliography
[Aris62] R. Aris. Vectors, Tensors and the Basic Equations of Fluid Mechanics. Prentice Hall, 1962.
Republished by Dover.

[BoneWood08] J. Bonet and R. Wood. Nonlinear Continuum Mechanics for Finite Element Analysis. Cam-
bridge University Press, 2008.

[BoriTara79] A. I. Borisenko and I. E. Tarapov. Vector and Tensor Analysis with Applications. Dover, 1979.
first published in 1966 by Prentice–Hall.

[Bowe10] A. F. Bower. Applied Mechanics of Solids. CRC Press, 2010. web site at solidmechanics.org.

[ChouPaga67] P. C. Chou and N. J. Pagano. Elasticity, Tensors, dyadic and engineering Approaches. D
Van Nostrand Company, 1967. Republished by Dover 1992.

[GhabPeckWu17] J. Ghaboussi, D. Pecknold, and X. Wu. Nonlinear Computational Solid Mechanics. CRC
Press, 2017.

[Gree77] D. T. Greenwood. Classical Dynamics. Prentice Hall, 1977. Republished by Dover 1997.

[Hack15] R. Hackett. Hyperelasticity Primer. Springer International Publishing, 2015.

[Hear97] E. J. Hearns. Mechanics of Materials 1. Butterworth–Heinemann, third edition, 1997.

[HenrWann17] P. Henry and G. Wanner. Johann Bernoulli and the cycliod: A theorem for posteriority.
Elemente der Mathematik, 72(4):137–163, 2017.

[Holz00] G. A. Holzapfel. Nonlinear Solid Mechanics, a Continuum Approch for Engineering. John Wi-
ley& Sons, 2000.

[Oden71] J. Oden. Finite Elements of Nonlinear Continua. Advanced engineering series. McGraw-Hill,
1971. Republished by Dover, 2006.

[Ogde13] R. Ogden. Non-Linear Elastic Deformations. Dover Civil and Mechanical Engineering. Dover
Publications, 2013.

SHA 21-5-21
CHAPTER 5. CALCULUS OF VARIATIONS, ELASTICITY AND TENSORS 400

[OttoPete92] N. S. Ottosen and H. Petersson. Introduction to the Finite Element Method. Prentice Hall,
1992.

[Prze68] J. Przemieniecki. Theory of Matrix Structural Analysis. McGraw–Hill, 1968. Republished by


Dover in 1985.

[Redd84] J. N. Reddy. An Introduction to the Finite Element Analysis. McGraw–Hill, 1984.

[Redd13] J. N. Reddy. An Introduction to Continuum Mechanics. Cambridge University Press, 2nd edition,
2013.

[Redd15] J. N. Reddy. An Introduction to Nonlinear Finite Element Analysis. Oxford University Press, 2nd
edition, 2015.

[Sege77] L. A. Segel. Mathematics Applied to Continuum Mechanics. MacMillan Publishing Company,


New York, 1977. republished by Dover 1987.

[Shab08] A. A. Shabana. Computational Continuum Mechanics. Cambridge University Press, 2008.

[Sout73] R. W. Soutas-Little. Elasticity. Prentice–Hall, 1973.

[VarFEM] A. Stahel. Calculus of Variations and Finite Elements. Lecture Notes used at HTA Biel, 2000.

[Wein74] R. Weinstock. Calculus of Variations. Dover, New York, 1974.

SHA 21-5-21
Chapter 6

Finite Element Methods

6.1 From Minimization to the Finite Element Method


Let Ω ⊂ R2 be a bounded domain with a smooth boundary Γ, consisting of two disjoint parts Γ1 and Γ2 . In
Section 5.2.4 the minimizer of a functional F (u)

1 1 2
ZZ Z
2
F (u) = a (∇u) + b u + f · u dA − g2 u ds
2 2 Γ2

was examined, with the boundary condition u(x, y) = g1 (x, y) for (x, y) ∈ Γ1 . At the minimal function u
the derivative in the direction of the function φ has to vanish. Thus the minimal solution has to satisfy
ZZ Z
0= (−∇( a ∇u) + b u + f ) · φ dA + (a ~n · ∇u − g2 ) φ ds (6.1)
Γ2

for all test function φ vanishing on the boundary Γ1 . Using Green’s formula (integration by parts) this leads
to ZZ Z
0= a ∇u · ∇φ + (b u + f ) · φ dA − g2 φ ds . (6.2)
Γ2

A function u satisfying this condition is called a weak solution. Using the fundamental lemma of the
calculus of variations conclude that the function u is a solution of the boundary value problem

−∇ · (a ∇u) + b u = −f for (x, y) ∈ Ω


u = g1 for (x, y) ∈ Γ1
a ~n · ∇u = g2 for (x, y) ∈ Γ2 .

Thus we have the following chain of results.

u minimizer of F (u) −→ u weak solution −→ u classical solution

For weak solutions or the minimization formulation use a numerical integration to discretize the for-
mulation. This will lead directly to the Finite Element Method (FEM). The path to follow is illustrated
in Figure 6.1. The branch on the left can only be used for self-adjoint problems and the resulting matrix
A will be symmetric. The branch on the right is applicable to more general problems, leading to possibly
non-symmetric matrices.

In this chapter the order of presentation is as follows:

• Develop the algorithm for piecewise linear FEM.

401
CHAPTER 6. FINITE ELEMENT METHODS 402

function u is a classical solution


∇ · (a ∇u) = f in Ω
u = 0 on ∂Ω
H
*
 HH
 HH
Calculus of Variations  multiply by φ and integrate
HH
  H
 HH
 Hj
H
function u is minimizer of ∂F
function u is a weak solution
RR 1 2 ∂u =0 RR
F (u) = 2 a (∇u) + f · u dA
- a ∇u · ∇φ + f φ dA = 0
Ω Ω
u = 0 on ∂Ω for all φ vanishing on ∂Ω

discretize discretize

? ?
vector ~u ∈ RN satisfies
vector ~u ∈ RN is minimizer of
~ + hW f~ , φi
hA ~u , φi ~ =0
F (~u) = 1 hA ~u , ~ui + hW f~ , ~ui
2 ~ ∈ RN
for all vectors φ

FEM
? ?
~u ∈ RN satisfies A ~u + W f~ = ~0 ~u ∈ RN satisfies A ~u + W f~ = ~0

Figure 6.1: Classical and weak solutions, minimizers and FEM

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 403

• Examine weak solutions of second order boundary value problems.

• Present the theoretical results for self-adjoint BVP as minimization problems. State the required
approximation results.

• Examine the Galerkin formulation for second order BVP.

• Develop algorithms for triangular second order elements.

• Comapre first and second ordere triangular elements.

• Give a very brief introduction to quadrilateral elements.

• To finish this chapter apply the FEM to a tumor growth problem.

6.2 Piecewise Linear Finite Elements


Start by developing an algorithm for a FEM solution of the model boundary value problem

∇ · (a ∇u) − b u = f for (x, y) ∈ Ω


.
u = 0 for (x, y) ∈ Γ = ∂Ω

Thus minimize the functional


1 1
ZZ
F (u) = a (∇u)2 + b · u2 + f · u dA (6.3)
2 2

amongst all functions u vanishing on the boundary Γ = ∂Ω.

This simple problem is used to explain the basic algorithms and can be implemented with MATLAB or
Octave 1 . The purpose of the code is to illustrate the necessary degree of complexity.

6.2.1 Discretization, Approximation and Assembly of the Global Stiffness Matrix


In the above functional F (u) integrals over the domain Ω ⊂ R2 have to be computed. To discretize this
integration use a triangulation of the domain, using grid points (xi , yi ) ∈ Ω, 1 ≤ i ≤ n. The grid points
are called nodes of the FEM approach and each node leads to one degree of freedom, i.e. one unknown
function value and one equation to be solved. On each triangle Tk replace the function u by a polynomial
of degree 1, i.e. a linear function. These polynomials are completely determined by their values at the three
corners of the triangle. Integrals over the complete domain Ω are split up into integrals over each triangle
and then a summation, i.e.
1
The codes presented are simplifications taken from the package FEMoctave by this author and are intended for instructional
purposes only.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 404

ZZ XZZ
. . . dA ≈ . . . dA .
Ω k Tk

The gradient of u is replaced by the gradient of the piecewise polyno-


mials. Since the partial derivatives of a linear function are constants the
gradient is constant on each triangle. Each contribution is written in the
form
1
ZZ
. . . dA ≈ hAk ~uk , ~uk i + hWk f~k , ~uk i
2
Tk

where the 3 × 3 matrix Ak is the element stiffness matrix and Wk is


a weight matrix, most often Wk is a diagonal matrix.

If the above process is carried out correctly the functional F is replaced by a discrete approximation
1
F (u) ≈ h~u , A ~ui + hW f~ , ~ui
2
with a symmetric, positive definite matrix A. This expression is minimized by the solution of the linear
system of equations
A ~u = −W f~ .
It is one of the most important advantages of the Finite Element Method that it can be applied on irregularly
shaped domains. For rectangular domains Ω the finite difference methods could be used to solve the BVP
in his section. Applying the finite differences to non-rectangular domains can be very challenging.

For one possible construction of finite elements the value of the unknown function at each of the nodes
is one degree of freedom. Thus for each triangle we have exactly 3 degrees of freedom and the total number
N of (interior) nodes corresponds to the number of unknowns. The element stiffness matrices AT will be
of size 3 × 3 and the global stiffness matrix A is a N × N matrix.

Thus a rather general FEM algorithm is described by

• Decompose the domain Ω ⊂ R2 in triangles and determine the degrees of freedom.

• Create the N × N matrix A, originally filled with zeros, and the vector f~ ∈ RN .

• For each triangle T :

– Compute the element stiffness matrix AT and the vector WT f~T . Use equation (6.3) and a
numerical integration scheme.
– Add matrix and vector to the global structure.

• Solve the global system A ~u + Wf~ = ~0 for the vector of unknown values in ~u.

• Visualize the solution and make the correct conclusion for your application.

The actual computation of an element stiffness matrix will be examined carefully in the subsequent sections.
It is the most important building block of any FEM approach.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 405

6.2.2 Integration over one Triangle


If a triangle T is given by its three corners at ~x1 , ~x2 and ~x3 ∈ R2 , then its area A can be computed by a
cross product2
1 1
A= k(~x2 − ~x1 ) × (~x3 − ~x1 )k = |(x2 − x1 ) · (y3 − y1 ) − (y2 − y1 ) · (x3 − x1 )| .
2 2
If the values of a general function f are given at the three corners of the triangle by f1 , f2 and f3 replace the
exact function by a linearly interpolated function and find an approximate integral by
f1 + f2 + f3
ZZ
f dA ≈ A · .
3
T

Observe that there is a systematic integration error due to replacing the true function by an approximate,
linear function.
This leads to the integrals
A
ZZ
f · u dA ≈ (f1 u1 + f2 u2 + f3 u3 )
3
ZZ T
1 1A
b · u2 dA ≈ (b1 u21 + b2 u22 + b3 u23 ) .
2 2 3
T

Using a vector and matrix notation find


        
u1 b1 0 0 u1 f1 u1
1 2 1A   u2 i + A h f2  ,  u2 i .
ZZ
       
b u + f · u dA ≈ h u , 0 b2 0 
2 2 3  2     3    
T u3 0 0 b3 u3 f3 u3

This is one of the contributions in equation (6.3).

6.2.3 Integration of ∇u · ∇u over one Triangle


To examine the other contribution we first need to compute the gradient of the function u. If the true function
is replaced by a linear interpolation on the triangle, then the gradient is constant on this triangle and can be
determined with the help of a normal vector of the plane passing through the three points
     
x1 x2 x3
     
 y1  ,  y2  and  y3  .
     
u1 u2 u3

The situation of one triangle in the xy plane and the corresponding triangle in the (xyu)–space is shown in
Figure 6.2. A normal vector ~n is given by the vector product ~n = ~a × ~b.
     
x2 − x1 x3 − x1 +(y2 − y1 ) · (u3 − u1 ) − (u2 − u1 ) · (y3 − y1 )
     
~n =   y2 − y1  ×  y3 − y1  =  −(x2 − x1 ) · (u3 − u1 ) + (u2 − u1 ) · (x3 − x1 ) 
    
u2 − u1 u3 − u1 +(x2 − x1 ) · (y3 − y1 ) − (y2 − y1 ) · (x3 − x1 )
 
∂u
 ∂x 
= λ ∂u  with λ = −2 A .
 ∂y 
−1
2
x = (x, y) ∈ R2 to a vector ~
Quietly extend the vector ~ x = (x, y, 0) ∈ R3 .

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 406

b
n
a
u

Figure 6.2: One triangle in space (green) and projected to plane (blue)

The third component of this vector equals twice the oriented3 area A of the triangle. To obtain the gradient
in the first two components, the vector has to be normalized, such that the third component is equals −1.
! !
∂u
∂x −1 +(y 2 − y 1 ) · (u3 − u1 ) − (u2 − u1 ) · (y3 − y 1 )
∇u = =
∂u 2A −(x2 − x1 ) · (u3 − u1 ) + (u2 − u1 ) · (x3 − x1 )
∂y

This formula can be written as


   
" # u1 u1
−1 (y3 − y2 ) (y1 − y3 ) (y2 − y1 ) 
 = −1 M ·  u2 
  
∇u = · u2 (6.4)
2A (x2 − x3 ) (x3 − x1 ) (x1 − x2 )   2A  
u3 u3
and leads to
       
u1 u1 u1 u1
1  u2  , M  u2 i = 1
    
 , MT M  u2 i .
  
h∇u , ∇ui = hM h u2
4 A2     4 A2    
u3 u3 u3 u3
Thus conclude
   
u1 u1
1 a1 + a2 + a3 a1 + a2 + a3
ZZ
a (∇u)2 dA ≈ A  , MT M  u2 i .
   
h∇u , ∇ui = h u2
2 3·2 2 · 3 · 4A    
T u3 u3

It is an exercise to verify that the matrix MT M is symmetric and positive semidefinite. The expression
vanishes if and only if u1 = u2 = u3 . This corresponds to a horizontal plane in Figure 6.2.

6.2.4 The Element Stiffness Matrix


Collecting the above results find
     
u1 u1 u1
1 1 1 
ZZ
a (∇u)2 + b u2 + f · u dA ≈ h  , AT  u2 i + h u2  , WT f~T i ,
    
u2
2 2 2      
T u3 u3 u3
3
We quietly assumed that the third component of ~n is positive. As we use only the square of the gradient the influence of this
ignorance will disappear.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 407

where
 
b1 0 0
a1 + a2 + a3 A 
MT M +

AT =  0 b2 0  (6.5)
12 A 3  
0 0 b3
 
f1
A 
WT f~T

=  f2  . (6.6)
3  
f3

If bi ≥ 0 then the element stiffness matrix AT is positive semidefinite.


The above can readily be implemented in Octave.
ElementContribution.m
function [elMat,elVec] = ElementContribution(corners,aFunc,bFunc,fFunc)

elMat = zeros(3); elVec = zeros(3,1);


if is_scalar(aFunc) aV = aFunc*ones(3,1);else aV=feval(aFunc,corners);endif
if is_scalar(bFunc) bV = bFunc*ones(3,1);else bV=feval(bFunc,corners);endif
if is_scalar(fFunc) fV = fFunc*ones(3,1);else fV=feval(fFunc,corners);endif
area = ((corners(2,1)-corners(1,1))*(corners(3,2)-corners(1,2))-...
(corners(2,2)-corners(1,2))*(corners(3,1)-corners(1,1)) )/2;
M = [corners(3,2)-corners(2,2),corners(1,2)-corners(3,2),corners(2,2)-corners(1,2);
corners(3,1)-corners(2,1),corners(1,1)-corners(3,1),corners(2,1)-corners(1,1)];

elMat = sum(aV)/(12*area)*M’*M + area/3*diag(bV);


elVec = area/3*fV;
end%function

6.2.5 Triangularization of the Domain Ω ⊂ R2


One of the first, and important, task for FEM is the generation of a mesh. For the current setup decompose
a domain Ω ⊂ R2 into triangles. Any domain limited by straight edges can be decomposed into triangles.
This is (almost) always performed by suitable software.
As example consider the code triangle from [www:triangle], which can be called by the FEMoctave
package within Octave. Apply it to a domain with corners at (1, 0), (2, 0), (2, 2) and (0, 1) and generate a
mesh visible in Figure 6.3. The typical value of the area of one triangle is 0.1 .
Octave
ProbName = ’TestLinear’;
xy = [0,0,-1;2,0,-1;2,2,-1;0,1,-1];
Mesh = CreateMeshTriangle(ProbName,xy ,0.1)
figure(1)
FEMtrimesh(Mesh.elem,Mesh.nodes(:,1),Mesh.nodes(:,2))
xlabel(’x’); ylabel(’y’);

The generated mesh consists of 33 nodes, forming 48 triangles. On the boundary the values of the
function u are given by the known function g(x, y) and thus not all nodes are degrees of freedom for the
FEM problem. As a result find 17 interior points in this mesh and thus the resulting system of equations will
have 17 equations and unknowns.

6.2.6 Assembly of the System of Linear Equations


With the above results one can now assemble the global stiffness matrix, i.e. generate the system of linear
equations to be solved.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 408

The structure Mesh contains multiple fields, amongst them:


2
• The variable nodes contains the coordinates of the
1.5
numbered nodes and the information whether the node
is on the boundary or in the interior of the domain.
1

y
• The variable elem contains a list of all the triangles
0.5
and the node numbers of the corners.

• The variable edges contains a list of all the boundary 0


0 0.5 1 1.5 2
edges of the discretized domain. x

Figure 6.3: A small mesh of a simpe domain in R2

For each triangle we find the element stiffnes matrix AT , which will contribute to the global stiffness
matrix A. As an example consider Figure 6.4 with 3 nodes for each triangle. The entries of AT have to be
added to the previous entries in the global matrix A and accordingly the entries of ~bk have to be added to
the global vector f~.

local ←→ global
j
i 3 triangle ←→ mesh
1
1 ←→ i
2 k
2 ←→ k
3 ←→ j

Figure 6.4: Local and global numbering of nodes

col i col j col k



.. .. .. .. 
. . . .
 
 ···
row i  a11 ··· a13 ······ a12 ··· 
   .. .. .. .. 
a11 a12 a13

 . . . .


   
AT = 
 a21 a22 a23 
 −→ A = A + row j 
 ··· a31 ··· a33 ······ a32 ··· 
a31 a32 a33
 .. .. .. .. 

 . . . . 

 
 ···
row k  a21 ··· a23 ······ a22 ··· 
.. .. .. ..
. . . .

The above construction allows to verify that symmetry of the element stiffness matrices AT carries over to
the global matrix A. If all element stiffness matrices are positive definite the global stiffness matrix will be
positive definite. Once the matrix and the right hand side vector are generated, a linear system of equations
has to be solved. Thus results from chapter 2 have to be used. This assembling of the global stiffness matrix
can be implemented in (almost) any programming language. As an example you might have a look at the
Octave package FEMoctave on GitHub4 .
4
Use github.com/AndreasStahel/FEMoctave or the documentation web.sha1.bfh.science/FEMoctave/FEMdoc.pdf.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 409

6.2.7 The Algorithm of Cuthill and McKee to Reduce Bandwidth


The numbering of the nodes of a mesh created on a given domain will determine the bandwidth of the
resulting stiffness matrix A for the given differential equation to be solved by the FEM5 . For linear elements
on triangles each (interior) node leads to one degree of freedom, the value of the function at this node. Find
ai,j 6= 0 if the nodes with number i and j share a common triangle. In view of the result in Section 2.6.4
one should aim for a numbering leading to a small bandwidth. One possible (and rather efficient) algorithm
is known as the Cuthill–McKee6 algorithm.
choose a starting node and give it the number 1
while there are unnumbered nodes
pick the next node
find all its neighbors not yet numbered
sort them, using the the nodes with fewer of connections
to unnumbered nodes first
give them the next free numbers
endwhile

Table 6.1: Algorithm of Cuthill–McKee

There are different criteria on how to choose an optimal first node. Tests show that nodes with few
neighbors are often good stating nodes. Thus one may choose nodes with the minimal number of neighbors.
Also good candidates are nodes at extreme points of the discretized domain. A more detailed description of
the Cuthill-McKee algorithm and how to choose starting points is given in [LascTheo87].

1 2 3 4 5 6 7 8 9 10 11
6 9 11 1 ∗ ∗ ∗
2 ∗ ∗ ∗ ∗
3 ∗ ∗ ∗ ∗ ∗
4 ∗ ∗ ∗ ∗ ∗ ∗
2 4
10 5 ∗ ∗ ∗ ∗ ∗
7 6 ∗ ∗ ∗ ∗
7 ∗ ∗ ∗ ∗ ∗ ∗ ∗
8 ∗ ∗ ∗ ∗
9 ∗ ∗ ∗ ∗ ∗
1 3 5 8 10 ∗ ∗ ∗ ∗ ∗
11 ∗ ∗ ∗

Figure 6.5: Numbering of a simple mesh by Cuthill–McKee

The algorithm is illustrated by numbering the simple mesh in Figure 6.5. On the right the structure of
the nonzero elements in the resulting stiffness matrix is shown. The band structure is clearly recognizable.

• The first node is chosen, since it has only two neighbors and is at one end of the domain.

• Node 1 has two neighbors, number 2 is given to the node above, since it has only one free neighbor.
The node on the right (two free neighbors) of 1 will be number 3 .
5
When using iterative solvers with sparse matrices, the reduction of bandwidth is irrelevant. Since many (newer) direct solvers
internally renumber equations and variable the importance of the Cuthill–McKee algorithm has clearly diminished.
6
Named after Elizabeth Cuthill and James McKee.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 410

• Node 2 has only one free node with number 4 .

• Node 3 now has also only one free node left, number 5 .

• Of the two free neighbors of node 4, the one above has fewer free nodes and thus will receive num-
ber 6. The node on the right will be number 7 .

• The only free neighbor of node 5 will now receive number 8 .

• The only free neighbor of node 6 will now receive number 9 .

• The only free neighbor of node 7 will now receive number 10 .

• The last node will be number 11 .

As an example examine a BVP on the domain shown in Figure 6.6, where the mesh was generated by
the program triangle (see [www:triangle]). The mesh has 518 nodes and the original numbering leads
to a semi–bandwidth of 515, i.e. no band structure. Nonetheless we have a sparse matrix, since only 3368
entries are nonzero (i.e. 1.25%). The nonzero elements in the matrix A are shown in Figure 6.7, before and
after applying the Cuthill–McKee algorithm. The new semi-bandwidth is 28. If finer meshes (more nodes)
are used, then the improvements due to a good renumbering of the nodes will be even larger.
Within the band only 21% of the entries are not zero, i.e. we still have a certain sparsity within the band.
The algorithm of Cholesky can not take advantage of this sparsity within the band, but iterative methods
can, as examined in Section 2.7.

Figure 6.6: Mesh generated by triangle

0 0

100 100

200 200

300 300

400 400

500 500
0 100 200 300 400 500 0 100 200 300 400 500

(a) before renumbering (b) after Cuthill–McKee renumbering

Figure 6.7: Structure of the nonzero entries in a stiffness matrix

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 411

6.2.8 A First Solution by the FEM


For a numerical example examine the domain Ω ⊂ R2 given in Figure 6.3 and solve the boundary value
problem
−∇ · (∇u) = 5 for (x, y) ∈ Ω
.
u = 0 for (x, y) ∈ Γ = ∂Ω
Apply the algorithms from the preceding sections to find an approximate solution. The code below generates
a mesh on a non-rectangular domain. The triangles have an area of approximately 0.1. The resulting mesh
consists of 48 triangles based on 33 nodes. Since 16 nodes are on the boundary the resulting system of linear
equations is of size 17 × 17.
TestLinear.m
Mesh = CreateMeshTriangle(’TestLinear’,[0,0,-1;2,0,-1;2,2,-1;0,1,-1],0.1);
figure(1); FEMtrimesh(Mesh.elem,Mesh.nodes(:,1),Mesh.nodes(:,2))
xlabel(’x’); ylabel(’y’);

u = BVP2D(Mesh,1,0,0,0,5,0,0,0);
figure(2); FEMtrimesh(Mesh.elem,Mesh.nodes(:,1),Mesh.nodes(:,2),u)
xlabel(’x’); ylabel(’y’); zlabel(’u’); view([160,35])

figure(3);
tricontour(Mesh.elem,Mesh.nodes(:,1),Mesh.nodes(:,2),u,linspace(0,1,11)+eps)
xlabel(’x’); ylabel(’y’); axis equal

2
1
0.8
0.6 1.5
u

0.4
0.2 1
y

0
0 0.5
0.5
1
y 1.5 0 0
1 0.5
2 2 1.5 0 0.5 1 1.5 2
x x
(a) surface of a solution (b) contour levels of a solution

Figure 6.8: A first FEM solution

The results in Figure 6.8 are obviously far from optimal.


• The level curves are very ragged. This can be improved by generating a finer mesh and consequently
a larger system of equations to be solved. A test run with a typical area of 0.001 (diameter divided by
10) leads to 2470 nodes, forming 4750 triangles. The resulting matrix A has size 2282 × 2282. The
level curves are very smooth.

• There is no control over the approximation error yet. The corresponding results will be examined in
Section 6.4, starting on page 417.

• It it not clear whether the approximation by piecewise linear functions was a good choice. This will be
clarified in Section 6.4 and a more efficient approach, using second order elements, will be examined
in Section 6.5.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 412

• The numerical integration was done with a rather simplistic idea. A better approach will be presented
in Section 6.5.

• It is not obvious how to adapt the above approach to more general problems. This will be examined
in following sections.

6–1 Example : A finite difference stencil as special case of the FEM


Examine the boundary value problem

∂2 ∂2
∂x2
u(x, y) + ∂y 2
u(x, y) = f (x, y) for (x, y) ∈ Ω = (0, a) × (0, b)
.
u(x, y) = 0 for (x, y) on boundary ∂Ω

This corresponds to minimizing the functional


 2  2 !
1 ∂u ∂u
Z
F (u) = + + u · f dA .
Ω 2 ∂x ∂y

Use the domain Ω in Figure 6.9 with a uniform, rectangular mesh, which has nx interior nodes in x direction
and ny interior nodes in y direction. In the shown example nx = 18 and ny = 5. The step sizes are given
a b
by hx = nx+1 and hy = ny+1 .

y
6 (a, b)

x
-

Figure 6.9: A simple rectangular mesh

In the mesh in Figure 6.9 there are two types of triangles, shown in Figure 6.10 and thus one can compute
all element contributions with the help of those two standard triangles.

hx
3 3 2
B
hy hy
A

1 hx 2 1

Figure 6.10: The two types of triangles in a rectangular mesh

If the values ui of the function at the three corners of a type A triangle are known, then compute the
gradient of the linearly interpolated function by
   
u1 " # u1
−1   −1 (y3 − y2 ) (y1 − y3 ) (y2 − y1 )  
∇u = M· u2  = · u2 
2 area   2 area (x2 − x3 ) (x3 − x1 ) (x1 − x2 )  
u3 u3

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 413

 
" # u1
−1 hy −hy 0  
= ·
 u2 

hx · hy 0 hx −hx
u3

and thus
   
hy 0 " # hy 2 −hy 2 0
hy −hy 0
MT · M = 
   
2 2 2 2 
 −hy ·
hx 
0 hx −hx
=
 −hy hy + hx −hx  .
0 −hx 0 −hx2 hx2

Observe that the zeros in the off-diagonal corners are based on the facts y1 = y2 and x2 = x3 . Equation (6.5)
leads to the element stiffness matrix
 
hy 2 −hy 2 0
a1 + a2 + a3 1
MT · M =
 
AA =  −hy 2 hy 2 + hx2 −hx2 
12 area 2 hx hy  
0 −hx2 hx2

and the vector    


f1 f1
~bA = A  f2  = hx hy  f2  .
   
3   6  
f3 f3
Similar calculations lead to the element stiffness matrix for triangles of type B
 
hx2 0 −hx2
1 
2 2

AB =  0 hy −hy 
2 hx hy  
−hx2 −hy 2 hy 2 + hx2

and ~bB = ~bA .

Now construct the system of linear equations for the boundary value problem on the mesh in Figure 6.9.
For this consider the mesh point in column i and row j (starting at the bottom) in the mesh and denote
the value of the solution at this point by ui,j . In Figure 6.9 observe that ui,j is directly connected to 6
neighboring points and 6 triangles are used to built the connections. This is visualized in Figure 6.11, the
left part of that figure represents the stencilRat this point in the mesh.
Start by examining the contributions to Ω u · f dA involving the coefficients ui,j . As the coefficients in
~bA = ~bB are all constant, obtain six contributions of the size hx hy fi,j leading to a total of hx hy fi,j
6
 
hx hy
fi,j → 6 = hx hy .
6

With similar arguments examine the contributions to 12 Ω ∇u · ∇u dA. Use the element matrices AA , AB
R

and Figure 6.11 to verify the contributions below.

1  2 (hx2 + hy 2 )
ui,j → hy 2 + hx2 + (hx2 + hy 2 ) + hy 2 + hx2 + (hx2 + hy 2 ) =
2 hx hy hx hy
1 −hx 2
−hx2 + 0 + 0 + 0 + 0 − hx2 =

ui+1,j →
2 hx hy hx hy
1 −hx2
0 + 0 − hx2 − hx2 + 0 + 0 =

ui−1,j →
2 hx hy hx hy

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 414

# #
ui,j+1 ui+1,j+1
"!"!
2
# # # 3
1
ui−1,j ui,j ui+1,j
4
"!"!"! 6
5
# #

ui−1,j−1 ui,j−1

"!"!

Figure 6.11: FEM stencil and neighboring triangles of a mesh point

1 −hy 2
0 − hy 2 − hy 2 + 0 + 0 + 0 =

ui,j+1 →
2 hx hy hx hy
1 −hy 2
0 + 0 + 0 + 0 − hy 2 − hy 2 =

ui,j−1 →
2 hx hy hx hy
1
ui+1,j+1 → (0 + 0 + 0 + 0 + 0 + 0) = 0
2 hx hy
1
ui−1,j−1 → (0 + 0 + 0 + 0 + 0 + 0) = 0
2 hx hy
Observe that the two diagonal connections in the stencil lead to zero contributions. This is correct for the
rectangular mesh. Thus the resulting equation for the degree of freedom ui,j is given by

2 (hx2 + hy 2 ) hx2 hx2 hy 2 hy 2


ui,j − ui+1,j − ui−1,j − ui,j+1 − ui,j−1 + fi,j hx hy = 0
hx hy hx hy hx hy hx hy hx hy
or by rearranging
−ui+1,j + 2 ui,j − ui−1,j −ui,j+1 + 2 ui,j − ui,j−1
2
+ = −fi,j . (6.7)
hx hy 2
For the special case hx = hy obtain
1
(4 ui,j − ui+1,j − ui−1,j − ui,j+1 − ui,j−1 ) = −fi,j
hx2
2 2
This is the finite difference approximation to the differential expression −∆u = −( ∂∂xu2 + ∂∂yu2 ) and leads
to the well know finite difference stencil in Figure 6.12, used to solve boundary value problems in two
variables, see Section 4.4.2.

6.2.9 Contributions to the Approximation Error


The algorithm in the previous section constructs an approximation of the exact solution of the boundary
value problem. It is important to identify possible sources of errors of the approximate solution:

1. The true solution is approximated by a piecewise linear function on each triangle.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 415


−1
h2



−1 4 −1
h2 h2 h2



−1
h2


Figure 6.12: Finite difference stencil for −uxx − uyy for h = hx = hy

2. On each triangle we have to use a numerical integration to determine the contributions.

3. Not all domains Ω ⊂ R2 can be decomposed into triangles.

In the following sections we will examine the first two contributions, and methods to minimize the errors.

• Using quadratic polynomials on each triangle will (usually) lead to smaller errors.

• Using the brilliant integration methods by Gauss, the influence of the integration error will be very
small.

6.3 Classical and Weak Solutions


In this section some of the more mathematical notations are presented. They are used to formulate the
essential convergence results for FEM.

6.3.1 Weak Solutions of a System of Linear Equations


As an introduction to the concept of weak solutions examine systems of linear equations, i.e. for a matrix A
and a vector f~ search for solutions ~u of A ~u = f~. The same idea will be applied to boundary value problems
later in this section.
A vector ~x ∈ RN vanishes if and only if it is orthogonal to all vectors φ ~ ∈ RN or orthogonal to all
vectors φ~ i ∈ R (i = 1, 2, . . . , N ) which form a basis of R .
N N

~x = ~0 ⇐⇒ ~ = 0 for all φ
h~x , φi ~ ∈ RN ⇐⇒ ~ i i = 0 for i = 1, 2, . . . , N
h~x , φ

This obvious fact also applies to Hilbert spaces, which will be functions space for our boundary value
problems to be examined.
~ i for i = 1, 2, 3, . . . N be a basis
Let H be a Hilbert space with a finite dimensional subspace Hh . Let φ
of Hh and thus any vector ~x ∈ Hh can be written in the form
N
~j .
X
~x = xj φ
j=1

With the above examine a weak solution of a system of linear equations and find the following.

A ~u = f~ ∈ Hh ⇐⇒ A ~u − f~ = 0
⇐⇒ hA ~u − f~ , φi
~ = 0 for all φ ~ ∈ Hh
⇐⇒ hA ~u − f~ , φ
~ j i = 0 for all j = 1, 2, . . . , N

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 416

Taking inhomogenous boundary conditions into account


For the boundary value problems the partial differential equation in the domain has to be solved together
with boundary conditions. Thus we have to be able to take those into account for the correct definition of a
weak solution.
Let H0 = {v ∈ H | B1 v = 0} be a linear subspace of H with basis ϕj . Examine the system of equa-
tions
Au = f
B1 u = g1
B2 u = g2 .
If u1 ∈ H satisfies B1 u1 = g1 , then write u = u1 + v and arrive at the modified problem
A v = f − A u1
B1 v = 0
B2 v = g2 − B2 u1 .
Thus examine weak solutions of linear systems of equations additional conditions.
A v = f − A u1 ⇐⇒ hA v − f + A u1 , φi = 0 for all φ with B1 φ = 0

6.3.2 Classical Solutions and Weak Solutions of Differential Equations


Examine boundary value problems of the form (6.8) and define classical and weak solutions to this problems.
The main point of this section is to motivate7 that being a classical is equivalent to being a weak solution.

−∇ · (a ∇u + u ~b) + b0 u = f for (x, y) ∈ Ω


u = g1 for (x, y) ∈ Γ1 (6.8)
~n · (a ∇u + u ~b) = g2 + g3 u for (x, y) ∈ Γ2

The functions a, b f and gi are known functions and we have to determine the solution u, all depending on
the independent variables (x, y) ∈ Ω ⊂ R2 . The vector ~n is the outer unit normal vector. The expression
∂u ∂u ∂u
~n · ∇u = n1 + n2 =
∂x ∂y ∂~n
equals the directional derivative of the function u in the direction of the outer normal ~n.

Consider a smooth test-function φ vanishing on the Dirichlet boundary Γ1 and a classical solution u of
the boundary value problem (6.8). Then multiply the differential equation in (6.8) by φ and integrate over
the domain Ω.
0 = −∇ · (a ∇u + u ~b) + b0 u − f
ZZ  
0 = φ −∇ · (a ∇u + u ~b) + b0 u − f dA
ZΩZ Z  
= ∇φ · (a ∇u + u ~b) + φ (b0 u − f ) dA − φ a ∇u + u ~b · ~n ds
Γ
ZΩZ Z
= ∇φ · (a ∇u + u ~b) + φ (b0 u − f ) dA − φ (g2 + g3 u) ds
Γ2

7
We knowingly ignore regularity problems, i.e. we assume that all expressions and solutions are smooth enough. These
problems are carefully examined in books and classes on PDEs.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 417

6–2 Definition : A function u satisfying the above equation for all smooth test functions φ vanishing on Γ1
is said to be an weak solution of the BVP (6.8).

The above computation shows that classical solutions have to be weak solutions.

If u is a weak solution, then for all smooth test functions φ conclude


ZZ Z
0 = ~
∇φ · (a ∇u + u b) + φ (b0 u − f ) dA − φ (g2 + g3 u) ds
Γ2
ZΩZ   Z  
= φ −∇ · (a ∇u + u ~b) + b0 u − f dA + φ a ∇u + u ~b · ~n − φ (g2 + g3 u) ds .
Γ

In particular find that for all functions φ vanishing on all of the boundary Γ we have
ZZ  
0= φ −∇ · (a ∇u + u ~b) + b0 u − f

and the fundamental lemma of the calculus of variations implies

0 = −∇ · (a ∇u + u ~b) + b0 u − f

i.e. we have a solution of the differential equation. This in turn now leads to
Z  
0 = + φ a ∇u + u ~b · ~n − φ (g2 + g3 u) ds
Γ

for all smooth function φ on the boundary Γ2 and thus we also recover the boundary condition in (6.8)
 
0 = a ∇u + u ~b · ~n − (g2 + g3 u) .

Thus we have equivalence of weak and classical solution (ignoring smoothness problems).

If there were an additional term ~c · ∇u in the PDE (6.8) we would have to consider an additional term
in the above computations.
ZZ ZZ
φ ~c · ∇u dA = ∇(φ u ~c) − u ∇(φ ~c) dA
Ω Ω
Z ZZ
= − φ u ~c · n ds − u ∇φ · ~c + u φ ∇~c dA
Γ

6.4 Energy Norms and Error Estimates


To illustrate the theoretical background examine the boundary value problem

−∇ · (a ∇u) + b0 u = f for (x, y) ∈ Ω


u = g1 for (x, y) ∈ Γ1 . (6.9)
~n · (a ∇u) = g2 for (x, y) ∈ Γ2

Solving this problem is equivalent to minimize the functional


1 1
ZZ Z
2 2
F (u) = a (∇u) + b0 u − f u dA − g2 u ds .
2 2 Γ2

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 418

In general the exact ue solution can not be found and we have to settle for an approximate solution uh , where
the parameter h corresponds to the typical size (e.g. diameter) of the elements used for the approximation.
Obviously we hope for the solution uh to converge to the exact solution ue as h approaches 0. It is the goal
of this section to show under what circumstances this is in fact the case and also to determine the rate of
convergence. The methods and ideas used can also be applied to partial differential equations with multiple
variables.

6.4.1 Basic Assumptions and Regularity Results


For the results of this section the be correct we need assumptions on the functions a, b and gi , such that the
solution of the boundary value problem is well behaved. Throughout the section we assume:

• a, b0 and gi are continuous, bounded functions.

• There is a positive number α0 such that 0 < α0 ≤ a ≤ α1 and 0 ≤ b0 ≤ β1 for all points in the
domain Ω ⊂ R2 .

• The quadratic functional F (u) is strictly positive definite. This condition is satisfied if

– either on a nonempty section Γ1 of the boundary the Dirichlet boundary condition is imposed
– or the function b0 is strictly positive on a subdomain.

There are other combinations of conditions to arrive at a strictly positive functional F , but the above
two are easiest to verify.

With the above assumptions we know that the BVP (6.9) has exactly one solution. The proof of this result
is left to mathematicians. As a rule of thumb we use that the solution u is (k + 2)-times differentiable if f
is k-times differentiable.
This mathematical result tells us that there is a unique solution ue of the boundary value problem, but it
does not give the solution. Now we use the finite element method to find numerical approximations uh to
this exact solution ue .

6.4.2 Function Spaces, Norms and Continuous Functionals


In view of the above definition of a weak solution we define for functions u and v
ZZ
hu, vi := u v dA
ZΩZ
A(u, v) := a ∇u · ∇v + b0 u v dA = ha ∇u , ∇vi + hb0 u , vi
ZΩ
hu, viΓ2 := u v ds .
Γ2

Basic properties of the integral imply that the bilinear form A is symmetric and linear with respect to each
argument, i.e. for λi ∈ R we have

A(u, v) = A(v, u)
A(λ1 u1 + λ2 u2 , v) = λ1 A(u1 , v) + λ2 A(u2 , v)
A(u, λ1 v1 + λ2 v2 ) = λ1 A(u, v1 ) + λ2 A(u, v2 ) .

The function u is a weak solution of (6.9) iff

A(u, φ) = hf, φi + hg2 , φiΓ2 for all functions φ .

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 419

We can also search for a minimum of the functional


1
F (u) = A(u, u) − hf, ui − hg2 , uiΓ2 .
2
The only new aspect is a new notation. For the subsequent observations it is convenient to introduce two
spaces of functions.
6–3 Definition : Let u be a piecewise differentiable function defined on the domain Ω ⊂ R2 . Then L2 and
V denote two sets of functions, both spaces equipped with a norm8

L2 := {u : Ω → R | u is square integrable}
ZZ
2
kuk2 := hu, ui = u2 dA .

For the function u to be in the smaller subspace V we require the function u and its derivatives ∇u to be
square integrable and u has to satisfy the Dirichlet boundary condition (if there are any imposed). The norm
in this space is given by
ZZ  2  2
∂u ∂u
ZZ
kuk2V := k∇uk22 + kuk22 = 2
|∇u| + u dA = 2
+ + u2 dA .
∂x ∂y
Ω Ω

L2 and V are vector spaces and h . , . i is a scalar product on L2 . V is called a Sobolv space. Obviously we
have
V ⊂ L2 and kuk2 ≤ kukV .

Since the ‘energy’ to be minimized


1 1
ZZ
F (u) = A(u, u) = a (∇u)2 + b u2 dA
2 2

is closely related to kuk2V this norm is often called an energy norm.


If u, v ∈ V the expression A(u, v) can be computed and find

ZZ ZZ
|A(u, v)| = a ∇u · ∇v + b0 u v dA ≤ |a| |∇u| |∇v| + |b0 | |u| |v| dA
Ω Ω
ZZ ZZ
≤ α1 |∇u| |∇v| dA + β1 |u| |v| dA ≤ α1 k∇uk2 k∇vk2 + β1 kuk2 kvk2
Ω Ω
≤ (α1 + β1 ) kukV kvkV .

Assuming 0 < β0 ≤ b0 (x) for all x leads to


ZZ ZZ
A(u, u) = a ∇u · ∇u + b0 u u dA = a |∇u|2 + b0 |u|2 dA
ZΩZ Ω
ZZ
≥ α0 |∇u|2 + β0 |u|2 dA ≥ min{α0 , β0 } |∇u|2 + |u|2 dA = γ0 kuk2V .
Ω Ω
8
A mathematically correct introduction of these function spaces is well beyond the scope of these notes. The tools of Lebesgue
integration and completion of spaces is not available. As a consequence we ignore most of the mathematical problems.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 420

It can be shown9 that the above inequality is correct as long as the assumptions in Section 6.4.1 are satisfied.
Thus find
γ0 kuk2V ≤ A(u, u) ≤ (α1 + β1 ) kuk2V for all u ∈ V . (6.10)
This inequality is the starting point for most theoretical results on boundary value problems of the type
examined in these notes. The bilinear form A(·, ·) is called coercive or elliptic. For the purposes of these
notes it is sufficient to realize that the expression A(u, u) corresponds to the squared integral of the function
u and its partial derivatives of order 1.

6.4.3 Convergence of the Finite Dimensional Approximations


The space V contains all piecewise differentiable, continuous functions and thus V is not a finite dimensional
vectors space. For a fixed parameter h > 0 we choose a discretisation of the bounded domain Ω ⊂ R2 in
finite many triangles of typical diameter h. Then we consider only continuous functions that are polynomials
of a given degree (e.g. 1 or 2) on each of the triangles, i.e. a piecewise linear or quadratic function. This
leads to a finite dimensional subspace Vh , i.e. finitely many degrees of freedom.

Vh ⊂ V finite dimensional subspace

The functions in the finite dimensional subspace Vh have to be piecewise differentiable and everywhere
continuous. This condition is necessary, since we try to minimize a functional involving first order partial
derivatives. This property is called conforming elements. Instead of searching for a minimum on all of V
we now only consider functions in Vh ⊂ V to find the minimizer of the functional. This is illustrated in
Table 6.2. We hope that the minimum uh ∈ Vh will be close to the exact solution ue ∈ V . The main goal
of this section is to show that this is in fact the case. The ideas of proofs are adapted from [John87, p.54]
and [Davi80, §7] and can also be used in more general situations, e.g. for differential equations with more
independent variables. To simplify the proof of the abstract error estimate we use two lemmas.

exact problem approximate problem


1 1
F (u) = 2 A(u, u) F (uh ) = 2 A(uh , uh )
functional to minimize
−hf, ui − hg2 , uiΓ2 −hf, uh i − hg2 , uh iΓ2
amongst functions u ∈ V (infinite dimensional) uh ∈ Vh (finite dimensional)
necessary condition A(u, φ) − hf, φi − hg2 , φiΓ2 = 0 A(uh , φh ) − hf, φh i − hg2 , φh iΓ2 = 0
for minimum for all φ ∈ V for all φh ∈ Vh
main goal uh −→ u as h → 0

Table 6.2: Minimization of exact and approximate problem

9
The correct mathematical result to be used is Poincaré’s inequality. There exists a constant C (depending on Ω only) such that
for any smooth function u vanishing on the boundary we have
ZZ ZZ
u2 dA ≤ C |∇u|2 dA .
Ω Ω

This inequality replaces the condition 0 < β0 ≤ b. Intuitively the inequality shows that the values of the function are controlled by
the values of the derivatives. For elasticity problems Korn’s inequality will play the same role.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 421

6–4 Lemma : If uh is a minimizer of the functional F on Vh , i.e.


1 1
F (uh ) = A(uh , uh ) − hf, uh i − hg2 , uh iΓ2 ≤ A(vh , vh ) − hf, vh i − hg2 , vh iΓ2 for all vh ∈ Vh
2 2
then
A(uh , φh ) − hf, φh i − hg2 , φh iΓ2 = 0 for all φh ∈ Vh .
3
Proof : Use Figure 6.13 to visualize the proof. We examine the function along one straight line, i.e. the
derivative of
1
g(t) = F (uh + t φh ) = A(uh + t φh , uh + t φh ) − hf, uh + t φh i − hg2 , uh + t φh iΓ2
2
1 1
= A(uh , uh ) + hg, uh i + t (A(uh , φ) − hf, φi − hg2 , φiΓ2 ) + t2 A(φh , φh )
2 2
has to vanish at t = 0 for all φh ∈ Vh . Since the above expression is of the form g(t) = c0 + c1 t + c2 t2
d
and dt g(0) = c1 we find
A(uh , φ) − hf, φi − hg2 , φiΓ2 = 0 .
This implies the desired result. 2

30
25
20
15
F

10
5
0
5
4 5
3 4
2 3
u 1 2
1 u
2 0 0 1

Figure 6.13: A function to be minimized

6–5 Lemma : Let ue ∈ V be the minimizer of the functional F on all of V and let uh ∈ Vh be the
minimizer of
1
F (ψh ) = A(ψh , ψh ) − hf, ψh i − hg2 , ψh iΓ2
2
amongst all ψh ∈ Vh . This implies that uh ∈ Vh is also the minimizer of

G(ψh ) = A(ue − ψh , ue − ψh )

amongst all ψh ∈ Vh . 3
Proof : If uh ∈ Vh minimizes F in Vh and ue ∈ V minimizes F in V then the previous lemma implies

A(ue , φh ) = hf, φh i + hg2 , φh iΓ2 for all φh ∈ Vh


A(uh , φh ) = hf, φh i + hg2 , φh iΓ2 for all φh ∈ Vh

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 422

and thus A(ue − uh , φh ) = 0. This leads to

G(uh + φh ) = A(ue − uh − φh , ue − uh − φh )
= A(ue − uh , u0 − uh ) − 2 A(ue − uh , φh ) + A(φh , φh )
= A(ue − ue , ue − ue ) + A(φh , φh )
≥ A(ue − uh , ue − uh ) .

Equality occurs only if φh = 0. Thus φh = 0 ∈ Vh is the unique minimizer of the above function and the
result is established. 2

6–6 Theorem : (Abstract error estimate, Lemma of Céa)


If ue is the minimizer of the functional
1
F (u) = A(u, u) − hf, ui − hg2 , uiΓ2
2
amongst all u ∈ V and uh ∈ Vh is the minimizer of F amongst all uh in the subspace Vh ⊂ V ,
then the distance of ue and uh (in the V –norm) can be estimated. There exists a positive constant
k such that
kue − uh kV ≤ k min kue − ψh kV .
ψh ∈Vh

The constant k is independent on h. 3

It assumes that the integrations are carried out without error. Since we will use a Gauss integration this
is not far from the truth.
As a consequence of Céa’s Lemma we have to be able to approximate an the exact solution ue ∈ V
by approximate function ψh ∈ Vh and the error of the finite element solution uh ∈ Vh is smaller than the
approximation error, except for the factor k. Thus the Lemma of Céa reduces the question of estimating the
error of the approximate solution to a question of estimating the approximation error for a given function
(the exact solution) in the energy norm. Standard interpolation results allow to estimate the error of the
approximation, assuming some regularity on the exact solution u.
Proof : Use the inequality (6.10) and the above lemma to conclude that

γ0 kue − uh k2V ≤ A(ue − uh , u0 − uh ) ≤ A(ue − uh − φh , ue − uh − φh )


≤ (α1 + β1 ) kue − uh − φh k2V for all φh ∈ Vh

and thus s
α1 + β 1
kue − uh kV ≤ kue − uh − φh kV for all φh ∈ Vh .
γ0
As φh ∈ Vh is arbitrary find the claimed result. 2

6–7 Observation : Connecting FEM and finite difference method


When working with the method of finite differences in Chapter 4 the Lax Equivalence Theorem 4–2 (page 262)
is used to conclude that consistency and stability imply convergence. There is a similar path for the Finite
Element Method:
• The concept of consistency of a finite difference approximation is replaced by approximation results,
based on piecewise interpolation. By general approximation methods one shows that “any” function
can be approximated by piecewise linear functions (Result 6–8) or piecewise quadratic functions (Re-
sult 6–10). These results lead to the order of convergence, i.e. the approximation error is proportional
to hp for some order p, where h is the typical size of an element.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 423

• The concept of stability of a finite difference approach is replaced by the coercive bilinear form
A(u, u), e.g. inequality (6.10). These rather mathematical results might be difficult to prove. For
second order boundary value problems of the type (6.9) Poincaré’s inequality has to be used. For
elasticity problems the tool to be used is Korn’s inequality.

• The abstract error estimate in Theorem 6–6 then implies that the FEM approximation uh converges
to the true solution u0 , in some very specific sense. Thus Theorem 6–6 replaces the Lax Equivalence
Theorem 4–2.

Example 6–1 (page 412) illustrates that in special cases a finite element approximation is equivalent to a
finite difference approximation. ♦

6.4.4 Approximation by Piecewise Linear Interpolation


Let Ω ⊂ R2 be a bounded polygonal domain, divided into triangles of typical diameter h. No corner of a
triangle can be placed in the middle of a triangle side. In addition we require a minimal angle condition:
no angle is smaller than a given minimal angle. Equivalently require that the ratio of the triangle diameter
h and the radius of the inscribed circle be smaller than a given fixed value. This is often mentioned as the
quality of the mesh. A typical triangulation is shown in Figure 6.3 on page 408.
Then compute the value of the function u(x, y) at each corner of the triangles. Within each triangle the
function is replaced by a linear function. Thus an interpolating function Πh u is constructed. The operator
Πh can be considered a projection operator of V onto the finite dimensional subspace Vh .

Πh : V −→ Vh , u 7→ Πh u

For two neighboring triangles the interpolated functions will coincide along the common edge, since the
linear functions coincide at the two corners. Thus the interpolated function is continuous on the domain
and we have conforming elements. The interpolated function Πh u and the original function u coincide
if u happens to be a linear function. By considering a Taylor expansion one can verify10 that the typical
approximation error of the function is of the order c h2 , where the constant c depends on higher order
derivatives of u. The error of the gradient is of order h.

6–8 Result : (Piecewise linear interpolation)


Assume that a function u is at least twice differentiable on the domain Ω and use the piecewise linear
interpolation Πh u. Approximation theory implies that there is a constant M (depending on second order
derivatives of u), such that

|u(~x) − Πh u(~x)| ≤ M h2 for all ~x ∈ Ω


|∇u(~x) − ∇(Πh u(~x))| ≤ M h for all ~x ∈ Ω .

Thus an integration implies that there is a constant c such that

ku − Πh ukV ≤ c h |u|2 and ku − Πh uk2 ≤ c h2 |u|2

where
2 2 2
∂2u ∂2u ∂2u
ZZ
|u|22 = + + dA .
∂x2 ∂x ∂y ∂y 2

The constant c does not depend on h and the function u, as long as a minimal angle condition is satisfied. 3
10
Use the fact that the quadratic terms in the Taylor expansion lead to an approximation error. For an error vanishing at the nodes
at x = 0 and h we use a function f (x) = a · x · (h − x) with derivatives f 0 (x) = a (h − 2 x) and f 00 (x) = −2 a. Since the maximal
2 2
value of a · h2 /4 is attained at h/2 we find |f (x)| ≤ a 4h = h8 max |f 00 | and |f 0 (x)| ≤ a2h max |f 00 | for all 0 ≤ x ≤ h.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 424

An exact statement and proof of this result is given in [Brae02, §II.6]. The result is based on the
fundamental Bramble–Hilbert–Lemma.

Now we have all the ingredients to state and proof the basic convergence results for finite element
solutions to boundary value problems in two variables. The exact solution ue ∈ V to be approximated is the
minimizer of the functional
1 1
ZZ Z
2 2
F (u) = a (∇u) + b0 u − f u dA − g2 u ds
2 2 Γ2

On a smooth domain Ω ⊂ R2 the exact solution ue is smooth (often differentiable) if a, b0 , f and g2 are
smooth. Instead of searching on the space V we restrict the search on the finite dimensional subspace Vh
and arrive at the approximate minimizer uh . Thus the error function e = uh − ue has to be as small as
possible for the approximation to be of a good quality. In fact we hope for a convergence

uh −→ ue as h −→ 0

in some sense to be specified.

6–9 Theorem : Examine the boundary value problem (6.9) where the conditions are such that
the unique solutions u and all its partial derivative up to order 2 are square integrable over the
domain Ω. If the subspace Vh is generated by the piecewise linear interpolation operator Πh then
we find
kuh − ue kV ≤ C h and kuh − ue k2 ≤ C1 h2
for some constants C and C1 independent on h.
We may say that
• uh converges to ue with an error proportional to h2 as h → 0.

• ∇uh converges to ∇ue with an error proportional to h as h → 0.


3

Observe that the above estimates are not point-wise estimates. It is the integrals of the solution and its
derivatives that are controlled.
Proof : The interpolation result 6–8 and the abstract error estimate 6–6 imply immediately

kuh − ue kV ≤ k min kφh − ue kV ≤ k kΠh ue − ue kV ≤ k c h .


φh ∈Vh

This is the first of the desired estimates. It shows that the error of the function and its first order partial
derivatives are approximately proportional to h. The second estimate states that the error of the function
is proportional to h2 . This second estimate requires considerably more work. The method of proof is
known as Nitsche trick and is due to Joachim Nitsche and Jean-Pierre Aubin. A good presentation is given
in [StraFix73, §3.4] or [KnabAnge00, Satz 3.37]. For sake of completeness a similar presentation is shown
below.
Use the notation e = uh − ue and equation (6.10) to conclude

A(e, e) ≤ (α1 + β1 ) kek2V ≤ (α1 + β1 ) k 2 c2 h2 .

Let w ∈ V be the minimizer of the functional


1
A(w, w) + he, wi .
2

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 425

Thus w is a solution of the boundary value problem

−∇ · (a ∇w) + b0 w = e for (x, y) ∈ Ω


w = 0 for (x, y) ∈ Γ1 .
~n · (a ∇w) = 0 for (x, y) ∈ Γ2

Regularity theory now implies that the second order derivatives of w are bounded by the values of e (in the
L2 sense) or more precisely
|w|2 ≤ c kek2 = c kuh − ue k2 .
The interpolation result 6–8 leads to

kw − Πh wkV ≤ c1 h |w|2 ≤ c2 h kuh − ue k2

Using Theorem 6–6 conclude

A(w, w) ≤ A(w − Πh w, w − Πh w) ≤ (α1 + β1 ) kw − Πh wk2V ≤ (α1 + β1 ) c22 h2 kuh − ue k22 .

Since w is a minimizer of the functional find

A(w, ψ) + he, ψi = 0 for all ψ ∈ V .

By choosing ψ = e arrive at
ZZ
−A(w, e) = he, ei = kek22 = |uh − ue |2 dA .

Now use the Cauchy–Schwartz inequality to conclude that


ZZ ZZ
2 2
kek2 = |uh − ue | dA = |A(w, e)| = a ∇w · ∇e + b0 w e dA
Ω Ω
1/2 1/2
p p
≤ (A(w, w)) · (A(e, e)) ≤ α1 + β1 c2 h kuh − ue k2 · α1 + β1 k c h .

A division by kek2 = kuh − ue k2 leads to

kuh − ue k2 ≤ C1 h2 .

This is the claimed second convergence estimate. 2

6.4.5 Approximation by Piecewise Quadratic Interpolation


We start with a triangulation similar to the piecewise linear interpolation, but we use a piecewise quadratic
interpolation on each triangle, i.e. we examine an interpolating function in Figure 6.14. The six coefficient
ck can be used to assure that the interpolated function coincides with the given function on the corners and
the mid points of triangle, as show in Figure 6.14. Along each edge we find a quadratic function, uniquely
determined by the values at the three points. Thus the interpolating functions from neighboring triangles
will coincide on all points on the common edge. Thus we find again conforming elements.
Thus we arrive again at an projection operator Πh from V onto the finite dimensional subspace Vh .
We have
Πh : V −→ Vh , u 7→ Πh u .
The resulting functions Πh u are continuous on the domain and on each triangle we have a quadratic function.
The interpolated function Πh u and the original function u coincide if u happens to be a quadratic function.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 426

y
s
6 J
 J
 Js
f (x, y) = c1 + c2 x + c3 y + s J
 J
+c4 x2 + c5 x y + c6 y 2  Js
 s

s 

-x

Figure 6.14: Quadratic interpolation on a triangle

By considering a Taylor expansion one can verify11 that the typical approximation error of the function is
of the order c h3 where the constant c depends on higher order derivatives of u. The error of the gradient is
of order h2 .

6–10 Result : (Piecewise quadratic interpolation)


Assume that a function u is at least three times differentiable on the domain Ω and use the piecewise
quadratic interpolation Πh u. Approximation theory implies that there is a constant M (depending on
third order derivatives of u), such that

|u(~x) − Πh u(~x)| ≤ M h3 for all ~x ∈ Ω


2
|∇u(~x) − ∇(Πh u(~x))| ≤ M h for all ~x ∈ Ω .

Thus an integration implies that there is a constant c such that

ku − Πh ukV ≤ c h2 |u|3 and ku − Πh uk2 ≤ c h3 |u|3

where |u|23 is the sum of all squared and intergrated partial derivatives of order 3 . The constant c does not
depend on h and the function u, as long as a minimal angle condition is satisfied. 3
An exact statement and proof of this result is given in [Brae02, §II.6]. The result is based on the
fundamental Bramble–Hilbert–Lemma. Based on this interpolation estimate we can again formulate the
basic convergence result for a finite element approximation using piecewise quadratic approximations.

6–11 Theorem : Examine the boundary value problem (6.9) where the conditions are such that the
unique solutions u and all its partial derivative up to order 3 are square integrable over the domain
Ω . If the subspace Vh is generated by the piecewise quadratic interpolation operator Πh then we
find
kuh − ue kV ≤ C h2 and kuh − ue k2 ≤ C1 h3
for some constants C and C1 independent on h.
We may say that
• uh converges to ue with an error proportional to h3 as h → 0.

• ∇uh converges to ∇ue with an error proportional to h2 as h → 0.


3

11
Use the fact the cubic terms in the Taylor expansion lead to an approximation error. For an error vanishing at the nodes at x = 0
and ±h we use a function f (x) = a · x · (h2 − x2 ) with derivatives f 0 (x) = a (h2 − 3 x2 ), f 00 (x) = −a 6 x and f 000 (x) = −a 6.
3 √
The maximal value a32√h3 of the function is attained at ±h/ 3 we find |f (x)| ≤ c h3 max |f 000 | and |f 0 (x)| ≤ c h2 max |f 000 | for
all −h ≤ x ≤ h.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 427

Proof : The interpolation result 6–10 and the abstract error estimate 6–6 imply immediately

kuh − ue kV ≤ k min kψh − ue kV ≤ k kΠh ue − ue kV ≤ k c h2 .


φh ∈Vh

which is already the first of the desired estimates. The second estimate has to be verified with the Aubin–
Nitsche method, as in the proof of Theorem 6–9. 2

Observe that the convergence with quadratic interpolation (Theorem 6–11) is improved by a factor of
h compared to the linear interpolation, i.e. Theorem 6–9. Thus one might be tempted to increase the order
of the approximating polynomials further and further. But there are also reasons that speak against such a
process:

• Carrying out the interpolation will get more and more difficult. In particular the continuity across the
edges of the triangles is not easily obtained. It is more difficult to construct higher order conforming
elements.

• For higher order approximations to be effective we need bounds on higher order derivatives of the
exact solution ue . This might be difficult or impossible to achieve. If the domain is a polygon, there
will be corners and smoothness of the solution is far from obvious. Some of the coefficient functions
in the BVP (6.9) might not be smooth, e.g. by two different materials used in the domain. Thus we
might not benefit from a higher order convergence with higher order elements.

In the interior of the domain Ω smoothness of the exact solution ue is often true and with higher or-
der approximations we get a faster convergence. Thus piecewise approximations of orders 1, 2 and 3 are
regularly used. In the next section a detailed construction of second order elements is presented.

Presentations rather similar to the above can be found in many books on FEM. As example consult
[Brae02] for a proof of energy estimates and also for error estimates in the L∞ norm, i.e. point wise
estimates. In [AxelBark84] find Céa’s lemma and regularity results for non-symmetric problems.

6.5 Construction of Triangular Second Order Elements


In this section we construct a second order FEM approximation to the solution of the BVP shown as equa-
tion (6.8) on page 416.

−∇ · (a ∇u + u ~b) + b0 u = f for (x, y) ∈ Ω


u = g1 for (x, y) ∈ Γ1
~
~n · (a ∇u + u b) = g2 + g3 u for (x, y) ∈ Γ2

The function u is a weak solution of the above BVP if it satisfies the boundary condition u = g1 on Γ1 and
for all test functions φ (vanishing in Γ1 ) we have the integral condition
ZZ Z
0= ~
∇φ · (a ∇u + u b) + φ (b0 u − f ) dA − φ (g2 + g3 u) ds .
Γ2

The domain Ω ⊂ R2 is triangulated and the values of the function u at the corners and the midpoints of
the edges of the triangles are considered as degrees of freedom of the system. This leads to a vector ~u to be
determined. We have to find a discretized version of the integrals in the above functions and determine the
global stiffness matrix A such that the above integral condition translates to
~ + hW f~ , φi
0 = hA ~u , φi ~ .

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 428

Then the discretized solution is given as solution of the linear systems of equations

A ~u + W f~ = ~0 .

All computations should be formulated with matrices, such that an implementation with Octave/MATLAB
will be easy.
The order of presentation is as follows:

• Integration over a general triangle, using Gauss integration.

• Examine the basis functions for a second order element on the standard triangle.

• Integrate a function using the values at the corners and midpoints.

• Show all the integrals to be computed.

• Integration of f φ .

• Integration of b0 u φ .

• Transformation of the gradient from standard triangle to general triangle.

• Integration of u ~b ∇φ .

• Integration of a ∇u ∇φ .

6.5.1 Integration over a General Triangle


Transformation of coordinates
All of the necessary integrals for the FEM method are integrals over general triangles E. These can be
written as images of a standard triangle in a (ξ, ν)–plane, according to Figure 6.15. The transformation is
given by the linear (resp. affine) mapping
ν y
t (x3 , y3 )
6 6
J
!j  J
ν
!
ξ x Jt
t3 7→ t
J
@ ν y  E J
J
Jt
@ 
t5 @ t4  1
t ξ (x , y )
@   2 2
Ω @ t 

(x1 , y1 )
t1 t6 @t2 - ξ - x
@

Figure 6.15: Transformation of standard triangle to general triangle

! ! ! !
x x1 x2 − x1 x3 − x1
= +ξ +ν
y y1 y2 − y1 y3 − y1
! " # ! ! !
x1 x2 − x1 x3 − x1 ξ x1 ξ
= + · = +T·
y1 y2 − y1 y3 − y1 ν y1 ν

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 429

where " #
x2 − x1 x3 − x1 ∂ (x, y)
T= = .
y2 − y1 y3 − y1 ∂ (ξ, ν)

If the coordinates (x, y) are given find the values of (ξ, ν) with the help of
! ! " # !
ξ −1 x − x1 1 y3 − y1 −x3 + x1 x − x1
=T · = · .
ν y − y1 det(T) −y2 + y1 x2 − x1 y − y1

Integration over the standard triangle and Gauss integration


If a function f (x, y) is to be integrated over the triangle E use the transformation identity
  Z 1 Z ν 
∂ (x, y)
ZZ ZZ
f dA = f (~x (ξ, ν)) det dξ dν = |det T| f (~x (ξ, ν)) dξ dν . (6.11)
∂ (ξ, ν) 0 0
E Ω

The Jaccobi determinant is given by


 
∂ (x, y)
det = |det T| = |(x2 − x1 ) (y3 − y1 ) − (x3 − x1 ) (y2 − y1 )| .
∂ (u, v)

Since the area of the standard triangle Ω is 21 find the area of E is given by 12 |det T|. For a numerical
integration over the standard triangle Ω choose some integration points ~gj ∈ Ω and the corresponding
weights wj for j = 1, 2, . . . , m and then work with
ZZ m
~ dA ≈
X
f (ξ) wj f (~gj ) . (6.12)
Ω j=1

The integration points and weights have to be chosen, such that the integration error is as small as possible.
There are many integration schemes available. One of the early papers is [Cowp73] and a list is shown
in [Hugh87, Table 3.1.1, p. 173] and a short list in [TongRoss08]12 , [Gmur00, Tableau D3, page 233]
or [Zien13, Table 6.3, p. 181].
As a concrete and useful example use the points g1 = 12 (λ1 , λ1 ) and g4 = 12 (λ2 , λ2 ) along the diagonal
ξ = ν. Similarly use two more points along each connecting straight line from a corner of the triangle to
the midpoint of the opposing edge. This leads to a total of 6 integration points where groups of 3 have the
same weight, i.e. w1 = w2 = w3 and w4 = w5 = w6 . Finally add the midpoint with weight w7 . This
is illustrated in Figure 6.16. The nodes are shown in red and the integration points in blue. This choice
satisfies two essential conditions:

• If a sample point is used in a Gauss integration, then all other points obtainable by permuting the three
corners of the triangle must appear and with identical weight.

• All sample points must be inside the triangle (or on the triangle boundary) and all weights must be
positive.

Then one arrives at a 7 × 2 matrix G (see equation (6.13)) containing in each row the coordinates of one
integration point ~gj and a vector w
~ with the corresponding integration weights.
12
The results are known as Hammer’s formula. There is a typing error in [TongRoss08, Table 6.2, page 190]. For the integration
scheme using four points the coefficient for the central point is negative, i.e. −0.5625. Thus the scheme should not be used for
FEM codes, since there is a danger that the stiffness matrix will not be positive definite. For integration purposes the scheme does
just fine.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 430

ν
r
6
@A
A@r
3
A @
A @
rHr5 A @r4 r
H A @
HHAr 7 @
AHH @
r1 A HH@ r
2
r6 ξ
r Ar Hr
A H @
H@ -

Figure 6.16: Gauss integration of order 5 on the standard triangle, using 7 integration points

To determine the optimal values setup and solve a system of five nonlinear equations for the unknowns
λ1 , λ2 , w1 , w4 and w7 . The integration is approximated by
ZZ    
f (~xi ) dA ≈ w1 f (ξ~1 ) + f (ξ~2 ) + f (ξ~3 ) + w4 f (ξ~4 ) + f (ξ~5 ) + f (ξ~6 ) + w7 f (ξ~7 )

We require that ξ k for 0 ≤ k ≤ 5 be integrated exactly. This generates five equations to be solved13 . Due
to the symmetric arrangement of the integration points this implies that all polynomial up to degree 5 are
integrated exactly. The equations to be solved are generated by elementary computations.
1
ZZ
1 dA = = 3 w1 + 3 w4 + w7
2
ZΩZ
1 λ1 λ2 1
ξ dA = = w1 (2 + 1 − λ1 ) + w4 (2 + 1 − λ 2 ) + w7
6 2 2 3

1
= w1 1 + w4 1 + w7 equivalent to the above condition
3
1 3 2 3 1
ZZ
ξ 2 dA = = w1 (1 − 2 λ1 + λ ) + w4 (1 − 2 λ2 + λ22 ) + w7
12 2 1 2 9
ZΩZ
1 3 3 3 1
ξ 3 dA = = w1 (1 − 3 λ1 + 3 λ21 − λ1 ) + w4 (1 − 3 λ2 + 3 λ22 − λ32 ) + w7
20 4 4 27
ZΩZ
1 9 4 9 1
ξ 4 dA = = w1 (1 − 4 λ1 + 6 λ21 − 4 λ31 + λ1 ) + w4 (1 − 4λ2 + 6λ22 − 4λ32 + λ42 ) + w7
30 8 8 81
ZΩZ
1 15 5
ξ 5 dA = = w1 (1 − 5 λ1 + 10 λ21 − 10 λ31 + 5 λ41 − λ )
42 16 1

15 5 1
+w4 (1 − 5 λ2 + 10 λ22 − 10 λ32 + 5 λ42 − λ2 ) + w7
16 241
13
As example we consider the case f (ξ) = ξ 2 with some more details.
Z 1 Z 1−ξ Z 1  3 1
ξ4
ZZ 
ξ 1
ξ 2 dA = ξ 2 dν dξ = (1 − ξ) ξ 2 dξ = − =
0 0 0 3 4 ξ=0 12

7    
X λ1 2 λ1 λ2 λ2 1
wj f (~gj ) = w1 ) + (1 − λ1 )2 + ( )2 + w4 ( )2 + (1 − λ2 )2 + ( )2 + w7 ( )2
(
j=1
2 2 2 2 3
   
3 3 1
= w1 1 − 2 λ1 + λ21 + w4 1 − 2 λ2 + λ22 + w7
2 2 9

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 431

The above leads to a system of five nonlinear equations for the five unknowns λ1 , λ2 , w1 , w4 and w7 . This
can be solved by Octave/MATLAB or Mathematica 14 . There are multiple solutions possible, but you need a
solution satisfying the following key properties:

• Pick a solution with 0 < λ1 < λ2 < 1. This corresponds to the desirable result that all Gauss points
are inside the triangle.

• Pick a solution with positive weights w1 , w4 and w7 . This guarantees that the integral of a positive
function is positive.

The equation resulting from the integral of ξ over the triangle is identical to the equation generated by the
integral of 1 and thus not taken into account. Due to the symmetry of the Gauss points and the weights one
can verify that all polynomials up to degree 5 are integrated exactly, e.g. ν, ν 5 , ξ 2 ν 3 , . . .
The solution is given by
       
λ1 /2 λ1 /2 0.101287 0.101287 w1 0.0629696
       
 1−λ λ1 /2   0.797427 0.101287 
   w   0.0629696 
 1  2   
       
 λ1 /2 1 − λ1   0.101287 0.797427   w3   0.0629696 
       
G =  λ2 /2 λ2 /2  ≈  0.470142 0.470142  , w ~ =  w4  ≈  0.0661971  . (6.13)
       
       
 1−λ λ2 /2   0.059716 0.470142 
   w   0.0661971 
 2  5   
       
 λ2 /2 1 − λ2   0.470142 0.059716   w6   0.0661971 
       
1/3 1/3 0.333333 0.333333 w7 0.1125000

Using the transformation results in this section compute the coordinates XG for the Gauss integration points
in a general triangle by
! " # !
x1 x2 − x1 x3 − x1 T x1
XG = + ·G = + T · GT .
y1 y2 − y1 y3 − y1 y1

This approximate integration yields the exact results for polynomials up to degree 5 . Thus for one triangle
with diameter h and an area of the order h2 the integration error for smooth functions is of the order h6 ·h2 =
h8 . When dividing a large domain in sub-triangle of size h this leads to a total integration error of the
order h6 . For most problems this error will be considerably smaller than the approximation error of the FEM
method and one cano ignore this error contribution and we will from now on assume that the integrations
yield exact results.
The above can be implemented in Octave or MATLAB.
IntegrationTriangle.m
function res = IntegrationTriangle(corners,func)

la1 = (12 - 2*sqrt(15))/21; la2 = (12 + 2*sqrt(15))/21;


w1 = (155 - sqrt(15))/2400; w2 = (155 + sqrt(15))/2400;
w3 = 0.1125; w = [w1,w1,w1,w2,w2,w2,w3];

G = [la1/2 la1/2; 1-la1,la1/2; la1/2, 1-la1;


la2/2 la2/2; 1-la2,la2/2; la2/2, 1-la2; 1/3 1/3];

T = [corners(2,:)-corners(1,:);corners(3,:)-corners(1,:)];

14
√ √ √ √
The exact values are λ1 = (12 − 2 15)/21, λ2 = (12 + 2 15)/21, w1 = (155 − 15)/2400, w4 = (155 + 15)/2400
and w7 = 9/80.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 432

if ischar(func)
P = G*T; P(:,1) = P(:,1)+corners(1,1); P(:,2) = P(:,2)+corners(1,2);
val = feval(func,P);
else
val = func(:);
end%if

res = w*val*abs(det(T));
end%function

and the above function can be tested by integrating the function x + y 2 over a triangle.
Octave
corners = [0 0; 1 0; 0 1];

function res = intFunc(xy)


x = xy(:,1); y = xy(:,2);
res = x+y.ˆ2;
end%function

IntegrationTriangle(corners,’intFunc’)

Observations:
• To save computation time some FEM algorithms use a simplified numerical integration. If the func-
tions to be examined are close to constants over each triangle, then the error is acceptable.

• It is important to observe that the functions and the solutions are only evaluated at the integration
points. This may lead to surprising (and wrong) results. Keywords: hourglassing and shear locking.

• The material properties are usually given by coefficient functions, e.g. for Young’s modulus E and
Poisson’s ratio ν. Thus these properties are evaluated at the Gauss points, but not at the nodes. This
can lead to surprising extrapolation effects, e.g. a material constraint might not be satisfied at the
nodes, but only at the integration points.

6.5.2 The Basis Functions for a Second Order Element


There are 6 linearly independent polynomials of degree 2 or less, namely 1, x, y, x2 , y 2 and x · y. Examine
the standard triangle Ω in Figure 6.15 with the values of a function f (ξ, ν) at the corners and at the midpoints
of the edges. Use the numbering as shown in Figure 6.15. Now construct polynomials φi (ξ, ν) of degree 2,
such that (
1 if i = j
Φi (ξj , νj ) = δi,j = ,
0 if i 6= j
i.e. each basis function is equal to 1 at one of the nodes and vanishes on all other nodes. These basis
polynomials are given by

Φ1 (ξ, ν) = (1 − ξ − ν) (1 − 2 ξ − 2 ν)
Φ2 (ξ, ν) = ξ (2 ξ − 1)
Φ3 (ξ, ν) = ν (2 ν − 1)
Φ4 (ξ, ν) = 4 ξ ν
Φ5 (ξ, ν) = 4 ν (1 − ξ − ν)
Φ6 (ξ, ν) = 4 ξ (1 − ξ − ν)

and find their graphs in Figure 6.17. Any quadratic polynomial f on the standard triangle Ω can be written

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 433

1 1
1 1

0 0
ν ν
0 0

ξ 0 ξ 0
1 1

(a) Φ1 (ξ, ν) = (1 − ξ − ν) (1 − 2 ξ − 2 ν) (b) Φ2 (ξ, ν) = ξ (2 ξ − 1)

1 1
1 1

0
0
ν ν
0 0

ξ 0 ξ 0
1 1

(c) Φ3 (ξ, ν) = ν (2 ν − 1) (d) Φ4 (ξ, ν) = 4 ξ ν

1 1
1 1

0 0
ν ν
0 0

ξ 0 ξ 0
1 1

(e) Φ5 (ξ, ν) = 4 ν (1 − ξ − ν) (f) Φ6 (ξ, ν) = 4 ξ (1 − ξ − ν)

Figure 6.17: Basis functions for second order triangular elements

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 434

as linear combination of the basis functions by using


6
X 6
X
f (ξ, ν) = f (xi , yi ) Φi (ξ, ν) = fi Φi (ξ, ν) .
i=1 i=1

6.5.3 Integration of Functions Given at the Nodes


Integration of quadratic functions
Compute the values of the basis functions Φi (ξ, ν) at the Gauss points ~gj by mj,i = Φi (~gj ) to find
6
X 6
X
f (~gj ) = fi Φi (~gj ) = mj,i fi
i=1 i=1
or using a matrix notation
     
f (~g1 ) m1,1 m1,2 · · · m1,6 f1
     
 f (~g2 )   m2,1 m2,2 · · · m2,6   f2 
= . ·  = M · f~ .
     
 . .. .. .. ..
 ..   .. . . .   . 
     
f (~g7 ) m7,1 m7,2 · · · m7,6 f6
The integration formula (6.12) now leads to
ZZ 7
~ , M · f~i .
X
f (ξ, ν) dA ≈ wj f (~gj ) = hw
Ω j=1

If the numbers fi represent the values of a quadratic function f (x, y) at the corners and midpoints of a
general triangle then use the above and the transformation rule to conclude
ZZ
~ , M · f~i = | det T| hMT · w
f (~x) dA = | det T| hw ~ , f~i . (6.14)
E
The interpolation matrix M
 
+0.4743526 −0.0807686 −0.0807686 0.0410358 0.3230744 0.3230744
 
 −0.0807686 +0.4743526 −0.0807686 0.3230744 0.0410358 0.3230744 
 
 
 −0.0807686 −0.0807686 +0.4743526 0.3230744 0.3230744 0.0410358 
 
M ≈  −0.0525839 −0.0280749 −0.0280749 0.8841342 0.1122998 0.1122998 
 
 
 −0.0280749 −0.0525839 −0.0280749 0.1122998 0.8841342 0.1122998 
 
 
 −0.0280749 −0.0280749 −0.0525839 0.1122998 0.1122998 0.8841342 
 
−0.1111111 −0.1111111 −0.1111111 0.4444444 0.4444444 0.4444444
and the weight vector w~ do not depend on the triangle E, but only on the standard elements and the choice
of integration method. Thus for a new triangle E only the determinant of T has to be computed. Since
 
0
 
 0 
 
 
T
 0 
M ·w ~ = 
 1/6 
 
 
 1/6 
 
1/6

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 435

the integration of quadratic functions by (6.14) is rather easy to do: add up the values of the function at the
three mid-points of the edges, then divide the result by 6 and multiply by | det T|.

Integration of a product of quadratic functions


Let f and g be quadratic functions given at the nodal points in the general triangle E. Then the integral of
the product can be computed by
ZZ
f (~x) g(~x) dA ≈ | det T| hM · f~ , diag(w) ~ · M · f~ , ~g i
~ · M · ~g i = | det T| hMT · diag(w)
E

where  
6 −1 −1 −4 0 0
 

 −1 6 −1 0 −4 0 

 
T 1  −1 −1 6 0 0 −4 
M · diag(w)
~ ·M=  .
360 
 −4 0 0 32 16 16 

 

 0 −4 0 16 32 16 

0 0 −4 16 16 32
This result may be confirmed with the help of a program capable of symbolic computations.

6.5.4 Integrals to be Computed


To examine weak solution of boundary value problems the following integrals have to be computed.
ZZ I
~
∇φ · (a ∇u + u b) + φ (b0 u − f ) dA − φ (g2 + g3 u) ds = 0 .
Γ2

The unknown function u and the test function ϕ will be given at the nodes. Now develop numerical integra-
tion formulas for the above expressions. We have to aim for expression of the form
~
hA · ~u , φi and h~b , φi
~ .

6.5.5 The Octave code ElementContribution.m


To start examine integrals over one triangle only and seek an algorithm implemented in Octave to compute
the above integrals for a given triangle E and functions a, b and f . As a starting point use the codes for the
Gauss points and weights. Observe that the code below is far from complete.
ElementContribution.m
function [elMat,elVec] = ElementContribution(corners,aFunc,bFunc,fFunc)
% [...] = ElementContribution(...)
% compute the element stiffness matrix of one element
%
%[elMat,elVec] = ElementContribution(corners,aFunc,bFunc,fFunc)
% corners coordinates of the three corners forming the triangular element
% aFunc bFunc fFunc function files or scalars for the coefficient functions
% elMat element stiffness matrix
% elVec vector contribution to the RHS of the linear equation

% define the Gauss points and integration weights


l1 = 0.20257301464691267760; l2 = 0.94028412821023017954;
w1 = 0.062969590272413576298; w2 = 0.066197076394253090369;
w3 = 0.1125; w = [w1,w1,w1,w2,w2,w2,w3]’;

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 436

G = [l1/2 l1/2; 1-l1,l1/2; l1/2, 1-l1;


l2/2 l2/2; 1-l2,l2/2; l2/2, 1-l2; 1/3 1/3];

T = [corners(2,:)-corners(1,:); corners(3,:)-corners(1,:)]; detT = abs(det(T));


P = G*T;
P(:,1) = P(:,1)+corners(1,1); P(:,2) = P(:,2)+corners(1,2);

InterpolationMatrix = [...give the numerical values...];

6.5.6 Integration of f φ over one Triangle


General situation
First evaluate the coefficient function f (~x) at the Gauss integration points, leading to a vector f~. Use the
computations from the above section to conclude that
ZZ
f φ dA ≈ | det T| hMT · diag(w) ~ · f~ , φi
~ = hWE f~, φi~ .
E

The contribution to the element vector is


 
w1 f (~g1 )
 

 w1 f (~g2 ) 

 

 w1 f (~g3 ) 

WE f~ = | det T| MT · diag(w)
~ · f~ = | det T| MT ·  w2 f (~g4 )  .
 
 

 w2 f (~g5 ) 

 

 w2 f (~g6 ) 

w3 f (~g7 )

This is implemented by the code lines below.


ElementContribution.m
if ischar(fFunc) val = feval(fFunc,P); else val = fFunc(:); end%if
elVec = detT*InterpolationMatrix’*(w.*val);

Simplification if f is constant
Now the contribution to the element vector is
 
0
 

 0 
 
T
 0 
f | det T| M · w
~ = f | det T|  
1/6 
 

 

 1/6 

1/6
RR
and thus the effect of f φ dA is very easy to implement.
E

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 437

6.5.7 Integration of b0 u φ over one Triangle


General situation
First evaluate the coefficient function b0 (~x) at the Gauss integration points, leading to a vector ~b0 . Then we
use the computations from the above section to conclude that
ZZ
b0 u φ dA ≈ | det T| hMT · diag(w) ~ · diag(~b0 ) · M · ~u , φi
~ = hB0 ~u, φi
~ .
E

This is implemented by the code lines below.


ElementContribution.m
if ischar(bFunc) val = feval(bFunc,P); else val = bFunc(:); end%if
elMat = detT*InterpolationMatrix’*diag(w.*val)*InterpolationMatrix;

Simplification if b0 is constant
Now the contribution to the element stiffness matrix is
 
6 −1 −1 −4 0 0
 

 −1 6 −1 0 −4 0 

 
T b0 | det T|  −1 −1 6 0 0 −4 
b0 | det T| M · diag(w)
~ ·M=  .
360 
 −4 0 0 32 16 16 

 

 0 −4 0 16 32 16 

0 0 −4 16 16 32

6.5.8 Transformation of the Gradient to the Standard Triangle


The still to be examined integral expression contain components of the gradients ∇ u and ∇φ. First examine
how the gradient behave under the transformation to the standard triangle, only then use the above integration
methods.

According to section 6.5.1 the coordinates (ξ, ν) of the standard triangle are connected to the global
coordinates (x, y) by
! ! " # ! ! !
x x1 x2 − x1 x3 − x1 ξ x1 ξ
= + · = +T·
y y1 y2 − y1 y3 − y1 ν y1 ν

or equivalently
! ! " # !
ξ −1 x − x1 1 y3 − y1 −x3 + x1 x − x1
=T · = · .
ν y − y1 det(T) −y2 + y1 x2 − x1 y − y1

If a function f (x, y) is given on the general triangle E we can pull it back to the standard triangle by

g(ξ, ν) = f (x(ξ, ν) , y(ξ, ν))

and then compute the gradient of g with respect to its independent variables ξ and ν. The result will depend
on the partial derivatives of f with respect to x and y. The standard chain rule implies

∂ ∂ ∂ f (x, y) ∂ x ∂ f (x, y) ∂ y
g(ξ, ν) = f (x(ξ, ν) , y(ξ, ν)) = +
∂ξ ∂ξ ∂x ∂ξ ∂y ∂ξ

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 438

∂ f (x, y) ∂ f (x, y)
= (x2 − x1 ) + (y2 − y1 )
∂x ∂y
∂ ∂ ∂ f (x, y) ∂ x ∂ f (x, y) ∂ y
g(ξ, ν) = f (x(ξ, ν) , y(ξ, ν)) = +
∂ν ∂ν ∂x ∂ν ∂y ∂ν
∂ f (x, y) ∂ f (x, y)
= (x3 − x1 ) + (y3 − y1 ) .
∂x ∂y
This can be written with the help of matrices as
! " # ! !
∂g ∂f ∂f
(x 2 − x 1 ) (y2 − y1 )
∂ξ
∂g
= · ∂x
∂f
= TT · ∂x
∂f
∂ν (x3 − x1 ) (y3 − y1 ) ∂y ∂y

or equivalently    
∂g ∂g ∂f ∂f
, = , ·T .
∂ξ ∂ν ∂x ∂y
This implies
 " #
y3 − y1 −x3 + x1
    
∂f ∂f ∂g ∂g 1 ∂g ∂g
, = , · T−1 = , · .
∂x ∂y ∂ξ ∂ν det T ∂ξ ∂ν −y2 + y1 x2 − x1

Let ϕ be a function on the standard triangle Ω, given as a linear combination of the basis functions, i.e.
6
X
ϕ(ξ, ν) = ϕi Φi (ξ, ν) ,
i=1

where    
Φ1 (ξ, ν) (1 − ξ − ν) (1 − 2 ξ − 2 ν)
   

 Φ2 (ξ, ν)  
  ξ (2 ξ − 1) 

   
~
 Φ3 (ξ, ν)   ν (2 ν − 1) 
Φ(ξ, ν) =  = .
Φ4 (ξ, ν)   4ξν
   
 
   

 Φ5 (ξ, ν)  
  4 ν (1 − ξ − ν) 

Φ6 (ξ, ν) 4 ξ (1 − ξ − ν)
Then its gradient with respect to ξ and ν can be determined with the help of
 
−3 + 4 ξ + 4 ν −3 + 4 ξ + 4 ν
 

 4 ξ − 1 0 

  h
~ = 0 4 ν − 1 i
~ ξ (ξ, ν) Φ
~ ν (ξ, ν) .
 
grad Φ = Φ
4ν 4ξ
 
 
 

 −4 ν 4 − 4 ξ − 8 ν 

4 − 8ξ − 4ν −4 ξ

Thus conclude
 
∂ϕ ∂ϕ h
~ ξ (ξ, ν) Φ
i
~ ν (ξ, ν) = ϕ
h
~ ξ (ξ, ν) Φ
i
~ ν (ξ, ν) .
, = (ϕ1 , ϕ2 , ϕ3 , ϕ4 , ϕ5 , ϕ6 ) · Φ ~T · Φ
∂ξ ∂ν
If the function ϕ(x, y) is given on the general triangle as linear combination of the basis-functions on E find
6
X
ϕ(x, y) = ϕi Φi (ξ(x, y) , ν(x, y)) .
i=1

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 439

Now combine the results in this section to find


   
∂ϕ ∂ϕ ∂ϕ ∂ϕ h i
, = , · T−1 = ϕ
~T · Φ ~ ν · T−1
~ξ Φ
∂x ∂y ∂ξ ∂ν

or by transposition
! " # " # " #
∂ϕ
T ~T
Φ 1 y3 − y1 −y2 + y1 ~T
Φ
∂x
= T−1 · ξ
·ϕ
~= · ξ
·ϕ
~
∂ϕ ~ Tν
Φ det(T) −x3 + x1 x2 − x1 ~T
Φ
∂y ν

and the same identities can be spelled out for the two components independently

∂ϕ 1 h
~ T + (−y2 + y1 ) Φ
~T ·ϕ
i
= (+y3 − y1 ) Φ ξ ν ~,
∂x det(T)
∂ϕ 1 h
~ T + (+x2 − x1 ) Φ
~T ·ϕ
i
= (−x3 + x1 ) Φ ξ ν ~.
∂y det(T)

For the numerical integration use the values of the gradients at the Gauss integration points ~gj = (ξj , νj ).
We already found that the values of the function ϕ at the Gauss points can be computed with the help of the
interpolation matrix M by    
ϕ(~g1 ) ϕ1
   
 ϕ(~g2 )   ϕ2 
=M· .  .
   

 .
..   .. 
   
ϕ(~g7 ) ϕ6
Similarly define the interpolation matrices for the partial derivatives. Using
 
−3 + 4 ξ1 + 4 ν1 4 ξ1 − 1 0 4 ν1 −4 ν1 4 − 8 ξ1 − 4 ν1
 
 −3 + 4 ξ2 + 4 ν2 4 ξ2 − 1 0 4 ν2 −4 ν2 4 − 8 ξ2 − 4 ν2 
Mξ = 
 
.. .. 

 . . 

−3 + 4 ξm + 4 νm 4 ξm − 1 0 4 νm −4 νm 4 − 8 ξm − 4 νm
 
−2.18971 −0.59485 0.00000 0.40515 −0.40515 2.78456
 
 0.59485 2.18971 0.00000 0.40515 −0.40515 −2.78456 
 
 
 0.59485 −0.59485 0.00000 3.18971 −3.18971 0.00000 
 
≈  0.76114 0.88057 0.00000 1.88057 −1.88057 −1.64170
 

 
 −0.88057 −0.76114 0.00000 1.88057 −1.88057 1.64170 
 
 
 −0.88057 0.88057 0.00000 0.23886 −0.23886 0.00000 
 
−0.33333 0.33333 0.00000 1.33333 −1.33333 0.00000

find    
ϕξ (~g1 ) ϕ1
   
 ϕξ (~g2 )   ϕ2 
 = Mξ ·  .
   
 .. ..

 . 


 . 

ϕξ (~g7 ) ϕ6

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 440

Similarly write
 
−3 + 4 ξ1 + 4 ν1 0 4 ν1 − 1 4 ξ1 4 − 4 ξ1 − 8 ν1 −4 ξ1
 
 −3 + 4 ξ2 + 4 ν2 0 4 ν2 − 1 4 ξ2 4 − 4 ξ2 − 8 ν2 −4 ξ2 
Mν = 
 
.. .. 

 . . 

−3 + 4 ξm + 4 νm 0 4 νm − 1 4 ξm 4 − 4 ξm − 8 νm −4 ξm
 
−2.18971 0.00000 −0.59485 0.40515 2.78456 −0.40515
 
 0.59485 0.00000 −0.59485 3.18971 0.00000
−3.18971 
 
 
 0.59485 0.00000 2.18971 0.40515 −2.78456 −0.40515 
 
≈  0.76114 0.00000 0.88057 1.88057 −1.64170 −1.88057 
 
 
 −0.88057 0.00000 0.88057 0.23886 0.00000 −0.23886 
 
 
 −0.88057 0.00000 −0.76114 1.88057 1.64170 −1.88057 
 
−0.33333 0.00000 0.33333 1.33333 0.00000 −1.33333

and    
ϕν (~g1 ) ϕ1
   
 ϕν (~g2 )   ϕ2 
 = Mν · .
   
 .. ..

 . 


 . 

ϕν (~g7 ) ϕ6

Combining the above two computations use the notation


! !
x1 ξi
~xi = +T· for i = 1, 2, 3, . . . , 7
y1 νi

∂ϕ
and find for the first component ϕx = ∂x of the gradient at the Gauss points
 
ϕx (~x1 )
 
 ϕx (~x2 )  1 h i

=
 ~
(+y3 − y1 ) MTξ + (−y2 + y1 ) MTν · φ
 ..  det(T)

 . 
ϕx (~x7 )

and for the second component of the gradient


 
ϕy (~x1 )
 
 ϕy (~x2 )  1 h i
 
= (−x + x ) MT + (+x − x ) MT ~.
·φ
 ..  det(T)
 3 1 ξ 2 1 ν

 . 
ϕy (~x7 )

The above results for Mξ and Mν can be implemeted in MATLAB/Octave and then used to compute the
element stiffness matrix.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 441

6.5.9 Integration of u ~b ∇φ over one Triangle


The vector function ~b(~x) has to be evaluated at the Gauss integration points ~gj . Then the integration of

∂φ ∂φ
ZZ ZZ ZZ
Cont = ~
u b ∇φ dA = u b1 dA + u b2 dA
∂x ∂y
E E E

is replaced by a weighted summation. Introduce the vectors


   
w1 b1 (~g1 ) w1 b2 (~g1 )
   
−→  w2 b1 (~g2 )  −→  w2 b2 (~g2 )
   
wb 1 =   and wb 2 =  .

.
.. ..





 . 

w7 b1 (~g7 ) w7 b2 (~g7 )

Using the results of the previous sections find

∂φ 1
ZZ ZZ
u b1 dA = u b1 ((y3 − y1 ) φξ + (−y2 + y1 ) φν ) dA
∂x det T
E E
| det T| −→
≈ ~ + (−y2 + y1 ) Mν · φi
hdiag( wb 1 ) · M · ~u , (y3 − y1 ) Mξ · φ ~
det T
| det T| −→
~
h (y3 − y1 ) MTξ + (−y2 + y1 ) MTν · diag( wb 1 ) · M · ~u , φi

=
det T
~
= hB1 ~u, φi

and similarly

∂φ 1
ZZ ZZ
u b2 dA = u b2 ((−x3 + x1 ) φξ + (x2 − x1 ) φν ) dA
∂y det T
E E
| det T| −→
≈ ~ + (x2 − x1 ) Mν · φi
hdiag( wb 2 ) · M · ~u , (−x3 + x1 ) Mξ · φ ~
det T
| det T| −→
~
h (−x3 + x1 ) MTξ + (x2 − x1 ) MTν · diag( wb 2 ) · M · ~u , φi

=
det T
~
= hB2 ~u, φi

For computations use the above formula. A slightly more compact notation is given by
ZZ
Cont = u ~b ∇φ dA
E
−→ −→
 
| det T|  T (y 3 − y 1 ) diag( wb 1 ) + (−x 3 + x 1 ) diag( wb 2 ) ~
h Mξ , MTν · 

≈ −→ −→  · M · ~u , φi
det T (−y2 + y1 ) diag( wb 1 ) + (x2 − x1 ) diag( wb 2 )
#  −→
" 
| det T|  T (y3 − y1 ) (−x3 + x1 ) diag( wb 1 ) ~
h Mξ , MTν ·

= · −→  · M · ~u , φi
det T (−y2 + y1 ) (x2 − x1 ) diag( wb 2 )
−→
 
diag( wb 1 ) ~ .
= | det T|h MTξ , MTν · T−1 · 
 
−→  · M · ~u , φi
diag( wb 2 )

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 442

In the special case of E = Ω use x2 − x1 = y3 − y1 = 1, x3 − x1 = y3 − y1 = 0 and thus T = I and


det T = 1. For a constant vector ~b the above simplifies to
∂φ
ZZ
u b1 dA = b1 hMTξ · diag(w) ~ = hB1 ~u, φi
~ · M · ~u , φi ~
∂x
ZΩZ
∂φ ~ = hB2 ~u, φi
~ .
u b2 dA = b2 hMTν · diag(w) ~ · M · ~u , φi
∂y

6.5.10 Integration of a ∇u ∇φ over one Triangle


The function a ∇u · ∇φ = a ( ∂∂xu ∂∂xφ + ∂∂yu ∂∂yφ ) has to be evaluated at the Gauss integration points ~gj , then
multiplied by the Gauss weights wi and added up. Introduce the vector
 
w1 a(~x(~g1 ))
 
−→  w2 a(~ x(~g2 )) 

wa=  .

 .
.. 
 
w7 a(~x(~g7 ))

For the integration over the general triangle E use the transformation formula (6.11) and obtain
∂ u(~x) ∂ φ(~x) ∂ u(~x(ξ, ν)) ∂ φ(~x(ξ, ν))
ZZ ZZ
a dA = | det T| a(~x(ξ, ν)) dξ dν
∂x ∂x ∂x ∂x
E Ω
| det T| ~ = 1 ~
≈ 2
hAx · ~u , φi hAx · ~u , φi
(det T) | det T|
where
h iT −→
h i
Ax = (+y3 − y1 ) Mξ + (−y2 + y1 ) Mν · diag(wa) · (+y3 − y1 ) Mξ + (−y2 + y1 ) Mν
−→
h i h i
= (+y3 − y1 ) MTξ + (−y2 + y1 ) MTν · diag(wa) · (+y3 − y1 ) Mξ + (−y2 + y1 ) Mν
−→ −→
= (+y3 − y1 )2 MTξ · diag(wa) · Mξ + (−y2 + y1 )2 MTν · diag(wa) · Mν
−→ −→
 
+(+y3 − y1 ) (−y2 + y1 ) MTξ · diag(wa) · Mν + MTν · diag(wa) · Mξ .

For a constant coefficient a more of the above expressions can be computed explicitly and simplified.
 
3 1 0 0 0 −4
 
 1 3 0 0 0 −4 
 
 
T 1  0 0 0 0 0 0 
Mξ · diag(w)~ · Mξ =  
6 
 0 0 0 8 −8 0


 
 0 0 0 −8 8 0 
 
−4 −4 0 0 0 8
 
3 0 1 0 −4 0
 
 0 0 0 0 0 0 
 
 
T 1 
 1 0 3 0 −4 0 
Mν · diag(w)~ · Mν = 
6  0 0 0

8 0 −8 

 
 −4 0 −4 0 8 0 
 
0 0 0 −8 0 8

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 443

 
6 1 1 0 −4 −4
 

 1 0 −1
−4 
4 0
 
1  1 −1 0 4 −4 0 
MTξ · diag(w)
~ · Mν + MTν · diag(w)
~ · Mξ =  
6 
 0 4 4 8 −8 −8 

 

 −4 0 −4 −8 8 8 
−4 −4 0 −8 8 8

Similarly determine

∂ u(~x) ∂ φ(~x) ∂ u(~x(ξ, ν)) ∂ φ(~x(ξ, ν))


ZZ ZZ
a dA = | det T| a(~x(ξ, ν)) dξ dν
∂y ∂y ∂y ∂y
E Ω
| det T| ~ = 1 ~
≈ 2
hAy · ~u , φi hAy · ~u , φi
(det T) | det T|

where
h iT −→
h i
Ay = (−x3 + x1 ) Mξ + (+x2 − x1 ) Mν · diag(wa) · (−x3 + x1 ) Mξ + (+x2 − x1 ) Mν .

Now put all the above computations into one single formula, leading to
1
ZZ
a ∇u · ∇φ dA ≈ ~ .
h(Ax + Ay ) · ~u , φi
| det T|
E

This is implemented by the code lines below


ElementContribution.m
if ischar(aFunc) val = feval(aFunc,P); else val = aFunc(:); end%if

tt = T(2,2)*Mxi-T(1,2)*Mnu; Ax = tt’*diag(w.*val)*tt;
tt = -T(2,1)*Mxi+T(1,1)*Mnu; Ay = tt’*diag(w.*val)*tt;

elMat = elMat + (Ax+Ay)/detT;

and with this segment the code for the function ElementContribution() is complete. The element
stiffness matrix and the element vector can now be computed by
Octave
corners = [0 0; 1 0; 0 1];

function res = fFunc(xy)


x = xy(:,1); y = xy(:,2);
res = 1+x.ˆ2;
end%function

function res = bFunc(xy)


x = xy(:,1); y = xy(:,2);
res = 0*x;
end%function

[elMat,elVec] = ElementContribution(corners,’fFunc’,’bFunc’,’fFunc’)

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 444

6.5.11 Construction of the Element Stiffness Matrix


With the above algorithms and resulting codes the element stiffness matrices and vectors can be computed.
For each element E (a triangle) use the approximated integral
ZZ
∇φ · (a ∇u + u ~b) + φ (b0 u − f ) dA ≈ hAE ~uE , φ~ E i − hWE f~E , φ
~E i
E

with the element stiffness matrix AE = Ax + Ay + B0 + B1 + B2 . Then use ideas similar to Section 6.2.6
(page 407) to assemble the results to obtain the global stiffness matrix A and the resulting system of linear
equations to be solved is given by
A~u = W f~ .
The boundary contributions in
ZZ I
∇φ · (a ∇u + u ~b) + φ (b0 u − f ) dA − φ (g2 + g3 u) ds = 0
Γ2

have to be taken into account by very similar procedures. If all goes well the vector ~u ∈ RN is an approxi-
mation of the solution u(x, y) of the boundary value problem

−∇ · (a ∇u + u ~b) + b0 u = f for (x, y) ∈ Ω


u = g1 for (x, y) ∈ Γ1 .
~
~n · (a ∇u + u b) = g2 + g3 u for (x, y) ∈ Γ2

6.6 Comparing First and Second Order Triangular Elements


6.6.1 Observe the Convergence of the Error as h → 0
Consider the unit square Ω = [0, 1] × [0, 1]. Verify that the function ue (x, y) = sin(x) · sin(y) is solution of

−∇ · ∇u = −2 sin(x) · sin(y) for 0 ≤ x, y ≤ 1


∂ u(x,1)
∂y = − sin(x) · cos(1) for 0 ≤ x ≤ 1 and y=1 .
u(x, y) = ue (x, y) on the other sections of the boundary

Let h > 0 be the typical length of a side of a triangle. For first order elements 12 h is used, such that the
computational effort is comparable to second order elements, i.e. the same number of degrees of freedom.
Nonuniform meshes are used, to avoid the effect of superconvergence. By choosing different values of h one
should observe smaller errors for smaller values of h. The error is measured by computing the L2 norms of
the difference of the exact and approximate solutions, for the values of the functions and its partial derivative
with respect to y. These are the expressions used in the general convergence estimates, based on the abstract
error estimate in Theorem 6–6. A double logarithmic plot leads to Figure 6.18.

• For linear elements:

– The slope of the curve for the absolute values of u(x, y) − ue (x, y) is approximately 2 and thus
conclude that the error is proportional to h2 .

– The slope of the curve for the absolute values of ∂y (u(x, y) − ue (x, y)) is approximately 1 and
thus conclude that the error of the gradient is proportional to h.

• For quadratic elements:

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 445

– The slope of the curve for the absolute values of u(x, y) − ue (x, y) is approximately 3 and thus
conclude that the error is proportional to h3 .

– The slope of the curve for the absolute values of ∂y (u(x, y) − ue (x, y)) is approximately 2 and
thus conclude that the error of the gradient is proportional to h2 .

These observations confirm the theoretical error estimates in Theorem 6–9 (page 424) and Theorem 6–
11 (page 426). It is rather obvious from Figure 6.18 that second order elements generate more accurate
solutions for a comparable computational effort.

-2
log (difference)

-4

-6 linear, u-u
10

e
linear, d/dy (u-u )
e
-8 quad, u-u
e
quad, d/dy (u-u )
e
-10
-2.5 -2 -1.5 -1 -0.5
log (h)
10

Figure 6.18: Convergence results for linear and quadratic elements

6.6.2 Estimate the Number of Nodes and Triangles and the Effect on the Sparse Matrix
Let Ω ⊂ R2 be a domain with a triangular mesh with many triangles. Examine the typical mesh, shown
below, and consider only triangles and nodes inside the mesh, as the contributions by the borders are con-
siderably smaller for large meshes.

We search a connection between

N = number of nodes and T = number of triangles.

• each triangle has three corners

• each (internal) corner is touched by 6 triangles

• each triangle has 3 midpoints of edges and each of the midpoints


is shared by 2 triangles

• For first order elements the nodes are the corners of the triangles.
1 1
N≈ T3= T
6 2
Thus the number N of nodes is approximately half the number T of triangles.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 446

• For second order elements the nodes are the corners of the triangles and the midpoints of the edges.
Each midpoint is shared by two triangles.
1 3
N≈ T + T = 2T
2 2
Thus the number N of nodes is approximately twice the number T of triangles.

The above implies that the number of degrees of freedom to solve a problem with second order elements
with a typical diameter h of the triangles is approximately equal to using linear element on triangles with
diameter h/2.

The above estimates also allow to estimate how many entries in the sparse matrix resulting from an FEM
algorithm will be different from zero.

• For linear elements each node typically touches 6 triangles and each of the involved corners is shared
by two triangles. Thus there might be 6 + 1 = 7 nonzero entries in each row of the matrix.

• For second order triangles we have to distinguish between corners and midpoints.

– Each corner touches typically six triangles and thus expect up to 6 × 3 + 1 = 19 nonzero entries
in the corresponding row of the matrix.
– Each midpoint touches two triangles and two of the corner points are shared. Thus expect up to
2 + 2 × 3 + 1 = 9 nonzero entries in the corresponding row of the matrix.
3·9+19
The midpoints outnumber the corners by a factor of three. Thus expect an average of 4 = 11.5
nonzero entries in each row of the matrix.

This points to about a factor of 11.5


7 ≈ 1.6 more nonzero entries in the matrix for quadratic elements for
the same number of degrees of freedom. This implies that the computational effort is larger, the size of the
effect depends on the linear solver used.

6.6.3 Behavior of a FEM Solution within Triangular Elements


To examine the behavior of a solution u(x, y) within each of the triangular elements use the boundary value
problem
−∆u = − exp(y) for (x, y) ∈ Ω
u(x, y) = + exp(y) for (x, y) ∈ Γ = ∂Ω
on the domain Ω displayed in Figure 6.19(a). The exact solution is given by u(x, y) = exp(y), shown in
Figure 6.19(b). The problem is solved twice, using different elements:

1. using 32 triangular elements of order 1.

2. using 8 triangular elements of order 2.

The degrees of freedom and nodes used coincide for the two approaches, i.e four triangles in Figure 6.19(a)
for the linear elements from one of the eight triangles for the quadratic elements.
Figure 6.20(a) shows the difference of the computed solution with first order elements to the exact solu-
tion. Within each of the 32 elements the difference is not too far from a quadratic function. Figure 6.20(b)
shows the values of the partial derivative ∂∂yu . It is clearly visible that the gradient is constant within each
triangle, and not continuous across element borders.
Figure 6.21(a) shows the difference of the computed solution with second order elements to the exact
solution. The error is considerably smaller than for linear elements, using identical degrees of freedom.
Within each of the 8 elements the difference does not show a simple structure. Figure 6.21(b) shows the

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 447

1.5

1
y

0.5

0
0 0.2 0.4 0.6 0.8 1
x
(a) the mesh (b) the solution

Figure 6.19: The mesh and the solution for a BVP

∂u
(a) the difference to the exact solution (b) the values of ∂y

∂u
Figure 6.20: Difference to the exact solution and values of ∂y , using a first order mesh

∂u
(a) the difference to the exact solution (b) the values of ∂y

∂u
Figure 6.21: Difference to the exact solution and values of ∂y , using a second order mesh

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 448

values of the partial derivative ∂∂yu . It is clearly visible that the gradient is not constant within the triangles.
By a careful inspection one has to accept that the gradient is not continuous across element borders, but the
jumps are considerably smaller than for linear elements.
Figure 6.22 shows the errors for the partial derivative ∂∂yu and confirms this observation. For first order
elements (Figure 6.22(a)) the gradient is constant within each triangle and thus the maximal error on the
triangles is proportional to the size of the triangles. This confirms the convergence of order 1 (i.e. ≈ c h1 ) for
the gradients with linear elements. The error for quadratic elements is considerably smaller, for a comparable
computational effort.

(a) using linear elements (b) using quadratic elements

∂u
Figure 6.22: Difference of the approximate values of ∂y to the exact values

6.6.4 Remaining Pieces for a Complete FEM Algorithm, the FEMoctave Package
With the above algorithms and codes we can construct the element stiffness matrix and the vector contribu-
tion for a triangular element with second order polynomials as basis functions. Thus we can take advantage
of the convergence result in Theorem 6–11 on page 426. The missing parts for a complete algorithm are:

• Examine the integrals over the boundary edges in a similar fashion. This poses no major technical
problem.

• Assemble the global stiffness matrix and vector, similar to the method in Section 6.2.6, page 407.

• Solve the resulting system of linear systems, using either a direct method or an iterative approach.
Use the methods from Chapter 2.

• Visualize the results.

In 2020 I wrote a FEM code in Octave, implementing all of the above. Find the documentation with
description of the algorithms and sample codes at web.sha1.bfh.science/FEMoctave/FEMdoc.pdf. The pack-
age is hosted on GitHub at github.com/AndreasStahel/FEMoctave . Within Octave use
pkg install https://github.com/AndreasStahel/FEMoctave/archive/v2.0.3.tar.gz
to download and install the package and then pkg load femoctave to load the package. For Linux
systems the complete package is on my web page at web.sha1.bfh.science/FEMoctave2.0.1.tgz. The source
code, demos and examples for FEMoctave is also available in the directory web.sha1.bfh.science/FEMoctave.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 449

6.7 Applying the FEM to Other Types of Problems, e.g. Plane Elasticity
In the previous section approximate solutions of the boundary value problem

−∇ · (a ∇u + u ~b) + b0 u = −f for (x, y) ∈ Ω


u = g1 for (x, y) ∈ Γ1
~n · (a ∇u + u ~b) = g2 + g3 u for (x, y) ∈ Γ2

are generated, either as weak solutions or, if ~b = ~0, as minimizers of the functional
1 1
ZZ
F (u) = a (∇u)2 + b0 u2 + f · u dA
2 2

among all functions u with u = g1 on the boundary section Γ1 . By examining Table 5.2 on page 317 verify
that the above setup covers a wide variety of applications. With a standard finite difference approximation
of the time derivative many dynamic problems can be solved too.

According to equation (5.28) on page 388 a plane strain problem can be examined as minimizer of the
functional
     
1−ν ν 0 εxx εxx
1 E
ZZ
     
U (~u) = h ν 1 − ν 0  ·  εyy  ,  εyy i dx dy −
2 (1 + ν) (1 − 2 ν)      
Ω 0 0 2 (1 − 2 ν) εxy εxy
ZZ I
− f~ · ~u dx dy − ~g · ~u ds
∂Ω

where the strain tensor ε depends on the displacements ~u through


 
∂u1 ∂u2 1 ∂ u1 ∂ u2
εxx = , εyy = and εyx = + .
∂x ∂y 2 ∂y ∂x

For a plane stress problem the total energy is given by (5.31) on page 394

U (~u) = Uelast + UV ol + USurf


     
1 ν 0 εxx εxx
1 E
ZZ
     
= h ν 1
 0  ·  εyy  ,  εyy i dx dy −
2 (1 − ν 2 )      
Ω 0 0 2 (1 − ν) εxy εxy
ZZ I
− ~
f · ~u dx dy − ~g · ~u ds .
∂Ω

Thus all contributions to the above elastic energy functionals are of the same type as the integrals in the
previous sections. Thus identical techniques are used to develop FEM code for elasticity problems. The
convergence results in Section 6.4 also apply. The role of Poincaré’s inequality is taken over by Korn’s
inequality.

6.8 Using Quadrilateral Elements


The presentation is mostly based on [TongRoss08]. Find additional information in [Hugh87, §3.2].

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 450

6.8.1 First Order Quadrilateral Elements


A quadrilateral domain in the (x, y)–plane and its standard square in the (ξ, ν) plane with −1 ≤ ξ, ν ≤ +1
is shown in Figure 6.23. Assume that all interior angles of this quadrilateral are smaller than 180◦ . The
transformation
! ! ! !
x x 1 ξ + 1 x 2 − x 1 ν + 1 x4 − x 1
= F~ (ξ, ν) = + + +
y y1 2 y2 − y1 2 y4 − y1
!
(ξ + 1) (ν + 1) x3 − x2 − x4 + x1
+ (6.15)
4 y3 − y2 − y4 + y1

is a bilinear mapping, i.e. it is linear (resp. affine) with respect to ξ and ν separately. If the quadrilateral
is a parallelogram, then the contribution with the factor 14 (ξ + 1) (ν + 1) vanishes and we have the same
transformation formula as for triangles. The above transformation is built with the corner at (x1 , y1 ) as
starting point and the vectors in directions of (x2 , y2 ) and (x4 , y4 ). One can also build on the central point
~c and the directional vectors d~ξ and d~ν , given by
4
! ! !
1 X xi 1 x2 + x3 1 x3 + x4
~c = , d~ξ = − ~c, d~ν = − ~c .
4 yi 2 y2 + y3 2 y3 + y4
i=1

If the quadrilateral is a parallelogram then the midpoints of the opposite corners coincide, thus the vector
! ! ! !
x 1 x 3 x 2 x4
d~ξν = + − −
y1 y3 y2 y4

vanishes for parallelograms. The vector d~ξν indicates the deviation from a parallelogram and equation (6.15)
is identical to
ξν ~
F~ (ξ, ν) = ~c + ξ d~ξ + ν d~ν + dξν
4
! !
ξ −x1 + x2 + x3 − x4 ν −x1 − x2 + x3 + x4
= ~c + + +
4 −y1 + y2 + y3 − y4 4 −y1 − y2 + y3 + y4
!
ξν +x1 − x2 + x3 − x4
+ (6.16)
4 +y1 − y2 + y3 − y4

Observe that
! ! ! !
x1 x2 x4 x3
F~ (−1, −1) = , F~ (+1, −1) = , F~ (−1, +1) = , F~ (+1, +1) =
x2 y2 y4 y3

and the Jaccobi matrix T(ξ, ν), given by


(ν+1) (x3 −x2 −x4 +x1 ) (ξ+1) (x3 −x2 −x4 +x1 )
" # " # " #
∂ F1 ∂ F1 x2 −x1 x4 −x1
∂ξ ∂ν 2 2 4 4
T(ξ, ν) = ∂ F2 ∂ F2
= y2 −y1 y4 −y1
+ (ν+1) (y3 −y2 −y4 +y1 ) (ξ+1) (y3 −y2 −y4 +y1 )
∂ξ ∂ν 2 2 4 4
" # " #
1 x2 − x1 x4 − x1 1 x3 − x2 − x4 + x1 h i
= + · ν+1 ξ+1
2 y2 − y1 y4 − y1 4 y3 − y2 − y4 + y1
" # " #
−x1 +x2 +x3 −x4 −x1 −x2 +x3 +x4 +x1 −x2 +x3 −x4 h i
4 4 4
= −y1 +y2 +y3 −y4 −y1 −y2 +y3 +y4
+ +y1 −y2 +y3 −y4
· ν ξ ,
4 4 4

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 451

ν ! !j
6 ξ x
7→ BMB ν
t4 1 t3 ν y
yt t(x3 , y3 )
6 B
A (x , y ) B
4 4
A B 1

 ξ
Ω 1 ξ A B  
- A B t
A
A E B
B (x2 , y2 )
A B
t1 t2
A B
At
(x1 , y1 )
-x

Figure 6.23: The transformation of a linear quadrilateral element and the four nodes

is not constant, which is the case for triangular elements. This has to be taken into account when integrating
over the general quadrilateral E by using the integration formula (6.11) on page 429.
Along each of the edges of the standard square Ω ⊂ R2 one of ξ or ν is constant and thus the transfor-
mation F~ leads to straight lines as borders of the domain E = F~ (Ω) ∈ R2 . Using the functions 1, ξ, ν and
ξ ν construct the four basis functions with the key property
(
1 for i = j
Φi (ξj , νj ) = δi,j = .
0 for i 6= j

They are given by the bilinear functions15

1 1
Φ1 (ξ, ν) = 4 (1 − ξ) (1 − ν) , Φ2 (ξ, ν) = 4 (1 + ξ) (1 − ν) ,
1 1
Φ3 (ξ, ν) = 4 (1 + ξ) (1 + ν) , Φ4 (ξ, ν) = 4 (1 − ξ) (1 + ν)

or shorter
1
Φi (ξ, ν) =
(1 + ξi ξ) (1 + νi ν) for i = 1, 2, 3, 4 ,
4
where ξi , νi = ±1 are the coordinates of the corners, e.g. ξ1 = ν1 = −1 or ξ3 = ν3 = +1.

• It is a matter of tedious algebra to verify that for any bilinear function F~ (ξ, ν)
4
! !
x xi
= F~ (ξ, ν) =
X
Φi (ξ, ν) .
y i=1 yi

• Any bilinear function f (ξ, ν) = c1 + c2 ξ + c3 ν + c4 ξ ν can be written as a linear combination of


the Φi (ξ, ν), i.e.
4
X 4
X
f (ξ, ν) = f (ξi , νi ) Φi (ξ, ν) = fi Φi (ξ, ν) .
i=1 i=1
15
The function is linear with respect to each of the arguments ξ and ν, but not linear overall, caused by the contribution ξ ν.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 452

This is a bilinear interpolation on the standard square Ω. To verify this examine16 the linear system
    
+1 +1 +1 +1 f1 c1
    
 −1 +1 +1 −1   f2   c2 
1     
=
4  −1 −1 +1 +1   f3   c3 
    
    
+1 −1 +1 −1 f4 c4

and observe that the matrix is invertible. Thus given the values of ci we can solve uniquely for fi . Let
M be the above matrix, then M · MT = 41 I, i.e. 2 M is a unitary matrix. This leads to M−1 = 4 MT .
• With these shape functions implement an interpolation on the general quadrilateral E in Figure 6.23
by using the values of the function at the nodes. This leads to the values when pulled back to the
standard square Ω.
• Along each of the edges these basis functions are linear functions, since one of the variables ξ or ν
is constant. This leads to conforming elements, i.e. the values of the functions will be continuous
across element boundaries. Observe that these functions are not linear when displayed on a general
quadrilateral, see Figure 6.24. There is no easy formula for this bilinear interpolation on a general
quadrilateral.
Based on this the general approximation results are applicable and similar to Theorem 6–9 obtain for an
FEM algorithm based in this interpolation the error estimates

kuh − u0 kV ≤ C h and kuh − u0 k2 ≤ C1 h2

for some constants C and C1 independent on h. A possible formulation is


• uh converges to u0 with an error proportional to h2 as h → 0.
• ∇uh converges to ∇u0 with an error proportional to h as h → 0.

To evaluate integrals over the domain E ⊂ R2 use the transformation rule (6.11), i.e.
Z +1 Z +1 
∂ (x, y)
ZZ ZZ
f dA = f (~x (ξ, ν)) det dξ dν = f (~x (ξ, ν)) |det T(ξ, ν)| dξ dν .
∂ (ξ, ν) −1 −1
E Ω

Observe that the term | det T(ξ, ν)| is not constant. The idea of Gauss integration can be applied to squares,
using the 1D integrals17
Z +1 X
f (t) dt ≈ wj f (tj ) general formula
−1 j

16
Compare the coefficients of 1, ξ, ν and ξν.
4
X
4 fi Φi (ξ, ν) = f1 (1 − ξ)(1 − ν) + f2 (1 + ξ)(1 − ν) + f3 (1 + ξ)(1 + ν) + f1 (1 − ξ)(1 + ν)
i=1
= (f1 + f2 + f3 + f4 ) + (−f1 + f2 + f3 − f4 ) ξ + (−f1 − f2 + f3 + f4 ) ν + (+f1 − f2 + f3 − f4 ) ξν

17
To derive the first formula integrate 1, t and t2 .
R +1
f (t) dt = w1 f (−ξ) + w1 f (+ξ)
R +1−1
1 dt = 2 = w1 1 + w1 1 =⇒ w1 = 1
R−1
+1
−1
t dt = 0 = − w1 ξ + w1 ξ = 0
R +1 2
p
t dt = 32 = + w1 ξ 2 + w1 ξ 2 =⇒ ξ= 1/3
R−1
+1 3
−1
t dt = 0 = − w1 ξ 3 + w1 ξ 3 = 0

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 453

Φ
1

1
4 0.2
0.8 0.2

0.6 3 0.4

y
1
Φ

0.2
0.4
0.
2 0.6 4
0.2
5 0.8
4
0 3
2 1
0 1 2 1 y
3 4 5 0
x 1 2 3 4
x
Φ
2

1 0.2
4
0.8 0.4
0.2
0.6 3
y

0.6
2

0.8
Φ

0.4
2

0.4
0.2
5
4 0.2
0 3
2 1
0 1 2 1 y
3 4 5 0
x 1 2 3 4
x
Φ
3
0.2 0. 0.8
4
0.6
1
4
0.8 0.2 0.4

0.6 3 0.2
y
3
Φ

0.4
2
0.2
5
4
0 3
2 1
0 1 2 1 y
3 4 5 0
x 1 2 3 4
x
Φ
4
0.4
0.8

1
0.6

0.2

4
0.8

0.6 3
4
y

0.
4
Φ

0.4
2
0.

2
0.2
5
4
0 3
2 1
0 1 2 1 y
3 4 5 0
x 1 2 3 4
x

Figure 6.24: Bilinear shape functions on a 4 node quadrilateral element

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 454

ν
6 ν
6
t t t t
t t t
t t
Ω ξ Ω ξ
- t t t -

t t
t t t
t t t t

Figure 6.25: Gauss integration points on the standard square, using 22 = 4 or 32 = 9 integration points

+1
−1 +1
Z
f (t) dt ≈ 1 f ( √ ) + 1 f ( √ ) 2 point Gauss integration
−1 3 3
Z +1 r r
5 3 8 5 3
f (t) dt ≈ f (− ) + f (0) + f (+ ) 3 point Gauss integration
−1 9 5 9 9 5

For the standard square Ω = [−1, +1] × [−1, +1] ⊂ R2 use


ZZ Z +1 Z +1 Z +1 X XX
f (ξ, ν) dξ dν = f (ξ, ν) dξ dν ≈ wj f (ξj , ν) dν ≈ wj wi f (ξj , νi ) .
−1 −1 −1 j i j

p
As an example for the result based on the 2 point Gauss formula use α = 1/3 and find
ZZ
f dA ≈ 1 · f (−α, −α) + 1 · f (+α, −α) + 1 · f (+α, +α) + 1 · f (−α, +α) .

This leads to the 2D Gauss integration points shown on the left in Figure 6.25. If the 2D integration is based
on the 3 point formula for the integration on [−1, +1] the 2D situation is shown on the right in Figure 6.25.
There are more integration schemes of the same type, e.g. see [TongRoss08, §6.5.3] or [Hugh87, §3.8].
With this we have all the tools to apply the ideas from Section 6.5 to construct the matrices and vectors
required for an FEM algorithm, based on first order quadrilateral elements with four nodes.
Thus t4 is not integrated exactly and the error is proportional to h4 . To derive the second formula use
R +1
f (t) dt = w1 f (−ξ) + w0 f (0) + w1 f (+ξ)
R +1−1
1 dt = 2 = w1 1 + w0 1 + w1 1
R−1
+1
−1
t dt = 0 = − w1 ξ + w0 0 + w1 ξ = 0
R +1
t dt = 23 = + w1 ξ 2 + w1 ξ 2
2
R−1+1 3
t dt = 0 = − w1 ξ 3 + w1 ξ 3 = 0
R −1
+1 4
t dt = 52 = + w1 ξ 4 + w1 ξ 4
R−1+1 5
−1
t dt = 0 = − w1 ξ 5 + w1 ξ 5 = 0

Thus t6 is not integrated exactly and the error is proportional to h6 . The system to be solved is

 w0 + 2 w1 = 2


3 5 8
2 w1 ξ 2 = 32 =⇒ ξ 2 = , w1 = , w0 = .
 5 9 9
2 w ξ4 = 2


1 5

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 455

6.8.2 Second Order Quadrilateral Elements


On the above quadrilateral domain a second order element can be constructed, using the midpoints of the
sides as additional nodes, see Figure 6.26. The transformation is identical to the four node element, i.e.
equation (6.15) or (6.16). Using the eight functions 1, ξ, ν, ξ 2 , ν 2 , ξ ν, ξ 2 ν and ξν 2 construct the eight basis
functions

ν
BMB ν
6 ! !j
ξ x
7→ yt Bt t(x3 , y3 )
t4 t7 t3
ν y
6A
(x4 , y4 ) B
A B t 1
 ξ

A B 
Ω A t B t
t8 t6 ξ 
E
- A B
(x2 , y2 )
Bt
A B
A
A B
At
t1 t5 t2 (x1 , y1 )
-x

Figure 6.26: The transformation of a second order quadrilateral element and the eight nodes

1
Φi (ξ, ν) = (1 − ξi ξ) (1 − νi ν) (ξi ξ + νi ν − 1) for i = 1, 2, 3, 4
4
1
Φi (ξ, ν) = (1 − ξ 2 ) (1 + νi ν) for i = 5, 7
2
1
Φi (ξ, ν) = (1 + ξi ξ) (1 − ν 2 ) for i = 6, 8
2
• The above shape functions are all of the form

f (ξ, ν) = c1 + c2 ξ + c3 ν + c4 ξ 2 + c5 ν 2 + c6 ξ ν + c7 ξ 2 ν + c8 ξ ν 2 ,

i.e. these are biquadratic18 functions. Any function of this form can be writen as a linear combination
of the Φi (ξ, ν), i.e.
X 8 8
X
f (ξ, ν) = f (ξi , νi ) Φi (ξ, ν) = fi Φi (ξ, ν)
i=1 i=1

• Along each of the edges these basis functions are quadratic functions, since one of the variables ξ or
ν is constant. This leads to conforming elements, i.e. the values of the functions will be continuous
across element boundaries. Find the contour lines of the shape functions on a general quadrilateral in
Figure 6.27. There is no easy formula for this biquadratic interpolation on a general quadrilateral.

Based on this the general approximation results are applicable and similar to Theorem 6–11 obtain for
an FEM algorithm based in this interpolation the error estimates

kuh − u0 kV ≤ C h2 and kuh − u0 k2 ≤ C1 h3

for some constants C and C1 independent on h. We may say that

• uh converges to u0 with an error proportional to h3 as h → 0.


18
The functions are quadratic with respect to each of the arguments ξ and ν, but not quadratic overall, caused by the contributions
ξ ν and ξ ν 2 .
2

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 456

Φ Φ
1 2
0

0
4 -0.2 4 -0.2
0

-0.2
-0.2
3 -0. 3
2

0..2
y

00..864
0

-0.
0

0
2 0.2 0 2
0
0.4
0
1 0..86 1

0
1 2 3 4 1 2 3 4
x x
Φ Φ
3 4
0 0.0.8 0

0.68
0.20.4 6

0.
0.4
0

-0.

0.2
4 -0.2 4 2
0

-0.2
-0

3 3 0
.2

2
-0.
y

y
0

2 0
0 -0.
2 2

1 1 0
0
1 2 3 4 1 2 3 4
x x
Φ Φ
5 6
2
0..4
0
4
0.2
4 0.2 0.6
0.8
0.4
4

0.2
0.
0.2

0.6
0.4

3 3 0.6 .4
y

0
0.2 0.2
0.8

2 2
0.6 0.4
0.2

1 1
1 2 3 4 1 2 3 4
x x
Φ Φ
7 8
0.2

0. 0.2
0.2
0.4

6 0.2
0.

0.4
0.4

4 4
0.6 0.8 0.6
0.4

0.2

3 0.4 3
0.
y

y
2

0.2
0.6
0.8

2 2
4
0.
2
1 1 0.

1 2 3 4 1 2 3 4
x x

Figure 6.27: Contour levels of the shape functions on a 8 node quadrilateral element

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 457

• ∇uh converges to ∇u0 with an error proportional to h2 as h → 0.


The numerical integration schemes for linear quadrilateral elements can be used for second order elements
too. With this we have all the tools to apply the ideas from Section 6.5 to construct the matrices and vectors
required for an FEM algorithm, based on second order quadrilateral elements with eight nodes.

The above is the basis to implement a FEM algorithm based on quadrilateral elements.

6.9 An Application of FEM to a Tumor Growth Model


6.9.1 Introduction
The goal is to examine the growth of tumor cells in healthy tissue, i.e. find a simple mathematical model to
describe this situation.
Let 0 ≤ u(t, ~x) ≤ 1 describe the concentration of tumor cells, i.e. u = 0: no tumor and u = 1 only
tumor. There are two effects contribution to the tumor growth and spreading.
• growth: assume that the growth rate of the tumor is given by the ordinary differential equation

u̇(t) = α · f (u(t)) = α · u(t) · (1 − u(t))

where α > 0 is a parameter. This is called a logistic growth model. Find the graph of the function for
d
α = 1 and the solution of the corresponding logistic differential equation dt u(t) = u(t) · (1 − u(t))
in Figure 6.28.
• diffusion: the tumor cells will also spread out, just like heat spreads in a medium. Thus describe this
effect by a heat equation.

0.25 1
growth rate of concentration u

0.2 0.8
concentration u

0.15 0.6

0.1 0.4

0.05 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 2 4 6 8
concentration u time t
(a) the function for logistic growth (b) solution of the logistic differential equation

Figure 6.28: The function leading to logistic growth and the solution of the differential equation

The above two effects lead to the partial differential equation


d
u(t, ~x) = ∆u(t, ~x) + α f (u(t, ~x)) . (6.17)
dt
To get a first impression it is a good idea to assume radial symmetry, i.e. the function u(t, ~x) depends on
the radius r = k~xk only. For this we express the Laplace operator ∆u in spherical coordinates, i.e. for
functions depending on the radius r only.
∂2 u ∂2 u ∂2 u
 
1 ∂ 2 ∂u
∆u = + + = r .
∂x2 ∂y 2 ∂z 2 r2 ∂r ∂r

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 458

Now equation (6.17) takes the form


 
d 1 ∂ 2 ∂ u(t, r)
u(t, r) = 2 r + α f (u(t, r))
dt r ∂r ∂r

or after a multiplication by r2
 
d 2 ∂ 2 ∂ u(t, r)
r u(t, r) = r + r2 α f (u(t, r)) . (6.18)
dt ∂r ∂r

This has to be supplemented with appropriate initial and boundary values. The goal is to examine the
behavior of solutions of this partial differential equation, using finite elements.

6.9.2 The Finite Element Method Applied to Static 1D Problems


First examine steady state boundary value problems. For given functions a(r) > 0, b(r) and f (r) we
examine a boundary value problem of the form19
0
a(r) u0 (r) + b(r) f (r) = 0 (6.19)

with some boundary conditions. Multiplying (6.19) by a smooth test function φ(r) and an integration by
parts leads to
Z R 0 
0 = a(r) u0 (r) + b(r) f (r) φ(r) dr
0
R Z R
0
= a(r) u (r) φ(r) r=0 + −a(r) u0 (r) φ0 (r) + b(r) f (r) φ(r) dr . (6.20)
0

Using FEM this equation will be discretized, leading to the stiffness matrix A and the weight matrix M,
such that hA~u − Mf~ , φi
~ = 0 for all vectors φ.
~ This then leads to the linear system A~u = Mf~ to be solved
for the vector ~u .
The section below will lead to a numerical implementation of the above idea. Then the developed
MATLAB/Octave code will be tested with the help of a few example problems.

Interpolation and Gauss integration


To generate a finite element formulation first examine an interval ri ≤ r ≤ ri+1 . For given coefficient
functions a(r), b(r) and f (r) and the values of the functions u(r) and φ(r) given at the three nodes at
r = ri , ri +r2 i+1 and ri+1 use a quadratic interpolation to construct the functions u(r) and φ(r) on the
interval. Then integrate
Z ri+1 Z ri+1
I0 = b(r) f (r) φ(r) dr and I1 = a(r) u0 (r) φ0 (r) dr .
ri ri

To compute these integrals first examine the very efficient Gauss integration on a standard interval [ −h
2 ,
+h
2 ]
of length h.

• On the interval − h2 ≤ x ≤ + h2 the Gauss integration formula is given by


r r !
h/2
h 3h 3h
Z
u(x) dx ≈ 5 u(− ) + 8 u(0) + 5 u(+ ) . (6.21)
−h/2 18 52 52
19
At this stage using two functions f (r) and b(r) seem to be overkill, but this will turn out to be useful when solving the dynamic
problem in Section 6.9.3.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 459

• The three values of a function u(x) at u(−h/2) = u− , u(0) = u0 and u(h/2) = u+ determine a
quadratic interpolating polynomial20
u+ − u− u+ − 2 u0 + u−
u(x) = u0 + x+ 2 x2 .
h h2
q
Use x = 0 and x = ± 35 h
2 to determine the values of u(x) at the Gauss points by
 q q 
3 3
 q   
3h 3 4 3
u(− 5 2 )  10 +2
5
10 10
2  
 − u−
5
   
u(0)  =  0 1 0  ·  u0 
 
    
 q  q q
3h 3 3 
u

u(+ 5 2 ) 3 5 4 3 5 +
10 − 2 10 10 + 2
 √ √     
3 + 15 4 3 − 15 u− u−
1      
=  0 10 0  ·  u0  = G0 ·  u0  .
10  √ √     
3 − 15 4 3 + 15 u+ u+

Use this Gaussian interpolation matrix to compute the values of the function at the Gauss integration
points, using the values at the nodes.

• The above can be repeated to obtain the values of the derivative u0 (x) at the Gauss points.
 q   q q q     
0 3h
u (− 5 2 ) −1 − 2 5 +4 5 +1 − 2 35
3 3
 1  u − u−
    1  
 0
u (0) =
 
−1 0 +1 ·
 
u = G · u .
 h  q   0  h 1  0
  
 q q q
u0 (+ 35 h2 ) −1 + 2 35 −4 35 +1 + 2 35 u+ u+

• Define a weight matrix W by


 
5
18 0 0
5 8 5 
8

W = diag([ , , ]) = 
 0 18 0 
18 18 18 
5
0 0 18

and then rewrite (6.21) in the form


3
r r !
h/2
h 3h 3h
Z X
u(x) dx ≈ 5 u(− ) + 8 u(0) + 5 u(+ ) =h (W G0 ~u)i
−h/2 18 52 52
i=1

where the vector ~u contains the values of the function at ± h2 and 0 .

• To evaluate the function a(x) at the Gauss points use the notation
 q 
a(− 35 h2 ) 0 0
 
a= 0 a(0) 0
 

 q 
0 0 a(+ 35 h2 )

and similarly for the function b(r), leading to a diagonal matrix b.


20
To verify the formula use u(0) = u0 and for x = ±h
h u+ − u− h u + − 2 u 0 + u − 2 h2 1 1 1 1
u(± ) = u0 ± + = u0 (1 − 1) + u+ (± + ) − u− (± − ) .
2 h 2 h2 4 2 2 2 2

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 460

The above notation leads to the required integrals. With ∆ri = ri+1 − ri obtain
Z ri+1
I0 = b(r) f (r) φ(r) dr ≈ ∆ri hW b G0 f~ , G0 φi ~ = ∆ri hGT W b G0 f~ , φi
0
~
r
Z iri+1
∆ri ~ = 1 hGT W a G1 ~u , φi
~
I1 = a(r) u0 (r) φ0 (r) dr ≈ 2
hW a G1 ~u , G1 φi 1
ri (∆ri ) ∆ri

Now work on the interval [0, R], discretized by 0 = r0 < r1 < r2 < . . . < rn−1 < rn = R. Then examine
the discrete version of the weak solution, thus integrals of the type
Z R 0
I = a(r) u0 (r) φ(r) − b(r) f (r) φ(r) dr
0
n−1
X Z ri+1 0
= a(r) u0 (r) φ(r) − b(r) f (r) φ(r) dr
i=0 ri
n−1
X 
1 T ~ T ~ ~
≈ hG1 W ai G1 ~ui , φi i − ∆ri hG0 W bi G0 fi , φi i
∆ri
i=0
= hA ~u − M f~ , φi
~ for all vectors φ
~.

The stiffness matrix A and the weight matrix M are both of size (2 n + 1) × (2 n + 1), but possible
boundary conditions are not take into account yet21 . This has to be done with some care, since the differential
equation (6.19) has a unique solution only if boundary conditions are specified.

Simplified matrices for differential equations with constant coefficients


If the coefficients a and b are constant the above formulas can be simplified by using
 
+7 −8 +1
1 
GT1 WG1 = 

−8 +16 −8 
3  
+1 −8 +7
21
We also ignored contributions of the type c(r) u0 (r) or h(r) u(r) in the differential equation. These contributions can be
treated using exactly the same procedures, e.g.
Z ri+1
∆ri
c(r) u0 (r) φ(r) dr ≈ hW c G1 ~ ~ = hGT0 W c G1 ~
u , G0 φi ~
u , φi
ri ∆ri
Z ri+1
h(r) u(r) φ(r) dr ≈ ∆ri hW h G0 ~ ~ = ∆ri hGT0 W h G0 ~
u , G0 φi ~ .
u , φi
ri

Thus it is straightforward to adapt the algorithm and the codes to examine differential equations of the type
0
a(r) u0 (r) + c(r) u0 (r) + h(r) u(r) + b(r) f (r) = 0 .

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 461

RR
and the integral a u0 (r) φ0 (r) dr leads to a matrix
0
 
+7 −8 +1
 
 −8 +16 −8 
 
 
 +1 −8 +14 −8 +1 
 
−8 +16 −8
 
 
 

 +1 −8 +14 −8 +1 

a  
−8 +16 −8 . (6.22)
3 ∆r  
+1 −8 +14 −8 +1
 
 

.. .. .. ..

. . . .
 
 
 

 +1 −8 +14 −8 +1  
 

 −8 +16 −8 

+1 −8 7
Observe that this matrix has a band structure with 3 or 5 elements only arranged along the diagonal. This
matrix is symmetric and for a > 0 positive semi-definite. If we have Dirichlet boundary conditions the first
(or last) row and column are removed, then the matrix will be strictly positive definite. This corresponds to
the analytical observations
Z R Z R
0 2
a |u (r)| dr ≥ 0 and a |u0 (r)|2 dr = 0 ⇐⇒ u(r) = const .
0 0
Some of the key properties are similar to our model matrix An from Section 2.3.1.
Similarly using  
+4 +2 −1
1 
GT0 WG0 =

 +2 +16 +2 
30  
−1 +2 +4
RR
the integral 0 b f (r) φ(r) dr leads to
 
+4 +2 −1
 
 +2 +16 +2 
 
 
 −1 +2 +8 +2 −1 
 
+2 +16 +2
 
 
 

 −1 +2 +8 +2 −1 

b ∆r  
 +2 +16 +2 . (6.23)
30  
−1 +2 +8 +2 −1
 
 

.. .. .. .. ..

. . . . .
 
 
 

 −1 +2 +8 +2 −1 

 

 +2 +16 +2 

−1 +2 +4

Taking boundary conditions into account


The contribution in (6.20) by boundary terms is
R
a(r) u0 (r) φ(r) r=0
= a(R) u0 (R) φ(R) − a(0) u0 (0) φ(0) .

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 462

This leads to different algorithms to take Dirichlet or Neumann conditions into account.
• If the value of u(R) is known, then φ(R) needs to be zero and the contribution vanishes. If u(R) = 0
the last value in the vector ~u is zero and we can safely remove this zero in ~u and the last column in
the matrix A.
• If we have no constraint on u(R) the natural boundary condition is a(R) u0 (R) = 0. We do not have
to do anything to take this condition into account.
• For boundary conditions of the type a(R) u0 (r) = c1 + c2 u(R) the correct type of contribution will
have to be added.
This leads to the linear system
A ~u − M f~ = ~0 or ~u = A−1 M f~ = A\M f~ .
The resulting matrix A is symmetric and has a band structure with semi-bandwidth 3, i.e. in each row there
are up to 5 entries about the diagonal.

The MATLAB/Octave code


The matrices A and M generated by the above algorithm depend on the interval and its division in finite
elements and the two functions a(r) and b(r). Thus we write code to construct those matrices. In addition
the new vector with the nodes will be generated, i.e. the mid points of the sub-intervals are added.
GenerateFEM.m
function [A,M,xnew] = GenerateFEM(x,a,b)
% doc to be written
dx = diff(x(:));
n = length(x)-1;
xnew = [x(:)’;x(:)’ + [dx’,0]/2]; xnew = xnew(1:end-1); xnew = xnew(:);
A = sparse(2*n+1,2*n+1); M = A;
%% interpolation matrix for the function values
s06 = sqrt(0.6);
G0 = [3/10+s06/2, 4/10, 3/10-s06/2;
0, 1, 0;
3/10-s06/2, 4/10, 3/10+s06/2];
%% interpolation matrix for the derivative values
G1 = [-1-2*s06, +4*s06, +1-2*s06;
-1, 0, 1;
-1+2*s06, -4*s06, +1+2*s06];
W = diag([5 8 5])/18;
for ind = 1:n
x_elem = x(ind)+dx(ind)/2*(1+[-s06 0 s06]);
M_elem = dx(ind)*G0’*W*diag(b(x_elem))*G0;
A_elem = (G1’*W*diag(a(x_elem))*G1)/dx(ind);
ra = 2*ind-1:2*ind+1;
M(ra,ra) = M(ra,ra) + M_elem;
A(ra,ra) = A(ra,ra) + A_elem;
end%for
end%function

A first example
To solve the boundary value problem
−u00 (r) = 1 on 0 ≤ r ≤ 1 with u(0) = u(1) = 0
1
with the exact solution uexact (r) = 2 r (1 − r) use the code below.
Test1.m

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 463

N = 10; % number of elements, then 2*N+1 nodes


x = linspace(0,1,N+1);
a = @(x) 1*ones(size(x)); b = @(x) 1*ones(size(x)); % solve -u"=1

[A,M,r] = GenerateFEM(x,a,b);
A = A(2:end-1,2:end-1); M = M(2:end-1,:); % Dirichlet BC on both ends
f = ones(size(r(:)));
u_exact = r(:).*(1-r(:))/2;
u = A\(M*f);

figure(1); plot(r,[0;u;0],r,u_exact)
xlabel(’r’); ylabel(’u’);title(’solution, exact and approximate’)

figure(2); plot(r,[0;u;0]-u_exact)
xlabel(’r’); ylabel(’u’);title(’difference of solutions, exact and approximate’)

It turns out that the generated solution coincides with the exact solution. This is no real surprise, since the
exact solution is a quadratic function, which we approximate by a piecewise quadratic function. For real
problems this is very unlikely to occur. Thus this example is only useful to verify the algorithm and the
coding.

A second example
To solve the boundary value problem
π π
−u00 (r) = cos(3 r) on 0 ≤ r ≤ with u0 (0) = u( ) = 0
2 2
1
with the exact solution uexact (r) = 9 cos(3 r) use the code Test2.m below. Find the solution in Fig-
ure 6.29(a).
Test2.m
N = 2*10; % number of elements
x = linspace(0,pi/2,N+1);
a = @(x) 1*ones(size(x)); b = @(x) 1*ones(size(x)); % solve -u"= cos(3*x)

[A,M,r] = GenerateFEM(x,a,b);

A = A(1:end-1,1:end-1); % Neumann on the left and Dirichlet BC on the right


M = M(1:end-1,:);
f = cos(3*r(:));
u_exact = 1/9*cos(3*r(:));
u = A\(M*f);
figure(1); plot(r,[u;0],r,u_exact)
xlabel(’r’); ylabel(’u’); title(’solution, exact and approximate’)
figure(2); plot(r,[u;0]-u_exact)
xlabel(’r’); ylabel(’u’); title(’difference of exact and approximate solution’)

By using different values N for the number of elements observe (see Table 6.3) an order of convergence

differences N = 10 N = 20 ratio order of convergence


at the nodes 8.39 · 10−6 5.32 · 10−7 16 = 24 4
at all points 9.36 · 10−5 1.16 · 10−5 8= 23 3

Table 6.3: Maximal approximation error

at the nodes of approximately 4! This is better than to be expected by Theorem 6–11 (page 426). If

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 464

the solution is reconstructed between the grid points by a piecewise quadratic interpolation one observes
a cubic convergence. This is the expected result by the abstract error estimate for a piecewise quadratic
approximation. The additional accuracy is caused by the effect of superconvergence, and we can not count
on it to occur. Figure 6.29(b) shows the error at the nodes and also at points between the nodes22 . For those
we obtain the expected third order of convergence. A closer look at the difference of exact and approximate
solution at r ≈ 1.1 also reveals that the approximate solution is not twice differentiable, since the slope of
the curve has a severe jump. The piecewise quadratic interpolation can (and does) lead to non-continuous
derivatives.

solution, exact and approximate difference of exact and approximate solution


0.2 0.0001
at nodes
0.15 between nodes

0.1 5e-05

difference
0.05
u

0 0

-0.05

-0.1 -5e-05

-0.15

-0.2 -0.0001
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
r r
(a) the solutions (b) difference of exact and approximate solution

Figure 6.29: Exact and approximate solution of the second test problem

A third example
The function uexact (r) = exp(−r2 ) − exp(−R2 ) solves the boundary value problem

−(r2 u0 (r))0 = r2 · f (r) on 0≤r≤R with u0 (0) = u(R) = 0

where f (r) = (6 − 4 r) exp(−r2 ). In this example the coefficient functions a(r) = b(r) = r2 are used.
The effect of super-convergence can be observed for this example too.
Test3.m
N = 20; R = 3;
x = linspace(0,R,N);
a = @(x) x.ˆ2; b = @(x) x.ˆ2;

[A,M,r] = GenerateFEM(x,a,b);
A = A(1:end-1,1:end-1); % Dirichlet BC at the right end point
M = M(1:end-1,:);
r = r(:); f = (6-4*r.ˆ2).*exp(-r.ˆ2);
u_exact = exp(-r.ˆ2)-exp(-Rˆ2);
u = A\(M*f);

figure(1); plot(r,[u;0],r,u_exact)
xlabel(’r’); ylabel(’u’); title(’solution, exact and approximate’)

22
This figure can not be generated by the codes in these lecture notes. The Octave command pwquadinterp() allows to
apply the piecewise quadratic interpolation within the elements.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 465

figure(2); plot(r,[u;0]-u_exact)
xlabel(’r’); ylabel(’u’); title(’difference of exact and approximate solution’)

6.9.3 Solving the Dynamic Tumor Growth Problem


The equation (6.18) to be examined is

(r2 u0 (t, r))0 + r2 α f (u(t, r)) − r2 u̇(t, r) = 0 for 0 < r < R and t > 0 .

Using the radial symmetry conclude that ∂u(t,0)


∂r = 0 and thus use a no flux condition ∂u(t,R)
∂r = 0 as
boundary condition for some large radius R. This differential equation has to be supplemented with an
initial condition, e.g.
2
u(0, r) = 0.001 · e−r /4 .
This represents a very small initial set of tumor cells located close to the origin at r ≈ 0.

Use FEM and a time discretization


Using the space discretization u(t, r) −→ ~u(t) and the FEM notation from the previous section (with
a(r) = b(r) = r2 ) leads to23

d ~u(t) ~
A ~u(t) − α M f~(~u(t)) + M =0
dt
or by rearranging terms
d ~u(t)
M = −A ~u(t) + α M f~(~u(t)) . (6.24)
dt
Because of the nonlinear function f (u) = u · (1 − u) this is a nonlinear system of ordinary differential
equations. Now use the finite difference method from Section 4.5, starting on page 273, to discretize the
dynamic behavior. To start out use a Crank–Nicolson scheme for the time discretization, but for the nonlinear
contribution use an explicit expression24 . This will lead to systems of linear equations to be solved. With
the time discretization ~ui = ~u(i ∆t) this leads to

~ui+1 − ~ui ~ui+1 + ~ui


M = −A + α M f~(~ui )
∆t 2
∆t
M (~ui+1 − ~ui ) = − A (~ui+1 + ~ui ) + ∆t α M f~(~ui )
2
∆t ∆t
(M + A) ~ui+1 = (M − A) ~ui + ∆t α M f~(~ui ) .
2 2
Thus for each time step a system of linear equations has to be solved. The matrix M+ ∆t
2 A does not change
for the different time steps. This is a perfect situation to use MATLAB/Octave.

Create an animation
The above is implemented in MATLAB/Octave and an animation25 can be displayed on screen. Find the final
state in Figure 6.30.
LogisticModel.m
23
u − M f~ = ~0.
The static equation (a(r) u0 (r))0 + b(r) f (r) = 0 leads to the linear system A ~
24
This can be improved, see later in the notes.
25
With the animation the code took 12.5 seconds to run, without the plots only 0.12 seconds. Thus most of the time is used to
generate and display the plots.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 466

R = 50; % space interval [0,R]


T = 6; % time interval [0,T]
x = linspace(0,R,200); Nt = 200;

LogF = @(u) max(0,u.*(1-u)); al = 10; % scaling factor


u0 = @(x,R)0.001*exp(-x.ˆ2/4);
a = @(x) x.ˆ2; b = @(x) x.ˆ2;
[A,M,r] = GenerateFEM(x,a,b);
dt = T/Nt; u = u0(r,R); t = 0;

for ii = 1:Nt
u = (M+dt/2*A)\((M-dt/2*A)*u + dt*M*al*LogF(u));% Crank-Nicolson
t = t + dt;
figure(1); plot(r,u)
xlabel(’radius r’); ylabel(’density u’)
axis([0 R -0.2 1.1]); text(0.7*R,0.7,sprintf(’t = %5.3f’,t))
drawnow();
end%for

0.8
t = 6.000
0.6
density u

0.4

0.2

-0.2
0 10 20 30 40 50
radius r

Figure 6.30: A snapshot of the solution

Generate surfaces and contours


With a modification of the above code one can generate surfaces and contours for the concentration of the
tumor cells for short (Figure 6.31) and long (Figure 6.32) time intervals.
LogisticModelContour.m
% short time interval
R = 35/3; % space interval [0,R]
T = 5/3; % time interval [0,T]
Nx = 100; x = linspace(0,R,Nx); Nt = 100;

% long time interval


%R = 100; % space interval [0,R]
%T = 15; % time interval [0,T]
%Nx = 400; x = linspace(0,R,Nx); Nt = 400;

u_all = zeros(2*Nx-1,Nt+1);
LogF = @(u) max(0,u.*(1-u)); al = 10; % scaling factor
u0 = @(x,R)0.001*exp(-x.ˆ2/4);

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 467

a = @(x) x.ˆ2; b = @(x) x.ˆ2;


[A,M,r] = GenerateFEM(x,a,b);

dt = T/Nt;
u = u0(r,R);
u_all(:,1) = u;
t = 0;
for ii = 1:Nt
u = (M+dt/2*A)\((M-dt/2*A)*u + dt*M*al*LogF(u));% Crank-Nicolson,
t = t + dt;
u_all(:,ii+1) = u;
end%for

figure(2); mesh(0:dt:T,r,u_all)
ylabel(’radius r’); xlabel(’time t’); zlabel(’concentration u’)
axis([0,T,0,R,0,1])
figure(3); contour(0:dt:T,r,u_all,[0.1 0.5 0.9])
caxis([0 1]); ylabel(’radius r’); xlabel(’time t’)

(a) surface (b) contours at concentrations of 10%, 50% and 90%

Figure 6.31: Concentration u(t, r) as function of time t and radius r on a short time interval and contours

Discussion
• In Figure 6.31 observe the small initial seed growing to a moving front where the concentration
increases from 0 to 1 over a short distance.

• In Figure 6.32 it is obvious that this front moves with a constant speed, without changing width or
shape.

• The above is a clear indication that for the moving front section the original equation
0
r2 u̇(t, r) = r2 u0 (t, r) + r2 α f (u(t, r))
2
u̇(t, r) = u00 (t, r) + u0 (t, r) + α f (u(t, r))
r
can be replaced by Fisher’s equation

u̇(t, r) = D u00 (t, r) + α u(t, r) (1 − u(t, r)) ,

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 468

(a) surface (b) contours at concentrations of 10%, 50% and 90%

Figure 6.32: Concentration u(t, r) as function of time t and radius r on a long time interval and contours

i.e. the contribution 2r u0 (t, r) is dropped. The behavior of solutions of this equation is well studied
and the literature is vast!
• If clinical data is available the real task will be to find good values for the parameters D and α to
match the observed behavior of the tumors.
• Instead of the function f (u) = u · (1 − u) other functions may be used to describe the growth of the
tumor.
• For small time steps the algorithm showed severe instability, caused by the nonlinear contribution
f (u) = u · (1 − u). Using max{f (u), 0} improved the situation slightly.
• The traveling speed of the front depended rather severely on the time step. This is not surprising as
the contribution f (~ui ) is the driving force, and it is lagging behind by 21 ∆t since Crank–Nicolson
is formulated at time t = ti + 21 ∆t. One should use f ( 21 (~ui + ~ui+1 )) instead, but this leads to a
nonlinear system of equations to be solved for each time step.

Improvements for the nonlinear Crank–Nicolson step


In the above approach a Crank–Nicolson scheme is used for the linear part and an explicit scheme for the
nonlinear part, i.e. solve
~ui+1 − ~ui ~ui+1 + ~ui
M = −A + α M f~(~ui )
∆t 2
for ~ui+1 . This approach is consistent of order 1 with respect to the time step, i.e. the error is expected to be
proportional to ∆t. A more consistent approximation should evaluate the nonlinear function at the midpoint
too, i.e. at 12 (~ui+1 + ~ui ). Then the approximation error is expected to be proportional to (∆t)2 . Thus one
should solve the nonlinear system
~ui+1 − ~ui ~ui+1 + ~ui ~ui+1 + ~ui
M = −A + α M f~( )
∆t 2 2
for the unknown ~ui+1 . There are (at least) two approaches for this.
• Use a linear approximation for the nonlinear term. Based on the approximation26
~ui + ~ui+1 ~ui+1 − ~ui ~ui+1 − ~ui
f( ) = f (~ui + ) ≈ f (~ui ) + f 0 (~ui ) ·
2 2 2
26 1
Approximating 2
(f (~
ui ) + f (~
ui+1 )) leads to the identical formula.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 469

find the slightly modified system of equations for ~ui=1 .

~ui+1 − ~ui ~ui+1 + ~ui ~ui + ~ui+1


M = −A + α M f~( )
∆t 2  2 
~ui+1 + ~ui ~ ~0 ~ui+1 − ~ui
≈ −A + α M f (~ui ) + f (~ui ) ·
2 2
   
∆t α ∆t ∆t α ∆t
M+ A− Mf~0 (~ui ) ~ui+1 = M− A− Mf~0 (~ui ) ~ui + ∆t α M f~(~ui ) .
2 2 2 2

This can be implemented with MATLAB/Octave, leading to the code in LogisticModel2.m. The
solutions look very similar to the ones from the previous section, but the traveling speed of the front
is different. The linear system to be solved is modified for each time step, thus the computational
effort is higher. Since all matrices have a very narrow band structure the computational penalty is not
very high. A more careful examination of the above approach reveals that this actually one step of
Newton’s method to solve the nonlinear system of equations.

• One might as well use Newton’s algorithm to solve the nonlinear system of equations for ~ui+1 . To
apply the algorithm consider each time step as a nonlinear system of equations F(~ui+1 ) = ~0. To
~ differentiate with respect to ~ui+1 .
approximate F(~ui+1 + Φ)

~0 = F(~ui+1 ) = M ~ui+1 − ~ui + A ~ui+1 + ~ui − α M f~( ~ui + ~ui+1 )


∆t 2 2
~ = 1 1
~ + AΦ α
~ − M f~ ( 0 ~
u i + ~
u i+1 ~
DF(~ui+1 ) Φ MΦ )Φ
∆t 2 2 2
~ solve for Φ
~0 = F(~ui+1 ) + DF(~ui+1 ) Φ ~
~ui+1 −→ ~ui+1 + Φ ~ = ~ui+1 − (DF(~ui+1 ))−1 F(~ui+1 )

This can be implemented with MATLAB/Octave, leading to the code in LogisticModel3.m. The
solutions look very similar to the ones from the previous section, and the traveling speed of the front
is close to the speed observed in LogisticModel2.m.

• The different speeds observed for the above approaches should trigger the question: what is the correct
speed. For this examine the results of four different runs.

1. Use the explicit approximation in LogisticModel.m at with Nt=200 time steps.


2. Use the explicit approximation LogisticModel.m with Nt=3200 time steps.
3. Use the linearized approximation in LogisticModel2.m with Nt=200 time steps.
4. Use the nonlinear approximation in LogisticModel3.m with Nt=200 time steps.

Then examine the graphical results in Figure 6.33.

– The solutions by the linearized and fully nonlinear approaches differ very little. This is no
surprise, since the linearized approach is just the first step of Newton’s method, which is used
for the nonlinear approach.
– The explicit solution with Nt=200 leads to a clearly slower speed and the smaller time step with
Nt=3200 leads to a speed much closer to the one observed by the nonlinear approach. This
clearly indicates that the observed speed depends on the time step, and for smaller time steps the
front is moving with a speed move closer to the observed speed for the linearized and nonlinear
approach.
– As a consequence one should use the linearized or nonlinear approach.

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 470

0.8
t = 6.000
density u
0.6

0.4

explicit 200
0.2
explicit 3200
linearized 200
0 nonlinear 200

0 10 20 30 40 50
radius r

Figure 6.33: Graph of the solutions by four computations with different algorithms

LogisticModel2.m
R = 50; % space interval [0,R]
T = 6; % time interval [0,T]
x = linspace(0,R,200); Nt = 200;
LogF = @(u) u.*(1-u); dLogF = @(u) 1-2*u;

al = 10; % scaling factor


u0 = @(x,R)0.001*exp(-x.ˆ2/4);
a = @(x) x.ˆ2; b = @(x) x.ˆ2;
[A,M,r] = GenerateFEM(x,a,b);
dt = T/Nt; u = u0(r,R); t = 0;

for ii = 1:Nt
dfu = dt*al/2*M*diag(dLogF(u));
u = (M+dt/2*A-dfu)\((M-dt/2*A-dfu)*u + dt*M*al*LogF(u));% Crank-Nicolson
t = t + dt;
figure(1); plot(r,u); xlabel(’radius r’); ylabel(’density u’)
axis([0 R -0.2 1.1]); text(0.7*R,0.7,sprintf(’t = %5.3f’,t))
drawnow();
end%for

LogisticModel3.m
% use true Newton
R = 50; % space interval [0,R]
T = 6; % time interval [0,T]
x = linspace(0,R,200); Nt = 200;
LogF = @(u) u.*(1-u); dLogF = @(u) 1-2*u;

al = 10; % scaling factor


u0 = @(x,R)0.001*exp(-x.ˆ2/4);
a = @(x) x.ˆ2; b = @(x) x.ˆ2;
[A,M,r] = GenerateFEM(x,a,b);
dt = T/Nt; u = u0(r,R); t = 0;

for ii = 1:Nt

SHA 21-5-21
CHAPTER 6. FINITE ELEMENT METHODS 471

up = u; % start for Newton iteration


for iter = 1:3 % choose number of Newton steps
F = 1/dt*M*(up-u) + 1/2*A*(u+up) - al*M*LogF((u+up)/2);
DF = 1/dt*M +1/2*A-1/2*al*M*diag(dLogF((u+up)/2));
phi = DF\F;
% disp([norm(phi),norm(F)]) % trace the error
up = up-phi;
end%for
u = up; t = t + dt; % update solution
figure(1); plot(r,u); xlabel(’radius r’); ylabel(’density u’)
axis([0 R -0.2 1.1]); text(0.7*R,0.7,sprintf(’t = %5.3f’,t))
drawnow();
end%for

Bibliography
[AxelBark84] O. Axelsson and V. A. Barker. Finite Element Solution of Boundary Values Problems. Aca-
demic Press, 1984.

[Brae02] D. Braess. Finite Elemente. Theorie, schnelle Löser und Anwendungen in der Elastizitätstheorie.
Springer, second edition, 2002.

[Cowp73] G. R. Cowper. Gaussian quadrature formulas for triangles. International Journal on Numerical
Methods and Engineering, 7:405–408, 1973.

[Davi80] A. J. Davies. The Finite Element Method: a First Approach. Oxford University Press, 1980.

[Gmur00] T. Gmür. Méthode des éléments finis en mécanique des structures. Mécanique (Lausanne).
Presses polytechniques et universitaires romandes, 2000.

[Hugh87] T. J. R. Hughes. The Finite Element Method, Linear Static and Dynamic Finite Element Analysis.
Prentice–Hall, 1987. Reprinted by Dover.

[John87] C. Johnson. Numerical Solution of Partial Differential Equations by the Finite Element Method.
Cambridge University Press, 1987. Republished by Dover.

[KnabAnge00] P. Knabner and L. Angermann. Numerik partieller Differentialgleichungen. Springer Ver-


lag, Berlin, 2000.

[LascTheo87] P. Lascaux and R. Théodor. Analyse numérique matricielle appliquée a l’art de l’ingénieur,
Tome 2. Masson, Paris, 1987.

[www:triangle] J. R. Shewchuk. https://www.cs.cmu.edu/˜quake/triangle.html.

[StraFix73] G. Strang and G. J. Fix. An Analysis of the Finite Element Method. Prentice–Hall, 1973.

[TongRoss08] P. Tong and J. Rossettos. Finite Element Method, Basic Technique and Implementation. MIT,
1977. Republished by Dover in 2008.

[Zien13] O. Zienkiewicz, R. Taylor, and J. Zhu. The Finite Element Method: Its Basis and Fundamentals.
Butterworth-Heinemann, 7 edition, 2013.

Find a longer literature list for FEM, with some comments, in Section 0.3.2 and Table 2, starting on
page 3.

SHA 21-5-21
Bibliography

[AbraSteg] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions. Dover, 1972.

[Acto90] F. S. Acton. Numerical Methods that Work; 1990 corrected edition. Mathematical Association of
America, Washington, 1990.

[Agga20] C. Aggarwal. Linear Algebra and Optimization for Machine Learning. Springer, first edition,
2020.

[AmreWihl14] M. Amrein and T. Wihler. An adaptive Newton-method based on a dynamical systems


approach. Communications in Nonlinear Science and Numerical Simulation, 19(9):2958–2973, 2014.

[Aris62] R. Aris. Vectors, Tensors and the Basic Equations of Fluid Mechanics. Prentice Hall, 1962.
Republished by Dover.

[AtkiHan09] K. Atkinson and W. Han. Theoretical Numerical Analysis. Number 39 in Texts in Applied
Mathematics. Springer, 2009.

[Axel94] O. Axelsson. Iterative Solution Methods. Cambridge University Press, 1994.

[AxelBark84] O. Axelsson and V. A. Barker. Finite Element Solution of Boundary Values Problems. Aca-
demic Press, 1984.

[templates] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo,


C. Romine, and H. V. der Vorst. Templates for the Solution of Linear Systems: Building Blocks for
Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA, 1994.

[BoneWood08] J. Bonet and R. Wood. Nonlinear Continuum Mechanics for Finite Element Analysis. Cam-
bridge University Press, 2008.

[BoriTara79] A. I. Borisenko and I. E. Tarapov. Vector and Tensor Analysis with Applications. Dover, 1979.
first published in 1966 by Prentice–Hall.

[Bowe10] A. F. Bower. Applied Mechanics of Solids. CRC Press, 2010. web site at solidmechanics.org.

[Brae02] D. Braess. Finite Elemente. Theorie, schnelle Löser und Anwendungen in der Elastizitätstheorie.
Springer, second edition, 2002.

[Butc03] J. Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, Ltd,
second edition, 2003.

[Butc16] J. Butcher. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, Ltd,
third edition, 2016.

[ChouPaga67] P. C. Chou and N. J. Pagano. Elasticity, Tensors, dyadic and engineering Approaches. D
Van Nostrand Company, 1967. Republished by Dover 1992.

[Ciar02] P. G. Ciarlet. The Finite Element Method for Elliptic Problems. SIAM, 2002.

472
BIBLIOGRAPHY 473

[Cowp73] G. R. Cowper. Gaussian quadrature formulas for triangles. International Journal on Numerical
Methods and Engineering, 7:405–408, 1973.

[TopTen] B. A. Cypra. The best of the 20th century: Editors name top 10 algorithms. SIAM News, 2000.

[DahmReus07] W. Dahmen and A. Reusken. Numerik für Ingenieure und Naturwissenschaftler. Springer,
2007.

[Davi80] A. J. Davies. The Finite Element Method: a First Approach. Oxford University Press, 1980.

[Deim84] K. Deimling. Nonlinear Functional Analysis. Springer Verlag, 1984.

[DeisFaisOng20] M. P. Deisenroth, A. A. Faisal, and C. S. Ong. Mathematics for Machine Learning.


Cambridge University Press, 2020. pre–publication.

[Demm97] J. W. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, 1997.

[www:LinAlgFree] J. Dongarra. Freely available software for linear algebra on the web.
http://www.netlib.org/utk/people/JackDongarra/la-sw.html.

[DowdSeve98] K. Dowd and C. Severance. High Performance Computing. O’Reilly, 2nd edition, 1998.

[DrapSmit98] N. Draper and H. Smith. Applied Regression Analysis. Wiley, third edition, 1998.

[Farl82] S. J. Farlow. Partial Differential Equations for Scientist and Engineers. Dover, New York, 1982.

[Froc16] J. Frochte. Finite-Elemente-Methode: Eine praxisbezogene Einführung mit GNU Octave/MAT-


LAB. Hanser Fachbuchverlag, 2016. Octave/Matlab code available.

[GhabPeckWu17] J. Ghaboussi, D. Pecknold, and X. Wu. Nonlinear Computational Solid Mechanics. CRC
Press, 2017.

[GhabWu16] J. Ghaboussi and X. Wu. Numerical Methods in Computational Mechanics. CRC Press, 2016.

[Gmur00] T. Gmür. Méthode des éléments finis en mécanique des structures. Mécanique (Lausanne).
Presses polytechniques et universitaires romandes, 2000.

[Gold91] D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM
Computing Surveys, 23(1), March 1991.

[GoluVanLoan96] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, third
edition, 1996.

[GoluVanLoan13] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,
fourth edition, 2013.

[Gree77] D. T. Greenwood. Classical Dynamics. Prentice Hall, 1977. Republished by Dover 1997.

[Hack94] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 of Applied
Mathematical Sciences. Springer, first edition, 1994.

[Hack16] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 of Applied
Mathematical Sciences. Springer, second edition, 2016.

[Hack15] R. Hackett. Hyperelasticity Primer. Springer International Publishing, 2015.

[HairNorsWann08] E. Hairer, S. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I: Non-
stiff Problems. Springer Series in Computational Mathematics. Springer Berlin Heidelberg, second
edition, 1993. third printing 2008.

SHA 21-5-21
BIBLIOGRAPHY 474

[HairNorsWann96] E. Hairer, S. Nørsett, and G. Wanner. Solving Ordinary Differential Equations II:
Stiff and Differential-Algebraic Problems. Lecture Notes in Economic and Mathematical Systems.
Springer, second edition, 1996.

[Hear97] E. J. Hearns. Mechanics of Materials 1. Butterworth–Heinemann, third edition, 1997.

[HenrWann17] P. Henry and G. Wanner. Johann Bernoulli and the cycliod: A theorem for posteriority.
Elemente der Mathematik, 72(4):137–163, 2017.

[HeroArnd01] H. Herold and J. Arndt. C-Programmierung unter Linux. SuSE Press, 2001.

[Holz00] G. A. Holzapfel. Nonlinear Solid Mechanics, a Continuum Approch for Engineering. John Wi-
ley& Sons, 2000.

[HornJohn90] R. Horn and C. Johnson. Matrix Analysis. Cambridge University Press, 1990.

[Hugh87] T. J. R. Hughes. The Finite Element Method, Linear Static and Dynamic Finite Element Analysis.
Prentice–Hall, 1987. Reprinted by Dover.

[Intel90] Intel Corporation. i486 Microprocessor Programmers Reference Manual. McGraw-Hill, 1990.

[IsaaKell66] E. Isaacson and H. B. Keller. Analysis of Numerical Methods. John Wiley & Sons, 1966.
Republished by Dover in 1994.

[John87] C. Johnson. Numerical Solution of Partial Differential Equations by the Finite Element Method.
Cambridge University Press, 1987. Republished by Dover.

[Kell92] H. B. Keller. Numerical Methods for Two–Point Boundary Value Problems. Dover, 1992.

[KhanKhan18] O. Khanmohamadi and E. Khanmohammadi. Four fundamental spaces of numerical analy-


sis. Mathematics Magazin, 91(4):243–253, 2018.

[KnabAnge00] P. Knabner and L. Angermann. Numerik partieller Differentialgleichungen. Springer Ver-


lag, Berlin, 2000.

[Knor08] M. Knorrenschild. Numerische Mathematik. Carl Hanser Verlag, 2008.

[Koko15] J. Koko. Approximation numérique avec Matlab, Programmation vectorisée, équations aux
dérivées partielles. Ellipses, Paris, 2015.

[LascTheo87] P. Lascaux and R. Théodor. Analyse numérique matricielle appliquée a l’art de l’ingénieur,
Tome 2. Masson, Paris, 1987.

[LiesTich05] J. Liesen and P. Tichý. Convergence analysis of Krylov subspace methods. GAMM Mitt. Ges.
Angew. Math. Mech., 27(2):153–173 (2005), 2004.

[Linz79] P. Linz. Theoretical Numerical Analysis. John Wiley& Sons, 1979. Republished by Dover.

[MeybVach91] K. Meyberg and P. Vachenauer. Höhere Mathematik II. Springer, Berlin, 1991.

[MontRung03] D. Montgomery and G. Runger. Applied Statistics and Probability for Engineers. John
Wiley & Sons, third edition, 2003.

[DLMF15] NIST. NIST Digital Library of Mathematical Functions, 2015.

[Oden71] J. Oden. Finite Elements of Nonlinear Continua. Advanced engineering series. McGraw-Hill,
1971. Republished by Dover, 2006.

SHA 21-5-21
BIBLIOGRAPHY 475

[Ogde13] R. Ogden. Non-Linear Elastic Deformations. Dover Civil and Mechanical Engineering. Dover
Publications, 2013.

[OttoPete92] N. S. Ottosen and H. Petersson. Introduction to the Finite Element Method. Prentice Hall,
1992.

[Pres92] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C, The


Art of Scientific Computing. Cambridge University Press, second edition, 1992.

[Prze68] J. Przemieniecki. Theory of Matrix Structural Analysis. McGraw–Hill, 1968. Republished by


Dover in 1985.

[RalsRabi78] A. Ralston and P. Rabinowitz. A first Course in Numerical Analysis. McGraw–Hill, second
edition, 1978. repulished by Dover in 2001.

[RawlPantuDick98] J. Rawlings, S. Pantula, and D. Dickey. Applied regression analysis. Springer texts in
statistics. Springer, New York, 2. ed edition, 1998.

[Redd84] J. N. Reddy. An Introduction to the Finite Element Analysis. McGraw–Hill, 1984.

[Redd13] J. N. Reddy. An Introduction to Continuum Mechanics. Cambridge University Press, 2nd edition,
2013.

[Redd15] J. N. Reddy. An Introduction to Nonlinear Finite Element Analysis. Oxford University Press, 2nd
edition, 2015.

[Saad00] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS, second edition, 2000. available on
the internet.

[SchnWihl11] H. R. Schneebeli and T. Wihler. The Netwon-Raphson Method and Adaptive ODE Solvers.
Fractals, 19(1):87–99, 2011.

[Schw86] H. R. Schwarz. Numerische Mathematik. Teubner, Braunschweig, 1986.

[Schw88] H. R. Schwarz. Finite Element Method. Academic Press, 1988.

[Schw09] H. R. Schwarz. Numerische Mathematik. Teubner und Vieweg, 7. edition, 2009.

[Sege77] L. A. Segel. Mathematics Applied to Continuum Mechanics. MacMillan Publishing Company,


New York, 1977. republished by Dover 1987.

[Shab08] A. A. Shabana. Computational Continuum Mechanics. Cambridge University Press, 2008.

[ShamDym95] I. Shames and C. Dym. Energy and Finite Element Methods in Structural Mechanics. New
Age International Publishers Limited, 1995.

[ShamReic97] L. Shampine and M. W. Reichelt. The MATLAB ODE Suite. SIAM Journal on Scientific
Computing, 18:1–22, 1997.

[www:triangle] J. R. Shewchuk. https://www.cs.cmu.edu/˜quake/triangle.html.

[Shew94] J. R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain.
Technical report, Carnegie Mellon University, 1994.

[Smit84] G. D. Smith. Numerical Solution of Partial Differential Equations: Finite Difference Methods.
Oxford Univerity Press, Oxford, third edition, 1986.

[Sout73] R. W. Soutas-Little. Elasticity. Prentice–Hall, 1973.

SHA 21-5-21
BIBLIOGRAPHY 476

[VarFEM] A. Stahel. Calculus of Variations and Finite Elements. Lecture Notes used at HTA Biel, 2000.

[Stah00] A. Stahel. Calculus of Variations and Finite Elements. supporting notes, 2000.

[Octave07] A. Stahel. Octave and Matlab for Engineers. lecture notes, 2007.

[Stah08] A. Stahel. Numerical Methods. lecture notes, BFH-TI, 2008.

[Stah16] A. Stahel. Statistics with Matlab/Octave. supporting notes, BFH-TI, 2016.

[StraFix73] G. Strang and G. J. Fix. An Analysis of the Finite Element Method. Prentice–Hall, 1973.

[Thom95] J. W. Thomas. Numerical Partial Differential Equations: Finite Difference Methods, volume 22
of Texts in Applied Mathematics. Springer Verlag, New York, 1995.

[TongRoss08] P. Tong and J. Rossettos. Finite Element Method, Basic Technique and Implementation. MIT,
1977. Republished by Dover in 2008.

[Trim90] D. W. Trim. Applied Partial Differential Equations. PWS–Kent, 1990.

[Wein74] R. Weinstock. Calculus of Variations. Dover, New York, 1974.

[Wilk63] J. H. Wilkinson. Rounding Errors in Algebraic Processes. Prentice-Hall, 1963. Republished by


Dover in 1994.

[Wlok82] J. Wloka. Partielle Differentialgleichungen. Teubner, Stuttgart, 1982.

[YounGreg72] D. M. Young and R. T. Gregory. A Survey of Numerical Analysis, Volume 1. Dover Publica-
tions, New York, 1972.

[Zien13] O. Zienkiewicz, R. Taylor, and J. Zhu. The Finite Element Method: Its Basis and Fundamentals.
Butterworth-Heinemann, 7 edition, 2013.

[ZienMorg06] O. C. Zienkiewicz and K. Morgan. Finite Elements and Approximation. John Wiley & Sons,
1983. Republished by Dover in 2006.

SHA 21-5-21
List of Figures

1 Structure of the topics examined in this class . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1 Temperature T as function of the horizontal position . . . . . . . . . . . . . . . . . . . . . 12


1.2 Segment of a string . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Nonlinear stress strain relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Bending of a Beam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 A nonsmooth function f and three regularized approximations u . . . . . . . . . . . . . . . 22

2.1 Memory and cache access times on a Alpha 21164 system . . . . . . . . . . . . . . . . . . 28


2.2 FLOPS for a 21164 microprocessor system, four implementations of one algorithm . . . . . 28
2.3 FLOPS for a few systems, four implementations of one algorithm . . . . . . . . . . . . . . 29
2.4 CPU-cache structure for the Intel I7-920 (Nehalem) . . . . . . . . . . . . . . . . . . . . . . 30
2.5 The discrete approximation of a continuous function . . . . . . . . . . . . . . . . . . . . . 32
2.6 A 4 × 4 grid on a square domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 LR factorization, using elementary matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.8 The Cholesky decomposition for a banded matrix . . . . . . . . . . . . . . . . . . . . . . . 66
2.9 Cholesky steps for a banded matrix. The active area is marked . . . . . . . . . . . . . . . . 66
2.10 The sparsity pattern of a band matrix and two Cholesky factorizations . . . . . . . . . . . . 71
2.11 Graph of a function to be minimized and its level curves . . . . . . . . . . . . . . . . . . . 74
2.12 One step of a gradient iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.13 The gradient algorithm for a large condition number . . . . . . . . . . . . . . . . . . . . . . 77
2.14 Ellipse and circle to illustrate conjugate directions . . . . . . . . . . . . . . . . . . . . . . . 78
2.15 One step of a conjugate gradient iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.16 Two steps of the gradient algorithm (blue) and the conjugate gradient algorithm (green) . . . 80
2.17 Number of flops for banded Cholesky, steepest descent and conjugate gradient algorithm . . 84
2.18 The GMRES algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.19 The GMRES(m) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.20 The Arnoldi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.21 Comparison of linear solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.1 Method of bisection to solve one equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


3.2 Method of false position to solve one equation . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.3 Secant method to solve one equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.4 Newton’s method to solve one equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.5 Three functions that might cause problems for Newton’s methods . . . . . . . . . . . . . . . 108
3.6 The contraction mapping principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.7 Successive substitution to solve cos x = x . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.8 Partial successive substitution to solve 3 + 3 x = exp(x) . . . . . . . . . . . . . . . . . . . 114
3.9 Graph of the function y = x2 − 1 − cos(x) . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.10 Definition and graph of the auxiliary function h . . . . . . . . . . . . . . . . . . . . . . . . 123
3.11 Graphs for stretching of a beam, with Poisson contraction . . . . . . . . . . . . . . . . . . . 125

477
LIST OF FIGURES 478

3.12 Graph of a function h = f (x, y), with contour lines . . . . . . . . . . . . . . . . . . . . . . 129


3.13 A linear mapping applied to a rectangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.14 Level curve or surface of quadratic forms in R2 or R3 . . . . . . . . . . . . . . . . . . . . . 137
3.15 Level surface of a quadratic form in R3 with intersection and projection onto R2 . . . . . . . 142
3.16 Ein lineares Vektorfeld mit Eigenvektoren . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
3.17 Spirals as solutions of a system of three differential equations . . . . . . . . . . . . . . . . . 148
3.18 Image compression by SVD with different number n of contributions . . . . . . . . . . . . 152
3.19 Raw data and level curves for the likelihood function . . . . . . . . . . . . . . . . . . . . . 156
3.20 Raw data and the domain of confidence at level of significance α = 0.05 and the PCA . . . . 157
3.21 PCA demo for data in R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
3.22 Projection of data in the direction of the first principal component . . . . . . . . . . . . . . 160
3.23 Scaled data and level curves for the likelihood function . . . . . . . . . . . . . . . . . . . . 163
3.24 Trapezoidal integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
R π/2
3.25 The integral 0 cos(x) dx by the trapezoidal rule . . . . . . . . . . . . . . . . . . . . . . 166
3.26 The approximation errors of three integration algorithms as function of the stepsize . . . . . 174
3.27 Two graphs of functions for integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.28 Domains in the plane R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.29 Vector field and three solutions for a logistic equation . . . . . . . . . . . . . . . . . . . . . 184
3.30 One solution and the vector field for the Volterra-Lotka problem . . . . . . . . . . . . . . . 185
3.31 Vector field and a solution for a spring-mass problem . . . . . . . . . . . . . . . . . . . . . 187
d
3.32 Vector field and solution of the ODE dt x(t) = x(t)2 − 2 t . . . . . . . . . . . . . . . . . . 188
3.33 One step of the Heun Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
3.34 One step of the Runge–Kutta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
3.35 Code for Runge–Kutta with fixed step size . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
d
3.36 Conditional stability of Euler’s approximation to dt y(t) = λ y(t) with λ < 0 . . . . . . . . 197
d
3.37 Unconditional stability of the implicit approximation to dt y(t) = λ y(t) with λ < 0 . . . . . 198
3.38 Domains of stability in C for a few algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 200
3.39 Graphical results for a Heun based method and Runge–Kutta . . . . . . . . . . . . . . . . . 209
3.40 Solution of an ODE by ode45() at the computed times or at preselected times . . . . . . . 210
3.41 SIR model,with infection rate b and recovery rate k . . . . . . . . . . . . . . . . . . . . . . 212
3.42 SIR model vector field with b = 13 and k = 10 1
. . . . . . . . . . . . . . . . . . . . . . . . . 213
3.43 Solution of a non–stiff ODE with different algorithms . . . . . . . . . . . . . . . . . . . . . 215
3.44 Solution of a stiff ODE with different algorithms . . . . . . . . . . . . . . . . . . . . . . . 217
3.45 Solution of a stiff ODE system with different algorithms . . . . . . . . . . . . . . . . . . . 218
3.46 Solution of a non–stiff ODE system with different algorithms . . . . . . . . . . . . . . . . . 220
3.47 Regression of a straight line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3.48 An example for linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
3.49 Regions of confidence for two similar regressions with identical data . . . . . . . . . . . . . 228
3.50 Linear regression for a parabola, with a small or a large noise contribution . . . . . . . . . . 233
3.51 Region of confidence for the parameters p1 , p2 and p3 . . . . . . . . . . . . . . . . . . . . . 235
3.52 Fitting a straight line to data close to a parabola . . . . . . . . . . . . . . . . . . . . . . . . 237
3.53 Intensity of light as function of the angle α . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
3.54 Result of a 3D linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
3.55 Least square approximation of a damped oscillation . . . . . . . . . . . . . . . . . . . . . . 242
3.56 Nonlinear least square approximation with fsolve() . . . . . . . . . . . . . . . . . . . . 243
3.57 Data points and the optimal fit by a logistic function . . . . . . . . . . . . . . . . . . . . . . 244
3.58 The intersection of the 95% confidence ellipsoid in R4 with 2D-planes . . . . . . . . . . . . 247
3.59 The intersection of the 95% confidence ellipsoid in R4 with the p4 = const plane . . . . . . 248

4.1 FD stencils for y 0 (t), forward, backward and centered approximations . . . . . . . . . . . . 255
4.2 Finite difference approximations of derivatives . . . . . . . . . . . . . . . . . . . . . . . . 256

SHA 21-5-21
LIST OF FIGURES 479

4.3 Discretization and arithmetic error contributions . . . . . . . . . . . . . . . . . . . . . . . . 257


4.4 Finite difference stencil for −uxx − uyy if h = hx = hy . . . . . . . . . . . . . . . . . . . 258
4.5 Finite difference stencil for ut − uxx , explicit, forward . . . . . . . . . . . . . . . . . . . . 258
4.6 Finite difference stencil for ut − uxx , implicit, backward . . . . . . . . . . . . . . . . . . . 258
4.7 A finite difference approximation of an initial value problem . . . . . . . . . . . . . . . . . 259
4.8 Exact and approximate solution of a boundary value problem . . . . . . . . . . . . . . . . . 260
4.9 An approximation scheme for −u00 (x) = f (x) . . . . . . . . . . . . . . . . . . . . . . . . . 261
4.10 A general approximation scheme for boundary value problems . . . . . . . . . . . . . . . . 261
4.11 Stretching of a beam, displacement and force . . . . . . . . . . . . . . . . . . . . . . . . . 267
4.12 Stretching of a beam with constant and variable cross section . . . . . . . . . . . . . . . . . 269
4.13 A finite difference grid for a steady state heat equation . . . . . . . . . . . . . . . . . . . . 271
4.14 Solution of the steady state heat equation on a square . . . . . . . . . . . . . . . . . . . . . 272
4.15 A finite difference grid for a dynamic heat equation . . . . . . . . . . . . . . . . . . . . . . 274
4.16 Explicit finite difference approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
4.17 Solution of 1-d heat equation, stable and unstable algorithms with r ≈ 0.5 . . . . . . . . . . 277
4.18 Implicit finite difference approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
4.19 Solution of 1-d heat equation, implicit scheme with small and large step sizes . . . . . . . . 279
4.20 Crank–Nicolson finite difference approximation . . . . . . . . . . . . . . . . . . . . . . . . 281
4.21 Solution of the dynamic heat equation on a square . . . . . . . . . . . . . . . . . . . . . . . 285
4.22 Stability function for four algorithms with z = λ ∆t . . . . . . . . . . . . . . . . . . . . . . 290
4.23 Explicit finite difference approximation for the wave equation . . . . . . . . . . . . . . . . 290
4.24 D’Alembert’s solution of the wave equation . . . . . . . . . . . . . . . . . . . . . . . . . . 294
4.25 Implicit finite difference approximation for the wave equation . . . . . . . . . . . . . . . . 295
4.26 The nonlinear beam stetching problem, solved by successive substitution, with errors . . . . 298
4.27 Bending of a beam, solved by Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . 301
4.28 Bending of a beam with large force, solved as linear problem and by Newton’s method . . . 302
4.29 Nonlinear beam problem for a large force, solved by a parameterized Newton’s method . . . 302

5.1 Shortest connection between two points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310


5.2 Pendulum with moving support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
5.3 Numerical solution for a pendulum with moving support . . . . . . . . . . . . . . . . . . . 325
5.4 Definition of the modulus of elasticity E and the Poisson number ν . . . . . . . . . . . . . . 326
5.5 Deformation of an elastic solid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
5.6 Definition of strain: rectangle before and after deformation . . . . . . . . . . . . . . . . . . 327
5.7 Rotation of the coordinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
5.8 Definition of stress in a plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
5.9 Normal and tangential stress in an arbitrary direction . . . . . . . . . . . . . . . . . . . . . 337
5.10 Components of stress in space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
5.11 Mohr’s circle for the 2D and 3D situations . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
5.12 How to determine the maximal principal stress, von Mises stress and Tresca stress . . . . . . 344
5.13 Action of a linear mapping from R2 onto R2 . . . . . . . . . . . . . . . . . . . . . . . . . . 348
5.14 Block to be deformed, used to determine the elastic energy density W . . . . . . . . . . . . 351
5.15 Situation for the basic version of Hooke’s law . . . . . . . . . . . . . . . . . . . . . . . . . 356
5.16 Torsion of a tube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
5.17 Stress–strain curve for Hooke’s linear law and an incompressible neo–Hookean material
under uniaxial load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
5.18 Stress–strain curve for Hooke’s linear law and a compressible Neo–Hookean material with
ν = 0.3 under uniaxial load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
5.19 Stress-strain curve for Mooney–Rivlin under uniaxial loading . . . . . . . . . . . . . . . . . 380
5.20 Stress–strain curve for Hooke’s linear law and an incompressible Ogden material under
uniaxial load, for different values of α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382

SHA 21-5-21
LIST OF FIGURES 480

5.21 Stress–strain curve for Hooke’s linear law and a compressible Ogden material under hydro-
static load with ν = 0.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
5.22 Plane strain and plane stress situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

6.1 Classical and weak solutions, minimizers and FEM . . . . . . . . . . . . . . . . . . . . . . 402


6.2 One triangle in space (green) and projected to plane (blue) . . . . . . . . . . . . . . . . . . 406
6.3 A small mesh of a simpe domain in R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
6.4 Local and global numbering of nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
6.5 Numbering of a simple mesh by Cuthill–McKee . . . . . . . . . . . . . . . . . . . . . . . . 409
6.6 Mesh generated by triangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
6.7 Structure of the nonzero entries in a stiffness matrix . . . . . . . . . . . . . . . . . . . . . . 410
6.8 A first FEM solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
6.9 A simple rectangular mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
6.10 The two types of triangles in a rectangular mesh . . . . . . . . . . . . . . . . . . . . . . . . 412
6.11 FEM stencil and neighboring triangles of a mesh point . . . . . . . . . . . . . . . . . . . . 414
6.12 Finite difference stencil for −uxx − uyy for h = hx = hy . . . . . . . . . . . . . . . . . . 415
6.13 A function to be minimized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
6.14 Quadratic interpolation on a triangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
6.15 Transformation of standard triangle to general triangle . . . . . . . . . . . . . . . . . . . . 428
6.16 Gauss integration of order 5 on the standard triangle, using 7 integration points . . . . . . . 430
6.17 Basis functions for second order triangular elements . . . . . . . . . . . . . . . . . . . . . . 433
6.18 Convergence results for linear and quadratic elements . . . . . . . . . . . . . . . . . . . . . 445
6.19 The mesh and the solution for a BVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
6.20 Difference to the exact solution and values of ∂∂yu , using a first order mesh . . . . . . . . . . 447
6.21 Difference to the exact solution and values of ∂∂yu , using a second order mesh . . . . . . . . 447
6.22 Difference of the approximate values of ∂∂yu to the exact values . . . . . . . . . . . . . . . . 448
6.23 The transformation of a linear quadrilateral element and the four nodes . . . . . . . . . . . . 451
6.24 Bilinear shape functions on a 4 node quadrilateral element . . . . . . . . . . . . . . . . . . 453
6.25 Gauss integration points on the standard square, using 22 = 4 or 32 = 9 integration points . 454
6.26 The transformation of a second order quadrilateral element and the eight nodes . . . . . . . 455
6.27 Contour levels of the shape functions on a 8 node quadrilateral element . . . . . . . . . . . 456
6.28 The function leading to logistic growth and the solution of the differential equation . . . . . 457
6.29 Exact and approximate solution of the second test problem . . . . . . . . . . . . . . . . . . 464
6.30 A snapshot of the solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
6.31 Concentration u(t, r) as function of time t and radius r on a short time interval and contours 467
6.32 Concentration u(t, r) as function of time t and radius r on a long time interval and contours 468
6.33 Graph of the solutions by four computations with different algorithms . . . . . . . . . . . . 470

SHA 21-5-21
List of Tables

1 Literature on Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


2 Literature on the Finite Element Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1 Some values of heat related constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8


1.2 Symbols and variables for heat conduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Symbols and variables for a vibrating membrane . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Variables used for the stretching of a beam . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Typical values for the elastic constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Variables used for a bending beam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Binary representation of floating point numbers . . . . . . . . . . . . . . . . . . . . . . . . 24


2.2 Normalized timing for different operations on an Intel I7 . . . . . . . . . . . . . . . . . . . 27
2.3 FLOPS for a few CPU architectures, using one core only . . . . . . . . . . . . . . . . . . . 28
2.4 Properties of the model matrices An , Ann and Annn . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Memory requirements for the Cholesky algorithm for banded matrices . . . . . . . . . . . . 67
2.6 Comparison of direct solvers for Ann with n = 200 . . . . . . . . . . . . . . . . . . . . . . 70
2.7 Gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.8 The conjugate gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.9 Comparison of algorithms for the model problems . . . . . . . . . . . . . . . . . . . . . . . 84
2.10 Time required to complete a given number of flops on a 100 MFLOPS or 100 GFLOPS CPU 84
2.11 Preconditioned conjugate gradient algorithms to solve A ~x + ~b = ~0 . . . . . . . . . . . . . 86
2.12 The condition numbers κ using the incomplete Cholesky preconditioner IC(0) . . . . . . . 89
2.13 Performance for MATLAB’s pcg() with an incomplete Cholesky preconditioner . . . . . . 90
2.14 Performance for Octave’s pcg() with an incomplete Cholesky preconditioner . . . . . . . 91
2.15 Timing for solving a system with 106 = 10 0000 000 unknowns . . . . . . . . . . . . . . . . . 91
2.16 Performance of Octave’s pcg() with an ilu() preconditioner . . . . . . . . . . . . . . . 92
2.17 Performance of MATLAB’s pcg() with an ilu() preconditioner . . . . . . . . . . . . . . 92
2.18 Iterative solvers in Octave/MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.19 Benchmark of different algorithms for linear systems, used with COMSOL Multiphysics . . 100
2.20 Codes for chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

3.1 Comparison of methods to solve one equation . . . . . . . . . . . . . . . . . . . . . . . . . 109


3.2 Performance of some basic algorithms to solve x2 − 2 = 0 . . . . . . . . . . . . . . . . . . 110
3.3 Compare partial substitution method and Newton’s method . . . . . . . . . . . . . . . . . . 127
3.4 Standard Gaussian PDF in n dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.5 Approximation errors of three integration algorithms . . . . . . . . . . . . . . . . . . . . . 173
3.6 Discretization errrors for the methods of Euler, Heun and Runge–Kutta . . . . . . . . . . . . 191
3.7 A comparison of Euler and Runge–Kutta . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
3.8 Comparing integration and solving ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
3.9 Comparison of a Heun based method and Runge–Kutta . . . . . . . . . . . . . . . . . . . . 208
3.10 Data for a non–stiff ODE problem with different algorithms . . . . . . . . . . . . . . . . . . 215

481
LIST OF TABLES 482

3.11 Data for a stiff ODE problem with different algorithms . . . . . . . . . . . . . . . . . . . . 216
3.12 Data for a stiff ODE system with different algorithms . . . . . . . . . . . . . . . . . . . . . 218
3.13 Data for a stiff ODE system with different algorithms, using the Jacobian matrix . . . . . . . 219
3.14 Data for non–stiff ODE system with different algorithms . . . . . . . . . . . . . . . . . . . 220
3.15 Commands for linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
3.16 Examples for linear and nonlinear regression . . . . . . . . . . . . . . . . . . . . . . . . . 240
3.17 Commands for nonlinear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
3.18 Estimated and exact values of the parameters . . . . . . . . . . . . . . . . . . . . . . . . . 243
3.19 Codes for chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

4.1 Finite difference approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256


4.2 Exact and approximate boundary value problem . . . . . . . . . . . . . . . . . . . . . . . . 263
4.3 Comparison of finite difference schemes for the 1D heat equation . . . . . . . . . . . . . . . 282
4.4 Comparison of finite difference schemes for 2D dynamic heat equations . . . . . . . . . . . 284
4.5 Properties of the ODE solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
4.6 Codes for chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

5.1 Examples of second order differential equations . . . . . . . . . . . . . . . . . . . . . . . . 314


5.2 Some examples of Poisson’s equation −∇ (a ∇u) = f . . . . . . . . . . . . . . . . . . . . 317
5.3 Normal and shear strains in space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
5.4 Description of normal and tangential stress in space . . . . . . . . . . . . . . . . . . . . . . 339
5.5 Elastic moduli and their relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
5.6 Different tensors in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
5.7 Nonlinear material laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
5.8 Plane strain and plane stress situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

6.1 Algorithm of Cuthill–McKee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409


6.2 Minimization of exact and approximate problem . . . . . . . . . . . . . . . . . . . . . . . . 420
6.3 Maximal approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463

SHA 21-5-21
Index

ξ 2 distribution, 153 incomplete, 87, 89


modified, 55
A–stability, 196, 286 classical solution, 402, 415, 416
AbsTol, 214 coercive, 420
adaptive step size, 206 condition number, 49
Alpha 21164 processor, 30 cone of dependence, 293
amd, 69 confidence interval, 226, 232, 240
approximation, linear, 469 conforming element, 420, 423, 425, 427, 452, 455
Arnoldi iteration, 93, 96 conjugate
ATLAS, 31, 40 direction, 77, 78
gradient algorithm, 77
Banach’s fixed point theorem, 111
gradient normal residual, 92
bandwidth, 66, 409
conjugate residual, 92
basis function, 432, 451
connection, shortest, 310
BDF2, 287
consistency, 259, 260, 262
beam
contraction mapping principle, 111
bending, 18, 298, 302, 312
convergence, 260, 261, 444
buckling, 20
linear, 104
stretching, 15, 265, 267, 268
quadratic, 104
Bernoulli principle, 312, 355, 363, 388, 390, 394,
correlation, 162
395
Courant-Fischer Minimax Theorem, 135
BiCGSTAB, 92
covariance, 155, 157, 162, 234
binary representation, 24
Crank–Nicolson, 203, 280, 282–284, 287, 465
bisection, 105
cumtrapz, 165
BLAS, 40
curvefit stat, 248
Bogacki–Shampine, 205
Cuthill–McKee, 409
boundary condition, 461
Dirichlet, 313, 392 d’Alembert, 293
natural, 309, 312, 313, 318 diagonalization, 133, 134
Neumann, 313 diagonally dominant, 62
bracketed, 105 strictly, 62
Bramble-Hilbert lemma, 424 direct method, 72
brittle, 343 discretization error, 191
bulk modulus, 354, 358, 360, 374 displacement gradient tensor, 348
Butcher table, 201, 203 displacement vector, 327
distortion energy, 345
Céa lemma, 422
distribution
cache structure, 26
chi–square, 153
calculus of variations, 305
F, 226
CGNR, 92
Gaussian, 152
characteristic polynomial, 132
Student-t, 227, 232, 240
chi–square distribution, 153
divergence theorem, 313
Cholesky, 55
domain of confidence, 246
classical, 55

483
INDEX 484

Dormand–Prince, 206 fzero, 120


drop tolerance, 87
ductile, 343 Gauss integration, 170, 415, 422, 428–431, 452, 454,
458
eig, 142 Gauss, algorithm, 38
eigenvalue, 130, 131 Gaussian probability distribution, 152
generalized, 131, 149, 282 GenerateFEM, 462
eigenvector, 130, 131 geometric nonlinearity, 301
eigs, 143 Given’s rotation, 97
element, second order, 427 GMRES, 93
elliptic, 420 GMRES(m), 94
energy gradient algorithm, 73, 75
density, 354 Green’s identity, 313, 315
density, elastic, 350 Green–Gauss theorem, 313
norm, 76, 417, 419
shape changing, 353 Hamilton’s principle, 320
volume changing, 353 heat capacity, 8
error estimate, 417 heat equation, 8
Euler 2D, 11
implicit, 203 dynamic, 10, 273, 278, 280, 283
Euler buckling, 21 steady state, 10, 271
Euler method, 187, 188 Hesse matrix, 348
Euler–Lagrange equation, 21, 305, 309–312, 315, Hessenberg matrix, 97
316, 318, 320–322, 324, 362, 363, 397 Heun, 188, 202
expfit, 240 Hooke’s law, 18, 326, 349, 356
explicit, 283, 286, 290 hourglassing, 432
extrapolation
IC(0), 87, 89
Richardson, 174, 206
ichol, 89
F–distribution, 228, 246 ICT(), 87, 89
failure mode, elastic, 343 IEEE-754, 23
false position method, 106 iff, 37
FEMoctave, 448 implicit, 198, 258, 259, 278, 281, 283, 286, 294
finite difference, 254, 414 integral, 177
finite difference stencil, 257 integral2, 180
explicit, 258 integration
implicit, 258 adaptive, 174
Finite Element Method, FEM, 401 Gauss, 170
Fisher’s equation, 467 numerical, 164, 415
floating point arithmetic, 23 Simpson, 166
flop, 26 trapezoidal, 164
FLOPS, 26 Intel Haswell processor, 30
flux of thermal energy, 8 Intel I7 processor, 30
fminbnd, 128 interpolation, 458
fminsearch, 129 bilinear, 452
Fourier’s law, 8 biquadratic, 455
fsolve, 120, 240, 243, 249 linear, 405, 424
function piecewise linear, 423
bilinear, 451 piecewise quadratic, 425, 426, 464
function space, 418 quadratic, 425
functional, 305, 306, 417, 418 invariant, 336, 341, 345, 352, 353, 368
quadratic, 312, 315, 398 irreducible, 62

SHA 21-5-21
INDEX 485

irreducibly diagonally dominant, 62 minimal angle condition, 423, 426


iterative method, 72, 73, 103 model matrix An , 32
iterative solver, 72 model matrix Ann , 34
modulus of elasticity, 16, 326, 357
Jaccobi determinant, 429 Mohr’s circle, 341, 394
Jacobian, 202 Mooney–Rivlin, 378, 379
multi core architecture, 30
Korn’s inequality, 420, 423, 449
Krylov subspace, 79 Neo–Hookean, 370, 372, 374, 375
compressible, 376–378
L–stability, 196, 286
incompressible, 369
Lagrange function, 320
Newton’s method, 107, 115, 298, 469
Lagrange multiplier, 135
parameterized, 302
Lamé’s parameter, 360
damped, 119
Laplace operator, 12
modified, 119
Lax equivalence theorem, 262, 422
Newton–Raphson, 107
leasqr, 240, 242, 245
Nitsche trick, 424
least square, 95, 221
nlinfit, 240
lemma, fundamental, 307
nlparci, 240, 249
Levenberg–Marquardt, 110, 119, 240
nonlin curvefit, 248
linear element, 403
norm, 45
LinearRegression, 229, 231, 237, 239
equivalent, 46
logistic
matrix, 46
equation, 183, 457
vector, 45
function, 244
normal equation, 92, 221, 222
LR factorization, 36–38
normal strain, 329
lscov, 231, 234
lsqcurvefit, 240, 249 octahedral shear stress, 341
lsqnonlin, 240, 244 ode15s, 215
ode23, 205, 214
mapping
ode23s, 215
bilinear, 450
ode45, 186, 206, 214
linear, 347, 428
odeget, 212, 214
mass spring system, damped, 186
odeset, 212
matrix
Ogden, 370
banded, 66
compressible, 381, 383
elementary, 43
incompressible, 384
factorization, 36
OPENBLAS, 31, 40
inverse, 67
options, 212
norm, 45
order of convergence, 104
orthogonal, 134
orthogonal matrix, 130, 134
permutation, 53
orthonormal, 130
positive definite, 60
positive semidefinite, 60 partial successive substitution, 114, 297
sparse, 72 PCA, 156, 157, 161
symmetric, 133, 134 pca, 161
triangular, 36 pcacov, 161
maximum likelihood, 154, 155 pendulum, 186, 194
membrane double, 321
steady state, 15, 316 moving support, 323
vibrating, 14, 316 single, 321
mesh quality, 423 permutation matrix, 53

SHA 21-5-21
INDEX 486

Picard iteration, 114, 297 Simpson integration, 166


pivoting, 51, 52 singular value decomposition, 50, 149, 225
partial, 52 SIR, 210
total, 52 sparse direct solver, 68
Plateau problem, 319 sparse solver, 68
Poincaré’s inequality, 420, 423, 449 stability, 196, 203, 260, 262, 286, 292
Poisson’s ratio, 17, 326, 357, 360 conditional, 259, 277
preconditioning, 84 backward, 52
pressure conditional, 197, 199, 291
hydrostatic, 358 domain, 201
principal component analysis, 156, 157 function, 203
principal strain, 336 of Cholesky, 63
principal stress, 340 unconditional, 198, 200, 204, 259, 295, 296
principle of least action, 320 steepest descent, 73
princomp, 161 stencil, 257, 413, 414
projection operator, 423, 425 step size
adaptive, 206
QR factorization, 95, 222 stiff ODE, 214
quad, 178 stiffness matrix
quad2d, 180 element, 404, 406, 407
quadv, 179 global, 404, 427
quiver, 186 strain, 16, 18, 327
invariant, 336, 341
Rayleigh quotient, 135
plane, 387
reducible, 62
principal, 334, 336, 360
regress, 230, 233
tensor, 347, 365
regression
stress, 18, 336
linear, 95, 221, 222, 225
engineering, 17, 125, 372
nonlinear, 240, 243, 244
hydrostatic, 345, 361
straight line, 221
matrix, 338
regula falsi, 106
maximal principal, 344
RelTol, 214
normal, 336
residual vector, 74
plane, 387, 393
rounding error, 23
principal, 340, 360
row reduction, 38
shape changing, 345, 361
Runge–Kutta, 288
tangential, 336
classical, 189, 202
tensor, 338, 347
embedded, 204
Tresca, 341–343
explicit, 201
true, 17, 125, 372
implicit, 202
volume changing, 361
order 2, 188
von Mises, 341, 343
order 4, 189
yield, 344
Saint–Venants’s principle, 388 stretch
secant method, 106 principal, 369, 378
selection tree, 71 string
semibandwidth, 66 deformation, 13, 311
separation of variables, 10 vibrating, 14
shear locking, 432 Students’s t-distribution, 227
shear modulus, 354, 359, 360 successive substitution, 111
shear strain, 329 superconvergence, 444, 464
sigmoid function, 244 surface forces, 355

SHA 21-5-21
INDEX 487

svd, 50, 149, 161, 225

templates, 85
tensor, 345
Cauchy–Green deformation, 365, 366
deformation gradient, 365
displacement gradient, 348, 365, 366
Green strain, 366, 368
infinitesimal strain, 329, 366
infinitesimal stress, 366
thermal conductivity, 8
Tikhonov regularization, 21
time discretization, 465
tolerance, 214
absolute, 214
relative, 214
trapz, 165
Tresca stress, 341, 344
triangle, 407
triangular matrix, 36
triangularization, 407
tube, torsion, 361
tumor growth, 457

unit roundoff, 24

variance
of parameters, 226
vector
norm, 45
outer unit normal, 316, 416
Volterra–Lotka, 184
volume forces, 355
von Mises stress, 341, 345

wave equation, 289


weak solution, 401, 415–418
well conditioned problem, 51

Young’s modulus, 326, 360

SHA 21-5-21

You might also like