0% found this document useful (0 votes)
47 views120 pages

数值分析

Uploaded by

seven second
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views120 pages

数值分析

Uploaded by

seven second
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 120

Notes on Numerical Analysis

数值分析

Qinghai Zhang
张庆海

Fall 2021
2021年秋季学期
Contents

1 Solving Nonlinear Equations 1


1.1 The bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The signature of an algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Proof of correctness and simplification of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Q-order convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.5 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.6 The secant method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.7 Fixed-point iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8.1 Theoretical questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8.2 Programming assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Polynomial Interpolation 9
2.1 The Vandermonde determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 The Cauchy remainder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 The Lagrange formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 The Newton formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 The Neville-Aitken algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 The Hermite interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 The Chebyshev polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 The Bernstein polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9.1 Theoretical questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9.2 Programming assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Splines 18
3.1 Piecewise-polynomial splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 The minimum properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 B-Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.1 Truncated power functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.2 The local support of B-splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.3 Integrals and derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.4 Marsden’s identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.5 Symmetric polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.6 B-splines indeed form a basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.7 Cardinal B-splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Curve fitting via splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.1 Theoretical questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.2 Programming assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Computer Arithmetic 31
4.1 Floating-point number systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Rounding error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Rounding a single number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2 Binary floating-point operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.3 The propagation of rounding errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Accuracy and stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Avoiding catastrophic cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Backward stability and numerical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.3 Condition numbers: scalar functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

i
CONTENTS ii

4.3.4 Condition numbers: vector functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38


4.3.5 Condition numbers: algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.6 Overall error of a computer solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 Theoretical questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 Programming assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Approximation 41
5.1 Orthonormal systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Fourier expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 The normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Discrete least squares (DLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.1 Gaussian and Dirac delta functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4.2 Reusing the formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.3 DLS via normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.4 DLS via QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5.1 Theoretical questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5.2 Programming assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Numerical Integration and Differentiation 51


6.1 Accuracy and convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Newton-Cotes formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 Composite formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.4 Gauss formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5 Numerical differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.6.1 Theoretical questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

A Sets, Logic, and Functions 58


A.1 First-order logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A.2 Ordered sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

B Linear Algebra 61
B.1 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
B.1.1 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
B.1.2 Span and linear independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
B.1.3 Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
B.1.4 Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
B.2 Linear maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
B.2.1 Null spaces and ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
B.2.2 The matrix of a linear map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
B.2.3 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
B.3 Eigenvalues, eigenvectors, and invariant subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
B.3.1 Invariant subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
B.3.2 Upper-triangular matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
B.3.3 Eigenspaces and diagonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B.4 Inner product spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B.4.1 Inner products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B.4.2 Norms induced from inner products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B.4.3 Norms and induced inner-products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
B.4.4 Orthonormal bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
B.5 Operators on inner-product spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
B.5.1 Adjoint and self-adjoint operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
B.5.2 Normal operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
B.5.3 The spectral theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
B.5.4 Isometries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
B.5.5 The singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
B.6 Trace and determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
CONTENTS iii

C Basic Analysis 73
C.1 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
C.1.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
C.1.2 Limit points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
C.2 Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
C.3 Continuous functions on R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
C.4 Differentiation of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
C.5 Taylor series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
C.6 Riemann integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
C.7 Convergence in metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
C.8 Vector calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

D Point-set Topology 81
D.1 Topological spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
D.1.1 A motivating problem from biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
D.1.2 Generalizing continuous maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
D.1.3 Open sets: from bases to topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
D.1.4 Topological spaces: from topologies to bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
D.1.5 Generalized continuous maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
D.1.6 The subbasis topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
D.1.7 The topology of phenotype spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
D.1.8 Closed sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
D.1.9 Interior–Frontier–Exterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
D.1.10 Hausdorff spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
D.2 Continuous maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
D.2.1 The subspace/relative topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
D.2.2 New maps from old ones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
D.2.3 Homeomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
D.3 A zoo of topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
D.3.1 Hierarchy of topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
D.3.2 The order topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
D.3.3 The product topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
D.3.4 The metric topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
D.4 Connectedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
D.5 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

E Functional Analysis 97
E.1 Normed and Banach spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
E.1.1 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
E.1.2 Normed spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
E.1.3 The topology of normed spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
E.1.4 Bases of infinite-dimensional spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
E.1.5 Sequential compactness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
E.1.6 Continous maps of normed spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
E.1.7 Norm equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
E.1.8 Banach spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
E.2 Continuous linear maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
E.2.1 The space CL(X, Y ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
E.2.2 The topology of CL(X, Y ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
E.2.3 Invertible operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
E.2.4 Series of operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
E.2.5 Uniform boundedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
E.3 Dual spaces of normed spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
E.3.1 The Hahn-Banach theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
E.3.2 Bounded linear functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Preface
This book comes out from my teaching of the course “Numerical Analysis” (formerly “Numerical Ap-
proximation”) in the spring semesters of 2018, 2019, and 2020 and in the fall semesters of 2016 and 2021
at the school of mathematical sciences in Zhejiang University.
In writing this book, I have made special efforts to
• collect the prerequisites in the appendices so that students can quickly brush up on the preliminaries,
• emphasize the connection between numerical analysis and other branches of mathematics such as
elementary analysis and linear algebra,
• arrange the contents carefully with the hope that the total entropy of this book is minimized,
• encourage the student to understand the motivations of definitions, to formally verify all major
theorems on her/his own, to think actively about the contents, to relate mathematical theory to
realworld physics, and to form a habit to tell logical and coherent stories.
In the whole progress of my teaching, many students asked for clarifications, pointed out typos,
reported errors, raised questions, and suggested improvements. Each and every comment contributed
to a better writing and/or teaching, be it small or big, negative or positive, subjective or objective.
关于数学学习的几点建议
A. 深入理解每一个知识点:证明或推导的每一步从哪里来的?争取做到“无一处无出处”,这有助于
培养缜密的逻辑思维能力。
B. 寻找新内容和已知内容或其他数学分支之间的联系。我们学习数值分析已经用到的其它分支包括
分析基础和线性代数等。学习的本质是把新内容和已经牢固掌握的知识联系起来!
C. 深入思考每一个知识点:一个定义捕捉到了什么?一个定理是否可以弱化条件?如果不能的话这
些条件在证明中是在哪里出现的?作用是什么?一个定理的结论是否可以再加强?如果不能原因
是什么?一个数学方法的适用范围是什么?局限性在哪里?
D. 精准识记核心的定义定理,再以一定的逻辑关系把相关知识点串成一个故事,这些关系可以包括
继承、组合、蕴含、特例等;构建这样一个脉络的目的是使自己知识体系的熵(混乱度)最小。
E. 在完成知识体系构建的基础上尽可能地多做习题,但是构建知识体系永远比做题本身重要。
F. 将新知识以一种和已有知识相容的方式纳入自己的知识体系。学数学的过程是盖一座大楼不是在
一个平面上搭很多帐篷;一座大楼的高度取决于基础以及每一层的坚固度。
G. 任何一门数学都包括内容和形式,两者相互依赖,互为补充。
H. “骐骥一跃,不能十步;驽马十驾,功在不舍。锲而舍之,朽木不折;锲而不舍,金石可镂。”
One baby step at a time!
Do the simpliest thing that could possibly work, then keep asking more and refining your answers.
I. “一阴一阳之谓道,继之者善也,成之者性也。仁者见之谓之仁,知者见之谓之知,百姓日用不
知,故君子之道鲜矣!” —— 《易经系辞上》
J. “Think globally, act locally.”
K. “重剑无锋,大巧不工”—— 《神雕侠侣》
Chapter 1

Solving Nonlinear Equations

1.1 The bisection method 1.3 Proof of correctness and simpli-


Algorithm 1.1. The bisection method finds a root of a fication of algorithms
continuous function f : R → R by repeatedly reducing the
Definition 1.6. An invariant is a condition that holds dur-
interval to the half interval where the root must lie.
ing the execution of an algorithm.
Input: f : [a, b] → R, a ∈ R, b ∈ R,
M ∈ N+ , δ ∈ R+ ,  ∈ R + Definition 1.7. A variable is temporary or derived for a
Preconditions : f ∈ C[a, b], loop if it is initialized inside the loop. A variable is persis-
sgn(f (a)) 6= sgn(f (b)) tent or primary for a loop if it is initialized before the loop
Output: c, h, k and its value changes across different iterations.
Postconditions: |f (c)| <  or |h| < δ or k = M
Exercise 1.8. What are the invariants in Algorithm 1.1?
1 u ← f (a) Which quantities do a, b, c, h, u, v, w represent? Which of
2 v ← f (b) them are primary? Which of these variables are temporary?
3 for k = 1 : M do Draw pictures to illustrate the life spans of these variables.
4 h←b−a
5 c ← a + h/2 Algorithm 1.9. A simplified bisection algorithm.
6 w ← f (c) Input: f : [a, b] → R, a ∈ R, b ∈ R,
7 if |h| < δ or |w| <  then M ∈ N + , δ ∈ R+ ,  ∈ R+
8 break Preconditions : f ∈ C[a, b],
9 else if sgn(w) 6= sgn(u) then sgn(f (a)) 6= sgn(f (b))
10 b←c Output: c, h, k
11 v←w Postconditions: |f (c)| <  or |h| < δ or k = M
12 else
1 h←b−a
13 a←c
2 u ← f (a)
14 u←w
3 for k = 1 : M do
15 end
4 h ← h/2
16 end
5 c←a+h
6 w ← f (c)
7 if |h| < δ or |w| <  then
1.2 The signature of an algorithm 8 break
9 else if sgn(w) = sgn(u) then
Definition 1.2. An algorithm is a step-by-step procedure
10 a←c
that takes some set of values as its input and produces some
11 end
set of values as its output.
12 end
Definition 1.3. A precondition is a condition that holds
for the input prior to the execution of an algorithm.
1.4 Q-order convergence
Definition 1.4. A postcondition is a condition that holds
for the output after the execution of an algorithm. Definition 1.10 (Q-order convergence). A convergent se-
quence {xn } is said to converge to L with Q-order p (p ≥ 1)
Definition 1.5. The signature of an algorithm consists of if
its input, output, preconditions, postconditions, and how |xn+1 − L|
input parameters violating preconditions are handled. lim = c > 0; (1.1)
n→∞ |xn − L|p

1
Qinghai Zhang Numerical Analysis 2021

the constant c is called the asymptotic factor. In particu- Input: f : R → R, f 0 , x0 ∈ R, M ∈ N+ ,  ∈ R+


lar, {xn } has Q-linear convergence if p = 1 and Q-quadratic Preconditions : f ∈ C 2 and x0 is sufficiently
convergence if p = 2. close to a root of f
Output: x, k
Definition 1.11. A sequence of iterates {xn } is said to Postconditions: |f (x)| <  or k = M
converge linearly to L if
1 x ← x0
2 for k = 0 : M do
∃c ∈ (0, 1), ∃d > 0, s.t. ∀n ∈ N, |xn − L| ≤ cn d. (1.2)
3 u ← f (x)
4 if |u| <  then
For a sequence {xn } that convergs to L, its order of conver-
5 break
gence is the maximum p ∈ R+ satisfying
6 end
7 x ← x − u/f 0 (x)
∃c > 0, ∃N ∈ N s.t. ∀n > N, |xn+1 −L| ≤ c|xn −L|p . (1.3)
8 end
In particular, {xn } converges quadratically if p = 2.

Theorem 1.12 (Monotonic sequence theorem). Every


bounded monotonic sequence is convergent.

Theorem 1.13 (Convergence of the bisection method).


For a continuous function f : [a0 , b0 ] → R satisfying
sgn(f (a0 )) 6= sgn(f (b0 )), the sequence of iterates in the bi-
section method converges linearly with asymptotic factor 12 ,

lim an = lim bn = lim cn = α, (1.4)


n→∞ n→∞ n→∞
f (α) = 0, (1.5) Theorem 1.15 (Convergence of Newton’s method). Con-
sider a C 2 function f : B → R on B = [α − δ, α + δ] satisfying
|cn − α| ≤ 2−(n+1) (b0 − a0 ), (1.6) f (α) = 0 and f 0 (α) 6= 0. If x is chosen sufficiently close
0
to α, then the sequence of iterates {xn } in the Newton’s
where [an , bn ] is the interval in the nth iteration of the bi- method converges quadratically to the root α, i.e.
section method and cn = 12 (an + bn ).
α − xn+1 f 00 (α)
lim 2
=− 0 . (1.8)
Proof. It follows from the bisection method that n→∞ (α − xn ) 2f (α)
Proof. By Taylor’s theorem (Theorem C.60) and the as-
a0 ≤ a1 ≤ a2 ≤ · · · ≤ b0 , sumption f ∈ C 2 , we have
b0 ≥ b1 ≥ b2 ≥ · · · ≥ a0 ,
(α − xn )2 00
1 f (α) = f (xn ) + (α − xn )f 0 (xn ) + f (ξ)
bn+1 − an+1 = (bn − an ). 2
2
where ξ is between α and xn . f (α) = 0 yields
In the rest of this proof, “lim” is a shorthand for “limn→∞ .”
By Theorem 1.12, both {an } and {bn } converge. Also, f (xn ) (α − xn )2 f 00 (ξ)
− α = −x n + + .
lim(bn − an ) = lim 21n (b0 − a0 ) = 0, hence lim bn = lim an = f 0 (xn ) 2 f 0 (xn )
α. By the given condition and the algorithm, the invari-
By (1.7), we have
ant f (an )f (bn ) ≤ 0 always holds. Since f is continuous,
lim f (an )f (bn ) = f (lim an )f (lim bn ), then f 2 (α) ≤ 0 im- f (xn ) f 00 (ξ)
(∗) : xn+1 − α = xn − 0 − α = (xn − α)2 0 .
plies f (α) = 0. (1.6) is another important invariant that f (xn ) 2f (xn )
can be proven by induction. Comparing (1.6) to (1.2) yields
0 0
convergence of the bisection method. Also, the convergence The continuity of f and the assumption f (α) 6= 0 yield
is linear with asymptotic factor as c = 12 . ∃δ ∈ (0, δ) s.t. ∀x ∈ B , f 0 (x) 6= 0
1 1

where B1 = [α − δ1 , α + δ1 ]. Define
1.5 Newton’s method maxx∈B1 |f 00 (x)|
M=
2 minx∈B1 |f 0 (x)|
Algorithm 1.14. Newton’s method finds the root of f :
R → R near an initial guess x0 by the iteration formula and pick x0 sufficiently close to α such that
(i) |x0 − α| = δ0 < δ1 ;
f (xn )
xn+1 = xn − 0 , n ∈ N. (1.7) (ii) M δ0 < 1.
f (xn )

2
Qinghai Zhang Numerical Analysis 2021

The definition of M and (*) imply 1.6 The secant method


|xn+1 − α| ≤ M |xn − α|2 . Algorithm 1.19. The secant method finds a root of f :
R → R near initial guesses x0 , x1 by the iteration
Comparing the above to (1.3) implies that if {xn } converges, xn − xn−1
then the order of convergence is 2. We must still show that xn+1 = xn − f (xn ) , n ∈ N+ . (1.12)
f (xn ) − f (xn−1 )
(a) it converges and (b) it converges to α.
By (i) and (ii), we have M |x0 − α| < 1. Then it is easy
Input: f : R → R, x0 ∈ R, x1 ∈ R,
to obtain the following via induction,
M ∈ N+ , δ ∈ R+ ,  ∈ R+
1 Preconditions : f ∈ C 2 ; x0 , x1 are sufficiently
2n
|xn − α| ≤ (M |x0 − α|) , close to a root of f
M
Output: xn , xn−1 , k
which shows both (a) and (b) and completes the proof. Postconditions: |f (xn )| <  or |xn − xn−1 | < δ
or k = M
Theorem 1.16. A continuous function f : [a, b] → [c, d] is 1 xn ← x1
bijective if and only if it is strictly monotonic. 2 xn−1 ← x0
3 u ← f (xn )
Theorem 1.17. If a C 2 function f : R → R satisfies 4 v ← f (xn−1 )
f (α) = 0, f 0 > 0 and f 00 > 0, then α is the only root of 5 for k = 2 : M do
f and, ∀x0 ∈ R, the sequence of iterates {xn } in the New- 6 if |u| > |v| then
ton’s method converges quadratically to α. 7 xn ↔ xn−1
8 u↔v
Proof. By Theorem 1.16, f is a bijection since f is contin- 9 end
−xn−1
uous and strictly monotonic. With 0 in its range, f must 10 s ← xn u−v
have a unique root. When proving Theorem 1.15, we had 11 xn−1 ← xn
12 v←u
f 00 (ξ) xn ← xn − u × s
xn+1 − α = (xn − α)2 . (1.9) 13
2f 0 (xn ) 14 u ← f (xn )
15 if |xn − xn−1 | < δ or |u| <  then
Then f 0 > 0 and f 00 > 0 further imply xn > α for all n > 0. 16 break
f being strictly increasing implies that f (xn ) > f (α) = 0 for 17 end
all n > 0. By the definition of Newton’s method, xn+1 −α = 18 end
xn − α − ff0(x n)
(xn ) , hence the sequence {xn − α : n > 0} is
strictly monotonically decreasing with 0 as a lower bound. Definition 1.20. The sequence {F } of Fibonacci numbers
n
By Theorem 1.12 it converges. is defined as
Suppose limn→∞ xn = a. Then take the limit of n → ∞
on both sides of (1.7) and we have F0 = 0, F1 = 1, Fn+1 = Fn + Fn−1 . (1.13)

f (a) Theorem √ 1.21 (Binet’s formula). Denote



the golden ratio
a=a− , by r0 = 1+2 5 and let r1 = 1 − r0 = 1−2 5 , then
f 0 (a)
rn − rn
which implies f (a) = 0. By the uniqueness of the root of f , Fn = 0 √ 1 . (1.14)
we have a = α. 5
The quadratic convergence rate can be proved by an in- Proof. By Definition 1.20, we have
duction using (1.9), as in Theorem 1.15.      
Fk+1 Fk 1 1
uk := =A , where A := .
Fk Fk−1 1 0
Definition 1.18. Let V be a vector space. A subset U ⊆ V
is a convex set iff Hence we have un = An u0 . It follows from

∀x, y ∈ U, ∀t ∈ (0, 1), tx + (1 − t)y ∈ U. (1.10) det(A − λI) = λ2 − λ − 1 = 0

that the two eigenvalues of A are λ = r0 , r1 , with their


A function f : U → R is convex iff
eigenvectors as x0 = (r0 , 1)T and x1 = (r1 , 1)T , respectively.
∀x, y ∈ U, ∀t ∈ (0, 1), Indeed, the two eigenpairs stem from λ2 = λ + 1, a nice re-
(1.11) lation between multiplication and addition by 1. Finally, we
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y).
express u0 as a linear combination of x0 and x1 ,
In particular, f is strictly convex if we replace “≤” with “<” 1
in the above equation. u0 = (x0 − x1 ),
r0 − r1

3
Qinghai Zhang Numerical Analysis 2021

which, together with un = An u0 , yields where B1 = [α − δ1 , α + δ1 ]. Define Ei = |xi − α|,


1 maxx∈B1 |f 00 (x)|
un = (rn x0 − r1n x1 ), M= ,
r0 − r1 0 2 minx∈B1 |f 0 (x)|

the second equation of which yields (1.14). and we have from Lemma 1.23

Corollary 1.22. The ratios r0 , r1 in Theorem 1.21 satisfy M En+1 ≤ M En M En−1 .


Pick x0 , x1 such that
Fn+1 = r0 Fn + r1n . (1.15)
(i) E0 < δ, E1 < δ;
Proof. This follows from (1.14) and values of r0 and r1 .
(ii) max(M E1 , M E0 ) = η < 1,
Lemma 1.23 (Error relation of the secant method). For
the secant method (1.12), there exist ξn between xn−1 and then an induction by the above equation shows that En < δ,
xn and ζn between min(xn−1 , xn , α) and max(xn−1 , xn , α) M En < η. To prove convergence, 2
we write M E0 < η,
such that M E 1 < η, M E2 < M E 1 M E 0 < η , M E3 < M E2 M E1 <
η 3 , · · · , M En+1 < M En M En−1 < η qn +qn−1 = η qn+1 , i.e.
f 00 (ζn )
xn+1 − α = (xn − α)(xn−1 − α) 0 . (1.16) 1 qn
2f (ξn ) En < Bn := η .
M
Proof. Define a divided difference as {qn } is a Fibonacci sequence starting from q0 = 1, q1 = 1.
n+1

f (a) − f (b) By Theorem 1.21, as n → ∞ we have qn → 1.618 √


5
since
f [a, b] = . (1.17) |r1 | ≈ 0.618 < 1. Hence limn→∞ En = 0.
a−b
To guestimate the convergence rate, we first examine the
Then it takes some algebra to show that the formula (1.12) rate at which the upper bounds {Bn } decrease:
is equivalent to 1 qn+1
Bn+1 Mη
f [xn−1 ,xn ]−f [xn ,α] r0 = 1 r0 r0 q n
= M r0 −1 η qn+1 −r0 qn ≤ M r0 −1 η −1
B

xn−1 −α n M η
xn+1 − α = (xn − α)(xn−1 − α) . (1.18)
f [xn−1 , xn ]
where qn+1 − r0 qn = r1n+1 > −1.
By (1.17) and the mean value theorem (Theorem C.51), To prove the convergence rate, we define
there exists ξn between xn−1 and xn such that 00
f (ζn )
00
f (α)

mn := 0
, mα := 0 , (1.22)
0 2f (ξn ) 2f (α)
f [xn−1 , xn ] = f (ξn ). (1.19)
where ζn and ξn are the same as those in Lemma 1.23. By
Define a function g(x) := f [x, xn ], apply the mean value induction, we have
theorem to g(x), and we have
F F
En = E1Fn E0 n−1 m1 n−1 · · · mF n−1 ,
1

f [xn−1 , xn ] − f [xn , α]
= g 0 (β) (1.20) F
En+1 = E1 n+1 E0Fn mF F2
1 · · · mn−1 mn ,
n F1
xn−1 − α
where Fn is a Fibonacci number as in Definition 1.20. Then
for some β between xn−1 and α. Compute the derivative of
0 En+1
g (β) from (1.17), use the Lagrangian remainder Theorem F −r F F −r F F −r F F −r F
=E1 n+1 0 n E0 n 0 n−1 m1 n 0 n−1 m2 n−1 0 n−2
C.60, and we have Enr0
F3 −r0 F2 F2 −r0 F1 F1
· · · mn−2 mn−1 mn
f [xn−1 , xn ] − f [xn , α] f 00 (ζn )
= (1.21) r1n r1n−1 r1n−1 r1n−2 r11
xn−1 − α 2 =E1 E0 m1 m2 · · · mn−1 m1n , (1.23)
for some ζn between min(xn−1 , xn , α) and max(xn−1 , xn , α). where the second step follows from Corollary 1.22. (1.22)
The proof is completed by substituting (1.19) and (1.21) into and the convergence we just proved yield
(1.18).
lim mn = mα , (1.24)
n→+∞
Theorem 1.24 (Convergence of the secant method). Con-
sider a C 2 function f : B → R on B = [α − δ, α + δ] satisfying which means
f (α) = 0 and f 0 (α) 6= 0. If both x0 and x1 are chosen suf- 
1

00
ficiently close to α and f (α) 6= 0, then the iterates {xn } ∃N ∈ N s.t. ∀n > N, mn ∈ m α , 2mα . (1.25)
2
in the secant
√ method converges to the root α with order
p = 12 (1 + 5) ≈ 1.618. We define
rn r n−1 r n−1 r n−2 r n−N
Proof. The continuity of f 0 and the assumption f 0 (α) 6= 0 A := E1 1 · E0 1 m11 · m21 · · · mN1
yield r n−N −1 r n−N −2 r1
∃δ1 ∈ (0, δ) s.t. ∀x ∈ B1 , f 0 (x) 6= 0 B := mN1+1 · mN1+2 · · · mn−1
1
· m1n

4
Qinghai Zhang Numerical Analysis 2021

so that EEn+1
r0 = AB. Since |r1 | < 1, we have limn→∞ A = 1. 1.7 Fixed-point iterations
n
As for B, we have from (1.25)
Definition 1.26. A fixed point of a function g is an inde-
B ≤ (2mα )
1+r11 +r12 +···+r1n−N
, pendent parameter of g satisfying g(α) = α.
Example 1.27. A fixed point of f (x) = x2 −3x+4 is x = 2.
and then
Lemma 1.28. If g : [a, b] → [a, b] is continuous, then g has
En+1 at least one fixed point in [a, b].
lim r = lim A lim B
n→∞ En0 n→∞ n→∞
1 1 Proof. The function f (x) = g(x) − x satisfies f (a) ≥ 0 and
= lim B ≤ (2mα ) 1−r1 = (2mα ) r0 .
n→∞ f (b) ≤ 0. The proof is then completed by the intermediate
value theorem (Theorem C.39).
The proof is then completed by Definition 1.10.
Exercise 1.29. Let A = [−1, 0)∪(0, 1]. Give an example of
Corollary 1.25. Consider solving f (x) = 0 near a root α. a continuous function g : A → A that does not have a fixed
Let m and sm be the time to evaluate f (x) and f 0 (x) respec- point. Give an example of a continuous function f : R → R
tively. The minimum time to obtain the desired absolute that does not have a fixed point.
accuracy  with Newton’s method and the secant method
are respectively Theorem 1.30 (Brouwer’s fixed point). Any continuous
function f : Dn → Dn with
TN = (1 + s)mdlog2 Ke, (1.26)
Dn := {x ∈ Rn : kxk ≤ 1}
TS = mdlogr0 Ke, (1.27)
√ 00 has a fixed point.
f (α)
where r0 = 1+2 5 , c = 2f 0 (α) ,
Example 1.31. Take a map of your country C and place it
log c on the ground of your room. Let f be the function assigning
K= , (1.28) to each point in your country the point on the map corre-
log c|x0 − α|
sponding to it. Then f can be considered as a continuous
2
and d·e denotes the rounding-up operator, i.e. it rounds function C → C. If C is homeomorphic to D , then there
towards +∞. must exist a point on the map that corresponds exactly to
the point on the ground directly beneath it.
1 2n
Proof. We showed |xn − α| ≤ M (M |x0 − α|) in proving
Theorem 1.15. Denote En = |xn − α|, we have Exercise 1.32. Take two pieces of the same-sized paper and
lay one on top of the other. Every point on the top sheet
n
M En ≤ (M E0 )2 . of paper is associated with some point right below it on the
bottom sheet. Crumple the top sheet into a ball without
Let i ∈ N+ denote the smallest number of iterations such ripping it. Place the crumpled ball on top of (and simulta-
i
that the desired accuracy  is satisfied, i.e. (M E0 )2 ≤ M . neously within the realm of) the bottom sheet of paper. Use
When  is sufficiently small, M → c. Hence we have Theorem 1.30 to prove that there always exists some point
in the crumpled ball that sits above the same point it sat
i = dlog2 Ke. above prior to crumpling.
For each iteration, Newton’s method incurs one function Definition 1.33. A fixed-point iteration is a method for
evaluation and one derivative evaluation, which cost time m finding a fixed point of g with a formula of the form
and sm, respectively. Therefore (1.26) holds.
For the secant method, assume M E0 ≥ M E1 . By the xn+1 = g(xn ), n ∈ N. (1.29)
proof of Theorem 1.24, we have Example 1.34. Newton’s method is a fixed-point iteration.
n+1

M En ≤ (M E0 )r0 / 5
. Exercise 1.35. To calculate the square root of some posi-
tive real number a, we can formulate the problem as finding
Let j ∈ N+ denote the smallest number of iterations√such the root of f (x) = x2 − a. For a = 1, the initial guess
that the desired accuracy  is satisfied, i.e. r0j ≤ r05 K. of x0 = 2, and the three choices of g1 (x) := x2 + x − a,
Hence g2 (x) := xa , and g3 (x) := 21 (x + xa ), verify that g1 diverges,
& √ ' g2 oscillates, g3 converges. The theorems in this section will
5   explain why.
j = logr0 K + logr0 ≤ logr0 K + 1.
r0
Definition 1.36. A function f : [a, b] → [a, b] is a contrac-
Since the first two values x and x are given in the secant tion or contractive mapping on [a, b] if
0 1
method, the least number of iterations is dlogr0 Ke (com- ∃λ ∈ [0, 1) s.t. ∀x, y ∈ [a, b], |f (x)−f (y)| ≤ λ|x−y|. (1.30)
pare to Newton’s method!). Finally, only the function value
f (xn ) needs to be evaluated per iteration because f (xn−1 ) Example 1.37. Any linear function f (x) = λx + c with
has already been evaluated in the previous iteration. 0 ≤ λ < 1 is a contraction.

5
Qinghai Zhang Numerical Analysis 2021

Theorem 1.38 (Convergence of contractions). If g(x) is a Proof. By Corollary 1.40, the fixed-point iteration converges
continuous contraction on [a, b], then it has a unique fixed uniquely to α because g 0 (α) = 0. By the Taylor expansion
point α in [a, b]. Furthermore, the fixed-point iteration of g at α, we have
(1.29) converges to α for any choice x0 ∈ [a, b] and
Eabs (xn+1 ) := |xn+1 − α| = |g(xn ) − g(α)|
λn p−1
|xn − α| ≤ |x1 − x0 |. (1.31) X (x − α)i
n (x n − α) p
1−λ =
(i)
g (α) + (p)
g (ξ)


i=1
i! p!
Proof. By Lemma 1.28, g has at least one fixed point in
[a, b]. Suppose there are two distinct fixed points α and for some ξ ∈ [a, b]. Since g (p) is continuous on [a, b], Theorem
β, then |α − β| = |g(α) − g(β)| ≤ λ|α − β|, which implies C.48 implies that g (p) is bounded on [a, b]. Hence there ex-
|α − β| ≤ 0, i.e. the two fixed points are identical. ists a constant M such that Eabs (xn+1 ) < M Eabs p
(xn ).
By Definition 1.36, xn+1 = g(xn ) implies that all xn ’s
stay in [a, b]. To prove convergence, Example 1.42. The following method has third-order con-

vergence for computing R:
|xn+1 − α| = |g(xn ) − g(α)| ≤ λ|xn − α|.
xn (x2n + 3R)
By induction and the triangle inequality, xn+1 = .
3x2n + R
|xn − α| ≤ λn |x0 − α| √ x(x2 +3R)
n First, R is the fixed point of F (x) = 3x2 +R :
≤ λ (|x1 − x0 | + |x1 − α|)
≤ λn (|x1 − x0 | + λ|x0 − α|). √  √R(R + 3R) √
F R = = R.
3R + R
From the first and last right-hand sides (RHSs), we have
1
|x0 − α| ≤ 1−λ |x1 − x0 |, which yields (1.31). Second, the derivatives of F (x) are

Theorem 1.39. Consider g : [a, b] → [a, b]. If g ∈ C 1 [a, b] n F (n) (x) F (n) ( R)
and λ = maxx∈[a,b] |g 0 (x)| < 1, then g has a unique fixed 1 3(x2 −R)2
0
(3x2 +R)2
point α in [a, b]. Furthermore, the fixed-point iteration 48Rx(x2 −R)
(1.29) converges to α for any choice x0 ∈ [a, b], the error 2 (3x2 +R)3 0
−48R(9x4 −18Rx2 +R2 ) −48R(−8R2 ) 3
bound (1.31) holds, and 3 (3x2 +R)4 (4R)4 = 2R 6= 0
xn+1 − α
lim = g 0 (α). (1.32) The rest follows from Corollary 1.41.
n→∞ xn − α

Proof. The mean value theorem (Theorem C.51) implies 1.8 Problems
that, for all x, y ∈ [a, b], |g(x) − g(y)| ≤ λ|x − y|. Theo-
rem 1.38 yields all the results except (1.32), which follows 1.8.1 Theoretical questions
from
I. Consider the bisection method starting with the ini-
xn+1 − α = g(xn ) − g(α) = g 0 (ξ)(xn − α), tial interval [1.5, 3.5]. In the following questions “the
interval” refers to the bisection interval whose width
lim xn = α, and the fact that ξ is between xn and α. changes across different loops.
Corollary 1.40. Let α be a fixed point of g : R → R with • What is the width of the interval at the nth step?
|g 0 (α)| < 1 and g ∈ C 1 (B) on B = [α − δ, α + δ] with some
• What is the maximum possible distance between
δ > 0. If x0 is chosen sufficiently close to α, then the results
the root r and the midpoint of the interval?
of Theorem 1.38 hold.

Proof. Choose λ so that |g 0 (α)| < λ < 1. Choose δ0 ≤ δ II. In using the bisection algorithm with its initial interval
so that maxx∈B0 |g 0 (x)| ≤ λ < 1 on B0 = [α − δ0 , α + δ0 ]. as [a0 , b0 ] with a0 > 0, we want to determine the root
Then g(B0 ) ⊂ B0 and applying Theorem 1.39 completes the with its relative error no greater than . Prove that this
proof. goal of accuracy is guaranteed by the following choice
of the number of steps,
Corollary 1.41. Consider g : [a, b] → [a, b] with a fixed
point g(α) = α ∈ [a, b]. The fixed-point iteration (1.29) log(b0 − a0 ) − log  − log a0
n≥ − 1.
converges to α with pth-order accuracy (p > 1, p ∈ N) for log 2
any choice x0 ∈ [a, b] if
III. Perform four iterations of Newton’s method for the
 g ∈ C p [a, b], polynomial equation p(x) = 4x3 − 2x2 + 3 = 0 with

∀k = 1, 2, . . . , p − 1, g (k) (α) = 0, (1.33) the starting point x0 = −1. Use a hand calculator and
 (p)
g (α) 6= 0. organize results of the iterations in a table.

6
Qinghai Zhang Numerical Analysis 2021

IV. Consider a variation of Newton’s method in which only • x−1 − tan x on [0, π2 ],
the derivative at x0 is used, • x−1 − 2x on [0, 1],
• 2−x + ex + 2 cos x − 6 on [1, 3],
f (xn )
xn+1 = xn − . • (x3 + 4x2 + 3x + 5)/(2x3 − 9x2 + 18x − 2) on [0, 4].
f 0 (x0 )
C. Test your implementation of Newton’s method by solv-
Find C and s such that ing x = tan x. Find the roots near 4.5 and 7.7.
en+1 = Cesn , D. Test your implementation of the secant method by the
following functions and initial values.
where en is the error of Newton’s method at step n,
s is a constant, and C may depend on xn , the given • sin(x/2) − 1 with x0 = 0, x1 = π
2,
function f and its derivatives. • ex − tan x with x0 = 1, x1 = 1.4,
V. Within (− π2 , π2 ), will the iteration xn+1 = tan−1 xn • x3 − 12x2 + 3x + 1 with x0 = 0, x1 = −0.5.
converge? You should play with other initial values and (if you get
different results) think about the reasons.
VI. Let p > 1. What is the value of the following continued
fraction? E. As shown below, a trough of length L has a cross section
1
x= in the shape of a semicircle with radius r. When filled to
p + p+ 1 1 within a distance h of the top, the water has the volume
p+...
 
Prove that the sequence of values converges. (Hint: h 1
V = L 0.5πr2 − r2 arcsin − h(r2 − h2 ) 2 .
this can be interpreted as x = limn→∞ xn , where r
x1 = p1 , x2 = p+1 1 , x3 = p+ 1 1 , . . ., and so forth.
p p+ 1
p
Formulate x as a fixed point of some function.)

VII. What happens in problem II if a0 < 0 < b0 ? Derive


an inequality of the number of steps similar to that in
II. In this case, is the relative error still an appropriate
measure?
Suppose L = 10ft, r = 1ft, and V = 12.4ft3 . Find the
VIII. (∗) Consider solving f (x) = 0 (f ∈ C k+1 ) by Newton’s
depth of water in the trough to within 0.01ft by each of
method with the starting point x0 close to a root of
the three implementations in A.
multiplicity k. Note that α is a zero of multiplicity k
of the function f iff F. In the design of all-terrain vehicles, it is necessary to
consider the failure of the vehicle when attempting to
f (k) (α) 6= 0; ∀i < k, f (i) (α) = 0. negotiate two types of obstacles. One type of failure is
called hang-up failure and occurs when the vehicle at-
• How can a multiple zero be detected by examining tempts to cross an obstacle that causes the bottom of
the behavior of the points (xn , f (xn ))? the vehicle to touch the ground. The other type of fail-
• Prove that if r is a zero of multiplicity k of the ure is called nose-in failure and occurs when the vehicle
function f , then quadratic convergence in New- descends into a ditch and its nose touches the ground.
ton’s iteration will be restored by making this
modification:
f (xn )
xn+1 = xn − k .
f 0 (xn )

1.8.2 Programming assignments


A. Implement the bisection method, Newton’s method, and
the secant method in a C++ package. You should

(a) design an abstract base class EquationSolver with


a pure virtual method solve, The above figure shows the components associated with
(b) write a derived class of EquationSolver for each the nose-in failure of a vehicle. The maximum angle
method to accomodate its particularities in the con- α that can be negotiated by a vehicle when β is the
tract of solving nonlinear equations. maximum angle at which hang-up failure does not occur
satisfies the equation
B. Test your implementation of the bisection method on the
following functions and intervals. A sin α cos α + B sin2 α − C cos α − E sin α = 0,

7
Qinghai Zhang Numerical Analysis 2021

where (b) Use Newton’s method to find α with the initial guess
A = l sin β1 , B = l cos β1 , 33◦ for the situation when l, h, β1 are the same as in
C = (h + 0.5D) sin β1 − 0.5D tan β1 , part (a) but D = 30 in..
(c) Use the secant method (with another initial value as
E = (h + 0.5D) cos β1 − 0.5D.
far away as possbile from 33◦ ) to find α. Show that
(a) Use Newton’s method to verify α ≈ 33◦ when l = you get a different result if the initial value is too far
89 in., h = 49 in., D = 55 in. and β1 = 11.5◦ . away from 33◦ ; discuss the reasons.

8
Chapter 2

Polynomial Interpolation

Definition 2.1. Interpolation constructs new data points that the coefficient of xn is det V (x0 , x1 , . . . , xn−1 ). Hence
within the range of a discrete set of known data points, usu- we have
ally by generating an interpolating function whose graph
n−1
goes through all known data points. Y
U (x) = det V (x0 , x1 , . . . , xn−1 ) (x − xi ),
Example 2.2. The interpolating function may be piece- i=0

wise constant, piecewise linear, polynomial, spline, or other


and consequently the recursion
non-polynomial functions.
n−1
Y
det V (x0 , x1 , . . . , xn ) = det V (x0 , x1 , . . . , xn−1 ) (xn −xi ).
2.1 The Vandermonde determinant i=0

Definition 2.3. For n + 1 given points x0 , x1 , . . . , xn ∈ R, An induction based on U (x0 , x1 ) = x1 − x0 yields (2.2).
the associated Vandermonde matrix V ∈ R(n+1)×(n+1) is 通过行列式的性质分析得
Theorem 2.5 (Uniqueness of polynomial interpolation).
1 x0 · · · xn0
 
Given distinct points x0 , x1 , . . . , xn ∈ C and corresponding
 1 x1 · · · xn1 
V (x0 , x1 , . . . , xn ) =  . values f0 , f1 , . . . , fn ∈ C. Denote by Pn the class of polyno-
.. ..  . (2.1)
 
 .. ..
. . .  mials of degree at most n. There exists a unique polynomial
1 xn ··· xnn pn (x) ∈ Pn such that 多项式插值唯一

Lemma 2.4. The determinant of a Vandermonde matrix ∀i = 0, 1, . . . , n, pn (xi ) = fi . (2.4)


can be expressed as Pn
Y Proof. Set up a polynomial i=0 ai xi with n + 1 undeter-
det V (x0 , x1 , . . . , xn ) = (xi − xj ). (2.2) mined coefficients ai . The condition (2.4) leads to the sys-
i>j tem of n + 1 equations:

Proof. Consider the function a0 + a1 xi + a2 x2i + · · · + an xni = fi ,

U (x) = det V (x0 , x1 , . . . , xn−1 , x) where i = 0, 1,


Q. . . , n. By Lemma 2.4, the determinant of
x20 xn0 the system is i>j (xi − xj ). The proof is completed by the

1 x0 ···
2 n distinctness of the points and Cramer’s rule.

1 x 1 x 1 · · · x 1

.. . . .

= . .. .. . .. .. .

(2.3)

1 xn−1 x2 xnn−1
n−1 · · · 2.2 The Cauchy remainder


2
1 x x ··· xn
Theorem 2.6 (Generalized Rolle). Let n ≥ 2. Suppose
Clearly, U (x) ∈ Pn and it vanishes at x0 , x1 , . . . , xn−1 since that f ∈ C n−1 [a, b] and f (n) (x) exists at each point of
inserting these values in place of x yields two identical rows (a, b). Suppose that f (x0 ) = f (x1 ) = · · · = f (xn ) = 0
in the determinant. It follows that for a ≤ x0 < x1 < · · · < xn ≤ b. Then there is a point
n−1
Y ξ ∈ (x0 , xn ) such that f (n) (ξ) = 0.
U (x) = A (x − xi ),
i=0
Proof. Applying Rolle’s theorem (Theorem C.50) on the n
intervals (xi , xi+1 ) yields n points ζi where f 0 (ζi ) = 0. Con-
where A depends only on x0 , x1 , . . . , xn−1 . Meanwhile, the sider f 0 , f 00 , . . . , f (n−1) as new functions. Repeatedly apply-
expansion of U (x) in (2.3) by minors of its last row implies ing the above arguments completes the proof.

9
Qinghai Zhang Numerical Analysis 2021

Theorem 2.7 (Cauchy remainder of polynomial interpola- where the fundamental polynomial for pointwise interpo-
tion). Let f ∈ C n [a, b] and suppose that f (n+1) (x) exists at lation (or elementary Lagrange interpolation polynomial )
each point of (a, b). Let pn (f ; x) denote the unique polyno- `k (x) is
n
mial in Pn that coincides with f at x0 , x1 , . . . , xn . Define Y x − xi
`k (x) = . (2.9)
xk − xi
Rn (f ; x) := f (x) − pn (f ; x) (2.5) i6 = k;i=0

In particular, for n = 0, `0 = 1.
as the Cauchy remainder of the polynomial interpolation.
If a ≤ x0 < x1 < · · · < xn ≤ b, then there exists some Example 2.11. For i = 0, 1, 2, we are given xi = 1, 2, 4
ξ ∈ (a, b) such that and f (xi ) = 8, 1, 5, respectively. The Lagrangian formula
generates p2 (x) = 3x2 − 16x + 21.
(n+1) n
f (ξ) Y
Rn (f ; x) = (x − xi ) (2.6) Lemma 2.12. Define a symmetric polynomial
(n + 1)! i=0
(
1, n = 0;
where the value of ξ depends on x, x0 , x1 , . . . , xn , and f . πn (x) = Qn−1 (2.10)
i=0 (x − xi ), n > 0.
Proof. Since f (xk ) = pn (f ; xk ), the remainder Rn (f ; x) van-
ishes at xk ’s. Fix x 6= x0 , x1 , . . . , xn and define Then for n > 0 the fundamental polynomial for pointwise
interpolation can be expressed as
f (x) − pn (f ; x)
K(x) = Qn πn+1 (x)
i=0 (x − xi ) ∀x 6= xk , `k (x) = . (2.11)
0
(x − xk )πn+1 (xk )
and a function of t
0
n Proof. By the chain rule, πn+1 (x) is the summation of n + 1
terms, each of which is a product of n terms. When x is re-
Y
W (t) = f (t) − pn (f ; t) − K(x) (t − xi ).
i=0 placed with x k , all of the n + 1 terms vanish except one.

The function W (t) vanishes at t = x0 , x1 , . . . , xn . In addi- Lemma 2.13 (Cauchy relations). The fundamental poly-
tion W (x) = 0. By Theorem 2.6, W (n+1) (ξ) = 0 for some nomials `k (x) satisfy the Cauchy relations as follows.
ξ ∈ (a, b), i.e. n
X
`k (x) ≡ 1 (2.12)
0 = W (n+1) (ξ) = f (n+1) (ξ) − (n + 1)!K(x).
k=0
n
Hence K(x) = f (n+1) (ξ)/(n + 1)! and (2.6) holds. ∀j = 1, . . . , n,
X
(xk − x)j `k (x) ≡ 0 (2.13)
n+1 k=0
Corollary 2.8. Suppose f (x) ∈ C [a, b]. Then
n Proof. By Theorems 2.5 and 2.7, for each q(x) ∈ Pn we
Mn+1 Y Mn+1 have pn (q; x) ≡ q(x). Interpolating the constant function
|Rn (f ; x)| ≤ |x − xi | ≤ (b − a)n+1 ,
(n + 1)! i=0 (n + 1)! f (x) ≡ 1 with the Lagrange formula yields (2.12). Simi-
(2.7) larly, (2.13) can be proved by interpolating the polynomial
where Mn+1 = maxx∈[a,b] f (n+1) (x) . 误差控制 q(u) = (u − x)j for each j = 1, . . . , n with the Lagrange
formula.
Example 2.9. A value for arcsin(0.5335) is obtained by in-
terpolating linearly between the values for x = 0.5330 and
x = 0.5340. Estimate the error committed. 2.4 The Newton formula
Let f (x) = arcsin(x). Then
3 5
Definition 2.14 (Divided difference and the Newton for-
f 00 (x) = x(1 − x2 )− 2 , f 000 (x) = (1 + 2x2 )(1 − x2 )− 2 . mula). The Newton formula for interpolating the values
f0 , f1 , . . . , fn at distinct points x0 , x1 , . . . , xn is
Since the third derivative is positive over [0.5330, 0.5340].
n
The maximum value of f 00 occurs at 0.5340. By Corollary X
2.8 we have |R1 | ≤ 4.42 × 10−7 . The true error is about pn (x) = ak πk (x), (2.14)
k=0
1.10 × 10−7 .
where πk is defined in (2.10) and the kth divided difference
ak is defined as the coefficient of xk in pk (f ; x) and is de-
2.3 The Lagrange formula noted by f [x0 , x1 , . . . , xk ] or [x0 , x1 , . . . , xk ]f . In particular,
Definition 2.10. To interpolate given values f0 , f1 , . . . , fn f [x0 ] = f (x0 ).
at distinct points x0 , x1 , . . . , xn , the Lagrange formula is Corollary 2.15. Suppose (i0 , i1 , i2 , . . . , ik ) is a permutation
n
X of (0, 1, 2, . . . , k). Then
pn (x) = fk `k (x), (2.8)
k=0
f [x0 , x1 , . . . , xk ] = f [xi0 , xi1 , . . . , xik ]. (2.15)

10
Qinghai Zhang Numerical Analysis 2021

Proof. The interpolating polynomial does not depend on the By Definition 2.18, we can construct the following table of
numbering of the interpolating nodes. The rest of the proof divided difference,
follows from the uniqueness of the interpolating polynomial
0 6
in Theorem 2.5.
1 −3 −9
(2.18)
Corollary 2.16. The kth divided difference can be ex- 2 −6 −3 3
pressed as 3 9 15 9 2

k k
By Definition 2.14, the interpolating polynomial is gener-
fi fi
, ated from the main diagonal and the first column of the
X X
f [x0 , x1 , . . . , xk ] = Qk = 0
i=0 j6=i;j=0 (xi − xj ) i=0
πk+1 (xi ) above table as follows.
(2.16) p3 = 6 − 9x + 3x(x − 1) + 2x(x − 1)(x − 2). (2.19)
where πk+1 (x) is defined in (2.10).
Hence f ( 23 ) ≈ p3 ( 23 ) = −6.
Proof. The uniqueness of interpolating polynomials in Theo-
rem 2.5 implies that the two polynomials in (2.8) and (2.14) Exercise 2.20. Redo Example 2.11 with the Newton for-
are the same. Then the first equality follows from (2.9) mula.
and Definition 2.14, while the second equality follows from Theorem 2.21. For distinct points x0 , x1 , . . . , xn , and x,
Lemma 2.12. we have
Theorem 2.17. Divided differences satisfy the recursion f (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + · · ·
n−1
f [x1 , x2 , . . . , xk ] − f [x0 , x1 , . . . , xk−1 ]
Y
f [x0 , x1 , . . . , xk ] = . + f [x0 , x1 , · · · , xn ] (x − xi ) (2.20)
xk − x0 i=0
(2.17) n
Y
+ f [x0 , x1 , · · · , xn , x]
(x − xi ).
Proof. By Definition 2.14, f [x1 , x2 , . . . , xk ] is the coefficient i=0
k−1
of x in a degree-(k − 1) interpolating polynomial, say,
P2 (x). Similarly, let P1 (x) be the interpolating polynomial Proof. Take another point z 6= xi . The Newton formula ap-
whose coefficient of xk−1 is f [x0 , x1 , . . . , xk−1 ]. Construct a plied to x0 , x1 , . . . , xn , z yields an interpolating polynomial
polynomial Q(x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + · · ·
n−1
x − x0 Y
P (x) = P1 (x) + (P2 (x) − P1 (x)) . + f [x0 , x1 , · · · , xn ] (x − xi )
xk − x0
i=0
n
Clearly P (x0 ) = P1 (x0 ). Furthermore, the interpolation Y
condition implies P2 (xi ) = P1 (xi ) for i = 1, 2, . . . , k − 1. + f [x0 , x1 , · · · , xn , z] (x − xi ).
i=0
Hence P (xi ) = P1 (xi ) for i = 1, 2, . . . , k − 1. Lastly,
P (xk ) = P2 (xk ). Therefore, P (x) as above is the inter- The interpolation condition Q(z) = f (z) yields
polating polynomial for given values at the k + 1 points.
f (z) = Q(z) = f [x0 ] + f [x0 , x1 ](z − x0 ) + · · ·
In particular, the term f [x0 , x1 , · · · , xk ]xk in P (x) is con-
x n−1
tained in xk −x0 (P2 (x) − P1 (x)). The rest follows from the Y
definitions of and the kth divided difference. + f [x 0 , x 1 , · · · , xn ] (z − xi )
i=0
n
Definition 2.18. The kth divided difference (k ∈ N+ ) on Y
the table of divided differences + f [x0 , x1 , · · · , xn , z] (z − xi ).
i=0
x0 f [x0 ] Replacing the dummy variable z with x yields (2.20).
x1 f [x1 ] f [x0 , x1 ] The above argument assumes x 6= xi . We now con-
x2 f [x2 ] f [x1 , x2 ] f [x0 , x1 , x2 ] sider the case of x = xj for some fixed j. Rewrite (2.20)
x3 f [x3 ] f [x2 , x3 ] f [x1 , x2 , x3 ] f [x0 , x1 , x2 , x3 ] as f (x) = pn (f ; x) + R(x) where R(x) is clearly the last
··· ··· ··· ··· ··· term in (2.20). We need to show
is calculated as the difference of the entry immediately to ∀j = 0, 1, · · · , n, pn (f ; xj ) + R(xj ) − f (xj ) = 0,
the left and the one above it, divided by the difference of
where pn (f ; xj ) is the value of pn (f ; x) at x = xj ; this clearly
the x-value horizontal to the left and the one corresponding
holds because R(xj ) = 0 and the interpolation condition at
to the f -value found by going diagonally up.
xj dictates pn (f ; xj ) = f (xj ).
Example 2.19. Derive the interpolating polynomial via the Corollary 2.22. Suppose f ∈ C n [a, b] and f (n+1) (x) exists
Newton formula for the function f with given values as fol- at each point of (a, b). If a = x < x < · · · < x = b and
0 1 n
lows. Then estimate f ( 32 ). x ∈ [a, b], then there exists ξ(x) ∈ (a, b) such that
x 0 1 2 3 1
f [x0 , x1 , · · · , xn , x] = f (n+1) (ξ(x)). (2.21)
f (x) 6 −3 −6 9 (n + 1)!

11
Qinghai Zhang Numerical Analysis 2021

Proof. This follows from Theorems 2.21 and 2.7. where the second line follows from (2.26), the third line from
splitting one term out of each sum and replacing the dummy
Corollary 2.23. If x0 < x1 < · · · < xn and f ∈ C n [x0 , xn ], variable in the first sum, and the fourth line from (2.29) and
we have the fact that (−1)n+1 fi and fi+n+1 contribute to the first
1 (n)
lim f [x0 , x1 , · · · , xn ] = f (x0 ). (2.22) and last terms, respectively.
xn →x0 n!
Proof. Set x = xn+1 in Corollary 2.22, replace n + 1 by n,
and we have ξ → x0 as xn → x0 since each xi → x0 . Theorem 2.29. On a grid x = x + ih with uniform spac-
i 0
Definition 2.24. A bisequence is a function f : Z → R. ing h, the sequence of values fi = f (xi ) satisfies
Definition 2.25. The forward shift E and the backward
∆n f 0
shift B are linear operators V 7→ V on the linear space V of ∀n ∈ N+ , f [x0 , x1 , . . . , xn ] = . (2.30)
bisequenes given by n!hn

(Ef )(i) = f (i + 1), (Bf )(i) = f (i − 1). (2.23) Proof. Of course (2.30) can be proven by induction. Here
The forward difference ∆ and the backward difference ∇ are we provide a more informativeQproof. For πn+1 (x) defined
0 n
linear operators V 7→ V given by in (2.10), we have πn+1 (xk ) = i=0,i6=k (xk − xi ). It follows
from xk − xi = (k − i)h that
∆ = E − I, ∇ = I − B, (2.24)
where I is the identity operator on V . n
Y
0
πn+1 (xk ) = (k − i)h = hn k!(n − k)!(−1)n−k . (2.31)
Example 2.26. With the notation fi := f (i) for a bise-
i=0,i6=k
quence f , the nth forward difference and the nth backward
difference are
Then we have
∆n fi := (∆n f )(i), ∇n fi := (∇n f )(i). (2.25)
n n
X (−1)n−k fk
In particular, for n = 1 we have
X fk
f [x0 , x1 , . . . , xn ] = =
π 0 (xk ) hn k!(n − k)!
∆fi = fi+1 − fi , ∇fi = fi − fi−1 . (2.26) k=0 k=0
n
∆n f 0
 
Theorem 2.27. The forward difference and backward dif- 1 X n
= n (−1)n−k fk = n ,
ference are related as h n! k h n!
k=0
+ n n
∀n ∈ N , ∆ fi = ∇ fi+n . (2.27)
where the first step follows from Corollary 2.16, the second
Proof. An easy induction. from (2.31), and the last from Theorem 2.28.
Theorem 2.28. The forward difference can be expressed
explicitly as
n   Theorem 2.30 (Newton’s forward difference formula).
n−k n
X
n
∆ fi = (−1) fi+k . (2.28) Suppose pn (f ; x) ∈ Pn interpolates f (x) on a uniform grid
k xi = x0 + ih at x0 , x1 , . . . , xn with fi = f (xi ). Then
k=0

Proof. For n = 1, (2.28) reduces to ∆fi = fi+1 − fi . The


n  
rest of the proof is an induction utilizing the identity X s
∀s ∈ R, pn (f ; x0 + sh) = ∆ k f0 , (2.32)
k
     
n n n+1 k=0
+ = . (2.29)
k k−1 k
Suppose (2.28) holds. For the inductive step, we have where ∆0 f0 = f0 and
n   !
X n  
s s(s − 1) · · · (s − k + 1)
∆n+1 fi =∆∆n fi = ∆ (−1)n−k fi+k = . (2.33)
k k k!
k=0
n   n  
X n X n
= (−1)n−k fi+k+1 − (−1)n−k fi+k
k k Proof. Set f (x) = pn (f ; x) in Theorem 2.21, apply Theorem
k=0 k=0
n   2.29, and we have
X n
= (−1)n+1−k fi+k + fi+n+1
k−1 n k−1
k=1 X ∆ k f0 Y
n   p(x) = f0 + (x − xi );
n+1−k n k!hk i=0
X
+ (−1) fi+k + (−1)n+1 fi k=1
k
k=1
n+1   the remainder is zero because any (n + 1)th divided differ-
n+1−k n + 1
X
= (−1) fi+k , ence applied to a degree n polynomial is zero. The proof is
k
k=0 completed by x = x0 + sh, xi = x0 + ih, and (2.33).

12
Qinghai Zhang Numerical Analysis 2021

2.5 The Neville-Aitken algorithm Definition 2.34. The nth divided difference at n + 1 “con-
fluent” (i.e. identical) points is defined as
[i]
Theorem 2.31. Denote p0 = f (xi ) for i = 0, 1, . . . , n. For 1 (n)
all k = 0, 1, . . . , n − 1 and i = 0, 1, . . . , n − k − 1, define f [x0 , x0 , · · · , x0 ] =
f (x0 ), (2.37)
n!
[i+1] [i] where x0 is repeated n + 1 times on the left-hand side.
[i] (x − xi )pk (x) − (x − xi+k+1 )pk (x)
pk+1 (x) = . (2.34) Theorem 2.35. For the Hermite interpolation problem in
xi+k+1 − xi P
Definition 2.33, denote N = k + i mi . Denote by pN (f ; x)
Then each pk is the interpolating polynomial for the func- the unique element of PN for which (2.36) holds. Suppose
[i]

[0] f (N +1) (x) exists in (a, b). Then there exists some ξ ∈ (a, b)
tion f at the points xi , xi+1 , . . . , xi+k . In particular, pn is
such that
the interpolating polynomial of degree n for the function f
k
at the points x0 , x1 , . . . , xn . f (N +1) (ξ) Y
f (x) − pN (f ; x) = (x − xi )mi +1 . (2.38)
(N + 1)! i=0
Proof. The induction basis clearly holds for k = 0 because
[i] [i]
of the definition p0 = f (xi ). Suppose that pk is the in- Proof. The proof is similar to that of Theorem 2.7. Pay at-
terpolating polynomial of degree k for the function f at the tention to the Qkdifference caused by the multiple roots of the
mi +1
points xi , xi+1 , . . . , xi+k . Then the interpolation conditions polynomial i=0 (x − x i ) .
yield
Example 2.36. For the Hermite interpolation problem
[i+1] [i]
∀j = i + 1, i + 2, . . . , i + k, pk (xj ) = pk (xj ) = f (xj ), p(x0 ) = f0 , p0 (x0 ) = f00 , p00 (x0 ) = f000 ,

which, together with (2.34), implies Newton’s formula yields the interpolating polynomial as
1
[i] p(x) = f0 + f00 (x − x0 ) + f000 (x − x0 )2 ,
∀j = i + 1, i + 2, . . . , i + k, pk+1 (xj ) = f (xj ). 2
which is exactly the Taylor polynomial of degree 2. Thus a
In addition, (2.34) and the induction hypothesis yield Taylor polynomial is a special case of a Hermite interpolat-
[i] [i]
ing polynomial. By Theorem 2.35, the Cauchy remainder of
pk+1 (xi ) = pk (xi ) = f (xi ), this interpolation is
[i] [i+1]
pk+1 (xi+k+1 ) = pk (xi+k+1 ) = f (xi+k+1 ). f (3) (ξ)
R2 (f ; x) = f (x) − p2 (f ; x) = (x − x0 )3 ,
6
The proof is completed by the last three equations and the
which is Lagrange’s formula of the remainder term in Tay-
uniqueness of interpolating polynomials.
lor’s formula; see Theorem C.60.
Example 2.32. To estimate f (x) for x = 32 directly from Example 2.37. For the Hermite interpolation problem
the table in Example 2.19, we construct a table by repeating
p(x0 ) = f0 , p(x1 ) = f1 , p0 (x1 ) = f10 , p(x2 ) = f2 ,
(2.34) with xi = i for i = 0, 1, 2, 3.
the table of divided differences has the form
[i] [i] [i]
xi x − xi f (xi ) p1 (x) p2 (x) p3 (x) x0 f0
0 3
6 −2 15
−4 21
−6 x 1 f1 f [x0 , x1 ]
2
1 x 1 f1 f10 f [x0 , x1 , x1 ]
1 2 −3 − 92 − 27 4
(2.35)
x2 f2 f [x1 , x2 ] f [x1 , x1 , x2 ] f [x0 , x1 , x1 , x2 ]
2 − 12 −6 − 27 2
and the interpolating polynomial follows from Newton’s for-
3 − 23 9 mula. By Theorem 2.35, the Cauchy remainder is
The result is the same as that in Example 2.19. In contrast, f (4) (ξ)
R3 (f ; x) = (x − x0 )(x − x1 )2 (x − x2 )
the calculation and layout of the two tables are distinct. 4!
for some ξ ∈ [x0 , x2 ].
2.6 The Hermite interpolation f1′

Definition 2.33. Given distinct points x0 , x1 , . . . , xk in p


[a, b], non-negative integers m0 , m1 , . . . , mk , and a function
f ∈ C M [a, b] where M = maxi mi , the Hermite interpolation
problem seeks to find a polynomial p of the lowest degree
such that f1 f2
(µ)
∀i = 0, 1, . . . , k, ∀µ = 0, 1, · · · , mi , p(µ) (xi ) = fi , (2.36)
f0
(µ) (µ)
where fi = f (xi ) is the value of the µth derivative of f
(0) x0 x1 x2
at xi ; in particular, fi = f (xi ).

13
Qinghai Zhang Numerical Analysis 2021

2.7 The Chebyshev polynomials where k = 1, 2, . . . , n. For x ∈ [−1, 1] and n ∈ N+ , Tn (x)


has extreme values at the n + 1 points
Example 2.38 (Runge phenomenon). The points
k
x0 , x1 , . . . , xn in Theorem 2.5 are usually given a priori, x0k = cos π, k = 0, 1, . . . , n, (2.44)
e.g., as uniformly distributed over the interval [x0 , xn ]. As n
n increases, the degree of the interpolating polynomial also where it assumes the alternating values (−1)k .
increases. Ideally we would like to have
Proof. (2.41) and (2.43) yield
∀f ∈ C[x0 , xn ], ∀x ∈ [x0 , xn ], lim pn (f ; x) = f (x). 
2k − 1
 
2k − 1

n→+∞
Tn (xk ) = cos n arccos(cos π) = cos π = 0.
(2.39) 2n 2
However, this is not true for polynomial interpolation on
equally spaced points. The famous Runge’s example illus- Differentiate (2.41) and we have
trates the violent oscillations at the end of the interval. n
Tn0 (x) = √ sin(n arccos x).
1 − x2
1
Then each xk must be a simple zero since
n=2
n=6
2k − 1
 
n
0.5 Tn0 (xk ) = p sin π 6= 0.
1 − x2k 2
In contrast, ∀k = 1, 2, . . . , n − 1,
0
 − 21
0 0 2 kπ
Tn (xk ) = n 1 − cos sin(kπ) = 0;
n=4 n
-0.5
n2 cos (n arccos (x)) n x sin (n arccos (x))
n=8 Tn00 (x) = + ;
x2 − 1 (1 − x2 )
3/2
-1
Tn00 (x0k ) 6= 0.
-5 -4 -3 -2 -1 0 1 2 3 4 5
Hence a Taylor expansion of Tn yields
The above plot is created by interpolating
1
Tn (x0k + δ) = Tn (x0k ) + Tn00 (x0k )δ 2 + O(δ 3 ),
1 2
f (x) = (2.40)
1 + x2 and Tn must attain local extremes at each x0k . For k =
0, 1, . . . , n, Tn (x0k ) attains its extreme values at x0k since
on xi = −5 + 10 ni , i = 0, 1, . . . , n with n = 2, 4, 6, 8.
Tn (x00 ) = 1, Tn (x01 ) = −1, . . ., and by (2.41) we have
Definition 2.39. The Chebyshev polynomial of degree |Tn (x)| ≤ 1. Clearly these are the only extrema of Tn (x)
n ∈ N of the first kind is a polynomial Tn : [−1, 1] → [−1, 1], on [−1, 1].
1
Tn (x) = cos(n arccos x). (2.41)
0.8

Theorem 2.40. The Chebyshev polynomials of the first 0.6

kind satisfy the following recursive relations, 0.4

∀n ∈ N+ , Tn+1 (x) = 2xTn (x) − Tn−1 (x). (2.42) 0.2

0
Proof. By trigonometric identities, we have
-0.2

cos(n + 1)θ = cos nθ cos θ − sin nθ sin θ, -0.4

cos(n − 1)θ = cos nθ cos θ + sin nθ sin θ. -0.6

-0.8
Adding up the two equations and setting cos θ = x complete
the proof. -1 -0.5 0 0.5 1

Corollary 2.41. The coefficient of xn in Tn is 2n−1 for each Exercise 2.43. Write a program to reproduce the above
n > 0. plot.
Proof. Use (2.42) and T1 = x in an induction. Theorem 2.44 (Chebyshev). Denote by P̃n the class of all
polynomials of degree n ∈ N+ with leading coefficient 1.
Theorem 2.42. Tn (x) has simple zeros at the n points Then

2k − 1 Tn (x)
xk = cos π, (2.43) ∀p ∈ P̃n , max ≤ max |p(x)| . (2.45)
2n x∈[−1,1] 2n−1 x∈[−1,1]

14
Qinghai Zhang Numerical Analysis 2021

Proof. By Theorem 2.42, Tn (x) assumes its extrema n + 1 Proof. This follows from Definition 2.47.
times at the points x0k defined in (2.44). Suppose (2.45) does
not hold. Then Theorem 2.42 implies that Definition 2.50. The nth Bernstein polynomial of a map
f ∈ C[0, 1] is
1
∃p ∈ P̃n s.t. max |p(x)| < n−1 . (2.46) n  
k
2
X
x∈[−1,1]
(Bn f )(t) := f bn,k (t), (2.51)
n
1 k=0
Consider the polynomial Q(x) = 2n−1 Tn (x) − p(x).
where bn,k is a Bernstein base polynomial in (2.49).
0 (−1)k 0
Q(xk ) = n−1 − p(xk ), k = 0, 1, . . . , n. Theorem 2.51 (Weierstrass approximation). Every contin-
2
uous function f : [a, b] → R can be uniformly approximated
By (2.46), Q(x) has alternating signs at these n + 1 points. as closely as desired by a polynomial function.
Hence Q(x) must have n zeros. However, by the con-
struction of Q(x), the degree of Q(x) is at most n − 1. ∀f ∈ C[a, b], ∀ > 0, ∃N ∈ N+ s.t. ∀n > N,
1
Therefore, Q(x) ≡ 0 and p(x) = 2n−1 Tn (x), which implies ∃pn ∈ Pn s.t. ∀x ∈ [a, b], |pn (x) − f (x)| < .
1
max |p(x)| = 2n−1 . This is a contradiction to (2.46). (2.52)
Corollary 2.45. For n ∈ N+ , we have Proof. Without loss of generality, we assume a = 0, b = 1.
Set pn = Bn f in (2.51). For any  > 0, there exist δ > 0
1
max |xn + a1 xn−1 + · · · + an | ≥ n−1 . (2.47) and n ∈ N+ such that
x∈[−1,1] 2
Xn
Corollary 2.46. Suppose polynomial interpolation is per- |(Bn f )(t) − f (t)| = (Bn f )(t) − f (t)

bn,k (t)

formed for f on the n + 1 zeros of Tn+1 (x) as in Theorem
k=0

2.42. The Cauchy remainder in Theorem 2.7 satisfies n  
f k − f (t) bn,k (t)
X

1 n
|Rn (f ; x)| ≤ n max f (n+1) (x) . (2.48) k=0

2 (n + 1)! x∈[−1,1]
 
 
X X k
= + f n − f (t) bn,k (t)
 
Proof. Theorem 2.7, Corollary 2.41, and Theorem 2.42 yield k k
k:| n −t|<δ k:| n −t|≥δ
n
(n+1) (n+1) kf k∞
|f (ξ)|
Y |f
(ξ)|
|Rn (f ; x)| = (x − xi ) = n |Tn+1 |. ≤ sup |f (t) − f (s)| +
(n + 1)! i=0 2 (n + 1)! |t−s|≤δ 2nδ 2
 
≤ + = ,
Definition 2.39 completes the proof as |Tn+1 | ≤ 1. 2 2
where the case |k − nt| < nδ in the second inequality follows
2.8 The Bernstein polynomials from (2.50a) and (2.50b), the other case |k − nt| ≥ nδ in the
second inequality follows from (2.50d) and
Definition 2.47. The Bernstein base polynomial s of degree
n ∈ N+ relative to the unit interval [0, 1] are
X X (k − nt)2
bn,k (t) ≤ bn,k (t)
k k
δ 2 n2
  k:| n −t|≥δ k:| n −t|≥δ
n k
bn,k (t) = t (1 − t)n−k (2.49) n
k X (k − nt)2 t(1 − t) 1
≤ bn,k (t) = ≤ ,
δ 2 n2 nδ 2 4nδ 2
where k = 0, 1, . . . , n. k=0

Lemma 2.48. The Bernstein base polynomials satisfy and the last inequality follows from the uniform continuity
of f (c.f. Theorem C.44) and the choice of n > kfδk2∞ .
∀k = 0, 1, . . . , n, ∀t ∈ (0, 1), bn,k (t) > 0 (2.50a)
n
X
bn,k (t) = 1, (2.50b) 2.9 Problems
k=0
n
X 2.9.1 Theoretical questions
kbn,k (t) = nt, (2.50c) I. For f ∈ C 2 [x0 , x1 ] and x ∈ (x0 , x1 ), linear interpolation
k=0
n
of f at x0 and x1 yields
X
(k − nt)2 bn,k (t) = nt(1 − t). (2.50d) f 00 (ξ(x))
k=0 f (x) − p1 (f ; x) = (x − x0 )(x − x1 ).
2
Lemma 2.49. The Bernstein base polynomials of degree n Consider the case f (x) = x1 , x0 = 1, x1 = 2.
form a basis of Pn , the vector space of all polynomials with
degree no more than n. • Determine ξ(x) explicitly.

15
Qinghai Zhang Numerical Analysis 2021

• Extend the domain of ξ continuously from VIII. Assume f is differentiable at x0 . Prove


(x0 , x1 ) to [x0 , x1 ]. Find max ξ(x), min ξ(x), and
max f 00 (ξ(x)). ∂
f [x0 , x1 , . . . , xn ] = f [x0 , x0 , x1 , . . . , xn ].
∂x0
+
II. Let Pm be the set of all polynomials of degree ≤ m
that are non-negative on the real line, What about the partial derivative with respect to one
of the other variables?
P+ m = {p : p ∈ Pm , ∀x ∈ R, p(x) ≥ 0} .
IX. A min-max problem.
Find p ∈ P+ 2n such that p(xi ) = fi for i = 0, 1, . . . , n For n ∈ N+ , determine
where fi ≥ 0 and xi are distinct points on R.
III. Consider f (x) = ex . min max |a0 xn + a1 xn−1 + · · · + an |,
x∈[a,b]

• Prove by induction that


where a0 6= 0 is fixed and the minimum is taken over
(e − 1)n t all ai ∈ R, i = 1, 2, · · · , n.
∀t ∈ R, f [t, t + 1, . . . , t + n] = e.
n!
X. Imitate the proof of Chebyshev Theorem.
• From Corollary 2.22 we know Express the Chebyshev polynomial of degree n ∈ N as
1 (n) a polynomial Tn and change its domain from [−1, 1] to
∃ξ ∈ (0, n) s.t. f [0, 1, . . . , n] = f (ξ). R. For a fixed a > 1, define Pan := {p ∈ Pn : p(a) = 1}
n!
and a polynomial p̂n (x) ∈ Pan ,
Determine ξ from the above two equations. Is ξ
located to the left or to the right of the midpoint Tn (x)
n/2? p̂n (x) := .
Tn (a)
IV. Consider f (0) = 5, f (1) = 3, f (3) = 5, f (4) = 12.
Prove
• Use the Newton formula to obtain p3 (f ; x); ∀p ∈ Pan , kp̂n k∞ ≤ kpk∞
• The data suggest that f has a minimum in
where the max-norm of a function f : R → R is defined
x ∈ (1, 3). Find an approximate value for the lo-
as kf k∞ = maxx∈[−1,1] |f (x)|.
cation xmin of the minimum.
V. Consider f (x) = x7 . XI. Prove Lemma 2.48.

• Compute f [0, 1, 1, 1, 2, 2].


2.9.2 Programming assignments
• We know that this divided difference is express-
ible in terms of the 5th derivative of f evaluated A. Implement the Newton formula in a subroutine that
at some ξ ∈ (0, 2). Determine ξ. produces the value of the interpolation polynomial
pn (f ; x0 , x1 , . . . , xn ; x) at any real x, where n ∈ N+ , xi ’s
VI. f is a function on [0, 3] for which one knows that are distinct, and f is a function assumed to be available
f (0) = 1, f (1) = 2, f 0 (1) = −1, f (3) = f 0 (3) = 0. in the form of a subroutine.

• Estimate f (2) using Hermite interpolation. B. Run your routine on the function
• Estimate the maximum possible error of the above 1
answer if one knows, in addition, that f ∈ C 5 [0, 3] f (x) =
1 + x2
and |f (5) (x)| ≤ M on [0, 3]. Express the answer
in terms of M . for x ∈ [−5, 5] using xi = −5 + 10 ni , i = 0, 1, . . . , n, and
n = 2, 4, 6, 8. Plot the polynomials against the exact
VII. Define forward difference by
function to reproduce the plot in the notes that illus-
∆f (x) = f (x + h) − f (x), trate the Runge phenomenon.
k+1
∆ f (x) = ∆∆k f (x) = ∆k f (x + h) − ∆k f (x) C. Reuse your subroutine of Newton interpolation to per-
and backward difference by form Chebyshev interpolation for the function

∇f (x) = f (x) − f (x − h), 1


f (x) =
k+1 k k k 1 + 25x2
∇ f (x) = ∇∇ f (x) = ∇ f (x) − ∇ f (x − h).
for x ∈ [−1, 1] on the zeros of Chebyshev polynomials
Prove
Tn with n = 5, 10, 15, 20. Clearly the Runge function
∆k f (x) = k!hk f [x0 , x1 , . . . , xk ], f (x) is a scaled version of the function in B. Plot the
∇k f (x) = k!hk f [x0 , x−1 , . . . , x−k ], interpolating polynomials against the exact function to
observe that the Chebyshev interpolation is free of the
where xj = x + jh. wide oscillations in the previous assignment.

16
Qinghai Zhang Numerical Analysis 2021

D. A car traveling along a straight road is clocked at a num- that extensively damage these trees in certain years. The
ber of points. The data from the observations are given following table lists the average weight of two samples of
in the following table, where the time is in seconds, the larvae at times in the first 28 days after birth. The first
distance is in feet, and the speed is in feet per second. sample was reared on young oak leaves, whereas the sec-
ond sample was reared on mature leaves from the same
Time 0 3 5 8 13 tree.
Distance 0 225 383 623 993
Speed 75 77 80 74 72
Day 0 6 10 13 17 20 28
(a) Use a Hermite polynomial to predict the position of Sp1 6.67 17.3 42.7 37.3 30.1 29.3 28.7
the car and its speed for t = 10s. Sp2 6.67 16.1 18.9 15.0 10.6 9.44 8.89
(b) Use the derivative of the Hermite polynomial to de-
termine whether the car ever exceeds the 55 mi/h (a) Use Newton’s formula to approximate the average
(81 feet per second) speed limit. weight curve for each sample.
E. It is suspected that the high amounts of tannin in mature (b) Predict whether the two samples of larvae will die
oak leaves inhibit the growth of the winter moth larvae after another 15 days.

17
Chapter 3

Splines

Proof. Denote pi (x) = s|[xi ,xi+1 ] and Ki = f [xi , xi+1 ].


The table of divided difference for the Hermite interpola-
tion problem pi (xi ) = fi , pi (xi+1 ) = fi+1 , p0i (xi ) = mi ,
p0i (xi+1 ) = mi+1 is

xi fi
xi fi mi
Ki −mi
xi+1 fi+1 Ki xi+1 −xi
mi+1 −Ki mi +mi+1 −2Ki
xi+1 fi+1 mi+1 xi+1 −xi (xi+1 −xi )2

Then the Newton formula yields

Ki − mi
pi (x) =fi + (x − xi )mi + (x − xi )2
xi+1 − xi
m i + mi+1 − 2Ki
+ (x − xi )2 (x − xi+1 ) , (3.5)
3.1 Piecewise-polynomial splines (xi+1 − xi )2

Definition 3.1. Given nonnegative integers n, k, and a or equivalently


strictly increasing sequence {xi } that partitions [a, b],
pi (x) = ci,0 + ci,1 (x − xi ) + ci,2 (x − xi )2 + ci,3 (x − xi )3 ,


a = x1 < x2 < · · · < xN = b, (3.1) 

 ci,0 = fi ,


the set of spline functions of degree n and smoothness class ci,1 = mi ,
k relative to the partition {xi } is


 ci,2 = 3Ki −2m i −mi+1
xi+1 −xi ,
mi +mi+1 −2Ki

c = .


i,3 (xi+1 −xi )2
Skn = s : s ∈ C k [a, b]; ∀i ∈ [1, N − 1], s|[xi ,xi+1 ] ∈ Pn .

(3.6)
(3.2) s ∈ C 2 implies that p00 (x ) = p00 (x ), i.e.
i−1 i i i
The xi ’s are called knots of the spline.
Notation 1. In Section 2 and this section, the polynomial 3ci−1,3 (xi − xi−1 ) = ci,2 − ci−1,2 .
degree is denoted by n for all methods. Here we use N to
denote the number of knots for a spline. The substitution of the coefficients ci,j into the above equa-
tion yields (3.3).
Example 3.2. As an extreme, Snn = Pn , i.e. all the pieces of
s ∈ Snn belong to a single polynomial. On the other end, S01 Lemma 3.4. Denote Mi = s00 (f ; xi ) for s ∈ S23 . Then, for
is the class of piecewise linear interpolating functions. The each i = 2, 3, . . . , N − 1, we have
most popular splines are the cubic splines in S23 . 三弯矩方程组
µ i M i−1 + 2M i + λ i M i+1 = 6f [x i−1 , xi , xi+1 ] (3.7)
Lemma 3.3. Denote mi = s0 (f ; xi ) for s ∈ S23 . Then, for
each i = 2, 3, . . . , N − 1, we have where µ and λ are the same as those in (3.4).
i i
λi mi−1 + 2mi + µi mi+1 = 3µi f [xi , xi+1 ] + 3λi f [xi−1 , xi ],
Proof. Taylor expansion of s(x) at xi yields
(3.3)
where 三斜率方程组
0 Mi 2 s000 (xi )
xi − xi−1 xi+1 − xi s(x) = f i + s (x i )(x − x i ) + (x − x i ) + (x − xi )3 ,
µi = , λi = . (3.4) 2 6
xi+1 − xi−1 xi+1 − xi−1 (3.8)

无视量纲 18
Qinghai Zhang Numerical Analysis 2021

where x ∈ [xi , xi+1 ] and the derivatives should be inter- Theorem 3.7. For a given function f : [a, b] → R, there ex-
preted as the right-hand derivatives. Differentiate (3.8) ists a unique complete/natural/periodic cubic spline s(f ; x)
twice, set x = xi+1 , and we have that interpolates f .

Mi+1 − Mi Proof. We only prove the case of complete cubic splines since
s000 (xi ) = . (3.9) the other cases are similar.
xi+1 − xi
By the proof of Lemma 3.3, s is uniquely determined if
Substitute (3.9) into (3.8), set x = xi+1 , and we have all the mi ’s are uniquely determined on all intervals. For
a complete cubic spline we already have m1 = f 0 (a) and
1 mN = f 0 (b). Assemble (3.3) into a linear system
s0 (xi ) = f [xi , xi+1 ] − (Mi+1 + 2Mi )(xi+1 − xi ). (3.10)
6 
2 µ2

m2

Similarly, differentiate (3.8) twice, set x = xi−1 , and we have λ3 2 µ3   m3 


  .. 
  
M −M
s000 (xi ) = xi−1 −xi . Its substitution into (3.8) yields
i−1 i  ..

 .  . 
 
 λi 2 µi   mi  = b,
1 
..  . 
  
s0 (xi ) = f [xi−1 , xi ] − (Mi−1 + 2Mi )(xi−1 − xi ). (3.11) 
. .
6  . 
   

 λN −2 2 µN −2  mN −2 
The subtraction of (3.10) from (3.11) yields (3.7). λN −1 2 mN −1
(3.15)
Definition 3.5 (Boundary conditions of cubic splines).
where the vector b is determined from the known informa-
Common cubic splines include the following.
tion. (3.4) implies that the matrix in the above equation is
• A complete cubic spline s ∈ S3 satisfies boundary con- strictly diagonally dominant. Therefore its determinant is
2

ditions s0 (f ; a) = f 0 (a) and s0 (f ; b) = f 0 (b). nonzero and the mi ’s can be uniquely determined.
Alternatively, a complete cubic spline can be uniquely
• A cubic spline with specified second derivatives at its
determined from Lemmas 3.4 and 3.6, following arguments
end points: s00 (f ; a) = f 00 (a) and s00 (f ; b) = f 00 (b).
similar to the above.
• A natural cubic spline s ∈ S23 satisfies boundary con-
ditions s00 (f ; a) = 0 and s00 (f ; b) = 0. Example 3.8. Construct a complete cubic spline s(x) on
points x1 = 1, x2 = 2, x3 = 3, x4 = 4, x5 = 6 from the func-
• A not-a-knot cubic spline s ∈ S23 satisfies that s000 (f ; x) tion values of f (x) = ln(x) and its derivatives at x and x .
1 5
exists at x = x2 and x = xN −1 . Approximate ln(5) by s(5).
not a knot--非节点边界条件
• A periodic cubic spline s ∈ S23 is obtained from From the given conditions, we set up the table of divided
replacing s(f ; b) = f (b) with s(f ; b) = s(f ; a), differences as follows.
s0 (f ; b) = s0 (f ; a), and s00 (f ; b) = s00 (f ; a). xi f [xi ]
1 0
Lemma 3.6. For a complete cubic spline s ∈ S23 , denote
1 0 1
Mi = s00 (f ; xi ) and we have
2 0.6931 0.6931 −0.3069
2M1 + M2 = 6f [x1 , x1 , x2 ], (3.12) 3 1.0986 0.4055 −0.1438
4 1.3863 0.2877 −0.05889
MN −1 + 2MN = 6f [xN −1 , xN , xN ]. (3.13) 6 1.7918 0.2027 −0.02831
6 1.7918 0.1667 −0.01803
Proof. As for (3.12), the cubic polynomial on [x1 , x2 ] can be
written as All values of λi and µi are 12 except that

s1 (x) =f [x1 ] + f [x1 , x1 ](x − x1 ) 2 1


λ4 =, µ4 = .
3 3
M1 s000 (x1 )
+ (x − x1 )2 + 1 (x − x1 )3 . Then Lemma 3.4 and Lemma 3.6 yield a linear system
2 6     
2 1 M1 −1.84112
Differentiate the above equation twice, replace x with x2 , 1 4 1  M2  −1.72610
M2 −M1
and we have s000
1 (x1 ) = x2 −x1 , which implies

 1 4 1
  
 M3  ≈ −0.70670 ,

    
 1 6 2 M4  −0.50967
s1 (x) =f [x1 ] + f [x1 , x1 ](x − x1 ) 1 2 M5 −0.10820
M1 2 M2 − M1 3
+ (x − x1 ) + (x − x1 ) . (3.14) where elements in the RHS vector are obtained from the
2 6(x2 − x1 ) last column of the table of divided differences by multiply-
Set x = x2 , divide both sides by x2 − x1 , and we have ing 6, 12, 12, 18, and 6. Why? Solve the linear system and
we have all the Mi ’s. Then we derive an expression of the
M2 − M1 spline on the last interval following the procedures similar
 
M1
f [x1 , x2 ] = f [x1 , x1 ] + + (x2 − x1 ), to those for (3.14). After this expression is obtained, we
2 6
then evaluate it and obtain s(5) ≈ 1.60977. In comparison,
which yields (3.12). (3.13) can be proven similarly. ln(5) ≈ 1.60944.

19
Qinghai Zhang Numerical Analysis 2021

3.2 The minimum properties Proof. Since s00 (x) is linear on [xi , xi+1 ], |s00 (x)| attains its
maximum at xj for some j. If j = 2, . . . , N − 1, it follows
Theorem 3.9 (Minimum bending energy). For any func- from Lemma 3.4 and Corollary 2.22 that
tion g ∈ C 2 [a, b] that satisfies g 0 (a) = f 0 (a), g 0 (b) = f 0 (b),
and g(xi ) = f (xi ) for each i = 1, 2, . . . , N , the complete 2Mj = 6f [xj−1 , xj , xj+1 ] − µj Mj−1 − λj Mj+1
cubic spline s = s(f ; x) satisfies ⇒ 2|Mj | ≤ 6 |f [xj−1 , xj , xj+1 ]| + (µj + λj )|Mj |
Z b D1样条势能最小
Z b ⇒ ∃ξ ∈ [xj−1 , xj+1 ] s.t. |Mj | ≤ 3 |f 00 (ξ)|
00 00
[s (x)] dx ≤ 2
[g (x)] dx, 2
(3.16) ⇒ |s00 (x)| ≤ 3 max |f 00 (x)| . (3.19)
a a x∈[a,b]

where the equality holds only when g(x) = s(f ; x). If |s00 (x)| attains its maximum at x1 or xN , (3.19) clearly
holds at these end points for a cubic spline with specified sec-
Proof. Define η(x) = g(x)−s(x). From the given conditions ond derivatives. After all, s00 (a) = f 00 (a) and s00 (b) = f 00 (b).
we have η ∈ C 2 [a, b], η 0 (a) = η 0 (b) = 0, and ∀i = 1, 2, . . . , N , As for the complete cubic spline, it suffices to prove (3.19)
η(xi ) = 0. Then when |s00 (x)| attains its maximum at x1 . Since the first
Z b Z b derivative f 0 (a) = f [x1 , x1 ] is specified, f [x1 , x1 , x2 ] is a con-
[g 00 (x)]2 dx = [s00 (x) + η 00 (x)]2 dx stant. By (3.12), we have
a a
Z b Z b Z b 2|M1 | ≤ 6|f [x1 , x1 , x2 ]| + |M2 | ≤ 6|f [x1 , x1 , x2 ]| + |M1 |
00 2 00 2 00 00
= [s (x)] dx + [η (x)] dx + 2 s (x)η (x)dx. which, together with Corollary 2.22, implies
a a a

From ∃ξ ∈ [x1 , x2 ] s.t. |M1 | ≤ 3|f 00 (ξ)|.


Z b N
X −1 Z xi+1 This completes the proof.
s00 (x)η 00 (x)dx = s00 (x)dη 0
a i=1 xi
N −1 N −1 Z xi+1
3.3 Error analysis
x
X X
00 0 0 000
= s (x)η (x)|xii+1 − η (x)s (x)dx
xi Theorem 3.12. Suppose a C 4 function f : [a, b] → R is in-
i=1 i=1
N −1 Z xi+1
terpolated by a complete cubic spline or a cubic spline with
00 0 00 0
X specified second derivatives at its end points. Then
=s (b)η (b) − s (a)η (a) − s000 (x)dη
xi

i=1
∀j = 0, 1, 2, f (j) (x) − s(j) (x) ≤ cj h4−j max f (4) (x) ,

N −1 N −1 Z xi+1 x∈[a,b]
x
X X
=− s000 (x)η(x)|xii+1 + η(x)s(4) (x)dx (3.20)
1 N −1
i=1 i=1 xi where c0 = 16 , c1 = c2 = 12 , and h = maxi=1 |xi+1 − xi |.
=0, Proof. Our plan is to first prove the case of j = 2, then
utilize the conclusion to study other cases.
we have
Consider an auxiliary function ŝ ∈ C 2 [a, b] that satisfies
Z b Z b Z b
[g 00 (x)]2 dx = [s00 (x)]2 dx + [η 00 (x)]2 dx, ∀i = 1, 2, . . . , N − 1, ŝ|[xi ,xi+1 ] ∈ P3 , ŝ00 (xi ) = f 00 (xi ).
a a a
We can obtain such an ŝ by interpolating f 00 (x) with some
which completes the proof. s̃ ∈ S01 and integrating s̃ twice. Then the theorem of Cauchy
remainder (Theorem 2.7) implies
Theorem 3.10 (Minimum bending energy). For any func-
tion g ∈ C 2 [a, b] with g(xi ) = f (xi ) for each i = 1, 2, . . . , N , ∃ξi ∈ [xi , xi+1 ], s.t. ∀x ∈ [xi , xi+1 ],
the natural cubic spline s = s(f ; x) satisfies 1
|f 00 (x) − s̃(x)| ≤ f (4) (ξi ) |(x − xi )(x − xi+1 )|,

Z b Z b 2
00
[s (x)] dx ≤ 2
[g 00 (x)]2 dx, (3.17) hence we have
a a
1
|f 00 (x) − ŝ00 (x)|x∈[xi ,xi+1 ] ≤ max f (4) (x) (xi+1 − xi )2

where the equality holds only when g(x) = s(f ; x). 8 x∈[xi ,xi+1 ]
Proof. The proof is similar to that of Theorem 3.9. Al- and thus
though η 0 (a) = η 0 (b) = 0 does not hold, we do have h2
s00 (a) = s00 (b) = 0. |f 00 (x) − ŝ00 (x)| ≤ max |f (4) (x)|. (3.21)
8 x∈[a,b]
Lemma 3.11. Suppose a C 2 function f : [a, b] → R is in- Now consider interpolating f (x) − ŝ(x) with a cubic spline.
terpolated by a complete cubic spline or a cubic spline with Since ŝ(x) ∈ S23 , the interpolant must be s(x) − ŝ(x). Then
specified second derivatives at its end points. Then Lemma 3.11 yields
∀x ∈ [a, b], |s00 (x)| ≤ 3 max |f 00 (x)| . (3.18) ∀x ∈ [a, b], |s00 (x) − ŝ00 (x)| ≤ 3 max |f 00 (x) − ŝ00 (x)|,
x∈[a,b] x∈[a,b]

20
Qinghai Zhang Numerical Analysis 2021

which, together with (3.21), leads to (3.20) for j = 2: 3.4.1 Truncated power functions
|f 00 (x) − s00 (x)| ≤ |f 00 (x) − ŝ00 (x)| + |ŝ00 (x) − s00 (x)|
Definition 3.16. The truncated power function with expo-
≤ 4 max |f 00 (x) − ŝ00 (x)| nent n is defined as
x∈[a,b]
1 2
≤ h max f (4) (x) . (3.22)
(
2 x∈[a,b] xn if x ≥ 0,
xn+ = (3.23)
For j = 0, we have f (x) − s(x) = 0 for x = xi , xi+1 . Then 0 if x < 0.
Rolle’s theorem C.50 implies f 0 (ξi ) − s0 (ξi ) = 0 for some
ξi ∈ [xi , xi+1 ]. It follows from the second fundamental the-
orem of calculus (Theorem C.73) that Example 3.17. According to Definition 3.16, we have
Z x
∀x ∈ [xi , xi+1 ], f 0 (x) − s0 (x) = (f 00 (t) − s00 (t)) dt, Z b
n
Z t
(t − a)n+1
ξi ∀t ∈ [a, b], (t − x)+ dx = (t − x)n dx = .
a a n+1
which, together with the integral mean value theorem C.71 (3.24)
and (3.22), yields
|f 0 (x) − s0 (x)|x∈[xi ,xi+1 ] = |x − ξi | |f 00 (ηi ) − s00 (ηi )| Lemma 3.18. The following is a basis of Sn−1 (t1 , . . . , tN ),
n
1
≤ h3 max f (4) (x) .

2 x∈[a,b]
1, x, x2 , . . . , xn , (x−t2 )n+ , (x−t3 )n+ , . . . , (x−tN −1 )n+ . (3.25)
This proves (3.20) for j = 1. Finally, consider interpolating
f (x)−s(x) with some linear spline s̄ ∈ S01 . The interpolation
n−1
conditions dictate ∀x ∈ [a, b], s̄(x) ≡ 0. Hence Proof. ∀i = 2, 3, . . . , N − 1, (x − ti )n+ ∈ Sn,N . Also,
i n−1
|f (x) − s(x)| = |f (x) − s(x) − s̄| ∀i = 0, 1, . . . , n, x ∈ Sn,N . Suppose
x∈[xi ,xi+1 ] x∈[xi ,xi+1 ]
1
≤ (xi+1 − xi )2 max |f 00 (x) − s00 (x)| n N −1
8 x∈[xi ,xi+1 ] X X
1 4 ai xi + an+j (x − tj )n+ = 0(x). (3.26)
≤ h max |f (4) (x)|, i=0 j=2
16 x∈[a,b]
where the second step follows from Theorem 2.7 and the
To satisfy (3.26) for all x < t2 , ai must be 0 for each
third step from (3.22).
i = 0, 1, · · · , n. To satisfy (3.26) for all x ∈ (t2 , t3 ), an+2
Exercise 3.13. Verify Theorem 3.12 using the results in must be 0. Similarly, all an+j ’s must be zero. Hence, the
Example 3.8. functions in (3.25) are linearly independent by Definition
B.25. The proof is completed by Theorem 3.14, Lemma
B.41, and the fact that there are n + N − 1 functions in
3.4 B-Splines (3.25).
Notation 2. In the notation Sn−1 n (t1 , t2 , · · · , tN ), ti ’s in
the parentheses represent knots of a spline. When there is
no danger of ambiguity, we also use the shorthand notation Corollary 3.19. Any s ∈ Sn−1 can be expressed as
Sn−1 n−1
n,N := Sn (t1 , t2 , · · · , tN ) or simply Sn−1
n . n,N

Theorem 3.14. The set of splines Sn−1


n (t1 , t2 , · · · , tN ) is a n N −1
linear space with dimension n + N − 1.
X X
s(x) = ai (x − t1 )i + an+j (x − tj )n+ ,
x ∈ [t1 , tN ].
Proof. It is easy to verify from (3.2) and Definition B.2 that i=0 j=2
Sn−1
n (t1 , t2 , · · · , tN ) is indeed a linear space. Note that the (3.27)
additive identity is the zero function not the number zero.
One polynomial of degree n is determined by n + 1 coeffi-
Proof. By Lemma 3.18, it suffices to point out that
cients. The N − 1 intervals lead to (N − 1)(n + 1) coeffi-
cients. At each of the N − 2 interval knots, the smoothness
condition requires that the 0th, 1st, . . ., (n − 1)th deriva- span{1, x, . . . , xn } = span{1, (x − t1 ), . . . , (x − t1 )n }.
tives of adjacent polynomials match. Hence the dimension
is (N − 1)(n + 1) − n(N − 2) = n + N − 1.
Example 3.20. (3.27) with n = 1 is the linear spline in-
Example 3.15. The cubic splines in Definition 3.5, have terpolation. Imagine a plastic rod that is initially straight.
n = 3 and hence the dimension of S23 is N + 2. Apart from Place one of its end at (t1 , f1 ) and let it go through (t2 , f2 ).
the N interpolation conditions at the knots, we need to im- In general (t3 , f3 ) will be off the rod, but we can bend the
pose two other conditions at the ends of the interpolating rod at (t2 , f2 ) to make the rod go through (t3 , f3 ). This
interval to obtain a unique spline, this leads to different “bending” process corresponds to adding the first truncated
types of cubic splines in Definition 3.5. power function in (3.27).

21
Qinghai Zhang Numerical Analysis 2021

3.4.2 The local support of B-splines Proof. This is an easy induction by (3.31) and (3.30).
Definition 3.21. The hat function at ti is Definition 3.28. Let X be a vector space. For each x ∈ X
 x−ti−1 we associate a unique real (or complex) number L(x). If
 ti −ti−1 x ∈ (ti−1 , ti ],
 ∀x, y ∈ X and ∀α, β ∈ R (or C), we have
i+1 −x
B̂i (x) = tti+1 −ti x ∈ (ti , ti+1 ], (3.28)
L(αx + βy) = αL(x) + βL(y), (3.36)

0 otherwise.

Theorem 3.22. The hat functions form a basis of S01 . then L is called a linear functional over X.

Proof. By Definition 3.21, we have Example 3.29. X = C[a, b], then the elements of X are
( functions continuous over [a, b].
1 if i = j,
B̂i (tj ) = (3.29) Z b Z b
0 if i 6= j. L(f ) = f (x)dx, L(f ) = x2 f (x)dx
a a
PN
Suppose i=1 ci B̂i (x) = 0(x). Then we have ci = 0 for each are both linear functionals over X.
i = 1, 2, · · · , N by setting x = tj and applying (3.29). Hence
by Definition B.25 the hat functions are linearly indepen- Notation 3. We have used the notation f [x0 , . . . , xk ] for
dent. It suffices to show that span{B̂1 , B̂2 , . . . , B̂N } = S01 , the kth divided difference of f , inline with considering
which is true because f [x0 , . . . , xk ] as a generalization of the Taylor expansion.
N Hereafter, for analyzing B-splines, it is both semantically
and syntactically better to use the notation [x0 , . . . , xk ]f ,
X
∀s(x) ∈ S01 , ∃sB (x) = s(ti )B̂i (x) s.t. s(x) = sB (x).
i=1 inline with considering the procedures of a divided difference
as a linear functional over C[x0 , xk ].
On each interval [ti , ti+1 ], (3.29) implies sB (ti ) = s(ti ) and
sB (ti+1 ) = s(ti+1 ). Hence sB (x) ≡ s(x) because they are Theorem 3.30 (Leibniz formula). For k ∈ N, the kth di-
both linear. Then Definition B.32 completes the proof. vided difference of a product of two functions satisfies

Definition 3.23. B-splines are defined recursively by k


X
[x0 , . . . , xk ]f g = [x0 , . . . , xi ]f · [xi , . . . , xk ]g. (3.37)
x − ti−1 ti+n+1 − x n
Bin+1 (x) = Bin (x)+ B (x). (3.30) i=0
ti+n − ti−1 ti+n+1 − ti i+1
Proof. The induction basis k = 0 holds because (3.37) re-
The recursion base is the B-spline of degree zero, duces to [x0 ]f g = f (x0 )g(x0 ). Now suppose (3.37) holds.
( For the induction step, we have from Theorem 2.17 that
0 1 if x ∈ (ti−1 , ti ],
Bi (x) = (3.31)
0 otherwise. [x1 , . . . , xk+1 ]f g − [x0 , . . . , xk ]f g
[x0 , . . . , xk+1 ]f g = .
xk+1 − x0
Example 3.24. The hat functions in Definition 3.21 are
clearly the B-splines of degree one: By the induction hypothesis, we have

Bi1 = B̂i . (3.32) k


X
[x1 , . . . , xk+1 ]f g = [x1 , . . . , xi+1 ]f · [xi+1 , . . . , xk+1 ]g
In (3.30), B-splines of higher degrees are defined by gener- i=0
alizing the idea of hat functions. k
X
=S1 + [x0 , . . . , xi ]f · [xi+1 , . . . , xk+1 ]g, where
Example 3.25. The quadratic B-splines Bi2 (x) = i=0
k
(x−ti−1 )2
 X
 (ti+1 −ti−1 )(ti −ti−1 ) ,
 x ∈ (ti−1 , ti ]; S1 = (xi+1 − x0 ) · [x0 , . . . , xi+1 ]f · [xi+1 , . . . , xk+1 ]g

 (x−ti−1 )(ti+1 −x) + (ti+2 −x)(x−ti ) , x ∈ (t , t ];

i=0
(ti+1 −ti−1 )(ti+1 −ti ) (ti+2 −ti )(ti+1 −ti ) i i+1
(ti+2 −x)2 k+1

), x ∈ (ti+1 , ti+2 ]; X
(t −t )(t −t = (xi − x0 ) · [x0 , . . . , xi ]f · [xi , . . . , xk+1 ]g.

 i+2 i i+2 i+1


0, otherwise. i=1
(3.33) k
X
[x0 , . . . , xk ]f g = [x0 , . . . , xi ]f · [xi , . . . , xk ]g
Definition 3.26. The support of a function f : X → R is i=0
k
supp(f ) = closure{x ∈ X | f (x) 6= 0}. (3.34) X
= − S2 + [x0 , . . . , xi ]f · [xi+1 , . . . , xk+1 ]g, where
Lemma 3.27 (Support of B-splines). For n ∈ N+ , the in- i=0
k
terval of support of Bin is [ti−1 , ti+n ] where X
S2 = [x0 , . . . , xi ]f · (xk+1 − xi ) · [xi , . . . , xk+1 ]g.
∀x ∈ (ti−1 , ti+n ), Bin (x) > 0. (3.35) i=0

22
Qinghai Zhang Numerical Analysis 2021

In the above derivation, we have applied Theorem 2.17 to Definition 3.23 and the induction hypothesis yield
go from the kth divided difference to the (k + 1)th. Then
Bin+1 (x) = β(x) + γ(x), with
S1 + S2 x − ti−1
[x0 , . . . , xk+1 ]f g = β(x) = B n (x)
xk+1 − x0 ti+n − ti−1 i
k+1
X = (x − ti−1 ) · [ti−1 , . . . , ti+n ](t − x)n+
= [x0 , . . . , xi ]f · [xi , . . . , xk+1 ]g,
i=0 = [ti , . . . , ti+n ](t − x)n+ − [ti−1 , . . . , ti+n ](t − x)n+1
+ ,

which completes the inductive proof. where the last step follows from (3.39). Similarly,
Example 3.31. There exists a relation between B-splines ti+n+1 − x n
and truncated power functions, e.g., γ(x) = B (x)
ti+n+1 − ti i+1
(ti+1 − ti−1 )[ti−1 , ti , ti+1 ](t − x)+ =(ti+n+1 − x) · [ti , . . . , ti+n+1 ](t − x)n+
=[ti , ti+1 ](t − x)+ − [ti−1 , ti ](t − x)+ =(ti+n+1 − ti ) · [ti , . . . , ti+n+1 ](t − x)n+
(ti+1 − x)+ − (ti − x)+ (ti − x)+ − (ti−1 − x)+ + (ti − x) · [ti , . . . , ti+n+1 ](t − x)n+
= −
ti+1 − ti ti − ti−1 =[ti+1 , . . . , ti+n+1 ](t − x)n+ − [ti , . . . , ti+n ](t − x)n+
+ [ti , . . . , ti+n+1 ](t − x)n+1
 x−ti−1
 ti −ti−1 x ∈ (ti−1 , ti ],
 +

=Bi = tti+1
1 i+1 −x
x ∈ (ti , ti+1 ], − [ti+1 , . . . , ti+n+1 ](t − x)n+
−ti
=[ti , . . . , ti+n+1 ](t − x)n+1 − [ti , . . . , ti+n ](t − x)n+ ,

0 otherwise.

+

The algebra is illustrated by the figures below, where the second last step follows from Theorem 2.17 and
(3.39). The above arguments yield
ti−1 ti ti+1
Bin+1 (x) =[ti , . . . , ti+n+1 ](t − x)n+1
+
− [ti−1 , . . . , ti+n ](t − x)n+1
+
ti−1 ti ti+1 ti−1 ti ti+1
=(ti+n+1 − ti−1 ) · [ti−1 , . . . , ti+n+1 ](t − x)n+1
+ ,

ti−1 ti ti+1 ti−1 ti ti+1 ti−1 ti ti+1 which completes the inductive proof.
The significance is that, by applying divided difference
to truncated power functions we can “cure” their drawback 3.4.3 Integrals and derivatives
of non-local support. This idea is made precise in the next
Theorem. Corollary 3.33 (Integrals of B-splines). The average of a
B-spline over its support only depends on its degree,
Theorem 3.32 (B-splines as divided difference of truncated Z ti+n
power functions). For any n ∈ N, we have 1 1
Bin (x)dx = . (3.40)
ti+n − ti−1 ti−1 n+1
n n
Bi (x) = (ti+n − ti−1 ) · [ti−1 , . . . , ti+n ](t − x)+ . (3.38)
Proof. The left-hand side (LHS) of (3.40) is
Proof. For n = 0, (3.38) reduces to
Z ti+n
1
Bi0 (x) = (ti − ti−1 ) · [ti−1 , ti ](t − x)0+ Bin (x)dx
ti+n − ti−1 ti−1
= (ti − x)0+ − (ti−1 − x)0+ Z ti+n

0 if x ∈ (−∞, ti−1 ], = [ti−1 , . . . , ti+n ](t − x)n+ dx
 ti−1
= 1 if x ∈ (ti−1 , ti ], Z ti+n
=[t , . . . , t ] (t − x)n+ dx

0 if x ∈ (ti , +∞), i−1 i+n

ti−1

which is the same as (3.31). Hence the induction basis holds. (t − ti−1 )n+1
=[ti−1 , . . . , ti+n ]
Now assume the induction hypothesis (3.38) hold. n+1
By Definition 3.16, (t − x)n+1
+ = (t − x)(t − x)n+ . Then =
1
,
the application of the Leibniz formula (Theorem 3.30) with n+1
f = (t − x) and g = (t − x)n+ yields
where the first step follows from Theorem 3.32, the second
[ti−1 , . . . , ti+n ](t − x)n+1 step from the commutativity of integration and taking di-
+
vided difference, the third step from (3.24), and the last step
=(ti−1 − x) · [ti−1 , . . . , ti+n ](t − x)n+ (3.39)
from Corollary 2.22.
+ [ti , . . . , ti+n ](t − x)n+ .

23
Qinghai Zhang Numerical Analysis 2021

Theorem 3.34 (Derivatives of B-splines). For n ≥ 2, we where the product (t − ti ) · · · (t − ti+n−1 ) is defined as 1 for
have, ∀x ∈ R, n = 0.
n−1
d n nBin−1 (x) nBi+1 (x)
Bi (x) = − . (3.41) Proof. For n = 0, (3.43) follows from Definition 3.23. Now
dx ti+n−1 − ti−1 ti+n − ti suppose (3.43) holds. A linear interpolation of the linear
For n = 1, (3.41) holds for all x except at the three knots function f (t) = t − x is the function itself,
ti−1 , ti , and ti+1 , where the derivative of Bi1 is not defined.
t − ti+n t − ti−1
t−x = (ti−1 −x)+ (ti+n −x). (3.44)
Proof. We first show that (3.41) holds for all x except at the ti−1 − ti+n ti+n − ti−1
knots tj . By (3.32), (3.28), and (3.31), we have
Hence for the inductive step we have
∀x ∈R \ {ti−1 , ti , ti+1 },
d 1 1 1 +∞
B (x) = B 0 (x) − B 0 (x). X
dx i ti − ti−1 i ti+1 − ti i+1 (t − x)n+1 = (t − x) (t − ti ) · · · (t − ti+n−1 )Bin (x)
i=−∞
Hence the induction basis holds. Now suppose (3.41) holds +∞
∀x ∈ R \ {ti−1 , . . . , ti+n }. Differentiate (3.30), apply the
X ti−1 − x
= (t − ti ) · · · (t − ti+n ) B n (x)
induction hypothesis (3.41), and we have i=−∞
ti−1 − ti+n i
n +∞
d n+1 Bin (x) Bi+1 (x) X ti+n − x
Bi (x) = − + nC(x), (3.42) + (t − ti−1 ) · · · (t − ti+n−1 ) B n (x)
dx ti+n − ti−1 ti+n+1 − ti
i=−∞
ti+n − ti−1 i
where C(x) is +∞
X x − ti−1
= (t − ti ) · · · (t − ti+n ) B n (x)
ti+n − ti−1 i
" #
x − ti−1 Bin−1 (x) B n−1 (x) i=−∞
− i+1
ti+n − ti−1 ti+n−1 − ti−1 ti+n − ti +∞
X ti+n+1 − x n
+ (t − ti ) · · · (t − ti+n ) B (x)
ti+n+1 − ti i+1
" #
n−1 n−1
ti+n+1 − x Bi+1 (x) Bi+2 (x) i=−∞
+ −
ti+n+1 − ti ti+n − ti ti+n+1 − ti+1 +∞
X
"
n−1
# = (t − ti ) · · · (t − ti+n )Bin+1 (x),
1 (x − ti−1 )Bin−1 (x) (ti+n − x)Bi+1 (x) i=−∞
= +
ti+n − ti−1 ti+n−1 − ti−1 ti+n − ti
"
n−1 n−1
# where the first step follows from the induction hypothesis,
1 (x − ti )Bi+1 (x) (ti+n+1 − x)Bi+2 (x) the second step from (3.44), the third step from replacing i
− +
ti+n+1 − ti ti+n − ti ti+n+1 − ti+1 with i + 1 in the second summation, and the last step from
Bin (x) n
Bi+1 (x) (3.30).
= − ,
ti+n − ti−1 ti+n+1 − ti
where the last step follows from (3.30). Then (3.42) can be Corollary 3.37 (Truncated power functions as linear com-
written as binations of B-splines). For any j ∈ Z and n ∈ N,
n
d n+1 (n + 1)Bin (x) (n + 1)Bi+1 (x) j−n
Bi (x) = − , X
dx ti+n − ti−1 ti+n+1 − ti (tj − x)n+ = (tj − ti ) · · · (tj − ti+n−1 )Bin (x). (3.45)
i=−∞
which completes the inductive proof of (3.41) except at the
knots. Since Bi1 = B̂i is continuous, an easy induction with
Proof. We need to show that the RHS is (tj − x)n if x ≤ tj
(3.30) shows that Bin is continuous for all n ≥ 1. Hence the
and 0 otherwise. Set t = tj in (3.43) and we have
right-hand side of (3.41) is continuous for all n ≥ 2. There-
d
fore, if n ≥ 2, dx Bin (x) exists for all x ∈ R. This completes +∞
the proof. (tj − x)n =
X
(tj − ti ) · · · (tj − ti+n−1 )Bin (x).
n n−1 i=−∞
Corollary 3.35 (Smoothness of B-splines). Bi ∈ Sn .
Proof. For n = 1, the induction basis Bi1 (x) ∈ S01 holds be- For each i = j − n + 1, . . . , j, the corresponding term in the
cause of (3.32). The rest of the proof follows from (3.30) summation is zero regardless of x; for each i ≥ j +1, Lemma
and Theorem 3.34 via an easy induction. 3.27 implies that Bin (x) = 0 for all x ≤ tj . Hence

j−n
3.4.4 Marsden’s identity X
x ≤ tj ⇒ (tj − ti ) · · · (tj − ti+n−1 )Bin (x) = (tj − x)n .
Theorem 3.36 (Marsden’s identity). For any n ∈ N, i=−∞

+∞
n
(3.43) Otherwise x > tj , then Lemma 3.27 implies Bi (x) = 0 for
X
(t − x)n = (t − ti ) · · · (t − ti+n−1 )Bin (x),
i=−∞ each i ≤ j − n. This completes the proof.

24
Qinghai Zhang Numerical Analysis 2021

3.4.5 Symmetric polynomials yield


+∞
n X
Definition 3.38. The elementary symmetric polynomial of Y
degree k in n variables is the sum of all products of k distinct gτ,n (z) = xki z k
i=1 k=0
variables chosen from the n variables,
X =(1 + x1 z + x21 z 2 + · · · )(1 + x2 z + x22 z 2 + · · · )
σk (x1 , . . . , xn ) = xi1 xi2 · · · xik . (3.46) · · · (1 + xn z + x2n z 2 + · · · ).
1≤i1 <···<ik ≤n
The coefficient of the monomial z k , is the sum of all possible
In particular, σ0 (x1 , . . . , xn ) = 1 and products of k variables from x1 , x2 , . . . , xn . Definition 3.38
then completes the proof.
∀k > n, σk (x1 , . . . , xn ) = 0.
Example 3.44.
If the distinctiveness condition is dropped, we have the com-
plete symmetric polynomial of degree k in n variables, (1 + x1 z)(1 + x2 z)(1 + x3 z)
X =1 + (x1 + x2 + x3 )z
τk (x1 , . . . , xn ) = xi1 xi2 · · · xik . (3.47) + (x1 x2 + x1 x3 + x2 x3 )z 2 + x1 x2 x3 z 3 .
1≤i1 ≤···≤ik ≤n
Lemma 3.45 (Recursive relations of complete symmetric
Example 3.39. σ2 (x1 , x2 , x3 ) = x1 x2 + x1 x3 + x2 x3 . In polynomials). The complete symmetric polynomials satisfy
comparison, τ2 (x1 , x2 , x3 ) = σ2 (x1 , x2 , x3 ) + x21 + x22 + x23 . a recursion,
Lemma 3.40. For k ≤ n, the elementary symmetric poly- τk+1 (x1 , . . . , xn , xn+1 )
nomials satisfy a recursion, =τk+1 (x1 , . . . , xn ) + xn+1 τk (x1 , . . . , xn , xn+1 ). (3.54)
σk+1 (x1 , . . . , xn , xn+1 ) Proof. (3.50) implies
=σk+1 (x1 , . . . , xn ) + xn+1 σk (x1 , . . . , xn ). (3.48) gτ,n+1 = gτ,n + xn+1 zgτ,n+1 . (3.55)
Proof. The terms in σk+1 (x1 , . . . , xn , xn+1 ) can be assorted The proof is completed by requiring that the coefficient of
into two groups: (a) those that contain the factor xn+1 z k+1 on the LHS equal that of z k+1 on the RHS.
and (b) those that do not. By the symmetry in (3.46),
Theorem 3.46 (Complete symmetric polynomials as di-
group (a) must be xn+1 σk (x1 , . . . , xn ) and group (b) must
vided difference of monomials). The complete symmetric
be σk+1 (x1 , . . . , xn ).
polynomial of degree m − n in n + 1 variables is the nth
Example 3.41. σ2 (x1 , x2 , x3 ) = x1 x2 + x3 (x1 + x2 ). divided difference of the monomial xm , i.e.

Definition 3.42. The generating function for the elemen- ∀m ∈ N+ , ∀i ∈ N, ∀n = 0, 1, . . . , m,


tary symmetric polynomials is τm−n (xi , . . . , xi+n ) = [xi , . . . , xi+n ]xm . (3.56)
n
Y Proof. By Lemma 3.45, we have
gσ,n (z) = (1 + xi z) = (1 + x1 z) · · · (1 + xn z) (3.49)
(xn+1 − x1 )τk (x1 , . . . , xn , xn+1 )
i=1
=τk+1 (x1 , . . . , xn , xn+1 ) − τk+1 (x1 , . . . , xn )
while that for the complete symmetric polynomials is
− x1 τk (x1 , . . . , xn , xn+1 )
n
Y 1 1 1 =τk+1 (x2 , . . . , xn , xn+1 ) + x1 τk (x1 , . . . , xn , xn+1 )
gτ,n (z) = = ··· . (3.50)
i=1
1 − x i z 1 − x 1 z 1 − xn z − τk+1 (x1 , . . . , xn ) − x1 τk (x1 , . . . , xn , xn+1 )
=τk+1 (x2 , . . . , xn , xn+1 ) − τk+1 (x1 , . . . , xn ). (3.57)
Lemma 3.43 (Generating elementary and complete sym-
metric polynomials). The elementary and complete sym- The rest of the proof is an induction on n. For n = 0, (3.56)
metric polynomials are related to their generating functions reduces to
as τm (xi ) = [xi ]xm ,
Xn which is trivially true. Now suppose (3.56) holds for a non-
gσ,n (z) = k
σk (x1 , . . . , xn )z . (3.51) negative integer n < m. Then (3.57) and the induction
k=0 hypothesis yield
+∞
X τm−n−1 (xi , . . . , xi+n+1 )
gτ,n (z) = τk (x1 , . . . , xn )z k . (3.52)
k=0
τm−n (xi+1 , . . . , xi+n+1 ) − τm−n (xi , . . . , xi+n )
=
xi+n+1 − xi
Proof. With Lemma 3.40, we can prove (3.51) by an easy [xi+1 , . . . , xi+n+1 ]xm − [xi , . . . , xi+n ]xm
induction. For (3.52), (3.50) and the identity =
xi+n+1 − xi
1 X +∞ =[xi , . . . , xi+n+1 ]xm ,
= xk (3.53)
1−x which completes the proof.
k=0

25
Qinghai Zhang Numerical Analysis 2021

3.4.6 B-splines indeed form a basis Theorem 3.47 states that each monomial xj can also be ex-
pressed as a linear combination of B-splines. Since the do-
Theorem 3.47. Given any k ∈ N, the monomial xk can be
main is restricted to [t1 , tN ], we know from Lemma 3.27 that
expressed as a linear combination of B-splines for any fixed
only those B-splines in the list of (3.60) appear in the linear
n ≥ k, in the form
combination. Therefore, these B-splines form a spanning list
  +∞ of Sn−1
n (t1 , t2 , . . . , tN ). The proof is completed by Lemma
n k X
n
x = σk (ti , . . . , ti+n−1 )Bi (x), (3.58) B.40, Theorem 3.14, and the fact that the length of the list
k i=−∞ (3.60) is also n + N − 1.

where σk (ti , . . . , ti+n−1 ) is the elementary symmetric poly-


3.4.7 Cardinal B-splines
nomial of degree k in the n variables ti , . . . , ti+n−1 .
Definition 3.50. The cardinal B-spline of degree n, de-
Proof. Lemma 3.43 yields n
noted by Bi,Z , is the B-spline in Definition 3.23 on the knot
n
X set Z.
(1 + ti x) · · · (1 + ti+n−1 x) = σk (ti , . . . , ti+n−1 )xk . Corollary 3.51. Cardinal B-splines of the same degree are
k=0
translates of one another, i.e.
Replace x with −1/t, multiply both sides with tn , and we ∀x ∈ R, n
Bi,Z n
(x) = Bi+1,Z (x + 1). (3.62)
have
Proof. The recurrence relation (3.30) reduces to
n
x−i+1 n i+n+1−x n
X
(t − ti ) · · · (t − ti+n−1 ) = σk (ti , . . . , ti+n−1 )(−1)k tn−k . n+1
Bi,Z (x) = Bi,Z (x) + Bi+1,Z (x).
k=0 n+1 n+1
(3.63)
Substituting the above into (3.43) yields The rest of the proof is an easy induction on n.
+∞ X
X n Corollary 3.52. A cardinal B-spline is symmetric about
n
(t − x) = σk (ti , . . . , ti+n−1 )(−1)k tn−k Bin (x) the center of its interval of support, i.e.
i=−∞ k=0 n n
n
( +∞
) ∀n > 0, ∀x ∈ R, Bi,Z (x) = Bi,Z (2i + n − 1 − x). (3.64)
X X
= tn−k (−1)k σk (ti , . . . , ti+n−1 )Bin (x) . Proof. The proof is similar with that of Corollary 3.51.
k=0 i=−∞
Example 3.53. For ti = i, the quadratic B-spline in Ex-
On the other hand, the binomial theorem states that ample 3.25 simplifies to

n   n (x−i+1)2
if x ∈ (i − 1, i];
 
n
X n n−k k
X
n−k k n k 
2
(t − x) = t (−x) = t (−1) x .


k k  3 − x − (i + 1 )2 if x ∈ (i, i + 1];

2 4 2
k=0 k=0 Bi,Z (x) = (i+2−x) 2

 2 if x ∈ (i + 1, i + 2];
Comparing the last two equations completes the proof. 

0 otherwise.
Corollary 3.48 (Partition of Unity). (3.65)
It is straightforward to verify Corollaries 3.51 and 3.52. It
+∞
X also follows from (3.65) that
∀n ∈ N, Bin = 1. (3.59) (
1
i=−∞ 2 if j ∈ {i, i + 1};
Bi,Z (j) = 2 (3.66)
0 if j ∈ Z \ {i, i + 1}.
Proof. Setting k = 0 in Theorem 3.47 yields (3.59).
Example 3.54. For ti = i, the cubic cardinal B-spline is
Theorem 3.49. The following list of B-splines is a basis of  (x−i+1)3
Snn−1 (t1 , t2 , . . . , tN ), 
 6 if x ∈ (i − 1, i];
2 1
 2
n n n 3 − (x − i + 1)(i + 1 − x) if x ∈ (i, i + 1];
B2−n (x), B3−n (x), . . . , BN (x). (3.60) Bi,Z (x) = 3 3 2
B (2i + 2 − x) if x ∈ (i + 1, i + 3);
 i,Z



Proof. It is easy to verify that 0 otherwise.
(3.67)
∀ti ∈ R, (x − ti )n+ = (x − ti )n − (−1)n (ti − x)n+ . (3.61) It follows that

1
Then it follows from Theorem 3.36 and Corollary 3.37 that  6 if j ∈ {i, i + 2};

n 3
each truncated power function (x − ti )+ can be expressed Bi,Z (j) = 32 if j = i + 1; (3.68)
as a linear combination of B-splines. By Lemma 3.18, each 

0 if j ∈ Z \ {i, i + 1, i + 2}.
element in Sn−1 n (t1 , t2 , . . . , tN ) can be expressed as a linear
combination of This illustrates Corollary 3.51 that cardinal B-splines have
the same shape, i.e., they are invariant under integer trans-
1, x, x2 , . . . , xn , (x − t2 )n+ , (x − t3 )n+ , . . . , (x − tN −1 )n+ . lations.

26
Qinghai Zhang Numerical Analysis 2021

Theorem 3.55. The cardinal B-spline of degree n can be which proves the middle N − 2 equations of M a = b. By
explicitly expressed as Theorem 3.34, we have

1 X
n   d n
n−k n + 1
n−1 n−1
n
Bi,Z (x) = (−1) (k + i − x)n+ . (3.69) B (x) = Bi,Z (x) − Bi+1,Z (x). (3.74)
n! k+1 dx i,Z
k=−1
Differentiate (3.71), apply (3.74), set x = 1, apply (3.66)
Proof. Theorems 3.32, 2.29, and 2.28 yield and we have the first identity in (3.72), which, together with
(3.73), yields
n
Bi,Z (x) = (n + 1)[i − 1, . . . , i + n](t − x)n+
n + 1 n+1 2a0 + a1 = f 0 (1) + 3f (1);
= ∆ (i − 1 − x)n+
(n + 1)! this proves the first equation of M a = b. The last equation
n+1   M a = b and the second identity in (3.72) can be shown
1 X n+1
= (−1)n+1−k (i − 1 + k − x)n+ . similarly. The strictly diagonal dominance of M implies a
n! k
k=0 nonzero determinant of M and therefore a is uniquely deter-
mined. The uniqueness of S(x) then follows from (3.72).
Replacing k with k + 1 and accordingly changing the sum-
mation bounds complete the proof. Theorem 3.58. There is a unique B-spline S(x) ∈ S12 that
interpolates f (x) at ti = i + 21 for each i = 1, 2, . . . , N − 1
Corollary 3.56. The value of a cardinal B-spline at an in-
with end conditions S(1) = f (1) and S(N ) = f (N ). Fur-
teger j is
thermore, this B-spline is
n  
n 1 X n+1 N
Bi,Z (j) = (−1)n−k (k + i − j)n (3.70) X
2
n! k+1 S(x) = ai Bi,Z (x), (3.75)
k=j−i+1
i=0

for j ∈ [i, n + i) and is zero otherwise. where


Proof. This follows directly from Theorem 3.55 and Defini- a0 = 2f (1) − a1 , aN = 2f (N ) − aN −1 , (3.76)
tion 3.16.
and aT = [a1 , . . . , aN −1 ] is the solution of the linear system
Theorem 3.57 (Unique interpolation by complete cubic M a = b with
cardinal B-splines). There is a unique B-spline S(x) ∈ S23     
that interpolates f (x) at 1, 2, . . . , N with S 0 (1) = f 0 (1) and T 3 5
b = 8f − 2f (1), 8f ,
0 0
S (N ) = f (N ). Furthermore, this B-spline is 2 2
    
3 1
N . . . , 8f N − , 8f N − − 2f (N ) ,
X
3 2 2
S(x) = ai Bi,Z (x), (3.71)  
i=−1 5 1
1 6 1 
 
where M =
 . .
.. .. . .. .

 
a−1 = a1 − 2f (1), 0 0
aN = aN −2 + 2f (N ), (3.72)
 1 6 1
1 5
and aT = [a0 , . . . , aN −1 ] is the solution of the linear system
M a = b with Proof. It follows from Lemma 3.27 and Definition 3.50 that
2
there are three quadratic cardinal B-splines, namely Bi−1,Z ,
T 0 2 2
b =[3f (1) + f (1), 6f (2), Bi,Z , and Bi+1,Z , that have nonzero values at each interpo-
. . . , 6f (N − 1), 3f (N ) − f 0 (N )], lation site ti = i + 21 . Hence we have
 
2 1 2
f (ti ) = ai−1 Bi−1,Z 2
(ti ) + ai Bi,Z 2
(ti ) + ai+1 Bi+1,Z (ti ). (3.77)
1 4 1 
 
.. . .. Hence the dimension of the space of relevant cardinal B-
M = . .. .
 
.
  splines is N − 1 + 2 = N + 1, which is different from that
 1 4 1 in the proof of Theorem 3.49! By Theorem 3.55, we can
1 2 calculate the values of B-splines as:
Proof. By Theorem 3.49 and Lemma 3.27, we have, at each 2  
2 1 X 3
interpolation site i = 1, 2, . . . , N , B0,Z (x) = (−1)2−k (k − x)2+ ,
2 k+1
k=−1
3 3 3
f (i) = ai−2 Bi−2,Z (i) + ai−1 Bi−1,Z (i) + ai Bi,Z (i).  
1 3
2
B0,Z = ,
Then (3.68) yields 2 4
   
2 1 2 3 1
B0,Z − = B0,Z = ,
∀i = 1, 2, . . . , N, ai−2 + 4ai−1 + ai = 6f (i), (3.73) 2 2 8

27
Qinghai Zhang Numerical Analysis 2021

2
− 12 we have used Corollary 3.52. Then is a curve. Its tangent vector is

where for B0,Z
Corollary 3.51 and (3.77) yield
γ 0 (t) = (et (cos t − sin t), et (cos t + sin t)) (3.82)
ai−1 + 6ai + ai+1 = 8f (ti ), (3.78)
and thus the modulus of the tangent vector is
which proves the middle N − 3 equations in M a = b. At
the end point x = 1, only two quadratic cardinal B-splines, √
2
B0,Z 2
(x) and B1,Z , are nonzero. Then Example 3.25 yields kγ 0 (t)k2 = 2et .

1 1 Consequently, the arc length of the spiral is


a0 + a1 = f (1)
2 2 Z t √ √
and this proves the first identity in (3.76). Also, the above s(t) = eτ 2dτ = 2(et − 1) (3.83)
0
equation and (3.78) with i = 1 yield
 
3 and we have t = ln( √s2 + 1). According to Lemma 3.66, the
5a1 + a2 = 8f − 2f (1), spiral can be expressed as a unit-speed curve
2
     
which proves the first equation in M a = b. The last equa- s s s
γ(s) = √ +1 cos ln( √ + 1) , sin ln( √ + 1) .
tion in M a = b can be proven similarly. 2 2 2
(3.84)
Despite of its complicated form, the parametrization of the
3.5 Curve fitting via splines spiral in (3.84) makes the curve unit-speed; this is a promi-
nent advantage over the parametrization in (3.81).
Definition 3.59. An open curve is (the image of) a con-
tinuous map γ : (α, β) → Rn for some α, β with −∞ ≤ α <
Definition 3.68. A closed curve is (the image of) a contin-
β ≤ +∞. It is simple if the map γ is injective.
uous map γ̊ : [0, 1] → R2 that satisfies γ̊(0) = γ̊(1). If the
Definition 3.60. The tangent vector of a curve γ is its first restriction of γ̊ to [0, 1) is further injective, then the closed
derivative curve is a simple closed curve or Jordan curve.
0 dγ
γ := (3.79)
ds Definition 3.69. The signed unit normal of a curve, de-
and the unit tangent vector of γ, denoted by t, is the nor- noted by ns , is the unit vector obtained by rotating its unit
malization of its tangent vector. tangent vector counterclockwise by π2 .

Definition 3.61. A unit-speed curve is a curve whose tan- Definition 3.70. For a unit-speed curve γ, its signed cur-
gent vector has unit length at each of its points. vature is defined as
Definition 3.62. A point γ(t0 ) is a regular point of γ if
t(t0 ) exists and t(t0 ) 6= 0 holds; a curve is regular if all of κs := γ 00 · ns . (3.85)
its points are regular.
Definition 3.71. The cumulative chordal lengths associ-
Definition 3.63. The arc-length of a curve starting at the ated with a sequence of n points
point γ(t0 ) is defined as
Z t {xi ∈ RD : i = 1, 2, . . . , n} (3.86)
0
sγ (t) = kγ (u)k2 du. (3.80)
t0 are the n real numbers,
Definition 3.64. A map X 7→ Y is a homeomorphism if it (
is continuous and bijective and its inverse is also continuous; 0, i = 1;
ti = (3.87)
then the two sets X and Y are said to be homeomorphic. ti−1 + kxi − xi−1 k2 , i > 1,

Definition 3.65. A curve γ̃(α̃, β̃) → Rn is a reparametriza-


tion of another curve γ(α, β) → Rn if there exists a homeo- where k · k2 denotes the Euclidean 2-norm.

morphism φ : (α̃, β̃) → (α, β) such that γ̃(t̃) = γ φ(t̃) for
Algorithm 3.72. A curve γ : (0, 1) → RD can be approxi-
each t̃ ∈ (α̃, β̃).
mated by fitting D splines constructed from n characteristic
Lemma 3.66. A reparametrization of a regular curve is points (3.86), each of which is on γ.
unit-speed if and only if it is based on the arc-length.
(a) Compute the cumulative chordal lengths.
Example 3.67. The spiral γ : R → R2 given by
(b) Fit a spline for each coordinate of γ with the indepen-
γ(t) = (et cos t, et sin t) (3.81) dent parameter as the cumulative chordal lengths.

28
Qinghai Zhang Numerical Analysis 2021

3.6 Problems VII. Scaled integral of B-splines.


Deduce from the Theorem on derivatives of B-splines
3.6.1 Theoretical questions that the scaled integral of a B-spline Bin (x) over its
support is independent of its index i even if the spac-
I. Consider s ∈ S23 on [0, 2]: ing of the knots is not uniform.
(
p(x) if x ∈ [0, 1], VIII. Symmetric Polynomials.
s(x) =
(2 − x)3 if x ∈ [1, 2]. We have a theorem on expressing complete symmetric
polynomials as divided difference of monomials.
Determine p ∈ P3 such that s(0) = 0. Is s(x) a natural
cubic spline? (a) Verify this theorem for m = 4 and n = 2 by work-
ing out the table of divided difference and compar-
II. Given fi = f (xi ) of some scalar function at points ing the result to the definition of complete sym-
a = x1 < x2 < · · · < xn = b, we consider interpolating metric polynomials.
f on [a, b] with a quadratic spline s ∈ S12 .
(b) Prove this theorem by the lemma on the recursive
(a) Why is an additional condition needed to deter- relation on complete symmetric polynomials.
mine s uniquely?
(b) Define mi = s0 (xi ) and pi = s|[xi ,xi+1 ] . 3.6.2 Programming assignments
Determine pi in terms of fi , fi+1 , and mi for
i = 1, 2, . . . , n − 1. A. Write a program for cubic-spline interpolation of the
0 function
(c) Suppose m1 = f (a) is given. Show how 1
m2 , m3 , . . . , mn−1 can be computed. f (x) =
1 + 25x2
III. Let s1 (x) = 1 + c(x + 1)3 where x ∈ [−1, 0] and c ∈ R. on evenly spaced nodes within the interval [−1, 1] with
Determine s2 (x) on [0, 1] such that N = 6, 11, 21, 41, 81. Compute for each N the max-
( norm of the interpolation error vector at mid-points of
s1 (x) if x ∈ [−1, 0], the subintervals and report the errors and convergence
s(x) =
s2 (x) if x ∈ [0, 1] rates with respect to the number of subintervals.
Your algorithm should follow the example of interpolat-
is a natural cubic spline on [−1, 1] with knots −1, 0, 1. ing the natural logarithm in the notes and your program
How must c be chosen if one wants s(1) = −1? must use an implementation of lapack.
IV. Consider f (x) = cos π2 x with x ∈ [−1, 1].

Plot the interpolating spline against the exact function
to observe that spline interpolation is free of the wide
(a) Determine the natural cubic spline interpolant to oscillations in the Runge phenomenon.
f on knots −1, 0, 1.
(b) As discussed in the class, natural cubic splines have B. Let f : R → R be a given function. Implement two
the minimal total bending energy. Verify this by subroutines to interpolate f by the quadratic and cubic
taking g(x) be (i) the quadratic polynomial that cardinal B-splines, which corresponds to Theorems 3.57
interpolates f at −1, 0, 1, and (ii) f (x). and 3.58, respectively.

V. The quadratic B-spline Bi2 (x). C. Run your subroutines on the function

(a) Derive the same explicit expression of Bi2 (x) as 1


f (x) = , x ∈ [−5, 5],
that in the notes from the recursive definition of 1 + x2
B-splines and the hat function.
using ti = −6 + i, i = 1, . . . , 11 for Theorem 3.57 and
d 2
(b) Verify that dx Bi (x) is continuous at ti and ti+1 . ti = i − 11
2 , i = 1, . . . , 10 for Corollary 3.58, respectively.
(c) Show that only one x∗ ∈ (ti−1 , ti+1 ) satisfies Plot the polynomials against the exact function.
d 2 ∗ ∗
dx Bi (x ) = 0. Express x in terms of the knots
within the interval of support. D. Define ES (x) = |S(x) − f (x)| as the interpolation error.
For the two cardinal B-spline interpolants, output values
(d) Consequently, show Bi2 (x) ∈ [0, 1). of ES (x) at the sites
(e) Plot Bi2 (x) for ti = i.
x = −3.5, −3, −0.5, 0, 0.5, 3, 3.5.
VI. Verify Theorem 3.32 algebraically for the case of n = 2,
i.e. Output these values by a program. Why are some of
the errors close to machine precision? Which of the two
(ti+2 − ti−1 )[ti−1 , ti , ti+1 , ti+2 ](t − x)2+ = Bi2 . B-splines is more accurate?

29
Qinghai Zhang Numerical Analysis 2021

E. The roots of the following equation constitute a closed the characteristic points and you should think about (i)
planar curve in the shape of a heart: how many pieces of splines to use? (ii) what boundary
 2 conditions are appropriate? )
2 3 p
x + y − |x| = 3. (3.88)
2
F. (*) Write a program to illustrate (3.38) by plotting the
Write a program to plot the heart. The parameter of the truncated power functions for n = 1, 2 and build a table
curve should be the cumulative chordal length defined in of divided difference where the entries are figures instead
(3.87). Choose n = 10, 40, 160 and produce three plots of numbers. The pictures you generated for n = 1 should
of the heart function. (Hints: Your knots should include be the same as those in Example 3.31.

30
Chapter 4

Computer Arithmetic

4.1 Floating-point number systems Algorithm 4.8. A decimal integer can be converted to a
binary number via the following method:
Definition 4.1. The base or radix of a positional numeral
system is the number of unique symbols used to represent • divide by 2 and record the remainder,
numbers. • repeat until you reach 0,
Example 4.2. The binary numeral system consists of two • concatenate the remainder backwards.
digits: “0” and “1,” and thus its base is 2. The decimal
system consists of ten digits: “0” – “9,” and thus its base is A decimal fraction can be converted to a binary number via
10. the following method:

Definition 4.3. A bit is the basic unit of information in • multiply by 2 and check whether the integer part is no
computing; it can have only one of two values 0 and 1. less than 1: if so record 1; otherwise record 0,
• repeat until you reach 0,
Definition 4.4. A byte is a unit of information in com-
puting that commonly consists of 8 bits; it is the smallest • concatenate the recorded bits forward.
addressable unit of memory in many computers.
Combine the above two methods and we can convert any
Definition 4.5. A word is a group of bits with fixed size decimal number to its binary counterpart.
that are handled as a unit by the instruction set architec-
ture (ISA) and/or hardware of the processor. The word Example 4.9. Convert 156 to binary number:
size/width/length is the number of bits in a word and is an
important characteristic of processor or computer architec- 156 = (10011100)2 .
ture.
Example 4.10. What is the normalized binary form of 23 ?
Example 4.6. 32-bit and 64-bit computers are mostly com-
mon these days. A 32-bit register can store 232 values, hence 2
= (0.a1 a2 a3 · · · )2 = (0.1010 · · · )2
a processor with 32-bit memory address can directly access 3
4GB byte-addressable memory. = (1.0101010 · · · )2 × 2−1 .

Definition 4.7 (Floating point numbers). A floating point Definition 4.11 (FPN systems). A floating point number
number (FPN) is a number of the form system F is a proper subset of the rational numbers Q, and
it is characterized by a 4-tuple (β, p, L, U ) with
x = ±m × β e , (4.1)
• the base (or radix) β;
where β is the base or radix, e ∈ [L, U ], and the significand
(or mantissa) m is a number of the form • the precision (or significand digits) p;
  • the exponent range [L, U ].
d1 dp−1
m = d0 + + · · · + p−1 , (4.2)
β β Definition 4.12. An FPN is normalized if its mantissa sat-
isfies 1 ≤ m < β.
where the integer di satisfies ∀i ∈ [0, p−1], di ∈ [0, β −1]. d0
and dp−1 are called the most significant digit and the least Definition 4.13. The subnormal or denormalized numbers
significant digit, respectively. The string of digits of m is the are FPNs of the form (4.1) with e = L and m ∈ (0, 1). A
string d0 .d1 d2 · · · dp−1 , of which the portion .d1 d2 · · · dp−1 is normalized FPN system can be extended by including the
called the fraction of m. subnormal numbers.

31
Qinghai Zhang Numerical Analysis 2021

Definition 4.14 (IEEE standard 754-2019). The single pre- • realmax is OFL(F)
cision and double precision FPNs of current IEEE (Institute
of Electrical and Electronics Engineers) standard 754 pub- max |F| = β U (β − β 1−p ) ≈ 1.80 × 10308 .
lished in 2019 are normalized FPN systems with three binary
formats (32, 64, and 128 bits) and two decimal formats (64 In C/C++, these constants are defined in <cfloat> and
and 128 bits). float.h by macros DBL_EPSILON, DBL_MIN, and DBL_MAX.
β = 2, p = 23 + 1, e ∈ [−126, 127]; (4.3a)
Corollary 4.19 (Cardinality of F). For a normalized bi-
β = 2, p = 52 + 1, e ∈ [−1022, 1023]; (4.3b) nary FPN system F,
β = 2, p = 112 + 1, e ∈ [−16382, 16383]; (4.3c)
β = 10, p = 16, e ∈ [−1022, 1023]; (4.3d) #F = 2p (U − L + 1) + 1. (4.7)
β = 10, p = 34, e ∈ [−6143, 6144]. (4.3e)
Proof. The cardinality can be proved by Axiom A.21. The
Example 4.15. In the IEEE 754 standard, there are some factor 2p comes from the sign bit and the mantissa. By Ex-
further details on the representation specifications of FPNs. ample 4.15, U −L+1 is the number of exponents represented
in F. The trailing “+1” in (4.7) accounts for the number
0.
± exponent (e) normalized significand (m)
b
implicit radix point
Definition 4.20. The range of a normalized FPN system
is a subset of R that consists of two intervals,
For example, some major representation specifications of
the 32-bit FPNs are as follows. R(F) := {x : x ∈ R, UFL(F) ≤ |x| ≤ OFL(F)} . (4.8)

(a) Out of the 32 bits, 1 is reserved for the sign, 8 for the Example 4.21. Consider a normalized FPN system with
exponents, 23 for the significand (see the plot above for the characterization β = 2, p = 3, L = −1, U = +1.
the locations and the implicit radix point).
(b) The precision is 24 because we can choose d0 = 1 for
| | | | | | |
normalized binary floating point numbers and get away
-3 -2 -1 0 1 2 3
with never storing d0 .
(c) The exponent has 28 = 256 possibilities. If we assign The four FPNs
1, 2, . . . , 256 to these possibilities, it would not be possi-
ble to represent numbers whose magnitudes are smaller 1.00 × 20 , 1.01 × 20 , 1.10 × 20 , 1.11 × 20
than one. Hence we subtract 1, 2, . . . , 256 by 128 to shift
the exponents to −127, −126, . . . , 0, . . . , 127, 128. Out of correspond to the four ticks in the plot starting at 1 while
these numbers , ±m × β −127 is reserved for ±0 and sub-
normal numbers while ±m × β 128 is reserved for ±∞ 1.00 × 21 , 1.01 × 21 , 1.10 × 21 , 1.11 × 21
and NaNs including qNaN (quiet) and sNaN (signaling).
correspond to the four ticks starting at 2. Add subnormal
Definition 4.16. The machine precision of a normalized FPNs and we have the following plot.
FPN system F is the distance between 1.0 and the next
larger FPN in F,
M := β 1−p . (4.4) | | | | | | |
-3 -2 -1 0 1 2 3
Definition 4.17. The underflow limit (UFL) and the over-
flow limit (OFL) of a normalized FPN system F are respec- Definition 4.22. Two normalized FPNs a, b are adjacent
tively to each other in F iff
UFL(F) := min |F \ {0}| = β L , (4.5)
∀c ∈ F \ {a, b}, |a − b| < |a − c| + |c − b|. (4.9)
U 1−p
OFL(F) := max |F| = β (β − β ). (4.6)
Lemma 4.23. Let a, b be two adjacent normalized FPNs
Example 4.18. By default Matlab adopts IEEE 754 double satisfying |a| < |b| and ab > 0. Then
precision arithmetic. Three characterizing constants are
β −1 M |a| < |a − b| ≤ M |a|. (4.10)
• eps is the machine precision

M = β 1−p = 21−(52+1) = 2−52 ≈ 2.22 × 10−16 , Proof. Consider a > 0, then ∆a := b − a > 0. By Defi-
nitions 4.7 and 4.12, a = m × β e with 1.0 ≤ m < β. a
• realmin is UFL(F) and b only differ from each other at the least significant
digit, hence ∆a = M β e . Since βM < m M
≤ M , we have
min |F \ {0}| = β L = 2−1022 ≈ 2.22 × 10−308 , ∆a
a ∈ (β
−1
M , M ]. The other case is similar.

32
Qinghai Zhang Numerical Analysis 2021

4.2 Rounding error analysis 4.2.2 Binary floating-point operations


4.2.1 Rounding a single number Definition 4.30 (Addition/subtraction of two FPNs). Ex-
press a, b ∈ F as a = Ma × β ea and b = Mb × β eb where
Definition 4.24 (Rounding). Rounding is a map Ma = ±ma and Mb = ±mb . With the assumption |a| ≥ |b|,
fl : R → F ∪ {+∞, −∞, NaN}. The default rounding mode the sum c := fl(a + b) ∈ F is calculated in a register of
is round to nearest, i.e. fl(x) is chosen to minimize |fl(x) − x| precision at least 2p as follows.
for x ∈ R(F). In the case of a tie, fl(x) is chosen by round
to even, i.e. fl(x) is the one with an even last digit dp−1 . (i) Exponent comparison:
Definition 4.25. A rounded number fl(x) overflows if
• If ea − eb > p + 1, set c = a and return c;
|x| > OFL(F), in which case fl(x)=NaN, or underflows if
0 < |x| < UFL(F), in which case fl(x)=0. An underflow of • otherwise set ec ← ea and Mb ← Mb /β ea −eb .
an extended FPN system is called a gradual underflow.
(ii) Perform the addition Mc ← Ma + Mb in the register
Definition 4.26. The unit roundoff of F is the number with rounding to nearest.
1 1
u := M = β 1−p . (4.11) (iii) Normalization:
2 2
Theorem 4.27 (Range of round-off errors). For x ∈ R(F) • If |Mc | = 0, return 0.
as in (4.8), we have • If |Mc | ∈ (β, β 2 ), set Mc ← Mc /β and ec ← ec +1.
fl(x) = x(1 + δ), |δ| < u . (4.12) • If |Mc | ∈ [β − u (p), β], set Mc ← 1.0 and
Proof. By Definition A.32, R(F) is a subset of R and is thus ec ← ec + 1.
a chain. Therefore ∀x ∈ R(F), ∃xL , xR ∈ F s.t. • If |Mc | ∈ (0, β − u (p)), repeat Mc ← Mc β,
ec ← ec − 1 until |Mc | ∈ [1, β).
• xL and xR are adjacent,
• xL ≤ x ≤ xR . (iv) Check range:
If x = xL or xR , then fl(x) − x = 0 and (4.12) clearly holds.
Otherwise xL < x < xR . Then Lemma 4.23 and Definitions • return NaN if ec overflows,
4.22 and 4.24 yield • return 0 if ec underflows.
1
|fl(x)−x| ≤ |xR −xL | ≤ u min(|xL |, |xR |) < u |x|. (4.13) (v) Round Mc (to nearest) to precision p.
2
Hence −u |x| < fl(x) − x < u |x|, which yields (4.12). (vi) Set c ← Mc × β ec .

Theorem 4.28. For x ∈ R(F), we have Here u (p) in step (iii) is the unit round-off for FPNs with
x precision p, c.f. Definition 4.26.
fl(x) = , |δ| ≤ u . (4.14)
1+δ
Example 4.31. Consider the calculation of c := fl(a + b)
Proof. The proof is the same as that of Theorem 4.27, ex-
with a = 1.234 × 104 and b = 5.678 × 100 in an FPN system
cept that we replace the last inequality “< u |x|” in (4.13)
F : (10, 4, −7, 8).
by “≤ u |fl(x)|.” Consequently, the equality in (4.14) holds
when x = 21 (xL + xR ) and |fl(x)| = min(|xL |, |xR |) has its (i) b ← 0.0005678 × 104 ; ec ← 4.
significand as m = 1.0.
(ii) mc ← 1.2345678.
Example 4.29. Find xL , xR of x = 23 in normalized single-
precision IEEE 754 standard, which of them is fl(x)? (iii) do nothing.
By Example 4.10, we have (iv) do nothing.
2 (v) mc ← 1.235.
= (0.1010 · · · )2 = (1.0101010 · · · )2 × 2−1 .
3
xL = (1.010 · · · 10)2 × 2−1 ; (vi) c = 1.235 × 104 .

xR = (1.010 · · · 11)2 × 2−1 , For b = 5.678 × 10−2 , c = a would be returned in step (i).
where the last bit of xL must be 0 because the IEEE 754
Example 4.32. Consider the calculation of c := fl(a + b)
standard states that 23 bits are reserved for the mantissa.
with a = 1.000 × 100 and b = −9.000 × 10−5 in an FPN
It follows that
system F : (10, 4, −7, 8).
2 −24
x − xL = × 2 ;
3 (i) b ← −0.0000900 × 100 ; ec ← 0.
xR − xL = 2−24 ,
(ii) mc ← 0.9999100.
1 −24
xR − x = (xR − xL ) − (x − xL ) = × 2 . (iii) ec ← ec − 1; mc ← 9.9991000.
3
Thus Definition 4.24 implies fl(x) = xR . (iv) do nothing.

33
Qinghai Zhang Numerical Analysis 2021

(v) mc ← 9.999. Definition 4.38 (Division of two FPNs). Express a, b ∈ F


(vi) c = 9.999 × 10−1 . as a = Ma × β ea and b = Mb × β eb where Ma = ±ma and
Mb = ±mb . The quotient c = fl ab ∈ F is calculated in a
For b = −9.000 × 10−6 , c = a would be returned in step (i). register of precision at least 2p + 1 as follows.

Exercise 4.33. Repeat Example 4.31 with b = 8.769 × 104 , (i) If mb = 0, return NaN; otherwise set ec ← ea − eb .
b = −5.678 × 100 , and b = −5.678 × 103 . (ii) Perform the division Mc ← Ma /Mb in the register with
rounding to nearest.
Lemma 4.34. For a, b ∈ F, a + b ∈ R(F) implies
(iii) Normalization:
fl(a + b) = (a + b)(1 + δ), |δ| < u . (4.15)
• If |Mc | < 1, set Mc ← Mc β, ec ← ec − 1.
Proof. The round-off error in step (v) always dominates that
in step (ii), which, because of the 2p precision, is nonzero (iv) Check range:
only in the case of ea − eb = p + 1. Then (4.15) follows from • return NaN if ec overflows,
Theorem 4.27.
• return 0 if ec underflows.
Definition 4.35 (Multiplication of two FPNs). Express
(v) Round Mc (to nearest) to precision p.
a, b ∈ F as a = Ma ×β ea and b = Mb ×β eb where Ma = ±ma
and Mb = ±mb . The product c := fl(ab) ∈ F is calculated (vi) Set c ← Mc × β ec .
in a register of precision at least 2p as follows. a
Lemma 4.39. For a, b ∈ F, b ∈ R(F) implies
(i) Exponent sum: ec ← ea + eb . a a
(ii) Perform the multiplication Mc ← Ma Mb in the regis- fl = (1 + δ), |δ| < u . (4.17)
b b
ter.
Proof. In the case of |Ma | = |Mb |, there is no rounding er-
(iii) Normalization:
ror in Definition 4.38 and (4.17) clearly holds. Hereafter we
• If |Mc | ∈ (β, β 2 ), set Mc ← Mc /β and ec ← ec +1. denote by Mc1 and Mc2 the results of steps (ii) and (v) in
Definition 4.38, respectively.
• If |Mc | ∈ [β − u (p), β], set Mc ← 1.0 and
In the case of |Ma | > |Mb |, the condition a, b ∈ F, Defi-
ec ← ec + 1.
nition 4.16, and |Ma |, |Mb | ∈ [1, β) imply
(iv) Check range:
Ma β − M −1
Mb ≥ β − 2M > 1 + β M , (4.18)

• return NaN if ec overflows,
• return 0 if ec underflows.
which further implies that the normalization step (iii) in
(v) Round Mc (to nearest) to precision p. Definition 4.38 is not invoked. By Definitions 4.24, 4.16,
and 4.26, the unit roundoff of a register with precision p + k
(vi) Set c ← Mc × β ec . is
Here u (p) in step (iii) is the unit round-off for FPNs with 1 1−p−k 1
precision p, c.f. Definition 4.26. β = β 1−p β 1−p β p−1−k = β p−1−k u M ,
2 2
Example 4.36. Consider the calculation of c := fl(ab) with and hence the unit roundoff of the register in Definition 4.38
a = 2.345 × 104 and b = 6.789 × 100 in an FPN system is β −2 u M . Therefore we have
F : (10, 4, −7, 8).
Mc2 = Mc1 + δ2 , |δ2 | < u
(i) ec ← 4. Ma
= + δ1 + δ2 , |δ1 | < β −2 u M
(ii) Mc ← 15.920205. Mb
(iii) mc ← 1.5920205, ec ← 5. Ma
= (1 + δ);
Mb
(iv) do nothing.
δ1 + δ2 u 1 + β −2 M

(v) mc ← 1.592. |δ| = < < u ,
Ma /Mb 1 + β −1 M
(vi) c = 1.592 × 105 .
where we have applied (4.18) and the triangular inequality
Lemma 4.37. For a, b ∈ F, |ab| ∈ R(F) implies in deriving the first inequality of the last line.
Consider the last case |Ma | < |Mb |. It is impossible to
fl(ab) = (ab)(1 + δ), |δ| < u . (4.16) have |Mc1 | = 1 in step (ii) because

Proof. The error only comes from the round-off in steps (v). |Ma | β − 2M M
Then (4.16) follows from Theorem 4.27. ≤ =1− < 1 − β −1 M
|Mb | β − M β − M

34
Qinghai Zhang Numerical Analysis 2021

and the precision of the register is greater than p+1. There- An easy induction then shows that
fore |Mc1 | < 1 must hold and in Definition 4.38 step (iii) is
k
invoked to yield X
∀k ∈ N, |δk+1 | < u (1 + u )i (4.21)
Ma i=0
Mc1 = + δ1 , |δ1 | < β −2 u M ;
Mb (1 + u )k+1 − 1
= u = (1 + u )k+1 − 1,
Mc2 = βMc1 + δ2 , |δ2 | < u 1 + u − 1
 
Ma βδ1 + δ2 where the second step follows from the summation formula
=β 1+ ,
Mb βMa /Mb of geometric series. The proof is completed by the binomial
theorem.
where the denominator in the parentheses satisfies

Ma
Exercise 4.42. If we sort the positive numbers ai > 0 ac-
β M −1
β ≥ = 1 + > 1 + β  M . cording to their magnitudes and carry out the additions in
M b β − M β − M this ascending order, we can minimize the rounding error
Hence we have term δ in Theorem 4.41. Can you give some examples?
βδ1 + δ2 β −1 u M + u Exercise 4.43. Derive fl(a1 b1 + a2 b2 + a3 b3 ) for ai , bi ∈ F

|δ| = < = u . and make
βMa /Mb 1 + β −1 M P Qsome observations on the corresponding derivation
of fl( i j ai,j ).
Theorem 4.40 (Model of machine arithmetic). Denote by
F a normalized FPN system with precision p. For each Theorem 4.44. For given µ ∈ R+ and a positive integer
ln 2
arithmetic operation = +, −, ×, /, we have n ≤ b µ c, suppose |δi | ≤ µ for each i = 1, 2, . . . , n. Then
n
∀a, b ∈ F, a b ∈ R(F) ⇒ fl(a b) = (a b)(1+δ) (4.19) Y
1 − nµ ≤ (1 + δi ) ≤ 1 + nµ + (nµ)2 , (4.22)
where |δ| < u if and only if these binary operations are i=1
performed in a register with precision 2p + 1. 1
or equivalently, for In := [− 1+nµ , 1],
Proof. This follows from Lemmas 4.34, 4.37, and 4.39. n
Y
∃θ ∈ In s.t. (1 + δi ) = 1 + θ(nµ + n2 µ2 ). (4.23)
4.2.3 The propagation of rounding errors i=1

Theorem 4.41. If ∀i = 0, 1, · · · , n, ai ∈ F, ai > 0, then Proof. The condition |δi | ≤ µ implies


n n
!
X X Y n
fl ai = (1 + δn ) ai , (4.20) (1 − µ)n ≤ (1 + δi ) ≤ (1 + µ)n .
i=0 i=0 i=1
n
where |δn | < (1 + u ) − 1 ≈ nu . Taylor expansion of f (µ) = (1 − µ)n at µ = 0 with La-
Pk grangian remainder yields
Proof. Define sk := i=0 ai ,
  ∗ (1 − µ)n ≥ 1 − nµ,
s0 := a0 ; s0 := a0 ;
sk+1 := sk + ak+1 , s∗k+1 := fl(s∗k + ak+1 ), which implies the first inequality in (4.22). On the other
hand, the Taylor series of ex for x ∈ R+ satisfies
s∗k − sk s∗k+1 − (s∗k + ak+1 )
δk := , k := . x2 x3
sk s∗k + ak+1 ex = 1 + x + + + ···
2!  3!
In words, δk is the accumulated rounding error and k is the
x2 x 2x2

rounding error at the kth step. Then we have =1+x+ 1+ + + ···
2! 3 4!
s∗k+1 − sk+1 (s∗ + ak+1 )(1 + k ) − sk+1 x2
δk+1 = = k ≤ 1 + x + ex .
sk+1 sk+1 2
(sk (1 + δk ) + ak+1 )(1 + k ) − sk − ak+1
= Set x = nµ in the above inequality, apply the condition
sk+1
nµ ≤ ln 2, and we have
(k + δk + k δk )sk + k ak+1
=
sk+1 enµ ≤ 1 + nµ + (nµ)2 ,
k sk+1 + δk (1 + k )sk sk which, together with the inequality (1 + µ)n ≤ enµ , yields
= = k + δk (1 + k ) .
sk+1 sk+1 the second inequality in (4.22). Q
n
The condition of ai ’s being positive implies sk < sk+1 , and Finally, (4.22) implies that i=1 (1 + δi ) is in the range
Theorem 4.27 states |k | < u . Hence we have of the continuous function f (τ ) = 1 + τ (1 + nµ)nµ on In .
The rest of the proof follows from the intermediate value
|δk+1 | < |k | + |δk |(1 + u ) < u + |δk |(1 + u ). theorem.

35
Qinghai Zhang Numerical Analysis 2021

4.3 Accuracy and stability In other words, the relative error of addition or subtraction
can be arbitrarily large when x + y → 0.
4.3.1 Avoiding catastrophic cancellation
Theorem 4.49 (Loss of most significant digits). Suppose
Definition 4.45. Let x̂ be an approximation to x ∈ R. The x, y ∈ F, x > y > 0, and
accuracy of x̂ can be measured by its absolute error
y
β −t ≤ 1 − ≤ β −s . (4.26)
Eabs (x̂) = |x̂ − x| (4.24) x
and/or its relative error Then the number of most significant digits that are lost in
|x̂ − x| the subtraction x − y is at most t and at least s.
Erel (x̂) = . (4.25)
|x| Proof. Rewrite x = mx × β n and y = my × β m with
Definition 4.46. For an approximation ŷ to y = f (x) com- 1 ≤ mx , my < β. Definition 4.30 and the condition x > y
imply that my , the significand of y, is shifted so that y has
puted by ŷ = fˆ(x), the forward error is the relative error of
the same exponent as x before mx − my is performed in the
ŷ in approximating y and the backward error is the small-
register. Then
est relative error in approximating x by an x̂ that satisfies
f (x̂) = fˆ(x), assuming such an x̂ exists. y = (m × β m−n ) × β n
y
f b ŷ ⇒ x − y = (mx − my × β m−n ) × β n
x̂ b 
my × β m
  y
⇒ mx−y = mx 1 − = m x 1 −
mx × β n x
⇒ β −t ≤ mx−y < β 1−s .
fˆ ∆y
To normalize mx−y into the interval [1, β), it should be mul-
∆x tiplied by at least β s and at most β t . In other words, mx−y
should be shifted to the left for at least s times and at most
f b y t times. Therefore the conclusion on the number of lost
x b significant digits follows.
Rule 4.50. Catastrophic cancellation should be avoided
whenever possible.
Definition 4.47 (Accuracy). An algorithm ŷ = fˆ(x) for
computing the function y = f (x) is accurate if its forward Example 4.51. Calculate y = f (x) = x − sin x for x → 0.
error is small for all x, i.e. ∀x ∈ dom(f ), Erel (fˆ(x)) ≤ cu When x is small, a straightforward calculation would result
where c is a small constant. in a catastrophic cancellation because x ≈ sin x. The solu-
tion is to use the Taylor series
Example 4.48 (Catastrophic cancellation). For two real
numbers x, y ∈ R(F), Theorems 4.27 and 4.40 imply x3 x5 x7
 
x − sin x = x − x − + − + ···
fl(fl(x) fl(y)) = (fl(x) fl(y))(1 + δ3 ) 3! 5! 7!
3 5 7
= (x(1 + δ1 ) y(1 + δ2 ))(1 + δ3 ) x x x
= − + + ···
3! 5! 7!
where |δi | ≤ u . From Theorems 4.40 and 4.44, we know
that multiplication is accurate: 4.3.2 Backward stability and numerical
fl(fl(x) × fl(y)) = xy(1 + δ1 )(1 + δ2 )(1 + δ3 ) stability
= xy(1 + θ(3u + 92u )), Definition 4.52 (Backward stability). An algorithm fˆ(x)
where θ ∈ [−1, 1]. Similarly, division is also accurate: for computing y = f (x) is backward stable if its backward
error is small for all x, i.e.
x(1 + δ1 )
fl(fl(x)/fl(y)) = (1 + δ3 ) ∀x ∈ dom(f ), ∃x̂ ∈ dom(f ), s.t.
y(1 + δ2 ) (4.27)
x fˆ(x) = f (x̂) ⇒ Erel (x̂) ≤ cu ,
= (1 + δ1 )(1 − δ2 + δ22 − · · · )(1 + δ3 )
y
x where c is a small constant.
≈ (1 + δ1 )(1 − δ2 )(1 + δ3 ).
y Definition 4.53. An algorithm fˆ(x1 , x2 ) for computing
However, addition and subtraction might not be accurate: y = f (x1 , x2 ) is backward stable if
fl(fl(x) + fl(y)) = (x(1 + δ1 ) + y(1 + δ2 ))(1 + δ3 ) ∀(x1 , x2 ) ∈ dom(f ), ∃(x̂1 , x̂2 ) ∈ dom(f ) s.t.
=(x + y + xδ1 + yδ2 )(1 + δ3 ) ˆ Erel (x̂1 ) ≤ c1 u , (4.28)
f (x1 , x2 ) = f (x̂1 , x̂2 ) ⇒

xδ1 + yδ2 xδ1 + yδ2
 Erel (x̂2 ) ≤ c2 u ,
=(x + y) 1 + δ3 + + δ3 .
x+y x+y where c1 , c2 are two small constants.

36
Qinghai Zhang Numerical Analysis 2021

Lemma 4.54. For f (x1 , x2 ) = x1 − x2 , x1 , x2 ∈ R(F), the Choose x̂ = x(1 + δ1 + δ2 + δ1 δ2 ) and we have
algorithm fˆ(x1 , x2 ) = fl(fl(x1 ) − fl(x2 )) is backward stable. Erel (x̂) = |δ1 + δ2 + δ1 δ2 | < 3u ,
Proof. We have fˆ(x1 , x2 ) = (fl(x1 ) − fl(x2 ))(1 + δ3 ) from

fˆ(x) − f (x̂) δ2


⇒ = ≤ u ,

Theorem 4.40. Then Theorem 4.27 implies f (x̂) 1 + x(1 + δ1 + δ2 + δ1 δ2 )
fˆ(x1 , x2 ) = (x1 (1 + δ1 ) − x2 (1 + δ2 ))(1 + δ3 ) where the denominator is never close to zero since x > 0.
= x1 (1 + δ1 + δ3 + δ1 δ3 ) − x2 (1 + δ2 + δ3 + δ2 δ3 ).
4.3.3 Condition numbers: scalar functions
Take x̂1 and x̂2 to be the two terms in the above line and
we have Definition 4.59. The (relative) condition number of a
function y = f (x) is a measure of the relative change in
Erel (x̂1 ) = |δ1 + δ3 + δ1 δ3 |, the output for a small change in the input,
0
Erel (x̂2 ) = |δ2 + δ3 + δ2 δ3 |. xf (x)
Cf (x) = . (4.30)
f (x)
Then Definition 4.53 completes the proof.
Definition 4.60. A problem with a low condition number
Example 4.55. For f (x) = 1 + x, x ∈ (0, OFL), show that is said to be well-conditioned. A problem with a high con-
the algorithm fˆ(x) = fl(1.0 + fl(x)) is not backward stable. dition number is said to be ill-conditioned.
We prove a stronger statement that implies the nega- Example 4.61. Definition 4.59 yields
tion of (4.27). For each x ∈ (0, u ), Definition 4.24 yields
Erel (ŷ) / Cf Erel (x̂). (4.31)
fˆ(x) = 1.0. Then fˆ(x) = f (x̂) implies x̂ = 0, which further
implies Erel (x̂) = 1. The approximation mark “≈” refers to the fact that the
quadratic term (∆x)2 has been ignored. As one way to
interpret (4.31) and to understand Definition 4.59, the com-
ŷ b
puted solution to an ill-conditioned problem may have a large
forward error.
∆y
fˆ Example 4.62. For the function f (x) = arcsin(x), its con-
b dition number, according to Definition 4.59, is
f 0
xf (x) x
b Cf (x) =
= √ .
f (x) 2
1 − x arcsin x
∆x Hence Cf (x) → +∞ as x → ±1.
5

x b
4.5

3.5

Definition 4.56. An algorithm fˆ(x) for computing 3

y = f (x) is stable or numerically stable iff 2.5

( ˆ 2
f (x)−f (x̂)
f (x̂) ≤ cf u ,
∀x ∈ dom(f ), ∃x̂ ∈ dom(f ) s.t. 1.5
Erel (x̂) ≤ cu ,
1
(4.29) -1 -0.5 0 0.5 1

where cf , c are two small constants. Lemma 4.63. Consider solving the equation f (x) = 0 near
Lemma 4.57. If an algorithm is backward stable, then it a simple root r, i.e. f (r) = 0 and f 0 (r) 6= 0. Suppose we per-
is numerically stable. turb the function f to F = f + g where f, g ∈ C 2 , g(r) 6= 0,
and |g 0 (r)|  |f 0 (r)|. Then the root of F is r + h where
ˆ
Proof. By Definition 4.52, f (x̂) = f (x), hence cf = 0. The g(r)
other condition also follows trivially. h ≈ − 0 . (4.32)
f (r)
Example 4.58. For f (x) = 1 + x, x ∈ (0, OFL), show that Proof. Suppose r + h is the new root, i.e. F (r + h) = 0, or,
the algorithm fˆ(x) = fl(1.0 + fl(x)) is stable.
f (r + h) + g(r + h) = 0.
If |x| < u , then fˆ(x) = 1.0. Choose x̂ = x, then
ˆ Taylor’s expansion of F (r + h) yields

f (x̂) − x = fˆ(x) and f (x)−f (x̂) x
f (x̂) = 1+x < 2u .
Otherwise |x| ≥  . The definitions of the range and f (r) + hf 0 (r) + [g(r) + hg 0 (r)] = O(h2 )
u
unit roundoff (Definitions 4.26 and 4.20) yield x ∈ R(F). and we have
By Theorem 4.27, fˆ(x) = (1 + x(1 + δ1 ))(1 + δ2 ), i.e. g(r) g(r)
fˆ(x) = 1 + δ2 + x(1 + δ1 + δ2 + δ1 δ2 ), where |δ1 |, |δ2 | < u . h ≈ − 0 ≈ − 0 .
f (r) + g 0 (r) f (r)

37
Qinghai Zhang Numerical Analysis 2021

Example 4.64 (Wilkinson). Define Example 4.67. For the matrix


p
 
1 1−δ
A= (4.36)
Y
f (x) := (x − k), 1 1+δ
k=1
g(x) := xp . and δ = 10−8 , we have cond2 A = 199999999.137258.

How is the root x = p affected by perturbing f to f + g? Definition 4.68. The componentwise condition number of
By Lemma 4.63, the answer is a vector function f : Rm → Rn is

g(p) pp condf (x) = kA(x)k, (4.37)


h ≈ − = − .
f 0 (p) (p − 1)! where the matrix A(x) = [aij (x)] and each component is
p
p
is about 8.6 × 108 ,
∂fi
For p = 20, 30, 40, the value of (p−1)! xj
∂xj
2.3 × 1013 , 5.9 × 1017 , respectively. Hence a small change aij (x) = . (4.38)
fi (x)
p
of the coefficient in the monomial x would cause a large
change of the root. Consequently, the problem of root find- Example 4.69. For the vector function
ing for polynomials with very high degrees is hopeless. 1
+ x12

f (x) := x11 1 ,
x1 − x2
4.3.4 Condition numbers: vector functions
its Jacobian matrix is
Definition 4.65. The condition number of a vector func-  2
x21

1 x2
tion f : Rm → Rn is ∇f = − .
x21 x22 x22 −x21
kxk k∇f k
condf (x) = , (4.33) The condition number based on Definition 4.68 clearly cap-
kf (x)k
tures the fact that x1 ± x2 ≈ 0 leads to ill-conditioning,
where k · k denotes a Euclidean norm such as the 1-, 2-, and  
x2 x1
∞-norms. x1 +x2 x1 +x2
Cc =  x x  ,
Example 4.66. In solving the linear system Au = b, the 2
x1 −x 1
x1 −x
2 2
algorithm can be viewed as taking the input b and return-
ing the output A−1 b, i.e. f (b) = A−1 b. Clearly ∇f = A−1 . while that based on 1-norm of Definition 4.65 fails to capture
Definition 4.65 yields the ill-conditioning,
kxk1 k∇f k1 |x1 | + |x2 | 2 max(x21 , x22 )
−1 −1
kbk A kAuk A C = = ,
condf (x) = = . 1
kf k1 |x1 x2 | |x1 + x2 | + |x1 − x2 |
kuk kuk
In practice the input b can take any value, hence we have in that the condition x1 ± x2 ≈ 0 yields C1 ≈ 2. Note that
−1 we have used the well-known formula
kAuk A
= kAk A−1 ,
X
max condf (x) = max n×n
∀A ∈ R , kAk 1 = max |aij |.
u6=0 kuk j
i
where the last expression is the condition number of A de-
Definition 4.70. The Hilbert matrix Hn ∈ Rn×n is
fined in linear algebra and we have used the common defi-
nition 1
kAuk hi,j = . (4.39)
kAk := max . (4.34) i + j −1
u6=0 kuk
Example 4.71. The condition number of Hilbert matrices
This explains why the condition number of a matrix A is based on the 2-norm is
usually defined as √
( 2 + 1)4n+4
−1
cond A := A A . (4.35) cond2 Hn ∼ √ ,
215/4 πn
If we take the norm in (4.35) to be the 2-norm, then which is 9.22 × 1014 for n = 10.
σmax Definition 4.72. The Vandermonde matrix Vn ∈ Rn×n is
cond2 A := A 2 A−1 2 =

,
σmin
vi,j = ti−1
j , (4.40)
where σmax and σmin are respectively the largest magnitude
and the smallest magnitude of eigenvalues of A. Why? (hint: where t1 , t2 , . . . , tn are parameters.
see Definition B.169.) In addition, if A is normal, then
Example 4.73. If the parameters t1 , t2 , . . . , tn are equally
|λmax | spaced in [−1, 1], the condition number of Vandermonde ma-
cond2 A = ,
|λmin | trices based on the ∞-norm is
1
where |λmax | and |λmin | are respectively the largest and cond∞ Vn ∼ e−π/4 en/4(π+2 ln 2) ,
smallest singular values of A. Furthermore, if A is unitary, π
then we have cond2 A = 1. which is 9.86 × 108 for n = 20.

38
Qinghai Zhang Numerical Analysis 2021

4.3.5 Condition numbers: algorithms that computes f (x) = 1−cos x


sin x for x ∈ (0, π/2).
By Definition 4.59, it is easy to compute that
Definition 4.74. Consider approximating a function
f : Rm → Rn with an algorithm fA : F m → F n . Assume x
condf (x) = .
sin x
∀x ∈ F m , ∃xA ∈ Rm s.t. fA (x) = f (xA ), (4.41)
Furthermore, by Theorem 4.40 and the assumptions on sin x
the condition number of the algorithm fA is defined as and cos x, we have

1 kxA − xk (1 − (cos x)(1 + δ1 ))(1 + δ2 )


condA (x) = inf . (4.42) fA (x) = (1 + δ4 ),
u {xA } kxk (sin x)(1 + δ3 )
Example 4.75. Consider an algorithm A for calculating where |δi | ≤ u for i = 1, 2, 3, 4. Neglecting the quadratic
y = ln x. Suppose that, for any positive number x, this terms of O(δi2 ), the above equation is equivalent to
program produces a yA satisfying yA = (1 + δ) ln x where
1 − cos x
 
|δ| ≤ 5u . What is the condition number of the algorithm? cos x
fA (x) = 1 + δ2 + δ4 − δ3 − δ1 ,
We clearly have sin x 1 − cos x

yA = ln xA where xA = x1+δ , hence we have ϕ(x) = 3 + cos x


1−cos x and

and consequently sin x



cos x

condA (x) ≤ 3+ .
1+δ
x

− x x 1 − cos x
δ δ ln x
Erel (xA ) = = |x − 1| = |e − 1|

x Hence, condA (x) may be unbounded as x → 0. On the
≈ |δ ln x| ≤ 5| ln x|u . other hand, condA (x) is controlled by π6 as x → π2 .
sin x
Hence A is well conditioned except when x → 0+ . Exercise 4.78. Repeat Example 4.77 for f (x) = 1+cos x on
the same interval.
Theorem 4.76. Suppose a smooth function f : R → R
is approximated by an algorithm A : F → F, producing
fA (x) = f (x)(1 + δ(x)) where |δ(x)| ≤ ϕ(x)u . If condf (x) 4.3.6 Overall error of a computer solution
is bounded and nonzero, then we have

ϕ(x) f
∀x ∈ F, condA (x) ≤ . (4.43)
condf (x) x x∗
Rounding Algorithm fA fA (x∗ )
Proof. Assume ∀x, ∃xA such that f (xA ) = fA (x). Write
xA = x(1 + A ) and we have

f (x)(1 + δ) = f (xA ) = f (x(1 + A )) = f (x + xA )


= f (x) + xA f 0 (x) + O(2A ). Theorem 4.79. Consider using normalized FPN arith-
metics to solve a math problem
Neglecting the quadratic term yields
f : Rm → Rn , y = f (x). (4.45)
0
xA f (x) = f (x)δ

xA − x
Denote the computer input and output as
= |A | = f (x) |δ(x)|.

⇒ 0
x xf (x) x∗ ≈ x, ∗
yA = fA (x∗ ), (4.46)
Dividing both sides by u yields
where fA is the algorithm that approximates f . The relative

error of approximating y with yA can be bounded as

1 xA − x δ(x)
= .
u x u condf (x) ∗
Erel (yA ) / Erel (x∗ )condf (x) + u condf (x∗ ) condA (x∗ ) ,
Take inf with respect to all xA ’s, apply the condition (4.47)
|δ(x)| ≤ ϕ(x)u , and we have (4.43).
where the relative error is defined in (4.25).
Example 4.77. Assume that sin x and cos x are computed
with relative error within machine roundoff (this can be sat- Proof. By the triangle inequality, we have
isfied easily by truncating the Taylor series). Apply Theo- ∗
kyA − yk kfA (x∗ ) − f (x)k
rem 4.76 to analyze the conditioning of the algorithm =
kyk kf (x)k
kf (x ) − f (x)k kfA (x∗ ) − f (x∗ )k

" #
fl 1 − fl(cos x)
fA = fl (4.44) ≤ + .
fl(sin x) kf (x)k kf (x)k

39
Qinghai Zhang Numerical Analysis 2021

By (4.31), the first term is X. The math problem of root finding for a polynomial
∗ ∗
kf (x ) − f (x)k kx − xk n
/ condf (x) X
kf (x)k kxk q(x) = a i xi , an = 1, a0 6= 0, ai ∈ R (4.48)
∗ i=0
= Erel (x )condf (x) .
By (4.31) and Definition 4.74, the second term is can be considered as a vector function f : Rn → C:
kfA (x∗ ) − f (x∗ )k kf (x∗A ) − f (x∗ )k kf (x∗A ) − f (x∗ )k r = f (a0 , a1 , . . . , an−1 ).
= ≈
kf (x)k kf (x)k kf (x∗ )k
kx∗ − x∗ k Derive the componentwise condition number of f based
≤ condf (x∗ ) A ∗ on the 1-norm. For the Wilkinson example, compute
kx k
your condition number, and compare your result with
= u condA (x ) condf (x∗ ) ,

that in the Wilkinson Example. What does the com-
where the last step follows from the fact that we only con- parison tell you?
sider the x∗A that is the least dangerous.
XI. Suppose the division of two FPNs is calculated in a reg-
ister of precision 2p. Give an example that contradicts
4.4 Problems the conclusion of the model of machine arithmetic.

4.4.1 Theoretical questions XII. If the bisection method is used in single precision FPNs
of IEEE 754 starting with the interval [128, 129], can
I. Convert the decimal integer 477 to a normalized FPN we compute the root with absolute accuracy < 10−6 ?
with β = 2. Why?
II. Convert the decimal fraction 3/5 to a normalized FPN
XIII. In fitting a curve by cubic splines, one gets inaccurate
with β = 2.
results when the distance between two adjacent points
e is much smaller than those of other adjacent pairs. Use
III. Let x = β , e ∈ Z, L < e < U be a normalized FPN in
F and xL , xR ∈ F the two normalized FPNs adjacent the condition number of a matrix to explain this phe-
to x such that xL < x < xR . Prove xR −x = β(x−xL ). nomenon.
IV. By reusing your result of II, find out the two normal-
ized FPNs adjacent to x = 3/5 under the IEEE 754 4.4.2 Programming assignments
single-precision protocol. What is fl(x) and the rela- A. Print values of the functions in (4.49) at 101 equally
tive roundoff error? spaced points covering the interval [0.99, 1.01]. Calcu-
V. If the IEEE 754 single-precision protocol did not round late each function in a straightforward way without re-
off numbers to the nearest, but simply dropped excess arranging or factoring. Note that the three functions are
bits, what would the unit roundoff be? theoretically the same, but the computed values might
be very different. Plot these functions near 1.0 using a
VI. How many bits of precision are lost in the subtraction magnified scale for the function values to see the varia-
1 − cos x when x = 41 ? tions involved. Discuss what you see. Which one is the
VII. Suggest at least two ways to compute 1−cos x to avoid most accurate? Why?
catastrophic cancellation caused by subtraction.
B. Consider a normalized FPN system F with the charac-
VIII. What are the condition numbers of the following func- terization β = 2, p = 3, L = −1, U = +1.
tions? Where are they large?
• compute UFL(F) and OFL(F) and output them as
• (x − 1)α , decimal numbers;
• ln x, • enumerate all numbers in F and verify the corollary
• ex , on the cardinality of F in the summary handout;
• arccos x. • plot F on the real axis;
IX. Consider the function f (x) = 1 − e−x for x ∈ [0, 1]. • enumerate all the subnormal numbers of F;
• Show that condf (x) ≤ 1 for x ∈ [0, 1]. • plot the extended F on the real axis.
• Let A be the algorithm that evaluates f (x) for
the machine number x ∈ F. Assume that the ex-
ponential function is computed with relative error
f (x) = x8 − 8x7 + 28x6 − 56x5 + 70x4 − 56x3 + 28x2 − 8x + 1
within machine roundoff. Estimate condA (x) for
(4.49a)
x ∈ [0, 1].
• Plot condf (x) and the estimated upper bound of g(x) = (((((((x − 8)x + 28)x − 56)x + 70)x − 56)x + 28)x − 8)x + 1
condA (x) as a function of x on [0, 1]. Discuss your (4.49b)
8
results. h(x) = (x − 1) (4.49c)

40
Chapter 5

Approximation

Definition 5.1. Given a normed vector space Y of func- p : [0, 1) → R2 with its knots at xi ’s and a scaled cumu-
tions and its subspace X ⊂ Y . A function ϕ̂ ∈ X is called lative chordal length as in Definition 3.71. Denote by Int(γ)
the best approximation to f ∈ Y from X with respect to the as the complement of γ that always lies at the left of an ob-
norm k · k iff server who travels γ according to its parametrization. Then
the area difference between S1 := Int(γ) and S2 := Int(p)
∀ϕ ∈ X, kf − ϕ̂k ≤ kf − ϕk. (5.1) can be defined as
Z
Example 5.2. The Chebyshev Theorem 2.44 can be re-
kS1 ⊕ S2 k1 := dx,
stated in the format of Definition 5.1 as follows. As in Ex- S1 ⊕S2
ample B.24, denote by Pn (R) the set of all polynomials with
coefficients in R and degree at most n. For Y = Pn (R), and where
X = Pn−1 (R), the best approximation to f (x) = −xn in Y S1 ⊕ S2 := S1 ∪ S2 \ (S1 ∩ S2 )
from X with respect to the max-norm k · k∞ is the exclusive disjunction of S and S .
1 2
The minimization of this area difference can be formu-
kgk∞ = max |g(x)| (5.2)
x∈[−1,1] lated by a best approximation problem based on the 1-norm.
Tn
is ϕ̂ = 2n−1 − xn , where Tn is the Chebyshev polynomial of Theorem 5.6. Suppose X is a finite-dimensional subspace
degree n. Clearly ϕ̂ satisfies (5.1). of a normed vector space (Y, k · k). Then we have

Definition 5.3. The fundamental problem of linear Pn approx- ∀y ∈ Y, ∃ϕ̂ ∈ X s.t. ∀ϕ ∈ X, kϕ̂ − yk ≤ kϕ − yk. (5.5)
imation is to find the best approximation ϕ̂ = i=1 ai ui to
f ∈ Y from n elements u1 , u2 , . . . , un ∈ X ⊂ Y that are Proof. For a given y ∈ Y , define a closed ball
linearly independent and given a priori.
By := {x ∈ X : kxk ≤ 2kyk}.
Example 5.4. For f (x) = ex in P C ∞ [−1, 1], seeking its best
n
approximation of the form ϕ̂ = i=1 ai ui in the subspace Clearly 0 ∈ By , and the distance from y to By is
2
X = span{1, x, x , . . .} is a problem of linear approximation,
where n can be any positive integer and the norm can be dist(y, By ) := inf ky − xk ≤ ky − 0k = kyk.
x∈By
the max-norm (5.2), the 1-norm
Z +1 By definition, any z ∈ X, z 6∈ By must satisfy kzk > 2kyk,
kgk1 := |g(x)|dx, (5.3) and thus
−1 kz − yk ≥ kzk − kyk > kyk.
or the 2-norm Therefore, if a best approximation to y exists, it must be in
Z +1 1
By . As a subspace of X, By is finite dimensional, closed, and
2
2 bounded, hence By is compact. The extreme value theorem
kgk2 := |g(x)| dx . (5.4)
−1
states that a continuous scalar function attains its minimum
and maximum on a compact set. A norm is a continuous
The three different norms are motivated differently: the function, hence the function d : By → R+ ∪ {0} given by
max-norm corresponds to the min-max error, the 1-norm d(x) = kx − yk must attain its minimum on By .
is related to the area bounded between g(x) and the x-axis,
and the 2-norm is related to the Euclidean distance, c.f. Theorem 5.7. The set C[a, b] of continuous functions over
Section 5.4. [a, b] is an inner-product space over C with its inner product
as
Example 5.5. For a simple closed curve γ : [0, 1) → R2
Z b
and n points xi ∈ γ, consider a spline approximation hu, vi := ρ(t)u(t)v(t)dt, (5.6)
a

41
Qinghai Zhang Numerical Analysis 2021

where v(t) is the complex conjugate of v(t) and the weight Proof. Definition 5.13 implies that the formulae (5.9) can
function ρ(x) ∈ C[a, b] satisfies ρ(x) > 0 for all x ∈ (a, b). In be rewritten in the form of (5.10); this can be proven
addition, C[a, b] with by induction. The induction basis is the recursion basis
! 21 u∗1 = u1 /ku1 k and the inductive step follows from (5.9) as
Z b
kuk2 := ρ(t)|u(t)|2 dt (5.7) n−1
!
1 X
a u∗n = un − hun , u∗k i u∗k ,
kvn k
k=1
is a normed vector space over R.
Proof. This follows from Definitions B.2, B.108, and where each u∗k
is a linear combination of u1 , u2 , . . . , uk and
B.113. ann , the coefficient of un in (5.10), is clearly kv1n k . By (5.9b),
u∗n+1 is normal. We show by induction that u∗n+1 is orthog-
Definition 5.8. The least-square approximation on C[a, b] onal to u∗n , u∗n−1 , . . ., u∗1 . The induction base holds because
is a best approximation problem with the norm in (5.1) set
to that in (5.7). hv2 , u∗1 i = hu2 − hu2 , u∗1 i u∗1 , u∗1 i
= hu2 , u∗1 i − hu2 , u∗1 i hu∗1 , u∗1 i = 0,
5.1 Orthonormal systems where the second step follows from (IP-3) in Definition B.108
and the third step from u∗1 being normal. The inductive step
Definition 5.9. A subset S of an inner product space X is also holds because for any j < n + 1 we have
called orthonormal if
n
* +
( X
∗ ∗ ∗ ∗


0 if u 6= v, vn+1 , uj = un+1 − hun+1 , uk i uk , uj
∀u, v ∈ S, hu, vi = (5.8)
1 if u = v. k=1
n

X
Example 5.10. The standard basis vectors in Rn are or- hun+1 , u∗k i u∗k , u∗j



= un+1 , uj −
thonormal. k=1

un+1 , u∗j = 0,



Example 5.11. The Chebyshev polynomials of the first = u n+1 , uj −
kind as in Definition 2.39 are orthogonal with respect to where the third step follows from the induction hypothesis
1
(5.6) where a = −1, b = 1, ρ = √1−x 2
. However, they do and (5.9b), i.e.,
not satisfy the second case in (5.8). (
Theorem 5.12. Any finite set of nonzero orthogonal ele-

∗ ∗ 1 if k = j;
uk , uj = (5.11)
ments u1 , u2 , . . . , un is linearly independent. 0 otherwise.
Proof. This is easily proven by contradiction using Defini- Exercise 5.15. Prove akk = 1
by using ku∗n k = 1.
kvk k
tions B.25 and 5.9.
u2
Definition 5.13. The Gram-Schmidt process takes in a fi-
nite or infinite independent list (u1 , u2 , . . .) and output two u1
other lists (v1 , v2 , . . .) and (u∗1 , u∗2 , . . .) by
n
X
vn+1 = un+1 − hun+1 , u∗k i u∗k , (5.9a)
k=1 u∗1 u∗2
u∗n+1 = vn+1 /kvn+1 k, (5.9b)
v2 = u2 − hu2 , u∗1 iu∗1
with the recursion basis as v1 = u1 , u∗1 = v1 /kv1 k.
Theorem 5.14. For a finite or infinite independent list
(u1 , u2 , . . .), the Gram-Schmidt process yields constants Corollary 5.16. For a finite or infinite independent list
a11 (u1 , u2 , . . .), we can find constants
a21 a22
b11
a31 a32 a33
.. b21 b22
. b31 b32 b33
..
such that akk = kv1 k > 0 and the elements u∗1 , u∗2 , . . .
k
.

u∗1 = a11 u1 and an orthonormal list (u∗1 , u∗2 , . . .) such that bii > 0 and
u∗2 = a21 u1 + a22 u2 u1 = b11 u∗1
u∗3 = a31 u1 + a32 u2 + a33 u3 (5.10) u2 = b21 u∗1 + b22 u∗2
.. u3 = b31 u∗1 + b32 u∗2 + b33 u∗3 (5.12)
.
..
are orthonormal. .

42
Qinghai Zhang Numerical Analysis 2021

Proof. This follows from (5.10) and that a lower-triangular Example 5.21. With the Euclidean inner product in Def-
matrix with positive diagonal elements is invertible. inition B.112, we select orthonormal vectors in R3 as
Corollary 5.17. In Theorem 5.14, we have hu∗n , ui i = 0 for u∗1 = (1, 0, 0)T , u∗2 = (0, 1, 0)T , u∗3 = (0, 0, 1)T .
each i = 1, 2, . . . , n − 1.
Proof. By Corollary 5.16, each ui can be expressed as For the vector w = (a, b, c)T , the Fourier coefficients are
i
X hw, u∗1 i = a, hw, u∗2 i = b, hw, u∗3 i = c,
ui = bik u∗k .
k=1 and the projections of w onto u∗1 and u∗2 are
Inner product the above equation with u∗n , apply the orthog-
hw, u∗1 i u∗1 = (a, 0, 0)T , hw, u∗2 i u∗2 = (0, b, 0)T .
onal conditions, and we reach the conclusion.
Definition 5.18. Using the Gram-Schmidt orthonormaliz- The Fourier expansion of w is
ing process with the inner product (5.6), we obtain from
w = hw, u∗1 i u∗1 + hw, u∗2 i u∗2 + hw, u∗3 i u∗3 ,
the independent list of monomials (1, x, x2 . . .) the following
classic orthonormal polynomials: with the error of Fourier expansion as 0; see Theorem 5.23.
a b ρ(x)
Exercise 5.22. For the orthonormal list in L2ρ=1 [−π, π],
Chebyshev polynomials
of the first kind -1 1 √ 1
1−x2 1 sin x cos x sin(nx) cos(nx)
Chebyshev polynomials √ , √ , √ ,..., √ , √ ,..., (5.14)
√ 2π π π π π
of the second kind -1 1 1−x 2

Legendre polynomials -1 1 1 derive the Fourier series of a function f (x) as


Jacobi polynomials -1 1 (1 − x)α (1 + x)β
+∞
Laguerre polynomials 0 +∞ xα e−x a0 X
2 f (x) ∼ + (ak cos kx + bk sin kx), (5.15)
Hermite polynomials −∞ +∞ e−x 2
k=1
where α, β > −1 for Jacobi polynomials and α > −1 for
where the coefficients are
Laguerre polynomials.
1 π 1 π
Z Z
Example 5.19. We compute the first 3 Legendre polyno- ak = f (x) cos kxdx, bk = f (x) sin kxdx.
mials using the Gram-Schmidt process. π −π π −π
Z +1
1 Theorem 5.23. Let u1 , u2 , . . . , un be linearly independent
u1 = 1, v1 = 1, kv1 k2 = dx = 2, u∗1 = √ . and let u∗i be thePui ’s orthonormalized by the Gram-Schmidt
−1 2 n

1

1 2 process. If w = i=1 ai ui , then
2
u2 = x, v2 = x − x, √ √ = x, kv2 k = ,
2 2 3 X n
r
3 w = hw, u∗i i u∗i , (5.16)

u2 = x. i=1
2
* r +r
3 3

1

1 1 i.e. w is equal to its Fourier expansion.
v 3 = x − x2 ,
2
x x − x2 , √ √ = x2 − ,
2 2 2 2 3 Pn
Proof. By the condition w = i=1 ai ui and Corollary 5.16,
Z +1 
1
2
8 we can express w as a linear combination of u∗i ’s,
kv3 k2 = x2 − dx = ,
−1 3 45 Xn

3 √

1
 w = ci u∗i .
u∗3 = 10 x2 − . i=1
4 3
Then the orthogonality of u∗i ’s implies
5.2 Fourier expansions
∀k = 1, 2, · · · , n, hu∗k , wi = ck ,
Definition 5.20. Let (u∗1 , u∗2 , . . .) be a finite or infinite or-
thonormal list. The orthogonal expansion or Fourier expan- which completes the proof.
sion for an arbitrary w is the series Theorem 5.24 (Minimum properties of Fourier expan-
Xm sions). Let u∗1 , u∗2 , . . . be an orthonormal system and let w
hw, u∗n i u∗n , (5.13) be arbitrary. Then
n=1

N N
where the constants hw, u∗n i are known as the Fourier coef-
X X
∗ ∗ ∗
w − hw, u i u ≤ w − ai i ,
u (5.17)

i i
ficients of w and the term hw, u∗n i u∗n the projection of w on


i=1 i=1
u∗n . The error of the Fourier expansion of w with respect to
(u∗1 , u∗2 , . . .) is simply n hw, u∗n i u∗n − w.
P
for any selection of constants a1 , a2 , · · · , aN .

43
Qinghai Zhang Numerical Analysis 2021

P PN
Proof. With the shorthand notation i = i=1 , we deduce Example 5.28. Consider the problem in Example 5.4 in the
from the definition and properties of inner products sense of least square approximation with the weight function
2 * + ρ = 1. It is equivalent to
X X X
w − ai u∗i = w − ai u∗i , w − ai u∗i !2
Z +1 Xn
x i
min e − ai x dx. (5.22)

i i i
* + * + ai −1
X X i=0
= hw, wi − w, ai u∗i − ai u∗i , w
i i For n = 1, 2, use the Legendre polynomials derived in Ex-
ample 5.19:
* +
X X
+ ai u∗i , ai u∗i
1√
r
i i
∗ 1 ∗ 3 ∗
X X u 1 = √ , u2 = x, u3 = 10(3x2 − 1),
= hw, wi − ai hw, u∗i i − ai hu∗i , wi 2 2 4
i i
XX and we have the Fourier coefficients of ex as
ai aj u∗i , u∗j


+
i j Z +1  
1 x 1 1
X X X b0 = √ e dx = √ e− ,
= hw, wi − ai hw, u∗i i − ai hu∗i , wi + |ai |2 −1 2 2 e
i i i Z +1 r √
3 x
b = xe dx = 6e−1 ,
X X
∗ ∗ ∗ ∗
− hui , wi hw, ui i + hui , wi hw, ui i 1
2
−1
i i Z +1 √ √  
2
X
∗ 2
X
∗ 2
1 2 x 10 7
= kwk − |hw, ui i| + |ai − hw, ui i| , (5.18) b2 = 10(3x − 1)e dx = e− .
−1 4 2 e
i i

where “| · |” denotes the modulus of a complex number. The minimizing polynomials are thus
The first
P two 2 terms are independent of ai . Therefore
(
1
kw − i ai u∗i k is minimized only when ai = hw, u∗i i. (e2 − 1) + 3e x n = 1;
ϕ̂n = 2e 5
(5.23)
ϕ̂1 + 4e (e2 − 7)(3x2 − 1) n = 2.
Corollary 5.25. Let (u1 , u2 , . . . , un ) be an independent
list. The fundamental problem of linearly approximating
an arbitrary vector w is solved by the best approximation 5.3 The normal equations
ϕ̂ = k hw, u∗k i u∗k where u∗k ’s are the uk ’s orthonormalized
P
by the Gram-Schmidt process. The error norm is Theorem 5.29. Let u1 , u2 , . . . , un ∈ X be linearly indepen-
dent and let u∗i be the ui ’s orthonormalized by the Gram-
2
Schmidt process. Then, for any element w,

n n
2
X X
kw − ϕ̂k2 := min w − ak uk = kwk2 − |hw, u∗k i| .

ak n
!
k=1 k=1 X
(5.19) ∀j = 1, 2, . . . , n, w− hw, u∗k i u∗k ⊥ u∗j , (5.24)
k=1
Proof. This follows directly from (5.18).
where “⊥” denotes orthogonality.
Corollary 5.26 (Bessel inequality). If u∗1 , u∗2 , . . . , u∗N are Pn
orthonormal, then, for an arbitrary w, Proof. If w ∈ X, we have w − k=1 hw, u∗k i u∗k = 0 and
thus (5.24) holds trivially. For the other case of w 6∈ X, set
N w = un+1 , apply Corollary 5.17, and we have (5.24).
2
X
|hw, u∗i i| ≤ kwk2 . (5.20)
i=1 Corollary 5.30.PLet u1 , u2 , . . . , un ∈ X be linearly inde-
n
pendent. If ϕ̂ = k=1 ak uk is the best linear approximant
Proof. This follows directly from Corollary 5.25 and the real
to w, then
positivity of a norm.
∀j = 1, 2, . . . , n, (w − ϕ̂) ⊥ uj . (5.25)
Corollary 5.27. The Gram-Schmidt process in Definition
5.13 satisfies Pn
Proof. Since ϕ̂ = k=1 ak uk is the best linear approximant
n to w, Theorem 5.24 implies that
2
X
+ 2
∀n ∈ N , kvn+1 k = kun+1 k − 2
|hun+1 , u∗k i| . (5.21)
n n
k=1 X X
ak uk = hw, u∗k i u∗k .
Proof. By (5.9a), each vn+1 can be regarded as the error of k=1 k=1
Fourier expansion of un+1 with respect to the orthonormal
list (u∗1 , u∗2 , . . . , u∗n ). In Corollary 5.25, identifying w with Corollary 5.16 and Theorem 5.29 complete the proof.
un+1 completes the proof.

44
Qinghai Zhang Numerical Analysis 2021

Definition 5.31. Let u1 , u2 , . . . , un be a sequence of ele- Theorem 5.33. For nonzero elements u1 , u2 , . . . , un ∈ X,
ments in an inner product space. The n × n matrix we have
n
Y
G = G(u1 , u2 , · · · , un ) = (hui , uj i) 0 ≤ g(u1 , u2 , . . . , un ) ≤ kuk k2 , (5.30)
  k=1
hu1 , u1 i hu1 , u2 i . . . hu1 , un i
 hu2 , u1 i hu2 , u2 i . . . hu2 , un i 
where the lower equality holds if and only if u1 , u2 , . . . , un
= .. .. .. (5.26)
 
.. 
are linearly dependent and the upper equality holds if and
 . . . . 
hun , u1 i hun , u2 i . . . hun , un i only if they are orthogonal.

is the Gram matrix of u1 , u2 , . . . , un . Its determinant Proof. Suppose u1 , u2 , . . . , un are linearly dependent. Then
Pn
we can find constants c1 , c2 , . . . , cn satisfying i=1 ci ui = 0
g = g(u1 , u2 , . . . , un ) = det(hui , uj i) (5.27)
with at least one cj being nonzero. Construct vectors
is the Gram determinant. (P
n
i=1 ci ui = 0, k = j;
Pn
Lemma 5.32. Let wi = j=1 aij uj for i = 1, 2, . . . , n. Let wk =
A = (aij ) and its conjugate transpose AH = (aji ). Then we uk , k 6= j.
have
We have g(w1 , w2 , . . . , wn ) = 0 because hwj , wk i = 0 for
G(w1 , w2 , . . . , wn ) = AG(u1 , u2 , . . . , un )AH (5.28) each k. By the Laplace theorem (Theorem B.188), we can
expand the determinant of C = (cij ) according to minors of
and
its jth row:
2
g(w1 , w2 , . . . , wn ) = | det A| g(u1 , u2 , . . . , un ). (5.29) 
1 0 ··· 0 ··· 0

Proof. The inner product of ui and wj yields  0 1 ··· 0 ··· 0 
 .. .. . . .. .. .. 
   
hu1 , w1 i hu1 , w2 i . . . hu1 , wn i  . . . . . . 
 hu2 , w1 i hu2 , w2 i . . . hu2 , wn i  det(C) = det  
 c1 c2 · · · cj · · · cn 
.. .. ..
 
..  . .. . . . .. .. 
 
 .. . ..
 
 . . . .  . . . 
hun , w1 i hun , w2 i . . . hun , wn i 0 0 ··· 0 ··· 1
  
hu1 , u1 i hu1 , u2 i . . . hu1 , un i a11 . . . an1 = 0 + · · · + 0 + cj + 0 + · · · + 0 = cj 6= 0,
 hu2 , u1 i hu2 , u2 i . . . hu2 , un i   a12 . . . an2 
= .. .. ..   .. .. 
  
.. .. where the determinant of each minor matrix Mi of ci with
 . . . .   . . . 
hun , u1 i hun , u2 i . . . hun , un i a1n . . . ann i 6= j is zero because the ith row of each Mi is a row of all
zeros. Then Lemma 5.32 yields g(u1 , u2 , . . . , un ) = 0.
=G(u1 , u2 , . . . , un )AH . Now suppose u1 , u2 , . . . , un are linearly independent.
Therefore (5.28) holds since Theorem 5.14 yields constants aij such that akk > 0 and
  the following vectors are orthonormal:
hw1 , w1 i hw1 , w2 i . . . hw1 , wn i
 hw2 , w1 i hw2 , w2 i . . . hw2 , wn i  k
X
G(w1 , w2 , . . . , wn ) =  ∗
.. .. . u = aki ui .
 
. .. ..  k
 . . 
i=1
hwn , w1 i hwn , w2 i . . . hwn , wn i
Then Definition 5.31Qimplies g(u∗1 , u∗2 , . . . , u∗n ) = 1. Also,
  
a11 ... a1n hu1 , w1 i hu1 , w2 i ... hu1 , wn i n
 a21 ... a2n   hu2 , w1 i hu2 , w2 i ... hu2 , wn i  we have det(aij ) = k=1 akk because the matrix (aij ) is
= . ..   .. .. .. triangular. It then follows from Lemma 5.32 that
  
 .. .. .. 
. .  . . . . 
an1 ... ann hun , w1 i hun , w2 i . . . hun , wn i n
Y 1
g(u1 , u2 , . . . , un ) = > 0. (5.31)
= AG(u1 , u2 , . . . , un )AH . a2kk
k=1
The following properties of complex conjugate are well
known: Since the list of vectors (u1 , u2 , . . . , un ) is either de-
z + w = z + w, zw = z w. pendent or independent, the arguments so far show that
g(u1 , u2 , . . . , un ) = 0 if and only if u1 , u2 , . . . , un are linearly
Then the identity det(A) = det(AT ) and the Leibniz formula dependent.
of determinants (Definition B.183) yield
Suppose u1 , u2 , . . . , un are orthogonal. By Definition
X n
Y 5.31, G(u1 , u2 , . . . , un ) is a diagonal matrix with kuk k2 on
det A = det AT = sgn(σ) aσi ,i = det AH . the diagonals. Hence the orthogonality of uk ’s implies
σ∈Sn i=1
n
Finally, (5.29) follows from the determinant of (5.28) and
Y
g(u1 , u2 , . . . , un ) = kuk k2 . (5.32)
the identity det(AB) = det(A) det(B). k=1

45
Qinghai Zhang Numerical Analysis 2021

For the converse statement, suppose (5.32) holds. Then 5.4 Discrete least squares (DLS)
u1 , u2 , . . . , un must be independent because otherwise it
would contradict the lower equality of (5.30) proved as
above. Apply the Gram-Schmidt process to (u1 , u2 , . . . , un ) Example 5.36 (An experiment on Newton’s second law by
and we know from Theorem 5.14 that a1kk = kvk k. Set the discrete least squares). A cart with mass M is pulled along
length of the list in Theorem 5.14 to 1, 2, . . . , n and we know a horizontal track by a cable attached to a weight of mass
from (5.31) and (5.32) that mj through a pulley.

∀k = 1, 2, . . . , n, kuk k2 = kvk k2 . (5.33)


Then Corollary 5.27 and (5.33) imply
M
k−1
uk , u∗j 2 = 0,
X

∀k = 1, 2, . . . , n, 0 1 2 x
j=1 F
m
which further implies mg

uk , u∗j = 0,


∀k = 1, 2, . . . , n, ∀j = 1, 2, . . . , k − 1,
which, together with Corollary 5.16, implies the orthogo- Neglecting the friction of the track and the pulley system,
nality of uk ’s. Finally, we Q remark that the maximum of we have from Newton’s second law
n
g(u1 , u2 , . . . , un ) is indeed k=1 kuk k2 because of (5.31),
1
akk = kvk k, and Corollary 5.27. d2 x
m j g = (m j + M )a = (mj + M ) .
Pn
Theorem 5.34. Let ϕ̂ = i=1 ai ui be the best approxima- dt2
tion to w constructed from the list of independent vectors
(u1 , u2 , . . . , un ). Then the coefficients The following experiments verify Newton’s second law.

a = [a1 , a2 , . . . , an ]T
(i) For fixed M and mj , we measure a number of data
are uniquely determined from the linear system of normal
points (ti , xi ) by recording the position of the cart with
equations,
a high-speed camera.
G(u1 , u2 , . . . , un )T a = c, (5.34)
where c = [hw, u1 i , hw, u2 i , . . . , hw, un i]T .
Pn
Proof. Take inner product of ϕ̂ = i=1 ai ui with uj , apply (ii) Fit a quadratic polynomial p(t) = c0 + c1 t + c2 t2 by
Corollary 5.30, and we have minimizing the total length squared,
n
X
hw, uj i = ak huk , uj i , X
min (xi − p(ti ))2 .
k=1
i
which is simply the jth equation of (5.34). The uniqueness
of the coefficients follows from Theorem 5.33 and Cramer’s
rule.
(iii) Take aj = 2c2 as the experimental result of accelera-
Example 5.35. Solve Example 5.28 by normal equations. tion for the force Fj = mj (g − aj ).
To find the best approximation ϕ̂ = a0 + a1 x + a2 x2
to ex from the linearly independent list (1, x, x2 ), we first
construct the Gram matrix from (5.26), (5.6), and ρ = 1: (iv) Change the weight mj and repeat steps (i)-(iii) a num-
1, x2 2 0 23 ber of times to get data points (aj , Fj ).

  
h1, 1i h1, xi
G(1, x, x2 ) = 
hx, 1i
hx, xi
x, x2  =  0 32 0  .

2
x2 , 1 x2 , x x2 , x2 3 0 25
(v) Fit a linear polynomial f (x) = c0 + c1 x by minimizing
We then calculate the vector
 x    the total length squared,
he , 1i e − 1/e
c = 
hex , xi  =  2/e  . X
ex , x2 e − 5/e min (Fj − f (aj ))2 .
j
The normal equations then yields
3(11 − e2 ) 3 15(e2 − 7)
a0 = , a1 = , a2 = .
4e e 4e One verifies Newton’s second law by showing that the data
With these values, it is easily verified that the best approx- fitting result c1 is very close to M . Note that the expressions
imation ϕ̂ = a0 + a1 x + a2 x2 equals that in (5.23). in steps (ii) and (v) justify the name “least squares.”

46
Qinghai Zhang Numerical Analysis 2021

5.4.1 Gaussian and Dirac delta functions Definition 5.40. The Dirac delta function δ(x−x̄) centered
at x̄ is
Definition 5.37. A Gaussian function, or a Gaussian, is a
function of the form δ(x − x̄) = lim φ (x − x̄) (5.38)
→0

(x − b)2
 
f (x) = a exp − , (5.35) where φ (x− x̄) = fx̄, is a normal distribution with its mean
2c2 at x̄ and its standard deviation as .

where a ∈ R+ is the height of the curve’s peak, b ∈ R is the Lemma 5.41. The Dirac delta function satisfies
position of the center of the peak and c ∈ R+ is the standard (
deviation or the Gaussian RMS (root mean square) width. +∞, x = x̄,
δ(x − x̄) = (5.39a)
0, x 6= x̄;
1.2
Z +∞
1 δ(x − x̄)dx = 1. (5.39b)
−∞
0.8
Proof. These follow directly from Definitions 5.39 and 5.40
0.6
and Lemma 5.38.
0.4

0.2
Lemma 5.42 (Sifting property of δ). If f : R → R is con-
tinuous, then
0
Z +∞
-0.2 δ(x − x̄)f (x)dx = f (x̄). (5.40)
-3 -2 -1 0 1 2 3 −∞

Lemma 5.38. The integral of a Gaussian is Proof. Since I := [x̄ − , x̄ + ] is a compact interval and
f (x) is continuous over I , f (x) is bounded over I , say,
Z +∞ (x−b)2 √ f (x) ∈ [m, M ]. The nonnegativeness of φ and the integral
ae− 2c2 dx = ac 2π. (5.36)
−∞ mean value theorem C.71 imply that

Proof. By the trick of combining two one-dimensional Gaus- Z +∞ Z x̄+


sians and the Polar coordinate transformation, we have (∗) : δ(x − x̄)f (x)dx = δ(x − x̄)f (x)dx
−∞ x̄−
Z +∞ s Z
 +∞  Z +∞
is bounded within the interval
2

−x −x 2
e dx = e dx e−y2 dy
−∞ −∞ −∞  Z x̄+ Z x̄+ 
sZ
+∞ Z +∞ m δ(x − x̄)dx, M δ(x − x̄)dx .
= e−(x2 +y2 ) dxdy x̄− x̄−
−∞ −∞
s
Z 2π Z +∞
It follows that
= e−r2 r drdθ Z x̄+ Z x̄+
0 0
lim δ(x−x̄)f (x)dx = f (x̄) lim δ(x−x̄)dx = f (x̄).
→0 →0
s
x̄− x̄−
+∞ √
 
1
= 2π · − e−r2 = π,
2

0
Apply lim→0 to (∗) and we have (5.40).
and hence
Definition 5.43. The Heaviside function or step function
Z +∞
(x−b)2 √ Z +∞ √
ae − 2c2 dx = 2ac e −y 2
dy = ac 2π, is
(
−∞ −∞
0 if x < 0;
√ H(x) := (5.41)
where it follows from the transformation of x = b+ 2cy. 1 if x ≥ 0.

Definition 5.39. A normal distribution or Gaussian distri- Lemma 5.44. The Dirac delta function and the Heaviside
bution is a continuous probability distribution of the form function are related as
Z x
(x − µ)2
 
1 δ(t)dt = H(x). (5.42)
fµ,σ = √ exp − , (5.37)
σ 2π 2σ 2 −∞

where µ is the mean or expectation and σ is the standard Proof. This follows from Definitions 5.40 and 5.43 and
deviation. Lemma 5.41.

47
Qinghai Zhang Numerical Analysis 2021

5.4.2 Reusing the formalism From the plot of the discrete data, it appears that a
quadratic polynomial would be a good fit. Hence we formu-
Definition 5.45. Define a function λ : R → R
late the least square problem as finding the coefficients of a

if t ∈ (−∞, a), quadratic polynomial to minimize
R0

t
λ(t) = a
ρ(τ )dτ if t ∈ [a, b], (5.43)
12

2
2
R b

ρ(τ )dτ if t ∈ (b, +∞). aj xji  .
X X
a yi −
i=1 j=0
Then a corresponding continuous measure dλ can be defined
as ( Reusing the procedures in Example 5.35, we have
ρ(t)dt if t ∈ [a, b],
dλ = (5.44)
1, x2


0 otherwise, h1, 1i h1, xi
G(1, x, x2 ) = 
hx, 1i
hx, xi
x, x2 

where the support of the continuous measure dλ is the in- x2 , 1 x2 , x x2 , x2


terval [a, b].  
12 78 650
Definition 5.46. The discrete measure or the Dirac mea- =  78 650 6084 ,
sure associated with the point set {t1 , t2 , . . . , tN } is a mea- 650 6084 60710
sure dλ that is nonzero only at the points ti and has the    P12   
hy, 1i i=1 yi 1662
value ρi there. The support of the discrete measure is the c = 
hy, xi  =  12
P  =  11392  .
i=1 yi xi
set {t1 , t2 , . . . , tN }. y, x2 P12 2 109750
i=1 yi xi
Lemma 5.47. For a function u : R → R, define
Then the normal equations yield
N  
386.00
X
λ(t) = ρi H(t − ti ), (5.45)
i=1 a = G−1 c = −113.43 .
9.04
and we have
N
The corresponding polynomial is shown in the above plot.
Z X
u(t)dλ = ρi u(ti ). (5.46)
R i=1
5.4.4 DLS via QR decomposition
Proof. By (5.45) and Lemmas 5.42 and 5.44, we have
Definition 5.49. A matrix A ∈ Rn×n is orthogonal iff
Z N
Z X N
X AT A = I.
u(t)dλ = ρi δ(t − ti )u(t)dt = ρi u(ti ).
R R i=1 i=1 Definition 5.50. A matrix A is upper triangular iff

5.4.3 DLS via normal equations ∀i, j, i > j ⇒ ai,j = 0.

Example 5.48. Consider a table of sales record. Similarly, a matrix A is lower triangular iff

∀i, j, i < j ⇒ ai,j = 0.


x 1 2 3 4 5 6
y 256 201 159 61 77 40 Theorem 5.51 (QR factorization). For any A ∈ Rm×n ,
x 7 8 9 10 11 12 there exists an orthogonal matrix Q ∈ Rm×m and an upper
y 17 25 103 156 222 345 triangular matrix R ∈ Rm×n so that A = QR.

Proof. Rewrite A = [ξ1 , ξ2 , · · · , ξn ] ∈ Rm×n and denote by


350
r the column rank of A. Construct a rank-r matrix
300
Ar = [u1 , u2 , . . . , ur ]
250
by the following steps.
200
(S-1) Set u1 = ξk1 where k1 satisfies ξk1 6= 0 and ∀` < k1 ,
150 ξ` = 0.

100 (S-2) For each j = 2, . . . , r, set uj = ξkj where kj satisfies


that Kj = (ξk1 , . . . , ξkj ) is a list of independent col-
50 umn vectors and, ∀` ∈ Rj := {kj−1 + 1, . . . , kj − 1},
ξ` can be expressed as a linear combination of the
0
1 2 3 4 5 6 7 8 9 10 11 12 column vectors in Kj−1 .

48
Qinghai Zhang Numerical Analysis 2021

In plain words, (S-1) means that we jump over the lead zero Proof. For any x ∈ Rn , we have
vectors and (S-2) states that, starting from uj−1 , we pick
the first vector as uj that is not in span(u1 , u2 , . . . , uj−1 ). kAx − bk22 = kQT Ax − QT bk22 = kR1 x − ck22 + krk22 ,
By Corollary 5.16, the Gram-Schmidt process determines
a unique orthogonal matrix A∗r = [u∗1 , u∗2 , . . . , u∗r ] ∈ Rm×r where the first step follows from Lemma 5.52.
and a unique upper triangular matrix such that
 
b11 b21 . . . br1
 b22 . . . br2  5.5 Problems
Ar = A∗r  ..  . (5.47)
 
..
 . . 
brr 5.5.1 Theoretical questions
By definition of the column rank of a matrix, we have r ≤ m. I. Give a detailed proof of Theorem 5.7.
In the rest of this proof, we insert each column vector in
X = {ξ1 , ξ2 , . . . , ξn }\{u1 , u2 , . . . , ur } back into (5.47) and
II. Consider the Chebyshev polynomials of the first kind.
show that the QR form of (5.47) is maintained. For those
zero column vectors in (S-1), we have
(a) Show that they are orthogonal on [−1, 1] with re-
Aξ = [ξ1 . . . ξk1 −1 u1 u2 . . . ur ] (5.48) spect to the inner product in Theorem 5.7 with the
1
weight function ρ(x) = √1−x .
 
0 ... 0 b11 b21 . . . br1 2
 0 ... 0 b22 . . . br2 
=A∗r  . . .. ..  . (b) Normalize the first three Chebyshev polynomials
 
 .. .. ..
. . .  to arrive at an orthonormal system.
0 ... 0 brr
For each ξ` with ` ∈ Rj in (S-2), we have III. Least-square approximation of a continuous function.
Approximate
√ the circular arc given by the equation
[u1 , u2 , . . . , uj−1 , ξ` ] (5.49) y(x) = 1 − x2 for x ∈ [−1, 1] by a quadratic poly-
nomial with respect to the inner product in Theorem
 
b11 ... bj−1,1 c`,1
= u∗1 , u∗2 , . . . , u∗j−1 
  .. .. .. 5.7.
,

. . .
bj−1,j−1 c`,j−1
(a) ρ(x) = √ 1 with Fourier expansion,
1−x2
where ξ` = c`,1 u∗1
+ ... + c`,j−1 u∗j−1 .
With (5.48) as the in-
duction basis and (5.49) as the inductive step, it is straight- (b) ρ(x) = √ 1 with normal equations.
1−x2
forward to prove by induction that we have A = A∗r R where
R is an upper triangular matrix. IV. Discrete least square via orthonormal polynomials.
If r = m, Definitions 5.49 and 5.9 complete the proof. Consider the example on the table of sales record in
Otherwise r < m and the proof is completed by the well- Example 5.48.
known fact in linear algebra that a list of orthonormal vec-
tors can be extended to an orthonormal basis.
(a) Starting from the independent list (1, x, x2 ), con-
Lemma 5.52. An orthogonal matrix preserves the 2-norm struct orthonormal polynomials by the Gram-
of the vectors it acts on. Schmidt process using
Proof. Definition 5.49 yields N
X
∀x ∈ dom(Q), kQxk22 = xT QT Qx = xT x = kxk22 . hu(t), v(t)i = ρ(ti )u(ti )v(ti ) (5.52)
i=1
Theorem 5.53. Consider an over-determined linear system
Ax = b where A ∈ Rm×n and m ≥ n. The discrete linear as the inner product with N = 12 and ρ(x) = 1.
least square problem P2
(b) Find the best approximation ϕ̂ = i=0 ai xi such
min kAx − bk22 P2
x∈Rn that ky − ϕ̂k ≤ ky − i=0 bi xi k for all bi ∈ R. Ver-
ify that ϕ̂ is the same as that of the example on
is solved by x∗ satisfying
the table of sales record in the notes.
R1 x∗ = c, (5.50)
(c) Suppose there are other tables of sales record in
where R1 ∈ Rn×n and c ∈ Rn result from the QR factoriza- the same format as that in the example . Values
tion of A: of N and xi ’s are the same, but the values of yi ’s

R1
 
c
 are different. Which of the above calculations can
QT A = R = , QT b = . (5.51) be reused? Which cannot be reused? What ad-
0 r
vantage of orthonormal polynomials over normal
Furthermore, the minimum is krk22 . equations does this reuse imply?

49
Qinghai Zhang Numerical Analysis 2021

5.5.2 Programming assignments x 0.0 0.5 1.0 1.5 2.0 2.5 3.0
y 2.9 2.7 4.8 5.3 7.1 7.6 7.7
x 3.5 4.0 4.5 5.0 5.5 6.0 6.5
y 7.6 9.4 9.0 9.6 10.0 10.2 9.7
A. Write a program to perform discrete least square via
x 7.0 7.5 8.0 8.5 9.0 9.5 10.0
normal equations. Your subroutine should take two ar-
y 8.3 8.4 9.0 8.3 6.6 6.7 4.1
rays x and y as the input and output three coefficients
a0 , a1 , a2 that determines a quadratic polynomial as the B. Write a program to solve the previous discrete least
best fitting polynomial in the sense of least squares with square problem via QR factorization. Report the condi-
the weight function ρ = 1. tion number based on the 2-norm of the matrix G in the
normal-equation approach and that of the matrix R1 in
the QR-factorization approach, verifying that the former
Run your subroutine on the following data. is much larger than the latter.

50
Chapter 6

Numerical Integration and Differentiation

Pn
Definition 6.1. A weighted quadrature formula In (f ) is a (b) ∃B ∈ R s.t. ∀n ∈ N+ , Wn := k=1 |wk | < +∞.
linear functional
Proof. For sufficiency, we need to prove that for any given f
n
X we have limn→+∞ In (f ) = I(f ). To this end, we find f ∈ V
In (f ) := wk f (xk ) (6.1)
such that (6.6) holds, define K := maxx∈[a,b] |f (x) − f (x)|.
k=1
Then we have
that approximates the integral of a function f ∈ C[a, b],
|En (f )| ≤|I(f ) − I(f )| + |I(f ) − In (f )| + |In (f ) − In (f )|
Z b Z
b
I(f ) := f (x)ρ(x)dx, (6.2) =

[f (x) − f (x)] ρ(x)dx

a a
n
where the weight function ρ ∈ C[a, b] satisfies ∀x ∈ (a, b), X
+ |I(f ) − In (f )| + wk [f (xk ) − f (xk )]

ρ(x) > 0. The points xk ’s at which the integrand f is eval-
uated are called nodes or abscissas, and the multiplier wk ’s k=1
n
"Z #
are called weights or coefficients. b X
≤K ρ(x)dx + |wk | + |I(f ) − In (f )|,
Example 6.2. If a and/or b are infinite, I(f ) and In (f ) a k=1
in (6.1) may still be well defined if the moment of weight
function where the first step follows from the triangular inequality,
Z b the second from Definition 6.1, and the third from the def-
µj := xj ρ(x)dx (6.3) inition of K. The terms inside the brackets is bounded be-
a
cause of ρ ∈ C[a, b] and condition (b). By condition (a),
exists and is finite for all j ∈ N. |I(f ) − In (f )| can be made arbitrarily small. Since K can
also be arbitrarily small, we have (6.5).
For necessity, it is trivial to deduce (a) from (6.5). In
6.1 Accuracy and convergence contrast, it is nontrivial to deduce (b) from (6.5) as the pro-
cess involves some key theorems in functional analysis. A
Definition 6.3. The remainder, or error, of In (f ) is
reader not familiar with the principle of uniform bounded-
En (f ) := I(f ) − In (f ). (6.4) ness may skip the rest of the proof.
The numerical quadrature formula In : C[a, b] → R is
In (f ) is said to be convergent for C[a, b] iff a linear functional and is continuous at f = 0 because of
Definition E.58 and the fact that
∀f ∈ C[a, b], lim In (f ) = I(f ). (6.5) 
n→+∞
∀ > 0, ∃δ = P , s.t. ∀f ∈ C[a, b],
2 k |wk |
Definition 6.4. A subset V ⊂ C[a, b] is dense in C[a, b] iff
X


kf − 0k∞ < δ ⇒ |In (f ) − In (0)| = wk f (xk )

∀f ∈ C[a, b], ∀ > 0, ∃f ∈ V, s.t. max |f (x) − f (x)| ≤ .
x∈[a,b] k
(6.6)
X X
≤ |wk ||f (xk )| ≤ δ |wk | < .
k k
Theorem 6.5. Let {In (f ) : n ∈ N+ } be a sequence of
quadrature formulas that approximate I(f ), where In and By Theorem E.96, I is continuous and for each n ∈ N+ we
n
I(f ) are defined in (6.1) and (6.2). Let V be a dense subset have I ∈ CL(C[a, b], R) and the convergence (6.5) implies
n
of C[a, b]. In (f ) is convergent for C[a, b] if and only if
∀f ∈ C[a, b], sup |In (f )| < +∞.
(a) ∀f ∈ V, limn→+∞ In (f ) = I(f ), n∈N+

51
Qinghai Zhang Numerical Analysis 2021

Then Theorem E.85 and the principle of uniform bound- Example 6.11. Derive the trapezoidal rule for the weight
edness (Theorem E.148) yield (b). Note that the operator function ρ(x) = x−1/2 on the interval [0, 1]. Note that one
norm of In , by Lemma E.109, equals Wn . cannot apply (6.10) to ρ(x)f (x) because ρ(0) = ∞. (6.8)
yields
Definition 6.6. A weighted quadrature formula (6.1) has Z 1
4
(polynomial) degree of exactness dE iff w1 = x−1/2 (1 − x)dx = ,
0 3
Z 1
2

∀f ∈ PdE , En (f ) = 0,
(6.7) w2 = x−1/2 xdx = .
∃g ∈ PdE +1 , s.t. En (g) 6= 0, 0 3
Hence the formula is
where Pd denotes the set of polynomials with degree no more
than d. 2
I T (f ) = [2f (0) + f (1)]. (6.11)
3
Example 6.7. By Definition 6.6, dE ≥ 0 implies that Theorem 6.12. For f ∈ C 2 [a, b] with weight function
P Rb
k wk is bounded since In (c) = c a ρ(x)dx holds for any ρ(x) ≡ 1, the remainder of the trapezoidal rule satisfies
constant c ∈ R.
(b − a)3 00
∃ζ ∈ [a, b] s.t. E T (f ) = − f (ζ). (6.12)
Lemma 6.8. Let x1 , . . . , xn be given as distinct nodes of 12
In (f ). If dE ≥ n − 1, then its weights can be deduced as Proof. By Theorem 2.5, the interpolating polynomial
Z b p1 (f ; x) is unique. Then we have
∀k = 1, . . . , n, wk = ρ(x)`k (x)dx, (6.8) Z b 00
f (ξ(x))
a E T (f ) = − (x − a)(b − x)dx
a 2
where `k (x) is the fundamental polynomial for pointwise in- f 00 (ζ) b
Z
(b − a)3 00
terpolation in (2.9) applied to the given nodes, = − (x − a)(b − x)dx = − f (ζ),
2 a 12
n
Y x − xi where the first step follows from Theorem 2.7 and the sec-
`k (x) := . (6.9) ond step from the integral mean value theorem (Theorem
xk − xi
i6=k;i=1
C.71). Here we can apply Theorem C.71 because
Proof. Let pn−1 (f ; x) be the unique polynomial that in- w(x) = (x − a)(b − x)
terpolates f at the distinct nodes, as in the theorem on
the uniqueness of polynomial interpolation (Theorem 2.5). is always positive on (a, b). Also note that ξ is a function of
Then we have x while ζ is a constant depending only on f , a, and b.

n Z b Definition 6.13. Simpson’s rule is a formula (6.1)


based on approximating f (x) by a quadratic polynomial
X
wk pn−1 (xk ) = pn−1 (f ; x)ρ(x)dx
k=1 a that goes through the points (a, f (a))T , (b, f (b))T , and
Z bX n X n ( a+b a+b T
2 , f ( 2 )) . For ρ(x) ≡ 1, it is simply
= {`k (x)f (xk )} ρ(x)dx = wk f (xk ),
b−a
   
a k=1 a+b
k=1 I S (f ) = f (a) + 4f + f (b) . (6.13)
6 2
where the first step follows from dE ≥ n − 1 and the second 4
step from the interpolation conditions (2.4), the Lagrange Theorem 6.14. For f ∈ C [a, b] with weight function
formula, and the uniqueness of pn−1 (f ; x). The proof is ρ(x) ≡ 1, the remainder of Simpson’s rule satisfies
completed by setting f to be the hat function B̂k (x) (see S (b − a)5 (4)
Definition 3.21) for each xk . ∃ζ ∈ [a, b] s.t. E (f ) = − f (ζ). (6.14)
2880
Proof. It is difficult to imitate the proof of Theorem 6.12,
since (x − a)(x − b)(x − a+b 2 ) changes sign over [a, b] and
6.2 Newton-Cotes formulas the integral mean value theorem is not applicable. To over-
come this difficulty, we can formulate the interpolation via
Definition 6.9. A Newton-Cotes formula is a formula (6.1)
a Hermite problem so that Theorem C.71 can be applied.
based on approximating f (x) by interpolating it on uni-
See problem I in Section 6.6 for the main steps.
formly spaced nodes x1 , . . . , xn ∈ [a, b].
Example 6.15. Consider the integral
Definition 6.10. The trapezoidal rule is a formula (6.1) Z 4
based on approximating f (x) by the straight line that con- dx
I = = 2 tan−1 (4) = 2.6516 · · · (6.15)
nects the points (a, f (a))T and (b, f (b))T . In particular, for −4 1 + x 2

ρ(x) ≡ 1, it is simply
Let n − 1 be the number of sub-intervals that partition [a, b]
b−a in Definition 6.9. As shown below, the Newton-Cotes for-
I T (f ) = [f (a) + f (b)]. (6.10) mula appears to be non-convergent.
2

52
Qinghai Zhang Numerical Analysis 2021

n−1 2 4 6 8 10 Proof. It suffices to verify that (6.21) holds for the complex
In−1 5.4902 2.2776 3.3288 1.9411 3.5956 exponential em (x) := eimx = cos mx + i sin mx, m ∈ N, i.e.
Z 2π
For equally spaced nodes, the interpolating polynomials
EnT (em ) = em (x)dx
have wilder and wilder oscillations as the degree increases. 0
Consequently, condition (b) of Theorem 6.5 does not hold. " n−1  #
2π em (0) + em (2π) X 2kπ
Hence Newton-Cotes formulas are not convergent even for − + em
well-behaved functions in C[a, b]. In practice, Newton-Cotes n 2 n
k=1
formula with n > 8 is seldom used. Z 2π n−1
2π X imk·2π/n
= eimx dx − e .
0 n
k=0
6.3 Composite formulas R 2π
Since 0 eimx dx = (im)−1 · eimx |2π
0 = 0, the geometric se-
Definition 6.16. The composite trapezoidal rule for ap- ries yields
proximating I(f ) in (6.2) with ρ(x) ≡ 1 is 
0
 if m = 0;
T
n−1 En (em ) = −2π if m = 0 (mod n), m > 0;
h X h
InT (f ) = f (x0 ) + h f (xk ) + f (xn ), (6.16)  2π 1−eimn·2π/n

2 2 − n 1−eim·2π/n = 0 if m 6= 0 (mod n).
k=1 (6.22)
b−a Hence (6.21) holds as EnT (em ) = 0 for m = 0, . . . , n − 1.
where h = n and xk = a + kh.

Theorem 6.17. For f ∈ C 2 [a, b], the remainder of the com- 6.4 Gauss formulas
posite trapezoidal rule satisfies
Lemma 6.21. Let n, m ∈ N+ and m ≤ n. Pn Giveni poly-
b − a 2 00 Pn+m i
∃ξ ∈ (a, b) s.t. EnT (f ) = − h f (ξ). (6.17) nomials p = i=0 pi x ∈ P n+m and s = i=0 si x ∈ Pn
12 satisfying pn+m 6= 0 and sn 6= 0, there exist unique polyno-
Proof. Apply Theorem 6.12 to the subintervals, sum up the mials q ∈ Pm and r ∈ Pn−1 such that
errors, and we have p = qs + r. (6.23)
" n−1 #
b − a 2 1 X 00 Proof. Rewrite (6.23) as
EnT (f ) = − h f (ξk ) . (6.18)
12 n n+m m
! n ! n−1
k=0 X
i
X
i
X
i
X
pi x = qi x si x + ri xi . (6.24)
The proof is completed by (6.18), the intermediate value i=0 i=0 i=0 i=0
Theorem C.39, and the fact f ∈ C 2 [a, b] ⇒ f 00 ∈ C[a, b]. Since monomials are linearly independent, (6.24) consists of
n + m + 1 equations, the last m + 1 of which are
Definition 6.18. The composite Simpson’s rule for approx-
imating I(f ) in (6.2) with ρ(x) ≡ 1 is pn+m = qm sn ,
pn+m−1 = qm sn−1 + qm−1 sn ,
h
InS (f ) = f (x0 ) + 4f (x1 ) + 2f (x2 ) + 4f (x3 ) + 2f (x4 ) ···
3 
+ · · · + 4f (xn−1 ) + f (xn ) , (6.19) pn = qm sn−m + . . . + q0 sn ,
which can be written as Sq = p with S being a lower tri-
b−a
where h = n , xk = a + kh, and n is even. angular matrix whose diagonal entries are sn 6= 0. The
coefficient vector q can be determined uniquely from coef-
Theorem 6.19. For f ∈ C 4 [a, b] and n ∈ 2N+ , the remain-
ficients of p and s. Then r can be determined uniquely by
der of the composite Simpson’s rule satisfies
p − qs from (6.24).
b − a 4 (4)
∃ξ ∈ (a, b) s.t. EnS (f ) = − h f (ξ). (6.20) Definition 6.22. The node polynomial associated with the
180 nodes xk ’s of a weighted quadrature formula is
Proof. Exercise. n
Y
vn (x) = (x − xk ). (6.25)
Lemma 6.20. The trapezoidal rule satisfies k=1

Theorem 6.23. Suppose a quadrature formula (6.1) has


∀f ∈ Tn−1 [0, 2π], EnT (f ) = 0, (6.21) d ≥ n − 1. Then it can be improved to have d ≥ n + j − 1
E E
where j ∈ (0, n] by and only by imposing the additional
where Tn [0, 2π] is the class of trigonometric polynomials of conditions on its node polynomial and weight function:
degree at most n,
Z b
T [0, 2π] := span{1, cos x, sin x, . . . , cos(nx), sin(nx)}. ∀p ∈ Pj−1 , vn (x)p(x)ρ(x)dx = 0. (6.26)
n
a

53
Qinghai Zhang Numerical Analysis 2021

Proof. For the necessity, we have which is equivalent to h1, π(x)i = 0 and hx, π(x)i = 0 be-
cause P1 = span(1, x). These two conditions yield
Z b n
X Z 1
vn (x)p(x)ρ(x)dx = wk vn (xk )p(xk ) = 0, 2 2
a k=1
(c0 − c1 x + x2 )x−1/2 dx = + 2c0 − c1 = 0,
0 5 3
Z 1
where the first step follows from the facts dE ≥ n + j − 1 2 2 2
x(c0 − c1 x + x2 )x−1/2 dx = + c0 − c1 = 0.
and vn (x)p(x) ∈ Pn+j−1 , and the second step from (6.25). 0 7 3 5
To prove the sufficiency, we must show that En (p) = 0
Hence c1 = 67 , c0 = 3
35 , and the orthogonal polynomial is
for any p ∈ Pn+j−1 . Lemma 6.21 yields
3 6
∀p ∈ Pn+j−1 , ∃!q ∈ Pj−1 , ∃!r ∈ Pn−1 , s.t. p = qvn + r. π(x) = − x + x2
35 7
(6.27)
with its zeros at
Consequently, we have r ! r !
1 6 1 6
Z b Z b Z b x1 = 3−2 , x2 = 3+2 .
p(x)ρ(x)dx = q(x)v (x)ρ(x)dx + r(x)ρ(x)dx 7 5 7 5
n
a a a
Z b n
X To calculate w1 and w2 , we could again use (6.8), but it is
= r(x)ρ(x)dx = wk r(xk ) simpler to set up a linear system of equations by exploit-
a k=1 ing Corollary 6.25, i.e. Gauss quadrature is exactly for all
n
X n
X constants and linear polynomials,
= wk [p(xk ) − q(xk )vn (xk )] = wk p(xk ), Z 1
k=1 k=1
w1 + w2 = x−1/2 dx = 2,
0
where the first step follows from (6.27), the second from 1
2
Z
(6.26), the third from the condition of dE ≥ n − 1, the x1 w1 + x2 w2 = xx−1/2 dx = ,
fourth from (6.27), and the last from (6.25). 0 3
which yields
Definition 6.24. A Gaussian quadrature formula (or sim-
−2x2 + 32
r
ply a Gauss formula) is a formula (6.1) whose nodes are the 1 5
w1 = =1+ ,
zeros of the polynomial vn (x) in (6.25) that satisfies (6.26) x1 − x2 3 6
for j = n. 2x1 − 32
r
1 5
w2 = =1− .
x1 − x2 3 6
Corollary 6.25. A Gauss formula has dE = 2n − 1.
The desired two-point Gauss formula is thus
Proof. The index j in (6.26) cannot be n + 1 because the r ! r !
node polynomial vn (x) ∈ Pn cannot be orthogonal to itself. G 1 5 3 2 6
I2 (f ) = 1 + f −
Therefore we know that j = n in Theorem 6.23 is optimal: 3 6 7 7 5
the formula (6.1) achieves the highest degree of exactness r ! r !
2n − 1. From an algebraic viewpoint, the 2n degrees of free- 1 5 3 2 6
+ 1− f + . (6.29)
dom of nodes and weights in (6.1) determine a polynomial of 3 6 7 7 5
degree at most 2n − 1. The proof is completed by Theorem
The degree of exactness of the trapezoidal rule is 1 while
6.23 and Definition 6.6.
that of the two-point Gauss formula is 3. Hence we expect
Corollary 6.26. Weights of a Gauss formula In (f ) are that the Gauss formula be much more accurate. Indeed,
calculate errors of the two formulas (6.11) and (6.29) for
Z b
vn (x) f (x) = cos 21 πx and we have
∀k = 1, · · · , n, wk = 0
ρ(x)dx, (6.28)
a (x − xk )vn (xk ) E T = 0.226453 . . . ;
where vn (x) is the node polynomial that defines In (f ). E2G = 0.002197 . . . ,

Proof. This follows from Lemma 6.8; also see (2.11). which can be verified by simple calculations.
Definition 6.28. A set of orthogonal polynomials is a set
Example 6.27. Derive the Gauss formula of n = 2 for the of polynomials P = {pi : deg(pi ) = i} that satisfy
weight function ρ(x) = x−1/2 on the interval [0, 1].
We first construct an orthogonal polynomial ∀pi , pj ∈ P, i 6= j ⇒ hpi , pj i = 0. (6.30)
Example 6.29. In this chapter, the inner product in (6.30)
π(x) = c0 − c1 x + x2
is taken to be
Z b
such that
hpi , pj i = pi (x)pj (x)ρ(x)dx,
Z 1 a
∀p ∈ P1 , hp(x), π(x)i := p(x)π(x)ρ(x)dx = 0, where [a, b] and ρ are the same as those in (6.2).
0

54
Qinghai Zhang Numerical Analysis 2021

Theorem 6.30. Each zero of a real orthogonal polynomial 6.5 Numerical differentiation
over [a, b] is real, simple, and inside (a, b).
Formula 6.36 (The method of undetermined coefficients).
Proof. For fixed n ≥ 1, suppose pn (x) does not change sign
Rb A general method to derive FD formulas that approximate
in [a, b]. Then a ρ(x)pn (x)dx = hpn , p0 i = 6 0. But this con- u(k) (x̄) is based on an arbitrary stencil of n > k distinct
tradicts orthogonality. Hence there exists x1 ∈ [a, b] such points x1 , x2 , . . . , xn . Taylor expansions of u at each point
that pn (x1 ) = 0. xi in the stencil about u(x̄) yield
Suppose there were a zero at x1 which is multiple. Then
pn (x)
(x−x1D )2 would be aEpolynomial of degree n − 2. Hence 1
pn (x)
D
p2n (x)
E u(xi ) = u(x̄) + (xi − x̄)u0 (x̄) + · · · + (xi − x̄)k u(k) (x̄) + · · ·
0 = pn (x), (x−x1 )2 = 1, (x−x1 )2 > 0, which is false. k!
Therefore every zero is simple.
for i = 1, 2, . . . , n. This leads to a linear combination of
Suppose that only j < n zeros of pn , say x1 , x2 , . . . , xj ,
point values that approximates u(k) (x̄),
are inside (a, b) and all other zeros are out of (a, b). Let
Qj
vj (x) = i=1 (x − xi ) ∈ Pj . Then pn vj = Pn−j vj2 where
u(k) (x̄) = c1 u(x1 ) + c2 u(x2 ) + · · · + cn u(xn ) + O(hp ),
Pn−j is a polynomial
of degree n − j that does not change
sign on [a, b]. Hence Pn−j , vj2 > 0, which contradicts the
orthogonality of pn (x) and vj (x). where the cj ’s are chosen to make p as large as possible:

Corollary 6.31. All nodes of a Gauss formula are real, n


(
1X 1 if i = k,
distinct, and contained in (a, b). ∀i = 0, . . . , p − 1, cj (xj − x̄)i =
i! j=1 0 otherwise.
Proof. This follows directly from Definition 6.24 and Theo- (6.32)
rem 6.30.
Lemma 6.32. Gauss formulas have positive weights. Example 6.37. To approximate u0 (x̄) with an FD formula

Proof. For each k = 1, 2, . . . , n, the definition of `k (x) in D2 u(x̄) = au(x̄) + bu(x̄ − h) + cu(x̄ − 2h), (6.33)
(6.9) implies `2k ∈ P2n−2 , then we have
n
X Z b we determine the coefficients a, b, and c to give the best
wk = wj `2k (xj ) = ρ(x)`2k (x)dx > 0, possible accuracy. Taylor expansions at x̄ yield
j=1 a

D2 u(x̄) =(a + b + c)u(x̄) − (b + 2c)hu0 (x̄)


where the first step follows from (6.9), second step from
dE = 2n − 1 and the last step from the conditions on ρ. 1 1
+ (b + 4c)h2 u00 (x̄) − (b + 8c)h3 u000 (x̄)
2 6
Lemma 6.33. A Gauss formula satisfies + O h4 .

n
X
wk = µ0 ∈ (0, +∞). Set a + b + c = 0, b + 2c = − h1 , and b + 4c = 0, solve
k=1
    
Proof. This follows from setting j = 0 in (6.3) and applying 1 1 1 a 0
the condition on ρ in Definition 6.1. 0 1 2  b  = − h1  , (6.34)
0 1 4 c 0
Theorem 6.34. Gauss formulas are convergent for C[a, b].
Proof. Denote by P the set of real polynomials. Theorem and we get
2.51 states that P is dense in C[a, b], i.e. condition (a) in
Theorem 6.5 holds. Condition (b) also holds because of 3 2 1
a= , b=− , c= . (6.35)
Lemmas 6.33 and 6.32. Then the proof is completed by 2h h 2h
Theorem 6.5.
Therefore the FD formula is determined as
Theorem 6.35. For f ∈ C 2n [a, b], the remainder of a Gauss
formula In (f ) satisfies 1
D2 u(x̄) = [3u(x̄) − 4u(x̄ − h) + u(x̄ − 2h)]. (6.36)
2h
f (2n) (ξ) b
Z
∃ξ ∈ [a, b] s.t. EnG (f ) = ρ(x)vn2 (x)dx, (6.31)
(2n)! a Definition 6.38. In approximating a derivative, an FD for-
mula is p-th order accurate if its error E has the form
where vn is the node polynomial that defines In .
Proof. One proof is suggested in 1885 by Markov, a student E(h) = Θ(hp ), (6.37)
of Chebyshev and famous for his work in probability theory
on certain random processes now known as Markov chains. where h is the maximum distance of adjacent points in the
See exercise IV in Section 6.6.1. stencil.

55
Qinghai Zhang Numerical Analysis 2021

Example 6.39. Consider approximating u0 (x) at a point x̄ 6.6 Problems


using the nearby function values u(x̄ ± h). Three commonly
used formulas are 6.6.1 Theoretical questions
u(x̄ + h) − u(x̄) I. Simpson’s rule.
D+ u(x̄) := , (6.38)
h
(a) Show that Simpson’s rule on [−1, 1] can be ob-
u(x̄) − u(x̄ − h)
D− u(x̄) := , (6.39) tained by
h
u(x̄ + h) − u(x̄ − h) 1 Z 1 Z 1
D0 u(x̄) := = (D+ + D− )u(x̄). y(t)dt = p3 (y; −1, 0, 0, 1; t)dt + E S (y),
2h 2
−1 −1
(6.40)
where y ∈ C 4 [−1, 1] and p3 (y; −1, 0, 0, 1; t) is
For u(x) = sin(x) and x̄ = 1, we calculate the errors of the interpolation polynomial of y that satisfies
the above three formulas in approximating u0 (1) = cos(1) ≈ p3 (−1) = y(−1), p3 (0) = y(0), p03 (0) = y 0 (0), and
0.5403023 with h = 0.01 and 0.005. p3 (1) = y(1).
Define the error as E := Du(x̄) − u0 (x̄) and the following
table shows the values of E for the three formulas. (b) Derive E S (y).
(c) Using (a), (b), and a change of variable, derive the
h D+ u(x̄) D− u(x̄) D0 u(x̄) composite Simpson’s rule and prove the theorem
h1 = 1.0e-2 −4.2e-3 −4.2e-3 −9.00e-6 on its error estimation.
h2 = 5.0e-3 −2.1e-3 −2.1e-3 −2.25e-6
p = log2 E(Du(x̄,h 1 ))
1 1 2 II. Estimate the
R 1 number of subintervals required to ap-
E(Du(x̄,h2 )) 2
proximate 0 e−x dx to 6 correct decimal places, i.e.
The last row of the table shows that the error behaves like the absolute error is no greater than 0.5 × 10−6 ,
E(h) ≈ Chp , which means reduction of h by a factor of r
leads to error reduction by a factor of rp . Geometric illus- (a) by the composite trapezoidal rule,
tration of these FD formulas are shown below. (b) by the composite Simpson’s rule.

III. Gauss-Laguerre quadrature formula.

(a) Construct a polynomial π2 (t) = t2 + at + b that is


orthogonal to P1 with respect to the weight func-
tion ρ(t) = e−t , i.e.
Z +∞
∀p ∈ P1 , p(t)π2 (t)ρ(t)dt = 0.
0
R +∞
(hint: 0
tm e−t dt = m!)
(b) Derive the two-point Gauss-Laguerre quadrature
formula
Exercise 6.40. Show that the FD formulas D+ u(x̄) and Z +∞
D− u(x̄) are first-order accurate while D0 u(x̄) is second-
f (t)e−t dt = w1 f (t1 ) + w2 f (t2 ) + E2 (f )
order accurate. 0

Exercise 6.41. Construct a table of divided difference (as and express E2 (f ) in terms of f (4) (τ ) for some
in Definition 2.18) to derive a quadratic polynomial that τ > 0.
agrees with u(x) at x̄, x̄ − h, and x̄ − 2h. Then take deriva-
(c) Apply the formula in (b) to approximate
tive of this polynomial to obtain the FD formula (6.36).
+∞
1 −t
Z
Lemma 6.42. In approximating the second derivative of I= e dt.
u ∈ C 4 (R), the formula 0 1+t

Use the remainder to estimate the error and com-


u(x̄ − h) − 2u(x̄) + u(x̄ + h)
D2 u(x̄) = (6.41) pare your estimate with the true error. With the
h2 true error, identify the unknown quantity τ con-
is second-order accurate. Furthermore, if the input function tained in E2 (f ).
values u(x̄ − h), u(x̄), and u(x̄ + h) are perturbed with ran- (hint: use the exact value I = 0.596347361 · · · )
dom errors  ∈ [−E, E], then there exists ξ ∈ [x̄ − h, x̄ + h]
IV. Remainder of Gauss formulas. Consider the Hermite
such that
interpolation problem: find p ∈ P2n−1 such that
2
u (x̄) − D2 u(x̄) ≤ h |u(4) (ξ)| + 4E .
00
(6.42) ∀m = 1, 2, . . . , n, p(xm ) = fm , p0 (xm ) = fm
0
. (6.43)
12 h2

56
Qinghai Zhang Numerical Analysis 2021

There are elementary Hermite interpolation polynomi- (b) Obtain the quadrature rule
als hm , qm such that the solution of (6.43) can be ex-
n
pressed in the form X
In (f ) = [wk f (xk ) + µk f 0 (xk )]
n k=1
X
0
p(t) = [hm (t)fm + qm (t)fm ], that satisfies En (p) = 0 for all p ∈ P2n−1 .
m=1
(c) What conditions on the node polynomial or on the
analogous to the Lagrange interpolation formula. nodes xk must be imposed so that µk = 0 for each
k = 1, 2, . . . , n?
(a) Seek hm and qm in the form V. Prove Lemma 6.42. How do you choose h to minimize
the error bound in (6.42)? Design a fourth-order accu-
hm (t) = (am +bm t)`2m (t), qm (t) = (cm +dm t)`2m (t) rate formula based on a symmetric stencil, derive its
error bound, and minimize the error bound. What do
where `m is the elementary Lagrange polynomial you observe in comparing the second-order case and
in (2.9). Determine the constants am , bm , cm , dm . the fourth-order case?

57
Appendix A

Sets, Logic, and Functions

A.1 First-order logic Definition A.7. A universal-existential statement is a log-


ical statement of the form
Definition A.1. A set S is a collection of distinct objects
that share a common quality; it is often denoted with the UE = (∀x ∈ S, ∃y ∈ T s.t. A(x, y)). (A.5)
following notation An existential-universal statement has the form
S = {x | the conditions that x satisfies. }. (A.1) EU = (∃y ∈ T , s.t. ∀x ∈ S, A(x, y)). (A.6)

Notation 4. R, Z, N, Q, C denote the sets of real numbers, Example A.8. True or false:
integers, natural numbers, rational numbers and complex ∀x ∈ [2, +∞), ∃y ∈ Z+ s.t. xy < 105 ;
numbers, respectively. R+ , Z+ , N+ , Q+ the sets of positive ∃y ∈ R s.t. ∀x ∈ [2, +∞), x > y;
such numbers. In particular, N contains the number zero ∃y ∈ R s.t. ∀x ∈ [2, +∞), x < y.
while N+ does not. Example A.9 (Translating an English statement into a
Definition A.2. S is a subset of U, written S ⊆ U, if and logical statement). Goldbach’s conjecture states every even
only if (iff) x ∈ S ⇒ x ∈ U. S is a proper subset of U, natural+ number greater than 2 is the sum of two primes. Let
written S ⊂ U, if S ⊆ U and ∃x ∈ U s.t. x 6∈ S. P ⊂ N denote the set of prime numbers. Then Goldbach’s
conjecture is ∀a ∈ 2N+ + 2, ∃p, q ∈ P, s.t. a = p + q.
Definition A.3 (Statements of first-order logic). A univer-
Theorem A.10. The existential-universal statement im-
sal statement is a logical statement of the form
plies the corresponding universal-existential statement, but
not vice versa.
U = (∀x ∈ S, A(x)). (A.2)
Example A.11 (Translating a logical statement to an En-
An existential statement has the form glish statement). Let S be the set of all human beings.
UE =(∀p ∈ S, ∃q ∈ S s.t. q is p’s mom.)
E = (∃x ∈ S, s.t. A(x)), (A.3) E =( ∃q ∈ S s.t. ∀p ∈ S, q is p’s mom.)
U
UE is probably true, but EU is certainly false.
where ∀ (“for each”) and ∃ (“there exists”) are the quanti- If E were true, then U would be true. Why?
U E
fiers, S is a set, “s.t.” means “such that,” and A(x) is the
formula. Axiom A.12 (First-order negation of logical statements).
A statement of implication/conditional has the form The negations of the statements in Definition A.3 are

A ⇒ B. (A.4) ¬U = (∃x ∈ S, s.t. ¬A(x)). (A.7)


¬E = (∀x ∈ S, ¬A(x)). (A.8)
Example A.4. Universal and existential statements:
∀x ∈ [2, +∞), x > 1; Rule A.13. The negation of a more complicated logical
∀x ∈ R+ , x > 1; statement abides by the following rules:

∃p, q ∈ Z, s.t. p/q = 2; • switch the type of each quantifier until you reach the
√ √
∃p, q ∈ Z, s.t. p = q + 1. last formula without quantifiers;
Definition A.5. Uniqueness quantification or unique exis- • negate the last formula.
tential quantification, written ∃! or ∃=1 , indicates that ex- In particular, the negation of an implication formula P ⇒ Q,
actly one object with a certain property exists. is P ∧ ¬Q.
Exercise A.6. Express the logical statement ∃!x, s.t. A(x) Example A.14 (The negation of Goldbach’s conjecture).
with ∃, ∀, and ⇔. ∃a ∈ 2N+ + 2 s.t. ∀p, q ∈ P, a 6= p + q.

58
Qinghai Zhang Numerical Analysis 2021

Exercise A.15. Negate the logical statement in Definition Example A.26. If a set S ⊂ R has a maximum, we have
C.64. max S = sup S.
Axiom A.16 (Contraposition). A conditional statement is Example A.27. sup[a, b] = sup[a, b) = sup(a, b] =
logically equivalent to its contrapositive. sup(a, b).
(A ⇒ B) ⇔ (¬B ⇒ ¬A) (A.9) Theorem A.28 (Existence and uniqueness of least upper
Example A.17. “If Jack is a man, then Jack is a human bound). Every nonempty subset of R that is bounded above
being.” is equivalent to “If Jack is not a human being, then has exactly one least upper bound.
Jack is not a man.” Corollary A.29. Every nonempty subset of R that is
Exercise A.18. Draw an Euler diagram of subsets to illus- bounded below has a greatest lower bound.
trate Example A.17.
Definition A.30. A binary relation between two sets X and
Exercise A.19. Rewrite each of the following statements Y is an ordered triple (X , Y, G) where G ⊆ X × Y.
and its negation into logical statements using symbols, quan- A binary relation on X is the relation between X and X .
tifiers, and formulas. The statement (x, y) ∈ R is read “x is R-related to y,” and
denoted by xRy or R(x, y).
(a) The only even prime is 2.
(b) Multiplication of integers is associative. Definition A.31. An equivalence relation “∼” on A is a
binary relation on A that satisfies ∀a, b, c ∈ A,
(c) Goldbach’s conjecture has at most a finite number of
counterexamples. • a ∼ a (reflexivity);
• a ∼ b implies b ∼ a (symmetry);
A.2 Ordered sets • a ∼ b and b ∼ c imply a ∼ c (transitivity).
Definition A.20. The Cartesian product X × Y between Definition A.32. A binary relation “≤” on some set S is
two sets X and Y is the set of all possible ordered pairs with a total order or linear order on S iff, ∀a, b, c ∈ S,
first element from X and second element from Y:
• a ≤ b and b ≤ a imply a = b (antisymmetry);
X × Y = {(x, y) | x ∈ X , y ∈ Y}. (A.10)
• a ≤ b and b ≤ c imply a ≤ c (transitivity);
Axiom A.21 (Fundamental principle of counting). Con-
sider a task that consists of a sequence of k independent • a ≤ b or b ≤ a (totality).
steps. Let ni denote the number of different choices for the A set equipped with a total order is a chain or totally ordered
i-th step, the total number of distinct ways to complete the set.
task is
k
Y Example A.33. The real numbers with less or equal.
ni = n1 n2 · · · nk . (A.11)
i=1 Example A.34. The English letters of the alphabet with
Example A.22. Let A, E, D be the set of appetizers, main dictionary order.
entrees, desserts in a restaurant. A × E × D is the set of
Example A.35. The Cartesian product of a set of totally
possible dinner combos. If #A = 10, #E = 5, #D = 6,
ordered sets with the lexicographical order.
#(A × E × D) = 300.
Definition A.23 (Maximum and minimum). Consider S ⊆ Example A.36. Sort your book in lexicographical order
R, S 6= ∅. If ∃s ∈ S s.t. ∀x ∈ S, x ≤ s , then s is the and save a lot of time. log26 N  N !
m m m
maximum of S and denoted by max S. If ∃sm ∈ S s.t. Definition A.37. A binary relation “≤” on some set S is
∀x ∈ S, x ≥ sm , then sm is the minimum of S and denoted a partial order on S iff, ∀a, b, c ∈ S, antisymmetry, transi-
by min S. tivity, and reflexivity (a ≤ a) hold.
Definition A.24 (Upper and lower bounds). Consider A set equipped with a partial order is called a poset.
S ⊆ R, S =6 ∅. a is an upper bound of S ⊆ R if ∀x ∈ S,
Example A.38. The set of subsets of a set S ordered by
x ≤ a; then the set S is said to be bounded above. a is a
inclusion “⊆.”
lower bound of S if ∀x ∈ S, x ≥ a; then the set S is said to
be bounded below. S is bounded if it is bounded above and Example A.39. The natural numbers equipped with the
bounded below. relation of divisibility.
Definition A.25 (Supremum and infimum). Consider a Example A.40. The set of stuff you will put on your body
nonempty set S ⊆ R. If S is bounded above and S has every morning with the time ordered: undershorts, pants,
a least upper bound then we call it the supremum of S and belt, shirt, tie, jacket, socks, shoes, watch.
denote it by sup S. If S is bounded below and S has a great-
est lower bound, then we call it the infimum of S and denote Example A.41. Inheritance (“is-a” relation) is a partial
it by inf S. order. A → B reads “B is a special type of A”.

59
Qinghai Zhang Numerical Analysis 2021

Example A.42. Composition (“has-a” relation) is also a A.3 Functions


partial order. A B reads “B has an instance/object of
A.” Definition A.48. A function/map/mapping f from X to
Y, written f : X → Y or X 7→ Y, is a subset of the Carte-
Example A.43. Implication “⇒” is a partial order on the sian product X × Y satisfying that ∀x ∈ X , there is exactly
set of logical statements. one y ∈ Y s.t. (x, y) ∈ X × Y. X and Y are the domain and
range of f , respectively.
Example A.44. The set of definitions, axioms, proposi- Definition A.49. A function f : X → Y is said to be
tions, theorems, lemmas, etc., is a poset with inheritance, injective or one-to-one iff
composition, and implication. It is helpful to relate them
with these partial orderings. ∀x1 ∈ X , ∀x2 ∈ X , x1 6= x2 ⇒ f (x1 ) 6= f (x2 ). (A.13)

Definition A.45. An upper bound of a subset W of a poset It is surjective or onto iff


M is an element u ∈ M such that x ≤ u for each x ∈ W . A
maximal element of a poset M is an m ∈ M such that ∀y ∈ Y, ∃x ∈ X , s.t. y = f (x). (A.14)

It is bijective iff it is both injective and surjective.


∀x ∈ M, x ≥ m ⇒ x = m. (A.12)
Definition A.50. A set S is countably infinite iff there ex-
+ +
Axiom A.46 (Zorn’s lemma). For a nonempty poset M , if ists a bijective function f : S → N that maps S to N . A
every chain in M has an upper bound, then M has at least set is countable if it is either finite or countably infinite.
one maximal element.
Example A.51. Are the integers countable? Are the ra-
tionals countable? Are the real numbers countable?
Lemma A.47 (The Union Lemma). Let X be a set and C
be a collection of subsets of X. Assume that for each x ∈ X, Definition A.52. A binary function or a binary operation
there is a set Ax in C such that x ∈ Ax . Then ∪x∈X Ax = X. on a set S is a map S × S → S.

60
Appendix B

Linear Algebra

B.1 Vector spaces Example B.5. The simplest vector space is {0}. Another
simple example of a vector space over a field F is F itself,
Definition B.1. A field F is a set together with two binary equipped with its standard addition and multiplication.
operations, usually called “addition” and “multiplication”
and denoted by “+” and “∗”, such that ∀a, b, c ∈ F, the
following axioms hold, B.1.1 Subspaces
Definition B.6. A subset U of V is called a subspace of V
• commutativity: a + b = b + a, ab = ba;
if U is also a vector space.
• associativity: a + (b + c) = (a + b) + c, a(bc) = (ab)c;
• identity: a + 0 = a, a1 = a; Definition B.7. Suppose U1 , . . . , Um are subsets of V. The
sum of U1 , . . . , Um is the set of all possible sums of elements
• invertibility: a + (−a) = 0, aa−1 = 1 (a 6= 0); of U1 , . . . , Um :
• distributivity: a(b + c) = ab + ac.  
X m 
Definition B.2. A vector space or linear space over a field F U1 + . . . + Um := uj : uj ∈ Uj . (B.1)
is a set V together with two binary operations “+” and “×” 
j=1

respectively called vector addition and scalar multiplication
that satisfy the following axioms: Example B.8. For U = {(x, x, y, y) ∈ F4 : x, y ∈ F} and
W = {(x, x, x, y) ∈ F4 : x, y ∈ F}, we have
(VSA-1) commutativity
∀u, v ∈ V, u + v = v + u; U + W = {(x, x, z, y) ∈ F4 : x, y, z ∈ F}.
(VSA-2) associativity
∀u, v, w ∈ V, (u + v) + w = u + (v + w); Lemma B.9. Suppose U1 , . . . , Um are subspaces of V. Then
U1 + . . . + Um is the smallest subspace of V that contains
(VSA-3) compatibility
U1 , . . . , Um .
∀u ∈ V, ∀a, b ∈ F, (ab)u = a(bu);
(VSA-4) additive identity Definition B.10. Suppose U1 , . . . , Um are subspaces of V.
∃0 ∈ V, ∀u ∈ V, s.t. u + 0 = u; The sum U1 + . . . + Um is called a direct sum if each element
(VSA-5) additive inverse in U1 + . . . + Um can be written in only one way as a sum
P m
∀u ∈ V, ∃v ∈ V, s.t. u + v = 0; j=1 uj with uj ∈ Uj for each j = 1, . . . , m. In this case
we write the direct sum as U1 ⊕ . . . ⊕ Um .
(VSA-6) multiplicative identity
∃1 ∈ F, s.t. ∀u ∈ V, 1u = u; Exercise B.11. Show that U1 + U2 + U3 is not a direct sum:
(VSA-7) distributive laws
U1 = {(x, y, 0) ∈ F3 : x, y ∈ F},

∀u, v ∈ V, ∀a, b ∈ F,
(a + b)u = au + bu, U2 = {(0, 0, z) ∈ F3 : z ∈ F},
a(u + v) = au + av.
U3 = {(0, y, y) ∈ F3 : y ∈ F}.
The elements of V are called vectors and the elements of F
Lemma B.12. Suppose U1 , . . . , Um are subspaces of V.
are called scalars.
Then U1 + . . . + Um is a direct
Pm sum if and only if the only
Definition B.3. A real vector space or a complex vector way to write 0 as a sum j=1 uj , where uj ∈ Uj for each
space is a vector space with F = R or F = C, respectively. j = 1, . . . , m, is by taking each uj equal to 0.

Exercise B.4. Show that a complex vector space can also Theorem B.13. Suppose U and W are subspaces of V.
be considered as a real vector space. Then U + W is a direct sum if and only if U ∩ W = {0}.

61
Qinghai Zhang Numerical Analysis 2021

B.1.2 Span and linear independence Definition B.25. A list of vectors (v1 , v2 , . . . , vm ) in V is
called linearly independent iff
Definition B.14. A list of length n or n-tuple is an or-
dered collection of n elements (which might be numbers, a1 v1 + . . . + am vm = 0 ⇒ a1 = · · · = am = 0. (B.4)
other lists, or more abstract entities) separated by commas
and surrounded by parentheses: x = (x1 , x2 , . . . , xn ). Otherwise the list of vectors is called linearly dependent.
Definition B.15. A vector space composed of all the n-
tuples of a field F is known as a coordinate space, denoted Example B.26. The empty list is declared to be linearly
by Fn (n ∈ N+ ). independent. A list of one vector (v) is linearly independent
iff v 6= 0. A list of two vectors is linearly independent iff
Example B.16. The properties of forces or velocities in the neither vector is a scalar multiple of the other.
real world can be captured by a coordinate space R2 or R3 .
Example B.27. The list (1, z, . . . , z m ) is linearly indepen-
Example B.17. The set of continuous real-valued functions
dent in Pm (F) for each m ∈ N.
on the interval [a, b] forms a real vector space.
Notation 5. For a set S, define a vector space Example B.28. (2, 3, 1), (1, −1, 2), and (7, 3, 8) is linearly
dependent in R3 because
FS := {f : S → F}.
Fn is a special case of FS because n can be regarded as the 2(2, 3, 1) + 3(1, −1, 2) + (−1)(7, 3, 8) = (0, 0, 0).
n
set {1, 2, . . . , n} and each element in F can be considered
as a function {1, 2, . . . , n} 7→ F. Example B.29. Every list of vectors containing the 0 vec-
tor is linearly dependent.
Definition B.18. A linear P combination of a list of vectors
{vi } is a vector of the form i ai vi where ai ∈ F. Lemma B.30 (Linear dependence lemma). Suppose
Example B.19. (17, −4, 2) is a linear combination of V = (v1 , v2 , · · · , vm ) is a linearly dependent list in V. Then
(2, 1, −3), (1, −2, 4) because there exists j ∈ {1, 2, . . . , m} such that

(17, −4, 2) = 6(2, 1, −3) + 5(1, −2, 4). • vj ∈ span(v1 , v2 , . . . , vj−1 );


Example B.20. (17, −4, 5) is not a linear combination • if the jth term is removed from V , the span of the
of (2, 1, −3), (1, −2, 4) because there do not exist numbers remaining list equals span(v1 , v2 , . . . , vm ).
a1 , a2 such that
Lemma B.31. In a finite-dimensional vector space, the
(17, −4, 5) = a1 (2, 1, −3) + a2 (1, −2, 4). length of every linearly independent list of vectors is less
than or equal to the length of every spanning list of vectors.
Solving from the first two equations yields a1 = 6, a2 = 5,
but 5 6= −3 × 6 + 4 × 5.
Definition B.21. The span of a list of vectors (vi ) is the
B.1.3 Bases
set of all linear combinations of (vi ), Definition B.32. A basis of a vector space V is a list of
(m
X
) vectors in V that is linearly independent and spans V.
span(v1 , v2 , . . . , vm ) = ai vi : ai ∈ F . (B.2)
i=1
Definition B.33. The standard basis of Fn is the list of
vectors
In particular, the span of the empty set is {0}. We say that
(v1 , v2 , . . . , vm ) spans V if V = span(v1 , v2 , . . . , vm ). (1, 0, · · · , 0)T , (0, 1, 0, · · · , 0)T , . . . , (0, · · · , 0, 1)T . (B.5)
Example B.22.
Example B.34. (z 0 , z 1 , . . . , z m ) is a basis of Pm (F) in
(17, −4, 2) ∈ span((2, 1, −3), (1, −2, 4)) (B.3).
(17, −4, 5) 6∈ span((2, 1, −3), (1, −2, 4))
Lemma B.35. A list of vectors (v1 , . . . , vn ) is a basis of V
Definition B.23. A vector space V is called finite dimen- iff every vector u ∈ V can be written uniquely as
sional if some list of vectors span V; otherwise it is infinite
dimensional. Xn
u= ai vi , (B.6)
Example B.24. Let Pm (F) denote the set of all polynomi- i=1
als with coefficients in F and degree at most m,
( m
) where ai ∈ F.
X
i
Pm (F) = p : F → F; p(z) = ai z , ai ∈ F . (B.3)
Lemma B.36. Every spanning list in a vector space V can
i=0
be reduced to a basis of V.
Then Pm (F) is a finite-dimensional vector space for each
non-negative integer m. The set of all polynomials with Lemma B.37. Every linearly independent list of vectors in
coefficients in F, denoted by P(F) := P+∞ (F), is infinite- a finite-dimensional vector space can be extended to a basis
dimensional. Both are subspaces of FF for F = R or C. of that vector space.

62
Qinghai Zhang Numerical Analysis 2021

B.1.4 Dimension Lemma B.48. Let V be a complex vector space and f


a complex linear functional on V . Then the real part
Lemma B.38. Any two bases of a finite-dimensional vector
Re f (x) = u(x) is related to f by
space have the same length.
∀x ∈ V, f (x) = u(x) − iu(ix). (B.8)
Proof. Suppose B1 and B2 are two bases of V . Then B1
is linearly independent in V and B2 spans V . By Lemma Proof. Any α, β ∈ R and z = α + iβ ∈ C satisfy
B.31, the length of B1 is no greater than B2 . The proof is
completed by switching the roles of B1 and B2 . z = Re z − iRe (iz).

Definition B.39. The dimension of a finite-dimensional Set z = f (x) and we have


vector space V, denoted dim V, is the length of any basis of f (x) = Re f (x) − iRe (if (x))
the vector space.
= u(x) − iRe (f (ix)) = u(x) − iu(ix).
Lemma B.40. If V is finite-dimensional, then every span-
Lemma B.49. Let V be a complex vector space and
ning list of vectors in V with length dim V is a basis of V.
u : V → R a real linear functional on V . Then the function
Lemma B.41. If V is finite-dimensional, then every lin- f : V → C defined by (B.8) is a complex linear functional.
early independent list of vectors in V with length dim V is a Proof. The additivity (LNM-1) of f follows from the addi-
basis of V. tivity of u and (B.8). For any c ∈ R, we have f (cx) = cf (x)
from (B.8). The rest follows from the additivity of f and
B.2 Linear maps f (ix) = u(ix) − iu(i2 x)
= u(ix) + iu(x) = i(u(x) − iu(ix)) = if (x).
Definition B.42. A linear map or linear transformation
between two vector spaces V and W is a function T : V → W
that satisfies B.2.1 Null spaces and ranges
(LNM-1) additivity Definition B.50. The null space of a linear map
∀u, v ∈ V, T (u + v) = T u + T v; T ∈ L(V, W) is the subset of V consisting of those vectors
that T maps to the additive identity 0:
(LNM-2) homogeneity
∀a ∈ F, ∀v ∈ V, T (av) = a(T v), null T = {v ∈ V : T v = 0}. (B.9)

where F is a scalar field. In particular, a linear map Example B.51. The null space of the differentiation map
T : V → W is called a linear operator if W = V. in Example B.43 is R.

Notation 6. The set of all linear maps from V to W is de- Definition B.52. The range of a linear map T ∈ L(V, W)
noted by L(V, W). The set of all linear operators from V to is the subset of W consisting of those vectors that are of the
itself is denoted by L(V). form T v for some v ∈ V:

Example B.43. The differentiation operator on R[x] is a range T = {T v : v ∈ V}. (B.10)


linear map T ∈ L(R[x], R[x]) Example B.53. The range of A ∈ Cm×n is the span of its
Example B.44. L(Fn , Fm ) ' Fm×n is a vector space with column vectors.
the zero map 0 as the additive identity. Theorem B.54 (The counting theorem or the fundamental
theorem of linear maps). If V is a finite-dimensional vector
Lemma B.45. The set L(V, W), equipped with scalar
space and T ∈ L(V, W), then range T is a finite-dimensional
multiplication (aT )v = a(T v) and vector addition
subspace of W and
(S + T )v = Sv + T v, is a vector space.
dim V = dim null T + dim range T. (B.11)
Proof. The scalar field F of L(V, W) is the same as that of
V and W. So multiplicative identity is still 1, the same as
that of F. However, the additive identity is the zero map B.2.2 The matrix of a linear map
0 ∈ L(V, W). Definition B.55. The matrix of a linear map T ∈ L(V, W)
with respect to the bases (v1 , v2 , . . . , vn ) of V and
Definition B.46. The identity map, denoted I, is the func-
(w1 , w2 , . . . , wm ) of W, denoted by
tion on a vector space that assigns to each element the same
element: MT := M (T, (v1 , . . . , vn ), (w1 , . . . , wm )), (B.12)
Iv = v. (B.7)
is the m × n matrix A(T ) whose entries ai,j ∈ F satisfy the
Definition B.47. A complex linear functional is a linear linear system
map T : V → C with C being the underlying field of V. A m
real linear functional is a map T : V → R such that (LNM-1)
X
∀j = 1, 2, . . . , n, T vj = ai,j wi . (B.13)
and (LNM-2) in Definition B.42 hold for F = R. i=1

63
Qinghai Zhang Numerical Analysis 2021

Corollary B.56. The matrix MT in (B.12) of a linear map By Corollary B.56, applying this equation to vk yields
T ∈ L(V, W) satisfies
n
X
T [v1 , v2 , . . . , vn ] = [w1 , w2 , . . . , wm ]MT . (B.14) (ψ j ◦ T )(v k ) = cr,j ϕr (vk ) = ck,j .
r=1
Proof. This follows directly from (B.12).
On the other hand, we have

B.2.3 Duality n
!
X
(ψj ◦ T )(vk ) = ψj (T vk ) = ψj ar,k wr
Dual vector spaces r=1
n
Definition B.57. The dual space of a vector space V is the X
= ar,k ψj (wr ) = aj,k .
vector space of all linear functionals on V ,
r=1

V 0 = L(V, F). (B.15)


Definition B.66. The double dual map of a linear map
00 00 00
Definition B.58. For a basis v1 , . . . , vn of a vector space T : V → W is the linear map T : V → W defined as
00 0 0
V , its dual basis is the list ϕ1 , . . . , ϕn where each ϕj ∈ V 0 is T = (T ) .
( Theorem B.67. For T ∈ L(V ) and Λ in (B.18), we have
1 if k = j,
ϕj (vk ) = (B.16)
0 if k 6= j. T 00 ◦ Λ = Λ ◦ T. (B.20)
Exercise B.59. Show that the dual basis is a basis of the Proof. Definition B.66 and equation (B.18) yields
dual space.
∀v ∈ V, ∀ϕ ∈ V 0 ,
Lemma B.60. A finite-dimensional vector space V satisfies
(T 00 ◦ Λ)vϕ = ((T 0 )0 Λv)ϕ = (Λv ◦ T 0 )ϕ = Λv(T 0 ϕ)
dim V 0 = dim V. (B.17) = (T 0 ϕ)(v) = ϕ(T v) = Λ(T v)(ϕ)
Proof. This follows from Definition B.57 and the identity = (Λ ◦ T )vϕ,
dim L(V, W ) = dim(V ) dim(W ).
where the third step is natural since T 0 send V 0 to V 0 .
Definition B.61. The double dual space of a vector space
V , denoted by V 00 , is the dual space of V 0 . Corollary B.68. For T ∈ L(V ) where V is finite-
dimensional, the double dual map is
Lemma B.62. The function Λ : V → V 00 defined as
T 00 = Λ ◦ T ◦ Λ−1 . (B.21)
∀v ∈ V, ∀ϕ ∈ V 0 , (Λv)(ϕ) = ϕ(v) (B.18)
Proof. This follows directly from Theorem B.67 and Lemma
is a linear bijection.
B.62.
Proof. It is easily verified that Λ is a linear map. The rest
follows from Definitions B.57, B.61, and Lemma B.60. The null space and range of the dual of a linear map

Dual linear maps Definition B.69. For U ⊂ V , the annihilator of U , de-


noted U 0 , is defined by
Definition B.63. The dual map of a linear map T : V → W
is the linear map T 0 : W 0 → V 0 defined as U 0 := {ϕ ∈ V 0 : ∀u ∈ U, ϕ(u) = 0}. (B.22)

∀ϕ ∈ W 0 , T 0 (ϕ) = ϕ ◦ T. (B.19) Exercise B.70. Let e1 , e2 , e3 , e4 , e5 denote the standard


basis of V = R5 , and ϕ1 , ϕ2 , ϕ3 , ϕ4 , ϕ5 its dual basis of V 0 .
Exercise B.64. Denote by D the linear map of differen- Suppose
tiation Dp = p0 on the vector space P(R) of polynomials
with real coefficients. Under the dual map R 1 of D, what is the U = span(e1 , e2 ) = {(x1 , x2 , 0, 0, 0) ∈ R5 : x1 , x2 ∈ R}.
image of the linear functional ϕ(p) = 0 p on P(R)?
Show that U 0 = span(ϕ3 , ϕ4 , ϕ5 ).
Theorem B.65. The matrix of T 0 is the transpose of the
matrix of T . Exercise B.71. Let i : U ,→ V be an inclusion. Show that
nulli0 = U 0 .
Proof. Let (v1 , · · · , vn ), (ϕ1 , · · · , ϕn ), (w1 , · · · , wn ),
(ψ1 , · · · , ψn ), be bases of V , V 0 , W , W 0 , respectively. Lemma B.72. Suppose U ⊂ V . Then U 0 is a subspace of
Denote by A and C the matrices of T : V → W and V 0 .
T 0 : W 0 → V 0 , respectively. We have
n
Exercise B.73. Suppose V is finite-dimensional. Prove
every linear map on a subspace of V can be extended to a
X
0
ψj ◦ T = T (ψj ) = cr,j ϕr .
r=1
linear map on V .

64
Qinghai Zhang Numerical Analysis 2021

Lemma B.74. Suppose V is finite-dimensional and U is a Proof. Theorem B.54, Lemma B.75, and Lemma B.74 yield
subspace of V . Then
ϕ ∈ rangeT 0 ⇒ ∃ψ ∈ W 0 s.t T 0 (ψ) = ϕ
dim U + dim U 0 = dim V. (B.23)
⇒ ∀v ∈ nullT, ϕ(v) = ψ(T v) = 0
Proof. Apply Theorem B.54 to the dual of an inclusion ⇒ ϕ ∈ (nullT )0 .
i0 : V 0 → U 0 and we have
The proof is completed by
dim rangei0 + dim nulli0 = dim V 0
⇒ dim rangei0 + dim U 0 = dim V, dim rangeT 0 = dim(rangeT )
where the second line follows from Exercise B.71 and Lemma = dim V − dim nullT
0 0
B.60. For any ϕ ∈ U , Exercise B.73 states that ϕ ∈ U can = dim(nullT )0 .
0 0 0
be extended to ψ ∈ V such that i (ψ) = ϕ. Hence i is
surjective and we have U 0 = rangei0 . The proof is then Corollary B.80. For finite-dimensional vector spaces V
completed by Lemma B.60. and W , any linear map T ∈ L(V, W ) is injective if and
only if T 0 is surjective.
Lemma B.75. Any linear map T ∈ L(V, W ) satisfies
0 0
nullT 0 = (rangeT )0 . (B.24) Proof. T is injective ⇔ nullT = {0} ⇔ (nullT ) = V ⇔
0 0 0
rangeT = V ⇔ T is surjective. The second step follows
Proof. Definitions B.50, B.52, B.63, and B.69 yield from Lemmas B.74 and B.60, and the third step follows from
Lemma B.79.
ϕ ∈ nullT 0 ⇔ 0 = T 0 (ϕ) = ϕ ◦ T
⇔ ∀v ∈ V, ϕ(T v) = 0
Matrix ranks
⇔ ϕ(rangeT ) = 0
⇔ ϕ ∈ (rangeT )0 . Definition B.81. For a matrix A ∈ Fm×n : Fn → Fm , its
column space (or range or image) consists of all linear com-
Lemma B.76. For finite-dimensional vector spaces V and binations of its columns, its row space (or coimage) is the
W , any linear map T ∈ L(V, W ) satisfies column space of AT , its null space (or kernel) is the null
space of A as a linear operator, and the left null space (or
dim nullT 0 = dim nullT + dim W − dim V. (B.25)
cokernel) is the null space of AT .
Proof. Lemma B.75 and Theorem B.54 yield
Definition B.82. The column rank and row rank of a ma-
dim nullT 0 = dim(rangeT )0 = dim W − dim(rangeT ) trix A ∈ Fm×n is the dimension of its column space and row
= dim W − dim V + dim(nullT ) space, respectively.
= dim nullT + dim W − dim V.
Lemma B.83. Let AT denote the matrix of a linear op-
Corollary B.77. For finite-dimensional vector spaces V erator T ∈ L(V, W ). Then the column rank of AT is the
and W , any linear map T ∈ L(V, W ) is surjective if and dimension of rangeT .
only if T 0 is injective. P
Proof. For u = i ci vi , Corollary B.56 yields
Proof. T is surjective ⇔ W = rangeT ⇔ (rangeT )0 = {0}
⇔ nullT 0 = {0} ⇔ T 0 is injective. The second step follows
X
Tu = ci T vi = T [v1 , . . . , vn ]c = [w1 , . . . , wm ]AT c.
from Lemma B.74 applied to W : i

dim W = dim(rangeT ) + dim(rangeT )0 . Hence we have


Lemma B.78. For finite-dimensional vector spaces V and
W , any linear map T ∈ L(V, W ) satisfies {T u : c ∈ Fn } = {[w1 , . . . , wm ]AT c : c ∈ Fn }.

dim rangeT 0 = dim rangeT. (B.26) The LHS is rangeT while {AT c : c ∈ Fn } is the column space
of AT . Since (w1 , . . . , wm ) is a basis, by Definition B.82 the
Proof. Theorem B.54, Lemma B.75, and Lemma B.74 yield column rank of the matrix [w , . . . , w ] is m. Taking dim to
1 m
0
dim rangeT = dim W − dim nullT 0 both sides of the above equation yields the conclusion. Note
m
that the RHS is a subspace of F (why?) and the dimension
= dim W − dim(rangeT )0 of it does not depend on the special choice of its basis, hence
= dim(rangeT ). we can choose (w1 , . . . , wm ) to be the standard basis and
then [w1 , . . . , wm ] is simply the identity matrix.
Lemma B.79. For finite-dimensional vector spaces V and
W , any linear map T ∈ L(V, W ) satisfies
Theorem B.84. For any A ∈ Fm×n , its row rank equals
0 0
rangeT = (nullT ) . (B.27) its column rank.

65
Qinghai Zhang Numerical Analysis 2021

Proof. Define a linear map T : Fn → Fm as T x = Ax. Lemma B.91. Suppose λ1 , . . . , λm are distinct eigenvalues
Clearly, A is the matrix of T for the standard bases of Fn of T ∈ L(V) with corresponding eigenvectors v1 , . . . , vm .
and Fm . Then we have, Then v1 , . . . , vm is linearly independent.

column rank of A = dim rangeT Lemma B.92. Suppose V is finite-dimensional. Then each
= dim rangeT 0 operator on V has at most dim V distinct eigenvalues.
= column rank of the matrix of T 0
B.3.2 Upper-triangular matrices
= column rank of AT
= row rank of A, Notation 7. Suppose T ∈ L(V) and p ∈ P(F) is a polyno-
mial given by
where the first step follows from Lemma B.83, the second
from Lemma B.78, the third from Lemma B.83, the fourth p(z) = a0 + a1 z + · · · + am z m
from Theorem B.65, and the last from the definition of ma-
for z ∈ F. Then p(T ) is the operator given by
trix transpose and matrix products.
p(T ) = a0 I + a1 T + · · · + am T m ,
Definition B.85. The rank of a matrix is its column rank.
Theorem B.86 (Fundamental theorem of linear algebra). where I = T 0 is the identity operator.
For a matrix A ∈ Fm×n : Fn → Fm , its column space and Example B.93. Suppose D ∈ L(P(R)) is the differentia-
row space both have dimension r ≤ min(m, n); its null space tion operator defined by Dq = q 0 and p is the polynomial
and left null space have dimensions n − r and m − r, respec- defined by p(x) = 7 − 3x + 5x2 . Then we have
tively. In addition, we have
p(D) = 7 − 3D + 5D2 , (p(D))q = 7q − 3q 0 + 5q 00 .
Fm = rangeA ⊕ nullAT , (B.28a)
Fn = rangeAT ⊕ nullA, (B.28b) Definition B.94. The product polynomial of two polyno-
mials p, q ∈ P(F) is the polynomial defined by
where rangeA ⊥ nullAT and rangeAT ⊥ nullA.
∀z ∈ F, (pq)(z) := p(z)q(z). (B.29)
Proof. The first sentence is a rephrase of Theorem B.84 and
follows from Theorem B.54. For the second sentence, we Lemma B.95. Any T ∈ L(V) and p, q ∈ P(F) satisfy
only prove (B.28b). x ∈ nullA implies x ∈ Fn and Ax = 0.
The latter expands to (pq)(T ) = p(T )q(T ) = q(T )p(T ). (B.30)

Theorem B.96. Every linear operator on a finite-


 T  
a1 0
 aT2  0 dimensional, nonzero, complex vector space has an eigen-
 x =  ,
· · · · · · value.
T
am 0
Definition B.97. The matrix of a linear operator T ∈ L(V)
which implies that ∀j = 1, 2, . . . , m, aj ⊥ x. Hence x is is the matrix of the linear map T ∈ L(V, V), c.f. Definition
T
orthogonal to each basis vector of rangeA . The rest of the B.55.
proof follows from Lemma B.78, Theorem B.65, Theorem Theorem B.98. Suppose T ∈ L(V) and v , . . . , v is a
1 n
B.54. basis of V. Then the following are equivalent:
(a) the matrix of T with respect to v1 , . . . , vn is upper tri-
B.3 Eigenvalues, eigenvectors, and angular;
invariant subspaces (b) T vj ∈ span(v1 , . . . , vj ) for each j = 1, . . . , n;
(c) span(v1 , . . . , vj ) is invariant under T for each j =
B.3.1 Invariant subspaces 1, . . . , n.
Definition B.87. Under a linear operator T ∈ L(V), a
Theorem B.99. Every linear operator T ∈ L(V) on a
subspace U of V is invariant if u ∈ U implies T u ∈ U.
finite-dimensional complex vector space V has an upper-
Example B.88. Under T ∈ L(V), each of the following triangular matrix with respect to some basis of V.
subspaces of V is invariant: {0}, V, null T , and range T .
Theorem B.100. Suppose T ∈ L(V) has an upper-
Definition B.89. A number λ ∈ F is called an eigenvalue triangular matrix with respect to some basis of V. Then
of an operator T ∈ L(V) if there exists v ∈ V such that T is invertible if and only if all the entries on the diagonal
T v = λv and v 6= 0. Then the vector v is called an eigen- of that upper-triangular matrix are nonzero.
vector of T corresponding to λ.
Theorem B.101. Suppose T ∈ L(V) has an upper-
Example B.90. For each eigenvector v of T ∈ L(V), the triangular matrix with respect to some basis of V. Then
subspace span(v) is a one-dimensional invariant subspace of the eigenvalues of T are precisely the entries on the diago-
V. nal of that upper-triangular matrix.

66
Qinghai Zhang Numerical Analysis 2021

B.3.3 Eigenspaces and diagonal matrices Corollary B.110. An inner product has conjugate homo-
geneity in the second slot, i.e.
Definition B.102. A diagonal entry of a matrix is an en-
try of the matrix of which the row index equals the column ∀a ∈ F, ∀v, w ∈ V, hv, awi = ā hv, wi . (B.33)
index. The diagonal of a matrix consists of all diagonal en-
tries of the matrix. A diagonal matrix is a square matrix Exercise B.111. Prove Corollaries B.109 and B.110 from
that is zero everywhere except possibly along the diagonal. Definition B.108.
n
Definition B.103. The eigenspace of T ∈ L(V) corre- Definition B.112. The Euclidean inner product on F is
sponding to λ ∈ F is Xn
hv, wi = vi wi . (B.34)
E(λ, T ) := null(T − λI). (B.31) i=1

Lemma B.104. Suppose λ1 , . . . , λm are distinct eigenval- B.4.2 Norms induced from inner products
ues of T ∈ L(V) on a finite-dimensional space V. Then
Definition B.113. Let F be the underlying field of an inner
E(λ1 , T ) + · · · + E(λm , T ) product space V. The norm induced by an inner product on
V is a function V → F:
is a direct sum and p
kvk = hv, vi. (B.35)
dim E(λ1 , T ) + · · · + dim E(λm , T ) ≤ dim V. (B.32)
Definition B.114. For p ∈ [1, ∞), the Euclidean `p norm
Definition B.105. An operator T ∈ L(V) is diagonalizable of a vector v ∈ Fn is
if it has a diagonal matrix with respect to some basis of V. ! p1
Xn
Theorem B.106 (Conditions of diagonalizability). Sup- kvkp = |vi |p (B.36)
pose λ1 , . . . , λm are distinct eigenvalues of T ∈ L(V) on i=1
a finite-dimensional space V. Then the following are equiv- and the Euclidean ` norm is

alent:
kvk∞ = max |vi |. (B.37)
(a) T is diagonalizable; i

(b) V has a basis consisting of eigenvectors of T ; Theorem B.115 (Equivalence of norms). Any two norms
(c) there exist one-dimensional subspaces U1 , . . . , Un of V, k · kN and k · kM on a finite dimensional vector space V = Cn
each invariant under T , such that V = U1 ⊕ · · · ⊕ Un ; satisfy

(d) V = E(λ1 , T ) ⊕ · · · ⊕ E(λm , T ); ∃c1 , c2 ∈ R+ , s.t. ∀x ∈ V, c1 kxkM ≤ kxkN ≤ c2 kxkM .


(e) dim V = dim E(λ1 , T ) + · · · + dim E(λm , T ). (B.38)
Definition B.116. The angle between two vectors v, w in
Corollary B.107. An operator T ∈ L(V) is diagonalizable
an inner product space with F = R is the number θ ∈ [0, π],
if T has dim V distinct eigenvalues.
hv, wi
θ = arccos . (B.39)
kvkkwk
B.4 Inner product spaces
Theorem B.117 (The law of cosines). Any triangle satis-
B.4.1 Inner products fies
c2 = a2 + b2 − 2ab cos γ. (B.40)
Definition B.108. Denote by F the underlying field of a
vector space V. The inner product hu, vi on V is a function
V × V → F that satisfies
(IP-1) real positivity: ∀v ∈ V, hv, vi ≥ 0;
(IP-2) definiteness: hv, vi = 0 iff v = 0;
(IP-3) additivity in the first slot:
∀u, v, w ∈ V, hu + v, wi = hu, wi + hv, wi; Proof. The dot product of AB to AB = CB − CA yields
(IP-4) homogeneity in the first slot: c2 = hAB, CBi − hAB, CAi .
∀a ∈ F, ∀v, w ∈ V, hav, wi = a hv, wi;
(IP-5) conjugate symmetry: ∀v, w ∈ V, hv, wi = hw, vi. The dot products of CB and CA to AB = CB − CA yield

An inner product space is a vector space V equipped with hCB, ABi = a2 − hCB, CAi ;
an inner product on V. − hCA, ABi = − hCA, CBi + b2 .
Corollary B.109. An inner product has additivity in the The proof is completed by adding up all three equations and
second slot, i.e. hu, v + wi = hu, vi + hu, wi. applying (B.39).

67
Qinghai Zhang Numerical Analysis 2021

Theorem B.118 (The law of cosines: abstract version). More precisely, we have in the above plot
Any induced norm on a real vector space satisfies
(AB)2 +(BC)2 +(CD)2 +(DA)2 = (AC)2 +(BD)2 . (B.42)
ku − vk2 = kuk2 + kvk2 − 2 hu, vi . (B.41)
Proof. Apply the law of cosines to the two diagonals, add
the two equations, and we obtain (B.42).
Proof. Definitions B.113 and B.108 and F = R yield
Theorem B.123 (The parallelogram law: abstract ver-
ku − vk2 = hu − v, u − vi sion). Any induced norm (B.35) satisfies
= hu, ui + hv, vi − hu, vi − hv, ui
2kuk2 + 2kvk2 = ku + vk2 + ku − vk2 . (B.43)
= kuk2 + kvk2 − 2 hu, vi .
Proof. Replace v in (B.41) with −v and we have
B.4.3 Norms and induced inner-products ku + vk2 = kuk2 + kvk2 + 2 hu, vi .
Definition B.119. A function k · k : V → F is a norm for (B.43) follows from adding the above equation to (B.41).
a vector space V iff it satisfies
Exercise B.124. In the case of Euclidean `p norms, show
(NRM-1) real positivity: ∀v ∈ V, kvk ≥ 0; that the parallelogram law (B.43) holds if and only if p = 2.
(NRM-2) point separation: kvk = 0 ⇒ v = 0.
Theorem B.125. The induced norm (B.35) holds for some
(NRM-3) absolute homogeneity: inner product h·, ·i if and only if the parallelogram law (B.43)
∀a ∈ F, ∀v ∈ V, kavk = |a|kvk; holds for every pair of u, v ∈ V.
(NRM-4) triangle inequality: Exercise B.126. Prove Theorem B.125.
∀u, v ∈ V, ku + vk ≤ kuk + kvk.
Example B.127. By Theorem B.125 and Exercise B.124,
The function k · k : V → F is called a semi-norm iff it satifies the `1 and `∞ spaces do not have a corresponding inner
(NRM-1,3,4). A normed vector space (or simply a normed product for the Euclidean `1 and `∞ norms.
space) is a vector space V equipped with a norm on V.

Exercise B.120. Explain how (NRM-1,2,3,4) relate to the B.4.4 Orthonormal bases
geometric meaning of the norm of vectors in R3 . Definition B.128. Two vectors u, v are called orthogonal
if hu, vi = 0, i.e., their inner product is the additive identity
Lemma B.121. The norm induced by an inner product is
of the underlying field.
a norm as in Definition B.119.
Example B.129. An inner product on the vector space of
Proof. The induced norm as in (B.35) satisfies (NRM-1,2) continuous real-valued functions on the interval [−1, 1] is
trivially. For (NRM-3),
Z +1
2 2 2
kavk = hav, avi = a hv, avi = aā hv, vi = |a| kvk . hf, gi = f (x)g(x)dx.
−1

To prove (NRM-4), we have f and g are said to be orthogonal if the integral is zero.

ku + vk2 = hu + v, u + vi Theorem B.130 (Pythagorean). If u, v are orthogonal,


then ku + vk2 = kuk2 + kvk2 .
= hu, ui + hv, vi + hu, vi + hu, vi
Proof. This follows from (B.41) and Definition B.128.
≤ hu, ui + hv, vi + 2| hu, vi |
≤ kuk2 + kvk2 + 2kukkvk Theorem B.131 (Cauchy-Schwarz inequality).
2
= (kuk + kvk) , |hu, vi| ≤ kukkvk, (B.44)

where the second step follows from (IP-5) and the fourth where the equality holds iff one of u, v is a scalar multiple
step from Cauchy-Schwarz inequality. of the other.
Proof. For any complex number λ, (IP-1) implies
Theorem B.122 (The parallelogram law). The sum of
squares of the lengths of the four sides of a parallelogram hu + λv, u + λvi ≥ 0
equals the sum of squares of the two diagonals.
⇒ hu, ui + λ hv, ui + λ̄ hu, vi + λλ̄ hv, vi ≥ 0.

If v = 0, (B.44) clearly holds. Otherwise (B.44) follows from


substituting λ = − hu,vi
hv,vi into the above equation.

Exercise B.132. To explain the choice of λ in the proof of


Theorem B.131, what is the geometric meaning of (B.44) in
the plane? When will the equality hold?

68
Qinghai Zhang Numerical Analysis 2021

Example B.133. If xi , yi ∈ R, then for any n ∈ N+ B.5 Operators on inner-product



Xn
2
Xn n
X spaces
xi yi ≤ x2j yk2 .


B.5.1 Adjoint and self-adjoint operators

i=1 j=1 k=1

Example B.134. If f, g : [a, b] → R are continuous, then Definition B.142. The adjoint of a linear map
Z 2 ! Z ! T ∈ L(V, W) between inner-product spaces is a function
b Z b b
T ∗ : W → V that satisfies

2 2
f (x)g(x)dx ≤ f (x)dx g (x)dx


a a a
∀v ∈ V, ∀w ∈ W, hT v, wi = hv, T ∗ wi .
(B.48)
Definition B.135. A list of vectors (e1 , e2 , . . . , em ) is
called orthonormal if the vectors in it are pairwise orthogo- Example B.143. Define a linear operator T : R3 → R2 ,
nal and each vector has norm 1, i.e.
( T (x1 , x2 , x3 ) = (x2 + 3x3 , 2x1 ).
∀i = 1, 2, . . . , m, kei k = 1;
(B.45)
∀i 6= j, hei , ej i = 0. Then T ∗ (y1 , y2 ) = (2y2 , y1 , 3y1 ) because

Definition B.136. An orthonormal basis of an inner- h(x1 , x2 , x3 ), T ∗ (y1 , y2 )i = hT (x1 , x2 , x3 ), (y1 , y2 )i


product space V is an orthonormal list of vectors in V that = h(x2 + 3x3 , 2x1 ), (y1 , y2 )i
is also a basis of V.
= x2 y1 + 3x3 y1 + 2x1 y2
Theorem B.137. If (e1 , e2 , . . . , en ) is an orthonormal basis = h(x1 , x2 , x3 ), (2y2 , y1 , 3y1 )i .
of V, then
n
X Lemma B.144. If T ∈ L(V, W), then T ∗ ∈ L(W, V).
∀v ∈ V, v = hv, ei i ei , (B.46a)
i=1 Proof. Use Definition B.42.
n
2
X
kvk2 = |hv, ei i| . (B.46b) Theorem B.145. The adjoint of a linear map has the fol-
i=1 lowing properties.
Lemma B.138. Every finite-dimensional inner-product (ADJ-1) additivity:
space has an orthonormal basis. ∀S, T ∈ L(V, W), (S + T )∗ = S ∗ + T ∗ ;
Theorem B.139 (Schur). Every linear operator T ∈ L(V) (ADJ-2) conjugate homogeneity:
on a finite-dimensional complex vector space V has an upper- ∀T ∈ L(V, W), ∀a ∈ F, (aT )∗ = āT ∗ ;
triangular matrix with respect to some orthonormal basis of
(ADJ-3) adjoint of adjoint:
V.
∀T ∈ L(V, W), (T ∗ )∗ = T ;
Proof. This follows from Theorem B.99, Lemma B.138 and (ADJ-4) identity: I ∗ = I;
the Gram-Schmidt process; see Section 5.1.
(ADJ-5) products: let U be an inner-product space,
Definition B.140. A linear functional on V is a linear map ∀T ∈ L(V, W), ∀S ∈ L(W, U), (ST )∗ = T ∗ S ∗ .
from V to F, or, it is an element of L(V, F).
Proof. Use Definitions B.142 and B.108.
Theorem B.141 (Riesz representation theorem). If V is a
finite-dimensional vector space, then Definition B.146. The conjugate transpose, or Hermitian
∀ϕ ∈ V 0 , ∃!u ∈ V s.t. ∀v ∈ V, ϕ(v) = hv, ui . (B.47) transpose, or m×n
Hermitian conjugate, or adjoint matrix, of a
matrix A ∈ C is the matrix A∗ ∈ Cn×m defined by
Proof. Let (e1 , e2 , . . . , en ) be an orthonormal basis of V.
n
! n
(A∗ )ij = aji , (B.49)
X X
ϕ(v) = ϕ hv, ei i ei = hv, ei i ϕ(ei )
where aji denotes the complex conjugate of the entry aji .
i=1 i=1
n D n
* +
X E X Exercise B.147. Show that the conjugate transpose is an
= v, ϕ(ei )ei = v, ϕ(ei )ei , adjoint operator in L(V, W) with V = Cn and W = Cm .
i=1 i=1
n×n
where the last two steps follow from Corollaries B.109 and Definition B.148. A matrix U ∈ C is unitary iff
∗ n×n
B.110. U U = I. A matrix U ∈ R is orthogonal iff U T U = I.
As for the uniqueness, suppose that ∃u1 , u2 ∈ V s.t.
Theorem B.149. A matrix U ∈ Cn×n is unitary if and
ϕ(v) = hv, u1 i = hv, u2 i. Then for each v ∈ V,
only if its columns form an orthonormal basis for Cn .
0 = hv, u1 i − hv, u2 i = hv, u1 − u2 i .
Proof. This follows from considering the (i, j)th element of
Taking v = u1 − u2 shows that u1 − u2 = 0. U ∗ U and applying U ∗ U = I in Definition B.148.

69
Qinghai Zhang Numerical Analysis 2021

Corollary B.150. A unitary matrix U preserves norms and Proof. By Definition B.108 and (B.51), we have, ∀u, w ∈ V,
inner products. More precisely, we have
hT (u + w), u + wi − hT (u − w), u − wi
n
hT u, wi =
∀v, w ∈ C , hU v, U wi = hv, wi . 4
hT (u + iw), u + iwi − hT (u − iw), u − iwi
+i
Proof. This follows from Definitions B.142 and B.148. 4
2×2 =0.
Theorem B.151. Every unitary matrix U ∈ C with
det U = 1 is of the form Setting w = T u completes the proof.
 
a b B.154. An operator T ∈ L(V) is self-adjoint iff
U= , (B.50) Definition
−b a T = T ∗ , i.e.

where |a|2 + |b|2 = 1. ∀v, w ∈ V, hT v, wi = hv, T wi . (B.52)

Proof. Let Theorem B.155. Suppose V is a complex inner product


space and T ∈ L(V). Then T is self-adjoint if and only if
 
a b
U= .
c d
∀v ∈ V, hT v, vi ∈ R. (B.53)
Then Theorem B.149 and the condition det U = 1 yield
Proof. By Definitions B.108, B.142, and B.154, we have
ab + cd = 0,
hT v, vi − hT v, vi = hT v, vi − hv, T vi
ad − cb = 1.
= hT v, vi − hT ∗ v, vi = h(T − T ∗ )v, vi .
In other words, the linear system Then Lemma B.153 completes the proof.
    
b d x 0 Lemma B.156. Suppose V is a real inner product space
=
d −b y 1 and T ∈ L(V). If T is self-adjoint and satisfies

has solution x = a, y = c. Furthermore, Theorem B.149 ∀v ∈ V, hT v, vi = 0, (B.54)


2 2
and the form of U yield |b| + |d| = 1. Hence the solution
then T = 0.
x = a, y = c is unique and we have a = d and c = −b, which
completes the proof. Proof. By the self-adjointness and the underlying field being
real, we have
Theorem B.152. Let T ∈ L(V, W). Suppose e1 , . . . , en is
an orthonormal basis of V and f1 , . . . , fm is an orthonormal hT w, ui = hw, T ui = hT u, wi ,
basis of W. Then
which, together with Definition B.108, implies
M (T ∗ , (f1 , . . . , fm ), (e1 , . . . , en )) hT (u + w), u + wi − hT (u − w), u − wi
hT u, wi = .
is the conjugate transpose of 4
Setting w = T u completes the proof.
M (T, (e1 , . . . , en ), (f1 , . . . , fm )).

Proof. By Corollary B.56, we have


B.5.2 Normal operators
Definition B.157. An operator T ∈ L(V) is normal iff
T [e1 , . . . , en ] = [f1 , . . . , fm ]MT . T T ∗ = T ∗T .
The two bases being orthonormal further implies Corollary B.158. Every self-adjoint operator is normal.

hf1 , T e1 i hf1 , T e2 i · · · hf1 , T en i
 Lemma B.159. An operator T ∈ L(V) is normal if and
 hf2 , T e1 i hf2 , T e2 i · · · hf2 , T en i  only if
MT = 

.. .. .. .. .
 ∀v ∈ V, kT vk = kT ∗ vk. (B.55)
 . . . . 
hfm , T e1 i hfm , T e2 i · · · hfm , T en i Proof. By Lemma B.156 and Definition B.142, we have
T ∗ T − T T ∗ = 0 ⇔ ∀v ∈ V, h(T ∗ T − T T ∗ )v, vi = 0
The proof is completed by repeating the above derivation
for T ∗ and then applying Definitions B.108 and B.142. ⇔ ∀v ∈ V, hT ∗ T v, vi = hT T ∗ v, vi
⇔ ∀v ∈ V, kT vk2 = kT ∗ vk2 .
Lemma B.153. Suppose V is a complex inner product
space and T ∈ L(V). If The positivity of a norm completes the proof.

∀v ∈ V, hT v, vi = 0, (B.51) Theorem B.160. For a linear operator T ∈ L(V) on a


two-dimensional real inner product space V, the following
then T = 0. are equivalent:

70
Qinghai Zhang Numerical Analysis 2021

(a) T is normal but not self-adjoint. B.5.5 The singular value decomposition
(b) The matrix of T with respect to every orthonormal basis Definition B.166. A self-adjoint linear operator whose
of V has the form eigenvalues are non-negative is called positive semidefinite
 
a −b or positive, and called positive definite if it is also invertible.
M (T ) = (B.56)
b a
Corollary B.167. For any linear operator f ∈ L(V), both
where b 6= 0.
f ∗ ◦ f and f ◦ f ∗ are self-adjoint and positive semidefinite.
Proof. (b) ⇒ (a) trivially holds, so we only prove (a) ⇒ (b).
Let (e1 , e2 ) be an orthonormal basis of V and set Proof. By Definition B.154, f ∗ ◦ f is self-adjoint since
 
a c h(f ∗ ◦ f )u, vi = hf u, f vi = hu, (f ∗ ◦ f )vi .
M (T, (e1 , e2 )) = .
b d

By Definition B.55, Theorem B.130, and Definition B.135, Suppose (λ, u) is an eigen-pair of (f ◦ f ). Then we have
we have kT e1 k2 = a2 + b2 . In addition, Theorem B.152
yields kT ∗ e1 k2 = a2 + c2 . Then Lemma B.159 implies λ hu, ui = h(f ∗ ◦ f )u, ui = hf u, f ui
b2 = c2 and the condition of T being not self-adjoint fur- hf u, f ui
⇒λ = ≥ 0.
ther yields c = −b 6= 0. Considering kT e2 k2 and kT ∗ e2 k2 hu, ui
yields a = d.
Similar arguments apply to f ◦ f ∗ .
B.5.3 The spectral theorem
Definition B.168. The singular values of a linear map f
Theorem B.161 (Complex spectral). For a linear operator are the square roots of the nonnegative eigenvalues of f ∗ ◦ f .
T ∈ L(V) with F = C, the following are equivalent:
(a) T is normal; Definition B.169. For a rectangular matrix A ∈ Fm×n ,
the factorization A = P ΣQ∗ is a singular value decomposi-
(b) V has an orthonormal basis consisting of eigenvectors of
tion (SVD) iff any entry of Σ ∈ Rm×n is zero except pos-
T;
sibly at a diagonal entry (an entry of which the column
(c) T has a diagonal matrix with respect to some orthonor- index equals the row index), and P ∈ Fm×m and Q ∈ Fn×n
mal basis of V. are unitary matrices or orthogonal matrices for F = C or
Theorem B.162 (Real spectral). For a linear operator F = R, respectively. The diagonal entries of Σ, written
T ∈ L(V) with F = R, the following are equivalent: σ 1 ≥ σ2 ≥ · · · ≥ σq ≥ 0 where q = min(m, n), are the
singular values of A. The column vectors of P and Q are
(a) T is self-adjoint; the left singular vectors and the right singular vectors of A,
(b) V has an orthonormal basis consisting of eigenvectors of respectively.
T;
m×n
(c) T has a diagonal matrix with respect to some orthonor- Theorem B.170. Any matrix A ∈ C has an SVD.
mal basis of V.
Definition B.171. Two matrices A, B ∈ Rn×n are called
similar iff there exists an invertible matrix P such that
B.5.4 Isometries B = P −1 AP . The map A 7→ P −1 AP is called a similar-
Definition B.163. An operator S ∈ L(V) is called a (lin- ity transformation or conjugation of the matrix A.
ear) isometry iff
∀v ∈ V, kSvk = kvk. (B.57) B.6 Trace and determinant
Theorem B.164. An operator S ∈ L(V) on a real inner
product space is an isometry if and only if there exists an Definition B.172. The trace of a matrix A, denoted by
orthonormal basis of V with respect to which S has a block Trace A, is the sum of the diagonal entries of A.
diagonal matrix such that each block on the diagonal is a
Lemma B.173. The trace of a matrix is the sum of its
1-by-1 matrix containing 1 or −1, or, is a 2-by-2 matrix of
eigenvalues, each of which is repeated according to its mul-
the form 
cos θ − sin θ
 tiplicity.
(B.58)
sin θ cos θ
Definition B.174. A permutation of a set A is a bijective
where θ ∈ (0, π). function σ : A → A.
Corollary B.165. For an operator S ∈ L(V) on a two-
dimensional real inner product space, the following are Definition B.175. Let σ be a permutation of A =
equivalent: {1, 2, . . . , n} and let s denote the number of pairs of inte-
gers (j, k) with 1 ≤ j < k ≤ n such that j appears after k
(a) S is an isometry; in the list (m1 , . . . , mn ) given by mi = σ(i). The sign of the
(b) S is either an identity or a reflection or a rotation. permutation σ is 1 if s is even and −1 if s is odd.

71
Qinghai Zhang Numerical Analysis 2021

Definition B.176. The signed volume of a parallelotope Theorem B.185. The signed volume function satisfying
spanned by n vectors v1 , v2 , . . . , vn ∈ Rn is a function (SVP-1,2,3) in Definition B.176 is unique and is the same as
δ : Rn×n → R that satisfies the determinant in (B.62).
(SVP-1) δ(I) = 1; Proof. Let the parallelotope be spanned by the column vec-
(SVP-2) δ(v1 , v2 , . . . , vn ) = 0 if vi = vj for some i 6= j; tors v1 , v2 , . . . , vn . We have
 
(SVP-3) δ is linear, i.e., ∀j = 1, . . . , n, ∀c ∈ R, v11 v12 . . . v1n
 v21 v22 . . . v2n 
δ(v1 , . . . , vj−1 , v + cw, vj+1 , . . . , vn ) δ .

.. .. .. 

= δ(v1 , . . . , vj−1 , v, vj+1 , . . . , vn ) (B.59)  .
. . . . 
+cδ(v1 , . . . , vj−1 , w, vj+1 , . . . , vn ). vn1 vn2 . . . vnn
 
| v12 . . . v1n
Exercise B.177. Give a geometric proof that the signed n ei1 v22 . . . v2n 
volume of the parallelogram determined by the two vectors =
X
vi 1 1 δ  .. .. 
 
..
v = (a, b)T and v = (c, d)T is
1 2
 | i1 =1 . . . 
| vn2 ... vnn
v1⊥ , v2


δ(v1 , v2 ) = ad − bc = . (B.60)  
| | v13 ... v1n
Lemma B.178. Adding a multiple of one vector to another X n ei1 ei2 v23 ... v2n 
does not change the signed volume. = vi1 1 vi2 2 δ  .. .. 
 
..
i1 ,i2 =1
 | | . . . 
Proof. This follows directly from (SVP-2,3). | | vn2 ... vnn
Lemma B.179. If the vectors v1 , v2 , . . . , vn are linearly =···
dependent, then δ(v1 , v2 , . . . , vn ) = 0.
 
n
X | | ... |
Pn
Proof. WLOG, we assume v1 = i=2 ci vi . Then the result = vi1 1 vi2 2 · · · vin n δ ei1 ei2 . . . ein 
follows from (SVP-2,3). i1 ,i2 ,...,in =1 | | ... |
 
| | ... |
Lemma B.180. The signed volume δ is alternating, i.e., X
= vσ(1),1 vσ(2),2 · · · vσ(n),n δ eσ(1) eσ(2) . . . eσ(n) 
δ(v1 , . . . , vi , . . . , vj , . . . , vn ) = −δ(v1 , . . . , vj , . . . , vi , . . . , vn ). σ∈Sn | | ... |
(B.61) =
X
vσ(1),1 vσ(2),2 · · · vσ(n),n sgn(σ)
Exercise B.181. Prove Lemma B.180 using (SVP-2,3). σ∈S n
n
Lemma B.182. Let Mσ denote the matrix of a permuta- =
X Y
sgn(σ) vσ(i),i ,
tion σ : E → E where E is the set of standard basis vectors σ∈Sn i=1
in (B.5). Then we have δ(Mσ ) = sgn(σ).
where the first four steps follow from (SVP-3), the sixth
Proof. There is a one-to-one correspondence between the
step from Lemma B.182, and the fifth step from (SVP-
vectors in the matrix
2). In other words, the signed volume δ(·) is zero for any
Mσ = [eσ(1) , eσ(2) , . . . , eσ(n) ] ij = ik and hence the only nonzero terms are those of which
(i1 , i2 , . . . , in ) is a permutation of (1, 2, . . . , n).
and the scalars in the one-line notation
(σ(1) σ(2) . . . σ(n)). Exercise B.186. Use the formula in (B.62) to show that
det A = det AT .
A sequence of transpositions taking σ to the identity map
n×n
also takes Mσ to the identity matrix. By Lemma B.180, each Definition B.187. The i, j cofactor of A ∈ R is
transposition yields a multiplication factor −1. Definition
B.175 and (SVP-1) give δ(Mσ ) = sgn(σ)δ(I) = sgn(σ). Cij = (−1)i+j Mij , (B.63)

Definition B.183 (Leibniz formula of determinants). The where Mij is the i, j minor of a matrix A, i.e. the determi-
determinant of a square matrix A ∈ Rn×n is nant of the (n−1)×(n−1) matrix that results from deleting
n the i-th row and the j-th column of A.
X Y
det A = sgn(σ) aσ(i),i , (B.62)
Theorem B.188 (Laplace formula of determinants). Given
σ∈Sn i=1
fixed indices i, j ∈ 1, 2, . . . , n, the determinant of an n-by-n
where the sum is over the symmetric group Sn of all permu- matrix A = [aij ] is given by
tations and aσ(i),i is the element of A at the σ(i)th row and
n n
the ith column. X X
det A = aij 0 Cij 0 = ai0 j Ci0 j . (B.64)
Lemma B.184. The determinant of a matrix is the prod- j 0 =1 i0 =1
uct of its eigenvalues, each of which is repeated according to
its multiplicity. Exercise B.189. Prove Theorem B.188 by induction.

72
Appendix C

Basic Analysis

C.1 Sequences • if you want to compute the circumference of the entire


universe to the accuracy of less than the diameter of a
Definition C.1. A sequence is a function on N. hydrogen atom, you need only 39 decimal places of π.
Definition C.2. The extended real number system is the On one hand, computational mathematics is judged by a
real line R with two additional elements −∞ and +∞: metric that is different from that of pure mathematics; this
may cause a huge gap between what needs to be done and
R∗ := R ∪ {−∞, +∞}. (C.1) what has been done. On the other hand, a computational
mathematician cannot assume that a fixed accuracy is good
An extended real number x ∈ R∗ is finite if x ∈ R and it is
enough for all applications. In the approximation a number
infinite otherwise.
or a function, she must develop theory and algorithms to

Definition C.3. The supremum of a sequence (an )n=m is provide the user the choice of an ever-increasing amount of
accuracy, so long as the user is willing to invest an increas-
sup(an )∞n=m := sup{a n : n ≥ m}, (C.2) ing amount of computational resources. This is one of the
main motivations of infinite sequence and series.
and the infimum of a sequence (an )∞ n=m is
Lemma C.6. A convergent sequence has a unique limit.
inf(an )∞
n=m := inf{an : n ≥ m}. (C.3)
Definition C.7. A sequence {an } is Cauchy if
C.1.1 Convergence ∀ > 0, ∃N ∈ N s.t. m, n > N ⇒ |an − am | < . (C.6)
Definition C.4 (Limit of a sequence). A sequence {an } has Lemma C.8. Every convergent sequence in R is Cauchy.
the limit L, written limn→∞ an = L, or an → L as n → ∞,
iff Proof. Since (xn ) converges to some L ∈ R, for any given
∀ > 0, ∃N, s.t. ∀n > N, |an − L| < . (C.4)  > 0, there exists N ∈ N such that for all n > N we have
|xn − L| < 2 . It follows that
If such a limit L exists, we say that {an } converges to L.
∀n, m > N, |xn − xm | ≤ |xn − L + L − xm |
Example C.5 (A story of π). A famous estimation of π in
ancient China is given by Zu, ChongZhi 1500 years ago, ≤ |xn − L| + |xm − L| ≤ .

355 Lemma C.9. If a Cauchy sequence contains a convergent


π≈ ≈ 3.14159292. subsequence, then the entire sequence converges to the same
113
limit.
In modern mathematics, we approximate π with a sequence
for increasing accuracy, e.g. Proof. Suppose {an } is a Cauchy sequence and {anj } is a
subsequence converging to some a ∈ R. It follows that
π ≈ 3.141592653589793 . . . (C.5)
∀ > 0, ∃n0 s.t. ∀m, n ≥ n0 , |am − an | ≤ 2 ;
As of March 2019, we human beings have more than 31 tril- ∀ > 0, ∃j0 s.t. ∀j ≥ j0 , |anj − a| ≤ 2 .
lion digits of π. However, real world applications never use
even a small fraction of the 31 trillion digits: Set N = max{n0 , nj0 } and we have

• If you want to build a fence over your backyard swim- ∀ > 0, ∀n ≥ N, |an − a| ≤ |an − aN | + |aN − a| ≤ ,
ming pool, several digits of π is probably enough;
which completes the proof.
• in NASA, calculations involving π use 15 digits for
Guidance Navigation and Control; Lemma C.10. Every Cauchy sequence is bounded.

73
Qinghai Zhang Numerical Analysis 2021

Lemma C.11. Every real sequence has a monotone subse- Lemma C.21. Let (an )∞ n=m be a sequence of real numbers.
quence. For L+ = lim sup an and L− = lim inf an , we have
Theorem C.12. A bounded monotone sequence is conver- (a) For every x > L+ , elements of the sequence are eventu-
gent. ally less than x:
Theorem C.13 (Bolzano-Weierstrass). Every bounded se-
∀x > L+ , ∃N ≥ m s.t. ∀n ≥ N, an < x.
quence has a convergent subsequence.
Theorem C.14 (Cauchy criterion). Every Cauchy se- Similarly, for every x < L− , elements of the sequence
quence in R converges to a limit in R. are eventually greater than x:

Proof. By Lemma C.10, the Cauchy sequence (an ) is ∀x < L− , ∃N ≥ m s.t. ∀n ≥ N, an > x.
bounded. Theorem C.13 implies that (an )n∈N has a con-
vergent subsequence (ank )k∈N . Then Lemma C.9 completes
(b) For every x < L+ , there are an infinite number of ele-
the proof.
ments in the sequence that are greater than x:
Theorem C.15 (Completeness of R). A sequence of real
numbers is Cauchy if and only if it is convergent. ∀x < L+ , ∀N ≥ m, ∃n ≥ N s.t. an > x.

Proof. This is a summary of Lemma C.8 and Theorem Similarly, for every x > L− , there are an infinite number
C.14. of elements in the sequence that are less than x:

C.1.2 Limit points ∀x > L− , ∀N ≥ m, ∃n ≥ N s.t. an < x.

Definition C.16. Let  > 0 be a real number. Two real


(c) inf(an )∞ − + ∞
n=m ≤ L ≤ L ≤ sup(an )n=m .
numbers x, y are said to be -close iff |x − y| ≤ .
∞ − +
Definition C.17. A real number x is said to be -adherent (d) Any limit point c of (an )n=m satisfies L ≤ c ≤ L .
to a sequence (an )∞
n=m of real numbers iff there exists an + − ∞
n ≥ m such that an is -close to x. x is continually - (e) If L (or L ) is finite, then it is a limit point of (an )n=m .
adherent to (an )∞ ∞
n=m iff it is -adherent to (an )n=N for every
N ≥ m. (f) limn→∞ an = c if and only if L+ = L− = c.

Definition C.18. A real number x is a limit point or ad- Theorem C.22 (Squeeze test or the Sandwich Theorem).
herent point of a sequence (an )∞
n=m of real numbers if it is
Let (an )∞ ∞ ∞
n=m , (bn )n=m , and (cn )n=m be sequences of real

continually -adherent to (an )n=m for every  ≥ 0. numbers that satisfy

Definition C.19. The limit superior of a sequence (an )∞


n=m ∃M ∈ N s.t. ∀n ≥ M, an ≤ bn ≤ cn .
of real numbers is

Suppose (an )∞ ∞
n=m and (cn )n=m both converge to the same
lim sup an := inf(a+
N )N =m , (C.7) limit L. Then (bn )∞ also converges to L.
n→∞ n=m

where a+ ∞ ∞ Notation 8 (Asymptotic notation). For g : R+ → R+ ,


N = sup(an )n=N . The limit inferior of (an )n=m is
f : R → R, and a ∈ [0, +∞], we write
lim inf an := sup(a− ∞
N )N =m , (C.8)
n→∞
f (x) = O(g(x)) as x → a
where a−
N = inf(an )∞
n=N .
iff
Example C.20. Let (an )∞
n=m be the sequence
|f (x)|
lim sup < ∞.
x→a g(x)
1.1, −1.01, 1.001, −1.0001, 1.00001, . . .
In particular, we have

Then (a+
n )n=N is the sequence
|f (x)|
f (x) = o(g(x)) as x → a ⇔ lim = 0.
1.1, 1.001, 1.001, 1.00001, 1.00001, . . . x→a g(x)

and (a− ∞
n )n=N is the sequence We also write

−1.01, −1.01, −1.0001, −1.0001, −1.000001, −1.000001, . . . f (x) = Θ(g(x)) as x → a


Hence we have iff
|f (x)|
lim sup an = 1, lim inf an = −1. 0 < lim sup < ∞.
n→∞ n→∞ x→a g(x)

74
Qinghai Zhang Numerical Analysis 2021

C.2 Series Definition C.34 (Limit of a scalar function with one vari-
able). Consider a function f : I → R with I(c, r) =
Definition C.23 (Finite series). Let m, n be integers and (c − r, c) ∪ (c, c + r). The limit of f (x) exists as x approaches
let (ai )ni=m be a finite sequence of real numbers. The finite c, written limx→c f (x) = L, iff
n
series or finite
Pnsum associated with the sequence (ai )i=m is
the number i=m ai given by the recursive formula ∀ > 0, ∃δ > 0, s.t. ∀x ∈ I(c, δ), |f (x) − L| < . (C.13)
1
n
( Example C.35. Show that limx→2 x = 12 .
X 0 if n < m;
ai := Pn−1 (C.9) 1
i=m
an + i=m ai otherwise. 1 1 If 1 ≥ 2 , choose
Proof. δ = 1. Then x ∈ (1, 3) implies
− < since 1 − 1 is a monotonically decreasing func-
x 2 2 x 2
Definition C.24 (Formal infinite series). A (formal) infi- tion with its supremum at x = 1.
associated with an infinite sequence {an } is the
nite series P If  ∈ (0, 12 ), choose δ = . Then x ∈ (2−, 2+) ⊂ ( 32 , 52 ).

Hence x1 − 12 = |2−x|

expression n=0 an . |2x| < |2−x| < . The proof is completed
by Definition C.34.

Definition C.25. The sequence of partial
P∞ sums (Sn )n=0 as-
sociated with a formal infinite series i=0 ai is defined for Definition C.36. f : R → R is continuous at c iff
each n as the sum of the sequence {ai } from a0 to an
lim f (x) = f (c). (C.14)
x→c
Xn
Sn = ai . (C.10) Definition C.37. A scalar function f is continuous on
i=0 (a, b), written f ∈ C(a, b), if (C.14) holds ∀x ∈ (a, b).
Definition C.26. A formal infinite series is said to be con- Theorem C.38 (Extreme values). A continuous func-
P∞ con- tion f : [a, b] → R attains its maximum at some point
vergent and converge to L if its sequence of partial sums
verges to some limit L. In this case we write L = n=0 an xmax ∈ [a, b] and its minimum at some point xmin ∈ [a, b].
and call L the sum of the infinite series.
Theorem C.39 (Intermediate value). A scalar function
Definition C.27. A formal infinite series is said to be di- f ∈ C[a, b] satisfies
vergent if its sequence of partial sums diverges. In this case
∀y ∈ [m, M ] , ∃ξ ∈ [a, b], s.t. y = f (ξ) (C.15)
we do not assign any real number value to this series.
P∞ where m = inf x∈[a,b] f (x) and M = supx∈[a,b] f (x).
Lemma C.28. An infinite series n=0 an of real numbers
is convergent if and only if Definition C.40. Let I = (a, b). A function f : I → R is
q uniformly continuous on I iff
X
∀ > 0, ∃N ∈ N s.t. ∀p, q ≥ N, an ≤ . (C.11)

∀ > 0, ∃δ > 0, s.t.
n=p (C.16)
∀x, y ∈ I, |x − y| < δ ⇒ |f (x) − f (y)| < .
P∞
Definition C.29. An infinite
P∞ series n=0 an is absolutely Example C.41. Show that, on (a, ∞), f (x) = x is uni-
1

convergent iff the series n=0 |an | is convergent. formly continuous if a > 0 and is not so if a = 0.
Lemma C.30. An infinite series that is absolutely conver- Proof. If a > 0, then |f (x) − f (y)| = |x−y| |x−y|
xy < a2 .
gent is convergent. Hence ∀ > 0, ∃δ = a2 , s.t.
a2 
P∞ |x − y| < δ ⇒ |f (x) − f (y)| < |x−y| a2 < a2 = .
Theorem C.31 (Root test). For an infinite series n=0 an ,
If a = 0, negating the condition of uniform continuity,
define
1 i.e. eq. (C.16), yields ∃ > 0 s.t. ∀δ > 0 ∃x, y > 0 s.t.
α := lim sup |an | n . (C.12) (|x − y| < δ) ∧ (| x1 − y1 | ≥ ).
n→∞
We prove a stronger version: ∀ > 0, ∀δ > 0 ∃x, y > 0
The series is convergent if α < 1 and divergent if α > 1. s.t. |x − y| < δ ⇒ |f (x) − f (y)| ≥ .
1 1 1
P∞ If δ ≥ 2 , choose x = 2 , y = 4 . This choice satis-
Theorem C.32 (Ratio test). An infinite series n=0 an of 1 1
fies |x − y| < δ since x − y = 4 < 2 ≤ δ. However,
nonzero real numbers is |x−y|
|f (x) − f (y)| = xy = 2 > .
• absolutely convergent if lim supn→∞ |a|an+1
n|
|
< 1; If δ < 21
, then 2δ < 1. Choose x ∈ (0, δ 2 ) and y ∈
(2δ , δ). This choice satisfies |x − y| < δ and |x − y| > δ 2 .
2
• divergent if lim inf n→∞ |a|an+1 |
> 1. δ 2
n| However, |f (x) − f (y)| = |x−y| 1 1
xy > xy > y > δ > 2 > .

1
Exercise C.42. On (a, ∞), f (x) = x2 is uniformly contin-
C.3 Continuous functions on R uous if a > 0 and is not so if a = 0.
Definition C.33. A scalar function is a function whose Theorem C.43. Uniform continuity implies continuity but
range is a subset of R. the converse is not true.

75
Qinghai Zhang Numerical Analysis 2021

Proof. exercise. C.5 Taylor series


Theorem C.44. f : R → R is uniformly continuous on
Lemma C.52. A series converges to L iff the associated
(a, b) iff it can be extended to a continuous function f˜ on
sequence of partial sums converges to L.
[a, b].
Definition C.53. A power series centered at c is a series
C.4 Differentiation of functions of the form

X
Definition C.45. The derivative of a function f : R → R p(x) = an (x − c)n , (C.18)
at a is the limit n=0

f (a + h) − f (a)
f 0 (a) = lim . (C.17) where an ’s are the coefficients. The interval of convergence
h→0 h is the set of values of x for which the series converges:
If the limit exists, f is differentiable at a.
Example C.46. For the power function f (x) = xα , we have Ic (p) = {x | p(x) converges}. (C.19)
f 0 = αxα−1 due to Newton’s generalized binomial theorem,
X∞  
α α−n n Definition C.54. If the derivatives f (i) (x) with i =
α
(a + h) = a h . 1, 2, . . . , n exist for a function f : R → R at x = c, then
n=0
n
n
Definition C.47. A function f (x) is k times continuously X f (k) (c)
T (x) = (x − c)k (C.20)
differentiable on (a, b) iff f (k) (x) exists on (a, b) and is itself n
k!
k=0
continuous. The set or space of all such functions on (a, b)
is denoted by C k (a, b). In comparison, C k [a, b] is the space
is called the nth Taylor polynomial for f (x) at c.
of functions f for which f (k) (x) is bounded and uniformly
In particular, the linear approximation for f (x) at c is
continuous on (a, b).
Theorem C.48. A scalar function f is bounded on [a, b] if T1 (x) = f (c) + f 0 (c)(x − c). (C.21)
f ∈ C[a, b].
Theorem C.49. If f : (a, b) → R assumes its maximum or Example C.55. If f ∈ C ∞ , then ∀n ∈ N, we have
minimum at x0 ∈ (a, b) and f is differentiable at x0 , then
f 0 (x0 ) = 0.
( P
n f (k) (c)
k=m (k−m)! (x − c)k−m , m ∈ N, m ≤ n;
0
Proof. Suppose f (x0 ) > 0. Then we have Tn(m) (x) =
0, m ∈ N, m > n.
f (x) − f (x0 )
f 0 (x0 ) = lim > 0.
x→x0 x − x0 This can be proved by induction. In the inductive step, we
The definition of a limit implies regroup the summation into a constant term and another
shifted summation.
∃δ > 0 s.t. a < x0 − δ < x0 + δ < b,
which, together with |x − x0 | < δ, implies f (x)−f
x−x0
(x0 )
> 0. Definition C.56. The Taylor series (or Taylor expansion)
This is a contradiction to f (x0 ) being a maximum when we for f (x) at c is
choose x ∈ (x0 , x0 + δ). ∞
X f (k) (c)
(x − c)k . (C.22)
Theorem C.50 (Rolle’s). If a function f : R → R satisfies k!
k=0

(i) f ∈ C[a, b] and f 0 exists on (a, b),


Definition C.57. The remainder of the nth Taylor poly-
(ii) f (a) = f (b), nomial in approximating f (x) is
then ∃x ∈ (a, b) s.t. f 0 (x) = 0.
Proof. By Theorem C.39, all values between sup f and inf f En (x) = f (x) − Tn (x). (C.23)
will be assumed. If f (a) = f (b) = sup f = inf f , then f is a
constant on [a, b] and thus the conclusion holds. Otherwise, Theorem C.58. Let Tn be the nth Taylor polynomial for
Theorem C.49 completes the proof. f (x) at c.
Theorem C.51 (Mean value). If f ∈ C[a, b] and if f 0 exists
lim En (x) = 0 ⇔ lim Tn (x) = f (x). (C.24)
on (a, b), then ∃ξ ∈ (a, b) s.t. f (b) − f (a) = f 0 (ξ)(b − a). n→∞ n→∞

Proof. Construct a linear function L : [a, b] → R such


(m)
that L(a) = f (a), L(b) = f (b), then ∀x ∈ (a, b), we have Lemma C.59. ∀m = 0, 1, 2, . . . , n, En (c) = 0.
L0 (x) = f (b)−f
b−a
(a)
. Consider g(x) = f (x) − L(x) on [a, b].
g(a) = 0, g(b) = 0. By Theorem C.50, ∃ξ ∈ [a, b] such that Proof. This follows from Definition C.54 and Example
g 0 (ξ) = 0, which completes the proof. C.55.

76
Qinghai Zhang Numerical Analysis 2021

Theorem C.60 (Taylor’s theorem with Lagrangian form). Definition C.63. The Riemann sum of f : R → R over a
Consider a function f : R → R. If f ∈ C n [c − d, c + d] and partition Pn is
f (n+1) (x) exists on (c − d, c + d), then ∀x ∈ [c − d, c + d], n
there exists some ξ between c and x such that Sn (f ) =
X
f (x∗i )(xi − xi−1 ), (C.30)
f (n+1) (ξ) i=1
En (x) = (x − c)n+1 . (C.25)
(n + 1)! where x∗i ∈ Ii is a sample point of the ith subinterval.
Proof. Fix x 6= c, let M be the unique solution of Definition C.64. A function f : R → R is Riemann inte-
n+1 grable on [a, b] iff
M (x − c)
En (x) = f (x) − Tn (x) = .
(n + 1)! ∃L ∈ R, s.t. ∀ > 0, ∃δ > 0 s.t.
Consider the function ∀Pn (a, b) with h(Pn ) < δ, |Sn (f ) − L| < . (C.31)
M (t − c)n+1 Rb
g(t) := En (t) − . (C.26) In this case we write L = a
f (x)dx and call it the Riemann
(n + 1)! integral of f on [a, b].
Clearly g(x) = 0. By Lemma C.59, g (k) (c) = 0 for each Example C.65. The following function f : [a, b] → R is
k = 0, 1, . . . , n. Then Rolle’s theorem implies that not Riemann integrable.
∃x1 ∈ (c, x) s.t. g 0 (x1 ) = 0. (
1 if x is rational;
f (x) =
If x < c, change (c, x) above to (x, c). Apply Rolle’s theorem 0 if x is irrational.
to g 0 (t) on (c, x1 ) and we have
To see this, we first negate the logical statement in (C.31)
∃x2 ∈ (c, x1 ) s.t. g (2) (x2 ) = 0. to get
Repeatedly using Rolle’s theorem, ∀L ∈ R, ∃ > 0, s.t. ∀δ > 0
(n+1) ∃Pn (a, b) with h(Pn ) < δ, s.t. |Sn (f ) − L| ≥ .
∃xn+1 ∈ (c, xn ) s.t. g (xn+1 ) = 0. (C.27)
(n+1)
Since Tn is a polynomial of degree n, we have Tn (t) = 0, If |L| < b−a ∗
2 , we choose all xi ’s to be rational so that
which, together with (C.27) and (C.26), yields f (x∗i ) ≡ 1; then (C.30) yields Sn (f ) = b − a. For  = b−a4 ,
the formula |Sn (f ) − L| ≥  clearly holds.
f (n+1) (xn+1 ) − M = 0. If |L| ≥ b−a ∗
2 , we choose all xi ’s to be irrational so that
The proof is completed by identifying ξ with xn+1 . f (xi ) ≡ 0; then (C.30) yields Sn (f ) = 0. For  = b−a

4 , the
formula |Sn (f ) − L| ≥  clearly holds.
Example C.61. How many terms are needed to compute
e2 correctly to four decimal places? Definition C.66. If f : R → R is integrable on [a, b], then
The requirement of four decimal places means an accu- the limit of the Riemann sum of f is called the definite in-
racy of at least  = 10−5 . By Definition C.56, the Taylor tegral of f on [a, b]:
series of ex at c = 0 is Z b
+∞ n f (x)dx = lim Sn (f ). (C.32)
X x hn →0
ex = . a
n!
n=0 Theorem C.67. A scalar function f is integrable on [a, b]
By Theorem C.60, we have if f ∈ C[a, b].

∃ξ ∈ [0, 2] s.t. En (2) = eξ 2n+1 /(n + 1)! < e2 2n+1 /(n + 1)! Definition C.68. A monotonic function is a function be-
tween ordered sets that either preserves or reverses the given
Then e2 2n+1 /(n + 1)! ≤  yields n ≥ 12, i.e., 13 terms. order. In particular, f : R → R is monotonically increasing
if ∀x, y, x ≤ y ⇒ f (x) ≤ f (y); f : R → R is monotonically
decreasing if ∀x, y, x ≤ y ⇒ f (x) ≥ f (y).
C.6 Riemann integral
Theorem C.69. A scalar function is integrable on [a, b] if
Definition C.62. A partition of an interval I = [a, b] is a it is monotonic on [a, b].
totally-ordered finite subset Pn ⊆ I of the form
Exercise C.70. True or false: a bijective function is either
Pn (a, b) = {a = x0 < x1 < · · · < xn = b}. (C.28) order-preserving or order-reversing?

The interval Ii = [xi−1 , xi ] is the ith subinterval of the par- Theorem C.71 (Integral mean value). Let w : [a, b] → R+
tition. The norm of the partition is the length of the longest be integrable on [a, b]. For f ∈ C[a, b], ∃ξ ∈ [a, b] s.t.
subinterval, Z b Z b
h = h(P ) = max(x − x ), i = 1, 2, . . . , n. (C.29) w(x)f (x)dx = f (ξ) w(x)dx. (C.33)
n n i i−1
a a

77
Qinghai Zhang Numerical Analysis 2021

Proof. Denote m = inf x∈[a,b] f (x), M = supx∈[a,b] f (x), and Notation 9. In Definition C.76 we used the synonym no-
Rb
I = a w(x)dx. Then mw(x) ≤ f (x)w(x) ≤ M w(x) and tation
|u − v|X := dX (u, v). (C.40)
Z b
mI ≤ w(x)f (x)dx ≤ M I. Definition C.77 (Pointwise convergence). Let (fn )∞ n=1 be
a a sequence of functions from one metric space (X , dX ) to
w > 0 implies I 6= 0, hence another (Y, dY ), and let f : X → Y be another function.
We say that (fn )∞n=1 converges pointwise to f on X iff
1 b
Z
m≤ w(x)f (x)dx ≤ M. ∀x ∈ X , lim fn (x) = f (x), (C.41)
I a n→∞

Applying Theorem C.39 completes the proof. or, equivalently,


+
Theorem C.72 (First fundamental theorem of calculus). ∀ > 0, ∀x ∈ X , ∃N ∈ N s.t. ∀n > N, |fn (x) − f (x)|Y < .
Let a < b be real numbers. For a continuous function (C.42)
f : [a, b] → R that is Riemann integrable, define a func- Example C.78. Consider fn : [0, 1] → R defined by
tion F : [a, b] → R by fn (x) := xn and f : [0, 1] → R defined by
Z x (
F (x) := f (y)dy. (C.34) 1 if x = 1;
f (x) :=
a 0 if x ∈ [0, 1).
Then F is differentiable and
The functions fn are continuous and converge pointwise to
0
∀x0 ∈ [a, b], F (x0 ) = f (x0 ). (C.35) f , which is discontinuous. Hence pointwise convergence does
not preserve continuity.
Theorem C.73 (Second fundamental theorem of calculus).
Example C.79. For the functions in Example C.78, we
Let a < b be real numbers and let f : [a, b] → R be a Rie-
have limx→1;x∈[0,1) xn = 1 for all n and limx→1;x∈[0,1) f (x) =
mann integrable function. If F : [a, b] → R is the antideriva-
0; it follows that
tive of f , i.e. F 0 (x) = f (x), then
Z b lim lim fn (x) 6= lim lim fn (x).
n→∞ x→x0 ;x∈X x→x0 ;x∈X n→∞
f = F (b) − F (a). (C.36)
a Hence pointwise convergence does not preserve limits.
Example C.80. Consider the interval [a, b] = [0, 1], and
C.7 Convergence in metric spaces the function sequence fn : [a, b] → R given by
Definition C.74. A metric is a function d : X × X →
(  1 1
2n if x ∈ 2n ,n ;
[0, +∞) that satisfies, for all x, y, z ∈ X , f n (x) :=
0 otherwise.
(1) non-negativity: d(x, y) ≥ 0;
Then (fn ) converges pointwise to f (x) = 0. However,
(2) identity of indiscernibles: x = y ⇔ d(x, y) = 0; Rb Rb
f = 1 for every n while a f = 0. Hence
a n
(3) symmetry: d(x, y) = d(y, x);
Z b Z b
(4) triangle inequality: d(x, z) ≤ d(x, y) + d(y, z). lim fn 6= lim fn .
n→∞ a a n→∞
A metric space is an ordered pair (X , d) where X is a set
and d is a metric on X . Hence pointwise convergence does not preserve integral.

Example C.75. Set X to be C[a, b], the set of continuous Example C.81. Pointwise convergence does not preserve
functions [a, b] → R. Then the following is a metric on X , boundedness. For example, the function sequence
(
d(x, y) = max |x(t) − y(t)|. (C.37) exp(x) if exp(x) ≤ n;
t∈[a,b]
fn (x) = (C.43)
n if exp(x) > n
Definition C.76 (Limiting value of a function). Let converges pointwise to f (x) = exp(x). Similarly, the func-
(X , dX ) and (Y, dY ) be metric spaces. Let E be a subset tion sequence
of X and x0 ∈ X be an adherent point of E. A function (
1
f : X → Y is said to converge to L ∈ Y as x converges to x if x ≥ n1 ;
x0 ∈ E, written f n (x) = (C.44)
0 if x ∈ (0, 1 ) n
lim f (x) = L, (C.38) converges pointwise to f (x) = 1
x→x0 ;x∈E x. As another example, the
function sequence
iff
x
fn (x) = n sin (C.45)
∀ > 0,∃δ > 0 s.t. ∀x ∈ E, n
|x − x0 |X < δ ⇒ |f (x) − L|Y < . (C.39) converges pointwise to f (x) = x.

78
Qinghai Zhang Numerical Analysis 2021

Definition C.82 (Uniform convergence). Let (fn )∞ 2 2


n=1 be a Example C.90. For f : R → R and L : R → R ,
2 2

sequence of functions from one metric space (X , dX ) to an-


other (Y, dY ), and let f : X → Y be another function. We f (x, y) := (x2 , y 2 ), L(x, y) := (2x, 4y), (C.49)
say that (fn )∞
n=1 converges uniformly to f on X iff
we claim that f is differentiable at (1, 2) with f 0 (x0 ) = L.
∀ > 0, ∃N ∈ N+ s.t. ∀x ∈ X , ∀n > N, |fn (x) − f (x)|Y < .
To show this, we compute
(C.46)
The sequence (fn ) is locally uniformly convergent to f iff for
kf (x, y) − f (1, 2) − L((x, y) − (1, 2))k2
every point x ∈ X there is an r > 0 such that ( fn |Br (x)∩X ) lim
(x,y)→(1,2) k(x, y) − (1, 2)k2
is uniformly convergent to f on Br (x) ∩ X . (x,y)6=(1,2)

Theorem C.83. Uniform convergence implies pointwise kf (1 + a, 2 + b) − f (1, 2) − L(a, b)k2


= lim
convergence. (a,b)→(0,0) k(a, b)k2
(a,b)6=(0,0)
Proof. This follows directly from (C.42), (C.46), and Theo- k (1 + a)2 , (2 + b)2 − (1, 4) − (2a, 4b)k2

rem A.10. = lim
(a,b)→(0,0) k(a, b)k2
(a,b)6=(0,0)
Example C.84 (Uniform convergence of Taylor series).
Consider f : R → R and the sequence of its Taylor k(a2 , b2 )k2
= lim
polynomial (Tn )∞ n=1 in Definition C.54. For any interval
(a,b)→(0,0) k(a, b)k2
(a,b)6=(0,0)
Ir := (a − r, a + r), (Tn )∞ n=1 converges locally uniformly to
k(a2 , 0)k2 k(0, b2 )k2
 
f |Ir if r is less or equal to the radius of convergence of f at ≤ lim +
a. In particular, (Tn )∞ n=1 converges locally uniformly to f if
(a,b)→(0,0) k(a, b)k2 k(a, b)k2
(a,b)6=(0,0)
the radius of convergence of f is +∞. p
= lim a2 + b2
Theorem C.85. Consider bij ≥ 0 in R∗ := R∪{+∞, −∞}. (a,b)→(0,0)
(a,b)6=(0,0)
If bij is monotone increasing in i for each j and is monotone
increasing in j for each i, then we have =0.
∞ ∞ ∞ ∞
lim lim bij = lim lim bij (C.47) Lemma C.91. Let E be a subset of Rn , f : E → Rm a
i=0 j=0 j=0 i=0
function, and x0 ∈ E an interior point of E. Suppose f is
with all the indicated limits existing in R∗ . differentiable at x0 with derivative L1 and also differentiable
Theorem C.86. Suppose that {fn : [a, b] → R} is a se- at x0 with derivative L2 . Then L1 = L2 .
quence of continuous functions satisfying
Exercise C.92. Prove Lemma C.91.
• {fn (x0 )} converges for some x0 ∈ [a, b],
• each fn is differentiable on (a, b), Definition C.93 (Directional derivative). Let E be a sub-
• {fn } converges uniformly on (a, b).
0 set of Rn , f : E → Rm a function, x0 ∈ E an interior point
of E, and v ∈ Rn a vector. If the limit
Then we have
• {fn } converges uniformly on [a, b] to a function f , f (x0 + tv) − f (x0 )
lim
t→0;t>0,x0 +tv∈E t
• both f 0 (x) and limn fn0 (x) exist for any x ∈ (a, b),
• f 0 (x) = limn fn0 (x) for any x ∈ (a, b). exists, we say that f is differentiable in the direction v at
x0 , and we denote this limit as
C.8 Vector calculus f (x0 + tv) − f (x0 )
Dv f (x0 ) := lim . (C.50)
Lemma C.87. For E ⊂ R, f : E → R, x0 ∈ E, and L ∈ R, t→0;t>0,x0 +tv∈E t
the following two statements are equivalent,
Example C.94. For v = (3, 4) and f : R2 → R2 defined in
(a) f is differentiable at x0 and f 0 (x0 ) = L; (C.49), we have Dv f (1, 2) = (6, 16).
|f (x)−f (x0 )−L(x−x0 )|
(b) limx→x0 ,x∈E\{x0 } |x−x0 | = 0.
Example C.95. For f : R → R, D+1 f (x) is the right
Exercise C.88. Prove Lemma C.87. derivative of f at x (if it exists), and similarly D−1 f (x) is
Definition C.89 (Total derivative). For E ⊂ Rn , f : the left derivative of f at x (if it exists).
E → Rm , x0 ∈ E, f is differentiable at x0 with derivative
L : Rn → Rm if Lemma C.96. Let E be a subset of Rn , f : E → Rm a
kf (x) − f (x0 ) − L(x − x0 )k2 function, x0 ∈ E an interior point of E, and v ∈ Rn a vec-
lim = 0. (C.48) tor. If f is differentiable at x0 , then f is also differentiable
x→x0 ,x∈E\{x0 } kx − x0 k2
in the direction v at x0 , and
We denote the derivative of f with f 0 (x0 ) = L and also call
it the total derivative of f . Dv f (x0 ) = f 0 (x0 )v. (C.51)

79
Qinghai Zhang Numerical Analysis 2021

Definition C.97 (Partial derivative). Let E be a subset of Theorem C.99. Let E be a subset of Rn , f : E → Rm a
Rn , f : E → Rm a function, x0 ∈ E an interior point of E, function, F a subset of E, and x0 ∈ E an interior point of
∂f
and 1 ≤ j ≤ n. The partial derivative of f with respect to F . If all the partial derivatives ∂x j
exist on F and are con-
the xj variable at x0 is defined by tinuous at x0 , then f is differentiable at x0 , and the linear
transformation f 0 (x0 ) : Rn → Rm is defined by
∂f f (x0 +tej )−f (x0 )
∂xj (x0 ) := limt→0;t>0,x0 +tej ∈E t (C.52) n
d
= dt f (x0 + tej )|t=0 X ∂f
f 0 (x0 )(v) = vj (x0 ). (C.53)
j=1
∂xj
provided that the limit exists. Here ej is the jth standard
basis vector of Rn . Definition C.100. The derivative matrix or differential
matrix or Jacobian matrix of a differentiable function
Exercise C.98. Show that the existence of partial deriva-
f : Rn → Rm is a m × n matrix,
tives at x0 does not imply that the function is differentiable
at x0 by considering the differentiability of the following  ∂f ∂f1 ∂f1

1
· · ·
function f : R2 → R at (0,0). ∂x
 ∂f21
∂x2
∂f2
∂xn
∂f2
· · · ∂x

 ∂x1 ∂x 2 n

( 3 Df :=  . . . . . (C.54)
 .. .. .. ..
x
x2 +y 2 if (x, y) 6= (0, 0);  
f (x, y) =

∂fm ∂fm
0 if (x, y) = (0, 0). ∂x1 ∂x2 · · · ∂f
∂xn
m

80
Appendix D

Point-set Topology

D.1 Topological spaces


D.1.1 A motivating problem from biology
Definition D.1. Phenotype refers to the physical, organiza-
tional, and behavioral expression of an organism during its
lifetime while genotype refers to a heritable repository of in-
formation that instructs the production of molecules, whose
interactions with the environment generate and maintain This subsection concerns such a model proposed by
the phenotype. Fontana and Schuster [1998a,b] and Stadler et al. [2001]
based on the shape of ribonucleic acid (RNA) sequences.
Example D.2. The collection of genes responsible for eye Building blocks of strands of RNA are smaller molecules
color in a particular individual is a genotype while the ob- called nucleotides, which have four different types: guanine
servable eye coloration in the individual is the corresponding (G), cytosine (C), adenine (A) and uracil (U ). This se-
phenotype. quence of an RNA molecule functions as a genotype, since it
Phenotype and genotype are two fundamental concepts can be directly replicated by suitable enzymes. Meanwhile,
in the classical framework of evolution. nonadjacent nucleotide pairs undergo additional (weaker)
The genotype-phenotype relationship is of great impor- bonding, contorting the sequence into a more complicated
tance in biology in that evolution is driven by the selection of three-dimensional structure. It is by this process that an
phenotypes that causes the amplification of their underlying RNA sequence always acquires a physical shape and this
genotypes and the production of novel phenotypes through shape functions as the phenotype.
genetic mutation.
Definition D.3. The primary structure of an RNA
molecule is its unfolded nucleotide chain, often represented
by a genotype sequence over the alphabet set {C, G, A, U }.
The secondary structure or bonding diagram or RNA shape
of an RNA molecule is an unlabeled diagram depicting the
bonding that occurs in the resulting RNA molecule.

Example D.4. In the plot below, a genotype sequence gets


folded into a three-dimensional structure represented by the
planar graph.
In phenotypic innovation, the heritable modification of
a phenotype usually does not involve a direct intervention
at the phenotypic level, but proceeds indirectly through
changes at the genetic level during a number of processes
known as development. While selection is clearly an im-
portant driving force of evolution, the dynamics of selection
does not tell us much about how evolutionary innovations
arise in the first place. A mutation is advantageous if it gen-
erates a phenotype favored by selection, but this definition
reveals nothing about why or how that mutation could inno-
vate the phenotype. Hence a model of genotype-phenotype
relation is needed to illuminate how genetic changes map The following plot shows the bonding diagram, with the dot
into phenotypic changes. as the location of the first nucleotide in the sequence.

81
Qinghai Zhang Numerical Analysis 2021

This RNA shape conveys biochemical behavior to an RNA


molecule and is therefore subject to selection.

Exercise D.5. How phenotype changes with genotype?


Definition D.10. The neutral network of an RNA shape
Example D.6. A unique RNA shape can be assigned to s, denoted by N (s), is the set of all genotype sequences that
each genotype sequence. This function is not injective since result in s after folding and bonding.
multiple genotype sequences may result in the same RNA
shape. In the meantime, a single-entry change in the geno- Example D.11. Consider the sequence space GC 10 of
type sequence may completely alter the RNA shape. Con- length 10 over the alphabet {G, C}. There are 1024 pos-
sider the following genotype sequences. sible sequences, and, after folding and bonding, they result
in eight different RNA shapes, as shown below.

The corresponding bonding diagrams are as follow.


The number to the right of a sequence Si is #N (Si ).

Example D.12. The possibility of changing the genotype


while preserving the phenotype is a manifestation of a cer-
tain degree of phenotypic robustness toward genetic muta-
tions. Meanwhile it is a key factor underlying the capacity
of a system to evolve.

The bonding diagram are the same for sequences 1 and


3, which are identical except for the 16th entry. On the
other hand, sequence 2 differs from sequence 1 in only the
24th entry, but their diagrams are very different.

Definition D.7. A point mutation is a mutation from one


genotype sequence to another by changing a single entry in
the sequence. Two sequences are called neighbors if they
can be converted to each other by a point mutation.

Definition D.8. The sequence space of length n (proposed


by Eigen [1971]) is a metric space of all genotype sequences
of length n with the metric being the distance between two
sequences, i.e., the smallest number of point mutations re-
quired to convert one sequence to the other.

Example D.9. We show below a sequence space of length


4 over the binary alphabet {0, 1}.

82
Qinghai Zhang Numerical Analysis 2021

In the above plots, imagine a population with phenotype void bubble_sort(int a[]){
‘star’ in an evolutionary situation where phenotype ‘trian- int n = sizeof(a)/sizeof(a[0]);
gle’ would be advantageous or desirable. But phenotype for (int j=0; j<n-1; j++)
‘triangle’ may not be accessible to phenotype ‘star’ in the for (int i = 0; i<n-1-j; i++)
vicinity of the population’s current location. However, due if(a[i] > a[i+1])
to the neutral network of ‘star,’ the population is not stuck, swap(a[i], a[i+1]);
but can drift on that network into far away regions, vastly }
improving its chances of encountering the neutral network
of ‘triangle.’ Therefore, neutral networks enable phenotypic void swap(int& b, int& c){
innovation by permitting the accumulation of neutral muta- int temp = b;
tions. b = c;
c = temp;
Exercise D.13. How do we capture and quantify the ac- }
cessibility of one (favorable) phenotype from another (less
favorable) by means of mutations in the sequence space? For As a limit of the above implementation, the program
any two phenotypes, is there always a directed path from one does not apply to the data type char, nor any other data
to the other? type without an implicit conversion, even if the “less than”
binary relation for such a data type is natural. You have
Definition D.14. A phenotype space is a set of RNA shapes to manually repeat the above program for each data type.
on which a topology is defined to quantify proximity of RNA An elegant solution is to use function template in C++ as
shapes. follows.
template<typename T>
Definition D.15. The mutation probability of an RNA void bubble_sort(T a[])
shape r to another RNA shape s is defined as {
mr,s int n = sizeof(a)/sizeof(a[0]);
pr,s := , (D.1) for (int i=0; i<n-1; i++)
mr,∗
for (int j=0; j<n-1-i; j++)
where mr,s is the number of point mutations that change a if (a[j] > a[j+1])
sequence in N (r) to a neighboring sequence in N (s) and mr,∗ swap<T>(a[j], a[j+1]);
is the number of point mutations that change a sequence in }
N (r) to a neighboring sequence in any other network.
template<typename T>
Exercise D.16. Show that the mutation probability cannot void swap(T& b, T& c){
be a metric on the phenotype space. T temp = b;
b = c;
Example D.17 (Bubble sort). To sort the sequence 51428, c = temp;
the first pass of the algorithm goes as follows. }

( 5 1 4 2 8 ) --> ( 1 5 4 2 8 )
( 1 5 4 2 8 ) --> ( 1 4 5 2 8 ) D.1.2 Generalizing continuous maps
( 1 4 5 2 8 ) --> ( 1 4 2 5 8 ) Definition D.18. A function f : R → R is continuous at a
( 1 4 2 5 8 ) --> ( 1 4 2 5 8 ) iff

The second pass goes as follows. ∀ > 0 ∃δ > 0 s.t. |x − a| < δ ⇒ |f (x) − f (a)| < . (D.2)
Definition D.19. A function f : Rn → Rm is continuous
( 1 4 2 5 8 ) --> ( 1 4 2 5 8 )
at x = a iff
( 1 4 2 5 8 ) --> ( 1 2 4 5 8 )
( 1 2 4 5 8 ) --> ( 1 2 4 5 8 ) ∀ > 0 ∃δ > 0 s.t. f (B(a, δ)) ⊂ B(f (a), ), (D.3)
( 1 2 4 5 8 ) --> ( 1 2 4 5 8 )
where the n-dimensional open ball B(p, r) is
Now, the array is already sorted, but the algorithm does not B(p, r) = {x ∈ Rn : kx − pk2 < r}. (D.4)
know if it is completed. The algorithm needs one whole pass
without any swap to know it is sorted. The third pass goes Definition D.20. A function f : X → Y with X ⊂ Rn and
m
as follows. Y ⊂ R is continuous at x = a iff
∀ > 0 ∃δ > 0 s.t. f (Va ) ⊂ Ua , (D.5)
( 1 2 4 5 8 ) --> ( 1 2 4 5 8 )
( 1 2 4 5 8 ) --> ( 1 2 4 5 8 ) where the two sets associated with a are
( 1 2 4 5 8 ) --> ( 1 2 4 5 8 ) Va := B(a, δ) ∩ X, Ua := B(f (a), ) ∩ Y. (D.6)
( 1 2 4 5 8 ) --> ( 1 2 4 5 8 )
Definition D.21. A function f : X → Y is continuous if
This algorithm is expressed in C as follows. it is continuous at every point a ∈ X.

83
Qinghai Zhang Numerical Analysis 2021

Example D.22. Is the function x 7→ x1 continuous? It de- • for (m, n) ∈ R2 and d > 0,
pends on whether its domain includes the origin. But it is
indeed continuous on domains such as (0, 1], R \ {0}, and ∀(a, b) ∈ S((m, n), d), ∃r > 0 s.t. S((a, b), r) ⊂ S((m, n), d).
[1, 2]. Note that definitions of the one-sided continuity in
calculus are nicely incorporated in Definition D.20. • the set of all open squares in R2 is a basis of R2 ,

Definition D.23. A function f : X → Y with X ⊂ Rn and Bs = {S((a, b), d) : (a, b) ∈ R2 , d > 0}.
Y ⊂ Rm is continuous iff
Exercise D.31. Show that the closed balls (r > 0)
∀Ua ∈ γY , ∃Va ∈ γX s.t. f (Va ) ⊂ Ua , (D.7)
B̄(p, r) = {x ∈ Rn : kx − pk2 ≤ r} (D.13)
where γX and γY are sets of intersections of the open balls
to X and Y , respectively, do not form a basis of Rn . However, the following collection
is indeed a basis:
γX := {B(a, δ) ∩ X : a ∈ X, δ ∈ R+ };
γY := {B(f (a), ) ∩ Y : f (a) ∈ Y,  ∈ R+ }. Bp = {B̄(a, r) : a ∈ Rn , r ≥ 0}, (D.14)

Definition D.24. A basis of neighborhoods (or a basis) on which is the union of all closed balls and all singleton sets.
a set X is a collection B of subsets of X such that
D.1.3 Open sets: from bases to topologies
• covering: ∪B = X, and
• refining: Definition D.32. A subset U of X is open (with respect
to a given basis of neighborhoods B of X) iff
∀U, V ∈ B, ∀x ∈ U ∩ V, ∃B ∈ B s.t. x ∈ B ⊂ (U ∩ V ).
∀x ∈ U ∃B ∈ B s.t. x ∈ B ⊂ U. (D.15)
Definition D.25. For two sets X, Y with bases of neigh- Lemma D.33. Each neighborhood in the basis B is open.
borhoods BX , BY , a surjective function f : X → Y is con-
tinuous iff Proof. This follows from B ⊂ B ∈ B and Definition
D.32.
∀U ∈ BY ∃V ∈ BX s.t. f (V ) ⊂ U. (D.8)
Exercise D.34. What are the open subsets of R with re-
Lemma D.26. If a surjective function f : X → Y is con- spect to the right rays in (D.9)?
tinuous in the sense of Definitions D.20 and D.21, then it is
continuous in the sense of Definition D.25. Lemma D.35. The intersection of two open sets is open.

Proof. By Definition D.24, the following collections are Proof. Let U1 and U2 be two open sets and fix a point
bases of X ⊆ Rm and Y = f (X) ⊆ Rn , respectively, x ∈ U1 ∩ U2 . By Definition D.32, there exists B1 , B2 ∈ B
such that x ∈ B1 ⊂ U1 and x ∈ B2 ⊂ U2 . Then Def-
BX = {B(a, δ) ∩ X : a ∈ X, δ > 0}; inition D.24 implies that there exists B3 ∈ B such that
BY = {B(b, ) ∩ Y : b ∈ Y,  > 0}. x ∈ B3 ⊂ B1 ∩ B2 ⊂ U1 ∩ U2 . Then the proof is completed
by Definition D.32 and x being arbitrary.
The rest follows from Definitions D.25 and D.20.
Lemma D.36. The union of two open sets is open.
Example D.27. The right rays
Lemma D.37. The union of any collection of open sets is
BRR = {{x : x > s} : s ∈ R} (D.9) open.
Definition D.38. The topology of X generated by a basis
form a basis of R.
B is the collection T of all open subsets of X in the sense of
Exercise D.28. Prove that the set of all right half-intervals Definition D.32.
in R is a basis of neighborhoods:
Definition D.39. The standard topology is the topology
B = {[a, b) : a < b}. (D.10) generated by the standard Euclidean basis, which is the col-
lection of all open balls in X = Rn .
Example D.29. A basis on R2 is the set of all quadrants
Theorem D.40. The topology of X generated by a basis
Bq = {Q(r, s) : r, s ∈ R}, (D.11) satisfies
Q(r, s) = {(x, y) ∈ R2 : x > r, y > s}. (D.12) • ∅, X ∈ T ;

Exercise D.30. Denote an open square in R2 as • α ⊂ T ⇒ ∪U ∈α U ∈ T ;


• U, V ∈ T ⇒ U ∩ V ∈ T .
S((a, b), d) = {(x, y) : max(|x − a|, |y − b|) < d}.
Proof. The first item follows from Definition D.32. The oth-
Prove that ers follows from Lemmas D.35 and D.37.

84
Qinghai Zhang Numerical Analysis 2021

Example D.41. The largest basis on a set X is the set of Lemma D.48. Let (X, T ) be a topological space. Suppose
all subsets of X, a collection of open sets C ⊂ T satisfies

Bd (X) = {A ⊂ X} = 2X , (D.16) ∀U ∈ T , ∀x ∈ U, ∃C ∈ C s.t. x ∈ C ⊂ U. (D.19)

and the topology it generates is called the discrete topol- Then C is a basis for T .
ogy, which coincides with the basis. This topology is more Proof. We first show that C is a basis. The covering rela-
economically generated by the basis of all singletons, tion holds trivially by setting U = X in (D.19). As for the
 refining condition, let x ∈ C1 ∩ C2 where C1 , C2 ∈ C. Since
Bs (X) = {x} : x ∈ X . (D.17) C1 ∩C2 is open, (D.19) implies that there exists C3 ∈ C such
that x ∈ C3 ⊂ C1 ∩ C2 . Hence C is a basis by Definition
The smallest basis on X is simply {X} and the topol-
D.24.
ogy it generates is called the trivial/anti-discrete/indiscrete
Then we show the topology T 0 generated by C equals T .
topology Ta = {∅, X}.
On one hand, for any U ∈ T and any x ∈ U , by (D.19) there
Exercise D.42. Show that if U is open with respect to a exists C ∈ C such that x ∈ C ⊂ U . By Definitions D.32 and
basis B, then B ∪ {U } is also a basis. D.38, we have U ∈ T 0 . On the other hand, it follows from
Corollary D.47 that any W ∈ T 0 is a union of elements of
C. Since each element of C is in T , we have W ∈ T .
D.1.4 Topological spaces: from topologies
to bases Example D.49. The following countable collection
Definition D.43. For an arbitrary set X, a collection T B = {(a, b) : a < b, a and b are rational} (D.20)
of subsets of X is called a topology on X iff it satisfies the
following conditions, is a basis that generates the standard topology on R.
Lemma D.50. A collection of subsets of X is a topology
(TPO-1) ∅, X ∈ T ;
on X if and only if it generates itself.
(TPO-2) α ⊂ T ⇒ ∪α ∈ T ;
Proof. The necessity holds trivially since (TPO-1) implies
(TPO-3) U, V ∈ T ⇒ U ∩ V ∈ T . the covering condition and (TPO-3) implies the refining con-
The pair (X, T ) is called a topological space. The elements dition. As for the sufficiency, suppose U, V ∈ T . By Def-
of T are called open sets. inition D.32, U ∪ V is also open, hence U ∪ V ∈ T . This
argument holds for the union of an arbitrary number of open
Corollary D.44. The topology of X generated by a basis sets.
B as in Definition D.38 is indeed a topology in the sense of
Definition D.43. D.1.5 Generalized continuous maps
Proof. This follows directly from Theorem D.40. Definition D.51. The preimage of a set U ⊂ Y (or the
Example D.45. For each n ∈ Z, define fiber over U ) under f : X → Y is
( f −1 (U ) := {x ∈ X : f (x) ∈ U }. (D.21)
{n} if n is odd;
B(n) = (D.18) Exercise D.52. Show that the operation f −1 preserves in-
{n − 1, n, n + 1} if n is even.
clusions, unions, intersections, and differences of sets:
The topology generated by the basis B = {B(n) : n ∈ Z} is 
B0 ⊆ B1 ⇒ f −1 (B0 ) ⊆ f −1 (B1 ),
called the digital line topology and we refer to Z with this

 −1
f (B0 ∪ B1 ) = f −1 (B0 ) ∪ f −1 (B1 ),

topology as the digital line. (D.22)
 f −1 (B0 ∩ B1 ) = f −1 (B0 ) ∩ f −1 (B1 ),
 −1
 −1 −1
Theorem D.46. A topology generated by a basis B equals f (B0 \ B1 ) = f (B0 ) \ f (B1 ).
the collection of all unions of elements of B. (In particular, In comparison, f only preserves inclusions and unions:
the empty set is the union of “empty collections” of elements 
of B.) 
 A0 ⊆ A1 ⇒ f (A0 ) ⊆ f (A1 ),
f (A0 ∪ A1 ) = f (A0 ) ∪ f (A1 ),

Proof. Given a collection of elements of B, Lemma D.33 (D.23)
 f (A0 ∩ A1 ) ⊆ f (A0 ) ∩ f (A1 ),
states that each of them belongs to T . Since T is a topology,

f (A0 \ A1 ) ⊇ f (A0 ) \ f (A1 ),

(TPO-2) implies that all unions of these elements are also
in T . Conversely, given an open set U ∈ T , we can choose where the equalities in the last two equations holds if f is
for each x ∈ U an element Bx ∈ B such that x ∈ Bx ⊂ U . injective.
Hence U = ∪x∈U Bx and this completes the proof. Lemma D.53. For a map f : X → Y , A ⊆ X, and B ⊆ Y ,
Corollary D.47. Let T be a topology on X generated by we have
the basis B. Then every open set U ∈ T is a union of some A ⊆ f −1 (f (A)), f (f −1 (B)) ⊆ B, (D.24)
basis neighborhoods in B. (In particular, the empty set is where the first inclusion is an equality if f is injective and
the union of “empty collections” of elements of B.) the second is an equaility if f is surjective or B ⊆ f (X).

85
Qinghai Zhang Numerical Analysis 2021

Proof. By (D.21), a ∈ A implies a ∈ f −1 (f (A)). Conversely, D.1.6 The subbasis topology


a ∈ f −1 (f (A)) implies f (a) ∈ f (A). f being injective dic-
tates a ∈ A. Definition D.59. A subbasis S on X is a collection of sub-
By (D.21), b ∈ f (f (B)) implies b ∈ B. Furthermore, sets of X such that the covering condition in Definition D.24
−1

if f is surjective or B ⊆ f (X), then for any b ∈ B we have holds.


f −1 ({b}) 6= ∅ and thus
Example D.60. The set of all open balls with their radii
b ∈ f (f −1 ({b})) ⊆ f (f −1 (B)). no less than a given h > 0, written Bh , is a subbasis but not
Definition D.54 (Continuous maps between topological a basis.
spaces). A function f : X → Y is continuous iff the preim-
Definition D.61. The topology of X generated by a subba-
age of each open set U ⊂ Y is open in X.
sis S is the collection TS of all unions of finite intersections
Lemma D.55. Let f : X → Y be a continuous function in of elements of S.
the sense of Definition D.54. Then the preimage of an open
subset U ⊂ Y satisfies Exercise D.62. Show that the topology generated by a
−1
f (U ) = ∪x∈X Vx , (D.25) subbasis S as in Definition D.61 is indeed a topology in the
sense of Definition D.43.
where the set Vx is a basis element of X containing x such
that f (Vx ) ⊂ U . Exercise D.63. Show S ⊂ TS . In other words, for the
Proof. Since U is open, Definition D.54 implies that f (U ) topology generated by a subbasis S, every set in S is an
−1

is open. Then by Definition D.32, x ∈ f −1 (U ) implies the open set in X.


existence of Vx ∈ BX such that x ∈ Vx ⊂ f −1 (U ). Hence
Exercise D.64. Show that if T is a topology on X con-
f −1 (U ) ⊂ ∪x∈X Vx .
taining S, then TS ⊂ T .
Conversely, the condition f (Vx ) ⊂ U and (D.22) yield
f −1 f (Vx ) ⊂ f −1 (U ), which, together with Lemma D.53,
Exercise D.65. Assume that each x ∈ X is contained in at
implies Vx ⊂ f −1 (U ). Hence ∪x∈X Vx ⊂ f −1 (U ).
most finitely many sets in S and let Bx be the intersection
Theorem D.56. If a surjective function is continuous in of all sets in S that contain x. Show that
the sense of Definition D.54, it is continuous in the sense of
Definition D.25. • the collection BS := {Bx : x ∈ X} is a basis for TS ;
Proof. Consider U ∈ BY . By Lemma D.33, U is open and • if B is a basis for TS , then BS ⊂ B.
then Definition D.54 implies that f −1 (U ) is open in X. The
surjectivity of f implies that f −1 (U ) is not empty. Then by
Definition D.32 we have
D.1.7 The topology of phenotype spaces
∀x ∈ f −1 (U ), ∃V ∈ BX s.t. x ∈ V ⊂ f −1 (U ), Example D.66. Consider the sequence space GC10 in Ex-
ample D.11. Suppose for GC10 the value of pi,j , i.e. the
hence f (V ) ⊂ f (f −1 (U )) = U , cf. Lemma D.53. mutation probability from si to sj as in Definition D.15, is
Exercise D.57. Show that Lemma D.55 holds in the sense the number in the ith row and the jth column in the follow-
of a strengthened continuity of Definition D.25 as follows. ing table.

∀U ∈ BY , ∀x ∈ X satisfying f (x) ∈ U,
∃V ∈ BX satisfying x ∈ V s.t. f (V ) ⊂ U.
Example D.58. A continuous function is not necessarily
“well behaved,” as exemplified by the following space-filling
Hilbert curve.

The proximity of RNA shapes is based on the likelihood


of a point mutation from one RNA shape to another. By
Example D.11, there are eight RNA shapes and hence it is
reasonable to assume that pi,j > 1/7 implies that the sj
phenotype is accessible to si . Hence we define
 
1
∀i = 1, 2, . . . , 8, Ri := {si } ∪ sj : pi,j > . (D.26)
7

Each Ri is a row in the following table.

86
Qinghai Zhang Numerical Analysis 2021

Lemma D.72. A function f : X → Y is continuous if and


only if the preimage of any closed set is closed.

Proof. By Definition D.51, we have

f −1 (U ) = f −1 (Y \ (Y \ U )) = X \ f −1 (Y \ U ).

The rest follows from Definitions D.54 and D.68.

Definition D.73. The graph of a function f : X → Y is


the set {(x, y) ∈ X × Y : y = f (x)}.
The topology T1/7 on GC10 is defined as the topology gen-
erated by the subbasis Lemma D.74. The graph of a continuous function f :
[a, b] → R is closed in the space [a, b] × R.
R1/7 := {Ri : i = 1, 2, . . . , 8}. (D.27)
Exercise D.75. Give an example of the graph of a discon-
It can be shown that a topology on a finite set has a unique tinuous function f : R → R being not closed in R2 . Give
minimal basis that generates the topology. In this case, the another example of the graph of a discontinuous function
basis is illustrated as follows. f : R → R being closed in R2 .

Exercise D.76. Let X be a topological space.

(a) For a continuous function f : X → R, show that the set


{x ∈ X : f (x) = r}, i.e. the solution set of any equation
with respect to f for some r ∈ R, is closed.
(b) Show that this fails for a general continuous function
f : X → Y where Y is an arbitrary topological space.
(c) What condition on Y would guarantee that the conclu-
sion holds?

D.1.9 Interior–Frontier–Exterior
Exercise D.67. Change the threshold value in Example
from 1/7 to 1/10 and repeat the entire process. What is Definition D.77. A point x ∈ X is an interior point of
the minimum value of q such that Tq on GC10 becomes the A if there is a neighborhood W of x that lies entirely in A.
discrete topology? The set of interior points of a set U is called its interior and
denoted by Int(U ).
D.1.8 Closed sets Lemma D.78. Int(A) is open for any A.
Definition D.68. A subset of X is called closed if its com- Proof. Exercise.
plement is open.
Example D.79. The interior of a closed ball is the corre-
Example D.69. The set
sponding open ball.
 
1
K= : n ∈ Z+ (D.28) Definition D.80. A point x ∈ X is an exterior point of A
n if there is a neighborhood W of x that lies entirely in X \ A.
is neither open nor closed. In comparison, K ∪ {0} is closed. The set of exterior points of a set U is called its exterior and
denoted by Ext(U ).
Theorem D.70. The set σ of all closed subsets of X satis-
fies the following conditions: Example D.81. The exterior of the set K in (D.28) is
R \ K \ {0}. Why not 0?
(TPC-1) ∅, X ∈ σ;
(TPC-2) α ⊂ σ ⇒ ∩α ∈ σ; Definition D.82. A point x is a closure point of A if each
neighborhood of x contains some point in A.
(TPC-3) U, V ∈ σ ⇒ U ∪ V ∈ σ.
Example D.83. Any point in the set K in (D.28) is a clo-
Example D.71. The following example shows that infinite
sure point of K, so is 0.
intersections of open sets might not be open and infinite
unions of closed sets might not be closed. Definition D.84. A point x is an accumulation point (or
\  1 1   a limit point) of A if each neighborhood of x contains some
− , : n = 1, 2, . . . = {0}; point p ∈ A with p 6= x.
n n
Example D.85. The only accumulation point of the set K
  
[ 1 1
−1 + , 1 − : n = 1, 2, . . . = (−1, 1). in (D.28) is 0.
n n

87
Qinghai Zhang Numerical Analysis 2021

Example D.86. Each point in R is an accumulation point Corollary D.98. A subset of a topological space is closed
of Q. if and only if it contains all of its accumulation points.
Definition D.87. A point x in a set A is isolated if there Proof. If A is a superset of A0 , the set of all accumulation
exists a neighborhood of x such that x is the only point of points of A. We have A ∪ A0 = A = Cl(A) from Theorem
A in this neighborhood. D.97. Definition D.91 implies that A is closed.
Example D.88. Every point of the set K in (D.28) is iso- Suppose A is closed, but there is an accumulation point
lated. x of A such that x 6∈ A. By Definition D.84, in any neigh-
borhood of x there exists a point p ∈ A such that p 6= x;
Definition D.89. A point x is a frontier point of a set A
this contradicts the complement of A being open.
iff it is a closure point for both A and its complement. The
set of all frontier points is called the frontier Fr(A) of A.
Theorem D.90. For any set A in X, its interior, its fron- D.1.10 Hausdorff spaces
tier, and its exterior form a partition of X. Definition D.99. Suppose X is a set with a basis of neigh-
Proof. Consider an arbitrary point a ∈ X. If there exists borhoods γ. Let {xn : n = 1, 2, . . .} be a sequence of ele-
a neighborhood Na of a such that Na ⊂ A, then Definition ments of X and a ∈ X. Then we say the sequence converges
D.77 implies a ∈ Int(A). If Na ⊂ X \ A, then Definition to a, written
D.80 implies a ∈ Ext(A). Otherwise, for all neighborhoods
of a we have Na 6⊂ A and Na 6⊂ X \ A, which implies that lim xn = a, or xn → a as n → ∞,
n→∞
any Na contains points both from A and X \ A. The rest
follows from Definition D.89. iff

Definition D.91. The closure of A, written Cl(A) or A, is ∀U ∈ γ with a ∈ U, ∃N ∈ N+ s.t. n > N ⇒ xn ∈ U.


the set of all closure points of A. (D.31)
Lemma D.92. Int(A) ⊂ A ⊂ Cl(A).
Exercise D.100. Prove that the definition remains equiv-
Lemma D.93. Cl(A) = Int(A) ∪ Fr(A). alent if we replace “basis γ” with “topology T .”
Theorem D.94. The closure of a set A is the smallest
Exercise D.101. Show that if a sequence converges with
closed set containing A:
respect to a basis γ, it also converges with respect to any
Cl(A) = ∩{G : A ⊂ G, G is closed in X}. (D.29) basis equivalent to γ.
Proof. Write α := {G : A ⊂ G, G is closed in X} and Theorem D.102. Continuous functions preserve conver-
A− := ∩α and we need to show gence, i.e., for a continuous f : X → Y , limn→∞ xn = a
• A− ⊂ Cl(A); implies limn→∞ f (xn ) = f (a).
• A− ⊃ Cl(A). Proof. This follows from Definitions D.99 and D.25.
We only prove the first part and leave the other as an ex-
ercise. Consider x 6∈ Cl(A). Then by Definitions D.82 and Exercise D.103. A sequence α = {xn : n = 1, 2, . . .}
D.91 there exists an open neighborhood Nx of x such that in a topological space X can be viewed as a subset of X,
Nx ∩ A = ∅. Hence the set P := X \ Nx contains A. P A = {xn : n ∈ N+ }. Compare the meanings of the closure
is also closed because Nx is open. Therefore P ∈ α and points of A and the accumulation points of A. What about
x 6∈ A− . the limit of α?

Exercise D.95. Prove Cl(A ∩ B) ⊂ Cl(A) ∩ Cl(B). What Exercise D.104. For metric topology, show that a func-
if we have infinitely many sets? tion f : X → Y is continuous if and only if the function
commutes with limits for any convergent sequence in X.
Theorem D.96. The interior of a set A is the largest open
set contained in A, Example D.105. When do we have xn → a for discrete
Int(A) = ∪{U : U ⊂ A, U is open in X}. (D.30) topology?
Theorem D.97. Let A0 be the set of accumulation points Example D.106. When do we have xn → a for anti-
of A. Then Cl(A) = A ∪ A0 . discrete topology?
Proof. Suppose x ∈ Cl(A), If x ∈ A, then x ∈ A ∪ A0 Definition D.107. A topological space (X, T ) is called a
trivially holds. Otherwise x 6∈ A, Definition D.91 dictates Hausdorff space iff
that its neighborhood must contain at least one point in A.
Hence Definition D.77 yields x ∈ A0 . In both cases we have ∀a, b ∈ X, a 6= b, ∃U, V ∈ T s.t. a ∈ U, b ∈ V, U ∩ V = ∅.
x ∈ A ∪ A0 . (D.32)
Conversely, suppose x ∈ A ∪ A0 . If x ∈ A, Lemma D.92
implies x ∈ Cl(A). If x ∈ A0 , x is an accumulation point of Lemma D.108. Every subset of finite points in a Hausdorff
A and is thus a closure point A. space is closed.

88
Qinghai Zhang Numerical Analysis 2021

Proof. By (TPC-3) in Theorem D.70, it suffices to show that Proof. For (TPO-1), we choose W = ∅, A. For (TPO-2),
every singleton set is closed. Consider X \ {x0 }. For any [ [
x 6= x0 , Definition states that there exists U ⊃ x, V ⊃ x0 (W ∩ A) = ( W ) ∩ A,
such that U ∩ V = ∅, hence x0 6∈ U and U ∈ X \ {x0 }. W ∈α W ∈α
Therefore X \ {x0 } is open. S
where W ∈α W is a subset of X. For (TPO-3),
Exercise D.109. Does there exist a topological space X
that is not Hausdorff but in which every finite point set is (U ∩ A) ∩ (V ∩ A) = (U ∩ V ) ∩ A,
closed?
where U ∩ V is a subset of X.
Definition D.110. A topological space is called a T1 space
iff every finite subset is closed in it. Definition D.116 (Subspace and subspace topology).
Given a topological space (X, T ) and a subset A ⊂ X, the
Theorem D.111. Let X be a T1 space and A a subset of
topological space (A, TA ) is called a subspace of X and the
X. A point x is an accumulation point of A if and only
topology TA in (D.34) is called the subspace topology or rel-
if every neighborhood of x intersects with infinitely many
ative topology induced by X.
points of A.
Proof. The sufficiency follows directly from Definition D.84. Theorem D.117. Let γX be a basis that generates the
As for the necessity, suppose there exists a neighborhood U topology TX on a topological space X. Then the subspace
of x such that (A \ {x}) ∩ U = {x1 , x2 , . . . , xm }. Then by topology on A induced by TX is equivalent to the subspace
Definition D.110 we know topology generated by γX . In other words, TA is generated
by γA .
U ∩ (X \ {x1 , x2 , . . . , xm }) = U ∩ (X \ (A \ {x})) open
γX TX
is an open set containing x, yet it does not contain any ∩A ∩A
points in A other than x. This contradicts the condition of γA
open
TA
x being an accumulation point of A.
Theorem D.112. A sequence of points in a Hausdorff space Proof. We first show that U is open with respect to (w.r.t.)
X converges to at most one point in X. γA for any given U ∈ TA . By Lemma D.115, there exists
U 0 ∈ TX such that U = U 0 ∩ A. The condition of γX being
Proof. By Definition D.99, a convergence to two points in a basis of X yields
X would be a contradiction to Definition D.107.
∀y ∈ U 0 , ∃B 0 ∈ γX s.t. y ∈ B 0 ⊂ U 0 ,

D.2 Continuous maps which implies

D.2.1 The subspace/relative topology ∀x ∈ U ⊂ U 0 , ∃B := (B 0 ∩ A) ∈ γA s.t. x ∈ B ⊂ U.


Lemma D.113. Consider a subset A of a topological space It remains to show that any set U that is open w.r.t. γA
X. Suppose γX is a basis of neighborhoods of X. Then is in TA , i.e., we need to find U 0 ∈ TX such that U = U 0 ∩ A.
γA := {W ∩ A : W ∈ γX } (D.33) Since U is open w.r.t. γA , Definition D.32 yields
is a basis of neighborhoods of A. ∀x ∈ U, ∃Nx ∈ γA s.t. x ∈ Nx ⊂ U,

Proof. The covering condition for A holds because the cov- where Nx = Nx0 ∩ A for some Nx0 ∈ X. We then choose
ering condition of X holds. As for the refining condition, for
any U, V ∈ γA and any x ∈ U ∩ V , there exists U 0 , V 0 ∈ γX
[
U 0 := Nx0 .
such that U = U 0 ∩ A, V = V 0 ∩ A, and W 0 ⊂ U 0 ∩ V 0 for x∈U
some W 0 ∈ γX . Setting W := W 0 ∩ A and we have
Theorem D.46 implies that U 0 is open and U = U 0 ∩ A.
x ∈ W ⊂ (U 0 ∩ V 0 ) ∩ A = (U 0 ∩ A) ∩ (V 0 ∩ A) = U ∩ V,
Lemma D.118. Let A be a subspace of X. If U is open in
which completes the proof. A and A is open in X, then U is open in X.
Definition D.114. The topology generated by γA in (D.33) Proof. Since U is open in A, Definition D.116 yields
is called the relative topology or subspace topology on A gen-
erated by the basis γX of X. ∃U 0 ∈ X open s.t. U = U 0 ∩ A,
Lemma D.115. Consider a subset A of a topological space
the rest of the proof follows from A being open in X.
X. Suppose TX is a topology on X. Then
TA := {W ∩ A : W ∈ TX } (D.34) Lemma D.119 (Closedness in a subspace). Let A be a sub-
space of X. Then a set V ⊂ A is closed in A if and only if
is a topology on A. it equals the intersection of A with a closed subset of X.

89
Qinghai Zhang Numerical Analysis 2021

Proof. Suppose V is closed in A. Then Theorem D.125 (Restricting the domain). Any restriction
of a continuous function is continuous.
∃V 0 ⊂ A s.t. V ∪ V 0 = A, V 0 ∈ TA .
Proof. For any open set U in Y , we have i−1A (U ) = U ∩ A.
Since A is a subspace of X, we have from Definition D.116 The rest follows from the relative topology.

∃U 0 ⊂ X, s.t. V 0 = U 0 ∩ A, U 0 ∈ TX . Exercise D.126. Let iA : A ,→ X be an inclusion. Suppose


the set A is given a topology such that, for every topological
Hence the set U := X \ U 0 is closed in X and space Y and every function f : Y → A, f is continuous if
and only if the composition (iA ◦ f ) : Y → X is continuous.
A ∩ U = A ∩ (X \ U 0 ) = A \ (X \ (X \ U 0 )) Prove that this topology of A is the same as the relative
= A \ U 0 = A \ (U 0 ∩ A) = A \ V 0 = V. topology of A in X.

Conversely, suppose Lemma D.127 (Restricting the range). If f : X → Y


is a continuous function, so is gf : X → f (X) given by
∃U ∈ X s.t. (X \ U ) ∈ TX , V = U ∩ A. gf (x) := f (x) for all x ∈ X.
Define V 0 := (X \U )∩A and we know from Definition D.116 Proof. Of course the topology of f (X) is understood as the
that V 0 is open in A. The proof is then completed by subspace topology of Y . The rest follows from Definition
D.116.
V ∪ V 0 = (U ∩ A) ∪ ((X \ U ) ∩ A) = A,
Lemma D.128 (Expanding the range). Let f : X → Y
where the last step follows from the condition A ⊂ X. be a continuous function and Y a subspace of Z. Then the
function g : X → Z given by g(x) := f (x) for all x ∈ X is
Corollary D.120 (Transitivity of relative closedness). Let
continuous.
A be a subspace of X. If V is closed in A and A is closed in
X, then V is closed in X. Proof. Write g = iY ◦ f .
Proof. This follows directly from Lemma D.119 by using Lemma D.129 (Pasting lemma). Let A, B be two closed
V = V ∩ A. subsets of a topological space X such that X = A ∪ B. Sup-
pose fA : A → Y and fB : B → Y are continuous functions
D.2.2 New maps from old ones
∀x ∈ A ∩ B, fA = fB . (D.37)
Theorem D.121. The composition of continuous functions
is continuous. Then the following function f : X → Y is continuous,
(
Proof. Suppose we have continuous functions f = X → Y fA (x) if x ∈ A,
and g = Y → Z. Let h = gf : X → Z be their composition. f (x) := (D.38)
fB (x) if x ∈ B.
Then for any open set U ∈ Z,
Proof. Define W := fA (A) ∪ fB (B). Then for any V ⊂ Y ,
h−1 (U ) = (gf )−1 (U ) = f −1 (g −1 (U )) (D.38) and the condition (D.37) yields
is open due to the continuity of g and f and Definition V = (V ∩W )∪(V \W ) = (V ∩f (A))∪(V ∩f (B))∪(V \W ).
A B
D.54.
If V is closed in Y , then its preimage is
Theorem D.122. Suppose X is a topological space and
f, g : X → R are continuous functions. Then f + g, f − g, f −1 (V ) = f −1 (V ∩ fA (A)) ∪ f −1 (V ∩ fB (B))
and f · g are continuous; f /g is also continuous if g(x) 6= 0
for all x. = fA−1 (V ∩ fA (A)) ∪ fB−1 (V ∩ fB (B))
−1 −1
= gA (V ∩ fA (A)) ∪ gB (V ∩ fB (B)),
Proof. By Theorem D.184, the function h : X → R2 given
by h(x) = (f (x), g(x)) is continuous. We also know that where gA and gB are defined in Lemma D.127. We claim
the function + : R2 → R is continuous. Hence the function that f −1 (V ) is also closed in X with arguments as follows.
f + g = + ◦ h is continuous.
(i) Since V is closed, Lemma D.119 implies that V ∩fA (A)
Definition D.123. Let X be a topological space and A a and V ∩ fB (B) are closed in fA (A) and fB (B), respec-
subset of X. The inclusion iA : A ,→ X is given by tively.

∀x ∈ A, iA (x) = x. (D.35) (ii) By Lemma D.127, both gA and gB are continuous.


Hence the two sets to be unioned in the last line of the
Definition D.124. Let X and Y be topological spaces and above equation are closed in A and B, respectively.
A a subset of X. The restriction of a function f : X → Y (iii) By Corollary D.120, both sets in the last step are closed
to A is a function given as in X.
∀x ∈ A, f |A (x) := f (x). (D.36) The rest of the proof follows from Lemma D.72.

90
Qinghai Zhang Numerical Analysis 2021

Exercise D.130. Show that Lemma D.129 fails if A and B Lemma D.142. All closed intervals of a non-zero, finite
are not closed. length are homeomorphic.

Exercise D.131. Formulate the pasting lemma in terms of Lemma D.143. All open intervals, including infinite ones,
open sets and prove it. are homeomorphic.

Exercise D.132. What is the counterpart of the pasting Proof. The tangent function gives you a homeomorphism
lemma in complex analysis? between (− π2 , π2 ) and (−∞, +∞).

Definition D.133 (Expanding the domain). For A ⊂ X Lemma D.144. An open interval is not homeomorphic to
and a given function f : A → Y , a function F : X → Y is a closed interval (nor half-open).
called an extension of f if F |A = f . Definition D.145. The n-sphere is a subset in Rn+1 ,

D.2.3 Homeomorphisms Sn := {x ∈ Rn+1 : kxk = 1}. (D.40)

Definition D.134. A function f : X → Y between topo- Its north pole is denoted by N = (0, 0, · · · , 0, 1).
logical spaces X and Y is called a homeomorphism iff f is Definition D.146. The stereographic projection
bijective and both f and f −1 are continuous. Then X and
Y are said to be homeomorphic or topologically equivalent, P : Sn \ N → Rn
written X ≈ Y .
is given by
Lemma D.135. If two spaces X and Y are homeomorphic,  
then x1 x2 xn
P (x) := , ,..., . (D.41)
1 − xn+1 1 − xn+1 1 − xn+1
∀a ∈ X, ∃b ∈ Y s.t. X \ {a} ≈ Y \ {b}. (D.39)
Lemma D.147. The stereographic projection is a homeo-
Exercise D.136. Show that the function f : {A, B} → {C} morphism with its inverse as
given by f (A) = f (B) = C is continuous, but not a homeo- 1
P −1 (y) = 2y1 , 2y2 , . . . , 2yn , kyk2 − 1 . (D.42)

morphism. Hence a necessary condition for homeomorphism 2
1 + kyk
is the number of connected components.
Exercise D.148. Show that the 2-sphere and the hollow
Example D.137. Consider X the letter “T” and Y a line cube are homeomorphic by using the radial projection f ,
segment. They are not homeomorphic because removing
the junction point in T would result in three pieces while x
f (x) = . (D.43)
removing any point in the line segment yields at most two kxk
connected components.
Theorem D.149. Homeomorphisms form an equivalence
Exercise D.138. Classify the following symbols of the relation on the set of all topological spaces.
standard computer keyboard by considering them as 1- Proof. For a homeomorphism f : (X, T ) → (Y, T ), we can
X Y
dimensional topological spaces. define a function fT : TX → TY by setting fT (V ) := f (V ).
‘ 1 2 3 4 5 6 7 8 9 0 - = It is easy to show that fT is also a bijection from Definition
q w e r t y u i o p [ ] \ D.134.
a s d f g h j k l ; ’ Definition D.150. An embedding of X in Y is a function
z x c v b n m , . / f : X → Y that maps X homeomorphically to the subspace
~ ! @ # $ % ^ & * ( ) _ + f (X) in Y .
Q W E R T Y U I O P { } |
A S D F G H J K L : " Example D.151. For an embedding f : [0, 1] → X, its im-
Z X C V B N M < > ? age is called an arc in X. For an embedding f : S1 → X,
its image is called a simple closed curve in X.
Exercise D.139. Consider the identity function f = IX :
(X, T ) → (X, κ) where κ is the anti-discrete topology and
T is not. Show that f −1 is not continuous and hence f is D.3 A zoo of topologies
not a homeomorphism.
D.3.1 Hierarchy of topologies
Exercise D.140. Give an example of a continuous bijec-
tion f : X → Y that isn’t a homeomorphism; this time both Definition D.152. Suppose that T and T 0 are two topolo-
X and Y are subspaces of R2 . gies on a given set X. If T ⊃ T , we say that T 0 is
0

finer /larger than T ; if T 0 properly contains T , we say that


Exercise D.141. For a continuous function f : R → R, T 0 is strictly finer /strictly larger than T . We also say that
define g : R → R2 by g(x) = (x, f (x)). Prove that g is con- T is coarser /smaller, or strictly coarser /strictly smaller, in
tinuous and its image, the graph of f , is homeomorphic to these two respective situations. We say T and T 0 are com-
R. parable if either T 0 ⊃ T or T 0 ⊂ T .

91
Qinghai Zhang Numerical Analysis 2021

Lemma D.153. Let B and B 0 be bases for the topologies Proof. For any x ∈ (a, b), we can always find [x, b) ∈ T` and
T and T 0 , respectively, on X. T 0 is finer than T if and only (a, b) ∈ TK such that x ∈ [x, b) ⊂ (a, b) and x ∈ (a, b) ⊂
if (a, b). On the other hand, for any x ∈ R and any neighbor-
hood [x, b) ∈ R` , no open interval in the standard topology
∀x ∈ X, ∀B ∈ B with x ∈ B, ∃B 0 ∈ B 0 s.t. x ∈ B 0 ⊂ B. simultaneously contains x and is a subset of [x, b). Similarly,
(D.44) for 0 ∈ R and BK := (−1, 1) \ K ⊃ {0}, no open interval
simultaneously contains 0 and is a subset of BK . Hence R`
Proof. The sufficiency U ∈ T ⇒ U ∈ T 0 follows directly
and RK are strictly finer than the standard topology on R.
from (D.44) and Definition D.152.
To show that R` and RK are not comparable, it suffices
As for the necessity, we start with given x ∈ X and
to give two examples. For any x ∈ K ⊂ R and any neighbor-
B ∈ B with x ∈ B. By Lemma D.33, B is open, i.e. B ∈ T .
hood [x, b) ∈ T` , no open sets in TK simultaneously contains
Then by hypothesis B ∈ T 0 . Definition D.32 implies that
x and is a subset of [x, b). Conversely, for 0 ∈ R and the
there exists B 0 ∈ B 0 such that x ∈ B 0 ⊂ B, which completes
above BK , no interval [a, b) ∈ T` simultaneously contains 0
the proof.
and is a subset of BK .
Exercise D.154. The bounded complements of all non- Exercise D.163. The topologies on R2 generated by the
degenerate Jordan curves form a basis of neighborhoods. Is open balls and the open squares are the same topology.
the topology generated by this basis finer than that gener-
ated by the open balls? Exercise D.164. Show that the collection

Definition D.155. The finite complement topology on X C = {[a, b) : a < b, a and b are rational} (D.49)
is is a basis that generates a topology TQ different from the
T = {U ⊂ X : U = ∅ or X \ U is finite}. (D.45) lower limit topology T` on R. Compare this to Example
The countable complement topology on X is D.49.

T = {U ⊂ X : U = ∅ or X \ U is countable}. (D.46) D.3.2 The order topology


The particular point topology on X is Definition D.165. Let X be a totally ordered set with
more than one element. Let B be the collection of all sets
T = {U ⊂ X : U = ∅ or p ∈ U }. (D.47) of the following types:
(1) All open intervals (a, b) in X;
The excluded point topology on X is
(2) All half-open intervals of the form [a0 , b) where a0 is the
T = {U ⊂ X : U = X or p ∈
/ U }. (D.48) smallest element (if any) of X;
(3) All half-open intervals of the form (a, b0 ] where b0 is the
Exercise D.156. Show that each of the topologies in Def- largest element (if any) of X.
inition D.155 is indeed a topology.
The order topology on X is the topology generated by the
Exercise D.157. For a three-element set X = {a, b, c}, basis B.
enumerate all possible topologies up to the permutation iso-
Exercise D.166. Show that B is indeed a basis of X in
morphism.
Definition D.165.
Exercise D.158. Which topology in the answer of Exercise Example D.167. The standard topology on R as in Defini-
D.157 has a basis other than itself? tion D.39 is the same as the order topology derived from the
Exercise D.159. Define a directed graph G = (V, E) where usual order on R. This is due to the fact that there exists
the vertex set V contains the topologies in Exercise D.157 in R neither the smallest element nor the largest element.
and E contains an edge T1 → T2 iff T2 is strictly finer than Definition D.168. The dictionary order or lexicopgraphi-
T1 . Plot the graph G. cal order on R × R is a total order defined as
Definition D.160. The lower limit topology T` on R is the (a, b) < (c, d) ⇔ a < c or a = c, b < d. (D.50)
topology generated by all half-open intervals of the form
Example D.169. A basis for the order topology on R × R
[a, b) with a < b. The space R endowed with T` is denoted
with the dictionary order is the collection of all open inter-
by R` .
vals of the form (1) in Definition D.165.
Definition D.161. The K-topology TK on R is the topology Example D.170. The order topology of positive integers
generated by all open intervals (a, b) and all sets of the form Z+ is the same as the discrete topology. For n > 1, take
(a, b) \ K where K is set in (D.28). The space R endowed the basis interval (n − 1, n + 1); for n = 1, take the interval
with TK is denoted by RK . [1, 2).
Lemma D.162. The topologies of R` and RK are strictly Exercise D.171. Show that the order topology derived
finer than the standard topology on R, but are not compa- from the dictionary order on the set X = {1, 2} × Z+ is
rable with one another. not the discrete topology.

92
Qinghai Zhang Numerical Analysis 2021

Exercise D.172. Show that Definition D.165 does not gen- Proof. Let T denote the product topology in Definition
eralize to posets. D.176 and T 0 denote the topology generated by the sub-
basis (D.55). Every element in S belongs to T , so do any
Definition D.173. Let X be an ordered set and a ∈ X. unions of finite intersections of elements of S. Hence Defi-
The rays determined by a are the four subsets of X: nition D.59 yields T 0 ⊂ T .
Conversely, each element in the basis of T is an intersec-
(a, +∞) := {x : x > a}; (D.51a) tion of elements in S,
(−∞, a) := {x : x < a}; (D.51b) B × B = π −1 (B ) ∩ π −1 (B ),
1 2 1 1 2 2
[a, +∞) := {x : x ≥ a}; (D.51c) 0 0
hence B1 × B2 ∈ T and thus T ⊂ T .
(−∞, a] := {x : x ≤ a}. (D.51d)
Corollary D.183. The projections in Definition D.181 are
The first two are open rays while the last two closed rays. continuous (with respect to the product topology).
Proof. Consider π1 : X×Y → X. For each open set U ∈ TX ,
Exercise D.174. Show that the open rays form a subbasis Lemma D.182 and Definition D.61 imply that its preimage
for the order topology on X. under π1 is open in the product topology.
Definition D.175. The set [0, 1] × [0, 1] in the dictionary Theorem D.184 (Product of maps). Given f1 : A → X
order topology is called the ordered square, denoted by Io2 . and f2 : A → Y , the map f : A → X × Y with
f (a) := (f1 (a), f2 (a)) (D.56)
D.3.3 The product topology is continuous if and only if both f1 and f2 are continuous.
Definition D.176. Let X and Y be topological spaces. The Proof. Write f1 = π1 ◦ f and f2 = π2 ◦ f . The necessity
product topology on X × Y is the topology generated by the follows from Corollary D.183 and Theorem D.121.
basis As for the sufficiency, we need to show that the preim-
age f −1 (U × V ) of any basis element U × V is open. By
γ̄X×Y := {B1 × B2 : B1 ∈ TX , B2 ∈ TY }, (D.52) Definition D.176, U × V ∈ BX×Y implies that U ∈ TX and
V ∈ TY . By Definition D.51, any point a ∈ f −1 (U × V ) if
where TX and TY are topologies on X and Y , respectively. and only if f (a) ∈ U × V , which, by (D.56), is equivalent to
f1 (a) ∈ U and f2 (a) ∈ V . Hence, we have
Exercise D.177. Check that γ̄X×Y in (D.52) is indeed a
basis. f −1 (U × V ) = f1−1 (U ) ∩ f2−1 (V ).
The rest of the proof follows from the conditions of both f1
Exercise D.178. Give an example that γ̄X×Y is not a and f2 being continuous.
topology.
Example D.185. A parametrized curve γ(t) = (x(t), y(t))
Exercise D.179. The product of two Hausdorff spaces X is continuous if and only if both x and y are continuous.
and Y is Hausdorff.
D.3.4 The metric topology
Theorem D.180. Let X and Y be topological spaces with
bases γX and γY , respectively. Then the set Definition D.186. For a metric d on X in Definition C.74,
the number d(x, y) is called the distance between x and y.
γX×Y := {B1 × B2 : B1 ∈ γX , B2 ∈ γY }, (D.53) Definition D.187. In a metric space (X , d), an open ball
Br (x) centered at x ∈ X with radius r is the subset
is a basis for the topology of X × Y .
Br (x) := {y ∈ X : d(x, y) < r} . (D.57)
Proof. Both the covering and refining conditions hold triv- Lemma D.188. If d is a metric on X, then the collection
ially. of all open balls is a basis on X.
Definition D.189. The topology on X generated by the
Definition D.181. For topological spaces X and Y , the
basis of all open balls in Definition D.187 is called the met-
functions π1 : X × Y → X and π2 : X × Y → Y given by
ric topology induced by the metric d.
π1 (x, y) = x, π2 (x, y) = y (D.54) Lemma D.190. A set U is open in the metric topology
induced by d if and only if
are called the projections of X × Y onto its first and second ∀x ∈ U, ∃r > 0 s.t. Br (x) ⊂ U.
factors, respectively.
Definition D.191. A topological space X is said to be
Lemma D.182. The product topology on X × Y is the metrizable if there exists a metric d on X that induces the
same as the topology generated by the subbasis topology of X. A metric space is a metrizable topological
space together with a specific metric d that gives the topol-
S := π1−1 (U ) : U ∈ TX ∪ π2−1 (V ) : V ∈ TY . (D.55) ogy of X.
 

93
Qinghai Zhang Numerical Analysis 2021

Definition D.192. A point x in a normed space X is an Example D.202. The rationals Q are not connected. The
interior point of A if there is an open ball Br (x) that lies only connected subspaces are the one-point spaces: for
entirely in A. The set of interior points of a set U is called Y = {p, q} ⊂ Q with p < q, choose an irrational number
its interior and denoted by Int(U ). a ∈ (p, q) and write
Definition D.193. A point x in a normed space X is an Y = (Y ∩ (−∞, p)) ∪ (Y ∩ (q, +∞)).
exterior point of A if there is an open ball Br (x) that lies
entirely in X \ A. The set of exterior points of a set U is According to Definition D.196, this separation implies that
called its exterior and denoted by Ext(U ). Y is not connected.

Definition D.194. For metric spaces (X, d1 ) and (Y, d2 ), Theorem D.203. Connectedness is preserved by continu-
a function f : X → Y is continuous iff ous functions; i.e., the image of a connected space under a
continuous map is connected.
∀ > 0 ∀x ∈ X ∃δ > 0 s.t. ∀y ∈ X (D.58)
Proof. Let X be a connected space and f : X → Y a contin-
d1 (x, y) < δ ⇒ d2 (f (x), f (y)) <  uous function. We show that the image space Z := f (X) is
Definition D.195. For metric spaces (X, d1 ) and (Y, d2 ), connected. Suppose Z is not connected. Then there exists
a function f : X → Y is uniformly continuous iff disjoint nonempty open sets U, V such that Z = U ∪ V . By
Definition D.54, f −1 (U ) and f −1 (V ) are disjoint open sets
∀ > 0 ∃δ > 0 s.t. ∀x, y ∈ X (D.59) and X = f −1 (U ) ∪ f −1 (V ), which contradicts the condition
d1 (x, y) < δ ⇒ d2 (f (x), f (y)) < . of X being connected.
Theorem D.204 (Intermediate value theorem (general-
D.4 Connectedness ized)). Let f : X → Y be a continuous function where
X is a connected space and Y is an ordered set in the order
Definition D.196. Let X be a topological space. A sepa- topology. If a and b are two points of X and if r is a point
ration of X is a pair U, V of disjoint nonempty open subsets of Y lying between f (a) and f (b), then there exists a point
of X whose union is X. A topological space is connected if c of X such that f (c) = r.
there does not exist a separation of X. Definition D.205. A path in a topological space X is a con-
Exercise D.197. Why do we define the separation as a pair tinuous map f : I → X, where I := [0, 1], and x0 := f (0)
of disjoint open sets? Can we define separation using closed and x1 := f (1) are called its initial point and final point,
sets? respectively.

Example D.198. A space X with indiscrete topology is Definition D.206. A space X is path-connected if, for ev-
connected, since there exists no separation of X. ery x0 , x1 ∈ X, there exists a path from x0 to x1 .

Lemma D.199. For a subspace Y of X, a separation of Exercise D.207. Prove that if [a, b] is path-connected, so
Y is a pair of disjoint nonempty sets A and B such that are (a, b) and [a, b).
A ∪ B = Y and neither of them contains a limit point of the Theorem D.208. Path-connectedness is preserved by con-
other. The space Y is connected if there exists no separation tinuous functions; i.e., the image of a path-connected space
of Y . under a continuous function is path-connected.
Proof. Suppose first that A and B form a separation of Y . Proof. Let X be a connected space and f : X → Y
By Definition D.196, A and B are both open in Y . Further- a continuous function. We show that the image space
more, A is also closed since its complement B is open in Y . Z := f (X) is connected. Any C, D ∈ Z have their preim-
Thus A = A and B ∩ A = ∅. ages A = f −1 (C) ∈ X and B = f −1 (D) ∈ X. The path-
Conversely, suppose A ∪ B = Y , A ∩ B = ∅, and connectedness of X implies that there exists a continuous
B ∩ A = ∅. Then we have function q : [0, 1] → X such that q(0) = A and q(1) = B.
A ∩ Y = A ∩ (A ∪ B) = (A ∩ A) ∪ (A ∩ B) = A. By Theorem D.121, the composition p = f q is continuous,
p(0) = f (q(0)) = f (A) = C, and p(1) = f (q(1)) = f (B) =
Similarly, B ∩ Y = B. Both A and B are closed in Y , hence D. Hence Z is path-connected by Definition D.206.
they are both open in Y and form a separation of Y .
Lemma D.209. Every path-connected space is connected.
Example D.200. Let Y = [−1, 1] be a subspace of X = R.
Proof. Suppose a topological space X is not connected but
The sets [−1, 0] and (0, 1] are disjoint and nonempty, but
path-connected. Then there exists a separation U, V of
they do not form a separation of Y because [−1, 0] is not
X such that X = U ∪ V . Consider an arbitrary path
open in Y . Alternatively, one can use Lemma D.199 to say
f : [0, 1] → X. Since f ([0, 1]) is a continuous image of a
that [−1, 0] contains a limit point 0 of the other set (0, 1].
connected set, we know from Theorem D.203 that f ([0, 1])
Example D.201. Let Y = [−1, 0) ∪ (0, 1]. Each of the is connected, hence it must lie entirely in either U or V .
sets [−1, 0) and (0, 1] is nonempty and open in Y ; therefore, Consequently, there is no path in X joining a point of A to
they form a separation of Y . Again, an alternative argument a point of B, contradicting the condition of X being path-
utilizes Lemma D.199. connected.

94
Qinghai Zhang Numerical Analysis 2021

Exercise D.210. A connected space is not necessarily path- Definition D.218. A space X is called locally path-
connected, c.f. the topologist’s sine curve. The space connected at x iff for every neighborhood U of x, there exists
   a path-connected neighborhood V of x contained in U . X is
1 locally path-connected iff it is locally path-connected at each
S= x, sin : x ∈ (0, 1] . (D.60)
x of its points.

is connected because it is the image of the connected space Theorem D.219. A space X is locally connected if and
(0, 1] under a continuous map. Hence the closure of S only if for every open set U of X, each component of U is
open in X.
S = S ∪ {(0, y) : y ∈ [−1, 1]}. (D.61) Theorem D.220. A space X is locally path-connected if
and only if for every open set U of X, each path component
is also connected in R2 . But S is not path-connected. Can of U is open in X.
you prove it?
Theorem D.221. Each path component of a topological
Exercise D.211. Deduce Theorem C.39 from Theorem space X lies in a component of X. If X is locally path con-
D.208. nected, then the components and the path components of
X are the same.
Theorem D.212 (Fixed points in one dimension). Every
Proof. The first statement follows from Lemma D.209. Let
continuous function f : [−1, 1] → [−1, 1] has a fixed point.
C be a component of X, P be a path-component of X. If
Proof. If f (−1) = −1 or f (1) = 1, we are done; otherwise there is a point x ∈ P and x ∈ C, we have P ⊂ C. Suppose
we have f (−1) = a > −1 and f (1) = b < 1. Hence none of P 6= C. Let Q be the union of all other path components
the following two disjoint sets is empty, of X, each of which intersects C and thus lies in C. Hence
we have C = P ∪ Q. By Theorem D.220 and the local path-
A := {(x, f (x)) : f (x) > x}, B := {(x, f (x)) : f (x) < x}. connectedness of X, each path component of X must be
open in X. Thus P and Q constitute a separation of X,
By Theorems D.184 and D.203, the graph of f , contradicting the connectedness of C.

G := {(x, f (x)) : x ∈ [−1, 1]},


D.5 Compactness
is path-connected.
Theorem D.222 (Extreme values). A continuous func-
Suppose no x∗ satisfies f (x∗ ) = x∗ , then G = A ∪ B. In
tion attains its extreme values on closed bounded inter-
the topological space G, both A and B are open with re-
vals. In other words, if f is continuous on [a, b], there exist
spect to a subspace topology of the standard topology. By
c, d ∈ [a, b] such that
Definition D.196 and Lemma D.209, this is a contradiction
to G being path connected. f (c) = max f (x), f (d) = min f (x). (D.62)
x∈[a,b] x∈[a,b]

Exercise D.213. Prove Theorem D.212 via connectedness. Definition D.223. A collection α of subsets of a topolog-
ical space X is said to cover X, or to be a covering of X,
Definition D.214. The equivalence classes resulting from if the union of all elements of α equals X; it is an open
connectedness and path-connectedness are called compo- covering of X if each element of α is an open subset of X.
nents and path components, respectively.
Definition D.224. An (open) cover of a subset X in a
Example D.215. The topologist’s sine curve S in Exercise topological space Y is a collection α of (open) subsets in Y
D.210 has only one component, but has two path compo- such that X ⊂ ∪α. A subcover of X is a subcollection of a
nents S and V := S \ S. Note that S is open in S but not cover that also covers X.
closed, while V is closed in S but not open. Example D.225. Consider K in (D.28) and X = K ∪ {0}.
If one forms a space from S by deleting all points of V An open cover of K in R is {Un : n ∈ N+ } where
having rational second coordinate, one obtains a space that  
has only one component but uncountably many path com- 1 1 1
Un = − n , + n , n := ;
ponents. n n n(n + 1)
elements of this open cover are pairwise disjoint for all n > 1.
Definition D.216. A space X is called locally connected at
An open cover of X in R is {Un : n ∈ N+ } ∪ (−, ) with
x iff for every neighborhood U of x, there exists a connected
 := N1 for some N ∈ N+ .
neighborhood V of x contained in U . X is locally connected
iff it is locally connected at each of its points. Example D.226. Consider K in Example D.225 as a space
with relative topology induced from R. Each singleton set
Example D.217. Q is neither connected nor locally con-  
1
nected; the subspace [−1, 0) ∪ (0, +1] is not connected but sn :=
locally connected; the topologist’s sine curve is connected n
but not locally connected; each interval and each ray in the is open in K since sn = Un ∩ K and Un is open in R. Hence
real line is both connected and locally connected. {sn : n ∈ N+ } is an infinite open cover of K.

95
Qinghai Zhang Numerical Analysis 2021

Exercise D.227. Consider X in Example D.225 as a space Definition D.234. A topological space is said to be locally
with relative topology induced from R. Is the collection compact at x iff there is some compact subspace C of X that
contains a neighborhood of x; it is locally compact iff it is
{{0}} ∪ {sn : n ∈ N+ } locally compact at each of its points.
an open cover of X? If not, can you find an infinite open
Example D.235. The real line R is not compact, but lo-
cover of X whose elements are pairwise disjoint for suffi-
cally compact. The subspace Q is not locally compact.
ciently large n? If not, can you give a finite open cover of
X? Theorem D.236. A topological space X is locally compact
Exercise D.228. What is the crucial difference between K Hausdorff if and only if there exists a compact Hausdorff
and X in the space R in terms of covers and subcovers? space Y such that X is a subspace of Y and Y \ X consists
of a single point.
Proof. For any open cover U of X, there exists an element
of U containing all but finite many of the points 1/n. Hence, Proof. Munkres p. 183.
we have a finite subcover in U for X. This is not true for
K. Definition D.237. If Y is a compact Hausdorff space and
X is a proper subspace of Y such that X = Y , then Y is said
Definition D.229. A compact topological space is a topo- to be a compactification of X. In particular, if Y \ X is a
logical space X where every open cover of X has a finite singleton set, then Y is called the one-point compactification
subcover. of X.
Lemma D.230. A subspace Y of a topological space X is
Example D.238. In Example D.225, X is the one-point
compact if and only if every open cover of Y contains a finite
compactification of K.
subcover of Y .
Lemma D.231. If X is a compact subset of a space Y , Example D.239. The one-point compactification of the
then X is compact in relative topology. real line R is homeomorphic with the circle. Similarly, the
one-point compactification of the complex plane is homeo-
Theorem D.232 (Bolzano-Weierstrass). In a compact morphic with the sphere S2 . The Riemann sphere is the
space, every infinite subset has an accumulation point. space C ∪ {∞}.
Proof. Suppose there exists an infinite subset A that does
Theorem D.240. Let X be a Hausdorff space. Then X is
not have an accumulation point. Then we can construct an
locally compact if and only if, given x ∈ X and a neighbor-
open cover of X
hood U of x, there is a neighborhood V of x such that V is
α = {Ux : x ∈ X}
compact and V ⊂ U .
such that there is at most one element of A in an element
of α. By compactness, α contains a finite subcover α0 that Corollary D.241. Let X be locally compact Hausdorff and
covers X. However, since each element in the finite set α0 let A be a subspace of X. If A is closed in X or open in X,
only has one element of A and α0 covers A, A must be finite, then A is locally compact.
which contradicts the condition of A being infinite.
Corollary D.242. A space X is homeomorphic to an open
Corollary D.233. In a compact space, every sequence has subspace of a compact Hausdorff space if and only if X is
a convergent subsequence. locally compact Hausdorff.

96
Appendix E

Functional Analysis

Example E.1. A copper mining company mines in a moun- Exercise E.3. Let X be the set of all bounded and un-
tain that has an estimated total amount of Q tonnes of cop- bounded sequences of complex numbers. Show that the fol-
per. Let x(t) denote the amount of copper removed during lowing is a metric on X ,
the period [0, t], with x(0) = 0 and x(T ) = Q. Assume x is ∞
a continuous function [0, T ] → R and the cost of extracting X 1 |ξj − ηj |
d(x, y) = , (E.3)
copper per unit tonne at time t is 2j 1 + |ξj − ηj | j=1

0
c(t) = ax(t) + bx (t), (E.1) where x = (ξj ) and y = (ηj ).
p
where a, b ∈ R+ . What is the optimal mining operation x(t) Definition E.4. For a real number p ≥ 1, the ` sequence
p
that minimizes the cost function space is the metric space (` , d) with
Z T  

(ax(t) + bx0 (t))x0 (t)dt?
 
f (x) =
X
`p := (ξj )∞
j=1 : ξj ∈ C; |ξj |p < ∞ ; (E.4)
0  
j=1
1 + 1/p
In math terms, we would like to minimize f : CQ →R
[0, T ] 

1
X
where CQ [0, T ] is the set of continuously differentiable func- d(x, y) =  |ξj − ηj |p  , (E.5)
tions x : [0, T ] → R satisfying x(0) = 0 and x(T ) = Q. j=1
In calculus, the minimizer x∗ of a function f ∈ C 2 is
usually found by the condition f 0 (x∗ ) = 0 and f 00 (x∗ ) > 0. where x = (ξj ) and y = (ηj ) are both in X . In particular,
However, the above problem does not fit into the usual the Hilbert sequence space `2 is the `p space with p = 2.
framework of calculus, since x is not a number but a function
Definition E.5. A pair of conjugate exponents are two real
that belongs to an infinite-dimensional function space. Solv-
numbers p, q ∈ [1, ∞] satisfying
ing this problem requires a number of techniques in func-
tional analysis. 1 1
p + q = pq, i.e., + = 1. (E.6)
p q

E.1 Normed and Banach spaces Lemma E.6. Any two positive real numbers α, β satisfy
αp βq
E.1.1 Metric spaces αβ ≤ + , (E.7)
p q
Definition E.2. The `∞ sequence space is a metric space where p and q are conjugate exponents and the equality
(`∞ , d), where `∞ is the set of all bounded sequences of holds if β = αp−1 .
complex numbers,
  Proof. By (E.6), we have
`∞ := (ξ1 , ξ2 , . . .) : ∃cx ∈ R, s.t. sup |ξi | ≤ cx . (E.2)
i∈N+
u = tp−1 ⇒ t = uq−1 .
u = tp−1 u = tp−1
and the metric is given by
β β

2 2
d(x, y) = sup |ξi − ηi | u u
1
i∈N+ 1
0 0
0 α 0 α
where y = (η1 , η2 , . . .) ∈ X . t t

97
Qinghai Zhang Numerical Analysis 2021

It follows that E.1.2 Normed spaces


Z α Z β
αp βq Example E.11. (Rn , k · kp ) is a normed space, where k · kp
αβ ≤ tp−1 dt + uq−1 du = + , is the Euclidean norm in Definition B.114 with p ∈ [1, ∞):
0 0 p q
  p1
n
where the equality holds if β = αp−1 since p = q(p − 1). X
kxkp =  |xj |p  .
Corollary E.7. A pair of conjugate exponents p, q satisfy j=1

1 1 a b Exercise E.12. Prove the backward triangle inequality of


∀a, b ∈ [0, +∞), ap bq ≤ + . (E.8) a norm k · k, i.e.,
p q
∀u, v ∈ V, |kuk − kvk| ≤ ku − vk. (E.12)
Proof. This follows directly from Lemma E.6.
Exercise E.13. Use Holder’s inequality to verify the trian-
Theorem E.8 (Hölder’s inequality). For n ∈ N+ ∪ {+∞}, gle inequality for the Euclidean norm in Example E.11.
x, y ∈ Cn and conjugate exponents p, q ∈ [1, ∞], we have
Definition E.14. In a normed space (X , k · k), an open ball
X n Br (x) centered at x ∈ X with radius r > 0 is the subset
|xj yj | ≤ kxkp kykq , (E.9)
j=1 Br (x) := {y ∈ X : kx − yk < r} . (E.13)

where k · kp is the Euclidean norm. For p, q ∈ (1, ∞), the Lemma E.15. Any open ball in a normed space is a convex
equality in (E.9) holds if set as in Definition 1.18.
Proof. For α ∈ [0, 1] and x, y ∈ Br (0), we have
∃c ∈ R s.t. ∀j = 1, . . . , n, |xj |p = c|yj |q . (E.10)
kαx + (1 − α)yk ≤ kαxk + k(1 − α)yk
Pn Pn
Proof. If j=1 |xj |p = 0 or j=1 |yj |p = 0 or p = ∞ or ≤ αkxk + (1 − α)kyk < αr + (1 − α)r = r,
q = ∞, then (E.9) holds trivially. Otherwise we define
where we have applied properties of norms. Hence αx + (1 −
|xi |p |yi |q α)y ∈ Br (0).
ai := Pn p
, bi := Pn q
.
j=1 |xj | j=1 |yj | Exercise E.16. Show that the Euclidean norm k · kp in
Example E.11 satisfies a monotonicity property:
It follows from (E.8) that
1 ≤ p ≤ q ≤ ∞ ⇒ ∀x ∈ Rn kxkq ≤ kxkp .
|xi yi | |xi |p |yi |q
≤ Pn + Pn . n
P
n
 p1 P
n
 q1 p j=1 |xj |p q j=1 |yj |q Example E.17. (R , k · k∞ ) is a normed space, where k·k∞
j=1 |xj |p j=1 |yj |q is the Euclidean norm in Definition B.114:

Sum up all equations for i = 1, . . . , n and we have kxk∞ = max |xj |.


j
Pn
j=1 |xi yi | 1 1 Definition E.18. Let Ω ⊂ Rn be a bounded open set. The
≤ + = 1, p-norm of a continuous scalar function in the linear space
kxkp kykq p q
C(Ω) is
which yields (E.9). Substitute (E.10) into (E.9) and we have Z  p1
the equality. ∀v ∈ C(Ω), kvkp := p
|v(x)| dx (E.14)

Example E.9. Cauchy-Schwarz inequality in Theorem
and the ∞-norm or maximum norm is given by
B.131 is a special case of the Hölder inequality (E.9) for
p = q = 2. ∀v ∈ C(Ω), kvk∞ := max |v(x)|. (E.15)
x∈Ω
Exercise E.10. Prove that (E.5) satisfies the triangular
inequality and is indeed a metric. Example E.19. (C(Ω), k·k∞ ) in Definition E.18 is a normed
space, so is (C(Ω), k · kp ) for any p ∈ [1, ∞).
(a) The Hölder inequality implies the Minkowski inequality, Example E.20. For the `∞ sequence space in (E.2),
i.e. for any p ≥ 1, (ξj ) ∈ `p , and (ηj ) ∈ `p ,  
 1/p !1/p !1/p `∞ := (an )n∈N : sup |an | < ∞. ,
X∞ ∞
X ∞
X n∈N
 |ξj + ηj |p  ≤ |ξk |p + |ηm |p .
j=1 k=1 m=1
define k · k∞ : `∞ → R+ ∪ {0} as
(E.11) k(an )n∈N k∞ = sup |an |. (E.16)
n∈N
(b) The Minkowski inequality implies that the triangular
inequality holds for (E.5). Then (`∞ , k · k∞ ) is a normed space.

98
Qinghai Zhang Numerical Analysis 2021

Example E.21. For the `p space in (E.4) with p ∈ [1, ∞), Definition E.33. A normed space is separable if it has a
( ) countable dense set.
X
p p
` := (an )n∈N : an ∈ C; |an | < ∞ , Example E.34. By Definitions E.29 and E.33, Lp (Ω) is
n∈N
separable since the set of all polynomials with rational co-
we have efficients is countable and is dense in Lp (Ω).

(an )n∈N ∈ `p , (bn )n∈N ∈ `p Lemma E.35. `∞ is not separable.



p p p p p p p
|a + b| ≤ (|a| + |b|) ≤ 2 (max(|a|, |b|)) ≤ 2 (|a| + |b| )
Proof. Suppose `∞ is separable. Then there exists in `∞
⇒ (an )n∈N + (bn )n∈N ∈ `p ,
a dense subset D = {x1 , x2 , x3 , . . . , }. For the set A of se-
where the comparison test is applied. quences with each term being either 0 or 1, we have
Then (`p , k · kp ) is a normed space where
∀a, b ∈ A, a 6= b ⇔ ka − bk∞ = 1.

! p1

(E.17) It follows from D being dense in ` that
X
k(an )n∈N+ kp := |an |p .
n=1
∀a ∈ A, ∃xn(a) ∈ D s.t. xn(a) ∈ B 21 (a).
Notation 10. Let c00 denote the space of all sequences that
are eventually 0, c0 the space of all sequences that converge Because the open balls are pairwise distinct, the map f :
to 0, and c the space of all sequences that converge. A → N given by f (a) = n(a) is injective. However, the con-
struction of A implies that A is uncountable because A has
Definition E.22 (Convergence of sequences). A sequence
a one-to-one correspondence with all real numbers in [0, 1)
{un } in a normed space (V, k · k) is convergent to u ∈ V iff
via binary expansion. Thus f : A → N cannot be injective
lim ku − uk = 0. (E.18) and this completes the proof.
n
n→∞

Exercise E.36. Prove that `p is separable for all p ∈ [1, ∞).


E.1.3 The topology of normed spaces
Example E.23. The topology of a normed space is the met-
E.1.4 Bases of infinite-dimensional spaces
ric topology in Definition D.189 because a normed space is
always a metric space. Definition E.37. An infinite dimensional normed space V
has a countably-infinite basis iff
Definition E.24. Let (X , d) be a metric space. A point
x0 ∈ X is an adherent point or a closure point of E ⊂ X or ∃{v } n +
i i≥1 ⊂ V s.t. ∀v ∈ V, ∃{αn,i }i=1 where n ∈ N , αn,i ∈ R
a point of closure or a contact point iff
Xn
s.t. lim v − α v = 0. (E.22)

∀r > 0, E ∩ Br (x0 ) 6= ∅. (E.19) n→∞
n,i i

i=1
Example E.25. Any point in the set K in (D.28) is a clo-
sure point of K, so is 0. The sequence {vi }i≥1 is a basis if any finite subset of it is
linearly independent.
Definition E.26. A point x is an accumulation point (or a
limit point) of A iff Definition E.38. A Schauder basis of an infinite-
dimensional normed linear space V is a sequence {vn }n≥1 of
∀r > 0, (Br (x) \ {x}) ∩ A 6= ∅. (E.20) elements in V such that

Example E.27. The only accumulation point of the set K X
in (D.28) is 0. ∀v ∈ V, ∃!{αn }n≥1 where αn ∈ R s.t. v = αn vn .
n=1
Example E.28. Each number in R is an accumulation (E.23)
point of Q.
Example E.39. The sequence space `2 in Definition E.4
Definition E.29. Let V1 ⊂ V2 be two subsets in a normed has a Schauder basis
space V . The set V1 is dense in V2 iff  ∞
ej = (0, . . . , 0, 1, 0, 0, . . .) j=1
∀u ∈ V2 , ∀ > 0, ∃v ∈ V1 s.t. kv − uk < . (E.21)
sincePany ξ = (ξ1 , ξ2 , . . .) ∈ `2 can be uniquely written as
Theorem E.30. Q is dense in R. ∞
ξ = j=1 ξj ej .
Exercise E.31. Show that c00 is dense in `2 .
Example E.40. It can be proved that the set
Example E.32. The set of polynomials is dense in {1, cos nx, sin nx}∞ p
n=1 is a Schauder basis of L (−π, π) for
(C[a, b], k · k∞ ), c.f. Theorem 2.51. p ∈ (1, ∞).

99
Qinghai Zhang Numerical Analysis 2021

E.1.5 Sequential compactness Proof. The necessity follows from Lemma E.47, we only
prove the sufficiency. Any sequence (xn )n∈N ⊂ K is
Definition E.41. A subset K of a normed space (X, k·k) is
bounded because K is bounded. Then Lemma E.45 dic-
sequentially compact if every sequence in K has a convergent
tates that (xn )n∈N has a convergent subsequence. Because
subsequence that converges in K,
each term xn ∈ K and K is closed, Corollary D.98 implies
∀(xn )n∈N ⊂ K, ∃nk : N → N, ∃L ∈ K s.t. lim xnk = L. that the limit point of (xn )n∈N is also in K. The proof is
n→+∞ then completed by Definition E.41.
(E.24)
Example E.49. The intervals (a, b], [a, b), (−∞, b], and
Example E.42. Any interval [a, b] is sequentially compact
[a, +∞) are not sequentially compact in R.
in R. Indeed, any sequence in [a, b] is bounded, and by the
Bolzano-Weierstrass theorem (Theorem C.13) it has a con- Definition E.50. The Cantor set is a subset of R give by
vergent subsequence, of which the limit must be in [a, b], C := ∩+∞n=1 Fn where F1 = [0, 1] and each Fn+1 is obtained
thanks to the completeness of R (Theorem C.15), by deleting from Fn the open middle third of each closed
Example E.43. (a, interval.
b−a
 b) is not sequentially compact since the
sequence a + 2n n∈N+ is contained in (a, b), but its limit Example E.51. The Cantor set is an intersection of closed
a is not contained in (a, b). set and thus it is closed. It is also bounded and thus it is
Example E.44. R is not sequentially compact because the sequentially compact.
sequence (n)n∈N in R cannot have a convergent subsequence: Corollary E.52. A subset K of a finite-dimensional
the distance bewteen any two terms on any subsequence is normed space X is sequentially compact if and only if K
at least 1. is closed and bounded.
Lemma E.45. Every bounded sequence in Rn has a con-
Example E.53. The closed unit ball in (C[0, 1], k · k∞ )
vergent subsequence.
Proof. We prove this statement by an induction on n. The K := {f ∈ C[0, 1] : kf k∞ ≤ 1} (E.25)
induction basis is the Bolzano-Weierstrass theorem (Theo-
rem C.13). Suppose the statement holds for n ≥ 1. For a is closed and bounded, but K is not sequentially compact.
bounded sequence (xm )m∈N ⊂ Rn+1 , we split each xm as Consider the hat function
xm = (αn , βn ), where αn ∈ Rn and βn ∈ R. Since xm 
x−an
 bn −an x ∈ [an , bn ],
is bounded and kαn k2 ≤ kxm k2 , αm is also bounded. By

x−cn
the induction hypothesis, (αm )m∈N has a convergent sub- Bn (x) = bn −cn x ∈ [bn , cn ], (E.26)

sequence, say (αmk )k∈N , that converges to α ∈ Rk . Then 0 otherwise,

(βmk )k∈N is bounded and by Theorem C.13 it has a con-
vergent subsequence (βmkp )p∈N that converges to β ∈ R. where an = 1 − 21n , cn = an+1 , and bn = an +c
2
n
. Then the
Therefore we have sequence (Bn )n∈N has no convergent subsequence.
 
lim xmkp = lim αmkp , βmkp = (α, β) ∈ Rn+1 , Example E.54. The closed unit ball in `2 ,
p→∞ p→∞

K := x ∈ `2 : kxk2 ≤ 1 ,

which completes the proof. (E.27)

Theorem E.46. In a metric space, sequential compactness is closed and bounded, but is not sequentially compact. For
is equivalent to compactness.
en = (0, · · · , 0, 1, 0, · · · , 0) ∈ K ⊂ `2
Lemma E.47. A sequentially compact subset K of a
normed space X must be closed and bounded. where all terms are zero except than the nth term is 1, the
sequence (en )n∈N+ has no convergent subsequence.
Proof. Supppose K is compact but not bounded. Then
Example E.55. The Hilbert cube in the normed space `2 ,
∀n ∈ N, ∃xn ∈ K s.t. kxn k ≥ n.
  
1
Hence no subsequence of (xn )n∈N ⊂ K converges and this C := (xn )n∈N+ : xn ∈ 0, , (E.28)
n
contradicts the compactness of K.
For any convergent sequence (xn )n∈N ⊂ K, Definition can be shown to be a sequentially compact subset.
E.41 implies that it has a convergent subsequence that con-
verges in K. The uniqueness of limit (Lemma C.6) dictates Definition E.56. An open cover of a topological space X
that the two sequences converge to the same limit in K. is collection of open subsets of X such that any element of
Now that any convergent sequence converges to some limit X belongs to some open subset in the collection.
point in K, Corollary D.98 implies that K is closed.
Definition E.57. A subset K in a topological space is com-
Theorem E.48. A subset K of Rn is sequentially compact pact if and only if every open cover of K has a finite sub-
if and only if K is closed and bounded. cover.

100
Qinghai Zhang Numerical Analysis 2021

E.1.6 Continous maps of normed spaces yet kfn − 0k can be made arbitrarily small as n → ∞. In
contrast, D : (C 1 [a, b], k · k1,∞ ) → (C[a, b], k · k∞ ) is continu-
Definition E.58. Let X and Y be normed spaces. A func-
ous because
tion f : X → Y is continuous at x0 ∈ X iff
∀ > 0, ∃δ = , s.t. ∀f, g ∈ C 1 [0, 1], kf − gk1,∞ < δ ⇒
∀ > 0, ∃δ > 0 s.t.
kDf − Dgk∞ = kf 0 − g 0 k∞ ≤ kf − gk1,∞ < δ = .
∀x ∈ X, kx − x0 kX < δ ⇒ kf (x) − f (x0 )kY < . (E.29)
Exercise E.65. Show that the arc length function L :
The function f : X → Y is continuous iff it is continuous at C 1 [0, 1] → R,
every x0 ∈ X. Z 1p
Lemma E.59. Let X and Y be normed spaces. A function L(f ) := 1 + (f 0 (t))2 dt, (E.32)
0
f : X → Y is continuous at x ∈ X iff, for any sequence with
limn→∞ xn = x, we have limn→∞ f (xn ) = f (x). is not continuous if the norm of C 1 [0, 1] is k · k∞ , whereas it
is continuous if we equip C 1 [0, 1] with (E.31).
Exercise E.60. Prove Lemma E.59.
Exercise E.66. Is the function S : (c00 , k · k∞ ) → (R, | · |),
Lemma E.61. The norm function k · k is continuous. ∞
X
Proof. By Definition E.58, we have lim ku − uk = 0 S ((a )
n n∈N ) = a2n (E.33)
n→∞ n
n=1
from limn→∞ un = u. The rest of the proof follows from the
backward triangle inequality (E.12). continuous?

Exercise E.62. For V = C[0, 1] and x0 ∈ [0, 1], define a Theorem E.67. A map f : X → Y between normed spaces
function `x0 : V → R as is continuous if and only if the preimage f −1 (V ) of each open
set V in Y is open in X.
`x0 (v) = v(x0 ).
Corollary E.68. A map f : X → Y between normed spaces
Show that `x0 is continuous on C[0, 1]. is continuous if and only if the preimage f −1 (V ) of each
closed set V in Y is closed in X.
Example E.63. The function S : (C[0, 1], k·k∞ ) → (R, |·|),
Lemma E.69. If f : X → Y and g : Y → Z are continu-
Z 1 ous functions between normed spaces, then the composition
S(f ) = f 2 (x)dx, (E.30) map g ◦ f : X → Z is continuous.
0
Lemma E.70. Let X, Y be normed spaces and let K be a
is continuous. Indeed, for any g ∈ C[0, 1], we have compact subset of X. If f : X → Y is continuous at each
Z 1 Z 1 x ∈ K, then f (K) is a compact subset of Y .
2 2

|S(f ) − S(g)| =
f (x)dx − g (x)dx
0 0
Proof. For a sequence (yn )n∈N ⊂ f (K), there exists for each
Z 1 n ∈ N an xn ∈ K such that f (xn ) = yn . This defines a
≤ |f (x) − g(x)| |f (x) − g(x) + 2g(x)| dx sequence (xn )n∈N ⊂ K. Because K is compact, Definition
0 E.41 implies the existence of a subsequence (xnk )k∈N that
Z 1
≤ kf − gk∞ (kf − gk∞ + 2kgk∞ ) dx, converges to L ∈ K. Since f is continuous, Lemma E.59
0 implies that (yn )n∈N converges to f (L) ∈ f (K).
which implies Theorem E.71 (Weierstrass). Suppose K is a nonempty
  compact subset of a normed space X and the function
 f : X → R is continuous at each x ∈ K. Then
∀ > 0, ∃δ = min 1, s.t. kf − gk∞ < δ ⇒
1 + 2kgk∞ 
f (a) = max{f (x) : x ∈ K},
kf − gk∞ (kf − gk∞ + 2kgk∞ ) < kf − gk∞ (1 + 2kgk∞ ) <  ∃a, b ∈ K s.t.
f (b) = min{f (x) : x ∈ K}.
⇒ |S(f ) − S(g)| < .
Proof. It suffices to only prove the first clause. By Lemma
Example E.64. The differentiation map E.70, f (K) is compact, and thus by Lemma E.47 f (K) is
bounded. f (K) is also nonempty because K is nonempty.
d
: (C 1 [a, b], k · k∞ ) → (C[a, b], k · k∞ ) Then Theorem A.28 implies that f (K) ⊂ R must have a
dt unique supremum
is not continuous, but can be made continuous if we change
M := sup{f (x) : x ∈ K} ∈ R,
the norm on C 1 [a, b] to
and hence there exists a sequence (xn )n∈N ⊂ K satisfying
kf k1,∞ := kf k∞ + kf 0 k∞ . (E.31) lim
n→∞ f (xn ) = M . By Definition E.41, (xn )n∈N has a con-
vergent subsequence (xnk )k∈N that converges to some c ∈ K.
Indeed, for fn (t) = √1n cos(2πnt), we have
The continuity of f , Lemma E.59 and Lemma C.9 yield

∀n ∈ N+ , kfn0 − 00 k∞ = 2π n > 1, lim f (xnk ) = f (c) = M.
k→∞

101
Qinghai Zhang Numerical Analysis 2021

Example E.72. Since the set K = {x ∈ R3 : kxk2 = 1} is Corollary E.79. Over a finite dimensional space, any two
P3
compact in R3 and the function x 7→ j=1 xj is continuous, norms are equivalent.
the optimization problem
Proof. This follows from Theorem E.78 and the isomor-
 P3 phism of linear spaces.
minimize j=1 xj ,
subject to kxk2 = 1, Example E.80. In the normed space V := C[0, 1], consider
a sequence of functions {un } given by
has a minimizer. (
1 − nx, x ∈ [0, n1 ];
un (x) :=
E.1.7 Norm equivalence 0, x ∈ ( n1 , 1].
Example E.73. The optimal mining problem in Example For the p-norm in (E.14), we have
E.1 concerns C 1 [a, b]. Since C 1 [a, b] is a subspace of C[a, b], 1
we could use the norm k·k∞ for C[a, b] as a norm for C 1 [a, b]. kun kp = [n(p + 1)]− p
But by Example E.64, the differentiation map would not be
continuous; instead, if we equip C 1 [a, b] with (E.31), then the and thus the sequence {un } converges to u = 0. However,
differentiation map is continuous. Also, it might be more ap- for the ∞-norm in (E.15), we have
propriate to regard two functions in C 1 [a, b] as being close kun k∞ = 1
to each other if both their function values and their function
derivatives are close. and thus the sequence {un } does not converge to u = 0.

Definition E.74. Two norms k · kA and k · kB on V are


E.1.8 Banach spaces
equivalent, written k · kA ∼ k · kB , iff
Definition E.81. A Cauchy sequence in a normed space V
∃c1 , c2 ∈ R+ s.t. ∀v ∈ V,
c1 kvkA ≤ kvkB ≤ c2 kvkA . is a sequence {un } ⊂ V satisfying
(E.34)
lim kum − un k = 0. (E.36)
m,n→+∞
Example E.75. The Euclidean norms in Definition B.114
satisfy Definition E.82 (Banach spaces). A Banach space (or a
√ complete normed space) is a normed space V such that every
∀x ∈ Rn , kxk∞ ≤ kxkp ≤ p nkxk∞ . (E.35) Cauchy sequence in V converges to an element in V .

Therefore all the Euclidean `p norms are equivalent. Example E.83 (Q is not complete). The sequence

Exercise E.76. Show that ∼ in Definition E.74 defines an 3 4 + 3xn−1


x1 = ; ∀n > 1, xn = (E.37)
equivalence relation on the set of all norms on V . 2 3 + 2xn−1

Exercise E.77. Show that two norms k · kA and k · kB on is bounded below by 2 and is monotonically decreasing. By
a linear space V are equivalent if and only if each sequence Theorem C.12, (xn ) is convergent in R. However, although
converging with respect to k · kA also converges with respect (xn ) is Cauchy in Q, it is not convergent in Q because
to k · kB . 4 + 3L √
L= ⇒ L = 2.
Theorem E.78. All norms are equivalent on R or C . n n 3 + 2L
Pn 1
Proof. Since ∼ is an equivalence relation on the set of all Example E.84. The sequence ( k=1 kk )n∈N is Cauchy,
norms, it suffices to prove that any norm is equivalent to but we do not know yet whether the limit is rational or
the 2-norm. Let e1 , e2 , · · · , en be a basis n irrational.
n
Pnof R . Then any
vector x ∈ R can be expressed as x = i=1 xi ei . Thus Theorem E.85. (C[a, b], k · k∞ ) is a Banach space.
Proof. It is straightforward to show that k · k∞ is a norm.

X n X n
kxk = xi ei ≤ |xi |kei k ≤ M kxk2 , We only show the completeness in three steps.

First, at any fixed t ∈ [a, b], we can reduce a Cauchy

i=1 i=1
pPn sequence {fn }n≥1 ⊂ C[a, b] to a sequence {fn (t)} ⊂ R. The
where M = i=1 kei k and the last inequality follows from completeness of R (Theorem C.15) yields limn→∞ fn (t) ∈ R.
the Cauchy-Schwarz inequality (Theorem B.131). Set For any Cauchy sequence {fn } ⊂ C[a, b], this process fur-
nishes a function f : [a, b] → R given by
K := {y ∈ Rn : kyk2 = 1} .
f (t) = lim fn (t).
n→∞
Since K is a compact set and the norm k · k : K → R is
a continuous function (Lemma E.61), k · k must attain its Second, we show f ∈ C[a, b], i.e., f is continuous. The
minimum value m on K. Furthermore, m > 0 since 0 6∈ K. sequence {fn } ⊂ C[a, b] being Cauchy implies
x
For x 6= 0, we have y := kxk ∈ K since kyk2 = 1. Then
2 
kyk ≥ m implies mkxk2 ≤ kxk. ∀ > 0, ∃N −1 ∈ N s.t. ∀m, n > N −1, kfm −fn k∞ < .
3

102
Qinghai Zhang Numerical Analysis 2021


In particular, set m = N , let n → ∞, and we have Example E.88. For p ∈ [1, ∞), C(Ω), k · kp is not a Ba-
nach space. Consider (un )n∈N ⊂ C[0, 1] given by

∀t ∈ [a, b], |fN (t) − f (t)| ≤ kfN − fn k∞ < .
3 
0,
 x ∈ [0, 21 − 2n1
];
n−1 1 1 1 1
The condition of fN ∈ C[a, b] implies un (x) = nx − 2 , x ∈ [ 2 − 2n , 2 + 2n ];
(E.38)
x ∈ [ 12 + 2n
1

1 , 1].

∀t ∈ [a, b],∀ > 0, ∃δ > 0 s.t.
 (un )n∈N is clearly Cauchy and we have
|t − τ | < δ ⇒ |fN (t) − fN (τ )| < .
3 (
The above two equations yield 0, x ∈ [0, 12 );
lim un = u(x) =
n→∞ 1, x ∈ ( 12 , 1].
∀t ∈ [a, b], ∀ > 0,∃δ > 0 s.t. |t − τ | < δ ⇒
|f (t) − f (τ )| ≤ |f (t) − fN (t)| + |fN (t) − fN (τ )| But u(x) cannot be in C(Ω) no matter how we define u( 21 ).
+ |fN (τ ) − f (τ )| Exercise E.89. Show that the sequence space (`p , k · kp ) is
   complete for p ∈ [1, +∞].
≤ + + = ,
3 3 3
Theorem E.90. In a Banach space, absolutely convergent
which shows that f is continuous at every t ∈ [a, b]. series converge. More precisely, if (xn )n∈N is a sequence in
Finally, we show that {fn }n≥1 indeed converges to f . a Banach space (X, k · k) such that P∞ kx k converges,
n=1 n
The sequence {fn } ⊂ C[a, b] being Cauchy implies P∞
then n=1 xn converges in X. Furthermore,
∀ > 0, ∃N ∈ N s.t. ∀m, n > N, kfm − fn k∞ < . ∞
X X

xn ≤ kxn k . (E.39)


For a fixed n > N , we have
n=1

n=1

∀m > N, ∀t ∈ [a, b], |fn (t) − fm (t)| ≤ kfn − fm k∞ < , Proof. Since XPis Banach, it suffices to prove that the se-
n
quence (sn = i=1 xi )n∈N is Cauchy. Since the real se-
Pn
which implies quence (σn = i=1 kxi k)n∈N is Cauchy, we have

n
∀t ∈ [a, b], |fn (t) − f (t)| = f (t) − lim fm (t) < .
X
m→∞ ∀ > 0, ∃N ∈ N+ s.t. ∀n > m > N, kxi k < ,
i=m+1
It follows that
Pn
which implies that (sn = i=1 xi )n∈N is Cauchy:
kfn − f k∞ = max |fn (t) − f (t)| < .
t∈[a,b]
n
X n
X
xi ≤ kxi k < .

In the above process, we could have fixed any n > N at the

i=m+1 i=m+1
outset to obtain the same result. Therefore we have
Pn
Set L := i=m+1 xi and we have
∀ > 0, ∃N ∈ N s.t. ∀n > N, kfn − f k∞ < ,
∀ > 0, ∃N ∈ N s.t. ∀n > N, ksn − Lk < ,
which implies limn→∞ fn = f .
which implies
Exercise E.86. Define Cb [0, ∞) as the set of all functions
f that are continuous on [0, ∞) and satisfy ∞
X
kLk ≤ ksn − Lk + ksn k <  + σn <  + kxn k ,
kf k∞ := sup |f (x)| < ∞. n=1
x≥0
where the second inequality follows from the triangle in-
Show Cb [0, ∞) with this norm is complete. equality and the third from n being a finite number. Then
(E.39) holds because  can be made arbitrarily small.
Exercise E.87. Define C α [a, b] as the set of all functions
f ∈ C[a, b] satisfying Example E.91. The series
P∞ 1
P∞ 1n=1 n2 sin(nx) converges in
2π], k · k∞ ) since
(C[0, P n=1 n2 converges in R. Hence
|f (x) − f (y)| ∞
x 7→ n=1 n12 sin(nx) defines a continuous function.
Mα (f ) := sup < ∞.
x,y∈[a,b];x6=y |x − y|α
Exercise E.92. Prove the converse of Theorem E.90, i.e., a
Define kf kα = kf k∞ + Mα (f ). Show that (C α [a, b], k · kα ) is normed space X is complete if every absolutely convergent
a Banach space. series converges in X.

103
Qinghai Zhang Numerical Analysis 2021

Theorem E.93. For each normed space V , there exists an- Example E.98. The linear map T : (C[a, b], k · k∞ ) → R
other normed space W and a dense subspace Vb ⊂ W such given by T (f ) = b f (t)dt is continuous because
R
a
that one can find an isometric isomorphism between V and
Vb , i.e., a bijective linear function I : V → Vb satisfying
Z
Z b b
|T (f )| = f (t)dt ≤ kf k∞ dt = (b − a)kf k∞ .

a a
∀v ∈ V, kIvkW = kvkV . (E.40)

Furthermore, the complete normed space W is unique up to By Lemma E.59, T preserves convergent sequences:
the isometric isomorphism. Z b Z b
lim fn = f ⇒ lim fn = f.
Definition E.94. The normed space W in Theorem E.93 n→∞ n→∞ a a
is called the completion of the normed space V .
In other words, the continuity of T under k · k∞ guarantees
Example E.95. If V is the normed space Q of rational that T and limn→∞ are commutative; see Section C.7.
numbers, then W = R is a completion of Q, where each ele-
ment is an equivalence class of Cauchy sequences of rational Theorem E.99 (Existence and uniqueness of ODEs). The
numbers. IVP
dx
(t) = f (x(t), t) (E.44)
dt
E.2 Continuous linear maps with initial condition x(0) = x0 ∈ R has a unique solution
x ∈ C 1 [0, T ] for some T > 0, if f : R × [0, ∞) → R is
E.2.1 The space CL(X, Y ) Lipschitz continuous in space and continuous in time.

Notation 11. CL(X, Y ) denotes the set of all continu- Proof. For existence, we define y0 (t) = x0 and
ous linear transformations or bounded linear transforma- Z t
tions from the normed space X to the normed space Y , (∗) : yn+1 (t) = x0 + f (yn (τ ), τ )dτ.
0
CL(X, Y ) := C(X, Y ) ∩ L(X, Y ). (E.41)
1
For any t ∈ [0, 2L ] where L is the Lipschitz constant,
For Y = X, we write CL(X). R
t
|yn+1 (t) − yn (t)| = 0 f (yn (τ ), τ ) − f (yn−1 (τ ), τ )dτ

Theorem E.96. For any map T ∈ L(X, Y ), the following Rt
statements are equivalent: ≤ 0 |f (yn (τ ), τ ) − f (yn−1 (τ ), τ )| dτ
Rt
≤ 0 L|yn (τ ) − yn−1 (τ )|dτ
(1) T is continuous, Rt
≤ 0 Lkyn − yn−1 k∞ dτ
(2) T is continuous at 0,
≤ 21 kyn − yn−1 k∞ .
(3) ∃M ∈ R+ s.t. ∀x ∈ X, kT xkY ≤ M kxkX .
Hence we have
Proof. (1)⇒(2) follows from Definition E.58. For (2)⇒(3),
the continuity of T at 0 implies 1 1
kyn+1 − yn k∞ ≤ kyn − yn−1 k∞ ≤ n ky1 − y0 k∞ .
2 2
for  = 1, ∃δ > 0 s.t. kxk < δ ⇒ kT xk < 1.
It follows that (yn )n∈N is a Cauchy sequence and there
Replacing x with y = 2δ kxk
x
in the above inequalities yields exists y ∈ C 1 [0, T ] such that limn→∞ yn = y. Similarly,
2 (f (yn , t))n∈N is a Cauchy sequence and there exists f (y, t)
kT xk ≤ M kxk with M = δ . Finally, (3)⇒(1) follows from
such that limn→∞ f (yn , t) = f (y, t). Take limn→∞ (∗), apply
 Example E.98, and we have
∀ > 0, ∃δ = s.t. kx − yk < δ ⇒
M Z t
 (∗∗) : y(t) = x0 + f (y(τ ), τ )dτ.
kT x − T yk = kT (x − y)k ≤ M kx − yk = kx − yk < .
δ 0

Example E.97. The left shift operator L : `2 → `2 and It is trivial to check that the above y(t) solves (E.44).
right shift operator R : `2 → `2 , For uniqueness, suppose for two solutions x and y of
(E.44) there exists t∗ ∈ (0, T ) satisfying
L(a1 , a2 , a3 , . . .) = (a2 , a3 , . . .), (E.42)
R(a , a , a , . . .) = (0, a , a , . . .), (E.43) t∗ := max{t ∈ [0, T ] : ∀τ ≤ t, y(τ ) = x(τ )}.
1 2 3 1 2

are linear operators. Furthermore, L, R ∈ CL(`2 ) because We choose


they are bounded: n o
N := max 2, L(T1−t∗ ) ,
kL(an )n∈N k ≤ k(an )n∈N k, kR(an )n∈N k = k(an )n∈N k. M := maxt∈[t∗ ,t∗ + 1 ] |x(t) − y(t)|
LN

104
Qinghai Zhang Numerical Analysis 2021

1
to obtain t∗ + LN ≤ T . Then (∗∗) implies Exercise E.108. Prove Lemma E.107.

1
∀t ∈ [t∗ , t∗ + LN Lemma E.109. The operator norm k · k : CL(X, Y ) → R,
R],
t
|x(t) − y(t)| = t∗ [f (x(τ ), τ ) − f (y(τ ), τ )] dτ

∀T ∈ CL(X, Y ), kT k := sup kT xk : x ∈ X, kxk ≤ 1 ,
Rt Rt
≤ t∗ |f (x(τ ), τ ) − f (y(τ ), τ )| dτ ≤ t∗ L |x(τ ) − y(τ )| dτ (E.46)
M is well defined, i.e., kT k is a unique bounded real number.
≤ LM (t − t∗ ) ≤ N,
Proof. By Theorem A.28, it suffices to show that
which yields M ≤ M N and contradicts N ≥ 2. Hence the
uniqueness is proved by the non-existence of such a t∗ .

S := kT xk : x ∈ X, kxk ≤ 1 (E.47)
Example E.100. The continuity of differentiation maps in is a nonempty bounded subset of R. S is nonempty because
Example E.64 can be determined by Theorem E.96. For 0 ∈ X and T 0 = 0Y , imply 0 ∈ S. The boundedness of S
k · k1,∞ , we have kDxk∞ ≤ kxk1,∞ , and thus the operator follows from Theorem E.96(3) and kxkX ≤ 1.
D : (C[0, 1], k · k1,∞ ) → (C[0, 1], k · k∞ ) is continuous.
In comparison, D : (C[0, 1], k · k∞ ) → (C[0, 1], k · k∞ ) Lemma E.110. For any T ∈ CL(X, Y ), we have
is not continuous: for xn = tn , we have kxn k∞ = 1 yet
limn→∞ kx0n k = ∞. (∀x ∈ X, kT xk ≤ M kxk) ⇒ kT k ≤ M. (E.48)

Corollary E.101. For finite-dimensional normed spaces X Proof. M is an upper bound of the set S in (E.47) while
and Y , we have L(X, Y ) = CL(X, Y ). kT k is the least upper bound of S.
Proof. Each linear transformation TA ∈ L(Rn , Rm ) has a Lemma E.111. ∀T ∈ CL(X, Y ), ∀x ∈ X, kT xk ≤ kT kkxk.
matrix A ∈ Rm×n such that
 2 Proof. The statement holds trivially for x = 0. Otherwise
x
Xm Xn for y = kxk we have kT yk ∈ S where S is in (E.47). Hence
kTA xk22 = kAxk22 =  aij xj 
i=1 j=1 kT yk ≤ kT k ⇒ kT xk ≤ kT kkxk.
 2  2
m n n n
m X
X X X X Lemma E.112. ∀S ∈ CL(X, Y ), ∀T ∈ CL(Y, Z), we have
≤  aij   xj  = kxk22 a2ij ,
kST k ≤ kSkkT k.
i=1 j=1 j=1 i=1 j=1

Proof. This follows from Lemmas E.110 and E.111.


where the inequality follows from the Cauchy-Schwarz in-
equality. The proof is completed by Theorem E.96, Theorem

Theorem E.113. CL(X, Y ), k · k is a normed space.
E.78, and the isomorphism of linear spaces.
Exercise E.114. Prove Theorem E.113.
Exercise
P∞ E.102.
P∞ For an infinite-dimensional matrix A sat-
isfying i=1 j=1 a2ij < ∞, define TA : `2 → `2 by

Lemma E.115. For a normed space X, CL(X, Y ), k · k is
  a Banach space if Y is a Banach space.

Proof. Let (Tn )n∈N be a Cauchy sequence in CL(X, Y ). For
X
∀x = (xj )j∈N ∈ `2 , TA x = Ax =  aij xj  .
j=1 any x ∈ X, (Tn x)n∈N ⊂ Y is Cauchy as Lemma E.111 yields
i∈N+

Prove TA ∈ CL(`2 ). kTn x − Tm xk ≤ kTn − Tm kkxk.

Exercise E.103. For A, B ∈ C[a, b] and Since Y is complete, (Tn x)n∈N converges to some L(x) ∈ Y .
This defines a map T (x) = L(x).
S := {f ∈ C 1 [a, b] : f (a) = f (b) = 0}, (E.45) The second step is to show T ∈ CL(X, Y ).
The third step is to show limn→∞ Tn = T .
show that the map L : (S, k · k1,∞ ) → R given by
Z b Exercise E.116. Supplement the proof of Lemma E.115
0 with all details.

L(f ) = A(t)f (t) + B(t)f (t) dt
a
Corollary E.117. If X is a normed space over R, then the
is a bounded linear transformation. dual space of X, X 0 = CL(X, R), is a Banach space with the
Exercise E.104. For A ∈ Rm×n , show that the subspace operator norm.
ker A is closed in Rn . Proof. This follows directly from Lemma E.115.
Theorem E.105. Every subspace of Rn is closed.
Corollary E.118. If X is a Banach space, then CL(X) is
Exercise E.106. Prove Theorem E.105. a Banach space with the operator norm.

Lemma E.107. CL(X, Y ) is a subspace of L(X, Y ). Proof. This follows directly from Lemma E.115.

105
Qinghai Zhang Numerical Analysis 2021

Definition E.119. An algebra is a vector space V with an Proof. We need to show


associative and distributive multiplication V × V → V ,
∀x ∈ X, ∃r > 0 s.t. B(x, r) ∩ (X \ F ) 6= ∅.
∀u, v, w ∈ V, ∀α ∈ F, If x ∈ (X \ F ), then we are done. Otherwise x ∈ F implies

 u(vw) = (uv)w, that B(x, r) is not contained in F for any r > 0. Therefore,
(u + v)w = uw + vw, u(v + w) = uv + uw, (E.49)

α(uv) = u(αv) = (αu)v. ∀r > 0, ∀x ∈ X, ∃y ∈ B(x, r) ⊂ (X \ F ) s.t. ky − xk < r.
Then the proof is completed by Lemma D.190.
The multiplicative identity is the element e ∈ V such that
∀v ∈ V , ev = v = ve. Theorem E.124 (Baire). Suppose (Fn )n∈N is a sequence
of closed sets in a Banach space X such that X = ∪n∈N Fn .
Definition E.120. A normed algebra is an algebra V with Then there exists an n ∈ N and a nonempty open set U such
a norm k · k satisfying that U ⊂ Fn .

∀u, v ∈ V, kuvk ≤ kukkvk. (E.50) Proof. Suppose that no Fn contains any nonempty open set.
Then Lemma E.123 implies that X \ Fn is dense in X for
A Banach algebra is a normed algebra that is complete. each n ∈ N. Therefore we have
∃x1 ∈ (X \ F1 ), ∃r1 > 0 s.t. B(x1 , r1 ) ⊂ (X \ F1 ).
E.2.2 The topology of CL(X, Y ) Both B(x1 , r1 ) and (X \ F1 ) are open and thus their inter-
Notation 12. For a vector space X and its subsets A, A1 , section D2 := B(x1 , r1 ) ∩ (X \ F1 ) is also open. Hence,
A2 , we write  r 
1
∃x2 ∈ D2 , ∃r2 ∈ 0, s.t. B(x2 , r2 ) ⊂ D2 ;
2
∀α ∈ R, αA := {αa, a ∈ A};
∀w ∈ X, A + w := {a + w, a ∈ A}. Proceed inductively and we have
 r
∀A1 , A2 ⊂ X, A1 + A2 := {a1 + a2 : a1 ∈ A1 , a2 ∈ A2 }.

n−1
∃xn ∈ Dn , ∃rn ∈ 0, s.t. B(xn , rn ) ⊂ Dn ,
(E.51) 2
where Dn := B(xn−1 , rn−1 ) ∩ (X \ Fn−1 ). By construction,
Definition E.121. A linear map T : X → Y between
n > m implies B(xn , rn ) ⊂ B(xm , rm ) and
normed spaces X and Y is open if its image of any open
set is open. r1
kxn − xm k < rm < .
2m−1
Lemma E.122. Let X and Y be normed spaces. A
Hence (xn )n∈N is a Cauchy sequence and converges to x in
bounded linear map T ∈ CL(X, Y ) is open if and only if
the Banach space X. For any m ∈ N, we have
the image of the unit open ball in X under T contains some
open ball centered at 0Y in Y , i.e., x ∈ B(xm , rm ) ⊂ (X \ ∪m
i=1 Fm ) ,

∃δ > 0 s.t. B(0Y , δ) ⊂ T (B(0X , 1)) . (E.52) which contradicts X = ∪n∈N Fn as m → ∞.


Lemma E.125 (Unit open ball). Suppose X and Y are
Proof. For necessity, T being an open map implies that Banach spaces and T ∈ CL(X, Y ) is surjective. Then the
the image T (B(0X , 1)) is open. The linearity of T implies image T (B ) of the open ball B := B(0 , 1) contains an
0 0 X
0Y ∈ T (B(0X , 1)). Then Lemma D.190 yields (E.52). open ball about 0Y .
For sufficiency, let U ⊂ X be open, we need to show
Proof. Define  
(∗) : ∀y0 ∈ T (U ), ∃rY > 0 s.t. B(y0 , rY ) ⊂ T (U ). 1
Bn := B 0X , n
2
y0 ∈ T (U ) implies there exists x0 ∈ U such that T x0 = y0 .
and we show T (B1 ) contains an open ball. Indeed
Since U is open, we have
∀x ∈ X, ∃k > 2kxk s.t. x ∈ kB1
(∗∗) : ∃rX > 0 s.t. B(x0 , rX ) ⊂ U.
and thus X = ∪k∈N kB1 . T being surjective implies
Choose rY = δrX and we have [
!
[ [
Y = T (X) = T kB1 = kT (B1 ) = kT (B1 ),
B(T x0 , rY ) = T x0 + B(0Y , δrX ) ⊂ T x0 + T B(0X , rX ) k∈N+ k∈N+ k∈N+
= T B(x0 , δrX ) ⊂ T (U ), where the last step follows from the condition of Y be-
ing a Banach space. By Theorem E.124, there exists some
where the second step follows from (E.52).
kT (B1 ) that contains a nonempty open ball, which implies
that T (B1 ) also contains an open ball, say,
Lemma E.123. If a closed set F in a normed space X does
not contain any open set, then X \ F is dense in X. B(y0 , ) ⊂ T (B1 ),

106
Qinghai Zhang Numerical Analysis 2021

which implies E.2.3 Invertible operators


B(0Y , ) = B(y0 , ) − y0 ⊂ T (B1 ) + T (B1 ) Lemma E.128. In a finite-dimensional vector space X, if
two operators T, S ∈ L(X) satisfy T S = I, then ST = I.
⊂ T (B1 ) + T (B1 ) = T (B0 ).

To sum up the above arguments, we have Proof. T S = I implies ker S = {0} because

(∗) : B(0Y , ) ⊂ T (B0 ). Sx = 0 ⇒ T Sx = 0 ⇒ x = 0.


Define Vn := B(0Y , 2n ). To complete the proof, we show Thus for any basis (vi )ni=1 of X, (Svi )ni=1 is also a basis.
 
V 1 = B 0Y , ⊂ T (B0 ). n n
2 X X
∀x ∈ X, ∃(βi )ni=1 s.t. x = βi Svi = S βi vi .
The linearity of T and (∗) imply i=1 i=1

(4) : ∀n ∈ N, Vn ⊂ T (Bn ). It follows that

For y ∈ V1 , we have y ∈ T (B1 ). Since both T and k · k are n


X n
X
continuous, the map x 7→ ky − T xk is also continuous, and ∀x ∈ X, ST x = ST S βi vi = S βi vi = x,
therefore i=1 i=1

∃x1 ∈ B1 s.t. ky − T x1 k < .
4 which implies ST = I.
By definition of Vn and (4), y − T x1 ∈ V2 ⊂ T (B2 ). Thus
 Example E.129. For the shift operators on `2 in Example
∃x2 ∈ B2 s.t. k(y − T x1 ) − T x2 k < . E.97, we have LR = I but RL 6= I,
8
Proceed inductively and we have RL(1, 0, 0, . . .) = (0, 0, 0, . . .).
n

X  Definition E.130. For vector spaces X and Y , a map
∀k = 1, 2, . . . , n, ∃xk ∈ Bk s.t. y − T xk < n+1 .

2 A ∈ L(X, Y ) is invertible if there exists B ∈ L(Y, X) such
k=1
that AB = I ∈ L(Y ) and BA = I ∈ L(X). Then B is called
Take limit of the above and we have the inverse of A.
n
X
() : y = T lim xk . Exercise E.131. Prove that the inverse of A ∈ L(X, Y ) is
n→∞
k=1 unique if A is invertible.
1
For each k, xk ∈ Bk implies kxk k < . Hence
2k Lemma E.132. For any vector spaces X and Y , if a linear
∞ ∞ map A ∈ L(X, Y ) is invertible, then A is bijective.
X X 1
kxk k < = 1.
2k Proof. A is injective because
k=1 k=1

The completeness of X and Theorem E.90 yield


Ax = Ay ⇒ A−1 Ax = A−1 Ay ⇒ x = y.

X
∃x ∈ X s.t. x = xk , kxk < 1. A is surjective because ∀y ∈ Y , A−1 y ∈ X implies y = Ax
k=1 for some x ∈ X.
Then () yields y = T x ∈ T (B0 ).
Lemma E.133. For any vectors space X and Y , if a map
Theorem E.126 (Open mapping). For Banach spaces X A ∈ L(X, Y ) is invertible, then its inverse A−1 is linear.
and Y , any surjective map T ∈ CL(X, Y ) is open.
Proof. This follows from Lemmas E.122 and E.125. Proof. For any x, y ∈ X, set z = A−1 (x + y) and we have

Example E.127. The following function f : R → R, x + y = Az ⇒ A−1 x + A−1 y = z = A−1 (x + y).



x + 1 if x ∈ (−∞, −1];
 Similarly, for any α ∈ F, set z = A−1 (αx) and we have
f (x) = 0 if x ∈ (−1, +1);
z z
= A−1 x

x − 1 if x ∈ [+1, +∞), Az = αx ⇒ A =x ⇒

α α
is surjective and continuous; but since f (−1, 1) = {0} is ⇒ αA−1 x = z = A−1 (αx).
closed, f is not open. By the open mapping theorem, if a
map between two Banach spaces is not open but surjective Lemma E.134. For finite-dimensional vector space X and
and continuous, then it must be nonlinear. Y , if a map A ∈ L(X, Y ) is bijective, then A is invertible.

107
Qinghai Zhang Numerical Analysis 2021

Proof. For the bijective map A, define a map B : Y → X, E.2.4 Series of operators
∀v ∈ X, A(Bv) = v. Theorem E.141 (Neumann series). Suppose X is a Banach
space and A ∈ CL(X) has kAk < 1. Then we have
The existence and uniqueness of Bv are guaranteed by the
surjectivity and injectivity of A. Therefore, AB = I. Fur- (NST-1) I − A is invertible in CL(X),
thermore, BA = I follows from the injectivity of A and P∞
(NST-2) (I − A)−1 = I + A + · · · + An + · · · = n=0 An ,
∀v ∈ X, A(BAv) = (AB)Av = Av. (I − A)−1 ≤ 1 .

(NST-3) 1−kAk

Finally, Lemma E.133 implies that B is a linear map. Proof. Since X is a Banach space, Corollary E.118 states
Theorem E.135. Suppose X and Y are finite-dimensional that CL(X) is also P a Banach space. By Theorem E.90,
∞ n
normed spaces. Then a map A ∈ CL(X, Y ) is invertible with the convergence of n=0 kAk implies that the sequence
A−1 ∈ CL(Y, X) if and only if A is bijective. (Sn )n∈N with
Xn
Proof. This follows from Lemmas E.132, E.133, E.134, and Sn = Ak
Corollary E.101. k=0

Example E.136. The map A : c00 → c00 given by converges to some S ∈ CL(X). It follows that
x2 x3 Pn+1
∀(xn )n∈N ∈ c00 , A(x1 , x2 , x3 , . . .) = (x1 , , , . . .) Sn A = ASn = k=1 Ak = Sn+1 − I
2 3 (
kASn − ASk ≤ kAkkSn − Sk,
is linear, bijective, and continuous (since kAxk∞ ≤ kxk∞ ). ⇒
kSn A − SAk ≤ kAkks − Sn k.
However, it is not invertible in CL(c00 ). Suppose it is and
B ∈ CL(c00 ) is the inverse of A. Then for the sequences ⇒ SA = AS = S − I
em := (0, . . . , 0, 1, 0, . . .) where all terms are 0 except that ⇒ (I − A)S = I = S(I − A)
P∞
the mth term is 1, we have ⇒ (I − A)−1 = S = n=0 An ,
kBk
1 = kem k∞ = kBAem k∞ ≤ kBkkAem k∞ = where the last step follows from Definition E.130. Finally,
.
m
(NST-3) follows from
Hence ∀m ∈ N, kBk ≥ m and this contradicts Lemma E.109. ∞ ∞ ∞

−1
X
n X n X n 1
Theorem E.137 (Banach). For Banach spaces X and Y , (I − A) =
A ≤ kA k ≤ kAk =
a map T ∈ CL(X, Y ) is invertible with T −1
∈ CL(Y, X) if
n=0

n=0 n=0
1 − kAk
and only if T is bijective.
where the first inequality follows from Theorem E.90 and
Proof. The necessity follows from Lemma E.132. For suffi- the second inequality from Lemma E.112.
ciency, the bijective map T induces a map T −1 : Y → X,
Theorem E.142. Suppose X is a Banach space. Then the
∀y = T x ∈ Y, T −1 (y) = x. exponential of A ∈ CL(X), defined as
Since the bijectiveness of T guarantees that T −1 is well de- ∞
fined, T −1 is indeed an inverse of T . By Lemma E.133, T −1
X 1 n
eA := A , (E.54)
is linear. It remains to show that T −1 is continuous. By n=0
n!
the surjectivity of T and Theorem E.126, T is open. Hence
T (U ) is open whenever U is open. Meanwhile we have converges in CL(X).
−1
T −1 (U ) = y ∈ Y : T −1 y ∈ U Proof. By Lemma E.112, we have


= {y ∈ Y : y ∈ T (U )} = T (U ). ∞
1 n X ∞
X 1 n
−1
A ≤ kAk = ekAk .
Thus T is continuous by Theorem E.67. n! n!
n=0 n=0

Definition E.138. A pair of isomorphic normed spaces are P∞ 1 n


By the comparison test, n=0 n! A converges absolutely.
Banach spaces X and Y for which there exists a bijective
The rest of the proof follows from Theorem E.90.
map T ∈ CL(X, Y ). Then T is called an isomorphism of
normed spaces and we write X ' Y .
Lemma E.143. For a Banach space X, A ∈ CL(X) satisfies
Theorem E.139 (Closed graph). For two Banach spaces
(X, k · kX ) and (Y, k · kY ), a map T ∈ L(X, Y ) is continuous d tA
e := AetA = etA A. (E.55)
if and only if its graph G(T ) := {(x, T x) : x ∈ X} is closed dt
in (X × Y, k · k∞ ) where Lemma E.144. For a Banach space X, if A, B ∈ CL(X)
∀(x, y) ∈ X × Y, k(x, y)k∞ := max(kxkX , kykY ). (E.53) commute, i.e. AB = BA, then
Exercise E.140. Prove Theorem E.139. eA+B := eA eB . (E.56)

108
Qinghai Zhang Numerical Analysis 2021

Corollary E.145. For a Banach space X and A ∈ CL(X), an open ball B(0X , δ). Consequently, x ∈ B(0X , δ) ⊂ Fn
eA is always invertible with its inverse as e−A . implies
Theorem E.146 (Existence and uniqueness of ODEs). For kxk < δ ⇒ ∀i ∈ I, kTi xk ≤ n.
a Banach space X and A ∈ CL(X), the IVP δ x
Thus for any x ∈ X, there exists y = 2 kxk such that
dx
(t) = Ax(t) (E.57) 2n
dt ∀i ∈ I, kTi yk ≤ n ⇒ kTi xk ≤ kxk,
δ
with initial condition x(0) = x0 ∈ X has a unique solution
x(t) = etA x0 for t ∈ R. and the proof is completed by Lemma E.110.
Proof. If x(t) solves (E.57), then
Example E.149. Many PDEs can be written in the form
d −tA d
(e x(t)) = e−tA (−A)x(t) + e−tA (x(t)) = 0,
dt dt T x = y,
which implies e−tA x(t) = x0 and thus x(t) = etA x0 .
where y is a known vector incorporating initial and bound-
ary conditions, x is the unknown, and T is a continuous lin-
E.2.5 Uniform boundedness ear operator. If the PDE is well-posed, we can often assume
Lemma E.147. Suppose X is a normed space and a subset that T is a bijection, hence by Theorem E.137 the inverse
of T is a bounded linear operator and we write x = T −1 y.
A ⊂ X satisfies
In numerically solving the PDE, we usually approximate y
• A is symmetric, i.e., −A = A; by a grid function yn and approximate T −1 by a discrete
• A is mid-point convex, i.e., ∀x, y ∈ A, x+y
∈ A, operator Tn−1 . The convergence usually means
2
• there exists a nonempty open set U ⊂ A. ∀y ∈ C r (Ω), lim yn = y, lim Tn−1 yn = x,
n→∞ n→∞
Then there exists δ > 0 such that B(0X , δ) ⊂ A.
i.e., supn→∞ kTn−1 yn k < ∞. Theorem E.148 then implies
Proof. For α 6= 0 and a ∈ X, the maps x 7→ x + a and supn∈N kTn−1 k < ∞, which usually implies some form of nu-
x 7→ αa are both continuous with continuous inverses. By merical stability.
Theorem E.67, U being open in X implies that its preimage
U + {−a} under x 7→ x + a is also open in X. Adopting Theorem E.150 (Banach-Steinhaus). Suppose X and Y
notations in (E.51), we find that the set are Banach spaces. If a sequence (Tn )n∈N ∈ CL(X, Y ) has
U + (−A) := ∪a∈A (U + {−a})
∀x ∈ X, lim kTn xk < ∞,
n→∞
is open since it is a union of open sets. For a ∈ U , we have
then the map x 7→ limn→∞ Tn x belongs to CL(X, Y ).
a − a U + (−A) A + (−A) A+A
0X = ∈ ⊂ = = A,
2 2 2 2 Proof. Clearly the map T (x) = limn→∞ Tn x is linear, it
where the last two equalities follows from A being symmetric remains to show that it is continuous. Because the limit
and mid-point convex, respectively. The proof is completed limn→∞ Tn x exists, we have supn∈N kTn xk < ∞ for all
by Lemma D.190 and U +(−A) being open. x ∈ X. Then Theorem E.148 implies supn∈N kTn k < ∞.
2
Hence
Theorem E.148 (Uniform boundedness principle). Sup-
pose X is a Banach space and Y is a normed linear space. ∃M ∈ R, ∀x ∈ X, ∀n ∈ N, kTn xk ≤ M kxk;
For a family of maps Ti ∈ CL(X, Y ), i ∈ I, “pointwise
boundedness” implies “uniform boundedness,” the limit of the above yields ∀x ∈ X, kT xk ≤ M kxk. The
proof is completed by Theorem E.96.
∀x ∈ X, sup kTi xk < +∞ ⇒ sup kTi k < +∞.
i∈I i∈I

Proof. For any given n ∈ N, we define E.3 Dual spaces of normed spaces
Fn := ∩i∈I {x ∈ X : kTi xk ≤ n} = {x ∈ X : sup kTi xk ≤ n}. Definition E.151. The dual space of a normed vector space
i∈I
X over a field F, written X 0 , is the normed space CL(X, F)
As intersection of closed sets, each Fn is closed. By point- equipped with the operator norm. The elements in X 0 are
wise boundedness, we have X = ∪n∈N Fn . The Baire theo- called bounded linear functionals.
rem E.124 implies that there exists some Fn that contains
a nonempty open subset. Since Fn is also symmetric and Theorem E.152 (Isomorphism of (`p )0 and `q ). (`p )0 ' `q
mid-point convex, Lemma E.147 implies that Fn contains where p and q are conjugate exponents with p 6= ∞.

109
Qinghai Zhang Numerical Analysis 2021

Proof. Consider a given T ∈ CL(`p , R). Let ek ∈ `p denote By Definition A.45, ĝ is an upper bounded of C with respect
the sequence (0, · · · , 0, 1, 0, · · · ) with the kth term being 1 to ≺. Then Zorn’s lemma (Axiom A.46) implies that Ep (f )
and all other terms being 0. Define a function φ : (`p )0 → `q , has a maximal element T .
We claim that dom(T ) = X. Suppose that this is not
φ(T ) = y := (T ek )k∈N+ , true, then we can choose y1 6= 0 in X \ dom(T ) to consider
and we need to show y ∈ `q . the subspace Y1 = span(y1 , dom(T )). Any x ∈ Y1 can be
For p = 1, we have y ∈ ` because∞ uniquely expressed as

kyk∞ = sup |T ek | ≤ kT k sup kek k1 = kT k < ∞. x = y + αy1 where y ∈ dom(T ), α ∈ R.


k∈N+ k∈N+

Pn 1
Indeed, x = y + αy1 = z + βy1 implies y − z = (β − α)y1
For p > 1, we have ( k=1 |yk |q ) q ≤ kT k for n > 1 since with y − z ∈ dom(T ), y1 6∈ dom(T ), and thus we must have
n n y − z = 0 = (β − α)y1 . We then extend T to a linear
functional g1 : Y1 → R that satisfies
X X
|yk |q = |yk |q−1 T (ek )sgn(yk )
k=1 k=1
n
! (∗) : g1 (x) = g1 (y + αy1 ) = T (y) + αc
X
q−1
=T |yk | ek sgn(yk )
where c is some real constant to be determined later. Since
k=1

n
T ≺ g1 , it suffices to prove that g1 ∈ Ep (f ) because this
would contradict the conclusion in the first paragraph that
X
q−1
≤ kT k |yk | ek sgn(yk )


k=1
T is a maximal element of Ep (f ).
p
! p1 ! p1 It remains to show that there exists some choice of c such
n n
X
p(q−1)
X
q
that ∀x ∈ dom(g1 ), g1 (x) ≤ p(x). For a, b ∈ dom(T ),
= kT k |yk | = kT k |yk | .
k=1 k=1 T (a) − T (b) ≤ T (a − b) ≤ p(a − b) ≤ p(a + y1 ) + p(−y1 − b),
Then y ∈ `q follows from taking the limit case of n → ∞.
which implies
It is straightforward to show that φ is a continuous linear
bijection. The proof is completed by Definition E.138. −T (b) − p(−y1 − b) ≤ −T (a) + p(y1 + a).

E.3.1 The Hahn-Banach theorems Since a, b ∈ dom(T ) are arbitrary, we choose any c satisfying

Definition E.153. A sublinear functional of a vector space



c ≥ supb∈dom(T ) {−T (b) − p(−y1 − b)},
X is a functional p : X → R that is postive-homogeneous and c ≤ inf a∈dom(T ) {−T (a) + p(y1 + a)}.
subadditive, i.e.,
 If α = 0 in (∗), we have x = y ∈ dom(T ) and thus
∀x ∈ X, ∀α ∈ [0, ∞), p(αx) = αp(x);
g1 (x) = T (y) ≤ p(x) holds trivially. If α < 0, set b = αy
∀x, y ∈ X, p(x + y) ≤ p(x) + p(y).
in the first clause and we have
(E.58) y  y
Example E.154. Norms and semi-norms are both sublin- −T − p −y1 − ≤c
α α
ear functions. ⇒ g (x) = T (y) + αc ≤ p(αy + y) = p(x), 1 1
Theorem E.155 (Hahn-Banach theorem for real vector
spaces). If a linear functional f : Z → R defined on a sub- where the conclusion follows from multiplying the first line
space Z of a real vector space X is bounded by a sublinear by −α and applyingy
the positive homogenuity in (E.58). If
functional p : X → R, i.e., ∀x ∈ Z, f (x) ≤ p(x), then f can α > 0, set a = α in the second clause and we have
be extended to a linear functional T : X → R such that  y   y
−T + p y1 + ≥c
∀x ∈ Z, T (x) = f (x); ∀x ∈ X, T (x) ≤ p(x). (E.59) α α
⇒ g1 (x) = T (y) + αc ≤ p(αy1 + y) = p(x).
Proof. A function g is an extension of f if dom(f ) ⊂ dom(g)
and ∀x ∈ dom(f ), g(x) = f (x); then we write f ≺ g. By Theorem E.156 (Hahn-Banach theorem for complex vec-
Definition A.37, the extension is a partial order on the set tor spaces). Let X be a complex vector space (with F = C
or R) and p : X → R a map satisfying subadditivity (as in
Ep (f ) := {g extends f : ∀x ∈ dom(g) ⊂ X, g(x) ≤ p(x)} . Definition E.153) and

For any chain C ⊂ Ep , we define ĝ : ∪g∈C dom(g) → R by ∀α ∈ F, ∀x ∈ X, p(αx) = |α|p(x).


∀g ∈ C, ∀x ∈ dom(g), ĝ(x) = g(x),
If a linear functional f : Z → F defined on a subspace Z ⊂ X
which is well defined because C being a chain implies that is bounded by p, i.e., ∀x ∈ Z, |f (x)| ≤ p(x), then f can be
dom(ĝ) = ∪g∈C dom(g) is a vector space and extended to a linear functional T : X → F such that

x ∈ dom(g1 ) ∩ dom(g2 ) ⇒ g1 (x) = g2 (x). ∀x ∈ Z, T (x) = f (x); ∀x ∈ X, |T (x)| ≤ p(x). (E.60)

110
Qinghai Zhang Numerical Analysis 2021

Proof. First, we prove the simple case of F = R, c.f. Exer- there exists some a ∈ `1 such that ϕa = Λ in Example E.158.
cise B.3. Since F = R yields f (x) ≤ |f | ≤ p(x), Theorem However, this cannot hold for the sequences
E.155 implies that we can extend f (x) to a linear functional
T : X → R with T (x) ≤ p(x). Then (E.60) follows from ek = (0, · · · , 0, 1, 0, · · · ) ∈ `∞

−T (x) = T (−x) ≤ p(−x) = | − 1|p(x) = p(x). where the kth term is 1 and all other terms being 0 because
For F = C, let u(x) be the real part of f (x). Lemma ∀n ∈ N, 0 = Λ(en ) = λ(en ) = ϕa (en ) = an
B.48 states that f (x) = u(x) − iu(ix). Since u(x) is a real
⇒ a = 0 ⇒ Λ = 0,
linear functional satisfying |u(x)| ≤ |f (x)| ≤ p(x), the first
paragraph implies that we can extend u(x) to a real linear 0
but Λ(1, 1, ·, ) = 1 shows that Λ 6= 0. To sum up, (`∞ ) 6' `1
functional Tu : X → R satisfying (E.60). Then we construct ∞ 0 1
because (` ) is “bigger” than ` .
a map Tf : X → C by Tf (x) = Tu (x) − iTu (ix). By Lemma
B.49, Tf is a complex linear functional. Furthermore, the
polar form Tf (x) = |Tf (x)|eiθ yields E.3.2 Bounded linear functionals
|Tf (x)| = Tf (x)e−iθ = Tf (xe−iθ ) = uf (xe−iθ ) Theorem E.160. For any x0 6= 0 of a normed space X,
≤ p(xe −iθ
) = |e −iθ
|p(x) = p(x). there exists a bounded linear functional T on X such that

Theorem E.157 (Hahn-Banach theorem for normed kT k = 1, T (x0 ) = kx0 k. (E.62)


spaces). A bounded linear functional f : Z → C on a sub-
space Z of a normed space X can be extended to another Proof. For the subspace Z = {αx0 : α ∈ F} of X, define a
bounded linear functional T : X → C with the same opera- linear functional f : X → F by
tor norm, i.e., kT kX = kf kZ .
∀x = αx0 , f (x) = αkx0 k
Proof. If Z = {0}, then f = 0 and the statement holds
trivially. Define a function p : X → R by and we have kf k = 1 because

p(x) := kf kZ kxk ∀x ∈ Z, |f (x)| = |f (αx0 )| = |α|kx0 k = kxk.


and we have The rest follows from Theorem E.157.

p(αx) = kf kZ kαxk = |α|p(x);
p(x + y) = kf kZ kx + yk ≤ p(x) + p(y). Corollary E.161. Every x in a normed space X has

By Theorem E.156, f can be extended to another bounded |f (x)|


kxk = sup . (E.63)
linear functional T : X → C with f ∈X 0 ;f 6=0 kf k

|T (x)| ≤ p(x) = kf kZ kxk, Therefore, x0 = 0 if for all f ∈ X 0 we have f (x0 ) = 0.


which, together with Lemma E.110, implies kT kX ≤ kf kZ . Proof. For any fixed x ∈ X,
On the other hand, T is an extension of f and kT kX can be
no less than kf kZ . Hence kT kX = kf kZ . |f (x)| |Tf (x)| kxk
sup ≥ = = kxk,
f ∈X 0 ;f 6=0 kf k kTf k 1
Example E.158. For the subspace c of `∞ consisting of
convergent sequences, define a map λ ∈ c0 = CL(c, R) by
where the first equality follows from Theorem E.160. On the
0
λ(an )n∈N := lim an . (E.61) other hand, |f (x)| ≤ kf kkxk for any f ∈ X , which implies
n→∞
|f (x)|
Then |λa| = | limn→∞ an | ≤ kak∞ and the equality holds sup ≤ kxk.
for a sequence with constant terms. Therefore kλk = 1. f ∈X ;f 6=0 kf k
0

By Theorem E.157, we can extend λ to Λ : `∞ → R with


kΛk = 1. Corollary E.162. For two distinct elements x, y of a
normed space X, there exists a linear functional f ∈ X 0
0
Corollary E.159. (`∞ ) 6' `1 . such that f (x) 6= f (y).
Proof. For any a := (an )n∈N ∈ `1 , we define a linear func- Proof. Set f (x−y) = kx−yk and apply Theorem E.160.
0
tional ϕa ∈ (`∞ ) by
∞ Corollary E.163. Let (X, k · kX ) be a normed space and
x ∈ X a fixed element. Then the map ϕx ∈ X 00 given by
X

∀b ∈ ` , ϕa (b) = an bn .
n=1
∀ψ ∈ X 0 ,
ϕx (ψ) = ψ(x) (E.64)
∞ 0 1
This map a 7→ ϕa is clearly injective. Suppose (` ) ' ` .
Then the map a 7→ ϕa must also be surjective, and hence has the operator norm kϕx k = kxkX .

111
Qinghai Zhang Numerical Analysis 2021

Proof. By (E.64), we have Definition E.169. The Riemann-Stieltjes sum of a func-


tion x : R → R with respect to w ∈ BV [a, b] over a partition
∀ψ ∈ X 0 , kϕx (ψ)k = kψ(x)k ≤ kψkkxkX Pn = {t0 = a, t1 , . . . , tn = b} is
and Lemma E.110 implies kϕx k ≤ kxkX . In fact, we have n
kϕx k = kxkX if x = 0X ; otherwise Theorem E.160 implies
X
RSn (x; w) = x(ti )[w(ti ) − w(ti−1 )]. (E.67)
i=1
∀x ∈ X \ {0}, ∃ψ ∈ CL(X, F) s.t. kψk = 1, ψ(x) = kxkX

and hence Definition E.170. A function x : R → R is Riemann-


Stieltjes integrable on [a, b] with respect to w ∈ BV [a, b] iff
kxkX = kψ(x)k = kϕx (ψ)k ≤ kϕx kkψk = kϕx k,
∃L ∈ R, s.t. ∀ > 0, ∃δ > 0 s.t.
which yields kϕx k = kxkX . ∀Pn (a, b) with h(Pn ) < δ, |RSn (x; w) − L| < . (E.68)
Theorem E.164. For a normed space X, CL(X, Y ) is a
Rb
Banach space if and only if Y is a Banach space. In this case we write L = a x(t)dw(t) and call it the
Riemann-Stieltjes integral of x with respect to w on [a, b].
Proof. The sufficiency has been stated in Lemma E.115, here
we only prove the necessity. Consider a Cauchy sequence
(yn )n∈N ⊂ Y . Define Tn ∈ CL(X, Y ) by Exercise E.171. Show that
Rb
φ(x) • for any given w ∈ BV [a, b], the map x 7→ a x(t)dw(t)
∀n ∈ N, ∀x ∈ X, Tn x := yn ,
φ(x∗ ) is a linear functional on C[a, b];

where x∗ 6= 0X is any fixed nonzero element in X and Rb


φ is the linear functional in Corollary E.162 such that • for any given x ∈ C[a, b], the map w 7→ a x(t)dw(t) is
φ(x∗ ) 6= φ(0X ) = 0. Tn is clearly linear, it is also continuous a linear functional on BV [a, b].
because R
kyn kkφk Lemma E.172.
b
x(t)dw(t) ≤ maxt∈[a,b] |x(t)|var(w).

∀x ∈ X, kTn xk ≤ kxk. a
|φ(x∗ )|
A similar computation leads to Proof. This follows from Definitions E.165 and E.170.
kyn − ym kkφk
∀n, m ∈ N, ∀x ∈ X, kTn − Tm k ≤ ,
|φ(x∗ )| Theorem E.173 (Riesz). Every bounded linear functional
f on C[a, b] is represented by a Riemann-Stieltjes integral,
and thus (Tn )n∈N is Cauchy sequence in CL(X, Y ). The
completeness of CL(X, Y ) implies limn→∞ Tn = T for some Z b
T ∈ CL(X, Y ). Thus ∀x ∈ C[a, b], f (x) = x(t)dw(t), (E.69)
a
∀x ∈ X, lim kTn x − T xk ≤ lim kTn − T kkxk = 0
n→∞ n→∞
where the map w ∈ BV [a, b] satisfies var(w) = kf k∞ .
and ∀x ∈ X, limn→∞ Tn x = T x. In particular, the sequence
(yn = Tn x∗ )n∈N converges to T x∗ ∈ Y . Proof. First, the normed space (C[a, b], k · k∞ ) is a subspace
of (BV [a, b], k · k∞ ). By Theorem E.157, f can be extended
Definition E.165. The total variation of a function to T : BV [a, b] → C such that kT k = kf k . We define
∞ ∞
w : [a, b] → R is
(
n
X 0 if t = a;
var(w) := sup |w(tk ) − w(tk−1 )|, (E.65) w(t) :=
Pn ∈P[a,b]
T (ξt ) otherwise,
k=1

where Pn = {t0 = a, t1 , . . . , tn = b} is a partition of [a, b] as where ξt ∈ BV [a, b] is a charateristic function of [a, t],
in Definition C.62 and P[a, b] is the set of all such partitions.
(
Definition E.166. A function w : [a, b] → R is said to have 1 if y ∈ [a, t];
ξt (y) :=
bounded variation on [a, b] if var(w) is finite on [a, b]. 0 otherwise.
Lemma E.167. Denote by BV [a, b] the set of all functions
of bounded variations on [a, b] with the usual pointwise op- Hence for a partition Pn = {t0 = a, t1 , . . . , tn = b} we have
erations. Then (BV [a, b], k · k) is a normed space where 
ξt = · · · = ξtj−1 = 0,
 0


∀w ∈ BV [a, b], kwk := |w(a)| + var(w). (E.66) ξtj = · · · = ξtn = 1,
(∗) : ∀t ∈ (tj−1 , tj ],
ξ t − ξtj−1 = 1,
 j

Exercise E.168. Prove Lemma E.167.

∀i 6= j, ξti − ξti−1 = 0.

112
Qinghai Zhang Numerical Analysis 2021

Second, we claim that var(w) ≤ kT k∞ . For any n ∈ N, Definition E.170, and limn→∞ T (zn ) = limn→∞ RSn (x; w):
n
X n
X |T (zn ) − RSn (x; w)|

|w(tj ) − w(tj−1 )| = |T (ξt1 )| + T (ξtj ) − T (ξtj−1 )
n
X
j=1 j=2
n
= [x(tj ) − x(tj−1 )] [w(tj ) − w(tj−1 )]
X   j=1
= T (ξt1 )u1 + uj T (ξtj ) − T (ξtj−1 )


j=2 n
X


n
 ≤h(Pn ) [w(tj ) − w(tj−1 )] .
X   j=1
= T ξt1 u1 + uj ξtj − ξtj−1 

j=2
Finally, Lemma E.172 and (E.69) yield
Xn
 
≤ kT k ξt1 u1 +
uj ξtj − ξtj−1 |f (x)| ≤ max |x(t)|var(w) = kxk∞ var(w).
t∈[a,b]

j=2

≤ kT k, Take supremum over all x ∈ C[a, b] of unit norm and we
have kf k∞ ≤ var(w). Then the proof is completed by
where in the second step uj ∈ C is given by kf k∞ ≥ var(w), the conclusion of the second paragraph.

Exercise E.174. For the functional f ∈ (C[a, b])0 given by


 
uj T (ξtj ) − T (ξtj−1 ) = T (ξtj ) − T (ξtj−1 )
f (x) = x(a), find a representation of Riemann-Stieltjes in-
and thus |uj | = 1; the last step follows from the last two tegral and verify (E.69).
clauses of (∗) and kξtj k∞ = 1 for all j = 1, . . . , n.
Corollary E.175. (C[a, b])0 ⊂ BV [a, b].
Third, we prove (E.69). For x ∈ C[a, b] and a sequence
of partitions (Pn )n∈N of [a, b], we define (zn )∞
n=2 by Proof. This follows immediately from Theorem E.173.
n
X   Definition E.176. Let X and Y be normed spaces. The
zn := x(t0 )ξt1 + x(tj−1 ) ξtj − ξtj−1 . dual operator of T ∈ CL(X, Y ) is the bounded linear map
j=2 T 0 ∈ CL(Y 0 , X 0 ) given by

By definition of ξt and (∗), we have zn (a) = x(a) and ∀x ∈ X, ∀ψ ∈ Y 0 , (T 0 ψ)(x) = ψ(T x). (E.70)

∀t ∈ (tj−1 , tj ], zn (t) = x(tj−1 ). Exercise E.177. Check that, in Definition E.176,

• ∀ψ ∈ Y 0 , T 0 ψ ∈ X 0 ;
Hence zn is bounded. Furthermore,
• T 0 ∈ CL(Y 0 , X 0 ).
∀ > 0, ∃δ > 0 s.t. max |tj − tj−1 | < δ ⇒ kzn − xk∞ < .
j Example E.178. A dual operator D0 of the differentiation

Therefore, limn→∞ zn = x. By definition of w, we have D : C 1 [0, 1] → C[0, 1], D(x) = x0 ,


n
X is given by Definition E.176 as
 
T (zn ) = x(t0 )T (ξt1 ) + x(tj−1 ) T (ξtj ) − T (ξtj−1 ) .
j=2 ∀ψ ∈ (C[0, 1])0 , ∀x ∈ C 1 [0, 1],
R1
n
X (D0 ψ)(x) = ψ(Dx) = 0 x0 (t)dwψ ,
= x(tj−1 ) [w(tj ) − w(tj−1 )] .
j=1 where, by Theorem E.173, the map wψ ∈ BV [0, 1] satisfies

Then (E.69) follows from taking the limit n → ∞ of the


Z 1

above equation and applying Lemma E.59, w ∈ BV [a, b], ∀y ∈ C[0, 1], ψ(y) = y(t)dwψ .
0

113
Bibliography

M. Eigen. Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften, 58:465–523,
1971.
W. Fontana and P. Schuster. Continuity in evolution: on the nature of transitions. Science, 280:1451–5, 1998a.
W. Fontana and P. Schuster. Shaping space: the possible and the attainable in RNA genotype-phenotype mapping. J.
Theor. Biol., 194:491–515, 1998b.

B. M. R. Stadler, G. Wagner, and W. Fontana. The topology of the possible: Formal spaces underlying patterns of
evolutionary change. J. Theor. Biol., 213:241–274, 2001.

114

You might also like