0% found this document useful (0 votes)
211 views

Error Analysis of Direct Methods of Matrix Inversion

This document analyzes errors in direct methods of matrix inversion. It derives an upper bound for R, the ratio of maximum elements in successive reduced matrices, for Gaussian elimination of a general matrix. Specifically: - It shows that for Gaussian elimination with complete pivoting for size, R is bounded above by g(n), where g(n) is a function that grows slowly with n. - This provides a bound on the growth of elements during the Gaussian elimination process, ensuring numerical stability. - The analysis takes a priori bounds for errors in the computed inverses and expresses them in terms of matrix norms, providing useful indicators of the applicability of different methods.

Uploaded by

agbas20026896
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
211 views

Error Analysis of Direct Methods of Matrix Inversion

This document analyzes errors in direct methods of matrix inversion. It derives an upper bound for R, the ratio of maximum elements in successive reduced matrices, for Gaussian elimination of a general matrix. Specifically: - It shows that for Gaussian elimination with complete pivoting for size, R is bounded above by g(n), where g(n) is a function that grows slowly with n. - This provides a bound on the growth of elements during the Gaussian elimination process, ensuring numerical stability. - The analysis takes a priori bounds for errors in the computed inverses and expresses them in terms of matrix norms, providing useful indicators of the applicability of different methods.

Uploaded by

agbas20026896
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Error Analysis of Direct M e t h o d s of Matrix Inversion*

J. H. WILKINSON
National Physical Laboratory, Teddington, England

Introduction
1. I n order to assess the relative effectiveness of methods of inverting a matrix
it is useful to have a priori bounds for the errors in the computed inverses. I n
this paper we determine such error bounds for a number of the most effective
direct methods. To illustrate fully the techniques we have used, some of the
analysis has been done for floating-point computation and some for fixed-point.
In all cases it has been assumed t h a t the computation has been performed using
a precision of t binary places, though it should be appreciated t h a t on a com-
puter which has both fixed and floating-point facilities the number of permissible
digits in a fixed-point number is greater than the number of digits in the mantissa
of a floating-point number. The techniques used for analyzing floating-point
computation are essentially those of [8], and a familiarity with t h a t paper is
assumed.
2. The error bounds are most conveniently expressed in terms of vector and
matrix norms, and throughout we have used the Euclidean vector n o r m and the
spectral matrix norm except when explicit reference is made to the contrary.
For convenience the main properties of these norms are given in Section 9.
I n a recent paper [7] we analyzed the effect of the rounding errors made in
the solution of the equations

Ax = b (2.1)

b y Gaussian elimination with interchanges "pivoting for size". We showed t h a t


the computed solution was the exact solution of

( A + E ) x =- b + ~b (2.2)

and t h a t we could obtain bounds for E and 8b provided assumptions were made
about the ratio of the m a x i m u m element of the successive reduced matrices to the
m a x i m u m element of A. All the methods we discuss in this paper depend on the
(1) (2) (~)
successive transformation of the original matrix A into matrices A , A , • • • ,
(k) (8) (1) (k)
A such t h a t each A is equivalent to A and the final A is triangular. An
essential requirement in an a priom analysis of any of the methods, is a bound
for the quantity r, defined b y
(s)
R = maxla~3 I (s = 1, . . . , k ) (2.3)
max [ a~) I
* Received March, 1961.
281
282 J. H. WILKINSON

For methods based upon Gaussian elimination we have been able to obtain a
reasonable upper bound for general unsymmetric matrices only when pivoting
for size is used at each stage. This bound is far from sharp, but does at least
lead to a useable result, and further, the proof shows the factors which limit
the size of R. For a number of classes of matrices which are important in practice,
a much smaller bound is given for R, and this bound holds even when a limited
form of pivoting for size is employed. Finally, we show that for positive definite
matrices R ~ 1, even when pivoting for size is not used.
For comparison with methods of this Gaussian type, we give an analysis of
methods which are based on the reduction of A to triangular form by equivalent
transformations with orthogonal matrices. For these it is easy to show t h a t for
any matrix R < x / n , so t h a t control of size is assured. Because of this limitation
on the size of R, the error bounds obtained for orthogonal reductions of a general
matrix are smaller than those that have been obtained for elimination methods.
However, orthogonal reduction requires considerably more computation than
does elimination, and since in practice the value of R falls so far short of the
bound we obtained it is our opinion that the use of the orthogonal transforma-
tions is seldom, if ever, justified.
3. We do not claim that any of the a priori bounds we have given are in any
way "best possible", and in a number of cases we know that they could have
been reduced somewhat b y a more elaborate argument. We feel, however, that
there is a danger that the essential simplicity of the error analysis m a y be ob-
scured by an excess of detail. In any case the bounds take no account of the
statistical effect and, in practice, this invariably reduces the true error b y a far
more substantial factor than can be achieved by an elaboration of the arguments.
We also adduce arguments which partly explain why it is so often true in practice
t h a t computed inverses are far more accurate than might have been expected
even making allowances for the statistical effect.
All the a priori bounds for the error contain explicitly the factor [[ A -1 ]l • This
means that we will hardly ever be in a position to make use of the results before
obtaining the computed inverse. The nature of the analysis makes it obvious
how to use the computed inverse to obtain an a posteriori upper bound for its
error but the main value of the results is to indicate the range of applicability
of a given method for a given precision in the computation, and they should be
judged in this light.

Upper Bound for R in Gaussian Elimination


4. We derive first an upper bound for R when a general matrix is reduced to
triangular form b y Gaussian elimination, selecting as pivotal element at each
stage the element of maximum modulus in the whole of the remaining square
matrix. We refer to this as "complete" pivoting for size, in contrast to the selec-
tion of the maximum element in the leading column at each stage, which we call
"partial" pivoting for size.
We denote the original matrix by A (~) and the reduced matrices by A (n-l),
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 283

A (n-2), . . . , A (1) ( s o that A (r) is a matrix of order r) and the modulus of the
pivotal element of A (r)by p~ so that
I a~;) I =< p , . (4.1)
Then we have
I d e t A (r) l = P ~ P ~ - ~ " ' P l (r = n , n - 1,...,2). (4.2)
On the other hand, Hadamard's Theorem gives
I d e t A (~)] =< [rp~]~ ( r = n, . . . , 2 ) (4.3)
since the length of every column of A (r) is bounded by r½p,. We write
p~p,-l'"p~_-< [rips] ~ (r = 2, 3 , . . . , n - - 1) (4.4)
but retain
p~p~-~ "'" pl = ]det A(~) ] . (4.5)
Taking logarithms of relations (4.4) and (4.5) and writing
log p~ = q, (4.6)
we have
r--1
~q<r
1 = ~ log r + (r -- 1)qr (r = 2, 3, . . . , n -- 1) (4.7)

q, = log [ det A (n) ]. (4.8)


1

Dividing (4.7) by r(r -- 1) for r = 2, 3, • • • , n -- 1 and (4.8) by (n - 1) and


adding we have, on observing that
1 1 1 1 1
+ + . . . g-- + ---- (4.9)
r ( r - - 1) (r + 1)r ( n - - 1 ) ( n - - 2) n-- 1 r-- 1'
ql + q2 q,,-2 + qn-1 qn
i- 2+ "'" + n - - 2 ~--l+n-- 1

< .1 log
. 213½4
. .t . (n 1
l)n--2 + 1
log [ d e t A (~) [ (4.10)
--2 n--1
q2 q3 q~-i
+2+3+'"+n-1'
giving

q l + n - - q ~ 1 =< l o g f ( n - - 1) + n _ l ll°g[detA (~)1, (4.11)

where
1
f(s) = [213½ . . . s~-~]½. (4.12)
Substituting for I det A (n) I from (4.3) (with r = n) we have
284 $. It. W I L K I N S O N

n nq~
q~..Fn q__21 =< l o g f ( n -- 1) + 2(n -- 1) log n + n - - ' - -1 (4.13)

ql -- q,, =< l o g f ( n ) -{- ½ log n. (4.14)


Hence
pl/p~ ~ ~¢/nf(n) = g(n) (say). (4.15)
This gives an upper bound for the ratio of the last pivot to the first. In particular,
if p~ ~ 1 then pl _-< g(n).
On the other hand if we write

l = max (a~))2
3 9=1

so that all columns of A (~) have lengths which are bounded by l, then

I d e t A (~)t =< l~ and l


p~ => x/--~" (4.16)

Hence from (4.11)

ql ~ l o g f ( n - 1) W Iogn Wlogl
2 (n - 1) (4.17)
= log f(n) + log l.
In particular if l =< 1 we have
pl <=f(n). (4.18)
The functions f(n) and g(n) increase comparatively slowly and it is easy to
show t h a t
f(n) N Cn il°g'~. (4.19)
In Table 1 we show the values of f(n) and g(n) for representative values of n.
The two forms of normalization
(I) ½ -_< max l a , ] _-< 1 (4.20)
9,$

(II) !2 =
< --,=la" -<_ 1 (j = 1,.--,n) (4.21)

will be used throughout this paper and will be called normalization (I) and (II)
respectively. The results (4.15) and (4.18) hold for any matrix normalized in
the appropriate way, and indeed their truth is in no way dependent on the left-
hand inequalities in (4.20) and (4.21). However, in our opinion, pivoting for
size is a reasonable strategy only if all rows and columns of the original matrix
have comparable norms.
For normalization (I) this m a y be achieved by scaling all rows and columns
so that they contain at least one element satisfying ½ ~ ] a , I < 1. For normali-
zation (II) this should be followed by an additional scaling of the columns so
that (4.21) is satisfied. F. L. Bauer has suggested that such matrices should
be termed "equilibrated" matrices. In practice it is unnecessary to ensure exact
E R R O R A N A L Y S I S OF D I R E C T M E T H O D S OF M A T R I X I N V E R S I O N 285

TABLE 1
n lO 20 50 100 200 1,000

f(n) 6.1 15 75 330 1,800 250,000


g(~) [ lo 67 530 3300 26,000 7,900,000

equilibration, but merely t h a t there are no rows or columns consisting entirely


of "small" elements. Strictly speaking, equilibration should be performed at
each stage of the reduction before selecting the pivot, b u t these later equilibra-
tions are of m u c h less importance.
5. T h e inequality (4.15) is certainly not sharp. For the equality could hold
only if the equality held in (4.4) for all r. If this were true we could show, exactly
as we have for r = n, t h a t
Pl/P, = g(r) (r = 2, . . . , n ) (5.1)
and hence
p~_l/pr = g ( r ) / g ( r _ l ) = [ r ~ r 1 r 1/~-1]½ . (5.2)

Except for small r, this ratio is only slightly greater than unity. This means
t h a t the pivotal sequence pn, pn-1, " " increases quite slowly at first. Now
equality in (4.4) can hold only if all elements of A (~) are equal to ~ p , and all
columns are orthogonal. This is clearly not possible for some values of r, for
example, r = 3. Further if a,~.(r) = ::t:pr then all a (r-i)
, are ± 2 p r or zero. Hence
if at any stage the equality sign holds in (4.4) then either p,-z/pr = 2 or A (r-l) is
null, and in either case (5.2) cannot be satisfied.
These considerations suggest t h a t the least upper bound for the pivots m a y
be m u c h smaller than the one we have given. In our experience the last pivot
has fallen far below the given bound, and no matrix has been encountered in
practice for which pl/pn was as large as 8. Note t h a t we m u s t certainly have
p, _-< png(n + 1 - r) (5.3)
since p, is the final pivot in the reduction of a matrix of order (n + 1 -- r),
though if the m a x i m u m is to be approached b y pl then the initial rate of growth
m u s t be much slower than t h a t corresponding to (5.3).

Positive Definite Matrices


6. There are a number of special types of matrices for which a better bound
can be given for the largest pivot and, surprisingly, none of these require com-
plete pivoting for size.
The most i m p o r t a n t class is t h a t of positive definite matrices. We shall show
t h a t when a positive definite matrix is reduced b y Gaussian elimination without
any pivoting for size then no element in any reduced matrix exceeds the m a x i m u m
element in the original matrix. I t is convenient now to call the original matrix
A (1) and the reduced matrices A (~), A °), - . - , A (2°) so t h a t A (r) is of order
(n--r+ 1).
286 J . H . WILKINSON

We first show t h a t A (2) is positive definite.


T h e multipliers m,1 used in the first reduction are defined b y
a(1)
m,1 --
,1
_(1) ,
(6.1)
t~ll

and ~n"(~)is non-zero, because A (1) is positive definite. T h e elements of A (2) are
given b y
_(~) (~)
a~3( 2~) ¢b~ s -- m,lalj

a (~) a (~)
a(1)
_~3 _ 13
_(2)
- - ¢b3~
(6.2)
a(1)
11

from the s y m m e t r y of A (1). N o w we h a v e

~.(1) ~(1) ~1
~S X~ X s - - t,bll X l "Jr ~ X, k ~$ (1) / X~ X/
~1 ~1 ¢~2 t ~ l l • =2 j=~ all /
(6.3)

If A (2) is not positive definite there is a non-null set of x2, x3, • • • , x~ such t h a t
a ~ ) x , x , =< O. If we write

Xl = ~
•. +,1 ~
(x)

X~ , (6.4)
~2 t~ll

then with this x l , x2, . - - , x,, we h a v e

~ ( ' ).~
a~j ~ x s =< 0, (6.5)
t = l 351

which is impossible since A (~) is positive definite. W e h a v e f u r t h e r


a(1) n(1)
_(2) = 6(1) ~1 ~ 1 < _(1)
. . . . -- .(1) = ~,,, (6.6)

and a ~ ) is positive because A (2) is positive definite. T h e diagonal t e r m s of A (2)


are therefore not greater t h a n those of A (~) and, since the m a x i m u m element of a
positive definite m a t r i x lies on the diagonal
m a x I a~) I ~ m a x [ a ~ ) I. (6.7)
I n the same w a y it follows t h a t A (3) - - - A (") are positive definite. I n p a r t i c u l a r
(1)
if I a,~ I < 1 (all i, 2), then this is true of all -(~)
T h e multipliers m a y be large, as for e x a m p l e for the m a t r i x
[.000,002 .001,300]
A = , (6.8)
L.001,300 .900,000.J
so G a u s s i a n elimination m u s t either be p e r f o r m e d in floating-point arithmetic,
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 287

or otherwise it requires the introduction of scale-factors in a manner similar to


that which is normally employed in the back-substitution.

Triangular Decomposition
7. For the symmetrical Cholesky decomposition of a positive definite sym-
metric matrix we may prove an even more satisfactory result. We show that
if A is positive definite and lanai ~ 1, then there exists a lower triangular
matrix L such that
LL r = A, (7.1)
[ 1 . ] _~ 1. (7.2)

The proof is by induction. Assume it is true for matrices of orders up to (n - 1).


If A is a positive definite matrix of order n, we may partition it in the form

A = [ "°
-~, - - 1 (7.3)
La ,aooJ
where B is a positive definite matrix of order (n -- 1) and a is a vector of order
(n - 1). By hypothesis there is an L satisfying
LL r = B, ] l , [ =< 1. (7.4)
If we choose 1 so that
L l = a, (7.5)
which can clearly be done since L is triangular and non-singular, then

I L 0~-I [0L-~ii1 I = [B_ a ]


',lnnJ }22 [aTl; = =A, (7.6)
where
lrl + (l~n) 2 = a n n . (7.7)
Taking determinants of both sides
det L det L r l ~,n = (det L) 22
ln~ = det A. (7.8)
Now det A is positive since A is positive-definite; hence l~,, is positive and l==
is real and m a y be taken to be positive. If the components of l are denoted by
lnl, ln2, . . - , 1. . . . 1, then from equation (7.7)

l~2, = ann =< 1 (7.9)

and every l=, satisfies [ l~, I ~ 1. Equation (7.6) and (7.7) now give the appro-
priate decomposition of A and all elements of the triangular matrix are between
± 1 . A consequence of this is that an L L r decomposition m a y be performed in
fixed-point arithmetic. Incidentally we have proved that the sum of the squares
of the elements in any row of L is less than unity, and this result is used later.
288 J.H. WILKINSON

Other Matrices for which R < g(n)


8. Matrices of Hessenberg form are of particular interest in connection with
the eigenvalue problem. We consider now the reduction of an upper Hessenberg
matrix with normalization I, using "partial" pivoting for size. At each stage
there are only two elements in the leading column of the reduced matrix and
one of these is in a row which has not yet been modified. Since even partial
pivoting for size ensures that the multipliers are between i l it is easy to see
that the maximum element in the rth reduced matrix lies between i r , so t h a t
the maximum pivotal value is n and this can be achieved only b y the last pivot.
Further if we call the final triangular matrix U, and M is the matrix by which A
has been multiplied to give U, so that
MA = U, (8.1)
it is evident that M is lower triangular and all its elements are less than or equal
to unity in modulus. If we write, in the usual way,
(8.2)
A = LU
then L = M -~ and we have shown that all elements of L -~ lie between ± 1.
A special type of Hessenberg matrix is a triple-diagonal form. B y much the
same argument it is evident that the maximum elements of U now lie between
± 2 and again all elements of L -1 lie betwen ± 1 . Further ~ : I u,j I ~ 3.
Finally we consider matrices with dominating diagonal elements, such that
J a~i) { > ~{all){ ( k = 1, 2, . . . , n). (8.3)

If we perform elimination without pivoting for size, then we have typically, in


the first stage
~., {m,1 [ = ~ a:~ ) {/{~11
.(I){ < 1. (8.4)
I~2 $~2

Hence
4,> ~ [I _41> + Ira,, I1 ~4,
Zla,, l< a,X

_-<
(1)
l a,, +Lal,41) l ~ l m , l L (8.5)

(1)
< ~21 -4" + lal)) i -- {a,, I-

Further
I _42)
~,, i>_ {a.(1) [-Im~,ila.(i) L
=> ~[a,,(1) I -- (1 -- ~}m,l})[ "41'~I,
{
(8.6)
(1) 41) (2)
E R R O R A N A L Y S I S OF D I R E C T M E T H O D S OF M A T R I X I N V E R S I O N 289

The diagonal terms therefore dominate in exactly the same way in all A (~) and,
from (8.5), the sum of the moduli of the elements of any column of A (~) de-
creases as r increases. Hence

max l a~) I <= max I ~-(~)


. I=< max ~ l a .~) I

(8.7)
(i)
=< max < 2 max [ a,,-(1)I.
l a,~ I =

Hence the ratio, R, is certainly less than 2. B y similar arguments we see t h a t if


we write the reduction in the form
MA = U or LU = A, (8.8)
then all elements of M (or L -1) are less than unity.
Even when the diagonal elements are not strictly dominant, behavior of this
last type is very common. For example, the matrices of the form C~ where

C4= a b (a> b > 0) (8.9)


b a
b b
retain this character during Gaussian elimination and the diagonals become
progressively more dominant even when (8.3) is not satisfied at any stage.
M a n y matrices which are encountered in practice do not belong strictly to
any of the special classes we have mentioned, but nevertheless resemble them in
character and it is our experience that when "complete" pivoting for size is
used, the pivotal sequence, even if not monotonic decreasing, usually has a
general downward trend. For ill-conditioned matrices this is often very marked.
This is not surprising since, as is shown by (4.11), the last pivot p~ satisfies

P_~< f(n _ l ) . [ ' det A (~) 'llln-1 /


P~ = P~ p~ (8.10)

and I det A (~) [ will usually be much smaller than p ~ if AC~) is ill-conditioned.
The behavior of the special matrices may give the impression that "partial"
pivoting for size is as effective as "complete" pivoting. The following example
shows that it may not be adequate. Consider matrices Bn of the type:

[+i 0 0 0 +'11
B:L;11 +1_1
+

--
+1

+I
+1+1
--1
0

+10
+I
0 -
.1/
+lJ
If we use partial pivoting, then the pivot is always in the top left-hand corner
and the sequence is + 1 , + 1 , - . . , + 1 , ( - - 2 ) ~-1. If complete pivoting is used
290 J . H . WILKINSON

the sequence is + 1 , - 2 , -t-2, + 2 , . . . , -t-2. W e show later t h a t we can m o d i f y


B . in such a w a y as to give a v e r y well-conditioned m a t r i x for which partial pivot-
ing for size gives a completely inaccurate inverse. A l t h o u g h complete p i v o t i n g
is not always the best strategy, we have not been able to construct an e x a m p l e
for which it is a v e r y b a d s t r a t e g y and, in a n y case, the analysis of Section 16
gives an u p p e r b o u n d for the error.
T h e example also illustrates the i m p o r t a n c e of equilibration. If the last column
of B~ is multiplied b y 2 -~ then partial p i v o t i n g gives the same pivotal selection
as complete, with disastrous results. (See Section 29.)

Some Fundamental Properties of the Spectral Norm


9. A n u m b e r of results will be required r e p e a t e d l y a n d it is convenient to
assemble t h e m here. W e give first the properties of the spectral n o r m which we
h a v e used [9]. W e assume t h a t A is real t h r o u g h o u t .
(A) Definition: II A H -- max [X(AA T)IL
(B) l l A i l -- max l i A x ] [ / l l x i l , l l x l l ~ 0 , a n d h e n c e i l A x i l -~ ] l A i l ] l x J l .
(c) II A II satisfies the usual relations for matrix norms.
(i) IIkA II ~- I k IlIA II
(ii) II A + B II =< II A II + II B Ii, a.d hence
(iii) I1 A - n II - 111A II - II B III
(iv) II A B II --< II A II II B II.
(D) If A(')is any principal submatrix of A, then II A(')][ =< II A I1 (from (B)).
(E) II abr II = II a II IIb II (from (B)).
(F) If A is symmetric, then II A II = max I X,(A)I •
(G) If A is positive definite, l[ A ][ -- max M(A) = M ,
II A-111 ~ max I/M(A) = 1/X,~ .
(H) II A II = I1 A T It •

(i) IIAATII=IIA115 .
(J) n max,,, l a,, 1 > II A II --> max,,, l a,, I.
(K) IIAII=<IIAII~ = [~ a,~]~,
2 For max X(AA r) ~ trace(AA T) = II A lie2.

(L) HA II -~ n-½HA HE. For maxX(AAr) >= ntrace(AAT) = 1-iiAiiE ~.


n

(M) HA N --< III A III where I A I has elements I a,~ I • This follows from (B).
(N) If lA I =< B, then l[ A 1[ _<- Ill A Ill --< II B II.
(O) HI A Ill =< nt II A II • For NAII => n-½ II A lie and Ill A Itl --< III A II[E = II i liE.
T h e following results are f u n d a m e n t a l in the error analysis and will be used
repeatedly.
(P) II A II ~ max I M(A)[ (from (B)).
(Q) If ][ AII < 1, then I + A zs non-singular.
F o r X,(I + A ) = 1 + X,(A), i X , ( / + A) I ~ 1 -- I X~(A) I (9.1)
=> 1 - - l i A i t > 0 .
H e n c e det ( I + A ) = l I X , ( I + A) ~ 0. (9.2)
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 291

(R) If [] A [] < 1, then


l[ A [I
(i) ]I(/-b A ) -~ [l ~ 1 --k - -
1 -- ]] A I I 1 - I I A I]"
E] A]]
(ii) [](I-bA) - i - I l l = < 1 _ ]]A ]["

f[ A If
(iii) H(I-bA) -*]] => 1
1 - II A I1"
These results follow from the identity
I = ( I --k A ) - I ( I --t- A ) , (9.3)
the existence of ( I W A) -1 following from (Q). Writing ( I -{- A) -1 = R,
I = R+RA (9.4)
1 =[[II[ a ]]R[[ -- lIRA II a [[R]] -- llRII IIAll (9.5)
which gives (i) since (1 - [[ A [[) is positive. (9.4) then gives
[[I- R]l = ]IRAi[ ~ IIRI] [[d[I (9.6)
which gives (ii) ; from (ii) we deduce

[[ I II -- II R II < li A ]] (9.7)
= 1 -II All

which gives (iii).

(S) If A is non-singular and II A-1E II < 1, then (A + E) is non-singular and we have


(1) ][(A + Z ) - ' - - A -1]] ~ ][A-'l[ HFN = k]IA-l[] (say),
x - I] F I]
and hence
(ii) (1-- k)][A -1][ ~ [I(A TTE) -III =< (1 + k)IIA -~li
where F -- A-~E, so that ]l F It < 1.
The proof is as follows. We m a y write

( A -t- E ) = A ( I + A - 1 E ) = A ( I -t- F). (9.8)


Hence
(A + E) -1 = ( I + F)-IA -1, (9.9)

the existence of ( I + F) -1 following from (Q). This gives


(A + E) -1 -- A - ' = [(I + F) -1 -- I] A -~ (9.10)

ll(d + E ) - ' -- A -1 [[ < [[ A-~ l/ II F [[ (9.11)


= ( 1 - - llFil) "
This gives (i), and (ii) is an immediate consequence.
In practical applications, E will be a matrix of rounding errors and the result
292 $. H. WILKINSON

will be of value only if (A + E) -1 can be regarded as a reasonable approxima-


tion to A -1. Often we will not know anything special about the product A-1E and
will have to malorize F, using the inequality

[] F ]l ~ [I A-1 ]] [[ E I]. (9.12)


The necessity for doing this may well result in our estimate for
[[ (A + E) -1 - A-1I[
being far above its true value• When the substitution (9.12) is made, (i) becomes

II A - 1 N II E [] (9.13)
II(A + E) -~ - A -~ II/li A - 1 II --< 1 - - II A -~ I[ II E ]l"

We shall refer to the expression on the left as the relative error in ( A + E ) -1.
Note that the bound we have obtained for this relative error contains II A-~ ]]
itself as a factor, while the bound for the absolute error contains ]1 A -~ l]2.
Whether this represents the true state of affairs depends critically upon whether
(9.12) is a reasonably sharp inequality.

The example in Table 2 illustrates the last point. We give there an ill-condi-
tioned matrix A, and show the effect on the inverse of two different perturbations
E~ and E2, in A. These perturbations are of the same order of magnitude, but E~
results in a perturbation in the inverse which is of order II A-i ]lel[ El [[ and the
second, in a perturbation of order ]] A -~ [[ [[ E2 ]l. Perturbations E will behave in
the same way as E2 whenever E can be expressed in the form
E = AG or E = GA (9.14)
where G is of the same order of magnitude as E. We shall see later t h a t such
perturbations are encountered in the error analysis. In applications of (S) we
will normally restrict ourselves ab initio b y some such condition as
II A-I }l 11E II < 0.1
since we will not otherwise be able to prove that the computed inverse is a reason-
able approximation to the true inverse.

TABLE 2
+. 45000 00000-
A = [[1 O00OOOO000
[_ .7777777800 •70000
; A -I = 109
F
L.+. 38888 88900 --• 50000 00OOO

1842 10526 +.2368 421151


(A+EI) -1- A -1 = 108[-
[_+.2046 78363 --.2631 57895__]

(A + E~)-1 -- A-1 = 10-1[ "-k'350000000 --. 45000 000001


E~ = lO-l°[A]
L--. 38888 88900 +. 5ooo0 oooooA
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX: INVERSION 293

With this condition, (i) and (ii) become


II (A + E ) -~ -- A -1 II/il A-~ II ~ ~°11A-~E [I (9.15)
[1 -- ½o1] A-aS II]ll A-~ II -<- [[ (A + E) -~ II -<- [1 + ½° N A-aS Illll A-t il. (9.16)
(T) The final result will be used in some form in nearly every analysis. If A is non-singular
and
AY- I= P+ Q+R
ll P [l _-< a l] Y l] , ]I Q I] 6 b lI A-' l] , ][Rl]_-<c f
(9.17)
where a, b and c are scalars, then if a II a - , ]I < 1
(i) JJ A - , JJ[1 - b JJ A-~ JJ - ~]/[1 + a JI A - ' JJ]
=< II Y II 6 I] a-~ II[I + b II A - ' II + ~]/[1 - a II a-~ ][],
and if the right-hand side is denoted by k II a-~ II
(ii) II AY - III =< (ak + b)ll A-~ II 4- c,
(iii) [] Y - A -~][/][A -~[] ~ ]lA-'ll(ak+ b) + c .
These follow by writing (9.17) in the form

Y - A -~ = A-i[P + Q + R] (9.18)
and taking norms of both sides. Provided all A -~ II and (bll A -~ II ÷ e) are appre-
ciably less than unity, (i) and (iii) show that Y is a good approximation to A -1.
In applications P is always present in (9.17~ but Q or R or both may be absent.
Again we stress that the result will be pessimistic if I] A-~ I] ]] P I] is a poor
approximation to [1 A-1p 1] and this will happen if P may be expressed in the
form A S where S is of the same order of magnitude us P (a normalized A is,
of course, implicit in this remark).

Solution of Triangular Sets of Equations


10. All the methods of inversion which we analyze require the solution of sets
of equations with a triangular matrix of coefficients. The solution of these sets
in fixed-point arithmetic will usually demand some form of scaling, so we con-
sider floating-point solution first. Take an upper triangular set of equations
Ux = b; (10.1)
typically, xr is calculated from the computed xr+l, ".. , x~ using the relation

xr ~ fl ( - u~'~+l x~+~ - ur'~+2xr+~u~. . . . . ur,~xn + b~) (10.2)

in the notation of [8]. From the result given there for an inner product, we have
ur,x,(1 ± e2-t) + Ur,,+lX,+l(1 ± O(n ÷ 1 -- r)2 -t)
0
÷ u~,~+2x~+2(1 ± 0(n ÷ 1 -- r)2 -t) ÷ . . .
÷ u . . . . ~x~_~(1 ± 04.2 -t) ÷ ur,~x,~(1 ± 03.2 -t) (10.3)
--= b ~ ( l ± 02-t)
where each 0 denotes a value, ordinarily different in each factor, satisfying
lel=<l.
294 J. H. WILKINSON

Hence the computed solution satisfies


Ux + SUx = b + Sb (10.4)
where, typically for n = 5,

1b, I]
b~
I ~bl _-<2-' b3 ~ 2-'1 b l (10.5)
b4
0

]un[ 31u151]
51u12 [ 51u131 41u141
1~21 41u231 41u~41 31u~l|
[ 6U I < 2 - ' I l ussl 31u~41 31u~51| = 2 - i V (say), (10.6)
l u4,1 21 u451|
L ] us~[/
we m a y write this in the form
I Ux -- b l < 2-'[[ b [ + Vl x l]. (10.7)
For the solution of the n sets of equations
UX = B 10.8)
we have, for the computed X
[ U X - - B] < 2-'[[BI + V[X]] (lO.9)
and hence

II u x - B [1 ~ 11[U X - B ill ~ 2-'11[ B II + 2-~11 V I1 Ill X ill


(10.10)
___<2-'nail B II + 2-'211 v II II x 11.
If we merely assume that U has normalization (I) then we can obtain an upper
hound for V by replacing all u . by unity. We have then
IlVll ~ l l V l l , < [~3=+ ( n - 1)4 5 + "'"
(10.11)
+2(n+ 1) 2 + ( n + 2 ) 2 ] ~
where we have replaced all multiples in the rth column of V by (n + 3 -- r).
2
We deduce that II V ]]B ~'~ n and in fact
vq~
[[ Vii < 0 . 4 n 2 (n => 10). (10.12)
(Here and elsewhere the result holding when n < 10 is not catastrophically
worse than that for n ~ 10 but we are not much concerned with low order
matrices.) Hence
I] U X - - B 11 <= 2-'[n*ll B [1 + 0.4 nSZSlli Ill. (10.13)
E R R O R A N A L Y S I S OF D I R E C T M E T H O D S OF M A T R I X I N V E R S I O N 295

When B = I we find that the computed X is exactly triangular and no term


corresponding to I] B I] is now present in (10.13). This is because the rth column
of the inverse is the solution of U~ = e, and the error corresponding to ~b in
(10.5) is therefore the null vector. We do not require the whole of the majorizing
matrix, V, except for the solution corresponding to e, but we shall not a t t e m p t
to take this into account. We have, therefore, for the computed solution of
U X = I,
II U X -- I I] ~ (0.4)2-'n5/2]1 X II (10.14)
and from (T) of Section 9,
]] U-1 [I/(1 "4- 0.4"2-~nS/211 U -1 I1)
(10.15)
{I x II =< II u -~ {I/(1 - o.4.2-'n~'~ll u - ' II)
II x - u -1 II -<- 0.4"2-'n5/~II U-' IIV(1 - 0.4"2-'n5/211 U -~ I[). (10.16)
We can guarantee that X is a good inverse only if (0.4)2-~nSnIl U -1 l[ is appre-
ciably less than unity.
If on the other hand we know (as we often shall) that
EU~j 2 ~< 1 ( a l l j) or ~u[j <
~_ l(alli),
3

then we have

II v II ~ I1 y II~ (10.17)
<= [ 1 + 3 ~ + £ + . . . + (n+l)~] ~ -<_0.6n 3/~ (n_>_ 10).
The result for the inverse is
II X - V -~ [1 ~ ( 0 . 6 ) 2 - " : l l V - ' IIV(1 - (0.6)2-"~ll U -~ II). (10.18)
I1. The corresponding results for fixed-point arithmetic are rather less defi-
nite because they depend on the details of the arithmetic facilities and on whether
we have a priori information on the size of the solution. For the sake of definite-
ness we assume that scalar products can be accumulated exactly in a double-
precision accumulator and that division into this double accumulator is possible.
These facilities are provided on all desk machines and m a n y high speed compu-
ters.
If we know in advance that the computed solution lies between 2 k-~ and 2 k,
then we may write
x = 2ky (11.1)
and solve
Uy = 2-kb. (11.2)
The variable y, is determined from the computed y,+l, • -. , y, by the relation

Y~ ~ fi [ 2-kb" - u'''+l y'~-'u,,. . . . . u~.y.] (11.3)


296 s. ItI. WILKINSON

where fi means that we take the fixed-point result of evaluating the expression in
brackets. The numerator is accumulated exactly and the only error is the round-
ing error in the division. (We have tacitly assumed that intermediate values of
the scalar product do not require the introduction of a larger scale factor than is
necessary for the final value. On ACE this difficulty is avoided quite simply by
accumulating exactly 2 -p times the scalar product, where 2 p is greater than the
largest order matrix t h a t can be dealt with. Overspill in the scalar product can-
not therefore occur and the rounding errors introduced, 2 v-~t, are far below the
level of the others. We shall not refer to this point again.) Hence
Uy ~ 2-kb + ~b (11.4)
[~b~ [ =< ]u,~ 12-t-~ < 2 - H (11.5)

giving
Ux = b -+- 2~b. (11.6)
Applied to the solution of U X = B, we obtain an X satisfying
UX~B+C (11.7)
where
I C. [ < max Ix. 12-' II x 112 (11.8)
]] C H ~ n2-t[[ X H- (11.9)
When B = I, we have
UX = I + C (11.10)
where (11.9) is satisfied, giving
I1X -- U -1 ]l --< n2-tll U-~ I[~/( 1 - n2-'tl U -~ ]1). (11.11)
In this we have tacitly assumed that k ~ 0. If k < 0 no scale factor is necessary
and (11.9) becomes tl C II < n2-t-1. For the inversion problem k must neces-
sarily satisfy k -> 0, because one element of X is ( 1 / u , , ) and this is not less
than unity. The strictest application of (11.5) gives

U X = I-t.- C; ]C I N 2-'-~2k l" . . . . . . ''!.uf~[..~..I.u.~Q. (11.12)


k
12. When the scale factor is not known in advance, then the elements of X
are sealed when necessary as the computation proceeds. If the first computed
element of X requires the largest scale factor that is necessary for the whole
computation then the analysis is precisely that we have just given. If complete
pivoting for size has been used this will commonly be the ease. Otherwise the
situation will be this. Suppose that at the stage when xr is about to be computed
the current scale factor is 2k'+L We may denote the current sealed values of the
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 297

variables x, b y x ~ "+1) (i = r + 1 , - . . , n ) so t h a t
x~ T+I) = 2k'+ly~r+'); [ y~+~) { < 1 (i = r --k 1, "." , n) ]
(12.1)
[ y~,+l) I ~ ½ for at least one i.
T h e next step is the c o m p u t a t i o n of x~ ") with the introduction of an increased
scale factor 2 k" if necessary, and we h a v e
UrrX~) q_ u~,.+ixr+~
(,+1) + "'" + u r . x .(,+1) ~ b. + ~,, I ~r I =
< 2k'-t+l . (12.2)

N o w it m a y be necessary to introduce higher scale factors at a n y or all of the


subsequent stages so t h a t the final values xr(1) • • • x~~) of these variables are ob-
tained b y rounding off x , , x,+i, x ~ 2 , • • • , x, to progressively fewer figures.
T h e worst t h a t can h a v e h a p p e n e d is t h a t (k~ -- k~+,) successive roundings
h a v e t a k e n place. If t h e n we write

x~ ~) -- x~l) = e,~, (12.3)


then
] e,, ] < 2-'[2 kl -- 2 k'] < 2 k'-'. (12.4)
T h e final values therefore satisfy
ur~x~(a) -Jr u (1) u (2)
(12.5)
-- Urr~rr -- Ur,r+lEr+l,r+l -- Ur,r+2erT2,r+l . . . . . Urn~n,rq-1.

T h e error t e r m on the right is b o u n d e d b y (n + 1 - r ) 2 k~-t. This is worse b y


the factors (n + 1 -- r) t h a n was o b t a i n e d when the final scale factor 2 k~ was
used throughout. I n spite of this, it is our opinion t h a t the progressive introduc-
tion of the scale factor usually gives the more accurate result, p a r t i c u l a r l y for
v e r y ill-conditioned matrices. W e m a y illustrate this b y considering the inversion
of the matrix A, below, working to five figures.
+.00311 -.23456 - 34567 7
A = +.00112 -.12345]
+.00113_1
W h e n calculating the last column of the inverse with progressive scale factors
we obtain 884.96 for the last element in the first instance. This is subsequently
rounded first to 885 and then to 9 X 10 =, the final column being 102174552,975,9].
If we introduce the correct scale factor f r o m the start, we h a v e for the c o m p u t e d
final column 10~[75818,992,9]. I t is obvious t h a t the f o r m e r answer is the better,
t h o u g h it has residuals which are a b o u t 100 times as large as those of the latter.

Matrix Multiplication
13. F o r m a t r i x multiplication in floating-point we write
C ~- f l ( A B ) . (13.1)
298 z.H. WILKINSON

From our result for floating-point scalar products we have


C~ ~- a~1(1 + ~ ) b , + aa(1 + ~)b2~ + . . . + a,~(1 + ~ ) b ~
with the usual bounds for the e,. We have therefore, typically for a matrix of
order 4:
41an] 4]a1~1 3[ala] 2 ta14 "]
iC_AB]

Hence afortiori
<2-t
=
I41a211 41a2~1 31a231
4ta~l , 4laa2 I 3{a,, I
41 a41 1 4] a42 I 31 a43 I
2]a~4 | . I B I .
2[au,|
21 a44
(13.2)

[C-- AB] <n2-~[AIIBI (13.3)


and
I[ C - A B 1] 6 ]l[ C -- A B ill --< n2-']i[ A I I B 111
(13.4)
=< n2-'lll A II1 tll B I11 =< n'2-"ll A 11II B II.
In the only application made in this paper we have B = A r and for this
]1 C - A A r I1 N n'2-'ll AII II A ~ ]1 = n22-'11AAr II. (13.5)
Writing this in the form
[[ C -- A A r II < n22-t, (13.6)
II A d r !l =
we see that the botmd for the relative error in the computed product is inde-
pendent of that product. Note that even the computed C is exactly symmetric
so that we need form only the upper half of C.
For fixed-point computation of A A T, even if the scale factors are introduced
progressively, we have
[ C - A A r I- < 2k12-t (13.7)
where 2 kl is the final scale factor. If the diagonal elements are computed first,
the final scale factor is determined at an early stage since the maximum element
is on the diagonal. Equation (13.7) gives
II C - A A r II =< 2kln2-t (13.8)
and hence
]1 C - A A r II < 2n2-' (13.9)
II A A T II =
since ]I A A r ]] ~ max ](AAr),il ~ ½.2 ~t. This is a most satisfactory error bound.

General Error Analysis of Gaussian Elimination


14. Consider now the inversion of a general matnx, A, by reduction to tri-
angular form by Gaussian elimination using a complete pivotal strategy. This
/

E R R O R ANALYSIS O F DIRECT METHODS OF M A T R I X I!XVERSION 200

strategy merely determines a re-ordering of the rows and columns of A and it


will simplify the notation, without any loss of generality, if we consider A to be
reordered in this way, so that at each stage the pivot is in the top left-hand
corner. It should be emphasized that this re-ordering is not known in advance.
We first give an general analysis which is independent of the type of arith-
metic that is used. We denote the computed elements of the rth matrix A (~), by
a(~)
ij and the computed multipliers by m , j . The history of the elements in the
(i, j) position is as follows.
(i) i _<_j.
(~) _(1) (1) (2)
aij ~ ao - - m ~ l a b - 9 e~9

a_(3)
,) ~ a ,(~) (~)
j - - m ~ 2 a ~ ) -9 e,~ (14.1)
(~) _ (i--i) ~ ~(*--1) (1)
a,; ~- ~j ~ :../t.l.i--l~z--l,.~ -9 e,.1

where e~ ) is the error made in computing a,,-(~-1) - m~.~_la~k_~] from the computed
a(k-~)
,; , me.k-1 and ak(~_~] . The element a,,(') is an element of the ith pivotal row
and undergoes no further change. Summing equations (14.1) and cancelling
common elements,

m a (k) a(~ -9 ~ (~) (14.2)


k=l k=2

(ii) i > j. We now have the same equations as (14.1) terminating with
(:) ~
a~j ('-" -
a,~ m , , j - i a ~o-~)
- i , ~ JF e ~(:) . (1~.3)

The next value, a °+~)~, is taken to be exactly zero and remains zero for the rest
of the reduction. The element a~, (:) is used to calculate m,j from
aO)
" (14.4)
(~33

where ~, is the rounding error in the division. Hence


O) O) O+l)
0 ~- a,~ - - m ~ a : ~ -9 e~ (14.5)
where
e (3 q-l)
~ = a ,(3)
: n~: . (14.6)

Summing equations of type (14.1) up to (14.3) and adding in (14.5),


3 y+l
(1) .(k)
E ) a,, + E (14.;)
k=l k=2

Note that equations (14.2) and (14.7) give a relation between elements a~:) of
the final triangular matrix, elements m,,, and elements a~2 with added perturba-
tions. Writing L for the lower triangular matrix formed by the m,, augmented
300 J. H. WILKINSON

by a unit diagonal, and U for the upper triangle formed by the pivotal rows,
(14.2) and (14.7) give

L U -----A a) + E (2) -t- E (a) + "'" + E (~) --: A m -t- E (14.8)

where E (k) is the matrix formed b y the e,`/(~).Note that this has null rows 1 to
(/c -- 1) and null columns 1 to (/c -- 2).
We m a y describe this result as follows. " T h e computed matrices L and U are
the matrices which would have been obtained by correct computation with
A m -t- E. Further A (k) is the matrix which would have resulted from exact
computation with
A m + E (2) W E (a) + . . . q - E (k) = A m + F (~) (say)." (14.9)

Gaussian Elimination in Floating Point


15. We now assume that the matrix is scaled so t h a t all elements are slightly
less than 1/g(n) [see Sec. 4]. Strictly speaking for floating-point computation,
this scaling is irrelevant, but it facilitates comparison with fixed-point computa-
tion and error analysis elsewhere. We also assume that the matrix is equilibrated
although no use is made of this fact. We consider this to be necessary to justify
the complete pivotal strategy. We know t h a t for exact computation the elements
of A (k) would be bounded b y g(k)/g(n) and hence every element of every A (k)
would be less than unity. We shall assume this is also true for the computed
ACe).
I t might be felt that the result of the last paragraph guarantees this, since the
final L and U do correspond to exact computation with A + F (n). Unfortunately
each A (k) corresponds to exact computation with the corresponding A °) -t- F (k),
so that although we can claim to have performed an exact Gaussian elimination
with A (I) + F (~) we cannot claim that we have selected the largest pivot at each
stage corresponding to this fixed matrix. We hope to deal with this point later,
but, for the moment, it remains an assumption. We assume explicitly that if

[_(i)a~ [ < g(n----)l_ 2-t I g ( 1 ) --~ 2g(2) + g2g(3)(n) -{- "'" '~ g(n)J , (15.1)

then

l a ~ ) [ < g(k____) (k = 2, .-- n). (15.2)


g(n)
Note that I r a , I =< 1 is assured since we do take as pivot the largest element
of computed A (~) at each stage.
Now we have
(k) a(k-1) ~
a , i __-- k-n~a(/1)
,, - - m ~,k-1
- k--l,j )
[ a ~ -11 - - m ~,k-1 a (k--l,,/
k-i)(1 + el)J(1 -~- e2); I 11,1 1_<2 -t. (15.3)
ERROR ANALYSIS OF DIRECT METHODS OF M A T R I X I N V E R S I O N 301

Hence
a(k)
~3 a(k-1)

(15.4)
=< 2 - t [ g(k) +g(n)
g(k-- 1)],

showing that, under our assumption (15.1), all elements of (A + E) are less
than 1 / g ( n ) .
Since we are assuming that all a~ ) are less than unity, it is clear that
fl r[a,i(k--l) -- m,,k-,ai~-7~l -- [a~-' --
k--1
m~,k--lak--l,j]
~
<
2--t.
(15.5)

2-t
i ooo!l
Hence a bound for the matrix E (k) is, typically for n = 5, k = 3,

0
0
1
1
0
1
1
0
1
1
(15.6)

~0 1 1 1 1_
and for E is

01 01 01 01 01i
2 -t 1 2 2 2 2 . (15.7)
1 2 3 3 3
1 2 3 4 4
Using the Euclidean norm as an approximation we have

11E I[ =< (0.41) 2-tn2. (15.8)


These results all represent extreme upper bounds and cannot be attained even
if the maximum possible growth takes place during the reduction. If, as is
usually the case, the elements diminish somewhat in size as the process progresses
then the rounding errors, which are certainly less than (I a,~
(k--l) + i ak-l.i
(k--l) I)2-' at
each stage, becomes progressively smaller. If, for example, the elements diminish
by a factor of 2 at each stage, then the last element of the bound for E becomes
2 X 2 - t / g ( n ) instead of (n -- 1) X 2 -t.
There remains the possibility that the process may break down prematurely
due to the emergence of a null A (~). This can happen only if A -t- F (k) is exactly
singular. Now II F(k)ll =< (0.41) 2-tn2, since each element of F (k) is certainly
bounded by the bound we have for E. Breakdown cannot occur therefore if
(0.41) 2-tn2 [[ A-1 I1 < 1 and we shall in any case need a stronger condition if
we are to guarantee a good inverse.
16. The inversion is completed by solving
LUX = I (16.1)
302 J.H. WILKINSON

which is performed in the two steps


L Y = I, U X -- Y. (16.2)
Applying our results for the floating-point solution of triangular sets of equa-
tions we have finally:
LU ~ A + E (a)
where from (15.8) : liE I[ ~ (0.41)2-in ~ (b)
L Y --= I -t- F (c)
where from (10.14): II F ]] =< (0.4)2-in 5/2 ]l Y [[ (d)
UX-~ Y + G-k H (e)
where from (10.13): I1 G ][ =< 2-in ½[[ Y [[ ;
II H II ~ (0.4) 2-'n5/2 [[ X [[. (f)
Hence
L U X ~- L Y -4- L G -4- L H
I + F A- LG-4- L H
or

(A "4- E ) X ~- I -4- F -4- LG A- L H


(16.3)
( A X - I ) ~ - - E X A- F -t- LG .4_ L H
We need satisfactory bounds for the norms of each of the terms on the right.
It will become apparent that we cannot guarantee a reasonable inverse unless
2-tnV21/A-1 II is appreciably less than unity. We therefore assume
2-tn m [[ A -1 [[ < 0.1, (16.4)
and this condition certainly guarantees that there is no breakdown during the
reduction. We assume further n => 10. Relation (16.4) implies that A is non-
singular. Equations (a) and (b) give

F [[ A-~ [' [[E[,


l[ (A + E) -~ l] ~ II A-~ II L1 -4- 1 - [I A-I II II E IIJ =
]<
1.01 ,[A -1
II. (16.5)
Now from (a),
II L-1 li = II U ( A -k E ) -1 II
=< [I U ][ II ( A -4- E ) -~ [I (16.6)
=<[~(n 2+1)]' • 1.OlIIA-' II

=< (0.8)n II A -1 II,


since all elements of U are less than unity.
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX IN'VERSION 303:

From (c) and (d),


II Y I] ~ II L-: I[/[1 -- (0.4)2-'n 5}2 II L -1 Ill
=< (o,84)n II A-1 ll- (16.7)
Applying these results and remembering that I[ L [[ _< 0.8n, since all its ele-
ments are bounded by unity, we have
11EX II --< (0.4) 2-'n~ II S II
II e [[ =< (0.34) 2-'nr/~ [I A-' II
(16.8)
II LG ]} =< (0.7) 2-'n5'2 }} A-~ II
II LH II --< (0.32)2-'~ ~'~ II g II.
We have now established a result of the type (T) of Section 9 with
a = 2-t[0.4n ~ -q- 0.32n m] < 2-t[0.34n m]
(16.9)
b = 2-t[0.7n 5/2 -q- 0.34n ~/2] ~ 2-t[0.41n~/2].
Hence

II A -1 II 1 - o,41n~z~2 -' II a -1 II
1 --ff 0.34nm2 - ' II A-1 ]1

<- II x II < II A -1 II 1 + 0.41n~Z2~ -' l} A -1 II _<_(1.076) 11A -1 (16.10)


- = 1 -- 0.34nm2 - ' }}A -I }l li

II X -- A-111/II A - ' II -~ (0.78) 2-grim II A - ' II

II A X -- I I1 ~ (0.78) 2-'rim ]1 A-~ II.


We can readily deduce the result for a matrix with elements bounded by
unity. For if B is such a matrix then we work with A = B/g(n). Normally we
would divide by the smallest power of 2 greater than g(n) so that no rounding
was involved. If X is the computed inverse of A, then take Z = X/g(n) as the
inverse of B. We have then
BZ = g(n)AX/g(n) = AX.

Hence
II BZ -- III = II A X - z li ~ (0.78) 2-'rim [1 A-1 II
(16.11)
(0.78)2-'nmg(n)[[ B-1 II .

The presence of the term n m is at first sight disappointing. However, we must


remember that Goldstine and yon Neumann's results for a positive definite
symmetric matrix contains the factor n 2 1[A li II A-1 It and II A ti could be as
large as n. In corresponding positions in our analysis we have replaced II L I1and
II U I] by 0.8n and this contributes a factor n in our result. We have therefore
only the additional factor n t and this result is for a general matrix.
304 J . H . WILKINSON

The presence of the factor g(n) is inevitable, but notice that it should be
replaced by max [1, R] where R is the growth ratio, in each particular case and
we know in practice that R is often less than unity. For each of our special
types of matrix we have an a priori upper bound for R which is much smaller
than g(n).

Fixed-Point Computation
17. We do not wish to repeat the analysis for all varieties of fixed-point com-
putation and we restrict ourselves to one particular type. Before describing
this we justify its selection.
The alternative methods of solving unsymmetric equations for which an error
analysis has been performed are
(i) by reducing the problem to that of inverting A A r, a positive definite
matrix [3],
(ii) by orthogonal trriangularrizations of different types (§21-23).
None of these methods requires less than 2n 3 multiplications, whereas the
method we have just described requires n ~. If we are prepared to use 2n 3 multi-
plications, then the following strategy gives high accuracy.
(I) Invert the matrix as above, thereby determining the interchanges and the
required scaling factors.
(II) Repeat the computation in fixed point taking advantage of the informa-
tion gained in (I) and accumulating scalar products wherever possible.
In (II) we determine L and U directly from L U = A, where A has its rows
and columns permuted as determined by (I). We should compute ½L in case
some of the multipliers were just greater than unity as a result of the different
rounding errors. For this technique we have

LU~ A +E; l e,jl < 2-'; ]]Eli --< 2-~n (17.1)


since each element of ½L and U is determined directly by dividing into an
exactly accumulated scalar product. Similarly in solving L Y = I and U X = Y
the scale factors are known in advance. Some of the bounds we give will be
different according as max~,~ I x , ] is greater or less than max~.~ I Y,~ [ • Where
this is true, the bounds in the former case are given on the left and for the
latter on the right. We have
LY =I+F; lfrs] ~ 2 - ~ m a x ] y , j ] ; HF]] ~ n 2 - ' ] ] Y H
$,3

from (11.8)
UX = Y + G (17.2)
I g,. I ~ 2-' max
./,./
I x,~l :1 I gr~ I ~ 2-' max
z,.~
(y,)
i
l
]le[I ~ n2-']]X]l : ]]GII =< n2-'[[ VII
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 305

giving
II E X II ~ 2-~n II i II
NF ]l ~ (0.9) 2-tn~ HA-1 II (17.3)
I[ LG H =< (0.9) 2-tn2 II X [I II n v II ~ (0-7) 2-tn8 II A - ' II
provided 2 - t n 2 I] A-~ ]1 =< 0.1. (The constants have been taken to one decimal
only.) In both cases we have results of the type (T) of Section 9 with
a = 2-t[n + 0.8n ~] a = 2-tn
(17.4)
b = 2-t[0.9n 2] b = 2-t[0.9n 2 + 0.Tna]

In practice the second of these is very uncommon since the largest element of
X is usually larger than the largest element in Y, particularly if A is at all ill-
conditioned. For several of the special matrices we have already seen that
n--1
max~.~ I Y~J I = max~,~ I l~x I is equal to unity. In any case max,.~ [ y~, ] -_< 2 , so
that for small matrices II Y II cannot be large.
Hence we wifl usually have

II A X - I I1 -<_ (2-0) 2-tn2 II n - I II


(:7.5)
if n'2 -~ IJ A -~ J[ --< 0.1, n >= 10.

In the alternative case we will not be able to guarantee that X is even an ap-
proximate inverse unless n32 -~ ]] A -1 II --< 0.1, and then we have

II A X -- I ]] __< (0.9)2-'n 3 I] A-1 I]. (17.6)


The results are for a matrix A which has been scaled sufficiently to cover the
growth factor (if any) observed in (I). Hence for a matrix A with elements
satisfying [ a , I -<- 1, we have for the double process

II A X -- I I1 --< (2.0) 2-~n2R II A-1 I[ i II A X - I tl ~ (0.9) 2-'n~R i] A - : II. (17.7)

Error Analysis of the Symmetric ChoIesky Decomposition


18. We omit the analysis of Gaussian elimination of positive-definite matrices
which can now be done in floating-point arithmetic with no pivotal strategy.
The analysis is similar to that of the last section except that we know there is
no growth in size of pivotal elements, and the norms of L and U are more specially
related to those of A. We pass instead to the L L r decomposition which is im-
portant in other contexts.
We consider fixed-point computation of L for a positive definite matrix A.
We show that if

(n+ 1 ) 2 - ' H A - ' I I < 1; l a,,l < (1 - 2-t), (18.1)


306 J . H . WILKINSON

then all computed elements of L satisfy I l . ] < 1 and


I2--t--11 /,$ I , j>i

LL r ~ A WE; l e.I ~ ~2-t-lIl~l, i>j (18.2)


!
L2-'lz,,i, i=j.
The proof is by a double induction. Assume that when we have computed the
first s rows of L and the first r elements of its (s --1- 1)th row, the computed
elements are all bounded by unity and are those that would have resulted from
the exact decomposition of the matrix (A -t- E.,). Here Es~ is a symmetric matrix
having zero elements except in its leading principM matrix of order s and in the
first r elements of its (s -q- 1)th row and column where it has elements which
are bounded as in (18.2). For n = 5, s = 3, r = 1, we have typically:
[-21/11[ I lnl I ln] I lnl 07
| I/n[ 21Z,=I IZ=,I o
IE=,11_-<2 I/-I 21/= 1 0 (18.3)
/l/ul o o o
Lo 0 0 0

From (18.1) and our assumptions, A -q- Er. is a positive definite matrix with
elements bounded by unity.
Consider now the determination of la+l,r+l. The equation from which it is
computed is

ls+l,r+l
L l.+1,.+1 ] (18.4)
= q (say).
If this quotient were computed exactly it would be the (s q- 1, r + 1)-element
corresponding to the exact decomposition of (A q- Er~), and since this is a posi-
tive definite matrix with its elements bounded by unity, the exact quotient is
bounded by unity. Hence the computed l~+i,.+1 is also bounded by unity and
l.+1,,+1 = q -t- e; I e I =< 2 - H . (18.5)
Hence
/8+ml.+m + /8+1,2/r+1,2 -4- " " + 18+1,~+11r+1,.+1~ a8+1,.+1 --t- l.+l,.+le (18.6)
This completes our induction apart from the element/~+1,,+1 • This is computed
from the relation
18+1,,+1 "~- ~/a8+1,8+1 -- 2
18+u -- l2
~+1,~ . . . . . l~8+1,8 = ~ (say). (18.7)

Again if the square root were computed exactly it would be less than unity, since
it is an element in the exact decomposition of A q-- E8+1,8 which is positive definite
with elements bounded by unity.
We do not wish to enter into a detailed study of square root subroutines.
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 307

On A C E , on which the square root is calculated by the Newton process using


division into the double-precision accumulator, special care is taken with round
off, and the computed value is the correctly rounded value
l~+1,~+1 -= v Q + ~; I ~ I =< 2 - % (18.8)
We shall assume this in our analysis, though if we had
I 1.+1,.+1 - V ' q t < k × 2- H
for any reasonable k, the conditions on A would not need to be much changed
from (18.1). Since the computed value is the correctly rounded value,
I lS+l,s+l { ~ 1.
The result is clearly true for a 1 X 1 matrix so the induction is complete.
Note that if any I. is small (and if A is ill-conditioned this is almost certain to
be true) the corresponding elements of the error matrix ( L L r -- A ) are far
smaller than 2 -*-1 . In other words the errors arising from the decomposition may
be far less than those from the initial rounding of the elements of A, in the case
when they are not exactly representable by numbers of t binary digits. The L L r de-
composition is fundamental in the L-I~ technique of Rutishauser [5] as applied
to symmetric matrices, and in the calculation of the zeros of det (B -- kA) and
det ( A B -- M ) when A is positive definite. The closeness of L L T to A is there-
fore of great practical importance.

Completion of the Inverse


\
19. The inversion can now be completed by solving
LTX = I (19.1)
and computing X X r. We can no longer avoid scaling in this part of the work
and we give the analysis for floating-point computation since it will be seen that
the errors made at this stage are less important. Using the result for the inversion
of a triangular matrix and taking advantage of the fact that the sum of the
squares of the elements in any row of L is not greater than unity (Section 7)
we have from (10.18) :
L r X = I + F; I[Fll =< (0.6) 2 - t n : l I X l i ;
(19.2)
i] X I1 =< I}(Lr) -1 ]]/[1 - (0.6)2-*n 2 ]](Lr) -1 ]]].
The computed inverse Y satisfies
Y - ~ X X r + G; }}G]I =< n 2 2 - ' l l X [ I 2 (19.3)
from (13.5). We need compute only the upper half of Y, since even the computed
Y is exactly symmetric. The total number of multiplications involved in com-
1 3 so that full advantage is taken of symmetry in this
puting Y from A is ~n,
method.
308 J H. W I L K I N S O N

Equations (19.2) and (19.3) together with


LL ~'~ A-l-E; ]]E[] =< n2 -t (19.4)
enable us to assess the accuracy of Y. Because of the close relationship between
the norms of L and L -~ and those of A and A -1 it is simpler to proceed to a
direct computation of the relative error rather than by assessing ]{ A Y - I {] ;
we comment on this in Section 20. It will simplify the assessment if we assume
~2-~IIA-11{ < 0 - 1 ; n22- ' I { A - i I i { < 0 . 1 ; n ~ 10. (19.5)
Whether the first or the second of these is the more restrictive depends on the
size of [[ A - ' 1]. In any case the first is more than adequate to insure against
breakdown when computing L, and breakdown cannot occur anywhere else.
We assemble in equations (19.6) the deductions which can be made from the
symmetry of A and relations (19.2) to (19.5) :
(a) [l L - ' [[2 = ]](L~)-, [[2 = [](A --k E ) -1 [] _-< 1.12 [{ A - ' I[ from S(ii)

(b) [](A -}- E) -1 -- A -1 [[ ~ 1.12n 2-' [[ A - ' ]{2 from (9.13)


(c) 2-tn2 [I L - ' II < 2-'n2 X 1.06 II A-~ [I½ < 0.106 from (a) (19.6)

(d) 1{X [{ -< 1.07 II L - ' 1] < 1.14 [[ A - ' ]it


(e) lI F II =< ( 0 . 7 ) 2 % ~ II A-' IlL
Hence
X X r = [(Lr) -1 -k (LT)-iFI[L -~ .-k FrL-~]
(19.7)
= (LLr) -~ -b H (say)

[]XX ~ ' - ( A --k E ) - ' }{ = []H[[ ~ 2[IF[] IlL -ill ~ + IIL-1II~]IFII ~


< (1.6)2-tn 2 1}A -1 l]~/2 --k (0.055)2-~n 2 {{A -1 {{m (19.8)
< (1.66)2 -'~: I1 A -I II~/2.
The conditions (19.5) allow us to express the terms in I{ F {Is in a variety of
ways but it is convenient to have it in the same form as the main component
of H. Finally we have
II Y -- X XT I1 = II G tl _-< n 22-~ I1x I12
(19.9)
< 1.3~22 -' {{ A -1 ii.
Combining (19.6b), (19.8) and (19.9) we have for the relative error in Y,
{{ Y -- A-~ {{/I{A-~ {{ < 1.12n2-' {]A -~ {] -t- 1.66n22-~ {{A -1 }{~-t- 1.3n22-~ (19.10)

where the three terms on the right come from the calculation of L, X and Y
respectively. By using fixed-point arithmetic in the calculation of X and Y we
can reduce the n 2 in the last two terms to n, but if A is very ill-conditioned, so
E R R O R ANALYSIS OF DIRECT METHODS OF MATRIX I N V E R S I O N 309

that [[ A -~ ]] is large, the errors made in the triangularization are in any case the
most important, and those made in the calculation of X X ~ are the least.
Instead of solving L T X = I we could solve L X = I and then calculate X T X .
The bound obtained for the relative error is identical in the two cases, though,
of course, the computed inverses are not identical.

The Residual M a t r i x
20. In most of the error analysis we have worked with the residual matrix
(AY -- I ) . I t is worth noting that we can obtain a far better bound for this
matrix if we solve L T X = I than if we solve L X = I. Further, the contributions
made to the residual matrix from the separate parts of the computation are
not in the same ratio as the contributions to the relative error.
Working with L r X = ( I + F ) we have
AY-- I = A [ X X T + G] - I
= (A +E)XX T-EXX r-I-AG-I
= LLTXX T-EXX r+AG-I
(20.1)
= L[I+FF]X r-EXX r +AG-I
= L[I+F][L -I+FrL -1] - - E X X r+AG-I
= L F L -~ + L F V L -~ -t- ( L F F r L -~) -- E X X r + AG.

The order of magnitudes of the bounds for the norms of the components arising
from E, F, G respectively are
1.3n2-~ ]] A-~ I], 1-5n22-t I1AII t }] A-1 II and 1.3 II A II n22-' 11A - ' II. (20.2)
The terms containing F and F r make the same contributions, and that of the
F F r term is negligible. Note that the contribution of G to the residual may
easily be the largest even when it has the smallest effect on the inverse itself.
On the other hand if we use
L X = I + F, Y = XrX-t- G,
then F and G are bounded as before but
A Y -- I = L L r F r ( L - 1 ) T L -1 + F
(20.3)
-t- L L r F T ( L - 1 ) r L - 1 F -- E X X r + AG.

The orders of magnitude of the terms containing F r and F are now


n22-' [I A II II d -1 II and n22-t I[ d-1 II~
respectively, so that if A is ill-conditioned the first of these may be serious.
Except for specially constructed matrices, however, this disparity between the
two terms does not occur in practice. This is because of the nature of the inverses
of triangular matrices and is discussed in Sections 27-28.
310 J . H . WILKINSON

Orthogonal Reduction to Triangular Form


21. We now consider the reduction of a matrix to triangular form b y pre-
multiplication with orthogonal matrices. In methods of this type the original
matrix is transformed successively to A~, A2, - . . , As, where each member of
the sequence has at least one extra zero below the diagonal in addition to those
in the previous matrix. These zeros m a y be produced one at a time (Givens
[2]), or a whole subeolumn at a time (Householder [4]). Whatever the details
of the method, the error analysis has certain features in common as we shall
now show.
We m a y define the mathematical process as follows. Starting with
A1X = I = B1, (21.1)
we produce
A~X = B, (r = 2 , . . . , s ) (21.2)
where
RrA, = Ar+i RrB, = B r + l R f R r = I, (21.3)
the matrices R~ being the successive orthogonal matrices.
We now consider the practical procedure, and we let A~ denote the r-th com-
puted matrix. Corresponding to this A, (N.~. this is not the matrix which would
have been obtained by an exact computation in the previous stages), the tech-
nique defines an exact orthogonal matrix R , . The computed m a t r i x / ~ differs
from this, and we m a y write
/~ ~ R~ + E , . (21.4)
We then premultiply A, b y / ? , and make further errors so that the computed
Ar+l and B,+i satisfy:
A,+~ =-- /~A~ + F , l
(21.5)
B,+t R~B, + Gr f
Note that we take those elements which should have been made zero to be
exactly equal to zero, and Fr must include the errors implicit in this assumption.
I f F, is to be small it is therefore essential that I~A, should at least have small ele-
ments in place of those which should be made zero.
We can now pursue the analysis in two different ways.
(A) We may write
A~+i ~ R,A, + (E~A~ + Fr) -~ RrAr + H~
(21.6)
B,+t ~- R,B~ + (E~B, + G,) ~ R,B, + Kr

and assume t h a t we can find bounds for II H , II and II K , II of the form

II H, II <= h(n, r)2 -t li K , II ~ k(n, r)2-', (21.7)

where h and k are simple functions of n and r.


ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 311

The final set of equations is


A~X = B , , (21.8)
a triangular set, and the computed X satisfies
A , X ~ B8 + M (say). (21.9)
Premultiplying by RlrR~ T • • • R~r_1, we have
[ Ai + S,H,
]X ,-,
~ I + ~ S,K~ + Ss_~M (21.10)
1 1

where
S, = RiTR2 T . . . R, r. (21.11)
Since all the R , , and hence all the S , , are orthogonal, we have
[A1-F N]X ~ I + P 4- Q'~
s--1
II N ll =< ~ II g, ll
1

s--1
(21.12)
Il P ll ~ ~ IlK;I]
1

]]QII = I I M ] I .
(B) We may work instead with/~r and assume that we have bounds for Fr and
Gr of the form
II Fr [I ~ f(n, r)2-'; II G, II =< g(n, r)2-'. (21.13)
Premultiplying (21.9) now by/~-1/~-1 . . . /~:_11we have

S S,F, X - ~ I 4- S S,G, 4- S,_,M


[ A , - t - ,-1
T ] "-1T (21.14)

where
S, = k~-1/~7t . - . /~:1. (21.15)

Now the S, are no longer exactly orthogonal, but suppose we can show that
]] Er 11 ~ a < 1. Then we have

II RT' II < 1 (21.16)


11 ~ 7 1 I1 =< 1 - [I E , R r ~ [( = 1 - a

since R;-t is exactly orthogonal. Hence we have a relation of the same form as
(21.12) in which:
S--1 1
11N II -<- ~ (1 - a) ' II F, II
8--1

IIPII =< 23 (1 --1 a)' II e, II (21.17)

1
I1Q II =< (1 - a) '-~ II M 11
312 J. H. WILKINSON

Observe now t h a t we can permit a value of a which is considerably greater


t h a n 2 -t without losing very much. Suppose for example t h a t a = 10 -4. Now in
the Householder method s = n, so t h a t even for n = 104, the factor 1/(1 -- a) 8-1
is merely equal to e. We therefore see t h a t although the errors made in pre-
multiplying A r b y / ~ r should be kept as low as possible (of order 2 -t) and also
t h a t the exact 1StAr should have elements of order 2 -t in the positions in which
zeros should be produced, we can tolerate a matrix /~r which is b y no means
orthogonal to working accuracy. All we are really demanding of/~r is t h a t both
its norm and t h a t of/~71 should be reasonably close to unity. Analysis (B) is
clearly the more penetrating.
This contrasts with the situation when orthogonal transformations are used
in connection with the symmetric eigenproblem. There we pre-multiply with
/~ and post-multiply w i t h / ~ r r and we assume t h a t the resulting matrix is still
symmetric. Unless Rr is accurately orthogonal this will not be an accurate
similarity transformation.
22. If the analysis is to give satisfactory results, it is essential t h a t we should
be able to obtain satisfactory bounds for the norms of H and K or for F and
G, according as we use (A) or (B). The former will imply finding very good
bounds for E r , but the latter is not very demanding in this respect. Now in
Gaussian elimination the main obstacle to finding bounds was the possibility of a
progressive increase in size of the elements of the reduced matrices, since in
general the current rounding error is proportional to the current size of the
elements. With orthogonal (or nearly orthogonal) matrices this difficulty is no
longer present. The Euclidean norm of each column of A1 is preserved by exact
orthogonal transformations. Hence, if originally the sum of the squares of the
elements in each column is less than unity, this remains true in all later matrices.
I n practice when working in fixed-point we must scale A1 so t h a t the accumula-
tion of rounding errors cannot cause the norm of any column of a n y A~ to exceed
unity and we m u s t take B1 = ½I for the same reason. I n floating-point we need
not take these extra precautions, but will need to deal with the possible slight
increase in norms when we are estimating bounds for H~ and Kr or F, and Gr.

Detailed Analysis for Givens' Reduction


23. We now give a brief analysis of the Givens' reduction. We work in floating-
point since the gain involved in working in fixed-point with accumulation of
scalar products is not substantial. We take A1 to have normalization ( I I )
originally.
Consider a typical stage in the reduction at which r columns have been made
triangular and elements (r -t- 2, r + 1) ; • • • ; (k, r + 1) in the (r -t- 1)th colum
have been made zero so t h a t typically for n = 5, r = 1, k = 3 the current
matrix is of the form
all a12 a13 a14 als-~

I
i a2~ a23 a24 a25/
0 a33 a~ a~ I (23.1)
a42 a4R a44 a4s ]
a52 a53 a54 a55_J
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 313

The next step is to make the (r + 1, k -t- 1)-element zero by a rotation in the
(r + 1, k -t- 1)-plane. This modifies only the elements in rows (r --~ 1) and
(k -t- 1) and leaves the earlier zeros undisturbed. We shall assume that the
Euclidean norms of all columns of A and B at this stage are bounded by
(1 ± 9.2-t) p,

where p is the number of transformations completed so far, and justify this


assumption in our analysis of the next stage.
The components of the rotation matrix are computed from the relations
2 2
cos 0 ~ fl[ar+l,r+l/(ar+l,~+l "~- a~+1,~+1) ]
(23.2)
~ f l [b/(b .q- c2) ½] (say)
sin 0 ~ fl[c/(b 2 q- c2) ½] (23.3)
fl(b 2 q- c ~) =-- [b2(1 --1- el) q- c2(1 q- e2)](1 q-- e3)
(23.4)
(b ~ 4- c2)(1 q- e4)(1 -t- e3)
fl(b 2 -Jr c2) ½ ~- (b 2 -t- c2)½(1 q- e4)t(1 --t- e3)½(1 -Jr 2e5)
(23.5)
(b 2 -~- c2)½(1 --}- 3e6)
where all I e, I -<- 2-t and we have assumed a square root subroutine which gives
errors of less than 2 parts in 2 t. The calculated cos 0 and sin 0 therefore satisfy
b 1 +~7
c o s 0 ~ - f l [ b / ( b ~+c2] ½ --=(b 2 + c2)½ ( 1 + 3e6)
(23.6)
c (1 + ~)
sin 0 ~- fl[c/(b 2 q- c2) ½] --= -(b2 q_ cO)~ (1 -t- 3e6)

In the notation of the last section we have, by quite crude inequalities,

II E I I = II R - R II < 6.2-t (23.7)


independent of the size of b and c. Hence in the notation of Section 21 we may
take a = 6.2 -t, and it is evident that the departure from orthogonality is quite
unimportant.
We now turn to the determination of the F and G matrices. These are null
except in rows (r -t- 1) and (k -t- 1), and for F both rows are also null in posi-
tions 1 to r.
The element (r -t- 1, k + 1) of F must be treated specially but bounds for
each of the other non-zero elements of F and G are given either by
~1 = I(x cos 0 4- ysin 0) -- fl(x cos 0 + y sin 0)1 (23.8)
or by
~2 = I ( - - x sin 0 -t- Y cos 0) -- f l ( - - x sin 0 -t- y cos 0) 1 (23.9)
where x and y are the existing elements in the (r "-b 1) and (k --Jr 1) positions
of a column of the current A or B. For ~ we have, ignoring 2 -2t,
314 ~. H. WILKINSON

5~ ~ I x cos O + y sin O t e~ + l x cos O E2 + y sin O e31; l e, I ~ 2 -t


(23.10)
< 2-~+1 V ~ + y2 < 2 -t+l X (Euclidean norm of column)
and the same result holds for ~2. This enables us to justify the original assump-
tion made about the norms of the columns of the transformed A and B. For the
ratio of the norms of any column before and after this transformation certainly
lies between (1 ± 9 X 2-t). This is ample to cover the error in/~ and the errors
~ and ~ made in multiplying b y it. All norms of all transformed columns there-
n(n--1)
fore lie between (1 ± 9.2 -*) --~---. These norms will be of order unity since we
will in any case need to have n 2 X 2 -t appreciably less than unity for a useful
inverse. The exact value of the modulus of the (k + 1, r + 1) element of/~A is

(b 2 + C2) ~ 1 + 3e6

which is certainly less than 2 -t. In replacing this b y zero we are committing an
error with a bound which is smaller than those we have obtained for other ele-
ments of rows (r + 1) and (k + 1). Combining these results we have for the
current F and G,
nfn--1)
II F ll < 2 - 2-' (1 + 9.2-')
(23.11)
n (n--l)
I]G]] < 2 V ~ 2 - ~ ( 1 + 9 . 2 -t) 2
Summing these results and taking ~ 2.3/2,
s½ -~ ~r ~ s3/2 ~. vr
2 5,3, we have
from (21.12) and (21.17),
(Ai + N)X ~ I + P + Q (23.12)
t n (n--l)
1 4V/2 n 5/2 2 -t (1 + 9.2- )
I[ N I] < n(n--i) 5
(1 -- 6.2 -t) 2
• t n(n--1)
1 n ( n -- 1) 2 V ~ 2 - , (1 + 9 . 2 - )--~---
[[ P ]1 < ~-') 2 (23.13)
(1 -- 6.2 -t) 2
1
II Q II < ~(~-1) IIM II
(1 - 6.2 -1)
n (n--l)
where we have replaced all 1/(1 -- a) * by 1/(1 -- a) --V--" We write for the
moment
t n (n--l) 1
(1 + 9 . 2 - ) ~ = l+d; ~(~-1) = l + e . (23.14)
(1 -- 6.2 -t) 2
Now M is the error made in solving the final triangular set of equations
A~X = B,, (23.15)
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 315

so that we know from (10.10) and (10.17) that the computed X satisfies
A~X ~ B.+M (23.16)
where certainly
II M II ~ 2-~[n~ll B, II -4- 0.6(1 + d)n21I Z ill
(23.17)
< 2-71 + d)[,~ + 0.6,~11X ill
since the norm of each column of A. and B. is less than (1 -t- d). We have there-
fore
F4~/2 5/2
II A X - I II < 2-'(1 + d) (1 -4- e) [ _ - - ~ n ]l X II
(23.18)
+ -v'.qn ~/~ + n + o.6n ~ II x I1.]
-1

which is a result of the standard form. For a reasonably good inverse we shall
require
2 - i n 5/2 < 0.05 (say)
and assuming n >= 10 this certainly gives
(1-4-d) < 1.1
(l+e) <1.07 (23.19)
(1 "4- d)(1 -4- e) < 1.2.
We shall require further
2-tn51211 A-~ I] < 0.1
and if we assume this we certainly have when n > 10
I] A X -- I ]] < 2-~[1.61n5/2[[ X II + 1.71n5/~]• (23.20)
From (T) of Section 9 we obtain
l / z - A - ' l/
II a - ' II < 2-'n 5/2 [2.1 [[ A -1/I -4- 1.8]. (23.21)

The result is for a matrix with normalization (II). For normalization (I) we
have immediately, in the usual way,

II X -- A -1 II < 2-tnS/~ [2.1 n ~ II A-1 I1 + 1.8] (23.22)


II A-' II
The more dangerous term is the (2.1)2-tn31] A -1 I] and this comes primarily
from the rounding errors made in computing/~A at each stage.
We cannot gain very much by working in fixed-point because there is no stage
at which we can accumulate a scalar product of any magnitude. Care must be
exercized in computing cos 0 and sin 0. The most satisfactory technique is to
316 J.H. WILKINSON

accumulate (b 2 + c2) exactly and then scale it b y the largest 22k for which
22k(b2 -4- c2) < 1. We then calculate cos 0 and sin 0 from 2~b and 2%. The com-
puted matrix is rather closer to an orthogonal matrix t h a n for floating-point
computation and further, only one rounding error is made in computing each of
(x cos 0 + y sin ~) and ( - x sin 0 + y cos 0). The gain in accuracy is at best
equivalent to a small constant factor and we cannot gain b y a factor of~/n.

Householder's Reduction
24. Only (n -- l ) transformations are involved in the Householder reduction.
At the rth step we multiply b y a matrix of the form (I -- 2WrWrT) where wr r
is of the form
(O0...Ox~xr+l...x~); ~x, 2 = 1 (24.1)
and this produces zeros below the diagonal in the rth column. I t leaves unchanged
the first (r -- 1) rows of A and B and the first (r - 1) columns of A, so t h a t
(n - r + 1) 2 elements of A are modified and n ( n - r + 1) elements of B.
We are forced to round these elements before going on to the next stage so t h a t
we have no hope of having sharper bounds for F, and G~ t h a n
IIFrll ~ ( n - r + 1)2-t-1; IIGrll --< % / n ( n - r + 1)2 - H . (24.2)
This gives for N and P,
(INll < ~1 n2~--t--1
~ ; I l P l l ~ ~2 n2~--t--1
~ (24.3)
even if we take a, defined on p. 311, to be zero. The most we can hope to gain
over Givens therefore is a factor of .v/n, and perhaps a small constant factor,
if we compute (I - 2w~wrr)A~ and ( I - 2w~w~r)B~ with sufficient care.
If we work in conventional floating-point arithmetic, we do not gain even the
~,/n factor. As before it is easy to see t h a t the departure from orthogonality is
slight and therefore u n i m p o r t a n t and we m a y concentrate on the errors made in
computing Ar+~ itself.
We assume t h a t we have an exact w~ and estimate the errors made in computing
A~ ~w~w~A~ The first stage is typical so we deal with A~
-- 0 T .
2wtw~rA~.
We have
p~ ~ fl(w~A~) ~ w~TAi + q~ (say) (24.4)
and ql~ m a y be written in the form
qlk : xlaikel Jr x2a2kE2 -3F " ' " ~'~ Xnanken (24.5)
where
levi --< (n + 1 -- r)2 -t.
Now ~ x~~ = 1 and ~ , a 2~ = 1 (the latter persisting approximately at later
stages), but we cannot conclude anything sharper than
I q~k I < n 2-t]] q~ [[ < n3/22-t. (24.6)
E R R O R A N A L Y S I S OF D I R E C T M E T H O D S OF M A T R I X I N V E R S I O N 317

When computing fl(Wlw~TA1), the contribution from q~ alone is [[ wl I] [[ ql II


and the only bound we have for this is n3/22-t. Similarly in the later stages we
have contributions of magnitude rm2 -* and together they give the term n5/22-*.
In practice [1], it is convenient, in floating-point computation, to express the
equations in the form exhibited in equations (24.7) (the first step being typical).

A2 = A1 -- KululrA1
ul r = (an :=i:: S, a2~, . . . , a ~ l ) ; S 2 = a~l + a~l2 + . - . + a,1
2i (24.7)
K = 1 / ( S 2 ± allS) where ~ S has the sign of a11.

This form avoids unnecessary square roots, but does not lead to a lower error
bound.
In fixed-point arithmetic we can indeed realize the gain of the factor nn½, but
only on a computer with good arithmetic facilities would this be worthwhile.
To ensure an accurate determination of w, S 2 should be accumulated exactly
and then multiplied b y 22k where/~ is the greatest integer for which 22kS2 is not
greater than zero. The vector w can then be calculated using 2ka~l instead of
a ~ . This gives a very accurate determination of w and we need only concern
ourselves with the errors made in computing (A1 -- 2wlwlrA~). In forming
pl = fi(wlrA1) we can accumulate scalar products, and since the columns of A1
have norms bounded by unity, the elements of w~rA~ are less than unity. If we
write
pl -~ fi(w~rA) ~ w~rA + q~, (24.8)

then all elements of q~ are less than 2 -t-~ and hence II q~ I] ~ x / n 2 - H . Although
the elements of 2w~w~rA~ may be greater than unity, we know that those of
A~ -- 2w~w~rA1 ure not, and therefore no harm is done by the overspill of the
former. To be safe throughout, we must scale A originally so that its columns are
sufficiently less than unity for those of A~ to be less than unity throughout.
With sufficient care we can obtain a computed inverse X satisfying

I1 X - d -~ tl < 2-~n2 [a ]l A-~ II + b] (24.9)


II H =
where a and b are of order unity. For a matrix with normalizatmn (I) this gives

I1 X -- A -~ tl < 2-tn ~ [an ~ I1 d-~ 11 + b] (24.10)


II II

Comparison of Methods for Inverting Unsymmetric Matrices


25. The method of Givens requires four times as many multiplications as
Gaussian inversion and therefore a fair comparison is between single-precision
Givens inversion and double-precision Gaussian. On this basis the Gaussian solu-
tion is much the better. The method of Householder on the other hand requires
only twice as many multiplications as Gaussian inversion. However, we have
318 J . H . WILKINSON

already seen [(17.7) and Section 9, (T)] that if we are prepared to perform Gaus-
sian elimination twice we can expect an inverse satisfying
[I X -- A -~ {]/H A-~ [] < (2.0)2-tn2RII A -~ {{.
This is more favorable than that for Householder, even when the computation
is carried out with extreme care, provided "R", the pivot ratio, is less than ~v/n.
In our experience this is almost invariably true and there are many important
types of matrix for which we know in advance that it will be true.
Comparing Gaussian elimination with Householder when both are performed
in floating-point, we have a relative error of 2-tnmrH A -~ ]] for the former and
2-tn3{{ A-1 H for the latter. In our view the presence of the factor n m springs
rather from the exigencies of the analysis than from any true shortcoming, and
the case for the Householder is slight.
The other method for which we have an error analysis is that of Goldstine and
yon Neumann based on the inversion of the positive definite matrix, A A r.
However, this requires 2n 3 multiplications and has ]] A -1 ]]2 in place of }] A -1 {{
in error terms. If A is at all ill-conditioned, this method has nothing to recom-
mend it.

General Comments on Results


26. The bounds we have obtained are in all cases strict upper bounds. In
general, the statistical distribution of the rounding errors will reduce consider-
ably the function of n occurring in the relative errors. We might expect in each
case that this function should be replaced by something which is no bigger than
its square root and is usually appreciably smaller. For example, in forming a
scalar product in floating-point we have used the result

f l ( ~ a~b,) ------E a,b,(1 + e,) (26.1)


1

and have then replaced the e, by their upper bounds, (n - i)2 -t. Now for these
bounds to be attained not only must all the individual rounding errors have
their greatest values but the a,b, must have a special distribution.
On the other hand the term [{ A -1 ]l which invariably occurs in the relative
error is not usually subject to any such considerations. I t is useful to have some
simple standard by which to judge the relative error and this is provided b y the
following considerations. Consider the effect on the inverse of perturbations of
order 2 -t in the elements of A. We have

II(A + E ) -1 -- A - ' II < II A - ' E A -1 I1 (26.2)


= 1 - - HEA-~H"
If these perturbations are not specially correlated, then we m a y write
{] EA-~ [{ ~ II S i{ {l A-~ {] (26.3)
and expect that the right-hand side is a reasonable approximation to the left.
E R R O R A N A L Y S I S OF D I R E C T M E T H O D S OF M A T R I X I N V E R S I O N 319

We have therefore

I1(A + E) -1 -- A-~ II/11A-~ II < II E tl II A-111 < n 2-t II A-1 II (26.4)


= 1 - II E II II A-1 II 1 - n2- rl II
and we would expect to be able to choose some E for which the right-hand side
was not a severe over-estimate. If then a method of inversion gives an inverse X
for which
II X -- A -1 II/ll A-~ II < f(n)2-tll A-~ It, (26.5)
we could say that the errors in the inverse are of the order of magnitude of those
which could be caused by perturbations of order (1In)f (n)2 -t in the individual
elements of A. It might be thought that the computed X would be the exact inverse of
some matrix (A q- F) where the elements of F were of order (1/n)f(n)2 -t but this
is seldom true. Indeed the elements of the inverse of X usually differ from those of A
by quantities which are of the order of (1/n)f(n)2-tll A -1 [[. Since I1A-1 I1 will be
very large if A is ill-conditioned, this last result is a little surprising. However
it is, in general, true that the rth column of X is the rth column of the exact
inverse of (A q- Er) where E, has elements of order (1/n)f(n)2 -t, but it requires
a different Er for each column. The reader may readily verify these comments
for an ill-conditioned 2 X 2 matrix. They are true equally of symmetric and
unsymmetric matrices.

High Accuracy of Computed Inverses


27. It is a common experience that even when the most generous estimate is
made for the reduction of the error by statistical effects, computed inverses are
much more accurate than might be expected. It is unlikely that there is a single
simple explanation of this phenomenon, but the following example in which the
effect is very marked does give considerable insight into some of the main
causes. We consider the inversion, working with 8 decimals, of a symmetric seg-
ment of order five of the Hilbert matrix. The matrix is positive definite and
II A-1 II is approximately 106. It is so ill-conditioned that truncation of the ele-
ments of the original matrix to 8 decimals already affects the second figure of
the inverse. This is to be expected since I[ A-1 II II E 11 I1A-1 ]1 is of order
(106) ~ X 10 -8, i.e. 104.
Naturally we compare our computed inverse with the exact inverse of the
truncated matrix and not with that of the exact Hilbert segment! We denote
the exact Hilbert segment by H, the matrix with the truncated elements by ~r,
and the computed inverse o f / 7 by X. Now we see from Table 3 that

max I H -1 -- /-F~ I- 4- 6,000.0. (27.1)


On the other hand, for X computed by the techniques of Section 19, we see that
max [//-1 _ X I -~ 1.0. (27.2)
320 J.H. WILKINSON

The total effect of all the rounding errors made in the process of solution is far less
than those which come from the initial truncation.
There are two main phenomena which account for the remarkable accuracy
of the solution.
(i) We saw in Section 18 that L L r = I1 + E and that a typical element of
E is bounded by 2-t-11 l,~ I. If some of the elements [ l~ I are much less than unity,
the bounds for the corresponding elements of E are far smaller than 2 -t. Now
because B is so ill-conditioned, the l,, are small and become progressively smaller
with increasiug i. The matrix E is displayed and it will be seen that its elements
also become progressively smaller as we move to the bottom right-hand corner.
Here they are far smaller than ½.10 -8 (the optimum value which we might ex-
pect for 8 decimal computation). Note that L L r is so close t o / [ precisely be-
cause/~ is so ill-conditioned!
We now have to consider the effect of the errors E, on the inverse. The per-
turbation resulting from a change e in the (i, j)-element is
--e (ith column of/FF-1) (jth row o f / t - 1 ) / ( 1 + e l-l- -,1 );
now the denominator is approximately unity and the ith row and 3th column of
/1-1 increase generally in size with increasing i and 3. This means that the small-
est elements of E are in precisely those positions which have the most effect on
the inverse. The result is that we do not get the full effect of the ]]/1-1 ]] in the
computed inverse. The effect we have just noticed is very common in ill-condi-
tioned matrices with elements which are of a systematic nature. I t becomes even
more marked if we consider Hilbert segments of higher order. The presence of
very small elements in E is also quite common in ill-conditioned matrices which
are less obviously of this type.
(ii) The second phenomenon is of a more general nature and is associated
with the inversion of triangular matrices. We have seen that for a matrix A of
general form, the effect of perturbations in the end figures is to introduce a rela-
tive error which is usually proportional to ]] A -~ [I, and an absolute error which is
proportional to I] A-~ ]12.Such an effect is avoided only if the rounding errors are
peculiarly correlated. Now whereas for general matrices such a correlation is
very uncommon, it is so common for triangular matrices as to constitute the
rule rather than the exception (Sect. 28). In formal terms we have, quite com-
monly, for the computed inverse of a triangular matrix,

II Z - A -1 II < a ( n ) 2 - t (27.3)
II A-1 II
in contrast to the more common result for a general matrix,

I1 Z -- A -* II < b(n)2-t II A-1 II. (27.4)


II A-~ [t
This means that for triangular matrices it is common for the relative error to be
unaffected by the condition of the matrix. The triangular matrix in Table 3 is
one such example. None of the elements in the computed inverse X has an error
ERROR A N A L Y S I S OF D I R E C T M E T H O D S OF M A T R I X I N V E R S I O N 321

÷7~7~

I+

oo

÷ ÷

÷lT÷l

' ~-7 7-
Illl

II
I++ +~ I+


+

÷
322 ~. H. WILKINSON

which is greater than one in the last figure retained in the computation. This means
that the errors made in the triangular inversion are negligible and we already
know that the errors involved in computing X X T are negligible. The combined
effect is to produce an inverse of quite astonishing accuracy. The total effect
of all the rounding errors is far smaller than that corresponding to a single per-
turbation of ½-10 -s in the (5, 5) element of/~.

The Inversion of Triangular Matrices


28. The remarkable property of triangular matrices mentioned in the last
section is of such widespread significance in practice that it is perhaps worth
indicating the types of triangular matrix which possess it.
Perhaps the most important class is of those having positive diagonal elements
and negative off-diagonal elements. We consider, without loss of generality,
lower triangular matrices and show that if ~ , is the computed (i, j)-element and
x , , the exact, then

:~__2~ (1 q - ~ , ) ; ] ~ , ] =< 3 ( i ÷ 1 - - j ) 2 -~ (28.1)


X~3
however ill-conditioned L may be. The proof is by induction. We need consider
only the first column of the inverse, since in general the rth column is derived
from the triangular matrix obtained by setting the first (r -- 1) columns of L
equal to zero and this has the same properties. Assuming that (28.1) is true with
j = 1 for i = 1, - . . , r -- 1, we have, with an obvious abbreviated notation:
~rl ~ --fl[(lrl~ll + lr2~21 "3V "'" -iV 1.... lf~r-l,1)/lrr]
--[{l~l:~n(1 =t= Or q- 1"2-') ÷ l~21(1 =i: Or'2 -t) ÷ " "
+ l. . . . , ~ _ 1 , ( 1 ± 03.2-')}/l.] (28.2)
------ - - [{/taXi1(1 ::i:: Or + 4.2 -~) ÷ l~x:~(1 ~ Or + 6.2 -t)
+-.. ÷ 1.... lx~_,.1(1 =e 0 3 r . 2 - ' ) } / l , ] .
Now from our assumption about the signs of the 1.... all ~, and x,i are positive.
Further we have
xr~ =-- --[IrlX~ ÷ lr2x21" . . I.... ~xr_~,i/1.], (28.3)
and hence ~ / x ~ lies between (1 ÷ 3r-2 -t) and (1 - 3r.2-~); this completes
the proof.
The nature of the proof shows that the result is not sharp and we would expect
something appreciably better than

~'__.2_: (1 ~ ~/3(~ ÷ 1 -- j) 2-t). (28.4)

Even for very high order matrices this gives a very good result. It shows that if
we use floating-point arithmetic, then even the small elements have a low relative
E R R O R ANALYSIS OF D I R E C T M E T H O D S OF M A T R I X I N V E R S I O N 323

error. Matrices of this kind are produced when Gaussian elimination is performed
on the matrices derived from finite-difference approximations to elliptic partial
differential equations.
When complete pivoting for size is used then we can prove a much weaker
result which is, however, independent of the sign distribution of the elements of L.
We show that if I l~, ] ~ I1, [, then for fixed-point computation with a fixed
scale factor,
I~. - z , t --< 2'+~-' max I x~ 12-~. (28.5)
k

The proof of this is immediate by induction, since we have, typically for the first
column,
2~ ~ -fi[(/n~n + 1~2:~21 + "-" + l . . . . ~:~r-L1)//~]

(28.6)
=-- --[(/r1:~1~ + Ir2~2~ + . . . + l . . . . x~r-~.~)/lr~] ± 2 p-t-~

where 2 v is the scale factor.


Hence

[Xrl - - Xrl [ ~ lrl [ X l l - - X11 [ + "'" + /rr [ Xr--l,1 - - Xr--l,1 ] + 2 p-t-1

=< (21 + 2 2 + . . . + 2 r-l) max [xkl [ 2 -~ + 2 ~-~-1 (28.7)

_-< 2 r max [xkl [ 2 -t


since
2p < 2 m a x l x ~ l [ .
This result is not very spectacular but, for small order matrices, 2 n may be far
smaller than [[ L -1 II. Hence if we have used complete pivoting on a matrix of
low order we are certain to get a "comparatively good" result when the matrix
is ill-conditioned.
When Gaussian elimination with complete pivoting is used on a matrix which
has one isolated very small eigenvalue then it is common for "all the ill-condi-
tion" to he concentrated in the final element. This happens, for example when
( A - - h i ) is reduced to triangular form for a value of h which is very close to an
isolated eigenvalue. The triangular matrix is then commonly of the form typified
in (28.8) :
-.31265 .12321 .216231
.41357 .41632| (28.8)
.00o01A
When this set of equations is solved with any right-hand side, the computed
solution has a low relative error in all components. It is commonly believed that
it is a poor strategy to use a method of reduction that concentrates all the small-
324 z. It. WILKINSON

ness in the last pivot since it means t h a t this last pivot has a high relative error.
In our opinion there is no substance in this belief. The emergence of a small last
pivot shows t h a t the matrix is ill-conditioned, since whatever pivoting m a y have
been done, the reciprocal of the last pivot is an element of the inverse matrix.
On a computer on which scalar products can be accumulated, it is usually better
to use a pivotal strategy which results in a very small last pivot t h a n one which
does not. In the example of the last section for instance it is better to work with
the matrix as it was, rather than working with the pivots in positions 4, 5, 3, 2
and 1 in t h a t order, in spite of the fact t h a t the former leads to the smaller value
for the last pivot. If scalar products cannot be accumulated, then there is a very
slight advantage in a strategy which does not make the last pivot very small.
I t should be realized t h a t the h a r m t h a t comes from a small last pivot is pri-
marily a property of the matrix and is not something induced b y the method of
solution. When scalar products are not accumulated exactly, the perturbation in
the original element of A which corresponds to the last pivot is the sum of
(n -- 1) rounding errors, whereas t h a t in the element corresponding to the rth
pivot is the sum of (r - 1) rounding errors. Ideally we should t r y to make the
last pivot correspond to the element i, j for which the product of the norms of
the ith column and the j t h row of A -1 is as small as possible. There does not seem
to be any practical strategy which assures this.
The triangular matrix of Section 27 does not belong to any of the classes we
nave mentioned so far. We now describe a much larger class of matrices for which
our result is true. Consider the solution of a lower triangular set of equations in
fixed-point arithmetic, using a constant scale factor 2 k. Let .X be the computed
inverse and X the true inverse. We have

L2~ I+E (28.9)

L l~ l~ ..- l~,J
(28.10)
7

= 2 k-t-1 diag Ll,, I


'i:::
Suppose X is such t h a t I x , / x , I < C for j > 9. Now the exact solution is
- X E so t h a t the error matrix is bounded by I X E I and we have

VI Xll I
IXEI ~ IXllEI • t E [. (28.11)
LlxllL Ix221 ..•
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 325

Hence, since x,~ = 1/l~,

giving
[ X E [ < C2 k-t-1
!i 1 1 (28.12)

max I ( X E ) ~ I < Cn2k-t-I < Cn2-t max 12~3 I. (28.13)


This shows that the error in any element of X is a small multiple of its maxi-
mum element. In the example of the last section, C is less than 2½ and the ratio
[ xJx¢~ I is less than one for many of the components.
However, the result we have just proved should not mislead us into thinking
that it is essential that the components of X should not increase rapidly as we
move along any column in a direction away from the diagonal. On the contrary
this type of behavior is usually extremely favorable. Consider, for example, the
matrix of order n with elements
'0 1' '1 0'

I
|
L
" '011' '1.0'

'0.1' '1.0' I
'0.1' J
(28.14)

where we use 'a' to denote a number of the order of magnitude of a, but with
end figures which will lead to rounding errors. Now the inverse of this matrix
contains an element of order 10 ~. If n is at all large it is therefore very ill-condi-
tioned. However, it is obvious in the light of our previous analysis that the ele-
ments of the computed solution have a very low relative error.
An attempt to construct a matrix for which the result is not true reveals how
widespread the phenomenon is likely to be. An example is the matrix
- 1 0 -10 .9 --.4 ]

.9 ~0 ~ o j (28.15)

The last column of the exact inverse is [0,-~ 101°, 101°]. The last column of the
computed inverse is [4 X 109; 4444444444; 10 l°] SO that we really do have a
computed inverse with a relative error of order II A-1 I1 X 10-1°. A matrix of
low order must be very specially designed if it is to give such a result. If we have
used complete pivoting, then we cannot have such examples in matrices of low
order.
Selecting three matrices at random of order 18, 25 and 43 from moderately
ill-conditioned unsymmetric matrices inverted on DEUCE, we found in all cases
that the inversion of computed L and computed U and the computation of
L-1U -1 made practically no contribution to the errors in the inverse. This shows
the great importance of keeping the difference ( L U - A ) as small as possible
326 J . H . WILKINSON

since this is usually the main source of error. If partial pivoting for size is used
it is easy to ensure that I ( L U - A),~ ] < R2 - H where R is the pivotal growth
factor [7]. I t does not appear to be possible with complete pivoting, without
using twice as much storage as in the ordinary reduction or alternatively doing
the computation twice.
It is natural to ask why this accuracy of computed inverses of triangular
matrices was not revealed by our earlier analysis. The reason is that the analysis
is based on an assessment of [] X - L -1 [[ derived from a bound for [[ L X -- I II.
Now the norm of ( L X -- I ) is not significantly smaller in the cases when X
has the exceptional accuracy than in cases when it has not. The size of the resid-
ual gives a very crude estimate of the accuracy of X and the range covered b y
approximate inverses having residual matrices of a prescribed size is remarkably
wide when the original matrix is ill-conditioned. We are able to put the residual
matrix to such good use when analyzing computed inverses of general matrices,
only when the methods used are such as to bias the residual in such a way as to
be almost as small as possible having regard to the accuracy of the computed
inverse.
The really important norm is not that of ( L X -- I ) = E (say) but t h a t of
L - ~ E . Now it happens that for triangular matrices, E can usually be expressed
in the form L F where F has much the same norm as E. This gives L - 1 E much
the same norm as E itself. We can now see why it is that the progressive intro-
duction of scale factors in the inversion of a triangular matrrix usually gives a
better result than that in which the same scale factor is used throughout, al-
though the latter gives the smaller residual. We may illustrate it b y taking an
extreme case. Suppose no scale factor is required until the very last stage, and
this requires quite a large scale factor, 2*. The computed values x~, xn_~, • • •
x~ will then have to be rounded to k less figures to give x , / , x : - i , • • • , x2 p with
[x~ - x / I --< 2*-H. The contributions to the residuals corresponding to this
change are precisely L ( x - x').
Now although this may constitute the major part of the residual as far as
magnitude is concerned, it is obvious that the solution of L y = L ( x - x ' ) is
(x - x'), whereas if the other part of the residual is E1 the solution of L y = E~
is L - ~ E i and m a y be very large. The increased size of the residual is entirely
misleading. On the other hand, because the early stages have been done without
the scale factor, the residuals just before computing the last element x~ are
bounded by 2 - H I l,, I instead of 2k-*-~[ l,~ [.
When the residual matrix R, defined by ( A X -- I ) , is used to assess the error
in an approximate inverse X, then we can, in general, make no claim unless
I] R [[ < 1. Thus Goldstine and yon Neumann [3, p. 1080] say that unless
[I R [[ < 1, even the null matrix gives a better residual. Although this is true it is
reasonable to regard a matrix X, for which HX - A - ' [I/[[ A-1 [I (< 1, as a
much better inverse than the null matrix. Indeed if a normalized matrix A has
an inverse such that ]] A -111 is of order 102°, then if X is the matrix obtained
by rounding the elements of A -1 to 10 significant figures, the absolute errors in
some elements of X may be as large as ½ 101° and the residual matrix m a y have
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 327

components as large ( n / 2 ) 10'°. Yet it would be reasonable to regard X as a


"good" inverse.

Failure of Partial Pivoting Strategy


29. T h a t the partial pivoting strategy can fail completely on a well-conditioned
matrix is well illustrated b y the following example. Consider matrices of the
form illustrated in (29.1) :

F+i
0 0 0 0 +1~
+ +1 0 0 0 - 1[
+1 +1 0 0 +1[ (29.1)
-I +1 +I 0 -- I|
+1 -1 +1 +1 + 1[
-1 +1 -1 +1 - 1 J|
The inverses are of the form illustrated in (29.2):
- '
+2-2 __2-3 +2 -4 _2-5 +2-5~
i+• + 2 -1
0
+ 2 -2
+ 2 -1
_2-3 + 2 -4
+ 2 -2 --2 -3
_2-4|
+ 2 -3 |
A;1 = i 0 0 + 2 - ' + 2 -2 --2-2| (29.2)
0 0 0 2 -1 +2-' /
L+2 -~ __2-2 + 2 -3 __2-4 + 2 -5 __2-5J

Consider now a matrix A31, of this form. I t is clearly a very well-conditioned


matrix. Let B31 be the matrix obtained by replacing its (31, 31)-element by
½. This matrix is also well-conditioned and its inverse differs from that of A31 by
+½ (last column of A3~ -1 ) (last row of A ~ ) ) / ( 1 - 2-3'). Some of the alterations
are therefore as big as 2 -3 so that the inverses differ in the first decimal. Now if
we invert these matrices by Gaussian elimination with the partial pivotal strat-
egy using floating-point computation with a 30 binary digit mantissa, then in the
reductions all multipliers are identical and all reduced matrices differ b y exactly
½ in the (31, 31)-element until we reach the final step. In the very last move
An should give the value 230 and Bn should give the value 23o -- ½ for the final
element of the triangle. The latter is rounded to 23° so that both An and Ba,
give identical upper triangular matrices. T h e y therefore produce identical L
and U matrices and therefore identical inverses! T h a t of the Bn matrix has
errors of norm greater than (0.1)B~). The use of exact numbers plays no essen-
tial part but merely aids the demonstration. If we replace all elements of An
by elements differing only in the last five binary digits but such that the pivotal
decisions are unaltered, then we will obtain a poor inverse in nearly all cases.
We know from our a priori analysis that the complete pivotal strategy gives a
good inverse for Aal, B31 and all the perturbations of An •
The example also exposes a common fallacy. In the computation with B3~ all
the pivotal values have a very low relative error. In fact all of them are correct except
the last which has an error of only one part in 23'. In spite of this the results are
328 a . H . WILKINSON

useless. Significant figure arguments are much too superficial in general to give
reliable indications, and reflection back into the original matrix is a much more
reliable guide.

Iterative Methods of Solution


30. There is still a widely held belief t h a t iterative methods of solving linear
equations are less affected b y rounding errors than direct methods because in the
former one continues to work with the original matrix instead of modifying it.
Our analysis has shown t h a t the total effect of the rounding errors made in good
methods of solution corresponds to only very small changes in the elements of
the original matrix. I t is interesting to compare these perturbations with those
which "effectively" occur when iteration is performed with the original matrix.
Consider a typical step in a Gauss-Seidel iteration performed in floating point.
Typically an improved xl is computed from the previous x2xa "" xn b y the rela-
tion

x1-: flI b1-a12x2-a13x3au . . . . . al"x~l

[bi(1 d- e,) -- a~2 x2(1 + e2) (30.1)

- - a~a xa(1 + ca) . . . . . aln x,(1 + en)]/all


with the usual bounds for the e,. We have therefore performed an exact Gauss-
Seidel step with a matrix with elements b1(1 -{- El) and a1,(1 + e0. The bounds
for the a l ~ are no better than the perturbations we obtained in Gassian elimina-
tion and further we obtain a different set of e~ for each iteration. The idea t h a t
we have worked with the original matrix is an illusion. Even if we accumulate
scalar products exactly we still do not do an exact iteration with A and b and
ill-conditioning of A still affects the computation adversely. T h i s is well-illus-
trated b y the equations:
.96326xl + .81321x2 -- .88824
(30.2)
.81321xl + .68654x2 = .74988
These equations are positive definite and therefore Gauss-Seidel iteration should
be convergent, though very slowly. However, if we work with five decimals
and accumulate scalar products exactly, then starting with x~ = .33116, x~ =
.70000, for example, no progress is made towards the true solution
x~ = .39473.-. , x2 = . 6 2 4 7 0 - . . .
Convergence is so slow t h a t the modifications to be made in the first step are
less than ½ 10-~. We m a y express this in similar terms to those used earlier, if we
observe t h a t Xl is computed from the relation

Xl
fi [.88824 - (.81321).7 7 _-- r "318993] = .33116. (3o.3)
k .96326 J fi [_ .96326 .J
ERROR ANALYSIS OF DIRECT METHODS OF MATRIX INVERSION 329

This would be an exact result if the denominator were .318993/.33116 = .963259


• . . . The computed xl corresponds to an exact computation with a perturbed
all element. Similarly the computed x2 corresponds to an exact computation with
a perturbed a22 element. Starting with almost any x~ in the range .4 to .8 the
iterated vectors remain constant.

Relative Effectiveness of Fixed-point and Floating-point Computation


31. At most stages in this paper the results obtained for fixed-point computa-
tion have been better than those for floating-point computation. Strictly speak-
ing the term fixed-point is something of a misnomer since what we have called
fixed-point computation has usually involved the introduction of scale factors.
We are therefore comparing a technique in which the scale factors are introduced
only as and when they are necessary (so-called fixed-point) and that in which
scale-factors are used automatically at every stage (floating-point). The su-
periority of the fixed-point scheme springs entirely from the ability to accumu-
late exact scalar products. Now the inability to do this conveniently in floating-
point is not fundamental, it is merely a result of the design of existing computers.
Further there are some fixed-point computers on which it is inconvenient to
accumulate exact scalar products. On ACE, where floating-point operations are
performed by subroutines, it was found that the routines for addition and sub-
traction could, without any appreciable loss of speed, deal with double-precision
floating numbers. I t has therefore been possible to make a set of floating-point
subroutines which add double-precision numbers, divide a single-precision
number into a double-precision number but multiply only single-precision
numbers, all operations being essentially at the speed associated with single-
precision working. With these subroutines we have all the convenience of float-
lug-point working with the full advantage of accurate accumulation of scalar
products. The provision of automatic floating-point facilities of this nature
would be of great value in matrix work.

Acknowledgments
The work described here has been carried out as part of the Research Pro-
gramme of the National Physical Laboratory and this paper is published by
permission of the Director of the Laboratory. The author wishes to thank G. E.
Forsythe and E. T. Goodwin for reading the manuscript and making many
valuable suggestions.

REFERENCES
1. BAUER,F.L. Sequential reduction to tridiagonal form J. Soc. Indust. Appl. Math. 7
(1959), 107.
2. GIVENS, W. The hnear equations prob|em Technical Report no. 3. Applied Mathe-
matics and Statistics Series, Nonr 225 (37). Stanford University, 1959.
3. GOLDSTINE, H. I I . ; AND VON NEUMANN, Z. Numerical inverting of matrices of high
order. B~dl. Amer. Math. Soc. 53 (1947), 1021.
330 J.H. WILKINSON

4. HOUSEHOLDER, A. S Unitary triangularization of a nonsymmetric matrix J. A C M 5


(1958), 339.
5 RUTISHAUSER, H. Solution of eigenvalue problems with the L-R transformation.
In National Bureau of Standards Apphed Math Series 49 (1958).
6. TURIN~, A.M. Rounding-off errors in matrix processes. Quart. J. Mech. A p p l . Math. 1
(1948), 287.
7 WILKINSON, J. H. Rounding errors in algebraic processes. Proceedings of Interna-
tional Conference on Information Processing, UNESCO, 1959.
8. WILKINSON,J . H . Error analysis of floating-point computation. N u m Math 2 (1960),
319
9. FADDEEVA, V. N. Computational methods of hnear algebra. (Translated by C. D.
Benster.) New York: Dover; London. Constable, 1959.

You might also like