Linear Models: Stability and Redundancy: 2.1 Singular Value Decomposition
Linear Models: Stability and Redundancy: 2.1 Singular Value Decomposition
18
which corresponds to the relative error of 99 percent if L2 matrix norm is
used.
What is the reason for such an instability? As we can see from the
expression for V , eigenvectors of A are very close to being linearly dependent,
||V1 + V2 ||/||V2 || = 10−2 . Consequently, the matrix V is close to a degenerate
matrix causing V −1 to amplify numerical errors in the calculation of matrix
D.
It turns out that there are quantities characterising matrices which remain
stable even in the situation above. Namely, consider matrix B = AT A.
Matrix B is symmetric, therefore its eigenvectors are orthogonal (or can be
chosen to be orthogonal). Consider the decomposition of matrix B,
B = OW OT ,
where
O =
-0.9951 0.0987
0.0987 0.9951
W =
0.0020 0
0 25.4505
19
there is and we will introduce the singular value decomposition in the next
section.
Finally, let us remark on the geometric meaning of singular values: as
we will verify in a moment, a singular value of matrix A is equal to the
corresponding axis of an ellipsoid representing the image of a unit sphere
under the linear map generated by A. Examining matrix W one may notice
that the corresponding ellipse has a large aspect ratio reflecting the near
linear dependence of the eigenvectors of A.
where wi are non-negative real numbers (the singular values) and U and V
are orthogonal matrices.1 This decomposition is not unique, but the singular
values are (up to reordering them). Notice that
AAT = U W V T V W U T = U W W T U T .
Therefore, the singular values of A are indeed the positive square roots of
the eigenvalues of AAT .
Notice also that if u = Av and ||v|| = 1, uT (AAT )−1 u = 1. Therefore
in the two-dimensional case, singular values w1 , w2 are the sizes of the short
and long axis of the ellipse
(and in the higher dimensional case the numbers wi are also the lengths of
the main axes of the corresponding ellipsoid).
If A is a square symmetric matrix (as is the case with covariance matrices),
then all eigenvalues of A are real, and there is a basis of eigenvectors. So
1
A similar statement holds for complex matrices in which case matrices U and V are
unitary.
20
in this case one can choose U to be the matrix of the orthogonal column
eigenvectors, and set V = U (but with the sign reversed in some columns
in order to get the terms wi > 0). Since U tr = U −1 this gives the above
decomposition. For example the symmetric matrix
1 2
A=
2 1
has eigenvalues 3 and −1, and singular values 3 and 1. A SVD decomposition
is √ √ √ √ ′
1/√2 1/ √2 3 0 1/√2 −1/√ 2
A= .
1/ 2 −1/ 2 0 1 1/ 2 1/ 2
If A is non symmetric then the relationship between eigenvalues and sin-
gular values is less obvious. For example,
1 1
A=
0 1
√
has eigenvalues 1 (twice), and singular values (3 ± 5)/2.
Singular value decomposition (SVD) holds for rectangular matrices as
well. Namely, we have the following
21
U W W T U T or as square roots of non-zero eigenvalues of n × n matrix AT A =
V WTWV T.
Let us re-iterate that:
AAT = U W W T U T , AT A = V W T W V T .
A =[1,10;0,1]
1 10
0 1
svd(A)
ans =
10.0990
0.0990
[U,S,V]=svd(A)
U =
0.9951 -0.0985
0.0985 0.9951
S =
10.0990 0
0 0.0990
V =
22
0.0985 -0.9951
0.9951 0.0985
eig(A)
ans =
1
1
[U,V]=eigenv(A)
U =
1.0000 -1.0000
0 0.0000
V =
1 0
0 1
23
is equivalent to
w1−1
−1
w2−1
tr
a=A b=V U b.
..
.
−1
wN
We see from the second formula that the minimum of ||Aa − b|| is achieved
for a vector a: ak = bk /wk for all k : wk 6= 0. Then it follows from the
second formula that the shortest such vector has ak = 0 for all k : wk = 0.
Combining the last two conclusions, we find that a = W + b or
a = A+ b
24
in the arbitrary basis.
It’s worth noting the following corollary of Proposition 2.1.1: if Aa = b
has infinitely many solutions, the minimal norm solution is a0 = A+ b.
As one can see from the proof, the proven
result holds for rectangular
1
matrices as well. For example, take A = . Then the singular value
1
decomposition is
√ √ √
1 1/√2 −1/√ 2 2
A= = (1)′ .
1 1/ 2 1/ 2 0
>> a=inv(A)*b
a =
0.5263
0.5263
25
Suppose that A is a matrix we measured experimentally and due to nu-
merical errors A(1, 2) = A(2, 1) ≈ 1. The resulting matrix is degenerate:
1 1
à = .
1 1
As the result, the system Ãa = b has infinitely many solutions. However,
′ + 0.5
a = Ã b =
0.5
β̂ = argminβ {ǫtr ǫ}
= argminβ {(y − Xβ)tr (y − Xβ)}
= argminβ {||y − Xβ||}.
β̂ = X + y.
26
(where we assume that A is m × n matrix and n ≥ m). Here U is a m × m
matrix and V a n × n matrix. Let u1 , . . . , um be the columns of U and
v1 , . . . , vn those of V . Then the above means gives
A = w1 u1 v1tr + w2 u2 v2tr + · · · + wm um vm
tr
.
27
Background note. This is what Wikipedia has to say about pairs trade:
The pairs trade was developed in the late 1980s by quantitative analysts.
They found that certain securities, often competitors in the same sector, were
correlated in their day-to-day price movements. When the correlation broke
down, i.e. one stock traded up while the other traded down, they would sell
the outperforming stock and buy the underperforming one, betting that the
”spread” between the two would eventually converge.
Some real-life examples of potentially correlated pairs:
The pairs trade helps to hedge sector- and market-risk. For example, if
the market as a whole crashes and your two stocks plummet along with it,
you should experience a gain on the short position and a negating loss on
the long position, leaving your profit close to zero in spite of the large move.
In a pairs trade, you are not making a bet on the direction of the stocks in
absolute terms, but on the direction of the stocks relative to each other.
Some mathematical algorithms relevant for pairs trading are discussed in
see Y. Zhu, D. Shasha, Fast approaches to simple problems in financial
time series streams.
The mathematical problem we need to solve in order to be able to trade
pairs is this: what is the most likely price increment of B given the observed
price increment of A.
To be more specific, let us assume that joint probability distribution of
logarithmic price increments a of stock A and logarithmic price increment
b of stock B is Gaussian with covariance matrix C,
1 − 1 (a,b)C −1 (a,b)T
̺(a, b) = e 2 ,
Z
where Z is the normalisation constant. Further in the course we will see how
Gaussianity of logarithmic increments follows from linear stock price models.
Note that a, b can be either price increments or vectors of price increments
28
in the case we are trying to correlate groups of stocks. So, what is the most
probable value of b given a? This can be computed as
Therefore,
−1 T −1 T
Cbb b0 = −Cba a
Matrix C −1 is the inverse of covariance matrix. The latter is measured exper-
imentally, which leads to inaccuracy in determining small correlation com-
ponents. Therefore, it is often the case that only the principle component of
C can be determined accurately. As the result, C −1 should be interpreted
as a generalized inverse and the above system may have to be solved in the
sense of minimizing the fitting error as in Section 2.1.4.
In general, the analysis of covariance matrices of large size using principal
components is an important part of the study of movements of financial
markets. You will learn more about in the course on financial time series.
• Is it enough to know the total population at time 0 to find the total population
at time t? For example, assume that the population in year 0 is distributed as
29
0
100 . What will be the population in 10 years? What if the population
900
900
starts with 100 ?
0
• What has the age distribution of these to do with the eigenvector corresponding
to the largest eigenvector?
• Calculate the distribution of population after 1000 years for the above two initial
conditions without calculating L1000 directly.
• What would happen if the 4 in the matrix was replaced by 5 (check eigenvalues)?
Plot the growth of the populations over the period 100 years. Consider using
’semilogy’ command. Analyze the dependence of your answers on the initial
distribution of population.
• What will happen to the evolution of the population if you replace in the first
column 1/2 and 1/2 by 1/3 and 2/3?
30
2.2.3 Solving equations
3 2 −1
Suppose you want to solve Ax = b where A = −1 3 2 and b =
1 −1 −1
10
5 . There are several ways of solving this. For example,
−1
x=inv(A)*b and x=pinv(A)*b
The 2nd method is based on generalized inverses and is generally more suit-
able for calculations involving experimental data. (Type ’help pinv’ into
command window to learn more about pinv.)
1 1 0
Now suppose you want to solve Ax = b where A = 1 0 0 and
0 1 0
10
b = 5 . Can you still use both methods? Does A−1 exist? The
−1
solution
x=pinv(A)*b
is based on the least square approach discussed in Sections 2.1.3 and 2.1.4.
Compare the norm of the residual
r = A ∗ x − b.
to the norm of b.
31
maximum ratio of the relative error in A−1 b divided by the relative error in
b, where the maximum is taken over all b 6= 0.
It follows from the definition that K(A) >> 1 describes ill-conditioned sys-
tems for which the relative error in the knowledge of the right hand side is
amplified by the factor of K(A). Is it easy to calculate K(A)?
Proposition 2.3.1.
K(A) =|| A || · || A−1 || .
Proof: Let e be the error in the determination of vector b, b′ = b + e.
Then the error in x = A−1 b is A−1 e. Therefore,
def ||A−1 e||/||A−1 b||
K(A) = maxb6=0,e6=0
||e||/||b||
||A−1 e|| ||b||
= maxb6=0,e6=0
||e|| ||A−1 b||
||A−1 e|| ||Ab||
= maxb6=0,e6=0 = ||A|| · ||A−1 ||.
||e|| ||b||
b′ = b + e = b − δAA−1 b
Therefore large K(A) signals the enhancement of the relative error in the
solution as the result of errors both in determination of equation matrix A
and equation r. h. s. b.
Using singular value decomposition of matrix A it is easy to show that
Therefore, K(A) can be interpreted as the aspect ratio of the ellipsoid {Av |
||v|| = 1}.
32
For the two examples of ill-conditioned systems already considered in this
Chapter,
0.5 5
K ≈ 113 >> 1.
0 0.45
1 0.9
K ≈ 19
0.9 1
The beauty of the condition number is that it can be used to correctly
identify an ill-conditioned system irrespective of the precise nature of numer-
ical instability present.
For instance, in the example of Section 2.1.1, we identified the origin of
instability as the near-degeneracy of eigenvectors A.
Here is the example of an ill-conditioned system with a different kind of
instability:
100 −200 100
A= and Ax = .
−200 401 −100
ans =
2.5080e+003
ans =
625.5009
ans =
33
0.1997
500.8003
ans =
0.7997
500.2003
one should re-scale the 2nd row by a 1000 to make it the same order of
magnitude as the first one.
34
2.4 Numerical computations with matrices
Matlab has very good routines for solving systems of linear equations. If you
need to code one of these routines in C, check for example the book by Press
at al. This is much safer than writing your own code. However, in order
to choose the right code for a given ill-conditioned system it is important to
have a general idea about the underlying algorithms.
L1 = x1 + 2x2 + 3x3 = 1
L2 = 2x1 + 2x2 + x1 = 1
L3 = 3x1 + 2x2 + x3 = 1
L1 = x1 + 2x2 + 3x3 = 1
L2 − 2L1 = −2x2 − 5x3 = −1
L3 − 3L1 = −4x2 − 8x3 = −2
L1 = x1 + 2x2 + 3x3 = 1
L2 − 2L1 = −2x2 − 5x3 = −1
L3 − 3L1 − 2(L2 − 2L1 ) = 2x3 = 0
35
Some general remarks:
A = L + D + U,
Dx = (−(L + U )x + b)
36
Proof:
By Levy−Desplanques theorem diagonally dominant matrix A is non-
singular. Therefore, Ax = b has a unique solution. Clearly, x is the fixed
point of the recursion (R). Let xk = x + δxk . Then
Jacobi algorithm consists of running recursion (R) starting with some initial
condition until sufficient convergence has been achieved. Note that the com-
plexity of one iteration step is approximately 2n2 for n >> 1. Given that
the number of iterations required to achieve the desired accuracy is much
smaller than n, Jacobi algorithm costs much less than Gaussian elimination
algorithm. It is clear that the rate of convergence is determined by the largest
eigenvalue of matrix T introduced above.
Example. Consider
4x − y + z = 7
4x − 8y + z = −21
−2x + y + 5z = 15
Then taking
xk+1 = (7 + yk − zk )/4
yk+1 = (21 + 4xk + zk )/8
zk+1 = (15 + 2xk − yk )/5
and any choice for (x0 , y0 , z0 ) we get that (xk , yk , zk ) is very close to the
solution (2, 4, 3) for k large. We have sufficient theoretical background to
37
explain why Jacobi algorithm converges for the case at hand: the matrix
0 1/4 −1/4
4/8 0 1/8
2/5 −1/5 0
Dxn+1 = −Lxn+1 + b − U xn
38
2.5 Matlab exercises for week 4 (Linear Al-
gebra)
2.5.1 Ill posed systems
10 7 8 7
7 5 6 5
Suppose you want to solve Ax = b where A =
8
and
6 10 9
7 5 9 10
32
23
b= 33 . Solve for x using for example
31
x=A \ b or x=inv(A)*x
32.1
22.9
What would the answer be if b was changed to b′ = 32.9 ? Our aim
31.1
is to explain the reason for large difference between the two solutions corre-
sponding to a small relative change of b. Compute in both cases the residual
Ax − b. Notice that these residuals are very small compared with norms of
Ax and b. Therefore, we are in the situation when the problem Ax ≈ b is ill
posed.
• Confirm that the condition number of A is the ratio of the largest and the smallest
singular values of A. What is the geometric meaning of the largest and of the
smallest singular values?
• What is the ratio of the relative error in the solution to the relative error in b?
What is the maximal error magnification factor? How well do these two numbers
agree to each other? (Take the decimal logarithm of the numbers.)
39
2.5.2 MATLAB capabilities investigation: sparse ma-
trices
When solving PDE’s numerically one often has to deal matrices with lots of
zeros, but of very high dimension. Such matrices are called sparse. Matlab
has commands for efficient operations with sparse matrices
• What does the matlab commands ’sparse’ and ’full’ do? (Try ’help sparse’ to
find out. Use ’sparse(eye(n))’ for sufficiently large integer n as an example.)
• Investigate ’sparse’ and ’full’ commands a bit deeper by running through the fol-
lowing sequence of commands pausing regularly to analyze intermediate results:
n=6
format rat
P=sparse(1,1,1,n,n)
full(P)
Q=sparse(n,n,1,n,n)
R=P+Q
full(R)
S=sparse(1:n-1,2:n,1/3,n,n)
full(S)
T=P+S
full(T)
• Check that your code works for the matrix used as an example in the lecture.
40
• Find an example of non-diagonally-dominant matrix and an initial condition for
which the iteration does not converge.
• Create another MATLAB function for solving a linear system using Gauss-Seidel
iteration scheme.
You may need to hand in these codes at the end of the term.
41