Linear Algebra - Theory Intuition Code
Linear Algebra - Theory Intuition Code
Theory, Intuition,
Code
ISBN: 9789083136608
Book edition 1.
0.2
Dedication
0.3
Forward
2
Contents
1 Introduction 11
1.1 What is linear algebra and why learn it? . . . . . . 12
1.2 About this book . . . . . . . . . . . . . . . . . . . . 12
1.3 Prerequisites . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Exercises and code challenges . . . . . . . . . . . . 17
1.5 Online and other resources . . . . . . . . . . . . . . 18
2 Vectors 21
2.1 Scalars . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Vectors: geometry and algebra . . . . . . . . . . . . 23
2.3 Transpose operation . . . . . . . . . . . . . . . . . . 30
2.4 Vector addition and subtraction . . . . . . . . . . . 31
2.5 Vector-scalar multiplication . . . . . . . . . . . . . 33
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8 Code challenges . . . . . . . . . . . . . . . . . . . . 41
2.9 Code solutions . . . . . . . . . . . . . . . . . . . . . 42
3 Vector multiplication 43
3.1 Vector dot product: Algebra . . . . . . . . . . . . . 44
3.2 Dot product properties . . . . . . . . . . . . . . . . 46
3.3 Vector dot product: Geometry . . . . . . . . . . . . 50
3.4 Algebra and geometry . . . . . . . . . . . . . . . . 52
3.5 Linear weighted combination . . . . . . . . . . . . . 57
3.6 The outer product . . . . . . . . . . . . . . . . . . . 59
3.7 Hadamard multiplication . . . . . . . . . . . . . . . 62
3.8 Cross product . . . . . . . . . . . . . . . . . . . . . 64
3.9 Unit vectors . . . . . . . . . . . . . . . . . . . . . . 65
3.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 68
3.11 Answers . . . . . . . . . . . . . . . . . . . . . . . . 70
3.12 Code challenges . . . . . . . . . . . . . . . . . . . . 72
3.13 Code solutions . . . . . . . . . . . . . . . . . . . . . 73
4 Vector spaces 75
4.1 Dimensions and fields . . . . . . . . . . . . . . . . . 76
4.2 Vector spaces . . . . . . . . . . . . . . . . . . . . . 78
4.3 Subspaces and ambient spaces . . . . . . . . . . . . 78
4.4 Subsets . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5 Span . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.6 Linear independence . . . . . . . . . . . . . . . . . 90
4.7 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 102
4.9 Answers . . . . . . . . . . . . . . . . . . . . . . . . 105
5 Matrices 107
5.1 Interpretations and uses of matrices . . . . . . . . . 108
5.2 Matrix terms and notation . . . . . . . . . . . . . . 109
5.3 Matrix dimensionalities . . . . . . . . . . . . . . . . 110
5.4 The transpose operation . . . . . . . . . . . . . . . 110
5.5 Matrix zoology . . . . . . . . . . . . . . . . . . . . 112
5.6 Matrix addition and subtraction . . . . . . . . . . . 124
5.7 Scalar-matrix mult. . . . . . . . . . . . . . . . . . . 126
5.8 "Shifting" a matrix . . . . . . . . . . . . . . . . . . 126
5.9 Diagonal and trace . . . . . . . . . . . . . . . . . . 129
5.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 131
5.11 Answers . . . . . . . . . . . . . . . . . . . . . . . . 134
5.12 Code challenges . . . . . . . . . . . . . . . . . . . . 136
5.13 Code solutions . . . . . . . . . . . . . . . . . . . . . 137
7 Rank 179
7.1 Six things about matrix rank . . . . . . . . . . . . . 180
7.2 Interpretations of matrix rank . . . . . . . . . . . . 181
7.3 Computing matrix rank . . . . . . . . . . . . . . . . 183
7.4 Rank and scalar multiplication . . . . . . . . . . . . 185
7.5 Rank of added matrices . . . . . . . . . . . . . . . . 186
7.6 Rank of multiplied matrices . . . . . . . . . . . . . 188
7.7 Rank of A, AT , AT A, and AAT . . . . . . . . . . 190
7.8 Rank of random matrices . . . . . . . . . . . . . . . 193
7.9 Boosting rank by "shifting" . . . . . . . . . . . . . . 194
7.10 Rank difficulties . . . . . . . . . . . . . . . . . . . . 196
7.11 Rank and span . . . . . . . . . . . . . . . . . . . . . 198
7.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 200
7.13 Answers . . . . . . . . . . . . . . . . . . . . . . . . 201
7.14 Code challenges . . . . . . . . . . . . . . . . . . . . 202
7.15 Code solutions . . . . . . . . . . . . . . . . . . . . . 203
11 Determinant 295
11.1 Features of determinants . . . . . . . . . . . . . . . 296
11.2 Determinant of a 2×2 matrix . . . . . . . . . . . . 297
11.3 The characteristic polynomial . . . . . . . . . . . . 299
11.4 3×3 matrix determinant . . . . . . . . . . . . . . . 302
11.5 The full procedure . . . . . . . . . . . . . . . . . . . 305
11.6 ∆ of triangles . . . . . . . . . . . . . . . . . . . . . 307
11.7 Determinant and row reduction . . . . . . . . . . . 309
11.8 ∆ and scalar multiplication . . . . . . . . . . . . . . 313
11.9 Theory vs practice . . . . . . . . . . . . . . . . . . 314
11.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 316
11.11 Answers . . . . . . . . . . . . . . . . . . . . . . . . 317
11.12 Code challenges . . . . . . . . . . . . . . . . . . . . 318
11.13 Code solutions . . . . . . . . . . . . . . . . . . . . . 319
13 Projections 357
13.1 Projections in R2 . . . . . . . . . . . . . . . . . . . 358
13.2 Projections in RN . . . . . . . . . . . . . . . . . . . 361
13.3 Orth and par vect comps . . . . . . . . . . . . . . . 365
13.4 Orthogonal matrices . . . . . . . . . . . . . . . . . 370
13.5 Orthogonalization via GS . . . . . . . . . . . . . . . 374
13.6 QR decomposition . . . . . . . . . . . . . . . . . . . 377
13.7 Inverse via QR . . . . . . . . . . . . . . . . . . . . . 381
13.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 383
13.9 Answers . . . . . . . . . . . . . . . . . . . . . . . . 384
13.10 Code challenges . . . . . . . . . . . . . . . . . . . . 385
13.11 Code solutions . . . . . . . . . . . . . . . . . . . . . 386
14 Least-squares 389
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 390
14.2 5 steps of model-fitting . . . . . . . . . . . . . . . . 391
14.3 Terminology . . . . . . . . . . . . . . . . . . . . . . 394
14.4 Least-squares via left inverse . . . . . . . . . . . . . 395
14.5 Least-squares via projection . . . . . . . . . . . . . 397
14.6 Least-squares via row-reduction . . . . . . . . . . . 399
14.7 Predictions and residuals . . . . . . . . . . . . . . . 401
14.8 Least-squares example . . . . . . . . . . . . . . . . 402
14.9 Code challenges . . . . . . . . . . . . . . . . . . . . 409
14.10 Code solutions . . . . . . . . . . . . . . . . . . . . . 410
15 Eigendecomposition 415
15.1 Eigenwhatnow? . . . . . . . . . . . . . . . . . . . . 416
15.2 Finding eigenvalues . . . . . . . . . . . . . . . . . . 421 7
15.3 Finding eigenvectors . . . . . . . . . . . . . . . . . 426
15.4 Diagonalization . . . . . . . . . . . . . . . . . . . . 429
15.5 Conditions for diagonalization . . . . . . . . . . . . 433
15.6 Distinct, repeated eigenvalues . . . . . . . . . . . . 435
15.7 Complex solutions . . . . . . . . . . . . . . . . . . . 440
15.8 Symmetric matrices . . . . . . . . . . . . . . . . . . 442
15.9 Eigenvalues singular matrices . . . . . . . . . . . . 445
15.10 Eigenlayers of a matrix . . . . . . . . . . . . . . . . 448
15.11 Matrix powers and inverse . . . . . . . . . . . . . . 451
15.12 Generalized eigendecomposition . . . . . . . . . . . 455
15.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 457
15.14 Answers . . . . . . . . . . . . . . . . . . . . . . . . 458
15.15 Code challenges . . . . . . . . . . . . . . . . . . . . 459
15.16 Code solutions . . . . . . . . . . . . . . . . . . . . . 460
19 PCA 555
19.1 PCA: interps and apps . . . . . . . . . . . . . . . . 556
19.2 How to perform a PCA . . . . . . . . . . . . . . . . 559
19.3 The algebra of PCA . . . . . . . . . . . . . . . . . . 561
19.4 Regularization . . . . . . . . . . . . . . . . . . . . . 563
19.5 Is PCA always the best? . . . . . . . . . . . . . . . 566
19.6 Code challenges . . . . . . . . . . . . . . . . . . . . 568
19.7 Code solutions . . . . . . . . . . . . . . . . . . . . . 569
9
Start this chapter happy
CHAPTER 1
Introduction to this
book
1.1
What is linear algebra and why learn it?
IMPORTANT!!
ferential equations, physics, economics, and many other areas.
1.2
About this book
The purpose of this book is to teach you how to think about and
work with matrices, with an eye towards applications in machine
12 learning, multivariate statistics, time series, and image process-
ing. If you are interested in data science, quantitative biology,
statistics, or machine learning and artificial intelligence, then this
book is for you. If you don’t have a strong background in mathe-
matics, then don’t be concerned: You need only high-school math
and a bit of dedication to learn linear algebra from this book.
More important equations are given on their own lines. The num-
ber in parentheses to the right will allow me to refer back to that
equation later in the text (the number left of the decimal point
is the chapter, and the number to the right is the equation num-
ber).
Something important!
1.3 Prerequisites
Algebraic and geometric perspectives on matrices Many con-
cepts in linear algebra can be formulated using both geometric
and algebraic (analytic) methods. This "dualism" promotes com-
prehension and I try to utilize it often. The geometric perspective
provides visual intuitions, although it is usually limited to 2D or
3D. The algebraic perspective facilitates rigorous proofs and com-
putational methods, and is easily extended to N-D. When working
on problems in R2 or R3 , I recommend sketching the problem on
paper or using a computer graphing program.
Just keep in mind that not every concept in linear algebra has
both a geometric and an algebraic concept. The dualism is useful
in many cases, but it’s not a fundamental fact that necessarily
applies to all linear algebra concepts.
1.3
Prerequisites
16 I provide code for all concepts and problems in this book in both
MATLAB and Python. I find MATLAB to be more comfortable
for implementing linear algebra concepts. If you don’t have access
1.4
Practice, exercises and code challenges
Math is not a spectator sport. If you simply read this book with-
out solving any problems, then sure, you’ll learn something and
I hope you enjoy it. But to really understand linear algebra, you
need to solve problems.
Code challenges are more involved, require some effort and cre-
ativity, and can only be solved on a computer. These are oppor-
tunities for you to explore concepts, visualizations, and parameter
spaces in ways that are difficult or impossible to do by hand. I
provide my solutions to all code challenges, but keep in mind that
there are many correct solutions; the point is for you to explore
and understand linear algebra using code, not to reproduce my
code.
1.5
Online and other resources
19
Introduction
Vectors
CHAPTER 2
2.1
Scalars
Vectors
We begin not with vectors but with scalars. You already know
everything you need to know about scalars, even if you don’t yet
recognize the term.
Why are single numbers called "scalars"? It’s because single num-
bers "scale," or stretch, vectors and matrices without changing
their direction. This will become clear and intuitive later in this
chapter when you learn about the geometric interpretation of vec-
tors.
Answers
a) λ = 9/2 b) λ = 5/7 c) λ = 3 ln π d) λ = −9/2
e) λ = .5 f) λ = 0 g) λ = ±6 h) λ = 15 (for λ 6=
0)
2.2
Vectors: geometry and algebra
simply means a line that goes one unit in the positive direction in
the first dimension, and two units in the negative direction in the
second dimension. This is the key difference between a vector and
a coordinate: For any coordinate system (think of the standard
Cartesian coordinate system), a given coordinate is a unique point
in space. A vector, however, is any line—anywhere—for which the
end point (also called the head of the vector) is a certain number
of units along each dimension away from the starting point (the
tail of the vector).
24
Code Vectors are easy to create and visualize in MATLAB and
in Python. The code draws the vector in its standard position.
Figure 2.3: The three coordinates (circles) are distinct, but the
three vectors (lines) are the same, because they have the same
magnitude and direction ([1 -2]). When the vector is in its
standard position (the black vector), the head of the vector [1
-2] overlaps with the coordinate [1 -2].
25
Algebra A vector is an ordered list of numbers. The number of
numbers in the vector is called the dimensionality of the vector.
Here are a few examples of 2D and 3D vectors.
h i h i h i
1 −2 , 4 1 , 10000 0
h √ i h i h i
π3 e2 0 , 3 1 4 , 2 −7 8
Figure 2.4: The Vectors are not limited to numbers; the elements can also be func-
geometric per- tions. Consider the following vector function.
spective of the
vector in equation h i
2.1. v = cos(t) sin(t) t (2.1)
26
(t is itself a vector of time points.) Vector functions are used in
multivariate calculus, physics, and differential geometry. How-
ever, in this book, vectors will comprise single numbers in each
27
Code block 2.5: Python
1 v1 = [2 ,5 ,4 ,7] # l i s t
2 v2 = np . a r r a y ( [ 2 , 5 , 4 , 7 ] )# array , no o r i e n t a t i o n
3 v3 = np . a r r a y ( [ [ 2 ] , [ 5 ] , [ 4 ] , [ 7 ] ] )# c o l . v e c t o r
4 v4 = np . a r r a y ( [ [ 2 , 5 , 4 , 7 ] ] ) # row v e c t o r
It’s important to see that the letter is not bold-faced when refer-
ring to a particular element. This is because subscripts can also
be used to indicate different vectors in a set of vectors. Thus, vi
is the ith element in vector v, but vi is the ith vector is a series of
related vectors (v1 , v2 , ..., vi ). I know, it’s a bit confusing, but
unfortunately that’s common notation and you’ll have to get used
to it. I try to make it clear from context whether I’m referring to
vector element vi or vector vi .
That said, there are some special vectors that you should know
about. The vector that contains zeros in all of its elements is
28 called the zeros vector. A vector that contains some zeros but
other non-zero elements is not the zeros vector; it’s just a regular
vector with some zeros in it. To earn the distinction of being a
"zeros vector," all elements must be equal to zero.
The zeros vector has some interesting and sometimes weird prop-
erties. One weird property is that it doesn’t have a direction. I
don’t mean its direction is zero, I mean that its direction is un-
defined. That’s because the zeros vector is simply a point at the
origin of a graph. Without any magnitude, it doesn’t make sense
to ask which direction it points in.
Practice problems State the type and dimensionality of the following vectors (e.g., "four-
dimensional column vector"). For 2D vectors, additionally draw the vector starting from the
origin.
1
2 −1
a) b) 1 2 3 1 c) d) 7 1/3
3 π
1
Answers
a) 4D column b) 4D row c) 2D column d) 2D row
all of linear algebra is built up from scalars and vectors.
From humble beginnings, amazing things emerge. Just
think of everything you can build with wood planks and
nails. (But don’t think of what I could build — I’m a
terrible carpenter.)
29
2.3
Transpose operation
Now you know some of the basics of vectors. Let’s start learning
what you can do with them. The mathy term for doing stuff with
vectors is operations that act upon vectors.
ments stay the same; only the orientation changes. The transpose
operation is indicated by a super-scripted T (some authors use an
italics T but I think it looks nicer in regular font). For example:
h iT 4
4 3 0 =
3
0
T
4 h i
3 = 4 3 0
0
h iTT h i
4 3 0 = 4 3 0
Reminder: Vectors As mentioned in the previous section, we assume that vectors are
are columns unless columns. Row vectors are therefore indicated as a transposed
otherwise specified. column vector. Thus, v is a column vector while vT is a row
vector. On the other hand, column vectors written inside text
are often indicated as transposed row vectors, for example w =
h iT
1 2 3 .
30
Code Transposing is easy both in MATLAB and in Python.
2.4
Vector addition and subtraction
Geometry To add two vectors a and b, put the start ("tail") of Remember that a
vector b at the end ("head") of vector a; the new vector that goes vector is defined by
from the tail of a to the head of b is vector a+b (Figure 2.5). length and direc-
tion; the vector can
start anywhere.
Vector addition is commutative, which means that a + b = b + a.
This is easy to demonstrate: Get a pen and piece of paper, come
up with two 2D vectors, and follow the procedure above. Of
course, that’s just a demonstration, not a proof. The proof will
come with the algebraic interpretation.
There are two ways to think about subtracting two vectors. One
way is to multiply one of the vectors by -1 and then add them as
above (Figure 2.5, lower left). Multiplying a vector by -1 means
to multiply each vector element by -1 (vector [1 1] becomes vector
[-1 -1]). Geometrically, that flips the vector by 180◦ .
You can see that the two subtraction methods give the same dif- 31
ference vector. In fact, they are not really different methods;
just different ways of thinking about the same method. That will
become clear in the algebraic perspective below.
Vectors
Figure 2.5: Two vectors (top left) can be added (top right) and
subtracted (lower two plots).
More formally:
Practice problems Solve the following operations. For 2D vectors, draw both vectors starting
from the origin, and the vector sum (also starting from the origin).
4 6 2
a) 4 5 1 0 + −4 −3 3 10 b) 2 − −4 + −5
0 60 40
1 1 2 3
c) + d) −
0 2 2 4
−3 3 1 2
e) + f) +
1 −1 4 8
Answers
0
a) 0 2 4 10 b) 1
100
2 −1
c) d)
2 −2
0 3
e) f)
0 12
33
2.5
Vector-scalar multiplication
vector is rotated by 180◦ . For now, let’s say that the result of
vector-scalar multiplication (the scaled vector) must form either
a 0◦ or a 180◦ angle with the original vector. There is a deeper
and more important explanation for this, which has to do with
an infinitely long line that the vector defines (a "1D subspace");
you’ll learn about this in Chapter 4.
Still, the important thing is that the scalar does not rotate the
vector off of its original orientation. In other words, vector direc-
tion is invariant to scalar multiplication.
This definition holds for any number of dimensions and for any
scalar. Here is one example:
h i h i
3 −1 3 0 2 = −3 9 0 6
35
Practice problems Compute scalar-vector multiplication for the following pairs:
3 0
0 3.14×π 3.14 3
a) −2 4 3 0 b) (−9+2×5) 4 c) 0 d) λ
9 1
3
−234987234 11
Answers
0 0
0 0 λ3
a) b) 4 c)
0
d)
λ
−8 −6 0 3
0 λ11
Vectors
tance: Stretching a vector without rotating it is funda-
mental to many applications in linear algebra, including
eigendecomposition. Sometimes, the simple things (in
mathematics and in life) are the most powerful.
36
2.6
Exercises
2.6 Exercises
2. Draw the following vectors [ ] using the listed starting point
().
h i h i
a) 2 2 (0,0) b) 6 12 (1,-2)
h i h i
c) −1 0 (4,1) d) π e (0,0)
h i h i
e) 1 2 (0,0) f) 1 2 (-3,0)
h i h i
g) 1 2 (2,4) h) −1 −2 (0,0)
h i h i
i) −3 0 (1,0) j) −4 −2 (0,3/2)
h i h i
k) 8 4 (1,1) l) −8 −4 (8,4)
37
4. Perform vector-scalar multiplication on the following. For 2D
vectors, additionally draw the original and scalar-multiplied
vectors.
e10000
" #
1 1
h i 1
a) 3 b) 3 12 6 c) 0
2 0
√
π
" #
−3 h i
d) 4 e)λ a b c d e f) γ[0 0 0 0 0]
3
Vectors
38
2.7
Answers
1.
5
2 4
2 4
1 1
a) 3 b) 12 c) 6
3 d) 9 11
11 2
4 8
3 5
3
2. This one you should be able to do on your own. You just need
to plot the lines from their starting positions. The key here is
2.7 Answers
to appreciate the distinction between vectors and coordinates
(they overlap when vectors are in the standard position of
starting at the origin).
4. I’ll let you handle the drawing; below are the algebraic solu-
tions.
0
" #
3 h i 0
a) b) 4 2 c)
6 0
0
T
λa
" # λb
−12 h i
d) e) λc f) 0 0 0 0 0
12
λd
λe
39
5. These should be easy to solve if you passed elementary school
arithmetic. There is, however, a high probability of careless
mistakes (indeed, the further along in math you go, the more
likely you are to make arithmetic errors).
" # " #
1 4
a) b)
1 4
" # " #
3 4
c) d)
3 6
Vectors
" # " #
0 2
e) f)
0 0
" # " #
5 4
g) h)
2 −2
" # " #
2 −1
i) j)
5 7
" # " #
5 −2
k) l)
−3 0
40
2.8
Code challenges
41
2.9
Code solutions
1. You’ll notice that all scaled versions of the vector form a line.
Note that the Python implementation requires specifying the
vector as a numpy array, not as a list.
3 v = np . a r r a y ( [ 1 , 2 ] )
4 plt . plot ([0 , v [ 0 ] ] , [ 0 , v [ 1 ] ] )
5 f o r i in range ( 1 0 ) :
6 s = np . random . randn ( )
7 sv = s ∗v
8 p l t . p l o t ( [ 0 , sv [ 0 ] ] , [ 0 , sv [ 1 ] ] )
9 p l t . g r i d ( ’ on ’ )
10 p l t . a x i s ( [ − 4 , 4 , − 4 , 4 ] ) ;
42
Start this chapter happy
CHAPTER 3
Vector multiplications
In this chapter, we continue our adventures through the land
of vectors. Warning: There are a few sections here that might
be challenging. Please don’t get discouraged—once you make it
through this chapter, you can be confident that you can make it
through the rest of the book.
There are four ways to multiply a pair of vectors. They are: dot
product, outer product, element-wise multiplication, and cross
Vector multiplication
product. The dot product is the most important and owns most
of the real estate in this chapter.
3.1
Vector dot product: Algebra
The dot product, also called the inner product or the scalar prod-
uct (not to be confused with the scalar-vector product), is one of
the most important operations in all of linear algebra. It
is the basic computational building-block from which many opera-
tions and algorithms are built, including convolution, correlation,
the Fourier transform, matrix multiplication, signal filtering, and
so on.
= 5 + 12 + 21 + 32
= 70
I will mostly use the notation aT b for reasons that will become
clear after learning about matrix multiplication.
Why does the dot product require two vectors of equal dimen-
sionality? Try to compute the dot product between the following
two vectors:
3
−2
4
0
5 · = 3×−2 + 4×0 + 5×1 + 0×8 + 2×???
1
0
8
2
n
X n
X
aT a = kak2 = ai ai = a2i (3.2)
i=1 i=1
45
I guess there’s a little voice in your head wondering whether it’s
called the length because it corresponds to the geometric length
of the vector. Your intuition is correct, as you will soon learn.
But first, we need to discuss some of the properties of the dot
product that can be derived algebraically.
3.2
Dot product properties
Therefore, the two sides of the equation are not equal; they wouldn’t
even satisfy a "soft equality" of having the same elements but in
a different orientation.
" #!
h i h i 2 h i
T T
u v w = 1 2 1 3 = 11 22 (3.5)
3
i 1 T 2
" #! " # " #
T
T h 14
u v w= 1 2 = (3.6)
3 3 21
And it gets even worse, because if the three vectors have different
dimensionalities, then one or both sides of Equation 3.4 might
even be invalid. I’ll let you do the work to figure this one out, but
imagine what would happen if the dimensionalities of u, v, and
w, were, respectively, 3, 3, and 4.
The conclusion here is that the vector dot product does not obey
the associative property. (Just to make life confusing, matrix mul-
tiplication does obey the associative property, but at least you
don’t need to worry about that for several more chapters.) 47
Commutative property The commutative property holds for the
vector dot product. This means that you can swap the order of
the vectors that are being "dot producted" together (I’m not sure
if dot product can be used as a verb like that, but you know what
I mean), and the result is the same.
a T b = bT a (3.7)
Vector multiplication
The distributive
Distributive property This one also holds for the dot product,
law is that scalars and it turns out to be really important for showing the link be-
distribute inside tween the algebraic definition of the dot product with the geomet-
parentheses, e.g., ric definition of the dot product, which you will learn below.
a(b+c) = ab+ac
When looking at the equation below, keep in mind that the sum of
two vectors is simply another vector. (Needless to say, Equation
3.9 is valid only when all three vectors have the same dimension-
ality.)
wT (u + v) = wT u + wT v (3.9)
48 Why is Equation 3.9 true? This has to do with how the dot prod-
uct is defined as the sum of element-wise multiplications. Com-
mon terms can be combined across sums, which brings us to the
following:
n
X n
X n
X
wi (ui + vi ) = wi ui + wi vi (3.10)
i=1 i=1 i=1
There are many proofs of this inequality; I’m going to show one
that relies on the geometric perspective of the dot product. So
put a mental pin in this inequality and we’ll come back to it in a
few pages.
Practice problems Compute the dot product between the following pairs of vectors.
T T T T
−4 1 2 2 7 −7 3/2 2/3
a) b) c) d)
−2 3 3 −3 −2 −24 4/5 −5/4
T T T T
0 −2 4 4 7/2 10 81 1
1
e) 1 −1 f) 1 1 g) −3 3.5 h) 3 3
1
2 0 3 3 6 −4 9 1
Answers
a) -10 b) -5 c) -1 d) 0
e) -1 f) 26 g) 1/2 h) 31 = (81+3+9)/3
3.3
Vector dot product: Geometry
Note that if both vectors have unit length (|a| = |b| = 1), then
First, let’s understand why the sign of the dot product is deter-
mined exclusively by the angle between the two vectors. Equation
3.15 says that the dot product is the product of three quanti-
ties: two magnitudes and a cosine. Magnitudes are lengths, and
therefore cannot be negative (magnitudes can be zero for the ze-
ros vector, but let’s assume for now that we are working with
non-zero-magnitude vectors). The cosine of an angle can range
between -1 and +1. Thus, the first two terms (kakkbk) are neces-
sarily non-negative, meaning the cosine of the angle between the
two vectors alone determines whether the dot product is positive
or negative. With that in mind, we can group dot products into
five categories according to the angle between the vectors (Figure Figure 3.1: Unit
3.2) (in the list below, θ is the angle between the two vectors and circle. The x-axis
coordinate corre-
α is the dot product): sponds to the
51 co-
1. θ < 90◦ → α > 0. The cosine of an acute angle is always
positive, so the dot product will be positive.
2. θ > 90◦ → α < 0. The cosine of an obtuse angle is
always negative, so the dot product will be negative.
3. θ = 90◦ → α = 0. The cosine of a right angle is zero, so
Important: Vec-
tors are orthogonal
the dot product will be zero, regardless of the magnitudes
when they meet of the vectors. This is such an important case that it has
at a 90◦ angle, its own name: orthogonal. Commit to memory that if two
and orthogonal vectors meet at a right angle, their dot product is exactly
vectors have a
Vector multiplication
Figure 3.2: Inferences that can be made about the sign of the
dot product (α), based on the angle (θ) between the two vec-
tors. Visualization is in 2D, but the terms and conclusions
extend to any number of dimensions.
52
3.4
Algebraic and geometric equivalence
Were you surprised that the algebraic and geometric equations for
the dot product looked so different? It’s hard to see that they’re
the same, but they really are, and that’s what we’re going to
discover now. The two expressions are printed again below for
convenience.
a1 = a sin(θab ) (3.20)
b1 = a cos(θab ) (3.22)
b = b1 + b2 (3.23)
b2 = b − b1 (3.24)
b2 = b − a cos(θab ) (3.25)
54
c2 = a21 + b22 (3.26)
Recall the trig identity that cos2 (θ) + sin2 (θ) = 1. Notice that
when θab = 90◦ , the third term in Equation 3.30 drops out and
we get the familiar Pythagorean theorem.
I realize this was a bit of a long tangent, but we need the Law
of Cosines to prove the equivalence between the algebraic and
geometric equations for the dot product.
kck = ka − bk
Notice that kak2 and kbk2 appear on both sides of the equation, so
these simply cancel. Same goes for the factor of −2. That leaves
us with the remarkable conclusion of the equation we started with
Vector multiplication
Whew! That was a really long proof! I’m pretty sure it’s the
longest one in the entire book. But it was important, because
we discovered that the algebraic and geometric definitions of the
dot product are merely different interpretations of the same op-
eration.
I also wrote that the equality holds when the two vectors form a
linearly dependent set. Two co-linear vectors meet at an angle of
0◦ or 180◦ , and the absolute value of the cosines of those angles
56 is 1.
There is a lot to say about the dot product. That’s
no surprise—the dot product is one of the most funda-
mental computational building blocks in linear algebra,
statistics, and signal processing, out of which myriad al-
gorithms in math, signal processing, graphics, and other
statistics, the cosine of the angle between a pair of nor-
malized vectors is called the Pearson correlation coeffi-
cient; In the Fourier transform, the magnitude of the
dot product between a signal and a sine wave is the
power of the signal at the frequency of the sine wave; In
pattern recognition and feature-identification, when the
dot product is computed between multiple templates and
an image, the template with the largest-magnitude dot
product identifies the pattern or feature most present in
the image.
3.5
Linear weighted combination
w = λ1 v1 + λ2 v2 + ... + λn vn (3.35)
57
It is assumed that all vectors vi have the same dimensionality,
otherwise the addition is invalid. The λ’s can be any real number,
including zero.
An example:
4 −4 1
λ1 = 1, λ2 = 2, λ3 = −3, v1 = 5 , v2 = 0 , v3 = 3
1 −4 2
−7
w = λ1 v1 + λ2 v2 + λ3 v3 = −4
−13
58
Code block 3.4: MATLAB
1 l1 = 1;
2 l2 = 2;
3 l 3 = −3;
4 v1 = [ 4 5 1 ] ’ ;
5 v2 = [−4 0 − 4 ] ’ ;
6 v3 = [ 1 3 2 ] ’ ;
7 l 1 ∗ v1 + l 2 ∗ v2 + l 3 ∗ v3
But let’s start with a bit of notation. The outer product is indi-
cated using a notation is that initially confusingly similar to that
of the dot product. In the definitions below, v is an M -element
column vector and w is an N -element column vector.
This notation indicates that the dot product (vT w) is a 1×1 array
(just a single number; a scalar) whereas the outer product (vwT )
is a matrix whose sizes are defined by the number of elements in
the vectors.
The vectors do not need to be the same size for the outer product,
unlike with the dot product. Indeed, the dot product expression 59
above is valid only when M = N , but the outer product is valid
even if M 6= N .
Now let’s talk about how to create an outer product. There are
Vector multiplication
The element perspective lends itself well to a formula, but it’s not
always the best way to conceptualize the outer product.
If you look closely at the two examples above, you’ll notice that
when we swapped the order of the two vectors, the two outer
product matrices look the same but with the columns and rows
swapped. In fact, that’s not just a coincidence of this particular
example; that’s a general property of the outer product. It’s fairly
straightforward to prove that this is generally the case, but you
need to learn more about matrix multiplications before getting to
the proof. I’m trying to build excitement for you to stay motivated
to continue with this book. I hope it’s working!
Code The outer product, like the dot product, can be imple-
mented in several ways. Here are two of them. (Notice that
MATLAB row vector elements can be separated using a space
or a comma, but separating elements with a semicolon would
produce a column vector, in which case you wouldn’t need the
transpose.) 61
Code block 3.5: Python
1 v1 = np . a r r a y ( [ 2 , 5 , 4 , 7 ] )
2 v2 = np . a r r a y ( [ 4 , 1 , 0 , 2 ] )
3 op = np . o u t e r ( v1 , v2 )
Practice problems Compute the outer product of the following pairs of vectors.
T
T 1 0
T T −1 1
−1 2 4 2 3 1
a) b) c) 0 2 d)
1 3 6 3 5 1
1 3
7 0
Answers
0 1 1 0
−1 −2 −3
−2 −3 8 12 0 3 3 0
a) b) c) 0 0 0 d)
2 3 12 18 0 5 5 0
1 2 3
0 7 7 0
3.7
Element-wise (Hadamard) vector product
This is actually the way of multiplying two vectors that you might
have intuitively guessed before reading the past few sections.
63
Practice problems Compute the element-wise product between the following pairs of vec-
tors.
1 −1 0 π 1.3
3
a) 3 1 b) 0 e5 c) 1.4
5
2 −2 0 2 3.2
Answers
−1 0
a) 3 b) 0 c) undefined!
−4 0
Vector multiplication
3.8
Cross product
The cross product is defined only for two 3-element vectors, and
the result is another 3-element vector. It is commonly indicated
using a multiplication symbol (×).
Answers
−7 0 0
a) −3 b) 0 c) 0 d) undefined.
11 0 1
3.9
|µ3| = 1
µ = 1/3
1 1
v̂ = v = qP v (3.41)
kvk n
v 2
i=1 i
Vector multiplication
The norm of the vector, kvk, is a scalar, which means (1) divi-
sion is allowed (division by a full vector is not defined) and (2)
importantly, the direction of the vector does not change. Here is
a simple example:
" # " # " #
0 1 0 0
v= , v̂ = √ =
2 2
0 +2 2 2 1
This example also shows why the divisor is the magnitude (the
square root of sum of squared vector elements), and not the
squared magnitude vT v. It is also clear that the unit vector v̂
points in the same direction as v.
Taking µ = 1/kvk allows for a quick proof that the unit vector
really does have unit length:
1
kµvk = kvk = 1 (3.42)
kvk
66
Code Fortunately, both Python and MATLAB have built-in
functions for computing the norm of a vector.
Practice problems Compute a unit vector in the same direction as the following vectors.
Then confirm that the magnitude of the resulting vector is 1.
.1
1 3 6 .2
a) b) c) d)
3 4 −8 .4
.2
Answers
.1
1 3/5 3/5 .2
a) √1 b) c) d) √1.25 =
10 3 4/5 −4/5 .4
.2
.2
.4
.8
.4
67
3.10 Exercises
" # " # 1 3 10 −1
1 3
d) , e) 4 , −3
f) 14 , 1
4 −3
0 9 −3 −6
" # " # 2i 3 " # " #
1 b 1 2
g) , h) 4 − i , 3
i) ,
a 1 2 −1
3 1
1 −4 2 −1
−2 3 −5 5
" # " #
−3 4 0 0 2.47 π3
j) , k) , l) , 3
4 −5 0 −1 −2.47 π
5 6 2 0
−6 −2 −8 −4
2 0
2 −2 " # " #
3 12 0 a
−3 , −8
m) −1 , 1
n) o) ,
5 0 b
−2 −2
2 4
xT y yT y
c) xT x
− xT yyT y
=0 d) ky + 23 xk2 = 49 (1 + 4xT y)
1 T 1 T
x x2y y kxk2
e) xT x+xT x−2yT y = xT y f) 2
yT ( 14 y)
+ kxk2 kyk3
−
yT yyT y
xT x
=1
68
3. Compute the angle θ between the following pairs of vectors.
See if you can do with without a calculator!
1 −2 10 2.5 −1 3
a) 2 , 1
b) 12 , 3
c) −2 , 6
3 0 4 1 3 −9
3.10 Exercises
column-wise (each column in the product matrix is the left-
vector scaled by each element in the right-vector).
T T
" # " #T −1 0 1 0
1 1
a) b) 3
1
c)
0 1
2 2
0 3 0 0
T T T
1 5 1 10 1 a
2 6 2 20 1 b
d)
3 7 e)
3 30 f)
2 c
4 8 4 40 2 d
T T
4 T 10 a 1
6 " #
2 3 b
2
1
g)
3 7
h) i)
c
40 30
2
5
1 4 d 2
6
6. What is the magnitude of vector µv for the following µ?
a) µ = 0 b) µ = kvk c) µ = 1/kvk d) µ = 1/kvk2
69
3.11 Answers
1. a) 0 b) ac + bd c) ab + cd
d) −9 e) −9 f) 22
g) a + b h) 15 + 3i i) 0
Vector multiplication
j) 0 k) 5 l) 0
m)68 n) 0 o) 0
5 6 7 8
0 1 0
10 12 14 16
c)
0 0 0
d)
15 18 21 24
0 0 0
20 24 28 32
10 20 30 40 a b c d
20 40 60 80 a b c d
e) f)
2 a 2 b 2 c 2 d
30 60 90 120
70 40 80 120 160 2a 2b 2c 2d
T
24 28 20 30 400
12 14 10 6 80
g)
h)
18 21 15
90 1200
6 7 5 12 160
a a 2a 2a
b b 2b 2b
i)
c
c 2c 2c
d d 2d 2d
5. a) Yes b) No c) Yes d) No
6. a) 0 b) kvk2 c) 1 d) 1/kvk
3.11 Answers
71
3.12 Code challenges
72
3.13 Code solutions
1. Only the basic code solutions are presented here. Keep in mind
that these challenges are also about critical thinking, not just
writing a few lines of code.
73
3. We don’t know which numbers are more important than oth-
ers, so I will use randomized weights. In fact, the solution here
is simply the the dot product between the data vector and any
other vector that sums to one.
74
Vector spaces
CHAPTER 4
Answers
a) R4 b) R3
c) R2 d) R1
Vector spaces
4.2
Vector spaces
An axiom is a
A vector space refers to any set of objects for which addition
statement that and scalar multiplication are defined. Addition and scalar multi-
is taken to be true plication obey the following axioms; these should all be sensible
without requiring requirements based on your knowledge of arithmetic:
a formal proof.
78
4.3
Subspaces and ambient spaces
Geometry A subspace is the set of all points that you can reach
by stretching and combining a collection of vectors (that is, addi-
tion and scalar multiplication).
Let’s start with a simple example of the vector v=[-1 2]. In its
standard position, this is a line from the origin to the coordinate (-
1,2). This vector on its own is not a subspace. However, consider
the set of all possible vectors that can be obtained by λv for
the infinity of possible real-valued λ’s, ranging from −∞ to +∞:
That set describes an infinitely long line in the same direction
as v, and is depicted in Figure 4.4 (showing the entire subspace
would require an infinitely long page).
That gray dashed line is the set of all points that you can reach by Figure 4.4: A 1D
scaling and combining all vectors in our collection (in this case, subspace (gray
dashed line) cre-
it’s a collection of one vector). That gray line extends infinitely ated from a vector
far in both directions, although the vector v is finite. (solid black line).
In fact, the set of all points reachable by scaling and adding two
vectors (that is, the linear weighted combination of those two
vectors) creates a new 2D subspace, which is a plane that extends
infinitely far in all directions of that plane. Figure 4.6 shows how
this looks in 3D: The subspace created from two vectors is a plane.
Any point in the plane can be reached by some linear combination
of the two grey vectors.
But wait a minute—will any two vectors form a plane? No, the 81
vectors must be distinct from each other. This should make sense
intuitively: two vectors that lie on the same line cannot define a
unique plane. In a later section, I’ll define this concept as linear
independence and provide a formal explanation; for now, try to
use your visual intuition and high-school geometry knowledge to
understand that a unique plane can be defined only from two
vectors that are different from each other.
Vector spaces
as the 4th dimension: There is a single instant in time in
which an infinitely expansive space exists, but for all of
time before and after, that space doesn’t exist (and yes,
I am aware that time had a beginning and might have an
end; it’s just a visualization trick, not a perfect analogy).
Now take a moment to try to visualize what an 18D
subspace embedded in an ambient R96 "looks like." You
can understand why we need the algebraic perspective
to prevent overcrowding at psychiatric institutions...
∀ v, w ∈ V, ∀ λ, α ∈ R; λv + αw ∈ V (4.1)
4.4
Subsets
4.4 Subsets
concepts, but they are easily confused because of the similarity
of the names. Subsets are actually not important for the linear
algebra topics covered in this book. But it is important to be able
to distinguish subsets from subspaces.
• The set of all points on the XY plane such that x > 0 and
y > 0.
• The set of all points such that 4 > x > 2 and y > x2 .
• The set of all points such that y = 4x, for x ranging from
−∞ to +∞.
These are all valid subsets. The third example is also a subspace,
because the definition of that set is consistent with the defini-
tion of a subspace: an infinitely long line that passes through the
origin. 85
Practice problems Identify whether the following subsets are also subspaces.
If they are not subspaces, find a vector in the set and a scalar such that λv is outside the set.
a) All points in R2 b) All points on the line y = 2x + .1
c) Points satisfying x2 + y 2 + z 2 = 1 d) All points on the line y = 2x + 0
Answers
a) subspace and subset b) subset
c) subset (doesn’t contain the origin!) d) subspace and subset
4.5
Vector spaces
Span
4.5 Span
is just some fancy math-speak for asking whether you can create
some vector w by scalar-multiplying and adding vectors from set
S.
For now, it’s important to understand the concept that the span
of a set of vectors is the entire subspace that can be reached
Vector spaces
4.5 Span
of vectors and "see" the subspace dimensionality. You will learn
several algorithms for computing this, but for now, focus on the
concept that it is possible for a set to contain five 4D vectors that
together span a 2D subspace.
Span makes me think of a robot holding a laser pointer
in a dark room, and each new vector in the set is a de-
gree of freedom of movement, like an extra joint in the
robot’s arm. With one vector the robot can shine the
Reflection
light only in one fixed direction. With two vectors the
robot can swivel its arm and illuminate a plane. Three
vectors means the robot can move its arm in all direc-
tions and can follow a mosquito around the room so I
can finally kill it. The analogy kindof breaks down in
higher dimensions. Maybe there are quantum curled-up
mosquitos and we need better robots to hunt them down.
89
Practice problems Determine whether each vector is in the span of the associated set.
√
π3 − 2 1 0
a) v = , w= . S= ,
ln (e3 ) 1 0 1
−5 5 −1 0
b) p = 5 , q = 5 .
T = 0 , 1
25 25 4 1
0 3
1 2 1 0
c) m = , x = . U = 1 , 1
2 4
0 1
3 0
Answers
Vector spaces
A set of vectors is
independent if the Geometry A set of vectors is independent if the dimensionality
subspace dimen- of the subspace spanned by that set of vectors is equal to the
sionality equals the number of vectors in that set. For example, a set with one vector
number of vectors.
spans a line (assuming it is not the zeros vector) and is always an
90 independent set (1 vector, 1 dimension); a linearly independent
set of two vectors spans a plane (2 vectors, 2 dimensions); an
independent set with three vectors spans a 3D space (3 vectors, 3
dimensions).
Finally, consider the right-hand set (panel C): This set of three
vectors in R2 is linearly dependent, because any one of the vectors
can be obtained by a linear combination of the other two vectors.
In this example, the middle vector can be obtained by averag-
ing the other two vectors (that is, summing them and scalar-
multiplying by λ = .5). But that’s not just a quirk of this ex-
ample. In fact, there is a theorem about independence that is
illustrated in Figure 4.9:
The proof of this theorem involves creating a matrix out of the set
of vectors and then computing the rank of that matrix. That’s 91
beyond the scope of this chapter, but I wanted to present this
theorem now anyway, because I think it is intuitive: For example,
three vectors that lie on the same plane (a 2D subspace) cannot
possibly create a cube (a 3D subspace).
Also notice the clause could be: M ≤ N merely creates the op-
portunity for independence; it is up to the vectors themselves to
be independent or not. For example, imagine a set of 20 vectors
in R25 that all lie on the same line (that is, the same 1D subspace
embedded in ambient R25 ). That set contains 20 vectors, but ge-
ometrically, they’re occupying a 1D subspace, hence, that set is
linearly dependent.
Vector spaces
0 −27 3
{v1 , v2 , v3 } =
2 5 1 .
, , v2 = 7v1 − 9v3 (4.5)
5 −37 8
Both examples show dependent sets. The way I’ve written the
equations on the right, it seems like w2 and v2 are the dependent
vectors while the other vectors are independent. But remember
from the beginning of this section that linear (in)dependence is a
property of a set of vectors, not of individual vectors. You could
just as easily isolate v1 or v3 on the left-hand side of Equation
92 4.5.
The important point is that it is possible to create any one vector
in the set as some linear weighted combination of the other vectors
in the same set.
Try as hard as you can and for as long as you like; you will never
be able to define any one vector in each set using a linear weighted
combination of the other vectors in the same set. That’s easy to
see in the first set: When considering only the first two rows, then
w1 = 2w2 . However, this weighted combination fails for the third
row. Mapping this back onto the geometric perspective, these
two vectors are two separate lines in R3 ; they point in similar
directions but are definitely not collinear.
Linear dependence
This may seem like a strange definition: Where does it come from
and why is the zeros vector so important? Some rearranging,
starting with subtracting λ1 v1 from both sides of the equation,
will reveal why this equation indicates dependence:
λ1 v1 = λ2 v2 + ... + λn vn
λ2 λn
v1 = v2 + ... + vn , λ ∈ R, λ1 6= 0 (4.7)
λ1 λ1
Because the λ’s are scalars, then λn /λ1 is also just some scalar. If
you like, you could replace all the fractional constants with some
other constant, e.g., βn = λn /λ1 .
Step 2: Check for a vector of all zeros. Any set that contains the
zeros vector is a dependent set.
Step 3: If you’ve gotten this far, it means you need to start doing
some trial-and-error educated guesswork. Start by looking
for zeros in the entries of some vectors, with the knowledge
that zeros in some vectors in combination with non-zero
entries in corresponding dimensions in other vectors is a tip
towards independence (you cannot create something from
nothing, with the possible exception of the big bang).
of all vectors gives the zeros vector. For the first dimension, the
coefficients (-2, -1, 1) will produce a zero (−2×1+−1×2+1×4 = 0).
Those same coefficients will also work for the second dimension,
but they don’t work for the third dimension. This means either (1)
a different set of coefficients could work for all three dimensions,
or (2) the set is linearly independent.
96
Consistent terminology is important in any branch of
knowledge, and confusing or redundant terminology can
hinder progress and create confusion. Unfortunately, the
term independence has different meanings in different ar-
eas of mathematics. In statistics, two variables are "in-
Reflection
dependent" if the correlation coefficient between them
is zero. In linear algebra, that case would be called "or-
thogonal." What is called "independent" in linear algebra
would be called "correlated" in statistics. There is no
easy solution to terminological confusions; you just have
to be flexible and make sure to clarify your terms when
talking to an audience from a different background.
4.7 Basis
If the set is dependent, come up with a set of coefficients that illustrates their dependence.
0 1 2 1 1 2
a) 0 , 1 , 2 b) 1 , 1 , 2
0 1 2 2 1 1
0 1 −2
1 2
c) 7 , 2 , 17
d) ,
1 2.000001
−6
2 6
Answers
a) Dependent (1,0,0) b) Dependent (1,-3,1)
c) Dependent (-3, 2, 1) d) Independent
4.7
Basis
The answers are that set M2 is a basis for R3 because there are
three vectors in R3 that form a linearly independent set.
Set M3 is also not a basis set because one can define the first
or second vectors as a linear combination of the other vectors.
However, the third vector and either the first or second vector
would form a basis for a 2D subspace, which is a plane in R3 .
4.7 Basis
0 1 1 2 0 1
For basis set T , we have p[T ] = [2,-.5]. Why is this the correct Figure 4.10: A
point on a graph
answer? Starting from the origin, draw a vector that is two times has different coor-
the first vector in set T , and then add -.5 times the second vector dinates depending
on the basis vec-
in set T . That will get you from the origin to point p.
tors.
Now for set U . Ah, this is a trick. In fact, set U is not a basis set,
because it is not a linearly independent set. No set with the zeros
vector can be independent. For our exercise here, it is impossible
to reach point p using the vectors in set U , because span(U ) is a
1D subspace that does not touch point p.
This set is not a basis for R2 , but let’s pretend for a moment that
it is. The vector that, in standard coordinates, is given by [-2 6]
can be obtained by scaling these "basis vectors" by (-6,4,0) or (0,-
2,6) or (-2,0,4), or an infinite number of other possibilities. That’s
Independent bases
confusing. Therefore, mathematicians decided that a vector must
ensure uniqueness.
have unique coordinates within some basis set, which happens
only when the set is linearly independent.
Vector spaces
For example, using the first two sets, the Cartesian-basis coordi-
nate (3,4) would be obtained by scaling the basis vectors by [3,8]
and [-3,4], respectively.
The truth is that not all basis sets are created equal. Some bases
are better than others, and some problems are easier to solve in
certain bases and harder to solve in other bases. For example,
that third basis set above is valid, but would be a huge pain to
work with. In fact, finding optimal basis sets is one of the most
important problems in multivariate data science, in particular data
100 compression and components analyses.
Consider Figure 4.11: The dots correspond to data points, and
the black lines correspond to basis vectors. The basis set in the
left graph was obtained via principal components analysis (PCA)
whereas the basis set in the right graph was obtained via inde-
pendent components analysis (ICA). (You’ll learn the math and
implementation of PCA in Chapter 19.) Thus, these two analyses
identified two different basis sets for R2 ; both bases are valid, but
which one describes the patterns in the data better?
4.7 Basis
Figure 4.11: This example of 2D datasets (both graphs show
the same data) with different basis vectors (black lines) is an
example of the importance of choosing basis sets for charac-
terizing and understanding patterns in datasets.
In this section we are discussing only basis vectors. You
may have also heard about basis functions or basis im-
ages. The concept is the same—a basis is the set of the
Reflection
minimum number of metrics needed to describe some-
thing. In the Fourier transform, for example, sine waves
are basis functions, because all signals can be represented
using sine waves of different frequencies, phases, and am-
plitudes.
101
4.8
Exercises
2 0 0 −10 −3
Vector spaces
a)
2 b)
4 c)
0 −2
d) e)
2
6 12 0 −3 −3
0 1 −1
c) 1 , 0 , 1
3 3 2
1 2 3
i) 0 , 1 , 1
1 2 3
4.8 Exercises
0 c
3 13.5
0 d
1
4 5 1
c) , ,
2 λ 0 1
,
3 5 8 1
1 2 1 6 2 4
c) 1 , 0 , 2 d) 9 , 3 , 6
Vector spaces
1 1 3 3 1 2
−3 4.5 −1.5
c) 2 , −3 , 1
13 −19.5 6
104
4.9
Answers
d) T : [2,-3] e) S: [2,-3]
2. a) no b) yes c) no
4.9 Answers
d) Dependent, no e) Dependent, yes f) Dependent, no
4. a) λ = 9 b) λ = 0 or any- c) Any λ
thing if
a=b=c=d=0
7. a) 1D b) 2D c) 3D d) 1D
105
Vector spaces
Matrices
CHAPTER 5
In this book, we will work only with matrices that are rows ×
columns, or matrices that you could print out and lay flat on a
table. The terminology can get a bit confusing here, because you
might think that these are 2D matrices. However, although they
would be in two physical dimensions when printed on a piece of
paper, the number of dimensions in which a matrix lives is more
open to interpretation compared to vector dimensionality (more
Matrices
5.2
Matrix terminology and notation
5.3
Matrix dimensionalities
Matrices
• RM×N
• RMN , if each matrix element is its own dimension (this is
the closest interpretation to the vector dimensionality defi-
nition).
• RM , if the matrix is conceptualized as a series of column
vectors (each column contains M elements and is thus in
RM ).
• RN , if the matrix is conceptualized as a stack of row vectors.
110
5.4
The transpose operation
ATT = A (5.2)
T 16 6 0 −4
0 1 0 5 1
a) b) 6 10 −3 c)
−5
2 0 3 −4 6
0 −3 22
0 5 5
Answers
T −1 −3 1
0 2 16 6 0 −4 5 1
a) 1 0 b) 6 10 −3 c)
−5
−4 6
0 3 0 −3 22
0 5 5
5.5
Matrix zoology
There are many categories of special matrices that are given names
according to their properties, or sometimes according to the per-
son who discovered or characterized them.
Let me guess what you are thinking: "But Mike, a square is also a
rectangle!" Yes, dear reader, that’s technically true. However, for
ease of comprehension, it is assumed that a "rectangular matrix"
A = AT (5.3)
A = −AT (5.5)
AI = A
If all diagonal elements are the same, then the matrix can be writ-
Matrices
0 0
Diagonal matrices are useful because they simplify operations in-
cluding matrix multiplication and matrix powers (An ). Trans-
forming a matrix into a diagonal matrix is called diagonalization
and can be achieved via eigendecomposition or singular value de-
composition, as you will learn in later chapters.
QT Q = I (5.9)
Notice that the main diagonal is the same as the first element of
the vector (a), the next off-diagonal is the second element of the
vector (b), and so on.
Notice that the ith column (and the ith row) of the Hankel matrix
comes from starting the vector at the ith element and wrapping
around. You can also see how the anti-diagonals relate to the
vector. 123
The algorithm to compute a Hankel matrix involves populating a
matrix Y from elements of a vector x:
the final row of the matrix, or else it will have zeros like in the
first example above.
5.6
Matrix addition and subtraction
Now that you know some matrix terminology and some special
matrices, it is time to begin learning how to work with matrices.
The rest of this chapter focuses on the "easy" arithmetic opera-
tions on matrices, and the next few chapters will deal with more
124 advanced topics. We begin with basic matrix arithmetic.
Matrix addition is simple, and it works how you would think it
should work: Add or subtract each corresponding element in the
two matrices. For addition to be a valid operation, both matrices
must be the same size—M×N —and the resulting matrix will also
be M ×N .
125
Practice problems Solve the following matrix arithmetic problems.
2 4 6
0 5 0 1 9 8
1 1 4 3 1 3 5
a) + b) −4 6 + 1 1 c) + 7 6
2 3 2 1 4 2 0
−3 0 1 0 5 4
5 6 7
Answers
0 6
5 4
a) b) −3 7 c) Invalid operation!
4 4
−2 0
5.7
Scalar-matrix multiplication
Matrices
This means that you can move the scalar around, and the result is
unchanged. That feature turns out to be crucial for several proofs
and derivations. The following illustrates the idea.
126
5.8
"Shifting" a matrix
"Shifting" a matrix
There are three properties of matrix shifting that can be seen from
this example:
3 I = np . eye ( 4 )
4 A = np . random . randn ( 4 , 4 )
5 As = A + l ∗ I
Answers
.5 3 0 42 42 42
a) 2 1.1 0 b) 234 745 12
.1 .4 .1 0 33 1000
128
5.9
Diagonal and trace
v = diag(A)
Note that this formula does not require the matrix to be square.
Observe the following examples (diagonal elements highlighted).
5 3
10
5 3 8 5 5 3 8 3 9 5
8
" #
6 5
diag
1 0 4 = 0 , diag
1 0 4 3 1 = 0 , diag =
35 0
8 6 2 2 8 6 2 9 9 2
97
8 2
M
X
tr(A) = ai,i (5.15)
i=1
2 A = np . random . randn ( 4 , 4 )
3 t r = np . t r a c e (A)
Answers
a) −9 b) 4 c) 0
130
5.10 Exercises
1. For the following matrix and vectors, solve the given arith-
metic problems, or state why they are not solvable.
3 1
2 −1 −2 −6 −6
5 0
4 , v = 0 , w = 1 , A = 1 −1
u= 0 −2
1 −1 0 −1 −4
2 5
d) vwT − A e) vwT + AT
5.10 Exercises
is valid.
" # " #
2 4 3 −2 −1 3
A= , B= ,
0 1 3 6 −7 7
0 −6 1 2
C = −3 −2
D = 3 4
−2 7 2 4
a) A + 3B b) A + C c) C − D
d) D + C e) AT + D f) (A+B)T +2C
a b c 0 b c 1 0 0 0
d) −b
d e e) −b
0 f f) 0 32 0 0
−c −e f −c −e 0 0 0 42 0
5.10 Exercises
garding trace, matrix addition, and scalar multiplication.
" # " # " #
5 −3 −4 −1 a c
A= , B= , C= , λ = 5, α = −3
2 −3 1 3 b d
133
5.11
1.
Answers
1 5 0
−2 −1
0
a) Size mismatch. b)
−4
c)
7
11 21 0
4
12 6 10
5 1 6
10
4 0
4 18 1
d)
0 2 e)
7 5 −3 −10
f)
0 11 5 7 2 24
" #
4 8 10
g) h) Invalid.
0 −6 23
3. n(n+1)/2
3 1 2 3 1 2
5.11 Answers
cause of the element-wise definition of both trace and matrix
addition:
M
X M
X M
X
ai,i + bi,i = (ai,i + bi,i )
i=1 i=1 i=1
9. a) -6 b) 0 c) ae + bf + cg +
dh
a) 2 b) -1 c) 1
j) 5 k) 5 l) 1
135
5.12 Code challenges
136
5.13 Code solutions
137
3. The point here is to appreciate that indexing the diagonal of
a matrix involves the i, i indices.
138
Start this chapter happy
CHAPTER 6
Matrix multiplication
Multiplying matrices is considerably more complicated than mul-
tiplying regular numbers. The normal rules you know about mul-
tiplication (e.g., a×b = b×a) often don’t apply to matrices, and
there are extra rules you will need to learn. Furthermore, there
are several ways of multiplying matrices. And to make matters
more complicated, not all pairs of matrices can be multiplied. So,
take a deep breath, because this chapter is your first deep-dive
into linear algebra!
6.1
"Standard" matrix multiplication
Matrix multiplication
For lack of a better term, this method will be called the "standard"
method. Unless otherwise explicitly stated, you can assume (in
this book and elsewhere) that two matrices next to each other
(like this: AB) indicates "standard" matrix multiplication.
140
Validity Before learning how standard matrix multiplication works,
you need to learn when matrix multiplication is valid. The rule
for multiplication validity is simple and visual, and you need to
memorize this rule before learning the mechanics of multiplica-
tion.
If you write the matrix sizes underneath the matrices, then ma-
trix multiplication is valid only when the two "inner dimensions"
match, and the size of the resulting product matrix is given by
the "outer dimensions." By "inner" and "outer" I’m referring to
the spatial organization of the matrix sizes, as in Figure 6.1.
The first pair (AB) is valid because the "inner" dimensions match
(both 2). The resulting product matrix will be of size 5×7. The
second pair shows the lack of commutativity in matrix multitpli-
cation: The "inner" dimensions (7 and 5) do not match, and thus
the multiplication is not valid. The third pair is an interesting
case. You might be tempted to call this an invalid operation;
however, when transposing C, the rows and columns swap, and
so the "inner" dimensions become consistent (both 5). So this 141
multiplication is valid.
Here’s something exciting: you are now armed with the knowledge
to understand the notation for the dot product and outer product.
In particular, you can now appreciate why the order of transposi-
tion (vT w or vwT ) determines whether the multiplication is the
dot product or the outer product (Figure 6.2).
Matrix multiplication
Practice problems For the following matrices, determine whether matrix multiplication is
valid, and, if so, the size of the product matrix.
Answers
a) no b) 2×4 c) 2×3 d) 3×2 e) no
142
Code block 6.2: MATLAB
1 M1 = randn ( 4 , 3 ) ;
2 M2 = randn ( 3 , 5 ) ;
3 C = M1 ∗ M2
Here is another example; make sure you see how each element in
the product matrix is the dot product between the corresponding
row or the left matrix and column in the right matrix.
3 4 " # 3·5 + 4·3 3·1 + 4·1 27 7
5 1
−1 2
3 1 = −1·5 + 2·3 −1·1 + 2·1 = 1 1
0 4 0·5 + 4·3 0·1 + 4·1 12 4
Now for the same numerical example you’ve seen in the previous
two perspectives:
3 4 " # 3 4 3 4 27 7
−1 2 5 1 = 5 −1 + 3 2
3 1 1−1
+ 1 2 = 1 1
0 4 0 4 0 4 12 4
(4) The "row perspective" You guessed it—it’s the same con-
cept as the column perspective but you build up the product ma-
trix one row at a time, and everything is done by taking weighted
146 combinations of rows. Thus: each row in the product matrix is the
weighted sum of all rows in the right matrix, where the weights
are given by the elements in each row of the left matrix. Let’s
begin with the simple example:
h i h i
" #" # 1 a b +2 c d
1 2 a b
= (6.3)
3 4 c d h i h i
3 a b +4 c d
Practice problems Multiply the following pairs of matrices four times, using each of four
perspectives. Make sure you get the same result each time.
4 4 1 2 3 a b c
3 0 3
a) 0 1 b) 4
5 6 d e f
1 1 0
4 1 7 8 9 g h i
Answers
1a + 2 d + 3 g 1b + 2 e + 3 h 1c + 2 f + 3 i
24 15
a) b) 4 a + 5 d + 6 g 4b + 5e + 6h 4 c + 5 f + 6 i
4 5
7a + 8d + 9g 7b + 8e + 9h 7c + 8f + 9i
147
Matrix multiplication
Practice problems Perform the following matrix multiplications. Use whichever perspec-
tive you find most confusing (that’s the perspective you need to practice!).
2 −1 2
3 4 0 1 1 3 2
a) 4 0 0 b)
0 4 1 1 2 0 1
0 1 1
Answers
22 −3 6 3 3
a) b)
16 1 1 3 4
6.2
Multiplication and equations
You know from your high-school algebra course that you are al-
lowed to add and multiply terms to an equation, as long as you
148 apply that operation to both sides. For example, I can multiply
both sides of this equation by 7:
4 + x = 5(y + 3)
Notice that the bottom two equations are the same: Because
scalars obey the commutative law, the 7 can go on the left or
the right of the parenthetic term.
B = λ(C + D)
AB = λ(C + D)A
149
Code The non-commutivity of matrix multiplication is easy to
confirm in code. Compare matrices C1 and C2.
1 A = randn ( 2 , 2 ) ;
2 B = randn ( 2 , 2 ) ;
3 C1 = A∗B ;
4 C2 = B∗A;
6.3
Matrix multiplication with a diagonal matrix
Let’s see two examples of 3×3 matrices; notice how the diagonal
elements appear in the rows (pre-multiply) or the columns (post-
multiply). Also notice how the product matrix is the same as the
dense matrix, but either the columns or the rows are scaled by
150 each corresponding diagonal element of the diagonal matrix.
a 0 0 1 2 3 1a 2a 3a
0 b 0 4 5 6 = 4b 5b 6b (6.4)
0 0 c 7 8 9 7c 8c 9c
1 2 3 a 0 0 1a 2b 3c
4 5 6 0 b 0 = 4a 5b 6c (6.5)
7 8 9 0 0 c 7a 8b 9c
Practice problems Perform the following matrix multiplications. Note the differences be-
tween a and c, and between b and d.
2 0 1 1 5 0 4 3 1 1 2 0 4 3 5 0
a) b) c) d)
0 3 1 1 0 6 3 4 1 1 0 3 3 4 0 6
Answers
2 2 20 15 2 3 20 18
a) b) c) d)
3 3 18 24 2 3 15 24
6.4
LIVE EVIL! (a.k.a. order of operations)
" #
4 2
I assume you got the matrix . Now let’s try it again, but
6 3
this time transpose each matrix individually before multiplying
them: " #T " #T
2 4 0 1
= ?
1 2 1 1
OK, now let’s try it one more time. But now, instead of applying
152 the transpose operation to each individual matrix, swap the order
of the matrices. Thus:
" #T " #T
0 1 2 4
= ?
1 1 1 2
" #
4 2
And now you get the same result as the first multiplication: .
6 3
Practice problems Perform the following matrix multiplications. Compare with the prob-
lems and answers on page 148.
T
2 −1 2 T T T
3 4 0 3 2 1 1
a) 4
0 0 b)
0 4 1 0 1 1 2
0 1 1
Answers
22 16
3 3
a) −3 1 b)
3 4
6 1
154
6.5
Matrix-vector multiplication
h h i
T
b A= d e = ad + be bd + ce
b c
Notice, as mentioned above, that the two results are not identical
vectors because one is a column and the other is a row. However,
they have identical elements in a different orientation.
Answers
21
16 26
a)
21 b)
34
8
156
6.6
Creating symmetric matrices
In the previous chapter, I claimed that all hope is not lost for
non-symmetric matrices who aspire to the glorious status of their
symmetric conspecifics, upon whom so many luxuries of linear
algebra are bestowed. How can a non-symmetric matrix become
a symmetric matrix?
1
C = (AT + A) (6.9)
2
CT = A + AT (6.14)
proof does not depend on the size of the matrix, which shows that
our example above was not a fluke.
Practice problems Create symmetric matrices from the following matrices using the ad-
ditive method.
−3 0 −2 −2
1 11 1 −1 −3 −2 −6
a) −5 −2 1 b)
−4
−8 −7 4
−1 −5 2
6 −2 5 4
Answers
− 21
−3 −3 2
1 3 0 − 1
2
−3 −5 −4
a) 3 −2 −2 b)
−3 9
−5 −7 2
0 −2 2 9
2 −4 2
4
pends on how the data are stored (e.g., observations ×
features vs. features × observations), which usually de-
pends on coding preferences or software format. In the
written medium, I prefer AT A because of the aesthetic
symmetry of having the T in the middle.
Practice problems Create two symmetric matrices from each of the following matrices, using
the multiplicative method (AT A and AAT ).
3 0 3 1 6
a) b)
0 7 0 6 1
Answers
9 0 9
37 12 37 12
a) AT A = 0 49 0 , AAT = b) AT A = , AAT =
12 37 12 37
9 0 9
18 0
0 49
159
6.7
Multiplication of two symmetric matrices
You can also see that the product matrix is not symmetric by
trying to prove that it is symmetric (the proof fails, which is
called proof-by-contradition). Assume that A and B below are
both symmetric matrices.
(AB)T = BT AT = BA 6= AB (6.18)
components analysis, and one of the most important
advantages of generalized eigendecomposition, which is
the computational backbone of many machine-learning
methods, most prominently linear classifiers and discrim-
inant analyses.
6.8
Element-wise (Hadamard) multiplication
C=AB (6.19)
Answers
4 −12 27
a) 15 8 6 b) Undefined!
0 −14 35
6.9
Frobenius dot product
The Frobenius dot product, also called the Frobenius inner prod-
uct, is an operation that produces a scalar (a single number) given
two matrices of the same size (M ×N ).
Vectorizing a matrix
a c e c
vec =
b d f d
e
Anyway, with that tangent out of the way, we can now compute
the Frobenius dot product. Here is an example:
*" # " #+
1 5 0 4 −1 3
hA, BiF = , =5
164 −4 0 2 2 6 7 F
Note the notation for the Frobenius dot product: hA, BiF
I’ve omitted the matrix sizes in this equation, but you can tell
from inspection that the operation is valid if both matrices are
Code The code below shows the trace-transpose trick for com-
puting the Frobenius dot product.
165
Code block 6.10: MATLAB
1 A = randn ( 4 , 3 ) ;
2 B = randn ( 4 , 3 ) ;
3 f = t r a c e (A’ ∗B ) ;
Practice problems Compute the Frobenius dot product between the following pairs of
matrices.
4 2 7 −2
6 1 2 4 −11 1
a) 3 2 , −7 −8 b) ,
3 3 −2 6 1 −2
−5 −1 −1 8
Answers
a) −16 b) 40
Matrix multiplication
In section 3.9 you learned that the square root of the dot product
of a vector with itself is the magnitude or length of the vector,
which is also called the norm of the vector.
Let’s start with the Frobenius norm, because it’s fresh in your
mind from the previous section. The equation below is an alter-
166 native way to express the Frobenius norm.
Frobenius matrix norm
v
um X
n
uX
kAkF = t (ai,j )2 (6.24)
i=1 j=1
Now you know three ways to compute the Frobenius norm: (1) di-
rectly implementing Equation 6.24, (2) vectorizing the matrix and
computing the dot product with itself, (3) computing tr(AT A).
Of course, this formula is valid only for two matrices that have
the same size.
There are many other matrix norms with varied formulas. Dif-
ferent applications use different norms to satisfy different crite-
ria or to minimize different features of the matrix. Rather than
overwhelm you with an exhaustive list, I will provide one general
formula for the matrix p-norm; you can see that for p = 2, the
following formula is equal to the Frobenius norm.
1/p
XM X
N
kAkp = |aij |p (6.26)
i=1 j=1
167
Cauchy-Schwarz inequality In Chapter 3 you learned about the
Cauchy-Schwarz inequality (the magnitude of the dot product be-
tween two vectors is no larger than the product of the norms of the
two vectors). There is a comparable inequality for the Frobenius
norm of a matrix-vector multiplication:
The proof for this inequality comes from integrating (1) the row
perspective of multiplication, (2) the Cauchy-Schwarz inequality
for the vector dot product introduced in Chapter 3, and (3) linear-
ity of the Frobenius norm across the rows (that is, the Frobenius
norm of a matrix equals the sum of the Frobenius norms of the
rows; this comes from the summation in Equation 6.24).
Matrix multiplication
Practice problems Compute Euclidean distance between the following pairs of matrices
(note: compare with the exercises on page 166).
4 2 7 −2
6 1 2 4 −11 1
a) 3 2 , −7 −8 b) ,
3 3 −2 6 1 −2
−5 −1 −1 8
a) CB b) CT B
c) (CB)T d) CT BC
Matrix multiplication
e) ABCB f) ABC
g) CT BAT AC h) BT BCCT A
i) AAT j) AT A
q) A (ABC) r) A ABC(BC)T
a 0 1 a b c
i) #2,3: 0 b 0
1 2 3
1 0 c 0 0 1
6.12 Exercises
3 2 3
1 0 0 c 1 0 1 1
T
1 1 0 1 3 1 3 2
g) 2 0 −4 0 h) 3 6 1 5
5 1 0 1 2 3 5 0
a) AB = AT BT b) AB = (AB)T
c) AB = ABT d) AB = AT B
e) AB = BT A f) AB = (BA)T
2 5 7 a d f
5 3 6 , d b e
7 6 4 f e c
4 −5 8 4 −5 8 " # " #
a b a b
c) 1 −1 2 , 1 −1 2 d)
,
c d a b
−2 2 −4 −2 2 −4
" # " # 1 1 7 1 0
a b a b
e) , f) 2 2 6 , 0 1
c d c d
3 3 5 1 1
a) AB b) AC c) BC d) CA
172
6.13 Answers
1. a) no b) yes: 4×3
c) no d) yes: 4×4
e) no f) yes: 2×4
g) yes: 4×4 h) no
k) no l) no
m)yes: 3×3 n) no
6.13 Answers
o) yes: 3×4 p) yes: 3×3
q) no r) yes: 2×3
" # " #
3. 4 h i 7 h i
a) b) 4 9 c) d) 10 11
9 13
b 6 h i
e)
c −8
f) g) invalid h) 27 22 21
a 6
173
4. a) Both matrices symmetric. b) A = B, both symmetric.
Or A = BT
c) B = BT d) A = AT
6. a) a+2b+3c+4d b) 63 c) 135
7. 6 0 0 6 3 9
a) 0 10
0
b) 0
8 2
0 0 −3 −2 −2 −3
4 2 6 6 2 −3
c) 0 20 5
d) 0 8 −1
6 6 9 6 4 −3
4 5 9 12 4 −6
e)
0 20 3 0 40 −5
f)
4 10 9 18 12 −9
12 15 27 12 6 18
g) 0
40 6
h) 0
40 10
−4 −10 −9 −6 −6 −9
174
6.14 Code challenges
175
6.15 Code solutions
3 C1 = np . z e r o s ( ( 2 , 3 ) )
4 f o r i in range ( 4 ) :
5 C1 += np . o u t e r (A [ : , i ] , B [ i , : ] )
6
7 C1 − A@B # show e q u a l i t y
176
2. The diagonals of the two product matrices are the same.
1/2
3. C1 = C2 sounds like an amazing and quirky property of
matrix multiplication, but it’s actually trivial because A is
√
diagonal. The reason it works is the same reason that x = x2
for x ≥ 0.
177
4. The challenge isn’t so challenging, but it’s good excuse to gain
experience coding with norms. My strategy for the inequality
is to show that the right-hand side minus the left-hand side is
positive. You can run the code multiples times to test different
random numbers.
8 RHS−LHS # s h o u l d always be p o s i t i v e
178
Matrix rank
CHAPTER 7
Why all the fuss about rank? Why are full-rank matrices
so important? There are some operations in linear alge-
bra that are valid only for full-rank matrices (the matrix
inverse being the most important). Other operations
are valid on reduced-rank matrices (for example, eigen-
Reflection
decomposition) but having full rank endows some ad-
ditional properties. Furthermore, many computer algo-
rithms return more reliable results when using full-rank
compared to reduced-rank matrices. Indeed, one of the
main goals of regularization in statistics and machine-
learning is to increase numerical stability by ensuring
that data matrices are full rank. So yeah, matrix rank
is a big deal.
181
7.2
Interpretations of matrix rank
Below are a few matrices and their ranks. Although I haven’t yet
taught you any algorithms for computing rank, try to understand
why each matrix has its associated rank based on the description
Rank
above.
1 1 3 1 3.1 1 3 2 1 1 1 0 0 0
2, 2 6 , 2 6 , 6 6 1, 1 1 1, 0 0 0
4 4 12 4 12 4 2 0 1 1 1 0 0 0
r=1 r=1 r=2 r=3 r=1 r=0
183
7.3
Computing matrix rank
Code Except for your linear algebra exams, you will never com-
pute matrix rank by hand. Fortunately, both Python and MAT-
LAB have functions that return matrix rank.
184
Code block 7.2: MATLAB
1 A = randn ( 3 , 6 ) ;
2 r = rank (A)
Practice problems Compute the rank of the following matrices based on visual inspection.
1 4 2 −4 10
1 −2 1 2
a) b) c) 5 −220 d) 2 3 −4
2 −4 2 1
5 5/2 4 2 0
Answers
7.4
Rank and scalar multiplication
I’ll keep this brief: Scalar multiplication has no effect on the rank
of a matrix, with one exception when the scalar is 0 (because this
produces the zeros matrix, which has a rank of 0).
Code There isn’t any new code in this section, but I decided
to add these code blocks to increase your familiarity with matrix 185
rank. You should confirm that the ranks of the random matrix
and the scalar-multiplied matrix are the same.
7.5
Rank of added matrices
For example, imagine two 3×4 matrices, each with rank 3. The
sum of those matrices cannot possibly have a rank of 6 (Equation
7.3), because 6 is greater than the matrix sizes. Thus, the largest
possible rank of the summed matrix is 3 (the rank could be smaller
than 3, depending on the values in the matrices).
1
Top to bottom: 3, 3, 2, 1
187
Practice problems Determine the maximum possible rank of the following expressions,
based on the following information (assume matrix sizes match for valid addition):
rank(A) = 4, rank(B) = 11, rank(C) = 14.
a) A + B b) (A + B) + 0C c) C − 3A d) 0((C+A)+4A)
Answers
a) 15 b) 15 c) 18 d) 0
Rank
7.6
Rank of multiplied matrices
Reflection
applied linear algebra, primarily statistics and machine-
learning. Moreover, the more comfortable you are with
matrix rank, the more intuitive advanced linear algebra
concepts will be.
Practice problems Determine the maximum possible rank of the following expressions,
based on the following information (assume valid matrix sizes):
rank(A) = 4, rank(B) = 11, rank(C) = 14.
Answers
a) 4 b) 4 c) 4 d) 4
Rank
7.7
Rank of A, AT , AT A, and AAT
The key take-home message from this section is that these four
matrices—A, AT , AT A, and AAT —all have exactly the same
rank.
You already know that A and AT have the same rank because
of property 3 in the first section of this chapter: the rank is a
property of the matrix; it does not reflect the columns or the
rows separately. Thus, transposing a matrix does not affect its
rank.
Ay = 0 (7.14)
AT Ay = AT 0 (7.15)
AT Ay = 0 (7.16)
Equations 7.14 and 7.16 show that any vector in the null space
of A is also in the null space of AT A. This proves that the null
space of AT A is a subset of the null space of A. That’s half of
the proof, because we also need to show that any vector in the
null space of AT A is also in the null space of A.
AT Ay = 0 (7.17)
yT AT Ay = yT 0 (7.18)
kAyk = 0 (7.20)
Equations 7.17 and 7.20 together show that any vector in the null
space of AT A is also in the null space of A.
Now we’ve proven that AT A and A have the same null spaces.
Why does that matter? You will learn in the next chapter that
the row space (the set of all possible weighted combinations of
the rows) and the null space together span all of RN , and so if the
null spaces are the same, then the row spaces must have the same
dimensionality (this is called the rank-nullity theorem). And the 191
rank of a matrix is the dimensionality of the row space, hence,
the ranks of AT A and A are the same.
Proving this for AAT follows the same proof as above, except you
start from yT A = 0 instead. I encourage you to reproduce the
proof with a pen and some paper.
=VΣ2 VT (7.23)
192
" #
1 3 4
A= rank(A) = 1
3 9 12
1 3
T
rank(AT ) = 1
A =
3 9
4 12
10 30 40
T
rank(AT A) = 1
A A = 30 90 120
40 120 160
" #
T26 78
rank(AAT ) = 1
7.8
Rank of random matrices
Apologies for making the font so small. The point isn’t for you
to read the actual numbers; the point is for you to appreciate
that the probability of linear dependencies leading to a reduced-
rank matrix is infinitesimal. Thus, whenever you create random
matrices on computers, you can assume that their rank is their
maximum possible rank.
0 0 0 0
Rank
0 0 0 0
1 0 1 1
1 1 0 0
Thus, matrices populated with floating-point random numbers
have maximum possible rank. This is useful because it allows you
to create matrices with arbitrary rank, which in turn will unlock
many opportunities for exploring linear algebra in code. Code
challenge 1 will guide you through the process.
7.9
Full-rank matrices via “shifting”
Your computer would tell you that the rank of this data matrix is
3, which you know is actually due to sensor noise. So you might
want your rank-estimating-algorithm to ignore some small amount
of noise, based on what you know about the data contained in the
matrix.
Rank
I wrote that there are several algorithms that can answer that
question, but that you hadn’t yet learned the necessary concepts.
198 We can now re-interpret this problem in the context of matrices
and rank.
199
7.12Exercises
a) A b) B c) A + B d) AAT
a) A b) B c) C d) D
e) CT B f) CT C g) AD h) CD
200
7.13 Answers
1. a) r = 1 b) r = 2 c) r = 3 d) r = 3
2. a) 2 b) 2 c) 2 d) 2
e) 2 f) 2 g) 2 h) 2
3. a) λ = 3 b) λ 6= 0 c) λ 6= 2 d) λ = 0
4. a) 2 b) 3 c) 3 d) 3
7.13 Answers
e) 3 f) 3 g) 2 h) invalid!
i) 3 j) 3 k) 2 l) 3
201
7.14 Code challenges
202
7.15 Code solutions
2. On the laptop that I’m using while writing this exercise, I got
rank-5 matrices down to scaling factors of around 10−307 .
Ax = b ? (8.1)
Ay = 0 ? (8.2)
As you work through this chapter, try to think about how each
concept fits into these questions. At the end of the chapter, I will
provide a philosophical discussion of the meaning of these two
questions.
8.1
Column space of a matrix
"Column space" sounds like a fancy and exotic term, but in fact,
you already know what a column space is: The column space of
Column space
a matrix is the subspace spanned by all columns of that matrix.
is also some- In other words, think of a matrix as a set of column vectors, and
times called the the subspace spanned by that set of vectors as the column space
range or the im- of the matrix.
age of a matrix.
The two equations above show two different, and equally accept-
able, ways to express the same concept.
Also relevant here is the distinction between basis and span. The
columns of a matrix span a subspace, but they may or may not
be a basis for that subspace. Remember that a set of vectors is a
basis for a subspace only if that set is linearly independent.
Consider the two matrices below; their column spaces are iden-
tical, but the columns of the left matrix form a basis for that
subspace, whereas the columns of the right matrix do not.
3 9 3 9 9
7 5 7 5 5
1 8 , 1 8 8
2 7 2 7 7
9 6 9 6 6
Let’s try a few examples. For each of the matrices below, deter- 207
mine (1) the dimensionality of the ambient space in which the
column space is embedded, (2) the dimensionality of the column
space, and (3) whether the columns form a basis for the column
space.
1 5 −3 4
1 2 4 2 6 −2 4
1 2 3 5
0 4 4 3 7 −1 4
0 8 13 21
B = 4 1 9 , C= , D=
4 8 0 4
0 0 34 55
6 0 12
5 9
1 4
0 0 0 89
1 1 3 6 10
2 4
7 11 3 4
Now let’s see why the column spaces must be equal. Recall from
section 6.1 that the "column perspective" of matrix multiplica-
Let’s see this in an example. I’m going to write out the multipli-
cation AAT using the column perspective.
0 10 " # 0 10 0 10 0 10
0 3 5
3 7
10 7 3 = 0 3+10 7
33+7 7
53+3 7
5 3 5 3 5 3 5 3
Next we rely on the fact that AAT and A have the same rank (see
rank-nullity theorem, page 191), which means the dimensionalities
of their column spaces are the same. If the column space of AAT
is a subset of the column space of AAT and those two subspaces
have the same dimensionality, then they must be equal. 209
There is another explanation of why AAT and A have the same
subspace, which relies on the singular value decomposition. More
on this in chapter 16!
8.3
Determining whether v ∈ C(A)
of a matrix.
Whenever you see a matrix equation, the first thing you should do
is confirm that the matrix sizes allow for a valid equation. Here
we have matrix sizes (3×2)×(2×1) = (3×1). That works.
Practice problems Determine whether the following vectors are in the column space of
the accompanying matrices, and, if so, the coefficients on the columns to reproduce the vector.
1 0 1 1 1 0 1
1 0 1
a) , b) 2 , 1 0 0 c) 2 , 1 0
2 1 0
0 3 3 0 0 3 3
Answers
a) yes. (2,1) b) yes. (2,-2,3) c) no.
212
8.4
Row space of a matrix
The primary difference is the way that you ask the question whether
a given vector is in the row space of the matrix. In particular, this
changes how you multiply the matrix and the vector: Instead of
matrix-vector multiplication as with the column space (Ax = v),
you have to put the row vector on the left side of the matrix, like
213
Practice problems Determine whether the following vectors are in the row space of the
accompanying matrices, and, if so, the coefficients on the rows to reproduce the vector.
T T
T 1 0 1 1 1 0 1
π 0 1
a) , b) 1 , 2 1 1 c) 2 , 1 0
0 0 1
1 0 0 0 0 3 3
Answers
a) no. b) yes. (.5,.5,0) c) wrong sizes!
8.5
Row spaces of A and AT A
One new concept I will add here is that the fact that R(A) =
R(AT A) can be an example of dimensionality reduction: both of
those matrices have the same row space, but AT A might be a
smaller matrix (that is, it might have fewer numbers), and there-
fore be computationally easier to work with.
8.6
Null space of a matrix
Ay = 0 (8.7)
Let’s see an example to make this more concrete. See if you can
come up with a vector (that is, find x and y) that satisfies the
equation.
" #" # " #
1 2 x 0
= (8.8)
4 8 y 0
Did you come up with the vector [-2 1]T ? That satisfies the
equation and therefore is in the null space of that matrix (thus:
y ∈ N (A)).
But that’s not the only vector that satisfies that equation—perhaps
your solution was [2 -1]T or [-1 .5]T or [-2000 1000]T . I’m sure you
see where this is going: There is an infinite number of vectors in
the null space of this matrix, and all of those vectors are scaled
versions of each other. In other words:
" #! ( " # )
1 2 −2
N = σ , σ∈R (8.9)
4 8 1
Can you find a (non-trivial) vector in the null space? The answer
is No, you cannot. There is no way to combine the columns to
produce the zeros vector. It is colloquially said that this matrix
Matrix spaces
" #!
1 2
N = {} (8.12)
4 7
Did you notice anything about the two matrices in equations 8.8
and 8.11? Perhaps you noticed that the matrix in 8.8 was singular
(rank=1) while the matrix in 8.11 was full rank (rank=2). This
is no coincidence—full-rank square matrices and full-column-rank
matrices necessarily have an empty null space, whereas reduced-
rank and reduced-column-rank matrices necessarily have a non-
empty null space. You’ll learn more about why this is in a few
pages.
The previous examples used square matrices; below are two exam-
ples of rectangular matrices so you can see that null space is not
just for squares. The hard work of finding the vector y is already
done, so pay attention to the sizes: the null space is all about
linear weighted combinations of the N columns, which are in RM ,
216 and the vector that contains the weightings for the columns is
y ∈ RN , corresponding to the dimensionality of the row space.
1 2 0 1
" # " # " #
−2 = 0 , 1 −4
1 2 1 1 1 1 = 0
1
2 1
0
2 2 2 2 −8 1
0
1 2 0 1
There is a deterministic relationship between the rank of a matrix,
its size, and the dimensionality of the four matrix spaces. We will
return to this in section 8.10.
Practice problems Find a vector in the null space of each matrix, if there is one.
1 0
1 2 3 1 2 3
4 0 −5 5 5
a) b) 3 1 4 c) 3 1 5 d)
2 0 −3 7 15
4 4 8 4 4 8
1 0
Practice problems For each combination of matrix and vectors, determine whether neither,
one, or both vectors are in the null space of the matrix.
1 2 3 0 1
1 1 0 0 0 1 0
−1 −2 3 1 5 1 −2
a) 2
1 , , b) 0
0 0 , 1 , 1
c) , ,
1 −1 4 4 8 0 −1
4 0 0 0 1 0 0
0 4 2 2 0
Answers
a) Neither b) Both c) Wrong sizes!
yT A = 0T (8.13)
The "regular" null
The left-null space can be thought of as the "regular" null space of space is formally
the matrix transpose. This becomes apparent when transposing the "right null 217
both sides of Equation 8.13.
AT y = 0 (8.14)
Considering the left-null space as the (right) null space of the ma-
trix transpose is analogous to how R(A) = C(AT ). This also
means that the two null spaces are equal when the matrix is sym-
metric:
This result is not
terribly surprising, N (A) = N (AT ), if A = AT (8.15)
but it is relevant
for finding null
space basis vec-
tors via the SVD. Let’s think about the dimensionality of the left-null space. It
should be intuitive that it will mirror the null space. I’ll start
by showing the left-null spaces of the two rectangular matrices
used above, and then discuss the rule. Notice that the null-space
vector is now on the left of the matrix.
Matrix spaces
1 2
h i
1 2
h i
−1 −1 1 1
1
= 0 0 (8.16)
2
1 2
" #
h i 1 1 1 1 −4 h i
−2 1 = 0 0 0 0 (8.17)
2 2 2 2 −8
Here’s the rule: For an M×N matrix, the row space is in ambient
RN and the left-null space is in ambient RM . Again, this is sensible
because the left-null space provides a weighting of all the rows to
produce the zeros row vector. There are M rows and so a null
space vector must have M elements.
Look closely at Equation 8.16: Is that vector the only vector in the
left-null space? I’m not referring to scaled version of that vector;
I mean that there are more vectors in the left-null space that are
separate from the one I printed. How many more (non-trivial)
218 vectors can you identify?
Practice problems Find a vector in the left-null space of each matrix, if there is one. Notice
that these matrices appeared in the practice problems on page 217; are the answers the same?
1 0
1 2 3 1 2 3
4 0 −5 5 5
a) b) 3 1 4 c) 3 1 5 d)
2 0 −3 7 15
4 4 8 4 4 8
1 0
Answers
a) 1 0 −1 1 b) −8 −4 5 c) Empty left d) Empty left
null space null space
Code Python and MATLAB will return basis vectors for the null
space of a matrix (if it has one). How do they solve this seem-
8.7
Geometric interpretation of the null space
Now that you know the horror-movie analogy (the evil basement
from which no unsuspecting visitors return) and the algebraic
definition (Ay = 0), this is a good time to learn the geometric
perspective of the null space.
Recall that a matrix times a vector produces another vector. We’ll 219
stick to R2 here so that everything can be easily visualized. Con-
sider the following matrix and vectors.
" # " # " #
1 2 −1 −2
M= , u= , y=
2 4 1 1
Figure 8.2 shows that when you multiply a vector in the null
space by that matrix, the resulting "vector" is just a point at the
origin. And that’s basically the end of the line for this matrix-
vector product My—it cannot do anything else but sit in the
origin. Just like the basement of a horror movie: Once you go
in, you never come out. There is no escape. There is no possible
other matrix A such that AMy 6= 0. In other words, the matrix
entered its null space and became a singularity at the origin; no
other matrix can bring it back from the abyss.
Practice problems For the following matrices, find a basis for the column space and a basis
for the left-null space. Then draw those two basis vectors on the same Cartesian plot (one plot
per matrix). Do you notice anything about the plots?
1 2 −3 1
a) b)
2 4 6 −2
Answers In both cases, the column space and the left-null space are orthogonal to each other.
You can easily confirm that in these examples by computing the dot products between the basis
vectors. That’s not just some quirky effect of these specific matrices; that’s a general principle,
and you will soon learn the reason why. (You can repeat this exercise for the row space and the
8.8
Orthogonal subspaces, orthogonal complements
(α1 v1 + α2 v2 ) ⊥ βw (8.19)
Orthogonal subspaces
Don’t confuse
Orthogonal complements This leads us to orthogonal comple- complement with
ments. The idea of orthogonal complements is that any ambient compliment! (Al-
space RN can be decomposed into two subspaces W and V such though both are
that W ∪ V spans all of RN and W ⊥ V . In other words, the things you should
do for your signifi-
entire ambient space is carved into subspaces that are mutually
cant other.)
orthogonal and that meet only at the origin (all subspaces meet
at the origin).
223
8.9
Orthogonalities of the matrix spaces
y ⊥ C(A) (8.21)
AT y = 0 (8.22)
N
Note: 0 ∈ R .
Remarkably, we’ve just re-derived Equation 8.13 (see also Equa-
tion 8.14), which was the definition of the left-null space. That’s
pretty neat, because we started from the question of a vector that
is orthogonal to the column space, and we ended up re-discovering
the left-null space.
Analogously, if someone asks you if you want ice cream with sprin-
OK, perhaps not
" # " #
1 1 1 1
B= , D=
2 2 2 3
225
I chose the following as bases:
(" #) (" #)
1 T −2
C(B) = , N (B ) =
2 1
(" #)
1 1 nhio
C(D) = , N (DT ) =
2 3
The easiest way to choose a basis for the column space is to take
the first r columns, where r is the rank of the matrix (make sure
that those columns form a linearly independent set). Of course,
this isn’t necessarily the best basis set; it’s just the easiest for
small matrices.
Figure 8.6 shows these vectors, with black lines for the column
Matrix spaces
space basis vectors and gray lines for the left-null space basis
vectors.
Figure 8.6: Basis vectors for the column and null spaces of
matrices B and D.
We can express this by writing that the dot product between each
row of the matrix (indicated as am below) and the vector y is 0.
a2 y = 0
..
.
am y = 0
As with the column and left-null spaces, the row space and null
space are orthogonal complements that together span all of RN :
The dimensionalities of the column space, the row space, and the
two null spaces are all interconnected. 227
First, I want to reiterate that "dimension" is not the same thing
as "rank." The rank is a property of a matrix, and it’s the same
regardless of whether you are thinking about rows, columns, or
null spaces. The ambient dimensionality differs between rows and
columns for non-square matrices.
The null space contains one basis vector, which means it has di-
mensionality of one, while the column and row spaces each has
dimensionalities of 2 (notice that row 3 is a multiple of row 1).
The rank of the matrix is also 2.
You can see in this example that the dimensionality of the column
space plus the dimensionality of the left-null space adds up to the
ambient dimensionality R3 .
Two more examples and then I’ll present the rules: If the column
space is 2D and embedded in R2 , then the column space already
covers the entire ambient space, which means there’s nothing for
the left-null space to capture; the left-null space must therefore
be the empty set. Here’s a matrix to illustrate this point; note
that the left-null is empty, because there is no way to combine
the rows of the matrix (or, the columns of the matrix transpose)
to get a vector of zeros. " #
1 1
2 3
Final example: The 2 × 2 zeros matrix has columns in ambient
R2 , but the column space is empty; it contains nothing but a
228 point at the origin. It is 0-dimensional. Therefore, its orthogonal
complement must fill up the entirety of R2 . This tells us that
the left-null space must be 2-dimensional. What is a basis set for
that left-null space? Literally any independent set of vectors can
be a basis set. A common choice in this situation is the identity
matrix, but that’s because it’s convenient, not because it’s the
only basis set.
The story is the same for the row space and the null space, so I
will just re-state it briefly: The row space lives in ambient RN but
can span a lower-dimensional subspace depending on the elements
in the matrix. The orthogonal complement—the null space—fills
Subspace dimensionalities
One more relevant formula: The rank of the matrix is the dimen-
sionality of the column space, which is the same as the dimen-
sionality of the row space:
Answers The four numbers below are the dimensionalities of the column space, left-null
space, row space, and null space. You can confirm that the sum of the first two numbers
corresponds to the number of rows, and the sum of the second two numbers corresponds to the
number of columns.
a) 2,1,2,1 b) 2,0,2,4 c) 1,2,1,1
ment to write that most people learn linear algebra because they
want to know how to solve these equations. You might not realize
that this is what you want to solve, but most of applied linear
algebra boils down to solving one of these two equations.
I hope this helps put things in perspective. It’s not the case that
every problem in linear algebra boils down to one of these two
equations. But as you proceed in your adventures through the
jungle of linear algebra, please keep these two equations in mind;
the terminology may differ across fields, but the core concepts are
the same.
231
8.12Exercises
" # " # 1 1 2
1 0 3
c) , d) 3 1 , 2
2 0 0
0 1 0
1 1 " # " # " #
2
, 1 2 −3
e) 3 1 f) ,
Matrix spaces
2 2 1 3
0 1
0 0 1 3 −1 5 2 0
g) ,
0 1 0 6 −7 9 8 0
h) ,
1 0 0 4 −1 4 π 0
T
" # " #T " #1
1 6 2 1 0 1
c) , d) , −2
2 12 9 2 2 0
3
3. For each matrix-set pair, determine whether the vector set can
form a basis for the column space of the matrix.
8.12 Exercises
b) dim(C(A)) = 1, dim(N (AT )) = ___
233
8.13 Answers
" #
1. 2
a) b) not in column space
3
" #
3
e) sizes don’t match f)
−3
4 0
g) 6
h)
0
3 0
" #T " #T
Matrix spaces
2. 2 1
a) b)
3 1
" #T
3
c) Not in the row space d)
−1
c) Yes
d) Yes
" #
4. 0
a) b) No null space
234 1
1
c) No null space d) −4
1/5
5. a) 2
b) 1
c) 0
f) 2
g) 1
h) 0
8.13 Answers
235
8.14 Code challenges
236
8.15 Code solutions
1. The goal here is to see that once you’ve entered the null space
of a matrix (An), you can never come back; and that matrix
b rotates n such that Bn is no longer in the null space of A.
237
Code block 8.6: MATLAB
1 A = randn ( 1 6 , 9 ) ∗ randn ( 9 , 1 1 ) ;
2 rn = n u l l (A ) ;
3 l n = n u l l (A ’ ) ;
4 r = rank (A ) ;
5 s i z e ( rn ,2)+ r
6 s i z e ( ln ,2)+ r
Matrix spaces
238
CHAPTER 9
Complex numbers in
9.1
Complex numbers and C
Complex numbers
You’ve probably heard this story before: A long, long time ago,
mathematicians scratched their heads about how to solve equa-
√
tions like x2 + 1 = 0. The answer, of course, is x = ± −1, but
this is not a sensible answer (or so the elders thought) because
no number times itself can give a negative number. Confused and
√
concerned, they adopted the symbol i = −1 and called it "imag-
Fun fact: Gauss inary" because it was the best term their imaginations could come
and many other up with.
people despised
the term "imagi-
nary," instead ar-
The imaginary operator was, for a long time, just a quirky excep-
guing that "lateral" tion. It was Karl Friederich Gauss (yes, that Gauss) who had the
would be better. I brilliant insight that the imaginary operator was not merely an ex-
(and many others) ceptional case study in solving one kind of equation, but instead,
whole-heartedly
that the imaginary unit was the basis of an entirely different di-
agree, but unfor-
tunately "imagi- mension of numbers. These numbers were termed "complex" and
nary" remains stan- had both a "real" part and an "imaginary" part. Thus was born
dard terminology. the complex plane as well as the field of complex numbers, C.
a0 x0 + a1 x1 + ... + an xn = 0 (9.1)
9.2
What are complex numbers?
Don’t lose sleep over what complex numbers really mean, whether
imaginary numbers have any physical interpretation, or whether Figure 9.2: The
complex numbers,
intelligent life elsewhere in the universe would also come up with all planed up.
imaginary numbers (I think the answer is Yes, and I hope they call
them "lateral numbers"). It doesn’t matter. What does matter
is that complex numbers are useful from a practical perspective,
and they simplify many operations in both theoretical and applied
mathematics. 241
Complex numbers are referred to using the real and imaginary
axis coordinates, just like how you would refer to XY coordinates
on a Cartesian axis. Figure 9.3 shows a few examples of com-
plex numbers as geometric coordinates and their corresponding
labels.
The reason why complex numbers are so useful is that they pack
a lot of information into a compact representation. For the real
number line, a number has only two pieces of information: its
distance away from the zero and its sign (left or right of zero).
1 z = complex ( 3 , 4 ) ;
2 Z = zeros (2 ,1);
3 Z ( 1 ) = 3+4 i ;
9.3
The complex conjugate
Im
of the imaginary component, not set it to be negative. Thus, if
the imaginary component is already negative, then its conjugate
would have a positive imaginary component: a − bi = a + bi.
Figure 9.4 depicts the geometric interpretation of the complex Re
conjugate, which is to reflect the complex number across the real
axis.
[1 -2i]
Complex conjugate pairs A complex conjugate pair is a complex Figure 9.4: A com-
plex number ([1
number and its conjugate, together forever, just like penguins. 2i] and its complex
Here are a few examples; note that the magnitudes of the real conjugate ([1 -2i]).
and imaginary parts are the same; the only difference is the sign
of the imaginary part.
z, z = 4 + 2i, 4 − 2i
u, u = a + bi, a − bi
x2 + 9 = 0 x = +3i, −3i
This should also be familiar from the quadratic equation: When The quadratic
b2 < 4ac then the roots come in conjugate pairs. equation:
√ x =
−b± b2 −4ac
2a
245
Code block 9.3: Python
1 r = np . random . r a n d i n t ( −3 ,4 , s i z e =3)
2 i = np . random . r a n d i n t ( −3 ,4 , s i z e =3)
3 Z = r+i ∗1 j
4 p r i n t (Z)
5 p r i n t (Z . conj ( ) )
a) 3 + 4i b) −6(−5 − i) c) j d) 17
Answers
a) 3 − 4i b) −6(−5 + i) c) −j d) 17
Complex conjugate pairs are used in many areas of ap-
plied mathematics. For example, the most efficient way
to compute a power spectrum from the Fourier trans-
Reflection
form is to multiply the complex-valued spectrum by its
complex conjugate. More germane to this book: A ma-
trix with entirely real-valued entries can have complex-
valued eigenvalues; and when there are complex-valued
eigenvalues, they always come in conjugate pairs.
246
9.4
Arithmetic with complex numbers
z = a + ib
w = c + id
z + w = a + ib + c + id
= (a+c) + i(b+d)
In other words, sum the real parts, then sum the imaginary parts
and attach the i. The only potential source of error here is mis-
Subtraction works exactly the same way; just be careful that you
are replacing the correct plus signs with a minus signs. Pay at-
tention to the pluses and minuses here:
z − w = a + ib − (c + id)
= (a-c) + i(b-d)
z ∗ z = (a+bi)(a-bi)
= a2 + b 2 (9.7)
z a + ib
=
w c + id
(c − id)(a + ib)
=
(c − id)(c + id)
(c − id)(a + ib)
=
c2 + d2
(ca+db) + i(cb-da)
=
c2 + d2
z wz
= (9.8)
248 w ww
Why does the denominator have to be real-valued? I
honestly have no idea. But all my math teachers and
everyone who writes about math in books and on the
Reflection
Internet says that we should avoid having complex num-
bers in the denominator of a fraction. I don’t condone
conformity and I think it’s important to question the
status-quo, but on the other hand, you gotta pick your
battles. This one isn’t worth fighting.
Practice problems For the following two complex numbers, implement the indicated arith-
metic operations.
z = 3 + 6i, w = −2 − 5i
(w+2)z
a) 2z + wz b) w(z + z) c) 5z + 6w d) 5z + 6w e) w
9.5
The Hermitian and complex dot products
2 − i9 249
Notice that nothing happened to the real-valued elements (first
and third entries). For this reason, the Hermitian and "regular"
transpose are identical operations for real-valued matrices.
Dot product with complex vectors The dot product with com-
plex vectors is exactly the same as as the dot product with real-
valued vectors: element-wise multiply and sum.
Im
[0 i]
vT v = 02 + i2 = −1
vH v = 02 + (−i)(−i)
" #
= −1
0
v= .
i vH v = 02 + (−i)(i)
=1
vT v = 02 + (i)(−i)
=1
Practice problems For the following vectors, implement the specified operations.
1 i
z = 3i , w = 1 − i ,
4 − 2i 0
zH z
a) zH (z + w) b) c) 2z + w z
Answers
2+i
a) 27 − 2i b) −10 c) 3 + 9i
8 − 4i
9.6
Special complex matrices
252
9.7
Exercises
a) wd b) dH wd c) Rd d) RH Rd
e) wz f) wz ∗ g) wdR h) wdH R
9.7 Exercises
253
9.8
Answers
1. 8 + 20i 24 + 8i
−6 + 14i
a) b) 80 + 200i −8i
c)
20 − 8i 18 − 2i
160 + 16i
d) 114 + 30i
e) 5 − 2i f) −5 + 2i
26 − 26i
h i
g) Wrong sizes! h) −44 + 64i 12 + 88i −40 + 16i
Complex numbers
254
9.9
Code challenges
255
9.10Code solutions
256
Code block 9.10: MATLAB
1 A = complex ( randn ( 3 , 3 ) , randn ( 3 , 3 ) ) ;
2 A1 = A+A ’ ;
3 A2 = A∗A ’ ;
4 i s h e r m i t i a n (A1) % i s s y m m e t r i c (A1) i s f a l s e !
5 i s h e r m i t i a n (A2)
257
Complex numbers
CHAPTER 10
Systems of equations
2x = 6
To be sure, this single equation does not need matrices and vec-
tors. In fact, using matrices to solve one equation simply creates
gratuitous confusion. On the other hand, consider the following
Systems of equations
system of equations.
2x + 3y − 5z = 8
−2y + 2z = −3
5x − 4z = 3
You can imagine an even larger system, with more variables and
more equations. It turns out that this system can be represented
compactly using the form Ax = b. And this isn’t just about
saving space and ink—converting a system of many equations
into one matrix equation leads to new and efficient ways of solving
those equations. Are you excited to learn? Let’s begin.
2x + 3 = 11 (10.1)
And I’m sure you know that the geometric translation of this
equation is a line in a 2D space. You probably also know that any
point on this line is a valid solution to the equation.
Now things are starting to get interesting. The point on the graph
where the two lines intersect is the solution to both equations. In
this case, that point is (x, y) = (4/3, 8/3). Try this yourself by
plugging those values into both equations in system 10.3. You can
Systems of equations
also try points that are on one line but not the other; you will find
that those pairs of numbers will solve only one of the equations.
Try, for example, (0,4) and (-4,0).
With a system of equations, you still have the same rule, although
you don’t have to apply the same operation to all equations in
the system. But having a system of equations allows you to do
something more: You may add and subtract entire equations from
each other (this is analogous to multiplying the left-hand side by
"8" and the right-hand side by "4×2"). Let’s try this with Equation
10.3. I will transform the first equation to be itself minus the
second equation:
( )
0y = 3x/2 − 2
(10.4)
y = −x + 4
262
Next, I will replace the second equation by itself plus two times
the original first equation:
( )
0y = 3x/2 − 2
(10.5)
3y = 0x + 8
At a superficial glance, system 10.5 looks really different from
system 10.3. What does the graph of this "new" system look like?
Let’s see:
The amazing thing here is that the solution stays the same, even
though the lines are different. The same can be said of the al-
gebraic equations (10.3 and 10.5): They look different, but the
solution to the system remains the same before and after we sub-
tracted equations from each other.
Did you get it? It’s actually pretty difficult to solve in your head.
Now I will add the first equation to the second; try again to solve
the system without writing anything down.
( )
2x + 3y = 8
(10.7)
5y = 5
have (1) one intersecting point like in the examples above, (2) no
points in common, or (3) an infinite number of points in common.
We’ll get to this discussion later on in this chapter. First you need
to learn how to convert a system of equations into a matrix-vector
equation.
2x + 3y − 4z = 5 (10.8)
The variables are (x, y, z), the corresponding coefficients are (2, 3, −4),
and the constant is 5. In other equations, you might need to do
a bit of arithmetic to separate the components:
The variables are still (x, y, z), the coefficients are (2, 6, −1), and
the constant is 6.
Life pro tip: If you are converting this system into matrices while simulta-
Vietnamese co-
neously Facebooking, savoring Vietnamese coconut coffee, and
conut coffee is
really delicious. watching reruns of Rick and Morty, you might end up with the
following matrix:
" # x " #
2 3 −5
y = 8
0 0 5 2
z
2x + 3y − 5z = 8 2x + 3y − 5z = 8
⇒
266 5x = 2 5x + 0y + 0z = 2
Practice problems (1 of 2) Convert the following systems of equations into their matrix
form.
2x + 3y + 75z = 8 x − z/2 = 1/3
a) b)
− 2y + 2z = −3 3y + 6z = 4/3
s−t = 6
u+v = 1 x+y = 2
c) d)
t+u = 0 x−y = 0
2v + 3t = 10
Answers
x x
2 3 75 8 1 0 −1/2 1 1
a) y = b) y = 3
0 −2 2 −3 0 3 6 4
z z
1 −1 0 0 s 6
0 0 1 1 t 1 1 1 x 2
c) = d) =
0 1 1 0 u 0 1 −1 y 0
0 3 0 2 v 10
Answers
j = 10
a) Not a valid equation! b)
k = 9
s + 3t = 5
7q + 7w + 8e + 8r + 6t + 7y = 9 2s + 4t = 4
c) d)
1q + 9e + r + 2t = 9
3s + 4t = 6
4s + 2t = 2
267
Sometimes, the biggest challenge in data analysis and
modeling is figuring out how to represent a problem us-
ing equations; the rest is usually just a matter of alge-
bra and number-crunching. Indeed, the translation from
Reflection
real-world problem to matrix equation is rarely trivial
and sometimes impossible. In this case, representing a
system of equations as matrix-vector multiplication leads
to the compact and simplified notation: Ax = b. And
this form leads to an equally compact solution via the
least-squares algorithm, which is a major topic of Chap-
ter 14.
So how does row reduction work? It’s exactly the same procedure
268 we applied in Equation 10.6 (page 264): Replace rows in the ma-
trix with linear combinations of other rows in the same matrix.
Let’s start with an example.
Let’s try an example with a 3×3 matrix. The goal is to find the
multiples of some rows to add to other rows in order to obtain the
echelon form of the matrix. It’s often easiest to start by clearing
out the bottom row using multiples of the top row(s).
1 2 2 1 2 2 1 2 2
R1 +R2
−1 3 0 −1 3 0 −−−−→ 0 5 2
−2R1 +R3
2 4 −3 −−−−−−→ 0 0 −7 0 0 −7
As with many operations in linear algebra (and math in general),
the procedure is easy to implement in small examples with care-
fully chosen integer numbers. Computers can take care of the
arithmetic for harder problems, but it’s important that you un-
derstand the procedure.
One of the nice features of the echelon form is that linear de-
Systems of equations
Watch what happens to the echelon form when there are linear
dependencies.
1 2 2 1 2 2 1 2 2
R1 +R2
−1 3 0 −1 3 0 −−−−→ 0 5 2
−2R1 +R3
2 4 4 −−−−−−→ 0 0 0 0 0 0
When the columns (or rows) of a matrix form a linearly dependent
set, the echelon form of the matrix has at least one row of zeros.
Practice problems Convert the following matrices into their echelon form.
1 2 2 2 1
1 0 2
a) b) 0 1 c) 4 0 1
2 1 1
2 1 3 1 1
Answers
1 2 2 2 1
1 0 2
a) b) 0 1 c) 0 −4 −1
0 1 −3
0 0 0 0 0
Each nth step of row reduction has its own Rn matrix, each one to
the left of A and any previously applied R matrices. Depending
on why you are implementing row reduction, you might not need
to keep track of the transformation matrices, but it is important
to understand that every echelon matrix E is related to its original
form A through a series of transformation matrices:
This also shows how row reduction does not actually entail losing
information in the matrix, or even fundamentally changing the
information in the matrix. Instead, we are merely applying a
sequence of linear transformations to reorganize the information
in the matrix. (That said, row reduction without storing the
R matrices does involve information-loss; whether that matters
272 depends on the goal of row-reduction. More on this point later.)
Exchanging rows in a matrix In the examples of row reduction
thus far, the matrices "just so happened" to be constructed such
that each row had nonzero elements to the right of nonzero el-
ements in higher rows (criteria #1 of an echelon form matrix).
This is not guaranteed to happen, and sometimes rows need to be
exchanged. Row exchanges have implications for several matrix
properties.
For example, if you want to exchange the first and second rows of a
3×3 matrix, create matrix P, which is the identity matrix with the
first two rows exchanged, then left-multiply by this permutation
matrix.
Notice that the right-most matrix is not in its echelon form, be-
cause the leading non-zero term in the third row is to the left of
the leading non-zero term in the second row. Swapping the sec-
ond and third rows will set things right, and we’ll have our proper
echelon form.
After putting the matrix into echelon form, the pivots are the left-
most non-zero elements in each row. Not every row has a pivot,
and not every column has a pivot. A zero cannot be a pivot even
if it’s in a position that could be a pivot. That’s because pivots
are used as a denominator in row-reduction, and terrible things
happen when you divide by zero. In the following matrix, the
gray boxes highlight the pivots.
a b c d e
LU = PA (10.18)
E=U (10.19)
Practice problems Use row reduction to transform the matrix into its echelon form, and
then identify the pivots.
3 2 1
1 2 2 4 2 0 1 3 2 7
a) 2 0 3 2 b) 1 1 2 c)
9
6 3
8 0 4 0 3 1 3
1 1 1
During row reduction, any row that can be created using a linear
276 weighted combination of other rows in the matrix will turn into a
row of zeros. We could phrase this another way: If row reduction
produces a zeros row, then it means that some linear combina-
tion of row vectors equals the zeros vector. This is literally the
definition of linear dependence (Equation 4.6, page 94).
But this only proves that row reduction can distinguish a reduced-
rank matrix from a full-rank matrix. How do we know that the
number of pivots equals the rank of the matrix?
For this same reason, row reduction reveals a basis set for the row
space: You simply take all the non-zeros rows.
So the echelon form "cleans up" the matrix to reveal the dimen-
sionality of the row space and therefore the rank of the ma-
trix. And the "cleaning" process happens by moving information
around in the matrix to increase the number of zero-valued ele-
ments. 277
Practice problems Use row reduction to compute the rank of the following matrices by
counting pivots.
3 3 5
0 0 0 0 −4
4 0 2 4 6
a) 0 0 0 0 b)
5
c)
1 5 5 3 2
0 0 0 0
6 2 1
different results.
278
10.4 Gaussian elimination
280
Practice problems Perform Gaussian elimination to solve the following systems of equations.
3y + x + 2z = 36.5
2x − 3z = −8
2x + 6y − 2z = −29
a) y + 4z + 3x = 25 b)
y + z = 18
−z + x = −2
14x − z = −24
Answers
x = 2 x = −.5
a) y=3 b) y=1
z=4 z = 17
Linear dependen-
1 2 3 1 0 −1
4 5 6 ⇒ 0 1 2
cies in the columns
produce zeros 7 8 9 0 0 0
row(s) and piv-
otless column(s).
5 2 1 0 4 1 0 0 3.4 −.03
A wide matrix
4 1 5 6 3 ⇒ 0 1 0 −8.6 1.967
with independent
rows becomes I 2 1 9 0 4 0 0 1 .2 .233
augmented by
other numbers. 1 2 1 0
A tall matrix with
linearly indepen- 6 2 0 1
4 ⇒
dent columns has 7
0
0
I on top and ze- 1 7 0 0
ros underneath.
1 2 3 1 0 0
4 1 2 ⇒ 0 1 0
A full-rank square
matrix trans- 6 4 2 0 0 1
forms into I.
Answers
1 0 0
1 0 0 −3 0 1 0 0 0
1 0
a) 0 1 0 2 b)
0
c) 0 1 0 1
0 1
0 0 1 −1/4 0 0 1 0
0 0 0
283
10.6 Gauss-Jordan elimination
" #
2 3 8
(10.25)
0 1 1
Now let’s continue with upwards row reduction to get all pivots
equal to 1 and the only non-zero elements in their columns:
Now we map the right-most matrix above back onto their equa-
tions:
( )
x = 5/2
y=1
Answers
x=4 x = 5/2
a) b) c) x = y = z = 1
y=5 y = 2/5
284
Math has many life-lessons if you look at it in the right
way. The wisdom in this chapter is the following: Try
to solve a problem (system of equations) with no prepa-
ration or organization, and you will spend a lot of time
with little to show except stress and frustration. Do
Reflection
some preparation and organization (Gaussian elimina-
tion) and you can solve the problem with a bit of dedica-
tion and patience. And if you spend a lot of time prepar-
ing and strategizing before even starting to solve the
problem (Gauss-Jordan elimination), the solution may
simply reveal itself with little additional work. You’ll be
left with a deep sense of satisfaction and you will have
earned your Mojito on the beach.
All of the example systems shown so far in this chapter have had
a unique solution. But this is not always the case. In fact, there
are three possibilities:
" # " #
−.5 1 2 1 −2 0
⇒
−.5 1 4 0 0 1
Do you spot the problem? Let’s map the RREF back onto a
system of equations:
( )
y − 2x = 0
(10.27)
0=1
Unique solution (Figure 10.5B) This is the case for the exam-
ples we’ve worked through earlier in this chapter. The system has
exactly one solution and you can use Gaussian elimination to find
it.
1
For example: https://www.pleacher.com/mp/mhumor/onezero2.html
286
Infinite solutions (Figure 10.5C) Geometrically, this means that
the two equations are collinear; they are in the same subspace.
Here’s a system that would produce Figure 10.5C:
( )
y = x/2 + 2
(10.28)
2y = x + 4
n o
y − 2x = 4 (10.29)
Yeah, it’s not really a "system" anymore, unless you want to in-
(3) an infinite number of equally good solutions. So if
life is all about balancing equations, then it seems that
one of these three possibilities is correct:
1. Life has no purpose.
2. Life has exactly one purpose.
3. Life has an infinite number of equally good purposes.
287
10.8 Matrix spaces after row reduction
Let’s talk about rank first. You already know that rank is a
property of the matrix, not of the rows or the columns. The
rank doesn’t change before vs. after row-reduction. In fact, row-
Systems of equations
Next let’s think about the row space. That doesn’t change after
row reduction. To understand why, think about the definition of a
subspace and the process of row reduction: A subspace is defined
as all possible linear combinations of a set of vectors. That means
that you can take any constant times any vector in the subspace,
add that to any other constant times any other vector in the
subspace, and the resulting vector will still be in the subspace.
Now think about the process of row reduction: you take some
constant times one row and add it to another row. That is entirely
consistent with the algebraic definition of a subspace. So you can
do row-reduction to your heart’s delight and you will never leave
the initial subspace.
The only characteristic of the row space that could change is the
basis vectors: If you take the rows of the matrix (possibly sub-
selecting to get a linearly independent set) as a basis for the row
space, then row reduction will change the basis vectors. But those
288 are just different vectors that span the same space; the subspace
Anyway, the rows
spanned by the rows is unchanged before, during, and after row
tend to be a poor
reduction. choice for basis
vectors. You will
Now let’s talk about the column space. The column space actually understand why
when you learn
can change during row reduction. Let me first clarify that the
about the SVD.
dimensionality of the column space does not change with row
reduction. The dimensionality will stay the same because the
dimensionality of the row space, and the rank of the matrix, are
unaffected by row reduction.
But are they the same plane, i.e., the same subspace? Not at all!
They intersect at the origin, and they have a line in common (that
is, there is a 1D subspace in both 2D subspaces), but otherwise
they are very different from each other—before RREF it was a
tilted plane and after RREF it’s the XY plane at Z=0 (Figure
10.6).
On the other hand, keep in mind that row reduction is not nec-
essarily guaranteed to change the column space. Consider the
following matrix and its RREF:
" # " #
1 2 1 0
⇒
3 7 0 1 289
Figure 10.6: A graph of the column spaces before and after
RREF for matrix M (Equation 10.30). Column vectors are
drawn on top of the planes (unit-normalized for visualization).
It’s a full-rank matrix and its RREF is the identity matrix. The
column space spans all of R2 both before and after row reduction.
Systems of equations
Again, the take-home messages in this section are that row re-
duction (1) does not affect the rank or the dimensionalities of the
matrix subspaces, (2) does not change the row space, and (3) can
(but does not necessarily) change the column space.
290
10.9Exercises
g) R6×7 , r = 0 h) R4×4 , r = 4
10.9 Exercises
need to solve the system.
4x − 3y + 3z = 4 4x − 3y + 3z = 4 4x − 3y + 3z = 4
a) 5x + y + z = 7 b) 5x + y + z = 7 c) 5x + y + z = 7
x + 4y − 2z = 1
x + 4y − 2z = 3
2x + 4y − 2z = 1
291
10.10
Answers
1. 1 −2 −2 5 1 2 1 −1
0 4 3 −11 0 −3 1
1
a)
b)
0 0 −37 45
0 0 2 −1
1554 0 0 0 0
0 0 0 29
2. a) 0 b) 1 c) 0 d) 4
e) 5 f) 0 g) 6 h) 0
4. a) x = 1, y = −2, z = 3/2 b) x = 2, y = 2, z = 4
Systems of equations
292
10.11 Coding challenges
And voilà! You now have all the pieces of a system of equa-
tions. If you want to create systems that have zero, one, or
infinite solutions, you will need to make some minor adjust-
ments to the above procedure (e.g., by changing a coefficient
after already solving for the constants).
293
10.12 Code solutions
1 A = [ 2 0 −3; 3 1 4 ; 1 0 −1];
2 x = [ 2 3 4 ] ’ ; % n o t e : column v e c t o r !
3 b = A∗x ;
294
CHAPTER 11
Matrix determinant
297
Here are a few examples with real numbers.
1 2
= 1×1 − 2×0 = 1 (11.2)
0 1
5 −3
= 5×2 − (−3×−4) = −2 (11.3)
−4 2
1 2
= 1×4 − 2×2 = 0 (11.4)
2 4
Notice that the first two matrices have a rank of 2 (and, thus, the
columns form a linearly independent set), while the third matrix
has a rank of 1. And that the determinant was non-zero for the
first two matrices and zero for the third matrix.
" #
a λa
c λc
=0
Answers
a) Correct b) Correct c) No, ∆ = −53 d) No, ∆ = 3
e) Correct f) No, ∆ = 0 g) No, ∆ = −15 h) Correct
Notice that the difference between the matrix and its transpose is
simply the swapping of c and d. Scalar multiplication is commu-
tative, and thus the determinant is unaffected by the transpose.
This proof is valid only for 2×2 matrices because Equation 11.1
is a short-cut for 2 × 2 matrices. However, this property does
generalize to any sized square matrix. We will revisit this concept
when learning about 3×3 matrices.
ab − cd = ∆ (11.8)
Notice how the equation worked out: There were two λs on the di-
agonal, which produced a second-order polynomial equation, and
thus there were two solutions. So in this case, there is no single
λ that satisfies our determinant equation; instead, there are two
Determinant
Now I want to push this idea one step further. Instead of having
only λ on the diagonal of the matrix, let’s have both a real number
and an unknown. I’m also assuming that the matrix below is
singular, which means I know that its determinant is zero.
1 − λ 3
=0 ⇒ (1 − λ)2 − 9 = 0 (11.9)
3 1 − λ
⇒ λ2 − 2λ − 8 = 0
⇒ (λ + 2)(λ − 4) = 0
⇒ λ = −2, 4
There are two possible values of λ, which means there are two
300 distinct matrices that solve this equation. Those two matrices
are shown below (obtained by plugging in λ). It should take only
a quick glance to confirm that both matrices have ∆ = 0 (which is
the same thing as saying that both matrices are reduced-rank).
" #
3 3
λ = −2 ⇒
3 3
" #
−3 3
λ=4 ⇒
3 −3
But more importantly for our purposes in this book, when the
characteristic polynomial is set to zero (that is, when we assume
the determinant of the shifted matrix is 0), the λ’s—the roots of
the polynomial—are the eigenvalues of the matrix. Pretty neat,
eh? More on this in Chapter 15. For now, we’re going to continue
our explorations of determinants of larger matrices. 301
Practice problems Which value(s) of λ would make the following matrices singular (that
is, have a determinant of zero)?
1 2 0 14 4 1 10 1
a) b) c) d)
4 λ λ 4 1 λ 3−λ 1
6 2 λ 4 λ 18 5−λ −1/3
e) f) g) h)
5 3λ 1 λ 1/2 λ −3 5−λ
Answers
a) λ = 8 b) λ = 0 c) λ = 1/4 d) λ = −7
e) λ = 5/9 f) λ = ±2 g) λ = ±3 h) λ = 6, 4
302
Figure 11.1: Two visualizations of the short-cut for computing
the determinant of a 3×3 matrix.
0 3 −9
−6 0
1 = 0·0·1 + 3·1·8 + 9·6·0 − 9·0·8 + 3·6·1 − 0·1·0 = 24 + 18 = 42
8 0 1
1 3 5
2 4 2 = 1·4·7 + 3·2·3 + 5·2·7 − 5·4·3 − 3·2·7 − 1·2·7 = 116 − 116 = 0
3 7 7
The third example has a determinant of zero, which means the ma-
trix is singular (reduced-rank). It’s singular because the columns
(or rows) form a dependent set. Can you guess by visual inspec-
tion how the columns are linearly related to each other? It might
be easier to see from looking at the rows.
=0
Answers
a) No, ∆ = −3 b) No, ∆ = +3 c) Correct d) No, ∆ = 0
Practice problems Which value(s) of λ would make the following matrices singular?
0 1 1 1 0 −3
a) 0 4 7 b) −4 −11 1
0 3 λ 3 λ 0
1 1 1 1−λ 0 4
c) 2 2 3 d) −4 3−λ 4
3 λ 4 0 0 5−λ
Answers
a) any λ b) λ = 9
c) λ = 3 d) λ = 1, 3, 5
a d g
b e
h = aei + dhc + gbf − gec − dbi − ahf (11.15)
c f i
11.5
nant
Full procedure to compute the determi-
It turns out that the aforediscussed tricks for computing the de-
Now let’s see what happens when we apply this same procedure
to a 3×3 matrix (Figure 11.3).
Now it’s your turn: apply the full procedure to a 2×2 matrix. (The
determinant of a scalar is simply that scalar, thus, det(−4) = −4.)
You will arrive at the familiar ad − bc.
The full procedure scales up to any sized matrix. But that means
that computing the determinant of a 5 × 5 matrix requires com-
puting 5 determinants of 4×4 submatrices, and each of those sub-
matrices must be broken down into 3×3 submatrices. Honestly, I
could live 1000 years and die happy without ever computing the
306 determinant of a 5×5 matrix by hand.
Practice problems Compute the determinant of the following matrices (lots of zeros in these
matrices... you’re welcome).
0 −1 2 0 0 0 −3 3 0 −1 0 0
1 2 0 5 6 0 0 −1 1 2 0 5
a) b) c)
0 3 −4 0 0 0 −4 0 0 3 0 0
1 0 2 0 −1 −1 −3 2 1 0 0 0
Answers
a) -10 b) 72 c) 0
11.6 ∆ of triangles
In fact, for some matrices, it’s easier to apply row reduction to get
the matrix into its echelon form and then compute the determi-
Useful tip for ex-
nant as the product of the diagonal. HOWEVER, be aware that
ams!
row exchanges and row-scalar-multiplication affect the determi-
nant, as you will learn in subsequent sections.
You can now prove this for lower-triangular and diagonal matri-
ces.
0 1
= 0×0 − 1×1 = −1
1 0
1 0 0
0 1
0 = 1·1·1 + 0·0·0 + 0·0·0 − 0·1·0 − 0·0·1 − 1·0·0 = 1
0 0 1
0 1 0
1 0
0 = 0·0·1 + 1·0·0 + 0·1·0 − 0·0·0 − 1·1·1 − 0·0·0 = −1
0 0 1
0 1 0
0 0
1 = 0·0·0 + 1·1·1 + 0·0·0 − 0·0·1 − 1·0·0 − 0·1·0 = 1
1 0 0
Thus, each row swap reverses the sign of the determinant. Two
row swaps therefore "double-reverses" the sign. More generally,
an odd number of row swaps effectively multiplies the sign of
Determinant
d e f
a
b c = bdi + ceg + af h − bf g − aei − cdh (11.19)
g h i
= ad + cd − (bc + dc)
= ad − bc (11.21)
Let’s try this again with a 3×3 matrix to make sure our conclusion
isn’t based on a quirk of the 2×2 case:
a + g b + h c + i
d e f
g h i
It’s not immediately visually obvious that the "extra" terms will
cancel, but notice that there are identical terms with opposite
signs and with letters in a different order (for example: +gei and
−ieg). Canceling those terms will bring us back to the familiar
determinant formula. It’s as we never added any rows in the first
place.
The reason why this happens is that each term in the determi-
nant formula contains exactly one element from each row in the
matrix. Therefore, multiplying an entire row by a scalar means
that every term in the formula contains that scalar, and it can
thus be factored out.
Row swap → ∆ = −∆
Add rows → ∆=∆
Scale row by β → ∆ = ∆β
11.8
tion
Determinant & matrix-scalar multiplica-
The previous section might have led you to wonder what happens
to the determinant of a scalar-multiplication over the entire ma-
trix. That is, what is the relationship between |A| and |βA|?
You can see that each term in the determinant formula brings β 2 ,
which can be factored out. Let’s see how this looks for a 3 × 3
matrix.
a b c βa βb βc
β d e f = βd βe βf
g h i βg βh βi
You can consider how this works for the 4×4 case (refer to Figure
11.2): The determinant of each submatrix has a scalar of β 3 ,
which then multiplies β times the element in the first row, leading
to β 4 .
315
11.10 Exercises
6 0 −2 −2
1. a) a(c − b) b) 0 c) 0
d) 1 e) 16 f) 2
g) 1 h) −1 i) −1
j) 107 k) −1 l) +1
m) 1 n) 0 o) −24
11.11 Answers
, ∆ = −4
c) 0 −2 −6 d) 0 −4 8,∆ = 0
0 0 1 0 0 0
317
11.12 Code challenges
318
11.13 Code solutions
319
2. One easy way to create a singular matrix is to set column 1
equal to column 2.
1 ns = 3 : 3 0 ;
2 i t e r s = 100;
3 d e t s = z e r o s ( l e n g t h ( ns ) , i t e r s ) ;
4
5 f o r n i =1: l e n g t h ( ns )
6 f o r i =1: i t e r s
7 A = randn ( ns ( n i ) ) ; % s t e p 1
8 A( : , 1 ) = A ( : , 2 ) ; % s t e p 2
9 d e t s ( ni , i ) = abs ( d e t (A ) ) ; % s t e p 3
10 end
11 end
12
13 p l o t ( ns , l o g ( mean ( d e t s , 2 ) ) , ’ s− ’ )
14 x l a b e l ( ’ Matrix s i z e ’ )
15 y l a b e l ( ’ Log d e t e r m i n a n t ’ )
320
CHAPTER 12
Matrix inverse
3x = 1 (12.1)
Obviously, x = 1/3. How did you solve this equation? I guess you
divided both sides of the equation by 3. But let me write this in
a slightly different way:
3x = 1
3−1 3x = 3−1 1
Matrix inverse
1x = 3−1
1
x=
3
I realize that this is a gratuitously excessive number of steps to
write out explicitly, but it does illustrate a point: To separate
the 3 from the x, we multiplied by 3−1 , which we might call the
"inverse of 3." And of course, we had to do this to both sides of
the equation. Thus, 3×3−1 = 1. A number times its inverse is 1.
The number 1 is special because it is the multiplicative identity.
What about the following equation; can you solve for x here?
0x = 1
Nope, you cannot solve for x here, because you cannot compute
0/0. There is no number that can multiply 0 to give 1. The lesson
322 here is that not all numbers have inverses.
OK, now let’s talk about the matrix inverse. The matrix inverse is
a matrix that multiplies another matrix such that the product is
the identity matrix. The identity matrix is important because it
is the matrix analog of the number 1—the multiplicative identity
(AI = A). Here it is in math notation:
A−1 A = I (12.2)
Ax = b (12.3)
Ix = A−1 b (12.5)
x = A−1 b (12.6)
This was already
IMPORTANT: Because matrix multiplication is non-commutative mentioned in sec-
Conditions for invertibility You saw above that not all num-
bers have an inverse. Not all matrices have inverses either. In
fact, many (or perhaps most) matrices that you will work with in
practical applications are not invertible. Remember that square
matrices without an inverse are variously called singular, reduced-
rank, or rank-deficient.
met:
1. It is square
2. It is full-rank
So the full matrix inverse exists only for square matrices. What
does a "full" matrix inverse mean? It means you can put the
inverse on either side of the matrix and still get the identity ma-
trix:
Thus, the full matrix inverse is one of the few exceptions to matrix
multiplication commutivity.
AB = I (12.13)
AC = I (12.14)
C = CI = CAB = IB = B (12.17)
A−1 A = I (12.18)
AT A−T = I (12.20)
Matrix inverse
AA−T = I (12.21)
Avoid the inverse when possible! The last thing I want to dis-
cuss before teaching you how to compute the matrix inverse is that
the matrix inverse is great in theory. When doing abstract paper-
and-pencil work, you can invert matrices as much as you want,
regardless of their size and content (assuming they are square and
full-rank). But in practice, computing the inverse of a matrix on
a computer is difficult and can be wrought with numerical inac-
326 curacies and rounding errors.
Thus, in practical computer applications of linear algebra, you
should avoid using the explicit inverse unless it is absolutely nec-
essary.
Computer scientists have worked hard over the past several decades
to develop algorithms to solve problems that—on paper—require
the inverse, without actually computing the inverse. The details of
those algorithms are beyond the scope of this book. Fortunately,
they are implemented in low-level libraries called by MATLAB,
Python, C, and other numerical processing languages. This is
good news, because it allows you to focus on understanding the
conceptual aspects of the inverse, while letting your computer deal
with the number-crunching.
This example is for a 3×3 matrix for visualization, but the prin-
ciple holds for any number of dimensions.
This inverse procedure also shows one reason why singular ma-
trices are not invertible: A singular diagonal matrix has at least
one diagonal element equal to zero. If you try to apply the above
short-cut, you’ll end up with an element of 0/0.
1 0 2
AÃ = 0 1 0
(12.26)
0 0 1
1 0 8/3
ÃA = 0 1 0
(12.27)
0 0 1
328
You can see that à is definitely not the inverse of A, because
their product is not the identity matrix.
Before learning the full formula for computing the matrix inverse,
let’s spend some time learning another short-cut for the inverse
that works on 2×2 matrices.
You can stop the computation as soon as you see that the de-
terminant, which goes into the denominator, is zero. This is one
explanation for why a singular matrix (with ∆ = 0) has no in-
verse.
Below are two more examples of matrices and their inverses, and
a check that the multiplication of the two yields I.
" #−1 " # " #" # " #
3 1 1 1 −1 3 1 1 −1 1 0
= ⇒ =
2 1 1 −2 3 2 1 −2 3 0 1
Hint: Sometimes
it’s easier to leave
#−1
the determinant as
" " # " #" # " #
2 7 1 −8 −7 1 −8 −7 2 7 1 0
a scalar, rather = −37 ⇒ −37 =
3 −8 3 2 3 2 3 −8 0 1
than to divide
each element.
Matrix inverse
Practice problems Compute the inverse (if it exists) of the following matrices.
1 4 2 1/2 2 1/2 −9 3
a) b) c) d)
4 4 3 1/3 3 3/4 −9 −8
Answers
1 −1 1 −1 1/3 −1/2 1 −8 −3
a) 3
b) 6/5
c) No inverse. d) 99
1 −1/4 −3 2 9 −9
332
2 1 1 2
0 4 2 ,
A= m1,1 = (12.29)
1 3 2
−2
m1,2 =
(12.30)
2 −2 −4
−1
M= 3 5
(12.31)
−2 4 8
+ − + −
" # + − +
+ − − −
+ +
, − + − ,
− +
+
− + −
+ − +
− + − + 333
The formula that defines each element of the G matrix is
where i and j refer, respectively, to row and column index. Try out
a few examples to see how this formula produces the alternating
+ − + grids above. For example, the (1,1) position is −11+1 = 1 while
− + − the (1,2) position is −11+2 = −1. That said, I think it’s easier to
remember what the matrix looks like rather than memorizing the
+ − +
formula.
Figure 12.2: Sim-
ple and elegant. Finally, the cofactors matrix: C = G M. Using the example
Linear algebra is
matrix from the previous page,
truly inspiring.
2 2 −4
C= 1
3 −5
−2 −4 8
Answers
1 −2 1 −10 15 −10
4 −3
a) b) −3 −2 4 −2 c) 0 −2 8
−2 1
1 −2 1 10 −9 6
Adjugate matrix At this point, all the hard work is behind you.
The adjugate matrix is simply the transpose of the cofactors ma-
trix, scalar-multiplied by the inverse of the determinant of the
matrix (note: it’s the determinant of the original matrix, not the
minors or cofactors matrices). Again, if the determinant is zero,
334 then this step will fail because of the division by zero.
Assuming the determinant is not zero, the adjugate matrix is the
inverse of the original matrix.
And now let’s test that this matrix really is the inverse of our
original matrix:
Here is another
2 1 1 2 −1 −2 2 0 0
example of how
0 4 2 1 −2 1
2 3 4 = 0 2 0
it’s often easier
2
1 3 2 −4 5 8 0 0 2 to leave scalars
outside the matrix,
rather than dealing
Now that you know the full MCA formula, we can apply this to with fractions.
335
Practice problems Compute the adjugate matrices of the following matrices (in case you
haven’t noticed: you’ve already computed the minors and cofactors matrices above).
1 2 3 3 4 1
1 2
a) b) 4
5 6 c) 0 2 3
3 4
7 8 9 5 4 1
Answers
1 −2 1 −10 0 −10
1 4 −2 −3 −2 1
a) −2
b) 0 4 −2 c) 20
15 −2 −9
−3 1
1 −2 1 10 8 6
Practice problems Apply the MCA algorithm to the following matrices to derive their
inverses, when they exist.
2 2 4 1 −5 4
1 1
a) b) 4 2 5 c) −1 −15 6
1 4
1 2 1 2 0 3
Answers
−8 6 2
4 −1
a) 3 b) 10 1 −2 6 c) No inverse!
−1 1
6 −2 −4
Code The code challenges at the end of this chapter will provide
Matrix inverse
336
12.5 Inverse via row reduction
R +R
" # −1/2R " #
2
−−2−−→
1
1 0 −2 1 −−−−−→ 1 0 −2 1
0 −2 −3 1 0 1 3/2 −1/2
You can confirm that the augmented part of the final matrix is
the same as the inverse we computing from the MCA algorithm
in the practice problems in the previous section.
Practice problems Use row reduction to compute the inverse of the following matrices, or
to discover that the matrix is singular. Then multiply each matrix by its inverse to make sure
you’ve gotten the correct result.
1 0 2
−4 7
a) b) 1 2 1
3 −8
1 0 1
1 0 2 5 1 0 2 5
3 0 0 1 3 0 0 1
c)
0
d)
4 5 0 0 4 5 0
5 8 2 −17 0 3 4 0
Answers
−1 0 2
1 −8 −7
a) 11
b) 0 1/2 −1/2
−3 −4
1 0 −1
−1 5 −6 8
0 0 56 −70
c) No inverse. d) 14
0 0 −42 56
3 −1 18 −24
Matrix inverse
338
Code block 12.4: MATLAB
1 A = randn ( 3 ) ;
2 Ar = r r e f ( [ A eye ( 3 ) ] ) ; % RREF
3 Ar = Ar ( : , 4 : end ) ; % keep i n v e r s e
4 Ai = i n v (A ) ;
5 Ar−Ai
In Chapter 10, you learned that you can solve the equation Ax =
b by performing Gauss-Jordan elimination on the augmented ma-
trix [A|b]. If there is a solution—that is, if b is in the column
less absolutely necessary. So why, you might wonder,
should I suffer through learning how to compute it when
I can type inv on a computer? For the same reason that
you need to learn how to compute 3+4 without a calcu-
lator: You will never really learn math unless you can
do it without a computer. Frustrating but true.
Matrix inverse
I wrote on page 324 that only square matrices can have a full
inverse. That’s true, but it applies only to a full, a.k.a. two-sided,
inverse. Rectangular matrices can have a one-sided inverse. The
goal of this section is to derive the one-sided left inverse, explain
how to interpret it, and define the conditions for a rectangular
matrix to have a left inverse.
C = TT T
C−1 C = I (12.38)
Here’s the thing about Equation 12.37: The first set of parentheses
is necessary because we are inverting the product of two matrices
(neither of those matrices is individually invertible!). However,
the second set of parentheses is not necessary; they’re there just
for aesthetic balance. By removing the unnecessary parentheses
and re-grouping, some magic happens: the product of three ma-
trices that can left-multiply T to produce the identity matrix.
T-L T = I
T(TT T)−1 TT 6= I
Conditions for validity Now let’s think about the conditions for
the left inverse to be valid. Looking back to Equation 12.38,
matrix C (which is actually TT T) is invertible if it is full rank,
meaning rank(C) = N . When is TT T a full-rank matrix? Recall
from section 7.7 that a matrix times its transpose has the same
rank as the matrix on its own. Thus, TT T is full-rank if T has a
rank of N , meaning it is full column-rank.
And this leads us to the two conditions for a matrix to have a left
inverse:
" #
T 3 9
T T= (12.41)
9 29
" #
T −1 1 29 −9
(T T) = (12.42)
6 −9 3
" #
-L T −1 T 1 11 2 −7
T = (T T) T = (12.43)
6 −3 0 3
" #
-L1 6 0
T T= (12.44)
6 0 6
5 2 −1
TT-L
= 2 2
2
(12.45)
−1 2 5
You can probably guess where we’re going in this section: wide
matrices (that is, more columns than rows) do not have a left
inverse. However, they can have a right inverse. I encourage you
to try to discover the right inverse, as well as the conditions for
343
Did you figure it out? The reasoning is the same as for the left
inverse. The key difference is that you post-multiply by the trans-
posed matrix instead of pre-multiplying by it. Let’s call this ma-
trix W for wide.
WW-R = I
Code The code below shows the left inverse for a tall matrix; it’s
your job to modify the code to produce the right-inverse for a wide
matrix! (Note: In practice, it’s better to compute the one-sided
inverses via the Moore-Penrose pseudoinverse algorithm, but it’s
good practice to translate the formulas directly into code.)
345
Code block 12.5: Python
1 import numpy a s np
2 A = np . random . randn ( 5 , 3 )
3 Al = np . l i n a l g . i n v (A.T@A)@A. T
4 Al@A
This section is called "part 1" because I’m going to introduce you
to the pseudoinverse here, but I’m not going to teach you how to
compute it until section 16.12. That’s not because I’m mean, and
it’s not because I’m testing your patience (which is, by the way, a
virtue). Instead, it’s because the algorithm to compute the Moore-
Matrix inverse
1 0 1 2 −1 1
† 1 † 1
A A = 0 2 0
, AA = −1
2 1
2 3
1 0 1 1 1 2
348
12.9Exercises
12.9 Exercises
A−1 A−1 or (A−1 )−1 ? Think of" an answer
# and then confirm xn xm = xn+m
1 2
it empirically using the matrix . (xn )m = xnm
1 3
349
4. For the following matrices and vectors, compute and use A−1
n
to solve for x in An x = bn .
" # " # 1 3 4 9 0 1
1 4 3 −3
A1 = , A2 = , A3 = 0 0 4 A4 = 3 1 0
5 2 2 2
2 2 1 0 6 1
" # " # 1 8
4 6
b1 = , b2 = , b3 = 3 b4 = 3
1 −3
8 1
a) A1 x = b1 b) A1 x = b2 c) A2 x = b1 d) A2 x = b2
e) A3 x = b3 f) A3 x = b4 g) A4 x = b2 h) A4 x = b4
Matrix inverse
350
12.10 Answers
" # " #
1. 1 0 1/4 0
a) b)
0 1 0 −1/2
" #
1 −1 2
c) 3 d) No inverse!
2 −1
" # 3 0 0
1 6 4 1
e) 28 f) 3 0 −1 2
−1 4
0 2 −1
2 −1 0
1
g) 3 −1 2 0 h) No inverse!
0 0 3
1/2 0 0 " #
1 d −b
i) 0 1/4 0
j) ad−bc
−c a
0 0 1/3
−16 6 −8 16
12.10 Answers
" #
b −1 −32 −98
1 1
44 24
k) 3b l)−54 −4 −12 −2
0 3 4
−11 −6 8 11
2. The correct expression for the inverse of the inverse is (A−1 )−1 .
A−1 A−1 would mean to matrix-multiply the inverse by itself,
which is a valid mathematical operation but is not relevant
here.
" # " #
1 d −b 1 4 −7
a) ad−bc b) 18
−c a 2 1 351
7 −4 −1
1
c) 2 −1 2 −1 d) Not invertible!
−1 0 1
−8 5 12 1 6 −1
1 1
A−1 A−1
8 −7 −4 −3
3 =
16 , 4 =
27 9 3
0 4 0 18 −54 9
h iT h iT
a) 18 −4 19 b) 18 −24 33
h iT h iT
c) 12 11 −5 d) 12 3 −21
h iT h iT
e) 16 103 −45 12 f) 16 −37 39 12
h iT
g) Invalid operation! h) 27 25 6 −9
Matrix inverse
352
12.11 Code challenges
2. I wrote earlier that the algorithm underlying the MP pseudoin- Reminder: Demon-
verse is only understandable after learning about the SVD. But strations in code
help build insight
that needn’t stop us from exploring the pseudoinverse with
but are no substi-
code! The goal of this challenge is to illustrate that the pseu- tute for a formal
doinverse is the same as (1) the "real" inverse for a full-rank proof.
square matrix, and (2) the left inverse for a tall full-column-
rank matrix.
353
12.12 Code solutions
16
17 c o l s = [ True ] ∗m
18 cols [ j ] = False
19
20 M[ i , j ]=np . l i n a l g . d e t (A[ rows , : ] [ : , c o l s ] )
21
22 # compute G
23 G[ i , j ] = ( −1)∗∗( i+j )
24
25 # compute C
26 C =M ∗ G
27
28 # compute A
29 Ainv = C . T / np . l i n a l g . d e t (A)
30 AinvI = np . l i n a l g . i n v (A)
31 AinvI−Ainv # compare a g a i n s t i n v ( )
354
Code block 12.10: MATLAB
1 % c r e a t e matrix
2 m = 4;
3 A = randn (m) ;
4 [M,G] = d e a l ( z e r o s (m) ) ;
5
6 % compute m a t r i c e s
7 f o r i =1:m
8 f o r j =1:m
9
10 %% s e l e c t rows / c o l s
11 rows = t r u e ( 1 ,m) ;
12 rows ( i ) = f a l s e ;
13
14 c o l s = t r u e ( 1 ,m) ;
15 cols ( j ) = false ;
16
17 % compute M
18 M( i , j ) = d e t ( A( rows , c o l s ) ) ;
19
20 % compute G
21 G( i , j ) = ( −1)^( i+j ) ;
22 end
355
2. .
6
7 % t a l l matrix
8 T = randn ( 5 , 3 ) ;
9 Tl = i n v (T’ ∗T) ∗T ’ ; % l e f t i n v
10 Tpi = pinv (T ) ; % pinv
11 Tl − Tpi % t e s t e q u i v a l a n c e
356
CHAPTER 13
Projections and
orthogonalization
13.1 Projections in R2
There is some We start with a vector a, a point b not on a, and a scalar β such
"overloading" (us- that βa is as close to b as possible without leaving a. Figure 13.1
ing the same no-
shows the situation. (Because we are working with standard-
tation for different
meanings) in this position vectors, it is possible to equate coordinate points with
figure—the let- vectors.)
Projections
One way to think about this is to imagine that the line from βa
to b is one side of a right triangle. Then the line from b to a is the
hypotenuse of that triangle. Any hypotenuse is longer than the
adjacent side, and so the shortest hypotenuse (i.e., the shortest
distance from b to a) is the adjacent side.
(b − βa) ⊥ a (13.1)
And that in turns means that the dot product between them is
zero. Thus, we can rewrite Equation 13.1 as
13.1 Projections in R2
(b − βa)T a = 0 (13.2)
And that is the key insight that geometry provides us. From here,
solving for β just involves a bit of algebra. It’s a beautiful and
important derivation, so I’ll put all of it into a math box.
aT (b − βa) = 0 (13.3)
aT b − βaT a = 0
βaT a = aT b
aT b
β= (13.4)
aT a
359
Note that dividing both sides of the equation by aT a is valid
because it is a scalar quantity.
In the fraction, one Note that Equation 13.4 gives the scalar value β, not the actual
a in the denomina- point represented by the open circle in Figure 13.1. To calculate
tor doesn’t cancel that vector βa, simply replace β with its definition:
the multiplying
a, because the
Projection of a point onto a line
vector in the de-
nominator is part
aT b
of a dot product. proja (b) = a (13.5)
aT a
Let’s work through an example with real numbers. We’ll use the
following vector and point.
" #
−2
Projections
a= , b = (3, −1)
−1
If you draw these two objects on a 2D plane, you’ll see that this
example is different from that presented in Figure 13.1: The point
is "behind" the line, so it will project negatively onto the vector.
Let’s see how this works out algebraically.
" #T " #
−2 3
" # " # " #
−1−1 −2 −6 + 1 −2 −2
proja (b) = " #T " # = = −1
−2 −2 −1 4 + 1 −1 −1
−1 −1
(b − βa)
a b
Practice problems Draw the following lines (a) and points (b). Draw the approximate
location of the orthogonal projection of b onto a. Then compute the exact proja (b) and compare
with your guess.
−.5 0
a) a = , b = (.5, 2) b) a = , b = (0, 1)
2 2
2 1
c) a = , b = (0, −1) d) a = , b = (2, 1)
0 2
13.2 Projections in RN
c) 0 d) 4/5
Mapping over magnitude: Meditating on Equation 13.4
will reveal that it is a mapping between two vectors,
scaled by the squared length of the "target" vector. It’s
Reflection
useful to understand this intuition (mapping over magni-
tude), because many computations in linear algebra and
its applications (e.g., correlation, convolution, normal-
ization) involve some kind of mapping divided by some
kind of magnitude or norm.
361
13.2 Projections in RN
AT Ax = AT b (13.7)
Projections
362
I’m sure you guessed it: Left-multiply by (AT A)−1 .
You can see that this equation involves inverting a matrix, which
should immediately raise the question in your mind: What is the
condition for this equation to be valid? The condition, of course,
is that AT A is square and full-rank. And you know from the
previous chapter that this is the case if A is already square and
full-rank, or if it is tall and full column-rank.
13.2 Projections in RN
It is insightful to think back to the geometric perspective: we
are trying to project a point (also represented as the end-point
of a vector in its standard position) b onto the column space of
matrix A. If b is in the column space of A, then x indicates the
combinations of columns in A that produce b. Figure 13.3 shows
a visualization of this idea for a 2D column space embedded in
3D ambient space (in other words, the matrix A has three rows
and rank=2).
x = A−1 A−T AT b
x = A−1 b (13.9)
1 2 5.5
x1 −2 2 x1 12
a) 3 1 = −3.5 b) =
x2 1 3 x2 46
1 1 1.5
1 0 2 1 3 4
x1 x1
c) 0 1 = 1 d) 2 6 =8
x2 x2
0 0 1.2 3 9 11
Answers
T T
a) −2.5 4 b) 7 13
T
c) 2 1 d) Doesn’t work! (AT A is singular)
ition underlying the least-squares formula, which is the
mathematical backbone of many analyses in statistics,
364 machine-learning, and AI. Stay tuned...
Code You can implement Equation 13.8 based on what you
learned in the previous chapter. But that equation is so important
and is used so often that numerical processing software packages
have short-cuts for solving it. The Python code uses the numpy
function lstsq, which stands for least-squares.
13.3
13.3 Orth and par vect comps
Orthogonal and parallel vector components
In this section, you will learn how to decompose one vector into
two separate vectors that are orthogonal to each other, and that
have a special relationship to a third vector. This procedure will
draw on concepts you learned in Chapter 2 (adding and subtract-
ing vectors) and in the previous section (projecting onto vectors).
And in turn, it forms the basis for orthogonalization, which you’ll
learn about in the following section.
Now that you have the picture, let’s start deriving formulas and
proofs. Don’t worry, it’ll be easier than you might think.
wT v
This is just a mod- w||v = projv w = v (13.11)
vT v
ified version of
Equation 13.5.
wT v
w||v = projv w = v (13.14)
vT v
Here’s a tip that might help you remember these formulas: Be-
cause wkv is parallel to v, it’s really the same vector as v but
scaled. Thus, wkv = αv. Try to link the formulas to their geo-
metric pictures rather than memorizing letters; the vector symbols
(e.g., w and v) will change in different examples and textbooks.
We need to prove that wkv and w⊥v really are orthogonal to each
other. The proof comes from taking the dot product between
them, and then expanding then simplifying the results. Hopefully
we’ll find that the dot product is zero.
wT v T wT v wT v T
= v w − v v If you squint, this
vT v vT v vT v
proof looks like a
(wT v)2 (wT v)2 flock of birds.
= −
vT v vT v
=0
That looks on first glance like a messy set of equations, but when
keeping in mind that the dot product is a single number, I hope
you will see that it’s actually a straightforward proof. If you
find it difficult to follow, then consider re-writing the proof using
α = wT v and β = vT v.
" # " #
2·4 + 3·0 4 2
vkw = =
4·4 + 0·0 0 0
" # " # " #
2 2 0
v⊥w = − =
3 0 3
It’s easily seen in this example that v||w and v⊥w are orthogonal
and that v||w + v⊥w = v. This example is also easy to map onto
geometry (Figure 13.5).
v⊥w v
vkw w
2 0
3 −2
5 −2
a = , b =
0 5
1 5
368 2 −6
0 0
−2 −2
2·0 + 3·-2 + 5·-2 + 0·5 + 1·5 + 2·−6 −2 −23
−2
akb = =
0·0 + -2·-2 + -2·-2 + 5·5 + 5·5 + -6·-6
5
94
5
5 5
−6 −6
2 0 2 0 188
3 −2 3 2 236
5 −23 −2 94 5 23
2 1 424
a⊥b = − = − =
0
94
5
94
0 94 −5
94
115
1 5 1 −5 209
2 −6 2 6 50
You can confirm (on paper or computer) that the two vector com-
ponents (1) sum to produce the original vector a and (2) are or-
Practice problems For the following pairs of vectors (a, b), decompose a into parallel and
perpendicular components relative to b. For R2 problems, additionally draw all vectors.
0 2
2 0 1/2 −1
a) , b) , c) 0 , 4
2 −3 0 2
3 0
369
A quick glance at the geometric representations of vec-
tors in R2 , particularly when one vector is horizontal or
vertical, provides an important visual confirmation that
the algebra was correct. Whenever possible, you should
learn math by solving problems where you can see the
Reflection
correct answer, and working on more challenging prob-
lems only after understanding the concept and algorithm
in the visualizable examples. This same principle under-
lies the motivation to test statistical analysis methods
on simulated data (where ground truth is known) before
applying those methods to empirical data (where ground
truth is usually unknown).
QT Q = I (13.18)
1 2 2
1 2 4 2 5
A wide matrix has size M < N and its maximum possible rank is
M , corresponding to the number of rows. Thus, the largest possi-
ble set of linearly independent columns is M ; the rest of columns
r + 1 through N can be expressed as some linear combination of
the first M columns, and thus cannot possibly be orthogonal.
You can see that all columns are pairwise orthogonal, but the third
column has a norm of 0. This matrix transposed still provides a
right inverse. However, it does not provide a left inverse.
−2 −1 " # 5 0 0
T 1 −2 −1 0 1
Q Q= −1 2
= 0 5 0 (13.29)
5 −1 2 0 5
0 0 0 0 0
" # −2 −1 " #
1 −2 −1 0 1 5 0
QQT =
−1
5 −1 2 0
2= 5 0 5 (13.30)
0 0
Practice problems Determine the value of ζ that will produce orthogonal matrices.
2−.5 2−.5
0 cos(ζ) − sin(ζ) 0
−1/2 1/2
a) ζ b) 0 2−.5 −2−.5 c) sin(ζ) cos(ζ) 0
1/2 1/2
ζ 0 0 0 0 1
|v1 |
" #
1 1
v1∗ =√
10 3 Normalize to unit
" # " # " # " # vector
1 1·1 + 3·−1 1 1 + 1/5 1 6
v2∗ = v2 − v2kv1 = − = =
−1 1·1 + 3·3 3 −1 + 3/5 5 −2 Orthogonalize to
v1
√ " #
5 6
v2∗ = √
10 2 −2 Normalize to unit
" # " # vector
−2 10−1/2 ·−2 + 3·10−1/2 1
v3∗ = v3 − v3kv∗1 − v3kv∗2 = −
1 (10−1/2 )2 + (3·10−1/2 )2 3 Orthogonalize to
v∗1 and to v∗2
√ √ √ √ " #
6 5/10 2·−2 + −2 5/10 2 6
− √ √ √ √
(6 5/10 2)2 + (−2 5/10 2)2 −2
√
−7 5
Figure 13.6 shows the column vectors of V and Q. Aside from the
intense arithmetic on the previous page, it is also geometrically
obvious that q3 must be zeros, because it is not possible to have
more than two orthogonal vectors in R2 . This is an application
of the theorem about the maximum number of vectors that can
form a linearly independent set, which was introduced in section
4.6.
Projections
376
Figure 13.6: Lines show column vectors in matrix V (left) and
Q (right). Note that vector q3 is a dot at the origin.
Answers
1 −1 1 3 3 0
a) √1 b) √1 c) 1
2 10 5
1 1 −3 1 4 0
13.6 QR decomposition
due to round-off errors that propagate forward to each
subsequent vector and affect both the normalization and
Reflection
13.6 QR decomposition
A = QR (13.31)
The Q here is the same Q that you learned about above; it’s the
result of Gram-Schmidt orthogonalization (or other comparable
QR decomposi- but more numerically stable algorithm). R is like a "residual"
tion is unrelated matrix that contains the information that was orthogonalized out
to QR codes. of A. You already know how to create matrix Q; how do you
QT A = QT QR (13.32)
QT A = R (13.33)
I’d like to show you an example, but because you already know
how to compute Q, I’ll simply write the answer without details
or fanfare; that will allow us to focus the discussion on R.
√ −1 √ −1
1 0 2 6
√ −1 √ −1
A = 1 1 , ⇒ Q = 2
− 6 (13.34)
√ −1
0 1 0 −2 6
Projections
"√ −1 √ −1 # 1 0 "√ √ −1 #
2 2 0 2 2
QT A
R= = √ −1 √ −1 √ −1
1 1 = 0
√
6 − 6 −2 6 − 6/2
0 1
(13.35)
13.6 QR decomposition
same size as A. This is true regardless of the rank of A (more on
rank and QR decomposition below).
It may seem surprising that the rank of Q can be higher than the
rank of A, considering that Q is created from A. But consider
this: If I give you a vector [-1 1], can you give me a different,
orthogonal, vector? Of course you could: [1 1] (among others).
Thus, new columns in Q can be created that are orthogonal to
all previous columns. You can try this yourself in Python or
MATLAB by taking the QR decomposition of 1 (the matrix of all
380 1’s).
On the other hand, R will have the same rank as A. First of
all, because R is created from the product QT A, the maximum
possible rank of R will be the rank of A, because the rank of A
is equal to or less than the rank of Q. Now let’s think about why
the rank of R equals the rank of A: Each diagonal element in R
is the dot product between corresponding columns in A and Q.
Because each column of Q is orthogonalized to earlier columns in
A, the dot products forming the diagonal of R will be non-zero
as long as each column in A is linearly independent of its earlier
columns. On the other hand, if column k in A can be expressed
as a linear combination of earlier columns in A, then column k of
Q is orthogonal to column k of A, meaning that matrix element
Rk,k will be zero.
A = QR
A−1 = (QR)−1
382
13.8 Exercises
1. Without looking back at page 367, derive wkv and w⊥v and
prove that they are orthogonal.
a) Q−T
383
13.9 Answers
2. a) Q
b) (QT )2
c) Q2 A
4. Here’s the proof; the key result at the end is that kQxk = kxk.
= xT QT Qx
= xT x = kxk2
b) Yes
Projections
384
13.10 Code challenges
The first two columns of your Q should match v1∗ and v2∗
in the text. But you will not get a third column of zeros;
it will be some other vector that has magnitude of 1 and is
not orthogonal to the previous two columns (easily confirmed:
QT Q 6= I). What on Earth is going on here?!?!
385
13.11 Code solutions
1. This one’s easy, so I’m not giving you code! You simply need
to create random matrices of various sizes, and check the sizes
of the resulting Q and R matrices.
14 f o r j in range ( i ) : # only to e a r l i e r c o l s
15 q = Q[ : , j ] # convenience
16 Q [ : , i ]=Q [ : , i ]−np . dot ( a , q ) / np . dot ( q , q ) ∗ q
17
18 # normalize
19 Q [ : , i ] = Q [ : , i ] / np . l i n a l g . norm (Q [ : , i ] )
20
21 # QR
22 Q2 ,R = np . l i n a l g . qr (A)
386
Code block 13.6: MATLAB
1 m = 4;
2 n = 4;
3 A = randn (m, n ) ;
4 Q = z e r o s (m, n ) ;
5
6 f o r i =1:n % l o o p through columns ( n )
7
8 Q( : , i ) = A( : , i ) ;
9
10 % orthogonalize
11 i f i >1
12 a = A( : , i ) ; % c o n v e n i e n c e
13 f o r j =1: i −1 % o n l y t o e a r l i e r columns
14 q = Q( : , j ) ; % c o n v e n i e n c e
15 Q( : , i ) = Q( : , i ) − ( a ’ ∗ q / ( q ’ ∗ q ) ) ∗ q ;
16 end
17 end
18
19 % normalize
20 Q( : , i ) = Q( : , i ) / norm (Q( : , i ) ) ;
21 end
22
23 % QR
387
3. The issue here is normalization of computer rounding errors.
To see this, modify your algorithm so that column 3 of Q is
not normalized.
You will find that both components of the third column have
values close to zero, e.g., around 10−15 . That’s basically zero
plus computer round error. The normalization step is making
mountains out of microscopic anthills. Congratulations, you
have just discovered one of the reasons why the "textbook"
Gram-Schmidt algorithm is avoided in computer applications!
Projections
388
Least-squares
CHAPTER 14
This is why models contain both fixed features and free param-
eters. The fixed features are components of the model that the
scientist imposes, based on scientific evidence, theories, and in-
tuition (a.k.a. random guesses). Free parameters are variables
that can be adjusted to allow the model to fit any particular data
set. And this brings us to the primary goal of model-fitting: Find
values for these free parameters that make the model match the
data as closely as possible.
390 Here is an example to make this more concrete. Let’s say you want
to predict how tall someone is. Your hypothesis is that height is a
result of the person’s sex (that is, male or female), their parents’
height, and their childhood nutrition rated on a scale from 1-10.
So, males tend to be taller, people born to taller parents tend to be
taller, and people who ate healthier as children tend to be taller.
Obviously, what really determines an adult’s height is much more
complicated than this, but we are trying to capture a few of the
important factors in a simplistic way. We can then construct our
model of height:
h = β1 s + β2 p + β3 n + (14.1)
But here’s the thing: I don’t know how important each of these
Of course, that leads to the question of how you find the model
parameters that make the model match the data as closely as
possible. Randomly picking different values for the βs is a ter-
rible idea; instead, we need an algorithm that will find the best
parameters given the data. This is the idea of model fitting. The
most commonly used algorithm for fitting models to data is linear
least-squares, and that is what you will learn in this chapter.
391
14.2 The five steps of model-fitting
Step 2 Work the data into the model. You get existing data
from a database or by collecting data in a scientific experiment.
Or you can simulate your own data. I made up the data in Table
14.1 for illustration purposes.
But this is not the format that we need the data in. We need to
map the data in this table into the model. That means putting
these numbers inside the equation from step 1. The first row of
Binary variables data, converted into an equation, would look like this:
are often "dummy-
coded," meaning 180 = β1 1 + β2 175 + β3 8
that female=0 and
male=1 (this map-
ping will make pos- One row of data gives us one equation. But we have multiple
itive β correspond rows, corresponding to multiple individuals, and so we need to
to taller adults). create a set of equations. And because those equations are all
linked, we can consider them a system of equations. That system
392 will look like this:
180 = β1 1 + β2 175 + β3 8
170 = β1 0 + β2 172 + β3 6
.. (14.2)
.
176 = β1 0 + β2 189 + β3 7
Notice that each equation follows the same "template" from Equa-
tion 14.1, but with the numbers taken from the data table. (The
statistically astute reader will notice the absence of an intercept
term, which captures expected value of the data when the pre-
dictors are all equal to zero. I’m omitting that for now, but will
discuss it later.)
Step 4 Solve for β, which is also called fitting the model, es-
timating the parameters, computing the best-fit parameters, or
some other related terminology. The rest of the chapter is fo-
cused on this step. 393
Step 5 Statistically evaluate the model: is it a good model? How
well does it fit the data? Does it generalize to new data or have we
over-fit our sample data? Do all model terms contribute or should
some be excluded? This step is all about inferential statistics, and
it produces things like p-values and t-values and F-values. Step
5 is important for statistics applications, but is outside the scope
of this book, in part because it relies on probability theory, not
on linear algebra. Thus, step 5 is not further discussed here.
14.3 Terminology
The final point I want to make here is about the term "linear" in
linear least-squares. Linear refers to the way that the parameters
are estimated; it is not a restriction on the model. The model may
contain nonlinear terms and interactions; the restriction for lin-
earity is that the coefficients (the free parameters) scalar-multiply
their variables and sum to predict the dependent variable. Basi-
cally that just means that it’s possible to transform the system
394 of equations into a matrix equation. That restriction allows us to
use linear algebra methods to solve for β; there are also nonlinear
methods for estimating parameters in nonlinear models.
h = β1 s + β2 β3 p + β3 n3 +
p
(14.4)
√
One nonlinearity is in the regressors (β2 β3 ), and one is in a
predictor (n3 ). The latter is no problem; the former prevents
linear least-squares from fitting this model (there are nonlinear
alternatives that you would learn about in a statistics course).
You know what the solution is: instead of the full inverse, we use
the left-inverse. Thus, the solution to the least squares problem 395
is:
Xβ = y
y = Xβ + (14.6)
ŷ = y + (14.7)
ŷ = Xβ̂ (14.8)
There is more to say about this vector (section 14.7), but first
I want to prove to you that the least squares equation minimizes
. And to do that, we need to re-derive the least-squares solution
from a geometric perspective.
XT = 0 (14.9)
XT (y − Xβ) = 0 (14.10)
XT y − XT Xβ = 0 (14.11)
XT Xβ = XT y (14.12)
398 We also see that the design matrix X and the residuals vector
are orthogonal. Geometrically, that makes sense; statistically, it
means that the prediction errors should be unrelated to the model,
which is an important quality check of the model performance.
You might have noticed that I’m a bit loose with the
plus signs and minus signs. For example, why is de-
fined as y − Xβ and not Xβ − y? And why is added
in Equation 14.6 instead of subtracted? Sign-invariance
Reflection
often rears its confusing head in linear algebra, and in
many cases, it turns out that when the sign seems like
it’s arbitrary, then the solution ends up being the same
regardless. Also, in many cases, there are coefficients
floating around that can absorb the signs. For example,
you could flip the signs of all the elements in to turn
vector − into +.
XT Xβ = XT y (14.16)
the right.
The reason why this works comes from thinking about the row
spaces of X and XT X. Remember from Chapter 8 (section 8.5)
that these two matrices have the same row space, and that XT X
is a more compact representation of the space spanned by rows
of X. Furthermore, assuming that X has linearly independent
columns (which is an assumption we’ve been making in this entire
chapter), then XT X spans all of RN , which means that any point
in RN is guaranteed to be in the column space of XT X. And
XT y is just some point in RN , so it’s definitely going to be in the
column space of XT X. Therefore, β contains the coefficients on
the columns of matrix XT X to get us exactly to XT y. Note that
400 with the Gauss-Jordan approach to solving least-squares using
the normal equations, we never leave the N-dimensional subspace
of RN , so the question of whether y ∈ C(X) doesn’t even come
up.
= Xβ − y (14.18)
β = (XT X)−1 XT y
Let me bring your attention back to Figure 14.1. The data vector
y is unlikely to be exactly in the column space of the design matrix
X, and vector gets us from the data into the subspace spanned
by the design matrix. We also know that is defined as being
orthogonal to the column space of X.
= Xβ − y (14.21)
401
When is the model a good fit to the data? It is sensible that the
smaller is, the better the model fits the data. In fact, we don’t
really care about the exact values comprising (remember, there
is one i for each data value yi ); instead, we care about the norm
of .
0 = XT Xβ − XT y (14.25)
XT Xβ = XT y (14.26)
D = {−4, 0, −3, 1, 2, 8, 5, 8}
d = β1 + (14.28)
That was step 1. Step 2 is to work the data into the model. That
will give a series of equations that looks like this:
−4 = β1
0 = β1
−3 = β1
17
β= = 2.125 (14.33)
8
It’s always a good idea to visualize data and models. Let’s see
what they look like (Figure 14.2).
a "linear trend"). The new model equation from step 3 will look
like this:
1 −4
2 0
204
β= = .7255 (14.37)
148
Figure 14.3 shows the same data with the new model-predicted
data values. 405
Figure 14.3: Our second modeling attempt.
It looks better compared to Figure 14.2, but still not quite right:
The predicted data are too high (over-estimated) in the beginning
and too low (under-estimated) in the end. The problem here is
that the model lacks an intercept term.
The intercept is the expected value of the data when all other
parameters are set to 0. That’s not the same thing as the average
value of the data. Instead, it is the expected value when all model
parameters have a value of zero. Thinking back to the example
Least-squares
The arithmetic gets a bit more involved, but produces the follow-
ing.
I’m sure you agree that this third model (linear trend plus inter-
cept) fits the data reasonably well (Figure 14.4). There are still
residuals that the model does not capture, but these seem like
they could be random fluctuations. Notice that the best-fit line
does not go through the origin of the graph. We don’t see exactly
where the line will cross the y-axis because the x-axis values start
at 1. However, β1 , the intercept term, predicts that this crossing
would be at y = −5.5357, which looks plausible from the graph.
The final thing I want to do here is confirm that ⊥ Xβ. The two
columns below show the residual and predicted values, truncated
at 4 digits after the decimal point. 407
Figure 14.4: Our final model of this dataset.
Xβ
0.1667 −3.8333
−2.1310 −2.1310
2.5714 −0.4286
0.2738 1.2738
0.9762 2.9762
−3.3214 4.6786
1.3810 6.3810
0.0833 8.0833
Least-squares
T (Xβ) = 0.0000000000000142...
It may seem like the dot product is not exactly zero, but it is 14
orders of magnitude smaller than the data values, which we can
consider to be zero plus computer rounding error.
408
14.9 Code challenges
5. One measure of how well the model fits the data is called
R2 ("R-squared"), and can be interpreted as the proportion
of variance in the dependent variable that is explained by the
design matrix. Thus, an R2 of 1 indicates a perfect fit between
the model. The definition of R2 is
P 2
2 i i
R =1− P 2
(14.42)
i (yi − ȳ)
Compute R2 for the model to see how well it fits the data. Are
you surprised, based on what you see in the scatterplots?
409
14.10 Code solutions
A simple model would predict that time and age are both
linear predictors of widget purchases. There needs to be an
intercept term, because the average number of widgets pur-
chased is greater than zero. The variables could interact (e.g.,
older people buy widgets in the morning while younger peo-
ple buy widgets in the evening), but I’ve excluded interaction
terms in the interest of brevity.
y = β1 + β2 t + β3 a
410
Code block 14.1: Python
1 import numpy a s np
2
3 # l o a d t h e data
4 data = np . l o a d t x t ( fname= ’ widget_data . t x t ’ ,
5 d e l i m i t e r= ’ , ’ )
6
7 # d e s i g n matrix
8 X = np . c o n c a t e n a t e ( ( np . o n e s ( ( 1 0 0 0 , 1 ) ) ,
9 data [ : , : 2 ] ) , a x i s =1)
10
11 # outcome v a r i a b l e
12 y = data [ : , 2 ]
13
14 # beta c o e f f i c i e n t s
15 b e t a = np . l i n a l g . l s t s q (X, y ) [ 0 ]
16
17 # s c a l e d c o e f f i c i e n t s ( i n t e r c e p t not s c a l e d )
18 b e t a S c a l e d = b e t a /np . s t d (X, a x i s =0, ddof =1)
411
4. The figure is below, code thereafter.
412
5. The model accounts for 36.6% of the variance of the data.
That seems plausible given the variability in the data that can
be seen in the graphs.
413
Least-squares
CHAPTER 15
Eigendecomposition
Eigenvalue equation
Av = λv (15.1)
The example shown in Figure 15.3 has λ = .5. That means that
Av = .5v. In other words, the matrix shrunk the vector by half,
without changing its direction.
Why am I writing all of this? It turns out that that the gray line is
an eigenvector of the data matrix times its transpose, which is also
called a covariance matrix. In fact, the gray line shows the first
principal component of the data. Principal components analysis
(PCA) is one of the most important tools in data science (for 419
example, it is a primary method used in unsupervised machine
learning), and it is nothing fancier than an eigendecomposition of
a data matrix. More on this in Chapter 19.
420
15.2 Finding eigenvalues
Eigenvectors are like secret passages that are hidden inside the
matrix. In order to find those secret passages, we first need to
find the secret keys. Eigenvalues are those keys. Thus, eigende-
composition requires first finding the eigenvalues, and then using
those eigenvalues as "magic keys" to unlock the eigenvectors.
Av − λv = 0 (15.2)
Notice that both terms of the left-hand side of the equation con-
tain vector v, which means we can factor out that vector. In
order to make this a valid operation, we need to insert an identity
matrix before the vector.
Av − λIv = 0 (15.3)
(A − λI)v = 0 (15.4)
Finding eigenvalues
|A − λI| = 0 (15.5)
c d − λ
(a − λ)(d − λ) − bc = 0
αλ2 − βλ + γ = 0 (15.8)
α=1
β = −(a + d)
γ = ad − bc
p
−β ± β 2 − 4γ
λ= (15.9)
2
p
(a + d) ± (a + d)2 − 4(ad − bc)
λ= (15.10)
2
(1 − λ)(1 − λ) − 4 = 0
λ2 − 2λ − 3 = 0
(λ − 3)(λ + 1) = 0
⇒ λ1 = 3, λ2 = −1
(1 − λ)(3 − λ) − 4 = 0
λ2 − 4λ − 1 = 0
√
4± 20
λ=
2
√
λ=2± 5
Answers
√
a) 4, −4 b) ± 15 c) 3, 5 d) −3, 8
You still need to solve the equation for λ, so this admittedly isn’t
such a brilliant short-cut. But it will get to you the characteristic
polynomial slightly faster.
Be mindful that this trick works only for 2×2 matrices; don’t try
to apply it to any larger matrices.
Practice problems Use the trace-determinant trick to find the eigenvalues of the following
matrices.
−4 1 −2 2 6 −6 6 3
a) b) c) d)
1 3 −3 2 0 −3 3 1.5
Answers
√ √
a) (−1 ± 53)/2 b) ± 2i c) 6, -3 d) 0, 7.5
424
Eigenvalues of a 3×3 matrix The algebra gets more complicated,
but the principle is the same: Shift the matrix by −λ and solve
for ∆ = 0. The characteristic polynomial produces a third-order
equation, so you will have three eigenvalues as roots of the equa-
tion. Here is an example.
9 0 −8 9 − λ 0 −8
15 3 −15
⇒ 15
3 − λ −15 = 0
0 0 1 0 0 1 − λ
(9 − λ)(3 − λ)(1 − λ) = 0
λ1 = 9, λ2 = 3, λ3 = 1
−λ3 + λ2 + 5λ + 9 = 0
(3 − λ)(1 − λ)(3 − λ) = 0
Answers
a) 1, 2, 3 b) 0, 1, 3 c) 4, 3, 3
You can already see from the examples above that eigen-
values have no intrinsic sorting. We can come up with
sensible sorting, for example, ordering eigenvalues ac-
Reflection
cording to their position on the number line or mag-
nitude (distance from zero), or by a property of their
corresponding eigenvectors. Sorted eigenvalues can facil-
itate data analyses, but eigenvalues are an intrinsically
unsorted set.
into the matrix, turn it, and the eigenvector will be revealed.
vi ∈ N (A − λi I) (15.13)
(A − λi I)vi = 0 (15.14)
426
Let’s work through a practice problem. I’ll use the matrix pre-
sented on page 423. As a quick reminder, that matrix and its
eigenvalues are:
" #
1 2
⇒ λ1 = 3, λ2 = −1
2 1
normalization.
Our three eigenvalues are 0, -1, and 11. Notice that we got a λ =
0, and the matrix is reduced-rank (the third column is the sum of
the first two). That’s not a coincidence, but a deeper discussion
of the interpretation of zero-valued eigenvalues will come later.
Now we can solve for the three eigenvectors, which are the vectors
just to the left of the equals sign (non-normalized).
1−0 2 3 1 2 3 1 0
15.4 Diagonalization
λ1 : 4
3−0 7 ⇒ 4 3 7 1 = 0
3 3 6−0 3 3 6 −1 0
1−1 2 3 2 2 3 1 0
λ2 : 4
3−1 7 ⇒ 4 4 7 −1 = 0
3 3 6−1 3 3 7 0 0
You can appreciate
1 − 11 2 3 −10 2 3 19 0
why linear algebra
λ3 : 4
3 − 11 7 ⇒ 4 −8
7 41 = 0
was mostly a the-
3 3 6 − 11 3 3 −5 36 0 oretical branch of
mathematics before
computers came to
the rescue.
429
15.4 Diagonalization via eigendecomposition
It is time to take a step back and look at the big picture. An M×M
matrix contains M eigenvalues and M associated eigenvectors.
The set of eigenvalue/vector pairs produces a set of similar-looking
equations:
Av1 = λ1 v1
Av2 = λ2 v2
.. (15.18)
.
Avm = λm vm
Those two questions are closely related. The reason why the
430 eigenvalues go in the diagonal of a matrix that post-multiplies
the eigenvectors matrix is that the eigenvalues must scale each
column of the V matrix, not each row (refer to page 151 for the
rule about pre- vs. post-multiplying a diagonal matrix). If Λ pre-
multiplied V, each element of each eigenvector would be scaled
by a different λ.
There is another reason why it’s VΛ and not ΛV. If the equa-
tion read AV = ΛV, then we could multiply both sides by V−1 ,
producing the statement A = Λ, which is not generally true.
So then why did I write λv for the single equation instead of the
more-consistent vλ? For better or worse, λv is the common way
to write it, and that’s the form you will nearly always see.
A = VΛV−1 (15.21)
Let’s now revisit the Rubik’s cube analogy from the beginning of
this chapter: In Equation 15.21, A is the scrambled Rubik’s cube
with all sides having inter-mixed colors; V is the set of rotations
that you apply to the Rubik’s cube in order to solve the puzzle; Λ
is the cube in its "ordered" form with each side having exactly one
color; and V−1 is the inverse of the rotations, which is how you
would get from the ordered form to the original mixed form.
Figure 15.5 shows what diagonalization looks like for a 5×5 ma- 431
Figure 15.5: Diagonalization of a matrix in pictures.
The previous reflection box mentioned sorting eigenval-
ues. Figure 15.5 shows that the eigenvalues are sorted
Reflection
ascending along the diagonal. Re-sorting eigenvalues is
fine, but you need to be diligent to apply the same re-
sorting to the columns of V, otherwise the eigenvalues
and their associated eigenvectors will be mismatched.
Eigendecomposition
But there are matrices for which no matrix V can make that
decomposition true. Here’s an example of a non-diagonalizable
matrix.
" # " #
1 1 1 −1
A= , λ = {0, 0}, V=
−1 −1 −1 1
Notice that the matrix is rank-1 and yet has two zero-valued eigen-
values. This means that our diagonal matrix of eigenvalues would
be the zeros matrix, and it is impossible to reconstruct the original
matrix using Λ = 0.
All triangular matrices that have zeros on the diagonal are nilpo-
tent, have all zero-valued eigenvalues, and thus cannot be diago-
nalized.
All hope is not lost, however, because the singular value decom-
position is valid on all matrices, even the non-diagonalizable ones.
The two non-diagonalizable example matrices above have singular
values, respectively, of {2,0} and {1,0}.
Eigendecomposition
1
A proof for this statement is given in section 15.11.
434
15.6 Distinct vs. repeated eigenvalues
β1 v1 + β2 v2 = 0, β1 , β2 6= 0 (15.22)
β1 Av1 + β2 Av2 = A0
β1 λ1 v1 + β2 λ2 v2 = 0 (15.23)
β1 λ1 v1 + β2 λ1 v2 = 0
− β1 λ1 v1 + β2 λ2 v2 = 0
β2 λ1 v2 − β2 λ2 v2 = 0
The two terms in the left-hand side of the difference equation both
contain β2 v2 , so that can be factored out, revealing the final nail 435
in the coffin of our to-be-falsified hypothesis:
Why is this the key equation? It says that we multiply three terms
to obtain the zeros vector, which means at least one of these terms
must be zero. (λ1 − λ2 ) cannot equal zero because we began from
the assumption that they are distinct. v2 6= 0 because we do not
consider zero eigenvectors.
Either way you look at it, the conclusion is that distinct eigenval-
ues lead to linearly independent eigenvectors.
We did not need to impose any assumptions about the field from
which the λs are drawn; they can be real or complex-valued, ratio-
nal or irrational, integers or fractions. The only important quality
is that each λ is distinct.
It’s clear how to get the first eigenvector, but then how do you get
MATLAB normal-
the second eigenvector? The answer is you don’t—there is only ized the vectors
one eigenvector. MATLAB will return the following. to be unit-length
vectors: the vector
[1 2] points in the
>> [V,L] = eig([6 -4; 1 2]) same direction as
√
V = [1 2]/ 5.
0.8944 -0.8944
0.4472 -0.4472
L =
" #
4 0 4 − λ 0
⇒ ⇒ λ2 − 8λ + 16 = 0 ⇒ (λ − 4)2 = 0
0 4 0 4 − λ
Again, the two solutions are λ = 4. Plugging this into the matrix
yields
" #" #
0 0 v1
=0 ⇒ ? (15.25)
0 0 v2
437
Now we have an interesting situation: any vector times the zeros
matrix produces the zeros vector. So all vectors are eigenvectors
of this matrix. Which two vectors to select? The standard basis
vectors are a good practical choice, because they are easy to work
with, are orthogonal, and have unit length. Therefore, for the
matrix above, V = I. To be clear: V could be I, or it could
be any other 2 × 2 full-rank matrix. This is a special case of
eigendecomposition of the identity matrix. I just multiplied it by
4 so we’d get the same eigenvalues as in the previous example.
This matrix has one distinct eigenvalue and one repeated eigen-
value. We know with certainty that the eigenvector associated
with λ = 6 will be distinct (I encourage you to confirm that a
good integer-valued choice is [3 -1 1]T ); what will happen when
we plug 4 into the shifted matrix?
1 −1 0 v1 0 1 0
−1
1 0 v2 = 0
⇒ v1 = 1 , v2 = 0
1/3 −1/3 0 v3 0 0 1
(15.27)
Eigendecomposition
Again, we can ask the question whether these are the only two
eigenvectors. I’m not referring to scaled versions of these vectors
such as αv2 , I mean whether we could pick equally good eigen-
vectors with different directions from the two listed above.
To understand why this can happen, let’s revisit the proof in the
beginning of this section. We need to modify the first assump-
tion, though: We now assume that the two eigenvalues are the
same, thus λ1 = λ2 . Most of the rest of the proof continues as
already written above. Below is Equation 15.24, re-written for
convenience.
(λ1 − λ2 )β2 v2 = 0
0β2 v2 = 0
What can we now conclude? Not much. We have the zeros vector
without the trivial solution of a zeros eigenvector, so assumption
Do you really need to worry about repeated eigenvalues
for applications of eigendecomposition? Repeated eigen-
values may seem like some weird quirk of abstract math-
Reflection
ematics, but real datasets can have eigenvalues that are
exactly repeated or statistically indistinguishable. In my
own research on multivariate neural time series analysis,
I find nearly identical eigenvalues to be infrequent, but
common enough that I keep an eye out for them.
Eigendecomposition
If 4ac > b2 in Equation 15.6, then you end up with the square root
of a negative number, which means the eigenvalues will be com-
plex numbers. And complex eigenvalues lead to complex eigen-
vectors.
Complex solutions can also arise from "normal" matrices with real-
valued entries. Consider the following matrix, which I generated
in MATLAB from random integers (that is, I did not carefully
hand-select this matrix or look it up in a secret online website for
math teachers who want to torment their students). (Numbers
are rounded to nearest tenth.)
" # " # " #
−1 15 1.5 + 9.2i 0 .85 .85
⇒ Λ= , V=
−6 4 0 1.5 − 9.2i .1 + .5i .1 − .5i
Av = λv
Av = λv
Av = λv (15.28)
Equation 15.28 follows from the previous line because the matrix
is real-valued, thus A = A. 441
Complex-valued solutions in eigendecomposition can be difficult
to work with in applications with real datasets, but there is noth-
ing in principle weird or strange about them.
Practice problems Identify the eigenvalues from the following matrices (consult page 423
for the quadratic formula).
2 −3 3 7 a b
a) b) c)
1 2 −7 3 −b a
Answers
√
a) λ = 2 ± i 3 b) λ = 3 ± i7 c) λ = a ± ib
15.8
trix
Eigendecomposition of a symmetric ma-
By this point in the book, you know that I’m a HUGE fan of
symmetric matrices. And I’m not the only one—everyone who
works with matrices has a special place in their heart for the
elation that comes with symmetry across the diagonal. In this
section, you are going to learn two additional properties that make
symmetric matrices really great to work with.
442 But of course we need to prove this rigorously for all symmetric
matrices. The goal is to show that the dot product between any
pair of eigenvectors is zero. We start from two assumptions: (1)
matrix A is symmetric (A = AT ) and (2) λ1 and λ2 are distinct
eigenvalues of A (thus λ1 6= λ2 ). v1 and v2 are their correspond-
ing eigenvectors. I’m going to write a series of equalities; make
sure you can follow each step from left to right.
The middle steps are actually just way-points; we care about the
equality between the first and last terms. I’ll write them below,
and then set the equation to zero.
Notice that both terms contain the dot product v1T v2 , which can
be factored out, bringing us to the crux of the proof:
VT V = I (15.35)
VT = V−1 (15.36)
perscript H indi-
cates the Hermi- 1) Av = λv
tian transpose,
which means 2) (Av)H = (λv)H
transpose and flip
the sign of the 3) v H A = λH v H
imaginary parts.
4) vH Av = λH vH v
5) λvH v = λH vH v
6) λ = λH
Final note for this section: Some people use different letters to
indicate the eigendecomposition of a symmetric matrix. In other
texts or lectures you might see the following options.
A = PDP−1
A = UDU−1
Practice problems Each problem gives a matrix and its unscaled eigenvectors. Fill in the
missing eigenvector element. Then find the eigenvalues using the trace and determinant. Finally,
Answers
a) ∗ = 1 b) ∗ = −4 c) ∗ = 1
λ = −3, 2 λ = 20, 60 λ = 0, 6
" #
1 2 1 − λ 2
A= ⇒ |A − λI| = 0 ⇒ =0
1 2 1 2 − λ
(1 − λ)(2 − λ) − 2 = 0 ⇒ λ2 − 3λ = 0 ⇒
λ(λ − 3) = 0 ⇒ λ1 = 0, λ2 = 3
Lest you think I carefully selected that matrix to give a zero eigen-
value, the next example shows that even in the general case of a
matrix where one column is a multiple of another column (thus,
a rank-1 matrix), one of the eigenvalues will be zero.
" #
a σa a − λ σa
A= ⇒ |A − λI| = 0 ⇒ =0
b σb b σb − λ
λ(λ − (a + σb)) = 0 ⇒ λ1 = 0, λ2 = a + σb
You can see from this example that the off-diagonal product (−(σa×
b)) will cancel with the "left" terms from the diagonal product
(a × σb); it doesn’t matter what values you assign to a, b, and
σ. (The same effect happens if you construct one row to be a
multiple of the other, which you should try on your own.) The
canceling of the constant term means all terms on the left-hand
side of the characteristic polynomial contain λ, which means 0
will always be a solution.
Let’s see how this would work with a 3 × 3 rank-2 matrix (I en-
courage you to work through this problem on your own before
inspecting the equations below).
a d σa a − λ d σa
A = b e σb
⇒ b
e−λ σb = 0
c f σc c f σc − λ
(A − λI)v = 0 (15.4)
Practice problems Compute the missing eigenvalues of these matrices (hint: remember
that the sum of eigenvalues equals the trace of the matrix).
1 0 3 4
a) , λ = 1, ? b) , λ =?, ? c)
1 0 6 8
2 −2 −2
3 −3 −2 , λ = −2, ?, ?
2 −2 −2
from an eigenvector.
A = VΛVT (15.38)
But actually, this is not exactly Equation 15.21 (page 431). Previ-
ously we right-multiplied by V−1 but here we’re right-multiplying
by VT . Can you guess what that means? It means that recon-
structing a matrix using Equation 15.37 is valid only for symmet-
ric matrices, because V−1 = VT .
M
X
A= vi λi wiT (15.40)
i=1 What happens here
for λi = 0?
Now we have the outer product between the eigenvector and the
corresponding row of the inverse of the eigenvector matrix trans-
posed, which here is printed as the ith column of matrix W. I
hope it is clear why Equation 15.37 is a simplification of Equation
15.40: V-T = V for an orthogonal matrix.
−1
√1
λ = −1,
" #
1 v1 = 2
1
2 3
A=
3 2
1
λ = 5, v2 = √1
2
2
1
" #
−.5 .5
A1 = v1 λ1 v1T =
.5 −.5
" #
2.5 2.5
A2 = v2 λ2 v2T =
2.5 2.5
" # " # " #
−.5 .5 2.5 2.5 2 3
A1 + A2 = + = =A
.5 −.5 2.5 2.5 3 2
450
Practice problems The matrices below are eigenvectors and eigenvalues. Reconstruct the
matrix from which these were extracted by summing over eigenlayers. It’s safer to apply Equation
15.40 if you don’t know for sure that the matrix is symmetric.
−2 2 0 1 0 0
2 3 1 0 1 2 2 0
a) , b) , c) 1 1 0 , 0
2 0
−4 2 0 3 0 1 0 3
1 0 1 0 0 3
Answers
1.5 1 0
2.5 .75 2 2
a) b) c) .25
1.5 0
1 1.5 0 3
.5 −1 3
What if you would sum only the layers with the largest
k < r eigenvalues? That will actually be a low-rank
approximation of the original matrix. Or maybe this is
a data matrix and you identity certain eigenvectors that
reflect noise; you can then reconstruct the data without
the "noise layers." More on this in the next few chapters!
You see where this is going: The general form for matrix powers
is:
An v = λn v (15.45)
Practice problems Compute A3 for the following matrices. The eigenvalues are provided
but you’ll have to compute the eigenvectors.
4 0 1
1 0 4 −2
a) , λ = 1, −1 b) , λ = 3, 6 c) −2 1 0 , λ =
6 −1 −1 5
−2 0 1
1, 2, 3
Answers
46 0 19
1 0 90 −126
a) b) c) −38 1 −12
6 −1 −63 153
−38 0 −11
A−1 = (VDV−1 )−1 = (V−1 )−1 D−1 V−1 = VD−1 V−1 (15.46)
Of course, this procedure is valid only for matrices with all non-
zero diagonal elements, which unsurprisingly excludes all singular
matrices. 453
You might wonder whether this is really a short-cut, considering
that V still needs to be inverted. There are two computational
advantages of Equation 15.46. One advantage is obviously from
inverting a symmetric matrix (with an orthogonal eigenvectors
matrix), where V−1 = VT . A second advantage is that because
the eigenvectors are normalized, V has a low condition number
and is therefore more numerically stable. Thus, the V of a non-
symmetric matrix might be easier to invert than A. (Condition
number is a measure of the "spread" of a matrix that characterizes
the stability of a matrix to minor perturbations or noise; you’ll
learn more about this quantity in the next chapter. The point
is that even if A is theoretically invertible, the inverse of V may
have a more accurate inverse.)
Practice problems Each exercise provides a matrix, its eigenvectors, and eigenvalues.
Compute A−1 by VD−1 V−1 . Then confirm that AA−1 = I.
1 2 2 1 3 −2 2 1
a) , , λ = 2, −5 b) , , λ = 2, 1
3 −4 1 −3 1 0 1 1
Eigendecomposition
454
15.12 Generalized eigendecomposition
One quick glance will reveal that the following two equations are
equivalent.
Av = λv
Av = λIv
Av = λBv (15.47)
AV = BVΛ (15.48)
A = BVΛV−1 (15.50)
Cv = λv , C = B−1 A
eigendecomposition a computational workhorse for sev-
eral multivariate data science and machine-learning ap-
plications, including linear discriminant analysis, source
456 separation, and classifiers.
15.13 Exercises
g) 2, 4, 1 h) a, b, c i) a, b, c
4. You get the same results: Same eigenvalues and same eigen-
vectors. It’s a bit awkward in my opinion, because you have
Eigendecomposition
458
15.15 Code challenges
459
15.16 Code solutions
11
12 # important to s o r t e i g v a l s
13 l1 . sort ()
14 l2 . sort ()
15
16 a v e d i f f s [ n−1] = np . mean ( np . abs ( l 1 −l 2 ) )
17
18 p l t . p l o t ( a v e d i f f s ) ;
460
Code block 15.6: MATLAB
1 f o r n=1:100
2 A = randn ( n ) ;
3 B = randn ( n ) ;
4 l 1 = e i g (A, B ) ;
5 l 2 = e i g ( i n v (B) ∗A ) ;
6
7 % i mp o r t a n t t o s o r t e v a l s
8 l1 = sort ( l1 );
9 l2 = sort ( l2 );
10 a v e d i f f s ( n ) = mean ( abs ( l 1 −l 2 ) ) ;
11 end
12 p l o t ( a v e d i f f s , ’ s− ’ )
13 x l a b e l ( ’ Matrix s i z e ’ )
14 y l a b e l ( ’ \ D e l t a \ lambda ’ )
461
2. Conveniently, the eigenvalues of a diagonal matrix are simply
the diagonal elements, while the eigenvectors matrix is the
identity matrix. This is because (A − λI) is made singular
simply by setting λ to each diagonal element.
462
3. The eigenvectors matrix looks cute, like a pixelated flower from
a 1970’s computer. Plotting the eigenvectors reveals their re-
markable property—they’re sine waves! (Figure is shown be-
low the code.) If you are familiar with signal processing, then
this might look familiar from the Fourier transform. In fact,
there is a deep connection between the Hankel matrix and the
Fourier transform. By the way, you can run this code on a
Toeplitz matrix for comparison (spoiler alert: Not nearly as
cool!).
463
Code block 15.10: MATLAB
1 N = 50;
2 T = t o e p l i t z ( 1 :N ) ;
3 H = h a n k e l ( 1 : N , [ N 1 :N− 1 ] ) ;
4
5 [ V,D] = e i g (H ) ;
6 [ ~ , s i d x ] = s o r t ( d i a g (D) , ’ d e s c e n d ’ ) ;
7 V = V( : , s i d x ) ;
8
9 s u b p l o t ( 2 2 1 ) % t h e matrix
10 i m a g e s c (H) , a x i s s q u a r e
11 subplot (222) % a l l eigenvectors
12 i m a g e s c (V) , a x i s s q u a r e
13 s u b p l o t ( 2 1 2 ) % a few e v e c s
14 p l o t (V( : , 1 : 4 ) , ’ o− ’ )
Eigendecomposition
464
CHAPTER 16
Singular value
decomposition
A = UΣVT (16.1)
The SVD
466
The M×N matrix to be decomposed. It can be square
A
or rectangular, and any rank.
As you can see in the descriptions above, the sizes of the right-
hand-side matrices depend on the size of A. Figure 16.1 shows
graphical representations of the SVD for square, tall, and wide
Let’s see what the SVD looks like for a real matrix. Figure 16.2
shows a matrix (created by applying a 2D smoothing kernel to
random noise; the grayscale intensity at each pixel in the image is
mapped onto the numerical value at that element of the matrix)
and its SVD.
A = U Σ VT
(N×N )
(M×N ) (M×M ) (M×N )
A = U Σ VT
(M×N ) (M×M ) (M×N )
(N×N )
= VΣ2 VT (16.4)
You can immediately see why the singular values are non-negative—
any real number squared will be non-negative. This doesn’t show
immediately why the singular values are real-valued, though; that
comes from the proof that the eigenvalues of a symmetric matrix
are real-valued (page 444).
We’re missing the U matrix, but I’m sure you’ve already figured 469
it out: Take the eigendecomposition of matrix AAT :
= UΣ2 UT (16.7)
So there you have it—the way to compute the SVD of any rect-
angular matrix is to apply the following steps:
AVΣ−1 = U (16.8)
Σ−1 UT A = VT (16.9)
You can see the effects of not normalizing in the exercises below. It 16.2 Computing the SVD
will take you a while to work through these two exercises by hand,
but I believe it’s worth the effort and will help you understand
the SVD.
471
Practice problems Perform the following steps on each matrix below: (1) Compute the
SVD by hand. Construct the singular vectors to be integer matrices for ease of visual inspection.
(2) Using your non-normalized singular vectors, compute UΣVT to confirm that it does not
equal the original matrix. (3) Normalize each singular vector (each column of U and row of
VT ). (4) Re-compute UΣVT and confirm that it exactly equals the original matrix.
1 −1
1 1 0
a) b) 0 2
0 1 1
1 1
Answers Matrices are ordered as U, Σ, and VT . Top rows show integer vectors; bottom
rows show normalized versions.
√ 1 2 1 √ √ √
1 −1 3 0 0 3−1 2 3 3+1
a) , , 1 0 −1 ; product = √ √ √
1 1 0 1 0 3+1 2 3 3−1
1 −1 1
√ √ √
√ −1/ 6 −2/ 6 −1/ 6
−1 1 −1 3 0 0 √ √ 1 1 0
√ , , 1/ 2 0 −1/ 2 ; product=
2 1 1 0 1 0 √ √ √ 0 1 1
1/ 3 −1/ 3 1/ 3
√ √ √
1 1 −1 6 0 2 6
√ 0 1 √
b) −2 0 −1, 0 2, ; product= 0 −2 6
1 0 √ √
−1 1 1 0 0 2 − 6
√ √ √ √
1/ 6 −1/ 2 −1/ 3 6 0 1 −1
√ √ √ 0 −1
−2/ 6 0 −1/ 3, 0 2, ; product=0 2
√ √ √ −1 0
−1/ 6 −1/ 2 1/ 3 0 0 1 1
Why are there three λs but only two σs? It’s because AT A is 3×3
but Σ has the same size as the matrix A, which is 2 × 3; hence,
the diagonal has only two elements. But the non-zero λ’s equal
the squared σ’s.
473
3 1 0
A=
1 1 0
1 1 1
475
Code block 16.1: Python
1 import numpy a s np
2 A = [[1 ,1 ,0] ,[0 ,1 ,1]]
3 U, s ,V = np . l i n a l g . svd (A)
A = UΣUT , if A = AT (16.10)
Proving that this is the case simply involves writing out the SVD
of the matrix and its transpose:
A = UΣVT (16.11)
And this proves that U = V, which means that the left and right
singular vectors are the same.
Notice that this is not necessarily the same thing as "Case 2" of
the relationship between singular values and eigenvalues, because
not all symmetric matrices can be expressed as some other matrix
times its transpose.
476
16.5 SVD and the four subspaces
But let me start by talking more about Σ—the matrix that con-
tains the singular values on the diagonal and zeros everywhere
else. Below is an example of a Σ matrix for a 5 × 3, rank-2 ma-
trix.
σ1 0 0
0 σ2 0
0 0 0
0 0 0
0 0 0
By construction, σ1 > σ2 . Indeed, SVD algorithms always sort
the singular values descending from top-left to lower-right. Thus,
zero-valued singular values will be in the lower-right of the diag-
onal. σ3 = 0 because this is a rank-2 matrix. You’ll learn below
that any zero-valued singular values correspond to the null space
of the matrix. Therefore, the number of non-zero singular values
corresponds to the dimensionality of the row and column spaces,
Figure 16.3 shows the "big picture" of the SVD and the four sub-
spaces. There is a lot going on in that figure, so let’s go through
each piece in turn. The overall picture is a visualization of Equa-
tion 16.1; as a quick review: the matrix gets decomposed into
three matrices: U provides an orthogonal basis for RM and con-
tains the left singular vectors; Σ is the diagonal matrix of singular
values (all non-negative, real-valued); and V provides an orthogo-
nal basis for RN (remember that the SVD uses VT , meaning that
the rows are the singular vectors, not the columns).
This figure additionally shows how the columns of U are organized 477
Figure 16.3: Visualization of how the SVD reveals basis sets
for the four subspaces of a rank-r matrix. (orth. = orthogonal
matrix; diag. = diagonal matrix; 1:r = columns (or rows) 1
through r.)
into basis vectors for the column space (light gray) and left-null
space (darker gray); and how the rows of VT are organized into
basis vectors for the row space (light gray) and null space (darker
gray). In particular, the first r columns in U, and the first r rows
in VT , are the bases for the column and row spaces of A. The
columns and rows after r get multiplied by the zero-valued singu-
lar values, and thus form bases for the null spaces. The singular
vectors for the column and row spaces are sorted according to
their "importance" to the matrix A, as indicated by the relative
magnitude of the corresponding singular values.
You can see that the SVD reveals a lot of important information
about the matrix. The points below are implicit in the visualiza-
tion, and written explicitly in the interest of clarity:
values.
• The dimensionality of the left-null space is the number of
columns in U from r + 1 to M .
• The dimensionality of the null space is the number of rows
in VT from r + 1 to N .
Figure 16.3 also nicely captures the fact that the column space and
left-null space are orthogonal: If each column in U is orthogonal
to each other column, then any subset of columns is orthogonal
to any other non-overlapping subset of columns. Together, all of
these columns span all of RM , which means the rank of U is M ,
even if the rank of A is r < M .
The story for VT is the same, except we deal with rows instead of
columns (or, if you prefer, the columns of V) and the row space
and null space of A. So, the first r rows provide an orthonormal
basis for the row space of A, and the rest of the rows, which get
multiplied by the zero-valued singular values, are a basis for the
null space of A.
Practice problems The following triplets of matrices are UΣVT that were computed from
a matrix A that is not shown. From visual inspection, determine the size and rank of A, and
identify the basis vectors for the four spaces of A. (Note: I re-scaled U and V to integers.)
−1 0 −3 0 π 0
−1 −1 −1
−1 −1 4.89 0 0 0 1 0 0 0 2 −1 0
a) 1 0 −1 b)
−1 1 0 2 0 −3 0 1 0 0 0 0 1
1 −2 1
0 0 0 1 0 0
Don’t expect to understand everything about the SVD
just by staring at that figure. You’ll gain more famil-
iarity and intuition about the SVD by working with it,
which is the goal of the rest of this chapter!
479
16.6 SVD and matrix rank
Let’s start by rewriting the SVD using one pair of singular vectors
and their corresponding singular value. This is analogous to the
single-vector eigenvalue equation.
The SVD
Av = uσ (16.14)
Notice that replacing the vectors with matrices and then right-
multiplying by VT gives the SVD matrix equation that you’re
now familiar with.
Now let me remind you of the definition of the column space and
left-null space: The column space comprises all vectors that can
be expressed by some combination of the columns of A, whereas
480 the left-null space comprises all non-trivial combinations of the
columns of A that produce the zeros vector. In other words:
C(A) : Ax = b (16.15)
N (AT ) : Ay = 0 (16.16)
Now we can think about Equation 16.14 in this context: all singu-
lar vectors are non-zeros, and thus the right-hand side of Equation
16.14 must be non-zero—and thus in the column space of A—if
σ is non-zero. Likewise, the only possible way for the right-hand
side of Equation 16.14 to be the zeros vector—and thus in the
left-null space of A—is for σ to equal zero.
You can make the same argument for the row space, by starting
from the equation uT A = σvT .
A is the product of the three SVD matrices, therefore the rank 16.6 SVD and matrix rank
of A is constrained by the ranks of those matrices. U and V are
by definition full-rank. Σ is of size M ×N but could have a rank
smaller than M or N . Thus, the maximum possible rank of A
is the rank of Σ. However, the rank of A could not be smaller
than the rank of Σ, because the ranks of A and UΣVT are equal.
Therefore, the rank of A must equal the smallest rank of the three
SVD matrices, which is always Σ. And as a diagonal matrix, the
rank of Σ is the number of non-zero diagonal elements, which is
the number of non-zero singular values.
"Effective" rank You’ve read many times in this book that com-
puters have difficulties with really small and really large numbers. 481
These are called rounding errors, precision errors, underflow, over-
flow, etc. How does a computer decide whether a singular value
is small but non-zero vs. zero with rounding error?
The MATLAB source code for computing rank looks like this:
s = svd(A);
r = sum(s > tol);
In other words, retrieve the singular values, and count the number
of those values that are above a tolerance (variable tol). So the
question is, how to set that tolerance? If it’s too small, then the
rank will be over-estimated; if it’s too large, then the rank will
be under-estimated. MATLAB’s solution is to set the tolerance
dynamically, based on the size and elements in the matrix:
The function eps returns the distance between a number and the
next-larger number that your computer is capable of representing.
For example, if your computer could only represent integers, then
eps=1. (And you probably need to buy a new computer...)
With matrix multiplication via layers, the two vectors that multi-
ply to create each "layer" of the product matrix are defined purely
by their physical position in the matrix. Is that really the best way
to define basis vectors to create each layer? Probably not: The
position of a row or column in a matrix may not be the organizing
principle of "importance"; indeed, in many matrices—particularly
matrices that contain data—rows or columns can be swapped or
even randomized without any change in the information content
of the data matrix.
where r is the rank of the matrix (the singular values after σr are
zeros, and thus can be omitted from this equation). Your delicate
linear-algebraic sensibilities might be offended by going from the
elegant matrix Equation 16.1 to the clumsy vector-sum Equation
16.17. But this equation will set us up for the SVD spectral
theory, and will also lead into one of the important applications
of SVD, which is low-rank approximations (next section). 483
Let’s consider only the first iteration of the summation:
A1 = u1 σ1 v1T (16.18)
Low-rank approximation
k
X
e =
A ui σi viT , k<r (16.19)
i=1
488
16.9 Normalizing singular values
Imagine you have two different matrices, and find that the largest
singular value of matrix A is 8 and the largest singular value of
matrix B is 678. How would you interpret that difference? Are
those two numbers comparable? And what does the number "8"
even mean??
The answer is that those two σmax values are probably not compa-
rable, unless you know that the matrices have numbers in exactly
the same range.
" # " #
8 4 10 15.867 0
A= , ΣA =
Because the singular vectors are all unit-length and are scaled
by the singular values when reconstructing the matrix from its
SVD layers, the sum over all singular values can be interpreted
See also sec- as the total variance or "energy" in the matrix. This sum over
tion 6.10 about all singular values is formally called the Schatten 1-norm of the
matrix norms. matrix. For completeness, the equation below is the full formula
r
!1/p
p
X
kAkp = σi (16.20)
i=1
The next step is to scale each singular value to the percent of the
Schatten 1-norm:
100σi
σei = (16.21)
kAk1
matrices with largest singular values of 8 and 678 are not com-
parable, but let’s say their normalized largest singular values are
35% and 82%. In this case, 35% means that the largest SVD-layer
accounts for only 35% of the total variance in the matrix. One
interpretation is that this is a complicated matrix that contains a
lot of information along several different directions. In contrast,
a largest normalized singular value of 82% means nearly all of the
variance is explained by one component, so this matrix probably
contains less complex information. If this were a data matrix, it
490 might correspond to one pattern and 18% noise.
Now let’s think back to the question of how many SVD-layers to
use in a low-rank approximation (that is, how to select k). When
the singular values are normalized, you can pick some percent
variance threshold and retain all SVD-layers that contribute at
least that much variance. For example, you might keep all SVD-
layers with σ > 1%, or perhaps 0.1% to retain more information.
The choice of a threshold is still somewhat arbitrary, but this is at
least more quantitative and reproducible than visual inspection of
the scree plot.
Code You can compute the condition number on your own based
on the SVD, but Python and MATLAB have built-in functions.
The SVD
492
Code block 16.3: Python
1 import numpy a s np
2 A = np . random . randn ( 5 , 5 )
3 s = np . l i n a l g . svd (A ) [ 1 ]
4 condnum = np . max( s ) / np . min ( s )
5 # compare above with cond ( )
6 p r i n t ( condnum , np . l i n a l g . cond (A) )
= VΣ−1 UT (16.25)
Expressing the inverse via the SVD may seem like an academic
exercise, but this is a crucial introduction to the pseudoinverse,
as you will now learn.
A† = (UΣVT )†
= Vӆ UT (16.28)
1 if σi 6= 0
Σ†i,i = σi
0 if σj = 0
Notice that this procedure will work for any matrix, square or
rectangular, full-rank or singular. When the matrix is square
and full-rank, then the Moore-Penrose pseudoinverse will equal
the true inverse, which you can see yourself by considering what
happens when all of the singular values are non-zero.
T = UΣVT (16.29)
−1 T
(TT T)−1 TT = (UΣVT )T UΣVT UΣVT (16.30)
−1
= VΣT UT UΣVT VΣUT (16.31)
−1
= VΣ2 VT VΣUT (16.32)
= VΣ−1 UT (16.34)
Needless to say, the conclusion is the same for the right inverse,
which you can work through on your own.
496
16.13 Code challenges
5. This and the next two challenges involve taking the SVD of
a picture. A picture is represented as a matrix, with the ma-
trix values corresponding to grayscale intensities of the pixels. 497
We will use a picture of Einstein. You can download the file at
https://upload.wikimedia.org/wikipedia/en/8/86/Einstein_tongue.jpg
Of course, you can replace this with any other picture—a selfie
of you, your dog, your kids, your grandmother on her wedding
day... However, you may need to apply some image processing
to reduce the image matrix from 3D to 2D (thus, grayscale in-
stead of RGB) and the datatype must be double (MATLAB)
or floats (Python).
10. This challenge follows up on the first code challenge from the
previous chapter (about generalized eigendecomposition im-
plemented as two matrices vs. the product of one matrix and
the other’s inverse). The goal is to repeat the exploration of
differences between eig(A,B) and eig(inv(B)*A). Use only
10×10 matrices, but now vary the condition number of the ran-
dom matrices between 101 and 1010 . Do you come to different
conclusions from the previous chapter?
11. This isn’t a specific code challenge, but instead a general sug-
gestion: Take any claim or proof I made in this chapter (or any
other chapter), and demonstrate that concept using numerical
examples in code. Doing so (1) helps build intuition, (2) im-
proves your skills at translating math into code, and (3) gives
you opportunities to continue exploring other linear algebra
principles (I can’t cover everything in one book!).
499
16.14 Code solutions
500
2. Hint: You need to sort the columns of U and V based on
descending eigenvalues. You can check your results by sub-
tracting the eigenvectors and the matching singular vectors
matrices. Due to sign-indeterminacy, you will likely find a few
columns of zeros and a few columns of non-zeros; comparing
against −U will flip which columns are zeros and which are
non-zeros. Don’t forget that Python returns VT .
8 % c r e a t e Sigma
9 S = z e r o s ( s i z e (A ) ) ;
10 f o r i =1: l e n g t h ( L2 )
11 S ( i , i ) = s q r t ( L2 ( i ) ) ;
12 end
13 [ U2 , S2 , V2 ] = svd (A ) ; % svd
501
3. I changed the code slightly from the Figure to include the orig-
inal matrix to the right of the reconstructed matrix. Anyway,
the important part is creating the low-rank approximations in
a for-loop. Be very careful with the slicing in Python!
502
Code block 16.10: MATLAB
1 A = randn ( 5 , 3 ) ;
2 [ U, S ,V] = svd (A ) ;
3
4 f o r i =1:3
5 subplot (2 ,4 , i )
6 o n e l a y e r = U( : , i ) ∗ S ( i , i ) ∗V( : , i ) ’ ;
7 i ma g es c ( o n e l a y e r )
8 t i t l e ( s p r i n t f ( ’ Layer %g ’ , i ) )
9
10 s u b p l o t ( 2 , 4 , i +4)
11 lowrank = U ( : , 1 : i ) ∗ S ( 1 : i , 1 : i ) ∗V ( : , 1 : i ) ’ ;
12 i ma g es c ( lowrank )
13 t i t l e ( s p r i n t f ( ’ L a y e r s 1:%g ’ , i ) )
14 end
15 s u b p l o t ( 2 4 8 )
16 i m a g es c (A) , t i t l e ( ’ O r i g i n a l A ’ )
503
4. There are two insights to this challenge. First, U and V must
be orthogonal, which you can obtain via QR decomposition on
random matrices. Second, you only need to specify the target
singular value; the smallest can be 1 and rest can be anything
in between (for simplicity, I’ve made them linearly spaced from
smallest to largest).
9 f o r i =1:min (m, n )
10 S( i , i ) = s ( i ) ;
11 end
12 A = U∗S∗V ’ ; % c o n s t r u c t matrix
13 cond (A) % c o n f i r m !
504
5. You might have struggled a bit with transforming the image,
but hopefully the SVD-related code wasn’t too difficult. My
code below reconstructs the image using components 1-20, but
you can also try, e.g., 21-40, etc.
505
Code block 16.14: MATLAB
1 p i c = imread ( ’ h t t p s : / / upload . wikimedia . o r g /
2 w i k i p e d i a / en /8/86/ E i n s t e i n _ t o n g u e . j p g ’ ) ;
3 pic = double ( pic ) ; % convert to double
4
5 [ U, S ,V] = svd ( p i c ) ;
6 comps = 1 : 2 0 ; % low−rank a p p r o x i m a t i o n
7 lowrank = U( : , comps ) ∗ . . .
8 S ( comps , comps ) ∗V( : , comps ) ’ ;
9
10 % show t h e o r i g i n a l and low−rank
11 s u b p l o t ( 1 2 1 )
12 i m a g e s c ( p i c ) , a x i s image
13 t i t l e ( ’ O r i g i n a l ’ )
14 s u b p l o t ( 1 2 2 )
15 i m a g e s c ( lowrank ) , a x i s image
16 t i t l e ( s p r i n t f ( ’ Comps . %g−%g ’ , . . .
17 comps ( 1 ) , comps ( end ) ) )
18 colormap gray
The SVD
506
6. The low-rank calculations and plotting are basically the same
as the previous exercise. The main additions here are comput-
ing percent variance explained and thresholding. It’s a good
idea to check that all of the normalized singular values sum to
100.
507
Code block 16.16: MATLAB
1 % convert to percent explained
2 s = 100∗ d i a g ( S ) . / sum ( S ( : ) ) ;
3 p l o t ( s , ’ s− ’ ) , xlim ( [ 0 1 0 0 ] )
4 x l a b e l ( ’ Component number ’ )
5 y l a b e l ( ’ Pct v a r i a n c e e x p l a i n s ’ )
6
7 thresh = 4; % threshold in percent
8 comps = s>t h r e s h ; % comps g r e a t e r than X%
9 lowrank = U( : , comps ) ∗ . . .
10 S ( comps , comps ) ∗V( : , comps ) ’ ;
11
12 % show t h e o r i g i n a l and low−rank
13 f i g u r e , s u b p l o t ( 1 2 1 )
14 i m a g e s c ( p i c ) , a x i s image
15 t i t l e ( ’ O r i g i n a l ’ )
16 s u b p l o t ( 1 2 2 )
17 i m a g e s c ( lowrank ) , a x i s image
18 t i t l e ( s p r i n t f ( ’%g comps with > %g%%’ , . . .
19 sum ( comps ) , t h r e s h ) )
20 colormap gray
The SVD
508
7. The RMS error plot goes down when you include more compo-
nents. That’s sensible. The scale of the data is pixel intensity
errors, with pixel values ranging from 0 to 255. However, each
number in the plot is the average over the entire picture, and
therefore obscures local regions of high- vs. low-errors. You
can visualize the error map (variable diffimg).
8 y l a b e l ( ’ Error ( a . u . ) ’ )
509
8. This code challenge illustrates that translating formulas into
code is not always straightforward. I hope you enjoyed it!
510
9. The pseudoinverse of a column of constants is a row vector
where each element is 1/kn where k is the constant and n is
the dimensionality. The reason is that the vector times its
pseudoinverse is actually just a dot product; summing up k n
times yields nk, and thus 1/nk is the correct inverse to yield
1. (I’m not sure if this has any practical value, but I hope it
helps you think about the pseudoinverse.)
511
10. The differences between the two approaches is much more ap-
parent. The issue is that high-conditioned matrices are more
unstable and thus so are their inverses. In practical applica-
tions, B might be singular, and so an eigendecomposition on
B−1 A is usually not a good idea.
24 # GEDs and s o r t
25 l 1 = e i g (A, B ) [ 0 ]
26 l 2 = e i g ( np . l i n a l g . i n v (B)@A) [ 0 ]
27 l1 . sort ()
28 l2 . sort ()
29
30 a v e d i f f s [ c o n d i ] = np . mean ( np . abs ( l 1 −l 2 ) )
31
32 p l t . p l o t ( cns , a v e d i f f s ) ;
512
Code block 16.24: MATLAB
1 M = 1 0 ; % matrix s i z e
2 c n s = l i n s p a c e ( 1 0 , 1 e10 , 3 0 ) ;
3
4 % c o n d i t i o n numbers
5 f o r c o n d i =1: l e n g t h ( c n s )
6
7 % create A
8 [ U, ~ ] = qr ( randn (M) ) ;
9 [ V, ~ ] = qr ( randn (M) ) ;
10 S = d i a g ( l i n s p a c e ( c n s ( c o n d i ) , 1 ,M) ) ;
11 A = U∗S∗V ’ ; % c o n s t r u c t matrix
12
13 % create B
14 [ U, ~ ] = qr ( randn (M) ) ;
15 [ V, ~ ] = qr ( randn (M) ) ;
16 S = d i a g ( l i n s p a c e ( c n s ( c o n d i ) , 1 ,M) ) ;
17 B = U∗S∗V ’ ; % c o n s t r u c t matrix
18
19 % e i g e n v a l u e s and s o r t
20 l 1 = e i g (A, B ) ;
21 l 2 = e i g ( i n v (B) ∗A ) ;
22 l1 = sort ( l1 );
23 l2 = sort ( l2 );
24
25 % s t o r e the d i f f e r e n c e s
26 a v e d i f f s ( c o n d i ) = mean ( abs ( l 1 −l 2 ) ) ;
27 end
16.14 Code solutions
28
29 p l o t ( cns , a v e d i f f s )
30 x l a b e l ( ’ Cond . number ’ )
31 y l a b e l ( ’ \ D e l t a \ lambda ’ )
513
The SVD
CHAPTER 17
Quadratic form and
definiteness
vT Av = ζ (17.1)
" #" #
h i 2 4 3
v1T Av1 = 3 −1 =9 (17.2)
0 3 −1
" #" #
h i 2 4 2
v2T Av2 = 2 1 = 19 (17.3)
0 3 1
Now imagine that the matrix elements are fixed scalars and x
and y are continuous variables, as if this were a function of two 517
variables:
" #" #
h i a b xi
f (A, vi ) = xi yi = ax2i + (c + d)xi yi + dyi2 = ζi
c d yi
(17.7)
The interpretation is that for each matrix A, we can vary the
vector elements xi and yi and obtain a scalar value for each (xi , yi )
pair.
h
= ax + f y + dz f x + by + ez dx + ey + cz y
z
(17.9)
Yeah, that’s quite a lot to look at. Still, you see the squared terms
518 and the cross-terms, with coefficients defined by the elements in
the matrix, and the diagonal matrix elements paired with their
corresponding squared vector elements.
I’m sure you’re super-curious to see how it looks for a 4×4 matrix.
It’s written out below. The principle is the same: diagonal matrix
elements are coefficients on the squared vector elements, and the
off-diagonals are coefficients on the cross-terms. Just don’t expect
me to be patient enough to keep this going for larger matrices...
a b c d w
i
b e f g
x =
h
w x y z
c
f h i
y
d g i j z
magnitude-squared of the vector. Thus, putting a dif-
ferent matrix in between the vectors is like using this
matrix to modulate, or scale, the magnitude of this vec-
tor. In fact, this is the mechanism for measuring distance
in several non-Euclidean geometries.
520
17.2 Geometric perspective
Let’s return to the idea that the quadratic form is a function that
takes a matrix and a vector as inputs, and produces a scalar as
an output:
f (A, v) = vT Av = ζ (17.14)
Now let’s think about applying this function over and over again,
for the same matrix and different elements in vector v. In fact, we
can think about the vector as a coordinate space where the axes
are defined by the vector elements. This can be conceptualized
in R2 , which is illustrated in Figure 17.1. In fact, the graph of
f (A, v) for v ∈ R2 is a 3D graph, because the two elements of v
provide a 2D coordinate space, and the function value is mapped
onto the height above (or below) that plane.
Thus, the 2D plane defined by the v1 and v2 axes is the co- Figure 17.1: A vi-
ordinate system; each location on that plane corresponds to a sualization of the
quadratic form re-
unique combination of elements in the vector, that is, when set- sult of a matrix at
ting v = [v1 , v2 ]. The z-axis is the function result (ζ). The vertical two specific coor-
dinates.
dashed gray lines leading to the gray dots indicate the value of ζ
for two particular v’s.
17.2 Geometric perspective
Once we have this visualization, the next step is to evaluate
f (A, v) for many different possible values of v (of course, using
the same A). If we keep using stick-and-button lines like in Fig-
ure 17.1, the plot will be impossible to interpret. So let’s switch
to a surface plot (17.2).
That graph is called the quadratic form surface, and it’s like an
energy landscape: The matrix has a different amount of "energy"
at different points in the coordinate system (that is, plugging in
different values into v), and this surface allows us to visualize that
energy landscape.
Let’s make sure this is concrete. The matrix that I used to create 521
Figure 17.2: A better visualization of the quadratic form sur-
face of a matrix.
The coordinates for this quadratic form surface are not bound by
±2; the plane goes to infinity in all directions, but it’s trimmed
here because I cannot afford to sell this book with one infinitely
large page. Fortunately, though, the characteristics of the surface
Quadratic form
you see here don’t change as the axes grow; for this particular ma-
trix, two directions of the quadratic form surface will continue to
grow to infinity (away from the origin), while two other directions
will continue at 0.
This surface is the result for one specific matrix; different matri-
ces (with the same vector v) will have different surfaces. Figure
17.3 shows examples of quadratic form surfaces for four different
matrices. Notice the three possibilities of the quadratic form sur-
522 face: The quadratic form can bend up to positive infinity, down
Figure 17.3: Examples of quadratic form surfaces for four dif-
ferent matrices. The v1 , v2 axes are the same in all subplots,
and the f (A, v) = ζ axis is adapted to each matrix.
That said, one thing that all quadratic form surfaces have (for all
matrices) is that they equal zero at the origin of the graph, corre-
sponding to v = [0 0]. That’s obvious algebraically—the matrix
is both pre- and post-multiplied by all zeros—but geometrically,
it means that we are interested in the shape of the matrix relative
to the origin.
On the quadratic form surface, any direction away from the origin
goes in one of three directions: down to negative infinity, up to
positive infinity, or to zero. However, the surface doesn’t neces-
sarily go to infinity equally fast in all directions; some directions
Philosophical side- have steeper slopes than others. However, if a quadratic form goes
note: What does up, it will eventually get to infinity.
it mean to "even-
tually" get to ∞
and how long That’s because the value of the function at each [v1 , v2 ] coordinate
does that take? is not only a function of the matrix; it is also a function of the
vector v. As the vector elements grow, so does the quadratic form.
Quadratic form
That statement may feel intuitive, but we’re going to take the
time to prove it rigorously. This is important because the proof
will help us discover how to normalize the quadratic form, which
becomes the mathematical foundation of principal components
analysis.
The Cauchy- The proof involves (1) expressing the quadratic form as a vector
Schwarz inequal- dot product, (2) applying the Cauchy-Schwarz inequality to that
ity for the dot
dot product, and then (3) applying the Cauchy-Schwarz inequality
product was on
524 page 49 if you again to the matrix-vector product.
To start with, think about vT Av as the dot product between two
vectors: (v)T (Av). Then apply the Cauchy-Schwarz inequality:
You can call this a feature or you can call it a bug. Either way,
it impedes using the quadratic form in statistics and in machine
learning. We need something like the quadratic form that reveals
important directions in the matrix space independent of the mag-
This means we are looking for the argument (here v) that maxi-
mizes the expression.
ax2 + (c + d)xy + dy 2
f (A, v) = (17.19)
x2 + y 2
vT Av
fnorm (A, v) = (17.20)
vT v
|vT Av|
≤ kAkF (17.22)
vT v
Now we see that the magnitude of the normalized quadratic form
is bounded by the magnitude of the matrix, and does not depend
on the vector that provides the coordinate space.
Quadratic form
Have you noticed the failure scenario yet? If you don’t already
526 know the answer, I think you can figure it out from looking again
at Equation 17.20. The failure happens when v = 0, in other
words, with the zeros vector.
on page 13.1 I introduced the concept of "mapping over
magnitude." The normalized quadratic form can be con-
ceptualized in the same way: It’s a mapping of a matrix
onto a vector coordinate space, over the magnitude of
that coordinate space.
527
17.4 Eigenvectors and quadratic form surfaces
All analogies but the surrounding space curves up in some directions and curves
break down at down in other directions. An analogy is that the absolute ridge
some point, so
and valley are like the global maximum and minimum, and the
you should never
think too deeply intermediate saddle points are like local maxima and minima.
about them.
For the 3D surfaces in these figures, you can identify the ridges
Important informa-
and valleys simply by visual inspection. But that’s obviously not
tion hidden inside very precise, nor is it scalable. It turns out that the eigenvec-
a long paragraph. tors of a symmetric matrix point along the ridges and valleys of
Terrible textbook the normalized quadratic form surface. This applies only to sym-
528 writing style!
metric matrices; the eigenvectors of a non-symmetric matrix do
not necessarily point along these important directions. This is
another one of the special properties of symmetric matrices that
make all other matrices fume with jealousy.
Figure 17.5: The bird’s-eye-view of the quadratic form surface 17.4 Evecs and the qf surface
of a symmetric matrix, with the eigenvectors plotted on top.
The matrix and its eigenvectors (W, rounded to the nearest
tenth) are printed on top. Colorbar indicates the value of ζ.
Notice the small missing box at the center; this was an NaN
value corresponding to v = 0.
Why is this the case? Why do the eigenvectors point along the
directions of maximal and minimal "energy" in the matrix? The
short answer is that the vectors that maximize the quadratic form
(Equation 17.20) turn out to be the eigenvectors of the matrix.
A deeper discussion of why that happens is literally the mathe-
matical basis of principal components analysis, and so I will go
through the math in that chapter. 529
One of the most important applications of the nor-
malized quadratic form is principal components anal-
ysis (PCA), which is a dimension-reduction and data-
exploration method used in multivariate statistics, sig-
Reflection
nal processing, and machine learning. The math of PCA
is pretty simple: Compute a covariance matrix by mul-
tiplying the data matrix by its transpose, eigendecom-
pose that covariance matrix, and then multiply the data
by the eigenvectors to obtain the principal components.
More on this in Chapter 19!
they have at least one zero-valued eigenvalue). The invertibility Indefinite matri-
ces are sometimes
of indefinite matrices is unpredictable; it depends on the numbers
called indetermi-
in the matrix. Finally, matrix definiteness is exclusive—a ma- nate.
trix cannot have more than one definiteness label (except for the
always-bizarre zeros matrix; see Reflection Box).
With this list in mind, refer back to Figure 17.3 and see if you
can determine the definiteness of each matrix based on visual
inspection and the associated descriptions.
1
The answer is No, because all singular values are non-negative.
531
When it comes to definiteness, no matrix is weirder than
the zeros matrix. Its quadratic form surface is flat and
Reflection
zero everywhere, and it actually belongs to two defi-
niteness categories: positive semidefinite and negative
semidefinite. I’m not sure if that will help you live a
richer, healthier, and more meaningful life, but I thought
I’d mention it just in case.
= kAwk2
A few pages ago, I wrote that the best way to determine the
17.7 λ and definiteness
Av = λv (17.24)
vT Av = vT λv (17.25)
vT Av = λvT v (17.26)
vT Av = λkvk2 (17.27)
But wait a minute: the quadratic form is defined for all vectors,
not only for eigenvectors. You might be tempted to say that
our conclusion (positive definite means all positive eigenvalues) is
valid only for eigenvectors.
w = v1 + v2 (17.31)
β = (λ1 + λ2 ) (17.32)
wT Aw = βkwk2 (17.33)
The other categories I trust you see the pattern here. Proving
the relationship between the signs of the eigenvalues and the def-
initeness of a matrix does not require additional math; it simply
requires changing your assumption about A and re-examining the
above equations. As math textbook authors love to write: The
proof is left as an exercise to the reader.
Quadratic form
536
17.8 Code challenges
537
17.9 Code solutions
20 ax . s e t _ z l a b e l ( ’ $ \ z e t a $ ’ )
21 p l t . show ( )
538
Code block 17.4: MATLAB
1 A = [ 1 2 ; 2 3 ] ; % matrix
2 vi = −2:.1:2; % vector elements
3 quadform = z e r o s ( l e n g t h ( v i ) ) ;
4
5 f o r i =1: l e n g t h ( v i )
6 f o r j =1: l e n g t h ( v i )
7 v = [ vi ( i ) vi ( j ) ] ’ ; % vector
8 quadform ( i , j ) = v ’ ∗A∗v / ( v ’ ∗ v ) ;
9 end
10 end
11
12 s u r f ( vi , vi , quadform ’ )
13 x l a b e l ( ’ v_1 ’ ) , y l a b e l ( ’ v_2 ’ )
14 z l a b e l ( ’ $\ z e t a $ ’ , ’ I n t e r p r e t e r ’ , ’ l a t e x ’ )
539
2. Notice that I’ve set a tolerance for "zero"-valued eigenvalues,
as discussed in previous chapters for thresholds for computing
rank. You will find that all or nearly all random matrices are
indefinite (positive and negative eigenvalues). If you create
smaller matrices (3 × 3 or 2 × 2), you’ll find more matrices in
the other categories, although indefinite will still dominate.
Category number corresponds to the rows of table 17.1.
22 d e f c a t [ i t e r i ] = 4 # neg . s e m i d e f
23 e l i f np . a l l ( np . s i g n ( e )==−1):
24 d e f c a t [ i t e r i ] = 5 # neg . d e f
25 else :
26 defcat [ i t e r i ] = 3 # indefinite
27
28 # p r i n t out summary
29 f o r i i n r a n g e ( 1 , 6 ) :
30 p r i n t ( ’ c a t %g : %g ’ %( i , sum ( d e f c a t==i ) ) )
540
Code block 17.6: MATLAB
1 n = 4;
2 nIterations = 500;
3 defcat = zeros ( nIterations , 1 ) ;
4
5 f o r i t e r i =1: n I t e r a t i o n s
6
7 % c r e a t e t h e matrix
8 A = r a n d i ([ −10 1 0 ] , n ) ;
9 ev = e i g (A ) ; % ev = E i g e n V a l u e s
10 w h i l e ~ i s r e a l ( ev )
11 A = r a n d i ([ −10 1 0 ] , n ) ;
12 ev = e i g (A ) ;
13 end
14
15 % " z e r o " t h r e s h o l d ( from rank )
16 t = n ∗ e p s (max( svd (A ) ) ) ;
17
18 % test definiteness
19 i f a l l ( s i g n ( ev )==1)
20 d e f c a t ( i t e r i ) = 1 ; % pos . d e f
21 e l s e i f a l l ( s i g n ( ev)>−1) && sum ( abs ( ev)< t )>0
22 d e f c a t ( i t e r i ) = 2 ; % pos . s e m i d e f
23 e l s e i f a l l ( s i g n ( ev ) <1) && sum ( abs ( ev)< t )>0
24 d e f c a t ( i t e r i ) = 4 ; % neg . s e m i d e f
25 e l s e i f a l l ( s i g n ( ev)==−1)
26 d e f c a t ( i t e r i ) = 5 ; % neg . d e f
27 else
28 defcat ( i t e r i ) = 3; % indefinite
29 end
17.9 Code solutions
30 end
31
32 % p r i n t out summary
33 f o r i =1:5
34 f p r i n t f ( ’ c a t %g : %g \n ’ , i , sum ( d e f c a t==i ) )
35 end
541
Quadratic form
CHAPTER 18
Data and covariance
matrices
Clearly, these two variables are related to each other; you can
imagine drawing a straight line through that relationship. The
544 correlation coefficient is the statistical analysis method that quan-
tifies this relationship.
546 You can imagine from the formula that when all data values are
close to the mean, the variance is small; and when the data val-
ues are far away from the mean, the variance is large. Figure
18.3 shows examples of two datasets with the identical means but
different variances.
18.3 Covariance
xT y
cx,y = , x=y=0 (18.5)
n−1
sults, which you will see in the next chapter. Further-
more, mean-centering ensures that the diagonal of the
covariance matrix contains the variances of each data
feature. Thus, it is convenient to mean-center all data
before computing covariances, even if it not strictly nec-
essary for a particular application.
550
18.5 Covariance matrices
C = XT X (18.8)
Matrix X is the data matrix, and we assume that the columns are
mean-centered. (For simplicity, I omitted the factor 1/(n − 1).)
Note that there is nothing magical about transposing the first
matrix; if your data are stored as features-by-observations, then
the correct formula is C = XXT . The covariance matrix should
be features-by-features.
551
18.6 From correlation to covariance matrices
553
18.8 Code solutions
2 X = randn ( n , 4 ) ; % data
3 X = X−mean (X, 1 ) ; % mean−c e n t e r
4 covM = X’ ∗X / ( n −1); % c o v a r i a n c e
5 stdM = i n v ( d i a g ( s t d (X) ) ) ; % s t d e v s
6 corM = stdM∗ X’ ∗X ∗stdM / ( n −1); % R
7
8 d i s p ( covM−cov (X) ) % compare c o v a r i a n c e s
9 d i s p ( corM−c o r r c o e f (X) ) % compare c o r r s
554
CHAPTER 19
Principal components
analysis
Now let’s consider panel B. The two data channels are negatively
556 correlated. Keeping the same weighting of .7 for each channel
actually reduced the variance of the result (that is, the result-
ing component has less variance than either individual channel).
Thus, a weighting of .7 for both channels is not a good PCA
solution.
This is the idea of PCA: You input the data, PCA finds the best
set of weights, with "best" corresponding to the weights that maxi-
mize the variance of the weighted combination of data channels.
In this section, I will describe how a PCA is done, and then in the
19.2 How to perform a PCA
next section, you’ll learn about the math underlying PCA. There
are five steps to performing a PCA, and you will see how linear
algebra is central to those steps.
Panel B shows the same data, mean-centered, and with the two
principal components (the eigenvectors of the covariance matrix)
PCA
560 Panel C shows the same data but redrawn in PC space. Because
PCs are orthogonal, the PC space is a pure rotation of the original
data space. Therefore, the data projected through the PCs are
decorrelated.
n o
vmax = arg max vT Cv (19.1)
v
n o
vmax = arg max vT Cv , s.t. kvk = 1 (19.2)
v
( )
vT Cv
vmax = arg max (19.3)
v vT v
When we obtain vmax , we can plug that value into the normalized
quadratic form to obtain a scalar, which is the amount of variance
in the covariance matrix along direction vmax .
w = vmax
wT Cw
λ= (19.4)
wT w
Λ = W−1 CW (19.7)
WΛ = CW (19.8)
19.4
19.4 Regularization
Regularization
e = (1 − γ)C + γαI
C (19.9)
α = n−1
X
λ(C) (19.10)
You can see from Equation 19.9 that we reduce C and then inflate
the diagonal by shifting. Scaling the identity by α ensures that the
regularization amount reflects the numerical range of the matrix.
Imagine that that scalar wasn’t in the equation: The effect of 1%
shrinkage could be huge or insignificant, depending on the values
in C.
565
19.5 Is PCA always the best?
Figure 19.6: PCA and ICA (via the jade algorithm) on the
same dataset.
567
19.6 Code challenges
centered data. But generate the plot using the mean-center the
data. How do the PCs look, and what does this tell you about
mean-centering and PCA?
568
19.7 Code solutions
26 p l t . p l o t ( [ 0 ,V[ 0 , 0 ] ∗ 4 5 ] , [ 0 ,V[ 1 , 0 ] ∗ 4 5 ] , ’ r ’ )
27 p l t . p l o t ( [ 0 ,V[ 0 , 1 ] ∗ 2 5 ] , [ 0 ,V[ 1 , 1 ] ∗ 2 5 ] , ’ r ’ )
28 p l t . x l a b e l ( ’ Height ’ ) , p l t . y l a b e l ( ’ Weight ’ )
29 p l t . a x i s ([ −50 ,50 , −50 ,50])
30 p l t . show ( )
569
Code block 19.2: MATLAB
1 % c r e a t e data
2 N = 1000;
3 h = l i n s p a c e ( 1 5 0 , 1 9 0 ,N) + randn ( 1 ,N) ∗ 5 ;
4 w = h ∗ . 7 − 50 + randn ( 1 ,N) ∗ 1 0 ;
5
6 % covariance
7 X = [h’ w’ ] ;
8 X = X−mean (X, 1 ) ;
9 C = X’ ∗X / ( l e n g t h ( h ) −1);
10
11 % PCA and s o r t r e s u l t s
12 [ V,D] = e i g (C ) ;
13 [ e i g v a l s , i ] = s o r t ( d i a g (D) , ’ d e s c e n d ’ ) ;
14 V = V( : , i ) ;
15 e i g v a l s = 100∗ e i g v a l s /sum ( e i g v a l s ) ;
16 s c o r e s = X∗V; % not used but u s e f u l code
17
18 % p l o t data with PCs
19 f i g u r e , h o l d on
20 p l o t (X( : , 1 ) ,X( : , 2 ) , ’ ko ’ )
21 p l o t ( [ 0 V( 1 , 1 ) ] ∗ 4 5 , [ 0 V( 2 , 1 ) ] ∗ 4 5 , ’ k ’ )
22 p l o t ( [ 0 V( 1 , 2 ) ] ∗ 2 5 , [ 0 V( 2 , 2 ) ] ∗ 2 5 , ’ k ’ )
23 x l a b e l ( ’ Height (cm) ’ ) , y l a b e l ( ’ Weight ( kg ) ’ )
24 a x i s ([ −1 1 −1 1 ] ∗ 5 0 ) , a x i s s q u a r e
PCA
570
2. The tricky part of this exercise is normalizing the singular
values to match the eigenvalues. Refer back to section 16.3
if you need a refresher on their relationship. There’s also the
normalization factor of n − 1 to incorporate. Finally, you still
need to mean-center the data. That should already have been
done from the previous exercise, but I included that line here
as a reminder of the importance of mean-centering.
571
3. All you need to do in the code is move line 11 to 24 in the
Python code, or line 8 to 20 in the MATLAB code. See figure
below.
572
CHAPTER 20
Where to go from
here?
ers.
20.2 Thanks!
Thank you for choosing to learn from this book, and for trusting
me with your linear algebra education. I hope you found the book
useful, informative, and perhaps even a bit entertaining. Your
brain is your most valuable asset, and investments in your brain
always pay large dividends.
575
The end.
Index
Standard position, 24
Statistics, 392
Subset, 85
580 Subspace, 79, 83