0% found this document useful (0 votes)

23 views297 pages

La4cs Free

The document is a comprehensive guide on Linear Algebra tailored for Computer Science students, emphasizing its importance in applied mathematics, particularly in fields like machine learning and artificial intelligence. It covers various topics including numerical computations, algebraic and geometric views, and advanced topics such as eigenvalue decomposition and singular value decomposition. The author, Manoj Thulasidas, aims to provide a practical understanding of Linear Algebra's relevance to computing applications.

Uploaded by

JonathanLaureanoRamirez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views297 pages

La4cs Free

Uploaded by

JonathanLaureanoRamirez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 297

Linear Algebra for

! * =.×- rank ! = -

Computer Science
! * =! = all possible ! !" = $
! 6 =- ÿ =.
- * =" = all possible & impossible -

! " : dim = (
! "! : dim = ( Column Space
Row Space ") ÿ + ) ÿ "# + .#
."
,#
,"

ï
ÿ+
ï

) ) ÿ " #(+ +
"() + )" +" )

$
.
$

", ÿ ,
,

#
,ÿ" ,
! "! :
! " : dim = ( 2 *
dim = + 2 * , , ÿ " #+ Left Null Space
Null Space ") " ÿ " $
$ # !" # !"
#
! !" ! !"
#

!4$
ï
"=
ï

&
#

!4: =. ÿ =-
%
!

Essential
! * = Mathematics
4
rank ! = - for -×. 4

Computer and Data Scientists

Manoj Thulasidas
Buy it

Scan or Tap
Linear Algebra for
Computer Science

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
Linear Algebra for
Computer Science
Essential Mathematics for
Computer Scientists

Manoj Thulasidas

ASIAN BOOKS
Singapore
ASIAN BOOKS

Manoj Thulasidas
Associate Professor of Computer Science (Education)
80 Stamford Road, School of Computing and Information Systems,
Singapore Management University, Singapore 178902

https://www.thulasidas.com
https://smu.sg/manoj
https://LA4CS.com

c Manoj Thulasidas, 2021.

All rights reserved. No part of this Book may be reproduced or

transmitted in any form, or by any means (electronic or mechanical,
including photocopying, recording or taping on information storage
and retrieval systems), without the prior written permission of the
copyright owner.
ISBN: 978-981-18-2045-8 (ebook)

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.

Published in Singapore
“You can’t learn too much linear algebra.”

—Prof. Benedict Gross, Harvard

“Matrices act. They don’t just sit there.”

—Prof. Gilbert Strang, MIT

Contents

Acknowledgments 1
Preface 2
Introduction 4
I.1 Why Learn Linear Algebra? 5
I.2 Learning Objectives and Competencies 6
I.3 Organization 7

Part I Numerical Computations

1. Functions, Equations and Linearity 11

1.1 Linearity 11
1.2 The Big Picture 17

2. Vectors, Matrices and Their Operations 19

2.1 Vectors 19
2.2 Vector Operations 21
2.3 Linear Independence of Vectors 26
2.4 Matrices 32
2.5 Matrix Operations 32
2.6 Properties of Scaling and Addition 33
2.7 Matrix Multiplication 34
2.8 Generalized Vectors 42

3. Transposes and Determinants 44

3.1 Transpose of a Matrix 44
3.2 Definitions and Matrices with Special Properties 48
3.3 Determinant of a Matrix 50
3.4 Numerical Computations 60

Part II Algebraic View

4. Gaussian Elimination 62
4.1 Solvability of System of Linear Equations 62
4.2 Gaussian Elimination 67
4.3 Applications of Gaussian Elimination 72
4.4 More Examples 83
4.5 Beyond Gaussian Elimination 86

5. Ranks and Inverses of Matrices 87

5.1 Rank of a Matrix 88
5.2 Gauss-Jordan Elimination 91
5.3 Inverse of a Matrix 97
5.4 Left and Right Inverses 102
5.5 Cramer’s Rule 103
5.6 Algebraic View of Linear Algebra 104

Part III Geometric View

6. Vector Spaces, Basis and Dimensions 106

6.1 Linear Combinations 106
6.2 Vector Spaces and Subspaces 112
6.3 Basis and Dimensions 115
6.4 Geometry of Linear Equations 119

7. Change of Basis, Orthogonality and Gram-Schmidt 124

7.1 Basis and Components 125
7.2 Change of Basis 128
7.3 Basis of Subspaces 131
7.4 Orthogonality 133
7.5 Gram-Schmidt Process 137
7.6 Rotation Matrices 140

8. Review and Recap 144

8.1 A Generalization 144
8.2 Product Rules: Transposes and Inverses 146
8.3 Column Picture of Matrix Multiplication 147
8.4 Named Algorithms 150
8.5 Pivots, Ranks, Inverses and Determinants 151
8.6 Two Geometries 154

9. The Four Fundamental Spaces 155

9.1 Column Space 156
9.2 Null Space 157
9.3 Row Space 158
9.4 Left Null Space 161
9.5 Computing the Four Fundamental Subspaces 162
9.6 Summary of the Four Spaces 163
9.7 Computing the Spaces 166
9.8 Review: Complete Solution 169
9.9 Other Names 173

10. Projection, Least Squares and Linear Regression 175

10.1 Projection Revisited 175
10.2 Projection to Subspace 178
10.3 Meaning of Projection 181
10.4 Linear Regression 182
Part IV Advanced Topics

11. Eigenvalue Decomposition and Diagonalization 189

11.1 Definition and Notation 190
11.2 Examples of Eigenvalues and Eigenvectors 191
11.3 Computing Eigenvalues and Finding Eigenvectors 193
11.4 Properties 196
11.5 Unit Circles and Ellipses 201
11.6 Diagonalization 203
11.7 Fibonacci Numbers 208
11.8 Applications of Eigenvalues and Eigenvectors 210
11.9 Big Recap: The Story So Far 211

12. Special Matrices, Similarity and Algorithms 221

12.1 Real, Symmetric Matrices 221
12.2 Hermitian Matrices 223
12.3 Eigen Properties of Hermitian Matrices 224
12.4 Markov Matrices 226
12.5 Positive Definite Matrices 230
12.6 Gram Matrix 235
12.7 Matrix Similarity 237
12.8 Jordan Normal Forms 240
12.9 Algorithms 244

13. Singular Value Decomposition 249

13.1 What SVD Does 250
13.2 How SVD Works 255
13.3 Why SVD Is Important 261
13.4 Pseudo-Inverse 267
13.5 Fundamental Spaces and SVD 270

Summing Up. . . 273

Glossary 275
Credits 277
List of Figures

2.1 Example of Scalar Multiplication 22

2.2 Addition of Vectors 24
2.3 Two Linearly Independent Vectors 25
2.4 Two Linearly Dependent Vectors 27
2.5 Dot Product and the Angle Between Vectors 30
2.6 Matrix Multiplication: Illustrated 36
2.7 Column and Row Pictures of Matrix Multiplication 40
3.1 How a Matrix Transforms Unit Vectors 53
3.2 Determinant and Area: Proof Without Words 54
3.3 Determinants as Positive and Negative Areas 54
3.4 Minors and Cofactors 57
4.1 Visualizing Equations 65
4.2 Gaussian Elimination on a Simple Matrix 73
4.3 Gaussian Elimination: Elementary Matrix 79
5.1 Gauss-Jordan Elimination 92
6.1 Linear Combinations: Two Independent Vectors
in R2 107
6.2 Linear Combinations: Two Dependent Vectors in
R2 108
6.3 Linear Combinations: Two Vectors in R3 109
6.4 Linear Combinations: Two Other Vectors in R3 109
6.5 Geometric View of Solvable Equations 120
6.6 Geometric View of Two Inconsistent Equations 121
6.7 Geometric View of Three Inconsistent Equations 122
7.1 Visualization of Change of Basis 129
7.2 Dot Product as Projection 136
7.3 The Gram-Schmidt Process 138
7.4 Rotation Matrix in R2 141
8.1 Matrix Shapes, RREF and Solvability 153
9.1 The Four Fundamental Subspaces 164
10.1 Projection of a Vector onto Another 176
10.2 Projection to a Subspace 178
10.3 Example of Simple Linear Regression 183
10.4 Multiple Linear Regression: Data Matrix 185
10.5 Multiple Linear Regression: Notations 186
10.6 Multiple Linear Regression: Visualization 187
11.1 The Shear Matrix 192
11.2 Eigen Analysis: Visualization 202
11.3 Recap: Four Fundamental Spaces 211
11.4 Recap: Full-Rank Square Matrix 213
11.5 Recap: Full-Column-Rank Tall Matrix 214
11.6 Recap: Full-Row-Rank Wide Matrix 216
11.7 Recap: General Matrix 218
13.1 SVD Input Matrix 251
13.2 SVD interpretation: V T 252
13.3 SVD interpretation: Σ 253
13.4 SVD interpretation: U and summary 254
13.5 SVD interpretation: summary 254
13.6 Shapes of Û , Σ̂ and V in the SVD of a
full-column-rank A. 257
13.7 Shapes of Û , Σ̂ and V in the SVD of a general A. 259
13.8 Examples of PCA 266
13.9 Pseudo-Inverse and the Elegant Symmetry of
Linear Algebra 272
List of Tables

4.1 Various permutations of simultaneous equations

in two variables, showing solvability 63
4.2 Properties and behavior of linear equations and
solutions 67
4.3 Illustration of solvability conditions based on the
characteristics of REF 74
4.4 A = LU Decomposition 80
4.5 A = P LU Decomposition 82
4.6 Example 1: Gaussian elimination and solvability:
Unique solution 83
4.7 Example 2: Gaussian elimination and solvability:
No solutions 84
4.8 Ecample 3: More equations than unknowns, but
solvable 85
4.9 Example 4: More equations than unknowns, with
no solutions 85
4.10 Example 5: Gaussian elimination and solvability:
Infinity of solutions 86
5.1 Gauss-Jordan on a full-rank, square coefficient
matrix, giving us the unique solution 94
5.2 Gauss-Jordan on a rank-deficient coefficient
matrix, resulting in an infinity of solutions 95
5.3 Gauss-Jordan on a rank-deficient “wide”
coefficient matrix 96
7.1 Examples of change of basis 129
8.1 Named algorithms and their uses 150
9.1 Summary of the four fundamental spaces 165
9.2 The fundamental spaces of matrices of different
shapes 165
11.1 Fibonacci numbers (fk ) vs. its approximation
(fk(approx) ) 210
12.1 Migration probabilities 229
2×2
12.2 Jordan Normal Forms J of A ∈ R with
Asi = λi si 242
12.3 Multiplicities of eigenvalues from Joran blocks 244
Acknowledgments

This book is the final metamorphosis of a collection of lecture notes,

written specifically to help my students of Linear Algebra for Com-
puting Applications at Singapore Management University. It would
not have existed but for the opportunity to teach that course, and for
the encouragement of my friends, colleagues, and students. I am
grateful to them for their help and support, and especially to the ones
who proofread the drafts and suggested edits.

M.T.
Preface

This book has its origin in my experience teaching Linear Algebra

to Computer Science students at Singapore Management University.
Traditionally, Linear Algebra is taught as a pure mathematics course,
almost as an afterthought, not fully integrated with any other applied
curriculum. That certainly was how it was taught to me. The course
I was teaching, however, had a definite pedagogical objective of
bringing out the applicability and usefulness of Linear Algebra in
Computer Science, which is essentially applied mathematics. In the
era of machine learning and artificial intelligence, Linear Algebra is
the branch of mathematics that holds the most relevance to computing.
One question I got from one of my brightest students the first time
I taught the course was why they were forced to learn this particular
branch of mathematics. It was not a defiant or rebellious question, but
one of pure curiosity. I did not have a ready answer then, but I think I
have one now. When we embark on any profession, we have to start
with the tools of the trade. For instance, if we want to be a musician,
we have to learn the notes before we can perform. If we want to be a
writer, we need to know the vocabulary and grammar of our chosen
language first. Similarly, in order to be a computer scientist, we have
to have the necessary mathematical skills. And Linear Algebra is
arguably the most critical expertise needed, especially when it comes
to dealing with large quantities of data efficiently.
Preface 3

Linear Algebra is a well-established field, and we can find several

resources freely available on the Internet. In writing this book, I have
made use of many of them. In fact, the whole process of writing can
be thought of as an exercise in curating the right pieces of information
at the right level, and standardizing them with consistent notations to
form a coherent and fluid narrative.
Mathematics books often tend to be dry, unreadable collections of
facts, theorems, proofs and problems. This type of discourse is under-
standable, given the nature of the subject that demands accuracy and
completeness, often at the expense of readability. My objective was
to write a book that would be read. For this reason, I set practicality
as my goal, and even used, dare I say this, humor to keep my reader
engaged.
I employed two more tricks to improve the readability. The first
one is to restrict the field (over which our vectors and matrices are
defined) to real numbers (R) because of its relevance to computer
science. The second trick is to pepper the text with “boxes,” which
are curious applications, background information, or other tidbits that
are topic-adjacent to the subject matter under discussion.
From experience, I know that a book of this length takes about a
year to write and another year to polish and publish. The first version
of this book was done and dusted in about three months, almost ten
times faster than normal. For this reason, it is continuously updated,
corrected and improved upon over the last couple of years. How
my students respond to the course on which the book is based will
also inspire further revisions. These new versions and/or editions are
made available periodically.

MANOJ THULASIDAS
Singapore
February 4, 2025
Introduction

I cannot teach anybody anything, I can only make them

think.
—Socrates

This book is for an introductory course in Linear Algebra. It

teaches the mathematical foundations of Linear Algebra to illustrate
their relevance to computer science and applications. It also prepares
the students for advanced numerical methods in computing, especially
in machine learning and data analytics. Designed for a first course in
Linear Algebra, this book will cover the basic concepts and techniques
of Linear Algebra and provide an appreciation of the wide application
of this discipline within the field of computer science.
The book will require development of some theoretical results,
with proofs and consequences employing some level of mathematical
rigor, algebraic manipulation, geometry and numerical algorithms.
However, the main focus will be on the computational aspects and
the applications of Linear Algebra.
Why Learn Linear Algebra? 5

Regardless of how application-oriented we would like to make it,

Linear Algebra is a branch of pure mathematics. And in a textbook
written for a first course introducing it, we will have a hard time
staying away from the purely theoretical; however, Linear Algebra
touches upon so many aspects of the technologies that enhance our
modern life that we may discover interesting connections and appli-
cations even from the most basic of its concepts. This book will bring
out such connections wherever possible, thereby attempting to be an
intellectual treat to its readers.

I.1 Why Learn Linear Algebra?

From a purely mathematical perspective, Linear Algebra is a branch

that takes us to mathematical maturity, whereby we start to appreciate
the interconnections among its various branches, such as algebra (as
in solving equations) and geometry (such as vector spaces), advanced
algorithms and mathematical intuitions behind them and so on.
We have a few major branches of mathematics that are highly rel-
evant to scientific computing, numerical methods and technology in
general. The most important ones among them are Calculus (of the
multivariable kind) and Linear Algebra. Roughly speaking, Calculus
corresponds to the analog world–concerned with continuous vari-
ables, their evolution, effects on others etc. It plays an enormously
important role in all branches of engineering and physical sciences.
However, the world is digital now. And, the mathematics of digital
technology, at a high level, is Linear Algebra.
For computer science specifically, we have one more relevant
branch of mathematics, which is discrete math. This one is a lot
of fun, full of puzzles and brain-teasers. It involves number theory,
game theory and other fun stuff.
Relevant to all fields where data and number crunching are involved
(ranging from social sciences to medicine to data science) is Statistics
and Probability. Later on, we will see some of the statistical quantities
(covariance and its principal components, for instance) and even
certain quintessentially calculus operations (error minimization in
linear regression, viewed as a projection operation) coming out of
operations specific to Linear Algebra in an elegant fashion.
6 Introduction

While we can list a number of impressive applications of Linear

Algebra in computer science (such as the page rank algorithm that
made Google what it is, and certain image and data compression
algorithms), the real role of this branch of mathematics is similar to
that of the alphabet or vocabulary or grammar in learning a language.
If we want to be a writer, for instance, we have to be good at all these
aspects of the language of our choice. Having these background
skills alone is not enough; it will not make us a writer. But what
is absolutely certain is that without these skills, we will never be a
writer, not a good one at any rate.
Linear Algebra, in much the same way, is really the basic backdrop
of several of the pivotal numerical algorithms in computer science.

I.2 Learning Objectives and Competencies

The learning objectives of the book include imparting the knowledge

and the foundational concepts of Linear Algebra, as well as a set of
skills and an appreciation for its application in computing. Upon the
successful completion of a course based on this book, students should
have a sound understanding the following:

• The concept of linearity as it applies to expressions, functions

and transformations.
• The notion of vectors and matrices, their operations.
• Important characteristics of matrices, such as its four funda-
mental subspaces, rank, determinant, eigenvalues and eigen-
vectors, different factorizations, etc.
• How to use the characteristics of a matrix to solve a linear
system of equations using algorithms such as Gaussian and
Gauss-Jordan eliminations.
• Important concepts of vector spaces such as independence,
basis, dimensions, orthogonality, Gram-Schmidt process for
orthonormalization, etc.
• Properties of special categories of matrices such as symmetric,
positive definite, etc.
• Eigenvalue, singular value and other decompositions, and their
applications.
Organization 7

• Several algorithms in and using Linear Algebra, like linear re-

gression, QR and power iteration for eigen-analysis, Cholesky
decomposition etc.

The learning objectives listed above translate to the following learn-

ing outcomes or competencies in the students. Upon the successful
completion of a course based on this book, students should be able to
perform the following (both manually and in software tools such as
SageMath):

1. Determine the existence and uniqueness of the solution of a

linear system, and find its complete solution by choosing an
effective method such as Gaussian elimination, factorization,
diagonalization, etc.
2. Test for linear independence of vectors, orthogonality of vectors
and vector spaces.
3. Compute the rank, determinant, inverse, Gram-Schmidt or-
thogonalization and different factorizations of a matrix.
4. Visualize the four fundamental subspaces of a matrix, and iden-
tify their relation to systems of linear equations, and find their
dimension and basis.
5. Identify special properties of a matrix, such as symmetry, pos-
itive definiteness, etc., and use this information to facilitate the
calculation of matrix characteristics.
6. Describe the use of mathematical techniques from linear alge-
bra as applied to computer applications.
7. Compute eigenvalues and eigenvectors of a matrix, use them
for diagonalizing, taking its powers, and applying them to solve
advanced problems. [Optional Topic]
8. Perform diagonalization and the singular value decomposition
of a matrix, and identify its principal components. [Optional
Topic]

I.3 Organization

The book is organized in four parts, as listed below. For courses

based on this book, the same topic flow is recommended, although
8 Introduction

it may be necessary to shuffle some of the sub-topics for clarity and

ease of understanding.

Part I: Numerical Computations

In the first part, we will go over the basics of Linear Algebra as
it is usually taught in an undergraduate curriculum. Here, we will
be working with vectors and matrices as represented by arrays of
numbers, and their basic operations.
1. Functions, Equations and Linearity
2. Vectors, Matrices and Operations
3. Transposes and Determinants

Part II: Algebraic View

After covering the basic operations, we move on to the view of ma-
trices and vectors as encoding linear equations, and ways of solving
them.
4. Gaussian Elimination
5. Ranks and Inverses of Matrices

Part III: Geometric View

In the third part, we will look at the beautiful geometry that arises
from vectors and matrices, from the perspective of the spaces they
define, and their properties and operations.
6. Vector Spaces, Basis and Dimensions
7. Change of Basis, Orthogonality and Gram-Schmidt
8. Review and Recap
9. The Four Fundamental Spaces
10. Projection, Least Squares and Linear Regression

Part IV : Advanced Topics

In the last part, we will discuss the operations on matrices such as
eigenvalue and singular value decompositions, their significance and
applications.
Organization 9

11. Eigenvalue Decomposition and Diagonalization

12. Special Matrices, Similarity and Algorithms
13. Singular Value Decomposition
Part IV of the book, Advanced Topics, although significant as
the basis of many algorithms, may be too difficult in an introductory
undergraduate course in Linear Algebra. It may be revisited later on
in the curriculum.
Even though the emphasis in each part will be as declared in its
title, we will see that it is difficult to segment Linear Algebra into
watertight compartments like Numeric, Algebraic, Geometric etc.
We will necessarily see some overlap.
The book will also list some of the tools and resources to be used,
especially to illustrate the computing applications. It will also show
that most of the numeric calculations that used to be performed by
hand are now carried out using mathematical programs such as Matlab
or SageMath, the latter being our tool of choice.

I.3.1 Our Focus

The value we can find in learning Linear Algebra is not in developing

arithmetic dexterity in performing numeric computations, for we can
find all such numerical computations neatly implemented in tools like
SageMath. Even some symbolic manipulations are found in SageMath
or other tools such as Mathematica. What we focus on, therefore, are
the aspects of Linear Algebra that will improve our intuitive insights
and conceptual understanding. Our hope is that these aspects will
help us be better computer and data scientists. For this reason, the
exercises that follow every chapter after a chapter summary are very
different from what we usually find in books in Linear Algebra; they
are mainly objective type questions testing our grasp of the theoretical,
conceptual and intuitive aspects.
Part I

Numerical
Computations
1
Functions, Equations and
Linearity

It is a monstrous thing to force a child to learn Latin

or Greek or mathematics on the ground that they are an
indispensable gymnastic for the mental powers. It would
be monstrous even if it were true.
—George Bernard Shaw

1.1 Linearity

Formally, Linear Algebra deals with objects (mathematical entities)

that transform in a linear fashion. After all, it has the word “Linear” in
its name. Let’s define linearity, along with what “transform” means.
In order to do that, however, we have to take a step back and look at
some of the foundations.

1.1.1 Expressions and Functions

When we combine one or more mathematical variables in a valid way

(using operations like addition, multiplication, exponentiation etc.,
or using functions like sine, logarithm etc.), we get a mathematical
12 Functions, Equations and Linearity

expression. Here are some examples of expressions: 3x, sin x, x +

y, Ax, ln x.
A variable is a symbol (like a, x, A etc.) that is a placeholder or
a container for a value. Note that the value can be a single number,
or a complicated one like a vector or a matrix, although we have not
defined them yet. For now, however, we are thinking of the variables
as placeholders of single, real values. For a variable x, we will
state that it is real using this mathematical jargon: x ∈ R where R
represents the set of all real numbers and ∈ says, “is a member of.”
When we have an expression (which is a combination of vari-
ables), we can also think of it as a relationship between its value
and the set of variables it contains, which is what we mean by a
function. For example, 4x + y is an expression; f (x, y) = 4x + y
is a function. We usually write a function of a single variable as
f (x) = an expression in x. Some examples of single-variable func-
2
tions are: f (x) = 3x, f (x) = e−x . We require the functions to be
single-valued.
In other words, we can think of an expression as defining or speci-
fying a function, with the additional constraint that one set of inputs
(corresponding to the variables in the expression) gives one √ unique
output (which is the value√ of the expression). Examples: x is an
expression, but f (x) = x is not a function by our definition
√ because
it gives two values for every x. However, f (x) = | x | (where |y|
stands for the absolute value of y) is a function.
We can think of a function as a transformation: A mathematical
object that transforms the inputs to the outputs. Although we are
thinking of the inputs (x, y etc.) and the output (f (x, y)) as numeric
values, they can be other mathematical entities, like vectors and
matrices (to be defined). Again, for now, let’s limit ourselves to real
values of the inputs and outputs for our discussion now.

Linearity
Definition: We call a transformation (or a function) of a single, real
value (x ∈ R) linear if it satisfies the following two conditions:
• Homogeneity: When the value of x is multiplied by a real
number, the value of the function also gets multiplied by the
same number.
f (sx) = sf (x) ∀ s, x ∈ R
Linearity 13

This property of multiplicative scaling behavior is known as

homogeneity.
• Additivity: The output of the function for the sum of two
inputs is the sum of the corresponding outputs.
f (x + x′ ) = f (x) + f (x′ ) ∀ x, x′ ∈ R

Here are some examples of linear functions: f (x) = 3x, f (x) = 0.

Some functions that are not linear would be: f (x) = x2 , f (x) = 5.
Note that f (x) = mx + c is not linear if c ̸= 0 because it does not
satisfy the linearity conditions listed above. This fact may look weird
to us because we know that y = mx + c is the equation to a line. We
will take a closer look at it very soon.

1.1.2 Equations and Equality

In order to understand why f (x) = mx + c is not linear, we have to

understand what an equation is, and, even more fundamentally, what
equality (or the equal sign, =) means. When we write f (x) = mx+c,
we are defining a function. The equal sign (=) in f (x) = mx + c is
one of assignment: We are assigning the symbol f to the function of
one variable mx + c, which is not a difficult concept for computer
scientists with exposure to programming languages1 . This function,
indeed, is not a linear function according to our definition of linearity.
When we write y = mx + c, on the other hand, we do not mean an
assignment or a definition. It is a statement of truth, or a recipé for
computing any y given a value for x. It is an equation. In fact, it is
an equation in two variables, x, y ∈ R. In other words, the equal sign
(=) in y = mx + c is an assertion of truth or a condition. We call
such assertions of truth equations. Our preferred form of equations
is: expression = constant, or function (which is a placeholder for
an expression) = constant. Therefore, we will write y = mx + c as
−mx + y = c.
To summarize, equations in one variable have the form f (x) = b
(where x, b ∈ R, in our context). If the function f (x) (which is a

1
Programming languages such as Python are procedural, where we specify assignments and
operations (or steps) to be performed on them. We have another class of programming
languages called functional, where we list statements of truth and specify mathematical
operations on the variables. Haskell is one of them.
14 Functions, Equations and Linearity

proxy for some, hitherto unspecified expression) is linear, we consider

the equation f (x) = b to be linear as well. In fact, in the case of
expressions (or functions) of one variable, we know a bit more than
that because we have defined what we mean by linearity: The only
possible form f (x) can take so that it is linear is ax. So the most
general linear equation of one variable is simply ax = b.
How do we define linearity for two (or more) variables? In par-
ticular, is y = mx + c (or, equivalently, −mx + y = c) a linear
equation? We know that mx + c is not a linear expression, and
f (x) = mx + c is not a linear function. But, −mx + y is a function of
two variables, f (x, y). In order to determine whether it is linear, the
linearity conditions listed earlier (homogeneity and additivity) need
to be generalized for multiple variables.

1.1.3 Linearity of Multivariable Functions

Let’s generalize the homogeneity property to two variables. For the

one variable case, it was f (sx) = sf (x) ∀ s, x ∈ R . For a two-
variable function f (x, y), let’s say that homogeneity is satisfied when
both variables are multiplied by the same scaling factor (or scalar).
f (sx, sy) = sf (x, y) ∀ s, x, y ∈ R
For one variable, the additivity property was f (x + x′ ) = f (x) +
f (x′ ) ∀ x, x′ ∈ R. Let’s generalize it as follows:
f (x + x′ , y + y ′ ) = f (x, y) + f (x′ , y ′ ) ∀ x, y, x′ , y ′ ∈ R
If we define a new mathematical entity, consisting of two num-
bers (elements) in a specified order, for which scaling is defined as
scaling each element, and addition is defined as the addition of the
corresponding elements, we could reuse our original (one-variable)
linearity conditions for the two-variable case as well, or indeed for
the n-variable case. This mathematical entity (a group of numbers in
a specified order) is a vector. Let’s define it as a column of numbers,
with scaling and addition operations.
′
x sx ′ x ′ x + x′
x= sx = x = ′ x+x =
y sy y y + y′
Our vector x has two real numbers in it, and we write x ∈ R2 . We can
easily generalize the vectors such as x to n dimensions. All we have
Linearity 15

to do is write its n elements (say, x1 , x2 , · · · , xn ) in a column. We

would then write x ∈ Rn . With these definitions and generalization,
we can restate the two linearity conditions as follows:
• Homogeneity: f (sx) = sf (x) ∀ s ∈ R, x ∈ Rn
• Additivity: f (x + x′ ) = f (x) + f (x′ ) ∀ x, x′ ∈ Rn
For the sake of completeness, let’s verify whether f (x, y) =
−mx + y is indeed linear. If we scale x and y by s, the value of
the function gets scaled by the same factor:
f (sx, sy) = −msx + sy = s(−mx + y) = sf (x, y)
Therefore, the homogeneity condition is satisfied. For the additivity
condition:
f (x + x′ , y + y ′ ) = −m(x + x′ ) + y + y ′
= (−mx + y) + (−mx′ + y ′ )
= f (x, y) + f (x′ , y ′ )
Therefore, the second, additivity, condition also is satisfied, and
f (x, y) = −mx + y is indeed linear. And the equation −mx + y = c
is linear as well. To be clear, earlier we said mx + c, as an expression
in (or f (x) = mx+c as a function of) one variable is not linear, which
is not in contradiction with the statement that f (x, y) = −mx + y is
linear as a function of (or −mx+y as an expression in) two variables.
And −mx + y = c is a linear equation in two variables.

1.1.4 Vectors and Matrices

We stated earlier that the most general linear equation of one variable
was ax = b. We start with the linear equation −mx + y = c. What
is the most general linear equation (or system of linear equations) in
two or n dimensions?
Let’s start by defining a multiplication operation between two vec-
tors: The product of two vectors is the sum of the products of the
corresponding elements of each of them. With this definition, and a
notational trick2 of writing the first of the two vectors horizontally,

2
The reason for this trick is to have a generalized definition of the product of two matrices,
of which this vector multiplication will become a special case. We will go through matrix
multiplication in much more detail in the next chapter.
16 Functions, Equations and Linearity

we write:

x x
a1 a2 = a1 x + a2 y =⇒ −m 1 = −mx + y
y y
Our linear equation −mx+y = c and a more general a1 x+a2 y = b
then becomes:

x x
−m 1 =c a1 a2 =b
y y
If we had one more equation, a21 x + a22 y = b2 , we could add another
row in the compact notation above. In fact, the notation is capable of
handling as many equations as we like. For instance, if we had three
equations:
1. a11 x + a12 y = b1
2. a21 x + a22 y = b2
3. a31 x + a32 y = b3
We could write the system of three equations, more compactly,
   
a11 a12 b1
a21 a22  x = b2  or Ax = b (1.1)
y
a31 a32 b3
Here, the table of numbers we called A is a matrix. What we have
written down as Ax = b is a system of three linear equations (in
2-dimensions, but readily extended to n dimensions).

1.1.5 Linear Transformations

Now that we got a glimpse of vectors and matrices, and defined

what linearity means in the context of Linear Algebra, let’s quickly
summarize the discussion with linear transformations. A function
f (x) can be thought of as a transformation, or a mapping3 . When we
write y = f (x), what we are saying is that we can transform for every
x ∈ R using f (x) and get a value of y ∈ R to which it is mapped. f is
a mapping from R to R. Of all possible functions f (x), the simplest
one is a linear transformation, f (x) = ax. In fact, it is the only form

3
For our purposes, there is very little difference between functions, transformations and
mapping.
The Big Picture 17

of linear transformation in one dimension—the only one that respects

linearity as we defined it.
f (x) = ax, x ∈ R. f : R 7→ R
And the most general form of equation (in the form expression or
function = constant) in one dimension would be:
ax = b, a, b ∈ R
What is the simplest possible transformation from a multi-dimensional
space Rn to a different multi-dimensional space Rm ? This turns out
to be the linear transformation encoded in a matrix.
y = Ax, x ∈ Rn , y ∈ Rm . A : Rn 7→ Rm
We can prove (as indeed we will, later on) that every linear trans-
formation Rn 7→ Rm has a unique matrix A associated with it. The
matrix A, therefore, represents a mapping from Rn to Rm , taking the
vector x and giving us the vector y. When we define an equation (as
expression = constant), we get the deceptively simple equation:
Ax = b
As we saw earlier, the equation Ax = b also represents a system
of linear equations. The properties of such systems as encoded in
the matrix A, the conditions under which the system can be solved,
and the geometry and intuitions behind it will become the rest of this
book.

1.2 The Big Picture

For our purposes in computer science, the most appropriate set of

numbers to build vectors and matrices is that of real numbers, arranged
in columns or rows or tables. A scalar is a member of the set of real
numbers, s ∈ R. It can also be thought of as a 1 × 1 matrix. A vector
is a n×1 matrix, x ∈ Rn . Vectors of fewer than four components live
in spaces4 that we can understand and visualize: R2 is a plane and

4
As we get more sophisticated later on, we will qualify this statement and draw a distinction
between coordinate spaces where we live and vector spaces where vectors live.
18 Functions, Equations and Linearity

R3 is the three-dimensional space we live in. We can extrapolate and

understand, if not visualize, dimensions higher than three, namely
Rn , n > 3, as well.
Matrices (such as data sets), on the other hand, live in spaces that
are harder to visualize: A ∈ Rm×n . However, as we will see, the
properties of these spaces are not unlike the spaces of vectors.
Although we will deal with vectors and matrices as arrays (columns
or tables) of numbers, it is important to keep in mind that they are, in
fact, abstract entities defined only by the operations (such as scaling,
addition, multiplication) specified on them.
In computer science, the representation of vectors and matrices
as arrays of numbers is the one that makes most sense because our
data sets tend to comprise numbers. However, if we can find other
entities on which we can specify the requisite set of operations, we
are allowed to treat them as vectors as well. This generalization,
though not critical in computer science, plays an important role in
the applications of Linear Algebra in other fields, most notably in
physics.
The whole point of Linear Algebra is that we do not have to think of
the representation; we can work with the properties of the underlying
entities. So, as the book progresses, we will stop worrying about the
individual elements of matrices or vectors. Towards the end, in Part
IV, we will go over some advanced applications of Linear Algebra,
where, our intuition will become completely abstract and independent
of any representation. We will then be in a position to apply such
intuitions to come up with efficient algorithms and computational
techniques.

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
2
Vectors, Matrices and
Their Operations

So far as the theories of mathematics are about reality,

they are not certain; so far as they are certain, they are not
about reality.
—Albert Einstein

Vectors and matrices, along with the operations defined on them,

form the basis of Linear Algebra as we will learn in this book. In the
previous chapter, we got a glimpse of them. Now, let us look at them
in more detail and in a formal fashion.

2.1 Vectors

For our purposes in Computer Science, vectors are an ordered list of

numbers. The numbers are of the same type. In our case, typically
they are real numbers because data tends to be full of real numbers.
We would formally call the vectors by the type of the numbers. In
other words, our vectors are vectors over the field of real numbers.
The last statement calls for a digression to describe what a field
is, which is in the box below. If we have a m dimensional vector
x, we indicate its field by saying, x ∈ Rm . In our tool, SageMath,
20 Vectors, Matrices and Their Operations

Groups, Rings and Fields

Group: A Group is a set of mathematical entities with one binary operation, the identity
of the operation and an inverse for every element with respect to that operation. The
operation needs to be commutative.
The set of all integers with addition defined will be an example. Zero is the identity,
for every number, its negative is the additive inverse.
Ring: A Ring is a set of mathematical entities with two binary operations (general-
ized versions of the arithmetic operations of addition and multiplication) with the key
properties:
• Addition is associative and commutative
• There is an additive identity, a zero
• Every element has an additive inverse
• Multiplication is associative
• Not necessary that multiplication be commutative
• Multiplication distributes over addition
• It need not have an multiplicative inverse
Classic examples of Rings are:
• Integers
• Integers modulo some Natural number greater than one
Field: A Field has (in addition to what Rings have):
• Multiplication is commutative
• Every nonzero element has a multiplicative inverse
Classic examples of Fields are:
• Rational numbers
• Real numbers
• Complex numbers
• Integers modulo a Prime number
A Field is a Ring with extra properties. And a Ring is a Group with extra properties.

we will see that vectors (and matrices) can be over the ring (see the
box titled Groups, Rings and Fields) of integers (Z, called ZZ in
SageMath), or the field of rationals (Q, QQ), real (R, RR) or complex
(C, CC) numbers1 .
In order to build a realistic example of a vector that we might come
across in computer science, let’s think of a data set where we have
multiple observations with three quantities:

1
See https://www.mathsisfun.com/sets/number-types.html for common sets of numbers
in mathematics.
Vector Operations 21

1. w: Weight in kg

2. h: Height in cm

3. a: Age in years

A sample data point could be a list of three values: (w, h, a) =

(76, 175, 51). We can think of this data point as a vector (over the
field2 of real numbers). Our vectors are always Column Vectors, or
numbers arranged in a column. Therefore, we write:

76
x = 175 ∈ R3 (2.1)
51

A column vector is, in fact, a matrix of size m × 1. Note that although

the numbers of the vector (also known as its elements, entries or
components) belong to the same field mathematically, they do not
have to signify the same physical quantity3 . They do not need to have
the same physical significance. However, the order of the elements
in a vector is significant.

2.2 Vector Operations

2.2.1 Scalar Multiplication

We define the scalar multiplication of a vector with a number as the

follows: The product of the multiplication is a new vector with each
of its elements multiplied by the number. In other words, the number
(called scalar) scales the vector.

Scalar Multiplication
Definition: For any vector x and any scalar s, the result of scalar

2
In computer science, the difference between using the rational ring instead of real field is
too subtle to worry about. It has to do with the definition of the norm (or the size) of a vector.
3
In fact, the whole machinery of Linear Algebra is a big syntactical engine, yielding us
important insights into the underlying structure of the numbers under study. But it has little
semantic content or correspondence to the physical world, other than the insights themselves.
22 Vectors, Matrices and Their Operations

multiplication (sx) is defined as follows:

   
x1 sx1
 x2   sx2 
def 
∀ x =  ..  ∈ Rn and s ∈ R, sx =  ..  ∈ Rn (2.2)
  
.  . 
xn sxn
We say that the set of vectors is closed under scalar multiplication,
which is a fancy way of saying that if we scale a vector, we get another
vector. Note that for a vector over a certain field (like R), the scalar
also belongs to the same field. In fact, it can be any number in that
field, including zero.
The fact that the scaling factor can be zero, and that the set of vectors
is closed under scalar multiplication has something to say about the
zero vector (denoted by 0, which is a vector whose elements are all
zeros). It has to be a vector too, because by scaling any vector with
the scalar zero, we get an entity with all zeros, and because the set
of vectors is closed under scalar multiplication, this entity has to be a
vector. The zero vector is special, just like zero is a special number.
Scalar multiplication is commutative: sx = xs. It is not a pre-
condition, but a consequence of the definition.
6 "

SCALAR MULTIPLICATION 5

" Multiply every element by a $#

4
number
%
" In Linear Algebra, we write 3 #
$
vectors in a column, as a
column of numbers 2 #

1 1
!=
2 !
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
1 2
2! = 2 = 21
2 4
1 21
2! = 21! = 21 = 2#
22
2 22
' ' 1 ' 23
!= = 2
2 2 2 ' 24

Fig. 2.1 Example of scalar multiplication of a vector x ∈ R2 . Note that all the scaled
versions lie on the same line defined by the original vector.

Figure 2.1 shows an example of scalar multiplication. The original

vector x ∈ R2 has elements 1 and 2, which we plot with one unit
Vector Operations 23

along the x axis, and two units along the y axis. The vector then is
an arrow from the origin to the point (1, 2), shown in red. Note how
the scaled versions (in blue, green and purple) of the vector are all
collinear with the original one. Note also that the line defined by the
original vector and its scaled versions all go through the origin.

2.2.2 Vector Addition

We define the addition of two vectors such that the result of the
addition is another vector whose elements are the sum of the cor-
responding elements in the two vectors. Note that the sum of two
vectors also is vector. In other words, the set of vectors is closed
under addition as well.

Vector Addition
Definition: For any two vector x1 and x2 , the result of addition
(x1 + x2 ) is defined as follows:
   
x11 x12
 x21   x22 
∀ x1 =  ..  and x2 =  ..  ∈ Rn ,
   
 .   . 
xn1 xn2
  (2.3)
x11 + x12
def  x21 + x22 
 
x1 + x2 =  ..  ∈ Rn
 . 
xn1 + xn2
Vector addition also is commutative (x1 + x2 = x2 + x1 ), as a
consequence of its definition. Moreover, we can only add vectors
of the same number of elements (which is called the dimension of
the vector). Although it is not critical to our use of Linear Algebra,
vectors over different fields also should not be added. We will not see
the latter restriction because our vectors are all members of Rn . Even
if we come across vectors over other fields, we will not be impacted
because we have the hierarchy integers (Z) ¦ rationals (Q) ¦ real (R)
¦ complex (C) numbers. In case we happen to add a vector over the
field of integers to another one over complex numbers, we will get
a sum vector over the field of complex numbers; we may not realize
that we are committing a Linear Algebra felony.
24 Vectors, Matrices and Their Operations

6 "

ADDITION OF VECTORS 5

" Add the corresponding 4

elements
4
" Two Vectors 3
#" + #! =
1 3
1 #" =
!! = 2
2 2
3 3
!" = 1 #! =
1 1
!
28 ! 27 26 1 25 3
+ 24 23 22 21 0 1 2 3 4 5 6 7 8
! + !# =
2 1 21
1+3 4
= =
2+1 3 22

Fig. 2.2 Adding two vectors x1 , x2 ∈ R2 . We add the element of x2 to the correspond-
ing elements of x1 , which we can think of as moving x2 (in blue) to the tip of x1 (in red),
so that we get the arrow drawn with dashed blue line. The sum of the two vectors is in green,
drawn from the origin to the tip of the dashed arrow.

Figure 2.2 shows the addition of two vectors x1 , x2 ∈ R2 . We can

think of the addition operation as transporting x2 (in blue) to the tip
of x1 (in red), so that we get the arrow drawn with dashed blue line,
parallel to the original x2 , and with the same size. The sum of the
two vectors is in green, drawn from the origin to the tip of the dashed
arrow. We could, of course, x1 (the red vector) to the tip of x2 (blue)
and perform the addition that way as well. In other words, we could
complete the parallelogram with x1 and x2 as adjacent sides, and
think of the diagonal (from the origin) as the sum4 .
Now that we have defined addition and scaling, we can ask the
question whether each vector has an additive inverse. In other words,
for each vector x ∈ Rn , can we find another vector x′ such that
x + x′ = 0. Note that 0 is not the number 0, but a vector 0 ∈ Rn .
We can see that if we scale x by −1 so that x′ = −1 × x, by our
definition of the addition of vectors, we have x + x′ = 0. Therefore,
every vector in Rn has an additive inverse.

4
In Linear Algebra as taught in this book, our vectors are always drawn from the origin, as
a general rule. This description of the addition of vectors is the only exception to the rule,
when we think of a vector transported to the tip of another one.
Vector Operations 25

2.2.3 Linear Combinations

Once we have scalar multiplication and addition, we can perform

them both on our vectors, and be sure that what we get are valid
vectors because of the closure property of the set of vectors under the
two operations. If we have two vectors in Rn , x1 and x2 , and any two
scalars, s1 , s2 ∈ R, we can construct a new vector y = s1 x1 + s2 x2 .
We know that y ∈ Rn is a valid vector.
For example, if we look at the two vectors in Figure 2.2, taking
s1 = 2 and s2 = −1, we can get

1 3 2 −3 −1
y = s 1 x1 + s 2 x 2 = 2 + (−1) = + =
2 1 4 −1 3
By taking different values for s1 and s2 , we can get all sorts of y
vectors. This notion of linear combinations of vectors is foundational
to Linear Algebra, and we will have much more to say about it later
on.
6 "

4 ##"
-
3
1
2 #" =
2
3
#! =
1 1
!
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
21
LINEAR INDEPENDENCE
##$ 22
Draw a line (blue Length of #"#
dotted) through the tip .% =
23 Length of #"
of -, parallel to #$ , Length of ##$
24 intersecting the line of .! =
#" and constructing ##" . Length of #$
25 Construct #$# similarly. - = .% #" + .! #$

Fig. 2.3 Given any green vector y and the two vectors x1 and x2 , how to find the scaling
factors s1 and s2 in y = s1 x1 + s2 x2 ? This geometric construction shows that any vector
can be written as a linear combination of x1 and x2 .

Is it possible to have two sets of s1 and s2 for the same y? The

answer is no, and we can probably already see why. Suppose we
want to find the scaling factors for the green vector in Figure 2.3 so
that it can be written as s1 x1 + s2 x2 . Here is one way to think about
it: Draw a line through the tip of the green vector, parallel to the blue
26 Vectors, Matrices and Their Operations

vector x2 . Let it intersect the line of the red vector, giving us the light
red vector x′1 . The (signed) length of this vector x′1 divided by the
length of x1 will be s1 . The sign is positive if x1 and x′1 are in the
same direction, and negative otherwise. By a similar construction,
we can get s2 as well. From the definition of the addition of two
vectors (as the diagonal from 0 of the parallelogram of which the two
vectors are sides), we can see that y = s1 x1 + s2 x2 , as shown in
Figure 2.3. Since two lines (that are not parallel to each other) can
intersect only in one point (in R2 ), the lengths and the scaling factors
are unique.
The second question is whether we can get the zero vector

0
y=0= = s 1 x 1 + s 2 x2
0
without having s1 = s2 = 0 (when x1 and x2 are what we will
call linearly independent in a minute). The answer is again no,
as a corollary to the geometric “proof” for the uniqueness of the
scaling factors. But we will formally learn the real reasons (à la
Linear Algebra) in a later chapter dedicated to vector spaces and the
associated goodness.

2.3 Linear Independence of Vectors

In Figure 2.3, we saw that we can get any general vector y as a linear
combination of x1 and x2 . In other words, we can always find s1 and
s2 such that y = s1 x1 + s2 x2 . Later on, we will say that x1 and x2
span R2 , which is another way of saying that all vectors in R2 can be
written as a unique linear combination of x1 and x2 .
The fact that x1 and x2 span R2 brings us to another pivotal concept
in Linear Algebra: Linear Independence, which we will get back to,
in much more detail in Chapter 6. x1 and x2 are indeed two linearly
independent vectors in R2 .
A set of vectors are linearly independent of each other if none
of them can be expressed as a linear combination of the rest. For
R2 and two vectors x1 and x2 , it means x1 is not a scalar multiple
of x2 . Another equivalent statement is that x1 and x2 are linearly
independent if the only s1 and s2 we can find such that 0 = s1 x1 +
s2 x2 is s1 = 0 and s2 = 0.
Linear Independence of Vectors 27

6 "

-
3

2
1.5 3
#" = #! =
1 0.5 1
!
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
21
LINEAR DEPENDENCE
22
Line through the tip of -, parallel
23 to #$ , never intersects the line of
#" . We cannot construct .% or .!
24 (unless it is already in the line of
#" and #$ , in which case we get
25 an infinite number of .% and .! )

Fig. 2.4 Two linearly dependent vectors x1 and x2 , which cannot form a linear combi-
nation such that the green y = s1 x1 + s2 x2 .

Figure 2.3 shows two x1 and x2 that are linearly independent.

In Figure 2.4, we have another pair, x1 and x2 , that are linearly
dependent. When we try to construct y = s1 x1 + s2 x2 out of it, we
fail, unless y happens to be in the same line as x1 and x2 . Even in
that case, we do not get a unique pair of s1 and s2 , but an infinite
number of them.
Note that in Figure 2.4, x2 = 2x1 , or 2x1 − x2 = 0: The zero
vector can be written as a linear combination of x1 and x2 . Thus,
our test for linear independence fails for these two vectors, indicating
that they are, indeed, linearly dependent.

2.3.1 Vector Dot Product

Given some scalars or vectors, now we know how to multiply a vector

by a scalar and add two vectors. The operations between two scalars
are not interesting, and a scalar cannot add to a (nontrivial) vector.
So we are left with only one operation left to define, multiplication
of a vector by another vector. Let’s define a dot product between two
vectors.

Dot Product
Definition: For any two vector x1 and x2 , the dot product (x1 · x2 )
28 Vectors, Matrices and Their Operations

is defined as follows:

  
x11 x12
 x21   x22 
∀ x1 =  ..  and x2 =  ..  ∈ Rn ,
   
 .   . 
xn1 xn2
(2.4)
def
x1 · x2 = x11 x12 + x21 x22 + · · · + xn1 xn2
Xn
= xi1 xi2 ∈ R
1

In other words, we compute the dot product between two vectors

by multiplying their corresponding elements and summing up the
products. Clearly, if the numbers of elements are different for the
two vectors, the dot product is not defined. We cannot multiply
vectors of different dimensions.
The dot product of two vectors is also known as their scalar product
because the result of the operation is a scalar. This is not to be
confused with scalar multiplication, which is about scaling a vector
using a scalar. Another commonly used name for the dot product is
inner product.

Norm of a Vector
Definition: For a vector x, its norm, ∥x∥, is defined as follows:
 
x1
 x2  q
def
∀ x =  ..  ∈ Rn , ∥x∥ = x21 + x22 + · · · + x2n
 
.
xn (2.5)
v
u n
uX
=t x2i ∈ R
1
√
By the definition of the dot product, we can see that ∥x∥ = x · x.
Related to the dot product, the norm of a vector is a measure of its size.
Also note that if we scale a vector by a factor s, the norm also scales
by the same factor: ∥sx∥ = s∥x∥. If we divide a vector by its norm
(which means we perform scalar multiplication by the reciprocal of
the norm), the resulting vector has unit length, or is normalized. We
Linear Independence of Vectors 29

Cosine Similarity
In text analytics, in what they call Vector-Space Model, documents are represented as
vectors. We would take the terms in all documents and call each a direction in some
space. A document then would be a vector having components equal to the frequency
of each term. Before creating such term-frequency vectors, we may want to clean up
the documents by normalizing different forms of words (by stemming or lemmatization)
removing common words like articles (the, a, etc.) and prepositions (in, on, etc.), which
are considered stop words. We may also want to remove or assign lower weight to words
that are common to all documents, using the so-called inverse document frequency (IDF)
instead of raw term frequency (TF). If we treat the chapters in this book as documents,
words like linear, vector, matrix, etc. may be of little distinguishing value. Consequently,
they should get lower weights.
Once such document vectors are created (either using TF or TF-IDF), one common
task is to quantify similarity, for instance, for plagiarism detection or document retrieval
(searching). How would we do it? We could use the norm of the difference vector, but
given that the documents are likely to be of different length, and document length is not
a metric by which we want to quantify similarity, we may need another measure. The
cosine of the angle between the document vectors is a good metric, and it is called the
Cosine Similarity, computed exactly as we described in this section.
P
m
xi1 xi2
i=1
cos θ =
∥x1 ∥∥x2 ∥

How accurate would the cosine similarity measure be? It turns out that it would
be very good. If, for instance, we compare one chapter against another one in this
book as opposed to one from another book on Linear Algebra, the former is likely to
have a higher cosine similarity. Why? Because we tend to use slightly flowery (albeit
totally appropriate) language because we believe it makes for a nuanced treatment of
the intricacies of this elegant branch of mathematics. How many other Linear Algebra
textbooks are likely to contain words like albeit, flowery, nuance, intricacy etc.?

use the notation x̂ to indicate unit vectors.

x
x̂ =
∥x∥
The norm we defined above is the Euclidean Norm. Other norms
are possible, defined as below for various values of p:
v
u n
def u
X
∥x∥p = t p
|xi |p ∈ R (2.6)
i=1

This so-called p-norm becomes the Euclidean norm we defined above

for p = 2. For p = 1, it becomes what is known as the Manhattan
(or Taxicab) norm. As p → ∞, the p-norm becomes the maximum
30 Vectors, Matrices and Their Operations

absolute value of the elements of the vector (xi ), known as the infinity
norm, or simply the maximum norm.

6 "
3 4.96
!! = !# = . = 60°
4 20.60 5

!! = !# = 5 #" =
3
4 4
Dot Product Using angle:
1 3
!! 1 !# = !! !# cos . = 5×5× = 12.5
2 2
Using element-by-element multiplication:
' 1 =
!! 1 !# = 6 7$& 7$" = 3×4.96 + 4× 20.60
<= !
3
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
$%& 4.96
= 12.5 21 #$ =
20.60
The Projection (of !# on to !! ) is: 22
1
!# cos . = 5× = 2.5
2
23

Fig. 2.5 Dot product between two vectors x1 , x2 ∈ R2 . We compute the dot product
using the elements of the vectors as well as the angle between them and show that we get
the same result.

Dot Product as Projection: The dot product can also be defined using
the angle between the two vectors. Let’s consider two vectors x1 and
x2 with an angle θ between them. Then,

def
x1 · x2 = ∥x1 ∥∥x2 ∥ cos θ

Rearranging this definition, we can see that what it is describing is

the projection of one vector in the direction of the other.

x1 · x2 x1
∥x2 ∥ cos θ = = · x2 = x̂1 · x2
∥x1 ∥ ∥x1 ∥

x̂1 is the unit vector in the direction of the first vector. ∥x2 ∥ cos θ is the
projection of the second vector onto this direction. As a consequence,
if the angle between the two vectors θ = π2 , the projection is zero. If
the angle is zero, the projected length is the same as the length of the
second vector.
Linear Independence of Vectors 31

Quantum Computing
The backbone of Quantum Mechanics is, in fact, Linear Algebra. By the time we finish
Chapter 11 (Eigenvalue Decomposition), we will have learned everything we need to
deal with the mathematics of QM. However, we may not be able to understand the
lingo because physicists use a completely different and alien-looking notation. Let’s go
through this so-called Dirac (AKA "bra-ket") notation for two good reasons. Firstly, we
may come across it in our online searches on Linear Algebra topics. Secondly, closer
to our field, Quantum Computing is gathering momentum. And the lingo used in the
description of its technical aspects is likely to be the one from physics.
A vector in QM is a ket vector. What we usually write as x would appear as |xð in this
notation. The transpose of a vector is the bra vector, written as y T ≡ ïy|. Therefore, a
dot product x · y = xT y ≡ ïx|yð.
Now that we got started with QM, let’s go ahead and complete the physics story. The
QM vectors are typically the wave functions of the probability amplitudes. So, if we have
an electron with a wave function |ψð = ψ(x) (x being its location in a one dimensional
problem), what it is describing is the probability amplitude of finding the electron at x.
The corresponding probability is the square of the norm of this vector, which is ïψ|ψð.
There are a couple of complications here: Since |ψð is actually a function ψ(x), it
has a value at each point x, and if we are going to think of it as a vector, it is an infinite-
dimensional vector. Secondly, the values of |ψð can be (and typically are) complex
numbers. So when we take the norm, since we like the norm to be positive, we cannot
merely take the transpose, we have to take the complex-conjugate transpose (called the
Hermitian transpose, or simply conjugate transpose). Lastly, the analog of summation
of elements, when we have an infinity of them, is going to be an integral. The space in
which such wave functions live is called the Hilbert Space.
Finally, toward the end of this book, we will come across expressions like xT Ax
which will look like ïψ|H|ψð. The matrix A has become an operator H corresponding
to a physical observable (in this case, the energy, if H is the so-called Hamiltonian), and
the values we can get are, in fact, its eigenvalues, which can be thought of as the reason
why the observable can take only discrete, quantized values. That, in a nutshell, is how
we get the various allowed energy levels in a Hydrogen atom.

The two definitions of the dot product are equivalent, which is a

remarkable fact. In other words,

n
X
x1i x2i = ∥x1 ∥∥x2 ∥ cos θ
1

It can be proven using the Law of Cosines. Figure 2.5 shows the
equivalence of the two definitions using an example of two vectors in
R2 , one projecting onto the other.
32 Vectors, Matrices and Their Operations

2.4 Matrices

We introduced the notion of a matrix when we explored the conditions

of linearity for functions and expressions with two variables in §1.1.4,
page 15. Now it is time to describe it more completely and formally.
Exactly like vectors, matrices are tables of numbers. In Computer
Science, the numbers we are dealing with are typically real (which
we call floating or double in various programming languages). We
will, therefore, talk about matrices over the field of real numbers, and
write A ∈ Rm×n , which says A is a matrix over the field of reals,
with m rows and n columns (rows first, always).

2.5 Matrix Operations

The first two operations on matrices that we will define are identical
to the ones for vectors.

2.5.1 Scalar Multiplication

Definition: For any matrix A and any scalar s, the result of scalar
multiplication (sA) is defined as follows:

 
a11 · · · a1n
∀ A =  ... aij ..  ∈ Rm×n and s ∈ R,

. 
am1 · · · amn
  (2.7)
sa11 · · · sa1n
sA =  ... ..  ∈ Rm×n
def 
saij . 
sam1 · · · samn

In other words, when we multiply a matrix by a scalar, we multiply

each of its elements by the scalar. The resulting matrix is of the same
dimension: If A ∈ Rm×n , then sA ∈ Rm×n .
Properties of Scaling and Addition 33

2.5.2 Matrix Addition

Definition: For any two matrices A and B, their sum (the result of
their addition, A + B) is defined as follows:
   
a11 ··· a1n b11 ··· b1n
∀ A =  ... ..  and B =  .. ..  ∈ Rm×n

aij .   . bij . 
am1 · · · amn bm1 ··· bmn
  (2.8)
a11 + b11 ··· a1n + b1n
def  .. .. m×n
A+B =  ∈R

. aij + bij .
am1 + bm1 ··· amn + bmn

To add two matrices, we simply add the corresponding elements.

Naturally, the two matrices we are trying to add should have the same
dimensions.
As we can see, although the notations are slightly different, vectors
and matrices have the same operations, at least when it comes to scalar
multiplication and addition. Soon, we will see that the dot product,
a quintessentially vector-type operation (with the cosine of the angle
between them) is also, in fact, a matrix operation. The reason for this
connection is simple: A vector in Linear Algebra is a matrix with
just one column; it is a column matrix. Everything we specify for
matrices applies to vectors as well.
It may be worth our time to compare the first two operations that
we defined for vectors and matrices (namely, scalar multiplication
and addition) to the conditions of linearity (namely, homogeneity and
additivity, described in §1.1.3, page 15). The fact that they seem
parallel is not an accident, but points to the underlying structure of
Linear Algebra.

2.6 Properties of Scaling and Addition

We have to keep in mind that in matrix (and therefore vector) addition,

the dimensions should match: A + B is defined if and only if they
both have the same number of rows and columns. In particular, a
vector cannot add to a matrix (of more than one column). Vectors of
different lengths cannot interact with each other; they live in different
spaces. A zero vector in R3 (our 3-D space) is not the same as the
zero vector in R2 (a plane).
34 Vectors, Matrices and Their Operations

When dealing with matrices (or vectors) of compatible sizes, we

can list the mathematical properties of scalar multiplication and ad-
dition. Both these operations are:

Commutativity

The order in which the matrices (or vectors) appear in the operations
does not matter.

sx = xs; sA = As
x1 + x2 = x2 + x1 ; A + B = B + A

Associativity

We can group and perform the operations in any order we want.

s1 s2 x = s1 (s2 x) = (s1 s2 )x; s1 s2 A = (s1 s2 )A = s1 (s2 A)

x1 + x2 + x3 = (x1 + x2 ) + x3 = x1 + (x2 + x3 )
A + B + C = (A + B) + C = A + (B + C)

Distributivity

Scalar multiplication distributes over matrix addition.

s(x1 + x2 ) = sx1 + sx2 ; s(A + B) = sA + sB

(s1 + s2 )x = s1 x + s2 x; (s1 + s2 )A = s1 A + s2 A

These properties are not arbitrarily imposed, but the consequences

of the definitions of the operations of scalar multiplication and matrix
addition. In other words, they can be proven to be true starting from
the definitions.

2.7 Matrix Multiplication

We will now define and describe how matrices multiply. Two ma-
trices can be multiplied to get a new matrix, but only under certain
conditions. Matrix multiplication, in fact, forms the backbone of
much of subject matter to follow. For that reason, we will look at it
carefully, and from different perspectives.
Matrix Multiplication 35

2.7.1 Conformance Requirement

In order to multiply one matrix with another, the number of columns

of the first has to be the same as the number of rows of the second.
The product will have the same number of rows as the first matrix,
and the same number of columns as the second one. In other words,
if A ∈ Rm×k and B ∈ Rk×n , then they can be multiplied to give
AB ∈ Rm×n .
Note that this conformance requirement is different from the one
for matrix addition, where the individual matrices and the sum are of
the same size. In contrast, for matrix multiplication, if the individual
matrices are of the same size, they cannot be multiplied, unless they
both have the same number of rows and columns (in which case, we
will call the square matrices).

Element-wise Multiplication
Definition: For conformant matrices, we define the matrix multipli-
cation as follows: The element in the ith row and the j th column of
the product is the sum of the products of the elements in ith row of
the first matrix and the j th column of them second matrix.
More formally, for any two conformant matrices A ∈ Rm×k and
B ∈ Rk×n , their product (the result of their multiplication, C = AB)
is defined as follows:
   
a11 ···a1k b11 · · · b1n
A =  ... ..  and B =  .. ..  ,

ail .   . blj . 
am1 · · · amk bk1 · · · bkn
 
c11 · · · c1n
(2.9)
AB = C =  ... ..  ∈ Rm×n where

cij . 
cm1 ··· cmn
k
def
X
cij = ai1 b1j + ai2 b2j + · · · + aik bkj = ail blj
l=1

Since the definition of is bit cumbersome, let us take a look at what it

is saying using Figure 2.6.
Note that matrix multiplication is not commutative. AB ̸= BA,
in general. In fact, for general (non-square) matrices, if they are
conformant for the multiplication AB, they cannot be conformant
for BA and the product is not even defined.
36 Vectors, Matrices and Their Operations

! " = $
!;; !;< ï !;= %;; %;< ï %;? ';; ';< ï ';?
!<; !<< ï !<= %<; %<< ï %<? '<; '<< ï '<?
î î ó î = î î ó î
î î ó î
!>; !>< ï !>= %=; %=< ï %=? '>; '>< ï '>?

!++ ï !+, ï !+- %++ ï %+, ï %+0 '++ ï '+, ï '+0

î ó î ó î î ó î ó î î ó î ó î
!.+ ï !., ï !.- %.+ ï %., ï %.0 = '.+ ï '., ï '.0
î ó î ó î î ó î ó î î ó î ó î
!/+ ï !/, ï !/- %-+ ï %-, ï %-0 '/+ ï '/, ï '/0

Fig. 2.6 Illustration of matrix multiplication AB = C . In the top panel, the element
c11 (in red) of the product C is obtained by taking the elements in the first row of A
and multiplying them with the elements in the first column of B (shown in red letters and
arrows) and summing them up. c22 (blue) is the sum-product of the second row of A and
the second column of B (blue). In the bottom panel, we see a general element, cij (green)
as the sum-product of the ith row of A and the j th column of B (in green).

2.7.2 Vector Dot Product

Now that we have defined the general matrix multiplication AB =

C, with A ∈ Rm×k , B ∈ Rk×n , and C ∈ Rm×n , let’s consider a
special case when m = n = 1. Both A and B have k numbers in
them, arranged horizontally and vertically.
 
b1
 b2 
 
A = a1 a2 · · · ak = aT B =  ..  = b

.
bk (2.10)
k
X
C = [c11 ] = a1 b1 + a2 b2 + · · · + ak bk = ai b i = s
i=1

In this case, C has become a matrix of one row and one column,
which is the same as a scalar. We can, therefore, use the symbol s
to represent it. B is a column matrix with k rows, which is what
we earlier called a vector; remember, our vectors are all column
vectors. Let’s use the symbol b instead of B to stay consistent in our
Matrix Multiplication 37

notations. Now, A is a special matrix with k elements arranged in one

row. Some people call it a row vector, but let’s call it the transpose
of a column vector, a, and denoted it by aT . We get the transpose
of a matrix switching its rows and columns. In the case of a column
matrix (a vector), which is a sequence of numbers standing vertically,
its transpose happens when we make them lie down horizontally.
What we have written down above is mathematically identical to
our definition of vector dot product in Eqn (2.4), with a = x1 and
b = x2 . In other words, the vector dot product a · b is identical to
the matrix product aT b. Note that a · b = b · a and aT b = bT a, as
we can see from the summation in Equations (2.4) and (2.10). Since
this dot product is something we may need to cross-reference later
on, let’s restate it and give it a new equation number:
a · b = aT b = bT a = b · a (2.11)
As we see clearly now, the vector dot product is a special case of
matrix multiplication. In fact, all the operations we defined for vec-
tors are special cases of the corresponding operations for matrices,
which is not surprising because our vectors are merely matrices with
one column. We actually used matrix multiplication way back in
Eqn (1.1), where we called it a notational trick.

Multiplication as Dot Products

Definition: In AB = D (with A ∈ Rm×k , B ∈ Rk×n , and D ∈
Rm×n )5 , A consists of m rows riT , each of which have k elements
and B consists of n column vectors cj with k elements each. Then
the element dij of the product is the dot product of ri and cj .
dij = ri · cj = riT cj
Here, we are restating the matrix multiplication using the definition
of vector dot product, rather cyclically. We think of A consisting of
m vectors ri arranged horizontally. To be more precise, A consists
of m rows riT , since our vectors are column vectors, as are the n
columns in the matrix B which we call cj . Because of the matrix
dimensions, all riT and cj have k elements each, and we are allowed
to take their dot products.

5
We are using D (instead of C) as the product in order to avoid possible confusion between
the symbols for column vectors cj and the elements of the product matrix.
38 Vectors, Matrices and Their Operations

2.7.3 Block-wise Matrix Multiplication

We can perform matrix multiplication in a block-wise fashion, where

we segment the matrices we are multiplying into blocks that are
conformant sub-matrices, and multiply block-by-block. While this
statement may be of dubious usefulness, we have already used it
above, in defining matrix multiplication as dot product of the rows
of the first and the columns of the second. What we did was to
divide up the matrices into smaller blocks that are conformant for
multiplication and define the product in terms of the products of the
smaller blocks.
The block-wise multiplication brings out the recursive nature of the
operation. In the product C = AB, if A is segmented into m × k
blocks Ail and B into Blj (k × n of them), we can write the block
Cij as:
k
X
Cij = Ail Blj
l=1

provided that the block Ail is conformant for multiplication with Blj .
Compare this equation to Eqn (2.9) and we can immediately see that
the latter is a special case of partitioning A and B into blocks of
single elements.
In general, segmenting matrices into comformant blocks may not
be trivial, but we have advanced topics in Linear Algebra where it
comes in handy. For our purposes, we can think of few cases of simple
segmentation of a matrix and perform block-wise multiplication.

Special Cases of Block-wise Multiplication

Here is a situation where block-wise multiplication makes sense: If
we have A ∈ Rn×n and B ∈ Rn×2n , we can certainly multiply them
and get AB ∈ Rn×2n . We can also think of B as composed of two
matrices, B1 and B2 , both of size Rn×n , arranged side-by-side. In
other words, B consists of two blocks of square matrices and can
be written as B = [B1 | B2 ]. The notation [C | D] stands for a
new matrix in which we have the columns of C and D side-by-side.
Later on, we will call such matrices “augmented matrices.” Now,

AB = A[B1 | B2 ] = [AB1 | AB2 ]

Matrix Multiplication 39

which is like distributing the matrix multiplier A over the operation

of segmenting the matrix into blocks. We will use this type of block-
wise matrix multiplication in understanding some of the techniques
later. We have keep it in mind that the order in which the matrices
appear in the product is significant because matrix multiplication is
not commutative. In particular, BA ̸= [B1 A | B2 A]. In fact, the
B and A are not even conformant for the multiplication BA.
Another case where block-wise multiplication makes sense is when
we think of the second matrix as blocks of column vectors. In other
words, we have A ∈ Rm×n and B ∈ Rn×k and if we think of B as k
column vectors bi standing side-by-side, which we write as B = [bi ].
Then we have the block-wise multiplication,

AB = A[bi ] = A[b1 | b2 | · · · | bk ] = [Ab1 | Ab2 | · · · | Abn ]

Earlier, we spoke of Ax = b as set of m linear equations on n

unknowns. We can think of the product AB as shown in this block-
wise multiplication as a collection of k such sets. We will circle back
to this notion in Chapter 5.

2.7.4 Column and Row Pictures of Matrix Multiplication

We can think of our first matrix as composed of column vectors, and

the product as a linear combination of these vectors, which gives us
the column picture of matrix multiplication. This column picture is a
view that underpins several critical concepts in Linear Algebra, and
we will come back to it time and again in this book.
Remember, we defined the Linear Combinations of vectors in
§2.2.3 (page 25). Let’s now restate the definition of matrix multipli-
cation using the notion of the linear combinations of the columns of
the first matrix, scaled by the elements of the second matrix. To keep
it simple, let’s first consider the multiplication of a matrix A ∈ Rm×n
by a column vector x ∈ Rn . The matrices are conformant, and the
multiplication is allowed. We will get a vector as the product, which
we will call b ∈ Rm . We can think of A as being composed of n
column vectors (ci ∈ Rm ) standing side-by-side. The product Ax
is then the linear combination of the columns of A, scaled by the
components of x.
40 Vectors, Matrices and Their Operations


  x1
| | |  x2 
A = c1 c2 · · · cn  ∈ Rm×n x =  ..  ∈ Rn
 
| | |
. (2.12)
xn
Ax = b = x1 c1 + x2 c2 + · · · + xn cn ∈ Rm
Similarly, multiplication on the left gives us the row picture. Take
xT to be a matrix of single row (xT ∈ R1×m ), which means x is
column vector x ∈ Rm . And A ∈ Rm×n . Now the product xT A is
a row matrix bT ∈ R1×n . The row picture says that xT A is a linear
combination of the rows of the matrix A.

COLUMN PICTURE ROW PICTURE

4! 4" 4# 4$
3 3 4 ,%!
7 1 1 4
5 5 1 ,"%
(= 5 8 0 7 2= 25 = 7 1 1 4 (=
2 2 3 ,%#
6 9 2 5 ,%$
7 7 2

(2 = 341 + 54$ + 242 + 743 25( = 7651 + 65$ + 652 + 7653

7 1 1 4 =73 4 + 5 1 + 2 3 +4 7 2
=3 5 +5 8 +2 0 +7 7
6 9 2 5 = 21 28 + 5 1 + 2 3 + 28 8
21 5 2 28 56
= 15 + 40 + 0 + 49 = 104 = 56 40
18 45 4 35 102

Fig. 2.7 Illustration of matrix multiplication as column and row pictures. On the left,
we have Ax, where the product (which is a column vector we might call b) is the linear
combination of the columns of A. On the right, we have xT A, where the product (say bT )
is a linear combination of the rows of A.

Since the column and row pictures are hard to grasp as concepts, we
illustrate them in Figure 2.7 using example matrices with color-coded
rows and columns. We can use the basic element-wise multiplication
(Eqn (2.9)) of matrices and satisfy ourselves that the column and row
pictures indeed give the same numeric answers as element-by-element
multiplication. To put it as a mnemonic, matrix multiplication is the
linear combination of the columns of the matrix on the left and of the
rows of the matrix on the right.
Matrix Multiplication 41

The column and row pictures are, in fact, special cases of the block-
wise multiplication we discussed earlier. In the column picture, we
are segmenting the first matrix into blocks that are columns, which are
conformant for multiplication by single-element blocks (or scalars)
of the second matrix. The sum of the individual products is then
the linear combinations of the columns of the first matrix. We can
visualize the row picture also as a similar block-wise multiplication.
If, instead of a vector x, we had a multi-column second matrix,
then the product also would have multiple columns. Each column of
the product would then be the linear combinations of the columns of
A, scaled by the corresponding column of the second matrix. In other
words, in AB, A ∈ Rm×k , B ∈ Rk×n , the product has n columns,
each of which is a linear combination of the columns of A.
Considering this a teachable moment, let’s look at it from the
perspective of block-wise multiplication once more. Let’s think of
B as composed on of n column matrices stacked side-by-side as
B = [b1 | b2 | · · · | bn ]. Then, by block-wise multiplication, we
have:

AB = A[b1 | b2 | · · · | bn ] = [Ab1 | Ab2 | · · · | Abn ]

which says that each column of the product is Abi , which, by the
column picture of matrix multiplication, is a linear combination of
the columns of A using the coefficients from bi .

2.7.5 Vector Inner and Outer Products

Vector dot product is commutative: a · b = b · a. This property can

be easily verified by examining its definition in Eqn (2.10). When
written using the matrix notation, commutativity means: aT b = bT a.
However, note that matrix multiplication by itself is not commutative.
In particular, aT b ̸= abT (although both multiplications are between
conformant matrices and therefore allowed). The latter is, in fact, not
a scalar, but a matrix of size k × k if a, b ∈ Rk . The matrix abT
is called the outer product (as opposed to the dot product, which is
the inner product) of the two vectors. The outer product has certain
properties that will become important to us later in the book.
42 Vectors, Matrices and Their Operations

Groups, Rings and Fields

2.8 Generalized Vectors

We should keep in mind that, in its generality, vectors are defined

only by their operations, namely scalar multiplication, addition, and
the inner product. Any set of mathematical entities, literally any
at all, for which we can consistently define these operations with
the right commutative, additive and associative properties can be
treated as vectors. Once we do that, the vast machinery of Linear
Algebra stands ready to help us ensure consistency, derive insights
and deepen our understanding further, which is the way it is used
Generalized Vectors 43

Convolution
Convolution in image processing involves sliding a small matrix (kernel) over an image.
At each position, the kernel’s values are multiplied with the underlying image pixels, and
the results are summed to form a new pixel value in the output image. This operation is
used for tasks like blurring, edge detection, and feature extraction.
Here’s how convolution works in the context of image processing:
• Input Image: A two-dimensional matrix representing an image’s pixel values.
• Kernel/Filter: A smaller matrix with numerical values defining the convolution
operation.
• Sliding: The kernel is systematically moved over the image in small steps.
• Element-Wise Multiplication: Values in the kernel and underlying pixels are
multiplied.
• Summation: The products are added up at each kernel position.
• Output: The sums are placed in the output matrix, which is also known as the
feature map or convolved image.
Convolution is used for various image processing tasks:
• Blurring/Smoothing: By using a kernel with equal values, convolution can
smooth an image, reducing noise and sharp transitions.
• Edge Detection: Specific kernels can detect edges in an image by highlighting
areas with rapid intensity changes.
• Feature Extraction: Convolution with various filters can extract specific fea-
tures from images, such as texture or pattern information.
• Sharpening: Convolution with a sharpening filter enhances edges and details in
an image.
It is left as an exercise to the student to look up how exactly the convolution is performed,
and whether it is linear.

in modern physics, most notably in quantum mechanics and special

relativity.

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
3
Transposes and
Determinants

I would rather have questions that can’t be answered than

answers that can’t be questioned.
—Richard Feynman

Now that we have defined matrices and mastered the basic oper-
ations on them, let’s look at another operation that comes up very
often in Linear Algebra, namely taking the transpose of a matrix. We
will also introduce the concept of the determinant of a matrix, which
a single number with a lot of information about the matrix, and with
a nice geometrical interpretation. Also in this chapter, we will go
over the nomenclature of various special matrices and entities related
to matrices with a view to familiarizing ourselves with the lingo of
Linear Algebra. This familiarity will come in handy in later chapters.

3.1 Transpose of a Matrix

We get the transpose of a matrix by flipping it over its main diagonal.

Before defining it more formally, let’s first look at an example in
Transpose of a Matrix 45

Eqn (3.1), where the so-called main diagonal is highlighted in bold.

The transpose of a matrix is obtained, basically, by switching the
rows and columns.
 
  7 8 6
7 1 1 4
1 8 9
A = 8 8 0 7 ∈ R3×4  ∈ R4×3

 AT = 
1 (3.1)
0 2
6 9 2 8
4 7 8
The main diagonal is the elements whose row number is the same
as the column number: aii . Flipping a matrix over it means we take
the element aij and put in the location of aji .
To make future definitions easier to write, let’s
first introduce a
m×n
shorthand notation for a matrix A ∈ R as aij , where we expect
ourselves to understand, from the context, that we mean aij to stand
for a typical element in A, with the row index i, 1 f i f m and
the column index j, 1 f j f n. With this notation, we can state the
definition of the transpose of a matrix as follows:

Matrix Transpose
Definition: For any matrix A = aij ∈ Rm×n , its transpose is
def
defined as AT = aji ∈ Rn×m .

3.1.1 Properties of Transposes

From the definitions of transposes and the basic operations of matri-

ces, we can prove the following properties.
1. Transpose of a Transpose: If we take the transpose twice, we
get the original matrix back.
T
AT = A

2. Transpose of a Sum: The transpose of the sum of two matrices

is the sum of the transposes of the individual matrices.
(A + B)T = AT + B T

3. Scalar Multiplication: The operation of taking the transpose

of a matrix commutes with scalar multiplication.
(sA)T = sAT
46 Transposes and Determinants

4. Transpose of a Scalar: Scalars can be considered a matrix of

one row and one column. Therefore, it is its own transpose.

s ≡ [s] =⇒ sT = [s]T = [s] = s

3.1.2 Product Rule

The transpose of the product of two matrices is the product of the

transposes, taken in the reverse order.

(AB)T = B T AT

This product rule also can be proven by looking at the (i, j) element
of the product matrices on the left and right hand sides, although it is
a bit tedious to do so. Before proving the product rule, let’s look at
the dimensions of the matrices involved in the product rule and easily
see why the rule makes sense.
Let’s consider A ∈ Rm×k and B ∈ Rk×n so that AB ∈ Rm×n .
The dimensions have to be this way by the conformance requirement
of matrix multiplication, which says that the number of columns of
the first matrix has to be the same as the number of rows of the
second one. Otherwise, we cannot define the product. In particular,
for m ̸= n, BA is not defined.
By the definition of transpose, we have

A ∈ Rm×k =⇒ AT ∈ Rk×m and B ∈ Rk×n =⇒ B T ∈ Rn×k

Note that the number of columns of B T is the same as number of rows

of AT . Therefore, the product B T AT is well defined, while AT B T
cannot be defined (unless m = n). Furthermore, B T AT ∈ Rn×m .
And, from the definition of the transpose of a matrix again, we
know that
AB ∈ Rm×n =⇒ (AB)T ∈ Rn×m

which is the same as the dimensions of product of the transposes

in the reverse order: B T AT ∈ Rn×m . Therefore, at least from the
perspective of conformance of matrix multiplication, the product rule
of transposes makes sense.
Transpose of a Matrix 47

Let’s illustrate this product rule of transposes using an example

with two simple matrices A and B:
 
3 4
1 2 3
A= and B = 5 6
4 5 6
7 8

34 40 T 34 79
AB = =⇒ (AB) =
79 94 40 94
 
1 4
3 5 7
BT = and AT = 2 5
4 6 8
3 6

34 79
B T AT = = (AB)T
40 94

We have postponed the proof for the product rule of transposes for
as long as possible. Now, we will present two proofs.

Proof 1: Here, we will use the element-wise multiplication of matrices

to prove the product rule. In C = AB, where A ∈ Rm×k and
B ∈ Rk×n , let’s
denote AT = A′ = a′ij , B T = B ′ = b′ij and
C T = C ′ = c′ij .

k
X
(1) Element-wise matrix multiplication: cij = aip bpj
p=1

(2) Definition of transpose: c′ij = cji

k
X
(3) Using (2), and i ´ j: = ajp bpi
p=1
k
X
(4) Definition of transpose: = a′pj b′ip
p=1
k
X
(5) Rearranging: c′ij = b′ip a′pj
p=1

(6) Recognizing (5) as matrix product: C = B T AT

Proof 2: Here, we will use the dot-product view of matrix multiplica-

tion to prove the product rule. In C = AB, let’s denote C T = C ′ .
Noting that the element cij of C is the dot product of ith row of A
48 Transposes and Determinants

and j th column of B, we can write:

(1) The element cij as dot product: cij = ai · bj
(2) Definition of transpose: c′ij = cji
(3) Using (1), and switching i ´ j: cji = aj · bi
(4) Since dot product is commutative: cji = bi · aj
(5) Using (2) and the definition of transposes: c′ij = bi · aj = b′i · a′j
(6) Writing (5) in matrix form: C ′ = C T = B T AT

In (5), we used the fact that the ith row of a matrix is the same as
the ith column of its transpose. In both proofs, we have shown that
C T = (AB)T = B T AT .

3.2 Definitions and Matrices with Special Properties

Main Diagonal: The line of elements in a matrix with the same row
and column indexes is known as the main diagonal. It may be
referred to as the leading diagonal as well. In other
words, it
is the line formed by the elements aii in A = aij ∈ Rm×n .
Note that number of rows and columns the matrix A does not
have to be equal.

Square Matrix: When the number of rows in a matrix is the same as

the number of columns, we call it a square matrix. If A ∈ Rn×n
then A is a square matrix.
Here is an interesting fact: For any matrix A ∈ Rm×n , AAT ∈
Rm×m and AT A ∈ Rn×n . Therefore, for any matrix (of any
size), its product with its transpose (multiplying on the right or
left) is always a square matrix.

Symmetric Matrix: A matrix is symmetric when its transpose is

identical to itself. AT = A =⇒ A is symmetric. This condi-
tion cannot be satisfied by a non-square matrix, and therefore
all symmetric matrices are square matrices. For a symmetric
matrix, aij = aji .
For a square matrix A ∈ Rn×n , A + AT is always a symmetric
matrix, which is easily proved by the fact that the transpose of
the sum of two matrices is the sum of their transposes.
Definitions and Matrices with Special Properties 49

Skew-Symmetric Matrix: A matrix is skew symmetric (AKA anti-

symmetric) when its transpose is the negative of itself. AT =
−A =⇒ A is skew symmetric. Like in the previous case,
this condition also cannot be met by a non-square matrix, and
therefore all skew-symmetric matrices are also square matrices.
For a skew-symmetric matrix, aij = −aji .
For a square matrix A ∈ Rn×n , A − AT is always a skew-
symmetric matrix, which can be proven using the same tech-
nique as for A + AT being symmetric.
Gram Matrix: Another interesting fact: For any matrix A ∈ Rm×n ,
T
what is AT A ? By the product rule of matrix transposes,
T T
AT A = AT AT = AT A
In other words, AT A is symmetric. So is AAT , by the same
argument. AT A is called the Gram Matrix.
Diagonal Matrix: A matrix that has zero elements everywhere other
than the main diagonal elements is called a diagonal matrix.
It is usually a square matrix, but we can call a non-square
matrix also a diagonal matrix without ambiguity because we
have a clear
definition for the main diagonal. Using symbols,
A = aij ∈ Rm×n is a diagonal matrix if aij = 0 for all i ̸= j.
Note that we do not specify the values of the diagonal elements
in anyway: They may or may not be zero. In particular, a zero
matrix (with all zero elements) is also a diagonal matrix.
Identity Matrix: The identity matrix is a square, diagonal matrix
with ones along the diagonal and zeros everywhere else. A = I
if and only if A ∈ Rn×n , aii = 1 and aij = 0 for i ̸= j. We
use the symbol I or In to refer to the identity matrix.
Unit Vectors We can think of the columns of the identity matrix as
unit vectors defining directions in space. For I3 ∈ R3×3 , the
unit vectors (î, ĵ and k̂, as some people, especially physicists,
denote them) would be:
       
1 0 0 1 0 0
I= 0 1 0
  î = 0
  ĵ = 1
  k̂ = 0

0 0 1 0 0 1
50 Transposes and Determinants

Now, we can write any general vector, say,

 
x
x = y  as x = xî + y ĵ + z k̂

z
Note that the unit vectors, as defined here, all have unit length.
They are also perpendicular to each other because the dot prod-
uct of any distinct pair of them is zero.
Upper and Lower Triangular
Matrices: An upper triangular ma-
m×n
trix (U = uij ∈ R ) is the one with all the elements
below the main diagonal zero.
uij = 0 for i > j

Similarly, a lower triangular matrix (L = lij ∈ Rm×n ) is the
one with all the elements above the main diagonal zero.
lij = 0 for i < j
Here are some examples:

a 0 a b
L= ∈ R2×2 U= ∈ R2×2
c d 0 d

Again, note that we do not specify anything about nonzero

elements in the definitions. They can be zeros as well, and
a diagonal matrix, in principle, is both an upper and a lower
triangular matrix at the same time.
Inverse of a Matrix: A (square) matrix (A), when multiplied by its
inverse (A−1 ) will result in the identity matrix (I): A−1 A =
AA−1 = I. For non-square matrices, we can have left and
right inverses, but they will be different from each other.
Singular Matrix: Not all square matrices have inverses. If a matrix
is noninvertible, it is called singular.

3.3 Determinant of a Matrix

The determinant of a square matrix is a scalar value derived out

of its elements. It contains a large amount of information about
Determinant of a Matrix 51

Morphisms
Earlier (in §1.1.5, page 16), we talked about how a matrix A ∈ Rm×n as encoding a
linear transformations between Rn and Rm . A : Rn 7→ Rm . Other names for a linear
transformation include linear map, linear mapping, homomorphism etc. If n = m, then
A maps Rn to itself: Every vector (x) in Rn , when multiplied by A, gives us another
vector b ∈ Rn . The name used for it is a linear endomorphism. Note that two different
vectors x1 , x2 do not necessarily have to give us two different vectors. If they do, then
the transformation is called an automorphism. Such a transformation can be inverted:
We can find another transformation that will reverse the operation.
To complete the story of morphisms, a transformation Rn 7→ Rm (with n and m not
necessarily equal) is called an isomorphism if it can be reversed, which means that it is
a one-to-one and onto mapping, also known as bijective. If two vectors in Rn (x1 and
x2 ) map to the same vector in Rm (b), then given b, we have no way of knowing which
vector (x1 or x2 ) it came from, and we cannot invert the operation.
Of course, these morphisms are more general: They are not necessarily between the
so-called Euclidean spaces Rn , but between any mathematical structure. We, for our
purposes in computer science, are interested only in Rn though.
Let’s illustrate the use of the idea of isomorphisms with an (admittedly academic)
example. We know that the number of points in a line segment between 0 and 1 is
infinite. So is the number of points in a square of side 1. Are these two infinities the
same?
If we can find an isomorphism from the square (in R2 ) to the line (R), then we can
argue that they are. Here is such an isomorphism: For any point in the square, take its
coordinates, 0 < x, y < 1. Express them as decimal numbers. Create a new number x′
by taking the first digit of x, then the first digit of y, followed by the second digit of x,
second digit of y and so on, thereby interleaving the coordinates into a new number. As
we can see, 0 < x′ < 1, and this transformation T is a one-to-one mapping and onto,
or an isomorphism: T : (x, y) 7→ x′ : R2 7→ R. It is always possible to reverse the
operation and find x and y given any x′ , T −1 : x′ 7→ (x, y) : R 7→ R2 . Therefore the
infinities have to be equal.
Another transformation that is definitely not an isomorphism is the projection op-
eration: Take any point (x, y) and project it to the x axis, so that the new x′ = x.
P : (x, y) 7→ x : R2 7→ R maps multiple points to the same number, and it cannot be
reversed. P −1 does not exist.
Interestingly, P is a linear transformation, while T is not.

the matrix and the linear transformation (see §1.1.5, page 16) that
it represents. We use the symbol det(A), ∆A or |A| to denote the
determinant of the square matrix A.
Before actually defining the determinant, let’s formally state what
we said in the previous paragraph.

A = aij ∈ Rn×n

|A| = f (aij ) where
52 Transposes and Determinants

which says that the determinant is a function (a hitherto unknown one)

of the elements of the square matrix. We will soon specify what the
function is. Before defining it, however, let us list some interesting
facts about the determinant.
Thinking of A ∈ Rn×n as a linear transformation, we can say that
|A| = 0 means that it is not a one-to-one mapping or an isomor-
phism. In other words, A takes multiple vectors in Rn to the same
transformed vector (again in Rn ). See the box titled “Morphisms”
for the formal nomenclature and some more details.
If the determinant is not zero, then A represents an isomorphism,
and we can invert its transformation. We can find another matrix to do
the reversal, and we will call it the inverse of A, and denote it using the
symbol A−1 , which is indeed the same inverse we defined on page 50.
The transformations that are not isomorphisms cannot be reversed,
and the corresponding matrices are noninvertible; they are singular
matrices. To use the right mathematical lingo, the determinant |A|
being nonzero is a necessary and sufficient condition for the matrix
A to be invertible.
As we shall see later, when the matrix A is invertible, the system
of linear equations Ax = b has a unique solution. We can think of
this characteristic of the determinant an algebraic property because it
says something about the solutions of linear equations. In addition,
we will find a geometric property as well, stating that the determinant
behaves like the volume of a parallelepiped that the matrix represents,
which we will explore once we define what the determinant is. Par-
allelepiped, by the way, is the higher-dimensional generalization of a
parallelogram, which is a diamond-shaped, sheared rectangle in 2-D.

3.3.1 2 × 2 Matrices
If we have a matrix A ∈ R2×2 as in the equation below, its determinant
is defined as |A| = ad − bc. To restate it formally,

a c a c def
A= ∈ R2×2 |A| = = ad − bc ∈ R (3.2)
b d b d
We can think of A as having two column vectors (c1 and c2 ) standing
side-by-side.

a c 2×2 a c
A= ∈R c1 = , c2 = , ∈ R2
b d b d
Determinant of a Matrix 53

1.25 "

TRANSFORMATION: MEANING 0
1 1
K
1 > ? 1 > L
= = =
0 @ A 0 @ 0.75

0 > ? 0 ?
= = = 0.5
1 @ A 1 A
1 0 > ? 0.25 I
B= §==
0 1 @ A J
22 21.75 21.5 21.25 21 20.75 20.5 20.25 0 0.25 0.5 0.75 1 1.25 1.5 1.7
1
§ 20.25 0

2 = 1 § ! = 45 2 78 20.5

20.75

Fig. 3.1 The matrix A transforms the unit vectors to its columns, a1 and a2 , thus
transforming the unit square to a parallelogram.

How does A transform the unit vectors? (The ith unit vector is the ith
column of the identity matrix I). The transformed versions are just
c1 and c2 , the columns of A.

1 a c 1 a 0 a c 0 c
A = = = c1 A = = = c2
0 b d 0 b 1 b d 1 d

It is easiest to use the column picture of matrix multiplication (§2.7.4,

page 39) to understand this fact: The first product above is a trivial
linear combination of the columns of A, taking one of the first column
and zero of the second column (1 × c1 + 0 × c2 ), giving us the first
column c1 of A back. Similarly the second unit vector transforms to
the second column c2 of A.
As we can see in Figures 3.1 and 3.2, the transformation that A per-
forms on the unit square with vertices at (0, 0), (1, 0), (1, 1) and (0, 1)
takes it to a parallelogram with vertices at (0, 0), (a, b), (a + c, b + d)
and (c, d). And the area of this parallelogram, as Figure 3.2 proves
without words1 , is indeed the determinant as defined in Eqn (3.2).
In Rn , instead of the area, we get the (signed) volume of hyper-
parallelepiped. Note that we have proved only the absolute value of

1
The proof is recreated from Mathematics StackExchange, attributed to Solomon W. Golomb.
If the geometrical version of the proof is hard to digest, there is an algebraic version as well.
54 Transposes and Determinants

(L, O + M) - (L + N, O + M)

.
(0, M)
(N, M) (L, M) (L + N, M)

(L, O)

(0,0) (L, 0)

L N
= LM 2 ON = 2 =
O M

Fig. 3.2 The parallelogram that results from the transformation of the unit square by the
action of A. The picture proves, without words, that its area is the same as |A|.

the area in Figure 3.2; we will learn more about the sign part in the
following examples.
In Figure 3.3, we have three different examples of A in R2 and
their determinants. In the left panel, we see how A transforms the
red and blue unit vectors (which have unit elements in the first and
second dimension respectively). The red unit vector gets transformed
to the first column of A (shown in a lighter shade of red), and the
blue one to the second column (light blue vector). If we complete the
parallelogram, its area is 2, which is the determinant, |A|.
In the middle panel of Figure 3.3, we have a different A, which
does something strange to the red and blue unit vectors: They get
transformed to a line, which is a collapsed parallelogram with zero
area. And by the definition of |A|, it is indeed zero. Looking at the
column vectors in A, we can see that they are scalar multiples of each
other; they are not linearly independent.
In the right panel of the same figure, we have shuffled the columns
of A, so that parallelogram is the same as in the left panel, but the
determinant now is negative. We, therefore, call the determinant the
signed area of the parallelogram formed with the column vectors as
adjacent sides. Why is the area negative in this case? It is because A
has flipped the order of the transformed vectors: Our blue unit vector
is to the left of the red one. In the left panel, the transformed blue
Determinant of a Matrix 55

6 " 6 " 6 "

2 2 2 4 2 2
== == ==
5 1 2 5
1 2 5 2 1
= = 2×2 2 2×1 = = 2×2 2 4×1 = = 2×1 2 2×2
4 =2 4
=0 4 = 22
># = 2>!
3 0 2 3 3 1 2
§ §
1 2 0 4 0 2
2 2 § 2
1 2

1 1 2 1 1 2 1 0 2
§ § §
0 1 0 1 1 1
21 0 1 2 3 4 5 21
6 0 1 2 3 4 5 21
6 0 1 2 3 4 5 6
21 Blue unit vector is to 21 Blue unit vector is to 21 Blue unit vector is to
the left of the red one the left of the red one the left of the red one
22 22 22
Blue transformed vector also Blue transformed vector is Blue transformed vector is to
23 to the left of the red one 23 on top of the red one 23 the right of the red one
The determinant is positive The determinant is zero The determinant is negative
24 24 24

25 25 25

Fig. 3.3 Determinants as areas: The matrix A transforms the unit square (shaded grey)
into the amber parallelogram. The determinant |A| is the signed area of this parallelogram.
The sign is negative when the transformed unit vectors “flip.”

vector is still to the left. But in the right panel, the blue one has gone
to the right of the red one after transformation, thereby attributing a
negative sign to the determinant.
To use a more formal language, to go from the first (red) unit vector
to the second (blue) one, we go in the counterclockwise direction.
The direction is the same for the transformed versions in the left panel
of Figure 3.3, in which case the determinant is positive. So is the
area. In the right panel, the direction for the transformed vectors is
clockwise, opposite of the unit vectors. In this case, the determinant
and the signed area are negative.

3.3.2 3 × 3 Matrices
We have defined the determinant of a 2 × 2 matrix in Eqn (3.2).
We
now extend it to higher dimensions recursively. For A = aij ∈
R3×3 , we have:
 
a11 a12 a13
A = a21 a22 a23  ∈ R3×3
a31 a32 a33 (3.3)
def a a a a a a
|A| = a11 22 23 − a12 21 23 + a13 21 22 ∈ R
a32 a33 a31 a33 a31 a32

Since each of the 2 × 2 determinants in Eqn (3.3) are defined in

Eqn (3.2), we have a recursive formula for the determinant of A ∈
56 Transposes and Determinants

R3×3 . Notice that the first term has a positive sign, the second one
a negative sign and the third a positive sign again. This pattern of
alternating signs extends to higher dimensions as well.

Minors and Cofactors: The first submatrix whose determinant ap-

pears in Eqn (3.3) (multiplying a11 ) is obtained by removing the first
row and column from A. Generalizing, the one multiplying aij is the
determinant of the submatrix obtained by removing the ith row and
the j th column. The determinants of such submatrices are called the
minors of A, denoted by Mij .
The minor with the associated sign is called the cofactor, Cij =
−1i+j Mij . What we did in Eqn (3.3) was to expand the determinant
along the first row. We could have done it along the first column as
well. In fact, we can compute the determinant by expanding along any
row or column and summing up the cofactors. With these definitions
of minors and cofactors, we rewrite Eqn (3.3) more compactly as
follows (where we are expanding |A| along the ith row):
3
X 3
X
i+j
|A| = (−1) aij Mij = aij Cij (3.4)
j=1 j=1

Determinant as Volume: For 2 × 2 matrices, we saw that the deter-

minant was the (signed) area of the parallelogram to which the unit
square transformed. In the 3 × 3 case, it becomes the volume of the
parallelepiped that is the transformed version of the unit cube. It is
also signed: If two unit vectors flip orientation when transformed,
the determinant gets multiplied by −1, which means if one more unit
vector flips, the sign reverts back to the original.

3.3.3 n × n Matrices
We can extend the notion of volume to Rn . If we think of A as being
composed of n column vectors,
 
| | |
A = c1 c2 · · · cn  ∈ Rn×n ci ∈ Rn
| | |
the determinant, |A|, is the signed volume of the n-dimensional par-
allelepiped formed with edges ci .
Determinant of a Matrix 57

7 1 1 4 7 1 1 4 7 1 1 4 7 1 1 4 7 1 1 4
5 8 0 7 5 8 0 7 5 8 0 7 5 8 0 7 5 8 0 7
|A| = =7 21 +1 24
6 9 2 5 6 9 2 5 6 9 2 5 6 9 2 5 6 9 2 5
3 5 2 7 3 5 2 7 3 5 2 7 3 5 2 7 3 5 2 7

8 0 7 5 0 7 5 8 7 5 8 0
=7 9 2 5 21 6 2 5 +1 6 9 5 24 6 9 2
5 2 7 3 2 7 3 5 7 3 5 2

Fig. 3.4 Illustration of minors and cofactors using an example 4 × 4 matrix. The minor is
the determinant of the submatrix obtained by removing the row and column corresponding
to each element, as shown. Notice the sign in the summation. The minor with the associated
sign is the cofactor.

Laplace Formula: What we wrote down in Eqn (3.4) is, in fact, the
general version of the recursive formula for computing the determi-
nant, expanding over the ith row.
n
X n
X
i+j
|A| = (−1) aij Mij = aij Cij (3.5)
j=1 j=1

This is the Laplace formula (or expansion). We could do the expan-

sion over the j th column as well.
n
X n
X
i+j
|A| = (−1) aij Mij = aij Cij
i=1 i=1

In fact, one of the properties of determinants is that they are invariant

under row-column transposition, which is to say, when taking the
transpose of the matrix. In other words |A| = AT .

3.3.4 Properties of Determinants

We went over some of the characteristics and properties of determi-

nants. Let’s list them all here for completeness.

1. Identity matrices (of any dimension, I ∈ Rn×n ) have a deter-

minant of one.
|I| = 1
This property is consistent with the fact that a hypercube of
unit sides has unit (hyper)volume.
58 Transposes and Determinants

2. If we exchange two rows (or two columns) the determinant

changes sign. (As a consequence, it should be impossible to
get back to the same matrix by performing an odd number of
row or column exchanges.)
3. If we multiply a row (or a column) by a scalar s, the determinant
gets multiplied by the same factor:

a c sa sc sa c
A= A1 = A2 =
b d b d sb d
=⇒ |A1 | = |A2 | = s |A|
Since determinants behave like volumes, we can see that if we
double one side of a hypercube or a parallelepiped, its volume
gets doubled. If we double all sides, then its volume gets
multiplied by 2n .
4. If we can express all elements in one row (or a column) as sums
of two numbers each, we can split the whole determinant as
sum of two determinants.
a + a′ c + c′ a c a′ c′
= +
b d b d b d

5. If one row (or columns) is all zeros, then the determinant is

zero. This statement should be self-evident from the definition
of determinants (Eqn (3.5)): We expand the Laplace formula
over the row or column with all zeros to get |A| = 0.
6. The determinant does not change if we add or subtract any mul-
tiple of one row (or column) from any other row (or column).
While this property may sound a bit mysterious, it corresponds
to the fact that the matrix equations (such as Ax = b) actu-
ally encode systems of linear equations, which we can add and
subtract without affecting their solvability or solutions.
7. If two rows (or columns) of the matrix are the same, the de-
terminant is zero, which is a corollary of the previous two
properties.
8. For a triangular matrix (e.g., upper triangular matrix, U : uij =
0 for i > j), the determinant is the product of the diagonal
Determinant of a Matrix 59

Deriving |A|
It is possible to start from the properties (listed in §3.3.4, page 57) and derive the formula
for the determinant for a 2 × 2 matrix, which is an interesting exercise.
1 0
• Start with the identity matrix in R2 . By Property (1), we have: =1
0 1
a 0
• Property (3): Scale the first row by a, the determinant scales by a: =a
0 1
• Using the same property again, this time to scale the second row by d,
a 0
= ad (3.6)
0 d

b 0
• Similarly, by multiplying the rows of I by b and c, we get: = bc
0 c
• Property (2): Swap the rows, the determinant changes sign, which means:
0 c b 0
=− = −bc (3.7)
b 0 0 c
• Property (4): Express the elements of the first row as sums, we can write:
a c a+0 0+c a 0 0 c
= = + (3.8)
b d b d b d b d
• We can express the first term in Eqn (3.8), again by using Property (4) as:
a 0 a 0 a 0 a 0 a 0
= = + =
b d 0+b d+0 0 d b 0 0 d
The second determinant is zero by Property (5) because it contains a column of zeros.
0 c 0 c
• Similarly, the second term becomes: =
b d b 0
• Therefore we get, using Equations (3.6) and (3.7) in Eqn (3.8),
a c a 0 0 c
= + = ad − bc
b d 0 d b 0
What this derivation is telling us is that the properties of the determinant are not merely
a consequence of its definition, but also its origin. In other words, if we are looking for
a number associated with a matrix with the specified set of properties, the determinant
turns out to be that number.

elements. We can appreciate the veracity of this property by

expanding the determinant over the first column for an upper
triangular matrix, or the first row of a lower triangular matrix.

9. If the determinant is zero, the matrix cannot be inverted. We

have not defined or discussed matrix inversion, but the inverse
of a matrix is something that reverses its transformation. If
60 Transposes and Determinants

Ax = b, and x = A−1 b, we call A−1 the inverse of A. What

this property says is that if |A| = 0, we cannot find A−1 . As
we saw earlier, such noninvertible matrices are called singular
matrices, and have deep implications in much of what we will
learn form now on.
10. The determinant of the product of two matrices (of the same
size) is the product of their determinants: |AB| = |A| |B|.
11. The determinant of the transpose of a matrix is the same as that
of the matrix. We could have used this last property to avoid
specifying “or column” in most of the properties above.

3.4 Numerical Computations

This first part of the book (comprising Chapters 1, 2 and 3) was meant
to be about numerical computations involving vectors and matrices.
The moment we start speaking of vectors, however, we are already
thinking in geometrical terms. In this chapter, we also saw how
determinants had a geometric meaning as well.
In the first chapter, in order to provide a motivation for the idea
of matrices, we introduced linear equations, which is what we will
expand on, in the next part on the algebraic view of Linear Algebra.
We will go deeper into systems of linear equations. While discussing
the properties of determinants, we hinted at their connection with
linear equations again.
As we can see, although we might want to keep the algebra and
geometry separated, it may not be possible (nor is it perhaps advis-
able) to do so because we are dealing with different views of the same
subject of Linear Algebra.

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
Part II

Algebraic View
4
Gaussian Elimination

Confidence is what you have before you understand the

problem.
—Woody Allen

If we have a system of linear equations, we may or may not have

solutions, we may have a unique solution or an infinite number of
solutions. As we will see very soon, it is not easy to determine the
solvability of a system even for relatively small number of equations.
We can, of course, perform a series of manipulations on the equations
(like adding them, eliminating variables, substituting the solved ones
back etc.) to arrive at their solutions. How do we get a computer to
solve the system of equations though? We need an algorithm, both to
determine the solvability conditions and to actually find the solutions.
Such an algorithm is called Gaussian elimination or row reduction.

4.1 Solvability of System of Linear Equations

Let’s start with a simple system of two linear equations, and see what
the problem really is. Here, we will have only two variables, x and
Solvability of System of Linear Equations 63

Table 4.1 Various permutations of simultaneous equations in two variables, showing

solvability

Equations Solution Comment

x+y =5 x=3
1 Unique solution
x−y =1 y=2
x=t Infinity of solutions
2 x+y =5
y =5−t Only one equation
x+y =5 x=t Infinity of solutions
3
2x + 2y = 10 y =5−t Really, only one equation
x+y =5 No solutions
4 –
x+y =1 Inconsistent equations
x+y =5
x=3 Unique solution
5 x−y =1
y=2 Really only two equations
3x − y = 7
x+y =5
No solutions
6 x−y =1 –
Inconsistent equations
3x − y = 9

y. We are told that if we have two equations and two unknowns, we

can solve them, as though the equality of the number of variables to
the number of equations is the necessary and sufficient condition for
solvability. However, as Table 4.1 shows, it is only part of the story,
and not always true.
In the first row of the table, we see the ideal case: We have two
good equations for our two unknowns and we can easily find their
solutions.
In the second row, we have too few equations. What happens is
not that we do not have a solution. Any x and y satisfying the one
equation we have is a solution, and we have infinitely many of them.
In the third row of Table 4.1, we do have two equations, but
the second equation is derived from the first, and is therefore not
independent. In effect, we have only one equation and an infinite
number of solutions, as in the second row.
64 Gaussian Elimination

In the fourth row, we have two equations, but they are not consistent
with each other. They both cannot be true at the same time for any
pair of values of x and y.
Things get complicated in the fifth row, where we seem to have
three equations. But the third one can be derived from the first two. It
is, in fact, Eq.1 + 2×Eq.2, which means, in reality, we only have two
good equations for two unknowns. We therefore get a good solution,
much like the first row of Table 4.1.
The sixth row looks similar to the fifth, but the third equation there
is different on the right hand side. It cannot be derived from the other
two, and is inconsistent with them. Therefore, we have no solutions.
In light of these results, we can state the solvability condition, in
a general case as follows: If we have a system of n independent
and consistent linear equations on n unknowns, we can find a unique
solution for them. We are yet to define the concepts of independence
and consistency though.

Independence
Definition: An equation in a system of linear equations is considered
independent if it cannot be derived from the rest using algebraic
manipulations.
If we multiply an equation with a scalar, or add two equations, the
new equation we get is not independent. Again, notice the similarity
of the dependence of equations with our requirements for linearity
(§1.1.3, page 15).

Consistency
Definition: An inconsistent system of linear equations is the one with
no solutions.
The concept of consistency is harder to pin down. For now, we are
defining it rather cyclically as in the statement above. It is possible,
however, to visualize why some equations are inconsistent with oth-
ers. In fourth row of Table 4.1, for instance, the lines described by
the two equations are parallel to each other.

4.1.1 Visualizing Equations

In Table 4.1, our equations represent line because we have only two
variables (x and y) and we are dealing with R2 , which is a plane.
Solvability of System of Linear Equations 65

6 " 6 "
Row1: Consistent Equations Row4: Inconsistent Equations
Parallel lines
5 5

!
4 + 4
"
=
5 3 !
3 +
"
=
2 2 5
!
1 +
= 1 "
1 " =
2 1
! !
2 21 0 1 2 3 4 5 36 22 21 0 1 2 3 4 5
21 21

22 22
6 " Row5: Consistent: 6 " Row6: Inconsistent:
Single Point of No Single Point of
7

5 Intersection 5 Intersection

9
"=

"=
! !
2

+ +

2
4 4
3!

" "

3!
= =
5 5
3 3

2 2
1 1
= =
1 " 1 "
2 2
! ! ! !
22 21 0 1 2 3 4 5 26 21 0 1 2 3 4 5 6
21 21

22
Fig. 4.1 Visualizing equations listed in Table 4.1. 22
Clockwise from top-left: Row 1, 4, 5
and 6 in the table.

In Figure 4.1, we can see the lines corresponding to the equations

in the various rows in the table. For the first row, we have two
good (independent and consistent) equations, corresponding to the
red and blue lines in the top-left panel. This system has nice solution,
where the two lines intersect, at x = 3, y = 2. When people say “as
many equations as unknowns,” what they have in mind is this type of
independent and consistent system.
In the second row of Table 4.1, we have only one equation, which
corresponds to the red one in all panels of Figure 4.1. Every point on
the line is a solution to the equation, which goes for the third row too.
The fourth row of the table is shown in the top-right panel of
Figure 4.1 as two parallel lines, which will never meet. No point in
the red line will ever be a point in the blue line, and the equations do
not have a solution, which is why they are inconsistent.
66 Gaussian Elimination

In the fifth row of Table 4.1, we have an extra third equation,

shown as a green line in the bottom-left panel of Figure 4.1. It is
consistent with the other two equations because it goes through the
same solution point x = 3, y = 2. The equations are, in fact, not
independent, any one of them can be derived from the other two. The
three equations are consistent with each other and any two of them
would be enough to fully solve the system.
The sixth row shows the case where the third equation is not
consistent. Note that in this case, we do not have parallel lines,
but three lines not having a common point (which would have been
the solution). For future use, it may be interesting to wonder what an
approximate, or best possible, solution would be. Is the centroid of
the triangle formed by the three lines, perhaps, a good candidate as
the best possible solution?

4.1.2 Generalizing to Rn

The representation of the one-dimensional space R, as we learn in

school, is the number line. An equation, such as x = 1 or, in general,
ax = b defines a point on this line.
When we move to R2 , the same equation, x = 1 defines a vertical
line. In general, however, we have an equation, a1 x + a2 y = b, which
defines a line with a slope. If we have two such equations, then we
get two lines, with the possibility of them intersecting and giving us
a solution to the system of the two equations.
Similarly, in R3 , a single linear equation, such as x = 1 defines
a plane. It is a plane parallel to the yz plane, at unit distance from
it. x = 0 would be the yz plane. A general equation of the form
a1 x + a2 y + a3 z = b gives us a plane with some orientation. Another
linear equation defines another plane. If the equations are consistent
and independent, the planes will intersect, giving us the solution,
which is a line. Since the solution is a line, we have an infinity of
solutions with two linear equations in R3 . If we have one more linear
equation, we have another plane, potentially intersecting this solution
line at a point, giving us a unique solution.
With this picture in mind, let’s list the behavior of systems of linear
equations and their solutions in R2 , R3 and extrapolate to Rn , as in
Table 4.2.
Summarizing the insights from Table 4.2,
Gaussian Elimination 67

Table 4.2 Properties and behavior of linear equations and solutions

Equations R2 R3 Rn : n > 3

A line A plane An n − 1 subspace

One equation
Infinite solutions Infinite solutions Infinite solutions

Two independent and A point of intersection A line of intersection An n − 2 subspace as

consistent equations Unique solution Infinite solutions intersection. ∞ solutions

Three independent and A point of intersection An n − 3 subspace as

Cannot happen
consistent equations Unique solution intersection. ∞ solutions

n independent and A point of intersection.

Cannot happen Cannot happen
consistent equations Unique solution

Parallel lines Parallel planes Parallel subspaces

Two independent, but
No intersection No intersection No intersection
inconsistent equations
No solutions No solutions No solutions

Lines that make Planes making |||, ∦, △ Hard to visualize

Three independent, but
|||, ∦, △1 in cross section2 No common intersection
inconsistent equations
No solutions No solutions No solutions

Lines that make Planes making |||, ∦, △ Hard to visualize

n independent, but
|||, ∦, △3 in cross section4 No common intersection
inconsistent equations
No solutions No solutions No solutions
1,3
These symbols represent three parallel lines, two parallel lines plus an intersecting line, or three lines making
a triangle.
2,4
These symbols represent three planes, which, when sliced perpendicularly by a plane, show up as three
parallel lines, two parallel lines plus an intersecting line, or three lines making a triangle. In other words, the
three planes are parallel, two parallel with one intersecting or the three making a triangular tube respectively.

• The number of independent equations in a system of consistent

linear equations can never be greater than the number of unknowns.
• If a system of linear equations has more independent equations
than unknowns, it is necessarily inconsistent and has no solutions.
As we can see, the behavior of systems of equations is much more
complicated than what we learned in highschool. Furthermore, it
looks hard to generalize to higher dimensions, and harder still to
understand, how a given system is going to behave. Imagine if we
have 50 unknowns and 100 equations–a relatively small system that
we may come across in our computer/data science career. Do they
have solutions? Are the equations independent? Consistent? Are
there too many? Or too few? We clearly need a systematic way, an
algorithm, to tell us these things.

4.2 Gaussian Elimination

The critical step in the process to determine the solvability and to

actually solve systems of linear equations is Gaussian Elimination,
68 Gaussian Elimination

also known as Row Reduction. Gaussian elimination is the algorithm

to transform a matrix by applying a series of elementary row opera-
tions to arrive at its Row-Echelon Form (REF). When working with
systems of linear equations and their solutions, we apply Gaussian
elimination on the augmented matrix of the system. (We will soon
define the terms in italics.) Gaussian elimination, however, has appli-
cations other than solving equations, such as computing determinants,
finding inverses, determining ranks of matrices, and so on.

4.2.1 Matrix as System of Linear Equations

The first step in applying Gaussian elimination is to cast our linear

equations in a compact matrix form. We, in fact, did this in Chapter
1 (§1.1.4, page 15), when we introduced vectors and matrices. Let’s
do it again, looking at the deceptively simple equation, Ax = b, that
will become the mainstay of our discussion for the rest of the book.
Writing it out explicitly:
 
  x1
a11 · · · a1n  x2 
A =  ... aij ..  ∈ Rm×n and x =   ∈ Rn ,

.   .. 
.
am1 · · · amn
xn
  (4.1)
b1
 b2 
Ax = b, where b =  ..  ∈ Rm
 
 . 
bm
Now that we know matrix multiplication, we can see that Ax = b
represents a system of m linear equations of the kind
ai1 x1 + ai2 x2 + · · · + ain xn = bi where 1 f i f m
on n unknowns, xj , 1 f j f n.
On these equations, we can perform algebraic operations, like
adding or subtracting them, multiplying by a scalar, etc. In fact,
we can perform operations similar to taking linear combinations,
described in §2.2.3 (page 25). Notice the similarity between these
algebraic manipulations of equations and some of the properties of
determinants (§3.3.4, page 57)? We will make use of this similarity
in using Gaussian elimination for determinant calculation.
Gaussian Elimination 69

4.2.2 Elementary Row Operations

Analogous to the algebraic manipulation of equations, let’s first define

a set of elementary row operations, as listed below:

• Swap any two rows.

• Multiply any row by a scalar.

• Add a multiple of any row to another.

We will use these row operations to build an algorithm to solve the

system of linear equations Ax = b, much the same way we would
use the corresponding algebraic operations to solve the equations
symbolically. The advantage in the matrix formulation is that we are
dealing with numbers, and we can program a computer to perform
the operations once the algorithm is ready.

4.2.3 Augmented Matrix

We have to keep in mind that what we are doing when performing

row operation on A is basically the same as algebraic manipulation
of equations. Therefore, we have to apply the same operation to the
right hand side, b as well. For this reason, it may be best to add an
extra column to A with the elements of b. Such a matrix is called the
augmented matrix. It is merely a convenient bookkeeping technique,
which turns out to be useful when implementing row operations in a
computer program.
 
a11 · · · a1n b1
 a21 · · · a2n b2 

A | b =  .. ..  (4.2)
 . aij . bi 
am1 · · · amn bm

Now
we are naming matrices, let’s also call A (either as part of
that
A | b or by itself) the coefficient matrix, and the b part the constant
matrix or constant vector.
Before describing the algorithm of Gaussian elimination, let’s look
at the endpoint of the algorithm,
which is the form in which we would
like to have our matrix A or A | b .
70 Gaussian Elimination

Row-Echelon form
Definition: A matrix is considered to be in its row-echelon form
(REF) if it satisfies the following two conditions:

1. All rows with zero elements are at the bottom of the matrix.

2. The first nonzero element in any row is strictly to the right of

the first nonzero element in the row above it.

Pivots
Definition: The leading nonzero element in a row of a matrix in its
row-echelon form is called a pivot. The corresponding column is
called the pivot column. Pivots are also called the leading coefficient.
The largest number of pivots a matrix can have is the smaller of
its dimensions (numbers of rows and columns). In other words, for
A ∈ Rm×n , the largest number of pivots would be min(m, n).

Rank
Definition: The number of pivots of a matrix (in its REF) is its rank.
We will have a better definition of rank later on. If a matrix has its
largest possible rank (which is the largest possible number of pivots,
min(m, n)) is called a full-rank matrix. If a matrix is not full rank, we
call it rank deficient, and its rank deficiency is min(m, n) − rank(A).
We have a few examples of matrices in their REF in Eqn (4.3)
below, where the pivots are shown in bold. The first matrix shows a
square matrix of size 4 × 4, and it has four pivots, and is therefore
full rank. The second matrix is 2 × 3, and has two pivots–the largest
possible number. It is also full-rank. The third one is a 4 × 4 matrix,
but has only three pivots. It is rank deficient by one. The fourth
matrix also has a rank deficiency of one.
5 2 11 3 5 2 11 3
     
5 2 11 3
0 3 7 13  5 11 3 0 0 17 2 
0 0 0 17 2 (4.3)
0 17 2 0 0 13 0 0 0 13 
0 0 0 0
0 0 0 13 0 0 0 0

4.2.4 The Algorithm

With the definitions of REF, pivots and ranks in place, we can

state

the Gaussian elimination algorithm, running on a matrix A = aij ∈
Rm×n as follows:
Gaussian Elimination 71

Gaussian Elimination: Algorithm

Since this is a book aimed at computer and data scientists, it is probably worth our time
stating the algorithm of Gaussian Elimination as an algorithm.
Input: A = aij ∈ Rm×n

Output: RRF(A) ∈ Rm×n

1: REF ← A; nrows ← m
2: while m > 0 and n > 0 do
If the first row does not start with a nonzero element,
try to find a row that does:
3: if a11 = 0 then
4: repeat
5: if a11 = 0 then
6: for i = 1 to m do
7: if ai1 ̸= 0 then
Swap row 1 with row i:
8: r1 ´ ri
9: Exit repeat loop
10: end if
11: end for
Could not finda row withnonzero element in the first column
12: A ← A a22 : amn , m ← m − 1, n ← n − 1
13: end if
14: until a11 ̸= 0
15: else
Subtract scaled first row from other rows to get zero first column
16: for i = 2 to m do
17: ri ← ri − aa11i1
r1
18: end for
19: end if
Save the current row in the REF
20: REF(nrows−m) ← A a11 : a1n

21: A ← A a22 : amn , m ← m − 1, n ← n − 1
22: end while
23: return REF
Note that instead of finding the first row with nonzero element (starting at line 5),
it may be a better numerical strategy to unconditionally locate the row with the largest
absolute value at the first element (starting at line 3), and swapping it with the first row.
Most programs implement Gaussian Elimination that way.

1. If a11 = 0, loop over the rows of A to find the row that has a
nonzero element in the first column.

2. If found, swap it with the first row. If not, ignore the first row
and column and move on to the second row (calling it the first).

3. Multiply the first row with − aa11

i1
and add it to the ith row to get
zeros in the first column of all rows other than the first one.
72 Gaussian Elimination

4. Now consider the submatrix from the second row, second col-
umn to the last row, last element (i.e., from a22 to amn ) as the
new matrix:
A ← A[a22 : amn ], m ← m − 1, n ← n − 1

5. Loop back to step 1 and perform all the steps until all rows or
columns are exhausted.

Note that in step 3, we are subtracting a number after dividing it by

a11 , which may cause numerical instability in the algorithm when
a11 → 0. For this reason, we may want to modify the first step to
find the row with the largest absolute value in the first column. (In
other words, modify the first step to read, “Loop over the rows of A
to find the row that has the largest absolute value (̸= 0) in the first
column,” which is the way Gaussian elimination is implemented in
most programs, including SageMath.)

4.3 Applications of Gaussian Elimination

Although the primary objective of Gaussian elimination is to solve

systems of linear equations, we also use it for a couple of other
purposes.
We saw that the rank of a matrix is the same as the number of
pivots, which we can get directly from Gaussian elimination (AKA
row reduction, as a reminder).
At the end of Gaussian elimination, we have an upper triangular
matrix. If we start with a square matrix, whose determinant we need
to compute, we can get it (or its absolute value) by simply taking the
product of the diagonal elements (which are the pivots) because the
elementary row operations we perform in Gaussian elimination do
not change the absolute value of the determinant.
For these applications, we can also perform Gaussian elimination
by column (as column reduction) instead of row. However, for the
main application of solving a system of linear equations, we can do
it only row-wise. Besides, column-wise Gaussian elimination is the
same as row reduction on the transpose of the matrix anyway.
Applications of Gaussian Elimination 73

1 1 ) 5
$% = ' ó * = 1
1 21
Subtract Row 1 from Row 2:
1 1 5 1 1 5
1 21 1 0 22 24
Augmented Matrix: [1 | 3] Pivots ó First non-zero element

The Elementary Row Operation as a Matrix:

Fig. 4.2 Gaussian elimination on a simple augmented matrix, showing the pivots.
1 0 1 1 5 1 1 5
= =.

4.3.1 Solution to System of Linear Equations

To solve a system of linear equations, we apply the algorithm of

Gaussian elimination to its augmented matrix. Let’s first look at
a simple example from the first row in Table 4.1. As shown in
x + y = 5, x − y = 1,
Figure 4.2, the system of linear equations,

translates to an augmented matrix A | b with two rows, and single
step Gaussian elimination, after which, we get two pivots, 1 and −2.
Although it is a simple system, we can make a few observations about
it:

• The rank of the 2 × 2 matrix A is two, and it is full-rank matrix.

• Since the coefficient matrix (A) is square and full rank, we can
infer that the system has a unique solution.

• The determinant, |A| is the product of the pivots = −2. It

can also be calculated directly: |A| = a11 a22 − a21 a12 =
(1 × −1) − (1 × 1) = −2.

• The last row of the REF form of A | b stands for the equation
−2y = −4, which gives us y = 2.

• We can back-substitute this value of y in the second last row

(which is the first row) to solve for the other variable, giving us
x + 2 = 5 =⇒ x = 3.

Back Substitution
Definition: The process of solving a system of linear equations from
the REF of the augmented matrix of the system is known as back
substitution. The equation corresponding to the last nonzero row is
74 Gaussian Elimination

Table 4.3 Illustration of solvability conditions based on the characteristics of REF

Observations1

Equations A|b REF( A | b )

x+y =5

1 1 5 1 1 5 REF has no 0 = bi =⇒ Solvable
1
x−y =1 1 −1 1 0 −2 −4 Rank = # Vars =⇒ Unique Solution
REF has no 0 = bi =⇒ Solvable
2 x+y =5 1 1 5 1 1 5
Rank < # Vars =⇒ Infinity of Solns
x+y =5

1 1 5 1 1 5 REF has no 0 = bi =⇒ Solvable
3
2x + 2y = 10 2 2 10 0 0 0 Rank < # Vars =⇒ Infinity of Solns

x+y =5

1 1 5 1 1 5 REF has 0 = bi =⇒ Inconsistency
4
x+y =6 1 1 6 0 0 1 Not Solvable

x+y =5 
1 1 5
 
1 1 5

REF has no 0 = bi =⇒ Solvable
5 x−y =1 1 −1 1 0 −2 −4
Rank = # Vars =⇒ Unique Solution
3x − y = 7 3 −1 7 0 0 0

x+y =5 
1 1 5
 
1 1 5


6 x−y =1 1 −1 1 0 −2 −4 REF has 0 = bi =⇒ Inconsistency

3 −1 9 0 0 2 Not Solvable
3x − y = 9
1
Note: When we say 0 = bi in the observations, we mean with bi ̸= 0.

solved first, and the solution is substituted in the one for the row
above and so on, until all the variables are solved.
With Gaussian elimination and back substitution, we can solve
a system of linear equations is as fully as possible. Moreover, we
can say a lot about the solvability of the system by looking at the
pivots, as we shall illustrate using the examples in Table 4.1. We
have the augmented matrices for these equations, their REF, and our
observations on solvability of the system based on the properties
of the REF in Table 4.3. These observations are, in fact, general
statements about the solvability of systems of linear equations, as
listed below.
Solvability Conditions of a system of linear equations based on the
properties of the REF of its augmented matrix:

1. If we have a row in the REF (of the augmented matrix A | b )
with all zeros in the coefficient (A) part and a nonzero element
in the constant (b) part, the system is not solvable.

2. If the number of pivots in the REF is the same as the number

of unknowns, we have a unique solution, provided the system
is solvable by the first condition.
Applications of Gaussian Elimination 75

3. If the number of pivots is smaller than the number of vari-

ables, we have an infinity of solutions, providing the system is
solvable (by the first condition).
The first solvability condition above can be stated in a variety of ways:
• If a certain linear combination of the rows of the coefficient
matrix gives all zeros, the same combination of the elements of
the constant vector also should give zero. Else, the system is
inconsistent and not solvable.
• If the rank of the augmented matrix is greater than the rank
of the coefficient matrix, the system is inconsistent and not
solvable.
• If the pivot in any row of the REF of the augmented matrix is
in the augmented column (meaning, in the constants column,
coming from b), the system is not solvable.
Knowing that the number of pivots in the REF is the same as the
rank of the matrix, we can restate the solvability conditions more
formally (albeit absolutely equivalently and therefore superfluously)
as follows: For a system of linear equations Ax = b with m equations
and nunknowns
(in other words, A ∈ Rm×n ), with rank(A) = r and
rank( A | b ) = r′ , we have:
1. If r′ > r, the system is inconsistent and unsolvable.
2. If n = r = r′ , the system has a unique solution.
3. If n > r, the system has an infinity of solutions.

Notes:

• The largest value the rank of any matrix can have is the smaller
of its dimensions.
• The rank r of the coefficient
matrix
A ∈ Rm×n can never be
larger than the rank r′ of A | b ∈ Rm×(n+1) .
• As a corollary, if n > m, we have too few equations. The
system, if solvable, will always have infinitely many solutions.
We will look at some more examples to illustrate the solvability
conditions when we discuss the elementary matrices (the matrices
76 Gaussian Elimination

associated with the elementary operations) after the next important

topic on finding the complete solution of a system of linear equations
when we have an infinite number of solutions.

4.3.2 Complete Solution

Looking at the solvability of the systems of equations in Table 4.3,

we see that in several of them, we have either a unique solution or
no solutions at all. In the case of no solution, we give up right
away. When we have a unique solution, we can get at it using back
substitution. What is more complicated is when we have infinitely
many solutions as in rows 2 and 3 in Table 4.1, but the solution was
still trivial to find because it was just a line representing one of the
equations in the system.
Here is a system of equations with nontrivial infinity of solutions,
along with its augmented matrix and its REF.

x+y+z =6    
1 1 1 6 1 1 1 6
REF
2x + 2y + z = 9 =⇒ A | b = 2 2 1 9 −−→ 0 0 −1 −3
x+y =3 1 1 0 3 0 0 0 0

From the REF above, we can start back substituting: Row 3 does
not say anything. Row 2 says:
−z = −3 =⇒ z = 3
Substituting it in the row above, we get:
x+y+3=6 or x+y =3
By convention, we take the variable corresponding to the non-pivot
column (which is column 2 in this case, corresponding to the variable
y) as a free variable. It can take any value y = t. Once y takes a
value, x is fixed: x = 3 − t. So the complete solution to this system
of equations is:
       
x 3−t 3 −1
x= y =
   t  = 0 +t 1 
  
z 3 3 0
Note that a linear equation (such as z = 3) in R3 defines a plane.
x + y = 3 also defines a plane. (The pair x = 3 − t and y = t is
Applications of Gaussian Elimination 77

a parametric equation to the same plane.) The intersection of these

two planes is a line. And any point in this line is a solution to our set
of linear equations.
Let’s take one more example, this time with four variables x1 , x2 , x3
and x4 :
x1 + x2 + x3 + 2x4 = 6
2x1 + 2x2 + x3 + 7x4 = 9

1 1 1 2 6 REF 1 1 1 2 6
A|b = −−→
2 2 1 7 9 0 0 −1 3 −3
Columns 1 and 3 have pivots, which means x2 and x4 are free vari-
ables. Assigning values t1 and t2 to them, the last row gives us the
equation: −x3 + 3t2 = −3 or x3 = 3 + 3t2 . The first row stands for
the equation:
x1 + x2 + x3 + 2x4 = 6
Now we know x2 = t1 , x3 = 3 + 3t2 and x4 = t2 . Substituting, we
get:
x1 + t1 + 3 + 3t2 + 2t2 = 6
This equation gives us the parametrized value of x1 = 3 − t1 − 5t2
and the complete solution:
         
x1 3 − t1 − 5t2 3 −1 −5
x2   t 1
 0 1 0
x3  =  3 + 3t2  = 3 + t1  0  + t2  3 
x=          (4.4)
x4 t2 0 0 1

We have chosen to decompose the complete solution in this way (as

the sum of one vector and a linear combination of two others) because
it is the right form for a later topic. Just as a preview, we will call the
first vector a particular solution and the linear combination the null
space later on.

General Solution The complete solution in Eqn (4.4) is also called

the general solution. Let’s take a closer look at it. It has the form
xp + t1 xs1 + t2 xs2 . The first vector, xp is the so-called particular
solution. We can get it by setting all the free variables (the ones
corresponding to the non-pivot columns, which are the second and
fourth variables in our example above) to the value zero. The second
and third terms in the solution form a linear combination of two
78 Gaussian Elimination

vectors xs1 and xs2 . These vectors are called the special solutions
of the system. We can see that they are, in fact, the solutions to the
equations when the right hand side, the constants part, is zero, which
is to say b = 0.
When a system of linear equations is of the form Ax = 0, it is
called a homogeneous system because all the terms in the system are
of the same order one in the variables (as opposed to some with order
zero if we had a nonzero b). Therefore, the special solutions are also
called homogeneous solutions.
Lastly, the linear combinations of the special solutions in our ex-
ample, t1 xs1 +t2 xs2 , define a plane in R4 , as two linearly independent
vectors always form a plane going through the origin 0. What the ad-
dition of xp does is to shift the plane to its tip, namely the coordinate
point (3, 0, 3, 0), if we allow ourselves to visualize it in a coordinate
space. In other words, the complete solution is any vector whose
tip is on this plane defined by the special solutions xs1 and xs2 , and
shifted by the particular solution xp .

4.3.3 Elementary Matrices

Each of the row operations that we perform on a matrix A in Gaussian

elimination can be thought of as a matrix multiplying A on the left.
This insight comes from the row picture of matrix multiplication
(§2.7.4, page 39). For easy reference, here are the elementary row
operations once more:

1. Swap any two rows.

e.g., Swap the second and third rows (r3 ´ r2 ).

2. Multiply any row by a scalar.

e.g., Multiply the second row by 3 (r2 ← 3r2 ).

3. Add a multiple of any row to another.

e.g., Subtract three times the first row from the third (r3 ←
−3r1 + r3 ).
Applications of Gaussian Elimination 79

For any A ∈ R3×n , here are the so-called elementary matrices (or
operators) that would implement the examples listed above:
     
1 0 0 1 0 0 1 0 0
E1 =  0 0 1  E 2 =  0 3 0 E3 =  0 1 0
0 1 0 0 0 1 −3 0 1

By the row picture of matrix multiplication, U = E1 A can be stated

as follows, denoting the rows of A by ri and the rows of U by ri′ :
• r1′ is a linear combination of the rows of A:
r1′ = 1r1 + 0r2 + 0r3 = r1
• r2′ is another linear combination: r2′ = 0r1 + 0r2 + 1r3 = r3
• r3′ is another linear combination: r3′ = 0r1 + 1r2 + 0r3 = r2
As we can see, E1 implements the swap of the second and third rows
of any A ∈ R3×n . In particular, it does it for the identity matrix I in
R3×3 . Therefore, we can see that the elementary matrices differ from
the identity matrix of the same size by one elementary row operation.
Summarizing, we can capture each elementary row operation that
we apply to a matrix in Gaussian elimination as a matrix multiplying it
on the left. This matrix is called the elementary matrix or elementary
operator, denoted by E. Since we are reducing a matrix A to its
REF, it is customary to think of its rows being replaced by linear
combinations other rows: In other words, we usually write ri ←
(rather than ri′ =) a specified linear combination, as we see in our
examples.
Figure 4.3 shows the elementary
matrix (in blue) that reduced
our augmented matrix A | b for the system of linear equations,
x + y = 5, x − y = 1 to REF. We needed only one elementary
operation in this example. We have more examples of elementary
matrices in Tables 4.6 to 4.10.

4.3.4 LU Decomposition

As we have seen, the Row-Echelon Form (REF), the end result of

Gaussian elimination, is always an upper triangular matrix. We hinted
at this by calling it U in one of our equations. Since Gaussian
elimination can be applied to any matrix, we can always get an upper
triangular matrix from any matrix by applying the elementary row
80 Gaussian Elimination

1 1 ) 5
$% = ' ó * = 1
1 21
Subtract Row 1 from Row 2:
1 1 5 1 1 5
1 21 1 0 22 24
Augmented Matrix: [1 | 3] Pivots ó First non-zero element

The Elementary Row Operation as a Matrix:

1 0 1 1 5 1 1 5
= =.
21 1 1 21 1 0 22 24
Using the row picture for the first matrix multiplication above:
" First Row of the product ! = 1× Row 1 + 0× Row 2 of & '
" Second Row of the product ! = 21× Row 1 + 1× Row 2 of & '

Fig. 4.3 Gaussian elimination on a simple augmented matrix, showing the elementary
operator that implements the row reduction.

operations listed earlier (in §4.2.2, page 69). Table 4.4 shows an
example of this process of getting a U out of a matrix by finding its
REF.
We have more examples coming up (in Tables 4.4 to 4.10), where
we will write down the elementary matrix that implemented each row
operation. We will call these elementary matrices Ei . Referring to
Table 4.4, we can therefore write the REF form as:

U = E2 E1 A = EA

Table 4.4 A = LU Decomposition

[A] → REF Row Op. Ei Inv. Op. Ei−1
     
1 1 1 r2 ← −2r1 + r2 1 0 0 r2 ← 2r1 + r2 1 0 0
2 1 1 −2 1 0  2 1 0
1 3 0 r3 ← −r1 + r3 −1 0 1 r3 ← r1 + r3 1 0 1
     
1 1 1 1 0 0 1 0 0
0 −1 −1 r3 ←2r2 + r3 0 1 0 r3 ← −2r2 + r3 0 1 0
0 2 −1 0 2 1 0 −2 1
     
1 1 1 1 0 0 1 1 1
0 −1 −1 L=
−1 −1
E1 E2 = 2 1 0 U = REF = 0 −1 −1
0 0 −3 1 −2 1 0 0 −3

Let’s focus on one elementary row operation, E2 (second row, under the column Ei ). This operation replaces
row 3 with the sum of twice row 2 and row 3. The inverse of this operation would be to replace the row 3 with
row 3 − twice row 2, as shown in the column Inv. Op. The elementary operation for this inverse operation is,
−1
in fact, the inverse of E2 , which we can find in the last column, under Ei .
Applications of Gaussian Elimination 81

Note the order in which the elementary matrices appear in the product.
We perform the first row operation E1 on A, get the product E1 A
and apply the second operation E2 on this product, and so on.
We have not yet fully discussed the inverse of matrices, but the
idea of the inverse undoing what the original matrix does is probably
clear enough at this point. Now, starting from the upper triangular
matrix U that is the REF of A, we can write the following:
U = EA = E2 E1 A =⇒ A = E1−1 E2−1 U = LU
where we have called the matrix E1−1 E2−1 , which is a lower triangular
matrix (because each of the Ei−1 is a lower triangular matrix) L.
Again notice the order in which the inverses Ei appear in the equation:
We undo the last operation first before moving on to the previous one.
This statement is the famous LU decomposition, which states
that any matrix can be written as the product of a lower triangular
matrix L and an upper triangular matrix U . The algorithm we use
to perform the decomposition is Gaussian elimination. Notice that
all the elementary matrices and their inverses have unit determinants,
and the row operations we carry out do not change the determinant
of A, such that |A| = |U |.

Permutation Matrices: In discussing the example in Table 4.4, we

cleverly glossed over one fact: We did not use the first elementary
row operation (namely, swapping rows). Row exchanges introduce
symmetric matrices as their elementary matrices. They are not lower
triangular. The matrix implementing row exchanges is called a per-
mutation matrix, and is denoted by P .
Every permutation matrix (of single row exchange) differs from
the identity matrix of the same size by one or more row exchange5 .
And it is its own inverse (because exchanging row with another is
undone by the same exchange): P 2 = I. To introduce another name,
the matrices that are their own inverses are called involutory. And
the permutation matrices implementing a single row exchange are
involutory.
Here are all possible permutation matrices that swap one or more
pairs of rows in R2 and R3 :

5
In principle, we can have permutation matrices that perform multiple row exchanges, but
for the purposes of this book, we focus on one row exchange at a time.
82 Gaussian Elimination

Table 4.5 A = P LU Decomposition

[A] → REF Row Op. Ei Inv. Op. Ei−1
     
1 1 1 1 0 0 1 0 0
 2 2 1 r2 ´ r3  0 0 1 r2 ´ r3 0 0 1
1 3 0 0 1 0 0 1 0
     
1 1 1 r2 ← −r1 + r2 1 0 0 r2 ← r1 + r2 1 0 0
 1 3 0 −1 1 0 1 1 0
2 2 1 r3 ← −2r1 + r3 −2 0 1 r3 ← 2r1 + r3 2 0 1
       
1 1 1 1 0 0 1 0 0 1 1 1
0 2 −1 P =  0 0 1  L = E2 =  1
−1
1 0 U = REF = 0 2 −1
0 0 −1 0 1 0 2 0 1 0 0 −1

r1 ´ r2
r 1 ´ r2  r1 ´ r2   r2 ´ r3   r3 ´ r1   r2 ´ r3 
0 1 0 1 0 0 0 0 1 0 1 0
0 1 1 0 0 0 0 1 0 1 0 0 0 1
1 0
0 0 1 0 1 0 1 0 0 1 0 0

Note that the last one does multiple row exchanges and conse-
quently P 2 ̸= I: It is not involutory.

4.3.5 P LU Decomposition
Since the elementary operation of row exchanges breaks our decom-
position of A = LU , we keep the permutation part separate, and
come up with the general, unbreakable, universally applicable decom-
position A = P LU . We can see an example of this decomposition
in Table 4.5.

4.3.6 Computing Determinants

As breifly mentioned earlier, we can use Gaussian elimination to

compute the determinant of a square matrix. Once we get the REF
(U ), the determinant is simply the product of diagonal elements. The
reason is that the elementary row operations do not change the abso-
lute value of the determinant. Or rather, they change the determinant
in a way we can keep track of.

• Swapping two rows will result in the determinant flipping sign

(Property 1 in §3.3.4, page 57). We therefore have to keep track
of how many times we swapped rows when applying Gaussian
elimination to compute determinants.
More Examples 83

• Although we defined it as an elementary row operation (for

future use), we do not scale rows in Gaussian elimination. If
we did, we would have to keep track of it as well because
the determinant then would be multiplied by the same scaling
factor (Property 3 in §3.3.4, page 57)

• The third elementary operation (adding/subtracting a multiple

of a row from another one) does not change the determinant
(by Property 6 in §3.3.4, page 57).

Why would we want to compute determinants this way rather than

using the Laplace expansion (Eqn (3.4))? The computational com-
plexity of the Laplace expansion is O(n!) while that of the Gaussian-
elimination way is only O(n3 ), which is a huge gain for large values
of n. As a special case, if the matrix does not have a full set of
pivots, the determinant is zero, which means we do not even have
to complete Gaussian elimination; we can stop the moment we get a
row without a pivot.

4.4 More Examples

In Table 4.6, we have the first example of a system of three equations:

We have as many equations as unknowns. After Gaussian elimination,
we can see that we have a full set of pivots, and A is indeed full rank.
Therefore, we get a unique solution through back substitution, as
shown after the table.
We can restate the determinant computation more formally in terms
of the P LU decomposition we discussed above. If we start with a
square matrix A ∈ Rn×n , we can make the following statements:

1. By the product rule of determinants (Property 10 in §3.3.4, page

57, |AB| = |A| |B|), we can see that |A| = |P | |L| |U |.

2. |P | = ±1. (Exercise: Prove this statement.)

3. |L| = 1. (Exercise: Prove this statement.)

4. Therefore |A| = |P | |U | = ± |U |, with the sign given by |P |.

84 Gaussian Elimination

Table 4.6 Example 1: Gaussian elimination and solvability: Unique solution

Equations A | b → REF Row Operation Elementary Matrix Ei
x+y+z =6 
1 1 1 6
 
1 0 0

r2 ← −2r1 + r2
2x + 2y + z = 9 2 2 1 9 −2 1 0
1 3 0 7 r3 ← −r1 + r3 −1 0 1
x + 3y = 7
   
1 1 1 6 1 0 0
0 0 −1 −3 r2 ´ r3  0 0 1
0 2 −1 1 0 1 0
 
1 1 1 6
0 2 −1 1 REF
0 0 −1 −3

Back Substitution:
The last row of the REF(A) says −z = −3 =⇒ z = 3.
Substituting it in the row above, 2y − 3 = 1 =⇒ y = 2.
Substituting z and y in the first row, x + 2 + 3 = 6 =⇒ x = 1.
The complete and unique solution is: (x, y, z) = (1, 2, 3)

5. Since U is triangular, its determinant is the product of its

diagonal elements.
Y n
|U | = uii
i=1

6. In particular, if U does not have full set of pivots (and is rank

deficient), at least one of uii = 0 =⇒ |A| = 0.
In the second example, Table 4.7, we have modified the third
equation such that it is no longer consistent with the first two. We
again have as many equations as unknowns. However, after Gaussian

elimination, we can see that the last row reads 0 0 0 3 . It
translates to an equation 0 = 3, which
indicates that the equations are
inconsistent. Note that rank( A | b ) = 3 > rank(A) = 2.

Table 4.7 Example 2: Gaussian elimination and solvability: No solutions

Equations A | b → REF Row Operation Elementary Matrix Ei
x+y+z =6 
1 1 1 6
 
1 0 0

r2 ← −2r1 + r2
2x + 2y + z = 9 2 2 1 9 −2 1 0
1 1 0 6 r3 ← −r1 + r3 −1 0 1
x+y =6
   
1 1 1 6 1 0 0
0 0 −1 −3 r3 ← −r2 + r3 0 1 0
0 0 −1 0 0 −1 1
 
1 1 1 6
0 0 −1 −3 REF
0 0 0 3

The last row of the REF(A) is 0 0 0 3 , saying 0 = 3 =⇒ The system is inconsistent.
More Examples 85

Table 4.8 Ecample 3: More equations than unknowns, but solvable

Equations A | b → REF Row Operation Elementary Matrix Ei
x+y+z =6
1 1 1 6 1 0 00
   
r2 ← −2r1 + r2
2x + 2y + z = 9 2 2 1 9 −2 1 00
1 r3 ← −r1 + r3
x + 3y = 7 3 0 7 −1 0 10
3 1 1 8 r4 ← −3r1 + r4 −3 0 01
3x + y + z = 8
1 1 1 6 1 0 0 0
   
0 0 −1 −3 0 0 1 0
0 r2 ´ r3
2 −1 1 0 1 0 0
0 −2 −2 −10 0 0 0 1
1 1 1 6 1 0 0 0
   
0 2 −1 1 0 1 0 0
0 r4 ← r2 + r4
0 −1 −3 0 0 1 0
0 −2 −2 −10 0 1 0 1
1 1 1 6 1 0 0 0
   
0 2 −1 1 0 1 0 0
0 0 r4 ← −3r3 + r4
−1 −3 0 0 1 0
0 0 −3 −9 0 0 −3 1
1 1 1 6
 
0 2 −1 1
0 0 REF
−1 −3
0 0 0 0

The last row of the REF(A), 0 0 0 0 ,says 0 = 0 =⇒ The system is consistent.

The third example in Table 4.8, we have added one more equation
to the system in Table 4.6. Thus, we have four equations for three
unknowns. But the augmented matrix reduces with the last row
reading 0 = 0, which means the fourth equation is consistent with

Table 4.9 Example 4: More equations than unknowns, with no solutions

Equations A | b → REF Row Operation Elementary Matrix Ei
x+y+z =6
1 1 1 6 1 0 00
   
r2 ← −2r1 + r2
2x + 2y + z = 9 2 2 1 9 −2 1 00
1 r3 ← −r1 + r3
x + 3y = 7 3 0 7 −1 0 10
3 1 1 11 r4 ← −3r1 + r4
−3 0 01
3x + y + z = 11
1 1 1 6 1 0 0 0
   
0 0 −1 −3 0 0 1 0
0 r2 ´ r3
2 −1 1   0 1 0 0
0 −2 −2 −7 0 0 0 1
1 1 1 6 1 0 0 0
   
0 2 −1 1 0 1 0 0
0 r4 ← r2 + r4
0 −1 −3   0 0 1 0
0 −2 −2 −7 0 1 0 1
1 1 1 6 1 0 0 0
   
0 2 −1 1 0 1 0 0
0 0 −1 −3 r4 ← −3r3 + r4 0 0 1 0
0 0 −3 −6 0 0 −3 1
1 1 1 6
 
0 2 −1 1
0 0 −1 −3 REF
0 0 0 3

The last row of the REF(A) is 0 0 0 3 , saying 0 = 3 =⇒ The system is inconsistent.
86 Gaussian Elimination

Table 4.10 Example 5: Gaussian elimination and solvability: Infinity of solutions

Equations A | b → REF Row Operation Elementary Matrix Ei
x+y+z =6 
1 1 1 6
 
1 0 0

r2 ← −2r1 + r2
2x + 2y + z = 9 2 2 1 9 −2 1 0
1 1 0 3 r3 ← −r1 + r3 −1 0 1
x+y =3
   
1 1 1 6 1 0 0
0 0 −1 −3 r3 ← −r2 + r3 0 1 0
0 0 −1 −3 0 −1 1
 
1 1 1 6
0 0 −1 −3 REF
0 0 0 0

The last row of the REF(A) is 0 0 0 0 , indicating consistency. With two independent equations for three
unknowns, we get an infinity of solutions.

the rest. And rank( A | b ) = rank(A) = 3, same as the number of
unknowns. Hence unique solution.
The fourth example is similar to the third one; we have added a
fourth equation in Table 4.9 as well. However, the new equation is
not consistent with the rest.
In the last example in Table 4.10 (which we used to illustrate the
complete solution with free variables), we have as many equations as
unknowns.
After
Gaussian elimination, we get the last row reading
0 0 0 0 ; no zero=nonzero row, indicating that the equations
are consistent. However rank( A | b ) = rank(A) = 2 with three
unknowns, which means we have infinitely many combinations of
(x, y, z) that can satisfy these equations.

4.5 Beyond Gaussian Elimination

In this chapter, we looked at the applications of Gaussian elimination

in computing determinants and solving equations. We briefly intro-
duced matrix inverses. The interplay between ranks, determinants
and inverses will be further explored in the next chapter.

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
5
Ranks and Inverses of
Matrices

Everything is simpler than you think and at the same time

more complex than you imagine.
—Johann Wolfgang von Goethe

Along with determinants, the rank of a matrix is a number that

says a lot about the properties of the matrix. Unlike determinants,
however, rank is defined for all matrices, not merely square ones. And
rank is an integer, while determinant is a function of the elements of
the matrix, and belongs to the same field over which it is defined. For
square matrices, ranks and determinants are related to each other. In
fact, most things in Linear Algebra tend to be interconnected. In this
chapter, we will summarize what we already learned about ranks, and
expand on it. The full discussion of ranks, however, will have to wait
until we look at the underlying geometry of matrices and vectors.
88 Ranks and Inverses of Matrices

5.1 Rank of a Matrix

Earlier, we stated that the rank of a matrix is the number of pivots in its
row echelon form (REF), which we get through Gaussian elimination.
This algorithm is merely a series of elementary row operations, each
of which is about taking a linear combination of the rows of the
matrix. And, right from the opening chapters where we introduced
vectors and matrices (in §2.2.3, page 25), we talked about linear
combinations.
If we can combine a bunch of rows in a matrix so as to get a different
row, then the rows are not linearly independent. To state it without
ambiguity using mathematical lingo, for a matrix A ∈ Rm×n , we say
that a nonzero ith row (riT ) is a linear combination of the other rows
if we can write:
m
X
riT = sj rjT , r j ∈ Rn , sj ∈ R
j=1,j̸=i

When we can write one row as the linear combination of the rest,
we say that that row is not linearly independent. Note that any row
that is not linearly independent can be reduced to zero using the
elementary row operations. Once a row is reduced to zero, it does
not have a pivot. Conversely, all nonzero rows do have pivots, and
they are linearly independent. Therefore, we can state that the rank
of a matrix is the number of linearly independent rows.

5.1.1 Row and Column Ranks

Strictly speaking, the number of linearly independent rows would be

the row rank of the matrix. Similarly, we can define a column rank,
which is the number of linearly independent columns of the matrix.
Or, the number of pivots when Gaussian elimination is performed
column-wise, using column-reduction operations.
We have a theorem that states that the row rank is the same as
the column rank. It is fairly easy to prove this fact. We will provide
some hints and leave it as an exercise after looking at various possible
shapes of matrices, the Gauss-Jordan algorithm and the canonical
form it produces (in §5.2.2, page 94). We also have the full proof in
one of the advanced topics, in §12.6 (page 235).
Rank of a Matrix 89

5.1.2 Shapes of Matrices

A general matrix A ∈ Rm×n can be tall if the number of rows is

greater than the number of columns: m > n. Or it can be wide (or,
less charitably, fat) if it has more columns than rows: m < n. When
m = n, what we have is a square matrix.
For a square matrix, the best possible situation is that we may have
a pivot in every row/column. We cannot have more pivots than that.
Therefore, the rank r of the matrix has a maximum value of m = n.

A ∈ Rn×n =⇒ rank(A) f n

For a “tall” matrix, the best scenario is when we have a pivot in

each of the columns. We cannot have more than one pivot in the same
column. So we have r f n when A ∈ Rm×n , n < m. Similarly,
for a “wide” matrix, we have, r f m when A ∈ Rm×n , n > m.
Combining these two limits, we can write:

A ∈ Rm×n =⇒ rank(A) f min(m, n)

If the rank r = min(m, n), we call A a full-rank matrix; else, it is

rank deficient. If r = m, it is often called a full-row-rank (or if r = n,
full-column-rank) matrix.

5.1.3 Properties of Ranks

Here are some of the properties of ranks of matrices. Although listed

without any formal proof, the properties are either obvious, or derived
from the fact that the rank is the number of linearly independent
columns or rows, and that matrix multiplication is about taking their
linear combinations (as in the row and column pictures). The proofs
will have to wait till we learn more about the geometry associated
with matrices because ranks are closely linked to it.
For A ∈ Rm×n , we have:
1. The only matrix that has a rank of zero is the zero matrix, which
is a matrix that has all entries as zeros.
2. If B ∈ Rn×k , conformant for multiplication with A ∈ Rm×n
on the right,

rank(AB) f min(rank(A) , rank(B))

90 Ranks and Inverses of Matrices

3. If B ∈ Rn×k , conformant for multiplication with A ∈ Rm×n

on the right, with rank(A) = r and rank(B) = n (full-row
rank =⇒ k g n),
rank(AB) = rank(A) = r
This property follows from the column picture of matrix mul-
tiplication and the definition of rank as the number of lin-
early independent columns: The product AB has columns
that are linear combinations of the r independent columns
of A. Note that m g r and n g r (since rank(A) = r),
k g n (since rank(B) = n) =⇒ k g r. In other words,
the product AB ∈ Rm×k has enough rows and columns to
accommodate the rank of r. Or, stating it using still other
words, while multiplying on the right, B has enough rows and
columns to preserve the rank of A, subject to the condition
rank(AB) f min(rank(A) , rank(B)).
4. Similarly, if B ∈ Rk×m , conformant for multiplication with
A ∈ Rm×n on the left, with rank(B) = m (full-column rank),
rank(BA) = rank(A)
This property follows from the row picture of matrix multi-
plication and the definition of rank as the number of linearly
independent rows. The explanation is the transpose case of the
previous case. Keep in mind that these two properties (3 and
4) do not violate the condition in property (2).
5. If B ∈ Rm×n , same dimensions as A so that we can add them,
rank(A + B) f rank(A) + rank(B)

6. As a corollary of the previous property, we can say that if

rank(A) = r, we can write it as a sum of r matrices of rank
one, and we do not need more matrices. Although stated as an
innocuous-looking factoid, this property has important impli-
cations in computer science, in the context of data compression,
as we shall see later on.
7. A very important property that we will prove later on (much
later, in §12.6, page 235):
rank AAT = rank AT A = rank(A) = rank AT

Gauss-Jordan Elimination 91

AT A is called the Gram matrix, and has properties that make

it important in machine learning.

5.2 Gauss-Jordan Elimination

While discussing the Gaussian elimination algorithm in the previous

chapter, we ended up with the A = P LU decomposition. The
U matrix had the same determinant as A (and P contained the
information about the sign of the determinant).
If we are not really concerned about computing the determinant, but
only want solve the system of linear equations Ax = b, we can keep
applying elementary row operations instead of the back substitution
we discussed earlier. This algorithm to solve linear equations is
known as Gauss-Jordan Elimination.
As a reminder to ourselves, we have the elementary row operations,
as listed below:
• Swap any two rows.
• Multiply any row by a scalar.
• Add a multiple of any row to another.
In the previous chapter, although we listed the three elementary op-
erations, we did not use the second one, namely scaling a row. In the
Gauss-Jordan algorithm, we will use it as well.
Before stating the steps involved in Gauss-Jordan elimination, let’s
look at the example in Figure 5.1. Here, we start from the REF of
the augmented matrix of the equations x + y = 5, x − y = 1 (see
Figure 4.2), and keep applying the elementary row operations, which
do not affect the solution of the system of equations.

In this case, the REF of the augmented matrix, A | b , becomes
the identity matrix in the coefficient (A) side, at which point the
constants column contains the solution. Remember, each row of
the augmented matrix (whether row reduced or not) represents an
equation. What equation does the top row of final matrix from
Gauss-Jordan elimination represent? It reads: 1x + 0y = 3 or x = 3.
Similarly, the second row gives us the equation y = 2.
The aim of the Gauss-Jordan algorithm is to get to an identity
matrix in the coefficient part of the augmented matrix. In other
92 Ranks and Inverses of Matrices

1 1 5 Elementary Operators
1 21 1 1 0
( 1 5 21 1
Subtract Row 1 0 26 24
1 0 1 0
from Row 2: Kill leading non-zeros !
( 1 5 0 2 21 1
"
0 ( 2
1 21 1 0 1 0
Divide Row 2 by 22: Make Pivots 1 ( 0 3 !
0 1 0 2 21 1
"
0 ( 2
Subtract Row 2 from Row 1: Make non-pivots 0

))*+
! " # ", ó Find the solutions in the augmented column (", )
) = 3, *=2
The Upper Triangular Matrix in the row-echelon form (REF) in Gaussian Elimination
becomes the Identity matrix in the Reduced REF (RREF) in Gauss-Jordan Elimination

Fig. 5.1 Example of Gauss-Jordan elimination to solve a simple system of linear equations

words, try to make A in Ax = b an I, failing which, get as close to

the identity matrix as possible. This form of the coefficient matrix is
called the Reduced Row Echelon Form, or RREF, also known as the
Canonical Form. For a full-rank, square matrix, its canonical form
is the identity matrix of the same size, as we saw in the example in
Figure 5.1. We will soon discuss other combinations of shapes and
ranks and their canonical forms.

Reduced Row Echelon Form

Definition: A matrix is in its Reduced Row Echelon or Canonical
Form if it has the following characteristics:
1. It is in REF. (See §4.2.3, page 69)
2. Every nonzero row is scaled such that the pivot value is one.
3. In every pivot column, the only nonzero element is the pivot.
As we can see, the Reduced Row Echelon Form (RREF) is the Row
Echelon Form (REF) with some special properties, as its name indi-
cates. The actual definition of REF and RREF are a bit fluid: Different
textbooks may define them differently. For instance, some may sug-
gest that REF should have all pivots equalling unity, through scaling.
Such differences, however, do not affect the conceptual significance
of the forms and perhaps not even their applications.
Gauss-Jordan Elimination 93

Gauss-Jordan Elimination: Algorithm

We can state the Gauss-Jordan algorithm also using pseudo-code so that the steps may
be clearer to the students of computer science.

Input: A = aij ∈ Rm×n
Output: RRRF(A) ∈ Rm×n
Get the REF using Gaussian Elimination
1: A ← REF(A); pivots ← pivots of A
2: for i = 1 to m do
3: if pivotsi = 0 then
4: Exit for loop
5: end if
Scale the row to get unit pivot
ri
6: ri ← pivots
i
Subtract scaled pivot row from other rows to get zero pivot column
7: for j = 1 to m do
8: if j ̸= i then
9: k ← pivot column index
10: rj ← rj − ajk ri
11: end if
12: end for
13: end for
14: return A as RREF

Note that while the REF (which is the result of Gaussian elimina-
tion) can have different shapes depending on the order in which the
row operations are performed, RREF (the result of Gauss-Jordan) is
immutable: A matrix has a unique RREF. In this sense RREF is more
fundamental than REF, and points to the underlying characteristics
of the matrix.

5.2.1 The Algorithm

With the insight from the example in Figure 5.1, we canstate the steps
in the Gauss-Jordan algorithm (for a matrix A = aij ∈ Rm×n ) as
follows:

1. Run the Gaussian elimination algorithm (§4.2.4, page 70) to

get REF(A).
94 Ranks and Inverses of Matrices

2. Loop (with index i) over the rows of REF(A) and scale the ith
row by a1ik (where k is the column where Pivoti appears) so that
Pivoti = 1.
3. Loop (with index j) over all the elements of the pivot column
from 1 to i − 1, multiply the ith row with −ajk and add it to the
j th row to get zeros in the k th for all rows above the ith row.
4. Loop back to step 2 (with i ← i + 1) and iterate until all rows
or columns are exhausted.
Let’s look at two examples to illustrate the outputs of Gauss-Jordan
algorithm. Starting from the REF of the augmented matrices in Ta-
bles 4.7 and 4.10, we find their RREF. The examples are in Tables 5.1
and 5.2.

Table 5.1 Gauss-Jordan on a full-rank, square coefficient matrix, giving us the unique
solution

Equations A|b REF RREF
x+y+z =6 
1 1 1

6

1 1 1 6
 
1 0 0 1

2x + 2y + z = 9 2 2 1 9 0 2 −1 1 0 1 0 2
x + 3y = 7 1 3 0 7 0 0 −1 −3 0 0 1 3

Here, we start from the REF form of A | b (as in Table 4.7), and do further row reduction to get
I in the coefficient part. In other words, in total, we apply the Gauss-Jordan algorithm. We can then
read out the solution (x, y, z) = (1, 2, 3) from the augmented part of the RREF.

If we started from the REF of A | b in Table 4.9, where we have
inconsistent equations, the RREF would be an uninteresting 4 × 4
identity matrix. What it says is that we have a rank of four in A | b ,
with only three unknowns, indicating that we have no solutions.
Notice that we did not write down the elementary matrices for the
Gauss-Jordan algorithm, as we diligently did for Gaussian elimina-
tion? The reason for doing that in Gaussian elimination was to arrive
at the A = P LU decomposition so as to compute its determinant.
We do not use Gauss-Jordan to compute determinants, although we
could, if we wanted to.

5.2.2 Shapes of Canonical Forms

We looked at what we called “tall” and “wide” matrices in §5.1.2.

What are the canonical forms of matrices of these various shapes?
Remember, canonical form is an alias for RREF.
Gauss-Jordan Elimination 95

Table 5.2 Gauss-Jordan on a rank-deficient coefficient matrix, resulting in an infinity of

solutions

Equations A|b REF RREF
x+y+z =6 
1 1 1

6

1 1 1 6
 
1 1 0 3

2x + 2y + z = 9 2 2 1 9 0 0 −1 −3 0 0 1 3
x+y =3 1 1 0 3 0 0 0 0 0 0 0 0

This second example starts from the REF form of A | b from Table 4.10. The Gauss-Jordan

algorithm gets us as close to the identity matrix as possible in the coefficient part of A | b . The
solution is: z = 3 and x + y = 3, which is usually written as (x, y, z) = (3 − t, t, 3).

Square Matrices Let’s start with a square matrix of full rank: A ∈

n×n
R , rank(A) = n. Since all the rows have pivots, and since we can
scale the pivots to get ones, we will end up with the identity matrix
I as the canonical form. In fact, this is what we did in Table 5.1, for
the coefficient matrix.
RREF
A ∈ Rn×n , rank(A) = n =⇒ A −−−→ In ∈ Rn×n
where the symbol In stands for the identity matrix in Rn×n .
What happens if the matrix is rank deficient? If rank(A) = r < n,
we have only r pivots, which the Gauss-Jordan algorithm will place
in the first r rows. The last n − r rows of the RREF, therefore, will be
zeros. The top r rows will contain the identity matrix I because the
pivots will be scaled to one, and they will be used to eradicate other
elements in the pivot columns.
The pivot columns, however, may not be ideally placed to give
us an Ir matrix in the top-left corner, but may appear interspersed
among other pivot-less columns. As we remember from solving
linear equations, columns with no pivots result in free variables, and
we may call these columns part of a matrix F . Thus the canonical
form of a rank-deficient square matrix looks like:
" #
n×n RREF Ir · Fr×(n−r)
A∈R , rank(A) = r < n =⇒ A −−−→ ∈ Rn×n
0(n−r)×n

We use the expression I · F to indicate that I and F are shuffled

in together. Note that · is not a standard notation for this purpose.
In Table 5.2, we have an example of such a system: The coefficient
matrix is a square matrix with rank deficiency. Looking at the RREF
column of the table, we can see that pivot columns do make an iden-
tity matrix, but the pivot-less, free-variable columns are interspersed
96 Ranks and Inverses of Matrices

among it. The pivot columns are 1 and 3, and the free variable is in
column 2. The bottom m − r = 1 row is a zero row.

Tall Matrices For a “tall” matrix of full (column) rank, we will get
pivots for the first n rows, and the rest m − n rows will be zeros. The
n pivots can be normalized so that the top part of the canonical form
becomes In , and we will have:
" #
m×n RREF In
A∈R , rank(A) = n < m =⇒ A −−−→ ∈ Rm×n
0(m−n)×n

If the tall matrix is rank deficient, we again have an I and F in

the top r rows, followed by zeros. The pivot and pivot-less columns
may get interspersed among each other, as in the case of the square
matrix, giving us a canonical form that may see identical to the one
for the rank-deficient square matrix, but the indices are different: It
has one instance of n replaced by m.
" #
m×n RREF Ir · Fr×(n−r)
A∈R , rank(A) = r < n < m =⇒ A −−−→ ∈ Rm×n
0(m−r)×n

Wide Matrices Moving on to “wide” matrices (with more columns

than rows), if they have full (row) rank, they will get a canonical form
that may not cleanly separate into I and F . We can see an example in
Table 5.3, which has two independent
and consistent equations, and
therefore a full-row-rank A | b . The first and third columns have
pivots, and make an identity matrix, but the free-variable column
appears in between. The general shape of a full-row-rank, wide
matrix, therefore, looks like:

RREF
A ∈ Rm×n , rank(A) = m < n =⇒ A −−−→ Im · Fm×(n−m) ∈ Rm×n

For a rank-deficient wide matrix also, the canonical form is similar

to the other rank-deficient cases. What we had forthe coefficient
matrix (or for the whole augmented matrix A | b ) in Table 5.2
was a wide matrix (not enough independent equations) with rank
deficiency. Looking at the RREF column of the table, we can see that
pivot elements do make an identity matrix, but the pivot-less, free
variable columns are interspersed among it. The pivot columns are 1
Inverse of a Matrix 97

Table 5.3 Gauss-Jordan on a rank-deficient “wide” coefficient matrix

Equations A|b REF RREF
x+y+z =6

1 1 1 6 1 1 1 6 1 1 0 3
2x + 2y + z = 9 2 2 1 9 0 0 −1 −3 0 0 1 3

This third example is similar to the one in Table 5.2, but we removed the third equation. The
Gauss-Jordan
algorithm gets us as close to the identity matrix as possible in the coefficient part of
A | b . Notice how the columns of the identity matrix and the pivot-less columns are interspersed.

and 3, and the free variable is in column 2. The rest m − r rows are
zeros. We represent this as:
" #
m×n RREF Ir · Fr×(n−r)
A∈R , rank(A) = r < m < n =⇒ A −−−→ ∈ Rm×n
0(m−r)×n

Summary We see that the Gauss-Jordan algorithm tries very hard

to get to an identity matrix as the RREF. If the matrix has full rank,
it succeeds fully in the case of square and tall matrices, with a clean
identity matrix visible in the RREF. In the case of a wide matrix,
it may not be able to cleanly separate the identity matrix from the
pivot-less columns. If the matrix is rank deficient, in general, the
RREF will have the columns of the identity matrix (which are pivot
columns) interspersed with the free-variable columns. In any case,
the RREF is unique for a matrix, unlike REF, the output of Gaussian
elimination, which depends on the order in which the elementary row
operations are performed.
Earlier, we stated that the row and column ranks were the same
for any matrix, and that we would provide a hint to prove it. This
whole section is the hint, plus the fact that elementary row operations
do not change row or column rank. Neither do elementary column
operations. There is nothing that prevents us from doing both, one
after the other, if we want, and then asserting the fact that the row and
column ranks do not change.

5.3 Inverse of a Matrix

We came across the inverse of matrices in a few previous occasions,

starting with the nomenclature, where we defined it as that matrix
which, when multiplied with the original gives us the identity matrix.
98 Ranks and Inverses of Matrices

We again saw it in the properties of determinants: If a matrix has zero

determinant, it cannot be inverted. While talking about elementary
matrices, we spoke of their inverses as those matrices that would
reverse the effect of multiplying with the original matrices. Let’s
recap and expand on what we have learned about inverses.

5.3.1 Invertibility

A square matrix is invertible if we can compute its inverse, obviously.

Invertibility can be expressed in a variety of ways, as listed below.
For a matrix A ∈ Rn×n , if A−1 exists, we may say:
• A is invertible.
• A is not singular.
• A is not degenerate.
• A has n pivots in its REF.
• A has full rank: rank(A) = n.
• The system of linear equations Ax = b has a unique solution.
• A represents a linear automorphism. (See §3.3, page 51.)
These are the statements about the invertibility of A that we have
learned so far. We have quite a few more of such statements, related to
the linear independence of the rows and columns of A, the geometry
of its so-called fundamental spaces, its eigenvalues etc. We will list
them later, when appropriate.

5.3.2 Properties of Inverses

In addition to the various statements about invertibility, we also have

a list of some interesting properties of inverses.
−1
1. Inverse of an inverse is the original matrix: (A−1 ) =A
2. Scaling: When we multiply a matrix by a scalar, its inverse
gets multiplied by the reciprocal of the scalar:

(sA)−1 = s−1 A−1 because

(sA) s−1 A−1 = (ss−1 )AA−1 = 1 I = I
Inverse of a Matrix 99

3. Product Rule: Similar to the product rule of transposes, the

inverse of a product is the product of the inverses of the factors
in the reverse order:

(AB)−1 = B −1 A−1 because (AB)B −1 A−1

= A BB −1 A−1 = AIA−1 = AA−1 = I

We used this property in constructing the lower triangular ma-

trix L (on A = LU ) from the inverses of elementary matrices.
(See §4.3.4, page 80.)

4. Involutory Matrix: A matrix whose inverse is the same as

itself is called an involutory matrix. If P is involutory, P 2 = I,
as the single-row-exchange permutation matrices.

5. The determinant of the inverse of a matrix is the reciprocal of

its determinant:
1
A−1 =
|A|

5.3.3 Gauss-Jordan Method to Compute Inverse

The inverse of A ∈ Rn×n , if it exists, is easily computed as:

RREF
A | I −−−→ I | A−1

We have two ways of seeing why this works. The first explanation
below is more elegant and less complicated than the second one that
follows.

A Using Block Multiplication

The first and most elegant explanation of why Gauss-Jordan elimina-
tion gives A−1 uses the Special Case of Block-wise Multiplication

we discussed on page 38. If we have a block matrix A | I ∈ Rn×2n ,
we can multiply it by another matrix E ∈ Rn×n as:

E A | I = EA | EI = EA | E

If E happens to be the elementary matrix that takes A to its RREF,

then the process of multiplying by E on
the left
is the same
as per-

forming Gauss-Jordan elimination: E A | I = RREF A | I .
100 Ranks and Inverses of Matrices

And since we start with a full-rank, invertible A, its RREF is nothing

but I. Putting it all together, we write:

RREF A | I = EA | E = I | E

Comparing the left blocks in the equation above, we get EA =

I =⇒ E = A−1 , and we can see E sitting as the right block in the
final term. In other words,
RREF
A | I −−−→ I | A−1

B Using Super-Augmented Matrices

The previous explanation was elegant because we worked at the level
of matrices (using matrix-algebra, as it were), rather than descend
to the column/row/element level. If we need another explanation,
we can get it by looking at A as encoding one system of equations
Ax = b, and Gauss-Jordan giving us the solution. If we have
multiple systems of equations with the same coefficient matrix, we
can perform Gauss-Jordan elimination on all of them in one go.
Limiting ourselves to the simple case of full-rank, square coeffi-
cient matrix A so that A−1 exists, if we have Ax = b1 and Ax = b2 ,
we could write a “super” augmented matrix like, A | [b b ]
1 2 . If the
RREF of this augmented matrix becomes I | [b1 b2 ] , we know that
′ ′

b′1 is the solution to the first system, and b′2 that of the second one.
In other words:
RREF
A | [b1 b2 ] −−−→ I | [b′1 b′2 ] =⇒ Ab′1 = b1 and Ab′2 = b2

More generally, we could write this complicated super system as

(again, limiting ourselves to A, B, B ′ ∈ Rn×n ):
RREF
AX = B =⇒ A | B −−−→ I | B ′

Noting that X is a symbolic matrix (xij are symbols, not numbers)

and B ′ is the solution that satisfies the equations, we can confidently
write:
AB ′ = B
We know that there is nothing special about this B. In other words,
if we have a A ∈ Rn×n , we can build any number of systems of
linear equations by an arbitrary choice of B. Let’s therefore choose
Inverse of a Matrix 101

B = I, and investigate what it says. With this choice, we get:

AB ′ = I
What does this equation mean? It says that B ′ is the matrix that,
when multiplied with A results in the identity matrix I, which is the
definition of the inverse of A. In other words, B ′ = A−1
′
Remembering again that B is merely the part of the result of
Gauss-Jordan running on A | B , and that B now is I, we get the
same cryptic recipé for A−1 :
RREF
A | I −−−→ I | A−1
Elaborating on it: To find the A−1 ,

1. Construct the augmented matrix A | I .
2. Run the Gauss-Jordan algorithm on it.
3. If we get theidentity matrix I in the place of the original matrix
A in A | I , the augmented part will now contain the inverse
A−1 .
4. If we cannot get I, it means A is singular.
We see that Gauss-Jordan elimination (which is an extension of
Gaussian elimination) gives us a recipé for computing the inverse of
any invertible matrix A. It so happens that the Gauss-Jordan method
of finding the inverse is the one that is computationally most efficient.

5.3.4 Cofactors to Compute Inverse

In addition to the Gauss-Jordan algorithm, we also have an analytic

formula, similar to the Laplace expansion for determinant compu-
tation, Eqn (3.5), to directly compute the inverse, using the same
cofactors we defined in §3.3.2, page 56.
Refreshing our memory, every element in a matrix A = aij has
cofactor Cij , which is the determinant of the submatrix obtained by
deleting the ith row and j th column of A, with an associated sign. We
can put these cofactors (being just numbers) in a matrix of the same
size to get C = [Cij ]. The transpose of this matrix of cofactors goes
by many names, such as the adjugate, the classical adjoint and the
adjoint matrix.
102 Ranks and Inverses of Matrices

Once we have the matrix of cofactors, we can compute its transpose

(which is the adjugate matrix) and then the inverse as:
1 T
A−1 = C (5.1)
|A|
Although this equation looks compact, it is horribly expensive com-
putationally. As we can see from the formula, when |A| = 0, we
cannot compute A−1 and the matrix A is singular.

5.3.5 Inverse of a 2 × 2 Matrix

For a 2 × 2 matrix, we can apply Eqn (5.1) and write down its inverse.

a c −1 1 d −c
A= =⇒ |A| = ad − bc and A =
b d ad − bc −b a
Remember this formula as: Swap the diagonal elements and switch
the sign of the non-diagonal elements. And then divide the determi-
nant.

5.4 Left and Right Inverses

Left inverse of a matrix A is that matrix which produces I when

multiplied on the left of A. Similarly for the right inverse, it is
the multiplication on the right. In other words, if BA = I, then
B = A−1 Left . And, if AC = I, then C = AReft .
−1

Although we did not specify it, the inverse A−1 we have been
talking about is a double-sided inverse: AA−1 = I = A−1 A.
For a full-rank, square matrix, we can prove that if A, B, C ∈
Rn×n , rank(A) = n, and AB = CA = I, then B = C.
For “tall” or “wide” matrices (as described in §5.1.2, page 89),
the situation is a bit more complicated, even if they are of full rank.
Remember, for tall matrices full rank means full column rank, and for
wide ones, it means full row rank. To state it with deadly mathematical
precision:
Tall, full rank =⇒ A ∈ Rm×n , m > n, rank(A) = r = n
Wide, full rank =⇒ A ∈ Rm×n , m < n, rank(A) = r = m
For tall matrices, what is the shape of AT A? It is a square matrix of
size n × n = r × r. It has a rank of r, as we stated in Property 7, while
Cramer’s Rule 103

listing the properties of ranks in §5.1.3, page 89. When A has full
column-rank, r = n, and AT A is full rank as well. This so-called
−1
Gram matrix is therefore an invertible matrix, and AT A exists.
T
T T
It is also symmetric because A A = A A. With a bit of matrix
algebra wizardry, we can write:

−1 −1 T
AT A AT A = I =⇒ AT A

A A=I (5.2)

This is of the form BA = I, and therefore (AT A)−1 AT is
something like an inverse of A, when multiplied on the left. We will
m×n
call this the left inverse of A, A−1
Left . Remember, A ∈ R ,m > n
is not square. Although we cannot multiply A on the right with A−1 Left
(because of conformance requirements is met), what we will get is a
matrix in Rm×m with a rank of n < m. And the product cannot be
an identity matrix because it is rank-deficient.
In data science, our matrices tend to be “tall,” with m k n, and
A A ∈ Rn×n is a nice, small matrix. Moreover, it tends to be a full-
T

rank matrix because data points are usually measurements, and one
measurement is very unlikely to be a linear combination of others.
AAT ∈ Rm×m , on the other hand, is a horrible, huge matrix with a
rank r = n j m, which means it is hopelessly rank deficient.
When we have a “wide” matrix, however, it is the other Gram
matrix, AAT ∈ Rm×m , that is the nice matrix. All we wrote down
above for the left inverse will work for the right inverse with obvious
and trivial changes, and altogether we get the following results:

Tall, full rank : A ∈ Rm×n , m > n, r = n =⇒ A−1 T

Left = (A A)
−1 T
A
Wide, full rank : A ∈ Rm×n , m < n, r = m =⇒ A−1 T T −1
Reft = A (AA )

5.5 Cramer’s Rule

Related to the matrix of cofactors and the analytic formulas for de-
terminant and inverse of matrices (Equations (3.5) and (5.1)) is a
beautifully compact and elegant (albeit computationally useless) for-
mula for solving a system of linear equations. This formula is known
as the Cramer’s Rule. It states that for a system of linear equations
Ax = b with a square, invertible A and x = [xj ], the solution is
104 Ranks and Inverses of Matrices

given by:
|Aj |
xj = (5.3)
|A|
where Aj is the matrix formed by replacing the j th column of A with
b.
To recap, we have learned four ways of solving the system of linear
equations Ax = b.

1. Do Gaussian elimination on the augmented matrix A | b ,
followed by back substitution.
2. Perform
Gauss-Jordan elimination on the augmented matrix
A | b (which is not really different from the previous method).
3. If A is a full-rank, square matrix Ax = b =⇒ x = A−1 b.
This is often written in some programs as x = A\b, indicating
something like dividing by A on the left.
4. Use Cramer’s rule, again if A is a full-rank, square matrix.
As we can see, although the methods (3) and (4) are elegant, they
work only for the ideal situation of unique solutions. Besides, they
are (especially the Cramer’s rule method is) prohibitively expensive
to compute for nontrivial matrix sizes. To top it off, their numeric
stability also is questionable.

5.6 Algebraic View of Linear Algebra

We have completed algebraic view of Linear Algebra, by which

we meant determining the solvability conditions of systems of lin-
ear equations and characterizing the solutions. Along the way, we
touched upon the concepts of ranks and determinants, inverses and a
couple of decompositions.
While looking at linear equations, we did speak of their shapes in
coordinate spaces: a linear equation is a line in R2 and a plane in
R3 , and their solutions are the intersections of lines or planes. While
this view is undoubtedly geometric, the geometric view of Linear
Algebra we will start in the next part of the book is the deeper and
richer structure of vector spaces and subspaces and their properties.
The insights from such geometry, as we shall see, will have direct
Algebraic View of Linear Algebra 105

relevance to certain algorithms, and a wider relevance to how we

discuss and present our ideas in computer science.

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
Part III

Geometric View
6
Vector Spaces, Basis and
Dimensions

To those who do not know mathematics, it is difficult to get

across a real feeling as to the beauty, the deepest beauty, of
nature. . . If you want to learn about nature, to appreciate
nature, it is necessary to understand the language that she
speaks in.
—Richard Feynman

The notion of linear combinations is central to Linear Algebra,

which is why we started seeing it from the very first chapter of this
book. Now that we are taking a geometric view to understanding the
subject, we will deal with this notion and the associated concepts in
much more detail and in a formal way.

6.1 Linear Combinations

Let’s start by reminding ourselves of the basic operations on vectors

(and indeed matrices):
• Vectors can be scaled. A vector multiplied by a scalar is another
vector, which means the collection of vectors is closed under
scalar multiplication.
108 Vector Spaces, Basis and Dimensions

• Vectors can be added. The set of vectors is closed under vector

addition as well: When we add vectors, we get other vectors.
We can do both these operations together: Take a bunch of vectors,
scale them and add the scaled versions, which is exactly what we mean
by a linear combination. As we can see from the basic operations,
the set of vectors is closed under linear combinations as well; when
we take linear combinations, what we get are other vectors. Stating
this rather obvious fact formally:
∀xi ∈ Rn ,si ∈ R and any nonnegative integer k,
z = s1 x1 + s2 x2 + · · · + sk xk
(6.1)
X
k
n
= si xi ∈ R is a linear combination
i=1

Let’s look a few examples of linear combinations:

LINEAR COMBINATIONS OF TWO 6 "

LINEARLY INDEPENDENT
5
VECTORS IN =$ #"! '$
1 3 4
!! = ! = (# = #"! '$ + ,## '#
2 " 1 ($ = #!! '$ + #!" '# =
4
3 3
4
&! = '## !! + '#$ !" =
3 2 '$
#!! = 1, #!" = 1
&" = '$# !! + '$$ !" 1
'#

&% = '&# !! + '&$ !" !

28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
21
#%" '#
#"" '#
22

23 Any $- * =. can be
#%! '$ written as a unique
24 linear combination of
(& = #%! '$ + #%" '#
25 7/ and 70

Fig. 6.1 Linear combinations of two vectors in R2 . We can create all possible vectors in
R2 starting from our x1 and x2 .

1. Two vectors in R2 : In Figure 6.1, we start with two vectors

x1 and x2 . The first linear combination we make is the sum
z1 = x1 + x2 , which is the same as z1 = s1 x1 + s2 x2 , with
the scalar multipliers s1 and s2 set to 1. Then we change the
values of the scalars s1 and s2 so as to get other vectors. As
we can see, we can get to any vector in R2 by the appropriate
choice of s1 and s2 .
Linear Combinations 109

LINEAR COMBINATIONS OF TWO 6 "

LINEARLY DEPENDENT
5
VECTORS IN =$
1.5 3 4
!! = ! =
0.5 " 1
3
4.5
&! = '## !! + '#$ !" =
1.5 2
#!! = 1, #!" = 1 '# 4.5
($ = #!! '$ + #!" '# =
&" = '$# !! + '$$ !" 1 '$ 1.5
#"! '$ !
&% = '&# !! + '&$ !"
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
(# = #"! '$ + #"" '#
#%" '# 21
#"" '#
#%! '$ 22
Any &' on the green line can be
written in an infinite number of
(& = #%! '$ + #%" '#
23 linear combinations of !! and !"
($ = '$ + '#
24 = 3'$
= 1.5'#
25
= 3 + 24 '$ 2 4'#
26

Fig. 6.2 Linear combinations of two vectors in R2 . Starting from our x1 and x2 , we
cannot get out of the line defined by the vectors, no matter what scaling factors we try. The
vectors are not linearly independent.

2. In Figure 6.2, we again have two vectors. In this case, however,

when we take all possible linear combinations, we do not seem
be able get to all vectors in R2 ; we are confined to one line.
In order see why, we have to first notice that the two vectors
we defined were not independent of each other. Our second
vector, x2 is a scalar multiple of x1 : x2 = 2x1 . Therefore,
when we take a linear combination z = s1 x1 + s2 x2 , what we
are actually doing, in effect, is only scaling one vector.

z = s1 x1 + 2s2 x1 = (s1 + 2s2 )x1 = sx1

All the scaled versions of x1 fall in the same direction as x1 .

3. In the third example. Figure 6.3, we move on to R3 , but
with vectors very similar to the first case, but with a third
component (because vectors in in R3 have three components)
equal to zero. When we take all possible linear combinations
z = s1 x1 + s2 x2 , we can see that the third component of z can
never be anything other than zero, and therefore all such linear
combinations live on the xy plane.
4. In Figure 6.4, we have added some nonzero values as the third
component of x1 and x2 . Still, the linear combinations of
110 Vector Spaces, Basis and Dimensions

"
LINEAR COMBINATIONS OF TWO
6
VECTORS ON CD PLANE IN =& #"
'# !' '#
1 3 # "" 5 $
# !"
+ +
'$ = 2 '# = 1 !
'$ !
'$
#" 4 #!
0 0 = =
4 (# ($
3
($ = #!! '$ + #!" '# = 3
'$
0 2
,!! = 1, ,!" = 1
(# = #"! '$ + #"" '# 1 '#
(& = #%! '$ + #%" '# !
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
21
# %"' #
#""' # 22

23 All linear combinations /. of

6 1/ and 10 are confined to
#%! '
'# $ 24
# %"
the 45 plane because the
+ 25 third component is 0
'$
# %!
= 26
(&

Fig. 6.3 Linear combinations of two vectors in R3 . All the vectors we create starting
from our x1 and x2 live on the xy plane. We do not have enough vectors to span the whole
of R3 .

" LINEAR COMBINATIONS OF

6 TWO GENERAL VECTORS IN =&
1 3
5 1# = 2 1$ = 1
(# = #"! '$ + #"" '#
2 21.5
4 4
($ = #!! '$ + #!" '# = 3
3 '$
# "! 0.5
#" ' ,!! = 1, ,!" = 1
" 2
#
(# = #"! '$ + #"" '#
1
'$
#%" ' (& = #%! '$ + #%" '#
#
!
28 27 26 25 24 23 22 21 0 1 2 3 4
($ = #!!5'$ + 6#!" '#7 8
'
(& = #%! '$ + #%" # 21
'#
'$ 22
, &$
23
6
24
All linear combinations 25
/. are confined to the plane
26
defined by of 1/ and 10

Fig. 6.4 Linear combinations of two vectors in R3 . Here again, we do not have enough
vectors to span R3 . The linear combinations of x1 and x2 live in a plane defined by the
two vectors. (Not drawn to scale.)

the two vectors, z = s1 x1 + s2 x2 , all fall on a plane de-

fined by the two vectors, or, equivalently, by the three points
(0, 0, 0), (1, 2, 2) and (3, 1, −1.5). Two vectors are not enough
to span R3 .
Linear Combinations 111

5. As we can also see from Figure 6.4, we have a hard time

depicting to general vectors in R3 . In this fifth example, we
are going to move to three vectors, but will not attempt to draw
them. If we have the following three vectors, we still cannot
get all vectors in R3 :
     
1 3 11
x1 = 2 x2 = 1 x3 = 17
    
0 0 0
In addition to the two vectors in Figure 6.3, we have added a
third vector, but with its third component equal to 0. Now we
take all possible linear combinations z = s1 x1 + s2 x2 + s3 x3 .
The third component of z will have to be zero no matter what
scaling factors we try, and we are confined to the xy plane.
Although it may not be obvious, x3 is a linear combination of
x1 and x2 , which we can see by solving for s1 and s2 in the
equation x3 = s1 x1 + s2 x2 .
6. In the next example, we start with the two vectors in Figure 6.4
and add a third vector, which we know is a linear combination
of the two. We have these three vectors in R3 :
     
1 3 9
x1 =  2  x2 =  1  x3 = 8 = 3x1 + 2x2

2 −1.5 3
The three vectors, being linear combinations of each other, are
on the same plane in R3 . All other linear combinations we can
make with them will fall on the same plane.
7. In the last example, we will start from the simpler vectors in
Figure 6.3, which live on the xy plane in R3 . We add a simple
third vector with a component in the third direction.
     
1 3 0
x1 = 2 x2 = 1 x3 = 0 
    
0 0 1
Note that none of these three vectors can be written as the linear
combination of the other two. Now we are no longer confined
to any plane. We can indeed create any vector in R3 as a linear
combination of these three.
112 Vector Spaces, Basis and Dimensions

We have listed seven examples with many permutations of linear

combinations of vectors above. In order to make sense of them, and
to introduce some terms that we will define very soon, let’s take stock
of what we have done. In the first example, we had two linearly
independent vectors in R2 . Two linearly independent vectors are
enough to span R2 , which is the collection of all possible vectors
with two components. We will soon call R2 a vector space. In
the second example, we had two vectors, but they were not linearly
independent. What they spanned was a subset of R2 , which we will
call a subspace. In the third example, we went on to R3 , but with
two linearly independent vectors, which were not enough to span
the whole vector space. Instead, they spanned a subspace that was
the xy plane. In the fourth example, we again had only two vectors
in R3 , spanning a subspace. In the fifth and sixth examples, we
did add the third vector, but it was not linearly independent of the
other two, and therefore, the spans of the three vectors were still
only subspaces. Finally, in the seventh example, we had three good,
linearly independent vectors that could span the entirety of R3 .
Our job now is to define the unfamiliar terms in the previous
paragraph, express them in mathematical language, and generalize
them to higher dimensions, Rn . Clearly, we will not be able to
visualize them, but that inability will not prevent us from developing
insights and intuitions that will apply to all dimensions and datasets.

6.1.1 Spans of Vectors

Definition: The set of all possible linear combinations of a given set

of vectors is their span. To write it mathematically,
( k
)
X
n
Given k vectors xi ∈ R , span({xi }) = z|z= s i xi
i=1
(6.2)
for any k scalars si ∈ R

In Figure 6.1, the span of the red and blue vectors is all possible
vectors in R2 . In Figure 6.2, the span of the two vectors is a much
smaller subset: Only those vectors that are in the same direction (the
dotted line in the figure) as the original two, collinear vectors.
Vector Spaces and Subspaces 113

6.1.2 Cardinality

Definition: The number of vectors in the span is called its cardinality.

Cardinality is just a fancy name for the size or the number of elements
in a set. It applies to our notion of span of vectors as well as the set
of vectors that define a span, as in Eqn (6.2) for instance. The
cardinality of the former is indeed infinity (as there are an infinity of
linear combinations we can form), while the cardinality of the latter
(the vectors of which we are forming the linear combinations) is k.

6.1.3 Linear Independence

Definition: The vectors in a set are linearly independent if we cannot

write any one of them as a linear combination of the others. An
equivalent definition, again in mathematical lingo, is that the neces-
sary and sufficient condition for a given set of k vectors xi ∈ Rn to
be linearly independent is that there be no set of scalars si ∈ R that
are not all zero such that:
X k
s i xi = 0 (6.3)
i=1

Clearly, with all si = 0, the sum is always the zero vector. What we
are looking for is a set of scaling factors with at least some nonzero
numbers such that the sum is 0. If we can find such a set, the vectors
are not linearly independent. They are linearly dependent.
Looking at the example in Figure 6.2, we have x2 = 2x1 =⇒
x2 − 2x1 = 0, the zero vector as a linear combination with nonzero
scalars, which means x1 and x2 are not linearly independent. Sim-
ilarly, for the sixth example in the list of examples above, x3 =
3x1 + 2x2 or 3x1 + 2x2 − x3 = 0, again the zero vector as a a linear
combination with nonzero scalars, implying linear dependence.

6.2 Vector Spaces and Subspaces

For our purpose in this book, we will define a vector space as a set of
all possible vectors. In fact, while defining vectors, their operations
and properties, we were indirectly defining vector spaces as well. For
instance, we said when we scale a vector or add two vectors, we get
another vector, which is the same as saying that the set of vectors
114 Vector Spaces, Basis and Dimensions

is closed under scalar multiplication and vector addition. We also

listed the associativity, commutativity and distributivity properties
of vector operations, which are, in fact properties of the underlying
vector space. Although mathematical puritans may look askance at
such sloppiness, we will consider it a pardonable foible of computer
scientists. We will, however, present the full and precise definition in
all its mathematical glory in §8.1, page 144.

6.2.1 Vector Space

Definition: The set V of all vectors x ∈ Rn is a vector space (AKA

linear space) if the operations of a scalar multiplication and a vector
addition (with the right properties as specified earlier) are defined for
x and V is closed under the operations.
As we mentioned earlier, any mathematical object that can fulfill
the requirements of vectors is a vector, as far as Linear Algebra is
concerned. And a closed set of any such vectors would be a vector
space. Here are some examples of vector spaces:
Coordinate Space: The usual Euclidean space of points is a vector
space, treating the numbers and coordinates in them as vectors.
For instance, the good old number line (same as R) is a vector
space. If we treat the (x, y) coordinates in R2 as vectors, the
two-dimensional plane, being a collection of such coordinates,
is a vector space. Such spaces (Rn ) are called coordinate
spaces, naturally.
Space of Matrices: Matrices also have operations identical to vec-
tors, as we defined them. Vectors are indeed single-column
matrices. Therefore, we can treat a set of matrices (of the
same size) as a vector space. For instance, if we create a set
M = {A | A ∈ Rm×n }, it would be a vector space.
Complex Numbers: The set of complex numbers (C) is a vector
space as well, very similar to the coordinate space of R2 .

6.2.2 Subspace

Definition: Given a vector space V, a subset of it is a vector (or

linear) subspace S if it is closed under the same operations of a scalar
multiplication and a vector addition defined in V. In other words, for
Vector Spaces and Subspaces 115

any vectors xi ∈ S, the same operations (as defined in V) applied on

them results in other vectors yi ∈ S.
In practice, we can say that all linear combinations of the vectors
in S have to be in S for it to qualify as a subspace. Therefore, a
practical definition of a subspace S ¦ V is that it is the span of a
given set of vectors in V.
In Figure 6.2, we have two vectors, x1 and x2 , that were scaled
versions of each other. All their linear combinations (which are, in
fact, just scaled versions of x1 ) form a line in R2 . This line represents
a subspace because, when we add any two vectors in this line, we get
another vector that is collinear with it. Note that the zero vector has
to be a part of any subspace because the trivial linear combination
of the vectors (that define the span, which is the subspace) with zero
scalars is still a linear combination and has to be in the subspace.
If we have two subspaces, their intersection is a subspace, but
their union is not. Let’s formally prove the intersection part of this
statement.
1. Let S1 and S2 be the subspaces and I = S1 ∩ S2 , with x1 , x2 ∈ I.
2. By the definition of the intersection, x1 , x2 ∈ S1 and x1 , x2 ∈ S2 .
3. Since S1 is a subspace, for any two scalars, s1 and s2 , s1 x1 + s2 x2 ∈ S1 .
4. Since S2 is a subspace, for the same scalars, s1 x1 + s2 x2 ∈ S2 as well.
5. Therefore, s1 x1 +s2 x2 ∈ I, since it is in both S1 and S2 =⇒ I = S1 ∩S1
is a subspace.

6.2.3 Orthogonal Subspaces

Orthogonality in Linear Algebra is a generalization of perpendicular-

ity in geometry; the latter applies to lines or other shapes where the
concept of angles makes sense. Two lines are perpendicular when
the angle between them is 90◦ . On the other hand, two vectors are
orthogonal when their dot product is zero. In our definition of vectors
for the purpose, there is little difference between orthogonal vectors
and perpendicular vectors, but there can be, with other definitions of
vectors and their dot products. We shall, therefore, never again talk
of perpendicular vectors.
We can extend the concept of orthogonality to vector subspaces as
well. Two subspaces are orthogonal to each other when every vector
in one subspace is perpendicular to every vector in the other one.
116 Vector Spaces, Basis and Dimensions

Here are two subspaces: The xy plane and the yz plane in R3 . Are
they orthogonal subspaces? From the geometric shapes, they look
like two planes at right angles to each other. But is every vector in
the xy subspace orthogonal to those in the yz subspace? Clearly not,
because we have a whole line of vectors along the y axis that are in
both subspaces. They are clearly not orthogonal to themselves. Any
two subspaces with nontrivial (meaning more than the zero vector, 0)
intersection, therefore, cannot be orthogonal to each other. Note that
all subspaces have to contain the zero vector.
To look at a positive example, the xy plane and the z axis can be
considered two subspaces. They are indeed orthogonal to each other.
Every vector on the xy plane is orthogonal to every vector along the
z axis. Here is a generalized proof using symbolic vectors.
1. A general vector on the xy plane and a general vector along the
z axis have the form:
   
x 0
x= y   z = 0

0 z

2. The dot product between any vector on the xy plane and any
vector along the z axis is therefore:
x · z = xT z = 0 =⇒ x § z

3. Therefore, by the definition of orthogonal subspaces, xy plane

§ z axis.
Both perpendicularity and orthogonality are indicated by the same
symbol §.

6.3 Basis and Dimensions

6.3.1 Basis

Definition: A basis of a vector space or a vector subspace is a set of

vectors that meet two criteria:
1. That they span the vector space or the vector subspace.
2. That they are linearly independent.
Basis and Dimensions 117

Notational Abuse
For our purposes in Linear Algebra, R2 , R3 , Rn etc. are vector spaces, which means they
are collections or sets of all possible vectors of the right dimensions. R2 , for instance, is
a collection of all vectors of the kind

x
x = 1 with x1 , x2 ∈ R
x2
and nothing else.
However, as we saw, R2 also is a coordinate space—the normal and familiar two-
dimensional plane where we have points with (x, y) coordinates. Because of such
familiarity, we may switch lightly between the vector space that is R2 and the coordinate
space that is R2 . We did it, for instance, when speaking of the span of a single vector
which is a line. In a vector space, there is no line, there are no points, only vectors.
Similarly when we spoke of vector subspaces, we spoke of the xy plane in R3 without
really distinguishing it from the vector space R3 .
Since the vectors in R2 or R3 , as we described them so far, all have components (or
elements) that are identical to the coordinates of the points in the coordinate space, this
abuse of notation goes unnoticed. Soon, we will see that the coordinates are artifacts of
the basis that we choose. It just so happens that the most natural and convenient basis
vectors are indeed the ones that will give coordinates as components.
Ultimately, however, it may be a difference without a distinction, but it is still good to
know when we are guilty of notational abuse so that we may be careful to avoid mistakes
arising from it.

The vectors in the basis are called basis vectors, obviously. As we

can see, the concept of basis builds on the related concepts of linear
combinations and spans of vectors. We can define basis in a variety of
ways. Here is another definition: A basis of a vector space or vector
subspace is the minimum set of vectors that span it. Yet another
one: A set of vectors in a vector space (or subspace) is called a basis
if every element in it can be written as a unique and finite linear
combination of them. All these definitions are indeed equivalent.
Illustrating it with a couple of examples and counterexamples:

• In Figure 6.1 illustrating linear combinations, we have two

vectors and linear combinations like z:

1 3 2 5
x1 = and x2 = Test vector: z ∈ R =
2 1 1

x1 and x2 span all of R2 and are linearly independent. There-

fore they form a basis for the vector space R2 . Given any
118 Vector Spaces, Basis and Dimensions

vector z ∈ R2 , we can find unique scalars s1 and s2 such that

z = s 1 x1 + s 2 x2 .
• In Figure 6.2, we have two candidate basis vectors x1 and x2 :

1.5 3 2 5
x1 = and x2 = Test vector: z ∈ R =
0.5 1 1

x1 and x2 do not span all of R2 and are not linearly independent.

Therefore they are not a basis for the vector space R2 . Given
our test vector z, we are not able to find scalars s1 and s2 such
that z = s1 x1 + s2 x2 .
• However, considering the subspace indicated by the line in the
same Figure 6.2, are the two vectors in the previous example a
basis? The vectors x1 and x2 do span this subspace, but they
are still not a basis because they are not linearly independent.
For this reason, a new vector z2 , which is in this subspace, can
be expressed as an infinite number of linear combinations of
x1 and x2 .
• The reason why the two vectors did not form a basis for the
subspace in the previous example was that we had too many
vectors: for a subspace formed as a span of one vector (which
is a line of vectors), we need only one vector in the basis. Simi-
larly, for a subspace that looks like a plane (span of two vectors)
as in Figure 6.4, we need exactly two vectors in the basis. If
we had a third vector, it is necessarily a linear combination of
the other two that are already in the basis, and is therefore not
linearly independent.

6.3.2 Dimension

Definition: The dimension of a vector space or subspace is the number

of vectors in its basis. Or, in fancier language, the dimension is the
cardinality of any basis of the space or subspace.
Note that a vector space or vector subspace (which we may as well
start calling spaces and subspaces, dropping the vector part, now that
we know them well enough) may have many different bases, but all
of them will have the same cardinality. The dimension of a space or
a subspace is an immutable property.
Basis and Dimensions 119

Earlier, we spoke of the orthogonal subspaces. We can restate the

orthogonality in terms of their bases: Two subspaces are orthogonal
to each other if all the basis vectors of one are orthogonal to all the
basis vectors of the other.
In R3 , we spoke of the xy plane as a subspace being orthogonal
to the z axis (taking the liberty to mix vector and coordinate spaces).
The dimension of the first space is 2 and the second one is 1, adding
up to the dimension of the containing space R3 . We saw that the xy
plane and the yz plane do not form orthogonal subspaces, and one
reason is that their dimensions add up to 4, which is more than the
dimension of R3 .
In R4 , however, we will be able to find two-dimensional subspaces
that are orthogonal to each other. Remember that subspaces cannot
have nontrivial (nonzero) intersections. Therefore, we can have two
planes intersecting at a point in four dimensions!

6.3.3 Orthogonal Complements

Definition: If we have a subspace S ¢ Rn of dimension r, then its

orthogonal complement S § is the collection of all vectors x ∈ Rn
that are orthogonal to all vectors in S.

S § = x | xT y = 0 ∀ y ∈ S

As a consequence, the dimension of S § = n − r so that the dimen-

sions of the subspace and its orthogonal complement add up to the
dimension of the containing space.
In the earlier example of orthogonal subspaces, the xy plane and the
z axis (in the coordinate space R3 ) are, in fact, more than orthogonal
to each other. They are orthogonal complements, which means there
are no other dimensions left in the space beyond what is accounted
for by these two. To look at a counter example, the x and z axes
define orthogonal subspaces, but not orthogonal complements.
Keep in mind that although all the dimensions are accounted for
when we think of the orthogonal subspaces (such as the xy plane and
the z axis in R3 ), there are still plenty of vectors not in either. In fact,
most vectors in R3 are not on the xy plane or the z axis; they are
linear combinations of the ones in these to subspaces.
120 Vector Spaces, Basis and Dimensions

6.4 Geometry of Linear Equations

Earlier, we listed a set of equations in Table 4.1 and visualized some

of them in Figure 4.1, as part of algebraic view of solving equa-
tions. Our linear equations formed lines in two-dimensional xy
planes (which are coordinate spaces, not vector spaces, underscoring
the need for care in distinguishing between them as described in the
box on “Notational Abuse”). Now we want to look at the equations
again as vectors in some vector space, where we present the geometry
in a completely different way.
Looking at the system of linear equations Ax = b one row at a
time, as an equation making a shape in the coordinate space is the
row picture. Our equations in Figure 4.1 have two variables each,
and the row picture produces visualizations in the coordinate space
of R2 . In general, for A ∈ Rm×n , the row picture works in Rn ,
n being the number of variables. Our new way of looking at the
equations, which we will call the column picture, will mean that we
are working in Rm , of dimension the same as the number of rows in
A, or, equivalently, the number of equations in the system. In our
examples in Figures 4.1 and 6.5 to 6.7, if the row picture looks simpler
than the column picture, it is only because we have two variables,
and potentially more than two equations. As the coefficient matrix A
becomes large, both pictures become equally challenging to visualize,
but the column view gives us deeper insights.
In Figure 6.5, we have our favorite system of two equations, this
time shown as a linear combination of vectors that form the columns
of the coefficient matrix A. Here is the system:
x+y =5

1 1 5
Ax = b A = , b=
x−y =1 1 −1 1
By the column picture of matrix multiplication, we know that the
product Ax is a linear combination of the columns of A, taken with
the scaling factors in x, which are x and y. The system of equations
Ax = b says that the solution is the right values for the scalars in
the linear combination of columns of A that will give is b. How we
find these right values as depicted in Figure 6.5. Note that we had to
relabel the axes (in Figures 6.5 to 6.7) as Directions 1 and 2 because
they are indicators in a vector space now, not in the coordinate space.
In particular, the x and y in our equations have nothing to do with
Geometry of Linear Equations 121

" Direction 2
6 ·
!+" =5
!2" =1 5
8' = 9 ó ;$ ;# ' = 9
1 1 ! 5 4
= !"! = 3
1
1 21 " 1 1
Column Picture of Equations: 3
1 1 5
! +" =
1 21 1 2 !! =
1
!;$ + ";# = 9 1 5
(=
1 1
Direction 1
·!
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
21
1
!# =
21
22
Draw the blue dotted ± Length of ;'$ 1
!= &$" = 2
line parallel to ;# , Length of ;$ 21
23
through the tip of 9, ± Length of ;'#
intersecting the line of "=
Length of ;# 24
;$ and constructing ;'$ .
Construct ;#' similarly. 9 = !;$ + ";#
25

Fig. 6.5 Two linear equations on two unknowns with a unique solution. The scalars for
the column vectors of A to produce b as linear combinations are the solution to the system
of equations. Note that the length of a′i is signed (as indicated by ±): It is positive if a′i is
in the same direction ai , negative otherwise.

these directions, once again highlighting the need for care described
in the box above on notational abuse.
To convince ourselves that this system does have a solution in this
case, let’s outline, as a series of steps, or an algorithm of sorts, how
we can get to the right values for the scalars1 referring to Figure 6.5:
1. Draw a line parallel to the second blue vector (the second
column of A, a2 ), going through the tip of the green b vector.
It is shown in blue as a thin dashed line.

2. Scale the first red vector (the first column of A, a1 ) to reach this
line. The scaling required tells us what the scalar should be.
For our equations, the scalar for a1 is x = 3 in b = xa1 + ya2 .

3. Similarly, draw the red dashed line parallel to a1 , going through

the tip of the green b vector.

1
In listing these steps, we break our own rule about notational abuse: We talk about drawing
lines in the vector space, which we cannot do. A vector space contains only vectors; it does
not contain lines. It is coordinate spaces that contain lines. This predicament of ours shows
the difficulty in staying absolutely rigorous about concepts. Perhaps pure mathematical rigor
for its own sake is not essential, especially for an applied field like computer science.
122 Vector Spaces, Basis and Dimensions

4. Scale a2 to reach this line, which gives us the scalar y = 2 in

b = xa1 + ya2 .
Following the steps listed above, we can see that the solution is the
scalars needed for a1 and a2 , giving us (x, y) = (3, 2). Although
we presented our steps above as an “algorithm,” we should point
out that it is never used as a method for solving the equations. It is
an algorithm only for mentally constructing the requisite scalars for
the right linear combination, illustrating that it is possible to do so,
and that the scalars are unique in this case of linear equations with a
unique solution.
" Direction 2
6 ·
!+" =5
!+" =1 5
8' = 9 ó ;$ ;# ' = 9
1 1 ! 5 4
= !"! = *
1
1 1 " 1 1
Column Picture of Equations: 3
1 1 5
! +" =
1 1 1 2 !! =
1
!;$ + ";# = 9 1 5
1 (=
1 !# = 1
1 Direction 1
·!
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
21

The blue dotted line parallel to 22

;# , through the tip of 9 does not
intersect the line of ;$ 23
ó Cannot construct ;'$ .
Nor can we construct ;'# . 24
Cannot find ! and " such that
!;$ + ";# = 9 25

Fig. 6.6 Two inconsistent linear equations on two unknowns with no solution. All linear
combination of the column vectors of A fall on the purple line, and the right hand side
does not. Therefore, b is not in the span of a1 and a2 . Hence no solution. Note that we
have drawn the red and blue vectors slightly offset from each other for visibility; they are
supposed to be on top of each other.

Let’s now move on to the case where we have two linear equations
that are not consistent with each other: x + y = 5 and x + y = 1. In
Figure 4.1, we saw that they were parallel lines, which would never
meet. What do they look like in our advanced geometric view in the
vector space? The column vectors of the coefficient matrix are now
identical. As we now know, all possible linear combinations of the
two vectors a1 and a2 fall in a subspace that is a line defined by the
direction of either of them. Their span, to use the technical term, is
only a subspace. The green vector b that we would like to create is
Geometry of Linear Equations 123

not along this line, which means no matter what scalars we try, we
will never be able to get b out of a1 and a2 . The system has no
solutions.
If we try the steps of our little construction algorithm above, we
see that while we can draw a line parallel to a2 going through the tip
of b, there is scaling factor (other than ∞, to be absolutely rigorous)
that will take a1 to this line.

" Direction 2
6 ·
!+" =5
!2" =1 5
!+" =6
0+0=1 4
!"! = ?
1 1 ! 5
8' = 9 ó 1 21 " = 1 3
0 0 1 5
Column Picture: 1
2 1
!! = 1
1 1 5 0 5
0
! 1 + " 21 = 1 1 (= 1
1
Direction 1
0 0 1 ·!
28 27 26!;$ +
25";# =
249 23 22 21 0 1 2 3 4 5 6 7 8
21
1
! = 21
22 #
The blue dotted line ó Cannot construct ;'$ . 0
&$" = ?
parallel to ;# , through Cannot construct ;#' .
23
the tip of 9 does not Cannot find ! and "
intersect the line of ;$ such that
24
because they are on !;$ + ";# = 9
different planes ó No solution
25

Fig. 6.7 Three linear equations on two unknowns with no solution. Here, a1 and a2 are
not collinear, but span the plane of the page. The right hand side, b, has a component in the
third direction, perpendicular to the page coming toward us, indicated only by the shadow
that b casts.

In Figure 6.7, we have three equations, with the third one incon-
sistent with the first two. The geometric view is in three dimensions,
as opposed to the algebraic visualization of the equations, which still
stays in two dimensions because of the number of variables. In other
words, the geometric view is based on the column vectors of A,
which have as many elements as equations, or number of rows of A.
The algebraic view is based on the number of unknowns, which is
the same as the number of columns of A.
The fact that the vector space now has three directions makes it
harder for us to visualize it. We have simplified it: First, we reduced
the third equation to a simpler form. Secondly, we indicate the third
direction (assumed to be roughly perpendicular to the page, coming
124 Vector Spaces, Basis and Dimensions

toward us) only by the shadows that the vector and the dashed lines
would cast if we were to shine light on them from above the page.
Here are the equations, the columns of A and the RHS:
x+y =5
     
1 1 5
x−y =1 a1 1 , a2 −1
    b = 1

x + y = 6 (reduced to 0 = 1) 0 0 1
Running our construction algorithm on this system:
1. Drawing a line parallel to a2 (blue vector), going through the
tip of the green b, we get the thin dashed line in blue. This
line is one unit above the plane of the page (because the third
component of b is one).
2. Trying to scale the first red vector (a1 ) to reach this blue dashed
line, we fail because the scaled versions go under the blue line.
3. Similarly, we fail trying to scale the blue a2 as well, for the
same reason. We cannot find x and y such that b = xa1 + ya2 ,
because of the pesky, nonzero third component in b.
4. However, we can see that the shadows of these red and blue
dashed lines (shown in grey) on the plane of the page do meet
at the tip of the shadow of b. Think of these shadows as
projections and we have a teaser for a future topic.
In Figure 6.7, we took a coefficient matrix such that the third
components of its column vectors were zero so that we could visualize
the system relatively easily: Most of the action was taking place on
the xy-plane. In a general case of three equations on two unknowns,
even when we have nonzero third components, the two column vectors
still make a plane as their span. If the RHS vector is in the span of
the two vectors, we get a unique solution. If not, the equations are
inconsistent and we get no solution.
The geometric view of Linear Algebra, much like the algebraic
view, concerns itself with the solvability conditions, the characteris-
tics of the solutions etc., but from the geometry of the vector spaces
associated with the coefficient matrix (as opposed to row-reduction
type of operations in the algebraic view). As we saw in this chapter,
and as we will appreciate even more in later chapters, the backdrop
of the geometric view is the column picture of matrix multiplication.
Geometry of Linear Equations 125

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
7
Change of Basis,
Orthogonality and
Gram-Schmidt
Mathematics takes us still further from what is human,
into the region of absolute necessity, to which not only the
world, but every possible world, must conform.
—Bertrand Russell

Right from the start of this book, we wrote vectors as column ma-
trices, with numbers arranged in a column. These numbers are the
components of the vector. Where do the components come from?
How do we get them? They are, in fact, the byproduct of the under-
lying basis that we did not hitherto talk much about. In this chapter,
we will expand on our understanding of bases, learn how the compo-
nents change when we change bases and explore some of the desirable
properties of basis vectors.
Basis and Components 127

7.1 Basis and Components

Basis of a space1 (such as Rn ) is the minimal set of vectors that span

it. Minimal implies that the basis vectors are linearly independent. In
Rn , we can have a maximum of n linearly independent vectors ai ∈
Rn . Any set of such n linearly independent vectors would be a basis
for Rn . Once we have the basis, we can express any vector x ∈ Rn
as a unique linear combination of the basis vectors. The components
(or the coordinates) of x are the coefficients (or the scaling factors)
of this linear combination. Since the linear combination is unique, so
are the components.

7.1.1 Components of a Vector

Definition: For any x ∈ Rn , if the set of n vectors ai (which are

columns of a matrix A) form a basis for Rn and

n
X
x= xi|A ai (7.1)
i=1

then the n scalars xi|A ∈ R are the components (or coordinates) of x

in the basis ai , or in the basis matrix A.
If we were to place the basis vectors
as the columns of a matrix,
we would get a square matrix A = ai ∈ Rn×n . Now that we know
that we can place a set of basis vectors as the columns of a matrix,
we will start calling the matrices themselves the basis of spaces and
subspaces, which is why we call the components of x in the A basis
as xi|A in Eqn (7.1).

7.1.2 Identity Matrix as a Basis

To answer the question that we started this chapter with: We already

had components to our vectors. Where do they come from? These
components are, in fact, with reference to the identity matrix as the
basis . Let’s look at an example to illustrate it. Consider a vector

1
From this chapter onward, when we say space or subspace, we mean a vector space or a
vector subspace.
128 Change of Basis, Orthogonality and Gram-Schmidt

x ∈ R2 .

2 1 0 2 1 0
x= = Ix = =2 +3 = 2q1 + 3q2
3 0 1 3 0 1

where we called the basis vectors q1 and q2 . Some textbooks, espe-

cially the ones on physics, may call the unit vectors î and ĵ, so that
x = 2î + 3ĵ. In either case, the components of the vector are 2 in the
first direction and 3 in the second one. Clearly, this example of how
a two-dimensional vector gets its components from an I2 extends to
higher dimensions as well. For x ∈ Rn , In is the basis that gives it
the components.
Identity matrices are pretty much the perfect basis we could ask for.
First of all, the column vectors of I all have their norm equal to one,
which is why we call them unit vectors in the previous paragraph.
Secondly, each column is orthogonal to the rest. And lastly, the
matrix is diagonal, which says that the dimension of the space of
which it is a basis is the sum of the diagonal elements. This sum is
also called the trace (usually written as Tr(A) or trace(A) or Tr. A)
of the matrix. Since it is an important concept in theoretical Linear
Algebra, let’s define it:

Trace
Definition: For any A = aij ∈ Rn×n , its trace is defined as

n
X
trace(A) = aii
i=1

When we use the identity matrix as the basis (as we almost always
do), what we get as the components of vectors are, in fact, the coor-
dinates of the points where the tips of the vectors lie. For this reason,
the identity basis may also be referred to as the coordinate basis. The
components of a vector may be called coordinates. And a vector (as
we define and use it in Linear Algebra, as starting from the origin)
may be called a position vector to distinguish it from other vectors
(such as the electric or magnetic field strength, which may have a
specific value and direction at any point in the coordinate space).
Basis and Components 129

7.1.3 Orthogonal Basis

We sneaked in orthogonality earlier in the previous chapter, although

we were planning a fuller treatment of the topic here in this chapter.
Orthogonal vectors are the ones whose dot product is zero: x and
y are orthogonal if xT y = 0. If our basis vectors are orthogonal to
each other, then we have an orthogonal basis.

7.1.4 Orthonormal Basis

In addition to being orthogonal to each other, if our basis vectors are

all of unit length, then we have an orthonormal basis. The identity
matrix as a basis is a nice orthonormal basis, but it is not the only
one. Before looking at some examples, let’s study orthonormal bases
in their generality.
Let’s call our orthonormal basis matrix Q = [qi ] ∈ Rn×n , with
column vectors qi ∈ Rn . The fact that the columns qi form an
orthonormal basis for means two things: The dot product of any two
distinct columns is zero, and the norm of each column is one. Both
these characteristics can be stated compactly as:
(
1 i=j
qiT qj = (7.2)
0 i ̸= j

As a consequence, we get some interesting properties for the matrix

1. The inverse of an orthonormal matrix is its transpose:

Q−1 = QT =⇒ QT Q = QQT = I (7.3)

We can easily prove this by looking at the product QQT as

the dot products of the rows of Q with the columns of QT .
Because of Eqn (7.2) above, only the elements where the row
number is the same as the column number survive, giving us
the identity matrix.

2. An orthonormal matrix does not change the norm of a vector:

∥Qx∥ = ∥x∥ (7.4)

130 Change of Basis, Orthogonality and Gram-Schmidt

Proof: Consider ∥Qx∥2 :

∥Qx∥2 = (Qx)T Qx = xT QT Qx
= xT Q−1 Qx = xT Ix
= xT x = ∥x∥2
=⇒ ∥Qx∥ = ∥x∥

3. Orthonormal matrices have unit determinants: |Q| = ±1,

which follows conceptually from the fact that they do not
change the size of the vector it is multiplying with. But it
can be easily proven by the properties of determinants:
1
Q−1 = QT =⇒ Q−1 = QT = |Q| =⇒ = |Q|
|Q|
=⇒ |Q|2 = 1 =⇒ |Q| = ±1

Keep in mind that although we made a distinction between or-

thogonal and orthonormal matrices, most authors write the former to
mean the latter. We also will adapt this sloppy practice soon.

7.2 Change of Basis

One of the important applications of Linear Algebra uses the change

of basis: Going from one basis to another. We see this application
in multi-player video games where the underlying world is rendered
from the perspective each player. We also see it in the perspective
correction apps in our smart phones.
Let’s start with some examples as shown in Table 7.1 and drawn
in Figure 7.1. In the first row of the vector x has components 7 and
5 in the usual basis we are used to, which may be called identity or
coordinate basis. In the table, it is written as [x]I .
In the second row of Table 7.1, we are using the matrix A′ as the
basis. Notice that the basis vectors a′1 and a′2 are still in the same
direction as the corresponding unit vectors from the identity basis q1
and q2 . But they have been scaled up. Looking at the components
of the same vector in this basis A′ , as we see in [x]A′ , we see that
they are smaller than the coordinates: 7 → 2 and 5 → 2. What it
is saying is that since the basis vectors are bigger, we need to take
Change of Basis 131

Table 7.1 Examples of change of basis

Basis Basis Linear Vector
Matrix Vectors Combination Components

1 0 1 0 7
A=I= q1 = q2 = x = 7q1 + 5q2 [x]I =
0 1 0 1 5

3.5 0 3.5 0 2
A′ = a′1 = a′2 = x = 2a′1 + 2a′2 [x]A′ =
0 2.5 0 2.5 2

2 1 2 1 2
A′′ = a′′
1 = a ′′
2 = x = 2a′′ ′′
1 + 3a2 [x]A′′ =
1 1 1 1 3
How the same vector x is represented in various bases. The notation of square brackets
around a vector with a matrix as subscript stands for the vector represented in the basis
of the matrix: [x]A is the representation of x using the columns of A as the basis
vectors.

6 "
CHANGE OF BASIS 5
#

7
!= 4
5
1 0
2! = 2 = 3
0 # 1 5!%
3.5 & 0
3&! = 3" = 2
0 2.5 5%%
!
2 1
3&&
! = 3&& = P# 1
5%%
1 " 1 "
!
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
P" 5%"
21

22 7
! = 72! + 52# ó ! ' =
5
23 2
!= 23&! + 23&" ó ! ($ =
2
24 2
!= 23&&
! + 33&&
" ó ! ($$ =
3
25

Fig. 7.1 Visualizing the change-of-basis examples listed in Table 7.1. The green vector
is represented in three different bases, giving vastly different values for its components.

only two in each direction to get to the tip of our vector x. When
the basis vectors become bigger, the components get smaller, which
should indicate to us that the change of basis probably has the inverse
of the basis matrix involved in some fashion.
In the third row of Table 7.1, things get really complicated. Now
we have the basis vectors in some random direction (not orthogonal
to each other) with some random size (pretty far from unity). The
first component now is 2 and the second 3, as described in Table 7.1
and illustrated in Figure 7.1.
132 Change of Basis, Orthogonality and Gram-Schmidt

How do we compute the new components given the original vector

and the new basis vectors? In other words, how do we perform the
change of basis in a general way? Let’s call our original basis is
the coordinate basis In ∈ Rn×n and the new basis A ∈ Rn×n . The
column vectors of A = ai ∈ Rn are also specified in the identity
basis. Eqn (7.1) tells us how to arrive the components of x (which
are xi ) in this basis. Remember, we denoted the representation of x
in A as [x]A . In order to find the components of x in the A basis,
we have to find the vector [x]A such that x = A[x]A , which is the
same as solving the system of linear equations x = A[x]A as shown
in Eqn (7.5) below. Since we know that A is a basis matrix, it is
square, full rank and is therefore invertible. We can, therefore, get
the components in the new basis A through matrix inversion:
n
X
x= xi|A ai =⇒ x = A[x]A =⇒ [x]A = A−1 x (7.5)
i=1

Let’s now verify Eqn (7.5) using the most complicated example we
did, namely the third row of Table 7.1. Remembering the prescription
for the inverse of a 2 × 2 matrix (swap the diagonal elements, switch
the sign of the off-diagonal elements and divide by the determinant),
we have:

7
The original vector in the coordinate basis: x =
5

2 1 −1 1 −1
The new basis: A = =⇒ |A| = 1 and A =
1 1 −1 2

1 −1 7 2
The vector in the new basis: [x]A = A−1 x = =
−1 2 5 3
Note that we have used the symbol A for the new basis rather than
A′′ as in Eqn (7.5). Comparing our [x]A with [x]A′′ in the Table 7.1,
we can satisfy ourselves that the matrix equation in Eqn (7.5) does
work.

7.2.1 Consequences of Basis Change

When we change the basis to A, what happens to vector dot products,

and therefore, to the norm of vector? Using the definition of dot
products way back in Eqn (2.11) and the change of basis in Eqn (7.5),
Basis of Subspaces 133

we can write:
x · y = xT y = (A[x]A )T A[y]A = [x]TA AT A [y]A

If the dot product is to remain unchanged, we need x · y = xT y =

[x]TA [y]A , or AT A = I for the new basis A. What type of matrices
satisfy this condition? As we see in Eqn (7.3), orthogonal matrices do.
Therefore, preservation of the norm of the vector under basis change
also requires the basis matrix to be orthogonal. (Orthonormal, in fact,
but as promised, we will let it slide from now on.)

7.3 Basis of Subspaces

Whatever we said about bases and components for spaces also applies
to subspaces, but with one important and interesting difference. As
we know, subspaces live inside a bigger space. For example, we
can have a subspace of dimension r inside Rn with r < n. For
this subspace, we will need r basis vectors, each of which is a n-
dimensional vector: ai ∈ Rn . If we were to place these r vectors in
a matrix, we would get A ∈ Rn×r , not a square matrix, but a “tall”
one.
Remember, this subspace of dimension r is not Rr . In particular, a
two-dimensional subspace (a plane going through the origin) inside
R3 is not the same as R2 . Let’s take an example, built on the third row
of Table 7.1 again, to illustrate it. Let’s take the two vectors in the
example, make them three-dimensional by adding a third component.
The subspace we are considering is the span of these two vectors,
which is a plane in the coordinate space R3 : All linear combinations
of the two vectors lie on this plane. We will use the same two vectors
as the basis A and write our vector x (old, coordinate basis) as [x]A
(new basis for the subspace). We have:
   
2 1 7
2
A = 1 1 x = 2a1 + 3a2 = 5 =⇒ [x]A =
3
1 0 2
Note that our [x]A has only two components because the subspace
has a dimension of two. Why is that? Although all the vectors in
the subspace are in R3 , they are all linear combinations of the two
column vectors of A. The two scaling factors required in taking the
134 Change of Basis, Orthogonality and Gram-Schmidt

linear combination of the basis vectors are the two components of the
vectors in this basis.
In the case of full spaces Rn , we had a formula in Eqn (7.5) to
compute [x]A , which had A−1 in it. For a subspace, however, what
we have is a “tall” matrix with more rows (m) than columns (r < m).
What is the equivalent of Eqn (7.5) in this case? Here’s where we
will use the left inverse as defined in Eqn (5.2). Note that A is a
tall matrix with full column rank because its r columns (being basis
vectors for a subspace) are linearly independent. AT A is a full-rank,
square matrix of size r × r, whose inverse figures in the left inverse.
As a reminder, here is how we defined it:
−1
AT A AT A = I =⇒ A−1 T −1 T

left = (A A) A

Coming back to the change-of-basis problem, we get the compo-

nents in the new basis A following the same arguments used earlier
in deriving Eqn (7.5):

A[x]A = x =⇒ [x]A = A−1 T

Left x = (A A)
−1 T
A x (7.6)

Once again, let’s verify the veracity of this prescription using our
example.
   
7 2 1
2 1 1 6 3
x =  5 A = 1 1 =⇒ AT = and AT A =
1 1 0 3 2
2 1 0
2
1 2 −3 −1
AT A = 3 and AT A
−1
= = 3
3 −3 6 −1 2
2 1
− 31 2

T −1 T −1 2 1 1
(A A) A = ALeft = −1 3 = 3 3
−1 2 1 1 0 0 1 −1
 
1 7
− 31 2

2
[x]A = A−1
Left x = 3 3  5  =
0 1 −1 3
2

It is a tedious calculation, but it works out to be exactly the scalars

in the linear combination we started from: x = 2a1 + 3a2 . Happily,
we will never have to do such calculations by hand; we have SageMath
to do it for us. For a deeper understanding of this left inverse, we
will have to wait for the projection operation coming up two chapters
down the line, but based on what we will discuss next.
Orthogonality 135

7.4 Orthogonality

7.4.1 Orthogonal Vectors

We came across orthogonal vectors in a couple of places earlier in

this book. If we imagine vectors as line segments in coordinate
spaces with arrows at their tips, we can say that orthogonal vectors
are simply perpendicular vectors. We are clearly more sophisticated
than that by now, and our vector spaces do not have lines and arrows.
Orthogonal vectors are, therefore, those vectors that have their inner
(or dot) product equal to zero.

Orthogonal Vectors
Definition: Two vectors, x, y ∈ Rn are orthogonal to each other if
and only if
xT y = y T x = 0
Note that the vectors in the inner product can commute. Note also
that the less sophisticated definition of the inner product (namely
x · y = ∥x∥∥y∥ cos θ) shows that the inner (or dot) product is zero
when x and y are perpendicular to each other because the angle θ
between them is π2 and cos θ = 0.
We still stay away from the definition of the inner product using the
angle because, by now, we know that the machinery of Linear Allegra
may be applied to vector-like objects where we may not be able to
talk about directions and angles. For example, in Fourier transforms
or the wave functions in quantum mechanics, vectors are functions
with the inner product defined with no reference any kind of angles.
We can still have orthogonal “vectors” in such vector spaces when the
inner product is zero. Clearly, we cannot have perpendicular vectors
without abusing the notion and notation a bit too much for our (or at
least, the author’s) liking.
Earlier, in Eqn (2.5), we defined the Euclidean norm of a vector
∥a∥:
∥a∥2 = aT a
Using this definition, we can prove that the inner product of orthog-
onal vectors has to be zero, albeit not completely devoid of the lack
of sophistication associated with perpendicularity.
If we have a § b, then we know that ∥a∥and ∥b∥ make the sides
of a right-angled triangle (which is where the pesky perpendicularity
136 Change of Basis, Orthogonality and Gram-Schmidt

rears its ugly head again) whose hypotenuse is of length ∥a + b∥.

Then, by the Pythagorean theorem, we can write:

∥a∥2 + ∥b∥2 = ∥a + b∥2

aT a + bT b = (a + b)T (a + b)
= aT + bT (a + b)

= aT a + bT b + bT a + aT b
=⇒ 0 = bT a + aT b
=⇒ bT a = aT b = 0 since bT a = aT b

7.4.2 Orthogonalization

Given that the orthonormal (we may as well call it orthogonal because
everybody does it) basis is the best possible basis we can ever hope
to have, we may want to have an algorithm to make any matrix
A ∈ Rn×n orthogonal. An orthogonal matrix is the one in which
the columns are normalized and orthogonal to one another. In other
words, it is matrix that could be a basis matrix as described in §7.1.4
with the associated properties. And orthogonalization is the process
or algorithm that can make a square matrix orthogonal.
The first question we may want to ask ourselves is why we would
want to do this; why orthogonalize? We know that a perfect basis
for Rn×n is In , the identity matrix. Why not just use it? We have
two reasons for doing it. The first one is pedagogical: We get
to see how projection works in a general way, which we will use
later. The second reason is a practical one from a computer science
perspective: Certain numerical algorithms use the decomposition that
results from the orthogonalization process. Another reason, again
from our neck of the woods, is that when we know that a matrix
Q is orthonormal, we know that the transformation it performs on
a vector is numerically stable: Qn x does not suffer from overflow
or underflow errors because the norm of x is not modified by the
multiplication with Q.

7.4.3 Projection

Earlier, in Chapter 2, we looked at vector dot product as projection

using the cosine of the angle between them, in Figure 2.5. To remind
Orthogonality 137

ourselves, if we have two vectors a1 and a2 with an angle θ between

them, the projection of a2 onto a1 has the length ∥a2 ∥ cos θ. From
the definition of the dot product using the angle, we can write:
a1 · a2 a1
∥a2 ∥ cos θ = = · a2
∥a1 ∥ ∥a1 ∥
To move to a totally matrix notation, we can use the matrix definition
of dot products, a1 · a2 = aT1 a2 = aT2 a1 , which changes the formula
above to:
aT a1
∥a2 ∥ cos θ = 1 a2 = aT2
∥a1 ∥ ∥a1 ∥
This formula gives us the length of the projection, which we call x
in Figure 7.2. For our purposes, we want a vector as the projection,
which would be a vector in the direction of a1 , with its norm equal
to the length of the projection x. Let’s call this projection vector a2∥ .
The direction of a1 is specified by the unit vector q1 :
a1
q1 =
∥a1 ∥
Putting it all together, we write:
a2∥ = aT2 q1 q1

(7.7)
This form of the projection is what we will use in the Gram-Schmidt
process coming up in the next section.
However, we want to go a bit further and with this Eqn (7.7) and
develop what we will call a projection matrix. Knowing the definition
of the norm of a vector:
∥a1 ∥2 = aT1 a1
we can write the projection in a form that we will use later, when we
start projecting vectors onto subspaces rather than other vectors. The
derivation of this form of projection is shown in Figure 7.2, which
we will repeat as a recasting of Eqn (7.7) merely to get used to the
vector/matrix manipulations, if nothing else.
Rewriting Eqn (7.7) using definitions of the norm (∥a1 ∥) and the
direction (q1 ):
T
T
a1 a1
a2∥ = q1 q1 a2 = a2
∥a1 ∥ ∥a1 ∥
T (7.8)
a1 aT1

a1 a2
= a1 = a2
∥a1 ∥2 a1 T a1
138 Change of Basis, Orthogonality and Gram-Schmidt

PROJECTION OF VECTORS
5" and 5# * =& with Q between them
5"
Dot Product Using angle:
5'" 5# = 5" 5# cos Q ÿ
5'" 5#
cos Q = 5#'
5" 5#
Projection Length (of 5# on to 5" ):
5'" 5#

!
! = 5# cos Q =
5"
Q
Projection Vector (of 5# on to 5" ):
5" 5" 5'" 5# 5" 5'" 5#
5#' = != = ' 5# 5"'
5" 5" 5" 5" 5"
Projection Vector (of 5" on to 5# ):
5# 5'#
5"' = 5"
5'# 5#

Fig. 7.2 Dot product between two vectors a1 , a2 ∈ Rn , shown on the plane defined by
the two vectors.

In the relatively complicated manipulations in Eqn (7.8), we have a

couple of observations to write down:
• The entities in the parentheses in all but the last RHS are scalars.
If we were to call it, say s, it commutes with matrix (and vector)
multiplication:

sa1 a2 = a1 sa2 = a1 a2 s

• In the first RHS, we have the projection of a2 equal to a scalar

times the direction of a1 , which makes sense.
• This pattern repeats itself in all the RHS until the very last one:
The projection vector is a1 scaled.
• In the last and final RHS (where we used the associativity of
matrix multiplication to put the parenthesis where we wanted),
the entity multiplying a2 is a matrix: It is a1 aT1 ∈ Rn×n
−1
multiplied by a scalar aT1 a1 .
• We can think of this scaled matrix as an operator, which, when
applied to any vector a2 , gives its projection to a1
We will call the scaled matrix P , the projection matrix. Let’s write it
down once more as the definition of the projection matrix, which we
Gram-Schmidt Process 139

will revisit in a couple of chapters.

−1
P = a1 aT1 a1 aT1 (7.9)

7.5 Gram-Schmidt Process

The Gram-Schmidt process is an algorithm to turn a square, invertible

matrix into an orthonormal one through a series of steps. These steps
are, in fact, elementary column operations, akin to the row operations
we performed in Gaussian and Gauss-Jordan eliminations.
At the end of the Gram-Schmidt process, starting from A, we will
get a matrix Q that is a rotated version of the identity matrix I with
some of the rows permuted. As we will see, the algorithm keeps the
direction of the first column vector and the order of the rest fixed. In
other words, if the second column vector of A is to the “right” of the
first one, we will have the second column of Q also to the right of
the first one. (To be more precise, by “right,” we mean that we have
to turn clockwise to go from the first column to the second.)

7.5.1 The Algorithm

Since our objective is to get an orthonormal basis matrix, we will

normalize each column vector and ensure that it is orthogonal to the
rest through this iterative process, starting from the first column. In
order to describe the algorithm (which is what a step-by-step process
is), let’s set the stage by specifying our symbols. We will start from
a square matrix A and end up with another square matrix Q.
(
G-S 0 i ̸= j
A = ai , Q = qi ∈ Rn×n , A −−→ Q =⇒ qiT qj =

1 i=j

Here are the steps in the Gram-Schmidt process, as illustrated in

Figure 7.3:
1. Take the first column of A and normalize it to get the first
column of Q.
a1
q1 =
∥a1 ∥
2. Use q1 to get q2 (or qj with j = 2) using the steps below:
140 Change of Basis, Orthogonality and Gram-Schmidt

GRAM-SCHMIDT PROCESS

5!
5 = 3) 3" 3% ï 3*
1. Normalize 5) as P) : 5 !+
5)
P) =
5)
2. Project 5! on to P) : 5!' = (5*! P) )P)
" Get part perpendicular to P) :
5!+ = 5! 2 5!' = 5! 2 (5*! P) )P) P!

,'
5
" Normalize 5!+ as P!
5!+
P! = 5 ,' "
5!+ 5)
5,
3. Project 5, on to P) and P! : 5 !'
P)
5,'! = (5*, P) )P) and 5,'" = 5*, P! P! 5 ,' !
" Get part of 5, perpendicular to P) and P! :
5,+ = 5, 2 5,' = 5, 2 5*, P) P) + 5*, P! P! 5,+
" Normalize 5,+ as P, :
5,+ P,
P, =
5,+

Fig. 7.3 Illustration of the Gram-Schmidt process running on a matrix A. The first
column (red) is normalized to get the unit vector q1 , which is then used to create q2 from
the second (blue) column. Both q1 and q2 are used in projecting the third (green) column
and computing q3 .

(a) Take the second column of A and project it onto q1 to get

the part of a2 parallel to it, using Eqn (7.7).
a2∥ = aT2 q1 q1

(b) Subtract a2∥ from a2 to get the perpendicular part.

a2§ = a2 − a2∥ = a2 − aT2 q1 q1

(c) Normalize a2§ to get the second column of Q,

a2§
q2 =
∥a2§ ∥
3. Now get qj using the j − 1 vector’s normalized so far, qi ,
1 f i < j , following the steps below:

(a) Take aj , the next column of A, project it successively

to qi to get the
part of aj parallel to it, using Eqn (7.7).
aj∥i = aTj qi qi
(b) Subtract all parallel parts aj∥i from aj to get the part
perpendicular to all qi .
j−1 j−1
X X
aTj qi qi

aj§ = aj − aj∥i = aj −
i=1 i=1
Gram-Schmidt Process 141

(c) Normalize aj§ to get the next column of Q, qj .

aj§
qj =
∥aj§ ∥

4. Iterate until we run out of columns, which is when j = n.

Note that step 2 in the algorithm above is actually the same as step 3,
with j = 2, but we chose to spell it out in order to make the process
clear. In fact, even the first step can be thought of as the same as the
third step, making it easy to list the steps as an algorithm, and indeed
to write code based on them.

7.5.2 Numerical Considerations

When we look at the steps of the Gram-Schmidt process, we see

that we are normalizing a vector, using it to compute the next one,
normalizing it and working our way through the whole matrix A. As
we know, floating point operations on a computer have an inherent
precision, and the first step necessarily incurs a certain error. Since it
is being used in the next step, the errors accumulate. For this reason,
the process may be numerically unstable.
The modified version of the algorithm mitigates this problem by
breaking down the summation in the step 3(b) above iteratively as:
(1)
aj§ = aj − aTj q1 q1

(i+1) (i) (i)T
aj§ = aj§ − aj§ qi qi 1 < i < j − 1 (7.10)
(j+1)
aj§ = aj§

In other words, instead of projecting aj onto all the currently available

orthonormal vectors qi , 1 f i < j and subtracting the sum in step 3(b),
we project it onto q1 first, subtract it from aj to get the perpendicular
(1)
part aj§ . Then instead of projecting aj onto q1 again, we project
(1)
aj§ instead, thereby disrupting the accumulation of rounding errors.

7.5.3 QR Decomposition
Since the Gram-Schmidt algorithm is about taking linear combina-
tions of the columns of the matrix, it should be possible to write it as
142 Change of Basis, Orthogonality and Gram-Schmidt

a matrix multiplication on the right.

G-S
A, Q ∈ Rn×n , A −−→ Q =⇒ AX = Q, X ∈ Rn×n
Furthermore, since q1 is a scaled version of a1 , the first column
of X has only one nonzero element, x11 . In general, qi is a linear
combination of all aj , 1 f j f i. Therefore, X is an upper triangular
matrix. We are more interested in the inverse of this matrix, which
we will call R. It is also an upper triangular matrix: a fact that
can be appreciated by noticing that ai is a linear combination of all
qj , 1 f j f i. It is also a fact that can be proven in general, and is
left as an exercise.
With these notations, we can write:
AX = Q =⇒ A = QR =⇒ R = Q−1 A = QT A
Where we used the fact that Q is orthogonal, and therefore its inverse
exists, and Q−1 = QT .
As we can see, the Gram-Schmidt algorithm leads to a decomposi-
tion of a square, full-rank matrix into an orthogonal matrix Q and an
upper triangular matrix R, which is called the QR decomposition,
naturally.

7.6 Rotation Matrices

Thinking of matrix multiplication of the type Ax = b as a transforma-

tion A : x 7→ b, we can appreciate that Q has to be a transformation
that either rotates or reflects a vector, or shuffles its elements.
An important class of orthonormal matrices are rotation matrices.
They are orthonormal matrices because they do not change the norm
of the vector being rotated. As shown in Figure 7.4, we can easily
derive the rotation matrix in Qθ in R2 . Note that the inverse of the
rotation matrix would be a matrix that would rotate in the opposite
direction: Q−1θ should be Q−θ . And, being an orthonormal matrix,
it should also be the same as Qθ T .

cos θ − sin θ
Qθ = |Qθ | = cos2 θ + sin2 θ = 1
sin θ cos θ

−1 cos(−θ) − sin(−θ) cos θ sin θ
Qθ = Q−θ = = = QTθ
sin(−θ) cos(−θ) − sin θ cos θ
Rotation Matrices 143

1.25 "
0
P O# =
in 1
2s os P 1
sP
=
c co n P
% i
O# 0.75 % = s
O"

0.5

Y
0.25
Y
Z !
22 ROTATION
21.75 21.5 21.25MATRIX
21 IN =20.5
20.75 20.25 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
1
O" =
0
The unit vectors transform as: 20.25

1 cos ;
2! = § 2&! = 20.5
0 sin ;
0 2sin ;
2# = § 2&# = 20.75
1 cos ;
ó The Rotation Matrix
21
cos ; 2sin ;
?+ =
sin ; cos ;
21.25

Fig. 7.4 The rotation matrix in R2 can be written down by looking at where the unit
vectors go under a rotation through a specified angle.

Rotation in R3 is defined by three angles, the pitch, roll and yaw,

as they are known in flying. The matrix can actually be written as
the product of three independent rotations, but it is probably of not
much interest in computer science, except, perhaps for developing
flight simulators.

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
144 Change of Basis, Orthogonality and Gram-Schmidt

Linearity in Relativity
We talked about rotation matrices. In R3 , for instance, if we have a vector x that
gets rotated by a yaw (ψ), pitch (θ) and roll (φ) and ends up as x′ , we have a linear
transformation: x′ = R(ψ, θ, φ)x. If all the rotation angles are zero (ψ = θ = φ = 0),
then clearly R = I3 . Since the physical world we live in is R3 , these vectors (x and
x′ ) represent the position of a point and how its coordinates change in a rotated frame.
These are the so-called position vectors.
Then Einstein came along and said we should be thinking of events rather that posi-
tions. An event takes place ata position and at a time, which would be represented by
x
a four-dimensional vector . Then Einstein proceeded to write a "rotation" between
t
′
x x
two events as something like: = L Here, if we think of time as universal
t′ t
(meaning, independent of the position or the state of motion of an observer), then t′ = t,
which says the fourth-row, fourth-column element of this matrix L is 1, the rest being
identical to those in R. And if the rotation angles are all zero, L = I4 .
So far so good. But what comes next is the jaw-dropping, god-level genius of Albert
Einstein when he said the matrix L depends on the speed v of the observer. In other
words, it is a function of the rotation angles as well as the speed, L(ψ, θ, φ, v). This is
the so-called Lorentz transformation. If we assume there is no rotation, and the motion
is along the z axis, then we get the Lorentz matrix as:
1 0 0 0
 
0 1 0 0 
L=
0 0 γ −γβ 
0 0 −γβ γ
What γ and β are is not so important for our discussion here, but β = vc , the velocity as
a fraction of the speed of light and γ = √ 1 2 , the Lorentz factor. What is important to
1−β
note is that the z coordinate (which is the length along the direction of motion) and time
are interconnected, which leads to length contraction and time dilation in such a way as
to keep the speed of light a constant. How this transformation is derived and what the
physical reasons behind them form the initial part of the paper that literally changed all
of physics and our understanding of the universe forever.
Once the transformation is written as a linear transformation, Einstein had all of linear
algebra standing behind his equations. There was never going to be a mathematical
inconsistency in all the crazily counter-intuitive predications of special relativity. But
there is this crucial assumption of linearity, which was introduced on page six, third line
in this English translation of the original paper: "In the first place it is clear that the
equations must be linear on account of the properties of homogeneity which we attribute
to space and time."
Although I never explicitly stated it this way so far, I have taken issue with this
assumption of linearity, which formed the basis of my first book, The Unreal Universe,
with its key finding described in this short video.
Rotation Matrices 145

Importance of Linearity in Relativity á la ChatGPT

One of the foundational assumptions in special relativity is the linearity of the trans-
formations between inertial reference frames. This assumption plays a crucial role in
deriving the Lorentz transformations and establishing the consistency of the theory. Here,
we explore why this assumption is both natural and necessary.
In special relativity, inertial frames are related by transformations that preserve the
uniformity of space and time. If xµ = (t, x, y, z) represents coordinates in one frame
and x′µ = (t′ , x′ , y ′ , z ′ ) in another, the transformation x′µ = f (xµ ) must:

1. Preserve the Principle of Relativity: All inertial frames are equivalent, meaning
the laws of physics are the same in each.
2. Preserve Homogeneity and Isotropy: Space and time are uniform and direc-
tionally invariant.

These conditions imply that the transformation must act uniformly on all points in
spacetime. This uniformity necessitates linearity, as nonlinear transformations would
introduce position-dependent effects, violating homogeneity.
A linear transformation can be written as:
x′µ = Λµν xν ,
where Λµν is a constant matrix. Linearity ensures that the spacetime interval:
s2 = −c2 t2 + x2 + y 2 + z 2
is preserved up to a constant factor, ensuring consistency with the invariant speed of light
c.
The linearity assumption simplifies the interplay between space and time while en-
suring causality and consistency with experimental observations. Without linearity,
transformations would introduce arbitrary distortions, undermining the predictive power
of the theory.
In summary, the assumption of linearity in special relativity arises naturally from the
symmetry and uniformity of spacetime, enabling the elegant derivation of the Lorentz
transformations and the unification of space and time.
8
Review and Recap

The story so far:

In the beginning the Universe was created.
This has made a lot of people very angry and been widely
regarded as a bad move.
—Douglas Adams

8.1 A Generalization

We have come about halfway in our Linear-Algebra journey. The

way we discussed our topics in this book was explicitly geared toward
computer science. Specifically, our vectors and matrices were over
the field of real numbers, R. We could generalize it to any field, F or
K, and almost all of our statements and discussions will still stand.
However, for computer scientists, the most appropriate field to work
with, we believe, is R.
We also presented certain definitions, such as the pivotal one of
vector space, incrementally: We defined vectors first, their operations
and the properties thereof, and finally said a vector space are merely
A Generalization 147

a complete collection or closed set of all possible vectors of the same

dimension.
We may find some value in taking a step back and defining the
concept of vector space in a general and formal way, which starts
from scratch and lists all the properties and operations we are looking
for. Here is such a definition, which also provides a good summary
of our initial, introductory chapters.

Vector Space
Definition: A vector space over a field of K is a set of elements
that have two operations defined on them. We will call the elements
“vectors” and use the symbol S for the set.
1. Addition (denoted by +): For any two vectors, x, y ∈ S,
addition assigns a third (not necessarily distinct) vector (called
the sum) in z ∈ S. We will write z = x + y.
2. Scalar Multiplication: For an element s ∈ K (which we will
call a scalar) and a vector x ∈ S, scalar multiplication assigns
a new (not necessarily distinct) vector z ∈ S such that z = sx.
These two operations have to satisfy the properties listed below:
Commutativity: Addition should respect associativity, which means
the order in which the vectors appear in the operation does not
matter. For any two vectors x1 , x2 ∈ S,
x1 + x2 = x 2 + x1
We can make scalar multiplication also commutative by defin-
ing sx = xs.
Associativity: Both operations should respect associativity, which
means we can group and perform the operations in any order
we want. For any two scalars s1 , s2 ∈ K and a vector x ∈ S,
s1 s2 x = s1 (s2 x) = (s1 s2 )x
and for any three vectors x1 , x2 , x3 ∈ S,
x1 + x2 + x3 = (x1 + x2 ) + x3 = x1 + (x2 + x3 )
Distributivity: Scalar multiplication distributes over vector addition.
For any scalar s ∈ K and any two vectors x1 , x2 ∈ S,
s(x1 + x2 ) = sx1 + sx2
148 Review and Recap

and for any two scalars s1 , s2 ∈ K and a vector x ∈ S,

(s1 + s2 )x = s1 x + s2 x

Identity of Addition: The set S has an identity of addition (called

the zero vector) 0 ∈ S such that x + 0 = 0 + x = x.
Additive Inverse: For every x ∈ S, the set S has an inverse of
addition −x such that x + (−x) = 0, the identity of addition.
Identity of Scalar Multiplication: The field K has a unit element,
1 (called the multiplicative identity) such that for every x ∈ S,
1x = x1 = x.
If S is a set that satisfies all these conditions, then it is a vector space.
As we can see, we do not say anything about what the “vectors” are.
In particular, we are free to call our matrices by the name vectors also,
and consider a collection of matrices that satisfy these conditions a
vector space. What are such matrices? They are matrices of the same
dimensions, of which our vectors are examples.
Note that the existence of a third operation, vector-vector multipli-
cation (such as dot products, or matrix multiplication, more generally)
is not required for our set of vectors S to be considered a vector space.

8.2 Product Rules: Transposes and Inverses

The product rule of transposes and inverses is similar: When we

transpose (or invert) a product, we get the operations in the reverse
order on the factor matrices.

(AB)T = B T AT and (AB)−1 = B −1 A−1

The product rule for transposes gives us two interesting symmetric
matrices from any odd matrix:

T T
(AT A) = AT (AT )T = AT A and (AAT ) = (AT )T AT = AAT

AT A is the Gram matrix. AT A, AAT and AT all have the same

rank as A. In particular, if A has full column rank, AT A is a full-
rank, square matrix. And if A has full row rank, AAT is a full-rank,
square matrix.
Column Picture of Matrix Multiplication 149

Combining the invertibility of AT A and AAT with the product

rule for inverses, we can come up with left and right inverses.
(AT A)−1 AT A = I =⇒ A−1 T
Left = (A A)
−1 T
A
In writing the first part of the expression above, we assume that
AT A (the Gram matrix) is invertible, which means A has full column
rank. Similarly, for full-row-rank matrix A has a right inverse:
AAT (AAT )−1 = I =⇒ A−1 T T −1
Reft = A (AA )

Keep in mind that A is not necessarily square, and does not,

therefore, have a double-sided inverse A−1 . But the left inverse
can take the place of an inverse, as it did in Eqn (7.6).
While describing the Gram-Schmidt process, we wrote down the
projection operator in a special form in Eqn (7.8):
−1
a2∥ = a1 aT1 a1 aT1 a2

Notice a similar pattern emerging? The projection of a2 onto a1 is

the left inverse of a1 = a−1 1 left ∈ R
1×n
multiplying a2 ∈ Rn on the
left, giving us a scalar (which then scales a1 to make the projection a
vector).

8.3 Column Picture of Matrix Multiplication

While describing matrix multiplication, we presented many different

ways of understanding it. All these different pictures are indeed
equivalent. First, we started with an element-wise definition. We
then saw that vectors, being matrices of single columns, obeyed the
same multiplication rules, and realized that vector dot product is a
matrix multiplication. We then redefined matrix multiplication as
dot products of the rows of the first matrix with the columns of the
second.
We also looked at the row and column pictures of matrix multiplica-
tion: Multiplication on the left is the same as the linear combinations
of the rows of the second matrix, and multiplication on the right is
the same as the linear combinations of the columns of the first. Of
all these different views of matrix multiplication, the column picture
is most useful one. Let’s look at it once more, and apply it to our
150 Review and Recap

understanding of some of the crucial and basic concepts of Linear

Algebra.
In the product AX = B with A ∈ Rm×k , X ∈ Rk×n and con-
sequently B ∈ Rm×n , the columns of the product B are linear
combinations of the columns of A, taken with the scalars in each
column of X.
More specifically, in our favorite equation Ax = b, b is a linear
combination of the columns of A scaled by the components of x. This
view makes it possible for us to unearth a system of linear equations
hiding behind the definitions of span and linear independence of
vectors.

8.3.1 Span of Vectors

Earlier, we defined the span of vectors in Eqn (6.2). For our own
nefarious ulterior motives, let’s recast the definition using other no-
tations and symbols as in the following equation:
( n
)
X
m
Given n vectors ai ∈ R , span({ai }) = b|b= xi a i
i=1
(8.1)
for any n scalars xi ∈ R

We switched from our previous notation of vectors (from xi → ai )

and scaling factors (from si → xi ). Can we unearth our favorite
matrix equation Ax = b from this new definition?
As we can see, we may redefine the span of a set of vectors as
follows: If we arrange a set of vectors as the columns of a matrix
A, then their span is all possible vectors we can get as the product
Ax = b for all possible vectors x. In the very next chapter, we will
have a lot more to say about this way of looking at the span.

8.3.2 Linear Independence

The definition of linear independence in Eqn (6.3) also hides within

it a system of linear equations. In order to rebrand linear dependency
also as a matrix equation that we will have the pleasure of meeting
again later on, let’s consider it for the columns ai ∈ Rm of A ∈
Rm×n , where the scalars are called xi ∈ R. The condition now
Column Picture of Matrix Multiplication 151

The Zero Vector

The zero vector 0 ∈ Rn has the same role as the number zero in R: It is the identity of
vector addition. It has few special properties. It is the only vector that is parallel and
perpendicular (or, to be more precise, collinear and orthogonal) to all other vectors at
the same time.
As we know by now, a vector is collinear with a scaled version of itself. And, since
0 = sx for s = 0 for any x, we can say that 0 is collinear with any x. Furthermore, if
x · y = 0, x is orthogonal to y. Sure enough, x · 0 = 0 for any x, proving that 0 § x.
When it comes to vector spaces, 0 ∈ R0 is the smallest vector space we can think of.
It has only one vector. Scale it by any s ∈ R, we get 0. Add any “two” members of
this tiny vector space, we get 0 + 0 = 0. Therefore, this vector space (let’s call it Z)
is indeed closed under scalar multiplication and vector addition, as all vector spaces are
supposed to be.
The strange thing about it is the dimension of Z. It has to be zero because, as we
know, a vector space of dimension one is a line, not a point. What is the basis of Z?
We might think that it is the set {0}, in which case the cardinality of the set has to be
zero, because the dimension of a vector space defined as the cardinality of its basis. It
looks as though 0 does not count. Does it mean that a set of real numbers {0} has zero
members in it? Or that {0, 1} has only one member, despite the fact that we can clearly
see two of them? Are our eyes deceiving us?
The weirdness of the zero vector extends to the notion of vector subspaces as well.
All vector subspaces contain the zero vector, as do all vector spaces. Otherwise they
would not be closed under scalar multiplication (with s = 0). Therefore, the intersection
of any two subspaces is at least {0}. Are we justified in calling it a null set because we
know that its cardinality is zero?
These contradictions lead to the inescapable conclusion that the vector space contain-
ing only the zero vector Z = {0} does not need a basis at all. Remember, the basis set
{0} of Z needs to satisfy two conditions: (1) It should space the space (which it does),
(2) The vectors in the basis set need to be linearly independent. But the zero vector is
not linearly independent of itself, and therefore cannot form a basis of anything!
There is much more to zero than meets the eye.

becomes:

n
X
xi ai = 0 =⇒ Ax = 0 (8.2)
i=1

In other words, if we can find nontrivial solutions (meaning x ̸= 0)

for the matrix equation Ax = 0, then the columns of A are not
linearly independent.
152 Review and Recap

8.4 Named Algorithms

As we have encountered a few algorithms (by which we mean step-

by-step instructions to achieve an objective), it may be useful now to
compare them in terms of their purpose and end products, which is
what we have in Table 8.1.
When we started solving systems of linear equations, we encoun-
tered the Gaussian-elimination algorithm. Its objective is to get to
the rowechelon
form (REF) of the matrix (typically, the augmented
matrix A | b of the system Ax = b). Once we have the REF, we
can count the number of pivots as rank(A), and fully solve the system
by back substitution.
As a byproduct, Gaussian elimination produces a decomposition as
well, A = P LU . P , being a permutation matrix, has a determinant
of ±1, while L, being the inverse of a bunch of elementary matrices,
has a determinant of 1. Therefore, |A| can be computed from its REF,
which is the upper triangular matrix U as the product of its diagonal
elements, multiplied by |P | = ±1.

Table 8.1 Named algorithms and their uses

Type of
Name Uses Decomposition
Operation
Row Echelon Form (REF).
Elementary Solve equations using back
Gaussian
Row substitution. P LU
Elimnation
Operations Determine rank.
Compute determinants.
Reduced Row Echelon Form
Elementary (RREF).
Gauss-Jordan
Row Solve equations. None
Elimination
Operations Determine rank.
Compute inverses.
Gram- Elementary Orthonormal basis matrix.
Schmidt Column Works only on square, QR
Process Operations invertible A.

Gauss-Jordan elimination is an extension of Gaussian elimination,

where we scale the pivot rows to get the pivots to have a value of one.
We also get rid of all the matrix elements above the pivot in each pivot
column by subtracting appropriate multiples of the pivot row. Our
objective is to get an identity matrix if we were to look only at the
Pivots, Ranks, Inverses and Determinants 153

pivot rows and columns. The resulting form is called the reduced row
echelon form or RREF. Once we have the RREF of the augmented
matrix of a system of linear equations, we can read off the solutions
from the pivot rows. Gauss-Jordan elimination also may produce a
decomposition, but we are not really concerned about it.
The RREF that Gauss-Jordan elimination produces can also tell
us about the rank, which is the number of pivots (also equal to the
sum of pivots because they all have value equal to one). But the real
interesting usage of this algorithm is in inverting a matrix, A. For this
purpose, we augment it with I and run the G-J algorithm on [A | I].
When the A part of the augmented matrix becomes I, we have the
inverse of A where I used to be.
RREF
[A | I] −−−→ [I | A−1 ]
If the algorithm fails to produce I, it means that it could not find
a pivot in at least one row, and the matrix is not invertible; it is
rank-deficient.
The third algorithm, the Gram-Schmidt process, is different from
the first two in the sense that it does elementary column operations
(not row ones as the other two). However, it is not such a big differ-
ence because a column operation is a row operation on the transpose.
The purpose of Gram-Schmidt is to produce a matrix that is orthogo-
nal. In the process, it creates the so-called QR decomposition, where
R, an upper triangular matrix, is the inverse of all the elementary op-
erations performed by the matrix. But the process does not really
have to keep track of the operations and take its inverse because of
the basic property of orthogonal matrices, Q−1 = QT . Therefore,
A = QR =⇒ R = Q−1 A = QT A
Once Q is obtained, R is only a matrix multiplication away.
We have created a neat little Table 8.1 to compare these three
algorithms. It may be useful as a quick reference to their steps,
purposes and usages.

8.5 Pivots, Ranks, Inverses and Determinants

Another topic worthy of a recap is the interconnected story of pivots,

ranks, inverses and determinants of matrices, and how it relates to the
154 Review and Recap

solvability of the underlying system of linear equations. We looked

at each of these topics at various points in our discussions so far.
Because of the intimate interconnections among them, it is best to
summarize it all as we do in Figure 8.1.
Matrices come in different shapes: Square, “tall” and “wide,” as
we described way back in §5.1.2 (page 89). Each shape of matrix
can be either full-rank or rank-deficient, thereby giving us six possi-
ble permutations. When we run Gauss-Jordan elimination on these
matrices, depending on their rank, we will get different canonical
forms, as we discussed in §5.2.2 (page 94). Figure 8.1 is an attempt
to put all these things together, and comment on the solvability of the
underlying system of linear equations.
Corresponding to each permutation1 of the matrix shape and its
rank status, we have a row in Figure 8.1. Referring to these rows, we
can make the following statements.
The only matrix that can be inverted is in the first row. It has a
nonzero determinant, and it leads to a unique solution for the system
of equations that it represents. The matrix in the second row also
has a determinant, but it is zero. The matrix is not invertible. Note
that its RREF has at least one zero row at the bottom. If the entry
on the augmented part of the matrix is also zero, then the equations
are consistent, but we do not have enough of them. Therefore, we
get an infinite number of solutions. We can see it because the RREF
has F in it, which indicates columns with no pivots and therefore
free variables. On the other hand, if the constant in this zero row is
nonzero, we have inconsistency and hence no solution.
We can make a similar statement about the third row also: The
system can be consistent (with 0 = 0 in the bottom rows), in which
case we get a unique solution, or it can be inconsistent, with no
solutions. Note the absence of F in the RREF, indicating that we will
never get an infinity of solutions.
The fourth row has a matrix with RREF with both free variables
and zero rows. So we have the possibilities of no solution or an
infinite number of solutions.

1
To be very pedantic about it, it is not a permutation but an element of the Cartesian product
of the sets of matrix shapes and rank statuses.
Pivots, Ranks, Inverses and Determinants 155

Shape Rank RREF Solvability

Square, Full-rank:
1
> * =&×&
2 3O Unique Solution

Square, Rank-deficient: 3P · 7P× ORP No Solution or

2
> * =&×&
4<2 Infinity of Solutions
8 ORP ×O

Tall, Full-rank: 3O No Solution or

3
> * =(×& , M > O
2 Unique Solution
8 SRO ×O

Tall, Rank-deficient: 3P · 7P× ORP No Solution or

4
> * =(×& , M > O
4<2 Infinity of Solutions
8 SRP ×O

Wide, Full-rank:
5
> * =(×& , M < O
9 3S · 7S× ORS Infinity of Solutions

Wide, Rank-deficient: 3P · 7P× ORP No Solution or

6
> * =(×& , M < O
4<9 Infinity of Solutions
8 SRP ×O

Fig. 8.1 Properties and solvability of the linear equations Ax = b based on the RREF
of the coefficient matrix A,

When it comes to the full-rank “wide” matrix, we have no zero

row, which means we will never have inconsistency. But we never
have enough equations either (as indicated by the free variable part
in the RREF), and we always have an infinity of solutions.
Rank-deficient “wide” matrix leads to situations similar to rows 2
and 4, with no solution or infinite number of solutions. To reiterate,
when a matrix is rank-deficient (as in rows 2, 4 or 6), regardless of
its shape, we either have no solution (inconsistent equations) or an
infinite number of solutions (not enough equations).
Figure 8.1 is color-coded for easy reference: Whenever we see free
variables (corresponding to columns with no pivots) in RREF, shown
in purple, an infinity of solutions is a possibility, obviated only by
the existence of inconsistent
equations
(shown in red) as indicated by
zero rows in the A part of A | b , with the b part nonzero. The only
instances where we get a unique solution is when we can have an In
in the RREF, shown in green. We can also see that a tall matrix can
never have an infinity of solutions and a wide one can never have a
unique solution.
156 Review and Recap

8.6 Two Geometries

When we started visualizing linear equations (Figure 4.1), we were

doing it in the coordinate space, where we have points and lines and
planes. A linear equation in the coordinate space of R2 , for instance,

is a line. As we know, each row of the augmented matrix A | b
represents an equation, which has a shape in the coordinate space.
In other words, when we think of equations, we are thinking rows,
and we can consider the thinking process the row picture of Ax = b.
The dimension of this coordinate space is the same as the number of
unknowns.
We have a wholly separate vector space R2 as well, where we have
all possible vectors with two components and nothing else. When
we think of Ax = b as a system of equations demanding that b be
the right linear combination of the columns of A as specified by x,
we are working with the latter. If we have two rows, our column
vectors have two elements, and we are working with the vector space
R2 , regardless of how many unknowns we may have. If we have
three equations, as in Figure 6.7, we are in the vector space R3 , even
though we have only two unknowns. This mode of thinking can be
considered the column picture of systems of linear equations. This
column picture is the more sophisticated, university-level thinking
we are trying to foster in this course.

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
9
The Four Fundamental
Spaces

Time and space are modes by which we think and not

conditions in which we live.
—Albert Einstein

Our geometric view started with the notion that we can think of
the solution to the system of linear equations Ax = b as a quest for
that special x whose components become the coefficients in taking
the linear combination of the columns of A to give us b. Since this
opening statement was a bit too long and tortured, let’s break it down.
Here is what it means:

• The product of the matrix multiplication Ax is a linear combi-

nation of the columns ai of A.

• The linear combination is taken with coefficients equal to the

components of x = xi . In other words, the product is:
n
X
Ax = xi a i
i=1
158 The Four Fundamental Spaces

• Our system of linear equations Ax = b, therefore becomes the

condition:
X n
xi a i = b
i=1

Solving this system means finding the right xi so that the linear
combination satisfies the condition.
We also learned that the collection of all possible linear combina-
tions of a set of vectors is called the span of those vectors, and it is
a vector subspace, which is a subset of a vector space. For example,
Rm is a space1 and n vectors ai ∈ Rm span a subspace contained
within Rm . Let’s call this subspace C ¦ Rm . If, among the n vectors
that span C, only r f n of them are linearly independent, then those
r vectors form a basis for C and the dimension of this subspace C
is indeed r (which is the cardinality of the basis). In fact, even if
n > m, the n vectors span only a subspace C ¢ Rm if the number of
independent vectors r < m. Note that r can never be greater than m,
the number of components of each of our vectors ai ∈ Rm .

9.1 Column Space

The choice of the symbols in the previous paragraph was not an

accident. We are ready to introduce the concept of the Column Space
of a matrix.

Column Space
Definition: The column space C of a matrix is the span of its columns.
( n
)
def
X
For a matrix A = ai ∈ Rm×n , C(A) = z | z =

xi a i
i=1

where xi ∈ R. Note that C(A) ¦ Rm , where m is the number of

rows of A because each of its column vectors ai ∈ Rm .
Are the columns of A (ai ) a basis for C(A)? Remembering that
the basis of a space or subspace is the minimal set of vectors that
span it, we can see that a decent basis for C(A) would be the linearly

1
Once again, we will be dropping the ubiquitous “vector” prefix from spaces and subspaces.
Null Space 159

independent columns of A. How many of them are there? We saw

earlier that the number of linearly independent columns of A is its
rank, rank(A). To be more precise, we should call it the column-rank
of A, but we also saw that the column and row ranks were the same.
Finally, the dimension of a space or subspace is the number of vectors
in its basis. Therefore, the dimension of the column space is the rank.
Conversely, we can define rank(A) as the dimension of C(A).

9.1.1 Significance of Column Space

As we said in the introduction to this chapter, the solution to Ax = b

is that x which produces the linear combination of the columns of A
equalling b. We then defined the column space C(A) as all possible
linear combinations of the columns of A. In other words, if b ̸∈ C(A),
we will never be able to find a solution vector x.
We can, in fact, turn the last statement about the solvability of
Ax = b and C(A) into an alternative definition of column space:
C(A) is the set of all vectors b if there exists x ̸= 0 such that
Ax = b.
Note that the elementary column operations, which we used in
the Gram-Schmidt process, do not change the column space because
they are all about taking linear combinations of the columns or scaling
them. The span of the columns is unaffected by such linear operations.
G-S
Therefore, in A −−→ Q, both A and its orthogonal cousin Q have
the same column space: C(A) = C(Q).
On the other hand, the elementary row operations in Gaussian
and Gauss-Jordan eliminations do affect the column space. What is
unaffected would be the linear combination of the rows, which we
will call the row space. But before getting to it, let’s look at the null
space.

9.2 Null Space

So far, we have concerned ourselves mostly with the solution of

Ax = b. Let’s now take a look at the so-called homogeneous system
of linear equations Ax = 0. In fact, we did come across this system
when we were talking about linear independence in the last chapter,
rebranding its condition as a matrix equation in Eqn (8.2). If Ax = 0
160 The Four Fundamental Spaces

for some nonzero vector x (in other words, if the homogeneous system
has a nontrivial solution), then the columns of A are not linearly
independent.
The collection of all vectors x that solve Ax = 0 (called the
solution set) is the null space of A, denoted by N (A). Let’s put
down this statement as the definition of the null space.

Null Space
Definition: The null space N of a matrix A is the complete set of all
vectors that form the solutions to the homogeneous system of linear
equations Ax = 0.
def
N (A) = {x | Ax = 0}
Note that if we have two vectors in the null space, then all their
linear combinations are also in the null space. x1 , x2 ∈ N (A) =⇒
x = s1 x1 + s2 x2 ∈ N (A). If Ax1 = 0 and Ax2 = 0, then
A(s1 x1 + s2 x2 ) = 0 because of the basic linearity condition we
learned way back in the first chapter. What all this means is that
the null space is indeed a vector subspace. Furthermore, N (A) is
complete, by definition. In other words, we will not find a vector x
that is not in N (A) such that Ax = 0. In the cold hard language of
mathematics, x ∈ N (A) ⇐⇒ Ax = 0.
For any and all vector x ∈ N (A), in the null space of A, its dot
product with the rows of A are zero. If riT is a row of A, we have:
 . 
..
Ax = 0 =⇒ ri  x = 0 =⇒ riT x = 0 ∀ i
 T
(9.1)
..
.
The dot product being zero means that the vectors in N (A) and the
rows of A are orthogonal. It then follows that any linear combinations
of the rows of A are also orthogonal to the vectors in N (A). And,
the collection of the linear combinations of the rows of A is indeed
their span, which is a subspace we will soon call the row space of A.

9.3 Row Space

The set of all possible linear combinations of the rows of a matrix is

called its row space, with one little complication. Since our vectors
Row Space 161

are all column vectors, and the rows of a matrix are definitely not
columns, we think of the transpose of the rows as the vectors whose
linear combinations make the row space. Equivalently, we may
transpose the matrix first and call the column space of the transpose
the row space of the original. That is to say, the row space of A is
C(AT ), which is the notation we will use for row space. Since the row
rank is the same as column rank, the number of linearly independent
rows in A is its rank, which is the same as the dimension of C(AT ).

9.3.1 Significance of Row Space

If all vectors in a subspace are orthogonal to all vectors in another

subspace, then we call these two subspaces orthogonal. Earlier, we
established that the row and null spaces of a matrix are orthogonal to
each other. In fact, the row and null spaces are orthogonal comple-
ments of each other, which means that all the vectors in the containing
space that are orthogonal to all the vectors in the null space are in
the row space. Again, this complex statement needs some disman-
tling (and convincing): What this statement says, in mathematical
language, is the following:
• We have a matrix A, with a row space C(AT ) and null space
N (A).
• x ̸∈ N (A), y ∈ N (A), and xT y = 0 =⇒ x ∈ C(AT ).
The proof of this assertion is probably beyond the remit of this book
on applied Linear Algebra, and is, therefore, included as an optional
box on “Orthogonal Complementarity.” What it means, however,
is important for us to know. Let’s first illustrate it using an example.
      
1 0 0  1 0 
T
A = 0 1 0 =⇒ Basis for C(A ) =
   0 , 1
 
0 0 0 0 0
 

Clearly, ∈ R3×3 , rank(A) = 2 and therefore C(AT ) ¢ R3 with the

dimension 2, which is why we have two basis vectors, the pivot rows
of A. In this case, A is already in its RREF.
In order to compute N (A), we have to find the solutions to Ax =
0. For our A, we can see that all vectors with the first two components
zero and a nonzero third component will give Ax = 0. We can
162 The Four Fundamental Spaces

Orthogonal Complementarity
Before attempting to prove that C(AT ) and N (A) are orthogonal complements, it is
perhaps best to spell out what it means. It means that all the vectors in C(AT ) are
orthogonal to all the vectors in N (A), which is the orthogonal part. It also means
if there is a vector orthogonal to all vector in C(AT ), it is in N (A), which is the
complement part. Here are the two parts in formal lingo:
1. y ∈ C(AT ) and x ∈ N (A) =⇒ xT y = 0
2. (a) y ̸= 0 ∈ C(AT ) and
(b) xT y = 0 ∀ y ∈ C(AT ) =⇒ x ∈ N (A)
3. Or, conversely,
(a) x ∈ N (A) and
(b) xT y = 0 ∀ x ∈ N (A) =⇒ y ∈ C(AT )
The first part is easy to prove, and we did it in Eqn (9.1), which just says that each
element of Ax has to be zero for Ax = 0 to be true.
To prove the second part, let’s think of the matrix AT as being composed of columns
ci . If y ∈ C(AT ), as in condition 2(a), we can write y as a linear combination of ci :
X
y= si ci = A T s

for some si , which we put together as a column vector s. If xT y = 0 as condition 2(b)

says, then xT AT s = 0 =⇒ (Ax)T s = 0. We know that s ̸= 0 because it is the
linear combination that gives y ̸= 0. The only way (Ax)T s = 0 can be true for all
y ∈ C(AT ) is if Ax = 0 =⇒ x ∈ N (A).
The third part says Ax = b and y § x. Let’s assume that y ̸∈ C(AT ), with the
hope that it leads to a contradiction. y ̸∈ C(AT ) =⇒ Ay = 0, because otherwise y
would be in the column space. But if y ̸∈ C(AT ), y ∈ N (A), which means it is a linear
combination of the vectors in N (A). And it means y cannot be orthogonal to all of
them, as 3(b) requires. Hence our starting assumption has to be wrong, and y ∈ C(AT ).

therefore write:
  
 0 
N (A) ¢ R3 , of one dimension, with the basis 0
1
 

Thinking in terms of the coordinate space, we see that the row space
is actually the xy-plane and the null space is the z-axis. They are
indeed orthogonal complements.
Let’s go back to talking about the general case, A ∈ Rm×n , rank(A) =
r. We know that by definition, all vectors x ∈ N (A) get transformed
to 0 by Ax = 0. Because of the orthogonal complementarity, we
also know that all orthogonal vectors are in C(AT ), the row space
Left Null Space 163

of A, as illustrated above with our toy example. All the vectors in

the row space get transformed to nonzero vectors in Rm . What about
the rest of the vectors? After all, most vectors in Rn are in neither
C(AT ) nor N (A): They are linear combinations of these two sets of
vectors. Let’s take such a vector x = x∥ + x§ where x∥ ∈ C(AT )
and x§ ∈ N (A).

Ax = A(x∥ + x§ ) = Ax∥ + Ax§ = Ax∥ + 0 = Ax∥

What this equation tells is that all vectors not in N (A) also end up in
C(A) by Ax. Moreover, multiple vectors x ̸∈ N (A) get transformed
to the same b ∈ C(A): It is a many-to-one (surjective) mapping.
What is special about the row space is that the mapping from C(AT )
to C(A) is a one-to-one (injective) mapping. Since it is an important
point, let’s state it mathematically:

x ∈ C(AT ) and Ax = b =⇒ for a given x, b is unique.

In order to see it, we have to appreciate that b is a linear combination of

the linearly independent columns of A. For a specific x, it is a linear
combination with the specified coefficients. As we know, probably
from the first chapter, there is only one such linear combination; we
cannot get two different linear combinations with the same set of
linearly independent vectors and coefficients.
A consequence of C(AT ) being N (A)§ is that their dimensions
add up to the dimension of the containing space. Or, n = r +
dim (N (A)) =⇒ dim (N (A)) = n − r. The dimension of the
null space of A is called its nullity, and the statement that the rank
and nullity add up to the number of columns of A is the famous
rank-nullity theorem.

9.4 Left Null Space

To complete the picture and bring out the beautiful symmetry of the
whole system, we will define one more null space, which is N (AT ),
which is also called the left null space. It lives on the same side as
the column space, C(A), and no vector in the input space Rn (other
than the zero vector) can reach this space.
164 The Four Fundamental Spaces

9.5 Computing the Four Fundamental Subspaces

If we want to “find” or “compute” the column, row or null space

of a matrix, how do we proceed? First of all, we need to clarify
what “finding” means, when it comes to a subspace. Here is the
clarification: Find the basis for the subspace, state its dimension, and
the dimension of the space of which it is a subspace. For instance,
we could say, “C(A) ¢ Rm , dim(C(A)) = r, with the basis = {vi }”
specifying the three pieces of information we are after.

9.5.1 Row Space

We start with the computation of the row space because it is the

easiest one. To find the rank and basis of C(AT ), we will fall back on
Gaussian (or Gauss-Jordan) elimination, to get to the REF (or RREF)
of A. Remembering that row operations do not change the row space,
a good basis for the row space would be the pivot rows of RREF. The
dimension of the row space is the number of pivots, AKA rank(A) and
the containing space is Rn , where n is the number of columns of A.
The full specification of the row space of A ∈ Rm×n , rank(A) = r
would therefore be C(AT ) ¦ Rn with dimension = r and basis = the
set of r pivot rows.

9.5.2 Column Space

For a matrix A ∈ Rm×n of rank r, the dimension of C(A) is r, the

dimension of the containing space is m, the number of rows. Or,
more compactly, C(A) ¢ Rm , dim(C(A)) = r.
How do we find the basis, or the linearly independent columns of
A? The pivot columns are the ones that are linearly independent in
the reduced forms. It turns out that the corresponding columns in
the original matrix A are the ones that are linearly independent, and
therefore, they form a good basis for C(A). Let’s see why.
In order to make the discussion easier, we are going to call the
RREF of A by the name R. As we know, the solution set of the
homogeneous
system Ax = 0 can be obtained from its augmented
matrix A | 0 , by finding its RREF, which is R | 0 . Note that the
constants part of the augmented matrices are still 0, which is different
RREF
from what happens if we do A | b −−−→ R | b′ where b ̸= b′ in

Summary of the Four Spaces 165

general. But when b = 0, the constants part does not get affected by
row operations. Since the solution of a system of linear equations is
not affected by row operations, the solution sets for A and R are the
same.
Next, we can see that in R, the pivot columns are linearly indepen-
dent of each other by design. They are the only ones with a nonzero
element (actually 1) in the pivot positions, and there is no way we
can create a 1 by taking finite combinations of 0. Therefore, a linear
combination of the pivot columns in R will be 0 if and only if the
coefficients multiplying them are all zero. Since the solution sets
for A and R are the same, the same statement applies to A as well.
Therefore, the columns in A that correspond to the pivot columns in
R are linearly independent. And indeed, they form a basis for the
column space of A.

9.5.3 Null Spaces

The null space of a matrix would be the complete solution of the

homogeneous system of equations Ax = 0. When we write down the
augmented matrix, it becomes A | 0 , and whatever row operations
we perform on it will not change the constants or RHS part because
it is 0. We can therefore solve Ax = 0 by performing Gauss-Jordan
elimination on the matrix A itself. If we find columns with no pivots,
we will have free variables and null space.
To compute the left null space, we will follow the same procedure
to find the complete solution set of the homogeneous system of linear
equations AT x = 0. After all, the left null space is N (AT ), which
is the null space of AT .

9.6 Summary of the Four Spaces

We are now ready to summarize the four fundamental subspaces

defined by a matrix. We will do it as a nice picture first, discuss
it in the text here and then present it again in two tables. It is that
important.
166 The Four Fundamental Spaces

I * =Y×U rank I = C
! * =( = all possible ! I7 = K c * =* = all possible & impossible c

I 6 =U § =Y

e 8 : dim = f
e 8- : dim = f Column Space
Row Space 8' § 9 Image
Coimage Range
§9
8 '' + !,

8h § h

3 4- :
h
3 4 : 8' § dim = 9 2 :
dim = ; 2 : Left Null Space
Null Space Left or Cokernel
Kernel

=( = Domain = Linear Combinations of =* = Codomain =Linear Combinations of

vectors in i L/ and j L vectors in i L and j L0

Fig. 9.1 A pictorial representation of the four fundamental subspaces defined by a matrix.

9.6.1 General Properties

In Figure 9.1, we have a pictorial representation of what the four

fundamental subspaces mean in terms of the mapping A : Rn 7→ Rm
in the system Ax = b. They basically carve out different segments
of the two vector spaces, Rn and Rm , with specific properties.
The row space and null space are orthogonal complements. So are
the column space and the left null space. All the vectors in the mull
space (N (A)) map to the zero vector 0. All other vectors in the row
space get mapped to the column space C(A). All other vectors are
linear combinations of the vectors in N (A) and C(AT ). They also go
to C(A), was we saw earlier. Since C(AT ) and N (A) are orthogonal
complements, we have nothing left in Rn to go to the left null space
N (AT ), other that the zero vector 0 ∈ Rn .

9.6.2 Matrix Shapes and the Spaces

As we saw, the geometric view of the fundamental spaces defined by

a matrix are closely connected to its rank and the solvability of the
system of linear equations it represents. It may be worth our time to
look at the shapes of the matrix to see what its fundamental spaces
look like. The four subspaces are closely related to the RREF of the
matrix, and so are the solvability of the underlying system of linear
Summary of the Four Spaces 167

Table 9.1 Summary of the four fundamental spaces

Name, Space, Dim. Basis Description
Column Space Range (Image) of
C(A) ¦ Rm Pivot columns of A A : Rn 7→ Rm
dim(C(A)) = r b ̸∈ C(A) =⇒ no solution
Row Space
C(AT ) ¦ Rn Coimage of A : Rn 7→ Rm
Pivot rows of R or A
x ∈ C(AT ) =⇒ x 7→ b ̸= 0
dim(C(AT )) = r
Null Space (AKA
x ∈ N (A) =⇒ x 7→ 0
Kernel) Solution set of
If Ax = b and x§ ∈ N (A),
N (A) ¦ Rn Ax = 0 or Rx = 0
A(x + x§ ) = b
dim(N (A)) = n − r
Left Null Space Solution set of
N (AT ) ¦ Rm Unreachable, AKA Cokernel
AT x = 0 or
b ∈ N (AT ) =⇒ Ax ̸= b
dim(N (AT )) = m−r R′ x = 0
The table shows how the four fundamental subspaces of a matrix are related to the linear
RREF
equations it represents. A ∈ Rm×n , rank(A) = r in Ax = b with A −−−→ R and
RREF
AT −−−→ R′

equations. Keep in mind that while the REF (which is the result of
Gaussian elimination) can have different shapes depending on the
order in which the row operations are performed, RREF (the result
of Gauss-Jordan) is immutable: A matrix has a unique RREF.

Table 9.2 The fundamental spaces of matrices of different shapes

Matrix Shape RREF Observations on the Four Spaces
Square, full-rank
C(A), C(AT ) = Rn
A ∈ Rn×n In
rank(A) = n N (A), N (AT ) = {0}

“Tall,” full-rank C(A) ¢ Rm , dim (C(A)) = n

A ∈ Rm×n , m > C(AT ) = Rn
" #
In
n N (A) = {0}
rank(A) = n 0(m−n)×n N (AT ) ¢ Rm , dim (N (AT )) = m − n
“Wide,” full-rank C(A) = Rm
A ∈ Rm×n , m < C(A T
) ¢ Rn , dim (C(AT )) = m
Fm×(n−m)N (A) ¢ Rn , dim (N (A)) = n − m

n Im ·
rank(A) = m N (AT ) = {0}

Rank-deficient C(A) ¢ Rm , dim (C(A)) = r

A ∈ Rm×n C(AT ) ¢ Rn , dim (C(AT )) = r
" #
Ir · Fr×(n−r)
rank(A) = r < rm N (A) ¢ Rn , dim (N (AT )) = n − r
rm = min(m, n) 0(m−r)×n
N (AT ) ¢ Rm , dim (N (A)) = m − r
168 The Four Fundamental Spaces

We have presented, in Table 9.2, a comparison of the various

possible shapes and ranks of the matrices and the corresponding
variations in their fundamental spaces. We should note the following
points:
1. The column and row spaces always have the same dimension,
and it is equal to the rank of the matrix. In fact, it is another
definition for the rank.
dim (C(A)) = dim (C(AT )) = rank(A)

2. The dimension of the column space (which is the same as the

rank) and the dimension of the null space (which is also called
nullity) add up to the dimension of the domain. This is the
famous rank-nullity theorem2 .

For A ∈ Rm×n : Rn 7→ Rm , dim (C(A)) + dim (N (A)) = n

9.7 Computing the Spaces

It is probably best to illustrate the computation of the four funda-

mental spaces of a matrix using an example. Let’s compute all four
fundamental subspaces of the following matrix:
 
1 1 1 6
A =  2 2 1 9
1 1 0 3
In fact, this matrix and an associated system Ax = b are very similar
to something we solved way back in Chapter 4, in Eqn (4.4), when
we were still in Linear Algebra primary school. The difference now
is that we have added one more row, and removed the RHS of the
equations in A | b ; we are working with just A now.
     
1 1 1 6 1 1 1 6 1 1 0 3
REF RREF
A = 2 2 1 9 −−→ 0 0 −1 −3 −−−→ 0 0 1 3 = R
1 1 0 3 0 0 0 0 0 0 0 0

2
Wikipedia describes it as: “The rank-nullity theorem is a theorem in linear algebra, which
asserts that the dimension of the domain of a linear map is the sum of its rank (the dimension
of its image) and its nullity (the dimension of its kernel).”
Computing the Spaces 169

Looking at R, we can see that we have pivots in columns 1 and 3.

The rank of the matrix is 2. Let’s write down all the subspaces.

9.7.1 Column Space

The column space lives in R3 because the columns have three compo-
nents. The dimension of the column space = the rank = the number of
pivots = 2. The basis (which is a set) consists of the column vectors
in A (not in R) corresponding to the pivot columns, namely 1 and 3.
So here is our answer:

    
 1 1 
C(A) ¢ R3 ; dim (C(A)) = 2; Basis = 2 , 1
1 0
 

9.7.2 Row Space

For the row space, we can take the pivot rows of R as our basis. Note
that the row space lives in R4 because the number of columns of A
is 4. It also has a dimension of 2, same as the rank.

    

 1 0 
    
1 0

T 4 T

C(A ) ¢ R ; dim C(A ) = 2; Basis =   ,  


 0 1 
3 3
 

Notice how we have been careful to write the basis as a set of column
vectors, although we are talking about the row space. Our vectors are
always columns.

9.7.3 Null Space

For the null space, we will first solve the underlying Ax = 0 equa-
tions completely, highlight a pattern, and present it as a possible
shortcut, to be used with care.
We saw earlier that Ax = 0 and Rx = 0 have the same solution
set, which is the null space. Writing down the equations explicitly,
170 The Four Fundamental Spaces

we have
 
  x1
1 1 0 3  
x2 
Rx = 0 0 1 3 

 x3  = 0
0 0 0 0 (9.2)
x4
=⇒ x1 + x2 + 3x4 = 0 and x3 + 3x4 = 0
Noting that x2 and x4 are free variables (because the corresponding
columns have no pivots), we solve for x1 and x3 as x3 = −3x4 and
x1 = −x2 − 3x4 . Therefore the complete solution becomes:
       
x1 −x2 − 3x4 −1 −3
 x2   x 2 1
 = x2   + x4  0 
    
 =
x3   −3x4  0 −3
x4 x4 0 1
The complete solution is a linear combination of the two vectors
above because x2 and x4 , being free variables, they can take any
value in R. In other words, the complete solution is the span of these
two vectors, which is what the null space is. We give our computation
of the null space as follows:
    

 −1 −3 
    
1 0

4
N (A) ¢ R ; dim (N (A)) = 2; Basis =   ,  

 0 −3 
0 1
 

9.7.4 Left Null Space

The steps to compute N (AT ) are identical to the ones for N (A),
except that we start with AT instead of A, naturally. We will not go
through them here, but, as promised earlier, we will share a shortcut
for computing null spaces in the box on “Null Spaces: A Shortcut,”
and use the left-null-space computation of this A as an example.
From the examples worked out, we can see that the null-space
computations and the complete solutions of the underlying system of
linear equations have a lot in common. It is now time to put these
two topics on an equal footing, which also gives us an opportunity to
review the process of completely solving a system of linear equations
and present it as a step-by-step algorithm.
Review: Complete Solution 171

Null Spaces: A Shortcut

Once we have the RREF of a matrix, we can write down the null space by looking at it.
Here are the steps:
1. Identify the free-variable columns (which are the ones without pivots).
2. Copy these columns with a negative sign for the entries, while leaving a blank
space for the free variables. Ignore zero rows at the bottom of the RREF.
3. Type in 1 for the free variable and zero for all others in the vectors created in
(2). This is our basis.
4. The dimension of the containing space is the number of variables. The dimension
of the null space is the number of basis vectors in (3)
Illustrating it with the left-null-space computation of the example in the text:
1 2 1 1 0 −1
   
1 2 1  0 1 1 
AT =  ; RREF(AT ) = 
1 1 0 0 0 0 
6 9 3 0 0 0
Using our shortcut,
1. The only free variable is the third one (call it x3 ).
2. Copying it, with the negative sign and a blank ⃝ in the third position (because
our free variable is x3 ), we get:
 
1
−1
⃝

3. In the blank position, indicated by ⃝, of the first basis vector created in (2) we
type in 1.
4. The final answer is:
  
 1 
N (AT ) ¢ R3 ; dim N (AT ) = 1; Basis = −1
 1 

A bit of thinking should convince us that this shortcut is, in fact, the same as what we
did in computing the null space in the text, by writing down the complete solution. We
can easily verify that we get the same answer by applying this shortcut on the matrix R
in Eqn (9.2).

9.8 Review: Complete Solution

The computation of the null space of a matrix is, in fact, the same
as finding the special solutions of the underlying system of linear
equations. When we found the complete solution earlier in §4.3.2
(page 76) and when we computed the null space above, the procedures
172 The Four Fundamental Spaces

may have looked ad-hoc. Now that we know all there is to know about
the fundamental spaces of a matrix, it is time to finally put to rest the
tentativeness in the solving procedure and present it like an algorithm
with unambiguous steps. Let’s start by defining the terms.
In Ax = b, the complete solution is the sum of the particular so-
lution and the special solutions. We can see an example in Eqn (4.4),
where the first vector is the particular solution and the second and
third terms making a linear combination of two vectors is the special
solution. The linear combination is, in fact, the null space of A. As
we saw earlier, if x∥ is a solution to Ax = b, then x∥ + x§ also is a
solution for any x§ ∈ N (A) because, as we saw earlier,

Ax = A(x∥ + x§ ) = Ax∥ + Ax§ = Ax∥ + 0 = Ax∥

x∥ is a particular solution (which may be called xp ) and the set of x§

(AKA xs ) is the special solution.
Since the special solution, the set of x§ , is a subspace, we can select
any full set of linearly independent vectors as the basis to specify it.
Let’s work through an example to illustrate whatever we stated so far,
before listing the algorithm for completely solving a system of linear
equations.

9.8.1 An Example

We will reuse the example in Eqn (4.4) (on page 77), where we started
with these equations:

x1 + x2 + x3 + 2x4 = 6
2x1 + 2x2 + x3 + 7x4 = 9
from which we got the augmented matrix:

1 1 1 2 6 REF 1 1 1 2 6
A|b = −−→
2 2 1 7 9 0 0 −1 3 −3
and ended up with the complete solution:
     
3 −1 −5
0 1 0
x= 3 + t1  0  + t2  3 
     (9.3)
0 0 1
Review: Complete Solution 173

The null space of the coefficient matrix has the basis vectors appearing
as the linear combination in the last two terms above. N (A) is a plane
in R4 , and we can use any two linearly independent vectors on it as
its basis. The exact basis vectors we wind up with depend on the
actual elimination steps we use, but they all specify the same plane
of vectors, and indeed the same subspace. As we can see, in solving
the system of equations above,
we
started with the row-echelon form
of the augmented matrix A | b . The REF (the output of Gaussian
elimniation) of a matrix is not unique; it is the RREF (from Gauss-
Jordan elimination) that is unique.
Let’s solve the system again. This time, we will start by finding
the RREF (the output of Gauss-Jordan) because it is unique for any
given matrix.

1 1 1 2 6 RREF 1 1 0 5 3
A|b = −−−→
2 2 1 7 9 0 0 1 −3 3
In order to find a particular solution, we first set values of the
free variables to zero, knowing that they are free to take any values.
This step, in effect, ignores the free variables for the moment. In
the example above, the free variables are x2 and x4 , corresponding
to the columns with no pivots. Ignoring these pivot-less columns,
what we see is an augmented matrix A | b = I | b′ , giving us the

particular solution3 as below:

   
x1 3
1 0 3 x1 3  x2   0 
=⇒ = =⇒ xp =   x3  =  3 
  
0 1 3 x3 3
x4 0
The special solutions are from the null space, and we have to
solve the homogeneous of equations Ax = 0, for which the
system
augmented matrix is A | 0 . What we do, in practice, is to ignore
the b part of Ax = b, or set it to zero, and work with the coefficient
part. We then cyclically set one free variable
to one, and the rest to
zero in the equations specified by RREF( A | 0 ):

1 1 0 5 0 x1 + x2 + 5x4 = 0
=⇒
0 0 1 −3 0 x3 − 3x4 = 0

3
Note that the particular solution obtained using this prescription does not have to be in the
row space of the coefficient matrix because we are taking zero values for the free variables.
174 The Four Fundamental Spaces

Setting x2 = 1 and x4 = 0:
   
x1 −1
x1 + x2 + 5x4 =0 x1 + 1 = 0 x2   1 
=⇒ =⇒ x⊥1 x3  =  0 
=   
x3 − 3x4 =0 x3 = 0
x4 0
Now, setting x2 = 0 and x4 = 1:
   
x1 −5
x1 + x2 + 5x4 =0 x1 + 5 = 0 x2   0 
=⇒ =⇒ x⊥2 =
x3  =  3 
  
x3 − 3x4 =0 x3 − 3 = 0
x4 1
Putting it all together, we can write down the complete solution as:
     
3 −1 −5
0 1 0
x= 3 + t1  0  + t2  3 
     (9.4)
0 0 1

Comparing Eqn (9.4) above to Eqn (9.3), we can see that we

got identical solutions in our example. However, in principle, the
vectors in the linear combination in the last two terms may differ.
Our assertion is that what we may have in Eqn (9.4), if they were
different from the ones in Eqn (9.3), would merely be another pair
of linearly independent vectors in the same planar subspace specified
by the latter.
How do we make sure though? We know enough Linear Algebra
to answer this question: If we were to place these four vectors in a
matrix, we would have only two linearly independent columns, and
the rank of the matrix would be two. How do we know the rank? We
perform row-reduction on it and count the number of pivots.
Squeezing out the last teachable moment from this review of com-
pletely solving linear systems, the rank is also the trace of the RREF
of the coefficient matrix because the pivots are all normalized to one.

9.8.2 The Algorithm

Here is a description of what we did in the example above, but using

the most general case. We start with Ax = b, where A ∈ Rm×n
and rank(A) = r f min(m, n). We know the shape of the canonical
Other Names 175

form, RREF:
" #
RREF Ir · Fr×(n−r)
A −−−→ R = b′
0(m−r)×n
Note that we use the symbol · to indicate that the columns of Ir
and Fr×(n−r) may be shuffled in; we may not have the columns of F
neatly to the right of I. With this picture in mind, let’s describe the
algorithm for completely solving the system of linear equations:
1. Find RREF
through Gauss-Jordan on the augmented matrix
A|b →R
2. Ignore the free variables by setting them to zero, which is the
same as deleting the pivot-less columns in R and zero rows,
giving us Ir
3. Get the particular solution, x∥ with the r values in b′ , and zeros
for the n − r free variables
4. For each free variable,
with
the RREF of the homogeneous
augmented matrix A | 0 (which is the same R as above, but
with 0 instead of b′ in the augmenting column):
• Set its value to one, and the values of all others to zero
• Solve the resulting equations to get one special solution,
x§i
• Iterate over all free variables
5. Write down the complete solution:
n−r
X
x = x∥ + ti x§i
i=1

6. Know that we have computed the null space as well:

N (A) ¢ Rn , dim (N (A)) = n − r, Basis = {x§i }

9.9 Other Names

In order to see why these fundamental subspaces go by their aliases,

we have to look at functions again. In normal algebra, when we
176 The Four Fundamental Spaces

have a function f (x) : R 7→ R, the totality of the possible values of x

(which would be all of R since x is the so-called independent variable)
is called the domain and that of y = f (x) is called the codomain.
Not all of the codomain may be reached though. For instance, if
f (x) = x2 , the codomain is R, but the part of the codomain that is
reachable is y > 0. This part is called the range. Some people call
the range the image (because it is what x is reflected into, if we were
to guess). Not to be partial to y, they also call the part of the domain
that is reflected the coimage.
Thinking of A as a mapping or function A : Rn 7→ Rm , the
domain is Rn and the codomain is Rm . The column space may be
called the range or image, and the row space is the coimage, although
we will stay away from such abominations. Not to be outdone, the
null spaces also go by names such as kernel and right kernel. This
naming is a bit more mysterious, but everything in the kernel gets
mapped to 0, and nothing in the domain Rn ever gets mapped to the
left kernel.
Regardless of what names they go by, the four fundamental sub-
spaces bring together almost everything we learned so far. They form
the basis of the advanced topics in Linear Algebra and are therefore
critical for furthering our understanding.

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
10
Projection, Least Squares
and Linear Regression

It was one thing to use computers as a tool, quite another

to let them do your thinking for you.
—Tom Clancy

Now that we learned the four fundamental subspaces defined by

a matrix, we can look at one of its most widely used applications
in machine learning, namely linear regression. For this, we will be
expanding on our notion of projection of one vector onto others. We
will be projecting to subspaces instead.

10.1 Projection Revisited

As a preparation for orthonormalization and the Gram-Schmidt pro-

cess, we looked at projection earlier in §7.4.3 (page 134). Our ap-
proach was to fall back on the trigonometric definition of the dot
product using the cosine of the angle between the vectors. This ap-
proach was perhaps ill-advised for we are not learning trigonometry
178 Projection, Least Squares and Linear Regression

here, but the much superior Linear Algebra. The right approach to
be taken, as we shall see here, is much more elegant.
6 "

5 6 "
) Several vectors
4 5 give same
)

'
c
projection 3.
3 4 Projection is not
invertible.
2 3 c = d3, d is a
In 3
&

3
(' Singular matrix
1 2

! 5
24 23 22 21 0 1 2 3 4 5 1 c3
ce &
21
spa (+) 24 23 22 21 0 1 2 3 4 5
Sub 22
'='
21
( ='2)
'
23 = +& = &+ 22

24 23

25 24
Fig. 10.1 Projection of one vector (b in blue) onto another (a in red). The projection
b̂ is shown in light blue, and the error vector in green. The right panel shows that many
different vectors (blue ones) can all have the same projection. The projection operation is
many-to-one, and cannot be inverted.

In Figure 10.1, in the left panel, we have a blue vector b being

projected onto the red a. The projection is another vector, b̂ in a
brighter shade of blue. Of all the vectors along a, why is this the
right vector to call the projection? One way to look it is to imagine
that we are shining some light from the top, perpendicular to a, in
the plane containing a and b. The projection then is the shadow
cast by the blue vector on the red one. Another way is to think of
the difference vector, shown in green as e, to be a minimum. The
projection is that vector along the direction a which is closest to b,
where closeness is defined as the distance between the tips of b and
its projection b̂, which is the same as the norm ∥e∥. We can call
this green vector the error (hence the name e), and we are trying to
minimize this error.
And when is e the smallest? It is when a and e are orthogonal to
each other, or when a § e. We know the condition for orthogonality:
The dot product aT e = 0. Since b̂ lies along the direction of a, we
know that it has to be a scalar multiple: b̂ = ax, where we called
the scalar x (and put it after a for our own dark purposes). For
each value of x, we will get a different candidate for the projection.
The right candidate, the one that minimizes the error, is b̂, for which
x = x̂. Now the problem of computing the projection boils down to
Projection Revisited 179

computing x̂.

aT e = aT (b − b̂) = aT (b − ax̂) = 0
aT b
aT ax̂ = aT b =⇒ x̂ =
aT a (10.1)
aT b −1
b̂ = ax̂ = a T = a(aT a) aT b
a a
−1
P = a(aT a) aT
Here, in addition to computing the projection value x̂, the projection
vector b̂, we have also defined a projection matrix P which we can
multiply with any vector and gets its projection onto a. It is indeed
a matrix because aaT is of a column matrix (n × 1) multiplying a
−1
row matrix (1 × n), giving as an n × n matrix. The factor (aT a)
in between is just a scalar, which does not change the shape.
Comparing the derivation of the projection matrix above in Eqn (10.1)
to the one we did earlier in Eqn (7.9), we can appreciate that they
are identical. It is just that the Linear Algebra way is much more
elegant. In this derivation, we took certain facts to be self-evident,
such as when two vectors are orthogonal, their dot product is zero.
We can indeed prove it, as shown in the box on “Orthogonality and
Dot Product.”

10.1.1 Why Project?

What we did was to project a vector b onto another one a, which is

to say we found the vector closest to b in the subspace spanned by a.
We now want to expand on this notion on projecting to a subspace,
but before that, we may want to ask ourselves why we want to project
to subspaces.
The motivation, as with most things in Linear Algebra, is connected
to the solvability of Ax = b. Let’s take the situation where we have
too many equations, which means our coefficient matrix, A is “tall.”
We are particularly interested in such matrices because our datasets
in computer science tends to be of that shape.
In general, “tall” data matrices with real data tend to be full rank
because data points, which make up the rows of such matrices, tend
to be linearly independent due to measurement errors and statistical
fluctuations. The RREF of a “tall” full-rank matrix is I near the top,
180 Projection, Least Squares and Linear Regression

and zero rows below it. Such matrices, as coefficient matrices in

Ax = b, have solutions only if the RHS, b is in the column space of
A, which in general, it will not be.
What do we do in such a situation? We still want to solve the
system even when b ̸∈ C(A). In this case, the best we can do is to
find the vector b̂ in the column space that is closest to b, which will
be its projection onto C(A).

10.2 Projection to Subspace

4
3 3 5)

f 2
c)
3
e 1
c
3 !
28 27 26 25 24 23 22 21 0 1 2 3 4 5 6 7 8
5! c!
3 21

23
-
24

Fig. 10.2 Projecting the blue b onto the subspace that is the span of a1 and a2 , the two
red vectors. The subspace is a plane, shown in light red. The projection, b̂ (in bright blue)
of b is in the subspace, and is therefore a linear combination of its basis vectors, a1 and a2 ,
shown in red.

In order to project to a subspace, we need to specify what the

subspace is, which we do in Linear Algebra by specifying the basis.
If our basis vectors are a1 and a2 , the subspace onto which we are
projecting is a plane, as shown in Figure 10.2. Exactly as we did in
Eqn (10.1), we want the formulas for x̂, the length of the project, or
b̂, the projected vector and P , the projection matrix. In the case of a
plane with two basis vectors, we will have two lengths x̂1 and x̂2 .
Since our projected vector b̂ is on the plane spanned by the basis
vectors, it is a linear combinations of them, and the coefficients of
the linear combination are what we call x̂1 and x̂2 . Referring to
Projection to Subspace 181

Figure 10.2, where we are projecting the blue b onto the red plane
which is the span of a1 and a2 , we have the projections along each
basis vector, b̂i = ai x̂i . Arranging the basis vectors as the columns
of a matrix, we can write our linear combination for b̂ as:
 
| |
x̂
A = a1 a2 , b̂ = A 1 = Ax̂ = b̂
 
x̂2
| |

which recovers the form of our favorite equation Ax = b. All we

need to do is to find x̂ for which we have the fact that the green
error vector e = b − b̂ is orthogonal to the red plane in Figure 10.2.
When a vector is perpendicular to a subspace, it is perpendicular to
every vector in it. In particular, e is orthogonal to every basis vector
ai , which means we have aTi e = 0. We can again write these two
equations (for i = 1 and 2) as a matrix equation: AT e = 0.

AT e = 0 =⇒ AT (b − b̂) = 0 =⇒ AT b = AT b̂ = AT Ax̂

Recasting the last part of the previous equation in the form Ax = b,

we write our equation which will solve all our problems from now
on:
−1 T
AT Ax̂ = AT b =⇒ x̂ = AT A A b (10.2)
What this equation tells us is that even if Ax = b does not have
a solution (because A is full-column-rank with inconsistent RHS),
multiplying both sides of it with AT magically makes it solvable, at
least in an approximate sense.
This approximation that gives us the best possible solution x̂ is, in
fact, identical to the least square minimization of the sum of squared
errors, if we were to do it in the old, calculus way. A numerical, itera-
tive minimization of the calculus kind may or may not converge, may
not converge to the global minimum, and may have computational
complexity issues. In the elegant world of Linear Algebra, it is only
a couple of matrix multiplications and an inversion.
Using our recently acquired knowledge of the four fundamental
space of A, we can immediately see that e ∈ N (AT ). And, b̂ ∈
C(A) because it is a linear combination of the columns of A. A is
guaranteed to be full-rank because its columns span a subspace, and
Ax̂ = b̂ is guaranteed to have a unique solution.
182 Projection, Least Squares and Linear Regression

The Hat Matrix

As we saw, the projection matrix P takes b to its estimated, “hatted” counterpart, b̂. For
this reason, some textbooks and articles, especially the ones on statistics, when dealing
with linear regression, call the projection matrix P the hat matrix H.
While deriving the projection matrix, we also looked the error vector e = b − b̂ when
projecting b onto the column space, C(A) where b̂ = P b was the projection. We can
then write
e = b − b̂ = b − P b = (I − P )b
Some statisticians consider this a second projection matrix, P2 = I − P . Remem-
bering that e ∈ N (AT ), P2 is the operator that will project a vector to the left null space
of A.

The last thing we need to do in this section is to write down the

projection matrix P such that P b = b̂. Knowing that Ax̂ = b̂ and
using the formula for x̂ from Eqn (10.2), the projection matrix is:
−1
P = A AT A AT (10.3)

10.2.1 Properties of Projection Matrix

The first property of the projection matrix is that applying it multiple

times on a vector is the same as applying it once. It makes sense from
the meaning of projection—once we project a vector onto another, the
projection is already collinear with it. Projecting a vector collinear
with another one onto the latter does not do anything. This property
is called idempotence and the projection matrix is idempotent, which
we can easily verify:

P = A AT A AT =⇒ P 2 = A AT A
−1 T
A A AT A
−1 −1 T
A

= A AT A AT A AT A AT
−1 −1

= AI AT A
−1 T
A = A AT A
−1 T
A
=P

Another property is that P has to be symmetric. This property

comes from the fact that dot product of two vectors is the same as the
dot product of one with the projection of the other onto it. Referring
Meaning of Projection 183

to Figure 10.1, it means:

aT b = aT b̂ = aT P b

On the other hand, a projected onto itself is, of course, a. Therefore,

aT b = (P a)T b̂ = aT P T b

With the two expressions for aT b, we can equate them and write,

aT P b = aT P T b =⇒ P = P T

Let’s verify if our P in Eqn (10.3) is indeed symmetric as it should

be.
−1 T T
P T = A AT A A
T T −1 T T
(1) = AT A A A
T −1 T
(10.4)
(2) = A AT A A
−1 T
(3) = A AT A A
=P

where we used (1) the product rule of transposes, (2) the fact that
the inverse of a transpose is the transpose of the inverse and (3) the
symmetry of AT A.

10.3 Meaning of Projection

The projection operation is a many-to-one mapping, as shown in

Figure 10.1, on its right panel. Since many vectors like b can be
projected onto a to have the same b̂ as shown, given b̂, we just cannot
figure out what b it came from. Such mappings are called injective
and they cannot be inverted. In other words, the projection operation
throws away some information.
In the world of Linear Algebra, the matrix corresponding to such
operations would be a rank-deficient one. When projecting to a
subspace defined by one vector, P is a rank-one matrix. For a
subspace of dimension two in Rn , it is a rank-two matrix. In general,
184 Projection, Least Squares and Linear Regression

for any subspace, P is singular, which is the reason why we cannot

apply the product rule for inverses to (AT A)−1 .
−1
P = A(AT A)−1 AT ̸= AA−1 AT AT = II = I

We should not do this expansion and the product rule does not apply
here because A−1 is not defined, which was the whole point of
embarking on this projection trip to begin with.
What happens if we take the full space and try to project onto it? In
other words, we try to project a vector onto a “subspace” of dimension
n in Rn . In this case, we get a full-rank projection matrix, and the
expansion of the inverse above is indeed valid, and the projection
matrix really is I, which is an invertible matrix because every vector
gets “projected” onto itself. Note that the two properties we were
looking for in P are satisfied by I: It is idempotent because I 2 = I
and of course I T = I.
For rank-deficient matrices, P is AA−1 Left , the left inverse multi-
plying on the right, almost like an attempt to get as close to I as
possible.

10.4 Linear Regression

One of the interesting applications of the idea of projection is linear

regression, where we try to model our data using linear functions.
Before going to real data with multiple variables and several obser-
vations, let’s start with a toy example.

10.4.1 Simple Linear Regression

When we have only one variable y depending linearly on another x,

we get the normal line fitting, which is called simple linear regres-
sion, as opposed to multiple linear regression when we have multiple
independent variables xi
Figure 10.3 shows a data table with just two variables and five
data points. We can think of y as a dependent variable, and x the
independent one on which y depends. The dependence is modeled as
a line, y = mx + c, which is the simple linear regression model. With
the data shown in Figure 10.3, we can easily type in the numbers and
get a “trendline” from our popular spreadsheet applications, which
Linear Regression 185

x y
1 1
2 2
3 2
4 5

' = 1.2* 2 0.5

+! = 0.8

Fig. 10.3 An example of simple linear regression. The data points in the table on the left
are plotted in the chart on the right, and a “trendline” is estimated and drawn.

gives us the values of m = 1.2 and c = −0.5 as shown. Let’s look at

how these data points become a system of linear equations, and how
their solution then become a projection problem.

y = mx + c    
1=m+c 1 1 1
2 1 m 2
2 = 2m + c Ax = b A=
3
 x= b= 
1 c 2
2 = 3m + c 4 1 5
5 = 4m + c
Since we are modeling our data as y = mx + c and we have five
(x, y) pairs, we get five equations as shown above, which we massage
into the Ax = b form. Notice how A has two columns, the first for
m and another one for c full of ones. If we had written our model as
y = c + mx, the column for the intercept c would have been the first
one.
As we can see, we have five equations,
and if we were to do
Gauss-Jordan elimination on A | b , the third row would indicate
an inconsistent equation 0 = 1, and the system is not solvable, which
is fine by us at this point in our Linear Algebra journey. We will get
the best possible solution x̂, which will give us our best estimates
for the slope and the intercept as m̂ and ĉ. The steps are shown below.
186 Projection, Least Squares and Linear Regression

1 2 3 4
T
A =
1 1 1 1

T 30 10
A A= AT A = 20
10 4
" 1 1
#
−1 −
AT A = 51 32

−
" 23 2 1 1 3
#
−1 T − 10
− 10 10 10
AT A

A = 1
1 0 − 21
" 6 # 2
−1 T 6 1
x̂ = AT A A b = 5 1 =⇒ m̂ = and ĉ = −
−2 5 2
As we can see, our linear regression model becomes y = m̂x + ĉ =
1.2x − 0.5, same as the trendline that the spreadsheet application
computed in Figure 10.3.
We also have the error vector e = b − b̂. b is what we project
onto the column space, C(A) and b̂ = P b is the projection. Let’s
go ahead and compute P and b̂ as well. Using the formula for the
projection matrix from Eqn (10.3), we get:
 7 2 1
− 51

10 5 10
−1 T  25 10 3 1 1 

5 10
T

P =A A A A =  1 1 3 2 

 10 5 10 5 
− 51 10
1 2
5
7
10
7  3 
10 10
 19   1 
 10   10  9
b̂ = P b =  
 31  ê = b − b̂ =  
− 11  ∥e∥ =
 10   10  5
43 7
10 10
The norm of the error vector is the variance (∥e∥) in the data that is ex-
plained by the model. The total variance is computed independently
(using its formula from statistics) as σ 2 = 2.25. The coefficient of
determination R2 is the fraction of the variance in the data that is
explained by our model, and it is 0.8, just as the trendline from the
spreadsheet application reports it in Figure 10.3.
Linear Regression 187

10.4.2 Multiple Linear Regression

When we have one dependent variable and one independent variable,

it is a simple linear regression (SLR). The statistical model in SLR is
a line, which means we are saying that the data points all should have
been on a line, but for some reason, they may wander and fluctuate;
therefore let’s look for the best fitting line.
When we have multiple independent variables, we have multiple
linear regression (MLR). For two independent variables, the model is
a plane, and we are trying to find the best plane that fits the data. For
n independent variables, the model is an n-dimensional subspace in
the coordinate space (not a vector subspace) that best describes the
data.

Height Weight Hair Len. Age Sex

(cm) (kg) (cm) (years) (M/F)
169.0 60.0 10.0 21.2 M 80
174.0 73.0 12.0 26.8 M
163.0 58.0 22.0 21.0 F
163.0 55.0 18.0 20.0 F
174.0 67.0 15.0 22.3 M
171.0 69.0 3.0 23.9 M
162.5 50.0 40.0 22.5 F 70
179.5 74.0 7.0 25.7 M
180.0 76.0 5.0 24.2 M
170.0 65.0 8.0 24.6 M
Weight (kg)

176.0 74.0 7.0 23.1 M

165.0 46.0 20.0 22.6 F
60
161.0 50.0 23.0 22.2 F
145.0 42.0 40.0 23.1 F
163.0 46.0 27.0 23.9 F
160.0 50.0 25.0 23.0 F
160.0 53.0 27.0 21.0 F
160.0 47.0 30.0 23.0 F
50
160.0 51.0 30.0 21.0 F
156.0 49.0 25.0 20.5 F
160.0 56.0 28.0 21.6 F
158.0 60.0 40.0 22.4 F
182.0 75.0 5.0 25.3 M
179.0 62.0 8.0 22.2 M
40
140

150

160

170

180

190

165.0 59.0 5.1 23.1 M

175.0 59.0 12.0 23.1 M
Height (cm)
î î î î î

Fig. 10.4 Example of a data matrix and its visualization in multiple linear regression.

In Figure 10.4, we have the first 26 rows of a dataset of 127

observations of weight, height, length of hair, age and sex, arranged
as a matrix. This so-called Young Adult dataset was used for an
unrelated research project. The visualization shows Height in the x
axis and Weight in y. The size of the point encodes Hair Len., which
we can think of on the z axis coming towards us. The bigger bubbles
indicate larger Hair Len., because they are closer to us. Finally the
color indicates the last column, Sex: Blue for F and orange for M.
Treating this as a teachable moment, we are going to switch from
our standard Ax = b notation to what is generally used in the
188 Projection, Least Squares and Linear Regression

Height Hair Len. Weight

Intercept
(cm) (cm) (kg)
&' = ( !" = $
1 169.0 10.0 60.0
1 174.0 12.0 73.0
G % = !G $
! !"
1
1
163.0
163.0
22.0
18.0
58.0
55.0
1 169 10
1 174.0 15.0 67.0 1 174 12
1 171.0 3.0 69.0 != = H ó Design Matrix * =O× NRS
1 162.5 40.0 50.0 1 163 22
1 179.5 7.0 74.0
1 180.0 5.0 76.0
î î î
1 170.0 8.0 65.0
1
1
176.0
165.0
7.0
20.0
74.0
46.0
IT
1 161.0 23.0 50.0 " = IS = J ó Regression Parameters * = NRS
1 145.0 40.0 42.0
1 163.0 27.0 46.0 IU
1 160.0 25.0 50.0
1 160.0 27.0 53.0
1 160.0 30.0 47.0 60
1 160.0 30.0 51.0
1 156.0 25.0 49.0 73
1 160.0 28.0 56.0
$= =M ó Target Variable * =O
1 158.0 40.0 60.0
58
1
1
182.0
179.0
5.0
8.0
75.0
62.0
î
1 165.0 5.1 59.0
1 175.0 12.0 59.0 HI
î î î î % = !G !
" !G $

Fig. 10.5 The standard notations used in multiple linear regression. Weight is the depen-
dent (or target or output) variable. Height and Hair Len. are the independent (or predictor
or input) variables.

literature in the context of MLR, as shown in Figure 10.5, color

coded for easy comprehension by our tired brains. Notice that in
the case of SLR, we had the model y = mx + c, with the intercept
introducing a column of ones in our A matrix. In MLR, we are going
to keep our intercept as the first parameter, and the column of ones
will be the first in our matrix, which we will now call X with the
fancy name Design Matrix. Our model is
y = β0 + β1 x1 + β2 x2 or Weight = β0 + β1 Height + β2 Hair Len.

Following the same matrix equations, now with the new notations
as in Figure 10.5, we get the best estimate for the parameter vector
β̂, so that our model (coming from the 26 data points shown in
Figures 10.4 and 10.5) becomes:
Weight = β̂0 + β̂1 Height + β̂2 Hair Len.
= −74.06 + 0.814 Height − 0.151 Hair Len.
Although it is not easy to visualize the model and the points, even in
the simple intuitive three-dimensional space, we have attempted to
show this model in Figure 10.5. What is perhaps more important is
to understand the model in terms of its parameters: We can see that
Linear Regression 189

Fig. 10.6 Attempt to visualize a three dimensional MLR model for Weight. All three
panels show the model (which is a plane) and the associated data points, but from different
perspectives. The middle one shows the dependency of Weight on Height, and the last one
shows that on Hair Len.

β̂1 = 0.814, a positive number indicating that the weight increases as

the height increases. On the other hand, β̂2 = −0.151, which tells
us that the weight decreases as the hair length increases, which is
consistent with the fact that women tend to have longer hair and they
tend to be smaller and lighter.
As in the case of SLR, we could compute R2 and analyze whether
it is appropriate to have the hair length in the model and so on, which
is what we might expect to see in a book on data analytics. This
book, however, is written for Linear Algebra, and this is probably an
appropriate point to stop our work on its geometric aspects.

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
Part IV

Advanced Topics
11
Eigenvalue Decomposition
and Diagonalization

Would it save you a lot of time if I just gave up and went

mad now?
—Douglas Adams

We have completed the basics of Linear Algebra. We may have

gone a bit beyond the basics in its algebraic and geometric aspects.
Now it is time to switch gears and look at some topics that have enor-
mous impact in computer science as well as more classical sciences.
Eigenvalues are eigenvectors are the entry point to such topics. Al-
though it appears in the “Advanced Topics” part of this book, eigen-
value decomposition and the associated discussion usually appear
toward the end of all undergraduate-level courses on Linear Algebra.
The word “eigen” is German, and it means own or characteristic.
We may, therefore, see some people calling the eigenvectors the
characteristic vectors, although it is not common. What is much
more common is to call the expression that gives us the eigenvalues
of a matrix its characteristic polynomial.
192 Eigenvalue Decomposition and Diagonalization

11.1 Definition and Notation

We talked about Ax = b as a transformation A ∈ Rm×n : Rn 7→ Rm .

If we consider square matrices A ∈ Rn×n , then the mapping is from
Rn to Rn . We can then say that the matrix A is a mapping from a
space to itself: It takes vectors in one space and transforms them to
other vectors in the same space.
Are there vectors in the space that get transformed to a scalar
multiple of itself? If there are, such vectors are called eigenvectors.
Writing this statement in symbols, we come up with their definition.

Eigenvectors and eigenvalues

Definition: For A ∈ Rn×n , s ∈ Rn ̸= 0 is an eigenvector if As = λs
with the eigenvalue λ.

A few points to note about eigenvalues and eigenvectors:

1. For A ∈ Rn×n , its eigenvalues do not have to be in R. In other
words, just because we have square matrix over the field of
reals (A ∈ Rn×n ), we cannot assume that its eigenvalues are
real; it may have complex or imaginary eigenvalues.
2. Similarly, not all real matrices (A ∈ Rn×n ) have real eigenvec-
tors.
3. If s is an eigenvector of A with an eigenvalue λ, so is any
scaled version of it (rs), with the same eigenvalue λ. Proof:
A(rs) = rAs = rλs = λ(rs)

4. As a special case, λ can be zero. When λ = 0, we have

As = 0 =⇒ s ∈ N (A). Note that s ̸= 0.
5. Although presented here in terms of matrices and vectors, the
ideas behind the eigenvectors came from other fields, such as
physics. For this reason, eigenvectors are defined as those
vectors that do not change their “direction” when A applies on
them.
In this book, however, for our own overly pedantic reasons,
we stay away from the notion of “direction” of vectors to the
extent possible.
Examples of Eigenvalues and Eigenvectors 193

11.2 Examples of Eigenvalues and Eigenvectors

Before writing down a general method for finding eigenvalues and

eigenvectors, let’s look at a few examples.

11.2.1 Permutation Matrix

A permutation matrix is the one that shuffles the elements of a vec-

tor (or the rows of a matrix, as we saw earlier in §4.3.4, page 81,
when dealing with elementary matrices). In R2 , the only possible
permutation is r1 ´ r2 with the following A.

0 1 1 1 1 −1 1
A = 1 0 =⇒ A 1 = 1 , A −1 = 1 = −1 −1

As we can see above, we have two eigenvectors for this matrix, with
eigenvalues 1 and −1.

11.2.2 Projection Matrix

We came across projection matrices P ∈ Rn×n , which take any

vector x ∈ Rn to its projection onto a subspace S ¢ Rn . If the vector
x that P is acting on is already in the subspace S, we know that its
projection is itself. Calling these vectors x∥ , we can write:

P x∥ = x∥ = 1 × x∥

So we have a whole bunch of eigenvectors x∥ in S. Furthermore, the

eigenvalues for these eigenvectors would be one, λ = 1.
We also know that if x is orthogonal to the subspace S, the projec-
tion will be the zero vector. Calling such vectors x§ , we write:

P x§ = 0 = 0 × x§

So x§ is an eigenvector with λ = 0. Note that for any matrix A,

we can always write A0 = 0, but 0 is not an eigenvector by our
definition above.
194 Eigenvalue Decomposition and Diagonalization

11.2.3 Shear Matrix

A shear matrix transforms a square to a parallelogram1 . Here is an

example of a horizontal shear matrix, as shown in Figure 11.1.

1 0.5 1 1 0 0.5
A= =⇒ Aq1 = A = , Aq2 = A =
0 1 0 0 1 1

When any vector is transformed by A, its first component is not

affected. A horizontal shear leaves the x axis alone, and therefore q1
is an eigenvector with eigenvalue λ = 1.

1.25 "
SHEAR MATRIX 0 0.5
#" = ##" =
1 1
1
The unit vectors transform as:
1 1
!! = § !"! = 0.75
0 0
0 0.5
!# = § !"# = 0.5
1 1
ó The Shear Matrix
0.25
1
1 0.5 ##! =
)= 0
2 21.75 21.5 21.25
021 120.75 20.5 20.25 0 0.25 0.5 0.75 1
11.25 1.5 1.
#! =
20.25 0

20.5
Fig. 11.1 An example shear matrix, showing a square being transformed into a parallelo-
gram.

11.2.4 Rotation Matrix

We looked at rotation matrices earlier, when talking about orthogonal

matrices. In R2 , the rotation matrix is

cos θ − sin θ
Qθ =
sin θ cos θ

1
Although we state it like this, we should note that squares and parallelograms do not exist
in a vector space. They live in coordinate spaces, and this statement is an example of the
Notational Abuse, about which we complained in a box earlier. What we mean is that the
two vectors forming the sides of a unit square get transformed such that they form sides of
a parallelogram. We should perhaps eschew our adherence to this pedantic exactitude, now
that we are in the advanced section.
Computing Eigenvalues and Finding Eigenvectors 195

For any nontrivial θ (which means θ ̸= 2kπ for integer k), we can see
that Qθ changes every single vector in R2 . We have no eigenvectors
for this matrix in R2 .

11.2.5 Differentiation

We can think of the set of all functions (of one variable, for instance)
as a vector space. It satisfies all the requisite properties. The calculus
operation of differentiation is a linear transformation in this space; it
satisfies both the homogeneity and additivity properties of linearity.
d ax
e = aeax =⇒ eax is an eigenvector with eigenvalue a
dx
d2
sin x = − sin x =⇒ sin x is an eigenvector with eigenvalue − 1
dx2

11.3 Computing Eigenvalues and Finding Eigenvectors

In order to find the eigenvalues and then eigenvectors, we start from

their definitions.
As = λs =⇒ (A − λI)s = 0
Remembering that s ̸= 0, we can see that A − λI has a nontrivial
null space (to which the eigenvector s belongs). Since N (A − λI)
has nonzero vectors in it, A − λI is singular and its determinant has
to be zero, which gives us an equation for eigenvalues.
|A − λI| = 0
Thus, we get rid of s and end up with a polynomial in λ (when we
expand the determinant) equalling zero from which we can solve for
the possible values of λ. This polynomial is called the characteristic
polynomial of the matrix.
For a matrix A ∈ Rn×n , when we expand the determinant |A − λI|
using the Laplace formula in Eqn (3.5), we get a polynomial of order
n in λ, which should give us n roots, but not all of them may be real.
Once we have the eigenvalues, we can put them back in (A−λI) = 0
to find the eigenvectors s, which is the same as finding the null space
of (A − λI).
Let’s look at the examples from the previous section again to see
how we get the eigenvalues and eigenvectors.
196 Eigenvalue Decomposition and Diagonalization

Permutation Matrix

Starting from the permutation matrix in R2 , here are the steps.

0 1 −λ 1
A= 1 0 |A − λI| = 0 =⇒ 1 −λ = 0

Expanding the determinant, λ2 − 1 = 0 =⇒ λ = ±1

As we saw earlier, we have two eigenvalues, 1 and −1. To find
the corresponding eigenvectors, we substitute the λ values in either
As = λs and solve, or, equivalently, find the null space of A − λI.

−1 1 1
With λ = 1, (A − λI)s = s = 0 =⇒ s = 1
1 −1

1 1 1
With λ = −1, (A − λI)s = s = 0 =⇒ s = −1
1 1

Here, to find s, we are still using the column-picture of matrix mul-

tiplication and figuring out what linear combinations of the columns
of A − λI give the zero vector. As we can see, we get two dis-
tinct eigenvectors. We can also see that the eigenvectors are actually
orthogonal to each other. Note that any scaled versions of the eigen-
vectors are still eigenvectors with the same λ. For this reason, we
typically normalize them.
The fact that we got real eigenvalues and distinct and orthogonal
eigenvectors is not an accident. Real symmetric matrices (as our
A was, in the case of this permutation matrix) always have real
eigenvalues and a full set of orthogonal eigenvectors. They are the
best matrices to work with.

Projection Matrix

As a concrete example, let’s consider the projection matrix in R3 that

projects vectors to the xy plane. Still calling it A, what A needs
to do is to leave the first two components alone and make the last
component zero.
" #
1 0 0 1−λ 0 0
A= 0 1 0 |A − λI| = 0 =⇒ 0 1−λ 0 =0
0 0 0 0 0 −λ
Expanding the determinant, − λ(1 − λ)2 = 0 =⇒ λ = 0, 1, 1
Computing Eigenvalues and Finding Eigenvectors 197

Again, as we saw earlier, we have two eigenvalues, 0 and 1. But

note that λ = 1 is repeated. We state this fact more fancifully, that
the algebraic multiplicity of the eigenvalue (λ = 1) is two. Let’s go
ahead and try to find the corresponding eigenvectors.
" # " #
1 0 0 0
With λ = 0, (A − λI)s = 0 1 0 s = 0 =⇒ s = 0
0 0 0 1
" # " #
0 0 0 t1
With λ = 1, (A − λI)s = 0 0 0 s = 0 =⇒ s = t2
0 0 −1 0
Where t1 and t2 are any real numbers. We see that we have a small
issue with the second eigenvalue with algebraic multiplicity two: The
eigenvectors corresponding to it span a subspace, which is called the
eigenspace associated with the second eigenvalue (λ = 1). The
dimension of this eigenspace is two, which is called its geometric
multiplicity. All eigenvalues have eigenspaces associated with them,
with geometric multiplicity of at least one.
What we need to do when we have an eigenvalue with a geometric
multiplicity greater than one is to select any full set of linearly inde-
pendent vectors that span its eigenspace. In other words, we take its
basis as the eigenvectors. In this particular case of projecting to the
xy plane, the perfect basis would be the unit vectors along x and y
directions. Putting it all together, here is the full solution:
     
0 1 0
λ1 = 0, s1 = 0 λ2 = 1, s2 = 0 , s3 = 1
1 0 0

Shear Matrix

Moving on to our next example,

1 0.5 1 − λ 0.5
A= |A − λI| = 0 =⇒ =0
0 1 0 1−λ
Expanding the determinant, (1 − λ)2 = 0 =⇒ λ = 1
As we saw earlier, we have a single eigenvalue of 1, but with an
algebraic multiplicity of two. The corresponding eigenvector is.

0 0.5 1
With λ = 1, (A − λI)s = s = 0 =⇒ s =
0 0 0
198 Eigenvalue Decomposition and Diagonalization

The eigenvalue has an algebraic multiplicity of two, and a geometric

multiplicity of one. We cannot find a full set of (real) eigenvectors,
and we are in trouble because for A ∈ Rn×n , we would like to have
n eigenvectors.

Rotation Matrix

We saw that there were no eigenvectors for a rotation matrix in R2 .

Let’s consider a π2 -rotation, call the matrix A and attempt to find its
eigenvalues and eigenvectors.
cos π2 − sin π2

cos θ − sin θ 0 −1
Qθ = = = =A
sin θ cos θ sin π2 cos π2 1 0
−λ −1
|A − λI| = 0 =⇒ =0
1 −λ
Expanding the determinant, λ2 + 1 = 0 =⇒ λ = ±i
We have no real eigenvalues. For the sake of completeness, we can
try to find the eigenvectors, although we do not expect to find any in
R2 .

−i −1 1
With λ = i, (A − λI)s = s = 0 =⇒ s =
1 −i −i

i −1 1
With λ = −i, (A − λI)s = s = 0 =⇒ s =
1 i i
Thus, if we allow ourselves to step into the field of complex numbers,
we can find eigenvectors of the rotation matrix s ∈ C2 . Physically,
the vector that is conserved during rotation is perpendicular to the
plane of rotation, which is why gyroscopes work the way they do.
How that fact corresponds to the actual eigenvectors of the matrix we
computed above, however, is a fairly tortured explanation.

11.4 Properties

The eigenvalues and eigenvectors provide deep insights into the struc-
ture of the matrix, and have properties related to the properties of the
matrix itself. Here are some of them with proofs, where possible.
It is worth our time to verify these properties on the examples we
worked out above.
Properties 199

11.4.1 Eigenvalues

1. The sum of eigenvalues

equals the trace of the matrix. For an
n × n matrix A = aij ,
n
X n
X
λi = trace(A) = aii
i=1 i=1

Proof : Since the characteristic polynomial has roots λi , we can

write:
|A − λI| = (−1)n (λ − λ1 )(λ − λ2 ) · · · (λ − λn )
where we constructed the RHS to have roots λi and to match the
coefficient of λn in the determinant on the LHS, which expands
to give a polynomial in λ with the coefficient of λn = (−1)n .
On the LHS, the coefficient of λn−1 is
n
X
n
(−1) aii
i=1

On the RHS, the coefficient of λn−1 is

Xn
n
(−1) λi
i=1

Since the LHS and RHS coefficients have to match, we see that
Xn Xn
λi = aii = trace(A)
i=1 i=1

which proves the property.

Although we proved it by comparing the coefficient of λn−1 , it
is a lot easier to prove once we learn matrix similarity in the
next chapter. Note that we have to include λi in the summation
as many times as its algebraic multiplicity. Note also that for
A ∈ Rn×n , this property means that if A has any complex
eigenvalues, they should come in pairs of complex conjugates.
2. The product of eigenvalues equals the determinant of the ma-
trix. For an n × n matrix A,
Yn
λi = |A|
i=1
200 Eigenvalue Decomposition and Diagonalization

Proof : Again we start with the equality:

|A − λI| = (−1)n (λ − λ1 )(λ − λ2 ) · · · (λ − λn )
Set λ = 0 to get
n
Y
n n
|A| = (−1) (−1) λ1 λ2 · · · λn = λ1 λ2 · · · λn = λi
i=1

A corollary of this property is that singular matrices have at

least one zero eigenvalue.
3. The eigenvalues of a real, symmetric matrix are real.
A ∈ Rn×n , AT = A =⇒ λi ∈ R
Proof : To prove this property, we have to step into the scary
field of complex numbers again. The strategy for proving
something is real is to assume that it is complex, and then show
that its conjugate (where we replace all i with −i) is the same
as itself, which shows that it has to be real. Following this
strategy, let’s say that λ ∈ C is a possibly complex eigenvalue
of A, and s ∈ Cn×n is the corresponding eigenvector. (Note
that in step (2) below, we take the complex conjugate, which is
defined as: (a + ib)∗ = a − ib.)

(1) By the definition of eigenvalues, we have: As = λs

(2) Taking the complex conjugate, we get: A ∗ s∗ = λ ∗ s ∗
(3) Since A is real A∗ = A. Therefore: As∗ = λ∗ s∗
(4) Multiplying (1) on the left with s∗T : s∗T As = s∗T λs
(5) Flipping it around and reordering: λs∗T s = s∗T As
T
(6) Using the product rule of transposes: λs∗T s = AT s∗ s
T
(7) Since A is symmetric: = (As∗ ) s
(8) Using step (3) above: = λ∗ s∗T s

Finally, from steps (6) and (8), we get:

λs∗T s = λ∗ s∗T s =⇒ (λ − λ∗ )s∗T s = 0 =⇒ λ = λ∗
since s∗T s is the square of the norm ∥s∥2 (it is, for s ∈ Cn ,
as we shall see in the next chapter) of an eigenvector, which
cannot be the zero vector 0. λ = λ∗ means λ ∈ R.
Properties 201

4. The “opposite” of the previous property: The eigenvalues of a

real, antisymmetric (AKA skew symmetric) matrix are imagi-
nary.
A ∈ Rn×n , AT = −A =⇒ λi = iλ′i , where λ′i ∈ R
Proof : Identical to the previous proof, but with a negative sign
in step (7), showing λ = −λ∗ .
5. If we multiply a matrix (A) by a scalar (α), then all its eigen-
values (λi ) are multiplied by the same scalar.
Proof : As = λs =⇒ (αA)s = (αλ)s. That is it.
6. The eigenvalues of A + αI are λi + α, the eigenvalues of A
“shifted” by α.
Proof : If s is an eigenvector of A with the eigenvalue λ, we
have:
(A + αI)s = As + αs = λs + αs = (λ + α)s
which means s is an eigenvector of A + αI with the eigenvalue
λ + α, which goes for every eigenvector/eigenvalue pair.
7. The eigenvalues of real, symmetric matrices are related to the
pivots in the row echelon form (REF, or U in the LU or P LU
decomposition): The number of positive eigenvalues is the
same as the number of positive pivots. Same goes for negative
ones too. We will leave this property, known as Sylvester’s
Law of Inertia, without proof.

11.4.2 Eigenvectors

1. The eigenvectors of a matrix2 corresponding to distinct eigen-

values are linearly independent.

A ∈ Rn×n with Asi = λi si , si ∈ Rn , for 0 < i f n

λi ̸= λj and ai si + aj sj = 0
=⇒ ai = aj = 0 for 0 < i, j f n, i ̸= j

2
We use R (the real field, A ∈ Rn×n and s ∈ Rn ) in the mathematical statement and proof
of this property for convenience and because of its relevance to computer science, but the
property applies to C as well.
202 Eigenvalue Decomposition and Diagonalization

Sylvester’s Law of Inertia

The connection between the signs of the eigenvalues of a real, symmetric matrix and
its pivots is called Sylvester’s Law of Inertia. Inertia, in this context, is just the triplet
Inertia(A) = (p, n, z), with p− the number of positive eigenvalues, n, that of negative
ones and z the zeros. What the rule states is that these numbers are the same as numbers
of positive and negative pivots and the zero rows/columns (which, of course, is the
nullity). Note that this applies only if A ∈ Rn×n , AT = A.
The interesting corollary to this law is that the number of zero eigenvalues is the same
as the nullity of the matrix. Remembering that AT ARn×n and is symmetric for any
A ∈ Rm×n , we can extend the law to state that the rank of A (which is the same as the
rank of AT A) is the number of nonzero eigenvalues of AT A.
The nullity of A is the number of pivotless columns in A, which has to be the same as
the number of pivotless columns in AT A, Why? Because AT A and A have the same
rank, and the same number of columns n. Therefore, the nullity of A is the number of
zero eigenvalues of AT A.

Proof : We need to prove that if si and sj are eigenvectors of

A with distinct eigenvalues λi and λj , we will not be able to
find nonzero ai and aj such that ai si + aj sj = 0.

(1) Starting from: a i si + a j s j =0

(2) Multiplying on the left by A: ai Asi + aj Asj = 0
(3) Since si and sj are eigenvectors: a i λ i s i + a j λ j sj = 0
(4) Multiplying (1) with λi , we get: a i λ i s i + a j λ i sj = 0
(5) Subtracting (3) from (4): aj (λi − λj )sj =0
(6) Since λi ̸= λj and sj ̸= 0: aj =0
(7) Putting aj in (1), we can show: ai =0

The converse of this statement is not true: If the eigenvalues are

not distinct, the eigenvectors may still be linearly independent,
as in the case of the projection-matrix. Or the identity matrix In ,
which has the eigenvalue one repeated n times, but is already
diagonalized.
2. Real symmetric matrices have a full set of orthogonal eigen-
vectors.

A ∈ Rn×n with Asi = λi si , si ∈ Rn , for 0 < i f n

AT = A =⇒ si § sj for 0 < i, j f n, i ̸= j
Unit Circles and Ellipses 203

Proof :

(1) By the definition of eigenvalues: Asi = λi si

T
(2) Taking the dot product with sj : (Asi ) sj = λi sT
i sj

(3) Using the product rule of transposes: sT T

i A sj = λi sT
i sj

(4) Since A is symmetric: sT

i Asj = λi sT
i sj

(5) Since sj is an eigenvector of A: sT

i λj sj = λi sT
i sj

(6) Reordering: λj sT
i sj = λi sT
i sj

(7) Gathering terms: (λj − λi )sT

i sj = 0

(8) Since λj ̸= λi =⇒ sT
i sj = 0

(9) sT
i sj = 0 =⇒ si § sj

It may happen that the eigenvalues are repeated. For instance,

the identity matrix I ∈ Rn×n has n repeated eigenvalues of 1.
Every vector in Rn×n is an eigenvector of I. In this case also,
we can choose an orthogonal set of vectors as the full set of
eigenvectors for the matrix.
In some cases, the eigenvectors may span a subspace (called
eigenspace, of course), as in the case of the projection matrix.
Here again, we can choose an orthogonal eigenbasis for the
eigenspace associated with the repeated eigenvalue.

3. The eigenvectors of A + αI are the same as those of A.

Proof : As we already proved, if s is an eigenvector of A with
the eigenvalue λ,

(A + αI)s = As + αs = λs + αs = (λ + α)s

which should be proof enough. If not, it means s is an eigen-

vector of A + αI with the eigenvalue λ + α.

11.5 Unit Circles and Ellipses

One fair question we may have at this point is why we are doing all
this. It is all an academic exercise in intellectual acrobatics? We may
not be able to answer this question completely yet, but we can look
at a linear transformation and see what the eigenvalue analysis tells
204 Eigenvalue Decomposition and Diagonalization

us about it. In the last chapter, we will see how these insights are
harnessed in statistical analyses.
Let’s start with an example A ∈ R2×2 , find its eigenvalues and
eigenvectors, and look at them in the coordinate space R2 .
√ √
1 5
√ − 3 3 1 3 1 1 √1
A= λ1 = ; s1 = λ2 = ; s2 =
4 − 3 3 2 2 −1 2 2 3

1.25 "
-
+$ =
,
3 $"
1 = 1
1 2 3 2 1
= 4 3
*#
0.75

0.5
(!
= 3
2
0.25
,
+# =
- !
22 21.75 21.5 21.25 21 20.75 20.5 20.25 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
= 1
2

20.25
("

EIGENVALUES & VECTORS

20.5 ,!
1 5 $! = 1
2 3 = 1 4 2 5
3= 2 3
4 2 3 3 21 3
20.75
/ .
*. = *0 =
0 0
21
. 3 . 1
+! = +# =
0 21 0 3
21.25

Fig. 11.2 Visualization of eigenvalues and eigenvectors: A transforms the unit circle into
a rotate ellipse. The eigenvalues specify the lengths of its major and minor axes. And the
eigenvectors specify the orientation of the axes.

As we can see from Figure 11.2, A takes the first basis vector
(q1 , shown in red, dashed arrow) to its first column vector (a1 shown
bright red arrow): q1 7→ a1 . Similarly for the second one as well,
q2 7→ a2 , shown in various shades of blue. What happens to the
basis vectors happens to all vectors, and therefore, the unit circle in
the figure gets mapped to the ellipse, as shown.
Although we know about this unit-circle-to-ellipse business, from
the matrix A itself, we know very little else. Note that the vectors
to which the unit vectors transform (in qi 7→ ai ) are nothing special;
they are on the ellipse somewhere. What we would like to know are
the details of the ellipse, like its size and orientation, which is exactly
what the eigenvalues and eigenvectors tell us. The eigenvalues λi
are the lengths of the major and minor axes of the ellipse and the
eigenvectors are the unit vectors along the axes. In Figure 11.2, the
Diagonalization 205

eigenvectors (s1 and s2 ) are shown in darker shades of red and blue,
while the corresponding eigenvalues are marked as the lengths of the
axes.
When we move on to higher dimensions, ellipses become ellip-
soids or hyper-ellipsoids, and the axes are their principal axes. The
mathematics of eigenanalysis still stays the same: We get the direc-
tions and lengths of the principal axes. And, if the matrix on which
we are performing the eigenanalysis happens to contain the covari-
ance of the variables in a dataset, then what the eigenanalysis gives
us are insights about the directions along which we can decompose
the covariance. If we sort the directions by the eigenvalues, we can
extract the direction for the highest variance, second highest variance
and so on. We will revisit this idea in more detail in one of our last
topics, the Principal Component Analysis, which is the mainstay of
dimensionality reduction in data science.

11.6 Diagonalization

We learned quite a bit about eigenvalues and eigenvectors by now. We

might still wonder why at this point. What is the point in learning all
this trivia about them? We hinted at its significance in data analytics
for dimensionality reduction. We have one more good reason; there
is a method to this madness. Once we have the eigenvectors of a
matrix, we can diagonalize it. And once we diagonalize, we can
immediately see how it can help in computing the powers of the
matrix. Why would we want to take powers of matrices? Because it
is the basis of modeling time-varying systems mathematically.

11.6.1 S and Λ
Suppose A ∈ Rn×n has its eigenvalues λi and the corresponding
eigenvectors si . Let’s construct two matrices, arranging the eigen-
vectors as columns, and the eigenvalues as diagonal elements:
 
  λ1 0 ··· 0
| | ··· |  0 λ2 ··· 0
S =  s1 s2 · · · sn  Λ =  ..
 
.. .. 
| | ··· |
. . . 0
0 0 · · · λn
206 Eigenvalue Decomposition and Diagonalization

For simplicity, we may write S = [s] and Λ = [λ]. With these new
matrices, we arrive at the most important result from this chapter:
AS = SΛ
The LHS is a matrix multiplication, where the product matrix has
columns that are product of A and the corresponding column in S.
In other words, AS = A[si ] = [Asi ]. We think of the RHS using
the column picture of matrix multiplication again: The columns of
SΛ = [si ]Λ are the linear combinations of si taken with the scaling
factors in the columns of Λ. But the scaling factors are merely λi in
the ith place. Therefore, the ith column in AS = SΛ is the same as
Asi = si λi , which is the now-familiar the definition of eigenvalues
and eigenvectors.

11.6.2 The Decomposition

If we know that S is invertible, we can go one step further and write:

If S −1 exists, AS = SΛ =⇒ A = SΛS −1
This is the famous eigenvalue decomposition of a real, square, diag-
onalizable matrix. Most matrices are diagonalizable over the field of
complex numbers.

11.6.3 Powers of A

As previously advertised, the reason for this decomposition is that it

gives us a way to express the powers of the matrix. For a square,
diagonalizable matrix A ∈ Rn×n , we can write,
A = SΛS −1 =⇒ A2 = SΛS −1 SΛS −1 = SΛ(S −1 S)ΛS −1
= SΛIΛS −1 = SΛΛS −1
A2 = SΛ2 S −1

Similarly, we can easily see that Ak = SΛk S −1 . While this result

may look fairly mundane, let’s think about a medium-sized matrix,
say 100×100, and k = 60. Imagine the number of operations required
to compute A60 , which is probably of the order of 60×1003 = 6×107
(for the 60 matrix multiplications). But if we have the decomposition,
taking the 60th power of Λ is trivial, merely 100 exponentiations. The
two matrix multiplications cost about 2 × 1003 = 2 × 106 operations
Diagonalization 207

Invertibility vs. Diagonalizability

Not all invertible matrices can be diganolized. We saw an example earlier. The shear
matrix has an inverse that is easily written down.:

1 0.5 1 −0.5
A= A−1 =
0 1 0 1
But A has only one eigenvector, and is not diagonalizable.
Not all diagonalizable matrices are invertible. We already know that the projection
matrix is not invertible because it destroys information; it is a many-to-one (or injective)
mapping. But it can be diagonalized.
     
1 0 0 0 1 0 0 0 1
A = 0 1 0  S = 0 0 1  S −1
= 1 0 0  = S T
0 0 0 1 0 0 0 1 0

A = SΛS −1
In fact, A is already a diagonal matrix: It has nonzero elements only along its diagonal.
We know the condition for A to be invertible, or for A−1 to exist. Let’s state it
several different ways, as a means to remind ourselves. A is invertible if:
• |A| =
̸ 0. Otherwise, as Eqn (5.1) clearly shows, we cannot compute A−1
because |A| appears in the denominator.
• N (A) = 0, its null space contains only the zero vector. Otherwise, for some x,
we have Ax = 0, and there is no way we can invert it to go from 0 to x.
• λi ̸= 0, all its eigenvalues are nonzero. Otherwise, |A|, being the product of
eigenvalues, would be zero.
• λi ̸= 0, all its eigenvalues are nonzero. Another reason, otherwise, for the zero
λ, we have Ax = 0, which implies the existence of a nontrivial null space.
The diagonalizability of A is tested using the invertibility of its eigenvector matrix S.
Although this point is probably not critical for our view of Linear Algebra as it applies to
computer science, we might as well state it here. For a matrix to be non-diagonalizable,
the algebraic multiplicity of one of its eigenvalues (the number of times it is repeated)
has to be greater than its geometric multiplicity (the number of associated eigenvectors),
which means the characteristic polynomial needs to have repeated roots to begin with.
The roots are repeated if the discriminant of the polynomial (similar to b2 − 4ac in the
quadratic case) is zero. The discriminant being a continuous function of the coefficients
of the polynomial, which are the elements of the matrix, it being zero happens with
a frequency of the order of zero. But the roots being complex happens half the time
because the discriminant is less than zero half the time.

and the exponentiations of the 100 diagonal elements, a negligible

amount. The overall saving in computational time, therefore, is
roughly 30.
The best algorithms for matrix multiplications take about n2.3 op-
erations. If we are computing the k th power of A ∈ Rn×n , therefore,
208 Eigenvalue Decomposition and Diagonalization

it would cost us kn2.3 operations. But with diagonalization, it will

cost us n exponentiations of the diagonal elements of Λ and two
multiplications, or a total of n + 2n2.3 operations, which implies a
kn2.3 k
reduction of n+2n 2.3 ≈ 2 for large n.

Since we have any power of A being expressed as, essentially,

powers of Λ, we can make the same statement about polynomials of
A.

11.6.4 Inverse of A

Since we have a product for A, we can take its inverse using the
product rule of inverses.
−1 −1 −1 −1
A−1 = SΛS −1 = S −1 Λ S = SΛ−1 S −1

Since Λ is a diagonal matrix with λi along the diagonal, its inverse

is another diagonal matrix with the elements equal to the reciprocal
of λ1 . Therefore, A−1 has the same eigenvectors si as A, with
eigenvalues equal to the reciprocals of the eigenvalues of A: λ1 i . It
can be seen even more directly as follows:
1
Asi = λi si =⇒ si = λi A−1 si =⇒ A−1 si = si
λi
Enough said.
Since A−1 = SΛ−1 S −1 , we can see that Ak = SΛk S −1 holds
for k < 0 as well. Extrapolating even further, through the Taylor
series expansion, we can compute entities like eA (matrix exponenti-
ation), which are essential in solving differential equations—a topic
we consider beyond the scope of this book.

11.6.5 Difference Equations

In building mathematical models of systems that evolve in time, we

may come across the situation where the state of the system at any time
step depends on the state at the previous step. If all the parameters
specifying the state can be written as a vector, we may be able to
write the time evolution as xk+1 = Axk .
If we know the initial conditions at time step zero, we can write:

xk = Axk−1 = A2 xk−2 = · · · = Ak x0
Diagonalization 209

And, if we have the eigenvalue decomposition of A = SΛS −1 , we

know that we can compute the matrix raised to the power k without
worrying too much about the computational cost.

11.6.6 Eigenbasis

If we have a full set of real eigenvectors for A ∈ Rn×n , we can use

them as a basis for Rn , which we will call the eigenbasis. We can then
express any vector x0 ∈ Rn as a linear combination of the eigenbasis
vectors, remembering what we learned about changing the bases of
vectors earlier in §7.2 (page 128).
n
X
x0 = s 1 c 1 + s 2 c 2 + · · · + s n c n = si ci = Sc
i=1

where ci are the coordinates of x in the eigenbasis. (We wrote the

scalar after the vector in the summation so that the matrix product is
easier to spot.)
Once we have the vector x0 in the eigenbasis of A, we can do
simplify the multiplications of the powers of A with x as in the
following:
n
X n
X n
X
Ax0 = A si c i = Asi ci = λi si ci = SΛc
i=1 i=1 i=1

n
X
k
A x0 = λki si ci = SΛk c
i=1

Why does this matter? Why use the eigenbasis? Let’s think of A
as the transformation encoding the time evolution of a system with
xk its state at a given step (or iteration, or a point in time). Given
the state of the system at one step, we evolve it to the next step by
multiplying with A to get xk+1 = Axk .
Knowing the initial state x0 and, more importantly, the transition
matrix A, what can we say about the stability of the system? We can
say the following:
n
X
k
lim xk = lim A x0 = λki si ci
k→∞ k→∞
i=1,|λi |>1
210 Eigenvalue Decomposition and Diagonalization

In other words, in the sum that makes up xk , only those eigenvalues

(whether they are real or complex) whose absolute value is greater
than 1 matter; they are the only ones that will survive when we take
the limit k → ∞.

11.7 Fibonacci Numbers

As an example of how this idea of the powers of a matrix applies

to a real-world problem, let’s look at the Fibonacci numbers. This
problem may be academic in nature, but it does show the kind of
thinking that goes into transforming a problem to bring it into the
domain of eigenvalues.
The famous Fibonacci sequence appears in nature in unexpected
ways, and is heavily used in mathematics and, closer to home, in
computer science. The Wikipedia entry on it has a comprehensive
listing of its properties and interesting facts.
To see how we connect the eigenvalue computation with Fibonacci
numbers, let’s start by writing them down. The sequence of num-
bers is: 0, 1, 1, 2, 3, 5, 8, 13, · · · : Each number in the sequence (af-
ter the first two) is the sum of the previous two. Calling them
f0 , f1 , f2 , · · · , fk , · · · , we can say that fk+2 = fk+1 + fk . This is
what we might call a second-order difference equation because each
number depends on the previous two. We do not see any vector or
matrix here, do we? In order to reveal them, let’s make a vector out
of two Fibonacci numbers:

fk+1 fk+2 fk+1 + fk
xk = =⇒ xk+1 = =
fk fk+1 fk+1
which gives us a chance to write it as matrix equation, and connect
xk to x0 :

1 1 fk+1 1 1
xk+1 = = x = Axk = Ak+1 x0
1 0 fk 1 0 k
From the Fibonacci sequence, we can see that the numbers are
growing, and we may want to find out how fast they are growing.
Or maybe we want to have an approximation for the k th Fibonacci
number. The first question about the growth rate is directly answered
by the eigenvalues of A, and the second one by the eigenbasis repre-
sentation of xk .
Fibonacci Numbers 211

Let’s first compute the eigenvalues.

1 1 1−λ 1
A= 1 0 A − λI = 1 −λ

|A − λI| = 0 =⇒ −(1 − λ)λ − 1 = 0 or λ2 − λ − 1 = 0

√ √
1+ 5 1− 5
λ1 = ≈ 1.618 λ2 = ≈ −0.618
2 2
Since |λ1 | > 1 and |λ2 | < 1, it is the first eigenvalue that will
dominate large Fibonacci numbers. In particular, fk+1 is going to be
about 1.618 times bigger than fk as k → ∞.
Let’s now try to find an equation, a closed-form formula, for fk .
We will start with the eigenvectors, express x0 in the eigenbasis
and evolve it to xk . The eigenvectors of A are the solutions to
(A − λI)s = 0.

1−λ 1
(A − λI)s = 0 =⇒ 1
s=0 −λ

λ
=⇒ s = 1 for λ = λ1 , λ2

x0 is a linear combination of s1 and s2 .

1 λ1 λ
x0 = c1 s1 + c2 s2 =⇒ = c1 + c2 2
0 1 1

After a bit of algebra, we will get:

1 1
c 1 = √ , c1 = − √
5 5
We know how the evolution of x:

f
k k
xk = A x0 = A (c1 s1 + c2 s2 ) = c1 λk1 s1 + c2 λk2 s2 = k+1
fk

We can now read the second element in xk as fk :

fk+1 k λ1 k λ2
= c 1 λ1 + c 2 λ2 =⇒ fk = c1 λk1 + c2 λk2
fk 1 1
212 Eigenvalue Decomposition and Diagonalization

Knowing that the second term vanishes for large k (because |λ2 | =
0.618 < 1) , we finally get an expression for fk :
√ !k
1 1 + 5
fk ≈ c1 λk1 = √ = fk(approx)
5 2

How good this approximation is is shown in Table 11.1, which shows

that the approximation is stunningly accurate. By the time we reach
k = 11, the error is about 25 parts in a million.

(approx)
Table 11.1 Fibonacci numbers (fk ) vs. its approximation (fk )

k fk fk(approx) k fk fk(approx) k fk fk(approx)

0 0 0.45 4 3 3.07 8 21 21.01
1 1 0.72 5 5 4.96 9 34 33.99
2 1 1.17 6 8 8.02 10 55 55.00
3 2 1.89 7 13 12.98 11 89 89.00

11.8 Applications of Eigenvalues and Eigenvectors

The ideas behind eigenvalue decomposition has a multitude of ap-

plications, especially in physics and other physical sciences. In our
domain of computer science, the Google Page Rank algorithm, de-
scribed in its own box in the next chapter, is a brilliant success story
of this line of thinking. Since this chapter is in the advanced part of
this book, we do not list the applications here, but rather note that a
good starting point to explore would be the Wikipedia page.
Big Recap: The Story So Far 213

11.9 Big Recap: The Story So Far

As we saw multiple times, everything in Linear Algebra is connected

to everything else. It is a big and beautiful jigsaw puzzle. Although
we chose to learn it in the particular sequence that we did, we could
have started out exploration any one of its corner pieces. Let’s look at
what we have learned, this time from the perspective of the interplay
between the ranks and shapes of matrices and the four fundamental
spaces. As we shall see, this summary will also tells more about
the process of solving equations and projecting onto column (or even
row) spaces.

! * ="×! rank ! = .
! * =! = all possible ! !" = $ . * =" = all possible .
! 6 =! § ="

$ % : dim = ,
$ %! : dim = , Column Space
Row Space %- § / Image
Coimage Range
§/
% - ' + 2#

%0 § 0

! "! :
0
! " : %- § dim = ) 2 +
dim = , 2 + Left Null Space
Null Space Left or Cokernel
Kernel

=! = Domain = Linear Combinations of =" = Codomain =Linear Combinations of

Fig. 11.3 Recap of the four fundamental spaced defined by a matrix.

Given a matrix A ∈ Rm×n , we have a mapping from Rn to Rm , as

shown in Figure 11.3, where the input space is shown in green while
the output space is red. Ax = b says that b is a linear combination of
the columns of A. All such linear combinations are in C(A), which
is a subspace of Rm .
It is true that multiple vectors x ∈ Rn will get mapped to b ∈ C(A).
But the mapping from the row space C(AT ) to the column space C(A)
is one-to-one. In other words, if we take any vector b ∈ C(A), there
is one and only one vector x ∈ Rn such that Ax = b. Why is that?
Let’s prove it rather formally.
214 Eigenvalue Decomposition and Diagonalization

Proof : To prove this fact, we will assume its negative and establish
that leads to a contradiction. Let’s assume that we have two nonzero
vectors x1 , x2 ∈ C(AT ), x1 ̸= x2 such that Ax1 = b and Ax2 = b.

(1) Assumption: Ax1 = b, Ax1 = b

(2) Subtracting, we get: A(x1 − x2 ) = 0
(3) A(x1 − x2 ) = 0 =⇒ x1 − x2 ∈ N (A)
(4) x1 ̸= x2 =⇒ x1 − x2 ̸= 0
(5) Closure property of C(AT ) :
x1 , x2 ∈ C(AT ) =⇒ x1 − x2 ∈ C(AT )
(6) x1 − x2 ∈ C(AT ) and
x1 − x2 ∈ N (A) =⇒ x1 − x2 ∈ C(AT ) ∩ N (A)
(7) C(AT ) § N (A) =⇒ x1 − x2 =0

The statement (7) says that since C(AT ) and N (A) are orthogonal
to each other, the only vector in both is the zero vector. As we can
see, we statements (4) and (7) both of which cannot be true at the
same time. Therefore, for every b ∈ C(A) such that Ax = b, there is
only one vector x ∈ C(AT ) that satisfies the equation. The mapping
C(AT ) 7→ C(A) is one-to-one. Intuitively, since we have r linearly
independent columns (the pivot Pr columns) that span C(A), any one
nonzero linear combination ( i=1 xi ai ) should have a unique set of
coefficients xi , and these coefficients will form a vector x ∈ C(AT ).
If they did not, where would the vector x be, in the input space?

11.9.1 Full-Rank Square Matrices

When we have a full rank, square matrix A ∈ Rn×n , rank(A) = n,

the row space is all of Rn . So is the column space. The mapping
A : Rn 7→ Rn , which is the same as A : C(AT ) 7→ C(A), is one-
to-one. If we are given any b in Ax = b, we can always find the
corresponding x by the inverse mapping, A−1 , which is guaranteed to
exist. In terms of the four fundamental spaces, the picture looks like
what is depicted in Figure 11.4. Note that both N (A) and N (AT )
contain only the zero vector.
Solving Ax = b is the same as finding the inverse A−1 , either
of which can be done using Gauss-Jordan elimination. This is what
Big Recap: The Story So Far 215

! * =!×! rank ! = 0
! * =! = all possible ! . * =! = all possible .
!" = $
! 6 =! ÿ =!
$ % : dim =3
$ %! : dim =3 Column Space
Row Space %- ÿ / -ÿ% $% /
2#
2"
0#
0"

ï
ï

$
2
$

%0 ÿ 0
0

$% 0
0ÿ% Left Null Space: ! "! ={0}
Null Space: ! " ={0}
" = !67 $
!67 : =! ÿ =!
!67 * =!×! rank !67 = 0

Fig. 11.4 Recap of the four fundamental spaced defined by a full-rank, square matrix.
Notice how A−1 is a mapping from C(A) back to C(AT ).

people mean by “n equations in n unknowns means you can solve.”

As we shall see again soon, they should be saying “n independent
and consistent equations in n unknowns means we can find a unique
solution.”

11.9.2 Full-Column-Rank, Tall Matrices

Figure 11.5 shows the four fundamental subspaces of a full-column-

rank, tall matrix. Notice that the null space, N (A) is trivial (meaning,
it contains only the zero vector), and the row space is all of the
input space: C(AT ) = Rn . What this means is that any vector in
the input space will be mapped to a unique vector in C(A). Why
unique? Because in Ax = b, b ∈ Rm is a linear combination of
the columns of A and, as we saw earlier, any linear combination of
linearly independent vectors is unique. In other words,P given a set of
n linearly independent vectors ai and scalars xi , xi ai is a unique
vector. And, of course, this sum is exactly what Ax is.
The column space, C(A), however, does not cover all of the output
space Rm . Therefore, if we take a vector b ∈ Rm such that it is not
in the column space (b ∈ / C(A)), it is not a linear combination of
the columns of A, and there is no solution to x such that Ax = b.
The system of linear equations is inconsistent.
If we do Gaussian
Elimination on the augmented matrix A | b , we will get at least
216 Eigenvalue Decomposition and Diagonalization

! * ="×! rank ! = 0 < 3

! * =! = all possible ! . * =! = all possible .
!" = $
! 6 =! ÿ ="
$ % : dim =3
$ %! = =& Column Space
%- ÿ / %$%
Row Space -ÿ '()* / 2#
2"
0#
0"

ï
ï

- ÿ % $% (/
+ /# )

$
'()*

2
%

%0 ÿ 0
0

$%
0 ÿ %'()* 0
! "! :
dim = ) 2 +
0 ÿ % $% / Left Null Space
'()* # $

" = !67 ! !" #

;<=> $ ! !"

ï
!67 "
;<=> : = ÿ =
!

Fig. 11.5 Recap of the four fundamental spaced defined by a full-column-rank, tall matrix.
Here, ALeft is the one-to-one mapping C(A) 7→ C(AT ).
−1

one zero equal to nonzero row, indicating this fact. For Ax = b, for
a general b ∈ Rm : No solution if b ∈ / C(A).
We also saw that we can get to the best approximation to the
solution, x̂, by projecting b onto C(A) as b̂, in which case Ax̂ = b̂
has a unique solution because b̂ ∈ C(A) by construction (projection).
This process of projection is considered least square minimization
because when b̂ is the vector in C(A) “closest” to b. In other words,
the error vector b − b̂ has the smallest Euclidean norm.
We now have a recipe for finding the best approximation: Project
−1 T
b onto C(A) as b̂ = A AT A A b and solve Ax̂ = b̂ and we
T
−1 T
get Ax̂ = b̂ =⇒ Ax̂ = A A A A b. Notice that this last
equation states that some linear combination of the columns of A is
the same as some other linear combination of the same. Since we
know that the linear combinations of linearly independent vectors are
unique, we conclude:

−1
x̂ = AT A AT b

We also saw, in our discussion earlier that even when Ax = b is

not solvable (with a full-column-rank, tall A), AT Ax̂ = AT b has
Big Recap: The Story So Far 217

solutions, which also gives

−1
x̂ = AT A AT b

Finally, remember the left inverse, which is defined for a full-

column-rank, tall matrix? A−1 T
Left = (A A)
−1 T
A with A−1Left A = I.
Therefore, when we have Ax = b and A is full-column-rank, tall
matrix, we can write
T
−1 T
A−1 −1 −1
Left Ax̂ = ALeft b =⇒ x̂ = ALeft b = x̂ = A A A b

Why do we put a hat on x there? Because we know that the system

Ax = b does not have solutions, in general, unless b ∈ C(A). What
we are getting is the best approximation to the solution, which is
indeed the same as projecting b on to C(A) and solving. It is also the
same as the least square solution.
In summary, A ∈ Rm×n , rank(A) = n < m =⇒ |A| and A−1
are not defined, and we do not have (in general) solutions to Ax = b.
We do have a least-square solution, which is the best approximation,
and it can be arrived at in a variety of ways, all of which give the
same answer. Although unrelated to solving the system of equations,
keep in mind that an eigenanalysis is not possible because the matrix
A is not a square one.

11.9.3 Full-Row-Rank, Wide Matrices

A full-row-rank, wide matrix is indeed the transpose of a full-column-

rank, tall matrix. In terms of the four fundamental spaces, as shown
in Figure 11.6, it is roughly equivalent to swapping the input and
output spaces. We might expect to see the right inverse in the place
of left inverse as the mapping from the column space back to the row
space. Let’s see how it comes about.
For a full-row-rank, wide matrix, its column space is all of the
output space: C(A) = Rn . (See Figure 11.6.) The row space is a
subspace of the input space: C(AT ) ¢ Rn and there is a null space
N (A), which contains all vectors that are orthogonal to the ones in
the row space.
The mapping A : Rn 7→ Rm is many-to-one: multiple vectors in
the input space map to the same output vector. However, the mapping
from the row space to the column space, A : C(AT ) 7→ C(A), is still
218 Eigenvalue Decomposition and Diagonalization

! * ="×! rank ! = 3
! * =! = all possible ! . * =! = all possible .
!" = $
! 6 =! ÿ ="
$ % = =,
$ %! : dim =B Column Space
Row Space %- ÿ / $%
- ÿ %-./0* / 2#
2"
0#
0"

ï
ÿ/
ï

)
%(- + -#

%
2
%

%0 ÿ 0
0

$%
0 ÿ %-./0* 0

! " :
dim = , 2 ) 0
Null Space %- # ÿ
$
# !"
" = !67
#
# !"
CDEF> $
ï

!67 "
CDEF> : = ÿ =
!
&
#

Fig. 11.6 Recap of the four fundamental spaced defined by a full-row-rank, wide matrix,
showing AReft : C(A) 7→ C(AT ).
−1

one-to-one, for the reasons of uniqueness of linear combinations we

expounded on in the previous case. In other words, given any vector
b ∈ C(A) (which is the same as saying b ∈ Rm because C(A) − Rm ),
there is a unique vector x ∈ C(AT ) such that Ax = b.

Why Many-to-One? We know the answer already: We do not have

enough equations to constrain the system to a single solution. If we
have one equation (a11 x + a12 y = b1 ) in two variables, we have a line
and any point (x, y) in the line will map to the same b1 .
We can also see this geometrically. Let’s take a vector x ∈ C(AT ),
which maps to b ∈ C(A) through Ax = b. Now take another vector
in the input space that is in the null space, N (A). Remembering that
the null space is the orthogonal complement of the row space, we
call this vector x§ ∈ N (A). We know, by the definition of N (A)
as the solution set of Ax = 0, Ax§ = 0. Therefore, Ax + Ax§ =
b + 0 = b or, A(x + x§ ) = b. In other words, the moment we have
one vector x ∈ C(AT ) such that Ax = b, we can add any vector
x§ ∈ N (A) to it and the sum x + x§ will still map to the same b.
One point to note is that the sum x + x§ will always have a norm
greater than or equal to that of x: ∥x + x§ ∥ g ∥x∥ because x and
x§ for two sides of a right-angled triangle and the sum x + x§ is the
Big Recap: The Story So Far 219

hypotenuse. In other words, the unique solution x is the minimum-

norm solution of Ax = b.

How to Find the Unique Part? We actually know the answer to this
one also. When we found the complete solution of Ax = b, where
we have free variables, we wrote it as xp + t1 xs1 + t2 xs2 + · · · , where
xp was the particular solution and xsi were the special solutions (of
which we had as many as the number of free variables). Now we can
see that xp is, in fact, the part of the solution that is in the row space.
Any linear combination of the n − r special solutions (which form a
basis for the null space) would indeed be in the null space, and that
is what is denoted as x§ in Figure 11.6. We do see how all these
different views are coming together beautifully, don’t we?
One way of finding the particular solution is to perform Gaussian
Elimination on the augmented matrix to locate the pivotless columns
which point to the free variables. We then solve the system after
setting the free variables to zero. Now we have r equations and r
unknowns because we have set the n − r free variables to zero. The
unique solution to this system is the particular solution xp .
Thinking geometrically, we first notice that the complete solution
is the sum x + x§ . If we project this sum to the row space C(AT ),
it becomes just x because x§ is orthogonal to C(AT ). When we
did the projection, we wrote the matrix that projects onto C(A) as
−1 T
P = A AT A A . The row space C(AT ) is, as the symbol
indicates, the column space of AT and the matrix that would project
onto it, Pr would be just P , but with A replaced by AT . Thus,
−1
Pr = AT AAT A (where we also made use of the fact that the
transpose of AT is A).
Let’s say we found a solution x′ = x + x§ such that Ax′ = b.
−1
x = Pr x′ = AT AAT Ax′ . But Ax′ = b, which means we
−1
get the minimum-norm solution x = AT AAT b.
Let’s remind ourselves that what the right inverse is. A is a full-
row-rank, wide matrix. Therefore, AT A is full rank and invertible,
−1 −1
which means AAT AAT = I =⇒ A−1 Reft = A
T
AAT .
−1
Now we see that AReft is the mapping that will take any vector
b ∈ C(A) and give us the unique vector x ∈ C(AT ) such that
Ax = b. Beautifully symmetric, isn’t it? This is indicated in
Figure 11.6.
220 Eigenvalue Decomposition and Diagonalization

As we went through these cases of full-rank matrices, square, tall

or wide, we saw how solving the system of linear equations using
Gaussian Elimination is related to the picture of the four fundamental
subspaces, and how the mapping between the row space and the
column space was one-to-one, which could be inverted. We further
saw that the left and right inverses were in fact these mappings from
C(A) to C(AT ). In the case of a full-rank, square matrix, they indeed
reduce to he double-sided inverse, A−1 .

11.9.4 Any General Matrix

! * ="×! rank ! = .
! * =! = all possible ! . * =! = all possible .
!" = $
! 6 =! ÿ ="
$ % : dim = ,
$ %! : dim = , Column Space
Row Space %- ÿ / -ÿ% 5/
2#
2"
0#
0"
ÿ/ ï
ï

) - ÿ % 5(/ +
%(- + -# /# ) %
2
%

%0 ÿ 0
0

5
0ÿ% 0
! "! :
! " : dim = ) 2 +
dim = , 2 + 0 0 ÿ % 5/ Left Null Space
Null Space %- # ÿ # $
$ ! !" ! !"
#
# !" # !"
#
"= !R$
ï
ï

!R: =" ÿ =!
%
!
&
#

!R * =!×" rank !R = . CS103 | Week 13 | Slide 13

Fig. 11.7 The four fundamental spaced defined by a general rank-devicient matrix. Here,
the newly defined pseaudo inverse maps C(A) back to C(AT ).

We have not yet dealt with rank-deficient matrices. Even if the

matrix is rank deficient, the mapping between the row space and the
column space is one-to-one and invertible: If Ax = b, then the
mapping A : C(AT ) 7→ C(A) is invertible such that there is some
pseudo-inverse A+ : C(A) 7→ C(AT ). But we do not yet have all the
tools necessary to unearth this elusive inverse yet. We need to learn
Singular Value Decomposition, which we will go through in the very
last chapter of this book. Here, let’s do a quick preview, which may
be useful.
Big Recap: The Story So Far 221

In this chapter, we learned that eigenanalysis is possible only for

square matrices. It works best for symmetric matrices that are di-
agonalizable. The “thin” version of Singular Value Decomposition,
which is closely related to the eigenanalysis of AT A or AAT , works
for all matrices because these products are symmetric. What it gives
is, in fact, a decomposition of the form:
SVD
A ∈ Rm×n −−→ A = Û Σ̂V̂ T
Û ∈ Rm×r , Σ̂ ∈ Rr×r , V̂ ∈ Rr×n

Σ̂ is a diagonal matrix, with positive values along the main diagonal,

starting from σ1 all the way to σr , where r = rank(A), with the rest
of the elements all zero.
In terms of the four fundamental spaces, what SVD is telling
us is that there is a special orthonormal basis in the row space,
{v1 , v1 , · · · , vr }, that maps to an orthonormal basis in the column
space {u1 , u1 , · · · , ur }. The very fact that there are such bases is
already remarkable. The fact that these bases can be unearthed is
even more so. We will later see how the bases matrices Û and V̂ as
well as the matrix of singular values Σ̂. For our purposes here, let’s
note that following:
1
0 ··· 0
  
σ1 0 · · · 0 σ1
 0 σ2 · · · ...  −1
 0 1 · · · ... 
Σ̂ =  =⇒ Σ̂ =  . σ.2
 ... .. .. 
..
 
.. 
.  .. .. ..
. . . .
0 0 · · · σr 0 0 · · · σ1r

With this, we now define the pseudo-inverse A+ = V̂ Σ̂−1 Û T . Note

that A+ ∈ Rn×m .
If we knew that AA+ A = A, we could do the following:

AA+ Ax = Ax = b
AA+ (Ax) = Ax = b
AA+ b = Ax = b

Since we already have Ax = b, we can say (using the arguments

about the uniqueness of linear combinations of linearly independent
vectors again) that A+ b = x, which is the inverse transformation
from C(A) to C(AT ). We have to verify that AA+ A = A, which
222 Eigenvalue Decomposition and Diagonalization

is a property of the pseudo-inverse. Furthermore, A+ reduces to

A−1 , A−1 −1
Left or AReft for full-rank square, full-column-rank and full-
row-rank matrices respectively. As we defined it here, we do not have
enough knowledge to do prove it yet, but we will prove it in the last
chapter, considering the full SVD (rather than the “thin” version we
used in our definition of the pseudo-inverse A+ above).
The only thing left to do now is to look at the picture in Figure 11.7
and marvel at its beauty, elegance and completeness. In some sense,
we have come full circle. We can now appreciate how the algebraic
notion of solving equations (using Gauss-Jordan elimination, for in-
stance) is related to the geometric view of four fundamental spaces,
and the more abstract ideas of mappings and their inversions.

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
12
Special Matrices,
Similarity and Algorithms

The surest way to corrupt a youth is to instruct him to hold

in higher esteem those who think alike than those who
think differently.
—Friedrich Nietzsche

In the last chapter, we saw how the properties of eigenvalues and

eigenvectors are related to the characteristics of the matrices. We
looked at real, symmetric matrices and studied their eigen-analysis
to some extent. In this chapter, we will look at some other special
matrices and their eigenvalues and vectors. Let’s start with another
look at symmetric matrices.

12.1 Real, Symmetric Matrices

Real, symmetric matrices (A ∈ Rn×n , A = AT ) have real eigenval-

ues and orthogonal eigenvectors, as we proved in some painstaking
detail in the previous chapter (Item §3, page 198). Here are two
theorems on such matrices.
224 Special Matrices, Similarity and Algorithms

12.1.1 Spectral Theorem

As we (almost) saw, a real symmetric matrix has real, orthonormal

(which people may call orthogonal at times) eigenvectors (restricting
ourselves to R again). Using the symbol Q for orthonormal matri-
ces, and knowing that they are invertible, with the inverse being the
transpose, we can state our eigenvalue decomposition as:
A = SΛS −1 ⇐⇒ A = QΛQ−1 = QΛQT
Knowing that Q is a matrix with columns qi (which we may write as
Q = [qi ]), we can expand the product to read:
n
X
A= λi qi qiT (12.1)
1

This expansion is possible because Λ has only diagonal entries λi

and qi § qj =⇒ qiT qj = 0 if i ̸= j. Looking at each term in the
expansion, we can make the following remarks:
• Each term has a projection matrix qi qiT .
• It is a projection to the eigenspace of the ith eigenvector.
• Each eigenspace is one-dimensional, being the span of just one
vector qi .
• Consequently, each projection matrix is a rank-one matrix.
Eqn (12.1) is the spectral theorem, stating that any real, symmetric
matrix can be decomposed as a sum of projection matrices of rank
one, scaled by the eigenvalue. The eigenvalue is considered the
spectrum. Each component in the term is akin to a pure component
of the matrix, much like white light has a spectrum of pure primary
colors.

12.1.2 Sylvester’s Law of Inertia

Now that we are listing theorems, we have another one that goes by
the physics-inspired name, the Law of Inertia (attributed to James
Joseph Sylvester, not the brawny movie star). We stated it earlier:
For real, symmetric matrices, the number of positive eigenvalues is
the same as the number of positive pivots. Similarly, the negative
Hermitian Matrices 225

pivots and negative eigenvalues are equal in number. The proof of

this theorem (which we will not attempt) involves the concept of
similarity of matrices, which is our topic later in this chapter. We
will, however, state the theorem more formally.
We have a symmetric matrix A ∈ Rn×n so that AT = A, with
Asi = λi si . The Reduced Echelon Form (REF) of A is R, with the
pivot in the ith row is ri . Then, Sylvester’s Law of Inertia states that:

Count(λi > 0) = Count(ri > 0)

Count(λi = 0) = Count(ri = 0)
Count(λi < 0) = Count(ri < 0)

Note that the REF in this law is the result of Gaussian elimination,
done with no scaling of the rows. It is not the RREF from Gauss-
Jordan elimination, which makes all pivots one by scaling rows.

12.2 Hermitian Matrices

Some of the properties of real, symmetric matrices we stated so far

apply also to complex matrices, with one important caveat: We need
to redefine what “transpose” means for A ∈ Cn×n . Let’s expand our
field to complex numbers and see how the properties and their proofs
holds up. But before doing that, we have to state what “transpose”
means in C.

12.2.1 Conjugate Transpose and General Symmetry

Earlier, we defined the (Euclidean) norm of x ∈ Rn as ∥x∥2 = xT x.

The norm stands for the length or the size of the vector, and we
would like it to be real (and positive). In other words, we would like
∥x∥ ∈ R even if x ∈ Cn .
When xi is complex (in the form a + ib with b ̸= 0), x2i is not
real, and xT x is not real. What is always real and positive is (a +
ib)(a − ib) = a2 + b2 . a − ib is the complex conjugate of a + ib,
which we write as (a + ib)∗ = a − ib. So we would like to have x∗i xi
in ∥x∥2 rather than x2i , which means the right definition of the norm
of x ∈ Cn is ∥x∥2 = x∗T x.
226 Special Matrices, Similarity and Algorithms

x∗T is the conjugate transpose, also known as Hermitian transpose,

of x, which we will write1 as x = x∗T . Note that it is a generalization
of xT : For x ∈ R, x = xT . We may come across yet another
notation in some texts for the Hermitian transpose as xH .
Similarly, we can define the complex conjugate transpose (AKA
Hermitian transpose) of a matrix A ∈ Cn×n as its generalized trans-
pose, A . When A = A, we have the generalized version of
symmetry, and we call such matrices Hermitian.
Note that the product rule of transposes applies to the Hermitian
transposes as well: (AB) = B A .
Earlier, while discussing orthogonality, we stated that the inverse of
an orthonormal matrix (in Rn×n ) is its transpose. For A ∈ Cn×n , the
condition would be A = A−1 , and we call such matrices unitary.
When we look up information we may come across “unitary” used as a
synonym for “orthonormal,” which is technically correct because the
set of orthonormal matrices is a subset of the set of unitary matrices,
just as R ¢ C.

12.3 Eigen Properties of Hermitian Matrices

With this definition of generalized symmetry of matrices, we are ready

to restate some of the properties of eigenvalues and eigenvectors we
listed for real, symmetric matrices, and extend them to Hermitian
matrices. Note, however, that the field of complex numbers, C, is
not critical for our use in computer science, except when we look
for information on the internet, for instance, we may come across
terminology and explanations stated in terms of Hermitian, unitary,
complex conjugates etc. Such usage is common in the research
literature as well.
1. The eigenvalues of Hermitian matrices are real:
A ∈ Cn×n , A = A, As = λs =⇒ λ ∈ R
Proof :

1
Some people write x∗ to mean both conjugation on top of transposition. For this reason,
a less confusing notation for conjugate by itself may be an overline a + ib = a − ib, but it
may lead to another contextual confusion: Are we underlining the line above or overlining
the variable below? Good or bad, we are going to stick with ∗ for complex conjugate and †
for conjugate transpose.
Eigen Properties of Hermitian Matrices 227

(1) By the definition of eigenvalues: As = λs

(2) Multiplying on the left with s : s As = λs s
(3) Product rule of Hermitian transposes: (A s) s = λs s
(4) Since A is Hermitian: (As) s = λs s
(5) Using (1): (λs) s = λs s
(6) Since λ is a scalar: λ∗ s s = λs s
∗
(7) Since s s is never zero: λ =λ

Step (7) says λ is real. Since all eigenvalues are real, the Λ
matrix (with eigenvalues in the diagonal) is Hermitian as well.
In fact, the proof we gave in §3 (page 198) holds for Hermitian
matrices as well, with minor changes. We , however, provided
a brand-new proof, now that we are in the happy position of
being able to do the same thing in multiple ways.

2. The eigenvectors of Hermitian matrices are orthogonal:

A ∈ Cn×n , A = A, Asi = λi si i ̸= j =⇒ si § sj

Note that the eigenvectors of complex matrix are, in general,

complex: si , sj ∈ Cn . The proof is pretty much identical to
the one given in the case of real, symmetric matrices (§2, page
200), but with minor changes to call transposes Hermitians.
Here is another, higher level, matrix-algebra way of looking at
it:

(1) Eigenvector matrix: A = SΛS −1

(2) Taking the Hermitian transpose: A = S −1 Λ S
(3) Since A and Λ are Hermitian: A = S −1 ΛS
(4) Equating the RHS of (1) and (3): SΛS −1 = S −1 ΛS

One way this can be true is if S = S −1 for S ∈ Cn×n , which

is the same as saying S T = S −1 for S ∈ Rn×n . If the transpose
of a matrix is its inverse, then the matrix is orthonormal. We
have not actually proven it because we are not sure at this point
228 Special Matrices, Similarity and Algorithms

whether the only way step (4) can be true is if S = S −1 . Of

course it is, but we cannot yet see why.

12.4 Markov Matrices

A square matrix is called a Markov matrix if all its entries are non-
negative and the sum of each column vector is equal to one. It is
also known as a left stochastic matrix2 . “Stochastic” by the way is a
fancy word meaning probabilistic. Markov matrices are also known
as probability/transition/stochastic matrices.

Markov Matrix
Definition: A = aij ∈ Rn×n is a Markov matrix if
n
X
0 f aij f 1 and aij = 1
i=1

Markov matrices usually describe the transition probabilities between

states in a stochastic mathematical model.

12.4.1 Properties of Markov Matrices

All the following properties of Markov matrices follow from the fact
that the columns add up to one. In other words, if we were to add up
all the rows of a Markov matrix, we would get a row of ones because
each column adds up to one.
1. Markov matrices have one eigenvalue equal to one.
2. The product of two Markov Matrices is another Markov matrix.
3. All eigenvalues of a Markov matrix are less than or equal to
one, in absolute value: |λi | f 1.
Let’s try proving these properties one by one. Since its columns add
up to one, a Markov matrix always has one eigenvalue equal to one.

2
The right stochastic matrix, on the other hand, would be one in which the rows add up to
one. Since our vectors are all column vectors, it is the left stochastic matrix that we will
focus on. But we should keep in mind that the rows of a matrix are, at times, considered
“row vectors.” For such a row vector, a matrix would multiply it on the right and we can
think of the so-called right-eigenvectors.
Markov Matrices 229

Proof :
n
X
(1) Since A is a Markov matrix: aij =1
i=1
n
X
(2) Therefore: ajj =1− aij
i=1,i̸=j
Xn
(3) The diagonal element in A − I: (A − I)jj = 1 − aij
i=1,i̸=j

(4) Sum of columns in A − I: (A − I)jj = 0

(5) Therefore: |A − I| = 0

(5) above says that |A − λI| = 0 with λ = 1, which means one is an

eigenvalue of any Markov matrix. All other eigenvalues are less than
one in absolute value,
Let’s prove it again using a slightly more sophisticated technique.
Let’s construct a column vector of all ones, and call it u ∈ Rn .
Taking the sum of the rows of the Markov matrix A is the same
as multiplying on the left with uT (which is a single-row matrix of
all ones), and we know that the product uT A = uT because the
columns of A add up to one. Taking the transpose, AT u = u, which
says that AT has an eigenvalue of one (with u as the eigenvector,
which is not important for us). Now, the eigenvalues of a matrix
and its transpose are the same (but the eigenvectors may be different)
because the characteristic polynomial, being a determinant, does not
change when we take the transpose. So we can see that A has at least
one eigenvalue equal to one.
The second property is that the product of two Markov matrices is
another Markov matrix. If A and B are Markov, the AB cannot have
any negative entries because matrix multiplication, being addition
and multiplication of elements, cannot introduce a negative sign.
We also know that the each one of the columns of A and B adds
up to one, which means uT A = uT and uT B = uT . Consider
uT AB = uT B = uT . Therefore the columns of AB also add up
to one, or the product of two Markov matrices is another Markov
matrix.
The third property is that all the absolute values of the eigenvalues
of a Markov matrix are less than one: |λi | f 1, which follows from the
product property. An is a Markov matrix. If one of the eigenvalues
230 Special Matrices, Similarity and Algorithms

was more than one, then Ak , k → ∞ could not be a Markov matrix

because its elements would have to be growing exponentially.

12.4.2 Steady State

In order to make this cryptic proof of the third property a bit more
accessible, let’s work out an example. Let’s say we are studying
the human migration patterns across the globe, and know the yearly
migration probabilities as in Table 12.1 through some unspecified
demographic studies. We know nothing else, except perhaps that the
birth and death rate are close enough to each other for us to assume
that they add up to zero everywhere. One reasonable question to
ask would be about the steady state: If we wait long enough, do the
populations stabilize?
Note that the numbers in each column add up to 100% because
people either stay or leave. The numbers in each row, on the other
hand, do not. Asia-Pacific and Africa lose people to the Americas
and Europe.
Once we have probabilities like Table 12.1, the first thing to do
would be to put the values in matrices, now that we know enough
Linear Algebra.
   
0.80 0.04 0.05 0.05 4.68
0.10 0.90 0.07 0.08
 x0 = 1.20 xk+1 = Axk
 
A= 0.03 0.01 0.75 0.02 1.34
0.07 0.05 0.13 0.85 0.75
where we put the initial populations in a vector x0 . As we can see,
A is a Markov matrix. It describes how the population evolves over
time. The populations for year k evolve to that of year k + 1 as Axk ,
which is identical to what we did in the case of Fibonacci numbers
in §11.7 (page 208). As we learned there, the long-term evolution
of the populations is fully described by the eigenvalues λi of A. If
|λi | > 1, we will have a growing system, if |λi | < 1, we will have
system tending to zero. If we have an eigenvalue |λi | = 1, we will
have as steady state. And we know that A does have a eigenvalue
equal to one.
Knowing that x = xk will stabilize and reach an equilibrium value,
we can implement an iterative method to compute it: First, initialize
it x ← x0 Then iterate until convergence: x ← Ax. Doing all this
Markov Matrices 231

Table 12.1 Migration probabilities

Destination ↓ Source → Asia-Pacific Americas Africa Europe
Asia-Pacific 80% 4% 5% 5%
Americas 10% 90% 7% 8%
Africa 3% 1% 75% 2%
Europe 7% 5% 13% 85%
Population (billions) 4.68 1.20 1.34 0.75

in numeric a program, we get:

xT = 0.7262 1.844 0.2549 1.175

The eigen way of doing it would be to find the eigenbasis, find

their linear combination to form x0 :
n
X
x0 = c i si
i=1

and then say that (with λ1 = 1 and all other |λi | < 1):
n
X n
X n
X
k k
xk = A c i si = c i A si = ci λki si = c1 λk1 s1 = c1 s1
i=1 i=1 i=1

Therefore, we need to know only c1 . But it still looks like a pain

to compute because, in order to find c1 , we have to find all other ci ,
which means we have to find all the eigenvectors.
Luckily for us, there is a shortcut. As x evolves from x0 to xk , the
sum of its components is the total population, which we are assuming
to be a constant. Therefore, in xk = c1 s1 , the components of xk and
c1 s1 should add up to the same number, which is the total population.
In other words, P
x0i
c1 = P
s1i
where the summation is over the components of x0 and s1 . So we
only need to know one eigenvector, s1 corresponding to the dominant
eigenvalue, λ = 1 in order to compute the limiting value of x = xk
as k → ∞. As expected, this computation will also yield the same
answer:
xT = 0.7262 1.844 0.2549 1.175

232 Special Matrices, Similarity and Algorithms

12.4.3 Portal to a Big Field

Markov matrices are the starting point of the associated topics in

mathematical modeling, some of which are heavily used in machine
learning and data mining. For instance, Hidden Markov Models are
used in text mining for the so-called part-of-speech tagging. In our
short introduction here, we focused only on the basics, ignoring many
subtleties. A quick flip through the pages of this book will reveal that
permutation matrices are, in fact, Markov matrices. What are their
steady states? They have none; they have oscillating solutions. Other,
more complicated matrices may have multiple steady states among
which the solutions oscillate.
The Google Page Rank algorithm is a multi-billion dollar success
story of Markov matrices, which, as we see in the box, does not take
much more than our discussion here to fully understand.

12.5 Positive Definite Matrices

We saw that real, symmetric matrices were “good” matrices to work

with because they have real eigenvalues and orthogonal eigenvectors.
Positive definite matrices are even better.

Positive Definite Matrix

Definition: A real, symmetric matrix A ∈ Rn×n is positive definite
if all it eigenvalues are positive (λi > 0).
If an eigenvalue is equal to zero (which means A is singular), then
we call the matrix positive semidefinite. In other words, if we only
have λi g 0, then all we can say is that A is positive semidefinite.
Similarly, we can define negative definite and negative semidefinite
matrices, although they have dubious mathematical relevance.
We can extend the definition to A ∈ Cn×n using our definition of
Hermitian transpose, and most of what we learn here will apply to
them as well. However, for our purpose in computer science, we can
safely restrict ourselves to the real domain.

12.5.1 Test for Positive Definiteness

1. λi > 0? This test follows directly from our definition of

positive definite matrices.
Positive Definite Matrices 233

2. All the pivots > 0? The second test is essentially the same as
the first, by Sylvester’s law connecting the signs of pivots and
eigenvalues.

3. xT Ax > 0? For a positive definite A and for any nonzero

vector x, xT Ax > 0. Test (3) is a powerful one. In fact,
some textbooks define positive definiteness using this state-
ment. Let’s prove that it is equivalent to Test (1), which is our
definition of positive definiteness.
Proof : xT Ax > 0 =⇒ λi > 0

(1) Consider an eigenvector of A: As = λs

T
(2) Left-multiplying with s : s As = λsT s
T

sT As
(3) Rearranging: λ =
sT s
(4) Since xT Ax > 0 for any x and sT s > 0: λ > 0

Proof : λi > 0 =⇒ xT Ax > 0

(1) For a real symmetric A: A = QΛQT

(2) Multiply with xT and x: xT Ax = xT QΛQT x
(3) Calling y = Qx: xT Ax = y T Λy
X
(4) Since Λ is diagonal [λi ]: xT Ax = λi yi2
(5) Since λi > 0: xT Ax > 0

4. Do all the leading submatrices have |Ak | > 0? For A ∈ Rn×n ,

the leading submatrix (AKA upperleft submatrix) of dimension
k < n, Ak ∈ Rk×k , is a block matrix of the first k rows and
columns of A. For positive definite A, the determinants of all
such Ak are greater than zero.
a11 a21
 
a21 a22 Ak 
 . .. 
A=
 .
. ··· .  (12.2)
··· ···

ak1 ak2 akk 
.. ..
. ··· .
234 Special Matrices, Similarity and Algorithms

Eqn (12.2) illustrates the definition of a leading submatrix Ak .

Proof : Since A is positive definite, we know that xT Ax > 0
for any x.
Let’s consider an x[k,0] which has nonzero entries in the first k
elements, the rest being zero. xT Ax = xT[k,0] Ak x[k,0] , has to
be greater than zero.
Then, by Test (3), Ak is positive definite, all its eigenvalues are
positive, and its determinant, being the product of its eigenval-
ues, has to be positive.

5. Can we write B = AT A? For a positive definite matrix B, we

can always find an invertible matrix A such that B = AT A.
Although not usually used as a test, this property is an important
one, and our last topic in the last chapter depends on it. Let’s
therefore prove it here.
Proof : B = AT A =⇒ B is positive definite.

xT Bx = xT AT Ax = (Ax)T Ax = ∥Ax∥2 > 0

Since A is invertible, Ax ̸= 0 unless x = 0.
Proof : B is positive definite =⇒ B = AT A for some
invertible A.

(1) Since B is positive definite: B = QΛQT

1
h√ i 1 1
(2) Defining Λ 2 = λ : B = QΛ 2 Λ 2 QT
√ 1
1
T
(Matrix with λi in the diagonal) = QΛ 2 QΛ 2
1
(3) Defining AT = QΛ 2 : B = AT A

1
Since B is positive definite, its eigenvalues are positive, and Λ 2
is invertible. So is Q because Q−1 = QT . So A is invertible.

12.5.2 Applying the Tests

Let’s see how we can apply these five tests to various matrices to
determine whether they are positive definite.

• If A is positive definite, is A−1 positive definite as well?

Positive Definite Matrices 235

We can apply Test (1): A has positive λi . The eigenvalues of

A−1 are λ1i which are positive too. Therefore, A−1 is positive
definite.
• If A and B are positive definite, how about A + B?
We apply Test (3) here: xT (A + B)x = xT Ax + xT Bx > 0
because A and B are both positive definite. So is the sum.
• For any A ∈ Rm×n with rank(A) = n (full column rank), is
AT A positive definite?
Test (3) proves useful again: xT AT Ax = (Ax)T Ax =
∥Ax∥2 g 0. It is equal to zero only if Ax = 0 has nonzero so-
lutions, which means A has null space and A is rank-deficient.
Since we started with the assumption that A is full rank,
xT AT Ax > 0 and AT A is positive definite.
• For a positive definite A, is M −1 AM positive definite for any
invertible M ?
The answer is yes, and the reason is that A and M −1 AM are
similar matrices, and they have the same eigenvalues. We shall
soon define matrix similarity and prove that similar matrices
have the same eigenvalues.

12.5.3 Quadratic Forms

One of the tests for positive definiteness, namely xT Ax > 0, which

we called Test (3), leads to the big topic of quadratic forms. Although
not directly relevant to computer science, we will introduce the topic
of quadratic forms here, and list some of its general properties, so
that we may recognize it if we happen to come across it in some
mathematically oriented literature later on.
For a real, symmetric matrix, we can expand xT Ax in R2 as below:

a b x
2×2 T
A ∈ R , A = A =⇒ A = and x = 1
b c x2
(12.3)
T
a b x1 2 2
x Ax = x1 x2 = ax1 + 2bx1 x2 + cx2
b c x2

As we can see, the product xT Ax is a pure quadratic form, with

no linear or constant terms in it. Furthermore, as in Test (4), if the
236 Special Matrices, Similarity and Algorithms

determinants of the leading submatrices of A are positive, we have:

a > 0 and ac − b2 > 0

Let’s take an example and see how a > 0 and |A| > 0 implies that
T
x Ax > 0 for all x.

2 6
A= and |A| = 2c − 36 > 0 if c > 18.
6 c

Let’s set c = 20 =⇒

xT Ax = 2x21 + 12x1 x2 + 20x22 = 2(x21 + 6x1 x2 + 9x22 ) + 2x22

xT Ax = 2(x1 + 3x2 )2 + 2x22 > 0

As we can see, if c < 18, the last term in xT Ax above becomes

negative, and the whole expression would be negative for some value
of x1 and x2 (both equal to zero, for instance). But if c > 18, the
expression, being the sum of two squares, can never be negative. And
hence, A is positive definite. If c = 18, A is positive semidefinite.
Note that in xT Ax = 2(x1 +3x2 )2 +2x22 , the 2 multiplying the first
term is a in A, which is the first pivot. The multiplier of the second
term, 2, is the second pivot. Inside the parentheses of the first term,
the factor multiplying x2 , namely 3, is the multiplier of the first row in
Gaussian elimination to subtract from the second row. The quadratic
form, as we can see, is intrinsically connected to pivots and Gaussian
elimination. Our discussion with this little R2×2 generalizes to Rn×n ,
which has interesting mathematical, if not numeric or algorithmic,
applications.
This connection between row operations in Gaussian elimination
and completing the squares in the quadratic form becomes even
clearer if we work with the general matrix A in Eqn (12.3).

a b r2 ←− ab r1 +r2 a b
A= −−−−−−−−→ 2
b c 0 c − ba
2
b2 b2

b
a x1 + x2 = ax21 + 2bx1 x2 + x22 = xT Ax − cx22 + x22
a a a
2
b2

b
=⇒ xT Ax = a x1 + x2 + c − x22
a a
Gram Matrix 237

We can see the pivots and row multipliers in the expression for
(square-completed) xT Ax in the last line above, can’t we? Fur-
thermore, xT Ax > 0 if:
b2 2 b2
cx22 > x or when c > =⇒ ac − b2 = |A| > 0
a 2 a

12.5.4 Positive Definiteness and Symmetry

In our definition of positive definite matrix, we restricted ourselves

to symmetric ones. Our basic definition indeed made it abundantly
clear, by stating, “A real, symmetric matrix. . . ”. We should note that
some people relax this restriction, and consider matrices that are not
symmetric also positive definite if they pass the tests listed. While
we will stick with our definition because it makes sense for our use
in computer science, purely mathematical work may consider more
general definitions, and indeed more general fields like C, of even
unspecified ones.

12.6 Gram Matrix

For A ∈ Rm×n , rank(A) = r, its Gram matrix is defined3 as AT A

and has important applications in machine learning. It also has some
interesting properties that we exploit at various points in this book.
Before getting into the properties of Gram matrices, let’s revisit
some basic properties of ranks. The first one is that the row-rank
and the column-rank of a matrix are the same. Earlier, we defined
the rank as the number of pivots and also as the number of linearly
independent vectors in its row (or column) space. The number of
independent vectors is indeed the dimension of the subspace. To
prove the equality of row and column ranks, we will use the latter
definition.
We know that elementary row operations do not change the row
space of a matrix because they involve linear combinations of the
rows. Therefore, A has the same row rank as its RREF because they
both have the same row space. As we saw earlier, the shape of RREF

3
The author is not quite sure if AAT also considered a Gram matrix, although it should be,
by symmetry.
238 Special Matrices, Similarity and Algorithms

is, in its most general case, as follows:

" #
RREF Ir ⊕ Fr×(n−r)
A −−−→ = R1
0(m−r)×n

where · indicates that the I and F matrices have their columns

mixed in, not cleanly separated. Clearly, the dimension of the row
space of A is r, which is the row rank of both A and R1 .
We can go further and get rid of the columns of F using elementary
column operations, which do not change the column space. Starting
from R1 , the RREF of A, we can perform a series of elementary
column operations, without affecting its column space, to reduce the
matrix to the following form:
" # " #
Ir · Fr×(n−r) Col. Ops Ir 0r×(n−r)
RREF = −−−−−→ = R2
0(m−r)×n 0(m−r)×n
This form, obviously, has the same row and column rank of r, and
since our elementary operations changed neither, we can say that the
row rank of R1 is the same as the column rank of R2 , which is r.
This claim propagates to the original A, and it has the same row and
column ranks. Therefore, row rank of any matrix is the same as its
column rank. It should be immediately obvious from this statement
that rank(A) = rank AT .
Getting back to AT A, we can see that every vector x ∈ N (A) is
also in the null space of AT A:

Ax = 0 =⇒ AT Ax = 0 =⇒ x ∈ N (AT A) =⇒ N (A) ¦ N (AT A)

There may be more vectors in N (AT A), which is why we only

claim N (A) ¦ N (AT A). Now, coming at it from the other end, if
x ∈ N (AT A):

AT Ax = 0 =⇒ xT AT Ax = 0, (Ax)T (Ax) = 0 =⇒ Ax = 0

Every vector in N (AT A) is also in N (A). Both these conditions

can happen only if N (A) = N (AT A), which means A and AT A
have the same nullity.
Since A ∈ Rm×n , its domain is Rn . And since AT A ∈ Rn×n , its
domain also is Rn . Therefore, by the rank-nullity theorem, we have
Matrix Similarity 239

rank(A) = rank AT A because

rank(A) = dim(Rn ) − nullity(A)

rank AT A = dim(Rn ) − nullity(AT A)

And nullity(A) = nullity(AT A) =⇒ rank(A) = rank AT A

It is also easy enough to prove that A and AT A have the same row
space, looking at the product AT A as the linear combinations of the
rows of A and as the linear combinations of the columns of AT . At
this stage in our Linear Algebra journey, we are in the happy position
of being able to prove a lemma in multiple ways, and we can afford
to pick and choose.
Summarizing, A and AT A have the same row and null spaces.
AT and AAT have the same column space and left-null space. All
four of them have the same rank.
For a full-column-rank matrix (A ∈ Rm×n , rank(A) = n), the
Gram matrix AT A is a full-rank, square matrix with the same rank. It
is also much smaller. We can, therefore test the linear independence
of the n column vectors in A by looking at the invertibility (or,
equivalently, the determinant) of the Gram matrix. In data science,
as we shall see in the last chapter of this book, the Gram matrix is the
covariance matrix of a zero-centered data set.

12.7 Matrix Similarity

We consider a matrix A similar to another one B if we can write A =

M −1 BM for some invertible matrix M . We denote similarity as
A ∼ B. As we can immediately see, A is similar to Λ, its eigenvalue
matrix, when we have a full set of eigenvectors (and therefore the
eigenvector matrix S is invertible) because then A = SΛS −1 . Note
that we are talking about any square matrix A now, not necessarily
symmetric matrices.
While this definition of matrix similarity may look strange at first,
there are good reasons behind it: Similar matrices share several key
properties. Here is a list, with proof wherever necessary:

1. For invertible matrices A and B, AB ∼ BA.

Proof :
240 Special Matrices, Similarity and Algorithms

AB ∼ M ABM −1 ∼ BABB −1 ∼ BA with M = B

2. If A ∼ B, they have the same characteristic polynomials.

Proof : Let’s use pA (λ) to denote the Characteristic Polynomial
of A. Since A ∼ B, A = M −1 BM .
(1) Characteristic Polynomial pA (λ) = |A − λI|
−1
(2) Since A = M BM : pA (λ) = M −1 BM − λI
(3) Since M −1 M = I: pA (λ) = M −1 BM − M −1 λIM
(4) Factorizing M and M −1 : pA (λ) = M −1 (B − λI)M
(5) Since |AB| = |A| |B|: pA (λ) = M −1 |B − λI| |M |
(6) Since |A| A−1 = |I| = 1: pA (λ) = |B − λI| = pB (λ)

Step (6) above shows that both A and the similar matrix B have
the same characteristic polynomial. Note that, as a consequence
of the characteristic polynomials being the same, the algebraic
multiplicities of the eigenvalues (how many times each one is
repeated) are also the same for A and B.

3. If A ∼ B, they have the same eigenvalues, but not necessarily

the same eigenvectors.
Proof : We already showed that A and B have the same charac-
teristic polynomial. Therefore they have the same eigenvalues
(which are the roots of the said polynomial). But here is another
proof.

(1) Definition of similarity: A = M −1 BM

(2) Left-multiplying by M : MA = BM
(3) Right-multiplying by s: M As = BM s
(4) Since s is eigenvector of A: M λs = BM s
(5) Commuting λ: λ(M s) = B(M s)
(6) LHS ´ RHS: B(M s) = λ(M s)

Step (6) says that M s is an eigenvector of B with the same

eigenvalue λ, by the definition of eigenvalues and eigenvectors.

4. Combining the previous property with the first one, we can see
that AB and BA have the same eigenvalues.
Matrix Similarity 241

5. If A ∼ B and A is positive definite, so is B

Proof : Determinant is the product of eigenvalues.
6. A ∼ B =⇒ |A| = |B|
Proof : Determinant is the product of eigenvalues.
7. A ∼ B =⇒ trace(A) = trace(B)
Proof : Trace is the sum of eigenvalues.
8. Similar matrices have the same rank.
Proof : The number of (positive and negative) pivots is the same
the number of (positive and negative) eigenvalues, through
Sylvester’s Law of Inertia. And the rank is the number of
pivots.

12.7.1 Equivalence Relation

The properties listed above are the ones that are useful for us in Linear
Algebra. However, more basic than them, similarity as a relation has
some fundamental properties that make it an equivalence relation.
Here is a formal statement of what it means.
For A, B, C ∈ Rn×n , we can easily show that similarity as a
relation is:
Reflexive: A ∼ A
Proof : A = IAI −1
Symmetric: A ∼ B =⇒ B ∼ A
Proof : A = M BM −1 =⇒ B = M −1 AM = M ′ AM ′−1
Transitive: A ∼ B and B ∼ C then A ∼ C
Proof : A = M BM −1 , B = N CN −1
−1
=⇒ A = M N CN −1 M −1 = (M N ) C (M N )
=⇒ A = M ′ CM ′−1

When a relation has these three fundamental properties, it is called an

equivalence relation. And matrix similarity is an equivalence relation.

12.7.2 Diagonalizability

Since A is similar to its eigenvalue matrix Λ, if A is diagonalizable,

so are the matrices similar to it. In fact, this statement would be a
good, albeit incomplete, definition of similarity.
242 Special Matrices, Similarity and Algorithms

Similarity
Definition: A matrix is similar to another matrix if they diagonalize
to the same diagonal matrix.
As we can see, the similarity relation puts diagonalizable matrices
in families. In Rn×n , we have an infinity of such mutually exclusive
families. All matrices with the same set of eigenvalues belong to the
same family.
When we said “the same diagonal matrix” in the definition of
similarity above, we were being slightly imprecise: We should have
specified that shuffling the eigenvalues is okay. We are really looking
for the same set of eigenvalues, regardless of the order.
However, this definition leaves something unspecified: What hap-
pens if a matrix is not diagonalizable? Does it belong to no family?
Is it similar to none? Is it an orphan? It is in this context that the
Jordan Normal Forms come in to help. Since it is an important topic,
we will promote it to a section of its own.

12.8 Jordan Normal Forms

As we remember, in order for a matrix to be diagonalizable, its

eigenvector matrix S needs to be invertible, so that we can go from
AS = ΛS (which is always true) to A = SΛS −1 . For S to be in-
vertible, we need its columns to be linearly independent, which means
we need a full set of eigenvectors. In other words, for A ∈ Rn×n , we
need n linearly independent eigenvectors for it to be diagonalizable.
We also saw (in §11.4.2, page 199) that the eigenvectors corre-
sponding to distinct eigenvalues are linearly independent. We can,
therefore, say that a matrix with no repeated (AKA degenerate) eigen-
values is diagonalizable. The converse is not true though: If a matrix
does have repeated eigenvalues, it does not mean that it cannot be
diagonalized. The identity matrix in Rn , for instance, has the eigen-
value one repeated n times, but is perfectly diagonalizable. In fact,
it is already in the diagonal form. As we shall see shortly, the right
statement about diagonalizability is that if the algebraic multiplicity
of an eigenvalue is greater than its geometric multiplicity, the matrix
cannot be diagonalized.
We looked at such an example earlier, namely the shear matrix,
which was not diagonalizable. A general form of such a shear matrix
Jordan Normal Forms 243

A ∈ R2×2 is shown below. It has a twice-repeated eigenvalue of

λ because the product of the eigenvalues is |A| = λ2 and their
sum is trace(A) = 2λ, which can be easily verified by solving its
characteristic polynomial4 . But it has only one eigenvector s.

λ a 1
A= =⇒ λ1 = λ2 = λ, s = (12.4)
0 λ 0
In fancier language, the eigenvalue λ has an algebraic multiplicity
of two, and a geometric multiplicity of one. It is when the geo-
metric multiplicity is smaller than the algebraic multiplicity that we
have a matrix that cannot be diagonalized. Note that the geometric
multiplicity can never be greater than the algebraic multiplicity.
The closest we can get to a diagonal matrix for this matrix A
in Eqn (12.4) is when we have the repeated eigenvalues along the
diagonal, and a one in the position of a. This matrix is called the
Jordan normal (or canonical) form of A. It is defined as a block
matrix in terms of what are known as Jordan blocks.

Joran Normal Form

Definition: A square matrix (J ) made up of Jordan blocks is called a
Jordan normal form (JNF) of a matrix A if A ∼ J .

Jordan Block
Definition:A Jordan block of size k and value λ, Jk (λ) is a square
matrix with the value λ repeated along its main diagonal and ones
along the superdiagonal with zeros everywhere else. Here are some
examples of Jordan blocks:
 
7 1 0
λ 1
J2 (λ1 ) = 1

J1 (λ) = λ J3 (7) = 0 7 1 (12.5)
0 λ1
0 0 7
As we can see, superdiagonal means the diagonal above the main
diagonal, so to speak. In the matrix J , each eigenvector has a Jordan
block of its own, and one block has only one eigenvalue. If we have
repeated eigenvalues, but linearly independent eigenvectors, we get
multiple Jordan blocks. For instance, for the identity matrix in Rn×n ,

4
The characteristic equation is |A − λ′ I| = (λ − λ′ )2 = 0, with λ′ as the dummy variable
in the polynomial because we already used λ as the diagonal elements of the shear matrix.
244 Special Matrices, Similarity and Algorithms

Table 12.2 Jordan Normal Forms J of A ∈ R2×2 with Asi = λi si

Comments Multiplicities Jordan Examples of
on A Algeb. Geom. Normal Form Similar Matrices
Any A with trace and
Most matrices

1 1, 1 1, 1 λ1 0 |A| equal to sum and
rank(A) = 2 0 λ2 product of λi

λ2 = 0, s2 ∈ N (A)
" #
λ 1 − t 1 t2

2 1, 1 1, 1 λ1 0
rank(A) = 1 0 0 λ1 t1 −t2
t2
1
t1

Repeated λ and s

3 2 2 λ 0 Only J = A
rank(A) = 2 0 λ

λ1 = λ2 = 0
4 s1 , s2 ∈ N (A) 2 2 0 0 Only J = A
rank(A) = 0 0 0

Repeated λ, one s
" #
2λ − t1

5 2 1 λ 1 t2
rank(A) = 2 0 λ −λ2 −2λt1 −t2
t2
1
t1

In the last column, t1 , t2 ∈ R are any numbers that will generate an example of a similar
matrix for the corresponding row. They are constructed such that the sum and product of
the eigenvalues come out right. Note that the Jordan normal forms in all rows except the
fifth one have two Jordan blocks each.

we have n Jordan blocks, J1 (1), each with λ = 1. In the examples

above in Eqn (12.5), the first one corresponds to a good eigenvalue
with an associated eigenvector. The second one has the eigenvalue
repeated twice, but with only one eigenvector, much like our shear
matrix. The third one is for an eigenvalue with algebraic multiplicity
of three and geometric multiplicity of two, to use the right terms.
With these definitions of Jordan normal forms and blocks, we can
state the Jordan’s Theorem.

Jordan’s Theorem
Every square matrix A ∈ Rn×n with k f n linearly independent
eigenvectors si , 1 f i f k and the associated eigenvalues λi , 1 f
i f k, which are not necessarily linearly independent, is similar to a
Jordan matrix J made up of Jordan blocks along its diagonal.
To start with something simple before generalizing and compli-
cating life, let’s look at A ∈ R2×2 with eigenvalues λ1 and λ2 and
the corresponding eigenvectors s1 and s2 . Table 12.2 tabulates the
various possibilities. Let’s go over each of the rows, and generalize
it to from R2×2 to Rn×n .
Jordan Normal Forms 245

1. The first row is the good case, where we have distinct eigenval-
ues and linearly independent eigenvectors. The Jordan Normal
Form (JNF) is the same as Λ. Each λi is in a Jordan block
J1 (λi ) of its own.
2. When one of the two eigenvalues is zero, the matrix is singular,
and one of the eigenvectors is the basis of its null space. Qual-
itatively though, this case is not different from the first row. In
Rn×n , we will have JNF = Λ.
3. When the eigenvalues are repeated (meaning algebraic multi-
plicity is two), but we have two linearly independent eigenvec-
tors. Again, JNF = Λ. However, we have no other similar
matrices: M J M −1 = λM IM −1 = λI = J .
4. This row shows a rank-zero matrix. Although troublesome, it
is also qualitatively similar to the previous row.
5. In the last row, we have the geometric multiplicity smaller than
the algebraic one, and we have a J2 (λ).
In order to show the Jordan form in all its gory detail, we have a
general Jordan matrix made up of a large number of Jordan blocks in
Eqn (12.6) below:
 J1 (λ1 ) 
λ1 J1 (λ1 )
0 λ1 J1 (λ1 ) 
0 0 λ1
 

0 0 0 λ2 1
 
J3 (λ2 ) 
0 0 0 0 λ2 1 
J =0 (12.6)
 
 0 0 0 0 λ2 J1 (λ3 )


0 0 0 0 0 0 λ3 1 
(λ4 )
J2
 
0 0 0 0 0 0 0 λ4 1
0 0 0 0 0 0 0 0 λ4
 

..
0 0 0 0 0 0 0 0 0 .

Each Jordan block is highlighted in a colored box with its label Jk (λ).
Here is what this Jordan matrix J tells us about the original matrix
A (of which J is the Jordan Normal Form):
• The eigenvalues of J and A are the same: λi .
• The algebraic multiplicity of any eigenvalue is the size of the
Jordan blocks associated with it.
• Its geometric multiplicity is the number of Jordan blocks asso-
ciated with it.
246 Special Matrices, Similarity and Algorithms

• A can be diagonalized if and only if each Jordan block in J is

of size one. In other words, J needs to be diagonal for A to be
diagonalizable.

Table 12.3 lists the multiplicities of the eigenvalues shown in

Eqn (12.6). Since J has some elements in the superdiagonal, A
is not diagonalizable. The closest A can get to a diagonal matrix is
indeed its Jordan normal form J .

Table 12.3 Multiplicities of eigenvalues from Joran blocks

Eigenvalue Multiplicities
λi Algebraic Geometric
λ1 3 3
λ2 3 1
λ3 1 1
λ4 2 1

Much like the other topics in this chapter, Jordan canonical form
also is a portal, this time to advanced theoretical explorations in Linear
Algebra, perhaps more relevant to mathematicians than computer
scientists.

12.9 Algorithms

We already learned three named algorithms, namely,

1. Gaussian Elimination, also known as P LU or just LU de-

composition.

2. Gauss-Jordan Elimination, for matrix inversion.

3. Gram-Schmidt Orthonormalization, which gave us QR de-

composition.

They are neatly summarized in Table 8.1 and recapped in the as-
sociated text. We came across one more algorithm earlier in this
chapter (see §12.4.2, page 228), where we computed one eigenvector
corresponding to λ = 1. The method is called the Power Iteration
Algorithm, and can, in fact, find the largest eigenvalue/eigenvector
pair.
Algorithms 247

12.9.1 Power Iteration

We saw power iteration in our Markov matrix, migration patterns,

example. In general, the power iteration algorithm returns the dom-
inant eigenvalue (which is the one with the largest absolute value).
The process is simple: We start with a random initial vector, apply
the matrix to it, renormalize it and iterate.
Input: A ∈ Rn×n
Output: Dominant λ ∈ C
1: Start with a random s
2: repeat
s
3: Normalize it: s ← ∥s∥
4: s ← As
5: until Convergence
6: return s is the eigenvector for the largest λ
Once we have s, we can calculate λ as:
sT As
λ= T (Because As = λs)
s s
which, by the way, is called the Rayleigh quotient.

Limitations The power iteration algorithm has a couple of limita-

5
tions .
1. If we start with a bad guess for the initial vector, the power iter-
ation algorithm may not converge to the dominant eigenvalue.
2. For complex eigenvalues, we have to start with a complex initial
vector. Otherwise, the algorithm may oscillate between the two
conjugate eigenvalue pairs.

12.9.2 QR Algorithm
A general numerical method to compute all eigenvalues and eigen-
vectors of a matrix is the QR algorithm, based on the Gram-Schmidt
process.
Input: A ∈ Rn×n
Output: λi ∈ C
1: repeat
2: Perform the Gram-Schmidt process to get QR

5
From Wiki University.
248 Special Matrices, Similarity and Algorithms

3: We get: A = QR
4: Consider: RQ = Q−1 QRQ = Q−1 AQ
5: =⇒ RQ and A are similar and have the same eigenvalues
6: Therefore, set A ← RQ
7: until Convergence
8: return The diagonal elements of R as the eigenvalues
Convergence is obtained when A becomes close enough to a trian-
gular matrix. At that point, the eigenvalues are its diagonal elements.
Once we have the eigenvalues, we can compute the eigenvectors as
the null space of A − λI using elimination algorithms.

12.9.3 Cholesky Decomposition

A matrix factorization technique that has several applications is the

Cholesky Decomposition, which says that any positive definite matrix
A can be written as the product of a lower triangular matrix and its
transpose. We are, once again, dealing with real matrices, but this
decomposition applies to complex matrices as well.

A = LLT

The algorithm to perform this factorization is written in terms of

block matrices. We first partition A and L into blocks as shown
in Eqn (12.7). We keep the element in the first row, first column
as one block, and partition the rest of each matrix into a column
vector (a21 , l21 ∈ Rn−1 ), its transpose and a smaller matrix like
A22 , L22 ∈ R(n−1)×(n−1) .
   
a11 aT21 l11 0T
   
A=  L= (12.7)
   

a21 A22  l21 L22 

Note that we used the fact that A is symmetric in dividing up the top
row of A as [a11 aT21 ]. Similarly, we could write the first row of L as
[l11 0T ] because it is lower triangular. And, we do not have to worry
about the second factor LT at all because it is just the transpose of L.
" T
# " T
#" T
#
a11 a 21 l 11 0 l 11 l 21
A = LLT =⇒ =
a21 A22 l21 L22 0 LT22
Algorithms 249

Expanding the matrix multiplication A = LLT using the block

matrices, and comparing the corresponding elements in A and LLT :
" # " 2 #
a11 aT21 l11 l11 l21
= T
a21 A22 l11 l21 l21 l21 + L22 LT22
2 √
a11 = l11 =⇒ l11 = a11 (12.8)
a21 a21
a21 = l11 l21 =⇒ l21 = =√
l11 a11
T
A22 = l21 l21 + L22 LT22 =⇒ A22 − l21 l21
T
= L22 LT22

The second and third last lines in Eqn (12.8) tell us the elements of
L. Note that we decide to go with the positive square root for l11 .
The very last line tells us that once we got l11 and l21 , the problem
reduces to computing the Cholesky decomposition of a smaller matrix
A′ = A22 − l21 l21 T
.
Before we write it down as a formal algorithm, the only thing left
to do is to ensure that A′ is positive definite. Otherwise, we are
not allowed to assume that we can find A′ = L′ L′T . In the fourth
test for positive definiteness (on 231), we proved that the upper-left
submatrices of a positive definite matrix were also positive definite. If
we look closely at the proof, we can see that we did not need to confine
ourselves to “upper-left”: Any submatrix sharing the main diagonal
is positive definite. Therefore, if A in our Cholesky factorization is
positive definite, so is A22 . We also saw that sums of positive definite
matrices are positive definite. Extending it to nontrivial (meaning,
nonzero) differences, we can see that A′ = A22 − l21 l21 T
is positive
6
definite .
Looking at Eqn (12.8), we can translate it to an algorithm7 as
below:
Input: A ∈ Rn×n
Output: L ∈ Rn×n
1: Initialize L with zeros
2: repeat
3: Divide A into block-matrices as in Eqn (12.7)
√
4: l11 ← a11

6 T
To be very precise, l21 l21 is a rank-one matrix, and is only positive semidefinite. But the
difference is still positive definite.
7
This description and the algorithm listed are based on the discussion in this video.
250 Special Matrices, Similarity and Algorithms

5: l21 ← al11
21

T
6: A ← A22 − l21 l21
7: until A22 becomes a scalar
8: return The matrix L

Applications of Cholesky Decomposition

1. The LLT decomposition is faster in solving systems of linear

equations than Gauss-Jordan by about a factor of two. The
catch is that it applies only when the coefficient matrix A is
positive semidefinite.
2. In simulation programs, we usually face the problem of having
to work with multivariate normal distribution, where we need to
simulate correlated, normally distributed variables, which are
specified by their covariance matrix (Σ ∈ Rn×n ) with means
(µ ∈ Rn ). The standard way of accomplishing this task is to
draw the required number (n) random variables for a standard
normal distribution (with σ = 1, µ = 0) and multiply this
random vector (x ∈ Rn ) by the lower triangular matrix L and
shift it by µ, where L comes from the Cholesky Decomposition
of Σ = LLT . The required vector of random numbers would
then be Lx + µ, which would have the right covariance matrix
(Σ ∈ Rn×n ) and means (µ ∈ Rn ).

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
13
Singular Value
Decomposition

What is best in mathematics deserves not merely to be

learnt as a task, but to be assimilated as a part of daily
thought, and brought again and again before the mind
with ever-renewed encouragement.
—Bertrand Russell

Singular Value Decomposition (SVD) has come into prominence

in the last couple of decades because of its direct applicability in
algorithms in computer science, especially when dealing with large
volumes of data. It finds applications in, for instance, data com-
pression, dimensionality reduction, principal component analysis etc.
From a mathematical perspective, SVD is a topic that embodies pure
elegance. And it brings a large part of what we discussed in this
book (eigen analysis, vector spaces and bases, orthogonality, the four
fundamental subspaces etc.) together in one cohesive unity. In many
more ways than one, therefore, Singular Value Decomposition is ap-
propriate as a closing chapter for an introductory book on Linear
Algebra.
252 Singular Value Decomposition

13.1 What SVD Does

In eigenvalue decomposition (EVD), we took a square matrix A, and

wrote it as:
AS = SΛ
where Λ is a matrix with eigenvalues in the diagonal, S is the matrix
where we have the eigenvectors as columns. For real, symmetric
matrices, we saw that S was orthogonal, whose transpose is the same
as its inverse, and we wrote:
A = QΛQT
Since an orthogonal matrix is a rotation, we can interpret this decom-
position geometrically. In Ax = QΛQT x, what A does to x (or,
how it transforms x) is the same as a rotation (QT ), followed by a
scaling (Λ) and then another rotation (Q) by the same angle(s) in the
opposite direction.
This interpretation of eigenvalue decomposition (EVD), which we
called the Spectral Theorem, works only for real, symmetric matrices.
In fact, the computation of eigenvalue and eigenvectors can be done
only for a square matrices. Given that the data matrices that we deal
with are far from square ones, these restrictions of EVD severely
restrict its applicability to computer science. What SVD does is to
decompose any matrix into a rotation, followed by a scaling, followed
by another rotation, viewing A ∈ Rm×n as a transformation of x,
taking it from Rn to Rm .
To reiterate, when we apply SVD, A ∈ Rm×n is any matrix. It
does not have to be symmetric, not even square. Although it does not
need to be real either, we will again worry only about real matrices.
To assign symbols to these verbose statements:
SVD
A ∈ Rm×n −−→ A = U ΣV T
(13.1)
U ∈ Rm×m , Σ ∈ Rm×n , V ∈ Rn×n
Σ is a diagonal matrix (or as diagonal as a non-square matrix can
be), with positive values along the main diagonal, starting from σ1
all the way to σr , where r = min(m, n, rank(A)), with the rest of the
elements all zero. Note also that the action of a diagonal matrix [σi ]
on a matrix X multiplying it on the right is to scale the ith row of X
by σi .
What SVD Does 253

1.25 "
0
#" =
1
1
0
####
" = 34 0.75
2
!: SHEAR MATRIX
0.5
1 //
!! = § !"""
! =
0
0 21 0.25

0 0 !
22
! = § !""" = 20.75
21.75 # 21.5 1 21.25 # 21 // 20.5 20.25 0 0.25 0.5 0.75 1 2
0 11.25 1.5 1.75
#! =
// 20.25 0
0 0
)=
//
21 0 20.5

20.75

21 34
####
! = 2
21
21.25

Fig. 13.1 The shear matrix under SVD analysis.

Before learning how SVD does its magic, let’s take a look at an
example. For ease of visualization, we will work with a square matrix
so that we can draw the vectors and their transformations in R2 .
" √3 #
2
0 SVD
A= √ −−→ A = U ΣV T
−1 23
" √ # "3 # " √3 #
1 3 1
2 2 2
0 2 2
U= √ Σ= 1
V = √
− 3 1 0 2 − 21 3
2 2 2

A few points to note about these matrices:

• We do not yet know how we got the decomposition A =
U ΣV T , but we can verify its validity by multiplying. In fact,
what we did in coming up with the “decomposition” above was
to start from the U, Σ and V matrices and take their product
to get A as A = U ΣV T .
• U is indeed a rotation, as described in Figure 7.4, with the
angle of rotation θ = − π3 , a clockwise rotation through 60°.
• Similarly, V is a rotation θ = − π6 , or clockwise 30°. Therefore,
V T (being the same as V −1 ) is an anti-clockwise rotation of
30°.
254 Singular Value Decomposition

• Most importantly, as shown in Figure 13.1, the matrix A is a

vertical shear matrix: It shears the x-component down (propor-
tional to the y component), and leaves the y component alone,
but for some scaling. As we learned earlier (in §11.2.3, page
192 and further explanation on page 195), shear matrices do
not have a full set of eigenvectors and they cannot be diagonal-
ized. But, they do have a singular value decomposition, as do
all matrices.

1.25 "
1 0
2 #" =
2 1
1
3
2
0.75
3
2
0.5 1
2

0.25 #
1
!= $ #! =
0 !
22 21.75 21.5 21.25 21 20.75 20.5 20.25 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2

>1: ROTATION 20.25

1 $ 3
#! = § ##! = 20.5
0 % 1

0 $ 21
#" = § ##" = % 20.75
1 3
$ 3 21
6& = %
21
1 3
Anticlockwise rotation of 30° 21.25

Fig. 13.2 The first rotation by V T , anti-clockwise rotation through 30°.

Figures 13.2 to 13.5 show the actions of these three matrices, and
a summary. The transformation a vector x by Ax is broken down
into U ΣV T x. The first multiplication by V T is a rotation, shown in
Figure 13.2. It does not change the size of any vector x, nor of the
basis vectors q1 and q2 shown in red and blue. In their new, rotated
positions, q1 and q2 are shown in lighter red and blue. The unit
vectors all have their tips on the unit circle, as shown in Figure 13.2.
Because of the rotation (of 30°) by V T , all the vectors in the first
quadrant are now between the bright red and blue vectors q1′ and q2′ .
We then apply the scaling Σ on the product V T x, which scales
along the original (not the rotated) unit vectors q1 and q2 . In Fig-
ure 13.3, we can see how the rotated unit vectors (now shown in
translucent red and blue) get transformed into their new versions.
What Σ does to the unit vectors qi , it does to all vectors. Therefore,
What SVD Does 255

1.25 "
0
#" =
1
1

0.75
3
2
4
3 0.5
3 3
4
4
0.25
1
4
!
22 21.75 21.5 21.25 21 20.75 20.5 20.25 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
1
#! =
20.25 0
D: SCALING
D>1
1 $ 3 20.5
#! = §
0 % 0
§ ### 3 3
$
! = '
0 0 20.75
1
#" = §
1 0.5 $ 23
§ ###
" ='
$ 3 0 3
7=% 21
0 1 $ 3 3 23
? ³ /d0 ?; C ³ .d0 C 86& = '
21.25 1 3

Fig. 13.3 The second scaling by Σ, x-components by 1.5 and y by 0.5.

the unit circle gets elongated along the x direction, and squashed
along the y direction. What we mean by this statement is that all
vectors whose tips are on the unit circle get transformed such that
their tips end up on the said ellipse. As a part of this transformation,
the rotated unit vectors, the translucent red and blue vectors q1′ and
q2′ , get transformed to q1′′ and q2′′ (in brighter colors) on the ellipse. In
other words, the effect of the two transformations, the product ΣV T ,
is to move all the vectors in the first quadrant of the unit circle to the
arc of the ellipse between the bright red and blue vectors q1′′ and q2′′ .
Notice that the transformed ellipse in Figure 13.3 has its axes along
the x and y directions. The last step, shown in Figure 13.4, is the
rotation by U . It is a anti-clockwise rotation of 60°. It rotates the
unit vectors through that angle. Remembering that the axes of the
ellipse after the scaling (in Figure 13.3) were along the directions of
x and y unit vectors, we can see that how the ellipse gets rotated. Of
course, the rotation happens to all the vectors. The ΣV T -transformed
versions of the original unit vectors (from Figure 13.2), now shown
in translucent red and blue in Figure 13.4 as q1′′ and q√2′′ , for instance,
get rotated to the bright red vector,√with its tip at ( 23 , −1) and the
bright blue vector with its tip at (0, 23 ). This indeed is exactly what
the shear matrix A does, as illustrated in Figure 13.1.
256 Singular Value Decomposition

1.25 "
2 (4
#" = § ####
! =
%
1 21
1
0 0
§ #### = (
3 " 4%
0.75 2
(4
% 0
986& = =3
(4
0.5 21 %

0.25

!
22 21.75 21.5 21.25 21 20.75 20.5 20.25 0 0.25 0.5 0.75 1 1
1.25 1.5 1.75 2
#! =
!= 2
E: ROTATION 20.25 2 #
*

1 $ 1
#! = § 20.5
0 % 2 3

0
#" = § 2 3 20.75
1 1
$ 1 3 21 3
9=
% 2 3 1 2
21
Clockwise rotation of 60° 21.25

Fig. 13.4 The third rotation by U , clockwise rotation through 60°.

Now we are in a position to summarize what A does, either through

the direct path Ax or by the three steps U Σ V T x, as shown in
Figure 13.5.

;$
!

#: ! = #$%! %

<%
$ ;%
;$
<$

Fig. 13.5 The geometry of SVD, summarized.

We may be tempted to think, from the figures, that one rotation and
one scaling might be enough to do everything that A does. We should,
however, note that the first quadrant of the unit circle in Figure 13.2
is getting mapped to the arc of the ellipse between the light red and
blue vectors in Figure 13.3. One rotation and one scaling could give
is this mapping, but the ellipse would have its axes along the unit
How SVD Works 257

vectors. Or, by applying a scaling and then a rotation, we could get

to the right ellipse as in Figure 13.4, left panel, but the points in the
first quadrant on the unit circle would be mapped to the first quadrant
of the ellipse. We really do need two rotations and a scaling.
The story in the general case of A ∈ Rm×n , of course, is much
more complicated, but the fundamental idea is still the same. We use
the intuitions from R2 in order to make use of the power of SVD in
higher dimensions as well.

13.2 How SVD Works

From our discussions on the four fundamental subspaces defined by

a matrix A, we know that its row space C(AT ) gets mapped to its
column space C(A). Furthermore, the mapping is one-to-one and
onto, which means a vector v ∈ C(AT ) gets mapped to a unique
vector u ∈ C(A). Both C(AT ) and C(A) have the same dimension,
which is the rank of A, r. In other words, we should be able to find
r orthogonal vectors in C(AT ) to span the row space; we could, for
instance, apply the Gram-Schmidt algorithm to do it.
Let’s say that these r orthonormal basis vectors in C(AT ) are vi .
When applying the transformation of A to them, we get some vectors
in C(A): Avi = ui . We know that these are r unique vectors ui
because A : C(AT ) 7→ C(A) is a one-to-one mapping. However,
we have no reason to think that they are orthogonal or normalized.
We can insist that they be normalized by factoring out their norms:
Avi = σi ui with ∥ui ∥ = 1.
The vectors ui are unique because each of them is a different linear
combination of the columns of A with the coefficients as specified in
vi and the vectors vi are linearly independent. Taking this as one of
the last teachable moments in this textbook, let’s harp on this point
a bit more. Since vi ∈ C(AT ), we know that Avi ̸= 0. And since
the vectors vi form a basis for C(AT ), they are linearly independent,
which means, at the very least, the components of one of them cannot
all be the same as that of another. By the column picture of matrix
multiplication, ui = Avi is a linear combination of the columns of
A with different coefficients that are components of vi . And, from
way back in the first or second chapter, we know that two different
258 Singular Value Decomposition
P P
linear combinations si vi and ti vi cannot be the same: Linear
combinations are unique.
What we are demanding in Singular Value Decomposition (SVD)
of A ∈ Rm×n , rank(A) = r f min(m, n) is something remarkable:
Our goal is to find a special orthonormal basis in the row space of
A that gets mapped to an orthonormal basis in its column space:
We want a basis V for C(AT ), and a U for C(A), with a special
requirement each vi ∈ C(AT ) gets mapped to a ui ∈ C(A), with a
possible scaling factor σi . The remarkable fact behind SVD is that
we will always be able find such bases, and we shall shortly see why
and how. We will have a mathematical proof and recipe to find such
bases.
Since the rank of the matrix is r, we can find only r linearly
independent vectors in C(A) and C(AT ). We will worry about the
rest of the column vectors in U and V later.
Let’s write down all we have so far:
 
| | ··· |
Û = u1 u2 · · · ur  ∈ Rm×r = [ui ]
| | ··· |
 
| | ··· |
V̂ = v1 v2 · · · vr  ∈ Rn×r = [vi ]
| | ··· |
We are putting a hat on the U and V matrices because they are
smaller, more economical ones for now. They are not the full-sized
versions in Eqn (13.1). Each vi ∈ C(AT ) is going to be mapped to
a corresponding ui ∈ C(A), as per our requirement. We then have
the following, where we have arranged σi , 1 f i f r as the diagonal
elements in Σ̂ ∈ Rr×r . Remember that we require the columns of V̂
and Û to be orthonormal.

Avi = σi ui =⇒ AV̂ = Û Σ̂

Remembering that for a matrix with orthonormal columns, we know

that the product of its transpose with itself is the identity matrix. For
instance, the product Û T Û for Û ∈ Rm×r is Ir ∈ Rr×r because
the elements in the product are the dot products of the columns of
Û , which are either zero or one. We should not, however, call Û
How SVD Works 259

orthonormal because it is not a square matrix. Same same argument

applies to V̂ T V̂ as well.
We can now do a bit of matrix algebra magic to compute Û and V̂

(1) Orthonormal requirement: AV̂ = Û Σ̂

T T
(2) Taking the transpose of (1): V̂ A = Σ̂T Û T
(3) Multiplying (2) and (1): V̂ T AT AV̂ = Σ̂T Û T Û Σ̂
(4) Since Û T Û = I: V̂ T AT AV̂ = Σ̂T Σ̂
(5) Since Σ̂ is diagonal: V̂ T AT AV̂ = Σ̂2
(6) Left and right multiply by V̂ and V̂ T : V̂ V̂ T AT AV̂ V̂ T = V̂ Σ̂2 V̂ T
(7) If V̂ V̂ T = I: AT A = V̂ Σ̂2 V̂ T

In statement (7), we specify a conditional: If V̂ V̂ T = I, then

we get an equation involving V̂ and Σ̂2 , where we can identify an
eigenvalue equation, much like the Spectral theorem, A = QΛQT .
It then follows that the singular value decomposition is the eigenvalue
decomposition of the product AT A and the singular values are the
positive square roots of its eigenvalues.
The condition V̂ V̂ T = I is a problem though. We proved that
V̂ T V̂ = I. Can we also show that V̂ V̂ T = I? It turns out that
we cannot, for one very good reason: It is not true. One way to get
around this difficulty is to consider only full-column-rank matrices
A, in which case we have rank(A) = r = n and a square matrix
V ∈ Rn×n . We will drop the hat on V because it is now the basis
for the whole of the domain of A, namely Rn . What we now have
are the matrices as shown in Figure 13.6. Following the same logic,
we can show that
AAT = Û Σ̂2 Û T
with the proviso that A be a full-row-rank matrix in this case.
As for Σ̂2 = [σi2 ], it is the square of a diagonal matrix. It is,
therefore, simply another matrix with the squared elements along the
diagonal.
Comparing these results with our eigenvalue equations in the Spec-
tral Theorem (§12.1.1, page 222), we can see that these are the eigen-
value equations for AT A and AAT . Therefore, we conclude that the
left-singular vectors ui are the eigenvectors of AAT and the right-
260 Singular Value Decomposition

! has full column rank ó ! = %$#K = %

4$4 #K

! * =!×#

=>? =
rank % = '

4$
! is rank-de/icient ó ! = %$#K = % 4#4K
Fig. 13.6 The shapes of U , Σ and V in SVD when A is a full-column-rank matrix.

singular vectors vi those of AT A. And the singular values are the

square root of the eigenvalues of either AT A or AAT . As we saw in
one of the exercises in the previous chapter, the nonzero eigenvalues
of AT A are the same as those of AAT .
Although we derived the formulas for the singular vectors using the
provisos of the matrix A being full-column or full-row rank, they are
indeed valid for all matrices. The right approach would have been to
complete the orthonormal basis on the domain side (from V̂ , which
is the basis for the row space) to all of Rn (by adding the basis for the
null space as well). Similarly, we should have completed the basis
on the output space (Rm ) by including the basis vectors of the left
null space so that we can drop the hats on both V and U and get the
picture shown in Figure 13.7. It is perhaps wise to repeat these wordy
statements with more mathematical precision as in the following list,
where we will use the symbol B to denote basis.

• V̂ is a basis for the row space of A: V̂ = B C(AT )

• V is a basis for the domain, which is the union of the bases for
the row and null spaces of A:

V̂ = B {Rn } = B C(AT ) ∪ B {N (A)}

• Û is a basis for the column space of A: V̂ = B {C(A)}

• U is a basis for the output space (AKA co-domain), which is

the union of the bases for the column and the left null spaces
of A:

Û = B {Rm } = B {C(A)} ∪ B N (AT )

How SVD Works 261

4$
! is rank-de/icient ó ! = %$#K = % 4#
4K

! * =!×#

rank 3 = D =
D < min(=, ?)

Fig. 13.7 The shapes of U , Σ and V in SVD when A is a general, rank-deficient matrix.

Now we can confidently state V −1 = V T and U −1 = U T and

matrix algebra we performed earlier can be repeated without the hats
and everything is perfect.
We have one more way of arriving at this realization, by noting that
since these Gram matrices, AT A and AAT have the same rank as A,
which is r, we have r positive eigenvalues and as many orthogonal
eigenvectors. The rest of the eigenvalues (n−r for AT A and m−r for
AAT ) are zeros. Remembering that the eigenvectors corresponding
to zero eigenvalues are, in fact, the basis for the null space of the
matrix, we can complete the U , Σ and V (the unhatted ones) matrices.
To get the final answer of SVD, we pad U with the basis of the left null
space N (AT ) and V with the basis of the null space N (A), which
are really the eigenvectors of AAT and AT A. They are, indeed, the
missing left and right singular vectors we are seeking. And the Σ
matrix gets the correct zero eigenvalues to pad from the (r + 1)th to
the nth diagonal position, resulting in Figure 13.7 and the following
262 Singular Value Decomposition

equations.
 
| | ··· |
U = u1 u2 · · · um  ∈ Rm×m = [ui ]
| | ··· |
 
| | ··· |
V =  v1 v2 · · · vn  ∈ Rn×n = [vi ]
| | ··· |
  (13.2)
σ1 0 · · · 0 · · · 0
 0 σ2 · · · ... · · · 0 
. .. ..
· · · ... 
..

 .. . . .
Σ=  ∈ Rm×n = [σi ]
.0 0 · · · σ r · · · 0 
. . . .
.

 .. .. .. .. .. .. 
0 0 0 0 0 0
The Σ matrix is arranged such that σ1 g σ2 g · · · σr g 0 so that the
first singular value is the most important one. The singular vectors
are eigenvectors of AT A and AAT :

AT AV = V Σ2 and AAT U = U Σ2 (13.3)

We can summarize the singular value decomposition of A =

U ΣV T (where A ∈ Rm×n and rank(A) = r) in a few bullet points:
1. The matrix V ∈ Rn×n is the right1 singular matrix.
(a) It is an orthonormal basis for Rn , the domain of the trans-
formation A : Rn 7→ Rm .
(b) The columns of V , vi are the right singular vectors.
(c) vi , 0 < i f r are eigenvectors of AT A.
(d) The first r of them span the row space of A, C(AT ).
(e) The rest (n − r) span its null space, N (A).
(f) vi , r < i f n are found (if needed) by computing the null
space.
2. U ∈ Rm×m is the left singular matrix.

1
To remember whether U or V is left or right, note that the left singular matrix U appears
on the left in U ΣV T and V on the right.
How SVD Works 263

(a) It is an orthonormal basis for Rm , the output space (AKA

the co-domain) of the transformation A : Rn 7→ Rm .
(b) Its columns, ui , are the left singular vectors.
(c) ui , 0 < i f r are the eigenvectors of AAT .
(d) Again, the first r of them form a basis for the the column
space C(A).
(e) The rest (m − r) of them span the left null space, N (AT ).
(f) ui , r < i f m are found (if needed) by computing the
left null space.

3. The diagonal matrix Σ ∈ Rm×n (the same size as A) contains

the r singular values.

(a) The first r singular values, σi , 0 < i f r are positive.

(b) They are the nonnegative square roots of AT A (for tall
matrices) or AAT (for wide ones).
(c) The rest of the singular values, σi , r < i f min(m, n),
are all zeros.
(d) By convention, the singular values are arranged in de-
scending order:

σ1 g σ1 g · · · g σr

These bullet points are illustrated in Figure 13.7, the full-blown

SVD. Since only the first r singular values are nonzero (and also
because the basis vectors vi of the null space map to 0 ∈ Rm ), we
can ignore the last (m − r) columns of U and (n − r) columns of
V . We can also take the leading r × r part of Σ to come up with a
“thin” or “economical” version of SVD. Note that this version does
not involve any approximations; it is merely a choice of ignoring the
zeros that do not matter any way.
We also have the following properties for singular values:
• For square matrices, the product of the singular values is the
determinant. For instance, for A ∈ R2×2 , σ1 σ2 = λ1 λ2 = |A|.

• For A ∈ R2×2 , σ1 g λ1 g λ2 g σ2
264 Singular Value Decomposition

13.3 Why SVD Is Important

13.3.1 Data Compression

Remember that the Spectral Theorem (§12.1.1, page 222) allowed us

to write: n
X
A= λi qi qiT
i=1

from the EVD of a real, symmetric A ∈ Rn×n = QΛQT .

In SVD, we have A ∈ Rm×n = U ΣV T (for any matrix). This
product also can be expanded as:
r
X
A= σi ui viT (13.4)
i=1

where we used the fact from Eqn (13.2) that the diagonal matrix Σ
has only the first r = rank(A) nonzero elements, and therefore, the
sum runs from 1 only to r, not to m or n, the matrix dimensions.
It is perhaps important enough to reiterate that Eqn (13.4), by
itself, is not an approximation just because we are summing only up
to r, which is to say, we are using the “economical” (hatted) SVD
matrices. Even if we were to use the full matrices, the multiplication
ΣV T would have resulted in a matrix (∈ Rm×n ) with the last n − r
columns zero because only the first r singular values (σi , 0 < i f r)
are nonzero.
Each term in the summation in Eqn (13.4), Ai = ui viT , is a rank-
one matrix (because it is a linear combination of one row matrix
viT , by the row-picture of matrix multiplication. The first of them,
A1 = u1 v1T , is the most important one because σ1 is the largest
among the singular values. We can therefore see that A1 is the best
rank-one approximation of the original matrix A.
Let’s say A is a megapixel image of size 1000 × 1000. It takes a
million bytes to store it. A1 , on the other hand, takes up only 2001
bytes 1(σ1 ) + 1000(ui ) + 1000(ui ). If A1 is not good enough, we
may include up to k such rank one matrix at a storage cost of 2001k,
which is smaller than a million for k < 499. Typically, the first few
tens of σi would be enough to keep most of the information in A.
In general, in order to store up to k rank-one approximations of
A ∈ Rm×n , we need k(m + n + 1) units of memory, which could be
Why SVD Is Important 265

significantly smaller than mn. The Singular-Value Decomposition,

therefore, can be the basis of a powerful data compression algorithm.
Although we have more powerful techniques than the plain old SVD
for image compression, it is a technique that still inspires some re-
searchers2 .

13.3.2 Principal Component Analysis

In data science, SVD is commonly used for dimensionality reduction

to uncover the low-dimensional patterns in the data. Although it may
be called the Principal Component Analysis (PCA), it is essentially
SVD using different jargon and with one extra step. We will stick
with our current notation to describe PCA.
Let’s say A ∈ Rm×n is a data matrix, such as the Young Adult
dataset we used in Figure 10.4 while studying multiple linear re-
gression. Here, we have an observation as a row in the matrix, and
variables along which the observations are made in the columns. We
have m observations along n variables (xj , 1 f j f n), and as we
saw multiple times, m k n.
The first step in PCA is to compute one extra row called the mean
(as in average) row, and subtract it from all other rows.
m
1 X
µj = aij = E[xj ]; aij ← aij − µj
m i=0

Thus, we get the so-called zero-centered data in our A matrix (which

we will call A0 to avoid confusion), where each column has a mean of
0. Now, the Gram matrix, AT0 A0 becomes a much smaller matrix in
terms of the number of elements because AT0 A0 ∈ Rn×n and n j m.
It also becomes proportional to the covariance matrix of the data.
To see how this magic happens, we only need to expand the product.
Calling AT0 A0 by a new name, C ∈ Rn×n = [cij ] = AT0 A0 , we can
write:
m
X
T
cij = ai(0) aj(0) = (aki − µi )(akj − µj )
k=0

2
A quick search revealed this article from 2017.
266 Singular Value Decomposition

which is the covariance between the ith and j th variables. For example,
if we set i = j to look at the diagonal elements of C, we get
m
X
(aki − µi )(aki − µi ) = m (aki − µi )2 = m Var(xi )
k=0

which is the variance of the ith variable xi (times the number of

observations).
Taking the SVD of A0 is the same as finding the eigenvalues
and vectors of C, the elements of which are proportional to the
covariances of the variables in the data. Therefore, as we saw in
§11.5 (page 201), what PCA does is to find directions along which
the covariance is the highest, second highest and so on. In other
words, it gives us a hierarchical coordinate system, in which the first
direction corresponds to the highest variance in the data, the next one
the second highest and so on. Note that it comes from the data directly,
without assuming any kind of underlying probability distribution for
the variables.
Once we have the SVD of A0 , we can look at the right singular
vectors vi , which are the eigenvectors of AT0 A0 (as in Eqn (13.3))
and write:
AT0 A0 V = V Σ2 (13.5)
We know that the singular values are sorted so that σ1 is the largest.
Using the insights about what eigen-analysis does (from §11.5, page
201), we make the following observations:
1. The eigenvectors of the Gram matrix AT0 A0 represent the di-
rections in its row space along which the variances in the data
are sorted descending. Remembering that AT0 A0 and A0 have
the same row space from §12.6 (page 235), the eigenvectors
in the columns of V are the directions in the data space along
which we have the sorted variances as well.
2. The matrix T = A0 V holds the principal components of the
data (the first of which is a linear combination of the columns
of A0 with the largest variance, the second the second largest
and so on).
3. V , whose columns specify the coefficients required in the linear
combinations, is called the loading.
Why SVD Is Important 267

4. The elements of Σ tell us how much variance each principal

component holds. If we want, for instance, to capture 90%
of the variance in the data, we start with the first component,
and look at σ1 . If it is less than 90%, we keep adding the next
component until the cumulative sum of σi , 1 f i f k crosses
the required threshold of variance, 90%.
5. The k principal components are then used as engineered fea-
tures in our machine learning algorithm. Typically, k j n,
resulting in significant dimensionality reduction. Although its
description looks different, what we are doing here is not unlike
taking the first k terms in the summation in Eqn (13.4).
6. Dimensionality reduction has the added advantage of smooth-
ing out the noise in the data. Much like fitting a line or a
polynomial on a bunch of (x, y) data points may result in a bet-
ter fit than the collection of the points themselves, which may
be subject to statistical or measurement fluctuations, a lower
rank approximation of the data matrix may result in a better
model.
We discussed the whole PCA topic as though it was a sub-topic of
SVD. For historical reasons, however, PCA may appear with different
jargon and notations in other sources, with no reference to SVD. The
mathematics behind it is the same as our discourse here. It has to be,
for mathematical truth is singular.

A Simulated Example
As a well-established and widely used technique, PCA is available
in all modern statistical applications. In order to make its discussion
clear, we will use R and create a toy example of it using simulation.
We are going to simulate a multivariate normal distribution, centered
at µ with a covariance matrix S, generating 1000 tuples (x, y).

2.5 2 1
µ= S=
2.5 1 2
Since we set the parameters for the simulation, we already know what
to expect: We started with the covariance matrix S, and we expect
the bivariate normal distribution to show up as an ellipse, centered at
(1, 3) and with the major and minor axis along the eigenvectors s1
268 Singular Value Decomposition

and s2 of S. The lengths of the axes are going to be the eigenvalues,

λi , as we established in §11.5 (page 201).
Running the eigenvalue decomposition on S in R, we get:
eigen() decomposition
$values
[1] 3 1

$vectors
[,1] [,2]
[1,] 0.7071068 -0.7071068
[2,] 0.7071068 0.7071068

In this output, $values are our eigenvalues λi , which says the vari-
ances along the major and minor axes of the generated data in Fig-
ure 13.8, left panel, are 3 and 1. Therefore the lengths of the axes
(σi ) are the square roots, namely about 1.73 and 1, which is what
we see in Figure 13.8, on the left (except that we scaled the standard
deviations by a factor of two, so that 95% of the points are within the
ellipse).

21.0 20.5 0.0 0.5 1.0 1.5

1.5
440

191

550
3

119
6

325
771
1.0

708
307
686 952 116
914 47 562
622 330
558 312
156 603 421
466 911 399 865 36 947
2

316 24742 639 867 406

VVara 2 664345
591 907
r2 523
204 565958
162 998 819 261 117
831 720
(y) 623
358
106
339
230
53578 113
224 994
509
189
803 350 522
33
141 571 446 574560 833 677
4

653 829
154
340 969 552
121 381418 602 783 908
627 754
702 459 871 74
285104
0.5

625
424
275 51062 545 41 393
818
965 327 573 948 100
270 772 67 328482 997
839 648 515 521 883 99 216 836527 790 486 987
347
87 886 915252
10 949
300 655 53 876788 483176
481 12717 991 355
738 370
813 152 126 366 150209142 841586709 879 449
631 656422 51
265223 764
1

28743 872 359

540 693
539 581 193
88749580
144 494 610 556 194
349 68 120 3391 188 658
100017329953 147930 435 696214 954830 973 475
694 781 919 918 629
779 395 640 551787
184 452213890372548621 587 111
506 663 975 642
582 736 978
294 360
163 950 218
149672 44 873315187
79 169 319
266 968 254
611 112 16
86 268
557
921
242 666
775
9715441 190 516761 739 286 593 132
404 943 390
Var22 (y)

158 857 81 970 226 322 240

976 364 367 373

928 80138 959174267 484
659 412
8
766 476 138 145662 159 356 956 776 680146 537 920 909
878
407453 23182 528 902 832
PC2

245 259 845

2.5

511 263 105

862 409 817
456 90165 308 533 389 83 988 889
172 357 377904136 343185 712
400 597225202 434893
59 26 368644
605
880 301
760 698566 260859972 727 799
546 43 50822266 827 157
578 57331
526 596
220 293 519
96 496 199 244
229 46440 363 410 853
977 378 944262 18660 553 419 825 306
962856 667 995
354 767
441 606166 722730
326 164 380
Var

981
531289 92 798 344
283 645
131 608 480
946
812 534 103 793
177 650 21 941
398 95 957 321128999
804
735
488
25332 155
842620 501
460 549 716
269 85 729 135
869 179 19 411362 866 7956 786408 800 961 903497 609 719 589671989
607 178 641
2

0.0

231 212 507

711 690740983 695 379 469
912 657 442600792
332
479 236
588
617624 383
592896 598529 615 167982
0

493
54425 630
936 585
699 913 980 963 61 707
304
465 303 924 917
614 133 93
118
689 341
385
256
72 955891 718394
249
840 4791 805685
91 30
2 942 922 454101858 700 468324 755945
816 525 22183 524 203 20
757168
84
217
940821272601 45 65127 170 415925 432 572 846 594122 431 931
98 744 826517 258 82 430
899 282 933
487 612
337
974
401643
723
471 271710725669
877 161 728 455 458
463107 478
923 384 14
374 714737 137 417 828171 990
375 490
6473295417
746
76334 554 498 835
514875 427
996701984 433 248114239
733 31
724292 811 416 279
884 758284
688
110 756 342 868 429
726619 491 681
351 652 854870 467
782
200
86152 70 109
17 361
472 789
5 276
513 838 683 749 530
985 445
299 386
495 403 65 88852 352
387 713
397485 77 691 298 745 58 731
197251 278616905668932500 874396576 11 763
148
124
277559
661 502 966 820 219 180
388
462297
97
151 63 774
784802
420
916646
13 888 195153678 807 273 73
979211
55 447 320 238 208 134
850 670 250
844 287 512503
863613628310291 960 247520
181
227 336 443 769882
536 196
473 538
48 305 926 365 123
37 770 967
102 206 89765
518 561 139 335
235
257910626 338 697860499
851 437 477 632 207
855 414 382 785
935 470986
302 402 274 423
951721 773 444590461806 604 704676 333 797
21

569 175
210186 246376 543474 198 584 234
895 649 436 438 237
898
405 703 847 654
692 78075
927570 815577 750 937824 323 674
637
353
115
706 64 228 752568
413 822
20.5

759 618 823 864 809243 837

241
583 900
448 939
60
489834 796
575 9848 233 457426 814 885 3550
318 897 290 892 849 934 679634
0

15 288 280 25741 532 564 665

281
39269 232
451 309 317 687 843 542
964 505 143 810 201 80
56
635 992160 929 125 595129
(x)
636
130
567 563 192 311
205 547 264748675 768
314 346 705 753 295 747 296
715 579

r1 777108
938 46 255
71 794682
Va
Var 1 993 684
633 140
599 313734 428
901894
762
22

221 638 732

504 127 881 492 34
439 808
555 369
39 906
371
21.0

673
450
778 751
348
22

94
42215
23

22 0 2 2.5 4 6 23 22 21 0 1 2 3 4

Var 1 =(x)
Var 1 x PC1

Fig. 13.8 Left: Example of simulated (x, y) pairs, showing the elliptical shape, the
directions and sizes of the major and minor axes. Right: The “biplot” from the PCA
function in R.

Once we have the 1000 generated rows of simulated data, we run

PCA on it, and look at what it can reveal about it.
Standard deviations (1, .., p=2):
[1] 1.718043 1.018423
Pseudo-Inverse 269

Rotation (n x k) = (2 x 2):
PC1 PC2
[1,] -0.7043863 -0.7098168
[2,] -0.7098168 0.7043863

From the PCA output, we see the standard deviations σ1 and σ2 , close
to what we specified in our covariance matrix, S, as revealed√by the
eigen-analysis on it. Ideally, the values should have been 3 and
√
1, but we got 1.718 and 1.018—close enough. The first principal
component is linear combination of x and y with coefficients −0.704
and −0.710. Note that SVD vectors are unique in values only; their
signs are not fixed.
The right panel in Figure 13.8 is the so-called “biplot” of the
analysis, which shows a scatter plot between the first and second
principal components, as well as the loading of the original variables.
The first thing to note is that PC1 and PC2 are now uncorrelated standard
normal distributions, as indicated by the circle in the biplot rather than
the ellipse in the data.
The directions shown in red and blue are to be understood as
follows: The first variable x “loads” PC1 with a weight of −0.704
and and PC2 −0.710 (which form the first right singular vector, v1 ).
It is shown as the red arrow in the biplot, but with some scaling so
that its length corresponds to the weight. The second variable x loads
PC1 and PC2 at −0710 and 0.704 (the second right singular vector,
v2 ), shown in the blue arrow, again with some scaling. Since both
the principal components load each variable with similar weights, the
lengths of the red and blue arrows are similar.

13.4 Pseudo-Inverse

We talked about the left and right inverses in §5.4 (page 102).
Eqn (5.2), for instance, shows how we define the left inverse of a
“tall,” full-rank matrix. The SVD method gives us another way to
define an inverse of any matrix, called the pseudo-inverse, A+ .
Let’s first define what we are looking for.
270 Singular Value Decomposition

PCA on Iris Dataset

All introductory data analytics courses will deal with the Iris dataset at some point.
This dataset contains 150 flower measurements along four variables (Sepal Length,
Sepal Width, Petal Length and Petal Width) from three different iris species (Setosa,
Versicolor and Virgnica). There are 50 data points for each species.
Since creating two principal components out of two variables (as we did with the
simulated data) does not make much sense, let’s try PCA on the Iris dataset. Running
PCA on it gives us the following output:
Standard deviations (1, .., p=4):
[1] 2.0562689 0.4926162 0.2796596 0.1543862
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
Sepal.Length 0.36138659 -0.65658877 0.58202985 0.3154872
Sepal.Width -0.08452251 -0.73016143 -0.59791083 -0.3197231
Petal.Length 0.85667061 0.17337266 -0.07623608 -0.4798390
Petal.Width 0.35828920 0.07548102 -0.54583143 0.7536574

Pseudo-Inverse
Definition: A matrix A ∈ Rm×n has an associated pseudo-inverse
A+ if the following four criteria are met:
1. AA+ A = A: Note that AA+ does not have to be I.
2. A+ AA+ = A+ : The product A+ A does not have to be I
either.
T
3. (A+ A) = A+ A: Like the Gram matrix, AA+ needs to be
symmetric.
T
4. (AA+ ) = AA+ : The other product, A+ A should be sym-
metric too.
With the SVD of A, we can come up with A+ that satisfies the
four criteria.
A = U ΣV T =⇒ A+ = V Σ+ U T (13.6)
where Σ+ is a diagonal matrix with the reciprocals of σi when σi ̸=
0 and zero when σi = 0. In practice, since floating point zero
comparison is always troublesome in computing, we will use a lower
bound for σi . Note that Σ+ has the same size as AT , while Σ has the
same size as A. In other words, for A ∈ Rm×n , Σ+ ∈ Rn×m .
Pseudo-Inverse 271

At this stage of our Linear Algebra learning, it should be trivial for

us to verify that this definition of A+ does satisfy the four criteria.
We only need to note that U and V are square, full rank, orthonormal
matrices, with the associated property U U T = U T U = Im and
V V T = V T V = In . Remember that U and V are, in fact, the
bases for Rm and Rn respectively.
It is possible to rewrite the pseudo-inverse using the “thin” SVD,
where we have the hatted matrices. The shapes of these matrices are
specified in Eqn (13.2) and shown again pictorially in Figure 13.6,
which shows the typical case in data science, where we have a full-
column-rank “tall” matrix A ∈ Rm×n , m > n, rank(A) = n: All the
columns in A are linearly independent. U contains the left singular
vectors, which form a complete basis for the output space Rm . The
first n of these vectors are a basis for the column space C(A). The
rest m − n vectors are the basis for the left null space N (AT ). We
have n singular values, in decreasing order, in the leading diagonal of
Σ. Since Σ has m rows (same shape as A), we have m−n zero rows.
V ∈ Rn×n holds a full basis for Rn . There are no other dimensions
left anywhere to account for.
Figure 13.7 shows the most general case of a rank-deficient matrix
A ∈ Rm×n , rank(A) = r < min(m, n). The left and right singular
matrices (U and V ) are both square matrices forming the full basis
for Rm×m and Rn×n respectively. The first r columns of both contain
the bases for the column space and the row space of A, namely C(AT )
and C(A). The rest of the columns are the bases for the null spaces:
The last m − r columns in U span the left null space N (AT ) and the
last n − r columns of V span the (right) null space N (A). And, as
for the singular values in Σ, its first r of them are nonzero, the rest
are all zeros.
As we can see in Figure 13.7, the columns (and rows) to the right
of (and below) the thin lines in U , Σ and V T are there to account for
the null spaces. They do not affect the product at all because the zeros
in Σ will kill them anyway. We can, therefore, write an economical
SVD for A keeping only the columns (and rows) to the left of (and
above) the thin lines in U , Σ and V T . We call these smaller matrices
Û , Σ̂ and V̂ T . We can then write:

A = Û Σ̂V̂ T U ∈ Rm×r , Σ ∈ Rr×r , V ∈ Rn×r

272 Singular Value Decomposition

Since we defined Σ+ appearing in Eqn (13.6) as having nonzero

values only in the leading r × r submatrix, we can see that only the
first r columns of U and V have any bearing on A+ . Therefore, we
can write it as:
A+ = V̂ Σ̂−1 Û T (13.7)
Note that Σ̂+ is indeed Σ̂−1 because Σ̂ is a full-rank, diagonal matrix
and its inverse is another matrix with the reciprocals of the diagonal
elements along its diagonal:
1
0 ··· 0

σ1
 0 1 · · · ... 

σ ∈R r×r 1
Σ̂ = 
−1
. .
2
. . =
 .. .. .. ..  σi
1
0 0 · · · σr

Since we are working with “economical” version Σ̂, there are no

divide-by-zero errors to worry about when computing its inverse: all
the singular values σi are nonzero.
Note that in Eqn (13.6), the matrices of singular vectors on the
right are all square and invertible ones. We are justified in taking
their inverses (with U −1 = U T and V −1 = V T ). But their product
is not a square matrix, and we are probably not justified in calling it an
inverse, which is one of the reasons why we call it a pseudo-inverse.
A ∈ Rm×n and A+ ∈ Rn×m and both AA+ and A+ A are valid
matrix multiplications.
The pseudo-inverse reduces to the left-inverse for full-column-rank
matrices, and the right-inverse for the full-row-rank matrices. And
of course, for a full-rank, square matrix, it is the plain old inverse. It
is indeed worth our time to verify these statements, which is done in
a box in this chapter.
The first criterion in the definition of A+ says AA+ A = A, which
means AA+ Ax = Ax or AA+ b = b in our favorite equation
Ax = b. Let’s consolidate:
AA+ Ax = Ax = b
AA+ (Ax) = Ax = b
AA+ b = Ax = b
What is it telling us? Let’s say A+ b = y. Then we have the
equation Ay = b. Earlier, in §11.9.4, page 219, we argued that this,
Fundamental Spaces and SVD 273

in conjunction with Ax = b, implied that x = y because of the

uniqueness of linear combinations. Here’s another argument: Since
we have Eqn (13.7), we can see that A+ b is in the row space of A
because A+ b = V̂ Σ̂−1 Û T b is a linear combinations of the columns
of V̂ . Since V̂ is a basis for C(AT ), A+ b ∈ C(AT ). In other words,
for a nonzero b, y is in C(AT ), which is to say, for a b ∈ C(A), we
have A+ b = y with y ∈ C(AT ). Thus, we have the inverse mapping
A+ : C(A) 7→ C(AT ), leading to our closing discussion.

13.5 Fundamental Spaces and SVD

! * =T×S rank ! = @
I* =2 = all possible I !: = ; N * =3 = all possible & impossible N

! 6 =S ÿ =T
L 3 : dim = O
L 3) : dim = O Column Space
Row Space 3P ÿ R P ÿ 3+ R 1/
1,
//
/, ï
ÿR
ï

) P ÿ 3 +(R +
3(P + P* R* )
0
1
0

32 ÿ 2
/

+2
2ÿ3
$ %. :
$ % : dim = + 2 -
dim = . 2 - 2 2 ÿ 3 +R Left Null Space
Null Space 3P * ÿ * $
$ . !" . !"
#
, !" , !"
#

: = !X;
ï
ï

&
.

!X: =T ÿ =S
%
,

!X * =S×T rank !X = @

Fig. 13.9 The elegant symmetry of the four fundamental spaces, completed by the pseudo-
inverse A+ .

Singular Value Decomposition and the subsequent definition of the

pseudo-inverse complete the beautiful symmetry of the fundamental
spaces we started in Chapter 9. A is a mapping from Rn to Rm : It
takes vectors x ∈ Rn and gives us vectors b ∈ Rm . Or, A : Rn 7→
Rm . From the row space to the column space of A (from C(AT ) to
C(A)), the mapping is one-to-one.
A+ does just the reverse: A+ : Rm 7→ Rn . As Figure 13.9 shows,
it does it in a completely symmetric fashion. It is a one-to-one
mapping from C(A) to C(AT ), and it takes the null spaces to their
right counterparts. Instead of describing the picture using our words,
274 Singular Value Decomposition

Pesudo-Inverse vs. Other Inverses

The pseudo-inverse reduces to the old familiar double-sided inverse (A−1 ) in the case
of full-rank, square matrix. It also reduces to the left inverse (A−1
Left ) and the right inverse
when they are defined, as shown below.
Pseudo-inverse to Double-Sided Inverse: If A is a full-rank, square matrix (A
∈
Rn×n , rank(A) = n), the A+ reduces to A−1 . Since rank(A) = n, rank AT A =
n =⇒ AT A has n (positive) eigenvalues =⇒ Σ is full rank. We also have, in
A = U ΣV T , U −1 = U T , U U T = In , V −1 = V T , V V T = In and
1
··· 0
  
σ1 · · · 0 σ1

Σ =  ... .. ..  , Σ+ =  .. .. ..  = Σ−1 .

. .   . . . 
1
0 · · · σn 0 · · · σn

Now, A−1 = (V T )−1 Σ−1 U −1 = (V −1 )−1 Σ+ U T = V Σ+ U T = A+ .

Pseudo-inverse to Left Inverse: If A is a full-column-rank tall matrix (A ∈
T −1 T
Rm×n , rank(A) = n < m), A+ reduces to: A−1Left = (A A) A . Here, we have
−1 T T −1 T T
U = U , U U = Im , V = V , V V = In and
 
σ1 · · · 0
 . .. .. 
 .. . . 
1
··· 0 0 ··· 0

  σ1
 0 · · · σn  +  . .. .. .. . +
Σ=  0 · · · 0  , Σ =  ..

. . . · · · ..  =⇒ Σ Σ = In .
0 · · · σ1n 0 · · · 0
 
 . .. 
 .. ··· . 
0 ··· 0
Consider A+ A = (V Σ+ U T )(U ΣV T ) = V Σ+ (U T U )ΣV T = V (Σ+ Σ)V T =
V In V T = In . Thus A+ is the left inverse of A =⇒ A+ = A−1 T
Left = (A A)
−1 T
A .
Pseudo-inverse to Right Inverse: If A is a full-row-rank wide matrix (A ∈
T T −1
Rm×n , rank(A) = m < n), A+ reduces to: A−1
Reft = A (AA ) . Here again, we
−1 T T −1 T T
have U = U , U U = Im , V = V , V V = In and
1
0

σ1
···
 . .. .. 

σ1 · · · 0 0 ··· 0
  .. . . 
 
1
 . .. . . . +
 0 · · ·  =⇒ ΣΣ+ = Im .

Σ =  .. . . . Σ =

. . . ·· · . , 
 0 ···
σ m
0 
0 · · · σm 0 · · · 0
 
 . .. 
 .
. .

···
0 ··· 0
In this case, consider AA+ = (U ΣV T )(V Σ+ U T ) = U Σ(V T V )Σ+ U T =
U (ΣΣ+ )U T = U Im U T = Im . Thus A+ is the right inverse of A =⇒ A+ =
T T −1
Reft = A (AA )
A−1 .

it is perhaps wiser to let it speak its own thousand words. Let’s stare
at it to appreciate its exquisite completeness.
Fundamental Spaces and SVD 275

If we can allow ourselves to be mesmerized by the beauty and

symmetry of this picture and the elegance of the path that brought
us here, perhaps we have earned the right to call ourselves mathe-
maticians or computer scientists. And this picture, perhaps, is the
appropriate point to bring this book to its conclusion.

Get the Full Edition of LA4CS with

Summaries, Exercises and Solutions
Only $7.95. Scan, Click or Tap to buy.
Summing Up. . .

What we want is to see the child in pursuit of knowledge,

and not knowledge in pursuit of the child.
—George Bernard Shaw

We have now come to the end of our journey in the wonderful

world of Linear Algebra. Although we have covered a lot of mate-
rial, we have only scratched the surface of this vast and deep branch
of mathematics, for Linear Algebra is a very well-established field
of endeavor. Depending on the domain where it is applied, it may
appear different. Studied as a pure math course in a graduate level
course, Linear Algebra may focus on algebraic structures and gen-
eralized fields. When applied to quantum mechanics, it may look
like a different beast with alternative notations, working in infinite
dimensional spaces and vectors that are functions, where the focus is
on the physical interpretation of eigenvalues and vectors.
Even in our own field of computer science, when we go through the
literature, we may come across different notations and focus, depend-
ing on the background of the authors, and the specific applications
Summing Up. . . 277

under their investigation. Despite this vastness and diversity, our

hope is that we have covered enough ground to dissect the intricacies
of any of these applications and styles of discourse. To be sure, the
discourses may sound unfamiliar, and the applications may look diffi-
cult, even daunting, but their difficulty is one of notations and jargon,
not of the underlying mathematics and foundational concepts.
More importantly, we hope that our interest in this particularly
beautiful field of mathematics has been kindled, so that we look
forward to exploring more, learning further and applying the insights
in enriching our professional life as computer scientists.
Glossary

The symbols, terms and abbreviations most commonly used in this

book are listed and described below for easy reference.
We can also refer to an Online Glossary of Linear Algebra terms
and definitions, courtesy of Robert Campbell of UMBC.
Scalar Scalars are written in lowercase letters, e.g., s, similar to the
notation for the elements of vectors and matrices.
Vectors Bold lowercase letters represent vectors, e.g., x, with ele-
ments xi . Vectors are always column matrices, and are writ-
ten with square brackets when needed. Note, however, that
SageMath writes vectors as (x1 , x2 , · · · , xm ) (using parenthe-
ses with commas between the elements), and we may use that
notation as well, albeit rarely.
Matrices Bold capital letters represent matrices, e.g., A. For the
elements of A, the corresponding lowercase letter, aij is used.
When explicitly writing out the elements of the matrix, square
brackets are used.
Elements of Matrices: The elements of matrices, being scalars,
are
written using lowercase letters. When we write A = aij , we
mean that A is a matrix with a general element aij in the ith
row and j th column. We also write A = [aj ] to indicate that
A is composed of column vectors, with ai as the j th column.
Although used only a couple of times in the book, we also write
Glossary 279

A = [aTi ] to denote the matrix A consisting of row vectors, aTi

as the ith row.
Determinant: Our favorite symbol for determinant is a vertical line:
The determinant of A is |A|.
Transpose: The symbol T represents transpose. A ∈ Rm×n =⇒
AT ∈ Rn×m .
Hermitian Transpose: The complex conjugate transpose of A is
indicated as A† , although it is used very sparingly in this book
Fields: Our matrices and vectors are almost always over the field of
reals, which is represented as R. We will write A ∈ Rm×n
and x ∈ Rm . Although we may not use it, it is possible to
have matrices and vectors over other fields and rings, such as
integers (Z), rationals (Q) or complex (C).
Spaces: We use calligraphic symbols such as S for spaces. In par-
ticular, the column space of A is C(A), row space C(AT ), null
space N (A) and its left null space is N (AT ).
Math Symbols: In definitions and equations, we will use common
mathematical symbols such as:
• ∀: For all or for any. ∀ x means for any vector x.
• ∈: Is a member of the set. s ∈ R says that the scalar s is
a member of the set of reals.
i
• =⇒ : Implies. i, j ∈ Z =⇒ j
∈Q
• ¢, ¦: Subset of. Z ¦ Q ¦ R ¦ C.
nonstandard Notation: We use the symbol · in a manner not seen
elsewhere: We write A = [I · F ], for instance, to indicate A
is a matrix composed of the columns of the identity matrix and
the matrix F , but the columns are not necessarily in the order
in which they appear in the constituent matrices. They may be
“shuffled.”
Norm: Double vertical lines indicate the norm (usually the Euclidean
norm, usually of a vector). ∥x∥ is the norm of the vector x.
Credits

SageMath and Labs

In order to illustrate the topics in a hands-on manner, we will be using some

exercises to be run on SageMath. At the time of writing this book, SageMath is
on version 9.2, and can be downloaded from its website. However, we will be
using version 9.1 because of some problems of 9.2 on some platforms. However,
SageMath is a stable application, and the actual version used is not expected to make
much difference in the learning experience. We gratefully acknowledge the efforts
of the SageMath team in making their excellent tool available freely.

Resources

This book is written explicitly as a textbook to support a corresponding course at

Singapore Management University. For each weekly session of our course, which
corresponds to a chapter in the book, we will have a curated list of videos either as
preparatory, or for review and problem solving.
Linear Algebra is a well-established branch of mathematics, and we have an
abundance of online resources to draw from. Here are some more resources:
MIT Open Course Ware 18.06SC: Available on YouTube for free, these excel-
lent lectures by Gilbert Strang and associated recitations are a great resource
for our course. Many of the suggested readings (especially the problem
solving kind) in this book are drawn from this resource.
3Blue1Brown: Another excellent YouTube channel on mathematics, provided for
free by Grant Sanderson, is a must watch for students of applied mathematics
and computer science. We have listed several videos from its the playlist
Essence of Linear Algebra in our chapters.
Credits 281

Gilbert Strang “Linear Algebra and its Applications”: This is the book version
of the lectures in 18.06SC, and may be useful for sample problems and as
lecture notes. However, lecture notes, problem sets and lecture transcripts
from the book are all available online at MIT Open Courseware.
Philip Klein “Coding the Matrix: Linear Algebra through Computer Science
Applications”: A very comprehensive and well-known work, this book
teaches Linear Algebra from a computer science perspective. Some of the
labs in our course are inspired by or based on the topics in this book, which
has an associated website with a lot of information.

Books

In addition to this textbook, “Linear Algebra for Computer Science”, here are some
other books that we can freely download and learn from:
Jim Hefferson “Linear Algebra”: This book uses SageMath and has tutorials and
labs that can be downloaded for more practice from the author’s website.
Robert Beezer “A First Course in Linear Algebra”: Another downloadable book
with associated web resources on SageMath. In particular, it has an on-line
tutorial that can be used as a reference for SageMath.
Stephen Boyd “Introduction to Applied Linear Algebra”: This well-known book
takes a pragmatic approach to teaching Linear Algebra. Commonly referred
to by the acronym (VMLS) of its subtitle (“Vectors, Matrices, and Least
Squares”), this book is also freely downloadable and recommended for
computer science students.

A legal disclaimer: The websites and resources listed above are governed by their
own copyright and other policies. By listing and referring to them, we do not imply
any affiliation with or endorsement from them or their authors.
282 Credits ABOUT LA4CS

An enjoyable and readable textbook on mathematics,

LA4CS introduces the essential concepts and practice of
Linear Algebra to the undergraduate student of
computer science.
The focus of this book is on the elegance and beauty of
the numerical techniques and algorithms originating from
Linear Algebra. As a practical handbook for computer
and data scientists, LA4CS restricts itself mostly to real
fields and tractable discourses, rather than deep and
theoretical mathematics.
Its companion website, LA4CS.com, features extra
information, downloadable material, and links to the
video lectures based on this textbook.

ABOUT THE AUTH0R

Manoj Thulasidas, is an Associate Professor of
Computer Science (Education) who teaches Data
Analytics and Linear Algebra to undergraduate
students of computer science and information
systems at Singapore Management University.
His other works include The Unreal Universe, an
inquiry into the philosophical underpinnings of
physics (from his career as a physicist at CERN in
Geneva) and Principles of Quantitative
Development, a practitionerÕs guide to the
lucrative profession of quantitative finance (from
his experiences as a quant in Singapore).

Buy it ASIAN BOOKS

ISBN 978-981-18-2045-8

Scan or Tap

Linear Algebra With Its Applications
No ratings yet
Linear Algebra With Its Applications
336 pages
Lin Alg Book
No ratings yet
Lin Alg Book
251 pages
Linear Algebra - Intuition, Math, Code
No ratings yet
Linear Algebra - Intuition, Math, Code
565 pages
Linear Algebra For Data Science 9811276226 9789811276224 - Compress
100% (3)
Linear Algebra For Data Science 9811276226 9789811276224 - Compress
257 pages
Arindama Singh - Introduction To Matrix Theory-Ane Books, New Delhi (2017)
No ratings yet
Arindama Singh - Introduction To Matrix Theory-Ane Books, New Delhi (2017)
214 pages
Math 221 Notes - Chad Davis (Last Update July 2022)
No ratings yet
Math 221 Notes - Chad Davis (Last Update July 2022)
273 pages
(Ebook PDF) Elementary Linear Algebra 7th Edition by Ron Larsoninstant Download
100% (3)
(Ebook PDF) Elementary Linear Algebra 7th Edition by Ron Larsoninstant Download
57 pages
Maths 2 Book Linear Algebra (All The Chapters)
No ratings yet
Maths 2 Book Linear Algebra (All The Chapters)
240 pages
(Ebook PDF) Elementary Linear Algebra 7th Edition by Ron Larson Download
100% (1)
(Ebook PDF) Elementary Linear Algebra 7th Edition by Ron Larson Download
41 pages
Tsopméné P Matrix Algebra 2022
No ratings yet
Tsopméné P Matrix Algebra 2022
341 pages
(Ebook PDF) Elementary Linear Algebra 7th Edition by Ron Larson Download
100% (5)
(Ebook PDF) Elementary Linear Algebra 7th Edition by Ron Larson Download
46 pages
Oliveira L Linear Algebra
100% (3)
Oliveira L Linear Algebra
329 pages
Matrix Analysis For Scientists and Engineers Alan J Laub
No ratings yet
Matrix Analysis For Scientists and Engineers Alan J Laub
172 pages
Self Learning LinAlgebra
No ratings yet
Self Learning LinAlgebra
44 pages
Linear Algebra Done Openly
No ratings yet
Linear Algebra Done Openly
288 pages
MA1522 Note Binder
No ratings yet
MA1522 Note Binder
43 pages
Linear Algebra Lecture Notes
100% (1)
Linear Algebra Lecture Notes
69 pages
Advanced Linear Algebra PDF
100% (13)
Advanced Linear Algebra PDF
348 pages
Linear Notes
No ratings yet
Linear Notes
152 pages
Matrix Algebra by A.S.Hadi
0% (2)
Matrix Algebra by A.S.Hadi
4 pages
Network Defense and Countermeasures: Principles and Practices, 4th Edition William Easttom pdf download
No ratings yet
Network Defense and Countermeasures: Principles and Practices, 4th Edition William Easttom pdf download
94 pages
2021 Module 2 - Mathematical Language and Symbols
No ratings yet
2021 Module 2 - Mathematical Language and Symbols
34 pages
Linear Algebra I Notes
No ratings yet
Linear Algebra I Notes
107 pages
Bruce Cooperstein-Elementary Linear Algebra (2010)
100% (1)
Bruce Cooperstein-Elementary Linear Algebra (2010)
954 pages
Linear Algebra For Computer Science
No ratings yet
Linear Algebra For Computer Science
279 pages
DLL Evaluating Algebraic Expressions
100% (1)
DLL Evaluating Algebraic Expressions
6 pages
Linear Algebra - Pure & Applied
86% (7)
Linear Algebra - Pure & Applied
734 pages
Linear Algebra and Its Applications 5th Edition-6-8
No ratings yet
Linear Algebra and Its Applications 5th Edition-6-8
3 pages
200-901 V15.95
No ratings yet
200-901 V15.95
121 pages
LA4CS
No ratings yet
LA4CS
336 pages
Mathematical Language and Symbols
100% (2)
Mathematical Language and Symbols
23 pages
Unit-2 DSM
No ratings yet
Unit-2 DSM
22 pages
Duc Tran - Basic Linear Algebra - An Introduction With An Intuitive Approach (2022)
No ratings yet
Duc Tran - Basic Linear Algebra - An Introduction With An Intuitive Approach (2022)
190 pages
Introduction To Linear Algebra For Science and Engineering 1st Ed
90% (58)
Introduction To Linear Algebra For Science and Engineering 1st Ed
550 pages
Linearalgebra: Pure Applied
No ratings yet
Linearalgebra: Pure Applied
726 pages
MATLAB Games for Kids: Learn to Code and Play
From Everand
MATLAB Games for Kids: Learn to Code and Play
Eric Okoth Ogur
No ratings yet
Linear Algebra
100% (2)
Linear Algebra
395 pages
Linear Algebra
67% (3)
Linear Algebra
395 pages
Linalg Elia
No ratings yet
Linalg Elia
20 pages
Practical Linear Algebra
100% (1)
Practical Linear Algebra
253 pages
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
From Everand
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
Shari Eskenas
5/5 (1)
Schaum S Outline of Theory and Problems of Matrix Operations PDF
No ratings yet
Schaum S Outline of Theory and Problems of Matrix Operations PDF
235 pages
Matrices and Linear Algebra
No ratings yet
Matrices and Linear Algebra
258 pages
Lab. Manual PDF
No ratings yet
Lab. Manual PDF
310 pages
Lecture Notes in LInear Algebra - ICEF Moscow
No ratings yet
Lecture Notes in LInear Algebra - ICEF Moscow
121 pages
Syllabus - Linear Algebra For Engineers
No ratings yet
Syllabus - Linear Algebra For Engineers
5 pages
Ubco Math221
100% (1)
Ubco Math221
241 pages
Kuttler LinearAlgebra AFirstCourse Yorku MATH2022 Summer2016
No ratings yet
Kuttler LinearAlgebra AFirstCourse Yorku MATH2022 Summer2016
256 pages
線性代數113B v2
No ratings yet
線性代數113B v2
5 pages
Notes On MIT 18.06 Linear Algebra: Hello My Love
No ratings yet
Notes On MIT 18.06 Linear Algebra: Hello My Love
19 pages
Programming in C 1
No ratings yet
Programming in C 1
42 pages
Math 7 2nd Grading Exam
No ratings yet
Math 7 2nd Grading Exam
4 pages
Linear Algebra
No ratings yet
Linear Algebra
6 pages
ISC Class 11 Computer Science Syllabus 2023 24
No ratings yet
ISC Class 11 Computer Science Syllabus 2023 24
5 pages
Basic Algebra Drill
No ratings yet
Basic Algebra Drill
38 pages
Cis515 15 sl1 A
No ratings yet
Cis515 15 sl1 A
68 pages
Warwick Linear Algebra Inna
No ratings yet
Warwick Linear Algebra Inna
61 pages
Linear Algebra MATH 211 Textbook
No ratings yet
Linear Algebra MATH 211 Textbook
253 pages
Rational Functions
No ratings yet
Rational Functions
37 pages
1 Introduction To Vectors: Factorization: A LU - . - . - . - . - . - . - . - . .
No ratings yet
1 Introduction To Vectors: Factorization: A LU - . - . - . - . - . - . - . - . .
2 pages
Linear Algebra and Matrices
No ratings yet
Linear Algebra and Matrices
181 pages
Linear Algebra Reivew: All Linear Algebra, So This Is A Fairly Serious Weakness. This Review Is
No ratings yet
Linear Algebra Reivew: All Linear Algebra, So This Is A Fairly Serious Weakness. This Review Is
10 pages
Linear Algebra
No ratings yet
Linear Algebra
4 pages
Mission Ruby
From Everand
Mission Ruby
Sheela Preuitt
No ratings yet
Lattices
No ratings yet
Lattices
37 pages
In Writing and Speaking The Language of Mathematics, It Is Important That You Know
100% (1)
In Writing and Speaking The Language of Mathematics, It Is Important That You Know
20 pages
DLL Math 7 Week 1-2025
No ratings yet
DLL Math 7 Week 1-2025
3 pages
Advanced Math 60 Qs
No ratings yet
Advanced Math 60 Qs
30 pages
Linear Inequalities in One or Two Variables
No ratings yet
Linear Inequalities in One or Two Variables
71 pages
Veritas College of Irosin: Schedule Lesson No.: 4-6
No ratings yet
Veritas College of Irosin: Schedule Lesson No.: 4-6
17 pages
Intermediate Algebra 4th Edition Larson Instant Download
100% (1)
Intermediate Algebra 4th Edition Larson Instant Download
74 pages
Python W3 - Control Flows
No ratings yet
Python W3 - Control Flows
21 pages
PQC ENISA Feb 2021 1613083129
No ratings yet
PQC ENISA Feb 2021 1613083129
37 pages
Matrices and Linear Algebra in Control Applications
No ratings yet
Matrices and Linear Algebra in Control Applications
38 pages
Introduction To Python Chapter 5
No ratings yet
Introduction To Python Chapter 5
16 pages
Matrices: A Complete Course
No ratings yet
Matrices: A Complete Course
22 pages
GR 8 Maths Exam Term 2
No ratings yet
GR 8 Maths Exam Term 2
6 pages
GM Addition Subtraction of Functions 1
No ratings yet
GM Addition Subtraction of Functions 1
6 pages
Course Syllabus: Faculty of Computers and Information Faculty of Computers and Information
No ratings yet
Course Syllabus: Faculty of Computers and Information Faculty of Computers and Information
1 page
Entraînement Advanced Maths 1
No ratings yet
Entraînement Advanced Maths 1
11 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
MMW Module 2a FD
No ratings yet
MMW Module 2a FD
13 pages
Amazing Math: Projects You Can Build Yourself
From Everand
Amazing Math: Projects You Can Build Yourself
Lazlo C. Bardos
No ratings yet
Lesson Plan Title:: Students Will Understand That - .
No ratings yet
Lesson Plan Title:: Students Will Understand That - .
3 pages
Foundations of Algebra
No ratings yet
Foundations of Algebra
4 pages
Pros Dle21 2
No ratings yet
Pros Dle21 2
3 pages
NSMQ Selection Quiz
No ratings yet
NSMQ Selection Quiz
4 pages
Lab5 2 ImageOperationswithDocker
No ratings yet
Lab5 2 ImageOperationswithDocker
6 pages
Hashing
From Everand
Hashing
Prakash Hegade
No ratings yet
EPC 5GC Session Management Differences 1631471394
No ratings yet
EPC 5GC Session Management Differences 1631471394
1 page
Green Composites Second Edition Waste and Nature based Materials for a Sustainable Future Caroline Baillie instant download
No ratings yet
Green Composites Second Edition Waste and Nature based Materials for a Sustainable Future Caroline Baillie instant download
111 pages
Adipose Tissue Biology 2nd Edition Michael E. Symonds (Eds.) Download
100% (1)
Adipose Tissue Biology 2nd Edition Michael E. Symonds (Eds.) Download
66 pages
Organic Chemistry 12th 12th Edition Francis Carey PDF Download
100% (1)
Organic Chemistry 12th 12th Edition Francis Carey PDF Download
136 pages
Future Foods: Global Trends, Opportunities and Sustainability Challenges 1st Edition Rajeev Bhat Download
100% (1)
Future Foods: Global Trends, Opportunities and Sustainability Challenges 1st Edition Rajeev Bhat Download
124 pages
Maths Year 7 - Weekly Planning - 2024-2025 - Unit 2
No ratings yet
Maths Year 7 - Weekly Planning - 2024-2025 - Unit 2
1 page