MathCamp2024 Linear Algebra
MathCamp2024 Linear Algebra
1 Vectors
A vector can be defined in many different ways, but let us broadly define a vector as a list of objects that
is used to represent and conceptualize ideas.
Definition 1.3. Two vectors in Rn are equal if and only if their corresponding elements are equal.
2 1
Example 1.4. Our vectors from above, u = and v = , are .
1 2
Definition 1.5. Given two vectors u and v, their sum is the vector u + v, obtained by adding the
corresponding elements.
∗
This lecture note is for personal use only and is not intended for reproduction, distribution, or citation.
†
This lecture note was originally written by James Banovetz.
1
2 1
Example 1.6. Consider the vectors u = and v = .
1 2
2 1 3
u+v = + =
1 2 3
Definition 1.7. Given a vector u and a real number c, the scalar multiple of u by c is the vector
obtained by multiplying each element in u by c.
2
Example 1.8. Consider the vector u = and the scalar c = 3. Then
1
2 6
cu = 3 =
1 3
Aside. Try to imagine each element as a scalar. For example, a vector [2, 3] can be represented as 2i + 3j,
where i and j are the “basis vectors” of the xy coordinate system. This exercise will come in handy later.
Definition 1.9. Two vectors u and v in Rn are parallel if and only if there exists a real number c ∈ R\{0}
such that
u = cv
Definition 1.10. Given vectors v1 , v2 , · · · , vk and scalars c1 , c2 , · · · , ck , the vector y defined by
y = c1 v1 + · · · + ck vk
Definition 1.13. A set of vectors v1 , · · · , vk ∈ Rn is linearly dependent if and only if at least one of
the vectors can be written as a linear combination of the others. If no vector is a linear combination of
the others, the set of vectors are linearly independent.
Remark A set of vectors v1 , · · · , vk ∈ Rn is linearly independent if and only if the vector equation
c1 v1 + c2 v2 + · · · + ck vk = 0n
2
Aside. Let us assume that v1 , · · · , vk ∈ Rn is linearly dependent. WLOG, there exists i ∈ {1, · · · , k}
such that ci ̸= 0. Then
k k
X X cj
c i vi + c j vj = 0 n and vi = − vj
j=1, j̸=i j=1, j̸=i
c i
1. v1 , · · · , vk spans W, and
Aside. At this point, I highly encourage everyone to watch “Lecture 3: Essence of linear algebra” by 3b1b
in case you haven’t watched it. If you have ever been confused about some notions in linear algebra, this
video may provide you a different way to interpret and understand them.
3
2 Systems of Linear Equations and Matrix Equations
Definition 2.1. A linear equation in the variables x1 , x2 , · · · , xn is an equation that can be written in
the form
a1 x 1 + a2 x 2 + · · · + an x n = b
where b and the coefficients a1 , . . . , an are real or complex numbers. A system of linear equations is a
collection of one or more linear equations involving the same variables.
x1 − 2x2 + x3 = 0
2x2 − 8x3 = 8
−4x1 + 5x2 + 9x3 = −9
Definition 2.3. The compact rectangular array containing the coefficients of each variable aligned in
columns is called the coefficient matrix.
Example 2.4. From the system of linear equations above, the coefficient matrix is
1 −2 1
0 2 −8
−4 5 9
Definition 2.5. The coefficient matrix, concatenated with a column consisting of the right-hand-side
constants is called the augmented matrix.
Example 2.6. From the system of linear equations above, the augmented matrix is
1 −2 1 0
0 2 −8 8
−4 5 9 −9
Definition 2.7. The size of a matrix tells how many rows and columns it has. An m × n matrix has m
rows and n columns:
a11 a12 a13 . . . a1n
a21 a22 a23 . . . a2n
a31 a32 a33 . . . a3n
.. .. .. ... ..
. . . .
am1 am2 am3 . . . amn
Definition 2.9. The three basic operations that are used to solve the system of equations are known as
elementary row operations. These operations consist of:
• Replacement - Replace one row by the sum of itself and the multiple of another row
4
Example 2.10. Consider the augmented matrix from our system above:
1 −2 1 0 1 −2 1 0
0 2 −8 8 → Replacing R3 with R3 + 4R1 : → 0 2 −8 8
−4 5 9 −9 0 −3 13 −9
Multiplying R2 by 1/2
1 −2 1 0 1 −2 1 0
0 1 −4 4 → Replacing R3 with R3 + 3R2 : → 0 1 −4 4
0 −3 13 −9 0 0 1 3
Aside. This matrix is now in what we call reduced row echelon form. Don’t worry about the
terminology, however, what’s important is the row operations, which may come in handy a few times
throughout the year. Also note that not all linear systems have unique solutions; it is possible to have a
unique solution, infinite solutions, or no solutions.
Definition 2.11. If A is an m×n matrix with columns a1 , · · · , an ∈ Rm , and if x ∈ Rn , then the product
of A and x, denoted Ax, is the linear combination of the columns of A using the corresponding entries in
x as weights
x1
.
Ax = a1 · · · an
.. = x1 a1 + · · · + xn an (Column Expansion)
x
n
x1 − 2x2 + x3 = 0
2x2 − 8x3 = 8
−4x1 + 5x2 + 9x3 = −9
5
Let A be the coefficient matrix:
1 −2 1
A= 0 2 −8
−4 5 9
Then we can rewrite the linear system as a matrix equation Ax = b:
1 −2 1 x1 0
0 2 −8 x2 = 8
−4 5 9 x3 −9
Aside. We will work with matrix equations quite a bit during the first year, especially when we
learn a regression framework. However, we will almost never apply elementary row operations to
convert matrices into the reduced row echelon form by hand. We are revisiting them they are taught
in a linear algebra course and we’d be remiss if we don’t know what they are. The key takeaway is
that these elementary row operations do not alter the solution set. It is a good time to start thinking
about how the matrix equations are written and what they mean.
=⇒ y = Xβ + u
6
3 Matrix Operations
Definition 3.1. Let A and B be m × n matrices. Let aij and bij denote the elements in the ith row and
jth column of the matrices A and B, respectively. Then A and B are equal, denoted A = B, if and only
if aij = bij for all i and all j.
Aside. Note that this is simply an extension of our definition for vector equality; indeed, we could define
matrix equality in terms of vector equality, i.e., two matrices are equal if and only if their corresponding
vectors are equal.
Definition 3.2. If A and B are both m × n matrices, then the sum A + B is the m × n matrix whose
elements are the sums of the corresponding elements from A and B. The difference A − B is the m × n
matrix whose elements are the differences of the corresponding elements of A and B.
a11 a12 · · · a1n b11 b12 · · · b1n a11 + b11 a12 + b12 · · · a1n + b1n
a21 a22 · · · a2n b21 b22 · · · b2n a21 + b21 a22 + b22 . . . a2n + b2n
A + B = .. .. + .. .. =
.. . . .. . . .. .. . . ..
. . . . . . . . . . . .
am1 am2 . . . amn bm1 bm2 · · · bmn am1 + bm1 am2 + bm2 · · · amn + bmn
Definition 3.3. If k is a scalar and A is a matrix, then the scalar multiple kA is the matrix whose
elements are k times the corresponding elements in A.
ka11 ka12 · · · ka1n
ka21 ka22 · · · ka2n
kA = ..
.. . . ..
. . . .
kam1 kam2 · · · kamn
Definition 3.4. If A is a m×n matrix and B is a n×k matrix with columns b1 , · · · , bk , then the product
AB is the m × k matrix whose columns are Ab1 , · · · , Abk . Note that the number of columns of A must
match the number of rows in B for the matrices to be conformable.
Example 3.5. Suppose we have two matrices: A1×2 and B2×3 . Then AB is permissible, but BA is not. If
we have A1×2 B2×3 = C1×3 , then
7
Aside. There are a few special matrices with which you should be familiar. For example:
1. Identity Matrix (a diagonal matrix of ones)
1 0 ··· 0
0 1 ··· 0
In = ..
.. . . ..
. . . .
0 0 ··· 1
A+0=A 0+A=A
0A = 0 A0 = 0
A = AA
Theorem 3.6. Let A be a m × n matrix, and let B and C have dimensions for which the indicated sums
and products are defined. Then
Aside. Note that in matrix algebra, we do not have commutativity; that is, AB ̸= BA in general. It
may be true of specific matrices, but in most cases (as in an example above), the matrices won’t even be
conformable in both orders.
Definition 3.7. Given a m × n matrix A, the transpose of A is the n × m matrix, denoted AT or A′ ,
whose columns are formed from the corresponding rows of A.
Example 3.8. Consider the following matrices:
3 8 −9 3 4
A= B=
1 0 4 1 7
3 1
3 1
AT = 8 0 B = T
4 7
−9 4
Remark. A property of matrices that occasionally comes in handy is symmetry, which means A = AT .
Later on, we’ll talk about Hessian matrices, which have this property.
8
Theorem 3.9. Let A and B denote matrices whose sizes are appropriate for the following sums and
products.
(AT )T = A
(A + B)T = AT + B T
(AB)T = B T AT
Definition 3.10. The determinant of a square matrix A, denoted by |A| or det(A), is a uniquely defined
scalar associated with the matrix. For matrices of various sizes:
1. 1 × 1 Matrix: A = [a]
|A| = a
a b
2. 2 × 2 Matrix: A =
c d
a b
|A| = = ad − bc
c d
|A| = |AT |
2. Scaling a row by k will change the value of the determinant k-fold, e.g.:
ka kb a b
=k
c d c d
3. Replacement – adding a multiple of a row (or column) to another row (or column) – leaves the
determinant unchanged, e.g.:
a b a b
=
c + ka d + kb c d
4. Interchanging any two rows (or columns) reverses the sign of the determinant (but does not change
the absolute value), e.g.:
a b c d
=−
c d a b
Definition 3.12. A matrix A is singular if and only if
|A| = 0
which occurs only when rows (or columns) are linearly dependent.
Remark. I’m bringing this to your attention primarily for econometric reasons; perfect collinearity means
that two of your explanatory variables are linear combinations of one another. In this case, STATA, R, or
whatever program you ultimately use to run regressions will fail (as they use matrix algebra to do OLS).
Definition 3.13. A n × n matrix is said to be invertible if there is an n × n matrix C such that
CA = In and AC = In
In this case, C is the inverse of A. In fact, the inverse is unique, so it is typically denoted A−1 .
9
Theorem 3.14. If A is a 2 × 2 matrix
a b
A=
c d
and ad − bc ̸= 0 (i.e., the determinant is non-zero/the matrix is ), then the inverse is given by
the formula:
−1 1 d −b
A =
ad − bc −c a
Check the appendix 2 for a larger matrix.
2. AB is invertible and
(AB)−1 = B −1 A−1
3. AT is invertible and
(AT )−1 = (A−1 )T
10
4 Connecting Concepts
Theorem 4.1. Let A be a n × n matrix. Then the following statements are equivalent:
1. |A| =
̸ 0
2. A is invertible
2. rank(A) ≤ min{m, n}
11
5 Quadratic Forms
Definition 5.1. A quadratic form on Rn is a function Q defined on Rn whose value at a vector x ∈ Rn
can be written in the form
Q(x) = xT Ax
where A is a n × n symmetric matrix.
Aside. The matrix associated with a quadratic form need not be symmetric. However, no loss of gener-
ality is obtained by assuming A is symmetric. We can always take definite and semidefinite matrices to
be symmetric since they are defined by a quadratic form. Specifically consider a nonsymmetric matrix A
and replace it with 21 (A + AT ). Then A is now symmetric and still it has the same Q(x).
Aside. Frequently, you will also see quadratic forms written in double-sum notation:
n X
X n
T
x Ax = aij xi xj
i=1 j=1
Where aij are elements of A, which is symmetric, i.e., aij = aji ∀i, j.
If off-diagonal elements are all zero, then
n
X
xT Ax = aii x2i
i=1
Quadratic forms can also be positive semi-definite or negative semi-definite if we replace the strict
inequalities with weak inequalities. The matrix A that satisfies certain definite or semi-definite condition
in Definition 6.3 is called a positive (semi)definite/negative (semi)definite matrix.
12
This is a positive definite quadratic form, e.g., if xT = 1 1 :
T
2 −1 1
x Ax = 1 1 (plugging in values)
−1 2 1
xT Ax = 2
Similarly, if xT = −3 2 :
T
2 −1 −3
x Ax = −3 2 (plugging in values)
−1 2 2
xT Ax = 38
Indeed, any vector in R2 (besides the zero-vector) will produce a strictly positive value.
Example 5.5. We can prove that a positive definite matrix An×n is nonsingular. In other words,
• P =⇒ Q is equivalent to ¬P ∨ Q
Definition 5.6. Let A be a n × n matrix with elements aij . Then a principal minor is a minor produced
by the deletion of the row and column associated with an element such that i = j. A nth-order principal
minor is produced by the deletion of rows and columns associated with elements such that i = j, leaving
a nth order determinant.
13
The 1st-order principal minors are:
a11 a12
= a11 a22 − a12 a21 (deleting the 3rd row/column)
a21 a22
a11 a13
= a22 a33 − a23 a32 (deleting the 2nd row/column)
a31 a33
a22 a23
= a11 a33 − a31 a13 (deleting the 1st row/column)
a32 a33
Definition 5.8. The leading principal minors are the principal minors that are associated with con-
tiguous, upper-left elements of the matrix.
Example 5.9. For the 3 × 3 matrix A in the preceding example, the leading principal minors are
Aside. Note that in the example, we are using Simon and Blume’s notation. This is by no means consistent
throughout the various mathematical economics textbooks available; indeed, notation and terminology vary
quite a bit. We always start enumerating leading principal minors, however, with element a11 , which is
the upper-left corner of every leading principal minor.
Theorem 5.10. Let A be a n × n symmetric matrix. then the quadratic form xT Ax is:
• positive definite if and only if all of its leading principal minors are strictly positive
• positive semi-definite if and only if all of its principal minors are non-negative
• negative definite if and only if all of its leading principal minors follow the signing convention
• negative semi-definite if and only if all of its principal minors of an odd-order are ≤ 0 and all of it’s
principal minors of an even-order are ≥ 0.
14
• indefinite if none of the above hold
Example 5.11. Consider the matrix
1 −1 0
A = −1 6 −2
0 −2 3
We can determine the definiteness of the matrix by first evaluating the leading principal minors:
|1| = 1
1 −1
=5
−1 6
1 −1 0
−1 6 −2 = 11
0 −2 3
All the LPMs are positive, so A is positive definite. Note that if one of the leading principal minors was
equal to zero, we’d need to check all of the other PMs for positive semi-definiteness.
Below (connecting definiteness to Hessian) is going to be covered in the Analysis session.
15
6 Eigenvalues and Eigenvectors
Definition 6.1. For an n × n matrix A, the trace is the sum of the elements along the main diagonal:
n
X
tr(A) = aii
i=1
The trace is the sum along the main diagonal, so tr(I3 ) = 3 (indeed, the trace of any identity matrix is
the dimensionality of that matrix).
tr(AT ) = tr(A)
tr(AB) = tr(BA)
Example 6.4. Given a n × k matrix X where n > k, suppose that rank(X) = k (full rank). Then
−1 −1
tr In − X XT X XT = tr In − tr X XT X XT
−1
= n − tr XT X XT X
= n − tr(Ik )
=n−k
b ∈ Rn ,
Example 6.5. Given a vector u
bT u
u uT u
b = tr(b b)
Definition 6.6. An eigenvector of a n × n matrix A is a nonzero vector x such that Ax = λx for some
scalar λ. A scalar λ is called an eigenvalue of A if there is a nontrivial solution x of Ax = λx. Note that
we can rewrite this equation to be:
(A − λIn )x = 0
16
We can show that u is an eigenvector of A:
1 6 6
Au = (the product)
5 2 −5
(6 − 30)
= (multiplying)
(30 − 10)
−24
= (simplifying)
20
6
Au = −4 (factoring)
−5
Aside. Recall that one of our theorems states that Bx = 0 has a non-trivial solution x if and only if B is
singular. Thus, for there to be a non-trivial eigenvector, it must be that (A − λIn ) is singular (even if A
itself is not singular).
|(A − λIn )| = 0
Theorem 6.9. A scalar λ is an eigenvalue of a n × n matrix A if and only if λ satisfies the charactersitic
equation of A.
a−λ b
= (a − λ)(d − λ) − bc = 0
c d−λ
2−λ 2
|(A − λI2 )| = (the characteristic equation)
2 −1 − λ
= (2 − λ)(−1 − λ) − 4 (the determinant)
= λ2 − λ − 6 (simplifying)
λ2 − λ − 6 = 0 (setting equal to zero)
(λ − 3)(λ + 2) = 0 (factoring)
17
Thus, λ1 = 3 and λ2 = −2 are the eigenvalues. To find associated eigenvectors, recall that we need
(A − λI2 )x = 02 . Thus, for our first eigen value:
(A − λ1 In )x = 0 (our condition)
2−3 2 x1 0
= (plugging in values)
2 −1 − 3 x2 0
−1 2 x1 0
= (simplifying)
2 −4 x2 0
Note that the rows are linear combinations of one another; this reduces to the equation:
A : diagonalizable ⇐⇒ ∃P : P −1 AP is diagonal
18
3. The trace of Ak is the sum of the eigenvalues raised to the kth power
n
X
k
tr(A ) = λki
i=1
4. Let AT = A.
λi xTj xi = xTj (λi xi ) = xTj (Axi ) = xTj (AT xi ) = (xTj AT )xi = (Axj )T xi = (λj xj )T xi = λj xTj xi
5. Let AT = A.
(a) Positive (semi)definite if and only if the eigenvalues of A are all (weakly) positive.
(b) Negative (semi)definite if and only if the eigenvalues of A are all (weakly) negative.
(c) Indefinite if and only if A has both positive and negative eigenvalues.
MX := In − X(XT X)−1 XT
1. MX is symmetric.
2. MX is idempotent.
19
7 Vector Spaces and Norms
Aside. Building on what we already know about vectors, we can start thinking about vector spaces.
Recall that earlier, we mentioned that two linearly independent vectors v, u ∈ R2 span R2 . Indeed, R2 is
the vector space defined by the various linear combinations of v and u. Indeed, Rn is the vector space that
we usually are dealing with; however, it probably is a good idea for us to know a more formal definition.
Definition 7.1. A vector space is a set of vectors V equipped with addition and scalar multiplication
that satisfies the following properties for vectors u, v, and w and scalars a and b:
• Associativity of addition: u + (v + w) = (u + v) + w
• Commutativity of addition: u + v = v + u
• If u, v ∈ W , then u + v ∈ W
20
Example 7.5. Consider the matrix
1 0 −3 5 0
0 1 2 −1 0
B=
0
0 0 0 1
0 0 0 0 0
These are linearly independent, and all other columns of the matrix are linear combinations of these.
Definition 7.6. A metric space (X, d) is a set X and a function, d : X × X → R, such that for any
x, y, z ∈ X, it satisfies the following properties:
• d(x, y) ≥ 0
• d(x, y) = 0 ⇐⇒ x = y
• d(x, y) = d(y, x)
Definition 7.7. A normed vector space (V, || · ||) is a vector space V and a function, || · || : V → R,
such that for any x, y ∈ V and for any α ∈ R, it satisfies the following properties:
• ||x|| ≥ 0
Aside. Norms are the first step in getting us a notion of distance in a space. While we won’t end up using
norms explicitly all that frequently during the first year, they come about a great deal implicitly, any time
we talk about things like minimum distance, or minimum distance estimators. We typically deal with one
norm, but there are a few others.
Example 7.8. The following are all norms, which may be familiar:
n
! 21
X
|x| = x2i
i=1
This is the “as the crow flies” distance from the origin in R2 , and is (by far) the most frequent norm
we will use.
21
2. Manhattan Norm (occasionally denoted ||x||1 ):
n
X
||x|| = |xi |
i=1
This is akin to measuring distance from the origin on a map laid out as a grid of streets.
Example 7.9. Consider x, y ∈ R2 . The most common metric we use is the notion of Euclidean distance,
i.e.,
1
d(x, y) = (x1 − y1 )2 + (x2 − y2 )2 2
This is equivalent to stating ||x − y||2 .
Example 7.10. Let X ⊂ Rn and S = C(X) be the set of all continuous and bounded functions C : X → R.
Define the metric d : S × S → R such that given any g, h ∈ S
d(g, h) = sup |g(x) − h(x)|
x∈X
4. Let f, g, h ∈ S and x, y ∈ X. When there is a metric d′ (f, h) = |f (c) − h(c)| that fulfills the condition
in a metric apace (S, d′ ), for any c ∈ X,
d′ (f (c), h(c)) ≤ d′ (f (c), g(c)) + d′ (g(c), h(c)) d′ fulfills Metric Space Axiom
≤ sup |f (x) − g(x)| + sup |g(x) − h(x)| definition of Supremum of Real-Valued Function
x∈X x∈X
= d(f, g) + d(g, h) definition of d.
Therefore, d(f, g) + d(g, h) is an upper bound for Y := {d′ (f (c), h(c)) : c ∈ X}.
Thus, d(f, g) + d(g, h) ≥ sup Y = d(f, h).
Aside. As we move forward in econometrics, micro theory, or anything else during the first year sequences,
we will implicitly assume that any “minimum distance” is minimizing Euclidean distance unless specifically
told otherwise. A notion of distance, however, will be used implicitly (or explicitly) in classes, talks, papers,
etc., so it is a concept with which you should be familiar.
22
8 Orthogonal Projections
Definition 8.1. Given a nonzero vector x ∈ Rn and a vector y ∈ Rn , then y may be written as a sum of
two vectors
y = ŷ + û
where ŷ = αx for some scalar α and û is orthogonal to x. Then ŷ is the orthogonal projection of y
onto x, and û is the component of y that is orthogonal to x.
Aside. Alternative terminology would state that ŷ is the orthogonal projection of y onto the span of x.
For us, it does not matter which terminology we use. The first is probably more common, but the second
is perhaps a bit easier to understand graphically.
Theorem 8.2. The α that satisfies the definition of the orthogonal projection is given by
xT y
α=
xT x
Proof.
2
f (a) = ||ax − y|| = (ax − y)T (ax − y) = a2 xT x − 2axT y + yT y
∂f (α)
=0
∂a
2αxT x − 2xT y = 0
xT y
α=
xT x
xT y
ŷ = x
xT x
Theorem 8.3. Let θ ∈ (0, π) be angle between two vectors x, y. Then
Proof.
||y|| · cos(θ)
αx = x from the trigonometric functions
||x||
xT y
αx = x from Theorem 9.2. ||x||2 = xT x
||x||2
Aside. Two vectors x, y ∈ Rn are orthogonal when an angle between them is right (θ = 12 π).
Aside. Note that these formulas look virtually identical to our typical matrix formula for the coefficient
estimates in OLS. This is not a coincidence; indeed, if we have only one explanatory variable and no
constant, this is the exact formula for β
b and the fitted values.
23
Example 8.4. Consider the two vectors
7 4
y= and x =
6 2
We can find the orthogonal project of y onto the span of x, then write y as the sum of two orthogonal
vectors, one in span(x) and one orthogonal to x:
T
7
x y= 4 2 = 40 (the numerator)
6
T
4
x x= 4 2 = 20 (the denominator)
2
xT y
40 4
ŷ = T x = (the projection formula)
x x 20 2
8
ŷ = (simplifying)
4
ŷ span(x)
Aside. This idea extends to vectors that don’t span the x-axis (it’s just a bit easier to draw). Similarly,
we can think about projections in an analogous way for multiple dimensions (although it is much harder to
draw and much harder to picture mentally). Indeed, higher dimensionality is important–the fitted values
from OLS, for example, are the projection of y onto the column space of the data matrix X. Luckily, we
have theorems to help us handle more dimensions.
24
9 Orthogonality
Definition 9.1. Let u and v be vectors in Rn . Then the inner product of the vectors is a scalar given
by uT v or u · v.
v1
v2
uT v = u1 u2 . . . un
. . . = u1 v1 + u2 v2 + · · · + un vn
vn
This is also known as the dot product.
Aside. The outer product of the vectors is uvT .
Aside. Note that we can write our usual notion of length, the Euclidean norm, in terms of inner product:
n
! 12
√ X
||x||2 = xT x = x2i
i=1
Now, the same steps can be used to show that ||x − y||2 = ||x||2 + ||y||2 . This implies that ||x + y||2 =
||x − y||2 , or that x is equidistant to y and −y. Graphically:
x
−y 0 y
vT w = 0
V ∩ W = ∅ ∨ V ∩ W = {0}
25
Definition 9.6. Let V be a subset of Rn . The orthogonal complement of V , denoted by V ⊥ , is the set of
all vectors w ∈ Rn that are orthogonal to the set V .
Aside.
1. V ⊥ is a subspace of Rn .
2. Span(V ) = (V ⊥ )⊥
Theorem 9.7. (Lay THM 6.3). Let A be a m × n matrix. The orthogonal complement of the row space
of A is the null space of A, and the orthogonal complement of the column space of A is the null space of
AT :
(Col(A))⊥ = Nul(AT ) and (Row(A))⊥ = Nul(A)
Proof.
Nul(A) = {u ∈ Rn : Au = 0m }
= {u ∈ Rn : Ai · u = 0 for all i = 1, · · · , m}
where Ai is row vectors of A. Therefore, u is in orthogonal complement to the row(A).
∴ Nul(A) = (row(A))⊥ .
Similarly,
Nul(AT ) = {u ∈ Rm : AT u = 0n }
= {u ∈ Rm : aTi u = 0 for all i = 1, · · · , n}
where ai is column vectors of A. Therefore, u is in orthogonal complement to the col(A).
∴ Nul(AT ) = (col(A))⊥ .
Definition 9.8. A set of vectors {x1 , . . . , xk } in Rn is an orthogonal set if each pair of distinct vectors
from the set are orthogonal. An orthogonal basis for a subspace W ⊂ Rn is a basis for W that is also
an orthogonal set.
Theorem 9.9. (Lay THM 6.8) Orthogonal decomposition theorem
Let W be a subspace of Rn . Then each y ∈ Rn can be written uniquely in the form
y = ŷ + û
where ŷ is in W and û ∈ W ⊥ . If {x1 , . . . , xk } is any orthogonal basis of W , then
y′ x1 y′ xk
ŷ = x1 + · · · + xk
x′1 x1 x′k xk
and û = y − ŷ. Here, ŷ is the orthogonal projection of y onto W .
26
Aside. Two quick points are in order. First, this gives us a tool to understand how we can project y onto
a higher-dimensional space than line spanned by a single vector. Second, note that the theorem states that
any orthogonal basis can be used; it is the projection itself (not the basis) that is unique. Scale any vector
in the basis (or analogously, any column in the data matrix X) and it does not change the fitted values.
Theorem 9.10. (Lay THM 6.9). Let W be a subspace of Rn , let y ∈ Rn , and let ŷ be the orthogonal
projection of y onto W . Then ŷ is the point in W closest to y:
Proof.
Note that ŷ − v ∈ W (both vectors “live” in W , so their difference does as well. Further, note y − ŷ ∈ W ⊥
(by definition of orthogonal projection). Thus:
||y − v||2 = ||y − ŷ||2 + ||ŷ − v||2 (by the Pythaogrean Theorem)
||ŷ − v|| > 0 (v ̸= ŷ and by def. of a norm)
=⇒ ||y − v||2 > ||y − ŷ||2 (by ||ŷ − v|| > 0)
=⇒ ||y − v|| > ||y − ŷ|| (by non-negativity of norms)
27
10 OLS as a Projection
Definition 10.1. If X is a n × k matrix and y ∈ Rn , a least-squares solution of y = Xβ + u is a
b ∈ Rk such that
β
b ≤ ||y − Xb|| for all b ∈ Rk
||y − Xβ||
Aside. Note that no matter what, the answer we pick is going to “live” in the column space of X, Col(X).
Thus, we’re trying to find the point in Col(X) closest to y. To do this, we’ll use our best approximation
theorem, which tells us that the closest point is the orthogonal projection. ŷ ∈ Col(X), so a solution to
the system Xβb = ŷ will exist.
Theorem 10.2. (Lay THM 6.14). Let X be a n × k matrix. The following statements are logically
equivalent:
Proof.
0k = XT (y − Xβ)
b (plugging in for ŷ)
0 k = XT y − XT Xβ
b (distributing)
b = (XT X)−1 XT y
β (solving for β)
b
Definition 10.3. Let X be a n × k matrix of full rank and let y ∈ Rn . Then the matrix
PX = X(XT X)−1 XT
Theorem 10.4. Let PX be the projection matrix where PX = X(XT X)−1 XT . Then PX is
1. Symmetric
2. Idempotent
3. Positive Semidefinite
Definition 10.5. Let PX be the projection matrix associated with the columnn space of a n × k matrix
X. Then the annihilator matrix is defined as
MX = In − PX
28
Remark Given y = Xβ + u,
2. MX X = 0n
3. û = MX y = MX u and ûT û = uT MX u
4. û ∈ Nul(XT ) and XT û = 0
Aside. Depending on the presentation of OLS, the projection and annihilator matrices are important tools
in developing the theory behind econometrics. Even if we focus on a more statistics-motivated approach
to OLS, it is valuable to understand these matrices and what they do.
11 Matrix Differentiation
Aside. We won’t get into the details, but it is occasionally helpful to employ differentiation at a matrix
b for the regression equation y = Xβ + u by minimizing u′ u
level; for example, we can derive an estimator β
via matrix differentiation. Here are a few of the formulas you’ll likely run across:
1. Scalar-by-Vector: y ∈ R and x ∈ Rn
(a) In general:
∂y
∂x1
∂y
∂y
∂x2 ∂y h
∂y ∂y ∂y
i
= . and T
= ∂x1 ∂x2
··· ∂xn
∂x .. ∂x
∂y
∂xn
(a) In general: ∂y
∂y1 ∂y1
∂x
1
∂x2
··· ∂xn
∂y21 ∂y2 ∂y2
∂y
∂x1 ∂x2
··· ∂xn
= . .. ... ..
∂x .. . .
∂ym ∂ym ∂ym
∂x1 ∂x2
··· ∂xn
29
(c) Quadratic form:
∂xT Ax
= (A + AT )x (if A is not symmetric)
∂x
∂xT Ax
= 2Ax (if A is symmetric)
∂x
Example 11.1. Ordinary Least Square
S(b) = yT y − bT XT y − yT Xb + bT XT Xb
= yT y − 2bT XT y + bT XT Xb
∂S(b)
= −2XT y + 2XT Xb
∂b
∂S(β̂)
=0
∂b
(XT X)β̂ − XT y = 0
β̂ = (XT X)−1 XT y
30
12 Appendices
12.1 Appendix 1: determinant
a11 a12 a13
1. 3 × 3 Matrix: A = a21 a22 a23
a31 a32 a33
First method, multiply three elements diagonally to the right and sum; multiply three elements
diagonally to the left and subtract. Visually, (solid = positive, dashed = negative):
|A| = a11 a22 a33 + a12 a23 a31 + a13 a32 a21 − a13 a22 a31 − a12 a21 a33 − a11 a32 a23
a11 a13
The minor associated with a22 is |M22 | =
a31 a33
31
a11 a12 a13
Definition 12.3. The cofactor |Cij | is a minor modified by a prescribed algebraic sign that follows
the convention:
|Cij | ≡ (−1)i+j |Mij |
Definition 12.4. The determinant found by Laplace Expansion is the value found by “expanding”
along any row or column:
n
X
|A| = aij |Cij | (expansion by the ith row)
j=1
n
X
|A| = aij |Cij | (expansion by the jth column)
i=1
Example 12.5. Considering one again the 3 × 3 matrix C (defined above), performing the Laplace
Expansion using the first row:
32
12.2 Appendix 2: inverse
Example 12.7. Suppose we have an invertible matrix
a11 a12 a13
A = a21 a22 a23
a31 a32 a33
Then we could find the inverse of A by performing row operations on the augmented matrix
a11 a12 a13 1 0 0
A I3 = a21 a22 a23 0 1 0
a31 a32 a33 0 0 1
Theorem 12.8. If A is a n × n invertible matrix, the inverse may be found by using the determinant and
cofactors of the matrix A.
Definition 12.9. The cofactor matrix of A is a matrix defined by replacing the elements of A with
their associated cofactors:
|C11 | |C12 | . . . |C1n |
|C21 | |C22 | · · · |C2n
C = ..
.. . . ..
. . . .
|Cn1 | |Cn2 | . . . |Cnn |
1 2 2 2 2 1
|A| = 0 −1 +0 (expanding on 1st row)
0 0 4 0 4 0
= 0 − 1(0 − 8) + 0 = 8 (simplifying)
33
Since |A| =
̸ 0, we can invert this matrix. Next, find the cofactor matrix CA :
1 2 2 2 2 1
−
0 0 4 0 4 0
1 0 0 0 0 1
CA = − − (the cofactor matrix)
0 0 4 0 4 0
1 0 0 0 0 1
−
1 2 2 2 2 1
0 8 −4
CA = 0 0 4 (simplifying)
2 0 −2
34
12.3 Appendix 3: Cramer’s Rule
Example 12.12. Consider a two-commodity, linear market model. For the first market:
Qd1 = a0 + a1 P1 + a2 P2 (demand 1)
Qs1 = b0 + b1 P1 + b2 P2 (supply 1)
0 = Qd1 − Qs1 (S = D)
Qd2 = α0 + α1 P1 + α2 P2 (demand 2)
Qs2 = β0 + β1 P1 + β2 P2 (supply 2)
0 = Qd2 − Qs2 (S = D)
0 = (a0 −b0 ) + (a1 −b1 )P1 + (a2 −b2 )P2 (for market 1)
0 = (α0 −β0 ) + (α1 −β1 )P1 + (α2 −β2 )P2 (for market 2)
c1 P1 + c2 P2 = −c0 (market 1)
γ1 P1 + γ2 P2 = −γ0 (market 2)
Solving for P1∗ (the market clearing price for good 1):
−c0 c2
−γ0 γ2 c2 γ0 − c0 γ2
P1∗ = = (by Cramer’s Rule)
c1 c2 c1 γ2 − c2 γ1
γ1 γ2
Solving for P2∗ (the market clearing price for good 2):
c1 −c0
γ1 −γ0 c0 γ1 − c1 γ0
P2∗ = = (by Cramer’s Rule)
c1 c2 c1 γ2 − c2 γ1
γ1 γ2
35
Glossary
adjoint ... 33 metric space ... 21
annihilator matrix ... 28 minor ... 31
augmented matrix ... 4
negative semi-definite ... 12
basis ... 3 Negative definite ... 12
normed vector space ... 21
characteristic equation ... 17
null space ... 20
coefficient matrix ... 4
Null Matrix ... 8
cofactor ... 32
cofactor matrix ... 33 orthogonal ... 25
column space ... 20 Orthogonal decomposition theorem ... 26
column vector ... 1 orthogonal basis ... 26
Cramer’s Rule ... 11, 35 orthogonal set ... 26
determinant ... 9, 31 orthogonal projection ... 23, 26
diagonalizable ... 18 parallel ... 2
difference ... 7 positive semi-definite ... 12
dot product ... 25 Positive definite ... 12
eigenvalue ... 16 principal minor ... 13
eigenvector ... 16 product ... 5, 7
element ... 4 projection matrix ... 28
elementary row operations ... 4 Pythagorean theorem ... 25
equal ... 1, 7
quadratic form ... 12
Euclidean Norm ... 21
Hessian ... 15 rank ... 11
row echelon form ... 5, 11
Idempotent Matrix ... 8 row vector ... 1
Identity Matrix ... 8
Indefinite ... 12 scalar multiple ... 2, 7
inner product ... 25 singular ... 9
inverse ... 9, 33 size ... 4
invertible ... 9 span ... 2
subspace ... 20
Laplace Expansion ... 31, 32 sum ... 1, 7
leading principal minors ... 14 symmetry ... 8
least-squares solution ... 28
linear equation ... 4 total differential ... 15
linear combination ... 2 trace ... 16
linearly independent ... 2 transpose ... 8
linearly dependent ... 2 trivial solution ... 2
36