03 Analytic Geometry
03 Analytic Geometry
Analytic Geometry
Orthogonal
Lengths Angles Rotations
projection
68
c
Draft chapter (September 21, 2018) from “Mathematics for Machine Learning”
2018 by Marc
Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University
Press. Report errata and feedback to http://mml-book.com. Please do not post or distribute this
file, please link to https://mml-book.com.
3.1 Norms 69
k · k :V → R , (3.1)
x 7→ kxk , (3.2)
1561 which assigns each vector x its length kxk ∈ R, such that for all λ ∈ R length
1562 and x, y ∈ V the following hold:
Xn
kxk1 := |xi | , (3.3)
i=1
where | · | is the absolute value. The left panel of Figure 3.3 indicates all
vectors x ∈ R2 with kxk1 = 1. The Manhattan norm is also called `1 `1 norm
norm.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
70 Analytic Geometry
Euclidean distance which computes the Euclidean distance of x from the origin. This norm is
Euclidean norm called the Euclidean norm. The right panel of Figure 3.3 shows all vectors
`2 norm x ∈ R2 with kxk2 = 1. The Euclidean norm is also called `2 norm.
1571 Remark. Throughout this book, we will use the Euclidean norm (3.4) by
1572 default if not stated otherwise. ♦
Remark (Inner Products and Norms). Every inner product induces a norm,
but there are norms (like the `1 norm) without a corresponding inner
product. For an inner product vector space (V, h·, ·i) the induced norm
Cauchy-Schwarz k · k satisfies the Cauchy-Schwarz inequality
inequality
| hx, yi | 6 kxkkyk . (3.5)
1573 ♦
1580 We will refer to the particular inner product above as the dot product
1581 in this book. However, inner products are more general concepts with
1582 specific properties, which we will now introduce.
Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.2 Inner Products 71
1584 Here, (3.7) asserts that Ω is linear in the first argument, and (3.8) asserts
1585 that Ω is linear in the second argument.
1588 • Ω is called symmetric if Ω(x, y) = Ω(y, x) for all x, y ∈ V , i.e., the symmetric
1589 order of the arguments does not matter.
• Ω is called positive definite if positive definite
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
72 Analytic Geometry
Pn
y = j=1 λj bj ∈ V for suitable ψi , λj ∈ R. Due to the bilinearity of the
inner product it holds that for all x, y ∈ V that
* n n
+ n X n
ψi hbi , bj i λj = x̂> Aŷ , (3.11)
X X X
hx, yi = ψi bi , λj bj =
i=1 j=1 i=1 j=1
where Aij := hbi , bj i and x̂, ŷ are the coordinates of x and y with respect
to the basis B . This implies that the inner product h·, ·i is uniquely deter-
mined through A. The symmetry of the inner product also means that A
is symmetric. Furthermore, the positive definiteness of the inner product
implies that
∀x ∈ V \{0} : x> Ax > 0 . (3.12)
1601 Definition 3.4 (Symmetric, positive definite matrix). A symmetric matrix
symmetric, positive
1602 A ∈ Rn×n that satisfies (3.12) is called symmetric, positive definite or
definite 1603 just positive definite. If only > holds in (3.12) then A is called symmetric,
positive definite
1604 positive semi-definite.
symmetric, positive
semi-definite
Example 3.4 (Symmetric, Positive Definite Matrices)
Consider the following matrices:
9 6 9 6
A1 = , A2 = (3.13)
6 5 6 3
Then, A1 is positive definite because it is symmetric and
>
9 6 x1
x A1 x = x1 x2 (3.14a)
6 5 x2
= 9x21 + 12x1 x2 + 5x22 = (3x1 + 2x2 )2 + x22 > 0 (3.14b)
for all x ∈ V \ {0}. However, A2 is symmetric but not positive definite
because x> A2 x = 9x21 + 12x1 x2 + 3x22 = (3x1 + 2x2 )2 − x22 can be smaller
than 0, e.g., for x = [2, −3]> .
Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.3 Lengths and Distances 73
1609 • The null space (kernel) of A consists only of 0 because x> Ax > 0 for
1610 all x 6= 0. This implies that Ax 6= 0 if x 6= 0.
1611 • The diagonal elements aii of A are positive because aii = e>
i Aei > 0,
1612 where ei is the ith vector of the standard basis in Rn .
1613 In Section 4.3, we will return to symmetric, positive definite matrices in
1614 the context of matrix decompositions.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
74 Analytic Geometry
0.75
f (ω) = cos(ω)
Therefore, there exists a unique ω ∈ [0, π] with
0.50
0.25
hx, yi
cos(ω)
0.00
−0.75
kxk kyk
−1.00
0 1
ω
2 3
1633 see Figure 3.4 for an illustration. The number ω is the angle between
angle 1634 the vectors x and y . Intuitively, the angle between two vectors tells us
1635 how similar their orientations are. For example, using the dot product,
1636 the angle between x and y = 4x, i.e., y is a scaled version of x, is 0:
Figure 3.5 The 1637 Their orientation is the same.
angle ω between
two vectors x, y is
computed using the Example 3.6 (Angle between Vectors)
inner product. Let us compute the angle between x = [1, 1]> ∈ R2 and y = [1, 2]> ∈ R2 ,
see Figure 3.5, where we use the dot product as the inner product. Then
y we get
hx, yi x> y 3
cos ω = p =p =√ , (3.26)
1 hx, xi hy, yi >
x xy y> 10
ω x
and the angle between the two vectors is arccos( √310 ) ≈ 0.32 rad, which
corresponds to about 18◦ .
0 1
Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.4 Angles and Orthogonality 75
1638 The inner product also allows us to characterize vectors that are orthog-
1639 onal.
1640 Definition 3.7 (Orthogonality). Two vectors x and y are orthogonal if and orthogonal
1641 only if hx, yi = 0, and we write x ⊥ y . If additionally kxk = 1 = kyk,
1642 i.e., the vectors are unit vectors, then x and y are orthonormal. orthonormal
Consider two vectors x = [1, 1]> , y = [−1, 1]> ∈ R2 , see Figure 3.6.
We are interested in determining the angle ω between them using two
different inner products. Using the dot product as inner product yields an
angle ω between x and y of 90◦ , such that x ⊥ y . However, if we choose
the inner product
> 2 0
hx, yi = x y, (3.27)
0 1
we get that the angle ω between x and y is given by
hx, yi 1
cos ω = = − =⇒ ω ≈ 1.91 rad ≈ 109.5◦ , (3.28)
kxkkyk 3
and x and y are not orthogonal. Therefore, vectors that are orthogonal
with respect to one inner product do not have to be orthogonal with re-
spect to a different inner product.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
76 Analytic Geometry
It is convention to 1649 i.e., the inverse is obtained by simply transposing the matrix.
call these matrices
“orthogonal” but a Remark. Transformations by orthogonal matrices are special because the
more precise length of a vector x is not changed when transforming it using an orthog-
description would
be “orthonormal”.
onal matrix A. For the dot product we obtain
2 2
Transformations
kAxk = (Ax)> (Ax) = x> A> Ax = x> Ix = x> x = kxk . (3.31)
with orthogonal
matrices preserve
distances and
Moreover, the angle between any two vectors x, y , as measured by their
angles. inner product, is also unchanged when transforming both of them using
an orthogonal matrix A. Assuming the dot product as the inner product,
the angle of the images Ax and Ay is given as
1650 which gives exactly the angle between x and y . This means that orthogo-
1651 nal matrices A with A> = A−1 preserve both angles and distances. ♦
orthonormal basis 1661 for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB).
ONB 1662 If only (3.33) is satisfied then the basis is called an orthogonal basis.
orthogonal basis
1663 Note that (3.34) implies that every basis vector has length/norm 1. The
1664 Gram-Schmidt process (Strang, 2003) is a constructive way to iteratively
1665 build an orthonormal basis {b1 , . . . , bn } given a set {b̃1 , . . . , b̃n } of non-
1666 orthogonal and unnormalized basis vectors.
Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.6 Inner Product of Functions 77
1683 for lower and upper limits a, b < ∞, respectively. As with our usual in-
1684 ner product, we can define norms and orthogonality by looking at the
1685 inner product. If (3.36) evaluates to 0, the functions u and v are orthogo-
1686 nal. To make the above inner product mathematically precise, we need to
1687 take care of measures, and the definition of integrals. Furthermore, unlike
1688 inner product on finite-dimensional vectors, inner products on functions
1689 may diverge (have infinite value). Some careful definitions need to be ob-
1690 served, which requires a foray into real and functional analysis which we
1691 do not cover in this book.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
78 Analytic Geometry
Figure 3.8 4
Orthogonal
2
projection of a 2
two-dimensional 1
data set onto a
x2
x2
0 0
one-dimensional
subspace. −1
−2
−2
−4
−4 −2 0 2 4 −4 −2 0 2 4
x1 x1
(a) Original dataset. (b) Original data (blue) and their corresponding
orthogonal projections (orange) onto a lower-
dimensional subspace (straight line).
0.2
Remark. It also holds that the collection of functions
0.0
−0.2
{1, cos(x), cos(2x), cos(3x), . . . } (3.37)
−0.4
−2 0
x
2
1692 is orthogonal if we integrate from −π to π , i.e., any pair of functions are
1693 orthogonal to each other. ♦
1694 In Chapter 6, we will have a look at a second type of unconventional
1695 inner products: the inner product of random variables.
Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 79
Figure 3.9
x Examples of
projections onto
one-dimensional
subspaces.
b x
πU (x)
ω sin ω
ω cos ω b
(a) Projection of x ∈ R2 onto a subspace U (b) Projection of a two-dimensional vector
with basis vector b. x with kxk = 1 onto a one-dimensional
subspace spanned by b.
1711 Hotelling (1933) and Deep Neural Networks (e.g., deep auto-encoders Deng
1712 et al. (2010)), heavily exploit the idea of dimensionality reduction. In the
1713 following, we will focus on orthogonal projections, which we will use in
1714 Chapter 10 for linear dimensionality reduction and in Chapter 12 for clas-
1715 sification. Even linear regression, which we discuss in Chapter 9, can be
1716 interpreted using orthogonal projections. For a given lower-dimensional
1717 subspace, orthogonal projections of high-dimensional data retain as much
1718 information as possible and minimize the difference/error between the
1719 original data and the corresponding projection. An illustration of such an
1720 orthogonal projection is given in Figure 3.8.
1721 Before we detail how to obtain these projections, let us define what a
1722 projection actually is.
1723 Definition 3.10 (Projection). Let V be a vector space and W ⊆ V a
1724 subspace of V . A linear mapping π : V → W is called a projection if projection
1725 π2 = π ◦ π = π.
1726 Remark (Projection matrix). Since linear mappings can be expressed by
1727 transformation matrices (see Section 2.7), the definition above applies
1728 equally to a special kind of transformation matrices, the projection matrices projection matrices
1729 P π , which exhibit the property that P 2π = P π .
1730 ♦
1731 In the following, we will derive orthogonal projections of vectors in the
1732 inner product space (Rn , h·, ·i) onto subspaces. We will start with one-
1733 dimensional subspaces, which are also called lines. If not mentioned oth- lines
1734 erwise, we assume the dot product hx, yi = x> y as the inner product.
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
80 Analytic Geometry
Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 81
1758 Here, ω is the angle between x and b. This equation should be familiar
1759 from trigonometry: If kxk = 1 then x lies on the unit circle. It follows
1760 that the projection onto the horizontal axis spanned by b is exactly The horizontal axis
1761 cos ω , and the length of the corresponding vector πU (x) = | cos ω|. An is a one-dimensional
subspace.
1762 illustration is given in Figure 3.9.
3. Finding the projection matrix P π . We know that a projection is a lin-
ear mapping (see Definition 3.10). Therefore, there exists a projection
matrix P π , such that πU (x) = P π x. With the dot product as inner
product and
b> x bb>
πU (x) = λb = bλ = b = x (3.46)
kbk2 kbk2
we immediately see that
bb>
Pπ = . (3.47)
kbk2
1763 Note that bb> is a symmetric matrix (with rank 1) and kbk2 = hb, bi Projection matrices
1764 is a scalar. are always
symmetric.
1765 The projection matrix P π projects any vector x ∈ Rn onto the line through
1766 the origin with direction b (equivalently, the subspace U spanned by b).
1767 Remark. The projection πU (x) ∈ Rn is still an n-dimensional vector and
1768 not a scalar. However, we no longer require n coordinates to represent the
1769 projection, but only a single one if we want to express it with respect to
1770 the basis vector b that spans the subspace U : λ. ♦
Let us now choose a particular x and see whether it lies in the subspace
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
82 Analytic Geometry
Figure 3.10 x
Projection onto a
two-dimensional
subspace U with x − πU (x)
basis b1 , b2 . The
projection πU (x) of U
b2
x ∈ R3 onto U can
be expressed as a πU (x)
linear combination
0 b1
of b1 , b2 and the
displacement vector
x − πU (x) is
orthogonal to both >
b1 and b2 . spanned by b. For x = 1 1 1 , the projection is
1 2 2 1 5 1
1 1
πU (x) = P π x = 2 4 4 1 =
10 ∈ span[ 2] .
(3.49)
9 2 4 4 1 9 10 2
With the results Note that the application of P π to πU (x) does not change anything, i.e.,
from Chapter 4 we
P π πU (x) = πU (x). This is expected because according to Definition 3.10
can show that
πU (x) is also an we know that a projection matrix P π satisfies P 2π x = P π x for all x.
eigenvector of P π ,
and the
corresponding
eigenvalue is 1.
1771 3.7.2 Projection onto General Subspaces
If U is given by a set
of spanning vectors, 1772 In the following, we look at orthogonal projections of vectors x ∈ Rn
which are not a 1773 onto higher-dimensional subspaces U ⊆ Rn with dim(U ) = m > 1. An
basis, make sure
1774 illustration is given in Figure 3.10.
you determine a
basis b1 , . . . , bm 1775 Assume that (b1 , . . . , bm ) is an ordered basis of U . Any projection πU (x)
before proceeding.1776 onto U is necessarily an element of U . Therefore, they can be represented
1777 as linear Pcombinations of the basis vectors b1 , . . . , bm of U , such that
m
The basis vectors 1778 πU (x) = i=1 λi bi .
form the columns 1779 of As in the 1D case, we follow a three-step procedure to find the projec-
B ∈ Rn×m , where
1780 tion πU (x) and the projection matrix P π :
B = [b1 , . . . , bm ].
1. Find the coordinates λ1 , . . . , λm of the projection (with respect to the
basis of U ), such that the linear combination
m
X
πU (x) = λi bi = Bλ , (3.50)
i=1
Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 83
..
. (3.53)
hbm , x − πU (x)i = b>
m (x − πU (x)) = 0 (3.54)
which, with πU (x) = Bλ, can be written as
b>
1 (x − Bλ) = 0 (3.55)
..
. (3.56)
b>
m (x − Bλ) = 0 (3.57)
such that we obtain a homogeneous linear equation system
>
b1
.. >
. x − Bλ = 0 ⇐⇒ B (x − Bλ) = 0 (3.58)
b>
m
1781 The matrix (B > B)−1 B > is also called the pseudo-inverse of B , which pseudo-inverse
1782 can be computed for non-square matrices B . It only requires that B > B
1783 is positive definite, which is the case if B is full rank. In practical
2. Find the projection πU (x) ∈ U . We already established that πU (x) = applications (e.g.,
linear regression),
Bλ. Therefore, with (3.60) we often add a
πU (x) = B(B > B)−1 B > x . (3.61) “jitter term” I to
B > B to guarantee
3. Find the projection matrix P π . From (3.61) we can immediately see increased numerical
stability and positive
that the projection matrix that solves P π x = πU (x) must be
definiteness. This
P π = B(B > B)−1 B > . (3.62) “ridge” can be
rigorously derived
1784 Remark. Comparing the solutions for projecting onto a one-dimensional using Bayesian
inference. See
1785 subspace and the general case, we see that the general case includes the
Chapter 9 for
1786 1D case as a special case: If dim(U ) = 1 then B > B ∈ R is a scalar and details.
1787 we can rewrite the projection matrix in (3.62) P π = B(B > B)−1 B > as
>
1788 P π = BB
B> B
, which is exactly the projection matrix in (3.47). ♦
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
84 Analytic Geometry
To verify the results, we can (a) check whether the displacement vector
πU (x) − x is orthogonal to all basis vectors of U , (b) verify that P π = P 2π
(see Definition 3.10).
1789 Remark. The projections πU (x) are still vectors in Rn although they lie in
1790 an m-dimensional subspace U ⊆ Rn . However, to represent a projected
1791 vector we only need the m coordinates λ1 , . . . , λm with respect to the
1792 basis vectors b1 , . . . , bm of U . ♦
1793 Remark. In vector spaces with general inner products, we have to pay
1794 attention when computing angles and distances, which are defined by
1795 means of the inner product. ♦
Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 85
x x
Figure 3.11
Projection onto an
affine space. (a) The
original setting; (b)
x − x0 The setting is
L L
πL (x) shifted by −x0 , so
x0 x0
that x − x0 can be
projected onto the
b2 b2 U = L − x0 b2
direction space U ;
πU (x − x0 )
(c) The projection is
0 b1 0 b1 0 b1 translated back to
(a) Setting. (b) Reduce problem to pro- (c) Add support point back in x0 + πU (x − x0 ),
jection πU onto vector sub- to get affine projection πL . which gives the final
space. orthogonal
projection πL (x).
We can find
1796 Projections allow us to look at situations where we have a linear system
approximate
1797 Ax = b without a solution. Recall that this means that b does not lie in solutions to
1798 the span of A, i.e., the vector b does not lie in the subspace spanned by unsolvable linear
1799 the columns of A. Given that the linear equation cannot be solved exactly, equation systems
using projections.
1800 we can find an approximate solution. The idea is to find the vector in the
1801 subspace spanned by the columns of A that is closest to b, i.e., we compute
1802 the orthogonal projection of b onto the subspace spanned by the columns
1803 of A. This problem arises often in practice, and the solution is called the
1804 least squares solution (assuming the dot product as the inner product) of least squares
1805 an overdetermined system. This is discussed further in Chapter 9. solution
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
86 Analytic Geometry
Figure 3.12 A
rotation rotates
objects in a plane
about the origin. If
Original
the rotation angle is
Rotated by 112.5◦
positive, we rotate
counterclockwise.
order to get there, we subtract the support point x0 from x and from L,
so that L − x0 = U is exactly the vector subspace U . We can now use the
orthogonal projections onto a subspace we discussed in Section 3.7.2 and
obtain the projection πU (x − x0 ), which is illustrated in Figure 3.11(b).
This projection can now be translated back into L by adding x0 , such that
we obtain the orthogonal projection onto an affine space L as
πL (x) = x0 + πU (x − x0 ) , (3.70)
1812 where πU (·) is the orthogonal projection onto the subspace U , i.e., the
1813 direction space of L, see Figure 3.11(c).
From Figure 3.11 it is also evident that the distance of x from the affine
space L is identical to the distance of x − x0 from U , i.e.,
d(x, L) = kx − πL (x)k = kx − (x0 + πU (x − x0 ))k (3.71)
= d(x − x0 , πU (x − x0 )) . (3.72)
Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.8 Rotations 87
θ
− sin θ e1 cos θ
Therefore, the rotation matrix that performs the basis change into the
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
88 Analytic Geometry
Figure 3.15 e3
Rotation of a vector
(gray) in R3 by an
angle θ about the
e3 -axis. The rotated
vector is shown in
blue.
e2
θ e1
Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.8 Rotations 89
1856 • Rotations preserve distances, i.e., kx−yk = kRθ (x)−Rθ (y)k. In other
1857 words, rotations leave the distance between any two points unchanged
1858 after the transformation.
1859 • Rotations preserve angles, i.e., the angle between Rθ x and Rθ y equals
1860 the angle between x and y .
1861 • Rotations in three (or more) dimensions are generally not commuta-
1862 tive. Therefore, the order in which rotations are applied is important,
1863 even if they rotate about the same point. Only in two dimensions vector
1864 rotations are commutative, such that R(φ)R(θ) = R(θ)R(φ) for all
1865 φ, θ ∈ [0, 2π), and form an Abelian group (with multiplication) only if
1866 they rotate about the same point (e.g., the origin).
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
90 Analytic Geometry
1901 Exercises
3.1 Show that h·, ·i defined for all x = (x1 , x2 ) and y = (y1 , y2 ) in R2 by:
Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
Exercises 91
c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.