0% found this document useful (0 votes)
199 views

03 Analytic Geometry

This document introduces concepts from analytic geometry including norms, inner products, and geometric vectors. It defines a norm as a function that assigns a length to vectors according to properties like absolute homogeneity and the triangle inequality. Examples of norms given are the Manhattan norm and Euclidean norm. Inner products are then introduced as a way to define geometric concepts like vector length and angles. The dot product is presented as a specific type of inner product in Euclidean space.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
199 views

03 Analytic Geometry

This document introduces concepts from analytic geometry including norms, inner products, and geometric vectors. It defines a norm as a function that assigns a length to vectors according to properties like absolute homogeneity and the triangle inequality. Examples of norms given are the Manhattan norm and Euclidean norm. Inner products are then introduced as a way to define geometric concepts like vector length and angles. The dot product is presented as a specific type of inner product in Euclidean space.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

1541

Analytic Geometry

1542 In Chapter 2, we studied vectors, vector spaces and linear mappings at


1543 a general but abstract level. In this chapter, we will add some geometric
1544 interpretation and intuition to all of these concepts. In particular, we will
1545 look at geometric vectors, compute their lengths and distances or angles
1546 between two vectors. To be able to do this, we equip the vector space with
1547 an inner product that induces the geometry of the vector space. Inner
1548 products and their corresponding norms and metrics capture the intuitive
1549 notions of similarity and distances, which we use to develop the Support
1550 Vector Machine in Chapter 12. We will then use the concepts of lengths
1551 and angles between vectors to discuss orthogonal projections, which will
1552 play a central role when we discuss principal component analysis in Chap-
1553 ter 10 and regression via maximum likelihood estimation in Chapter 9.
1554 Figure 3.1 gives an overview of how concepts in this chapter are related
1555 and how they are connected to other chapters of the book.

Figure 3.1 A mind


Inner product
map of the concepts
introduced in this
es
chapter, along with uc
when they are used ind
in other parts of the
Chapter 12
book. Norm Classification

Orthogonal
Lengths Angles Rotations
projection

Chapter 9 Chapter 4 Chapter 10


Regression Matrix Dimensionality
decomposition reduction

68
c
Draft chapter (September 21, 2018) from “Mathematics for Machine Learning” 2018 by Marc
Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University
Press. Report errata and feedback to http://mml-book.com. Please do not post or distribute this
file, please link to https://mml-book.com.
3.1 Norms 69

kx k1 = 1 kx k2 = 1 Figure 3.3 For


1 1
different norms, the
red lines indicate
the set of vectors
with norm 1. Left:
1 1 Manhattan norm;
Right: Euclidean
distance.

1556 3.1 Norms


1557 When we think of geometric vectors, i.e., directed line segments that start
1558 at the origin, then intuitively the length of a vector is the distance of the
1559 “end” of this directed line segment from the origin. In the following, we
1560 will discuss the notion of the length of vectors using the concept of a norm.

Definition 3.1 (Norm). A norm on a vector space V is a function norm

k · k :V → R , (3.1)
x 7→ kxk , (3.2)

1561 which assigns each vector x its length kxk ∈ R, such that for all λ ∈ R length
1562 and x, y ∈ V the following hold:

1563 • Absolutely homogeneous: kλxk = |λ|kxk Triangle inequality


1564 • Triangle inequality: kx + yk 6 kxk + kyk Positive definite
1565 • Positive definite: kxk > 0 and kxk = 0 ⇐⇒ x = 0.
Figure 3.2 Triangle
1566 In geometric terms, the triangle inequality states that for any triangle, inequality.
1567 the sum of the lengths of any two sides must be greater than or equal to a b
1568 the length of the remaining side; see Figure 3.2 for an illustration.
c≤a+b
1569 Recall that for a vector x ∈ Rn we denote the elements of the vector
1570 using a subscript, that is xi is the ith element of the vector x.

Example 3.1 (Manhattan Norm)


The Manhattan norm on Rn is defined for x ∈ Rn as Manhattan norm

Xn
kxk1 := |xi | , (3.3)
i=1

where | · | is the absolute value. The left panel of Figure 3.3 indicates all
vectors x ∈ R2 with kxk1 = 1. The Manhattan norm is also called `1 `1 norm
norm.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
70 Analytic Geometry

Example 3.2 (Euclidean Norm)


The length of a vector x ∈ Rn is given by
v

u n
uX
kxk2 := t x2i = x> x , (3.4)
i=1

Euclidean distance which computes the Euclidean distance of x from the origin. This norm is
Euclidean norm called the Euclidean norm. The right panel of Figure 3.3 shows all vectors
`2 norm x ∈ R2 with kxk2 = 1. The Euclidean norm is also called `2 norm.

1571 Remark. Throughout this book, we will use the Euclidean norm (3.4) by
1572 default if not stated otherwise. ♦
Remark (Inner Products and Norms). Every inner product induces a norm,
but there are norms (like the `1 norm) without a corresponding inner
product. For an inner product vector space (V, h·, ·i) the induced norm
Cauchy-Schwarz k · k satisfies the Cauchy-Schwarz inequality
inequality
| hx, yi | 6 kxkkyk . (3.5)
1573 ♦

1574 3.2 Inner Products


1575 Inner products allow for the introduction of intuitive geometrical con-
1576 cepts, such as the length of a vector and the angle or distance between
1577 two vectors. A major purpose of inner products is to determine whether
1578 vectors are orthogonal to each other.

1579 3.2.1 Dot Product


We may already be familiar with a particular type of inner product, the
scalar product scalar product/dot product in Rn , which is given by
dot product n
X
x> y = xi yi . (3.6)
i=1

1580 We will refer to the particular inner product above as the dot product
1581 in this book. However, inner products are more general concepts with
1582 specific properties, which we will now introduce.

1583 3.2.2 General Inner Products


Recall the linear mapping from Section 2.7, where we can rearrange the
bilinear mapping mapping with respect to addition and multiplication with a scalar. A bilinear

Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.2 Inner Products 71

mapping Ω is a mapping with two arguments, and it is linear in each ar-


gument, i.e., when we look at a vector space V then it holds that for all
x, y, z ∈ V, λ ∈ R
Ω(λx + y, z) = λΩ(x, z) + Ω(y, z) (3.7)
Ω(x, λy + z) = λΩ(x, y) + Ω(x, z) . (3.8)

1584 Here, (3.7) asserts that Ω is linear in the first argument, and (3.8) asserts
1585 that Ω is linear in the second argument.

1586 Definition 3.2. Let V be a vector space and Ω : V × V → R be a bilinear


1587 mapping that takes two vectors and maps them onto a real number. Then

1588 • Ω is called symmetric if Ω(x, y) = Ω(y, x) for all x, y ∈ V , i.e., the symmetric
1589 order of the arguments does not matter.
• Ω is called positive definite if positive definite

∀x ∈ V \ {0} : Ω(x, x) > 0 , Ω(0, 0) = 0 (3.9)

1590 Definition 3.3. Let V be a vector space and Ω : V × V → R be a bilinear


1591 mapping that takes two vectors and maps them onto a real number. Then

1592 • A positive definite, symmetric bilinear mapping Ω : V ×V → R is called


1593 an inner product on V . We typically write hx, yi instead of Ω(x, y). inner product
1594 • The pair (V, h·, ·i) is called an inner product space or (real) vector space inner product space
1595 with inner product. If we use the dot product defined in (3.6), we call vector space with
1596 (V, h·, ·i) a Euclidean vector space. inner product
Euclidean vector
1597 We will refer to the spaces above as inner product spaces in this book. space

Example 3.3 (Inner Product that is not the Dot Product)


Consider V = R2 . If we define
hx, yi := x1 y1 − (x1 y2 + x2 y1 ) + 2x2 y2 (3.10)
then h·, ·i is an inner product but different from the dot product. The proof
will be an exercise.

1598 3.2.3 Symmetric, Positive Definite Matrices


1599 Symmetric, positive definite matrices play an important role in machine
1600 learning, and they are defined via the inner product.
Consider an n-dimensional vector space V with an inner product h·, ·i :
V × V → R (see Definition 3.3) and an ordered basis B = (b1 , . . . , bn ) of
V . Recall from Section 2.6.1 that any vectors x, y ∈ V
Pncan be written as
linear combinations of the basis vectors so that x = i=1 ψi bi ∈ V and

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
72 Analytic Geometry
Pn
y = j=1 λj bj ∈ V for suitable ψi , λj ∈ R. Due to the bilinearity of the
inner product it holds that for all x, y ∈ V that
* n n
+ n X n
ψi hbi , bj i λj = x̂> Aŷ , (3.11)
X X X
hx, yi = ψi bi , λj bj =
i=1 j=1 i=1 j=1

where Aij := hbi , bj i and x̂, ŷ are the coordinates of x and y with respect
to the basis B . This implies that the inner product h·, ·i is uniquely deter-
mined through A. The symmetry of the inner product also means that A
is symmetric. Furthermore, the positive definiteness of the inner product
implies that
∀x ∈ V \{0} : x> Ax > 0 . (3.12)
1601 Definition 3.4 (Symmetric, positive definite matrix). A symmetric matrix
symmetric, positive
1602 A ∈ Rn×n that satisfies (3.12) is called symmetric, positive definite or
definite 1603 just positive definite. If only > holds in (3.12) then A is called symmetric,
positive definite
1604 positive semi-definite.
symmetric, positive
semi-definite
Example 3.4 (Symmetric, Positive Definite Matrices)
Consider the following matrices:
   
9 6 9 6
A1 = , A2 = (3.13)
6 5 6 3
Then, A1 is positive definite because it is symmetric and
  
>
  9 6 x1
x A1 x = x1 x2 (3.14a)
6 5 x2
= 9x21 + 12x1 x2 + 5x22 = (3x1 + 2x2 )2 + x22 > 0 (3.14b)
for all x ∈ V \ {0}. However, A2 is symmetric but not positive definite
because x> A2 x = 9x21 + 12x1 x2 + 3x22 = (3x1 + 2x2 )2 − x22 can be smaller
than 0, e.g., for x = [2, −3]> .

If A ∈ Rn×n is symmetric, positive definite then


hx, yi = x̂> Aŷ (3.15)
1605 defines an inner product with respect to an ordered basis B where x̂ and
1606 ŷ are the coordinate representations of x, y ∈ V with respect to B .
Theorem 3.5. For a real-valued, finite-dimensional vector space V and an
ordered basis B of V it holds that h·, ·i : V × V → R is an inner product if
and only if there exists a symmetric, positive definite matrix A ∈ Rn×n with
hx, yi = x̂> Aŷ . (3.16)
1607 The following properties hold if A ∈ Rn×n is symmetric and positive
1608 definite:

Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.3 Lengths and Distances 73

1609 • The null space (kernel) of A consists only of 0 because x> Ax > 0 for
1610 all x 6= 0. This implies that Ax 6= 0 if x 6= 0.
1611 • The diagonal elements aii of A are positive because aii = e>
i Aei > 0,
1612 where ei is the ith vector of the standard basis in Rn .
1613 In Section 4.3, we will return to symmetric, positive definite matrices in
1614 the context of matrix decompositions.

1615 3.3 Lengths and Distances


In Section 3.1, we already discussed norms that we can use to compute
the length of a vector. Inner products and norms are closely related in the
sense that any inner product induces a norm Inner products
q induce norms.
kxk := hx, xi (3.17)
1616 in a natural way, such that we can compute lengths of vectors using the in-
1617 ner product. However, not every norm is induced by an inner product. The
1618 Manhattan norm (3.3) is an example of a norm that is not induced by an
1619 inner product. In the following, we will focus on norms that are induced
1620 by inner products and introduce geometric concepts, such as lengths, dis-
1621 tances and angles.

Example 3.5 (Lengths of Vectors using Inner Products)


In geometry, we are often interested in lengths of vectors. We can now use
an inner product to compute them using (3.17). Let us take x = [1, 1]> ∈
R2 . If we use the dot product as the inner product, with (3.17) we obtain
√ √ √
kxk = x> x = 12 + 12 = 2 (3.18)
as the length of x. Let us now choose a different inner product:
1 − 12
 
> 1
hx, yi := x y = x1 y1 − (x1 y2 + x2 y1 ) + x2 y2 . (3.19)
− 21 1 2
If we compute the norm of a vector, then this inner product returns smaller
values than the dot product if x1 and x2 have the same sign (and x1 x2 >
0), otherwise it returns greater values than the dot product. With this
inner product we obtain

hx, xi = x21 − x1 x2 + x22 = 1 − 1 + 1 = 1 =⇒ kxk = 1 = 1 , (3.20)
such that x is “shorter” with this inner product than with the dot product.

Definition 3.6 (Distance and Metric). Consider an inner product space


(V, h·, ·i). Then
q
d(x, y) := kx − yk = hx − y, x − yi (3.21)

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
74 Analytic Geometry

distance is called distance of x, y ∈ V . If we use the dot product as the inner


Euclidean distance product, then the distance is called Euclidean distance. The mapping
d:V ×V →R (3.22)
(x, y) 7→ d(x, y) (3.23)
metric 1622 is called metric.
1623 Remark. Similar to the length of a vector, the distance between vectors
1624 does not require an inner product: a norm is sufficient. If we have a norm
1625 induced by an inner product, the distance may vary depending on the
1626 choice of the inner product. ♦
1627 A metric d satisfies:
positive definite 1628 1. d is positive definite, i.e., d(x, y) > 0 for all x, y ∈ V and d(x, y) =
1629 0 ⇐⇒ x = y
symmetric 1630 2. d is symmetric, i.e., d(x, y) = d(y, x) for all x, y ∈ V .
Triangular 1631 3. Triangular inequality: d(x, z) 6 d(x, y) + d(y, z).
inequality

1632 3.4 Angles and Orthogonality


Figure 3.4 When
restricted to [0, π] The Cauchy-Schwarz inequality (3.5) allows us to define angles ω in inner
then f (ω) = cos(ω) product spaces between two vectors x, y . Assume that x 6= 0, y 6= 0. Then
returns a unique
number in the hx, yi
interval [−1, 1]. −1 6 6 1. (3.24)
kxk kyk
1.00

0.75
f (ω) = cos(ω)
Therefore, there exists a unique ω ∈ [0, π] with
0.50

0.25

hx, yi
cos(ω)

0.00

−0.25 cos ω = , (3.25)


−0.50

−0.75
kxk kyk
−1.00
0 1
ω
2 3
1633 see Figure 3.4 for an illustration. The number ω is the angle between
angle 1634 the vectors x and y . Intuitively, the angle between two vectors tells us
1635 how similar their orientations are. For example, using the dot product,
1636 the angle between x and y = 4x, i.e., y is a scaled version of x, is 0:
Figure 3.5 The 1637 Their orientation is the same.
angle ω between
two vectors x, y is
computed using the Example 3.6 (Angle between Vectors)
inner product. Let us compute the angle between x = [1, 1]> ∈ R2 and y = [1, 2]> ∈ R2 ,
see Figure 3.5, where we use the dot product as the inner product. Then
y we get
hx, yi x> y 3
cos ω = p =p =√ , (3.26)
1 hx, xi hy, yi >
x xy y> 10
ω x
and the angle between the two vectors is arccos( √310 ) ≈ 0.32 rad, which
corresponds to about 18◦ .
0 1

Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.4 Angles and Orthogonality 75

1638 The inner product also allows us to characterize vectors that are orthog-
1639 onal.

1640 Definition 3.7 (Orthogonality). Two vectors x and y are orthogonal if and orthogonal
1641 only if hx, yi = 0, and we write x ⊥ y . If additionally kxk = 1 = kyk,
1642 i.e., the vectors are unit vectors, then x and y are orthonormal. orthonormal

1643 An implication of this definition is that the 0-vector is orthogonal to


1644 every vector in the vector space.
1645 Remark. Orthogonality is the generalization of the concept of perpendic-
1646 ularity to bilinear forms that do not have to be the dot product. In our
1647 context, geometrically, we can think of orthogonal vectors as having a
1648 right angle with respect to a specific inner product. ♦

Example 3.7 (Orthogonal Vectors)

Figure 3.6 The


1 angle ω between
y x two vectors x, y can
change depending
ω on the inner
product.
−1 0 1

Consider two vectors x = [1, 1]> , y = [−1, 1]> ∈ R2 , see Figure 3.6.
We are interested in determining the angle ω between them using two
different inner products. Using the dot product as inner product yields an
angle ω between x and y of 90◦ , such that x ⊥ y . However, if we choose
the inner product
 
> 2 0
hx, yi = x y, (3.27)
0 1
we get that the angle ω between x and y is given by
hx, yi 1
cos ω = = − =⇒ ω ≈ 1.91 rad ≈ 109.5◦ , (3.28)
kxkkyk 3
and x and y are not orthogonal. Therefore, vectors that are orthogonal
with respect to one inner product do not have to be orthogonal with re-
spect to a different inner product.

Definition 3.8 (Orthogonal Matrix). A square matrix A ∈ Rn×n is an


orthogonal matrix if and only if its columns are orthonormal so that orthogonal matrix

AA> = I = A> A , (3.29)

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
76 Analytic Geometry

which implies that


A−1 = A> , (3.30)

It is convention to 1649 i.e., the inverse is obtained by simply transposing the matrix.
call these matrices
“orthogonal” but a Remark. Transformations by orthogonal matrices are special because the
more precise length of a vector x is not changed when transforming it using an orthog-
description would
be “orthonormal”.
onal matrix A. For the dot product we obtain
2 2
Transformations
kAxk = (Ax)> (Ax) = x> A> Ax = x> Ix = x> x = kxk . (3.31)
with orthogonal
matrices preserve
distances and
Moreover, the angle between any two vectors x, y , as measured by their
angles. inner product, is also unchanged when transforming both of them using
an orthogonal matrix A. Assuming the dot product as the inner product,
the angle of the images Ax and Ay is given as

(Ax)> (Ay) x> A> Ay x> y


cos ω = =q = , (3.32)
kAxk kAyk x> A> Axy > A> Ay kxk kyk

1650 which gives exactly the angle between x and y . This means that orthogo-
1651 nal matrices A with A> = A−1 preserve both angles and distances. ♦

1652 3.5 Orthonormal Basis


1653 In Section 2.6.1, we characterized properties of basis vectors and found
1654 that in an n-dimensional vector space, we need n basis vectors, i.e., n
1655 vectors that are linearly independent. In Sections 3.3 and 3.4, we used
1656 inner products to compute the length of vectors and the angle between
1657 vectors. In the following, we will discuss the special case where the basis
1658 vectors are orthogonal to each other and where the length of each basis
1659 vector is 1. We will call this basis then an orthonormal basis.
1660 Let us introduce this more formally.

Definition 3.9 (Orthonormal basis). Consider an n-dimensional vector


space V and a basis {b1 , . . . , bn } of V . If

hbi , bj i = 0 for i 6= j (3.33)


hbi , bi i = 1 (3.34)

orthonormal basis 1661 for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB).
ONB 1662 If only (3.33) is satisfied then the basis is called an orthogonal basis.
orthogonal basis
1663 Note that (3.34) implies that every basis vector has length/norm 1. The
1664 Gram-Schmidt process (Strang, 2003) is a constructive way to iteratively
1665 build an orthonormal basis {b1 , . . . , bn } given a set {b̃1 , . . . , b̃n } of non-
1666 orthogonal and unnormalized basis vectors.

Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.6 Inner Product of Functions 77

Example 3.8 (Orthonormal basis)


The canonical/standard basis for a Euclidean vector space Rn is an or-
thonormal basis, where the inner product is the dot product of vectors.
In R2 , the vectors
   
1 1 1 1
b1 = √ , b2 = √ (3.35)
2 1 2 −1
form an orthonormal basis since b>
1 b2 = 0 and kb1 k = 1 = kb2 k.

1667 We will exploit the concept of an orthonormal basis in Chapter 12 and


1668 Chapter 10 when we discuss Support Vector Machines and Principal Com-
1669 ponent Analysis.

1670 3.6 Inner Product of Functions


1671 Thus far, we looked at properties of inner products to compute lengths,
1672 angles and distances. We focused on inner products of finite-dimensional
1673 vectors.
1674 In the following, we will look at an example of inner products of a
1675 different type of vectors: inner products of functions.
1676 The inner products we discussed so far were defined for vectors with a
1677 finite number of entries. We can think of a vector x ∈ Rn as function with
1678 n function values. The concept of an inner product can be generalized to
1679 vectors with an infinite number of entries (countably infinite) and also
1680 continuous-valued functions (uncountably infinite). Then, the sum over
1681 individual components of vectors, see (3.6) for example, turns into an
1682 integral.
An inner product of two functions u : R → R and v : R → R can be
defined as the definite integral
Z b
hu, vi := u(x)v(x)dx (3.36)
a

1683 for lower and upper limits a, b < ∞, respectively. As with our usual in-
1684 ner product, we can define norms and orthogonality by looking at the
1685 inner product. If (3.36) evaluates to 0, the functions u and v are orthogo-
1686 nal. To make the above inner product mathematically precise, we need to
1687 take care of measures, and the definition of integrals. Furthermore, unlike
1688 inner product on finite-dimensional vectors, inner products on functions
1689 may diverge (have infinite value). Some careful definitions need to be ob-
1690 served, which requires a foray into real and functional analysis which we
1691 do not cover in this book.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
78 Analytic Geometry
Figure 3.8 4
Orthogonal
2
projection of a 2
two-dimensional 1
data set onto a

x2
x2

0 0
one-dimensional
subspace. −1
−2
−2

−4
−4 −2 0 2 4 −4 −2 0 2 4
x1 x1

(a) Original dataset. (b) Original data (blue) and their corresponding
orthogonal projections (orange) onto a lower-
dimensional subspace (straight line).

Example 3.9 (Inner Product of Functions)


If we choose u = sin(x) and v = cos(x), the integrand f (x) = u(x)v(x)
of (3.36), is shown in Figure 3.7. We see that this function is odd, i.e.,
Figure 3.7 f (x) = f (−x) = −f (x). Therefore, the integral with limits a = −π, b = π of this
sin(x) cos(x). product evaluates to 0. Therefore, sin and cos are orthogonal functions.
0.4
f (x) = sin(x) cos(x)

0.2
Remark. It also holds that the collection of functions
0.0

−0.2
{1, cos(x), cos(2x), cos(3x), . . . } (3.37)
−0.4

−2 0
x
2
1692 is orthogonal if we integrate from −π to π , i.e., any pair of functions are
1693 orthogonal to each other. ♦
1694 In Chapter 6, we will have a look at a second type of unconventional
1695 inner products: the inner product of random variables.

1696 3.7 Orthogonal Projections


1697 Projections are an important class of linear transformations (besides ro-
1698 tations and reflections). Projections play an important role in graphics,
1699 coding theory, statistics and machine learning. In machine learning, we
1700 often deal with data that is high-dimensional. High-dimensional data is
1701 often hard to analyze or visualize. However, high-dimensional data quite
1702 often possesses the property that only a few dimensions contain most in-
1703 formation, and most other dimensions are not essential to describe key
“Feature” is a 1704 properties of the data. When we compress or visualize high-dimensional
commonly used 1705 data we will lose information. To minimize this compression loss, we
word for “data
1706 ideally find the most informative dimensions in the data. Then, we can
representation”.
1707 project the original high-dimensional data onto a lower-dimensional fea-
1708 ture space and work in this lower-dimensional space to learn more about
1709 the dataset and extract patterns. For example, machine learning algo-
1710 rithms, such as Principal Component Analysis (PCA) by Pearson (1901b);

Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 79
Figure 3.9
x Examples of
projections onto
one-dimensional
subspaces.

b x

πU (x)

ω sin ω
ω cos ω b
(a) Projection of x ∈ R2 onto a subspace U (b) Projection of a two-dimensional vector
with basis vector b. x with kxk = 1 onto a one-dimensional
subspace spanned by b.

1711 Hotelling (1933) and Deep Neural Networks (e.g., deep auto-encoders Deng
1712 et al. (2010)), heavily exploit the idea of dimensionality reduction. In the
1713 following, we will focus on orthogonal projections, which we will use in
1714 Chapter 10 for linear dimensionality reduction and in Chapter 12 for clas-
1715 sification. Even linear regression, which we discuss in Chapter 9, can be
1716 interpreted using orthogonal projections. For a given lower-dimensional
1717 subspace, orthogonal projections of high-dimensional data retain as much
1718 information as possible and minimize the difference/error between the
1719 original data and the corresponding projection. An illustration of such an
1720 orthogonal projection is given in Figure 3.8.
1721 Before we detail how to obtain these projections, let us define what a
1722 projection actually is.
1723 Definition 3.10 (Projection). Let V be a vector space and W ⊆ V a
1724 subspace of V . A linear mapping π : V → W is called a projection if projection
1725 π2 = π ◦ π = π.
1726 Remark (Projection matrix). Since linear mappings can be expressed by
1727 transformation matrices (see Section 2.7), the definition above applies
1728 equally to a special kind of transformation matrices, the projection matrices projection matrices
1729 P π , which exhibit the property that P 2π = P π .
1730 ♦
1731 In the following, we will derive orthogonal projections of vectors in the
1732 inner product space (Rn , h·, ·i) onto subspaces. We will start with one-
1733 dimensional subspaces, which are also called lines. If not mentioned oth- lines
1734 erwise, we assume the dot product hx, yi = x> y as the inner product.

1735 3.7.1 Projection onto 1-Dimensional Subspaces (Lines)


1736 Assume we are given a line (1-dimensional subspace) through the origin
1737 with basis vector b ∈ Rn . The line is a one-dimensional subspace U ⊆ Rn

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
80 Analytic Geometry

1738 spanned by b. When we project x ∈ Rn onto U , we want to find the


1739 point πU (x) ∈ U that is closest to x. Using geometric arguments, let us
1740 characterize some properties of the projection πU (x) (Fig. 3.9 serves as
1741 an illustration):
1742 • The projection πU (x) is closest to x, where “closest” implies that the
1743 distance kx − πU (x)k is minimal. It follows that the segment πU (x) − x
1744 from πU (x) to x is orthogonal to U and, therefore, the basis b of U . The
1745 orthogonality condition yields hπU (x) − x, bi = 0 since angles between
1746 vectors are defined by means of the inner product.
1747 • The projection πU (x) of x onto U must be an element of U and, there-
1748 fore, a multiple of the basis vector b that spans U . Hence, πU (x) = λb,
λ is then the 1749 for some λ ∈ R.
coordinate of πU (x)
with respect to b. 1750 In the following three steps, we determine the coordinate λ, the projection
1751 πU (x) ∈ U and the projection matrix P π that maps arbitrary x ∈ Rn onto
1752 U.
1. Finding the coordinate λ. The orthogonality condition yields
hx − πU (x), bi = 0 (3.38)
πU (x)=λb
⇐⇒ hx − λb, bi = 0 . (3.39)
With a general inner We can now exploit the bilinearity of the inner product and arrive at
product, we get
λ = hx, bi if hx, bi − λ hb, bi = 0 (3.40)
kbk = 1.
hx, bi hx, bi
⇐⇒ λ = = (3.41)
hb, bi kbk2
If we choose h·, ·i to be the dot product, we obtain
b> x b> x
λ= = (3.42)
b> b kbk2
1753 If kbk = 1, then the coordinate λ of the projection is given by b> x.
2. Finding the projection point πU (x) ∈ U . Since πU (x) = λb we imme-
diately obtain with (3.42) that
hx, bi b> x
πU (x) = λb = b = b, (3.43)
kbk2 kbk2
where the last equality holds for the dot product only. We can also
compute the length of πU (x) by means of Definition 3.1 as
kπU (x)k = kλbk = |λ| kbk . (3.44)
1754 This means that our projection is of length |λ| times the length of b.
1755 This also adds the intuition that λ is the coordinate of πU (x) with
1756 respect to the basis vector b that spans our one-dimensional subspace
1757 U.

Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 81

If we use the dot product as an inner product we get

(3.43) |b> x| (3.25) kbk


kπU (x)k = kbk = | cos ω| kxk kbk = | cos ω| kxk .
kbk2 kbk2
(3.45)

1758 Here, ω is the angle between x and b. This equation should be familiar
1759 from trigonometry: If kxk = 1 then x lies on the unit circle. It follows
1760 that the projection onto the horizontal axis spanned by b is exactly The horizontal axis
1761 cos ω , and the length of the corresponding vector πU (x) = | cos ω|. An is a one-dimensional
subspace.
1762 illustration is given in Figure 3.9.
3. Finding the projection matrix P π . We know that a projection is a lin-
ear mapping (see Definition 3.10). Therefore, there exists a projection
matrix P π , such that πU (x) = P π x. With the dot product as inner
product and

b> x bb>
πU (x) = λb = bλ = b = x (3.46)
kbk2 kbk2
we immediately see that

bb>
Pπ = . (3.47)
kbk2

1763 Note that bb> is a symmetric matrix (with rank 1) and kbk2 = hb, bi Projection matrices
1764 is a scalar. are always
symmetric.
1765 The projection matrix P π projects any vector x ∈ Rn onto the line through
1766 the origin with direction b (equivalently, the subspace U spanned by b).
1767 Remark. The projection πU (x) ∈ Rn is still an n-dimensional vector and
1768 not a scalar. However, we no longer require n coordinates to represent the
1769 projection, but only a single one if we want to express it with respect to
1770 the basis vector b that spans the subspace U : λ. ♦

Example 3.10 (Projection onto a Line)


Find the projection matrix P π onto the line through the origin spanned
 >
by b = 1 2 2 . b is a direction and a basis of the one-dimensional
subspace (line through origin).
With (3.47), we obtain
   
1 1 2 2
bb> 1    1
Pπ = > = 2 1 2 2 = 2 4 4 . (3.48)
b b 9 2 9 2 4 4

Let us now choose a particular x and see whether it lies in the subspace

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
82 Analytic Geometry

Figure 3.10 x

Projection onto a
two-dimensional
subspace U with x − πU (x)
basis b1 , b2 . The
projection πU (x) of U
b2
x ∈ R3 onto U can
be expressed as a πU (x)
linear combination
0 b1
of b1 , b2 and the
displacement vector
x − πU (x) is
orthogonal to both  >
b1 and b2 . spanned by b. For x = 1 1 1 , the projection is
      
1 2 2 1 5 1
1 1 
πU (x) = P π x = 2 4 4 1 =
   10 ∈ span[ 2] .
 (3.49)
9 2 4 4 1 9 10 2
With the results Note that the application of P π to πU (x) does not change anything, i.e.,
from Chapter 4 we
P π πU (x) = πU (x). This is expected because according to Definition 3.10
can show that
πU (x) is also an we know that a projection matrix P π satisfies P 2π x = P π x for all x.
eigenvector of P π ,
and the
corresponding
eigenvalue is 1.
1771 3.7.2 Projection onto General Subspaces
If U is given by a set
of spanning vectors, 1772 In the following, we look at orthogonal projections of vectors x ∈ Rn
which are not a 1773 onto higher-dimensional subspaces U ⊆ Rn with dim(U ) = m > 1. An
basis, make sure
1774 illustration is given in Figure 3.10.
you determine a
basis b1 , . . . , bm 1775 Assume that (b1 , . . . , bm ) is an ordered basis of U . Any projection πU (x)
before proceeding.1776 onto U is necessarily an element of U . Therefore, they can be represented
1777 as linear Pcombinations of the basis vectors b1 , . . . , bm of U , such that
m
The basis vectors 1778 πU (x) = i=1 λi bi .
form the columns 1779 of As in the 1D case, we follow a three-step procedure to find the projec-
B ∈ Rn×m , where
1780 tion πU (x) and the projection matrix P π :
B = [b1 , . . . , bm ].
1. Find the coordinates λ1 , . . . , λm of the projection (with respect to the
basis of U ), such that the linear combination
m
X
πU (x) = λi bi = Bλ , (3.50)
i=1

B = [b1 , . . . , bm ] ∈ Rn×m , λ = [λ1 , . . . , λm ]> ∈ Rm , (3.51)


is closest to x ∈ Rn . As in the 1D case, “closest” means “minimum
distance”, which implies that the vector connecting πU (x) ∈ U and
x ∈ Rn must be orthogonal to all basis vectors of U . Therefore, we
obtain m simultaneous conditions (assuming the dot product as the
inner product)
hb1 , x − πU (x)i = b>
1 (x − πU (x)) = 0 (3.52)

Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 83
..
. (3.53)
hbm , x − πU (x)i = b>
m (x − πU (x)) = 0 (3.54)
which, with πU (x) = Bλ, can be written as
b>
1 (x − Bλ) = 0 (3.55)
..
. (3.56)
b>
m (x − Bλ) = 0 (3.57)
such that we obtain a homogeneous linear equation system
 > 
b1 
 ..   >
 .  x − Bλ = 0 ⇐⇒ B (x − Bλ) = 0 (3.58)
b>
m

⇐⇒ B > Bλ = B > x . (3.59)


The last expression is called normal equation. Since b1 , . . . , bm are a normal equation
basis of U and, therefore, linearly independent, B > B ∈ Rm×m is reg-
ular and can be inverted. This allows us to solve for the coefficients/
coordinates
λ = (B > B)−1 B > x . (3.60)

1781 The matrix (B > B)−1 B > is also called the pseudo-inverse of B , which pseudo-inverse
1782 can be computed for non-square matrices B . It only requires that B > B
1783 is positive definite, which is the case if B is full rank. In practical
2. Find the projection πU (x) ∈ U . We already established that πU (x) = applications (e.g.,
linear regression),
Bλ. Therefore, with (3.60) we often add a
πU (x) = B(B > B)−1 B > x . (3.61) “jitter term” I to
B > B to guarantee
3. Find the projection matrix P π . From (3.61) we can immediately see increased numerical
stability and positive
that the projection matrix that solves P π x = πU (x) must be
definiteness. This
P π = B(B > B)−1 B > . (3.62) “ridge” can be
rigorously derived
1784 Remark. Comparing the solutions for projecting onto a one-dimensional using Bayesian
inference. See
1785 subspace and the general case, we see that the general case includes the
Chapter 9 for
1786 1D case as a special case: If dim(U ) = 1 then B > B ∈ R is a scalar and details.
1787 we can rewrite the projection matrix in (3.62) P π = B(B > B)−1 B > as
>
1788 P π = BB
B> B
, which is exactly the projection matrix in (3.47). ♦

Example 3.11 (Projectiononto


 a Two-dimensional
 Subspace)
 
1 0 6
For a subspace U = span[1 , 1] ⊆ R3 and x = 0 ∈ R3 find the
1 2 0

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
84 Analytic Geometry

coordinates λ of x in terms of the subspace U , the projection point πU (x)


and the projection matrix P π .
First, we see that the generating set of U is a basis (linear
 indepen-

1 0
dence) and write the basis vectors of U into a matrix B = 1 1.
1 2
> >
Second, we compute the matrix B B and the vector B x as
   
  1 0     6  
> 1 1 1 3 3 > 1 1 1 6
B B= 1 1 = , B x= 0 =
 .
0 1 2 3 5 0 1 2 0
1 2 0
(3.63)

Third, we solve the normal equation B > Bλ = B > x to find λ:


      
3 3 λ1 6 5
= ⇐⇒ λ = . (3.64)
3 5 λ2 0 −3
Fourth, the projection πU (x) of x onto U , i.e., into the column space of
B , can be directly computed via
 
5
πU (x) = Bλ =  2  . (3.65)
−1
projection error The corresponding projection error is the norm of the difference vector
between the original vector and its projection onto U , i.e.,

 >
kx − πU (x)k = 1 −2 1 = 6 . (3.66)

The projection error


is also called the
reconstruction error.
Fifth, the projection matrix (for any x ∈ R3 ) is given by
 
5 2 −1
1
P π = B(B > B)−1 B > =  2 2 2  . (3.67)
6 −1 2 5

To verify the results, we can (a) check whether the displacement vector
πU (x) − x is orthogonal to all basis vectors of U , (b) verify that P π = P 2π
(see Definition 3.10).

1789 Remark. The projections πU (x) are still vectors in Rn although they lie in
1790 an m-dimensional subspace U ⊆ Rn . However, to represent a projected
1791 vector we only need the m coordinates λ1 , . . . , λm with respect to the
1792 basis vectors b1 , . . . , bm of U . ♦
1793 Remark. In vector spaces with general inner products, we have to pay
1794 attention when computing angles and distances, which are defined by
1795 means of the inner product. ♦

Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.7 Orthogonal Projections 85
x x
Figure 3.11
Projection onto an
affine space. (a) The
original setting; (b)
x − x0 The setting is
L L
πL (x) shifted by −x0 , so
x0 x0
that x − x0 can be
projected onto the
b2 b2 U = L − x0 b2
direction space U ;
πU (x − x0 )
(c) The projection is
0 b1 0 b1 0 b1 translated back to
(a) Setting. (b) Reduce problem to pro- (c) Add support point back in x0 + πU (x − x0 ),
jection πU onto vector sub- to get affine projection πL . which gives the final
space. orthogonal
projection πL (x).

We can find
1796 Projections allow us to look at situations where we have a linear system
approximate
1797 Ax = b without a solution. Recall that this means that b does not lie in solutions to
1798 the span of A, i.e., the vector b does not lie in the subspace spanned by unsolvable linear
1799 the columns of A. Given that the linear equation cannot be solved exactly, equation systems
using projections.
1800 we can find an approximate solution. The idea is to find the vector in the
1801 subspace spanned by the columns of A that is closest to b, i.e., we compute
1802 the orthogonal projection of b onto the subspace spanned by the columns
1803 of A. This problem arises often in practice, and the solution is called the
1804 least squares solution (assuming the dot product as the inner product) of least squares
1805 an overdetermined system. This is discussed further in Chapter 9. solution

Remark. We just looked at projections of vectors x onto a subspace U with


basis vectors {b1 , . . . , bk }. If this basis is an ONB, i.e., (3.33)–(3.34) are
satisfied, the projection equation (3.61) simplifies greatly to
πU (x) = BB > x (3.68)

since B > B = I with coordinates


λ = B>x . (3.69)
1806 This means that we no longer have to compute the tedious inverse from (3.61),
1807 which saves us much computation time. ♦

1808 3.7.3 Projection onto Affine Subspaces


1809 Thus far, we discussed how to project a vector onto a lower-dimensional
1810 subspace U . In the following, we provide a solution to projecting a vector
1811 onto an affine subspace.
Consider the setting in Figure 3.11(a). We are given an affine space L =
x0 + U where b1 , b2 are basis vectors of U . To determine the orthogonal
projection πL (x) of x onto L, we transform the problem into a problem
that we know how to solve: the projection onto a vector subspace. In

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
86 Analytic Geometry

Figure 3.12 A
rotation rotates
objects in a plane
about the origin. If
Original
the rotation angle is
Rotated by 112.5◦
positive, we rotate
counterclockwise.

order to get there, we subtract the support point x0 from x and from L,
so that L − x0 = U is exactly the vector subspace U . We can now use the
orthogonal projections onto a subspace we discussed in Section 3.7.2 and
obtain the projection πU (x − x0 ), which is illustrated in Figure 3.11(b).
This projection can now be translated back into L by adding x0 , such that
we obtain the orthogonal projection onto an affine space L as
πL (x) = x0 + πU (x − x0 ) , (3.70)
1812 where πU (·) is the orthogonal projection onto the subspace U , i.e., the
1813 direction space of L, see Figure 3.11(c).
From Figure 3.11 it is also evident that the distance of x from the affine
space L is identical to the distance of x − x0 from U , i.e.,
d(x, L) = kx − πL (x)k = kx − (x0 + πU (x − x0 ))k (3.71)
= d(x − x0 , πU (x − x0 )) . (3.72)

1814 3.8 Rotations


1815 Length and angle preservation, as discussed in Section 3.4, are the two
1816 characteristics of linear mappings with orthogonal transformation matri-
1817 ces. In the following, we will have a closer look at specific orthogonal
1818 transformation matrices, which describe rotations.
rotation A rotation is a linear mapping (more specifically, an automorphism of
a Euclidean vector space) that rotates a plane by an angle θ about the
origin, i.e., the origin is a fixed point. For a positive angle θ > 0, by com-
mon convention, we rotate in a counterclockwise direction. An example is
shown in Figure 3.12, where the transformation matrix is
 
−0.38 −0.92
R= . (3.73)
0.92 −0.38

Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.8 Rotations 87

Figure 3.13 The


robotic arm needs to
rotate its joints in
order to pick up
objects or to place
them correctly.
Figure taken
from (Deisenroth
et al., 2015).

Φ(e2 ) = [− sin θ, cos θ]> Figure 3.14


cos θ Rotation of the
standard basis in R2
e2 by an angle θ.

Φ(e1 ) = [cos θ, sin θ]>


sin θ
θ

θ
− sin θ e1 cos θ

1819 Important application areas of rotations include computer graphics and


1820 robotics. For example, in robotics, it is often important to know how to
1821 rotate the joints of a robotic arm in order to pick up or place an object,
1822 see Figure 3.13.

1823 3.8.1 Rotations in R2


    
1 0
1824 Consider the standard basis e1 = , e2 = of R2 , which defines
0 1
1825 the standard coordinate system in R2 . We aim to rotate this coordinate
1826 system by an angle θ as illustrated in Figure 3.14. Note that the rotated
1827 vectors are still linearly independent and, therefore, are a basis of R2 . This
1828 means that the rotation performs a basis change.
Rotations Φ are linear mappings so that we can express them by a
rotation matrix R(θ). Trigonometry (see Figure 3.14) allows us to de- rotation matrix
termine the coordinates of the rotated axes (the image of Φ) with respect
to the standard basis in R2 . We obtain
   
cos θ − sin θ
Φ(e1 ) = , Φ(e2 ) = . (3.74)
sin θ cos θ

Therefore, the rotation matrix that performs the basis change into the

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
88 Analytic Geometry

Figure 3.15 e3
Rotation of a vector
(gray) in R3 by an
angle θ about the
e3 -axis. The rotated
vector is shown in
blue.
e2

θ e1

rotated coordinates R(θ) is given as


 
  cos θ − sin θ
R(θ) = Φ(e1 ) Φ(e2 ) = . (3.75)
sin θ cos θ

1829 3.8.2 Rotations in R3


1830 In contrast to the R2 case, in R3 we can rotate any two-dimensional plane
1831 about a one-dimensional axis. The easiest way to specify the general rota-
1832 tion matrix is to specify how the images of the standard basis e1 , e2 , e3 are
1833 supposed to be rotated, and making sure these images Re1 , Re2 , Re3 are
1834 orthonormal to each other. We can then obtain a general rotation matrix
1835 R by combining the images of the standard basis.
1836 To have a meaningful rotation angle we have to define what “coun-
1837 terclockwise” means when we operate in more than two dimensions. We
1838 use the convention that a “counterclockwise” (planar) rotation about an
1839 axis refers to a rotation about an axis when we look at the axis “head on,
1840 from the end toward the origin”. In R3 , there are therefore three (planar)
1841 rotations about the three standard basis vectors (see Figure 3.15):
• Rotation about the e1 -axis
 
  1 0 0
R1 (θ) = Φ(e1 ) Φ(e2 ) Φ(e3 ) = 0 cos θ − sin θ . (3.76)
0 sin θ cos θ
1842 Here, the e1 coordinate is fixed, and the counterclockwise rotation is
1843 performed in the e2 e3 plane.
• Rotation about the e2 -axis
 
cos θ 0 sin θ
R2 (θ) =  0 1 0 . (3.77)
− sin θ 0 cos θ
1844 If we rotate the e1 e3 plane about the e2 axis, we need to look at the e2
1845 axis from its “tip” toward the origin.

Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
3.8 Rotations 89

• Rotation about the e3 -axis


 
cos θ − sin θ 0
R3 (θ) =  sin θ cos θ 0 . (3.78)
0 0 1
1846 Figure 3.15 illustrates this.

1847 3.8.3 Rotations in n Dimensions


1848 The generalization of rotations from 2D and 3D to n-dimensional Eu-
1849 clidean vector spaces can be intuitively described as fixing n − 2 dimen-
1850 sions and restrict the rotation to a two-dimensional plane in the n-dimen-
1851 sional space. As in the three-dimensional case we can rotate any plane
1852 (two-dimensional subspace of Rn ).
Definition 3.11 (Givens Rotation). Let and V be an n-dimensional Eu-
clidean vector space and Φ : V → V an automorphism with transforma-
tion matrix
I i−1 ··· ···
 
0 0
 0
 cos θ 0 − sin θ 0   n×n
Rij (θ) :=  0 0 I j−i−1 0 0  ∈R , (3.79)

 0 sin θ 0 cos θ 0 
0 ··· ··· 0 I n−j
for 1 6 i < j 6 n and θ ∈ R. Then Rij (θ) is called a Givens rotation. Givens rotation
Essentially, Rij (θ) is the identity matrix I n with
rii = cos θ , rij = − sin θ , rji = sin θ , rjj = cos θ . (3.80)
1853 In two dimensions (i.e., n = 2), we obtain (3.75) as a special case.

1854 3.8.4 Properties of Rotations


1855 Rotations exhibit a number useful properties:

1856 • Rotations preserve distances, i.e., kx−yk = kRθ (x)−Rθ (y)k. In other
1857 words, rotations leave the distance between any two points unchanged
1858 after the transformation.
1859 • Rotations preserve angles, i.e., the angle between Rθ x and Rθ y equals
1860 the angle between x and y .
1861 • Rotations in three (or more) dimensions are generally not commuta-
1862 tive. Therefore, the order in which rotations are applied is important,
1863 even if they rotate about the same point. Only in two dimensions vector
1864 rotations are commutative, such that R(φ)R(θ) = R(θ)R(φ) for all
1865 φ, θ ∈ [0, 2π), and form an Abelian group (with multiplication) only if
1866 they rotate about the same point (e.g., the origin).

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
90 Analytic Geometry

1867 3.9 Further Reading


1868 In this chapter, we gave a brief overview of some of the important concepts
1869 of analytic geometry, which we will use in later chapters of the book. For
1870 a broader and more in-depth overview of some the concepts we presented
1871 we refer to the following excellent books: Axler (2015) and Boyd and
1872 Vandenberghe (2018).
1873 Inner products allow us to determine specific bases of vector (sub)spaces,
1874 where each vector is orthogonal to all others (orthogonal bases) using the
1875 Gram-Schmidt method. These bases are important in optimization and
1876 numerical algorithms for solving linear equation systems. For instance,
1877 Krylov subspace methods, such as Conjugate Gradients or GMRES, mini-
1878 mize residual errors that are orthogonal to each other (Stoer and Burlirsch,
1879 2002).
1880 In machine learning, inner products are important in the context of
1881 kernel methods (Schölkopf and Smola, 2002). Kernel methods exploit the
1882 fact that many linear algorithms can be expressed purely by inner prod-
1883 uct computations. Then, the “kernel trick” allows us to compute these
1884 inner products implicitly in a (potentially infinite-dimensional) feature
1885 space, without even knowing this feature space explicitly. This allowed the
1886 “non-linearization” of many algorithms used in machine learning, such as
1887 kernel-PCA (Schölkopf et al., 1997) for dimensionality reduction. Gaus-
1888 sian processes (Rasmussen and Williams, 2006) also fall into the category
1889 of kernel methods and are the current state-of-the-art in probabilistic re-
1890 gression (fitting curves to data points). The idea of kernels is explored
1891 further in Chapter 12.
1892 Projections are often used in computer graphics, e.g., to generate shad-
1893 ows. In optimization, orthogonal projections are often used to (iteratively)
1894 minimize residual errors. This also has applications in machine learning,
1895 e.g., in linear regression where we want to find a (linear) function that
1896 minimizes the residual errors, i.e., the lengths of the orthogonal projec-
1897 tions of the data onto the linear function (Bishop, 2006). We will investi-
1898 gate this further in Chapter 9. PCA (Hotelling, 1933; Pearson, 1901b) also
1899 uses projections to reduce the dimensionality of high-dimensional data.
1900 We will discuss this in more detail in Chapter 10.

1901 Exercises
3.1 Show that h·, ·i defined for all x = (x1 , x2 ) and y = (y1 , y2 ) in R2 by:

hx, yi := x1 y1 − (x1 y2 + x2 y1 ) + 2(x2 y2 )

1902 is an inner product.

Draft (2018-09-21) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
Exercises 91

3.2 Consider R2 with h·, ·i defined for all x and y in R2 as:


 
2 0
hx, yi := x> y
1 2
| {z }
=:A

1903 Is h·, ·i an inner product?


3.3 Consider the Euclidean vector space R5 with the dot product. A subspace
U ⊆ R5 and x ∈ R5 are given by
         
0 1 −3 −1 −1
−1 −3  4  −3 −9
         
U = span[
 2 , −1
  1 ,  1  ,  5 ] , x= 
     
0 −1 2 0 4
2 2 1 7 1

1904 1. Determine the orthogonal projection πU (x) of x onto U


1905 2. Determine the distance d(x, U )
3.4 Consider R3 with the inner product
 
2 1 0
hx, yi := x> 1 2 −1 y .
0 −1 2

1906 Furthermore, we define e1 , e2 , e3 as the standard/canonical basis in R3 .


1. Determine the orthogonal projection πU (e2 ) of e2 onto
U = span[e1 , e3 ] .

1907 Hint: Orthogonality is defined through the inner product.


1908 2. Compute the distance d(e2 , U ).
1909 3. Draw the scenario: standard basis vectors and πU (e2 )
1910 3.5 Prove the Cauchy-Schwarz inequality | hx, yi | 6 kxk kyk for x, y ∈ V , where
1911 V is a vector space.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

You might also like