Tungban Machine Learning Math Course
Tungban Machine Learning Math Course
1
Disclaimer
2
Sets, Scalars, Vectors, Matrices and Tensors
Sets
I A set S is the mathematical model for a collection of different things.
I Examples:
I ∅: Empty set
I {red, green, blue}: Set of colors
I {0, 1}: Set of binary numbers
I R: Set of real numbers (e.g., 1.234 ∈ R)
I N0 = {0, 1, 2, . . . }: Set of natural numbers including 0
I {R, N0 }: Set of two sets
I The cardinality of a set S, denoted |S|, is the number of members of S
I We can take the union of two sets S = S1 ∪ S2
I We can take the intersection of two sets S = S1 ∩ S2
I We can take the cartesian product of two sets S = S1 × S2 or R2 = R × R
1
Scalars
2
Vectors
I A vector is an array of numbers with specific order
I We write vectors in lower case, bold typeface font
I Example:
x1
x = x2
x3
I We can identify each element of the vector (scalar) via its index (x1 , x2 , . . . )
I If each xi ∈ R, then x ∈ R3 = R × R × R (x is in the Cartesian product of R)
I We can think of vectors as identifying points in space (element = coordinate)
I We can also index a subset of elements of a vector: xS with S = {1, 3}
3
Matrices
I A matrix is a 2D array of numbers, each element is identified by two indices
I We write matrices in upper case, bold typeface font
I Example: !
a1,1 a1,2
A=
a2,1 a2,2
A ∈ RM ×N ×K
5
Transpose
I The transpose AT of a matrix A mirrors it at its main diagonal:
a1,1 a1,2 !
a1,1 a2,1 a3,1
A = a2,1 a2,2 ⇒ AT =
a1,2 a2,2 a3,2
a3,1 a3,2
x3 | {z }
better for inline math
! !
1 2
A:,1 = ∈ R2×1 A1,: = 1 2 ∈ R 1×2
A:,2 = ∈ R2×1 AT2,: = 2 4 ∈ R1×2
3 4
I The last row can be considered either matrices (R1×2 / R2×1 ) or vectors R2
I If they are considered vectors, we write either a or aT (with a ∈ R2 )
to distinguish row from column vectors which would otherwise not be clear
7
Adding and Multiplying Matrices and Vectors
Adding and Subtracting Matrices and Vectors
I We add/subtract vectors or matrices by adding/subtracting them elementwise:
I Note that the vectors or matrices must have the same shape
1
Adding and Subtracting Matrices and Vectors
I We can also add a scalar to a matrix or multiply a matrix by a scalar:
D = aB + c
di,j = a bi,j + c
I In deep learning, we sometimes also allow the addition of a matrix and a vector:
3
Multiplying Matrices and Vectors
I Two vectors or matrices A and B can be multiplied if A has the same number of
columns as B has rows (i.e., A ∈ RM ×N and B ∈ RN ×L ):
C = AB
I Note that the matrix product is not a matrix containing the product of the
individual elements which is called Hadamard product: A B (A, B ∈ RM ×N )
4
Multiplying Matrices and Vectors
I Example for a matrix product:
! ! !
3 1 1 2 6 7
A= B= ⇒ AB =
2 1 3 1 5 5
I Example for an inner product between two vectors: (= “dot product”)
! !
3 1
a= b= ⇒ aT b = 9
2 3
I Example for an outer product between two vectors: (= rank 1 matrix)
! ! !
3 1 T 3 9
a= b= ⇒ ab =
2 3 2 6
5
Multiplying Matrices and Vectors
6
Useful Properties
A(B + C) = AB + AC
A(BC) = (AB)C
AB 6= BA
7
Useful Properties
I The transpose of a matrix product has a simple form:
(AB)T = BT AT
T
xT y = xT y = y T x = xT y
∀x∈RN : IN x = x
I Example:
1 0 0 1 0 0 x1 x1
I3 = diag(1, 1, 1) = 0 1 0 ⇒ 0 1 0 x2 = x2
0 0 1 0 0 1 x3 x3
I IN is a square matrix with all diagonal elements equal to one and others zero
1
Identity and Inverse Matrices
I The matrix inverse A−1 of square matrix A ∈ RN ×N is defined by
A−1 A = IN
Ax = b
−1
A Ax = A−1 b
IN x = A−1 b
x = A−1 b
I Remark: This is only possible if A−1 exists (has full rank, will discuss later)
I It is not advisable to numerically solve linear systems like this (precision)
2
Identity and Inverse Matrices
I Example: ! !
1 1 0 0.5
A= ⇒ A−1 =
2 0 1 −0.5
! ! !
−1 0 0.5 1 1 1 0
A A= = = I2
1 −0.5 2 0 0 1
! ! ! !
2 0 0.5 2 0.5
b= ⇒ x = A−1 b = =
1 1 −0.5 1 1.5
3
Linear Dependence and Span
Linear Dependence and Span
I For A−1 to exist, Ax = b must have exactly one solution for every value of b
I It is possible for Ax = b to have 0, 1 or infinitely many solutions
I If both x and y are solutions then
z = α x + (1 − α)y
I Example:
! ! ! ! ! ! !
1 1 3 1 1 3 2 5
= |{z}
3 + |{z}
2 = + =
2 0 2 x1
2 x2
0 6 0 6
| {z } | {z }
A x
1
Linear Dependence and Span
I The span of a set of vectors is the set of all points obtained by linear combination:
! ! ! ! ! ! !
1 1 3 1 1 3 2 5
= |{z}
3 + |{z}
2 = + =
2 0 2 x1
2 x2
0 6 0 6
| {z } | {z } | {z } | {z } | {z }
A x a1 a2 b
6
5 I Ax = b has solution ⇔
4 b is in the span of columns of A
3 I This particular span is known as
2 column space or range
1 I When is there no solution?
1 2 3 4 5 6
2
Linear Dependence and Span
I Consider another example:
! ! ! ! !
1 2 x1 1 2 5
= x1 +x2 =
2 4 x2 2 4 6
| {z } | {z } | {z } | {z } | {z }
A x a1 a2 b
6 I a2 is a multiple of a2
5
I The columns are linearly dependent
4
I The column space is a line
3
I b is not in the span of columns of A
2
⇒ Ax = b has no solution
1
I When are there ∞ many solutions?
1 2 3 4 5 6
3
Linear Dependence and Span
I Consider another example:
! ! ! ! !
1 2 x1 1 2 3
= x1 +x2 =
2 4 x2 2 4 6
| {z } | {z } | {z } | {z } | {z }
A x a1 a2 b
6
5 I a2 is a multiple of a1
4 I The columns are linearly dependent
3 I The column space is a line
2 I b is in the span of columns of A
1 ⇒ Ax = b has ∞ many solutions
1 2 3 4 5 6
4
Linear Dependence and Span
I In order for Ax = b to have a solution for all values of b ∈ RN ,
the column space of A must encompass all of RN
I For the column space of a matrix with N rows to encompass all of RN ,
the matrix must contain at least one set of N linearly independent columns
I For the matrix to have an inverse, we additionally need to ensure
that Ax = b has at most one solution x for each value of b
I Hence, the matrix must have exactly N columns (square matrix A ∈ RN ×N )
and all columns must be linearly independent
I The rank of a matrix refers to the number of linearly independent rows or columns
I A square matrix with any two linearly dependent columns is called singular
I Every matrix with full rank (= every non-singular matrix) can be inverted
5
Vector and Matrix Norms
Vector and Matrix Norms
kxkp = |xi |p
i
1
Vector and Matrix Norms
X
kxk1 = |xi | = |x1 | + |x2 |
i
sX √ q
kxk2 = x2i = xT x = x21 + x22
i
2
Vector and Matrix Norms
I The dot product of two vectors x and y can be written in terms of norms
3
Special Matrices and Vectors
Special Matrices and Vectors
Some special kinds of matrices and vectors are particularly useful:
I Diagonal matrices D = diag(d) have non-zero values only at its diagonal
I Multiplying with diagonal matrices is easy: diag(d)x = d x
I Inverting diagonal matrices is easy: diag(d)−1 = diag((1/d1 , . . . , 1/dN )T )
I A symmetric matrix is equal to its own transpose: A = AT
I Symmetric matrices arise for instance if entries are distances with ai,j = aj,i
I A unit vector is a vector with unit `2 norm: kxk2 = 1
I A vector x is orthogonal to vector y if xT y = 0
I Unit vectors that are orthogonal are called orthonormal
I A orthogonal matrix is a square matrix whose rows/columns are orthonormal:
I Many mathematical objects can be better understood by breaking them into parts
(e.g., integers can be decomposed into prime numbers)
I An eigendecomposition decomposes a matrix into eigenvectors and eigenvalues
I An eigenvector of square matrix A is a non-zero vector such that multiplication
by A only alters its scale known as the corresponding eigenvalue:
Av = λv
1
Eigenvalue Decomposition
I We concatenate all eigenvectors to form a matrix V = (v1 , . . . , vN )
I We form all eigenvalues into a diagonal matrix Λ = diag(λ1 , . . . , λN )T
(remark: by convention, we typically sort the eigenvalues in descending order)
I The eigendecomposition of A is given by
A = VΛV−1
A = QΛQT
1. Rotate 3. Rotate
2. Scale
A = UDVT
where U ∈ RM ×M , D ∈ RM ×N and V ∈ RN ×N
I U and V are orthogonal matrices
I D is a diagonal (square or non-square) matrix
I The elements along the diagonal of D are known as singular values
I The columns of U and V are left-/right-singular vectors, respectively
5
Relationship between EVD and SVD
A = UDVT
AT A = VDUT UDVT = VD2 VT
AAT = UDVT VDUT = UD2 UT
6
Application: Principal Component Analysis (PCA)
Dataset
Reconstruction
5
x2
0
5
5 0 5
x1
I PCA is a technique for analyzing large high-dimensional datasets (see DL lec. 11)
I Eigenvectors of cov. matrix (=principal comp.) face direction of largest variance
√
I In this illustration, the two eigenvectors vi shown in black are scaled by λi
7
The Trace Operator and Determinant
The Trace Operator
I The trace operator returns the sum of all diagonal elements of a matrix:
X
Tr(A) = ai,i
i
1
The Trace Operator
I The trace operator allows for writing the Frobenius norm without summation:
q
kAkF = Tr(AAT )
Tr(A) = Tr(AT )
2
The Determinant
3
Differential Calculus
Derivative
I The derivative of a function f (x) measures the
sensitivity to change of the function value wrt.
a change in its argument and is given by
df f (a + h) − f (a)
f 0 (a) = (a) = lim
dx h→0 h
I Example:
3
Derivatives of Multivariate Functions
Let f (x, y) be a function where y = y(x) depends on x.
The partial derivative is defined as: The total derivative is defined as:
∂f (x, y) ∂f dx df (x, y) ∂f dx ∂f dy
= = +
∂x dx ∂x dx ∂y dx
|∂x{zdx} | {z }
Chain Rule Multi-Variable Chain Rule
Example: Example:
f (x, y) = xy f (x, y) = xy ∧ y = x
∂f (x, y) df (x, y)
=y = y + x = 2x
∂x dx
f (x, y) = 0 x2 + y 2 = 1
5
Vector Calculus
Derivative of a Vector-to-Vector Function
∂fi
where ∂xj
(x) is the (partial) derivative of the function f ’s i-th output wrt. to the j-th
component xj of the function f ’s input vector x. J(x) is called the Jacobian matrix.
1
Two frequent Special Cases
∂f
f 0 (x) = (x) ∈ R1×1 ≡ R
∂x
∂f
∂f ∂f
(∇x f )(x) = (x) = ∂x (x) · · · ∂xN (x)
∈ R1×N
∂x 1
The alternative notation (∇x f )(x) is read as the gradient of f at x and is exclusviely
used for vector-to-scalar functions.
2
Gradient and Hessian
Consider a vector-to-scalar function f : RN 7→ R1 .
I The first-order derivative of f is called the gradient vector. The components of the
gradient vector are the first-order partial derivatives of f at x:
∂f
∂f ∂f
(∇x f )(x) = (x) = ∂x (x) · · · ∂xN (x)
∈ R1×N
∂x 1
then h : RN 7→ RP and:
∂h ∂g ∂f
(x) = (f (x)) (x)
∂x ∂y ∂x
∂g ∂f
Here, the P × M matrix ∂y (f (x)) gets multiplied by the M × N matrix ∂x (x)
to form the resulting P × N matrix ∂h ∂x (x).
4
Special Case of Chain Rule
then h : R1 7→ R1 and:
n
∂h ∂g ∂f X ∂g ∂fi
(x) = (f (x)) (x) = (f (x)) (x)
∂x ∂y ∂x ∂yi ∂x
i=1
∂g ∂f
Here the 1 × N matrix ∂y (f (x)) gets multiplied by the N × 1 matrix ∂x (x)
∂h
to form the resulting 1 × 1 matrix ∂x (x). Note that this is the total derivative!
5
Random Variables and Probability Distributions
Why Probability?
Nearly all activities require some ability to reason in the presence of uncertainty.
Probability Theory:
I Mathematical framework for representing uncertain statements
I Tool to quantify uncertainty and reason in the presence of uncertainty
I Information quantifies the amount of uncertainty in a distribution
1
Terminology
2
Discrete Probability Distributions
I A random variable X can take values from a discrete set of outcomes X
I Example: 6-sided dice with equal probability 1/6 for each of the 6 numbers
I We usually use the short-hand notation
p(x, y) p(X = x, Y = y)
p(x|y) = for p(X = x|Y = y) =
p(y) p(Y = y)
4
Joint, Marginal and Conditional Probability
I Joint probability
cj
ni,j ni,j
|{z}
p(xi , yj ) = =P
N i,j ni,j
xi ni,j
I Conditional probability
yj
ni,j ni,j
p(xi | yj ) = =P
cj i ni,j
X
cj = ni,j
I Marginal probability i
X
N= ni,j
P
cj ni,j
p(yj ) = = Pi i,j
N i,j ni,j
5
Joint, Marginal and Conditional Probability
Example:
7
The Rules of Probability
I Sum rule
X X
p(x) = p(x, y) or p(y) = p(x, y)
y∈Y x∈X
I Product rule
p(x, y) = p(y|x)p(x) = p(x|y)p(y)
p(x|y)p(y)
p(y|x) =
p(x)
8
Continuous Probability Distributions
9
Probability Densities
I p(x) must satisfy the following conditions:
p(x) ≥ 0
Z ∞
p(x) = 1
−∞
I Note that p(X = x) is often 0 while p(x) is often greater than 0. For continuous
distributions we consider p(x) as the PDF and not as short-hand for p(X = x).
10
Probability Densities
p(x, y)
p(x|y) =
p(y)
12
Joint, Marginal and Conditional Probability
I Product rule
p(x, y) = p(y|x)p(x) = p(x|y)p(y)
p(x|y)p(y)
p(y|x) =
p(x)
14
Independence and Conditional Independence
I Two random variables X and Y are independent if their probability distribution
can be expressed as a product of two factors:
0.7
0.5
p(x) = µx (1 − µ)(1−x)
0.4
p(x)
0.3
I µ: probability for x = 1
0.2
I Handles two classes
0.1
e.g. (“cats” vs. “dogs”)
0.0
0 1
x
1
Categorical Distribution
0.5
p(x) = µx 0.3
p(x)
0.2
I µx : probability for class x
I Handles multiple classes 0.1
0.0
1 2 3 4
x
2
Gaussian Distribution
0.35
!
1 (x − µ)2
p(x) = √ exp − 0.30
2πσ 2 2σ 2
0.25
p(x)
0.20
I µ : mean 0.15
0.05
I The distribution has thin “tails”:
0.00
p(x) → 0 quickly as x → ∞ 6 4 2 0 2 4 6
x
3
Multivariate Gaussian Distribution
I µ ∈ RN : mean vector
I Σ ∈ RN ×N : covariance matrix
I The distribution has thin “tails”:
p(x) → 0 quickly as x → ∞
4
Laplace Distribution
0.5
Laplace distribution:
1 |x − µ| 0.4
p(x) = exp −
2b b
0.3
p(x)
I µ : location 0.2
I b : scale
0.1
5
Mixture Distributions
We can also model mixture densities:
M
X 1 |x − µm |
p(x) = πm exp −
2 bm bm
m=1
0.8
0.7
Example: 0.6
0.5
I Mixture of Laplace distribution
p(x)
0.4
0.2
I Constraint m πm = 1
P
0.1
0.0
6 4 2 0 2 4 6
x
6
Mixture Distributions
7
Bayesian Decision Theory
Digit Classification
1
Digit classification: Prior
Prior Distribution:
I How often do the letters “a” and “b” occur ?
I Let us assume
c1 = a p(c1 ) = 0.75
c2 = b p(c2 ) = 0.25
2
Digit Classification: Class Conditionals
I We describe every digit using some feature vector x, e.g.:
I the number of black pixels in each box
I relation between width and height
I Likelihood: How likely has x been generated from p(x | a) or p(x | b)?
3
Digit Classification
4
Digit Classification
5
Digit Classification
6
Bayes Theorem
p(x|y)p(y)
p(y|x) =
p(x)
7
Bayes Theorem
I Repeated from last slide:
Likelihood × Prior
Posterior =
Normalization Factor
Likelihood × Prior
Posterior = Likelihood×Prior
Normalization Factor
1
Expectation of a Random Variable
2
Properties of Expectations
I Expectations are linear:
I The expected value of a Gaussian random variable is the mean of its distribution:
0.40
0.35
0.30
0.25
p(x)
0.20
0.15
0.10
0.05
0.00
6 4 2 0 2 4 6
x
3
Variance of a Function
I The variance measures how much the values of a function of a random variable
X vary as we sample different values of x from its probability distribution:
h i
Var[f (x)] = E (f (x) − E [f (x)])2
I When the variance is low, the values of f (x) cluster near their expected value
I The square root of the variance is known as the standard deviation
I The variance/standard deviation is a parameter of the Gaussian distribution
4
Covariance of Functions
I The covariance measures how much two values are linearly related:
5
Information and Entropy
Information Theory
1
A Mathematical Theory of Communication
Basic Intuition:
I The basic intuition behind information theory is that observing
an unlikely event is more informative than observing a likely event
Example:
I The message “the sun rose this morning” is uninformative (not worth sending)
I In contrast, the message “there was a solar eclipse this morning” is informative
I This is a rare event - in other words, receiving this message surprises us
I We will quantify information content as the “level of surprise”
5
Self-Information
We would like to quantify information in a way that formalizes this intuition:
I Likely events should have low information content & certain events none
I Less likely events should have higher information content
I Independent events should have additive information
(Finding out that a tossed coin has come up as heads twice should convey twice
as much information as finding out that it has come up as heads once.)
where log refers to the natural logarithm with units of “nats” and I(x) ≥ 0.
6
Self-Information
where log refers to the natural logarithm with units of “nats” and I(x) ≥ 0.
Remarks:
I Self-information is also interpreted as quantifying the level of “surprise”
I One nat is the amount of information gained by observing an event of probability 1
e
I When using base-2 logarithms, units are called “bits” or “shannons”
I Information measured in bits is just a rescaling of information measured in nats
7
Self-Information
Shannon Entropy:
X
H(p) = −EX∼p [log p(x)] = − p(x) log p(x)
x∈X
Differential Entropy:
Z
H(p) = −EX∼p [log p(x)] = − p(x) log p(x) dx
x∈X
10
Shannon Entropy Example
I X-Axis: Probability p of a binary random variable (e.g., coin toss) being equal to 1
I Y-Axis: Entropy of corresponding distribution H(p) = −(1 − p) log(1 − p) − p log p
I Distributions that are close to deterministic have low entropy
while distributions that are close to uniform have high entropy
11
Shannon Entropy Example
X p(x)
DKL (p || q) = p(x) log
q(x)
x∈X
Continuous Distributions:
Z
p(x)
DKL (p || q) = p(x) log dx
x∈X q(x)
2
Kullback-Leibler Divergence Example
1.2
KL Divergence Large 1.2
KL Divergence Small
pdata pdata
pmodel pmodel
1.0 1.0
0.8 0.8
p(x)
p(x)
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
x x
3
Kullback-Leibler Divergence Example
I Minimizing DKL (p || q) and DKL (q || p) wrt. q does not lead to the same result
I Left: Select a q that has high probability where p has high probability
⇒ leads to blurring multiple modes to put high probability to all of them
I Right: Select a q that has low probability where p has low probability
⇒ picks one of the modes to not put probability mass in low-prob. region
5
Relationship to Cross-Entropy
= H(p) + DKL (p || q)
6
The Argmin and Argmax Operators
The Argmin and Argmax Operators
Let X denote a set. We define the argmin and argmax operators as follows:
0
argmin f (x) =x f (x) = min f (x )
x∈X x0 ∈X
0
argmax f (x) = x f (x) = max
0
f (x )
x∈X x ∈X
Examples:
I argmin x2 = 0 I argmin x = −1
x∈R x∈[−1,1]
I argmax cos(x) = {0, 2π, 4π} I argmin 2 = R
x∈[0,4π] x∈R
1
Example: Maximum Likelihood Estimation
I Let X = {(xi , yi )}N
i=1 be a dataset with samples drawn i.i.d. from pdata
I Let the model pmodel (y|x, w) be a parametric family of probability distributions
I The conditional maximum likelihood estimator for w is given by