0% found this document useful (0 votes)
12 views

Tungban Machine Learning Math Course

Uploaded by

irpower1375
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Tungban Machine Learning Math Course

Uploaded by

irpower1375
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

Mathematics for Deep Learning

A Short Guided Tour on Linear Algebra and Probability Theory

Prof. Dr.-Ing. Andreas Geiger


Autonomous Vision Group
University of Tübingen
Contents

I Sets, Scalars, Vectors, Matrices, Tensors I Random Variables and Distributions

I Adding and Multiplying Matrices/Vectors I Common Probability Distributions

I Identity and Inverse Matrices I Marginal and Conditional Distributions

I Linear Dependence and Span I Conditional Independence

I Vector and Matrix Norms I Bayesian Decision Theory

I Eigen and Singular Value Decomposition I Expectation, Variance and Covariance

I The Trace Operator and Determinant I Information and Entropy

I Differential Calculus I Kullback Leibler Divergence

I Vector Calculus I The ArgMin and ArgMax Operators

1
Disclaimer

I This is not a math lecture!


I No axioms, no claims, no theorems, no proofs
I The goal is to repeat the necessary minimal math background
to follow our deep learning, computer vision and self-driving lectures
I Enjoy!

2
Sets, Scalars, Vectors, Matrices and Tensors
Sets
I A set S is the mathematical model for a collection of different things.
I Examples:
I ∅: Empty set
I {red, green, blue}: Set of colors
I {0, 1}: Set of binary numbers
I R: Set of real numbers (e.g., 1.234 ∈ R)
I N0 = {0, 1, 2, . . . }: Set of natural numbers including 0
I {R, N0 }: Set of two sets
I The cardinality of a set S, denoted |S|, is the number of members of S
I We can take the union of two sets S = S1 ∪ S2
I We can take the intersection of two sets S = S1 ∩ S2
I We can take the cartesian product of two sets S = S1 × S2 or R2 = R × R
1
Scalars

I A scalar is a single number (in contrast to other objects in linear algebra)


I We write scalars in lower case, non-bold typeface font
I Examples:
I x∈R
I c ∈ N0
I When introducing them, we specify their type

2
Vectors
I A vector is an array of numbers with specific order
I We write vectors in lower case, bold typeface font
I Example:
 
x1
x = x2 
 

x3

I We can identify each element of the vector (scalar) via its index (x1 , x2 , . . . )
I If each xi ∈ R, then x ∈ R3 = R × R × R (x is in the Cartesian product of R)
I We can think of vectors as identifying points in space (element = coordinate)
I We can also index a subset of elements of a vector: xS with S = {1, 3}
3
Matrices
I A matrix is a 2D array of numbers, each element is identified by two indices
I We write matrices in upper case, bold typeface font
I Example: !
a1,1 a1,2
A=
a2,1 a2,2

I If a real-valued matrix has M rows and N columns we write A ∈ RM ×N


I M × N is the shape of the matrix
I We can identify each element of a matrix via its two indices (a1,1 , a1,2 , . . . )
I For example, ai,j is the element of A in the i’th row and j’th column
I Sometimes the ’,’ separating indices is omitted in the notation
I Columns and rows of matrices can be accessed via aj = A:,j and aTi = Ai,:
4
Tensors

I A tensor is an array with more than 2 axes (e.g.: RGB image)


I We write tensors in upper case, bold typeface font
I Example for tensor of shape M × N × K:

A ∈ RM ×N ×K

I We identify each element of a tensor via its indices (ai,j,k )


I Elements of a tensor are scalars and written in lower case non-boldface font

5
Transpose
I The transpose AT of a matrix A mirrors it at its main diagonal:
 
a1,1 a1,2 !
a1,1 a2,1 a3,1
A = a2,1 a2,2  ⇒ AT =
 
a1,2 a2,2 a3,2
a3,1 a3,2

I Similarly, a standard column vector can be transposed into a row vector:


 
x1  T  
x = x2  = x1 x2 x3 ⇒ xT = x 1 x 2 x 3
 

x3 | {z }
better for inline math

I For scalars, we have xT = x


6
Examples with Numbers
! !
1 2 1 3
A= ∈ R2×2 ⇒ AT = ∈ R2×2
3 4 2 4

a1,1 = 1 ∈ R a1,2 = 2 ∈ R a2,1 = 3 ∈ R a2,2 = 4 ∈ R

! !
1   2  
A:,1 = ∈ R2×1 A1,: = 1 2 ∈ R 1×2
A:,2 = ∈ R2×1 AT2,: = 2 4 ∈ R1×2
3 4

I The last row can be considered either matrices (R1×2 / R2×1 ) or vectors R2
I If they are considered vectors, we write either a or aT (with a ∈ R2 )
to distinguish row from column vectors which would otherwise not be clear
7
Adding and Multiplying Matrices and Vectors
Adding and Subtracting Matrices and Vectors
I We add/subtract vectors or matrices by adding/subtracting them elementwise:

ci = ai + bi ci,j = ai,j + bi,j

I Example (subtraction analogously):


! ! !
1 4 5
a= b= ⇒ c=a+b=
3 2 5
! ! !
1 2 4 3 5 5
A= B= ⇒ C=A+B=
3 4 2 1 5 5

I Note that the vectors or matrices must have the same shape
1
Adding and Subtracting Matrices and Vectors
I We can also add a scalar to a matrix or multiply a matrix by a scalar:

D = aB + c

I This corresponds to performing the operation on each element:

di,j = a bi,j + c

I In deep learning, we sometimes also allow the addition of a matrix and a vector:

C = A + b where ci,j = ai,j + bi

I In this example, the vector b is added to each column of matrix A


I This implicit copying of b to many locations is called broadcasting
2
Adding and Subtracting Matrices and Vectors

I Example for scalar addition and multiplication:


! !
1 2 3 5
2 +1=
2 0 5 1

I Example for broadcasting (shape of vector determines which type):


! ! ! ! !
1 1 1 2 2 1 1   2 4
+ = + 1 3 =
2 0 3 5 3 2 0 3 3

I NumPy supports broadcasting natively:


https://numpy.org/doc/stable/user/basics.broadcasting.html

3
Multiplying Matrices and Vectors

I Two vectors or matrices A and B can be multiplied if A has the same number of
columns as B has rows (i.e., A ∈ RM ×N and B ∈ RN ×L ):

C = AB

I The matrix product is defined by


X
ci,j = ai,k bk,j
k

I Note that the matrix product is not a matrix containing the product of the
individual elements which is called Hadamard product: A B (A, B ∈ RM ×N )

4
Multiplying Matrices and Vectors
I Example for a matrix product:
! ! !
3 1 1 2 6 7
A= B= ⇒ AB =
2 1 3 1 5 5
I Example for an inner product between two vectors: (= “dot product”)
! !
3 1
a= b= ⇒ aT b = 9
2 3
I Example for an outer product between two vectors: (= rank 1 matrix)
! ! !
3 1 T 3 9
a= b= ⇒ ab =
2 3 2 6
5
Multiplying Matrices and Vectors

I Example for broadcasting with the help of the outer product:


! ! ! ! ! ! !
1 1 1 1 1 1   1 1 1 1 2 2
+ = + 1 1 = + =
2 0 3 2 0 3 2 0 3 3 5 3

6
Useful Properties

I Matrix multiplication is distributive:

A(B + C) = AB + AC

I Matrix multiplication is associative:

A(BC) = (AB)C

I However, in general matrix multiplication is not commutative:

AB 6= BA

7
Useful Properties
I The transpose of a matrix product has a simple form:

(AB)T = BT AT

I Hence, the vector dot product is commutative:

T
xT y = xT y = y T x = xT y

I Matrix-vector products allow to compactly write systems of linear equations:

aT1 x = b1 ; aT2 x = b2 ; ... aTM x = bM ; as Ax = b

... where aTi = Ai,: denotes the i’th row of matrix A.


8
Identity and Inverse Matrices
Identity and Inverse Matrices
I A matrix that does not change a vector multiplied with it is called identity matrix
I We denote the identity matrix preserving N-dimensional vectors as IN ∈ RN ×N

∀x∈RN : IN x = x

I Example:
      
1 0 0 1 0 0 x1 x1
I3 = diag(1, 1, 1) = 0 1 0 ⇒ 0 1 0 x2  = x2 
      

0 0 1 0 0 1 x3 x3

I IN is a square matrix with all diagonal elements equal to one and others zero
1
Identity and Inverse Matrices
I The matrix inverse A−1 of square matrix A ∈ RN ×N is defined by

A−1 A = IN

I This allows for solving linear systems of the form Ax = b:

Ax = b
−1
A Ax = A−1 b
IN x = A−1 b
x = A−1 b

I Remark: This is only possible if A−1 exists (has full rank, will discuss later)
I It is not advisable to numerically solve linear systems like this (precision)
2
Identity and Inverse Matrices

I Example: ! !
1 1 0 0.5
A= ⇒ A−1 =
2 0 1 −0.5

! ! !
−1 0 0.5 1 1 1 0
A A= = = I2
1 −0.5 2 0 0 1

! ! ! !
2 0 0.5 2 0.5
b= ⇒ x = A−1 b = =
1 1 −0.5 1 1.5

3
Linear Dependence and Span
Linear Dependence and Span
I For A−1 to exist, Ax = b must have exactly one solution for every value of b
I It is possible for Ax = b to have 0, 1 or infinitely many solutions
I If both x and y are solutions then

z = α x + (1 − α)y

is also a solution for any α ∈ R


I We call Ax = i xi A:,i a linear combination (sum of scalar-vector products)
P

I Example:
! ! ! ! ! ! !
1 1 3 1 1 3 2 5
= |{z}
3 + |{z}
2 = + =
2 0 2 x1
2 x2
0 6 0 6
| {z } | {z }
A x
1
Linear Dependence and Span
I The span of a set of vectors is the set of all points obtained by linear combination:
! ! ! ! ! ! !
1 1 3 1 1 3 2 5
= |{z}
3 + |{z}
2 = + =
2 0 2 x1
2 x2
0 6 0 6
| {z } | {z } | {z } | {z } | {z }
A x a1 a2 b
6

5 I Ax = b has solution ⇔
4 b is in the span of columns of A
3 I This particular span is known as
2 column space or range
1 I When is there no solution?

1 2 3 4 5 6
2
Linear Dependence and Span
I Consider another example:
! ! ! ! !
1 2 x1 1 2 5
= x1 +x2 =
2 4 x2 2 4 6
| {z } | {z } | {z } | {z } | {z }
A x a1 a2 b
6 I a2 is a multiple of a2
5
I The columns are linearly dependent
4
I The column space is a line
3
I b is not in the span of columns of A
2
⇒ Ax = b has no solution
1
I When are there ∞ many solutions?
1 2 3 4 5 6
3
Linear Dependence and Span
I Consider another example:
! ! ! ! !
1 2 x1 1 2 3
= x1 +x2 =
2 4 x2 2 4 6
| {z } | {z } | {z } | {z } | {z }
A x a1 a2 b
6

5 I a2 is a multiple of a1
4 I The columns are linearly dependent
3 I The column space is a line
2 I b is in the span of columns of A
1 ⇒ Ax = b has ∞ many solutions

1 2 3 4 5 6
4
Linear Dependence and Span
I In order for Ax = b to have a solution for all values of b ∈ RN ,
the column space of A must encompass all of RN
I For the column space of a matrix with N rows to encompass all of RN ,
the matrix must contain at least one set of N linearly independent columns
I For the matrix to have an inverse, we additionally need to ensure
that Ax = b has at most one solution x for each value of b
I Hence, the matrix must have exactly N columns (square matrix A ∈ RN ×N )
and all columns must be linearly independent
I The rank of a matrix refers to the number of linearly independent rows or columns
I A square matrix with any two linearly dependent columns is called singular
I Every matrix with full rank (= every non-singular matrix) can be inverted
5
Vector and Matrix Norms
Vector and Matrix Norms

I We can measure the “size” of a vector using a function called norm


I The `p -norm is defined as
!1
X p

kxkp = |xi |p
i

I A norm maps vectors to non-negative values (= distance to the origin)


I Mathematically, a norm f (·) satisfies:
I f (x) = 0 ⇒ x = 0
I f (x + y) ≤ f (x) + f (y) (triangle inequality)
I ∀α ∈ R : f (αx) = |α|f (x)

1
Vector and Matrix Norms

X
kxk1 = |xi | = |x1 | + |x2 |
i
sX √ q
kxk2 = x2i = xT x = x21 + x22
i

kxk∞ = max |xi | = max(|x1 |, |x2 |)


i

I `2 is called the Euclidean norm


(= Euclidean distance to origin)
I `∞ is called the max/infinity norm

2
Vector and Matrix Norms

I The dot product of two vectors x and y can be written in terms of norms

xT y = kxk2 kyk2 cos θ

where θ is the angle between x and y

I The “size” of a matrix can be measured with the Frobenius norm:


sX
kAkF = a2i,j
i,j

3
Special Matrices and Vectors
Special Matrices and Vectors
Some special kinds of matrices and vectors are particularly useful:
I Diagonal matrices D = diag(d) have non-zero values only at its diagonal
I Multiplying with diagonal matrices is easy: diag(d)x = d x
I Inverting diagonal matrices is easy: diag(d)−1 = diag((1/d1 , . . . , 1/dN )T )
I A symmetric matrix is equal to its own transpose: A = AT
I Symmetric matrices arise for instance if entries are distances with ai,j = aj,i
I A unit vector is a vector with unit `2 norm: kxk2 = 1
I A vector x is orthogonal to vector y if xT y = 0
I Unit vectors that are orthogonal are called orthonormal
I A orthogonal matrix is a square matrix whose rows/columns are orthonormal:

AT A = AAT = I ⇒ A−1 = AT (cheap inverse)


1
Eigenvalue and Singular Value Decomposition
Eigenvalue Decomposition

I Many mathematical objects can be better understood by breaking them into parts
(e.g., integers can be decomposed into prime numbers)
I An eigendecomposition decomposes a matrix into eigenvectors and eigenvalues
I An eigenvector of square matrix A is a non-zero vector such that multiplication
by A only alters its scale known as the corresponding eigenvalue:

Av = λv

I As any scaled version of v is an eigenvector we consider unit eigenvectors

1
Eigenvalue Decomposition
I We concatenate all eigenvectors to form a matrix V = (v1 , . . . , vN )
I We form all eigenvalues into a diagonal matrix Λ = diag(λ1 , . . . , λN )T
(remark: by convention, we typically sort the eigenvalues in descending order)
I The eigendecomposition of A is given by

A = VΛV−1

I Every real symmetric matrix A (common case) can be decomposed into

A = QΛQT

where Q is an orthonormal matrix composed of the eigenvectors {vi }N


i=1 of A
2
Eigenvalue Decomposition

1. Rotate 3. Rotate

2. Scale

I Consider a matrix A with two orthonormal EVs v1 (with λ1 ) and v2 (with λ2 )


I A transforms space by 1. rotating, 2. scaling (along CS axes) and 3. rotating back
I Hence, A distorts the unit circle by scaling space in direction vi by λi
3
Eigenvalue Decomposition

I A matrix is singular ⇔ any of its eigenvalues is zero


I The rank of a matrix equals the number of non-zero eigenvalues
I Only matrices with full rank can be inverted (non-singular matrices)
I A matrix whose eigenvalues are all positive is called positive definite
I A matrix whose eigenvalues are all positive or zero is called positive semi-definite
I For positive semi-definite matrices we have ∀x : xT A x ≥ 0
I Positive definite matrices additionally guarantee that xT A x = 0 ⇒ x = 0
I The EVD can be computed easily in NumPy: numpy.linalg.eig
I More info: https://guzintamath.com/textsavvy/2018/05/26/
eigenvalues-and-eigenvectors/
4
Singular Value Decomposition
I Eigenvalue decomposition can only be applied to square matrices
I For non-square matrices we can use singular value decomposition
I The singular value decomposition factorizes a matrix A ∈ RM ×N as

A = UDVT

where U ∈ RM ×M , D ∈ RM ×N and V ∈ RN ×N
I U and V are orthogonal matrices
I D is a diagonal (square or non-square) matrix
I The elements along the diagonal of D are known as singular values
I The columns of U and V are left-/right-singular vectors, respectively
5
Relationship between EVD and SVD

A = UDVT
AT A = VDUT UDVT = VD2 VT
AAT = UDVT VDUT = UD2 UT

I The right-singular vectors V are the eigenvectors of AT A


I The left-singular vectors U are the eigenvectors of AAT
I The eigenvalues of AT A and AAT are equal to the squared singular values of A

6
Application: Principal Component Analysis (PCA)

Dataset
Reconstruction
5

x2
0

5
5 0 5
x1

I PCA is a technique for analyzing large high-dimensional datasets (see DL lec. 11)
I Eigenvectors of cov. matrix (=principal comp.) face direction of largest variance

I In this illustration, the two eigenvectors vi shown in black are scaled by λi
7
The Trace Operator and Determinant
The Trace Operator

I The trace operator returns the sum of all diagonal elements of a matrix:
X
Tr(A) = ai,i
i

1
The Trace Operator
I The trace operator allows for writing the Frobenius norm without summation:
q
kAkF = Tr(AAT )

I The trace operator is invariant to the transpose operator:

Tr(A) = Tr(AT )

I The trace of the product of square matrices is invariant to cyclic permutations:

Tr(ABC) = Tr(CAB) = Tr(BCA)

2
The Determinant

I The determinant of a square matrix is equal to the product of all eigenvalues:


Y
det(A) = λi
i

I The determinant can be thought of as a measure of how much


multiplication by the matrix expands or contracts space
I If the determinant is 1, then the transformation is volume-preserving
I If the determinant is 0, then space is contracted completely
along at least one dimension, causing it to lose all of its volume
I In other words, the matrix is singular or rank-deficient

3
Differential Calculus
Derivative
I The derivative of a function f (x) measures the
sensitivity to change of the function value wrt.
a change in its argument and is given by

df f (a + h) − f (a)
f 0 (a) = (a) = lim
dx h→0 h

I A function is differentiable at a if the limit exists


I The slope of the tangent line is equivalent to the
derivative at the tangent point (dark red)
I The second derivative is written as
d2 f
f 00 (a) or (a)
dx2
1
Examples of Non-Differentiable Functions

I not continuous I continuous


I not differentiable I not differentiable
2
Chain Rule
I The chain rule expresses the derivative of the composition of two differentiable
functions f and g in terms of the derivatives of f and g
I More precisely, if h = f ◦ g is the function such that h(x) = f (g(x)) then

h0 (x) = f 0 (g(x))g 0 (x) Lagrange notation


df df dg
= Leibniz notation
dx dg dx

I Example:

h(x) = (2x2 + x)3


h0 (x) = 3(2x2 + x)2 (4x + 1)

3
Derivatives of Multivariate Functions
Let f (x, y) be a function where y = y(x) depends on x.

The partial derivative is defined as: The total derivative is defined as:

∂f (x, y) ∂f dx df (x, y) ∂f dx ∂f dy
= = +
∂x dx ∂x dx ∂y dx
|∂x{zdx} | {z }
Chain Rule Multi-Variable Chain Rule

Example: Example:
f (x, y) = xy f (x, y) = xy ∧ y = x
∂f (x, y) df (x, y)
=y = y + x = 2x
∂x dx

I Remark: We sometimes write ∂ instead of d, but refer to the total derivative


4
Implicit Functions and Implicit Differentiation
An implicit equation is a relation Example: Let’s assume

f (x, y) = 0 x2 + y 2 = 1

where y(x) is defined only implicitly. Implicit differentiation yields

Implicit differentiation computes the dy


2x + 2y =0
total derivative of both sides wrt. x dx
dy x
=−
∂f dx ∂f dy dx y
+ =0
∂x dx ∂y dx
Note the presence of y in this term,
dy
and solves for dx . i.e., the implicit “function” is a curve.

5
Vector Calculus
Derivative of a Vector-to-Vector Function

Let f : RN 7→ RM . Then the (partial) derivative of f wrt. x is given by


 
∂f1 ∂f1
∂x1 (x) ··· ∂xN (x)
∂f  .. .. ..  ∈ RM ×N

(x) = J(x) =  . . .
∂x  
∂fM ∂fM
∂x1 (x) ··· ∂xN (x)

∂fi
where ∂xj
(x) is the (partial) derivative of the function f ’s i-th output wrt. to the j-th
component xj of the function f ’s input vector x. J(x) is called the Jacobian matrix.

1
Two frequent Special Cases

Scalar-to-scalar function: N = M = 1. Then f : R1 7→ R1 .

∂f
f 0 (x) = (x) ∈ R1×1 ≡ R
∂x

Vector-to-scalar function: N > 1, M = 1. Then f : RN 7→ R1 .

∂f 
∂f ∂f

(∇x f )(x) = (x) = ∂x (x) · · · ∂xN (x)
∈ R1×N
∂x 1

The alternative notation (∇x f )(x) is read as the gradient of f at x and is exclusviely
used for vector-to-scalar functions.

2
Gradient and Hessian
Consider a vector-to-scalar function f : RN 7→ R1 .
I The first-order derivative of f is called the gradient vector. The components of the
gradient vector are the first-order partial derivatives of f at x:

∂f 
∂f ∂f

(∇x f )(x) = (x) = ∂x (x) · · · ∂xN (x)
∈ R1×N
∂x 1

I The second-order derivative of f is called the Hessian matrix H. The components


of the Hessian matrix are the second-order partial derivatives of f at x:
 ∂2f ∂2f 
∂x21
(x) ··· ∂x1 ∂xN (x)
 .. .. ..  ∈ RN ×N

H(x) = 
 . . . 
∂2f ∂2f
∂xN ∂x1 (x) ··· ∂x2N
(x)
3
Chain Rule for Vector-to-Vector Functions

Let f : RN 7→ RN with f (x) and g : RM 7→ RP with g(y). If

h(x) = g(f (x))

then h : RN 7→ RP and:
∂h ∂g ∂f
(x) = (f (x)) (x)
∂x ∂y ∂x
∂g ∂f
Here, the P × M matrix ∂y (f (x)) gets multiplied by the M × N matrix ∂x (x)
to form the resulting P × N matrix ∂h ∂x (x).

4
Special Case of Chain Rule

Let f : R1 7→ RN with f (x) and g : RN 7→ R1 with g(y). If

h(x) = g(f (x))

then h : R1 7→ R1 and:
n
∂h ∂g ∂f X ∂g ∂fi
(x) = (f (x)) (x) = (f (x)) (x)
∂x ∂y ∂x ∂yi ∂x
i=1

∂g ∂f
Here the 1 × N matrix ∂y (f (x)) gets multiplied by the N × 1 matrix ∂x (x)
∂h
to form the resulting 1 × 1 matrix ∂x (x). Note that this is the total derivative!

5
Random Variables and Probability Distributions
Why Probability?
Nearly all activities require some ability to reason in the presence of uncertainty.

There exist 3 sources of uncertainty:


I Inherent stochasticity in the modeled system (e.g., traffic participants)
I Incomplete observability (e.g., 3D reconstruction)
I Incomplete modeling (e.g., discretization of robot location)

Probability Theory:
I Mathematical framework for representing uncertain statements
I Tool to quantify uncertainty and reason in the presence of uncertainty
I Information quantifies the amount of uncertainty in a distribution
1
Terminology

I A random variable is a variable that can take on different values randomly


I Random variables may be either discrete or continuous
I A discrete random variable has a finite or countably infinite number of states
I A continuous random variable is associated with a real value
I A probability distribution is a description of how likely a random variable or set of
random variables is to take on each of its possible states.It is described by:
I a probability mass function (PMF) in the case of discrete variables
I a probability density function (PDF) in the case of continuous variables
I The reader has to infer which function to use based on the identity of the RV

2
Discrete Probability Distributions
I A random variable X can take values from a discrete set of outcomes X
I Example: 6-sided dice with equal probability 1/6 for each of the 6 numbers
I We usually use the short-hand notation

p(x) for p(X = x) ∈ [0, 1]

for the probability that random variable X takes value x


I p(x) is called the probability mass function
I In contrast, with p(X) we denote the probability distribution over X
I If X follows distribution p(X) we also write X ∼ p(X)
I The probability p(x) must satisfy the following conditions:
X
p(x) ≥ 0 and p(x) = 1
x∈X
3
Joint, Marginal and Conditional Probability
I Joint probability (of X and Y )

p(x, y) for p(X = x, Y = y)

I Conditional probability (of X conditioned on Y )

p(x, y) p(X = x, Y = y)
p(x|y) = for p(X = x|Y = y) =
p(y) p(Y = y)

I Marginal probability (of Y )


X X
p(y) = p(x, y) for p(Y = y) = p(X = x, Y = y)
x∈X x∈X

4
Joint, Marginal and Conditional Probability
I Joint probability
cj
ni,j ni,j
|{z}
p(xi , yj ) = =P
N i,j ni,j
xi ni,j
I Conditional probability
yj
ni,j ni,j
p(xi | yj ) = =P
cj i ni,j
X
cj = ni,j
I Marginal probability i

X
N= ni,j
P
cj ni,j
p(yj ) = = Pi i,j
N i,j ni,j

5
Joint, Marginal and Conditional Probability
Example:

x1 1 1 1 x1 0.1 0.1 0.1


x2 1 1 2 x2 0.1 0.1 0.2
x3 1 1 1 x3 0.1 0.1 0.1
y1 y2 y3 y1 y2 y3
Counts Joint Probabilities

0.3 0.1 0.1 0.1 0.25 0.1 0.1 0.1


0.4 0.1 0.1 0.2 0.50 0.1 0.1 0.2
0.3 0.1 0.1 0.1 0.25 0.1 0.1 0.1
0.3 0.3 0.4 0.25 0.25 0.50
Marginals p(x) and p(y) Conditionals p(x|y3 ) and p(y|x2 )
6
Joint, Marginal and Conditional Probability

7
The Rules of Probability
I Sum rule
X X
p(x) = p(x, y) or p(y) = p(x, y)
y∈Y x∈X

We say that we “marginalize” x / y ⇒ p(x) / p(y) are called marginal probability.

I Product rule
p(x, y) = p(y|x)p(x) = p(x|y)p(y)

I And as a consequence we obtain Bayes Theorem:

p(x|y)p(y)
p(y|x) =
p(x)

8
Continuous Probability Distributions

I Now X is a continuous random variable, e.g., taking values in R


I The probability that X takes a value in the interval (a, b) is
Z b
p(X ∈ (a, b)) = p(x) dx
a

and we call p(x) the probability density function (PDF)

9
Probability Densities
I p(x) must satisfy the following conditions:

p(x) ≥ 0
Z ∞
p(x) = 1
−∞

I The probability of x ∈ (−∞, z) is given by the cumulative distribution function:


Z z
P (z) = p(x)dx
−∞

I Note that p(X = x) is often 0 while p(x) is often greater than 0. For continuous
distributions we consider p(x) as the PDF and not as short-hand for p(X = x).
10
Probability Densities

Probability density of a continuous variable


11
Joint, Marginal and Conditional Probability

I Joint density (of X and Y )


p(x, y)

I Conditional density (of X conditioned on Y )

p(x, y)
p(x|y) =
p(y)

I Marginal density (of Y ) Z


p(y) = p(x, y)dx
x∈X

12
Joint, Marginal and Conditional Probability

joint, marginal, conditional probability


13
The Rules of Probability
I Sum rule Z Z
p(x) = p(x, y)dy or p(y) = p(x, y)dx
y∈Y x∈X

We say that we “marginalize” x / y ⇒ p(x) / p(y) are called marginal density.

I Product rule
p(x, y) = p(y|x)p(x) = p(x|y)p(y)

I And as a consequence we obtain Bayes Theorem:

p(x|y)p(y)
p(y|x) =
p(x)

14
Independence and Conditional Independence
I Two random variables X and Y are independent if their probability distribution
can be expressed as a product of two factors:

p(x, y) = p(x) p(y)

I Two random variables X and X are conditionally independent given a random


variable Z if the conditional probability distribution over X and Y factorizes as
follows:
p(x, y|z) = p(x|z) p(y|z)

I We denote these two statements compactly as X ⊥⊥ Y and X ⊥⊥ Y |Z


15
Common Probability Distributions
Bernoulli Distribution

0.7

Bernoulli distribution: 0.6

0.5
p(x) = µx (1 − µ)(1−x)
0.4

p(x)
0.3
I µ: probability for x = 1
0.2
I Handles two classes
0.1
e.g. (“cats” vs. “dogs”)
0.0
0 1
x

1
Categorical Distribution

0.5

Categorical distribution: 0.4

p(x) = µx 0.3

p(x)
0.2
I µx : probability for class x
I Handles multiple classes 0.1

0.0
1 2 3 4
x

2
Gaussian Distribution

Gaussian distribution: 0.40

0.35
!
1 (x − µ)2
p(x) = √ exp − 0.30
2πσ 2 2σ 2
0.25

p(x)
0.20

I µ : mean 0.15

I σ : standard deviation 0.10

0.05
I The distribution has thin “tails”:
0.00
p(x) → 0 quickly as x → ∞ 6 4 2 0 2 4 6
x

3
Multivariate Gaussian Distribution

Multivariate Gaussian distribution:


s  
1 1 T −1
p(x) = exp − (x − µ) Σ (x − µ)
(2π)N det(Σ) 2

I µ ∈ RN : mean vector
I Σ ∈ RN ×N : covariance matrix
I The distribution has thin “tails”:
p(x) → 0 quickly as x → ∞

4
Laplace Distribution

0.5
Laplace distribution:
 
1 |x − µ| 0.4
p(x) = exp −
2b b
0.3

p(x)
I µ : location 0.2

I b : scale
0.1

I The distribution has heavy “tails”:


p(x) → 0 more slowly as x → ∞ 0.0
6 4 2 0 2 4 6
x

5
Mixture Distributions
We can also model mixture densities:
M  
X 1 |x − µm |
p(x) = πm exp −
2 bm bm
m=1

0.8

0.7

Example: 0.6

0.5
I Mixture of Laplace distribution

p(x)
0.4

I πm : weight of mode m 0.3

0.2
I Constraint m πm = 1
P
0.1

0.0
6 4 2 0 2 4 6
x

6
Mixture Distributions

7
Bayesian Decision Theory
Digit Classification

I Classify digits “a” versus “b”

I Goal: classify new digits such that probability of error is minimized

1
Digit classification: Prior
Prior Distribution:
I How often do the letters “a” and “b” occur ?
I Let us assume

c1 = a p(c1 ) = 0.75
c2 = b p(c2 ) = 0.25

I Note that the prior has to be a distribution, in particular:


X
p(ck ) = 1
k=1,2

2
Digit Classification: Class Conditionals
I We describe every digit using some feature vector x, e.g.:
I the number of black pixels in each box
I relation between width and height

I Likelihood: How likely has x been generated from p(x | a) or p(x | b)?

3
Digit Classification

I Which class should we assign x to?


I Class a

4
Digit Classification

I Which class should we assign x to ?


I Class b

5
Digit Classification

I Which class should we assign x to ?


I Class a, since p(a)=0.75

6
Bayes Theorem

I How do we formalize this?


I We already mentioned Bayes Theorem:

p(x|y)p(y)
p(y|x) =
p(x)

I Let us now apply it:

p(x|ck )p(ck ) p(x|ck )p(ck )


p(ck |x) = =P
p(x) i p(x|ci )p(ci )

7
Bayes Theorem
I Repeated from last slide:

p(x|ck )p(ck ) p(x|ck )p(ck )


p(ck |x) = =P
p(x) i p(x|ci )p(ci )

I We use the following names:

Likelihood × Prior
Posterior =
Normalization Factor

I The normalization factor is also called the Partition Function


or Evidence and commonly denoted with the symbol Z
8
Bayes Theorem
Likelihood

Likelihood × Prior

Posterior = Likelihood×Prior
Normalization Factor

Bayesian Decision Theory:


I Model prior and likelihood; decide for class with highest posterior
I Decision theory typically additionally considers loss function/risk
9
Expectation, Variance and Covariance
Expectation of a Function

I The expectation or expected value of a function f (x) with respect to a


probability distribution p(X) is the average of f (x) for X ∼ p(X)

I For discrete variables this can be computed with a summation:


X
EX∼p [f (x)] = p(x)f (x)
x∈X

I For continuous variables, it is computed with an integral:


Z
EX∼p [f (x)] = p(x)f (x) dx
x∈X

1
Expectation of a Random Variable

I An important special case is the one where f (x) = x


in which case we obtain the expectation of the random variable X

I For discrete variables we have:


X
EX∼p [x] = p(x)x
x∈X

I For continuous variables we obtain:


Z
EX∼p [x] = p(x)x dx
x∈X

2
Properties of Expectations
I Expectations are linear:

E[α f (x) + β g(x)] = α E[f (x)] + β E[g(x)]

I The expected value of a Gaussian random variable is the mean of its distribution:

0.40

0.35

0.30

0.25
p(x)

0.20

0.15

0.10

0.05

0.00
6 4 2 0 2 4 6
x
3
Variance of a Function

I The variance measures how much the values of a function of a random variable
X vary as we sample different values of x from its probability distribution:
h i
Var[f (x)] = E (f (x) − E [f (x)])2

I When the variance is low, the values of f (x) cluster near their expected value
I The square root of the variance is known as the standard deviation
I The variance/standard deviation is a parameter of the Gaussian distribution

4
Covariance of Functions
I The covariance measures how much two values are linearly related:

Cov[f (x), g(y)] = E [(f (x) − E [f (x)]) (g(y) − E [g(y)])]


I The covariance matrix of a random vector x ∈ RN is a N × N matrix
I The diagonal elements of the covariance matrix are the individual variances
I The sign of the covariance determines if variables are pos./neg. correlated

5
Information and Entropy
Information Theory

I Branch of applied mathematics


I Goal: quantify how much information is in a signal
I Father of information theory: Claude E. Shannon
I Founding work of the field of information theory:
Claude E. Shannon: “A Mathematical Theory of Communication”

1
A Mathematical Theory of Communication

Shannon: A mathematical theory of communication. Bell System Technical Journal, 1948. 2


A Mathematical Theory of Communication

Shannon: A mathematical theory of communication. Bell System Technical Journal, 1948. 3


Information Theory

I Information theory was originally invented to study sending messages from


discrete alphabets over noisy channels (e.g., communication via radio waves)
I Information theory tells how to design optimal codes and calculate the expected
length of messages sampled using various coding schemes
I Here, we mostly use a few key ideas from information theory to characterize
probability distributions or quantify similarity between probability distributions
4
Information Theory

Basic Intuition:
I The basic intuition behind information theory is that observing
an unlikely event is more informative than observing a likely event

Example:
I The message “the sun rose this morning” is uninformative (not worth sending)
I In contrast, the message “there was a solar eclipse this morning” is informative
I This is a rare event - in other words, receiving this message surprises us
I We will quantify information content as the “level of surprise”

5
Self-Information
We would like to quantify information in a way that formalizes this intuition:
I Likely events should have low information content & certain events none
I Less likely events should have higher information content
I Independent events should have additive information
(Finding out that a tossed coin has come up as heads twice should convey twice
as much information as finding out that it has come up as heads once.)

To satisfy all requirements, we define the self-information of an event X = x as

I(x) = − log p(x) [nats]

where log refers to the natural logarithm with units of “nats” and I(x) ≥ 0.
6
Self-Information

To satisfy all requirements, we define the self-information of an event X = x as

I(x) = − log p(x) [nats]

where log refers to the natural logarithm with units of “nats” and I(x) ≥ 0.

Remarks:
I Self-information is also interpreted as quantifying the level of “surprise”
I One nat is the amount of information gained by observing an event of probability 1
e
I When using base-2 logarithms, units are called “bits” or “shannons”
I Information measured in bits is just a rescaling of information measured in nats

7
Self-Information

I Single fair coin toss: I(x) = − log2 p(x) = − log2 1


2 = 1 bit
I Two fair coin tosses: I(x) = − log2 p(x) = − log2 14 = 2 bits
I Single unfair coin toss: I(x) = − log2 p(x) = − log2 1 = 0 bits
8
Shannon Entropy
I Self-information deals only with a single outcome
I We can quantify the amount of uncertainty in an entire probability distribution
p(X) using the Shannon Entropy:

H(p) = EX∼p [I(x)] = −EX∼p [log p(x)]

I Entropy H(p) = expected amount of information when drawing from p(X)


I As the self-information is positive I(x) ≥ 0, the entropy is also positive H(p) ≥ 0
I When using base-2 logarithm, the entropy specifies the lower bound on number
of bits required on average to encode symbols drawn from a distribution p(X)
I When x is continuous, the Shannon entropy is known as differential entropy
9
Shannon Entropy

Shannon Entropy:
X
H(p) = −EX∼p [log p(x)] = − p(x) log p(x)
x∈X

Differential Entropy:
Z
H(p) = −EX∼p [log p(x)] = − p(x) log p(x) dx
x∈X

I By convention, we treat 0 log 0 as limx→0 x log x = 0

10
Shannon Entropy Example

I X-Axis: Probability p of a binary random variable (e.g., coin toss) being equal to 1
I Y-Axis: Entropy of corresponding distribution H(p) = −(1 − p) log(1 − p) − p log p
I Distributions that are close to deterministic have low entropy
while distributions that are close to uniform have high entropy
11
Shannon Entropy Example

I Histograms of two discrete probability distributions over 30 bins


I The entropy of the broader distribution (on the right) is higher
I A uniform distribution would yield the largest entropy H = − log(1/30) = 3.4 nats
12
Relation to Shortest Coding Length
I Consider a RV X with 8 possible states {a, b, c, d, e, f, g, h} and a distribution p(X)
I If each state is equally likely, the entropy is: H(p) = −8 · 18 log2 18 = 3 bits
I Now let the probabilities be: 12 , 14 , 18 , 16
1 1 1 1 1

, 64 , 64 , 64 , 64
I Then we have: H(p) = − 21 log2 21 − 14 log2 14 . . . − 64 1 1
log2 64 = 2 bits
I The nonuniform distribution has a smaller entropy than the uniform one
I Let’s consider coding a message to be sent from a sender to a receiver
I We could do this using a 3-bit number, but this would be suboptimal
I Instead, we can use shorter codes for the more probable events:
a = 0, b = 10, c = 110, d = 1110, e = 111100, f = 111101, g = 111110, h = 111111
I The average code length equals the entropy: 12 · 1 + 14 · 2 + · · · + 64
1
· 6 = 2 bits
I Note that shorter codes cannot be used while uniquely disambiguating symbols
e.g., 11001110 decodes uniquely into the sequence c, a, d
13
Kullback-Leibler Divergence
Kullback-Leibler Divergence
If we have two probability distributions p(X) and q(X) over the same random variable
X, we can use the Kullback-Leibler (KL) divergence as a “measure of distance”:
 
p(x)
DKL (p || q) = EX∼P log = EX∼p [log p(x) − log q(x)]
q(x)

I In the case of discrete variables, it measures the extra amount of information


required to send a message containing symbols drawn from p(X), when we use a
code that was designed to minimize the length of messages drawn from q(X)
I The KL divergence is non-negative, and zero only iff p(X) = q(X)
I The KL divergence is not a true distance measure: DKL (p || q) 6= DKL (q || p)
I The KL divergence is a special case of a larger class of so-called f-divergences
1
Kullback-Leibler Divergence
Discrete Distributions:

X p(x)
DKL (p || q) = p(x) log
q(x)
x∈X

Continuous Distributions:

Z
p(x)
DKL (p || q) = p(x) log dx
x∈X q(x)

I By convention, we treat 0 log 0 as limx→0 x log x = 0

2
Kullback-Leibler Divergence Example

1.2
KL Divergence Large 1.2
KL Divergence Small
pdata pdata
pmodel pmodel
1.0 1.0

0.8 0.8
p(x)

p(x)
0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
x x

3
Kullback-Leibler Divergence Example

X p(x) 0.36 0.48 0.16


DKL (p || q) = p(x) log = 0.36 log + 0.48 log + 0.16 log ≈ 0.085
q(x) 0.33 0.33 0.33
x∈X
X q(x)
DKL (q || p) = q(x) log ≈ 0.097
p(x)
x∈X
4
Kullback-Leibler Divergence Asymmetry

I Minimizing DKL (p || q) and DKL (q || p) wrt. q does not lead to the same result
I Left: Select a q that has high probability where p has high probability
⇒ leads to blurring multiple modes to put high probability to all of them
I Right: Select a q that has low probability where p has low probability
⇒ picks one of the modes to not put probability mass in low-prob. region
5
Relationship to Cross-Entropy

The cross-entropy is closely related to the KL divergence:

H(p, q) = −EX∼p [log q(x)]


= −EX∼p [log p(x)] + EX∼p [log p(x)] − EX∼p [log q(x)]
= −EX∼p [log p(x)] + EX∼p [log p(x) − log q(x)]
| {z }
Kullback-Leibler Divergence

= H(p) + DKL (p || q)

I Minimizing the cross-entropy wrt. q is equivalent to minimizing


the KL divergence wrt. q, because H(p) does not depend on q

6
The Argmin and Argmax Operators
The Argmin and Argmax Operators

Let X denote a set. We define the argmin and argmax operators as follows:
 
0
argmin f (x) =x f (x) = min f (x )
x∈X x0 ∈X
 
0
argmax f (x) = x f (x) = max
0
f (x )
x∈X x ∈X

Examples:
I argmin x2 = 0 I argmin x = −1
x∈R x∈[−1,1]
I argmax cos(x) = {0, 2π, 4π} I argmin 2 = R
x∈[0,4π] x∈R

1
Example: Maximum Likelihood Estimation
I Let X = {(xi , yi )}N
i=1 be a dataset with samples drawn i.i.d. from pdata
I Let the model pmodel (y|x, w) be a parametric family of probability distributions
I The conditional maximum likelihood estimator for w is given by

ŵM L = argmax pmodel (y|X, w)


w
N
iid
Y
= argmax pmodel (yi |xi , w)
w
i=1
XN
= argmax log pmodel (yi |xi , w)
w
i=1
| {z }
Log-Likelihood

You might also like