Lecture 3 Mathematics For Machine Learning
Lecture 3 Mathematics For Machine Learning
Lecture 3
2
CS 404/504, Fall 2021
Lecture Outline
• Linear algebra
Vectors
Matrices
Eigen decomposition
• Differential calculus
• Optimization algorithms
• Probability
Random variables
Probability distributions
• Information theory
3
CS 404/504, Fall 2021
Notation
4
CS 404/504, Fall 2021
Notation
5
CS 404/504, Fall 2021
Vectors
Vectors
• Vector definition
Computer science: vector is a one-dimensional array of ordered real-valued scalars
Mathematics: vector is a quantity possessing both magnitude and direction,
represented by an arrow indicating the direction, and the length of which is
proportional to the magnitude
• Vectors are written in column form or in row form
Denoted by bold-font lower-case letters
• For a general form vector with elements the vector lies in the -dimensional space
6
CS 404/504, Fall 2021
Geometry of Vectors
Vectors
• Vector addition
We add the coordinates, and follow the directions
given by the two vectors that are added
Geometry of Vectors
Vectors
8
CS 404/504, Fall 2021
𝐮∙𝐯
𝐮 ∙ 𝐯 =‖𝐮‖‖ 𝐯‖ 𝑐𝑜𝑠 ( 𝜃 ) 𝜃 = ‖𝐮‖‖ 𝐯‖
cos
Norm of a Vector
Vectors
( )
𝑛 1
• The general norm of a vector is obtained as: ‖𝐱‖𝑝 = ∑ |𝑥𝑖| 𝑝 𝑝
𝑖 =1
On next page we will review the most common norms, obtained for and
10
CS 404/504, Fall 2021
Norm of a Vector
Vectors
√
𝑛
• For we have norm
Also called Euclidean norm
It is the most often used norm
‖𝐱‖ =
2 ∑ 𝑥 =√ 𝐱
2
𝑖
𝑇
𝐱
𝑖=1
norm is often denoted just as with the subscript 2 omitted
𝑛
• For we have norm
Uses the absolute values of the elements ‖𝐱 ‖
1= ∑ |𝑥 𝑖|
Discriminate between zero and non-zero elements 𝑖=1
11
CS 404/504, Fall 2021
Vector Projection
Vectors
Hyperplanes
Hyperplanes
• Hyperplane is a subspace whose dimension is one less than that of its ambient
space
In a 2D space, a hyperplane is a straight line (i.e., 1D)
In a 3D, a hyperplane is a plane (i.e., 2D)
In a d-dimensional vector space, a hyperplane has dimensions, and divides the space
into two half-spaces
• Hyperplane is a generalization of a concept of plane in high-dimensional space
• In ML, hyperplanes are decision boundaries used for linear classification
Data points falling on either sides of the hyperplane are attributed to different classes
Hyperplanes
Hyperplanes
• Solving , we obtain
Hyperplanes
Hyperplanes
• In a 3D space, if we have a vector and try to find all points that satisfy , we can
obtain a plane that is orthogonal to the vector
The inequalities and again define the two subspaces that are created by the plane
Matrices
Matrices
16
CS 404/504, Fall 2021
Matrices
Matrices
• Addition or subtraction A B i , j Ai , j Bi , j
• Scalar multiplication
cA i , j c Ai , j
• Matrix multiplication
AB i , j A i ,1B 1, j A i ,2B 2, j A i ,n B n, j
Defined only if the number of columns of the left matrix is the same as the number of
rows of the right matrix
Note that
17
CS 404/504, Fall 2021
Matrices
Matrices
A
T
i, j
A j, i
Some properties
• Identity matrix ( In ): has ones on the main diagonal, and zeros elsewhere
1 0 0
E.g.: identity matrix of size 3×3 : I 0 1 0
3
0 0 1
18
CS 404/504, Fall 2021
Matrices
Matrices
In the above, is a minor of the matrix obtained by removing the row and column
associated with the indices i and j
• Trace of a matrix is the sum of all diagonal elements
19
CS 404/504, Fall 2021
Matrices
Matrices
20
CS 404/504, Fall 2021
Matrix-Vector Products
Matrices
21
CS 404/504, Fall 2021
Matrix-Matrix Products
Matrices
• Size:
22
CS 404/504, Fall 2021
Linear Dependence
Matrices
23
CS 404/504, Fall 2021
Matrix Rank
Matrices
• For an matrix, the rank of the matrix is the largest number of linearly
independent columns
• The matrix B from the previous example has , since the two columns are linearly
dependent
𝐁= 2 [4 −1
−2 ]
• The matrix C below has , since it has two linearly independent columns
I.e., , ,
24
CS 404/504, Fall 2021
Inverse of a Matrix
Matrices
• For a square matrix A with rank , is its inverse matrix if their product is an
identity matrix I
AB
1
B 1A 1
• If (i.e., ), then the inverse does not exist
A matrix that is not invertible is called a singular matrix
• Note that finding an inverse of a large matrix is computationally expensive
In addition, it can lead to numerical instability
• If the inverse of a matrix is equal to its transpose, the matrix is said to be
orthogonal matrix
A 1 AT
25
CS 404/504, Fall 2021
Pseudo-Inverse of a Matrix
Matrices
• Pseudo-inverse of a matrix
Also known as Moore-Penrose pseudo-inverse
• For matrices that are not square, the inverse does not exist
Therefore, a pseudo-inverse is used
• If , then the pseudo-inverse is and
• If , then the pseudo-inverse is and
E.g., for a matrix with dimension , a pseudo-inverse can be found of size , so that
26
CS 404/504, Fall 2021
Tensors
Tensors
27
CS 404/504, Fall 2021
Manifolds
Manifolds
28
CS 404/504, Fall 2021
Manifolds
Manifolds
Manifolds
Manifolds
Manifolds
Manifolds
• Example:
The data points have 3 dimensions (left figure), i.e., the input space of the data is 3-
dimensional
The data points lie on a 2-dimensional manifold, shown in the right figure
Most ML algorithms extract lower-dimensional data features that enable to
distinguish between various classes of high-dimensional input data
o The low-dimensional representations of the input data are called embeddings
31
CS 404/504, Fall 2021
Eigen Decomposition
Eigen Decomposition
32
CS 404/504, Fall 2021
Eigen Decomposition
Eigen Decomposition
If any of the eigenvalues are zero, the matrix is singular (it does not have an inverse)
• However, not every matrix can be decomposed into eigenvalues and
eigenvectors
Also, in some cases the decomposition may involve complex numbers
Still, every real symmetric matrix is guaranteed to have an eigen decomposition
according to , where is an orthogonal matrix
33
CS 404/504, Fall 2021
Eigen Decomposition
Eigen Decomposition
35
CS 404/504, Fall 2021
Matrix Norms
Matrix Norms
√
𝑚 𝑛
• Frobenius norm – calculates the square-root of the
summed squares of the elements of matrix
This norm is similar to Euclidean norm of a vector
‖𝐗‖𝐹 = ∑∑ 𝑥 2
𝑖𝑗
𝑖=1 𝑗=1
√
𝑛 𝑚
‖𝐗‖2,1 =∑ ∑𝑥
• norm – is the sum of the Euclidean norms of the 2
columns of matrix 𝑖𝑗
𝑗=1 𝑖=1
• Max norm – is the largest element of matrix X
‖𝐗‖max =max ( 𝑥𝑖𝑗 )
𝑖, 𝑗
36
CS 404/504, Fall 2021
Differential Calculus
Differential Calculus
37
CS 404/504, Fall 2021
Differential Calculus
Differential Calculus
• The following rules are used for computing the derivatives of explicit functions
38
CS 404/504, Fall 2021
39
CS 404/504, Fall 2021
Taylor Series
Differential Calculus
Geometric Interpretation
Differential Calculus
• The expression approximates the function by a line which passes through the
point and has slope (i.e., the value of at the point )
Partial Derivatives
Differential Calculus
42
CS 404/504, Fall 2021
Gradient
Differential Calculus
• When there is no ambiguity, the notations or are often used for the gradient
instead of
The symbol for the gradient is the Greek letter (pronounced “nabla”), although is
more often it is pronounced “gradient of f with respect to x”
• In ML, the gradient descent algorithm relies on the opposite direction of the
gradient of the loss function with respect to the model parameters for
minimizing the loss function
Adversarial examples can be created by adding perturbation in the direction of the
gradient of the loss with respect to input examples for maximizing the loss function
43
CS 404/504, Fall 2021
Hessian Matrix
Differential Calculus
• The second partial derivatives are assembled in a matrix called the Hessian
• Computing and storing the Hessian matrix for functions with high-dimensional
inputs can be computationally prohibitive
E.g., the loss function for a ResNet50 model with approximately 23 million
parameters, has a Hessian of (trillion) parameters
44
CS 404/504, Fall 2021
Jacobian Matrix
Differential Calculus
For example, in robotics a robot Jacobian matrix gives the partial derivatives of the
translational and angular velocities of the robot end-effector with respect to the joints
(i.e., axes) velocities
45
CS 404/504, Fall 2021
Integral Calculus
Integral Calculus
• For a function defined on the domain , the definite integral of the function is
denoted
• Geometric interpretation of the integral is the area between the horizontal axis
and the graph of between the points a and b
In this figure, the integral is the sum of blue areas (where minus the pink area
(where )
Optimization
Optimization
47
CS 404/504, Fall 2021
Optimization
Optimization
• Optimization and machine learning have related, but somewhat different goals
Goal in optimization: minimize an objective function
o For a set of training examples, reduce the training error
Goal in ML: find a suitable model, to predict on data examples
o For a set of testing examples, reduce the generalization error
• For a given empirical function g (dashed purple curve), optimization algorithms
attempt to find the point of minimum empirical risk
• The expected function f (blue curve) is obtained
given a limited amount of training data
examples
• ML algorithms attempt to find the point of
minimum expected risk, based on minimizing
the error on a set of testing examples
o Which may be at a different location than the
minimum of the training examples
o And which may not be minimal in a formal sense
Stationary Points
Optimization
Local Minima
Optimization
Saddle Points
Optimization
• The gradient of a function at a saddle point is 0, but the point is not a minimum
or maximum point
The optimization algorithms may stall at saddle points, without reaching a minima
• Note also that the point of a function at which the sign of the curvature changes
is called an inflection point
An inflection point () can also be a saddle point, but it does not have to be
• For the 2D function (right figure), the saddle point is at (0,0)
The point looks like a saddle, and gives the minimum with respect to x, and the
maximum with respect to y
saddle point
x
Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html 51
CS 404/504, Fall 2021
Convex Optimization
Optimization
Convex Functions
Optimization
• In mathematical terms, the function fis a convex function if for all points and for
all
53
CS 404/504, Fall 2021
Convex Functions
Optimization
• One important property of convex functions is that they do not have local
minima
Every local minimum of a convex function is a global minimum
I.e., every point at which the gradient of a convex function = 0 is the global minimum
The figure below illustrates two convex functions, and one nonconvex function
Convex Functions
Optimization
• The Danish mathematician Johan Jensen showed that this can be generalized for
all that are non-negative real numbers a, to the following:
I.e., the expectation of a convex function is larger than the convex function of an
expectation
55
CS 404/504, Fall 2021
Convex Sets
Optimization
• A set in a vector space is a convex set is for any the line segment connecting a
and b is also in
• For all , we have
for all
• In the figure, each point represents a 2D vector
The left set is nonconvex, and the other two sets are convex
• Properties of convex sets include:
If and are convex sets, then is also convex
If and are convex sets, then is not necessarily convex
57
CS 404/504, Fall 2021
Constrained Optimization
Optimization
• The points that satisfy the constraints form the feasible region
• Various optimization algorithms have been developed for handling optimization
problems based on whether the constraints are equalities, inequalities, or a
combination of equalities and inequalities
58
CS 404/504, Fall 2021
Lagrange Multipliers
Optimization
59
CS 404/504, Fall 2021
Projections
Optimization
60
CS 404/504, Fall 2021
Projections
Optimization
• This means that the vector is projected onto the closest vector that belongs to
the set
• Lower bound of a subset from a partially ordered set is an element of , such that
for all
E.g., for the subset from the natural numbers , lower bounds are the numbers 2, 1, 0, ,
and all other natural numbers
• Infimum of a subset from a partially ordered set is the greatest lower bound i,
denoted
It is the maximal quantity such that for all
E.g., the infimum of the set is 2, since it is the greatest lower bound
• Example: consider the subset of positive real numbers (excluding zero)
The subset does not have a minimum, because for every small positive number, there
is a another even smaller positive number
On the other hand, all real negative numbers and 0 are lower bounds on the subset
0 is the greatest lower bound of all lower bounds, and therefore, the infimum of is 0
63
CS 404/504, Fall 2021
• Upper bound of a subset from a partially ordered set is an element of , such that
for all
E.g., for the subset from the natural numbers , upper bounds are the numbers 8, 9, 40,
and all other natural numbers
• Supremum of a subset from a partially ordered set is the least upper bound i,
denoted
It is the minimal quantity such that for all
E.g., the supremum of the subset is , since it is the least upper bound
• Example: for the subset of negative real numbers (excluding zero)
All real positive numbers and 0 are upper bounds
0 is the least upper bound, and therefore, the supremum of
64
CS 404/504, Fall 2021
Lipschitz Function
Optimization
65
CS 404/504, Fall 2021
66
CS 404/504, Fall 2021
Probability
Probability
• Intuition:
In a process, several outcomes are possible
When the process is repeated a large number of times, each outcome occurs with a
relative frequency, or probability
If a particular outcome occurs more often, we say it is more probable
• Probability arises in two contexts
In actual repeated experiments
o Example: You record the color of 1,000 cars driving by. 57 of them are green. You estimate the
probability of a car being green as 57/1,000 = 0.057.
In idealized conceptions of a repeated process
o Example: You consider the behavior of an unbiased six-sided die. The expected probability of
rolling a 5 is 1/6 = 0.1667.
o Example: You need a model for how people’s heights are distributed. You choose a normal
distribution to represent the expected relative probabilities.
Probability
Probability
68
CS 404/504, Fall 2021
Random variables
Probability
Axioms of probability
Probability
• The probability of a random variable must obey the axioms of probability over
the possible values in the sample space
Discrete Variables
Probability
• Probability distribution that acts on many variables at the same time is known as
a joint probability distribution
• Given any values x and y of two random variables and , what is the probability
that = x and = y simultaneously?
denotes the joint probability
We may also write for brevity
Bayes’ Theorem
Probability
76
CS 404/504, Fall 2021
Independence
Probability
• Two random variables and are independent if the occurrence of does not reveal
any information about the occurrence of
E.g., two successive rolls of a die are independent
• Therefore, we can write:
The following notation is used:
Also note that for independent random variables:
• In all other cases, the random variables are dependent
E.g., duration of successive eruptions of Old Faithful
Getting a king on successive draws form a deck (the drawn card is not replaced)
Expected Value
Probability
When the identity of the distribution is clear from the context, we can write
If it is clear which random variable is used, we can write just
• Mean is the most common measure of central tendency of a distribution
For a random variable:
This is similar to the mean of a sample of observations:
Other measures of central tendency: median, mode
Variance
Probability
• Variance gives the measure of how much the values of the function deviate from
the expected value as we sample values of X from
• When the variance is low, the values of cluster near the expected value
• Variance is commonly denoted with
The above equation is similar to a function
We have
This is similar to the formula for calculating the variance of a sample of observations:
• The square root of the variance is the standard deviation
Denoted
Covariance
Probability
• Covariance gives the measure of how much two random variables are linearly
related to each other
• If and
Then, the covariance is:
Compare to covariance of actual samples:
• The covariance measures the tendency for and to deviate from their means in
same (or opposite) directions at same time
𝑋 𝑋
Picture from: Jeff Howbert — Machine Learning Math Essentials 81
CS 404/504, Fall 2021
Correlation
Probability
Covariance Matrix
Probability
• I.e.,
• The diagonal elements of the covariance matrix are the variances of the elements
of the vector
83
CS 404/504, Fall 2021
Probability Distributions
Probability
• Bernoulli distribution
Binary random variable with states 𝑝= 0.3
The random variable can encodes a coin flip
which comes up 1 with probability p and 0
with probability
Notation:
• Uniform distribution
The probability of each value is
Notation:
Figure:
Probability Distributions
Probability
• Binomial distribution
Performing a sequence of n independent 𝑛=10 , 𝑝=0.2
experiments, each of which has probability p of
succeeding, where
The probability of getting k successes in n trials
is
Notation:
• Poisson distribution
A number of events occurring independently in
a fixed interval of time with a known rate
A discrete random variable with states has 𝜆 =5
probability
The rate is the average number of occurrences
of the event
Notation:
Probability Distributions
Probability
• Gaussian distribution
The most well-studied distribution
o Referred to as normal distribution or informally bell-shaped distribution
Defined with the mean and variance
Notation:
For a random variable with n independent measurements, the density is
Probability Distributions
Probability
• Multinoulli distribution
It is an extension of the Bernoulli distribution, from binary class to multi-class
Multinoulli distribution is also called categorical distribution or generalized Bernoulli
distribution
Multinoulli is a discrete probability distribution that describes the possible results of a
random variable that can take on one of k possible categories
o A categorical random variable is a discrete variable with more than two possible outcomes
(such as the roll of a die)
For example, in multi-class classification in machine learning, we have a set of data
examples , and corresponding to the data example is a k-class label representing one-
hot encoding
o One-hot encoding is also called 1-of-k vector, where one element has the value 1 and all other
elements have the value 0
o Let’s denote the probabilities for assigning the class labels to a data example by
o We know that and for the different classes
o The multinoulli probability of the data example is
o Similarly, we can calculate the probability of all data examples as
87
CS 404/504, Fall 2021
Information Theory
Information Theory
88
CS 404/504, Fall 2021
Self-information
Information Theory
• The basic intuition behind information theory is that learning that an unlikely
event has occurred is more informative than learning that a likely event has
occurred
E.g., a message saying “the sun rose this morning” is so uninformative that it is
unnecessary to be sent
But, a message saying “there was a solar eclipse this morning” is very informative
• Based on that intuition, Shannon defined the self-information of an event as
89
CS 404/504, Fall 2021
Entropy
Information Theory
For continuous random variables, the entropy is also called differential entropy
90
CS 404/504, Fall 2021
Entropy
Information Theory
91
CS 404/504, Fall 2021
Kullback–Leibler Divergence
Information Theory
92
CS 404/504, Fall 2021
Kullback–Leibler Divergence
Information Theory
• KL divergence is non-negative:
• if and only if and are the same distribution
• The most important property of KL divergence is that it is non-symmetric, i.e.,
93
CS 404/504, Fall 2021
Cross-entropy
Information Theory
94
CS 404/504, Fall 2021
Maximum Likelihood
Information Theory
95
CS 404/504, Fall 2021
Maximum Likelihood
Information Theory
• For a total number of n observed data examples , the predicted class labels for
the data example is
Using the multinoulli distribution, the probability of predicting the true class label is ,
where
E.g., we have a problem with 3 classes , and an image of a car , the true label , and let’s
assume a predicted label , then the probability is
• Assuming that the data examples are independent, the likelihood of the data
given the model parameters can be written as
• Log-likelihood is often used because it simplifies numerical calculations, since it
transforms a product with many terms into a summation, e.g.,
96
CS 404/504, Fall 2021
References
97