03_23ECE216_PartialDerivatives
03_23ECE216_PartialDerivatives
23ECE216
Machine Learning
Partial Derivatives
Matrix Random
Entropy Formulation
Multiplication variables
3
Stationary points
4
Relative(local) and Global Optimum
…contd.
.
Relative minimum
A2 is also global
f(x) f(x) optimum
.
A1
. . A3
B2
.
B1
.
x x
a b a b
5
5
Example 1
Find the optimum value of the function and also state if the
function attains a maximum or a minimum:𝑓 𝑥 = 𝑥 2 + 3𝑥 −
5
Sol:
For maxima or minima, put 𝑓 ′ 𝑥 = 2𝑥 + 3 = 0 to get x* = -
3/2.
Solution:
( )
f x * = 24 at x* = 2
Hence fn(x) is positive and n is even hence the point x = x* = 2 is a point of minimum
and the function attains a minimum value of 0 at this point.
7
Example 3
Analyze the function 𝑓(𝑥) = 12𝑥 5 − 45𝑥 4 + 40𝑥 3 + 5 and classify the stationary
points as maxima, minima and points of inflection.
Solution:
f ( x) = 12 x5 − 45x4 + 40 x3 + 5
f '( x) = 60 x 4 − 180 x3 + 120 x 2 = 0
= x 4 − 3x3 + 2 x 2 = 0
or x = 0,1, 2
at x * = 0
f ''' ( x* ) = 720( x* )2 −1080x* + 240 = 240
12/11/2024
8
Example 3 …contd.
Consider x = x* = 1
f '' ( x* ) = 240( x* )3 − 540( x* )2 + 240 x* = −60 at x* = 1
Since the second derivative is negative the point x = x* = 1 is a point of local maxima with
a maximum value of f(x) = 12 – 45 + 40 + 5 = 12
12/11/2024
9
Differentiation in Machine Learning
• Machine learning often seeks to determine the minimum
of a cost/loss function.
• As such, iterative techniques pioneered in numerical
computing find widespread usage in various ML
algorithms.
• At its core, most such methods aim at finding critical
points where the derivative is 0.
• Differentiation tools are also used in probabilistic
parameter estimation, most notably in maximum
likelihood estimation, to identify the set of most likely
parameters for a data-generating probability distribution.
10
Partial Derivatives
In general, if f is a function of two variables x and y,
suppose we let only x vary while keeping y fixed, say y = b,
where b is a constant.
11
Partial Derivatives
By the definition of a derivative, we have
12
Partial Derivatives
Similarly, the partial derivative of f with respect to y at
(a, b), denoted by fy(a, b), is obtained by keeping x fixed
(x = a) and finding the ordinary derivative at b of the
function G(y) = f (a, y):
13
Partial Derivatives
If we now let the point (a, b) vary in Equations 2 and 3,
fx and fy become functions of two variables.
14
Partial Derivatives
There are many alternative notations for partial derivatives.
15
Partial Derivatives
To compute partial derivatives, all we have to do is
remember from Equation 1 that the partial derivative with
respect to x is just the ordinary derivative of the function g
of a single variable that we get by keeping y fixed.
16
Example 1
If 𝑓 (𝑥, 𝑦) = 𝑥3 + 𝑥2𝑦3 – 2𝑦2, find fx(2, 1) and fy(2, 1).
Solution:
Holding y constant and differentiating with respect to x,
we get
𝜕
𝑓𝑥 𝑥, 𝑦 = 𝑓 𝑥, 𝑦
𝜕𝑥
𝜕
= 𝑥3 + 𝑥2𝑦3 – 2𝑦2
𝜕𝑥
= 3𝑥2 + 2𝑥𝑦3
and so fx(2, 1) = 3 22 + 2 2 13 = 16
17
Example 1
If 𝑓 (𝑥, 𝑦) = 𝑥3 + 𝑥2𝑦3 – 2𝑦2, find fx(2, 1) and fy(2, 1).
Solution:
Holding x constant and differentiating with respect to y,
we get
𝜕
𝑓𝑦 𝑥, 𝑦 = 𝑓 𝑥, 𝑦
𝜕𝑦
𝜕
= 𝑥3 + 𝑥2𝑦3 – 2𝑦2
𝜕𝑦
= 3𝑥2𝑦2 – 4𝑦
𝑓𝑦(2, 1) = 3 22 12 – 4 1 = 8
18
Example 2
If f (x, y) = 4 – x2 – 2y2, find fx(1, 1) and fy(1, 1).
Solution:
We have
fx(x, y) = –2x fy(x, y) = –4y
fx(1, 1) = –2 fy(1, 1) = –4
19
Interpretations of Partial Derivatives
20
Interpretations of Partial Derivatives
Partial derivatives can be interpreted as rates of change.
21
Functions of More Than Two Variables
22
Functions of More Than Two Variables
Partial derivatives can also be defined for functions of three
or more variables.
For example, if f is a function of three variables x, y, and z,
then its partial derivative with respect to x is defined as
23
Functions of More Than Two Variables
If w = f (x, y, z), then f x = ∂w / ∂x can be interpreted as the
rate of change of w with respect to x when y and z are held
fixed.
24
Example 3
Find fx, fy, and fz if f (x, y, z) = exy ln z.
Solution:
Holding y and z constant and differentiating with respect
to x, we have
fx = yexy ln z
Similarly,
fy = xexy ln z and fz =
25
Higher Derivatives
26
Higher Derivatives
If f is a function of two variables, then its partial derivatives
fx and fy are also functions of two variables, so we can
consider their partial derivatives (fx)x, (fx)y, (fy)x, and (fy)y,
which are called the second partial derivatives of f.
If z = f (x, y), we use the following notation:
27
Higher Derivatives
Thus the notation fxy (or ∂2f / ∂y ∂x) means that we first
differentiate with respect to x and then with respect to y,
whereas in computing fyx the order is reversed.
28
Example 4
Find the second partial derivatives of
𝑓 (𝑥, 𝑦) = 𝑥3 + 𝑥2𝑦3 – 2𝑦2
Solution:
In Example 1 we found that
𝑓𝑥 𝑥, 𝑦 = 3𝑥2 + 2𝑥𝑦3
𝑓𝑦(𝑥, 𝑦) = 3𝑥2𝑦2 – 4𝑦
Therefore
𝜕 𝜕
𝑓𝑥𝑥 = 𝑥3 + 𝑥2𝑦3 – 2𝑦2
𝜕𝑥 𝜕𝑥
𝜕
= (3𝑥2 + 2𝑥𝑦3) = 6𝑥 + 2𝑦3
𝜕𝑥 29
Example 4 – Solution cont’d
= 6xy2
= 6xy2
= 6x2y – 4
30
Higher Derivatives
Notice that fxy = fyx in Example 4. This is not just a
coincidence.
It turns out that the mixed partial derivatives fxy and fyx are
equal for most functions that one meets in practice.
31
Higher Derivatives
Partial derivatives of order 3 or higher can also be defined.
For instance,
32
Moving to Higher
Dimensions
Multivariate Calculus
33
Multivariate Calculus
34
Gradient Vector
We can compute the partial derivative for all dimensions
𝑥𝑖 ∈ 𝑥 and collect all partial derivatives in a gradient vector:
𝜕 𝜕 𝜕
∇𝑓 𝑥1 , … , 𝑥𝑛 = 𝑥1 , … , 𝑥𝑛 𝑥1 , … , 𝑥𝑛 … 𝑥1 , … , 𝑥𝑛
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛
35
Example 5
Consider the function 𝑓(𝑥, 𝑦) = 𝑥 2 𝑦 + 3 ln(𝑥𝑦). Find the
partial derivatives and the gradient.
Solution
𝜕 𝜕 𝜕
𝑓 𝒙 𝑓 𝒙 ⋯ 𝑓 𝒙
𝜕𝑥1 1 𝜕𝑥2 1 𝜕𝑥𝑛 1
∇𝒇𝑻𝟏 𝜕 𝜕 𝜕
𝐽= ⋮ = 𝜕𝑥1 𝑓2 𝒙 𝑓 𝒙
𝜕𝑥2 2
⋯ 𝑓 𝒙
𝜕𝑥𝑛 2
∇𝒇𝑻𝒎 ⋮ ⋮ ⋱ ⋮
𝜕 𝜕 𝜕
𝑓 𝒙 𝑓 𝒙 ⋯ 𝑓 𝒙
𝜕𝑥1 𝑚 𝜕𝑥2 2 𝜕𝑥𝑚 𝑚
37
Example 6
𝑥 𝑓1 𝑥, 𝑦
Consider the multivariate function 𝒇 𝑦 = =
𝑓2 𝑥, 𝑦
𝑥2𝑦
. Find its Jacobian.
5𝑥 + sin(𝑦)
Solution
𝜕 𝜕
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 𝑥2
𝜕𝑥 1 𝜕𝑦 1 2𝑥𝑦
𝐽= 𝜕 𝜕
=
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 5 cos(𝑦)
𝜕𝑥 2 𝜕𝑦 2
38
Example 7
𝑢 𝑥, 𝑦, 𝑧
Consider the function function 𝒇 𝑥, 𝑦, 𝑧 = 𝑣 𝑥, 𝑦, 𝑧 where
𝑤(𝑥, 𝑦, 𝑧)
39
Example 7 - Solution
𝑢 𝑥, 𝑦, 𝑧
𝒇 𝑥, 𝑦, 𝑧 = 𝑣 𝑥, 𝑦, 𝑧
𝑤(𝑥, 𝑦, 𝑧)
𝜕 𝜕 𝜕
𝑢 𝑥, 𝑦, 𝑧 𝑢 𝑥, 𝑦, 𝑧 𝑢 𝑥, 𝑦, 𝑧
𝜕𝑥 𝜕𝑦 𝜕𝑧
𝜕 𝜕 𝜕
𝐽= 𝑣 𝑥, 𝑦, 𝑧 𝑣 𝑥, 𝑦, 𝑧 𝑣 𝑥, 𝑦, 𝑧
𝜕𝑥 𝜕𝑦 𝜕𝑧
𝜕 𝜕 𝜕
𝑤 𝑥, 𝑦, 𝑧 𝑤 𝑥, 𝑦, 𝑧 𝑤 𝑥, 𝑦, 𝑧
𝜕𝑥 𝜕𝑦 𝜕𝑧
18𝑥𝑦 2 + 𝑧𝑒 𝑥 18𝑥 2 𝑦 𝑒𝑥
= 𝑦 + 2𝑥𝑦 3 𝑥 + 3𝑥 2 𝑦 2 2
− sin 𝑥 𝑠𝑖𝑛 𝑧 𝑒 𝑦 cos 𝑥 sin 𝑧 𝑒 𝑦 cos 𝑥 cos 𝑧 𝑒 𝑦
40
Example 7 - Solution
At (0,0,0), we get:
18 ∗ 0 ∗ 02 + 0 ∗ 𝑒 0 18 ∗ 02 ∗ 0 𝑒0
𝐽(0,0,0) = 0 + 2 ∗ 0 ∗ 03 0 + 3 ∗ 02 ∗ 02 2
− sin 0 𝑠𝑖𝑛 0 𝑒 0 cos 0 sin 0 𝑒 0 cos 0 cos 0 𝑒 0
0 0 1
= 0 0 2
0 0 1
41
Gradient and Jacobian in ML
• Both the gradient and the Jacobian belong to the bread
and butter of both classical and machine learning /
numerical approximation algorithms.
• In particular, they are the core ingredients of an
algorithm called gradient descent which enables us to
iteratively search for a local minimum of a differentiable
function.
• For instance, neural network learning relies on gradient
descent via the backpropagation algorithm.
42
Hessian Matrix
For a twice-differentiable scalar-valued function f : ℝ𝑛 → ℝ
the Hessian matrix H is defined as the matrix containing the
combinations of all second-order derivatives:
𝜕2 𝜕2 𝜕2
2𝑓 𝒙 𝜕𝑥1 𝜕𝑥2
𝑓 𝒙 ⋯
𝜕𝑥1 𝜕𝑥𝑛
𝑓 𝒙
𝜕𝑥1
𝜕2 𝜕2 𝜕2
𝐻 = 𝜕𝑥2 𝜕𝑥1 𝑓 𝒙 𝜕𝑥2 2𝑓 𝒙 ⋯
𝜕𝑥2 𝜕𝑥𝑛
𝑓 𝒙
⋮ ⋮ ⋱ ⋮
𝜕2 𝜕2 𝜕2
𝑓 𝒙 𝑓 𝒙 ⋯ 2𝑓 𝒙
𝜕𝑥𝑛 𝜕𝑥1 𝜕𝑥𝑛 𝜕𝑥2 𝜕𝑥𝑛
Solution:
The Hessian is computed as follows:
𝜕2 𝜕2
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 6𝑥 −2
𝜕𝑥 2 𝜕𝑥𝜕𝑦
𝐻= =
𝜕2 𝜕2 −2 −30𝑦 4
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦
𝜕𝑦𝜕𝑥 𝜕𝑦 2
44
Example 9
Consider the function 𝑓(𝑥, 𝑦) = 𝑒 𝑥 cos(𝑦). Compute the
Hessian.
Solution:
The Hessian is computed as follows:
𝜕2 𝜕2
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 𝑒 𝑥 cos(𝑦) −𝑒 𝑥 𝑠𝑖𝑛(𝑦)
𝜕𝑥 2 𝜕𝑥𝜕𝑦
𝐻= =
𝜕2
𝑓 𝑥, 𝑦
𝜕2
𝑓 𝑥, 𝑦 −𝑒 𝑥 sin(𝑦) −𝑒 𝑥 cos(𝑦)
𝜕𝑦𝜕𝑥 𝜕𝑦 2
45
Hessian in Machine Learning
• The Hessian matrix and its approximations are
frequently used to assess the curvature of loss
landscapes in neural networks.
• Similar to the univariate second-order derivative of a
function, the Hessian enables us to identify saddle points
and directions of higher and lower curvature.
• In particular, the ratio between the largest and the lowest
𝜆
eigenvalue of H, i.e. max , defines the condition
𝜆min
number.
• A large condition number implies slower convergence
while a condition number of 1 enables gradient descent
to quickly converge in all curvature directions.
46
References
Chapter 14- Partial Derivatives, Cengage Learning.
47