0% found this document useful (0 votes)
10 views47 pages

03_23ECE216_PartialDerivatives

The document covers the mathematical foundations necessary for machine learning, focusing on partial derivatives and their applications in optimization. It explains critical points, relative and global optima, and provides multiple examples of finding maxima and minima of functions. Additionally, it discusses the role of differentiation in machine learning for cost function minimization and parameter estimation.

Uploaded by

pvsbym
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views47 pages

03_23ECE216_PartialDerivatives

The document covers the mathematical foundations necessary for machine learning, focusing on partial derivatives and their applications in optimization. It explains critical points, relative and global optima, and provides multiple examples of finding maxima and minima of functions. Additionally, it discusses the role of differentiation in machine learning for cost function minimization and parameter estimation.

Uploaded by

pvsbym
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Om Namo Bhagavate Vasudevaya

23ECE216
Machine Learning

Partial Derivatives

Dr. Binoy B Nair


1
Mathematical Preliminaries
Covered in previous semesters
Requirements
for ML

Linear Probability Information Mathematical


Algebra and Statistics Theory Optimization

Matrix Random
Entropy Formulation
Multiplication variables

Matrix Mutual Optimality


Distributions
Inversion Information conditions

Eigen Values Numerical


KL
and Eigen Optimization
Divergence
Vectors methods 2
Identifying Critical Points
• Differentiation allows for the identification of critical points
such as minima, maxima, and inflection points.
• Critical points are defined as points where the derivative
is 0 (or non-existent).

• Considering the function 𝑓(𝑥), the critical point is


represented as the point 𝑥𝑖 where: 𝑓 ′ 𝑥𝑖 = 0
• The maximum is at point 𝑥𝑖 when the 𝑓 ′′ 𝑥𝑖 < 0
• The minimum is at point 𝑥𝑖 when the 𝑓 ′′ 𝑥𝑖 > 0
• The point of inflection is at point 𝑥𝑖 when the 𝑓 ′′ 𝑥𝑖 = 0

3
Stationary points

Figure showing the three types of stationary points


(a) inflection point (b) minimum (c) maximum

4
Relative(local) and Global Optimum
…contd.

A1, A2, A3 = Relative maxima


A2 = Global maximum
B1, B2 = Relative minima
B1 = Global minimum

.
Relative minimum
A2 is also global
f(x) f(x) optimum

.
A1
. . A3

B2

.
B1
.
x x
a b a b

5
5
Example 1
Find the optimum value of the function and also state if the
function attains a maximum or a minimum:𝑓 𝑥 = 𝑥 2 + 3𝑥 −
5

Sol:
For maxima or minima, put 𝑓 ′ 𝑥 = 2𝑥 + 3 = 0 to get x* = -
3/2.

Also 𝑓 ′′ 𝑥 = 2 , which is positive hence the point x* = -3/2


is a point of minima and the function attains a minimum
6
value of -29/4 at this point. 6
Example 2
Find the optimum value of the function 𝑓 𝑥 = 𝑥 − 2 4 and also state if the
function attains a maximum or a minimum

Solution:

f '( x) = 4( x − 2)3 = 0 or x = x* = 2 for maxima or minima.

f ''( x*) = 12( x * −2)2 = 0 at x* = 2

f '''( x*) = 24( x * −2) = 0 at x* = 2

( )
f  x * = 24 at x* = 2

Hence fn(x) is positive and n is even hence the point x = x* = 2 is a point of minimum
and the function attains a minimum value of 0 at this point.

7
Example 3
Analyze the function 𝑓(𝑥) = 12𝑥 5 − 45𝑥 4 + 40𝑥 3 + 5 and classify the stationary
points as maxima, minima and points of inflection.
Solution:
f ( x) = 12 x5 − 45x4 + 40 x3 + 5
f '( x) = 60 x 4 − 180 x3 + 120 x 2 = 0
= x 4 − 3x3 + 2 x 2 = 0
or x = 0,1, 2

Consider the point x = x* = 0


at x * = 0
f ( x ) = 240( x ) − 540( x ) + 240 x = 0
'' * * 3 * 2 *

at x * = 0
f ''' ( x* ) = 720( x* )2 −1080x* + 240 = 240

12/11/2024
8
Example 3 …contd.

Since the third derivative is non-zero, x = x* = 0 is neither a point of maximum or


minimum but it is a point of inflection

Consider x = x* = 1
f '' ( x* ) = 240( x* )3 − 540( x* )2 + 240 x* = −60 at x* = 1

Since the second derivative is negative the point x = x* = 1 is a point of local maxima with
a maximum value of f(x) = 12 – 45 + 40 + 5 = 12

Consider x = x* = 2 f '' ( x* ) = 240( x* )3 − 540( x* )2 + 240 x* = 240


at x* = 2
Since the second derivative is positive, the point x = x* = 2 is a point of local minima with
a minimum value of f(x) = -11

12/11/2024
9
Differentiation in Machine Learning
• Machine learning often seeks to determine the minimum
of a cost/loss function.
• As such, iterative techniques pioneered in numerical
computing find widespread usage in various ML
algorithms.
• At its core, most such methods aim at finding critical
points where the derivative is 0.
• Differentiation tools are also used in probabilistic
parameter estimation, most notably in maximum
likelihood estimation, to identify the set of most likely
parameters for a data-generating probability distribution.

10
Partial Derivatives
In general, if f is a function of two variables x and y,
suppose we let only x vary while keeping y fixed, say y = b,
where b is a constant.

Then we are really considering a function of a single


variable x, namely, g(x) = f (x, b). If g has a derivative at a,
then we call it the partial derivative of f with respect to x
at (a, b) and denote it by fx(a, b). Thus

11
Partial Derivatives
By the definition of a derivative, we have

and so Equation 1 becomes

12
Partial Derivatives
Similarly, the partial derivative of f with respect to y at
(a, b), denoted by fy(a, b), is obtained by keeping x fixed
(x = a) and finding the ordinary derivative at b of the
function G(y) = f (a, y):

13
Partial Derivatives
If we now let the point (a, b) vary in Equations 2 and 3,
fx and fy become functions of two variables.

14
Partial Derivatives
There are many alternative notations for partial derivatives.

For instance, instead of fx we can write f1 or D1f (to indicate


differentiation with respect to the first variable) or ∂f / ∂x.

But here ∂f / ∂x can’t be interpreted as a ratio of differentials.

15
Partial Derivatives
To compute partial derivatives, all we have to do is
remember from Equation 1 that the partial derivative with
respect to x is just the ordinary derivative of the function g
of a single variable that we get by keeping y fixed.

Thus we have the following rule.

16
Example 1
If 𝑓 (𝑥, 𝑦) = 𝑥3 + 𝑥2𝑦3 – 2𝑦2, find fx(2, 1) and fy(2, 1).

Solution:
Holding y constant and differentiating with respect to x,
we get
𝜕
𝑓𝑥 𝑥, 𝑦 = 𝑓 𝑥, 𝑦
𝜕𝑥
𝜕
= 𝑥3 + 𝑥2𝑦3 – 2𝑦2
𝜕𝑥
= 3𝑥2 + 2𝑥𝑦3
and so fx(2, 1) = 3  22 + 2  2  13 = 16

17
Example 1
If 𝑓 (𝑥, 𝑦) = 𝑥3 + 𝑥2𝑦3 – 2𝑦2, find fx(2, 1) and fy(2, 1).

Solution:
Holding x constant and differentiating with respect to y,
we get
𝜕
𝑓𝑦 𝑥, 𝑦 = 𝑓 𝑥, 𝑦
𝜕𝑦
𝜕
= 𝑥3 + 𝑥2𝑦3 – 2𝑦2
𝜕𝑦
= 3𝑥2𝑦2 – 4𝑦
𝑓𝑦(2, 1) = 3  22  12 – 4  1 = 8

18
Example 2
If f (x, y) = 4 – x2 – 2y2, find fx(1, 1) and fy(1, 1).

Solution:
We have
fx(x, y) = –2x fy(x, y) = –4y

fx(1, 1) = –2 fy(1, 1) = –4

19
Interpretations of Partial Derivatives

20
Interpretations of Partial Derivatives
Partial derivatives can be interpreted as rates of change.

If z = f (x, y), then ∂z / ∂x represents the rate of change of z


with respect to x when y is fixed.

Similarly, ∂z / ∂y represents the rate of change of z with


respect to y when x is fixed.

21
Functions of More Than Two Variables

22
Functions of More Than Two Variables
Partial derivatives can also be defined for functions of three
or more variables.
For example, if f is a function of three variables x, y, and z,
then its partial derivative with respect to x is defined as

and it is found by regarding y and z as constants and


differentiating f (x, y, z) with respect to x.

23
Functions of More Than Two Variables
If w = f (x, y, z), then f x = ∂w / ∂x can be interpreted as the
rate of change of w with respect to x when y and z are held
fixed.

In general, if u is a function of n variables,


u = f (x1, x2,…, xn), its partial derivative with respect to the
i th variable xi is

and we also write

24
Example 3
Find fx, fy, and fz if f (x, y, z) = exy ln z.

Solution:
Holding y and z constant and differentiating with respect
to x, we have

fx = yexy ln z

Similarly,

fy = xexy ln z and fz =

25
Higher Derivatives

26
Higher Derivatives
If f is a function of two variables, then its partial derivatives
fx and fy are also functions of two variables, so we can
consider their partial derivatives (fx)x, (fx)y, (fy)x, and (fy)y,
which are called the second partial derivatives of f.
If z = f (x, y), we use the following notation:

27
Higher Derivatives

Thus the notation fxy (or ∂2f / ∂y ∂x) means that we first
differentiate with respect to x and then with respect to y,
whereas in computing fyx the order is reversed.

28
Example 4
Find the second partial derivatives of
𝑓 (𝑥, 𝑦) = 𝑥3 + 𝑥2𝑦3 – 2𝑦2

Solution:
In Example 1 we found that
𝑓𝑥 𝑥, 𝑦 = 3𝑥2 + 2𝑥𝑦3
𝑓𝑦(𝑥, 𝑦) = 3𝑥2𝑦2 – 4𝑦

Therefore
𝜕 𝜕
𝑓𝑥𝑥 = 𝑥3 + 𝑥2𝑦3 – 2𝑦2
𝜕𝑥 𝜕𝑥
𝜕
= (3𝑥2 + 2𝑥𝑦3) = 6𝑥 + 2𝑦3
𝜕𝑥 29
Example 4 – Solution cont’d

fxy = (3x2 + 2xy3)

= 6xy2

fyx = (3x2y2 – 4y)

= 6xy2

fyy = (3x2y2 – 4y)

= 6x2y – 4
30
Higher Derivatives
Notice that fxy = fyx in Example 4. This is not just a
coincidence.

It turns out that the mixed partial derivatives fxy and fyx are
equal for most functions that one meets in practice.

The following theorem, which was discovered by the


French mathematician Alexis Clairaut (1713–1765), gives
conditions under which we can assert that fxy = fyx.

31
Higher Derivatives
Partial derivatives of order 3 or higher can also be defined.
For instance,

and using Clairaut’s Theorem it can be shown that


fxyy = fyxy = fyyx if these functions are continuous.

32
Moving to Higher
Dimensions
Multivariate Calculus

33
Multivariate Calculus

We can generalize the concepts from univariate


differentiation to higher dimensions by studying multi-
variate (or multivariable) differentiation.

34
Gradient Vector
We can compute the partial derivative for all dimensions
𝑥𝑖 ∈ 𝑥 and collect all partial derivatives in a gradient vector:

𝜕 𝜕 𝜕
∇𝑓 𝑥1 , … , 𝑥𝑛 = 𝑥1 , … , 𝑥𝑛 𝑥1 , … , 𝑥𝑛 … 𝑥1 , … , 𝑥𝑛
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛

35
Example 5
Consider the function 𝑓(𝑥, 𝑦) = 𝑥 2 𝑦 + 3 ln(𝑥𝑦). Find the
partial derivatives and the gradient.

Solution

The partial derivatives and the gradient are computed as


follows:
𝜕 𝜕 3 3
𝑓 𝑥, 𝑦 = 𝑥 2 𝑦 + 3 ln 𝑥𝑦 = 2𝑥𝑦 + 𝑦 = 2𝑥𝑦 +
𝜕𝑥 𝜕𝑥 𝑥𝑦 𝑥
𝜕 𝜕 3 3
𝑓 𝑥, 𝑦 = 𝑥 2 𝑦 + 3 ln 𝑥𝑦 = 𝑥2 + 𝑥 = 𝑥2 +
𝜕𝑦 𝜕𝑦 𝑥𝑦 𝑦
𝜕 𝜕 𝟑 𝟑
∇𝑓 𝑥, 𝑦 = 𝑓 𝑥, 𝑦 𝑥, 𝑦 = 𝟐𝒙𝒚 + 𝒙𝟐 +
𝜕𝑥 𝜕𝑦 𝒙 𝒚
36
Jacobian Matrix
For a differentiable vector valued function 𝑓: ℝ𝑛 → ℝ𝑚 , the
Jacobian is defined as:

𝜕 𝜕 𝜕
𝑓 𝒙 𝑓 𝒙 ⋯ 𝑓 𝒙
𝜕𝑥1 1 𝜕𝑥2 1 𝜕𝑥𝑛 1
∇𝒇𝑻𝟏 𝜕 𝜕 𝜕
𝐽= ⋮ = 𝜕𝑥1 𝑓2 𝒙 𝑓 𝒙
𝜕𝑥2 2
⋯ 𝑓 𝒙
𝜕𝑥𝑛 2
∇𝒇𝑻𝒎 ⋮ ⋮ ⋱ ⋮
𝜕 𝜕 𝜕
𝑓 𝒙 𝑓 𝒙 ⋯ 𝑓 𝒙
𝜕𝑥1 𝑚 𝜕𝑥2 2 𝜕𝑥𝑚 𝑚

37
Example 6
𝑥 𝑓1 𝑥, 𝑦
Consider the multivariate function 𝒇 𝑦 = =
𝑓2 𝑥, 𝑦
𝑥2𝑦
. Find its Jacobian.
5𝑥 + sin(𝑦)

Solution

𝜕 𝜕
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 𝑥2
𝜕𝑥 1 𝜕𝑦 1 2𝑥𝑦
𝐽= 𝜕 𝜕
=
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 5 cos(𝑦)
𝜕𝑥 2 𝜕𝑦 2

38
Example 7
𝑢 𝑥, 𝑦, 𝑧
Consider the function function 𝒇 𝑥, 𝑦, 𝑧 = 𝑣 𝑥, 𝑦, 𝑧 where
𝑤(𝑥, 𝑦, 𝑧)

Find the Jacobian matrix and evaluate the Jacobian Matrix


at (0,0,0)

39
Example 7 - Solution
𝑢 𝑥, 𝑦, 𝑧
𝒇 𝑥, 𝑦, 𝑧 = 𝑣 𝑥, 𝑦, 𝑧
𝑤(𝑥, 𝑦, 𝑧)

𝜕 𝜕 𝜕
𝑢 𝑥, 𝑦, 𝑧 𝑢 𝑥, 𝑦, 𝑧 𝑢 𝑥, 𝑦, 𝑧
𝜕𝑥 𝜕𝑦 𝜕𝑧
𝜕 𝜕 𝜕
𝐽= 𝑣 𝑥, 𝑦, 𝑧 𝑣 𝑥, 𝑦, 𝑧 𝑣 𝑥, 𝑦, 𝑧
𝜕𝑥 𝜕𝑦 𝜕𝑧
𝜕 𝜕 𝜕
𝑤 𝑥, 𝑦, 𝑧 𝑤 𝑥, 𝑦, 𝑧 𝑤 𝑥, 𝑦, 𝑧
𝜕𝑥 𝜕𝑦 𝜕𝑧

18𝑥𝑦 2 + 𝑧𝑒 𝑥 18𝑥 2 𝑦 𝑒𝑥
= 𝑦 + 2𝑥𝑦 3 𝑥 + 3𝑥 2 𝑦 2 2
− sin 𝑥 𝑠𝑖𝑛 𝑧 𝑒 𝑦 cos 𝑥 sin 𝑧 𝑒 𝑦 cos 𝑥 cos 𝑧 𝑒 𝑦
40
Example 7 - Solution
At (0,0,0), we get:

18 ∗ 0 ∗ 02 + 0 ∗ 𝑒 0 18 ∗ 02 ∗ 0 𝑒0
𝐽(0,0,0) = 0 + 2 ∗ 0 ∗ 03 0 + 3 ∗ 02 ∗ 02 2
− sin 0 𝑠𝑖𝑛 0 𝑒 0 cos 0 sin 0 𝑒 0 cos 0 cos 0 𝑒 0

0 0 1
= 0 0 2
0 0 1

41
Gradient and Jacobian in ML
• Both the gradient and the Jacobian belong to the bread
and butter of both classical and machine learning /
numerical approximation algorithms.
• In particular, they are the core ingredients of an
algorithm called gradient descent which enables us to
iteratively search for a local minimum of a differentiable
function.
• For instance, neural network learning relies on gradient
descent via the backpropagation algorithm.

42
Hessian Matrix
For a twice-differentiable scalar-valued function f : ℝ𝑛 → ℝ
the Hessian matrix H is defined as the matrix containing the
combinations of all second-order derivatives:

𝜕2 𝜕2 𝜕2
2𝑓 𝒙 𝜕𝑥1 𝜕𝑥2
𝑓 𝒙 ⋯
𝜕𝑥1 𝜕𝑥𝑛
𝑓 𝒙
𝜕𝑥1
𝜕2 𝜕2 𝜕2
𝐻 = 𝜕𝑥2 𝜕𝑥1 𝑓 𝒙 𝜕𝑥2 2𝑓 𝒙 ⋯
𝜕𝑥2 𝜕𝑥𝑛
𝑓 𝒙
⋮ ⋮ ⋱ ⋮
𝜕2 𝜕2 𝜕2
𝑓 𝒙 𝑓 𝒙 ⋯ 2𝑓 𝒙
𝜕𝑥𝑛 𝜕𝑥1 𝜕𝑥𝑛 𝜕𝑥2 𝜕𝑥𝑛

Note that 𝜕𝑥𝑖2 = 𝜕𝑥𝑖 𝜕𝑥𝑖 43


Example 8
Consider the function 𝑓(𝑥, 𝑦) = 𝑥 3 − 2𝑥𝑦 − 𝑦 6 . Compute the
Hessian.

Solution:
The Hessian is computed as follows:

𝜕2 𝜕2
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 6𝑥 −2
𝜕𝑥 2 𝜕𝑥𝜕𝑦
𝐻= =
𝜕2 𝜕2 −2 −30𝑦 4
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦
𝜕𝑦𝜕𝑥 𝜕𝑦 2

44
Example 9
Consider the function 𝑓(𝑥, 𝑦) = 𝑒 𝑥 cos(𝑦). Compute the
Hessian.

Solution:
The Hessian is computed as follows:

𝜕2 𝜕2
𝑓 𝑥, 𝑦 𝑓 𝑥, 𝑦 𝑒 𝑥 cos(𝑦) −𝑒 𝑥 𝑠𝑖𝑛(𝑦)
𝜕𝑥 2 𝜕𝑥𝜕𝑦
𝐻= =
𝜕2
𝑓 𝑥, 𝑦
𝜕2
𝑓 𝑥, 𝑦 −𝑒 𝑥 sin(𝑦) −𝑒 𝑥 cos(𝑦)
𝜕𝑦𝜕𝑥 𝜕𝑦 2

45
Hessian in Machine Learning
• The Hessian matrix and its approximations are
frequently used to assess the curvature of loss
landscapes in neural networks.
• Similar to the univariate second-order derivative of a
function, the Hessian enables us to identify saddle points
and directions of higher and lower curvature.
• In particular, the ratio between the largest and the lowest
𝜆
eigenvalue of H, i.e. max , defines the condition
𝜆min
number.
• A large condition number implies slower convergence
while a condition number of 1 enables gradient descent
to quickly converge in all curvature directions.
46
References
Chapter 14- Partial Derivatives, Cengage Learning.

Stephan Rabanser, Tutorial: Multivariate Differentiation


ECE421 – Introduction To Machine Learning (Fall 2022),
University of Toronto & Vector Institute for AI

47

You might also like