0% found this document useful (0 votes)
6 views

Introduction To Gradient Descent

Uploaded by

dothiminhphuong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Introduction To Gradient Descent

Uploaded by

dothiminhphuong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Introduction to Gradient

Descent
Gradient descent is one of the most common optimization algorithms in
machine learning. Understanding its basic implementation is fundamental to
understanding all the advanced optimization algorithms built off of it.

Optimization Algorithms

In machine learning, optimization is the process of finding the ideal parameters,


or weights, to maximize or minimize a cost or loss function. The global
maximum is the largest value on the domain of the function, whereas the global
minimum is the smallest value. While there is only one global maximum and/or
minimum, there can be many local maxima and minima. The global minimum or

Introduction to Gradient Descent 1


maximum of a cost function indicates where a model’s parameters generate
predictions that are close to the actual targets. The local maxima and minima
can cause problems when training a model, so their presence should always be
considered. The plot above shows an example of each.
There are a few major algorithm groups within this category: bracketing, local
descent, first-order, and second-order. The focus of this article will be first-
order algorithms that use the first derivative for optimization. Within this
category, the gradient descent algorithm is the most popular.

Gradient Descent in One Dimension


Gradient descent is a first-order, iterative optimization algorithm used to
minimize a cost function. By using partial derivatives, a direction, and a learning
rate, gradient descent decreases the error, or difference, between the
predicted and actual values.
The idea behind gradient descent is that the derivative of each weight will
reveal its direction and influence on the cost function. In the image below, the
cost function is f(w) = w², which is a parabola. The minimum is at (0,0), and the
current weight is -5.6. The current loss is 31.36, and the line in orange
represents the derivative, or current rate of change for the weight, which is
-11.2. This indicates the weight needs to move “downhill” — or become more
positive — to reach a loss of 0. This is where gradient descent comes in.

Introduction to Gradient Descent 2


Image by Author

By scaling the gradient with a value known as the learning rate and subtracting
the scaled gradient from its weight’s current value, the output will minimize.
This can be seen in the image below. In ten iterations (w₀ to w₉), a learning rate
of 0.1 is used to minimize the cost function.

Introduction to Gradient Descent 3


In the steps for the algorithm below, a weight is represented by w,
with j representing its current value and j+1 representing its new value. The cost
function to measure the error is represented by f, and the partial derivative is
the gradient of the cost function with respect to the parameters. The learning
rate is represented by α.

select a learning rate and the number of iterations

choose random values for the parameters

update the parameters with the equation below

repeat step three until the max number of iterations is reached

When taking the partial derivative, or gradient, of a function, only one


parameter can be assessed at a time, and the other parameters are treated as
constants. For the example above, f(w) = w², there is only one parameter, so
the derivative is f`(w) = 2w. The formula for updating the parameter follows:

Introduction to Gradient Descent 4


Using a learning rate of 0.1 and a starting weight of -5.6, the first ten iterations
follow:

Table by Author

The table demonstrates how each component of the formula helps minimize
the loss. By negating the scaled gradient, the new weight becomes more
positive, and the slope of the new gradient is less steep. As the slope becomes
more positive, each iteration yields a smaller update.

This basic implementation of gradient descent can be applied to almost any


cost function, including those with numerous weights. A few variations of the
mean squared error can be considered.

Gradient Descent with the Mean Squared


Error (MSE)
What is the MSE?
A popular cost function for machine learning is the Mean Squared Error (MSE).

Introduction to Gradient Descent 5


This function takes finds the difference between the model’s prediction (Ŷ) and
the expected output (Y). It then squares the difference to ensure the output is
always positive. This means Ŷ or Y can come first when calculating the
difference. This is repeated across a set of points with a size of n. By summing
the squared difference of all these points and dividing by n, the output is the
mean squared difference (error). It is an easy way of assessing the model’s
performance on all the points simultaneously. A simple example can be seen
below:

In this formula, Ŷ represents a model’s prediction. In regression, the model’s


equation may contain one or more weights depending on the requirements of
the training data. The table below reflects these situations.

Introduction to Gradient Descent 6


Table by Author
Now, to perform gradient descent with any of these equations, their gradients
must be calculated. The gradient contains the partial derivatives for a function:

Each weight’s partial derivative has to be calculated. A partial derivative is


calculated in the same manner as a normal derivative, but every variable that is
not being considered must be treated as a constant. The gradients for the MSE
variations listed above can be examined below.

One Weight

When taking the gradient of the MSE with only one weight, the derivative can
be calculated with respect to w. X, Y, and n must be treated as constants. With
this in mind, the fraction and sum can be moved outside of the derivative:

From here, the chain rule can be used to calculate the derivative with respect
to w:

Introduction to Gradient Descent 7


Now, this can be simplified:

Two Weights

When taking the gradient of the MSE with two weights, the partial derivatives
must be taken with respect to both parameters, w₀ and w₁. When taking the
partial derivative of w₀, the following variables are treated as constants: X, Y,
n, and w₁. When taking the partial derivative of w₁, the following variables are
treated as constants: X, Y, n, and w₀. The same steps as the previous example
can be repeated. First, the fraction and sum can be moved outside the
derivative.

From here, the chain rule can be used to calculate the derivative with respect to
each weight:

Introduction to Gradient Descent 8


Finally, they can be simplified.

Notice that the only difference between the equations is X.

Three Weights

When taking the gradient of the MSE with three weights, the partial derivatives
must be taken with respect to each parameter. When taking the partial
derivative of one weight, X, Y, n, and the other two weights will be treated as
constants. The same steps as the previous example can be repeated. First, the
fraction and sum can be moved outside the derivative.

Introduction to Gradient Descent 9


From here, the chain rule can be used to calculate the derivative with respect to
each weight:

Finally, they can be simplified.

As mentioned previously, the only difference between each partial derivative is


the input feature, X. This can be generalized for k weights in the next example.

More Than Three Weights

When taking the gradient of the MSE with k weights, the partial derivatives
must be taken with respect to each parameter. When taking the partial
derivative of one weight, X, Y, n, and the other k-1 weights will be treated as
constants. As seen in the previous example, only the input feature of each
partial derivative changes when there are more than two weights.

Matrix Derivation

Introduction to Gradient Descent 10


The formulas above show how to use gradient descent without explicitly taking
advantage of vectors and matrices. However, most of machine learning is best
understood by using their operations. For a quick overview, see A Simple
Introduction to Tensors.

The rest of this article will be dedicated to using matrix calculus to derive the
derivative of the MSE. To start, Ŷ and Y should be understood as matrices with
sizes of (n samples, 1). Both are matrices with 1 column and n rows, or they can
be viewed as column vectors, which would change their notation to lowercase:

The MSE is element-wise vector subtraction between ŷ and y, followed by the


dot product of the difference with itself. Remember, the dot product can only
occur if sizes are compatible. Since the goal is to have a scalar output, the first
vector must be transposed.

From here, ŷ can be replaced with Xw for regression. X is a matrix with a size
of (n samples, num features), and w is a column vector with with a size
of (num features, 1).

The next step is to simplify the equation before taking the derivative. Notice
that w and X switch positions to ensure their multiplication is still valid: (1,
features) x (num features, n samples) = (1, n samples).

Introduction to Gradient Descent 11


These error calculations can then be multiplied together.

Notice that the third term can be rewritten by transposing it, following the third
property on this page. Then, it can be added to the second term.

Now, the partial derivative of the MSE can be taken with respect to the weight.

This is equivalent to taking the derivative of each term:

Each term that is not w can be treated as a constant. The derivative of each
component can be computed using these rules:

Introduction to Gradient Descent 12


The first term in the equation follows the fourth rule and becomes zero. The
second term follows the first rule, and the third term follows the third rule.

This equation can be used in gradient descent to simultaneously calculate all


the partial derivatives:

Introduction to Gradient Descent 13

You might also like