Introduction To Gradient Descent
Introduction To Gradient Descent
Descent
Gradient descent is one of the most common optimization algorithms in
machine learning. Understanding its basic implementation is fundamental to
understanding all the advanced optimization algorithms built off of it.
Optimization Algorithms
By scaling the gradient with a value known as the learning rate and subtracting
the scaled gradient from its weight’s current value, the output will minimize.
This can be seen in the image below. In ten iterations (w₀ to w₉), a learning rate
of 0.1 is used to minimize the cost function.
Table by Author
The table demonstrates how each component of the formula helps minimize
the loss. By negating the scaled gradient, the new weight becomes more
positive, and the slope of the new gradient is less steep. As the slope becomes
more positive, each iteration yields a smaller update.
One Weight
When taking the gradient of the MSE with only one weight, the derivative can
be calculated with respect to w. X, Y, and n must be treated as constants. With
this in mind, the fraction and sum can be moved outside of the derivative:
From here, the chain rule can be used to calculate the derivative with respect
to w:
Two Weights
When taking the gradient of the MSE with two weights, the partial derivatives
must be taken with respect to both parameters, w₀ and w₁. When taking the
partial derivative of w₀, the following variables are treated as constants: X, Y,
n, and w₁. When taking the partial derivative of w₁, the following variables are
treated as constants: X, Y, n, and w₀. The same steps as the previous example
can be repeated. First, the fraction and sum can be moved outside the
derivative.
From here, the chain rule can be used to calculate the derivative with respect to
each weight:
Three Weights
When taking the gradient of the MSE with three weights, the partial derivatives
must be taken with respect to each parameter. When taking the partial
derivative of one weight, X, Y, n, and the other two weights will be treated as
constants. The same steps as the previous example can be repeated. First, the
fraction and sum can be moved outside the derivative.
When taking the gradient of the MSE with k weights, the partial derivatives
must be taken with respect to each parameter. When taking the partial
derivative of one weight, X, Y, n, and the other k-1 weights will be treated as
constants. As seen in the previous example, only the input feature of each
partial derivative changes when there are more than two weights.
Matrix Derivation
The rest of this article will be dedicated to using matrix calculus to derive the
derivative of the MSE. To start, Ŷ and Y should be understood as matrices with
sizes of (n samples, 1). Both are matrices with 1 column and n rows, or they can
be viewed as column vectors, which would change their notation to lowercase:
From here, ŷ can be replaced with Xw for regression. X is a matrix with a size
of (n samples, num features), and w is a column vector with with a size
of (num features, 1).
The next step is to simplify the equation before taking the derivative. Notice
that w and X switch positions to ensure their multiplication is still valid: (1,
features) x (num features, n samples) = (1, n samples).
Notice that the third term can be rewritten by transposing it, following the third
property on this page. Then, it can be added to the second term.
Now, the partial derivative of the MSE can be taken with respect to the weight.
Each term that is not w can be treated as a constant. The derivative of each
component can be computed using these rules: