What is Gradient descent?
Last Updated :
30 May, 2025
Gradient Descent is a fundamental algorithm in machine learning and optimization. It is used for tasks like training neural networks, fitting regression lines, and minimizing cost functions in models. In this article we will understand what gradient descent is, how it works , mathematics behind it and why it’s so important in machine learning.
Introduction to Gradient Descent
Gradient Descent is an algorithm used to find the best solution to a problem by making small adjustments in the right direction. It’s like trying to find the lowest point in a hilly area by walking down the slope, step by step, until you reach the bottom.
For example:
Gradient descentImagine you're at the top of a hill and your goal is to find the lowest point in the valley. You can't see the entire valley from the top, but you can feel the slope under your feet.
- Start at the Top: You begin at the top of the hill (this is like starting with random guesses for the model's parameters).
- Feel the Slope: You look around to find out which direction the ground is sloping down. This is like calculating the gradient, which tells you the steepest way downhill.
- Take a Step Down: Move in the direction where the slope is steepest (this is adjusting the model's parameters). The bigger the slope, the bigger the step you take.
- Repeat: You keep repeating the process — feeling the slope and moving downhill — until you reach the bottom of the valley (this is when the model has learned and minimized the error).
The key idea is that, just like walking down a hill, Gradient Descent moves towards the "bottom" or minimum of the loss function, which represents the error in predictions.
Moving in opposite direction of the gradient allows the algorithm to gradually descend towards lower values of the function and eventually reaching to the minimum of the function. These gradients guide the updates ensuring convergence towards the optimal parameter values. Gradual steps used in descent is done by defining learning rate.
What is Learning Rate?
Learning rate is a important hyperparameter in gradient descent that controls how big or small the steps should be when going downwards in gradient for updating models parameters. It is essential to determines how quickly or slowly the algorithm converges toward minimum of cost function.
1. If Learning rate is too small: The algorithm will take tiny steps during iteration and converge very slowly. This can significantly increases training time and computational cost especially for large datasets.
Learning rate with small steps2. If Learning rate is too big: The algorithm may take huge steps leading overshooting the minimum of cost function without settling. It fail to converge causing the algorithm to oscillate. This process is termed as exploding gradient problem.
Learning rate with big stepsIn image we can see point got oscillated from right to left with converging to minimum gradient value.
To address these problems we have some technique that can be used:
- Weights Regularzations: The initialization of weights can be adjusted to ensure that they are in an appropriate range. Using a different activation function such as the Rectified Linear Unit (ReLU) can help us to mitigate the vanishing gradient problem.
- Gradient clipping: Restrict the gradients to a predefined range to prevent them from becoming excessively large or small.
- Batch normalization: It can also help to address these problems by normalizing the input of each layer to prevent activation function from saturating and hence reducing vanishing and exploding gradient problems.
Choosing right learning rate can leads to fast and stable convergence improving the efficiency of the training process but sometimes vanishing and exploding gradient problem is unavoidable and to address these we have some techniques that we will discuss further in the article.
Mathematics Behind Gradient Descent
For simplicity let's consider a linear regression model with a single input feature x and target y. The loss function (or cost function) for a single data point is defined as the Mean Squared Error (MSE):
J(w, b) = \frac{1}{n} \sum_{i=1}^{n} \left( y_p - y \right)^2
Here:
- y_p = x \cdot w + b: The predicted value.
- w: Weight (slope of the line).
- b: Bias (intercept).
- n: Number of data points.
To optimize the model parameters w, we compute the gradient of the loss function with respect to w. This process involves taking the partial derivatives of J(w,b).
The gradient with respect to w is:
\frac{\partial J(w, b)}{\partial w} = \frac{\partial}{\partial w} \left[ \frac{1}{n} \sum_{i=1}^{n} \left( y_p - y \right)^2 \right]
\frac{\partial J(w, b)}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} (y_p - y) \cdot \frac{\partial}{\partial w} \left( y_p - y \right)
substitute y_p = x \cdot w + b: \frac{\partial J(w, b)}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} (y_p - y) \cdot \frac{\partial}{\partial w} \left( x \cdot w + b - y \right)
Final Gradient with respect to w:
\frac{\partial J(w, b)}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} (y_p - y) \cdot x
Gradient Descent Update:
Once the gradients are calculated we update the parameters w in the direction opposite to the gradient (to minimize the loss function):
1. For +ve gradient:
Gradient descentw = w - \gamma \cdot \frac{\partial J(w, b)}{\partial w}
Here:
- γ: Learning rate (step size for each update).
- ∂w\frac{\partial J(w, b)}{\partial w}: Gradients with respect to w.
Since the gradient is positive subtracting it effectively decreases w and hence reducing cost function.
2. For -ve gradient:
Gradient descentSince the gradient is negative subtracting it effectively increases w so here we add it to reduce cost function.
Working of Gradient Descent
- Step 1 we first initialize the parameters of the model randomly
- Step 2 Compute the gradient of the cost function with respect to each parameter. It involves making partial differentiation of cost function with respect to the parameters.
- Step 3 Update the parameters of the model by taking steps in the opposite direction of the model. Here we choose a hyperparameter learning rate which is denoted by γ. It helps in deciding the step size of the gradient.
- Step 4 Repeat steps 2 and 3 iteratively to get the best parameter for the defined model.
Gradient DescentThis animation shows iterative process of gradient descent as it traverses the 3D convex surface of cost function. Each step represents adjustment of model parameters to minimize the loss. It illustrates how the algorithm moves in opposite direction of descent to converge
Pseudo code:
t ← 0
max_iterations ← 1000
w, b ← initialize randomly
while t < max_iterations do
t ← t + 1
w_t+1 ← w_t − γ ∇w_t
b_t+1 ← b_t − γ ∇b_t
end
Here:
max_iterations is the number of iteration we want to do to update our parameter
W,b are the weights and bias parameter
γ is the learning parameter
So now we learned what is gradient descent and how it works, now we will learn about its variations.
Different Variants of Gradient Descent
Types of gradient descent are:
- Batch Gradient Descent: Batch Gradient Descent computes gradients using the entire dataset in each iteration.
- Stochastic Gradient Descent (SGD): SGD uses one data point per iteration to compute gradients, making it faster.
- Mini-batch Gradient Descent: Mini-batch Gradient Descent combines batch and SGD by using small batches of data for updates.
- Momentum-based Gradient Descent: Momentum-based Gradient Descent speeds up convergence by adding a fraction of the previous gradient to the current update.
- Adagrad: Adagrad adjusts learning rates based on the historical magnitude of gradients.
- RMSprop: RMSprop is similar to Adagrad but uses a moving average of squared gradients for learning rate adjustments.
- Adam: Adam combines Momentum, Adagrad, and RMSprop by using moving averages of gradients and squared gradients.
For understand their explanation and use-cases, please refer : Types of Gradient Descent.
Advantages Of Gradient Descent
- Flexibility: It can be used with various cost functions and can handle non-linear regression problems.
- Scalability: It is scalable to large datasets since it updates the parameters for each training example one at a time.
- Convergence: It can converge to global minimum of the cost function provided that the learning rate is set appropriately.
Disadvantages Of Gradient Descent
- Sensitivity to Learning Rate: The choice of learning rate is important in gradient descent as it can lead to vanishing or exploding gradient problem.
- Sensitivity to initialization: It can be sensitive to the initialization of the models parameters which can affect the convergence and the quality of the solution.
- Local Minima: It can get stuck in local minima, if the cost function has multiple local minima.
- Time-consuming: It can be time-consuming especially when dealing with large datasets.
Gradient Descent is a fundamental optimization technique used in machine learning. Understanding it allows us to make efficient and accurate model by reducing error made by model using cost function during their training phase, making gradient descent essential for building effective machine learning models.
Similar Reads
Stochastic Gradient Descent In R
Gradient Descent is an iterative optimization process that searches for an objective functionâs optimum value (Minimum/Maximum). It is one of the most used methods for changing a modelâs parameters to reduce a cost function in machine learning projects. In this article, we will learn the concept of
9 min read
ML | Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability, making it the go-to method for many d
8 min read
Stochastic Gradient Descent Regressor
A key method in data science and machine learning is the stochastic gradient descent (SGD) regression. It is essential to many regression activities and aids in the creation of predictive models for a variety of uses. We will study the idea of the SGD Regressor, its operation, and its importance in
10 min read
Vectorization Of Gradient Descent
In Machine Learning, Regression problems can be solved in the following ways: 1. Using Optimization Algorithms - Gradient Descent Batch Gradient Descent.Stochastic Gradient Descent.Mini-Batch Gradient DescentOther Advanced Optimization Algorithms like ( Conjugate Descent ... ) 2. Using the Normal Eq
5 min read
Gradient Descent Algorithm in R
Gradient Descent is a fundamental optimization algorithm used in machine learning and statistics. It is designed to minimize a function by iteratively moving toward the direction of the steepest descent, as defined by the negative of the gradient. The goal is to find the set of parameters that resul
6 min read
Stochastic Gradient Descent Classifier
One essential tool in the data science and machine learning toolkit for a variety of classification tasks is the stochastic gradient descent (SGD) classifier. Through an exploration of its functionality and critical role in data-driven decision-making, we set out to explore the complexities of the S
14 min read
Gradient Descent in Linear Regression
Gradient descent is a optimization algorithm used in linear regression to find the best fit line to the data. It works by gradually by adjusting the lineâs slope and intercept to reduce the difference between actual and predicted values. This process helps the model make accurate predictions by mini
4 min read
Different Variants of Gradient Descent
Gradient descent is a fundamental optimization algorithm in machine learning used to minimize functions by iteratively moving towards the minimum. It's important for training models by fine-tuning parameters to reduce prediction errors. In this article, we are going to explore different variants of
5 min read
Gradient Descent With RMSProp from Scratch
RMSprop modifies the traditional gradient descent algorithm by adapting the learning rate for each parameter based on the magnitude of recent gradients. The key advantage of RMSprop is that it helps to smooth the parameter updates and avoid oscillations, particularly when gradients fluctuate over ti
4 min read
ML | Mini-Batch Gradient Descent with Python
In machine learning, gradient descent is an optimization technique used for computing the model parameters (coefficients and bias) for algorithms like linear regression, logistic regression, neural networks, etc. In this technique, we repeatedly iterate through the training set and update the model
5 min read