Difference between Gradient descent and Normal equation
Last Updated :
12 Feb, 2025
In regression models the goal is to find a model that makes accurate predictions by estimating the model parameters. This is achieved by minimizing the error between actual and predicted values. The key approach in minimizing this error involves adjusting parameters to achieve the lowest possible cost function.
For models like Linear Regression there are two main methods to estimate these parameters: the Normal Equation and Gradient Descent.
Understanding Gradient Descent
Gradient Descent is an iterative optimization algorithm used to find parameter values that minimize a cost function. It is widely used in machine learning to update the model parameters like coefficients in Linear Regression or weights in neural networks.
Here’s how it works:
1. Initialization: You start by randomly initializing the model parameters. These parameters could be the weights (in neural networks) or coefficients (in linear regression).
2. Compute the Gradient: The gradient is the derivative of the cost function with respect to each parameter. It tells you the direction in which the cost function is increasing or decreasing the fastest. The gradient indicates how the parameters need to be adjusted to minimize the cost.
3. Update the Parameters: Using the gradient we update the parameters by taking a step in the opposite direction of the gradient since we want to minimize the cost. The size of the step is controlled by a parameter called the learning rate called as α. The update rule looks like this:
\theta_{j} = \theta_{j} - \alpha \frac{\partial}{\partial \theta_{j}} J(\theta)
Where:
- \theta_j is the parameter (e.g., weight or coefficient)
- α is the learning rate, determining how large a step to take
- ∂\frac{\partial}{\partial \theta_j} J(\theta) is the gradient (derivative) of the cost function with respect to \theta_j
4. Repeat: The process of calculating the gradient and updating the parameters is repeated over and over (iteratively) until the algorithm converges. Convergence happens when the parameters stop changing significantly, or when the cost function reaches a minimum.
5. Convergence: Ideally, the iterative process leads to a point where the cost function is minimized, and the parameters represent the best fit to the data.
Python Implementation of Gradient Descent
We can apply gradient descent to our input feature using the numpy library. However, to apply gradient we have to choose some hyperparameters which can be learned itself by the model.
Python
import numpy as np
def gradient_descent(X, y):
learning_rate = 0.01
num_iterations = 100
num_samples, num_features = X.shape
weights = np.random.randn(num_features)
for iteration in range(num_iterations):
predictions = np.dot(X, weights)
error = predictions - y
gradients = 2 * np.dot(X.T, error) / num_samples
weights -= learning_rate * gradients
return weights
X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array([3, 7, 11])
weights = gradient_descent(X, y)
X_test = np.array([[3, 4], [5, 4], [7, 9]])
predictions = np.dot(X_test, weights)
print(predictions)
Output:
[ 6.90093216 10.1567656 15.93407653]
Understanding Normal Equation
The Normal Equation is a closed-form solution for finding the optimal parameters (coefficients) in linear regression. It’s an analytical method that directly computes the values of the model parameters that minimize the sum of squared errors, which is the cost function in linear regression.
Unlike Gradient Descent, the Normal Equation does not involve iterations or hyperparameter tuning. It gives a direct formula to calculate the best-fit parameters in one step, based on the input data.
Here’s how it works:
\Theta=\left(X^{T} X\right)^{-1} X^{T} y
Where:
- X = input feature matrix
- y = output value
Where
X = input feature value
y = output value
X^T is the transpose of the matrix X
(X^T X)^{-1} is the inverse of X^T X
X^T y is the dot product of X^T and y.
Python implementation of Normal Equation in Gradient Descent
We can use the numpy library to apply linear algebra functions on datasets to get the parameter of the linear regression Model. Also, we will add 1 to the beginning of each row of the matrix to get the bias parameter of the model.
Python
import numpy as np
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])
y = np.array([[2], [3], [4], [5]])
X_transpose = X.T
X_transpose_X = np.dot(X_transpose, X)
X_transpose_y = np.dot(X_transpose, y)
X_with_intercept = X_transpose_X + np.eye(X_transpose_X.shape[0])
theta = np.linalg.solve(X_with_intercept, X_transpose_y)
X_test = np.array([[1], [4]])
X_test_with_intercept = np.c_[np.ones((X_test.shape[0], 1)), X_test]
predictions = np.dot(X_test_with_intercept, theta)
print(predictions)
Output:
[[1.70909091][4.98181818]]
Difference between Gradient Descent and the Normal Equation.
Feature | Gradient Descent | Normal Equation |
---|
Hyperparameters | We need to choose the learning rate, Number of iterations, etc. | No hyperparameters are needed. |
Approach | It is an iterative algorithm. | It is an analytical approach. |
Data Size | It works well with large number of features. | It works well with small number of features. |
Feature Scaling | Feature scaling is needed. | No need for feature scaling. |
Handling Non-Invertibility | No need to handle non-invertibility cases. | Regularization is needed to handle this. |
Time Complexity | Depends upon number of iterations and data size. | Depends upon on the matrix inversion operation of the input feature. |
Both Gradient Descent and the Normal Equation are widely used techniques for finding the optimal parameters in linear regression. While Gradient Descent is an iterative approach that requires tuning hyperparameters and works well with large datasets the Normal Equation offers a direct solution that avoids iteration but may become computationally expensive with a large number of features.