Optimization algorithms are essential tools across various fields, ranging from engineering and computer science to economics and physics. Among these algorithms, Newton's method holds a significant place due to its efficiency and effectiveness in finding the roots of equations and optimizing functions, here in this article we will study more about Newton's method and it's use in machine learning.
Newton's Method for Optimization
Newton's method can be extended to solve optimization problems by finding the minima or maxima of a real-valued function f(x). The goal of optimization is to find the value of x that minimizes or maximizes the function f(x). We are interested in finding critical points of the function where the first derivative is zero (for minima or maxima). Newton's method utilizes the first and second derivatives of the function to iteratively refine the solution.
The iterative formula for Newton's method is given by:
x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}
where x_{n+1} is the next approximation of the critical point, x_n is the current approximation, f'(x_n) is the first derivative of the function at x_n and f''(x_n) is the second order derivative (Hessian) of the function at x_n.
Intuitive Understanding of Newton's Method
Intuitively, Newton's method can be understood as follows:
- At each iteration, the method updates the current approximation x_n by subtracting the ratio of the gradient f'(x_n) and the curvature f''(x_n).
- If the curvature is positive (convex function), this ratio decreases the value of x, bringing it closer to the minimum.
- If the curvature is negative (concave function), this ratio increases the value of x, bringing it closer to the maximum.
The process continues iteratively until a stopping criterion is met or a desired convergence is achieved.
We must keep in mind that Newton's method may not always converge, especially if the initial guess is far from the true root (or critical point) or if the function has complex behavior (e.g., oscillations, multiple roots). We must be careful when dealing with functions with singularities or regions where the derivative approaches zero.
Newton's MethodSecond-Order Approximation
We begin our derivation by considering the second order approximation of function f(x) at point x = xn that is given by:
f(x) = f(x_n) + f'(x_n) (x-x_n) + \frac{1}{2!}f''(x_n) (x-x_n)^2
Now, rearranging the terms, we obtain
f(x) = \frac{1}{2} f''(x_n) x^2 + [f'(x_n) - f''(x_n) x_n]x + [f(x_n) - f'(x_n) x_n + \frac{1}{2} f''(x_n) x_n^{2}]
Next to find the value where the function is minimum, we compute the first derivative and equate it to zero to obtain the following:
f''(x_n) x = f''(x_n) x_n - f'(x_n)
Finally, rearranging the terms, we obtain the update rule as
x = x_n - \frac{f'(x_n)}{f''(x_n)}
Newton's Method for Finding Local Minima or Maxima in Python
In this section, we are going to demonstrate how to use Newton's method for optimization using python.
The following code aims to find the local minima of f(x) using the following steps:
f(x)
: The function to be minimized. In this case, it's x^2 - 4
.numerical_derivative(f, x, h=1e-6)
: A function to compute the numerical derivative of f
at x
using a small step size h
.newton_method_optimization(initial_guess, f, tolerance=1e-6, max_iterations=100)
: The main function implementing Newton's method. It takes an initial guess for the minimum (initial_guess
), the function to minimize (f
), a tolerance level for convergence (tolerance
), and a maximum number of iterations (max_iterations
).initial_guess
: A random initial guess for the minimum, generated using random.uniform(-10, 10)
.optimal_solution
: The result of running Newton's method to find the optimal solution.- Printing the result: If an optimal solution is found, it prints the rounded optimal solution and the value of the function at that optimal solution.
Python3
import random
def f(x):
return x**2 - 4
def numerical_derivative(f, x, h=1e-6):
return (f(x + h) - f(x)) / h
def newton_method_optimization(initial_guess, f, tolerance=1e-6, max_iterations=100):
x = initial_guess
for iteration in range(max_iterations):
# Compute the first derivative at current point
f_prime_x = numerical_derivative(f, x)
# Compute the second derivative
f_double_prime_x = numerical_derivative(lambda x: numerical_derivative(f, x), x)
if f_double_prime_x == 0:
print("Error: Division by zero or small derivative.")
return None
# Compute the update using Newton's method
delta_x = - f_prime_x / f_double_prime_x
# Update x
x += delta_x
# Check for convergence
if abs(delta_x) < tolerance:
print("Converged after", iteration + 1, "iterations.")
return x
print("Did not converge within", max_iterations, "iterations.")
return None
# Initial guess
initial_guess = random.uniform(-10, 10)
# Run Newton's method
optimal_solution = newton_method_optimization(initial_guess, f)
# Printing the optimal solution rounded to three decimal points
if optimal_solution is not None:
optimal_solution_rounded = round(optimal_solution, 3)
function_value_rounded = round(f(optimal_solution), 3)
print("Optimal solution:", optimal_solution_rounded)
print("Value of the function at optimal solution:", function_value_rounded)
Output:
Converged after 3 iterations.
Optimal solution: -0.0
Value of the function at optimal solution: -4.0
In the above implementation, we are applying Newton's Method to find the minimum value of the function f(x) = x^2 - 4. We have considered the convergence criteria by setting the max_iterations = 100 and the tolerance level = 10^-6.
Convergence Properties of Newton's Method
Newton's method converges quadratically, meaning that with each iteration, the number of digits approximately doubles. Its convergence may be affected by several factors:
- Choice of Initial Guess: The convergence of Newton's method can depend significantly on the initial guess. If the initial guess is close to the minimum, it usually converges rapidly. However, far-off initial guesses may lead to slow convergence or even divergence.
- Behavior of the Function: Newton's method assumes that the function is well-behaved in the vicinity of the minimum, meaning it's smooth and has continuous first and second derivatives. Discontinuities, singularities, or regions where derivatives are difficult to compute can affect convergence.
- Convergence Criteria: Newton's method typically terminates when the change in x between iterations becomes small enough, or when the function value becomes close to zero.
Complexity of Newton's Method
Newton's method has a favorable convergence rate, but its complexity can be higher compared to methods like gradient descent. The reason for this could be:
- Computational Cost per Iteration: Each iteration of Newton's method requires the computation of both the gradient and the Hessian of the function. For functions with a large number of variables, computing the Hessian can be computationally expensive, especially if it's dense.
- Storage Requirements: Storing and manipulating the Hessian matrix can be memory-intensive, especially for functions with a large number of variables. This can become a bottleneck for high-dimensional optimization problems.
- Numerical Stability: The numerical computation of the Hessian can introduce errors, especially if the function has regions of high curvature or ill-conditioned Hessian matrices. Ensuring numerical stability adds computational overhead.
Time Complexity of Newton's Method
- Computing Gradient and Hessian: Computing the gradient typically requires O(n) operations for a function with n variables. Computing the Hessian involves O(n^2) operations for a function with n variables. However, if the Hessian has a specific structure (e.g., it's sparse), specialized algorithms can reduce this complexity.
- Solving Linear System: In each iteration, Newton's method involves solving a linear system, usually by methods like Gaussian elimination, LU decomposition, or iterative solvers like conjugate gradient descent. Solving a dense linear system typically requires O(n^3) operations, but this can be reduced to O(n^1.5) for certain specialized methods. If the Hessian is sparse, specialized solvers for sparse linear systems can be employed, potentially reducing the complexity significantly.
- Number of Iterations: The number of iterations required for convergence varies depending on factors such as the chosen optimization tolerance, the curvature of the function, and the choice of initial guess. In ideal conditions, Newton's method converges quadratically.
So, the total time complexity of Newton's method, after considering the cost per iteration and the number of iterations can be approximated as O(k⋅T), where k is the number of iterations and T is the complexity per iteration.
However, the actual time complexity can vary significantly based on the specific characteristics of the optimization problem, including the size of the problem (number of variables), the sparsity of the Hessian, and the computational efficiency of the algorithms used for gradient and Hessian computations and linear system solving.
Parameter Estimation in Logistic Regression using Newton's Method
Parameter Estimation in Logistic Regression
Logistic regression is a popular statistical method used for binary classification tasks. Given a dataset with input features X and corresponding binary labels y, logistic regression models the probability that an input belongs to a particular class. The logistic regression model is defined by the logistic (or sigmoid) function, which maps the linear combination of input features to a probability:
p(y=1|X) = \sigma(w^TX) = \frac{1}{1+e^{-w^TX}}
where
- X is an Nxd input feature matrix, with N being the number of samples and d is the number of features.
- w is the weight vector of size dx1
- σ(z) is the logistic regression function.
Logistic Regression Model and Loss Function
The goal of logistic regression is to find the optimal weight vector w that maximizes the likelihood of the observed data. This is achieved by minimizing the loss function, known as the cross-entropy loss function:
L(w) = -\frac{1}{N}\sum_{i=1}^{N}[y_i \log(\sigma(w^T x_i))+(1-y_i)\log(1-\sigma(w^T x_i))]
where y_i is the label of the i-th feature and x_i the feature vector of the i-th sample
Newton's method can be applied to optimize the parameters w by iteratively updating the weight vector to minimize the logistic loss function. The steps to perform this are as follows:
- Choose an initial guess for the weight vector w0.
- For each iteration:
- Compute the gradient vector ∇ L(w_n) and the Hessian H(w_n) of the loss function with respect to w0.
- Update the weight vector using the Newton's formula:
w_n+1 = w_n - H^{-1}(w_n)\nabla{L(w_n)}
- Check for convergence. If the change in the weight vector is sufficiently small or if a maximum number of iterations are reached, terminate the iteration.
- The final weight vector obtained at the end of the iterations is considered the optimal solution
The gradient vector and Hessian matrix can be computed using the first and second derivatives of the logistic loss function, respectively. The Hessian matrix represents the curvature of the loss function, and its inverse is used to correct the weight updates, taking into account the local curvature of the loss surface.
We will now apply Newton's Method to find the optimal parameters of a logistic regression model for binary classification.
Let us consider a simple binary classification problem with just one feature, where we aim to predict whether a student passes (y=1) or fails (y=0) an exam based on the number of hours they studied. We have the following dataset:
Hours Studied
| Exam Result (Pass/Fail)
|
---|
2
|
0
|
---|
3
|
0
|
---|
4
|
1
|
---|
5
|
0
|
---|
6
|
1
|
---|
We shall use Logistic Regression to model the probability of the student passing the exam. To find the optimal weights we shall apply Newton's Method for the model.
First, we define our model:
\sigma(z) = \frac{1}{1+e^{-z}}
where z = w0+w1 x hours_studied, and w0, w1 are the weights that are needed to be optimized.
The logistic loss function is given by:
L(w0, w1) = -\frac{1}{N}\sum_{i=1}^{N}[y_i \log(\sigma(w^T x_i))+(1-y_i)\log(1-\sigma(w^T x_i))]
where N is the number of samples.
To apply Newton's method, we need to compute the gradient vector and Hessian matrix of the loss function with respect to the weights w0 and w1.
Let's assume an initial guess for the weights w0 = 0 and w1 = 0. Using Newton's Method, for each iteration, we perform the following steps.
- Compute the gradient vector ∇ L(w0, w1) and Hessian matrix H(w0, w1)
- Update the weights as follows:
{{w_0}\choose{w_1}} = {{w_0}\choose{w_1}} - H^{-1}\nabla L(w_0, w_1)
- Check for convergence and terminate once convergence is achieved.
Python Implementation
The Python implementation is given as follows:
Python3
import numpy as np
# Given dataset
hours_studied = np.array([2, 3, 4, 5, 6])
exam_result = np.array([0, 0, 1, 0, 1])
# Initialize weights
w0 = 0
w1 = 0
# Sigmoid function
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Convergence criteria
max_iterations = 1000
tolerance = 1e-6
prev_delta_w = np.inf
# Iterative optimization
for iteration in range(max_iterations):
# Compute z = w0 + w1 * hours_studied
z = w0 + w1 * hours_studied
# Compute predicted probabilities
probabilities = sigmoid(z)
# Compute error = predicted_prob - actual_result
error = probabilities - exam_result
# Compute gradient vector
grad_w0 = np.mean(error)
grad_w1 = np.mean(error * hours_studied)
# Compute Hessian matrix
hessian_w0_w0 = np.mean(probabilities * (1 - probabilities))
hessian_w1_w1 = np.mean(hours_studied**2 * probabilities * (1 - probabilities))
hessian_w0_w1 = np.mean(hours_studied * probabilities * (1 - probabilities))
# Inverse of Hessian matrix
hessian_inv = np.linalg.inv([[hessian_w0_w0, hessian_w0_w1],
[hessian_w0_w1, hessian_w1_w1]])
# Update weights
grad = np.array([grad_w0, grad_w1])
delta_w = np.dot(hessian_inv, grad)
# Check convergence
if np.linalg.norm(delta_w) < tolerance:
print("Converged after", iteration+1, "iterations.")
break
# Update weights
w0 -= delta_w[0]
w1 -= delta_w[1]
# Check for improvement in convergence
if np.linalg.norm(delta_w - prev_delta_w) < tolerance:
print("Converged (no significant change in weights) after", iteration+1, "iterations.")
break
# Update previous delta_w for next iteration
prev_delta_w = delta_w
# Output
print("Optimal weights:")
print("w0 =", w0)
print("w1 =", w1)
Output:
Converged after 6 iterations.
Optimal weights:
w0 = -4.984392306601187
w1 = 1.0904255602930342
- In the above code, we use the Numpy library to perform the numerical computations. The dataset is stored as Numpy arrays 'hours_studied' and 'exam_result'. Initial weights are considered as (w0, w1) = (0,0).
- We have considered the convergence criteria by setting the max_iterations = 1000 and the threshold/tolerance level = 10^-6. The rest of the code implements the algorithm as discussed and finally, we display the optimal values of the weights.
Data Fitting with Newton's Method
Suppose we have some data points in the form of (x, y), and we want to fit a line of the form y = mx + b to these points, where m is the slope and b is the y-intercept. We can use Newton's method to minimize the sum of squared errors between the observed y-values and the predicted y-values from our model.
For the implementation we are generating some random sample data. Along with finding the optimal parameters, we are using the Matplotlib library to plot the fitted line. Following is the Python implementation:
Python3
import numpy as np
import matplotlib.pyplot as plt
# Generate some sample data
np.random.seed(0)
x = np.linspace(0, 10, 20)
y = 2 * x + 1 + np.random.normal(0, 1, 20)
# Define the model: y = mx + b
def model(x, m, b):
return m * x + b
# Define the loss function: Mean Squared Error
def loss_function(params):
m, b = params
y_pred = model(x, m, b)
return np.mean((y - y_pred) ** 2)
# Define the gradient of the loss function
def gradient(params):
m, b = params
grad_m = -2 * np.mean(x * (y - model(x, m, b)))
grad_b = -2 * np.mean(y - model(x, m, b))
return np.array([grad_m, grad_b])
# Define the Hessian matrix of the loss function
def hessian(params):
m, b = params
hessian_mm = 2 * np.mean(x ** 2)
hessian_mb = 2 * np.mean(x)
hessian_bb = 2
return np.array([[hessian_mm, hessian_mb], [hessian_mb, hessian_bb]])
# Newton's method for optimization
def newtons_method(init_params, max_iterations=100, tolerance=1e-6):
params = init_params
for i in range(max_iterations):
grad = gradient(params)
hess = hessian(params)
params -= np.linalg.inv(hess).dot(grad)
if np.linalg.norm(grad) < tolerance:
break
return params
# Initial parameters
initial_params = np.array([0.0, 0.0])
# Run Newton's method to find optimal parameters
optimal_params = newtons_method(initial_params)
print('The optimal parameters are:', optimal_params)
# Plot the data points
plt.scatter(x, y, label='Data')
# Plot the fitted line
plt.plot(x, model(x, *optimal_params), color='red', label='Fitted Line')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Line fitting with Newton\'s Method')
plt.legend()
plt.grid(True)
plt.show()
Output:
The optimal parameters are: [1.88627741 2.13794752]
Output of the code for data fitting using Newton's MethodThe plot displays the same data points along with the fitted line obtained using Newton's method.
Newton's Method vs Other Optimization Algorithms
Now, we compare Newton's Method with some other popular optimization algorithms.
Criteria
| Newton's Method
| Gradient Descent (GD)
| Quasi-Newton Methods
| Genetic Algorithms
|
---|
Convergence Rate
| Quadratic
| Linear
| Faster than GD and slower than Newton's
| Typically slower than gradient-based methods
|
---|
Initialization Sensitivity
| Sensitive
| Less Sensitive
| Less Sensitive
| Less Sensitive
|
---|
Memory Requirement
| Low
| Low
| Moderate
| Moderate
|
---|
Derivative Requirement
| First and second order derivatives
| First order derivatives
| First order derivatives
| Doesn't require derivatives
|
---|
Optimizer Type
| Local
| Local
| Local
| Global
|
---|
Complexity
| Moderate
| Low
| Moderate
| High
|
---|
Applications of Newton's Method
- Root Finding: Newton's method can be used to find the roots of equations in various engineering and scientific applications, such as in solving nonlinear equations and systems of equations.
- Optimization in Machine Learning: Newton's method can optimize parameters in machine learning algorithms, such as logistic regression, support vector machines (SVMs) and Gaussian mixture models (GMMs).
- Computer Graphics: Newton's method is used in computer graphics for tasks such as finding intersections of curves and surfaces, ray tracing, and solving geometric problems.
- Signal Processing: It's used in signal processing for tasks like system identification, filter design, and spectral analysis.
- Image Processing: It's used in image processing for tasks like image registration, image reconstruction, and image segmentation.
Specifically, in the context of data science and machine learning, Newton's method play a crucial role in optimizing model parameters, such as those in logistic regression or neural networks.
Similar Reads
Voting in Machine Learning
What is Sklearn?Scikit-learn also known as Sklearn is a machine-learning package for Python. The name Sklearn is derived from the SciPy Toolkit. Sklearn is built on NumPy, SciPy, and Matplotlib and has two major implications : Sklearn is very fast and efficient.It often prefers working with arrays.A
9 min read
Regression in machine learning
Regression in machine learning refers to a supervised learning technique where the goal is to predict a continuous numerical value based on one or more independent features. It finds relationships between variables so that predictions can be made. we have two types of variables present in regression
5 min read
Maths for Machine Learning
Mathematics is the foundation of machine learning. Math concepts plays a crucial role in understanding how models learn from data and optimizing their performance. Before diving into machine learning algorithms, it's important to familiarize yourself with foundational topics, like Statistics, Probab
5 min read
Bayes Theorem in Machine learning
Bayes' theorem is fundamental in machine learning, especially in the context of Bayesian inference. It provides a way to update our beliefs about a hypothesis based on new evidence.What is Bayes theorem?Bayes' theorem is a fundamental concept in probability theory that plays a crucial role in variou
5 min read
Regularization in Machine Learning
In the previous session, we learned how to implement linear regression. Now, weâll move on to regularization, which helps prevent overfitting and makes our models work better with new data. While developing machine learning models we may encounter a situation where model is overfitted. To avoid such
8 min read
Diffusion Models in Machine Learning
A diffusion model in machine learning is a probabilistic framework that models the spread and transformation of data over time to capture complex patterns and dependencies.In this article, we are going to explore the fundamentals of diffusion models and implement diffusion models to generate images.
9 min read
Types of Machine Learning
Machine learning is the branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data and improve from previous experience without being explicitly programmed for every task.In simple words, ML teaches the systems to think and understand like h
13 min read
Steps to Build a Machine Learning Model
Machine learning models offer a powerful mechanism to extract meaningful patterns, trends, and insights from this vast pool of data, giving us the power to make better-informed decisions and appropriate actions.Steps to Build a Machine Learning Model In this article, we will explore the Fundamentals
9 min read
Machine Learning with R
Machine Learning as the name suggests is the field of study that allows computers to learn and take decisions on their own i.e. without being explicitly programmed. These decisions are based on the available data that is available through experiences or instructions. It gives the computer that makes
2 min read
Creating a simple machine learning model
Machine Learning models are the core of smart applications. To get a better insight into how machine learning models work, let us discuss one of the most basic algorithms: Linear Regression. This is employed in predicting a dependent variable based on one or multiple independent variables by utilizi
3 min read