Paper 2
Paper 2
Write
Member-only story
Stochastic Gradient
Descent: Math and Python
Code
Deep Dive on Stochastic Gradient Descent. Algorithm,
assumptions, benefits, formula, and practical
implementation
Cristian Leo
·
Follow
Published in
·
18 min read
·
Jan 16, 2024
--
16
Image by DALL-E-2
Introduction
The image above is not just an appealing visual that drew you to
this article (despite its length), but it also represents a potential
journey of the SGD algorithm in search of a global minimum. In
this journey, it navigates rocky paths where the height
symbolizes the loss. If this doesn’t sound clear now, don’t worry,
it will be by the end of this article.
Index:
· 1: Understanding the Basics
∘ 1.1: What is Gradient Descent
∘ 1.2: The ‘Stochastic’ in Stochastic Gradient Descent
· 2: The Mechanics of SGD
∘ 2.1: The Algorithm Explained
∘ 2.2: Understanding Learning Rate
· 3: SGD in Practice
∘ 3.1: Implementing SGD in Machine Learning Models
∘ 3.2: SGD in Sci-kit Learn and Tensorflow
· 4: Advantages and Challenges
∘ 4.1: Why Choose SGD?
∘ 4.2: Overcoming Challenges in SGD
· 5: Beyond Basic SGD
∘ 5.1: Variants of SGD
∘ 5.2: Future of SGD
· Conclusion
Image by DALL-E-2
Initialization (Step 1)
First, you initialize the parameters (weights) of your model. This
can be done randomly or by some other initialization technique.
The starting point for SGD is crucial as it influences the path the
algorithm will take.
Gradient Formula
The learning rate determines the size of the steps you take
towards the minimum. If it’s too small, the algorithm will be
slow; if it’s too large, you might overshoot the minimum.
class SGDRegressor:
def __init__(self, learning_rate=0.01, epochs=100, batch_size=1,
reg=None, reg_param=0.0):
"""
Constructor for the SGDRegressor.
Parameters:
learning_rate (float): The step size used in each update.
epochs (int): Number of passes over the training dataset.
batch_size (int): Number of samples to be used in each batch.
reg (str): Type of regularization ('l1' or 'l2'); None if no
regularization.
reg_param (float): Regularization parameter.
The weights and bias are initialized as None and will be set
during the fit method.
"""
self.learning_rate = learning_rate
self.epochs = epochs
self.batch_size = batch_size
self.reg = reg
self.reg_param = reg_param
self.weights = None
self.bias = None
Parameters:
X (numpy.ndarray): Training data, shape (m_samples,
n_features).
y (numpy.ndarray): Target values, shape (m_samples,).
This method initializes the weights and bias, and then updates
them over a number of epochs.
"""
m, n = X.shape # m is number of samples, n is number of
features
self.weights = np.zeros(n)
self.bias = 0
for _ in range(self.epochs):
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
if self.reg == 'l1':
gradient_w += self.reg_param *
np.sign(self.weights)
elif self.reg == 'l2':
gradient_w += self.reg_param * self.weights
Parameters:
X (numpy.ndarray): Data for which to predict target values.
Returns:
numpy.ndarray: Predicted target values.
"""
return np.dot(X, self.weights) + self.bias
Parameters:
X (numpy.ndarray): The input data.
y (numpy.ndarray): The true target values.
Returns:
float: The computed loss value.
"""
return (np.mean((y - self.predict(X)) ** 2) +
self._get_regularization_loss()) ** 0.5
def _get_regularization_loss(self):
"""
Computes the regularization loss based on the regularization
type.
Returns:
float: The regularization loss.
"""
if self.reg == 'l1':
return self.reg_param * np.sum(np.abs(self.weights))
elif self.reg == 'l2':
return self.reg_param * np.sum(self.weights ** 2)
else:
return 0
def get_weights(self):
"""
Returns the weights of the model.
Returns:
numpy.ndarray: The weights of the linear model.
"""
return self.weights
Initialization (Step 1)
for _ in range(self.epochs):
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
if self.reg == 'l1':
gradient_w += self.reg_param * np.sign(self.weights)
elif self.reg == 'l2':
gradient_w += self.reg_param * self.weights
for _ in range(self.epochs):
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
def _get_regularization_loss(self):
if self.reg == 'l1':
return self.reg_param * np.sum(np.abs(self.weights))
elif self.reg == 'l2':
return self.reg_param * np.sum(self.weights ** 2)
else:
return 0
# Making predictions
predictions = model.predict(X)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
sgd = SGD(learning_rate=0.01)
General Applicability:
SGD can be applied to a wide range of problems and is not
limited to specific types of models. This general applicability
makes it a versatile tool in the machine learning toolbox.
Improved Generalization:
By updating the model frequently with a high degree of variance,
SGD can often lead to models that generalize better on unseen
data. This is because the algorithm is less likely to overfit to the
noise in the training data.
Hyperparameter Tuning
SGD requires careful tuning of hyperparameters, not just the
learning rate but also parameters like momentum and the size of
the mini-batch.
Utilize grid search, random search, or more advanced methods
like Bayesian optimization to find the optimal set of
hyperparameters.
Overfitting
Like any machine learning algorithm, there’s a risk of
overfitting, where the model performs well on training data but
poorly on unseen data.
Use regularization techniques such as L1 or L2 regularization,
and validate the model using a hold-out set or cross-validation.
RMSprop
RMSprop (Root Mean Square Propagation) modifies Adagrad to
address its radically diminishing learning rates. It uses a moving
average of squared gradients to normalize the gradient.
It works well in online and non-stationary settings and has been
found to be an effective and practical optimization algorithm for
neural networks.
Each of these variants has its own strengths and is suited for
specific types of problems. Their development reflects the
ongoing effort in the machine learning community to refine and
enhance optimization algorithms to achieve better and faster
results. Understanding these variants and their appropriate
applications is crucial for anyone looking to delve deeper into
machine learning optimization techniques.
Quantum SGD
The advent of quantum computing presents an opportunity to
explore quantum versions of SGD, leveraging quantum
algorithms for optimization.
Quantum SGD has the potential to dramatically speed up the
training process for certain types of models, though this is still
largely in the research phase.
Conclusion
As we wrap up our exploration of Stochastic Gradient Descent
(SGD), it’s clear that this algorithm is much more than just a
method for optimizing machine learning models. It stands as a
testament to the ingenuity and continuous evolution in the field
of artificial intelligence. From its basic form to its more
advanced variants, SGD remains a critical tool in the machine
learning toolkit, adaptable to a wide array of challenges and
applications.
If you liked the article please leave a clap, and let me know in the
comments what you think about it!