Lesson 4 Gradient Descent
Lesson 4 Gradient Descent
A R T I F I C I A L I N T E L L I G E N C E
M A C H I N E L E A R N I N G
ALGORITHMS
GRADIENT DESCENT
GRADIENT DESCENT
AI and ML Libraries
Armadillo
NICTA research center, Australia and independent contributors Steffen Nissen (original), several collaborators (present)
C++ | Linear algebra and scientific computing C | Used for: Developing multi-layer feed-forward artificial neural nets /
Binds to C#, Python and others
Bioinformatics,
Computer vision, Aerospace engineering,
Econometrics, AI,
Pattern recognition, Biology,
Signal processing, and Environmental sciences,
Statistics. Genetics,
Image recognition, and
Machine learning.
https://youtu.be/lI-M7O_bRNg
GRADIENT DESCENT
Gradient descent is an optimization algorithm which is commonly-used to
train machine learning models and neural networks, based on calculus and
linear algebra.
Training data helps these models learn over time, and the cost function
within gradient descent specifically acts as a barometer, gauging its accuracy
with each iteration of parameter updates.
Until the function is close to or equal to zero, the model will continue to
adjust its parameters to yield the smallest possible error.
Once machine learning models are optimized for accuracy, they can be
powerful tools for artificial intelligence (AI) and computer science
applications.
Recall the following formula for the slope of a line, The gradient descent algorithm behaves similarly, but it is
which is y = mx + b, where m represents the slope and based on a convex function, such as the one above.
b is the intercept on the y-axis. Linear regression refers
to the points in the graph, up or down.
GRADIENT DESCENT
GRADIENT DESCENT
Gradient Descent Procedure
1) Initialize Coefficient
2) Evaluate Cost Function of the coefficient
3) Get derivative of the Cost
4) Update the coefficient with Learning Rate value
5) Repeat step 2 until all dataset records have been processed.
Local Minima - points which appear to be minima but are not the point
where the function actually takes the minimum value are called local
minima
y = mx + b
𝑚
𝐽 =∑ ( 𝑔𝑢𝑒𝑠𝑠1 − 𝑦 1 ) 2
𝑖=1
𝑚
ⅆ
𝐽 ⅆ𝐽 ⅆⅇ
𝐽 =∑ ( 𝑒 ) 2 = ×
𝑖=1 ⅆ𝑚 𝑑 𝑒 ⅆ 𝑚
= ([2] * error * x) * α
Plot Cost versus Time: Collect and plot the cost values calculated by the algorithm each iteration. The
expectation for a well performing gradient descent run is a decrease in cost each iteration. If it does not
decrease, try reducing your learning rate.
Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Try different
values for your problem and see which works best.
Rescale Inputs: The algorithm will reach the minimum cost faster if the shape of the cost function is not
skewed and distorted. You can achieved this by rescaling all of the input variables (X) to the same range,
such as [0, 1] or [-1, 1].
Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training
dataset to converge on good or good enough coefficients.
Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time
when using stochastic gradient descent. Taking the average over 10, 100, or 1000 updates can give you a
better idea of the learning trend for the algorithm.
From Machinelearningmastery.com