Intro_DL_01
Intro_DL_01
2 GRADIENT DESCENT
1
Prerequisites
Algebra:
variables, coefficients, and functions
Calculus:
linear equations such as limits, derivatives, measures, integrals,
y = b + w1 x1 + w2 x2 etc.
logarithms, and logarithmic equations concept of a derivative, gradient or
such as y = ln(1 + ez ) slope
sigmoid function partial derivatives (which are closely
1 ex related to gradients)
σ(x) = = = 1 − σ(−x).
1 + e−x ex + 1 chain rule (for a full understanding of
tanh (discussed as an activation the BP Alg. for training NNs)
x −x
function): f (x) = eex −e
+e−x
Probability Theory and Statistics:
Linear Algebra: familiarity with distributions, conditional and marginal
vector spaces, tensor and tensor rank distribution, expectation, variance, etc.
matrix operations: multiplication, mean, median, outliers, and standard
inversion, singular value decomp. deviation ability to read a histogram
(SVD)
2
What is Machine Learning
Arthur Samuel, 1959 defined Machine Learning as “Field of study that gives
computers the capability to learn without being explicitly programmed”.
Traditional Programming: Data and program is run on the computer to
produce the output.
Machine Learning: Data and output is run on the computer to create a
program. This program can be used in traditional programming.
3
Types of Machine Learning
Supervised ML: (inductive learning) Training
data includes desired outputs.
Unsupervised ML: Training data does not
include desired outputs. Example is
clustering. It is hard to tell what is good
learning and what is not.
Reinforcement learning: Rewards from a
sequence of actions. AI likes it, it is the
most ambitious type of learning.
Ensemble Learning: techniques that create
multiple models and then combine them to
produce improved results.
Deep Learning: uses multiple layers to
progressively extract higher-level features
from the raw input. 4
Key Elements of Machine Learning Algorithms
There are tens of thousands of machine learning algorithms and hundreds of
new algorithms are developed every year. Each of them has three components:
Representation: how to represent knowledge. Examples include decision
trees, sets of rules, instances, graphical models, neural networks, support
vector machines, model ensembles and others.
Evaluation: the way to evaluate candidate programs (hypotheses). Examples
include accuracy, prediction and recall, squared error, likelihood, posterior
probability, cost, margin, entropy k-L divergence and others.
Optimization: the way candidate programs are generated known as the
search process. For example combinatorial optimization, convex
optimization, constrained optimization.
All machine learning algorithms are combinations of these three components. A
framework for understanding all algorithms.
5
Key Elements of Machine Learning
6
Some Terminology of Machine Learning
Model: Also known as “hypothesis”, a machine learning model is the mathematical representa-
tion of a real-world process. A machine learning algorithm along with the training data
builds a machine learning model.
Feature Vector: It is a set of multiple numeric features. We use it as an input to the machine learning
model for training and prediction purposes.
Training: An algorithm takes a set of data known as “training data” as input. The learning algorithm
finds patterns in the input data and trains the model for expected results (target). The
output of the training process is the machine learning model.
Prediction: Once the machine learning model is ready, it can be fed with input data to provide a
predicted output.
Target (Label): The value that the machine learning model has to predict is called the target or label.
Overfitting: When a massive amount of data trains a machine learning model, it tends to learn from
the noise and inaccurate data entries. Here the model fails to characterise the data
correctly.
Underfitting: It is the scenario when the model fails to decipher the underlying trend in the input data.
It destroys the accuracy of the machine learning model. In simple terms, the model or
the algorithm does not fit the data well enough. 7
ML in Practice
Start Loop
1 Understand the domain, prior knowledge and goals. Talk to domain experts. Often
the goals are very unclear. You often have more things to try then you can possibly
implement.
2 Data integration, selection, cleaning and pre-processing: The most time consuming
part. It is important to have high quality data. The more data you have, the more it
sucks because the data is dirty. Garbage in, garbage out.
3 Learning models. The fun part. This part is very mature. The tools are general.
4 Interpreting results. Sometimes it does not matter how the model works as long it
delivers results. Other domains require that the model is understandable. You will
be challenged by human experts.
5 Consolidating and deploying discovered knowledge. The majority of projects that
are successful in the lab are not used in practice. It is very hard to get something
used.
End Loop
8
How to be the expert in ML?
Linear algebra for data analysis: Scalars, Vectors, Matrices, and Tensors
Mathematical Analysis: Derivatives and Gradients
Probability theory and statistics
Multivariate Calculus
Algorithms and Complex Optimizations
9
Become an expert in ML
Python is hands down the best programming language for Machine Learning
applications due to the various benefits mentioned in the section below.
Numpy, OpenCV, and Scikit are used when working with images
NLTK along with Numpy and Scikit again when working with text
Librosa for audio applications
Matplotlib, Seaborn, and Scikit for data representation
TensorFlow and Pytorch for Deep Learning applications
Scipy for Scientific Computing
Django for integrating web applications
Pandas for high-level data structures and analysis
Other programming languages that could to use for Machine Learning
Applications are R, C++, JavaScript, Java, C#, Julia, Shell, TypeScript, and Scala.
10
Commonly used Supervised Learning Algorithms
Linear Regression
Logistic Regression
Decision Tree
SVM
Naive Bayes
kNN
Random Forest
Dimensionality Reduction Algorithms
Gradient Boosting algorithms
GBM
XGBoost
LightGBM
CatBoost
11
Section 1
minimize E(f)
f:X →Y
Z
where E(f) = ℓ(f(x), y)dρ(x, y)
:X ×Y
12
Input and output spaces
INPUT SPACE
Linear Spaces: Structured Spaces:
Vectors Strings
Matrices Graphs
Functions Probabilities
… Points on a manifold
OUTPUT SPACE
Linear Spaces: Structured Spaces:
Y = R: Regression Strings
Y = {1, . . . , T}: Classification Graphs
Y = RT : Multi-task learning Probabilities
… Orders (i.e. Ranking)
13
Probability Distribution
ρ(y | x) characterizes the relation between a given input x and the possible
outcomes y that could be observed.
In noisy settings it represents the uncertainty in our observations.
Example: y = f∗ (x) + ε, with f∗ : X → R is the true function and ε ∼ N (0, σ) is
the Gaussian distributed noise. Then:
14
Definition of Statistical Learning
15
Why Estimate f?
16
Loss and Cost
1X
n
Cost(f) = L(yi , ŷi ), where ŷi = f(xi )
n
i
17
18
Some loss functions for classifications
where ŷ = (ŷ1 , · · · , ŷC ) is the class distribution, i.e. ŷ1 + · · · + ŷC = 1, returned by
the model.
Hinge Loss (for binary classification: Y = {−1, 1})
LH : {−1, 1} × R → R
LH (y, ŷ) = max{0, 1 − y · ŷ}
19
Illustrations of loss functions
20
Other Loss functions for y = 1
21
Some loss functions for learning class distributions
X X
q(d)
KL(qθ ||p) = q(d) log = q(d) (log q(d) − log p(d))
p(d)
d d
X X
= q(d) log q(d) − q(d) log p(d)
d d
| {z } | {z }
−entropy cross-entropy
X X
p(d)
KL(p||qθ ) = p(d) log = p(d) (log p(d) − log q(d))
q(d)
d d
X X
= p(d) log p(d) − p(d) log q(d)
d d
22
Some loss functions for Regression
Loss functions used in regression task, i.e. Y = R
inf E(f)
f:X →Y
25
Defining Learning Algorithms
S
Let S = n∈N (X × Y)n be the set of all finite datasets on X × Y. A learning
algorithm is a map
A :S → F
S 7→ A(S) : X → Y
fn = A({(xi , yi ) : i = 1, . . . , n})
26
Defining Learning Algorithms
27
Defining Learning Algorithms
28
Section 2
Gradient Descent
Gradient Descent
Gradient descent is an iterative optimization algorithm for finding the minimum
of a function. How? Take step proportional to the negative of the gradient of the
function at the current point.
29
Gradient Descent Update
If we consider a function f(θ), the gradient descent update can be expressed as:
∂
θj := θj − α f(θ) (1)
∂θj
30
Visualizing Gradient Descent
31
Convexity
Turns out that if the function is convex gradient descent will converge to the
global minimum. For non-convex functions, it may converge to local minima.
The cost function depends on the model’s parameters and is a proxy to evaluate
model’s performance. Generally speaking, in this framework minimizing the cost
equals to maximizing the effectiveness of the model.
33
Stochastic Gradient Descent
In principle, to perform a single update step you should run through all your
training examples. This is known as batch gradient descent.
In the extreme case in which only a random example of the training set is
considered to perform the update step, we talk of stochastic gradient descent.
34
Learning Rate
Choosing the the right learning rate α is essential to correctly proceed towards
the minimum. A step too small could lead to an extremely slow convergence. If
the step is too big the optimizer could overshoot the minimum or even diverge.
35
Advanced Optimizers
In practice, it’s quite rare to see the procedure described above (so called vanilla
SGD) used for optimization in the real-world.
[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online
learning and stochastic optimization. Journal of Machine Learning Research,
12(Jul):2121–2159, 2011.
[2] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[3] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.
36
Section 3
Back propagation
Input #1
Output #1
Input #2
Output #2
Input #3
Output #3
Input #4
37
Multilayer Perceptrons
38
Multilayer Perceptrons – First Layer
(1) (1)
On input x ∈ RD , layer l = 1 computes a vector y(1) := (y1 , . . . , ym(1) ) where
X
D
(1) (1) (1) (1) (1)
yi = f zi with zi = wi,j xj + wi,0 . (3)
j=1
39
Multilayer Perceptrons – First Layer
Idea: Recursively apply L additional layers on the output y(1) of the first layer.
40
Multilayer Perceptrons – Further Layers
(l) (l)
In general, layer l computes a vector y(l) := (y1 , . . . , ym(l) ) as follows:
mX
(l−1)
(l) (l) (l) (l) (l−1) (l)
yi =f zi with zi = wi,j yj + wi,0 . (5)
j=1
41
Multilayer Perceptrons – Output Layer
42
Network Graph
xD (L+1)
(1) (L) yC
ym(1) ym(L)
43
Activation Functions – Notions
44
Activation Functions
45
Activation Functions
For classification with C > 1 classes, layer (L + 1) uses the softmax activation
function:
(L+1)
(L+1) exp(zi )
yi = σ(z(L+1) , i) = P . (8)
C (L+1)
k=1 exp(z k )
46
Network Training – Notions
regression,
or classification.
We focus on classification.
47
Network Training – Training Set
C classes:
Given a training set 1-of-C coding scheme
X
N X
N X
C
E(w) = En (w) = (yi (xn , w) − tn,i )2 (10)
n=1 n=1 i=1
48
Training Protocols
We distinguish ...
49
Iterative Optimization
Problem: How to minimize En (w) (stochastic training)?
w[0] be an initial guess for the weights (several initialization techniques are
available),
and w[t] be the weights at iteration t.
50
Gradient Descent
Remember:
Gradient descent minimizes the error En (w) by taking steps in the direction of
the negative gradient:
∂En
∆w[t] = −γ (12)
∂w[t]
51
Gradient Descent – Visualization
w[0]
w[1]
w[2]
w[3]
w[4]
52
Error Backpropagation
Problem: How to evaluate ∂En
∂w[t] in iteration [t + 1]?
53
Backprobagation: Feed-forward step
For an input vector xn do a forward step to compute the activations and outputs
for all layers in the network (as described in previous slides):
for l = 2, ..., L + 1
54
Backprobagation: Backward step
− tk ) · f′ (zL+1
(L+1) (L+1)
δk = 2(yk k )
55
Backprobagation: Backward step
∂En
w[t + 1] = w[t] − γ
∂w[t]
56
Deep Learning
Multilayer perceptrons are called deep if they have more than three layers:
L + 1 > 3.
57
Approaches to Deep Learning
Possible approaches:
See “Learning Deep Architectures for AI,” by Y. Bengio [?] for a detailed
discussion of state-of-the-art approaches to deep learning.
58
Summary
Most prominent advantages of Backpropagation are:
Backpropagation is fast, simple and easy to program
It has no parameters to tune apart from the numbers of input
It is a flexible method as it does not require prior knowledge about the
network
It is a standard method that generally works well
It does not need any special mention of the features of the function to be
learned.
Disadvantages of using Backpropagation
The actual performance of backpropagation on a specific problem is
dependent on the input data.
Backpropagation can be quite sensitive to noisy data
You need to use the matrix-based approach for backpropagation instead of
59
mini-batch.
Summary
60