0% found this document useful (0 votes)
19 views

Intro_DL_01

Uploaded by

Hoàng Khải
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Intro_DL_01

Uploaded by

Hoàng Khải
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

INTRODUCTION TO DEEP LEARNING (IT3320E)

1 - Preliminaries, Machine Learning, Artificial Neural Network

Hung Son Nguyen

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY


SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

September 20, 2023


Agenda

1 OVERVIEW OF STATISTICAL LEARNING


Formulating The Learning Problem
Loss functions
Defining Learning Algorithms

2 GRADIENT DESCENT

3 BACK PROPAGATION ALGORITHM


Network Training
Deep Learning

1
Prerequisites
Algebra:
variables, coefficients, and functions
Calculus:
linear equations such as limits, derivatives, measures, integrals,
y = b + w1 x1 + w2 x2 etc.
logarithms, and logarithmic equations concept of a derivative, gradient or
such as y = ln(1 + ez ) slope
sigmoid function partial derivatives (which are closely
1 ex related to gradients)
σ(x) = = = 1 − σ(−x).
1 + e−x ex + 1 chain rule (for a full understanding of
tanh (discussed as an activation the BP Alg. for training NNs)
x −x
function): f (x) = eex −e
+e−x
Probability Theory and Statistics:
Linear Algebra: familiarity with distributions, conditional and marginal
vector spaces, tensor and tensor rank distribution, expectation, variance, etc.
matrix operations: multiplication, mean, median, outliers, and standard
inversion, singular value decomp. deviation ability to read a histogram
(SVD)
2
What is Machine Learning
Arthur Samuel, 1959 defined Machine Learning as “Field of study that gives
computers the capability to learn without being explicitly programmed”.
Traditional Programming: Data and program is run on the computer to
produce the output.
Machine Learning: Data and output is run on the computer to create a
program. This program can be used in traditional programming.

3
Types of Machine Learning
Supervised ML: (inductive learning) Training
data includes desired outputs.
Unsupervised ML: Training data does not
include desired outputs. Example is
clustering. It is hard to tell what is good
learning and what is not.
Reinforcement learning: Rewards from a
sequence of actions. AI likes it, it is the
most ambitious type of learning.
Ensemble Learning: techniques that create
multiple models and then combine them to
produce improved results.
Deep Learning: uses multiple layers to
progressively extract higher-level features
from the raw input. 4
Key Elements of Machine Learning Algorithms
There are tens of thousands of machine learning algorithms and hundreds of
new algorithms are developed every year. Each of them has three components:
Representation: how to represent knowledge. Examples include decision
trees, sets of rules, instances, graphical models, neural networks, support
vector machines, model ensembles and others.
Evaluation: the way to evaluate candidate programs (hypotheses). Examples
include accuracy, prediction and recall, squared error, likelihood, posterior
probability, cost, margin, entropy k-L divergence and others.
Optimization: the way candidate programs are generated known as the
search process. For example combinatorial optimization, convex
optimization, constrained optimization.
All machine learning algorithms are combinations of these three components. A
framework for understanding all algorithms.
5
Key Elements of Machine Learning

6
Some Terminology of Machine Learning
Model: Also known as “hypothesis”, a machine learning model is the mathematical representa-
tion of a real-world process. A machine learning algorithm along with the training data
builds a machine learning model.

Feature: A feature is a measurable property or parameter of the data-set.

Feature Vector: It is a set of multiple numeric features. We use it as an input to the machine learning
model for training and prediction purposes.

Training: An algorithm takes a set of data known as “training data” as input. The learning algorithm
finds patterns in the input data and trains the model for expected results (target). The
output of the training process is the machine learning model.

Prediction: Once the machine learning model is ready, it can be fed with input data to provide a
predicted output.

Target (Label): The value that the machine learning model has to predict is called the target or label.

Overfitting: When a massive amount of data trains a machine learning model, it tends to learn from
the noise and inaccurate data entries. Here the model fails to characterise the data
correctly.

Underfitting: It is the scenario when the model fails to decipher the underlying trend in the input data.
It destroys the accuracy of the machine learning model. In simple terms, the model or
the algorithm does not fit the data well enough. 7
ML in Practice
Start Loop
1 Understand the domain, prior knowledge and goals. Talk to domain experts. Often
the goals are very unclear. You often have more things to try then you can possibly
implement.
2 Data integration, selection, cleaning and pre-processing: The most time consuming
part. It is important to have high quality data. The more data you have, the more it
sucks because the data is dirty. Garbage in, garbage out.
3 Learning models. The fun part. This part is very mature. The tools are general.
4 Interpreting results. Sometimes it does not matter how the model works as long it
delivers results. Other domains require that the model is understandable. You will
be challenged by human experts.
5 Consolidating and deploying discovered knowledge. The majority of projects that
are successful in the lab are not used in practice. It is very hard to get something
used.

End Loop
8
How to be the expert in ML?

It is mandatory to learn a programming language, preferably Python, along with


the required analytical and mathematical knowledge. Here are the five
mathematical areas that you need to brush up before jumping into solving
Machine Learning problems:

Linear algebra for data analysis: Scalars, Vectors, Matrices, and Tensors
Mathematical Analysis: Derivatives and Gradients
Probability theory and statistics
Multivariate Calculus
Algorithms and Complex Optimizations

9
Become an expert in ML

Python is hands down the best programming language for Machine Learning
applications due to the various benefits mentioned in the section below.
Numpy, OpenCV, and Scikit are used when working with images
NLTK along with Numpy and Scikit again when working with text
Librosa for audio applications
Matplotlib, Seaborn, and Scikit for data representation
TensorFlow and Pytorch for Deep Learning applications
Scipy for Scientific Computing
Django for integrating web applications
Pandas for high-level data structures and analysis
Other programming languages that could to use for Machine Learning
Applications are R, C++, JavaScript, Java, C#, Julia, Shell, TypeScript, and Scala.

10
Commonly used Supervised Learning Algorithms

Linear Regression
Logistic Regression
Decision Tree
SVM
Naive Bayes
kNN
Random Forest
Dimensionality Reduction Algorithms
Gradient Boosting algorithms
GBM
XGBoost
LightGBM
CatBoost

11
Section 1

Overview of Statistical Learning


Formulating the Learning Problem
MAIN INGREDIENTS:
X : the input space, Y: the output space;
ρ: the unknown distribution on X × Y
ℓ : Y × Y → R a loss function measuring the discrepancy ℓ(y, y′ ) between
any two values y, y′ ∈ Y.

WE WOULD LIKE TO MINIMIZE THE EXPECTED RISK:

minimize E(f)
f:X →Y
Z
where E(f) = ℓ(f(x), y)dρ(x, y)
:X ×Y

The expected prediction error incurred by a predictor f : X → Y

12
Input and output spaces
INPUT SPACE
Linear Spaces: Structured Spaces:
Vectors Strings
Matrices Graphs
Functions Probabilities
… Points on a manifold

OUTPUT SPACE
Linear Spaces: Structured Spaces:
Y = R: Regression Strings
Y = {1, . . . , T}: Classification Graphs
Y = RT : Multi-task learning Probabilities
… Orders (i.e. Ranking)
13
Probability Distribution

Informally: the distribution ρ on X × Y encodes the probability of getting a


pair (x, y) ∈ X × Y when observing (sampling from) the unknown process.
Throughout the course we will assume ρ(x, y) = ρ(y | x) · ρX (x), where

ρX (x) the marginal distribution on X


ρ(y | x) the conditional distribution on Y given x ∈ X

ρ(y | x) characterizes the relation between a given input x and the possible
outcomes y that could be observed.
In noisy settings it represents the uncertainty in our observations.
Example: y = f∗ (x) + ε, with f∗ : X → R is the true function and ε ∼ N (0, σ) is
the Gaussian distributed noise. Then:

ρ(y | x) = N (f∗ (x), σ)

14
Definition of Statistical Learning

DEFINITION OF STATISTICAL LEARNING


y = f(x) + ε
f represents the information that x provides about y
Definition: Statistical Learning refers to a set of techniques/ approaches to
estimate the function
Questions: Why should we estimate and what are the techniques to
estimate ?

15
Why Estimate f?

Prediction: the average, or expected value, of the squared expected value


difference between the predicted ŷ = f̂(x) and actual value of y:

E(y − ŷ)2 = E[f(x) + ε − f̂(x)]2


= [f(x) − f̂(x)]2 + Var(ε)
| {z } | {z }
reducible irreducible

Inference: We are often interested in understanding the association


between output and the inputs:
Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Can the relationship between Y and each predictor be adequately summarized
using a linear equation, or is the relationship more complicated?

16
Loss and Cost

LOSS FUNCTIONS AND COST FUNCTIONS


Loss function is any function L : Y × Y ′ → R+ that evaluates how well our
algorithm models our dataset. Usually, the loss function is applied to the
designed output and the predicted output.
The cost function is the average loss over the entire training dataset:

1X
n
Cost(f) = L(yi , ŷi ), where ŷi = f(xi )
n
i

17
18
Some loss functions for classifications

Cross Entropy Loss function for 2 classes Y = {0, 1}:


LCE : {0, 1} × [0, 1] → R
LCE (y, ŷ) = −y ln(ŷ) + (1 − y) ln(1 − ŷ)

Cross Entropy Loss function for multiclass Y = {1, · · · , C}


LCE : {0, 1}C × [0, 1]C → R
X
C
LCE (y, ŷ) = − yi ln(ŷi )
i=1

where ŷ = (ŷ1 , · · · , ŷC ) is the class distribution, i.e. ŷ1 + · · · + ŷC = 1, returned by
the model.
Hinge Loss (for binary classification: Y = {−1, 1})
LH : {−1, 1} × R → R
LH (y, ŷ) = max{0, 1 − y · ŷ}
19
Illustrations of loss functions

Cross entropy Hinge loss Logistic loss

−y ln(ŷ) + (1 − y) ln(1 − ŷ) max{0, 1 − y · ŷ} log(1 + exp(−y · ŷ))

Neural Network SVM Logistic regression

20
Other Loss functions for y = 1

21
Some loss functions for learning class distributions

Kullback-Leibler (KL)-divergence loss function

X   X
q(d)
KL(qθ ||p) = q(d) log = q(d) (log q(d) − log p(d))
p(d)
d d
X X
= q(d) log q(d) − q(d) log p(d)
d d
| {z } | {z }
−entropy cross-entropy

X   X
p(d)
KL(p||qθ ) = p(d) log = p(d) (log p(d) − log q(d))
q(d)
d d
X X
= p(d) log p(d) − p(d) log q(d)
d d

22
Some loss functions for Regression
Loss functions used in regression task, i.e. Y = R

MAE (absolute error) or L1


L1 (y, ŷ) = ky − ŷk
MSE (square error) or L2
L2 (y, ŷ) = 12 (y − ŷ)2
ε-intensive
Vε (y, ŷ) = max(|y − ŷ| − ε, 0)

Huber Loss is a modification of MSE:


(
1
2 (y − ŷ) if |(y − ŷ)| < δ
2
Lδ (y, ŷ) = 1 2
δ|y − ŷ| − 2 δ otherwise
Log-Cosh Loss
1 1 ea(ŷ−y) + e−a(ŷ−y) 23
Llog (y, ŷ) = ln(cosh a(ŷ − y)) = ln
a a 2
24
Formulating the Learning Problem

The relation between X and Y encoded by the distribution is unknown in


reality. The only way we have to access a phenomenon is from finite
observations.
The goal of a learning algorithm is therefore to find a good approximation
fn : X → Y for the minimizer of expected risk

inf E(f)
f:X →Y

from a finite set of examples S = {(xi , yi ) : i = 1, . . . , n} sampled


independently from ρ.

25
Defining Learning Algorithms

S
Let S = n∈N (X × Y)n be the set of all finite datasets on X × Y. A learning
algorithm is a map

A :S → F
S 7→ A(S) : X → Y

where F is a set of possible (but not all) functions from X to Y.


In case S = {(xi , yi ) : i = 1, . . . , n}, we will denote:

fn = A({(xi , yi ) : i = 1, . . . , n})

26
Defining Learning Algorithms

27
Defining Learning Algorithms

DEFINITION: CLASSIFIER LEARNING


Given a data set of example pairs D = {(xi , yi ), i = 1, . . . , n} where xi ∈ X ⊂ RD is
a feature vector and yi ∈ Y is a class label, learn a function f : RD → Y ′ that
accurately predicts the class label y for any feature vector x.
Function f is also called the model.

28
Section 2

Gradient Descent
Gradient Descent
Gradient descent is an iterative optimization algorithm for finding the minimum
of a function. How? Take step proportional to the negative of the gradient of the
function at the current point.

Gradient descent on a series of level sets

29
Gradient Descent Update

If we consider a function f(θ), the gradient descent update can be expressed as:


θj := θj − α f(θ) (1)
∂θj

for each parameter θj .

The size of the step is controlled by learning rate α.

30
Visualizing Gradient Descent

Gradient Descent for 1-d function f(θ).

31
Convexity
Turns out that if the function is convex gradient descent will converge to the
global minimum. For non-convex functions, it may converge to local minima.

Convex Function Non-Convex Function


32
Gradient Descent

Gradient descent is often used in machine learning to minimize a cost function,


usually also called objective or loss function and denoted L(·) or J(·).

The cost function depends on the model’s parameters and is a proxy to evaluate
model’s performance. Generally speaking, in this framework minimizing the cost
equals to maximizing the effectiveness of the model.

33
Stochastic Gradient Descent

In principle, to perform a single update step you should run through all your
training examples. This is known as batch gradient descent.

A different strategy is the one of minibatch stochastic gradient descent. In this


case, only a small subset of the training dataset is considered at each update
step.

In the extreme case in which only a random example of the training set is
considered to perform the update step, we talk of stochastic gradient descent.

34
Learning Rate

Choosing the the right learning rate α is essential to correctly proceed towards
the minimum. A step too small could lead to an extremely slow convergence. If
the step is too big the optimizer could overshoot the minimum or even diverge.

Learning Rate too small Learning Rate too big

35
Advanced Optimizers
In practice, it’s quite rare to see the procedure described above (so called vanilla
SGD) used for optimization in the real-world.

Conversely, a number of cutting-edge optimizers [1,2,3] are commonly used.


However, these advanced optimization techniques are out of the scope of this
short overview.

[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online
learning and stochastic optimization. Journal of Machine Learning Research,
12(Jul):2121–2159, 2011.
[2] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[3] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.
36
Section 3

Back propagation algorithm


Back propagation algorithm
Input Hidden Hidden Output
layer layer1 layer2 layer

Back propagation

Input #1
Output #1
Input #2
Output #2
Input #3
Output #3
Input #4

37
Multilayer Perceptrons

A multilayer perceptron represents an adaptable model y(·, w) able to map


D-dimensional input to C-dimensional output:
 
y1 (x, w)
 .. 
y(·, w) : RD → RC , x 7→ y(x, w) = 
 . .
 (2)
yC (x, w)

In general, a (L + 1)-layer perceptron consists of (L + 1) layers, each layer l


computing linear combinations of the previous layer (l − 1) (or the input).

38
Multilayer Perceptrons – First Layer

(1) (1)
On input x ∈ RD , layer l = 1 computes a vector y(1) := (y1 , . . . , ym(1) ) where

  X
D
(1) (1) (1) (1) (1)
yi = f zi with zi = wi,j xj + wi,0 . (3)
j=1

ith component is called “unit i”


(1)
where f is called activation function and wi,j are adjustable weights.

39
Multilayer Perceptrons – First Layer

What does this mean?

Layer l = 1 computes linear combinations of the input and applies an


(non-linear) activation function ...

The first layer can be interpreted as generalized linear model:


  
(1) (1) T (1)
yi = f w i x + wi,0 . (4)

Idea: Recursively apply L additional layers on the output y(1) of the first layer.

40
Multilayer Perceptrons – Further Layers

(l) (l)
In general, layer l computes a vector y(l) := (y1 , . . . , ym(l) ) as follows:

  mX
(l−1)
(l) (l) (l) (l) (l−1) (l)
yi =f zi with zi = wi,j yj + wi,0 . (5)
j=1

Thus, layer l computes linear combinations of layer (l − 1) and applies an


activation function ...

41
Multilayer Perceptrons – Output Layer

Layer (L + 1) is called output layer because it computes the output of the


multilayer perceptron:
   (L+1) 
y1 (x, w) y
 .   1. 
y(x, w) = 

..  :=  ..  = y(L+1)
   (6)
(L+1)
yC (x, w) yC

where C = m(L+1) is the number of output dimensions.

42
Network Graph

1st layer Lth layer output


input
(1) (L)
y1 y1 (L+1)
x1 y1
...
(1) (L)
y2 y2
x2 (L+1)
y2
.. ..
. .
.. ... ..
. .

xD (L+1)
(1) (L) yC
ym(1) ym(L)

43
Activation Functions – Notions

How to choose the activation function f in each layer?

Non-linear activation functions will increase the expressive power:


Multilayer perceptrons with L + 1 ≥ 2 are universal approximators [?]!
Depending on the application: For classification we may want to interpret
the output as posterior probabilities:
!
yi (x, w) = p(c = i|x) (7)

where c denotes the random variable for the class.

44
Activation Functions

Activation layers compute non-linear activation function elementwise on the


input volume. The most common activations are ReLu, sigmoid and tanh.

Nonetheless, more complex activation functions exist [?, ?].

45
Activation Functions
For classification with C > 1 classes, layer (L + 1) uses the softmax activation
function:
(L+1)
(L+1) exp(zi )
yi = σ(z(L+1) , i) = P . (8)
C (L+1)
k=1 exp(z k )

Then, the output can be interpreted as posterior probabilities.

46
Network Training – Notions

By now, we have a general model y(·, w) depending on W weights.

Idea: Learn the weights to perform

regression,
or classification.

We focus on classification.

47
Network Training – Training Set

C classes:
Given a training set 1-of-C coding scheme

US = {(xn , tn ) : 1 ≤ n ≤ N}, (9)

learn the mapping represented by US ...

by minimizing the squared error

X
N X
N X
C
E(w) = En (w) = (yi (xn , w) − tn,i )2 (10)
n=1 n=1 i=1

using iterative optimization.

48
Training Protocols

We distinguish ...

STOCHASTIC TRAINING: A training sample (xn , tn ) is chosen at random, and the


weights w are updated to minimize En (w).

BATCH AND MINI-BATCH TRAINING: A set M ⊆ {1, . . . , N} of training samples is


chosen and the weights w are updated based on the cumulative
P
error EM (w) = n∈M En (w).

Of course, online training is possible, as well.

49
Iterative Optimization
Problem: How to minimize En (w) (stochastic training)?

En (w) may be highly non-linear with many poor local minima.

Framework for iterative optimization: Let ...

w[0] be an initial guess for the weights (several initialization techniques are
available),
and w[t] be the weights at iteration t.

In iteration [t + 1], choose a weight update ∆w[t] and set

w[t + 1] = w[t] + ∆w[t] (11)

50
Gradient Descent

Remember:

Gradient descent minimizes the error En (w) by taking steps in the direction of
the negative gradient:

∂En
∆w[t] = −γ (12)
∂w[t]

where γ defines the step size.

51
Gradient Descent – Visualization

w[0]
w[1]
w[2]
w[3]
w[4]

52
Error Backpropagation
Problem: How to evaluate ∂En
∂w[t] in iteration [t + 1]?

“Error Backpropagation” algorithm allows to evaluate ∂w[t]


∂En
in O(W)!
Feed-forward step: Calculate the output for every neuron from the input
layer, to the hidden layers, to the output layer.
Backward step: Calculate the error in the outputs and travel back from the
output layer to the hidden layer to adjust the weights such that the error is
decreased.
∂ 2 En
Similar algorithm allows to evaluate the Hessian ∂w[t] 2 such that

second-order optimization can be used.

Further details ...

See the original paper “Learning Representations by Back-Propagating


Errors,” by Rumelhart et al. [?].

53
Backprobagation: Feed-forward step

For an input vector xn do a forward step to compute the activations and outputs
for all layers in the network (as described in previous slides):

The first layer:


  
(1) (1) T (1)
yi =f wi · xn + wi,0 .

Layer l computes linear combinations of layer (l − 1) and applies an


activation function
mX
(l−1)
 
(l) (l) (l−1) (l) (l) (l)
zi = wi,j yj + wi,0 and then yi = f zi .
j=1

for l = 2, ..., L + 1

54
Backprobagation: Backward step

1 Calculate the error functions δ starting from the output units:

− tk ) · f′ (zL+1
(L+1) (L+1)
δk = 2(yk k )

2 Calculate the remaining error functions by working backwards using the


backpropagation algorithm
X (l+1) (l+1)
δj = f′ (zlk ) ·
(l)
wk,j δk
k
 
(l) (l−1)
3 Estimate the required derivatives ∇E = ∂En
(l) = δk · yj
∂wk,j
(l)
Note that the bias term for a layer l, the input is z = 1 so ∂En
(l) = δk .
∂bk

55
Backprobagation: Backward step

4 Change the weights based on estimated gradients by −γ · ∇E:

∂En
w[t + 1] = w[t] − γ
∂w[t]

where γ defines the step size.


5 Go back to forward step and repeat until a number of iterations or a desired
minimum.

56
Deep Learning

Multilayer perceptrons are called deep if they have more than three layers:
L + 1 > 3.

Motivation: Lower layers can automatically learn a hierarchy of features or a


suitable dimensionality reduction.

No hand-crafted features necessary anymore!

However, training deep neural networks is considered very difficult!

Error measure represents a highly non-convex, “potentially intractable” [?]


optimization problem.

57
Approaches to Deep Learning

Possible approaches:

Different activation functions offer faster learning, for example

max(0, z) or | tanh(z)|; (13)

unsupervised pre-training can be done layer-wise;


...

Further details ...

See “Learning Deep Architectures for AI,” by Y. Bengio [?] for a detailed
discussion of state-of-the-art approaches to deep learning.

58
Summary
Most prominent advantages of Backpropagation are:
Backpropagation is fast, simple and easy to program
It has no parameters to tune apart from the numbers of input
It is a flexible method as it does not require prior knowledge about the
network
It is a standard method that generally works well
It does not need any special mention of the features of the function to be
learned.
Disadvantages of using Backpropagation
The actual performance of backpropagation on a specific problem is
dependent on the input data.
Backpropagation can be quite sensitive to noisy data
You need to use the matrix-based approach for backpropagation instead of
59
mini-batch.
Summary

The multilayer perceptron represents a standard model of neural networks. They


...

allow to taylor the architecture (layers, activation functions) to the problem;


can be trained using gradient descent and error backpropagation;
can be used for learning feature hierarchies (deep learning).

Deep learning is considered difficult.

60

You might also like