Machine Learning
Machine Learning
1. Introduction ..................................................................................................................2
1
Giuditta Sigona
1. Introduction
ML can be seen as learning function from samples or produce knowledge from
data. Learning as search requires definition of hypothesis space and an
algorithm to search solutions in this space.
• Supervised learning when we have an output y for each sample y for the
dataset D = {(xi,yi)}.
o Classification: return the class to which a specific instance belong to.
o Regression: approximate real-valued function
Call H = {h1,h2,...,hn}, the hypothesis space = the set of all possible approximation of the
problem.
Given a target function 𝑐(𝑥) that we want to learn and a set of H = {h1,h2,...,hn}, the goal is
to find the best 𝒉∗ so that ℎ∗ (𝑥) ≈ 𝑐(𝑥).
Any hypothesis that week approximates the target function over a sufficient large set of
training examples will also approximate the target function well over other unobserved
examples → ℎ∗ will predicts correct values of h(x’) for instances x’ with respect to the
unknown values c(x’).
2
Giuditta Sigona
• Sample Error of h w.r.t. the target function f and data sample S is the proportion of
examples h misclassifies:
1
𝑒𝑟𝑟𝑜𝑟𝑠 (ℎ) = ∑ 𝛿(𝑓(𝑥) ≠ ℎ(𝑥))
|𝑆|
𝑥∈𝑆
In general:
• Having more samples for training and less for testing improves performance of the
model: potentially better model, but 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) 𝑁𝑂𝑇 ≈ 𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ),
• Having more samples for evaluation and less for training reduces variance of
estimation: 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) ≈ 𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ), but this value may be not satisfactory.
→ Trade off for medium sized datasets: 2/3 for training, 1/3 for testing.
Overfitting h overfits training data if, given another hypo h’ we have that:
3
Giuditta Sigona
K-fold Cross Validation: we can use it to compare solutions and learning algorithm:
• Precision
• F1-score
• Confusion Matrix: Report how many times an instance of class Ci
is classified in class Cj Main diagonal contains accuracy for each
class.
Outside the diagonal: which classes are more often confused.
4
Giuditta Sigona
3. Decision Trees
DT can represent classification function by making decisions explicit.
Given an instance space X coming from a set of attributes a DT has:
Information gain measures how well a given attribute separates the training
examples according to their target classification. It is measured as the expected
reduction in entropy of S caused by knowing the value of attribute A.
5
Giuditta Sigona
▪ Optimality = the one that first reach the destination, so that guarantee that the tree
is short → choose an attribute with 0 – or 0+ that guarantee the end of the tree
▪ ID3 algorithm selects the attribute that include highest information gain.
Overfitting condition when the model completely fits the training data but fails to
generalize the testing unseen data. Could happen that we continue the development of
the tree only for one sample and get a deeper tree, with a deep branch only for that
sample.
6
Giuditta Sigona
We must evaluate at each step the tree on a test set to see at which step we have a jeak
of accuracy → after this step accuracy drop and we are overfitting.
Rule Post-Pruning
Infer tree as well as possible (allowing for overfitting). Convert tree to equivalent set
of rules. Prune each rule by removing any preconditions that result in improving its
estimated accuracy. Sort final rules by their estimated accuracy and consider them in
this sequence when classifying.
• greedy! So not optimal
Random Forest ensemble method that generates a set of DTs with some random
criteria (bagging, feature selection, …) and integrates their values into a final result.
Integration of results: majority vote (most common class returned by all the trees).
Random Forests are less sensitive to overfitting.
7
Giuditta Sigona
• A and B are independent iff one does not affect the other:
𝑃(𝐴|𝐵) = 𝑃(𝐴) or 𝑃(𝐵|𝐴) = 𝑃(𝐵) or 𝑃(𝐴, 𝐵) = 𝑃(𝐴)𝑃(𝐵)
- if 𝑋1 , … , 𝑋𝑛 independent → 𝑃(𝑋1 , … , 𝑋𝑛 ) = 𝑃(𝑋1 )𝑃(𝑋2 ) ∙∙∙ 𝑃(𝑋𝑛 ) reducing the
size of the distribution from exponential to linear.
- X is Conditionally independent from Y, given Z iff: 𝑃(𝑋|𝑌, 𝑍) = 𝑃(𝑋|𝑌)
Bayes Rule
𝑃(𝑒𝑓𝑓𝑒𝑐𝑡|𝑐𝑎𝑢𝑠𝑒)𝑃(𝑐𝑎𝑢𝑠𝑒) 𝑃(𝑋|𝑌)𝑃(𝑌)
𝑃(𝑐𝑎𝑢𝑠𝑒|𝑒𝑓𝑓𝑒𝑐𝑡) = ⟺ 𝑃(𝑌|𝑋) =
𝑃(𝑒𝑓𝑓𝑒𝑐𝑡) 𝑃(𝑋)
8
Giuditta Sigona
5. Bayes Learning
Bayesian learning uses Bayes' theorem to determine the conditional probability of a
hypotheses given some evidence or observations.
• Provides practical learning algorithms:
- Naive Bayes learning (examples affect prob. that a hypothesis is correct)
- Combine prior knowledge (probabilities) with observed data
- Make probabilistic predictions (new instances classified by weighted combination
of multiple hypotheses)
- Requires prior probabilities (often estimated from available data)
• Provides useful conceptual framework for evaluating other learning algorithms
Bayes Theorem: given P(h) the prior probability of the hypothesis h, and P(D) the prior
probability of the training data D. The Bayes rule is:
Moreover if the prior distribution is uniform, i.e. P(hi) = P(hj), ∀i,j ∈ H we can use
the Maximum Likelihood hypothesis hML and have:
hML = argmaxP(D|h)
h∈H
We can estimate the maximum hMAP by computing P(hi|D) for every hi ∈ H and then
get the maximum, but hMAP return the most probable hypothesis, not the most
probable classification, so, given a new instance x’ hMAP(x) might not the return the
correct classification nor the most probable.
9
Giuditta Sigona
Given the target function f : X → V that maps an instance to a class 𝑣, a dataset D and a
new instance x’ we want to classify it correctly: 𝑣 ∗ = 𝑓̂ (𝑥 ′ )
In general: 𝑣 ∗ = arg max 𝑃(𝑣|𝑥 ′ , 𝐷)
𝑣∈𝑉
Where P(vj|x’,D) is the probability that x’ belongs to the class vj conditioned to the entire
dataset D (every hypothesis).
When we have to deal with a high hypothesis space, the Bayes Optimal Classifier
is not practical anymore. A way to avoid computing every hypothesis is using
conditional independence.
When X is conditionally independent of Y given Z:
𝑃(𝑋, 𝑌 |𝑍) = 𝑃(𝑋|𝑌 , 𝑍)𝑃(𝑌 |𝑍) = 𝑃(𝑋|𝑍)𝑃(𝑌 |𝑍)
10
Giuditta Sigona
NB: if none of the training instances with target value 𝑣𝑗 have attribute value 𝑎𝑖 → :
𝑃̂ (𝑎𝑖 |𝑣𝑗 , 𝐷) = 0 → 𝑃(𝑣𝑗 |𝐷) ∏𝑖 𝑃(𝑎𝑖 |𝑣𝑗 , 𝐷) = 0. In this case, to avoid the zero we can
set a virtual prior to some arbitrary number that guarantee 𝑃̂ > 0
| ∙ | + 𝑚𝑝
𝑃̂ (𝑎𝑖 |𝑣𝑗 , 𝐷) =
|∙|+𝑚
- p = prior estimate for P
- m = weight given to prior
• In the case of generative models, to find the conditional probability 𝑃(𝐶𝑖 |𝑥, 𝐷),
estimate the prior probability 𝑃(𝐶𝑖 ) and likelihood probability 𝑃(𝑥|𝐶𝑖 ) with the
help of the training data D and uses the Bayes Theorem to calculate the posterior
probability 𝑃(𝐶𝑖 |𝑥) → e.g. naive Bayes classifier
• In the case of discriminative models, to find the probability, directly assume some
functional form for 𝑃(𝐶𝑖 |𝒙) and then estimate the parameters of 𝑃(𝐶𝑖 |𝒙) with the
help of the training data. → e.g. logistic regression
11
Giuditta Sigona
𝑡𝑛 = 1 𝑖𝑓 𝑥𝑛 ∈ 𝐶1
Assuming 2 classes 𝐶1 , 𝐶2 and 𝐷 = {(𝑥𝑛 , 𝑡𝑛 )𝑁
𝑛=1 } →
𝑡𝑛 = 0 𝑖𝑓 𝑥𝑛 ∈ 𝐶2
𝑁1 𝑡ℎ𝑒 𝑛. 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝐷 ∈ 𝐶1
Let be
𝑁2 𝑡ℎ𝑒 𝑛. 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝐷 ∈ 𝐶2
In this case 𝑃(𝐶1 |𝑥) = 𝜎(𝑎) = 𝜎(𝒘𝑇 𝑥 + 𝒘𝟎 ) and 𝑃(𝐶2 |𝑥) = 1 − 𝑃(𝐶1 |𝑥)
𝒘 = ∑−1(𝜇1 − 𝜇2 )
With 1 1 𝑃( 𝐶 1 )
𝑤0 = − 𝜇𝑇1 ∑−1 𝜇1 + 𝜇𝑇2 ∑−1 𝜇2 + ln
2 2 𝑃(𝐶2 )
Where 𝑃(𝒕|𝜋, 𝝁𝟏 , 𝝁𝟐 , ∑, 𝐷) = ∏𝑁
𝑛=1[𝜋 ∙ 𝑁(𝑥𝑛 ; 𝜇1 , ∑)]
𝑡𝑛
∙ [(1 − 𝜋) ∙ 𝑁(𝑥𝑛 ; 𝜇2 , ∑)](1−𝑡𝑛)
12
Giuditta Sigona
Logistic Regression
Given a target function 𝑓: 𝑋 → 𝐶 and a dataset D
Assume a parametric model for the posterior probability 𝑃(𝐶𝑘 |𝒙 ̃ ):
̃, 𝒘
- 𝜎(𝒘 ̃ 𝒙) if 2 classes
𝑇
̃
̃ 𝑘 𝑇𝒙
exp (𝒘 ̃)
- if k-classes
∑𝑘 ̃ 𝑗𝑇 𝒙
𝑗=1 exp (𝒘 ̃)
𝑎𝑛𝑎𝑙𝑦𝑡𝑖𝑐𝑎𝑙𝑙𝑦
You can solve the minimization
𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑣𝑒 → 𝑰𝒕𝒆𝒓𝒂𝒕𝒊𝒗𝒆 𝒓𝒆 − 𝒘𝒆𝒊𝒈𝒉𝒕𝒆𝒅 𝒍𝒂𝒔𝒕 𝒔𝒒𝒖𝒂𝒓𝒆𝒔
Generalization:
Given a target function 𝑓 ∶ 𝑋 → 𝐶, and data set 𝐷
- assume a prediction parametric model 𝑦(𝒙; 𝜃), 𝑦(𝒙; 𝜃) ≈ 𝑓 (𝑥)
- Define an error function 𝐸(𝜃)
- Solve the optimization problem 𝜃 ∗ = arg min 𝐸(𝜃)
𝜃
- Classify new sample x’ as 𝑦(𝒙′; 𝜃 ∗ ).
NB: All methods described above can be applied in a transformed space of the input
(feature space).
Given a function 𝜙: 𝒙
̃ → 𝚽 (Φ is the feature space) each sample 𝒙
̃ 𝒏 can be mapped to a
feature vector 𝜙𝑛 = 𝜙(𝑥̃ 𝑛 ).
13
Giuditta Sigona
- If 2 classes: 𝑦(𝑥) = 𝑤 𝑇 𝑥 + 𝑤0 = w
̃ 𝑇 𝑥̃
𝑦1 (𝑥) 𝑤1𝑇 𝑥 + 𝑤1,0 ̃ 1𝑇
w
- If k-classes: 𝑦(𝑥) = ( … ) = ( … ̃ 𝑇 𝑥̃
) = ( … ) 𝑥̃ = 𝑊
𝑇 𝑇
𝑦𝑘 (𝑥) 𝑤𝑘 𝑥 + 𝑤k,0 w
̃k
Least Squares
Given D, find the linear discriminant 𝑦(𝑥) = 𝑾 ̃𝑇 𝒙
̃.
→Minimize the sum-of-square error function:
𝐸(𝑊̃) = 𝑊 ̃ = (𝑋̃ 𝑇 𝑋̃)−1 𝑋̃ 𝑇 𝑇 = 𝑋̃ † 𝑇 → 𝑦(𝑥) = 𝑊̃ 𝑇 𝑥̃ = 𝑇 𝑇 (𝑋̃ † )𝑇 𝑋̃
𝑡1 𝑇
With 𝑇 = ( … ) if 𝑥 ∈ 𝐶𝑘 → 𝑡𝑘 = 1, 𝑡𝑗 = 0, ∀𝑗 ≠ 𝑘.
𝑡𝑁
Classification of new instance x not in dataset:
Use learnt 𝑾̃ to compute 𝑦(𝑥) then assign class 𝐶𝑘 to x, where 𝑘 = arg max {𝑦𝑖 (𝑥)}
𝑖∈{1,…,𝑘}
PROBLEM: assume Gaussian conditional distributions
→ Not robust to outliers!
14
Giuditta Sigona
Perceptron
The Perceptron is a linear classification algorithm. It consists of a single node that takes a
row of data as input and predicts a class label. This is achieved by calculating the
weighted sum of the inputs and a bias (=1). The weighted sum of the input of the model is
called the activation.
Since we need to minimize this error we want to move to the direction of the gradient,
thus computing the derivative of:
Perceptron algorithm:
Given perceptron model 𝑜(𝑥) = 𝑠𝑖𝑔𝑛(𝑤 𝑇 𝑥) and data set D, determine weights w.
• The initial values for the model weights are set to small random values.
• 𝑤 ̂𝑖 ⟵ 𝑤 ̂𝑖 + Δ𝑤𝑖 Model weights are updated with a small proportion of the error
each batch, and the proportion is controlled by a hyperparameter called the
learning rate, typically set to a small value (resulting in a premature convergence):
𝑤(𝑡 + 1) = 𝑤(𝑡) + 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ (𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑_𝑖 – 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑_) ∗ 𝑖𝑛𝑝𝑢𝑡_𝑖
▪ Training stop when the error made by the model falls to a low level or no longer
improves.
▪ The perceptron is a linear classifier, it will classify all the inputs correctly if the
training set D is linearly separable, and 𝜂 sufficiently small.
▪ Incremental and mini-batch modes speed up convergence and are less sensitive to
local minima.
15
Giuditta Sigona
Given a two classes classification problem, Fisher’s linear discriminant is given by the
function 𝑦 = 𝒘𝑇 𝒙 and the classification of new instances is given
𝑥 ∈ 𝐶1 𝑖𝑓 𝑦 ≥ −𝑤0
𝑥 ∈ 𝐶2 otherwise
Corresponding to the projection on a line determined by w.
Adjusting 𝒘 to find a direction that maximizes class separation.
with
𝑑
𝐽(𝑤) = 0 → : 𝒘∗ = 𝑺−1
𝑤 (𝒎2 − 𝒎1 ) and 𝑤0 = 𝒘 𝒎
𝑇
𝑑𝑤
16
Giuditta Sigona
The margin is estimated as the minimum distance among all the points in the
dataset from the line
To solve:
Rescale all the points do not affect the solution in such a way that the closest point 𝑥𝑘
we have: 𝑡𝑘 (𝒘𝑇 𝒙𝑘 + 𝑤0 ) = 1
When the maximum margin hyperplane 𝑤 ∗ , 𝑤0∗ is found, there will be at least 2 closest
points 𝑥𝑘+ and 𝑥𝑘− (one for each class)
1
The optimal solution is when both are at the same distance
||𝑤||
17
Giuditta Sigona
In the canonical representation of the problem the maximum margin hyperplane can
be found by solving the optimization problem:
Or a more stable solution is obtained by averaging over all the support vectors:
18
Giuditta Sigona
2) Basis functions
IDEA: transform in a polar coordinate for each new value → replace 𝒙 with ϕ(𝑥) to
all formulas.
(*exists a family of basis functions that works well when you cannot find ϕ(𝑥) )
19
Giuditta Sigona
Maximum Likelihood
If our target value t is affected by noise 𝜖: 𝑡 = 𝑦(𝒙; 𝒘) + 𝜖
We now have a probability that the target is correct given the regression. If we
assume 𝜖 to be gaussian we have: 𝑃(𝜖|𝛽) = 𝒩(𝜖|0, 𝛽 −1 ), with precision (inverse
variance) 𝛽.
20
Giuditta Sigona
Or equivalently:
Since the second term is constant we focuse on the first one. Then to solve the problem
we do: 𝑤 ∗ = arg min 𝐸𝐷 (𝑤)
𝑤
But since Max Likelihood Least Square error minimization
NB: if we plot we have some values of 𝑤 very high (||𝑤𝑖 || ≫ 0) → this determine a
not-smooth function! → To control overfitting we can set a regularization factor on
the parameters of the kind:
arg min 𝐸𝐷 (𝒘) + 𝜆 𝐸𝑤 (𝒘)
𝑤
1
a common choice is 𝐸𝑊 (𝒘) = 𝒘𝑇 𝒘
2
1
Moreover, note that : 𝐸𝐷 (𝒘) = (𝑡 − 𝝓𝒘)𝑇 (𝑡 − 𝝓𝒘)
2
21
Giuditta Sigona
9. Kernel Methods
Kernel methods overcome difficulties in defining non-linear models. Kernel methods
use kernels (or basis functions) to map the input data into a different space. After this
mapping, simple models can be trained on the new feature space, instead of the input
space, which can result in an increase in the performance of the models.
This approach is called the "kernel trick", which avoids the explicit mapping that is
needed to get linear learning algorithms to learn a nonlinear function or decision
boundary
→ Optimal solution:
𝒘∗ = 𝑿𝑇 𝜶 = ∑ 𝛼𝑛 𝑥𝑛
𝑛=1
Kernel tricks: If input vector 𝑥 appears in an algorithm only in the form of an inner
product 𝒙𝑇 𝒙′ we can replace it with some kernel 𝑘(𝒙, 𝒙′ ).
Approach: use a similarity measure 𝑘(𝒙, 𝒙′ ) ≥ 0 between the instances 𝒙, 𝒙′
- 𝑘(𝒙, 𝒙′ ) is called a kernel function.
- Note: If we have 𝜙(𝑥) a possible choice is 𝑘(𝒙, 𝒙′ ) = 𝜙(𝒙)𝑇 𝜙(𝒙′ )
symmetric: 𝑘(𝒙, 𝒙′ ) = 𝑘(𝒙′, 𝒙)
Typically 𝑘 is:
non-negative: 𝑘(𝒙, 𝒙′ ) ≥ 0
22
Giuditta Sigona
Input normalization:
Input data in the dataset 𝐷 must be normalized in order for the kernel to be a good
similarity measure in practice.
Kernel families:
Linear
Polynomial
Radial Basis Function (RBF)
Sigmoid
Kernelized SVM
it is one of the most effective ML method for classification and regression.
- Still requires model selection and hyper-parameters tuning.
Classification:
In SVM, solution has the form: 𝑤 ∗ = ∑𝑁
𝑛=1 𝛼𝑛 𝑥𝑛
Linear model:
𝑦(𝒙; 𝜶) = 𝑠𝑖𝑔𝑛(𝑤0 + ∑𝑁 𝑇
𝑛=1 𝛼𝑛 𝑥𝑛 𝑥) → 𝑦(𝒙; 𝜶) = 𝑠𝑖𝑔𝑛(𝑤0 + ∑𝑁
𝑛=1 𝛼𝑛 𝑘(𝑥𝑛 , 𝑥))
Regression:
Linear model for regression 𝑦 = 𝒘𝑻 𝒙 and data set 𝐷
2
Minimize the regularized loss function: 𝐽(𝑤) = ∑𝑁
𝑛=1 𝐸(𝑦𝑛 , 𝑡𝑛 ) + 𝜆 ||𝑤||
The IDEA:
- points close to the predict model are good enough →I don’t consider them in an
error of the model
- decrease the effects of the points faraway
23
Giuditta Sigona
1 2
Consider: 𝐽(𝒘) = 𝐶 ∑𝑁
𝑛=1 𝐸𝜖 (𝑦𝑛 , 𝑡𝑛 ) + ||𝒘||
2
subject to :
All other data points inside the 𝜖 -tube have 𝑎̂𝑛 = 0 and 𝑎̂′𝑛 = 0 and thus do not
contribute to prediction.
24
Giuditta Sigona
Recap:
• Parametric Algorithm: we have a fixed set of parameters such as 𝜃 that we try to
find while training the data. After we have found the optimal values for these
parameters, we can use the model with parameters to make predictions.
25
Giuditta Sigona
▪ instead of creating a generalizable model from all of the data, KNN looks for
similarities among individual data points and makes predictions accordingly.
▪ Require storage of all data
▪ Increasing K brings to smoother regions (reducing overfitting)
One of the many issues that affect the performance of the kNN algorithm is the choice
of k. If k is too small, the algorithm would be more sensitive to outliers. If k is too large,
then the neighborhood may include too many points from other classes.
Issues:
- DV are affected by scaling problem
- The model can be considered as a “compression”
26
Giuditta Sigona
FeedForward NN (FNN)
Most NN are feedforward, they flow in one direction only, from input to output.
There are no loops or cycles in the network and the output of each layer is
determined by the weights and biases of the connections between neurons, as
well as the activation function of each neurons.
Hidden layer output can be seen as an array of unit (neuron) activations based on the
connections with the previous units.
The final function is a composition of elementary functions f and parameters θ
(one for each layer): 𝑓(𝑥; 𝜃) = 𝑓 (3) ( 𝑓 (2) (𝑥; 𝜃 (1) ); 𝜃 (2) ); 𝜃 (3)
In general, when you have multiple layers: you can see a layer which transform a space
into another one → NN can be seen as a sequence of transformations.
Architecture Design:
Choosing an appropriate architecture for a neural network is an important
consideration, as it can impact the model’s performance and ability to learn.
27
Giuditta Sigona
They are used to introduce non-linearity into the network, allowing the model
to learn and make more complex decisions.
There are several types of activation functions that are commonly used in
neural networks such as:
▪ Rectified linear units (ReLU): g(α) = max(0,α) which is easy to
optimize but not differentiable at 0.
▪ Sigmoid g(α) = σ(α) and hyper tan g(α) = thanh(α), both are: Easy
to saturate since there is no logarithm at the output, slow, usefull for
RNN and autoencoders
NB: The loss function usually include the cost function + regularization term to
prevent overfitting by penalizing large weights.
Recall the ML in which we wanted the class which maximized the conditional
distribution 𝑃(𝐶𝑖 |𝑥, 𝐷), if we use the same principle here we get the cross-
entropy loss function:
28
Giuditta Sigona
1) Regression
We use the identity activation function 𝑦 = 𝑊 𝑇 ℎ + 𝑏
A Gaussian distribution noise model 𝑝(𝑡|𝑥) = 𝑁(𝑡|𝑦, 𝛽 −1 )
→ Cost function: maximum likelihood (cross-entropy) that is
equivalent to minimizing mean squared error:
𝐽(𝜃) − ln (𝑃(𝑡|𝑥, 𝜃)
Note: linear units do not saturate.
2) Binary Classification
We use the Sigmoid activation function 𝑦 = 𝜎(𝑤 𝑇 ℎ + 𝑏)
The likelihood corresponds to a Bernoulli distribution
𝐽(𝜃) = 𝐸𝑥,𝑡~𝐷 [− ln 𝑝(𝑡|𝑥)]
− ln 𝑝(𝑡|𝑥) = ⋯ = 𝑠𝑜𝑓𝑡𝑝𝑙𝑢𝑠((1 − 2𝑡)𝛼
with 𝛼 = 𝑤 𝑇 ℎ + 𝑏.
Note: Unit saturates only when it gives the correct answer.
3) Multi-class Classification
exp(𝛼 (𝑖) )
We use the Softmax activation function 𝑦𝑖 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝛼 (𝑖) ) = ∑
𝑗 exp (𝛼𝑗 )
The Likelihood corresponds to a Multinomial distribution
𝐽𝑖 (𝜃) = 𝐸𝑥,𝑡~𝐷 [− ln 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝛼 (𝑖) )]
Note: Unit saturates on ly when there are minimal errors.
Since we propagate the error back trough the network, we compute the gradient
of the cost function with respect to each parameter using the chain rule.
29
Giuditta Sigona
Learning Algorithms
Since the backprop is just a method to compute the gradient (not learn) there are
various other algorithms for this purpose.
30
Giuditta Sigona
Regularization
Technique used to reduce overfitting. In general it involves adding a penalty term
to the cost function, which discourages the network from learning overly complex
patterns in the training data. For FNN we have:
• Early stopping: Stop iterations early to avoid overfitting to the training set
of data, when train loss zeros out while train test loss increases.
Use cross-validation to determine when to stop.
31
Giuditta Sigona
3) Pooling: used to reduce the spatial volume of input image after convolution.
It is an averaging of some sort (usually max or average) used to implement
invariance to local translations.
32
Giuditta Sigona
5) Softmax / Logistic Layer resides at the end of FC layer. Logistic is used for
binary classification and softmax is for multi-classification.
▪ Stride denotes how many steps we are moving in each steps in convolution
(by default it is one). We can observe that the size of output is smaller that
input.
Parameter size:
Consider input size 𝑤𝑖𝑛 × ℎ𝑖𝑛 × 𝑑𝑖𝑛 , 𝑑𝑜𝑢𝑡 kernels of size 𝑤𝑘 × ℎ𝑘 × 𝑑𝑖𝑛 , stride 𝑠 and
padding 𝑝.
- dimension of the output feature map is:
𝑤𝑖𝑛 −𝑤𝑘 +2𝑝 ℎ𝑖𝑛 −ℎ𝑘 +2𝑝
𝑤𝑜𝑢𝑡 = + 1 and ℎ𝑜𝑢𝑡 = +1
𝑠 𝑠
- number of trainable parameters of the convolutional layer is:
|𝜃| = 𝑤𝑘 ∙ ℎ𝑘 ∙ 𝑑𝑖𝑛 ∙ 𝑑𝑜𝑢𝑡 + 𝑑𝑜𝑢𝑡
33
Giuditta Sigona
Famous" CNNs
• LeNet: designed to recognize handwritten digits, and consists of a series of
convolutional and pooling layers followed by F.C. (dense) layers. LeNet is a
relatively small and simple CNN, but it was an important step in the development
of modern CNNs.
• AlexNet, which won the ILSVRC in 2012 and significantly advanced the state of
the art in image classification. Not commonly used anymore.
• VGG, which won the ILSVRC in 2014 and is known for its deep and narrow
architecture. Commonly used also today.
• Inception by Google, which won the ILSVRC in 2014 and introduced the concept
of "inception modules" which make use of multiple parallel convolutional and
pooling layers. Commonly used also today.
• ResNet by Microsoft, which won ILSVRC 2015 and introduced the concept of
residual connections, which made it possible to train very deep CNNs effectively.
Transfer Learning
GOAL → improve learning of 𝑓𝑇 of the target learning task, using the knowledge in the
source domain 𝐷𝑆 and source learning task 𝑇𝑠 (i.e., after training 𝑓𝑆 )
34
Giuditta Sigona
The main difference between the two is that in fine-tuning, more layers of the pre-trained
model get unfrozen and tuned on custom data. This fine-tuning usually takes more data
than feature extraction to be effective.
35
Giuditta Sigona
Voting: simple method in which the models are trained in parallel on the same
dataset D and then the outputs are summed (for regression) or chosen based on the
most voted class (for classification).
36
Giuditta Sigona
37
Giuditta Sigona
38
Giuditta Sigona
𝑃(𝑥) = ∑ 𝜋𝑘 𝑁(𝑥; 𝜇𝑘 , ∑𝑘 )
𝑘=1
• INVERSE WAY: from (𝜋𝑘 , 𝜇𝑘 , ∑𝑘 ), and by using 𝑃(𝑥) you find data.
• NORMAL WAY: making some assumptions (uniform 𝜋𝑘 = 1/𝑘, same covariance
∑𝑘 = ∑𝑘 ′)→ we want to estimate the k-means of data.
K-means
The K-means is a clustering algorithm that aims to partition n observations into k
clusters in which each observation belongs to the cluster with the nearest mean.
It is an iterative algorithm that starts by randomly initializing k centroids, then
assigns each observation to the cluster corresponding to the closest centroid. The
centroids are then updated based on the mean of the points in each cluster, and
the process is repeated until convergence.
Step 4. Repeat step 3 until convergence is achieved, that is until a pass through the
training sample causes no new assignments.
39
Giuditta Sigona
CONS:
- sensitive to initial conditions, it can get stuck in local minima when a few data
available.
- Not robust to outliers. Very far data from the centroid may pull the centroid
away from the real one.
• M-STEP: maximize the expected value of the complete data log likelihood
function with respect to the model parameters
40
Giuditta Sigona
General EM problem
Given:
• Observed data 𝑋 = {𝑥1 , . . . , 𝑥𝑁 }
• Unobserved latent values 𝑍 = {𝑧1 , . . . , 𝑧𝑁 }
• Parametrized probability distribution 𝑃(𝑌|𝜃), where
▪ 𝑌 = {𝑦1 , . . . , 𝑦𝑁 } is the full data 𝑦𝑛 =< 𝑥𝑛 , 𝑧𝑛 >, zni
▪ 𝜃 are the parameters
Determine the values of the model parameters that best explain the observations:
• 𝜃 ∗ that (locally) maximizes 𝐸[𝑙𝑛 𝑃(𝑌|𝜃)]
41
Giuditta Sigona
GMM distribution 𝑃(𝑥) can be seen as the marginalization of a distribution 𝑃(𝑥, 𝑧) over
variables 𝑧.
Given observations 𝐷 = {(𝑥𝑛 )𝑁 𝑛=1 } , each data point 𝑥𝑛 is associated to the
corresponding variable 𝑧𝑛 which is unknown.
Note: 𝑧𝑛𝑘 = 1 denotes 𝑥𝑛 sampled from Gaussian 𝑘. 𝒛𝒏 are called latent variables.
→ Analysis of latent variables allows for a better understanding of input data
(e.g., dimensionality reduction).
▪ 𝜋𝑘 : prior probability of 𝑧𝑘
▪ 𝛾(𝑧𝑘 ): posterior probability after observation of 𝑥.
42
Giuditta Sigona
Latent Variable: are variables that are not directly observed in the data but are
inferred from the observed data. For ex. If we have an input as image we have many
degrees of freedom and each composition of each parameter can generate a sample of
our dataset, but some configurations have no meaning. The goal is to identify these
variables and use them to represent the data in lower-dimensional space
(𝑠𝑚𝑎𝑙𝑙𝑒𝑟 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 𝑠𝑝𝑎𝑐𝑒 ↔ 𝑚𝑜𝑟𝑒 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑣𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑟𝑜𝑏𝑙𝑒𝑚).
General idea:
- Start by calculate the average for 𝑥1 and 𝑥2 to get the center of the data.
- Shift the data in such a way the center of the graph is the center of the data and fit
a line that fits the samples. Start with a random line until the best fit which is when,
by projecting the samples on the line, that line minimizes the distance sample-
projection or maximizes the distances projected point-origin = maximize the sum
of squared distances (= eigenvalue for PC1). This line is the principal component
PC1, which is a linear combination of 𝑥1 and 𝑥2
- The vector of that line is the singular vector or eigenvector for PC1
When you do PCA with SVD, the vector is scaled so that this length =1
- In 2D the PC2 is simply the line through the origin that is perpendicular to PC1.
- Now rotate until horizontal for the final plot and use the projections to draw the
samples
43
Giuditta Sigona
Math steps:
Given data {𝑥𝑛 } ∈ 𝑅𝑑
1. Compute the Covariance of the dataset
2. Compute the eigenvectors {𝑢𝑖 } and eigenvalues {𝜆𝑖 } of the covariance matrix: The
eigenvectors of the covariance matrix are the principal components, and the
eigenvalues are their corresponding variances. → we want to maximize the
variance and it is when the eigenvector corresponds to the largest eigenvalue→
3. Order the eigenvalues in decreasing order and select the k eigenvectors with the
largest eigenvalues. By selecting the top k principal components, we are able to
retain as much information as possible while still reducing the dimensionality of
the data.
4. Use this eigenvector matrix to transform the original dataset into a new k-
dimensional subspace by computing the matrix multiplication.
Probabilistic PCA
It is an extension of PCA that models the data as being generated by a probabilistic
process. This allows for the data to be represented by a low-dimensional latent
variable, which can be useful in cases where the data is noisy or incomplete. The goal
is to find a set of latent variables that best explains the data distribution.
- Assume data as a linear combination of a low-dimensional latent variable Z, with
some added Gaussian noise 𝑥 = 𝑊𝑧 + 𝜇
- Assume Gaussian distribution of z: 𝑃(𝑧) = 𝑁(𝑧; 0, 𝐼)
- Assume linear-gaussian relationship between latent variables and data
𝑃(𝑥|𝑧) = 𝑁(𝑥; 𝑊𝑧 + 𝜇, 𝜎 2 𝐼)
The goal is to find the parameters of the model (W, μ, Σ) that best explains the data
distribution. This is done by maximizing the likelihood of the data, given the model
parameters:
𝑁
Maximum likelihood solution for the probabilistic PCA model can be obtained also
with EM algorithm.
44
Giuditta Sigona
Linear representations are not sufficient for complex data, if you use PCA different
points are projected equal.
→ how to deal with Non-Linear transformation? → Non-linear latent variable models
(Autoencoders, GANs…). These models use non-linear functions to map the observed
data to the latent space.
Function that transforms the input in intermediate value z (latence space) and then
reconstruct in 𝑥𝑛 ′.
The GOAL is to minimize the difference between the input and the reconstruction
(𝑥𝑛′ = 𝑥𝑛 ), this is typically done by training the model to minimize the reconstruction
error (typically the MSE).
45
Giuditta Sigona
Generative Models are models that learn to generate new data from some
underlying probability distribution. Can be used for a variety of tasks, such as image
synthesis, language generation and as a way to extract feature from a dataset.
46
Giuditta Sigona
- Train the generator by using the entire model (generator + discriminator), with
discriminator layers fixed with a batch of data {(𝑟𝑘 , 𝑅𝑒𝑎𝑙)}, to generate data that
"fools" the discriminator in believing that the sample is “real”
(𝑟𝑘 are random values of the latent variable).
→ maximize the loss of the discriminator when given generated data;
47
Giuditta Sigona
▪ Once the current state is known, the evolution of the dynamic system does not
depend on the history of states, actions and observations.
▪ The current state contains all the information needed to predict the future.
▪ Future states are conditionally independent of past states and past
observations given the current state.
▪ The knowledge about the current state makes past, present and future
observations statistically independent.
Given an MDP, we want to find an optimal policy 𝝅: 𝑋 → 𝐴, that takes as input some
state 𝑥𝑡 and chooses an action 𝑎𝑡 to maximize the reward 𝑟𝑡 .
Optimality = maximizing the (expected value of the) cumulative discounted reward:
𝑉 𝜋 (𝑥1 ) = 𝐸[𝑟̅1 + 𝛾𝑟̅2 + 𝛾 2 𝑟̅3 + . . . ]
where 𝑟̅𝑡 = 𝑟(𝑥𝑡 , 𝑎𝑡 , 𝑥𝑡+1 ), 𝑎𝑡 = 𝜋(𝑥𝑡 ) and 𝛾∈[0,1] is the discount factor for
future rewards.
Optimal policy: 𝜋 ∗ = arg max 𝑉 𝜋 (𝑥) ∀𝑥 ∈ 𝑋.
𝜋
∗
▪ 𝜋 ∗ is an optimal policy iff for any other policy 𝜋: 𝑉 𝜋 (𝑥) ≥ 𝑉 𝜋 , ∀𝑥.
▪ For infinite horizon problems, a stationary MDP always has an optimal
stationary policy.
48
Giuditta Sigona
When 𝛿 and 𝑟 are not known, the agent cannot predict the effect of its actions. But it can
execute them and then observe the outcome.
∗
Value iteration: estimate the value function 𝑉 𝜋 from which it could
determine the optimal policy 𝜋 ∗ :
∗
𝜋 ∗ = arg max[𝑟(𝑥, 𝑎) + 𝛾𝑉 𝜋 (𝛿(𝑥, 𝑎))]
𝑎∈𝐴
Only if 𝛿 and 𝑟 are known.
To determine the optimal policy without knowing 𝛿 and r, the agent learns Q-table, we
get the policy just observing new state 𝑥′ and immediate reward 𝑟, after the execution of
the chosen action.
Exploration-Exploitation trade-off
During trials, an agent has a set of actions to select from:
- some are previously selected→ exploiting what the agent already knows,
selecting the action that maximizes 𝑄̂ (𝑥, 𝑎)
- others are never taken before→ exploring a random action (low value of 𝑄̂ (𝑥, 𝑎))
When the agent explores, it can improve its current knowledge and gain better rewards
in the long run; when it exploits, it gets more reward immediately → we wish to keep
a balance between exploration and exploitation; not giving up on one or another.
49
Giuditta Sigona
Action selection:
• 𝜺 -greedy strategy: It selects the best action with the highest estimated
reward most of the time. It works by choose the best action with probability
1 – 𝜀 (exploitation) and a random action with probability 𝜀 (exploration). It
selects the action with the highest estimated reward most of the time. 𝜀 can
decrease over time to have a balance between exploration and exploitation
(first exploration, then exploitation).
𝑘 > 0 determines how strongly the selection favours actions with 𝑄̂ high values.
𝑘 may increase over time (first exploration, then exploitation).
But if the probability µi changes (non-deterministic case) then the above solution
would not work: the first would fail because µi can change after the x trials, the 𝜀
-greedy strategy will not reach an optimal value.
1
𝛼= and 𝑣𝑛−1 (𝑎𝑖 ) number of executions of action ai up to time 𝑛 − 1.
1+𝑣𝑛−1 (𝑎𝑖 )
50
Giuditta Sigona
Q Function (deterministic case): 𝑄𝜋 (𝑥, 𝑎) is the expected value when executing a in the
state x and then act according to 𝜋.
𝑄(𝑥, 𝑎) ≡ 𝑟(𝑥, 𝑎) + 𝛾𝑉 ∗ (𝛿(𝑥, 𝑎)) → 𝜋 ∗ = arg max 𝑄(𝑥, 𝑎)
𝑎∈𝐴
When we consider the reward function, to be non deterministic too, then we must
include it in the value function:
𝑄(𝑥, 𝑎) = 𝐸[𝑟(𝑥, 𝑎) + 𝛾𝑉 ∗ (𝛿(𝑥, 𝑎)) = ⋯ = 𝐸[𝑟(𝑥, 𝑎)] + 𝛾 ∑ 𝑃(𝑥 ′ |𝑥, 𝑎) max 𝑄(𝑥 ′ , 𝑎 ′ )
𝑎′
𝑥′
now the optimal policy shift to: 𝜋 = arg max 𝑄(𝑥, 𝑎)
∗
𝑎∈𝐴
Q learning generalizes to non-deterministic worlds with training rule:
with
51
Giuditta Sigona
Q-Learning Algorithm
It updates the Q-value using the Q-value of the next state and the greedy action after that
(off-policy). The goal is to maximize its total reward. It does this by adding the maximum
reward attainable from future states to the reward for achieving its current state,
effectively influencing the current action by the potential future reward.
The formula that updates the Q-value is as follows:
This is called the action-value function or Q-function. The function approximates the
value of selecting a certain action in a certain state.
SARSA
SARSA (State–action–reward–state–action) is an on-policy as it uses an ε-greedy
strategy for all the steps. It updates the Q-value using the Q-value of the next state and
keeps following the policy for the next action.
In this case:
This means that SARSA uses the next action as a starting action in the next state (policy),
where Q-Learning replaces it with the maximisation of the next action’s reward.
*bootstrapping usually refers to a self-starting process that is supposed to continue or grow without external input.
52
Giuditta Sigona
- Thus, since 𝑃(𝑧𝑡 │𝑥𝑡 ) and 𝑃(𝑥𝑡 |𝑥𝑡−1 ) are given by the model's distribution
and transition probabilities, one can quickly calculate 𝛼𝑡 (𝑥𝑡 ) from
𝛼𝑡−1 (𝑥𝑡−1 ) and avoid incurring exponential computation time.
53
Giuditta Sigona
Given a current belief state b, the next belief state b’(x)’ which represents the
probability of being in state s′ after b,a and o, is calculated as:
So by the current belief state, the observation probability, and the original transition
probability.
▪ Transition function:
▪ Reward functions:
▪ Value function: similar in traditional MDPs, replacing states with belief states
policy tree:
54