0% found this document useful (0 votes)
12 views

Machine Learning

The document discusses machine learning techniques including decision trees, probability and Bayes learning, linear models, kernel methods, artificial neural networks, convolutional neural networks, unsupervised learning, dimensionality reduction, and reinforcement learning. It provides details on algorithms like ID3, naive Bayes classifier, logistic regression, support vector machines, k-nearest neighbors, expectation maximization, and Q-learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Machine Learning

The document discusses machine learning techniques including decision trees, probability and Bayes learning, linear models, kernel methods, artificial neural networks, convolutional neural networks, unsupervised learning, dimensionality reduction, and reinforcement learning. It provides details on algorithms like ID3, naive Bayes classifier, logistic regression, support vector machines, k-nearest neighbors, expectation maximization, and Q-learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Machine Learning

1. Introduction ..................................................................................................................2

2. Evaluation and Estimators.............................................................................................3

3. Decision Trees ...............................................................................................................5


ID3 Algorithm ......................................................................................................................6
Overfitting ...........................................................................................................................6
Reduced-Error Pruning .......................................................................................................7
Rule Post-Pruning................................................................................................................7
Random Forest ....................................................................................................................7

4. Probability and Bayes ....................................................................................................8


Bayes Rule ...........................................................................................................................8

5. Bayes Learning ..............................................................................................................9


Bayes Optimal Classifier ....................................................................................................10
Naive Bayes Classifier .......................................................................................................11

6. Probabilistic Model for CLASSIFICATION .....................................................................11


Probabilistic Generative Models .......................................................................................12
Probabilistic Discriminative Models ..................................................................................12
Logistic Regression ............................................................................................................13

7. Linear Models for Classification ..................................................................................14


Least Squares ....................................................................................................................14
Perceptron ........................................................................................................................15
Fisher’s linear discriminant ...............................................................................................16
Support Vector Machine [SVM] ........................................................................................17

8. Linear Models for Regression ......................................................................................20


Maximum Likelihood ........................................................................................................20

9. Kernel Methods ..........................................................................................................22


Kernelized SVM .................................................................................................................23

10. Instance based Learning ..........................................................................................25


K-nearest neighbors (KNN) ...............................................................................................25
Locally weighted regression ..............................................................................................26

11. Artificial Neural Network.........................................................................................27


FeedForward NN (FNN) .........................................................................................................27
Architecture Design: .........................................................................................................27
Gradient computation (BackProp) ....................................................................................29
Learning Algorithms ..........................................................................................................30
Regularization ...................................................................................................................31
Giuditta Sigona

12. Convolutional Neural Networks (CNNs) ...................................................................32


Famous" CNNs ..................................................................................................................34
Transfer Learning ..............................................................................................................34

13. Multiple Learner ......................................................................................................36


Voting ................................................................................................................................36
Stacking .............................................................................................................................36
Cascading ..........................................................................................................................36
Bagging..............................................................................................................................37
Boosting ............................................................................................................................37
Adaboost ...........................................................................................................................37

14. Unsupervised Learning ............................................................................................39


Gaussian Mixture Model (GMM) ......................................................................................39
K-means ............................................................................................................................39
Expectation Maximization (EM) ........................................................................................40
Gaussian Mixture Model ...................................................................................................42

15. Dimensionality Reduction .......................................................................................43


Principal Component Analysis (PCA) .................................................................................43
Probabilistic PCA ...............................................................................................................44
Autoassociative Neural Networks (Autoencoders) ...........................................................45
Variational Autoencoder (VAE) .........................................................................................46
Generative Adversarial Networks (GANs) .........................................................................46

16. MDP and RL .............................................................................................................48


Markov Decision Process (MDP) .......................................................................................48
One-state Markov Decision Processes (MDP)...................................................................49
Exploration-Exploitation trade-off ....................................................................................49
Temporal Difference Learning ..........................................................................................52
Q-Learning Algorithm........................................................................................................52
SARSA ................................................................................................................................52

17. HMM and POMDP ...................................................................................................53


Hidden Markov Model (HMM) .........................................................................................53
Partially Observable MDP (POMDP) .................................................................................54

1
Giuditta Sigona

1. Introduction
ML can be seen as learning function from samples or produce knowledge from
data. Learning as search requires definition of hypothesis space and an
algorithm to search solutions in this space.

ML problem is to learning a function (target function) f : X → Y, given a dataset


𝐷 = {(𝑥, 𝑦)} in such a way to find some approximation f(x) '≈ f(x) , ∀ 𝑥 ′ ∉ 𝐷.

Depending on the dataset we have different problems:

• Supervised learning when we have an output y for each sample y for the
dataset D = {(xi,yi)}.
o Classification: return the class to which a specific instance belong to.
o Regression: approximate real-valued function

• Unsupervised if we do not have yi, so we have 𝐷 = {𝑥𝑖 }


o Clustering

• Reinforcement Learning Having a triple of elements D =(si, ai, ri) (state,


action, reward), RL is the practice of learning a policy π (learn what is the
action in the next state)

Call H = {h1,h2,...,hn}, the hypothesis space = the set of all possible approximation of the
problem.
Given a target function 𝑐(𝑥) that we want to learn and a set of H = {h1,h2,...,hn}, the goal is
to find the best 𝒉∗ so that ℎ∗ (𝑥) ≈ 𝑐(𝑥).
Any hypothesis that week approximates the target function over a sufficient large set of
training examples will also approximate the target function well over other unobserved
examples → ℎ∗ will predicts correct values of h(x’) for instances x’ with respect to the
unknown values c(x’).

2
Giuditta Sigona

2. Evaluation and Estimators


• True Error of h w.r.t. target function f and distribution D is the probability that h
will misclassify an instance drawn at random according to D:

𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ) = Pr [𝑓(𝑥) ≠ ℎ(𝑥)]


x∈D
It cannot be computed.

• Sample Error of h w.r.t. the target function f and data sample S is the proportion of
examples h misclassifies:
1
𝑒𝑟𝑟𝑜𝑟𝑠 (ℎ) = ∑ 𝛿(𝑓(𝑥) ≠ ℎ(𝑥))
|𝑆|
𝑥∈𝑆

Accuracy: 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦(ℎ) = 1 − 𝑒𝑟𝑟𝑜𝑟(ℎ)


The goal of a learning system is to be accurate in ℎ(𝑥) ∀𝑥 ∉ 𝑆.
If the 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑠 (ℎ) is very high but the 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝐷 (ℎ) is low our system will not
be useful.
→ We want 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) ≈ 𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ)

To guarantee this we consider:


• Estimation Bias: bias ≡ E[errors(h) – errorD(h)]

errors(h) is random, how to compute?

1) Statistical method: Confidence Intervals:


Compute an interval that guarantee that the TrueError is inside it with a certain
probability.
𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ)(1−𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ))
𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) = ±𝑧𝑁 √
𝑛

2) Build some methods to compute (unbias) estimation: Estimators


1) Partition the data set 𝐷 = 𝑇 ∪ 𝑆, 𝑇 ∩ 𝑆 = ∅
2) Compute a hypothesis h using training T
1
3) Evaluate 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) = ∑𝑥∈𝑆 𝛿(𝑓(𝑥) ≠ ℎ(𝑥))
𝑛

In general:
• Having more samples for training and less for testing improves performance of the
model: potentially better model, but 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) 𝑁𝑂𝑇 ≈ 𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ),
• Having more samples for evaluation and less for training reduces variance of
estimation: 𝑒𝑟𝑟𝑜𝑟𝑆 (ℎ) ≈ 𝑒𝑟𝑟𝑜𝑟𝐷 (ℎ), but this value may be not satisfactory.
→ Trade off for medium sized datasets: 2/3 for training, 1/3 for testing.

Overfitting h overfits training data if, given another hypo h’ we have that:

3
Giuditta Sigona

K-fold Cross Validation: we can use it to compare solutions and learning algorithm:

Other performance metrics:


|𝑒𝑟𝑟𝑜𝑟𝑠|
• Error rate
|𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠|

• Accuracy= 1 − 𝑒𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒


• Recall

• Precision

• F1-score
• Confusion Matrix: Report how many times an instance of class Ci
is classified in class Cj Main diagonal contains accuracy for each
class.
Outside the diagonal: which classes are more often confused.

4
Giuditta Sigona

3. Decision Trees
DT can represent classification function by making decisions explicit.
Given an instance space X coming from a set of attributes a DT has:

▪ an attribute for each internal node test


▪ a branch for each value of an attribute 𝑎𝑖,𝑗 ∈ 𝐴𝑖
▪ a leaf to which assigns a classification value {𝑦𝑒𝑠, 𝑛𝑜}

DT represent a disjunction of conjunctions of constraints on the attribute value of


instances. In this way you can transform the tree into a rule for each path to a leaf node.
𝑒𝑥𝑎𝑚𝑝𝑙𝑒: 𝑖𝑓(𝑜 = 𝑠𝑢𝑛𝑛𝑦) ∧ (𝐻 = ℎ𝑖𝑔ℎ) 𝑡ℎ𝑒𝑛 𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠

Entropy: measure the impurity of the set of samples S:

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = −𝑝+ log 2 𝑝+ − 𝑝− log 2 𝑝−

Where p+ proportion of positive samples and p− = 1 − p+ the negative ones.

Information gain measures how well a given attribute separates the training
examples according to their target classification. It is measured as the expected
reduction in entropy of S caused by knowing the value of attribute A.

5
Giuditta Sigona

ID3 Algorithm is an algorithm used to generate a DT from a dataset, knowing


the examples (sample dataset), Target_attribute and the Attributes.

▪ Output DT depends on the attribute order!

▪ Optimality = the one that first reach the destination, so that guarantee that the tree
is short → choose an attribute with 0 – or 0+ that guarantee the end of the tree
▪ ID3 algorithm selects the attribute that include highest information gain.

Hypothesis Space Search by ID3:


• Hypothesis space is complete (target concept is there!)
• Outputs a single hypothesis (cannot determine how many DTs are consistent!)
• No back tracking (local minima!)
• Statistically-based search choices (robust to noisy data!)
• Uses all the training examples at each step (not incremental!)

Issues in Decision Tree Learning


• Determining how deeply to grow the DT
• Handling continuous attributes
• Choosing appropriate attribute selection measures
• Handling training data with missing attribute values
• Handling attributes with different costs

Overfitting condition when the model completely fits the training data but fails to
generalize the testing unseen data. Could happen that we continue the development of
the tree only for one sample and get a deeper tree, with a deep branch only for that
sample.

6
Giuditta Sigona

We must evaluate at each step the tree on a test set to see at which step we have a jeak
of accuracy → after this step accuracy drop and we are overfitting.

To avoid tree overfitting:


• Stop growing when data split is not statistically significant
• grow a full tree and then post-prune (replace nodes not important with
leafs).
To determine correct tree size:
• use a separate set of examples (distinct from the training examples) to
evaluate the utility of post-pruning,
• apply a statistical test to estimate accuracy of a tree on the entire data
distribution,
• using an explicit measure of the complexity for encoding the examples and
the decision trees.

Reduced-Error Pruning it produces smallest version of most accurate subtree


Split data into Training and Validation set. Do until further pruning is harmful
(decreases accuracy):
1. Evaluate impact on validation set of pruning each possible
node (remove all the subtree and assign the most
common classification)
2. Greedily remove the one that most improves validation
set accuracy

Rule Post-Pruning
Infer tree as well as possible (allowing for overfitting). Convert tree to equivalent set
of rules. Prune each rule by removing any preconditions that result in improving its
estimated accuracy. Sort final rules by their estimated accuracy and consider them in
this sequence when classifying.
• greedy! So not optimal

Random Forest ensemble method that generates a set of DTs with some random
criteria (bagging, feature selection, …) and integrates their values into a final result.
Integration of results: majority vote (most common class returned by all the trees).
Random Forests are less sensitive to overfitting.

7
Giuditta Sigona

4. Probability and Bayes


Representation of uncertainty with probabilities.

Prior Probability: it is the probability before collect any experience (Dataset is


empty). Correspond to belief prior to arrival of any (new) evidence.

Posterior/Conditional probability: probability arise after the arrival of some


evidence. If I know the outcome of a random variable, how will this affect probability
of other random variables?
𝑃(𝑎 ∧ 𝑏)
𝑃(𝑎|𝑏) ≡ , 𝑖𝑓 𝑃(𝑏) ≠ 0
𝑃(𝑏)

𝑃(𝑎 ∧ 𝑏) = 𝑃(𝑎|𝑏)𝑃(𝑏) = 𝑃(𝑏|𝑎)𝑃(𝑎)

• A and B are independent iff one does not affect the other:
𝑃(𝐴|𝐵) = 𝑃(𝐴) or 𝑃(𝐵|𝐴) = 𝑃(𝐵) or 𝑃(𝐴, 𝐵) = 𝑃(𝐴)𝑃(𝐵)
- if 𝑋1 , … , 𝑋𝑛 independent → 𝑃(𝑋1 , … , 𝑋𝑛 ) = 𝑃(𝑋1 )𝑃(𝑋2 ) ∙∙∙ 𝑃(𝑋𝑛 ) reducing the
size of the distribution from exponential to linear.
- X is Conditionally independent from Y, given Z iff: 𝑃(𝑋|𝑌, 𝑍) = 𝑃(𝑋|𝑌)

Chain rule: 𝑃(𝑋, 𝑌, 𝑍) = 𝑃(𝑋|𝑌, 𝑍)𝑃(𝑌, 𝑍) = 𝑃(𝑋|𝑌, 𝑍)𝑃(𝑌, 𝑍)𝑃(𝑍) = 𝑃(𝑋|𝑧)𝑃(𝑌|𝑍)𝑃(𝑍)

Bayes Rule

𝑃(𝑒𝑓𝑓𝑒𝑐𝑡|𝑐𝑎𝑢𝑠𝑒)𝑃(𝑐𝑎𝑢𝑠𝑒) 𝑃(𝑋|𝑌)𝑃(𝑌)
𝑃(𝑐𝑎𝑢𝑠𝑒|𝑒𝑓𝑓𝑒𝑐𝑡) = ⟺ 𝑃(𝑌|𝑋) =
𝑃(𝑒𝑓𝑓𝑒𝑐𝑡) 𝑃(𝑋)

- For multivariable we can write: 𝑃(𝑍|𝑌1 , … , 𝑌𝑛 ) = 𝛼 𝑃(𝑌1 , … 𝑌𝑛 |𝑍)𝑃(𝑍)


- If 𝑌1 , … . 𝑌𝑛 conditionally independent: 𝑃(𝑍|𝑌1 , … , 𝑌𝑛 ) = 𝛼𝑃(𝑌1 |𝑍) ∙∙∙ 𝑃(𝑌𝑛 |𝑍)𝑃(𝑍)

We can represent Bayesian Networks for graphical notation for conditional


independence assertions and hence for specification of full joint distribution:
- a set of nodes, one per variable
- a directed, acyclic graph (link “directly influences”)
- a conditional distribution for each node given its parents: 𝑃(𝑋𝑖 |𝑃𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑖 ))

8
Giuditta Sigona

5. Bayes Learning
Bayesian learning uses Bayes' theorem to determine the conditional probability of a
hypotheses given some evidence or observations.
• Provides practical learning algorithms:
- Naive Bayes learning (examples affect prob. that a hypothesis is correct)
- Combine prior knowledge (probabilities) with observed data
- Make probabilistic predictions (new instances classified by weighted combination
of multiple hypotheses)
- Requires prior probabilities (often estimated from available data)
• Provides useful conceptual framework for evaluating other learning algorithms

Bayes Theorem: given P(h) the prior probability of the hypothesis h, and P(D) the prior
probability of the training data D. The Bayes rule is:

Maximum a Posteriori probability (MAP)


When classifying new data we want to assign it the most probable hypothesis. To
do so we can use Maximum a posteriori hypothesis hMAP:

(removed the normalizing constant P(D))

1. For each hypothesis ℎ ∈ 𝐻, calculate the posterior probability P(h|D)


2. Output the hypothesis hMAP with the highest posterior probability

Moreover if the prior distribution is uniform, i.e. P(hi) = P(hj), ∀i,j ∈ H we can use
the Maximum Likelihood hypothesis hML and have:

hML = argmaxP(D|h)
h∈H

We can estimate the maximum hMAP by computing P(hi|D) for every hi ∈ H and then
get the maximum, but hMAP return the most probable hypothesis, not the most
probable classification, so, given a new instance x’ hMAP(x) might not the return the
correct classification nor the most probable.

9
Giuditta Sigona

Bayes Optimal Classifier


The Bayes Optimal Classifier maximizes the probability that the new instance x’ is
classified correctly.

Given the target function f : X → V that maps an instance to a class 𝑣, a dataset D and a
new instance x’ we want to classify it correctly: 𝑣 ∗ = 𝑓̂ (𝑥 ′ )
In general: 𝑣 ∗ = arg max 𝑃(𝑣|𝑥 ′ , 𝐷)
𝑣∈𝑉
Where P(vj|x’,D) is the probability that x’ belongs to the class vj conditioned to the entire
dataset D (every hypothesis).

If we consider P(vj|x’,hi) as the probability of a new instance x’ to be classified as the


class vj by a hypothesis hi. We have that:
𝑃(𝑣𝑗 |𝑥’, 𝐷) = ∑ℎ𝑖 𝑃(𝑐𝑗 |𝑥′, ℎ𝑖 )𝑃(ℎ𝑖 |𝐷)
And so the most probable class vOB for a new instance x’ would be:
𝑣𝑂𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣𝑖 ∑ℎ𝑖 𝑃(𝑐𝑗 |𝑥′, ℎ𝑖 )𝑃(ℎ𝑖 |𝐷)

When we have to deal with a high hypothesis space, the Bayes Optimal Classifier
is not practical anymore. A way to avoid computing every hypothesis is using
conditional independence.
When X is conditionally independent of Y given Z:
𝑃(𝑋, 𝑌 |𝑍) = 𝑃(𝑋|𝑌 , 𝑍)𝑃(𝑌 |𝑍) = 𝑃(𝑋|𝑍)𝑃(𝑌 |𝑍)

10
Giuditta Sigona

Naive Bayes Classifier


NBC uses conditional independence to approximate the solution.
Consider 𝑓: 𝑋 → 𝑉 where each instance x is described by attributes < 𝑎1 , 𝑎2 , … , 𝑎𝑛 >.

𝑉𝑀𝐴𝑃 = 𝑎𝑟𝑔 𝑚𝑎𝑥 𝑃(𝑣𝑗 |𝑥, 𝐷) = 𝑎𝑟𝑔 𝑚𝑎𝑥 𝑃(𝑣𝑗 |𝑎1 , … , 𝑎𝑛 , 𝐷) =


𝑣𝑗 ∈𝑉 𝑣𝑗 ∈𝑉
𝑃(𝑎1 , … , 𝑎𝑛 |𝑣𝑗 , 𝐷 ) 𝑃(𝑣𝑗 ,𝐷)
= 𝑎𝑟𝑔𝑚𝑎𝑥 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃(𝑎1 , . . , 𝑎𝑛 |𝑣𝑗 , 𝐷)𝑃(𝑣𝑗 |𝐷)
𝑃(𝑎1 ,…𝑎𝑛 |𝐷)

assuming each 𝑎𝑖 conditionally independent we get:


𝑉𝑁𝐵 = arg max 𝑃(𝑣𝑗 |𝐷) ∏ 𝑃(𝑎𝑖 |𝑣𝑗 , 𝐷)
𝑣𝑗 ∈𝑉
𝑖

NB: if none of the training instances with target value 𝑣𝑗 have attribute value 𝑎𝑖 → :
𝑃̂ (𝑎𝑖 |𝑣𝑗 , 𝐷) = 0 → 𝑃(𝑣𝑗 |𝐷) ∏𝑖 𝑃(𝑎𝑖 |𝑣𝑗 , 𝐷) = 0. In this case, to avoid the zero we can
set a virtual prior to some arbitrary number that guarantee 𝑃̂ > 0
| ∙ | + 𝑚𝑝
𝑃̂ (𝑎𝑖 |𝑣𝑗 , 𝐷) =
|∙|+𝑚
- p = prior estimate for P
- m = weight given to prior

6. Probabilistic Model for CLASSIFICATION


In statistical classification, there are two main models:

• In the case of generative models, to find the conditional probability 𝑃(𝐶𝑖 |𝑥, 𝐷),
estimate the prior probability 𝑃(𝐶𝑖 ) and likelihood probability 𝑃(𝑥|𝐶𝑖 ) with the
help of the training data D and uses the Bayes Theorem to calculate the posterior
probability 𝑃(𝐶𝑖 |𝑥) → e.g. naive Bayes classifier

• In the case of discriminative models, to find the probability, directly assume some
functional form for 𝑃(𝐶𝑖 |𝒙) and then estimate the parameters of 𝑃(𝐶𝑖 |𝒙) with the
help of the training data. → e.g. logistic regression

11
Giuditta Sigona

Probabilistic Generative Models


The posterior can be express as:
𝑝𝑟𝑖𝑜𝑟 × 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑃(𝒙|𝐶𝑖 )𝑃(𝐶𝑖 )
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = → 𝑃(𝐶𝑖 |𝒙) =
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 ∑𝑗 𝑃(𝒙|𝐶𝑗 ) 𝑃(𝐶𝑗 )
With 𝑃(𝐶𝑖 ) = 𝜋𝑖 and 𝑃(𝒙|𝐶𝑖 ) = 𝑁(𝒙; 𝝁𝒊 , ∑) gaussian.

𝑡𝑛 = 1 𝑖𝑓 𝑥𝑛 ∈ 𝐶1
Assuming 2 classes 𝐶1 , 𝐶2 and 𝐷 = {(𝑥𝑛 , 𝑡𝑛 )𝑁
𝑛=1 } →
𝑡𝑛 = 0 𝑖𝑓 𝑥𝑛 ∈ 𝐶2
𝑁1 𝑡ℎ𝑒 𝑛. 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝐷 ∈ 𝐶1
Let be
𝑁2 𝑡ℎ𝑒 𝑛. 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝐷 ∈ 𝐶2
In this case 𝑃(𝐶1 |𝑥) = 𝜎(𝑎) = 𝜎(𝒘𝑇 𝑥 + 𝒘𝟎 ) and 𝑃(𝐶2 |𝑥) = 1 − 𝑃(𝐶1 |𝑥)
𝒘 = ∑−1(𝜇1 − 𝜇2 )
With 1 1 𝑃( 𝐶 1 )
𝑤0 = − 𝜇𝑇1 ∑−1 𝜇1 + 𝜇𝑇2 ∑−1 𝜇2 + ln
2 2 𝑃(𝐶2 )

Then the optimal maximum likelihood is given by:


𝜋∗ , 𝝁∗𝟏 , 𝝁∗𝟐 , ∑∗ = arg max 𝑃(𝒕|𝜋, 𝝁𝟏 , 𝝁𝟐 , ∑, 𝐷)
𝜋,𝝁𝟏 ,𝝁𝟐 ,∑

Where 𝑃(𝒕|𝜋, 𝝁𝟏 , 𝝁𝟐 , ∑, 𝐷) = ∏𝑁
𝑛=1[𝜋 ∙ 𝑁(𝑥𝑛 ; 𝜇1 , ∑)]
𝑡𝑛
∙ [(1 − 𝜋) ∙ 𝑁(𝑥𝑛 ; 𝜇2 , ∑)](1−𝑡𝑛)

Then the prediction of a new sample 𝑥 ∉ 𝐷 → 𝑃(𝐶1 |𝑥′) = 𝜎(𝑤 ∗𝑇 𝑥′ + 𝑤0 )

Probabilistic Discriminative Models


exp(𝑎𝑘 )
without estimating the model parameters, estimate directly 𝑃(𝐶𝑘 |𝒙
̃, 𝐷) = ∑ ,
𝑗 exp(𝑎𝑗 )
1 𝑤
̃ = ( 0 ) , 𝑎𝑘 = 𝒘𝑻 𝒙 + 𝑤0 = 𝒘
with 𝑥̃ = ( ) , 𝒘 ̃ 𝑇𝒙 ̃.
𝑥 𝒘
The maximum likelihood: 𝑤 ̃ ∗ = arg max ln 𝑃(𝒕|𝒘
̃ , 𝑿)
̃
𝑤

12
Giuditta Sigona

Logistic Regression
Given a target function 𝑓: 𝑋 → 𝐶 and a dataset D
Assume a parametric model for the posterior probability 𝑃(𝐶𝑘 |𝒙 ̃ ):
̃, 𝒘
- 𝜎(𝒘 ̃ 𝒙) if 2 classes
𝑇
̃
̃ 𝑘 𝑇𝒙
exp (𝒘 ̃)
- if k-classes
∑𝑘 ̃ 𝑗𝑇 𝒙
𝑗=1 exp (𝒘 ̃)

Define the Error function 𝐸(𝑤


̃) as the negative log-likelihood.

Solve the optimization problem: 𝒘


̃ ∗ = arg min 𝐸(𝒘
̃)
𝑤
̃
Classify new sample 𝑥̃′ as 𝐶𝑘 ∗ where 𝑘 ∗ = arg max 𝑃(𝐶𝑘 |𝑥̃ ′ , 𝑤
̃ ∗)
𝑘=1,…𝐾

𝑎𝑛𝑎𝑙𝑦𝑡𝑖𝑐𝑎𝑙𝑙𝑦
You can solve the minimization
𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑣𝑒 → 𝑰𝒕𝒆𝒓𝒂𝒕𝒊𝒗𝒆 𝒓𝒆 − 𝒘𝒆𝒊𝒈𝒉𝒕𝒆𝒅 𝒍𝒂𝒔𝒕 𝒔𝒒𝒖𝒂𝒓𝒆𝒔

Iterative re-weighted last squares: Apply Newton-Raphson iterative optimization


for minimizing 𝐸(𝒘
̃ ): ∇𝐸(𝒘̃ ) = ∑𝑁
𝑛=1(𝑦𝑛 − 𝑡𝑛 ) 𝒙
̃𝑛

Generalization:
Given a target function 𝑓 ∶ 𝑋 → 𝐶, and data set 𝐷
- assume a prediction parametric model 𝑦(𝒙; 𝜃), 𝑦(𝒙; 𝜃) ≈ 𝑓 (𝑥)
- Define an error function 𝐸(𝜃)
- Solve the optimization problem 𝜃 ∗ = arg min 𝐸(𝜃)
𝜃
- Classify new sample x’ as 𝑦(𝒙′; 𝜃 ∗ ).

NB: All methods described above can be applied in a transformed space of the input
(feature space).
Given a function 𝜙: 𝒙
̃ → 𝚽 (Φ is the feature space) each sample 𝒙
̃ 𝒏 can be mapped to a
feature vector 𝜙𝑛 = 𝜙(𝑥̃ 𝑛 ).

13
Giuditta Sigona

7. Linear Models for Classification


Assume that the dataset is linearly separable (if it exists some hyper-plane that
splits the space in two regions such that different classes are separated).
Such hyper-plane is generated by a function 𝑓 ∶ 𝑅𝑛 → 𝐶 = {c1,c2,...,cm}

- If 2 classes: 𝑦(𝑥) = 𝑤 𝑇 𝑥 + 𝑤0 = w
̃ 𝑇 𝑥̃
𝑦1 (𝑥) 𝑤1𝑇 𝑥 + 𝑤1,0 ̃ 1𝑇
w
- If k-classes: 𝑦(𝑥) = ( … ) = ( … ̃ 𝑇 𝑥̃
) = ( … ) 𝑥̃ = 𝑊
𝑇 𝑇
𝑦𝑘 (𝑥) 𝑤𝑘 𝑥 + 𝑤k,0 w
̃k

K-class discriminant comprising 𝐾 linear functions (𝑥 not in dataset)


𝑦(𝒙) = ⋯ = 𝑾 ̃𝑇 𝒙
̃

Classify 𝒙 as 𝐶𝑘 if 𝑦𝑘 (𝒙) > 𝑦𝑗 (𝒙) for all 𝑗 ≠ 𝑘 (𝑗 , 𝑘 = 1, . . . , 𝐾)


𝑇
Where the decision boundary between 𝐶𝑘 and𝐶𝑗 is: (𝒘
̃𝑘 − 𝒘 ̃=0
̃𝑗 ) 𝒙

Our goal is to determine 𝑾̃ such that 𝑦(𝒙) = 𝑾


̃𝑇 𝒙
̃ is the K-class discriminant. To do
this there are different approaches:
• Least squares
• Perceptron
• Fisher’s linear discriminant
• Support Vector Machines

Least Squares
Given D, find the linear discriminant 𝑦(𝑥) = 𝑾 ̃𝑇 𝒙
̃.
→Minimize the sum-of-square error function:
𝐸(𝑊̃) = 𝑊 ̃ = (𝑋̃ 𝑇 𝑋̃)−1 𝑋̃ 𝑇 𝑇 = 𝑋̃ † 𝑇 → 𝑦(𝑥) = 𝑊̃ 𝑇 𝑥̃ = 𝑇 𝑇 (𝑋̃ † )𝑇 𝑋̃
𝑡1 𝑇
With 𝑇 = ( … ) if 𝑥 ∈ 𝐶𝑘 → 𝑡𝑘 = 1, 𝑡𝑗 = 0, ∀𝑗 ≠ 𝑘.
𝑡𝑁
Classification of new instance x not in dataset:
Use learnt 𝑾̃ to compute 𝑦(𝑥) then assign class 𝐶𝑘 to x, where 𝑘 = arg max {𝑦𝑖 (𝑥)}
𝑖∈{1,…,𝑘}
PROBLEM: assume Gaussian conditional distributions
→ Not robust to outliers!

14
Giuditta Sigona

Perceptron
The Perceptron is a linear classification algorithm. It consists of a single node that takes a
row of data as input and predicts a class label. This is achieved by calculating the
weighted sum of the inputs and a bias (=1). The weighted sum of the input of the model is
called the activation.

perceptron model 𝑜(𝑥) = 𝑠𝑖𝑔𝑛(𝑤 𝑇 𝑥) or (𝑤 𝑇 𝑥) if unthreshold.


To learn 𝑤𝑖 from training examples 𝐷 = {(𝑥𝑛 , 𝑡𝑛 )𝑁
𝑛=1 } minimize the squared error (loss
function):

Since we need to minimize this error we want to move to the direction of the gradient,
thus computing the derivative of:

so we can update the weight wi by wi+1 = wi + ∆w, where:

Perceptron algorithm:
Given perceptron model 𝑜(𝑥) = 𝑠𝑖𝑔𝑛(𝑤 𝑇 𝑥) and data set D, determine weights w.
• The initial values for the model weights are set to small random values.
• 𝑤 ̂𝑖 ⟵ 𝑤 ̂𝑖 + Δ𝑤𝑖 Model weights are updated with a small proportion of the error
each batch, and the proportion is controlled by a hyperparameter called the
learning rate, typically set to a small value (resulting in a premature convergence):
𝑤(𝑡 + 1) = 𝑤(𝑡) + 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ (𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑_𝑖 – 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑_) ∗ 𝑖𝑛𝑝𝑢𝑡_𝑖
▪ Training stop when the error made by the model falls to a low level or no longer
improves.

▪ The perceptron is a linear classifier, it will classify all the inputs correctly if the
training set D is linearly separable, and 𝜂 sufficiently small.
▪ Incremental and mini-batch modes speed up convergence and are less sensitive to
local minima.

15
Giuditta Sigona

Fisher’s linear discriminant

Given a two classes classification problem, Fisher’s linear discriminant is given by the
function 𝑦 = 𝒘𝑇 𝒙 and the classification of new instances is given
𝑥 ∈ 𝐶1 𝑖𝑓 𝑦 ≥ −𝑤0
𝑥 ∈ 𝐶2 otherwise
Corresponding to the projection on a line determined by w.
Adjusting 𝒘 to find a direction that maximizes class separation.

Consider a data set with N1 points in C1 and N2 points in C2:

→choose w that maximises

with

𝑑
𝐽(𝑤) = 0 → : 𝒘∗ = 𝑺−1
𝑤 (𝒎2 − 𝒎1 ) and 𝑤0 = 𝒘 𝒎
𝑇
𝑑𝑤

For Multiple classes: 𝑦 = 𝒘𝑇 𝒙 → Maximizing 𝐽(𝑾) = 𝑇𝑟{(𝑾𝑺𝑤 𝑾𝑇 )−1 (𝑾𝑺𝐵 𝑾𝑇 )}

16
Giuditta Sigona

Support Vector Machine [SVM]


SVMs are based on the idea of finding a hyperplane that best divides a dataset into
two classes.
Support vectors are the data points nearest to the
hyperplane, the points of a data set that, if removed,
would alter the position of the dividing hyperplane. The
distance between the hyperplane and the nearest data
point from either set is known as the margin. The goal is
to choose a hyperplane with the greatest possible margin
between the hyperplane and any point within the
training set.

Let’s consider binary classification 𝑓: 𝑋 → {+1, −1} with data set 𝐷 =


𝑛=1 }, 𝑡𝑛 ∈ {+1, −1}, and a linear model 𝑦(𝑥) = 𝑤 𝑥 + 𝑤0
{(𝑥𝑛 , 𝑡𝑛 )𝑁 𝑇

Assume D is linearly separable →

Let 𝑥𝑘 be the closest point of the dataset D to hyperplane ℎ̅: 𝑤


̅ 𝑇𝑥 + 𝑤
̅0 = 0

The margin is estimated as the minimum distance among all the points in the
dataset from the line

And to maximize the margin:

To solve:
Rescale all the points do not affect the solution in such a way that the closest point 𝑥𝑘
we have: 𝑡𝑘 (𝒘𝑇 𝒙𝑘 + 𝑤0 ) = 1
When the maximum margin hyperplane 𝑤 ∗ , 𝑤0∗ is found, there will be at least 2 closest
points 𝑥𝑘+ and 𝑥𝑘− (one for each class)
1
The optimal solution is when both are at the same distance
||𝑤||

17
Giuditta Sigona

In the canonical representation of the problem the maximum margin hyperplane can
be found by solving the optimization problem:

Quadratic programming problem solved with Lagrangian method. →


𝑤 ∗ = ∑𝑁 ∗
𝑛=1 𝑎𝑛 𝑡𝑛 𝑥𝑛

▪ Very sparse (because of KKT condition)


▪ even if the dimension is larger, few values are > 1 so optimization will be better
▪ all the points/samples 𝑥𝑛 for which 𝑎𝑛∗ = 0 will not contribute to the solution!
▪ Robust to outliners:

Support vectors: 𝑥𝑘 such that 𝑡𝑘 𝑦(𝑥𝑘 ) = 1 and 𝑎𝑘∗ > 0


Hyperplanes expressed with support vectors: 𝑦(𝒙) = ∑𝑥𝑗 ∈𝑆𝑉 𝑎𝑗∗ 𝑡𝑗 𝒙𝑇 𝑥𝑗 + 𝑤0∗ = 0
In fact, other vectors 𝑥𝑛 ∉ 𝑆𝑉 do not contribute (𝑎𝑛∗ = 0)
To compute 𝑤0 :

Or a more stable solution is obtained by averaging over all the support vectors:

Given the maximum margin hyperplane determined by 𝑎𝑘∗ , 𝑤0∗


Classification of a new instance x’ is given by the prediction model:

▪ Optimization problem for determining 𝑤, 𝑤0 (dimension 𝑑 + 1, with 𝑋 = 𝑅𝑑 )


transformed in an optimization problem for determining 𝑎 (dimension |𝐷|)
▪ Efficient when 𝑑 << |𝐷| (most of ai will be zero).
▪ Very useful when 𝑑 is large or infinite.

What if data are ALMOST (the majority) linearly separable?


Two cases:

18
Giuditta Sigona

1) SVM with soft margin constraints


IDEA: to relax the constraints (add a cost!)
Let introduce a new variable: slack variables 𝜉𝑛 ≥ 0, 𝑛 = 1, … , 𝑁

Soft margin constraint: 𝑡𝑛 𝑦(𝑥𝑛 ) ≥ 1 − ξ𝑛


Optimization problem:

Solution similar to the case of linearly separable data →

with 𝑎𝑛∗ computed as solution of a Lagrangian optimization problem.

2) Basis functions
IDEA: transform in a polar coordinate for each new value → replace 𝒙 with ϕ(𝑥) to
all formulas.

- Decision boundaries will be linear in the feature


space ϕ and non-linear in the original space x
- Classes that are linearly separable in the feature
space ϕ may not be separable in the input space x.

(*exists a family of basis functions that works well when you cannot find ϕ(𝑥) )

Linear models for non-linear functions:


To learn non-linear function 𝑓 ∶ 𝑋 → {𝐶1 , . . . , 𝐶𝐾 } from data set 𝐷 non-linearly
separable → find a non-linear transformation 𝜙 and learn a linear model
- Two classes: 𝑦(𝒙) = 𝒘𝑇 𝜙(𝑥) + 𝑤0
- Multiple classes: 𝑦𝑘 (𝒙) = 𝒘𝑇𝑘 𝜙(𝑥) + 𝑤𝑘0

19
Giuditta Sigona

8. Linear Models for Regression


Define a model 𝑦(𝑥; 𝑤) with parameter w to approximate the target
function 𝑓, our model will be of the kind:
𝑦(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + ⋯ + 𝑤𝑑 𝑥𝑑 = 𝒘𝑻 𝒙

Where both w, x are vectors of dimension d.


There are some cases in which the dataset is non linear, so we can
use a non linear function on x of the kind:
𝑀
𝑦(𝒙; 𝒘) = ∑ 𝑤𝑗 𝜙𝑗 (𝑥) = 𝒘𝑻 𝜑(𝒙)
𝑗=0
Which is still linear in 𝑤 but not in 𝑥.

Es: polynomial curve fitting

Learning algorithm for linear regression:

Maximum Likelihood
If our target value t is affected by noise 𝜖: 𝑡 = 𝑦(𝒙; 𝒘) + 𝜖

We now have a probability that the target is correct given the regression. If we
assume 𝜖 to be gaussian we have: 𝑃(𝜖|𝛽) = 𝒩(𝜖|0, 𝛽 −1 ), with precision (inverse
variance) 𝛽.

20
Giuditta Sigona

Assume observations independent and identically distributed (i.i.d.)


We seek the maximum of the likelihood function: 𝑃(𝑡|𝒙, 𝒘, 𝛽) = 𝒩(𝑡|𝑦(𝒙; 𝒘), 𝛽 −1 )

Or equivalently:

Since the second term is constant we focuse on the first one. Then to solve the problem
we do: 𝑤 ∗ = arg min 𝐸𝐷 (𝑤)
𝑤
But since Max Likelihood  Least Square error minimization

NB: if we plot we have some values of 𝑤 very high (||𝑤𝑖 || ≫ 0) → this determine a
not-smooth function! → To control overfitting we can set a regularization factor on
the parameters of the kind:
arg min 𝐸𝐷 (𝒘) + 𝜆 𝐸𝑤 (𝒘)
𝑤

1
a common choice is 𝐸𝑊 (𝒘) = 𝒘𝑇 𝒘
2

1
Moreover, note that : 𝐸𝐷 (𝒘) = (𝑡 − 𝝓𝒘)𝑇 (𝑡 − 𝝓𝒘)
2

Optimality condition → ∇𝐸𝐷 = 0 ↔ 𝝓𝑇 𝝓𝒘 = 𝝓𝑇 𝒕 → 𝒘𝑴𝑳 = (𝝓𝑻 𝝓)−1 𝝓𝑇 𝒕 = 𝝓† 𝒕

if big dimension, the pseudoinverse 𝝓† is complicated to compute → we compute the


stochastic gradient descent algorithm (SGD)(Sequential Learning)
𝒘̂ ←𝒘 ̂ − 𝜂 ∇𝐸𝑛 → 𝒘 ̂ ←𝒘 ̂ 𝑇 𝜙(𝒙𝑛 )]𝜙(𝒙𝑛 )
̂ − 𝜂 [t n − 𝒘

If Multiple outputs: 𝑦(𝒙; 𝑾) = 𝑾𝑇 𝜙(𝒙) and we have


similarly as before we obtain: 𝒘𝑴𝑳 = (𝝓𝑻 𝝓)−1 𝝓𝑇 𝑻

21
Giuditta Sigona

9. Kernel Methods
Kernel methods overcome difficulties in defining non-linear models. Kernel methods
use kernels (or basis functions) to map the input data into a different space. After this
mapping, simple models can be trained on the new feature space, instead of the input
space, which can result in an increase in the performance of the models.
This approach is called the "kernel trick", which avoids the explicit mapping that is
needed to get linear learning algorithms to learn a nonlinear function or decision
boundary

Consider a linear model 𝑦(𝒙; 𝒘) = 𝒘𝑇 𝒙 with Dataset 𝐷 = {(𝑥𝑛 , 𝑡𝑛 )𝑁


𝑛=1 }
2
Minimize 𝐽(𝒘) = (𝒕 − 𝑿𝒘)𝑇 (𝒕 − 𝑿𝒘) + 𝜆 ||𝒘||

→ Optimal solution:

X is the design matrix (representation of the dataset).


1
We can express 𝒘∗ by calling 𝜶 = (𝑿𝑿𝑇 + 𝜆𝐼𝑁 )−1 𝒕, 𝛼𝑛 = (𝒘𝑇 𝑥𝑛 − 𝑡𝑛 ) →
𝜆
𝑁

𝒘∗ = 𝑿𝑇 𝜶 = ∑ 𝛼𝑛 𝑥𝑛
𝑛=1

So our model will be: 𝑦(𝒙; 𝒘∗ ) = 𝒘∗ 𝑇 𝒙 = ∑𝑁


𝑛=1 𝛼𝑛 𝒙𝑛 𝒙 ,
𝑇
with 𝛼 = (𝐾 + 𝜆𝐼𝑁 )−1 and
𝑻
𝐾 = 𝑿𝑿 𝐺𝑟𝑎𝑚 𝑚𝑎𝑡𝑟𝑖𝑥.

▪ linear model with linear kernel 𝑘(𝒙, 𝒙′ ) = 𝒙𝑇 𝒙′

arg min 𝐽(𝑤)


𝛼

▪ linear model with 𝑎𝑛𝑦 𝑘:

arg min 𝐽(𝑤)


𝛼

Kernel tricks: If input vector 𝑥 appears in an algorithm only in the form of an inner
product 𝒙𝑇 𝒙′ we can replace it with some kernel 𝑘(𝒙, 𝒙′ ).
Approach: use a similarity measure 𝑘(𝒙, 𝒙′ ) ≥ 0 between the instances 𝒙, 𝒙′
- 𝑘(𝒙, 𝒙′ ) is called a kernel function.
- Note: If we have 𝜙(𝑥) a possible choice is 𝑘(𝒙, 𝒙′ ) = 𝜙(𝒙)𝑇 𝜙(𝒙′ )
symmetric: 𝑘(𝒙, 𝒙′ ) = 𝑘(𝒙′, 𝒙)
Typically 𝑘 is:
non-negative: 𝑘(𝒙, 𝒙′ ) ≥ 0

22
Giuditta Sigona

- We can apply kernelization also in Regression and SVM.


- Usually is good to normalize data.

Input normalization:
Input data in the dataset 𝐷 must be normalized in order for the kernel to be a good
similarity measure in practice.

Several types of normalizations:

Kernel families:
 Linear
 Polynomial
 Radial Basis Function (RBF)
 Sigmoid

Kernelized SVM
it is one of the most effective ML method for classification and regression.
- Still requires model selection and hyper-parameters tuning.

Classification:
In SVM, solution has the form: 𝑤 ∗ = ∑𝑁
𝑛=1 𝛼𝑛 𝑥𝑛

Linear model:
𝑦(𝒙; 𝜶) = 𝑠𝑖𝑔𝑛(𝑤0 + ∑𝑁 𝑇
𝑛=1 𝛼𝑛 𝑥𝑛 𝑥) → 𝑦(𝒙; 𝜶) = 𝑠𝑖𝑔𝑛(𝑤0 + ∑𝑁
𝑛=1 𝛼𝑛 𝑘(𝑥𝑛 , 𝑥))

And can be solved as a lagrangian problem …

Regression:
Linear model for regression 𝑦 = 𝒘𝑻 𝒙 and data set 𝐷
2
Minimize the regularized loss function: 𝐽(𝑤) = ∑𝑁
𝑛=1 𝐸(𝑦𝑛 , 𝑡𝑛 ) + 𝜆 ||𝑤||

The IDEA:
- points close to the predict model are good enough →I don’t consider them in an
error of the model
- decrease the effects of the points faraway

23
Giuditta Sigona

1 2
Consider: 𝐽(𝒘) = 𝐶 ∑𝑁
𝑛=1 𝐸𝜖 (𝑦𝑛 , 𝑡𝑛 ) + ||𝒘||
2

With: 𝐶 inverse of 𝜆 and 𝐸𝜖 and 𝜖-insensitive error function:

Not differentiable → difficult to solve.


Introduce Slack Variables 𝜉𝑛+ , ξ−
n ≥ 0:

- if all points inside the tube → I will have 0-error (optimal!)


- if I have some external points the solution will depend on them
1 2
loss function can be rewritten as: 𝐽(𝒘) = 𝐶 ∑𝑁 + −
𝑛=1(𝜉𝑛 , + ξn ) + ||𝒘||
2

subject to :

and can be solved as Lagrangian (it is a standard quadratic program).

From Karush-Kuhn-Tucker (KKT) condition, Support vectors contribute to


predictions:
• 𝑎̂𝑛 > 0 → 𝜖 + 𝜉𝑛 + 𝑦𝑛 − 𝑡𝑛 = 0 → data point lies on or above upper boundary of the
𝜖 -tube
• 𝑎̂′𝑛 > 0 → 𝜖 + 𝜉𝑛 − 𝑦𝑛 + 𝑡𝑛 = 0 → data point lies on or below lower boundary of the
𝜖 -tube

All other data points inside the 𝜖 -tube have 𝑎̂𝑛 = 0 and 𝑎̂′𝑛 = 0 and thus do not
contribute to prediction.

24
Giuditta Sigona

10. Instance based Learning

Recap:
• Parametric Algorithm: we have a fixed set of parameters such as 𝜃 that we try to
find while training the data. After we have found the optimal values for these
parameters, we can use the model with parameters to make predictions.

• Non-Parametric Algorithm: the number of parameters grows with the amount of


data and the model is note explicit (e.g. Instance-based learning)

Instance-based learning involves memorizing training data in order to make


predictions about future data points. This approach doesn’t require any prior
knowledge or assumptions about the data, which makes it easy to implement and
understand. However, it can be computationally expensive since all of the training data
needs to be stored in memory before making a prediction. Additionally, this approach
doesn’t generalize well to unseen data sets because its predictions are based on
memorized examples rather than learned models.

K-nearest neighbors (KNN)


KNN is an algorithm that belongs to the instance-based learning class of algorithms.
It relies on a measure of similarity between each pair of data points (e.g. Euclidean
distance).
Once the similarity between two points is calculated, KNN looks at how many
neighbors are within a certain radius around that point and uses these neighbors as
examples to make its prediction.
Classification rule:
1. Find K nearest neighbours of new instance x
2. Assign to x the most common label among the majority of neighbours

Likelihood of class c for new instance x:

with 𝑁𝐾 (𝑥𝑛 , 𝐷) the K nearest points to𝑥𝑛 and

25
Giuditta Sigona

▪ instead of creating a generalizable model from all of the data, KNN looks for
similarities among individual data points and makes predictions accordingly.
▪ Require storage of all data
▪ Increasing K brings to smoother regions (reducing overfitting)

One of the many issues that affect the performance of the kNN algorithm is the choice
of k. If k is too small, the algorithm would be more sensitive to outliers. If k is too large,
then the neighborhood may include too many points from other classes.

K-NN can be kernelized.


Distance function in computing 𝑁𝐾 (𝑥, 𝐷)
2
||𝑥 − 𝑥𝑛 || = 𝑥 𝑇 𝑥 + 𝑥𝑛𝑇 𝑥𝑛 − 2𝑥 𝑇 𝑥𝑛
Can be kernelized by using kernel 𝑘(𝑥, 𝑥𝑛 ).

Locally weighted regression


Locally weighted regression methods are a generalization of k-Nearest Neighbour.
Instead of fitting a single regression line, you fit many linear regression models. The final
resulting smooth curve is the product of all those regression models.

→ fit a local regression model around the query sample 𝑥𝑞


1. Compute 𝑁𝐾 (𝑥𝑞 , 𝐷): K-nearest neighbors of 𝑥𝑞
2. Fit a regression model 𝑦(𝑥; 𝑤) on 𝑁𝐾 (𝑥𝑞 , 𝐷)
3. Return 𝑦(𝑥𝑞 ; 𝑤)

Issues:
- DV are affected by scaling problem
- The model can be considered as a “compression”

26
Giuditta Sigona

11. Artificial Neural Network


Artificial neural networks (ANNs) are a subset of ML and are at the heart of deep
learning algorithms. Their name and structure are inspired by the human brain,
mimicking the way that biological neurons signal to one another.

FeedForward NN (FNN)
Most NN are feedforward, they flow in one direction only, from input to output.
There are no loops or cycles in the network and the output of each layer is
determined by the weights and biases of the connections between neurons, as
well as the activation function of each neurons.

Hidden layer output can be seen as an array of unit (neuron) activations based on the
connections with the previous units.
The final function is a composition of elementary functions f and parameters θ
(one for each layer): 𝑓(𝑥; 𝜃) = 𝑓 (3) ( 𝑓 (2) (𝑥; 𝜃 (1) ); 𝜃 (2) ); 𝜃 (3)

These many functions can tackle non-convex problems contrary to linear or


kernel methods. Moreover, Linear models cannot model interactions between
input variables.

In general, when you have multiple layers: you can see a layer which transform a space
into another one → NN can be seen as a sequence of transformations.

Architecture Design:
Choosing an appropriate architecture for a neural network is an important
consideration, as it can impact the model’s performance and ability to learn.

27
Giuditta Sigona

1. Depth: given by the number of hidden layers.


Universal approximation theorem: a FFN with a linear output layer and at least
one hidden layer with any “squashing” activation function (e.g., sigmoid) can
approximate any Borel measurable function with any desired amount of error,
provided that enough hidden units are used.

The Depth is correlated to the performance:


increasing too much layers <-> not
increasing the performance

+ Overfitting problem if too powerful model

2. Width: number of units (neurons) in each layer.


In general, it is exponential in the size of the input. In theory, a short and wide
network can approximate any function. In practice a deep and narrow
network is easier to train and provides better results in generalization.

3. Activation function: which kind of units.


The activation function of a neuron determines the output of the neuron given
a set of inputs. Different activation functions can be used in different layers of
the network.

They are used to introduce non-linearity into the network, allowing the model
to learn and make more complex decisions.
There are several types of activation functions that are commonly used in
neural networks such as:
▪ Rectified linear units (ReLU): g(α) = max(0,α) which is easy to
optimize but not differentiable at 0.

▪ Sigmoid g(α) = σ(α) and hyper tan g(α) = thanh(α), both are: Easy
to saturate since there is no logarithm at the output, slow, usefull for
RNN and autoencoders

4. Loss function: which kind of cost function.


It is the Cost function used for Training. This cost function is a guide to
training process by providing a measure of how well the network is
performing. The goal is to minimize the cost function by updating the weights
during training (this process is usually done by using an optimization
algorithm such as the Stochastic Gradient Descent).

NB: The loss function usually include the cost function + regularization term to
prevent overfitting by penalizing large weights.

Recall the ML in which we wanted the class which maximized the conditional
distribution 𝑃(𝐶𝑖 |𝑥, 𝐷), if we use the same principle here we get the cross-
entropy loss function:

𝐽(𝜃) = 𝐸𝑥,𝑡~𝐷 [−𝑙𝑛(𝑃(𝑡|𝑥, 𝜃))]

28
Giuditta Sigona

Choice of network output units and cost function are related.


(𝑜𝑢𝑡𝑝𝑢𝑡 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 ↔ 𝑙𝑜𝑠𝑠 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛) They change together with the
problem requirement:

1) Regression
We use the identity activation function 𝑦 = 𝑊 𝑇 ℎ + 𝑏
A Gaussian distribution noise model 𝑝(𝑡|𝑥) = 𝑁(𝑡|𝑦, 𝛽 −1 )
→ Cost function: maximum likelihood (cross-entropy) that is
equivalent to minimizing mean squared error:
𝐽(𝜃) − ln (𝑃(𝑡|𝑥, 𝜃)
Note: linear units do not saturate.

2) Binary Classification
We use the Sigmoid activation function 𝑦 = 𝜎(𝑤 𝑇 ℎ + 𝑏)
The likelihood corresponds to a Bernoulli distribution
𝐽(𝜃) = 𝐸𝑥,𝑡~𝐷 [− ln 𝑝(𝑡|𝑥)]
− ln 𝑝(𝑡|𝑥) = ⋯ = 𝑠𝑜𝑓𝑡𝑝𝑙𝑢𝑠((1 − 2𝑡)𝛼
with 𝛼 = 𝑤 𝑇 ℎ + 𝑏.
Note: Unit saturates only when it gives the correct answer.

3) Multi-class Classification
exp(𝛼 (𝑖) )
We use the Softmax activation function 𝑦𝑖 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝛼 (𝑖) ) = ∑
𝑗 exp (𝛼𝑗 )
The Likelihood corresponds to a Multinomial distribution
𝐽𝑖 (𝜃) = 𝐸𝑥,𝑡~𝐷 [− ln 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝛼 (𝑖) )]
Note: Unit saturates on ly when there are minimal errors.

Gradient computation (BackProp)


Gradient computation “BackProp” (Back-propagation) is an algorithm used to train
an ANN.
Algorithm steps:
a. Feed the input data through the network, and calculate the output of each
neuron in each layer.
b. Calculate the error at the output layer using the cost function.
c. Propagate the error back through the network
d. Adjust the weights and biases of the connections between neurons
e. Repeat the process until the cost function is minimized and the network
is able to make accurate predictions.

Since we propagate the error back trough the network, we compute the gradient
of the cost function with respect to each parameter using the chain rule.

29
Giuditta Sigona

Learning Algorithms
Since the backprop is just a method to compute the gradient (not learn) there are
various other algorithms for this purpose.

• Stochastic Gradient Descent:


It involves calculating the gradient of the cost function with respect to the weights and
biases of the connections between neurons and using that to update in such a way to
minimizes the cost function.
SGD is an iterative process, and the learning rate determines the size of the steps taken
to minimize the cost function. Using a learning rate η these method computes the
gradient on a subset (minibatch) of samples from the dataset. This gradient is
computed with the backprop method and the parameters are updated with the
following formula:

Where g is the value of the gradient with respect to θi.

Critical choice for 𝜂 → 𝜂 usually changes according to some rule through


the iterations
(es. If we are faraway → 𝜂 should be large, then should be constant

optimization performance can be improved with momentum and adaptive learning


rate:

• SGD with momentum: To accelerate the training an additional parameter v can


be used to increase or decrease the value of the update depending on the training
iteration.
- Momentum is applied before computing the gradient.
- Sometimes it improves convergence rate.

• Algorithms with adaptive learning rates: Based on analysis of the gradient of


the loss function it is possible to determine, at any step of the algorithm, whether
the learning rate should be increased or decreased.

Which optimization algorithm should I choose? → Empirical approach.

30
Giuditta Sigona

generalization error can be reduced with regularization.

Regularization
Technique used to reduce overfitting. In general it involves adding a penalty term
to the cost function, which discourages the network from learning overly complex
patterns in the training data. For FNN we have:

• Parameter norm penalties: add a regularization term to the cost function


in order to decrease the magnitude of each parameter, so no parameter
saturates.
𝑞
𝐽(̅ 𝜃) = 𝐽(𝜃) + 𝜆𝐸𝑟𝑒𝑔 (𝜃) with 𝐸𝑟𝑒𝑔 (𝜃) = ∑𝑗 |𝜃𝑗 |

• Dataset Augmentation: Transform the dataset (image distortion, noise adding)


in order to generate additional data in the Dataset. Done before training step.

• Early stopping: Stop iterations early to avoid overfitting to the training set
of data, when train loss zeros out while train test loss increases.
Use cross-validation to determine when to stop.

• Parameter sharing: constrain subset of parameters to be equal. (limit the


model = limit the overfitting)
o Decrease memory consumption
o In CNNs allow translation invariance.

• Dropout: Randomly remove network units with some probability 𝛼.

31
Giuditta Sigona

12. Convolutional Neural Networks (CNNs)


Convolutional neural networks (CNNs) are particularly effective at processing
data that has a grid-like topology, such as an image. They take the image as an
input, subjects it to combinations of weights and biases, extracts features and
outputs the results (convolution). In simple word, it extract the feature of image
and convert it into lower dimension without loosing its characteristics, by using a
Kernel.

In CNNs there are usually more stages:

𝑖𝑛𝑝𝑢𝑡 → 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 + 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 → 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 → 𝐹. 𝐶 → 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 → 𝑜𝑢𝑡𝑝𝑢𝑡

1) Convolution: mathematical (linear) operation in the convo layer applied to


an input image to produce a feature map by sliding the kernel matrix over the
image, element-wise multiplying and summing the entries of the image and
in such a way to extract some features. *

Different types of convolutions:


- 1D conv: kernel slide in 1D (e.g. sequence of words)
- 2D conv: kernel slide in 2D (e.g. image)
- 3D conv: kernel slide in 3D (e.g. video)

2) Activation function (detector): Convo layer also contains use nonlinear


activation function such as ReLU to make all negative value to zero.

3) Pooling: used to reduce the spatial volume of input image after convolution.
It is an averaging of some sort (usually max or average) used to implement
invariance to local translations.

If applied with a stride ≥ 0 then it reduces the dimension of the output.

32
Giuditta Sigona

4) Fully Connected Layer(FC): involves weights, biases, and neurons. It


connects neurons in one layer to neurons in another layer. It is used to
classify images between different category by training.

5) Softmax / Logistic Layer resides at the end of FC layer. Logistic is used for
binary classification and softmax is for multi-classification.

▪ Stride denotes how many steps we are moving in each steps in convolution
(by default it is one). We can observe that the size of output is smaller that
input.

▪ To maintain the dimension of output as in input , we use Padding: it is a process


of adding zeros to the input matrix symmetrically.
In Keras, this is specified via the “padding” argument on the Conv2D layer, which
ha the default value:
- Valid (no padding, 𝑝 = 0): the filter is applied only to valid ways to the input.
- Same: calculates and adds the padding required to the input image to ensure
that the output has the same shape as the input (𝑝 = 𝑤𝑘 /2)

▪ In order to reduce overfitting:


o Sparse connectivity: outputs depend only on few inputs
o Parameter sharing: learn only one set of parameters (for the kernel) shared to
all the units 𝑘 parameters instead of 𝑚 × 𝑛 (NB: 𝑘 ≪ 𝑚)

Parameter size:
Consider input size 𝑤𝑖𝑛 × ℎ𝑖𝑛 × 𝑑𝑖𝑛 , 𝑑𝑜𝑢𝑡 kernels of size 𝑤𝑘 × ℎ𝑘 × 𝑑𝑖𝑛 , stride 𝑠 and
padding 𝑝.
- dimension of the output feature map is:
𝑤𝑖𝑛 −𝑤𝑘 +2𝑝 ℎ𝑖𝑛 −ℎ𝑘 +2𝑝
𝑤𝑜𝑢𝑡 = + 1 and ℎ𝑜𝑢𝑡 = +1
𝑠 𝑠
- number of trainable parameters of the convolutional layer is:
|𝜃| = 𝑤𝑘 ∙ ℎ𝑘 ∙ 𝑑𝑖𝑛 ∙ 𝑑𝑜𝑢𝑡 + 𝑑𝑜𝑢𝑡

(NB: 𝑝𝑎𝑟𝑎𝑚. 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑎𝑙 𝑙𝑎𝑦𝑒𝑟 ≪ 𝑝𝑎𝑟𝑎𝑚. 𝑖𝑛 𝑡ℎ𝑒 𝐹. 𝐶. 𝑙𝑎𝑦𝑒𝑟)

33
Giuditta Sigona

Famous" CNNs
• LeNet: designed to recognize handwritten digits, and consists of a series of
convolutional and pooling layers followed by F.C. (dense) layers. LeNet is a
relatively small and simple CNN, but it was an important step in the development
of modern CNNs.
• AlexNet, which won the ILSVRC in 2012 and significantly advanced the state of
the art in image classification. Not commonly used anymore.
• VGG, which won the ILSVRC in 2014 and is known for its deep and narrow
architecture. Commonly used also today.
• Inception by Google, which won the ILSVRC in 2014 and introduced the concept
of "inception modules" which make use of multiple parallel convolutional and
pooling layers. Commonly used also today.
• ResNet by Microsoft, which won ILSVRC 2015 and introduced the concept of
residual connections, which made it possible to train very deep CNNs effectively.

Transfer Learning

Transfer learning is a technique that involves using a pre-trained CNN as a starting


point for training a new task, usually by adding additional layers or fine-tuning the
existing layers on a new dataset. It is a useful technique when the amount of labeled
data available for the new task is limited, as it allows the model to make use of the
knowledge it has acquired from a related task.

GOAL → improve learning of 𝑓𝑇 of the target learning task, using the knowledge in the
source domain 𝐷𝑆 and source learning task 𝑇𝑠 (i.e., after training 𝑓𝑆 )

There are two main approaches to transfer learning:

1. Fine-tuning: consists of unfreezing a few of the top layers of a frozen model


base used for feature extraction, and jointly training both the newly added
part of the model and these top layers.
PRO: better performance as the model can learn task-specific features that
are not present in the pre-trained model.
CON: more computationally expensive than feature extraction.

2. Feature extraction: This involves using the pre-trained model as a fixed


feature extractor, where the output of the pre-trained model's layers is fed as
input to a new model that is trained to perform the target task. The weights
of the pretrained model are not updated during training.
PRO: no need to train the CNN!
CON: cannot modify features, source and target domains should be as
compatible as possible.

34
Giuditta Sigona

The main difference between the two is that in fine-tuning, more layers of the pre-trained
model get unfrozen and tuned on custom data. This fine-tuning usually takes more data
than feature extraction to be effective.

35
Giuditta Sigona

13. Multiple Learner


The main idea is to train multiple models/learners to solve a more complex model,
by combining their results. By training multiple models and combining their
predictions, it is often possible to achieve better results than with a single model.
Another reason is to reduce the risk of overfitting.

Models can be trained in parallel (voting or bagging) or in sequence (boosting).

Voting: simple method in which the models are trained in parallel on the same
dataset D and then the outputs are summed (for regression) or chosen based on the
most voted class (for classification).

Stacking: it is an ensemble method in which you train some base


learners on the original data and then use them to make predictions
on a hold-out set. These predictions are then used as input features
for the second-level model, which is trained to predict the target
variable using the base learner predictions as input.

Cascading: ensemble learning based on the concatenation of


several classifiers, using all information collected from the output from a given
classifier as additional information for the next classifier in the cascade.

36
Giuditta Sigona

Bagging: is an ensemble method that involves training multiple models in parallel


on random subsets of the training data. The final prediction is made by averaging the
predictions of all the models.

Boosting: is an ensemble method that involves training on


weighted data multiple models sequentially (base classifier),
with each model attempt to correct the mistakes of the previous
model (points misclassified by previous classifiers are given
greater weight). The final prediction is based on weighted
majority of votes.

Adaboost: is a boosting algorithm which works by weighting the training data


points that are misclassified by the weak learners in such a way that the next weak
learner focuses more on the misclassified examples.
This process is repeated for a predetermined number of rounds or until a satisfactory
level of accuracy is achieved.
The final prediction of the AdaBoost model is made by taking a weighted average of
the predictions of all the weak learners, with the weights reflecting the learners'
importance or accuracy.

37
Giuditta Sigona

It outperforms many other base learners in many problems.

AdaBoost can be explained as the sequential minimization of an exponential


error function:
In AdaBoost, the weights are adjusted after each classifier is trained with the
weights of misclassified examples being increased and the weights of correctly
classified examples being decreased, in such a way that the error of the current
classifier is minimized exponentially. with the weights of misclassified examples
being increased and the weights of correctly classified examples being decreased.

38
Giuditta Sigona

14. Unsupervised Learning


This kind of learning is used when a dataset does not have any label, so we need
to cluster similar samples x based of some other metrics.

Gaussian Mixture Model (GMM)


One way to do it is to assume the data is generated from a mixture (sum) of a finite
number 𝑘 of Gaussian distribution with unknown parameters.
Given a dataset, a GMM estimates the parameters of each of the component Gaussians
and the mixture weights, which represent the proportion of the data generated by each
component. The probability density function of a GMM with K components is given by:
𝐾

𝑃(𝑥) = ∑ 𝜋𝑘 𝑁(𝑥; 𝜇𝑘 , ∑𝑘 )
𝑘=1

→Unsupervised Learning algorithms determine mixed probability distributions


from data!
Each instance 𝑥𝑛 is generated by:
- Choosing 𝑘 according to prior probabilities [𝜋1 , . . . , 𝜋𝐾 ]
- Generating an instance at random according to that Gaussian, thus using 𝜇𝑘 , ∑𝑘

• INVERSE WAY: from (𝜋𝑘 , 𝜇𝑘 , ∑𝑘 ), and by using 𝑃(𝑥) you find data.
• NORMAL WAY: making some assumptions (uniform 𝜋𝑘 = 1/𝑘, same covariance
∑𝑘 = ∑𝑘 ′)→ we want to estimate the k-means of data.

K-means
The K-means is a clustering algorithm that aims to partition n observations into k
clusters in which each observation belongs to the cluster with the nearest mean.
It is an iterative algorithm that starts by randomly initializing k centroids, then
assigns each observation to the cluster corresponding to the closest centroid. The
centroids are then updated based on the mean of the points in each cluster, and
the process is repeated until convergence.

Step 1. Initialize a decision on the value of 𝑘 (= number of clusters you want to


compute) randomly,
Step 2. Assign the training samples as follow:
clustering

- Take the first 𝑘 training samples as single-element clusters


- Assign each of the remaining (N- 𝑘) training samples to the cluster with the nearest
centroid. After each assignment, recompute the centroid of the new cluster.
Step 3. Take each sample in sequence and compute its distance from the centroid of
each of the clusters. If a sample is not currently in the cluster with the closest centroid,
switch it to that cluster and update the centroid of the two clusters involved in the
switch.

Step 4. Repeat step 3 until convergence is achieved, that is until a pass through the
training sample causes no new assignments.

39
Giuditta Sigona

CONS:
- sensitive to initial conditions, it can get stuck in local minima when a few data
available.
- Not robust to outliers. Very far data from the centroid may pull the centroid
away from the real one.

Expectation Maximization (EM)


The EM algorithm is an iterative method for finding maximum likelihood or
maximum a posteriori (MAP) estimates of parameters when the model depends
on unobserved latent variables 𝑧𝑛𝑘 *.
(NB: it is an extended version of the K-means algorithm)

Step 1. Initialize the model parameters randomly 𝜇𝑘0 , 𝜋𝑘0 , ∑0𝑘


Step 2. Repeat until termination condition 𝑡 = 0, . . . , 𝑇:
• E-STEP: compute the expected value of the complete data log likelihood
function given the current parameter estimates

• M-STEP: maximize the expected value of the complete data log likelihood
function with respect to the model parameters

Step 3. Return the final estimates of the model parameters.

40
Giuditta Sigona

General EM problem
Given:
• Observed data 𝑋 = {𝑥1 , . . . , 𝑥𝑁 }
• Unobserved latent values 𝑍 = {𝑧1 , . . . , 𝑧𝑁 }
• Parametrized probability distribution 𝑃(𝑌|𝜃), where
▪ 𝑌 = {𝑦1 , . . . , 𝑦𝑁 } is the full data 𝑦𝑛 =< 𝑥𝑛 , 𝑧𝑛 >, zni
▪ 𝜃 are the parameters
Determine the values of the model parameters that best explain the observations:
• 𝜃 ∗ that (locally) maximizes 𝐸[𝑙𝑛 𝑃(𝑌|𝜃)]

41
Giuditta Sigona

* Gaussian Mixture Model


𝐾
𝑃(𝑥) = ∑ 𝜋𝑘 𝒩(𝑥; 𝜇𝑘 , ∑𝑘 )
𝑘=1
Introduce new variables 𝑧𝑘 ∈ {0, 1}, with 𝑧 = (𝑧1 , . . . , 𝑧𝐾 )𝑇 using a 1-out-of-K encoding
(𝑧𝑘 = 1 only for one value of k, 0 otherwise).
𝑧𝑘
Let’s define 𝑃(𝑧𝑘 = 1) = 𝜋𝑘 , → thus: 𝑃(𝑧) = ∏𝐾 𝑘=1 𝜋𝑘
For a given value of 𝑧: 𝑃(𝑥|𝑧𝑘 = 1) = 𝒩(𝑥; 𝜇𝑘 , ∑𝑘 ) → 𝑃(𝑥|𝑧) = ∏𝐾 𝑘=1 𝒩(𝑥; 𝜇𝑘 , ∑𝑘 )
𝑧𝑘

Joint distribution: 𝑃(𝑥, 𝑧) = 𝑃(𝑥|𝑧)𝑃(𝑧) (chain rule).

When 𝑧 are variables with 1-out-of-K encoding and 𝑃(𝑧𝑘 = 1) = 𝜋𝑘


𝐾

𝑃(𝑥) = ∑ 𝑃(𝑧)𝑃(𝑥|𝑧) = ∑ 𝜋𝑘 𝒩(𝑥; 𝜇𝑘 , ∑𝑘 )


𝑧 𝑘=1

GMM distribution 𝑃(𝑥) can be seen as the marginalization of a distribution 𝑃(𝑥, 𝑧) over
variables 𝑧.
Given observations 𝐷 = {(𝑥𝑛 )𝑁 𝑛=1 } , each data point 𝑥𝑛 is associated to the
corresponding variable 𝑧𝑛 which is unknown.

Note: 𝑧𝑛𝑘 = 1 denotes 𝑥𝑛 sampled from Gaussian 𝑘. 𝒛𝒏 are called latent variables.
→ Analysis of latent variables allows for a better understanding of input data
(e.g., dimensionality reduction).

Let’s define the posterior


𝑃(𝑧𝑘 = 1)𝑃(𝑥|𝑧𝑘 = 1)
𝛾(𝑧𝑘 ) ≡ 𝑃(𝑧𝑘 = 1|𝑥) =
𝑃(𝑥)
𝜋𝑘 𝒩(𝑥; 𝜇𝑘 , ∑𝑘 )
𝛾(𝑧𝑘 ) = 𝐾
∑𝑗=1 𝜋𝑗 𝒩(𝑥; 𝜇𝑗 , ∑𝑗 )

▪ 𝜋𝑘 : prior probability of 𝑧𝑘
▪ 𝛾(𝑧𝑘 ): posterior probability after observation of 𝑥.

42
Giuditta Sigona

15. Dimensionality Reduction


Dimensionality reduction is the process of reducing the number of features or variables
in a dataset while preserving as much information as possible. Dimensionality
reduction aims at identifying the real/intrinsic degrees of freedom of a data set.
This can be done for a variety of reasons, such as to reduce noise in the data, to make
the data more visually comprehensible, or to reduce the computational cost of certain
algorithms.

Latent Variable: are variables that are not directly observed in the data but are
inferred from the observed data. For ex. If we have an input as image we have many
degrees of freedom and each composition of each parameter can generate a sample of
our dataset, but some configurations have no meaning. The goal is to identify these
variables and use them to represent the data in lower-dimensional space
(𝑠𝑚𝑎𝑙𝑙𝑒𝑟 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 𝑠𝑝𝑎𝑐𝑒 ↔ 𝑚𝑜𝑟𝑒 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑣𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑟𝑜𝑏𝑙𝑒𝑚).

→ Dimensionality reduction: how to transform a problem in many dimensions in


much less dimension, with the goal of keeping as much as possible the information.

Principal Component Analysis (PCA)


PCA is a linear technique for dimensionality reduction that is based on finding the
directions of maximum variance in the data. PCA finds a set of orthogonal axes,
called principal components, that capture the most information in the data. Data can
then be projected onto these principal components to obtain a lower-dimensional
representation of the data.

General idea:
- Start by calculate the average for 𝑥1 and 𝑥2 to get the center of the data.
- Shift the data in such a way the center of the graph is the center of the data and fit
a line that fits the samples. Start with a random line until the best fit which is when,
by projecting the samples on the line, that line minimizes the distance sample-
projection or maximizes the distances projected point-origin = maximize the sum
of squared distances (= eigenvalue for PC1). This line is the principal component
PC1, which is a linear combination of 𝑥1 and 𝑥2
- The vector of that line is the singular vector or eigenvector for PC1
When you do PCA with SVD, the vector is scaled so that this length =1
- In 2D the PC2 is simply the line through the origin that is perpendicular to PC1.
- Now rotate until horizontal for the final plot and use the projections to draw the
samples

43
Giuditta Sigona

Math steps:
Given data {𝑥𝑛 } ∈ 𝑅𝑑
1. Compute the Covariance of the dataset

2. Compute the eigenvectors {𝑢𝑖 } and eigenvalues {𝜆𝑖 } of the covariance matrix: The
eigenvectors of the covariance matrix are the principal components, and the
eigenvalues are their corresponding variances. → we want to maximize the
variance and it is when the eigenvector corresponds to the largest eigenvalue→

3. Order the eigenvalues in decreasing order and select the k eigenvectors with the
largest eigenvalues. By selecting the top k principal components, we are able to
retain as much information as possible while still reducing the dimensionality of
the data.

4. Use this eigenvector matrix to transform the original dataset into a new k-
dimensional subspace by computing the matrix multiplication.

PCA for high-dimensional data: If number of points is smaller than the


dimensionality, i.e. 𝑁 < 𝑑 → At least d-N+1 eigenvalues of S are zero.
1
Then consider 𝑢𝑖 = 𝑋 𝑇 𝑣𝑖 where 𝑣𝑖 eigenvectors of 𝑋𝑋 𝑇
√𝑁𝜆𝑖

Probabilistic PCA
It is an extension of PCA that models the data as being generated by a probabilistic
process. This allows for the data to be represented by a low-dimensional latent
variable, which can be useful in cases where the data is noisy or incomplete. The goal
is to find a set of latent variables that best explains the data distribution.
- Assume data as a linear combination of a low-dimensional latent variable Z, with
some added Gaussian noise 𝑥 = 𝑊𝑧 + 𝜇
- Assume Gaussian distribution of z: 𝑃(𝑧) = 𝑁(𝑧; 0, 𝐼)
- Assume linear-gaussian relationship between latent variables and data
𝑃(𝑥|𝑧) = 𝑁(𝑥; 𝑊𝑧 + 𝜇, 𝜎 2 𝐼)

The goal is to find the parameters of the model (W, μ, Σ) that best explains the data
distribution. This is done by maximizing the likelihood of the data, given the model
parameters:
𝑁

arg max ln 𝑃(𝑋|𝑊, 𝜇, 𝜎 2 ) = ∑ ln 𝑃(𝑥𝑛 |𝑊, 𝜇, 𝜎 2 )


𝑊,𝜇,𝜎
𝑛=1
setting derivatives to 0, we have a closed form solution:

Maximum likelihood solution for the probabilistic PCA model can be obtained also
with EM algorithm.

44
Giuditta Sigona

Linear representations are not sufficient for complex data, if you use PCA different
points are projected equal.
→ how to deal with Non-Linear transformation? → Non-linear latent variable models
(Autoencoders, GANs…). These models use non-linear functions to map the observed
data to the latent space.

Autoassociative Neural Networks (Autoencoders)


These are NN with reduced sized hidden layers (bottlenecks) in the middle which learn
to reconstruct their input by minimizing a loss function.
It is a combination of NN: an encoder, which maps the input data to a lower-
dimensional representation and the decoder which maps the lower-dimensional
representation back to the original data.

Function that transforms the input in intermediate value z (latence space) and then
reconstruct in 𝑥𝑛 ′.
The GOAL is to minimize the difference between the input and the reconstruction
(𝑥𝑛′ = 𝑥𝑛 ), this is typically done by training the model to minimize the reconstruction
error (typically the MSE).

By minimizing the reconstruction error, the autoencoder learns a compressed


representation of the data that captures the most important features and discards the
noise.

Autoencoders are very useful for anomaly detection (one-class classification):

45
Giuditta Sigona

Generative Models are models that learn to generate new data from some
underlying probability distribution. Can be used for a variety of tasks, such as image
synthesis, language generation and as a way to extract feature from a dataset.

Variational Autoencoder (VAE)


They are generative models that combine the encoder-decoder architecture of
autoencoders with the probabilistic modelling framework: instead of encoding an
input as a single point, we encode it as a distribution over the latent space.
It can be defined as being an autoencoder whose training is regularised to avoid
overfitting and ensure that the latent space has good properties that enable generative
process.

To produce a distribution consider parametric distributions (typically Gaussian)


The VAE generative model can be represented as:
𝑝(𝑥 | 𝑧) = 𝑁(𝑥 | 𝑓(𝑧), 𝐼) 𝑞(𝑧 | 𝑥) = 𝑁(𝑧 | 𝑔(𝑥), 𝛴)
Where 𝑁 is a Gaussian distribution, 𝑧 is the latent variable and 𝑥 is the data, 𝑓 and
𝑔 are the decoder and encoder networks.

The objective function of VAE is:


𝐿 = 𝐸[𝑙𝑜𝑔 𝑝(𝑥|𝑧) − 𝐾𝐿(𝑞(𝑧|𝑥) || 𝑝(𝑧))]
Where the first term is the likelihood of the data and the second term the Kullback-
Leibler divergence, is used as a regularization term, to ensure the encoder 𝑞(𝑧|𝑥)
approximates the prior distribution of the latent variable, 𝑝(𝑧).

- To prevent degeneration? → add loss term based on Kullback-Leibler divergence


(Evidence Lower Bound)
- Sampling operation is not differentiable → re-parametrization

Generative Adversarial Networks (GANs)


GANs are an approach that can generate data with similar characteristics as the
input real data. The idea is to use an inverted CNN and an adversarial training.
A GAN consists of two networks that train together:
• Generator (decoder) which given a vector of random values (latent
inputs) as input, generates data similar to the real data;
• Discriminator (critic) which is trained to identify if the observations are
from the generated or from the real data.

46
Giuditta Sigona

To train a GAN, train both networks simultaneously to maximize the performance of


both (making the network competing with each other):

- Train the generator by using the entire model (generator + discriminator), with
discriminator layers fixed with a batch of data {(𝑟𝑘 , 𝑅𝑒𝑎𝑙)}, to generate data that
"fools" the discriminator in believing that the sample is “real”
(𝑟𝑘 are random values of the latent variable).
→ maximize the loss of the discriminator when given generated data;

- Train the discriminator with a batch of data {(𝑥𝑛 , 𝑅𝑒𝑎𝑙)}, {(𝑥𝑚 ′


, 𝐹𝑎𝑘𝑒)} to
distinguish between Real and Generated data as good as possible.
→ minimize the loss of the discriminator when given batches of both real and
generated data

Ideally, these strategies result in a generator that generates convincingly realistic


data and a discriminator that has learned strong feature representations that are
characteristic of the training data.

47
Giuditta Sigona

16. MDP and RL


In a dynamic system, when the state is fully observable, the decision-making
problem for an agent is to decide which action must be executed in a given state.

Reinforcement Learning: Learning a behaviour function 𝜋∶ 𝑋→ 𝐴, given


𝐷 = {< 𝑥1, 𝑎1, 𝑟1, . . . , 𝑥𝑛, 𝑎𝑛, 𝑟𝑛 > 𝑖}.
In a mathematical framework we consider the so called Markov Decision Process.

Markov Decision Process (MDP)


MDP is a discrete-time control process. It provides a mathematical framework for
modelling decision-making in situations where outcomes are partly random and partly
under the control of a decision maker. In MDP an agent (e.g. a human) observes the
environment and takes actions.
𝑀𝐷𝑃 =< 𝑋, 𝐴, 𝛿, 𝑟 >
At each time step, the process is in
some state, and the decision maker
may choose any available action. The
process responds at the next time
step by randomly moving into a new
state and giving the decision maker a
corresponding reward.

A MDP follows the Markov properties which state:

▪ Once the current state is known, the evolution of the dynamic system does not
depend on the history of states, actions and observations.
▪ The current state contains all the information needed to predict the future.
▪ Future states are conditionally independent of past states and past
observations given the current state.
▪ The knowledge about the current state makes past, present and future
observations statistically independent.

Given an MDP, we want to find an optimal policy 𝝅: 𝑋 → 𝐴, that takes as input some
state 𝑥𝑡 and chooses an action 𝑎𝑡 to maximize the reward 𝑟𝑡 .
Optimality = maximizing the (expected value of the) cumulative discounted reward:
𝑉 𝜋 (𝑥1 ) = 𝐸[𝑟̅1 + 𝛾𝑟̅2 + 𝛾 2 𝑟̅3 + . . . ]

where 𝑟̅𝑡 = 𝑟(𝑥𝑡 , 𝑎𝑡 , 𝑥𝑡+1 ), 𝑎𝑡 = 𝜋(𝑥𝑡 ) and 𝛾∈[0,1] is the discount factor for
future rewards.
Optimal policy: 𝜋 ∗ = arg max 𝑉 𝜋 (𝑥) ∀𝑥 ∈ 𝑋.
𝜋


▪ 𝜋 ∗ is an optimal policy iff for any other policy 𝜋: 𝑉 𝜋 (𝑥) ≥ 𝑉 𝜋 , ∀𝑥.
▪ For infinite horizon problems, a stationary MDP always has an optimal
stationary policy.

48
Giuditta Sigona

One-state Markov Decision Processes (MDP)

𝑀𝐷𝑃 =< {𝑥0 }, 𝐴, 𝛿, 𝑟 >

The optimal policy: 𝜋 ∗ (𝑥0 ) = 𝑎𝑖 .

1. If 𝑟(𝑎𝑖) is deterministic and known, → we can compute without experiments


the optimal policy: 𝜋 ∗ (𝑥0 ) = arg max 𝑟(𝑎𝑖)
𝑎𝑖∈𝐴
2. If 𝑟(𝑎𝑖) is deterministic and unknown, → we do experiments, we execute
action 𝑎𝑖 and collect reward 𝑟𝑖 and then the optimal policy: 𝜋 ∗ (𝑥0 ) = 𝑎𝑖 with 𝑖 =
𝑎𝑟𝑔𝑚𝑎𝑥𝑖=1…|𝐴| 𝑟(𝑖) (NB: |A| iterations needed).
3. If 𝑟(𝑎𝑖) is non-deterministic (gaussian distribution) and known, → we have
information about the gaussian distribution, so we know the mean, so the
optimal policy: 𝜋 ∗ (𝑥0 ) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑖∈𝐴 𝐸[𝑟(𝑎𝑖 )]. (no test needed).
4. If 𝑟(𝑎𝑖) is non-deterministic and unknown, →we have no info for 𝑟 →compute
test (30 times) and compute the average.
NB: if we collect info over time maybe we can know at a certain point if 𝑎𝑖 is
better than 𝑎𝑗 and use less iterations.

When 𝛿 and 𝑟 are not known, the agent cannot predict the effect of its actions. But it can
execute them and then observe the outcome.

Policy iteration: estimate directly 𝜋 ∗


Value iteration: estimate the value function 𝑉 𝜋 from which it could
determine the optimal policy 𝜋 ∗ :

𝜋 ∗ = arg max[𝑟(𝑥, 𝑎) + 𝛾𝑉 𝜋 (𝛿(𝑥, 𝑎))]
𝑎∈𝐴
Only if 𝛿 and 𝑟 are known.

To determine the optimal policy without knowing 𝛿 and r, the agent learns Q-table, we
get the policy just observing new state 𝑥′ and immediate reward 𝑟, after the execution of
the chosen action.

Exploration-Exploitation trade-off
During trials, an agent has a set of actions to select from:
- some are previously selected→ exploiting what the agent already knows,
selecting the action that maximizes 𝑄̂ (𝑥, 𝑎)
- others are never taken before→ exploring a random action (low value of 𝑄̂ (𝑥, 𝑎))
When the agent explores, it can improve its current knowledge and gain better rewards
in the long run; when it exploits, it gets more reward immediately → we wish to keep
a balance between exploration and exploitation; not giving up on one or another.

49
Giuditta Sigona

Action selection:
• 𝜺 -greedy strategy: It selects the best action with the highest estimated
reward most of the time. It works by choose the best action with probability
1 – 𝜀 (exploitation) and a random action with probability 𝜀 (exploration). It
selects the action with the highest estimated reward most of the time. 𝜀 can
decrease over time to have a balance between exploration and exploitation
(first exploration, then exploitation).

• soft-max strategy: it controls the relative levels of exploration and exploitation


by mapping values into action probabilities: actions with higher 𝑄̂ are assigned
higher probabilities, but every action is assigned a non-zero probability.

𝑘 > 0 determines how strongly the selection favours actions with 𝑄̂ high values.
𝑘 may increase over time (first exploration, then exploitation).

K-Armed bandit example:


The classic (stocastic) version of the k-armed bandit problem sees 𝑘 slot
machines, each one having some gaussian distribution of winning 𝑁(µ𝑖 , 𝜎𝑖 ) =
𝑟(𝑎𝑖 ) and the goal is to earn the most money. In this context a RL agent can either:
- Perform x trials on each machine, estimate the mean winning rate and then choose
the one with the highest.
- Adopt an 𝜀 -greedy strategy in which they play at random with prob 𝜀 and
choose the optimal policy with 1 – 𝜀. In this case, the training rule is:

But if the probability µi changes (non-deterministic case) then the above solution
would not work: the first would fail because µi can change after the x trials, the 𝜀
-greedy strategy will not reach an optimal value.

1
𝛼= and 𝑣𝑛−1 (𝑎𝑖 ) number of executions of action ai up to time 𝑛 − 1.
1+𝑣𝑛−1 (𝑎𝑖 )

Evaluating RL Agents: it is usually performed through the cumulative reward


gained over time, which could be very noise → a better approach could be:
- Repeat: Execute 𝑘 steps of learning and evaluate the current policy 𝜋𝑘
- Domain-specific performance metrics

50
Giuditta Sigona

Q Function (deterministic case): 𝑄𝜋 (𝑥, 𝑎) is the expected value when executing a in the
state x and then act according to 𝜋.
𝑄(𝑥, 𝑎) ≡ 𝑟(𝑥, 𝑎) + 𝛾𝑉 ∗ (𝛿(𝑥, 𝑎)) → 𝜋 ∗ = arg max 𝑄(𝑥, 𝑎)
𝑎∈𝐴

Observe that: 𝑉 ∗ (𝑥) = max{𝑟(𝑥, 𝑎) + 𝛾𝑉 ∗ (𝛿(𝑥, 𝑎))} = max 𝑄(𝑥, 𝑎)


𝑎∈𝐴 𝑎∈𝐴
→ 𝑄(𝑥, 𝑎) ≡ 𝑟(𝑥, 𝑎) + 𝛾 max 𝑄(𝛿(𝑥, 𝑎), 𝑎′) → Training rule:
𝑎∈𝐴

Q Function (non-deterministic case):


When the problem is non deterministic we know that the transition function 𝛿(𝑥, 𝑎)
is some probability related to the current state and action:
𝑃𝑎 (𝑥, 𝑥 ′ ) = P (𝑥𝑡+1 = 𝑥 ′ |𝑥𝑡 = 𝑠, 𝑎𝑡 = 𝑎).
To define our value function, we must take the expected values:
𝑉 𝜋 (𝑥) = 𝐸[𝑟𝑡 + 𝛾 𝑟𝑡+1 + 𝛾 2 𝑟𝑡+2 + . . . ] → the optimal policy: 𝜋 ∗ = arg max 𝑉 𝜋 (𝑥)
𝜋

When we consider the reward function, to be non deterministic too, then we must
include it in the value function:
𝑄(𝑥, 𝑎) = 𝐸[𝑟(𝑥, 𝑎) + 𝛾𝑉 ∗ (𝛿(𝑥, 𝑎)) = ⋯ = 𝐸[𝑟(𝑥, 𝑎)] + 𝛾 ∑ 𝑃(𝑥 ′ |𝑥, 𝑎) max 𝑄(𝑥 ′ , 𝑎 ′ )
𝑎′
𝑥′
now the optimal policy shift to: 𝜋 = arg max 𝑄(𝑥, 𝑎)

𝑎∈𝐴
Q learning generalizes to non-deterministic worlds with training rule:

which is equivalent to:

with

51
Giuditta Sigona

Temporal Difference Learning


Temporal difference learning (TD) is a class of model-free RL methods which learn by
bootstrapping* the current estimate of the value function. We can think of model-free
algorithms as trial-and-error methods: the agent explores the environment and learns
from outcomes of the actions directly, without constructing an internal model or a MDP,
but by filling a table storing state-action pairs 𝑄(𝑠, 𝑎).
There are two different TD algorithms:
- On-policy uses the same strategy for both the behaviour and target policy
- Off-policy algorithms use a different strategy for the behaviour and target policy

Q-Learning Algorithm
It updates the Q-value using the Q-value of the next state and the greedy action after that
(off-policy). The goal is to maximize its total reward. It does this by adding the maximum
reward attainable from future states to the reward for achieving its current state,
effectively influencing the current action by the potential future reward.
The formula that updates the Q-value is as follows:

This is called the action-value function or Q-function. The function approximates the
value of selecting a certain action in a certain state.

SARSA
SARSA (State–action–reward–state–action) is an on-policy as it uses an ε-greedy
strategy for all the steps. It updates the Q-value using the Q-value of the next state and
keeps following the policy for the next action.
In this case:

This means that SARSA uses the next action as a starting action in the next state (policy),
where Q-Learning replaces it with the maximisation of the next action’s reward.

Convergence of non-deterministic algorithms:


Q-learning will converge faster to an optimal policy than SARSA. However, fast
convergence does not imply better solution in the optimal policy.

*bootstrapping usually refers to a self-starting process that is supposed to continue or grow without external input.

52
Giuditta Sigona

17. HMM and POMDP


Hidden Markov Model (HMM)
If the states x of a MDP are discrete and non-observable, but instead we have some
observations z tied to them we are dealing with a Hidden Markov Model [HMM].

In this case the process 𝐻𝑀𝑀 =< 𝑋, 𝑍, 𝜋0 > is defined by:


• observation model 𝑏𝑘 (𝑧𝑡 ) = 𝑃(𝑧𝑡 |𝑥𝑡 = 𝑘): a probability that a certain state is
mapped with a specific observation
• transition model: 𝐴𝑖𝑗 = 𝑃(𝑥𝑡 = 𝑗|𝑥𝑡−1 = 𝑖) : probability that explain the
transition to/from hidden states (emission probability).
• an initial distribution 𝜋0 = 𝑃(𝑥0 ) which define the probability that the model
starts with the state x0

Training → Forward-backward algorithm: It is used to calculate the probability


𝑃(𝑥𝑡 , 𝑧1:𝑡 ) of a state at a certain time, when we know about the sequence of
observations. Computing directly the probability would require marginalizing over all
possible state sequences, the number of which grows exponentially with time the
forward algorithm takes advantage of the conditional independence rules of the
hidden Markov model (HMM) to perform the calculation recursively.

- Let 𝛼𝑡 (𝑥𝑡 ) = 𝑃(𝑥𝑡 , 𝑧1:𝑡 ) = ∑𝑥𝑡−1 𝑃(𝑥𝑡 , 𝑥𝑡−1 , 𝑧1:𝑡 )


- The HMM is based on augmenting the Markov chain, so we can apply the
chain rule on the probabilities of the sequence (factorization):
𝑃(𝑥𝑡 , 𝑧1:𝑡 ) = ∑𝑥𝑡−1 𝑃(𝑧𝑡 |𝑥𝑡 , 𝑥𝑡−1 , 𝑧1:𝑡−1 )𝑃(𝑥𝑡 , 𝑥𝑡−1 , 𝑧1:𝑡−1 )𝑃(𝑥𝑡−1 , 𝑧1:𝑡−1 )

- Thus, since 𝑃(𝑧𝑡 │𝑥𝑡 ) and 𝑃(𝑥𝑡 |𝑥𝑡−1 ) are given by the model's distribution
and transition probabilities, one can quickly calculate 𝛼𝑡 (𝑥𝑡 ) from
𝛼𝑡−1 (𝑥𝑡−1 ) and avoid incurring exponential computation time.

Learning: to determine the maximum likelihood estimate of the parameters


(transition probabilities and emission probability):
• if states can be observed at training time → parameters can be estimated with
statistical analysis,
• if states cannot be observed at training time →compute a local maximum
likelihood with an Expectation-Maximization (EM) method (e.g., Baum-Welch
algorithm).

53
Giuditta Sigona

Partially Observable MDP (POMDP)


POMDP is an MDP in which the agent cannot directly observe the underlying state.
(combines decision making of MDP and non-observability of HMM).

𝑃𝑂𝑀𝐷𝑃 = < 𝑋, 𝐴, 𝑍, 𝛿, 𝑟, 𝑜 >

Solution concept for POMDP:


 Option 1: map from history of observations to actions - too long!
 Option 2: belief state - probability distribution over the current state

Belief MDP: A Markovian belief state allows a POMDP to be formulated as a MDP


where every belief is a state. The belief states b(x) is a probability distribution
over the states (but belief states are infinite). The belief MDP is defined by:
• B is a set of belief states
• A is a set of actions
• 𝜏 (𝑏,𝑎,𝑏′) is a probability distribution over transitions
• 𝜌(𝑏,𝑎,𝑏′) is a reward function
POMDP's policy is a mapping from the history of observations (or belief states)
to the actions: 𝜋: 𝐵 → 𝐴

Given a current belief state b, the next belief state b’(x)’ which represents the
probability of being in state s′ after b,a and o, is calculated as:

So by the current belief state, the observation probability, and the original transition
probability.

▪ Transition function:

▪ Reward functions:

▪ Value function: similar in traditional MDPs, replacing states with belief states

→the optimal policy is obtained by optimizing the long-term reward:

policy tree:

54

You might also like