Week 3
Week 3
Machine Learning
Week 3
Introduction to Supervised Learning
• https://developers.google.com/machine-learning/glossary#model
• https://scikit-learn.org/stable/glossary.html
• https://seaborn.pydata.org/examples/index.html
3
Applications
• Association
• Supervised Learning
• Classification
• Regression
• Unsupervised Learning
• Reinforcement Learning
4
Supervised Learning: Uses
• Prediction of future cases: Use the rule to predict the output for
future inputs
• Knowledge extraction: The rule is easy to understand
• Compression: The rule is simpler than the data it explains
• Outlier detection: Exceptions that are not covered by the rule, e.g.,
fraud
5
Supervised Learning
• We discuss supervised learning starting from the simplest case, which
is learning a class from its positive and negative examples.
• We generalize and discuss the case of multiple classes, then
regression, where the outputs are continuous.
6
Learning a Class from Examples
• Class C of a “family car”
• Prediction: Is car x a family car?
• Knowledge extraction: What do people expect from a family car?
• Output:
Positive (+) and negative (–) examples
• Input representation:
x1: price, x2 : engine power
7
Training set X
1 if x is positive
r=
0 if x is negative
X = {xt ,r t }tN=1
1 if h classifies x as a positive
h( x) =
0 if h classifies x as a negative
E (h | X ) = 1(h(xt ) r t )
N
t =1
h ∈ H, between S and G is
consistent and make up the
version space (Mitchell, 1997) 14
Margin
• Choose h with largest margin
18
Multiple Classes, Ci i=1,...,K
The training set: X = {x t ,r t }tN=1
0 if x t
C j , j i
Train hypotheses
hi(x), i =1,...,K:
rt
• If there is no noise, the task is interpolation. We would like to find the function
f(x) that passes through these points such that we have
( )
r t = f xt 20
Regression
• In time-series prediction, for example, we have data up to the present and we
want to predict the value for the future.
• In regression, there is noise added to the output of the unknown function
( )
r t = f xt +
E (g | X ) = r − g (x )
1 N t
N t =1
t 2
21
Regression
• The square of the difference is one error (loss) function that can be used; another
is the absolute value of the difference.
• Our aim is to find g(·) that minimizes the empirical error.
• Our approach is the same; we assume a hypothesis class for g(·) with a small set
of parameters.
• If we assume that g(x) is linear, we have
g ( x ) = w1 x1 + ... + wd xd + w0 = j =1 w j x j + w0
d
22
Regression
• Let us now go back to our example in previous section where we estimated the price of a
used car.
• There we used a single input linear model
g (x ) = w1 x + w 0
where w1 and w0 are the parameters to learn from data. The w1 and w0 values should
minimize
r − (w x )
N
E (w1 , w0 | X ) =
1 2
t
1
t
+ w0
N t =1
• The output may be taken as a higher-order function of the input—for example, quadratic
g (x ) = w 2 x 2 + w1 x + w 0
23
Regression
• Linear, second-order, and sixth-
g (x ) = w1 x + w 0 order polynomials are fitted to the
same set of points.
g (x ) = w 2 x 2 + w1 x + w 0
• The highest order gives a perfect
fit but given this much data it is
very unlikely that the real curve is
so shaped.
• The second order seems better
than the linear fit in capturing the
trend in the training data.
24
Model Selection & Generalization
• If the training set we are given contains only a small subset of all
possible instances, the solution is not unique.
• This is an example of an ill-posed problem;where the data by itself is
not sufficient to find a unique solution.
• If learning is ill-posed, and data by itself is not sufficient to find the
solution, we should make some extra assumptions to have a unique
solution with the data we have.
• The set of assumptions we make to have learning possible is called
the inductive bias of the learning algorithm.
• The need for inductive bias, assumptions about hypothesis class H
25
Model Selection & Generalization
• Learning is not possible without inductive bias, and now
the question is how to choose the right bias.
• This is called model selection, which is choosing between
possible H.
• In answering this question, we should remember that the
aim of machine learning is rarely to replicate the training
data but the prediction for new cases.
• We would like to be able to generate the right output for
an input instance outside the training set, one for which
the correct output is not given in the training set.
• How well a model trained on the training set predicts the
right output for new instances is called generalization.
26
*Yi-xin
Underfitting
• For best generalization, we should match the
complexity of the hypothesis class H with the
complexity of the function underlying the
data.
• If H is less complex than the function, we have
underfitting, for example, when trying to fit a
line to data sampled from a third-order
polynomial.
• In such a case, as we increase the complexity,
the training error decreases.
• But if we have H that is too complex, the data
is not enough to constrain it and we may end
up with a bad hypothesis, h ∈ H.
Siora Photography
27
Overfitting
• If there is noise, an overcomplex hypothesis may learn not only the underlying
function but also the noise in the data and may make a bad fit, for example, when
fitting a sixth-order polynomial to noisy data sampled from a third-order
polynomial.
• This is called overfitting.
• In such a case, having more training data helps but only up to a certain point.
• Given a training set and H, we can find h∈ H that
has the minimum training error but if H is not
chosen well, no matter which h∈ H we pick,
we will not have good generalization.
28
Sharoon Saxena
Triple Trade-Off
• In all learning algorithms that are trained from example data, there is
a trade-off between three factors:
• the complexity of the hypothesis C(H) we fit to data, namely, the capacity of
the hypothesis class,
• the amount of training data (N), and
• the generalization error (E) on new examples.
• As N increases, E decreases.
• As C(H ) increases, E first decreases first and then increases.
29
Train and Validation Set
• We can measure the generalization ability of a hypothesis, namely, the
quality of its inductive bias, if we have access to data outside the training
set.
• We simulate this by dividing the dataset we have into two parts.
• We use one part for training (i.e., to fit a hypothesis), and the remaining
validation set part is called the validation set and is used to test the
generalization ability.
• Assuming large enough training and validation sets, the hypothesis that is
the most accurate on the validation set is the best one (best inductive
bias).
30
Cross-Validation and Test Set
• Dividing data process is called cross-validation.
• Note that if we then need to report the error to give an idea about the
expected error of our best model, we should not use the validation error.
• We have used the validation set to choose the best model, and it has
effectively become a part of the training set.
• We need a test set, containing examples not used in training or validation.
• We split the data as
• Training set (50%)
• Validation set (25%)
• Test set (25%)
• Resampling when there is few data
31
Holdout Method for Model Evaluation and
Selection
• The dataset is split into two parts for The dataset is split into three different
model evaluation. sets – training, validation, and test for
• Generally, 70-30% split is used for model selection.
splitting the dataset. The hold-out method can also be used for
hyperparameters tuning.
32
Leave-p-out Cross-Validation
• Leave-p-out cross-validation involves using p observations as the validation set
and the remaining observations as the training set.
• This is repeated on all ways to cut the original sample on a validation set
of p observations and a training set.
33
K-Fold Cross-Validation
• For more advanced statistical evaluation, experienced experimenters often prefer
the so-called K-fold cross-validation.
• To begin with, the set of pre-classified examples is divided into K equally sized (or
almost equally-sized) subsets which the machine-learning jargon sometimes (not
quite correctly) refers to as “folds.”
34
K-Fold Cross-Validation
• K-fold cross-validation then runs K experiments.
• In each, one of the K subsets is removed so as to be used only for testing (this
guarantees that, in each run, a different testing set is used).
• The training is then carried out on the union of the remaining K-1 subsets.
• Again, the results are averaged, and the standard deviation calculated.
35
Dimensions of a Supervised Learner
1. Model: g (x | )
where g(·) is the model, x is the input, and θ are the parameters.
37
Homework
Week 3
LAB.
https://scikit-
learn.org/stable/auto_examples/
index.html#dataset-examples
The Iris Dataset
• This data sets consists of 3
different types of irises’
(Setosa, Versicolour, and
Virginica) petal and sepal
length, stored in a 150x4
numpy.ndarray