W2 Ecs7020p
W2 Ecs7020p
5 Oct 2023
How far is the equator from the north pole? 10,000km
2/54
Embrace the error!
3/54
Agenda
Recap
Summary
4/54
Machine learning
5/54
Machine learning taxonomy
Machine
Learning
Supervised Unsupervised
Structure Density
Regression Classification
Analysis Estimation
6/54
Agenda
Recap
Summary
7/54
Problem formulation
8/54
Examples of regression problems
The following are examples of problems that can be formulated as a
regression problem:
1. Predict the energy consumption of a household, given the location
of the house, household size, income, intensity of occupation.
2. Predict future values of a company stock, given past stock prices.
3. Predict distance driven by a vehicle given its speed and journey
duration.
4. Predict demand given past demand and currency exchange rate.
5. Predict tomorrow’s temperature given today’s temperature and
pressure.
6. Predict the probability to develop a specific heart condition given
BMI, alcohol consumption, diet, number of daily steps.
9/54
Mathematical notation
xi f (⋅) ŷi
Population:
x is the predictor attribute
y is the label attribute
Dataset:
N is the number of samples, i identifies each sample
xi is the predictor of sample i
yi is the actual label of sample i
(xi , yi ) is sample i, {(xi , yi ) ∶ 1 ≤ i ≤ N } is the entire dataset
Model:
f (⋅) denotes the model
ŷi = f (xi ) is the predicted label for sample i
ei = yi − ŷi is the prediction error for sample i
10/54
Visualising our mathematical notation
11/54
Candidate solutions
Which line is the best mapping of age to salary?
12/54
What is a good model?
Based on the squared error, we can define dataset quality metrics. Two
quality metrics based on the squared error are the sum of squared errors
(SSE) and the mean squared error (MSE). which are computed as:
N
ESSE = e21 + e22 + ⋅ ⋅ ⋅ + e2N = ∑ e2i
i=1
N
1
EM SE = ∑e
2
N i=1 i
13/54
MSE: Example
14/54
A zero-error model?
Given a dataset, is it possible to find a model such that ŷi = yi for every
instance i in the dataset, i.e. a model whose error is zero, EM SE = 0?
15/54
The nature of the error
y = ŷ + e
= f (x) + e
There will always be some discrepancy (error e) between the true label y
and our model prediction f (x). Embrace the error!
16/54
Regression as an optimisation problem (take 1)
1 N 2
fbest (x) = arg min ∑ (yi − f (xi ))
f N i=1
The question is, how do we find such model? Finding such a model is an
optimisation problem.
Why take 1? Note that we are defining regression as finding the model
that minimises EM SE on the dataset, without considering what happens
once deployed. We’ll revise this definition.
17/54
Agenda
Recap
Summary
18/54
Our regression learner
Knowledge
Learner
Data
Model
19/54
Simple regression
Simple regression considers one predictor x and one label y.
20/54
Simple linear regression
ŷi = f (xi ) = w0 + w1 xi
21/54
Linear solution: Example
22/54
Beyond linearity
Sketch the model that you would choose for the Salary Vs Age dataset
and try to find a suitable mathematical expression.
23/54
Simple polynomial regression
f (xi ) = w0 + w1 xi + w2 x2i + ⋅ ⋅ ⋅ + wD xD
i
24/54
Quadratic solution
25/54
Cubic solution
26/54
5-power solution
27/54
Multiple regression
features
In multiple regression there are two or more predictors. Given item i,
we will denote each individual predictor as xi,1 , xi,2 , ... and xi,K , where
K is the number of predictors.
28/54
Multiple regression: Linear model
29/54
Multiple regression: Vector notation
ŷi = f (xi )
Good news: the notation developed for simple regression can be easily
translated to the multivariate scenario, no extra efforts required!
30/54
Multiple linear regression: Formulation
Linear models in multiple regression are simply the sum of a constant (or
intercept) and each predictor multiplied by its own coefficient.
Note that we can use the same vector notation for simple linear
regression models, by defining w = [w0 , w1 ]T and xi = [1, xi ]T .
31/54
Multiple linear regression: Solution visualisation
Multiple linear regression models are planes (or hyperplanes).
32/54
Multiple regression: More notation
⎡ y1 ⎤
⎢ ⎥
⎢y ⎥
⎢ ⎥
y = ⎢ 2⎥
⎢ ⋮ ⎥
⎢ ⎥
⎢yN ⎥
⎣ ⎦
33/54
Multiple regression: More notation
Given a linear model defined by coefficients
⎡ w0 ⎤
⎢ ⎥
⎢w ⎥
⎢ ⎥
w = ⎢ 1⎥
⎢ ⋮ ⎥
⎢ ⎥
⎢wK ⎥
⎣ ⎦
we can calculate the label vector ŷ as
⎡ ŷ1 ⎤ ⎡1 x1,K ⎤⎥ ⎡⎢ w0 ⎤⎥
⎢ ⎥ ⎢ x1,1 x1,2 ...
⎢ ŷ ⎥ ⎢1 x2,K ⎥⎥ ⎢⎢ w1 ⎥⎥
⎢ 2⎥ ⎢ x2,1 x2,2 ...
ŷ = ⎢ ⎥ = Xw = ⎢ ⎥⎢ ⎥
⎢ ⋮ ⎥ ⎢⋮ ⋮ ⋮ ⋱ ⋮ ⎥⎥ ⎢⎢ ⋮ ⎥⎥
⎢ ⎥ ⎢
⎢ŷN ⎥ ⎢1 xN,1 xN,2 ... xN,K ⎥⎦ ⎢⎣wK ⎥⎦
⎣ ⎦ ⎣
and error vector e as
e = y − ŷ
34/54
Multiple linear regression: Example
35/54
The least squares solution
It can be shown that the linear model that minimises the metric EM SE
on a training dataset defined by a design matrix X and a label vector y,
has the parameter vector:
wbest = (XT X)
−1
XT y
This solution can also be used for polynomial models, by treating the
powers of the predictor as predictors themselves.
36/54
Other models for regression
Linear and polynomial models are not the only options available. Other
families of models that can be used include:
Exponential
Sinusoids
Radial basis functions
Splines
And many more!
37/54
Other quality metrics
In addition to the MSE, we can consider other quality metrics:
Root mean squared error. Measures the sample standard deviation
of the prediction error.
√
1
ERM SE = ∑ e2i
N
Mean absolute error. Measures the average of the absolute
prediction error.
1
EM AE = ∑ ∣ei ∣
N
R-squared. Measures the proportion of the variance in the response
that is predictable from the predictors.
∑ e2i 1
ER = 1 − , where ȳ = ∑ yi
∑(yi − ȳ) 2 N
38/54
Agenda
Recap
Summary
39/54
Flexibility
40/54
Interpretability
Model interpretability is crucial for us, as humans, to understand in a
qualitative manner how a predictor is mapped to a label. Inflexible
models produce solutions that are usually simpler and easier to interpret.
The training error of the best linear The training error of the best
model is EM SE = 0.0983 polynomial model is EM SE = 0.0379
42/54
Generalisation
Priors
Learner
Data
Model
Will our model work well during deployment, when presented with new
data? Generalisation is the ability of our model to successfully translate
what we was learnt during the learning stage to deployment.
43/54
Generalisation
In this figure, the red curve represents the training MSE of different
models of increasing complexity, whereas the blue curve represents the
deployment MSE for the same models. What’s happening?
Deployment
just right Training
Mean Squared Error
underfitting overfitting
0 2 4 6 8 10
Flexibility
44/54
Underfitting and overfitting
45/54
Underfitting and overfitting
46/54
Underfitting
47/54
Overfitting
48/54
Just right
49/54
Underfitting and overfitting
Remember this: Generalisation can only be assessed by comparing
training and deployment performance, not by just looking at how each
model fits the training data.
Deployment
Training
Mean Squared Error
0 2 4 6 8 10
Flexibility
50/54
Agenda
Recap
Summary
51/54
Regression: Basic methodology
52/54
Model generalisation
53/54
Final historical note
In the 19th century, Galton noticed that children of tall people tend to be
taller than average – but not as tall as their parents. Galton called this
reversion and later regression towards mediocrity.
This observation is nowadays called regression to the mean. You can read
more about this curious fallacy in Kanehman’s Thinking, Fast and Slow.
54/54