Week 3 v1.1 (hidden) Supervised Learning (Regression)
Week 3 v1.1 (hidden) Supervised Learning (Regression)
(WEEK 3, 2024)
SUPERVISED LEARNING (REGRESSION)
DR ANTHONY CONSTANTINOU 1
SCHOOL OF ELECTRONIC ENGINEERING AND COMPUTER SCIENCE
TIMETABLE
2
LECTURE OVERVIEW
Supervised Learning (Regression)
▪ Coursework 1.
▪ Week 3 Lab.
▪ Association.
▪ Bivariate Linear Regression.
▪ Multivariate Linear Regression.
▪ Non-linear Regression.
▪ Optimisation.
▪ Regularisation.
3
COURSEWORK 1: DATES
Important Dates:
▪ Release date:
▪ end of Week 2, Friday 2nd February 2024 at 12:00 noon.
▪ Submission deadline:
▪ mid-Week 8, Wednesday 13th March 2024 at 10:00AM.
4
COURSEWORK 1: GUIDELINES Reading
slide
General information:
▪ Students will sometimes upload their coursework as a draft and not hit the submit button.
Make sure you fully complete the submission process.
▪ A penalty will be applied automatically by the system for late submissions.
▪ Lecturers cannot remove the penalty!
▪ Penalties can only be challenged via submission of an Extenuating Circumstances
(EC) form which can be found on your Student Support page. All the information
you need to know is on that page, including how to submit an EC claim along with the
deadline dates and full guidelines.
▪ Deadline extensions can only be granted through approval of an EC claim.
▪ If you submit an EC form, your case will be reviewed by a panel. When the panel
reaches a decision, they will inform both you and the module organiser (Anthony).
▪ If you miss both the submission deadline and the late submission deadline, you will
automatically receive a score of 0.
▪ Submissions via e-mail are not accepted.
▪ The School recommends that we set the deadline during a weekday at 10:00 AM.
▪ For more details on submission regulations, please refer to your relevant student
handbook. 5
COURSEWORK 1: OVERVIEW
▪ The submission involves two files:
▪ a data analytic report (Deliverable 1).
▪ a Jupyter notebook (Deliverable 2).
▪ Once you determine the area that interests you the most, you
should search for a suitable data set online, or collate the
data set yourself (see Section 5 for possible data sources).
6
COURSEWORK 1: OVERVIEW
▪ You should apply a minimum of TWO data analytic techniques (i.e. machine
learning algorithms) of your choice to your data, from those covered in this
course up to and including Week 5.
▪ The aim is to learn two models and contrast their performance on your
input data.
▪ You are allowed to test more than TWO data analytic techniques if you
wish (e.g., using multiple techniques to learn a model, or learning more
than two models), but this is not a requirement and will not necessarily
improve your mark.
▪ Remember to use the page limit wisely against the marking criteria.
▪ In Windows, you can generate a PDF file by right clicking and selecting
‘Print’ the Jupyter notebook loaded in your browser, and then you should
be given an option to save it as a PDF file.
▪ You do NOT need to submit your data set nor the actual .ipynb file. These
might be requested at a later stage, if and only if we would like to review
your code and/or data in greater depth.
9
COURSEWORK 1: MARKING CRITERIA
10
This coursework contributes 60% towards your total module mark.
COURSEWORK 1: TIMETABLE Reading
slide
11
COURSEWORK 1: DATA SOURCES
Using public data is the most common choice. If you have access to private data, that is also
an option, though you will have to be careful about what results you can release to us. Some
sources of publicly available data are listed below (you don`t have to use these sources).
▪ Kaggle
▪ https://www.kaggle.com/
▪ Over 50,000 public data sets for machine learning.
▪ Football datasets
▪ http://www.football-data.co.uk/
▪ This site provides historical data for football matches around the world.
More data sources are provided in the coursework specification document available 12
on QM+.
WEEK 3 LAB OVERVIEW Reading
slide
Pandas
What is Pandas?
▪ Pandas is one of the fundamental libraries for data manipulation, including data
cleaning, and analysis in Python.
▪ It is suitable for manipulating two-dimensional tabular data, so you might find it
useful for Coursework 1.
▪ Three Jupyter notebook files:
▪ Notebook 1: covers Series and DataFrames.
▪ A Series is a one-dimensional object similar to an array, list, or column in a
table (e.g., each column in Excel).
▪ A DataFrame (one of the most commonly used Pandas object) is a two-
dimensional data structure, similar to a spreadsheet, relational database
table, or R's data.frame object.
▪ Notebook 2: Manipulating DataFrames.
▪ E.g., merging, concatenating, pivoting and deleting data.
▪ Notebook 3: Processing data from DataFrames.
▪ E.g., drop/remove/replace operations, data discretisation, outliers,
sampling, conditioning/grouping. 13
ASSOCIATION
.
. . . . . . . .
. .
. .. .
Variable 1 (y)
Variable 1 (y)
Variable 1 (y)
. .
.
. . .. . .
Variable 2 (x) Variable 2 (x) Variable 2 (x)
14
POSITIVE ASSOCIATION
.
. .
. . . .
Car value
.
. . .
Car colour
. .
Life expectancy
. .. .
. .
Cigarettes smoked
Supervised Unsupervised
Regression Classification
18
SUPERVISED LEARNING
There are two types of supervised learning.
▪ Regression ▪ Classification
▪ Predict numeric target 𝑦. ▪ Predict categorical target 𝑦.
▪ E.g., house price growth, temperature, ▪ E.g., Female/Male, True/False,
etc. Win/Draw/Lose, Rain/No-Rain
19
LINEAR REGRESSION
Why is it important?
▪ Linear Regression is one of the simplest predictive models.
▪ Not to be confused with regression vs classification
terminology! E.g., logistic regression is a classification
method.
▪ Despite its simplicity, regression serves as the foundation
for more advanced statistical and machine learning models,
especially because it can be extended to non-linear
representation.
▪ For example, Neural networks can be viewed as a set of
parametric (i.e., fixed set of parameters) non-linear
regression lines nested up against each other.
20
BIVARIATE LINEAR REGRESSION
.. .
.
21
Feature (x)
HOW DOES IT WORK?
▪ It answers the following question:
ෝ given observation 𝒙?
What is prediction 𝒚
.
.
22
Feature (x)
HOW DO WE EXPLAIN THE LINE?
▪ Regression line: 𝑦
ො = 𝑎𝑥 + 𝑏
▪ If we assume 𝑏 = 4, then 𝑦
ො = 𝑎𝑥 + 4, and this means that 𝑦ො is expected to
be 4 when feature 𝑥 has value 0.
▪ Parameter 𝑏 is known as the intercept.
▪ The other parameter, 𝑎, tells us how much we can expect 𝑦
ො to change as
𝑥 increases.
▪ For example, if 𝑎 = 4.5 then 𝑦
ො is expected to increase at 4.5 times the
rate of increase in 𝑥.
▪ Parameter 𝑎 is known as the slope.
.
. . .
20
.
0
0 1 2 3 4 5 23
Feature (x)
HOW DO WE EXPLAIN THE LINE?
Assuming 𝑏 = 4 and 𝑎 = 4.5 .
. . .
20
▪ 𝐰𝐡𝐞𝐧 𝑥 = 0 𝐭𝐡𝐞𝐧
. .
15
Target (y)
▪𝑦
ො = 𝑎𝑥 + 𝑏 = 𝑏 = 4 .
. .
10
5
.
0
0 1 2 3 4 5
Feature (x)
.
. . .
20
▪ 𝐰𝐡𝐞𝐧 𝑥 = 1 𝐭𝐡𝐞𝐧 . .
15
Target (y)
▪𝑦
ො = 𝑎𝑥 + 𝑏 = 4.5 ∗ 1 + 4 = 8.5 .
. .
10
5
0
.
0 1 2 3 4 5
Feature (x)
24
BIVARIATE LINEAR REGRESSION
▪ The predictions are only as accurate as the strength of the correlation
between 𝑥 and 𝑦.
▪ Pearson’s correlation coefficient 𝑟 (or 𝑅) is most commonly used.
▪ The value of 𝑟 ranges between −1 (negative correlation) and 1 (positive
correlation), where 0 represents no correlation.
▪ Note that 𝑟 does not take into consideration whether a variable is defined
as a feature 𝑥 or a target 𝑦 variable; it treats both equally (it is symmetric).
Figure taken from Statistics Laerd. (2017). Pearson Product-Moment Correlation. Retrieved August 16, 2017, 25
from https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php
BIVARIATE LINEAR REGRESSION
▪ The line of best fit is the line that best represents the relationship between
the two variables.
▪ When 𝑟 = 1 it simply means that the there is no variation between the
data points and the line of best fit and does not tell us anything about the
slope of the line of best fit.
Figure taken from Statistics Laerd. (2017). Pearson Product-Moment Correlation. Retrieved August 16, 2017, 26
from https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php
BIVARIATE LINEAR REGRESSION
The lower the variability between observed data and the line of best fit, the
stronger the 𝑟 coefficient towards −1 or 1. This means that different levels
of variability can generate similar regression lines.
Figure taken from Statistics Laerd. (2017). Pearson Product-Moment Correlation. Retrieved August 16, 2017, 27
from https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php
BIVARIATE LINEAR REGRESSION Reading
slide
The 𝑟 correlation for several sets of (x, y) points. Note that the correlation reflects the noisiness and
direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many
aspects of nonlinear relationships (bottom).
N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because
the variance of Y is zero. Source: Wikipedia, https://en.wikipedia.org/wiki/Correlation_and_dependence 28
10 MINUTTERS PAUSE
10分の休憩
10 MINUTEN PAUSE
دقائق استراحة10
10 MINUTI DI PAUSA
דקות10 הפסקה של
10 MINUTES DE PAUSE
10 मिनट का ब्रेक
10 MINUTES BREAK
10 МИНУТА ПАУЗЕ
10 মিমিটের মিরমি
ΔΙΑΛΕΙΜΜΑ 10 ΛΕΠΤΩΝ
ПЕРЕРЫВ 10 МИНУТ
休息10分钟
DESCANSO DE 10 MINUTOS
10 분 휴식
10 MINUTEN PAUZE 29
ERROR (RESIDUALS)
▪ The prediction is
𝑦ො = 𝑓 𝑥 = 𝑎𝑥 + 𝑏
𝑦 = 𝑓 𝑥 = 𝑎𝑥 + 𝑏 + 𝐸
.
.
30
Feature (x)
OUTLIERS
▪ An outlier is a data point that is abnormally distant from other
observations.
▪ An outlier occurs due to variability or due to error.
▪ Some evaluators are sensitive to outliers, such as MSE, and
other much less sensitive, such as Mean Absolute Error (MAE).
𝑦
𝐸𝑀𝐴𝐸 𝑤 = 𝑦𝑖 − 𝑓 𝑥𝑖
𝑖
outlier 31
ERROR MEASURES Reading
slide
There are many different error measures that can be used to
compute 𝑬.
Assuming 𝑁 is sample size:
▪ MSE: Mean squared error.
▪ Measures the average of the squares of the prediction error.
5. 5.
3. . 3. .
. . . .
1. . 4 1. . 4
.
. 2 .
. 2
𝑥1 𝑥1
33
HOW DOES LINEAR REGRESSION WORK AS
A MACHINE LEARNING ALGORITHM?
▪𝑦
ො = 𝑓 𝑥 = 𝑎𝑥 + 𝑏
▪ 𝑥 is a feature and 𝑎 and 𝑏 are model parameters.
▪ As a search problem, iterate over different linear lines; i.e., over
parameters 𝑎 and 𝑏.
▪ The lines searched depend on how we iterate through 𝑎 and 𝑏.
▪ For each line searched, compute 𝐸 which represents the error of
the line/model relative to the observed data points.
▪ Move towards the line that minimises 𝐸; i.e., argmin 𝐸(𝑎, 𝑏)
. .
. . . .
. . . .
. .
Target (y)
Target (y)
. .
. .
34
Feature (x) Feature (x)
LINEAR REGRESSION PSEUDOCODE
▪ trainLinearRegression(𝑦, 𝑥) Reading
▪ For 𝑎 = -10.0 to 10.0 slide
▪ For 𝑏 = -10.0 to 10.0
▪ Store how well the learnt line 𝑦
ො = 𝑎𝑥 + 𝑏 fits the observed data.
▪ Pick the optimal combination 𝑎 and 𝑏
▪ predictionWithLinearRegression(𝑥, 𝑎, 𝑏)
▪ Return 𝑦
ො = 𝑎𝑥 + 𝑏
. .
. . . .
. . . .
. .
Target (y)
Target (y)
. .
. .
35
Feature (x) Feature (x)
SEARCH AND OPTIMISATION
Optimisation represents the process of arriving at the optimal model
parameters.
▪ There are different approaches to optimisation, and each of those
approaches determine how to iterate over the different parameter values.
▪ E.g., exhaustive search explores for all possible combinations (within a
range) and returns the combination that minimises the error.
▪ E.g., gradient descent makes bigger or smaller steps towards the
optimal weights, proportional to the change in the error.
. .
. . . .
. . . .
. .
Target (y)
Target (y)
. .
. .
36
Feature (x) Feature (x)
SEARCH AND OPTIMISATION Reading
slide
▪ Gradient descend, for example, is much more efficient than exhaustive
search.
▪ The search approach represents an important decision when learning from
large data sets, since each additional parameter might mean an additional
nested loop, and nested loops have exponential, or higher, impact on
computational complexity.
▪ Moreover, exploring non-linear relationships might involve an additional
loop for each additional polynomial order searched.
▪ There are many different search methods for optimisation, but we will not be
covering them in this course.
. .
. . . .
. . . .
. .
Target (y)
Target (y)
. .
. .
37
Feature (x) Feature (x)
MULTIVARIATE LINEAR REGRESSION
▪ We now know how to predict 𝑦ො given a single feature 𝑥.
▪ e.g., carCrashes = 𝑎. drivingSpeed + 𝑏
38
MULTIVARIATE LINEAR REGRESSION
▪ The general expression of multivariate regression is Reading
slide
𝑦ො = 𝛽0 + 𝛽1 . 𝑥1 + 𝛽2 . 𝑥2 … 𝛽𝑘 . 𝑥𝑘
𝑦
Electricity
𝑥
Temp
2 2
𝐸𝑀𝑆𝐸 𝑎, 𝑏, 𝑐 = 𝑦𝑖 − 𝑦ෝ𝑖 = 𝑦𝑖 − 𝑎𝑥 2 + 𝑏𝑥 + 𝑐
41
𝑖 𝑖
POLYNOMIAL REGRESSION Reading
slide
42
MODEL UNDERFITTING
▪ Recall the relationship
between temperature and
electricity from a previous
slide. 𝑦
Electricity
linear, a linear regression . .
model is likely to underfit the . . .
data . . .
▪ Underfitting occurs when the 𝑥
trained model is too simple Temp
relative to the patterns
available in the data.
43
MODEL OVERFITTING
▪ The higher the order of polynomial
regression, the more complex the
regression line becomes.
𝑦
▪ It is possible to introduce enough
curves to fit the data perfectly.
▪ Complex models are at risk of
.. .
Electricity
adjusting to specific data patterns . . .
that do not generalise well. This . .
phenomenon is known as model . . .
overfitting.
▪ This means that increasing 𝑥
polynomial order does not mean Temp
the regression will be better, or
offer better predictions.
44
MODEL OVERFITTING
𝑦 𝑦
.. . .. .
Electricity
Electricity
. . . . . .
. . . .
. . . . . .
𝑥 𝑥
Temp Temp
▪ Acquiring more training data would have resulted in a better polynomial fit,
and this represents one way of addressing model overfitting.
▪ What if we cannot acquire more data?
45
HIDDEN SLIDE
46
HIDDEN SLIDE
47
HIDDEN SLIDE
48
HIDDEN SLIDE
49
HIDDEN SLIDE
50
HIDDEN SLIDE
51
HIDDEN SLIDE
52