0% found this document useful (0 votes)
5 views

Lecture 3_Regression (1)

The document provides an overview of supervised learning with a focus on linear regression, detailing the problem formulation, hypothesis selection, and error minimization techniques. It explains the least squares method for fitting a linear model and discusses the implications of overfitting in model selection. Additionally, it highlights the importance of transforming data and using different hypothesis classes for more complex models.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 3_Regression (1)

The document provides an overview of supervised learning with a focus on linear regression, detailing the problem formulation, hypothesis selection, and error minimization techniques. It explains the least squares method for fitting a linear model and discusses the implications of overfitting in model selection. Additionally, it highlights the importance of transforming data and using different hypothesis classes for more complex models.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Artificial Intelligence II (CS4442 & CS9542)

Supervised Learning: Linear Regression

Boyu Wang
Department of Computer Science
University of Western Ontario
Motivation examples

Slide credit: Maria-Florina Balcan

1
Motivation examples

Slide credit: Maria-Florina Balcan

2
Motivation examples

Slide credit: Doina Precup

3
Data

Slide credit: Doina Precup

4
Terminology

I Columns are called features or attributes or input variables.

I The outcome and time (which we are trying to predict) are called output
variables or targets.

I A row in the table is called training example or instance.

I The problem of predicting the recurrence is called (binary)


classification.

I The problem of predicting the time is called regression.

5
Problem formulation

I A training example i has the form: (xi , yi ), where xi ∈ Rn is the number


of features (feature dimension), and yi ∈ {0, 1} for (binary)
classification, and yi ∈ R for regression.
I The training set consists of m examples: S = {(xi , yi )}m
i=1 .

I We denote the matrix of features by X = [x1 , . . . , xm ]> ∈ Rm×n , and the


column vector of outputs by Y ∈ Rm .
I Objective: construct a good predictor using training data S (i.e., X and
Y ), such that it works well for new patients (test data).

6
Problem formulation

I Let X denote the space of input values

I Let Y denote the space of output values

I Given a data set S ∈ X × Y, find a function:

h:X →Y

such that for a new example (x, y ), h(x) can correctly predict the value
of y .

I Types of supervised learning problem: classification, regression,


ranking, structured prediction...

I Key assumption: training and test data are sampled from the same
distribution.

7
Steps to solving a supervised learning problem

I Collect the data set (input-output pairs)

I Choose a class of hypotheses (hypothesis space) H

I Choose a hypothesis h ∈ H

8
Example: what hypothesis class should we choose?

6.5

5.5
y

4.5

3.5
0 1 2 3 4 5
x

9
Steps to solving a supervised learning problem

I Collect the data set (input-output pairs)

I Choose a class of hypotheses (hypothesis space) H

I Choose a hypothesis h ∈ H

10
Linear model

I In linear regression, we consider the model hw has the form:

hw (x) = w > x + b

where w = [w1 , . . . , wn ]> ∈ Rn , wi ’s are called parameters or


weights, and b is called bias or intercept term.

I To simplify the notation, we can add a feature to the other n


features of x: x → [1; x] ∈ Rn+1 , such that we will have

hw (x) = w > x

where w and x are vectors of size n + 1.

I How to choose w?

11
Error (cost) Minimization

I Intuitively, w should make the predictions of hw close to the true


values y on the data we have.

I Therefore, we will define an error function or cost function to


measure the difference between our prediction and the true
value

I We will choose w such that the error function is minimized:


Xm
w = arg min `(hw (xi ), yi ), (1)
w
i=1

where ` is a loss function.

I (1) is the fundamental problem in machine learning!


- How to choose `
- How to solve (1)
12
Least squares

I We use the squared error to measure our prediction performance, and


choose w by minimizing the sum-of-squared errors:
m
1X
J(w) = (hw (xi ) − yi )2
2
i=1
1
(the 2
is just for convenience)

I Fit by solving minw J(w)

13
Solve ∇J(w) = 0 using the definition

I In order to minimize J(w), we need to solve ∇J(w) = 0

I ∇J(w) = [ ∂J(w) , . . . , ∂J(w) >


]
w 1 wn
m
∂J(w) ∂ 1X
= (hw (xi ) − yi )2
wj ∂wj 2
i=1
m
1 X ∂
= ·2 (hw (xi ) − yi ) (hw (xi ) − yi )
2 ∂wj
i=1
m
X ∂
= (hw (xi ) − yi ) (w > xi − yi )
∂wj
i=1
m
X
= (hw (xi ) − yi )xi,j
i=1

where xi,j is the j-th feature of the i-th instance.

I Setting all these partial derivatives to 0, we get a linear system with


(n+1) equations and (n + 1) unknown variables. 14
Solve ∇J(w) = 0 using vector calculus

m
!
1X
∇J(w) = ∇w (hw (xi ) − yi )2
2
i=1
 
1 >
= ∇w (Xw − y ) (Xw − y )
2
1  
= ∇w w > X > Xw − y > Xw − w > X > y + y > y
2
= X > Xw − X > y

I Setting ∇J(w) = 0 gives

X > Xw − X > y = 0
⇒X > Xw = X > y
⇒w = (X > X )−1 X > y

I The inverse exists if the columns of X are linearly independent.

15
Linear regression summary

I The solution is w = (X > X )−1 X > y , where X is the data matrix


augmented with a column of ones, and y is the column vector of
target outputs.

I The number of data points should be at least larger or equal to


the number of features to ensure a unique solution exists.

I An analytical, exact solution exists for linear regression, which is


a rare case in machine learning.

I Demo!

16
Predicting recurrence time based on tumor size

Slide credit: Doina Precup

17
Geometric interpretation orthogonal projection

https://towardsdatascience.com/3-things-you-didnt-know-about-simple-linear-regression-b5a86362dd53

18
Probabilistic perspective of linear regression

Slide credit: Doina Precup

19
Bayes rule in machine learning

Slide credit: Doina Precup

20
Select a hypothesis by maximum a posteriori (MAP)

Slide credit: Doina Precup

21
Maximum likelihood estimation

Slide credit: Doina Precup

22
The log trick

Slide credit: Doina Precup

23
Maximum likelihood for regression

Slide credit: Doina Precup

24
Applying the log trick

Slide credit: Doina Precup

25
Maximum likelihood and linear regression

I Under the assumption that the training examples are i.i.d. and
that we have Gaussian target noise, the maximum likelihood
solution is equivalent to the least-squares solution for linear
regression:
m
X
w = arg min (hw (xi ) − yi )2
w
i=1

I If the noise is not normally distributed, maximizing the likelihood


will not be the same as minimizing the sum-squared error

I In practice, different loss functions are used depending on the


noise assumption.

I Squared loss is the most popular one because it is the simplest


one!

26
Extending linear regression to more complex models

I Linear regression should be the first thing you try for real-valued
outputs!

I Two possible solutions:

1. Transform the data – apply a (nonlinear) transformation x → φ(x),


then do linear regression in the transformed space. (our focus in
this lecture)
I e.g., log, exp, sin, cos
I polynomial transformation: cross-terms, higher-order terms...
I basis expansions: Gaussian basis, polynomial basis...
I Dummy coding of categorical features

I Use a different hypothesis class (e.g. non-linear functions)

27
Linear models in general

Slide credit: Doina Precup

28
Linear models in general

Slide credit: Doina Precup

29
Example basis functions: Polynomial

Slide credit: Doina Precup

30
Example basis functions: Gaussian

Slide credit: Doina Precup

31
Example basis functions: Sigmoidal

Slide credit: Doina Precup

32
Polynomial regression

Slide credit: Doina Precup

33
Example: second-order polynomial regression

Slide credit: Doina Precup

34
Example: second-order polynomial regression

Slide credit: Doina Precup

35
Example: second-order polynomial regression

Slide credit: Doina Precup

36
Example: second-order polynomial regression

Slide credit: Doina Precup

37
Higher-order polynomial regression: Order-2 fit

Slide credit: Doina Precup

38
Higher-order polynomial regression: Order-3 fit

Slide credit: Doina Precup

39
Higher-order polynomial regression: Order-4 fit

Slide credit: Doina Precup

40
Higher-order polynomial regression: Order-5 fit

Slide credit: Doina Precup

41
Higher-order polynomial regression: Order-6 fit

Slide credit: Doina Precup

42
Higher-order polynomial regression: Order-7 fit

Slide credit: Doina Precup

43
Higher-order polynomial regression: Order-8 fit

Slide credit: Doina Precup

44
Higher-order polynomial regression: Order-9 fit

Slide credit: Doina Precup

45
Overfitting and model selection

I As long as the hypothesis is complicated enough, we can always


achieve zero training loss (e.g., memorize all the data)

I Minimizing the training loss does NOT indicate a good test


performance

I Overfitting: A (complicated) model fits data too well, but cannot


generalize at all to new data

I An extremely important problem for (almost all) machine


learning algorithms

46

You might also like