0% found this document useful (0 votes)
18 views

2 - Multiple Linear Regression

This document discusses machine learning and linear regression. It defines machine learning as a field that gives computers the ability to learn without being explicitly programmed. The document then discusses supervised learning and describes linear regression as a type of supervised learning where the goal is to predict a continuous numeric value by finding parameters m and b that minimize a cost function J. Linear regression finds the linear relationship between an input x and output y.

Uploaded by

ThelazyJoe TM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

2 - Multiple Linear Regression

This document discusses machine learning and linear regression. It defines machine learning as a field that gives computers the ability to learn without being explicitly programmed. The document then discusses supervised learning and describes linear regression as a type of supervised learning where the goal is to predict a continuous numeric value by finding parameters m and b that minimize a cost function J. Linear regression finds the linear relationship between an input x and output y.

Uploaded by

ThelazyJoe TM
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

COURSE DIT 822

Machine Learning

Multiple Linear Regression

LECTURE 2
MACHINE LEARNING
What is it?

Arthur Samuel (1959):


“Machine Learning is a
field of study that gives
computers the ability to
learn without being Machines don’t learn.
explicitly programmed”

Machine learning is a way to make advanced statistical models using math. 2

The computer learns a formula that applied to the input data produces the desired output
By: Tom Mitchell 1998

https://www.coursera.org/learn/machine-learning
By: Tom Mitchell 1998

https://www.coursera.org/learn/machine-learning
Types of Machine Learning

SUPERVISED UNSUPERVISED REINFORCEMENT

Task Driven Data Driven Learn From Mistakes


predict next value, Identify clusters Trial and Error
classification

Observing some example input–output


pairs and learning a function that maps
from input to output. Learning patterns in the input even though
Learning from a series of reinforcements—
Eg: Online advertising, Ob. Dect., Speech recog. no explicit feedback is supplied.
rewards or punishments

5
Types of Supervised Learning

Task of Supervised learning


SUPERVISED

Task Driven
predict next value,
classification

h = hypothesis

Russel, Norvig - Artificial Intelligence - A Modern Approach


6
Types of Supervised Learning

Training Dataset:
SUPERVISED Collection of labeled examples
(input attributes and expected output)

Task Driven The training dataset can take any kind of data as an input like values of a
predict next value, database row, the pixels of an image, or even an audio frequency
classification histogram.

Learning:
A model is prepared through a training process in which it is
required to make predictions and is corrected when those predictions are
wrong. The learning continues until the algorithm achieves an acceptable
level of performance.

7
Part 1.1 Supervised Learning - Simple Linear Regression

Simple Linear
Regression
SUPERVISED
Multiple Linear
Regression Regression

Polynomial
Regression
Task Driven
predict next value,
classification

Logistic
Classification Regression

Decision Tree
Classification: In classification, the goal is to predict a discrete category or label for each input data point.
Regression: In regression, the goal is to predict a continuous numeric value. The output is a numerical
variable, and the model estimates a real-valued number as the target variable.
8
Supervised Learning
Input(x) Output (y) Application
Home features Price Real Estate

Ad, user info Click on ad? (0/1) Online Advertising

Image Object (1,…,1000) Photo tagging

Audio Text transcript Speech recognition

English Chinese Machine translation

Image, Radar info Position of other cars Autonomous driving

https://www.coursera.org/learn/machine-learning
Supervised Learning
Structured Data Structured data: Structured
data is highly organized and
follows a predefined schema or
Unstructured Data
format. It is typically stored in
databases, spreadsheets, or
Size #bedroo … Price tables.

ms (1000$s)
Unstructured data: It lacks a
2104 3 400 specific organizational
3 structure. It does not conform to
1600 3
330 a fixed schema and is often in
free-form text, audio, video, or
2400 ⋮ 369 images.
⋮ 4 ⋮
3000 540 Audio Image

User Ad Id … Click
Age
Four scores and seven
41 93242 1
years ago…
93287
80 87312
0
18 ⋮ 1
⋮ 71244 ⋮
27 1
https://www.coursera.org/learn/machine-learning
Summary – linear regression (Univariate LG)

Hypothesis:

Parameters: m, b

Cost function:

Goal:

That is: find m, b such that J is minimum

11
Cost Function Let’s set b to 0 and look for m

5
3
J(m) 4
y
2 3

2
1
1

0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5
x m
12
Cost Function Let’s set b to 0 and look for m

5
3
J(m) 4
y J(1)

2 3

2
1
1

0 0
x 0 0.5 1 1.5 2 2.5
m
1
3
Cost Function Let’s set b to 0 and look for m

5
3 3.5
J(m)4
y
2 3 J(0.5) = 3.5

2
1
1

0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5
x m
1
4
Cost Function 14
Let’s set b to 0 and look for m

5
3
J(m)4
y
2 3 J(m) = 14
2
1
1

0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5
x m
1
5
Cost Function

5
3
J(m)4
y
2 3

2
1
1

0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5
x m
GOAL = minimize J
1
6
Hypothesis:

Parameter: m, b

Cost function:

Goal: min J(m,b)


Cost Function

Contour plots/figures

Objective: minimising the cost function with respect to and

20
Naming Conventions

parameters,
“weights”

Objective: minimising the cost function with respect to and


J(𝑤0 , 𝑤1 ) we will redefine: and with and respectively

21
Simple Linear Regression
Summary

Given a collection of labeled examples } “residual”


Find a linear function

That best fit the data = that minimise a cost function

Cost function: e.g. Sum Squared Errors (SSE) or Mean Squared Error (MSE)

𝐽(𝑤0 , 𝑤1 )
How do we find and ?
N is the number of training samples

22
Simple Linear Regression, how to find the weights?
Analytical Solution

}
optimal

minimized when the partial


derivatives are zero

Solving the linear system of two


equations with two unknowns
variables will give us...

Visualization of Linear http://setosa.io/ev/ordinary-least-squares-regression/


Regression:

Full explanation of the formula: https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L02%20Linear%20Regression.pdf 23


Simple Linear Regression, how to find the weights?
Analytical Solution

}
For Simple linear regression problems
(1 feature, linear fit) with a small dataset,
analytical solution is feasible and 1 feature

preferred

For example: predicting the


price of a house given its size

24
Simple Linear Regression, how to find the weights?
Example

25
Simple Linear Regression, how to find the weights?
Analytical Solution Example

26
Simple Linear Regression, how to find the weights?
Analytical Solution

27
Simple Linear Regression, how to find the weights?
2
What is the predicted price for a house of 1600 feet ?

28
Summary Part 1.1
What we have seen so far:
• Types of Machine Learning
• What is Supervised Learning
• Regression vs Classification
• Function Approximation
• Correlation vs Regression
• Simple Linear Regression
• Analytical Solution with Example

Piergiuseppe Mallozzi 2
9
Part 1.2 Supervised Learning - Multiple Linear Regression

Simple Linear
Regression
SUPERVISED
Multiple Linear
Regression Regression

Polynomial
Regression
Task Driven
predict next value,
classification

Logistic
Classification Regression

Decision Tree

3
0
1 Feature = Simple Linear Regression

size

How much is the value of an house on the market? price

31
Multiple Features = Multiple Linear Regression

size # rooms

location

How much is the value of an house on the market? price

32
1 feature vs 2 Features

Linear regression finds a line Linear regression finds a plane

33
N features?
Visual understanding is more abstract….

Linear regression finds an


Hyperplane
(subspace whose dimension is one
less than that of its ambient space)

But the math still works!

https://www.youtube.com/watch?v=zwAD6dRSVyI&t=829s
Picture from the movie Interstellar

Visualizing higher dimensions:

34
Multiple Features
4 features

Number of features

Features of the training example

Value of feature in training example

is not enough anymore

https://www.coursera.org/learn/machine-learning
35
Multiple Features
4 features

Number of features

Features of the training example

Value of feature in training example

Example: is a vector = [1534, 3, 2, 30]

?Is the number 2

is not enough anymore

https://www.coursera.org/learn/machine-learning
36
Multiple Features
4 features

For convenience we define

https://www.coursera.org/learn/machine-learning 37
Multiple Features, how to find the weights?

one sample vector with n features


Or we called it, n+1 dimensional feature vector

h – hypothesis – a vector (each element corresponds to a value of an example


W – weight vector – each element corresponds to a feature
X – is a matrix – each line in a matrix has values of the features,
and it has as many lines as the training examples, n+1 dimensional
feature vector

Full explanation here: https://www.geeksforgeeks.org/ml-normal-equation-in-linear-regression/


Need a Linear Algebra review? https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
Watch these video!
38
Multiple Features

For all samples of the training set we


have the matrix

in which each row is the sample (the


vector ) and is the vector of the
target values

https://www.coursera.org/learn/machine-learning 39
Multiple Features, how to find the weights?

Once we have computer the


weights vector, we can predict y for
new the values of x

Full explanation here: https://www.geeksforgeeks.org/ml-normal-equation-in-linear-regression/


Need a Linear Algebra review? https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
Watch these video!
40
Quick Review on the Transpose

Images from: https://www.mathsisfun.com/definitions/transpose-matrix-.html 41


Quick Review on Matrix Multiplication

Visual Understanding of what the dot product means:


https://www.youtube.com/watch?v=LyGKycYT2v0

Image from: https://www.mathsisfun.com/algebra/matrix-multiplying.html 42


Quick Review on the Inverse

Images from: https://www.mathsisfun.com/algebra/matrix-inverse.html 43


Normal Equation
Example

Matlab (or Octave):

X = [1 2104 5 1 45; 1 1416 3 2 40;


1 1534 3 2 30; 1 852 2 1 36]

y = [460; 232; 315; 178]

w = pinv(X'*X)*(X'*y)
w =
191.94
0.38
-59.03
-89.47 approximated values
-3.75 for the slides..

Need a Linear Algebra review? Watch these videos! https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab


44
Normal Equation w =
191.94
0.38
-59.03
Example - Predict price of a new house.. -89.47 approximated values
-3.75 for the slides..

new 900 3 1 14 ?
entry..

$
approximated values
for the slides..

46
Normal Equation
Can we use it for any problem?

This is a
“Closed-form solution”

computational
expensive for large computationally
dataset very expensive and
could not even
possible (might not Other methods?
be invertible)

Full explanation https://www.geeksforgeeks.org/ml-normal-equation-in-linear-regression/


of the formula:

Why it can be not invertible? https://www.coursera.org/learn/machine-learning/lecture/zSiE6/normal-equation-noninvertibility 47


Summary

Hypothesis:

Parameters:

Cost function:

Goal:

48
Single/Multiple-Features, how to find the weights?
Gradient Descent

Search Algorithm that starts with some “initial guess” of Cost function
w and that repeatedly change w to make the cost
function smaller, until we converge to a value that
minimizes the cost function

How do we change w?

In the direction of the negative of the gradient of the


cost function

How much do we change w at each step?


By (alpha = “learning rate”)
Convex only one minimum (global)

49
Gradient Descent
Which Cost Function do we use?

Cost function used:


Number of samples in the
Mean Squared Error
training set
Emphasizes larger
predicted value differences and makes
algebra easier

true value (target)


to compute the mean
50
Gradient Descent := =

Truth assertion
Update the weights using the gradient Assignment operator A=1
E:g: a := a+1

● loop until convergence do:

General update rule for all models

MSE cost function

Solution for linear


regression models

51
Question:

a)
b)
c)
d)

Send the response in google form


Gradient Descent Intuition
J(w)

positive slope

J(w)
negative slope

The gradient gives us the direction in


which the cost function (error) increases,
so we go in the opposite direction until
convergence (minimum found)
w

54
Gradient Descent
How to choose the learning rate?

too small.. too big!

pictures from: https://developers.google.com/machine-learning/crash-course/reducing-loss/learning-rate


55
Gradient Descent
How to choose the learning rate?

just right!

pictures from: https://developers.google.com/machine-learning/crash-course/reducing-loss/learning-rate


56
More on... Choosing Step-Size
Fixed Stepsize

https://developers.google.com/machine-learning/crash-course/fitter/graph

57
More on... Choosing Step-Size
Decreasing Stepsize

Adaptive stepsize, stepsize as a function of the number of iterations (t).

For example:

Why this can be


better?

Enables to effectively take big "jumps" at the beginning and slow down once
you are getting closer to the solution.

58
More on... Choosing Step-Size
Is gradient descent working correctly?

No. of iterations

If gradient descent is working properly then J(w) should decrease after every iteration

59
More on... Convergence Criteria
What does it mean to converge?

For convex functions the optimum occurs with the derivative of the cost function is
equal to zero

In practice though we want to stop when we reach a certain threshold

6
0
More on... Convergence Criteria
What does it mean to converge?

How many iterations does gradient


descent needs to converge?

We don’t know how many iterations


gradient descent needs to take to
converge as it’s very much
dependent on the application

No. of iterations

It looks like J(w) hasn't gone down much more. Gradient Descent has converged

6
1
More on… Learning Rate and Convergence
Plotting the cost function with different learning rate

Different values step-sizes (alpha)

Close to zero means it has converged:


Example of convergence criteria:
Declare convergence if the cost function has
decreased by less than in one iteration
for example

6
2
More on… Learning Rate and Convergence
Example when Gradient Descent is not working

Use smaller

For sufficiently small , should decrease on every


iteration.

But if is too small, gradient descent can be slow to converge.

https://www.coursera.org/learn/machine-learning
63
Question: After one iteration of Gradient Descent, what changes will occur in this figure?

Question: After one iteration of Gradient Descent, what changes will occur in this figure?

𝑤1 at local optimal
𝑤1
Current value of 𝑤1 𝑑
𝑤1 : = 𝑤1 − 𝑎 𝑑𝑤 𝐽(𝑤1 )
1

https://www.coursera.org/learn/machine-learning
64
Question: After one iteration of Gradient Descent, what changes will occur in this figure?

𝑤1 at local optimal
Answer: No change in 𝑤1
𝑤1 =0
Current value of 𝑤1 𝑑
𝑤1 : = 𝑤1 − 𝑎 𝑑𝑤 𝐽(𝑤1 )
1

Note: Gradient descent can converge to a local minimum, even with fixed learning rate. As we approach a local
minimum, gradient descent will automatically take smaller steps. So, no need to decrease learning rate (alpha)
over time.
https://www.coursera.org/learn/machine-learning
65
More on… Learning Rate and Convergence
Summary

• If is too small: slow convergence


• If is too large: may not decrease on every iteration, may not converge.

Concretely, try:

0,003... 0.03...
0.3...
Plot with the respect to the number of iterations
Then pick the value of alpha that seems to be causing to decrease rapidly.

66
Gradient Descent
How does it look like?

picture source: https://alykhantejani.github.io/images/gradient_descent_line_graph.gif 67


Gradient Descent
“weights” to be found
Example with Simple Linear Regression
Linear regression model:
?

Example from Burkov Andriy - The hundred-page machine learning book


69
Gradient Descent
Task: Find the values of w and b that minimise the mean squared error

Step 1:
Compute the partial derivative for every parameter Since we only have one feature,
the index i represent the example
Step 2: in the training set, not which feature
? Update the parameters

Step 3:
? Next epoch: repeat step 1
and step 2 until convergence

Example from Burkov Andriy - The hundred-page machine learning book


70
Gradient Descent Step 1:
Compute the partial derivative for every parameter

Step 2:
Update the parameters

Step 3:
Next epoch: repeat step 1
and step 2 until convergence

Example from Burkov Andriy - The hundred-page machine learning book


71
Gradient Descent

Function to make prediction on new data

Code to execute

Full Code here: https://github.com/aburkov/theMLbook/blob/master/gradient_descent.py

Example from Burkov Andriy - The hundred-page machine learning book


72
Gradient Descent
With scikit-learn

Example from Burkov Andriy - The hundred-page machine learning book


73
Gradient Descent

Example from Burkov Andriy - The hundred-page machine learning book


74
Gradient Descent VS Normal Equation
Pros and Cons

Gradient Descent Normal Equation

• No need to choose
• Need to choose alpha
alpha
• Needs many iterations
• No iterations
• Works well also with big
• Inverse of matrix is
datasets
expensive
• Slow with large
datasets

https://www.coursera.org/learn/machine-learning/lecture/2DKxQ/normal-equation
75
Summary Lecture 2
What we have seen in this lecture:
• Linear Regression with Multiple Features
• Normal Equation
• Gradient Descent
• Cost Function
• Learning Rate
• Gradient Descent vs Normal Equation

Piergiuseppe Mallozzi
76

You might also like