2 - Multiple Linear Regression
2 - Multiple Linear Regression
Machine Learning
LECTURE 2
MACHINE LEARNING
What is it?
The computer learns a formula that applied to the input data produces the desired output
By: Tom Mitchell 1998
https://www.coursera.org/learn/machine-learning
By: Tom Mitchell 1998
https://www.coursera.org/learn/machine-learning
Types of Machine Learning
5
Types of Supervised Learning
Task Driven
predict next value,
classification
h = hypothesis
Training Dataset:
SUPERVISED Collection of labeled examples
(input attributes and expected output)
Task Driven The training dataset can take any kind of data as an input like values of a
predict next value, database row, the pixels of an image, or even an audio frequency
classification histogram.
Learning:
A model is prepared through a training process in which it is
required to make predictions and is corrected when those predictions are
wrong. The learning continues until the algorithm achieves an acceptable
level of performance.
7
Part 1.1 Supervised Learning - Simple Linear Regression
Simple Linear
Regression
SUPERVISED
Multiple Linear
Regression Regression
Polynomial
Regression
Task Driven
predict next value,
classification
Logistic
Classification Regression
Decision Tree
Classification: In classification, the goal is to predict a discrete category or label for each input data point.
Regression: In regression, the goal is to predict a continuous numeric value. The output is a numerical
variable, and the model estimates a real-valued number as the target variable.
8
Supervised Learning
Input(x) Output (y) Application
Home features Price Real Estate
https://www.coursera.org/learn/machine-learning
Supervised Learning
Structured Data Structured data: Structured
data is highly organized and
follows a predefined schema or
Unstructured Data
format. It is typically stored in
databases, spreadsheets, or
Size #bedroo … Price tables.
ms (1000$s)
Unstructured data: It lacks a
2104 3 400 specific organizational
3 structure. It does not conform to
1600 3
330 a fixed schema and is often in
free-form text, audio, video, or
2400 ⋮ 369 images.
⋮ 4 ⋮
3000 540 Audio Image
User Ad Id … Click
Age
Four scores and seven
41 93242 1
years ago…
93287
80 87312
0
18 ⋮ 1
⋮ 71244 ⋮
27 1
https://www.coursera.org/learn/machine-learning
Summary – linear regression (Univariate LG)
Hypothesis:
Parameters: m, b
Cost function:
Goal:
11
Cost Function Let’s set b to 0 and look for m
5
3
J(m) 4
y
2 3
2
1
1
0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5
x m
12
Cost Function Let’s set b to 0 and look for m
5
3
J(m) 4
y J(1)
2 3
2
1
1
0 0
x 0 0.5 1 1.5 2 2.5
m
1
3
Cost Function Let’s set b to 0 and look for m
5
3 3.5
J(m)4
y
2 3 J(0.5) = 3.5
2
1
1
0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5
x m
1
4
Cost Function 14
Let’s set b to 0 and look for m
5
3
J(m)4
y
2 3 J(m) = 14
2
1
1
0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5
x m
1
5
Cost Function
5
3
J(m)4
y
2 3
2
1
1
0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5
x m
GOAL = minimize J
1
6
Hypothesis:
Parameter: m, b
Cost function:
Contour plots/figures
20
Naming Conventions
parameters,
“weights”
21
Simple Linear Regression
Summary
Cost function: e.g. Sum Squared Errors (SSE) or Mean Squared Error (MSE)
𝐽(𝑤0 , 𝑤1 )
How do we find and ?
N is the number of training samples
22
Simple Linear Regression, how to find the weights?
Analytical Solution
}
optimal
}
For Simple linear regression problems
(1 feature, linear fit) with a small dataset,
analytical solution is feasible and 1 feature
preferred
24
Simple Linear Regression, how to find the weights?
Example
25
Simple Linear Regression, how to find the weights?
Analytical Solution Example
26
Simple Linear Regression, how to find the weights?
Analytical Solution
27
Simple Linear Regression, how to find the weights?
2
What is the predicted price for a house of 1600 feet ?
28
Summary Part 1.1
What we have seen so far:
• Types of Machine Learning
• What is Supervised Learning
• Regression vs Classification
• Function Approximation
• Correlation vs Regression
• Simple Linear Regression
• Analytical Solution with Example
Piergiuseppe Mallozzi 2
9
Part 1.2 Supervised Learning - Multiple Linear Regression
Simple Linear
Regression
SUPERVISED
Multiple Linear
Regression Regression
Polynomial
Regression
Task Driven
predict next value,
classification
Logistic
Classification Regression
Decision Tree
3
0
1 Feature = Simple Linear Regression
size
31
Multiple Features = Multiple Linear Regression
size # rooms
location
32
1 feature vs 2 Features
33
N features?
Visual understanding is more abstract….
https://www.youtube.com/watch?v=zwAD6dRSVyI&t=829s
Picture from the movie Interstellar
34
Multiple Features
4 features
Number of features
https://www.coursera.org/learn/machine-learning
35
Multiple Features
4 features
Number of features
https://www.coursera.org/learn/machine-learning
36
Multiple Features
4 features
https://www.coursera.org/learn/machine-learning 37
Multiple Features, how to find the weights?
https://www.coursera.org/learn/machine-learning 39
Multiple Features, how to find the weights?
w = pinv(X'*X)*(X'*y)
w =
191.94
0.38
-59.03
-89.47 approximated values
-3.75 for the slides..
new 900 3 1 14 ?
entry..
$
approximated values
for the slides..
46
Normal Equation
Can we use it for any problem?
This is a
“Closed-form solution”
computational
expensive for large computationally
dataset very expensive and
could not even
possible (might not Other methods?
be invertible)
Hypothesis:
Parameters:
Cost function:
Goal:
48
Single/Multiple-Features, how to find the weights?
Gradient Descent
Search Algorithm that starts with some “initial guess” of Cost function
w and that repeatedly change w to make the cost
function smaller, until we converge to a value that
minimizes the cost function
How do we change w?
49
Gradient Descent
Which Cost Function do we use?
Truth assertion
Update the weights using the gradient Assignment operator A=1
E:g: a := a+1
●
● loop until convergence do:
○
51
Question:
a)
b)
c)
d)
positive slope
J(w)
negative slope
54
Gradient Descent
How to choose the learning rate?
just right!
https://developers.google.com/machine-learning/crash-course/fitter/graph
57
More on... Choosing Step-Size
Decreasing Stepsize
For example:
Enables to effectively take big "jumps" at the beginning and slow down once
you are getting closer to the solution.
58
More on... Choosing Step-Size
Is gradient descent working correctly?
No. of iterations
If gradient descent is working properly then J(w) should decrease after every iteration
59
More on... Convergence Criteria
What does it mean to converge?
For convex functions the optimum occurs with the derivative of the cost function is
equal to zero
6
0
More on... Convergence Criteria
What does it mean to converge?
No. of iterations
It looks like J(w) hasn't gone down much more. Gradient Descent has converged
6
1
More on… Learning Rate and Convergence
Plotting the cost function with different learning rate
6
2
More on… Learning Rate and Convergence
Example when Gradient Descent is not working
Use smaller
https://www.coursera.org/learn/machine-learning
63
Question: After one iteration of Gradient Descent, what changes will occur in this figure?
Question: After one iteration of Gradient Descent, what changes will occur in this figure?
𝑤1 at local optimal
𝑤1
Current value of 𝑤1 𝑑
𝑤1 : = 𝑤1 − 𝑎 𝑑𝑤 𝐽(𝑤1 )
1
https://www.coursera.org/learn/machine-learning
64
Question: After one iteration of Gradient Descent, what changes will occur in this figure?
𝑤1 at local optimal
Answer: No change in 𝑤1
𝑤1 =0
Current value of 𝑤1 𝑑
𝑤1 : = 𝑤1 − 𝑎 𝑑𝑤 𝐽(𝑤1 )
1
Note: Gradient descent can converge to a local minimum, even with fixed learning rate. As we approach a local
minimum, gradient descent will automatically take smaller steps. So, no need to decrease learning rate (alpha)
over time.
https://www.coursera.org/learn/machine-learning
65
More on… Learning Rate and Convergence
Summary
Concretely, try:
0,003... 0.03...
0.3...
Plot with the respect to the number of iterations
Then pick the value of alpha that seems to be causing to decrease rapidly.
66
Gradient Descent
How does it look like?
Step 1:
Compute the partial derivative for every parameter Since we only have one feature,
the index i represent the example
Step 2: in the training set, not which feature
? Update the parameters
Step 3:
? Next epoch: repeat step 1
and step 2 until convergence
Step 2:
Update the parameters
Step 3:
Next epoch: repeat step 1
and step 2 until convergence
Code to execute
• No need to choose
• Need to choose alpha
alpha
• Needs many iterations
• No iterations
• Works well also with big
• Inverse of matrix is
datasets
expensive
• Slow with large
datasets
https://www.coursera.org/learn/machine-learning/lecture/2DKxQ/normal-equation
75
Summary Lecture 2
What we have seen in this lecture:
• Linear Regression with Multiple Features
• Normal Equation
• Gradient Descent
• Cost Function
• Learning Rate
• Gradient Descent vs Normal Equation
Piergiuseppe Mallozzi
76