EECS 836: Machine Learning
Zijun Yao
Assistant Professor, EECS Department
The University of Kansas
1
Agenda
• Linear Regression model
• Model definition
• Loss function
• Optimizing parameters
2
Supervised learning setup
• Given a collection of records (training set) Training
• Each record is characterized by a pair (x, y) set
• x: feature, attribute, independent variable
Learning
• y: target, label, dependent variable algorithm
• Goal
• Learn a model (function ) so that f (x) can Function
correctly predict 𝑦ො for the corresponding value of y
• Tasks x 𝑦ො
• Regression: to predict a continuous value 𝑦ො Feature Predicted
• Classification: to predict a category class 𝑦ො value of y
3
* x is bolded because it represents a set of features; y is not because it is just a value.
House price prediction - regression
Size of House
# of Bedrooms f Price of House
…….
4
Linear regression
Independent variables (features) x Dependent variables (targets) y
• Given
• Data
• Corresponding labels
• Goal: find a continuous function that models the continuous points
5
3 ML steps for linear regression
Step 1: Step 2: Step 3: pick
define a set goodness the best
of function* of function function
Define a model Measure the error Optimizing parameters
*A set of function means same model but with different values of parameter. 6
Step 1: Model definition
Predicted value of y
Target 1st Feature 2nd Feature d-th Feature
A linear relationship between feature and target Data 7
Step 1: Model definition
Bias: a fixed offset Weights: significance of each feature
Parameters
Predicted value of y
Target 1st Feature 2nd Feature d-th Feature
A linear relationship between feature and target Data 8
Step 1: Define set of functions
Size of
house
𝑓 𝑥 = price 𝑦
𝑥𝑠𝑖𝑧𝑒 , 𝑥𝑏𝑎𝑡ℎ
# of bath
9
Step 1: Define set of functions
Size of
house
𝑓 𝑥 = price 𝑦
𝑥𝑠𝑖𝑧𝑒 , 𝑥𝑏𝑎𝑡ℎ
# of bath
Linear Regression model
10
Step 1: Define set of functions
Size of
house
𝑓 𝑥 = price 𝑦
𝑥𝑠𝑖𝑧𝑒 , 𝑥𝑏𝑎𝑡ℎ
# of bath
Linear Regression model
w and b are parameters (can be any value)
A set of f1:
f1 , f 2
function f2:
With different values of parameters …… infinite 11
Step 1: A variant form of linear model
1st Feature 2nd Feature d-th Feature
12
Step 1: A variant form of linear model
Equivalence
by
1st Feature 2nd Feature d-th Feature
0st Feature 1st Feature 2nd Feature d-th Feature
13
Step 1: A variant form of linear model
0st Feature 1st Feature 2nd Feature d-th Feature
Prediction Parameters
Linear Regression model
Function of 𝑥 Features
14
Agenda
• Linear Regression model
• Model definition
• Loss function
• Optimizing parameters
15
Step 2: Goodness of function
How good is a function? - measure the difference between predicted and true y
Model
A set of
function
f1 , f 2
Training
Data
16
Step 2: Goodness of function
How good is a function? - measure the difference between predicted and true y
Model function function
input: Output (scalar): label value:
A set of (𝟏)
𝒙𝒔𝒊𝒛𝒆 ∶ 4,043 sqft
function (𝟏)
𝒙𝒂𝒈𝒆 : 26 years 𝑦ො (1) = f(𝐱 (1) ) 𝑦 (1) 784,000
f1 , f 2
Training (𝟐)
𝒙𝒔𝒊𝒛𝒆 ∶ 4,976 sqft
Data (𝟐) 𝑦ො (2) = f(𝐱 (2) ) 𝑦 (2) 724,900
𝒙𝒂𝒈𝒆 : 8 years
Suppose we have two house features in
this data: size and age Superscript means the data index
Step 2: Goodness of function
How good is a function? - measure the difference between predicted and true y
Model function function
input: Output (scalar): label value:
Measure the difference
A set of (𝟏)
𝒙𝒔𝒊𝒛𝒆 ∶ 4,043 sqft
function (𝟏)
𝒙𝒂𝒈𝒆 : 26 years 𝑦ො (1) = f(𝐱 (1) ) 𝑦 (1) 784,000
f1 , f 2
Training (𝟐)
𝒙𝒔𝒊𝒛𝒆 ∶ 4,976 sqft
Data (𝟐) 𝑦ො (2) = f(𝐱 (2) ) 𝑦 (2) 724,900
𝒙𝒂𝒈𝒆 : 8 years
Suppose we have two house features in
this data: size and age Superscript means the data index
Step 2: Measure error
How good is a function? - Loss function 𝐿
A set of Model
Input: a function Output: the loss - how far is
function f1 , f 2 and data prediction from true value
𝑛 Sum of square error (SSE)
2
Goodness of 𝐿 𝑓 = 𝑦 (𝑖) −𝑓 𝑥 (𝑖)
function f 𝑖=1
Sum over examples Estimated y based
on input function
Training 1
𝑛
2
Averaged by n, you have mean squared error (MSE) loss 𝑦 (𝑖) − 𝑓 𝑥 (𝑖)
Data 𝑛
𝑖=1
Loss function also called cost function and objective function
Step 2: Measure error
How good is a function? - Loss function 𝐿
𝑛
Input: a function and data 2
𝐿 𝑓 = 𝑦 (𝑖) −𝑓 𝑥 (𝑖)
output: the loss - how far is 𝑖=1
prediction from true value Sum over examples
True y
Estimated y based
on input function
𝐿 𝑤𝑠𝑖𝑧𝑒 , 𝑤𝑎𝑔𝑒 , 𝑏
𝑛
2
(𝑖) (𝑖)
= 𝑦 (𝑖) − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
𝑖=1
Loss function also called cost function and objective function
Step 2: Loss function
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑓 𝑥 (𝑖)
𝑖=1
Estimated y based on
input function
A simple case where only one feature is used to predict y 21
Step 2: Loss function
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑓 𝑥 (𝑖)
𝑖=1
A simple case where only one feature is used to predict y When there are 2 features to predict y 22
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑤∙𝑥 𝑖
𝑖=1
𝑓 𝑥 𝐿 𝑓
3 15
𝑤=1
𝑦 2 L 𝑓 10
1 5
𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−1)2 +(2−2)2 +(3−3)2 = 0
23
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑤∙𝑥 𝑖
𝑖=1
𝑓 𝑥 𝐿 𝑓
3 15
𝑤 = 0.5 10
𝑦 2 L 𝑓
1 5
𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−0.5)2 +(2−1)2 +(3−1.5)2 = 3.5
24
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑤∙𝑥 𝑖
𝑖=1
𝑓 𝑥 𝐿 𝑓
3 15
𝑦 2 L 𝑓 10
1 5
𝑤=0
𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−0)2 +(2−0)2 +(3−0)2 = 14
25
Step 2: Intuition of loss function
• Let’s use a simple case with only one feature
𝑛
2
𝐿 𝑓 = 𝑦 (𝑖) −𝑤∙𝑥 𝑖
𝑖=1
𝑓 𝑥 𝐿 𝑓
Loss function L is
3 15 a concave
𝑦 2 L 𝑓 10
1 5
𝑤=0
𝑏=0
0 1 2 3 0 0.5 1 1.5
𝑥 𝑤
𝐿 𝑓 = (1−0)2 +(2−0)2 +(3−0)2 = 14
26
Step 2: Intuition of loss function
One-feature case Two-feature case
15
L(𝑓)
𝐿 𝑓 10
0 0.5 1 1.5 𝑤1
𝑤2
𝑤
Loss function tracks the performance of model as parameters change
27
Agenda
• Linear Regression model
• Model definition
• Loss function
• Optimizing parameters
28
Step 3: Find the best function
𝐿 𝑤, 𝑏
𝑛
A set of Model 𝑖
2
= 𝑦𝑖 − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑖
𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
function f1 , f 2 𝑖=1
Loss
Goodness of
function f
Training
Data
29
Step 3: Find the best function
𝐿 𝑤, 𝑏
𝑛
A set of Model 𝑖
2
= 𝑦𝑖 − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑖
𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
function f1 , f 2 𝑖=1
Loss The “best” – the function
Goodness of gives minimum loss
function f
Optimizing Search 𝑤, 𝑏 to find
Parameters minimum 𝐿
Training
Data
30
Step 3: Find the best function
𝐿 𝑤, 𝑏
𝑛
A set of Model 𝑖
2
= 𝑦𝑖 − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑖
𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
function f1 , f 2 𝑖=1
Loss The “best” – the function
Goodness of gives minimum loss
function f
𝑓 ∗ = 𝑎𝑟𝑔 min L 𝑓 Optimizing Search 𝑤, 𝑏 to find
𝑓
Parameters minimum 𝐿
𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min L 𝑤, 𝑏
𝑤,𝑏
Training 𝑛
2
Data (𝑖) (𝑖)
= 𝑎𝑟𝑔 min 𝑦 (𝑖) − 𝑏 + 𝑤𝑠𝑖𝑧𝑒 ∙ 𝑥𝑠𝑖𝑧𝑒 + 𝑤𝑎𝑔𝑒 ∙ 𝑥𝑎𝑔𝑒
𝑤,𝑏
𝑖=1
31
𝐿(𝑤)
Derivatives Tangent line (𝑤=1)
𝐿(𝑤)
• The derivative of loss function 𝐿 is
The sensitivity to change of loss function
with respect to a change in a parameter 𝑤
32
𝐿(𝑤)
Derivatives Tangent line (𝑤=1)
𝐿(𝑤)
• The derivative of loss function 𝐿 is
• Partial Derivatives: Let be loss function of parameters
Function derivative with respect to one of
parameters 𝑤𝑖 , with the others held constant
33
𝐿(𝑤)
Derivatives Tangent line (𝑤=1)
𝐿(𝑤)
• The derivative of loss function 𝐿 is
• Partial Derivatives: Let be loss function of parameters
• Gradients
Gradients is a vector consists of
partial derivative of each parameter
34
𝐿(𝑤)
Derivatives Tangent line (𝑤=1)
𝐿(𝑤)
• The derivative of loss function 𝐿 is
• Partial Derivatives: Let be loss function of parameters
• Gradients How to reduce Loss function?
Subtract gradient from each
parameter w
Gradient is a direction that increase the value of 𝐿(𝒘)
35
Step 3: Gradient descent
∗
• Consider loss function 𝐿(𝑤) with one parameter w: 𝑤 = 𝑎𝑟𝑔 min
𝑤
𝐿 𝑤
➢ (Randomly) Pick an initial value w0 at time 0
𝑑𝐿
➢ Compute | 0
Loss 𝑑𝑤 𝑤=𝑤
𝐿 𝑤 Negative Increase w
Positive Decrease w
w0 w
36
Step 3: Gradient descent
∗
• Consider loss function 𝐿(𝑤) with one parameter w: 𝑤 = 𝑎𝑟𝑔 min
𝑤
𝐿 𝑤
➢ (Randomly) Pick an initial value w0 at time 0
𝑑𝐿
➢ Compute | 𝑡 𝑑𝐿
Loss 𝑑𝑤 𝑤=𝑤 𝑤 𝑡+1 ← 𝑤𝑡 −𝛼 |𝑤=𝑤 𝑡
𝐿 𝑤 𝑑𝑤
𝛼 is called
w0 𝑑𝐿 w1 “learning rate” w
−𝛼 |𝑤=𝑤 0 37
𝑑𝑤 Usually small, like 0.05
Step 3: Gradient descent
∗
• Consider loss function 𝐿(𝑤) with one parameter w: 𝑤 = 𝑎𝑟𝑔 min
𝑤
𝐿 𝑤
➢ (Randomly) Pick an initial value w0
𝑑𝐿
➢ Compute | 0 1 0
𝑑𝐿
Loss 𝑑𝑤 𝑤=𝑤 𝑤 ←𝑤 −𝛼 |𝑤=𝑤 0
𝐿 𝑤 𝑑𝑤
𝑑𝐿
➢ Compute |𝑤=𝑤 1 2 1
𝑑𝐿
𝑑𝑤 𝑤 ←𝑤 −𝛼 |𝑤=𝑤 1
𝑑𝑤
repeat iterations until convergence
Local global
minima minima
w0 w1 w2 wT w
38
𝜕𝐿
Step 3: Gradient descent 𝛻𝐿 = 𝜕𝑤
𝜕𝐿
• How about two parameters? 𝜕𝑏 gradient
𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿 𝑤, 𝑏
𝑤,𝑏
➢ (Randomly) Pick an initial value for each parameter w0, b0
𝜕𝐿 𝜕𝐿
➢ Compute | 0 0, | 0 0
𝜕𝑤 𝑤=𝑤 ,𝑏=𝑏 𝜕𝑏 𝑤=𝑤 ,𝑏=𝑏
1 0
𝜕𝐿 𝜕𝐿
𝑤 ←𝑤 −𝛼 |𝑤=𝑤 0 ,𝑏=𝑏0 𝑏1 ← 𝑏0 − 𝛼 |𝑤=𝑤 0 ,𝑏=𝑏0
𝜕𝑤 𝜕𝑏
𝜕𝐿 𝜕𝐿
➢ Compute | 1 1, | 1 1
𝜕𝑤 𝑤=𝑤 ,𝑏=𝑏 𝜕𝑏 𝑤=𝑤 ,𝑏=𝑏
𝜕𝐿 𝜕𝐿
𝑤2 ← 𝑤1 −𝛼 |𝑤=𝑤 1 ,𝑏=𝑏1 2 1
𝑏 ← 𝑏 − 𝛼 |𝑤=𝑤 1 ,𝑏=𝑏1
𝜕𝑤 𝜕𝑏
repeat iterations until convergence
39
Step 3: Gradient descent
• Gradient of Linear Regression
40
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿
𝑤
Low 𝐿
𝑏
How does the loss get minimized by gradient descent
Slide by Andrew Ng 41
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿
𝑤
Low 𝐿
𝑏
How does the loss get minimized by gradient descent
Slide by Andrew Ng 42
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿
𝑤
Low 𝐿
𝑏
How does the loss get minimized by gradient descent
Slide by Andrew Ng 43
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿
𝑤
Low 𝐿
𝑏
How does the loss get minimized by gradient descent
Slide by Andrew Ng 44
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿
𝑤
Low 𝐿
𝑏
How does the loss get minimized by gradient descent
Slide by Andrew Ng 45
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿
𝑤
Low 𝐿
𝑏
How does the loss get minimized by gradient descent
Slide by Andrew Ng 46
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿
𝑤
Low 𝐿
𝑏
How does the loss get minimized by gradient descent
Slide by Andrew Ng 47
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏
High 𝐿
𝑤
Low 𝐿
𝑏
How does the loss get minimized by gradient descent
Slide by Andrew Ng 48
Step 3: Gradient descent
𝑓 𝑥 𝐿 𝑤, 𝑏 Find minimum!
High 𝐿
𝑤
Low 𝐿
𝑏
How does the loss get minimized by gradient descent
Slide by Andrew Ng 49
Step 3: Gradient descent
• Small gradient can slow down or halt the optimization
Loss
Very slow at
the plateau
Stuck at
saddle point
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤 𝜕𝐿 ∕ 𝜕𝑤
≈0 =0 =0
The value of the parameter w 50
Step 3: Gradient descent – learning rate 𝜶
Monitor the loss at each iteration
- Search 𝛼 by order of magnitude
at first (like, 10−1 … 10−5 )
- Tune 𝛼 locally to achieve efficient
convergence
[Link]
Linear algebra review
• Vector in ℝ𝑑 is an ordered set of 𝑑 real values
• Matrix in ℝ𝑛×𝑚 is a 𝑛 by 𝑚 object with 𝑛 rows and 𝑚
columns
• Transpose
• Matrix production
52
4×2 2×3 4×3
Vectorization of linear regression
• Benefits of vectorization
• More compact equations
• Faster code (using optimized matrix libraries)
• Linear regression model:
• Let
• In vectorized form, linear regression model:
53
Vectorization of linear regression
• Consider the model for 𝑛 instance
• Let
ℝ(𝑑+1)×1 ℝ𝑛×(𝑑+1)
• In vectorized form, linear regression model:
54
Vectorization of linear regression
• For the loss function
One time calculation, without iterating through all data samples.
55
Improving learning
• Feature scaling (or normalization)
• Ensure all features have similar scales
• Gradient descent would converge faster
56
[Link]
Feature standardization
• Rescale features to have zero mean and unit variance
• Let 𝜇𝑗 be the mean of feature 𝑗
• Let 𝑠𝑗 be the standard deviation of feature 𝑗
• Replace each value with
for 𝑗 = 1 … 𝑑 (not 𝑥0 )
• Must apply the same transformation for both training and testing instances
• Outliers can cause problems
57
Regularization
• A method to control the complexity of model, avoid overfitting
• Why - address overfitting issues by keeping 𝑤 small
• How - Penalize for large value of 𝑤𝑗
• Can incorporate into the loss function
• Works well when we have a lot of features
𝑛 𝑑
2
Also called 𝐿2 -norm
𝐿 𝑓 = 𝑦 𝑖 −𝑓 𝑥 𝑖 + 𝜆 𝑤𝑗2
𝑖=1 𝑗=1
model fit to data regularization
o 𝜆 is the predefined hyperparameter to control the degree of regularization
o No regularization on 𝑤0 (bias 𝑏)
58
Summary
• Problem: estimate a real value
• Model: 𝑦ො = 𝑏 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑑 𝑥𝑑
• Loss function: sum of square (SSE)
𝑛 𝑛
1
𝐿 𝐰, 𝑏 = 𝑙 𝑖 𝐰, 𝑏 = (𝑦 𝑖
− 𝑦ො (𝑖) )2
𝑛
𝑖=1 𝑖=1
• Optimize parameters by Gradient Descent method
• Choose a starting point
• Repeat
• Compute gradient
• Update parameters
59
Demo
• Use ML library
• [Link]
60