11 - Học máy cơ bản - Hồi quy tuyến tính 1
11 - Học máy cơ bản - Hồi quy tuyến tính 1
Linear Regression -
Mô hình hồi quy tuyến tính
Reference: Domingos, Pedro. "A few useful things to know about machine learning." Communications of the ACM 55.10 (2012): 78-87.
Linear Models for Regression
• Let’s discuss those key questions in the following cases
• Loss function
1 𝑤0
• Goal: minimize ℓ 𝑤 CRIM 𝑤1 Price
features ZN 𝑤2
• Solution
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
• Goal: minimize ℓ 𝑤
• Solution
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵
• Goal: minimize ℓ 𝑤
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Models for Regression
𝜕ℓ 𝑤
𝜕𝑤0
Gradient 𝛻𝑤 ℓ 𝑤 =
• Linear regression with two variables 𝜕ℓ 𝑤
𝜕𝑤1
• Linear model 𝜕ℓ 𝑤 1
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1
• Parameters = − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑖
𝑥1
𝜕𝑤1 𝑁
𝑤0 , 𝑤1 , 𝑤2 𝜕ℓ 𝑤
=
1
− 𝑁 σ𝑁 𝑖 𝑖 𝑖
𝜕𝑤2 𝑖=1[𝑡 −𝑦 𝑥 ] 𝑥2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math: Use Matrix Here!
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math ( Linear Algebra )
• How to use matrix to represent data?
Examples
1 𝑇 𝑡1
𝑥 (2)
2 𝑇 𝑡= 𝑡 = Labels
𝑋= 𝑥 = ⋮
⋮ 𝑡𝑁
𝑥 𝑁 𝑇
⋮ ⋮
Features
𝑥 1 𝑇 𝑡1
(2) 𝑤0
𝑋= 𝑥
2 𝑇 𝑡= 𝑡 =
= ⋮ 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑤 = 𝑤1
⋮ 𝑤2
𝑁 𝑇
𝑡𝑁
𝑥
⋮ ⋮
In Math ( Linear Algebra )
• For linear regression with two variables
𝑥 1 𝑇 𝑡1
(2) 𝑤0
𝑋= 𝑥
2 𝑇 𝑡= 𝑡 =
= ⋮ 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑤 = 𝑤1
⋮ 𝑤2
𝑁 𝑇
𝑡𝑁
𝑥
⋮ ⋮
𝒊
𝑖
𝒙𝟏
Here, 𝑥 = 𝒊
𝒙𝟐
In Math ( Linear Algebra )
• For linear regression with two variables
1
1
1
𝑥 1 𝑇 1 𝑡1
1 (2) 𝑤0
𝑋= 𝑥
2 𝑇 1 𝑡= 𝑡 =
= 1 ⋮ 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑤 = 𝑤1
⋮ 1 𝑤2
𝑁 𝑇 1 𝑡𝑁
𝑥 1
⋮ ⋮
⋮ ⋮
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
𝑤0
𝑤 = 𝑤1
𝑤2
In Math ( Linear Algebra )
• For linear regression with two variables
𝟏 𝒊 𝒊
ℓ 𝑤 = 𝟐𝑵 σ𝑵
𝒊=𝟏[𝒕
𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
1
= 2𝑁 (𝑡 − 𝑋𝑤) 𝑇 (𝑡 − 𝑋𝑤)
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
𝑤0
𝑤 = 𝑤1
𝑤2
In Math ( Linear Algebra )
• For linear regression with two variables
𝟏 𝒊 𝒊
ℓ 𝑤 = 𝟐𝑵 σ𝑵
𝒊=𝟏[𝒕
𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
1
= 2𝑁 (𝑡 − 𝑋𝑤) 𝑇 (𝑡 − 𝑋𝑤)
Gradient 𝛻𝑤 ℓ 𝑤 = 1 𝑋 𝑇 (𝑡 − 𝑋𝑤)
𝑁
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
𝑤0
𝑤 = 𝑤1
𝑤2
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
𝑁 𝑵
1 𝑖 𝑖
𝟏 𝒊 𝒊
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2 = [𝒕 𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵
𝑖=1 𝒊=𝟏
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Models for Regression
• Linear regression with multiple variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑘 𝑥𝑘
• Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑘
• Loss function
𝑁 𝑵
1 𝑖 𝑖
𝟏 𝒊 𝒊 𝒊
ℓ 𝑤 = [𝑡 −𝑦 𝑥 ]2 = [𝒕 𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ + 𝒘𝒌 𝒙𝒌 )]𝟐
2𝑁 𝟐𝑵
𝑖=1 𝒊=𝟏
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Models for Regression
• Linear regression with multiple variables 𝜕ℓ 𝑤
𝜕𝑤0
Gradient
𝜕ℓ 𝑤
𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1
⋮
𝜕ℓ 𝑤
𝜕𝑤𝑘
𝜕ℓ 𝑤 1
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1 𝑖
𝜕𝑤1
= − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥1
𝜕ℓ 𝑤 1 𝑖
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖 −𝑦 𝑥 𝑖 ] 𝑥2
𝜕𝑤2
⋮
𝜕ℓ 𝑤 1 𝑖
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖 −𝑦 𝑥 𝑖 ] 𝑥𝑘
𝜕𝑤𝑘
In Math ( Linear Algebra )
• Linear model 𝑤0 1
𝑤1 𝑥1
𝑦 𝑥 =𝑤𝑇𝑥 where 𝑤 = 𝑤2 𝑥 = 𝑥2
⋮ ⋮
• Parameters 𝑤𝑘 𝑥𝑘
𝑤0 , 𝑤1 , … , 𝑤𝑘
• Loss function 𝑥 1 𝑇 1
𝑡
2 𝑇 2
𝑥 𝑡
1 1
ℓ 𝑤 = σ𝑁 [𝑡 𝑖
−𝑦 𝑥 𝑖
]2 = (𝑡 − 𝑋𝑤) 𝑇 (𝑡 − 𝑋𝑤) where 𝑋 =
𝑥 3 𝑇 𝑡= 𝑡 3
2𝑁 𝑖=1 2𝑁
⋮ ⋮
𝑁
• Goal: minimize ℓ 𝑤 𝑁 𝑇 𝑡
𝑥
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math ( Linear Algebra )
1 𝑇
𝑥 𝑡 1
𝑥 2 𝑇 𝑡 2
1 1
ℓ 𝑤 = σ𝑁
𝑖=1 [𝑡 𝑖 −𝑦 𝑥 𝑖 ]2 = (𝑡 − 𝑋𝑤) 𝑇 (𝑡 − 𝑋𝑤) where 𝑋 =
𝑥 3 𝑇 𝑡= 𝑡 3
2𝑁 2𝑁
⋮ ⋮
𝑁
𝑥 𝑁 𝑇 𝑡
• Goal: minimize ℓ 𝑤
1
𝛻𝑤 ℓ 𝑤 = − 𝑋 𝑇 (𝑡 − 𝑋𝑤)
𝑁
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math ( Linear Algebra )
y
• Model’s ability to predict new data
𝑥
Training data
Courtesy of Dr. Andrew Ng
Generalization
• Generalization
y
• Model’s ability to predict new data
𝑥
Training data
Courtesy of Dr. Andrew Ng
Generalization
• Generalization
y
• Model’s ability to predict new data
Testing data
Generalization
• Generalization
y
• Model’s ability to predict new data
?
y
In geometry, a hyperplane is
𝑥1 a subspace of one dimension
less than its space
𝑥1
𝑥2
Generalization
• What if our linear model is not good? (for the
y
data to right)
𝑥1
Courtesy of Dr. Sanja Fidler
Generalization
• What if our linear model is not good? (for the
y
data to right)
𝑥1
• We can use a more complicated model
(polynomial)
Courtesy of Dr. Sanja Fidler
Linear Models for Regression
• Let’s discuss those key questions in the following cases
y
one dimensional feature x:
𝑀
𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤𝑗 𝑥 𝑗
𝑗=1
polynomial
Courtesy of Dr. Sanja Fidler
Fitting a Polynomial
• Example: an 𝑀-th order polynomial function of
y
one dimensional feature x:
𝑀
𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤𝑗 𝑥 𝑗
𝑗=1
polynomial
Courtesy of Dr. Sanja Fidler
3
1 𝑤0
Price
𝐹𝑜𝑟 𝑒𝑥𝑎𝑚𝑝𝑙𝑒, 𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤𝑗 𝑥 𝑗
Size in 𝑤1 𝑗=1
feature square feet
(𝒙)
1
𝑤0
𝒙 𝑤1
Price
𝑤2
feature 𝒙𝟐
𝑤3
𝒙𝟑
Fitting a Polynomial
3
1 𝑤0
Price
𝐹𝑜𝑟 𝑒𝑥𝑎𝑚𝑝𝑙𝑒, 𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤𝑗 𝑥 𝑗
Size in 𝑤1 𝑗=1
feature square feet
(𝒙)
1
𝑤0
𝒙 𝑤1
Price
𝑤2
feature 𝒙𝟐
𝑤3
𝒙𝟑
y
overfitting 𝑀
𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤𝑗 𝑥 𝑗
𝑗=1
𝑥1
Overfitting
• Let’s use polynomial model to explain
y
overfitting 𝑀
𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤𝑗 𝑥 𝑗
𝑗=1
𝑥1
𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤𝑗 𝑥 𝑗
𝑗=1
Overfitting
• Let’s use polynomial model to explain
overfitting 𝑀
𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤𝑗 𝑥 𝑗
𝑗=1
Overfitting
• Let’s use polynomial model to explain
overfitting 𝑀
𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤𝑗 𝑥 𝑗
𝑗=1
Overfitting
• Let’s use polynomial model to explain
overfitting 𝑀
𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤𝑗 𝑥 𝑗
𝑗=1
Overfitting
• Observations
• A more complex model yields lower error on
training data. (If we choose truly find the best
function, the error on training data may go to
zero)
the data)
Question?
• Consider 𝑦 𝑥, 𝑤 = 𝑤0 + σ𝑀 𝑤
𝑗=1 𝑗 𝑥 𝑗
𝑀
𝑗
𝑦 𝑥, 𝑤 + 𝑙𝑎𝑟𝑔𝑒 𝑒𝑟𝑟𝑜𝑟 ⇐ 𝑤0 + 𝑤𝑗 𝑥 + 𝑠𝑚𝑎𝑙𝑙 𝑛𝑜𝑖𝑠𝑒
𝑗=1
• Regularization
• Redesign Loss function by introducing regularization term
𝑁 𝑀
1
ℓ 𝑤 = [𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1
Overfitting
• Let's look at the estimated weights for various 𝑀
Overfitting
• Let's look at the estimated weights for various 𝑀
𝑀
𝑗
𝑦 𝑥, 𝑤 + 𝑙𝑎𝑟𝑔𝑒 𝑒𝑟𝑟𝑜𝑟 ⇐ 𝑤0 + 𝑤𝑗 𝑥 + 𝑠𝑚𝑎𝑙𝑙 𝑛𝑜𝑖𝑠𝑒
𝑗=1
• Standard approach
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = [𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1
• Standard approach
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = [𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1
𝑴=𝟗
𝐥𝐧 𝝀 = −∞
• One way of dealing with this is to
encourage the weights to be
small. This is called regularization.
• Standard approach
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = [𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1
𝑴=𝟗
𝐥𝐧 𝝀 = −∞
• One way of dealing with this is to
encourage the weights to be
small. This is called regularization.
𝑴=𝟗
𝐥𝐧 𝝀 = −∞
• One way of dealing with this is to
encourage the weights to be
small. This is called regularization.
𝑴=𝟗
• The penalty on the squared weights is 𝐥𝐧 𝝀 = 𝟎
𝑴=𝟗
However, choose value 𝜆
• The penalty on the squared weights is 𝐥𝐧 𝝀 = 𝟎
carefully
known as ridge regression (Statistics)
(we prefer smooth function,
but don’t be too smooth)
Question?
𝑴=𝟗
𝐥𝐧 𝝀 = −∞
• Why 𝑤0 can be not considered in the regularization
term? 𝑁 𝑀
1
ℓ 𝑤 = [𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1
𝑴=𝟗
𝐥𝐧 𝝀 = −𝟏𝟖
𝑴=𝟗
𝐥𝐧 𝝀 = 𝟎
𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤𝑖 𝑥 𝑖
𝑖=1
Linear Models for Regression
• Linear regression 𝜕ℓ 𝑤
𝜕𝑤0
• Linear regression model Gradient
𝜕ℓ 𝑤
𝑦 𝑥 = 𝑤0 + σ𝑀 𝑗=1 𝑤𝑗 𝑥
𝑗
𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1
• Parameters ⋮
𝜕ℓ 𝑤
𝑤
𝜕𝑤𝑀
• Regularized Loss function 𝜕ℓ 𝑤 1
𝑁 𝑀 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
1 𝜕𝑤0
ℓ 𝑤 = [𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 𝑤𝑖2 𝜕ℓ 𝑤 1
2𝑁
𝜕𝑤1
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥 𝑖
+ 𝜆𝑤1
𝑖=1 𝑖=1
• Goal: minimize ℓ 𝑤 𝜕ℓ 𝑤 1
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 2
+ 𝜆𝑤2
𝜕𝑤2
Steps: ⋮
𝜕ℓ 𝑤 1 𝑀
• Initialize 𝑤 (e.g., randomly) = 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤𝑀
𝜕𝑤𝑀
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Models for Regression
• Note: 𝜕ℓ 𝑤
𝜕𝑤0
Steps: Gradient
𝜕ℓ 𝑤
• Initialize 𝑤 (e.g., randomly) 𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1
• Repeatedly update 𝑤 based on the gradient ⋮
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝜕ℓ 𝑤
where 𝜖 is the learning rate. 𝜕𝑤𝑀
𝜕ℓ 𝑤 1
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1
𝜕𝑤1
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥 𝑖
+ 𝜆𝑤1
𝜕ℓ 𝑤 1 2
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤2
𝜕𝑤2
⋮
𝜕ℓ 𝑤 1 𝑀
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤𝑀
𝜕𝑤𝑀
Linear Models for Regression
• Note: 𝜕ℓ 𝑤
𝜕𝑤0
Steps: Gradient
𝜕ℓ 𝑤
• Initialize 𝑤 (e.g., randomly) 𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1
• Repeatedly update 𝑤 based on the gradient ⋮
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝜕ℓ 𝑤
where 𝜖 is the learning rate. 𝜕𝑤𝑀
𝜕ℓ 𝑤 1
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1
1
𝑤𝑗 = 𝑤𝑗 − 𝜖 𝑁 − σ𝑁 𝑖 −𝑦 𝑖 𝑖 𝑗 = 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥 𝑖
+ 𝜆𝑤1
𝑖=1 𝑡 𝑥 𝑥 + 𝜆𝑤𝑗 𝜕𝑤1
𝜕ℓ 𝑤 1 2
1
= 𝑤𝑗 + 𝜖 σ𝑁 𝑡 𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 𝑗
−
𝜖𝜆
𝑤 = 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤2
𝑁 𝑖=1 𝑁 𝑗 𝜕𝑤2
𝜖𝜆 1 𝑗 ⋮
= 1 − 𝑁 𝑤𝑗 + 𝜖 𝑁 σ𝑁 𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥𝑖 𝜕ℓ 𝑤 1 𝑀
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤𝑀
𝜕𝑤𝑀
Linear Models for Regression
• Note: 𝜕ℓ 𝑤
𝜕𝑤0
Steps: Gradient
𝜕ℓ 𝑤
• Initialize 𝑤 (e.g., randomly) 𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1
• Repeatedly update 𝑤 based on the gradient ⋮
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝜕ℓ 𝑤
where 𝜖 is the learning rate. 𝜕𝑤𝑀
𝜕ℓ 𝑤 1
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1
1
𝑤𝑗 = 𝑤𝑗 − 𝜖 𝑁 − σ𝑁 𝑖 −𝑦 𝑖 𝑖 𝑗 = 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥 𝑖
+ 𝜆𝑤1
𝑖=1 𝑡 𝑥 𝑥 + 𝜆𝑤𝑗 𝜕𝑤1
𝜕ℓ 𝑤 1 2
1
= 𝑤𝑗 + 𝜖 σ𝑁 𝑡 𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 𝑗
−
𝜖𝜆
𝑤 = 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤2
𝑁 𝑖=1 𝑁 𝑗 𝜕𝑤2
𝜖𝜆 1 𝑗 ⋮
= 1 − 𝑁 𝑤𝑗 + 𝜖 𝑁 σ𝑁 𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥𝑖 𝜕ℓ 𝑤 1 𝑀
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤𝑀
𝜕𝑤𝑀
𝜖𝜆
N is usually very large, 1 − enables the decrease of 𝑤𝑗
𝑁
Key Concepts
• To find a good model
• Loss function (measure error, or judge the fit)
• Optimization (how to find a good fit)
• Generalization (fit to unseen test data)
• Regularization (avoid overfitting)
Key Concepts
• To find a good model
• Loss function (measure error, or judge the fit)
• Optimization (how to find a good fit)
• Generalization (fit to unseen test data)
• Regularization (avoid overfitting)
• Prediction error
• Validation
• Bias and variance What we really care about is the
• Typical workflow to reduce error prediction error on new data
• Feature scaling
Key concepts in Supervised Learning,
Key Concepts not just for linear regression
• Prediction error
• Validation
• Bias and variance What we really care about is the
• Typical workflow to reduce error prediction error on new data
• Feature scaling
Validation
Validation
• What you should NOT do in Machine learning?
(Your) Training dataset (Your) Testing dataset
Validation
• What you should NOT do in Machine learning?
(Your) Training dataset (Your) Testing dataset
• Prediction error
• Validation
• Bias and variance What we really care about is the
• Typical workflow to reduce error prediction error on new data
• Feature scaling
Prediction Error
• Sources :
• Imprecision in data attributes (input noise, e.g., noise
in per-capita crime)
• Errors in data targets (mis-labeling, e.g., noise in
house prices)
• Additional attributes not taken into account by data
attributes, affect target values (latent variables). In
the example, what else could affect house prices?
• Model may be too simple to account for data targets
Courtesy of Dr. Sanja Fidler
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance
Underfiting
(if your model cannot ever fit the
training data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance
Underfiting
(if your model cannot ever fit the
training data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance
Underfiting Overfiting
(if your model cannot ever fit the (if your model fits the training data, but
training data) has large error on testing data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Our goal!
Bias and Variance
Underfiting Overfiting
(if your model cannot ever fit the (if your model fits the training data, but
training data) has large error on testing data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance
Underfiting
(if your model cannot ever fit the
training data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance
For large variance, what we can do
Human-level error: 0.1 • Regularization
(the best model) • More data
• …
Training set error: 0.15
Variance
Validation set error: 0.8
Courtesy of Dr. Andrew Ng
Overfiting
(if your model fits the training data, but
has large error on testing data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
More on Reducing Error
Courtesy of Dr. Andrew Ng
Reference:
• https://kevinzakka.github.io/2016/09/26/applying-deep-learning/
• https://www.youtube.com/watch?v=F1ka6a13S9I&t=4s
Preprocessing data
• What is the result?
References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing data
• What is the result?
The result: 1.00000761449
Not zero
References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html
Example: Boston Housing data
• Estimate median house price in a neighborhood based on neighborhood statistics
• Data: https://archive.ics.uci.edu/ml/datasets/Housing
(from sklearn.datasets import load_boston)
Preprocessing data
• Feature scaling
• In practice, we transform the data to center
it by removing the mean value of each
feature, then scale it by dividing non-
constant features by their standard
deviation.
Rescaling
Standardization
References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing data
• Feature scaling
• In practice, we transform the data to center
it by removing the mean value of each
feature, then scale it by dividing non-
constant features by their standard
deviation.
Rescaling
Commonly used
Standardization
References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing data
• Feature scaling
• In practice, we transform the data to center
it by removing the mean value of each
feature, then scale it by dividing non-
constant features by their standard
deviation.
Rescaling
Commonly used
Standardization
References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html http://m.blog.csdn.net/article/details?id=50670674
Readings
• Sections 1.1 and 1.3 in the book “Pattern Recognition and Machine Learning”, by
Christopher M. Bishop, Springer, 2006.
http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-
%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf