0% found this document useful (0 votes)

15 views105 pages

11 - Học máy cơ bản - Hồi quy tuyến tính 1

The document provides an overview of linear regression as a fundamental concept in machine learning, detailing its application in modeling relationships between variables. It discusses the structure of linear regression models, including single and multiple variables, loss functions, and optimization techniques. Additionally, it emphasizes the importance of using matrix representations for data in linear regression analysis.

Uploaded by

havi240304

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views105 pages

11 - Học máy cơ bản - Hồi quy tuyến tính 1

Uploaded by

havi240304

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 105

Học máy cơ bản

Linear Regression -
Mô hình hồi quy tuyến tính

Dùng mô hình hồi quy tuyến tính

để mô tả các khái niệm quan trọng
của học máy
Machine Learning

Learning = Representation + Evaluation + Optimization

Reference: Domingos, Pedro. "A few useful things to know about machine learning." Communications of the ACM 55.10 (2012): 78-87.
Linear Models for Regression
• Let’s discuss those key questions in the following cases

• Linear Regression with one variable (Simple 1-D Regression)

• Linear regression with multiple variables

Courtesy of Dr. Andrew Ng

• Linear regression using polynomial fitting

Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters

• Loss function
1 𝑤0
• Goal: minimize ℓ 𝑤 CRIM 𝑤1 Price

features ZN 𝑤2
• Solution
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function

• Goal: minimize ℓ 𝑤

• Solution
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵

• Goal: minimize ℓ 𝑤
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵

• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Models for Regression
𝜕ℓ 𝑤
𝜕𝑤0
Gradient 𝛻𝑤 ℓ 𝑤 =
• Linear regression with two variables 𝜕ℓ 𝑤
𝜕𝑤1
• Linear model 𝜕ℓ 𝑤 1
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1
• Parameters = − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑖
𝑥1
𝜕𝑤1 𝑁
𝑤0 , 𝑤1 , 𝑤2 𝜕ℓ 𝑤
=
1
− 𝑁 σ𝑁 𝑖 𝑖 𝑖
𝜕𝑤2 𝑖=1[𝑡 −𝑦 𝑥 ] 𝑥2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵

• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math: Use Matrix Here!
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵

• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math ( Linear Algebra )
• How to use matrix to represent data?
Examples
1 𝑇 𝑡1
𝑥 (2)
2 𝑇 𝑡= 𝑡 = Labels
𝑋= 𝑥 = ⋮
⋮ 𝑡𝑁
𝑥 𝑁 𝑇

⋮ ⋮
Features

To learn a mapping from inputs 𝒙 to outputs 𝒕, given a

set of input-output pairs
𝓓 = {(𝒙 𝟏 , 𝒕 𝟏 ), … , (𝒙 𝒊 , 𝒕 𝒊 ), … , (𝒙 𝑵 , 𝒕 𝑵 )}
In Math ( Linear Algebra )
• For linear regression with two variables

𝑥 1 𝑇 𝑡1
(2) 𝑤0
𝑋= 𝑥
2 𝑇 𝑡= 𝑡 =
= ⋮ 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑤 = 𝑤1
⋮ 𝑤2
𝑁 𝑇
𝑡𝑁
𝑥
⋮ ⋮
In Math ( Linear Algebra )
• For linear regression with two variables

𝑥 1 𝑇 𝑡1
(2) 𝑤0
𝑋= 𝑥
2 𝑇 𝑡= 𝑡 =
= ⋮ 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑤 = 𝑤1
⋮ 𝑤2
𝑁 𝑇
𝑡𝑁
𝑥
⋮ ⋮

𝒊
𝑖
𝒙𝟏
Here, 𝑥 = 𝒊
𝒙𝟐
In Math ( Linear Algebra )
• For linear regression with two variables
1
1
1
𝑥 1 𝑇 1 𝑡1
1 (2) 𝑤0
𝑋= 𝑥
2 𝑇 1 𝑡= 𝑡 =
= 1 ⋮ 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑤 = 𝑤1
⋮ 1 𝑤2
𝑁 𝑇 1 𝑡𝑁
𝑥 1

⋮ ⋮

For convenience, we can create a column of 1’s into 𝑋 to incorporate 𝑤0

In Math ( Linear Algebra )
• For linear regression with two variables
1
1
1
𝑥 1 𝑇 1 𝑡1
1 (2) 𝑤0
𝑋= 𝑥
2 𝑇 1 𝑡= 𝑡 =
= 1 ⋮ 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑤 = 𝑤1
⋮ 1 𝑤2
𝑁 𝑇 1 𝑡𝑁
𝑥 1

⋮ ⋮

For convenience, we can create a column of 1’s into 𝑋 to incorporate 𝑤0

𝒊
𝒙𝟎 𝟏
𝒊
Here, 𝑥 𝑖
= 𝒙𝟏𝒊 𝒙
= 𝟏
𝒊
𝒙𝟐
𝒊 𝒙𝟐
In Math ( Linear Algebra )
• For linear regression with two variables
𝟏 𝒊 𝒊
ℓ 𝑤 = 𝟐𝑵 σ𝑵
𝒊=𝟏[𝒕
𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐

𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2

𝑤0
𝑤 = 𝑤1
𝑤2
In Math ( Linear Algebra )
• For linear regression with two variables
𝟏 𝒊 𝒊
ℓ 𝑤 = 𝟐𝑵 σ𝑵
𝒊=𝟏[𝒕
𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
1
= 2𝑁 (𝑡 − 𝑋𝑤) 𝑇 (𝑡 − 𝑋𝑤)

𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2

Gradient 𝛻𝑤 ℓ 𝑤 = 1 𝑋 𝑇 (𝑡 − 𝑋𝑤)
𝑁

𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2

𝑤0
𝑤 = 𝑤1
𝑤2
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
𝑁 𝑵
1 𝑖 𝑖
𝟏 𝒊 𝒊
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 = ෍[𝒕 𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵
𝑖=1 𝒊=𝟏
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Models for Regression
• Linear regression with multiple variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑘 𝑥𝑘
• Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑘
• Loss function
𝑁 𝑵
1 𝑖 𝑖
𝟏 𝒊 𝒊 𝒊
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 = ෍[𝒕 𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ + 𝒘𝒌 𝒙𝒌 )]𝟐
2𝑁 𝟐𝑵
𝑖=1 𝒊=𝟏
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Models for Regression
• Linear regression with multiple variables 𝜕ℓ 𝑤
𝜕𝑤0
Gradient
𝜕ℓ 𝑤
𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1
⋮
𝜕ℓ 𝑤
𝜕𝑤𝑘
𝜕ℓ 𝑤 1
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1 𝑖
𝜕𝑤1
= − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥1
𝜕ℓ 𝑤 1 𝑖
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖 −𝑦 𝑥 𝑖 ] 𝑥2
𝜕𝑤2
⋮
𝜕ℓ 𝑤 1 𝑖
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖 −𝑦 𝑥 𝑖 ] 𝑥𝑘
𝜕𝑤𝑘
In Math ( Linear Algebra )
• Linear model 𝑤0 1
𝑤1 𝑥1
𝑦 𝑥 =𝑤𝑇𝑥 where 𝑤 = 𝑤2 𝑥 = 𝑥2
⋮ ⋮
• Parameters 𝑤𝑘 𝑥𝑘
𝑤0 , 𝑤1 , … , 𝑤𝑘
• Loss function 𝑥 1 𝑇 1
𝑡
2 𝑇 2
𝑥 𝑡
1 1
ℓ 𝑤 = σ𝑁 [𝑡 𝑖
−𝑦 𝑥 𝑖
]2 = (𝑡 − 𝑋𝑤) 𝑇 (𝑡 − 𝑋𝑤) where 𝑋 =
𝑥 3 𝑇 𝑡= 𝑡 3
2𝑁 𝑖=1 2𝑁
⋮ ⋮
𝑁
• Goal: minimize ℓ 𝑤 𝑁 𝑇 𝑡
𝑥
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math ( Linear Algebra )
1 𝑇
𝑥 𝑡 1

𝑥 2 𝑇 𝑡 2
1 1
ℓ 𝑤 = σ𝑁
𝑖=1 [𝑡 𝑖 −𝑦 𝑥 𝑖 ]2 = (𝑡 − 𝑋𝑤) 𝑇 (𝑡 − 𝑋𝑤) where 𝑋 =
𝑥 3 𝑇 𝑡= 𝑡 3
2𝑁 2𝑁
⋮ ⋮
𝑁
𝑥 𝑁 𝑇 𝑡
• Goal: minimize ℓ 𝑤
1
𝛻𝑤 ℓ 𝑤 = − 𝑋 𝑇 (𝑡 − 𝑋𝑤)
𝑁

Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math ( Linear Algebra )

Solution 1: Gradient Descent

Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
1
𝑤 = 𝑤 + 𝜖 × 𝑁 𝑋 𝑇 (𝑡 − 𝑋𝑤)
where 𝜖 is the learning rate.
In Math ( Linear Algebra )

Solution 1: Gradient Descent Solution 2: Analytical Solution

Steps:
• find 𝑤 that makes 𝛻𝑤 ℓ 𝑤 zero
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient 1
1 𝛻𝑤 ℓ 𝑤 = − 𝑋 𝑇 (𝑡 − 𝑋𝑤) =0
𝑤 = 𝑤 + 𝜖 × 𝑁 𝑋 𝑇 (𝑡 − 𝑋𝑤) 𝑁
where 𝜖 is the learning rate. 𝑤 = 𝑋𝑇𝑋 −1 𝑋 𝑇 𝑡
Linear Models for Regression
• Let’s discuss those key questions in the following cases

• Linear Regression with one variable (Simple 1-D Regression)

• Linear regression with multiple variables

Courtesy of Dr. Andrew Ng

• Generalization in Supervised Learning (one key concept)

• Linear regression using polynomial fitting

Generalization
• Generalization

y
• Model’s ability to predict new data

𝑥
Training data
Courtesy of Dr. Andrew Ng
Generalization
• Generalization

y
• Model’s ability to predict new data

𝑥
Training data
Courtesy of Dr. Andrew Ng
Generalization
• Generalization

y
• Model’s ability to predict new data

In fact, what we really care about is the

𝑥
error on new data ( in Testing datasets)
Training data
Courtesy of Dr. Andrew Ng

Testing data
Generalization
• Generalization

y
• Model’s ability to predict new data

In fact, what we really care about is the

𝑥
error on new data ( in Testing dataset)
Training data
Courtesy of Dr. Andrew Ng

Datasets in Machine Learning Testing data

• Training dataset
• Validation Dataset (important! we will discuss it soon.)
• Testing dataset
Hyperplane Linear regression models we learned:

• One variable • Two variables • More variables

𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + ⋯ + 𝑤𝑁 𝑥𝑁

?
y

In geometry, a hyperplane is
𝑥1 a subspace of one dimension
less than its space
𝑥1
𝑥2
Generalization
• What if our linear model is not good? (for the

y
data to right)

𝑥1
Courtesy of Dr. Sanja Fidler
Generalization
• What if our linear model is not good? (for the

y
data to right)

𝑥1
• We can use a more complicated model
(polynomial)
Courtesy of Dr. Sanja Fidler
Linear Models for Regression
• Let’s discuss those key questions in the following cases

• Linear Regression with one variable (Simple 1-D Regression)

• Linear regression with multiple variables

Courtesy of Dr. Andrew Ng

• Generalization in Supervised Learning (one key concept)

• Linear regression using polynomial fitting

Fitting a Polynomial
• Example: an 𝑀-th order polynomial function of

y
one dimensional feature x:
𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1

where 𝑥 𝑗 is the j-th power of 𝑥. 𝑥1

polynomial
Courtesy of Dr. Sanja Fidler
Fitting a Polynomial
• Example: an 𝑀-th order polynomial function of

y
one dimensional feature x:
𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1

where 𝑥 𝑗 is the j-th power of 𝑥. 𝑥1

polynomial
Courtesy of Dr. Sanja Fidler

• Note: We can optimize for the weights 𝑤 by

using the same approach as we did for previous
linear model.
Fitting a Polynomial

3
1 𝑤0
Price
𝐹𝑜𝑟 𝑒𝑥𝑎𝑚𝑝𝑙𝑒, 𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
Size in 𝑤1 𝑗=1
feature square feet
(𝒙)

1
𝑤0
𝒙 𝑤1
Price
𝑤2
feature 𝒙𝟐
𝑤3
𝒙𝟑
Fitting a Polynomial

3
1 𝑤0
Price
𝐹𝑜𝑟 𝑒𝑥𝑎𝑚𝑝𝑙𝑒, 𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
Size in 𝑤1 𝑗=1
feature square feet
(𝒙)

1
𝑤0
𝒙 𝑤1
Price
𝑤2
feature 𝒙𝟐
𝑤3
𝒙𝟑

Find weight values in the vector

𝑤 to reduce prediction error
Linear Models for Regression
• Let’s discuss those key questions in the following cases

• Linear Regression with one variable (Simple 1-D Regression)

• Linear regression with multiple variables

Courtesy of Dr. Andrew Ng

• Linear regression using polynomial fitting

Linear Models for Regression
• Linear regression
• Continuous outputs
• Simplest model (a linear combination of features)

• Key concepts in supervised learning

(very very … very important!!!)
• Loss function (measure error, or judge the fit)
• Optimization (how to find a good fit)
• Generalization (fit to unseen test data)
• Regularization (avoid overfitting)
Overfitting
• Let’s use polynomial model to explain

y
overfitting 𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1

𝑥1
Overfitting
• Let’s use polynomial model to explain

y
overfitting 𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1

𝑥1

Note: the data is generated from

𝑠𝑖𝑛(2𝜋𝑥) with small noises
Overfitting
• Let’s use polynomial model to explain
overfitting 𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1
Overfitting
• Let’s use polynomial model to explain
overfitting 𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1
Overfitting
• Observations
• A more complex model yields lower error on
training data. (If we choose truly find the best
function, the error on training data may go to
zero)

• A more complex model may perform very badly

on testing data (Our model with M = 9 overfits
Courtesy of Dr. Hung-yi Lee

the data)
Question?

• Consider 𝑦 𝑥, 𝑤 = 𝑤0 + σ𝑀 𝑤
𝑗=1 𝑗 𝑥 𝑗

• Why a more complex model can yield lower

error on training data?

• Why a more complex model can perform very

badly on testing data (overfitting)?
Courtesy of Dr. Hung-yi Lee
Overfitting
• Let's look at the estimated weights for various 𝑀
Overfitting
• Let's look at the estimated weights for various 𝑀

𝑀
𝑗
𝑦 𝑥, 𝑤 + 𝑙𝑎𝑟𝑔𝑒 𝑒𝑟𝑟𝑜𝑟 ⇐ 𝑤0 + ෍ 𝑤𝑗 𝑥 + 𝑠𝑚𝑎𝑙𝑙 𝑛𝑜𝑖𝑠𝑒
𝑗=1

The weights are becoming huge to compensate for the noise.

(a small noise on 𝑥 will have much fluctuation on prediction y)
Overfitting
• Possible solutions
• One workaround: Use more data
Overfitting
• Possible solutions
• Second workaround: Use regularization (Very important!!!)

• Regularization
• Redesign Loss function by introducing regularization term
𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1
Overfitting
• Let's look at the estimated weights for various 𝑀
Overfitting
• Let's look at the estimated weights for various 𝑀

𝑀
𝑗
𝑦 𝑥, 𝑤 + 𝑙𝑎𝑟𝑔𝑒 𝑒𝑟𝑟𝑜𝑟 ⇐ 𝑤0 + ෍ 𝑤𝑗 𝑥 + 𝑠𝑚𝑎𝑙𝑙 𝑛𝑜𝑖𝑠𝑒
𝑗=1

The weights are becoming huge to compensate for the noise.

(a small noise on 𝑥 will have much fluctuation on prediction y)
Overfitting
• One way of dealing with this is to
encourage the weights to be
small. This is called regularization.

• Standard approach
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1

• The penalty on the squared weights is

known as ridge regression (Statistics)
• When 𝑤𝑖 ’s are small, prediction 𝑦 will be not
sensitive to small change of x
Overfitting • Smooth functions are preferred.

• One way of dealing with this is to

encourage the weights to be
small. This is called regularization.

• Standard approach
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1

• The penalty on the squared weights is

known as ridge regression (Statistics)
• When 𝑤𝑖 ’s are small, prediction 𝑦 will be not
sensitive to small change of x
Overfitting • Smooth functions are preferred.

𝑴=𝟗
𝐥𝐧 𝝀 = −∞
• One way of dealing with this is to
encourage the weights to be
small. This is called regularization.

• Standard approach
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1

• The penalty on the squared weights is

Model 3 Error: 0.3 Error > 0.3

Testing data is usually used to judge the goodness of a fully-trained

model. But here, it is used to refine (tune) the model.
• The testing data here is not independent of training process and may
not reflect the prediction on real world data.
Validation
• What you can do in Machine learning?
(Your) Training dataset (Your) Testing dataset
Validation
• What you can do in Machine learning?
(Your) Training dataset (Your) Testing dataset
Validation
• What you can do in Machine learning?
Training dataset Validation dataset Testing dataset
Validation
• What you can do in Machine learning?
Training dataset Validation dataset Testing dataset

Error: 0.8 Error: 0.7 Error: 0.8

Error: 0.5 Error: 0.6 Error: 0.4

Error: 0.7 Error: 0.3 Error: 0.7

Error: 0.3 Error: 0.9 Error: 0.5

Error: 0.9 Error: 0.4 Error: 0.9

Average: 0.64 Average: 0.58 Average: 0.66

Validation
• In many applications, the supply of data for training and testing may be limited.
We can use Cross-Validation (CV).
Training dataset Model 1 Model 2 Model 3

Error: 0.8 Error: 0.7 Error: 0.8

Error: 0.5 Error: 0.6 Error: 0.4

Error: 0.7 Error: 0.3 Error: 0.7

Error: 0.3 Error: 0.9 Error: 0.5

Error: 0.9 Error: 0.4 Error: 0.9

Average: 0.64 Average: 0.58 Average: 0.66

Validation
• Cross Validation
• We split the training data into K folds; then, for
each fold k ∈ {1, . . . ,K}, we train on all the folds
but the k’th, and test on the k’th, in a round-
robin fashion.
• We then compute the error averaged over all the
folds, and use this as a proxy for the test error.
(Note that each point gets predicted only once, although
it will be used for training K−1 times.)

• It is common to use K = 5; this is called 5-fold CV.

Key concepts in Supervised Learning,
Key Concepts not just for linear regression

• To find a good model

• Loss function (measure error, or judge the fit)
• Optimization (how to find a good fit)
• Generalization (fit to unseen test data)
• Regularization (avoid overfitting)

• Prediction error
• Validation
• Bias and variance What we really care about is the
• Typical workflow to reduce error prediction error on new data
• Feature scaling
Prediction Error
• Sources :
• Imprecision in data attributes (input noise, e.g., noise
in per-capita crime)
• Errors in data targets (mis-labeling, e.g., noise in
house prices)
• Additional attributes not taken into account by data
attributes, affect target values (latent variables). In
the example, what else could affect house prices?
• Model may be too simple to account for data targets
Courtesy of Dr. Sanja Fidler

• Data size is not sufficient

Bias and Variance
• Bias measures how far off in general
these models' predictions are from
the correct value.

• Variance is how much the

predictions for a given point vary
between different realizations of the
model.

Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance

Human-level error: 0.1

(the best model) Bias

Training set error: 0.7

Validation set error: 0.8

Courtesy of Dr. Andrew Ng

Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance

Human-level error: 0.1

(the best model) Bias

Training set error: 0.7

Validation set error: 0.8

Courtesy of Dr. Andrew Ng

Underfiting
(if your model cannot ever fit the
training data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance

Human-level error: 0.1 Human-level error: 0.1

(the best model) Bias (the best model)

Training set error: 0.7 Training set error: 0.15

Variance
Validation set error: 0.8 Validation set error: 0.8
Courtesy of Dr. Andrew Ng

Underfiting
(if your model cannot ever fit the
training data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance

Human-level error: 0.1 Human-level error: 0.1

(the best model) Bias (the best model)

Training set error: 0.7 Training set error: 0.15

Variance
Validation set error: 0.8 Validation set error: 0.8
Courtesy of Dr. Andrew Ng

Underfiting Overfiting
(if your model cannot ever fit the (if your model fits the training data, but
training data) has large error on testing data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Our goal!
Bias and Variance

Human-level error: 0.1 Human-level error: 0.1

(the best model) Bias (the best model)

Training set error: 0.7 Training set error: 0.15

Variance
Validation set error: 0.8 Validation set error: 0.8
Courtesy of Dr. Andrew Ng

Human-level error: 0.1

(the best model) Bias
For large bias, redesign your model
Training set error: 0.7 • Add more features as input
• Create a more complex model
Validation set error: 0.8 • …

(Note: add more data is not useful here!)

Courtesy of Dr. Andrew Ng

Underfiting
(if your model cannot ever fit the
training data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance
For large variance, what we can do
Human-level error: 0.1 • Regularization
(the best model) • More data
• …
Training set error: 0.15
Variance
Validation set error: 0.8
Courtesy of Dr. Andrew Ng

Overfiting
(if your model fits the training data, but
has large error on testing data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
More on Reducing Error
Courtesy of Dr. Andrew Ng

Reference:
• https://kevinzakka.github.io/2016/09/26/applying-deep-learning/
• https://www.youtube.com/watch?v=F1ka6a13S9I&t=4s
Preprocessing data
• What is the result?

References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html
Example: Boston Housing data
• Estimate median house price in a neighborhood based on neighborhood statistics

• Data: https://archive.ics.uci.edu/ml/datasets/Housing
(from sklearn.datasets import load_boston)
Preprocessing data
• Feature scaling
• In practice, we transform the data to center
it by removing the mean value of each
feature, then scale it by dividing non-
constant features by their standard
deviation.

Rescaling

Standardization

References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing data
• Feature scaling
• In practice, we transform the data to center
it by removing the mean value of each
feature, then scale it by dividing non-
constant features by their standard
deviation.

Rescaling

Commonly used

Standardization

References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing data
• Feature scaling
• In practice, we transform the data to center
it by removing the mean value of each
feature, then scale it by dividing non-
constant features by their standard
deviation.

Rescaling

Commonly used

Standardization

References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html http://m.blog.csdn.net/article/details?id=50670674
Readings
• Sections 1.1 and 1.3 in the book “Pattern Recognition and Machine Learning”, by
Christopher M. Bishop, Springer, 2006.
http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-
%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf

• Understanding the Bias-Variance Tradeoff

http://scott.fortmann-roe.com/docs/BiasVariance.html

• Video: Nuts and Bolts of Applying Deep Learning (Andrew Ng)

https://www.youtube.com/watch?v=F1ka6a13S9I&t=4s

The Hundred-Page Machine Learning Book - Andriy Burkov
No ratings yet
The Hundred-Page Machine Learning Book - Andriy Burkov
16 pages
STAB22 Midterm-2022with-Keys
No ratings yet
STAB22 Midterm-2022with-Keys
23 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Project Technical Analysis
No ratings yet
Project Technical Analysis
111 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
ML - Lec 5 - Regression - Gradient Descent Least Square
No ratings yet
ML - Lec 5 - Regression - Gradient Descent Least Square
59 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
Lecture 3 Multi-Regresion 2022.
No ratings yet
Lecture 3 Multi-Regresion 2022.
16 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
ML - Lec 4-Introduction To Regression
No ratings yet
ML - Lec 4-Introduction To Regression
65 pages
Hundred Page ML Book CH 3
No ratings yet
Hundred Page ML Book CH 3
16 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
LinearRegression PDF
No ratings yet
LinearRegression PDF
4 pages
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
26 pages
ML 2
No ratings yet
ML 2
155 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Linear Regression
No ratings yet
Linear Regression
91 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
No ratings yet
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
113 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
ML UNIT -1 Part 1
No ratings yet
ML UNIT -1 Part 1
82 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
2 - Multiple Linear Regression
No ratings yet
2 - Multiple Linear Regression
71 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
Unit 4 (Part 1)
No ratings yet
Unit 4 (Part 1)
49 pages
Group 30
No ratings yet
Group 30
33 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
M6 RegressionLinearModels v2
No ratings yet
M6 RegressionLinearModels v2
97 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
Lec 03
No ratings yet
Lec 03
42 pages
Regression
No ratings yet
Regression
11 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Lecture3 Supervised Learning I
No ratings yet
Lecture3 Supervised Learning I
84 pages
Cost Function
No ratings yet
Cost Function
17 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Linear Regression
No ratings yet
Linear Regression
26 pages
Single-Parameter Linear Regression: Predicting Real-Valued Outputs: An Introduction To Regression
No ratings yet
Single-Parameter Linear Regression: Predicting Real-Valued Outputs: An Introduction To Regression
51 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Week2 BBM406 Lec2.1 LinearRegression
No ratings yet
Week2 BBM406 Lec2.1 LinearRegression
49 pages
ML Unit3
No ratings yet
ML Unit3
9 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
EE708 Module 3A
No ratings yet
EE708 Module 3A
28 pages
Neural Network Lectures RBF 1
No ratings yet
Neural Network Lectures RBF 1
44 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Syllabus Bcom 3 Sem
0% (1)
Syllabus Bcom 3 Sem
24 pages
Data Science Bootcamp - UG - V1 - 0324
No ratings yet
Data Science Bootcamp - UG - V1 - 0324
30 pages
AQFI - PHD Proposal
100% (1)
AQFI - PHD Proposal
424 pages
Coefficient of Determination - How To Calculate It and Interpret The Result
No ratings yet
Coefficient of Determination - How To Calculate It and Interpret The Result
1 page
Answer On Question #43725-Math-Statistics and Probability
100% (1)
Answer On Question #43725-Math-Statistics and Probability
1 page
Unit V
No ratings yet
Unit V
9 pages
(FREE PDF Sample) Handbook of Regression Analysis With Applications in R, Second Edition Samprit Chatterjee Ebooks
100% (1)
(FREE PDF Sample) Handbook of Regression Analysis With Applications in R, Second Edition Samprit Chatterjee Ebooks
54 pages
ECN3222 - English
No ratings yet
ECN3222 - English
4 pages
Data Analysis Problems
No ratings yet
Data Analysis Problems
12 pages
1 s2.0 S1674987123000245 Main
No ratings yet
1 s2.0 S1674987123000245 Main
23 pages
Summer Report
No ratings yet
Summer Report
50 pages
Resistance and Powering Prediction For Transom Stern Hull Forms During Early Stage Ship Design S C Fung
100% (4)
Resistance and Powering Prediction For Transom Stern Hull Forms During Early Stage Ship Design S C Fung
56 pages
Impact of Transactional, Transformational and Laissez-Faire Leadership Styles On Motivation: A Quantitative Study of Banking Employees in Pakistan
No ratings yet
Impact of Transactional, Transformational and Laissez-Faire Leadership Styles On Motivation: A Quantitative Study of Banking Employees in Pakistan
19 pages
Econometrics Sheet 2B MR 2024
No ratings yet
Econometrics Sheet 2B MR 2024
5 pages
Final Minutes - Guidelines BCH Business Statistics Sem 4
No ratings yet
Final Minutes - Guidelines BCH Business Statistics Sem 4
6 pages
Intraday Market Predictability - A Machine Learning Approach
No ratings yet
Intraday Market Predictability - A Machine Learning Approach
57 pages
SDSM Manual
No ratings yet
SDSM Manual
70 pages
Rural Transportation and The Distribution of Public Facilities Edu Kwara State
No ratings yet
Rural Transportation and The Distribution of Public Facilities Edu Kwara State
9 pages
Result-Based Talent Identification in Road Cycling - Discovering
No ratings yet
Result-Based Talent Identification in Road Cycling - Discovering
18 pages
Qed WP 1456
No ratings yet
Qed WP 1456
58 pages
Iran Offers Anthracite To Ukrainian and Indian Buyers: Tuesday, August 25, 2009
No ratings yet
Iran Offers Anthracite To Ukrainian and Indian Buyers: Tuesday, August 25, 2009
4 pages
Of Abbay River Basin, Ethiopia
No ratings yet
Of Abbay River Basin, Ethiopia
10 pages
Stock Price Prediction Website Using Linear Regres
No ratings yet
Stock Price Prediction Website Using Linear Regres
10 pages
M.SC (CS) Sem - III Practical Slips - 2024-25
No ratings yet
M.SC (CS) Sem - III Practical Slips - 2024-25
29 pages
Measuring Relationship Via Regression Analysis and Correlation
No ratings yet
Measuring Relationship Via Regression Analysis and Correlation
9 pages
Case Study: Intrinsically Linear Models Cobb-Douglas Production Function
No ratings yet
Case Study: Intrinsically Linear Models Cobb-Douglas Production Function
2 pages
Cours-1regression Lineaire PDF
No ratings yet
Cours-1regression Lineaire PDF
24 pages
BIA Data Science Detailed Brochure 2023
No ratings yet
BIA Data Science Detailed Brochure 2023
28 pages