0% found this document useful (0 votes)
15 views105 pages

11 - Học máy cơ bản - Hồi quy tuyến tính 1

The document provides an overview of linear regression as a fundamental concept in machine learning, detailing its application in modeling relationships between variables. It discusses the structure of linear regression models, including single and multiple variables, loss functions, and optimization techniques. Additionally, it emphasizes the importance of using matrix representations for data in linear regression analysis.

Uploaded by

havi240304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views105 pages

11 - Học máy cơ bản - Hồi quy tuyến tính 1

The document provides an overview of linear regression as a fundamental concept in machine learning, detailing its application in modeling relationships between variables. It discusses the structure of linear regression models, including single and multiple variables, loss functions, and optimization techniques. Additionally, it emphasizes the importance of using matrix representations for data in linear regression analysis.

Uploaded by

havi240304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

Học máy cơ bản

Linear Regression -
Mô hình hồi quy tuyến tính

Dùng mô hình hồi quy tuyến tính


để mô tả các khái niệm quan trọng
của học máy
Machine Learning

Learning = Representation + Evaluation + Optimization

Reference: Domingos, Pedro. "A few useful things to know about machine learning." Communications of the ACM 55.10 (2012): 78-87.
Linear Models for Regression
• Let’s discuss those key questions in the following cases

• Linear Regression with one variable (Simple 1-D Regression)

• Linear regression with multiple variables


Courtesy of Dr. Andrew Ng

• Linear regression using polynomial fitting


Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters

• Loss function
1 𝑤0
• Goal: minimize ℓ 𝑤 CRIM 𝑤1 Price

features ZN 𝑤2
• Solution
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function

• Goal: minimize ℓ 𝑤

• Solution
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵

• Goal: minimize ℓ 𝑤
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵

• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Models for Regression
𝜕ℓ 𝑤
𝜕𝑤0
Gradient 𝛻𝑤 ℓ 𝑤 =
• Linear regression with two variables 𝜕ℓ 𝑤
𝜕𝑤1
• Linear model 𝜕ℓ 𝑤 1
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1
• Parameters = − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑖
𝑥1
𝜕𝑤1 𝑁
𝑤0 , 𝑤1 , 𝑤2 𝜕ℓ 𝑤
=
1
− 𝑁 σ𝑁 𝑖 𝑖 𝑖
𝜕𝑤2 𝑖=1[𝑡 −𝑦 𝑥 ] 𝑥2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵

• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math: Use Matrix Here!
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
1 𝟏 𝒊 𝒊
ℓ 𝑤 = σ𝑁
𝑖=1[𝑡 𝑖 −𝑦 𝑥 𝑖
]2 = σ𝑵
𝒊=𝟏[𝒕 𝒊 − (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵

• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math ( Linear Algebra )
• How to use matrix to represent data?
Examples
1 𝑇 𝑡1
𝑥 (2)
2 𝑇 𝑡= 𝑡 = Labels
𝑋= 𝑥 = ⋮
⋮ 𝑡𝑁
𝑥 𝑁 𝑇

⋮ ⋮
Features

To learn a mapping from inputs 𝒙 to outputs 𝒕, given a


set of input-output pairs
𝓓 = {(𝒙 𝟏 , 𝒕 𝟏 ), … , (𝒙 𝒊 , 𝒕 𝒊 ), … , (𝒙 𝑵 , 𝒕 𝑵 )}
In Math ( Linear Algebra )
• For linear regression with two variables

𝑥 1 𝑇 𝑡1
(2) 𝑤0
𝑋= 𝑥
2 𝑇 𝑡= 𝑡 =
= ⋮ 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑤 = 𝑤1
⋮ 𝑤2
𝑁 𝑇
𝑡𝑁
𝑥
⋮ ⋮
In Math ( Linear Algebra )
• For linear regression with two variables

𝑥 1 𝑇 𝑡1
(2) 𝑤0
𝑋= 𝑥
2 𝑇 𝑡= 𝑡 =
= ⋮ 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑤 = 𝑤1
⋮ 𝑤2
𝑁 𝑇
𝑡𝑁
𝑥
⋮ ⋮

𝒊
𝑖
𝒙𝟏
Here, 𝑥 = 𝒊
𝒙𝟐
In Math ( Linear Algebra )
• For linear regression with two variables
1
1
1
𝑥 1 𝑇 1 𝑡1
1 (2) 𝑤0
𝑋= 𝑥
2 𝑇 1 𝑡= 𝑡 =
= 1 ⋮ 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑤 = 𝑤1
⋮ 1 𝑤2
𝑁 𝑇 1 𝑡𝑁
𝑥 1

⋮ ⋮

For convenience, we can create a column of 1’s into 𝑋 to incorporate 𝑤0


In Math ( Linear Algebra )
• For linear regression with two variables
1
1
1
𝑥 1 𝑇 1 𝑡1
1 (2) 𝑤0
𝑋= 𝑥
2 𝑇 1 𝑡= 𝑡 =
= 1 ⋮ 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑤 = 𝑤1
⋮ 1 𝑤2
𝑁 𝑇 1 𝑡𝑁
𝑥 1

⋮ ⋮

For convenience, we can create a column of 1’s into 𝑋 to incorporate 𝑤0


𝒊
𝒙𝟎 𝟏
𝒊
Here, 𝑥 𝑖
= 𝒙𝟏𝒊 𝒙
= 𝟏
𝒊
𝒙𝟐
𝒊 𝒙𝟐
In Math ( Linear Algebra )
• For linear regression with two variables
𝟏 𝒊 𝒊
ℓ 𝑤 = 𝟐𝑵 σ𝑵
𝒊=𝟏[𝒕
𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐

𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2

𝑤0
𝑤 = 𝑤1
𝑤2
In Math ( Linear Algebra )
• For linear regression with two variables
𝟏 𝒊 𝒊
ℓ 𝑤 = 𝟐𝑵 σ𝑵
𝒊=𝟏[𝒕
𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
1
= 2𝑁 (𝑡 − 𝑋𝑤) 𝑇 (𝑡 − 𝑋𝑤)

𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2

𝑤0
𝑤 = 𝑤1
𝑤2
In Math ( Linear Algebra )
• For linear regression with two variables
𝟏 𝒊 𝒊
ℓ 𝑤 = 𝟐𝑵 σ𝑵
𝒊=𝟏[𝒕
𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
1
= 2𝑁 (𝑡 − 𝑋𝑤) 𝑇 (𝑡 − 𝑋𝑤)

Gradient 𝛻𝑤 ℓ 𝑤 = 1 𝑋 𝑇 (𝑡 − 𝑋𝑤)
𝑁

𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2

𝑤0
𝑤 = 𝑤1
𝑤2
Linear Models for Regression
• Linear regression with two variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
• Parameters
𝑤0 , 𝑤1 , 𝑤2
• Loss function
𝑁 𝑵
1 𝑖 𝑖
𝟏 𝒊 𝒊
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 = ෍[𝒕 𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 )]𝟐
2𝑁 𝟐𝑵
𝑖=1 𝒊=𝟏
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Models for Regression
• Linear regression with multiple variables
• Linear model
𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑘 𝑥𝑘
• Parameters
𝑤0 , 𝑤1 , 𝑤2 ,…, 𝑤𝑘
• Loss function
𝑁 𝑵
1 𝑖 𝑖
𝟏 𝒊 𝒊 𝒊
ℓ 𝑤 = ෍[𝑡 −𝑦 𝑥 ]2 = ෍[𝒕 𝒊
− (𝒘𝟎 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ + 𝒘𝒌 𝒙𝒌 )]𝟐
2𝑁 𝟐𝑵
𝑖=1 𝒊=𝟏
• Goal: minimize ℓ 𝑤
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Models for Regression
• Linear regression with multiple variables 𝜕ℓ 𝑤
𝜕𝑤0
Gradient
𝜕ℓ 𝑤
𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1

𝜕ℓ 𝑤
𝜕𝑤𝑘
𝜕ℓ 𝑤 1
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1 𝑖
𝜕𝑤1
= − 𝑁 σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥1
𝜕ℓ 𝑤 1 𝑖
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖 −𝑦 𝑥 𝑖 ] 𝑥2
𝜕𝑤2

𝜕ℓ 𝑤 1 𝑖
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖 −𝑦 𝑥 𝑖 ] 𝑥𝑘
𝜕𝑤𝑘
In Math ( Linear Algebra )
• Linear model 𝑤0 1
𝑤1 𝑥1
𝑦 𝑥 =𝑤𝑇𝑥 where 𝑤 = 𝑤2 𝑥 = 𝑥2
⋮ ⋮
• Parameters 𝑤𝑘 𝑥𝑘
𝑤0 , 𝑤1 , … , 𝑤𝑘
• Loss function 𝑥 1 𝑇 1
𝑡
2 𝑇 2
𝑥 𝑡
1 1
ℓ 𝑤 = σ𝑁 [𝑡 𝑖
−𝑦 𝑥 𝑖
]2 = (𝑡 − 𝑋𝑤) 𝑇 (𝑡 − 𝑋𝑤) where 𝑋 =
𝑥 3 𝑇 𝑡= 𝑡 3
2𝑁 𝑖=1 2𝑁
⋮ ⋮
𝑁
• Goal: minimize ℓ 𝑤 𝑁 𝑇 𝑡
𝑥
Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math ( Linear Algebra )
1 𝑇
𝑥 𝑡 1

𝑥 2 𝑇 𝑡 2
1 1
ℓ 𝑤 = σ𝑁
𝑖=1 [𝑡 𝑖 −𝑦 𝑥 𝑖 ]2 = (𝑡 − 𝑋𝑤) 𝑇 (𝑡 − 𝑋𝑤) where 𝑋 =
𝑥 3 𝑇 𝑡= 𝑡 3
2𝑁 2𝑁
⋮ ⋮
𝑁
𝑥 𝑁 𝑇 𝑡
• Goal: minimize ℓ 𝑤
1
𝛻𝑤 ℓ 𝑤 = − 𝑋 𝑇 (𝑡 − 𝑋𝑤)
𝑁

Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
In Math ( Linear Algebra )

Solution 1: Gradient Descent


Steps:
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient
1
𝑤 = 𝑤 + 𝜖 × 𝑁 𝑋 𝑇 (𝑡 − 𝑋𝑤)
where 𝜖 is the learning rate.
In Math ( Linear Algebra )

Solution 1: Gradient Descent Solution 2: Analytical Solution


Steps:
• find 𝑤 that makes 𝛻𝑤 ℓ 𝑤 zero
• Initialize 𝑤 (e.g., randomly)
• Repeatedly update 𝑤 based on the gradient 1
1 𝛻𝑤 ℓ 𝑤 = − 𝑋 𝑇 (𝑡 − 𝑋𝑤) =0
𝑤 = 𝑤 + 𝜖 × 𝑁 𝑋 𝑇 (𝑡 − 𝑋𝑤) 𝑁
where 𝜖 is the learning rate. 𝑤 = 𝑋𝑇𝑋 −1 𝑋 𝑇 𝑡
Linear Models for Regression
• Let’s discuss those key questions in the following cases

• Linear Regression with one variable (Simple 1-D Regression)

• Linear regression with multiple variables


Courtesy of Dr. Andrew Ng

• Generalization in Supervised Learning (one key concept)

• Linear regression using polynomial fitting


Generalization
• Generalization

y
• Model’s ability to predict new data

𝑥
Training data
Courtesy of Dr. Andrew Ng
Generalization
• Generalization

y
• Model’s ability to predict new data

𝑥
Training data
Courtesy of Dr. Andrew Ng
Generalization
• Generalization

y
• Model’s ability to predict new data

In fact, what we really care about is the


𝑥
error on new data ( in Testing datasets)
Training data
Courtesy of Dr. Andrew Ng

Testing data
Generalization
• Generalization

y
• Model’s ability to predict new data

In fact, what we really care about is the


𝑥
error on new data ( in Testing dataset)
Training data
Courtesy of Dr. Andrew Ng

Datasets in Machine Learning Testing data


• Training dataset
• Validation Dataset (important! we will discuss it soon.)
• Testing dataset
Hyperplane Linear regression models we learned:

• One variable • Two variables • More variables


𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 𝑦 𝑥 = 𝑤0 + 𝑤1 𝑥1 + ⋯ + 𝑤𝑁 𝑥𝑁

?
y

In geometry, a hyperplane is
𝑥1 a subspace of one dimension
less than its space
𝑥1
𝑥2
Generalization
• What if our linear model is not good? (for the

y
data to right)

𝑥1
Courtesy of Dr. Sanja Fidler
Generalization
• What if our linear model is not good? (for the

y
data to right)

𝑥1
• We can use a more complicated model
(polynomial)
Courtesy of Dr. Sanja Fidler
Linear Models for Regression
• Let’s discuss those key questions in the following cases

• Linear Regression with one variable (Simple 1-D Regression)

• Linear regression with multiple variables


Courtesy of Dr. Andrew Ng

• Generalization in Supervised Learning (one key concept)

• Linear regression using polynomial fitting


Fitting a Polynomial
• Example: an 𝑀-th order polynomial function of

y
one dimensional feature x:
𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1

where 𝑥 𝑗 is the j-th power of 𝑥. 𝑥1

polynomial
Courtesy of Dr. Sanja Fidler
Fitting a Polynomial
• Example: an 𝑀-th order polynomial function of

y
one dimensional feature x:
𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1

where 𝑥 𝑗 is the j-th power of 𝑥. 𝑥1

polynomial
Courtesy of Dr. Sanja Fidler

• Note: We can optimize for the weights 𝑤 by


using the same approach as we did for previous
linear model.
Fitting a Polynomial

3
1 𝑤0
Price
𝐹𝑜𝑟 𝑒𝑥𝑎𝑚𝑝𝑙𝑒, 𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
Size in 𝑤1 𝑗=1
feature square feet
(𝒙)

1
𝑤0
𝒙 𝑤1
Price
𝑤2
feature 𝒙𝟐
𝑤3
𝒙𝟑
Fitting a Polynomial

3
1 𝑤0
Price
𝐹𝑜𝑟 𝑒𝑥𝑎𝑚𝑝𝑙𝑒, 𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
Size in 𝑤1 𝑗=1
feature square feet
(𝒙)

1
𝑤0
𝒙 𝑤1
Price
𝑤2
feature 𝒙𝟐
𝑤3
𝒙𝟑

Find weight values in the vector


𝑤 to reduce prediction error
Linear Models for Regression
• Let’s discuss those key questions in the following cases

• Linear Regression with one variable (Simple 1-D Regression)

• Linear regression with multiple variables


Courtesy of Dr. Andrew Ng

• Linear regression using polynomial fitting


Linear Models for Regression
• Linear regression
• Continuous outputs
• Simplest model (a linear combination of features)

• Key concepts in supervised learning


(very very … very important!!!)
• Loss function (measure error, or judge the fit)
• Optimization (how to find a good fit)
• Generalization (fit to unseen test data)
• Regularization (avoid overfitting)
Overfitting
• Let’s use polynomial model to explain

y
overfitting 𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1

𝑥1
Overfitting
• Let’s use polynomial model to explain

y
overfitting 𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1

𝑥1

Note: the data is generated from


𝑠𝑖𝑛(2𝜋𝑥) with small noises
Overfitting
• Let’s use polynomial model to explain
overfitting 𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1
Overfitting
• Let’s use polynomial model to explain
overfitting 𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1
Overfitting
• Let’s use polynomial model to explain
overfitting 𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1
Overfitting
• Let’s use polynomial model to explain
overfitting 𝑀

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑗 𝑥 𝑗
𝑗=1
Overfitting
• Observations
• A more complex model yields lower error on
training data. (If we choose truly find the best
function, the error on training data may go to
zero)

• A more complex model may perform very badly


on testing data (Our model with M = 9 overfits
Courtesy of Dr. Hung-yi Lee

the data)
Question?

• Consider 𝑦 𝑥, 𝑤 = 𝑤0 + σ𝑀 𝑤
𝑗=1 𝑗 𝑥 𝑗

• Why a more complex model can yield lower


error on training data?

• Why a more complex model can perform very


badly on testing data (overfitting)?
Courtesy of Dr. Hung-yi Lee
Overfitting
• Let's look at the estimated weights for various 𝑀
Overfitting
• Let's look at the estimated weights for various 𝑀

𝑀
𝑗
𝑦 𝑥, 𝑤 + 𝑙𝑎𝑟𝑔𝑒 𝑒𝑟𝑟𝑜𝑟 ⇐ 𝑤0 + ෍ 𝑤𝑗 𝑥 + 𝑠𝑚𝑎𝑙𝑙 𝑛𝑜𝑖𝑠𝑒
𝑗=1

The weights are becoming huge to compensate for the noise.


(a small noise on 𝑥 will have much fluctuation on prediction y)
Overfitting
• Possible solutions
• One workaround: Use more data
Overfitting
• Possible solutions
• Second workaround: Use regularization (Very important!!!)

• Regularization
• Redesign Loss function by introducing regularization term
𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1
Overfitting
• Let's look at the estimated weights for various 𝑀
Overfitting
• Let's look at the estimated weights for various 𝑀

𝑀
𝑗
𝑦 𝑥, 𝑤 + 𝑙𝑎𝑟𝑔𝑒 𝑒𝑟𝑟𝑜𝑟 ⇐ 𝑤0 + ෍ 𝑤𝑗 𝑥 + 𝑠𝑚𝑎𝑙𝑙 𝑛𝑜𝑖𝑠𝑒
𝑗=1

The weights are becoming huge to compensate for the noise.


(a small noise on 𝑥 will have much fluctuation on prediction y)
Overfitting
• One way of dealing with this is to
encourage the weights to be
small. This is called regularization.

• Standard approach
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1

• The penalty on the squared weights is


known as ridge regression (Statistics)
• When 𝑤𝑖 ’s are small, prediction 𝑦 will be not
sensitive to small change of x
Overfitting • Smooth functions are preferred.

• One way of dealing with this is to


encourage the weights to be
small. This is called regularization.

• Standard approach
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1

• The penalty on the squared weights is


known as ridge regression (Statistics)
• When 𝑤𝑖 ’s are small, prediction 𝑦 will be not
sensitive to small change of x
Overfitting • Smooth functions are preferred.

𝑴=𝟗
𝐥𝐧 𝝀 = −∞
• One way of dealing with this is to
encourage the weights to be
small. This is called regularization.

• Standard approach
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1

• The penalty on the squared weights is


known as ridge regression (Statistics)
• When 𝑤𝑖 ’s are small, prediction 𝑦 will be not
sensitive to small change of x
Overfitting • Smooth functions are preferred.

𝑴=𝟗
𝐥𝐧 𝝀 = −∞
• One way of dealing with this is to
encourage the weights to be
small. This is called regularization.

• Standard approach 𝑴=𝟗


𝐥𝐧 𝝀 = −𝟏𝟖
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1

• The penalty on the squared weights is


known as ridge regression (Statistics)
• When 𝑤𝑖 ’s are small, prediction 𝑦 will be not
sensitive to small change of x
Overfitting • Smooth functions are preferred.

𝑴=𝟗
𝐥𝐧 𝝀 = −∞
• One way of dealing with this is to
encourage the weights to be
small. This is called regularization.

• Standard approach 𝑴=𝟗


𝐥𝐧 𝝀 = −𝟏𝟖
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1

𝑴=𝟗
• The penalty on the squared weights is 𝐥𝐧 𝝀 = 𝟎

known as ridge regression (Statistics)


Overfitting
𝑴=𝟗
𝐥𝐧 𝝀 = −∞
• One way of dealing with this is to
encourage the weights to be
small. This is called regularization.

• Standard approach 𝑴=𝟗


𝐥𝐧 𝝀 = −𝟏𝟖
• New loss function
𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1

𝑴=𝟗
However, choose value 𝜆
• The penalty on the squared weights is 𝐥𝐧 𝝀 = 𝟎
carefully
known as ridge regression (Statistics)
(we prefer smooth function,
but don’t be too smooth)
Question?
𝑴=𝟗
𝐥𝐧 𝝀 = −∞
• Why 𝑤0 can be not considered in the regularization
term? 𝑁 𝑀
1
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2
2𝑁
𝑖=1 𝑖=1

𝑴=𝟗
𝐥𝐧 𝝀 = −𝟏𝟖

𝑴=𝟗
𝐥𝐧 𝝀 = 𝟎

𝑦 𝑥, 𝑤 = 𝑤0 + ෍ 𝑤𝑖 𝑥 𝑖
𝑖=1
Linear Models for Regression
• Linear regression 𝜕ℓ 𝑤
𝜕𝑤0
• Linear regression model Gradient
𝜕ℓ 𝑤
𝑦 𝑥 = 𝑤0 + σ𝑀 𝑗=1 𝑤𝑗 𝑥
𝑗
𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1
• Parameters ⋮
𝜕ℓ 𝑤
𝑤
𝜕𝑤𝑀
• Regularized Loss function 𝜕ℓ 𝑤 1
𝑁 𝑀 = − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
1 𝜕𝑤0
ℓ 𝑤 = ෍[𝑡 𝑖
−𝑦 𝑥 𝑖
]2 + 𝜆 ෍ 𝑤𝑖2 𝜕ℓ 𝑤 1
2𝑁
𝜕𝑤1
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥 𝑖
+ 𝜆𝑤1
𝑖=1 𝑖=1
• Goal: minimize ℓ 𝑤 𝜕ℓ 𝑤 1
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 2
+ 𝜆𝑤2
𝜕𝑤2
Steps: ⋮
𝜕ℓ 𝑤 1 𝑀
• Initialize 𝑤 (e.g., randomly) = 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤𝑀
𝜕𝑤𝑀
• Repeatedly update 𝑤 based on the gradient
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤
where 𝜖 is the learning rate.
Linear Models for Regression
• Note: 𝜕ℓ 𝑤
𝜕𝑤0
Steps: Gradient
𝜕ℓ 𝑤
• Initialize 𝑤 (e.g., randomly) 𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1
• Repeatedly update 𝑤 based on the gradient ⋮
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝜕ℓ 𝑤
where 𝜖 is the learning rate. 𝜕𝑤𝑀
𝜕ℓ 𝑤 1
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1
𝜕𝑤1
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥 𝑖
+ 𝜆𝑤1
𝜕ℓ 𝑤 1 2
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤2
𝜕𝑤2

𝜕ℓ 𝑤 1 𝑀
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤𝑀
𝜕𝑤𝑀
Linear Models for Regression
• Note: 𝜕ℓ 𝑤
𝜕𝑤0
Steps: Gradient
𝜕ℓ 𝑤
• Initialize 𝑤 (e.g., randomly) 𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1
• Repeatedly update 𝑤 based on the gradient ⋮
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝜕ℓ 𝑤
where 𝜖 is the learning rate. 𝜕𝑤𝑀
𝜕ℓ 𝑤 1
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1
1
𝑤𝑗 = 𝑤𝑗 − 𝜖 𝑁 − σ𝑁 𝑖 −𝑦 𝑖 𝑖 𝑗 = 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥 𝑖
+ 𝜆𝑤1
𝑖=1 𝑡 𝑥 𝑥 + 𝜆𝑤𝑗 𝜕𝑤1
𝜕ℓ 𝑤 1 2
1
= 𝑤𝑗 + 𝜖 σ𝑁 𝑡 𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 𝑗

𝜖𝜆
𝑤 = 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤2
𝑁 𝑖=1 𝑁 𝑗 𝜕𝑤2
𝜖𝜆 1 𝑗 ⋮
= 1 − 𝑁 𝑤𝑗 + 𝜖 𝑁 σ𝑁 𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥𝑖 𝜕ℓ 𝑤 1 𝑀
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤𝑀
𝜕𝑤𝑀
Linear Models for Regression
• Note: 𝜕ℓ 𝑤
𝜕𝑤0
Steps: Gradient
𝜕ℓ 𝑤
• Initialize 𝑤 (e.g., randomly) 𝛻𝑤 ℓ 𝑤 = 𝜕𝑤1
• Repeatedly update 𝑤 based on the gradient ⋮
𝑤 = 𝑤 − 𝜖 𝛻𝑤 ℓ 𝑤 𝜕ℓ 𝑤
where 𝜖 is the learning rate. 𝜕𝑤𝑀
𝜕ℓ 𝑤 1
= − 𝑁 σ𝑁
𝑖=1[𝑡
𝑖
−𝑦 𝑥 𝑖
]
𝜕𝑤0
𝜕ℓ 𝑤 1
1
𝑤𝑗 = 𝑤𝑗 − 𝜖 𝑁 − σ𝑁 𝑖 −𝑦 𝑖 𝑖 𝑗 = 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖
−𝑦 𝑥 𝑖
𝑥 𝑖
+ 𝜆𝑤1
𝑖=1 𝑡 𝑥 𝑥 + 𝜆𝑤𝑗 𝜕𝑤1
𝜕ℓ 𝑤 1 2
1
= 𝑤𝑗 + 𝜖 σ𝑁 𝑡 𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 𝑗

𝜖𝜆
𝑤 = 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤2
𝑁 𝑖=1 𝑁 𝑗 𝜕𝑤2
𝜖𝜆 1 𝑗 ⋮
= 1 − 𝑁 𝑤𝑗 + 𝜖 𝑁 σ𝑁 𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥𝑖 𝜕ℓ 𝑤 1 𝑀
= 𝑁 − σ𝑁
𝑖=1 𝑡
𝑖 −𝑦 𝑥 𝑖 𝑥 𝑖 + 𝜆𝑤𝑀
𝜕𝑤𝑀

𝜖𝜆
N is usually very large, 1 − enables the decrease of 𝑤𝑗
𝑁
Key Concepts
• To find a good model
• Loss function (measure error, or judge the fit)
• Optimization (how to find a good fit)
• Generalization (fit to unseen test data)
• Regularization (avoid overfitting)
Key Concepts
• To find a good model
• Loss function (measure error, or judge the fit)
• Optimization (how to find a good fit)
• Generalization (fit to unseen test data)
• Regularization (avoid overfitting)

• Prediction error
• Validation
• Bias and variance What we really care about is the
• Typical workflow to reduce error prediction error on new data
• Feature scaling
Key concepts in Supervised Learning,
Key Concepts not just for linear regression

• To find a good model


• Loss function (measure error, or judge the fit)
• Optimization (how to find a good fit)
• Generalization (fit to unseen test data)
• Regularization (avoid overfitting)

• Prediction error
• Validation
• Bias and variance What we really care about is the
• Typical workflow to reduce error prediction error on new data
• Feature scaling
Validation
Validation
• What you should NOT do in Machine learning?
(Your) Training dataset (Your) Testing dataset
Validation
• What you should NOT do in Machine learning?
(Your) Training dataset (Your) Testing dataset

Model 1 Error: 0.8


Validation
• What you should NOT do in Machine learning?
(Your) Training dataset (Your) Testing dataset

Model 1 Error: 0.8


refine

Model 2 Error: 0.6


Validation
• What you should NOT do in Machine learning?
(Your) Training dataset (Your) Testing dataset

Model 1 Error: 0.8


refine

Model 2 Error: 0.6


refine

Model 3 Error: 0.3


Validation
• What you should NOT do in Machine learning?
Real-word data
(Your) Training dataset (Your) Testing dataset

Model 1 Error: 0.8


refine

Model 2 Error: 0.6


refine

Model 3 Error: 0.3


Validation
• What you should NOT do in Machine learning?
Real-word data
(Your) Training dataset (Your) Testing dataset

Model 1 Error: 0.8


refine

Model 2 Error: 0.6


refine

Model 3 Error: 0.3 Error > 0.3


Validation
• What you should NOT do in Machine learning?
Real-word data
(Your) Training dataset (Your) Testing dataset

Model 1 Error: 0.8


refine

Model 2 Error: 0.6


refine

Model 3 Error: 0.3 Error > 0.3

Testing data is usually used to judge the goodness of a fully-trained


model. But here, it is used to refine (tune) the model.
• The testing data here is not independent of training process and may
not reflect the prediction on real world data.
Validation
• What you can do in Machine learning?
(Your) Training dataset (Your) Testing dataset
Validation
• What you can do in Machine learning?
(Your) Training dataset (Your) Testing dataset
Validation
• What you can do in Machine learning?
Training dataset Validation dataset Testing dataset
Validation
• What you can do in Machine learning?
Training dataset Validation dataset Testing dataset

Model 1 Error: 0.8


refine
Model 2 Error: 0.6
refine
Model 3 Error: 0.3
Validation
• What you can do in Machine learning?
Training dataset Validation dataset Testing dataset

Model 1 Error: 0.8


refine
Model 2 Error: 0.6
refine
Model 2 Error: 0.3 Error:0.34
Validation
• What you can do in Machine learning?
Real-word data
Training dataset Validation dataset Testing dataset

Model 1 Error: 0.8


refine
Model 2 Error: 0.6
This is your
refine
announced accuracy
Model 2 Error: 0.3 Error:0.34
Validation
• In many applications, the supply of data for training and testing may be limited.
We can use Cross-Validation (CV).
Training dataset
Validation
• In many applications, the supply of data for training and testing may be limited.
We can use Cross-Validation (CV).
Training dataset
Validation
• In many applications, the supply of data for training and testing may be limited.
We can use Cross-Validation (CV).
Training dataset Model 1 Model 2 Model 3

Error: 0.8 Error: 0.7 Error: 0.8

Error: 0.5 Error: 0.6 Error: 0.4

Error: 0.7 Error: 0.3 Error: 0.7

Error: 0.3 Error: 0.9 Error: 0.5

Error: 0.9 Error: 0.4 Error: 0.9


Validation
• In many applications, the supply of data for training and testing may be limited.
We can use Cross-Validation (CV).
Training dataset Model 1 Model 2 Model 3

Error: 0.8 Error: 0.7 Error: 0.8

Error: 0.5 Error: 0.6 Error: 0.4

Error: 0.7 Error: 0.3 Error: 0.7

Error: 0.3 Error: 0.9 Error: 0.5

Error: 0.9 Error: 0.4 Error: 0.9

Average: 0.64 Average: 0.58 Average: 0.66


Validation
• In many applications, the supply of data for training and testing may be limited.
We can use Cross-Validation (CV).
Training dataset Model 1 Model 2 Model 3

Error: 0.8 Error: 0.7 Error: 0.8

Error: 0.5 Error: 0.6 Error: 0.4

Error: 0.7 Error: 0.3 Error: 0.7

Error: 0.3 Error: 0.9 Error: 0.5

Error: 0.9 Error: 0.4 Error: 0.9

Average: 0.64 Average: 0.58 Average: 0.66


Validation
• Cross Validation
• We split the training data into K folds; then, for
each fold k ∈ {1, . . . ,K}, we train on all the folds
but the k’th, and test on the k’th, in a round-
robin fashion.
• We then compute the error averaged over all the
folds, and use this as a proxy for the test error.
(Note that each point gets predicted only once, although
it will be used for training K−1 times.)

• It is common to use K = 5; this is called 5-fold CV.


Key concepts in Supervised Learning,
Key Concepts not just for linear regression

• To find a good model


• Loss function (measure error, or judge the fit)
• Optimization (how to find a good fit)
• Generalization (fit to unseen test data)
• Regularization (avoid overfitting)

• Prediction error
• Validation
• Bias and variance What we really care about is the
• Typical workflow to reduce error prediction error on new data
• Feature scaling
Prediction Error
• Sources :
• Imprecision in data attributes (input noise, e.g., noise
in per-capita crime)
• Errors in data targets (mis-labeling, e.g., noise in
house prices)
• Additional attributes not taken into account by data
attributes, affect target values (latent variables). In
the example, what else could affect house prices?
• Model may be too simple to account for data targets
Courtesy of Dr. Sanja Fidler

• Data size is not sufficient


Bias and Variance
• Bias measures how far off in general
these models' predictions are from
the correct value.

• Variance is how much the


predictions for a given point vary
between different realizations of the
model.

Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance

Human-level error: 0.1


(the best model) Bias

Training set error: 0.7

Validation set error: 0.8


Courtesy of Dr. Andrew Ng

Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance

Human-level error: 0.1


(the best model) Bias

Training set error: 0.7

Validation set error: 0.8


Courtesy of Dr. Andrew Ng

Underfiting
(if your model cannot ever fit the
training data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance

Human-level error: 0.1 Human-level error: 0.1


(the best model) Bias (the best model)

Training set error: 0.7 Training set error: 0.15


Variance
Validation set error: 0.8 Validation set error: 0.8
Courtesy of Dr. Andrew Ng

Underfiting
(if your model cannot ever fit the
training data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance

Human-level error: 0.1 Human-level error: 0.1


(the best model) Bias (the best model)

Training set error: 0.7 Training set error: 0.15


Variance
Validation set error: 0.8 Validation set error: 0.8
Courtesy of Dr. Andrew Ng

Underfiting Overfiting
(if your model cannot ever fit the (if your model fits the training data, but
training data) has large error on testing data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Our goal!
Bias and Variance

Human-level error: 0.1 Human-level error: 0.1


(the best model) Bias (the best model)

Training set error: 0.7 Training set error: 0.15


Variance
Validation set error: 0.8 Validation set error: 0.8
Courtesy of Dr. Andrew Ng

Underfiting Overfiting
(if your model cannot ever fit the (if your model fits the training data, but
training data) has large error on testing data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance

Human-level error: 0.1


(the best model) Bias
For large bias, redesign your model
Training set error: 0.7 • Add more features as input
• Create a more complex model
Validation set error: 0.8 • …

(Note: add more data is not useful here!)


Courtesy of Dr. Andrew Ng

Underfiting
(if your model cannot ever fit the
training data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
Bias and Variance
For large variance, what we can do
Human-level error: 0.1 • Regularization
(the best model) • More data
• …
Training set error: 0.15
Variance
Validation set error: 0.8
Courtesy of Dr. Andrew Ng

Overfiting
(if your model fits the training data, but
has large error on testing data)
Reference: http://scott.fortmann-roe.com/docs/BiasVariance.html
More on Reducing Error
Courtesy of Dr. Andrew Ng

Reference:
• https://kevinzakka.github.io/2016/09/26/applying-deep-learning/
• https://www.youtube.com/watch?v=F1ka6a13S9I&t=4s
Preprocessing data
• What is the result?

References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing data
• What is the result?
The result: 1.00000761449
Not zero

References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html
Example: Boston Housing data
• Estimate median house price in a neighborhood based on neighborhood statistics

• Data: https://archive.ics.uci.edu/ml/datasets/Housing
(from sklearn.datasets import load_boston)
Preprocessing data
• Feature scaling
• In practice, we transform the data to center
it by removing the mean value of each
feature, then scale it by dividing non-
constant features by their standard
deviation.

Rescaling

Standardization

References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing data
• Feature scaling
• In practice, we transform the data to center
it by removing the mean value of each
feature, then scale it by dividing non-
constant features by their standard
deviation.

Rescaling

Commonly used

Standardization

References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html
Preprocessing data
• Feature scaling
• In practice, we transform the data to center
it by removing the mean value of each
feature, then scale it by dividing non-
constant features by their standard
deviation.

Rescaling

Commonly used

Standardization

References:
https://en.wikipedia.org/wiki/Feature_scaling
http://scikit-learn.org/stable/modules/preprocessing.html http://m.blog.csdn.net/article/details?id=50670674
Readings
• Sections 1.1 and 1.3 in the book “Pattern Recognition and Machine Learning”, by
Christopher M. Bishop, Springer, 2006.
http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-
%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf

• Understanding the Bias-Variance Tradeoff


http://scott.fortmann-roe.com/docs/BiasVariance.html

• Video: Nuts and Bolts of Applying Deep Learning (Andrew Ng)


https://www.youtube.com/watch?v=F1ka6a13S9I&t=4s

You might also like