0% found this document useful (0 votes)
1 views

2 LinearRegression2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

2 LinearRegression2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

ITCS 6156/8156 Fall 2023

Machine Learning

Linear Regression

Instructor: Hongfei Xue


Email: [email protected]
Class Meeting: Mon & Wed, 4:00 PM – 5:15 PM, CHHS 376

Some content in the slides is based on Dr. Razvan’s lecture


Machine Learning as Optimization
Convexity
Convex Optimization
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent

𝐽(𝑤! , 𝑤" )

𝑤"
𝑤!
Gradient Descent


Taylor Expansion
Gradient Decent


Gradient Decent

• The key operation in the above update step is the calculation of each partial derivative.


Gradient Decent

• The final weight update rule:


Issues with Gradient Decent

• Issues with Gradient Decent:


• Slow convergence
• Stuck in local minima

• One should note that the second issue will not arise in the
case of convex problem as the error surface has only one
global minima.

• More efficient algorithms exist for batch optimization,


including Conjugate Gradient Descent and other quasi-
Newton methods. Another approach is to consider training
examples in an online or incremental fashion, resulting in
an online algorithm called Stochastic Gradient Descent.
Stochastic Gradient Descent (SGD)
• Update weights after every (or a small subset of) training
example(s).

• Why SGD?
Stochastic Gradient Descent (SGD)

1 or K (a small number)

1 or K (a small number)
Polynomial Basis Functions

• Q: What if the raw feature is insufficient for good


performance?
• Example: non-linear dependency between label and raw
feature.

• A: Engineer / Learning higher-level features, as functions


of the raw feature.

• Polynomial curve fitting:


- Add new features, as polynomials of the original feature.
Regression: Curve Fitting

Target 𝑓
Regression: Curve Fitting

Learned ℎ
𝑦

Target 𝑓

• Training: Build a function ℎ 𝑥 , based on (noisy) training examples


𝑥! , 𝑦! , 𝑥" , 𝑦" , ⋯ , (𝑥# , 𝑦# ).
Regression: Curve Fitting

Learned ℎ
𝑦

Target 𝑓

• Testing: for arbitrary (unseen) instance 𝑥 𝜖 𝐗, compute target output


ℎ 𝑥 ; want it to be close to f 𝑥 .
Regression: Polynomial Curve Fitting

%
ℎ 𝑥 = ℎ 𝑥, 𝐰 = 𝑤$ + 𝑤! 𝑥 + 𝑤" 𝑥 " + ⋯ + 𝑤% 𝑥 % = 1 𝑤& 𝑥 &
&'$

parameters features
Polynomial Curve Fitting
• Parametric model:
%
ℎ 𝑥 = ℎ 𝑥, 𝐰 = 𝑤$ + 𝑤! 𝑥 + 𝑤" 𝑥 " + ⋯ + 𝑤% 𝑥 % = 1 𝑤& 𝑥 &
&'$

• Polynomial curve fitting is (Multiple) Linear Regression:


𝐱 = [1, 𝑥, 𝑥 " , ⋯ , 𝑥 % ](
ℎ 𝑥 = ℎ 𝐱, 𝐰 = ℎ𝐰 𝐱 = 𝐰 𝑻𝐱

• Learning = minimize the Sum-of-Squares error function:

#
argmin 1
6=
𝐰 𝐽 𝐰 𝐽 𝐰 = 1 (ℎ𝐰(𝐱 . ) − 𝑦. )"
𝐰 2𝑁
.'!
• Least Square Estimate:

6 = (𝐗 𝐓𝐗),𝟏 𝐗 𝐓𝐲
𝐰
Polynomial Curve Fitting

• Generalization = how well the parameterized ℎ(𝑥, 𝐰)


performs on arbitrary (unseen) test instances 𝑥𝜖𝑋.

• Generalization performance depends on the value of M


0th Order Polynomial
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial

• Which M to pick? Why?


• Follow the wisdom of a philosopher.
Occam’s Razor

William of Occam (1288 – 1348)


English Franciscan friar, theologian and
philosopher.

“Entia non sunt multiplicanda praeter necessitatem”


• Entities must not be multiplied beyond necessity.

i.e. Do not make things needlessly complicated.


i.e. Prefer the simplest hypothesis that fits the data.
Polynomial Curve Fitting

• Model Selection: choosing the order M of the polynomial.


- Best generalization obtained with M=3.
- M = 9 obtains poor generalization, even though it fits
training examples perfectly:
• But M = 9 polynomials subsume M = 3 polynomials!

• Overfitting ≡ good performance on training examples, poor


performance on test examples.
Over-fitting and Parameter Values
Overfitting
• Measure fit using the Root-Mean-Square (RMS) error (RMSE):
∑+(𝐰 , 𝐱 + − 𝒕+ )-
𝐸'() 𝐰 =
𝑁
• Use 100 random test examples, generated in the same way:
Overfitting vs. Data Set Size

• More training data ⟹ less overfitting

• What if we do not have more training data?


- Use regularization
Regularization

• Penalize large parameter values:


0
1 -
𝜆 -
𝐸 𝐰 = 5 (ℎ𝐰(𝐱 + ) − 𝑡+ ) + 𝐰
2𝑁 2
+./

Regularizer

argmin
𝐰∗ = 𝐸(𝐰)
𝐰
Ridge Regression

• Multiple linear regression with L2 regularization:


0
1 -
𝜆 -
𝐽 𝐰 = 5 (ℎ𝐰(𝐱 + ) − 𝑡+ ) + 𝐰
2𝑁 2
+./
argmin
@=
𝐰 𝐽(𝐰)
𝐰

• Solution is 𝐰 = (𝝀𝑵𝐈 + 𝐗 𝑻𝐗)3𝟏 𝐗 𝑻𝐭


- Prove it.
9th Order Polynomial with Regularization
9th Order Polynomial with Regularization
Training & Test error vs. ln 𝜆

How do we find the optimal value of 𝜆?


Model Selection

• Put aside an independent validation set.


• Select parameters giving best performance on validation set.

ln 𝜆 𝜖{−40, −35, −30, −25, −20, −15}


K-fold Cross-Validation

Source: https://scikit-learn.org/stable/modules/cross_validation.html
K-fold Cross-Validation

• Split the training data into K folds and try a wide range of
tunning parameter values:
- split the data into K folds of roughly equal size
- iterate over a set of values for 𝜆
• iterate over k = 1,2, ⋯ ,K
- use all folds except k for training
- validate (calculate test error) in the k-th fold
• error[𝜆] = average error over the K folds
- choose the value of 𝜆 that gives the smallest error.
Regularization: Ridge vs. Lasso

• Ridge regression:

! /
𝐽 𝐰 = ∑# (ℎ 𝐱 . − 𝑡. )" + ∑%
&'! 𝑤&
"
"# .'! 𝐰 "

• Lasso:
# %
1 "
𝜆
𝐽 𝐰 = 1 (ℎ𝐰 𝐱 . − 𝑡. ) + 1 𝑤&
2𝑁 2
.'! &'!

- if 𝜆 is sufficiently large, some of the coefficients 𝑤& are driven to


0 ⟹ sparse model
Regularization: Ridge vs. Lasso

Plot of the contours of the unregularized error function (blue) along with the
constraint region (3.30) for the quadratic regularizer 𝑞 = 2 on the left and the lasso
regularizer 𝑞 = 1 on the right, in which the optimum value for the parameter vector
𝐰 is denoted by 𝐰 ∗. The lasso gives a sparse solution in which 𝐰 ∗ = 𝟎.
Regularization

• Parameter norm penalties (term in the objective).


• Limit parameter norm (constraint).
• Dataset augmentation.
• Dropout.
• Ensembles.
• Semi-supervised learning.
• Early stopping
• Noise robustness.
• Sparse representations.
• Adversarial training.
Questions?

You might also like