0% found this document useful (0 votes)
6 views

M6 RegressionLinearModels v2

Here are some potential issues with increasing the polynomial degree: - Overfitting: Higher degree polynomials will fit the training data very well, but may not generalize well to new data points and could model the noise in the training data. - Variance: Higher degree polynomials are more complex models with more parameters, so the variance/uncertainty of the predictions will increase. - Interpretability: Lower degree polynomials are simpler models that are easier to interpret. Higher degrees may result in a model that is difficult to understand. - Computational cost: Fitting higher degree polynomials requires inverting/solving larger matrices, so the computational cost increases with the degree of the polynomial. So in summary, while a higher degree

Uploaded by

Aniket Keshri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

M6 RegressionLinearModels v2

Here are some potential issues with increasing the polynomial degree: - Overfitting: Higher degree polynomials will fit the training data very well, but may not generalize well to new data points and could model the noise in the training data. - Variance: Higher degree polynomials are more complex models with more parameters, so the variance/uncertainty of the predictions will increase. - Interpretability: Lower degree polynomials are simpler models that are easier to interpret. Higher degrees may result in a model that is difficult to understand. - Computational cost: Fitting higher degree polynomials requires inverting/solving larger matrices, so the computational cost increases with the degree of the polynomial. So in summary, while a higher degree

Uploaded by

Aniket Keshri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

M6.

Regression: Linear Models -


[a popular (supervised ML) appn. of linear algebra]
Manikandan Narayanan
Week 11 (Oct 6-, 2023)
PRML Jul-Nov 2023 (Grads section)
Acknowledgment of Sources
• Slides based on content from related
• Courses:
• IITM – Profs. Arun/Harish/Chandra’s PRML offerings (slides, quizzes, notes, etc.), Prof.
Ravi’s “Intro to ML” slides – cited respectively as [AR], [HR]/[HG], [CC], [BR] in the bottom
right of a slide.
• India – NPTEL PR course by IISc Prof. PS. Sastry (slides, etc.) – cited as [PSS] in the bottom
right of a slide.

• Books:
• PRML by Bishop. (content, figures, slides, etc.) – cited as [CMB]
• Pattern Classification by Duda, Hart and Stork. (content, figures, etc.) – [DHS]
• Mathematics for ML by Deisenroth, Faisal and Ong. (content, figures, etc.) – [DFO]
• Information Theory, Inference and Learning Algorithms by David JC MacKay – [DJM]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Context: ML Paradigms
• Unsupervised Learning (informally aka “learning patterns from
(unlabelled) data”)
• Density estimation
• Clustering
• Dimensionality reduction
• Supervised Learning (informally aka curve-fitting or function
approximation or “function learning from (labelled) data”)
• Learn an input and output map (features x to target t)
• Regression: continuous output/target t
• Classification: categorical output

[BR,HR]
Regression: the problem
• Learn a map from input variables (features x) to a continuous output variable (target t).
Informally, known as function approximation/learning or curve fitting, since
Given 𝒙𝒏 , 𝒕𝒏 𝒏=𝟏…𝑵 pairs, we seek a function 𝒚𝒘 (𝒙) that “approximates” the map from x → t.

Linear regression assumes yw x ≔ 𝑦(𝑥, 𝑤) is a linear function of the adjustable parameters 𝒘.


It could be linear or non-linear in 𝑥.

• A foundational supervised learning problem/algorithm:


• Practical limitations for complex data, but sets analytical foundation for other sophisticated learning algorithms.
• Due to its simplicity, first and predominant choice of statistical model in many applied areas with moderate sample sizes:
((E.g., in bioinformatics: to adjust for known confounding factors (covariates) in Disease Genomics and Genome-wide Association
(GWAS) Studies, Causal inference such as in Mendelian Randomization, etc.))

• Our approach in this lecture: Mostly [CBM, Chapters 1,3].

[CMB]
Regression: what does “approximate” mean?
recall three approaches (from Decision Theory)
• Generative model approach:
(I) Model
(I) Infer from
(D) Take conditional mean/median/mode/any other optimal decision outcome as y(x)

• Discriminative model approach:


(I) Model directly
(D) Take conditional mean/median/mode/any other optimal decision outcome as y(x)

• Direct regression approach:


(D) Learn a regression function y(x) directly from training data

[CMB]
Linear regression: from linear combination of input variables (x ∈
ℝ𝐷 ) to that of basis functions (𝜙(x) ∈ ℝ𝑀 )
• Simplest model of linear regn. involving 𝐷 input vars.:

𝑦 x, 𝑤 = 𝑤0 + 𝑤1 𝑥(1) + … + 𝑤𝐷 𝑥(𝐷)

1 1
= 𝑤0 . 1 + σ𝐷
𝑗=1 𝑤𝑗 𝑥𝑗 = 𝑤0 𝑤1 … 𝑤𝐷 = 𝑤𝑇
x x

• Model of linear regn. involving 𝑀 basis fns. (fixed non-linear fns. of the input vars.):

𝑦 x, 𝑤 = 𝑤0 + 𝑤1 𝜙1 x + … + 𝑤𝑀−1 𝜙𝑀−1 (x)

= 𝑤0 . 1 + σ𝑀−1
𝑗=1 𝑤𝑗 𝜙𝑗 (x)

= 𝑤𝑇𝜙 x (𝜙: ℝ𝐷 → ℝ𝑀 , with convention 𝜙0 (x) = 1)


Linear regression: recall standard examples
• Predicting weight 𝑡 from height 𝑥 : 𝒚(𝒙, 𝒘) linear in both 𝒙 and 𝒘.

• Estimation of fetal weight 𝑡 (actually log10 t) from ultrasound measurements: 𝒚 𝒙, 𝒘 linear in 𝒘, not 𝒙.

[From Hoopmann et al. Fetal Diagn


Ther. 2011;30(1):29-34.
doi:10.1159/000323586]
More examples of lin. regn. of basis functions

Fourier basis
Polynomials (& extn.
(sinusoidal fns.),
to spatially restricted
Wavelets, etc.
splines)
[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝒘𝑳𝑺 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Linear regression: a direct approach
Approach: minimize sum-of-squares error; aka
least- squares solution/approach

min𝐰 ( )

where y x, 𝒘 = 𝒘𝑇 𝜙 x
Solution: 𝑤𝐿𝑆 that minimizes 𝐸𝐷 𝑤 (via matrix notation)

Normal equations from setting gradient to zero, using N x M design matrix:

[CMB, HR]
Soln.: Geometry of (least-squares) sol.

[CMB, Figure 3.2]

[CMB, HR]
LA Refresher
• See Appendix for LA refresher on
• related topic of Ax=b when no solution is possible, and
• some matrix-vector gradient formulas/tricks.
Recall LA: To solve 𝐴𝑥 = 𝑏, we premult. by 𝐴𝑇 , and
simply solve 𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏.

Ex.: Prove:
1) at least one soln. x* exists for the normal eqn.
2) soln. x* unique if (𝐴𝑇 𝐴) is invertible ( 𝐴 has lin. indep. cols.)
3) infinite solns. x* if (𝐴𝑇 𝐴) is non-invertible ( 𝐴 has lin. dep. cols.)

Ex.:
i) Prove 𝑁𝑆(𝐴) = 𝑁𝑆(𝐴𝑇 𝐴).
ii) Use orthog. complementarity of 𝑁𝑆(𝐴𝑇 ), 𝐶𝑆(𝐴) to derive normal eqns.
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝒘𝑹𝑳𝑺 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
From sum-of-squares to regularized error!
Running example: polynomial curve fitting ((via sum-
of-squares error function / least squares approach))

min𝐰 ( )

where y x, 𝒘 = 𝒘𝑇 𝜙 x

[CMB]
Polynomial Curve Fitting

[CMB]
0th Order Polynomial

[CMB]
1st Order Polynomial

[CMB]
3rd Order Polynomial

[CMB]
9th Order Polynomial

[CMB]
Brainstorm: What degree polynomial would
you choose?
Over-fitting

Root-Mean-Square (RMS) Error:

[CMB]
Polynomial Coefficients

[CMB]
The role of data set size N?
Data Set Size: 𝑁 = 10
9th Order Polynomial

𝑁 = 10

[CMB]
Data Set Size:
9th Order Polynomial

[CMB]
Data Set Size:
9th Order Polynomial

[CMB]
How to take care of both data set size and
model complexity tradeoffs?
Regularization
• Penalize large coefficient values

favors complex models penalizes complex models

[CMB]
Polynomial Coefficients

[CMB]
Regularization:

[CMB]
Regularization:

[CMB]
Regularization: vs.

[CMB]
Now, let’s see how to solve the minimization
problem!

min
𝑤

Matrix notation gradient formula (from Appendix):


Regularized Least Squares (1): Solution for
ridge regression
• Consider the error function: 𝜆 is called the
regularization
Data term + Regularization term coefficient.

• With the sum-of-squares error function and a quadratic regularizer,


we get

• which is minimized by

[CMB]
Regularized Least Squares (2)
• With a more general regularizer, we have

Lasso Quadratic

[CMB]
Regularized Least Squares (3)
• Lasso tends to generate sparser solutions than a ridge (quadratic) regularizer.
• Regularization aka penalization/weight-decay in ML or parameter shrinkage in statistics literature.

[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝒘𝑴𝑳 = 𝒘𝑳𝑺 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
A different view of min. E(w) or its regularized
version?
Q: How do you convert this intuition/empirical-art into science, and
derive E(w) or its (other) regularized versions systematically?
A: Probabilistic view helps. Discriminative Approach: Model 𝑝(𝑡|𝑥)
Q: What has regression got to do with our
previous topic: density estimation?
• Brainstorm: how to model P(t|x)?
Q: What has regression got to do with our
previous topics?
A: P(t|x) captures the input-output map. Steps involved are:
(1) Model/estimate P(t|x)
(how? Density Estimation; MLE/Bayesian Inference)
(2) Predict t for a new x from estimated P(t|x)
(how? Decision Theory; e.g., 𝑦(x𝒏𝒆𝒘 ) = 𝐸[𝒕|x = x𝒏𝒆𝒘 ])
Curve Fitting: Going to the basics!
using a Probabilistic Model & its Density Estimation (MLE/Bayesian)

[CMB]
ML estimation

Determine by minimizing sum-of-squares error 𝐸𝐷 𝑤 .

[CMB]
Summary: Linear model for regression -
𝑤𝑀𝐿 == 𝑤𝐿𝑆

[CMB]
Addtnl. Advantage: ML Predictive Distribution

[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝒘𝑴𝑨𝑷 = 𝒘𝑹𝑳𝑺 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Recall: Probab. Model & its Density Estimation (MLE)

[CMB]
Bayesian inference: what would you model as
a rv instead of a fixed value?
• Brainstorm
• What would you model?
• What would you infer?
Relation between Bayesian MAP and
Regularized linear regression

Let look at a simpler problem first – mode of the posterior of 𝑤

𝑤𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑤 𝑃 𝑤 𝑫𝑵 ≔ 𝐱, 𝒕 ≔ 𝒙𝒏 , 𝒕𝒏 𝑛=1 𝑡𝑜 𝑁 )

We will actually show that 𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 !!


Bayesian inference: a first step via MAP
Assume

Then, Assume

Determine maximizer of this posterior, i.e., 𝑤𝑀𝐴𝑃 , by minimizing



regularized sum-of-squares error 𝐸(𝑤), because:

prior
(shrinks w
towards 0)
[cf. full details of proof in next slide]
[CMB]
Full details of 𝑤𝑀𝐴𝑃 derivation, & 𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 proof!

𝑀 𝑀 𝛼 𝑻
ln 𝑝 𝒘 𝛼 = ln 𝛼 − ln(2𝜋) − 𝒘 𝒘
2 2 2

⇒ ln 𝑝 𝒘 𝐱, 𝒕, 𝛼, 𝛽 = ln 𝑝 𝒕 𝐱, 𝒘, 𝛼, 𝛽 + ln 𝑝 𝒘 𝛼 + 𝑐𝑜𝑛s𝑡. (assume 𝛼, 𝛽 are both known hyperparams.)


𝑁 𝑁 𝑀 𝑀 𝛼
= 2 ln 𝛽 − 2 ln(2𝜋) − 𝛽𝐸𝐷 𝒘 + 2 ln 𝛼 − 2 ln 2𝜋 − 2 𝑤 𝑇 𝑤 + 𝑐𝑜𝑛s𝑡.

𝛼 2
So maximizing this ln(𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟) is the same as max. −𝛽𝐸𝐷 𝒘 − 2 𝒘 (ignoring terms indept. of 𝒘) , or equivalently
1 2 𝛼 𝛼
min. 𝛽𝐸𝐷 𝒘 + 𝛼 2 𝒘 = 𝛽 𝐸𝐷 𝒘 + 𝛽 𝐸𝑊 𝒘 ෨
= 𝛽𝐸(𝒘), ෨
or equivalently min. 𝐸(𝒘). ((here, we set 𝜆 ≔ 𝛽))
Changing the prior from Gaussian to Laplacian!
• Prior: p 𝑤 = 𝑝 𝑤0 𝑝 𝑤1 … 𝑝(𝑤𝑀−1 )
• What
1
if 𝑝 𝑤𝑖 changed from
𝛼 2 𝛼 𝛼 𝛼
exp{− 𝒘2𝑖 } → exp{− |𝒘𝑖 |} ?
2𝜋 2 4 2

• Then,
𝑀
𝑝(𝑤) changes from
𝛼 2 𝛼 𝟐 𝛼 𝑀 𝛼
exp(− 𝒘 𝟐
)→ exp(− 𝒘 1
)
2𝜋 2 4 2

• Then regularization term 𝐸𝑤 (𝑤) in regularized error


𝐸෨ 𝑤 = 𝐸𝐷 𝑤 + 𝜆𝐸𝑤 𝑤 changes from:
1 1
• σ𝑖 𝑤𝑖2 (ridge regn. / 𝐿2 regularization) → σ |𝑤𝑖 | (lasso regn. / 𝐿1 regularization)
2 2 𝑖

Recall:
General prior p(𝒘) that generalizes both
Gaussian and Laplacian prior
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Bayesian inference: second and third steps
• Assume Gaussian prior for 𝒘 going forward.
• We don’t want just a single-point estimate (MAP) of 𝒘; we want

Step 2) the full posterior of 𝒘, and in turn use it to


Step 3) predict t for new x (via model-averaging…
• …wherein each model is given by a particular possible value of w and the averaging
weight is given by the model’ s posterior)
Step 2) Let’s see an example of full posterior…
𝑡 ≅ 𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤1 𝑥

Use a given set of training data points to not just infer one optimal model specified by 𝑤𝑀𝐴𝑃 , but all
possible 𝑤 models and the training dataset’s support (posterior probab.) for each such model 𝑤.
Bayesian Linear Regression Example (1)
0 data points observed

Prior Data Space

[CMB]
Bayesian Linear Regression Example (2)
1 data point observed

Likelihood Posterior Data Space

[CMB]
Bayesian Linear Regression Example (3)
2 data points observed

Likelihood Posterior Data Space

[CMB]
Bayesian Linear Regression Example (4)
20 data points observed

Posterior Data Space

[CMB]
From example posterior plots to
Full posterior in the general case:
Bayesian Linear Regression
• Define a conjugate prior over w. A common choice is

• Combining this with the likelihood function and using results for marginal
and conditional Gaussian distributions, gives the posterior

[Qn. What is 𝑚𝑁 ?]

[CMB]
Recall: MVG Handy Results (cheat-sheet)

[CMB: Bishop, Chapter 2]


Finally, Step 3: What about predicting t for a
new point x?
• ((we don’t want just an estimate or posterior of w; we want to use it
to predict t for new x))

((again example plots first, and post. predictive distbn. form for the
general case after that.))
Example Predictive Distribution (1)
• Example: Sinusoidal data, 9 Gaussian basis fns. (𝜙: ℝ → ℝ9 ), 1 data point

[CMB]
Example Predictive Distribution (2)
• Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points

[CMB]
Example Predictive Distribution (3)
• Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points

[CMB]
Example Predictive Distribution (4)
• Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points

[CMB]
(Bayesian/Posterior) Predictive Distribution (1)
• Predict 𝑡 for new values of 𝑥 by integrating over w (model-averaging)

• where

Exercise: Prove that mN = 𝑤𝑅𝐿𝑆 = 𝑤𝑀𝐴𝑃 (recall 𝜆 ≔ 𝛼/𝛽), and hence that the posterior
predictive mean is same as that of the predicted value in the direct RLS approach. [CMB]
Linear regression direct vs. discriminative
model approaches: summary
Discriminative model-based approaches have two advantages:
1. Convert intuition for obj. fns. to probabilistic model driven motivations:
(Least-squares or min 𝐸(𝑤)) 𝑤𝐿𝑆 = 𝑤𝑀𝐿 (MLE)

(Reg. Least-squares or min 𝐸(𝑤)) 𝑤𝑅𝐿𝑆 = 𝑤𝑀𝐴𝑃 (Bayesian)

2. Give additional advantage of capturing the uncertainty over the predicted values, viz., a
predictive distribution

𝑝 𝑡 𝑥 = 𝑁 𝑡 𝑦 𝑥, 𝑤∗ ≔ 𝑤∗𝑇 𝜙(𝑥), 𝐯𝐚𝐫), (and not just a single predicted value 𝑦(𝑥, 𝑤∗) as in the direct
approach).

• In MLE, (pred.) var is a dataset-wide single variance (𝜎 2 = 𝛽 −1)


• In Bayesian, (post. pred.) var is datapoint-specific (𝜎 2 𝑥 = 𝛽 −1 + 𝜙 𝑥 𝑇 𝑆𝑁 𝜙 𝑥 )
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Recall: Motivating example for Regularization,
and Hyperparam. Tuning

ln 𝜆 = −∞
[CMB]
Regularization, and Hyperparam. tuning

Regularization penalizes large coefs.


of 9th order polynomial fit of the data.
[CMB]
Some Motivating Questions
• Can we understand the error in our predictions better?
• That is, can we identify the different components of the error in our predictions?
• How are these different components related to the complexity of our model, and to
overfitting?

• Can we use above knowledge to better tune model complexity


(hyperparameters) to avoid overfitting?

• If the trends in data require fitting of a complex model, then


• can overfitting be detected by understanding the stability of the optimal
(frequentist/ML) model across different training datasets?
• can a Bayesian model overcome the overfitting “naturally/implicitly” by not settling
in on a single optimal model and instead averaging over multiple models?
• Bayesian view (model averaging & empirical Bayes) possible, but out of scope for this course.
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Recall: Decision Theory for Regression
min. squared loss (cond. expn. as minimizer)

[CMB]
Bias-variance analysis proof

Goal: Decompose average error 𝐸𝐷 [𝐸𝑥,𝑡 [𝐿]] into different terms.

Now, simply view “y(x;D) – h(x)” as a random variable Z; and apply the variance formula:
𝑉𝑎𝑟𝐷 (𝑍) = 𝐸𝐷 [𝑍 2 ] – 𝐸𝐷 [𝑍] 2 to get the bias-variance decomposition of error below:

[CMB]
Bias-variance decomposition in formula

expected loss = 𝐸𝐷 𝐸𝑥,𝑡 𝐿 = 𝐸𝐷 𝐸𝑥,𝑡 𝑦 𝑥; 𝐷 − 𝑡 2

Exercise: cf. worksheet for careful understanding of what random variables the expectation above is
taken over!
[CMB]
Bias-variance in pictures (for an example)

[CMB]
Bias-variance analysis (for the example)

[CMB]
Bias-variance anals.: applicability in practice?

[CMB]
Bias-variance anals.: applicability in practice?
details

[CMB]
Concluding thoughts
• Linear regression forms the foundations of other sophisticated methods, so
it is good to invest enough time on it.
• Two views: direct loss fn. view (E(w)/regularization) & probabilistic model view (MLE/Bayesian)
• But lin. regn. has limitations in practice, even with non-linear basis functions, closed-
form solutions and other analytical advantages.
• Mainly because basis functions are fixed before seeing the training data (curse of
dimensionality when dimensionality of feature vectors D grows).

• Next steps:
• linear models for classification, which play similar basic role for other classification
methods.
• Move from fixed basis fns to selection of basis functions or adaptation of basis
functions using training data, in later lectures on non-linear models.
Thank you!
Backup slides
Linear Algebra (LA) Refresher
• Switch to LN Pandey’s notes
LA Cheat Sheet: The four subspaces of a matrix

[From https://mathworld.wolfram.com/FundamentalTheoremofLinearAlgebra.html, Strang LA book]


LA + Opt. Cheat Sheet
• Real, symmetric matrices 𝑆 can be diagonalized as 𝑆 = 𝑄ΛQ𝑇
• 𝑆 is psd iff all its eigen values are non-negative.
•.
• .

[HR]
Recall LA: To solve 𝐴𝑥 = 𝑏, we premult. by 𝐴𝑇 , and
simply solve 𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏.

Ex.: Prove:
1) at least one soln. x* exists for the normal eqn.
2) soln. x* unique if (𝐴𝑇 𝐴) is invertible ( 𝐴 has lin. indep. cols.)
3) infinite solns. x* if (𝐴𝑇 𝐴) is non-invertible ( 𝐴 has lin. dep. cols.)

Ex.:
i) Prove 𝑁𝑆(𝐴) = 𝑁𝑆(𝐴𝑇 𝐴).
ii) Use orthog. complementarity of 𝑁𝑆(𝐴𝑇 ), 𝐶𝑆(𝐴) to derive normal eqns.
Choice of Prior
Different priors on parameter 𝜃(≔ 𝑤) leads to…
…different regularizations (ridge vs. lasso regn.)
Bias-variance analysis (alternate proof)

[CMB]

You might also like