0% found this document useful (0 votes)

6 views

M6 RegressionLinearModels v2

Here are some potential issues with increasing the polynomial degree: - Overfitting: Higher degree polynomials will fit the training data very well, but may not generalize well to new data points and could model the noise in the training data. - Variance: Higher degree polynomials are more complex models with more parameters, so the variance/uncertainty of the predictions will increase. - Interpretability: Lower degree polynomials are simpler models that are easier to interpret. Higher degrees may result in a model that is difficult to understand. - Computational cost: Fitting higher degree polynomials requires inverting/solving larger matrices, so the computational cost increases with the degree of the polynomial. So in summary, while a higher degree

Uploaded by

Aniket Keshri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

M6 RegressionLinearModels v2

Uploaded by

Aniket Keshri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 97

M6.

Regression: Linear Models -

[a popular (supervised ML) appn. of linear algebra]
Manikandan Narayanan
Week 11 (Oct 6-, 2023)
PRML Jul-Nov 2023 (Grads section)
Acknowledgment of Sources
• Slides based on content from related
• Courses:
• IITM – Profs. Arun/Harish/Chandra’s PRML offerings (slides, quizzes, notes, etc.), Prof.
Ravi’s “Intro to ML” slides – cited respectively as [AR], [HR]/[HG], [CC], [BR] in the bottom
right of a slide.
• India – NPTEL PR course by IISc Prof. PS. Sastry (slides, etc.) – cited as [PSS] in the bottom
right of a slide.

• Books:
• PRML by Bishop. (content, figures, slides, etc.) – cited as [CMB]
• Pattern Classification by Duda, Hart and Stork. (content, figures, etc.) – [DHS]
• Mathematics for ML by Deisenroth, Faisal and Ong. (content, figures, etc.) – [DFO]
• Information Theory, Inference and Learning Algorithms by David JC MacKay – [DJM]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Context: ML Paradigms
• Unsupervised Learning (informally aka “learning patterns from
(unlabelled) data”)
• Density estimation
• Clustering
• Dimensionality reduction
• Supervised Learning (informally aka curve-fitting or function
approximation or “function learning from (labelled) data”)
• Learn an input and output map (features x to target t)
• Regression: continuous output/target t
• Classification: categorical output

[BR,HR]
Regression: the problem
• Learn a map from input variables (features x) to a continuous output variable (target t).
Informally, known as function approximation/learning or curve fitting, since
Given 𝒙𝒏 , 𝒕𝒏 𝒏=𝟏…𝑵 pairs, we seek a function 𝒚𝒘 (𝒙) that “approximates” the map from x → t.

Linear regression assumes yw x ≔ 𝑦(𝑥, 𝑤) is a linear function of the adjustable parameters 𝒘.

It could be linear or non-linear in 𝑥.

• A foundational supervised learning problem/algorithm:

• Practical limitations for complex data, but sets analytical foundation for other sophisticated learning algorithms.
• Due to its simplicity, first and predominant choice of statistical model in many applied areas with moderate sample sizes:
((E.g., in bioinformatics: to adjust for known confounding factors (covariates) in Disease Genomics and Genome-wide Association
(GWAS) Studies, Causal inference such as in Mendelian Randomization, etc.))

• Our approach in this lecture: Mostly [CBM, Chapters 1,3].

[CMB]
Regression: what does “approximate” mean?
recall three approaches (from Decision Theory)
• Generative model approach:
(I) Model
(I) Infer from
(D) Take conditional mean/median/mode/any other optimal decision outcome as y(x)

• Discriminative model approach:

(I) Model directly
(D) Take conditional mean/median/mode/any other optimal decision outcome as y(x)

• Direct regression approach:

(D) Learn a regression function y(x) directly from training data

[CMB]
Linear regression: from linear combination of input variables (x ∈
ℝ𝐷 ) to that of basis functions (𝜙(x) ∈ ℝ𝑀 )
• Simplest model of linear regn. involving 𝐷 input vars.:

𝑦 x, 𝑤 = 𝑤0 + 𝑤1 𝑥(1) + … + 𝑤𝐷 𝑥(𝐷)

1 1
= 𝑤0 . 1 + σ𝐷
𝑗=1 𝑤𝑗 𝑥𝑗 = 𝑤0 𝑤1 … 𝑤𝐷 = 𝑤𝑇
x x

• Model of linear regn. involving 𝑀 basis fns. (fixed non-linear fns. of the input vars.):

𝑦 x, 𝑤 = 𝑤0 + 𝑤1 𝜙1 x + … + 𝑤𝑀−1 𝜙𝑀−1 (x)

= 𝑤0 . 1 + σ𝑀−1
𝑗=1 𝑤𝑗 𝜙𝑗 (x)

= 𝑤𝑇𝜙 x (𝜙: ℝ𝐷 → ℝ𝑀 , with convention 𝜙0 (x) = 1)

Linear regression: recall standard examples
• Predicting weight 𝑡 from height 𝑥 : 𝒚(𝒙, 𝒘) linear in both 𝒙 and 𝒘.

• Estimation of fetal weight 𝑡 (actually log10 t) from ultrasound measurements: 𝒚 𝒙, 𝒘 linear in 𝒘, not 𝒙.

[From Hoopmann et al. Fetal Diagn

Ther. 2011;30(1):29-34.
doi:10.1159/000323586]
More examples of lin. regn. of basis functions

Fourier basis
Polynomials (& extn.
(sinusoidal fns.),
to spatially restricted
Wavelets, etc.
splines)
[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝒘𝑳𝑺 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Linear regression: a direct approach
Approach: minimize sum-of-squares error; aka
least- squares solution/approach

min𝐰 ( )

where y x, 𝒘 = 𝒘𝑇 𝜙 x
Solution: 𝑤𝐿𝑆 that minimizes 𝐸𝐷 𝑤 (via matrix notation)

Normal equations from setting gradient to zero, using N x M design matrix:

[CMB, HR]
Soln.: Geometry of (least-squares) sol.

[CMB, Figure 3.2]

[CMB, HR]
LA Refresher
• See Appendix for LA refresher on
• related topic of Ax=b when no solution is possible, and
• some matrix-vector gradient formulas/tricks.
Recall LA: To solve 𝐴𝑥 = 𝑏, we premult. by 𝐴𝑇 , and
simply solve 𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏.

Ex.: Prove:
1) at least one soln. x* exists for the normal eqn.
2) soln. x* unique if (𝐴𝑇 𝐴) is invertible ( 𝐴 has lin. indep. cols.)
3) infinite solns. x* if (𝐴𝑇 𝐴) is non-invertible ( 𝐴 has lin. dep. cols.)

Ex.:
i) Prove 𝑁𝑆(𝐴) = 𝑁𝑆(𝐴𝑇 𝐴).
ii) Use orthog. complementarity of 𝑁𝑆(𝐴𝑇 ), 𝐶𝑆(𝐴) to derive normal eqns.
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝒘𝑹𝑳𝑺 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
From sum-of-squares to regularized error!
Running example: polynomial curve fitting ((via sum-
of-squares error function / least squares approach))

min𝐰 ( )

where y x, 𝒘 = 𝒘𝑇 𝜙 x

[CMB]
Polynomial Curve Fitting

[CMB]
0th Order Polynomial

[CMB]
1st Order Polynomial

[CMB]
3rd Order Polynomial

[CMB]
9th Order Polynomial

[CMB]
Brainstorm: What degree polynomial would
you choose?
Over-fitting

Root-Mean-Square (RMS) Error:

[CMB]
Polynomial Coefficients

[CMB]
The role of data set size N?
Data Set Size: 𝑁 = 10
9th Order Polynomial

𝑁 = 10

[CMB]
Data Set Size:
9th Order Polynomial

[CMB]
How to take care of both data set size and
model complexity tradeoffs?
Regularization
• Penalize large coefficient values

favors complex models penalizes complex models

[CMB]
Polynomial Coefficients

[CMB]
Regularization:

[CMB]
Regularization: vs.

[CMB]
Now, let’s see how to solve the minimization
problem!

min
𝑤

Matrix notation gradient formula (from Appendix):

Regularized Least Squares (1): Solution for
ridge regression
• Consider the error function: 𝜆 is called the
regularization
Data term + Regularization term coefficient.

• With the sum-of-squares error function and a quadratic regularizer,

we get

• which is minimized by

[CMB]
Regularized Least Squares (2)
• With a more general regularizer, we have

Lasso Quadratic

[CMB]
Regularized Least Squares (3)
• Lasso tends to generate sparser solutions than a ridge (quadratic) regularizer.
• Regularization aka penalization/weight-decay in ML or parameter shrinkage in statistics literature.

[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝒘𝑴𝑳 = 𝒘𝑳𝑺 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
A different view of min. E(w) or its regularized
version?
Q: How do you convert this intuition/empirical-art into science, and
derive E(w) or its (other) regularized versions systematically?
A: Probabilistic view helps. Discriminative Approach: Model 𝑝(𝑡|𝑥)
Q: What has regression got to do with our
previous topic: density estimation?
• Brainstorm: how to model P(t|x)?
Q: What has regression got to do with our
previous topics?
A: P(t|x) captures the input-output map. Steps involved are:
(1) Model/estimate P(t|x)
(how? Density Estimation; MLE/Bayesian Inference)
(2) Predict t for a new x from estimated P(t|x)
(how? Decision Theory; e.g., 𝑦(x𝒏𝒆𝒘 ) = 𝐸[𝒕|x = x𝒏𝒆𝒘 ])
Curve Fitting: Going to the basics!
using a Probabilistic Model & its Density Estimation (MLE/Bayesian)

[CMB]
ML estimation

Determine by minimizing sum-of-squares error 𝐸𝐷 𝑤 .

[CMB]
Summary: Linear model for regression -
𝑤𝑀𝐿 == 𝑤𝐿𝑆

[CMB]
Addtnl. Advantage: ML Predictive Distribution

[CMB]
Bayesian inference: what would you model as
a rv instead of a fixed value?
• Brainstorm
• What would you model?
• What would you infer?
Relation between Bayesian MAP and
Regularized linear regression

Let look at a simpler problem first – mode of the posterior of 𝑤

𝑤𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑤 𝑃 𝑤 𝑫𝑵 ≔ 𝐱, 𝒕 ≔ 𝒙𝒏 , 𝒕𝒏 𝑛=1 𝑡𝑜 𝑁 )

We will actually show that 𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 !!

Bayesian inference: a first step via MAP
Assume

Then, Assume

Determine maximizer of this posterior, i.e., 𝑤𝑀𝐴𝑃 , by minimizing

෨
regularized sum-of-squares error 𝐸(𝑤), because:

prior
(shrinks w
towards 0)
[cf. full details of proof in next slide]
[CMB]
Full details of 𝑤𝑀𝐴𝑃 derivation, & 𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 proof!

𝑀 𝑀 𝛼 𝑻
ln 𝑝 𝒘 𝛼 = ln 𝛼 − ln(2𝜋) − 𝒘 𝒘
2 2 2

⇒ ln 𝑝 𝒘 𝐱, 𝒕, 𝛼, 𝛽 = ln 𝑝 𝒕 𝐱, 𝒘, 𝛼, 𝛽 + ln 𝑝 𝒘 𝛼 + 𝑐𝑜𝑛s𝑡. (assume 𝛼, 𝛽 are both known hyperparams.)

𝑁 𝑁 𝑀 𝑀 𝛼
= 2 ln 𝛽 − 2 ln(2𝜋) − 𝛽𝐸𝐷 𝒘 + 2 ln 𝛼 − 2 ln 2𝜋 − 2 𝑤 𝑇 𝑤 + 𝑐𝑜𝑛s𝑡.

𝛼 2
So maximizing this ln(𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟) is the same as max. −𝛽𝐸𝐷 𝒘 − 2 𝒘 (ignoring terms indept. of 𝒘) , or equivalently
1 2 𝛼 𝛼
min. 𝛽𝐸𝐷 𝒘 + 𝛼 2 𝒘 = 𝛽 𝐸𝐷 𝒘 + 𝛽 𝐸𝑊 𝒘 ෨
= 𝛽𝐸(𝒘), ෨
or equivalently min. 𝐸(𝒘). ((here, we set 𝜆 ≔ 𝛽))
Changing the prior from Gaussian to Laplacian!
• Prior: p 𝑤 = 𝑝 𝑤0 𝑝 𝑤1 … 𝑝(𝑤𝑀−1 )
• What
1
if 𝑝 𝑤𝑖 changed from
𝛼 2 𝛼 𝛼 𝛼
exp{− 𝒘2𝑖 } → exp{− |𝒘𝑖 |} ?
2𝜋 2 4 2

• Then,
𝑀
𝑝(𝑤) changes from
𝛼 2 𝛼 𝟐 𝛼 𝑀 𝛼
exp(− 𝒘 𝟐
)→ exp(− 𝒘 1
)
2𝜋 2 4 2

• Then regularization term 𝐸𝑤 (𝑤) in regularized error

𝐸෨ 𝑤 = 𝐸𝐷 𝑤 + 𝜆𝐸𝑤 𝑤 changes from:
1 1
• σ𝑖 𝑤𝑖2 (ridge regn. / 𝐿2 regularization) → σ |𝑤𝑖 | (lasso regn. / 𝐿1 regularization)
2 2 𝑖

Recall:
General prior p(𝒘) that generalizes both
Gaussian and Laplacian prior
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Bayesian inference: second and third steps
• Assume Gaussian prior for 𝒘 going forward.
• We don’t want just a single-point estimate (MAP) of 𝒘; we want

Step 2) the full posterior of 𝒘, and in turn use it to

Step 3) predict t for new x (via model-averaging…
• …wherein each model is given by a particular possible value of w and the averaging
weight is given by the model’ s posterior)
Step 2) Let’s see an example of full posterior…
𝑡 ≅ 𝑦 𝑥, 𝑤 = 𝑤0 + 𝑤1 𝑥

Use a given set of training data points to not just infer one optimal model specified by 𝑤𝑀𝐴𝑃 , but all
possible 𝑤 models and the training dataset’s support (posterior probab.) for each such model 𝑤.
Bayesian Linear Regression Example (1)
0 data points observed

Prior Data Space

[CMB]
Bayesian Linear Regression Example (2)
1 data point observed

Likelihood Posterior Data Space

[CMB]
Bayesian Linear Regression Example (3)
2 data points observed

Likelihood Posterior Data Space

[CMB]
Bayesian Linear Regression Example (4)
20 data points observed

Posterior Data Space

[CMB]
From example posterior plots to
Full posterior in the general case:
Bayesian Linear Regression
• Define a conjugate prior over w. A common choice is

• Combining this with the likelihood function and using results for marginal
and conditional Gaussian distributions, gives the posterior

[Qn. What is 𝑚𝑁 ?]

[CMB]
Recall: MVG Handy Results (cheat-sheet)

[CMB: Bishop, Chapter 2]

Finally, Step 3: What about predicting t for a
new point x?
• ((we don’t want just an estimate or posterior of w; we want to use it
to predict t for new x))

((again example plots first, and post. predictive distbn. form for the
general case after that.))
Example Predictive Distribution (1)
• Example: Sinusoidal data, 9 Gaussian basis fns. (𝜙: ℝ → ℝ9 ), 1 data point

[CMB]
Example Predictive Distribution (2)
• Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points

[CMB]
Example Predictive Distribution (3)
• Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points

[CMB]
Example Predictive Distribution (4)
• Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points

[CMB]
(Bayesian/Posterior) Predictive Distribution (1)
• Predict 𝑡 for new values of 𝑥 by integrating over w (model-averaging)

• where

Exercise: Prove that mN = 𝑤𝑅𝐿𝑆 = 𝑤𝑀𝐴𝑃 (recall 𝜆 ≔ 𝛼/𝛽), and hence that the posterior
predictive mean is same as that of the predicted value in the direct RLS approach. [CMB]
Linear regression direct vs. discriminative
model approaches: summary
Discriminative model-based approaches have two advantages:
1. Convert intuition for obj. fns. to probabilistic model driven motivations:
(Least-squares or min 𝐸(𝑤)) 𝑤𝐿𝑆 = 𝑤𝑀𝐿 (MLE)
෨
(Reg. Least-squares or min 𝐸(𝑤)) 𝑤𝑅𝐿𝑆 = 𝑤𝑀𝐴𝑃 (Bayesian)

2. Give additional advantage of capturing the uncertainty over the predicted values, viz., a
predictive distribution

𝑝 𝑡 𝑥 = 𝑁 𝑡 𝑦 𝑥, 𝑤∗ ≔ 𝑤∗𝑇 𝜙(𝑥), 𝐯𝐚𝐫), (and not just a single predicted value 𝑦(𝑥, 𝑤∗) as in the direct
approach).

• In MLE, (pred.) var is a dataset-wide single variance (𝜎 2 = 𝛽 −1)

• In Bayesian, (post. pred.) var is datapoint-specific (𝜎 2 𝑥 = 𝛽 −1 + 𝜙 𝑥 𝑇 𝑆𝑁 𝜙 𝑥 )
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Recall: Motivating example for Regularization,
and Hyperparam. Tuning

ln 𝜆 = −∞
[CMB]
Regularization, and Hyperparam. tuning

Regularization penalizes large coefs.

of 9th order polynomial fit of the data.
[CMB]
Some Motivating Questions
• Can we understand the error in our predictions better?
• That is, can we identify the different components of the error in our predictions?
• How are these different components related to the complexity of our model, and to
overfitting?

• Can we use above knowledge to better tune model complexity

(hyperparameters) to avoid overfitting?

• If the trends in data require fitting of a complex model, then

• can overfitting be detected by understanding the stability of the optimal
(frequentist/ML) model across different training datasets?
• can a Bayesian model overcome the overfitting “naturally/implicitly” by not settling
in on a single optimal model and instead averaging over multiple models?
• Bayesian view (model averaging & empirical Bayes) possible, but out of scope for this course.
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Recall: Decision Theory for Regression
min. squared loss (cond. expn. as minimizer)

[CMB]
Bias-variance analysis proof

Goal: Decompose average error 𝐸𝐷 [𝐸𝑥,𝑡 [𝐿]] into different terms.

Now, simply view “y(x;D) – h(x)” as a random variable Z; and apply the variance formula:
𝑉𝑎𝑟𝐷 (𝑍) = 𝐸𝐷 [𝑍 2 ] – 𝐸𝐷 [𝑍] 2 to get the bias-variance decomposition of error below:

[CMB]
Bias-variance decomposition in formula

expected loss = 𝐸𝐷 𝐸𝑥,𝑡 𝐿 = 𝐸𝐷 𝐸𝑥,𝑡 𝑦 𝑥; 𝐷 − 𝑡 2

Exercise: cf. worksheet for careful understanding of what random variables the expectation above is
taken over!
[CMB]
Bias-variance in pictures (for an example)

[CMB]
Bias-variance analysis (for the example)

[CMB]
Bias-variance anals.: applicability in practice?

[CMB]
Bias-variance anals.: applicability in practice?
details

[CMB]
Concluding thoughts
• Linear regression forms the foundations of other sophisticated methods, so
it is good to invest enough time on it.
• Two views: direct loss fn. view (E(w)/regularization) & probabilistic model view (MLE/Bayesian)
• But lin. regn. has limitations in practice, even with non-linear basis functions, closed-
form solutions and other analytical advantages.
• Mainly because basis functions are fixed before seeing the training data (curse of
dimensionality when dimensionality of feature vectors D grows).

• Next steps:
• linear models for classification, which play similar basic role for other classification
methods.
• Move from fixed basis fns to selection of basis functions or adaptation of basis
functions using training data, in later lectures on non-linear models.
Thank you!
Backup slides
Linear Algebra (LA) Refresher
• Switch to LN Pandey’s notes
LA Cheat Sheet: The four subspaces of a matrix

[From https://mathworld.wolfram.com/FundamentalTheoremofLinearAlgebra.html, Strang LA book]

LA + Opt. Cheat Sheet
• Real, symmetric matrices 𝑆 can be diagonalized as 𝑆 = 𝑄ΛQ𝑇
• 𝑆 is psd iff all its eigen values are non-negative.
•.
• .

[HR]
Recall LA: To solve 𝐴𝑥 = 𝑏, we premult. by 𝐴𝑇 , and
simply solve 𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏.

Ex.:
i) Prove 𝑁𝑆(𝐴) = 𝑁𝑆(𝐴𝑇 𝐴).
ii) Use orthog. complementarity of 𝑁𝑆(𝐴𝑇 ), 𝐶𝑆(𝐴) to derive normal eqns.
Choice of Prior
Different priors on parameter 𝜃(≔ 𝑤) leads to…
…different regularizations (ridge vs. lasso regn.)
Bias-variance analysis (alternate proof)

[CMB]

Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
ML_Lec 4-introduction to regression
No ratings yet
ML_Lec 4-introduction to regression
65 pages
ML-Lec8
No ratings yet
ML-Lec8
7 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Lecture15 Regression
No ratings yet
Lecture15 Regression
15 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
Bayesian linear regression for Posterior Predictive Distribution MATLAB
No ratings yet
Bayesian linear regression for Posterior Predictive Distribution MATLAB
46 pages
ML 5
No ratings yet
ML 5
21 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
Mlfa Autumn 22 Lec 02
No ratings yet
Mlfa Autumn 22 Lec 02
24 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Introml 02 Regression Annotated PDF
No ratings yet
Introml 02 Regression Annotated PDF
26 pages
11_Học máy cơ bản_Hồi quy tuyến tính 1
No ratings yet
11_Học máy cơ bản_Hồi quy tuyến tính 1
105 pages
03 Linear Regression
No ratings yet
03 Linear Regression
54 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
100% (1)
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
48 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
Lecture 6
No ratings yet
Lecture 6
29 pages
LinearRegression LectureNotesPublic PDF
No ratings yet
LinearRegression LectureNotesPublic PDF
7 pages
w3 - Linear Model - Linear Regression
No ratings yet
w3 - Linear Model - Linear Regression
33 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Lecture 2
No ratings yet
Lecture 2
23 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Linear Regression
No ratings yet
Linear Regression
104 pages
5_AML Lecture 5_Linear regression
No ratings yet
5_AML Lecture 5_Linear regression
56 pages
Regression
No ratings yet
Regression
11 pages
Chapter2 Annotated Part2
No ratings yet
Chapter2 Annotated Part2
30 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
Day 1
No ratings yet
Day 1
41 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Regression
No ratings yet
Regression
39 pages
ML Lecture Linear Regression 1
No ratings yet
ML Lecture Linear Regression 1
33 pages
Lecture 09_02.09.2024_Regression-01
No ratings yet
Lecture 09_02.09.2024_Regression-01
62 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
eng
No ratings yet
eng
10 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
2a Linear Regression 18may
No ratings yet
2a Linear Regression 18may
28 pages
Lec6 Linear Model With LSP
No ratings yet
Lec6 Linear Model With LSP
35 pages
Lecture 13 - Least Squares
No ratings yet
Lecture 13 - Least Squares
28 pages
02 01 Regression
No ratings yet
02 01 Regression
14 pages
Linear Algebra Fundamentals
From Everand
Linear Algebra Fundamentals
Kartikeya Dutta
No ratings yet
A Direct Least-Squares (DLS) Method For PNP
No ratings yet
A Direct Least-Squares (DLS) Method For PNP
8 pages
Geometric vs. Arithmetic
No ratings yet
Geometric vs. Arithmetic
2 pages
PC Unit 5 Packets - Polynomials
No ratings yet
PC Unit 5 Packets - Polynomials
21 pages
Komal Advanced Problems
100% (1)
Komal Advanced Problems
16 pages
Least-Square Method
No ratings yet
Least-Square Method
32 pages
TM5 Lecture 14 Interpolation
No ratings yet
TM5 Lecture 14 Interpolation
38 pages
Class 10 Maths NCERT Solutions Chapter 2 Polynomials - Learn CBSE
No ratings yet
Class 10 Maths NCERT Solutions Chapter 2 Polynomials - Learn CBSE
21 pages
(Original PDF) Graphical Approach to Precalculus with Limits 6th Edition instant download
100% (7)
(Original PDF) Graphical Approach to Precalculus with Limits 6th Edition instant download
49 pages
Numerical Solver Report
No ratings yet
Numerical Solver Report
39 pages
CH Gen Mat 3
No ratings yet
CH Gen Mat 3
17 pages
Finite Differences
No ratings yet
Finite Differences
25 pages
Berkeley Math Circle - Monthly Contests - Solutions (1999-00)
No ratings yet
Berkeley Math Circle - Monthly Contests - Solutions (1999-00)
4 pages
DLL Math 10 Q2 D1
No ratings yet
DLL Math 10 Q2 D1
4 pages
Gauss Legendre Ian
No ratings yet
Gauss Legendre Ian
5 pages
I. Learning Competency With Code II. Background Information For Learners
No ratings yet
I. Learning Competency With Code II. Background Information For Learners
2 pages
Sandi Cyclic: Dr. Risanuri Hidayat
No ratings yet
Sandi Cyclic: Dr. Risanuri Hidayat
35 pages
Get Intermediate Algebra 4th Edition Alan S. Tussy free all chapters
100% (6)
Get Intermediate Algebra 4th Edition Alan S. Tussy free all chapters
71 pages
Bombay Scottish - STD 6 (2021-22) - Algebra 1
No ratings yet
Bombay Scottish - STD 6 (2021-22) - Algebra 1
7 pages
Solutions For Exercises: Engineering Optimization by Ranjan Ganguli
No ratings yet
Solutions For Exercises: Engineering Optimization by Ranjan Ganguli
15 pages
STPM Mathematics T Past Year Question
85% (20)
STPM Mathematics T Past Year Question
57 pages
Differentiaon Taylor
No ratings yet
Differentiaon Taylor
14 pages
1967, Joyal, A., Labelle, G. and Rehman, Q. I., On The Location of Zeros of Polynomials
No ratings yet
1967, Joyal, A., Labelle, G. and Rehman, Q. I., On The Location of Zeros of Polynomials
11 pages
Some Convolution Identities and An Inverse Relation Involving Partial Bell Polynomials
No ratings yet
Some Convolution Identities and An Inverse Relation Involving Partial Bell Polynomials
14 pages
Hakmem. Mit Ai Memo 239, Feb 29, 1972-Ocr
No ratings yet
Hakmem. Mit Ai Memo 239, Feb 29, 1972-Ocr
107 pages
Singularities On Rational Curves Via Syzygies - Cox
No ratings yet
Singularities On Rational Curves Via Syzygies - Cox
132 pages
Monotonicty Diamond and DDFV Type Finite Volume Schemes For 2D Elliptic Problems
No ratings yet
Monotonicty Diamond and DDFV Type Finite Volume Schemes For 2D Elliptic Problems
39 pages
J. Boyd - Rational Chebychev Functions
No ratings yet
J. Boyd - Rational Chebychev Functions
31 pages
919-MARMARIDIS-BasicGalois
No ratings yet
919-MARMARIDIS-BasicGalois
588 pages
IMOTC Number Theory
No ratings yet
IMOTC Number Theory
28 pages
Exponential Functions
No ratings yet
Exponential Functions
24 pages

M6 RegressionLinearModels v2

Uploaded by

M6 RegressionLinearModels v2

Uploaded by

M6.

Regression: Linear Models -

Linear regression assumes yw x ≔ 𝑦(𝑥, 𝑤) is a linear function of the adjustable parameters 𝒘.

• A foundational supervised learning problem/algorithm:

• Our approach in this lecture: Mostly [CBM, Chapters 1,3].

• Discriminative model approach:

• Direct regression approach:

𝑦 x, 𝑤 = 𝑤0 + 𝑤1 𝜙1 x + … + 𝑤𝑀−1 𝜙𝑀−1 (x)

= 𝑤𝑇𝜙 x (𝜙: ℝ𝐷 → ℝ𝑀 , with convention 𝜙0 (x) = 1)

[From Hoopmann et al. Fetal Diagn

Normal equations from setting gradient to zero, using N x M design matrix:

[CMB, Figure 3.2]

Root-Mean-Square (RMS) Error:

favors complex models penalizes complex models

Matrix notation gradient formula (from Appendix):

• With the sum-of-squares error function and a quadratic regularizer,

Determine by minimizing sum-of-squares error 𝐸𝐷 𝑤 .

Let look at a simpler problem first – mode of the posterior of 𝑤

𝑤𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑤 𝑃 𝑤 𝑫𝑵 ≔ 𝐱, 𝒕 ≔ 𝒙𝒏 , 𝒕𝒏 𝑛=1 𝑡𝑜 𝑁 )

We will actually show that 𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 !!

Determine maximizer of this posterior, i.e., 𝑤𝑀𝐴𝑃 , by minimizing

⇒ ln 𝑝 𝒘 𝐱, 𝒕, 𝛼, 𝛽 = ln 𝑝 𝒕 𝐱, 𝒘, 𝛼, 𝛽 + ln 𝑝 𝒘 𝛼 + 𝑐𝑜𝑛s𝑡. (assume 𝛼, 𝛽 are both known hyperparams.)

• Then regularization term 𝐸𝑤 (𝑤) in regularized error

Step 2) the full posterior of 𝒘, and in turn use it to

Prior Data Space

Likelihood Posterior Data Space

Likelihood Posterior Data Space

Posterior Data Space

[CMB: Bishop, Chapter 2]

• In MLE, (pred.) var is a dataset-wide single variance (𝜎 2 = 𝛽 −1)

Regularization penalizes large coefs.

• Can we use above knowledge to better tune model complexity

• If the trends in data require fitting of a complex model, then

Goal: Decompose average error 𝐸𝐷 [𝐸𝑥,𝑡 [𝐿]] into different terms.

expected loss = 𝐸𝐷 𝐸𝑥,𝑡 𝐿 = 𝐸𝐷 𝐸𝑥,𝑡 𝑦 𝑥; 𝐷 − 𝑡 2

[From https://mathworld.wolfram.com/FundamentalTheoremofLinearAlgebra.html, Strang LA book]

You might also like