M6 RegressionLinearModels v2
M6 RegressionLinearModels v2
• Books:
• PRML by Bishop. (content, figures, slides, etc.) – cited as [CMB]
• Pattern Classification by Duda, Hart and Stork. (content, figures, etc.) – [DHS]
• Mathematics for ML by Deisenroth, Faisal and Ong. (content, figures, etc.) – [DFO]
• Information Theory, Inference and Learning Algorithms by David JC MacKay – [DJM]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Context: ML Paradigms
• Unsupervised Learning (informally aka “learning patterns from
(unlabelled) data”)
• Density estimation
• Clustering
• Dimensionality reduction
• Supervised Learning (informally aka curve-fitting or function
approximation or “function learning from (labelled) data”)
• Learn an input and output map (features x to target t)
• Regression: continuous output/target t
• Classification: categorical output
[BR,HR]
Regression: the problem
• Learn a map from input variables (features x) to a continuous output variable (target t).
Informally, known as function approximation/learning or curve fitting, since
Given 𝒙𝒏 , 𝒕𝒏 𝒏=𝟏…𝑵 pairs, we seek a function 𝒚𝒘 (𝒙) that “approximates” the map from x → t.
[CMB]
Regression: what does “approximate” mean?
recall three approaches (from Decision Theory)
• Generative model approach:
(I) Model
(I) Infer from
(D) Take conditional mean/median/mode/any other optimal decision outcome as y(x)
[CMB]
Linear regression: from linear combination of input variables (x ∈
ℝ𝐷 ) to that of basis functions (𝜙(x) ∈ ℝ𝑀 )
• Simplest model of linear regn. involving 𝐷 input vars.:
𝑦 x, 𝑤 = 𝑤0 + 𝑤1 𝑥(1) + … + 𝑤𝐷 𝑥(𝐷)
1 1
= 𝑤0 . 1 + σ𝐷
𝑗=1 𝑤𝑗 𝑥𝑗 = 𝑤0 𝑤1 … 𝑤𝐷 = 𝑤𝑇
x x
• Model of linear regn. involving 𝑀 basis fns. (fixed non-linear fns. of the input vars.):
= 𝑤0 . 1 + σ𝑀−1
𝑗=1 𝑤𝑗 𝜙𝑗 (x)
• Estimation of fetal weight 𝑡 (actually log10 t) from ultrasound measurements: 𝒚 𝒙, 𝒘 linear in 𝒘, not 𝒙.
Fourier basis
Polynomials (& extn.
(sinusoidal fns.),
to spatially restricted
Wavelets, etc.
splines)
[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝒘𝑳𝑺 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Linear regression: a direct approach
Approach: minimize sum-of-squares error; aka
least- squares solution/approach
min𝐰 ( )
where y x, 𝒘 = 𝒘𝑇 𝜙 x
Solution: 𝑤𝐿𝑆 that minimizes 𝐸𝐷 𝑤 (via matrix notation)
[CMB, HR]
Soln.: Geometry of (least-squares) sol.
[CMB, HR]
LA Refresher
• See Appendix for LA refresher on
• related topic of Ax=b when no solution is possible, and
• some matrix-vector gradient formulas/tricks.
Recall LA: To solve 𝐴𝑥 = 𝑏, we premult. by 𝐴𝑇 , and
simply solve 𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏.
Ex.: Prove:
1) at least one soln. x* exists for the normal eqn.
2) soln. x* unique if (𝐴𝑇 𝐴) is invertible ( 𝐴 has lin. indep. cols.)
3) infinite solns. x* if (𝐴𝑇 𝐴) is non-invertible ( 𝐴 has lin. dep. cols.)
Ex.:
i) Prove 𝑁𝑆(𝐴) = 𝑁𝑆(𝐴𝑇 𝐴).
ii) Use orthog. complementarity of 𝑁𝑆(𝐴𝑇 ), 𝐶𝑆(𝐴) to derive normal eqns.
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝒘𝑹𝑳𝑺 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
From sum-of-squares to regularized error!
Running example: polynomial curve fitting ((via sum-
of-squares error function / least squares approach))
min𝐰 ( )
where y x, 𝒘 = 𝒘𝑇 𝜙 x
[CMB]
Polynomial Curve Fitting
[CMB]
0th Order Polynomial
[CMB]
1st Order Polynomial
[CMB]
3rd Order Polynomial
[CMB]
9th Order Polynomial
[CMB]
Brainstorm: What degree polynomial would
you choose?
Over-fitting
[CMB]
Polynomial Coefficients
[CMB]
The role of data set size N?
Data Set Size: 𝑁 = 10
9th Order Polynomial
𝑁 = 10
[CMB]
Data Set Size:
9th Order Polynomial
[CMB]
Data Set Size:
9th Order Polynomial
[CMB]
How to take care of both data set size and
model complexity tradeoffs?
Regularization
• Penalize large coefficient values
[CMB]
Polynomial Coefficients
[CMB]
Regularization:
[CMB]
Regularization:
[CMB]
Regularization: vs.
[CMB]
Now, let’s see how to solve the minimization
problem!
min
𝑤
• which is minimized by
[CMB]
Regularized Least Squares (2)
• With a more general regularizer, we have
Lasso Quadratic
[CMB]
Regularized Least Squares (3)
• Lasso tends to generate sparser solutions than a ridge (quadratic) regularizer.
• Regularization aka penalization/weight-decay in ML or parameter shrinkage in statistics literature.
[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝒘𝑴𝑳 = 𝒘𝑳𝑺 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
A different view of min. E(w) or its regularized
version?
Q: How do you convert this intuition/empirical-art into science, and
derive E(w) or its (other) regularized versions systematically?
A: Probabilistic view helps. Discriminative Approach: Model 𝑝(𝑡|𝑥)
Q: What has regression got to do with our
previous topic: density estimation?
• Brainstorm: how to model P(t|x)?
Q: What has regression got to do with our
previous topics?
A: P(t|x) captures the input-output map. Steps involved are:
(1) Model/estimate P(t|x)
(how? Density Estimation; MLE/Bayesian Inference)
(2) Predict t for a new x from estimated P(t|x)
(how? Decision Theory; e.g., 𝑦(x𝒏𝒆𝒘 ) = 𝐸[𝒕|x = x𝒏𝒆𝒘 ])
Curve Fitting: Going to the basics!
using a Probabilistic Model & its Density Estimation (MLE/Bayesian)
[CMB]
ML estimation
[CMB]
Summary: Linear model for regression -
𝑤𝑀𝐿 == 𝑤𝐿𝑆
[CMB]
Addtnl. Advantage: ML Predictive Distribution
[CMB]
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝒘𝑴𝑨𝑷 = 𝒘𝑹𝑳𝑺 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Recall: Probab. Model & its Density Estimation (MLE)
[CMB]
Bayesian inference: what would you model as
a rv instead of a fixed value?
• Brainstorm
• What would you model?
• What would you infer?
Relation between Bayesian MAP and
Regularized linear regression
Then, Assume
prior
(shrinks w
towards 0)
[cf. full details of proof in next slide]
[CMB]
Full details of 𝑤𝑀𝐴𝑃 derivation, & 𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 proof!
𝑀 𝑀 𝛼 𝑻
ln 𝑝 𝒘 𝛼 = ln 𝛼 − ln(2𝜋) − 𝒘 𝒘
2 2 2
𝛼 2
So maximizing this ln(𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟) is the same as max. −𝛽𝐸𝐷 𝒘 − 2 𝒘 (ignoring terms indept. of 𝒘) , or equivalently
1 2 𝛼 𝛼
min. 𝛽𝐸𝐷 𝒘 + 𝛼 2 𝒘 = 𝛽 𝐸𝐷 𝒘 + 𝛽 𝐸𝑊 𝒘 ෨
= 𝛽𝐸(𝒘), ෨
or equivalently min. 𝐸(𝒘). ((here, we set 𝜆 ≔ 𝛽))
Changing the prior from Gaussian to Laplacian!
• Prior: p 𝑤 = 𝑝 𝑤0 𝑝 𝑤1 … 𝑝(𝑤𝑀−1 )
• What
1
if 𝑝 𝑤𝑖 changed from
𝛼 2 𝛼 𝛼 𝛼
exp{− 𝒘2𝑖 } → exp{− |𝒘𝑖 |} ?
2𝜋 2 4 2
• Then,
𝑀
𝑝(𝑤) changes from
𝛼 2 𝛼 𝟐 𝛼 𝑀 𝛼
exp(− 𝒘 𝟐
)→ exp(− 𝒘 1
)
2𝜋 2 4 2
Recall:
General prior p(𝒘) that generalizes both
Gaussian and Laplacian prior
Outline for Module M5
• M6. Regression (Linear models)
• M6.0 Introduction/context
• Problem formulation
• Motivating examples, incl. basis functions
• M6.1 Linear regression approaches
• Direct approach I: least-squares (error) (𝑤𝐿𝑆 )
• Direct approach II: least-squares (error) with regularization (𝑤𝑅𝐿𝑆 )
• Discriminative model approach I: MLE (𝑤𝑀𝐿 = 𝑤𝐿𝑆 )
• Discriminative model approach IIa: Bayesian linear regression (𝑤𝑀𝐴𝑃 = 𝑤𝑅𝐿𝑆 )
• Discriminative model approach IIb: Bayesian linear regression (posterior & predictions)
• M6.2 Model Complexity/Selection
• Motivation (hyperparameter tuning to avoid overfitting)
• Frequentist view (bias-variance decomposition)
• M6.3 Concluding thoughts
Bayesian inference: second and third steps
• Assume Gaussian prior for 𝒘 going forward.
• We don’t want just a single-point estimate (MAP) of 𝒘; we want
Use a given set of training data points to not just infer one optimal model specified by 𝑤𝑀𝐴𝑃 , but all
possible 𝑤 models and the training dataset’s support (posterior probab.) for each such model 𝑤.
Bayesian Linear Regression Example (1)
0 data points observed
[CMB]
Bayesian Linear Regression Example (2)
1 data point observed
[CMB]
Bayesian Linear Regression Example (3)
2 data points observed
[CMB]
Bayesian Linear Regression Example (4)
20 data points observed
[CMB]
From example posterior plots to
Full posterior in the general case:
Bayesian Linear Regression
• Define a conjugate prior over w. A common choice is
• Combining this with the likelihood function and using results for marginal
and conditional Gaussian distributions, gives the posterior
[Qn. What is 𝑚𝑁 ?]
[CMB]
Recall: MVG Handy Results (cheat-sheet)
((again example plots first, and post. predictive distbn. form for the
general case after that.))
Example Predictive Distribution (1)
• Example: Sinusoidal data, 9 Gaussian basis fns. (𝜙: ℝ → ℝ9 ), 1 data point
[CMB]
Example Predictive Distribution (2)
• Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points
[CMB]
Example Predictive Distribution (3)
• Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points
[CMB]
Example Predictive Distribution (4)
• Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points
[CMB]
(Bayesian/Posterior) Predictive Distribution (1)
• Predict 𝑡 for new values of 𝑥 by integrating over w (model-averaging)
• where
Exercise: Prove that mN = 𝑤𝑅𝐿𝑆 = 𝑤𝑀𝐴𝑃 (recall 𝜆 ≔ 𝛼/𝛽), and hence that the posterior
predictive mean is same as that of the predicted value in the direct RLS approach. [CMB]
Linear regression direct vs. discriminative
model approaches: summary
Discriminative model-based approaches have two advantages:
1. Convert intuition for obj. fns. to probabilistic model driven motivations:
(Least-squares or min 𝐸(𝑤)) 𝑤𝐿𝑆 = 𝑤𝑀𝐿 (MLE)
෨
(Reg. Least-squares or min 𝐸(𝑤)) 𝑤𝑅𝐿𝑆 = 𝑤𝑀𝐴𝑃 (Bayesian)
2. Give additional advantage of capturing the uncertainty over the predicted values, viz., a
predictive distribution
𝑝 𝑡 𝑥 = 𝑁 𝑡 𝑦 𝑥, 𝑤∗ ≔ 𝑤∗𝑇 𝜙(𝑥), 𝐯𝐚𝐫), (and not just a single predicted value 𝑦(𝑥, 𝑤∗) as in the direct
approach).
ln 𝜆 = −∞
[CMB]
Regularization, and Hyperparam. tuning
[CMB]
Bias-variance analysis proof
Now, simply view “y(x;D) – h(x)” as a random variable Z; and apply the variance formula:
𝑉𝑎𝑟𝐷 (𝑍) = 𝐸𝐷 [𝑍 2 ] – 𝐸𝐷 [𝑍] 2 to get the bias-variance decomposition of error below:
[CMB]
Bias-variance decomposition in formula
Exercise: cf. worksheet for careful understanding of what random variables the expectation above is
taken over!
[CMB]
Bias-variance in pictures (for an example)
[CMB]
Bias-variance analysis (for the example)
[CMB]
Bias-variance anals.: applicability in practice?
[CMB]
Bias-variance anals.: applicability in practice?
details
[CMB]
Concluding thoughts
• Linear regression forms the foundations of other sophisticated methods, so
it is good to invest enough time on it.
• Two views: direct loss fn. view (E(w)/regularization) & probabilistic model view (MLE/Bayesian)
• But lin. regn. has limitations in practice, even with non-linear basis functions, closed-
form solutions and other analytical advantages.
• Mainly because basis functions are fixed before seeing the training data (curse of
dimensionality when dimensionality of feature vectors D grows).
• Next steps:
• linear models for classification, which play similar basic role for other classification
methods.
• Move from fixed basis fns to selection of basis functions or adaptation of basis
functions using training data, in later lectures on non-linear models.
Thank you!
Backup slides
Linear Algebra (LA) Refresher
• Switch to LN Pandey’s notes
LA Cheat Sheet: The four subspaces of a matrix
[HR]
Recall LA: To solve 𝐴𝑥 = 𝑏, we premult. by 𝐴𝑇 , and
simply solve 𝐴𝑇 𝐴𝑥 = 𝐴𝑇 𝑏.
Ex.: Prove:
1) at least one soln. x* exists for the normal eqn.
2) soln. x* unique if (𝐴𝑇 𝐴) is invertible ( 𝐴 has lin. indep. cols.)
3) infinite solns. x* if (𝐴𝑇 𝐴) is non-invertible ( 𝐴 has lin. dep. cols.)
Ex.:
i) Prove 𝑁𝑆(𝐴) = 𝑁𝑆(𝐴𝑇 𝐴).
ii) Use orthog. complementarity of 𝑁𝑆(𝐴𝑇 ), 𝐶𝑆(𝐴) to derive normal eqns.
Choice of Prior
Different priors on parameter 𝜃(≔ 𝑤) leads to…
…different regularizations (ridge vs. lasso regn.)
Bias-variance analysis (alternate proof)
[CMB]