0% found this document useful (0 votes)

52 views

Linear Regression: Volker Tresp 2017

This document discusses linear regression and the linear model. It begins by introducing linear regression and describing the linear model as a weighted sum of inputs. It then discusses: - The method of least squares for minimizing error and finding the optimal parameters - Gradient descent learning for updating parameters to minimize error - Computing the least squares solution analytically in matrix form - Regularization techniques like penalized least squares to improve stability - Examples showing overfitting when there are redundant inputs The document provides technical details and equations for linear regression concepts like least squares estimation, gradient descent, and regularization. It uses examples to illustrate issues that can arise like unstable solutions when the number of parameters exceeds the number of data points.

Uploaded by

Sophie Strobl

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views

Linear Regression: Volker Tresp 2017

Uploaded by

Sophie Strobl

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Linear Regression

Volker Tresp
2017

1
Learning Machine: The Linear Model / ADALINE

• As with the Perceptron we start with

an activation functions that is a linearly
weighted sum of the inputs
M
X −1
h= wj xj
j=0
(Note: x0 = 1 is a constant input, so
that w0 is the bias)
• New: The activation is the output
(no thresholding)

ŷ = fw (x) = h
• Regression: the target function can take
on real values

2
Method of Least Squares

• Squared-loss cost function:

N
X
cost(w) = (yi − fw (xi))2
i=1

• The parameters that minimize the cost function are called least squares (LS) estimators

wls = arg min cost(w)

• For visualization, on chooses M = 2 (although linear regression is often applied to

high-dimensional inputs)

3
Least-squares Estimator for Regression

One-dimensional regression:

fw (x) = w0 + w1x

w = (w0, w1)T
Squared error:
N
X
cost(w) = (yi − fw (xi))2
i=1
Goal:

wls = arg min cost(w) w0 = 1, w1 = 2, var() = 1

4
Least-squares Estimator in General

General Model:
M
X −1
ŷi = f (xi, w) = w0 + wj xi,j
j=1

= xT
i w

w = (w0, w1, . . . wM −1)T

xi = (1, xi,1, . . . , xi,M −1)T

5
Linear Regression with Several Inputs

6
Contribution to the Cost Function of one Data Point

7
Gradient Descent Learning

• Initialize parameters (typically using small random numbers)

• Adapt the parameters in the direction of the negative gradient
• With
 2
N
X M
X −1
cost(w) = yi − wj xi,j 
i=1 j=0

• The parameter gradient is (Example: wj )

N
∂cost X
= −2 (yi − fw (xi))xi,j
∂wj
i=1

• A sensible learning rule is

N
X
wj ←− wj + η (yi − fw (xi))xi,j
i=1
8
ADALINE-Learning Rule

• ADALINE: ADAptive LINear Element

• The ADALINE uses stochastic gradient descent (SGE)

• Let xt and yt be the training pattern in iteration t. The we adapt, t = 1, 2, . . .

wj ←− wj + η(yt − ŷt)xt,j j = 1, 2, . . . , M

• η > 0 is the learning rate, typically 0 < η << 0.1

• Compare: the Perceptron learning rule (only applied to misclassified patterns)

wj ←− wj + ηytxt,j j = 1, . . . , M

9
Analytic Solution

• The least-squares solution can be calculated in one step

10
Cost Function in Matrix Form

N
X
cost(w) = (yi − fw (xi))2
i=1

= (y − Xw)T (y − Xw)

y = (y1, . . . , yN )T

 
x1,0 . . . x1,M −1
X =  ... ... ... 
xN,0 . . . xN,M −1

11
Calculating the First Derivative
Matrix calculus:

Thus
∂cost(w) ∂(y − Xw)
= × 2(y − Xw) = −2XT (y − Xw)
∂w ∂w
12
Setting First Derivative to Zero

Calculating the LS-solution:

∂cost(w)
= −2XT (y − Xw) = 0
∂w

ŵls = (XT X)−1XT y

Complexity (linear in N !):

O(M 3 + N M 2)

ŵ0 = 0.75, ŵ1 = 2.13

13
Alternative Convention

Comment: one also finds the conventions:

∂ ∂ T ∂ T
Ax = A x x = 2xT x Ax = xT (A + AT )
∂x ∂x ∂x

Thus
∂cost(w) ∂(y − Xw)
= 2(y − Xw)T × = −2(y − Xw)T X
∂w ∂w
This leads to the same solution

14
Stability of the Solution

• When N >> M , the LS solution is stable (small changes in the data lead to small
changes in the parameter estimates)

• When N < M then there are many solutions which all produce zero training error
PM 2
• Of all these solutions, one selects the one that minimizes i=0 wi (regularised
solution)

• Even with N > M it is advantageous to regularize the solution, in particular with

noise on the target

15
Linear Regression and Regularisation

• Regularised cost function (Penalized Least Squares (PLS), Ridge Regression, Weight
Decay ): the influence of a single data point should be small
N
X M
X −1
costpen(w) = (yi − fw (xi))2 + λ wi2
i=1 i=0

−1
ŵpen = XT X + λI XT y

Derivation:

∂costpen(w)
= −2XT (y − Xw) + 2λw = 2[−XT y + (XT X + λI)w]
∂w

16
Example: Correlated Input with no Effect on Output
(Redundant Input)

• Three data points are generated as (system; true model)

y = 0.5 + x1 + i
Here, i is independent noise

• Model 1 (correct structure)

fw (x) = w0 + w1x1

• Training data for Model 1:

x1 y
-0.2 0.49
0.2 0.64
1 1.39

• The LS solution gives wls = (0.58, 0.77)T

17
• In comparison, the true parameters are: w = (0.50, 1.00)T . The parameter esti-
mates are reasonable, considering that only three training patterns are available
Model 2

• For Model 2, we generate a second correlated input

xi,2 = xi,1 + δi
Again, δi is uncorrelated noise

• Model 2 (redundant additional input)

fw (xi) = w0 + w1xi,1 + w2xi,2

x1 x2 y
Data of Model 2: -0.2 -0.1996 0.49
0.2 0.1993 0.64
1 1.0017 1.39

• The least squares solution gives wls = (0.67, −136, 137)T !!! The parameter
estimates are far from the true parameters: This might not be surprising since M =
N =3

18
Model 2 with Regularisation

• As Model 2, only that large weights are penalized

• The penalized least squares solution gives wpen = (0.58, 0.38, 0.39)T , also
difficult to interpret !!!

• (Compare: the LS-solution for Model 1 gave wls = (0.58, 0.77))T

19
Performance on Training Data for the Models

• Training:
y M 1 : ŷM L M 2 : ŷM L M 2 : ŷpen
0.50 0.43 0.50 0.43
0.65 0.74 0.65 0.74
1.39 1.36 1.39 1.36
• For Model 1 and Model 2 with regularization we have nonzero error on the training
data

• For Model 2 without regularization, the training error is zero

• Thus, if we only consider the training error, we would prefer Model 2 without regula-
rization

20
Performance on Test Data for the Models

• Test Data:
y M 1 : ŷM L M 2 : ŷM L M 2 : ŷpen
0.20 0.36 0.69 0.36
0.80 0.82 0.51 0.82
1.10 1.05 1.30 1.05
• On test data Model 1 and Model 2 with regularization give better results

• Even more dramatic: extrapolation (not shown)

• As a conclusion: Model 1, which corresponds to the system performs best. For Model
2 (with additional correlated input) the penalized version gives best predictive results,
although the parameter values are difficult to interpret. Without regularization, the
prediction error of Model 2 on test data is large. Asymptotically, with N → ∞,
Model 2 might learn to ignore the second input and w0 and w1 converge to the true
parameters.

21
Remarks

• If one is only interested in prediction accuracy: adding inputs liberally can be beneficial
if regularization is used (in ad placements and ad bidding, hundreds or thousands of
features are used)

• The weight parameters of useless (noisy) features become close to zero with regula-
rization (ill-conditioned parameters); without regularization they might assume large
positive or negative values

• If parameter interpretation is essential:

• Forward selection; start with the empty model; at each step add the input that reduces
the error most

• Backward selection (pruning); start with the full model; at each step remove the input
that increases the error the least

• But no guarantee, that one finds the best subset of inputs or that one finds the true
inputs

22
Experiments with Real World Data: Data from Prostate Cancer
Patients

8 Inputs, 97 data points; y: Prostate-specific antigen

LS 0.586
10-times cross validation error Best Subset (3) 0.574
Ridge (Penalized) 0.540

23
GWAS Study

Trait (here: the disease systemic sclerosis) is the output and the SNPs are the inputs. The
major allele is encoded as 0 and the minor allele as 1. Thus wj is the influence of SNP
j on the trait. Shown is the (log of the p-value) of wj ordered by the locations on the
chromosomes. The weights can be calculated by penalized least squares (ridge regression)

Plano Hidráulico Retroexcavadora 420f
92% (13)
Plano Hidráulico Retroexcavadora 420f
13 pages
Welding Rod Calculation 2024
100% (1)
Welding Rod Calculation 2024
4 pages
320B All PDF
No ratings yet
320B All PDF
43 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Group 30 Ppt
No ratings yet
Group 30 Ppt
33 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
2022 Linear Regression
No ratings yet
2022 Linear Regression
34 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Introml 02 Regression Annotated PDF
No ratings yet
Introml 02 Regression Annotated PDF
26 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
ML4 Linear Models
No ratings yet
ML4 Linear Models
34 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Lecture 1.5-1.6
No ratings yet
Lecture 1.5-1.6
23 pages
Examples for LSE, RLS, and RBFN
No ratings yet
Examples for LSE, RLS, and RBFN
16 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
Section05 Solutions
No ratings yet
Section05 Solutions
9 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
ML models and when to choose one over others
No ratings yet
ML models and when to choose one over others
7 pages
ML_Lec 4-introduction to regression
No ratings yet
ML_Lec 4-introduction to regression
65 pages
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
24 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
HW 4
No ratings yet
HW 4
7 pages
Lecture04. Training Models (Regression in Chapter 4)
No ratings yet
Lecture04. Training Models (Regression in Chapter 4)
44 pages
2 Linear Regression
No ratings yet
2 Linear Regression
14 pages
lecture3_supervised_learning_I
No ratings yet
lecture3_supervised_learning_I
84 pages
Lab 4 - Markdown Practical - Solution
No ratings yet
Lab 4 - Markdown Practical - Solution
5 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Unit 2
No ratings yet
Unit 2
35 pages
2. Linear_ Regression_SGD
No ratings yet
2. Linear_ Regression_SGD
71 pages
Experiment N1
No ratings yet
Experiment N1
7 pages
2-LR_Optim
No ratings yet
2-LR_Optim
60 pages
Linear-Regression
No ratings yet
Linear-Regression
55 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
2 - Multiple Linear Regression
No ratings yet
2 - Multiple Linear Regression
71 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
26 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
Neural Network Lectures RBF 1
No ratings yet
Neural Network Lectures RBF 1
44 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Enzymatic Process For Fat and Oil Hydrolysis USA
100% (1)
Enzymatic Process For Fat and Oil Hydrolysis USA
22 pages
2nd Year Fee Receipt
No ratings yet
2nd Year Fee Receipt
2 pages
English All in One PDF 14 1 551655189293748
0% (1)
English All in One PDF 14 1 551655189293748
39 pages
DMN FST Manual Sp-Ec-Rot-Mt Reset Module en 20100713-Ms
No ratings yet
DMN FST Manual Sp-Ec-Rot-Mt Reset Module en 20100713-Ms
2 pages
Slaughter 2015
No ratings yet
Slaughter 2015
4 pages
Final Thesis 2016 New-1
No ratings yet
Final Thesis 2016 New-1
68 pages
International Supreme Price List 2021
No ratings yet
International Supreme Price List 2021
41 pages
PDF Bringing Indians to the Book Emil and Kathleen Sick Series in Western History and Biography 1st Edition Albert Furtwangler download
100% (4)
PDF Bringing Indians to the Book Emil and Kathleen Sick Series in Western History and Biography 1st Edition Albert Furtwangler download
81 pages
UAS - MKU - GENAP - 2023-2024 (Bahasa Inggris)
No ratings yet
UAS - MKU - GENAP - 2023-2024 (Bahasa Inggris)
4 pages
Year 5 Daily Lesson Plans: By:Missash
No ratings yet
Year 5 Daily Lesson Plans: By:Missash
6 pages
Cinema_Cities_Media_Cities_The_Contemporary_Intern
No ratings yet
Cinema_Cities_Media_Cities_The_Contemporary_Intern
119 pages
Soal Pat KLS 7 2021
No ratings yet
Soal Pat KLS 7 2021
13 pages
L&E Equipment Logbook 1 - Gearbox
No ratings yet
L&E Equipment Logbook 1 - Gearbox
21 pages
Assignment # 3 (Advanced Wireless Communication) Submitted By: Muhammad Tayyeb (18F-0863)
No ratings yet
Assignment # 3 (Advanced Wireless Communication) Submitted By: Muhammad Tayyeb (18F-0863)
1 page
Olivia de Gouveia - Applying For Jobs 2.3 V12
No ratings yet
Olivia de Gouveia - Applying For Jobs 2.3 V12
5 pages
ELC - Assignment Cover Sheet
No ratings yet
ELC - Assignment Cover Sheet
4 pages
Cover Letter Sample For Principal Position
100% (1)
Cover Letter Sample For Principal Position
4 pages
Download ebooks file (Ebook) The New York Intellectuals, Thirtieth Anniversary Edition: The Rise and Decline of the Anti-Stalinist Left from the 1930s to the 1980s by Alan M. Wald ISBN 9781469635941, 1469635941 all chapters
No ratings yet
Download ebooks file (Ebook) The New York Intellectuals, Thirtieth Anniversary Edition: The Rise and Decline of the Anti-Stalinist Left from the 1930s to the 1980s by Alan M. Wald ISBN 9781469635941, 1469635941 all chapters
81 pages
Application Manual: Keb Combivert F5-Basic / General 2.3
No ratings yet
Application Manual: Keb Combivert F5-Basic / General 2.3
334 pages
Vidmar Catalog
No ratings yet
Vidmar Catalog
116 pages
Vocabulario Mantenimiento Electromecanico
No ratings yet
Vocabulario Mantenimiento Electromecanico
5 pages
Skid Steer Loaders: 216B2, 226B2, 232B2, 236B2, 242B2, 252B2 Multi Terrain Loaders: 247B2, 257B2
No ratings yet
Skid Steer Loaders: 216B2, 226B2, 232B2, 236B2, 242B2, 252B2 Multi Terrain Loaders: 247B2, 257B2
4 pages
Commercial Law Notes
No ratings yet
Commercial Law Notes
3 pages
Inmarsat Broadband Global Area Network
No ratings yet
Inmarsat Broadband Global Area Network
10 pages
Free Access to Nursing Research 8th Edition Wood Test Bank Chapter Answers
100% (13)
Free Access to Nursing Research 8th Edition Wood Test Bank Chapter Answers
22 pages
Concept Attainment Lesson Plan 2
No ratings yet
Concept Attainment Lesson Plan 2
8 pages
Technical Report Writing (Exercise)
No ratings yet
Technical Report Writing (Exercise)
2 pages

Linear Regression: Volker Tresp 2017

Uploaded by

Linear Regression: Volker Tresp 2017

Uploaded by

Linear Regression

• As with the Perceptron we start with

• Squared-loss cost function:

wls = arg min cost(w)

• For visualization, on chooses M = 2 (although linear regression is often applied to

wls = arg min cost(w) w0 = 1, w1 = 2, var() = 1

w = (w0, w1, . . . wM −1)T

xi = (1, xi,1, . . . , xi,M −1)T

• Initialize parameters (typically using small random numbers)

• The parameter gradient is (Example: wj )

• A sensible learning rule is

• ADALINE: ADAptive LINear Element

• The ADALINE uses stochastic gradient descent (SGE)

• Let xt and yt be the training pattern in iteration t. The we adapt, t = 1, 2, . . .

• η > 0 is the learning rate, typically 0 < η << 0.1

• Compare: the Perceptron learning rule (only applied to misclassified patterns)

• The least-squares solution can be calculated in one step

Calculating the LS-solution:

ŵls = (XT X)−1XT y

ŵ0 = 0.75, ŵ1 = 2.13

Comment: one also finds the conventions:

• Even with N > M it is advantageous to regularize the solution, in particular with

• Three data points are generated as (system; true model)

• Model 1 (correct structure)

• Training data for Model 1:

• The LS solution gives wls = (0.58, 0.77)T

• For Model 2, we generate a second correlated input

• Model 2 (redundant additional input)

fw (xi) = w0 + w1xi,1 + w2xi,2

• As Model 2, only that large weights are penalized

• (Compare: the LS-solution for Model 1 gave wls = (0.58, 0.77))T

• For Model 2 without regularization, the training error is zero

• Even more dramatic: extrapolation (not shown)

• If parameter interpretation is essential:

8 Inputs, 97 data points; y: Prostate-specific antigen

You might also like

wls = arg min cost(w) w0 = 1, w1 = 2, var() = 1