0% found this document useful (0 votes)
44 views

Linear Regression: Volker Tresp 2017

This document discusses linear regression and the linear model. It begins by introducing linear regression and describing the linear model as a weighted sum of inputs. It then discusses: - The method of least squares for minimizing error and finding the optimal parameters - Gradient descent learning for updating parameters to minimize error - Computing the least squares solution analytically in matrix form - Regularization techniques like penalized least squares to improve stability - Examples showing overfitting when there are redundant inputs The document provides technical details and equations for linear regression concepts like least squares estimation, gradient descent, and regularization. It uses examples to illustrate issues that can arise like unstable solutions when the number of parameters exceeds the number of data points.

Uploaded by

Sophie Strobl
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Linear Regression: Volker Tresp 2017

This document discusses linear regression and the linear model. It begins by introducing linear regression and describing the linear model as a weighted sum of inputs. It then discusses: - The method of least squares for minimizing error and finding the optimal parameters - Gradient descent learning for updating parameters to minimize error - Computing the least squares solution analytically in matrix form - Regularization techniques like penalized least squares to improve stability - Examples showing overfitting when there are redundant inputs The document provides technical details and equations for linear regression concepts like least squares estimation, gradient descent, and regularization. It uses examples to illustrate issues that can arise like unstable solutions when the number of parameters exceeds the number of data points.

Uploaded by

Sophie Strobl
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Linear Regression

Volker Tresp
2017

1
Learning Machine: The Linear Model / ADALINE

• As with the Perceptron we start with


an activation functions that is a linearly
weighted sum of the inputs
M
X −1
h= wj xj
j=0
(Note: x0 = 1 is a constant input, so
that w0 is the bias)
• New: The activation is the output
(no thresholding)

ŷ = fw (x) = h
• Regression: the target function can take
on real values

2
Method of Least Squares

• Squared-loss cost function:

N
X
cost(w) = (yi − fw (xi))2
i=1

• The parameters that minimize the cost function are called least squares (LS) estimators

wls = arg min cost(w)


w

• For visualization, on chooses M = 2 (although linear regression is often applied to


high-dimensional inputs)

3
Least-squares Estimator for Regression

One-dimensional regression:

fw (x) = w0 + w1x

w = (w0, w1)T
Squared error:
N
X
cost(w) = (yi − fw (xi))2
i=1
Goal:

wls = arg min cost(w) w0 = 1, w1 = 2, var() = 1


w

4
Least-squares Estimator in General

General Model:
M
X −1
ŷi = f (xi, w) = w0 + wj xi,j
j=1

= xT
i w

w = (w0, w1, . . . wM −1)T

xi = (1, xi,1, . . . , xi,M −1)T

5
Linear Regression with Several Inputs

6
Contribution to the Cost Function of one Data Point

7
Gradient Descent Learning

• Initialize parameters (typically using small random numbers)


• Adapt the parameters in the direction of the negative gradient
• With
 2
N
X M
X −1
cost(w) = yi − wj xi,j 
i=1 j=0

• The parameter gradient is (Example: wj )


N
∂cost X
= −2 (yi − fw (xi))xi,j
∂wj
i=1

• A sensible learning rule is


N
X
wj ←− wj + η (yi − fw (xi))xi,j
i=1
8
ADALINE-Learning Rule

• ADALINE: ADAptive LINear Element

• The ADALINE uses stochastic gradient descent (SGE)

• Let xt and yt be the training pattern in iteration t. The we adapt, t = 1, 2, . . .

wj ←− wj + η(yt − ŷt)xt,j j = 1, 2, . . . , M

• η > 0 is the learning rate, typically 0 < η << 0.1

• Compare: the Perceptron learning rule (only applied to misclassified patterns)

wj ←− wj + ηytxt,j j = 1, . . . , M

9
Analytic Solution

• The least-squares solution can be calculated in one step

10
Cost Function in Matrix Form

N
X
cost(w) = (yi − fw (xi))2
i=1

= (y − Xw)T (y − Xw)

y = (y1, . . . , yN )T

 
x1,0 . . . x1,M −1
X =  ... ... ... 
xN,0 . . . xN,M −1

11
Calculating the First Derivative
Matrix calculus:

Thus
∂cost(w) ∂(y − Xw)
= × 2(y − Xw) = −2XT (y − Xw)
∂w ∂w
12
Setting First Derivative to Zero

Calculating the LS-solution:


∂cost(w)
= −2XT (y − Xw) = 0
∂w

ŵls = (XT X)−1XT y


Complexity (linear in N !):

O(M 3 + N M 2)

ŵ0 = 0.75, ŵ1 = 2.13

13
Alternative Convention

Comment: one also finds the conventions:


∂ ∂ T ∂ T
Ax = A x x = 2xT x Ax = xT (A + AT )
∂x ∂x ∂x

Thus
∂cost(w) ∂(y − Xw)
= 2(y − Xw)T × = −2(y − Xw)T X
∂w ∂w
This leads to the same solution

14
Stability of the Solution

• When N >> M , the LS solution is stable (small changes in the data lead to small
changes in the parameter estimates)

• When N < M then there are many solutions which all produce zero training error
PM 2
• Of all these solutions, one selects the one that minimizes i=0 wi (regularised
solution)

• Even with N > M it is advantageous to regularize the solution, in particular with


noise on the target

15
Linear Regression and Regularisation

• Regularised cost function (Penalized Least Squares (PLS), Ridge Regression, Weight
Decay ): the influence of a single data point should be small
N
X M
X −1
costpen(w) = (yi − fw (xi))2 + λ wi2
i=1 i=0

 −1
ŵpen = XT X + λI XT y

Derivation:

∂costpen(w)
= −2XT (y − Xw) + 2λw = 2[−XT y + (XT X + λI)w]
∂w

16
Example: Correlated Input with no Effect on Output
(Redundant Input)

• Three data points are generated as (system; true model)

y = 0.5 + x1 + i
Here, i is independent noise

• Model 1 (correct structure)

fw (x) = w0 + w1x1

• Training data for Model 1:


x1 y
-0.2 0.49
0.2 0.64
1 1.39

• The LS solution gives wls = (0.58, 0.77)T


17
• In comparison, the true parameters are: w = (0.50, 1.00)T . The parameter esti-
mates are reasonable, considering that only three training patterns are available
Model 2

• For Model 2, we generate a second correlated input

xi,2 = xi,1 + δi
Again, δi is uncorrelated noise

• Model 2 (redundant additional input)

fw (xi) = w0 + w1xi,1 + w2xi,2

x1 x2 y
Data of Model 2: -0.2 -0.1996 0.49
0.2 0.1993 0.64
1 1.0017 1.39

• The least squares solution gives wls = (0.67, −136, 137)T !!! The parameter
estimates are far from the true parameters: This might not be surprising since M =
N =3

18
Model 2 with Regularisation

• As Model 2, only that large weights are penalized

• The penalized least squares solution gives wpen = (0.58, 0.38, 0.39)T , also
difficult to interpret !!!

• (Compare: the LS-solution for Model 1 gave wls = (0.58, 0.77))T

19
Performance on Training Data for the Models

• Training:
y M 1 : ŷM L M 2 : ŷM L M 2 : ŷpen
0.50 0.43 0.50 0.43
0.65 0.74 0.65 0.74
1.39 1.36 1.39 1.36
• For Model 1 and Model 2 with regularization we have nonzero error on the training
data

• For Model 2 without regularization, the training error is zero

• Thus, if we only consider the training error, we would prefer Model 2 without regula-
rization

20
Performance on Test Data for the Models

• Test Data:
y M 1 : ŷM L M 2 : ŷM L M 2 : ŷpen
0.20 0.36 0.69 0.36
0.80 0.82 0.51 0.82
1.10 1.05 1.30 1.05
• On test data Model 1 and Model 2 with regularization give better results

• Even more dramatic: extrapolation (not shown)

• As a conclusion: Model 1, which corresponds to the system performs best. For Model
2 (with additional correlated input) the penalized version gives best predictive results,
although the parameter values are difficult to interpret. Without regularization, the
prediction error of Model 2 on test data is large. Asymptotically, with N → ∞,
Model 2 might learn to ignore the second input and w0 and w1 converge to the true
parameters.

21
Remarks

• If one is only interested in prediction accuracy: adding inputs liberally can be beneficial
if regularization is used (in ad placements and ad bidding, hundreds or thousands of
features are used)

• The weight parameters of useless (noisy) features become close to zero with regula-
rization (ill-conditioned parameters); without regularization they might assume large
positive or negative values

• If parameter interpretation is essential:

• Forward selection; start with the empty model; at each step add the input that reduces
the error most

• Backward selection (pruning); start with the full model; at each step remove the input
that increases the error the least

• But no guarantee, that one finds the best subset of inputs or that one finds the true
inputs

22
Experiments with Real World Data: Data from Prostate Cancer
Patients

8 Inputs, 97 data points; y: Prostate-specific antigen

LS 0.586
10-times cross validation error Best Subset (3) 0.574
Ridge (Penalized) 0.540

23
GWAS Study

Trait (here: the disease systemic sclerosis) is the output and the SNPs are the inputs. The
major allele is encoded as 0 and the minor allele as 1. Thus wj is the influence of SNP
j on the trait. Shown is the (log of the p-value) of wj ordered by the locations on the
chromosomes. The weights can be calculated by penalized least squares (ridge regression)

24

You might also like