Linear Regression: Volker Tresp 2017
Linear Regression: Volker Tresp 2017
Volker Tresp
2017
1
Learning Machine: The Linear Model / ADALINE
ŷ = fw (x) = h
• Regression: the target function can take
on real values
2
Method of Least Squares
N
X
cost(w) = (yi − fw (xi))2
i=1
• The parameters that minimize the cost function are called least squares (LS) estimators
3
Least-squares Estimator for Regression
One-dimensional regression:
fw (x) = w0 + w1x
w = (w0, w1)T
Squared error:
N
X
cost(w) = (yi − fw (xi))2
i=1
Goal:
4
Least-squares Estimator in General
General Model:
M
X −1
ŷi = f (xi, w) = w0 + wj xi,j
j=1
= xT
i w
5
Linear Regression with Several Inputs
6
Contribution to the Cost Function of one Data Point
7
Gradient Descent Learning
wj ←− wj + η(yt − ŷt)xt,j j = 1, 2, . . . , M
wj ←− wj + ηytxt,j j = 1, . . . , M
9
Analytic Solution
10
Cost Function in Matrix Form
N
X
cost(w) = (yi − fw (xi))2
i=1
= (y − Xw)T (y − Xw)
y = (y1, . . . , yN )T
x1,0 . . . x1,M −1
X = ... ... ...
xN,0 . . . xN,M −1
11
Calculating the First Derivative
Matrix calculus:
Thus
∂cost(w) ∂(y − Xw)
= × 2(y − Xw) = −2XT (y − Xw)
∂w ∂w
12
Setting First Derivative to Zero
O(M 3 + N M 2)
13
Alternative Convention
Thus
∂cost(w) ∂(y − Xw)
= 2(y − Xw)T × = −2(y − Xw)T X
∂w ∂w
This leads to the same solution
14
Stability of the Solution
• When N >> M , the LS solution is stable (small changes in the data lead to small
changes in the parameter estimates)
• When N < M then there are many solutions which all produce zero training error
PM 2
• Of all these solutions, one selects the one that minimizes i=0 wi (regularised
solution)
15
Linear Regression and Regularisation
• Regularised cost function (Penalized Least Squares (PLS), Ridge Regression, Weight
Decay ): the influence of a single data point should be small
N
X M
X −1
costpen(w) = (yi − fw (xi))2 + λ wi2
i=1 i=0
−1
ŵpen = XT X + λI XT y
Derivation:
∂costpen(w)
= −2XT (y − Xw) + 2λw = 2[−XT y + (XT X + λI)w]
∂w
16
Example: Correlated Input with no Effect on Output
(Redundant Input)
y = 0.5 + x1 + i
Here, i is independent noise
fw (x) = w0 + w1x1
xi,2 = xi,1 + δi
Again, δi is uncorrelated noise
x1 x2 y
Data of Model 2: -0.2 -0.1996 0.49
0.2 0.1993 0.64
1 1.0017 1.39
• The least squares solution gives wls = (0.67, −136, 137)T !!! The parameter
estimates are far from the true parameters: This might not be surprising since M =
N =3
18
Model 2 with Regularisation
• The penalized least squares solution gives wpen = (0.58, 0.38, 0.39)T , also
difficult to interpret !!!
19
Performance on Training Data for the Models
• Training:
y M 1 : ŷM L M 2 : ŷM L M 2 : ŷpen
0.50 0.43 0.50 0.43
0.65 0.74 0.65 0.74
1.39 1.36 1.39 1.36
• For Model 1 and Model 2 with regularization we have nonzero error on the training
data
• Thus, if we only consider the training error, we would prefer Model 2 without regula-
rization
20
Performance on Test Data for the Models
• Test Data:
y M 1 : ŷM L M 2 : ŷM L M 2 : ŷpen
0.20 0.36 0.69 0.36
0.80 0.82 0.51 0.82
1.10 1.05 1.30 1.05
• On test data Model 1 and Model 2 with regularization give better results
• As a conclusion: Model 1, which corresponds to the system performs best. For Model
2 (with additional correlated input) the penalized version gives best predictive results,
although the parameter values are difficult to interpret. Without regularization, the
prediction error of Model 2 on test data is large. Asymptotically, with N → ∞,
Model 2 might learn to ignore the second input and w0 and w1 converge to the true
parameters.
21
Remarks
• If one is only interested in prediction accuracy: adding inputs liberally can be beneficial
if regularization is used (in ad placements and ad bidding, hundreds or thousands of
features are used)
• The weight parameters of useless (noisy) features become close to zero with regula-
rization (ill-conditioned parameters); without regularization they might assume large
positive or negative values
• Forward selection; start with the empty model; at each step add the input that reduces
the error most
• Backward selection (pruning); start with the full model; at each step remove the input
that increases the error the least
• But no guarantee, that one finds the best subset of inputs or that one finds the true
inputs
22
Experiments with Real World Data: Data from Prostate Cancer
Patients
LS 0.586
10-times cross validation error Best Subset (3) 0.574
Ridge (Penalized) 0.540
23
GWAS Study
Trait (here: the disease systemic sclerosis) is the output and the SNPs are the inputs. The
major allele is encoded as 0 and the minor allele as 1. Thus wj is the influence of SNP
j on the trait. Shown is the (log of the p-value) of wj ordered by the locations on the
chromosomes. The weights can be calculated by penalized least squares (ridge regression)
24