0% found this document useful (0 votes)
4 views

4 Linear Regression Additional Notes

Uploaded by

luticia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

4 Linear Regression Additional Notes

Uploaded by

luticia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Statistical Learning Fall 2024

Linear Models in Machine Learning


Instructor: Yingyu Liang Date: Scriber: Yingyu Liang

We briefly go over two linear models frequently used in machine learning: linear regression
for regression, and logistic regression for classification.

1 Linear regression
1.1 The 1D case
Let’s first introduce the following notations:

• x: input variable (also called independent, predictor, explanatory variable).

• y: output variable (also called dependent, response variable)

We are given training data points {(xi , yi )}. We would like to learn a model for p(y|x∗ )
for new test points x∗ (usually predict one value E(y|x∗ ) instead of outputting the whole
distribution).

Statistical Model. For the goal of prediction, we first need a probabilistic model for the
data points. The typical model assumption is:

y = θ0 + θ1 x + ϵ, ϵ ∼ N (0, σ 2 ).

This is equivalent to assume a Gaussian distribution for the conditional probability p(y|x):

y|x ∼ N (θ0 + θ1 x, σ 2 ).

Then on a test point x∗ ,

E(y | x∗ ) = E(θ0 + θ1 x∗ + ϵ) = θ0 + θ1 x∗ + E(ϵ) = θ0 + θ1 x∗ .

So we need to learn the parameters θ0 , θ1 .

Learning via MLE. Now we show that for learning, the maximum likelihood estimate  2
2 1 ϵ
leads to the least squares estimate. Recall that ϵ ∼ N (0, σ ) has density 2πσ exp − 2σ
√ 2 .

1
The likelihood is then:

L(θ0 , θ1 ) = Pr [{xi , yi }ni=1 | θ0 , θ1 ]


Yn
= Pr [xi , yi | θ0 , θ1 ]
i=1
n
Y
= Pr[xi ] Pr [yi | xi , θ0 , θ1 ]
i=1
n
(θ0 + θ1 xi − yi )2
 
Y 1
= Pr[xi ] · √ exp − 2
.
i=1
2πσ 2σ
argmax L(θ0 , θ1 ) = argmax log L(θ0 , θ1 )
θ0 ,θ1 θ0 ,θ1
n
X
= argmax − (θ0 + θ1 xi − yi )2 .
θ0 ,θ1 i=1

Therefore, the maximum likelihood estimate is the least squares estimate:


n
X
(θb0 , θb1 ) = argmin (θ0 + θ1 xi − yi )2 .
θ0 ,θ1 i=1

Optional: Optimization. The least square estimate training objective is a convex objec-
tive, so we can solve it by setting the gradient to zero. Setting gradient to zero gives:
P
(xi − x̄)(yi − ȳ)
θb1 = P ,
(xi − x̄)2
θb0 = ȳ − θb1 x̄,
P P
where x̄ = ( xi )/n and ȳ = ( yi )/n. This is known as ordinary least squares (OLS).
The prediction is then ybi = θb0 + θb1 xi , and the residual is yi − ybi .

1.2 General Dimension


Statistical Model. Denote the input variables x = (x0 = 1, x1 , . . . , xp )⊤ ∈ Rp+1 , and
output variable y ∈ R. The model is then:

y = x⊤ θ + ϵ, ϵ ∼ N (0, σ 2 ).

This is linear in the parameters θ.

Learning via MLE. Following the same procedure as above, the MLE leads to the least
square estimate training objective:
n
X
θb = argmin (x⊤ 2
i θ − yi ) .
θ i=1

2
Optional: Optimization. It will be convenient to arrange the input in a matrix Xn×p
where each row is a feature vector, and a label vector yn×1 . The linear regression model is

y = Xθ + ϵ

where ϵ is a vector, too. Then the ordinary least squares (OLS) method finds parameter

min ∥y − Xθ∥2 .
θ

Rewrite the objective as

(y − Xθ)⊤ (y − Xθ) = y⊤ y − 2θ⊤ X ⊤ y + θ⊤ X ⊤ Xθ.

Take the gradient with respect to θ and set it to the zero vector:

−2X ⊤ y + 2X ⊤ Xθ = 0

Assuming X ⊤ X is invertible we obtain the solution

θ = (X ⊤ X)−1 X ⊤ y.

OLS is certainly not going to work, however, in the high dimension setting when p > n, since
X ⊤ X is rank deficient and not invertible.

1.3 Ridge Regression


Recall that OLS needs to assume X ⊤ X is invertible, which may not be true in practice.
More frequently used in practice for its stability, we add a regularization to the optimization
problem. This leads to the ridge regression:

min ∥y − Xθ∥2 + λ∥θ∥2 . (1)


θ

This can also be derived from the MAP estimate.

Statistical Model. The model for the data is still:

y = x⊤ θ + ϵ, ϵ ∼ N (0, σ 2 ).

Additionally, we assume a prior on θ:

θ ∼ N (0, σθ2 I).

3
Learning via MAP. The MAP is:

L(θ) = Pr [{xi , yi }ni=1 | θ]


Yn
= Pr [xi , yi | θ]
i=1
Yn
= Pr[xi ] Pr [yi | xi , θ]
i=1
n
(x⊤ 2
 
1 i θ − yi )
Y
= Pr[xi ] · √ exp − 2
.
i=1
2πσ 2σ
argmax L(θ) · p(θ) = argmax log L(θ) + log p(θ)
θ θ
n
1 X ⊤ 1
= argmax − 2 (xi θ − yi )2 − 2 ∥θ∥2
θ 2σ i=1 2σθ
σ2
= argmin ∥y − Xθ∥2 + ∥θ∥2 .
θ σθ2
σ2
This is the ridge regression objective with λ = σθ2
.

Optional: Optimization. Rewrite the objective as

y⊤ y − 2θ⊤ X ⊤ y + θ⊤ X ⊤ Xθ + λθ⊤ θ.

Take the gradient with respect to θ and set it to the zero vector:

−2X ⊤ y + 2X ⊤ Xθ + 2λθ = 0

−2X ⊤ y + 2(X ⊤ X + λI)θ = 0


The solution is
θ = (X ⊤ X + λI)−1 X ⊤ y.
Note (X ⊤ X + λI) is always invertible. With appropriate λ ridge regression can have smaller
mean squared error.

1.4 Optional: Additional Topics


Estimating the noise variance. The residual sum of squares is
X
(yi − ybi )2 ,

and an unbiased estimate of σ 2 is


(yi − ybi )2
P
2
σ
b = .
n − (p + 1)

The denominator is n − (p + 1) because θb used up p + 1 degrees of freedom.

4
The coefficient of determination. Note the best constant fit would have been f (x) = ȳ,
whose residual sum of squares is X
(yi − ȳ)2 .
The coefficient of determination r2 is a measure of how much better a non-constant line fit
is:
(yi − ybi )2
P
2
r =1− P .
(yi − ȳ)2
r2 ∈ [0, 1], 1 being desirable.

Feature expansion. Note that we only need the model to be linear in the parameters θ.
The input can be a nonlinear input function. Examples:

• pth-degree polynomial regression

y = θ0 + θ1 x + θ2 x2 + . . . + θp xp + ϵ.

• second order model:

y = θ0 + θ1 x1 + θ2 x2 + θ3 x21 + θ4 x1 x2 + θ5 x22 + ϵ.

• In general X
y = θ0 + θj ϕj (x) + ϵ,
j

where ϕj ()’s are arbitrary feature functions of x.

2 Logistic Regression
Statistical Model. Consider binary classification with y = −1, 1, and each example is
represented by a feature vector x. The intuition is to first map x to a real number, such that
large positive number means that x is likely to be positive (y = 1), and negative number
means x is negative (y = −1). This is done just like in linear regression:

θ⊤ x (2)

The next step is to “squash” the range down to [0, 1] so one can interpret it as the label
probability. This is done via the logistic function:
1
p(y = 1|x) = σ(θ⊤ x) = . (3)
1 + exp(−θ⊤ x)

For binary classification with -1, 1 class encoding, one can unify the definition for p(y = 1|x)
and p(y = −1|x) with
1
p(y|x) = . (4)
1 + exp(−yθ⊤ x)

5
Note that logistic regression can be easily generalized to multiple classes. Let there be
K classes. Each class has its own1 parameter θk . The probability is defined via the softmax
function
exp(θk⊤ x)
p(y = k|x) = PK ⊤
. (5)
i=1 exp(θi x)
If a predicted class label (instead of a probability) is desired, simply predict the class with
the largest probability p(y = k|x). Logistic regression produces linear decision boundaries
between classes. We will focus on binary classification in the rest of this note.

Learning via MLE. Training (i.e., finding the parameter θ) can be done by maximizing
the log likelihood of the training data:
n
X n
X
max log p(xi , yi |θ) = max log[p(yi |xi , θ)p(xi )]
θ θ
i=1 i=1
Xn n
X
= max log p(yi |xi , θ) + log p(xi ).
θ
i=1 i=1

Since no relation is assume between the parameters θ and the distribution of xi ’s, the above
reduces to maximizing the conditional log likelihood of the training data {(xi , yi }ni=1 :
n
X
max log p(yi |xi , θ). (6)
θ
i=1

However, when the training data is linearly separable, two bad things happen: 1. ∥θ∥
goes to infinity; 2. There are infinite number of MLE’s. To see this, note any step function
(sigmoid with ∥θ∥ = ∞) that is in the gap between the two classes is an MLE.

Learning via MAP. One way to avoid the above issues is to incorporate a prior on θ in
the form of a zero-mean Gaussian with covariance λ1 I,
 
1
θ ∼ N 0, I , (7)
λ
and seek the MAP estimate. This is essentially smoothing, since large θ values will be
penalized more. That is, we seek the θ that maximizes
n
X
log p(θ) + log p(yi |xi , θ) (8)
i=1
n
λ X
= − ∥θ∥2 + log p(yi |xi , θ) + constant (9)
2 i=1
n
λ X
= − ∥θ∥2 − log 1 + exp(−yi θ⊤ xi ) + constant.

(10)
2 i=1

1
Strictly speaking, one needs only K − 1 parameter vectors.

6
Equivalently, one minimizes the ℓ2 -regularized negative log likelihood loss
n
λ X
min ∥θ∥2 + log 1 + exp(−yi θ⊤ xi ) .

(11)
θ 2
i=1

This is a convex function so there is a unique global minimum of θ. Unfortunately there is no


closed form solution. One typically solves the optimization problem with stochastic gradient
descent or its variants (e.g., Newton-Raphson iterations, also known as iterative reweighted
least squares for logistic regression).

3 Stochastic Gradient Descent


In anticipation of more complex non-convex learners, we present a simple training algorithm
that works for both linear regression (1) and logistic regression (11). Observing that both
models can be written as follows:
n
X λ
min ℓ(xi , yi , θ) + ∥θ∥2 (12)
θ
i=1
2

where ℓ(xi , yi , θ) is the loss function.


The gradient descent algorithm starts at an arbitrary θ0 , and repeats the following update
until convergence:
n
!
X λ
θt+1 = θt − η∇θ ℓ(xi , yi , θ) + ∥θ∥2 (13)
i=1
2
n
X
= θt − η∇θ ℓ(xi , yi , θ) − λθ (14)
i=1

where η is a small positive number known as the step size.


Each iteration of gradient descent needs to go over all n training points to compute the
gradient, which may be too expensive. Note that we can write it equivalently as

θt+1 = θt − nηE∇θ ℓ(x, y, θ) − ηλθ (15)

where the expectation is with respect to the empirical distribution on the training set.
Stochastic Gradient Descent (SGD) instead performs the following:
• At each iteration draw (e
x, ye) from the empirical distribution, i.e. select a training point
uniformly at random.
• Perform the update

θt+1 = θt − nη∇θ ℓ(e


x, ye, θ) − ηλθ. (16)

The update, in expectation, agrees with (15). Note the above may differ from other
text by the coefficient n. This is an artifact of us not dividing the sum of losses by n,
and amounts to a rescaling of λ.

7
For ridge regression,
ℓ(x, y, θ) = (x⊤ θ − y)2 .
Therefore,
∇θ ℓ(x, y, θ) = 2(x⊤ θ − y)x.
For logistic regression,
ℓ(x, y, θ) = log 1 + exp(−yθ⊤ x) .


Therefore,
1
∇θ ℓ(x, y, θ) = exp(−yθ⊤ x)(−yx).
1 + exp(−yθ⊤ x)
SGD can be further stablized by:

1. Instead of taking the last iteration’s parameter vector θT , take the average of all pa-
rameters over time:
T
1X
θ̄ = θt .
T t=1
You will see that, for reasonably large T the average parameter achieves an MSE
objective that is very close to the minimum. It will still fluctuate a little as one should
expect, but overall it’s a lot more stable.

2. Alternatively (or in conjunction), you may try to diminish the step size η as you go.
One common scheme is to let the stepwise in iteration t to be

ηt = η0 / t

where η0 is the original stepsize. As t = 1, 2, . . . increases, η decreases. This will also


stabilize SGD.

Both methods are motivated by theoretical analysis of SGD.

You might also like