4 Linear Regression Additional Notes
4 Linear Regression Additional Notes
We briefly go over two linear models frequently used in machine learning: linear regression
for regression, and logistic regression for classification.
1 Linear regression
1.1 The 1D case
Let’s first introduce the following notations:
We are given training data points {(xi , yi )}. We would like to learn a model for p(y|x∗ )
for new test points x∗ (usually predict one value E(y|x∗ ) instead of outputting the whole
distribution).
Statistical Model. For the goal of prediction, we first need a probabilistic model for the
data points. The typical model assumption is:
y = θ0 + θ1 x + ϵ, ϵ ∼ N (0, σ 2 ).
This is equivalent to assume a Gaussian distribution for the conditional probability p(y|x):
y|x ∼ N (θ0 + θ1 x, σ 2 ).
Learning via MLE. Now we show that for learning, the maximum likelihood estimate 2
2 1 ϵ
leads to the least squares estimate. Recall that ϵ ∼ N (0, σ ) has density 2πσ exp − 2σ
√ 2 .
1
The likelihood is then:
Optional: Optimization. The least square estimate training objective is a convex objec-
tive, so we can solve it by setting the gradient to zero. Setting gradient to zero gives:
P
(xi − x̄)(yi − ȳ)
θb1 = P ,
(xi − x̄)2
θb0 = ȳ − θb1 x̄,
P P
where x̄ = ( xi )/n and ȳ = ( yi )/n. This is known as ordinary least squares (OLS).
The prediction is then ybi = θb0 + θb1 xi , and the residual is yi − ybi .
y = x⊤ θ + ϵ, ϵ ∼ N (0, σ 2 ).
Learning via MLE. Following the same procedure as above, the MLE leads to the least
square estimate training objective:
n
X
θb = argmin (x⊤ 2
i θ − yi ) .
θ i=1
2
Optional: Optimization. It will be convenient to arrange the input in a matrix Xn×p
where each row is a feature vector, and a label vector yn×1 . The linear regression model is
y = Xθ + ϵ
where ϵ is a vector, too. Then the ordinary least squares (OLS) method finds parameter
min ∥y − Xθ∥2 .
θ
Take the gradient with respect to θ and set it to the zero vector:
−2X ⊤ y + 2X ⊤ Xθ = 0
θ = (X ⊤ X)−1 X ⊤ y.
OLS is certainly not going to work, however, in the high dimension setting when p > n, since
X ⊤ X is rank deficient and not invertible.
y = x⊤ θ + ϵ, ϵ ∼ N (0, σ 2 ).
3
Learning via MAP. The MAP is:
y⊤ y − 2θ⊤ X ⊤ y + θ⊤ X ⊤ Xθ + λθ⊤ θ.
Take the gradient with respect to θ and set it to the zero vector:
−2X ⊤ y + 2X ⊤ Xθ + 2λθ = 0
4
The coefficient of determination. Note the best constant fit would have been f (x) = ȳ,
whose residual sum of squares is X
(yi − ȳ)2 .
The coefficient of determination r2 is a measure of how much better a non-constant line fit
is:
(yi − ybi )2
P
2
r =1− P .
(yi − ȳ)2
r2 ∈ [0, 1], 1 being desirable.
Feature expansion. Note that we only need the model to be linear in the parameters θ.
The input can be a nonlinear input function. Examples:
y = θ0 + θ1 x + θ2 x2 + . . . + θp xp + ϵ.
y = θ0 + θ1 x1 + θ2 x2 + θ3 x21 + θ4 x1 x2 + θ5 x22 + ϵ.
• In general X
y = θ0 + θj ϕj (x) + ϵ,
j
2 Logistic Regression
Statistical Model. Consider binary classification with y = −1, 1, and each example is
represented by a feature vector x. The intuition is to first map x to a real number, such that
large positive number means that x is likely to be positive (y = 1), and negative number
means x is negative (y = −1). This is done just like in linear regression:
θ⊤ x (2)
The next step is to “squash” the range down to [0, 1] so one can interpret it as the label
probability. This is done via the logistic function:
1
p(y = 1|x) = σ(θ⊤ x) = . (3)
1 + exp(−θ⊤ x)
For binary classification with -1, 1 class encoding, one can unify the definition for p(y = 1|x)
and p(y = −1|x) with
1
p(y|x) = . (4)
1 + exp(−yθ⊤ x)
5
Note that logistic regression can be easily generalized to multiple classes. Let there be
K classes. Each class has its own1 parameter θk . The probability is defined via the softmax
function
exp(θk⊤ x)
p(y = k|x) = PK ⊤
. (5)
i=1 exp(θi x)
If a predicted class label (instead of a probability) is desired, simply predict the class with
the largest probability p(y = k|x). Logistic regression produces linear decision boundaries
between classes. We will focus on binary classification in the rest of this note.
Learning via MLE. Training (i.e., finding the parameter θ) can be done by maximizing
the log likelihood of the training data:
n
X n
X
max log p(xi , yi |θ) = max log[p(yi |xi , θ)p(xi )]
θ θ
i=1 i=1
Xn n
X
= max log p(yi |xi , θ) + log p(xi ).
θ
i=1 i=1
Since no relation is assume between the parameters θ and the distribution of xi ’s, the above
reduces to maximizing the conditional log likelihood of the training data {(xi , yi }ni=1 :
n
X
max log p(yi |xi , θ). (6)
θ
i=1
However, when the training data is linearly separable, two bad things happen: 1. ∥θ∥
goes to infinity; 2. There are infinite number of MLE’s. To see this, note any step function
(sigmoid with ∥θ∥ = ∞) that is in the gap between the two classes is an MLE.
Learning via MAP. One way to avoid the above issues is to incorporate a prior on θ in
the form of a zero-mean Gaussian with covariance λ1 I,
1
θ ∼ N 0, I , (7)
λ
and seek the MAP estimate. This is essentially smoothing, since large θ values will be
penalized more. That is, we seek the θ that maximizes
n
X
log p(θ) + log p(yi |xi , θ) (8)
i=1
n
λ X
= − ∥θ∥2 + log p(yi |xi , θ) + constant (9)
2 i=1
n
λ X
= − ∥θ∥2 − log 1 + exp(−yi θ⊤ xi ) + constant.
(10)
2 i=1
1
Strictly speaking, one needs only K − 1 parameter vectors.
6
Equivalently, one minimizes the ℓ2 -regularized negative log likelihood loss
n
λ X
min ∥θ∥2 + log 1 + exp(−yi θ⊤ xi ) .
(11)
θ 2
i=1
where the expectation is with respect to the empirical distribution on the training set.
Stochastic Gradient Descent (SGD) instead performs the following:
• At each iteration draw (e
x, ye) from the empirical distribution, i.e. select a training point
uniformly at random.
• Perform the update
The update, in expectation, agrees with (15). Note the above may differ from other
text by the coefficient n. This is an artifact of us not dividing the sum of losses by n,
and amounts to a rescaling of λ.
7
For ridge regression,
ℓ(x, y, θ) = (x⊤ θ − y)2 .
Therefore,
∇θ ℓ(x, y, θ) = 2(x⊤ θ − y)x.
For logistic regression,
ℓ(x, y, θ) = log 1 + exp(−yθ⊤ x) .
Therefore,
1
∇θ ℓ(x, y, θ) = exp(−yθ⊤ x)(−yx).
1 + exp(−yθ⊤ x)
SGD can be further stablized by:
1. Instead of taking the last iteration’s parameter vector θT , take the average of all pa-
rameters over time:
T
1X
θ̄ = θt .
T t=1
You will see that, for reasonably large T the average parameter achieves an MSE
objective that is very close to the minimum. It will still fluctuate a little as one should
expect, but overall it’s a lot more stable.
2. Alternatively (or in conjunction), you may try to diminish the step size η as you go.
One common scheme is to let the stepwise in iteration t to be
√
ηt = η0 / t