0% found this document useful (0 votes)
24 views

Tutorial 8 Questions

This tutorial covers several topics related to gradient descent: 1. It provides the gradient descent update for minimizing the function f(x) = x^4 and calculates the value of x after one iteration with a learning rate of 0.1. 2. It describes fitting an exponential model to education expenditure data using gradient descent, plotting the cost over iterations and predicted expenditures from 1981-2023. 3. It derives the gradient of the cost function C(w) for linear regression with mean-squared error loss and other activation functions like sigmoid and ReLU.

Uploaded by

Evan Duh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Tutorial 8 Questions

This tutorial covers several topics related to gradient descent: 1. It provides the gradient descent update for minimizing the function f(x) = x^4 and calculates the value of x after one iteration with a learning rate of 0.1. 2. It describes fitting an exponential model to education expenditure data using gradient descent, plotting the cost over iterations and predicted expenditures from 1981-2023. 3. It derives the gradient of the cost function C(w) for linear regression with mean-squared error loss and other activation functions like sigmoid and ReLU.

Uploaded by

Evan Duh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

EE2211: Spring 2023

Tutorial 8

1. Suppose we are minimizing f (x) = x4 with respect to x. We initialize x to be 2. We perform gradient


descent with learning rate 0.1. What is the value of x after the first iteration?
2. Please consider the csv file (government-expenditure-on-education2.csv), which depicts the govern-
ment’s educational expenditure over the years. We would like to predict expenditure as a function of
year. To do this, fit an exponential model f (x, w) = exp(−x> w) withPsquared error loss to estimate
m
w based on the csv file and gradient descent. In other words, C(w) = i=1 (f (xi , w) − yi )2 .
Note that even though year is one dimensional, we should add the bias term, so x = [1 year]> .
Furthermore, optimizing the exponential function is tricky (because a small change in w can lead to
large change in f ). Therefore for the purpose of optimization, divide the “year” variable by the largest
year (2018) and divide the “expenditure” by the largest expenditure, so that the resulting normalized
year and normalized expenditure variables have maximum values of 1. Use a learning rate of 0.03 and
run gradient descent for 2, 000, 000 iterations.

(a) Plot the cost function C(w) as a function of the number of iterations.
(b) Use the fitted parameters to plot the predicted educational expenditure from year 1981 to year
2023.
(c) Repeat (a) using a learning rate of 0.1 and learning rate of 0.001. What do you observe relative
to (a)?

The goal of this question is for you to code up gradient descent, so I will provide you with the gradient
derivation. First, please note that in general, ∇w (x> w) = x. To see this:
 >  
∂(x w) ∂(w1 x1 +w2 x2 +···+wd xd )
  
∂w1  ∂w1 x1
 > w)   ∂(w1 x1 +w2 x2 +···+wd xd ) 
 ∂(x
∂w2   x2 

∇w (x> w) =  ∂w. 2  = 
  = .  =x

 ..   .
.  .
   .  .
∂(x> w) ∂(w1 x1 +w2 x2 +···+wd xd ) xd
∂wd ∂wd

The above equality will be very useful for the other questions as well. Now, going back to our question,
m
X
∇w C(w) = ∇w (f (xi , w) − yi )2
i=1
m
X
= ∇w (f (xi , w) − yi )2
i=1
m
X
= 2(f (xi , w) − yi )∇w f (xi , w) chain rule
i=1
Xm
= 2(f (xi , w) − yi )∇w exp(−x>
i w)
i=1

1
m
X
=− 2(f (xi , w) − yi ) exp(−x> >
i w)∇w (xi w) chain rule
i=1
Xm
=− 2(f (xi , w) − yi ) exp(−x>
i w)xi
i=1
m
X
=− 2(f (xi , w) − yi )f (xi , w)xi
i=1

3. Given the linear learning model f (x, w) = x> w, where x ∈ Rd . Consider the loss function
L(f (xi , w), yi ) = (f (xi , w) − yi )4 ,
where i indexes the i-th training sample. The final cost function is
m
X
C(w) = L(f (xi , w), yi ),
i=1

where m is the total number of training samples. Derive the gradient of the cost function with respect
to w.
4. Repeat Question 3 using f (x, w) = σ(x> w), where σ(a) = 1
1+exp(−βa)

5. Repeat Question 3 using f (x, w) = σ(x> w), where σ(a) = max(0, a)


Remark 1. Strictly speaking, the function σ as defined in this question is not differentiable at zero.
This function is, however, used widely used as the so-called ReLU (Rectified Linear Unit) activation
function in deep neural networks. Hence, it is of great importance in machine learning. Cavalierly,
you may take the “derivative” of σ here to be

dσ(a) 1 a≥0
σ 0 (a) = = .
dx 0 a<0

6. Consider the univariate (one-input, one-output) function C(w) = 5w2 and the initial point w0 = 2.
Find the learning rate η so that gradient descent converges in one step. Find the set of learning rates
such that gradient descent diverges. Finally, find the set of learning rates such that gradient descent
converges.
7. (Optional) The logistic regression model (a generalized linear model) posits that the target yi ∈
{−1, +1} is related to the feature vector xi as follows
1
Pr(yi | xi , w) = g yi x>

i w where g(z) = .
1 + exp(z)
Suppose we have a set of training samples {(xi , yi )}m
i=1 . The loss we want to minimize is
m
X
min − log Pr(yi | xi , w)
w
i=1

This is the same as minimizing the logistic loss


m
X
Logistic yi x>

min i w where Logistic(z) = log[1 + exp(−z)].
w
i=1

In stochastic gradient descent, which is used to train large-scale networks, we randomly choose one
sample, say (xi , yi ) and take a step in the negative gradient direction associated to this sample. Show
that if we use a learning rate of η, the update can be expressed as follows:
 
wk+1 = wk + ηyi xi 1 − Pr(yi | xi , wk ) .

2
Remark 2. Note that Pr(yi | xi , w) is the probability that we predict the training label yi using the
features xi correctly under parameters w and 1 − Pr(yi | xi , w) is the probability of making a mistake.
Hence, we make a larger correction to the current weight vector when the mistake made on the i-th
sample is “larger” in the sense that 1 − Pr(yi | xi , w) is larger. Hence, this update makes sense. If you
are interested in this, a good course to take is MA4270.
8. (Optional) In this problem, we want to show some properties of gradient descent on “nice” functions.
First, we have to make a few definitions.
• We say that f : Rd → R is convex if
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) for all x, y ∈ Rd , t ∈ [0, 1].
• We say that f is L-smooth if
f (y) − (f (x) + (∇f (x))> (y − x)) ≤ L kx − yk2 for all x, y ∈ Rd ,

2
2
P 2
where k · k denotes the usual `2 norm (i.e., kzk = i zi ).
• We say that f is µ-strongly convex if
µ
f (y) ≥ f (x) + (∇f (x))> (y − x) + ky − xk2 for all x, y ∈ Rd .
2
It can be shown that if f is µ-strongly convex for some µ ≥ 0, it is convex.
(a) Show that if f is L-smooth, then for all x ∈ Rd ,
 1  1
f x − ∇f (x) − f (x) ≤ − k∇f (x)k2
L 2L
and
1
f (x∗ ) − f (x) ≤ − k∇f (x)k2
2L
where x∗ is a global minimizer of f .
(b) Recall that in gradient descent, we implement
xk+1 = xk − η∇f (xk ).
Show that if f is L-smooth and µ-strongly convex, the learning rate η ∈ (0, 1/L], and we start
the iterative procedure at x0 , then
kxk+1 − x∗ k2 ≤ (1 − ηµ)k+1 kx0 − x∗ k2
In particular, if we choose the learning rate η = 1/L, the iterates converge geometrically fast at
a rate µ/L.
Remark 3. This result says that if we want to get ε-close to the optimal solution x∗ (i.e., kxk − x0 k <
ε), we need to use roughly

2 log( kx0 −x
ε
k
)
µ
− log(1 − L )
iterations. The numerator can be interpreted as the log of the ratio of the initial suboptimality (i.e., gap
between x0 and x∗ ), to the final suboptimality (i.e., less than ε). The denominator is − log(1 − µ/L) ≈
µ/L. When µ/L is small, it takes many iterations to get to an ε-close solution. Thus, we favor
situations in which µ/L is large. In fact, L/µ is known as the condition number; the smaller it is, the
better! If you are interested in this, a good course to take is EE5138.
Remark 4. In fact, among the class of all L-smooth, µ-strongly convex functions, the “contraction
factor” (1 − µ/L) is the best possible. See the following:
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: A novel
approach. Math. Program. 145(1), 451–482 (2014)

You might also like