Homework2
Homework2
Collaboration policy: You may complete this assignment with a partner, if you choose. In that case,
both partners should sign up on Canvas as a “team”, and only one of you should submit the assignment.
You are not permitted to use ChatGPT on any part of this assignment.
1. A linear NN will never solve the XOR problem [10 points, on paper]: Read the description of
the XOR problem from Section 6.1 of the Deep Learning textbook: https://www.deeplearningbook.
org/contents/mlp.html. Then show (by deriving the gradient, setting to 0, and solving mathemat-
ically, not in Python) that the values for w = [w1 , w2 ]⊤ and b that minimize the function fMSE (w, b)
in Equation 6.1 are: w1 = 0, w2 = 0, and b = 0.5 – in other words, the best prediction line is simply
flat and always guesses ŷ = 0.5.
2. Derivation of softmax regression gradient updates [20 points, on paper]: As explained in
class, let
W = w(1) . . . w(c)
be an m×c matrix containing the weight vectors from the c different classes. The output of the softmax
regression neural network is a vector with c dimensions such that:
exp zk
ŷk = Pc (1)
k′ =1 exp zk
′
zk = x⊤ w(k) + bk
for each k = 1, . . . , c. Correspondingly, our cost function will sum over all c classes:
n c
1 X X (i) (i)
fCE (W, b) = − yk log ŷk
n i=1
k=1
Important note: When deriving the gradient expression for each weight vector w(l) , it is crucial to
keep in mind that the weight vector for each class l ∈ {1, . . . , c} affects the outputs of the network for
every class, not just for class l. This is due to the normalization in Equation 1 – if changing the weight
vector increases the value of ŷl , then it necessarily must decrease the values of the other ŷl′ ̸=l .
In this homework problem, please complete the following derivation that is outlined below:
Derivation: For each weight vector w(l) , we can derive the gradient expression as:
n c
1 X X (i) (i)
∇w(l) fCE (W, b) = − yk ∇w(l) log ŷk
n i=1
k=1
n X c (i)
!
1 X (i) ∇w(l) ŷk
= − yk (i)
n i=1 ŷ
k=1 k
For l ̸= k:
(i)
∇w(l) ŷk = complete me...
(i) (i)
= −x(i) ŷk ŷl
1
To compute the total Pgradient of P
fCE w.r.t. eachP w(k) , we have to sum over all examples and over
l = 1, . . . , c. (Hint: k ak = al + k̸=l ak . Also, k yk = 1.)
n c
1 X X (i) (i)
∇w(l) fCE (W, b) = − yk ∇w(l) log ŷk
n i=1
k=1
= complete me...
n
1 X (i) (i) (i)
= − x yl − ŷl
n i=1
Train a 2-layer softmax neural network to classify images of fashion items (10 different classes, such
as shoes, t-shirts, dresses, etc.) from the Fashion MNIST dataset. The input to the network will be
a 28 × 28-pixel image (converted into a 784-dimensional vector); the output will be a vector of 10
probabilities (one for each class). The cross-entropy loss function1 that you minimize should be
n 10 c
1 X X (i) (i) α X (k) ⊤ (k)
fCE (w(1) , . . . , w(10) , b(1) , . . . , b(10) ) = − yk log ŷk + w w
n i=1 2
k=1 k=1
2
the network on the test set. Record the performance both in terms of (unregularized) cross-entropy
loss (smaller is better) and percent correctly classified examples (larger is better); put this information
into the PDF you submit.
Hint 1: it accelerates training if you first normalize all the pixel values of both the training and testing
data by dividing each pixel by 255. Hint 2: when using functions like np.sum and np.mean, make
sure you know what the axis and keepdims parameters mean and that you use them in a way that is
consistent with the math!
4. Logistic Regression [15 points, on paper]: Consider a 2-layer neural network that computes the
function
ŷ = σ(x⊤ w + b)
where x is an example, w is a vector of weights, b is a bias term, and σ is the logistic sigmoid function.
Assume we train this network using the log loss, as described in class. Moreover, suppose all the
training examples are positive. Answer the following questions about convergence. (Informally,
a sequence of numbers converges if it gets closer and closer to a specific number as the sequence
progresses. A sequence that does not converge can do different things, e.g., change erratically, or grow
towards +/ − ∞.) While you are not required to give formal proofs, you should explain your reasoning,
which could either be a mathematical argument or a simulation result. Put your answers into your
PDF file.
(a) Given a well-chosen learning rate: what value will the training loss converge to during gradient
descent?
(b) Given a well-chosen learning rate: will b always converge; does convergence depend on the exact
training examples; or does it never converge?
(c) Suppose the training set contains exactly 2 examples, x(1) , x(2) ∈ R2 . Give specific values for
these training data such that:
i. w will converge during gradient descent (given a well-chosen learning rate).
ii. w will not converge during gradient descent (no matter what the learning rate).
Create a Zip file containing both your Python and PDF files, and then submit on Canvas. If you are working
as part of a group, then only one member of your group should submit (but make sure you have already
signed up in a pre-allocated team for the homework on Canvas).