ISYE6740 Fall2024 HW4 Rubric
ISYE6740 Fall2024 HW4 Rubric
100 points
1. (10 points) Show step-by-step mathematical derivation for the gradient of the cost
function ℓ(θ) in (1).
Rubric:
Any reasonable attempt 2pt
Correct gradient 8pts
Partial credit may be awarded as appropriate.
2. (5 points) Write a pseudo-code for performing gradient descent to find the optimizer
θ∗ . This is essentially what the training procedure does.
Rubric:
Any reasonable attempt 1pt
Reasonable pseudo-code of gradient descent 4pts
3. (5 points) Write the pseudo-code for performing the stochastic gradient descent
algorithm to solve the training of logistic regression problem (1). Please explain the
difference between gradient descent and stochastic gradient descent for training logistic
regression.
Rubric:
Reasonable pseudo-code of Gradient descent 3pts
Reasonable explanation of the difference 2pts
1
(15 points) We will show that the training problem in basic logistic regression
problem is concave. Derive the Hessian matrix of ℓ(θ) and based on this, show the
training problem (1) is concave. Explain why the problem can be solved efficiently and
gradient descent will achieve a unique global optimizer, as we discussed in class.
Rubric:
Correct Hessian expression 10pts
Reasonable attempt 2pt
Resonable explanation 3pt
Explanation should contain some comments on why the problem can be solved efficiently
and gradient descent will achieve a unique global optimizer, as stated in the question.
V = {free, money, no, cap, crypto, real, cash, prize, for, you, big, chance, pizza, is, vibe}.
We will use Vi to represent the ith word in V . As our training dataset, we are also given 3
example spam messages,
• no cash no pizza
2
Recall that the Naive Bayes classifier assumes the probability of an input depends on
(i) (i) (i)
its input feature. The feature for each sample is defined as x(i) = [x1 , x2 , . . . , xd ]T , i =
1, . . . , m and the class of the ith sample is y (i) . In our case the length of the input vector is
(i)
d = 15, which is equal to the number of words in the vocabulary V . Each entry xj is equal
to the number of times word Vj occurs in the i-th message.
1. (5 points) Calculate class prior P(y = 0) and P(y = 1) from the training data, where
y = 0 corresponds to spam messages, and y = 1 corresponds to non-spam messages.
Note that these class prior essentially corresponds to the frequency of each class in the
training sample. Write down the feature vectors for each spam and non-spam messages.
Rubric:
Each correct prior, 1pt, total 2pts
Each correct feature vector, 0.5pt, total 3pts
2. (15 points) Assuming the keywords follow a multinomial distribution, the likelihood of
a sentence with its feature vector x given a class c is given by
d
n! Y xk
P(x|y = c) = θc,k , c = {0, 1}
x1 ! · · · xd ! k=1
Given this, the complete log-likelihood function for our training data is given by
m X
X d
(i)
ℓ(θ0,1 , . . . , θ0,d , θ1,1 , . . . , θ1,d ) = xk log θy(i) ,k
i=1 k=1
(In this example, m = 7.) Calculate the maximum likelihood estimates of θ0,1 , θ0,6 ,
θ1,2 , θ1,15 by maximizing the log-likelihood function above.
(Hint: We are solving a constrained maximization problem and you will need to intro-
duce Lagrangian multipliers and consider the Lagrangian function.)
Rubric:
11 Points for a proper Lagrangian definition. Partial credit may be given, but the
two Lagrangian multipliers must be present similar to the solution for full credit
to be received. Each correct MLE requested, 1pt each, 4 pts total.
3. (15 points) Given a test message “money for real”, using the Naive Bayes classier that
you have trained in Part (a)-(b), to calculate the posterior and decide whether it is
spam or not spam. Derivations must be shown in your report.
3
Rubric:
Reasonable posterior results 9pts
Correct classification result 6pts
1. (15 points) Report testing accuracy for each of the three classifiers. Comment on their
performance: which performs the best and make a guess why they perform the best in
this setting.
Rubric:
Classification accuracy above 90% 4pts, 12pts total
Reasonable explanations 3pts
2. (15 points) Now perform PCA to project the data into two-dimensional space. Build
the classifiers (Naive Bayes, Logistic Regression, and KNN) using the two-
dimensional PCA results. Plot the data points and decision boundary of each classifier
4
in the two-dimensional space. Comment on the difference between the decision bound-
ary for the three classifiers. Please clearly represent the data points with different
labels using different colors.
Rubric:
Each reasonable decision boundary 4pts, 12pts total
Reasonable explanations 3pts