ML 2023a Midsem Solution
ML 2023a Midsem Solution
Instructions: Answer the questions in the boxes provided in this answer booklet only.
You may use supplementary sheets for rough work which must be tied to this booklet.
1. (a) When the number of features is increased in a linear model, state whether the following [4]
are expected to increase or decrease with proper justifications:
i. Bias of the model
ii. Variance of the model
Solution:
(b) How do the relative error on the following quantities expected to change when you add [2]
regularization such as L2 regularization? State whether “increases” or “decreases”.
i. Training error
ii. True error
Solution:
1
(c) Consider a linear regression model with L1 -regularization. [2]
m
1 X
Loss(θ) = (yn − f (x))2 + λθ where f (x) = θ T x
m
n=1
(d) Suppose you have a task to classify a bird species into one of 23 classes. Your data for [2]
each bird is described by 20 numerical features. How many parameters are required if
you use a linear model for this bird classification task?
Solution: 23 × 21
(e) Write the expression for the hypothesis f (x) and the loss function L used in logistic [3]
regression. Assume that there are N training examples and d input features. The
training set is given by {(x(1) , y (1) ), · · · xN , y (N ) } and the ith feature of the jth
(i)
example is given by xj .
Solution:
1
f (x) =
1 + e−θT x
- Page 2 of 9-
2. (a) Use the following dataset to learn a decision tree for predicting if a candidate is short- [4]
listed for a job (Y) or not (N), based on their Major (A or C), CGPA (High or Low),
and whether they did a Project (Y or N).
Draw the full decision tree that would be learned for this data using the Information
Gain heuristic (assuming no pruning). You do not need to show all the calculations.
Major CGPA Project Shortlisted
A H Y Y
A L N N
A L Y N
A L N N
C H Y Y
C L Y Y
C H N Y
C L N Y
Solution:
- Page 3 of 9-
(b) Consider the following decision tree to decide whether to go for a long drive. You wish [4]
to apply post-pruning on the tree. Draw all possible candidate pruned trees.
- Page 4 of 9-
(c) Given a tree T and a candidate pruned tree Pi , how will you decide whether to use T [2]
or Pi ?
(d) Does the pruning of a decision tree reduce variance? Justify your answer. [2]
(e) Consider the following scenario. You have 1000 samples in the training data of a 2-class [2]
classification problem. Each sample is described by 20 features. Suppose that you use
information gain as a heuristic to grow a decision tree T from the training set; and you
use post-pruning to prune the tree to obtain a pruned tree P .
What can you say about the error on the training set of the pruned tree P compared
to that of T ?
Solution: The error on the training set is likely to be higher for the pruned tree
P compared to that of T . Otherwise it is unlikely that the tree could have been
expanded.
- Page 5 of 9-
3. Consider the below dataset in Figure 1. The objective is to learn a line that best separate
these classes. For this problem, assume that we are training an SVM with a quadratic
kernel– that is, the kernel function is a polynomial kernel of degree 2.
Please answer the following questions qualitatively. Give a one sentence justification for each
and draw your solution in the appropriate part of the Figure at the end of the problem.
Assume we are using slack penalty C. Recall the SVM optimization with slack variables is
d
1X 2 X
min θj + C ψi ,
θ 2
j=1 i
(a) Where would the decision boundary be for very large values of C (i.e., C → ∞)? [3]
(remember that we are using an SVM with a quadratic kernel.) Draw on the figure
and justify your answer.
Figure 2: For Q3(a), draw the decision boundary and justify in short.
Answer: With high value of C, the slack variables need to be very small for minimal
solution. So, the decision boundary will try to overfit to all boundary points as much
as possible.
- Page 6 of 9-
(b) Draw a data point which will significantly change the decision boundary learned for [3]
very large values of C. Justify your answer.
Figure 3: For Q3(b), mark the data point for large C and justify in short.
Ans: A red point at (3,16) or a green point at (2.5, 13.5). With very high C, decision
boundary will change to overfit. Many answers could be here. As long as students
understand the effect of high C value, mention a point and justify, full marks should
be given.
(c) For very small C ≈ 0, where would be the decision boundary. For this dataset, which [4]
C (very large or very small) will be preferable?
Figure 4: For (c), mark the decision boundary and justify in short.
Ans: With low C value, high chance that slack variables can be high. Therefore, it
will be able to ignore the outlier red dots.
(d) Suppose there are two Kernel functions K1 (x, x′ ) and K2 (x, x′ ). Show that the func- [3]
tion K1 (x, x′ ) × K2 (x, x′ ) is also a kernel. To show that a function is a Kernel
function, it should be factorizable into Φ(x) and Φ(x′ ) for some arbitrary projec-
tion function Φ(.). (Hint: Use the projection functions Φ1 (.) and Φ2 (.) for kernels
K1 (., .) and K2 (., .). Try to derive the new projection function in terms of these. )
Ans: Assume ϕ1 and ϕ2 are two feature representation of the two kernels such that
K1 (x, x′ ) = ϕ1 (x)⊤ ϕ1 (x′ ) and K2 (x, x′ ) = ϕ2 (x)⊤ ϕ2 (x′ ). The feature representation
for the new kernel will be is ϕ1 (x)i ϕ2 (x)j , for all i, j.
- Page 7 of 9-
4. (a) In linear PCA, the covariance matrix of the data Σ = X ⊤ X, is decomposed into [4]
weighted sum of its eigenvalues (λ) and eigenvectors p,
X
Σ= λi p⊤
i pi , (1)
i
What is the variance of the data projected along the first dimension in terms of p1
and Σ? From this, prove that the first eigenvalue λ1 is identical to the variance of the
data projected along the first principal component.
Ans: The variance in first PC is v = p⊤ 1 Σp1 . Since λ1 is the eigenvector λ1 p1 =
Σp1 =⇒ λ1 p⊤ p
1 1 = p ⊤ Σp =⇒ λ × 1 = p⊤ Σp , as p⊤ p = 1.
1 1 1 1 1 1 1
(b) In Figure ?? (below), there is a dataset plotted with two classes. Draw the first and [3]
second principal component directions (to your best guess).
Figure 5: Dataset for PCA. For 4(c) draw the principal components and justify.
Ans: The first component of PCA will be along the x-axis (Feature 1) as it has the
maximum variance along this axis. Similarly, the second component of PCA will be
prependicular to the first component and hence, it will be along y-axis (Feature 2).
(c) In Figure ??, each of the half-circles belong to a separate class. Imagine there are two [4]
classifiers learnt.
• 1) Here we use PCA and take the first principal component to derive the projected
data (1-D data), we further use logistic regression to classify,
• 2) We use Linear Discriminant Analysis to find the projection weight matrix w,
which projects from 2-D to 1 dimension. Now, we take the projected vectors and
use logistic regression classifier.
Which one among the classifiers will have a better training accuracy? Explain. The
classifier that has a better training accuracy, will it also have a better generalization
accuracy? Explain (in short).
LDA will possibly give better accuracy than PCA because: 1. PCA will take first
component along x-axis (feature 1) and in this process, it will overlap many datapoints
while reducing the dimension. 2. LDA on other hand take into account the class labels
and focuses on maximizing the class separability.
No, it may not translate to generalization accuracy, because the trainer may overfit
the given data.
(d) Suppose that X is a discrete random variable with the following probability mass [3]
2(1−θ)
function: P (X = 0) = 2θ θ
3 , P (X = 1) = 3 , P (X = 2) = 3 , P (X = 3) = (1−θ)
3 . The
following 10 independent observations were taken from such a distribution: (3, 0, 2, 1,
3, 2, 1, 0, 2, 1). What is the maximum likelihood estimate of θ?
- Page 8 of 9-
- Page 9 of 9-