0% found this document useful (0 votes)
15 views

Assignments Introduction To Machine Learning 2024

The document discusses clustering algorithms like K-means, single link clustering, complete link clustering and BIRCH. It also discusses clustering evaluation metrics like Rand Index. Questions related to these topics are answered along with code snippets and explanations.

Uploaded by

Himanshu Yadav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Assignments Introduction To Machine Learning 2024

The document discusses clustering algorithms like K-means, single link clustering, complete link clustering and BIRCH. It also discusses clustering evaluation metrics like Rand Index. Questions related to these topics are answered along with code snippets and explanations.

Uploaded by

Himanshu Yadav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Assignment 10

Introduction to Machine Learning


Prof. B. Ravindran
1. K-means algorithm is not a particularly sophisticated approach for Image Segmentation tasks.
Choose the best possible explanation from below which supports the claim:

(a) It takes no account of the spatial proximity of different pixels.


(b) The curse of dimensionality does not affect the performance of K-means algorithm, as it
effectively handles high-dimensional data with minimal loss of accuracy.
(c) The algorithm requires the number of clusters (K) to be specified beforehand.
(d) Initialization does not affect K-means.

Sol. (a), or (c) Please go through the principle algorithm for k-means and one can infer
the problems associated to high-dimensionality of the image and the fact that the number of
clusters are needed to be known apriori to facilitate Image segmentation using k-means.

2. The pairwise distance between 6 points is given below. Which of the option shows the hierarchy
of clusters created by single link clustering algorithm?

P1 P2 P3 P4 P5 P6
P1 0 3 8 9 5 4
P2 3 0 9 8 10 9
P3 8 9 0 1 6 7
P4 9 8 1 0 7 8
P5 5 10 6 7 0 2
P6 4 9 7 8 2 0

1
Sol. (b)
Step 1: Connect closest pair of points. Closest pairs are:
d(P3, P4) = 1
d(P5, P6) = 2
d(P1, P2) = 3

Step 2: Connect clusters with single link. The cluster pair to combine is bolded:
d({P1, P2}, {P3, P4}) = 8
d({P1, P2}, {P5, P6}) = 4
d({P5, P6}, {P3, P4}) = 6

Step 3: Connect the final 2 clusters

3. For the pairwise distance matrix given in the previous question, which of the following shows
the hierarchy of clusters created by the complete link clustering algorithm.

2
Sol. (b)
Step 1: Same as previous question

Step 2: Connect clusters with complete link. The cluster pair to combine is bolded:
d({P1, P2}, {P3, P4}) = 9
d({P1, P2}, {P5, P6}) = 10
d({P5, P6}, {P3, P4}) = 8

Step 3: Connect the final 2 clusters


4. In BIRCH, using number of points N, sum of points SUM and sum of squared points SS, we
can determine the centroid and radius of the combination of any two clusters A and B. How
do you determine the radius of the combined cluster? (In terms of N,SUM and SS of both
two clusters A and B)
Radius of a cluster is given by:
r
SS SU M 2
Radius = −( )
N N

Note: We use the following definition of radius from the BIRCH paper: ”Radius is the average
distance from the member points to the centroid.”
q
(a) Radius = SS SU MA 2 SSB
NA − ( NA ) + NB − ( NB )
A SU MB 2

3
q q
SSA
(b) Radius = NA − ( SUNM
A
A 2
) + SSB
NB − ( SUNM
B
B 2
)
q
SSA +SSB
(c) Radius = NA +NB − ( SU MA +SU MB 2
NA +NB )
q
SSA SSB
(d) Radius = NA + NB − ( SU MA +SU MB 2
NA +NB )

Sol. (c)
CFA+B = CFA + CFB

Therefore,
NA+B = NA + NB
SU MA+B = SU MA + SU MB
SSA+B = SSA + SSB

Replace the above in the formula of radius given in the question to get the formula for the
combined cluster.
5. Statement 1: CURE is robust to outliers.
Statement 2: Because of multiplicative shrinkage, the effect of outliers is dampened.
(a) Statement 1 is true. Statement 2 is true. Statement 2 is the correct reason for statemnet
1.
(b) Statement 1 is true. Statement 2 is true. Statement 2 is not the correct reason for
statemnet 1.
(c) Statement 1 is true. Statement 2 is false.
(d) Both statements are false.

Sol. (a)
Refer to lecture
6. Which of the following statements about the Rand Index is true?
(a) It is insensitive to the permutations of cluster labels
(b) It is biased towards larger clusters
(c) It cannot handle overlapping clusters
(d) It is unaffected by outliers in the data
Sol. (a) It is insensitive to the permutations of cluster labels. Refer to Lectures.
7. a in rand-index can be viewed as true positives(pair of points belonging to the same cluster)
and b as true negatives(pair of points belonging to different clusters). How then, are rand-index
and accuracy from the previous two questions related?
(a) rand-index = accuracy
(b) rand-index = 1.01×accuracy
(c) rand-index = accuracy/2
(d) None of the above

4
Sol. (d)
The accuracy in the question works on individual points, whereas rand-index on pair of points.
The two are therefore not directly related.

For the following questions, we will be using the iris dataset that can be loaded using the follow-
ing utility from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.
datasets.load_iris.html
Do not make any changes to the dataset unless directed in the question.
8. Run BIRCH on the input features of iris dataset using Birch(n clusters=5, threshold=2).
What is the rand-index obtained?
(a) 0.68
(b) 0.71
(c) 0.88
(d) 0.98
Sol. (b)
Code for the solution can be found here:
from s k l e a r n . d a t a s e t s import l o a d i r i s
from s k l e a r n . c l u s t e r import B i r c h
from s k l e a r n . m e t r i c s import r a n d s c o r e

# Load t h e I r i s d a t a s e t
i r i s = l o a d i r i s ()
X = i r i s . data
true labels = i r i s . target

# I n s t a n t i a t e t h e Bi rc h c l u s t e r i n g a l g o r i t h m w i t h s p e c i f i e d p a r a m e t e r s
b i r c h = B i r c h ( n c l u s t e r s =5, t h r e s h o l d =2)

# F i t t h e Bir ch model t o t h e d a t a
b i r c h . f i t (X)

# Obtain t h e p r e d i c t e d c l u s t e r l a b e l s
predicted labels = birch . l a b e l s

# Print the predicted c l u s t e r l a b e l s


print ( ” P r e d i c t e d c l u s t e r l a b e l s : \ n” , p r e d i c t e d l a b e l s )

# Compute t h e Rand I n d e x t o e v a l u a t e c l u s t e r i n g performance


rand index = rand score ( t r u e l a b e l s , p r e d i c t e d l a b e l s )

# P r i n t t h e Rand I n d e x
print ( ”Rand Index : ” , r a n d i n d e x )

5
9. Run PCA on Iris dataset input features with n components = 2. Now run DBSCAN using
DBSCAN(eps=0.5, min samples=5) on both the original features and the PCA features.
What are their respective number of outliers/noisy points detected by DBSCAN?
As an extra, you can plot the PCA features on a 2D plot using matplotlib.pyplot.scatter with
parameter c = y pred (where y pred is the cluster prediction) to visualise the clusters and
outliers.

(a) 10, 10
(b) 17, 7
(c) 21, 11
(d) 5, 10

Sol. (b)
Code for the solution can be found here:
import m a t p l o t l i b . p y p l o t a s p l t
from s k l e a r n . d a t a s e t s import l o a d i r i s
from s k l e a r n . d e c o m p o s i t i o n import PCA
from s k l e a r n . c l u s t e r import DBSCAN
from s k l e a r n . m e t r i c s import s i l h o u e t t e s c o r e

# Load t h e I r i s d a t a s e t
i r i s = l o a d i r i s ()
X = i r i s . data
y = i r i s . target

# Perform PCA w i t h n components = 2


pca = PCA( n components =2)
X pca = pca . f i t t r a n s f o r m (X)

# Run DBSCAN on o r i g i n a l f e a t u r e s
d b s c a n o r i g i n a l = DBSCAN( e p s =0.5 , min samples =5)
d b s c a n o r i g i n a l . f i t (X)
labels original = dbscan original . labels
n o u t l i e r s o r i g i n a l = l i s t ( l a b e l s o r i g i n a l ) . count ( −1)

# Run DBSCAN on PCA f e a t u r e s


d b s c a n p c a = DBSCAN( e p s =0.5 , min samples =5)
d b s c a n p c a . f i t ( X pca )
l a b e l s p c a = dbscan pca . l a b e l s
n o u t l i e r s p c a = l i s t ( l a b e l s p c a ) . count ( −1)

# V i s u a l i z e c l u s t e r s and o u t l i e r s
p l t . f i g u r e ( f i g s i z e =(12 , 6 ) )

# P l o t PCA f e a t u r e s
p l t . subplot (1 , 2 , 1)
p l t . s c a t t e r ( X pca [ : , 0 ] , X pca [ : , 1 ] , c=l a b e l s p c a , cmap= ’ v i r i d i s ’ )

6
p l t . t i t l e ( ’PCA F e a t u r e s ’ )
p l t . x l a b e l ( ’ P r i n c i p a l Component 1 ’ )
p l t . y l a b e l ( ’ P r i n c i p a l Component 2 ’ )

# Plot o r i g i n a l features
p l t . subplot (1 , 2 , 2)
p l t . s c a t t e r (X [ : , 0 ] , X [ : , 1 ] , c=l a b e l s o r i g i n a l , cmap= ’ v i r i d i s ’ )
plt . t i t l e ( ’ Original Features ’ )
p l t . x l a b e l ( ’ Feature 1 ’ )
p l t . y l a b e l ( ’ Feature 2 ’ )

p l t . show ( )

# P r i n t t h e number o f o u t l i e r s / n o i s y p o i n t s d e t e c t e d by DBSCAN
print ( ”Number o f o u t l i e r s d e t e c t e d by DBSCAN on
original features : ” , n outliers original )
print ( ”Number o f o u t l i e r s d e t e c t e d by DBSCAN on
PCA f e a t u r e s : ” , n o u t l i e r s p c a )

7
Assignment 9
Introduction to Machine Learning
Prof. B. Ravindran
1. (One or more correct options) Consider the Markov Random Field given below. We need
to delete one edge (without deleting any nodes) so that in the resulting graph, B and F are
independent given A. Which of these edges could be deleted to achieve this independence?
Note: In each option, we only delete one edge from the original graph.

A C D

B E

(a) AC
(b) BE
(c) CE
(d) AE

Sol. (b), (c)


The required independence relation will hold if all paths from B to F pass through A. This
occurs if either BE or CE is deleted.
If AC or AE is deleted, there are alternative paths from B to F that do not pass through A.
Thus, the required independence relation does not hold in these cases.
2. (One or more correct options) Consider the Markov Random Field from question 1. We need
to delete one node (and also delete the edges incident with that node) so that in the resulting
graph, B and C are independent given A. Which of these nodes could be deleted to achieve
this independence? Note: In each option, we only delete one node and its incident edges from
the original graph.

(a) D
(b) E
(c) F
(d) None of the above

Sol. (b)
The required independence relation will hold if all paths from B to C pass through A. This
occurs if E is deleted.

1
If D or F are deleted, there are alternative paths from B to C that do not pass through A.
Thus, the required independence relation does not hold in these cases.
3. (One or more correct options) Consider the Markov Random Field from question 1. Which
of the nodes has / have the largest Markov blanket (i.e. the Markov blanket with the most
number of nodes)?

(a) A
(b) B
(c) C
(d) D
(e) E
(f) F

Sol. (a), (c)


In an MRF, the Markov blanket of a node comprises of its immediate neighbours. Nodes A and
C have four neighbours each while the other nodes have fewer than four neighbours. Hence,
A and C have the largest Markov blankets.
4. (One or more correct options) Consider the Bayesian Network given below. Which of the
following independence relations hold?

A B

E F

(a) A and B are independent if C is given


(b) A and B are independent if no other variables are given
(c) C and D are not independent if A is given
(d) A and F are independent if C is given

Sol. (b), (d)


C is a common descendant of A and B. So, A and B are not independent if C is given.
P (A, B, C, D, E, F ) = P (A)P (B)P (D|A)P (C|A, B)P (E|C)P (F |C). Marginalizing over C, D,
E, F, we get P (A, B) = P (A)P (B). Thus, A and B are marginally independent.
A is a parent of both C and D. So. C and D are independent if A is given.
We have a directed path A → C → F . So, A and F are independent if C is given.
5. In the Bayesian Network from question 4, assume that every variable is binary. What is
the number of independent parameters required to represent all the probability tables for the
distribution?

2
(a) 8
(b) 12
(c) 16
(d) 24
(e) 36

Sol. (b)
To fully specify the distribution, we would need tables corresponding to P (A), P (B), P (C|A, B),
P (D|A), P (E|C), P (F |C).
The table for P (A) would appear as follows. It has 1 independent parameter since P (A = 1)
can be determined given P (A = 0). The table structure is similar for P (B), which also has 1
parameter.

P(A = 0) P(A = 1)
- -

The table for P (C|A, B) would appear as follows. It has 4 independent parameters, where
each parameter corresponds to a row. Given one of the entries in a row, we can determine the
other one.

P (C = 0|A, B) P (C = 1|A, B)
A = 0, B = 0 - -
A = 0, B = 1 - -
A = 1, B = 0 - -
A = 1, B = 1 - -

The table for P (D|A) would appear as follows. It has 2 independent parameters, where each
parameter corresponds to a row. Given one of the entries in a row, we can determine the
other one. Since P (E|C) and P (F |C) have a similar structure, they also have 2 independent
parameters each.

P (D = 0|A) P (D = 1|A)
A=0 - -
A=1 - -

Hence, the total number of independent parameters is 1 + 1 + 4 + 2 + 2 + 2 = 12


6. (One or more correct options) Consider the Bayesian Network from question 4. which of
the given options are valid factorizations to calculate the marginal P (E = e) using variable
elimination (need not be the optimal order)?
P P P P P
(a) B P (B) A P (A) D P (D|A) C P (C|A, B) F P (E = e|C)P (F |C)
P P P P P
(b) A P (A) D P (D|A) B P (B) C P (C|A, B) F P (E = e|C)P (F |C)
P P P P P
(c) B P (B) A P (D|A) D P (A) F P (C|A, B) C P (E = e|C)P (F |C)
P P P P P
(d) A P (B) B P (D|A) D P (A) F P (C|A, B) C P (E = e|C)P (F |C)
P P P P P
(e) A P (A) B P (B) C P (C|A, B) D P (D|A) F P (E = e|C)P (F |C)

3
Sol. (a), (b), (e)
P (A, B, C, D, E, F ) = P (A)P (B)P (D|A)P (C|A, B)P (E|C)P (F |C). Apply the variable elim-
ination procedure as discussed in the lecture.
Note that if we are summing
P over some random variable X, all terms containing X should
appear to the right of X . Options (c) and (d) are incorrect as they do not satisfy this
condition.

7. (One or more correct options) Consider the MRF given below. Which of the following factor-
ization(s) of P (a, b, c, d, e) satisfies/satisfy the independence assumptions represented by this
MRF?

A B E

C D

1
(a) P (a, b, c, d, e) = Z ψ1 (a, b, c, d)ψ2 (b, e)
1
(b) P (a, b, c, d, e) = Z ψ1 (b)ψ2 (a, c, d)ψ3 (a, b, e)
1
(c) P (a, b, c, d, e) = Z ψ1 (a, b)ψ2 (c, d)ψ3 (b, e)
1
(d) P (a, b, c, d, e) = Z ψ1 (a, b)ψ2 (c, d)ψ3 (b, d, e)
1
(e) P (a, b, c, d, e) = Z ψ1 (a, c)ψ2 (b, d)ψ3 (b, e)
1
(f) P (a, b, c, d, e) = Z ψ1 (c)ψ2 (b, e)ψ3 (b, a, d)

Sol. (a), (c), (e), (f)


The graph has two maximal cliques: (a, b, c, d) and (b, e). Option (a) is correct as it has a
factor corresponding to each maximal clique.
Option (b) has a factor corresponding to (a, b, e), which is not a clique. Hence, it is not a
valid factorization because the independence assumptions in the MRF are violated. Similarly,
option (d) is incorrect because it has a factor for (b, d, e), which is not a clique.
In options (c), (e) and (f), each factor corresponds to a clique (but need not be a maximal
clique). Hence, they satisfy the independence assumptions and are valid factorizations.
8. (One or more correct options) The following figure shows an HMM for three time steps i =
1, 2, 3. Suppose that it is used to perform part-of-speech tagging for a sentence. Which of the
following statements is/are true?

X1 X2 X3

Y1 Y2 Y3

4
(a) The Xi variables represent parts-of-speech and the Yi variables represent the words in
the sentence.
(b) The Yi variables represent parts-of-speech and the Xi variables represent the words in
the sentence.
(c) The Xi variables are observed and the Yi variables need to be predicted.
(d) The Yi variables are observed and the Xi variables need to be predicted.

Sol. (a), (d)


Xi are the hidden variables which need to be predicted. Yi are the observed variables. Given
a sentence, POS tagging identifies the part-of-speech of each word.

5
Assignment 8
Introduction to Machine Learning
Prof. B. Ravindran
1. Consider the Bayesian network given below. Which of the following statement(s) is/are correct?

(a) B is independent of F, given D.


(b) A is independent of E, given C.
(c) E and F are not independent, given D.
(d) A and B are not independent, given D.

Sol. (a), (d)


There is only one path from B to F, which passes through C. If C is given, then B and F are
independent.
There is an edge from A to E. So, A and E are not independent even if C is given.
E and F are children of D. Hence, they are independent if D is given.
D is a descendant of both A and B. Hence, they are not independent if D is given.
2. Select the correct statement(s) from the ones given below.

(a) Naive Bayes models are a special case of Bayesian networks.


(b) Naive Bayes models are a generalization of Bayesian networks.
(c) With no independence among the variables, a Bayesian network representing a distribu-
tion over n variables would have n(n−1)
2 edges.
(d) With no independence among the variables, a Bayesian network representing a distribu-
tion over n variables would have n − 1 edges.

1
Sol. (a), (c)

As explained in the lectures, Naive Bayes uses a specific kind of factorization of the joint
distribution. Hence, it is a special case of Bayesian networks.
With no independence, the factorization would be
P (x1 , x2 , ..., xn ) = P (x1 |x2 , ..., xn ) ∗ P (x2 |x3 , ..., xn ) ∗ ... ∗ P (xn−1 |xn ) ∗ P (xn )
The ith node has (n − i) outgoing directed edges.
The total number of edges is (n − 1) + (n − 2) + ... + 1 + 0 = n(n−1) 2 .
3. A decision tree classifier learned from a fixed training set achieves 100% accuracy. Which
of the following models trained using the same training set will also achieve 100% accuracy?
(Assume P (xi |c) as Gaussians)
I Logistic Regressor.
II A polynomial of degree one kernel SVM.
III A linear discriminant function.
IV Naive Bayes classifier.

(a) I
(b) I and II
(c) IV
(d) III
(e) None of the above.

Sol. (e) The Answer should be (e) none of the above, because decision tree can work with
binary and mix means multi class classifier as well as binary, so in the first two options, both
are not capable of multi class, so the option is more towards (e). Naive Bayes only work on
linearly separable problems, hence, a decision tree can learn an XOR data but logistic and
naive bayes can’t.
4. Which of the following points would Bayesians and frequentists disagree on?
(a) The use of a non-Gaussian noise model in probabilistic regression.
(b) The use of probabilistic modelling for regression.
(c) The use of prior distributions on the parameters in a probabilistic model.
(d) The use of class priors in Gaussian Discriminant Analysis.
(e) The idea of assuming a probability distribution over models
Sol. (c), (e)
(c) Represents a fundamental difference in how Bayesians and frequentists approach uncer-
tainty in parameter estimation. Bayesians use prior distributions to incorporate prior infor-
mation, while frequentists typically do not.
(e) Reflects a difference in how Bayesians and frequentists handle model uncertainty. Bayesians
treat models as random variables and assign probabilities to them, whereas frequentists typ-
ically focus on single “best” models based on criteria like likelihood or complexity without
assigning probabilities to models themselves.

2
5. Consider the following data for 500 instances of home, 600 instances of office and 700 instances
of factory type buildings

Building Balcony Multi-storied Power-backup Total


Home 400 200 100 500
Office 300 150 450 600
Factory 150 450 450 700
Total 850 800 1000 1800

Table 1

Suppose a building has a balcony and power-backup but is not multi-storied. According to
the Naive Bayes algorithm, it is of type

(a) Home
(b) Office
(c) Factory
Sol. (c)
P(Home|(Balcony & Multi-storied & Power-backup ) ∝ P(Balcony|Home) * P( Multi-storied|Home)*
P(Power-backup|Home) * P(Home) = 4/5 * 2/5 * 1/5 * 5/18 = 0.018

P(Office|(Balcony & Multi-storied & Power-backup ) ∝ P(Balcony|Office) * P( Multi-storied|Office)


* P(Power-backup|Office) * P(Office) = 3/6 * 15/60 * 45/60 * 6/18 = 0.031

P(Factory|(Balcony & Multi-storied & Power-backup ) ∝ P(Balcony|Factory) * P( Multi-


storied|Factory) * P(Power-backup|Factory) * P(Factory) = 15/70 * 45/70 * 45/70 * 7/18
=0.034
6. In AdaBoost, we re-weight points giving points misclassified in previous iterations more weight.
Suppose we introduced a limit or cap on the weight that any point can take (for example, say
we introduce a restriction that prevents any point’s weight from exceeding a value of 10).
Which among the following would be an effect of such a modification? (Multiple options may
be correct)
(a) We may observe the performance of the classifier reduce as the number of stages increase
(b) It makes the final classifier robust to outliers
(c) It may result in lower overall performance
(d) It will make the problem computationally infeasible
Sol. (b) and (c)
Outliers tend to get misclassified. As the number of iterations increase, the weight correspond-
ing to outlier points can become very large resulting in subsequent classifier models trying to
classify the outlier points correctly. This generally has an adverse effect on the overall clas-
sifier. Restricting the weights is one way of mitigating this problem. However, this can also
lower the performance of the classifier.

3
7. While using Random Forests, if the input data is such that it contains a large number (> 80%)
of irrelevant features (the target variable is independent of the these features), which of the
following statements are TRUE?
(a) Random Forests have reduced performance as the fraction of irrelevant features increases.
(b) Random forests have increased performance as the fraction of irrelevant features increases.
(c) The fraction of irrelevant features doesn’t impact the performance of random forest.

Sol. (a)
Random Forests sample a subset of features for each decision tree. Assuming that the number
of features in the sample is a fixed fraction of the total number of features, it is easy to see
that the greater the fraction of irrelevant features, the more the fraction of useless classifiers
(which dilutes the effect of the relevant classifiers).

8. Suppose you have a 6 class classification problem with one input variable. You decide to
use logistic regression to build a predictive model. What is the minimum number of (β0 , β)
parameter pairs that need to be estimated?
(a) 6
(b) 12
(c) 5
(d) 10
Sol. (c)
Explanation We can use logistic regression to predict the probability of the input belonging
a particular class. 5 classifiers can then be used to estimate the probability of the first 5 classes
and the probability of the last class can be estimated using the fact that the class membership
probabilities must sum to 1.

4
Assignment 7
Introduction to Machine Learning
Prof. B. Ravindran
1. Which of the following statement(s) regarding the evaluation of Machine Learning models
is/are true?

(a) A model with a lower training loss will perform better on a test dataset.
(b) The train and test datasets should represent the underlying distribution of the data.
(c) To determine the variation in the performance of a learning algorithm, we generally use
one training set and one test set.
(d) A learning algorithm can learn different parameter values if given different samples from
the same distribution.

Sol. (b), (d)


The training loss does not necessarily indicate the performance of a model on test data.
For a good estimate of the performance, the train and test data should represent the data
distribution.
We need multiple train and test samples to determine the variation in the learned models.
The learned parameter values may be different for different samples, as explained in the lec-
tures.
2. Suppose we have a classification dataset comprising of 2 classes A and B with 100 and 50
samples respectively. Suppose we use stratified sampling to split the data into train and test
sets. Which of the following train-test splits would be appropriate?

(a) Train- {A : 80 samples, B : 30 samples}, Test- {A : 20 samples, B : 20 samples}


(b) Train- {A : 20 samples, B : 20 samples}, Test- {A : 80 samples, B : 30 samples}
(c) Train- {A : 80 samples, B : 40 samples}, Test- {A : 20 samples, B : 10 samples}
(d) Train- {A : 20 samples, B : 10 samples}, Test- {A : 80 samples, B : 40 samples}

Sol. (c)
In stratified sampling, the train and test sets have the same class proportions as the original
dataset. Also, the train set is generally chosen to be larger than the test set.
Options (c) and (d) preserve the class proportion in the original dataset. Of these two, (c) has
a larger training set while (d) has a larger test set. Hence, (c) is the right option.
3. Suppose we are performing cross-validation on a multiclass classification dataset with N data
points. Which of the following statement(s) is/are correct?

(a) In k-fold cross validation, each fold should have a class-wise proportion similar to the
given dataset.
(b) In k-fold cross-validation, we train one model and evaluate it on the k different test sets.
(c) In LOOCV, we train N different models, using (N-1) data points for training each model.
(d) In LOOCV, we can use the same test data to evaluate all the trained models.

1
Sol. (a), (c)
If the class-wise proportions are different across different folds, the model evaluate will be
incorrect.
In k-fold cross-validation, we divide the dataset into k parts. For a given model, we use one
part as the test set and the remaining (k-1) parts as the training set. This process is repeated
k times to get k models.
LOOCV (Leave One Out Cross Validation) is a special case of k-fold cross-validation. For a
given model, we use one data point as the test set and the remaining (N-1) data points for
training. This process is repeated N times to get N models.
4. Suppose we have a binary classification problem wherein we need to achieve a high recall. On
training four classifiers and evaluating them, we obtain the following confusion matrices. Each
matrix has the format indicated below:

Predicted Positive Predicted Negative


Actual Positive —- —-
Actual Negative —- —-

Which of these classifiers should we prefer?


 
4 6
(a)
3 87
 
8 2
(b)
11 79
 
5 5
(c)
0 90
 
2 8
(d)
4 86

Sol. (b)
TP
The recall is computed as T P +F N .
8
For option (b), recall = 8+2 = 0.8,which is the maximum among all the options. Similarly,
we can compute it for the other options and verify that option (b) has the highest recall.

5. Suppose we have a binary classification problem wherein we need to achieve a low False Positive
Rate (FPR). On training four classifiers and evaluating them, we obtain the following confusion
matrices. Each matrix has the format indicated below:

Predicted Positive Predicted Negative


Actual Positive —- —-
Actual Negative —- —-

Which of these classifiers should we prefer?


 
4 6
(a)
6 84

2
 
8 2
(b)
13 77
 
5 5
(c)
2 88
 
10 0
(d)
4 86

Sol. (c)
FP
The FPR is computed as F P +T N .
2
For option (c), F P R = 88+2 = 0.022,which is the minimum among all the options. Similarly,
we compute it for the other options and verify that option (c) has the lowest FPR.

6. We have a logistic regression model that computes the probability p(x) that a given in-
put x belongs to the positive class. For a threshold θ ∈ (0, 1), the class labels f (x) ∈
{negative, positive} are predicted as given below.
(
negative, if p(x) < θ
f (x) =
positive, if p(x) ≥ θ

For θ = 0.5, we have T P R = 0.8 and F P R = 0.3. Then which of the following statement(s)
is/are correct?

(a) For θ = 0.4, the FPR could be lower than 0.25.


(b) For θ = 0.4, the FPR could be higher than 0.45.
(c) For θ = 0.6, the TPR must be higher than 0.85.
(d) For θ = 0.6, the TPR could be higher than 0.85.
(e) For θ = 0.4, the TPR must be lower than 0.75.
(f) For θ = 0.4, the TPR could be lower than 0.75.

Sol. (b)
FPR = FP/(FP + TN). Here the denominator (FP + TN) is the number of actual negative
samples and it remains constant irrespective of the value of θ. If the value of θ is decreased,
some of the actual negatives which were correctly classified as negatives for θ = 0.5 could be
incorrectly classified as positives for θ = 0.4. Thus, a decrease in θ could increase FP, which
in turn will increase FPR. On the contrary, an increase in θ could decrease FP, which in turn
will decrease FPR. Thus, option (a) is incorrect and option (b) is correct.
TPR = TP/(TP + FN). Here the denominator (TP + FN) is the number of actual positive
samples and it remains constant irrespective of the value of θ. If the value of θ is decreased,
some of the actual positives which were incorrectly classified as negatives for θ = 0.5 could be
correctly classified as positives for θ = 0.4. Thus, a decrease in θ could increase TP, which in
turn will increase TPR. On the contrary, an increase in θ could decrease TP, which in turn
will decrease FPR. Thus, options (c), (d), (e), (f) are incorrect.
Hence, the only correct option is (b). All the other options are incorrect.

3
7. Consider the following statements.
Statement P: Boosting takes multiple weak classifiers and combines them into a strong
classifier.
Statement Q: Boosting assigns equal weights to the predictions of all the weak classifiers,
resulting in a high overall performance.

(a) P is True. Q is True. Q is the correct explanation for A.


(b) P is True. Q is True. Q is not the correct explanation for A.
(c) P is True. Q is False.
(d) Both P and Q are False.

Sol. (c)

Statement P is true since it summarizes the basic principle of boosting.


As explained in the lecture, boosting determines the proportion of importance each weak
classifier should be assigned and combines them to make the final prediction. It does not
assign equal weights to each classifier. Hence, statement Q is false.

8. Which of the following statement(s) about ensemble methods is/are correct?

(a) The individual classifiers in bagging cannot be trained parallelly.


(b) The individual classifiers in boosting cannot be trained parallelly.
(c) A committee machine can consist of different kinds of classifiers like SVM, decision trees
and logistic regression.
(d) Bagging further increases the variance of an unstable classifier.

Sol. (b), (c)


Please refer to the relevant lectures.

4
Assignment 6
Introduction to Machine Learning
Prof. B. Ravindran
1. From the given dataset, choose the optimal decision tree learned by a greedy approach:

A B C Y
F F F F
T F T T
T T F T
T T T F

(a)

(b)

(c)
(d) None of the Above.

Sol. (c)
Although we get a better information gain by first splitting on A, Y is just a function of B/C:
Y = B XOR C. Thus, we can build a tree of depth 2 which classifies correctly, and is optimal.
2. Which of the following properties are characteristic of decision trees?

(a) High bias


(b) High variance

1
(c) Lack of smoothness of prediction surfaces
(d) Unbounded parameter set

Sol. (b), (c) & (d)


Decision trees are generally unstable considering that a small change in the data set can result
in a very different set of splits. This is mainly due to the hierarchical nature of decision trees,
since a change in split points in the initial stages will affect all the subsequent splits.
The decision surfaces that result from decision tree learning are generated by recursive splitting
of the feature space using axis parallel hyper planes. They clearly do not produce smooth
prediction surfaces such as the ones produced by, say, neural networks.
Decision trees do not make any assumptions about the distribution of the data. They are
non-parametric methods where the number of parameters depends solely on the data set on
which training is carried out.
3. Entropy for a 50-50 split between two classes is:
(a) 0
(b) 0.5
(c) 1
(d) None of the above
Sol. (c) Pn
Entropy = − 1 pi ∗ log2 (pi ) = - (0.5 * -1) - (0.5 * -1) = 1
4. Having built a decision tree, we are using reduced error pruning to reduce the size of the tree.
We select a node to collapse. For this particular node, on the left branch, there are 3 training
data points with the following feature values: 5, 7, 9.6 and for the right branch, there are
four training data points with the following feature values: 8.7, 9.8, 10.5, 11. What were the
original responses for data points along the two branches (left & right respectively) and what
is the new response after collapsing the node?

(a) 10.8, 13.33, 14.48


(b) 10.8, 13.33, 12.06
(c) 7.2, 10, 8.8
(d) 7.2, 10, 8.6

Sol. (c)
Original responses:
5+7+9.6
Left: 3 = 21.6
3 = 7.2
8.7,9.8,10.5,11
Right: 4 = 40
4 = 10
New response: 7.2 ∗ 7 + 10 ∗ 47
3
= 8.8
5. Given that we can select the same feature multiple times during the recursive partitioning of
the input space, is it always possible to achieve 100% accuracy on the training data (given
that we allow for trees to grow to their maximum size) when building decision trees?

(a) Yes

2
(b) No

Sol. (b)
Consider a pair of data points with identical input features but different class labels. Such
points can be part of the training data but will not be able to be classified without error.
6. Suppose on performing reduced error pruning, we collapsed a node and observed an improve-
ment in the prediction accuracy on the validation set. Which among the following statements
are possible in light of the performance improvement observed?

(a) The collapsed node helped overcome the effect of one or more noise affected data points
in the training set
(b) The validation set had one or more noise affected data points in the region corresponding
to the collapsed node
(c) The validation set did not have any data points along at least one of the collapsed branches
(d) The validation set did not contain data points which were adversely affected by the
collapsed node.

Sol. (a), (b), (c)


The first option is the kind of error we normally expect pruning to help us overcome. However,
a node collapse which ideally should result in an increase in the overall error of the model may
actually show an improvement due to a number of factors. Perhaps the points which should
have been misclassified due to the collapse are mislabelled in the validation set (option (b)).
Such points may also be missing from the the validation set (option (c)). Finally, even if the
increased error due to the collapsed node is registered in the validation set, it may be masked
by the absence of errors (existing in the training data) in other parts of the validation set
(option (d)).

7. Consider the following data set:

price maintenance capacity airbag profitable


low low 2 no yes
low med 4 yes no
low low 4 no yes
low high 4 no no
med med 4 no no
med med 4 yes yes
med high 2 yes no
med high 5 no yes
high med 4 yes yes
high high 2 yes no
high high 5 yes yes

Considering ‘profitable’ as the binary values attribute we are trying to predict, which of the
attributes would you select as the root in a decision tree with multi-way splits using the
cross-entropy impurity measure?

(a) price

3
(b) maintenance
(c) capacity
(d) airbag

Sol. (c)
4 2 2
cross entropyprice (D) = 11 (− 4 log2 4 − 42 log2 42 ) + 4 2 2
11 (− 4 log2 4 − 24 log2 24 ) + 3 2 2
11 (− 3 log2 3 −
1 1
3 log2 3 ) = 0.9777
2 2 2 0 0 4 2 2 2 2 5 2 2
cross entropymaintenance (D) = 11 (− 2 log2 2 − 2 log2 2 )+ 11 (− 4 log2 4 − 4 log2 4 )+ 11 (− 5 log2 5 −
3 3
5 log2 5 ) = 0.8050
3 1 1
cross entropycapacity (D) = 11 (− 3 log2 3 − 23 log2 32 ) + 6 3 3
11 (− 6 log2 6 − 36 log2 36 ) + 2 2 2
11 (− 2 log2 2 −
0 0
2 log2 2 ) = 0.7959
5 3 3
cross entropyprice (D) = 11 (− 5 log2 5 − 25 log2 25 ) + 6 3 3
11 (− 6 log2 6 − 36 log2 36 ) = 0.9868
8. For the same data set, suppose we decide to construct a decision tree using binary splits and the
Gini index impurity measure. Which among the following feature and split point combinations
would be the best to use as the root node assuming that we consider each of the input features
to be unordered?

(a) price - {low, med}|{high}


(b) maintenance - {high}|{med, low}
(c) maintenance - {high, med}|{low}
(d) capacity - {2}|{4, 5}

Sol. (c)
8 4 4 3 2 1
giniprice({low,med}|{high}) (D) = 11 ∗2∗ 8 ∗ 8 + 11 ∗ 2 ∗ 3 ∗ 3 = 0.4848
5 2 3 6 4 2
ginimaintenance({high}|{med,low}) (D) = 11 ∗ 2 ∗ 5 ∗ 5 + 11 ∗ 2 ∗ 6 ∗ 6 = 0.4606
9 4 5 2
ginimaintenance({high,med}|{low}) (D) = 11 ∗ 2 ∗ 9 ∗ 9 + 11 ∗ 2 ∗ 1 ∗ 0 = 0.4040
3 1 2 8 5 3
ginicapacity({2}|{4,5}) (D) = 11 ∗2∗ 3 ∗ 3 + 11 ∗ 2 ∗ 8 ∗ 8 = 0.4621

4
Assignment 5
Introduction to Machine Learning
Prof. B. Ravindran
1. Consider a feedforward neural network that performs regression on a p-dimensional input to
produce a scalar output. It has m hidden layers and each of these layers has k hidden units.
What is the total number of trainable parameters in the network? Ignore the bias terms.

(a) pk + mk 2
(b) pk + mk 2 + k
(c) pk + (m − 1)k 2 + k
(d) p2 + (m − 1)pk + k
(e) p2 + (m − 1)pk + k 2

Sol. (c)
Number of edges between the input layer and the first hidden layer = pk
Number of edges between the ith and (i + 1)th hidden layer is k 2 . Taking i = 1, 2, ..., m − 1,
we get (m − 1)k 2 edges.
Since the output is a scalar, there is only one neuron in the output layer. Therefore, the
number of edges between the last (i.e. mth ) hidden layer and the output layer = k.
Hence, the total number of edges = pk + (m − 1)k 2 + k. Each of these edges corresponds to a
parameter.
2. Consider a neural network layer defined as y = ReLU (W x). Here x ∈ Rp is the input,
y ∈ Rd is the output and W ∈ Rd×p is the parameter matrix. The ReLU activation (defined
∂yi
as ReLU (z) := max(0, z) for a scalar z) is applied element-wise to W x. Find ∂W ij
where
i = 1, .., d and j = 1, ..., p. In the following options, I(condition) is an indicator function that
returns 1 if the condition is true and 0 if it is false.

(a) I(yi > 0) xi


(b) I(yi > 0) xj
(c) I(yi ≤ 0) xi
(d) I(yi > 0) Wij xj
(e) I(yi ≤ 0) Wij xi

Sol. (b) Pp
We have yi = max( j=1 Wij xj , 0).
Pp ∂yi
j=1 Wij xj ≤ 0 =⇒ yi = 0 =⇒ ∂Wij = 0.
Pp Pp ∂yi
j=1 Wij xj > 0 =⇒ yi = j=1 Wij xj =⇒ ∂Wij = xj .

3. Consider a two-layered neural network y = σ(W (B) σ(W (A) x)). Let h = σ(W (A) x) denote the
hidden layer representation. W (A) and W (B) are arbitrary weights. Which of the following
statement(s) is/are true? Note: ∇g (f ) denotes the gradient of f w.r.t g.

1
(a) ∇h (y) depends on W (A) .
(b) ∇W (A) (y) depends on W (B) .
(c) ∇W (A) (h) depends on W (B) .
(d) ∇W (B) (y) depends on W (A) .
Sol. (b), (d)
Since y = σ(W (B) h), we only require W (B) and not W (A) to compute y from h.
∇W (A) (y) can be decomposed into ∇h (y) and ∇W (A) (h) using the chain rule. Of these two
components, ∇h (y) depends on W (B) .
Since h = σ(W (A) x), we only require W (A) and not W (B) to compute h from x.
∇W (B) (y) depends on h, which in turn depends on W (A) . Hence, ∇W (B) (y) depends on W (A) .
4. Which of the following statement(s) about the initialization of neural network weights is/are
true?
(a) Two different initializations of the same network could converge to different minima.
(b) For a given initialization, gradient descent will converge to the same minima irrespective
of the learning rate.
(c) The weights should be initialized to a constant value.
(d) The initial values of the weights should be sampled from a probability distribution.
Sol. (a), (d)
Since the loss surface of a neural network is highly non-convex, it has multiple local minima.
Hence, different initializations or learning rates may result in convergence to different minima.
If the weights are initialized to a constant value, all the neurons in a layer will learn similar
features. To avoid this, the initial values should be sampled from a distribution.
1
5. Consider the following statements about the derivatives of the sigmoid (σ(x) = 1+exp(−x) )
exp(x)−exp(−x)
and tanh (tanh(x) = exp(x)+exp(−x) ) activation functions. Which of these statement(s) is/are
correct?
(a) 0 < σ ′ (x) ≤ 18
(b) limx→−∞ σ ′ (x) = 0
(c) 0 < tanh′ (x) ≤ 1
(d) limx→+∞ tanh′ (x) = 1
Sol. (b), (c)
σ ′ (x) = σ(x)(1 − σ(x))
As x → −∞, we have σ(x) → 0 and (1 − σ(x)) → 1, which implies σ ′ (x) → 0.
0 < σ(x) < 1 =⇒ σ(x)(1 − σ(x)) > 0. The maximum value of σ ′ (x) is attained at x = 0 since
σ ′ (0) = 21 (1 − 12 ) = 41

tanh′ (x) = 1 − (tanh(x))2


As x → +∞, we have tanh(x) → 1 =⇒ tanh′ (x) → 0.
−1 < tanh(x) < 1 =⇒ 0 < (1 − (tanh(x))2 ) < 1. The maximum value of tanh′ (x) is attained
at x = 0 since tanh′ (0) = 1 − 02 = 1

2
6. A geometric distribution is defined by the p.m.f. f (x; p) = (1 − p)(x−1) p for x = 1, 2, ....
Given the samples [4, 5, 6, 5, 4, 3] drawn from this distribution, find the MLE of p. Using this
estimate, find the probability of sampling x ≥ 5 from the distribution.

(a) 0.289
(b) 0.325
(c) 0.417
(d) 0.366

Sol. (d) Qn P
The likelihood function is L(p|x1 , ..., xn ) = i=1 (1 − p)(1−xi ) pxi = pn (1 − p) (xi )−n
P
log(L(p|x1 , ..., xn )) = n log(p) + ( (xi ) − n) log(1 − p)
Take the derivative of the RHS w.r.t. p and equate it to 0. On simplifying, we get p̂M L = Pn .
xi
6 6
Substituting the given values, p̂M L = 4+5+6+5+4+3 = 27 = 0.222.
P+∞ (i−1) 4
P (x ≥ 5) = i=5 (1 − p) p = (1 − p)
Substituting p̂M L , P (x ≥ 5) = (1 − 0.222)4 = 0.366
7. Consider a Bernoulli distribution with with p = 0.7 (true value of the parameter). We draw
samples from this distribution and compute an MAP estimate of p by assuming a prior distri-
bution over p. Which of the following statement(s) is/are true?

(a) If the prior is Beta(2, 6), we will likely require fewer samples for converging to the true
value than if the prior is Beta(6, 2).
(b) If the prior is Beta(6, 2), we will likely require fewer samples for converging to the true
value than if the prior is Beta(2, 6).
(c) With a prior of Beta(2, 100), the estimate will never converge to the true value, regardless
of the number of samples used.
(d) With a prior of U (0, 0.5) (i.e. uniform distribution between 0 and 0.5), the estimate will
never converge to the true value, regardless of the number of samples used.

Sol. (b), (d)


Beta(6, 2) has a much higher density than Beta(2, 6) near the true value 0.7. Thus, Beta(6, 2)
will likely require fewer samples for convergence.
Beta(2, 100) has a high density close to 0 but has a non-zero density at 0.7. Therefore, the
estimate will converge to the true value if we use a sufficiently large number of samples.
However, U (0, 0.5) has zero density at 0.7. Hence, the estimate will never converge to 0.7.
8. Which of the following statement(s) about parameter estimation techniques is/are true?

(a) To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.
(b) The MAP estimate of the parameter gives a point prediction for a new data point.
(c) The MLE of a parameter gives a distribution of predicted values for a new data point.
(d) We need a point estimate of the parameter to compute a distribution of the predicted
values for a new data point.

3
Sol. (a), (b)

Option (a) is true and option (d) is false based on the equation written at 01:10 in the ”Pa-
rameter Estimation III” lecture.
Both MAP and MLE give point estimates for the parameter, leading to a point prediction on
a new data point. Hence, (b) is true and (c) is false.

4
Assignment 4
Introduction to Machine Learning
Prof. B. Ravindran
1. For a two-class classification problem, we use an SVM classifier and obtain the
following separating hyperplane. We have marked 4 instances of the training
data. Identify the point which will have the most impact on the shape of the
boundary on its removal.

(a) 1
(b) 2
(c) 3
(d) 4

Sol. (a)
We need to identify support vectors on which the hyperplane is supported. In
the figure above, data point 1 is a support vector. Removal of data point 1 will
have an impact on the decision boundary.

2. Consider a soft-margin SVM with a linear kernel and no slack variables, trained
on n points. The number of support vectors returned is k. By adding one extra
point to the dataset and retraining the classifier, what is the maximum possible
number of support vectors that can be returned (tuning parameter C)?

1
(a) k
(b) n
(c) n + 1
(d) k + 1

Sol. (c)

3. Consider the data set given below.

Feature 1 Feature 2 Class


0 0 A
0 1 B
1 0 B
1 1 A

Claim: The PLA (Perceptron Learning Algorithm) can learn a classifier that
achieves zero misclassification error on the training data. This claim is:

(a) True
(b) False
(c) Depends on the initial weights
(d) True, only if we normalize the feature vectors before applying PLA.

Sol. (b)
This dataset is not linearly separable; hence PLA will not achieve zero misclas-
sification error on the training data.

4. Consider the following dataset:

x y
1 1
2 1
4 -1
5 -1
6 -1
7 -1
9 1
10 1

2
(Note: x is the feature and y is the output)

Which of these is not a support vector when using a Support Vector Classifier
with a polynomial kernel with degree = 3, C = 1, and gamma = 0.1?
(We recommend using sklearn to solve this question.)

(a) 3
(b) 1
(c) 9
(d) 10

Sol. (a)
Following code will give the support vectors

>>> c l a s s a l g o = SVC(C=1, k e r n e l =‘ poly ’ , d e g r e e =3,gamma=0.1)


>>> X=np . a r r a y ( [ 1 . , 2 . , 4 . , 5 . , 6 . , 7 . , 9 . , 1 0 . ] )
>>> X=X [ : , None ]
>>> Y=np . a r r a y ([1 ,1 , −1 , −1 , −1 , −1 ,1 ,1])
>>> c l a s s i f i e r =c l a s s a l g o . f i t (X,Y)
>>> print ( c l a s s i f i e r . support vectors )

5. Which of the following is/are true about the Perceptron classifier?


(a) It can learn a OR function
(b) It can learn a AND function
(c) The obtained separating hyperplane depends on the order in which the
points are presented in the training process.
(d) For a linearly separable problem, there exists some initialization of the
weights which might lead to non-convergent cases.
Sol. (a), (b) and (c)
OR is a linear function, hence can be learnt by perceptron.
XOR is non linear function which cannot be learnt by a perceptron learning
algorithm which can learn only linear functions.
The perceptron learning algorithm is dependent on the order on which the data
is presented, there are multiple possible hyperplanes, and depending on the or-
der we will converge to any one of them.
We can also prove that the algorithm always converges to a separating hyper-
plane if it exists. Hence (d) is false.

3
6. In SVMs, a large functional margin represents a confident and correct predic-
tion. Let the functional margins be defined as :

ŷ (i) = y (i) (wT x + b)

and the linear classifier as

hw,b (x) = g(wT x + b)

For any choice of suitable g(·), if we replace w by 2w and b by 2b, which of the
following is likely to be observed?

(a) No change in hw,b (x) = g(wT x + b).


(b) Will result in reducing the functional margin by half.
(c) Change in geometric margin.
(d) None of the above.

Sol. (a)

7. Consider the following optimization problem

min x2 + 1
s.t. (x − 2)(x − 4) ≤ 0

Select the correct options regarding this optimization problem.

(a) Strong Duality holds.


(b) Strong duality doesn’t hold.
(c) The Lagrangian can be written as L(x, λ) = (1 + λ)x2 − 6λx + 1 + 8λ
−9λ2
(d) The dual objective will be g(λ) = 1+λ
+ 1 + 8λ

Sol. (a), (c)


L(x, λ) = (x2 + 1) + λ(x2 − 6x + 8)
L(x, λ) = (1 + λ)x2 − 6λx + 1 + 8λ
Hence, c is correct.
Now we will find the dual objective, g(λ).
∂L
∂x
= 2x(1 + λ) − 6λ = 0

x̂ = 1+λ

4
(
9λ2
− 1+λ + 1 + 8λ if λ > −1
g(λ) =
−∞ otherwise
Note that λ ≤ −1 will mean the function will have a minima instead of maxima.
Now we want to maximize g, which will occur at λ = 2. x̂ = 2. The given
optimization problem is convex and we can see that there are points in the
relative interior of the domain. Hence strong duality holds.

8. Suppose you have trained an SVM which is not performing well, and hence you
have constructed more features from existing features for the model. Which of
the following statements may be true?

(a) We are lowering the bias.


(b) We are lowering the variance.
(c) We are increasing the bias.
(d) We are increasing the variance.

Sol. (a) and (d)

5
Assignment 3
Introduction to Machine Learning
Prof. B. Ravindran
1. Which of the following statement(s) about decision boundaries and discriminant functions of
classifiers is/are true? (One or more choices may be correct)

(a) In a binary classification problem, all points x on the decision boundary satisfy δ1 (x) =
δ2 (x).
(b) In a three-class classification problem, all points on the decision boundary satisfy δ1 (x) =
δ2 (x) = δ3 (x).
(c) In a three-class classification problem, all points on the decision boundary satisfy at least
one of δ1 (x) = δ2 (x), δ2 (x) = δ3 (x) or δ3 (x) = δ1 (x).
(d) Let the input space be Rn . If x does not lie on the decision boundary, there exists an
ϵ > 0 such that all inputs y satisfying ||y − x|| < ϵ belong to the same class.

Sol. (a), (c), (d)


At any point x on the decision boundary, argmaxi (δi (x)) gives two or more classes corre-
sponding to the highest value of the discriminant function. If a point does not lie on a decision
boundary, it lies in the interior of a region corresponding to one of the classes.
2. The following table gives the binary ground truth labels yi for four input points xi (not given).
We have a logistic regression model with some parameter values that computes the probability
p(xi ) that the label is 1. Compute the likelihood of observing the data given these model
parameters.

Y 1 1 0 1
p(xi ) 0.8 0.4 0.2 0.9

(a) 0.346
(b) 0.230
(c) 0.058
(d) 0.086

Sol. (b)
QN
Apply the equation L = i=1 p(xi )yi (1 − p(xi ))(1−yi )
3. Which of the following statement(s) about logistic regression is/are true?

(a) It learns a model for the probability distribution of the data points in each class.
(b) The output of a linear model is transformed to the range (0, 1) by a sigmoid function.
(c) The parameters are learned by optimizing the mean-squared loss.
(d) The loss function is optimized by using an iterative numerical algorithm.

Sol. (b), (d)


Unlike LDA, logistic regression does not learn a probability distribution for the data. The
optimal parameters are computed by maximizing the log-likelihood.

1
4. Consider a modified form of logistic regression given below where k is a positive constant and
β0 and β1 are parameters.

1 − p(x)
log( ) = β0 − β1 x
kp(x)

Then find p(x).


eβ 0
(a) keβ0 +eβ1 x
eβ 1 x
(b) eβ0 +keβ1 x
eβ 1 x
(c) keβ0 +eβ1 x
eβ1 x
(d) keβ0 +e−β1 x

Sol. (c)

Exponentiating the given equation,


1−p(x)
kp(x) = eβ0 −β1 x
1 − p(x) = kp(x)eβ0 −β1 x
p(x)(1 + keβ0 −β1 x ) = 1
1
p(x) = 1+keβ0 −β1 x
eβ1 x
p(x) = keβ0 +eβ1 x

5. Consider a Bayesian classifier for a 3-class classification problem. The following tables give the
class-conditioned density fk (x) for three classes k = 1, 2, 3 at some point x in the input space.

k 1 2 3
fk (x) 0.15 0.20 0.05

Note that πk denotes the prior probability of class k. Which of the following statement(s)
about the predicted label at x is/are true? (One or more choices may be correct.)

(a) If the three classes have equal priors, the prediction must be class 2.
(b) If π3 < π2 and π1 < π2 , the prediction may not necessarily be class 2.
(c) If π1 > 2π2 , the prediction could be class 1 or class 3.
(d) If π1 > π2 > π3 , the prediction must be class 1.

Sol. (a), (c)


For a Bayesian classifier, the prediction is given by argmaxk fk (x)πk .
In option (a) and option (b), fk (x)πk would be highest for class 2.
In option (c), π1 > 2π2 ⇒ f1 (x)π1 > f2 (x)π2 . So, the prediction is class 1 if f1 (x)π1 > f3 (x)π3
and class 3 otherwise.
Similarly, in option (d), the prediction could be either class 1 or class 2.

2
(i) (i)
6. The following table gives the binary labels (y (i) ) for four points (x1 , x2 ) where i = 1, 2, 3, 4.
Among the given options, which set of parameter values β0 , β1 , β2 of a standard logistic re-
1
gression model p(xi ) = 1+e−(β0 +β 1 x+β2 x)
results in the highest likelihood for this data?

x1 x2 y
0.4 -0.2 1
0.6 -0.5 1
-0.3 0.8 0
-0.7 0.5 0
(a) β0 = 0.5, β1 = 1.0, β2 = 2.0
(b) β0 = −0.5, β1 = −1.0, β2 = 2.0
(c) β0 = 0.5, β1 = 1.0, β2 = −2.0
(d) β0 = −0.5, β1 = 1.0, β2 = 2.0

Sol. (c)
For each option, first compute the probabilities and then compute the log-likelihood using the
following equation. You can either do this manually or programmatically.

N N
X X p(xi )
l(β0 , β1 , β2 ) = log(1 − p(xi )) + yi log( )
i=1 i=1
1 − p(xi )

7. Which of the following statement(s) about a two-class LDA model is/are true? (One or more
choices may be correct)

(a) It is assumed that the class-conditioned probability density of each class is a Gaussian.
(b) A different covariance matrix is estimated for each class.
(c) At a given point on the decision boundary, the class-conditioned probability densities
corresponding to both classes must be equal.
(d) At a given point on the decision boundary, the class-conditioned probability densities
corresponding to both classes may or may not be equal.

Sol. (a), (d)


Option (a) is true and (b) is false according to the assumptions of LDA.
The log ratio of the posterior probabilities of any two classes is

P r(G = k|X = x) fk (x) πk


log( ) = log( ) + log( )
P r(G = l|X = x) fl (x) πl
The decision boundary is defined as the set of points at which the posterior probabilities of
both classes are equal, i.e.

fk (x) πk
log( ) + log( ) = 0
fl (x) πl
If πk = πl (i.e. equal number of samples from both the classes), we obtain fk (x) = fl (x).
Similarly, πk ̸= πl ⇒ fk (x) ̸= fl (x). Since the question does not mention whether there is an
equal number of samples from both classes, option (c) is incorrect, and option (d) is correct.

3
8. Consider the following two datasets and two LDA models trained respectively on these datasets.
Dataset A: 100 samples of class 0; 50 samples of class 1
Dataset B: 100 samples of class 0 (same as Dataset A); 100 samples of class 1 created by
repeating twice the class 1 samples from Dataset A
The classifier is defined as follows in terms of the decision boundary wT x + b = 0. Here, w is
called the slope and b is called the intercept.
(
0 if wT x + b < 0
y=
1 if wT x + b ≥ 0

Which of the given statement is true?

(a) The learned decision boundary will be the same for both models.
(b) The two models will have the same slope but different intercepts.
(c) The two models will have different slopes but the same intercept.
(d) The two models may have different slopes and different intercepts.

Sol. (b)

Consider the LDA decision boundary given by


πk 1
xT Σ−1 (µk − µl ) + log( ) − (µk + µl )T Σ−1 (µk − µl ) = 0
πl 2

The first term corresponds to wT x while the second and third terms constitute the intercept.
From the construction of the two datasets, it is clear that the estimates of µ0 , µ1 and Σ would
be the same for both datasets. The two decision boundaries would only differ in the term
log( ππ01 ) which depends on the sample sizes of each class.
9. Which of the following statement(s) about LDA is/are true? (One or more choices may be
correct)

(a) It minimizes the between-class variance relative to the within-class variance.


(b) It maximizes the between-class variance relative to the within-class variance.
(c) Maximizing the Fisher information results in the same direction of the separating hyper-
plane as the one obtained by equating the posterior probabilities of classes.
(d) Maximizing the Fisher information results in a different direction of the separating hy-
perplane from the one obtained by equating the posterior probabilities of classes.

Sol. (b), (c)


Please refer to the lecture.
10. Which of the following statement(s) regarding logistic regression and LDA is/are true for a
binary classification problem? (One or more choices may be correct)

(a) For any classification dataset, both algorithms learn the same decision boundary.

4
(b) Adding a few outliers to the dataset is likely to cause a larger change in the decision
boundary of LDA compared to that of logistic regression.
(c) Adding a few outliers to the dataset is likely to cause a similar change in the decision
boundaries of both classifiers.
(d) If the within-class distributions deviate significantly from the Gaussian distribution, lo-
gistic regression is likely to perform better than LDA.

Sol. (b), (d)


The decision boundaries learned by the two techniques are different because logistic regression
uses maximum likelihood estimation (MLE) while LDA performs Bayesian classification by
assuming the distribution of each class is a Gaussian.
Since LDA assumes that the underlying intra-class distributions are Gaussian, outliers have a
greater effect on LDA compared to logistic regression.
LDA will not perform well if the data does not satisfy the assumptions of the model. On the
other hand, logistic regression does not make any assumptions about the data distribution.

5
Assignment 2
Introduction to Machine Learning
Prof. B. Ravindran
1. The parameters obtained in linear regression
(a) can take any value in the real space
(b) are strictly integers
(c) always lie in the range [0,1]
(d) can take only non-zero values
Sol. (a)
2. Suppose that we have N independent variables (X1 , X2 , . . . Xn ) and the dependent variable
is Y . Now imagine that you are applying linear regression by fitting the best fit line using the
least square error on this data. You found that the correlation of X1 with Y is -0.005.

(a) Regressing Y on X1 mostly does not explain away Y .


(b) Regressing Y on X1 explains away Y .
(c) The given data is insufficient to determine if regressing Y on X1 explains away Y or not.
(d) None of the above.

Sol. (a)
The absolute value of the correlation coefficient denotes the strength of the relationship. Since
absolute correlation is significantly less, regressing Y on X1 mostly does not explain away Y .
3. The relation between studying time (in hours) and grade on the final examination (0-100) in
a random sample of students in the Introduction to Machine Learning Class was found to be:
Grade = 30.5 + 15.2(h)
How will a student’s grade be affected if she studies for four hours, compared to not studying?

(a) It will go down by 30.4 points.


(b) It will go up by 60.8 points.
(c) The grade will remain unchanged.
(d) It cannot be determined from the information given

Sol. (b)
The slope of the regression line gives the average increase in grade for every hour increase in
studying. So, if studying is increased by four hours, the grade will increase by 4×(15.2) = 60.8.
4. Consider the following 4 training examples:

x y
-1 0.0319
0 0.8692
1 1.9566
2 3.0343

1
find squared error (y - y2, which is minimum, is the best model

We want to learn a function f (x) = ax + b which is parametrized by (a, b). Using squared error
as the loss function, which of the following parameters would you use to model this function.
(a) (1,1)
(b) (1,2)
(c) (2,1)
(d) (2,2)
Sol. (a)
The line y = x + 1 is the one with minimum squared error out of all the four proposed.
5. Consider a modified k-NN method in which once the k nearest neighbours to the query point
are identified, you do a linear regression fit on them and output the fitted value for the query
point. Which of the following is/are true regarding this method.
(a) This method makes an assumption that the data is locally linear.
(b) In order to perform well, this method would need dense distributed training data.
(c) This method has higher bias compared to k-NN
(d) This method has higher variance compared to k-NN
Sol. (a), (b), (d)
Since we are doing a linear fit in the k-neighborhood, we are making an assumption of local
linearity. Hence, (a) holds. The method would need dense distributed training data to perform
well, since in the case of the training data being sparse, the k-neighborhood would end up being
quite spread out (not really local anymore). Hence, the assumption of local linearity would
not give good results. Hence, (b) holds. The method has higher variance, since we now have
two parameters (slope and intercept) instead of one in the case of conventional k-NN. (In the
conventional case, we just try to fit a constant, and the average happens to be the constant
which minimizes the squared error).
6. Which of the statements is/are True? Ridge regression does not have a sparsity constraint
like Lasso
(a) Ridge has sparsity constraint, and it will drive coefficients with low values to 0.
(b) Lasso has a closed form solution for the optimization problem, but this is not the case
for Ridge. both have closed form solution.
(c) Ridge regression may reduce the number of variables.
(d) If there are two or more highly collinear variables, Lasso will select one of them randomly.
Sol. (c),(d)
Refer to the lecture
7. Choose the correct option(s) from the following:
(a) When working with a small dataset, one should prefer low bias/high variance classifiers
over high bias/low variance classifiers.
(b) When working with a small dataset, one should prefer high bias/low variance classifiers
over low bias/high variance classifiers.
(c) When working with a large dataset, one should prefer high bias/low variance classifiers
over low bias/high variance classifiers.

2
(d) When working with a large dataset, one should prefer low bias/high variance classifiers
over high bias/low variance classifiers.
Sol. (b), (d)
On smaller datasets, variance is a concern since even small changes in the training set may
change the optimal parameters significantly. Hence, a high bias/low variance classifier would
be preferred. On the other hand, with a large dataset, since we have sufficient points to
represent the data distribution accurately, variance is not of much concern. Hence, one would
go for the classifier with low bias even though it has higher variance.
8. Consider the following statements: Residual = difference between predicted - actual
Statement A: In Forward stepwise selection, in each step, that variable is chosen which has the
maximum correlation with the residual, then the residual is regressed on that variable, and it
is added to the predictor. maximizing a specific metric related to model fit.
Statement B: In Forward stagewise selection, the variables are added one by one to the previ-
ously selected variables to produce the best fit till then

(a) Both the statements are True.


(b) Statement A is True, and Statement B is False
(c) Statement A if False and Statement B is True
(d) Both the statements are False.

Sol. (d)
Refer to the lecture

9. The linear regression model y = a0 +a1 x1 +a2 x2 +...+ap xp is to be fitted to a set of N training
data points having p attributes each. Let X be N × (p + 1) vectors of input values (augmented
by 1’s), Y be N × 1 vector of target values, and θ be (p + 1) × 1 vector of parameter values
(a0 , a1 , a2 , ..., ap ). If the sum squared error is minimized for obtaining the optimal regression
model, which of the following equation holds?

(a) X T X = XY
remember
(b) Xθ = X T Y beta = (X^T * X ) ^ -1 * X^T*Y
(c) X T Xθ = Y
(d) X T Xθ = X T Y

Sol. (d)
This comes from minimizing the sum of the least squares.
RSS(θ) = (Y − XθT )(Y − Xθ) (in matrix form)
If we take the derivative and equate it to 0,
then we get,
X T (Y − Xθ) = 0
So, X T Xθ = X T Y, θ = (X T X)−1 X T Y.

3
Assignment 1
Introduction to Machine Learning
Prof. B. Ravindran
1. Which of the following is/are unsupervised learning problem(s)? (Multiple options may be
correct)
(a) Grouping documents into different categories based on their topics
(b) Forecasting the hourly temperature in a city based on historical temperature patterns
(c) Identifying close-knit communities of people in a social network
(d) Training an autonomous agent to drive a vehicle
(e) Identifying different species of animals from images
Sol. (a), (c)
The tasks wherein we are provided explicit training labels are called supervised learning prob-
lems (options (b), (e)). In unsupervised learning, we need to find patterns from data even
in the absence of labels (options (a), (c)). In reinforcement learning, we learn a policy using
reward signals from the environment (option (d)).
2. Which of the following statement(s) about Reinforcement Learning (RL) is/are true? (Multiple
options may be correct)
(a) While learning a policy, the goal is to maximize the long-term reward.
(b) During training, the agent is explicitly provided the most optimal action to be taken in
each state.
(c) The state of the environment changes based on the action taken by the agent.
(d) RL is used for building agents to play chess.
(e) RL is used for predicting the prices of apartments from their features.
Sol. (a), (c), (d)
Refer to the lecture on RL for explanations of options (a) to (d). Option (e) is a supervised
learning task.
3. Which of the following is/are classification tasks(s)? (Multiple options may be correct)

(a) Predicting whether an email is spam or not spam


(b) Predicting the number of COVID cases over a given period
(c) Predicting the score of a cricket team
(d) Identifying the language of a text document

Sol. (a), (d)


Options (a), (d) are classification tasks as they predict categorical variables. Options (b), (c)
are regression tasks because they predict numerical variables.
4. Which of the following is/are regression task(s)? (Multiple options may be correct)
(a) Predicting whether or not a customer will repay a loan based on their credit history

1
(b) Forecasting the amount of rainfall in a given place
(c) Identifying the types of crops from aerial images of farms
(d) Predicting the future price of a stock
Sol. (b), (d)
Options (a), (c) are classification tasks as they predict categorical variables. Options (b), (d)
are regression tasks because they predict numerical variables.

5. Consider the following dataset. Fit a linear regression model of the form y = β0 + β1 x1 + β2 x2
using the mean-squared error loss. Using this model, the predicted value of y at the point
(x1 , x2 ) = (0.5, −1.0) is
x1 x2 y
-1.0 -0.5 -1.947
-0.5 0.0 -0.391
1.0 2.0 6.047
2.0 0.5 4.527
2.5 -1.5 1.287
0.0 -2.0 -3.451

(a) −0.651
(b) −0.737
(c) 0.245
(d) −0.872

Sol. (b)
We can compute the model parameters using the equation β = (X T X)−1 X T y where y is the
third column of the above table and X is a 6 x 3 matrix obtained by concatenating a 6 x 1
vector of ones with the 6 x 2 matrix containing X1 and X2. We get β = [0.498, 1.517, 1.993]
i.e. β0 = 0.498, β1 = 1.517, β2 = 1.993.
Substituting (x1 , x2 ) = (0.5, −1.0) and the model parameters into the RHS of the regression
equation (y = β0 + β1 x1 + β2 x2 ), we get (b) −0.737.
6. Consider the following dataset. Using a k-nearest neighbour (k-NN) regression model with
k = 3, predict the value of y at (x1 , x2 ) = (0.5, −1.0). Use the Euclidean distance to find the
nearest neighbours.
x1 x2 y
-1.0 -0.5 -1.947
-0.5 0.0 -0.391
1.0 2.0 6.047
2.0 0.5 4.527
2.5 -1.5 1.287
0.0 -2.0 -3.451

(a) −1.762
(b) −2.061
(c) −1.930

2
(d) −1.529

Sol. (c)
The 3 nearest neighbours of the point (0.5, −1.0) are (−1.0, −0.5), (−0.5, 0) and (0.0, −2.0).
Taking the mean of the y values at these points, we get (c) −1.930.
7. Consider the following statements regarding linear regression and k-NN regression models.
Select the true statements. (Multiple options may be correct)

(a) A linear regressor requires the training data points during inference.
(b) A k-NN regressor requires the training data points during inference.
(c) A k-NN regressor with a higher value of k is less prone to overfitting.
(d) A linear regressor partitions the input space into multiple regions such that the prediction
over a given region is constant.

Sol. (b), (c)


Option (a) is false because a linear regressor only uses the learned parameters during inference.
Option (b) is true because a k-NN regressor requires the training points during inference for
finding the nearest neighbours.
Option (c) is true as explained in the lecture on Bias and Variance.
Option (d) is false because the property described is that of a k-NN regressor and not of a
linear regressor.
8. Consider a binary classification problem where we are given certain measurements from a blood
test and need to predict whether the patient does not have a particular disease (class 0) or has
the disease (class 1). In this problem, false negatives (incorrectly predicting that the patient is
healthy) have more serious consequences as compared to false positives (incorrectly predicting
that the patient has the disease). Which of the following is an appropriate cost matrix for this
classification problem? The row denotes the true class and the column denotes the predicted
class.
 
0 100
(a)
0 0
 
0 100
(b)
1 0
 
0 1
(c)
1 0
 
0 1
(d)
100 0
 
0 0
(e)
100 0

Sol. (d)
From the details given in the question, a false negative should be penalized much more than
a false positive. However, a false positive should also receive a small but positive penalty.

3
9. Consider the following dataset with three classes: 0, 1 and 2. x1 and x2 are the independent
variables whereas y is the class label. Using a k-NN classifier with k = 3, predict the class label
at the point (x1 , x2 ) = (0.7, −0.8). Use the Euclidean distance to find the nearest neighbours.
x1 x2 y
-0.5 0.0 0
2.0 0.5 0
-1.0 -0.5 1
0.0 -2.0 1
2.5 -1.5 2
1.0 2.0 2

(a) 0
(b) 1
(c) 2
(d) Cannot be predicted

Sol. (b)
The 3 nearest neighbours of the point (0.7, −0.8) are (−1.0, −0.5), (−0.5, 0.0) and (0.0, −2.0)
having class labels 1, 0 and 1 respectively. Taking the majority of these labels, we get class 1
as the predicted label.

10. Suppose that we train two kinds of regression models corresponding to the following equations.

• (i) y = β0 + β1 x1 + β2 x2
• (ii) y = β0 + β1 x1 + β2 x2 + β3 x1 x2

Which of the following statement(s) is/are correct? (Multiple options may be correct)

(a) On a given training dataset, the mean-squared error of (i) is always greater than or equal
to that of (ii).
(b) (i) is likely to have a higher variance than (ii).
(c) (ii) is likely to have a higher variance than (i).
(d) If (ii) overfits the data, then (i) will definitely overfit.
(e) If (ii) underfits the data, then (i) will definitely underfit.

Sol. (a), (c), (e)


Model (ii) is more complex than model (i). So, (ii) will have a lower mean-squared error and
higher variance than (i). If a complex model overfits the data, a simpler model may or may
not overfit. However, if a complex model underfits the data, a simpler model will definitely
underfit.

You might also like