0% found this document useful (0 votes)
26 views12 pages

Final Compre - Solutions - Updated FoDS

Uploaded by

Azure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views12 pages

Final Compre - Solutions - Updated FoDS

Uploaded by

Azure
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Birla Institute of Technology & Science - Pilani, Hyderabad Campus

First Semester 2023-24


CS F320 – Foundations of Data Science
Comprehensive Examination
Type: Closed Time: 180 mins Max Marks: 80 Date: 19.12.2023

sAll parts of the same question should be answered together.


1.a. We say that two random variables are pairwise independent if p(X2/X1) = p(X2) and hence p(X2, X1) =
p(X1)p(X2/X1) = p(X1)p(X2).
We say that n random variables are mutually independent if p(Xi/XS) = p(Xi) for all S ⊆ {1, . . . , n}\{i}
and hence p(X1, X2, . . ., Xn) = P(X1) P(X2) … P(Xn).
Prove or disprove “Pairwise independence between all pairs of variables necessarily imply mutual independence”.
Note: The proof should be complete and correct if you are proving the statement and need to provide a counter
example in case you are not proving the statement. [8 Marks]

Sol: Suppose you are tossing two fair coins.


A = {First toss is head} = {HH, HT}
B = {Second toss is head} = {HH, TH}, and
C = {The outcomes are same} = {HH, TT}
P(A) = 1/2, P(B) = 1/2 and P(C) = 1/2.
P(A, B) = 1/4 = P(A)P(B).
P(C, B) = 1/4 = P(C)P(B).
P(A, C) = 1/4 = P(A)P(C).
P(A, B, C) = 1/4 ≠ P(A)P(B)P(C).

1.b. Find out the first principal component that emerges as part of PCA for the following data set. [6 Marks]
X1 X2
4 1
2 3
5 4
1 0
Sol:
1.c. You are given the following 2D dataset, draw the first and second principle components on the plot.

Note: You will have to draw the above figure approximately in your answer sheet and in the figure you should
show the first and second principle components. No need to calculate anything but approximately identifying the
principal components is the spirit of the question. [4 Marks]
Sol:

2. Consider the joint distribution p(X; Y)

a. What is the joint entropy H(X,Y)?


Sol:

b. What are the marginal entropies H(X) and H(Y)?


Sol:
c. The entropy of X conditioned on a specific value of y is defined as

Compute H(X/y) for each value of y.


Sol:
d. The conditional entropy is defined as

Compute this.

Sol:
e. What is the mutual information between X and Y? [10 Marks]
Sol:

3.a. Suppose X is a discrete random variable taking ‘n’ values say x1, x2, . . , xn. What is the discrete
distribution that maximizes the entropy of the random variable. What is the discrete distribution that minimizes
the entropy of the random variable. [4 Marks]
Sol:

3.b. Prove that H(X,Y) = H(X) + H(Y/X). [4 Marks]


Sol:
3.c. Can you think of a situation in which identification numbers would be useful for prediction? [2 Marks]
Sol: Student IDs are a good predictor of graduation date.

4.a. Propose k-nearest neighbour algorithm as a generative class of algorithm with all necessary mathematical
formulation. [6 Marks]
Sol: Refer to class notes
4.b. Suppose there are 80 features in a data set with 25000 training examples and 10000 testing examples. With
Euclidean distance as the similarity metric, a 1-Nearest Neighbour classifier is built and the misclassification rate
(on the 10000 test examples) of the dataset is found to be 2.98%. You are asked to randomly permute the features
(columns of the training and test design matrices), and then apply the classifier. Do you think that the
misclassification rate (on the same set of 10000 test examples) of 2.98% changes with the new classifier? Justify
your answer with appropriate reasoning. If you answer without the correct justification will be awarded no marks.
[6 Marks]
Note: A design matrix with 10000 examples having 80 features is of 10000 X 80 size. Every training example is
a row in the design matrix.
Sol:
5.a. In the probabilistic approach to linear regression, we assume that the target variate, t, follows normal
distribution with mean as the predicted target attribute (say, wT x where x is assumed to be the vector of D
dimensions) and variance as s2 where w and s2 are parameters. Assuming that you are given a good estimate of w,
as w’, show that the Maximum Likelihood Estimator (MLE) for the variance, s2 is given by
T 2
(1/N) ∑𝑛𝑖=1 (yi – xi w’) [8 Marks]
Sol:
5.b. Derive the dual formulation of the least squares linear regression problem as discussed in the class and thus
justify how it helps solving the linear regression problem in the dual mode when the number of features are far
more in number as compared to the number of training data examples. [6 Marks]
Sol: Refer to class notes

5.c. Let pemp(x) be the empirical distribution, and let q(x/θ) be some model. Show that argminθKL(pemp||q(x/ θ))
is obtained by q(x) = q(x/θ’), where θ’ is the MLE. [6 Marks]
Sol: Refer to class notes

6.a. Do you agree with the statement “Ordering of attributes is an important activity in constructing parallel
coordinate plots that needs to be considered with utmost care”. Justify your answer with appropriate reasoning.
[2 Marks]
Sol: Yes, ordering of the attributes is very important otherwise we will get weird plots which may not give any
useful information,
6.b. Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the following
Result: [4 Marks]
Draw the boxplots for age and %fat.
Sol:

6.c. Suppose a group of 12 sales price records has been sorted as follows: 3, 3, 102, 58, 7, 28, 9, 75, 122, 17, 98,
72. Partition them into three bins by each of the following methods. [4 Marks]
(a) equal-frequency partitioning
(b) equal-width partitioning
Sol:
Equal Frequency:
bin 1: 3,3,7,9
bin 2: 17, 28, 58, 72
bin 3: 75, 98, 102, 122

Equal Width:
Width= 39.66
bin 1: 3,3,7,9,17, 28
bin 2: 58, 72, 75
bin 3: 98, 102, 122

You might also like