0% found this document useful (0 votes)
2 views

Assignment 10 solution

The document consists of a series of questions and solutions related to statistical concepts and machine learning techniques, including Maximum Likelihood Estimation (MLE) for Bernoulli distributions, the Central Limit Theorem, multivariate Gaussian distributions, Naive Bayes classifiers, least squares cost functions, PCA, and Support Vector Machines (SVM). Each question is followed by a detailed explanation of the correct answer, providing insights into the underlying mathematical principles and assumptions. The document serves as a study guide for understanding key statistical methods and their applications in machine learning.

Uploaded by

Shruti Lashkari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Assignment 10 solution

The document consists of a series of questions and solutions related to statistical concepts and machine learning techniques, including Maximum Likelihood Estimation (MLE) for Bernoulli distributions, the Central Limit Theorem, multivariate Gaussian distributions, Naive Bayes classifiers, least squares cost functions, PCA, and Support Vector Machines (SVM). Each question is followed by a detailed explanation of the correct answer, providing insights into the underlying mathematical principles and assumptions. The document serves as a study guide for understanding key statistical methods and their applications in machine learning.

Uploaded by

Shruti Lashkari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Assignment 10

1. When estimating the parameter µ for a Bernoulli distribution using Maximum Likeli-
hood Estimation (MLE), what is the formula?

(A) µM L = N/m
(B) µM L = m/N
(C) µM L = 1/N
(D) µM L = m2 /N

Solution (B) µM L = m/N


Solution
Let the dataset consist of N independent observations D = {x1 , x2 , . . . , xN } drawn
from a Bernoulli distribution with parameter µ. Each xi is either 0 or 1. The proba-
bility mass function is P (xi |µ) = µxi (1 − µ)1−xi .
The likelihood function L(µ) is the probability of observing the dataset given µ:
N
Y N
Y
L(µ) = P (D|µ) = P (xi |µ) = µxi (1 − µ)1−xi
i=1 i=1

PN
Let m = i=1 xi be the number of successes (observations equal to 1). Then the
number of failures (observations equal to 0) is N − m. The likelihood function can be
written as:
L(µ) = µm (1 − µ)N −m

To find the Maximum Likelihood Estimate (MLE) µM L , we maximize L(µ) with re-
spect to µ. It is often easier to maximize the log-likelihood function, ln L(µ):

ln L(µ) = ln(µm (1 − µ)N −m ) = m ln µ + (N − m) ln(1 − µ)

We take the derivative with respect to µ and set it to zero:


d ln L(µ) m N −m
= −
dµ µ 1−µ
Setting the derivative to zero to find the maximum:
m N −m
− =0
µM L 1 − µM L
m N −m
=
µM L 1 − µM L

1
m(1 − µM L ) = µM L (N − m)
m − mµM L = N µM L − mµM L
m = N µM L
m
µM L =
N
This corresponds to the sample mean, which is the number of successes (m) divided
by the total number of trials (N ).
Therefore, the correct formula is (B) µM L = m/N .

2. According to the Central Limit Theorem, what happens to the distribution of the sum
of N independent and identically distributed random variables as N grows?

(A) It becomes increasingly uniform


(B) It becomes increasingly Bernoulli
(C) It becomes increasingly Gaussian
(D) It becomes increasingly Poisson

Solution (C) It becomes increasingly Gaussian


Solution
The Central Limit Theorem (CLT) is a fundamental concept in probability theory.
It states that, under certain conditions, the distribution of the sum (or average) of
a large number of independent and identically distributed (i.i.d.) random variables
approaches a Gaussian (Normal) distribution, regardless of the original distribution of
the individual variables.
Let X1 , X2 , . . . , XN be N i.i.d. random variables with a finite mean µ and a finite
non-zero variance σ 2 . Let SN = X1 + X2 + · · · + XN be their sum.
The Central Limit Theorem states that as N → ∞, the distribution of the standardized
sum:
SN − N µ
ZN = √
σ N
converges to a standard Normal distribution (Gaussian distribution with mean 0 and
variance 1).
Equivalently, the distribution of the sum SN itself becomes increasingly well-approximated
by a Gaussian distribution with mean N µ and variance N σ 2 . Similarly, the distribu-
tion of the sample mean X̄N = SN /N becomes increasingly well-approximated by a
Gaussian distribution with mean µ and variance σ 2 /N .
Therefore, as N grows, the distribution of the sum of N i.i.d. random variables becomes
increasingly Gaussian.

3. When the covariance matrix of a multivariate Gaussian distribution is proportional to


the identity matrix (Σ = σ 2 I), what shape will the contours of constant probability
density have?

(A) Ellipses aligned with the axes


(B) Circles

2
(C) Tilted ellipses
(D) Rectangles

Solution (B) Circles


Solution
The probability density function (PDF) of a D-dimensional multivariate Gaussian
distribution is given by:
 
1 1 T −1
p(x|µ, Σ) = exp − (x − µ) Σ (x − µ)
(2π)D/2 |Σ|1/2 2
where x is the variable vector, µ is the mean vector, and Σ is the covariance matrix.
Contours of constant probability density correspond to points x where the exponent
term is constant:
(x − µ)T Σ−1 (x − µ) = constant
This expression defines the shape of the contours.
We are given that the covariance matrix is proportional to the identity matrix, Σ =
σ 2 I, where σ 2 is a positive constant and I is the identity matrix.
The inverse of this covariance matrix is:
1 −1 1
Σ−1 = (σ 2 I)−1 = I = I
σ2 σ2

Substituting this into the exponent expression:


 
T 1 1
(x − µ) 2
I (x − µ) = 2
(x − µ)T I(x − µ)
σ σ
Since I is the identity matrix, I(x − µ) = (x − µ). So the expression becomes:
1 1
2
(x − µ)T (x − µ) = 2 ||x − µ||2
σ σ
where ||x − µ||2 is the squared Euclidean distance between x and the mean µ.
Therefore, the condition for constant probability density becomes:
1
||x − µ||2 = constant
σ2
||x − µ||2 = σ 2 × constant = constant′
Let R2 = constant′ . The equation is:

||x − µ||2 = R2

This is the equation of a hypersphere centered at µ with radius R. In two dimensions,


this corresponds to a circle.
The correct answer is (B) Circles.

4. When using a Naive Bayes classifier for prediction, what rule is used to assign a class
to a new observation?

3
(A) Minimum Error Rule
(B) Maximum a Posteriori (MAP) Rule
(C) Minimum Distance Rule
(D) Maximum Likelihood Rule

Solution (B) Maximum a Posteriori (MAP) Rule

a dataset with features X and classes C, the Naive Bayes formula P (C|X) ∝
5. GivenQ
P (C) P (Xi |C) is derived by applying which of the following?

(A) Central Limit Theorem and independence assumption


(B) Bayes’ theorem and conditional independence assumption
(C) Maximum Likelihood Estimation and uniform prior
(D) Chain rule and joint probability assumption

Solution (B) Bayes’ theorem and conditional independence assumption


Solution
The goal of the Naive Bayes classifier is to find the probability of a class C given a set
of features X = (X1 , X2 , . . . , Xn ). This is the posterior probability P (C|X).
We start by applying Bayes’ Theorem:
P (X|C)P (C)
P (C|X) =
P (X)
where:

⇒ P (C|X) is the posterior probability of the class given the features.


⇒ P (X|C) is the likelihood of observing the features given the class.
⇒ P (C) is the prior probability of the class.
⇒ P (X) is the evidence, the probability of observing the features.

For classification, we want to find the class C that maximizes P (C|X). Since P (X)
is the same for all classes given a specific input X, it acts as a normalization constant
and can be ignored when comparing posterior probabilities. Thus, we can write:

P (C|X) ∝ P (X|C)P (C)

The term P (X|C) represents the joint probability of all features given the class,
P (X1 , X2 , . . . , Xn |C). Calculating this directly can be complex and requires a large
amount of data.
The ”Naive” part of Naive Bayes comes from the assumption that the features Xi are
conditionally independent given the class C. This means:
n
Y
P (X1 , X2 , . . . , Xn |C) = P (X1 |C)P (X2 |C) . . . P (Xn |C) = P (Xi |C)
i=1

This is the crucial conditional independence assumption.

4
Substituting this assumption back into the proportionality derived from Bayes’ Theo-
rem: n
Y
P (C|X) ∝ P (C) P (Xi |C)
i=1

This is the formula given in the question.


Therefore, the derivation involves two key steps:

(a) Applying Bayes’ theorem to relate the posterior probability P (C|X) to the like-
lihood P (X|C) and prior P (C).
(b) Applying the conditional independence assumption to simplify the likelihood term
P (X|C) into a product of individual feature likelihoods P (Xi |C).

6. What is the probabilistic interpretation of the least squares cost function?

(A) It maximizes the posterior probability of the parameters


(B) It minimizes the Bayesian error rate
(C) It maximizes the likelihood of observing the data given the parameters
(D) It minimizes the prediction variance

Solution (C) It maximizes the likelihood of observing the data given the parameters
Solution
The least squares cost function aims to minimize the P sum of squared errors between
predicted values f (xi , θ) and actual values yi : J(θ) = (yi − f (xi , θ))2 .
To find a probabilistic interpretation, we model the target variable as yi = f (xi , θ)+ϵi ,
where the noise ϵi is assumed to be independent and identically distributed (i.i.d.)
Gaussian noise with zero mean and variance σ 2 , i.e., ϵi ∼ N (0, σ 2 ).
Under this assumption, the probability of observing yi given xi and parameters θ is:

(yi − f (xi , θ))2


 
1
p(yi |xi , θ) = √ exp −
2πσ 2 2σ 2

The likelihood of observing the entire dataset is the product of these probabilities (due
to the independence assumption):
N
Y
L(θ) = p(yi |xi , θ)
i=1

The log-likelihood is:


N N
X N 1 X
ln L(θ) = ln p(yi |xi , θ) = − ln(2πσ 2 ) − 2 (yi − f (xi , θ))2
i=1
2 2σ i=1

Maximizing the log-likelihood Pln L(θ) with respect to θ is equivalent to minimizing the
sum of squared errors term (yi − f (xi , θ))2 , because the other terms do not depend
on θ and 1/(2σ 2 ) is a positive constant.

5
Thus, minimizing the least squares cost function is equivalent to performing Maximum
Likelihood Estimation (MLE) for the parameters θ under the assumption of i.i.d.
Gaussian noise.
Therefore, the probabilistic interpretation is that least squares (C) maximizes the
likelihood of observing the data given the parameters.

7. What matrix factorization method is typically used for PCA?

(A) QR decomposition
(B) LU decomposition
(C) Singular Value Decomposition (SVD)
(D) Cholesky decomposition

Solution (C) Singular Value Decomposition (SVD)

8. How is a decision boundary represented mathematically in a Support Vector Machine?

(A) W T X + b = 0
(B) Y (W T X + b) > 1
(C) W T X + b > 0
(D) sign(W T X + b)

Solution (A) W T X + b = 0

9. What constraint is applied to ensure data points are correctly classified by an SVM?

(A) W T X + b = 1 for positive class


(B) Y (W T X + b) > 1
(C) |W | must be minimized
(D) W T X + b must be maximized

Solution (B) Y (W T X + b) > 1


Solution
A Support Vector Machine (SVM) aims to find an optimal hyperplane W T X + b = 0
that separates data points belonging to different classes. For binary classification, we
typically assign labels Y ∈ {+1, −1}.
The SVM not only separates the classes but also maximizes the margin between the
hyperplane and the nearest data points (support vectors) from either class. The margin
boundaries are defined by the hyperplanes W T X + b = 1 and W T X + b = −1.
To ensure that a data point (Xi , Yi ) is correctly classified *and* lies on the correct side
of the margin (or exactly on the margin boundary for support vectors), the following
conditions must hold:

⇒ If Yi = +1, we require W T Xi + b ≥ 1.
⇒ If Yi = −1, we require W T Xi + b ≤ −1.

6
These two conditions can be compactly written as a single constraint:

Yi (W T Xi + b) ≥ 1

This inequality must hold for all data points i in a hard-margin SVM. It ensures that
points are correctly classified (Yi and (W T Xi + b) have the same sign) and that they
are at least a distance corresponding to the margin boundary from the separating
hyperplane.
Let’s examine the options:

(A) W T X + b = 1 for positive class: This only defines the positive margin boundary,
it’s not the general constraint for all points.
(B) Y (W T X + b) > 1: This captures the combined constraint. The standard formu-
lation uses ≥ 1 to include points on the margin (support vectors). Using > 1
implies points must be strictly outside the margin boundary. However, this is
the closest representation of the core SVM classification constraint among the
options. It ensures correct classification (Y (W T X + b) > 0) and enforces the
margin concept.
(C) |W | must be minimized: Minimizing 12 ||W ||2 (which is equivalent to minimizing
||W ||) is the *objective* of the SVM optimization problem (to maximize the
margin 2/||W ||), subject to the classification constraints. It is not the constraint
itself.
(D) W T X +b must be maximized: This is not directly related to the SVM formulation.

Therefore, the constraint that ensures data points are correctly classified while re-
specting the margin is best represented by option (B) Y (W T X + b) > 1 (or more
accurately, ≥ 1).

10. How would the methodology change if the maximum likelihood estimation assumes
errors follow a different distribution than Gaussian?

(A) The least squares cost function would still be optimal


(B) A different cost function would be derived
(C) The algorithm would need to be completely redesigned
(D) There would be no mathematical solution

Solution (B) A different cost function would be derived


Solution
Maximum Likelihood Estimation (MLE) seeks parameters θ that maximize the proba-
bility (likelihood) L(θ) of observing the data, given an assumed probability distribution
for the data generation process (often focused on the error term ϵi in regression, where
yi = f (xi , θ) + ϵi ).
Minimizing the least squares cost function (yi −f (xi , θ))2 is equivalent to maximizing
P
the likelihood L(θ) specifically when the errors ϵi are assumed to be independent and
identically distributed (i.i.d.) Gaussian N (0, σ 2 ). This is because the Gaussian PDF
involves the term exp(−(yi − f (xi , θ))2 /(2σ 2 )), and maximizing the product of these

7
terms (the likelihood) corresponds to minimizing the sum of the squared errors in the
exponent.
If we assume a different distribution for the errors (e.g., Laplace, Student’s t), the
probability density function p(ϵi ) changes.
Q This leads to a different mathematical
form for the likelihood function L(θ) = p(yi |xi , θ).
Consequently, maximizing this new likelihood function (or its logarithm) will generally
correspond to minimizing a different cost function. For
P example, assuming Laplace
errors leads to minimizing the sum of absolute errors |yi − f (xi , θ)|.
Therefore, changing the assumed error distribution results in (B) A different cost
function would be derived from the principle of maximum likelihood.

You might also like