Assignment 10 solution
Assignment 10 solution
1. When estimating the parameter µ for a Bernoulli distribution using Maximum Likeli-
hood Estimation (MLE), what is the formula?
(A) µM L = N/m
(B) µM L = m/N
(C) µM L = 1/N
(D) µM L = m2 /N
PN
Let m = i=1 xi be the number of successes (observations equal to 1). Then the
number of failures (observations equal to 0) is N − m. The likelihood function can be
written as:
L(µ) = µm (1 − µ)N −m
To find the Maximum Likelihood Estimate (MLE) µM L , we maximize L(µ) with re-
spect to µ. It is often easier to maximize the log-likelihood function, ln L(µ):
1
m(1 − µM L ) = µM L (N − m)
m − mµM L = N µM L − mµM L
m = N µM L
m
µM L =
N
This corresponds to the sample mean, which is the number of successes (m) divided
by the total number of trials (N ).
Therefore, the correct formula is (B) µM L = m/N .
2. According to the Central Limit Theorem, what happens to the distribution of the sum
of N independent and identically distributed random variables as N grows?
2
(C) Tilted ellipses
(D) Rectangles
||x − µ||2 = R2
4. When using a Naive Bayes classifier for prediction, what rule is used to assign a class
to a new observation?
3
(A) Minimum Error Rule
(B) Maximum a Posteriori (MAP) Rule
(C) Minimum Distance Rule
(D) Maximum Likelihood Rule
a dataset with features X and classes C, the Naive Bayes formula P (C|X) ∝
5. GivenQ
P (C) P (Xi |C) is derived by applying which of the following?
For classification, we want to find the class C that maximizes P (C|X). Since P (X)
is the same for all classes given a specific input X, it acts as a normalization constant
and can be ignored when comparing posterior probabilities. Thus, we can write:
The term P (X|C) represents the joint probability of all features given the class,
P (X1 , X2 , . . . , Xn |C). Calculating this directly can be complex and requires a large
amount of data.
The ”Naive” part of Naive Bayes comes from the assumption that the features Xi are
conditionally independent given the class C. This means:
n
Y
P (X1 , X2 , . . . , Xn |C) = P (X1 |C)P (X2 |C) . . . P (Xn |C) = P (Xi |C)
i=1
4
Substituting this assumption back into the proportionality derived from Bayes’ Theo-
rem: n
Y
P (C|X) ∝ P (C) P (Xi |C)
i=1
(a) Applying Bayes’ theorem to relate the posterior probability P (C|X) to the like-
lihood P (X|C) and prior P (C).
(b) Applying the conditional independence assumption to simplify the likelihood term
P (X|C) into a product of individual feature likelihoods P (Xi |C).
Solution (C) It maximizes the likelihood of observing the data given the parameters
Solution
The least squares cost function aims to minimize the P sum of squared errors between
predicted values f (xi , θ) and actual values yi : J(θ) = (yi − f (xi , θ))2 .
To find a probabilistic interpretation, we model the target variable as yi = f (xi , θ)+ϵi ,
where the noise ϵi is assumed to be independent and identically distributed (i.i.d.)
Gaussian noise with zero mean and variance σ 2 , i.e., ϵi ∼ N (0, σ 2 ).
Under this assumption, the probability of observing yi given xi and parameters θ is:
The likelihood of observing the entire dataset is the product of these probabilities (due
to the independence assumption):
N
Y
L(θ) = p(yi |xi , θ)
i=1
Maximizing the log-likelihood Pln L(θ) with respect to θ is equivalent to minimizing the
sum of squared errors term (yi − f (xi , θ))2 , because the other terms do not depend
on θ and 1/(2σ 2 ) is a positive constant.
5
Thus, minimizing the least squares cost function is equivalent to performing Maximum
Likelihood Estimation (MLE) for the parameters θ under the assumption of i.i.d.
Gaussian noise.
Therefore, the probabilistic interpretation is that least squares (C) maximizes the
likelihood of observing the data given the parameters.
(A) QR decomposition
(B) LU decomposition
(C) Singular Value Decomposition (SVD)
(D) Cholesky decomposition
(A) W T X + b = 0
(B) Y (W T X + b) > 1
(C) W T X + b > 0
(D) sign(W T X + b)
Solution (A) W T X + b = 0
9. What constraint is applied to ensure data points are correctly classified by an SVM?
⇒ If Yi = +1, we require W T Xi + b ≥ 1.
⇒ If Yi = −1, we require W T Xi + b ≤ −1.
6
These two conditions can be compactly written as a single constraint:
Yi (W T Xi + b) ≥ 1
This inequality must hold for all data points i in a hard-margin SVM. It ensures that
points are correctly classified (Yi and (W T Xi + b) have the same sign) and that they
are at least a distance corresponding to the margin boundary from the separating
hyperplane.
Let’s examine the options:
(A) W T X + b = 1 for positive class: This only defines the positive margin boundary,
it’s not the general constraint for all points.
(B) Y (W T X + b) > 1: This captures the combined constraint. The standard formu-
lation uses ≥ 1 to include points on the margin (support vectors). Using > 1
implies points must be strictly outside the margin boundary. However, this is
the closest representation of the core SVM classification constraint among the
options. It ensures correct classification (Y (W T X + b) > 0) and enforces the
margin concept.
(C) |W | must be minimized: Minimizing 12 ||W ||2 (which is equivalent to minimizing
||W ||) is the *objective* of the SVM optimization problem (to maximize the
margin 2/||W ||), subject to the classification constraints. It is not the constraint
itself.
(D) W T X +b must be maximized: This is not directly related to the SVM formulation.
Therefore, the constraint that ensures data points are correctly classified while re-
specting the margin is best represented by option (B) Y (W T X + b) > 1 (or more
accurately, ≥ 1).
10. How would the methodology change if the maximum likelihood estimation assumes
errors follow a different distribution than Gaussian?
7
terms (the likelihood) corresponds to minimizing the sum of the squared errors in the
exponent.
If we assume a different distribution for the errors (e.g., Laplace, Student’s t), the
probability density function p(ϵi ) changes.
Q This leads to a different mathematical
form for the likelihood function L(θ) = p(yi |xi , θ).
Consequently, maximizing this new likelihood function (or its logarithm) will generally
correspond to minimizing a different cost function. For
P example, assuming Laplace
errors leads to minimizing the sum of absolute errors |yi − f (xi , θ)|.
Therefore, changing the assumed error distribution results in (B) A different cost
function would be derived from the principle of maximum likelihood.