0% found this document useful (0 votes)
26 views

Machine Learning PDF

Uploaded by

Kart Pirates
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Machine Learning PDF

Uploaded by

Kart Pirates
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

MACHINE

LEARNING
PROBABILISTIC PERSPECTIVE

VICTOR
Index

1. Hypothesis Space

2. Bayes Classifier

3. Linear Regression

4. Generalized Linear Regression

5. Non-parametric Density Estimation

6. Parzen Window Estimate

7. K-Nearest Neighbour (KNN)

8. Linear Discriminant Analysis (LDA)

9. Support Vector Machine (SVM) - Linearly Separable Data

10. Support Vector Machine (SVM) - Non-linearly Separable Data

11. SVM with Kernel

12. Neural Networks

13. Backpropagation

14. Decision Trees

15. Ensemble Learning

16. Bagging and Random Forest

17. Boosting

18. XGBoost

19. Principal Component Analysis (PCA)

20. K-means Clustering

21. Expectation Maximization (EM) Algorithm


22. Miscellaneous Machine Learning Terms
Hypothesis space (H)

Let the data be D = {(x1 , y1 ), (x2 , y2 ), …, (xn , yn )}


​ ​ ​ ​ ​ ​

where yi is the label of the feature vector xi .


​ ​

A hypothesis function h is a function that maps the feature vector x to the label y , ie

h:X→Y

The hypothesis space is the set of all possible hypothesis functions, denoted as

H = {h∣h : X → Y }

During the learning process, we try to find the best hypothesis function h from the
hypothesis space H that minimizes the error between the predicted label and the true
label.
Loss function (L)

Let the data be D = {(x1 , y1 ), (x2 , y2 ), …, (xn , yn )}


​ ​ ​ ​ ​ ​

where yi is the label of the feature vector xi .


​ ​

Let the hypothesis function be h :X →Y.

Let y^i
= h(xi ) be the prediction of this hypothesis function for the feature vector xi ,
​ ​

whereas the true label is yi . ​

Then the loss function L(yi , y^i ) is a function that measures the error between the
​ ​ ​

predicted label and the true label.

Desirable properties

1. Non-negative: L(yi , y^i )​ ​ ​ ≥0

2. Zero if and only if the prediction is correct: L(yi , y^i ) ​ ​ ​ = 0 if and only if yi = y^i
​ ​ ​

3. Continous and differentiable for all yi and y^i for smooth optimization using ​ ​ ​

gradient descent.

Examples of loss functions

0-1 loss function

L(yi , y^i ) = {
0 if yi = y^i
​ ​ ​

1 if yi 
= y^i
​ ​ ​ ​ ​

​ ​ ​

Square loss function

L(yi , y^i ) = (yi − y^i )2


​ ​ ​ ​ ​ ​

Cross-entropy loss function


The cross-entropy loss function is specifically used for classification tasks where yi and

y^i represent probabilities. It is defined as:


​ ​

L(yi , y^i ) = −yi log(y^i ) − (1 − yi ) log(1 − y^i )


​ ​ ​ ​ ​ ​ ​ ​ ​
Risk function

Recall that the loss function L(yi , y^i ) measures the error between the predicted label
​ ​ ​

and the true label for a single data point.

The risk function R(h) is the expected loss over all data points in the dataset. It is
defined as:

R(h) = E(x,y)∼P [L(y, h(x))] ​

where P is the true data distribution.

Conditional risk

The conditional risk R(h∣x) is the expected loss for a given input x. It is defined as:

R(h(x)∣x) = Ey∼P (y∣x) [L(y, h(x))] ​

Then total risk is the expected value of the conditional risk:

R(h) = Ex∼P (x) [R(h(x)∣x)] ​

Empirical risk

We do not know the true data distribution P and we have access to a dataset D
sampled from P . Hence we approximate the risk function by the empirical risk:

n
1
R(h) ≈ ∑ L(yi , h(xi ))
​ ​ ​ ​

n
i=1

where {(xi , yi )}n


i=1 is the dataset.


​ ​
Learning problem

We are given a dataset D = {(xi , yi )}ni=1 sampled from an unknown distribution P .


​ ​ ​

We want to find a hypothesis function h out of the complete hypothesis space H that
minimizes the risk function R(h).
Bayes classifier

Let the data be D = {(x1 , y1 ), (x2 , y2 ), …, (xn , yn )}


​ ​ ​ ​ ​ ​

where yi is the label of the feature vector xi .


​ ​

Let the hypothesis space be H = {h∣h : X → Y }.

The Bayes classifier is defined as:

hB (x) = {
1 if P (y = 1∣x = xi ) > P (y = 0∣x = xi )

0 if P (y = 1∣x = xi ) ≤ P (y = 0∣x = xi )
​ ​ ​

Notations

P (y) is also called the prior probability of class y or class prior probability.

P (x∣y) is also called the likelihood of x given class y .

P (y∣x) is also called the posterior probability of class y given x.

P (x) is also called the evidence.

Bayes classifier is the best classifier for 0-1 loss function

To prove that the Bayes classifier is the best classifier for the 0-1 loss function, we need
to show that it minimizes the expected risk (error) for any given input x.

Let’s define the 0-1 loss function:

L(y, h(x)) = {
0 if y = h(x)
1 if y 
= h(x)
​ ​

The expected risk for a classifier h is:

R(h) = E[L(y, h(x))] = ∫ L(y, h(x))P (x, y) dx dy


For a given x, the conditional risk is:

R(h∣x) = E[L(y, h(x))∣x] = P (y = 0∣x)L(0, h(x)) + P (y = 1∣x)L(1, h(x))

Now, let’s consider two cases:

1. If h(x) = 0:

R(h∣x) = P (y = 1∣x)

2. If h(x) = 1:

R(h∣x) = P (y = 0∣x)

The Bayes classifier chooses the class that minimizes this conditional risk:

hB (x) = arg min R(h∣x)


​ ​

y∈{0,1}

This means:

If P (y = 1∣x) > P (y = 0∣x), then hB (x) = 1


If P (y = 1∣x) ≤ P (y = 0∣x), then hB (x) = 0


This is exactly the definition of the Bayes classifier we started with.

Since the Bayes classifier minimizes the conditional risk for every x, it also minimizes the
overall expected risk R(h). Therefore, the Bayes classifier is the optimal classifier for
the 0-1 loss function.
KL Divergence - Kullback-Leibler
Divergence

KL divergence is a measure of how one probability distribution diverges from second,


probability distribution.

Mathematically, for two probability distributions P (x) and Q(x), the KL divergence
from Q to P is defined as:

DKL (p∣∣q) = Σp(x) ∗ log ( p(x)


​ )
q(x) for discrete distributions

DKL (p∣∣q) = ∫ p(x) ∗ log ( p(x)


​ )
q(x) dx for continuous distributions

where the sum/integral is over all possible events x. And p(x) and q(x) are the
probability density functions of distributions P (x) and Q(x) respectively.

Intuition

KL divergence is a measure of how one probability distribution diverges from another. It


is a measure of the information lost when Q is used to approximate P .

Properties

KL divergence is not symmetric: DKL (P ∣∣Q)


​ = DKL (Q∣∣P )
 ​

KL divergence is always non-negative: DKL (P ∣∣Q)


​ ≥0

KL divergence is 0 if and only if P and Q are the same distribution


Minimizing KL Divergence is Equivalent to
Maxmimizing Likelihood

For a typical ML problem, all we have are samples from the true distribution P (x) ie
data = {(xi )}N i=1 where xi ∈ R are data points. We do not know the distribution
d
​ ​

P (x) explicitly.

We try our best to estimate the true distribution P (x) by Q(x; θ) where θ are the
parameters of the model.

We want to know how well our model Q(x; θ) is performing. We can do this by
calculating the KL divergence between the true distribution P (x) and the estimated
distribution Q(x; θ).

DKL (P ∣∣Q) = ∫ p(x) ∗ log ( q(x;θ)



p(x)
) dx ​

DKL (P ∣∣Q) = Ex∼p(x) [log ( q(x;θ)



p(x)
)] ​ ​

DKL (P ∣∣Q) = Ex∼p(x) [log (p(x))] − Ex∼p(x) [log (q(x; θ))]


​ ​ ​

We are trying to find the parameters θ that minimize the KL divergence between p(x)
and q(x; θ).

hence θ ∗ = argmin DKL (p∣∣q(x; θ))


​ ​

θ∗ = argmin Ex∼p(x) [log (p(x))] − Ex∼p(x) [log (q(x; θ))]


​ ​ ​

because Ex∼p(x) [log (p(x))] is constant with respect to θ , we can ignore it.

θ∗ = argmax Ex∼p(x) [log (q(x; θ))]


​ ​

Ex∼p(x) [log (q(x; θ))] is called the Expected Log Likelihood,


By the law of large numbers, we can approximate the expected log likelihood by the
average log likelihood of the data:

1 N
Ex∼p(x) [log(q(x; θ))] ≈

N

∑i=1 log(q(xi ; θ))

Therefore, our optimization problem becomes:


1 N
θ∗ = argmax ​

N
​ ∑i=1 log(q(xi ; θ))
​ ​

This is equivalent to maximizing the log likelihood of the data.

θ∗ = argmax ​
1
N
​ ∑N
i=1 log(q(xi ; θ))
​ ​

hence θ is also called the maximum log likelihood estimate(MLE).


Expectation Maximization (EM) algorithm

i=1 where ti ∈ R are iid pt (t) data points.


= {ti }N d
Let the data be D ​ ​ ​ ​

Latent variable models

Let each data point xi be associated with a latent variable zi that is not observed. zi is a
​ ​ ​

random variable that takes values in some finite set 1, … , K and represents the
membership of xi to one of K clusters.

Hence the data can be represented as D = {(ti , zi )}N


i=1 where ti is the observed data ​ ​ ​ ​

and zi is the latent variable and (ti , zi ) are iid ptz .


​ ​ ​ ​

The marginal distribution of the observed data is given by:

pt (t) = ∑ ptz (t, z)


​ ​ ​

Variational inference

We want to maximize the log-likelihood of the observed data:

ℓ(θ) = log ∑ pθtz (t, z) ​ ​

​ ​

Let q(z) be an arbitrary distribution over z .

pθtz (t, z)
ℓ(θ) = log ∑ q(z)

​ ​

z
q(z) ​

pθtz (t, z)
ℓ(θ) = log Eq(z) [ ]

​ ​

q(z)

By Jensen’s inequality, we have:

pθtz (t, z) pθ (t, z)


log Eq(z) [ ] ≥ Eq(z) [log tz ]
​ ​

​ ​ ​ ​

q(z) q(z)
Hence, we have:

pθtz (t, z)
ℓ(θ) ≥ Eq(z) [log ]

​ ​

q(z)

ℓ(θ) is the log-likelihood of the observed data it is also called the evidence.

Eq(z) [log ptzq(z) ] is the lower bound of the log-likelihood and is also called the evidence
θ
(t,z)

​ ​

lower bound (ELBO).

Note that the ELBO is a function of q(z) and θ .

Optimizing ELBO

Instead of maximizing the evidence, we maximize the evidence lower bound (ELBO). We do
it by maximizing the ELBO with respect to q(z) and θ .

Making ELBO tight

To make the ELBO tight, we consider the difference between the evidence and the ELBO:

pθtz (t, z)
ℓ(θ) − ELBO(q, θ) = log pθt − Eq(z) [log ]

​ ​ ​

q(z)
​ ​

pθz∣t (z∣t)pθt (t)


− Eq(z) [log ]
​ ​

ℓ(θ) − ELBO(q, θ) = log pθt ​ ​ ​

q(z)
​ ​

ℓ(θ) − ELBO(q, θ) = log pθt − Eq(z) [log pθz∣t (z∣t) + log pθt (t) − log q(z)]

​ ​ ​ ​

[ ]
ℓ(θ) − ELBO(q, θ) = log pθt − Eq(z) [log pθz∣t (z∣t)] − Eq(z) [log pθt (t)] + Eq(z) [log q(z)]
​ ​ ​ ​ ​ ​

ℓ(θ) − ELBO(q, θ) = log pθt − Eq(z) [log pθz∣t (z∣t)] − log pθt (t) + Eq(z) [log q(z)]
​ ​ ​ ​ ​

ℓ(θ) − ELBO(q, θ) = −Eq(z) [log pθz∣t (z∣t)] + Eq(z) [log q(z)]


​ ​ ​

ℓ(θ) − ELBO(q, θ) = Eq(z) [log q(z) − log pθz∣t (z∣t)]


​ ​

​ ​

ℓ(θ) − ELBO(q, θ) = Eq(z) [log ]


q(z)
pθz∣t (z∣t)
​ ​

ℓ(θ) − ELBO(q, θ) = KL(q(z)∣∣pθz∣t (z∣t)) ​

The difference between the evidence and the ELBO is the Kullback-Leibler (KL) divergence
between q(z) and pθz∣t (z∣t). To make the ELBO tight, we need to minimize this KL

divergence.

The KL divergence is always non-negative and equals zero if and only if the two
distributions are identical. Therefore, to make the ELBO tight, we set:

q(z) = pθz∣t (z∣t) ​

This choice of q(z) makes the ELBO equal to the evidence, achieving the tightest possible
bound.

EM Algorithm

The Expectation-Maximization (EM) algorithm is an iterative method to find maximum


likelihood estimates of parameters in statistical models with latent variables. It consists of
two main steps:

1. E-step (Expectation): Compute the expected value of the log-likelihood function


with respect to the conditional distribution of z given t under the current estimate of
the parameters θ .

2. M-step (Maximization): Find the parameter that maximizes this expected log-
likelihood.

Formally, the EM algorithm can be described as follows:

1. Initialize θ (0)

2. Repeat until convergence:


( )
(t−1)
E-step: Compute q (t) (z) = pθz∣t (z∣t) ​

M-step:

= arg maxθ Eq(t) (z) [log ]


pθtz (t,z)
1. θ (t)

q(z)
​ ​ ​

2. θ (t) = arg maxθ Eq(t) (z) [log pθtz (t, z)] − Eq(t) (z) [log q(z)]
​ ​ ​ ​

3. θ (t) = arg maxθ Eq(t) (z) [log pθtz (t, z)] as second term is constant wrt θ
​ ​ ​

The EM algorithm guarantees that the likelihood increases at each iteration and converges
to a local maximum.

EM algorithm for GMM

Let’s apply the EM algorithm to the Gaussian Mixture Model (GMM) we discussed earlier.
Recall that in a GMM, we have:

Observed data points: x = (x1 , ..., xN ) ​ ​

Latent variables: z, where zi ​ ∈ {1, ..., m} indicates gaussian component


m
pt (t) = ∑j=1 αj N (t; μj , ξj )
​ ​ ​ ​ ​

m
αj are mixing coefficients, ∑j=1 αj = 1
​ ​ ​

μj are mean vectors

ξj are covariance matrices


p(ti ∣zi = k) = N (ti ; μk , Σk )


​ ​ ​ ​ ​

Parameters: θ = (α1 , ..., αm , μ1 , ..., μm , ξ1 , ..., ξm )


​ ​ ​ ​ ​ ​

The EM algorithm for GMM proceeds as follows:

1. Initialization:

Choose initial values for the parameters θ = (α1 , ..., αm , μ1 , ..., μm , ξ1 , ..., ξm ).
​ ​ ​ ​ ​ ​

2. E-step:
Compute the posterior probabilities (responsibilities) for each data point and each Gaussian
component:

k
t+1 k pθtz (z) N (t; ti , μs , ξs )αs
(z = s) = pθz∣t (z = s∣t = ti ) = θk = m
​ ​ ​ ​ ​

q
pt (t) ∑j=1 αj N (tj ; μj , ξj )
​ ​ ​ ​

​ ​ ​ ​

3. M-step:

Update the parameters:

θk+1 = arg max ELBO(q, θ) ​

θk+1 = arg max ​

= arg max(Eq log N (t, μξ , ξ)αs )


​ ​ ​ ​

The EM algorithm for GMM alternates between these steps until convergence, effectively
maximizing the likelihood of the observed data under the Gaussian mixture model.
Linear Discriminant Analysis(LDA) from
bayesian perspective

i=1 where xi ∈ R and yi ∈ {1, 2}


= {(xi , yi )}N d
Let the data be D ​ ​ ​ ​

Let’s assume the parametric form of the conditional density p(x∣y) is:

p(x∣y = 1) ∼ N (x; μ1 , Σ)

p(x∣y = 0) ∼ N (x; μ2 , Σ)

where N (x; μ, Σ) denotes a multivariate Gaussian distribution with mean μ and covariance
matrix Σ. Note that we assume the covariance matrix Σ is the same for both classes, which is a
key assumption in LDA.

In simple words, we are assuming that the data is distributed as Gaussian in each class with
different means but shared covariance matrix.

Also, let’s assume that the prior probabilities P (y = 1) and P (y = 0) are same ie 1/2.

Derivation

Let’s derive the decision boundary for LDA using the Bayes classifier.

1. Bayes Classifier:

The Bayes classifier is defined as:

hB (x) = {
1 if P (y = 1∣x = xi ) > P (y = 0∣x = xi )
​ ​

0 if P (y = 1∣x = xi ) ≤ P (y = 0∣x = xi )
​ ​ ​

​ ​

2. Using Bayes’ Rule:

P (x∣y = k) ⋅ P (y = k)
P (y = k∣x) =
P (x)

3. Decision Rule:

Choose class 1 if:

P (x∣y = 1) ⋅ P (y = 1) > P (x∣y = 0) ⋅ P (y = 0)

4. Taking logarithms (monotonic transformation):


log(P (x∣y = 1)) + log(P (y = 1)) > log(P (x∣y = 0)) + log(P (y = 0))

5. Substituting Gaussian densities:

1 d 1
− (x − μ1 )T Σ−1 (x − μ1 ) − log(2π) − log ∣Σ∣ + log(P (y = 1)) >
2 2 2
​ ​ ​ ​ ​

1 d 1
− (x − μ2 )T Σ−1 (x − μ2 ) − log(2π) − log ∣Σ∣ + log(P (y = 0))
2 2 2
​ ​ ​ ​ ​

6. Simplifying:

1 1
− (x − μ1 )T Σ−1 (x − μ1 ) + log(P (y = 1)) > − (x − μ2 )T Σ−1 (x − μ2 ) + log(P (y = 0))
2 2
​ ​ ​ ​ ​ ​

7. Expanding the quadratic terms:

1
− (xT Σ−1 x − 2μT1 Σ−1 x + μT1 Σ−1 μ1 ) + log(P (y = 1)) >
2
​ ​ ​ ​

1
− (xT Σ−1 x − 2μT2 Σ−1 x + μT2 Σ−1 μ2 ) + log(P (y = 0))
2
​ ​ ​ ​

8. Cancelling out common terms:

1
μT1 Σ−1 x − μT1 Σ−1 μ1 + log(P (y = 1)) >
2
​ ​ ​ ​

1
μT2 Σ−1 x − μT2 Σ−1 μ2 + log(P (y = 0))
2
​ ​ ​ ​

9. Rearranging:

1 T −1
(μ1 − μ2 )T Σ−1 x > (μ1 Σ μ1 − μT2 Σ−1 μ2 ) + log(P (y = 0)/P (y = 1))
2
​ ​ ​ ​ ​ ​

10. Final decision boundary:

hB (x) = {
1 if wT x + w0 > 0 ​

0 otherwise
​ ​

where:

w = Σ−1 (μ1 − μ2 ) ​ ​

1
w0 = − (μT1 Σ−1 μ1 − μT2 Σ−1 μ2 ) − log(P (y = 0)/P (y = 1))
2
​ ​ ​ ​ ​ ​

Note
The decision boundary is linear in x. This is the reason it is called Linear Discriminant
Analysis.

The decision boundary w T x + w0 ​


= 0 is a hyperplane.

The decision boundary will not be linear if the covariance matrices are not the same for
both classes.
Linear regression

Let the data be (xi , yi ) for i


​ ​ = 1, 2, ..., n where xi ∈ Rd and yi ∈ R.
​ ​

then we can model yi as: ​

Y = β0 + β1 X + ε
​ ​

Where:

Y ∈ R is the dependent variable

X ∈ Rd is the independent variable (or feature vector in higher dimensions)

β0 is the y-intercept (bias term)


β1 is the slope (or coefficient vector in higher dimensions)


ε is the error term

Ideal regressor

For mean squared error loss, the ideal regressor is defined as:

h∗ (x) = E[Y ∣X = x]

Where E[Y ∣X = x] is the conditional expectation of Y given X = x.

Derivation of ideal regressor

To derive the ideal regressor for squared error loss, we need to find the function h(x)
that minimizes the expected squared error:

h∗ = arg min E[(Y − h(X))2 ]


Let’s expand this expectation:

E[(Y − h(X))2 ] = E[Y 2 ] − 2E[Yh(X)] + E[h(X)2 ]


To minimize this, we differentiate with respect to h(x) and set it to zero:


E[(Y − h(X))2 ] = −2E[Y ∣X = x] + 2h(x) = 0
∂h(x)

Solving for h∗ (x):

h∗ (x) = E[Y ∣X = x]

This shows that the ideal regressor for squared error loss is indeed the conditional
expectation of Y given X = x.

Now, let’s derive this further using our linear model:

E[Y ∣X = x] = E[β0 + β1 X + ε∣X = x]


​ ​

Using the linearity of expectation:

E[β0 + β1 X + ε∣X = x] = E[β0 ∣X = x] + E[β1 X ∣X = x] + E[ε∣X = x]


​ ​ ​ ​

Simplifying:

1. E[β0 ∣X​ = x] = β0 (since β0 is a constant)


​ ​

2. E[β1 X ∣X
​ = x] = β1 x (since X is fixed at x)

3. E[ε∣X = x] = 0 (assuming the error term has zero mean and is independent of
X)

Therefore:

h∗ (x) = β0 + β1 x​ ​

Interpretation

The ideal regressor h∗ (x) = β0 + β1 x minimizes the expected squared error. It


​ ​

represents the best possible prediction of Y given X = x under the squared error loss,
assuming the linear model is correct. This function provides the average value of Y for
each value of X , effectively capturing the underlying linear relationship between the
variables while averaging out the random noise (represented by ε).
Empirical Risk Minimization

In practice, we don’t have access to the true distribution of the data, so we can’t directly
minimize the expected risk. Instead, we use the empirical risk as an approximation:

n
^ (β ) = 1 ∑(yi − (β0 + β1 xi ))2
R ​ ​ ​ ​ ​ ​

n
i=1

where n is the number of samples in our dataset.

Note that in this formulation, we don’t explicitly see the error term ε that was present in
our original model Y = β0 + β1 X + ε. This is because the empirical risk is calculated
​ ​

using the observed yi values, which already incorporate the random error.

To find the optimal parameters β ∗ , we minimize this empirical risk:

^ (β)
β ∗ = arg min R ​

This optimization problem has a closed-form solution, which can be derived using linear
algebra:

β ∗ = (X T X)−1 X T Y

Here, X is the design matrix:

⎡1 x11 ​ x12 ​ ⋯ x1d ⎤


1 x21 ​ x22 ​ ⋯ x2d​

X= ​ ​ ​ ​ ​ ​ ​ ​

⋮ ⋮ ⋮ ⋱ ⋮
⎣1 xn1 ​ xn2 ​ ⋯ xnd ⎦n×(d+1) ​

This matrix has n rows (one for each data point) and d + 1 columns. The first column is
all 1s (for the intercept term), and the remaining d columns contain the feature values of
our data points.

And Y is the vector of target values:

⎡ y1 ⎤ ​

y2 ​

Y = ​ ​ ​ ​


⎣y n ⎦
n×1

This is a column vector with n rows, containing the scalar y values of our data points.

The solution β ∗ is a (d + 1) × 1 vector:

⎡β0∗ ⎤

β1

β = ​ ​ ​ ​


⎣β ∗ ⎦
d (d+1)×1

To retrieve the individual β1∗ values from β ∗ :


1. β0∗ is the first element of β ∗ , i.e., β ∗ [0]


⎡β1 ⎤

2. β1∗ = ⋮
⎣β ∗ ⎦
​ ​ ​ ​ ​

d (d)×1​

This solution is known as the Ordinary Least Squares (OLS) estimator.


Logistic regression also known as logit
regression or binary classification

Let the data be D = {(x1 , y1 ), (x2 , y2 ), …, (xn , yn )}


​ ​ ​ ​ ​ ​

where yi ​
∈ {0, 1} is the binary label of the feature vector xi . ​

Then the logistic regression model is defined as:

1
P (yi = 1∣xi ) =
1 + e−w i +b
​ ​

Tx ​

1
P (yi = 0∣xi ) =
1 + e w T x i +b
​ ​ ​

where w is the parameter vector.

Note that

P (yi = 1∣xi ) + P (yi = 0∣xi ) = 1


​ ​ ​ ​

Why define the model like this?

because consider:

p(x∣y = 1)
p(y = 1∣x) =
p(x∣y = 1) + p(x∣y = 0)

1
p(y = 1∣x) = p(x∣y=0)

1+ p(x∣y=1)

if:

yi ∈ {0, 1} and the priors ie P (yi = 1) and P (yi = 0) are equal


​ ​ ​

And p(xi ∣yi ​ ​ = 0) and p(xi ∣yi = 1) follows an gaussian distribution with
​ ​

different means and equal covariance matrix

then this becomes:


1
P (yi = 1∣xi ) =
1 + e−w i +b
Tx
​ ​ ​

1
P (yi = 0∣xi ) =
1 + e w x i +b
​ ​

T ​

How to find the parameters w and b?

For the binary classification problem, use logistic regression as hypothesis function and
cross-entropy loss function as the loss function and perform gradient descent to find
the parameters w and b.

Why not use hard labels (0 or 1)?

Using hard labels (0 or 1) directly in backpropagation can lead to several issues, hence
we use soft labels. Soft labels provide a continuous probability distribution over the
classes, allowing for smoother gradients and more stable training. This approach enables
the model to capture uncertainty and learn more nuanced decision boundaries
compared to hard binary classifications.
Softmax regression also known multiclass
regression

Let the data be D = {(x1 , y1 ), (x2 , y2 ), …, (xn , yn )}


​ ​ ​ ​ ​ ​

where yi ​
∈ {1, 2, ..., K} is the multiclass label of the feature vector xi , and K is the ​

number of classes.

Then the softmax regression model is defined as:

T
ewk xi +bk ​ ​ ​

P (yi = k∣xi ) =
∑j=1 ewj xi +bj
​ ​

K T
​ ​ ​

where wk is the parameter vector for class k , and bk is the bias term for class k .
​ ​

Note that

K
∑ P (yi = k∣xi ) = 1
​ ​ ​

k=1

This ensures that the probabilities for all classes sum up to 1, providing a valid
probability distribution over the K classes.

The reason for using softmax regression instead of hard labels and the method to find
the parameters wk and bk is the same as the reason for using logistic regression.
​ ​

How to find the parameters wk and bk ? ​ ​

For the multiclass classification problem, we use softmax regression as the hypothesis
function and cross-entropy loss function as the loss function. We then perform gradient
descent to find the parameters wk and bk for each class k .
​ ​

Why use softmax instead of hard labels?


Using softmax instead of hard labels (e.g., one-hot encoding) in multiclass classification
offers several advantages:

1. Smooth gradients: Softmax provides a continuous, differentiable output, allowing


for smoother gradients during backpropagation. This leads to more stable and
efficient training.

2. Probability interpretation: Softmax outputs can be interpreted as probabilities,


giving a measure of the model’s confidence in its predictions for each class.
Maximum aposteriori estimate(MAP )

The Maximum A Posteriori (MAP) estimate is defined as:


θM AP = arg max p(θ∣V ) = arg max p(V ∣θ)p(θ)
​ ​ ​

θ θ

where


θM AP is the MAP estimate of the parameter θ

p(θ∣V ) is the posterior probability of the parameter θ given the observed data V

p(V ∣θ) is the likelihood of the observed data V given the parameter θ

p(θ) is the prior probability of the parameter θ

Note that:

Unlike MLE, MAP estimation incorporates prior knowledge about the parameter θ
and observed data V where as MLE only depends on the observed data V .

MAP provides a balance between the likelihood of the data and the prior beliefs,
leading to more robust estimates, especially when the data is limited.

Conjugate prior

A conjugate prior is a type of prior distribution p(θ) such that when multiplied with the
likelihood function p(V ∣θ) the posterior distribution p(θ∣V ) that we get is in the same
form as the prior p(θ).

In simple terms, the posterior p(θ) belongs to the same family of distributions as the
prior p(θ∣V ).
Non parametric density estimation

What is parametric density estimation?

In parametric density estimation, we assume that the data is generated from a known
distribution, such as the normal distribution, and we estimate the parameters of the
distribution using various methods like maximum likelihood estimation or risk
minimization.

What is non-parametric density estimation?

However, in non-parametric density estimation, we make no assumptions about the


form of the distribution and estimate the density directly from the data.

The basic idea behind non-parametric density estimation is to estimate the probability
density function (PDF) directly from the data without assuming a specific functional
form. One way to approach this is by considering the probability of a data point falling
within a certain region.

Let D = {x1 , x2 , ..., xn } be our dataset. The probability of a data point falling within a
​ ​ ​

region R can be estimated as:

number of data points in region R k


P (data point in region R) = =
total number of data points
​ ​

where k is the number of data points in region R, and n is the total number of data
points.

Also

P (data point in region R) = p(x) ∗ V

where p(x) is the probability density at point x and V is the volume of the region R.

Combining the above two equations, we get:

k
= p(x) ∗ V

n
Rearranging this equation, we get:

k
p(x) = ​

nV
This forms the basis for various non-parametric density estimation techniques.
Parzen window estimate also known as
the kernel density estimate

Basic idea

Recall that we had derived the following equation for non-parametric density
estimation:

k
p(x) = ​

nV
where:

k is the number of data points in region R

n is the total number of data points in the dataset D

V is the volume of the region R

In Parzen window estimate, we fix the volume V and count k by using a window
function.

Problem setting

Given a set of data points D = {x1 , x2 , ..., xn }, we want to estimate the probability
​ ​ ​

density function p(x) at a given point x ie model the distribution of data points in the
dataset.

Formulation using uniform kernel (rectangular kernel)

Let’s define the formulation for the Parzen window estimate:

1. Define the volume Vn :​

Vn = (hn )d
​ ​
where:

hn is the length of the hypercube in Rd


d is the dimension of the data

Vn is the volume of the hypercube in Rd


2. Define the window function ϕ(u):

ϕ(u) = {
1 if ∣uj ∣ ≤ 12 , j = 1, … , d
​ ​

0 otherwise
​ ​

where:

ϕ(u) returns 1 if the point u is within the unit hypercube centered at the origin,
and 0 otherwise.

uj is the j th coordinate of the point u.


3. Then the window function centered at a data point xi is: ​

)={
x − xi 1 ​ if x is in the hypercube centered at xi of side hn ​ ​

ϕ(
0 otherwise
​ ​ ​

hn ​

4. Count kn , the number of points in the hypercube centered at xi of side hn :


​ ​ ​

n
x − xi
kn = ∑ ϕ( )

​ ​ ​

i=1
hn ​

5. The Parzen window estimate:

n
k ∑i=1 ϕ( x−x
hn
i
) ​

p(x) =
​ ​

=

n(hn )d
​ ​

nV ​

Note that:

Here, hn is a hyperparameter that controls the width of the window.


As n → ∞, if hn → 0 and nhdn → ∞, then the estimate converges to the true


​ ​

density.
Parzen window with Gaussian kernel

1. Gaussian kernel:

1 − 12 u2
ϕ(u) = e ​

(2π)d/2

2. The window function centered at a data point xi is: ​

x − xi 1 − 12 ( h i )2
x−x
)=

ϕ( e n


(2π)d/2
​ ​

hn ​

3. The Parzen window estimate:

n
k ∑i=1 ϕ( x−x
hn
i
) ​

p(x) =
​ ​

= ​

n(hn )d
​ ​

nV ​

Algorithm

1. Choose a value for hn ​

2. For each test data point xi , find the number of points in the hypercube centered

at xi of side hn
​ ​

3. Calculate the Parzen window estimate p(x) using the number of points found in
the previous step
K Nearest neighbour (KNN)

Basic idea

Recall that we had derived the following equation for non-parametric density
estimation:

k
p(x) = ​

nV
where:

k is the number of data points in region R

n is the total number of data points in the dataset D

V is the volume of the region R

In K-Nearest Neighbour (KNN) method, we fix the volume V and count k

Problem setting

Given a set of data points D = {x1 , y1 }, {x2 , y2 }, ..., {xn , yn }, where xi ∈ Rd is


​ ​ ​ ​ ​ ​ ​

the feature vector and yi ∈ {1, 2, ..., C} is the class label.


Formulation

The posterior probability of class c given input xi can be estimated as:​

p(xi , y = c)
p(y = c∣xi ) =

p(xi , y = 1) + p(xi , y = 2) + ... + p(xi , y = C)


​ ​

​ ​ ​

ki
ki

p(y = c∣xi ) = nV
=

C k C
​ ​ ​

∑j=1 nVj ∑j=1 kj



​ ​

Now we can use bayes classification rule to assign label to the new data point xi . ​
Algorithm

1. Choose a value for k

2. For each test data point xi , find the k nearest neighbours in the training data D

3. Assign the class label to xi based on the majority class label of the k nearest
neighbours
Bias variance tradeoff

It can be shown that average risk for square error loss can be decomposed into three components:

2
Ravg (h) = EPD EPx,y [(hD (x) − h

^ (x)) ] ​


​ ​ // variance (sensitivity to dataset)

2
+EPx,y [(h
^ (x) − h∗ (x)) ]

​ // bias (how different is avg classifier from optimum classifier)

+EPx,y [(h∗ (x) − y )2 ] ​


​ // irreducible noise (nothing can be done about this)

where:

Ravg (h) is the average risk. ​

pD is the distribution of the training data. In simple words,pD (D = Di ) is the probability of


​ ​ ​

observing the training data Di . ​

EPD is the expectation over the distribution of the training data.



px,y is the distribution of the input-output pairs.


EPx,y is the expectation over the input-output pairs.



hD (x) is classifier trained on the training data D.


^ (x) is the average classifier ie h


h ^ (x) = EP (hD (x))
D
​ ​

h∗ (x) is the optimal classifier

Term breakdown

Variance:

2
EPD EPx,y [(hD (x) − h


^ (x)) ]
​ ​

Measures how sensitive the learned classifier hD (x) is to different training datasets D . High ​

variance means the model changes significantly with different training sets, making it unstable.

Bias:
2
EPx,y [(h

^ (x) − h∗ (x)) ]

^ (x) deviates from the optimal classifier h∗ (x)


Measures how much the average learned classifier h
that minimizes the error. High bias means the model is systematically inaccurate or underfits.

Irreducible Noise:

EPx,y [(h∗ (x) − y ) ]


2

This term captures the inherent noise in the data y . No model can reduce this part, as it reflects
randomness or variability in the data that is not related to the features x.

Bias variance tradeoff

1. Relationship:

As we decrease bias by making our model more complex (e.g., using more features or a more
flexible model), we often increase variance. This means the model may fit the training data very
well but perform poorly on unseen data due to overfitting.

Conversely, if we increase bias by simplifying the model (e.g., using fewer features or a more
rigid model), we may reduce variance, but at the cost of underfitting the training data.

2. Optimal Point:

The goal is to find a balance where both bias and variance are minimized, leading to the lowest
possible total error. This is often visualized as a U-shaped curve where the total error is
minimized at a certain level of model complexity.
Regularization

Regularization is a technique to increase model bias to reduce variance by constraining


empirical risk minimization.

regularized empirical risk minimization:

^ (hθ ) s.t. Ω(hθ ) < k


Reg ERM = min R ​ ​ ​

h∈H

here Ω(hθ ) ​ < k is the regularization function which is a design choice.

This can be solved using the method of Lagrange multipliers.

h(x) = arg minhθ ∈H (R


^ (hθ ) + λΩ(θ))

​ ​

here λ is the regularization parameter which is a design choice.

Norm based regularization

When Ω(θ) is taken to be the p-norm, i.e., ∥θ∥p : ​

If p = 1, then we have L1 (lasso) regularization.

If p = 2, then we have L2 (ridge) regularization.

Importance of regularization

It can be shown that:

M LE ≈ ERM

M AP ≈ reg ERM
Support vector machine(SVM) when data
is linearly separable

Dataset is linearly separable

Dataset D = {(xi , yi )}ni=1 where xi ∈ Rd and yi ∈ {−1, 1} is said to be linearly


​ ​ ​ ​ ​

separable if there exists a hyperplane that can separate the two classes of data points
with zero training error.

Mathematically, a dataset is linearly separable if there exist weights w and bias b such
that:

wT xi + b > 1 for yi = 1
​ ​

wT xi + b < −1 for yi = −1
​ ​

This can be rewritten as:

yi (wT xi + b) ≥ 1 for all i


​ ​

In other words, there does not exist any xi such that −1 ​


≤ w T xi + b ≤ 1 .

2
The distance between the margins is ∥w∥ . ​

Optimization problem

Hence the SVM optimization problem is:

1
min ∥w∥2
w,b 2
​ ​

s.t. yi (wT xi + b) ≥ 1 for all i


​ ​

Solution to the optimization problem

We can solve this optimization problem using the method of Lagrange multipliers.
The Lagrangian for the SVM optimization problem can be formulated as follows:

n
1
L(w, b, μ) = ∥w∥2 − ∑ μi (yi (wT xi + b) − 1)
2
​ ​ ​ ​ ​

i=1

where μi ​ ≥ 0 are the Lagrange multipliers.

To find the optimal solution, we take the partial derivatives of the Lagrangian with
respect to w , b, and μi , and set them to zero:

1. Gradient with respect to ( w ):

n
∂L
= w − ∑ μ i y i xi = 0
∂w
​ ​ ​ ​

i=1

2. Gradient with respect to ( b ):

n
∂L
= − ∑ μi yi = 0
∂b
​ ​ ​ ​

i=1

3. Complementary slackness condition:

μi (yi (wT xi + b) − 1) = 0
​ ​ ​

The solution can be obtained by solving these equations, which leads to the optimal
weights w and bias b that define the separating hyperplane.

From the first equation, we can express w in terms of μi : ​

n
w = ∑ μ i y i xi ​ ​ ​ ​

i=1

The second equation gives us a constraint on μi : ​

n
∑ μi yi = 0
​ ​ ​

i=1

Substituting these back into the Lagrangian and simplifying, we get the dual
formulation:

Dual Formulation
The dual formulation of the SVM problem can be expressed as:

n n n
1
max ∑ μi − ∑ ∑ μi μj yi yj (xTi xj )
2
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

μ
i=1 i=1 j=1

subject to:

n
∑ μi yi = 0
​ ​ ​

i=1

μi ≥ 0
​ for all i

Finding w

The dual problem can be solved using quadratic programming techniques. Once we
have the optimal μi , we can recover w using the equation:

n
w = ∑ μ i y i xi ​ ​ ​ ​

i=1

Finding b

To find b, we can use any support vector (a point where μi ​ > 0) and the fact that for
these points, yi (w

T
xi + b) = 1.

Decision Function

The decision function for classifying new points becomes:

f (x) = sign (∑ μi yi (xTi x) + b)


n
​ ​ ​ ​

i=1

where only the support vectors (points with μi ​


> 0) contribute to the sum.
Support vector machine(SVM) when data
is not linearly separable

Dataset is not linearly separable

Unlike the case when the dataset is linearly separable, we cannot find a hyperplane that
separates the two classes of data points with zero training error.

Soft margin

To handle this case, we introduce a slack variable ξi for each data point xi to allow some
​ ​

points to be on the wrong side of the margin or even in the wrong class.

Optimization problem

The optimization problem becomes:

n
1
min ∥w∥2 + C ∑ ξi
w,b,ξ 2
​ ​ ​ ​

i=1

s.t. yi (wT xi + b) ≥ 1 − ξi for all i


​ ​ ​

ξi ≥ 0 for all i

Solution to the optimization problem

The Lagrangian for the SVM optimization problem can be formulated as follows:

n n n
1
L(w, b, ξ, μ, ν) = ∥w∥2 + C ∑ ξi − ∑ μi (yi (wT xi + b) − 1 + ξi ) − ∑ νi ξi
2
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

i=1 i=1 i=1

where μi ​ ≥ 0 and νi ≥ 0 are the Lagrange multipliers.



To find the optimal solution, we take the partial derivatives of the Lagrangian with
respect to w , b, ξi , μi , and νi , and set them to zero:
​ ​ ​

n
∂L
= 0 ⟹ w = ∑ μ i y i xi
∂w
​ ​ ​ ​ ​

i=1

n
∂L
= 0 ⟹ ∑ μi yi = 0
∂b
​ ​ ​ ​

i=1

∂L
= 0 ⟹ C − μi − νi = 0
∂ξi
​ ​ ​

∂L
= 0 ⟹ yi (wT xi + b) − 1 + ξi ≥ 0
∂μi
​ ​ ​ ​

∂L
= 0 ⟹ ξi ≥ 0
∂νi
​ ​

The complementary slackness conditions are:

μi (yi (wT xi + b) − 1 + ξi ) = 0
​ ​ ​ ​

ν i ξi = 0 ​ ​

From these conditions, we can deduce:

1. If 0 < μi < C , then ξi = 0 and yi (wT xi + b) = 1


​ ​ ​ ​

2. If μi ​ = 0, then yi (wT xi + b) ≥ 1
​ ​

3. If μi ​
= C , then yi (wT xi + b) ≤ 1
​ ​

Dual Formulation

Substituting these back into the Lagrangian and simplifying, we get the dual
formulation:

n n n
1
max ∑ μi − ∑ ∑ μi μj yi yj (xTi xj )
2
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

μ
i=1 i=1 j=1

subject to:
n
∑ μi yi = 0
​ ​ ​

i=1

0 ≤ μi ≤ C
​ for all i

Finding w and b

Once we solve the dual problem and obtain the optimal μi , we can find w and b: ​

n
w = ∑ μ i y i xi ​ ​ ​ ​

i=1

To find b, we can use any support vector (a point where 0 < μi < C ) and the fact that

for these points, yi (w T xi


​ ​ + b) = 1.

Decision Function

The decision function for classifying new points remains the same as in the linearly
separable case:

f (x) = sign (∑ μi yi (xTi x) + b)


n
​ ​ ​ ​

i=1

where only the support vectors (points with μi ​ > 0) contribute to the sum.

The main difference from the linearly separable case is the upper bound C on the
Lagrange multipliers μi , which allows for some misclassifications in the training set

while still finding the optimal separating hyperplane.


SVM with kernel

Prerequisites - Kernel trick

Kernel function k(x, y) = ϕ(x)T ϕ(y) where ϕ(x) is a feature mapping that maps the
data points to a higher dimensional space without actually computing the feature
mapping, i.e., ϕ(x).

In other words, we can compute the kernel function k(x, y) without actually computing
the feature mapping ϕ(x).

Examples of kernel functions:

Polynomial kernel: k(x1 , x2 ) ​ ​ = ϕ(x1 )T ϕ(x2 ) = (1 + xT1 x2 )p ​ ​ ​

1
Sigmoid kernel: ks (x1 , x2 ) = 1+exp(axT1 x2 )
​ ​ ​

​ ​

= exp (− ∥x1 −σ2x2 ∥ )


2
RBF/Gaussian kernel: k(x1 , x2 )
​ ​

​ ​ ​

Motivation

If the dataset D = {(xi , yi )}ni=1 where xi ∈ Rd and yi ∈ {−1, 1} is not linearly


​ ​ ​ ​ ​

separable in the original d-dimensional space, we can use a feature mapping ϕ(x) to
map the data points to a higher dimensional space where they become linearly separable
and then use the SVM optimization problem to find the optimal hyperplane.

Hence the new dataset D ′ = {(x′i , yi )}ni=1 where x′i = ϕ(xi ) and yi ∈ {−1, 1}.
​ ​ ​ ​ ​ ​

Optimization problem

The optimization problem for SVM with kernel can be formulated as:

n
1
min ∥w∥2 + C ∑ ξi
w,b,ξ 2
​ ​ ​ ​

i=1
subject to:

yi (wT ϕ(xi ) + b) ≥ 1 − ξi ,
​ ​ ​ i = 1, 2, 3, ..., n

ξi ≥ 0 ​

Solution to the optimization problem

The Lagrangian for the SVM optimization problem with kernel can be formulated as
follows:

n n n
1
L(w, b, ξ, α, β) = ∥w∥2 + C ∑ ξi − ∑ αi (yi (wT ϕ(xi ) + b) − 1 + ξi ) − ∑ βi ξi
2
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

i=1 i=1 i=1

where αi ​
≥ 0 and βi ≥ 0 are the Lagrange multipliers.

To find the optimal solution, we take the partial derivatives of the Lagrangian with
respect to w , b, ξi , αi , and βi , and set them to zero:
​ ​ ​

n
∂L
= 0 ⟹ w = ∑ αi yi ϕ(xi )
∂w
​ ​ ​ ​ ​

i=1

n
∂L
= 0 ⟹ ∑ αi y i = 0
∂b
​ ​ ​ ​

i=1

∂L
= 0 ⟹ C − αi − β i = 0
∂ξi
​ ​ ​

∂L
= 0 ⟹ yi (wT ϕ(xi ) + b) − 1 + ξi ≥ 0
∂αi
​ ​ ​ ​

∂L
= 0 ⟹ ξi ≥ 0
∂βi
​ ​

The complementary slackness conditions are:

αi (yi (wT ϕ(xi ) + b) − 1 + ξi ) = 0


​ ​ ​ ​

β i ξi = 0 ​ ​

From these conditions, we can deduce:

1. If 0 < αi < C , then ξi = 0 and yi (wT ϕ(xi ) + b) = 1


​ ​ ​ ​
2. If αi ​ = 0, then yi (wT ϕ(xi ) + b) ≥ 1
​ ​

3. If αi ​ = C , then yi (wT ϕ(xi ) + b) ≤ 1


​ ​

Dual Formulation

Substituting these back into the Lagrangian and simplifying, we get the dual formulation:

n n n
1
max ∑ αi − ∑ ∑ αi αj yi yj k (xi , xj )
2
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

α
i=1 i=1 j=1

subject to:

n
∑ αi y i = 0
​ ​ ​

i=1

0 ≤ αi ≤ C ​ for all i

where k(xi , xj ) ​ ​
= ϕ(xi )T ϕ(xj ) is the kernel function.
​ ​

Finding w

Once we solve the dual problem and obtain the optimal αi , we can find w and b: ​

n
w = ∑ αi yi ϕ(xi ) ​ ​ ​ ​

i=1

but we don’t need to compute ϕ(xi ) explicitly. Hence we can use the kernel function

k(xi , x) to compute the decision function:


w ϕ(x) = (∑ αi yi ϕ(xi )) ϕ(x) = ∑ αi yi ϕ(xi ) ϕ(x) = ∑ αi yi k (xi , x)


n n n
T T
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

i=1 i=1 i=1

This is often referred to as the “kernel trick”, which allows us to work in high-dimensional
feature spaces without explicitly computing the feature vectors.

Finding b
To find b, we can use any support vector (a point where 0 < αi < C ) and the fact that

for these points, yi (w T ϕ(xi ) + b)


​ ​ = 1.

Decision Function

The decision function for classifying new points becomes:

n
f (x) = sign (∑ αi yi k(xi , x) + b)
​ ​ ​ ​

i=1

where only the support vectors (points with αi ​ > 0) contribute to the sum.

The final classifier is:

f (x) = {
1 if f (x) > 0
−1 if f (x) < 0
​ ​

The main difference from the non-kernel SVM is the use of the kernel function k(xi , x) ​

instead of the dot product xTi x. This allows the SVM to find non-linear decision

boundaries in the original input space by implicitly working in a higher-dimensional


feature space.
Support measure machine (SCM) as
Empirical Risk Minimization (ERM)

Recall that in support measure machine (SCM), we were trying to maximize the margin
between the two classes of data points.

n
1
min ∥w∥2 + C ∑ ξi
w,b,ξ 2
​ ​ ​ ​

i=1

s.t. yi (wT xi + b) ≥ 1 − ξi for all i


​ ​ ​

ξi ≥ 0 for all i

SCM as ERM

This is an optimization problem that can be formulated as a empirical risk minimization


(ERM) problem.

n
1
min wT w + C ∑ max(0, 1 − yi (wT xi + b))
w,b 2
​ ​ ​ ​ ​

i=1

This is an ERM problem with:

the loss function l(yi , f (xi ))


​ ​ = max(0, 1 − yi f (xi )), is known as the hinge
​ ​

loss.

the regularization term 12 w T w .



Neural network

A neural network is a computational model inspired by the structure and function of


biological neural networks. Mathematically, it can be defined as a series of function
compositions:

f (x) = fL (fL−1 (...f2 (f1 (x))))


​ ​ ​ ​

where L is the number of layers in the network, and each function fi represents a layer ​

operation.

For a single layer, the operation can be expressed as:

fi (x) = σ(Wi x + bi )
​ ​ ​

where:

Wi is the weight matrix for layer i


bi is the bias vector for layer i


σ is a non-linear activation function

The complete neural network can then be written as:

f (x) = σL (WL σL−1 (WL−1 ...σ2 (W2 σ1 (W1 x + b1 ) + b2 )... + bL−1 ) + bL )


​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

This formulation allows the network to learn complex, non-linear mappings from inputs
to outputs through the composition of simpler functions and the application of non-
linear activations.

Non-linear activation functions σ

Sigmoid or logistic:

1
σ(x) =
1 + e−x

Sign function:
sign(x)

Hyperbolic tangent:

ex − e−x
tanh(x) = x
e + e−x

Rectified Linear Unit (ReLU):

ReLU(x) = max(0, x)

Why do we need non-linear activation functions?

Without non-linear activation functions, the neural network would be equivalent


to a linear model. Let’s derive this:

Consider a neural network with two layers and no activation function:

1. First layer: y = W1 x + b1
​ ​

2. Second layer: z = W2 y + b2
​ ​

Substituting y into the second layer:

z = W 2 (W 1 x + b 1 ) + b 2
​ ​ ​ ​

z = W2 W1 x + W2 b1 + b2
​ ​ ​ ​ ​

This can be simplified to:

z = Wx + b

Where:

W = W2 W1 ​ ​

b = W2 b 1 + b2 ​ ​ ​

This is the equation of a linear model. Therefore, without non-linear activation


functions, regardless of the number of layers, a neural network will always produce a
linear transformation of the input.
Non-linear activation functions introduce non-linearity into the network, allowing
it to learn and represent complex, non-linear relationships in the data.
Backpropagation

Notations

L: number of layers in the network.

l
wjk : weight connecting k th neuron of (l − 1)th layer to j th neuron of lth layer

blj : bias of j th neuron in lth layer


alj : output of j th neuron in lth layer


zjl : preactivation output of j th neuron in lth layer


Here:

zjl = ∑ wjk

l l−1
ak + blj
​ ​ ​ ​

Also:

alj = σ(zjl )
​ ​

alj = σ (∑ wjk

ak + blj )
l l−1
​ ​ ​ ​

Derivation

Let’s assume we use squared loss function:

1 L
L= ∥a − y∥2
2

We define the risk R as the expected loss over the data distribution:

R = E[L]

Risk derivative with respect to the output layer:


∂R ∂ 1 L
= E[ ∥a − y∥2 ] = E[(aLj − yj )]
∂aj
L ∂aj 2
L
​ ​ ​ ​ ​

​ ​

∂R
Let δjL ​
= ∂zjL

then:

∂R ∂aLj
δjL = L ⋅
L
y )] ⋅ σ ′ (zjL )
L = E[(aj − j

∂aj ∂zj
​ ​ ​ ​ ​ ​

​ ​

Error term for the output layer (using element-wise product ⊙):

δ L = ∇a R ⊙ σ ′ (z L ) ​

Error term for hidden layers:

∂R ∂R ∂zkl+1
l =
∑ l+1 ⋅ l =
∑ δkl+1 ⋅ wjk
l+1
⋅ σ ′ (zjl )

∂zj ∂zk ∂aj


​ ​ ​ ​ ​ ​ ​ ​

​ ​

k k

because there are k neurons in l + 1th layer.

This can be written in vector form as:

δ l = ((wl+1 )T δ l+1 ) ⊙ σ ′ (z l )

Gradient of the risk with respect to weights:

∂R ∂zj
l
∂R
= l ⋅ = δjl ⋅ al−1

k
∂wjk
l
∂zj ∂wjk
l
​ ​ ​ ​ ​

​ ​ ​

Gradient of the risk with respect to biases:

∂R ∂R ∂zjl
= l ⋅ l = δjl

∂bj
l
∂zj ∂bj
​ ​ ​ ​

​ ​ ​

These derivatives form the basis of the backpropagation algorithm, allowing us to


compute the gradients needed for updating the weights and biases in the neural
network.
Miscellaneous Machine Learning Terms

Epoch

One complete pass of the entire training dataset for training the model.

Batch Size

Risk is defined as

N
1
L= ∑ l(yi , y^i )
​ ​ ​ ​ ​

N i=1

Where l(yi , y
​^i ) is a general loss function that measures the discrepancy between the
​ ​

​ ^i for each sample.


true value yi and the predicted value y ​ ​

Instead of calculating the risk over the entire dataset, however calculating the risk over
complete dataset is computationally expensive. Hence we calculate the risk over a small
subset of the dataset, called a batch to perform back propogation.

Gradient Descent Variants

1. Batch Gradient Descent

Full dataset: Computes the gradient of the risk function using the entire training
dataset.

Update frequency: Weights are updated after evaluating the entire dataset in one
go.

Efficiency: Can be slow for large datasets as it requires calculating gradients over
the whole dataset before updating weights.

Convergence: More stable gradient leads to smoother convergence.


Mathematically:

θ = θ − η∇θ R(θ) ​

where θ are the model parameters, η is the learning rate, and R(θ) is the risk function.

2. Stochastic Gradient Descent (SGD)

Single sample: Updates weights using one randomly chosen sample from the
dataset at a time.

Update frequency: Weights are updated after each individual sample, leading to
more frequent updates.

Efficiency: Faster and more efficient for large datasets as each update only requires
computing the gradient for one sample.

Convergence: Can have a noisier path to convergence but may help escape local
minima due to randomness.

Mathematically:

θ = θ − η∇θ R(θ; x(i) , y (i) )


where (x(i) , y (i) ) is a single training example.

3. Mini-batch Gradient Descent

Batch of samples: Computes the gradient over a small batch of samples (between
full dataset and single sample).

Update frequency: Weights are updated after evaluating the risk on each mini-
batch.

Efficiency: Faster than batch gradient descent but less noisy than stochastic
gradient descent.

Convergence: Provides a balance between the efficiency of SGD and the stability
of batch gradient descent.

Mathematically:
θ = θ − η∇θ R(θ; x(i:i+n) , y (i:i+n) )

where (x(i:i+n) , y (i:i+n) ) represents a mini-batch of n training examples.

Batch Normalization

Normalizes the input layer by adjusting and scaling the activations.

xi − μ B
^i =
​ ​

x
σB2 + ϵ
​ ​

​ ​

Where:

xi is the input

μB is the mini-batch mean


σB2 is the mini-batch variance


ϵ is a small constant added for numerical stability

Batch Normalization is typically implemented as a non-learnable layer in neural


networks, with fixed parameters during inference. During inference, the layer uses the
moving averages of mean and variance computed during training, rather than calculating
batch statistics, to normalize the inputs.

Layer Normalization

Layer Normalization normalizes the inputs across the features for each sample in a
batch, rather than across the batch for each feature.

Mathematically, for an input x with H features:

H
1
μ= ∑ xi ​ ​ ​

H i=1

H
1
2
σ = ∑(xi − μ)2
​ ​ ​

H
i=1
xi − μ
^i =

x
σ2 + ϵ
​ ​

yi = γ x

^i + β ​

Where:

μ is the mean of the features for a single sample

σ 2 is the variance of the features for a single sample

ϵ is a small constant for numerical stability

γ and β are learnable parameters for scaling and shifting

Implementation:

1. Compute mean and variance across features for each sample

2. Normalize each feature using the computed statistics

3. Scale and shift the normalized values with learnable parameters

Unlike Batch Normalization, Layer Normalization’s behavior is the same during training
and inference, as it doesn’t depend on batch statistics.

Dropout

A regularization technique

Also implemented as a layer in neural networks

During training:

Each neuron is “switched on” with probability p

When doing backpropagation, remember if neuron was on or off and update


weights accordingly

The value of p is generally same across all neurons of same layer but can be
different for different layers

During inference/classification:
All neurons are active

Output of each neuron is multiplied by probability p to get the output

N-Fold Cross Validation

A resampling technique used to evaluate machine learning models

Dataset is split into N equal parts

N-1 parts are used for training and 1 part is used for validation

This process is repeated N times, with each part being used for validation once

Finally, the performance of the model is evaluated by averaging the results from all
N folds
Decision Trees

How to descion tree works

At each node, a question is asked about the data that splits the data into two or
more non-overlapping subsets.

Question: xj ≤ θ​

then the data is split into two subsets, one where xj ​ ≤ θ and one where xj > θ.

The process is repeated till we reach leaf node, that classifies the datapoint to a
region of the feature space.

We can do regression and classification of the test datapoint based on the trainig
datapoints that lie in the region.

Growing a Decision Tree

Choose a data dimension j and a threshold θ to split the data that minimises a metric eg
minimize gini impurity for classification or minimize mean squared error for regression.
Do this recursively.

Gini Impurity

Gini Impurity is a measure of impurity or disorder in a set of data points. It’s used to
determine the quality of a split in a decision tree. The goal is to minimize the Gini
Impurity when growing the tree.

The Gini Impurity is calculated as:

K
G(set) = ∑ ∑ pi pj
​ ​ ​ ​


i=1 j=i
K
G(set) = 1 − ∑ p2i ​ ​

i=1

Where:

K is the total number of classes

pi is the probability of picking a data point with class i(pj also defined similarly) if
​ ​

you randomly choose from the set

ni
pi =

​ ​

N
Where:

ni is the number of datapoints of class i in the set


N is the total number of datapoints in the set

To evaluate a potential binary split, we calculate the weighted average of the Gini
Impurities for the resulting subsets:

nlef t nright
Gsplit = Glef t +
​ ​

​ ​ ​ Gright ​ ​

n n
Where:

nlef t and nright are the number of instances in the left and right subsets
​ ​

n is the total number of instances

Glef t and Gright are the Gini Impurities of the left and right subsets
​ ​

The split with the lowest Gsplit is chosen as the best split for that node.

Mean Squared Error

For regression tasks, Mean Squared Error (MSE) is commonly used as the splitting
criterion. MSE measures the average squared difference between the predicted and
actual values.

The Mean Squared Error for a set of data points is calculated as:
1
MSE(set) = ∑ (yi − y^)2
∣set∣ i∈set
​ ​ ​ ​

Where:

∣set∣ is the number of data points in the set

yi is the actual value of the i-th data point


y^ is the mean of the target values in the set, calculated as:


1
y^ = ∑ yi
∣set∣ i∈set
​ ​ ​ ​

To evaluate a potential binary split, we calculate the weighted average of the MSEs for
the resulting subsets:

nlef t nright
MSEsplit = MSElef t +
​ ​

​ ​ MSEright
​ ​ ​

n n
Where:

nlef t and nright are the number of instances in the left and right subsets
​ ​

n is the total number of instances

MSElef t and MSEright are the Mean Squared Errors of the left and right
​ ​

subsets

The split with the lowest MSEsplit is chosen as the best split for that node in the

regression tree.

Pruning a Decision Tree

Pruning is the process of removing branches from a decision tree to prevent overfitting.
Overfitting occurs when the tree is too complex and fits the training data too closely,
capturing noise and details that are specific to the training data rather than generalizing
to new, unseen data.

Pruning helps to simplify the tree, making it more robust and reducing its complexity.
This can lead to better generalization to new data.
There are two main types of pruning:

1. Pre-pruning (Early Stopping):

This method stops the growth of the tree before it fully fits the training data.

It uses stopping criteria such as:

Maximum depth of the tree

Minimum number of samples required to split an internal node

Minimum number of samples required to be at a leaf node

Pre-pruning is computationally efficient but may result in underfitting if the


stopping criteria are too strict.

2. Post-pruning (Reduced Error Pruning):

This method first grows a full tree and then removes branches that do not provide
significant predictive power.

The process typically involves:

1. Grow a full tree on the training data

2. For each node:

Calculate the accuracy of the tree with and without the node

If removing the node increases accuracy, prune it

Post-pruning often results in better performance but is more computationally


expensive than pre-pruning.
Ensemble learning

Ensemble learning is a machine learning technique that combines multiple models to


improve the overall performance and robustness of the prediction. It is based on the
idea that by aggregating the predictions of several models, we can achieve better results
than any single model alone.
Bagging

Let the original data be D.

Create n new datasets by sampling with replacement from D.

Each new dataset will have the same number of samples as the original dataset,
but some samples will be repeated, and some will be excluded.

Train a model on each of the new datasets.

The final prediction is made by aggregating the predictions of all models:

For classification, the final prediction is made by taking the majority vote of the
predictions of all models.

For regression, the final prediction is made by taking the average of the predictions
of all models.

Random Forest - Bagged descion trees

Bagging of decision trees.

Each decision tree is trained on a random subset of the features.

The final prediction is made by taking the majority vote of the predictions of all
decision trees.
Boosting

In bagging, we train each model independently on a random subset of the data. In


boosting, we train each model sequentially on the same data, with each subsequent
model focusing on correcting the errors of combined previous model by increasing
weights of misclassified datapoints

Unlike bagging, each model depends on the previous ones and its contribution to the
final prediction is weighted differently.

Mathematical Formulation of Boosting

Let’s define the ensemble model HT (x) as: ​

T
HT (x) = ∑ αt ht (x)
​ ​ ​ ​

t=1

where:

ht (x) is the t-th weak learner


αt ∈ [0, 1] is the weight of the t-th weak learner


T is the total number of weak learners

The risk (or error) of the ensemble model H is defined as:

n
1
R(H) = ∑ L(H(xi ), yi )
​ ​ ​ ​

n i=1

where:

L is the loss function

(xi , yi ) are the input-output pairs in the dataset


n is the number of samples


Our goal is to minimize R(H) with respect to HT . We do this by gradient descent over ​

functions:

hT = arg min R(HT −1 + αh)


​ ​ ​

h∈H

Using Taylor expansion, this can be approximated as:

hT = arg min R(HT −1 ) + α⟨∇R(HT −1 ), h⟩


​ ​ ​

h∈H

where ⟨⋅, ⋅⟩ denotes the inner product.

Because R(HT −1 ) is fixed, we can ignore it:


hT = arg min α ⋅ ⟨∇R(HT −1 ), h⟩


​ ​ ​

h∈H

Expanding this further:

n
∂R(HT −1 (xi ))
hT = arg min ∑ ⋅ h(xi )
​ ​

∂HT −1 (xi )
​ ​ ​ ​ ​

h∈H ​ ​

i=1

We call ri ​ = − ∂R(H T −1 (xi ))


∂HT −1 (xi ) the pseudo-residuals.


Thus, the problem reduces to:

n
hT = arg min ∑ ri ⋅ h(xi )
​ ​ ​ ​ ​

h∈H
i=1

This formulation shows that each new weak learner hT is fit to the pseudo-residuals of ​

the previous ensemble model, effectively focusing on the errors of the combined
previous models.
XGBoost - Gradient boosted regression tree

In XGBoost, the weak learners are decision trees.

The algorithm builds these trees sequentially, with each new tree aiming to correct the
errors of the combined previous trees.

1. Recall that ri ​ = − ∂R(H T −1 (xi ))


∂HT −1 (xi ) are the pseudo-residuals.


2. XGBoost formulates the optimization problem for finding the next weak learner
as:

n
1
hT = arg min ∑(ri ⋅ h(xi ) + h(xi )2 )
2
​ ​ ​ ​ ​ ​ ​

h∈H
i=1

^i
3. We can rewrite this by defining y ​ ​ = −ri : ​

n
1
hT = arg min ∑(−y^i ⋅ h(xi ) + h(xi )2 )
2
​ ​ ​ ​ ​ ​ ​ ​

h∈H
i=1

∂R(HT −1 (xi ))
^i
where y ​ ​ = ∂HT −1 (xi ) ​


4. This formulation is equivalent to:

n
hT = arg min ∑(h(xi ) − y^i )2
​ ​ ​ ​ ​ ​

h∈H
i=1

^i .
which is the standard squared error regression problem with new labels y ​ ​

^i is approximated as:
5. In practice, y ​ ​

y^i = HT −1 (xi ) − yi
​ ​ ​ ​ ​

where yi is the true label and HT −1 (xi ) is the prediction of the ensemble up to the
​ ​ ​

previous iteration. This approximation is derived from the first-order Taylor expansion of
the gradient when using the squared error loss L(yi , H(xi )) ​ ​ = 12 (yi − H(xi ))2 .
​ ​ ​

Algorithm

The XGBoost algorithm can be summarized in the following steps:


1. Initialize the model with a constant value:

n
H0 (x) = arg min ∑ L(yi , γ)
​ ​ ​ ​

γ
i=1

2. For t = 1 to T (number of trees):

a. Update labels:

y^i = HT −1 (xi ) − yi
​ ​ ​ ​ ​

for i = 1, ..., n

b. Fit a regression tree to the updated labels, giving terminal regions Rjt , j ​ = 1, ..., Jt ​

c. For each terminal region Rjt , compute:​

γjt = arg min ∑ L(yi , Ht−1 (xi ) + γ)


​ ​ ​ ​ ​ ​

γ
xi ∈Rjt
​ ​

This step computes the optimal value γj t (the adjustment) for each leaf region Rjt ,
​ ​

minimizing the loss function L for observations in that region. This “boost” is added to
the current model predictions in the next step.

d. Update the model:

Jt
Ht (x) = Ht−1 (x) + ν ∑ γjt I (x ∈ Rjt )

​ ​ ​ ​ ​

j=1

where ν is the learning rate (0 < ν ≤ 1) that controls how much of the new tree’s
contribution is added to the current model, γjt is the prediction adjustment for each ​

terminal region Rjt of the tree, and I(x


​ ∈ Rjt ) is an indicator function that equals 1 if

x belongs to region Rjt and 0 otherwise.


3. Output the final model:

T Jt
H(x) = HT (x) = ∑ ν ∑ γjt I (x ∈ Rjt )

​ ​ ​ ​ ​

t=0 j=1

The final model is the sum of all the contributions from the individual trees, with each
tree’s contribution scaled by the learning rate ν .
Adaboost

Recall

Recall from our discussion on boosting we defined the ensemble model HT (x) as: ​

T
HT (x) = ∑ αt ht (x)
​ ​ ​ ​

t=1

the problem of finding the next weak learner reduces to:

n
hT = arg min ∑ ri ⋅ h(xi )
​ ​ ​ ​ ​

h∈H
i=1

where

∂R(HT −1 (xi ))
ri = −
​ ​

∂HT −1 (xi )
​ ​

​ ​

Problem setting

Let the dataset be D = {(xi , yi )}ni=1 where yi ∈ {−1, 1}.


​ ​ ​ ​

The loss function is given by:

L(H(xi ), yi ) = exp(−yi H (xi ))


​ ​ ​ ​

The risk of the ensemble model H is given by:

n
1
R(H) = ∑ exp(−yi H (xi )). ​ ​ ​ ​

n
i=1

We want to find the model HT that minimizes this risk.


Solution
We know:

∂R(HT −1 (xi ))
ri = −
​ ​

∂HT −1 (xi )
​ ​

​ ​

Let’s calculate the pseudo-residuals:

n
( ∑ exp(−yj HT −1 (xj )))
∂R(HT −1 (xi )) ∂ 1
ri = − =
​ ​

∂HT −1 (xi ) ∂HT −1 (xi ) n j=1


​ ​ ​ ​ ​ ​ ​ ​

​ ​ ​ ​

1 ∂
= exp(−yi HT −1 (xi ))
n ∂HT −1 (xi )
​ ​ ​ ​ ​

​ ​

1
= exp(−yi HT −1 (xi ))(−yi )
​ ​ ​ ​ ​

n
= yi exp(−yi HT −1 (xi ))
​ ​ ​ ​

Now, let’s define weights wi for each data point:


exp(−yi HT −1 (xi ))
wi =
​ ​ ​

∑nj=1 exp(−yj HT −1 (xj ))


​ ​

​ ​ ​ ​

n
Note that ∑i=1 wi ​ ​ = 1.

Using these weights, we can rewrite our optimization problem:

n
hT = arg min ∑ ri ⋅ h(xi )
​ ​ ​ ​ ​

h∈H
i=1

n
= arg min ∑ yi exp(−yi HT −1 (xi )) ⋅ h(xi )
​ ​ ​ ​ ​ ​ ​

h∈H
i=1

n
= arg min − ∑ wi yi h(xi ) ​ ​ ​ ​ ​

h∈H
i=1

This is equivalent to:

n
hT = arg max ∑ wi yi h(xi )
​ ​ ​ ​ ​ ​

h∈H
i=1

The weak learner hT is chosen to maximize the weighted correlation between its

predictions and the true labels.


In practice, we take any h say MLP, weight the losses for each datapoint i by wi and ​

then optimize for the weights that minimize the weighted loss.

Finding step size

After finding hT , we need to determine its weight αT in the ensemble:


​ ​

To find the step size αt+1 , we minimize the risk with respect to α:

αt+1 = arg min R(Ht + αh)


​ ​ ​

n
αt+1 = arg min ∑ e−yi (Ht (xi )+αh(xi ))
​ ​ ​
​ ​ ​ ​

α
i=1

Differentiating with respect to α and setting to zero, we get:

1 1−ϵ
αt+1 = log
2
​ ​ ​

ϵ
where ϵ = ∑i:h(xi )
=yi wi is the weighted error of h.
​ ​
​ ​

We then update the ensemble:

Ht+1 (x) = Ht (x) + αt+1 h(x)


​ ​ ​

And update the weights for the next iteration:

(t+1) (t)
wi ​ = wi e−αt+1 yi h(xi ) ​
​ ​ ​

These weights are then normalized to sum to 1.

Algorithm

The AdaBoost algorithm can be summarized in the following steps:

1. Initialize the weights for each training example:

(1) 1
wi ​ = ​for i = 1, ..., n
n
2. For t = 1 to T (number of weak learners):
a. Train a weak learner ht using the weighted training data:

Choose any suitable model as the weak learner, such as a decision tree, MLP, or
logistic regression.

(t)
For each training example i, weight its contribution to the loss function by wi . ​

If using a decision tree:

Modify the splitting criterion to account for sample weights.

When calculating impurity measures (e.g., Gini index or entropy), use weighted
sums.

If using an MLP or logistic regression:

Modify the loss function to include sample weights. For example, with binary
cross-entropy loss:

n
L = − ∑ wi [yi log(ht (xi )) + (1 − yi ) log(1 − ht (xi ))]
(t)
​ ​ ​ ​ ​ ​ ​ ​

i=1

Optimize the parameters of ht to minimize this weighted loss function.


The resulting ht will focus more on correctly classifying examples with higher

weights.

b. Calculate the weighted error of ht : ​


(t)
ϵt =​ ​ wi ​

i:ht (xi )
​ =yi ​ ​

c. Compute the weight of the weak learner:

1 1 − ϵt
αt = log ( )

2
​ ​ ​

ϵt ​

d. Update the ensemble:

Ht (x) = Ht−1 (x) + αt ht (x)


​ ​ ​ ​

e. Update the weights for the next iteration:

(t+1) (t)
wi ​ = wi exp(−αt yi ht (xi ))
​ ​ ​ ​ ​
f. Normalize the weights:

(t+1)
(t+1) wi
=

wi ​

n (t+1)

∑j=1 wj ​ ​

3. Output the final ensemble:

H(x) = sign (∑ αt ht (x))


T
​ ​ ​

t=1

This algorithm iteratively builds an ensemble of weak learners, each time focusing more
on the examples that were misclassified in previous iterations. The final classifier is a
weighted combination of all the weak learners, where the weights are determined by the
performance of each weak learner.
Unsupervised learning

In unsupervised learning, we don’t have labeled data. We want to find structure in the
data.

Hence the data is of the form X = {x1 , ..., xN } where xi ∈ Rd .


​ ​ ​

You might also like