Machine Learning PDF
Machine Learning PDF
LEARNING
PROBABILISTIC PERSPECTIVE
VICTOR
Index
1. Hypothesis Space
2. Bayes Classifier
3. Linear Regression
13. Backpropagation
17. Boosting
18. XGBoost
A hypothesis function h is a function that maps the feature vector x to the label y , ie
h:X→Y
The hypothesis space is the set of all possible hypothesis functions, denoted as
H = {h∣h : X → Y }
During the learning process, we try to find the best hypothesis function h from the
hypothesis space H that minimizes the error between the predicted label and the true
label.
Loss function (L)
Let y^i
= h(xi ) be the prediction of this hypothesis function for the feature vector xi ,
Then the loss function L(yi , y^i ) is a function that measures the error between the
Desirable properties
2. Zero if and only if the prediction is correct: L(yi , y^i ) = 0 if and only if yi = y^i
3. Continous and differentiable for all yi and y^i for smooth optimization using
gradient descent.
L(yi , y^i ) = {
0 if yi = y^i
1 if yi
= y^i
Recall that the loss function L(yi , y^i ) measures the error between the predicted label
The risk function R(h) is the expected loss over all data points in the dataset. It is
defined as:
Conditional risk
The conditional risk R(h∣x) is the expected loss for a given input x. It is defined as:
Empirical risk
We do not know the true data distribution P and we have access to a dataset D
sampled from P . Hence we approximate the risk function by the empirical risk:
n
1
R(h) ≈ ∑ L(yi , h(xi ))
n
i=1
We want to find a hypothesis function h out of the complete hypothesis space H that
minimizes the risk function R(h).
Bayes classifier
hB (x) = {
1 if P (y = 1∣x = xi ) > P (y = 0∣x = xi )
0 if P (y = 1∣x = xi ) ≤ P (y = 0∣x = xi )
Notations
P (y) is also called the prior probability of class y or class prior probability.
To prove that the Bayes classifier is the best classifier for the 0-1 loss function, we need
to show that it minimizes the expected risk (error) for any given input x.
L(y, h(x)) = {
0 if y = h(x)
1 if y
= h(x)
1. If h(x) = 0:
R(h∣x) = P (y = 1∣x)
2. If h(x) = 1:
R(h∣x) = P (y = 0∣x)
The Bayes classifier chooses the class that minimizes this conditional risk:
y∈{0,1}
This means:
Since the Bayes classifier minimizes the conditional risk for every x, it also minimizes the
overall expected risk R(h). Therefore, the Bayes classifier is the optimal classifier for
the 0-1 loss function.
KL Divergence - Kullback-Leibler
Divergence
Mathematically, for two probability distributions P (x) and Q(x), the KL divergence
from Q to P is defined as:
where the sum/integral is over all possible events x. And p(x) and q(x) are the
probability density functions of distributions P (x) and Q(x) respectively.
Intuition
Properties
For a typical ML problem, all we have are samples from the true distribution P (x) ie
data = {(xi )}N i=1 where xi ∈ R are data points. We do not know the distribution
d
P (x) explicitly.
We try our best to estimate the true distribution P (x) by Q(x; θ) where θ are the
parameters of the model.
We want to know how well our model Q(x; θ) is performing. We can do this by
calculating the KL divergence between the true distribution P (x) and the estimated
distribution Q(x; θ).
We are trying to find the parameters θ that minimize the KL divergence between p(x)
and q(x; θ).
because Ex∼p(x) [log (p(x))] is constant with respect to θ , we can ignore it.
By the law of large numbers, we can approximate the expected log likelihood by the
average log likelihood of the data:
1 N
Ex∼p(x) [log(q(x; θ))] ≈
N
∑i=1 log(q(xi ; θ))
N
∑i=1 log(q(xi ; θ))
θ∗ = argmax
1
N
∑N
i=1 log(q(xi ; θ))
Let each data point xi be associated with a latent variable zi that is not observed. zi is a
random variable that takes values in some finite set 1, … , K and represents the
membership of xi to one of K clusters.
Variational inference
pθtz (t, z)
ℓ(θ) = log ∑ q(z)
z
q(z)
pθtz (t, z)
ℓ(θ) = log Eq(z) [ ]
q(z)
q(z) q(z)
Hence, we have:
pθtz (t, z)
ℓ(θ) ≥ Eq(z) [log ]
q(z)
ℓ(θ) is the log-likelihood of the observed data it is also called the evidence.
Eq(z) [log ptzq(z) ] is the lower bound of the log-likelihood and is also called the evidence
θ
(t,z)
Optimizing ELBO
Instead of maximizing the evidence, we maximize the evidence lower bound (ELBO). We do
it by maximizing the ELBO with respect to q(z) and θ .
To make the ELBO tight, we consider the difference between the evidence and the ELBO:
pθtz (t, z)
ℓ(θ) − ELBO(q, θ) = log pθt − Eq(z) [log ]
q(z)
q(z)
ℓ(θ) − ELBO(q, θ) = log pθt − Eq(z) [log pθz∣t (z∣t) + log pθt (t) − log q(z)]
[ ]
ℓ(θ) − ELBO(q, θ) = log pθt − Eq(z) [log pθz∣t (z∣t)] − Eq(z) [log pθt (t)] + Eq(z) [log q(z)]
ℓ(θ) − ELBO(q, θ) = log pθt − Eq(z) [log pθz∣t (z∣t)] − log pθt (t) + Eq(z) [log q(z)]
The difference between the evidence and the ELBO is the Kullback-Leibler (KL) divergence
between q(z) and pθz∣t (z∣t). To make the ELBO tight, we need to minimize this KL
divergence.
The KL divergence is always non-negative and equals zero if and only if the two
distributions are identical. Therefore, to make the ELBO tight, we set:
This choice of q(z) makes the ELBO equal to the evidence, achieving the tightest possible
bound.
EM Algorithm
2. M-step (Maximization): Find the parameter that maximizes this expected log-
likelihood.
1. Initialize θ (0)
M-step:
q(z)
2. θ (t) = arg maxθ Eq(t) (z) [log pθtz (t, z)] − Eq(t) (z) [log q(z)]
3. θ (t) = arg maxθ Eq(t) (z) [log pθtz (t, z)] as second term is constant wrt θ
The EM algorithm guarantees that the likelihood increases at each iteration and converges
to a local maximum.
Let’s apply the EM algorithm to the Gaussian Mixture Model (GMM) we discussed earlier.
Recall that in a GMM, we have:
m
αj are mixing coefficients, ∑j=1 αj = 1
1. Initialization:
Choose initial values for the parameters θ = (α1 , ..., αm , μ1 , ..., μm , ξ1 , ..., ξm ).
2. E-step:
Compute the posterior probabilities (responsibilities) for each data point and each Gaussian
component:
k
t+1 k pθtz (z) N (t; ti , μs , ξs )αs
(z = s) = pθz∣t (z = s∣t = ti ) = θk = m
q
pt (t) ∑j=1 αj N (tj ; μj , ξj )
3. M-step:
The EM algorithm for GMM alternates between these steps until convergence, effectively
maximizing the likelihood of the observed data under the Gaussian mixture model.
Linear Discriminant Analysis(LDA) from
bayesian perspective
Let’s assume the parametric form of the conditional density p(x∣y) is:
p(x∣y = 1) ∼ N (x; μ1 , Σ)
p(x∣y = 0) ∼ N (x; μ2 , Σ)
where N (x; μ, Σ) denotes a multivariate Gaussian distribution with mean μ and covariance
matrix Σ. Note that we assume the covariance matrix Σ is the same for both classes, which is a
key assumption in LDA.
In simple words, we are assuming that the data is distributed as Gaussian in each class with
different means but shared covariance matrix.
Also, let’s assume that the prior probabilities P (y = 1) and P (y = 0) are same ie 1/2.
Derivation
Let’s derive the decision boundary for LDA using the Bayes classifier.
1. Bayes Classifier:
hB (x) = {
1 if P (y = 1∣x = xi ) > P (y = 0∣x = xi )
0 if P (y = 1∣x = xi ) ≤ P (y = 0∣x = xi )
P (x∣y = k) ⋅ P (y = k)
P (y = k∣x) =
P (x)
3. Decision Rule:
1 d 1
− (x − μ1 )T Σ−1 (x − μ1 ) − log(2π) − log ∣Σ∣ + log(P (y = 1)) >
2 2 2
1 d 1
− (x − μ2 )T Σ−1 (x − μ2 ) − log(2π) − log ∣Σ∣ + log(P (y = 0))
2 2 2
6. Simplifying:
1 1
− (x − μ1 )T Σ−1 (x − μ1 ) + log(P (y = 1)) > − (x − μ2 )T Σ−1 (x − μ2 ) + log(P (y = 0))
2 2
1
− (xT Σ−1 x − 2μT1 Σ−1 x + μT1 Σ−1 μ1 ) + log(P (y = 1)) >
2
1
− (xT Σ−1 x − 2μT2 Σ−1 x + μT2 Σ−1 μ2 ) + log(P (y = 0))
2
1
μT1 Σ−1 x − μT1 Σ−1 μ1 + log(P (y = 1)) >
2
1
μT2 Σ−1 x − μT2 Σ−1 μ2 + log(P (y = 0))
2
9. Rearranging:
1 T −1
(μ1 − μ2 )T Σ−1 x > (μ1 Σ μ1 − μT2 Σ−1 μ2 ) + log(P (y = 0)/P (y = 1))
2
hB (x) = {
1 if wT x + w0 > 0
0 otherwise
where:
w = Σ−1 (μ1 − μ2 )
1
w0 = − (μT1 Σ−1 μ1 − μT2 Σ−1 μ2 ) − log(P (y = 0)/P (y = 1))
2
Note
The decision boundary is linear in x. This is the reason it is called Linear Discriminant
Analysis.
The decision boundary will not be linear if the covariance matrices are not the same for
both classes.
Linear regression
Y = β0 + β1 X + ε
Where:
Ideal regressor
For mean squared error loss, the ideal regressor is defined as:
h∗ (x) = E[Y ∣X = x]
To derive the ideal regressor for squared error loss, we need to find the function h(x)
that minimizes the expected squared error:
∂
E[(Y − h(X))2 ] = −2E[Y ∣X = x] + 2h(x) = 0
∂h(x)
h∗ (x) = E[Y ∣X = x]
This shows that the ideal regressor for squared error loss is indeed the conditional
expectation of Y given X = x.
Simplifying:
2. E[β1 X ∣X
= x] = β1 x (since X is fixed at x)
3. E[ε∣X = x] = 0 (assuming the error term has zero mean and is independent of
X)
Therefore:
h∗ (x) = β0 + β1 x
Interpretation
represents the best possible prediction of Y given X = x under the squared error loss,
assuming the linear model is correct. This function provides the average value of Y for
each value of X , effectively capturing the underlying linear relationship between the
variables while averaging out the random noise (represented by ε).
Empirical Risk Minimization
In practice, we don’t have access to the true distribution of the data, so we can’t directly
minimize the expected risk. Instead, we use the empirical risk as an approximation:
n
^ (β ) = 1 ∑(yi − (β0 + β1 xi ))2
R
n
i=1
Note that in this formulation, we don’t explicitly see the error term ε that was present in
our original model Y = β0 + β1 X + ε. This is because the empirical risk is calculated
using the observed yi values, which already incorporate the random error.
^ (β)
β ∗ = arg min R
This optimization problem has a closed-form solution, which can be derived using linear
algebra:
β ∗ = (X T X)−1 X T Y
X=
⋮ ⋮ ⋮ ⋱ ⋮
⎣1 xn1 xn2 ⋯ xnd ⎦n×(d+1)
This matrix has n rows (one for each data point) and d + 1 columns. The first column is
all 1s (for the intercept term), and the remaining d columns contain the feature values of
our data points.
⎡ y1 ⎤
y2
Y =
⋮
⎣y n ⎦
n×1
This is a column vector with n rows, containing the scalar y values of our data points.
⎡β0∗ ⎤
∗
β1
∗
β =
⋮
⎣β ∗ ⎦
d (d+1)×1
⎡β1 ⎤
∗
2. β1∗ = ⋮
⎣β ∗ ⎦
d (d)×1
where yi
∈ {0, 1} is the binary label of the feature vector xi .
1
P (yi = 1∣xi ) =
1 + e−w i +b
Tx
1
P (yi = 0∣xi ) =
1 + e w T x i +b
Note that
because consider:
p(x∣y = 1)
p(y = 1∣x) =
p(x∣y = 1) + p(x∣y = 0)
1
p(y = 1∣x) = p(x∣y=0)
1+ p(x∣y=1)
if:
And p(xi ∣yi = 0) and p(xi ∣yi = 1) follows an gaussian distribution with
1
P (yi = 0∣xi ) =
1 + e w x i +b
T
For the binary classification problem, use logistic regression as hypothesis function and
cross-entropy loss function as the loss function and perform gradient descent to find
the parameters w and b.
Using hard labels (0 or 1) directly in backpropagation can lead to several issues, hence
we use soft labels. Soft labels provide a continuous probability distribution over the
classes, allowing for smoother gradients and more stable training. This approach enables
the model to capture uncertainty and learn more nuanced decision boundaries
compared to hard binary classifications.
Softmax regression also known multiclass
regression
where yi
∈ {1, 2, ..., K} is the multiclass label of the feature vector xi , and K is the
number of classes.
T
ewk xi +bk
P (yi = k∣xi ) =
∑j=1 ewj xi +bj
K T
where wk is the parameter vector for class k , and bk is the bias term for class k .
Note that
K
∑ P (yi = k∣xi ) = 1
k=1
This ensures that the probabilities for all classes sum up to 1, providing a valid
probability distribution over the K classes.
The reason for using softmax regression instead of hard labels and the method to find
the parameters wk and bk is the same as the reason for using logistic regression.
For the multiclass classification problem, we use softmax regression as the hypothesis
function and cross-entropy loss function as the loss function. We then perform gradient
descent to find the parameters wk and bk for each class k .
∗
θM AP = arg max p(θ∣V ) = arg max p(V ∣θ)p(θ)
θ θ
where
∗
θM AP is the MAP estimate of the parameter θ
p(θ∣V ) is the posterior probability of the parameter θ given the observed data V
p(V ∣θ) is the likelihood of the observed data V given the parameter θ
Note that:
Unlike MLE, MAP estimation incorporates prior knowledge about the parameter θ
and observed data V where as MLE only depends on the observed data V .
MAP provides a balance between the likelihood of the data and the prior beliefs,
leading to more robust estimates, especially when the data is limited.
Conjugate prior
A conjugate prior is a type of prior distribution p(θ) such that when multiplied with the
likelihood function p(V ∣θ) the posterior distribution p(θ∣V ) that we get is in the same
form as the prior p(θ).
In simple terms, the posterior p(θ) belongs to the same family of distributions as the
prior p(θ∣V ).
Non parametric density estimation
In parametric density estimation, we assume that the data is generated from a known
distribution, such as the normal distribution, and we estimate the parameters of the
distribution using various methods like maximum likelihood estimation or risk
minimization.
The basic idea behind non-parametric density estimation is to estimate the probability
density function (PDF) directly from the data without assuming a specific functional
form. One way to approach this is by considering the probability of a data point falling
within a certain region.
Let D = {x1 , x2 , ..., xn } be our dataset. The probability of a data point falling within a
where k is the number of data points in region R, and n is the total number of data
points.
Also
where p(x) is the probability density at point x and V is the volume of the region R.
k
= p(x) ∗ V
n
Rearranging this equation, we get:
k
p(x) =
nV
This forms the basis for various non-parametric density estimation techniques.
Parzen window estimate also known as
the kernel density estimate
Basic idea
Recall that we had derived the following equation for non-parametric density
estimation:
k
p(x) =
nV
where:
In Parzen window estimate, we fix the volume V and count k by using a window
function.
Problem setting
Given a set of data points D = {x1 , x2 , ..., xn }, we want to estimate the probability
density function p(x) at a given point x ie model the distribution of data points in the
dataset.
Vn = (hn )d
where:
ϕ(u) = {
1 if ∣uj ∣ ≤ 12 , j = 1, … , d
0 otherwise
where:
ϕ(u) returns 1 if the point u is within the unit hypercube centered at the origin,
and 0 otherwise.
)={
x − xi 1 if x is in the hypercube centered at xi of side hn
ϕ(
0 otherwise
hn
n
x − xi
kn = ∑ ϕ( )
i=1
hn
n
k ∑i=1 ϕ( x−x
hn
i
)
p(x) =
=
n(hn )d
nV
Note that:
density.
Parzen window with Gaussian kernel
1. Gaussian kernel:
1 − 12 u2
ϕ(u) = e
(2π)d/2
x − xi 1 − 12 ( h i )2
x−x
)=
ϕ( e n
(2π)d/2
hn
n
k ∑i=1 ϕ( x−x
hn
i
)
p(x) =
=
n(hn )d
nV
Algorithm
2. For each test data point xi , find the number of points in the hypercube centered
at xi of side hn
3. Calculate the Parzen window estimate p(x) using the number of points found in
the previous step
K Nearest neighbour (KNN)
Basic idea
Recall that we had derived the following equation for non-parametric density
estimation:
k
p(x) =
nV
where:
Problem setting
Formulation
p(xi , y = c)
p(y = c∣xi ) =
ki
ki
p(y = c∣xi ) = nV
=
C k C
Now we can use bayes classification rule to assign label to the new data point xi .
Algorithm
2. For each test data point xi , find the k nearest neighbours in the training data D
3. Assign the class label to xi based on the majority class label of the k nearest
neighbours
Bias variance tradeoff
It can be shown that average risk for square error loss can be decomposed into three components:
2
Ravg (h) = EPD EPx,y [(hD (x) − h
^ (x)) ]
// variance (sensitivity to dataset)
2
+EPx,y [(h
^ (x) − h∗ (x)) ]
// bias (how different is avg classifier from optimum classifier)
where:
Term breakdown
Variance:
2
EPD EPx,y [(hD (x) − h
^ (x)) ]
Measures how sensitive the learned classifier hD (x) is to different training datasets D . High
variance means the model changes significantly with different training sets, making it unstable.
Bias:
2
EPx,y [(h
^ (x) − h∗ (x)) ]
Irreducible Noise:
This term captures the inherent noise in the data y . No model can reduce this part, as it reflects
randomness or variability in the data that is not related to the features x.
1. Relationship:
As we decrease bias by making our model more complex (e.g., using more features or a more
flexible model), we often increase variance. This means the model may fit the training data very
well but perform poorly on unseen data due to overfitting.
Conversely, if we increase bias by simplifying the model (e.g., using fewer features or a more
rigid model), we may reduce variance, but at the cost of underfitting the training data.
2. Optimal Point:
The goal is to find a balance where both bias and variance are minimized, leading to the lowest
possible total error. This is often visualized as a U-shaped curve where the total error is
minimized at a certain level of model complexity.
Regularization
h∈H
Importance of regularization
M LE ≈ ERM
M AP ≈ reg ERM
Support vector machine(SVM) when data
is linearly separable
separable if there exists a hyperplane that can separate the two classes of data points
with zero training error.
Mathematically, a dataset is linearly separable if there exist weights w and bias b such
that:
wT xi + b > 1 for yi = 1
wT xi + b < −1 for yi = −1
2
The distance between the margins is ∥w∥ .
Optimization problem
1
min ∥w∥2
w,b 2
We can solve this optimization problem using the method of Lagrange multipliers.
The Lagrangian for the SVM optimization problem can be formulated as follows:
n
1
L(w, b, μ) = ∥w∥2 − ∑ μi (yi (wT xi + b) − 1)
2
i=1
To find the optimal solution, we take the partial derivatives of the Lagrangian with
respect to w , b, and μi , and set them to zero:
n
∂L
= w − ∑ μ i y i xi = 0
∂w
i=1
n
∂L
= − ∑ μi yi = 0
∂b
i=1
μi (yi (wT xi + b) − 1) = 0
The solution can be obtained by solving these equations, which leads to the optimal
weights w and bias b that define the separating hyperplane.
n
w = ∑ μ i y i xi
i=1
n
∑ μi yi = 0
i=1
Substituting these back into the Lagrangian and simplifying, we get the dual
formulation:
Dual Formulation
The dual formulation of the SVM problem can be expressed as:
n n n
1
max ∑ μi − ∑ ∑ μi μj yi yj (xTi xj )
2
μ
i=1 i=1 j=1
subject to:
n
∑ μi yi = 0
i=1
μi ≥ 0
for all i
Finding w
The dual problem can be solved using quadratic programming techniques. Once we
have the optimal μi , we can recover w using the equation:
n
w = ∑ μ i y i xi
i=1
Finding b
To find b, we can use any support vector (a point where μi > 0) and the fact that for
these points, yi (w
T
xi + b) = 1.
Decision Function
i=1
Unlike the case when the dataset is linearly separable, we cannot find a hyperplane that
separates the two classes of data points with zero training error.
Soft margin
To handle this case, we introduce a slack variable ξi for each data point xi to allow some
points to be on the wrong side of the margin or even in the wrong class.
Optimization problem
n
1
min ∥w∥2 + C ∑ ξi
w,b,ξ 2
i=1
ξi ≥ 0 for all i
The Lagrangian for the SVM optimization problem can be formulated as follows:
n n n
1
L(w, b, ξ, μ, ν) = ∥w∥2 + C ∑ ξi − ∑ μi (yi (wT xi + b) − 1 + ξi ) − ∑ νi ξi
2
n
∂L
= 0 ⟹ w = ∑ μ i y i xi
∂w
i=1
n
∂L
= 0 ⟹ ∑ μi yi = 0
∂b
i=1
∂L
= 0 ⟹ C − μi − νi = 0
∂ξi
∂L
= 0 ⟹ yi (wT xi + b) − 1 + ξi ≥ 0
∂μi
∂L
= 0 ⟹ ξi ≥ 0
∂νi
μi (yi (wT xi + b) − 1 + ξi ) = 0
ν i ξi = 0
2. If μi = 0, then yi (wT xi + b) ≥ 1
3. If μi
= C , then yi (wT xi + b) ≤ 1
Dual Formulation
Substituting these back into the Lagrangian and simplifying, we get the dual
formulation:
n n n
1
max ∑ μi − ∑ ∑ μi μj yi yj (xTi xj )
2
μ
i=1 i=1 j=1
subject to:
n
∑ μi yi = 0
i=1
0 ≤ μi ≤ C
for all i
Finding w and b
Once we solve the dual problem and obtain the optimal μi , we can find w and b:
n
w = ∑ μ i y i xi
i=1
To find b, we can use any support vector (a point where 0 < μi < C ) and the fact that
Decision Function
The decision function for classifying new points remains the same as in the linearly
separable case:
i=1
where only the support vectors (points with μi > 0) contribute to the sum.
The main difference from the linearly separable case is the upper bound C on the
Lagrange multipliers μi , which allows for some misclassifications in the training set
Kernel function k(x, y) = ϕ(x)T ϕ(y) where ϕ(x) is a feature mapping that maps the
data points to a higher dimensional space without actually computing the feature
mapping, i.e., ϕ(x).
In other words, we can compute the kernel function k(x, y) without actually computing
the feature mapping ϕ(x).
1
Sigmoid kernel: ks (x1 , x2 ) = 1+exp(axT1 x2 )
Motivation
separable in the original d-dimensional space, we can use a feature mapping ϕ(x) to
map the data points to a higher dimensional space where they become linearly separable
and then use the SVM optimization problem to find the optimal hyperplane.
Hence the new dataset D ′ = {(x′i , yi )}ni=1 where x′i = ϕ(xi ) and yi ∈ {−1, 1}.
Optimization problem
The optimization problem for SVM with kernel can be formulated as:
n
1
min ∥w∥2 + C ∑ ξi
w,b,ξ 2
i=1
subject to:
yi (wT ϕ(xi ) + b) ≥ 1 − ξi ,
i = 1, 2, 3, ..., n
ξi ≥ 0
The Lagrangian for the SVM optimization problem with kernel can be formulated as
follows:
n n n
1
L(w, b, ξ, α, β) = ∥w∥2 + C ∑ ξi − ∑ αi (yi (wT ϕ(xi ) + b) − 1 + ξi ) − ∑ βi ξi
2
where αi
≥ 0 and βi ≥ 0 are the Lagrange multipliers.
To find the optimal solution, we take the partial derivatives of the Lagrangian with
respect to w , b, ξi , αi , and βi , and set them to zero:
n
∂L
= 0 ⟹ w = ∑ αi yi ϕ(xi )
∂w
i=1
n
∂L
= 0 ⟹ ∑ αi y i = 0
∂b
i=1
∂L
= 0 ⟹ C − αi − β i = 0
∂ξi
∂L
= 0 ⟹ yi (wT ϕ(xi ) + b) − 1 + ξi ≥ 0
∂αi
∂L
= 0 ⟹ ξi ≥ 0
∂βi
β i ξi = 0
Dual Formulation
Substituting these back into the Lagrangian and simplifying, we get the dual formulation:
n n n
1
max ∑ αi − ∑ ∑ αi αj yi yj k (xi , xj )
2
α
i=1 i=1 j=1
subject to:
n
∑ αi y i = 0
i=1
0 ≤ αi ≤ C for all i
where k(xi , xj )
= ϕ(xi )T ϕ(xj ) is the kernel function.
Finding w
Once we solve the dual problem and obtain the optimal αi , we can find w and b:
n
w = ∑ αi yi ϕ(xi )
i=1
but we don’t need to compute ϕ(xi ) explicitly. Hence we can use the kernel function
This is often referred to as the “kernel trick”, which allows us to work in high-dimensional
feature spaces without explicitly computing the feature vectors.
Finding b
To find b, we can use any support vector (a point where 0 < αi < C ) and the fact that
Decision Function
n
f (x) = sign (∑ αi yi k(xi , x) + b)
i=1
where only the support vectors (points with αi > 0) contribute to the sum.
f (x) = {
1 if f (x) > 0
−1 if f (x) < 0
The main difference from the non-kernel SVM is the use of the kernel function k(xi , x)
instead of the dot product xTi x. This allows the SVM to find non-linear decision
Recall that in support measure machine (SCM), we were trying to maximize the margin
between the two classes of data points.
n
1
min ∥w∥2 + C ∑ ξi
w,b,ξ 2
i=1
ξi ≥ 0 for all i
SCM as ERM
n
1
min wT w + C ∑ max(0, 1 − yi (wT xi + b))
w,b 2
i=1
loss.
where L is the number of layers in the network, and each function fi represents a layer
operation.
fi (x) = σ(Wi x + bi )
where:
This formulation allows the network to learn complex, non-linear mappings from inputs
to outputs through the composition of simpler functions and the application of non-
linear activations.
Sigmoid or logistic:
1
σ(x) =
1 + e−x
Sign function:
sign(x)
Hyperbolic tangent:
ex − e−x
tanh(x) = x
e + e−x
ReLU(x) = max(0, x)
1. First layer: y = W1 x + b1
2. Second layer: z = W2 y + b2
z = W 2 (W 1 x + b 1 ) + b 2
z = W2 W1 x + W2 b1 + b2
z = Wx + b
Where:
W = W2 W1
b = W2 b 1 + b2
Notations
l
wjk : weight connecting k th neuron of (l − 1)th layer to j th neuron of lth layer
Here:
zjl = ∑ wjk
l l−1
ak + blj
Also:
alj = σ(zjl )
alj = σ (∑ wjk
ak + blj )
l l−1
Derivation
1 L
L= ∥a − y∥2
2
We define the risk R as the expected loss over the data distribution:
R = E[L]
∂R
Let δjL
= ∂zjL
then:
∂R ∂aLj
δjL = L ⋅
L
y )] ⋅ σ ′ (zjL )
L = E[(aj − j
∂aj ∂zj
Error term for the output layer (using element-wise product ⊙):
δ L = ∇a R ⊙ σ ′ (z L )
∂R ∂R ∂zkl+1
l =
∑ l+1 ⋅ l =
∑ δkl+1 ⋅ wjk
l+1
⋅ σ ′ (zjl )
k k
δ l = ((wl+1 )T δ l+1 ) ⊙ σ ′ (z l )
∂R ∂zj
l
∂R
= l ⋅ = δjl ⋅ al−1
k
∂wjk
l
∂zj ∂wjk
l
∂R ∂R ∂zjl
= l ⋅ l = δjl
∂bj
l
∂zj ∂bj
Epoch
One complete pass of the entire training dataset for training the model.
Batch Size
Risk is defined as
N
1
L= ∑ l(yi , y^i )
N i=1
Where l(yi , y
^i ) is a general loss function that measures the discrepancy between the
Instead of calculating the risk over the entire dataset, however calculating the risk over
complete dataset is computationally expensive. Hence we calculate the risk over a small
subset of the dataset, called a batch to perform back propogation.
Full dataset: Computes the gradient of the risk function using the entire training
dataset.
Update frequency: Weights are updated after evaluating the entire dataset in one
go.
Efficiency: Can be slow for large datasets as it requires calculating gradients over
the whole dataset before updating weights.
θ = θ − η∇θ R(θ)
where θ are the model parameters, η is the learning rate, and R(θ) is the risk function.
Single sample: Updates weights using one randomly chosen sample from the
dataset at a time.
Update frequency: Weights are updated after each individual sample, leading to
more frequent updates.
Efficiency: Faster and more efficient for large datasets as each update only requires
computing the gradient for one sample.
Convergence: Can have a noisier path to convergence but may help escape local
minima due to randomness.
Mathematically:
Batch of samples: Computes the gradient over a small batch of samples (between
full dataset and single sample).
Update frequency: Weights are updated after evaluating the risk on each mini-
batch.
Efficiency: Faster than batch gradient descent but less noisy than stochastic
gradient descent.
Convergence: Provides a balance between the efficiency of SGD and the stability
of batch gradient descent.
Mathematically:
θ = θ − η∇θ R(θ; x(i:i+n) , y (i:i+n) )
Batch Normalization
xi − μ B
^i =
x
σB2 + ϵ
Where:
xi is the input
Layer Normalization
Layer Normalization normalizes the inputs across the features for each sample in a
batch, rather than across the batch for each feature.
H
1
μ= ∑ xi
H i=1
H
1
2
σ = ∑(xi − μ)2
H
i=1
xi − μ
^i =
x
σ2 + ϵ
yi = γ x
^i + β
Where:
Implementation:
Unlike Batch Normalization, Layer Normalization’s behavior is the same during training
and inference, as it doesn’t depend on batch statistics.
Dropout
A regularization technique
During training:
The value of p is generally same across all neurons of same layer but can be
different for different layers
During inference/classification:
All neurons are active
N-1 parts are used for training and 1 part is used for validation
This process is repeated N times, with each part being used for validation once
Finally, the performance of the model is evaluated by averaging the results from all
N folds
Decision Trees
At each node, a question is asked about the data that splits the data into two or
more non-overlapping subsets.
Question: xj ≤ θ
then the data is split into two subsets, one where xj ≤ θ and one where xj > θ.
The process is repeated till we reach leaf node, that classifies the datapoint to a
region of the feature space.
We can do regression and classification of the test datapoint based on the trainig
datapoints that lie in the region.
Choose a data dimension j and a threshold θ to split the data that minimises a metric eg
minimize gini impurity for classification or minimize mean squared error for regression.
Do this recursively.
Gini Impurity
Gini Impurity is a measure of impurity or disorder in a set of data points. It’s used to
determine the quality of a split in a decision tree. The goal is to minimize the Gini
Impurity when growing the tree.
K
G(set) = ∑ ∑ pi pj
i=1 j=i
K
G(set) = 1 − ∑ p2i
i=1
Where:
pi is the probability of picking a data point with class i(pj also defined similarly) if
ni
pi =
N
Where:
To evaluate a potential binary split, we calculate the weighted average of the Gini
Impurities for the resulting subsets:
nlef t nright
Gsplit = Glef t +
Gright
n n
Where:
nlef t and nright are the number of instances in the left and right subsets
Glef t and Gright are the Gini Impurities of the left and right subsets
The split with the lowest Gsplit is chosen as the best split for that node.
For regression tasks, Mean Squared Error (MSE) is commonly used as the splitting
criterion. MSE measures the average squared difference between the predicted and
actual values.
The Mean Squared Error for a set of data points is calculated as:
1
MSE(set) = ∑ (yi − y^)2
∣set∣ i∈set
Where:
1
y^ = ∑ yi
∣set∣ i∈set
To evaluate a potential binary split, we calculate the weighted average of the MSEs for
the resulting subsets:
nlef t nright
MSEsplit = MSElef t +
MSEright
n n
Where:
nlef t and nright are the number of instances in the left and right subsets
MSElef t and MSEright are the Mean Squared Errors of the left and right
subsets
The split with the lowest MSEsplit is chosen as the best split for that node in the
regression tree.
Pruning is the process of removing branches from a decision tree to prevent overfitting.
Overfitting occurs when the tree is too complex and fits the training data too closely,
capturing noise and details that are specific to the training data rather than generalizing
to new, unseen data.
Pruning helps to simplify the tree, making it more robust and reducing its complexity.
This can lead to better generalization to new data.
There are two main types of pruning:
This method stops the growth of the tree before it fully fits the training data.
This method first grows a full tree and then removes branches that do not provide
significant predictive power.
Calculate the accuracy of the tree with and without the node
Each new dataset will have the same number of samples as the original dataset,
but some samples will be repeated, and some will be excluded.
For classification, the final prediction is made by taking the majority vote of the
predictions of all models.
For regression, the final prediction is made by taking the average of the predictions
of all models.
The final prediction is made by taking the majority vote of the predictions of all
decision trees.
Boosting
Unlike bagging, each model depends on the previous ones and its contribution to the
final prediction is weighted differently.
T
HT (x) = ∑ αt ht (x)
t=1
where:
n
1
R(H) = ∑ L(H(xi ), yi )
n i=1
where:
functions:
h∈H
h∈H
h∈H
n
∂R(HT −1 (xi ))
hT = arg min ∑ ⋅ h(xi )
∂HT −1 (xi )
h∈H
i=1
n
hT = arg min ∑ ri ⋅ h(xi )
h∈H
i=1
This formulation shows that each new weak learner hT is fit to the pseudo-residuals of
the previous ensemble model, effectively focusing on the errors of the combined
previous models.
XGBoost - Gradient boosted regression tree
The algorithm builds these trees sequentially, with each new tree aiming to correct the
errors of the combined previous trees.
2. XGBoost formulates the optimization problem for finding the next weak learner
as:
n
1
hT = arg min ∑(ri ⋅ h(xi ) + h(xi )2 )
2
h∈H
i=1
^i
3. We can rewrite this by defining y = −ri :
n
1
hT = arg min ∑(−y^i ⋅ h(xi ) + h(xi )2 )
2
h∈H
i=1
∂R(HT −1 (xi ))
^i
where y = ∂HT −1 (xi )
n
hT = arg min ∑(h(xi ) − y^i )2
h∈H
i=1
^i .
which is the standard squared error regression problem with new labels y
^i is approximated as:
5. In practice, y
y^i = HT −1 (xi ) − yi
where yi is the true label and HT −1 (xi ) is the prediction of the ensemble up to the
previous iteration. This approximation is derived from the first-order Taylor expansion of
the gradient when using the squared error loss L(yi , H(xi )) = 12 (yi − H(xi ))2 .
Algorithm
n
H0 (x) = arg min ∑ L(yi , γ)
γ
i=1
a. Update labels:
y^i = HT −1 (xi ) − yi
for i = 1, ..., n
b. Fit a regression tree to the updated labels, giving terminal regions Rjt , j = 1, ..., Jt
γ
xi ∈Rjt
This step computes the optimal value γj t (the adjustment) for each leaf region Rjt ,
minimizing the loss function L for observations in that region. This “boost” is added to
the current model predictions in the next step.
Jt
Ht (x) = Ht−1 (x) + ν ∑ γjt I (x ∈ Rjt )
j=1
where ν is the learning rate (0 < ν ≤ 1) that controls how much of the new tree’s
contribution is added to the current model, γjt is the prediction adjustment for each
T Jt
H(x) = HT (x) = ∑ ν ∑ γjt I (x ∈ Rjt )
t=0 j=1
The final model is the sum of all the contributions from the individual trees, with each
tree’s contribution scaled by the learning rate ν .
Adaboost
Recall
Recall from our discussion on boosting we defined the ensemble model HT (x) as:
T
HT (x) = ∑ αt ht (x)
t=1
n
hT = arg min ∑ ri ⋅ h(xi )
h∈H
i=1
where
∂R(HT −1 (xi ))
ri = −
∂HT −1 (xi )
Problem setting
n
1
R(H) = ∑ exp(−yi H (xi )).
n
i=1
Solution
We know:
∂R(HT −1 (xi ))
ri = −
∂HT −1 (xi )
n
( ∑ exp(−yj HT −1 (xj )))
∂R(HT −1 (xi )) ∂ 1
ri = − =
1 ∂
= exp(−yi HT −1 (xi ))
n ∂HT −1 (xi )
1
= exp(−yi HT −1 (xi ))(−yi )
n
= yi exp(−yi HT −1 (xi ))
exp(−yi HT −1 (xi ))
wi =
n
Note that ∑i=1 wi = 1.
n
hT = arg min ∑ ri ⋅ h(xi )
h∈H
i=1
n
= arg min ∑ yi exp(−yi HT −1 (xi )) ⋅ h(xi )
h∈H
i=1
n
= arg min − ∑ wi yi h(xi )
h∈H
i=1
n
hT = arg max ∑ wi yi h(xi )
h∈H
i=1
The weak learner hT is chosen to maximize the weighted correlation between its
then optimize for the weights that minimize the weighted loss.
To find the step size αt+1 , we minimize the risk with respect to α:
n
αt+1 = arg min ∑ e−yi (Ht (xi )+αh(xi ))
α
i=1
1 1−ϵ
αt+1 = log
2
ϵ
where ϵ = ∑i:h(xi )
=yi wi is the weighted error of h.
(t+1) (t)
wi = wi e−αt+1 yi h(xi )
Algorithm
(1) 1
wi = for i = 1, ..., n
n
2. For t = 1 to T (number of weak learners):
a. Train a weak learner ht using the weighted training data:
Choose any suitable model as the weak learner, such as a decision tree, MLP, or
logistic regression.
(t)
For each training example i, weight its contribution to the loss function by wi .
When calculating impurity measures (e.g., Gini index or entropy), use weighted
sums.
Modify the loss function to include sample weights. For example, with binary
cross-entropy loss:
n
L = − ∑ wi [yi log(ht (xi )) + (1 − yi ) log(1 − ht (xi ))]
(t)
i=1
The resulting ht will focus more on correctly classifying examples with higher
weights.
∑
(t)
ϵt = wi
i:ht (xi )
=yi
1 1 − ϵt
αt = log ( )
2
ϵt
(t+1) (t)
wi = wi exp(−αt yi ht (xi ))
f. Normalize the weights:
(t+1)
(t+1) wi
=
wi
n (t+1)
∑j=1 wj
t=1
This algorithm iteratively builds an ensemble of weak learners, each time focusing more
on the examples that were misclassified in previous iterations. The final classifier is a
weighted combination of all the weak learners, where the weights are determined by the
performance of each weak learner.
Unsupervised learning
In unsupervised learning, we don’t have labeled data. We want to find structure in the
data.