Unit-2
Unit-2
Maximum likelihood estimation (MLE) is a statistical approach that determines the models’
parameters in machine learning. The idea is to find the values of the model parameters that
maximize the likelihood of observed data such that the observed data is most probable.
Let’s look at an example to understand MLE better. Assume that we want to estimate the average
height of a city’s population. However, because of the sheer size of the population, we cannot
calculate the true average height of the population. So, we estimate the average height as follows:
• Defining a statistical model: We start by assuming that the height of the population follows
a normal distribution. This implies that few people have a shorter or taller height than
average.
• Collecting the sample: We then collect a sample of heights from the population and find the
• Calculating the likelihood function: Given the population’s average height, we look at the
observing the provided data given the parameters in our model. In our case, the model’s
parameters are the normal distribution’s mean and standard deviation. Due to computational
reasons, the log-likelihood function is often used instead of the likelihood function.
• Maximizing the likelihood function: MLE aims to find the average height that maximizes
the log-likelihood function of obtaining the observed sample and makes the observed heights
most probable.
Estimation process
We can now model the average height with a normal distribution whose parameters are selected
In supervised machine learning, we use labeled data that trains the model’s parameters. The
training data consists of input features and the corresponding output labels. During the training
phase, we aim to find the model parameters that best capture the patterns in the labeled data.
MLE helps fine-tune the machine learning models. In the training phase, we adjust the model’s
parameters to maximize the likelihood of the labeled data. Alternatively, we can use a negative
log-likelihood that represents the loss function. A loss function quantifies the difference between
L = |y\{y}^|
Here, y represents the actual output and y^ represents the estimated value. We aim to minimize
this loss function L during training to reach an accurate and effective model. Note that minimizing
the negative log-likelihood is equivalent to maximizing the likelihood, and this is a common
Reference-https://learningdaily.dev/understanding-maximum-likelihood-estimation-in-
machine-learning-22b915c3e05a
2.Least Square method
Least Square method is a fundamental mathematical technique widely used in data analysis,
statistics, and regression modeling to identify the best-fitting curve or line for a given set of
data points. This method ensures that the overall error is reduced, providing a highly accurate
model for predicting future data trends.
In statistics, when the data can be represented on a cartesian plane by using the independent
and dependent variable as the x and y coordinates, it is called scatter data. This data might not
be useful in making interpretations or predicting the values of the dependent variable for the
independent variable. So, we try to get an equation of a line that fits best to the given data
points with the help of the Least Square Method.
Least Square Method is used to derive a generalized linear equation between two variables.
when the value of the dependent and independent variable is represented as the x and y
coordinates in a 2D cartesian coordinate system. Initially, known values are marked on a plot.
The plot obtained at this point is called a scatter plot.
Then, we try to represent all the marked points as a straight line or a linear equation. The
equation of such a line is obtained with the help of the Least Square method. This is done to
get the value of the dependent variable for an independent variable for which the value was
initially unknown. This helps us to make predictions for the value of dependent variable.
This method aims at minimizing the sum of squares of deviations as much as possible. The line
obtained from such a method is called a regression line or line of best fit.
Formula for Least Square Method
Least Square Method formula is used to find the best-fitting line through a set of data points.
For a simple linear regression, which is a line of the form y=mx+c, where y is the dependent
variable, x is the independent variable, a is the slope of the line, and b is the y-intercept, the
formulas to calculate the slope (m) and intercept (c) of the line are derived from the following
equations:
1. Slope (m) Formula: m = n(∑xy)−(∑x)(∑y) / n(∑x2)−(∑x)2
2. Intercept (c) Formula: c = (∑y)−a(∑x) / n
Where:
• n is the number of data points,
• ∑xy is the sum of the product of each pair of x and y values,
• ∑x is the sum of all x values,
• ∑y is the sum of all y values,
• ∑x2 is the sum of the squares of x values.
The steps to find the line of best fit by using the least square method is discussed below:
• Step 1: Denote the independent variable values as xi and the dependent ones as yi.
• Step 2: Calculate the average values of xi and yi as X and Y.
• Step 3: Presume the equation of the line of best fit as y = mx + c, where m is the slope of
the line and c represents the intercept of the line on the Y-axis.
• Step 4: The slope m can be calculated from the following formula:
m = [Σ (X – xi)×(Y – yi)] / Σ(X – xi)2
• Step 5: The intercept c is calculated from the following formula:
c = Y – mX
Thus, we obtain the line of best fit as y = mx + c, where values of m and c can be calculated
from the formulae defined above.
These formulas are used to calculate the parameters of the line that best fits the data according
to the criterion of the least squares, minimizing the sum of the squared differences between the
observed values and the values predicted by the linear model.
Least Square Method Graph
Let us have a look at how the data points and the line of best fit obtained from the Least Square
method look when plotted on a graph.
The red points in the above plot represent the data points for the sample data
available. Independent variables are plotted as x-coordinates and dependent ones are
plotted as y-coordinates. The equation of the line of best fit obtained from the Least Square
method is plotted as the red line in the graph.
We can conclude from the above graph that how the Least Square method helps us to find a
line that best fits the given data points and hence can be used to make further predictions
about the value of the dependent variable where it is not known initially.
Limitations of the Least Square Method
The Least Square method assumes that the data is evenly distributed and doesn’t contain any
outliers for deriving a line of best fit. But, this method doesn’t provide accurate results for
unevenly distributed data or for data containing outliers.
Check: Least Square Regression Line
Reference-https://www.geeksforgeeks.org/least-square-method/
3.Robust linear regression
Robust linear regression is a type of regression analysis designed to overcome the limitations of
ordinary least squares (OLS) regression when the data contains outliers or is not normally
distributed. Unlike OLS, which minimizes the sum of squared residuals, robust linear regression
minimizes a different function of the residuals to reduce the influence of outliers on the
regression model.
1. Outliers: OLS regression is highly sensitive to outliers. A few extreme values can
significantly affect the estimated coefficients.
2. Non-normal errors: If the residuals are not normally distributed, OLS estimations might
be biased or inefficient.
3. Data with heavy tails: Data that do not follow a normal distribution (heavy-tailed
distributions) can lead to unreliable OLS estimates.
Applications
• Robust regression is useful in real-world applications where data may not be perfectly
clean, such as in economics, engineering, finance, and environmental science.
• It is particularly useful when you expect a high level of noise or outliers in the data that
could distort the results from standard linear regression methods.
Summary
Robust linear regression methods are essential when dealing with real-world data that may
contain outliers or is not normally distributed. By using robust methods like Huber Regression,
RANSAC, or LAD, you can obtain more reliable estimates in the presence of outliers or non-
normal errors.
4.RIDGE REGRESSION
Introduction to Ridge Regression
Ridge regression is one of the types of linear regression in which a small amount of bias is
introduced so that we can get better long-term predictions.
In this technique, the cost function is altered by adding the penalty term to it. The amount
of bias added to the model is called Ridge Regression penalty. We can calculate it by
multiplying with the lambda to the squared weight of each individual feature.
The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model, and
hence ridge regression reduces the amplitudes of the coefficients that decreases the
complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum
value of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Conclusion
Reference-https://www.javatpoint.com/regularization-in-machine-learning
5. Bayesian Linear Regression
Bayesian Regression
Bayesian regression is a type of linear regression that uses Bayesian statistics to estimate the
unknown parameters of a model. It uses Bayes’ theorem to estimate the likelihood of a set of
parameters given observed data. The goal of Bayesian regression is to find the best estimate of
the parameters of a linear model that describes the relationship between the independent and
the dependent variables.
The main difference between traditional linear regression and Bayesian regression is the
underlying assumption regarding the data-generating process. Traditional linear regression
assumes that data follows a Gaussian or normal distribution, while Bayesian regression has
stronger assumptions about the nature of the data and puts a prior probability distribution on
the parameters. Bayesian regression also enables more flexibility as it allows for additional
parameters or prior distributions, and can be used to construct an arbitrarily complex model
that explicitly expresses prior beliefs about the data. Additionally, Bayesian regression
provides more accurate predictive measures from fewer data points and is able to construct
estimates for uncertainty around the estimates. On the other hand, traditional linear regressions
are easier to implement and generally faster with simpler models and can provide good results
when the assumptions about the data are valid.
Bayesian Regression can be very useful when we have insufficient data in the dataset or the
data is poorly distributed. The output of a Bayesian Regression model is obtained from a
probability distribution, as compared to regular regression techniques where the output is just
obtained from a single value of each attribute.
Bayes Theorem
Bayes Theorem gives the relationship between an event’s prior probability and its posterior
probability after evidence is taken into account. It states that the conditional probability of an
event is equal to the probability of the event given certain conditions multiplied by the prior
probability of the event, divided by the probability of the conditions.
i.e .
Where P(A|B) is the probability of event A occurring given that event B has already occurred,
P(B|A) is the probability of event B occurring given that event A has already occurred, P(A) is
the probability of event A occurring and P(B) is the probability of event B occurring.
Maximum Likelihood Estimation (MLE)
MLE is a method used to estimate the parameters of a statistical model by maximizing the
likelihood function. it seeks to find the parameter values that make the observed data most
probable under the assumed model. MLE does not incorporate any prior information or
assumptions about the parameters, and it provides point estimates of the parameters
Maximum A Posteriori (MAP) Estimation
MAP estimation is a Bayesian approach that combines prior information with the likelihood
function to estimate the parameters. It involves finding the parameter values that maximize the
posterior distribution, which is obtained by applying Bayes’ theorem. In MAP estimation, a
prior distribution is specified for the parameters, representing prior beliefs or knowledge about
their values. The likelihood function is then multiplied by the prior distribution to obtain the
joint distribution, and the parameter values that maximize this joint distribution are selected as
the MAP estimates. MAP estimation provides point estimates of the parameters, similar to
MLE, but incorporates prior information.
Reference-https://www.geeksforgeeks.org/implementation-of-bayesian-regression/
6.Linear Models for Classification – Discriminant Functions
Introduction to Linear Models for Classification
Linear models for classification are used to separate data points of different classes using a linear
decision boundary. These models are particularly useful when the relationship between the input
features and the class labels is approximately linear. One of the fundamental concepts in linear
classification is the discriminant function, which helps decide the class to which a data point
belongs.
Discriminant Functions
A discriminant function is a function that assigns a score to each data point, allowing the
classification model to decide which class the data point belongs to based on these scores. The
basic idea is to create a linear function that outputs a different score for each class, and then
assign the class with the highest score to the data point.
The simplest form of a discriminant function is linear. It assumes that the decision
boundary between classes is a linear function of the input features.
Unlike linear discriminant functions that assume a linear boundary, QDA allows
for more complex boundaries, which can provide better classification accuracy for
certain datasets.
4. Logistic Regression:
The Laplacian approximation is a technique used in statistics and machine learning to simplify
complex probability distributions, especially in the context of Bayesian inference. Bayesian
inference involves updating our beliefs about a parameter (like the average height of people in a
city) based on observed data (like a sample of people’s heights).
However, the exact calculations for updating these beliefs can be very complicated, particularly
when dealing with large datasets or complex models. The Laplacian approximation provides a
way to make these calculations easier by approximating the complex "posterior" distribution
with a simpler one, usually a bell-shaped curve (Gaussian distribution).
1. Bayesian Inference:
o Bayesian inference is a method of statistical inference in which we update our
beliefs or knowledge about a parameter based on new data.
o We start with a prior belief about what the parameter might be (the prior), and
after seeing the data, we update this belief to a new one (the posterior).
2. The Challenge of Complex Distributions:
o In many real-world situations, the updated belief (posterior) after seeing the data
can be very complex and not easy to calculate exactly.
o This complexity can make it difficult to answer questions about the data, like
predicting future events or understanding the distribution of the parameter.
3. Simplifying the Problem:
o The Laplacian approximation simplifies this problem by assuming that, around
the most likely value of the parameter (where the posterior peaks), the shape of
the distribution is approximately like a bell curve.
o Even if the true distribution is not exactly a bell curve, this assumption makes the
problem much more manageable because bell curves (Gaussians) are
mathematically simple and well-understood.
4. How Laplacian Approximation Works:
o Imagine you have a complicated, lumpy landscape (which represents the complex
posterior distribution). You want to find the highest point (the most likely value of
the parameter) and understand the surrounding area.
o The Laplacian approximation finds the highest point and then approximates the
area around it as a smooth, rounded hill (a bell curve).
o By doing this, it allows us to quickly estimate probabilities and make inferences
without having to deal with all the lumps and bumps of the actual landscape.
Scenario: Let’s say you want to estimate the average weight of apples in an orchard. You start
with a belief (prior) about the average weight based on previous knowledge. Then, you collect a
sample of apples and weigh them (data). After weighing the apples, you want to update your
belief about the average weight of all apples in the orchard (posterior).
• Before Seeing the Data: You might believe that the average weight of apples is around
150 grams, but you are not certain. This is your prior belief.
• After Seeing the Data: You weigh 20 apples, and the results suggest that the average
weight might be closer to 160 grams. You now have a new belief that is more informed
by the data. This is your posterior belief.
Now, suppose the exact shape of this updated belief is complex and difficult to describe. The
Laplacian approximation helps by assuming that around the most likely average weight (160
grams in this case), the shape of the distribution is roughly bell-shaped. This makes further
calculations, like predicting the weight of a new apple, much simpler.
By using the Laplacian approximation, we can quickly approximate probabilities and make
decisions without getting bogged down in complex mathematics. It's like drawing a smooth
curve that fits well around the most likely values and using that simple curve for analysis instead
of the true, more complex curve.
10.Bayesian Logistic Regression
Introduction to Bayesian Logistic Regression
Bayesian logistic regression is a method that combines the principles of Bayesian inference
with logistic regression, a popular technique used for binary classification tasks. In traditional
logistic regression, we aim to find the parameters (weights) that best separate two classes (e.g.,
spam vs. not spam emails) by maximizing the likelihood of the observed data.
In Bayesian logistic regression, instead of finding a single best estimate for these parameters, we
consider them as random variables with a probability distribution. This approach allows us to
incorporate prior knowledge about the parameters and quantify uncertainty in our predictions,
which can be very useful in situations where data is limited or noisy.
1. Logistic Regression:
o Logistic regression is a type of regression analysis used when the dependent
variable (output) is binary (0 or 1, true or false).
o It models the probability that a given input belongs to a particular class using the
logistic function, which outputs a value between 0 and 1.
2. Bayesian Inference:
o Bayesian inference is a method of statistical inference that updates the probability
for a hypothesis as more evidence or information becomes available.
o In Bayesian logistic regression, we start with a prior distribution over the model
parameters (what we believe about the parameters before seeing the data) and
update this to a posterior distribution after observing the data.
3. Prior Distribution:
o The prior represents our beliefs about the parameters before we see any data. For
example, we might believe that all weights are centered around zero with some
uncertainty.
o Choosing the right prior can help guide the model, especially when data is scarce.
Common choices for priors in logistic regression are Gaussian distributions
(which express a belief that the parameters are likely around zero but could be
anywhere).
4. Posterior Distribution:
o After observing the data, we update our prior beliefs to form the posterior
distribution, which combines our prior beliefs with the likelihood of the observed
data under different parameter values.
o The posterior distribution reflects both our initial beliefs and the new information
from the data, allowing us to make more informed predictions that take into
account uncertainty.
5. Likelihood Function:
o The likelihood represents the probability of observing the data given a set of
parameter values. In logistic regression, this is based on the logistic function.
o The likelihood is used in combination with the prior to compute the posterior.
6. Posterior Predictive Distribution:
o Instead of making a single prediction, Bayesian logistic regression provides a
distribution over possible predictions, accounting for uncertainty in the parameter
estimates.
o This distribution can be used to compute predictive probabilities, making the
model's predictions more robust and reliable.
• Incorporates Prior Knowledge: Allows incorporating prior beliefs about the model
parameters, which is helpful when data is limited or when we have domain knowledge.
• Quantifies Uncertainty: Provides a probabilistic framework that quantifies uncertainty
in predictions, which is valuable for risk assessment and decision-making in critical
applications.
• Avoids Overfitting: By regularizing parameter estimates through priors, Bayesian
logistic regression can prevent overfitting, especially in high-dimensional datasets or
when the number of features exceeds the number of observations.
Let's illustrate Bayesian logistic regression using a simple example of binary classification. We
will use synthetic data to classify points into two categories.
Scenario: Suppose we have a dataset of students' study hours and their pass/fail status in an
exam. can model the probability of passing the exam based on the number of study hours using
Bayesian logistic regression.
Conclusion
Bayesian logistic regression provides a powerful framework for binary classification tasks by
combining the strengths of logistic regression with the flexibility and uncertainty quantification
of Bayesian inference. This approach is particularly valuable when dealing with small datasets,
noisy data, or when prior knowledge about the parameters is available.
11.Kernel Functions, Using Kernels in Generalized Linear Models
(GLMs), and Kernel Trick
Introduction to Kernel Functions
1. Kernel Function:
o A kernel function is a function that computes a dot product between two vectors
in a (potentially very high-dimensional) feature space. It allows operations in this
high-dimensional space without explicitly mapping the data to that space.
o Common kernel functions include:
▪ Linear Kernel: Computes the standard inner product between two
vectors.
▪ Polynomial Kernel: Computes a polynomial combination of the input
data points.
▪ Gaussian (RBF) Kernel: Computes similarity based on the distance
between two points, where closer points have a higher similarity.
▪ Sigmoid Kernel: Similar to neural network activation functions, often
used in support vector machines.
2. Kernel Trick:
o The kernel trick is a technique that allows linear classifiers to learn non-linear
boundaries by implicitly mapping input features into high-dimensional feature
spaces. This is achieved by replacing the inner product (dot product) in algorithms
with a kernel function.
o For example, instead of computing ϕ(x)⋅ϕ(y)\phi(x) \cdot \phi(y)ϕ(x)⋅ϕ(y) (where
ϕ\phiϕ is a mapping function to a high-dimensional space), we compute
K(x,y)K(x, y)K(x,y), the kernel function, directly in the input space.
o This avoids the computational cost of transforming data into high-dimensional
space explicitly, making algorithms efficient even in very high-dimensional
spaces.
3. Mercer's Theorem:
o Mercer's theorem provides the theoretical foundation for kernel methods, stating
that any continuous, symmetric, and positive semi-definite function can be used
as a kernel function, effectively representing a dot product in some (potentially
infinite-dimensional) feature space.
Generalized Linear Models (GLMs) are a broad class of models that include linear regression,
logistic regression, and Poisson regression. They can be extended to incorporate non-linear
relationships using kernels, enabling them to model more complex patterns in data.
1. GLMs and Feature Space:
o Traditional GLMs operate on the original feature space, applying a linear
combination of the input features to predict the target variable.
o By introducing kernel functions, GLMs can be transformed into models that
operate in a high-dimensional feature space, allowing them to learn non-linear
relationships.
2. Kernelized GLMs:
o In a kernelized GLM, the input data is implicitly mapped to a high-dimensional
space using a kernel function. The model then fits a linear combination of the
mapped features.
o For instance, in kernelized logistic regression, the decision boundary becomes
non-linear in the original feature space but remains a linear function in the high-
dimensional space induced by the kernel.
o This technique allows GLMs to model more complex patterns without changing
the underlying linear structure of the model in the feature space.
3. Advantages of Using Kernels in GLMs:
o Flexibility: Enables the model to capture non-linear relationships.
o Computational Efficiency: The kernel trick allows the use of high-dimensional
feature spaces without explicitly computing transformations, reducing the
computational burden.
o Application to Different Data Types: Kernels can be designed for various types
of data, including sequences, graphs, and text, extending the applicability of
GLMs.
To illustrate the use of kernels in GLMs, let's consider a simple example of binary classification
using kernelized logistic regression.
12.Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are
3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it
in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
Reference- https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm