Module 1 2
Module 1 2
Module1
Module-I: Supervised Learning [14 Sessions] [Blooms Taxonomy Selected-
Application]
An overview of Machine Learning (ML); ML workflow; types of ML; Types of features, Feature Engineering -
Data Imputation Methods; Regression – introduction; simple linear regression, loss functions; Polynomial
Regression; Logistic Regression; Softmax Regression with cross entropy as cost function;
Bayesian Learning – Bayes Theorem, estimating conditional probabilities for categorical and continuous
features, Naïve Bayes for supervised learning; Bayesian Belief networks; Support Vector Machines – soft margin
and kernel tricks.
CSE3087
• Module-I: Supervised Learning
2
Machine Learning
Machine learning (ML) is a type of artificial intelligence that enables
machines to learn and improve from experience without being explicitly
programmed.
CSE3087 1/ 51
• Supervised learning involves training a model on a labeled dataset,
where the input features are labeled with the corresponding output
values.
4
MLWorkflow
ML Workflow
• Data Collection
• Data Cleaning
• Feature Engineering
• Model Selection
• Model Training
• Model Evaluation
• Model Deployment
2/
CSE3087
51
• Data collection involves gathering data from various sources, such as databases, APIs,
and web scraping.
• Data cleaning involves preprocessing the data to remove missing values, outliers,
and other anomalies.
6
• Model selection involves choosing the appropriate algorithm for
the task at hand. This can involve experimenting with different
algorithms and comparing their performance.
• Model training involves fitting the model to the training data using
an optimization algorithm such as gradient descent.
CSE3087 /
Types of Features
7/
CSE3087
51
Feature Engineering
Feature engineering is the process of selecting and transforming the
input features to improve the model’s performance.
Some common feature engineering techniques include feature scaling,
one-hot encoding, and dimensionality reduction.
Feature scaling involves scaling the numerical features to a common
scale to prevent bias in the model.
12/
CSE3087
51
Data Transformation
Data transformation is the most important step in a machine
learning pipeline which includes modifying the raw data and
converting it into a better format so that it can be more suitable
for analysis and model training purposes.
CSE3087
ImageData Transformation:
9/
CSE3087
51
Numerical Data Transformation:
Numerical data typically requires scaling and normalization to
ensure that all the input features have a similar scale and range.
10/
CSE3087
51
Data Imputation
14/
CSE3087
51
Methods of Data Imputation
Mean Imputation: Replace missing values with the mean of the
non-missing values in the variable.
13/
CSE3087
51
Example
Suppose we have a dataset with five data points with NaN values:
where
y is the dependent variable (the missing value we want to impute),
x is the independent variable (in this case, the index of the data
point),
b is the slope of the line, and a is the y-intercept.
15/
CSE3087
51
To estimate the missing value, we first need to fit the linear regression
model to the available data. We can do this by using the least squares
method, which finds the values of a and b that minimize the sum of the
squared differences between the predicted values and the actual values
of the dependent variable. The formula for the slope b and y-intercept
a are given by:
16
17
find the value of Y for x=12
Y=1.5 + (0.95 * 12)
Y= 12.5
K-Nearest Neighbor (KNN) Imputation
19/
CSE3087
51
Calculating K Nearest Neighbors with NaN Euclidean Distance:
20
Which features KNN imputer will take into account?
Whatever columns are passed as X
Because of Euclidean distance calculation, KNN imputer is not
applicable on categorical variables
21
The formula for Euclidean distance between two data points X and Y
is:
Where X_i and Y_i represent the values of the i-th feature or
dimension of data points X and Y, respectively and
weight = Total # of coordinates / # of present coordinates Euclidean.
22
To calculate the Euclidean distance between X and Y, we ignore the missing value (NaN) in
feature B and compute the distance using the available features:
sqrt(1.5 * 5)=7
In this case, the missing value in feature B does not contribute to the Euclidean distance
calculation between X and Y. The distance is determined based on the available features A
and C.
23
What is Regression?
22/
CSE3087
51
Simple Linear Regression:
23/
CSE3087
51
Simple Linear Regression
Simple linear regression is a type of regression where there is only one
independent variable. The goal of simple linear regression is to find the
line of best fit that minimizes the difference between the predicted values
and the actual values of the dependent variable.
24/
CSE3087
51
Loss Functions
In order to find the line of best fit in simple linear regression, we need to
define a loss function that measures the difference between the predicted
values and the actual values of the dependent variable.
Two commonly used loss functions in simple linear regression are the
Mean Squared Error (MSE) and the Mean Absolute Error (MAE).
2
Σi =1 (yi − yˆi )
n
MSE = 1
n
MAE = Σ i =1 |yi − yˆi |
1 n
n
25/
CSE3087
51
Loss Functions:
Loss functions are used in regression to quantify the difference
between the predicted values and the actual values of the dependent
variable.
The goal of regression is to minimize the loss function by adjusting
the parameters of the model.
The most commonly used loss function in simple linear regression is
the mean squared error (MSE), which is the average of the squared
differences between the predicted values and the actual values.
Other loss functions include mean absolute error (MAE) and root
mean squared error (RMSE).
26/
CSE3087
51
Polynomial Regression
28/
CSE3087
51
Types of Polynomial Regression
A quadratic equation is a general term for a second-degree polynomial equation.
This degree, on the other hand, can go up to nth values. Here is the
categorization of Polynomial Regression:
Linear – if degree as 1
Quadratic – if degree as 2
31
30/
CSE3087
51
33
Introduction to Logistic Regression
Logistic regression is a popular method for binary classification,
where the goal is to predict the probability of an input belonging to
a particular class.
The logistic regression model uses a logistic function to model
the probability of an input belonging to the positive class.
The logistic function maps any input to a value between 0 and
1, which can be interpreted as a probability.
1
p(y = 1|x) =
1 + e−z
Where z = θ0 + θ1x1 + θ2x2 + · · · + θnxn is the linear combination of the input features
x1, x2, . . . , xn and their corresponding coefficients
θ1 , θ2 , . . . , θn .
37/
CSE3087
51
Softmax Regression
• Softmax regression is a generalization of logistic regression that can be used for
multi-class classification problems.
• The softmax function is used to compute the probability of each class, and
the class with the highest probability is chosen asthe predicted class.
ezi
p(y = i |x) = K (2)
Σ j =1 ezj
where K is the number of classes, zi is the linear combination of input features for
class i, and the denominator sums over all classes.
38/
CSE3087
51
Cross Entropy Loss
39/
CSE3087
51
Solving classification
problems with naïve
bayes
How does naïve bayes algorithm work?
0 1
…it doesn’t take into account the effect of any other feature.
• Advantages Disadvantages
• Used for
benchmarking of a
model
42
Three types of naïve bayes classifiers
in sklearn
Bernoulli Multinomial Naïve Gaussian Naïve
Bayes Bayes
Used when data is Used when there are Used when all features
binary like true or false, discrete values such are continuous variables,
yes or no etc. as number of family like temperature or
members or pages height.
in a book.
Reversing the condition
Example: Rahul’s favorite breakfast is bagels and his favorite lunch is pizza. The probability of Rahul having
bagels for breakfast is 0.6. The probability of him having pizza for lunch is 0.5. The probability of him, having
a bagel for breakfast given that he eats a pizza for lunch is 0.7.
Let’s define event A as Rahul having a bagel for breakfast, Event B as Rahul having a pizza for lunch.
P (A) = 0.6
P (B) = 0.5
If we look at the numbers, the probability of having a bagel is different than the probability of having a bagel
given he has a pizza for lunch. This means that the probability of having a bagel is dependent on having a
pizza for lunch.
Now what if we need to know the probability of having a pizza given you had a bagel for breakfast. i.e. we
need to know
44
The Bayes theorem describes the probability of an event based on
the prior knowledge of the conditions that might be related to the
event. If we know the conditional probability , we can use the
bayes rule to find out the reverse probabilities .
45
Na¨ıve Bayes is a popular algorithm for supervised learning, particularly
for text classification problems. It’s based on Bayes’ theorem, which is
a formula for calculating conditional probabilities.
CSE3087
43/
51
Bayesian Belief Networks
CSE3087
One common application of Bayesian belief networks
is in medical diagnosis.
CSE3087
46/
51
Na¨ıve Bayes and Bayesian belief networks are powerful tools for
probabilistic modeling and inference.
CSE3087
47/
51
50
51
52
This is the Joint Probability Distribution
53
54
Support Vector Machines (SVMs) are a popular machine
learning algorithm for classification and regression. They work
by finding the hyperplane that maximally separates the data
points in a high-dimensional feature space.
• The basic idea behind SVMs is to find the hyperplane that maximizes
the margin betweenthe positive and negative data points. The
margin is the distance between the hyperplane and the closest data
points from either class. The hyperplane that maximizes the margin is
called the maximum margin hyperplane.
• SVMs can be extended to handle non-linearly separable data using a
technique called the kernel trick. The kernel trick involves mapping
the original feature space to a higher-dimensional space using a non-
linear function, and then finding the maximum margin hyperplane in
this new space.
CSE3087
48/
51
• Applying a mapping Function
56
Disadvantages:
• Increased Computation
• Increased Learning Cost
• No Thumb Rule for all types of data
57
58
59
In practice, data is often not perfectly separable, and finding the
maximum margin hyperplane is not always possible. In these
cases, we can use a variant of SVM called the soft margin SVM.
CSE3087
49/
51
The kernel trick is a powerful technique for extending SVMs to handle
non-linearly separable data. The basic idea is to map the data into a
higher-dimensional feature space using a non-linear function, and then
apply the linear SVM algorithm to this new feature space.
• The key insight behind the kernel trick is that we can compute the dot
product between the mapped data points without explicitly computing the
mapping. This is done by defining a kernel function that takes two data
points as input and returns the dot product of their mapped features.
• Some common kernel functions include the polynomial kernel, which
computes the dot product of two vectors raised to a certain power, and the
radial basis function kernel, which measures the similarity between two data
points based on their distance in the feature space.
CSE3087
50/
51
62
Support Vector Machines are a powerful machine learning
algorithm that can handle both linear and non-linearly
separable data.
CSE3087
51/
51