Module-1 (1)
Module-1 (1)
1
Machine Learning
Machine learning (ML) is a type of artificial intelligence that enables
machines to learn and improve from experience without being
explicitly programmed.
CSE3087 1 / 51
• Supervised learning involves training a model on a labeled dataset,
where the input features are labeled with the corresponding output
values.
3
MLWorkflow
ML Workflow
• Data Collection
• Data Cleaning
• Feature Engineering
• Model Selection
• Model Training
• Model Evaluation
• Model Deployment
2/
CSE3087
51
• Data collection involves gathering data from various sources, such as
databases, APIs, and web scraping.
5
• Model selection involves choosing the appropriate
algorithm for the task at hand. This can involve
experimenting with different algorithms and comparing
their performance.
CSE3087 /
Types of Features
7/
CSE3087
51
Feature Engineering
Feature engineering is the process of selecting and
transforming the input features to improve the model’s
performance.
Some common feature engineering techniques include
feature scaling, one-hot encoding, and dimensionality
reduction.
Feature scaling involves scaling the numerical features to
a common scale to prevent bias in the model.
12 /
CSE3087
51
Data Transformation
Data transformation is the most important step in a machine
learning pipeline which includes modifying the raw data and
converting it into a better format so that it can be more suitable
for analysis and model training purposes.
CSE3087
Image Data Transformation:
9/
CSE3087
51
Numerical Data Transformation:
Numerical data typically requires scaling and normalization to
ensure that all the input features have a similar scale and range.
10 /
CSE3087
51
Data Imputation
14 /
CSE3087
51
Methods of Data
Imputation
Mean Imputation: Replace missing values with the
mean of the non-missing values in the variable.
where
y is the dependent variable (the missing value we want to
impute), x is the independent variable (in this case, the
index of the data point),
b is the slope of the line, and a is the y-intercept.
15 /
CSE3087
51
To estimate the missing value, we first need to fit the linear regression
model to the available data. We can do this by using the least squares
method, which finds the values of a and b that minimize the sum of
the squared differences between the predicted values and the actual
values of the dependent variable. The formula for the slope b and y-
intercept a are given by:
15
16
find the value of Y for x=12
Y=1.5 + (0.95 * 12)
Y= 12.5
K-Nearest Neighbor (KNN)
Imputation
The KNN algorithm selects the k most similar observations to the
one with the missing value and then takes the average or weighted
average of the values of these observations to fill in the missing
value.
It is referred to as multivariate because it considers multiple variables
or features in the dataset to estimate the missing values. By
leveraging the values of other variables, KNN imputation takes into
account the relationships and patterns present in the data to impute
missing values.
19 /
CSE3087
51
Calculating K Nearest Neighbors with NaN Euclidean Distance:
19
Which features KNN imputer will take into account?
Whatever columns are passed as X
Because of Euclidean distance calculation, KNN imputer is not
applicable on categorical variables
20
The formula for Euclidean distance between two data
points X and Y is:
X: [1, NaN,213]
To calculate the Euclidean distance between X and Y, we ignore the
missing value (NaN) in feature B and compute the distance using the
available features:
sqrt(1.5 * 5)=7
In this case, the missing value in feature B does not contribute to the
Euclidean distance calculation between X and Y. The distance is
determined based on the available features A and C.
22
What is Regression?
22 /
CSE3087
51
Simple Linear
Regression:
Simple linear regression is a type of regression analysis
that models the relationship between a dependent
variable and a single independent variable.
It assumes a linear relationship between the two variables
and uses a line to model the relationship.
The line is fitted to the data using a least squares regression
approach, which minimizes the sum of the squared
differences between the predicted values and the actual
values of the dependent variable.
23 /
CSE3087
51
Loss Functions
1 =1n
= yˆ
|y )−2
yˆi
n Σ i ii
=1 |
MAE
=
25 /
CSE3087
51
Loss
Functions:
Loss functions are used in regression to quantify the
difference between the predicted values and the actual
values of the dependent variable.
The goal of regression is to minimize the loss function by
adjusting the parameters of the model.
The most commonly used loss function in simple linear
regression is the mean squared error (MSE), which is the
average of the squared differences between the predicted
values and the actual values.
Other loss functions include mean absolute error (MAE)
and root mean squared error (RMSE).
26 /
CSE3087
51
Polynomial Regression
Figure: Polynomial
Regression
27 /
CSE3087
51
Polynomial
Regression:
What is Polynomial Regression?
In polynomial regression, we describe the relationship between the
independent variable x and the dependent variable y using an nth-
degree polynomial in x.
28 /
CSE3087
51
Types of Polynomial Regression
A quadratic equation is a general term for a second-degree polynomial
equation. This degree, on the other hand, can go up to nth values.
Here is the categorization of Polynomial Regression:
Linear – if degree as 1
Quadratic – if degree as 2
29
30 /
CSE3087
51
31
Introduction to Logistic Regression
Logistic regression is a popular method for binary
classification, where the goal is to predict the probability
of an input belonging to a particular class.
The logistic regression model uses a logistic function to
model the probability of an input belonging to the
positive class.
The logistic function maps any input to a value
between 0 and 1, which can be interpreted as a
probability.
1
p(y = 1|x ) =
1 + e −z
ezi
p(y = i |x ) = K
(2) Σ j = ezj
1
where K is the number of classes, zi is the linear combination of input
features for class i , and the denominator sums over all classes.
38 /
CSE3087
51
Cross Entropy Loss
39 /
CSE3087
51
Solving classification
problems with naïve
bayes
How does naïve bayes algorithm work?
0 1
Less likely Most likely
to to
happen happen
Bayes theorem is centered on
conditional probability
What is conditional
probability?
Conditional probability is
the probability of an event
‘A’ happening given that
another event ‘B’ has already
happened.
…it doesn’t take into account the effect of any other feature.
• Advantages
Disadvantages
• Used for
benchmarking of a
model
41
Three types of naïve bayes
classifiers in sklearn
Bernoul Multinomial Gaussian
li Naïve Naïve
Bayes Bayes
Used when data is Used when there Used when all features
binary like true or are discrete values are continuous
false, yes or no etc. such as number of variables, like
family members or temperature or height.
pages in a book.
Reversing the condition
Example: Rahul’s favorite breakfast is bagels and his favorite lunch is pizza. The probability of
Rahul having bagels for breakfast is 0.6. The probability of him having pizza for lunch is 0.5.
The probability of him, having a bagel for breakfast given that he eats a pizza for lunch is 0.7.
Let’s define event A as Rahul having a bagel for breakfast, Event B as Rahul having a pizza for
lunch.
P (A) = 0.6
P (B) = 0.5
If we look at the numbers, the probability of having a bagel is different than the probability of
having a bagel given he has a pizza for lunch. This means that the probability of having a bagel
is dependent on having a pizza for lunch.
Now what if we need to know the probability of having a pizza given you had a bagel for
43
The Bayes theorem describes the probability of an event based on
the prior knowledge of the conditions that might be related to the
event. If we know the conditional probability , we can use the
bayes rule to find out the reverse probabilities .
44
Na¨ıve Bayes is a popular algorithm for supervised learning, particularly
for text classification problems. It’s based on Bayes’ theorem, which is
a formula for calculating conditional probabilities.
43 /
CSE3087
51
Bayesian Belief Networks
CSE3087
One common application of Bayesian belief networks
is in medical diagnosis.
46 /
CSE3087
51
Na¨ıve Bayes and Bayesian belief networks are powerful tools for
probabilistic modeling and inference.
47 /
CSE3087
51
49
50
51
This is the Joint Probability Distribution
52
53
Support Vector Machines (SVMs) are a popular machine
learning algorithm for classification and regression. They work
by finding the hyperplane that maximally separates the data
points in a high-dimensional feature space.
• The basic idea behind SVMs is to find the hyperplane that
maximizes the margin between the positive and negative data
points. The margin is the distance between the hyperplane and the
closest data points from either class. The hyperplane that maximizes
the margin is called the maximum margin hyperplane.
• SVMs can be extended to handle non-linearly separable data using
a technique called the kernel trick. The kernel trick involves
mapping the original feature space to a higher-dimensional space
using a non-linear function, and then finding the maximum margin
hyperplane in this new space.
48 /
CSE3087
51
Disadvantages:
• Increased Computation
• Increased Learning Cost
• No Thumb Rule for all types of data
55
• Applying a mapping Function
56
57
58
In practice, data is often not perfectly separable, and finding the
maximum margin hyperplane is not always possible. In these
cases, we can use a variant of SVM called the soft margin SVM.
49 /
CSE3087
51
The kernel trick is a powerful technique for extending SVMs to
handle non-linearly separable data. The basic idea is to map the data
into a higher-dimensional feature space using a non-linear function,
and then apply the linear SVM algorithm to this new feature space.
• The key insight behind the kernel trick is that we can compute the dot
product between the mapped data points without explicitly computing
the mapping. This is done by defining a kernel function that takes two
data points as input and returns the dot product of their mapped
features.
• Some common kernel functions include the polynomial kernel, which
computes the dot product of two vectors raised to a certain power, and the
radial basis function kernel, which measures the similarity between two data
points based on their distance in the feature space.
50 /
CSE3087
51
61
Support Vector Machines are a powerful machine learning
algorithm that can handle both linear and non-linearly
separable data.
51 /
CSE3087
51