0% found this document useful (0 votes)
8 views

Module-1 (1)

Uploaded by

mohddawoodkh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module-1 (1)

Uploaded by

mohddawoodkh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

• Module-I: Supervised Learning

• An overview of Machine Learning (ML); ML workflow; types of ML; Types of


features, Feature Engineering -Data Imputation Methods; Regression –
introduction; simple linear regression, loss functions; Polynomial Regression;
Logistic Regression; Softmax Regression with cross entropy as cost function;

• Bayesian Learning – Bayes Theorem, estimating conditional probabilities for


categorical and continuous features, Naïve Bayes for supervised learning;
Bayesian Belief networks; Support Vector Machines – soft margin and kernel
tricks.

1
Machine Learning
Machine learning (ML) is a type of artificial intelligence that enables
machines to learn and improve from experience without being
explicitly programmed.

ML algorithms can be divided into three main categories:


• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning

CSE3087 1 / 51
• Supervised learning involves training a model on a labeled dataset,
where the input features are labeled with the corresponding output
values.

• Unsupervised learning involves training a model on an unlabeled


dataset, where the goal is to discover patterns and structure in the
data.

• Reinforcement learning involves training a model to make decisions by


interacting with an environment and receiving feedback in the form of
rewards or penalties.

3
MLWorkflow
ML Workflow

The ML workflow typically consists of the following steps:

• Data Collection
• Data Cleaning
• Feature Engineering
• Model Selection
• Model Training
• Model Evaluation
• Model Deployment

2/
CSE3087
51
• Data collection involves gathering data from various sources, such as
databases, APIs, and web scraping.

• Data cleaning involves preprocessing the data to remove missing


values, outliers, and other anomalies.

• Feature engineering involves selecting and transforming the input


features to improve the model’s performance. This can include
techniques such as feature scaling, one-hot encoding, and
dimensionality reduction.

5
• Model selection involves choosing the appropriate
algorithm for the task at hand. This can involve
experimenting with different algorithms and comparing
their performance.

• Model training involves fitting the model to the training


data using an optimization algorithm such as gradient
descent.

• Model evaluation involves testing the model on a held-


out validation set to assess its performance.

• Model deployment involves integrating the trained model


into a larger system or application.

CSE3087 /
Types of Features

Features can be divided into two main categories:

• Categorical Features (Discrete values such as colors, types,


and categories)

• Numerical Features(Continuous values such as


measurements, counts, and percentages)

Features can also be divided into binary, ordinal, and


interval/ratio types, depending on their characteristics and
properties.

7/
CSE3087
51
Feature Engineering
Feature engineering is the process of selecting and
transforming the input features to improve the model’s
performance.
Some common feature engineering techniques include
feature scaling, one-hot encoding, and dimensionality
reduction.
Feature scaling involves scaling the numerical features to
a common scale to prevent bias in the model.

12 /
CSE3087
51
Data Transformation
Data transformation is the most important step in a machine
learning pipeline which includes modifying the raw data and
converting it into a better format so that it can be more suitable
for analysis and model training purposes.

CSE3087
Image Data Transformation:

Image data is typically represented as a matrix of pixel values, and a


common transformation is to normalize the pixel values so they
have a mean of zero and a standard deviation of one.

Text Data Transformation: A common technique is to represent the


text as a bag-of-words, which involves creating a dictionary of all the
unique words in the text and representing each document as a
vector of word frequencies.

9/
CSE3087
51
Numerical Data Transformation:
Numerical data typically requires scaling and normalization to
ensure that all the input features have a similar scale and range.

10 /
CSE3087
51
Data Imputation

Data imputation is a statistical technique used to fill in


missing data points in a dataset.
There are several methods for data imputation, but one
common approach is to use a statistical model to
estimate the missing values based on the available data.

14 /
CSE3087
51
Methods of Data
Imputation
Mean Imputation: Replace missing values with the
mean of the non-missing values in the variable.

Regression Imputation: Predict the missing values using a


regression model based on the other variables in the
dataset.

Multiple Imputation: Create multiple imputed datasets


and combine the results for analysis.

K-Nearest Neighbor (KNN) Imputation: Predict the


missing values using the values of the nearest neighbors
in the dataset.
13 /
CSE3087
51
Examp
le

Suppose we have a dataset with five data points with NaN


values:

We want to impute the missing value using a simple linear


regression model. The linear regression model is given by:
y = bx + a

where
y is the dependent variable (the missing value we want to
impute), x is the independent variable (in this case, the
index of the data point),
b is the slope of the line, and a is the y-intercept.

15 /
CSE3087
51
To estimate the missing value, we first need to fit the linear regression
model to the available data. We can do this by using the least squares
method, which finds the values of a and b that minimize the sum of
the squared differences between the predicted values and the actual
values of the dependent variable. The formula for the slope b and y-
intercept a are given by:

15
16
find the value of Y for x=12
Y=1.5 + (0.95 * 12)
Y= 12.5
K-Nearest Neighbor (KNN)
Imputation
The KNN algorithm selects the k most similar observations to the
one with the missing value and then takes the average or weighted
average of the values of these observations to fill in the missing
value.
It is referred to as multivariate because it considers multiple variables
or features in the dataset to estimate the missing values. By
leveraging the values of other variables, KNN imputation takes into
account the relationships and patterns present in the data to impute
missing values.

19 /
CSE3087
51
Calculating K Nearest Neighbors with NaN Euclidean Distance:

When calculating K nearest neighbors using Euclidean


distance and dealing with missing values (NaN), special
handling is required. Here’s how it can be performed:
1.Identify the subset of data points that have non-
missing values for the target feature.
2.Calculate the Euclidean distance between the data
point with the missing value and each data point in
the subset.
3.Exclude data points with missing values (NaN) in the
features being compared.
4.Select the K data points with the smallest Euclidean
distances as the nearest neighbors.

19
Which features KNN imputer will take into account?
Whatever columns are passed as X
Because of Euclidean distance calculation, KNN imputer is not
applicable on categorical variables

20
The formula for Euclidean distance between two data
points X and Y is:

Distance = sqrt(weight *Σ((X_i — Y_i)²)), for i = 1 to


number of dimensions

Where X_i and Y_i represent the values of the i-th


feature or dimension of data points X and Y,
respectively and
weight = Total # of coordinates / # of present
coordinates Euclidean.

An example to illustrate the calculation of Euclidean


distance when dealing with NaN values:
Consider two data points, X and Y, with three features:
A, B, and C.

X: [1, NaN,213]
To calculate the Euclidean distance between X and Y, we ignore the
missing value (NaN) in feature B and compute the distance using the
available features:

Euclidean Distance = sqrt(weight*(X_A — Y_A)² + (X_C — Y_C)²)

Applying the values from X and Y:

Euclidean Distance = sqrt(3/2 * (1–2)² + (3–5)²) = sqrt(1.5 * (1 + 4))

sqrt(1.5 * 5)=7

In this case, the missing value in feature B does not contribute to the
Euclidean distance calculation between X and Y. The distance is
determined based on the available features A and C.

22
What is Regression?

Regression is a statistical modeling technique used to explore


the relationship between a dependent variable and one or
more independent variables. It is used to model and predict
the value of the dependent variable based on the values of
the independent variables.

The goal of regression is to find a function that can


accurately predict the value of the dependent variable
based on the values of the independent variables.

22 /
CSE3087
51
Simple Linear
Regression:
Simple linear regression is a type of regression analysis
that models the relationship between a dependent
variable and a single independent variable.
It assumes a linear relationship between the two variables
and uses a line to model the relationship.
The line is fitted to the data using a least squares regression
approach, which minimizes the sum of the squared
differences between the predicted values and the actual
values of the dependent variable.

23 /
CSE3087
51
Loss Functions

In order to find the line of best fit in simple linear regression,


we need to define a loss function that measures the difference
between the predicted values and the actual values of the
dependent variable.
Two commonly used loss functions in simple linear
regression are the Mean Squared Error (MSE) and the
n Mean Absolute Error (MAE).
MSE n Σi (yi −
1

1 =1n
= yˆ
|y )−2
yˆi
n Σ i ii
=1 |
MAE
=
25 /
CSE3087
51
Loss
Functions:
Loss functions are used in regression to quantify the
difference between the predicted values and the actual
values of the dependent variable.
The goal of regression is to minimize the loss function by
adjusting the parameters of the model.
The most commonly used loss function in simple linear
regression is the mean squared error (MSE), which is the
average of the squared differences between the predicted
values and the actual values.
Other loss functions include mean absolute error (MAE)
and root mean squared error (RMSE).

26 /
CSE3087
51
Polynomial Regression

Polynomial regression is a type of regression where the


relationship between the dependent variable and the
independent variable(s) is modeled as an nth degree
polynomial. This allows for a more complex relationship
between the variables than simple linear regression.

Figure: Polynomial
Regression
27 /
CSE3087
51
Polynomial
Regression:
What is Polynomial Regression?
In polynomial regression, we describe the relationship between the
independent variable x and the dependent variable y using an nth-
degree polynomial in x.

Polynomial regression is a type of regression analysis that


models the relationship between a dependent variable and
one or more independent variables using a polynomial
function.
It allows for more complex relationships between the
variables by fitting a curve to the data instead of a
straight line.

28 /
CSE3087
51
Types of Polynomial Regression
A quadratic equation is a general term for a second-degree polynomial
equation. This degree, on the other hand, can go up to nth values.
Here is the categorization of Polynomial Regression:

Linear – if degree as 1

Quadratic – if degree as 2

Cubic – if degree as 3 and goes on, on the basis of degree.

When the Linear Regression Model


fails to capture the points in the
data and the Linear Regression
fails to adequately represent the
optimum, then we use Polynomial
Regression

29
30 /
CSE3087
51
31
Introduction to Logistic Regression
Logistic regression is a popular method for binary
classification, where the goal is to predict the probability
of an input belonging to a particular class.
The logistic regression model uses a logistic function to
model the probability of an input belonging to the
positive class.
The logistic function maps any input to a value
between 0 and 1, which can be interpreted as a
probability.
1
p(y = 1|x ) =
1 + e −z

Where z = θ0 + θ1x1 + θ2x2 + · · · + θnxn is the linear combination of the


input features x1 , x2 , . CSE3087
. . , xn and their corresponding coefficients
37 /
51
θ1 , θ2 , . . . , θn .
33
Softmax Regression
• Softmax regression is a generalization of logistic regression that can be used
for multi-class classification problems.

• The softmax function is used to compute the probability of each class,


and the class with the highest probability is chosen as the predicted class.

• The softmax function outputs a probability distribution over the classes,


and its outputs sum to 1.

ezi
p(y = i |x ) = K
(2) Σ j = ezj
1
where K is the number of classes, zi is the linear combination of input
features for class i , and the denominator sums over all classes.
38 /
CSE3087
51
Cross Entropy Loss

To train a logistic regression or softmax regression model,


we need a cost function that measures the difference
between the predicted probabilities and the true labels.
The cross entropy loss is a popular choice for such a cost
function.

39 /
CSE3087
51
Solving classification
problems with naïve
bayes
How does naïve bayes algorithm work?

• Naive Bayes classifier algorithm is based on a


P(A|B) = famous theorem called “Bayes theorem”.
P(A|B)
P(A)
P(B • It can help us find simple yet powerful solutions to
)
many problems ranging from text analysis to spam
detection and much more.
Probability to
describe how likely
an event is to happen
A value between 0 and 1 represents
the possibility of an event
happening

0 1
Less likely Most likely
to to
happen happen
Bayes theorem is centered on
conditional probability
What is conditional
probability?
Conditional probability is
the probability of an event
‘A’ happening given that
another event ‘B’ has already
happened.

The Bayes theorem is an


extension of conditional
probability. It allows us in a
sense to use reverse
reasoning.
Understanding the Bayes
theorem formula
Prior probability P(A) –
The probability of just ‘A’
occurring

Posterior probability P(A|B)


– The probability of event ‘A’
given that event ‘B’ occurs

P(B|A) - The probability of event


B happening given that event A
has occurred

P(B) - The probability of just B


What makes Naïve bayes
algorithm naïve?
When the model calculates the conditional probability of one
feature given a class,

…it doesn’t take into account the effect of any other feature.

…it assumes that features are independent from each other.

…it gives us the flexibility to describe the probability of each


feature.
The algorithm’s naivety has
some advantages &
limitations

• Advantages
Disadvantages

• Quick & simple • In most real-world situations


some of the features are
• Produce good results likely to be dependent on
with small amount each other, which might
of training data cause wrong results.

• Used for
benchmarking of a
model

• Works well with


continuous data
by discretizing

41
Three types of naïve bayes
classifiers in sklearn
Bernoul Multinomial Gaussian
li Naïve Naïve
Bayes Bayes
Used when data is Used when there Used when all features
binary like true or are discrete values are continuous
false, yes or no etc. such as number of variables, like
family members or temperature or height.
pages in a book.
Reversing the condition
Example: Rahul’s favorite breakfast is bagels and his favorite lunch is pizza. The probability of
Rahul having bagels for breakfast is 0.6. The probability of him having pizza for lunch is 0.5.
The probability of him, having a bagel for breakfast given that he eats a pizza for lunch is 0.7.
Let’s define event A as Rahul having a bagel for breakfast, Event B as Rahul having a pizza for
lunch.
P (A) = 0.6
P (B) = 0.5

If we look at the numbers, the probability of having a bagel is different than the probability of
having a bagel given he has a pizza for lunch. This means that the probability of having a bagel
is dependent on having a pizza for lunch.

Now what if we need to know the probability of having a pizza given you had a bagel for

breakfast. i.e. we need to know

. Bayes theorem now comes into the picture.

43
The Bayes theorem describes the probability of an event based on
the prior knowledge of the conditions that might be related to the
event. If we know the conditional probability , we can use the
bayes rule to find out the reverse probabilities .

For the previous example – if we now wish to calculate the probability of


having a pizza for lunch provided you had a bagel for breakfast would be =
0.7 * 0.5/0.6.

44
Na¨ıve Bayes is a popular algorithm for supervised learning, particularly
for text classification problems. It’s based on Bayes’ theorem, which is
a formula for calculating conditional probabilities.

• The basic idea behind Na¨ıve Bayes is to calculate the probability of a


particular class given some input features.
• This is done by calculating the conditional probabilities of each
feature given the class, and then using Bayes’ theorem to calculate the
probability of the class given the features.

43 /
CSE3087
51
Bayesian Belief Networks

Bayesian belief networks are a type of graphical model that can be


used to represent probabilistic relationships between variables.

• Each variable is represented by a node in the network, and


the edges between the nodes represent conditional
dependencies.
• The basic idea behind Bayesian belief networks is to use
Bayes’ theorem to calculate the probabilities of each variable
given the values of its parents in the network.
• This can be done by factorizing the joint probability
distribution of all the variables into a product of conditional
probabilities.

CSE3087
One common application of Bayesian belief networks
is in medical diagnosis.

• The variables in the network might represent symptoms and


diseases, and the edges represent the conditional
dependencies between them.
• Given a set of observed symptoms, the network can
be used to calculate the probability of each possible
disease.

46 /
CSE3087
51
Na¨ıve Bayes and Bayesian belief networks are powerful tools for
probabilistic modeling and inference.

• Na¨ıve Bayes is particularly useful for text classification


problems, while Bayesian belief networks are useful for
modeling complex dependencies between variables.
• With these tools, we can make more accurate predictions
and better understand the relationships between different
variables in our data.

47 /
CSE3087
51
49
50
51
This is the Joint Probability Distribution

52
53
Support Vector Machines (SVMs) are a popular machine
learning algorithm for classification and regression. They work
by finding the hyperplane that maximally separates the data
points in a high-dimensional feature space.
• The basic idea behind SVMs is to find the hyperplane that
maximizes the margin between the positive and negative data
points. The margin is the distance between the hyperplane and the
closest data points from either class. The hyperplane that maximizes
the margin is called the maximum margin hyperplane.
• SVMs can be extended to handle non-linearly separable data using
a technique called the kernel trick. The kernel trick involves
mapping the original feature space to a higher-dimensional space
using a non-linear function, and then finding the maximum margin
hyperplane in this new space.

48 /
CSE3087
51
Disadvantages:
• Increased Computation
• Increased Learning Cost
• No Thumb Rule for all types of data

55
• Applying a mapping Function

56
57
58
In practice, data is often not perfectly separable, and finding the
maximum margin hyperplane is not always possible. In these
cases, we can use a variant of SVM called the soft margin SVM.

• The soft margin SVM allows for some misclassifications,


and introduces a slack variable that penalizes data points
that are on the wrong side of the margin.
• The objective function for the soft margin SVM includes a
regularization term that controls the tradeoff between
maximizing the margin and minimizing the classification
error.

49 /
CSE3087
51
The kernel trick is a powerful technique for extending SVMs to
handle non-linearly separable data. The basic idea is to map the data
into a higher-dimensional feature space using a non-linear function,
and then apply the linear SVM algorithm to this new feature space.
• The key insight behind the kernel trick is that we can compute the dot
product between the mapped data points without explicitly computing
the mapping. This is done by defining a kernel function that takes two
data points as input and returns the dot product of their mapped
features.
• Some common kernel functions include the polynomial kernel, which
computes the dot product of two vectors raised to a certain power, and the
radial basis function kernel, which measures the similarity between two data
points based on their distance in the feature space.

50 /
CSE3087
51
61
Support Vector Machines are a powerful machine learning
algorithm that can handle both linear and non-linearly
separable data.

• The kernel trick allows us to extend SVMs to handle complex


data, and the soft margin SVM allows us to handle data that is
not perfectly separable.
• With these techniques, we can build powerful models for
classification and regression tasks.

51 /
CSE3087
51

You might also like