0% found this document useful (0 votes)
6 views

Module 1 2

Uploaded by

falishaumaiza6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Module 1 2

Uploaded by

falishaumaiza6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Course: A p p l i e d Machine Learning (CSE3087)

Module1
Module-I: Supervised Learning [14 Sessions] [Blooms Taxonomy Selected-
Application]
An overview of Machine Learning (ML); ML workflow; types of ML; Types of features, Feature Engineering -
Data Imputation Methods; Regression – introduction; simple linear regression, loss functions; Polynomial
Regression; Logistic Regression; Softmax Regression with cross entropy as cost function;
Bayesian Learning – Bayes Theorem, estimating conditional probabilities for categorical and continuous
features, Naïve Bayes for supervised learning; Bayesian Belief networks; Support Vector Machines – soft margin
and kernel tricks.

CSE3087
• Module-I: Supervised Learning

• An overview of Machine Learning (ML); ML workflow; types of ML; Types of


features, Feature Engineering -Data Imputation Methods; Regression –
introduction; simple linear regression, loss functions; Polynomial Regression;
Logistic Regression; Softmax Regression with cross entropy as cost function;

• Bayesian Learning – Bayes Theorem, estimating conditional probabilities for


categorical and continuous features, Naïve Bayes for supervised learning;
Bayesian Belief networks; Support Vector Machines – soft margin and kernel
tricks.

2
Machine Learning
Machine learning (ML) is a type of artificial intelligence that enables
machines to learn and improve from experience without being explicitly
programmed.

ML algorithms can be divided into three main categories:


• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning

CSE3087 1/ 51
• Supervised learning involves training a model on a labeled dataset,
where the input features are labeled with the corresponding output
values.

• Unsupervised learning involves training a model on an unlabeled


dataset, where the goal is to discover patterns and structure in the
data.

• Reinforcement learning involves training a model to make decisions by


interacting with an environment and receiving feedback in the form of
rewards or penalties.

4
MLWorkflow
ML Workflow

The ML workflow typically consists of the following steps:

• Data Collection
• Data Cleaning
• Feature Engineering
• Model Selection
• Model Training
• Model Evaluation
• Model Deployment

2/
CSE3087
51
• Data collection involves gathering data from various sources, such as databases, APIs,
and web scraping.

• Data cleaning involves preprocessing the data to remove missing values, outliers,
and other anomalies.

• Feature engineering involves selecting and transforming the input features to


improve the model’s performance. This can include techniques such as feature
scaling, one-hot encoding, and dimensionality reduction.

6
• Model selection involves choosing the appropriate algorithm for
the task at hand. This can involve experimenting with different
algorithms and comparing their performance.

• Model training involves fitting the model to the training data using
an optimization algorithm such as gradient descent.

• Model evaluation involves testing the model on a held-out


validation set to assess its performance.

• Model deployment involves integrating the trained model into a


larger system or application.

CSE3087 /
Types of Features

Features can be divided into two main categories:

• Categorical Features (Discrete values such as colors, types,


and categories)

• Numerical Features(Continuous values such as measurements,


counts, and percentages)

Features can also be divided into binary, ordinal, and interval/ratio


types, depending on their characteristics and properties.

7/
CSE3087
51
Feature Engineering
Feature engineering is the process of selecting and transforming the
input features to improve the model’s performance.
Some common feature engineering techniques include feature scaling,
one-hot encoding, and dimensionality reduction.
Feature scaling involves scaling the numerical features to a common
scale to prevent bias in the model.

12/
CSE3087
51
Data Transformation
Data transformation is the most important step in a machine
learning pipeline which includes modifying the raw data and
converting it into a better format so that it can be more suitable
for analysis and model training purposes.

CSE3087
ImageData Transformation:

Imagedata is typically represented as a matrix of pixel values, and a


common transformation is to normalize the pixel values so they have
a mean of zero and a standard deviation of one.

Text Data Transformation: A common technique is to represent the


text as a bag-of-words, which involves creating a dictionary of all the
unique words in the text and representing each document as a vector
of word frequencies.

9/
CSE3087
51
Numerical Data Transformation:
Numerical data typically requires scaling and normalization to
ensure that all the input features have a similar scale and range.

10/
CSE3087
51
Data Imputation

Data imputation is a statistical technique used to fill in missing data


points in a dataset.
There are several methods for data imputation, but one common
approach is to use a statistical model to estimate the missing values
based on the available data.

14/
CSE3087
51
Methods of Data Imputation
Mean Imputation: Replace missing values with the mean of the
non-missing values in the variable.

Regression Imputation: Predict the missing values using a regression


model based on the other variables in the dataset.

Multiple Imputation: Create multiple imputed datasets and combine


the results for analysis.

K-Nearest Neighbor (KNN) Imputation: Predict the missing values


using the values of the nearest neighbors in the dataset.

13/
CSE3087
51
Example

Suppose we have a dataset with five data points with NaN values:

We want to impute the missing value using a simple linear regression


model. The linear regression model is given by:
y = bx + a

where
y is the dependent variable (the missing value we want to impute),
x is the independent variable (in this case, the index of the data
point),
b is the slope of the line, and a is the y-intercept.

15/
CSE3087
51
To estimate the missing value, we first need to fit the linear regression
model to the available data. We can do this by using the least squares
method, which finds the values of a and b that minimize the sum of the
squared differences between the predicted values and the actual values
of the dependent variable. The formula for the slope b and y-intercept
a are given by:

16
17
find the value of Y for x=12
Y=1.5 + (0.95 * 12)
Y= 12.5
K-Nearest Neighbor (KNN) Imputation

The KNN algorithm selects the k most similar observations to the


one with the missing value and then takes the average or weighted
average of the values of these observations to fill in the missing value.

It is referred to as multivariate because it considers multiple variables


or features in the dataset to estimate the missing values. By
leveraging the values of other variables, KNN imputation takes into
account the relationships and patterns present in the data to impute
missing values.

19/
CSE3087
51
Calculating K Nearest Neighbors with NaN Euclidean Distance:

When calculating K nearest neighbors using Euclidean distance and


dealing with missing values (NaN), special handling is required.
Here’s how it can be performed:
1.Identify the subset of data points that have non-missing values for
the target feature.
2.Calculate the Euclidean distance between the data point with the
missing value and each data point in the subset.
3.Exclude data points with missing values (NaN) in the features being
compared.
4.Select the K data points with the smallest Euclidean distances as
the nearest neighbors.

20
Which features KNN imputer will take into account?
Whatever columns are passed as X
Because of Euclidean distance calculation, KNN imputer is not
applicable on categorical variables

21
The formula for Euclidean distance between two data points X and Y
is:

Distance = sqrt(weight *Σ((X_i — Y_i)²)), for i = 1 to number of


dimensions

Where X_i and Y_i represent the values of the i-th feature or
dimension of data points X and Y, respectively and
weight = Total # of coordinates / # of present coordinates Euclidean.

An example to illustrate the calculation of Euclidean distance when


dealing with NaN values:
Consider two data points, X and Y, with three features: A, B, and C.

X: [1, NaN, 3] Y: [2, 4, 5]

22
To calculate the Euclidean distance between X and Y, we ignore the missing value (NaN) in
feature B and compute the distance using the available features:

Euclidean Distance = sqrt(weight*(X_A — Y_A)² + (X_C — Y_C)²)

Applying the values from X and Y:

Euclidean Distance = sqrt(3/2 * (1–2)² + (3–5)²) = sqrt(1.5 * (1 + 4))

sqrt(1.5 * 5)=7

In this case, the missing value in feature B does not contribute to the Euclidean distance
calculation between X and Y. The distance is determined based on the available features A
and C.

23
What is Regression?

Regression is a statistical modeling technique used to explore the


relationship between a dependent variable and one or more independent
variables. It is used to model and predict the value of the dependent
variable based on the values of the independent variables.

The goal of regression is to find a function that can accurately predict


the value of the dependent variable based on the values of the
independent variables.

22/
CSE3087
51
Simple Linear Regression:

Simple linear regression is a type of regression analysis that models


the relationship between a dependent variable and a single
independent variable.
It assumes a linear relationship between the two variables and uses a
line to model the relationship.
The line is fitted to the data using a least squares regression approach,
which minimizes the sum of the squared differences between the
predicted values and the actual values of the dependent variable.

23/
CSE3087
51
Simple Linear Regression
Simple linear regression is a type of regression where there is only one
independent variable. The goal of simple linear regression is to find the
line of best fit that minimizes the difference between the predicted values
and the actual values of the dependent variable.

24/
CSE3087
51
Loss Functions

In order to find the line of best fit in simple linear regression, we need to
define a loss function that measures the difference between the predicted
values and the actual values of the dependent variable.
Two commonly used loss functions in simple linear regression are the
Mean Squared Error (MSE) and the Mean Absolute Error (MAE).

2
Σi =1 (yi − yˆi )
n
MSE = 1
n
MAE = Σ i =1 |yi − yˆi |
1 n
n

25/
CSE3087
51
Loss Functions:
Loss functions are used in regression to quantify the difference
between the predicted values and the actual values of the dependent
variable.
The goal of regression is to minimize the loss function by adjusting
the parameters of the model.
The most commonly used loss function in simple linear regression is
the mean squared error (MSE), which is the average of the squared
differences between the predicted values and the actual values.
Other loss functions include mean absolute error (MAE) and root
mean squared error (RMSE).

26/
CSE3087
51
Polynomial Regression

Polynomial regression is a type of regression where the relationship


between the dependent variable and the independent variable(s) is
modeled as an nth degree polynomial. This allows for a more complex
relationship between the variables than simple linear regression.

Figure: Polynomial Regression


27/
CSE3087
51
Polynomial Regression:

What is Polynomial Regression?


In polynomial regression, we describe the relationship between the independent
variable x and the dependent variable y using an nth-degree polynomial in x.

Polynomial regression is a type of regression analysis that models the


relationship between a dependent variable and one or more
independent variables using a polynomial function.
It allows for more complex relationships between the variables by
fitting a curve to the data instead of a straight line.

28/
CSE3087
51
Types of Polynomial Regression
A quadratic equation is a general term for a second-degree polynomial equation.
This degree, on the other hand, can go up to nth values. Here is the
categorization of Polynomial Regression:

Linear – if degree as 1

Quadratic – if degree as 2

Cubic – if degree as 3 and goes on, on the basis of degree.

When the Linear Regression Model fails


to capture the points in the data and
the Linear Regression fails to
adequately represent the optimum,
then we use Polynomial Regression

31
30/
CSE3087
51
33
Introduction to Logistic Regression
Logistic regression is a popular method for binary classification,
where the goal is to predict the probability of an input belonging to
a particular class.
The logistic regression model uses a logistic function to model
the probability of an input belonging to the positive class.
The logistic function maps any input to a value between 0 and
1, which can be interpreted as a probability.
1
p(y = 1|x) =
1 + e−z
Where z = θ0 + θ1x1 + θ2x2 + · · · + θnxn is the linear combination of the input features
x1, x2, . . . , xn and their corresponding coefficients
θ1 , θ2 , . . . , θn .

37/
CSE3087
51
Softmax Regression
• Softmax regression is a generalization of logistic regression that can be used for
multi-class classification problems.

• The softmax function is used to compute the probability of each class, and
the class with the highest probability is chosen asthe predicted class.

• The softmax function outputs a probability distribution over the classes,


and its outputs sum to 1.

ezi
p(y = i |x) = K (2)
Σ j =1 ezj
where K is the number of classes, zi is the linear combination of input features for
class i, and the denominator sums over all classes.
38/
CSE3087
51
Cross Entropy Loss

To train a logistic regression or softmax regression model, we need a


cost function that measures the difference between the predicted
probabilities and the true labels.
The cross entropy loss is a popular choice for such a cost function.

39/
CSE3087
51
Solving classification
problems with naïve
bayes
How does naïve bayes algorithm work?

• Naive Bayes classifier algorithm is based on a famous


P(A|B)= theorem called “Bayes theorem”.
P(A|B)P(A)
P(B) • It can help us find simple yet powerful solutions to many
problems ranging from text analysis to spam detection
and much more.
Probability to describe
how likely an event is
to happen
A value between 0 and 1represents the
possibility of an event happening

0 1

Less likely to Most likely to


happen happen
Bayes theorem is centered on
conditional probability
What is conditional probability?
Conditional probability is the
probability of an event
‘A’ happening given that another
event ‘B’ has already happened.

The Bayes theorem is an


extension of conditional
probability. It allows us in a
sense to use reverse reasoning.
Understanding the Bayes
theorem formula
Prior probability P(A) –
The probability of just ‘A’ occurring

Posterior probability P(A|B) –


The probability of event ‘A’ given
that event ‘B’ occurs

P(B|A) - The probability of event B


happening given that event A has
occurred

P(B) - The probability of just B


What makes Naïve bayes
algorithm naïve?
When the model calculates the conditional probability of one feature
given a class,

…it doesn’t take into account the effect of any other feature.

…it assumes that features are independent from each other.

…it gives us the flexibility to describe the probability of each feature.


The algorithm’s naivety has some
advantages & limitations

• Advantages Disadvantages

• Quick & simple • In most real-world situations


some of the features are likely
• Produce good results to be dependent on each
with small amount other, which might cause
of training data wrong results.

• Used for
benchmarking of a
model

• Works well with


continuous data
by discretizing

42
Three types of naïve bayes classifiers
in sklearn
Bernoulli Multinomial Naïve Gaussian Naïve
Bayes Bayes

Used when data is Used when there are Used when all features
binary like true or false, discrete values such are continuous variables,
yes or no etc. as number of family like temperature or
members or pages height.
in a book.
Reversing the condition
Example: Rahul’s favorite breakfast is bagels and his favorite lunch is pizza. The probability of Rahul having
bagels for breakfast is 0.6. The probability of him having pizza for lunch is 0.5. The probability of him, having
a bagel for breakfast given that he eats a pizza for lunch is 0.7.
Let’s define event A as Rahul having a bagel for breakfast, Event B as Rahul having a pizza for lunch.
P (A) = 0.6
P (B) = 0.5

If we look at the numbers, the probability of having a bagel is different than the probability of having a bagel
given he has a pizza for lunch. This means that the probability of having a bagel is dependent on having a
pizza for lunch.

Now what if we need to know the probability of having a pizza given you had a bagel for breakfast. i.e. we

need to know

. Bayes theorem now comes into the picture.

44
The Bayes theorem describes the probability of an event based on
the prior knowledge of the conditions that might be related to the
event. If we know the conditional probability , we can use the
bayes rule to find out the reverse probabilities .

For the previous example – if we now wish to calculate the probability of


having a pizza for lunch provided you had a bagel for breakfast would be =
0.7 * 0.5/0.6.

45
Na¨ıve Bayes is a popular algorithm for supervised learning, particularly
for text classification problems. It’s based on Bayes’ theorem, which is
a formula for calculating conditional probabilities.

• The basic idea behind Na¨ıveBayes is to calculate the probability of a


particular class given some input features.
• This is done by calculating the conditional probabilities of each feature
given the class, and then using Bayes’ theorem to calculate the
probability of the class given the features.

CSE3087
43/
51
Bayesian Belief Networks

Bayesian belief networks are a type of graphical model that can be


used to represent probabilistic relationships between variables.

• Each variable is represented by a node in the network, and


the edges between the nodes represent conditional
dependencies.
• The basic idea behind Bayesian belief networks is to use
Bayes’ theorem to calculate the probabilities of each variable
given the values of its parents in the network.
• This can be done by factorizing the joint probability
distribution of all the variables into a product of conditional
probabilities.

CSE3087
One common application of Bayesian belief networks
is in medical diagnosis.

• The variables in the network might represent symptoms and


diseases, and the edges represent the conditional dependencies
between them.
• Given a set of observed symptoms, the network can be
used to calculate the probability of each possible
disease.

CSE3087
46/
51
Na¨ıve Bayes and Bayesian belief networks are powerful tools for
probabilistic modeling and inference.

• Na¨ıve Bayes is particularly useful for text classification


problems, while Bayesian belief networks are useful for
modeling complex dependencies between variables.
• With these tools, we can make more accurate predictions and
better understand the relationships between different
variables in our data.

CSE3087
47/
51
50
51
52
This is the Joint Probability Distribution

53
54
Support Vector Machines (SVMs) are a popular machine
learning algorithm for classification and regression. They work
by finding the hyperplane that maximally separates the data
points in a high-dimensional feature space.
• The basic idea behind SVMs is to find the hyperplane that maximizes
the margin betweenthe positive and negative data points. The
margin is the distance between the hyperplane and the closest data
points from either class. The hyperplane that maximizes the margin is
called the maximum margin hyperplane.
• SVMs can be extended to handle non-linearly separable data using a
technique called the kernel trick. The kernel trick involves mapping
the original feature space to a higher-dimensional space using a non-
linear function, and then finding the maximum margin hyperplane in
this new space.

CSE3087
48/
51
• Applying a mapping Function

56
Disadvantages:
• Increased Computation
• Increased Learning Cost
• No Thumb Rule for all types of data

57
58
59
In practice, data is often not perfectly separable, and finding the
maximum margin hyperplane is not always possible. In these
cases, we can use a variant of SVM called the soft margin SVM.

• The soft margin SVM allows for some misclassifications, and


introduces a slack variable that penalizes data points that
are on the wrong side of the margin.
• The objective function for the soft margin SVM includes a
regularization term that controls the tradeoff between
maximizing the margin and minimizing the classification
error.

CSE3087
49/
51
The kernel trick is a powerful technique for extending SVMs to handle
non-linearly separable data. The basic idea is to map the data into a
higher-dimensional feature space using a non-linear function, and then
apply the linear SVM algorithm to this new feature space.
• The key insight behind the kernel trick is that we can compute the dot
product between the mapped data points without explicitly computing the
mapping. This is done by defining a kernel function that takes two data
points as input and returns the dot product of their mapped features.
• Some common kernel functions include the polynomial kernel, which
computes the dot product of two vectors raised to a certain power, and the
radial basis function kernel, which measures the similarity between two data
points based on their distance in the feature space.

CSE3087
50/
51
62
Support Vector Machines are a powerful machine learning
algorithm that can handle both linear and non-linearly
separable data.

• The kernel trick allows us to extend SVMs to handle complex


data, and the soft margin SVM allows us to handle data that is not
perfectly separable.
• With these techniques, we can build powerful models for
classification and regression tasks.

CSE3087
51/
51

You might also like