0% found this document useful (0 votes)

13 views57 pages

UNIT -3 ITAI & ML

The document introduces Bayes Theorem, a fundamental concept in Machine Learning that calculates the probability of an event based on prior knowledge and is widely used in classification tasks. It explains the theorem's components, prerequisites, and its application in Naïve Bayes classifiers, highlighting both advantages and disadvantages. Additionally, it touches on the concept of Maximum Likelihood Estimation for parameter estimation in models used in Machine Learning.

Uploaded by

SRIKANTH KETHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views57 pages

UNIT -3 ITAI & ML

Uploaded by

SRIKANTH KETHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

3 BAYESIAN AND COMPUTATIONAL

LEARNING

UNIT
BAYES THEOREM
Machine Learning is one of the most emerging technology of Artificial
Intelligence. We are living in the 21th century which is completely
driven by new technologies and gadgets in which some are yet to be
used and few are on its full potential. Similarly, Machine Learning is
also a technology that is still in its developing phase. There are lots of
concepts that make machine learning a better technology such as
supervised learning, unsupervised learning, reinforcement learning,
perceptron models, Neural networks, etc. In this article "Bayes
Theorem in Machine Learning", we will discuss another most
important concept of Machine Learning theorem i.e., Bayes Theorem.
But before starting this topic you should have essential understanding
of this theorem such as what exactly is Bayes theorem, why it is used
in Machine Learning, examples of Bayes theorem in Machine Learning
and much more. So, let's start the brief introduction of Bayes theorem.

UNIT - III 1
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Introduction to Bayes Theorem in Machine Learning

Bayes theorem is given by an English statistician, philosopher, and

Presbyterian minister named Mr. Thomas Bayes in 17th century.
Bayes provides their thoughts in decision theory which is extensively
used in important mathematics concepts as Probability. Bayes
theorem is also widely used in Machine Learning where we need to
predict classes precisely and accurately. An important concept of
Bayes theorem named Bayesian method is used to calculate
conditional probability in Machine Learning application that includes
classification tasks. Further, a simplified version of Bayes theorem
(Naïve Bayes classification) is also used to reduce computation time
and average cost of the projects.

Bayes theorem is also known with some other name such as Bayes
rule or Bayes Law. Bayes theorem helps to determine the probability
of an event with random knowledge. It is used to calculate the
probability of occurring one event while other one already occurred.
It is a best method to relate the condition probability and marginal
probability.

In simple words, we can say that Bayes theorem helps to contribute

more accurate results.

Bayes Theorem is used to estimate the precision of values and

provides a method for calculating the conditional probability.
However, it is hypocritically a simple calculation but it is used to easily
calculate the conditional probability of events where intuition often
fails. Some of the data scientist assumes that Bayes theorem is most
widely used in financial industries but it is not like that. Other than
financial, Bayes theorem is also extensively applied in health and
medical, research and survey industry, aeronautical sector, etc.

UNIT - III 2
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

What is Bayes Theorem?

Bayes theorem is one of the most popular machine learning concepts

that helps to calculate the probability of occurring one event with
uncertain knowledge while other one has already occurred.

Bayes' theorem can be derived using product rule and conditional

probability of event X with known event Y:

o According to the product rule we can express as the probability

of event X with known event Y as follows;

1. P(X ? Y)= P(X|Y) P(Y) {equation 1}

o Further, the probability of event Y with known event X:

1. P(X ? Y)= P(Y|X) P(X) {equation 2}

Mathematically, Bayes theorem can be expressed by combining both

equations on right hand side. We will get:

Here, both events X and Y are independent events which means

probability of outcome of both events does not depends one another.

The above equation is called as Bayes Rule or Bayes Theorem.

o P(X|Y) is called as posterior, which we need to calculate. It is

defined as updated probability after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence
when hypothesis is true.
o P(X) is called the prior probability, probability of hypothesis
before considering the evidence
UNIT - III 3
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o P(Y) is called marginal probability. It is defined as the probability

of evidence under any consideration.

Hence, Bayes Theorem can be written as:

posterior = likelihood * prior / evidence

Prerequisites for Bayes Theorem

While studying the Bayes theorem, we need to understand few

important concepts. These are as follows:

1. Experiment

An experiment is defined as the planned operation carried out under

controlled condition such as tossing a coin, drawing a card and rolling
a dice, etc.

2. Sample Space

During an experiment what we get as a result is called as possible

outcomes and the set of all possible outcome of an event is known as
sample space. For example, if we are rolling a dice, sample space will
be:

S1 = {1, 2, 3, 4, 5, 6}

Similarly, if our experiment is related to toss a coin and recording its

outcomes, then sample space will be:

S2 = {Head, Tail}

3. Event

Event is defined as subset of sample space in an experiment. Further,

it is also called as set of outcomes.

UNIT - III 4
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Assume in our experiment of rolling a dice, there are two event A and
B such that;

A = Event when an even number is obtained = {2, 4, 6}

B = Event when a number is greater than 4 = {5, 6}

o Probability of the event A ''P(A)''= Number of favourable

outcomes / Total number of possible outcomes
P(E) = 3/6 =1/2 =0.5
o Similarly, Probability of the event B ''P(B)''= Number of
favourable outcomes / Total number of possible outcomes
=2/6
=1/3
=0.333
o Union of event A and B:
A∪B = {2, 4, 5, 6}

UNIT - III 5
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o Intersection of event A and B:

A∩B= {6}

o Disjoint Event: If the intersection of the event A and B is an

empty set or null then such events are known as disjoint
event or mutually exclusive events also.

4. Random Variable:

It is a real value function which helps mapping between sample space

and a real line of an experiment. A random variable is taken on some
random values and each value having some probability. However, it is
neither random nor a variable but it behaves as a function which can
either be discrete, continuous or combination of both.

5. Exhaustive Event:

As per the name suggests, a set of events where at least one event
occurs at a time, called exhaustive event of an experiment.

UNIT - III 6
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Thus, two events A and B are said to be exhaustive if either A or B

definitely occur at a time and both are mutually exclusive for e.g.,
while tossing a coin, either it will be a Head or may be a Tail.

6. Independent Event:

Two events are said to be independent when occurrence of one event

does not affect the occurrence of another event. In simple words we
can say that the probability of outcome of both events does not
depends one another.

Mathematically, two events A and B are said to be independent if:

P(A ∩ B) = P(AB) = P(A)*P(B)

7. Conditional Probability:

Conditional probability is defined as the probability of an event A,

given that another event B has already occurred (i.e. A conditional B).
This is represented by P(A|B) and we can define it as:

P(A|B) = P(A ∩ B) / P(B)

8. Marginal Probability:

Marginal probability is defined as the probability of an event A

occurring independent of any other event B. Further, it is considered
as the probability of evidence under any consideration.

P(A) = P(A|B)P(B) + P(A|~B)P(~B)

UNIT - III 7
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Here ~B represents the event that B does not occur.

How to apply Bayes Theorem or Bayes rule in Machine Learning?

Bayes theorem helps us to calculate the single term P(B|A) in terms of

P(A|B), P(B), and P(A). This rule is very helpful in such scenarios where
we have a good probability of P(A|B), P(B), and P(A) and need to
determine the fourth term.

Naïve Bayes classifier is one of the simplest applications of Bayes

theorem which is used in classification algorithms to isolate data as
per accuracy, speed and classes.

Let's understand the use of Bayes theorem in machine learning with

below example.

Suppose, we have a vector A with I attributes. It means

A = A1, A2, A3, A4……………Ai

Further, we have n classes represented as C1, C2, C3, C4…………Cn.

These are two conditions given to us, and our classifier that works on
Machine Language has to predict A and the first thing that our
classifier has to choose will be the best possible class. So, with the help
of Bayes theorem, we can write it as:

P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)

Here;

P(A) is the condition-independent entity.

P(A) will remain constant throughout the class means it does not
change its value with respect to change in class. To maximize the
P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).

UNIT - III 8
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

With n number classes on the probability list let's assume that the
possibility of any class being the right answer is equally likely.
Considering this factor, we can say that:

P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).

This process helps us to reduce the computation cost as well as time.

This is how Bayes theorem plays a significant role in Machine Learning
and Naïve Bayes theorem has simplified the conditional probability
tasks without affecting the precision. Hence, we can conclude that:

P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)……P(An/C)

Hence, by using Bayes theorem in Machine Learning we can easily

describe the possibilities of smaller events.

What is Naïve Bayes Classifier in Machine Learning

Naïve Bayes theorem is also a supervised algorithm, which is based on

Bayes theorem and used to solve classification problems. It is one of
the most simple and effective classification algorithms in Machine
Learning which enables us to build various ML models for quick
predictions. It is a probabilistic classifier that means it predicts on the
basis of probability of an object. Some popular Naïve Bayes algorithms
are spam filtration, Sentimental analysis, and classifying articles.

Advantages of Naïve Bayes Classifier in Machine Learning:

o It is one of the simplest and effective methods for calculating the

conditional probability and text classification problems.
o A Naïve-Bayes classifier algorithm is better than all other models
where assumption of independent predictors holds true.
o It is easy to implement than other models.
o It requires small amount of training data to estimate the test
data which minimize the training time period.
o It can be used for Binary as well as Multi-class Classifications.

UNIT - III 9
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Disadvantages of Naïve Bayes Classifier in Machine Learning:

The main disadvantage of using Naïve Bayes classifier algorithms is, it

limits the assumption of independent predictors because it implicitly
assumes that all attributes are independent or unrelated but in real
life it is not feasible to get mutually independent attributes.

MAXIMUM LIKELIHOOD
In this post I’ll explain what the maximum likelihood method for
parameter estimation is and go through a simple example to
demonstrate the method. Some of the content requires knowledge of
fundamental probability concepts such as the definition of joint
probability and independence of events. I’ve written a blog post with
these prerequisites so feel free to read this if you think you need a
refresher.

What are parameters?

Often in machine learning we use a model to describe the process that

results in the data that are observed. For example, we may use a
random forest model to classify whether customers may cancel a
subscription from a service (known as churn modelling) or we may use
a linear model to predict the revenue that will be generated for a
company depending on how much they may spend on advertising (this
would be an example of linear regression). Each model contains its
own set of parameters that ultimately defines what the model looks
like.

For a linear model we can write this as y = mx + c. In this

example x could represent the advertising spend and y might be the

UNIT - III 10
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

revenue generated. m and c are parameters for this model. Different

values for these parameters will give different lines (see figure below).

Three linear models with different parameter values.

So parameters define a blueprint for the model. It is only when specific

values are chosen for the parameters that we get an instantiation for
the model that describes a given phenomenon.

Intuitive explanation of maximum likelihood estimation

Maximum likelihood estimation is a method that determines values for

the parameters of a model. The parameter values are found such that
they maximise the likelihood that the process described by the model
produced the data that were actually observed.

UNIT - III 11
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The above definition may still sound a little cryptic so let’s go through
an example to help understand this.

Let’s suppose we have observed 10 data points from some process. For
example, each data point could represent the length of time in seconds
that it takes a student to answer a specific exam question. These 10
data points are shown in the figure below

The 10 (hypothetical) data points that we have observed

We first have to decide which model we think best describes the

process of generating the data. This part is very important. At the very
least, we should have a good idea about which model to use. This
usually comes from having some domain expertise but we wont
discuss this here.

UNIT - III 12
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

For these data we’ll assume that the data generation process can be
adequately described by a Gaussian (normal) distribution. Visual
inspection of the figure above suggests that a Gaussian distribution is
plausible because most of the 10 points are clustered in the middle
with few points scattered to the left and the right. (Making this sort of
decision on the fly with only 10 data points is ill-advised but given that
I generated these data points we’ll go with it).

Recall that the Gaussian distribution has 2 parameters. The mean, μ,

and the standard deviation, σ. Different values of these parameters
result in different curves (just like with the straight lines above). We
want to know which curve was most likely responsible for creating the
data points that we observed? (See figure below). Maximum likelihood
estimation is a method that will find the values of μ and σ that result
in the curve that best fits the data.

UNIT - III 13
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The 10 data points and possible Gaussian distributions from which the
data were drawn. f1 is normally distributed with mean 10 and variance
2.25 (variance is equal to the square of the standard deviation), this is
also denoted f1 ∼ N (10, 2.25). f2 ∼ N (10, 9), f3 ∼ N (10, 0.25) and f4
∼ N (8, 2.25). The goal of maximum likelihood is to find the parameter
values that give the distribution that maximise the probability of
observing the data.

The true distribution from which the data were generated was f1 ~
N(10, 2.25), which is the blue curve in the figure above.

Calculating the Maximum Likelihood Estimates

Now that we have an intuitive understanding of what maximum

likelihood estimation is we can move on to learning how to calculate
the parameter values. The values that we find are called the maximum
likelihood estimates (MLE).

Again we’ll demonstrate this with an example. Suppose we have three

data points this time and we assume that they have been generated
from a process that is adequately described by a Gaussian distribution.
These points are 9, 9.5 and 11. How do we calculate the maximum
likelihood estimates of the parameter values of the Gaussian
distribution μ and σ?

What we want to calculate is the total probability of observing all of

the data, i.e. the joint probability distribution of all observed data
points. To do this we would need to calculate some conditional
probabilities, which can get very difficult. So it is here that we’ll make
our first assumption. The assumption is that each data point is

UNIT - III 14
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

generated independently of the others. This assumption makes the

maths much easier. If the events (i.e. the process that generates the
data) are independent, then the total probability of observing all of
data is the product of observing each data point individually (i.e. the
product of the marginal probabilities).

The probability density of observing a single data point x, that is

generated from a Gaussian distribution is given by:

The semi colon used in the notation P(x; μ, σ) is there to emphasise

that the symbols that appear after it are parameters of the probability
distribution. So it shouldn’t be confused with a conditional probability
(which is typically represented with a vertical line e.g. P(A| B)).

In our example the total (joint) probability density of observing the

three data points is given by:

We just have to figure out the values of μ and σ that results in giving
the maximum value of the above expression.

If you’ve covered calculus in your maths classes then you’ll probably

be aware that there is a technique that can help us find maxima (and
minima) of functions. It’s called differentiation. All we have to do is find
UNIT - III 15
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

the derivative of the function, set the derivative function to zero and
then rearrange the equation to make the parameter of interest the
subject of the equation. And voilà, we’ll have our MLE values for our
parameters. I’ll go through these steps now but I’ll assume that the
reader knows how to perform differentiation on common functions. If
you would like a more detailed explanation then just let me know in
the comments.

The log likelihood

The above expression for the total probability is actually quite a pain
to differentiate, so it is almost always simplified by taking the natural
logarithm of the expression. This is absolutely fine because the natural
logarithm is a monotonically increasing function. This means that if the
value on the x-axis increases, the value on the y-axis also increases (see
figure below). This is important because it ensures that the maximum
value of the log of the probability occurs at the same point as the
original probability function. Therefore we can work with the simpler
log-likelihood instead of the original likelihood.

Monotonic behaviour of the original function, y = x on the left and the

(natural) logarithm function y = ln(x). These functions are both

UNIT - III 16
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

monotonic because as you go from left to right on the x-axis the y

value always increases.

Example of a non-monotonic function because as you go from left to

right on the graph the value of f(x) goes up, then goes down and then
goes back up again.

Taking logs of the original expression gives us:

This expression can be simplified again using the laws of logarithms to

obtain:

UNIT - III 17
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

This expression can be differentiated to find the maximum. In this

example we’ll find the MLE of the mean, μ. To do this we take the
partial derivative of the function with respect to μ, giving

Finally, setting the left hand side of the equation to zero and then
rearranging for μ gives:

And there we have our maximum likelihood estimate for μ. We can do

the same thing with σ too but I’ll leave that as an exercise for the keen
reader

MINIMUM DESCRIPTION LENGTH PRINCIPLE

Minimum Description Length (MDL) is a model selection principle
where the shortest description of the data is the best model. MDL
methods learn through a data compression perspective and are
sometimes described as mathematical applications of Occam's razor.
The MDL principle can be extended to other forms of inductive
inference and learning, for example to estimation and sequential
prediction, without explicitly identifying a single model of the data.
MDL has its origins mostly in information theory and has been further
developed within the general fields of statistics, theoretical computer
science and machine learning, and more narrowly computational
learning theory.

UNIT - III 18
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Historically, there are different, yet interrelated, usages of the definite

noun phrase "the minimum description length principle" that vary in
what is meant by description:

 Within Jorma Rissanen's theory of learning, a central concept

of information theory, models are statistical hypotheses and
descriptions are defined as universal codes.
 Rissanen's 1978[1] pragmatic first attempt to automatically
derive short descriptions, relates to the Bayesian Information
Criterion (BIC).
 Within Algorithmic Information Theory, where the
description length of a data sequence is the length of the
smallest program that outputs that data set. In this context,
it is also known as 'idealized' MDL principle and it is closely
related to Solomonoff's theory of inductive inference, which
is that the best model of a data set is represented by its
shortest self-extracting archive.

GIBBS ALGORITHM
In statistical mechanics, the Gibbs algorithm, introduced by J. Willard
Gibbs in 1902, is a criterion for choosing a probability distribution for
the statistical ensemble of microstates of a thermodynamic
system by minimizing the average log probability
subject to the probability distribution pi satisfying a set of constraints
(usually expectation values) corresponding to the
known macroscopic quantities. in 1948, Claude Shannon interpreted
the negative of this quantity, which he called information entropy, as
a measure of the uncertainty in a probability distribution.[1] In
1957, E.T. Jaynes realized that this quantity could be interpreted as
missing information about anything, and generalized the Gibbs
algorithm to non-equilibrium systems with the principle of maximum
entropy and maximum entropy thermodynamics.
Physicists call the result of applying the Gibbs algorithm the Gibbs
distribution for the given constraints, most notably Gibbs's grand

UNIT - III 19
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

canonical ensemble for open systems when the average energy and
the average number of particles are given. (See also partition
function).
This general result of the Gibbs algorithm is then a maximum entropy
probability distribution. Statisticians identify such distributions as
belonging to exponential families.
NAÏVE BAYES CLASSIFIER
o Naïve Bayes algorithm is a supervised learning algorithm, which
is based on Bayes theorem and used for solving classification
problems.
o It is mainly used in text classification that includes a high-
dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and

Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence

of a certain feature is independent of the occurrence of other
features. Such as if the fruit is identified on the bases of color,
shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on
each other.
o Bayes: It is called Bayes because it depends on the principle
of Bayes' Theorem.

UNIT - III 20
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law,

which is used to determine the probability of a hypothesis with
prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the

observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that

the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing

the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of

the below example:

Suppose we have a dataset of weather conditions and corresponding

target variable "Play". So using this dataset we need to decide that
whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the
below steps:

1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given
features.
3. Now, use Bayes theorem to calculate the posterior probability.
UNIT - III 21
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

UNIT - III 22
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

UNIT - III 23
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation

that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict

a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the
other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or

unrelated, so it cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.

o It is used in medical data classification.
UNIT - III 24
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o It can be used in real-time predictions because Naïve Bayes

Classifier is an eager learner.
o It is used in Text classification such as Spam
filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a

normal distribution. This means if predictors take continuous
values instead of discrete, then the model assumes that these
values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used
when the data is multinomial distributed. It is primarily used for
document classification problems, it means a particular
document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the
Multinomial classifier, but the predictor variables are the
independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for
document classification tasks.

Python Implementation of the Naïve Bayes algorithm:

Now we will implement a Naive Bayes Algorithm using Python. So for

this, we will use the "user_data" dataset, which we have used in our
other classification model. Therefore we can easily compare the Naive
Bayes model with the other models.

Steps to implement:

o Data Pre-processing step

o Fitting Naive Bayes to the Training set

UNIT - III 25
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o Predicting the test result

o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1) Data Pre-processing step:

In this step, we will pre-process/prepare the data so that we can use

it efficiently in our code. It is similar as we did in data-pre-processing.
The code for this is given below:

1. Importing the libraries

2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. # Importing the dataset
7. dataset = pd.read_csv('user_data.csv')
8. x = dataset.iloc[:, [2, 3]].values
9. y = dataset.iloc[:, 4].values
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_
size = 0.25, random_state = 0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)

In the above code, we have loaded the dataset into our program using
"dataset = pd.read_csv('user_data.csv'). The loaded dataset is
divided into training and test set, and then we have scaled the feature
variable.

UNIT - III 26
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The output for the dataset is given as:

2) Fitting Naive Bayes to the Training Set:

After the pre-processing step, now we will fit the Naive Bayes model
to the Training set. Below is the code for it:

1. # Fitting Naive Bayes to the Training set

2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)

UNIT - III 27
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

In the above code, we have used the GaussianNB classifier to fit it to

the training dataset. We can also use other classifiers as per our
requirement.

Output:

Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)

3) Prediction of the test set result:

Now we will predict the test set result. For this, we will create a new
predictor variable y_pred, and will use the predict function to make
the predictions.

1. # Predicting the Test set results

2. y_pred = classifier.predict(x_test)

Output:

UNIT - III 28
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The above output shows the result for prediction vector y_pred and
real vector y_test. We can see that some predications are different
from the real values, which are the incorrect predictions.

4) Creating Confusion Matrix:

Now we will check the accuracy of the Naive Bayes classifier using the
Confusion matrix. Below is the code for it:

1. # Making the Confusion Matrix

2. from sklearn.metrics import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)

Output:

As we can see in the above confusion matrix output, there are 7+3=
10 incorrect predictions, and 65+25=90 correct predictions.

5) Visualizing the training set result:

Next we will visualize the training set result using Naïve Bayes
Classifier. Below is the code for it:

1. # Visualising the Training set results

UNIT - III 29
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

2. from matplotlib.colors import ListedColormap

3. x_set, y_set = x_train, y_train
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, st
op = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:
, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.
ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))

8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)

13. mtp.title('Naive Bayes (Training set)')

14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

UNIT - III 30
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

In the above output we can see that the Naïve Bayes classifier has
segregated the data points with the fine boundary. It is Gaussian curve
as we have used GaussianNB classifier in our code.

6) Visualizing the Test set result:

1. # Visualising the Test set results

2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, st
op = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:
, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.
ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))

13. mtp.title('Naive Bayes (test set)')

14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

UNIT - III 31
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The above output is final output for test set data. As we can see the
classifier has created a Gaussian curve to divide the "purchased" and
"not purchased" variables. There are some wrong predictions which
we have calculated in Confusion matrix. But still it is pretty good
classifier.

INSTANCE BASED LEARNING- K-NEAREST NEIGHBOUR

LEARNING
o K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the
category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category
by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification
problems.
o K-NN is a non-parametric algorithm, which means it does not
make any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset

UNIT - III 32
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

and at the time of classification, it performs an action on the

dataset.
o KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks
similar to cat and dog, but we want to know either it is a cat or
dog. So for this identification, we can use the KNN algorithm, as
it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images
and based on the most similar features it will put it in either cat
or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of
these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category
or class of a particular dataset. Consider the below diagram:

UNIT - III 33
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

How does K-NN work?

The K-NN working can be explained on the basis of the below

algorithm:

o Step-1: Select the number K of the neighbors

o Step-2: Calculate the Euclidean distance of K number of
neighbors
o Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data
points in each category.
o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the

required category. Consider the below image:

UNIT - III 34
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o Firstly, we will choose the number of neighbors, so we will

choose the k=5.
o Next, we will calculate the Euclidean distance between the data
points. The Euclidean distance is the distance between two
points, which we have already studied in geometry. It can be
calculated as:

o By calculating the Euclidean distance we got the nearest

neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:

UNIT - III 35
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o As we can see the 3 nearest neighbors are from category A,

hence this new data point must belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in

the K-NN algorithm:

o There is no particular way to determine the best value for "K",

so we need to try some values to find the best out of them. The
most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead
to the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex

some time.
UNIT - III 36
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o The computation cost is high because of calculating the distance

between the data points for all the training samples.

Python implementation of the KNN algorithm

To do the Python implementation of the K-NN algorithm, we will use

the same problem and dataset which we have used in Logistic
Regression. But here we will improve the performance of the model.
Below is the problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company

that has manufactured a new SUV car. The company wants to give the
ads to the users who are interested in buying that SUV. So for this
problem, we have a dataset that contains multiple user's information
through the social network. The dataset contains lots of information
but the Estimated Salary and Age we will consider for the
independent variable and the Purchased variable is for the
dependent variable. Below is the dataset:

UNIT - III 37
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Steps to implement the K-NN algorithm:

o Data Pre-processing step

o Fitting the K-NN algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic
Regression. Below is the code for it:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_s
ize= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

UNIT - III 38
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

By executing the above code, our dataset is imported to our program

and well pre-processed. After feature scaling our test dataset will look
like:

From the above output image, we can see that our data is successfully
scaled.

o Fitting K-NN classifier to the Training data:

Now we will fit the K-NN classifier to the training data. To do this
we will import the KNeighborsClassifier class of Sklearn
Neighbors library. After importing the class, we will create
the Classifier object of the class. The Parameter of this class will
be

o n_neighbors: To define the required neighbors of the

algorithm. Usually, it takes 5.
o metric='minkowski': This is the default parameter and it
decides the distance between the points.
o p=2: It is equivalent to the standard Euclidean metric.
UNIT - III 39
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

And then we will fit the classifier to the training data. Below is
the code for it:

1. #Fitting K-NN classifier to the training set

2. from sklearn.neighbors import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minko
wski', p=2 )
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the output as:

Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will
create a y_pred vector as we did in Logistic Regression. Below is
the code for it:

1. #Predicting the test set result

2. y_pred= classifier.predict(x_test)

Output:

The output for the above code will be:

UNIT - III 40
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o Creating the Confusion Matrix:

Now we will create the Confusion Matrix for our K-NN model to
see the accuracy of the classifier. Below is the code for it:

1. #Creating the Confusion matrix

2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

In above code, we have imported the confusion_matrix function and

called it using the variable cm.

Output: By executing the above code, we will get the matrix as below:

UNIT - III 41
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

In the above image, we can see there are 64+29= 93 correct

predictions and 3+4= 7 incorrect predictions, whereas, in Logistic
Regression, there were 11 incorrect predictions. So we can say that
the performance of the model is improved by using the K-NN
algorithm.

o Visualizing the Training set result:

Now, we will visualize the training set result for K-NN model. The
code will remain same as we did in Logistic Regression, except
the name of the graph. Below is the code for it:

1. #Visulaizing the trianing set result

2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, sto
p = x_set[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() +
1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.r
avel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))

UNIT - III 42
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

By executing the above code, we will get the below graph:

The output graph is different from the graph which we have occurred
in Logistic Regression. It can be understood in the below points:

o As we can see the graph is showing the red point and green
points. The green points are for Purchased(1) and Red
Points for not Purchased(0) variable.
o The graph is showing an irregular boundary instead of
showing any straight line or any curve because it is a K-NN
algorithm, i.e., finding the nearest neighbor.

UNIT - III 43
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o The graph has classified users in the correct categories as

most of the users who didn't buy the SUV are in the red
region and users who bought the SUV are in the green
region.
o The graph is showing good result but still, there are some
green points in the red region and red points in the green
region. But this is no big issue as by doing this model is
prevented from overfitting issues.
o Hence our model is well trained.

o Visualizing the Test set result:

After the training of the model, we will now test the result by
putting a new dataset, i.e., Test dataset. Code remains the same
except some minor changes: such as x_train and y_train will be
replaced by x_test and y_test.

Below is the code for it:

1. #Visualizing the test set result

2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, sto
p = x_set[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() +
1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.r
avel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
UNIT - III 44
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

16. mtp.legend()
17. mtp.show()

Output:

The above graph is showing the output for the test data set. As we can
see in the graph, the predicted output is well good as most of the red
points are in the red region and most of the green points are in the
green region.

However, there are few green points in the red region and a few red
points in the green region. So these are the incorrect observations that
we have observed in the confusion matrix(7 Incorrect output).

INTRODUCTION TO MACHINE LEARNING (ML):

Machine Learning tutorial provides basic and advanced concepts of
machine learning. Our machine learning tutorial is designed for
students and working professionals.

Machine learning is a growing technology which enables computers to

learn automatically from past data. Machine learning uses various
algorithms for building mathematical models and making predictions
using historical data or information. Currently, it is being used for
various tasks such as image recognition, speech recognition, email

UNIT - III 45
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

filtering, Facebook auto-tagging, recommender system, and many

more.

This machine learning tutorial gives you an introduction to machine

learning along with the wide range of machine learning techniques
such as Supervised, Unsupervised, and Reinforcement learning. You
will learn about regression and classification models, clustering
methods, hidden Markov models, and various sequential models.

What is Machine Learning

In the real world, we are surrounded by humans who can learn

everything from their experiences with their learning capability, and
we have computers or machines which work on our instructions. But
can a machine also learn from experiences or past data like a human
does? So here comes the role of Machine Learning.

Machine Learning is said as a subset of artificial intelligence that is

mainly concerned with the development of algorithms which allow a
computer to learn from the data and past experiences on their own.

UNIT - III 46
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The term machine learning was first introduced by Arthur

Samuel in 1959. We can define it in a summarized way as:

Machine learning enables a machine to automatically learn from data,

improve performance from experiences, and predict things without
being explicitly programmed.

With the help of sample historical data, which is known as training

data, machine learning algorithms build a mathematical model that
helps in making predictions or decisions without being explicitly
programmed. Machine learning brings computer science and statistics
together for creating predictive models. Machine learning constructs
or uses the algorithms that learn from historical data. The more we
will provide the information, the higher will be the performance.

A machine has the ability to learn if it can improve its performance

by gaining more data.

How does Machine Learning work

A Machine Learning system learns from historical data, builds the

prediction models, and whenever it receives new data, predicts the
output for it. The accuracy of predicted output depends upon the
amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform

some predictions, so instead of writing a code for it, we just need to
feed the data to generic algorithms, and with the help of these
algorithms, machine builds the logic as per the data and predict the
output. Machine learning has changed our way of thinking about the
problem. The below block diagram explains the working of Machine
Learning algorithm:

UNIT - III 47
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given

dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals
with the huge amount of the data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason
behind the need for machine learning is that it is capable of doing tasks
that are too complex for a person to implement directly. As a human,
we have some limitations as we cannot access the huge amount of
data manually, so for this, we need some computer systems and here
comes the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge

amount of data and let them explore the data, construct the models,
and predict the required output automatically. The performance of
the machine learning algorithm depends on the amount of data, and
it can be determined by the cost function. With the help of machine
learning, we can save both time and money.

The importance of machine learning can be easily understood by its

uses cases, Currently, machine learning is used in self-driving
cars, cyber fraud detection, face recognition, and friend suggestion
by Facebook, etc. Various top companies such as Netflix and Amazon

UNIT - III 48
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

have build machine learning models that are using a vast amount of
data to analyze the user interest and recommend product accordingly.

Following are some key points which show the importance of

Machine Learning:

o Rapid increment in the production of data

o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from
data.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

UNIT - III 49
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

1) Supervised Learning

Supervised learning is a type of machine learning method in which we

provide sample labeled data to the machine learning system in order
to train it, and on that basis, it predicts the output.

The system creates a model using labeled data to understand the

datasets and learn about each data, once the training and processing
are done then we test the model by providing a sample data to check
whether it is predicting the exact output or not.

The goal of supervised learning is to map input data with the output
data. The supervised learning is based on supervision, and it is the
same as when a student learns things in the supervision of the teacher.
The example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of

algorithms:

o Classification
o Regression

2) Unsupervised Learning

Unsupervised learning is a learning method in which a machine learns

without any supervision.

The training is provided to the machine with the set of data that has
not been labeled, classified, or categorized, and the algorithm needs
to act on that data without any supervision. The goal of unsupervised
learning is to restructure the input data into new features or a group
of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The

machine tries to find useful insights from the huge amount of data. It
can be further classifieds into two categories of algorithms:

o Clustering
UNIT - III 50
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o Association

3) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which

a learning agent gets a reward for each right action and gets a penalty
for each wrong action. The agent learns automatically with these
feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of
an agent is to get the most reward points, and hence, it improves its
performance.

The robotic dog, which automatically learns the movement of his

arms, is an example of Reinforcement learning.

History of Machine Learning

Before some years (about 40-50 years), machine learning was science
fiction, but today it is the part of our daily life. Machine learning is
making our day to day life easy from self-driving cars to Amazon
virtual assistant "Alexa". However, the idea behind machine learning
is so old and has a long history. Below some milestones are given
which have occurred in the history of machine learning:

UNIT - III 51
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The early history of Machine Learning (Pre-1940):

o 1834: In 1834, Charles Babbage, the father of the computer,

conceived a device that could be programmed with punch cards.
However, the machine was never built, but all modern
computers rely on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine can
determine and execute a set of instructions.

The era of stored program computers:

o 1940: In 1940, the first manually operated computer, "ENIAC"

was invented, which was the first electronic general-purpose
computer. After that stored program computer such as EDSAC in
1949 and EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an
electrical circuit. In 1950, the scientists started applying their
idea to work and analyzed how human neurons might work.

Computer machinery and intelligence:

o 1950: In 1950, Alan Turing published a seminal paper,

"Computer Machinery and Intelligence," on the topic of
artificial intelligence. In his paper, he asked, "Can machines
think?"

Machine intelligence in Games:

o 1952: Arthur Samuel, who was the pioneer of machine learning,

created a program that helped an IBM computer to play a
checkers game. It performed better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined
by Arthur Samuel.

UNIT - III 52
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The first "AI" winter:

o The duration of 1974 to 1980 was the tough time for AI and ML
researchers, and this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and
people had reduced their interest from AI, which led to reduced
funding by the government to the researches.

Machine Learning from theory to reality

o 1959: In 1959, the first neural network was applied to a real-

world problem to remove echoes over phone lines using an
adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented
a neural network NETtalk, which was able to teach itself how to
correctly pronounce 20,000 words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess
game against the chess expert Garry Kasparov, and it became
the first computer which had beaten a human chess expert.

Machine Learning at present:

Now machine learning has got a great advancement in its research,

and it is present everywhere around us, such as self-driving
cars, Amazon Alexa, Catboats, recommender system, and many
more. It includes Supervised, unsupervised, and reinforcement
learning with clustering, classification, decision tree, SVM
algorithms, etc.

Modern machine learning models can be used for making various

predictions, including weather prediction, disease prediction, stock
market analysis, etc.

DIFFERENCES BETWEEN SUPERVISED AND UNSUPERVISED

LEARNING PARADIGMS

UNIT - III 53
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Supervised and Unsupervised learning are the two techniques of

machine learning. But both the techniques are used in different
scenarios and with different datasets. Below the explanation of both
learning methods along with their difference table is given.

Supervised Machine Learning:

Supervised learning is a machine learning method in which models are

trained using labeled data. In supervised learning, models need to find
the mapping function to map the input variable (X) with the output
variable (Y).

Supervised learning needs supervision to train the model, which is

similar to as a student learns things in the presence of a teacher.
Supervised learning can be used for two types of
problems: Classification and Regression.

Example: Suppose we have an image of different types of fruits. The

task of our supervised learning model is to identify the fruits and
classify them accordingly. So to identify the image in supervised
learning, we will give the input data as well as output for that, which
means we will train the model by the shape, size, color, and taste of
each fruit. Once the training is completed, we will test the model by

UNIT - III 54
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

giving the new set of fruit. The model will identify the fruit and predict
the output using a suitable algorithm.

Unsupervised Machine Learning:

Unsupervised learning is another machine learning method in which

patterns inferred from the unlabeled input data. The goal of
unsupervised learning is to find the structure and patterns from the
input data. Unsupervised learning does not need any supervision.
Instead, it finds patterns from the data by its own.

Unsupervised learning can be used for two types of

problems: Clustering and Association.

Example: To understand the unsupervised learning, we will use the

example given above. So unlike supervised learning, here we will not
provide any supervision to the model. We will just provide the input
dataset to the model and allow the model to find the patterns from
the data. With the help of a suitable algorithm, the model will train
itself and divide the fruits into different groups according to the most
similar features between them.

The main differences between Supervised and Unsupervised learning

are given below:

Supervised Learning Unsupervised Learning

Supervised learning algorithms Unsupervised learning

are trained using labeled data. algorithms are trained using
unlabeled data.

Supervised learning model takes Unsupervised learning model

direct feedback to check if it is does not take any feedback.
predicting correct output or not.

UNIT - III 55
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Supervised learning model Unsupervised learning model

predicts the output. finds the hidden patterns in
data.

In supervised learning, input data In unsupervised learning, only

is provided to the model along input data is provided to the
with the output. model.

The goal of supervised learning is The goal of unsupervised

to train the model so that it can learning is to find the hidden
predict the output when it is patterns and useful insights
given new data. from the unknown dataset.

Supervised learning needs Unsupervised learning does not

supervision to train the model. need any supervision to train
the model.

Supervised learning can be Unsupervised Learning can be

categorized classified
in Classification and Regression in Clustering and Associations p
problems. roblems.

Supervised learning can be used Unsupervised learning can be

for those cases where we know used for those cases where we
the input as well as have only input data and no
corresponding outputs. corresponding output data.

Supervised learning model Unsupervised learning model

produces an accurate result. may give less accurate result as
compared to supervised
learning.

Supervised learning is not close to Unsupervised learning is more

true Artificial intelligence as in close to the true Artificial
this, we first train the model for Intelligence as it learns similarly
each data, and then only it can as a child learns daily routine
predict the correct output. things by his experiences.

UNIT - III 56
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

It includes various algorithms It includes various algorithms

such as Linear Regression, such as Clustering, KNN, and
Logistic Regression, Support Apriori algorithm.
Vector Machine, Multi-class
Classification, Decision tree,
Bayesian Logic, etc.

UNIT - III 57

Acting Under Uncertainty - Bayesian Inference-Probabilistic Reasoning
No ratings yet
Acting Under Uncertainty - Bayesian Inference-Probabilistic Reasoning
22 pages
K-Nearest Neighbor Algorithm
100% (1)
K-Nearest Neighbor Algorithm
6 pages
Performance Evaluation of Different Supervised Learning Algorithms For Mobile Price Classification
No ratings yet
Performance Evaluation of Different Supervised Learning Algorithms For Mobile Price Classification
10 pages
Aiml Module 04
No ratings yet
Aiml Module 04
62 pages
RENEWABLE ENERGY SOURCES ( UNIT - III )
No ratings yet
RENEWABLE ENERGY SOURCES ( UNIT - III )
56 pages
UNIT - 5 ( CS )
No ratings yet
UNIT - 5 ( CS )
54 pages
UNIT - 1 ( CFD )
No ratings yet
UNIT - 1 ( CFD )
45 pages
( UNIT - III ) ENVIRONMENTAL MANAGEMENT
No ratings yet
( UNIT - III ) ENVIRONMENTAL MANAGEMENT
37 pages
PublishedPaperNo.8 2022
100% (1)
PublishedPaperNo.8 2022
14 pages
AIML- Module 4- Updated
No ratings yet
AIML- Module 4- Updated
41 pages
UNIT - 4 ( QC )
No ratings yet
UNIT - 4 ( QC )
33 pages
Machine Learning Math Deep Dive - Opendir - Cloud
No ratings yet
Machine Learning Math Deep Dive - Opendir - Cloud
1 page
UNIT - 2 ( CFD )
No ratings yet
UNIT - 2 ( CFD )
29 pages
The Insider Threat Detection Method of University Website Clusters Based On Machine Learning
No ratings yet
The Insider Threat Detection Method of University Website Clusters Based On Machine Learning
6 pages
Machine Learning Mock
No ratings yet
Machine Learning Mock
3 pages
AML - Unit -3
No ratings yet
AML - Unit -3
2 pages
( UNIT - II ) ENVIRONMENTAL MANAGEMENT
No ratings yet
( UNIT - II ) ENVIRONMENTAL MANAGEMENT
23 pages
( UNIT - V ) ENVIRONMENTAL MANAGEMENT
No ratings yet
( UNIT - V ) ENVIRONMENTAL MANAGEMENT
23 pages
Paper 12-Blood Diseases Detection Using Classical Machine
No ratings yet
Paper 12-Blood Diseases Detection Using Classical Machine
5 pages
Ip2023 01 005
No ratings yet
Ip2023 01 005
10 pages
AI - Module 4
No ratings yet
AI - Module 4
57 pages
platias2020-Greece
No ratings yet
platias2020-Greece
10 pages
Ai Resos
No ratings yet
Ai Resos
16 pages
KNN Algorithm: Gnitc Mrs - Sumitra Mallick CSE Dept
No ratings yet
KNN Algorithm: Gnitc Mrs - Sumitra Mallick CSE Dept
12 pages
Project Plagiarism Report
No ratings yet
Project Plagiarism Report
21 pages
758-Revised Manuscript-3539-1-10-20211215
No ratings yet
758-Revised Manuscript-3539-1-10-20211215
13 pages
671f5e482353d
No ratings yet
671f5e482353d
35 pages
DSE - Course Outline
No ratings yet
DSE - Course Outline
11 pages
ML Lab Manual Devansh (1)
No ratings yet
ML Lab Manual Devansh (1)
57 pages
UNIT - 1 ( CS )
No ratings yet
UNIT - 1 ( CS )
61 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
final print reporttt_removed
No ratings yet
final print reporttt_removed
26 pages
ai3
No ratings yet
ai3
41 pages
Swayam 8thmajor
No ratings yet
Swayam 8thmajor
57 pages
Unit 2
No ratings yet
Unit 2
20 pages
ML Material-I (2)
No ratings yet
ML Material-I (2)
35 pages
UNIT - IV
No ratings yet
UNIT - IV
51 pages
UNIT - 4 (SMPC)
No ratings yet
UNIT - 4 (SMPC)
37 pages
Machine learning unit 5 part 2
No ratings yet
Machine learning unit 5 part 2
16 pages
An Efficient Machine Learning Approach For Diagnosing Parkinson's Disease by Utilizing Voice Features
No ratings yet
An Efficient Machine Learning Approach For Diagnosing Parkinson's Disease by Utilizing Voice Features
20 pages
Bayesian Classification
No ratings yet
Bayesian Classification
7 pages
23JX1F00D3- COMPARISON OF MACHINE LEARNING ALGORITHMS FOR PREDICTING CRIME HOTSPOTS
No ratings yet
23JX1F00D3- COMPARISON OF MACHINE LEARNING ALGORITHMS FOR PREDICTING CRIME HOTSPOTS
63 pages
UNIT - 3 ( ASA )
No ratings yet
UNIT - 3 ( ASA )
8 pages
What_is_bayes_theorem_in_AI_Bayes'_Theorem_is_a_foundational_concept copy
No ratings yet
What_is_bayes_theorem_in_AI_Bayes'_Theorem_is_a_foundational_concept copy
7 pages
Unit II Classification
No ratings yet
Unit II Classification
31 pages
UNIT - 3 ( QC )
No ratings yet
UNIT - 3 ( QC )
41 pages
Music Genre Classification Project Repor
No ratings yet
Music Genre Classification Project Repor
19 pages
An Introduction to Naive Bayes Algorithm for Beginners
No ratings yet
An Introduction to Naive Bayes Algorithm for Beginners
11 pages
UNIT - V
No ratings yet
UNIT - V
32 pages
RENEWABLE ENERGY SOURCES ( UNIT - IV )
No ratings yet
RENEWABLE ENERGY SOURCES ( UNIT - IV )
18 pages
AL3391-UNIT 5
No ratings yet
AL3391-UNIT 5
23 pages
UNIT -1 ITAI & ML
No ratings yet
UNIT -1 ITAI & ML
71 pages
UNIT - 2 ( ASA )
No ratings yet
UNIT - 2 ( ASA )
41 pages
Module4 Notes
100% (1)
Module4 Notes
31 pages
UNIT - 5 ( ASA )
No ratings yet
UNIT - 5 ( ASA )
26 pages
UNIT - 3 ( WC )
No ratings yet
UNIT - 3 ( WC )
13 pages
Unit - 2 ( Quantum Computing )
No ratings yet
Unit - 2 ( Quantum Computing )
30 pages
AL3391 AI UNIT 5 NOTES EduEngg
100% (1)
AL3391 AI UNIT 5 NOTES EduEngg
26 pages
UNIT -2 ITAI & ML
No ratings yet
UNIT -2 ITAI & ML
32 pages
Stroke_prediction_D.B
No ratings yet
Stroke_prediction_D.B
11 pages
Human3 6m
No ratings yet
Human3 6m
37 pages
Bayes Decision Theorylect3
No ratings yet
Bayes Decision Theorylect3
12 pages
Probabilistic Reasoning in Artificial Intelligence
No ratings yet
Probabilistic Reasoning in Artificial Intelligence
5 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
Naive Bayes
No ratings yet
Naive Bayes
60 pages
Unit-3
No ratings yet
Unit-3
157 pages
ml last document group 2.pdf
No ratings yet
ml last document group 2.pdf
13 pages
Unit 3 Bayesian Concept Learning
No ratings yet
Unit 3 Bayesian Concept Learning
66 pages
MODULE - 4 QB SOLVED-1
No ratings yet
MODULE - 4 QB SOLVED-1
31 pages
Unit-4
No ratings yet
Unit-4
36 pages
Group_5_Practical
No ratings yet
Group_5_Practical
6 pages
ML Unit 1
No ratings yet
ML Unit 1
13 pages
Module - 4 AIML
No ratings yet
Module - 4 AIML
22 pages
Unit Iii Bayesian Learning
No ratings yet
Unit Iii Bayesian Learning
5 pages
Aiml Iii
No ratings yet
Aiml Iii
28 pages
ML - Unit4pdf
No ratings yet
ML - Unit4pdf
65 pages
UNIT - 4 ( ASA )
No ratings yet
UNIT - 4 ( ASA )
9 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
178 pages
Module V_v1
No ratings yet
Module V_v1
58 pages
UNIT - 1 (SMPC)
No ratings yet
UNIT - 1 (SMPC)
33 pages
ML Unit III
No ratings yet
ML Unit III
40 pages
AI (IT) UNIT-3-converted
No ratings yet
AI (IT) UNIT-3-converted
85 pages
Bayes Theorem
No ratings yet
Bayes Theorem
7 pages
unit2 AI & ML
No ratings yet
unit2 AI & ML
29 pages
7 Statistical Reasoning
No ratings yet
7 Statistical Reasoning
21 pages
Bayesian Learning: Salma Itagi, Svit
No ratings yet
Bayesian Learning: Salma Itagi, Svit
14 pages
SD bayes theorem 1
No ratings yet
SD bayes theorem 1
35 pages
Data Science and AI Master's Program (With Unlimited Interview Calls)
No ratings yet
Data Science and AI Master's Program (With Unlimited Interview Calls)
52 pages
Mod 4
No ratings yet
Mod 4
26 pages
Naive Bayes
No ratings yet
Naive Bayes
29 pages
Long-Time Gap Crowd Prediction Using Time Series Deep Learning Models With Two-Dimensional Single Attribute Inputs 1-S2.0-S1474034621002329-Main
No ratings yet
Long-Time Gap Crowd Prediction Using Time Series Deep Learning Models With Two-Dimensional Single Attribute Inputs 1-S2.0-S1474034621002329-Main
14 pages
Decision Trees 4
No ratings yet
Decision Trees 4
56 pages
UNIT-2NEW
No ratings yet
UNIT-2NEW
26 pages
(Final) 600+ ML MCQ
100% (2)
(Final) 600+ ML MCQ
319 pages
Ch6
No ratings yet
Ch6
19 pages
Bayesian
No ratings yet
Bayesian
14 pages
Bayes Theorem in Machine learning
No ratings yet
Bayes Theorem in Machine learning
37 pages
18CS71 Module 4
No ratings yet
18CS71 Module 4
30 pages
UNIT - I
No ratings yet
UNIT - I
17 pages
Unit 2 - Probabilistic Reasoning
No ratings yet
Unit 2 - Probabilistic Reasoning
25 pages
Ai Cat 2
No ratings yet
Ai Cat 2
21 pages
Module 2 Notes
No ratings yet
Module 2 Notes
24 pages
AI models
No ratings yet
AI models
10 pages
Probabilistic Reasoning: Unit-V
No ratings yet
Probabilistic Reasoning: Unit-V
33 pages
Bayes Theorem
No ratings yet
Bayes Theorem
20 pages
CHEMOMETRICS and STATISTICS Multivariate Classification Techniques-21-27
No ratings yet
CHEMOMETRICS and STATISTICS Multivariate Classification Techniques-21-27
7 pages
Python Machine Learning Projects
No ratings yet
Python Machine Learning Projects
135 pages
Notes On ML
No ratings yet
Notes On ML
42 pages
Unit 5 1
No ratings yet
Unit 5 1
18 pages
Ai2 Unit
No ratings yet
Ai2 Unit
22 pages
Unit II Probabilistic Reasoning
No ratings yet
Unit II Probabilistic Reasoning
28 pages
Efficientnet-Lite and Hybrid CNN-KNN Implementation For Facial Expression Recognition On Raspberry Pi
No ratings yet
Efficientnet-Lite and Hybrid CNN-KNN Implementation For Facial Expression Recognition On Raspberry Pi
16 pages
Markov Models Supervised and Unsupervised Machine Learning: Mastering Data Science And Python
From Everand
Markov Models Supervised and Unsupervised Machine Learning: Mastering Data Science And Python
William Sullivan
2/5 (1)
BAYES Theorem
From Everand
BAYES Theorem
Jeffery Short
2/5 (5)

UNIT -3 ITAI & ML

Uploaded by

UNIT -3 ITAI & ML

Uploaded by

[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

3 BAYESIAN AND COMPUTATIONAL

Introduction to Bayes Theorem in Machine Learning

Bayes theorem is given by an English statistician, philosopher, and

In simple words, we can say that Bayes theorem helps to contribute

Bayes Theorem is used to estimate the precision of values and

What is Bayes Theorem?

Bayes theorem is one of the most popular machine learning concepts

Bayes' theorem can be derived using product rule and conditional

o According to the product rule we can express as the probability

1. P(X ? Y)= P(X|Y) P(Y) {equation 1}

1. P(X ? Y)= P(Y|X) P(X) {equation 2}

Mathematically, Bayes theorem can be expressed by combining both

Here, both events X and Y are independent events which means

The above equation is called as Bayes Rule or Bayes Theorem.

o P(X|Y) is called as posterior, which we need to calculate. It is

o P(Y) is called marginal probability. It is defined as the probability

Hence, Bayes Theorem can be written as:

posterior = likelihood * prior / evidence

Prerequisites for Bayes Theorem

While studying the Bayes theorem, we need to understand few

An experiment is defined as the planned operation carried out under

During an experiment what we get as a result is called as possible

Similarly, if our experiment is related to toss a coin and recording its

Event is defined as subset of sample space in an experiment. Further,

A = Event when an even number is obtained = {2, 4, 6}

B = Event when a number is greater than 4 = {5, 6}

o Probability of the event A ''P(A)''= Number of favourable

o Intersection of event A and B:

o Disjoint Event: If the intersection of the event A and B is an

It is a real value function which helps mapping between sample space

Thus, two events A and B are said to be exhaustive if either A or B

Two events are said to be independent when occurrence of one event

Mathematically, two events A and B are said to be independent if:

P(A ∩ B) = P(AB) = P(A)*P(B)

Conditional probability is defined as the probability of an event A,

P(A|B) = P(A ∩ B) / P(B)

Marginal probability is defined as the probability of an event A

P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)

Here ~B represents the event that B does not occur.

How to apply Bayes Theorem or Bayes rule in Machine Learning?

Bayes theorem helps us to calculate the single term P(B|A) in terms of

Naïve Bayes classifier is one of the simplest applications of Bayes

Let's understand the use of Bayes theorem in machine learning with

Suppose, we have a vector A with I attributes. It means

A = A1, A2, A3, A4……………Ai

Further, we have n classes represented as C1, C2, C3, C4…………Cn.

P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)

P(A) is the condition-independent entity.

This process helps us to reduce the computation cost as well as time.

P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C)

Hence, by using Bayes theorem in Machine Learning we can easily

What is Naïve Bayes Classifier in Machine Learning

Naïve Bayes theorem is also a supervised algorithm, which is based on

Advantages of Naïve Bayes Classifier in Machine Learning:

o It is one of the simplest and effective methods for calculating the

Disadvantages of Naïve Bayes Classifier in Machine Learning:

The main disadvantage of using Naïve Bayes classifier algorithms is, it

What are parameters?

Often in machine learning we use a model to describe the process that

For a linear model we can write this as y = mx + c. In this

revenue generated. m and c are parameters for this model. Different

Three linear models with different parameter values.

So parameters define a blueprint for the model. It is only when specific

Intuitive explanation of maximum likelihood estimation

Maximum likelihood estimation is a method that determines values for

The 10 (hypothetical) data points that we have observed

We first have to decide which model we think best describes the

Recall that the Gaussian distribution has 2 parameters. The mean, μ,

Calculating the Maximum Likelihood Estimates

Now that we have an intuitive understanding of what maximum

Again we’ll demonstrate this with an example. Suppose we have three

What we want to calculate is the total probability of observing all of

generated independently of the others. This assumption makes the

The probability density of observing a single data point x, that is

The semi colon used in the notation P(x; μ, σ) is there to emphasise

P(A) = P(A|B)P(B) + P(A|~B)P(~B)

P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)……P(An/C)