0% found this document useful (0 votes)
13 views57 pages

UNIT -3 ITAI & ML

The document introduces Bayes Theorem, a fundamental concept in Machine Learning that calculates the probability of an event based on prior knowledge and is widely used in classification tasks. It explains the theorem's components, prerequisites, and its application in Naïve Bayes classifiers, highlighting both advantages and disadvantages. Additionally, it touches on the concept of Maximum Likelihood Estimation for parameter estimation in models used in Machine Learning.

Uploaded by

SRIKANTH KETHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views57 pages

UNIT -3 ITAI & ML

The document introduces Bayes Theorem, a fundamental concept in Machine Learning that calculates the probability of an event based on prior knowledge and is widely used in classification tasks. It explains the theorem's components, prerequisites, and its application in Naïve Bayes classifiers, highlighting both advantages and disadvantages. Additionally, it touches on the concept of Maximum Likelihood Estimation for parameter estimation in models used in Machine Learning.

Uploaded by

SRIKANTH KETHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

3 BAYESIAN AND COMPUTATIONAL


LEARNING

UNIT
BAYES THEOREM
Machine Learning is one of the most emerging technology of Artificial
Intelligence. We are living in the 21th century which is completely
driven by new technologies and gadgets in which some are yet to be
used and few are on its full potential. Similarly, Machine Learning is
also a technology that is still in its developing phase. There are lots of
concepts that make machine learning a better technology such as
supervised learning, unsupervised learning, reinforcement learning,
perceptron models, Neural networks, etc. In this article "Bayes
Theorem in Machine Learning", we will discuss another most
important concept of Machine Learning theorem i.e., Bayes Theorem.
But before starting this topic you should have essential understanding
of this theorem such as what exactly is Bayes theorem, why it is used
in Machine Learning, examples of Bayes theorem in Machine Learning
and much more. So, let's start the brief introduction of Bayes theorem.

UNIT - III 1
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Introduction to Bayes Theorem in Machine Learning

Bayes theorem is given by an English statistician, philosopher, and


Presbyterian minister named Mr. Thomas Bayes in 17th century.
Bayes provides their thoughts in decision theory which is extensively
used in important mathematics concepts as Probability. Bayes
theorem is also widely used in Machine Learning where we need to
predict classes precisely and accurately. An important concept of
Bayes theorem named Bayesian method is used to calculate
conditional probability in Machine Learning application that includes
classification tasks. Further, a simplified version of Bayes theorem
(Naïve Bayes classification) is also used to reduce computation time
and average cost of the projects.

Bayes theorem is also known with some other name such as Bayes
rule or Bayes Law. Bayes theorem helps to determine the probability
of an event with random knowledge. It is used to calculate the
probability of occurring one event while other one already occurred.
It is a best method to relate the condition probability and marginal
probability.

In simple words, we can say that Bayes theorem helps to contribute


more accurate results.

Bayes Theorem is used to estimate the precision of values and


provides a method for calculating the conditional probability.
However, it is hypocritically a simple calculation but it is used to easily
calculate the conditional probability of events where intuition often
fails. Some of the data scientist assumes that Bayes theorem is most
widely used in financial industries but it is not like that. Other than
financial, Bayes theorem is also extensively applied in health and
medical, research and survey industry, aeronautical sector, etc.

UNIT - III 2
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

What is Bayes Theorem?

Bayes theorem is one of the most popular machine learning concepts


that helps to calculate the probability of occurring one event with
uncertain knowledge while other one has already occurred.

Bayes' theorem can be derived using product rule and conditional


probability of event X with known event Y:

o According to the product rule we can express as the probability


of event X with known event Y as follows;

1. P(X ? Y)= P(X|Y) P(Y) {equation 1}


o Further, the probability of event Y with known event X:

1. P(X ? Y)= P(Y|X) P(X) {equation 2}

Mathematically, Bayes theorem can be expressed by combining both


equations on right hand side. We will get:

Here, both events X and Y are independent events which means


probability of outcome of both events does not depends one another.

The above equation is called as Bayes Rule or Bayes Theorem.

o P(X|Y) is called as posterior, which we need to calculate. It is


defined as updated probability after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence
when hypothesis is true.
o P(X) is called the prior probability, probability of hypothesis
before considering the evidence
UNIT - III 3
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o P(Y) is called marginal probability. It is defined as the probability


of evidence under any consideration.

Hence, Bayes Theorem can be written as:

posterior = likelihood * prior / evidence

Prerequisites for Bayes Theorem

While studying the Bayes theorem, we need to understand few


important concepts. These are as follows:

1. Experiment

An experiment is defined as the planned operation carried out under


controlled condition such as tossing a coin, drawing a card and rolling
a dice, etc.

2. Sample Space

During an experiment what we get as a result is called as possible


outcomes and the set of all possible outcome of an event is known as
sample space. For example, if we are rolling a dice, sample space will
be:

S1 = {1, 2, 3, 4, 5, 6}

Similarly, if our experiment is related to toss a coin and recording its


outcomes, then sample space will be:

S2 = {Head, Tail}

3. Event

Event is defined as subset of sample space in an experiment. Further,


it is also called as set of outcomes.

UNIT - III 4
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Assume in our experiment of rolling a dice, there are two event A and
B such that;

A = Event when an even number is obtained = {2, 4, 6}

B = Event when a number is greater than 4 = {5, 6}

o Probability of the event A ''P(A)''= Number of favourable


outcomes / Total number of possible outcomes
P(E) = 3/6 =1/2 =0.5
o Similarly, Probability of the event B ''P(B)''= Number of
favourable outcomes / Total number of possible outcomes
=2/6
=1/3
=0.333
o Union of event A and B:
A∪B = {2, 4, 5, 6}

UNIT - III 5
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o Intersection of event A and B:


A∩B= {6}

o Disjoint Event: If the intersection of the event A and B is an


empty set or null then such events are known as disjoint
event or mutually exclusive events also.

4. Random Variable:

It is a real value function which helps mapping between sample space


and a real line of an experiment. A random variable is taken on some
random values and each value having some probability. However, it is
neither random nor a variable but it behaves as a function which can
either be discrete, continuous or combination of both.

5. Exhaustive Event:

As per the name suggests, a set of events where at least one event
occurs at a time, called exhaustive event of an experiment.

UNIT - III 6
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Thus, two events A and B are said to be exhaustive if either A or B


definitely occur at a time and both are mutually exclusive for e.g.,
while tossing a coin, either it will be a Head or may be a Tail.

6. Independent Event:

Two events are said to be independent when occurrence of one event


does not affect the occurrence of another event. In simple words we
can say that the probability of outcome of both events does not
depends one another.

Mathematically, two events A and B are said to be independent if:

P(A ∩ B) = P(AB) = P(A)*P(B)

7. Conditional Probability:

Conditional probability is defined as the probability of an event A,


given that another event B has already occurred (i.e. A conditional B).
This is represented by P(A|B) and we can define it as:

P(A|B) = P(A ∩ B) / P(B)

8. Marginal Probability:

Marginal probability is defined as the probability of an event A


occurring independent of any other event B. Further, it is considered
as the probability of evidence under any consideration.

P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)

UNIT - III 7
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Here ~B represents the event that B does not occur.

How to apply Bayes Theorem or Bayes rule in Machine Learning?

Bayes theorem helps us to calculate the single term P(B|A) in terms of


P(A|B), P(B), and P(A). This rule is very helpful in such scenarios where
we have a good probability of P(A|B), P(B), and P(A) and need to
determine the fourth term.

Naïve Bayes classifier is one of the simplest applications of Bayes


theorem which is used in classification algorithms to isolate data as
per accuracy, speed and classes.

Let's understand the use of Bayes theorem in machine learning with


below example.

Suppose, we have a vector A with I attributes. It means

A = A1, A2, A3, A4……………Ai

Further, we have n classes represented as C1, C2, C3, C4…………Cn.

These are two conditions given to us, and our classifier that works on
Machine Language has to predict A and the first thing that our
classifier has to choose will be the best possible class. So, with the help
of Bayes theorem, we can write it as:

P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)

Here;

P(A) is the condition-independent entity.

P(A) will remain constant throughout the class means it does not
change its value with respect to change in class. To maximize the
P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).

UNIT - III 8
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

With n number classes on the probability list let's assume that the
possibility of any class being the right answer is equally likely.
Considering this factor, we can say that:

P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).

This process helps us to reduce the computation cost as well as time.


This is how Bayes theorem plays a significant role in Machine Learning
and Naïve Bayes theorem has simplified the conditional probability
tasks without affecting the precision. Hence, we can conclude that:

P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C)

Hence, by using Bayes theorem in Machine Learning we can easily


describe the possibilities of smaller events.

What is Naïve Bayes Classifier in Machine Learning

Naïve Bayes theorem is also a supervised algorithm, which is based on


Bayes theorem and used to solve classification problems. It is one of
the most simple and effective classification algorithms in Machine
Learning which enables us to build various ML models for quick
predictions. It is a probabilistic classifier that means it predicts on the
basis of probability of an object. Some popular Naïve Bayes algorithms
are spam filtration, Sentimental analysis, and classifying articles.

Advantages of Naïve Bayes Classifier in Machine Learning:

o It is one of the simplest and effective methods for calculating the


conditional probability and text classification problems.
o A Naïve-Bayes classifier algorithm is better than all other models
where assumption of independent predictors holds true.
o It is easy to implement than other models.
o It requires small amount of training data to estimate the test
data which minimize the training time period.
o It can be used for Binary as well as Multi-class Classifications.

UNIT - III 9
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Disadvantages of Naïve Bayes Classifier in Machine Learning:

The main disadvantage of using Naïve Bayes classifier algorithms is, it


limits the assumption of independent predictors because it implicitly
assumes that all attributes are independent or unrelated but in real
life it is not feasible to get mutually independent attributes.

MAXIMUM LIKELIHOOD
In this post I’ll explain what the maximum likelihood method for
parameter estimation is and go through a simple example to
demonstrate the method. Some of the content requires knowledge of
fundamental probability concepts such as the definition of joint
probability and independence of events. I’ve written a blog post with
these prerequisites so feel free to read this if you think you need a
refresher.

What are parameters?

Often in machine learning we use a model to describe the process that


results in the data that are observed. For example, we may use a
random forest model to classify whether customers may cancel a
subscription from a service (known as churn modelling) or we may use
a linear model to predict the revenue that will be generated for a
company depending on how much they may spend on advertising (this
would be an example of linear regression). Each model contains its
own set of parameters that ultimately defines what the model looks
like.

For a linear model we can write this as y = mx + c. In this


example x could represent the advertising spend and y might be the

UNIT - III 10
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

revenue generated. m and c are parameters for this model. Different


values for these parameters will give different lines (see figure below).

Three linear models with different parameter values.

So parameters define a blueprint for the model. It is only when specific


values are chosen for the parameters that we get an instantiation for
the model that describes a given phenomenon.

Intuitive explanation of maximum likelihood estimation

Maximum likelihood estimation is a method that determines values for


the parameters of a model. The parameter values are found such that
they maximise the likelihood that the process described by the model
produced the data that were actually observed.

UNIT - III 11
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The above definition may still sound a little cryptic so let’s go through
an example to help understand this.

Let’s suppose we have observed 10 data points from some process. For
example, each data point could represent the length of time in seconds
that it takes a student to answer a specific exam question. These 10
data points are shown in the figure below

The 10 (hypothetical) data points that we have observed

We first have to decide which model we think best describes the


process of generating the data. This part is very important. At the very
least, we should have a good idea about which model to use. This
usually comes from having some domain expertise but we wont
discuss this here.

UNIT - III 12
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

For these data we’ll assume that the data generation process can be
adequately described by a Gaussian (normal) distribution. Visual
inspection of the figure above suggests that a Gaussian distribution is
plausible because most of the 10 points are clustered in the middle
with few points scattered to the left and the right. (Making this sort of
decision on the fly with only 10 data points is ill-advised but given that
I generated these data points we’ll go with it).

Recall that the Gaussian distribution has 2 parameters. The mean, μ,


and the standard deviation, σ. Different values of these parameters
result in different curves (just like with the straight lines above). We
want to know which curve was most likely responsible for creating the
data points that we observed? (See figure below). Maximum likelihood
estimation is a method that will find the values of μ and σ that result
in the curve that best fits the data.

UNIT - III 13
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The 10 data points and possible Gaussian distributions from which the
data were drawn. f1 is normally distributed with mean 10 and variance
2.25 (variance is equal to the square of the standard deviation), this is
also denoted f1 ∼ N (10, 2.25). f2 ∼ N (10, 9), f3 ∼ N (10, 0.25) and f4
∼ N (8, 2.25). The goal of maximum likelihood is to find the parameter
values that give the distribution that maximise the probability of
observing the data.

The true distribution from which the data were generated was f1 ~
N(10, 2.25), which is the blue curve in the figure above.

Calculating the Maximum Likelihood Estimates

Now that we have an intuitive understanding of what maximum


likelihood estimation is we can move on to learning how to calculate
the parameter values. The values that we find are called the maximum
likelihood estimates (MLE).

Again we’ll demonstrate this with an example. Suppose we have three


data points this time and we assume that they have been generated
from a process that is adequately described by a Gaussian distribution.
These points are 9, 9.5 and 11. How do we calculate the maximum
likelihood estimates of the parameter values of the Gaussian
distribution μ and σ?

What we want to calculate is the total probability of observing all of


the data, i.e. the joint probability distribution of all observed data
points. To do this we would need to calculate some conditional
probabilities, which can get very difficult. So it is here that we’ll make
our first assumption. The assumption is that each data point is

UNIT - III 14
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

generated independently of the others. This assumption makes the


maths much easier. If the events (i.e. the process that generates the
data) are independent, then the total probability of observing all of
data is the product of observing each data point individually (i.e. the
product of the marginal probabilities).

The probability density of observing a single data point x, that is


generated from a Gaussian distribution is given by:

The semi colon used in the notation P(x; μ, σ) is there to emphasise


that the symbols that appear after it are parameters of the probability
distribution. So it shouldn’t be confused with a conditional probability
(which is typically represented with a vertical line e.g. P(A| B)).

In our example the total (joint) probability density of observing the


three data points is given by:

We just have to figure out the values of μ and σ that results in giving
the maximum value of the above expression.

If you’ve covered calculus in your maths classes then you’ll probably


be aware that there is a technique that can help us find maxima (and
minima) of functions. It’s called differentiation. All we have to do is find
UNIT - III 15
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

the derivative of the function, set the derivative function to zero and
then rearrange the equation to make the parameter of interest the
subject of the equation. And voilà, we’ll have our MLE values for our
parameters. I’ll go through these steps now but I’ll assume that the
reader knows how to perform differentiation on common functions. If
you would like a more detailed explanation then just let me know in
the comments.

The log likelihood

The above expression for the total probability is actually quite a pain
to differentiate, so it is almost always simplified by taking the natural
logarithm of the expression. This is absolutely fine because the natural
logarithm is a monotonically increasing function. This means that if the
value on the x-axis increases, the value on the y-axis also increases (see
figure below). This is important because it ensures that the maximum
value of the log of the probability occurs at the same point as the
original probability function. Therefore we can work with the simpler
log-likelihood instead of the original likelihood.

Monotonic behaviour of the original function, y = x on the left and the


(natural) logarithm function y = ln(x). These functions are both

UNIT - III 16
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

monotonic because as you go from left to right on the x-axis the y


value always increases.

Example of a non-monotonic function because as you go from left to


right on the graph the value of f(x) goes up, then goes down and then
goes back up again.

Taking logs of the original expression gives us:

This expression can be simplified again using the laws of logarithms to


obtain:

UNIT - III 17
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

This expression can be differentiated to find the maximum. In this


example we’ll find the MLE of the mean, μ. To do this we take the
partial derivative of the function with respect to μ, giving

Finally, setting the left hand side of the equation to zero and then
rearranging for μ gives:

And there we have our maximum likelihood estimate for μ. We can do


the same thing with σ too but I’ll leave that as an exercise for the keen
reader

MINIMUM DESCRIPTION LENGTH PRINCIPLE


Minimum Description Length (MDL) is a model selection principle
where the shortest description of the data is the best model. MDL
methods learn through a data compression perspective and are
sometimes described as mathematical applications of Occam's razor.
The MDL principle can be extended to other forms of inductive
inference and learning, for example to estimation and sequential
prediction, without explicitly identifying a single model of the data.
MDL has its origins mostly in information theory and has been further
developed within the general fields of statistics, theoretical computer
science and machine learning, and more narrowly computational
learning theory.

UNIT - III 18
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Historically, there are different, yet interrelated, usages of the definite


noun phrase "the minimum description length principle" that vary in
what is meant by description:

 Within Jorma Rissanen's theory of learning, a central concept


of information theory, models are statistical hypotheses and
descriptions are defined as universal codes.
 Rissanen's 1978[1] pragmatic first attempt to automatically
derive short descriptions, relates to the Bayesian Information
Criterion (BIC).
 Within Algorithmic Information Theory, where the
description length of a data sequence is the length of the
smallest program that outputs that data set. In this context,
it is also known as 'idealized' MDL principle and it is closely
related to Solomonoff's theory of inductive inference, which
is that the best model of a data set is represented by its
shortest self-extracting archive.

GIBBS ALGORITHM
In statistical mechanics, the Gibbs algorithm, introduced by J. Willard
Gibbs in 1902, is a criterion for choosing a probability distribution for
the statistical ensemble of microstates of a thermodynamic
system by minimizing the average log probability
subject to the probability distribution pi satisfying a set of constraints
(usually expectation values) corresponding to the
known macroscopic quantities. in 1948, Claude Shannon interpreted
the negative of this quantity, which he called information entropy, as
a measure of the uncertainty in a probability distribution.[1] In
1957, E.T. Jaynes realized that this quantity could be interpreted as
missing information about anything, and generalized the Gibbs
algorithm to non-equilibrium systems with the principle of maximum
entropy and maximum entropy thermodynamics.
Physicists call the result of applying the Gibbs algorithm the Gibbs
distribution for the given constraints, most notably Gibbs's grand

UNIT - III 19
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

canonical ensemble for open systems when the average energy and
the average number of particles are given. (See also partition
function).
This general result of the Gibbs algorithm is then a maximum entropy
probability distribution. Statisticians identify such distributions as
belonging to exponential families.
NAÏVE BAYES CLASSIFIER
o Naïve Bayes algorithm is a supervised learning algorithm, which
is based on Bayes theorem and used for solving classification
problems.
o It is mainly used in text classification that includes a high-
dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and


Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence


of a certain feature is independent of the occurrence of other
features. Such as if the fruit is identified on the bases of color,
shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on
each other.
o Bayes: It is called Bayes because it depends on the principle
of Bayes' Theorem.

UNIT - III 20
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law,


which is used to determine the probability of a hypothesis with
prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the


observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that


the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing


the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of


the below example:

Suppose we have a dataset of weather conditions and corresponding


target variable "Play". So using this dataset we need to decide that
whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the
below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given
features.
3. Now, use Bayes theorem to calculate the posterior probability.
UNIT - III 21
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

UNIT - III 22
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

UNIT - III 23
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation


that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict


a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the
other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or


unrelated, so it cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
UNIT - III 24
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o It can be used in real-time predictions because Naïve Bayes


Classifier is an eager learner.
o It is used in Text classification such as Spam
filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a


normal distribution. This means if predictors take continuous
values instead of discrete, then the model assumes that these
values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used
when the data is multinomial distributed. It is primarily used for
document classification problems, it means a particular
document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the
Multinomial classifier, but the predictor variables are the
independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for
document classification tasks.

Python Implementation of the Naïve Bayes algorithm:

Now we will implement a Naive Bayes Algorithm using Python. So for


this, we will use the "user_data" dataset, which we have used in our
other classification model. Therefore we can easily compare the Naive
Bayes model with the other models.

Steps to implement:

o Data Pre-processing step


o Fitting Naive Bayes to the Training set

UNIT - III 25
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o Predicting the test result


o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1) Data Pre-processing step:

In this step, we will pre-process/prepare the data so that we can use


it efficiently in our code. It is similar as we did in data-pre-processing.
The code for this is given below:

1. Importing the libraries


2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. # Importing the dataset
7. dataset = pd.read_csv('user_data.csv')
8. x = dataset.iloc[:, [2, 3]].values
9. y = dataset.iloc[:, 4].values
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_
size = 0.25, random_state = 0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)

In the above code, we have loaded the dataset into our program using
"dataset = pd.read_csv('user_data.csv'). The loaded dataset is
divided into training and test set, and then we have scaled the feature
variable.

UNIT - III 26
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The output for the dataset is given as:

2) Fitting Naive Bayes to the Training Set:

After the pre-processing step, now we will fit the Naive Bayes model
to the Training set. Below is the code for it:

1. # Fitting Naive Bayes to the Training set


2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)

UNIT - III 27
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

In the above code, we have used the GaussianNB classifier to fit it to


the training dataset. We can also use other classifiers as per our
requirement.

Output:

Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)

3) Prediction of the test set result:

Now we will predict the test set result. For this, we will create a new
predictor variable y_pred, and will use the predict function to make
the predictions.

1. # Predicting the Test set results


2. y_pred = classifier.predict(x_test)

Output:

UNIT - III 28
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The above output shows the result for prediction vector y_pred and
real vector y_test. We can see that some predications are different
from the real values, which are the incorrect predictions.

4) Creating Confusion Matrix:

Now we will check the accuracy of the Naive Bayes classifier using the
Confusion matrix. Below is the code for it:

1. # Making the Confusion Matrix


2. from sklearn.metrics import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)

Output:

As we can see in the above confusion matrix output, there are 7+3=
10 incorrect predictions, and 65+25=90 correct predictions.

5) Visualizing the training set result:

Next we will visualize the training set result using Naïve Bayes
Classifier. Below is the code for it:

1. # Visualising the Training set results


UNIT - III 29
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

2. from matplotlib.colors import ListedColormap


3. x_set, y_set = x_train, y_train
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, st
op = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:
, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.
ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))

8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)

13. mtp.title('Naive Bayes (Training set)')


14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

UNIT - III 30
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

In the above output we can see that the Naïve Bayes classifier has
segregated the data points with the fine boundary. It is Gaussian curve
as we have used GaussianNB classifier in our code.

6) Visualizing the Test set result:

1. # Visualising the Test set results


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, st
op = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:
, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.
ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))

8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)

13. mtp.title('Naive Bayes (test set)')


14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

UNIT - III 31
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The above output is final output for test set data. As we can see the
classifier has created a Gaussian curve to divide the "purchased" and
"not purchased" variables. There are some wrong predictions which
we have calculated in Confusion matrix. But still it is pretty good
classifier.

INSTANCE BASED LEARNING- K-NEAREST NEIGHBOUR


LEARNING
o K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the
category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category
by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification
problems.
o K-NN is a non-parametric algorithm, which means it does not
make any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset

UNIT - III 32
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

and at the time of classification, it performs an action on the


dataset.
o KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks
similar to cat and dog, but we want to know either it is a cat or
dog. So for this identification, we can use the KNN algorithm, as
it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images
and based on the most similar features it will put it in either cat
or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of
these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category
or class of a particular dataset. Consider the below diagram:

UNIT - III 33
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

How does K-NN work?

The K-NN working can be explained on the basis of the below


algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of
neighbors
o Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data
points in each category.
o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the


required category. Consider the below image:

UNIT - III 34
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o Firstly, we will choose the number of neighbors, so we will


choose the k=5.
o Next, we will calculate the Euclidean distance between the data
points. The Euclidean distance is the distance between two
points, which we have already studied in geometry. It can be
calculated as:

o By calculating the Euclidean distance we got the nearest


neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:

UNIT - III 35
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o As we can see the 3 nearest neighbors are from category A,


hence this new data point must belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in


the K-NN algorithm:

o There is no particular way to determine the best value for "K",


so we need to try some values to find the best out of them. The
most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead
to the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex


some time.
UNIT - III 36
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o The computation cost is high because of calculating the distance


between the data points for all the training samples.

Python implementation of the KNN algorithm

To do the Python implementation of the K-NN algorithm, we will use


the same problem and dataset which we have used in Logistic
Regression. But here we will improve the performance of the model.
Below is the problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company


that has manufactured a new SUV car. The company wants to give the
ads to the users who are interested in buying that SUV. So for this
problem, we have a dataset that contains multiple user's information
through the social network. The dataset contains lots of information
but the Estimated Salary and Age we will consider for the
independent variable and the Purchased variable is for the
dependent variable. Below is the dataset:

UNIT - III 37
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Steps to implement the K-NN algorithm:

o Data Pre-processing step


o Fitting the K-NN algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic
Regression. Below is the code for it:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_s
ize= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

UNIT - III 38
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

By executing the above code, our dataset is imported to our program


and well pre-processed. After feature scaling our test dataset will look
like:

From the above output image, we can see that our data is successfully
scaled.

o Fitting K-NN classifier to the Training data:

Now we will fit the K-NN classifier to the training data. To do this
we will import the KNeighborsClassifier class of Sklearn
Neighbors library. After importing the class, we will create
the Classifier object of the class. The Parameter of this class will
be

o n_neighbors: To define the required neighbors of the


algorithm. Usually, it takes 5.
o metric='minkowski': This is the default parameter and it
decides the distance between the points.
o p=2: It is equivalent to the standard Euclidean metric.
UNIT - III 39
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

And then we will fit the classifier to the training data. Below is
the code for it:

1. #Fitting K-NN classifier to the training set


2. from sklearn.neighbors import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minko
wski', p=2 )
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the output as:

Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will
create a y_pred vector as we did in Logistic Regression. Below is
the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

Output:

The output for the above code will be:

UNIT - III 40
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o Creating the Confusion Matrix:

Now we will create the Confusion Matrix for our K-NN model to
see the accuracy of the classifier. Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

In above code, we have imported the confusion_matrix function and


called it using the variable cm.

Output: By executing the above code, we will get the matrix as below:

UNIT - III 41
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

In the above image, we can see there are 64+29= 93 correct


predictions and 3+4= 7 incorrect predictions, whereas, in Logistic
Regression, there were 11 incorrect predictions. So we can say that
the performance of the model is improved by using the K-NN
algorithm.

o Visualizing the Training set result:

Now, we will visualize the training set result for K-NN model. The
code will remain same as we did in Logistic Regression, except
the name of the graph. Below is the code for it:

1. #Visulaizing the trianing set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, sto
p = x_set[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() +
1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.r
avel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))

UNIT - III 42
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

By executing the above code, we will get the below graph:

The output graph is different from the graph which we have occurred
in Logistic Regression. It can be understood in the below points:

o As we can see the graph is showing the red point and green
points. The green points are for Purchased(1) and Red
Points for not Purchased(0) variable.
o The graph is showing an irregular boundary instead of
showing any straight line or any curve because it is a K-NN
algorithm, i.e., finding the nearest neighbor.

UNIT - III 43
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o The graph has classified users in the correct categories as


most of the users who didn't buy the SUV are in the red
region and users who bought the SUV are in the green
region.
o The graph is showing good result but still, there are some
green points in the red region and red points in the green
region. But this is no big issue as by doing this model is
prevented from overfitting issues.
o Hence our model is well trained.

o Visualizing the Test set result:

After the training of the model, we will now test the result by
putting a new dataset, i.e., Test dataset. Code remains the same
except some minor changes: such as x_train and y_train will be
replaced by x_test and y_test.

Below is the code for it:

1. #Visualizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, sto
p = x_set[:, 0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() +
1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.r
avel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
UNIT - III 44
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

16. mtp.legend()
17. mtp.show()

Output:

The above graph is showing the output for the test data set. As we can
see in the graph, the predicted output is well good as most of the red
points are in the red region and most of the green points are in the
green region.

However, there are few green points in the red region and a few red
points in the green region. So these are the incorrect observations that
we have observed in the confusion matrix(7 Incorrect output).

INTRODUCTION TO MACHINE LEARNING (ML):


Machine Learning tutorial provides basic and advanced concepts of
machine learning. Our machine learning tutorial is designed for
students and working professionals.

Machine learning is a growing technology which enables computers to


learn automatically from past data. Machine learning uses various
algorithms for building mathematical models and making predictions
using historical data or information. Currently, it is being used for
various tasks such as image recognition, speech recognition, email

UNIT - III 45
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

filtering, Facebook auto-tagging, recommender system, and many


more.

This machine learning tutorial gives you an introduction to machine


learning along with the wide range of machine learning techniques
such as Supervised, Unsupervised, and Reinforcement learning. You
will learn about regression and classification models, clustering
methods, hidden Markov models, and various sequential models.

What is Machine Learning

In the real world, we are surrounded by humans who can learn


everything from their experiences with their learning capability, and
we have computers or machines which work on our instructions. But
can a machine also learn from experiences or past data like a human
does? So here comes the role of Machine Learning.

Machine Learning is said as a subset of artificial intelligence that is


mainly concerned with the development of algorithms which allow a
computer to learn from the data and past experiences on their own.

UNIT - III 46
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The term machine learning was first introduced by Arthur


Samuel in 1959. We can define it in a summarized way as:

Machine learning enables a machine to automatically learn from data,


improve performance from experiences, and predict things without
being explicitly programmed.

With the help of sample historical data, which is known as training


data, machine learning algorithms build a mathematical model that
helps in making predictions or decisions without being explicitly
programmed. Machine learning brings computer science and statistics
together for creating predictive models. Machine learning constructs
or uses the algorithms that learn from historical data. The more we
will provide the information, the higher will be the performance.

A machine has the ability to learn if it can improve its performance


by gaining more data.

How does Machine Learning work

A Machine Learning system learns from historical data, builds the


prediction models, and whenever it receives new data, predicts the
output for it. The accuracy of predicted output depends upon the
amount of data, as the huge amount of data helps to build a better
model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform


some predictions, so instead of writing a code for it, we just need to
feed the data to generic algorithms, and with the help of these
algorithms, machine builds the logic as per the data and predict the
output. Machine learning has changed our way of thinking about the
problem. The below block diagram explains the working of Machine
Learning algorithm:

UNIT - III 47
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Features of Machine Learning:

o Machine learning uses data to detect various patterns in a given


dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals
with the huge amount of the data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason
behind the need for machine learning is that it is capable of doing tasks
that are too complex for a person to implement directly. As a human,
we have some limitations as we cannot access the huge amount of
data manually, so for this, we need some computer systems and here
comes the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge


amount of data and let them explore the data, construct the models,
and predict the required output automatically. The performance of
the machine learning algorithm depends on the amount of data, and
it can be determined by the cost function. With the help of machine
learning, we can save both time and money.

The importance of machine learning can be easily understood by its


uses cases, Currently, machine learning is used in self-driving
cars, cyber fraud detection, face recognition, and friend suggestion
by Facebook, etc. Various top companies such as Netflix and Amazon

UNIT - III 48
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

have build machine learning models that are using a vast amount of
data to analyze the user interest and recommend product accordingly.

Following are some key points which show the importance of


Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from
data.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

UNIT - III 49
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

1) Supervised Learning

Supervised learning is a type of machine learning method in which we


provide sample labeled data to the machine learning system in order
to train it, and on that basis, it predicts the output.

The system creates a model using labeled data to understand the


datasets and learn about each data, once the training and processing
are done then we test the model by providing a sample data to check
whether it is predicting the exact output or not.

The goal of supervised learning is to map input data with the output
data. The supervised learning is based on supervision, and it is the
same as when a student learns things in the supervision of the teacher.
The example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of


algorithms:

o Classification
o Regression

2) Unsupervised Learning

Unsupervised learning is a learning method in which a machine learns


without any supervision.

The training is provided to the machine with the set of data that has
not been labeled, classified, or categorized, and the algorithm needs
to act on that data without any supervision. The goal of unsupervised
learning is to restructure the input data into new features or a group
of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The


machine tries to find useful insights from the huge amount of data. It
can be further classifieds into two categories of algorithms:

o Clustering
UNIT - III 50
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

o Association

3) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which


a learning agent gets a reward for each right action and gets a penalty
for each wrong action. The agent learns automatically with these
feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of
an agent is to get the most reward points, and hence, it improves its
performance.

The robotic dog, which automatically learns the movement of his


arms, is an example of Reinforcement learning.

History of Machine Learning

Before some years (about 40-50 years), machine learning was science
fiction, but today it is the part of our daily life. Machine learning is
making our day to day life easy from self-driving cars to Amazon
virtual assistant "Alexa". However, the idea behind machine learning
is so old and has a long history. Below some milestones are given
which have occurred in the history of machine learning:

UNIT - III 51
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The early history of Machine Learning (Pre-1940):

o 1834: In 1834, Charles Babbage, the father of the computer,


conceived a device that could be programmed with punch cards.
However, the machine was never built, but all modern
computers rely on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine can
determine and execute a set of instructions.

The era of stored program computers:

o 1940: In 1940, the first manually operated computer, "ENIAC"


was invented, which was the first electronic general-purpose
computer. After that stored program computer such as EDSAC in
1949 and EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an
electrical circuit. In 1950, the scientists started applying their
idea to work and analyzed how human neurons might work.

Computer machinery and intelligence:

o 1950: In 1950, Alan Turing published a seminal paper,


"Computer Machinery and Intelligence," on the topic of
artificial intelligence. In his paper, he asked, "Can machines
think?"

Machine intelligence in Games:

o 1952: Arthur Samuel, who was the pioneer of machine learning,


created a program that helped an IBM computer to play a
checkers game. It performed better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined
by Arthur Samuel.

UNIT - III 52
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

The first "AI" winter:

o The duration of 1974 to 1980 was the tough time for AI and ML
researchers, and this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and
people had reduced their interest from AI, which led to reduced
funding by the government to the researches.

Machine Learning from theory to reality

o 1959: In 1959, the first neural network was applied to a real-


world problem to remove echoes over phone lines using an
adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented
a neural network NETtalk, which was able to teach itself how to
correctly pronounce 20,000 words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess
game against the chess expert Garry Kasparov, and it became
the first computer which had beaten a human chess expert.

Machine Learning at present:

Now machine learning has got a great advancement in its research,


and it is present everywhere around us, such as self-driving
cars, Amazon Alexa, Catboats, recommender system, and many
more. It includes Supervised, unsupervised, and reinforcement
learning with clustering, classification, decision tree, SVM
algorithms, etc.

Modern machine learning models can be used for making various


predictions, including weather prediction, disease prediction, stock
market analysis, etc.

DIFFERENCES BETWEEN SUPERVISED AND UNSUPERVISED


LEARNING PARADIGMS

UNIT - III 53
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Supervised and Unsupervised learning are the two techniques of


machine learning. But both the techniques are used in different
scenarios and with different datasets. Below the explanation of both
learning methods along with their difference table is given.

Supervised Machine Learning:

Supervised learning is a machine learning method in which models are


trained using labeled data. In supervised learning, models need to find
the mapping function to map the input variable (X) with the output
variable (Y).

Supervised learning needs supervision to train the model, which is


similar to as a student learns things in the presence of a teacher.
Supervised learning can be used for two types of
problems: Classification and Regression.

Example: Suppose we have an image of different types of fruits. The


task of our supervised learning model is to identify the fruits and
classify them accordingly. So to identify the image in supervised
learning, we will give the input data as well as output for that, which
means we will train the model by the shape, size, color, and taste of
each fruit. Once the training is completed, we will test the model by

UNIT - III 54
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

giving the new set of fruit. The model will identify the fruit and predict
the output using a suitable algorithm.

Unsupervised Machine Learning:

Unsupervised learning is another machine learning method in which


patterns inferred from the unlabeled input data. The goal of
unsupervised learning is to find the structure and patterns from the
input data. Unsupervised learning does not need any supervision.
Instead, it finds patterns from the data by its own.

Unsupervised learning can be used for two types of


problems: Clustering and Association.

Example: To understand the unsupervised learning, we will use the


example given above. So unlike supervised learning, here we will not
provide any supervision to the model. We will just provide the input
dataset to the model and allow the model to find the patterns from
the data. With the help of a suitable algorithm, the model will train
itself and divide the fruits into different groups according to the most
similar features between them.

The main differences between Supervised and Unsupervised learning


are given below:

Supervised Learning Unsupervised Learning

Supervised learning algorithms Unsupervised learning


are trained using labeled data. algorithms are trained using
unlabeled data.

Supervised learning model takes Unsupervised learning model


direct feedback to check if it is does not take any feedback.
predicting correct output or not.

UNIT - III 55
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

Supervised learning model Unsupervised learning model


predicts the output. finds the hidden patterns in
data.

In supervised learning, input data In unsupervised learning, only


is provided to the model along input data is provided to the
with the output. model.

The goal of supervised learning is The goal of unsupervised


to train the model so that it can learning is to find the hidden
predict the output when it is patterns and useful insights
given new data. from the unknown dataset.

Supervised learning needs Unsupervised learning does not


supervision to train the model. need any supervision to train
the model.

Supervised learning can be Unsupervised Learning can be


categorized classified
in Classification and Regression in Clustering and Associations p
problems. roblems.

Supervised learning can be used Unsupervised learning can be


for those cases where we know used for those cases where we
the input as well as have only input data and no
corresponding outputs. corresponding output data.

Supervised learning model Unsupervised learning model


produces an accurate result. may give less accurate result as
compared to supervised
learning.

Supervised learning is not close to Unsupervised learning is more


true Artificial intelligence as in close to the true Artificial
this, we first train the model for Intelligence as it learns similarly
each data, and then only it can as a child learns daily routine
predict the correct output. things by his experiences.

UNIT - III 56
[INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING]

It includes various algorithms It includes various algorithms


such as Linear Regression, such as Clustering, KNN, and
Logistic Regression, Support Apriori algorithm.
Vector Machine, Multi-class
Classification, Decision tree,
Bayesian Logic, etc.

UNIT - III 57

You might also like