0% found this document useful (0 votes)
2 views

AIML - UNIT-4 Modified

The document covers various classification techniques in artificial intelligence, focusing on Nearest Neighbor Classifiers and Naive Bayes Classifiers. Nearest Neighbor Classifiers determine class labels based on the proximity of training examples, while Naive Bayes Classifiers utilize Bayes theorem and the assumption of conditional independence among attributes to predict class labels. It discusses the importance of choosing the right proximity measure, the implications of the value of k in nearest neighbors, and the process of estimating probabilities for both categorical and continuous attributes in Naive Bayes classification.

Uploaded by

shreyask.cs22
Copyright
© © All Rights Reserved
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

AIML - UNIT-4 Modified

The document covers various classification techniques in artificial intelligence, focusing on Nearest Neighbor Classifiers and Naive Bayes Classifiers. Nearest Neighbor Classifiers determine class labels based on the proximity of training examples, while Naive Bayes Classifiers utilize Bayes theorem and the assumption of conditional independence among attributes to predict class labels. It discusses the importance of choosing the right proximity measure, the implications of the value of k in nearest neighbors, and the process of estimating probabilities for both categorical and continuous attributes in Naive Bayes classification.

Uploaded by

shreyask.cs22
Copyright
© © All Rights Reserved
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 119

RV Go, change

the world
College of
Engineeri
ng

21AI52-Artificial Intelligence and Machine Learning

Unit-IV
Contents

Nearest Neighbor Classifiers


Naive Bayes Classifier
Logistic Regression
Ensemble Methods
Nearest Neighbor Classifiers

Basic idea:
If it walks like a duck, quacks like a duck, then
it’s probably a duck

Comput
e Test
Distanc Record
e

Training Choose k of
Records the “nearest”
records
Nearest Neighbor Classifiers

Find all the training examples that are relatively


similar to the attributes of the test example
These examples, which are known as nearest
neighbors, can be used to determine the class
label of the test example
Nearestneighbor classifier represents each
example as a data point in a d-dimensionalspace,
where d is the number of attributes
Given a test example, its proximity to the rest of
the data points in the training set is computed,
using one ofthe proximity measures
The k-nearest neighbors of a given example z refer
to the k points that are closest to z.
Nearest Neighbor Classifiers

Fig illustrtaes 1-, 2-, and 3-nearest neighbors


of a data point located at the center of each
circle
Data point is classified based on the class
labels of its neighbors
In the case where the neighbors have more
than one label, the data point is assigned to
Nearest-Neighbor Classifiers

Requires the following:


A set of labeled records
Proximity metric to
compute
distance/similarity
between a pair of records
e.g., Euclidean distance
The value of k, the
number of nearest
neighbors to retrieve
A method for using class
labels of K nearest
neighbors to determine
the class label of
unknown record (e.g., by
taking majority vote)
Choice of proximity measure matters

For documents, cosine is better than correlation or


Euclidean

1 1 1111111 0 0 0000000
1 1 0 v 0 0 1
0 1 1111111 s 1 0 0000000
1 1 1 0 0 0
Euclidean distance = 1.4142 for both
pairs, but the cosine similarity measure
has different values for these pairs.
Nearest Neighbor Classification…

Choosing the value of k:


If k is too small, then the nearest-neighbor
classifier may besusceptible to overfitting because
of noise in the training data
On the other hand, if k is too large, the nearest-
neighbor classifier may misclassify the test
instance because its list of nearest neighbors may
include data points that are located far away from
its neighborhood
Nearest Neighbor Classification…

Algorithm:

Algorithm computes the distance (or similarity) between


each test example z = (x, y) and all the training examples (x, y)
∈ D to determine its nearest-neighbor list, Dz
Such computation can be costly if the number of training
examples is large. However, efficient indexing techniques are
available to reduce the amount of computations needed to find
the nearest neighbors of a test example.
Nearest Neighbor Classification…

Majority Voting

In the majority voting approach, every neighbor has the


same impact on the classification
This makes the algorithm sensitive to the choice of k

Distance-Weighted Voting
One way to reduce the impact of k is to weight the influence of
each nearest neighbor xi according to its distance: wi =
As a result, training examples that are located far away from z
have a weaker impact on the classification compared to those
that are located close to z.
Nearest-neighbor classifiers

Nearest neighbor
classifiers are local
classifiers
They can produce
decision boundaries
of arbitrary shapes.
Characteristics of Nearest-Neighbor Classifiers
Nearest-neighbor classification is part of a more general
technique known as instance-based learning, which
uses specific training instances to make predictions without
having to maintain an abstraction (or model) derived from
data. Instance-based learning algorithms require a proximity
measure to determine the similarity or distance between
instances and a classification function that returns the
predicted class of a test instance based on its proximity to
other instances.
Lazy learners such as nearest-neighbor classifiers do not
require model building. However, because we need to
computclassifying a test example can be quite
expensive e the proximity values individually between the
test and training examples. In contrast, eager learners often
spend the bulk of their computing resources for model
building. Once a model has been built, classifying a test
example is extremely fast
Nearest-neighbor classifiers make their predictions based
on local information, whereas decision tree and rule-based
classifiers attempt to find a global model that fits the entire input
space. Because the classification decisions are made locally, nearest-
neighbor classifiers (with small values of k) are quite susceptible to
noise.
Nearest-neighbor classifiers can produce arbitrarily shaped
decision boundaries. Such boundaries provide a more flexible
model representation compared to decision tree and rule-based
classifiers that are often constrained to rectilinear decision boundaries.
Nearest-neighbor classifiers can produce wrong predictions unless the
appropriate proximity measure and data preprocessing steps are taken.
For example, suppose we want to classify a group of people based on
attributes such as height (measured in meters) and weight (measured
in pounds). The height attribute has a low variability, ranging from 1.5
m to 1.85 m, whereas the weight attribute may vary from 90 lb. to 250
lb. If the scale of the attributes are not taken into
consideration, the proximity measure may be dominated by
differences in the weights of a person.
Nearest Neighbor Classification…

How to handle missing values in


training and test sets?
Proximity computations normally require
the presence of all attributes
Some approaches use the subset of
attributes present in two instances
This may not produce good results since it
effectively uses different proximity
measures for each pair of instances
Thus, proximities are not comparable
Naive Bayes Classifier
Bayes Classifier

In many applications the relationship between the


attribute set and the class variable is non-deterministic.
In other words, the class label of a test record cannot be
predicted with certainty even though its attribute set is
identical to some of the training examples.
This situation may arise because of noisy data that
affect classification but are not included in the analysis.
For example, consider the task of predicting whether a
person is at risk for heart disease based on the person’s
diet and workout frequency.
Although most people who eat healthily and exercise
regularly have less chance of developing heart disease,
they may still do so because of other factors such as
heredity, excessive smoking, and alcohol abuse.
This introduce uncertainties into the learning
problem.
Bayes Classifier

Consider a football game between two rival teams: Team


0 and Team 1. Suppose Team 0 wins 65% of the time and
Team 1 wins the remaining matches. Among the games
won by Team 0, only 30% of them come from playing on
Team 1’s football field. On the other hand, 75% of the
victories for Team 1 are obtained while playing at home.
If Team 1 is to host the next match between the two
teams, which team will most likely emerge as the
winner?
This question can be answered by using the well-known
Bayes theorem.
Bayes Classifier

Let X and Y be a pair of random variables. Their joint


probability, P(X = x, Y = y), refers to the probability
that variable X will take on the value x and variable Y
will take on the value y.
A conditional probability is the probability that a random
variable will take on a particular value given that the
outcome for another random variable is known.
For example, the conditional probability P(Y = y|X = x)
refers to the probability that the variable Y will take
on the value y, given that the variable X is observed
to have the value x.
Bayes Classifier

A probabilistic framework for solving


classification problems
Conditional Probability:

The joint and conditional probabilities for X


and Y are related in the following way:
P(X, Y ) = P(Y |X) × P(X) = P(X|Y ) ×
P(Y ).
Bayes Classifier

Conditional Probability:

Bayes theorem:
Bayes Classifier

The Bayes theorem can be used to solve the prediction


problem stated below
Consider a football game between two rival teams: Team 0
and Team 1. Suppose Team 0 wins 65% of the time and Team 1
wins the remaining matches. Among the games won by Team
0, only 30% of them come from playing on Team 1’s football
field. On the other hand, 75% of the victories for Team 1 are
obtained while playing at home. If Team 1 is to host the next
match between the two teams, which team will most likely
emerge as the winner?
For notational convenience, let X be the random variable that
represents the team hosting the match and Y be the random
variable that represents the winner of the match. Both X and Y
can take on values from the set {0, 1}.
information given in the problem as follows:
Probability Team 0 wins is P(Y = 0) = 0.65.
Probability Team 1 wins is P(Y = 1) = 1 − P(Y = 0) = 0.35.
Probability Team 1 hosted the match it won is P(X = 1|Y = 1) =
Bayes Classifier

Probability Team 0 wins is P(Y = 0) = 0.65.


Probability Team 1 wins is P(Y = 1) = 1 − P(Y = 0) = 0.35.
Probability Team 1 hosted the match it won is P(X = 1|Y = 1) =
0.75.
Probability Team 1 hosted the match won by Team 0 is P(X = 1|Y
= 0) = 0.3
Our objective is to compute P(Y = 1|X = 1), which is the
conditional probability that Team 1 wins the next match it will be
hosting, and compares it against P(Y = 0|X = 1). Using the Bayes
theorem,
Using Bayes Theorem for Classification

Class variable has a non-deterministic


relationship with the attributes, then we
can treat X and Y as random variables
and capture their relationship
probabilistically using P(Y |X).
This conditional probability is also known
as the posterior probability for Y , as
opposed to its prior probability, P(Y ).
Consider each attribute and class label as
random variables
Given a record with attributes (X1, X2,…,
Xd), the goal is to predict class Y
Specifically, we want to find the value
of Y that maximizes P(Y| X1, X2,…, Xd )
Can we estimate P(Y| X1, X2,…, Xd )
directly from data?
Using Bayes Theorem for Classification

During the training phase, we need to learn the posterior


probabilities P(Y |X) for every combination of X and Y based on
information gathered from the training data
By knowing these probabilities, a test record X can be classified by
finding the class Y that maximizes the posterior probability P(Y |
X).
Using Bayes Theorem for Classification

Approach:
compute posterior probability P(Y | X1, X2, …, Xd)
using the Bayes theorem

Maximum a-posteriori: Choose Y that maximizes


P(Y | X1, X2, …, Xd)
Equivalent to choosing value of Y that maximizes
P(X1, X2, …, Xd|Y) P(Y)

How to estimate P(X1, X2, …, Xd | Y )?


Example Data
Given a Test
Record:

We need to estimate
P(Evade = Yes | X) and P(Evade =
No | X)
In the following we will replace
Evade = Yes by Yes, and
Evade = No by No
Example Data
Given a Test
Record:
Conditional Independence

X and Y are conditionally independent


given Z if P(X|YZ) = P(X|Z)

Example: Arm length and reading skills


Young child has shorter arm length and
limited reading skills, compared to
adults
If age is fixed, no apparent relationship
between arm length and reading skills
Arm length and reading skills are
conditionally independent given age
Naïve Bayes Classifier

A na¨ıve Bayes classifier estimates the class-


conditional probability by assuming that the
attributes are conditionally independent,
given the class label y.
Conditional independence assumption can be
formally stated as follows:

where each attribute set X = {X1,X2, . . . ,


Xd} consists of d attributes
Naïve Bayes Classifier

Assume independence among attributes Xi when


class is given:
P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)

Now we can estimate P(Xi| Yj) for all Xi and Yj


combinations from the training data
To classify a test record, the na¨ıve Bayes classifier
computes the posterior probability for each class Y :

New point is classified to Yj if P(Yj) Π P(Xi| Yj) is


maximal.
Naïve Bayes on Example Data
Given a Test
Record:

P(X | Yes) =
P(Refund = No | Yes) x
P(Divorced | Yes) x
P(Income = 120K | Yes)
P(X | No) =
P(Refund = No | No) x
P(Divorced | No) x
P(Income = 120K | No)
Estimate Probabilities for Categorical Attributes
P(y) = fraction of instances of
class y
e.g., P(No) = 7/10,
P(Yes) = 3/10

For categorical
attributes: P(Xi =c| y)
= nc / n
where |Xi =c| is number of
instances having attribute
value Xi =c and belonging
to class y
Examples:
P(Status=Married|No) =
4/7P(Refund=Yes|Yes)=0
Estimate Probabilities continuous attribute

Discretization: Partition the range into bins:


Replace continuous value with bin value
Attribute changed from continuous to ordinal

Probability density estimation:


Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, use it to
estimate the conditional probability P(X i|Y)
Estimate Probabilities continuous attributes
Normal distribution:

One for each (Xi,Yi) pair


The distribution is
characterized by two
parameters, its mean, μ,
and variance, σ2.
For (Income, Class=No):
If Class=No
sample mean = 110
sample variance = 2975
Estimate Probabilities for Categorical Attributes

mean of a
population is

population
variance is

3.1415926
5359

e( mathematical 2.71
constant) 82
Example of Naïve Bayes Classifier

Given a Test
Record:

Naïve Bayes Classifier:


P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7
P(Refund = Yes | Yes) = 0 P(X | No) = P(Refund=No | No) ×
P(Refund = No | Yes) = 1 P(Divorced | No) × P(Income=120K |
P(Marital Status = Single | No) = No) = 4/7 × 1/7 × 0.0072 =
2/7 0.0006
P(Marital Status = Divorced | No)
= 1/7 P(X | Yes) = P(Refund=No | Yes)
P(Marital Status = Married | No) × P(Divorced | Yes) ×
= 4/7 P(Income=120K | Yes) = 1 × 1/3
P(Marital Status = Single | Yes) = × 1.2 × 10 = 4 × 10
-9 -10

2/3
P(Marital Status = Divorced | Yes) Since P(X|No)P(No) > P(X|Yes)P(Yes)
= 1/3 Therefore P(No|X) > P(Yes|X) =>
P(Marital Status = Married | Yes)
=0 Class = No
For Taxable Income:
If class = No: sample mean =
110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25
Example of Naïve Bayes Classifier

Given a Test
Record:

Naïve Bayes Classifier:


P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) =
2/7
P(Marital Status = Divorced | No)
= 1/7
P(Marital Status = Married | No)
= 4/7
P(Marital Status = Single | Yes) =
2/3
P(Marital Status = Divorced | Yes)
= 1/3
P(Marital Status = Married | Yes)
=0
For Taxable Income:
If class = No: sample mean =
110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25
Naïve Bayes Classifier can make decisions with partial information about attributes in the test record

Even in absence of
information about any P(Yes) = 3/10
attributes, we can use Apriori P(No) = 7/10
Probabilities of Class
Variable: If we only know that marital status is
Naïve Bayes Classifier: Divorced, then:
P(Refund = Yes | No) = 3/7
P(Yes | Divorced) = 1/3 x 3/10 / P(Divorced)
P(Refund = No | No) = 4/7
P(Refund = Yes | Yes) = 0 P(No | Divorced) = 1/7 x 7/10 / P(Divorced)
P(Refund = No | Yes) = 1 If we also know that Refund = No, then
P(Marital Status = Single | No) =
2/7 P(Yes | Refund = No, Divorced) = 1 x 1/3 x 3/10
P(Marital Status = Divorced | No) / P(Divorced, Refund = No)
= 1/7 P(No | Refund = No, Divorced) = 4/7 x 1/7 x
P(Marital Status = Married | No)
7/10 / P(Divorced, Refund = No)
= 4/7
P(Marital Status = Single | Yes) = If we also know that Taxable Income =
2/3 120, then
P(Marital Status = Divorced | Yes)
P(Yes | Refund = No, Divorced, Income = 120)
= 1/3
P(Marital Status = Married | Yes)
= 1.2 x10-9 x 1 x 1/3
=0 x 3/10 / P(Divorced, Refund = No, Income =
For Taxable Income: 120 )
If class = No: sample mean = P(No | Refund = No, Divorced Income = 120)
110 = 0.0072 x 4/7 x 1/7 x
sample variance = 2975
7/10 / P(Divorced, Refund = No,
If class = Yes: sample mean = 90
sample variance = 25 Income = 120)
Issues with Naïve Bayes Classifier

Given a Test
Record: X = (Married)

Naïve Bayes Classifier:


P(Refund = Yes | No) = 3/7 P(Yes) = 3/10
P(Refund = No | No) = 4/7
P(Refund = Yes | Yes) = 0
P(No) = 7/10
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) =
2/7 P(Yes | Married) = 0 x 3/10 / P(Married)
P(Marital Status = Divorced | No) P(No | Married) = 4/7 x 7/10 /
= 1/7
P(Marital Status = Married | No)
P(Married)
= 4/7
P(Marital Status = Single | Yes) =
2/3
P(Marital Status = Divorced | Yes)
= 1/3
P(Marital Status = Married | Yes)
=0
For Taxable Income:
If class = No: sample mean =
110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25
Issues with Naïve Bayes Classifier

Consider the table with Tid = 7 Naïve Bayes Classifier:


deleted P(Refund = Yes | No) = 2/6
P(Refund = No | No) = 4/6
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) =
2/6
P(Marital Status = Divorced | No)
=0
P(Marital Status = Married | No)
= 4/6
P(Marital Status = Single | Yes) =
2/3
P(Marital Status = Divorced | Yes)
= 1/3
P(Marital Status = Married | Yes)
= 0/3
For Taxable Income:
If class = No: sample mean = 91
Given X = (Refund = Yes, Divorced, sample variance = 685
If class =Naïve Bayes
No: sample mean will
= 90 not be
120K) sample variance = 25
able to classify X as Yes or
P(X | No) = 2/6 X 0 X 0.0083 = 0
P(X | Yes) = 0 X 1/3 X 1.2 X 10-9 No!
=0
Issues with Naïve Bayes Classifier

n: number of training
instances belonging to
class y
nc: number of
instances with Xi = c
and Y = y
v: total number of
attribute values that
Xi can take
p: initial estimate of
(P(Xi = c|y) known
apriori
m: hyper-parameter
for our confidence in p
Example of Naïve Bayes Classifier

A: attributes
M: mammals
N: non-mammals
Example of Naïve Bayes Classifier

A: attributes
M: mammals
N: non-mammals

P(A|M)P(M) > P(A|


N)P(N)
=> Mammals
Naïve Bayes (Summary)

Robust to isolated noise points

Handle missing values by ignoring the


instance during probability estimate
calculations

Robust to irrelevant attributes

Redundant and correlated attributes will


violate class conditional assumption

Use other techniques such as Bayesian Belief


Networks (BBN)
Naïve Bayes (Summary)

They are robust to isolated noise points


because such points are averaged out when
estimating conditional probabilities from
data. Na¨ıve Bayes classifiers can also
handle missing values by ignoring the
example during model building and
classification.
They are robust to irrelevant attributes. If Xi
is an irrelevant attribute, then P(Xi|Y )
becomes almost uniformly distributed. The
class conditional probability for Xi has no
impact on the overall computation of the
posterior probability.
Naïve Bayes (Summary)

Correlated attributes can degrade the


performance of na¨ıve Bayes classifiers
because the conditional independence
assumption no longer holds for such
attributes.

Use other techniques such as Bayesian Belief


Networks (BBN)
Bayes Classifier
Bayes Classifier
Bayes Classifier
Bayes Classifier
Bayes Classifier
Bayes Classifier
Bayes Classifier

(a) Estimate the conditional probabilities for P(A = 1|


+), P(B = 1|+), P(C = 1|+), P(A = 1|−), P(B = 1|−),
and P(C = 1|−)
Bayes Classifier
Bayes Classifier
Bayes Classifier
Bayes Classifier
Naïve Bayes

How does Naïve Bayes perform on the following dataset?

Conditional independence of attributes is


violated
Bayesian Belief Networks

Provides graphical representation of probabilistic


relationships among a set of random variables
Consists of:
A directed acyclic graph (dag) : encoding the
dependence relationships among a set of
variables.
Node corresponds to a variable
Arc corresponds to dependence relationship between
a pair of variables
A, B, and C, in which A and B are independent
variables and each has a direct influence on a third
variable, C.
A probability table associating each node to its
immediate parent
Conditional Independence
If there is a directed arc from X to Y , then X is
the parent of Y and Y is the child of X.
If there is a directed path in the network from X
to Z, then X is an ancestor of Z, while Z is a
descendant of X.

D is parent of C
A is child of C
B is descendant of
D
D is ancestor of A
Both B and D are
also non-
descendants of A.
Conditional Independence

D is parent of C
A is child of C
B is descendant of
D
D is ancestor of A

An important property of the Bayesian network


Property 1 (Conditional Independence).

A node in a Bayesian network is


conditionally independent of all of its
nondescendants, if its parents are known
Conditional Independence

In the first diagram A is conditionally independent


of both B and D given C because the nodes for B
and D are non- descendants of node A.
The conditional independence assumption made
by a na¨ıve Bayes classifier can also be
represented using a Bayesian network, as shown
in 2nd diagram where y is the target class and
{X1,X2, . . . , Xd} is the attribute set.
Probability Tables

Besides the conditional independence


conditions imposed by the network topology,
each node is also associated with a
probability table.
If X does not have any parents, table
contains prior probability P(X)
If X has only one parent (Y), table contains
conditional probability P(X|Y)
If X has multiple parents (Y1, Y2,…, Yk),
table contains conditional probability P(X|
Y1, Y2,…, Yk)
Example of Bayesian Belief Network
Bayesian Belief Networks
Each variable in the diagram is assumed to be
binary-valued.
Parent nodes for heart disease (HD) correspond
to risk factors that may affect the disease, such
as exercise (E) and diet (D)
Child nodes for heart disease correspond to
symptoms of the disease, such as chest pain (CP)
and high blood pressure (BP).
The nodes associated with the risk factors contain
only the prior probabilities, whereas the nodes
for heart disease and its corresponding symptoms
contain the conditional probabilities.
To save
For space,
example, thesome of the probabilities
conditional probability can be
omittedDisease
P(Heart from the= diagram. The=omitted
No|Exercise No, Diet =
probabilities can be recovered by noting that P(X
Healthy)
==1−x) = 1 − P(X
P(Heart = x) and
Disease P(X = x|Y ) =
= Yes|Exercise =1No,
− P(X
Diet==
x|Y ), where x denotes the opposite outcome of x.
Bayesian Belief Networks :
Model Building
Model building in Bayesian networks involves two
steps: (1) creating the structure of the network, and
(2) estimating the probability values in the tables
associated with each node.
The network topology can be obtained by encoding the
subjective knowledge of domain experts.
Example of Inferencing using BBN
Example of Inferencing using BBN

After performing Step 1 (in previous) , let us assume


that the variables are ordered in the following way:
(E,D,HD,Hb, CP,BP). From Steps 2 to 7, From Steps 2 to
7, starting with variable D, we
obtain the following conditional probabilities:
Example of Inferencing using BBN

Based on these conditional probabilities, we can


create arcs between the nodes
(E, HD), (D, HD), (D, Hb), (HD, CP), (Hb, CP), and
(HD, BP).
Once the right topology has been found, the
probability table associated with each node is
determined. Estimating such probabilities is fairly
straightforward and is similar to the approach
used by na¨ıve Bayes classifiers.
Example of Inferencing using BBN

Algorithm for generating the topology of a


Bayesian network guarantees a topology that
does not contain any cycles.
The proof for this is quite straightforward. If a
cycle exists, then there must be at least one arc
connecting the lower-ordered nodes to the
higher-ordered nodes, and at least another arc
connecting the higher-ordered nodes to the lower
ordered
nodes.
Since Algorithm prevents any arc from
connecting the lower-ordered nodes to the
higher-ordered nodes, there cannot be any cycles
in the topology.
Example of Bayesian Belief Network
Example of Inferencing using BBN

Given: X = (E=No, D=Yes, CP=Yes, BP=High)


Compute P(HD|E,D,CP,BP)?

P(HD=Yes| E=No,D=Yes) = 0.55P(CP=Yes| HD=Yes)


= 0.8P(BP=High| HD=Yes) = 0.85
P(HD=Yes|E=No,D=Yes,CP=Yes,BP=High) ∝ 0.55 × 0.8 ×
0.85 = 0.374 Classify
X as Yes

P(HD=No| E=No,D=Yes) = 0.45P(CP=Yes| HD=No) =


0.01P(BP=High| HD=No) = 0.2
P(HD=No|E=No,D=Yes,CP=Yes,BP=High) ∝ 0.45 × 0.01 ×
0.2 = 0.0009
Example of Bayesian Belief Network

Suppose we are interested in using the BBN shown whether a


person has heart disease. The following cases illustrate how
the diagnosis can be made under different scenarios.
Example of Bayesian Belief Network
Example of Bayesian Belief Network
Example of Inferencing using BBN
Logistic Regression
Logistic Regression

The na¨ıve Bayes and the Bayesian network


classifiers provide different ways of estimating the
conditional probability of an instance x given class y,
P(x|y). Such models are known as probabilistic
generative models. Note that the conditional
probability P(x|y) essentially describes the behavior
of instances in the attribute space that are
generated from class y.
For the purpose of making predictions, we are finally
interested in computing the posterior probability P(y|
x).
Logistic Regression

For example, computing the following ratio of posterior


probabilities is sufficient for inferring class labels in a binary
classification problem:

This ratio is known as the odds.


If this ratio is greater than 1, then x is classified as y = 1.
Otherwise, it is assigned to class y = 0.
One may simply learn a model of the odds based on the
attribute values of training instances, without having to
compute P(x|y) as an intermediate quantity in the Bayes
theorem.
Logistic Regression

Classification models that directly assign class labels without


computing class-conditional probabilities are called
discriminative models.
Logistic regression is a probabilistic discriminative
model ,which directly estimates the odds of a data instance x
using its attribute values. The basic idea of logistic regression is
to use a linear predictor, z = wTx + b, for representing the odds
of x as follows:

where w and b are the parameters of the model and wT


denotes the transpose of a vector w.
Note that if wT x + b > 0, then x belongs to class 1 since its
odds is greater than 1.
Otherwise, x belongs to class 0.
Logistic Regression

where the function σ(.) is known as the logistic or


sigmoid function
Logistic Regression

Figure below shows the behavior of the sigmoid


function as we vary z. We can see that σ(z) ≥ 0.5 only
when z ≥ 0.

We can also derive P(y = 0|x) using σ(z) as follows

Hence, if we have learned a suitable value of parameters w


and b, we can use Equations 6.38 and 6.39 to estimate the
posterior probabilities of any data instance x and determine its
class label.
Logistic Regression as a Generalized Linear Model

Since the posterior probabilities are real-


valued, their estimation using the previous
equations can be viewed as solving a
regression problem.
Logistic Regression as a Generalized Linear Model

Logistic regression belongs to a broader family of


statistical regression models, known as generalized
linear models (GLM). In these models, the target
variable y is considered to be generated from a
probability distribution P(y|x), whose mean μ can be
estimated using a link function g(.) as follows:

For binary classification using logistic regression, y follows a


Bernoulli distribution (y can either be 0 or 1) and μ is equal
to P(y = 1|x). The link function g(.) of logistic regression,
called the logit function, can thus be represented as
:
Even though logistic regression has relationships with regression
models, it is a classification model since the computed posterior
probabilities are eventually used to determine the class label of a
data instance.
Learning Model Parameters

The parameters of logistic regression, (w,


b), are estimated during training using a
statistical approach known as the
maximum likelihood estimation (MLE)
method.
Characteristics of Logistic Regression

Discriminative model for classification.


The learned parameters of logistic
regression can be analyzed to understand
the relationships between attributes and
class labels.
Can work more robustly even in high-
dimensional settings
Can handle irrelevant attributes
Cannot handle data instances with missing
values
Ensemble Techniques

2/14/2024
Ensemble Methods

Ensemble or classifier combination


methods improves classification accuracy
by aggregating the predictions of multiple
classifiers
An ensemble method constructs a set of
base classifiers from training data and
performs classification by taking a vote on
the predictions made by each base
classifier.
Ensemble methods tend to perform better
than any single classifier

2/14/2024
Example: Why Do Ensemble Methods Work?

2/14/2024
Necessary Conditions for Ensemble Methods

Fig shows the error rate of an ensemble of 25 binary


classifiers (ensemble) for different base classifier
error rates (). The diagonal line represents the case in
which the base classifiers are identical, while the solid
line represents the case in which the base classifiers
are independent. Observe that the ensemble classifier
performs worse than the base classifiers when is
larger than 0.5
Classification error for an
ensemble of 25 base
classifiers, assuming their
errors are uncorrelated.

2/14/2024
Necessary Conditions for Ensemble Methods

Ensemble Methods work better than a single base


classifier if:
1. All base classifiers are independent of each other

2. All base classifiers perform better than random

guessing (error rate < 0.5 for binary classification)

Classification error for an


ensemble of 25 base
classifiers, assuming their
errors are uncorrelated.

2/14/2024
Rationale for Ensemble Learning

In practice, it is difficult to ensure total


independence among the base classifiers.
Improvements in classification accuracies
have been observed in ensemble methods
in which the base classifiers are somewhat
correlated.
Ensemble Methods work best with
unstable base classifiers
Classifiers that are sensitive to minor
perturbations in training set, due to high model
complexity
Examples: Unpruned decision trees, ANNs, …
2/14/2024
Methods for Constructing an Ensemble Classifier
Fig. below presents the logical view of the
ensemble method
Basic idea is to construct multiple classifiers
from the original data and then aggregate their
predictions when classifying unknown examples.
The ensemble of classifiers can be constructed in
many ways:

2/14/2024
Constructing Ensemble Classifiers

By manipulating training set


Example: bagging, boosting,
By manipulating input features
Example: random forests

By manipulating class labels


Example: error-correcting output coding

By manipulating learning algorithm


Example: injecting randomness in the initial weights of
ANN

2/14/2024
Methods for Constructing an Ensemble Classifier
1. By manipulating the training set.
In this approach, multiple training sets are
created by resampling the original data
according to some sampling distribution and
constructing a classifier from each training set.
The sampling distribution determines how likely
it is that an example will be selected for training,
and it may vary from one trial to another.
Bagging and boosting are two examples of
ensemble methods that manipulate their training
sets

2/14/2024
Methods for Constructing an Ensemble Classifier
2. By manipulating the input features.
In this approach, a subset of input features is
chosen to form each training set.
The subset can be either chosen randomly or
based on the recommendation of domain
experts.
Some studies have shown that this approach
works very well with data sets that contain
highly redundant features.
Random forest, is an ensemble method that
manipulates its input features and uses decision
trees as its base classifiers.

2/14/2024
Methods for Constructing an Ensemble Classifier
3. By manipulating the class labels.
This method can be used when the number of classes is
sufficiently large.
The training data is transformed into a binary class
problem by randomly partitioning the class labels into two
disjoint subsets, A0 and A1.
Training examples whose class label belongs to the subset
A0 are assigned to class 0, while those that belong to the
subset A1 are assigned to class 1. The relabeled examples
are then used to train a base classifier.
By repeating this process multiple times, an ensemble of
base classifiers is obtained.
When a test example is presented, each base classifier Ci
is used to predict its class label.
If the test example is predicted as class 0, then all the
classes that belong to A0 will receive a vote. Conversely, if
it is predicted to be class 1, then all the classes that belong
to A1 will receive a vote.
The votes are tallied and the class that receives the
2/14/2024
Methods for Constructing an Ensemble Classifier
3. By manipulating the learning algorithm.
Many learning algorithms can be manipulated in such a
way that applying the algorithm several times on the same
training data will result in the construction of different
classifiers.
For example, an ensemble of decision trees can be
constructed by injecting randomness into the tree-growing
procedure. For example, instead of choosing the best
splitting attribute at each node, we can randomly choose
onean
Once ofensemble
the top k of
attributes forhas
classifiers splitting
been learned, a test example
x is classified by combining the predictions made by the base
classifiers

2/14/2024
Bagging (Bootstrap AGGregatING)

It is a technique that repeatedly samples (with


replacement) from a data set according to a
uniform probability distribution.
Each bootstrap sample has the same size as the
original data.
Because the sampling is done with replacement,
some instances may appear several times in the
same training set, while others may be omitted
from the training set.
On average, a bootstrap sample Di contains
approximately 63% of the original training data
because each sample has a probability
of being selected in each Di. If N is sufficiently
large, this probability converges to 1 − 1/e 0.632.
2/14/2024
Bagging (Bootstrap AGGregatING)

Bootstrap sampling: sampling with


replacement

Build classifier on each bootstrap sample

2/14/2024
Bagging Algorithm

2/14/2024
Bagging Example

Consider 1-dimensional data set:

Classifier is a decision stump (decision tree of size


1)
Decision rule: x ≤ k versus x > k
Split point k is chosen based on entropy
Without bagging, the best decision stump we
can produce splits the instances at either x ≤
0.35 or x ≤ 0.75. Either way, the accuracy of
the tree is at mostx ≤70%.
k

Tru False
e
yleft yrig
2/14/2024 ht
Bagging Example
Bagging Example
Bagging Example
Bagging Example

Summary of Trained Decision Stumps:1

2/14/2024
Bagging Example
Use majority vote (sign of sum of predictions) to
determine class of ensemble classifier

Bagging can also increase the complexity


(representation capacity) of simple classifiers such
Predicte
d Class
as decision stumps

2/14/2024
Boosting

Boosting is an iterative procedure used to adaptively


change the distribution of training examples for
learning base classifiers so that they increasingly
focus on examples that are hard to classify.
Unlike bagging, boosting assigns a weight to each
training example and may adaptively change the
weight at the end of each boosting round.
The weights assigned to the training examples can
be used in the following ways:
1. They can be used to inform the sampling
distribution used to draw a set of bootstrap samples
from the original data.
2. They can be used to learn a model that is biased
toward examples with higher weight.
2/14/2024
Boosting
An iterative procedure to adaptively change distribution of
training data by focusing more on previously misclassified
records
Initially, all N records are assigned equal weights (for being
selected for training) 1/N, so that they are equally likely to be
chosen for training. A sample is drawn according to the
sampling distribution of the training examples to obtain a new
training set. Next, a classifier is built from the training set and
used to classify all the examples in the original data.
The weights of the training examples are updated at the end
of each boosting round. Examples that are classified
incorrectly will have their weights increased, while those that
are classified correctly will have their weights decreased. This
forces the classifier to focus on examples that are difficult to
classify in subsequent iterations.
Unlike bagging, weights may change at the end of each
boosting round
Boosting

Unlike bagging, weights may change at the end of


each boosting round
Records that are wrongly classified will have their
weights increased in the next round
Records that are classified correctly will have their
weights decreased in the next round

Example 4 is hard to classify


Its weight is increased, therefore it is
more likely to be chosen again in
subsequent rounds

2/14/2024
AdaBoost

Base classifiers: C1, C2, …, CT .

In the AdaBoost algorithm, the importance of a base


classifier Ci depends on its Error rate

where I(p) = 1 if the predicate p is true, and 0 otherwise

In the AdaBoost algorithm, the importance of a base classifier Ci


depends on its Error rate The importance of a classifier Ci is
given by the following parameter
2/14/2024
AdaBoost Algorithm

Weight update:

If any intermediate rounds produce error


rate higher than 50%, the weights are
reverted back to 1/n and the resampling
procedure is repeated
Classification:
2/14/2024
AdaBoost Algorithm
AdaBoost Example

Consider 1-dimensional data set:

Classifier is a decision stump


Decision rule: x ≤ k versus x > k
Split point k is chosen based on entropy

x≤k

Tru False
e
yleft yrig
2/14/2024 ht
Random Forest Algorithm
Construct an ensemble of decision trees by
manipulating training set as well as features

Use bootstrap sample to train every


decision tree (similar to Bagging)
Use the following tree induction algorithm:
At every internal node of decision tree,
randomly sample p attributes for selecting split
criterion
Repeat this procedure until all leaves are pure
(unpruned tree)

2/14/2024
Random Forest Algorithm
Given a training set D consisting of n instances and d attributes, the basic procedure
of training a random forest classifier can be summarized using the following steps
1. Construct a bootstrap sample Di of the training set by randomly sampling n instances

(with replacement) from D.


2. Use Di to learn a decision tree Ti as follows. At every internal node of Ti, randomly
sample a set of p attributes and choose an attribute from this subset that shows the
maximum reduction in an impurity measure for splitting. Repeat this procedure till
every leaf is pure, i.e., containing instances from the same class.
Once an ensemble of decision trees have been constructed, their average prediction
(majority vote) on a test instance is used as the final prediction of the random forest

2/14/2024
Random Forest Algorithm
Decision trees involved in a random forest
are unpruned trees, as they are allowed to
grow to their largest possible size till every
leaf is pure. Hence, the base classifiers of
random forest represent unstable
classifiers that have low bias but high
variance, because of their large size.
Another property of the base classifiers
learned in random forests is the lack of
correlation among their model parameters
and test predictions.

2/14/2024
Characteristics of Random Forest

2/14/2024
Gradient Boosting
Constructs a series of models
Models can be any predictive model that has
a differentiable loss function
Commonly, trees are the chosen model
XGboost (extreme gradient boosting) is a popular
package because of its impressive performance
Boosting can be viewed as optimizing the loss
function by iterative functional gradient
descent.
Implementations of various boosted algorithms
are available in Python, R, Matlab, and more.

2/14/2024

You might also like