0% found this document useful (0 votes)
24 views49 pages

Class Adv Classification IV

Bayesian classifiers are statistical classifiers that predict class membership probabilities using Bayes Theorem, with Naïve Bayesian classifiers being computationally simple and effective. They calculate explicit probabilities for hypotheses, allowing for incremental learning by combining prior knowledge with observed data. The Naïve Bayes model assumes independence among features given the class label, simplifying computations and making it practical for various applications such as digit recognition and classification tasks.

Uploaded by

fakertoolzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views49 pages

Class Adv Classification IV

Bayesian classifiers are statistical classifiers that predict class membership probabilities using Bayes Theorem, with Naïve Bayesian classifiers being computationally simple and effective. They calculate explicit probabilities for hypotheses, allowing for incremental learning by combining prior knowledge with observed data. The Naïve Bayes model assumes independence among features given the class label, simplifying computations and making it practical for various applications such as digit recognition and classification tasks.

Uploaded by

fakertoolzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Bayesian Classification

What are Bayesian Classifiers?


 Statistical Classifiers
 Predict class membership probabilities
 Based on Bayes Theorem
 Naïve Bayesian Classifier
 Computationally Simple
 Comparable performance with DT and
NN classifiers
Bayesian Classification

 Probabilistic learning: Calculate explicit


probabilities for hypothesis, among the
most practical approaches to certain types
of learning problems
 Incremental: Each training example can
incrementally increase/decrease the
probability that a hypothesis is correct.
Prior knowledge can be combined with
observed data.
Bayes Theorem
 Let X be a data sample whose class label is
unknown
 Let H be some hypothesis that X belongs to a
class C
 For classification determine P(H|X)
 P(H|X) is the probability that H holds given the
observed data sample X
 P(H|X) is posterior probability of H conditioned on
X
Bayes Theorem
Example: Sample space: All Fruits described
by their color and shape
X is “round” and “red”
H= hypothesis that X is an Apple
P(H|X) is our confidence that X is an apple
given that X is “round” and “red”
 P(H) is Prior Probability of H, i.e. the
probability that any given data sample is
an apple regardless of how it looks
 P(H|X) is based on more information
 Note that P(H) is independent of X
Bayes Theorem
Example: Sample space: All Fruits
 P(X|H) ?
 It is the probability that X is round and
red given that we know that it is true
that X is an apple
 Here P(X) is prior probability =
P(data sample from our set of fruits is
red and round)
Estimating Probabilities

 P(X), P(H), and P(X|H) may be estimated from


given data
 Bayes Theorem

P(H | X ) P( X | H )P(H )


P( X )
 Use of Bayes Theorem in Naïve Bayesian
Classifier!!
Bayesian Classifiers
• Consider each attribute and class label as random variables

• Given a record with attributes (A1, A2,…,An)


– Goal is to predict class C
– Specifically, we want to find the value of C that maximizes
P(C| A1, A2,…,An )

• Can we estimate P(C| A1, A2,…,An ) directly from data?


Application

• Digit Recognition

Classifier 5

• X1,…,Xn  {0,1} (Black vs. White pixels)


• Y  {5,6} (predict whether a digit is a 5 or a 6)
The Bayes Classifier
• In class, we saw that a good strategy is to predict:

– (for example: what is the probability that the image represents a


5 given its pixels?)

• So … How do we compute that?


The Bayes Classifier

• Use Bayes Rule!

Likelihood Prior

Normalization Constant

• Why did this help? Well, we think that we might be able to specify how features
are “generated” by the class label
The Bayes Classifier
• Let’s expand this for our digit recognition task:

• To classify, we’ll simply compute these two probabilities and


predict based on which one is greater
Model Parameters

• For the Bayes classifier, we need to “learn” two functions,


the likelihood and the prior

• How many parameters are required to specify the prior for


our digit recognition example?
Model Parameters

• How many parameters are required to specify the likelihood?


– (Supposing that each image is 30x30 pixels)

?
Model Parameters

• The problem with explicitly modeling P(X1,…,Xn|Y) is that there are usually way
too many parameters:
– We’ll run out of space
– We’ll run out of time
– And we’ll need tons of training data (which is usually not
available)
The Naïve Bayes Model

• The Naïve Bayes Assumption: Assume that all features are


independent given the class label Y
• Equationally speaking:
Why is this useful?

• # of parameters for modeling P(X1,…,Xn|Y):

 2(2n-1)

• # of parameters for modeling P(X1|Y),…,P(Xn|Y)

 2n
Naïve Bayesian Classification
 Also called Simple Bayesian Classification
 Why Naïve/Simple??
 Class Conditional Independence
Effect of an attribute values on a given class
is independent of the values of other attributes
 This assumption simplifies computations
Bayesian Classifiers
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for all
values of C using the Bayes theorem
P ( A A  A | C ) P (C )
P (C | A A  A )  1 2 n

P( A A  A )
1 2 n

1 2 n

– Choose value of C that maximizes


P(C | A1, A2, …, An)

– Equivalent to choosing value of C that maximizes


P(A1, A2, …, An|C) P(C)

• How to estimate P(A1, A2, …, An | C )?


Naïve Bayes Classifier

• Assume independence among attributes Ai when class is


given:
– P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

– Can estimate P(Ai| Cj) for all Ai and Cj.

– New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.


Example
How to Estimate Probabilities from Data?

• Class: P(C) = Nc/N


Tid Refund M arital Taxable
– e.g., P(No) = 7/10,
Status Incom e Evade P(Yes) = 3/10
1
2
Yes
No
Single
M arried
125K
100K
No
No
For discrete attributes:
3
4
No
Yes
Single
M arried
70K
120K
No
No
P(Ai | Ck) = |Aik|/ Nc k
5
6
No
No
Divorced
M arried
95K
60K
Yes
No
– where |Aik| is number of
7 Yes Divorced 220K No instances having attribute
Ai and belongs to class Ck
8 No Single 85K Yes
9 No M arried 75K No

10
10 No Single 90K Yes
– Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
How to Estimate Probabilities from Data?
• For continuous attributes:
– Discretize the range into bins
– Two-way split: (A < v) or (A > v)
• choose only one of the two splits as new attribute
– Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, can use it to estimate the
conditional probability P(Ai|c)
How to Estimate Probabilities from Data?

• Normal distribution:
( Ai  ij ) 2

Tid Refund Marital Taxable 1 2 ij2
Status Incom e Evade
P ( Ai | c j )  e
1 Yes Single 125K No 2  ij
2 No Married 100K No
3 No Single 70K No
– One for each (Ai,ci) pair
4 Yes Married 120K No
5 No Divorced 95K Yes • For (Income, class=No):
6 No Married 60K No
7 Yes Divorced 220K No – If Class=No
8 No Single 85K Yes
• sample mean = 110
9 No Married 75K No
10 No Single 90K Yes • sample variance = 2975
10

1 
( 120 110 ) 2

P ( Income 120 | No )  e 2 ( 2975)


0.0072
2 (54.54)
Example of Naïve Bayes Classifier
Given a Test Record:

X (Refund No, Married, Income 120K)

naive Bayes Classifier:


 P(X|Class=No) = P(Refund=No|Class=No)
P(Refund=Yes|No) = 3/7  P(Married| Class=No)
P(Refund=No|No) = 4/7  P(Income=120K|
P(Refund=Yes|Yes) = 0 Class=No)
P(Refund=No|Yes) = 1
= 4/7  4/7  0.0072 =
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
0.0024
P(Marital Status=Married|No) = 4/7 P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Single|Yes) = 2/7  P(Married| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7  P(Incm=120K|
P(Marital Status=Married|Yes) = 0 Class=Yes)
= 1  0  1.2  10-9 = 0
For taxable income:
If class=No: sample mean=110 Since P(X|No)P(No) > P(X|Yes)P(Yes)
sample variance=2975 Therefore P(No|X) > P(Yes|X)
If class=Yes: sample mean=90 => Class = No
sample variance=25
Naïve Bayes Classifier
• If one of the conditional probability is zero, then the
entire expression becomes zero
• Probability estimation:

N c: number of classes
Original : P ( Ai | C )  ic
Nc p: prior probability

N 1 m: parameter
Laplace : P ( Ai | C )  ic
Nc  c
N ic  mp
m - estimate : P ( Ai | C ) 
Nc  m
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
python no no no no non-mammals
salmon no no yes no non-mammals
A: attributes
whale yes no yes no mammals
frog no no sometimes yes non-mammals M: mammals
komodo no no no yes non-mammals N: non-mammals
bat yes yes no yes mammals
pigeon no yes no yes non-mammals
cat yes no no yes mammals 6 6 2 2
P ( A | M )     0.06
leopard shark yes no yes no non-mammals 7 7 7 7
turtle no no sometimes yes non-mammals
1 10 3 4
penguin no no sometimes yes non-mammals P ( A | N )     0.0042
porcupine yes no no yes mammals 13 13 13 13
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P ( M ) 0.06  0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals 13
owl no yes no yes non-mammals P ( A | N ) P ( N ) 0.004  0.0027
dolphin yes no yes no mammals 20
eagle no yes no yes non-mammals

P(A|M)P(M) > P(A|N)P(N)


=> Mammals
Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ?
Naïve Bayesian Classification
Example
Age Income Student Credit_ratin Class:Buys_com
g p
<=30 HIGH N FAIR N
<=30 HIGH N EXCELLENT N
31…..40 HIGH N FAIR Y
>40 MEDIUM N FAIR Y
>40 LOW Y FAIR Y
>40 LOW Y EXCELLENT N
31…..40 LOW Y EXCELLENT Y
<=30 MEDIUM N FAIR N
<=30 LOW Y FAIR Y
>40 MEDIUM Y FAIR Y
<=30 MEDIUM Y EXCELLENT Y
31….40 MEDIUM N EXCELLENT Y
31….40 HIGH Y FAIR Y
>40 MEDIUM N EXCELLENT N
Naïve Bayesian Classification
Example
A= (<=30,MEDIUM, Y,FAIR, ???)
We need to max.
P(A|Cj)P(Cj) for j =1,2.
P(Cj) is computed from training sample
P(buys_comp=Y) = 9/14 = 0.643
P(buys_comp=N) = 5/14 = 0.357
How to calculate P(X|Ci)P(Ci) for i=1,2?
P(A|Cj) P(A1, A2, A3, A4|C) = PP(Ak|C)
Naïve Bayesian Classification

Example
P(age<=30 | buys_comp=Y)=2/9=0.222
P(age<=30 | buys_comp=N)=3/5=0.600
P(income=medium | buys_comp=Y)=4/9=0.444
P(income=medium | buys_comp=N)=2/5=0.400
P(student=Y | buys_comp=Y)=6/9=0.667
P(student=Y | buys_comp=N)=1/5=0.200
P(credit_rating=FAIR | buys_comp=Y)=6/9=0.667
P(credit_rating=FAIR | buys_comp=N)=2/5=0.400
Naïve Bayesian Classification

Example
P(X | buys_comp=Y)=0.222*0.444*0.667*0.667=0.044
P(X | buys_comp=N)=0.600*0.400*0.200*0.400=0.019

P(X | buys_comp=Y)P(buys_comp=Y) =
0.044*0.643=0.028
P(X | buys_comp=N)P(buys_comp=N) =
0.019*0.357=0.007

CONCLUSION: A buys computer


Problem to Solve

(a)Estimate the conditional probabilities for


P(Red|Yes), P(SUV|Yes), P(Domestic|
Yes) ,P(Red|No) , P(SUV|No), and
P(Domestic|No)using the m-estimate
approach, with p=1/2 and m=3
(b) Predict the class label for a test sample (Red
Domestic SUV) using Naïve Bayes
approach.
Probability of Error - Bayes Error Rate
Probability of Error
Naïve Bayes (Summary)
• Robust to isolated noise points
• Handle missing values by ignoring the instance during
probability estimate calculations
• Robust to irrelevant attributes
• Independence assumption may not hold for some
attributes
– Use other techniques such as Bayesian Belief Networks
(BBN)
Ensemble Classifiers

• Introduction & Motivation


• Methods to create Ensemble
• Bais Variance Decomposition
• Construction of Ensemble Classifiers
–Bagging
–Boosting (Ada Boost)
–Random Forests
Introduction & Motivation
• Suppose that you are a patient with a set of symptoms
• Instead of taking opinion of just one doctor (classifier),
you decide to take opinion of a few doctors!
• Is this a good idea? Indeed it is.
• Consult many doctors and then based on their diagnosis;
you can get a fairly accurate idea of the diagnosis.
• Majority voting - ‘bagging’
• More weightage to the opinion of some ‘good’ (accurate)
doctors - ‘boosting’
• In bagging, you give equal weightage to all classifiers,
whereas in boosting you give weightage according to the
accuracy of the classifier.
Ensemble Methods

• Construct a set of classifiers from the training data

• Predict class label of previously unseen records by


aggregating predictions made by multiple classifiers
General Idea
Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

Figure taken from Tan et. al. book “Introduction to Data Mining”
Rationale for Ensemble Methods

• Statistical reasons
–A set of classifiers with similar training performances may
have different generalization performances.
–Combining outputs of several classifiers reduces the risk of
selecting a poorly performing classifier.
• Large volumes of data
–If the amount of data to be analyzed is too large, a single
classifier may not be able to handle it; train different
classifiers on different partitions of data.
• Too little data
–Ensemble systems can also be used when there is too little
data; resampling techniques.
Rationale for Ensemble Methods
Why does it work?
• Suppose there are 25 base classifiers
– Each classifier has error rate,  = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes a wrong
prediction:
25
 25  i
 
 i 
i 13 
  (1   ) 25 i
0.06

– CHK out yourself if it is correct!!

Example taken from Tan et. al. book “Introduction to Data Mining”
Ensemble Classifiers (EC)
• An ensemble classifier constructs a set of ‘base
classifiers’ from the training data
• Methods for constructing an EC
• Manipulating training set
• Manipulating input features
• Manipulating class labels
• Manipulating learning algorithms
Ensemble Classifiers (EC)
• Manipulating training set
• Multiple training sets are created by resampling the data
according to some sampling distribution
• Sampling distribution determines how likely it is that an
example will be selected for training – may vary from one trial
to another
• Classifier is built from each training set using a particular
learning algorithm
• Examples: Bagging & Boosting
Ensemble Classifiers (EC)

• Manipulating input features


• Subset of input features chosen to form each training set
• Subset can be chosen randomly or based on inputs given by
Domain Experts
• Good for data that has redundant features
• Random Forest is an example which uses DT as its base
classifiers
Ensemble Classifiers (EC)

• Manipulating class labels


• When no. of classes is sufficiently large
• Training data is transformed into a binary class problem by
randomly partitioning the class labels into 2 disjoint subsets, A0
& A1
• Re-labelled examples are used to train a base classifier
• By repeating the class labeling and model building steps several
times, and ensemble of base classifiers is obtained
• How a new tuple is classified?
• Example – error correcting output coding (pp 307)
Ensemble Classifiers (EC)

• Manipulating learning algorithm


• Learning algorithms can be manipulated in such a way that
applying the algorithm several times on the same training data
may result in different models
• Example – ANN can produce different models by changing
network topology or the initial weights of links between
neurons
• Example – ensemble of DTs can be constructed by introducing
randomness into the tree growing procedure – instead of
choosing the best split attribute at each node, we randomly
choose one of the top k attributes
Ensemble Classifiers (EC)

• First 3 approaches are generic – can be applied to any


classifier
• Fourth approach depends on the type of classifier used
• Base classifiers can be generated sequentially or in
parallel
Typical Ensemble Procedure

You might also like