0% found this document useful (0 votes)
3 views

Bayesian Learning

Uploaded by

Mehar Hassan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Bayesian Learning

Uploaded by

Mehar Hassan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

BAYESIAN LEARNING

Bayesian Classifiers

Bayesian classifiers are statistical classifiers, and are based


on Bayes theorem

They can calculate the probability that a given sample


belongs to a particular class

1
BAYESIAN LEARNING

Bayesian learning algorithms are among the most


practical approaches to certain types of learning
problems.

There results are comparable to the performance of other


classifiers, such as decision tree and neural networks in
many cases

2
BAYESIAN LEARNING

Bayes Theorem

Let X be a data sample, e.g. red and round fruit

Let H be some hypothesis, such as that X belongs to a


specified class C (e.g. X is an apple)

For classification problems, we want to determine P(H|X),


the probability that the hypothesis H holds given the
observed data sample X

3
BAYESIAN LEARNING

Prior & Posterior Probability

The probability P(H) is called the prior probability of H, i.e


the probability that any given data sample is an apple,
regardless of how the data sample looks

The probability P(H|X) is called posterior probability. It is


based on more information, then the prior probability P(H)
which is independent of X

4
BAYESIAN LEARNING

Bayes Theorem

It provides a way of calculating the posterior probability

P(H|X) = P(X|H) P(H)


P(X)

P(X|H) is the posterior probability of X given H (it is the


probability that X is red and round given that X is an apple)

P(X) is the prior probability of X (probability that a data


sample is red and round)

5
BAYESIAN LEARNING

Bayes Theorem: Proof

The posterior probability of the fruit being an apple given


that its shape is round and its colour is red is P(H|
X) = |H  X| / |X|
i.e. the number of apples which are red and round divided by
the total number of red and round fruits

Since P(H  X) = |H  X| / |total fruits of all size and shapes|


and P(X) = |X| / |total fruits of all size and shapes|

Hence P(H|X) = P(H  X) / P(X)

6
BAYESIAN LEARNING

Bayes Theorem: Proof

Similarly P(X|H) = P(H  X) / P(H)

Since we have P(H  X) = P(H|X)P(X)


And also P(H  X) = P(X|H)P(H)

Therefore P(H|X)P(X) = P(X|H)P(H)

And hence P(H|X) = P(X|H) P(H) / P(X)

7
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

Studies comparing classification algorithms have found that


the simple Bayesian classifier is comparable in performance
with decision tree and neural network classifiers

It works as follows:

1. Each data sample is represented by an n-dimensional


feature vector, X = (x1, x2, …, xn), depicting n
measurements made on the sample from n attributes,
respectively A1, A2, … An

8
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

2. Suppose that there are m classes C1, C2, … Cm. Given an


unknown data sample, X (i.e. having no class label), the
classifier will predict that X belongs to the class having the
highest posterior probability given X

Thus if P(Ci|X) > P(Cj|X) for 1  j  m , j  i


then X is assigned to Ci

This is called Bayes decision rule

9
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

3. We have P(Ci|X) = P(X|Ci) P(Ci) / P(X)

As P(X) is constant for all classes, only P(X|C i) P(Ci) needs to


be calculated

The class prior probabilities may be estimated by


P(Ci) = si / s
where si is the number of training samples of class Ci
& s is the total number of training samples

If class prior probabilities are equal (or not known and thus
assumed to be equal) then we need to calculate only P(X|Ci)
10
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

4. Given data sets with many attributes, it would be


extremely computationally expensive to compute P(X|Ci)

For example, assuming the attributes of colour and shape to


be Boolean, we need to store 4 probabilities for the category
apple
P(¬red  ¬round | apple)
P(¬red  round | apple)
P(red  ¬round | apple)
P(red  round | apple)

If there are 6 attributes and they are Boolean, then we need


to store 26 probabilities
11
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

In order to reduce computation, the naïve assumption of


class conditional independence is made

This presumes that the values of the attributes are


conditionally independent of one another, given the class
label of the sample (we assume that there are no dependence
relationships among the attributes)

12
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

Thus we assume that P(X|Ci) = nk=1 P(xk|Ci)

Example
P(colour  shape | apple) = P(colour | apple) P(shape | apple)

For 6 Boolean attributes, we would have only 12 probabilities


to store instead of 26 = 64
Similarly for 6, three valued attributes, we would have 18
probabilities to store instead of 36

13
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification

The probabilities P(x1|Ci), P(x2|Ci), …, P(xn|Ci) can be


estimated from the training samples, where

For an attribute Ak, which can take on the values x1k, x2k, …
e.g. colour = red, green, …

P(xk|Ci) = sik/si

where sik is the number of training samples of class Ci having


the value xk for Ak
and si is the number of training samples belonging to Ci

e.g. P(red|apple) = 7/10 if 7 out of 10 apples are red


14
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

Example:

15
Play-tennis example: estimating
P(xi|C) outlook
Outlook Temperature Humidity Windy Class P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high false N
sunny hot high true N P(overcast|p) = 4/9 P(overcast|n) = 0
overcast hot high false P
rain mild high false P P(rain|p) = 3/9 P(rain|n) = 2/5
rain cool normal false P
rain cool normal true N temperature
overcast cool normal true P
sunny mild high false N P(hot|p) = 2/9 P(hot|n) = 2/5
sunny cool normal false P
rain mild normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
sunny mild normal true P
overcast
overcast
mild
hot
high true
normal false
P
P
P(cool|p) = 3/9 P(cool|n) = 1/5
rain mild high true N
humidity
P(high|p) = 3/9 P(high|n) = 4/5
P(p) = 9/14 P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(n) = 5/14
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
Naive Bayesian Classifier (II)

 Given a training set, we can compute the


probabilities
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
Play-tennis example: classifying
X
 An unseen sample X = <rain, hot, high, false>

 P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582

 P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286

 Sample X is classified in class n (don’t play)


Naïve Bayesian Classifier: Example2

Training dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
Class: 30…40 high no fair yes
C1:buys_computer=>40 medium no fair yes
‘yes’ >40 low yes fair yes
C2:buys_computer=>40 low yes excellent no
‘no’ 31…40 low yes excellent yes
<=30 medium no fair no
Data sample <=30 low yes fair yes
X =(age<=30, >40 medium yes fair yes
Income=medium, <=30 medium yes excellent yes
Student=yes 31…40 medium no excellent yes
Credit_rating= 31…40 high yes fair yes
Fair) >40 medium no excellent 19 no
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

Example:

Let C1 = class buy computer and C2 = class not buy computer

The unknown sample:


X = {age =  30, income = medium, student = yes, credit-
rating = fair}

The prior probability of each class can be computed as

P(buy computer = yes) = 9/14 = 0.643


P(buy_computer = no) = 5/14 = 0.357
21
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

Example:
To compute P(X|Ci) we compute the following conditional
probabilities

22
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

Example:
Using the above probabilities we obtain

And hence the naïve Bayesian classifier predicts that the


student will buy computer, because

23
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

An Example: Learning to classify text

- Instances (training samples) are text documents


- Classification labels can be: like-dislike, etc.
- The task is to learn from these training examples to
predict the class of unseen documents

Design issue:
- How to represent a text document in terms of
attribute values

24
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

One approach:
- The attributes are the word positions
- Value of an attribute is the word found in that
position

Note that the number of attributes may be different for each


document

We calculate the prior probabilities of classes from the


training samples
Also the probabilities of word in a position is calculated
e.g. P(“The” in first position | like document)
25
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

Second approach:
The frequency with which a word occurs is counted
irrespective of the word’s position

Note that here also the number of attributes may be different


for each document

The probabilities of words are


e.g. P(“The” | like document)

26
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

Results

An algorithm based on the second approach was applied to


the problem of classifying articles of news groups

- 20 newsgroups were considered


- 1,000 articles of each news group were collected (total
20,000 articles)
- The naïve Bayes algorithm was applied using 2/3rd of
these articles as training samples
- Testing was done over the remaining 3rd

27
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

Results

- Given 20 news groups, we would expect random guessing


to achieve a classification accuracy of 5%
- The accuracy achieved by this program was 89%

28
BAYESIAN LEARNING

Naïve (Simple) Bayesian Classification

Minor Variant

The algorithm used only a subset of the words used in the


documents

- 100 most frequent words were removed (these include


words such as “the”, and “of”)
- Any word occurring fewer than 3 times was also
removed

29
Avoiding the Zero-Probability Problem

• NaïveBayesian prediction requires each conditional prob.


be non-zero. Otherwise, the predicted prob. will be zero

• Ex.
Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case

• Prob(income = low) = 1/1003


• Prob(income = medium) = 991/1003
• Prob(income = high) = 11/1003

•The “corrected” prob. estimates are close to their


“uncorrected” counterparts
30
Handling Real valued data

31
The independence hypothesis…

• … makes computation possible


• … yields optimal classifiers when satisfied
• … but is seldom satisfied in practice, as attributes
(variables) are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning with
causal relationships between attributes
BAYESIAN LEARNING

Bayesian Belief Networks

Since in real world dependencies can exist between variables,


hence Bayesian belief networks are used to specify joint
conditional probability distributions

They allow class conditional independencies to be defined


between subset of variables

They provide a graphical model of causal relationships, on


which learning can be performed

These networks are also called belief networks, Bayesian


networks, and probabilistic networks
33
BAYESIAN LEARNING

Bayesian Belief Networks

A belief network is defined by two components

The first is a directed acyclic graph, where each node


represents a random variable and each arc represents a
probabilistic dependence
Age FamilyH

Diabetes Mass

Insulin Glucose

34
BAYESIAN LEARNING

Bayesian Belief Networks

If an arc is drawn from a node Y to a node Z, then Y is a


parent of Z and Z is a descendent of Y

Each variable is conditionally independent of its non


descendents in the graph, given its parents
Age FamilyH
The variables may be discrete
or continuous
Diabetes Mass

Insulin Glucose

35
BAYESIAN LEARNING

Bayesian Belief Networks


Age FamilyH
The second component of a belief
network consists of a conditional
probability table for each variable Diabetes Mass

For a variable Z, it specifies the


Insulin Glucose
conditional probability distribution
P(Z|Parents(Z)) (FH, A) (FH, ~A) (~FH, A) (~FH, ~A)

M 0.8 0.5 0.7 0.1


The conditional probability for each ~M 0.2 0.5 0.3 0.9
value of Z is listed for each possible
combination of values of its parents

36
BAYESIAN LEARNING

Bayesian Belief Networks

The joint probability of any tuple (z1,


…, zn) corresponding to the attributes
Z1, …, Zn is computed by

Where the values P(zi|Parents(Zi)) correspond to the entries


in the CPT for zi

37
BAYESIAN LEARNING

Bayesian Belief Networks

A node within the network can be selected as an output


node representing a class label attribute

The structure of the network can be given by an expert

38
BAYESIAN LEARNING

Example: probability that a fish caught in summer, in the


north Atlantic, is a sea bass, and is dark and thin:
P(a3, b1, x2, c3, d2)
= P(a3)P(b1)P(x2|a3,b1)P(c3|x2)P(d2|x2)
= 0.25 x 0.6 x0.4 x0.5 x0.4 = 0.012 39
BAYESIAN LEARNING

Learning Bayesian Belief Networks

The problem of learning a Bayes network is the problem of


finding a network that best matches (according to some
scoring metric) a training set of data

By finding a network we mean finding both the


- Structure of the net &
- the conditional probability tables (CPT) associated with
each node)

40
BAYESIAN LEARNING

Learning Bayesian Belief Networks

Known Network Structure

If the structure of the network is known we only have to find


the CPT

Often human experts can come up with the appropriate


structure for a problem domain but not the CPT

If we have ample number of training samples, we can compute


sample statistics for each node and its parents

41
BAYESIAN LEARNING

Learning Bayesian Belief Networks

Unknown Network Structure

If the network structure is not known, we must then attempt to


find that structure, as well as its associated CPTs, that best
fits the training data

In order to do so, we need


- a metric by which to score candidate networks
- a procedure for searching among possible structure

42
BAYESIAN LEARNING

Learning Bayesian Belief Networks

43
BAYESIAN LEARNING

Learning Bayesian Belief Networks

44
BAYESIAN LEARNING

Pros & Cons of Bayesian approach

It provides a more flexible approach to learning than


algorithms that completely eliminate a hypothesis if it is
found to be inconsistent with any single example. In
Bayesian approach each observed training sample can
gradually increase or decrease the estimated probability
that a hypothesis is correct.

The probabilities used, can be incrementally updated if some


new training samples arrive

45
BAYESIAN LEARNING

Pros & Cons of Bayesian approach

One practical difficulty in applying Bayesian methods is that


they typically require initial knowledge of many
probabilities

A second practical difficulty is the significant computational


cost required to determine the Bayes optimal hypothesis
in the general case (linear in the number of candidate
hypotheses)

46
BAYESIAN LEARNING

Reference

Chapter 6 of T. Mitchell

47
INSTANCE - BASED LEARNING

k – NEAREST NEIGHBOUR

48
k – NEAREST NEIGHBOUR LEARNING

Introduction

Key Idea:
Just store all training examples <xi, f(xi)>

Thus the training algorithm is very simple

49
k – NEAREST NEIGHBOUR LEARNING

Introduction

Classification Algorithm:
* Given query instance xq
* Locate nearest training example xn
* Estimate

50
k – NEAREST NEIGHBOUR LEARNING

Introduction

Classification Algorithm (k-nearest neighbor):


* Given query instance xq
* Locate k nearest training example xn
* Estimate its class label by taking vote among k-
nearest neighbor class labels

51
k – NEAREST NEIGHBOUR LEARNING

Introduction

Note that 1-nearest neighbor classifies xq as positive, whereas


5-nearest neighbor classifies it as negative

52
k – NEAREST NEIGHBOUR LEARNING

Introduction

Classification Algorithm (k-nearest neighbor):


* If the class labels are real-valued
Take the mean of class labels (target function
‘f’ values) of k nearest neighbors

53
k – NEAREST NEIGHBOUR LEARNING

Hypothesis about the target function

What classifications would be assigned if we were to hold the


training examples constant and query the algorithm with
every possible instance

The shape of the decision surface is something like this and is


called Voronoi diagram

54
k – NEAREST NEIGHBOUR LEARNING

Distance weighted NN-algorithm

We weight the contribution of each of the k-neighbors


according to their distance to the query point xq

The closer neighbors are given a greater weight

where

55
k – NEAREST NEIGHBOUR LEARNING

Distance weighted NN-algorithm

If for a case d(xq, xi)2 = 0


we assign the class of xi to xq
If there are several xi equal to xq, we take a majority
vote

For real valued target functions,

56
k – NEAREST NEIGHBOUR LEARNING

Distance weighted NN-algorithm

If all the training examples are used to determine the


classification of xq, then the algorithm is called a global
method, otherwise it is called a local method

For real-valued functions, the global methods is also called


the Shepard’s method

57
k – NEAREST NEIGHBOUR LEARNING

Remarks

Advantages
• It is robust to noise
• Training is fast

Disadvantages

• All attributes are used for calculation of distances, whereas


only a few may be relevant (this problem of
irrelevant attributes is called curse of dimensionality)
• Classification is a slow process

58
Reading Assignment & References

Chapter 8 of T. Mitchell

59

You might also like