0% found this document useful (0 votes)
29 views82 pages

Week12_PCA_BayesianInference_before_lecture

NTU EE6483 Week12_PCA_BayesianInference_before_lecture

Uploaded by

yimingxiao2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views82 pages

Week12_PCA_BayesianInference_before_lecture

NTU EE6483 Week12_PCA_BayesianInference_before_lecture

Uploaded by

yimingxiao2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Artificial Intelligence & Data Mining

Week 12

WEN Bihan (Asst Prof)


Homepage: https://personal.ntu.edu.sg/bihan.wen/

1
Recap: Bias and Variance

• Model Complexity Analysis:

Underfitting Overfitting
High bias and low variance Low bias and high variance

2
Recap: Overfitting and Underfitting

• Simple Model

• High Bias

• Cause an algorithm to miss relevant


relations between the input features and
the target outputs.

• Complex Model

• High Variance

• Cause an algorithm to model the noise in


the training set.

3
Recap: Overfitting and Underfitting

1. High complexity -> Large Gap -> Overfitting

2. Low complexity -> Small Gap -> Underfitting

4
Overfitting and Underfitting - Examples

• Suppose that you are training a NN for classification. When you


reduce the model complexity by removing a layer, your validation
accuracy increases.

• How does the model bias and variance change?

• Does your model suffer from overfitting or underfitting?

• Now, if you make your NN even deeper (higher complexity)

• Will your training accuracy increase?

• Will your validation accuracy increase?

5
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

2. Increase the training data complexity / size, to reduce the variance.

3. Simplify data distribution and dimensionality.

• Dimensionality Reduction

k = 200 k = 50 k=2
6
Dimensionality Reduction

WEN Bihan (Asst Prof)


Homepage: https://personal.ntu.edu.sg/bihan.wen/

7
Outline

• Concept of Dimensionality Reduction

• Principal Component Analysis (PCA)

• How to derive PCA (not examed)

• Examples

8
Carry-on Questions

• Why do we need dimensionality reduction?

• What are two objectives achieved by PCA?

• How to calculate PCA (not examed)?

9
Unsupervised Learning

• Clustering

• Exploit similar structures amongst the data themselves.


• One way to summarize various data points with a single categorical variable,
i.e., the cluster centroid.

10
Unsupervised Learning

• Clustering

• Exploit similar structures amongst the data themselves.


• One way to summarize various data points with a single categorical variable,
i.e., the cluster centroid.

• Dimensionality Reduction

• Another way to simplify the complex and high-dimensional data.


• Summarize data with a lower dimensional real valued vector.

11
Dimensionality Reduction

• Goal: find the best-fitting low-dimensional subspace to


𝑖=1 .
represent (𝑥𝑖 )𝑁

• Fit 2-d data with a 1-d line.

• Fit 3-d data with a 2-d plane.

12
Dimensionality Reduction

• Goal: find the best-fitting low-dimensional subspace to


𝑖=1 .
represent (𝑥𝑖 )𝑁

• Given data points in d dimensions

• Convert them to data points in r<d dimensions

• With minimal loss of information

13
Example: Reduce Data from 2D to 1D
(inches)

(cm)

14
Example: Reduce Data from 2D to 1D

Reduce data from


(inches)

2D to 1D

15
Example: Reduce Data from 2D to 1D

Reduce data from


(inches)

2D to 1D

16
Example: Reduce Data from 3D to 2D

3D data points 2D data points


(𝑥1 , 𝑥2 , 𝑥3 ) (𝑧1 , 𝑧2 )

Approximated by a 2D plane Projected onto the plane


17
Why Dimensionality Reduction?

• What is wrong with high dimensions?

• High Dimensions = Lots of Features

• Complex system to process

• Inefficient algorithm

• Overfitting to noise or other data corruptions.

18
Dimensionality Reduction

• Why do we need dimensionality reduction?

• Remove Feature Redundancy + Noise

• Prevent Overfitting

• Reduced Model Complexity

• Simplified Data distribution

• Better Visualization

19
Principal Component Analysis

• Principal Component Analysis (PCA)

• Simple and popular method.

• Unsupervised learning.

• To learn the “best” low-dimensional subspace for


data projection.

• What does the “best” mean here?

20
Principal Component Analysis

• Two ways to interpret the “Best”:


1. The mean square error (MSE) of the projected data is
minimized.

2. The variance of the projected data is maximized.

• These two objectives are achieved simultaneously by


applying PCA for dimensionality reduction.

• Why and How?

21
Principal Component Analysis

1. The mean square error (MSE) of the projected data


is minimized.

• 𝒙𝒊 ( black ): the original data.

• 𝒗 ( red ) : PCA subspace.

• 𝒗𝑻 𝒙𝒊 𝒗 ( blue ) : projected data.

• Minimize MSE ( green ) for all {𝒙𝒊 }


i.e., the perpendicular offset.
22
Principal Component Analysis

2. The variance of the projected data is maximized.

• 𝒗𝑻 𝒙𝒊 𝒗 ( blue ) : projected data.

• Variance = magnitude of ( blue )

• Blue variance is maximized.

23
Principal Component Analysis

• Minimizing MSE <=> Maximizing Projected Variance

• Blue2 + green2 = black2

• Black is fixed (given data)

• Maximizing blue (variance)


is equivalent to
Minimizing green (MSE).

24
Example: PCA

• Which of the following 𝒗 generates higher


projected variance?

25
Example: PCA

• Which of the following 𝒗 generates higher


projected variance?
• Find the projected mean 𝒖

𝒗 • Calculate the distance between each


point to the mean 𝒅𝒊

• The variance of the projected data


𝒅𝟐 𝒖
𝑁
𝒅𝟏 1
𝑣𝑎𝑟 = ෍ 𝒅𝒊 2
𝑁
𝑖=1

26
Example: PCA

• Subspace 𝒗 on the left generates higher


projected variance

Larger projected
variance!
Smaller projected 𝒗
variance!
27
Example: PCA

• Which of the following 𝒗 generates smaller


projected MSE?

28
Example: PCA

• Subspace 𝒗 on the left generates smaller


projected MSE – Smaller Perpendicular Offset

𝒗 Larger
projected MSE

Smaller
projected MSE
𝒗

29
Derivation of PCA - Optional

1. The mean square error (MSE) of


the projected data is minimized.

• For all {𝒙𝑖 }𝑁


𝑖=1 , the green
2

(in average) is minimized.

• Find the best red 𝒗 (unit vector)


to minimize the green2.

• The principal component:

𝒛𝒊 = 𝒗𝑇 𝒙𝒊 𝒗

30
Derivation of PCA - Optional

1. The mean square error (MSE) of


the projected data is minimized.

• The principal component:

𝒛𝒊 = 𝒗𝑇 𝒙𝒊 𝒗

• The MSE is minimized:


𝑁
1
𝑀𝑆𝐸 = ෍(𝒛𝒊 − 𝒙𝒊 )2
𝑁
𝑖=1

31
Derivation of PCA - Optional

2. The variance of the projected


data is maximized.

• Variance of the projected data:


𝑁
1
𝑉𝑎𝑟 = ෍(𝒛𝒊 − 𝒛ത )2
𝑁
𝑖=1

• 𝒛ത denotes the mean of {𝒛𝒊 }.

32
Derivation of PCA - Optional

2. The variance of the projected


data is maximized.

• For zero-mean data {𝒙𝑖 }𝑁


𝑖=1 , we
have 𝒛ത = 𝟎, then

𝑁 𝑁
1 1
𝑉𝑎𝑟 = ෍(𝒛𝒊 ) = ෍(𝒗𝑇 𝒙𝒊 )2
2
𝑁 𝑁
𝑖=1 𝑖=1

33
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N


i=1 ,

𝑁 𝑁
1 1
𝑉𝑎𝑟 = ෍(𝒛𝒊 )2 = ෍(𝒗𝑇 𝒙𝒊 )2
𝑁 𝑁
𝑖=1 𝑖=1

• Note that this variance is different from the


variance & bias for overfitting analysis.

34
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N


i=1 , the first weight unit vector 𝒗 satisfies:

𝑁
1
max ෍(𝒗𝑇 𝒙𝒊 )2
𝒗 2 =1 𝑁
𝑖=1

35
Derivation of PCA - Optional

• Recap: represent a set of data vectors {𝒙𝑖 }𝑁


𝑖=1 as a data matrix 𝑿:

𝒙𝒊 𝑻
𝒙𝟏 𝑻

=
𝑁
𝒙𝒊 𝑻

𝑻
𝒙𝑵

𝑑: feature dimension
36
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N


i=1 , the first weight unit vector 𝒗 satisfies:

𝑁
1 1 𝑇 𝑇
max ෍(𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1

37
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N


i=1 , the first weight unit vector 𝒗 satisfies:

𝑁
1 1 𝑇 𝑇
max ෍(𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1
1 𝑁 1 𝑇 𝑇
• Here σ𝑖=1(𝒗𝑇 𝒙𝒊 )2 = 𝒗 𝑿 𝑿𝒗 is the variance of the projected data.
𝑁 𝑁

• The optimal 𝒗* is the first unit weight vector for the PCA.

• The first Principal Component: 𝒛𝒊 = (𝒗∗𝑇 𝒙𝒊 )𝒗∗

38
Derivation of PCA - Optional

• Mathematical Problem Formulation :

• The optimal 𝒗* is the first unit weight vector for PCA:


𝑁
1 1 𝑇 𝑇
max ෍(𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1

• You can calculate the other 𝒗 and 𝒛𝒊 in an incremental fashion:

• Substitute 𝒙𝒊 with the residual (𝒙𝒊 - 𝒛𝒊 ).

• Calculate the next unit weight vector and principle component.

• How to obtain the optimal 𝒗 here?

39
Derivation of PCA - Optional

• Algorithm:

• We need to use Singular Value Decomposition (SVD):

40
Derivation of PCA - Optional

• Algorithm:

𝚺 𝑇
1. Apply SVD to the data matrix 𝑿, as sv𝑑 𝑿 = 𝑼 𝑽 .
𝟎

2. 𝑽 = 𝒗𝟏 , 𝒗𝟐 , … , 𝒗𝒌 , … 𝒗𝒅 , where 𝒗𝒌 is the k-th unit weight vector for PCA.

3. The k-th Principal Component of 𝒙𝒊 : (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘

4. Project to the k-dimensional subspace 𝑽𝒌 = 𝒗𝟏 , 𝒗𝟐 , … , 𝒗𝒌 :

𝒗1 𝑇 𝒙𝒊
𝑽 𝒌 𝑻 𝒙𝒊 = ⋮
𝒗𝑘 𝑇 𝒙𝒊

41
Derivation of PCA - Example

• Given data points, 𝒙1 = 4, 6, 10 , 𝒙2 = 3, 10, 13 , 𝒙3 = −2, −6, −8 .

• Calculate the first unit weight vector 𝒗, and the first Principal Component.

4 6 10
1. Data Matrix: 𝐗 = 3 10 13
−2 −6 −8

2. SVD: 𝐗 = 𝐔𝚺𝑽𝑻

−0.22 0.78 −0.58


3. 𝑽 = −0.57 −0.59 −0.58 −0.22
−0.79 0.20 0.58 𝒗𝟏 = −0.57 𝒛𝑖,𝑘 = (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘
−0.79
42
PCA Example 1

• Representation of handwriting digits in MNIST dataset:

Sample Variance
with rotations,
scales, etc.

43
PCA Example 1

• Representation of handwriting digits in MNIST dataset:

• PCA provides more robust and invariant features.

Reconstructions:

x k=2 k = 10 k = 50 k = 200

44
PCA Example 2
• Image Compression

Original Image

• Divide the original 372x492 image into patches:


• Each patch is an instance that contains 12x12 pixels on a grid
• View each as a 144-D vector
45
PCA Example 2
• Image Compression
PCA compression: 144D → 60D

46
PCA Example 2
• Image Compression
PCA compression: 144D → 16D

47
PCA Example 2
• Image Compression

16 most important eigenvectors


2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

48
PCA Example 2
• Image Compression
PCA compression: 144D → 6D

49
PCA Example 2
• Image Compression
PCA compression: 144D → 1D

50
Carry-on Questions

• Why do we need dimensionality reduction?

• To remove redundancy, prevent overfitting, remove noise, etc.

• What are two objectives achieved by PCA?

• Minimizing MSE and Maximizing Projected Variance

• How to calculate PCA?

• Apply SVD to the data matrix. Select the unit weight vector 𝒗𝒌
from the V matrix, and calculate (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘 .
51
Bayesian Inference

WEN Bihan (Asst Prof)


Homepage: https://personal.ntu.edu.sg/bihan.wen/

52
Outline

• Probability and Conditional Probability

• Bayes’ Theorem

• Naïve Bayes

• Examples

53
Carry-on Questions

• What is Bayes’ Theorem?

• What is Naïve Bayes assumption?

54
From deterministic to probabilistic learning

• Training a deep neural network for classification:

• Deterministic model: always gives you the same output for the same input.

55
From deterministic to probabilistic learning

• Training a deep neural network for classification:

• Deterministic inference: gives you the same output for the same input.

• Bayesian Inference

• Tossing a coin twice, what outcomes will we get?


• There are 4 possible outcomes of this experiment

𝑆 = {𝐻𝐻, 𝐻𝑇, 𝑇𝐻, 𝑇𝑇}

• Probabilistic model: 𝑃𝑟𝑜𝑏 𝐻𝐻 = 0.25

56
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event. Tossing a coin

57
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

We tossed the coin 8


times, and got 4 tails
and 4 heads
sequentially.
{H,H,H,H,T,T,T,T}
58
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

• P(h) = Probability that Hypothesis h holds.

• P(D) = Probability of observing the training data D.

P(h) = probability of getting heads

59
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

• P(h) = Probability that Hypothesis h holds.

• P(D) = Probability of observing the training data D.

• P(D|h) = Probability of observing D when h holds. (read


“probability of D given h”, “probability of D conditioned on h” ).

60
Basic Concepts in Probability Theory

• We are interested of P(h|D), because

• We always learn from history / knowledge.

• We normally have the access to the training data D.

• We need to know P(h|D), aims to find the most probable


hypothesis given that data.

61
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

• P(h) = Probability that Hypothesis h holds.

• P(D) = Probability of observing the training data D.

• P(D|h) = Probability of observing D when h holds. (read


“probability of D given h).

• Can we calculate P(h|D)? 62


Joint and Conditional Probability

• Conditional Probability 𝑃 𝐴 | 𝐵 :

The probability that A happens given that B happened.

• Joint probability 𝑃 𝐴, 𝐵 :

The probability that A and B both happen.

• 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 ∗ 𝑃(𝐵)

63
Bayes’ Theorem

• The simple form of Bayes’ Theorem:

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

• How to derive the theorem?

• Use joint probability: 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵) 𝑃 𝐵 = 𝑃 𝐵 𝐴) 𝑃(𝐴).

• Therefore:
𝑃(𝐴, 𝐵) 𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 = =
𝑃(𝐵) 𝑃(𝐵)

64
Bayes’ Theorem

• Given P(B) which is a constant, the proportional form:

• Sometimes P(B) can be also calculated as

• This is based on Law of total probability:

65
Bayes’ Theorem

• Now, we can predict 𝑃(ℎ|𝐷):

𝑃 𝐷 ℎ) 𝑃(ℎ)
𝑃 ℎ𝐷 =
𝑃(𝐷)
• Some terminologies we are going to use:

• Prior probability 𝑃(ℎ): prior knowledge of ℎ before observing 𝐷.

• Posterior 𝑃 ℎ 𝐷 : the probability of ℎ after we have observed 𝐷.

• Likelihood 𝑃 𝐷 ℎ : Likelihood of observing 𝐷 given ℎ.

66
Bayes’ Theorem

• Maximum A Posteriori (MAP) hypothesis:

𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏(𝐡|𝐃)


𝑃 𝐷 ℎ) 𝑃(ℎ)
• We just learned that 𝑃 ℎ 𝐷 = . Given D, we have
𝑃(𝐷)

𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏 𝐃 𝐡 𝐏(𝐡)

• If P(h) is constant, MAP is equivalent to Maximum Likelihood (ML):

𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏 𝐃 𝐡

67
Example Questions

Quiz 1: Flipping a coin

• Suppose we are flipping a fair coin twice.


What is the probability that both flips are heads?

68
Example Questions

• Quiz 2: Flipping a coin

• Suppose we are flipping a fair coin twice.


Given that the outcome of first flip is heads, what is the
probability that both flips are heads?

69
Example Questions

• Quiz 3: Our IE4483 is attended by students from both EEE and IEM.
Only 50% of the IEM students and 30% of the EEE students pass the
exam. Given that 60% of the entire class are EEE students, what is the
percentage of IEM students amongst those who pass the exam?

70
Example Questions

• Quiz 4: Monty Hall Problem

• You’re a contestant on a game show. You see three closed doors, and
behind one of them is a prize. You choose one door, and the host opens
one of the other doors and reveals that there is no prize behind it. Then
he offers you a chance to switch to the remaining door. Should you take
it?

http://en.wikipedia.org/wiki/Monty_Hall_problem 71
Naïve Bayes

• The basic form of Bayes’ Theorem:

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

• It is straightforward if A and B are both single attributes.

• But what if we have multiple conditions or query events?

• How to represent their conditional probabilities?

72
Naïve Bayes

• Extend from simple example to large training datasets.

• Single attribute Multiple attributes

• How does 𝑃 𝑎1 , 𝑎2 𝑣𝑗 ) relate to 𝑃 𝑎1 𝑣𝑗 ) and 𝑃 𝑎2 𝑣𝑗 ) ?

• Naïve Bayes Assumption:

𝑃 𝑎1 , 𝑎2 𝑣𝑗 ) = 𝑃 𝑎1 𝑣𝑗 )𝑃 𝑎2 𝑣𝑗 )

73
Naïve Bayes

• Naïve Bayes assumption:

• The conditional independence assumption

• The values of some features are conditionally independent on the others.

• Mathematical Form:

𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = 𝑃 𝑎1 𝑣𝑗 ) … 𝑃 𝑎𝑑 𝑣𝑗 )

𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = ෑ 𝑃 𝑎𝑖 𝑣𝑗 )
𝑖=1

74
Naïve Bayes - Example

• Play Tennis:

• New Observation: <Sunny,Cool,High,Strong>, do you play tennis?


75
Naïve Bayes - Example

• Play Tennis:

• Compare 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 >) and


𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 >)
S = Sunny (outlook)
• According to Bayes’ Theorem, it is to compare
𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) and 𝑃(𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >)
C = Cool (Temp.)
• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝑃 𝑦𝑒𝑠 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠) H = High (Humidity)

S = Strong (Wind)

76
Bayes’ Theorem

• Conditional Prob to Joint Prob

𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 ∗ 𝑃(𝐵)

𝑃 𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 > ∗ 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >)

𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 > ∗ 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >)

• 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >) is a constant, i.e., same for both choices.

• Let’s call as the 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >) as the 𝑃(𝑂𝑏𝑠𝑒𝑟. )

77
Bayes’ Theorem

• Bayes’ Theorem:

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑦𝑒𝑠 𝑃(𝑦𝑒𝑠)


𝑃 𝑦𝑒𝑠 | 𝑂𝑏𝑠𝑒𝑟. =
𝑃(𝑂𝑏𝑠𝑒𝑟. )

• Again, 𝑃(𝑂𝑏𝑠𝑒𝑟. ) is a constant, i.e., same for both choices.

• Just compare 𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑦𝑒𝑠 𝑃(𝑦𝑒𝑠) and 𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑛𝑜 𝑃(𝑛𝑜)

78
Naïve Bayes - Example

• Play Tennis:

• Compare 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 >) and


𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 >)

• According to Bayes’ Theorem, it is to compare S = Sunny (outlook)


𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) and 𝑃(𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >)
C = Cool (Temp.)
• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝑃 𝑦𝑒𝑠 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠)
H = High (Humidity)
• Based on Naïve Bayes assumption, we have
𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠) = S = Strong (Wind)
𝑃 𝑆 𝑦𝑒𝑠 𝑃 𝐶 𝑦𝑒𝑠 𝑃 𝐻 𝑦𝑒𝑠 𝑃 𝑆 𝑦𝑒𝑠
2 3 3 3
= 9 × 9 × 9 × 9 = 0.00823
• P(yes) = #yes / 15 = 9/14 By simply counting

9
• Thus, 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 14 × 0.00823 = 𝟎. 𝟎𝟎𝟓𝟏
79
Naïve Bayes - Example

• Play Tennis:

• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝟎. 𝟎𝟎𝟓𝟏

• Similarly, 𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑛𝑜 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑛𝑜)


= 𝑃 𝑛𝑜 𝑃 𝑆 𝑛𝑜 𝑃 𝐶 𝑛𝑜 𝑃 𝐻 𝑛𝑜 𝑃 𝑆 𝑛𝑜
= 0.36 × 0.6 × 0.2 × 0.8 × 0.6 = 𝟎. 𝟎𝟐𝟎𝟕

• Do you play tennis?

• 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 > < 𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >

• Given the observation <Sunny,Cool,High,Strong>, more likely you


will NOT play tennis.

80
Carry-on Questions

• What is Bayes’ Theorem?

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

• What is Naïve Bayes assumption?

𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = ෑ 𝑃 𝑎𝑖 𝑣𝑗 )
𝑖=1

81
What we have learned

• Probability and Conditional Probability

• Hypothesis, joint / conditional probability, etc.

• Bayes’ Theorem

• Various forms of Bayes’ Theorem, prior, likelihood, Posterior, etc.

• Naïve Bayes

• Multi-attributes, Naïve Bayes assumption, examples.

82

You might also like