Lecture 13: Bayesian Networks I: CS221 / Spring 2019 / Charikar & Sadigh
Lecture 13: Bayesian Networks I: CS221 / Spring 2019 / Charikar & Sadigh
X1 X2 X3
f1 f2 f3 f4
Variables:
X = (X1 , . . . , Xn ), where Xi ∈ Domaini
Factors:
f1 , . . . , fm , with each fj (X) ≥ 0
m
Y
Weight(x) = fj (x)
j=1
Search problems
Constraint satisfaction problems
Markov decision processes
Adversarial games Bayesian networks
Machine learning
Basics
Probabilistic programs
Inference
Joint distribution:
s r P(S = s, R = r)
0 0 0.20
P(S, R) = 0 1 0.08
1 0 0.70
1 1 0.02
R | T = 1, A = 1)
P(|{z}
| {z }
query condition
(S is marginalized out)
p(b) = · [b = 1] + (1 − ) · [b = 0]
p(e) = · [e = 1] + (1 − ) · [e = 0]
p(a | b, e) = [a = (b ∨ e)]
CS221 / Spring 2019 / Charikar & Sadigh 20
• Let us try to model the situation. First, we establish that there are three variables, B (burglary), E
(earthquake), and A (alarm). Next, we connect up the variables to model the dependencies.
• Unlike in factor graphs, these dependencies are represented as directed edges. You can intuitively think
about the directionality as suggesting causality, though what this actually means is a deeper question and
beyond the scope of this class.
• For each variable, we specify a local conditional distribution (a factor) of that variable given its parent
variables. In this example, B and E have no parents while A has two parents, B and E. This local
conditional distribution is what governs how a variable is generated.
• We are writing the local conditional distributions using p, while P is reserved for the joint distribution over
all random variables, which is defined as the product.
Bayesian network (alarm)
p(b) p(e)
B E B p(a | b, e) E
A A
P(B = b, E = e, A = a) = p(b)p(e)p(a | b, e)
[demo: = 0.05]
B E
All X
factors (local conditional distributions) satisfy:
p(xi | xParents(i) ) = 1 for each xParents(i)
xi
Implications:
• Consistency of sub-Bayesian networks
• Consistency of conditional distributions
B E
A short calculation:
P
P(B = b, E = e) = a P(B = b, E = e, A = a)
P
= ap(b)p(e)p(a | b, e)
P
= p(b)p(e) a p(a | b, e)
= p(b)p(e)
B E
B E
A
p(b) p(e)
p(b) p(e)
B p(a | b, e) E
B E
A
CS221 / Spring 2019 / Charikar & Sadigh 34
• This property is very attractive, because it means that whenever we have a large Bayesian network, where
we don’t care about some of the variables, we can just remove them (graph operations), and this encodes
the same distribution as we would have gotten from marginalizing out variables (algebraic operations).
The former, being visual, can be more intuitive.
Consistency of local conditionals
A B C
D E
F G H
P(D = d | A = a, B = b) = p(d | a, b)
| {z } | {z }
from probabilistic inference by definition
You are coughing and have itchy eyes. Do you have a cold or
allergies?
[demo]
Variables: Cold, Allergies, Cough, Itchy eyes
Bayesian network: Factor graph:
C A C A
H I H I
B E
Basics
Probabilistic programs
Inference
B ∼ Bernoulli()
E ∼ Bernoulli()
A=B∨E
def Bernoulli(epsilon):
return random.random() < epsilon
X0 = (0, 0)
For each time step i = 1, . . . , n:
With probability α:
Xi = Xi−1 + (1, 0) [go right]
With probability 1 − α:
Xi = Xi−1 + (0, 1) [go down]
X1 X2 X3 X4 X5
Query: what are possible trajectories given evidence X10 = (8, 2)?
X1 X2 X3 X4
H1 H2 H3 H4 H5
E1 E2 E3 E4 E5
4 5
E1 E2 E3 E4
travel
W1 W2 ... WL
beach Paris
CS221 / Spring 2019 / Charikar & Sadigh 57
• Naive Bayes is a very simple model which can be used for classification. For document classification, we
generate a label and all the words in the document given that label.
• Note that the words are all generated independently, which is not a very realistic model of language, but
naive Bayes models are surprisingly effective for tasks such as document classification.
Application: topic modeling
Question: given a text document, what topics is it about?
α {travel:0.8,Europe:0.2}
0 E12 E13 0
Basics
Probabilistic programs
Inference
Output
P(Q = q | E = e) for all values q
!
X
∝ p(x1 )p(x2 = 5 | x1 ) p(x3 | x2 = 5)
x1
∝ p(x3 | x2 = 5)
Fast way:
[whiteboard]
CS221 / Spring 2019 / Charikar & Sadigh 67
• Let’s first compute the query the old-fashioned way by grinding through the algebra. Then we’ll see a
faster, more graphical way, of doing this.
• We start by transforming the query into an expression that references the joint distribution, which allows
us to rewrite as the product of the local conditional probabilities. To do this, we invoke the definition of
marginal and conditional probability.
• One convenient shortcut we will take is make use of the proportional-to (∝) relation. Note that in the end,
we need to construct a distribution over X3 . This means that any quantity (such as P(X2 = 5)) which
doesn’t depend on X3 can be folded into the proportionality constant. If you don’t believe this, keep it
around to convince yourself that it doesn’t matter. Using ∝ can save you a lot of work.
P
• Next, we do some algebra to push the summations P inside. We notice that x4 p(x4 | x3 ) = 1 because
it’s a local conditional distribution. The factor x1 p(x1 )p(x2 = 5 | x1 ) can also be folded into the
proportionality constant.
• The final result is p(x3 | x2 = 5), which matches the query as we expected by the consistency of local
conditional distributions.
General strategy
Query:
P(Q | E = e)
[whiteboard]
Query: P(B)
• Marginalize out A, E
Query: P(B | A = 1)
• Condition on A = 1
D E
F G H
[whiteboard]
Query: P(C | B = b)
• Marginalize out everything else, note C ⊥
⊥B
Query: P(C, H | E = e)
• Marginalize out A, D, F, G, note C ⊥
⊥H|E