0% found this document useful (0 votes)
12 views

Graph Lecture19

Graphical models provide a compact representation for joint probability distributions through directed acyclic graphs (DAGs) and conditional probability tables (CPTs). For Bayesian networks, the DAG encodes local independence assumptions, and the distribution factorizes according to the graph. For Markov random fields, the graph represents conditional independence properties, and the distribution factorizes over cliques in the graph. Variable elimination is a method for efficient probabilistic inference in graphical models that exploits the conditional independence structure to decompose computations. The order of variable elimination can significantly impact computational complexity.

Uploaded by

nirmala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Graph Lecture19

Graphical models provide a compact representation for joint probability distributions through directed acyclic graphs (DAGs) and conditional probability tables (CPTs). For Bayesian networks, the DAG encodes local independence assumptions, and the distribution factorizes according to the graph. For Markov random fields, the graph represents conditional independence properties, and the distribution factorizes over cliques in the graph. Variable elimination is a method for efficient probabilistic inference in graphical models that exploits the conditional independence structure to decompose computations. The order of variable elimination can significantly impact computational complexity.

Uploaded by

nirmala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Graphical Models

Aarti Singh
Slides Courtesy: Carlos Guestrin

Machine Learning 10-701/15-781


Nov 15, 2010
Directed – Bayesian Networks
• Compact representation for a joint probability distribution

• Bayes Net = Directed Acyclic Graph (DAG) + Conditional


Probability Tables (CPTs)

• distribution factorizes according to graph ≡ distribution


satisfies local Markov independence assumptions

≡ xk is independent of its non-descendants


given its parents pak
Directed – Bayesian Networks
• Graph encodes local independence assumptions (local Markov
Assumptions)
• Other independence assumptions can be read off the graph
using d-separation
• distribution factorizes according to graph ≡ distribution
satisfies all independence assumptions found by d-separation

F I

• Does the graph capture all independencies? Yes, for almost all
distributions that factorize according to graph. More in 10-708
D-separation
• a is D-separated from b by c ≡ a  b|c
• Three important configurations
c

a … c … b

Causal direction a  b|c a b


Common cause a  b|c

a b a b
V-structure
(Explaining away)


a  b|c
c c
Undirected – Markov Random Fields
• Popular in statistical physics, computer vision, sensor
networks, social networks, protein-protein interaction network
• Example – Image Denoising xi – value at pixel i
yi – observed noisy value
Conditional Independence properties
• No directed edges
• Conditional independence ≡ graph separation
• A, B, C – non-intersecting set of nodes
• A  B|C if all paths between nodes in A & B are “blocked”
i.e. path contains a node z in C.
Factorization
• Joint distribution factorizes according to the graph

Arbitrary positive function Clique, xC = {x1,x2}

Maximal clique
xC = {x2,x3,x4}

typically NP-hard to compute


MRF Example

Often
Energy of the clique (e.g. lower if
variables in clique take similar values)
MRF Example
Ising model:

cliques are edges xC = {xi,xj}


binary variables xi ϵ {-1,1}

1 if xi = xj
-1 if xi ≠ xj

Probability of assignment is higher if neighbors xi


and xj are same
Hammersley-Clifford Theorem
• Set of distributions that factorize according to the graph - F

• Set of distributions that respect conditional independencies


implied by graph-separation – I

I F

Important because: Given independencies of P can get MRF structure G

I F

Important because: Read independencies of P from MRF structure G


What you should know…
• Graphical Models: Directed Bayesian networks, Undirected
Markov Random Fields
– A compact representation for large probability
distributions
– Not an algorithm
• Representation of a BN, MRF
– Variables
– Graph
– CPTs
• Why BNs and MRFs are useful
• D-separation (conditional independence) & factorization
Topics in Graphical Models
• Representation
– Which joint probability distributions does a graphical
model represent?

• Inference
– How to answer questions about the joint probability
distribution?
• Marginal distribution of a node variable
• Most likely assignment of node variables

• Learning
– How to learn the parameters and structure of a graphical
model?
Inference
• Possible queries:
1) Marginal distribution e.g. P(S) Flu Allergy
Posterior distribution e.g. P(F|H=1)

Sinus

2) Most likely assignment of nodes Headache Nose


arg max P(F=f,A=a,S=s,N=n|H=1)
f,a,s,n
Inference
• Possible queries:
1) Marginal distribution e.g. P(S) Flu Allergy
Posterior distribution e.g. P(F|H=1)

Sinus
P(F|H=1) ?

P(F, H=1)
P(F|H=1) = Headache Nose
P(H=1)
P(F, H=1)
=
∑ P(F=f,H=1)
f
 P(F, H=1) will focus on computing this, posterior will
follow with only constant times more effort
Marginalization
Need to marginalize over other vars
Flu Allergy
P(S) = ∑ P(f,a,S,n,h)
f,a,n,h
Sinus
P(F,H=1) = ∑ P(F,a,s,n,H=1)
a,s,n

23 terms
Headache Nose
To marginalize out n binary variables,
need to sum over 2n terms

Inference seems exponential in number of variables!


Actually, inference in graphical models is NP-hard 
Bayesian Networks Example
• 18 binary attributes

• Inference
– P(BatteryAge|Starts=f)

• need to sum over 216 terms!


• Not impressed?
– HailFinder BN – more
than 354 =
58149737003040059690
390169 terms
Fast Probabilistic Inference
P(F,H=1) = ∑ P(F,a,s,n,H=1)
a,s,n Flu Allergy
= ∑ P(F)P(a)P(s|F,a)P(n|s)P(H=1|s)
a,s,n

= P(F) ∑ P(a) ∑ P(s|F,a)P(H=1|s) ∑ P(n|s) Sinus


a s n

Headache Nose
Push sums in as far as possible

Distributive property: x1z + x2z = z(x1+x2)


2 multiply 1 mulitply
Fast Probabilistic Inference
P(F,H=1) = ∑ P(F,a,s,n,H=1)
a,s,n 8 values x 4 multiplies
Flu Allergy
= ∑ P(F)P(a)P(s|F,a)P(n|s)P(H=1|s)
a,s,n
1
= P(F) ∑ P(a) ∑ P(s|F,a)P(H=1|s) ∑ P(n|s) Sinus
a s n

= P(F) ∑ P(a) ∑ P(s|F,a)P(H=1|s)


a s 4 values x 1 multiply Nose
Headache
= P(F) ∑ P(a) g1(F,a)
a
2 values x 1 multiply 32 multiplies vs. 7 multiplies
= P(F) g2(F) 2n vs. n 2k
1 multiply k – scope of largest factor

(Potential for) exponential reduction in computation!


Fast Probabilistic Inference –
Variable Elimination
P(F,H=1) = ∑ P(F)P(a)P(s|F,a)P(n|s)P(H=1|s)
a,s,n Flu Allergy
1
= P(F) ∑ P(a) ∑ P(s|F,a)P(H=1|s) ∑ P(n|s)
a s n
P(H=1|F,a) Sinus

P(H=1|F)
Headache Nose

(Potential for) exponential reduction in computation!


Variable Elimination – Order can
make a HUGE difference
P(F,H=1) = ∑ P(F)P(a)P(s|F,a)P(n|s)P(H=1|s)
a,s,n Flu Allergy
1
= P(F) ∑ P(a) ∑ P(s|F,a)P(H=1|s) ∑ P(n|s)
a s n
P(H=1|F,a) Sinus

P(H=1|F)
Headache Nose
P(F,H=1) = P(F) ∑ P(a) ∑ ∑ P(s|F,a)P(n|s)P(H=1|s)
a ns

g(F,a,n) 3 – scope of largest factor

(Potential for) exponential reduction in computation!


Variable Elimination – Order can
make a HUGE difference
Y

X1 X2 X3 X4

1 – scope of
g(Y) largest factor

n – scope of
g(X1,X2,..,Xn) largest factor
Variable Elimination Algorithm
• Given BN – DAG and CPTs (initial factors – p(xi|pai) for i=1,..,n)
• Given Query P(X|e) ≡ P(X,e) X – set of variables
• Instantiate evidence e e.g. set H=1 IMPORTANT!!!
• Choose an ordering on the variables e.g., X1, …, Xn
• For i = 1 to n, If Xi {X,e}
– Collect factors g1,…,gk that include Xi
– Generate a new factor by eliminating Xi from these factors

– Variable Xi has been eliminated!


– Remove g1,…,gk from set of factors but add g
• Normalize P(X,e) to obtain P(X|e)
Complexity for (Poly)tree graphs
Variable elimination order:
• Consider undirected version
• Start from “leaves” up
• find topological order
• eliminate variables in reverse
order

Does not create any factors


bigger than original CPTs

For polytrees, inference is


linear in # variables (vs.
exponential in general)!
Complexity for graphs with loops
• Loop – undirected cycle
Linear in # variables but exponential in size of largest factor
generated!

Moralize
graph

(connect parents
into a clique
& drop direction
of all edges)

When you eliminate a variable, add edges between its neighbors


Complexity for graphs with loops
• Loop – undirected cycle
Var eliminated Factor generated
S g1(C,B)
B g2(C,O,D)
D g3(C,O)
C g4(T,O)
T g5(O,X)
O g6(X)

Linear in # variables but exponential in size of largest factor


generated ~ tree-width (max clique size-1) in resulting graph!
Example: Large tree-width with small
number of parents

At most 2 parents per node, but tree width is O(√n)

Compact representation  Easy inference 


Choosing an elimination order
• Choosing best order is NP-complete
– Reduction from MAX-Clique
• Many good heuristics (some with guarantees)
• Ultimately, can’t beat NP-hardness of inference
– Even optimal order can lead to exponential variable
elimination computation
• In practice
– Variable elimination often very effective
– Many (many many) approximate inference approaches
available when variable elimination too expensive
Inference
• Possible queries:
2) Most likely assignment of nodes Flu Allergy
arg max P(F=f,A=a,S=s,N=n|H=1)
f,a,s,n
Sinus

Use Distributive property: Headache Nose


max(x1z, x2z) = z max(x1,x2)
2 multiply 1 mulitply
Topics in Graphical Models
• Representation
– Which joint probability distributions does a graphical
model represent?

• Inference
– How to answer questions about the joint probability
distribution?
• Marginal distribution of a node variable
• Most likely assignment of node variables

• Learning
– How to learn the parameters and structure of a graphical
model?
Learning

Data
x(1) CPTs –
… P(Xi| PaXi)
x(m)
structure parameters

Given set of m independent samples (assignments of random


variables),
find the best (most likely?) Bayes Net (graph Structure + CPTs)
Learning the CPTs (given structure)
For each discrete variable Xk
Data
x(1) Compute MLE or MAP estimates for

x(m)
MLEs decouple for each CPT in Bayes
Nets
• Given structure, log likelihood of data F A
S
(j) (j) (j) (j) (j) (j) (j) (j) (j) H N

(j) (j) (j) (j) (j) (j) (j) (j) (j)

(j) (j) (j) (j) (j)

Depends qF qA qF,A (j) (j) (j) (j)


only on
qH|S qN|S
Can computer MLEs of each parameter independently!
Information theoretic interpretation
of MLE

Plugging in MLE estimates: ML score

Reminds of entropy
Information theoretic interpretation
of MLE

Doesn’t depend on graph structure

ML score for graph structure


ML – Decomposable Score
• Log data likelihood

• Decomposable score:
– Decomposes over families in BN (node and its parents)
– Will lead to significant computational efficiency!!!
– Score(G : D) = i FamScore(Xi|PaXi : D)
How many trees are there?
• Trees – every node has at most one parent
• nn-2 possible trees (Cayley’s Theorem)

Nonetheless – Efficient optimal algorithm finds best tree!


Scoring a tree

Equivalent Trees (same score): I(A,B) + I(B,C)

A B C A B C A B C

Score provides indication of structure:


A

A B C B C

I(A,B) + I(B,C) I(A,B) + I(A,C)


Chow-Liu algorithm
• For each pair of variables Xi,Xj
– Compute empirical distribution:
– Compute mutual information:

• Define a graph
– Nodes X1,…,Xn
– Edge (i,j) gets weight

• Optimal tree BN
– Compute maximum weight spanning tree (e.g. Prim’s, Kruskal’s
algorithm O(nlog n))
– Directions in BN: pick any node as root, breadth-first-search defines
directions
Chow-Liu algorithm example

1/
1/

1/ 1/ 1/
1/
1/
1/ 1/
1/
1/
Scoring general graphical models
• Graph that maximizes ML score -> complete graph!
• Information never hurts
H(A|B) ≥ H(A|B,C)

• Adding a parent always increases ML score


I(A,B,C) ≥ I(A,B)

• The more edges, the fewer independence assumptions, the


higher the likelihood of the data, but will overfit…

• Why does ML for trees work?


Restricted model space – tree graph
Regularizing
• Model selection
– Use MDL (Minimum description length) score
– BIC score (Bayesian Information criterion)
• Still NP –hard
• Mostly heuristic (exploit score decomposition)
• Chow-Liu: provides best tree approximation to any
distribution.
• Start with Chow-Liu tree. Add, delete, invert edges. Evaluate
BIC score
What you should know
• Learning BNs
– Maximum likelihood or MAP learns parameters
– ML score
• Decomposable score
• Information theoretic interpretation (Mutual
information)
– Best tree (Chow-Liu)
– Other BNs, usually local search with BIC score

You might also like