Homework1 Solution
Homework1 Solution
where X1 , . . . , Xn are the random variables corresponding the nodes of graph G and Pa(Xi ) denotes the
parents of node Xi in graph G. Furthermore, let ND(Xi ) denote the non-descendant nodes of node Xi in
graph G and D(Xi ) denote its descendant nodes. Using the above factorization, we can see that:
P (Xi , ND(Xi ))
P (Xi | ND(Xi )) = ,
P (ND(Xi ))
P
D(Xi ) P (X1 , . . . , Xn )
= P ,
Xi ,D(Xi ) P (X1 , . . . , Xn )
P Qn
D(Xi ) j=1 P (Xj | Pa(Xj ))
= P Qn ,
Xi ,D(Xi ) j=1 P (Xj | Pa(Xj )) (2)
Q P Q
Xj ∈(ND(Xi )∪Xi ) P (Xj | Pa(Xj )) D(Xi ) Xj ∈D(Xi ) P (Xj | Pa(Xj ))
= Q P Q ,
Xj ∈ND(Xi ) P (Xj | Pa(Xj )) Xi ,D(Xi ) Xj ∈(D(Xi )∪Xi ) P (Xj | Pa(Xj ))
Q
Xj ∈(ND(Xi )∪Xi ) P (Xj | Pa(Xj )) · 1
= Q ,
Xj ∈ND(Xi ) P (Xj | Pa(Xj )) · 1
1.2 D-separation
• The joint distribution can be written as:
P (X1 , . . . , X7 ) = P (X1 )P (X3 | X1 )P (X2 | X3 )P (X5 | X2 )P (X6 | X2 )P (X7 )P (X4 | X3 , X7 ). (4)
• Yes. There is only one path going from X1 to X5 , which is a causal trail. One nodes along the path
is X2 . Therefore, when X2 is observed, the path becomes blocked and X1 is independent of X5 .
• No. X7 and X3 are both parents of X4 . Therefore, when X4 is observed, these two nodes become
dependent. Also, X3 is clearly dependent with X2 and so, when X4 is observed, X2 and X7 become
dependent.
• Yes. When X3 is observed, the path between X4 and X2 is blocked. That is part of the path between
X4 and X5 and so, when X3 is observed, X4 and X5 become independent.
• The variables that are in the Markov blanket of X3 are: X1 , X2 , X4 , and X7 .
1
1.3 Hidden Markov Model
From the definition of the hidden Markov model (HMM) we have that:
We can now marginalize out the variables z1 , . . . , zi−1 to obtain the following:
X
P (x1 , . . . , xi , zi ) = P (x1 , . . . , xi , z1 , . . . , zi ),
z1 ,...,zi−1
X
= P (xi | zi )P (zi | zi−1 )P (x1 , . . . , xi−1 , z1 , . . . , zi−1 ),
z1 ,...,zi−1
X
= P (xi | zi ) P (zi | zi−1 )P (x1 , . . . , xi−1 , z1 , . . . , zi−1 ), (6)
z1 ,...,zi−1
X X
= P (xi | zi ) P (zi | zi−1 ) P (x1 , . . . , xi−1 , z1 , . . . , zi−1 ),
zi−1 z1 ,...,zi−2
X
= P (xi | zi ) P (zi | zi−1 )P (x1 , . . . , xi−1 , zi−1 ),
zi−1
2
This model is quite simple to understand. The reasoning behind creating it is that we use as parent
variables for each variable, variables that describe events that might cause the event described by the
child variable. Therefore, in our model for example, symptoms are caused by illnesses and therefore the
variables corresponding to symptoms are children of variables corresponding to the illnesses that can
cause those symptoms. The implementation can return the probability for any possible assignment.
More details on how to use the code are provided in the “readme.txt” file submitted with the code.
3. Compactness: The total number of parameters of this model is 1 + 2 + 1 + 2 + 2 + 23 + 2 + 22 +
23 + 23 + 23 + 22 = 50 (i.e., counting 1 parameter for variables with no parents – the probability of
the variable being equal to true – and 2n parameters for each variable with n parents – the conditional
probabilities table).
4. Accuracy: The `1 -distance of my model and the true joint probability distribution is equal to 0.3320
in this case, which is significantly smaller than that of the baseline model, as expected.
5. Data Likelihood: The log-likelihood of the provided data set after having fit our model to the data
is equal to -15,163,339.2713.
6. Querying: The code for querying has been submitted along with the rest of the code. The main
idea behind my code is that I consider the joint probability distribution table (effectively a tensor)
over all possible assignments, “filtered” by the observed variables and re-normalized, and I sum over
all dimensions of the table corresponding to variables other than the query variables. The remaining
table (i.e., tensor) is the resulting distribution for the query variables. The outputs for the example
queries provided in the problems handout and for many different models that I tried using can be
seen by running the submitted code by following the instructions provided in the “readme.txt” file.
The outputs also include the true probability distributions corresponding to these queries, computed
in the same way from the true probability distribution provided (as opposed to the joint probability
distribution table under our model). Please note that the querying code, as well as the rest of the code,
is highly inefficient – it is not optimized, as optimizing the code is outside the scope of this assignment.
7. Improved Graphical Model: I tried several improvements over the graphical model shown above.
All models I went through are included in the submitted code in files named as model#.py, where #
corresponds to the attempt number. On my sixth attempt I had the following model, which includes
some refinements over the previous model:
• IsSummer
• HasFlu ← IsSummer
• HasFoodPoisoning
• HasHayFever
• HasPneumonia ← IsSummer
• HasRespiratoryProblems ← HasFlu | HasHayFever | HasPneumonia | HasFoodPoisoning
• HasGastricProblems ← HasFlu | HasFoodPoisoning
• HasRash ← HasFoodPoisoning | HasHayFever
• Coughs ← HasFlu | HasPneumonia | HasRespiratoryProblems
• IsFatigued ← HasFlu | HasHayFever | HasPneumonia
• Vomits ← HasFoodPoisoning | HasGastricProblems
• HasFever ← HasFlu | HasPneumonia
This model uses 55 parameters and its accuracy, as measured by the `1 -distance of the model and the
true joint probability distribution, is equal to 0.2667, which is significantly better than the previous
accuracy we obtained. Furthermore, the log-likelihood value for the data is equal to −14, 899, 891.2238.
3
3 Undirected Graphical Models (Mrinmaya)
Solution courtesy Emmanouil Antonios Platanios (Anthony)
It is also easy to see that if Xi and Xj are not neighbors (i.e., {i, j} ∈
/ E), then V \ {i, j} contains the indexes of
all the neighbors of Xi (among other nodes) and j ∈ V \(i∪NG (Xi )). Therefore, from the local Markov prop-
erty we get that {Xi ⊥ Xj | Xn : {Xi , Xj } ∈ / E, n ∈ NG (Xi )} ⇒ {Xi ⊥ Xj | Xv : {i, j} ∈ / E, v ∈ V \ {i, j}}.
This is in fact the pairwise Markov property and so we have shown that “Local Markov Property” ⇒
“Pairwise Markov Property”.
where we used the fact that this is a complete graph (i.e., there exists an edge between every possible
pair of nodes), as mentioned in the question. By matching the form that we obtained with the provided
form, we see that:
1 X
ψi,j (xi , xj ) = exp − xi Ωij xj and ψi (xi ) = exp xi Ωij µj . (8)
2
j∈V
2. Since we take the product over all edges and the product over all nodes, we can incorporate the product
over all nodes in the product over all edges, while being careful for “double-counting”. Note that if a
node has n neighbors (i.e., n edges involving that node), that means that if we incorporate the term
ψi (xi ) corresponding to that node in the edges product, we would need to take it’s nth root, because it
4
would appear in n terms in that product (i.e., one term for each edge in which it appears). Therefore,
we can see that: Y 1 1
P (X | µ, Σ) ∝ ψi,j (xi , xj )ψi (xi ) n(i) ψj (xj ) n(j) , (9)
(i,j)∈E
which is the form that is provided to us in the problem sheet. Now, we see that if (i, j) ∈ / E, then in
order for the value of this density to remain unchanged we need to have that Ωij = 0 (i.e., so that all
the terms of the product term corresponding to that edge are equal to 1). It is not difficult to see then
that given XV \{i,j} and Ωij = 0, there is not “coupling” between the terms involving Xi and Xj and
the conditional probability density function factorizes with respect to these two variables, implying
that Xi ⊥ Xj | XV \{i,j} . Furthermore, following the same way of reasoning, if Xi ⊥ Xj | XV \{i,j} we
need to have no coupling between the terms involving Xi and Xj in the above equations. This means
that (i, j) must not be in E. Thus, we have argued (but not proved formally as this was not required
from the problem statement) that (i, j) ∈ / E ⇔ Xi ⊥ Xj | XV \{i,j} .
where: X X
zs = θv xv + θv,t xv xt . (11)
v∈V \s (v,t)∈E
v,t6=s
It is easy to see that this is a logistic regression model over the neighbors of Xs .
5
4 Generalized Linear Models (Xun)
4.1 Exponential Family
1. The distribution is invariant under linear transformations to η and T (x). For instance, we can define
ηe = cη and Te(x) = T (x)/c with any nonzero constant c.
2. Moment generating function of T (X):
Z
tT (X)
ψ(t) = E e = etT (x) h(x) exp {ηT (x) − A(η)} dx (16)
Z
= h(x) exp {(t + η)T (x) − A(η)} dx (17)
Z
= h(x) exp {(t + η)T (x) − A(η) + A(t + η) − A(t + η)} dx (18)
Z
= exp {A(t + η) − A(η)} h(x) exp {(t + η)T (x) − A(t + η)} dx (19)
3. Recall the inner product between two matrices hA, Bi = tr A> B = i,j Aij Bij .
P
−n − 21 1 > −1
p(x|µ, Σ) = (2π) 2|Σ| exp − (x − µ) Σ (x − µ) (24)
2
1 > −1 1
∝ exp − (x − µ) Σ (x − µ) − log det Σ (25)
2 2
1 > −1 > −1 1 > −1 1
∝ exp − x Σ x + x Σ µ − µ Σ µ − log det Σ (26)
2 2 2
1 −1 > −1 > 1 > −1 1
∝ exp tr − Σ xx + (Σ µ) x − µ Σ µ − log det Σ (27)
2 2 2
1 −1 > −1 1 > −1 1
∝ exp − Σ , xx + Σ µ, x − µ Σ µ − log det Σ (28)
2 2 2
∝ exp {hη, T (x)i − A(η)} , (29)
where
Σ−1 µ
x 1 > −1 1
η= , T (x) = , A(η) = µ Σ µ + log det Σ. (30)
− 12 Σ−1 xx> 2 2
6
Inverse parameter mapping:
1 1
µ = − η2−1 η1 , Σ = − η2−1 . (31)
2 2
(a) Lagrangian:
Z Z X Z
L(f, λ) = − f (x) log f (x) dx + λ0 f (x) dx − 1 + λk gk (x)f (x) dx − µk (34)
k
Z !
X X
= −f (x) log f (x) + λ0 f (x) + λk gk (x)f (x) dx − λ0 − λk µk . (35)
k k
Let L
e be the integrand, which does not contain derivatives of f (x). By the optimality condition,
δL ∂Le X
= = − log f (x) − 1 + λ0 + λk gk (x) = 0. (36)
δf (x) ∂f (x)
k
( )
X
⇒ f ∗ (x) = exp λ∗k gk (x) + λ∗0 − 1 , (37)
k
where λ∗ is chosen such that f ∗ is feasible. This is clearly in exponential family form.
(b) Lagrangian:
Z
L(f, λ) = − f (x) log f (x) + λ0 f (x) + λ1 xf (x) + λ2 (x − µ)2 f (x) dx + const. (38)
Let L
e be the integrand, which does not contain derivatives of f (x). By the optimality condition,
δL ∂Le
= = − log f (x) − 1 + λ0 + λ1 x + λ2 (x − µ)2 = 0. (39)
δf (x) ∂f (x)
n o
⇒ f ∗ (x) = exp λ∗1 x + λ∗2 (x − µ)2 + λ∗0 − 1 . (40)
Now check the feasibility. It is easy to see that in order to make f ∗ (x) integrate to 1, it requires
λ∗1 = 0 and λ∗2 < 0 (otherwise the integral will be unbounded). Therefore
7
q
∗ λ∗
Therefore eλ0 −1 = − π .
2
Then
r
∗ λ∗2 λ∗2 (x−µ)2
f (x) = − e . (43)
π
Verify the mean:
r r
λ∗ λ∗2
Z Z r
λ∗
2 (x−µ)
2 π
µ= xf (x) dx = − 2 xe dx = − µ − ∗ = µ. (44)
π π λ2
dy
where y 0 (x) = dx . A necessary condition for F [y] to attain extremum is given by the Euler–Lagrange
equation:
δF ∂G d ∂G
= − = 0, (48)
δy(x) ∂y dx ∂y 0
where the left hand side is the functional derivative of F w.r.t. y at the point x. In both (35) and (38) the
integrand only depends on y(x) and x, thus (48) becomes
δF ∂G
= = 0. (49)
δy(x) ∂y
Thus
8
2. Stochastic update rule has the form (Eq.(8.84) in Jordan’s textbook):
(t)
θ(t+1) = θ(t) + ρ(yi − µi )xi , (53)
(t)
where µi = f ( θ(t) , xi ). Plugging in f , we have
(t)
θ(t+1) = θ(t) + ρ(yi − ehθ ,xi i
)xi . (54)
dµi d2 A(η)
Wii = = = eη = ehθ,xi i . (56)
dηi dηi2