Tutorial On Helmholtz Machine
Tutorial On Helmholtz Machine
Kevin G. Kirby
Department of Computer Science, Northern Kentucky University
June 2006
Preface
When one understands the causes, all vanished images can easily
be found again in the brain through the impression of the cause.
This is the true art of memory
Rene Descartes, Cogitationes privatae.
Quoted in Frances Yates, The Art of Memory (1966)
Helmholtz machines are artificial neural networks that, through many cycles of sensing and
dreaming, gradually learn to make their dreams converge to reality, and, in the process, create a
succinct internal model of a fluctuating world.
This tutorial is meant to provide a comfortable introduction for readers with backgrounds in
computer science or mathematics, at a level roughly the same as the artificial intelligence
textbook of Russell and Norvig [1]. The description here is expanded from the work of Dayan
and Hinton [2,3,4,5], with some change of notation and a path of exposition that I hope readers
will find helpful. My attempt to provide comfortable reading means I include a lot of
necessary background information (to keep one from having to pull down an old textbook), a
lot of calculational detail (to keep one from having to grab a pencil), and explicit pseudocode.
To minimize prerequisites, I have trimmed out very important discussions of the relationship of
Helmholtz machines to the EM algorithm, to autoencoders, to variational learning in Bayesian
networks, to information geometry, and to models of the cerebral cortex. This material is meant
to lead a reader up to, but not into, that literature.
I wrote this tutorial for a workshop I gave at the Kunsthochschule fr Medien at Cologne in
June 2006. The thesis of this workshop was that Helmholtz machines can play aesthetic and
pedagogical roles that transcend their use in applications. To date they are, I believe, underappreciated and under-explained. The slides from my lecture partly address the former aspect;
this tutorial is meant to partly address the latter. Comments to [email protected] are welcome.
This process of summing over one argument is called marginalization. By a common abuse of
notation, we just call these functions p(x) and p(y), respectively, distinguishing them
informally by the names of their arguments. As it turns out, p(x) = Prob[ X= x ] and p(y) =
Prob[ Y= y ]. The random variables X and Y are independent if p(xy) = p(x)p(y).
A second useful quantity is p(x|y) p(xy) / p(y). Then p(x|y) = Prob[ X= x, given Y=y ], a
conditional probability. Note that if X and Y are independent, then p(x|y) = p(x).
These often-used formulas follow from the definitions above:
p(xy) = p(x|y) p(y)
p(x) = y p(x|y) p(y)
p(xy|d) = p(x|yd) p(y|d)
p ( y | x) =
p(y )
p(x | y )
p ( x)
The last of these is Bayes Theorem; it assumes, of course, that p(x) > 0.
A random bit vector D with distribution p(d) can itself be described as a joint distribution over
all N bits in d: p(d) = p(d1d2dN). It is often the case here that each bit is an independent and
identically distributed (IID) random variable. Independent, as above, means that p(d1d2dN)
= p(d1)p(d2)p(dN). Identically distributed, here, means that there is some constant
probability value p0 such that for all i=1,2N:
p(di)= p0
when di = 1
p(di)= 1 p0 when di = 0
This IID property means that one can write, for any specific bit vector d, the compact
expression:
p(d) = p0 i ( 1 p0 )1di
d
(1.1)
(1.3)
(1.4)
Given two probability assignments pA(d) and pB(d), one way to quantify how different they are
is to use the Kullback-Leibler divergence, also known as the relative entropy, from A to B:
KL[ pA(D) , pB(D)] =
p
d
(d) log
p A (d)
p B (d)
(1.5)
This is equal to zero when pA = pB, and is never negative. Using the properties of the logarithm
and the definition of expectation above, we see that
KL[ pA(D) , pB(D) ]= log pB( D )A log pA( D )A
In other words, the KL divergence from A to B is simply the difference in surprise, as averaged
by A. This is not a symmetric function: KL[ pA(D) , pB(D)] KL[ pB(D) , pA(D)], so the terms
difference or divergence are used, rather than the term distance.
One way to see the asymmetry in KL divergence is to look at the function A(p,q) kl( p,q )
kl(q,p), where p and q are probability vectors (vectors with positive components summing to
1), and
p
kl( p , q ) pi log i
qi
i
The plot below shows A( [x,1x]T, [0.85, 0.15]T). If the KL divergence were symmetric, this
curve would be a horizontal line. Note that the degree of asymmetry is very small when p and
q are nearly equal (at x = 0.85).
One remark about notation, for those fussy enough, like me, to care. It will be easiest to say
the random bit vector d even though the random bit vector D is the technically correct
phrasing, as bold capital letters denote random variables. (Precisely, a random bit vector D is a
function from a sample space into {0,1}n; whereas d is merely an element of {0,1}n.) One
surviving use for the upper/lower case distinction is exemplified in equations 1.2 1.5, where
capital letters are placeholders for bound variables. For example, the notation H(X|y) correctly
suggests that we view H(X|y) as a one-argument function of the bit vector y. Such
placeholders are necessary because function names (e.g. H, p) are overloaded in the
common notations of probability theory, so functions must be disambiguated by the names of
their arguments.
x RL
wjk
W
yj
y RM
V
vij
d RN
di
In a linear neural network, the patterns x, y, d are vectors of real numbers. The input pattern x
passes through the connection weight matrix W to produce the pattern y. In turn, y passes
through the connection weight matrix V to produce the pattern d. Each neuron computes a
weighted sum of its inputs, plus an additional bias weight. This yields the following:
L
y j = w jk xk + w j ,L+1 ,
dk =
k =1
M
v
j =1
ij
y j + vi , M +1 ,
or simply:
y = W[x|1]
or simply:
d = V[y|1]
where the notation [a|1] denotes a column vector a with a 1 appended on the end. This means
the last column of the weight matrices W and V hold the bias weights. Accordingly, the output
pattern of each layer is an affine transformation of the its input pattern. (An affine
transformation is a linear transformation followed by a translation.)
The simplest nonlinearity to introduce into this network is to take the linearly transformed
input pattern and push it through a nonlinear squashing function such as the logistic or
sigmoid function:
(u) = 1 / 1 + eu.
Let the vector function be defined by applying to each component. Then the neural
network processes patterns as follows:
y = ( W[x|1] )
d= ( V[y|1] )
Since yields a number between 0 and 1, we can move from interpreting this value as the
output of a real-valued neuron, to the probability that a binary-valued neuron produces the
output 1 (that is, the probability that is fires). Then we can write:
y = SAMPLE[ py ]
d= SAMPLE[ pd ]
where py = ( W[x|1] )
where pd = ( V[y|1] )
(2.1a)
(2.1b)
where SAMPLE[ p ] is a stochastic function that yields 1 with probability p, and 0 with
probability 1p. SAMPLE[ p ] applies SAMPLE to each component of a vector. Because the
range of the function never reaches 0 or 1, a neuron will never fire (or not fire) with
complete certainty. This kind of layered stochastic network will be the starting point for the
Helmholtz machine.
pG (xd | y )
p (xyd)
= G
= pG (d | xy)
pG ( x | y )
pG (xy)
This conditional independence means can get the joint distribution, which is the complete
description of behavior, merely by multiplying together the 3 givens of the generative
distribution for 1 x y d :
pG(xyd)
= pG(yd|x) pG(x)
= [ pG(d|yx) pG(y|x) ] pG(x)
= pG(d|y) pG(y|x) pG(x)
bG
bGk
xk
wGjk
WG
yj
y
VG
vGij
di
The set of network connection weight matrices ( bG , WG, VG ) is nothing but a way to specify
(pG(x), pG(y|x), pG(d|y) ), or, equivalently, pG(xyd). Of course, this is a highly constrained form
for a probability distribution. The pG(d) they produce may not be capable of conforming
exactly to the observed p(d), but the machine will try to get it as close as possible.
We get the directional probabilities from the neural network weights as follows, based on
equations 1.1 and 2.1:
pG (x)
( )
where pG ( xk ) = bkG
( xk ) xk [ 1 pG ( xk )]1 xk
where pG ( y j | x) = wG jk xk + wG j ,L+1
k =1
j
M
pG (d | y ) = pG (d i | y ) di [ 1 pG (d i | y )]1 di
where pG (d i | y ) = v G ij y j + v G i ,M +1
i
j =1
1
where is the sigmoid function (u) = ( 1 + exp u) .
pG (y | x) = pG ( y j | x) j [ 1 pG ( y j | x)]
y
1 y j
Call x by itself the cause. Call xy (the state of all the hidden units) the explanation of d.
Of course, the actual number of neurons in each layer is unrestricted. And we can imagine
variants of Helmholtz machines with more or fewer layers, different activation functions,
lateral connections, and so on [6]. But for the purposes of this tutorial, we concentrate on the
architecture depicted above.
4. Energy
Now let us look at the interesting ways the probabilities involved are interrelated in the
generative distribution. Given the role of explanations, one might ask about the probability of
an explanation xy given some fixed piece of generated data d:
pG (xy | d) =
pG (xyd)
=
pG (d)
pG (xyd)
pG (xyd)
(4.1)
xy
pG (xy | d) =
xy
This rewrite is more than just cosmetic. If we use the suggestive name generative energy for
EG(xy;d), an analogy to statistical physics emerges.
Suppose the state of a physical system fluctuates among a set of states { q1, q2, }. One sign
that the system is in thermal equilibrium is that the probability of finding the system in a state
qi is related to its energy E(qi) according to a rule called the Boltzmann distribution:
p(qi ) =
exp [ E (qi ) / T ]
exp [ E (qi ) / T ]
i
10
5. Free Energy
Now the goal of Helmholtz machine learning is to find a G (that is, find bG , WG, and VG) that
makes pG(d) as close to p(d) as possible. When this happens, the machine will have made a
generative model of the world. So we want to find a G that minimizes the function
( G ) KL[ p(D), pG(D) ]
(5.1)
It would be nice if we could find a G such that ( G ) = 0, but this is unlikely to happen since
the feedforward neural network implementation of G constrains the distribution. Minimizing
(5.1) is somewhat simpler than it looks, since if we use the definition of KL divergence we get
( G )
p (d)
G (d )
p(d) log p
d
p(d)
log pG (d)
where the first term does not depend on G and thus can be ignored in our optimization of G.
Whats left is nothing but the expected surprise of the data generated by the network, weighted
by the real-world probability of the data:
0 (G) log pG (D) =
p(d)
log pG (d) .
(5.2)
As we sample patterns in the real world we keep track of how surprising it would be if the
machine generated that pattern. This accumulated surprise as we sample will get smaller as the
machine learns to model the world, as pG(d) moves closer to p(d).
So now our optimization problem is one of finding a G that minimizes this expected surprise
0 (G). As is common in neural network learning, we will use a gradient descent algorithm.
That means we will need to compute various gradients of 0 (G). Since any gradient
operator is linear, we will have
0
p(d) [ log p
(d)] .
(5.3)
In gradient descent optimization we will make a lot of little changes proportional to 0. Since
the gradient consists of a p(d)-weighted sum of the gradients of log pG(d), our descent
algorithm can just sample a d from the real world p(d) distribution, then add in an increment
proportional to [log pG(d)]. As we keep adding these up, sampling according to p(d), we
get the same weighted sum as in equation (5.3). So finding the various gradients of log
pG(d) is all we need to do to prepare a gradient descent algorithm to make p approach pG.
Accordingly, we turn our attention to minimizing log pG(d), the generative surprise, and
rearrange it to get an interesting new form:
log pG (d)
= log pG (d) 1
= log pG (d) [ xy pG(xy|d) ]
= xy pG(xy|d) log pG (d)
11
=
=
=
=
The first term is the average generative energy of explanations we observe with a pattern d
when we let the generative distribution operate. The second term is the entropy of explanations
we observe with a pattern d when we let the generative distribution operate.
In statistical physics, the Helmholtz free energy of a system with average energy E,
temperature T, and entropy H, is given by:
F = E TH
Since the expression for log pG (d) above displays it as average energy minus entropy, it
is analogous to a Helmholtz free energy with temperature 1. We can call this quantity the free
energy of d in G:
FG(d) log pG (d) = EG(XY;d) G HG(XY|d).
(5.4)
surprise of pattern
So we see that to minimize the KL divergence from the real world data distribution to the
Helmholtz machine data distribution means, according to (5.3) above, that we need to
minimize FG(D), the generative free energy averaged according to the real world distribution.
At this point it would seem that gradient-descent learning in a Helmholtz machine all boils
down to calculating some derivatives of FG(d). Specifically, we need to calculate three sets of
derivatives:
FG (d)
FG (d)
FG (d)
G
G
w jk
vijG
bk
To calculate these derivatives we need to express FG(d) in terms of the weights via the
formulas for pG(x), pG(y|x), pG(d|y) at the end of section 3. Try out the math. Things dont seem
to work out very well; the equations do not clean up nice and give us simple expressions for
the derivatives. We must think a little more deeply.
12
p R (xy | d)
pG (xy | d)
pG (xyd)
pG (d)
+ EG(XY;d) R +
= HR(XY|d)
+ EG(XY;d) R FG(d).
Rewrite this to get the free energy FG(d), and juxtapose this with the previous sections
expression for the same quantity:
FG(d) = EG(XY;d) G HG(XY|d)
FG(d) = EG(XY;d)R HR(XY|d) KL[pR(XY|d), pG(XY|d)]
So whats going on? We have a simple expression for the generative Helmholtz free energy,
and then a more complicated expression for the same thing involving a whole new quantity, the
distribution pR. As it turns out, this involves us in something called a variational method.
Since KL divergences are never negative, we have EG(XY;d) R HR(XY|d) FG(d) 0,
i.e.
FG(d) EG(XY;d) R HR(XY|d)
Define the variational free energy (from R to G) as
FRG(d) EG(XY;d) R HR(XY|d)
= FG(d) + KL[pR(XY|d), pG(XY|d)]
13
(6.1a)
(6.1b)
free energy of G
Were the variational free energy FRG(d) minimized on both R and G, we would find that pR
pG and FG(d) would be minimized.
So far nothing in this section has assumed anything about where pR(xy|d) comes from. Now is
the time to pin it down.
M
where pR ( xk | y ) = wkjR y j + w R k ,M +1
j =1
N
where pR ( y j | d) = v Rji d i + v R j , N +1
i=1
pR (x | y ) = pR ( xk | y ) xk [ 1 pR ( xk | y )]1 xk
k
p R (y | d) = p R ( y j | d) j [ 1 p R ( y j | d)]
y
1 y j
Since the pattern d is an input, there are no bias weights coming into the data layer from below.
Thus unlike the generative case there are only two equations here. The resulting neural network,
with both generative and recognition connections, is shown below.
14
bG
x
WR
WG
y
VR
VG
8. Learning
We saw in section 5 that a Helmholtz machine can learn the structure of the world by
minimizing its own generative free energy FG(d). The idea is to use a gradient descent method
for this. However, calculation of gradients is complicated if one uses the direct form of FG(d).
That led us to turn to the variational free energy FRG(d), involving a separate recognition
distribution. Minimizing FRG(d) simultaneously minimizes the free energy FG(d) and the KL
divergence from pR(XY|d) to pG(XY|d), which will make recognition and generation
approximate inverses of each other.
Each iteration in this revised gradient descent algorithm will involve two phases, one involving
the generative weights and one involving the recognition weights. That is, we alternately make
small changes in G and in R.
In the first phase, to be called the wake phase, we make a small change in the generative
weights to decrease the variational free energy from equation (6.1b):
FRG(d) = FG(d) + KL[pR(XY|d), pG(XY|d)].
There is one free variable here: d. Where does it come from? We sample it from the real world
p(d) (not from the generative distribution), as our original goal was to minimize FG(d) where
the average is taken according to the real world distribution. So we sample a d from the world
and tweak the bG, WG, VG so as to decrease FRG(d).
In the second phase, to be called the sleep phase, we make a small change in the recognition
weights to decrease not FRG(d) but a slightly different quantity that approximates it:
~
F R G (d) = FG(d) + KL[pG(XY|d), pR(XY|d)].
15
Note the order of the KL arguments is switched! Since the KL divergence is not symmetric,
this is a nontrivial kludge. It gives us only one term involving R, rather than two a very
efficient form which simplifies the learning algorithm.
These adjustments will occur by gradient descent. So we now turn to evaluation of the
~
derivatives of FRG(d) and F R G (d) in order to derive the algorithm.
9. Wake
For the wake phase we will take the derivatives of the variational free energy with respect to
generative weights. Lets write FRG(d) in the form from equation (6.1a):
F R G (d)
= EG(XY;d) R
HR(XY|d)
(9.1)
Now our gradient descent learning algorithm will sample a pattern d from the world (according
to p(d)) then update the weights by the small step downhill G F R G (d) , where > 0 is small.
How do we evaluate this? The right hand side just above says we just sample GEG(xy;d)
where x and y are drawn from the recognition distribution.
We turn to evaluating this gradient of the energy:
G EG(xy;d) = G log pG(xyd)
= G log pG(x) pG(y|x) pG(d|y)
= G log pG(x) G log pG(y|x) G log pG(d|y)
(9.2)
The needed expressions for log pG(x), log pG(y|x), and log pG(d|y) are available from the last
equations in section 3 for the three layers of weights from the top down:
16
log pG(x)
= log k
=
xk
( 1 k )1 xk
log k
log pG(y|x)
yj
log j
log pG(d|y)
where j wG jk xk + wG j ,L+1
k =1
+ (1 y j ) log ( 1 j )
1 y j
( 1 j )
(1 x ) log ( 1
k
= log j
=
( )
where k bkG
where i v G ij y j + v G i ,M +1
j =1
+ (1 d i ) log ( 1 i )
= log i i ( 1 i )1di
d
d
i
log i
The various connection weight derivatives that make up G EG(xy;d) are then straightforward
to calculate. These calculations use the fact that if a = (b), then da/db = a (1a).
log pG (x)
bkG
x log K +
(1 xK ) log ( 1 K )
G K
bk K
bkG K
=
[ xk log k ] +
[(1 xk ) log ( 1 k )]
G
bk
bkG
= xk G log k + (1 xk ) G log ( 1 k )
bk
bk
(1 xk )
x
= k G k +
( 1 k )
( 1 k ) bkG
k bk
=
(1 xk )
xk
k
G k
( 1 k ) bkG
k bk
x
(1 xk )
k (1 k )
= k k (1 k )
( 1 k )
k
= xk (1 k ) k (1 xk )
= xk k.
=
log pG (y | x) = 0
bkG
log pG (d | y ) = 0
bkG
17
log pG (x)
wGjk
=0
( y j j ) xk , k = 1,2,..., L
log
p
(
|
)
=
y
x
G
k = L +1
wGjk
y j j ,
log pG (d | y ) = 0
wGjk
log pG (x)
vijG
=0
log pG (y | x) = 0
vijG
(d i ) y j , j = 1,2,..., M
log pG (d | y ) = i
G
j = M +1
vij
di i ,
Combining all this with equation (8.2) and writing them in vector form, we have
bG EG(xy;d) = (x )
WG EG(xy;d) = (y ) [ x | 1 ] T
VG EG(xy;d) = (d ) [ y | 1 ] T
Keep note of where the quantities on the right hand side came from: d came from the world; x
and y came from the recognition distribution given d; and , and came from the generative
distribution given x and y.
This means gradient descent takes the form of very simple weight increments at each layer:
bG += (x )
WG += (y ) [ x | 1 ] T
VG += (d ) [ y | 1 ] T.
where is the learning rate. It is not unusual in neural smithing to allow learning rates to
vary from layer to layer and from iteration to iteration.
This rule has fortuitously turned out to be nothing but a variant of the classic delta rule, in
which the change of a connection weight is proportional to an outer product. That is, weight
matrices change according a form like M += abT. In components, this is
18
mij += aibj
so the change in the connection between i and j only involves what happens at the neurons at
each end i and j. Indeed, algorithms for changing neural network weight matrices can only
fairly be called neural network algorithms if the weight changes are local in this way.
We can summarize the wake phase of an iteration in the following pseudocode.
WAKE PHASE
// Experience reality!
d = getSampleFromWorld( )
// Pass sense datum up through recognition network
y = SAMPLE( ( VR [d | 1 ] T ) )
x = SAMPLE( ( WR[y | 1 ] T ) )
// Pass back down through generation network, saving computed probabilities
= ( bG )
= ( WG [x | 1 ] T )
= ( VG [y | 1 ] T )
// Adjust generative weights by delta rule
bG += (x )
WG += (y ) [x | 1 ]T
VG += (d ) [y | 1 ]T
10. Sleep
For the wake phase we will take the derivatives with respect to recognition weights of an
~
approximation F R G (d) to the variational free energy FRG(d):
~
F R G (d) = FG(d) + KL[pG(XY|d), pR(XY|d)].
noting that the first term depends only on G, so can be considered a constant as we now focus
on dependence on R.
Let the symbol R stand for any derivative with respect to the recognition weights (i.e. for
/wRkj, or /vRji). Then:
~
R F R G (d)
=
=
=
=
=
R KL[pG(XY|d), pR(XY|d)]
R [ xy pG(xy|d) log pG(xy|d) xy pG(xy|d) log pR(xy|d) ]
R xy pG(xy|d) log pR(xy|d)
xy pG(xy|d) R log pR(xy|d)
R log pR(XY|d)G
(10.1)
19
Now our gradient descent learning algorithm will, given a pattern d, update the weights by the
~
small step downhill R F R G (d) , where > 0 is small. How do we evaluate this? The right
hand side just above says we just use a sample R log pR(xy|d) where x and y are drawn from
the generative distribution. But exactly what d do we use here? We defer this question until we
have developed the update rule.
So now we turn to evaluating the gradient of log pR(xy|d), using the layered nature of R we
introduced in section 7:
R log pR(xy|d)
(10.2)
We get the needed expressions for log pR(y|d) and log pR(x|y) from the last equations in
section 7 for the three layers of weights from the top down:
M
where k = wkjR y j + w R k ,M +1
j =1
+ (1 xk ) log ( 1 k )
log pR (x | y ) = log k k ( 1 k )1 xk
x
log k
where j = v Rji d i + v R j , N +1
i=1
+ (1 y j ) log ( 1 j )
log pR (y | d) = log j j ( 1 j )
y
y
j
log j
1 y j
The various connection weight derivatives that make up R log pR(xy|d) are then
straightforward to calculate. The algebra and elementary calculus follows the same pattern as
in the previous section. For the middle-to-top recognition weights:
( xk k ) x j , j = 1,2,..., M
=
log
p
(
x
|
y
)
R
j = M +1
wkjR
xk k ,
log pR (y | d) = 0
wkjR
log pR (x | y ) = 0
v Rji
( y j j )d i , i = 1,2,..., N
log pR (y | d) =
R
i = N +1
v ji
y j j ,
20
This means gradient descent takes the form of very simple weight increments at each layer:
WR += (x ) [ y | 1 ] T
VR += (y ) [ d | 1 ] T.
As before, keep note of where the quantities on the right hand side came from: x and y come
from the generative distribution; and and come from the recognition distribution given x
and y. What about d? In the wake phase, we started at the beginning of the recognition causal
chain x y d, generating the explanation xy by sampling d from the world. By symmetry,
in the sleep phase we start at the beginning of the generative causal chain 1 x y d by
sampling x from the generative bias input and feeding it forward to produce the pattern d.
So again we have a delta rule, nicely symmetric with the sleep phase. In summary:
SLEEP PHASE
// Initiate a dream !
x = SAMPLE( ( bG ) )
// Pass dream signal down through generation network
y = SAMPLE( ( WG[x | 1] T ) )
d = SAMPLE( ( VG [y | 1] T ) )
// Pass back up through recognition network, saving computed probabilities
= ( VR [d | 1 ] T )
= ( WR [y | 1 ] T )
// Adjust recognition weights by delta rule
VR += (y ) [d | 1 ]T
WR += (x ) [y | 1 ]T
21
If we kick away some of the scaffolding we used to get to this point, the optimization problem
begins to look quite simple itself. Equation (9.1) showed that the wake phases minimization of
the variational free energy just amounts to minimizing the expected energy, which means
maximizing log pG(XYd) R . Equation (10.1) showed that the sleep phases minimization of
an approximation to the variational free energy just amounts to maximizing log pR(XY|d)G.
In summary, we have the following bare-bones description:
G
R
We have found that one easy way to implement such a stochastic causal chain is through a
neural network, with connections running in opposite directions. And when the neurons use
sigmoid activation functions the derivatives of log probabilities of outputs with respect to
inputs always take on a simple delta rule form; this allows the wake-sleep algorithm to work
via simple local weight updates.
22
12. A Demonstration
We conclude with a quick confirmation that the code above actually works. This is a variant of
one example used in the Hinton-Dayan paper in Science [2]. The world consists of 33 images
of vertical and horizontal bars. Vertical bars occur with twice the probability of horizontal bars,
but within each category (horizontal or vertical) each of the possibilities occurs with the same
probability.
By contrast, Hinton and Dayan used 44 images, but all nonzero probabilities were identical,
so that the problem amounted to learning a binary classification. Here we set things up more
generally, to show the Helmholtz machine trying to match a nontrivial probability distribution.
We take d {0,1}9 , so there are 512 possible patterns. To get the distribution above, we
assign probabilities as follows:
23
Pattern d
vertical
bars
WORLD
horizontal
bars
[001001001]
[011011011]
[100100100]
[110110110]
[010010010]
[101101101]
[000111111]
[111000111]
[000111000]
[111111000]
[000000111]
[111000000]
[000010111]
[111101000]
[111100000]
[000011111]
[001011001]
[001101001]
[000111101]
[110010110]
others
World
p(d)
0.1111
0.1111
0.1111
0.1111
0.1111
0.1111
0.0556
0.0556
0.0556
0.0556
0.0556
0.0556
0
0
0
0
0
0
0
0
0
Helmholtz
Machine
pG(d)
0.1075
0.1070
0.0999
0.0981
0.0955
0.0911
0.0615
0.0529
0.0455
0.0419
0.0415
0.0391
0.0412
0.0287
0.0105
0.0079
0.0012
0.0011
0.0011
0.0010
< 0.0010
88%
The machine has clearly captured much of the worlds structure here: vertical bars appear with
higher probability than horizontal bars, all patterns in the world are generated by the machine,
and patterns not in the world occur with low probability and are mostly just a bit away from
real patterns. Yet, as experience with these machines has shown (e.g. [6]), the machine hasnt
quite captured the world in what we might judge to be the most natural way, as shown by the
unreal pattern 000010111 with probability 0.0412. El sueo de la razon produce monstruos.
Recall from section 5 that we began our work by trying to minimize KL[ p(D), pG(D) ]. We can
track the change in this quantity as the learning process proceeds1; this is shown in the plot
below. The simulation stops to sample the generative distribution 10,000 times every 500
iterations and plots the KL divergence.
Since our sampling will, early in the learning process, occasionally estimate pG(d) = 0 for some d with p(d)>0,
we arbitrarily substitute a small value pG(d) = 106 << 1/512 in place of zero so the KL divergence will not blow
up. This is completely separate from the algorithm; it merely affects the display of results.
24
3.5
KL(p,pG)
2.5
1.5
0.5
0
0
10000
20000
30000
40000
50000
60000
iteration
Since sleep-wake learning is based on steepest descent, it is amenable to many neural smithing
techniques from the savvy optimizers toolbox to improve performance further.
25
References
1. Russell, S. and P. Norvig. 2003. Artificial Intelligence: A Modern Approach, 2nd Ed.
Prentice-Hall.
2. Hinton, G.E., P. Dayan, B.J. Frey and R.M. Neal. 1995. The wake-sleep algorithm for
unsupervised neural networks. Science 268: 1158-1161.
3. Dayan, P., G.E. Hinton, R.M. Neal, and R.S. Zemel. 1995. The Helmholtz machine. Neural
Computation 7:5,889904.
4. Dayan, P. and L.F. Abbott. 2001. Theoretical Neuroscience: Computational and
Mathematical Modeling of Neural Systems. MIT Press.
5. Dayan, P. 2003. Helmholtz machines and sleep-wake learning. In M.A. Arbib, Ed.,
Handbook of Brain Theory and Neural Networks, Second Edition. MIT Press.
6. Dayan, P. and G.E. Hinton. 1996. Varieties of Helmholtz machine. Neural Networks 9:8,
1385-1403.
7. Ikeda, S., S.-I. Amari and H. Nakahara. 1999. Convergence of the wake-sleep algorithm. In
M.S. Kearns et al, Eds., Advances in Neural Information Processing Systems 11.
26