Deep Learning Book
Deep Learning Book
Auto
Autoenco
enco
encoders
ders
An auto
autoenco
enco der is a neural net
encoder netw
work that is trained to attempt to cop copy
y its input
to its output. Internally
Internally,, it has a hidden lalay
yer h that describ
describeses a co de used to
code
represen
representt the input. The net
netwwork ma
mayy b e view
viewed
ed as consisting of two parts: an
enco
encoder
der function h = f (x) and a deco
decoder
der that pro
produces
duces a reconstruction r = g (h).
This arc
architecture
hitecture is presen
presented
ted in figure 14.1. If an autoenco
autoencoder
der succeeds in simply
learning to set g (f (x)) = x ev everywhere,
erywhere, then it is not esp
especially
ecially useful. Instead,
auto
autoenco
enco
encoders
ders are designed to be unable to learn to copy perfectly
perfectly.. Usually they are
restricted in wa
ways
ys that allow them to copcopyy only approximately
approximately,, and to cop
copyy only
input that resembles the training data. Because the mo model
del is forced to prioritize
whic
whichh asp
aspects
ects of the input should b e copied, it often learns useful prop
properties
erties of the
data.
Chapter
Mo
Modern
dern
auto auto
enco 14ders hav
autoenco
derenco
encoders havee generalized the idea of an enco
encoder
der and a de-
co
coder
der b eyond deterministic functions to sto stocchastic
h mappings pencoder co(deh | x) and
An is a neural network that is trained to attempt to copy its input
pdecoder (x | h).
to its output. Internally, it has a hidden layer that describ es a used to
The idea of h =encoders
auto f (x )
autoencoders has b een part of the historical landscap
landscape e r
of g ( h)
=neural
represent the input. The network may b e viewed as consisting of two parts: an
net
netw
encowder
orksfunction
for decades (LeCun , 1987
and ; Bourlard
a deco der thatand Kampa, 1988
pro duces ; Hinton and Zemel,.
reconstruction
1994
1994). g (f (x,))auto
= xenco
This ).architecture
Traditionally
raditionally, autoenco
is presen encoders
ted in ders
figurew14.1
ere used
. If anfor dimensionalit
dimensionality
autoenco y reduction
der succeeds in simply or
feature
learninglearning.
to set Recen
Recentlytly
tly,, ev
theoretical
erywhere, connections
then it is notb et etw
wecially
esp een auto
autoencoders
encoders
useful. and
Instead,
Autoencoders
laten
latent
mo
modeling,
variable
deling, as we
mo
t ders aremodels
auto enco designed
will see
ha
dels hav
in
e brough
vto brought
be unable
chapter
restricted in ways that allow them to copy only 20
auto
t autoencoders
.
encoders
to learn
Auto
Autoenco
enco
encoders
to the forefront
ders may b e though
approximately thought
of generative
to copy perfectly. Usually they are
, and ttoofcop
asybonly
eing
ainput
sp
special
ecial
thatcase of feedforw
resemblesfeedforward ard net
the training networks
works and ma
data. Because may ythe
bemotrained with
del is forced all
encoderto the same
prioritize
tec
techniques,
whic hniques,
h asp ectstypically
decoder of the inputminibatc
minibatch
shouldh gradient
b e copied,descent
it oftenfollo
following
wing
learns gradients
useful computed
prop erties of the
b y back-propagation.
data. Unlike general feedforw
feedforwardard netw
networks,
orks, auto
autoenco
enco
encoders
ders may
also b e trained using recirculation (Hin Hinton
ton and McClelland,p 1988),(h a learning
x)
Mo dern auto enco ders have generalized the idea of an enco der and a de-
algorithm
p (x based
h) on comparing the activ activations
ations of the net netw
work on the original input
co der b eyond deterministic functions to sto chastic mappings and
. 499
The idea of auto encoders has b een part of the historical landscap e of neural
networks for decades (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel,
1994). Traditionally, auto enco ders were used for dimensionality reduction or
feature learning. Recently, theoretical connections b etween auto encoders and
latent variable mo dels have brought auto encoders to the forefront of generative
mo deling, as we will see in chapter 20. Auto enco ders may b e thought of |as b eing
|
a sp ecial case of feedforward networks and may be trained with all the same
techniques, typically minibatch gradient descent following gradients computed
recirculation
by back-propagation. Unlike general feedforward networks, auto enco ders may
CHAPTER 14. AUTOENCODERS
f g
x r
to the activ
activations
ations on the reconstructed input. Recirculation is regarded as more
x r
biologically plausible than back-propagation but is rarely used for machine learning
applications.
nately
nately,, if the enco
encoder
der and deco
decoder
der are allow
allowed
ed too muc
much h capacit
capacity
y, the auto
autoenco
enco
encoder
der
can learn to p erform the copcopying
ying task without extracting useful information ab about
out
the distribution of the data. Theoretically
Theoretically,, one could imagine that an autoenco
autoencoder
der
with a one-dimensional co code
de but a very p owerful nonlinear encoder could learn to
represen
representt eac h training example x(i) with the co
each code
de i . The deco
decoder
der could learn to
map these in integer
teger indices bac
backk to the values of spspecific
ecific training examples. This
CHAPTER 14. AUTOENCODERS
sp
specific
ecific scenario do
does
es not o ccur in practice, but it illustrates clearly that an auto
autoen-
en-
co
coder
der trained to p erform the copying task can fail to learn an anything
ything useful ab
about
out
the dataset if the capacity of the autoautoenco
enco
encoder
der is allo
allowwed to b ecome totoo
o great.
auto
autoenco
enco
encoder
der (section 20.10.3) and the generative sto
stochastic
chastic net
netw
works (section 20.12).
These models naturally learn high-capacit
high-capacity
y, overcomplete encodings of the input
and do not require regularization for these enco
encodings
dings to b e useful. Their enco
encodings
dings
are naturally useful b ecause the models were trained to appro
approximately
ximately maximize
the probabilit
probabilityy of the training data rather than to copcopy
y the input to the output.
CHAPTER 14. AUTOENCODERS
14.2.1 Sparse Auto
Autoenco
enco
encoders
ders
A sparse auto
autoenco
enco
encoder
der is simply an autoenco
autoencoder
der whose training criterion inv
involves
olves a
sparsit
sparsity
y p enalt
enalty
y Ω(
Ω(hh) on the co
code
de la
lay
yer h, in addition to the reconstruction error:
L(x, g (f (x))) + Ω(h), (14.2)
where g ( h) is the deco
decoderder output, and typically we ha hav
ve h = f (x), the enco
encoder
der
auto enco der (section 20.10.3) and the generative sto chastic networks (section 20.12).
output.
These models
Sparse autonaturally
autoenco
enco ders learn
encoders high-capacit
are typically , olearn
used yto vercomplete
featuresencodings of task,
for another the input
such
and do not require regularization
as classification. An auto autoencoder for these enco dings to b e useful. Their enco
encoder that has b een regularized to b e sparse must dings
are
respnaturally
respond
ond to unique b ecause the
usefulstatistical models
features were
of the trainedit to
dataset hasappro
b eenximately maximize
trained on, rather
the probabilit y of the training
than simply acting as an identit data
identity rather than to
y function. In this wacop
wayy the input to the output.
y, training to p erform the
cop
copying
ying task with a sparsit
sparsity
y penalty can yield a mo
model
del that has learned useful
14.2.1
features Sparse Auto
as a byproduct. enco ders
We can think Ω(of
h) the penalty Ω( Ω(h h) hsimply as a regularizer term added to
A sparse auto enco der is simply an autoenco der whose training criterion involves a
a feedforward netw network
ork whoseLprimary
x, la (xtask
g (yfer ))) +is Ω(to h)cop
copy
, ytothe input to the output
sparsity p enalty on the co(de , in addition the reconstruction error:
(unsup
(unsupervised
ervised learning ob objective)
jective) and p ossibly also p erform some sup supervised
ervised task
g ( h )
(with a supervised learning ob objectiv
jectiv
jective) h = f (
e) that depends on these sparse features.x ) (14.2)
Unlike otheris regularizers,
Unlike
where such as and
the deco der output, weigh
weight t deca
decay
typically y, wthere
e havis e not a straightforw
straightforward
, the encoard der
Ba
Bay yesian
output. in
interpretation
terpretation to this regularizer. As describ
described ed in section 5.6.1 , training
with weigh
eightt decay and other regularization p enalties can be in interpreted
terpreted as a
Sparse auto enco ders are typically used to learn features for another task, such
MAP appro
approximation
ximation to Ba Bayyesian inference, with the added regularizing p enalt enaltyy
as classification. An auto encoder that has b een regularized to b e sparse must
corresp
corresponding
onding to a prior probabilit
probability y distribution over the mo model
del parameters. In
resp ond to unique statistical features of the dataset it has b een trained on, rather
this view, regularized maximum lik likeliho
eliho
elihoo o d corresp
correspondsonds to maximizing p(θ | x),
than simply acting as an identity function. In this way, training to p erform the
whic
which h is equiv
equivalen
alen
alentt to maximizing log p( x | θ) + log p(θ ). The log p (x | θ) term
copying task with a sparsity penalty can yield a mo del that has learned useful
is the usual data log-likelihoo
log-likelihood d term,
Ω(h)and the log p(θ) term, the log-prior ov over
er
features as a byproduct.
parameters, incorporates the preference over particular values of θ . This view is
We ed
describ
described canin think
sectionof 5.6 penalty
the. Regularized simply
auto
autoenco
enco
encoders as a defy
ders regularizer
suc
suchh antermin added to
interpretation
terpretation
a feedforward netw ork
b ecause the regularizer dep whose ends on the data and is therefore by definitionoutput
depends primary task is to cop y the input to the not a
(unsup ervised learning ob jective) and p ossibly also p erform
prior in the formal sense of the word. We can still think of these regularization some sup ervised task
(with
terms asa supervised learning ob
implicitly expressing jective) that
a preference ov erdepends
over functions. on these sparse features.
Unlike other regularizers, such as weight decay, there is not a straightforward
Rather than thinking of the sparsity p enalt enalty y as a regularizer for the copying
Bayesian interpretation to this regularizer. As describ ed in section 5.6.1, training
task, we can think of the entire sparse auto autoenco
enco
encoderder framew
frameworkork as approximating
with weight decay and other regularization p enalties can be interpreted as a
MAP approximation to Bayesian inference, 502 with the added regularizing p enalty
p(θ x)
corresp onding to a prior probability distribution over the mo del parameters. In
log p( x θ) + log p(θ ) log p (x θ)
this view, regularized maximum likeliho o d corresp onds to maximizing ,
log p(θ)
which is equivalent to maximizing . The term
θ
is the usual data log-likelihoo d term, and the term, the log-prior over
parameters, incorporates the preference over particular values of . This view is
describ ed in section 5.6. Regularized auto enco ders defy such an interpretation
b ecause the regularizer dep ends on the data and is therefore by definition not a
prior in the formal sense of the word. We can still think of these regularization
terms as implicitly expressing a preference 502 over functions. |
CHAPTER 14. AUTOENCODERS
maxim
maximum um lik
likelihoo
elihoo
elihoodd training of a generativ
generativee mo
model
del that has laten
latentt variables.
Supp
Supposeose we hav
havee a mo
model
del with visible variables x and latent variables h, with
an explicit joint distribution pmodel (x, h ) = p model(h)pmodel (x | h). We refer to
pmodel (h) as the mo model’s
del’s prior distribution over the laten
latentt variables, representing
the model’s b eliefs prior to seeing x. This is differendifferentt from the way we ha havve
previously used the word “prior,” to refer to the distribution p(θ ) enco encoding
ding our
CHAPTER 14. AUTOENCODERS
b eliefs about the momodel’s
del’s parameters before we havhavee seen the training data. The
log-lik
log-likeliho
eliho
elihood
od can b e decomp
decomposed
osed as
log p model (x) = log pmodel (h, x). (14.3)
h
model model model
Wmodel
e can think of the autoautoenco
enco
encoderder as appro
approximating
ximating this sum with a p oin ointt estimate
for just one highly likely value for h. This is similarxto the sparse co coding h
ding generative
maximum likelihoo d training of a generative mo del that has latent variables.
mo del (section 13.4), but with ph b eing
model (x,the = p of(the
h ) output (x h)enco
h)p parametric encoder
der rather
Supp ose we have a mo del with visible variables and latent variables , with
p (h )
than the result of an optimization that infers the most likely h. From this point of
an explicit joint distribution . We refer to
view, with this chosen h, we are maximizing x
as the mo del’s prior distribution over the latent variables, representing
p(θ )
the model’s b eliefs prior(hto, xseeing
log p model ) = log p. This (h is+differen
) log p t from
(x | ). way we(14.4)
hthe have
model modelh model model
previously used the word “prior,” to refer to the distribution enco ding our
b eliefs
The p modelthe
logabout (h) mo del’s
term canparameters
b e sparsitybefore we hav
inducing. For e seen the training
example, data.prior,
the Laplace The
log-likeliho od can b e decomp
log p osed as
(x) = log p (h, x).
λ −λ|hi |
p model (hi ) = e , (14.5)
2 | (14.3)
corresp
corresponds
onds to an absolute value sparsity p enalty enalty.. Expressing the log-prior as an
h
absolute
W valueofp enalt
e can think enalty
the y, wenco
auto e obtain
der as approximating this sum with a p oint estimate
h
for just one highly likely value for . This is similar to the sparse co ding generative
model model model h
mo del (section 13.4), but h) = λb eing|hthe
Ω(with i|, output of the parametric enco der (14.6) rather
h
than the result
model of an optimization that
i
infers the most i
likely . From this point of
view, with this chosen , we are
log ppmodel ((h
− log h), x=) =maximizing
logλp|hi| −(h λ
) + log=p Ω(h)(x
log h ). ,
+ const (14.7)
λ2h
i
model i
log p (h) (14.4)
where the constan
constantt term dep depends
ends only on λ and not h. We typically treat λ as a
The
hyp
yperparameter term can
erparameter and discard the b e sparsity inducing.
λ sinceFit example,
ordoes the Laplace prior,
p constant
(h ) = term e , do es not affect the parameter
learning. Other priors, such as the Studen Studentt 2t prior, can also induce sparsit sparsity
y. From
this p oint of view of sparsit
sparsityy as resultingifrom the effect of pmodel ( h) on approximate (14.5)
maxim
maximum um lik
likelihoo
elihoo
elihood d learning, the isparsity p enalt enalty
− | y is not a regularization term at
|
corresp onds to an absolute v alue sparsity
all. It is just a consequence of the model’s distribution p enalty . Expressing thelatent
over its log-prior as an
variables.
absolute
This viewvalue
pro p enalt
provides
vides amodelwehobtain
y, Ω(
different
) =motiv
motivation
λ ation h ,i for training an auto |
autoenco
enco
encoder:
der: it is a wa
wayy
of approximately training a generative mo i model.del. It also provides a different reason for
(14.6)
λ
log p (h) = λ h503 log = Ω(h) + const,
2
(14.7)
λ h model λ
where the constant term dep ends only on and not . We typically treat as a
t
hyp erparameter and discard the constant term since it do es not affect the parameter
p ( h)
learning. Other priors, such as the Studen| | t prior, can also induce sparsity. From
this p oint of view of sparsity as resulting from the effect of on approximate
maximum likelihoo d learning, the sparsity p enalty is not a regularization term at
− | |−
all. It is just a consequence of the model’s 503 distribution over its latent variables.
CHAPTER 14. AUTOENCODERS
wh
whyy the features learned by the auto
autoenco
enco
encoder
der are useful: they describe the laten
latentt
variables that explain the input.
Early work on sparse autoenco
autoencoders
ders (Ranzato et al., 2007a, 2008) explored
various forms of sparsit
sparsity
y and prop
proposed
osed a connection b etw
etween
een the sparsity penalty
and the log Z term that arises when applying maximum lik likeliho
eliho
elihooo d to an undirected
1
probabilistic
CHAPTER 14. mo
model
del p(x ) = Z pp̃˜(x). The idea is that minimizing log Z prev
AUTOENCODERS preven
en
ents
ts a
probabilistic mo
model
del from having high probability ev everywhere,
erywhere, and imp imposing
osing sparsity
on an autoenco
autoencoder
der prev
preven
en
ents
ts the auto
autoenco
enco
encoder
der from hahaving
ving lolow
w reconstruction
error everywhere. In this case, the connection is on the lev level
el of an intuitiv
intuitivee
understanding of a general mecmechanism
hanism rather than a mathematical corresp
correspondence.
ondence.
The interpretation of the sparsit
sparsity y p enalty as corresp
corresponding
onding to log pmodel (h ) in a
directed mo
model
del pmodel (h)p model(x | h) is more mathematically straightforw
straightforward.
ard.
whyOnethe w ay to achiev
features achieve e actual
learned by the zeros inenco
zeros
auto h for
dersparse (and denoising)
are useful: they describeauto
autoenco
enco
encoders
the derst
laten
w as in
intro
tro
troduced
variables duced in Glorot
that explain
1
theetinput.
al.
Z
(2011b). The idea is et to al.
use rectified linear units to
pro
produce
duce the co code
de lay
layer.
er. With a prior that actually pushes the representations to
Early work on sparse autoenco ders (Ranzato , 2007a, 2008) explored
zero (lik
(like
e the
log Z absolute value p enalty), one can th thus
us indirectly con
control
trol the av
average
erage
various forms of sparsity and prop osed a connection b etween the sparsity penalty
num
umb b er of zeros in the p(xrepresentati
representation.
) = p˜(x) on. log Z
and the term that arises when applying maximum likeliho o d to an undirected
probabilistic mo del . The idea is that minimizing prevents a
14.2.2
probabilistic Denoising
mo del fromAuto
Autoenco
having enco
encoders
highders
probability everywhere, and imp osing model
sparsity
model model
on an autoenco der prevents the auto enco der from having low reconstruction
Rather than adding aInp enalty
error everywhere. Ω to the
this case, the cost function,
connection is we
oncan
theobtain
level an auto
autoenco
of an enco
encoder
dere
intuitiv
that learns something useful by changing the reconstruction errorlog
term h ) cost
p of (the
understanding of a general mechanism rather than a mathematical corresp ondence.
function. p (h)p (x h)
The interpretation of the sparsity p enalty as corresp onding to in a
Traditionally
raditionally,
directed mo del , autoautoencoactual
enco
encoders zeros
ders minimize h somemathematically
is more function straightforward.
et al.
One way to achieve L(xin, g (ffor sparse
(x))) ,
(and denoising) auto enco ders
(14.8)
was intro duced in Glorot (2011b). The idea is to use rectified linear units to
pro duce
where L the
is a co de function
loss layer. With a priorgthat
p enalizing (f (xactually pushes
)) for being the representations
dissimilar from x, suc
suchh to
as
zero (lik
2 e the absolute value p enalty), one can thus indirectly control the average
the L norm of their difference. This encourages g ◦ f to learn to b e merely an
n
iden b ery of
umtit
identit
tity zeros in
function theyrepresentati
if the hav on.
havee the capacity to do so.
A denoising auto autoenco
enco
encoder (DAE) instead minimizes
der (DAE)
14.2.2 Denoising Autoenco| ders
Ω
L(x, g (f ( x
˜ )))
))),, (14.9)
Rather than adding a p enalty to the cost function, we can obtain an auto enco der
where ˜ is asomething
that learns
xx̃ cop
copy
y of xuseful
that has been corrupted
by changing by some form
the reconstruction of noise.
error Denoising
term of the cost
auto
autoenco
enco
encoders
function. ders must therefore undo this corruption rather than simply copying their
input.
Traditionally , auto enco ders minimize
L(x, g (fsome
(x))),function
2
Denoising training forces f and g to implicitly learn the structure of pdata (x),
as sho
shown
wn
L b by
y Alain and Bengio (2013 g (f) (and
x)) Bengio et al. (2013c). x (14.8)
Denoising
L g f
where is a loss function p enalizing 504 for being dissimilar from , such as
the norm of their difference. This encourages to learn to b e merely an
denoising auto enco der
identity function if they have the capacity to do so.
A L(AE)
(D ˜ ))), minimizes
x, g (instead
f(x
x
˜ x (14.9)
data
where is a copy of that has been corrupted by some form of noise. Denoising
auto enco ders must therefore undo this corruption rather
◦ than simply copying their
input. f g 504 p (x)
CHAPTER 14. AUTOENCODERS
auto
autoenco
enco
encoders
ders thus provide yet another example of how useful prop
properties
erties can emerge
as a byproduct of minimizing reconstruction error. They are also an example of
ho
how
w overcomplete, high-capacity mo models
dels may b e used as auto
autoenco
enco
encoders
ders as long
as care is taken to prev
preven
en
entt them from learning the identit
identity
y function. Denoising
auto
autoenco
enco
encoders
ders are presented in more detail in section 14.5.
CHAPTER 14. AUTOENCODERS
14.2.3 Regularizing by Penalizing Deriv
Derivativ
ativ
atives
es
pencoder (h | x) pdecoder (x | h)
x r
g
f
x̃
˜
x L
x
h
Figure 14.3: The computational graph offthe costgfunction for a denoising auto autoencoder,
encoder,
whic
whichh is trained to reconstruct the clean data p oint x from its corrupted version x ˜x̃.
This is accomplished by minimizing the loss L = − log pdecoder (x | h = f (x ˜ )),, where
x̃ ))
˜
x L
˜ is a corrupted version of the data example x, obtained through a given corruption
x̃
x
pro
process
cess C (x
˜ | x). Typically the distribution pdecoder is a factorial distribution whose mean
C (x
˜ x)
parameters are emitted by a feedforward netw network
ork g .
x
|
1. Sample a training example x from the training
decoder
x
data. ˜
x
Figure 14.3: The computational graph of the cost function for a denoising auto ˜ encoder,
(x h = f (x ))
2.h Sample
whic is trained a corrupted version
to reconstruct the x fromdata
˜clean ˜p=|oint
CL(x x=log p
).
xfrom its corrupted version .
x
˜ x
This is accomplished by minimizing the loss , where
is3.aUse (˜x,xx
C (x
corrupted˜x̃)) as a training
version of theexample for pestimating
data example the auto
, obtained autoenco
enco
encoder
through der reconstruction
a given corruption
g
pro cessdistribution p reconstruct
. Typically x|x ˜ ) = pdecoder ( xis| h
x̃
the( distribution with hdistribution
a )factorial the outputwhose
of enco
encoder
der
mean
parameters andemitted
f (x˜) are p decoderbytypically defined
a feedforward netw
reconstructbyork
a deco
decoder
. der g (h).
x xwithx x
network and may Ebe E˜ C (˜ exactly the same techniques as any other
x trained
( decoder
feedforward network. ˆ
p ) | ) |
data therefore view the DAE as p erforming sto chastic gradient descent on
We can
the following expectation: log p (x h = f (x ˜ )),
508 − |
(14.14)
pˆ (x)
CHAPTER 14. AUTOENCODERS
Score matc
matching
hing is discussed further in section 18.4. For the presen
presentt discussion,
regarding auto
autoencoders,
encoders, it is sufficient to understand that learning the gradient
field of log pdata is one way to learn the structure of pdata itself.
A very imp
important
ortant prop
property
erty of DAEs is that their training criterion (with
14.5.1 Estimating
conditionally Gaussian the p( x |Score
h)) mak
makeses the auto
autoenco
enco
encoder
der learn a vector field
(g ( f (x)) − x) that estimates the score of the data distribution. This is illustrated
x
Score
in figurematc . (Hyvärinen,score
hing
14.4 2005) is an alternative to maximum likelihoo d. It
provides a consistent estimator of probability distributions based onhidden
encouraging
x Denoising training of a sp specific
ecific kind of auto
autoenco
enco der (sigmoidal
encoder units,
the mo del to ha ve the same as the data distribution at every
linear reconstruction units) using Gaussian noise and mean squared error as training p oint
the
. In this con text, the
reconstruction cost is equiv score is
equivalenta particular
alent (Vincent gradient
log p(,x2011 field:
). ) to training a specific kind of
undirected probabilistic
data mo
model
del called an RBM with Gaussian data visible units. This
(14.15)
kind of mo model
del is described in detail in section 20.5.1; for the present discussion, it
suffices
Score matc to know
hing that it is a mo
is discussed model del thatinprovides
further section an explicit
18.4 pmodel
. For the ( x; θt). discussion,
presen When the
RBM log p
is trained using denoising scoretomatching p
(Kingma and LeCun , 2010),
regarding auto encoders, it is sufficient understand that learning the gradient
its
fieldlearning
of algorithm
is one w isayequiv
equivalent
alentthe
to learn to structure
denoising oftrainingitself.
in the corresponding
auto
autoenco
enco
encoder.
der. With a fixedp(noise x h level,
) regularized score matching is not a consistent
A very imp ortant prop erty of DAEs is that their training criterion (with
(estimator;
g ( f (x)) it x) instead recorecov vers a blurred version of the distribution. If the noise
conditionally Gaussian ) makes the auto enco der learn a vector field
lev
levelel is chosen to approach 0 when∇the num number
ber of examples approaches infinity infinity,,
that estimates the score of the data distribution. This is illustrated
ho
how wev
ever,
er, then consistency is reco recov vered. Denoising score matc matching
hing is discussed in
in figure 14.4.
more detail in section 18.5.
Denoising training of a sp ecific kind of auto enco der (sigmoidal hidden units,
Other connections b et etw
ween auto autoenco
enco
encoders
ders and RBMs exist. Score matc matching
hing
linear reconstruction units) using Gaussian noise and meanmodel squared error as the
applied to RBMs yields a cost function that is iden identical
tical to reconstruction error
reconstruction cost is equivalent (Vincent, 2011) to training a specific kind of
com
combined
bined with a regularization term similar to the con contractiv
tractiv
tractivee p enalty of the
undirected probabilistic mo del called an RBM with Gaussian visible units. This
CAE (Sw Swersky
ersky et al.
al.,, 2011). Bengio| and Delalleau (2009) showshowed
p ed( xthat; θ ) an auto
autoen-
en-
kind of mo del is described in detail in section 20.5.1; for the present discussion, it
co
coder
der gradien
gradient t pro
provides
vides an
denoisingappro
approximation
ximation
score to con
contrastiv
matching trastiv
trastivee div
divergence
ergence training of
suffices to−know that it is a mo del that provides an explicit . When the
RBMs.
RBM is trained using (Kingma and LeCun, 2010),
F or contin
continuous-v
uous-v
uous-valued
alued x , the denoising
its learning algorithm is equivalent to denoising training criterion with Gaussian
in the corruption
corresponding and
reconstruction
auto enco der. With distribution
a fixed noise yields an regularized
level, estimator of the matching
score score thatis is
notapplicable to
a consistent
general enco
encoder
der and deco
decoderder parametrizations ( Alain and
estimator; it instead recovers a blurred version of the distribution. If the noise Bengio , 2013 ). This
means
level isachosen
generictoenco
encoder-deco
der-deco
der-decoder
approach 0 derwhenarc
architecture
hitecture
the numberma
may y be
of made toapproaches
examples estimate the score,
infinity
however, then consistency is recovered.509 Denoising score matching is discussed in
more detail in section 18.5.
Other connections b etween auto enco ders and RBMs exist. Score matching
applied to RBMs yields a cost function that is identical to reconstruction error
et al.
combined with a regularization term similar to the contractive p enalty of the
CAE (Swersky , 2011). Bengio and Delalleau (2009) showed that an auto en-
co der gradient provides an approximation to contrastive divergence training of
RBMs. x
For continuous-valued , the denoising criterion with Gaussian corruption and
reconstruction distribution yields an estimator
509 of the score that is applicable to
CHAPTER 14. AUTOENCODERS
x̃
x
˜
x g◦f
x̃
x
˜
˜ | x)
C (x
x
˜
x
x g f
|| − || |
−
−
∇
510
CHAPTER 14. AUTOENCODERS
|| ˜ )) − x|| 2
||gg (f (x (14.16)
and corruption
x̃
˜=x
C (x ˜ ; µ = x, Σ = σ 2 I )
˜ |x) = N (x (14.17)
CHAPTER AUTOENCODERS
with noise 14.
variance σ 2. See figure 14.5 for an illustration of how this works.
In general, there is no guarantee that the reconstruction g (f (x )) minus the
input x corresp
corresponds
onds to the gradient of an any y function, let alone to the score. That is
wh
whyy the early results (Vincen
Vincentt, 2011) are sp specialized
ecialized to particular parametrizations,
2
where g (f (x )) − x ma
may y be obtained by taking the deriv
derivativ
ativ
ativee of another function.
Kam
Kamyshansk
yshansk
yshanskaa and Memisevic (2015) generalized the results of Vincent (2011) by
iden
identifying
tifying a family of shallo
shallow
w auto
autoenco
enco
encodersders suc
suchh that g (2f (x)) − x corresp
corresponds
onds to
by training with the squared error
a score for all members of the family criterion
family..
g (f ( x
˜ )) x
2
So far we havhavee describ
described
ed only how the denoising auto autoenco
enco
encoderder learns to represent
(14.16)
a probabilit
probability y distribution. ˜=x
C (x More generally
generally,
˜ x) = (x , one may wan
want
˜ ; µ = x, Σ = σ I ) t to use the auto
autoenco
enco
encoder
der
anda corruption
as generativ
generativee model and draw samples from this distribution. This is describ described
ed
in section 20.11. σ (14.17)
with noise variance . See figure 14.5 for an illustration of how g (fthis
(x ))works.
14.5.1.1x Historical
In general, there is Perspective
no guarantee that the reconstruction minus the
input corresp onds to the gradient of any function, let alone to the score. That is
The idea g (fof )) x MLPs for denoising
(xusing || dates
− back
|| to the work of LeCun (1987)
why the early results (Vincent, 2011) are sp ecialized to particular parametrizations,
and Gallinari et al. (1987). Behnke (2001) also used recurren recurrentt netnetw works to denoise
where may be obtained by taking the derivative of another function.
images. Denoising auto autoenco
enco
encoders g ( f (x )) x
ders are, in some sense, just MLPs trained to denoise.
Kamyshanska and Memisevic (|2015) N generalized the results of Vincent (2011) by
The name “denoising autoenco
autoenco
encoder,” however, refers to a model
der,” how model that is intended not
identifying a family of shallow auto enco ders such that corresp onds to
merely to learn to denoise its input but to learn a go good
od ininternal
ternal representation
a score for all members of the family.
as a side effect of learning to denoise. This idea came m much
uch later ((Vincen
Vincen
Vincentt
So far we hav e describ ed only how the
al.,, 2008, 2010). The learned representation ma
et al. denoising auto
may enco der learns to
y then b e used to pretrain arepresent
a probabilit
deep
deeperer unsup y ervised
distribution.
unsupervised netw
networkMore
ork or agenerally
supervised, onenetw
mayork.
network.wanLikt to
Like use theauto
e sparse auto
autoencoenco
enco der
encoders,
ders,
as a generativ
sparse coding,e− model and
contractiv
contractive drawenco
e auto
autoenco ders, from
samples
encoders, this distribution.
and other regularized auto Thisenco
autoenco is describ
encoders, ed
ders, the
in section
motiv ation20.11
motivation .
for DAEs was to allo
allow
w the learning of a very high-capacity enco encoder
der
14.5.1.1 Historical Perspective
while prev
preven
en
enting
ting the enco
encoder
der and deco
decoder
der from learning a useless − iden
identit
tit
tityy function.
Prior to the introduction of the mo modern
dern DAE, Inay
Inayoshi
oshi and Kurita (2005)
explored someetofal.
the same goals with some of the same metho
methods.ds. Their approac
approach h
The idea of using MLPs for denoising dates back to the work of LeCun (1987)
minimizes reconstruction error in addition to a sup
supervised
ervised ob
objectiv
jectiv
jectivee while injecting
and Gallinari (1987). Behnke (2001) also used recurrent networks to denoise
noise in the hidden la lay
yer of a supervised MLP
MLP,, with the ob objective
jective to impro
improve
ve
images. Denoising auto enco ders are, in some sense, just MLPs trained to denoise.
generalization by intro
intro
troducing
ducing the reconstruction error and the injected noise. Their
The name “denoising auto enco der,” however, refers to a mo del that is intended not
metho
method d was based on a linear enco
encoder,
der, ho
how
wev
ever,
er, and could not learn function
merely to learn to denoise its input but to learn a go od internal representation
families
et al. as pow
powerful
erful as the mo
modern
dern DAE can.
as a side effect of learning to denoise. This idea came much later (Vincent
511
, 2008, 2010). The learned representation may then b e used to pretrain a
deep er unsup ervised network or a supervised network. Like sparse auto enco ders,
sparse coding, contractive auto enco ders, and other regularized auto enco ders, the
motivation for DAEs was to allow the learning of a very high-capacity enco der
while preventing the enco der and deco der from learning a useless identity function.
Prior to the introduction of the mo dern DAE, Inayoshi and Kurita (2005)
explored some of the same goals with some of the same metho ds. Their approach
minimizes reconstruction error in addition to a sup ervised ob jective while injecting
noise in the hidden layer of a supervised MLP, with the ob jective to improve
511
generalization by intro ducing the reconstruction error and the injected noise. Their
CHAPTER 14. AUTOENCODERS
512
CHAPTER 14. AUTOENCODERS
emb edding. 0It.4 is typically given by a low-dimensional vector, with fewer dimensions
than the “ambient” space of which the manifold is a low-dimensional subset. Some
0.2
algorithms (nonparametric manifold learning algorithms, discussed b elow) directly
learn an embedding
0
1.0 for each training example, while others learn a more general
x0
mapping, sometimes called an enco der, or xrepresentation
1 x2
function, that maps any
x
p oint in the ambient space (the input space) to its embedding.
0 . 8
Identity
Figure 14.7: If0.6the auto
autoenco
enco
encoder
der learns
Optimal a reconstruction function that is inv
reconstruction invariant
ariant to small
p erturbations near the data p oints, it captures the manifold structure of the data. Here
the manifold 0structure
.4
is a collection of 0-dimensional manifolds. The dashed diagonal
line indicates the identit
identity
y function target for reconstruction. The optimal reconstruction
0.2
function crosses the iden identit
tit
tity
y function
0
wherev
whereverer
1
there is a data 2
p oint. The horizontal
arro
arrows
ws at the0b.0ottom of the plot indicate the r( x) − x reconstruction direction vector at
the base of the arrow, in inputxspace, alwa always
ys p
x ointing tow
toward
ardx the nearest “manifold” (a
single data p oin
ointt in the 1-D case). The denoising x auto
autoenco
enco
encoder
der explicitly tries to make
the deriv
derivativ
ativ
ativee of the reconstruction function r(x) small around the data p oints. The
con
contractiv
tractiv
tractivee auto
autoencoder
encoder do
does
es the same for the enco
encoder.
der. Although the derivderivative
ative of r(x ) is
ask
asked
ed to b e small around the data p oints, it can b e large b et etw
ween the data p oints. The
space b etw
etween
een the data p oinoints
ts corresp
corresponds
onds to the region b etw etween
een the manifolds, where
the reconstruction function must hav havee a large deriv
derivativ
ativ
ativee to map corrupted p oin oints
ts bac
back k
on
onto the manifold.
to 14.7: If the auto enco der learns a reconstruction function that is invariant to small
Figure
0
p erturbations near the data p oints, it captures the manifold structure of the data. Here
515
the manifold structure is a collection of -dimensional manifolds. The dashed diagonal
line indicates the identity function target for reconstruction. The optimal reconstruction
r( x) x
function crosses the identity function wherever there is a data p oint. The horizontal
arrows at the b ottom of the plot indicate the reconstruction direction vector at
the base of the arrow, in input space, always p ointing toward the nearest “manifold” (a
r(x)
single data p oint in the 1-D case). The denoising auto enco der explicitly tries to make
r(x )
the derivative of the reconstruction function small around the data p oints. The
contractive auto encoder do es the same for the enco der. Although the derivative of is
asked to b e small around the data p oints, it can b e large b etween the data p oints. The
space b etween the data p oints corresp onds to the region b etween the manifolds, where
the reconstruction function must have a large 515 deriv−ative to map corrupted p oints back
CHAPTER 14. AUTOENCODERS
Figure 14.8: Nonparametric manifold learning procedures build a nearest neighbor graph
in which no
nodes
des represent training examples and directed edges indicate nearest neighbor
relationships. VVarious
arious pro
procedures
cedures can thus obtain the tangent plane assoassociated
ciated with a
neigh
neighbb orhoo
orhoodd of the graph as well as a co
coordinate
ordinate system that asso
associates
ciates each training
example with a real-v
real-valued
alued vector p osition, or em
embb edding. It is p ossible to generalize
suc
such
h a representation to new examples by a form of in interpolation.
terpolation. As long as the num
numbb er
of examples is large enough to cov cover
er the curv
curvature
ature and twists of the manifold, these
approac
approaches
hes work well. Images from the QMUL Multiview Face Dataset (Gong et al.,
2000
2000).
).
Figure 14.9: If the tangent planes (see figure 14.6) at each lo cation are known, then they
can b e tiled to form a global co ordinate system or a density function. Each lo cal patch
can b e thought of as a local Euclidean co ordinate system or as a lo cally flat Gaussian, or
“pancake,” with a very small variance in the directions orthogonal to the pancake and a
very large variance in the directions defining the co ordinate system on the pancake. A
mixture of these Gaussians provides an estimated
517 density function, as in the manifold
et al.
Parzen window algorithm (Vincent and Bengio, 2003) or in its nonlocal neural-net-based
variant (Bengio , 2006c).
517
CHAPTER 14. AUTOENCODERS
a large num
umber
ber of lo
locally
cally linear Gaussian-lik
Gaussian-likee patc
patches
hes (or “pancak
“pancakes,”
es,” b ecause the
Gaussians are flat in the tangent directions).
A fundamental difficulty with suc such h lo
local
cal nonparametric approaches to manifold
learning is raised in Bengio and Monp Monperrus
errus (2005): if the manifolds are not very
smo
smooth
oth (they ha havve many p eaks and troughs and twists), one may need a very
large num
CHAPTERnumberber AUTOENCODERS
14. of training examples to cov coverer eac
eachh one of these variations, with
no chance to generalize to unseen variations. Indeed, these metho methods ds can only
generalize the shap
shapee of the manifold by interpolating b et etwween neighboring examples.
Unfortunately
Unfortunately,, the manifolds inv involv
olv
olved
ed in AI problems can ha havve very complicated
structures that can b e difficult to capture from only local interpolation. Consider
for example the manifold resulting from translation sho shown
wn in figure 14.6. If we
watc
atch
h just one co coordinate
ordinate within the input vector, xi , as the image is translated,
we will observe that one co coordinate
ordinate encounters a p eak or a trough in its value
a large number of lo cally linear Gaussian-like patches (or “pancakes,” b ecause the
once for every p eak or trough in brightness in the image. In other words, the
Gaussians are flat in the tangent directions).
complexit
complexity y of the patterns of brightness in an underlying image template driv drives
es
A fundamental
the complexit
complexity difficulty with suc h lo cal nonparametric approaches
y of the manifolds that are generated by p erforming simple image to manifold
learning is raised
transformations. This in Bengio
motivand
atesMonp
motivates the useerrus (2005): if the
of distributed manifolds
represen
representations areand
tations not deep
very
smo oth (they
learning have many
for capturing and troughs and twists), one may need a very
p eaksstructure.
manifold
large number of training examples to cover each one of these variations, with
no chance to generalize to unseen variations. Indeed, these metho ds can only
14.7
generalizeCon
Contractive
tractive
the shap Auto
Autoenco
e of the manifold enco
encoders ders bietween neighboring examples.
by interpolating
Unfortunately, the manifolds involved in AI problems can have very complicated
structures
The thate auto
contractiv
contractive can benco
e difficult
autoenco
encoder to capture
der (Rifai from,bonly
et al., 2011a ) in
introlocal
tro
troduces
ducesinterpolation. Consider
an explicit regularizer
x
for example
on the cocode the manifold resulting from
de h = f (x), encouraging the deriv translation
derivativ
ativ
atives shown
es of f to b e as small as p.ossible:
in figure 14.6 If we
watch just one co ordinate within the input vector, 2 , as the image is translated,
we will observe that one co ordinate encounters
∂ f (x) a p eak or a trough in its value
Ω(h) = λ . (14.18)
once for every p eak or trough in brightness ∂ x inFthe image. In other words, the
complexity of the patterns of brightness in an underlying image template drives
The p enalt
enalty
the complexit y Ω(
Ω(h ) is
yhof themanifolds
the squared Fthat
rob
robenius
enius norm (sumbyofpsquared
are generated erforming elemen
elements)
simple ts) image
of the
Jacobian matrix ofThis
transformations. partial deriv
derivatives
motiv atesatives associated
the use with the
of distributed enco
encoder
represen der function.
tations and deep
Therefor
learning is capturing
a connection b etw
etween
manifold een the denoising auto
structure. autoencoenco
encoder
der and the contractiv
contractivee
auto
autoenco
enco
encoder:
der: Alain and Bengio (2013) sho show wed that in the limit of small Gaussian
input noise, the denoising reconstruction error is equiv equivalent
alent to a con contractiv
tractiv
tractivee
p enalt
enaltyy on the reconstruction function et al. that maps x to r = g ( f( x))
2 )).. In other
words, denoisingh = fauto
autoenco
(x) enco
encoders
ders mak
makee the reconstruction f function resist small but
14.7
The
finite-sized Con
contractiv etractive
auto enco der
p erturbations ofAuto
(the enco
Rifaiinput, ders
2011a
,while ,b)Ftractiv
con intro duces
contractiv
tractive anenco
e auto explicit
autoenco dersregularizer
encoders mak
makee the
on the co
feature de
extraction , encouraging
function the deriv
∂ f (ativ
resist infinitesimal x)pes of to b e of
erturbations small
as the as p ossible:
input. When
using the Jacobian-based con Ω( h )
contractiv
tractiv= λ
tractivee p enalt
enalty .
∂ xy to pretrain features f (x) for use
with a classifier, the b est classification accuracy usually results from applying the
(14.18)
Ω( h)
518
The p enalty is the squared Frob enius norm (sum of squared elements) of the
Jacobian matrix of partial derivatives associated with the enco der function.
There is a connection b etween the denoising auto enco der and the contractive
auto enco der: Alain and Bengio (2013) showed that in the limit of small Gaussian
x r = g ( f( x))
input noise, the denoising reconstruction error is equivalent to a contractive
p enalty on the reconstruction function that maps to . In other
words, denoising auto enco ders make the reconstruction function resist small but
finite-sized p erturbations of the input, while contractive auto enco ders make the
f (x)
feature extraction function resist infinitesimal
518 p erturbations of the input. When
CHAPTER 14. AUTOENCODERS
con
contractiv
tractiv
tractivee penalty to f ( x) rather than to g(f (x))
)).. A contractiv
contractivee p enalty on f (x)
also has close connections to score matching, as discussed in section 14.5.1.
The name con contractiv
tractiv
tractive e arises from the way that the CAE warps space. Sp Specifi-
ecifi-
cally
cally,, b ecause the CAE is trained to resist p erturbations of its input, it is encouraged
to map a neighborho
neighborhoo o d of input points to a smaller neigh
neighborho
borho
borhooo d of output p oin
oints.
ts.
W e can think
CHAPTER 14. of this as con
contracting
AUTOENCODERS tracting the input neigh
neighbb orhoo
orhoodd to a smaller output
neigh
neighb b orhoo
orhood.
d.
To clarify
clarify,, the CAE is con contractiv
tractiv
tractivee only lo locally—all
cally—all p erturbations of a training
p oin
ointt x are mapp
mapped ed near to f (x). GloballyGlobally,, tw two differentt p oints x and x ma
o differen may y be
mapp
mapped ed to f ( x ) and f ( x ) points that are farther apart than the original p oin oints.
ts.
It is plausible that f could b e expanding in-b in-betet
etw
ween or far from the data manifolds
(see, for example, whatf happ x) ens in the 1-Dgto
(happens toy ))
(fy(xexample of figure 14.7). Whenf the (x)
Ω(
Ω(h ) p enalt
conhtractiv enaltyy is applied
e penalty to to sigmoidal
rather than units,
to one easy . Awa
wayy to shrink
contractiv e pthe Jacobian
enalty on is
to
alsomak
make
hase the
closesigmoid
con unitseto
tractiv
connections saturate to 0 or 1. This
score matching, encourages
as discussed the CAE
in section to enco
14.5.1 encode
. de
input points with extreme values of the sigmoid, whic which h mamay y b e interpreted as a
The name arises from the way that the CAE warps space. Sp ecifi-
binary code. It also ensures that the CAE will spread its co code
de values throughout
cally, b ecause the CAE is trained to resist p erturbations of its input, it is encouraged
most of the hypercub
hypercubee that its sigmoidal hidden units can span.
to map a neighborho o d of input points to a smaller neighborho o d of output p oints.
We W cane can
thinkthink of the
of this as conJacobian
tractingmatrix J atneigh
the input a p oin
ointt x as
b orhoo approximating
d to a smaller output the
nonlinear
neighb orhoo d.enco
encoder der f (x ) as b eing a linear operator. This allo
allows
ws us to use the word
“con
“contractiv
tractiv
tractive”
x e” more formally formally..fIn (x)the theory of linear operators, axlinearxop operator
erator
To clarify, the CAE is contractive only lo cally—all p erturbations of a training
is said to b ef contractiv
(contractive
x) f(exif) the norm of J x remains less than or equal to 1 for all
p oint are mapp ed near to . Globally, two different p oints and may b e
unit-norm x . In other f words, J is contractiv
contractivee if its image of the unit sphere is
mapp ed to and points that are farther apart than the original p oints.
completely encapsulated by the unit sphere. We can think of the CAE as p enalizing
It is plausible that could b e expanding in-b etween or far from the data manifolds
the
Ω(h)Frob robenius
enius norm of the local linear appro approximation
ximation of f (x) at every training
(see, for example, what happ ens in the 1-D toy example of figure 14.7). When the
p oin
ointt x in order to encourage each of0 these 1 lo local
cal linear op operators
erators to b ecome a
p enalty is applied to sigmoidal units, one easy way to shrink the Jacobian is
con
contraction.
traction.
to make the sigmoid units saturate to or . This encourages the CAE to enco de
input As points
described
within sectionvalues
extreme 14.6, ofregularized
the sigmoid, auto
autoenco
enco
encoders
whic h ders
may learn manifoldsasbya
b e interpreted
balancing
binary code. tw
two oIt opp
opposing
alsoosing
ensures forces. In the
that the CAE case
willofspread
the CAE, these
its co de tw
two
values othroughout
forces are
reconstruction
most of the hypercub error and the its
e that con
contractiv
tractiv
tractivee p enalt
sigmoidal enalty
J y Ω(
hidden Ω(hh ). can
units Reconstruction
x span. error alone
would encourage the f (xCAE
) to learn an iden identit
tit
tityy function. The contractiv
contractivee p enalty
We can think of the Jacobian matrix at a p oint as approximating the
alone would encourage the CAE to learn features that are constant with resp respect
ect to x.
nonlinear enco der as b eing a linear operator. This allows us to use the word
The compromise b et etwween these two forces J xyields an autoautoenco
enco
encoder
der whose derivderivativ
1 ativ
atives
es
“con
∂ f (xtractiv
) e” more formally. In the theory of linear operators, a linear op erator
x are mostly
x tin
tiny y. Only a small J num
numb b er of hidden units, corresponding to a
is∂said to b e contractive if the norm of remains less than or equal to for all
small num umber
ber of directions in the input, ma may y hav
havee significan
significantt deriv
derivativ
ativ
atives.
es.
unit-norm . In other words, is contractive if its image of the unit sphere is
The goalencapsulated
of the CAE is f (x )
completely bytothelearn
unitthe manifold
sphere. We structure
can think of of the
the data.
CAE as Directions
p enalizing x
with x
the Flarge J x rapidly
rob enius norm of change h , so linear
the local these are lik
likely
appro ely to b e directions
ximation of that
at appro
approximate
every ximate
training
the
p ointangent
t in order planes toofencourage
the manifold. each of Exp
Experimen
erimen
eriments
these ts by
lo cal Rifaiopeterators
linear al. (2011a
to b,ecome
b) show a
that training
contraction. the CAE results in most singular values of J dropping b elo
eloww 1 in
Lo
Local
cal PCA (no sharing across regions)
CHAPTER 14. AUTOENCODERS
Con
Contractiv
tractiv
tractivee autoenco
autoencoder
der
captures man
many
y of the desirable qualitative characteristics.
Another practical issue is that the contraction p enalt enalty
y can obtain useless results
if we do not imp impose
ose some sort of scale on the deco decoder.
der. For example, the encoder
could consist of multiplying the input by a small constant , and the decoder
could consist of dividing the co code
de by . As approac
approaches
hes 0, the encoder drives the
con
contractiv
tractiv
tractivee14.
CHAPTER p enalt
enalty
y Ω(
Ω(hh ) to approac
AUTOENCODERS approach h 0 without having learned an anything
ything ab
about
out the
distribution. Meanwhile, the deco decoder
der main
maintains
tains p erfect reconstruction. In Rifai
et al. (2011a), this is prev
preven
en
ented
ted by tying the weigh
weights
ts of f and g . Both f and g are
standard neural netw network
ork lay
layers
ers consisting of an affine transformation follow followed
ed by
an elemen
element-wise
t-wise nonlinearit
nonlinearity y, so it is straigh
straightforw
tforw
tforward
ard to set the weigh
eightt matrix of g
to b e the transp
transpose
ose of the weigh
eightt matrix of f .
captures man
14.8 y of the desirable
Predictiv
Predictive e Sparse qualitative
Decomp characteristics.
Decomposition
osition
Another practical issue is that the contraction p enalty can obtain useless results
if we do not
Predictiv
Predictive imp ose decomp
e sparse some sortosition
decomposition of scale(PSD)
on theisdeco
a mo der.
modeldel F or example,
that is a hybrid theofencoder
sparse
0
co
coding
dingconsist
could and parametric auto
autoenco
of multiplying enco
encoders
the ders
input (Ka
Kavuk
byvuk
vukcuoglu
a cuoglu
small constant al.,, 2008
et al. ). Athe
, and parametric
decoder
Ω(h ) 0
enco
encoder
couldder is trained
consist to predict
of dividing the cothede by output
. As of approac
iterativ
iterative e inference.
hes , the encoderPSD drives
has been
the
applied
contractivtoeunsupervised
p enalty feature learning
to approac for ob
h without object
ject recognition
having learnedinanimages
ything and
ab outvideo
the
et al. f g f g
(distribution.
Ka
Kavuk
vuk
vukcuoglu
cuogluMeanwhile,
al.,, 2009,the
et al. 2010 ; Jarrett
deco der main , 2009
tains
et al.
al., ; Farab
arabet
p erfect et et al.
al.,, 2011),InasRifai
reconstruction. well
2011a),(Henaff
as for(audio this is prev
et al.en
, ted
2011b).
y tThe
ying model
the weigh ts of ofand
consists an enco der f (and
encoder
. Both x) andarea
g
standard
deco der gneural
decoder (h) that netw
areorkb oth parametric.
layers consistingDuring
of an affinetraining, h is controlled
transformation followbyedthe
by
f
optimization
an element-wise algorithm.
nonlinearit Training
y, so it pro
proceeds
ceeds btforw
is straigh y minimizing
ard to set the weight matrix of
to b e the transp ose of the weight2 matrix of .
||x − g (h)|| + λ|h|1 + γ ||h − f (x)||2 . (14.19)
As in sparse cocoding,
ding, the training algorithm alternates b etw etween
een minimization with
resp
respect
ect to h
Predictiv e and minimization
sparse decomp ositionwith resp
respect
ect to the mo model
del parameters. Minimization
with respect to h is fast b ecause f ( x) pro provides
vides a go goooet
d initial
al. value of h, and the
14.8 Predictiv e Sparse Decomp
cost function constrains h to remain near f (x) an (PSD) is aosition
mo
anyw
ywdel
ywa ay. Simplea gradien
that is hybrid tofdescent
gradient sparse
co ding and parametric auto enco ders (Ka
can obtain reasonable values of h in as few as ten steps. vuk cuoglu , 2008 ). A parametric
enco der is trained to predict the output of iterative inference. PSD has been
The training pro
procedure
et al. cedure used by PSDetisal.different from first et training
al. a sparse
applied to unsupervised feature learning for ob ject recognition in images and video
co
coding
ding mo
model
del and then training 2f (x ) to predict the values
et al. of the sparse )co
f (x coding
ding
(Kavukcuoglu , 2009, 2010; Jarrett 1 , 2009; Farab2et , 2011), as well
features.gThe
(h) PSD training pro procedure
cedure regularizes the deco
decoder
hder to use parameters
as for audio (Henaff , 2011). The model consists of an enco der and a
for whic
which h f (x) can infer go goo o d code values.
deco der that are b oth parametric. During training, is controlled by the
Predictiv
Predictive e sparse co
optimization algorithm. coding
xding g (ish)an pro
Training example h +of
+ λceeds byγlearned
h f (xapproximate
minimizing ) . inference..
inference
In section 19.5, this topic is developed further. The to tools
ols presen
presented
ted in chapter 19
mak
makee it clear that PSD can b e in
interpreted
terpreted as training a directed sparse (14.19)
co
coding
ding
probabilistich model b y maximizing a lo
lowwer b ound on the log-likelihoo
log-likelihood d of the
As in sparse co ding, the training algorithm alternates b etween minimization with
mo
model. h f ( x ) h
respdel.
ect to and minimization with resp ect to the mo del parameters. Minimization
h f (x)
with respect to is fast b ecause pro
521 vides a go o d initial value of , and the
h
cost function constrains to remain near anyway. Simple gradient descent
can obtain reasonable values of in as few as ten steps.
f (x )
The training pro cedure|| − used|| by PSD | | is different
|| − from || first training a sparse
co ding mo del and then training to predict the values of the sparse co ding
f (x)
features. The PSD training pro cedure regularizes the deco der to use parameters
for which can infer go o d code values. learned approximate inference
Predictive sparse co ding is an example of .
In section 19.5, this topic is developed further. The to ols presented in chapter 19
521 as training a directed sparse co ding
make it clear that PSD can b e interpreted
CHAPTER 14. AUTOENCODERS
523
523