0% found this document useful (0 votes)
4 views25 pages

Deep Learning Book

An autoencoder is a type of neural network designed to learn efficient representations of data by attempting to copy its input to its output through an internal hidden layer. Modern autoencoders have evolved to include stochastic mappings and are used for tasks such as dimensionality reduction and generative modeling. The learning process involves minimizing a loss function, and autoencoders can fail to extract useful information if their capacity is too high or the hidden code dimension is not appropriately constrained.

Uploaded by

20pcse02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views25 pages

Deep Learning Book

An autoencoder is a type of neural network designed to learn efficient representations of data by attempting to copy its input to its output through an internal hidden layer. Modern autoencoders have evolved to include stochastic mappings and are used for tasks such as dimensionality reduction and generative modeling. The learning process involves minimizing a loss function, and autoencoders can fail to extract useful information if their capacity is too high or the hidden code dimension is not appropriately constrained.

Uploaded by

20pcse02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter 14

Auto
Autoenco
enco
encoders
ders

An auto
autoenco
enco der is a neural net
encoder netw
work that is trained to attempt to cop copy
y its input
to its output. Internally
Internally,, it has a hidden lalay
yer h that describ
describeses a co de used to
code
represen
representt the input. The net
netwwork ma
mayy b e view
viewed
ed as consisting of two parts: an
enco
encoder
der function h = f (x) and a deco
decoder
der that pro
produces
duces a reconstruction r = g (h).
This arc
architecture
hitecture is presen
presented
ted in figure 14.1. If an autoenco
autoencoder
der succeeds in simply
learning to set g (f (x)) = x ev everywhere,
erywhere, then it is not esp
especially
ecially useful. Instead,
auto
autoenco
enco
encoders
ders are designed to be unable to learn to copy perfectly
perfectly.. Usually they are
restricted in wa
ways
ys that allow them to copcopyy only approximately
approximately,, and to cop
copyy only
input that resembles the training data. Because the mo model
del is forced to prioritize
whic
whichh asp
aspects
ects of the input should b e copied, it often learns useful prop
properties
erties of the
data.
Chapter
Mo
Modern
dern
auto auto
enco 14ders hav
autoenco
derenco
encoders havee generalized the idea of an enco
encoder
der and a de-
co
coder
der b eyond deterministic functions to sto stocchastic
h mappings pencoder co(deh | x) and
An is a neural network that is trained to attempt to copy its input
pdecoder (x | h).
to its output. Internally, it has a hidden layer that describ es a used to
The idea of h =encoders
auto f (x )
autoencoders has b een part of the historical landscap
landscape e r
of g ( h)
=neural
represent the input. The network may b e viewed as consisting of two parts: an
net
netw
encowder
orksfunction
for decades (LeCun , 1987
and ; Bourlard
a deco der thatand Kampa, 1988
pro duces ; Hinton and Zemel,.
reconstruction
1994
1994). g (f (x,))auto
= xenco
This ).architecture
Traditionally
raditionally, autoenco
is presen encoders
ted in ders
figurew14.1
ere used
. If anfor dimensionalit
dimensionality
autoenco y reduction
der succeeds in simply or
feature
learninglearning.
to set Recen
Recentlytly
tly,, ev
theoretical
erywhere, connections
then it is notb et etw
wecially
esp een auto
autoencoders
encoders
useful. and
Instead,
Autoencoders
laten
latent
mo
modeling,
variable
deling, as we
mo
t ders aremodels
auto enco designed
will see
ha
dels hav
in
e brough
vto brought
be unable
chapter
restricted in ways that allow them to copy only 20
auto
t autoencoders
.
encoders
to learn
Auto
Autoenco
enco
encoders
to the forefront
ders may b e though
approximately thought
of generative
to copy perfectly. Usually they are
, and ttoofcop
asybonly
eing
ainput
sp
special
ecial
thatcase of feedforw
resemblesfeedforward ard net
the training networks
works and ma
data. Because may ythe
bemotrained with
del is forced all
encoderto the same
prioritize
tec
techniques,
whic hniques,
h asp ectstypically
decoder of the inputminibatc
minibatch
shouldh gradient
b e copied,descent
it oftenfollo
following
wing
learns gradients
useful computed
prop erties of the
b y back-propagation.
data. Unlike general feedforw
feedforwardard netw
networks,
orks, auto
autoenco
enco
encoders
ders may
also b e trained using recirculation (Hin Hinton
ton and McClelland,p 1988),(h a learning
x)
Mo dern auto enco ders have generalized the idea of an enco der and a de-
algorithm
p (x based
h) on comparing the activ activations
ations of the net netw
work on the original input
co der b eyond deterministic functions to sto chastic mappings and
. 499
The idea of auto encoders has b een part of the historical landscap e of neural
networks for decades (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel,
1994). Traditionally, auto enco ders were used for dimensionality reduction or
feature learning. Recently, theoretical connections b etween auto encoders and
latent variable mo dels have brought auto encoders to the forefront of generative
mo deling, as we will see in chapter 20. Auto enco ders may b e thought of |as b eing
|
a sp ecial case of feedforward networks and may be trained with all the same
techniques, typically minibatch gradient descent following gradients computed
recirculation
by back-propagation. Unlike general feedforward networks, auto enco ders may
CHAPTER 14. AUTOENCODERS

f g

x r

CHAPTER 14. AUTOENCODERS


Figure 14.1: The general structure of an auto
autoenco
enco
encoder,
der, mapping an input x to an output
(called reconstruction) r through an internal representation or co
code
de h. The auto
autoenco
enco
encoder
der
has two comp
componen
onen
onents:
ts: the enco
encoder
der f (mapping x to h) and the decoder g (mapping h
h
to r ).
f g

to the activ
activations
ations on the reconstructed input. Recirculation is regarded as more
x r
biologically plausible than back-propagation but is rarely used for machine learning
applications.

14.1 Undercomplete Auto


Autoenco
enco
encoders
ders
x
r h
Figure
Cop
Copying theThe
14.1:
ying general
input structure
to the outputof an auto enco der, mapping
butanwinput to an output
f may sound x useless,
h e are tgypically not
h
(called
in reconstruction)
interested
terested in the outputthrough
of an
the internal
decoder. representation
Instead, weor co
hopde
hopee . The
that auto enco
training der
the
r
has
auto tencoder
wo comp onen
autoencoder ts: the enco
to p erform the der
input (mapping
copying task to will
) and the in
result decoder
h taking (mapping
on useful
to ).
prop
properties.
erties.
One way to obtain useful features from the autoautoenco
enco
encoder
der is to constrain h to
ha
havve a smaller dimension than x. An autoenco
autoencoder
der whose code dimension is less
to the activ ations on the
than the input dimension is called undercomplete. Learningis an
reconstructed input. Recirculation regarded as more
undercomplete
biologically
represen plausible
representation than back-propagation
tation forces the auto
autoenco
enco
encoder but is rarely used for machine
der to capture the most salient featureslearning
of the
applications.
training data.
The learning process is describ
described
ed simply as minimizing a loss function
L(x, g (f (x))), (14.1)
14.1
where
Cop L isUndercomplete
ying a loss
the input function
to thepoutput Auto
enalizing g (fenco
may sound ders
(x)) for being dissimilar
useless, but we are from x, suc
such
typically hnot
as
h
the mean squared
interested in the error.
output of the decoder. Instead, we hop e that training the
encoder
autoWhen theto p erform
deco
decoder the input
der is linear and L copying task will
is the mean result
squared in an
error, taking on useful
undercomplete
prop erties. h
auto
autoenco
enco
encoder
der learns to span the same subspace as PCA. In this case, an auto autoenco
enco
encoder
der
x
trained
One to
wap yerform
to obtaintheuseful
cop
copying
ying task has
features fromlearned the
the auto principal
enco der is tosubspace
constrainof the
to
undercomplete
training data asdimension
have a smaller a side effect.
than . An autoenco der whose code dimension is less
thanAuto
the enco
input
Autoenco
encoders
dersdimension is calledenco
with nonlinear der functions f. and
encoder Learning an undercomplete
nonlinear decoder func-
represen
tions tation
g can th usforces
thus learnthe autop
a more enco der to
owerful the most salient
capture generalization
nonlinear features
of PCA. of the
Unfortu-
training data.
500
The learning process is describLed
(x,simply as, minimizing a loss function
g (f (x)))
L g (f (x)) x (14.1)

where is a loss function p enalizing for being dissimilar from , such as


the mean squared error. L
When the deco der is linear and is the mean squared error, an undercomplete
auto enco der learns to span the same subspace as PCA. In this case, an auto enco der
trained to p erform the copying task has learned the principal subspace of the
training data as a side effect. 500 f
CHAPTER 14. AUTOENCODERS

nately
nately,, if the enco
encoder
der and deco
decoder
der are allow
allowed
ed too muc
much h capacit
capacity
y, the auto
autoenco
enco
encoder
der
can learn to p erform the copcopying
ying task without extracting useful information ab about
out
the distribution of the data. Theoretically
Theoretically,, one could imagine that an autoenco
autoencoder
der
with a one-dimensional co code
de but a very p owerful nonlinear encoder could learn to
represen
representt eac h training example x(i) with the co
each code
de i . The deco
decoder
der could learn to
map these in integer
teger indices bac
backk to the values of spspecific
ecific training examples. This
CHAPTER 14. AUTOENCODERS
sp
specific
ecific scenario do
does
es not o ccur in practice, but it illustrates clearly that an auto
autoen-
en-
co
coder
der trained to p erform the copying task can fail to learn an anything
ything useful ab
about
out
the dataset if the capacity of the autoautoenco
enco
encoder
der is allo
allowwed to b ecome totoo
o great.

14.2 Regularized Auto


Autoenco
enco
encoders
ders
(i)
nately, if the enco
Undercomplete derencoders,
auto and decowith
autoencoders, der are
co allow
code
de ed too muc
dimension lessh than they, input
capacit the auto enco der
dimension,
can learn
learn to
thep erform
most salient ying taskofwithout
the copfeatures the data extracting usefulW
distribution. information
e ha
havve seenabthat
out
the distribution
these autoenco
autoencoders of the data. Theoretically , one could
ders fail to learn anything useful if the enco imagine that
encoder an autoenco
der and decodecoder der
der are
x i
with
giv enatoo
given one-dimensional
muc
much h capacity
capacity. co
. de but a very p owerful nonlinear encoder could learn to
represen t eachproblem
A similar exampleif the with
training occurs hiddenthecode
co de is. allow
The deco
alloweded toder could
hav
have learn to
e dimension
to theininput,
map these
equal teger indices
and in bac
thekovtoercomplete
the values ofcase sp ecific training
in which examples.
the hidden co deThis
code has
sp ecific scenario do es not o ccur in practice, but it illustrates
dimension greater than the input. In these cases, even a linear enco clearly
encoderthat an auto en-
der and a linear
co der
deco
decodertrained to p erform
der can learn to cop copy the copying task can fail to learn an ything
y the input to the output without learning anything useful useful ab out
the
ab out the data distribution.of the auto enco der is allowed to b ecome to o great.
aboutdataset if the capacity
Ideally
Ideally,, one could train ananyy architecture of auto
autoenco
enco
encoder
der successfully
successfully,, choosing
the code
code dimension and the capacity of the enco encoder
der and deco der based on the
decoder
complexit
complexity y of distribution to b e modeled. Regularized auto autoenco
enco
encoders
ders pro
provide
vide the
abilit
ability
14.2 y to Regularized
do so.auto
Undercomplete Rather than limiting
Auto
encoders, with encothedimension
co de mo
model
ders del capacit
capacity y by the
less than keeping
inputthe enco
encoder
der
dimension,
and decoder
can learn theshallow and the
most salient code size
features small,
of the regularized
data distribution.autoenco
autoencoders
We ha ders
ve use
seena that
loss
function that encourages
these autoenco ders fail tothe mo
model
learn del to ha
anythinghavvuseful
e otherif prop
properties
ertiesder
the enco besides theder
and deco ability
are
to cop
copyy its input to
given too much capacity. its output. These other prop
properties
erties include sparsity of the
represen
representation,
tation, smallness of theovderiv
derivative
ative of the representation, and robustness
ercomplete
A similar problem occurs if the hidden code is allowed to have dimension
to noise or to missing inputs. A regularized autoenco autoencoder der can b e nonlinear and
equal to the input, and in the case in which the hidden co de has
overcomplete but still learn something useful ab about
out the data distribution, even if
dimension greater than the input. In these cases, even a linear enco der and a linear
the mo
model
del capacity is great enough to learn a trivial iden identit
tit
tity
y function.
deco der can learn to copy the input to the output without learning anything useful
In addition
ab out the data to the metho
methods
distribution.ds describ
described
ed here, which are most naturally in interpreted
terpreted
as regularized autoautoenco
enco
encoders,
ders, nearly any generativ
generativee mo model
del with laten
latentt variables
Ideally, one could train any architecture of auto enco der successfully, choosing
and equipp
equipped ed with an inference proprocedure
cedure (for computing latent represen representations
tations
the co de dimension and the capacity of the enco der and deco der based on the
giv
given
en input) may b e viewed as a particular form of auto autoenco
enco
encoder.
der. Two generativ
generativee
complexity of distribution to b e modeled. Regularized auto enco ders provide the
mo
modeling
deling approac
approacheshes that emphasize this connection with auto autoencoders
encoders are the
ability to do so. Rather than limiting the mo del capacity by keeping the enco der
descendan
descendants ts of the Helmholtz macmachine
hine (Hin
Hinton
ton et al.
al.,, 1995b), suc
suchh as the variational
and decoder shallow and the code size small, regularized autoenco ders use a loss
function that encourages the mo del to501 have other prop erties besides the ability
to copy its input to its output. These other prop erties include sparsity of the
representation, smallness of the derivative of the representation, and robustness
to noise or to missing inputs. A regularized autoenco der can b e nonlinear and
overcomplete but still learn something useful ab out the data distribution, even if
the mo del capacity is great enough to learn a trivial identity function.
In addition to the metho ds describ ed here, which are most naturally interpreted
as regularized auto enco ders, nearly any generative mo del with latent variables
and equipp ed with an inference pro cedure (for computing latent representations
501 form of auto enco der. Two generative
given input) may b e viewed as a particular
CHAPTER 14. AUTOENCODERS

auto
autoenco
enco
encoder
der (section 20.10.3) and the generative sto
stochastic
chastic net
netw
works (section 20.12).
These models naturally learn high-capacit
high-capacity
y, overcomplete encodings of the input
and do not require regularization for these enco
encodings
dings to b e useful. Their enco
encodings
dings
are naturally useful b ecause the models were trained to appro
approximately
ximately maximize
the probabilit
probabilityy of the training data rather than to copcopy
y the input to the output.
CHAPTER 14. AUTOENCODERS
14.2.1 Sparse Auto
Autoenco
enco
encoders
ders

A sparse auto
autoenco
enco
encoder
der is simply an autoenco
autoencoder
der whose training criterion inv
involves
olves a
sparsit
sparsity
y p enalt
enalty
y Ω(
Ω(hh) on the co
code
de la
lay
yer h, in addition to the reconstruction error:
L(x, g (f (x))) + Ω(h), (14.2)
where g ( h) is the deco
decoderder output, and typically we ha hav
ve h = f (x), the enco
encoder
der
auto enco der (section 20.10.3) and the generative sto chastic networks (section 20.12).
output.
These models
Sparse autonaturally
autoenco
enco ders learn
encoders high-capacit
are typically , olearn
used yto vercomplete
featuresencodings of task,
for another the input
such
and do not require regularization
as classification. An auto autoencoder for these enco dings to b e useful. Their enco
encoder that has b een regularized to b e sparse must dings
are
respnaturally
respond
ond to unique b ecause the
usefulstatistical models
features were
of the trainedit to
dataset hasappro
b eenximately maximize
trained on, rather
the probabilit y of the training
than simply acting as an identit data
identity rather than to
y function. In this wacop
wayy the input to the output.
y, training to p erform the
cop
copying
ying task with a sparsit
sparsity
y penalty can yield a mo
model
del that has learned useful
14.2.1
features Sparse Auto
as a byproduct. enco ders
We can think Ω(of
h) the penalty Ω( Ω(h h) hsimply as a regularizer term added to
A sparse auto enco der is simply an autoenco der whose training criterion involves a
a feedforward netw network
ork whoseLprimary
x, la (xtask
g (yfer ))) +is Ω(to h)cop
copy
, ytothe input to the output
sparsity p enalty on the co(de , in addition the reconstruction error:
(unsup
(unsupervised
ervised learning ob objective)
jective) and p ossibly also p erform some sup supervised
ervised task
g ( h )
(with a supervised learning ob objectiv
jectiv
jective) h = f (
e) that depends on these sparse features.x ) (14.2)
Unlike otheris regularizers,
Unlike
where such as and
the deco der output, weigh
weight t deca
decay
typically y, wthere
e havis e not a straightforw
straightforward
, the encoard der
Ba
Bay yesian
output. in
interpretation
terpretation to this regularizer. As describ
described ed in section 5.6.1 , training
with weigh
eightt decay and other regularization p enalties can be in interpreted
terpreted as a
Sparse auto enco ders are typically used to learn features for another task, such
MAP appro
approximation
ximation to Ba Bayyesian inference, with the added regularizing p enalt enaltyy
as classification. An auto encoder that has b een regularized to b e sparse must
corresp
corresponding
onding to a prior probabilit
probability y distribution over the mo model
del parameters. In
resp ond to unique statistical features of the dataset it has b een trained on, rather
this view, regularized maximum lik likeliho
eliho
elihoo o d corresp
correspondsonds to maximizing p(θ | x),
than simply acting as an identity function. In this way, training to p erform the
whic
which h is equiv
equivalen
alen
alentt to maximizing log p( x | θ) + log p(θ ). The log p (x | θ) term
copying task with a sparsity penalty can yield a mo del that has learned useful
is the usual data log-likelihoo
log-likelihood d term,
Ω(h)and the log p(θ) term, the log-prior ov over
er
features as a byproduct.
parameters, incorporates the preference over particular values of θ . This view is
We ed
describ
described canin think
sectionof 5.6 penalty
the. Regularized simply
auto
autoenco
enco
encoders as a defy
ders regularizer
suc
suchh antermin added to
interpretation
terpretation
a feedforward netw ork
b ecause the regularizer dep whose ends on the data and is therefore by definitionoutput
depends primary task is to cop y the input to the not a
(unsup ervised learning ob jective) and p ossibly also p erform
prior in the formal sense of the word. We can still think of these regularization some sup ervised task
(with
terms asa supervised learning ob
implicitly expressing jective) that
a preference ov erdepends
over functions. on these sparse features.
Unlike other regularizers, such as weight decay, there is not a straightforward
Rather than thinking of the sparsity p enalt enalty y as a regularizer for the copying
Bayesian interpretation to this regularizer. As describ ed in section 5.6.1, training
task, we can think of the entire sparse auto autoenco
enco
encoderder framew
frameworkork as approximating
with weight decay and other regularization p enalties can be interpreted as a
MAP approximation to Bayesian inference, 502 with the added regularizing p enalty
p(θ x)
corresp onding to a prior probability distribution over the mo del parameters. In
log p( x θ) + log p(θ ) log p (x θ)
this view, regularized maximum likeliho o d corresp onds to maximizing ,
log p(θ)
which is equivalent to maximizing . The term
θ
is the usual data log-likelihoo d term, and the term, the log-prior over
parameters, incorporates the preference over particular values of . This view is
describ ed in section 5.6. Regularized auto enco ders defy such an interpretation
b ecause the regularizer dep ends on the data and is therefore by definition not a
prior in the formal sense of the word. We can still think of these regularization
terms as implicitly expressing a preference 502 over functions. |
CHAPTER 14. AUTOENCODERS

maxim
maximum um lik
likelihoo
elihoo
elihoodd training of a generativ
generativee mo
model
del that has laten
latentt variables.
Supp
Supposeose we hav
havee a mo
model
del with visible variables x and latent variables h, with
an explicit joint distribution pmodel (x, h ) = p model(h)pmodel (x | h). We refer to
pmodel (h) as the mo model’s
del’s prior distribution over the laten
latentt variables, representing
the model’s b eliefs prior to seeing x. This is differendifferentt from the way we ha havve
previously used the word “prior,” to refer to the distribution p(θ ) enco encoding
ding our
CHAPTER 14. AUTOENCODERS
b eliefs about the momodel’s
del’s parameters before we havhavee seen the training data. The
log-lik
log-likeliho
eliho
elihood
od can b e decomp
decomposed
osed as

log p model (x) = log pmodel (h, x). (14.3)
h
model model model
Wmodel
e can think of the autoautoenco
enco
encoderder as appro
approximating
ximating this sum with a p oin ointt estimate
for just one highly likely value for h. This is similarxto the sparse co coding h
ding generative
maximum likelihoo d training of a generative mo del that has latent variables.
mo del (section 13.4), but with ph b eing
model (x,the = p of(the
h ) output (x h)enco
h)p parametric encoder
der rather
Supp ose we have a mo del with visible variables and latent variables , with
p (h )
than the result of an optimization that infers the most likely h. From this point of
an explicit joint distribution . We refer to
view, with this chosen h, we are maximizing x
as the mo del’s prior distribution over the latent variables, representing
p(θ )
the model’s b eliefs prior(hto, xseeing
log p model ) = log p. This (h is+differen
) log p t from
(x | ). way we(14.4)
hthe have
model modelh model model
previously used the word “prior,” to refer to the distribution enco ding our
b eliefs
The p modelthe
logabout (h) mo del’s
term canparameters
b e sparsitybefore we hav
inducing. For e seen the training
example, data.prior,
the Laplace The
log-likeliho od can b e decomp
log p osed as
(x) = log p (h, x).
λ −λ|hi |
p model (hi ) = e , (14.5)
2 | (14.3)
corresp
corresponds
onds to an absolute value sparsity p enalty enalty.. Expressing the log-prior as an
h
absolute
W valueofp enalt
e can think enalty
the y, wenco
auto e obtain
der as approximating this sum with a p oint estimate
h
for just one highly likely value for  . This is similar to the sparse co ding generative
model model model h
mo del (section 13.4), but h) = λb eing|hthe
Ω(with i|, output of the parametric enco der (14.6) rather
h
than the result
model of an optimization that
i
 infers the most i
 likely . From this point of
view, with this chosen , we are
log ppmodel ((h
− log h), x=) =maximizing
logλp|hi| −(h λ
) + log=p Ω(h)(x
log h ). ,
+ const (14.7)
λ2h
i
model i
log p (h) (14.4)
where the constan
constantt term dep depends
ends only on λ and not h. We typically treat λ as a
The
hyp
yperparameter term can
erparameter and discard the b e sparsity inducing.
λ sinceFit example,
ordoes the Laplace prior,
p constant
(h ) = term e , do es not affect the parameter
learning. Other priors, such as the Studen Studentt 2t prior, can also induce sparsit sparsity
y. From
this p oint of view of sparsit
sparsityy as resultingifrom the effect of pmodel ( h) on approximate (14.5)
maxim
maximum um lik
likelihoo
elihoo
elihood d learning, the isparsity p enalt enalty
− | y is not a regularization term at
|
corresp onds to an absolute v alue sparsity
all. It is just a consequence of the model’s distribution p enalty . Expressing thelatent
over its log-prior as an
variables.
absolute
This viewvalue
pro p enalt
provides
vides amodelwehobtain
y, Ω(
different
) =motiv
motivation
λ ation h ,i for training an auto |
autoenco
enco
encoder:
der: it is a wa
wayy
of approximately training a generative mo i model.del. It also provides a different reason for
(14.6)
λ
log p (h) = λ h503 log = Ω(h) + const,
2
(14.7)
λ h model λ
where the constant term dep ends only on and not . We typically treat as a
t
hyp erparameter and discard the constant term since it do es not affect the parameter
p ( h)
learning. Other priors, such as the Studen| | t prior, can also induce sparsity. From
this p oint of view of sparsity as resulting from the effect of on approximate
maximum likelihoo d learning, the sparsity p enalty is not a regularization term at
− | |−
all. It is just a consequence of the model’s 503 distribution over its latent variables.
CHAPTER 14. AUTOENCODERS

wh
whyy the features learned by the auto
autoenco
enco
encoder
der are useful: they describe the laten
latentt
variables that explain the input.
Early work on sparse autoenco
autoencoders
ders (Ranzato et al., 2007a, 2008) explored
various forms of sparsit
sparsity
y and prop
proposed
osed a connection b etw
etween
een the sparsity penalty
and the log Z term that arises when applying maximum lik likeliho
eliho
elihooo d to an undirected
1
probabilistic
CHAPTER 14. mo
model
del p(x ) = Z pp̃˜(x). The idea is that minimizing log Z prev
AUTOENCODERS preven
en
ents
ts a
probabilistic mo
model
del from having high probability ev everywhere,
erywhere, and imp imposing
osing sparsity
on an autoenco
autoencoder
der prev
preven
en
ents
ts the auto
autoenco
enco
encoder
der from hahaving
ving lolow
w reconstruction
error everywhere. In this case, the connection is on the lev level
el of an intuitiv
intuitivee
understanding of a general mecmechanism
hanism rather than a mathematical corresp
correspondence.
ondence.
The interpretation of the sparsit
sparsity y p enalty as corresp
corresponding
onding to log pmodel (h ) in a
directed mo
model
del pmodel (h)p model(x | h) is more mathematically straightforw
straightforward.
ard.
whyOnethe w ay to achiev
features achieve e actual
learned by the zeros inenco
zeros
auto h for
dersparse (and denoising)
are useful: they describeauto
autoenco
enco
encoders
the derst
laten
w as in
intro
tro
troduced
variables duced in Glorot
that explain
1
theetinput.
al.
Z
(2011b). The idea is et to al.
use rectified linear units to
pro
produce
duce the co code
de lay
layer.
er. With a prior that actually pushes the representations to
Early work on sparse autoenco ders (Ranzato , 2007a, 2008) explored
zero (lik
(like
e the
log Z absolute value p enalty), one can th thus
us indirectly con
control
trol the av
average
erage
various forms of sparsity and prop osed a connection b etween the sparsity penalty
num
umb b er of zeros in the p(xrepresentati
representation.
) = p˜(x) on. log Z
and the term that arises when applying maximum likeliho o d to an undirected
probabilistic mo del . The idea is that minimizing prevents a
14.2.2
probabilistic Denoising
mo del fromAuto
Autoenco
having enco
encoders
highders
probability everywhere, and imp osing model
sparsity
model model
on an autoenco der prevents the auto enco der from having low reconstruction
Rather than adding aInp enalty
error everywhere. Ω to the
this case, the cost function,
connection is we
oncan
theobtain
level an auto
autoenco
of an enco
encoder
dere
intuitiv
that learns something useful by changing the reconstruction errorlog
term h ) cost
p of (the
understanding of a general mechanism rather than a mathematical corresp ondence.
function. p (h)p (x h)
The interpretation of the sparsity p enalty as corresp onding to in a
Traditionally
raditionally,
directed mo del , autoautoencoactual
enco
encoders zeros
ders minimize h somemathematically
is more function straightforward.
et al.
One way to achieve L(xin, g (ffor sparse
(x))) ,
(and denoising) auto enco ders
(14.8)
was intro duced in Glorot (2011b). The idea is to use rectified linear units to
pro duce
where L the
is a co de function
loss layer. With a priorgthat
p enalizing (f (xactually pushes
)) for being the representations
dissimilar from x, suc
suchh to
as
zero (lik
2 e the absolute value p enalty), one can thus indirectly control the average
the L norm of their difference. This encourages g ◦ f to learn to b e merely an
n
iden b ery of
umtit
identit
tity zeros in
function theyrepresentati
if the hav on.
havee the capacity to do so.
A denoising auto autoenco
enco
encoder (DAE) instead minimizes
der (DAE)
14.2.2 Denoising Autoenco| ders

L(x, g (f ( x
˜ )))
))),, (14.9)
Rather than adding a p enalty to the cost function, we can obtain an auto enco der
where ˜ is asomething
that learns
xx̃ cop
copy
y of xuseful
that has been corrupted
by changing by some form
the reconstruction of noise.
error Denoising
term of the cost
auto
autoenco
enco
encoders
function. ders must therefore undo this corruption rather than simply copying their
input.
Traditionally , auto enco ders minimize
L(x, g (fsome
(x))),function
2
Denoising training forces f and g to implicitly learn the structure of pdata (x),
as sho
shown
wn
L b by
y Alain and Bengio (2013 g (f) (and
x)) Bengio et al. (2013c). x (14.8)
Denoising
L g f
where is a loss function p enalizing 504 for being dissimilar from , such as
the norm of their difference. This encourages to learn to b e merely an
denoising auto enco der
identity function if they have the capacity to do so.
A L(AE)
(D ˜ ))), minimizes
x, g (instead
f(x

x
˜ x (14.9)
data
where is a copy of that has been corrupted by some form of noise. Denoising
auto enco ders must therefore undo this corruption rather
◦ than simply copying their
input. f g 504 p (x)
CHAPTER 14. AUTOENCODERS

auto
autoenco
enco
encoders
ders thus provide yet another example of how useful prop
properties
erties can emerge
as a byproduct of minimizing reconstruction error. They are also an example of
ho
how
w overcomplete, high-capacity mo models
dels may b e used as auto
autoenco
enco
encoders
ders as long
as care is taken to prev
preven
en
entt them from learning the identit
identity
y function. Denoising
auto
autoenco
enco
encoders
ders are presented in more detail in section 14.5.
CHAPTER 14. AUTOENCODERS
14.2.3 Regularizing by Penalizing Deriv
Derivativ
ativ
atives
es

Another strategy for regularizing an auto


autoenco
enco
encoder
der is to use a p enalty Ω, as in sparse
auto
autoenco
enco
encoders,
ders,
L(x, g (f (x))) + Ω(h, x), (14.10)
but with a different form of Ω:
auto enco ders thus provide yet another example  of how useful prop erties can emerge
2
Ω( h , x
as a byproduct of minimizing reconstruction ) = λ ||∇ x h i || They
error. . (14.11)
are also an example of
how overcomplete, high-capacity mo delsi may b e used as auto enco ders as long
as care
This is taken
forces the mo to del
prevto
model entlearn
thema from
function that the
learning do
does identit
es not cyhange
function.
muc
much hDenoising
when x
auto encoslightly
changes ders are
slightly.. presented
Because thisinpmore
enalty detail
is in
applied section
only at14.5.
training examples, it forces
the auto
autoenco
enco
encoder
der to learn features that capture information ab about
out the training
14.2.3 Regularizing by Penalizing Derivatives
distribution.

An auto
autoenco
enco
encoderder regularized in this wa way y is called a con contractiv
tractiv
tractive e auto
autoenco
enco der,,
encoder
der
Another
or CAE.strategy for regularizing
This approach
approac h has an auto
theoretical enco der is
connections
x to use
to a p enalty
denoising , as
auto in
autoenco sparse
enco
encoders,
ders,
L(x, g (f (x))) + Ω(h, x)2,
auto enco ders,
manifold learning, and probabilistic mo modeling.
deling. The i
CAE is describ
described
ed in more
detail in section 14.7. Ω i (14.10)
but with a different form of Ω(:h, x) = λ h .
14.3 Representational Power, Lay
Representational Layer Size and Depth
(14.11)
x
Auto
Autoenco
enco
encoders
ders are often trained with only a single la layyer enco
encoder
der and a single lay
layer
er
This
deco forces
decoder.
der. Ho
Howthe
wev mo
er, del
ever, thistois learn
not aarequirement.
function thatIndofact, es not change
using deepmuc
encoh ders
when
encoders and
cdeco
hanges
decoders slightly . Because
ders offers many adv advanthis
an p enalty
antages.
tages. is applied only at training examples, it forces
the auto enco der to learn features that capture information ab out the training
Recall from section 6.4.1 that there are many adv advantages
antages to depth
contractiv in enco
e auto a feedfor-
der
distribution.
ward netnetwwork. Because autoenco
autoencodersders are feedforw
feedforward ard netnetwworks, these adv
advantages
antages
alsoAn autotoenco
apply autoder regularized
autoencoders.
encoders. Moreo
Moreov ver, wa
in this they encoder
is||∇
called||isa itself a feedforward netw ork,,
network,
CAE.
or is
as Thisder,
the deco
decoder,approac
so eac hh has
each theoretical
of these comp
componen connections
onen
onents to denoising
ts of the autoenco
autoencoder auto
der can enco ders,
individually
manifold learning,
b enefit from depth. and probabilistic mo deling. The CAE is describ ed in more
detail in section 14.7.
One ma major
jor adv
advantage
antage of nonnontrivial
trivial depth is that the universal approximator
theorem guaran
guaranteestees that a feedforward neural net netw work with at least one hidden
la
lay
yer can represen
representt an appro
approximation
ximation of an anyy function (within a broad class) to an
505
14.3encoRepresen
Auto ders are oftentational
trained withPonly
ower, Laylaer
a single enco derand
yer Size and aDepth
single layer
deco der. However, this is not a requirement. In fact, using deep enco ders and
deco ders offers many advantages.
Recall from section 6.4.1 that there are many advantages to depth in a feedfor-
ward network. Because autoenco ders are feedforward networks, these advantages
also apply to auto encoders. Moreover, the encoder is itself a feedforward network,
as is the deco der, so each of these comp onents of the autoenco der can individually
b enefit from depth.
One ma jor advantage of nontrivial 505
depth is that the universal approximator
CHAPTER 14. AUTOENCODERS

arbitrary degree of accuracy


accuracy,, pro
provided
vided that it has enough hidden units. This means
that an autoenco
autoencoder
der with a single hidden laylayer
er is able to represent the identit
identity
y
function along the domain of the data arbitrarily well. How However,
ever, the mapping from
input to code is shallow. This means that we are not able to enforce arbitrary
constrain
constraints,
ts, suc
such
h as that the co
code
de should be sparse. A deep auto autoenco
enco
encoder,
der, with at
least one additional hidden la lay
yer inside the enco
encoder
der itself, can approximate any
CHAPTER 14. AUTOENCODERS
mapping from input to co code
de arbitrarily well, giv
given
en enough hidden units.
Depth can exp
exponentially
onentially reduce the computational cost of representing some
functions. Depth can also exponentially decrease the amounamountt of training data
needed to learn some functions. See section 6.4.1 for a review of the adv
advan
an
antages
tages of
depth in feedforw
feedforward
ard netw
networks.
orks.
Exp
Experimen
erimen
erimentally
tally
tally,, deep autoenco
autoencoders
ders yield muc
much h b etter compression than corre-
sp
sponding
onding degree
arbitrary shallo
shallow
woforaccuracy
linear auto
autoenco
, proenco
encoders
videdders
that(Hin
Hinton
tonenough
it has and Salakhutdino
Salakhutdinov
hidden units. v, 2006
This).means
thatAancommon
autoenco strategy
der with fora training a deeplayauto
single hidden autoenco
er isenco
encoderto isrepresent
ableder to greedilythe pretrain
identity
the deepalong
function architecture
the domain by training a stack
of the data of shallow
arbitrarily auto
autoenco
well. How enco
encoders,
ever, ders,mapping
the so we often
from
encoun
encounter
input totercode
shallo
shallow
is wshallow.
auto
autoencoders,
encoders, ev
even
This means en when
that wetheare
ultimate
not ablegoalto isenforce
to train a deep
arbitrary
auto
autoenco
enco
encoder.
constrain der.
ts, such as that the co de should be sparse. A deep auto enco der, with at
least one additional hidden layer inside the enco der itself, can approximate any
mapping from input to co de arbitrarily well, given enough hidden units.
14.4 Sto
Stocchastic Enco
Encodersders and Deco
Decoders
ders
Depth can exp onentially reduce the computational cost of representing some
functions. Depth can also exponentially decrease the amount of training data
Auto
Autoencoenco
encoders
ders are just feedforward netw networks.
orks. The same loss functions and output
needed to learn some functions. See section 6.4.1 for a review of the advantages of
unit typ ypes
es that can b e used for traditional feedforward netw networks
orks are also used for
depth in feedforward networks.
auto
autoenco
enco
encoders.
ders.
Exp erimentally, deep autoenco ders yield much b etter compression than corre-
As describ
described
ed in section 6.2.2.4, a general strategy for designing the output units
sp onding shallow or linear auto enco ders (Hinton and Salakhutdinov, 2006).
and the loss function of a feedforward netw network
ork is to define an output distribution
A common strategy for training
p(y | x) and minimize the negative log-likelihoo a deep
log-likelihoodauto
d −enco
log der
p(y is
| xto greedily
). In pretrain
that setting, y
the deep architecture b y training
is a vector of targets, such as class lab a stack
labels.
els. of shallow auto enco ders, so we often
encounter shallow auto encoders, even when the ultimate goal is to train a deep
In an auto
autoenco
enco
encoder,
der, x is nonow w the target as well as the input. Yet we can still
auto enco der.
apply the same mac machinery
hinery as b efore. Giv Given
en a hidden co code
de h, we may think of
the decoder as providing a conditional distribution pdecoder (x | h). We may then
train the autoenco
autoencoder
der by minimizing − log p decoder(x | h) . The exact form of this
loss function will change dep depending
ending on the form of p decoder . As with traditional
feedforw
feedforward
14.4
Auto encoard
Stonetw
networks,
ders orks,
just we
chastic
are usually
Encoders
feedforward usenetw
linear
andoutput
orks. Deco
The units
same to
ders parametrize
loss functions theandmean of
output
aunit
Gaussian distribution
typ es that if x isforreal
can b e used valued. In
traditional that case, the
feedforward netwnegativ
negative
orks aree log-likelihoo
log-likelihood
also used ford
yields a mean
auto enco ders. squared error criterion. Similarly
Similarly, , binary x v alues corresp
correspond
ond to
a Bernoulli distribution whose parameters are giv given
en by a sigmoid output unit,
As describ ed in section 6.2.2.4, a general strategy for designing the output units
discrete
p(y x) x values corresp
correspond
ond to a softmax distribution, log p(and
y xso) on. Typically
ypically,, the
y
and the loss function of a feedforward network is to define an output distribution
and minimize the negative log-likelihoo 506 d decoder . In that setting,
x
is a vector of targets, such as class lab els. decoder
h
In an auto enco der, is now the target as well as decoder the input. Yet we can still
p (x h)
apply the same machinery as b efore. Given a hidden co de , we may think of
log p (x h)
the decoder as providing a conditional distribution . We may then
p
train the autoenco der by minimizing . The exact form of this
loss function will change dep ending on the form of . As with traditional
| x − |
feedforward networks, we usually use linear output units to parametrize the mean of
x
a Gaussian distribution if is real valued. In that case, the negative log-likelihoo d
yields a mean squared error criterion. 506 Similarly, binary values corresp ond to
x
CHAPTER 14. AUTOENCODERS

pencoder (h | x) pdecoder (x | h)

x r

CHAPTER 14. AUTOENCODERS


Figure 14.2: The structure of a sto
stochastic
chastic auto
autoenco
enco
encoder,
der, in which b oth the enco
encoder
der and the
deco
decoder
der are not simple functions but instead invinvolv
olv
olvee some noise injection, meaning that
encoder decoder
their output can b e seen as sampled from a distribution, pencoder ( h | x) for the enco
encoder
der
h
and pdecoder(x | h) for the deco
decoder.
der.
p (h x) p (x h)

output variables are treated as being conditionally indepindependent


endent given h so that
x r
this probabilit
probability
y distribution is inexp
inexpensive
ensive to ev
evaluate,
aluate, but some techniques, such
as mixture densit
densityy outputs, allo
allow
w tractable mo
modeling
deling of outputs with correlations.
To mak
makee a more radical departure from the feedforward netwnetworks
orks we hav
havee seen
encoder
previously
previously,, we can also generalize |the notion of an |enco
decoder
ding function f (x) to
encoding
an enco ding distribution pencoder (h | x), as illustrated in figure 14.2.
encoding
An
Any
Figure y laten
latent
14.2: t vstructure
The ariable model
of a stopchastic
model (h defines
, x)enco
auto a sto
der, in stochastic
chastic
which b oth enco
encoder
der der and the
the enco
p ( h x)
deco der are not simple functions but instead involve some noise injection, meaning that
p (x h )
their output can b e seen as psampled | x)a=distribution,
from
encoder (h pmodel (h | x) (14.12)
for the enco der
and for the deco der.
and a stochastic deco
decoder
der
h
pdecoder (x | h) = pmodel (x | h). (14.13)
output variables are treated as being conditionally indep endent given so that
In
thisgeneral, they enco
probabilit encoder
der and deco
distribution decoder
derensive
isencoder
inexp distributions are not
to evaluate, butnecessarily conditional
some techniques, such
distributions compatible with a unique join
jointt distribution | . Alain
as mixture density outputs, allomodel w tractable mo deling of outputs
p model with correlations.
( x , h ) et al.
| enco ding function f (x )
(2015To) mak
sho
show
weed that training
a more the enco
encoder
radical departure der
fromand thedeco
decoder
der as a denoising
feedforward networks we auto
autoenco
enco
encoder
hav der
e seen
willenco ding
tend to makdistribution
make e them p
compatible (h x )
previously , we can encoder
also generalize theasymptotically
notion of an (with enough capacity and
model
to
examples).
an p (h , x) , as illustrated in figure 14.2.
Any latent variable model
p (ha sto
(h x) = pdefines x)chastic enco der
14.5 Denoising Auto
Autoenco
decoderenco
encoders
dersmodel
(14.12)
The
and adenoising
stochasticauto der der (D
autoenco
decoenco
encoder (DAE)
AE) is an auto
autoenco
enco
encoderder that receiv
receives
es a corrupted
(x h) = p
ointt as input and isp trained to predict the (original,
data p oin x h). modeluncorrupted data point
as its output. (14.13)
|
The DAE training pro procedure
cedure is illustrated in figure p 14.3(.x,W
h )e in
intro
tro
troduce
duce a
et al.
In general, the
corruption pro enco
process der
cess C ( xand deco der
˜ | x), whic
which distributions
h represen
represents are not necessarily conditional
ts a conditional distribution ov over
er
distributions compatible with a unique joint distribution . Alain
| 507 |
(2015) showed that training the enco der and deco der as a denoising auto enco der
will tend to make them compatible asymptotically (with enough capacity and
examples).
| |

denoising auto enco der


14.5 Denoising Auto(D
The enco ders
AE) is an auto enco der that receives a corrupted
data p oint as input and is trained to predict the original, uncorrupted data point
as its output. 507
CHAPTER 14. AUTOENCODERS

g
f


˜
x L

CHAPTER 14. AUTOENCODERS


˜ | x)
C (x

x
h

Figure 14.3: The computational graph offthe costgfunction for a denoising auto autoencoder,
encoder,
whic
whichh is trained to reconstruct the clean data p oint x from its corrupted version x ˜x̃.
This is accomplished by minimizing the loss L = − log pdecoder (x | h = f (x ˜ )),, where
x̃ ))
˜
x L
˜ is a corrupted version of the data example x, obtained through a given corruption

x
pro
process
cess C (x
˜ | x). Typically the distribution pdecoder is a factorial distribution whose mean
C (x
˜ x)
parameters are emitted by a feedforward netw network
ork g .
x

corrupted samples x x̃˜ , giv


given
en a data sample x. The autoautoenco
enco
encoder
der then learns a
reconstruction distribution preconstruct (x | x x̃˜) estimated from training pairs
˜ ) as follo
(x, x follows:
ws: decoder

|
1. Sample a training example x from the training
decoder
x
data. ˜
x
Figure 14.3: The computational graph of the cost function for a denoising auto ˜ encoder,
(x h = f (x ))
2.h Sample
whic is trained a corrupted version
to reconstruct the x fromdata
˜clean ˜p=|oint
CL(x x=log p
).
xfrom its corrupted version .
x
˜ x
This is accomplished by minimizing the loss , where
is3.aUse (˜x,xx
C (x
corrupted˜x̃)) as a training
version of theexample for pestimating
data example the auto
, obtained autoenco
enco
encoder
through der reconstruction
a given corruption
g
pro cessdistribution p reconstruct
. Typically x|x ˜ ) = pdecoder ( xis| h

the( distribution with hdistribution
a )factorial the outputwhose
of enco
encoder
der
mean
parameters andemitted
f (x˜) are p decoderbytypically defined
a feedforward netw
reconstructbyork
a deco
decoder
. der g (h).

Typically we can simply x˜ p erform gradient-based


x approximate minimization (such
as minibatc
minibatch
reconstructionh gradien
gradient t descen
descent)
distribution t) on
p the negative
(x xlog-lik
log-likelihoo
˜−) elihoo
elihoodd − log p decoder(x | h ).
corrupted samples , giv en a data sample . The auto enco |der then learns a
As long
(x, x˜) as the enco
encoder
der is deterministic, the denoising autoenco
autoencoder
der is a feedforward
net
netwwork and estimatedhniques training
from pairs
| mamayy be trained with exactly the same tec techniques as an
any
y other
feedforwasard
feedforwardfollonet
ws:work.
netw x
WeSample
1. can therefore
a training view the DAE from
example as decoder
p erforming
the sto
stocchastic gradien
gradientt descen
descentt on
reconstruct ˜
x ˜training
C (x x = xdata.
)
the follo
following
wing expectation:
decoder
2. Sample a
˜
( x, x) corrupted version from .
−pEx∼pp̂ˆdata (x()xEx˜x̃∼x
˜C)(=
x̃|xp)
˜
x log p ( x h)
decoder (x | h =hf (x˜ ))
)),, (14.14)
3. Use as a training example for estimating the auto enco der reconstruction
f (x˜) p g (h )
pp̂ˆdata (x) is the training distribution. |
decoder
wheredistribution with the output of enco der
and typically defined by a deco der .
508
log p (x h)
Typically we can simply p erform gradient-based approximate minimization (such
as minibatch gradient descent) on the negative| log-likelihoo d .
As long as the enco der is deterministic, the denoising autoenco der is a feedforward
data

x xwithx x
network and may Ebe E˜ C (˜ exactly the same techniques as any other
x trained
( decoder
feedforward network. ˆ
p ) | ) |
data therefore view the DAE as p erforming sto chastic gradient descent on
We can
the following expectation: log p (x h = f (x ˜ )),
508 − |
(14.14)
pˆ (x)
CHAPTER 14. AUTOENCODERS

14.5.1 Estimating the Score

Score matc hing (Hyvärinen, 2005) is an alternative


matching alternative to maximum likelihoo d. It
likelihood.
pro
provides
vides a consistent estimator of probability distributions based on encouraging
the mo
model
del to ha
hav
ve the same score as the data distribution at every training p oint
x. In this con
context,
text, the score is a particular gradient field:
CHAPTER 14. AUTOENCODERS
∇x log p(x). (14.15)

Score matc
matching
hing is discussed further in section 18.4. For the presen
presentt discussion,
regarding auto
autoencoders,
encoders, it is sufficient to understand that learning the gradient
field of log pdata is one way to learn the structure of pdata itself.
A very imp
important
ortant prop
property
erty of DAEs is that their training criterion (with
14.5.1 Estimating
conditionally Gaussian the p( x |Score
h)) mak
makeses the auto
autoenco
enco
encoder
der learn a vector field
(g ( f (x)) − x) that estimates the score of the data distribution. This is illustrated
x
Score
in figurematc . (Hyvärinen,score
hing
14.4 2005) is an alternative to maximum likelihoo d. It
provides a consistent estimator of probability distributions based onhidden
encouraging
x Denoising training of a sp specific
ecific kind of auto
autoenco
enco der (sigmoidal
encoder units,
the mo del to ha ve the same as the data distribution at every
linear reconstruction units) using Gaussian noise and mean squared error as training p oint
the
. In this con text, the
reconstruction cost is equiv score is
equivalenta particular
alent (Vincent gradient
log p(,x2011 field:
). ) to training a specific kind of
undirected probabilistic
data mo
model
del called an RBM with Gaussian data visible units. This
(14.15)
kind of mo model
del is described in detail in section 20.5.1; for the present discussion, it
suffices
Score matc to know
hing that it is a mo
is discussed model del thatinprovides
further section an explicit
18.4 pmodel
. For the ( x; θt). discussion,
presen When the
RBM log p
is trained using denoising scoretomatching p
(Kingma and LeCun , 2010),
regarding auto encoders, it is sufficient understand that learning the gradient
its
fieldlearning
of algorithm
is one w isayequiv
equivalent
alentthe
to learn to structure
denoising oftrainingitself.
in the corresponding
auto
autoenco
enco
encoder.
der. With a fixedp(noise x h level,
) regularized score matching is not a consistent
A very imp ortant prop erty of DAEs is that their training criterion (with
(estimator;
g ( f (x)) it x) instead recorecov vers a blurred version of the distribution. If the noise
conditionally Gaussian ) makes the auto enco der learn a vector field
lev
levelel is chosen to approach 0 when∇the num number
ber of examples approaches infinity infinity,,
that estimates the score of the data distribution. This is illustrated
ho
how wev
ever,
er, then consistency is reco recov vered. Denoising score matc matching
hing is discussed in
in figure 14.4.
more detail in section 18.5.
Denoising training of a sp ecific kind of auto enco der (sigmoidal hidden units,
Other connections b et etw
ween auto autoenco
enco
encoders
ders and RBMs exist. Score matc matching
hing
linear reconstruction units) using Gaussian noise and meanmodel squared error as the
applied to RBMs yields a cost function that is iden identical
tical to reconstruction error
reconstruction cost is equivalent (Vincent, 2011) to training a specific kind of
com
combined
bined with a regularization term similar to the con contractiv
tractiv
tractivee p enalty of the
undirected probabilistic mo del called an RBM with Gaussian visible units. This
CAE (Sw Swersky
ersky et al.
al.,, 2011). Bengio| and Delalleau (2009) showshowed
p ed( xthat; θ ) an auto
autoen-
en-
kind of mo del is described in detail in section 20.5.1; for the present discussion, it
co
coder
der gradien
gradient t pro
provides
vides an
denoisingappro
approximation
ximation
score to con
contrastiv
matching trastiv
trastivee div
divergence
ergence training of
suffices to−know that it is a mo del that provides an explicit . When the
RBMs.
RBM is trained using (Kingma and LeCun, 2010),
F or contin
continuous-v
uous-v
uous-valued
alued x , the denoising
its learning algorithm is equivalent to denoising training criterion with Gaussian
in the corruption
corresponding and
reconstruction
auto enco der. With distribution
a fixed noise yields an regularized
level, estimator of the matching
score score thatis is
notapplicable to
a consistent
general enco
encoder
der and deco
decoderder parametrizations ( Alain and
estimator; it instead recovers a blurred version of the distribution. If the noise Bengio , 2013 ). This
means
level isachosen
generictoenco
encoder-deco
der-deco
der-decoder
approach 0 derwhenarc
architecture
hitecture
the numberma
may y be
of made toapproaches
examples estimate the score,
infinity
however, then consistency is recovered.509 Denoising score matching is discussed in
more detail in section 18.5.
Other connections b etween auto enco ders and RBMs exist. Score matching
applied to RBMs yields a cost function that is identical to reconstruction error
et al.
combined with a regularization term similar to the contractive p enalty of the
CAE (Swersky , 2011). Bengio and Delalleau (2009) showed that an auto en-
co der gradient provides an approximation to contrastive divergence training of
RBMs. x
For continuous-valued , the denoising criterion with Gaussian corruption and
reconstruction distribution yields an estimator
509 of the score that is applicable to
CHAPTER 14. AUTOENCODERS

CHAPTER 14. AUTOENCODERS


x
˜
x g◦f


x
˜

˜ | x)
C (x
x

˜
x
x g f

Figure 14.4: A denoising autoenco autoencoder


der is trained to map a corrupted data p oint x ˜x̃ bac
back k to
x
˜
the original data p oint x. We illustrate training examples x as red crosses lying near a
lo
low-dimensional
w-dimensional manifold, illus illustrated
trated with the b oldC (black
x
˜ x)
line. We illustrate the corruption
pro
process
cess C (x ˜ | x) with a gray circle of equiprobable x corruptions. A gray arrow demonstrates
ho
how w one training example is transformed ◦ into one sample from this corruption pro process.
cess.
When the denoising auto autoencoder
encoder is trained to minimize the average of squared errors
||
||gg( f (xx̃˜ )) − x||2 , the reconstruction g (f ( x ˜x̃)) estimates Ex, x˜x̃∼pdata(x)C (xx̃˜|x) [ x | x
˜ ]. The vector

˜x̃)) − x
g (f ( x ˜ p oin
x̃ oints
ts approximately towtoward
ard the nearest p oin ointt on the manifold, since g (f ( x ˜x̃))
estimates the center of mass of the clean p oints x that could hav havee given rise to x ˜x̃. The
auto
autoencoder
encoder thus learns a vector field g(f ( x)) − x indicated | by the green arrows. This
vector field estimates the score ∇x log pdata (x ) up to a multiplicativ x x
multiplicative
x xx
e factor that is the
average ro root
ot mean square reconstruction error.
data ˜
x
x x
Figure 14.4: A2 denoising autoenco der is trained to map a corrupted data p oint back to
the original data p oint . We illustrate training examples E , ˜ p as ( )red
C ( ˜ crosses
) lying near a
C (x˜ x)
low-dimensional manifold, illus trated with the b old black line. We illustrate the corruption
pro cess with a gray circle of equiprobable corruptions. A gray arrow demonstrates
how one training example is transformed into one sample from this corruption pro cess.
g( f (x˜ )) x g (f ( x˜ )) [x x ˜]
When the denoising auto encoder xis trained data to minimize the average of squared errors
g (f ( x˜)) x ˜ g (f ( x˜ ))
, the reconstruction estimates . The˜ vector
x x
p oints approximately toward the nearest p oin∼ t on the manifold,
| since
g(f ( x)) x
estimates the center of mass of the clean p510 oints that could have given rise to . The
log p (x )
auto encoder thus learns a vector field indicated by the green arrows. This
|
vector field estimates the score up to a multiplicative factor that is the
average ro ot mean square reconstruction error.

|| − || |



510
CHAPTER 14. AUTOENCODERS

by training with the squared error criterion

|| ˜ )) − x|| 2
||gg (f (x (14.16)

and corruption

˜=x
C (x ˜ ; µ = x, Σ = σ 2 I )
˜ |x) = N (x (14.17)
CHAPTER AUTOENCODERS
with noise 14.
variance σ 2. See figure 14.5 for an illustration of how this works.
In general, there is no guarantee that the reconstruction g (f (x )) minus the
input x corresp
corresponds
onds to the gradient of an any y function, let alone to the score. That is
wh
whyy the early results (Vincen
Vincentt, 2011) are sp specialized
ecialized to particular parametrizations,
2
where g (f (x )) − x ma
may y be obtained by taking the deriv
derivativ
ativ
ativee of another function.
Kam
Kamyshansk
yshansk
yshanskaa and Memisevic (2015) generalized the results of Vincent (2011) by
iden
identifying
tifying a family of shallo
shallow
w auto
autoenco
enco
encodersders suc
suchh that g (2f (x)) − x corresp
corresponds
onds to
by training with the squared error
a score for all members of the family criterion
family..
g (f ( x
˜ )) x
2
So far we havhavee describ
described
ed only how the denoising auto autoenco
enco
encoderder learns to represent
(14.16)
a probabilit
probability y distribution. ˜=x
C (x More generally
generally,
˜ x) = (x , one may wan
want
˜ ; µ = x, Σ = σ I ) t to use the auto
autoenco
enco
encoder
der
anda corruption
as generativ
generativee model and draw samples from this distribution. This is describ described
ed
in section 20.11. σ (14.17)
with noise variance . See figure 14.5 for an illustration of how g (fthis
(x ))works.
14.5.1.1x Historical
In general, there is Perspective
no guarantee that the reconstruction minus the
input corresp onds to the gradient of any function, let alone to the score. That is
The idea g (fof )) x MLPs for denoising
(xusing || dates
− back
|| to the work of LeCun (1987)
why the early results (Vincent, 2011) are sp ecialized to particular parametrizations,
and Gallinari et al. (1987). Behnke (2001) also used recurren recurrentt netnetw works to denoise
where may be obtained by taking the derivative of another function.
images. Denoising auto autoenco
enco
encoders g ( f (x )) x
ders are, in some sense, just MLPs trained to denoise.
Kamyshanska and Memisevic (|2015) N generalized the results of Vincent (2011) by
The name “denoising autoenco
autoenco
encoder,” however, refers to a model
der,” how model that is intended not
identifying a family of shallow auto enco ders such that corresp onds to
merely to learn to denoise its input but to learn a go good
od ininternal
ternal representation
a score for all members of the family.
as a side effect of learning to denoise. This idea came m much
uch later ((Vincen
Vincen
Vincentt
So far we hav e describ ed only how the
al.,, 2008, 2010). The learned representation ma
et al. denoising auto
may enco der learns to
y then b e used to pretrain arepresent
a probabilit
deep
deeperer unsup y ervised
distribution.
unsupervised netw
networkMore
ork or agenerally
supervised, onenetw
mayork.
network.wanLikt to
Like use theauto
e sparse auto
autoencoenco
enco der
encoders,
ders,
as a generativ
sparse coding,e− model and
contractiv
contractive drawenco
e auto
autoenco ders, from
samples
encoders, this distribution.
and other regularized auto Thisenco
autoenco is describ
encoders, ed
ders, the
in section
motiv ation20.11
motivation .
for DAEs was to allo
allow
w the learning of a very high-capacity enco encoder
der
14.5.1.1 Historical Perspective
while prev
preven
en
enting
ting the enco
encoder
der and deco
decoder
der from learning a useless − iden
identit
tit
tityy function.
Prior to the introduction of the mo modern
dern DAE, Inay
Inayoshi
oshi and Kurita (2005)
explored someetofal.
the same goals with some of the same metho
methods.ds. Their approac
approach h
The idea of using MLPs for denoising dates back to the work of LeCun (1987)
minimizes reconstruction error in addition to a sup
supervised
ervised ob
objectiv
jectiv
jectivee while injecting
and Gallinari (1987). Behnke (2001) also used recurrent networks to denoise
noise in the hidden la lay
yer of a supervised MLP
MLP,, with the ob objective
jective to impro
improve
ve
images. Denoising auto enco ders are, in some sense, just MLPs trained to denoise.
generalization by intro
intro
troducing
ducing the reconstruction error and the injected noise. Their
The name “denoising auto enco der,” however, refers to a mo del that is intended not
metho
method d was based on a linear enco
encoder,
der, ho
how
wev
ever,
er, and could not learn function
merely to learn to denoise its input but to learn a go od internal representation
families
et al. as pow
powerful
erful as the mo
modern
dern DAE can.
as a side effect of learning to denoise. This idea came much later (Vincent
511
, 2008, 2010). The learned representation may then b e used to pretrain a
deep er unsup ervised network or a supervised network. Like sparse auto enco ders,
sparse coding, contractive auto enco ders, and other regularized auto enco ders, the
motivation for DAEs was to allow the learning of a very high-capacity enco der
while preventing the enco der and deco der from learning a useless identity function.
Prior to the introduction of the mo dern DAE, Inayoshi and Kurita (2005)
explored some of the same goals with some of the same metho ds. Their approach
minimizes reconstruction error in addition to a sup ervised ob jective while injecting
noise in the hidden layer of a supervised MLP, with the ob jective to improve
511
generalization by intro ducing the reconstruction error and the injected noise. Their
CHAPTER 14. AUTOENCODERS

CHAPTER 14. AUTOENCODERS

Figure 14.5: Vector field learned b by


y a denoising autoenco
autoencoder
der around a 1-D curvcurved
ed
manifold near whic
which
h the data concentrate in a 2-D space. Eac Eachh arrow is prop
proportional
ortional
to the reconstruction minus input vector of the autoautoencoder
encoder and p oints tow
towards
ards higher
probabilit
probabilityy according to the implicitly estimated probability distribution. The vector field
has zeros at b oth maxima of the estimated density function (on the data manifolds) and
at minima of that density function. For example, the spiral arm forms a 1-D manifold of
lo
local
cal maxima that are connected to each other. Lo Local
cal minima app
appear
ear near the middle of
the gap b etw
etween
een tw
two
o arms. When the norm of reconstruction error (sho (shown
wn by the length of
the arrows) is large, probability can be significan
significantly
tly increased by moving in the direction
of the arrow, and that is mostly the case in places of low probabilit
probability
y. The auto
autoenco
enco
encoder
der
maps these lolow
w probability p oin
oints
ts to higher probabilit
probability
y reconstructions. Where probability
is maximal, the arrows shrink b ecause the reconstruction b ecomes more accurate. Figure
repro
reproduced
duced with p ermission from Alain and Bengio (2013).
Figure 14.5: Vector field learned by a denoising autoenco der around a 1-D curved
manifold near which the data concentrate in a 2-D space. Each arrow is prop ortional
to the reconstruction minus input vector of the auto encoder and p oints towards higher
probability according to the implicitly estimated probability distribution. The vector field
has zeros at b oth maxima of the estimated density function (on the data manifolds) and
at minima of that density function. For example, the spiral arm forms a 1-D manifold of
512 Lo cal minima app ear near the middle of
lo cal maxima that are connected to each other.
the gap b etween two arms. When the norm of reconstruction error (shown by the length of
the arrows) is large, probability can be significantly increased by moving in the direction
of the arrow, and that is mostly the case in places of low probability. The auto enco der
maps these low probability p oints to higher probability reconstructions. Where probability
is maximal, the arrows shrink b ecause the reconstruction b ecomes more accurate. Figure
repro duced with p ermission from Alain and Bengio (2013).

512
CHAPTER 14. AUTOENCODERS

14.6 Learning Manifolds with Auto


Autoenco
enco
encoders
ders
Lik
Likee man
manyy other mac
machine
hine learning algorithms, autoautoenco
enco
encoders
ders exploit the idea
that data concentrates around a lo low-dimensional
w-dimensional manifold or a small set of such
manifolds, as describ
described
ed in section 5.11.3. Some machine learning algorithms exploit
this idea only insofar as they learn a function that b eha
CHAPTER 14. AUTOENCODERS
ehav
ves correctly on the manifold
but that mamayy hav
havee unusual behavior if given an input that is off the manifold.
Auto
Autoenco
enco
encoders
ders take this idea further and aim to learn the structure of the manifold.
To understand ho how
w auto
autoenco
enco
encoders
ders do this, we must present some imp
importan
ortan
ortantt
characteristics of manifolds.
An imp
importan
ortan
ortantt characterization of a manifold is the set of its tangen planes.
tangentt planes.
At a p oint x on a d-dimensional manifold, the tangen
tangentt plane is given by d basis
vectors that span the lo local
cal directions of variation allow
allowed
ed on the manifold. As
illustrated in figure 14.6, these lo local
cal directions specify ho
howw one can change x
infinitesimally
14.6
Lik while
e manLearning
y other sta
staying
mac ying learning
on the manifold.
Manifolds
hine with Auto
algorithms, autoenco ders
enco ders exploit the idea
thatAll auto
autoencoder
data encoder training
concentrates around pro
procedures
acedures inv
involve
olve a compromise
low-dimensional manifold or betw
between
een set
a small tw
twooofforces:
such
manifolds, as describ ed in section 5.11.3. Some machine learning algorithms exploit
1. Learning a represen
representation
tation of a training example x suc suchh that x can b e
this idea only insofar as they learn h a function that b ehaves correctly on the manifold
appro
approximately
ximately reco
recov vered from h through a deco decoder.
der. The fact that x is
but that may have unusual behavior if given an input that is off the manifold.
dra
drawn
wn from the training data is crucial, because it means the auto autoenco
enco
encoder
der
Auto enco ders take this idea further and aim to learn the structure of the manifold.
need not successfully reconstruct inputs that are not probable under the
understand how
Todata-generating auto enco ders do this, we must present some imp ortant
distribution.
characteristics of manifolds. tangent planes
x
2.AnSatisfying d constrain d hitec-
imp ortantthe constraintt or regularization
characterization of a manifold pisenalt
enalty
the yset
. This
of itscan b e an arc
architec-.
At a ptural
oint constraint that limitsmanifold,
on a -dimensional the capacity of the tauto
the tangen autoenco
enco
encoder,
plane isder, orbit
given y can be
basis
a regularization x
vectors that span the term
lo cal added to the
directions of vreconstruction
ariation allowed cost. These
on the techniques
manifold. As
generally
illustrated prefer14.6
in figure solutions
, thesethat aredirections
lo cal less sensitive to the
specify howinput.
one can change
infinitesimally
Clearly
Clearly,, neither staying
whileforce onwthe
alone ouldmanifold.
b e useful—copying the input to the output
Alluseful
is not auto encoder training
on its own, pro
nor is cedures
ignoring
h involve
the input.a compromise x betw
Instead, the two een tw
forces o forces:
xtogether
are useful b ecause they force the hidden h representation to capture information x
ab 1.
about Learning a represen tation of a training example
out the structure of the data-generating distribution. The imp such that
important can
ortant principle be
is thatappro ximately
the auto
autoencoder
encoder reco vered
can affordfrom through
to represent a deco
only der. The that
the variations fact arthat
are
e ne
neeedeis
ded
d
drawn from
to reconstruct tr the training
training
aining examples
examples. data is crucial,
. If the because itdistribution
data-generating auto enco der
means the concentrates
near aneed not successfully
low-dimensional reconstruct
manifold, inputs
this yields thattations
represen are not
representations probable
that implicitly under the
capture
a lo data-generating
local
cal co
coordinate
ordinate systemdistribution.
for this manifold: only the variations tangent to the
manifold around the
2. Satisfying x need to correspond
constrain to changespin
t or regularization h=
enalt (x). can
y. fThis Hence thearc
b e an enco
encoder
der
hitec-
learnstural
a mapping from the input space
constraint that limits the x to a
capacityrepresen
representation
of thetation space, a mapping
auto enco der, or it can b e that
is onlya sensitive
regularization term added to the reconstruction but
to changes along the manifold directions, cost.that is insensitive
These techniquesto
changes orthogonal
generally prefer tosolutions
the manifold.
that are less sensitive to the input.
513
Clearly, neither force alone would b e useful—copying the input to the output
is not useful on its own, nor is ignoring the input. Instead, the two forces together
are useful b ecause they force the hidden representation to capture information
only the variations that are needed
ab out the structure of the data-generating distribution. The imp ortant principle
to reconstruct training examples
is that the auto encoder can afford to represent
. If the data-generating distribution concentrates
near a low-dimensional manifold, this yields representations that implicitly capture
x h = f (x )
a lo cal co ordinate system for this manifold: only the variations tangent to the
x
manifold around need to correspond to changes in . Hence the enco der
513
learns a mapping from the input space to a representation space, a mapping that
CHAPTER 14. AUTOENCODERS

CHAPTER 14. AUTOENCODERS

Figure 14.6: An illustration of the concept of a tangen


tangentt hyperplane. Here we create a 1-D
manifold in 784-D space. We take an MNIST image with 784 pixels and transform it by
translating it vertically
vertically.. The amount of vertical translation defines a co coordinate
ordinate along
a 1-D manifold that traces out a curved path through image space. This plot shows a
few p oints along this manifold. F or visualization, we hav
For pro jected the manifold into
havee projected
2-D space using PCA. An n -dimensional manifold has an n -dimensional tangent plane
at every p oint. This tangent plane touches the manifold exactly at that point and is
orien
oriented
ted parallel to the surface at that p oint. It defines the space of directions in whic
which
h
it is p ossible to mov
movee while remaining on the manifold. This 1-D manifold has a single
tangen
tangentt line. We indicate an example tangent line at one p oint, with an image sho showing
wing
ho
howw this tangent direction appappears
ears in image space. Gray pixels indicate pixels that do not
change as we movmovee along the tangent line, white pixels indicate pixels that brighten, and
blac
blackk pixels indicate pixels that darken.
Figure 14.6: An illustration of the concept of a tangent hyperplane. Here we create a 1-D
manifold in 784-D space. We take an MNIST image with 784 pixels and transform it by
translating it vertically. The amount of vertical translation defines a co ordinate along
a 1-D manifold that traces out a curved path 514 through image space. This plot shows a
n n
few p oints along this manifold. For visualization, we have pro jected the manifold into
2-D space using PCA. An -dimensional manifold has an -dimensional tangent plane
at every p oint. This tangent plane touches the manifold exactly at that point and is
oriented parallel to the surface at that p oint. It defines the space of directions in which
it is p ossible to move while remaining on the manifold. This 1-D manifold has a single
tangent line. We indicate an example tangent line at one p oint, with an image showing
how this tangent direction app ears in image space. Gray pixels indicate pixels that do not
change as we move along the tangent line, white pixels indicate pixels that brighten, and
black pixels indicate pixels that darken.
514
CHAPTER 14. AUTOENCODERS

A one-dimensional example is illustrated in figure 14.7, showing that, by making


the reconstruction function insensitiv
insensitivee to p erturbations of the input around the
data p oin
oints,
ts, we cause the auto
autoenco
enco
encoder
der to reco
recovver the manifold structure.
To understand wh whyy auto
autoenco
enco
encoders
ders are useful for manifold learning, it is in-
structiv
structivee to compare them to other approaches. What is most commonly learned
to characterize
CHAPTER a manifold is a represen
14. AUTOENCODERS tation of the data p oin
representation oints
ts on (or near)
the manifold. Suc Suchh a represen
representation
tation for a particular example is also called its
em
embb edding. It is typically given by a lo
low-dimensional
w-dimensional vector, with few
fewer
er dimensions
than the “am“ambien
bien
bient”
t” space of which the manifold is a lo
low-dimensional
w-dimensional subset. Some
algorithms (nonparametric manifold learning algorithms, discussed b elo elow)
w) directly
learn an embedding for each training example, while others learn a more general
mapping, sometimes called an enco encoder,
der, or representation function, that maps ananyy
p oin
ointt in the ambien
ambientt space (the input space) to its em
embedding.
bedding.
A one-dimensional example is illustrated in figure 14.7, showing that, by making
the reconstruction function insensitive to p erturbations of the input around the
data p oints,1we.0 cause the auto enco der to recover the manifold structure.
To understand whyIden auto
Identit
tityenco ders are useful for manifold learning, it is in-
tity
0.8 representation
structive to compare them to other
Optimal approaches. What is most commonly learned
reconstruction
0.6 a manifold is a
to characterize of the data p oints on (or near)
the manifold. Such a representation for a particular example is also called its
r(x)
r(x)

emb edding. 0It.4 is typically given by a low-dimensional vector, with fewer dimensions
than the “ambient” space of which the manifold is a low-dimensional subset. Some
0.2
algorithms (nonparametric manifold learning algorithms, discussed b elow) directly
learn an embedding
0
1.0 for each training example, while others learn a more general
x0
mapping, sometimes called an enco der, or xrepresentation
1 x2
function, that maps any
x
p oint in the ambient space (the input space) to its embedding.
0 . 8
Identity
Figure 14.7: If0.6the auto
autoenco
enco
encoder
der learns
Optimal a reconstruction function that is inv
reconstruction invariant
ariant to small
p erturbations near the data p oints, it captures the manifold structure of the data. Here
the manifold 0structure
.4
is a collection of 0-dimensional manifolds. The dashed diagonal
line indicates the identit
identity
y function target for reconstruction. The optimal reconstruction
0.2
function crosses the iden identit
tit
tity
y function
0
wherev
whereverer
1
there is a data 2
p oint. The horizontal
arro
arrows
ws at the0b.0ottom of the plot indicate the r( x) − x reconstruction direction vector at
the base of the arrow, in inputxspace, alwa always
ys p
x ointing tow
toward
ardx the nearest “manifold” (a
single data p oin
ointt in the 1-D case). The denoising x auto
autoenco
enco
encoder
der explicitly tries to make
the deriv
derivativ
ativ
ativee of the reconstruction function r(x) small around the data p oints. The
con
contractiv
tractiv
tractivee auto
autoencoder
encoder do
does
es the same for the enco
encoder.
der. Although the derivderivative
ative of r(x ) is
ask
asked
ed to b e small around the data p oints, it can b e large b et etw
ween the data p oints. The
space b etw
etween
een the data p oinoints
ts corresp
corresponds
onds to the region b etw etween
een the manifolds, where
the reconstruction function must hav havee a large deriv
derivativ
ativ
ativee to map corrupted p oin oints
ts bac
back k
on
onto the manifold.
to 14.7: If the auto enco der learns a reconstruction function that is invariant to small
Figure
0
p erturbations near the data p oints, it captures the manifold structure of the data. Here
515
the manifold structure is a collection of -dimensional manifolds. The dashed diagonal
line indicates the identity function target for reconstruction. The optimal reconstruction
r( x) x
function crosses the identity function wherever there is a data p oint. The horizontal
arrows at the b ottom of the plot indicate the reconstruction direction vector at
the base of the arrow, in input space, always p ointing toward the nearest “manifold” (a
r(x)
single data p oint in the 1-D case). The denoising auto enco der explicitly tries to make
r(x )
the derivative of the reconstruction function small around the data p oints. The
contractive auto encoder do es the same for the enco der. Although the derivative of is
asked to b e small around the data p oints, it can b e large b etween the data p oints. The
space b etween the data p oints corresp onds to the region b etween the manifolds, where
the reconstruction function must have a large 515 deriv−ative to map corrupted p oints back
CHAPTER 14. AUTOENCODERS

CHAPTER 14. AUTOENCODERS

Figure 14.8: Nonparametric manifold learning procedures build a nearest neighbor graph
in which no
nodes
des represent training examples and directed edges indicate nearest neighbor
relationships. VVarious
arious pro
procedures
cedures can thus obtain the tangent plane assoassociated
ciated with a
neigh
neighbb orhoo
orhoodd of the graph as well as a co
coordinate
ordinate system that asso
associates
ciates each training
example with a real-v
real-valued
alued vector p osition, or em
embb edding. It is p ossible to generalize
suc
such
h a representation to new examples by a form of in interpolation.
terpolation. As long as the num
numbb er
of examples is large enough to cov cover
er the curv
curvature
ature and twists of the manifold, these
approac
approaches
hes work well. Images from the QMUL Multiview Face Dataset (Gong et al.,
2000
2000).
).

Manifold learning has mostly fo focused


cused on unsup
unsupervised
ervised learning pro procedures
cedures that
attempt to capture these manifolds. Most of the initial machine learning research
on learning
Figure nonlinear manifolds
14.8: Nonparametric manifold has fo
focused
cused
learning on nonparametric
procedures build a nearestmetho
methods ds based
neighbor graph
on the . This graph has one no
node
de p er training
in which no des represent training examples and directed edges indicate nearest neighbor
nearest neighbor graph example
relationships.
and Various pro
edges connecting cedures
near neighcan
neighbors
borsthus
to obtain
each the tangent
Theseplane asso
dsciated with
opfa
emother.
b edding metho
methods (Sc
Schölk
hölk
hölkopf
neigh b orhoo d
al.,, 1998; Row
et al. of the
Roweis graph as well as a
eis and Saul, 2000; Tenenco ordinate
enenbaum system
baum et al. that asso ciates each training
al.,, 2000; Brand, 2003; Belkin
example
and Niy with
Niyogi a real-v alued vector p osition, or
ogi, 2003; Donoho and Grimes, 2003; Weinberger . Itand
is pSaul
ossible to generalize
, 2004 ; Hinton
such a representation to new examples by a form of interpolation. As long as the numb er
and RoRow w eis, 2003 ; van der Maaten and Hinton , 2008 ) asso
associate eac
ciate each no
h nodede with
et al.a
of examples is large enough to cover the curvature and twists of the manifold, these
tangen
tangentthes
approac plane
work that spans
well. the directions
Images from the QMULof variations asso
associated
Multiview ciated
F with the
ace Dataset difference,
(Gong
vectors
2000 ). betw
betweeneen the example and its neigh neighb b ors, as illustrated in figure 14.8.
A global coordinate system can then b e obtained through an optimization or
by solving a linear system. Figure 14.9 illustrates how a manifold can b e tiled by
Manifold learning has mostly fo cused516
on unsup ervised learning pro cedures that
nonparametric
attempt to capture these manifolds. Most of the initial machine learning research
nearest neighbor graph
on learning nonlinear manifolds has fo cused on metho ds based
on the . This graph has one no de p er training example
et al. et al.
and edges connecting near neighbors to each other. These metho ds (Schölkopf
, 1998; Roweis and Saul, 2000; Tenenbaum , 2000; Brand, 2003; Belkin
and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul, 2004; Hinton
and Roweis, 2003; van der Maaten and Hinton, 2008) asso ciate each no de with a
tangent plane that spans the directions of variations asso ciated with the difference
vectors between the example and its neigh516
b ors, as illustrated in figure 14.8.
CHAPTER 14. AUTOENCODERS

CHAPTER 14. AUTOENCODERS

Figure 14.9: If the tangent planes (see figure 14.6) at eac


each
h lo
location
cation are kno
known,
wn, then they
can b e tiled to form a global coordinate
coordinate system or a density function. Each lo cal patch
local
can b e thought of as a local Euclidean co
coordinate
ordinate system or as a lolocally
cally flat Gaussian, or
“pancak
“pancake,”
e,” with a very small variance in the directions orthogonal to the pancake and a
very large variance in the directions defining the cocoordinate
ordinate system on the pancake. A
mixture of these Gaussians provides an estimated density function, as in the manifold
Parzen window algorithm (Vincent and Bengio, 2003) or in its nonlocal neural-net-based
variant (Bengio et al., 2006c).

Figure 14.9: If the tangent planes (see figure 14.6) at each lo cation are known, then they
can b e tiled to form a global co ordinate system or a density function. Each lo cal patch
can b e thought of as a local Euclidean co ordinate system or as a lo cally flat Gaussian, or
“pancake,” with a very small variance in the directions orthogonal to the pancake and a
very large variance in the directions defining the co ordinate system on the pancake. A
mixture of these Gaussians provides an estimated
517 density function, as in the manifold
et al.
Parzen window algorithm (Vincent and Bengio, 2003) or in its nonlocal neural-net-based
variant (Bengio , 2006c).

517
CHAPTER 14. AUTOENCODERS

a large num
umber
ber of lo
locally
cally linear Gaussian-lik
Gaussian-likee patc
patches
hes (or “pancak
“pancakes,”
es,” b ecause the
Gaussians are flat in the tangent directions).
A fundamental difficulty with suc such h lo
local
cal nonparametric approaches to manifold
learning is raised in Bengio and Monp Monperrus
errus (2005): if the manifolds are not very
smo
smooth
oth (they ha havve many p eaks and troughs and twists), one may need a very
large num
CHAPTERnumberber AUTOENCODERS
14. of training examples to cov coverer eac
eachh one of these variations, with
no chance to generalize to unseen variations. Indeed, these metho methods ds can only
generalize the shap
shapee of the manifold by interpolating b et etwween neighboring examples.
Unfortunately
Unfortunately,, the manifolds inv involv
olv
olved
ed in AI problems can ha havve very complicated
structures that can b e difficult to capture from only local interpolation. Consider
for example the manifold resulting from translation sho shown
wn in figure 14.6. If we
watc
atch
h just one co coordinate
ordinate within the input vector, xi , as the image is translated,
we will observe that one co coordinate
ordinate encounters a p eak or a trough in its value
a large number of lo cally linear Gaussian-like patches (or “pancakes,” b ecause the
once for every p eak or trough in brightness in the image. In other words, the
Gaussians are flat in the tangent directions).
complexit
complexity y of the patterns of brightness in an underlying image template driv drives
es
A fundamental
the complexit
complexity difficulty with suc h lo cal nonparametric approaches
y of the manifolds that are generated by p erforming simple image to manifold
learning is raised
transformations. This in Bengio
motivand
atesMonp
motivates the useerrus (2005): if the
of distributed manifolds
represen
representations areand
tations not deep
very
smo oth (they
learning have many
for capturing and troughs and twists), one may need a very
p eaksstructure.
manifold
large number of training examples to cover each one of these variations, with
no chance to generalize to unseen variations. Indeed, these metho ds can only
14.7
generalizeCon
Contractive
tractive
the shap Auto
Autoenco
e of the manifold enco
encoders ders bietween neighboring examples.
by interpolating
Unfortunately, the manifolds involved in AI problems can have very complicated
structures
The thate auto
contractiv
contractive can benco
e difficult
autoenco
encoder to capture
der (Rifai from,bonly
et al., 2011a ) in
introlocal
tro
troduces
ducesinterpolation. Consider
an explicit regularizer
x
for example
on the cocode the manifold resulting from
de h = f (x), encouraging the deriv translation
derivativ
ativ
atives shown
es of f to b e as small as p.ossible:
in figure 14.6 If we
watch just one co ordinate within the input  vector, 2 , as the image is translated,
we will observe that one co ordinate encounters  
∂ f (x)  a p eak or a trough in its value
Ω(h) = λ   . (14.18)
once for every p eak or trough in brightness ∂ x inFthe image. In other words, the
complexity of the patterns of brightness in an underlying image template drives
The p enalt
enalty
the complexit y Ω(
Ω(h ) is
yhof themanifolds
the squared Fthat
rob
robenius
enius norm (sumbyofpsquared
are generated erforming elemen
elements)
simple ts) image
of the
Jacobian matrix ofThis
transformations. partial deriv
derivatives
motiv atesatives associated
the use with the
of distributed enco
encoder
represen der function.
tations and deep
Therefor
learning is capturing
a connection b etw
etween
manifold een the denoising auto
structure. autoencoenco
encoder
der and the contractiv
contractivee
auto
autoenco
enco
encoder:
der: Alain and Bengio (2013) sho show wed that in the limit of small Gaussian
input noise, the denoising reconstruction error is equiv equivalent
alent to a con contractiv
tractiv
tractivee
p enalt
enaltyy on the reconstruction function et al. that maps x to r = g ( f( x))
2 )).. In other
words, denoisingh = fauto
autoenco
(x) enco
encoders
ders mak
makee the reconstruction f function resist small but
14.7
The
finite-sized Con
contractiv etractive
auto enco der
p erturbations ofAuto
(the enco
Rifaiinput, ders
2011a
,while ,b)Ftractiv
con intro duces
contractiv
tractive anenco
e auto explicit
autoenco dersregularizer
encoders mak
makee the
on the co
feature de
extraction , encouraging
function the deriv
∂ f (ativ
resist infinitesimal x)pes of to b e of
erturbations small
as the as p ossible:
input. When
using the Jacobian-based con Ω( h )
contractiv
tractiv= λ
tractivee p enalt
enalty .
∂ xy to pretrain features f (x) for use
with a classifier, the b est classification accuracy usually results from applying the
(14.18)
Ω( h)
518
The p enalty is the squared Frob enius norm (sum of squared elements) of the
Jacobian matrix of partial derivatives associated with the enco der function.
There is a connection b etween the denoising auto enco der and the contractive
auto enco der: Alain and Bengio (2013) showed that in the limit of small Gaussian
x r = g ( f( x))
input noise, the denoising reconstruction error is equivalent to a contractive
p enalty on the reconstruction function that maps to . In other
words, denoising auto enco ders make the reconstruction function resist small but
finite-sized p erturbations of the input, while contractive auto enco ders make the
f (x)
feature extraction function resist infinitesimal
518 p erturbations of the input. When
CHAPTER 14. AUTOENCODERS

con
contractiv
tractiv
tractivee penalty to f ( x) rather than to g(f (x))
)).. A contractiv
contractivee p enalty on f (x)
also has close connections to score matching, as discussed in section 14.5.1.
The name con contractiv
tractiv
tractive e arises from the way that the CAE warps space. Sp Specifi-
ecifi-
cally
cally,, b ecause the CAE is trained to resist p erturbations of its input, it is encouraged
to map a neighborho
neighborhoo o d of input points to a smaller neigh
neighborho
borho
borhooo d of output p oin
oints.
ts.
W e can think
CHAPTER 14. of this as con
contracting
AUTOENCODERS tracting the input neigh
neighbb orhoo
orhoodd to a smaller output
neigh
neighb b orhoo
orhood.
d.
To clarify
clarify,, the CAE is con contractiv
tractiv
tractivee only lo locally—all
cally—all p erturbations of a training
p oin
ointt x are mapp
mapped ed near to f (x). GloballyGlobally,, tw two differentt p oints x and x ma
o differen may y be

mapp
mapped ed to f ( x ) and f ( x ) points that are farther apart than the original p oin oints.
ts.
It is plausible that f could b e expanding in-b in-betet
etw
ween or far from the data manifolds
(see, for example, whatf happ x) ens in the 1-Dgto
(happens toy ))
(fy(xexample of figure 14.7). Whenf the (x)
Ω(
Ω(h ) p enalt
conhtractiv enaltyy is applied
e penalty to to sigmoidal
rather than units,
to one easy . Awa
wayy to shrink
contractiv e pthe Jacobian
enalty on is
to
alsomak
make
hase the
closesigmoid
con unitseto
tractiv
connections saturate to 0 or 1. This
score matching, encourages
as discussed the CAE
in section to enco
14.5.1 encode
. de
input points with extreme values of the sigmoid, whic which h mamay y b e interpreted as a
The name arises from the way that the CAE warps space. Sp ecifi-
binary code. It also ensures that the CAE will spread its co code
de values throughout
cally, b ecause the CAE is trained to resist p erturbations of its input, it is encouraged
most of the hypercub
hypercubee that its sigmoidal hidden units can span.
to map a neighborho o d of input points to a smaller neighborho o d of output p oints.
We W cane can
thinkthink of the
of this as conJacobian
tractingmatrix J atneigh
the input a p oin
ointt x as
b orhoo approximating
d to a smaller output the
nonlinear
neighb orhoo d.enco
encoder der f (x ) as b eing a linear operator. This allo
allows
ws us to use the word
“con
“contractiv
tractiv
tractive”
x e” more formally formally..fIn (x)the theory of linear operators, axlinearxop operator
erator
To clarify, the CAE is contractive only lo cally—all p erturbations of a training
is said to b ef contractiv
(contractive
x) f(exif) the norm of J x remains less than or equal to 1 for all
p oint are mapp ed near to . Globally, two different p oints and may b e
unit-norm x . In other f words, J is contractiv
contractivee if its image of the unit sphere is
mapp ed to and points that are farther apart than the original  p oints.
completely encapsulated by the unit sphere. We can think of the CAE as p enalizing
It is plausible that could b e expanding in-b etween or far from the data manifolds
the
Ω(h)Frob robenius
enius norm of the local linear appro approximation
ximation of f (x) at every training
(see, for example, what happ ens in the 1-D toy example of figure 14.7). When the
p oin
ointt x in order to encourage each of0 these 1 lo local
cal linear op operators
erators to b ecome a
p enalty is applied to sigmoidal units, one easy way to shrink the Jacobian is
con
contraction.
traction.
to make the sigmoid units saturate to or . This encourages the CAE to enco de
input As points
described
within sectionvalues
extreme 14.6, ofregularized
the sigmoid, auto
autoenco
enco
encoders
whic h ders
may learn manifoldsasbya
b e interpreted
balancing
binary code. tw
two oIt opp
opposing
alsoosing
ensures forces. In the
that the CAE case
willofspread
the CAE, these
its co de tw
two
values othroughout
forces are
reconstruction
most of the hypercub error and the its
e that con
contractiv
tractiv
tractivee p enalt
sigmoidal enalty
J y Ω(
hidden Ω(hh ). can
units Reconstruction
x span. error alone
would encourage the f (xCAE
) to learn an iden identit
tit
tityy function. The contractiv
contractivee p enalty
We can think of the Jacobian matrix at a p oint as approximating the
alone would encourage the CAE to learn features that are constant with resp respect
ect to x.
nonlinear enco der as b eing a linear operator. This allows us to use the word
The compromise b et etwween these two forces J xyields an autoautoenco
enco
encoder
der whose derivderivativ
1 ativ
atives
es
“con
∂ f (xtractiv
) e” more formally. In the theory of linear operators, a linear op erator
x are mostly
x tin
tiny y. Only a small J num
numb b er of hidden units, corresponding to a
is∂said to b e contractive if the norm of remains less than or equal to for all
small num umber
ber of directions in the input, ma may y hav
havee significan
significantt deriv
derivativ
ativ
atives.
es.
unit-norm . In other words, is contractive if its image of the unit sphere is
The goalencapsulated
of the CAE is f (x )
completely bytothelearn
unitthe manifold
sphere. We structure
can think of of the
the data.
CAE as Directions
p enalizing x
with x
the Flarge J x rapidly
rob enius norm of change h , so linear
the local these are lik
likely
appro ely to b e directions
ximation of that
at appro
approximate
every ximate
training
the
p ointangent
t in order planes toofencourage
the manifold. each of Exp
Experimen
erimen
eriments
these ts by
lo cal Rifaiopeterators
linear al. (2011a
to b,ecome
b) show a
that training
contraction. the CAE results in most singular values of J dropping b elo
eloww 1 in

xAs described in section 14.6, regularized


519 auto enco ders learn manifolds by
Ω(h )
balancing
∂ f (x ) two opp osing forces. In the case of the CAE, these two forces are
reconstruction
∂ error and the contractive p enalty . Reconstruction error alone
x
would encourage the CAE to learn an identity function. The contractive p enalty
alone would encourage the CAE to learn features that are constant with resp ect to .
The compromise b etween these two forces yields an auto enco der whose derivatives
are mostly tiny. Only a small numb er of hidden units, corresponding to a
small number of directions in the input, may have significant derivatives. x
Jx h
The goal of the CAE is to learn the manifold structure of the data. Directions
et al.
with large rapidly change , so these 519
are likely to b e directions that approximate
J 1
CHAPTER 14. AUTOENCODERS

Input Tangent vectors


p oin
ointt

Lo
Local
cal PCA (no sharing across regions)
CHAPTER 14. AUTOENCODERS

Con
Contractiv
tractiv
tractivee autoenco
autoencoder
der

Figure 14.10: Illustration of tangen


tangentt vectors of the manifold estimated by lo local
cal PCA
and by a contractiv
contractivee auto
autoenco
enco
encoder.
der. The lo
location
cation on the manifold is defined by the input
image of a dog drawn from the CIF CIFAR-10
AR-10 dataset. The tangen
tangentt vectors are estimated
∂h
by the leading
Input singular vectors
Tangent of the Jacobian matrix ∂ x of the input-to-co
vectors input-to-code
de mapping.
Although b oth lolocal
cal PCA and the CAE can capture lo local
cal tangen
tangents,
ts, the CAE is able to
p oint
form more accurate estimates from limited training data b ecause it exploits parameter
sharing across different lolocations
cations that share a subset of active hidden units. The CAE
tangen
tangentt directions typically corresp
correspond
ond to moving or changing parts of the obobject
ject (such as
the head or legs). Images repro
reproduced
duced with p ermission from Rifai et al. (2011c).
Lo cal PCA (no sharing across regions)

magnitude and therefore b ecoming con contractiv


tractiv
tractive.
e. Some∂ h singular values remain ab
∂x
abov
ov
ovee
1, howev
howev
ever,
er, b ecause the reconstruction error p enalty encourages the CAE to enco
encode
de
Contractive autoenco der
the directions with the most lo
local
cal variance. The directions corresp
corresponding
Figure 14.10: Illustration of tangent vectors of the manifold estimated by lo cal PCA onding to the
largest
and by asingular
contractivvalues are
e auto interpreted
enco der. The lo ascation
the tangent directionsis that
on the manifold definedthebcontractiv
contractive
y the inpute
image
auto of
autoencodera dog drawn from
encoder has learned. Ideally the CIF AR-10 dataset.
Ideally,, these tangen The tangen t vectors are
tangentt directions should correspond to estimated
by the
real leading singular
variations data. Foforthe
in the vectors Jacobiana matrix
example, CAE applied of the
toinput-to-co de mapping.
images should learn
Although
tangen b oth lo cal PCA and the CAE can
tangentt vectors that show how the image changes as obcapture lo cal tangen
objects ts, the CAE is able
jects in the image gradually to
form more accurate estimates from limited training data b ecause it exploits parameter
change p ose, as shown in figure 14.6. Visualizations of the experimentally obtained
sharing across different lo cations that share a subset of active hidden units. The CAE
singular vectors do seem to corresp
correspond et al.
tangent directions typically corresp ondond to meaningful
to moving transformations
or changing parts of the obof the
ject input
(such as
image,
the headasorsho
shownwn Images
legs). in figure 14.10
repro . with p ermission from Rifai
duced (2011c).
One practical issue with the CAE regularization criterion is that although it
is cheap to compute in the case of a single hidden lay layer
er auto
autoenco
enco
encoder,
der, it b ecomes
1muc
uchh more exp
expensiv
ensiv
ensivee in the case of deep
deeper
er auto
autoenco
enco
encoders.
ders. The strategy follo
follow
wed by
magnitude and therefore b ecoming contractive. Some singular values remain ab ove
Rifai et al. (2011a) is to separately train a series of single-la
single-lay
yer auto
autoencoders,
encoders, each
, however, b ecause the reconstruction error p enalty encourages the CAE to enco de
trained to reconstruct the previous autoautoenco
enco
encoder’s
der’s hidden la
lay
yer. The comp
composition
osition
the directions with the most lo cal variance. The directions corresp onding to the
of these auto
autoenco
enco
encoders
ders then forms a deep auto autoenco
enco
encoder.
der. Because each la lay
yer was
largest singular values are interpreted as the tangent directions that the contractive
separately trained to b e lo locally
cally contractiv
contractive,
e, the deep auto
autoenco
enco
encoder
der is con
contractiv
tractiv
tractivee
auto encoder has learned. Ideally, these tangent directions should correspond to
as well. The result is not the same as what would b e obtained by jointly training
real variations in the data. For example, a CAE applied to images should learn
the en
entire
tire architecture with a p enalty on the Jacobian of the deep mo model,
del, but it
tangent vectors that show how the image changes as ob jects in the image gradually
change p ose, as shown in figure 14.6. Visualizations of the experimentally obtained
singular vectors do seem to corresp ond to 520meaningful transformations of the input
image, as shown in figure 14.10.
One practical issue with the CAE regularization criterion is that although it
is cheap to compute in the case of a single hidden layer auto enco der, it b ecomes
et al.
much more exp ensive in the case of deep er auto enco ders. The strategy followed by
Rifai (2011a) is to separately train a series of single-layer auto encoders, each
trained to reconstruct the previous auto enco der’s hidden layer. The comp osition
of these auto enco ders then forms a deep auto enco der. Because each layer was
separately trained to b e lo cally contractive, the deep auto enco der is contractive
520 would b e obtained by jointly training
as well. The result is not the same as what
CHAPTER 14. AUTOENCODERS

captures man
many
y of the desirable qualitative characteristics.
Another practical issue is that the contraction p enalt enalty
y can obtain useless results
if we do not imp impose
ose some sort of scale on the deco decoder.
der. For example, the encoder
could consist of multiplying the input by a small constant  , and the decoder
could consist of dividing the co code
de by . As  approac
approaches
hes 0, the encoder drives the
con
contractiv
tractiv
tractivee14.
CHAPTER p enalt
enalty
y Ω(
Ω(hh ) to approac
AUTOENCODERS approach h 0 without having learned an anything
ything ab
about
out the
distribution. Meanwhile, the deco decoder
der main
maintains
tains p erfect reconstruction. In Rifai
et al. (2011a), this is prev
preven
en
ented
ted by tying the weigh
weights
ts of f and g . Both f and g are
standard neural netw network
ork lay
layers
ers consisting of an affine transformation follow followed
ed by
an elemen
element-wise
t-wise nonlinearit
nonlinearity y, so it is straigh
straightforw
tforw
tforward
ard to set the weigh
eightt matrix of g
to b e the transp
transpose
ose of the weigh
eightt matrix of f .

captures man
14.8 y of the desirable
Predictiv
Predictive e Sparse qualitative
Decomp characteristics.
Decomposition
osition
Another practical issue is that the contraction p enalty can obtain useless results

if we do not
Predictiv
Predictive imp ose decomp
e sparse some sortosition
decomposition of scale(PSD)
on theisdeco
a mo der.
modeldel F or example,
that is a hybrid theofencoder
sparse
  0
co
coding
dingconsist
could and parametric auto
autoenco
of multiplying enco
encoders
the ders
input (Ka
Kavuk
byvuk
vukcuoglu
a cuoglu
small constant al.,, 2008
et al. ). Athe
, and parametric
decoder
Ω(h ) 0
enco
encoder
couldder is trained
consist to predict
of dividing the cothede by output
. As of approac
iterativ
iterative e inference.
hes , the encoderPSD drives
has been
the
applied
contractivtoeunsupervised
p enalty feature learning
to approac for ob
h without object
ject recognition
having learnedinanimages
ything and
ab outvideo
the
et al. f g f g
(distribution.
Ka
Kavuk
vuk
vukcuoglu
cuogluMeanwhile,
al.,, 2009,the
et al. 2010 ; Jarrett
deco der main , 2009
tains
et al.
al., ; Farab
arabet
p erfect et et al.
al.,, 2011),InasRifai
reconstruction. well
2011a),(Henaff
as for(audio this is prev
et al.en
, ted
2011b).
y tThe
ying model
the weigh ts of ofand
consists an enco der f (and
encoder
. Both x) andarea
g
standard
deco der gneural
decoder (h) that netw
areorkb oth parametric.
layers consistingDuring
of an affinetraining, h is controlled
transformation followbyedthe
by
f
optimization
an element-wise algorithm.
nonlinearit Training
y, so it pro
proceeds
ceeds btforw
is straigh y minimizing
ard to set the weight matrix of
to b e the transp ose of the weight2 matrix of .
||x − g (h)|| + λ|h|1 + γ ||h − f (x)||2 . (14.19)
As in sparse cocoding,
ding, the training algorithm alternates b etw etween
een minimization with
resp
respect
ect to h
Predictiv e and minimization
sparse decomp ositionwith resp
respect
ect to the mo model
del parameters. Minimization
with respect to h is fast b ecause f ( x) pro provides
vides a go goooet
d initial
al. value of h, and the
14.8 Predictiv e Sparse Decomp
cost function constrains h to remain near f (x) an (PSD) is aosition
mo
anyw
ywdel
ywa ay. Simplea gradien
that is hybrid tofdescent
gradient sparse
co ding and parametric auto enco ders (Ka
can obtain reasonable values of h in as few as ten steps. vuk cuoglu , 2008 ). A parametric
enco der is trained to predict the output of iterative inference. PSD has been
The training pro
procedure
et al. cedure used by PSDetisal.different from first et training
al. a sparse
applied to unsupervised feature learning for ob ject recognition in images and video
co
coding
ding mo
model
del and then training 2f (x ) to predict the values
et al. of the sparse )co
f (x coding
ding
(Kavukcuoglu , 2009, 2010; Jarrett 1 , 2009; Farab2et , 2011), as well
features.gThe
(h) PSD training pro procedure
cedure regularizes the deco
decoder
hder to use parameters
as for audio (Henaff , 2011). The model consists of an enco der and a
for whic
which h f (x) can infer go goo o d code values.
deco der that are b oth parametric. During training, is controlled by the
Predictiv
Predictive e sparse co
optimization algorithm. coding
xding g (ish)an pro
Training example h +of
+ λceeds byγlearned
h f (xapproximate
minimizing ) . inference..
inference
In section 19.5, this topic is developed further. The to tools
ols presen
presented
ted in chapter 19
mak
makee it clear that PSD can b e in
interpreted
terpreted as training a directed sparse (14.19)
co
coding
ding
probabilistich model b y maximizing a lo
lowwer b ound on the log-likelihoo
log-likelihood d of the
As in sparse co ding, the training algorithm alternates b etween minimization with
mo
model. h f ( x ) h
respdel.
ect to and minimization with resp ect to the mo del parameters. Minimization
h f (x)
with respect to is fast b ecause pro
521 vides a go o d initial value of , and the
h
cost function constrains to remain near anyway. Simple gradient descent
can obtain reasonable values of in as few as ten steps.
f (x )
The training pro cedure|| − used|| by PSD | | is different
|| − from || first training a sparse
co ding mo del and then training to predict the values of the sparse co ding
f (x)
features. The PSD training pro cedure regularizes the deco der to use parameters
for which can infer go o d code values. learned approximate inference
Predictive sparse co ding is an example of .
In section 19.5, this topic is developed further. The to ols presented in chapter 19
521 as training a directed sparse co ding
make it clear that PSD can b e interpreted
CHAPTER 14. AUTOENCODERS

In practical applications of PSD, the iterative optimization is used only during


training. The parametric encoencoder
der f is used to compute the learned features when
the mo
model
del is deploy
deployed.
ed. EvEvaluating
aluating f is computationally inexpensive compared to
inferring h via gradien
gradientt descen
descent.
t. Because f is a differen
differentiable
tiable parametric function,
PSD momodels
dels may b e stack
stackeded and used to initialize a deep netw
network
ork to b e trained
with another criterion.
CHAPTER 14. AUTOENCODERS

14.9 Applications of Auto


Autoenco
enco
encoders
ders
Auto
Autoenco
enco
encoders
ders hav
havee b een successfully applied to dimensionalit
dimensionality y reduction and infor-
mation retriev
retrievalal tasks. Dimensionality reduction was one of the first applications
of represen
representation
tation learning and deep learning. It was one of the early motiv motivations
ations
f
for In practical
studying applications
auto
autoenco
enco ders. Fof
encoders. or PSD, the iterative
example, Hinton andoptimization
Salakh
Salakhutdino is used
utdino
utdinov only
v (2006 during
) trained
f
atraining.
stack ofTheRBMs parametric
and thenenco der their
used is used to compute
weights the learned
to initialize a deepfeatures
autoencowhen
autoencoder
der
h f
the mo del is deploy ed. Ev
with gradually smaller hidden la aluating
lay is computationally inexpensive
yers, culminating in a b ottlenec
ottleneck compared
k of 30 units. Theto
inferring co
resulting via
code gradientless
de yielded descen t. Because error
reconstruction is athan
differen
PCA tiable
in toparametric
into function,
30 dimensions, and
the mo delsrepresentation
PSDlearned may b e stacked wasand used to initialize
qualitatively a deep
easier to network
interpret and to b e trained
relate to the
with another
underlying criterion.with these categories manifesting as well-separated clusters.
categories,
Lo
Lowwer-dimensional represen
representations
tations can improv
improvee p erformance on many tasks,
suc
suchh as classification. Mo Models
dels of smaller spaces consume less memory and runtime.
Man
Many y forms of dimensionality reduction place semantically related examples near
eac
each
14.9h other, as observed by Salakhutdino
Salakhutdinov v and Hinton (2007b) and Torralba et al.
Auto Applications
enco ders of Auto
have b een successfully enco
applied to ders
dimensionality reduction and infor-
(2008). The hints provided by the mapping to the lo low
wer-dimensional space aid
mation retrieval tasks. Dimensionality reduction was one of the first applications
generalization.
of representation learning and deep learning. It was one of the early motivations
for One task auto
studying thatenco
benefits
ders. even more than
For example, usual and
Hinton fromSalakh
dimensionalit
dimensionality
utdinov (y2006
reduction
) trainedis
information
a stack of RBMs retriev
retrieval , the used
andalthen task of finding
their entries
weights in a database
to initialize a deepthatautoenco
resem
resemble
ble
dera
querygradually
with en
entry
try
try.. This taskhidden
smaller deriv
deriveseslathe
yers,usual b enefitsinfrom
culminating dimensionalit
dimensionality
a b ottlenec k of 30yunits.
reduction
The
that other tasks do, but also deriv
deriveses the additional b enefit that
resulting co de yielded less reconstruction error than PCA into 30 dimensions, search can b ecome
and
extremely efficient in certain kinds of low-dimensional spaces.
the learned representation was qualitatively easier to interpret and relate to the Specifically
Specifically, , if
w e train thecategories,
underlying dimensionality reduction
with these algorithm
categories to pro
manifesting produce
duce
as a co
code
de thatclusters.
well-separated is lo
low-
w-
dimensional and binary, then we can store all database entries in a hash table
Lower-dimensional representations can improve p erformance on many tasks,
that maps binary co code
de vectors to entries. This hash table allows us to perform
such as classification. Mo dels of smaller spaces consume less memory and runtime.
information retriev
retrieval
al by returning all database entries that ha hav
ve the same binary et al.
Many forms of dimensionality reduction place semantically related examples near
co
code
de as the query
query.. We can also searc searchh oovver slightly less similar en entries
tries vvery
ery
each other, as observed by Salakhutdinov and Hinton (2007b) and Torralba
efficien
efficiently
tly
tly,, just by flipping individual bits from the enco encoding
ding of the query
query.. This
(2008). The hints provided by the mapping to the lower-dimensional space aid
approac
approach h to information retriev
retrievalal via dimensionalit
dimensionality y reduction and binarization
generalization.
is called seman
informationsemantic tic hashing
retriev al (Salakh
Salakhutdino
utdino
utdinov v and Hinton, 2007b, 2009b) and has
One task that benefits even more than
b een applied to b oth textual input (Salakhutdino usual
Salakhutdinov vfrom
anddimensionalit
Hinton, 2007b y ,reduction
2009b) and is
, the task of finding entries in a database that resemble a
query entry. This task derives the usual 522b enefits from dimensionality reduction
that other tasks do, but also derives the additional b enefit that search can b ecome
extremely efficient in certain kinds of low-dimensional spaces. Specifically, if
binary
we train the dimensionality reduction algorithm to pro duce a co de that is low-
dimensional and , then we can store all database entries in a hash table
that maps binary co de vectors to entries. This hash table allows us to perform
information retrieval by returning all database entries that have the same binary
co de as the query. We can also search over slightly less similar entries very
efficiently, just by flipping individual bits from the enco ding of the query. This
semantic hashing
approach to information retrieval via dimensionalit
522 y reduction and binarization
CHAPTER 14. AUTOENCODERS

images (Torralba et al.


al.,, 2008; Weiss et al.
al.,, 2008; Krizhevsky and Hinton, 2011).
To pro
produce
duce binary cocodes
des for semantic hashing, one typically uses an enco encoding
ding
function with sigmoids on the final la lay
yer. The sigmoid units must be trained to b e
saturated to nearly 0 or nearly 1 for all input values. One trictrick
k that can accomplish
this is simply to inject additiv
additivee noise just b efore the sigmoid nonlinearity during
training. The
CHAPTER magnitude of the noise should increase ov
14. AUTOENCODERS over
er time. To fight that
noise and preserv
preservee as muc
uch
h information as p ossible, the net
netwwork must increase the
magnitude of the inputs to the sigmoid function, un until
til saturation o ccurs.
The idea of learning a hashing function has b een further explored in several
directions, including the idea of training the representations to optimize a loss more
directly link
linked
ed to the task of finding nearby examples in the hash table (Norouzi
and Fleet, 2011). et al. et al.
images (Torralba , 2008; Weiss , 2008; Krizhevsky and Hinton, 2011).
To pro duce binary co des for semantic hashing, one typically uses an enco ding
function with sigmoids on the final layer. The sigmoid units must be trained to b e
saturated to nearly 0 or nearly 1 for all input values. One trick that can accomplish
this is simply to inject additive noise just b efore the sigmoid nonlinearity during
training. The magnitude of the noise should increase over time. To fight that
noise and preserve as much information as p ossible, the network must increase the
magnitude of the inputs to the sigmoid function, until saturation o ccurs.
The idea of learning a hashing function has b een further explored in several
directions, including the idea of training the representations to optimize a loss more
directly linked to the task of finding nearby examples in the hash table (Norouzi
and Fleet, 2011).

523

523

You might also like