Ad3501-Dl-Unit 5 Notes
Ad3501-Dl-Unit 5 Notes
Autoencoders:
Autoencoders are an unsupervised learning technique that we can use to learn efficient data
encodings. Basically, autoencoders can learn to map input data to the output data. While doing
so, they learn to encode the data. And the output is the compressed representation of the input
data.
The main aim while training an autoencoder neural network is dimensionality reduction.
The following image shows the basic working of an autoencoder. Note that this is not a neural
network specific image. It is just a basic representation of the working of the autoencoder.
Autoencoder
An autoencoder should be able to reconstruct the input data efficiently but by learning the
useful properties rather than memorizing it.
There are many ways to capture important properties when training an autoencoder. Let’s start
by getting to know about undercomplete autoencoders.
Undercomplete Autoencoders
In the previous section, we discussed that we want our autoencoder to learn the important
features of the input data. It should do that instead of trying to memorize and copy the input
data to the output data.
We can do that if we make the hidden coding data to have less dimensionality than the input
data. In an autoencoder, when the encoding h has a smaller dimension than x, then it is called
an undercomplete autoencoder.
The above way of obtaining reduced dimensionality data is the same as PCA. In PCA also, we
try to try to reduce the dimensionality of the original data.
The loss function for the above process can be described as,
L(x,r)=L(x,g(f(x)))
where L is the loss function. This loss function applies when the reconstruction r is dissimilar
from the input x.
Regularized Autoencoders
In undercomplete autoencoders, we have the coding dimension to be less than the input
dimension.
We also have overcomplete autoencoder in which the coding dimension is the same as the input
dimension. But this again raises the issue of the model not learning any useful features and
simply copying the input.
One solution to the above problem is the use of regularized autoencoder. When training a
regularized autoencoder we need not make it undercomplete. We can choose the coding
dimension and the capacity for the encoder and decoder according to the task at hand.
To properly train a regularized autoencoder, we choose loss functions that help the model to
learn better and capture all the essential features of the input data.
Next, we will take a look at two common ways of implementing regularized autoencoders.
Sparse Autoencoders
In sparse autoencoders, we use a loss function as well as an additional penalty for sparsity.
Specifically, we can define the loss function as,
L(x,g(f(x))) + Ω(h)
where Ω(h) is the additional sparsity penalty on the code h.
The following is an image showing MNIST digits. The first row shows the original images and
the second row shows the images reconstructed by a sparse autoencoder.
Adding a penalty such as the sparsity penalty helps the autoencoder to capture many of the
useful features of data and not simply copy it.
Denoising Autoencoders
In sparse autoencoders, we have seen how the loss function has an additional penalty for the
proper coding of the input data.
But what if we want to achieve similar results without adding the penalty? In that case, we can
use something known as denoising autoencoder.
We can change the reconstruction procedure of the decoder to achieve that. Until now we have
seen the decoder reconstruction procedure as r(h) = g(f(x)) and the loss function
as L(x,g(f(x))).
Now, consider adding noise to the input data to make it x~ instead of x. Then the loss function
becomes,
L(x,g(f(x~)))
The following image shows how denoising autoencoder works. The second row shows the
reconstructed images after the decoder has cleared out the noise.
For a proper learning procedure, now the autoencoder will have to minimize the above loss
function. And to do that, it first will have to cancel out the noise, and then perform the decoding.
In a denoising autoencoder, the model cannot just copy the input to the output as that would
result in a noisy output. While we update the input data with added noise, we can also use
overcomplete autoencoders without facing any problems.
Application of autoencoders
So far we have seen a variety of autoencoders and each of them is good at a
specific task. Let’s find out some of the tasks they can do
Data Compression
Although autoencoders are designed for data compression yet they are hardly used for this
purpose in practical situations. The reasons are:
Lossy compression: The output of the autoencoder is not exactly the same as the input, it is a
close but degraded representation. For lossless compression, they are not the way to go.
Data-specific: Autoencoders are only able to meaningfully compress data similar to what they
have been trained on. Since they learn features specific for the given training data, they are
different from a standard data compression algorithm like jpeg or gzip. Hence, we can’t expect
an autoencoder trained on handwritten digits to compress landscape photos.
Since we have more efficient and simple algorithms like jpeg, LZMA, LZSS(used in WinRAR
in tandem with Huffman coding), autoencoders are not generally used for compression.
Although autoencoders have seen their use for image denoising and dimensionality reduction
in recent years.
Image Denoising
Autoencoders are very good at denoising images. When an image gets corrupted or there is a
bit of noise in it, we call this image a noisy image.
To obtain proper information about the content of the image, we perform image denoising.
Stochastic Encoders and Decoders
Given a hidden code h, we may think of the decoder as providing a conditional distribution
pdecoder(x|h). We may train the autoencoder by minimizing −lpgPdecoder(x|h).
• x is Gaussian, negative log-likehood yield mean squared error
• x is Bernoulli, yield softmax
Learning Manifolds with Autoencoder
Tangent planes: At a point x on a d-dimensional manifold, the tangent plane is given by d basis
vectors that span the local directions of variation allowed on the manifold.
• Architectural constraint
• regularization term
The two forces together are useful because they force the hidden representation to capture
information about the structure of the data-generating distribution. Autoencoder can afford to
repersent only the variation that are needed to reconstruct training examples. Encoder learns a
mapping from the input space x to a repersentation space, a mapping that is only sensitive to
changes along the manifold directions, but that is insensitive to changes orthorgnal to the
manifold.
How to characterize a manifold: A representation of the data points on (or near) the manifold.
Such a representation for a particular example is also called “embedding”. It is typically given
by a low-D vector, with fewer D than the ambient space of which the manifold is a low-D
subset.
Nearest Neighbour Graph:
A global coordinate system can then be obtained through an optimization or by solving a linear
system
Fundamental problem with such local nonparametric approach: if the manifold is not very
smooth, one may need a very large number of training examples to cover each one of these
variations, with no chance to generalize to unseen variations. This motivates the use of
distributed representations and deep learning for capturing manifold structure.
There's been a lot of advances in image classification, mostly thanks to the convolutional neural
network.
It turns out, these same networks can be turned around and applied to image generation as well.
If we've got a bunch of images, how can we generate more like them?
To gain some intuition, think of a back-and-forth situation between a bank and a money
counterfeiter. At the beginning, the fakes are easy to spot. However, as the counterfeiter keeps
trying different kinds of techniques, some may get past the check. The counterfeiter then can
improve his fakes towards the areas that got past the bank's security checks.
But the bank doesn't give up. It also keeps learning how to tell the fakes apart from real money.
After a long period of back-and-forth, the competition has led the money counterfeiter to create
perfect replicas.
Now, take that same situation, but let the money forger have a spy in the bank that reports back
how the bank is telling fakes apart from real money.
Every time the bank comes up with a new strategy to tell apart fakes, such as using ultraviolet
light, the counterfeiter knows exactly what to do to bypass it, such as replacing the material
with ultraviolet marked cloth.
The second situation is essentially what a generative adversarial network does. The bank is
known as a discriminator network, and in the case of images, is a convolutional neural network
that assigns a probability that an image is real and not fake.
The counterfeiter is known as the generative network, and is a special kind of convolutional
network that uses transpose convolutions, sometimes known as a deconvolutional network.
This generative network takes in some 100 parameters of noise (sometimes known as the code)
and outputs an image accordingly.
Which part was the spy? Since the discriminator was just a convolutional neural network, we
can backpropagate to find the gradients of the input image. This tells us what parts of the image
to change in order to yield a greater probability of being real in the eyes of the discriminator
network.
All that's left is to update the weights of our generative network with respect to these gradients,
so the generative network outputs images that are more "real" than before.
The two networks are now locked in a competition. The discriminative network is constantly
trying to find differences between the fake images and real images, and the generative network
keeps getting closer and closer to the real deal. In the end, we've trained the generative network
to produce images that can't be differentiated from real images.
How well does this work? It’s an implementation in Tensorflow, and trained it on various
image sets such as CIFAR-10 and 64x64 Imagenet samples.
In these samples, 64 images were generated at different iterations of learning. In the beginning,
all the samples are roughly the same brownish color. However, even at iteration 200, some
hints of variation can be spotted. By iteration 900, different colors have emerged, although the
generated images still do not resemble anything. At iteration 5700, the generated images aren't
blurry anymore, but there's no actual objects in the images.
After letting the network run for a couple hours on a GPU, I was glad to see that nothing broke.
In fact, the generated images are looking pretty close to the real deal. You can see actual objects
now, such as some ducks and cars.
What happens if we scale it up? CIFAR is only 32x32, so let's try Imagenet. I downloaded a
150,000 images set from the Imagenet 2012 Challenge, and rescaled them all to 64x64.
The progression here is basically the same as before. It starts out with some brown blobs, learns
to add color and some lighting, and finally learns the look and feel of real images.
Here's the generated Imagenet samples at the last iteration I trained from. These definitely look
a lot better than the earlier iterations. However, especially at this higher resolution, some
problems become apparent.
When generating from CIFAR or Imagenet, there are no concept of classes in the generative
adversarial network. The network is not learning to make images of cars and ducks, it is
learning to make images that look real in general. The problem is, this results in some image
that may have some combination of features from all sorts of objects, like the outline of a table
but the coloring of a frog.
Improving GANs, and InfoGAN, both of these involve adding multiple objectives to the
discriminator's cost function, which is a good idea. In a simple GAN, the discriminator only
has one idea of what an incredibly "real" image looks like. This leads to the generator either
collapsing to only produce one image no matter what noise it starts with, or only producing
images that have some resemblance of real features but no distinct uniqueness, like our
Imagenet generator.
Improving GANs adds in minibatch discrimination, which is a fancy way of making sure
features within various samples remain varied.
Meanwhile, InfoGAN tries to correlate the initial noise with features in the generated image, so
you can do things such as adjust one of the initial noise variables to change the angle of an
object.
In a plain GAN, the initial noise variables suffer from the same problem of features in a typical
neural network. Although they make sense when put together, it's hard to tell what each of
them do individually.
This quick generation script that kept all 200 initial noise values constant, except linearly
adjusted one from -1 to 1. The most common result is some color becomes more prominent in
a certain region of the image.
Unfortunately, this means the generator network has not learned what it means to represent an
object. All it's doing is creating an image that has features that might be present in a photograph,
such as distinct color regions and shadows.
Adding some secondary objectives, such as correlating initial noise and features present, could
add some more concrete value to this noise, and result in images that look less like a mix
between multiple objects.
The problem of generating realistic-looking images is complex. For the simplicity of the simple
generative adversial network, it's done a pretty good job in creating images that look real, at
least from a distance.
Variational Autoencoders
There were a couple of downsides to using a plain GAN.
First, the images are generated off some arbitrary noise. If you wanted to generate a picture
with specific features, there's no way of determining which initial noise values would produce
that picture, other than searching over the entire distribution.
Second, a generative adversarial model only discriminates between "real" and "fake" images.
There are no constraints that an image of a cat has to look like a cat. This leads to results where
there's no actual object in a generated image, but the style just looks like picture.
To get an understanding of a VAE, we'll first start from a simple network and add parts step by
step.
Let's say we had a network comprised of a few deconvolution layers. We set the input to always
be a vector of ones. Then, we can train the network to reduce the mean squared error between
itself and one target image. The "data" for that image is now contained within the network's
parameters.
Now, let's try it on multiple images. Instead of a vector of ones, we'll use a one-hot vector for
the input. [1, 0, 0, 0] could mean a cat image, while [0, 1, 0, 0] could mean a dog. This works,
but we can only store up to 4 images. Using a longer vector means adding in more and more
parameters so the network can memorize the different images.
To fix this, we use a vector of real numbers instead of a one-hot vector. We can think of this
as a code for an image, which is where the terms encode/decode come from. For example, [3.3,
4.5, 2.1, 9.8] could represent the cat image, while [3.4, 2.1, 6.7, 4.2] could represent the dog.
This initial vector is known as our latent variables.
Choosing the latent variables randomly, like I did above, is obviously a bad idea. In an
autoencoder, we add in another component that takes in the original images and encodes them
into vectors for us. The deconvolutional layers then "decode" the vectors back to the original
images.
We've finally reached a stage where our model has some hint of a practical use. We can train
our network on as many images as we want. If we save the encoded vector of an image, we can
reconstruct it later by passing it into the decoder portion. What we have is the standard
autoencoder.
However, we're trying to build a generative model here, not just a fuzzy data structure that can
"memorize" images. We can't generate anything yet, since we don't know how to create latent
vectors other than encoding them from images.
There's a simple solution here. We add a constraint on the encoding network, that forces it to
generate latent vectors that roughly follow a unit gaussian distribution. It is this constraint that
separates a variational autoencoder from a standard one.
Generating new images is now easy: all we need to do is sample a latent vector from the unit
gaussian and pass it into the decoder.
In practice, there's a tradeoff between how accurate our network can be and how close its latent
variables can match the unit gaussian distribution.
We let the network decide this itself. For our loss term, we sum up two separate losses: the
generative loss, which is a mean squared error that measures how accurately the network
reconstructed the images, and a latent loss, which is the KL divergence that measures how
closely the latent variables match a unit gaussian.
When we're calculating loss for the decoder network, we can just sample from the standard
deviations and add the mean, and use that as our latent vector:
samples = tf.random_normal([batchsize,n_z],0,1,dtype=tf.float32)
sampled_z = z_mean + (z_stddev * samples)
In addition to allowing us to generate random latent variables, this constraint also improves the
generalization of our network.
Let's say you were given a bunch of pairs of real numbers between [0, 10], along with a name.
For example, 5.43 means apple, and 5.44 means banana. When someone gives you the number
5.43, you know for sure they are talking about an apple. We can essentially encode infinite
information this way, since there's no limit on how many different real numbers we can have
between [0, 10].
However, what if there was a gaussian noise of one added every time someone tried to tell you
a number? Now when you receive the number 5.43, the original number could have been
anywhere around [4.4 ~ 6.4], so the other person could just as well have meant banana (5.44).
The greater standard deviation on the noise added, the less information we can pass using that
one variable.
Now we can apply this same logic to the latent variable passed between the encoder and
decoder. The more efficiently we can encode the original image, the higher we can raise the
standard deviation on our gaussian until it reaches one.
This constraint forces the encoder to be very efficient, creating information-rich latent
variables. This improves generalization, so latent variables that we either randomly generated,
or we got from encoding non-training images, will produce a nicer result when decoded.
A downside to the VAE is that it uses direct mean squared error instead of an adversarial
network, so the network tends to produce more blurry images.
There's been some work looking into combining the VAE and the GAN: Using the same
encoder-decoder setup, but using an adversarial network as a metric for training the decoder.