0% found this document useful (0 votes)
18 views

Auto Encoder

The document discusses autoencoders, which are unsupervised neural networks used for dimensionality reduction and feature learning. An autoencoder learns an encoding for input data in an unsupervised manner by training the network to minimize the reconstruction error between the input and output. Convolutional autoencoders apply this concept using convolutional layers. Regularization techniques like sparse regularization can be used to learn more meaningful features by imposing constraints like sparsity during training.

Uploaded by

Ankit Kumar Ray
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Auto Encoder

The document discusses autoencoders, which are unsupervised neural networks used for dimensionality reduction and feature learning. An autoencoder learns an encoding for input data in an unsupervised manner by training the network to minimize the reconstruction error between the input and output. Convolutional autoencoders apply this concept using convolutional layers. Regularization techniques like sparse regularization can be used to learn more meaningful features by imposing constraints like sparsity during training.

Uploaded by

Ankit Kumar Ray
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

AUTOENCODERS Guy Golan

AGENDA
- Unsupervised Learning (Introduction)
- Autoencoder (AE) (with code)
- Convolutional AE (with code)
- Regularization: Sparse
- Denoising AE
- Stacked AE
- Contractive AE
INTRODUCTION TO UNSUPERVISED LEARNING
SUPERVISED LEARNING

Supervised Learning
Data: (X,Y)
Goal: Learn a Mapping
Function f where:
f(X) = Y
SUPERVISED LEARNING X2

Examples: Classification.

Decision Trees Naïve Bayes

KNN SVM

Multi Layer X1
Perceptron
Perceptron
Classification
SUPERVISED LEARNING
X2

Examples: Regression.

Linear Regression

Logistic Regression
X1

Regression
SUPERVISED LEARNING VS UNSUPERVISED LEARNING

01 02
What happens when our What happens where we
labels are noisy? don’t have labels for training
• Missing values. at all?
• Labeled incorrectly.
SUPERVISED LEARNING VS UNSUPERVISED LEARNING

Up until now we have encountered in this


seminar mostly Supervised Learning
problems and algorithms.

Lets talk about Unsupervised Learning


UNSUPERVISED LEARNING

Unsupervised Learning
Data: X (no labels!)
Goal: Learn the structure of the data
(learn correlations between features)
UNSUPERVISED LEARNING

Examples: Clustering, Compression, Feature & Representation


learning, Dimensionality reduction, Generative models ,etc.
PCA – PRINCIPAL COMPONENT ANALYSIS

- Statistical approach for data


compression and visualization
- Invented by Karl Pearson in
1901

- Weakness: linear components


only.
TRADITIONAL AUTOENCODER
𝑧
TRADITIONAL AUTOENCODER
▪ Unlike the PCA now we can use 𝑧
activation functions to achieve non-
linearity.

▪ It has been shown that an AE


without activation functions
achieves the PCA capacity.
USES
- The autoencoder idea was a part of NN history for
decades (LeCun et al, 1987). • Not used for compression.
-Data specific compression.
-Lossy.
- Traditionally an autoencoder is used for
dimensionality reduction and feature learning.

- Recently, the connection between autoencoders and


latent space modeling has brought autoencoders to
the front of generative modeling, as we will see in
the next lecture.
SIMPLE IDEA
𝑥 𝑓 𝑥 𝑧 𝑔 𝑧 𝑥ො
- Given data 𝑥 (no labels) we would like to learn the
functions 𝑓 (encoder) and 𝑔 (decoder) where:

𝑓 𝑥 = 𝑠 𝑤𝑥 + 𝑏 = 𝑧
and

𝑔 𝑧 = 𝑠 𝑤 ′ z + 𝑏 ′ = 𝑥ො (𝑧 is some latent
representation or code
s.t ℎ 𝑥 = 𝑔 𝑓 𝑥 = 𝑥ො and 𝑠 is a non-linearity
such as the sigmoid)
where ℎ is an approximation of the identity
function.
(𝑥ො is 𝑥’s reconstruction)
SIMPLE IDEA
Learning the identity function seems trivial, but
with added constraints on the network (such as
limiting the number of hidden neurons or
regularization) we can learn information
about the structure of the data.

Trying to capture the


distribution of the data
(data specific!)
TRAINING THE AE
Using Gradient Descent we can simply train the model as any other FC NN with:

- Traditionally with squared error loss function

𝐿 𝑥, 𝑥ො = 𝑥 − 𝑥ො 2

- If our input is interpreted as bit vectors or vectors of bit probabilities the


cross entropy can be used

𝐻 𝑝, 𝑞 = − ෍ 𝑝 𝑥 log 𝑞 𝑥
𝑥
UNDERCOMPLETE AE VS OVERCOMPLETE AE
We distinguish between two types of AE structures:
UNDERCOMPLETE AE
• Hidden layer is Undercomplete if
smaller than the input layer 𝑥ො
❑Compresses the input
❑Compresses well only for the 𝑤′
training dist.
𝑓 𝑥

• Hidden nodes will be 𝑤


❑Good features for the training
distribution. 𝑥
❑Bad for other types on input
OVERCOMPLETE AE
• Hidden layer is Overcomplete if 𝑥ො
greater than the input layer
❑No compression in hidden layer.
❑Each hidden unit could copy a
𝑤′
different input component.
𝑓 𝑥
• No guarantee that the hidden units
will extract meaningful structure.
𝑤
• Adding dimensions is good for
training a linear classifier (XOR 𝑥
case example).
• A higher dimension code helps
model a more complex distribution.
DEEP AUTOENCODER EXAMPLE
https://cs.stanford.edu/people/karpathy/convnetjs/demo/autoencoder.html - By Andrej
Karpathy
SIMPLE LATENT SPACE INTERPOLATION - KERAS
𝑧1

Encoder

𝑧2

Encoder
SIMPLE LATENT SPACE INTERPOLATION - KERAS
𝑧1 𝑧2

𝑧𝑖 = 𝛼 + 1−𝛼

𝑧𝑖

Decoder
SIMPLE LATENT SPACE INTERPOLATION – KERAS
CODE EXAMPLE
SIMPLE LATENT SPACE – INTERPOLATION - KERAS
CONVOLUTIONAL AE
* Input values are normalized
* All of the conv layers activation functions are relu except for the last conv which is sigm

CONVOLUTIONAL AE
Output
Input C1 M.P1 C2 M.P2 C3 M.P3 D.C1 U.S1 D.C2 U.S2 D.C3 U.S3 D.C4
(28,28,1) (28,28,16) (14,14,16) (14,14,8) (7,7,8) (7,7,8) (4,4,8) (4,4,8) (8,8,8) (8,8,8) (16,16,8) (14,14,16) (28,28,8) (28,28,1)

Conv 1 M.P 1 Conv 2 M.P 2 Conv 3 M.P 3 D Conv 1 U.S 1 D Conv 2 U.S 2 D Conv 3 U.S 3 D Conv 4
16 F (2,2) 8F (2,2) 8F (2,2) 8F (2,2) 8F (2,2) 16 F (2,2) 1F
@ (3,3,1) same @ (3,3,16) same @ (3,3,8) same @ (3,3,8) @ (3,3,8) @ (3,3,8) @ (5,5,8)
same same same same same valid same

Hidden
Code

Encoder Decoder
CONVOLUTIONAL AE – KERAS EXAMPLE
CONVOLUTIONAL AE – KERAS EXAMPLE RESULTS
- 50 epochs.
- 88% accuracy on validation set.
REGULARIZATION
Motivation:
- We would like to learn meaningful features without altering
the code’s dimensions (Overcomplete or Undercomplete).

The solution: imposing other constraints on the network.


SPARSELY REGULATED AUTOENCODERS Activation Maps
A bad example:
SPARSELY REGULATED AUTOENCODERS
- We want our learned features to be as sparse as possible.
- With sparse features we can generalize better.

= 1* + 1* + 1* + 1* + 1*

+ 1* + 1* + 0.8* + 0.8*
SPARSELY REGULATED AUTOENCODERS
Recall:
Bn
𝑎𝑗 is defined to be the activation of the 𝑗th hidden unit (bottleneck)
of the autoencoder.
Bn
Let 𝑎𝑗 𝑥 be the activation of this specific node on a given input 𝑥.
SPARSELY REGULATED AUTOENCODERS
Further let,
𝑚
1 Bn 𝑖
𝜌ො𝑗 = ෍ 𝑎𝑗 𝑥
𝑚 𝑖=1

be the average activation of hidden unit 𝑗 (over the training set).

Thus we would like to force the constraint:


𝜌ො𝑗 = 𝜌

where 𝜌 is a “sparsity parameter”, typically small. In other words, we want


the average activation of each neuron 𝑗 to be close to 𝜌.
SPARSELY REGULATED AUTOENCODERS
- We need to penalize 𝜌ො𝑗 for deviating from 𝜌.
- Many choices of the penalty term will give reasonable
results.

𝐵𝑛
- For example: ෍ 𝐾𝐿 𝜌ȁ𝜌ො𝑗
𝑗=1

where 𝐾𝐿 𝜌ȁ𝜌ො𝑗 is a Kullback-Leibler divergence function.


SPARSELY REGULATED AUTOENCODERS
- A reminder:
- KL is a standard function for
measuring how different two
distributions are, which has the
properties:

𝐾𝐿 𝜌ȁ𝜌ො𝑗 = 0 if 𝜌ො𝑗 = 𝜌

otherwise it is increased monotonically.


𝜌 = 0.2
SPARSELY REGULATED AUTOENCODERS
- Our overall cost functions is now:
𝐵𝑛

𝐽𝑆 𝑊, 𝑏 = 𝐽 𝑊, 𝑏 + 𝛽 ෍ 𝐾𝐿 𝑝ȁ𝜌ො𝑗
𝑗=1

*Note: We need to know 𝜌ො𝑗 before hand,


so we have to compute a forward pass on all the training set.
DENOISING AUTOENCODERS
Intuition:
- We still aim to encode the input and to NOT mimic the identity function.
- We try to undo the effect of corruption process stochastically applied to the
input.
A more robust model

Encoder Decoder

Noisy Input Denoised Input


Latent space representation
DENOISING AUTOENCODERS
Use Case:
- Extract robust representation for a NN classifier.

Encoder

Noisy Input
Latent space
representation
DENOISING AUTOENCODERS
Instead of trying to mimic the identity function by minimizing:
𝐿 𝑥, 𝑔 𝑓 𝑥

where L is some loss function

A DAE instead minimizes:


𝐿 𝑥, 𝑔 𝑓 𝑥෤

where 𝑥෤ is a copy of 𝑥 that has been corrupted by some form of noise.


DENOISING AUTOENCODERS
Idea: A robust representation 𝑥ො
against noise:
𝑤′
- Random assignment of subset of
inputs to 0, with probability 𝑣. 𝑓 𝑥
- Gaussian additive noise.
𝑤
𝑥
DENOISING AUTOENCODERS 𝑥ො

• Reconstruction 𝑥ො computed from the 𝑤′


corrupted input 𝑥.෤
• Loss function compares 𝑥ො reconstruction 𝑓 𝑥෤
with the noiseless 𝑥.

❖The autoencoder cannot fully trust each 𝑤


feature of 𝑥 independently so it must learn
the correlations of 𝑥’s features. 𝑥෤ 0 0 0
❖Based on those relations we can predict a
Noise Process
more ‘not prune to changes’ model. 𝒑 𝒙෥ȁ𝒙

➢ We are forcing the hidden layer to learn a 𝑥


generalized structure of the data.
DENOISING AUTOENCODERS - PROCESS

Taken some input 𝑥 Apply Noise 𝑥෤


DENOISING AUTOENCODERS - PROCESS

DAE
DAE
𝑥෤ Encode And Decode

𝑔 𝑓 𝑥෤
DENOISING AUTOENCODERS - PROCESS

DAE
DAE
𝑥ො
𝑔 𝑓 𝑥෤
DENOISING AUTOENCODERS - PROCESS

𝑥ො Compare 𝑥
DENOISING AUTOENCODERS
DENOISING CONVOLUTIONAL AE – KERAS
DENOISING CONVOLUTIONAL AE – KERAS
- 50 epochs.
- Noise factor 0.5
- 92% accuracy on validation set.
STACKED AE
- Motivation:
❑ We want to harness the feature extraction quality of a AE for our
advantage.
❑ For example: we can build a deep supervised classifier where it’s
input is the output of a SAE.
❑ The benefit: our deep model’s W are not randomly initialized but are
rather “smartly selected”
❑Also using this unsupervised technique lets us have a larger unlabeled
dataset.
STACKED AE
- Building a SAE consists of two phases:
1. Train each AE layer one after the other.
2. Connect any classifier (SVM / FC NN layer etc.)
STACKED AE
𝑥 𝑦

SAE Classifier
STACKED AE – TRAIN PROCESS
First Layer Training (AE 1)

𝑥 𝑓1 𝑥 𝑧1 𝑔1 𝑧1 𝑥ො
STACKED AE – TRAIN PROCESS
Second Layer Training (AE 2)

𝑥 𝑓1 𝑥 𝑧1 𝑓2 𝑧1 𝑧2 𝑔2 𝑧2 𝑧1Ƹ
STACKED AE – TRAIN PROCESS
Add any classifier

𝑥 𝑓1 𝑥 𝑧1 𝑓2 𝑧1 𝑧2 Classifier Output
CONTRACTIVE AUTOENCODERS
𝑥ො
- We are still trying to avoid
uninteresting features. 𝑤′
- Here we add a regularization
term 𝛺 𝑥 to our loss function to
limit the hidden layer.
𝑤
𝑥
CONTRACTIVE AUTOENCODERS
𝑥ො
- Idea: We wish to extract features that
only reflect variations observed in the
training set. We would like to be invariant 𝑤′
to the other variations.

-Points close to each other in the input


space maintain that property in the 𝑤
latent space.
- This regularizer needs to conform to the Frobenius norm 𝑥
of the Jacobian matrix for the encoder activation
sequence, with respect to the input.
CONTRACTIVE AUTOENCODERS
Definitions and reminders:
- Frobenius norm (L2): 𝐴 𝐹 = 2
𝛴𝑖,𝑗 𝑎𝑖𝑗
Frobenius norm of the matrix is simply the ℓ2 norm applied after flattening the matrix into a vector
When y is a vector, the most natural interpretation of the derivative of y with respect to a vector x is a matrix called the Jacobian that contains the partial derivatives of each
component of y with respect to each component of x.

𝜕𝑓 𝑥 1 𝜕𝑓 𝑥 1

𝜕𝑥1 𝜕𝑥𝑛
𝜕𝑓(𝑥)
- Jacobian Matrix: 𝐽𝑓 𝑥 = = ⋮ ⋱ ⋮
𝜕𝑥
𝜕𝑓 𝑥 𝑚 𝜕𝑓 𝑥 𝑚

𝜕𝑥1 𝜕𝑥𝑛
CONTRACTIVE AUTOENCODERS
Our new loss function would be:
𝐿∗ 𝑥 = 𝐿 𝑥 + 𝜆𝛺 𝑥

2 𝜕𝑓 𝑥 𝑗 2
where 𝛺 𝑥 = 𝐽𝑓 𝑥 or simply: ෎
𝐹 𝜕𝑥𝑖
𝑖,𝑗

and where 𝜆 controls the balance of our reconstruction objective and the
hidden layer “flatness”.
CONTRACTIVE AUTOENCODERS
Our new loss function would be:
𝐿∗ 𝑥 = 𝐿 𝑥 + 𝜆𝛺 𝑥

𝐿 𝑥 - would be an encoder that keeps good Combination


would be an
information (𝜆 → 0)
encoder that
keeps only good
𝛺 𝑥 - would be an encoder that throws away all information.
information (𝜆 → ∞)
CONTRACTIVE AUTOENCODERS
Encoder doesn’t need to be sensitive
to this variation (not observed in
training data)

Encoder must be sensitive to


this variation to reconstruct
well
WHICH AUTOENCODER?
- DAE make the reconstruction function resist small, finite sized
perturbations in input.
- CAE make the feature encoding function resist small, infinitesimal
perturbations in input.

- Both denoising AE and contractive AE perform well!


WHICH AUTOENCODER?
❑Advantage of DAE: simpler to implement
-Requires adding one or two lines of code to regular AE.
-No need to compute Jacobian of hidden layer.

❑Advantage of CAE: gradient is deterministic.


- might be more stable than DAE, which uses a sampled gradient.
- one less hyper-parameter to tune (noise-factor)
SUMMARY
REFERENCES
1. https://arxiv.org/pdf/1206.5538.pdf
2. http://www.deeplearningbook.org/contents/autoencoders.html
3. http://deeplearning.net/tutorial/dA.html
4. http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/
5. http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders
6. http://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf
7. https://codeburst.io/deep-learning-types-and-autoencoders-a40ee6754663
QUESTIONS?

You might also like