Auto Encoder
Auto Encoder
AGENDA
- Unsupervised Learning (Introduction)
- Autoencoder (AE) (with code)
- Convolutional AE (with code)
- Regularization: Sparse
- Denoising AE
- Stacked AE
- Contractive AE
INTRODUCTION TO UNSUPERVISED LEARNING
SUPERVISED LEARNING
Supervised Learning
Data: (X,Y)
Goal: Learn a Mapping
Function f where:
f(X) = Y
SUPERVISED LEARNING X2
Examples: Classification.
KNN SVM
Multi Layer X1
Perceptron
Perceptron
Classification
SUPERVISED LEARNING
X2
Examples: Regression.
Linear Regression
Logistic Regression
X1
Regression
SUPERVISED LEARNING VS UNSUPERVISED LEARNING
01 02
What happens when our What happens where we
labels are noisy? don’t have labels for training
• Missing values. at all?
• Labeled incorrectly.
SUPERVISED LEARNING VS UNSUPERVISED LEARNING
Unsupervised Learning
Data: X (no labels!)
Goal: Learn the structure of the data
(learn correlations between features)
UNSUPERVISED LEARNING
𝑓 𝑥 = 𝑠 𝑤𝑥 + 𝑏 = 𝑧
and
𝑔 𝑧 = 𝑠 𝑤 ′ z + 𝑏 ′ = 𝑥ො (𝑧 is some latent
representation or code
s.t ℎ 𝑥 = 𝑔 𝑓 𝑥 = 𝑥ො and 𝑠 is a non-linearity
such as the sigmoid)
where ℎ is an approximation of the identity
function.
(𝑥ො is 𝑥’s reconstruction)
SIMPLE IDEA
Learning the identity function seems trivial, but
with added constraints on the network (such as
limiting the number of hidden neurons or
regularization) we can learn information
about the structure of the data.
𝐿 𝑥, 𝑥ො = 𝑥 − 𝑥ො 2
𝐻 𝑝, 𝑞 = − 𝑝 𝑥 log 𝑞 𝑥
𝑥
UNDERCOMPLETE AE VS OVERCOMPLETE AE
We distinguish between two types of AE structures:
UNDERCOMPLETE AE
• Hidden layer is Undercomplete if
smaller than the input layer 𝑥ො
❑Compresses the input
❑Compresses well only for the 𝑤′
training dist.
𝑓 𝑥
Encoder
𝑧2
Encoder
SIMPLE LATENT SPACE INTERPOLATION - KERAS
𝑧1 𝑧2
𝑧𝑖 = 𝛼 + 1−𝛼
𝑧𝑖
Decoder
SIMPLE LATENT SPACE INTERPOLATION – KERAS
CODE EXAMPLE
SIMPLE LATENT SPACE – INTERPOLATION - KERAS
CONVOLUTIONAL AE
* Input values are normalized
* All of the conv layers activation functions are relu except for the last conv which is sigm
CONVOLUTIONAL AE
Output
Input C1 M.P1 C2 M.P2 C3 M.P3 D.C1 U.S1 D.C2 U.S2 D.C3 U.S3 D.C4
(28,28,1) (28,28,16) (14,14,16) (14,14,8) (7,7,8) (7,7,8) (4,4,8) (4,4,8) (8,8,8) (8,8,8) (16,16,8) (14,14,16) (28,28,8) (28,28,1)
Conv 1 M.P 1 Conv 2 M.P 2 Conv 3 M.P 3 D Conv 1 U.S 1 D Conv 2 U.S 2 D Conv 3 U.S 3 D Conv 4
16 F (2,2) 8F (2,2) 8F (2,2) 8F (2,2) 8F (2,2) 16 F (2,2) 1F
@ (3,3,1) same @ (3,3,16) same @ (3,3,8) same @ (3,3,8) @ (3,3,8) @ (3,3,8) @ (5,5,8)
same same same same same valid same
Hidden
Code
Encoder Decoder
CONVOLUTIONAL AE – KERAS EXAMPLE
CONVOLUTIONAL AE – KERAS EXAMPLE RESULTS
- 50 epochs.
- 88% accuracy on validation set.
REGULARIZATION
Motivation:
- We would like to learn meaningful features without altering
the code’s dimensions (Overcomplete or Undercomplete).
= 1* + 1* + 1* + 1* + 1*
+ 1* + 1* + 0.8* + 0.8*
SPARSELY REGULATED AUTOENCODERS
Recall:
Bn
𝑎𝑗 is defined to be the activation of the 𝑗th hidden unit (bottleneck)
of the autoencoder.
Bn
Let 𝑎𝑗 𝑥 be the activation of this specific node on a given input 𝑥.
SPARSELY REGULATED AUTOENCODERS
Further let,
𝑚
1 Bn 𝑖
𝜌ො𝑗 = 𝑎𝑗 𝑥
𝑚 𝑖=1
𝐵𝑛
- For example: 𝐾𝐿 𝜌ȁ𝜌ො𝑗
𝑗=1
𝐾𝐿 𝜌ȁ𝜌ො𝑗 = 0 if 𝜌ො𝑗 = 𝜌
𝐽𝑆 𝑊, 𝑏 = 𝐽 𝑊, 𝑏 + 𝛽 𝐾𝐿 𝑝ȁ𝜌ො𝑗
𝑗=1
Encoder Decoder
Encoder
Noisy Input
Latent space
representation
DENOISING AUTOENCODERS
Instead of trying to mimic the identity function by minimizing:
𝐿 𝑥, 𝑔 𝑓 𝑥
DAE
DAE
𝑥 Encode And Decode
𝑔 𝑓 𝑥
DENOISING AUTOENCODERS - PROCESS
DAE
DAE
𝑥ො
𝑔 𝑓 𝑥
DENOISING AUTOENCODERS - PROCESS
𝑥ො Compare 𝑥
DENOISING AUTOENCODERS
DENOISING CONVOLUTIONAL AE – KERAS
DENOISING CONVOLUTIONAL AE – KERAS
- 50 epochs.
- Noise factor 0.5
- 92% accuracy on validation set.
STACKED AE
- Motivation:
❑ We want to harness the feature extraction quality of a AE for our
advantage.
❑ For example: we can build a deep supervised classifier where it’s
input is the output of a SAE.
❑ The benefit: our deep model’s W are not randomly initialized but are
rather “smartly selected”
❑Also using this unsupervised technique lets us have a larger unlabeled
dataset.
STACKED AE
- Building a SAE consists of two phases:
1. Train each AE layer one after the other.
2. Connect any classifier (SVM / FC NN layer etc.)
STACKED AE
𝑥 𝑦
SAE Classifier
STACKED AE – TRAIN PROCESS
First Layer Training (AE 1)
𝑥 𝑓1 𝑥 𝑧1 𝑔1 𝑧1 𝑥ො
STACKED AE – TRAIN PROCESS
Second Layer Training (AE 2)
𝑥 𝑓1 𝑥 𝑧1 𝑓2 𝑧1 𝑧2 𝑔2 𝑧2 𝑧1Ƹ
STACKED AE – TRAIN PROCESS
Add any classifier
𝑥 𝑓1 𝑥 𝑧1 𝑓2 𝑧1 𝑧2 Classifier Output
CONTRACTIVE AUTOENCODERS
𝑥ො
- We are still trying to avoid
uninteresting features. 𝑤′
- Here we add a regularization
term 𝛺 𝑥 to our loss function to
limit the hidden layer.
𝑤
𝑥
CONTRACTIVE AUTOENCODERS
𝑥ො
- Idea: We wish to extract features that
only reflect variations observed in the
training set. We would like to be invariant 𝑤′
to the other variations.
𝜕𝑓 𝑥 1 𝜕𝑓 𝑥 1
⋯
𝜕𝑥1 𝜕𝑥𝑛
𝜕𝑓(𝑥)
- Jacobian Matrix: 𝐽𝑓 𝑥 = = ⋮ ⋱ ⋮
𝜕𝑥
𝜕𝑓 𝑥 𝑚 𝜕𝑓 𝑥 𝑚
⋯
𝜕𝑥1 𝜕𝑥𝑛
CONTRACTIVE AUTOENCODERS
Our new loss function would be:
𝐿∗ 𝑥 = 𝐿 𝑥 + 𝜆𝛺 𝑥
2 𝜕𝑓 𝑥 𝑗 2
where 𝛺 𝑥 = 𝐽𝑓 𝑥 or simply:
𝐹 𝜕𝑥𝑖
𝑖,𝑗
and where 𝜆 controls the balance of our reconstruction objective and the
hidden layer “flatness”.
CONTRACTIVE AUTOENCODERS
Our new loss function would be:
𝐿∗ 𝑥 = 𝐿 𝑥 + 𝜆𝛺 𝑥