0% found this document useful (0 votes)
13 views

Deep Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Deep Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Deep Learning

Introduction

Unlike a machine learning model that needs to be explicitly told how to make an accurate
prediction, a deep learning model is capable of self-learning through its own method of
computing
www.ti.com
• It is inspired by the structure of the human
brain and is particularly effective in classifying
objects, pattern recognition and predictive
analysis
• Deep learning way of classifying ,clustering
and predicting by using neural network that
has been trained on vast amount of data
History of neural networks
• 1943: Warren S. McCulloch and Walter Pitts’s work ““A
logical calculus of the ideas immanent in nervous
activity ”,deals with cell and their connectivity in brain.
• 1958: Frank Rosenblatt’s work “The Perceptron: A
Probabilistic Model for Information Storage and
Organization in the Brain” introduced weight sum
concept
• 1974: Paul Werbos’s work on backpropagation made the
learning possible
• 1989: Yann LeCun’s implemented Hand written zip code
recognition for US postal using neural network LeNet
https://www.ibm.com/cloud/learn/neural-networks
Neural Networks
Deep Neural Networks (DNN)
• Multi Layer Perceptron (MLP)
• Convolution Neural Network (CNN)
• Recurrent Neural Network (RNN)
Deep learning frameworks
• Building a deep learning solution is a big
challenge because of its complexity
• Frameworks are tools to ease the building of
deep learning solutions
• Frameworks offer a higher level of abstraction
and simplify potentially difficult programming
tasks.
TensorFlow: Developed by Google
– The most used deep learning framework
– Based on Github stars and forks and Stack Overflow activity
Caffe: Developed by Berkeley Vision and Learning Center (BVLC)
– Popular for CNN modeling (imaging/computer vision applications)
and its Model Zoo (a selection of pre-trained networks)
• Next to all these frameworks, there are also interfaces that
are wrapped around one or multiple frameworks. The most
well-known and widely-used interface for deep learning
today is Keras.
• Keras is a high-level deep learning API, written in Python.
Activation Function
• expressed mathematically
• Should differentiable
• should easily calculate with spare computing
power
• Types: Step function, Linear function and
nonlinear function(S shaped, hyperbolic
tangents)
Step function
• Step function:decision(firing of neuron) based on certain
threshold value (activated if above else not activated)
• linear separable problems, such as XOR gate, AND gate(binary
class)
• The zero gradients are the major problem
Linear Activation Function

• The value of “a” varies with 1


• Where “a” is the proportional input activation
• Many and various neurons can be activated at the
same time using linear activation
• For multiple classes(we choose maximum)
• It gives constant derivative function in all situations
• At output layer can be used for regression problem
Linear Activation Function
Non Linear- "S"-shaped Activation
Functions
• activation functions are not linear
• the range of the function is between 0-1
• Sigmoid Function-has a smooth gradient value

• Also called Logistic function


tanh function
• output values range between −1 and 1
• Like sigmoid functions
• negative inputs of the hyperbolic functions
will be mapped to a negative output
• nearing zero will also be mapped to output
values nearing zero
• the network is not stuck due to the above
features during training
Rectified Linear Unit Activation Functions

• transforming the weighted input to the strict


output or proportional sum
• piecewise linear functions that usually output
the positive input directly, otherwise it the
output is zero
Basic Rectified Linear Unit (ReLU)

• also known as a ramp function and is


analogous to half-wave rectification in
electrical engineering
• enabling the more efficient training of deeper
networks, compared to the commonly-used
activation functions
Dying ReLU Problem
• ReLU has a significant problem known as “dying
ReLU”
• neutral network with weights distribution in the
form of a low-variance Gaussian centered at + 0.1
• by this condition, most inputs are positive and cause
ReLU node to fire
• During a particular backpropagation, large
magnitude of gradient passing back to nodes
• Possibilities of moving of the distribution a low
variance Gaussian centered at −0.1
• All the inputs are negative, making the neuron
inactive, omitting the weight updates during
backpropagation
• How to by pass this: Leaky ReLU
• To achieve this, a small multiplier α , e.g., 0.01
EXPONENTIAL LINEAR UNITS (ELUS)
softmax function
• The softmax function is a function that turns a vector
of K real values into a vector of K real values that sum
to 1
• is normally used to convert from real-valued
activations to class likelihoods
• the softmax transforms them into values between 0
and 1, so that they can be interpreted as
probabilities.
Softmax

• zi values are the elements of the input vector


and can take any real value.
• K The number of classes in the multi-class
classifier.
Example:Calculating the Softmax
• Given

• Compute sofmax output


Example:Calculating the Softmax
• Given
Loss / Error Functions
• Loss functions are used to determine the
error(loss) between the output of our
algorithms and the given target value
• Two common loss functions
– 0-1 loss function
– quadratic loss function
L1 and L2 Loss
• Mean Absolute Error, or L1 loss

m - number of samples
x(i)-> i-th sample from dataset
h(x(i)) -> prediction for i-th sample
y(i)-> ground truth label for i-th sample
MSE (L2)
• Mean Squared Error, or L2 loss

m - number of samples
y(i) -> ground truth label for i-th sample
y^(i) -> predicted label for i-th sample
Loss
• Regression loss
• Classification loss
Cross-Entropy Loss Function
Example

https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e
https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e
Autoencoders
• a neural network trained to try to copy its input to its
output
• unsupervised learning technique
• has a hidden layer h (code) used to represent the input
• network consisting of two parts:
– encoder function h = f (x)
– a decoder fucntion=g(h)
– reconstruction : g(f (x)) = x^
• Approximated (reconstruction) copy of input that
resembles the training data
• used for dimensionality reduction or feature
learning , generative modeling
• suppose we have a set of data points

• where each data point has many dimensions


• to map to another set of data points

• where z's have lower dimensionality than x's


and
• z's can faithfully reconstruct x's.
• To map data back and forth, z and ~x are
functions
• our goal is to have ~x(i) to approximate x(i)
• Using objective function, which is the sum of squared
diferences between ~x(i) and x(i):

• which can be minimized using stochastic gradient


descent
Undercomplete Autoencoders
• Important to train the autoencoder to perform the input
copying task will result in h (hidden unit) taking on useful
properties from input

• features from the autoencoder is to constrain h to have a


smaller dimension than x

• An autoencoder whose code dimension is less than the input


dimension is called undercomplete
Linear auto encoders

• Eg autoencoder :map data from 4 dimensions to 2


dimensions, with one hidden layer, The activation
function of the hidden layer is linear
• works for the case where the data lie on a linear surface
Non-Linear Auto encoder
• For case data lie on a nonlinear surface
• it makes more sense to use a nonlinear autoencoder

• If the data is highly nonlinear, one could add more


hidden layers to the network to have a deep
autoencoder
Regularization
• Regularization is a set of strategies used in Machine
Learning to reduce the generalization error. Most
models, after training, perform very well on a specific
subset of the overall population but fail to generalize
well. This is also known as overfitting.
Regularized Autoencoders
• Over complete( code with high dimension than
input) and Autoencoders with code dimension
equal to code dimension fails to learn useful from
input by simply copying the input( no data
distribution is learned)
• Regularized autoencoders provide a way for such
auto encoder
• Regularized autoencoder uses a loss function that
encourages the model to have other properties
besides the ability to copy its input to its output
• other properties include
– Sparsity
– smallness of the derivative of the representation
– Robustness to noise or missing inputs
• A regularized autoencoder can be nonlinear
and overcomplete
• Any generative model with latent variables
and provides procedure to infer the latent
variables may be viewed as autoencoders
Types of Autoencoders
• Undercomplete autoencoders
• Sparse Autoencoders
• Denoising autoencoders
• Variational Autoencoders (for generative
modelling)
• Contractive Autoencoders(CAE)
Sparse Autoencoders
• In an autoencoder , may the number of
hidden units is large (perhaps even greater
than the number of input pixels)
• It can still discover interesting structure, by
imposing other constraints on the network
• In particular, impose a sparsity constraint on
the hidden units, then the autoencoder will
still discover interesting structure in the data,
even if the number of hidden units is large
• ρ is a sparsity parameter, typically a small
value close to zero
• we would like the average activation of each
hidden neuron to be close to 0.05 (ρ)
• To achieve ,add an extra penalty term to our
optimization objective that penalizes ˆρj
deviating significantly from ρ
Sparse Autoencoders
• an autoencoder with sparsity penalty Ω(h) on the code layer h , in addition to
the reconstruction error while training

– Where g(h) is the decoder output


– h =f(x), the encoder output

• Incorporating sparsity forces more neurons to be inactive


• Sparse autoencoders attempt to enforce the constraint i.e sparsity
parameter(Ω(h)).

• This penalizes the neurons that are too active, forcing them to activate less

• forces the model to only have a small number of hidden units being activated
at the same time
• Sparse autoencoders are used to learn
features for another task, such as classification
• sparsity penalty can yield a model that has
learned useful features as a byproduct
• sparsity penalty function, prevents the neural
network from activating more neurons and
serves as a regularizer

λ as a hyperparameter
Original digits sampled from the MNIST test set (left),
reconstruction of sampled digits with a non-sparse autoencoder (middle), and
reconstruction with a sparse autoencoder (right)
Image Courtesy: https://bradleyboehmke.github.io/HOML/autoencoders.html#denoising-autoenc
Denoising Autoencoders
• Rather than adding a penalty Ω to the cost function, an
autoencoder that learns something useful by changing the
reconstruction error term of the cost function

• The denoising autoencoder (DAE) is an autoencoder that receives a


corrupted data point as input and is trained to predict the original,
uncorrupted data point as its output

• we train the autoencoder to reconstruct the input from


a corrupted copy of the inputs

• This forces the codings to learn more robust features of the inputs
• denoising autoencoder as having two objectives:
• (i) try to encode the inputs to preserve the
essential signals
• (ii) try to undo the effects of a corruption process
stochastically applied to the inputs of the
autoencoder

• Corruption can be done by capturing the


statistical dependencies between the inputs
Autoencoder

AE consist of
• encoder function g(.)g(.) parameterized by ϕ
• decoder function f(.) parameterized by θ
•metrics to quantify the difference between two vectors
•i.e cross entropy when the activation function is sigmoid
Denoising Auto Encoder
• The input is partially corrupted by adding
noises to or masking some values of the input
vector in a stochastic manner
DAE

Denoising AE architecture by Lilian Weng


https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html
VAE
VAE
Variational autoencoder (VAE)
• bottleneck vector (latent vector) is replaced by two vectors,
namely, mean vector and standard deviation vector
• variational autoencoders learn the parameters of
probability distribution
Training

Training means learning 𝑊E and 𝑊D.


– Define a loss function ℒ
– Use stochastic gradient descent (or Adam) to
minimize ℒ
The Loss function
• Reconstruction error:
• Similarity between the probability of z given x,
, and some predefined probability
• distribution p(z), which can be computed by
Kullback-Leibler divergence (KL):
𝐾𝐿(𝑝(𝑧|𝑥)||𝑝 𝑧 )
Working of VAE
• The encoder network is forced to generate the latent
vectors following the unit Gaussian distribution
• tradeoff between
– How accurately the network reconstructs the images
– How closely the latent variables match the unit Gaussian
distribution
• reconstruction error (generative loss) is measured using
mean squared error
• Kullback–Leibler (KL) divergence loss measures the
closeness of latent variables with unit Gaussian
distribution
KL-Divergence
• a measure of how one probability distribution Q is different
from a second, reference probability distribution P
• KL(P || Q)
• KL divergence can be calculated as the negative sum of
probability of each event in P multiplied by the log of the
probability of the event in Q over the probability of the event
in P

• KL(P || Q) = – sum x in X P(x) * log(Q(x) / P(x))


Contractive autoencoder (CAE)
• A contractive autoencoder also tries to prevent an overcomplete
autoencoder from learning the identity function.
• For slight variation in input data, Explicit regularizer is added in
the objective function of contractive autoencoder to make the
model robust
• loss function of contractive autoencoder is given as

• penalty term Ω(h) for the hidden layer is calculated w.r.t. input x
which is known as Frobenius norm of the Jacobian matrix
• The frobenius norm is described as we square
all of the elements in the matrix, take the sum,
and then take the square root of this sum
• If the input has n dimensions and the hidden
layer has k dimensions then
CNN Architectures
• Case Studies
– AlexNet
– ResNet
AlexNet- Krizhevsky Alex et al. 2012

Alex Krizhevsky et al 2012 ImageNet Classification with Deep Convolutional Neural Networks
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Details/Retrospectives:
- first use of ReLU
- used Norm layers (not common anymore)
- data augmentation
- dropout 0.5
- batch size 128
- SGD Momentum 0.9
- Learning rate 1e-2,
- L2 weight decay 5e-4
ImageNet Dataset
• 15 million labeled high-resolution images,
belonging to roughly 22,000 categories
• The images were collected from the web
• in 2010, ImageNet Large-Scale Visual
Recognition Challenge (ILSVRC)
• ILSVRC uses a subset of ImageNet with roughly
1000 images in each of 1000 categories
• ImageNet consists of variable-resolution images
• roughly 1.2 million training images
• 50,000 validation images
• 150,000 testing images
Transfer Learning

You might also like