Deep Learning
Deep Learning
Introduction
Unlike a machine learning model that needs to be explicitly told how to make an accurate
prediction, a deep learning model is capable of self-learning through its own method of
computing
www.ti.com
• It is inspired by the structure of the human
brain and is particularly effective in classifying
objects, pattern recognition and predictive
analysis
• Deep learning way of classifying ,clustering
and predicting by using neural network that
has been trained on vast amount of data
History of neural networks
• 1943: Warren S. McCulloch and Walter Pitts’s work ““A
logical calculus of the ideas immanent in nervous
activity ”,deals with cell and their connectivity in brain.
• 1958: Frank Rosenblatt’s work “The Perceptron: A
Probabilistic Model for Information Storage and
Organization in the Brain” introduced weight sum
concept
• 1974: Paul Werbos’s work on backpropagation made the
learning possible
• 1989: Yann LeCun’s implemented Hand written zip code
recognition for US postal using neural network LeNet
https://www.ibm.com/cloud/learn/neural-networks
Neural Networks
Deep Neural Networks (DNN)
• Multi Layer Perceptron (MLP)
• Convolution Neural Network (CNN)
• Recurrent Neural Network (RNN)
Deep learning frameworks
• Building a deep learning solution is a big
challenge because of its complexity
• Frameworks are tools to ease the building of
deep learning solutions
• Frameworks offer a higher level of abstraction
and simplify potentially difficult programming
tasks.
TensorFlow: Developed by Google
– The most used deep learning framework
– Based on Github stars and forks and Stack Overflow activity
Caffe: Developed by Berkeley Vision and Learning Center (BVLC)
– Popular for CNN modeling (imaging/computer vision applications)
and its Model Zoo (a selection of pre-trained networks)
• Next to all these frameworks, there are also interfaces that
are wrapped around one or multiple frameworks. The most
well-known and widely-used interface for deep learning
today is Keras.
• Keras is a high-level deep learning API, written in Python.
Activation Function
• expressed mathematically
• Should differentiable
• should easily calculate with spare computing
power
• Types: Step function, Linear function and
nonlinear function(S shaped, hyperbolic
tangents)
Step function
• Step function:decision(firing of neuron) based on certain
threshold value (activated if above else not activated)
• linear separable problems, such as XOR gate, AND gate(binary
class)
• The zero gradients are the major problem
Linear Activation Function
m - number of samples
x(i)-> i-th sample from dataset
h(x(i)) -> prediction for i-th sample
y(i)-> ground truth label for i-th sample
MSE (L2)
• Mean Squared Error, or L2 loss
m - number of samples
y(i) -> ground truth label for i-th sample
y^(i) -> predicted label for i-th sample
Loss
• Regression loss
• Classification loss
Cross-Entropy Loss Function
Example
https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e
https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e
Autoencoders
• a neural network trained to try to copy its input to its
output
• unsupervised learning technique
• has a hidden layer h (code) used to represent the input
• network consisting of two parts:
– encoder function h = f (x)
– a decoder fucntion=g(h)
– reconstruction : g(f (x)) = x^
• Approximated (reconstruction) copy of input that
resembles the training data
• used for dimensionality reduction or feature
learning , generative modeling
• suppose we have a set of data points
• This penalizes the neurons that are too active, forcing them to activate less
• forces the model to only have a small number of hidden units being activated
at the same time
• Sparse autoencoders are used to learn
features for another task, such as classification
• sparsity penalty can yield a model that has
learned useful features as a byproduct
• sparsity penalty function, prevents the neural
network from activating more neurons and
serves as a regularizer
λ as a hyperparameter
Original digits sampled from the MNIST test set (left),
reconstruction of sampled digits with a non-sparse autoencoder (middle), and
reconstruction with a sparse autoencoder (right)
Image Courtesy: https://bradleyboehmke.github.io/HOML/autoencoders.html#denoising-autoenc
Denoising Autoencoders
• Rather than adding a penalty Ω to the cost function, an
autoencoder that learns something useful by changing the
reconstruction error term of the cost function
• This forces the codings to learn more robust features of the inputs
• denoising autoencoder as having two objectives:
• (i) try to encode the inputs to preserve the
essential signals
• (ii) try to undo the effects of a corruption process
stochastically applied to the inputs of the
autoencoder
AE consist of
• encoder function g(.)g(.) parameterized by ϕ
• decoder function f(.) parameterized by θ
•metrics to quantify the difference between two vectors
•i.e cross entropy when the activation function is sigmoid
Denoising Auto Encoder
• The input is partially corrupted by adding
noises to or masking some values of the input
vector in a stochastic manner
DAE
• penalty term Ω(h) for the hidden layer is calculated w.r.t. input x
which is known as Frobenius norm of the Jacobian matrix
• The frobenius norm is described as we square
all of the elements in the matrix, take the sum,
and then take the square root of this sum
• If the input has n dimensions and the hidden
layer has k dimensions then
CNN Architectures
• Case Studies
– AlexNet
– ResNet
AlexNet- Krizhevsky Alex et al. 2012
Alex Krizhevsky et al 2012 ImageNet Classification with Deep Convolutional Neural Networks
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Details/Retrospectives:
- first use of ReLU
- used Norm layers (not common anymore)
- data augmentation
- dropout 0.5
- batch size 128
- SGD Momentum 0.9
- Learning rate 1e-2,
- L2 weight decay 5e-4
ImageNet Dataset
• 15 million labeled high-resolution images,
belonging to roughly 22,000 categories
• The images were collected from the web
• in 2010, ImageNet Large-Scale Visual
Recognition Challenge (ILSVRC)
• ILSVRC uses a subset of ImageNet with roughly
1000 images in each of 1000 categories
• ImageNet consists of variable-resolution images
• roughly 1.2 million training images
• 50,000 validation images
• 150,000 testing images
Transfer Learning