0% found this document useful (0 votes)
13 views

Unit 2.1

deep learning

Uploaded by

jadhavrohan7337
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit 2.1

deep learning

Uploaded by

jadhavrohan7337
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

MIT Art Design and Technology University

MIT School of Computing, Pune


21BTCS031 – Deep Learning & Neural Networks

Class - L.Y. CORE (SEM-I)

Unit - II Deep Networks


Dr. Anant Kaulage
Dr. Sunita Parinam
Dr. Mayura Shelke
Dr. Aditya Pai
AY 2024-2025 SEM-I
Deep Neural Networks

Unit II
Introduction
□ Modern deep learning provides a very powerful
framework for supervised learning.
□ By adding more layers and more units within a
layer, a deep network can represent functions
of increasing complexity
□ Deep feedforward networks, also often
called feedforward neural networks, or
multilayer perceptrons (MLPs) are the
quintessential deep learning models.
□ The goal of a feedforward network is to
approximate some function f*.
Introduction
□ y = f*(x) maps an input x to a category y.
□ A feedforward network defines a mapping y = f
(x; θ) and learns the value of the parameters θ
that result in the best function approximation
□ feedforward :- information flows through the
function being evaluated from x, through the
intermediate computations used to define f ,
and finally to the output y.
□ No feedback connections in which outputs of
the model are fed back into itself
Introduction
□ When feedforward neural networks are
extended to include feedback connections,
they are called recurrent neural networks
□ Feedforward neural networks are called
networks because they are typically
represented by composing together many
different functions
■ f(x) = f(3)(f(2)(f(1)(x)))
□ During neural network training, we drive f(x) to
match f∗(x).
y ≈ f∗(x)
Introduction
□ Linear models: logistic regression and linear
regression, are appealing because they may be fit
efficiently and reliably
□ To extend linear models to represent nonlinear
functions of x, we can apply the linear model not to x
itself but to a transformed input φ(x )
■ Φ -nonlinear transformation
□ Choosing the mapping φ
■ use a very generic φ, such as the infinite-dimensional φ
that is implicitly used by kernel machines based on the
RBF kernel
■ option is to manually engineer φ
■ deep learning to learn φ
Learning XOR
□ XOR Function: When exactly one of binary
values is equal to 1, the XOR function returns 1.
□ target function, y = f∗(x)
□ Our model provides a function y = f(x;θ)
□ our learning algorithm will adapt the
parameters θ to make f as similar as possible
to f∗
□ X = {[0, 0]T, [0,1]T,[1, 0]T,[1, 1]T}
□ Consider regression problem and use a mean
squared error loss function
Learning XOR
□ Evaluated on our whole training set, the MSE
loss function is

□ Linear Model

□ minimize J(θ) in closed form with respect to w


and b using the normal equations
□ Simple feedforward network with one hidden
layer containing two hidden units
Learning XOR
Learning XOR
□ h = f(1)(x;W, c)
□ y = f(2)(h;w, b)
□ complete model f(x;W,
c,w, b) = f(2)(f (1)(x)).
□ Nonlinear function called
an activation function.
h= g(WTx + c)
□ f(x;W, c,w, b) = wT
max{0,WTx + c} + b.
Learning XOR
Gradient Based Learning
□ The largest difference between the linear models
and neural networks is that the nonlinearity of a
neural network causes most interesting loss
functions to become non-convex
□ Neural networks are usually trained by using
iterative, gradient-based optimizers
□ Linear equation solvers used to train linear
regression models or the convex optimization
algorithms with global convergence used for logistic
regression
□ Convex optimization converges starting from any
initial parameters
Gradient Based Learning
Gradient Based Learning
□ What makes non-convex optimization
hard?
■ Potentially many local minima
■ Saddle points
■ Very flat regions
■ Widely varying curvature
Gradient Based Learning
□ Matrix completion, principle component
analysis
□ Low-rank models and tensor decomposition
□ Maximum likelihood estimation with hidden
variables
□ The big one: deep neural networks
Gradient Based Learning
□ How to solve non-convex problems
■ Stochastic gradient descent
■ Mini-batching
■ SVRG (stochastic variance reduced gradient)
■ Momentum
□ There are also specialized methods for
solving non-convex problems
■ Alternating minimization methods
■ Branch-and-bound methods
■ These generally aren’t very popular for
machine learning problems
Cost Functions: Conditional
Distribution
□ An important aspect of the design of a deep
neural network is the choice of the cost
function
□ our parametric model defines a distribution p(y
| x;θ ) and we simply use the principle of
maximum likelihood
□ cost function: cross-entropy between the
training data and the model’s predictions
□ Most modern neural networks are trained using
maximum likelihood
■ cost function : negative log-likelihood
Cost Functions
Conditional Statistics
□ Instead of learning a full probability distribution
p(y | x; θ) we often want to learn just one
conditional statistic of y given x.
□ For example, we may have a predictor f(x; θ)
that we wish to predict the mean of y
Output Units
□ The choice of cost function is tightly coupled with
the choice of output unit.
□ Most of the time, we simply use the cross-entropy
between the data distribution and the model
distribution.
□ The choice of how to represent the output then
determines the form of the cross-entropy function
□ The role of the output layer is to provide some
additional transformation from the features to
complete the task that the network must perform.
Linear Units for Gaussian
Output Distributions
□ Given features h, a layer of linear output units
produces a vector yˆ = WTh+b
□ Linear output layers are often used to produce
the mean of a conditional Gaussian distribution
■ p(y | x) = N(y;yˆ, I).
■ Gaussian distribution over y with mean y^ and
covariance I
□ Maximizing the log-likelihood is then equivalent
to minimizing the mean squared error
□ Because linear units do not saturate, they pose
little difficulty for gradient based optimization
algorithms
Linear Units for Gaussian
Output Distributions
□ Functions that saturate (become very flat)
■ Because the gradient becomes very small
■ Happens when activation functions producing
output of hidden/output units saturate
□ Negative log-likelihood helps avoid
saturation problem for many models
■ Many output units involve exp functions that
saturate when its argument is very negative
■ log function in Negative log-likelihood cost
function undoes exp of some units
□ Possible use in VAE
Sigmoid Units for Bernoulli
Output Distributions
□ Many tasks require predicting the value of a
binary variable y .
□ E.g. classification problems with two classes
□ The maximum-likelihood approach is to define
a Bernoulli distribution over y conditioned on x
□ A Bernoulli distribution is defined by just a
single number.
□ The neural net needs to predict only P(y = 1 |
x).
□ For this number to be a valid probability, it
must lie in the interval [0, 1].
Softmax Units for Multinoulli
Output Distributions
□ Any time we wish to represent a probability
distribution over a discrete variable with n
possible values, we may use the softmax function
□ Softmax functions are most often used as the

distribution over 𝑛 different classes.


output of a classifier, to represent the probability

□ In case of a discrete variable with 𝑘 values,


produce a vector 𝒚^with 𝑦^𝑖 = 𝑃(𝑦 = 𝑖|𝑥)
□ First, a linear layer predicts unnormalized log
probabilities: z = WTh + b, where
zi = log P˜(y = i | x)
Hidden Units
□ How to choose the type of hidden unit to use in the
hidden layers of the model?
□ The design of hidden units is an extremely active area of
research and does not yet have many definitive guiding
theoretical principles
□ Rectified linear units are an excellent default choice
□ Positives:
■ Gives large and consistent gradients (does not saturate) when
active
■ Efficient to optimize, converges much faster than sigmoid or
tanh
□ Negatives:
■ Non zero centered output
■ Units "die" i.e. when inactive they will never update
Hidden Units
□ Signmoid
□ Tanh
□ Radial basis function
□ Softplus
□ Hard Tanh
Architecture Design
□ Key design consideration for neural networks
□ How many units it should have and how these
units should be connected to each other
□ neural networks are organized into layers

□ main architectural considerations are to choose


the depth of the network and the width of each
layer
□ a network with even one hidden layer is
sufficient to fit the training set
□ Deeper networks often are able to use far fewer
units per layer and far fewer parameters and
often generalize to the test set
□ often harder to optimize
□ The ideal network architecture for a task must
be found via experimentation guided by
monitoring the validation set error
Universal Approximation
Properties and Depth
□ presume that learning a nonlinear function requires
designing a specialized model
□ feedforward networks with hidden layers provide a
universal approximation framework
□ the universal approximation theorem states that a
feedforward network with a linear output layer and at
least one hidden layer with any “squashing” activation
function can approximate any Borel measurable
function from one finite-dimensional space to another
with any desired non-zero amount of error, provided
that the network is given enough hidden units
□ Any continuous function on a closed and bounded
subset of Rn
□ Mathematically speaking, any neural network
architecture aims at finding any mathematical function
y= f(x) that can map attributes(x) to output(y).
□ The accuracy of this function i.e. mapping differs
depending on the distribution of the dataset and the
architecture of the network employed.
□ The function f(x) can be arbitrarily complex.
□ The Universal Approximation Theorem tells us that
Neural Networks has a kind of universality i.e. no
matter what f(x) is, there is a network that can
approximately approach the result and do the job!
□ This result holds for any number of inputs and outputs.
Universal Approximation
Properties and Depth
□ The universal approximation theorem means that
regardless of what function we are trying to learn, we
know that a large MLP will be able to represent this
function.
□ However, we are not guaranteed that the training
algorithm will be able to learn that function
■ Reasons: optimization algorithm used for training may not
be able to find the value of the parameters
■ training algorithm might choose the wrong function due to
overfitting
□ The universal approximation theorem says that there
exists a network large enough to achieve any degree of
accuracy we desire.
□ How much large?
□ In summary, a feedforward network with
a single layer is sufficient to represent
any function, but the layer may be
infeasibly large and may fail to learn and
generalize correctly.
□ In many circumstances, using deeper
models can reduce the number of units
required to represent the desired
function and can reduce the amount of
generalization error
Other Architectures
□ Neural networks show considerably more diversity
□ Specialized architectures for computer vision called
convolutional networks
□ Feedforward networks may also be generalized to the
recurrent neural networks for sequence processing

Empirical results showing that deeper


networks generalize better

You might also like