0% found this document useful (0 votes)
8 views

DL Answers

Uploaded by

cutevenkyputti
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

DL Answers

Uploaded by

cutevenkyputti
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 24

1.

Define activation function and discuss various activation functions with suitable graphs
2. List the Parameters to design feed forward neural network and discuss how to select the
Parameters
3. Discuss various paradigms of Learning methodologies and issues in Deep learning
4. Compare the deep learning optimization, Regularization and model selection.
5. Describe Greedy Layer wise training.
6. Discuss the use of Deep Neural Networks.
7. Discuss the Hyper parameters convolution Neural Networks
8. Define Deep Feed Forward network and explain about regularizations
1 a) Define activation function and discuss various activation functions with suitable graphs

Activation functions are mathematical equations that determine the output of a neural network
model. Activation functions also have a major effect on the neural network’s ability to converge and
the convergence speed, or in some cases, activation functions might prevent neural networks from
converging in the first place. Activation function also helps to normalize the output of any input in the
range between 1 to -1 or 0 to 1.
Activation function must be efficient and it should reduce the computation time because the neural
network sometimes trained on millions of data points.
Let’s consider the simple neural network model without any hidden layers.

Here is the output-


Y = ∑ (weights*input + bias)
and it can range from -infinity to +infinity. So it is necessary to bound the output to get the desired
prediction or generalized results.
Y = Activation function (∑ (weights*input + bias))
So the activation function is an important part of an artificial neural network. They decide whether a
neuron should be activated or not and it is a non-linear transformation that can be done on the input
before sending it to the next layer of neurons or finalizing the output.
Properties of activation functions
1. Non Linearity
2. Continuously differentiable
3. Range
4. Monotonic
5. Approximates identity near the origin
Types of Activation Functions
The activation function can be broadly classified into 2 categories.
1. Binary Step Function
2. Linear Activation Function
Binary Step Function
A binary step function is generally used in the Perceptron linear classifier. It thresholds the input
values to 1 and 0, if they are greater or less than zero, respectively.

The step function is mainly used in binary classification problems and works well for linearly severable
pr. It can’t classify the multi-class problems.
Also Read: 3 Things to Know before deep diving into Neural Networks
Linear Activation Function

The equation for Linear activation function is:


f(x) = a.x
When a = 1 then f(x) = x and this is a special case known as identity.
Properties:
1. Range is -infinity to +infinity
2. Provides a convex error surface so optimisation can be achieved faster
3. df(x)/dx = a which is constant. So cannot be optimised with gradient descent
Limitations:
1. Since the derivative is constant, the gradient has no relation with input
2. Back propagation is constant as the change is delta x
Non-Linear Activation Functions
Modern neural network models use non-linear activation functions. They allow the model to create
complex mappings between the network’s inputs and outputs, such as images, video, audio, and data
sets that are non-linear or have high dimensionality.
Majorly there are 3 types of Non-Linear Activation functions.
1. Sigmoid Activation Functions
2. Rectified Linear Units or ReLU
3. Complex Nonlinear Activation Functions
Sigmoid Activation Functions
Sigmoid functions are bounded, differentiable, real functions that are defined for all real input values,
and have a non-negative derivative at each point.
Sigmoid or Logistic Activation Function
The sigmoid function is a logistic function and the output is ranging between 0 and 1.

The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of
linear function. It is non-linear, continuously differentiable, monotonic, and has a fixed output range.
But it is not zero centered.
Hyperbolic Tangent
The function produces outputs in scale of [-1, 1] and it is a continuous function. In other words,
function produces output for every x value.
Y = tanh(x)
tanh(x) = (e x – e-x) / (ex + e-x)

Inverse Hyperbolic Tangent (arctanh)


It is similar to sigmoid and tanh but the output ranges from [-pi/2,pi/2]

Softmax
The softmax function is sometimes called the soft argmax function, or multi-class logistic regression.
This is because the softmax is a generalization of logistic regression that can be used for multi-class
classification, and its formula is very similar to the sigmoid function which is used for logistic
regression. The softmax function can be used in a classifier only when the classes are mutually
exclusive.
Gudermannian
The Gudermannian function relates circular functions and hyperbolic functions without explicitly using
complex numbers.

The below is the mathematical equation for Gudermannian function:

GELU (Gaussian Error Linear Units)


An activation function used in the most recent Transformers such as Google’s BERT and OpenAI’s GPT-
2. This activation function takes the form of this equation:
GELU(x)=0.5x(1+tanh(√2/π(x+0.044715×3)))
So it’s just a combination of some functions (e.g. hyperbolic tangent tanh) and approximated numbers.

It has a negative coefficient, which shifts to a positive coefficient. So when x is greater than zero, the
output will be x, except from when x=0 to x=1, where it slightly leans to a smaller y-value.
Also Read: What is Recurrent Neural Network | Introduction of Recurrent Neural Network
Pros:
1. Less time and space complexity
2. Avoids the vanishing gradient problem.
Cons:
1. Introduces the dead relu problem.
2. Does not avoid the exploding gradient problem.
2. List the Parameters to design feed forward neural network and discuss how to select the parameters
Neural Networks: parameters, hyper parameters and optimization strategies
Neural Networks (NNs) are the typical algorithms used in Deep Learning analysis. NNs can take different
Shapes and structures; nevertheless, the core skeleton is the following

So we have our inputs (x), we take the weighted sum of them (with weights equal to w), pass it through an
activation function f(.) and, voilà, we obtain our output. Then, depending on how accurate our predictions
are, the algorithm updates itself through the so-called ‘backpropagation’ phase, according to a given
optimization strategy.
Needless to say, this is a tremendously poor definition, but if you keep it in mind while reading this article,
you will better understands its core topic.
Indeed, what I want to focus on is how to approach some characteristic elements of NNs, whose
initialization, optimization and tuning can make your algorithm much more powerful. Before starting, let’s
see which elements I’m talking about:
Parameters: these are the coefficients of the model, and they are chosen by the model itself. It means that
the algorithm, while learning, optimizes these coefficients (according to a given optimization strategy) and
returns an array of parameters which minimize the error. To give an example, in a linear regression task, you
have your model that will look like y=b + ax, where b and a will be your parameter. The only thing you have
to do with those parameters is initializing them (we will see later on what it means).
Hyper parameters: these are elements that, differently from the previous ones, you need to set.
Furthermore, the model will not update them according to the optimization strategy: your manual
intervention will always be needed.
Strategies: these are some tips and approaches you should have towards your model. Namely, before
managing your data, you might want to normalize them, especially if you have values on different scales and
this might affect your algorithm’s performance.
So, let’s examine all of them.
Parameters
As anticipated, the only thing you have to do with respect to parameters is initializing them (note that
parameters initialization is a strategy). So, which is the best way to initialize them? For sure, what you
should NOT do is setting them equal to zero: indeed, by doing so you are risking to penalize the whole
algorithm. To give an example of the variety of problems you might face, there is that of remaining stuck to
weights equal to zero even after several re-weighting procedures.
Hence, here there are some ideas to properly initialize your parameters depending on the activation
function you decide to employ (I will dwell on activation functions later on).
If you are using Sigmoid or Tanh activation function, you might use a Xavier initialization, with uniform or
normal distribution:

· If you are using a ReLU instead, you could use the He initialization, with a normal distribution:

Where ni and n0 are, respectively, the number of inputs and the number of outputs.
Hyper parameters
This is far more interesting. Hyper parameters require your attention and knowledge much more than
parameters. So, to have an idea on how to handle them, let’s examine some of them:
Number of hidden layers: this is probably the most questionable point. The idea is that you want to keep
your NN as simple as possible (you want it to be fast and well generalized), but at the same time, you want
it to well classify your input data. In this case (and in many others related to hyperparameters) you should
proceed with manual attempts. It might sound ‘old’ in the era of self-learning machines, but remember that
those latter need to be built, before learning. So, to provide the intuition of this, I suggest you to run some
experiments on this Tensor flow platform: you will see how, after a number of layers/neurons, the accuracy
does not improve anymore, hence it would be inefficient to keep the algorithm so heavy.
Learning rate: this hyper parameter refers to the step of back propagation, when parameters are updated
according to an optimization function. Basically, it represents how important is the change it the weight
after a re-calibration. But what does it mean‘re-calibration’? Well, if you think about a generic loss function
with only one weight, the graphic representation will be something like that:
You want to minimize the loss, so ideally you want your current w to slide towards the minimum. The
procedure should be the following:

The correspondent function is:

Where the first term is your current weight, the second term is the gradient of your function (in this one-
dimension case, it will be the first derivate of your loss function with respect, obviously, to your only
weight). Remember that the first derivate has a negative value if the steep of the tangent segment is
negative, that’s why we put a minus in the middle of the two terms (intuition: if the steep is negative, the
weight should move towards right, as in the example). This optimization procedure is called Gradient
Descent.
Now let’s add a new term to the formula:

This gamma is our learning rate, and it tells the algorithm how important should be the impact of the
gradient on the weight. The problem of a small gamma is that the NN will converge (if it will converge) very
slowly, and we might incur in the problem of so-called ‘Vanishing Gradient’. On the other side, if gamma is
very big, the risk is missing the minimum and incur in the scenario of ‘Exploding Gradient’.
A good strategy might be starting with a value around 0.1, and then exponentially reduce it: at some point,
the value of the loss function starts decreasing in the first few iterations and that’s the signal the weight
took the right direction.
Momentum: it is a technique used during the back propagation phase. As said regarding the learning rate,
parameters are updated so that they can converge towards the minimum of the loss function. This process
might be too long and affecting the efficiency of the algorithm. Hence, one possible solution is taking track
of the previous directions (that are the gradients of the loss function with respect to weights) and keeping
them as embedded information: this is what momentum is thought for. It basically increases the speed of
convergence not in terms of learning rate (how much a weight is updated each time) but in terms of
embedded memory of past re-calibration (the algorithm knows the previous direction of that weight was,
let’s say, right, and it will directly proceed towards this direction during the next propagation). We can
visualize it if we consider the projection of a two-weights loss function (specifically, a paraboloid):

You can find the source for making these 3D graphs here.
As you can see, if we add momentum hyper parameter the descending phase is faster, since the model
keeps traces of the past gradient directions.
If you decide for high values of momentum, it means it will massively take into account the past directions:
it might result in an incredibly fast learning algorithm, but the risk of missing some corrects ‘deviations’ is
high. The suggestion is always starting with low values and then increasing them little by little.
Activation function: it is the function through which we pass our weighed sum, in order to have a significant
output, namely as a vector of probability or a 0–1 output. The major activation functions are Sigmoid (for
multiclass classification, a variant of this function is used, called SoftMax function: it returns as output a
vector of probability whose sum is equal to one), Tanh and RELU.
Note that activation function can be located at any point in the NN, as many times as you want. However,
you always have to think about efficiency and velocity. Namely, the ReLU function is very quick in terms of
training, while the Sigmoid is more complex and it takes more time. Hence, a good practice might be using
ReLU for hidden layers and then, in the last layer, inserting your Sigmoid.
· Minibatch size: when you are facing billions of data, it might result inefficient (as well as
counterproductive) feeding your NN with all of them. A good practice is feeding it with smaller samples of
your data, called batches: by doing so, every time the algorithm trains itself, it will train on a sample of the
same size of the batch. The typical size is 32 or higher, however you need to keep in mind that, if the size is
too big, the risk is an over generalized model which won’t fit new data well.
· Epochs: it represents how many time you want your algorithm to train on your whole dataset (note that
epochs are different from iterations: those latter are the number of batches needed to complete one
epoch). Again, the number of epochs depends on the kind of data and task you are facing. An idea could be
imposing a condition such that epochs stop when the error is close to zero. Or, more easily, you can start
with a relatively low number of epochs and then increase it progressively, tracking some evaluation metrics
(like accuracy).
· Dropout: this technique consists of removing some nodes so that the NN is not too heavy. This can be
implemented during the training phase. The idea is that we do not want our NN to be overwhelmed by
information, especially if we consider that some nodes might be redundant and useless. So, while building
our algorithm, we can decide to keep, for each training stage, each node with probability p (called ‘keep
probability’) or drop it with probability 1-p (called ‘drop probability’).
Strategies
Strategies are approaches and best practices we might want to have towards our algorithm to make it more
performing. Among these there are the following:
· Parameter initialization: we have been talking about that in the first paragraph.
· Data normalization: while inspecting your data, you might notice that some features are represented on
different scales. This might affect the performance of your NN, since the convergence is slower. Normalizing
data means converting all of them to the same scale, within the range [0–1]. You can also decide to
standardize your data, which means making them normally distributed with mean equal to 0 and standard
deviation equal to 1. While data normalization happens before training your NN, another way you can
normalize your data is through the so-called Batch Normalization: it happens directly during your NN
training, specifically after the weighted sum and before the activation function.
· Optimization algorithm: in the previous paragraph, I mentioned the gradient descent as the optimization
algorithm. However, we have many variants of this latter: Stochastic Gradient Descent (it minimizes the loss
according to the gradient descent optimization, and for each iteration it randomly selects a training sample
— that’s why it’s called stochastic), the RMSProp (that differs from the previous since each parameter has
an adapted learning rate) and the Adam Optimizer (it is a RMSProp + momentum). Of course, this is not the
full list, yet it is sufficient to understand that Adam optimizer is often the best choice, since it allows you to
set different hyper parameters and customize your NN.
· Regularization: this strategy is pivotal if you want to keep your model simple and avoid over fitting. The
idea is that regularization adds a penalty to the model if weights are great/too many. Indeed, it adds to our
loss function a new term which tends to increase (hence, the loss increases too) if the re-calibration
procedure increases weights. There are two kinds of regularization: the Lasso regularization (L1) and Bridge
regularization (L2):

The L1 regularization tends to shrink weights to zero, with the risk of getting rid of some inputs (since they
will be multiplied with a null value), whereas the L2 might shrink weights to very low values, but not to zero
(hence inputs are preserved).
It is interesting to note that this concept is strongly related to the Information Criteria in time series
analysis. Indeed, while optimizing the Maximum Likelihood function of our Autoregressive model, we might
incur in the same problem of over fitting, since this procedure tends to increase the number of parameters:
that’s why it is a good practice to add a penalty if this latter increases.
3. Discuss various paradigms of Learning methodologies and issues in Deep learning
Learning paradigms
Learning theories are usually divided into several paradigms which represent different perspectives on the
learning process. Theories within the same paradigm share the same basic point of view. Currently, the
most commonly accepted learning paradigms are behaviorism, cognitivism, constructivism, connectivism,
and humanism.1).
Here we will refer to the named learning paradigms and their related learning and instructional design
theories. A brief overview of the paradigms follows, and more information can be obtained by clicking on
each paradigm name.
 Behaviorism
 Cognitivism
 Humanism
 Constructivism
 Connectivism

Biheviorism Cognitivism Humanism Constructivism Connectivism

Time Since 1900s Since 1960s Since 1960s Since 1970s Since 2000s
line:

What is Development Acquisition of A mean which should Construction of Process of


learning: of desired new help learner in self- new knowledge connection-
behavior knowledge and actualization and forming
developing development of
adequate personal potentials
mental
constructions

Control Environment Learner Learner Learner Mostly learner


locus: but also
environment

Learner Passive, simply Active and Active and discovery Active, Knowledge
role: responding to central to the constructing his acquisition in
external process, he representation form of
stimuli learns of knowledge establishing
objective using preferred connections to
knowledge learning styles other nodes
Biheviorism Cognitivism Humanism Constructivism Connectivism

from external
world

Learning External An active Active learning Construction of Learning can


process: supporting of process of through experience subjective also reside
desired or acquiring and representation outside a
punishing of processing new of knowledge person (within
undesired information based on prior a database or
behavior using prior knowledge and an
knowledge and experience organization)
experience and is focused
on establishing
connections

Critics: Ignores Views More psychologically There is little A relatively new


learner and his knowledge as then experimentally evidence for and according
mental objective and grounded approach some to some not
processes, external to the based on constructivist fully developed
depends learner assumptions of free views, and some theory
exclusively on will and a system of even contradict
overt behavior human values which known findings
are generally
believed to be true,
yet sometimes
discredited through
counterexamples
4. Compare the deep learning optimization, Regularization and model selection.
Implementing machine learning and deep learning algorithms is different from writing any other type of
software program. While most code goes through the traditional authoring, compilation/interpretation,
testing and execution lifecycle, deep learning models live through a never ending lifecycle of testing and
improvement processes. Most people generically refer to that part of the lifecycle as optimization but, in
reality, it also includes another important area of deep learning theory: regularization. In order to
understand the role that optimization and regularization play in deep learning models we should start by
understanding how those models are composed.
Anatomy of a Deep Learning Model
What is a deep learning algorithm? Obviously, we know it includes a model but is not just that, isn’t it?
Using a pseudo-match nomenclature, we can define a deep learning algorithm with the following equation:
DL(x)= Model(x) + Cost_Function(Model(x)) + Input_Data_Set (x) + Optimization(Cost_Function(x))
Using this conceptual equation, we can represent any deep learning algorithm as a function of an input data
set, a cost function, a deep neural network model and an optimization process. In the context of this article,
we are focusing on the optimization processes.
What those those processes so challenging in deep learning systems? One word: size. Deep neural networks
include a large number of layers and hidden units that can also include many nodes. That level of complexity
directly translates into millions of interconnected nodes which makes for an absolute optimization
nightmare.
When thinking about improving a deep learning model, you should focus the efforts in two main areas:
a) Reducing the cost function.
b) Reducing the generalization error.
Those two subjects have become broad areas of research in the deep learning ecosystem know as
optimization and regularization respectively. Let’s look at both definitions in a bit more detail.
Regularization
The role of regularization is to modify a deep learning model to perform well with inputs outside the
training dataset. Specifically, regularization focuses on reducing the test or generalization error without
affecting the initial training error.
The field of deep learning has helped to create many new regularization techniques. Most of them can be
summarize as functions to optimize estimators. Very often, regularization techniques optimize estimators
by reducing their variance without increasing the corresponding bias( read my previous article about bias
and variance). Many times, finding the solution to a deep learning problem is not about creating the best
model but a model that regularize well under the right environment.
Optimization
There are many types of optimizations in deep learning but the most relevant are focused on reducing the
cost function of a model. Those techniques typically operate by estimating the gradient of different nodes
and trying to minimize it iteratively. Among the many optimization algorithms in the deep learning space,
stochastic gradient descent(SGD) has become the most popular variation with countless implementation in
mainstream deep learning frameworks(see my previous article about SGD). It is also common to find many
variations of SGD like SGD with Momentum that work better on specific deep learning algorithms.
What we generally refer to as optimization in deep learning model is really a constant combination of
regularization and optimization techniques. For deep learning practitioners, mastering regularization and
optimization is as important as understanding the core algorithms and it certainly play a key role in real
world deep learning solutions.
5. Describe Greedy Layer wise training.
6. Discuss the use of Deep Neural Networks.
Deep Learning Tutorial
Deep learning is based on the branch of machine learning, which is a subset of artificial intelligence. Since
neural networks imitate the human brain and so deep learning will do. In deep learning, nothing is
programmed explicitly. Basically, it is a machine learning class that makes use of numerous nonlinear
processing units so as to perform feature extraction as well as transformation. The output from each
preceding layer is taken as input by each one of the successive layers.
Deep learning models are capable enough to focus on the accurate features themselves by requiring a little
guidance from the programmer and are very helpful in solving out the problem of dimensionality. Deep
learning algorithms are used, especially when we have a huge no of inputs and outputs.
Since deep learning has been evolved by the machine learning, which itself is a subset of artificial
intelligence and as the idea behind the artificial intelligence is to mimic the human behavior, so same is
"the idea of deep learning to build such algorithm that can mimic the brain".
Deep learning is implemented with the help of Neural Networks, and the idea behind the motivation
of Neural Network is the biological neurons, which is nothing but a brain cell.
in
Deep learning is a collection of statistical techniques of machine learning for learning feature hierarchies
that are actually based on artificial neural networks.
So basically, deep learning is implemented by the help of deep networks, which are nothing but neural
networks with multiple hidden layers.
Example of Deep Learning

In the example given above, we provide the raw data of images to the first layer of the input layer. After
then, this input layer will determine the patterns of local contrast that means it will differentiate on the
basis of colors, luminosity, etc. Then the 1st hidden layer will determine the face feature, i.e., it will fixate
on eyes, nose, and lips, etc. And then, it will fixate those face features on the correct face template. So, in
the 2nd hidden layer, it will actually determine the correct face here as it can be seen in the above image,
after which it will be sent to the output layer. Likewise, more hidden layers can be added to solve more
complex problems, for example, if you want to find out a particular kind of face having large or light
complexions. So, as and when the hidden layers increase, we are able to solve complex problems.
Architectures
o Deep Neural Networks
It is a neural network that incorporates the complexity of a certain level, which means several
numbers of hidden layers are encompassed in between the input and output layers. They are highly
proficient on model and process non-linear associations.
o Deep Belief Networks
A deep belief network is a class of Deep Neural Network that comprises of multi-layer belief
networks.
Steps to perform DBN:
1. With the help of the Contrastive Divergence algorithm, a layer of features is learned from
perceptible units.
2. Next, the formerly trained features are treated as visible units, which perform learning of
features.
3. Lastly, when the learning of the final hidden layer is accomplished, then the whole DBN is
trained.
o Recurrent Neural Networks
It permits parallel as well as sequential computation, and it is exactly similar to that of the human
brain (large feedback network of connected neurons). Since they are capable enough to reminisce
all of the imperative things related to the input they have received, so they are more precise.
Types of Deep Learning Networks
1. Feed Forward Neural Network
A feed-forward neural network is none other than an Artificial Neural Network, which ensures that the
nodes do not form a cycle. In this kind of neural network, all the perceptrons are organized within layers,
such that the input layer takes the input, and the output layer generates the output. Since the hidden
layers do not link with the outside world, it is named as hidden layers. Each of the perceptrons contained in
one single layer is associated with each node in the subsequent layer. It can be concluded that all of the
nodes are fully connected. It does not contain any visible or invisible connection between the nodes in the
same layer. There are no back-loops in the feed-forward network. To minimize the prediction error, the
back propagation algorithm can be used to update the weight values.
Applications:
o Data Compression
o Pattern Recognition
o Computer Vision
o Sonar Target Recognition
o Speech Recognition
o Handwritten Characters Recognition
2. Recurrent Neural Network
Recurrent neural networks are yet another variation of feed-forward networks. Here each of the neurons
present in the hidden layers receives an input with a specific delay in time. The Recurrent neural network
mainly accesses the preceding info of existing iterations. For example, to guess the succeeding word in any
sentence, one must have knowledge about the words that were previously used. It not only processes the
inputs but also shares the length as well as weights crossways time. It does not let the size of the model to
increase with the increase in the input size. However, the only problem with this recurrent neural network
is that it has slow computational speed as well as it does not contemplate any future input for the current
state. It has a problem with reminiscing prior information.
Applications:
o Machine Translation
o Robot Control
o Time Series Prediction
o Speech Recognition
o Speech Synthesis
o Time Series Anomaly Detection
o Rhythm Learning
o Music Composition
3. Convolutional Neural Network
Convolutional Neural Networks are a special kind of neural network mainly used for image classification,
clustering of images and object recognition. DNNs enable unsupervised construction of hierarchical image
representations. To achieve the best accuracy, deep convolutional neural networks are preferred more
than any other neural network.
Applications:
o Identify Faces, Street Signs, Tumors.
o Image Recognition.
o Video Analysis.
o NLP.
o Anomaly Detection.
o Drug Discovery.
o Checkers Game.
o Time Series Forecasting.
4. Restricted Boltzmann Machine
RBMs are yet another variant of Boltzmann Machines. Here the neurons present in the input layer and the
hidden layer encompasses symmetric connections amid them. However, there is no internal association
within the respective layer. But in contrast to RBM, Boltzmann machines do encompass internal
connections inside the hidden layer. These restrictions in BMs helps the model to train efficiently.
Applications:
o Filtering.
o Feature Learning.
o Classification.
o Risk Detection.
o Business and Economic analysis.
5. Autoencoders
An autoencoder neural network is another kind of unsupervised machine learning algorithm. Here the
number of hidden cells is merely small than that of the input cells. But the number of input cells is
equivalent to the number of output cells. An autoencoder network is trained to display the output similar
to the fed input to force AEs to find common patterns and generalize the data. The autoencoders are
mainly used for the smaller representation of the input. It helps in the reconstruction of the original data
from compressed data. This algorithm is comparatively simple as it only necessitates the output identical
to the input.
o Encoder: Convert input data in lower dimensions.
o Decoder: Reconstruct the compressed data.
Applications:
o Classification.
o Clustering.
o Feature Compression.
Deep learning applications
o Self-Driving Cars
In self-driven cars, it is able to capture the images around it by processing a huge amount of data,
and then it will decide which actions should be incorporated to take a left or right or should it stop.
So, accordingly, it will decide what actions it should take, which will further reduce the accidents
that happen every year.
o Voice Controlled Assistance
When we talk about voice control assistance, then Siri is the one thing that comes into our mind.
So, you can tell Siri whatever you want it to do it for you, and it will search it for you and display it
for you.
o Automatic Image Caption Generation
Whatever image that you upload, the algorithm will work in such a way that it will generate caption
accordingly. If you say blue colored eye, it will display a blue-colored eye with a caption at the
bottom of the image.
o Automatic Machine Translation
With the help of automatic machine translation, we are able to convert one language into another
with the help of deep learning.
Limitations
o It only learns through the observations.
o It comprises of biases issues.
Advantages
o It lessens the need for feature engineering.
o It eradicates all those costs that are needless.
o It easily identifies difficult defects.
o It results in the best-in-class performance on problems.
Disadvantages
o It requires an ample amount of data.
o It is quite expensive to train.
o It does not have strong theoretical groundwork.

7. Discuss the Hyper parameters convolution Neural Networks


Hyper-parameters
An artificial neural network consists of model parameters and hyper-parameters. Model
parameters are attributes such the weights and biases that the model uses to tailor itself to fit
the data. Hyper-parameters are attributes or properties that dictate the entire training process
and need to be predefined [18]. Hyper-parameters must be predefined because they cannot be
directly learned from the training process. Hyperparameters define model complexity, its
capacity to learn, and the rate of convergence for model parameters; thus finding the optimal
value for hyper-parameters leads to better efficiency and results. The hyper-parameters a user
can set include the learning 12 rate, number of hidden layers, number of hidden nodes, number
of epochs, batch size, and the type of activation functions, among others. For the purpose of
this research, the hyper-parameters considered are the learning rate, number of hidden layers,
number of dense nodes, and the batch size.
8. Define Deep Feed Forward network and explain about regularizations
What is Regularization?
Simple speaking: Regularization refers to a set of different techniques that lower the complexity of a neural
network model during training, and thus prevent the over fitting.
There are three very popular and efficient regularization techniques called L1, L2, and dropout which we are
going to discuss in the following.
L2 Regularization
The L2 regularization is the most common type of all regularization techniques and is also commonly known
as weight decay or Ride Regression.
The mathematical derivation of this regularization, as well as the mathematical explanation of why this
method works at reducing overfitting, is quite long and complex. Since this is a very practical article I don’t
want to focus on mathematics more than it is required. Instead, I want to convey the intuition behind this
technique and most importantly how to implement it so you can address the overfitting problem during
your deep learning projects.
During the L2 regularization the loss function of the neural network as extended by a so-called
regularization term, which is called here Ω.
Eq. 1 Regularization Term
The regularization term Ω is defined as the Euclidean Norm (or L2 norm) of the weight matrices, which is the
sum over all squared weight values of a weight matrix. The regularization term is weighted by the scalar
alpha divided by two and added to the regular loss function that is chosen for the current task. This leads to
a new expression for the loss function:

Eq 2. Regularization loss during L2 regularization.


Alpha is sometimes called as the regularization rate and is an additional hyperparameter we introduce into
the neural network. Simply speaking alpha determines how much we regularize our model.
In the next step we can compute the gradient of the new loss function and put the gradient into the update
rule for the weights:

Eq. 3 Gradient Descent during L2 Regularization.


Some reformulations of the update rule lead to the expression which very much looks like the update rule
for the weights during regular gradient descent:

Eq.4 Gradient Descent during L2 Regularization.


The only difference is that by adding the regularization term we introduce an additional subtraction from
the current weights (first term in the equation).
In other words independent of the gradient of the loss function we are making our weights a little bit
smaller each time an update is performed.
L1 Regularization
In the case of L1 regularization (also known as Lasso regression), we simply use another regularization
term Ω. This term is the sum of the absolute values of the weight parameters in a weight matrix:
Eq. 5 Regularization Term for L1 Regularization.
As in the previous case, we multiply the regularization term by alpha and add the entire thing to the loss
function.

Eq. 6 Loss function during L1 Regularization.


The derivative of the new loss function leads to the following expression, which the sum of the gradient of
the old loss function and sign of a weight value times alpha.

Eq. 7 Gradient of the loss functions during L1 Regularization.


Why do L1 and L2 Regularizations work?
The question you might be asking yourself right now is:
“Why does all of this help to reduce the over fitting issue?”
Let’s tackle this question.
Please consider the plots of the and functions, where represents the operation performed during L1 and the
operation performed during L2 regularization.
Graph. 2 L1 function (red), L2 function (blue). Source: Self made.
In the case of L2 regularization, our weight parameters decrease, but not necessarily become zero, since the
curve becomes flat near zero. On the other hand during the L1 regularization, the weights are always forced
all the way towards zero.
We can also take a different and more mathematical view on this.
In the case of L2, you can think of solving an equation, where the sum of squared weight values is equal or
less than a value s. s is the constant that exists for each possible value of the regularization term α. For just
two weight values W1 and W2 this equation would look as follows: W1 ² + W²² ≤ s
On the other hand, the L1 regularization can be thought of as an equation where the sum of modules of
weight values is less than or equal to a value s. This would look like the following expression: |W1| + |W2|
≤s
Basically the introduced equations for L1 and L2 regularizations are constraint functions, which we can
visualize:

Source: An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert
Tibshirani
The left image shows the constraint function (green area) for the L1 regularization and the right image
shows the constraint function for the L2 regularization. The red ellipses are contours of the loss function
that is used during the gradient descent. In the center of the contours there is a set of optimal weights for
which the loss function has a global minimum.
In the case of L1 and L2 regularization, the estimates of W1 and W2 are given by the first point where the
ellipse intersects with the green constraint area.
Since L2 regularization has a circular constraint area, the intersection won’t generally occur on an axis, and
this the estimates for W1 and W2 will be exclusively non-zero.
In the case of L1, the constraints area has a diamond shape with corners. And thus the contours of the loss
function will often intersect the constraint region at an axis. Then this occurs, one of the estimates (W1 or
W2) will be zero.
In a high dimensional space, many of the weight parameters will equal zero simultaneously.
What does Regularization achieve?
 Performing L2 regularization encourages the weight values towards zero (but not exactly zero)
 Performing L1 regularization encourages the weight values to be zero
Intuitively speaking smaller weights reduce the impact of the hidden neurons. In that case, those hidden
neurons become neglectable and the overall complexity of the neural network gets reduced.
As mentioned earlier: less complex models typically avoid modeling noise in the data, and therefore, there
is no over fitting.
But you have to be careful. When choosing the regularization term α. The goal is to strike the right balance
between low complexity of the model and accuracy
 If your alpha value is too high, your model will be simple, but you run the risk of under fitting your
data. Your model won’t learn enough about the training data to make useful predictions.
 If your alpha value is too low, your model will be more complex, and you run the risk of over
fitting your data. Your model will learn too much about the particularities of the training data,
and won’t be able to generalize to new data.

You might also like