DL Answers
DL Answers
Define activation function and discuss various activation functions with suitable graphs
2. List the Parameters to design feed forward neural network and discuss how to select the
Parameters
3. Discuss various paradigms of Learning methodologies and issues in Deep learning
4. Compare the deep learning optimization, Regularization and model selection.
5. Describe Greedy Layer wise training.
6. Discuss the use of Deep Neural Networks.
7. Discuss the Hyper parameters convolution Neural Networks
8. Define Deep Feed Forward network and explain about regularizations
1 a) Define activation function and discuss various activation functions with suitable graphs
Activation functions are mathematical equations that determine the output of a neural network
model. Activation functions also have a major effect on the neural network’s ability to converge and
the convergence speed, or in some cases, activation functions might prevent neural networks from
converging in the first place. Activation function also helps to normalize the output of any input in the
range between 1 to -1 or 0 to 1.
Activation function must be efficient and it should reduce the computation time because the neural
network sometimes trained on millions of data points.
Let’s consider the simple neural network model without any hidden layers.
The step function is mainly used in binary classification problems and works well for linearly severable
pr. It can’t classify the multi-class problems.
Also Read: 3 Things to Know before deep diving into Neural Networks
Linear Activation Function
The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of
linear function. It is non-linear, continuously differentiable, monotonic, and has a fixed output range.
But it is not zero centered.
Hyperbolic Tangent
The function produces outputs in scale of [-1, 1] and it is a continuous function. In other words,
function produces output for every x value.
Y = tanh(x)
tanh(x) = (e x – e-x) / (ex + e-x)
Softmax
The softmax function is sometimes called the soft argmax function, or multi-class logistic regression.
This is because the softmax is a generalization of logistic regression that can be used for multi-class
classification, and its formula is very similar to the sigmoid function which is used for logistic
regression. The softmax function can be used in a classifier only when the classes are mutually
exclusive.
Gudermannian
The Gudermannian function relates circular functions and hyperbolic functions without explicitly using
complex numbers.
It has a negative coefficient, which shifts to a positive coefficient. So when x is greater than zero, the
output will be x, except from when x=0 to x=1, where it slightly leans to a smaller y-value.
Also Read: What is Recurrent Neural Network | Introduction of Recurrent Neural Network
Pros:
1. Less time and space complexity
2. Avoids the vanishing gradient problem.
Cons:
1. Introduces the dead relu problem.
2. Does not avoid the exploding gradient problem.
2. List the Parameters to design feed forward neural network and discuss how to select the parameters
Neural Networks: parameters, hyper parameters and optimization strategies
Neural Networks (NNs) are the typical algorithms used in Deep Learning analysis. NNs can take different
Shapes and structures; nevertheless, the core skeleton is the following
So we have our inputs (x), we take the weighted sum of them (with weights equal to w), pass it through an
activation function f(.) and, voilà, we obtain our output. Then, depending on how accurate our predictions
are, the algorithm updates itself through the so-called ‘backpropagation’ phase, according to a given
optimization strategy.
Needless to say, this is a tremendously poor definition, but if you keep it in mind while reading this article,
you will better understands its core topic.
Indeed, what I want to focus on is how to approach some characteristic elements of NNs, whose
initialization, optimization and tuning can make your algorithm much more powerful. Before starting, let’s
see which elements I’m talking about:
Parameters: these are the coefficients of the model, and they are chosen by the model itself. It means that
the algorithm, while learning, optimizes these coefficients (according to a given optimization strategy) and
returns an array of parameters which minimize the error. To give an example, in a linear regression task, you
have your model that will look like y=b + ax, where b and a will be your parameter. The only thing you have
to do with those parameters is initializing them (we will see later on what it means).
Hyper parameters: these are elements that, differently from the previous ones, you need to set.
Furthermore, the model will not update them according to the optimization strategy: your manual
intervention will always be needed.
Strategies: these are some tips and approaches you should have towards your model. Namely, before
managing your data, you might want to normalize them, especially if you have values on different scales and
this might affect your algorithm’s performance.
So, let’s examine all of them.
Parameters
As anticipated, the only thing you have to do with respect to parameters is initializing them (note that
parameters initialization is a strategy). So, which is the best way to initialize them? For sure, what you
should NOT do is setting them equal to zero: indeed, by doing so you are risking to penalize the whole
algorithm. To give an example of the variety of problems you might face, there is that of remaining stuck to
weights equal to zero even after several re-weighting procedures.
Hence, here there are some ideas to properly initialize your parameters depending on the activation
function you decide to employ (I will dwell on activation functions later on).
If you are using Sigmoid or Tanh activation function, you might use a Xavier initialization, with uniform or
normal distribution:
· If you are using a ReLU instead, you could use the He initialization, with a normal distribution:
Where ni and n0 are, respectively, the number of inputs and the number of outputs.
Hyper parameters
This is far more interesting. Hyper parameters require your attention and knowledge much more than
parameters. So, to have an idea on how to handle them, let’s examine some of them:
Number of hidden layers: this is probably the most questionable point. The idea is that you want to keep
your NN as simple as possible (you want it to be fast and well generalized), but at the same time, you want
it to well classify your input data. In this case (and in many others related to hyperparameters) you should
proceed with manual attempts. It might sound ‘old’ in the era of self-learning machines, but remember that
those latter need to be built, before learning. So, to provide the intuition of this, I suggest you to run some
experiments on this Tensor flow platform: you will see how, after a number of layers/neurons, the accuracy
does not improve anymore, hence it would be inefficient to keep the algorithm so heavy.
Learning rate: this hyper parameter refers to the step of back propagation, when parameters are updated
according to an optimization function. Basically, it represents how important is the change it the weight
after a re-calibration. But what does it mean‘re-calibration’? Well, if you think about a generic loss function
with only one weight, the graphic representation will be something like that:
You want to minimize the loss, so ideally you want your current w to slide towards the minimum. The
procedure should be the following:
Where the first term is your current weight, the second term is the gradient of your function (in this one-
dimension case, it will be the first derivate of your loss function with respect, obviously, to your only
weight). Remember that the first derivate has a negative value if the steep of the tangent segment is
negative, that’s why we put a minus in the middle of the two terms (intuition: if the steep is negative, the
weight should move towards right, as in the example). This optimization procedure is called Gradient
Descent.
Now let’s add a new term to the formula:
This gamma is our learning rate, and it tells the algorithm how important should be the impact of the
gradient on the weight. The problem of a small gamma is that the NN will converge (if it will converge) very
slowly, and we might incur in the problem of so-called ‘Vanishing Gradient’. On the other side, if gamma is
very big, the risk is missing the minimum and incur in the scenario of ‘Exploding Gradient’.
A good strategy might be starting with a value around 0.1, and then exponentially reduce it: at some point,
the value of the loss function starts decreasing in the first few iterations and that’s the signal the weight
took the right direction.
Momentum: it is a technique used during the back propagation phase. As said regarding the learning rate,
parameters are updated so that they can converge towards the minimum of the loss function. This process
might be too long and affecting the efficiency of the algorithm. Hence, one possible solution is taking track
of the previous directions (that are the gradients of the loss function with respect to weights) and keeping
them as embedded information: this is what momentum is thought for. It basically increases the speed of
convergence not in terms of learning rate (how much a weight is updated each time) but in terms of
embedded memory of past re-calibration (the algorithm knows the previous direction of that weight was,
let’s say, right, and it will directly proceed towards this direction during the next propagation). We can
visualize it if we consider the projection of a two-weights loss function (specifically, a paraboloid):
You can find the source for making these 3D graphs here.
As you can see, if we add momentum hyper parameter the descending phase is faster, since the model
keeps traces of the past gradient directions.
If you decide for high values of momentum, it means it will massively take into account the past directions:
it might result in an incredibly fast learning algorithm, but the risk of missing some corrects ‘deviations’ is
high. The suggestion is always starting with low values and then increasing them little by little.
Activation function: it is the function through which we pass our weighed sum, in order to have a significant
output, namely as a vector of probability or a 0–1 output. The major activation functions are Sigmoid (for
multiclass classification, a variant of this function is used, called SoftMax function: it returns as output a
vector of probability whose sum is equal to one), Tanh and RELU.
Note that activation function can be located at any point in the NN, as many times as you want. However,
you always have to think about efficiency and velocity. Namely, the ReLU function is very quick in terms of
training, while the Sigmoid is more complex and it takes more time. Hence, a good practice might be using
ReLU for hidden layers and then, in the last layer, inserting your Sigmoid.
· Minibatch size: when you are facing billions of data, it might result inefficient (as well as
counterproductive) feeding your NN with all of them. A good practice is feeding it with smaller samples of
your data, called batches: by doing so, every time the algorithm trains itself, it will train on a sample of the
same size of the batch. The typical size is 32 or higher, however you need to keep in mind that, if the size is
too big, the risk is an over generalized model which won’t fit new data well.
· Epochs: it represents how many time you want your algorithm to train on your whole dataset (note that
epochs are different from iterations: those latter are the number of batches needed to complete one
epoch). Again, the number of epochs depends on the kind of data and task you are facing. An idea could be
imposing a condition such that epochs stop when the error is close to zero. Or, more easily, you can start
with a relatively low number of epochs and then increase it progressively, tracking some evaluation metrics
(like accuracy).
· Dropout: this technique consists of removing some nodes so that the NN is not too heavy. This can be
implemented during the training phase. The idea is that we do not want our NN to be overwhelmed by
information, especially if we consider that some nodes might be redundant and useless. So, while building
our algorithm, we can decide to keep, for each training stage, each node with probability p (called ‘keep
probability’) or drop it with probability 1-p (called ‘drop probability’).
Strategies
Strategies are approaches and best practices we might want to have towards our algorithm to make it more
performing. Among these there are the following:
· Parameter initialization: we have been talking about that in the first paragraph.
· Data normalization: while inspecting your data, you might notice that some features are represented on
different scales. This might affect the performance of your NN, since the convergence is slower. Normalizing
data means converting all of them to the same scale, within the range [0–1]. You can also decide to
standardize your data, which means making them normally distributed with mean equal to 0 and standard
deviation equal to 1. While data normalization happens before training your NN, another way you can
normalize your data is through the so-called Batch Normalization: it happens directly during your NN
training, specifically after the weighted sum and before the activation function.
· Optimization algorithm: in the previous paragraph, I mentioned the gradient descent as the optimization
algorithm. However, we have many variants of this latter: Stochastic Gradient Descent (it minimizes the loss
according to the gradient descent optimization, and for each iteration it randomly selects a training sample
— that’s why it’s called stochastic), the RMSProp (that differs from the previous since each parameter has
an adapted learning rate) and the Adam Optimizer (it is a RMSProp + momentum). Of course, this is not the
full list, yet it is sufficient to understand that Adam optimizer is often the best choice, since it allows you to
set different hyper parameters and customize your NN.
· Regularization: this strategy is pivotal if you want to keep your model simple and avoid over fitting. The
idea is that regularization adds a penalty to the model if weights are great/too many. Indeed, it adds to our
loss function a new term which tends to increase (hence, the loss increases too) if the re-calibration
procedure increases weights. There are two kinds of regularization: the Lasso regularization (L1) and Bridge
regularization (L2):
The L1 regularization tends to shrink weights to zero, with the risk of getting rid of some inputs (since they
will be multiplied with a null value), whereas the L2 might shrink weights to very low values, but not to zero
(hence inputs are preserved).
It is interesting to note that this concept is strongly related to the Information Criteria in time series
analysis. Indeed, while optimizing the Maximum Likelihood function of our Autoregressive model, we might
incur in the same problem of over fitting, since this procedure tends to increase the number of parameters:
that’s why it is a good practice to add a penalty if this latter increases.
3. Discuss various paradigms of Learning methodologies and issues in Deep learning
Learning paradigms
Learning theories are usually divided into several paradigms which represent different perspectives on the
learning process. Theories within the same paradigm share the same basic point of view. Currently, the
most commonly accepted learning paradigms are behaviorism, cognitivism, constructivism, connectivism,
and humanism.1).
Here we will refer to the named learning paradigms and their related learning and instructional design
theories. A brief overview of the paradigms follows, and more information can be obtained by clicking on
each paradigm name.
Behaviorism
Cognitivism
Humanism
Constructivism
Connectivism
Time Since 1900s Since 1960s Since 1960s Since 1970s Since 2000s
line:
Learner Passive, simply Active and Active and discovery Active, Knowledge
role: responding to central to the constructing his acquisition in
external process, he representation form of
stimuli learns of knowledge establishing
objective using preferred connections to
knowledge learning styles other nodes
Biheviorism Cognitivism Humanism Constructivism Connectivism
from external
world
In the example given above, we provide the raw data of images to the first layer of the input layer. After
then, this input layer will determine the patterns of local contrast that means it will differentiate on the
basis of colors, luminosity, etc. Then the 1st hidden layer will determine the face feature, i.e., it will fixate
on eyes, nose, and lips, etc. And then, it will fixate those face features on the correct face template. So, in
the 2nd hidden layer, it will actually determine the correct face here as it can be seen in the above image,
after which it will be sent to the output layer. Likewise, more hidden layers can be added to solve more
complex problems, for example, if you want to find out a particular kind of face having large or light
complexions. So, as and when the hidden layers increase, we are able to solve complex problems.
Architectures
o Deep Neural Networks
It is a neural network that incorporates the complexity of a certain level, which means several
numbers of hidden layers are encompassed in between the input and output layers. They are highly
proficient on model and process non-linear associations.
o Deep Belief Networks
A deep belief network is a class of Deep Neural Network that comprises of multi-layer belief
networks.
Steps to perform DBN:
1. With the help of the Contrastive Divergence algorithm, a layer of features is learned from
perceptible units.
2. Next, the formerly trained features are treated as visible units, which perform learning of
features.
3. Lastly, when the learning of the final hidden layer is accomplished, then the whole DBN is
trained.
o Recurrent Neural Networks
It permits parallel as well as sequential computation, and it is exactly similar to that of the human
brain (large feedback network of connected neurons). Since they are capable enough to reminisce
all of the imperative things related to the input they have received, so they are more precise.
Types of Deep Learning Networks
1. Feed Forward Neural Network
A feed-forward neural network is none other than an Artificial Neural Network, which ensures that the
nodes do not form a cycle. In this kind of neural network, all the perceptrons are organized within layers,
such that the input layer takes the input, and the output layer generates the output. Since the hidden
layers do not link with the outside world, it is named as hidden layers. Each of the perceptrons contained in
one single layer is associated with each node in the subsequent layer. It can be concluded that all of the
nodes are fully connected. It does not contain any visible or invisible connection between the nodes in the
same layer. There are no back-loops in the feed-forward network. To minimize the prediction error, the
back propagation algorithm can be used to update the weight values.
Applications:
o Data Compression
o Pattern Recognition
o Computer Vision
o Sonar Target Recognition
o Speech Recognition
o Handwritten Characters Recognition
2. Recurrent Neural Network
Recurrent neural networks are yet another variation of feed-forward networks. Here each of the neurons
present in the hidden layers receives an input with a specific delay in time. The Recurrent neural network
mainly accesses the preceding info of existing iterations. For example, to guess the succeeding word in any
sentence, one must have knowledge about the words that were previously used. It not only processes the
inputs but also shares the length as well as weights crossways time. It does not let the size of the model to
increase with the increase in the input size. However, the only problem with this recurrent neural network
is that it has slow computational speed as well as it does not contemplate any future input for the current
state. It has a problem with reminiscing prior information.
Applications:
o Machine Translation
o Robot Control
o Time Series Prediction
o Speech Recognition
o Speech Synthesis
o Time Series Anomaly Detection
o Rhythm Learning
o Music Composition
3. Convolutional Neural Network
Convolutional Neural Networks are a special kind of neural network mainly used for image classification,
clustering of images and object recognition. DNNs enable unsupervised construction of hierarchical image
representations. To achieve the best accuracy, deep convolutional neural networks are preferred more
than any other neural network.
Applications:
o Identify Faces, Street Signs, Tumors.
o Image Recognition.
o Video Analysis.
o NLP.
o Anomaly Detection.
o Drug Discovery.
o Checkers Game.
o Time Series Forecasting.
4. Restricted Boltzmann Machine
RBMs are yet another variant of Boltzmann Machines. Here the neurons present in the input layer and the
hidden layer encompasses symmetric connections amid them. However, there is no internal association
within the respective layer. But in contrast to RBM, Boltzmann machines do encompass internal
connections inside the hidden layer. These restrictions in BMs helps the model to train efficiently.
Applications:
o Filtering.
o Feature Learning.
o Classification.
o Risk Detection.
o Business and Economic analysis.
5. Autoencoders
An autoencoder neural network is another kind of unsupervised machine learning algorithm. Here the
number of hidden cells is merely small than that of the input cells. But the number of input cells is
equivalent to the number of output cells. An autoencoder network is trained to display the output similar
to the fed input to force AEs to find common patterns and generalize the data. The autoencoders are
mainly used for the smaller representation of the input. It helps in the reconstruction of the original data
from compressed data. This algorithm is comparatively simple as it only necessitates the output identical
to the input.
o Encoder: Convert input data in lower dimensions.
o Decoder: Reconstruct the compressed data.
Applications:
o Classification.
o Clustering.
o Feature Compression.
Deep learning applications
o Self-Driving Cars
In self-driven cars, it is able to capture the images around it by processing a huge amount of data,
and then it will decide which actions should be incorporated to take a left or right or should it stop.
So, accordingly, it will decide what actions it should take, which will further reduce the accidents
that happen every year.
o Voice Controlled Assistance
When we talk about voice control assistance, then Siri is the one thing that comes into our mind.
So, you can tell Siri whatever you want it to do it for you, and it will search it for you and display it
for you.
o Automatic Image Caption Generation
Whatever image that you upload, the algorithm will work in such a way that it will generate caption
accordingly. If you say blue colored eye, it will display a blue-colored eye with a caption at the
bottom of the image.
o Automatic Machine Translation
With the help of automatic machine translation, we are able to convert one language into another
with the help of deep learning.
Limitations
o It only learns through the observations.
o It comprises of biases issues.
Advantages
o It lessens the need for feature engineering.
o It eradicates all those costs that are needless.
o It easily identifies difficult defects.
o It results in the best-in-class performance on problems.
Disadvantages
o It requires an ample amount of data.
o It is quite expensive to train.
o It does not have strong theoretical groundwork.
Source: An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert
Tibshirani
The left image shows the constraint function (green area) for the L1 regularization and the right image
shows the constraint function for the L2 regularization. The red ellipses are contours of the loss function
that is used during the gradient descent. In the center of the contours there is a set of optimal weights for
which the loss function has a global minimum.
In the case of L1 and L2 regularization, the estimates of W1 and W2 are given by the first point where the
ellipse intersects with the green constraint area.
Since L2 regularization has a circular constraint area, the intersection won’t generally occur on an axis, and
this the estimates for W1 and W2 will be exclusively non-zero.
In the case of L1, the constraints area has a diamond shape with corners. And thus the contours of the loss
function will often intersect the constraint region at an axis. Then this occurs, one of the estimates (W1 or
W2) will be zero.
In a high dimensional space, many of the weight parameters will equal zero simultaneously.
What does Regularization achieve?
Performing L2 regularization encourages the weight values towards zero (but not exactly zero)
Performing L1 regularization encourages the weight values to be zero
Intuitively speaking smaller weights reduce the impact of the hidden neurons. In that case, those hidden
neurons become neglectable and the overall complexity of the neural network gets reduced.
As mentioned earlier: less complex models typically avoid modeling noise in the data, and therefore, there
is no over fitting.
But you have to be careful. When choosing the regularization term α. The goal is to strike the right balance
between low complexity of the model and accuracy
If your alpha value is too high, your model will be simple, but you run the risk of under fitting your
data. Your model won’t learn enough about the training data to make useful predictions.
If your alpha value is too low, your model will be more complex, and you run the risk of over
fitting your data. Your model will learn too much about the particularities of the training data,
and won’t be able to generalize to new data.